I've started my final project for Data Talks Club's data engineering zoomcamp. I'm using the Massachusetts voter data from the Secretary of State, with names and addresses removed. The data consists of all the elections that every voter voted in, going back to 1970.
I had the data on a CD, from August 2022 when I ran for State Treasurer. It had the data files password encrypted. I had to call the office of the Secretary of State to get the password. I have an old laptop that can still read a CD. But the browser was too old to run the Google cloud console. So I downloaded the data onto an 8G thumb drive and then used a less old computer to move the data from the thumb drive to a Google cloud storage bucket. I could not unzip the files on the thumb drive. 8G was not enough memory.
I opened a python notebook in my virtual machine on GCP and was able to unpack the files after consulting with ChatGPT. I had to set my GOOGLE_APPLICATION_CREDENTIALS in the jupyter notebook and use cloud storage modules to unpack the data. I also had to install cloud storage.
Here's the code that I used to do that:
import os
import zipfile
from google.cloud import storage
# Initialize GCS client
client = storage.Client()
# Define GCS bucket and zip file path
bucket_name = 'cris-voter-data'
zip_file_path = 'voter_activity_20220828.zip'
destination_path = 'temp.zip'
try:
# Get the bucket
bucket = client.get_bucket(bucket_name)
print("Bucket:", bucket)
# Get the blob (file) from the bucket
blob = bucket.blob(zip_file_path)
print("Blob:", blob)
# Define the destination path for the downloaded file
destination_path = 'temp.zip'
# Download the blob's content and write it to a file
with open(destination_path, 'wb') as file:
blob.download_to_file(file)
print("Downloaded zip file to:", destination_path)
# Check if download was successful
if os.path.exists(destination_path):
# Extract the zip file
with zipfile.ZipFile(destination_path, 'r') as zip_ref:
zip_ref.extractall('extracted_files')
print("Successfully extracted zip file contents.")
else:
print("Failed to download zip file.")
except Exception as e:
print("Error:", e)
So now I have 351 .txt files taking up 11G in a directory named "extracted_files" on my virtual machine. Next step is to use the python notebook to remove the names and street addresses, and to set them up in parquet files in Google cloud storage. I'm not sure whether I'll use a bucket or directly put them into BigQuery. I'm going to use dbt to transform them, join them with the district lists and make a fact table for visualizing the data.
Top comments (0)