DEV Community

Cover image for Ingesting Data to Postgres
totalSophie
totalSophie

Posted on • Edited on

Ingesting Data to Postgres

DE Zoomcamp study notes

To set up PostgreSQL in Docker, run the pgcli command, and execute SQL statements, you can follow these steps:

Step 1: Install Docker

Make sure you have Docker installed on your machine. You can download Docker from the official website: Docker.

Step 2: Pull PostgreSQL Docker Image

Open a terminal and pull the official PostgreSQL Docker image:

docker pull postgres
Enter fullscreen mode Exit fullscreen mode

Step 3: Run PostgreSQL Container

Run a PostgreSQL container with a specified password for the default user 'postgres':

docker run --name mypostgres -e POSTGRES_USER="root" -e POSTGRES_PASSWORD="password" -e POSTGRES_DB="ny_taxi" -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data -p 5432:5432 -d postgres

Enter fullscreen mode Exit fullscreen mode

This command starts a PostgreSQL container named 'mypostgres' with the password 'password' and exposes port 5432 on the host.
e declares the environment variables.
v declares volume path

Step 4: Install pgcli

Install pgcli, a command-line interface for PostgreSQL, on your local machine:

pip install pgcli
Enter fullscreen mode Exit fullscreen mode

Step 5: Connect to PostgreSQL using pgcli

Connect to the PostgreSQL database using pgcli:

pgcli -h localhost -p 5432 -U root -d ny_taxi -W

Enter fullscreen mode Exit fullscreen mode

h declares the host variable which is localhost connection port.
u is the username.
d is the database name
-W prompts the user for the password. After entering the command

Enter the password when prompted (use 'password' if you followed the previous steps).

Step 6: Execute SQL Statements

Once connected, you can execute SQL statements directly in the pgcli interface. For example:

-- Create a new database
CREATE DATABASE mydatabase;

-- Switch to the new database
\c mydatabase

-- Create a table
CREATE TABLE mytable (
    id serial PRIMARY KEY,
    name VARCHAR (100),
    age INT
);

-- Insert some data
INSERT INTO mytable (name, age) VALUES ('John', 25), ('Jane', 30);

-- Query the data
SELECT * FROM mytable;
Enter fullscreen mode Exit fullscreen mode

Feel free to modify the SQL statements according to your requirements.

Step 7: To Exit pgcli and Stop the PostgreSQL Container

To exit pgcli, type \q. After that, stop and remove the PostgreSQL container:

docker stop mypostgres
docker rm mypostgres
Enter fullscreen mode Exit fullscreen mode

Data Ingestion from CSV to PostgreSQL using Pandas and SQLAlchemy

  • Used Jupyter notebook to insert the data in chunks.
  • Downloaded the NY taxi 2021 data

Step 1: Setting Up the Environment:

  • Use Pandas to read the CSV file in chunks for more efficient processing.
  • Define a PostgreSQL connection string using SQLAlchemy.

Step 2: Creating the Table Schema:

  • Read the first chunk of data to create the initial table schema in the database.
  • Utilize Pandas' to_sql method to replace or create the table in the PostgreSQL database.

Step 3: Iterative Data Insertion:

  • Iterate through the remaining chunks of the CSV file.
  • Optimize timestamp data types using Pandas' to_datetime.
  • Append each chunk to the existing PostgreSQL table.

Final Code:

from sqlalchemy import create_engine
from time import time
import pandas as pd

# specify the database you want to use based on the docker run command we had
# postgresql://username:password@localhost:port/dbname
db_url = 'postgresql://root:password@localhost:5432/ny_taxi'
engine = create_engine(db_url)

# Chunksize for reading CSV and inserting into the database
chunk_size = 100000

# Create an iterator for reading CSV in chunks
csv_iter = pd.read_csv('2021_Yellow_Taxi_Trip_Data.csv', iterator=True, chunksize=chunk_size)

# Get the first chunk to create the table schema
first_chunk = next(csv_iter)
first_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='replace', index=False)

# Loop through the remaining chunks and append to the table
for chunk in csv_iter:
    t_start = time()
    # Fix timestamp type issue
    chunk['tpep_pickup_datetime'] = pd.to_datetime(chunk['tpep_pickup_datetime'])
    chunk['tpep_dropoff_datetime'] = pd.to_datetime(chunk['tpep_dropoff_datetime'])

    # Append data to the existing table
    chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append', index=False)

    # Print a message and benchmark the time
    t_end = time()
    print(f'Inserted another chunk... took {t_end - t_start:.3f} second(s)')
Enter fullscreen mode Exit fullscreen mode

Extra, Extra!!!

Animated GIF

Using argparse to Parse Command Line Arguments

Utilizing the argparse standard library to efficiently parse command line arguments, this script downloads a CSV file from a specified URL and ingests its data into a PostgreSQL database.

from time import time
from sqlalchemy import create_engine
import pandas as pd
import argparse
import os

def main(params):
    user = params.user
    password = params.password
    host = params.host
    port = params.port
    db = params.db
    table_name = params.table_name
    url = params.url

    csv_name = 'output.csv'

    # Download the CSV using the os system function to execute command line arguments from Python
    os.system(f"wget {url} -O {csv_name}")

    engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')

    df_iter = pd.read_csv(csv_name, iterator=True, chunksize=100000)
    df = next(df_iter)

    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

    # Adding the column names
    df.head(n=0).to_sql(name=table_name, con=engine, if_exists="replace")

    # Adding the first batch of rows
    df.to_sql(name=table_name, con=engine, if_exists="append")

    while True:
        t_start = time()

        df = next(df_iter)

        df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
        df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

        df.to_sql(name=table_name, con=engine, if_exists="append")

        t_end = time()

        print('Inserted another chunk... took %.3f second(s)' % (t_end - t_start))

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Ingest CSV data to Postgres")

    parser.add_argument('--user', help="user name for postgres")
    parser.add_argument('--password', help="password for postgres")
    parser.add_argument('--host', help="host for postgres")
    parser.add_argument('--port', help="port for postgres")
    parser.add_argument('--db', help="database name for postgres")
    parser.add_argument('--table_name', help="name of the table where we will write the results to")
    parser.add_argument('--url', help="url of the CSV")

    args = parser.parse_args()

# Dockerizing Ingestion Script

In the provided Dockerfile:

> **Dockerfile**
> ```
{% endraw %}
docker
> FROM python:3.9.1
> 
> RUN apt-get install wget
> RUN pip install pandas sqlalchemy psycopg2
> 
> WORKDIR /app
> COPY ingest_data.py ingest_data.py
> 
> ENTRYPOINT ["python", "ingest_data.py"]
>
{% raw %}
Enter fullscreen mode Exit fullscreen mode

The psychopg2 package is included to facilitate access to the PostgreSQL database from Python, serving as a valuable "database wrapper."

To build the Docker image, execute the following command:


bash
docker build -t taxi_ingest:v001 .


Enter fullscreen mode Exit fullscreen mode

Now run the image instead of the script with the network argument and changing the database host...

You can serve the local file over HTTP on your machine and access it through your IP address by running this in its location
python3 -m http.server


bash
# If your file is local
URL="http://192.x.x.x:8000/2021_Yellow_Taxi_Trip_Data.csv"
docker run -it \
  --network=pg-network \
  taxi_ingest:v001 \
  --user=root \
  --password=password \
  --host=pg-database \
  --port=5432 \
  --db=ny_taxi \
  --table_name=yellow_taxi_trips \
  --url="${URL}"


Enter fullscreen mode Exit fullscreen mode

Not yet...

Connecting pgAdmin and Postgres

pgCLI allows for quickly looking into data. But the more convenient way to work with a postgres database is to use the pgAdmin tool which is a web based GUI tool.

To install pgAdmin in a Docker container, you can follow these steps:

  1. Pull the pgAdmin Docker Image: Use the following command to pull the official pgAdmin Docker image from Docker Hub.

bash
   docker pull dpage/pgadmin4


Enter fullscreen mode Exit fullscreen mode
  1. Create a Docker Network: It's a good practice to create a Docker network to facilitate communication between the PostgreSQL container and the pgAdmin container.

bash
   docker network create pgadmin-network


Enter fullscreen mode Exit fullscreen mode
  1. Run the PostgreSQL Container: Now modify the postgres db run command

bash
docker run --name pg-database \
--network pgadmin-network \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="password" \
-e POSTGRES_DB="ny_taxi" \
-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5432:5432 \
-d postgres


Enter fullscreen mode Exit fullscreen mode

Replace password with your desired PostgreSQL password.

  1. Run the pgAdmin Container: Now, you can run the pgAdmin container and link it to the PostgreSQL container.

bash
   docker run --name pgadmin-container \
              --network pgadmin-network \
              -e PGADMIN_DEFAULT_EMAIL=myemail@example.com \
              -e PGADMIN_DEFAULT_PASSWORD=mypassword \
              -p 5055:80 \
              -d dpage/pgadmin4


Enter fullscreen mode Exit fullscreen mode

Replace myemail@example.com and mypassword with your desired pgAdmin login credentials.

  1. Access pgAdmin:
    Open your web browser and navigate to http://localhost:5055. Log in with the credentials you provided in the previous step.

  2. Add PostgreSQL Server:
    In pgAdmin, click on "Add New Server" and fill in the necessary details to connect to the PostgreSQL server running in the Docker container.

  • Host name/address: postgres-container (the name of your PostgreSQL container)
  • Port: 5432
  • Username: postgres
  • Password: (the password you set in step 3)

Now, you should be able to manage your PostgreSQL server using pgAdmin in a Docker container. Adjust the commands and parameters according to your specific requirements and environment.

Rather, we can also use Docker Compose

Create a docker-compose.yml.. Now, you don't specify the network



services:
  pgdatabase:
    image: postgres:latest
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=ny_taxi
    volumes:
      - "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw"
    ports:
      - "5432:5432"
    container_name: mypostgres

  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=myemail@example.com
      - PGADMIN_DEFAULT_PASSWORD=mypassword
    ports:
      - "5055:80"
    container_name: pgadmin


Enter fullscreen mode Exit fullscreen mode

To start Docker Compose docker-compose up
To run Docker Compose in the background docker-compose up -d
To view Docker Compose containers docker-compose ps
To stop Docker Compose docker-compose down
To stop Docker Compose if you used the -d flag docker-compose down -v

Top comments (0)