CSV - Process Local & Remote Files in Python

#python #csv #devtools

Hello coders!

This article presents an open-source tool that is able to process local and remote CSV files, load and print the information, and later map the column to Django Types. Processing CSV files is usually required when the dataset becomes large, custom reports are unsupported by Excel or full data manipulation via data tables, and API is needed.
The current list of features can be further extended to map CSV files to database tables/models and fully generate dashboard web apps.

Source Code: CSV Processor part of the AppSeed Service (open-source)

Before we start explaining the code and the usage, let's summarize the tool features:

load local and remote files
print values
print detected column types
print the mapping types to a Django Model

The CSV parser can be executed via the CLI after cloning the project sources and making it usable as explained in the README. Once the installation is completed, we can call the CVS processor using this one-liner:

$ python manage.py tool_inspect_source -f media/tool_inspect/csv_inspect.json

The tool performs the following tasks:

validate the input
locate the CSV file (exit with error if not found)
loads the information and detects the column types
detects the Django column type
print the first 10 rows

The same can be applied to local and remote files. For instance, we can analyze the notorious Titanic.cvs by running this one-liner:

$ python manage.py tool_inspect_source -f media/tool_inspect/csv_inspect_distant.json

# Output
> Processing .\media\tool_inspect\csv_inspect_distant.json
    |-- file: https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
    |-- type: csv


Field        CSV Type    Django Types
-----------  ----------  ------------------------------------------
PassengerId  int64       models.IntegerField(blank=True, null=True)
Survived     int64       models.IntegerField(blank=True, null=True)
Pclass       int64       models.IntegerField(blank=True, null=True)
Name         object      models.TextField(blank=True, null=True)
Sex          object      models.TextField(blank=True, null=True)
Age          float64     models.FloatField(blank=True, null=True)
SibSp        int64       models.IntegerField(blank=True, null=True)
Parch        int64       models.IntegerField(blank=True, null=True)
Ticket       object      models.TextField(blank=True, null=True)
Fare         float64     models.FloatField(blank=True, null=True)
Cabin        object      models.TextField(blank=True, null=True)
Embarked     object      models.TextField(blank=True, null=True)


[1] - PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
[2] - 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
[3] - 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
[4] - 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
[5] - 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
[6] - 5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
[7] - 6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
[8] - 7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
[9] - 8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
[10] - 9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
... (truncated output)

Here are the relevant parts of the tool:

Loads the information and prior checks the source if is local or remote

    print( '> Processing ' + ARG_JSON )
    print( '    |-- file: ' + JSON_DATA['source'] )
    print( '    |-- type: ' + JSON_DATA['type'  ] )
    print( '\n')

    tmp_file_path = None 

    if 'http' in JSON_DATA['source']:
        url = JSON_DATA['source']
        r = requests.get(url)
        tmp_file = h_random_ascii( 8 ) + '.csv'
        tmp_file_path = os.path.join( DIR_TMP, tmp_file )
        if not file_write(tmp_file_path, r.text ):
            return
        JSON_DATA['source'] = tmp_file_path
    else:    
        if not file_exists( JSON_DATA['source'] ):
            print( ' > Err loading SOURCE: ' + JSON_DATA['source'] )            
            return

    csv_types = parse_csv( JSON_DATA['source'] )

Analyze the headers and map the detected types to Django Types.

For the tabular view, Tabulate Library is used:

    csv_types = parse_csv( JSON_DATA['source'] )

    #pprint.pp ( csv_types )

    table_headers = ['Field', 'CSV Type', 'Django Types']
    table_rows    = []

    for t in csv_types:
        t_type        = csv_types[t]['type']
        t_type_django = django_fields[ t_type ]
        table_rows.append( [t, t_type, t_type_django] )

    print(tabulate(table_rows, table_headers))

The last step is to Print the CSV data:

    csv_data = load_csv_data( JSON_DATA['source'] )

    idx = 0
    for l in csv_data:
        idx += 1
        print( '['+str(idx)+'] - ' + str(l) )  

        # Truncate output ..
        if idx == 10:
            print( ' ... (truncated output) ' ) 
            break

At this point, the code provides us access to the CSV information, data types, and the correspondent data types for Django. The mapping can be easily extended for any framework like Flask, Express, or NextJS.

The type mapping for Django is this one:

# Pandas Type
django_fields = {
    'int'           : 'models.IntegerField(blank=True, null=True)',
    'integer'       : 'models.IntegerField(blank=True, null=True)',
    'string'        : "models.TextField(blank=True, null=True)",
    'string_unique' : "models.TextField(blank=True, null=False, unique=True)",
    'object'        : "models.TextField(blank=True, null=True)",
    'object_unique' : "models.TextField(blank=True, null=False, unique=True)",
    'int64'         : 'models.IntegerField(blank=True, null=True)',
    'float64'       : 'models.FloatField(blank=True, null=True)',
    'bool'          : 'models.BooleanField(null=True)',
}

This tool is under active development and here are the next steps:

connect the tool to more data sources like remote/local databases (SQLite, MySql, PgSQL), JSON
Generate Models for any framework: FastAPI, Flask, Express, NextJS
Generate secure APIs on top
Generate server-side paginated DataTables using Tailwind/Bootstrap for styling

Thanks for reading!

For those interested in contributing, feel free to join the new AppSeed platform and connect with the community on Discord:

AppSeed - open-source platform for developers
AppSeed Community - 3k+ Discord members

DEV Community

CSV - Process Local & Remote Files in Python

Top comments (0)

Read next

Random Forest Classification: Unveiling the Powerful Machine Learning Technique That's Transforming Decision-Making

Remaking a rule-engine DSL

การใช้งาน Polyglot notebook กับ Python

Places365 in PyTorch