Hello coders!
This article presents an open-source tool that is able to process local and remote CSV files, load and print the information, and later map the column to Django Types. Processing CSV files is usually required when the dataset becomes large, custom reports are unsupported by Excel or full data manipulation via data tables, and API is needed.
The current list of features can be further extended to map CSV files to database tables/models and fully generate dashboard web apps.
Source Code: CSV Processor part of the AppSeed Service (open-source)
Before we start explaining the code and the usage, let's summarize the tool features:
- load local and remote files
- print values
- print detected column types
- print the mapping types to a Django Model
The CSV parser can be executed via the CLI after cloning the project sources and making it usable as explained in the README. Once the installation is completed, we can call the CVS processor using this one-liner:
$ python manage.py tool_inspect_source -f media/tool_inspect/csv_inspect.json
The tool performs the following tasks:
- validate the input
- locate the CSV file (exit with error if not found)
- loads the information and detects the column types
- detects the Django column type
- print the first 10 rows
The same can be applied to local and remote files. For instance, we can analyze the notorious Titanic.cvs by running this one-liner:
$ python manage.py tool_inspect_source -f media/tool_inspect/csv_inspect_distant.json
# Output
> Processing .\media\tool_inspect\csv_inspect_distant.json
|-- file: https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
|-- type: csv
Field CSV Type Django Types
----------- ---------- ------------------------------------------
PassengerId int64 models.IntegerField(blank=True, null=True)
Survived int64 models.IntegerField(blank=True, null=True)
Pclass int64 models.IntegerField(blank=True, null=True)
Name object models.TextField(blank=True, null=True)
Sex object models.TextField(blank=True, null=True)
Age float64 models.FloatField(blank=True, null=True)
SibSp int64 models.IntegerField(blank=True, null=True)
Parch int64 models.IntegerField(blank=True, null=True)
Ticket object models.TextField(blank=True, null=True)
Fare float64 models.FloatField(blank=True, null=True)
Cabin object models.TextField(blank=True, null=True)
Embarked object models.TextField(blank=True, null=True)
[1] - PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
[2] - 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
[3] - 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
[4] - 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
[5] - 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
[6] - 5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
[7] - 6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
[8] - 7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
[9] - 8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
[10] - 9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
... (truncated output)
Here are the relevant parts of the tool:
Loads the information and prior checks the source if is local or remote
print( '> Processing ' + ARG_JSON )
print( ' |-- file: ' + JSON_DATA['source'] )
print( ' |-- type: ' + JSON_DATA['type' ] )
print( '\n')
tmp_file_path = None
if 'http' in JSON_DATA['source']:
url = JSON_DATA['source']
r = requests.get(url)
tmp_file = h_random_ascii( 8 ) + '.csv'
tmp_file_path = os.path.join( DIR_TMP, tmp_file )
if not file_write(tmp_file_path, r.text ):
return
JSON_DATA['source'] = tmp_file_path
else:
if not file_exists( JSON_DATA['source'] ):
print( ' > Err loading SOURCE: ' + JSON_DATA['source'] )
return
csv_types = parse_csv( JSON_DATA['source'] )
Analyze the headers and map the detected types to Django Types.
For the tabular view, Tabulate Library
is used:
csv_types = parse_csv( JSON_DATA['source'] )
#pprint.pp ( csv_types )
table_headers = ['Field', 'CSV Type', 'Django Types']
table_rows = []
for t in csv_types:
t_type = csv_types[t]['type']
t_type_django = django_fields[ t_type ]
table_rows.append( [t, t_type, t_type_django] )
print(tabulate(table_rows, table_headers))
The last step is to Print the CSV data:
csv_data = load_csv_data( JSON_DATA['source'] )
idx = 0
for l in csv_data:
idx += 1
print( '['+str(idx)+'] - ' + str(l) )
# Truncate output ..
if idx == 10:
print( ' ... (truncated output) ' )
break
At this point, the code provides us access to the CSV information, data types, and the correspondent data types for Django. The mapping can be easily extended for any framework like Flask, Express, or NextJS.
The type mapping for Django is this one:
# Pandas Type
django_fields = {
'int' : 'models.IntegerField(blank=True, null=True)',
'integer' : 'models.IntegerField(blank=True, null=True)',
'string' : "models.TextField(blank=True, null=True)",
'string_unique' : "models.TextField(blank=True, null=False, unique=True)",
'object' : "models.TextField(blank=True, null=True)",
'object_unique' : "models.TextField(blank=True, null=False, unique=True)",
'int64' : 'models.IntegerField(blank=True, null=True)',
'float64' : 'models.FloatField(blank=True, null=True)',
'bool' : 'models.BooleanField(null=True)',
}
This tool is under active development and here are the next steps:
- connect the tool to more data sources like remote/local databases (SQLite, MySql, PgSQL), JSON
- Generate Models for any framework: FastAPI, Flask, Express, NextJS
- Generate secure APIs on top
- Generate server-side paginated DataTables using Tailwind/Bootstrap for styling
Thanks for reading!
For those interested in contributing, feel free to join the new AppSeed platform and connect with the community on Discord:
- AppSeed - open-source platform for developers
- AppSeed Community - 3k+ Discord members
Top comments (0)