Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python, along with other popular packages such as numpy, matplotlib, and scipy.
A key pandas object is the data frame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores used by a variety of other applications, including spreadsheets, databases and statistical packages. However, a pandas data frame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward. In this tutorial we show you how to import data from CSV files.
Technical requirements
The code and notebooks described here are available on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook.
Importing CSV files
The read_csv method of the pandas library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas data frame. In the following recipe, we read a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical data missing.
Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.
All data in a CSV file are text characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas read_csv method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.
Getting ready
Create a folder for this tutorial and create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder and place the file landtempssample.csv (which can be found in the data sub-folder of the Chapter01 folder of the GitHub repo mentioned above) file in that subfolder. Alternatively, you could retrieve all the files from the GitHub repository. Here’s a code sample from the beginning of the CSV file:
locationid,year,month,temp,latitude,longitude,stnelev,station,
countryid,country
USS0010K01S,2000,4,5.27,39.9,-110.75,2773.7,INDIAN_CANYON,US,United States
CI000085406,1940,5,18.04,-18.35,-70.333,58.0,ARICA,CI,Chile
USC00036376,2013,12,6.22,34.3703,91.1242,61.0,SAINT_CHARLES,US,United
States
ASN00024002,1963,2,22.93,-34.2833,140.6,65.5,BERRI_IRRIGATION,
AS,Australia
ASN00028007,2001,11,,-14.7803,143.5036,79.4,MUSGRAVE,AS,Australia
Note
This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-baseddatasets/global-historical-climatology-networkmonthly-version-4. This is just a 100,000-row sample of the full dataset, which is also available in the repository.
How to do it…
We will import a CSV file into pandas, taking advantage of some very useful read_csv options:
- Import the pandas library and set up the environment to make viewing the output easier:
`>>> import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.width', 85)
pd.set_option('display.max_columns', 8)`
- Read the data file, set new names for the headings, and parse the date column. Pass an argument of 1 to the skiprows parameter to skip the first row, pass a list of columns to parse_dates to create a pandas datetime column from those columns, and set low_memory to False to reduce the usage of memory during the import process:
`>>> landtemps = pd.read_csv('data/landtempssample.csv',
...
names=['stationid','year','month','avgtemp','latitude',
...
'longitude','elevation','station','countryid','country'],
... skiprows=1,
... parse_dates=[['month','year']],
... low_memory=False)
type(landtemps)
`
- Get a quick glimpse of the data. View the first few rows. Show the data type for all columns, as well as the number of rows and columns:
landtemps.head(7)
month_year stationid ... countryid country
0 2000-04-01 USS0010K01S ... US United States
1 1940-05-01 CI000085406 ... CI Chile
2 2013-12-01 USC00036376 ... US United States
3 1963-02-01 ASN00024002 ... AS Australia
4 2001-11-01 ASN00028007 ... AS Australia
5 1991-04-01 USW00024151 ... US United States
6 1993-12-01 RSM00022641 ... RS Russia
[7 rows x 9 columns]
landtemps.dtypes
month_year datetime64[ns]
stationid object
avgtemp float64
latitude float64
longitude float64
elevation float64
station object
countryid object
country object
dtype: object
landtemps.shape
(100000, 9)
- Give the date column a better name and view the summary statistics for average monthly temperature: landtemps.rename(columns={'month_ year':'measuredate'}, inplace=True) landtemps.dtypes measuredate datetime64[ns] stationid object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object landtemps.avgtemp.describe() count 85,554.00 mean 10.92 std 11.52 min -70.70 25% 3.46 50% 12.22 75% 19.57 max 39.95 Name: avgtemp, dtype: float64
- Look for missing values for each column. Use isnull, which returns True for each value that is missing for each column, and False when not missing. Chain this with sum to count the missings for each column. When working with Boolean values, sum treats True as 1 and False as 0. I discuss method chaining in There's more... (below): landtemps.isnull().sum() measuredate 0 stationid 0 avgtemp 14446 latitude 0 longitude 0 elevation 0 station 0 countryid 0 country 5 dtype: int64
- Remove rows with missing data for avgtemp. Use the subset parameter to tell dropna to drop rows where avgtemp is missing. Set inplace to True. Leaving inplace at its default value of False would display the data frame, but the changes we have made would not be retained. Use the shape attribute of the data frame to get the number of rows and columns: landtemps.dropna(subset=['avgtemp'], inplace=True) landtemps.shape (85554, 9)
That's it! Importing CSV files into pandas is as simple as that.
How it works...
In step 1 we import the pandas library, and refer to it as pd to make it easier to reference later. This is customary. We also use float_format to display float values in a readable way and set_option to make the terminal output wide enough to accommodate the number of variables.
Much of the work is done by the first line in step 2. We use read_csv to load a pandas data frame in memory and call it landtemps. In addition to passing a filename, we set the names parameter to a list of our preferred column headings. We also tell read_csv to skip the first row, by setting skiprows to 1, since the original column headings are in the first row of the CSV file. If we do not tell it to skip the first row, read_csv will treat the header row in the file as actual data.
read_csv also solves a date conversion issue for us. We use the parse_dates parameter to ask it to convert the month and year columns to a date value.
Step 3 runs through a few standard data checks. We use head(7) to print out all columns for the first 7 rows. We use the dtypes attribute of the data frame to show the data type of all columns. Each column has the expected data type. In pandas, character data has the object data type, a data type that allows for mixed values. shape returns a tuple, whose first element is the number of rows in the data frame (100,000 in this case) and whose second element is the number of columns (9).
When we used read_csv to parse the month and year columns, it gave the resulting column the name month_year. We use the rename method in step 4 to give that column a better name. We need to specify inplace=True to replace the old column name with the new column name in memory. The describe method provides summary statistics on the avgtemp column
.
Notice that the count for avgtemp indicates that there are 85,554 rows that have valid values for avgtemp. This is out of 100,000 rows for the whole data frame, as provided by the shape attribute. The listing of missing values for each column in step 5 (landtemps.isnull().sum()) confirms this: 100,000 – 85,554 = 14,446.
Step 6 drops all rows where avgtemp is NaN. (The NaN value, not a number, is the pandas representation of missing values.) subset is used to indicate which column to check for missings. The shape attribute for landtemps now indicates that there are 85,554 rows, which is what we would expect given the previous count from describe.
There's more...
If the file you are reading uses a delimiter other than a comma, such as a tab, this can be specified in the sep parameter of read_csv. When creating the pandas data frame, an index was also created. The numbers to the far left of the output when head and sample were run are index values. Any number of rows can be specified for head or sample. The default value is 5.
Setting low_memory to False causes read_csv to parse data in chunks. This is easier on systems with lower memory when working with larger files. However, the full data frame will still be loaded into memory once read_csv completes successfully.
The landtemps.isnull().sum() statement is an example of chaining methods. First, isnull returns a data frame of True and False values, resulting from testing whether each column value is null. sum takes that data frame and sums the True values for each column, interpreting the True values as 1 and the False values as 0. We would have obtained the same result if we had used the following two steps:
checknull = landtemps.isnull()
checknull.sum()
There’s no hard and fast rule for when to chain methods and when not to. I find it helpful to chain when I really think of something I am doing as being a single step, but only two or more steps, mechanically speaking. Chaining also has the side benefit of not creating extra objects that I might not need.
The dataset used in this recipe is just a sample from the full land temperatures database with almost 17 million records. You can run the larger file if your machine can handle it, with the following code:
>>> landtemps = pd.read_csv('data/landtemps.zip',
compression='zip',
... names=['stationid','year','month','avgtemp','latitude',
...
'longitude','elevation','station','countryid','country'],
... skiprows=1,
... parse_dates=[['month','year']],
... low_memory=False)
read_csv can read a compressed ZIP file. We get it to do this by passing the name of the ZIP file and the type of compression.
Points to note
A significant amount of reshaping of the Global Historical Climatology Network raw data was performed before using it in this recipe.
You can read a text file that is not delimited, one that is fixed, by using panda’s read_fwf function.
Purchase your copy on Amazon
Top comments (0)