Input data
Input data is CSV file for day 5.5.2020 from Mobility Trends Reports database published by Apple. Reports are published daily and reflect requests for directions in Apple Maps.
Input CSV file is having four string columns: geo_type,region,transportation_type,alternative_name and every next column is difference between mobility from 13.1.2020 representing by float and day named in column. And there problems are beginning.
Reading CSV to numpy array
I used native function of numpy library named genfromtxt. My initial/naive code for reading file downloaded from Mobility reports ends with encoding error.
import numpy as np
def getData():
path = "data/applemobilitytrends-2020-05-05.csv"
npcsv = np.genfromtxt(path, delimiter=',')
print(npcsv)
getData()
Error:
Exception has occurred: UnicodeDecodeError
'charmap' codec can't decode byte 0x98 in position 5961: character maps to <undefined>
File "C:\Users\marcel.kocisek\Documents\marcel\covid\examples\csv.py", line 5, in getData
npcsv = np.genfromtxt(path, delimiter=',')
File "C:\Users\marcel.kocisek\Documents\marcel\covid\examples\csv.py", line 8, in <module>
getData()
The fix was easy, add encoding as parameter:
import numpy as np
def getDataV2():
path = "data/applemobilitytrends-2020-05-05.csv"
npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8')
print(npcsv)
getDataV2()
The result is without first 4 string columns and header:
[[ nan nan nan ... nan nan nan]
[ nan nan nan ... 36. 43.69 42.61]
[ nan nan nan ... 43.41 49.59 46.44]
...
[ nan nan nan ... 128.55 110.19 107.62]
[ nan nan nan ... 113.52 104.54 104.41]
[ nan nan nan ... 82.94 72.42 72.63]]
[[ nan nan nan ... nan nan nan]
[ nan nan nan ... 36. 43.69 42.61]
[ nan nan nan ... 43.41 49.59 46.44]
...
[ nan nan nan ... 128.55 110.19 107.62]
[ nan nan nan ... 113.52 104.54 104.41]
[ nan nan nan ... 82.94 72.42 72.63]]
Why? Because default data type for resulting 2d numpy array is np.float and it is not possible to convert string names of states/regions to float (logic). Therefore, add new parameter to change np.float to np.string.
npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8', dtype=np.str)
The result is numpy array with header and all values in string (important is, that we have values :) )
[['geo_type' 'region' 'transportation_type' ... '2020-05-03' '2020-05-04'
'2020-05-05']
['country/region' 'Albania' 'driving' ... '36.0' '43.69' '42.61']
['country/region' 'Albania' 'walking' ... '43.41' '49.59' '46.44']
...
['sub-region' 'Östergötland County' 'driving' ... '128.55' '110.19'
'107.62']
['sub-region' 'Ústí nad Labem Region' 'driving' ... '113.52' '104.54'
'104.41']
['sub-region' 'Žilina Region' 'driving' ... '82.94' '72.42' '72.63']]
To ignore header, you could use skip_header parameter for genfromtxt function
npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8', dtype=np.str, skip_header=1)
or numpy 2d array indexing to ignore first row [1:, :]:
npcsv = np.genfromtxt(path, delimiter=',', encoding='utf8', dtype=np.str)
print(npcsv[1:, :])
Result:
[['country/region' 'Albania' 'driving' ... '36.0' '43.69' '42.61']
['country/region' 'Albania' 'walking' ... '43.41' '49.59' '46.44']
['country/region' 'Argentina' 'driving' ... '16.44' '32.01' '33.63']
...
['sub-region' 'Östergötland County' 'driving' ... '128.55' '110.19'
'107.62']
['sub-region' 'Ústí nad Labem Region' 'driving' ... '113.52' '104.54'
'104.41']
['sub-region' 'Žilina Region' 'driving' ... '82.94' '72.42' '72.63']]
Top comments (0)