NYC Open Data Portal
Working Through CSV
Oil-burning boilers are one of the largest sources of air pollution in NYC. The purpose of the project is to see how complete this dataset is and to map the data in a geographic visualization. The original data is small-sized (~22MB, 8K rows), and appropriately enough, dirty.
Steps
- Get data from remote url using NYC Open Data API, and save it to local file rows.csv.
- Data cleaning: There are issues with the data as imported, having to do with cleanliness and completeness. First, get rid of rows with all NaN value at the end of the dataframe. Then parse the ‘Owner Address’ to get clean latitude and longitude information.
- Save the clean data with latitude and logitude into a new file oil.csv.
- Load the cleaned CSV data into a new dataset on CartoDB for geographic visualization.
- Georeference the map by the cleaned latitude and longitude columns.
Code
-
Get data from remote url using NYC Open Data API, and save it to local file rows.csv.
import requests, os, StringIO, re
import pandas as pd
import numpy as np
def getCachedData(url, cache_file = "cache.data", ignore_cache = False):
'''Get data from the remote url, with optional caching to a file.'''
_data = None
if not ignore_cache and os.path.isfile(cache_file):
print 'pulling from cache'
with open(cache_file, 'r') as f:
_data = f.read() #read the whole thing into memory
print 'data length:', len(_data), 'bytes'
return _data
if _data == None:
_data = requests.get(url)
if _data.status_code == 200:
print 'header date', _data.headers['date']
print 'status_code:', _data.status_code
print 'apparent_encoding:',_data.apparent_encoding
print 'encoding:', _data.encoding
print 'data length:', len(_data.text), 'bytes'
if not ignore_cache:
with open(cache_file, 'w') as f:
f.write(_data.text)
else:
print 'bad status code:', _data.status_code
raise Error('fatal error fetching data from network')
return _data.text
%%time
path = 'https://nycopendata.socrata.com/api/views/jfzu-yy6n/rows.csv?accessType=DOWNLOAD'
_csv = getCachedData(path, cache_file = 'rows.csv')
%%time
cdf = pd.read_csv(StringIO.StringIO(_csv))
cdf.dtypes
cdf.columns
cdf.head()
cdf.tail()
cdf = cdf.dropna(subset=[cdf.columns[0]])
# drop all NaN rows in the end of the data frame
cdf.tail()
address = []
latitude = []
longtitude = []
address_pattern = '(.*\n.*)\n\('
latitude_pattern = '\n\((4.*),'
longtitude_pattern = ',\ (-7.*)\)'
# use regular expression to parse the Owner Address and get clean latitude and longtitude
for i in cdf[cdf.columns[5]]:
i = str(i)
address.append(re.findall(address_pattern, i))
latitude.append(re.findall(latitude_pattern, i))
longtitude.append(re.findall(longtitude_pattern, i))
def stringSeries(lst):
lst = pd.Series(lst)
for i in lst.index:
if lst[i] == []: lst[i] = ''
else: lst[i] = lst[i][0]
return lst
# insert clean latitude and longtitude into new columns
cdf[cdf.columns[5]] = stringSeries(address)
cdf.insert(6, 'Latitude', stringSeries(latitude))
cdf.insert(7, 'Longtitude', stringSeries(longtitude))
cdf.to_csv('oil.csv')
4 & 5. Data Visualization