NYC Oil Boilers – Detailed Fuel Consumption and Building Data

NYC Open Data Portal

Working Through CSV

Source Data

Oil-burning boilers are one of the largest sources of air pollution in NYC. The purpose of the project is to see how complete this dataset is and to map the data in a geographic visualization. The original data is small-sized (~22MB, 8K rows), and appropriately enough, dirty.

Steps

  1. Get data from remote url using NYC Open Data API, and save it to local file rows.csv.
  2. Data cleaning: There are issues with the data as imported, having to do with cleanliness and completeness. First, get rid of rows with all NaN value at the end of the dataframe. Then parse the ‘Owner Address’ to get clean latitude and longitude information.
  3. Save the clean data with latitude and logitude into a new file oil.csv.
  4. Load the cleaned CSV data into a new dataset on CartoDB for geographic visualization.
  5. Georeference the map by the cleaned latitude and longitude columns.

Code

  1. Get data from remote url using NYC Open Data API, and save it to local file rows.csv.
import requests, os, StringIO, re
import pandas as pd
import numpy as np
def getCachedData(url, cache_file = "cache.data", ignore_cache = False):
    '''Get data from the remote url, with optional caching to a file.'''
    
    _data = None
    
    if not ignore_cache and os.path.isfile(cache_file):
        print 'pulling from cache'
        with open(cache_file, 'r') as f:
            _data = f.read()  #read the whole thing into memory
            print 'data length:', len(_data), 'bytes'
        return _data
    
    if _data == None:       
        _data = requests.get(url)
        if _data.status_code == 200:
            print 'header date', _data.headers['date']
            print 'status_code:', _data.status_code
            print 'apparent_encoding:',_data.apparent_encoding
            print 'encoding:', _data.encoding
            print 'data length:', len(_data.text), 'bytes'
            if not ignore_cache:
                with open(cache_file, 'w') as f:
                    f.write(_data.text)
        else:
            print 'bad status code:', _data.status_code
            raise Error('fatal error fetching data from network')
        return _data.text
%%time
path = 'https://nycopendata.socrata.com/api/views/jfzu-yy6n/rows.csv?accessType=DOWNLOAD'
_csv  = getCachedData(path, cache_file = 'rows.csv')
pulling from cache
data length: 3166301 bytes
CPU times: user 542 µs, sys: 3.21 ms, total: 3.75 ms
Wall time: 8.58 ms
%%time
cdf = pd.read_csv(StringIO.StringIO(_csv))
cdf.dtypes
CPU times: user 92.3 ms, sys: 11.7 ms, total: 104 ms
Wall time: 106 ms
cdf.columns
Index([u'Borough, Block and Lot #', u'Facility Address',
       u'Natural Gas Utility (Con Edison or National Grid',
       u'Building Manager', u'Owner', u'Owner Address',
       u'DEP Boiler Application #',
       u'Deadline for phasing out #6 oil (i.e. data of next DEP permit renewal after July 1 2012)',
       u'Boiler Model', u'Number of identical boilers',
       u'Boiler capacity (Gross BTU)', u'Boiler Installation Date',
       u'Estimated retirement date of boiler (assuming 35 year average useful life)',
       u'Is boiler dual fuel capable?', u'Age range of boiler',
       u'Burner Model', u'Primary Fuel',
       u'Total Estimated Cosumption - High Estimate (Gallons) ',
       u'Total Estimated Cosumption - Low Estimate (Gallons) ',
       u'Total Estimated Cosumption - High Estimate (MMBTUs) ',
       u'Total Estimated Cosumption - Low Estimate  (MMBTUs) ',
       u'Needs to comply with Greener Greater Buildings Laws?',
       u'Deadline for complying with Audit and Retrocommissioning Law',
       u'Building Type', u'City Council District',
       u'Total area of buildings on lot',
       u'Number of buildings on property (tax lot)', u'Number of floors',
       u'Number of residential units', u'Number of total units',
       u'Year constructed', u'Condominium housing?', u'Cooperative housing?'],
      dtype='object')
cdf.head()
Borough, Block and Lot # Facility Address Natural Gas Utility (Con Edison or National Grid Building Manager Owner Owner Address DEP Boiler Application # Deadline for phasing out #6 oil (i.e. data of next DEP permit renewal after July 1 2012) Boiler Model Number of identical boilers Building Type City Council District Total area of buildings on lot Number of buildings on property (tax lot) Number of floors Number of residential units Number of total units Year constructed Condominium housing? Cooperative housing?
0 2032590022 2840 BAILEY AVENUE\nBronx, NY 10463\n(40.87433… Con Edison BESNIK REALTY CORP. BESNIK REALTY CORP 754 MACE AVENUE\nBRONX, NY 10467\n(40.86314421… CA348086Y 7/6/05 ROCKMILLS MP 150 1 Elevator Apartments 14 86412 1 6 87 87 1926
1 2057800940 3900 GREYSTONE AVENUE\nBronx, NY 10463\n(40.88… Con Edison TROY PITTS SPRINT WIENER, EDITH 855 AVE\nMANHATTAN, NY 10001\n(40.750259021436… CA014493L 7/6/05 FEDERAL FST-150 2 Elevator Apartments 11 134703 2 6 127 127 1928
2 2028230001 1650 GRAND\nBronx, NY 10457\n(40.8457983712978… Con Edison BRONX LEBANON HOSPITAL BRONX-LEBANON HOSPITA 1650 SELWYN AVE\nBRONX, NY 10457\n(40.84317219… CA159489X 7/6/05 CLEAVER BROOKS CB-600-600 2 Elevator Apartments 14 480769 1 11 0 1 1941
3 2030640001 4422 3 AVENUE\nBronx, NY 10457\n(40.8540683145… Con Edison ST BARNABAS HOSPITAL BOILER ROOM ST BARNABAS NURSING H 4422 3 AVENUE\nN Y 10457\n(40.854068314549714,… CB078401R NaN B & W FM 10-66 1 Loft Buildings 15 421234 4 8 0 1 1943
4 2025060016 1204 SHAKESPEARE AVENUE\nBronx, NY 10452\n(40…. Con Edison BRAGREIS REALTY CORP. BRAGREIS REALTY CORP 191024 BROOKLYN\nNY 11219\n(40.63277295540763,… CA352883Z 7/6/05 ROCKMILLS MP100 1 Walk-Up Apartments 16 53376 1 6 67 67 1926

5 rows × 33 columns

cdf.tail()
Borough, Block and Lot # Facility Address Natural Gas Utility (Con Edison or National Grid Building Manager Owner Owner Address DEP Boiler Application # Deadline for phasing out #6 oil (i.e. data of next DEP permit renewal after July 1 2012) Boiler Model Number of identical boilers Building Type City Council District Total area of buildings on lot Number of buildings on property (tax lot) Number of floors Number of residential units Number of total units Year constructed Condominium housing? Cooperative housing?
8768 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8769 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8770 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8771 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8772 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 33 column

2 & 3. Data cleaning, and save to new file.
cdf = cdf.dropna(subset=[cdf.columns[0]])
# drop all NaN rows in the end of the data frame
cdf.tail()
Borough, Block and Lot # Facility Address Natural Gas Utility (Con Edison or National Grid Building Manager Owner Owner Address DEP Boiler Application # Deadline for phasing out #6 oil (i.e. data of next DEP permit renewal after July 1 2012) Boiler Model Number of identical boilers Building Type City Council District Total area of buildings on lot Number of buildings on property (tax lot) Number of floors Number of residential units Number of total units Year constructed Condominium housing? Cooperative housing?
8043 5005890053 610 VICTORY BOULEVARD\nStaten Island, NY 10301… National Grid 610 VICTORY BLVD.LLC 610 VICTORY BOULEVARD 97 77 QUEENS BLVD\nREGO PARK, NY 11374\n(40.72… CA288985X 7/5/05 ROCKMILLS MP-125 1 Elevator Apartments 49 72600 1 6 72 72 1940
8044 5000160054 350 ST MARK’S PLACE\nStaten Island, NY 10301\n… National Grid BALKON REALTY CO. BALKON REALTY CO 2350 OCEAN AVENUE\nBROOKLYN, NY 11229\n(40.604… CA163687Z 7/6/05 FEDERAL FLW 1518 1 Elevator Apartments 49 64680 1 5 0 1 1964
8045 5028520001 191 VERMONT AVENUE\nStaten Island, NY 10305\n(… National Grid BOARD OF EDUCATION BOARD OF EDUCATION 28 11 QUEENS PLAZA\nL I C, NY 11101\n(40.74570… CA031979X 7/5/05 NaN 2 Elevator Apartments 49 76623 1 3 0 1 1981
8046 5005890044 630 VICTORY BOULEVARD\nStaten Island, NY 10301… National Grid CLIFTON BUILDERS CO 630 VICTORY BOULEVARD 97 77 QUEENS BOULEVARD\nREGO PARK, NY 11374\n(… CA289085P 7/5/05 ROCKMILLS MP-100 SS 1 Elevator Apartments 49 72600 1 6 60 60 1940
8047 5005790045 4 CHESTER PLACE\nStaten Island, NY 10304\n(40…. National Grid JANE KURTIN CHESTER PLACE PARTNER NaN CA293085R 7/4/05 FEDERAL FST-60 1 Elevator Apartments 49 32000 1 5 34 34 1929

5 rows × 33 columns

address = []
latitude = []
longtitude = []

address_pattern = '(.*\n.*)\n\('
latitude_pattern = '\n\((4.*),'
longtitude_pattern = ',\ (-7.*)\)'

# use regular expression to parse the Owner Address and get clean latitude and longtitude
for i in cdf[cdf.columns[5]]:
    i = str(i)
    address.append(re.findall(address_pattern, i))
    latitude.append(re.findall(latitude_pattern, i))
    longtitude.append(re.findall(longtitude_pattern, i))

def stringSeries(lst):
    lst = pd.Series(lst)
    for i in lst.index:
        if lst[i] == []: lst[i] = ''
        else: lst[i] = lst[i][0]
    return lst

# insert clean latitude and longtitude into new columns
cdf[cdf.columns[5]] = stringSeries(address)
cdf.insert(6, 'Latitude', stringSeries(latitude))
cdf.insert(7, 'Longtitude', stringSeries(longtitude))

cdf.to_csv('oil.csv')
4 & 5. Data Visualization
QQ20150621-1@2x

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s