Fangzhou Cheng

Data Engineer NYC -> SF Bay Area -> London

NYC Oil Boilers – Detailed Fuel Consumption and Building Data

Posted on June 22, 2015July 31, 2015 by Fangzhou

NYC Open Data Portal

Working Through CSV

Oil-burning boilers are one of the largest sources of air pollution in NYC. The purpose of the project is to see how complete this dataset is and to map the data in a geographic visualization. The original data is small-sized (~22MB, 8K rows), and appropriately enough, dirty.

Steps

Get data from remote url using NYC Open Data API, and save it to local file rows.csv.
Data cleaning: There are issues with the data as imported, having to do with cleanliness and completeness. First, get rid of rows with all NaN value at the end of the dataframe. Then parse the ‘Owner Address’ to get clean latitude and longitude information.
Save the clean data with latitude and logitude into a new file oil.csv.
Load the cleaned CSV data into a new dataset on CartoDB for geographic visualization.
Georeference the map by the cleaned latitude and longitude columns.

Code

Get data from remote url using NYC Open Data API, and save it to local file rows.csv.

import requests, os, StringIO, re
import pandas as pd
import numpy as np

def getCachedData(url, cache_file = "cache.data", ignore_cache = False):
    '''Get data from the remote url, with optional caching to a file.'''
    
    _data = None
    
    if not ignore_cache and os.path.isfile(cache_file):
        print 'pulling from cache'
        with open(cache_file, 'r') as f:
            _data = f.read()  #read the whole thing into memory
            print 'data length:', len(_data), 'bytes'
        return _data
    
    if _data == None:       
        _data = requests.get(url)
        if _data.status_code == 200:
            print 'header date', _data.headers['date']
            print 'status_code:', _data.status_code
            print 'apparent_encoding:',_data.apparent_encoding
            print 'encoding:', _data.encoding
            print 'data length:', len(_data.text), 'bytes'
            if not ignore_cache:
                with open(cache_file, 'w') as f:
                    f.write(_data.text)
        else:
            print 'bad status code:', _data.status_code
            raise Error('fatal error fetching data from network')
        return _data.text

%%time
path = 'https://nycopendata.socrata.com/api/views/jfzu-yy6n/rows.csv?accessType=DOWNLOAD'
_csv  = getCachedData(path, cache_file = 'rows.csv')

pulling from cache
data length: 3166301 bytes
CPU times: user 542 µs, sys: 3.21 ms, total: 3.75 ms
Wall time: 8.58 ms

%%time
cdf = pd.read_csv(StringIO.StringIO(_csv))
cdf.dtypes

CPU times: user 92.3 ms, sys: 11.7 ms, total: 104 ms
Wall time: 106 ms

cdf.columns

Index([u'Borough, Block and Lot #', u'Facility Address',
       u'Natural Gas Utility (Con Edison or National Grid',
       u'Building Manager', u'Owner', u'Owner Address',
       u'DEP Boiler Application #',
       u'Deadline for phasing out #6 oil (i.e. data of next DEP permit renewal after July 1 2012)',
       u'Boiler Model', u'Number of identical boilers',
       u'Boiler capacity (Gross BTU)', u'Boiler Installation Date',
       u'Estimated retirement date of boiler (assuming 35 year average useful life)',
       u'Is boiler dual fuel capable?', u'Age range of boiler',
       u'Burner Model', u'Primary Fuel',
       u'Total Estimated Cosumption - High Estimate (Gallons) ',
       u'Total Estimated Cosumption - Low Estimate (Gallons) ',
       u'Total Estimated Cosumption - High Estimate (MMBTUs) ',
       u'Total Estimated Cosumption - Low Estimate  (MMBTUs) ',
       u'Needs to comply with Greener Greater Buildings Laws?',
       u'Deadline for complying with Audit and Retrocommissioning Law',
       u'Building Type', u'City Council District',
       u'Total area of buildings on lot',
       u'Number of buildings on property (tax lot)', u'Number of floors',
       u'Number of residential units', u'Number of total units',
       u'Year constructed', u'Condominium housing?', u'Cooperative housing?'],
      dtype='object')

cdf.head()

Borough, Block and Lot #	Facility Address	Natural Gas Utility (Con Edison or National Grid	Building Manager	Owner	Owner Address	DEP Boiler Application #	Deadline for phasing out #6 oil (i.e. data of next DEP permit renewal after July 1 2012)	Boiler Model	Number of identical boilers	…	Building Type	City Council District	Total area of buildings on lot	Number of buildings on property (tax lot)	Number of floors	Number of residential units	Number of total units	Year constructed	Condominium housing?	Cooperative housing?
0	2032590022	2840 BAILEY AVENUE\nBronx, NY 10463\n(40.87433…	Con Edison	BESNIK REALTY CORP.	BESNIK REALTY CORP	754 MACE AVENUE\nBRONX, NY 10467\n(40.86314421…	CA348086Y	7/6/05	ROCKMILLS MP 150	1	…	Elevator Apartments	14	86412	1	6	87	87	1926
1	2057800940	3900 GREYSTONE AVENUE\nBronx, NY 10463\n(40.88…	Con Edison	TROY PITTS SPRINT	WIENER, EDITH	855 AVE\nMANHATTAN, NY 10001\n(40.750259021436…	CA014493L	7/6/05	FEDERAL FST-150	2	…	Elevator Apartments	11	134703	2	6	127	127	1928
2	2028230001	1650 GRAND\nBronx, NY 10457\n(40.8457983712978…	Con Edison	BRONX LEBANON HOSPITAL	BRONX-LEBANON HOSPITA	1650 SELWYN AVE\nBRONX, NY 10457\n(40.84317219…	CA159489X	7/6/05	CLEAVER BROOKS CB-600-600	2	…	Elevator Apartments	14	480769	1	11	0	1	1941
3	2030640001	4422 3 AVENUE\nBronx, NY 10457\n(40.8540683145…	Con Edison	ST BARNABAS HOSPITAL BOILER ROOM	ST BARNABAS NURSING H	4422 3 AVENUE\nN Y 10457\n(40.854068314549714,…	CB078401R	NaN	B & W FM 10-66	1	…	Loft Buildings	15	421234	4	8	0	1	1943
4	2025060016	1204 SHAKESPEARE AVENUE\nBronx, NY 10452\n(40….	Con Edison	BRAGREIS REALTY CORP.	BRAGREIS REALTY CORP	191024 BROOKLYN\nNY 11219\n(40.63277295540763,…	CA352883Z	7/6/05	ROCKMILLS MP100	1	…	Walk-Up Apartments	16	53376	1	6	67	67	1926

5 rows × 33 columns

cdf.tail()

Borough, Block and Lot #	Facility Address	Natural Gas Utility (Con Edison or National Grid	Building Manager	Owner	Owner Address	DEP Boiler Application #	Deadline for phasing out #6 oil (i.e. data of next DEP permit renewal after July 1 2012)	Boiler Model	Number of identical boilers	…	Building Type	City Council District	Total area of buildings on lot	Number of buildings on property (tax lot)	Number of floors	Number of residential units	Number of total units	Year constructed	Condominium housing?	Cooperative housing?
8768	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8769	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8770	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8771	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8772	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 33 column

2 & 3. Data cleaning, and save to new file.

cdf = cdf.dropna(subset=[cdf.columns[0]])
# drop all NaN rows in the end of the data frame

cdf.tail()

Borough, Block and Lot #	Facility Address	Natural Gas Utility (Con Edison or National Grid	Building Manager	Owner	Owner Address	DEP Boiler Application #	Deadline for phasing out #6 oil (i.e. data of next DEP permit renewal after July 1 2012)	Boiler Model	Number of identical boilers	…	Building Type	City Council District	Total area of buildings on lot	Number of buildings on property (tax lot)	Number of floors	Number of residential units	Number of total units	Year constructed	Condominium housing?	Cooperative housing?
8043	5005890053	610 VICTORY BOULEVARD\nStaten Island, NY 10301…	National Grid	610 VICTORY BLVD.LLC	610 VICTORY BOULEVARD	97 77 QUEENS BLVD\nREGO PARK, NY 11374\n(40.72…	CA288985X	7/5/05	ROCKMILLS MP-125	1	…	Elevator Apartments	49	72600	1	6	72	72	1940
8044	5000160054	350 ST MARK’S PLACE\nStaten Island, NY 10301\n…	National Grid	BALKON REALTY CO.	BALKON REALTY CO	2350 OCEAN AVENUE\nBROOKLYN, NY 11229\n(40.604…	CA163687Z	7/6/05	FEDERAL FLW 1518	1	…	Elevator Apartments	49	64680	1	5	0	1	1964
8045	5028520001	191 VERMONT AVENUE\nStaten Island, NY 10305\n(…	National Grid	BOARD OF EDUCATION	BOARD OF EDUCATION	28 11 QUEENS PLAZA\nL I C, NY 11101\n(40.74570…	CA031979X	7/5/05	NaN	2	…	Elevator Apartments	49	76623	1	3	0	1	1981
8046	5005890044	630 VICTORY BOULEVARD\nStaten Island, NY 10301…	National Grid	CLIFTON BUILDERS CO	630 VICTORY BOULEVARD	97 77 QUEENS BOULEVARD\nREGO PARK, NY 11374\n(…	CA289085P	7/5/05	ROCKMILLS MP-100 SS	1	…	Elevator Apartments	49	72600	1	6	60	60	1940
8047	5005790045	4 CHESTER PLACE\nStaten Island, NY 10304\n(40….	National Grid	JANE KURTIN	CHESTER PLACE PARTNER	NaN	CA293085R	7/4/05	FEDERAL FST-60	1	…	Elevator Apartments	49	32000	1	5	34	34	1929

5 rows × 33 columns

address = []
latitude = []
longtitude = []

address_pattern = '(.*\n.*)\n\('
latitude_pattern = '\n\((4.*),'
longtitude_pattern = ',\ (-7.*)\)'

# use regular expression to parse the Owner Address and get clean latitude and longtitude
for i in cdf[cdf.columns[5]]:
    i = str(i)
    address.append(re.findall(address_pattern, i))
    latitude.append(re.findall(latitude_pattern, i))
    longtitude.append(re.findall(longtitude_pattern, i))

def stringSeries(lst):
    lst = pd.Series(lst)
    for i in lst.index:
        if lst[i] == []: lst[i] = ''
        else: lst[i] = lst[i][0]
    return lst

# insert clean latitude and longtitude into new columns
cdf[cdf.columns[5]] = stringSeries(address)
cdf.insert(6, 'Latitude', stringSeries(latitude))
cdf.insert(7, 'Longtitude', stringSeries(longtitude))

cdf.to_csv('oil.csv')

4 & 5. Data Visualization

Click here to see interactive versualization

QQ20150621-1@2x

Leave a comment Cancel reply

Design a site like this with WordPress.com