December 2019¶

Predicting Water Usage in Cities in America¶

by Regan Connell, Nathan Jefferies

Contents¶

1 - Abstract
2 - Project Background
3 - Project Objectives
4 - Data Description
5 - Data Cleaning
6 - Data Exploration
7 - Model Development
8 - Interpretation and Conclusions
9 - Acknowledgements and Sources

Abstract¶

As the U.S. climate becomes more volatile and the population continues to rise, informed water use is as important as ever. The aim of this project is to generate various predictive models for urban water use with climate, demographic, and geographic data. Additionally, models are generated to predict the amount of electricity needed to process and distribute water consumed by cities. The models developed in this project fall under the category of Land Use Regression, or spatial prediction. There is limited publicly available data about city-level water use across the United States, and the motivation behind this project is to fill in geographic data gaps which may help better inform policy decisions and improve the use of two vital resources— water and energy.

Four models are developed for both drinking water per capita predictions and electricity per capita predictions for drinking water processing, including K Nearest Neighbors, Linear regression, Ridge regression, and Lasso regression. The takeaways from this project are that the KNN and Ridge models for water per capita prediction perform the best overall. Both of these models performs significantly better for the larger, higher water consuming cities, and the predicitons for smaller, lower water consuming cities had much too high of a margin to be informative. No single feature in the climate, demographic, and geographic data used to train the models dominated the predictions generated. None of the models developed for electricity per capita usage made well-performing predictions, which is predominately due to the variation in the types of energy cities use to process their drinking water, be that natural gas, coal, electricity, or others. Poor predictions may also be due to fluctuating levels of technology across the country— some cities have much more advanced energy-saving technologies than others. Next steps include looking into features that may indicate advanceness of technology in cities to better predict their electricity use for drinking water processing, as well as finding more complete data for total processing energy use. The main takeaway from this project is that more primary data is necessary in order to develop accurate prediction models for urban water use, as well as processing energy use. The findings of this project may motivate agencies to prioritize resources towards better data collection of water use in order to illuminate our society's consumption of this precious, declining resource.

Project Background¶

Most of the U.S. lives within urban districts, and all utilities across the United States generate data on how much water is consumed in any given month, as well as the electiricity needed to distribute and treat water. However, this information is not often collected or made public. The Advancing Earth and Space Sciences journal published a paper with the aim of addressing this issue and set out to generate a primary data source that contains drinking and wastewater usage by month or year, as well as the energy needed to process each respective type of water usage. Their primary data set contains this information for 2012.

Upon reading Chini and Stillwell’s analysis of the water-energy nexus across the United States, we found a motivation behind generating spatial prediction models that could fill in the gaps of the missing locations in the existing primary data set. The implications of developing this type of prediction model would help inform better policy decisions, such as how to allocate resources to which cities in order to help decrease water consumption in inefficient locations.

One of the interesting conclusions that the Advancement of Hydrologic Science made was that often cities in hotter climates end up consuming the same amount across all months, while those in colder climates consume varying amounts according to the seasons. This led us to include climate data as part of our prediction model, assuming that climate data is usually a very readily available set of information for all locations across the United States.

Another interesting note about the dataset we were working from is that there are often increases in electricity used to produce drinking water in colder climates, the colder it got. This led us to hypothesize that there was some sort of relationship between temperature and the energy needed to process drinking water. Our inclusion of geographic data (elevation and geographical coordinates) in our prediction model stems from the effect of elevation and proximity to the equator on climate.

With a changing climate, there is threat of increased temperatures globally. If temperature has an effect on drinking water consumption, there is a need to understand this effect in order to adequately address conserving this finite, vital resource. We intended to better understand this effect through developing our models that predict drinking water consumption and energy usage based off of climate, geographic, and demographics data. If water consumption could be predicted with this data with high accuracy, it would allow cities to know how much water they will need in coming years given the projected climate and population data. This would help predict how climate change may impact water consumption in varying locations.

On top of this, if we could also accurately predict electricity used for water consumption, we would be able to see how much climate change will increase electricity from water consumption alone. If the U.S. cities wish to begin focusing on conserving water and electricity, they need to know their current consumption rates which our model would provide, as well as the potential to understand how consumption rates may change with the changing climate and population. Similar to climate scientists that intend to show how the world will be impacted from the rapid change in global climate, if our model is successful, it could be used to promote more sustainable means of water usage as well as another case for more sustainable sources of electricity.

Project Objectives¶

The purpose of this project is to train and evaluate different models that are able to spatially predict water consumption per capita using features such as population, climate, and geographic data.
Another purpose of this project is to train and evaluate models that predict the amount of electricity per capita needed for drinking water consumption using features such as population, climate, and geographic data.

Both of these predictions are valuable because they would help fill in missing data across the United States about water usage and energy usage for locations that lack the capacity to collect this type of data. This would help address a resource allocation problem of which cities need help in decreasing water consumption per capita by providing a complete picture of the levels of consumption across the country. These models would also help cities across the United States predict what their future water consumption might look like given how their climate and populations are projected to change. It can give cities worst case scenarios or potential disaster scenarious to look out for. It could also be possible for cities to identify points at which the water consumption would exhaust the water supply. Should we achieve our purpose, it would build a stronger case for water conservation as a whole.

Data Description¶

Our data for the electricity and water consumption was originally provided by The Consortium of Universities for the Advancement of Hydrological Science, a non-profit created to analyze and forward science within the water use world. The group originally acquired the data from utilities across all of the cities in order to get exact numbers. This data was not complete for all columns, missing some data and therefore eliminating some data points for us. We took out any incomplete data. The drinking water and electricity consumption data were given monthly which provided a separate data point for each month. Therefore, we had 12 data points for each city. The data is from the year of 2012. We were able to obtain population data from text files included in our primary data repository for each city. We had to manually input monthly weather data points from Weather Underground and NOAA into a file to be merged with our water and population data. The data for fuel, propane, and natural gas (all energy usage) were all relatively empty, we ended up eliminating them from our data set and focussed on electricity and drinking water predictions.

Our climate data came from Weather Underground which stores a history of weather data, as well as data from the National Oceanic and Atmospheric Administration, a large reputable, government-run organization. Weather Underground uses sensors that upload data wirelessly from major buildings across each city, providing us our monthly data. We went through and input the data for our climate features manually from both Weather Underground and NOAA, which was a massive undertaking given how many data points we had (around 804). There was no way to access a file that included all the data we needed, so we had to toggle through filters on their website. All data was taken from 2012 and matched with the corresponding data point for water consumption. The data contains monthly temperature max averages, min averages, overall temperature averages, dew point averages, wind average, sea level pressure average, as well as total rainfall and total snowfall.

Our elevation and latitude and longitudinal data came from the United States Geological Survey, another government-run organization that provides reputable data. These data points were added manually to our dataset as well. It was a complete set, with no missing values, and since we have 12 points for every single city, and only 1 elevation for each city, we have 12 of each elevation.

For the final data set we generated through much data merging and cleaning had the following:

* Structure— how are the data stored?:¶

The energy and drinking water data that we merged into our final data frame came from individual csv files by city. The weather data originally was stored on a website that had filter features which helped get to each data point we needed in our data frame.

* Granularity— how are the data aggregated?:¶

Our data are aggregated by month and by city.

* Scope— how much time, how many people, what spatial area?:¶

Our data covers 67 cities in the United States from the year of 2012, including data from each month in each city.

* Temporality— how is time represented in the data?:¶

Time is represented in the data by month because the entire data set is for the year of 2012.

* Faithfulness— are the data trustworthy?:¶

The data are trustworthy based on our exclusion of cities with missing data, and due to our data exploration that revealed no significant outliers.

We cannot provide the final data frame we used to build our models because it was generated through the Data Cleaning section below. You may find this data frame at the end of the following section.

Data Cleaning¶

Goal overview for this section:

Clean the raw water and energy data from Hydro Share.
Clean and combine the raw weather data from NOAA and Weather Underground.
Combine the cleaned drinking water with the clean weather data. Create one final data frame called drinking_complete_df that will be used in EDA and model building.
Add latitude and longitude into data frame.

#Import dependencies.

import pandas as pd
import numpy as np
from pathlib import Path

1. Clean the raw water and energy data from Hydro Share.¶

a. Create a list of all of the cities that Hydro Share has raw data for use later to load raw csv files.¶

names_1st_half = ['AlbanyNY', 'AlexandriaVA', 'AmarilloTX', 'AnchorageAK', 'AugustaGA',
                  'AustinTX','BakersfieldCA','BaltimoreMD','BeaumontTX','BoiseID',
                  'BostonMA','BridgeportCT','BuffaloNY','BurlingtonVT','CharlestonSC','CharlestonWV',
                  'CharlotteNC','CheyenneWY','ChicagoIL','CincinnatiOH','ClevelandOH','ColoradoSpringsCO',
                  'ColumbiaSC','ColumbusOH','CorpusChristiTX','DallasTX','DaytonOH','DenverCO','DetriotMI',
                  'DuluthMN','ElPasoTX','EugeneOR','FargoND','FortCollinsCO','FortWayneIN','FortWorthTX',
                  'FresnoCA','GreensboroNC',
                  'GreenvilleSC','HarrisburgPA','HonoluluHI','HoustonTX','IndianapolisIN','JacksonvilleFL']

names_2nd_half = ["KalamazooMI", "KansasCityMO", "KnoxvilleTN", "LakeCharlesLA", "LasVegasNV",
                 "LincolnNE", "LittleRockAR", "LosAngelesCA", "LouisvilleKY", "LubbockTX", "MadisonWI",
                 "ManchesterNH", "MemphisTN", "MiamiFL", "MilwaukeeWI", "MinneapolisMN","NashvilleTN",
                 "NewarkNJ", "NewOrleansLA", "NewYorkNY", "NorfolkVA", "North Texas", "OaklandCA",
                 "OgdenUT", "OklahomaCityOH", "OmahaNE", "PeoriaIL", "PhiladelphiaPA", "PhoenixAZ",
                 "PittsburghPA", "PortlandME", "PortlandOR", "ProvidenceRI", "ProvoUT", "RaleighNC",
                "RenoNV", "SacramentoCA", "SalemOR", "SaltLakeCityUT", "SanAntonioTX", "SanDiegoCA",
                 "SanFranciscoCA", "SanJoseCA", "SantaFeNM", "SavannahGA", "SeattleWA", "SiouxFallsSD",
                 "SpokaneWA", "SpringfieldMA", "StLouisMO", "StPaulMN", "SyracuseNY", "TacomaWA",
                 "TallahasseeFL", "TampaFL", "ToledoOH", "TucsonAZ", "TulsaOK", "WashingtonDC",
                 "WichitaKS", "WorcesterMA"]

all_names = names_1st_half + names_2nd_half

b. Functions to load drinking water csv files into data frame.¶

Consideration:

AnchorageAK's data folder is labeled "AnchorageAk" which is inconsistent with it's csv names. For this reason, a special case had to be created to load this set in the function.
The loaded data frames are stored in a dictionary object for drinking water. The keys are the place name, and the values are the respective data frames. This allows for quick access of the data by place name.

#Function to load csv files as data frames

def load_one_Water(name):
    if name == "AnchorageAK":
        my_file = Path("AnchorageAk/Water_"+name+".csv")
        return pd.read_csv("AnchorageAk/Water_"+name+".csv")
    my_file = Path(name+"/Water_"+name+".csv")
    if (name == "NashvilleTN") or (name == "NewYorkNY") or (name == "OgdenUT"):
        return
    if not my_file.exists():
        #print("Water "+name+" doesn't exist")
        return
    return pd.read_csv(name+"/Water_"+name+".csv")

#Function to load all drinking water data frames into a dicitonary.
def load_all_drinking_water():
    df_list_Water = []
    for name in all_names:
        df_list_Water.append(load_one_Water(name))

    drinking_water_dict = dict(zip(all_names, df_list_Water))
    return drinking_water_dict
drinking_water_dict = load_all_drinking_water()

c. Create a data frame that contain only the dictionary data frames with all 12 months of data to ensure data completeness for drinking water. This was done by looping through the dictionaries and creating one data frame from each filtered data frame in the dictionaries.¶

Considerations:

The column labels for each of the places' data frames are not all consistent, so operations had to be performed to make them uniform before adding to a single total data frame.

#Create drinking_water_df for all files with data for all 12 months

#Reset dict
drinking_water_dict = load_all_drinking_water()

drinking_water_df = pd.DataFrame()

for name in all_names:
    curr_drinking_water_df = drinking_water_dict.get(name)

    #Check that file exists for city, and check if has all 12 months
    if (curr_drinking_water_df is not None) and (len(curr_drinking_water_df) > 11):
        city = name[:-2]
        state = name[-2:]
        curr_drinking_water_df["City"] = city
        curr_drinking_water_df["State"] = state
        curr_drinking_water_df["CityState"] = name

        #Make all column names lower case
        curr_drinking_water_df.columns = map(str.lower, curr_drinking_water_df.columns)

        #Clean column names
        for col_name in curr_drinking_water_df.columns.tolist():

            if col_name == "biogas (therms)":
                curr_drinking_water_df = curr_drinking_water_df.rename(columns={col_name: "biogas (therm)"})
            if (col_name == "electric consumption (kwh)") or (col_name =="electricity (kwh)*"):
                curr_drinking_water_df = curr_drinking_water_df.rename(columns={col_name: "electricity (kwh)"})
            if col_name == "natural gas (therms)":
                curr_drinking_water_df = curr_drinking_water_df.rename(columns={col_name: "natural gas (therm)"})
            if col_name == "volume (mgd)":
                curr_drinking_water_df.rename(columns={col_name: "volumne (mg)"})
            if col_name == "volume (mg)\telectricity (kwh)":
                print(name)

        drinking_water_df = drinking_water_df.append(curr_drinking_water_df[:12])



drinking_cities = drinking_water_df["citystate"].unique()

d. Function to scrape the population data from each place's text file.¶

Considerations:

A couple places have error messages in their text files, such as SanJoseCA and WorcesterMA, so these two places were skipped over in this function and added manually later on.
Each list returned by the get_numbers_from_text function is either length 1 or 2, depending on if the place has data for just drinking water, just waste water, or both.
A for loop was then used to create a dictionary with the places as the key, and the list of population(s) as the value.

#Extract the populations from the text files. Returns a list of the population(s) for each place. 

def get_numbers_from_text(file_name):
    #Check that txt file exists for place   
    if not Path(file_name+"/READ-ME_"+file_name+".txt").exists():
        return

    #SanJoseCA and WorcesterMA have an error in text file that cannot be read.
    if (file_name == "SanJoseCA") or (file_name == "WorcesterMA"):
        return

    words = []
    with open(file_name+"/READ-ME_"+file_name+".txt", "r") as f:
        for line in f:
            for word in line.split():
                words.append(word)

    num = []
    numbers = []
    for i in words:
        if i == "Error!":
            return
        for j in i:
            if j.isdigit():
                num.append(j)
        if len(num) != 0:
            joined_num = int("".join(num))
            if (joined_num > 10000) and (joined_num != 446515096638828383):
                numbers.append(joined_num)
        num = []
    return numbers

#Create population dictionary. Keys correspond to place names, and values correspond to the population(s).
pop_dict = dict()
for file_name in all_names:
    numbers = get_numbers_from_text(file_name)
    if numbers is not None:
        pop_dict[file_name] = numbers

places_in_pop_dict = pop_dict.keys()

e. Add the populations into the correct drinking water data frame.¶

Considerations

Must manually add in AnchorageAK populations due to inconsistency in place name in original data file.
If there are two population values in the list, the first value at index 0 corresponds to the drinking water popuation and the second value at index 1 corresponds to the waste water population. Since the populations are added based on the places with waste water data and the places with drinking water data (waste_cities, drinking_cities), if the population list is only of length one for a place, then the value at index 0 corresponds to the population for the respective type of water.

#Add population into drinking_water_df
drinking_pops = []
for place in drinking_cities:
    if (place == "WorcesterMA"):
        drinking_pops+=[None for i in range(12)]
    if (place == "AnchorageAK"):
        drinking_pops+=[226984 for i in range(12)]
    if pop_dict.get(place) is not None:
        drinking_pop = pop_dict.get(place)[0]
        drinking_pops+=[drinking_pop for i in range(12)]

#Manually input Worcester population, missing from Hydro Share data set due to error in text file.
drinking_pops[-12:] = [182627 for i in range(12)]

drinking_water_df["population"] = drinking_pops

d. Export the clean water data frame to csv for collaboration purposes.¶

drinking_water_df.shape

(804, 10)

drinking_water_df.to_csv("drinking_water_monthly.csv")

Here is a clear list of the places that have complete water data. This is helpful to know for the next step of aggregating the correct weather data for the places that we have water data.

print("Places with complete and clean drinking water data:")
print(drinking_cities)

Places with complete and clean drinking water data:
['AnchorageAK' 'AugustaGA' 'AustinTX' 'BakersfieldCA' 'BeaumontTX'
 'BostonMA' 'CheyenneWY' 'ChicagoIL' 'CincinnatiOH' 'ClevelandOH'
 'ColoradoSpringsCO' 'ColumbiaSC' 'ColumbusOH' 'CorpusChristiTX'
 'DallasTX' 'DaytonOH' 'DenverCO' 'DuluthMN' 'EugeneOR' 'FortCollinsCO'
 'FortWayneIN' 'FortWorthTX' 'FresnoCA' 'GreensboroNC' 'GreenvilleSC'
 'HarrisburgPA' 'HonoluluHI' 'HoustonTX' 'IndianapolisIN' 'JacksonvilleFL'
 'KansasCityMO' 'KnoxvilleTN' 'LakeCharlesLA' 'LasVegasNV' 'LouisvilleKY'
 'LubbockTX' 'MadisonWI' 'MiamiFL' 'MilwaukeeWI' 'MinneapolisMN'
 'NewOrleansLA' 'OaklandCA' 'OmahaNE' 'PhiladelphiaPA' 'PhoenixAZ'
 'PittsburghPA' 'PortlandME' 'PortlandOR' 'ProvidenceRI' 'RaleighNC'
 'RenoNV' 'SacramentoCA' 'SanAntonioTX' 'SanDiegoCA' 'SantaFeNM'
 'SavannahGA' 'SiouxFallsSD' 'SpokaneWA' 'SpringfieldMA' 'StLouisMO'
 'TacomaWA' 'TallahasseeFL' 'TampaFL' 'TucsonAZ' 'TulsaOK' 'WichitaKS'
 'WorcesterMA']

2. Clean and combine the raw weather data from NOAA and Weather Underground.¶

Process Weather Data for Drinking Water Dataset¶

a. Load in weather data for drinking cities.¶

weather_drinking = pd.read_csv("drinking_weather.csv")

b. Ensure all cities in drinking_water_df have weather data.¶

Considerations

The first column in the weather_drinking data frame had to be changed to match up with the drinking_water_df.
Checks had to be made to ensure that the places match up in both data frames before joining the two.

#Rename column to match drinking_water_df for concattenating
weather_drinking = weather_drinking.rename(columns={"Unnamed: 0": "citystate"})

Outputting the heads of the two data frames, weather_drinking and drinking_water_df, provides an idea of how the data frames look and helps to visualize how to join the two data frames.

weather_drinking.head()

drinking_water_df.head()

#Drop Nashville, New York and Ogden from weather drinking set because complete data missing from drinking water set.

weather_drinking = weather_drinking[weather_drinking["citystate"]!="NashvilleTN"]
weather_drinking = weather_drinking[weather_drinking["citystate"]!="NewYorkNY"]
weather_drinking = weather_drinking[weather_drinking["citystate"]!="OgdenUT"]

#From this code, I was able to find and fix spelling errors we had in our weather data set.

weather_drinking_cities = weather_drinking["citystate"].unique()
drinking_cities = drinking_water_df["citystate"].unique()

print(len(drinking_cities))
print(len(weather_drinking_cities))

for i in drinking_cities:
    if i not in weather_drinking_cities:
        print(i)

67
67

weather_drinking_cities

array(['AnchorageAK', 'AugustaGA', 'AustinTX', 'BakersfieldCA',
       'BeaumontTX', 'BostonMA', 'CheyenneWY', 'ChicagoIL',
       'CincinnatiOH', 'ClevelandOH', 'ColoradoSpringsCO', 'ColumbiaSC',
       'ColumbusOH', 'CorpusChristiTX', 'DallasTX', 'DaytonOH',
       'DenverCO', 'DuluthMN', 'EugeneOR', 'FortCollinsCO', 'FortWayneIN',
       'FortWorthTX', 'FresnoCA', 'GreensboroNC', 'GreenvilleSC',
       'HarrisburgPA', 'HonoluluHI', 'HoustonTX', 'IndianapolisIN',
       'JacksonvilleFL', 'KansasCityMO', 'KnoxvilleTN', 'LakeCharlesLA',
       'LasVegasNV', 'LouisvilleKY', 'LubbockTX', 'MadisonWI', 'MiamiFL',
       'MilwaukeeWI', 'MinneapolisMN', 'NewOrleansLA', 'OaklandCA',
       'OmahaNE', 'PhiladelphiaPA', 'PhoenixAZ', 'PittsburghPA',
       'PortlandOR', 'PortlandME', 'ProvidenceRI', 'RaleighNC', 'RenoNV',
       'SacramentoCA', 'SanAntonioTX', 'SanDiegoCA', 'SantaFeNM',
       'SavannahGA', 'SiouxFallsSD', 'SpokaneWA', 'SpringfieldMA',
       'StLouisMO', 'TacomaWA', 'TallahasseeFL', 'TampaFL', 'TucsonAZ',
       'TulsaOK', 'WichitaKS', 'WorcesterMA'], dtype=object)

Here is a clear output of the cities in either data frames. You can see that both look the same which can be confirmed by the cell above.

drinking_cities

array(['AnchorageAK', 'AugustaGA', 'AustinTX', 'BakersfieldCA',
       'BeaumontTX', 'BostonMA', 'CheyenneWY', 'ChicagoIL',
       'CincinnatiOH', 'ClevelandOH', 'ColoradoSpringsCO', 'ColumbiaSC',
       'ColumbusOH', 'CorpusChristiTX', 'DallasTX', 'DaytonOH',
       'DenverCO', 'DuluthMN', 'EugeneOR', 'FortCollinsCO', 'FortWayneIN',
       'FortWorthTX', 'FresnoCA', 'GreensboroNC', 'GreenvilleSC',
       'HarrisburgPA', 'HonoluluHI', 'HoustonTX', 'IndianapolisIN',
       'JacksonvilleFL', 'KansasCityMO', 'KnoxvilleTN', 'LakeCharlesLA',
       'LasVegasNV', 'LouisvilleKY', 'LubbockTX', 'MadisonWI', 'MiamiFL',
       'MilwaukeeWI', 'MinneapolisMN', 'NewOrleansLA', 'OaklandCA',
       'OmahaNE', 'PhiladelphiaPA', 'PhoenixAZ', 'PittsburghPA',
       'PortlandME', 'PortlandOR', 'ProvidenceRI', 'RaleighNC', 'RenoNV',
       'SacramentoCA', 'SanAntonioTX', 'SanDiegoCA', 'SantaFeNM',
       'SavannahGA', 'SiouxFallsSD', 'SpokaneWA', 'SpringfieldMA',
       'StLouisMO', 'TacomaWA', 'TallahasseeFL', 'TampaFL', 'TucsonAZ',
       'TulsaOK', 'WichitaKS', 'WorcesterMA'], dtype=object)

c. Join the weather and water data for drinking water data cities, then export as csv.¶

Considerations

The index in both data frames had to be reset in order to allow for correct concatenation.

drinking_water_df = drinking_water_df.reset_index()

weather_drinking = weather_drinking.reset_index()

drinking_complete_df = pd.concat([drinking_water_df, weather_drinking], axis=1)

Ensure that the finished data frame is the correct shape.

print(drinking_water_df.shape)
print(weather_drinking.shape)
print(drinking_complete_df.shape)

(804, 11)
(804, 12)
(804, 23)

Rename the index column to month number. Remove the duplicate columns.

drinking_complete_df = drinking_complete_df.rename(columns={"index": "month number"})
drinking_complete_df = drinking_complete_df.loc[:, ~drinking_complete_df.columns.duplicated()]

Export the complete drinking water data frame that now contains weather to a csv for collaboration purposes.

drinking_complete_df.to_csv("drinking_complete_df.csv")

print(drinking_complete_df.shape)

print()
print(drinking_complete_df.columns)

(804, 21)

Index(['month number', 'city', 'citystate', 'electricity (kwh)',
       'fuel oil (gal)', 'month', 'natural gas (therm)', 'propane (gal)',
       'state', 'volume (mg)', 'population', 'Month', 'T Max Avg', 'T Avg',
       'T Min Avg', 'Precipitation (Inches)', 'Dew Point Avg',
       'Snowdepth (Inches)', 'Wind(Avg MPH)', 'Sea Level Pressure (Avg)',
       'Elevation(feet)'],
      dtype='object')

4. Add latitudes and longitudes into data frames.¶

a. Clean external latitude and longitude data set.¶

#Load in external data set
lat_lons = pd.read_csv("lat lons.csv")
lat_lon_places = list(lat_lons["Unnamed: 0"])

#Clean the strings in external set to match our data frame place names.
clean_lat_lon_places = []
for s in lat_lon_places:
    s = s.replace(',','')
    s = s.replace(" ", "")
    s = s.replace("USA", "")
    s = s.replace("theUS", "")
    clean_lat_lon_places.append(s)

lat_lons["places"] = clean_lat_lon_places
lat_lons = lat_lons.drop(columns = "Unnamed: 0")

lat_lons = lat_lons.rename(columns={"Unnamed: 1":"Latitude", "Unnamed: 2": "Longitude"})
lat_lon_clean_df = lat_lons.drop(index = 0).reset_index(drop=True)
len(lat_lon_clean_df)

lat_lon_clean_df.drop_duplicates(subset ="places",
                     keep = "first", inplace = True)
print(len(lat_lon_clean_df))
lat_lon_clean_df.head()

792

lat_lon_dict = dict()
lats = [np.nan] * len(drinking_cities)
lons = [np.nan] * len(drinking_cities)

num = 0
for i in range(len(drinking_cities)):
    for j in lat_lon_clean_df["places"]:
        if drinking_cities[i] == j:
            lat = float(lat_lon_clean_df[lat_lon_clean_df["places"] == drinking_cities[i]]["Latitude"])
            lon = float(lat_lon_clean_df[lat_lon_clean_df["places"] == drinking_cities[i]]["Longitude"])
            lat_lon_dict[drinking_cities[i]] = [lat, lon]
            lats[i] = lat
            lons[i] = lon
            num+=1

for i in range(len(lats)):
    if lats[i] is np.nan:
        print(drinking_cities[i])
        print(i)

ColumbiaSC
11
PittsburghPA
45
PortlandME
46
SiouxFallsSD
56
StLouisMO
59

#Input missing lats
lats[11] = 34.0007 #columbia
lats[45] = 40.4406 #pitts
lats[46] = 43.6591 #portland me
lats[56] = 43.5473 #sioux
lats[59] = 38.6270 #stlouis

#Input missing lons
lons[11] = 81.0348 #columbia
lons[45] = 79.9959 #pitts
lons[46] = 70.2568 #portland me
lons[56] = 96.7283 #sioux
lons[59] = 90.1994 #stlouis

#This shows that all lats and lons are now accounted for.
for i in range(len(lats)):
    if lats[i] is np.nan:
        print(drinking_cities[i])
        print(i)

lats_extended = []
for i in lats:
    lats_extended+= [i for j in range(12)]
lons_extended = []

for i in lons:
    lons_extended+= [i for j in range(12)]
print(len(lons_extended))
print(len(lats_extended))

804
804

b. Add latitudes and longitudes to drinking_complete_df.¶

drinking_complete_df["lats"] = lats_extended
drinking_complete_df["lons"] = lons_extended

drinking_complete_df.to_csv("drinking_complete_df.csv")

The majority of the data cleaning has been done now. Our result is the drinking_complete_df, which will be used to generate different models to predict water and electricity usage. This data frame has water, energy, and weather data.¶

Data Exploration¶

#Import dependencies.

import datetime
import os
import requests
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.font_manager import FontProperties
%matplotlib inline

Drinking Water Exploration¶

drinking_complete_df.head()

Add a column into the data frame that has the volume of water consumed per capita.

water_per_capita = drinking_complete_df['volume (mg)']/drinking_complete_df['population']
drinking_complete_df['water per capita'] = water_per_capita

drinking_complete_df.head()

Add a column in that has the electricity use per capita.

First we must clean our drinking water data set to observations that don't have null values for electricity (kwh), and also add an electricity per capita column. We must also convert the non-string values to string values.

electricity = drinking_complete_df[pd.notnull(drinking_complete_df['electricity (kwh)'])]

#Convert non-floats to floats in electricity column.
clean_electricity = list(electricity["electricity (kwh)"])
for i in range(len(clean_electricity)):
    if type(clean_electricity[i]) == str:
        s = clean_electricity[i]
        s = s.replace(",", "")
        clean_electricity[i] = float(s)
electricity["electricity (kwh)"] = clean_electricity

electricity["electricity per capita"] = electricity["electricity (kwh)"]/electricity["population"]

electricity.columns

#Ensure all values in electricity column have been converted to float values.
for i in list(electricity["electricity (kwh)"]):
    if type(i) == str:
        print(i)

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':

Figure 1¶

This shows the water use per capita's density by month. This shows us that the drinking water consumption varies more in certain months, which could mean that including month as a feature in our prediction model may be beneficial. It looks as though the winter months have their distributions more concentrated around a certain water use per capita value, while the summer months have more spread out distributions. This could be because in the summer time, the places that have very high temperatures may have more drinking water consumption than places that have more moderate climates. Variation in coldness in the winter does not seem to impact the drinking water consumption as much as heat does.

#Figure 1 code

month_names = drinking_complete_df["Month"].unique().tolist()

plt.figure(figsize=(10, 7))
palette = sns.cubehelix_palette(12)

for i in range(0,12):
    sns.distplot(drinking_complete_df[drinking_complete_df["month number"] == i]["water per capita"],
                 hist=False, kde=True, label=month_names[i], color = palette[i-12])

plt.legend()
plt.title('Water Use Density by Month', size = 20)
plt.xlabel('Water per Capita', size = 15)
plt.ylabel('Density', size = 15)

/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Text(0, 0.5, 'Density')

Figure 2¶

Following up figure 1, let's look further into how the summer months compare with the winter months. Now it is more clear that whether it is a summer or winter month does in fact have some relation with the per capita water consumption.

#Figure 2 code
plt.figure(figsize=(15, 7))

#Plot summer months
plt.subplot(121)
for i in range(4,10):
    sns.distplot(drinking_complete_df[drinking_complete_df["month number"] == i]["water per capita"],
                 hist=False, kde=True, label=month_names[i])

plt.legend()
plt.title('Water Use Density Summer Months', size = 20)
plt.xlabel('Water per Capita', size = 15)
plt.ylabel('Density', size = 15)
plt.ylim(0, 400)
plt.xlim(0, .03)

#Plot winter months
plt.subplot(122)
for i in list(range(0,4)) + list(range(10, 12)):
    sns.distplot(drinking_complete_df[drinking_complete_df["month number"] == i]["water per capita"],
                 hist=False, kde=True, label=month_names[i])

plt.legend()
plt.title('Water Use Density Winter Months', size = 20)
plt.xlabel('Water per Capita', size = 15)
plt.ylabel('Density', size = 15)
plt.ylim(0, 400)
plt.xlim(0, .03)

/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

(0, 0.03)

Figure 3¶

Let's look at a few simple visualizations of average temperature, preciptation, and snow depth versus water consumption. We can see that there is generally a positive association with temperature and water consumption, but not as clear of one with preciptation. Snow depth seems to have an interesting relationship with water consumption, with lower values of snow depth having a much higher variation in water consumption. This may be evidence that there is another variable associated with low snow depth that leads to high water usage, such as high temperatures in the summer as our Figure 2 implies.

#Figure 4 code
plt.figure(figsize=(5, 5))

drinking_complete_df.plot.scatter(x = "T Avg", y ="water per capita", subplots= True)
plt.title("Temperature vs Water per Capita Consumption", size = 20)
plt.xlabel("Average Temperature", size = 15)
plt.ylabel("Water Consumption Per Capita", size = 13)

drinking_complete_df.plot.scatter(x = "Precipitation (Inches)", y ="water per capita", subplots= True)
plt.title("Precipitation vs Water per Capita Consumption", size = 20)
plt.xlabel("Precipitation", size = 15)
plt.ylabel("Water Consumption Per Capita", size = 13)

drinking_complete_df.plot.scatter(x = "Snowdepth (Inches)", y ="water per capita", subplots= True)
plt.title("Snowdepth vs Water per Capita Consumption", size = 20)
plt.xlabel("Snowdepth", size = 15)
plt.ylabel("Water Consumption Per Capita", size = 13)

Text(0, 0.5, 'Water Consumption Per Capita')

<Figure size 360x360 with 0 Axes>

Figure 4¶

Here we show the water capita per month by city.

color = sns.color_palette("colorblind", 69)
cities = list(drinking_complete_df["city"].unique())

plt.figure(figsize = (30,20))

for name in cities:
    city_df = drinking_complete_df[drinking_complete_df["city"] == name]
    plt.plot(city_df["Month"], city_df["water per capita"], color = color[cities.index(name)])

plt.legend(cities, loc = "upper right")
plt.xlabel("Month", fontsize = 30)
plt.ylabel("Water per Capita", fontsize = 30)
plt.title("Water per Capita per Month for All Cities", fontsize = 40)

Text(0.5, 1.0, 'Water per Capita per Month for All Cities')

Figure 5¶

Now let's look further into the places with the hottest and coldest temperatures to compare the water comsumption for those months specifically.

#Find 10 places that have the highest average temperature throughout all 12 of their months.
top_10_hottest_cities = drinking_complete_df.sort_values(by = ["T Avg"], ascending = False)["citystate"].unique()[:10]
print(top_10_hottest_cities)
print()

#Find 10 places that have the lowest average temperature throughout all 12 of their months.
top_10_coldest_cities = drinking_complete_df.sort_values(by = ["T Avg"], ascending = True)["citystate"].unique()[:10]
print(top_10_coldest_cities)

['LasVegasNV' 'PhoenixAZ' 'TulsaOK' 'TucsonAZ' 'WichitaKS' 'KansasCityMO'
 'StLouisMO' 'CorpusChristiTX' 'DallasTX' 'BakersfieldCA']

['AnchorageAK' 'DuluthMN' 'LouisvilleKY' 'SiouxFallsSD' 'MinneapolisMN'
 'CheyenneWY' 'MadisonWI' 'FortCollinsCO' 'PortlandOR' 'MilwaukeeWI']

color = sns.color_palette("colorblind", 69)
cities = list(drinking_complete_df["citystate"].unique())

plt.figure(figsize = (30,20))

for name in cities:
    city_df = drinking_complete_df[drinking_complete_df["citystate"] == name]
    if name in top_10_hottest_cities:
        plt.plot(city_df["Month"], city_df["water per capita"], color = "darkred")
    elif name in top_10_coldest_cities:
        plt.plot(city_df["Month"], city_df["water per capita"], color = "blue")
    else:
        plt.plot(city_df["Month"], city_df["water per capita"], color = "lightgray")


plt.legend(cities, loc = "upper right")
plt.xlabel("Month", fontsize = 30)
plt.ylabel("Water per Capita", fontsize = 30)
plt.title("Water per Capita per Month for All Cities", fontsize = 40)

Text(0.5, 1.0, 'Water per Capita per Month for All Cities')

Visualize electricity part of data.¶

electricity.shape

(612, 25)

Figure 1¶

Let's see how electricity and water consumption compare for the hottest and coldest cities by generating a similar plot as Figure 5. There is a clear division between the hottest cities and the coldest cities, appearing that the hotter cities have higher consumptions of electricity per capita during the summer months.

#Find 10 places that have the highest average temperature throughout all 12 of their months.
top_10_hottest_cities_e = electricity.sort_values(by = ["T Avg"], ascending = False)["citystate"].unique()[:10]
print(top_10_hottest_cities_e)
print()

#Find 10 places that have the lowest average temperature throughout all 12 of their months.
top_10_coldest_cities_e = electricity.sort_values(by = ["T Avg"], ascending = True)["citystate"].unique()[:10]
print(top_10_coldest_cities_e)

['LasVegasNV' 'PhoenixAZ' 'TulsaOK' 'TucsonAZ' 'WichitaKS' 'StLouisMO'
 'DallasTX' 'FortWorthTX' 'FresnoCA' 'SanAntonioTX']

['AnchorageAK' 'LouisvilleKY' 'SiouxFallsSD' 'MinneapolisMN' 'MadisonWI'
 'FortCollinsCO' 'PortlandOR' 'MilwaukeeWI' 'OmahaNE' 'WorcesterMA']

color = sns.color_palette("colorblind", 69)
cities = list(electricity["citystate"].unique())

plt.figure(figsize = (30,20))

for name in cities:
    city_df = electricity[electricity["citystate"] == name]
    if name in top_10_hottest_cities:
        plt.plot(city_df["Month"], city_df["electricity per capita"], color = "darkred")
    elif name in top_10_coldest_cities:
        plt.plot(city_df["Month"], city_df["electricity per capita"], color = "blue")
    else:
        plt.plot(city_df["Month"], city_df["electricity per capita"], color = "lightgray")


plt.legend(cities, loc = "upper right")
plt.xlabel("Month", fontsize = 30)
plt.ylabel("Electricity per Capita", fontsize = 30)
plt.title("Electricity per Capita per Month for All Cities", fontsize = 40)

Text(0.5, 1.0, 'Electricity per Capita per Month for All Cities')

Figure 2¶

Let's take a look at the total electricity per capita consumed per month in 2012 to explore the relationship between month and electricity per capita consumed. From this graph it is clear that the electricity per capita consumption does vary by month, which motivates the generation of our model in the next few steps.

df_grouped = electricity[["Month", "electricity per capita"]].groupby("Month").sum()
df_grouped["Month"] = list(df_grouped.index)
df_grouped

df_grouped.plot.bar(x = "Month", y = "electricity per capita", legend = False)

plt.title("Total Electricity per Capita Consumed by Month", size = 20)
plt.ylabel("Total Electricity per Capita")

Text(0, 0.5, 'Total Electricity per Capita')

Model Development¶

Drinking Water Prediction¶

We will be using different modeling approaches to predict drinking water use and the electricity needed to process drinking water.

Predicting Drinking Water Consumption, KNN Model
Predicting Drinking Water Consumption, Regression Models (Linear, Lasso, Ridge)
Predicting Electricity Use for Drinking Water Processing, KNN Model
Predicting Electricity Use for Drinking Water Processing, Regression Models (Linear, Ridge, and Lasso)

1. Predicting Drinking Water Consumption, KNN Model¶

For our KNN model, nearest neighbor proximity will be based on spatial distance. For each location, we will find its K-nearest neighbors in our drinking water dataset, and then we will use their average drinking water consumption as the forecast for that location.

a. Because we are working with a monthly dataset, we want to find out the water consumption at each location by month. This means that for each call to our algorithm that we will build, we will need to go through our drinking water dataset and select only the data that correspond to that specific month.¶

In order to assist in this part of the process, the function get_month_data() will take in a month parameter and return a data frame containing only the data that was recorded from that month, and also only the latitude, longitude, and water per capita used. We are looking to predict water per capita because the population in a place will most likely have a signficiant impact on the amount of water a place consumes, so looking at water consumption per capita will be more accurate.

#Part a code.

#Making output an input in order to make it easier to predict energy later.
def get_month_data(df, month, output):
    month_data = df[df["Month"] == month]
    month_data = month_data[["lats", 'lons', output]]
    return month_data.reset_index(drop=True)

#Check that length is 67 to ensure all 67 cities' are accounted for for January data.
jan = get_month_data(drinking_complete_df, "January", "water per capita")
print(get_month_data(drinking_complete_df, "January", "water per capita").shape)

(67, 3)

b. We will also need to write a function that will separate our month data into a train and test set.¶

The functionality of this method is that a random number of rows will be chosen from our month data set which will comprise our test set, and these rows will be removed from our month data set which will then be our train set. This way we will have observations we can predict for and find the error of in order to compute the average error for a certain value of k.

#Part b code.

def train_test(month_data, test_size, seed):
    test = month_data.sample(n = test_size, random_state=seed)
    to_drop = list(test.index)
    train = month_data.drop(to_drop)
    return train, test

#Affirms train_test works.

train, test = train_test(jan, 12, 1)
test.shape, train.shape

((12, 3), (55, 3))

c. We need a function that will compute the k nearest neighbors for a test point, which will output a list of indexes of the places in train that are closest to the given test point.¶

We also need a function that computes the distance between two sets of latitudes and longitudes as a helper function.¶

We then are able to make a function that will compute the average prediction error for certain values of k over a test set (prediction based off of average value of k nearest neighbors).¶

#Part c code.

#Returns a list of the indexes in train that are the closest k points to a test index.
def closest_k(lat, lon, train, k):
    dists = dict()
    for i in list(train.index):
        dist = calc_dist(train.loc[i]["lats"], train.loc[i]["lons"], lat, lon)
        dists[dist] = i
    #Find k smallest distances
    list_dists = list(dists.keys())
    sorted_dists = pd.DataFrame()
    sorted_dists["dists"] = list_dists
    top_dists = list(sorted_dists.sort_values("dists").head(k)["dists"])

    closest_in_train_by_index = []
    for i in top_dists:
        closest_in_train_by_index.append(dists.get(i))

    return closest_in_train_by_index

#Calculate distance between two sets of latitudes and longitudes.
def calc_dist(lat_1, lon_1, lat_2, lon_2):
    return np.sqrt((np.subtract(lat_1, lat_2))**2 + np.subtract(lon_1,lon_2)**2)

#Compute test set average error for certain value of k.
def k_error(test, train, k, output):
    errors = []
    for i in list(test.index):
        test_lat = float(test.loc[[i]]["lats"])
        test_lon = float(test.loc[[i]]["lons"])

        closest_in_train_by_index = closest_k(test_lat, test_lon, train, k)

         #A list of outputs that holds the k nearest neighbors values for "output" for each test value.
        train_outputs = []
        for j in closest_in_train_by_index:
            train_output = float(train.loc[[j]][output])


            train_outputs.append(train_output)

        prediction = np.mean(train_outputs)
        error = abs(prediction - float(test.loc[[i]][output]))
        errors.append(error)
    return np.mean(errors)

d. Now we will tie all the pieces together into one knn function. This function computes the average prediction error for a specific month and value of k.¶

def knn(df, month, k, test_size, seed, output):

    month_data = get_month_data(df, month, output)
    train, test = train_test(month_data, test_size, seed)

    avg_error = k_error(test, train, k, output)
    #print(month+ " has an error of "+str(avg_error)+" when k="+str(k))
    return avg_error

df= drinking_complete_df
month = "January"
k=3
test_size = 10
seed = 1
output_1 = "water per capita"

#Test knn function.
knn(df, month, k, test_size, seed, output_1)

0.0012879904910782004

e. Now we will be able to find the average error for all months with varying values of k. We will make a data frame of all the months and their errors for corresponding values of k.¶

all_month_errors = pd.DataFrame()

month_names_1 = list(drinking_complete_df["Month"].unique())
values_of_k_1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test_size_1 = 10
seed_1 = 1
output_1 = "water per capita"

all_month_errors["k"] = values_of_k_1

for month in month_names_1:
    month_errors_1 = []
    for i in values_of_k_1:
        month_errors_1.append(knn(drinking_complete_df, month, i, test_size_1, seed_1, output_1))
    all_month_errors[month] = month_errors_1

Here you can see the average errors for each month and each value of k. It is clear that the error decreases as k increases.

all_month_errors

no_k = all_month_errors.drop(columns = ["k"])
all_month_errors['mean'] = no_k.mean(axis=1)
all_month_errors

Here you are able to see the average error for each value of k. The value of k has the smallest error, so predicting drinking water consumption per capita using the nearest 5 neighbors is the most accurate. It is important to note that a low K-value (e.g., 1) would result in high variance, and a high K-value (e.g., 10) would result in high bias. For this reason as well, 5 is a good choice for k.

summary = all_month_errors[["k", "mean"]]

summary.plot.bar(x = "k", y = "mean", legend = False)

plt.title("Average Prediction Error per Value of K", size = 20)
plt.ylabel("Average Prediction Error")

Text(0, 0.5, 'Average Prediction Error')

2. Predicting Drinking Water Consumption, Regression Models (Linear, Ridge, and Lasso)¶

a. These regressions were handled together in order to maximize their efficiency and be able to compare them to one another as we go. The first thing that has to be done is importing the sklearn models, train_test_split, and MSE¶

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from numpy import *

b. Next we will need a function that will drop certain columns for our X, specifically the columns that we are not using as features such as catagorical or indexing columns. The function will also make y contain only our drinking water. After making the proper datasets, we use train_test_split to create our training and testing data.¶

def get_X_y(df):

    X = df.drop(columns = ['month number',"city", "citystate", "state", "month", 'Month', "electricity (kwh)", "fuel oil (gal)", 'natural gas (therm)', "propane (gal)", "volume (mg)", 'water per capita'])
    y = df[["volume (mg)"]]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 123)

    return X_train, X_test, y_train, y_test

X_drink_train, X_drink_test, y_drink_train, y_drink_test = get_X_y(drinking_complete_df)

Check that the data appears right

X_drink_train.head()

c. Find the Optimal Alphas for both Ridge and Lasso Regressions.¶

Using Cross-validation, we will check for the best possible alphas for each regression that will give us the best fit. To do this, we must also import a few more functions from sklearn.

from sklearn.linear_model import RidgeCV
from sklearn.model_selection import KFold

alphas_ridge = np.linspace(0, 10000, 200)

kf = KFold(n_splits = 3, shuffle = True, random_state = 0)
ridgecv = RidgeCV(cv = kf, alphas=alphas_ridge)
ridgecv.fit(X_drink_train, y_drink_train)

r_alpha_opt = ridgecv.alpha_
print("optimal alpha:", r_alpha_opt)

y_pred_ridgecv = ridgecv.predict(X_drink_test)

ridgecv_mse = mean_squared_error(y_pred_ridgecv, y_drink_test)
print("MSE with cross-validated Ridge:", ridgecv_mse)

optimal alpha: 5226.130653266332
MSE with cross-validated Ridge: 3234558.0004394166

from sklearn.linear_model import LassoCV

alphas_lasso = np.linspace(1, 100, 20)

kf = KFold(n_splits = 3, shuffle = True, random_state = 0)
lassocv = LassoCV(cv = kf, alphas=alphas_lasso, tol = .001, max_iter = 100000)
lassocv.fit(X_drink_train, y_drink_train.values.ravel())

l_alpha_opt = lassocv.alpha_
print("optimal alpha:", l_alpha_opt)

y_pred_lassocv = lassocv.predict(X_drink_test)

lassocv_mse = mean_squared_error(y_pred_lassocv, y_drink_test)
print("MSE with cross-validated Lasso:", lassocv_mse)

optimal alpha: 32.26315789473684
MSE with cross-validated Lasso: 3260696.695113983

d. Make a function for processing the regressions. This will make graphing much smoother later on. Use the optimal alpha for each regression and input code to recognize the linear regression so it does not get interrupted for not having an alpha.¶

def fit_model(Model, X_train, X_test, y_train, y_test, lasso_alpha = l_alpha_opt, ridge_alpha = r_alpha_opt):

    if Model == LinearRegression:
        model = Model()
    elif Model == Ridge:
        model = Model(alpha = ridge_alpha)
    else:
        model = Model(alpha = lasso_alpha)


    model.fit(X_train, y_train)
    mse = mean_squared_error(y_test, model.predict(X_test))
    coef = model.coef_.flatten()

    return mse, coef

e. Run a for loop to collect the MSE values for each model as well as all the coefficient values for each feature in each model. This will also later be used to compare models side by side. This loop is much faster than individually calculating for each regression to attempt a comparison.¶

Models = [LinearRegression, Ridge, Lasso]
mse_drink = np.zeros(len(Models))

coef_drink = np.zeros((X_drink_train.shape[1], len(Models)))


for Model, i in zip(Models, np.arange(len(Models))):
    mse_drink[i], coef_drink[:,i] = fit_model(Model, X_drink_train, X_drink_test, y_drink_train, y_drink_test)

f. Create a graph comparing the coefficients for each feature in each model, to show what each model is doing to handle our data. Also examine the MSE's collected and see how they compare to one another.¶

When examining these MSE's, it's important to keep in mind that we are comparing water consumption rates up to 25,000 million gallons of water. This is why the MSE values are found to be relatively high.

ind = np.arange(coef_drink.shape[0])
width = 0.25
pos = np.array([ind - width, ind, ind + width])
modelNames = ["Linear regression", "Ridge", "Lasso"]

plt.figure(figsize = (20,10))

plt.subplot(111)
for i in np.arange(coef_drink.shape[1]):
    plt.bar(pos[i], height = coef_drink[:,i], width = width, label = modelNames[i])
plt.legend()
plt.xlabel("Feature number")
plt.ylabel("Feature coefficient")
plt.title("Model comparison, Water")




plt.tight_layout()
plt.show()

mse_drink

array([3325348.488994  , 3234558.00043942, 3260696.86320397])

coef_drink

array([[ 4.12373418e-03,  4.13588799e-03,  4.11680155e-03],
       [ 1.03121791e+02,  2.96940407e+01,  3.98162877e+01],
       [-1.36400856e+02,  2.35891932e+01, -0.00000000e+00],
       [ 1.27010321e+02,  2.37501141e+01,  5.12597523e+01],
       [ 1.32919758e+02,  1.33030803e+01,  9.36934374e+01],
       [-5.67566151e+01, -3.88038701e+01, -5.25118485e+01],
       [ 5.58429612e+01,  2.34372076e+01,  4.95736165e+01],
       [-3.39681254e+01, -3.05664459e+00, -1.89748631e+01],
       [-4.10364891e+02, -9.54326129e+00, -2.15231218e+02],
       [-3.96198821e-01, -4.33865781e-02, -2.33026457e-01],
       [-3.92042251e+00,  4.06624051e+00, -7.33114208e-02],
       [ 1.84894060e+00,  1.68299832e+00,  1.67033750e+00]])

The MSE for the ridge regression is best out of all three. Ridge regression proves to be the best model for the dataset since most of our features actually have an effect on the output, instead of just a handful of features which the Lasso regression prefers.

3. Predicting Electricity Use for Drinking Water Processing, KNN Model¶

This process will be similar to part 1, where we found the optimal k to predict water per capita with a KNN model, however, now we will be predicting the electricity per capita use required to process water.

For our KNN model, nearest neighbor proximity will be based on spatial distance. For each location, we will find its K-nearest neighbors in our drinking water dataset, and then we will use their average electricity consumption per capita as the forecast for that location.

We are able to use many of the functions defined in part 1 to run KNN, in particular our knn function that uses all the helpers we constructed.

a. We can run KNN with our electricity data frame because we cleaned it already in the EDA section of this notebook.¶

all_month_errors_electricity = pd.DataFrame()

month_names_2 = list(drinking_complete_df["Month"].unique())
values_of_k_2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test_size_2 = 10
seed_2 = 1
output_2 = "electricity per capita"
all_month_errors_electricity["k"] = values_of_k_2

for month in month_names_2:
    month_errors_2 = []
    for i in values_of_k_2:
        month_errors_2.append(knn(electricity, month, i, test_size_2, seed_2, output_2))
    all_month_errors_electricity[month] = month_errors_2

all_month_errors_electricity

no_k = all_month_errors_electricity.drop(columns = ["k"])
all_month_errors_electricity['mean'] = no_k.mean(axis=1)
all_month_errors_electricity

Here is it clear that k nearest neigbors is not a good predictor for electricity, because the highest bias model (k=1) has the lowest error.

summary_2 = all_month_errors_electricity[["k", "mean"]]
summary_2
summary_2.plot.bar(x = "k", y = "mean", legend = False)

plt.title("Average Prediction Error per Value of K", size = 20)
plt.ylabel("Average Prediction Error")

Text(0, 0.5, 'Average Prediction Error')

4. Predicting Electricity Use for Drinking Water Processing, Regression Models (Linear, Ridge, and Lasso)¶

a. In this section we will do the same as in part 3 above, but this time we will focus on attempting to predict the amount of energy that it will take to produce the drinking water. This dataset will be a little smaller since not all data points from above can be used for this set (some locations did not have electricity data).¶

We will start the same way we did last time, dropping unnecessary columns, creating and X and y dataframe, and creating our test and train splits.

def get_X_y2(df):

    X = df.drop(columns = ['electricity per capita','month number',"city", "citystate", "state", "month", 'Month', "electricity (kwh)", "fuel oil (gal)", 'natural gas (therm)', "propane (gal)", "volume (mg)", 'water per capita'])
    y = df[["electricity (kwh)"]]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 123)

    return X_train, X_test, y_train, y_test

X_elec_train, X_elec_test, y_elec_train, y_elec_test = get_X_y2(electricity)

X_elec_train.columns = X_elec_train.columns.str.strip()
X_elec_test.columns = X_elec_test.columns.str.strip()
y_elec_train.columns = y_elec_train.columns.str.strip()
y_elec_test.columns = y_elec_test.columns.str.strip()

b. Check for optimal alphas for each regression through cross-validation in order to find the best fit for each regression.¶

alphas_ridge = np.linspace(0, 10000, 200)

kf = KFold(n_splits = 3, shuffle = True, random_state = 0)
ridgecv = RidgeCV(cv = kf, alphas=alphas_ridge)
ridgecv.fit(X_elec_train, y_elec_train)

r_alpha_opt2 = ridgecv.alpha_
print("optimal alpha:", r_alpha_opt2)

y_pred_ridgecv = ridgecv.predict(X_elec_test)

ridgecvelec_mse = mean_squared_error(y_pred_ridgecv, y_elec_test)
print("Electricty MSE with cross-validated Ridge:", ridgecvelec_mse)

optimal alpha: 10000.0
Electricty MSE with cross-validated Ridge: 45271933825504.19

alphas_lasso = np.linspace(1, 100, 20)

kf = KFold(n_splits = 3, shuffle = True, random_state = 0)
lassocv = LassoCV(cv = kf, alphas=alphas_lasso, tol = .00001, max_iter = 1000000)
lassocv.fit(X_elec_train, y_elec_train.values.ravel())

l_alpha_opt2 = lassocv.alpha_
print("optimal alpha:", l_alpha_opt2)

y_pred_lassocv = lassocv.predict(X_elec_test)

lassocvelec_mse = mean_squared_error(y_pred_lassocv, y_elec_test)
print("Electricity MSE with cross-validated Lasso:", lassocvelec_mse)

optimal alpha: 1.0
Electricity MSE with cross-validated Lasso: 42681649269088.19

c. Create a function to quickly perform all of the regressions, with different alphas for each regression. This will make graphing and comparison easier. Collect the MSEs and Coef from the function as well¶

def fit_model2(Model, X_train, X_test, y_train, y_test, lasso_alpha = l_alpha_opt2, ridge_alpha = r_alpha_opt2):

    if Model == LinearRegression:
        model = Model()
    elif Model == Ridge:
        model = Model(alpha = ridge_alpha)
    else:
        model = Model(alpha = lasso_alpha)


    model.fit(X_train, y_train)
    mse = mean_squared_error(y_test, model.predict(X_test))
    coef = model.coef_.flatten()

    return mse, coef

#Perform the fits, collect the mse and coefficients into their own arrays so they can be compared
Models = [LinearRegression, Ridge, Lasso]
mse_elec = np.zeros(len(Models))

coef_elec = np.zeros((X_elec_train.shape[1], len(Models)))


for Model, i in zip(Models, np.arange(len(Models))):
    mse_elec[i], coef_elec[:,i] = fit_model2(Model, X_elec_train, X_elec_test, y_elec_train, y_elec_test)

/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

d. Graph the models' coefficients to compare the models.¶

ind = np.arange(coef_elec.shape[0])
width = 0.25
pos = np.array([ind - width, ind, ind + width])
modelNames = ["Linear regression", "Ridge", "Lasso"]

plt.figure(figsize = (20,10))

plt.subplot(111)
for i in np.arange(coef_elec.shape[1]):
    plt.bar(pos[i], height = coef_elec[:,i], width = width, label = modelNames[i])
plt.legend()
plt.xlabel("Feature number")
plt.ylabel("Feature coefficient")
plt.title("Model comparison, Electricity")




plt.tight_layout()
plt.show()

mse_elec

array([4.26816537e+13, 4.52719338e+13, 4.25708101e+13])

e. As seen, this model does not accurately predict the amount of electricity used based on our current features. These models need more features to better predict the amount of electricity needed.¶

Interpretation and Conclusions¶

In conclusion, the KNN (k=5) and Ridge models for water per capita prediction perform the best overall. Both of these models performs significantly better for the larger, higher water consuming cities, and the predicitons for smaller, lower water consuming cities had much too high of a margin to be informative. No single feature in the climate, demographic, and geographic data used to train the models dominated the predictions generated. The KNN model supports that geographically close locations have similar water per capita consumption for given months, which may be attributed to similarity in climate.

None of the models developed for electricity per capita usage made well-performing predictions, which is predominately due to the variation in the types of energy cities use to process their drinking water, be that natural gas, coal, electricity, or others. Poor predictions may also be due to fluctuating levels of technology across the country— some cities have much more advanced energy-saving technologies than others. Next steps include looking into features that may indicate advanceness of technology in cities to better predict their electricity use for drinking water processing, as well as finding more complete data for total processing energy use.

A main finding from this project is that a relationship between population, climate, and urban drinking water consumption exists. As climate proceeds to change and regions begin to shift in average temperature, water demands are going to change. The implications of this finding could extend into the realm of policy making to better inform decisions regarding the changing climate and society's water needs. Another takeaway is that more primary data is necessary in order to develop accurate prediction models for urban water use, as well as processing energy use. There is a clear lack of publicly available well-tracked water use data. The findings of this project may motivate agencies to prioritize resources towards better data collection of water use in order to illuminate our society's consumption of this precious, declining resource.

Acknowledgements¶

Nathan and I would like to thank Professor Duncan Callaway and Sallmah Elmallah of the Energy and Resources department at UC Berkeley for their guidance and feedback throughout this project. We are very grateful to have had such knowledgeable mentors in Data Science and resource conservation.

Sources¶

Hydro Share Data

<a href = "https://agupubs.onlinelibrary.wiley.com/doi/10.1002/2017WR022265">Advancing Earth and Space Science Journal</a>

	citystate	Month	T Max Avg	T Avg	T Min Avg	Precipitation (Inches)	Dew Point Avg	Snowdepth (Inches)	Wind(Avg MPH)	Sea Level Pressure (Avg)	Elevation(feet)
0	AnchorageAK	January	9.77	3.59	-3.29	0.75	-2.33	15.4	5.34	29.48	102.0
1	AnchorageAK	February	30.28	25.44	20.28	0.71	19.28	10.0	5.53	29.44	102.0
2	AnchorageAK	March	27.97	21.19	12.77	0.59	11.99	12.2	3.99	29.34	102.0
3	AnchorageAK	April	45.80	38.70	31.37	0.47	24.28	5.4	5.09	29.69	102.0
4	AnchorageAK	May	51.65	45.72	39.39	0.71	30.02	0.6	8.47	29.64	102.0

	Latitude	Longitude	places
0	38.846668	-91.948059	FultonMO
1	41.392502	-81.534447	BedfordOH
2	27.192223	-80.243057	StuartFL
3	31.442778	-100.450279	SanAngeloTX
4	40.560001	-74.290001	WoodbridgeNJ

	month number	city	citystate	electricity (kwh)	fuel oil (gal)	month	natural gas (therm)	propane (gal)	state	volume (mg)	...	T Avg	T Min Avg	Precipitation (Inches)	Dew Point Avg	Snowdepth (Inches)	Wind(Avg MPH)	Sea Level Pressure (Avg)	Elevation(feet)	lats	lons
0	0	Anchorage	AnchorageAK	242089	NaN	Jan-2012	14184.6565	NaN	AK	629.485	...	3.59	-3.29	0.75	-2.33	15.4	5.34	29.48	102.0	61.217381	-149.863129
1	1	Anchorage	AnchorageAK	125174	NaN	Feb-2012	13781.1990	NaN	AK	600.465	...	25.44	20.28	0.71	19.28	10.0	5.53	29.44	102.0	61.217381	-149.863129
2	2	Anchorage	AnchorageAK	136445	NaN	Mar-2012	14899.5150	NaN	AK	620.445	...	21.19	12.77	0.59	11.99	12.2	3.99	29.34	102.0	61.217381	-149.863129
3	3	Anchorage	AnchorageAK	132454	NaN	Apr-2012	11940.0690	NaN	AK	656.376	...	38.70	31.37	0.47	24.28	5.4	5.09	29.69	102.0	61.217381	-149.863129
4	4	Anchorage	AnchorageAK	133798	NaN	May-2012	10548.9930	NaN	AK	700.916	...	45.72	39.39	0.71	30.02	0.6	8.47	29.64	102.0	61.217381	-149.863129

	month number	city	citystate	electricity (kwh)	fuel oil (gal)	month	natural gas (therm)	propane (gal)	state	volume (mg)	...	T Min Avg	Precipitation (Inches)	Dew Point Avg	Snowdepth (Inches)	Wind(Avg MPH)	Sea Level Pressure (Avg)	Elevation(feet)	lats	lons	water per capita
0	0	Anchorage	AnchorageAK	242089	NaN	Jan-2012	14184.6565	NaN	AK	629.485	...	-3.29	0.75	-2.33	15.4	5.34	29.48	102.0	61.217381	-149.863129	0.002773
1	1	Anchorage	AnchorageAK	125174	NaN	Feb-2012	13781.1990	NaN	AK	600.465	...	20.28	0.71	19.28	10.0	5.53	29.44	102.0	61.217381	-149.863129	0.002645
2	2	Anchorage	AnchorageAK	136445	NaN	Mar-2012	14899.5150	NaN	AK	620.445	...	12.77	0.59	11.99	12.2	3.99	29.34	102.0	61.217381	-149.863129	0.002733
3	3	Anchorage	AnchorageAK	132454	NaN	Apr-2012	11940.0690	NaN	AK	656.376	...	31.37	0.47	24.28	5.4	5.09	29.69	102.0	61.217381	-149.863129	0.002892
4	4	Anchorage	AnchorageAK	133798	NaN	May-2012	10548.9930	NaN	AK	700.916	...	39.39	0.71	30.02	0.6	8.47	29.64	102.0	61.217381	-149.863129	0.003088

	electricity per capita	Month
Month
April	320.068845	April
August	454.538947	August
December	306.025093	December
February	300.961229	February
January	315.514072	January
July	511.388456	July
June	402.681943	June
March	296.386363	March
May	360.660548	May
November	324.826121	November
October	360.409619	October
September	397.796370	September

	k	January	February	March	April	May	June	July	August	September	October	November	December
0	1	0.002333	0.002083	0.002300	0.002399	0.002494	0.002546	0.002621	0.002779	0.002564	0.002286	0.002470	0.002365
1	2	0.001606	0.001402	0.001634	0.001854	0.002080	0.001949	0.002181	0.002064	0.001787	0.001724	0.001641	0.001536
2	3	0.001288	0.001118	0.001397	0.001594	0.001839	0.001749	0.001933	0.001853	0.001543	0.001497	0.001339	0.001255
3	4	0.001427	0.001251	0.001395	0.001471	0.001567	0.001430	0.001741	0.001522	0.001254	0.001306	0.001195	0.001234
4	5	0.001277	0.001126	0.001261	0.001302	0.001446	0.001300	0.001514	0.001266	0.001071	0.001173	0.001067	0.001123
5	6	0.001263	0.001115	0.001240	0.001263	0.001441	0.001356	0.001629	0.001351	0.001110	0.001108	0.001043	0.001093
6	7	0.001462	0.001305	0.001436	0.001406	0.001584	0.001562	0.001873	0.001673	0.001322	0.001237	0.001237	0.001309
7	8	0.001455	0.001274	0.001392	0.001382	0.001622	0.001584	0.001898	0.001800	0.001347	0.001181	0.001191	0.001268
8	9	0.001524	0.001313	0.001411	0.001389	0.001632	0.001672	0.001941	0.001916	0.001325	0.001202	0.001238	0.001290
9	10	0.001390	0.001195	0.001276	0.001264	0.001598	0.001648	0.001844	0.001880	0.001263	0.001087	0.001106	0.001157

	population	T Max Avg	T Avg	T Min Avg	Precipitation (Inches)	Dew Point Avg	Snowdepth (Inches)	Wind(Avg MPH)	Sea Level Pressure (Avg)	Elevation(feet)	lats	lons
150	1139345	90.45	79.89	68.23	4.06	64.51	0.0	6.54	29.06	902.0	39.983334	-82.983330
229	148700	40.18	25.71	14.68	0.39	13.55	7.0	7.46	24.84	5003.0	40.585258	-105.084419
303	66540	62.53	52.95	43.30	3.31	34.53	0.4	8.24	29.59	322.0	40.263680	-76.890739
85	5296094	39.48	33.32	26.93	1.93	22.15	9.1	9.80	29.43	597.0	41.881832	-87.623177
367	464300	91.80	79.70	67.50	4.40	56.23	0.0	6.00	28.90	909.0	39.099724	-94.578331

	k	January	February	March	April	May	June	July	August	September	October	November	December
0	1	3.811338	3.992842	3.964100	4.289427	4.066889	5.261500	6.442083	7.235397	6.375251	6.005634	5.250455	5.332278
1	2	5.025623	4.922919	4.958193	5.137606	5.465723	5.835178	7.418259	7.445479	6.495248	6.141843	5.351647	5.503408
2	3	4.831437	4.749068	4.792903	5.072866	5.385666	5.368473	7.466225	7.193479	6.342507	5.805451	4.883007	4.954037
3	4	5.213113	5.144352	5.175133	5.418282	5.790806	5.837386	8.030096	7.840913	6.774566	6.139286	5.242724	5.096384
4	5	5.305699	5.083197	5.130282	5.433460	5.875472	5.925716	8.280380	7.948150	6.895151	6.106254	5.213606	5.246066
5	6	5.139070	4.925879	4.904255	5.162298	5.555862	5.616700	8.582608	7.584408	6.602559	5.857903	4.998997	5.145812
6	7	5.188050	5.023057	4.925042	5.222675	5.744856	5.802214	8.912170	7.815080	6.901970	6.093671	5.141205	5.121583
7	8	5.041366	4.952013	4.966699	5.183938	5.633547	5.747743	8.865534	7.710510	6.754287	5.964270	5.098166	5.067784
8	9	5.026938	4.907255	4.866975	5.188001	5.648687	5.731508	8.683092	7.682680	6.711002	5.851290	5.054846	5.052084
9	10	4.964846	4.842639	4.841190	5.070952	5.502289	5.546764	8.507724	7.466789	6.604884	5.747126	4.921594	5.017683