1 - Abstract
2 - Project Background
3 - Project Objectives
4 - Data Description
5 - Data Cleaning
6 - Data Exploration
7 - Model Development
8 - Interpretation and Conclusions
9 - Acknowledgements and Sources
As the U.S. climate becomes more volatile and the population continues to rise, informed water use is as important as ever. The aim of this project is to generate various predictive models for urban water use with climate, demographic, and geographic data. Additionally, models are generated to predict the amount of electricity needed to process and distribute water consumed by cities. The models developed in this project fall under the category of Land Use Regression, or spatial prediction. There is limited publicly available data about city-level water use across the United States, and the motivation behind this project is to fill in geographic data gaps which may help better inform policy decisions and improve the use of two vital resources— water and energy.
Four models are developed for both drinking water per capita predictions and electricity per capita predictions for drinking water processing, including K Nearest Neighbors, Linear regression, Ridge regression, and Lasso regression. The takeaways from this project are that the KNN and Ridge models for water per capita prediction perform the best overall. Both of these models performs significantly better for the larger, higher water consuming cities, and the predicitons for smaller, lower water consuming cities had much too high of a margin to be informative. No single feature in the climate, demographic, and geographic data used to train the models dominated the predictions generated. None of the models developed for electricity per capita usage made well-performing predictions, which is predominately due to the variation in the types of energy cities use to process their drinking water, be that natural gas, coal, electricity, or others. Poor predictions may also be due to fluctuating levels of technology across the country— some cities have much more advanced energy-saving technologies than others. Next steps include looking into features that may indicate advanceness of technology in cities to better predict their electricity use for drinking water processing, as well as finding more complete data for total processing energy use. The main takeaway from this project is that more primary data is necessary in order to develop accurate prediction models for urban water use, as well as processing energy use. The findings of this project may motivate agencies to prioritize resources towards better data collection of water use in order to illuminate our society's consumption of this precious, declining resource.
Most of the U.S. lives within urban districts, and all utilities across the United States generate data on how much water is consumed in any given month, as well as the electiricity needed to distribute and treat water. However, this information is not often collected or made public. The Advancing Earth and Space Sciences journal published a paper with the aim of addressing this issue and set out to generate a primary data source that contains drinking and wastewater usage by month or year, as well as the energy needed to process each respective type of water usage. Their primary data set contains this information for 2012.
Upon reading Chini and Stillwell’s analysis of the water-energy nexus across the United States, we found a motivation behind generating spatial prediction models that could fill in the gaps of the missing locations in the existing primary data set. The implications of developing this type of prediction model would help inform better policy decisions, such as how to allocate resources to which cities in order to help decrease water consumption in inefficient locations.
One of the interesting conclusions that the Advancement of Hydrologic Science made was that often cities in hotter climates end up consuming the same amount across all months, while those in colder climates consume varying amounts according to the seasons. This led us to include climate data as part of our prediction model, assuming that climate data is usually a very readily available set of information for all locations across the United States.
Another interesting note about the dataset we were working from is that there are often increases in electricity used to produce drinking water in colder climates, the colder it got. This led us to hypothesize that there was some sort of relationship between temperature and the energy needed to process drinking water. Our inclusion of geographic data (elevation and geographical coordinates) in our prediction model stems from the effect of elevation and proximity to the equator on climate.
With a changing climate, there is threat of increased temperatures globally. If temperature has an effect on drinking water consumption, there is a need to understand this effect in order to adequately address conserving this finite, vital resource. We intended to better understand this effect through developing our models that predict drinking water consumption and energy usage based off of climate, geographic, and demographics data. If water consumption could be predicted with this data with high accuracy, it would allow cities to know how much water they will need in coming years given the projected climate and population data. This would help predict how climate change may impact water consumption in varying locations.
On top of this, if we could also accurately predict electricity used for water consumption, we would be able to see how much climate change will increase electricity from water consumption alone. If the U.S. cities wish to begin focusing on conserving water and electricity, they need to know their current consumption rates which our model would provide, as well as the potential to understand how consumption rates may change with the changing climate and population. Similar to climate scientists that intend to show how the world will be impacted from the rapid change in global climate, if our model is successful, it could be used to promote more sustainable means of water usage as well as another case for more sustainable sources of electricity.
Both of these predictions are valuable because they would help fill in missing data across the United States about water usage and energy usage for locations that lack the capacity to collect this type of data. This would help address a resource allocation problem of which cities need help in decreasing water consumption per capita by providing a complete picture of the levels of consumption across the country. These models would also help cities across the United States predict what their future water consumption might look like given how their climate and populations are projected to change. It can give cities worst case scenarios or potential disaster scenarious to look out for. It could also be possible for cities to identify points at which the water consumption would exhaust the water supply. Should we achieve our purpose, it would build a stronger case for water conservation as a whole.
Our data for the electricity and water consumption was originally provided by The Consortium of Universities for the Advancement of Hydrological Science, a non-profit created to analyze and forward science within the water use world. The group originally acquired the data from utilities across all of the cities in order to get exact numbers. This data was not complete for all columns, missing some data and therefore eliminating some data points for us. We took out any incomplete data. The drinking water and electricity consumption data were given monthly which provided a separate data point for each month. Therefore, we had 12 data points for each city. The data is from the year of 2012. We were able to obtain population data from text files included in our primary data repository for each city. We had to manually input monthly weather data points from Weather Underground and NOAA into a file to be merged with our water and population data. The data for fuel, propane, and natural gas (all energy usage) were all relatively empty, we ended up eliminating them from our data set and focussed on electricity and drinking water predictions.
Our climate data came from Weather Underground which stores a history of weather data, as well as data from the National Oceanic and Atmospheric Administration, a large reputable, government-run organization. Weather Underground uses sensors that upload data wirelessly from major buildings across each city, providing us our monthly data. We went through and input the data for our climate features manually from both Weather Underground and NOAA, which was a massive undertaking given how many data points we had (around 804). There was no way to access a file that included all the data we needed, so we had to toggle through filters on their website. All data was taken from 2012 and matched with the corresponding data point for water consumption. The data contains monthly temperature max averages, min averages, overall temperature averages, dew point averages, wind average, sea level pressure average, as well as total rainfall and total snowfall.
Our elevation and latitude and longitudinal data came from the United States Geological Survey, another government-run organization that provides reputable data. These data points were added manually to our dataset as well. It was a complete set, with no missing values, and since we have 12 points for every single city, and only 1 elevation for each city, we have 12 of each elevation.
For the final data set we generated through much data merging and cleaning had the following:
The energy and drinking water data that we merged into our final data frame came from individual csv files by city. The weather data originally was stored on a website that had filter features which helped get to each data point we needed in our data frame.
Our data are aggregated by month and by city.
Our data covers 67 cities in the United States from the year of 2012, including data from each month in each city.
Time is represented in the data by month because the entire data set is for the year of 2012.
The data are trustworthy based on our exclusion of cities with missing data, and due to our data exploration that revealed no significant outliers.
We cannot provide the final data frame we used to build our models because it was generated through the Data Cleaning section below. You may find this data frame at the end of the following section.
Goal overview for this section:
#Import dependencies.
import pandas as pd
import numpy as np
from pathlib import Path
names_1st_half = ['AlbanyNY', 'AlexandriaVA', 'AmarilloTX', 'AnchorageAK', 'AugustaGA',
'AustinTX','BakersfieldCA','BaltimoreMD','BeaumontTX','BoiseID',
'BostonMA','BridgeportCT','BuffaloNY','BurlingtonVT','CharlestonSC','CharlestonWV',
'CharlotteNC','CheyenneWY','ChicagoIL','CincinnatiOH','ClevelandOH','ColoradoSpringsCO',
'ColumbiaSC','ColumbusOH','CorpusChristiTX','DallasTX','DaytonOH','DenverCO','DetriotMI',
'DuluthMN','ElPasoTX','EugeneOR','FargoND','FortCollinsCO','FortWayneIN','FortWorthTX',
'FresnoCA','GreensboroNC',
'GreenvilleSC','HarrisburgPA','HonoluluHI','HoustonTX','IndianapolisIN','JacksonvilleFL']
names_2nd_half = ["KalamazooMI", "KansasCityMO", "KnoxvilleTN", "LakeCharlesLA", "LasVegasNV",
"LincolnNE", "LittleRockAR", "LosAngelesCA", "LouisvilleKY", "LubbockTX", "MadisonWI",
"ManchesterNH", "MemphisTN", "MiamiFL", "MilwaukeeWI", "MinneapolisMN","NashvilleTN",
"NewarkNJ", "NewOrleansLA", "NewYorkNY", "NorfolkVA", "North Texas", "OaklandCA",
"OgdenUT", "OklahomaCityOH", "OmahaNE", "PeoriaIL", "PhiladelphiaPA", "PhoenixAZ",
"PittsburghPA", "PortlandME", "PortlandOR", "ProvidenceRI", "ProvoUT", "RaleighNC",
"RenoNV", "SacramentoCA", "SalemOR", "SaltLakeCityUT", "SanAntonioTX", "SanDiegoCA",
"SanFranciscoCA", "SanJoseCA", "SantaFeNM", "SavannahGA", "SeattleWA", "SiouxFallsSD",
"SpokaneWA", "SpringfieldMA", "StLouisMO", "StPaulMN", "SyracuseNY", "TacomaWA",
"TallahasseeFL", "TampaFL", "ToledoOH", "TucsonAZ", "TulsaOK", "WashingtonDC",
"WichitaKS", "WorcesterMA"]
all_names = names_1st_half + names_2nd_half
Consideration:
#Function to load csv files as data frames
def load_one_Water(name):
if name == "AnchorageAK":
my_file = Path("AnchorageAk/Water_"+name+".csv")
return pd.read_csv("AnchorageAk/Water_"+name+".csv")
my_file = Path(name+"/Water_"+name+".csv")
if (name == "NashvilleTN") or (name == "NewYorkNY") or (name == "OgdenUT"):
return
if not my_file.exists():
#print("Water "+name+" doesn't exist")
return
return pd.read_csv(name+"/Water_"+name+".csv")
#Function to load all drinking water data frames into a dicitonary.
def load_all_drinking_water():
df_list_Water = []
for name in all_names:
df_list_Water.append(load_one_Water(name))
drinking_water_dict = dict(zip(all_names, df_list_Water))
return drinking_water_dict
drinking_water_dict = load_all_drinking_water()
Considerations:
#Create drinking_water_df for all files with data for all 12 months
#Reset dict
drinking_water_dict = load_all_drinking_water()
drinking_water_df = pd.DataFrame()
for name in all_names:
curr_drinking_water_df = drinking_water_dict.get(name)
#Check that file exists for city, and check if has all 12 months
if (curr_drinking_water_df is not None) and (len(curr_drinking_water_df) > 11):
city = name[:-2]
state = name[-2:]
curr_drinking_water_df["City"] = city
curr_drinking_water_df["State"] = state
curr_drinking_water_df["CityState"] = name
#Make all column names lower case
curr_drinking_water_df.columns = map(str.lower, curr_drinking_water_df.columns)
#Clean column names
for col_name in curr_drinking_water_df.columns.tolist():
if col_name == "biogas (therms)":
curr_drinking_water_df = curr_drinking_water_df.rename(columns={col_name: "biogas (therm)"})
if (col_name == "electric consumption (kwh)") or (col_name =="electricity (kwh)*"):
curr_drinking_water_df = curr_drinking_water_df.rename(columns={col_name: "electricity (kwh)"})
if col_name == "natural gas (therms)":
curr_drinking_water_df = curr_drinking_water_df.rename(columns={col_name: "natural gas (therm)"})
if col_name == "volume (mgd)":
curr_drinking_water_df.rename(columns={col_name: "volumne (mg)"})
if col_name == "volume (mg)\telectricity (kwh)":
print(name)
drinking_water_df = drinking_water_df.append(curr_drinking_water_df[:12])
drinking_cities = drinking_water_df["citystate"].unique()
Considerations:
#Extract the populations from the text files. Returns a list of the population(s) for each place.
def get_numbers_from_text(file_name):
#Check that txt file exists for place
if not Path(file_name+"/READ-ME_"+file_name+".txt").exists():
return
#SanJoseCA and WorcesterMA have an error in text file that cannot be read.
if (file_name == "SanJoseCA") or (file_name == "WorcesterMA"):
return
words = []
with open(file_name+"/READ-ME_"+file_name+".txt", "r") as f:
for line in f:
for word in line.split():
words.append(word)
num = []
numbers = []
for i in words:
if i == "Error!":
return
for j in i:
if j.isdigit():
num.append(j)
if len(num) != 0:
joined_num = int("".join(num))
if (joined_num > 10000) and (joined_num != 446515096638828383):
numbers.append(joined_num)
num = []
return numbers
#Create population dictionary. Keys correspond to place names, and values correspond to the population(s).
pop_dict = dict()
for file_name in all_names:
numbers = get_numbers_from_text(file_name)
if numbers is not None:
pop_dict[file_name] = numbers
places_in_pop_dict = pop_dict.keys()
Considerations
#Add population into drinking_water_df
drinking_pops = []
for place in drinking_cities:
if (place == "WorcesterMA"):
drinking_pops+=[None for i in range(12)]
if (place == "AnchorageAK"):
drinking_pops+=[226984 for i in range(12)]
if pop_dict.get(place) is not None:
drinking_pop = pop_dict.get(place)[0]
drinking_pops+=[drinking_pop for i in range(12)]
#Manually input Worcester population, missing from Hydro Share data set due to error in text file.
drinking_pops[-12:] = [182627 for i in range(12)]
drinking_water_df["population"] = drinking_pops
drinking_water_df.shape
drinking_water_df.to_csv("drinking_water_monthly.csv")
Here is a clear list of the places that have complete water data. This is helpful to know for the next step of aggregating the correct weather data for the places that we have water data.
print("Places with complete and clean drinking water data:")
print(drinking_cities)
weather_drinking = pd.read_csv("drinking_weather.csv")
Considerations
#Rename column to match drinking_water_df for concattenating
weather_drinking = weather_drinking.rename(columns={"Unnamed: 0": "citystate"})
Outputting the heads of the two data frames, weather_drinking and drinking_water_df, provides an idea of how the data frames look and helps to visualize how to join the two data frames.
weather_drinking.head()
drinking_water_df.head()
#Drop Nashville, New York and Ogden from weather drinking set because complete data missing from drinking water set.
weather_drinking = weather_drinking[weather_drinking["citystate"]!="NashvilleTN"]
weather_drinking = weather_drinking[weather_drinking["citystate"]!="NewYorkNY"]
weather_drinking = weather_drinking[weather_drinking["citystate"]!="OgdenUT"]
#From this code, I was able to find and fix spelling errors we had in our weather data set.
weather_drinking_cities = weather_drinking["citystate"].unique()
drinking_cities = drinking_water_df["citystate"].unique()
print(len(drinking_cities))
print(len(weather_drinking_cities))
for i in drinking_cities:
if i not in weather_drinking_cities:
print(i)
weather_drinking_cities
Here is a clear output of the cities in either data frames. You can see that both look the same which can be confirmed by the cell above.
drinking_cities
Considerations
drinking_water_df = drinking_water_df.reset_index()
weather_drinking = weather_drinking.reset_index()
drinking_complete_df = pd.concat([drinking_water_df, weather_drinking], axis=1)
Ensure that the finished data frame is the correct shape.
print(drinking_water_df.shape)
print(weather_drinking.shape)
print(drinking_complete_df.shape)
Rename the index column to month number. Remove the duplicate columns.
drinking_complete_df = drinking_complete_df.rename(columns={"index": "month number"})
drinking_complete_df = drinking_complete_df.loc[:, ~drinking_complete_df.columns.duplicated()]
Export the complete drinking water data frame that now contains weather to a csv for collaboration purposes.
drinking_complete_df.to_csv("drinking_complete_df.csv")
print(drinking_complete_df.shape)
print()
print(drinking_complete_df.columns)
#Load in external data set
lat_lons = pd.read_csv("lat lons.csv")
lat_lon_places = list(lat_lons["Unnamed: 0"])
#Clean the strings in external set to match our data frame place names.
clean_lat_lon_places = []
for s in lat_lon_places:
s = s.replace(',','')
s = s.replace(" ", "")
s = s.replace("USA", "")
s = s.replace("theUS", "")
clean_lat_lon_places.append(s)
lat_lons["places"] = clean_lat_lon_places
lat_lons = lat_lons.drop(columns = "Unnamed: 0")
lat_lons = lat_lons.rename(columns={"Unnamed: 1":"Latitude", "Unnamed: 2": "Longitude"})
lat_lon_clean_df = lat_lons.drop(index = 0).reset_index(drop=True)
len(lat_lon_clean_df)
lat_lon_clean_df.drop_duplicates(subset ="places",
keep = "first", inplace = True)
print(len(lat_lon_clean_df))
lat_lon_clean_df.head()
lat_lon_dict = dict()
lats = [np.nan] * len(drinking_cities)
lons = [np.nan] * len(drinking_cities)
num = 0
for i in range(len(drinking_cities)):
for j in lat_lon_clean_df["places"]:
if drinking_cities[i] == j:
lat = float(lat_lon_clean_df[lat_lon_clean_df["places"] == drinking_cities[i]]["Latitude"])
lon = float(lat_lon_clean_df[lat_lon_clean_df["places"] == drinking_cities[i]]["Longitude"])
lat_lon_dict[drinking_cities[i]] = [lat, lon]
lats[i] = lat
lons[i] = lon
num+=1
for i in range(len(lats)):
if lats[i] is np.nan:
print(drinking_cities[i])
print(i)
#Input missing lats
lats[11] = 34.0007 #columbia
lats[45] = 40.4406 #pitts
lats[46] = 43.6591 #portland me
lats[56] = 43.5473 #sioux
lats[59] = 38.6270 #stlouis
#Input missing lons
lons[11] = 81.0348 #columbia
lons[45] = 79.9959 #pitts
lons[46] = 70.2568 #portland me
lons[56] = 96.7283 #sioux
lons[59] = 90.1994 #stlouis
#This shows that all lats and lons are now accounted for.
for i in range(len(lats)):
if lats[i] is np.nan:
print(drinking_cities[i])
print(i)
lats_extended = []
for i in lats:
lats_extended+= [i for j in range(12)]
lons_extended = []
for i in lons:
lons_extended+= [i for j in range(12)]
print(len(lons_extended))
print(len(lats_extended))
drinking_complete_df["lats"] = lats_extended
drinking_complete_df["lons"] = lons_extended
drinking_complete_df.to_csv("drinking_complete_df.csv")
#Import dependencies.
import datetime
import os
import requests
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.font_manager import FontProperties
%matplotlib inline
drinking_complete_df.head()
Add a column into the data frame that has the volume of water consumed per capita.
water_per_capita = drinking_complete_df['volume (mg)']/drinking_complete_df['population']
drinking_complete_df['water per capita'] = water_per_capita
drinking_complete_df.head()
Add a column in that has the electricity use per capita.
First we must clean our drinking water data set to observations that don't have null values for electricity (kwh), and also add an electricity per capita column. We must also convert the non-string values to string values.
electricity = drinking_complete_df[pd.notnull(drinking_complete_df['electricity (kwh)'])]
#Convert non-floats to floats in electricity column.
clean_electricity = list(electricity["electricity (kwh)"])
for i in range(len(clean_electricity)):
if type(clean_electricity[i]) == str:
s = clean_electricity[i]
s = s.replace(",", "")
clean_electricity[i] = float(s)
electricity["electricity (kwh)"] = clean_electricity
electricity["electricity per capita"] = electricity["electricity (kwh)"]/electricity["population"]
electricity.columns
#Ensure all values in electricity column have been converted to float values.
for i in list(electricity["electricity (kwh)"]):
if type(i) == str:
print(i)
This shows the water use per capita's density by month. This shows us that the drinking water consumption varies more in certain months, which could mean that including month as a feature in our prediction model may be beneficial. It looks as though the winter months have their distributions more concentrated around a certain water use per capita value, while the summer months have more spread out distributions. This could be because in the summer time, the places that have very high temperatures may have more drinking water consumption than places that have more moderate climates. Variation in coldness in the winter does not seem to impact the drinking water consumption as much as heat does.
#Figure 1 code
month_names = drinking_complete_df["Month"].unique().tolist()
plt.figure(figsize=(10, 7))
palette = sns.cubehelix_palette(12)
for i in range(0,12):
sns.distplot(drinking_complete_df[drinking_complete_df["month number"] == i]["water per capita"],
hist=False, kde=True, label=month_names[i], color = palette[i-12])
plt.legend()
plt.title('Water Use Density by Month', size = 20)
plt.xlabel('Water per Capita', size = 15)
plt.ylabel('Density', size = 15)
Following up figure 1, let's look further into how the summer months compare with the winter months. Now it is more clear that whether it is a summer or winter month does in fact have some relation with the per capita water consumption.
#Figure 2 code
plt.figure(figsize=(15, 7))
#Plot summer months
plt.subplot(121)
for i in range(4,10):
sns.distplot(drinking_complete_df[drinking_complete_df["month number"] == i]["water per capita"],
hist=False, kde=True, label=month_names[i])
plt.legend()
plt.title('Water Use Density Summer Months', size = 20)
plt.xlabel('Water per Capita', size = 15)
plt.ylabel('Density', size = 15)
plt.ylim(0, 400)
plt.xlim(0, .03)
#Plot winter months
plt.subplot(122)
for i in list(range(0,4)) + list(range(10, 12)):
sns.distplot(drinking_complete_df[drinking_complete_df["month number"] == i]["water per capita"],
hist=False, kde=True, label=month_names[i])
plt.legend()
plt.title('Water Use Density Winter Months', size = 20)
plt.xlabel('Water per Capita', size = 15)
plt.ylabel('Density', size = 15)
plt.ylim(0, 400)
plt.xlim(0, .03)
Let's look at a few simple visualizations of average temperature, preciptation, and snow depth versus water consumption. We can see that there is generally a positive association with temperature and water consumption, but not as clear of one with preciptation. Snow depth seems to have an interesting relationship with water consumption, with lower values of snow depth having a much higher variation in water consumption. This may be evidence that there is another variable associated with low snow depth that leads to high water usage, such as high temperatures in the summer as our Figure 2 implies.
#Figure 4 code
plt.figure(figsize=(5, 5))
drinking_complete_df.plot.scatter(x = "T Avg", y ="water per capita", subplots= True)
plt.title("Temperature vs Water per Capita Consumption", size = 20)
plt.xlabel("Average Temperature", size = 15)
plt.ylabel("Water Consumption Per Capita", size = 13)
drinking_complete_df.plot.scatter(x = "Precipitation (Inches)", y ="water per capita", subplots= True)
plt.title("Precipitation vs Water per Capita Consumption", size = 20)
plt.xlabel("Precipitation", size = 15)
plt.ylabel("Water Consumption Per Capita", size = 13)
drinking_complete_df.plot.scatter(x = "Snowdepth (Inches)", y ="water per capita", subplots= True)
plt.title("Snowdepth vs Water per Capita Consumption", size = 20)
plt.xlabel("Snowdepth", size = 15)
plt.ylabel("Water Consumption Per Capita", size = 13)
Here we show the water capita per month by city.
color = sns.color_palette("colorblind", 69)
cities = list(drinking_complete_df["city"].unique())
plt.figure(figsize = (30,20))
for name in cities:
city_df = drinking_complete_df[drinking_complete_df["city"] == name]
plt.plot(city_df["Month"], city_df["water per capita"], color = color[cities.index(name)])
plt.legend(cities, loc = "upper right")
plt.xlabel("Month", fontsize = 30)
plt.ylabel("Water per Capita", fontsize = 30)
plt.title("Water per Capita per Month for All Cities", fontsize = 40)
Now let's look further into the places with the hottest and coldest temperatures to compare the water comsumption for those months specifically.
#Find 10 places that have the highest average temperature throughout all 12 of their months.
top_10_hottest_cities = drinking_complete_df.sort_values(by = ["T Avg"], ascending = False)["citystate"].unique()[:10]
print(top_10_hottest_cities)
print()
#Find 10 places that have the lowest average temperature throughout all 12 of their months.
top_10_coldest_cities = drinking_complete_df.sort_values(by = ["T Avg"], ascending = True)["citystate"].unique()[:10]
print(top_10_coldest_cities)
color = sns.color_palette("colorblind", 69)
cities = list(drinking_complete_df["citystate"].unique())
plt.figure(figsize = (30,20))
for name in cities:
city_df = drinking_complete_df[drinking_complete_df["citystate"] == name]
if name in top_10_hottest_cities:
plt.plot(city_df["Month"], city_df["water per capita"], color = "darkred")
elif name in top_10_coldest_cities:
plt.plot(city_df["Month"], city_df["water per capita"], color = "blue")
else:
plt.plot(city_df["Month"], city_df["water per capita"], color = "lightgray")
plt.legend(cities, loc = "upper right")
plt.xlabel("Month", fontsize = 30)
plt.ylabel("Water per Capita", fontsize = 30)
plt.title("Water per Capita per Month for All Cities", fontsize = 40)
electricity.shape
Let's see how electricity and water consumption compare for the hottest and coldest cities by generating a similar plot as Figure 5. There is a clear division between the hottest cities and the coldest cities, appearing that the hotter cities have higher consumptions of electricity per capita during the summer months.
#Find 10 places that have the highest average temperature throughout all 12 of their months.
top_10_hottest_cities_e = electricity.sort_values(by = ["T Avg"], ascending = False)["citystate"].unique()[:10]
print(top_10_hottest_cities_e)
print()
#Find 10 places that have the lowest average temperature throughout all 12 of their months.
top_10_coldest_cities_e = electricity.sort_values(by = ["T Avg"], ascending = True)["citystate"].unique()[:10]
print(top_10_coldest_cities_e)
color = sns.color_palette("colorblind", 69)
cities = list(electricity["citystate"].unique())
plt.figure(figsize = (30,20))
for name in cities:
city_df = electricity[electricity["citystate"] == name]
if name in top_10_hottest_cities:
plt.plot(city_df["Month"], city_df["electricity per capita"], color = "darkred")
elif name in top_10_coldest_cities:
plt.plot(city_df["Month"], city_df["electricity per capita"], color = "blue")
else:
plt.plot(city_df["Month"], city_df["electricity per capita"], color = "lightgray")
plt.legend(cities, loc = "upper right")
plt.xlabel("Month", fontsize = 30)
plt.ylabel("Electricity per Capita", fontsize = 30)
plt.title("Electricity per Capita per Month for All Cities", fontsize = 40)
Let's take a look at the total electricity per capita consumed per month in 2012 to explore the relationship between month and electricity per capita consumed. From this graph it is clear that the electricity per capita consumption does vary by month, which motivates the generation of our model in the next few steps.
df_grouped = electricity[["Month", "electricity per capita"]].groupby("Month").sum()
df_grouped["Month"] = list(df_grouped.index)
df_grouped
df_grouped.plot.bar(x = "Month", y = "electricity per capita", legend = False)
plt.title("Total Electricity per Capita Consumed by Month", size = 20)
plt.ylabel("Total Electricity per Capita")
We will be using different modeling approaches to predict drinking water use and the electricity needed to process drinking water.
For our KNN model, nearest neighbor proximity will be based on spatial distance. For each location, we will find its K-nearest neighbors in our drinking water dataset, and then we will use their average drinking water consumption as the forecast for that location.
In order to assist in this part of the process, the function get_month_data() will take in a month parameter and return a data frame containing only the data that was recorded from that month, and also only the latitude, longitude, and water per capita used. We are looking to predict water per capita because the population in a place will most likely have a signficiant impact on the amount of water a place consumes, so looking at water consumption per capita will be more accurate.
#Part a code.
#Making output an input in order to make it easier to predict energy later.
def get_month_data(df, month, output):
month_data = df[df["Month"] == month]
month_data = month_data[["lats", 'lons', output]]
return month_data.reset_index(drop=True)
#Check that length is 67 to ensure all 67 cities' are accounted for for January data.
jan = get_month_data(drinking_complete_df, "January", "water per capita")
print(get_month_data(drinking_complete_df, "January", "water per capita").shape)
The functionality of this method is that a random number of rows will be chosen from our month data set which will comprise our test set, and these rows will be removed from our month data set which will then be our train set. This way we will have observations we can predict for and find the error of in order to compute the average error for a certain value of k.
#Part b code.
def train_test(month_data, test_size, seed):
test = month_data.sample(n = test_size, random_state=seed)
to_drop = list(test.index)
train = month_data.drop(to_drop)
return train, test
#Affirms train_test works.
train, test = train_test(jan, 12, 1)
test.shape, train.shape
#Part c code.
#Returns a list of the indexes in train that are the closest k points to a test index.
def closest_k(lat, lon, train, k):
dists = dict()
for i in list(train.index):
dist = calc_dist(train.loc[i]["lats"], train.loc[i]["lons"], lat, lon)
dists[dist] = i
#Find k smallest distances
list_dists = list(dists.keys())
sorted_dists = pd.DataFrame()
sorted_dists["dists"] = list_dists
top_dists = list(sorted_dists.sort_values("dists").head(k)["dists"])
closest_in_train_by_index = []
for i in top_dists:
closest_in_train_by_index.append(dists.get(i))
return closest_in_train_by_index
#Calculate distance between two sets of latitudes and longitudes.
def calc_dist(lat_1, lon_1, lat_2, lon_2):
return np.sqrt((np.subtract(lat_1, lat_2))**2 + np.subtract(lon_1,lon_2)**2)
#Compute test set average error for certain value of k.
def k_error(test, train, k, output):
errors = []
for i in list(test.index):
test_lat = float(test.loc[[i]]["lats"])
test_lon = float(test.loc[[i]]["lons"])
closest_in_train_by_index = closest_k(test_lat, test_lon, train, k)
#A list of outputs that holds the k nearest neighbors values for "output" for each test value.
train_outputs = []
for j in closest_in_train_by_index:
train_output = float(train.loc[[j]][output])
train_outputs.append(train_output)
prediction = np.mean(train_outputs)
error = abs(prediction - float(test.loc[[i]][output]))
errors.append(error)
return np.mean(errors)
def knn(df, month, k, test_size, seed, output):
month_data = get_month_data(df, month, output)
train, test = train_test(month_data, test_size, seed)
avg_error = k_error(test, train, k, output)
#print(month+ " has an error of "+str(avg_error)+" when k="+str(k))
return avg_error
df= drinking_complete_df
month = "January"
k=3
test_size = 10
seed = 1
output_1 = "water per capita"
#Test knn function.
knn(df, month, k, test_size, seed, output_1)
all_month_errors = pd.DataFrame()
month_names_1 = list(drinking_complete_df["Month"].unique())
values_of_k_1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test_size_1 = 10
seed_1 = 1
output_1 = "water per capita"
all_month_errors["k"] = values_of_k_1
for month in month_names_1:
month_errors_1 = []
for i in values_of_k_1:
month_errors_1.append(knn(drinking_complete_df, month, i, test_size_1, seed_1, output_1))
all_month_errors[month] = month_errors_1
Here you can see the average errors for each month and each value of k. It is clear that the error decreases as k increases.
all_month_errors
no_k = all_month_errors.drop(columns = ["k"])
all_month_errors['mean'] = no_k.mean(axis=1)
all_month_errors
Here you are able to see the average error for each value of k. The value of k has the smallest error, so predicting drinking water consumption per capita using the nearest 5 neighbors is the most accurate. It is important to note that a low K-value (e.g., 1) would result in high variance, and a high K-value (e.g., 10) would result in high bias. For this reason as well, 5 is a good choice for k.
summary = all_month_errors[["k", "mean"]]
summary.plot.bar(x = "k", y = "mean", legend = False)
plt.title("Average Prediction Error per Value of K", size = 20)
plt.ylabel("Average Prediction Error")
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from numpy import *
def get_X_y(df):
X = df.drop(columns = ['month number',"city", "citystate", "state", "month", 'Month', "electricity (kwh)", "fuel oil (gal)", 'natural gas (therm)', "propane (gal)", "volume (mg)", 'water per capita'])
y = df[["volume (mg)"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 123)
return X_train, X_test, y_train, y_test
X_drink_train, X_drink_test, y_drink_train, y_drink_test = get_X_y(drinking_complete_df)
Check that the data appears right
X_drink_train.head()
Using Cross-validation, we will check for the best possible alphas for each regression that will give us the best fit. To do this, we must also import a few more functions from sklearn.
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import KFold
alphas_ridge = np.linspace(0, 10000, 200)
kf = KFold(n_splits = 3, shuffle = True, random_state = 0)
ridgecv = RidgeCV(cv = kf, alphas=alphas_ridge)
ridgecv.fit(X_drink_train, y_drink_train)
r_alpha_opt = ridgecv.alpha_
print("optimal alpha:", r_alpha_opt)
y_pred_ridgecv = ridgecv.predict(X_drink_test)
ridgecv_mse = mean_squared_error(y_pred_ridgecv, y_drink_test)
print("MSE with cross-validated Ridge:", ridgecv_mse)
from sklearn.linear_model import LassoCV
alphas_lasso = np.linspace(1, 100, 20)
kf = KFold(n_splits = 3, shuffle = True, random_state = 0)
lassocv = LassoCV(cv = kf, alphas=alphas_lasso, tol = .001, max_iter = 100000)
lassocv.fit(X_drink_train, y_drink_train.values.ravel())
l_alpha_opt = lassocv.alpha_
print("optimal alpha:", l_alpha_opt)
y_pred_lassocv = lassocv.predict(X_drink_test)
lassocv_mse = mean_squared_error(y_pred_lassocv, y_drink_test)
print("MSE with cross-validated Lasso:", lassocv_mse)
def fit_model(Model, X_train, X_test, y_train, y_test, lasso_alpha = l_alpha_opt, ridge_alpha = r_alpha_opt):
if Model == LinearRegression:
model = Model()
elif Model == Ridge:
model = Model(alpha = ridge_alpha)
else:
model = Model(alpha = lasso_alpha)
model.fit(X_train, y_train)
mse = mean_squared_error(y_test, model.predict(X_test))
coef = model.coef_.flatten()
return mse, coef
Models = [LinearRegression, Ridge, Lasso]
mse_drink = np.zeros(len(Models))
coef_drink = np.zeros((X_drink_train.shape[1], len(Models)))
for Model, i in zip(Models, np.arange(len(Models))):
mse_drink[i], coef_drink[:,i] = fit_model(Model, X_drink_train, X_drink_test, y_drink_train, y_drink_test)
When examining these MSE's, it's important to keep in mind that we are comparing water consumption rates up to 25,000 million gallons of water. This is why the MSE values are found to be relatively high.
ind = np.arange(coef_drink.shape[0])
width = 0.25
pos = np.array([ind - width, ind, ind + width])
modelNames = ["Linear regression", "Ridge", "Lasso"]
plt.figure(figsize = (20,10))
plt.subplot(111)
for i in np.arange(coef_drink.shape[1]):
plt.bar(pos[i], height = coef_drink[:,i], width = width, label = modelNames[i])
plt.legend()
plt.xlabel("Feature number")
plt.ylabel("Feature coefficient")
plt.title("Model comparison, Water")
plt.tight_layout()
plt.show()
mse_drink
coef_drink
The MSE for the ridge regression is best out of all three. Ridge regression proves to be the best model for the dataset since most of our features actually have an effect on the output, instead of just a handful of features which the Lasso regression prefers.
This process will be similar to part 1, where we found the optimal k to predict water per capita with a KNN model, however, now we will be predicting the electricity per capita use required to process water.
For our KNN model, nearest neighbor proximity will be based on spatial distance. For each location, we will find its K-nearest neighbors in our drinking water dataset, and then we will use their average electricity consumption per capita as the forecast for that location.
We are able to use many of the functions defined in part 1 to run KNN, in particular our knn function that uses all the helpers we constructed.
all_month_errors_electricity = pd.DataFrame()
month_names_2 = list(drinking_complete_df["Month"].unique())
values_of_k_2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test_size_2 = 10
seed_2 = 1
output_2 = "electricity per capita"
all_month_errors_electricity["k"] = values_of_k_2
for month in month_names_2:
month_errors_2 = []
for i in values_of_k_2:
month_errors_2.append(knn(electricity, month, i, test_size_2, seed_2, output_2))
all_month_errors_electricity[month] = month_errors_2
all_month_errors_electricity
no_k = all_month_errors_electricity.drop(columns = ["k"])
all_month_errors_electricity['mean'] = no_k.mean(axis=1)
all_month_errors_electricity
Here is it clear that k nearest neigbors is not a good predictor for electricity, because the highest bias model (k=1) has the lowest error.
summary_2 = all_month_errors_electricity[["k", "mean"]]
summary_2
summary_2.plot.bar(x = "k", y = "mean", legend = False)
plt.title("Average Prediction Error per Value of K", size = 20)
plt.ylabel("Average Prediction Error")
We will start the same way we did last time, dropping unnecessary columns, creating and X and y dataframe, and creating our test and train splits.
def get_X_y2(df):
X = df.drop(columns = ['electricity per capita','month number',"city", "citystate", "state", "month", 'Month', "electricity (kwh)", "fuel oil (gal)", 'natural gas (therm)', "propane (gal)", "volume (mg)", 'water per capita'])
y = df[["electricity (kwh)"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 123)
return X_train, X_test, y_train, y_test
X_elec_train, X_elec_test, y_elec_train, y_elec_test = get_X_y2(electricity)
X_elec_train.columns = X_elec_train.columns.str.strip()
X_elec_test.columns = X_elec_test.columns.str.strip()
y_elec_train.columns = y_elec_train.columns.str.strip()
y_elec_test.columns = y_elec_test.columns.str.strip()
alphas_ridge = np.linspace(0, 10000, 200)
kf = KFold(n_splits = 3, shuffle = True, random_state = 0)
ridgecv = RidgeCV(cv = kf, alphas=alphas_ridge)
ridgecv.fit(X_elec_train, y_elec_train)
r_alpha_opt2 = ridgecv.alpha_
print("optimal alpha:", r_alpha_opt2)
y_pred_ridgecv = ridgecv.predict(X_elec_test)
ridgecvelec_mse = mean_squared_error(y_pred_ridgecv, y_elec_test)
print("Electricty MSE with cross-validated Ridge:", ridgecvelec_mse)
alphas_lasso = np.linspace(1, 100, 20)
kf = KFold(n_splits = 3, shuffle = True, random_state = 0)
lassocv = LassoCV(cv = kf, alphas=alphas_lasso, tol = .00001, max_iter = 1000000)
lassocv.fit(X_elec_train, y_elec_train.values.ravel())
l_alpha_opt2 = lassocv.alpha_
print("optimal alpha:", l_alpha_opt2)
y_pred_lassocv = lassocv.predict(X_elec_test)
lassocvelec_mse = mean_squared_error(y_pred_lassocv, y_elec_test)
print("Electricity MSE with cross-validated Lasso:", lassocvelec_mse)
def fit_model2(Model, X_train, X_test, y_train, y_test, lasso_alpha = l_alpha_opt2, ridge_alpha = r_alpha_opt2):
if Model == LinearRegression:
model = Model()
elif Model == Ridge:
model = Model(alpha = ridge_alpha)
else:
model = Model(alpha = lasso_alpha)
model.fit(X_train, y_train)
mse = mean_squared_error(y_test, model.predict(X_test))
coef = model.coef_.flatten()
return mse, coef
#Perform the fits, collect the mse and coefficients into their own arrays so they can be compared
Models = [LinearRegression, Ridge, Lasso]
mse_elec = np.zeros(len(Models))
coef_elec = np.zeros((X_elec_train.shape[1], len(Models)))
for Model, i in zip(Models, np.arange(len(Models))):
mse_elec[i], coef_elec[:,i] = fit_model2(Model, X_elec_train, X_elec_test, y_elec_train, y_elec_test)
ind = np.arange(coef_elec.shape[0])
width = 0.25
pos = np.array([ind - width, ind, ind + width])
modelNames = ["Linear regression", "Ridge", "Lasso"]
plt.figure(figsize = (20,10))
plt.subplot(111)
for i in np.arange(coef_elec.shape[1]):
plt.bar(pos[i], height = coef_elec[:,i], width = width, label = modelNames[i])
plt.legend()
plt.xlabel("Feature number")
plt.ylabel("Feature coefficient")
plt.title("Model comparison, Electricity")
plt.tight_layout()
plt.show()
mse_elec
In conclusion, the KNN (k=5) and Ridge models for water per capita prediction perform the best overall. Both of these models performs significantly better for the larger, higher water consuming cities, and the predicitons for smaller, lower water consuming cities had much too high of a margin to be informative. No single feature in the climate, demographic, and geographic data used to train the models dominated the predictions generated. The KNN model supports that geographically close locations have similar water per capita consumption for given months, which may be attributed to similarity in climate.
None of the models developed for electricity per capita usage made well-performing predictions, which is predominately due to the variation in the types of energy cities use to process their drinking water, be that natural gas, coal, electricity, or others. Poor predictions may also be due to fluctuating levels of technology across the country— some cities have much more advanced energy-saving technologies than others. Next steps include looking into features that may indicate advanceness of technology in cities to better predict their electricity use for drinking water processing, as well as finding more complete data for total processing energy use.
A main finding from this project is that a relationship between population, climate, and urban drinking water consumption exists. As climate proceeds to change and regions begin to shift in average temperature, water demands are going to change. The implications of this finding could extend into the realm of policy making to better inform decisions regarding the changing climate and society's water needs. Another takeaway is that more primary data is necessary in order to develop accurate prediction models for urban water use, as well as processing energy use. There is a clear lack of publicly available well-tracked water use data. The findings of this project may motivate agencies to prioritize resources towards better data collection of water use in order to illuminate our society's consumption of this precious, declining resource.
Nathan and I would like to thank Professor Duncan Callaway and Sallmah Elmallah of the Energy and Resources department at UC Berkeley for their guidance and feedback throughout this project. We are very grateful to have had such knowledgeable mentors in Data Science and resource conservation.
<a href = "https://agupubs.onlinelibrary.wiley.com/doi/10.1002/2017WR022265">Advancing Earth and Space Science Journal</a>