article

A few years back I had to deliver this task for a European project (the THERMOSS project). THERMOSS was a H2020 project that aimed at facilitating the adoption of District Heating and Cooling

In this project, the first stage was to create geoclusters for district heating and cooling catalogue. Those geoclusters would have similar characteristics

At the time, I did not know python so much and used MATLAB for such task. If the result were satisfaying at the time, I now want to reproduce it using python and some well-known GIS packages.

So lets get to it...

1. the data¶

The table below gives the list of features that have been used to create the geoclusters. Those are believed to be the most influential parameters for heating and cooling loads.

Note: Environmental conditions taken from NetCDF

Parameter	Source
Environmental conditions
Cold season mean temperature
Cold season mean temperature
Building typology
Residential building energy use
Non-residential building energy use
Single family unit U-value	http://webtool.building-typology.eu/#bm
Multiple family unit U-value	http://webtool.building-typology.eu/#bm
Single family unit floor area	http://webtool.building-typology.eu/#bm
Multiple family unit floor area	http://webtool.building-typology.eu/#bm
Building stock
Residential building stock
Single family stock
Economic considerations
Energy use
Gas prices for domestic consumers
Gas prices for industrial consumers
Electricity prices for domestic consumers
Electricity prices for industrial consumers
Share of renewable energy in gross final energy consumption

For the building typology and stock, some manual work was needed in excel to produce a neat table with only the required information. Thanks, eurostat provided some nice tables, ready to use. See the example below for renewable energy share.

import requests
import pandas as pd
import io

url = "https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/T2020_RD330/?format=CSV"
urlData = requests.get(url).content
renew_df = pd.read_csv(io.StringIO(urlData.decode('utf-8')))
renew_df.iloc[:,0] = renew_df.iloc[:,0].apply(lambda x: x.split(';')[-1])
renew_df.rename(columns={'freq;nrg_bal;unit;geoTIME_PERIOD': 'CODE'},inplace=True)
renew_df

	CODE	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
0	AL	29.621	31.367	32.070	32.657	32.448	31.437	31.867	31.187	35.152	33.167	31.856	34.896	36.939	35.898	36.844	36.667
1	AT	22.554	24.355	26.277	28.145	28.790	31.041	31.207	31.553	32.736	32.666	33.553	33.502	33.374	33.141	33.806	33.626
2	BA	20.274	19.752	19.213	18.701	16.708	18.471	18.710	17.995	18.014	19.307	24.873	26.607	25.358	23.241	35.972	37.578
3	BE	1.890	2.332	2.633	3.101	3.590	4.715	6.002	6.275	7.089	7.650	8.043	8.026	8.752	9.113	9.478	9.924
4	BG	9.231	9.173	9.415	9.098	10.345	12.005	13.928	14.152	15.837	18.898	18.050	18.261	18.760	18.701	20.592	21.564
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
34	SE	38.677	40.265	42.040	43.551	44.288	47.476	46.595	48.135	50.027	50.792	51.817	52.947	53.328	54.157	54.651	56.391
35	SI	18.397	19.809	18.416	19.674	18.645	20.764	21.080	20.936	21.549	23.161	22.461	22.880	21.977	21.658	21.378	21.974
36	SK	6.391	6.360	6.584	7.766	7.723	9.368	9.099	10.348	10.453	10.133	11.713	12.883	12.029	11.465	11.896	16.894
37	UK	1.096	1.281	1.488	1.735	2.814	3.448	3.862	4.392	4.461	5.524	6.737	8.385	9.032	9.858	11.138	12.336
38	XK	20.541	19.773	19.512	18.812	18.429	18.230	18.230	17.598	18.625	18.823	19.544	18.484	24.472	23.082	24.616	25.686

39 rows × 17 columns

In addition to that, we will also be using EU countries geoJSON file found here.

With all the country codes set to iso alpha 2

import geopandas as gpd

all_data = pd.concat([renew_df,energy_df,gas_df,elc_df,typo_df,stock_dfeu_df[['geometry']]],join='inner',axis=1).replace(': ',np.nan) #eurostat set nan values as ':'
all_data_gdf  = gpd.GeoDataFrame(all_data)

	renew share in %	energy cons	residential gas price in euro	indus gas price in euro	residential elec price in euro	indus elec price in euro	single family U value	multiple family U value	single family area	mutiple family area	Residential share in %	Non-residential share in %	single family share in %	mutiple family share in %	geometry
AT	33.626	6695.287	18.3400	7.3744	0.1950	0.0621	0.978513	0.882793	145	418.0	61.6	38.4	47.5	52.5	POLYGON ((1687803.014 6264215.777, 1696293.843...
BE	9.924	7899.465	15.3755	6.2508	0.2857	0.0816	1.019217	1.121330	220	1613.0	67.5	32.5	72.9	27.1	POLYGON ((536053.133 6697902.835, 536858.496 6...
BG	21.564	2159.861	12.4631	8.2636	0.0955	0.0753	1.182875	0.989048	172	495.0	72.1	27.9	54.9	45.1	POLYGON ((2551393.950 5439823.819, 2575025.606...
CZ	16.244	7006.728	16.2682	7.7802	0.1438	0.0677	0.813539	0.967243	112	681.0	64.9	35.1	44.3	55.7	POLYGON ((1602756.662 6623613.894, 1605874.568...
DE	17.354	57743.118	17.5503	7.7126	0.3048	0.0761	0.722681	0.861979	173	1239.0	68.4	31.6	45.2	54.8	MULTIPOLYGON (((750538.061 7090702.816, 751353...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
IT	18.181	31612.006	21.3645	8.2732	0.2132	0.0829	1.295267	1.109430	154	884.0	89.0	11.0	26.0	74.0	MULTIPOLYGON (((1404993.029 4230990.607, 14038...
NL	8.768	9307.869	25.5701	6.1856	0.1562	0.0607	1.075242	1.544523	163	2618.0	60.7	39.3	70.8	29.2	MULTIPOLYGON (((-7596140.830 1348758.669, -759...
PL	12.164	18196.434	13.1310	9.3669	0.1446	0.0673	0.699272	0.729197	136	2186.0	67.0	33.0	33.1	66.9	POLYGON ((2115307.046 7235751.798, 2157060.914...
SE	56.391	7363.914	32.8557	8.7466	0.1936	0.0643	0.255006	NaN	121	1207.0	66.6	33.4	44.1	55.9	MULTIPOLYGON (((1744966.813 7582362.064, 17458...
SI	21.974	1057.443	15.8903	8.5024	0.1609	0.0619	0.788727	1.110169	201	1362.0	81.7	18.3	60.7	39.3	POLYGON ((1819341.831 5895544.825, 1820883.526...

15 rows × 15 columns

2. Handling NetCDF¶

For NetCDF it is a bit more technical. Indeed, NetCdf are multidimensional arrays that allows us to store both temporal and spatial dimension of certain phenomenon

NetCDF

It can also been seen as temporaly related raster layers. For that reason, we are ogoin to use python libraries such as xarray, rasterio and rioxarray.

First let's use xarray to load our data.

import xarray as xr
import rioxarray
import rasterio

with xr.open_dataset('tg_0.50deg_reg_1995-2016_v14.0.nc') as file:
    dsCDF = file.rio.write_crs("epsg:4326", inplace=True)
    
dsCDF

From there, we will create a typical year by grouping data by days and running a median, thus for each long/lat cell.

#Create typical year from the 20 years'set with daily data point (grid X 366 days)
typical_year = dsCDF.groupby('time.dayofyear').median('time') 

We are now going to slip our typical year into 2 seasons, namely cold and hot seasons so that for each cell we have the median temperature over those seasons.

#Split typical year into seasonal bins (grid X 2 seasons)
day_bins = [0,106,289,366] #Limits of the 2 seasons (day number over the year)
typical_Seasonal_year = typical_year.groupby_bins('dayofyear',day_bins,labels=['cold1','hot','cold2']).median('dayofyear') #Split the year into 3 bins
Cold = typical_Seasonal_year.sel(dayofyear_bins = ['cold2','cold1']).mean('dayofyear_bins') #Average of the 2 "Cold" season bins
Hot = typical_Seasonal_year.sel(dayofyear_bins = ['hot'])

Also, the area covered by the NetCDF is greater than we actually need. Let's clip the Hot and Cold raster to the area of interest

#Note that this could have been done before defining Hot and Cold ¯_ツ_/¯
Hot.rio.write_crs("epsg:4326", inplace=True) #set crs to our raster
Hot = Hot.rio.reproject(dst_crs='EPSG:3857') #reproject because geojson in EPSG:3857
Hot = Hot.rio.clip(eu)
Hot = Hot.where(Hot['tg'] != -9999.)  #conserve only necesaary values

Cold.rio.write_crs("epsg:4326", inplace=True)
... #idem

3. binning¶

Ok, so, we have 16 features that we must transform into 'categorical features'. For that we must split them into equals bins. With 4 bins per feature, we have 4¹⁶ = 4294967296 geoclusters possibilities. Plenty enough! The reality will be way less. Testes have shown that 4 bins per feature would result into 116 geoclusters which is sufficient for now.

hot_bins = Hot.quantile([0,0.25,0.5,0.75,1]).tg.values
Hot_discr = np.digitize(Hot.tg.values, hot_bins)
Hot_discr = Hot_discr.astype(np.int32)

#idem for Cold raster

Hot raster

Now let's vectorize those rasters

from rasterio.features import shapes

west, east = min(Hot.x.values), max(Hot.x.values) #get corners coordinates
south, north =  min(Hot.y.values), max(Hot.y.values) #get corners coordinates
width = len(Hot.x.values)
height = len(Hot.y.values)

res_x = np.diff(Hot.x.values).mean() #get the x gradient
res_y = np.diff(Hot.y.values).mean() #get the y gradient

transform = rasterio.transform.from_origin(west, north, res_x, -res_y) #Return an Affine transformation for a georeferenced raster 

data = []
mask = Cold_discr != 5
for shp, val in shapes(Cold_discr,mask=mask, transform=transform):
    record = {'properties':{'bin': val}, 'geometry' : shp}
    data.append(record)

#let's make a geodataframe from it
Hot_df = gpd.GeoDataFrame.from_features(data,crs=3857) 
Hot_df = Hot_df.dissolve(by='bin')

	Hot_temp in C
bin
1.0	MULTIPOLYGON (((-727543.440 7693137.472, -7242...
2.0	MULTIPOLYGON (((-352714.079 7639933.464, -3604...
3.0	MULTIPOLYGON (((-986903.059 5272566.153, -9883...
4.0	MULTIPOLYGON (((-1010018.459 4695661.164, -101...

Also let's fit data to countries outline

for index, row in Hot_df.iterrows(): #for each bin
    temp_poly = row.geometry.buffer(100*1000) #buffer the current geometry
    temp_poly = temp_poly.intersection(eu)    #intersect it with EU geometry (outline)
    other_polys = Hot_df.loc[Hot_df.index != index].geometry.values.unary_union() #union of all the other polygon
    to_remove = temp_poly.intersection(other_polys) #intersection to retrieve what to remove (because of the previous buffer)
    temp_poly = temp_poly.difference(to_remove) #removal of teh extra bits

    Hot_df.loc[index,['geometry']]= temp_poly #overwrite geometry in dataframe

Hot_df	Cold_df

We binned our rasters. Let's bin our other variables.

all_data_dis = all_data.apply(lambda col: pd.qcut(col, 4, labels=False, duplicates = 'drop')) #qcut is a Quantile-based discretization function.
all_data_dis = all_data_dis+1 #we add 1 so the value range from 1 to 4 (and not 0 to 3)

	renew share in %	energy cons	residential gas price in euro	indus gas price in euro	residential elec price in euro	indus elec price in euro	single family U value	multiple family U value	single family area	mutiple family area	Residential share in %	Non-residential share in %	single family share in %	mutiple family share in %	geometry
AT	4	2	2	1	3	1	2	2.0	2	1.0	1	4	2	3	POLYGON ((1687803.014 6264215.777, 1696293.843...
BE	1	3	1	1	4	4	3	4.0	4	4.0	2	3	4	1	POLYGON ((536053.133 6697902.835, 536858.496 6...
BG	3	1	1	3	1	3	4	3.0	3	1.0	3	2	2	2	POLYGON ((2551393.950 5439823.819, 2575025.606...
CZ	2	2	2	2	1	2	2	2.0	1	2.0	1	4	2	3	POLYGON ((1602756.662 6623613.894, 1605874.568...
DE	2	4	2	2	4	3	1	2.0	4	3.0	2	3	2	3	MULTIPOLYGON (((750538.061 7090702.816, 751353...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
IT	3	4	4	3	3	4	4	3.0	3	2.0	4	1	1	4	MULTIPOLYGON (((1404993.029 4230990.607, 14038...
NL	1	3	4	1	2	1	4	4.0	3	4.0	1	4	4	1	MULTIPOLYGON (((-7596140.830 1348758.669, -759...
PL	1	4	1	4	1	2	1	1.0	2	4.0	2	3	1	4	POLYGON ((2115307.046 7235751.798, 2157060.914...
SE	4	2	4	4	2	2	1	NaN	1	3.0	1	4	1	4	MULTIPOLYGON (((1744966.813 7582362.064, 17458...
SI	4	1	2	4	2	1	2	4.0	4	3.0	4	1	3	2	POLYGON ((1819341.831 5895544.825, 1820883.526...

4. merge all in a final table¶

Now, all our data are ready to come together. I am going to spare you dataframe manipulations because it isn't the topic here but at the end, the final table would look like that:

		quartiles	geometry
1.0	renew share in %		MULTIPOLYGON (((-7600964.855 1353927.589, -760...
	energy cons		MULTIPOLYGON (((-1146332.292 6785384.989, -114...
	residential gas price in euro		MULTIPOLYGON (((532003.307 6697072.451, 530749...
	indus gas price in euro		MULTIPOLYGON (((-7601494.819 1355961.122, -760...
	residential elec price in euro		MULTIPOLYGON (((2089863.323 6360968.338, 20852...
...	...	...	...
4.0	Non-residential share in %		MULTIPOLYGON (((-7601703.180 1358036.485, -760...
	single family share in %		MULTIPOLYGON (((-7600964.855 1353927.589, -760...
	mutiple family share in %		MULTIPOLYGON (((-1997145.494 3212924.432, -199...
	Cold_temp in C		MULTIPOLYGON (((-1160953.847 6780231.201, -115...
	Hot_temp in C		MULTIPOLYGON (((-1010018.459 4695661.164, -101...

The final result can then be store in a postgresql DB, geoJSON, Shapefile or more...

final_data.to_postgis(name, con, schema=None, if_exists='fail', index=False, index_label=None, chunksize=None, dtype=None)

final_data.to_file("data.shp")

final_data.to_file("data.geojson", driver='GeoJSON')
...

5. Plotting¶

I first intended to use javascript for the plotting using mapbox gl, turf and d3 for a more interactive output but I got into some issues with the polygons intersections. So I went for something simpler using python folium, geopy and ipywidgets.

plot

corentin kuster

1. the data¶

2. Handling NetCDF¶

3. binning¶

4. merge all in a final table¶

5. Plotting¶