Leeds EPC Data Exploration & Analysis Workbook


Import required Python libraries

Set some constants:

Extract and load

Extract ZIP files to CSV files (if they do not already exist)

Load a subset of the EPC Data columns into a Dataframe

Load the ONS postcode lookup file into a Dataframe:

Data cleaning:

Sort by data (newest to oldest), and remove outdated EPC's (ie. de-duplicate by 'PROPERTY_REFERENCE_NUMBER')

Map postcodes to LSOAs, dropping rows with invalid postcodes

Standardise postcodes in both tables:

Merge tables on 'POSTCODE' column (dropping any rows with invalid postcodes), then drop the postcode columns:

Drop rows where property is not within Leeds (using the 'ladcd' column)


Simple breakdowns (for Leeds as a whole)

Number of properties by current EPC rating:

Number of properties by transaction type:

Number of properties by construction age band:

The 'CONSTRUCTION_AGE_BAND' column contains empty some values, so we will fill those with 'UNKNOWN', and also standardise the naming:

Grouped by LSOA:

Create a Dataframe to store our aggregated output:

Calculate number of properties in each age band for each LSOA:

Do the same for EPC Ratings:

Use the above outputs to calculate the modal average age band and EPC Rating by LSOA:

Calculate the mean 'CURRENT_ENERGY_EFFICIENCY' value for each LSOA:

Since the EPC Ratings are based on the values of the energy efficiency values, we can use the mean energy efficiency to derive the mean EPC Rating for each LSOA: