EPC Cleaner (on the example of Leeds)

Packages

Load EPC Data & Standardise Missing Values

There is a lot of missing data in the EPC and often there are different expressions describing missingness that need to be standardised. While 9 of these expressions are outlined in the online description of the EPC data, there are a few more that we discovered while working with the Leeds data (and there might be a few more that we haven't found yet).

Removing Duplicate Entries

There are a few errors in the EPC data like accidental doubling of entries, which should be removed to reduce the amount of redundant information. We take this step before removing any unuseful columns to ensure that these observations are truly duplicated across all variables.

We arrange the data by reverse order of the date and time EPCs were lodged to remove older doubles, as we assume newer entries to be updates.

We considered removing additional types of potential duplicates as outlined by Hardy and Glew (2019), e.g. by considering EPCs lodged for the same building reference number within the same day or week updates. We decided against this because we were not sure in which cases a duplicated building reference number may refer to a flat or building.

Redundant Columns - Removing Geographies

We drop all the columns that contain geographic descriptors, apart from postcode. We do this, since it annonymises the data to an extent, and only post code is necessary to attach more meaningful geographies like LSOAs. You can remove additional columns depending on what your use for the EPC data is.

Removing Columns with a lot of Missing Values

Columns that are mostly empty are likely to be unhelpful in our analysis and waste processing power.

SIDENOTE: Identify the columns that contain more than a certain percentage of NAs

This may help us decide whether we really want to drop all of the columns containing a certain percentage of NAs.

Remove columns that contain more than a certain percentage of NAs

The following code removes all of the columns that contain more than 70% NAs.

Alternatively, you could pick only the columns you want to remove out of those identified in the last (SIDENOTE) step.

Adding LSOAs & Removing Postcodes

We join our cleaned dataset with the National Statistics Postcode Lookup (NSPL) to add some meaningful geographies. Postcodes should then be removed to further anonymise the outputs and avoid copyright issues.

Outcome

Save a Copy of the Cleaned Data

Useful References

Hardy, ALR. and Glew, D. 2019. An analysis of errors in the Energy Performance certificate database. Energy Policy, 129. pp. 1168-1178. Available from: http://eprints.leedsbeckett.ac.uk/id/eprint/5844/1/AnAnalysisofErrorsintheEnergyPerformanceCertificateDatabaseAM-HARDY.pdf