library(tidyverse)
Registered S3 methods overwritten by 'ggplot2': method from [.quosures rlang c.quosures rlang print.quosures rlang Registered S3 method overwritten by 'rvest': method from read_xml.response xml2 -- Attaching packages --------------------------------------- tidyverse 1.2.1 -- v ggplot2 3.1.1 v purrr 0.3.2 v tibble 2.1.1 v dplyr 0.8.0.1 v tidyr 0.8.3 v stringr 1.4.0 v readr 1.3.1 v forcats 0.4.0 -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag()
Read in Leeds EPC and UK postcode data:
certificates <- read.csv("~/11.2020_Project Cygnus/20.11.2020_EPC_Leeds/domestic-E08000035-Leeds/certificates.csv")
PC <- read.csv("~/11.2020_Project Cygnus/20.11.2020_EPC_Leeds/domestic-E08000035-Leeds/Postcode File/PCD_OA_LSOA_MSOA_LAD_NOV20_UK_LU.csv")
sum(is.na(certificates$MAINHEAT_DESCRIPTION))
There are no missing values. This is due to this being a compulsory field and potentially the use of different missing data indicators like "NODATA" or "NA". But we will get to that when we look at unique values.
Remove all observations that contain duplicated building reference numbers since we only want to look at the most recent records. (It is worth noting that this may limit the outputs since the building reference number can reger to buildings or flats within them.)
certificates_ord <- certificates[order(rev(certificates[, 'BUILDING_REFERENCE_NUMBER']), certificates[, 'INSPECTION_DATE']),]
certificates_dedup <- certificates_ord[!duplicated(certificates_ord$BUILDING_REFERENCE_NUMBER),]
Join with the postcode dataframe to remove outdated postcodes and add on LSOAs and filter out all observations that do not have the Leeds ladcd code.
PC_LSOA <- PC %>%
select(pcds, lsoa11cd, ladcd)
cert_final <- inner_join(certificates_dedup, PC_LSOA, by = c("POSTCODE" = "pcds")) %>%
filter(ladcd == "E08000035")
Warning message: "Column `POSTCODE`/`pcds` joining factors with different levels, coercing to character vector"
There are a lot of unique values in the Main Heating Description column. This is probably due to it combining descriptions of heating fixtures and overarching heating systems or fuels.
unique(cert_final$MAINHEAT_DESCRIPTION)
Since heating system and fuel seem to generally be after a comma, we decided to remove the strings before commas to retain only overarching heating types and fuels.
cert_final$Heat_type <- sub(".*, ", "", cert_final$MAINHEAT_DESCRIPTION)
This left us with a more manageable amount of unique values (that also show no sign of missing value descriptions):
unique(cert_final$Heat_type)
table(cert_final$Heat_type)
|electric|trydan 179 |mains gas|nwy prif gyflenwad 56 anthracite 11 Boiler and radiators 4 bottled gas 3 bottled LPG 5 coal 96 community 19 Community scheme 4177 Community scheme with CHP 34 dual fuel (mineral and wood) 94 electric 21279 Electric ceiling heating 43 Electric storage heaters 12531 Electric underfloor heating 979 Electricaire 348 electricity 3 LPG 399 mains gas 179520 No system present: electric heaters assumed 1165 nwy prif gyflenwad 3 oil 373 Portable electric heaters 8 Portable electric heaters assumed for most rooms 746 Portable electric heating assumed for most rooms 108 radiators 743 SAP05:Main-Heating 384 smokeless fuel 55 wood chips 2 wood logs 33 wood pellets 16
Still, some values seem to only show heating fixtures (e.g. radiators), which is caused by them not containing any additional information.
Create a dataframe that only contains relevant columns:
Heating <- cert_final %>% select(HEATING_COST_CURRENT, HEATING_COST_POTENTIAL, MAINHEAT_DESCRIPTION, Heat_type, lsoa11cd)
Create an extra column to rename unique values according to reasonable subgroups. We chose to leave some unique values that only hinted at the fuel type (e.g. Electric Storage Heaters) in their own groups. We did this to give an unbiased overview of what EPC data offers.
Heating2 <- Heating %>%
mutate(
Gen_Heat = case_when(
Heat_type == "bottled gas" ~ "Bottled Gas",
Heat_type == "bottled LPG" ~ "Bottled LPG",
Heat_type == "LPG" ~ "LPG",
Heat_type == "mains gas" ~ "Mains Gas",
Heat_type == "|mains gas|nwy prif gyflenwad" ~ "Mains Gas",
Heat_type == "nwy prif gyflenwad" ~ "Mains Gas",
Heat_type == "anthracite" ~ "Coal and Anthracite",
Heat_type == "coal" ~ "Coal and Anthracite",
Heat_type == "anthracite" ~ "Coal and Anthracite",
Heat_type == "dual fuel (mineral and wood)" ~ "Dual Fuel (Mineral and Wood)",
Heat_type == "wood chips" ~ "Wood",
Heat_type == "wood logs" ~ "Wood",
Heat_type == "wood pellets" ~ "Wood",
Heat_type == "community" ~ "Community Scheme",
Heat_type == "Community scheme" ~ "Community Scheme",
Heat_type == "Community scheme with CHP" ~ "Community Scheme with CHP",
Heat_type == "electric" ~ "Electric",
Heat_type == "|electric|trydan" ~ "Electric",
Heat_type == "Electricaire" ~ "Electric",
Heat_type == "electricity" ~ "Electric",
Heat_type == "Electric ceiling heating" ~ "Electric Ceiling Heating",
Heat_type == "Electric storage heaters" ~ "Electric Storage Heating",
MAINHEAT_DESCRIPTION == "Electric storage heaters, radiators" ~ "Electric Storage Heating & Radiators",
Heat_type == "Electric underfloor heating" ~ "Electric Underfloor Heating",
Heat_type == "Portable electric heaters" ~ "Portable Electric Heating",
Heat_type == "Portable electric heaters assumed for most rooms" ~ "Portable Electric Heating",
Heat_type == "Portable electric heating assumed for most rooms" ~ "Portable Electric Heating",
Heat_type == "No system present: electric heaters assumed" ~ "No system present: Electric heaters assumed",
Heat_type == "oil" ~ "Oil",
Heat_type == "smokeless fuel" ~ "Smokeless Fuel",
Heat_type == "Boiler and radiators" ~ "Unknown",
Heat_type == "SAP05:Main-Heating" ~ "Unknown"
)
)
We created two tables one containing the frequencies of the Heating Types for visualisation, and Heating Types by LSOAs for the Data Mapper.
Heat_Freq <- data.frame(table(Heating2$Gen_Heat))
write.csv(Heat_Freq, 'HeatingTypes_Frequencies.csv')
Heat_LSOA <- xtabs(~ lsoa11cd + Gen_Heat, data=Heating2)
write.csv(Heat_LSOA, 'HeatingType_LSOA.csv')
Create a bar chart for the Heating Types:
Heat_Freq %>%
mutate(Var1 = fct_reorder(Var1, Freq)) %>%
ggplot( aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="#D73058", alpha=.6, width=.4) +
xlab("Heating Systems or Fuels") +
ylab("Number of Dwellings") +
theme_bw() +
coord_flip()
This approach results in a somewhat unhelpful graph. However, it shows how much need there is to standardise in the EPC creation process.
The ONS report on the EPC data also contains a visualisation of Heating Systems and Fuel distribution. They chose to heavily summarise the groupings by only using "Mains Gas", "Electric", "Community Scheme", "Oil", "Other" and "Unknown".
We grouped smaller subgroups as 'Other' and made the assumption that observations that showed only electrical fixtures could be grouped into 'Electric'.
HeatingONS <- Heating %>%
mutate(
Gen_Heat = case_when(
Heat_type == "bottled gas" ~ "Other",
Heat_type == "bottled LPG" ~ "Other",
Heat_type == "LPG" ~ "Other",
Heat_type == "mains gas" ~ "Mains Gas",
Heat_type == "|mains gas|nwy prif gyflenwad" ~ "Mains Gas",
Heat_type == "nwy prif gyflenwad" ~ "Mains Gas",
Heat_type == "anthracite" ~ "Other",
Heat_type == "coal" ~ "Other",
Heat_type == "anthracite" ~ "Other",
Heat_type == "dual fuel (mineral and wood)" ~ "Other",
Heat_type == "wood chips" ~ "Other",
Heat_type == "wood logs" ~ "Other",
Heat_type == "wood pellets" ~ "Other",
Heat_type == "community" ~ "Community Scheme",
Heat_type == "Community scheme" ~ "Community Scheme",
Heat_type == "Community scheme with CHP" ~ "Community Scheme",
Heat_type == "electric" ~ "Electric",
Heat_type == "|electric|trydan" ~ "Electric",
Heat_type == "Electricaire" ~ "Electric",
Heat_type == "electricity" ~ "Electric",
Heat_type == "Electric ceiling heating" ~ "Electric",
Heat_type == "Electric storage heaters" ~ "Electric",
MAINHEAT_DESCRIPTION == "Electric storage heaters, radiators" ~ "Electric",
Heat_type == "Electric underfloor heating" ~ "Electric",
Heat_type == "Portable electric heaters" ~ "Electric",
Heat_type == "Portable electric heaters assumed for most rooms" ~ "Electric",
Heat_type == "Portable electric heating assumed for most rooms" ~ "Electric",
Heat_type == "No system present: electric heaters assumed" ~ "Unknown",
Heat_type == "oil" ~ "Oil",
Heat_type == "smokeless fuel" ~ "Other",
Heat_type == "Boiler and radiators" ~ "Unknown",
Heat_type == "SAP05:Main-Heating" ~ "Unknown"
)
)
We, again, created two tables one containing the frequencies of the Heating Types for visualisation, and Heating Types by LSOAs for the Data Mapper.
Heat_FreqONS <- data.frame(table(HeatingONS$Gen_Heat))
write.csv(Heat_Freq, 'HeatingTypesONS_Frequencies.csv')
Heat_LSOAONS <- xtabs(~ lsoa11cd + Gen_Heat, data=HeatingONS)
write.csv(Heat_LSOA, 'HeatingType_LSOA.csv')
Create a bar chart for the Heating Types:
Heat_FreqONS %>%
mutate(Var1 = fct_reorder(Var1, Freq)) %>%
ggplot( aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="#D73058", alpha=.6, width=.4) +
xlab("Heating Systems or Fuels") +
ylab("Number of Dwellings") +
theme_bw() +
coord_flip()
This creates a much more less cluttered visual of the Heating Systems and Fuels that are being used in Leeds. However, this also heavily simplifies the groupings, which may bias potential further analysis into heating cost differences.
This uses the cleaned dataframe 'cert_final' from the Heating Types - Data Cleaning Section.
sum(is.na(cert_final$TRANSACTION_TYPE))
Again there are no missing values.
unique(cert_final$TRANSACTION_TYPE)
table (cert_final$TRANSACTION_TYPE)
assessment for green deal 6876 ECO assessment 8404 FiT application 2634 following green deal 297 marketed sale 80455 new dwelling 20807 non marketed sale 4866 none of the above 15365 not recorded 7 rental 1667 rental (private) 42630 rental (private) - this is for backwards compatibility only and should not be used 13 rental (social) 39153 RHI application 115 Stock Condition Survey 10 unknown 117
A look into the unique values of this column shows that it is more standardised and missing entries are summarised as "unknown".
There are a few variables that clutter the data. This means that the long and unuseful private rental variable is being removed. Then "not recorded" cases are renamed to "unknown". "None of the above", is not grouped into "unknown" since it does not necessarily mean this data is unavailable.
Transaction <- cert_final %>%
select(lsoa11cd, TRANSACTION_TYPE)
Transaction$TRANSACTION_TYPE <- gsub('not recorded', 'unknown', Transaction$TRANSACTION_TYPE)
Transaction <- Transaction[!(Transaction$TRANSACTION_TYPE=="rental (private) - this is for backwards compatibility only and should not be used"),]
unique(Transaction$TRANSACTION_TYPE)
TransDf <- data.frame(table(Transaction$TRANSACTION_TYPE))
write.csv(TransDf, 'TransactionType_Frequencies.csv')
TransDf1 <- xtabs(~ lsoa11cd + TRANSACTION_TYPE, data=Transaction)
write.csv(TransDf1, 'TransactionType_LSOA.csv')
TransDf %>%
mutate(Var1 = fct_reorder(Var1, Freq)) %>%
ggplot( aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="#178CFF", alpha=.6, width=.4) +
xlab("Reason for EPC Evaluation") +
ylab("Number of Dwellings") +
theme_bw() +
coord_flip()
We can also look at all reasons for EPC inspections in Leeds rather than just the most recent ones.
For this, we only clean out invalid postcodes, create a table with only relevant columns, and rename and group variables as before.
cert_ALL <- inner_join(certificates, PC_LSOA, by = c("POSTCODE" = "pcds")) %>%
filter(ladcd == "E08000035")
Transaction_ALL <- cert_ALL %>%
select(lsoa11cd, TRANSACTION_TYPE)
Transaction_ALL$TRANSACTION_TYPE <- gsub('not recorded', 'unknown', Transaction_ALL$TRANSACTION_TYPE)
Transaction_ALL <- Transaction_ALL[!(Transaction_ALL$TRANSACTION_TYPE=="rental (private) - this is for backwards compatibility only and should not be used"),]
Warning message: "Column `POSTCODE`/`pcds` joining factors with different levels, coercing to character vector"
TransDf_ALL <- data.frame(table(Transaction_ALL$TRANSACTION_TYPE))
write.csv(TransDf_ALL, 'ALLTransactionType_Frequencies.csv')
TransDf1_ALL <- xtabs(~ lsoa11cd + TRANSACTION_TYPE, data=Transaction_ALL)
write.csv(TransDf1_ALL, 'ALLTransactionType_LSOA.csv')
TransDf_ALL %>%
mutate(Var1 = fct_reorder(Var1, Freq)) %>%
ggplot( aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="#178CFF", alpha=.6, width=.4) +
xlab("Reason for EPC Evaluation") +
ylab("Number of Dwellings") +
theme_bw() +
coord_flip()
We decided to take a more detailed look at the results of the Green Deal (GD) scheme. This is relatively well reflected in the EPC data, since the Transaction Types include "Assessment for Green Deal" and "Following Green Deal".
This time, we do not work with the previously cleaned cert_final data frame, since we want to keep duplicates and invalid postcodes. This is because we want to compare te effect of GD initiatives on the same properties. Since assessment for the GD may be a bit older and may contain invalid postcodes, we keep these to avoid losing observations.
First, we drop every observation that does not contain GD in its Transaction Type.
GreenDeal <- certificates_ord %>%
filter(str_detect(TRANSACTION_TYPE, "green deal"))
We keep only duplicated building reference numbers since we want to compare before and after for the same building reference numbers.
GreenDeal <- GreenDeal[GreenDeal$BUILDING_REFERENCE_NUMBER %in% GreenDeal$BUILDING_REFERENCE_NUMBER[duplicated(GreenDeal$BUILDING_REFERENCE_NUMBER)],] %>%
select(BUILDING_REFERENCE_NUMBER, TRANSACTION_TYPE, INSPECTION_DATE, CURRENT_ENERGY_EFFICIENCY, POTENTIAL_ENERGY_EFFICIENCY)
GreenDeal$INSPECTION_DATE <- as.numeric(GreenDeal$INSPECTION_DATE)
We then remove older doubles for the observations classed as "assessment for GD" and newer doubles of those classed as "following GD". We do this to compare the state of properties, as closely surrounding the implementation of GD measures as possible. This is done by creating two separate dataframes.
# Assessment for Green Deal:
GD_assmt <- GreenDeal %>%
filter(TRANSACTION_TYPE == "assessment for green deal") %>%
arrange(., BUILDING_REFERENCE_NUMBER, desc(INSPECTION_DATE))
GD_assmt <- GD_assmt[!duplicated(GD_assmt$BUILDING_REFERENCE_NUMBER),]
# Create a Column for the difference between Potential and Current Energy Efficiency
GD_assmt$Diff_EE <- (GD_assmt$POTENTIAL_ENERGY_EFFICIENCY - GD_assmt$CURRENT_ENERGY_EFFICIENCY)
# Add an indicator to the column names to help differentiate the dataframes
colnames(GD_assmt) <- paste("A", colnames(GD_assmt), sep = "_")
# Following Green Deal:
GD_follw <- GreenDeal %>%
filter(TRANSACTION_TYPE == "following green deal") %>%
arrange(., BUILDING_REFERENCE_NUMBER, INSPECTION_DATE)
GD_follw <- GD_follw[!duplicated(GD_follw$BUILDING_REFERENCE_NUMBER),]
# Create a Column for the difference between Potential and Current Energy Efficiency
GD_follw$Diff_EE <- (GD_follw$POTENTIAL_ENERGY_EFFICIENCY - GD_follw$CURRENT_ENERGY_EFFICIENCY)
# Add an indicator to the column names to help differentiate the dataframes
colnames(GD_follw) <- paste("F", colnames(GD_follw), sep = "_")
The two dataframes created in the last step are then joined to compare them directly:
GD_comp <- inner_join(GD_assmt, GD_follw, by = c("A_BUILDING_REFERENCE_NUMBER" = "F_BUILDING_REFERENCE_NUMBER"))
Interestingly, there are some building reference numbers where inspection dates for GD assessments are later than inspections classed as "following GD". This may indicate that there may have been several GDs in these properties, suggesting that cleaning out duplicates in the last step may have caused these inconsistencies. This may be looked into in further analyses. However, for now these observations will be dropped, since the assessment date for the GD should reasonably not be after the GD has been implemented.
GD_comp2 <- GD_comp %>%
filter(., A_INSPECTION_DATE < F_INSPECTION_DATE)
In this step the differences between the different levels of Energy Efficiency are calculated, to enable the illustration of changes.
GD_comp2$Diff_CEE <- (GD_comp2$F_CURRENT_ENERGY_EFFICIENCY - GD_comp2$A_CURRENT_ENERGY_EFFICIENCY)
GD_comp2$Diff_AP_FC <- (GD_comp2$F_CURRENT_ENERGY_EFFICIENCY - GD_comp2$A_POTENTIAL_ENERGY_EFFICIENCY)
GD_comp2$Diff_ADE_FDE <- (GD_comp2$F_Diff_EE - GD_comp2$A_Diff_EE)
mean (GD_comp2 [["Diff_CEE"]])
This means that Energy Efficiency on average improved by roughly 1 band upon the implementation of GD measures.
mean (GD_comp2 [["Diff_AP_FC"]])
This shows that the Energy Efficiency following the implementation of GD measures was generally 1 band under the proposed potential prior to GD measures.
mean (-(GD_comp2 [["A_Diff_EE"]]))
# negative to make it comparable
The Energy Efficiency before the implementation of GD measures was on average 2 bands lower than its respective potential. Combined with the previous average, this means that the gap between Energy Efficiency and the old potential has aproximately halved.
mean (GD_comp2 [["Diff_ADE_FDE"]])
This comparison may not be too necessary but it shows that the difference between current and potential Energy Efficiency after GD implementation is generally lower than before it.
write.csv(GD_comp2, 'Green_Deal_Comparisons.csv')
ggplot(GD_comp2, aes(x=Diff_CEE)) +
geom_histogram(color= "black", fill="#67E767", alpha=.6, binwidth=10) +
xlab("Difference in Energy Efficiency") +
ylab("Number of Dwellings") +
theme_bw()
This graph shows that the Energy Efficiency of dwellings following GD implementation is mostly larger than before. There are some cases at the upper end of the spectrum where the Energy Efficiency has improved greatly, but there are also a few cases where Energy Efficiency has decreased. These outliers should be further explored to identify underlying reasons or if they are errors.
ggplot(GD_comp2, aes(x=Diff_AP_FC)) +
geom_histogram(color= "black", fill="#67E767", alpha=.6, binwidth=10) +
xlab("Difference in Energy Efficiency") +
ylab("Number of Dwellings") +
theme_bw()
This visualisation shows that the Energy Efficiency dwellings have achieved through the GD was sometimes on target for the potential proposed prior to GD implementation but generally just under. Some dwellings have even surpassed previous potentials following GD measures.
ggplot(GD_comp2, aes(x=-(A_Diff_EE))) +
geom_histogram(color= "black", fill="#67E767", alpha=.6, binwidth=10) +
xlab("Difference in Energy Efficiency") +
ylab("Number of Dwellings") +
theme_bw()
This chart offers a good point of comparison for the previous one. It shows that the gap between the current energy efficiency preceding GD measures and the respective potential was generally bigger than following GD implementation. It may be interesting to look into changes in Energy Efficiency for individual dwellings, especially at the lower end of the spectrum to test the magnitude of changes and improvements.
ggplot(GD_comp2, aes(x=Diff_ADE_FDE)) +
geom_histogram(color= "black", fill="#67E767", alpha=.6, binwidth=10) +
xlab("Difference in Energy Efficiency") +
ylab("Number of Dwellings") +
theme_bw()
This final graph visualises the difference between the current-potential gaps in Energy Efficiency before and after GD implementation. It shows that the proposed potentials after dwellings have taken advantage of GD measures are often similarly distant to their respective current Energy Efficiency values, as the before GD implementation. However, the current-potential gap is generally smaller following the GD than before it.