library(tidyverse)
Registered S3 methods overwritten by 'ggplot2': method from [.quosures rlang c.quosures rlang print.quosures rlang Registered S3 method overwritten by 'rvest': method from read_xml.response xml2 -- Attaching packages --------------------------------------- tidyverse 1.2.1 -- v ggplot2 3.1.1 v purrr 0.3.2 v tibble 2.1.1 v dplyr 0.8.0.1 v tidyr 0.8.3 v stringr 1.4.0 v readr 1.3.1 v forcats 0.4.0 -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag()
Read in Leeds EPC and UK postcode data:
certificates <- read.csv("~/11.2020_Project Cygnus/20.11.2020_EPC_Leeds/domestic-E08000035-Leeds/certificates.csv")
PC <- read.csv("~/11.2020_Project Cygnus/20.11.2020_EPC_Leeds/domestic-E08000035-Leeds/Postcode File/PCD_OA_LSOA_MSOA_LAD_NOV20_UK_LU.csv")
sum(is.na(certificates$MAINHEAT_DESCRIPTION))
There are no missing values. This is due to this being a compulsory field and potentially the use of different missing data indicators like "NODATA" or "NA". But we will get to that when we look at unique values.
Remove all observations that contain duplicated building reference numbers since we only want to look at the most recent records. (It is worth noting that this may limit the outputs since the building reference number can reger to buildings or flats within them.)
certificates_ord <- certificates[order(rev(certificates[, 'BUILDING_REFERENCE_NUMBER']), certificates[, 'INSPECTION_DATE']),]
certificates_dedup <- certificates_ord[!duplicated(certificates_ord$BUILDING_REFERENCE_NUMBER),]
Join with the postcode dataframe to remove outdated postcodes and add on LSOAs and filter out all observations that do not have the Leeds ladcd code.
PC_LSOA <- PC %>%
select(pcds, lsoa11cd, ladcd)
cert_final <- inner_join(certificates_dedup, PC_LSOA, by = c("POSTCODE" = "pcds")) %>%
filter(ladcd == "E08000035")
Warning message: "Column `POSTCODE`/`pcds` joining factors with different levels, coercing to character vector"
There are a lot of unique values in the Main Heating Description column. This is probably due to it combining descriptions of heating fixtures and overarching heating systems or fuels.
unique(cert_final$MAINHEAT_DESCRIPTION)
Since heating system and fuel seem to generally be after a comma, we decided to remove the strings before commas to retain only overarching heating types and fuels.
cert_final$Heat_type <- sub(".*, ", "", cert_final$MAINHEAT_DESCRIPTION)
This left us with a more manageable amount of unique values (that also show no sign of missing value descriptions):
unique(cert_final$Heat_type)
table(cert_final$Heat_type)
|electric|trydan
179
|mains gas|nwy prif gyflenwad
56
anthracite
11
Boiler and radiators
4
bottled gas
3
bottled LPG
5
coal
96
community
19
Community scheme
4177
Community scheme with CHP
34
dual fuel (mineral and wood)
94
electric
21279
Electric ceiling heating
43
Electric storage heaters
12531
Electric underfloor heating
979
Electricaire
348
electricity
3
LPG
399
mains gas
179520
No system present: electric heaters assumed
1165
nwy prif gyflenwad
3
oil
373
Portable electric heaters
8
Portable electric heaters assumed for most rooms
746
Portable electric heating assumed for most rooms
108
radiators
743
SAP05:Main-Heating
384
smokeless fuel
55
wood chips
2
wood logs
33
wood pellets
16
Still, some values seem to only show heating fixtures (e.g. radiators), which is caused by them not containing any additional information.
Create a dataframe that only contains relevant columns:
Heating <- cert_final %>% select(HEATING_COST_CURRENT, HEATING_COST_POTENTIAL, MAINHEAT_DESCRIPTION, Heat_type, lsoa11cd)
Create an extra column to rename unique values according to reasonable subgroups. We chose to leave some unique values that only hinted at the fuel type (e.g. Electric Storage Heaters) in their own groups. We did this to give an unbiased overview of what EPC data offers.
Heating2 <- Heating %>%
mutate(
Gen_Heat = case_when(
Heat_type == "bottled gas" ~ "Bottled Gas",
Heat_type == "bottled LPG" ~ "Bottled LPG",
Heat_type == "LPG" ~ "LPG",
Heat_type == "mains gas" ~ "Mains Gas",
Heat_type == "|mains gas|nwy prif gyflenwad" ~ "Mains Gas",
Heat_type == "nwy prif gyflenwad" ~ "Mains Gas",
Heat_type == "anthracite" ~ "Coal and Anthracite",
Heat_type == "coal" ~ "Coal and Anthracite",
Heat_type == "anthracite" ~ "Coal and Anthracite",
Heat_type == "dual fuel (mineral and wood)" ~ "Dual Fuel (Mineral and Wood)",
Heat_type == "wood chips" ~ "Wood",
Heat_type == "wood logs" ~ "Wood",
Heat_type == "wood pellets" ~ "Wood",
Heat_type == "community" ~ "Community Scheme",
Heat_type == "Community scheme" ~ "Community Scheme",
Heat_type == "Community scheme with CHP" ~ "Community Scheme with CHP",
Heat_type == "electric" ~ "Electric",
Heat_type == "|electric|trydan" ~ "Electric",
Heat_type == "Electricaire" ~ "Electric",
Heat_type == "electricity" ~ "Electric",
Heat_type == "Electric ceiling heating" ~ "Electric Ceiling Heating",
Heat_type == "Electric storage heaters" ~ "Electric Storage Heating",
MAINHEAT_DESCRIPTION == "Electric storage heaters, radiators" ~ "Electric Storage Heating & Radiators",
Heat_type == "Electric underfloor heating" ~ "Electric Underfloor Heating",
Heat_type == "Portable electric heaters" ~ "Portable Electric Heating",
Heat_type == "Portable electric heaters assumed for most rooms" ~ "Portable Electric Heating",
Heat_type == "Portable electric heating assumed for most rooms" ~ "Portable Electric Heating",
Heat_type == "No system present: electric heaters assumed" ~ "No system present: Electric heaters assumed",
Heat_type == "oil" ~ "Oil",
Heat_type == "smokeless fuel" ~ "Smokeless Fuel",
Heat_type == "Boiler and radiators" ~ "Unknown",
Heat_type == "SAP05:Main-Heating" ~ "Unknown"
)
)
We created two tables one containing the frequencies of the Heating Types for visualisation, and Heating Types by LSOAs for the Data Mapper.
Heat_Freq <- data.frame(table(Heating2$Gen_Heat))
write.csv(Heat_Freq, 'HeatingTypes_Frequencies.csv')
Heat_LSOA <- xtabs(~ lsoa11cd + Gen_Heat, data=Heating2)
write.csv(Heat_LSOA, 'HeatingType_LSOA.csv')
Create a bar chart for the Heating Types:
Heat_Freq %>%
mutate(Var1 = fct_reorder(Var1, Freq)) %>%
ggplot( aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="#D73058", alpha=.6, width=.4) +
xlab("Heating Systems or Fuels") +
ylab("Number of Dwellings") +
theme_bw() +
coord_flip()
This approach results in a somewhat unhelpful graph. However, it shows how much need there is to standardise in the EPC creation process.
The ONS report on the EPC data also contains a visualisation of Heating Systems and Fuel distribution. They chose to heavily summarise the groupings by only using "Mains Gas", "Electric", "Community Scheme", "Oil", "Other" and "Unknown".
We grouped smaller subgroups as 'Other' and made the assumption that observations that showed only electrical fixtures could be grouped into 'Electric'.
HeatingONS <- Heating %>%
mutate(
Gen_Heat = case_when(
Heat_type == "bottled gas" ~ "Other",
Heat_type == "bottled LPG" ~ "Other",
Heat_type == "LPG" ~ "Other",
Heat_type == "mains gas" ~ "Mains Gas",
Heat_type == "|mains gas|nwy prif gyflenwad" ~ "Mains Gas",
Heat_type == "nwy prif gyflenwad" ~ "Mains Gas",
Heat_type == "anthracite" ~ "Other",
Heat_type == "coal" ~ "Other",
Heat_type == "anthracite" ~ "Other",
Heat_type == "dual fuel (mineral and wood)" ~ "Other",
Heat_type == "wood chips" ~ "Other",
Heat_type == "wood logs" ~ "Other",
Heat_type == "wood pellets" ~ "Other",
Heat_type == "community" ~ "Community Scheme",
Heat_type == "Community scheme" ~ "Community Scheme",
Heat_type == "Community scheme with CHP" ~ "Community Scheme",
Heat_type == "electric" ~ "Electric",
Heat_type == "|electric|trydan" ~ "Electric",
Heat_type == "Electricaire" ~ "Electric",
Heat_type == "electricity" ~ "Electric",
Heat_type == "Electric ceiling heating" ~ "Electric",
Heat_type == "Electric storage heaters" ~ "Electric",
MAINHEAT_DESCRIPTION == "Electric storage heaters, radiators" ~ "Electric",
Heat_type == "Electric underfloor heating" ~ "Electric",
Heat_type == "Portable electric heaters" ~ "Electric",
Heat_type == "Portable electric heaters assumed for most rooms" ~ "Electric",
Heat_type == "Portable electric heating assumed for most rooms" ~ "Electric",
Heat_type == "No system present: electric heaters assumed" ~ "Unknown",
Heat_type == "oil" ~ "Oil",
Heat_type == "smokeless fuel" ~ "Other",
Heat_type == "Boiler and radiators" ~ "Unknown",
Heat_type == "SAP05:Main-Heating" ~ "Unknown"
)
)
We, again, created two tables one containing the frequencies of the Heating Types for visualisation, and Heating Types by LSOAs for the Data Mapper.
Heat_FreqONS <- data.frame(table(HeatingONS$Gen_Heat))
write.csv(Heat_Freq, 'HeatingTypesONS_Frequencies.csv')
Heat_LSOAONS <- xtabs(~ lsoa11cd + Gen_Heat, data=HeatingONS)
write.csv(Heat_LSOA, 'HeatingType_LSOA.csv')
Create a bar chart for the Heating Types:
Heat_FreqONS %>%
mutate(Var1 = fct_reorder(Var1, Freq)) %>%
ggplot( aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="#D73058", alpha=.6, width=.4) +
xlab("Heating Systems or Fuels") +
ylab("Number of Dwellings") +
theme_bw() +
coord_flip()
This creates a much more less cluttered visual of the Heating Systems and Fuels that are being used in Leeds. However, this also heavily simplifies the groupings, which may bias potential further analysis into heating cost differences.
This uses the cleaned dataframe 'cert_final' from the Heating Types - Data Cleaning Section.
sum(is.na(cert_final$TRANSACTION_TYPE))
Again there are no missing values.
unique(cert_final$TRANSACTION_TYPE)
table (cert_final$TRANSACTION_TYPE)
assessment for green deal
6876
ECO assessment
8404
FiT application
2634
following green deal
297
marketed sale
80455
new dwelling
20807
non marketed sale
4866
none of the above
15365
not recorded
7
rental
1667
rental (private)
42630
rental (private) - this is for backwards compatibility only and should not be used
13
rental (social)
39153
RHI application
115
Stock Condition Survey
10
unknown
117
A look into the unique values of this column shows that it is more standardised and missing entries are summarised as "unknown".
There are a few variables that clutter the data. This means that the long and unuseful private rental variable is being removed. Then "not recorded" cases are renamed to "unknown". "None of the above", is not grouped into "unknown" since it does not necessarily mean this data is unavailable.
Transaction <- cert_final %>%
select(lsoa11cd, TRANSACTION_TYPE)
Transaction$TRANSACTION_TYPE <- gsub('not recorded', 'unknown', Transaction$TRANSACTION_TYPE)
Transaction <- Transaction[!(Transaction$TRANSACTION_TYPE=="rental (private) - this is for backwards compatibility only and should not be used"),]
unique(Transaction$TRANSACTION_TYPE)
TransDf <- data.frame(table(Transaction$TRANSACTION_TYPE))
write.csv(TransDf, 'TransactionType_Frequencies.csv')
TransDf1 <- xtabs(~ lsoa11cd + TRANSACTION_TYPE, data=Transaction)
write.csv(TransDf1, 'TransactionType_LSOA.csv')
TransDf %>%
mutate(Var1 = fct_reorder(Var1, Freq)) %>%
ggplot( aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="#178CFF", alpha=.6, width=.4) +
xlab("Reason for EPC Evaluation") +
ylab("Number of Dwellings") +
theme_bw() +
coord_flip()
We can also look at all reasons for EPC inspections in Leeds rather than just the most recent ones.
For this, we only clean out invalid postcodes, create a table with only relevant columns, and rename and group variables as before.
cert_ALL <- inner_join(certificates, PC_LSOA, by = c("POSTCODE" = "pcds")) %>%
filter(ladcd == "E08000035")
Transaction_ALL <- cert_ALL %>%
select(lsoa11cd, TRANSACTION_TYPE)
Transaction_ALL$TRANSACTION_TYPE <- gsub('not recorded', 'unknown', Transaction_ALL$TRANSACTION_TYPE)
Transaction_ALL <- Transaction_ALL[!(Transaction_ALL$TRANSACTION_TYPE=="rental (private) - this is for backwards compatibility only and should not be used"),]
Warning message: "Column `POSTCODE`/`pcds` joining factors with different levels, coercing to character vector"
TransDf_ALL <- data.frame(table(Transaction_ALL$TRANSACTION_TYPE))
write.csv(TransDf_ALL, 'ALLTransactionType_Frequencies.csv')
TransDf1_ALL <- xtabs(~ lsoa11cd + TRANSACTION_TYPE, data=Transaction_ALL)
write.csv(TransDf1_ALL, 'ALLTransactionType_LSOA.csv')
TransDf_ALL %>%
mutate(Var1 = fct_reorder(Var1, Freq)) %>%
ggplot( aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="#178CFF", alpha=.6, width=.4) +
xlab("Reason for EPC Evaluation") +
ylab("Number of Dwellings") +
theme_bw() +
coord_flip()
We decided to take a more detailed look at the results of the Green Deal (GD) scheme. This is relatively well reflected in the EPC data, since the Transaction Types include "Assessment for Green Deal" and "Following Green Deal".
This time, we do not work with the previously cleaned cert_final data frame, since we want to keep duplicates and invalid postcodes. This is because we want to compare te effect of GD initiatives on the same properties. Since assessment for the GD may be a bit older and may contain invalid postcodes, we keep these to avoid losing observations.
First, we drop every observation that does not contain GD in its Transaction Type.
GreenDeal <- certificates_ord %>%
filter(str_detect(TRANSACTION_TYPE, "green deal"))
We keep only duplicated building reference numbers since we want to compare before and after for the same building reference numbers.
GreenDeal <- GreenDeal[GreenDeal$BUILDING_REFERENCE_NUMBER %in% GreenDeal$BUILDING_REFERENCE_NUMBER[duplicated(GreenDeal$BUILDING_REFERENCE_NUMBER)],] %>%
select(BUILDING_REFERENCE_NUMBER, TRANSACTION_TYPE, INSPECTION_DATE, CURRENT_ENERGY_EFFICIENCY, POTENTIAL_ENERGY_EFFICIENCY)
GreenDeal$INSPECTION_DATE <- as.numeric(GreenDeal$INSPECTION_DATE)
We then remove older doubles for the observations classed as "assessment for GD" and newer doubles of those classed as "following GD". We do this to compare the state of properties, as closely surrounding the implementation of GD measures as possible. This is done by creating two separate dataframes.
# Assessment for Green Deal:
GD_assmt <- GreenDeal %>%
filter(TRANSACTION_TYPE == "assessment for green deal") %>%
arrange(., BUILDING_REFERENCE_NUMBER, desc(INSPECTION_DATE))
GD_assmt <- GD_assmt[!duplicated(GD_assmt$BUILDING_REFERENCE_NUMBER),]
# Create a Column for the difference between Potential and Current Energy Efficiency
GD_assmt$Diff_EE <- (GD_assmt$POTENTIAL_ENERGY_EFFICIENCY - GD_assmt$CURRENT_ENERGY_EFFICIENCY)
# Add an indicator to the column names to help differentiate the dataframes
colnames(GD_assmt) <- paste("A", colnames(GD_assmt), sep = "_")
# Following Green Deal:
GD_follw <- GreenDeal %>%
filter(TRANSACTION_TYPE == "following green deal") %>%
arrange(., BUILDING_REFERENCE_NUMBER, INSPECTION_DATE)
GD_follw <- GD_follw[!duplicated(GD_follw$BUILDING_REFERENCE_NUMBER),]
# Create a Column for the difference between Potential and Current Energy Efficiency
GD_follw$Diff_EE <- (GD_follw$POTENTIAL_ENERGY_EFFICIENCY - GD_follw$CURRENT_ENERGY_EFFICIENCY)
# Add an indicator to the column names to help differentiate the dataframes
colnames(GD_follw) <- paste("F", colnames(GD_follw), sep = "_")
The two dataframes created in the last step are then joined to compare them directly:
GD_comp <- inner_join(GD_assmt, GD_follw, by = c("A_BUILDING_REFERENCE_NUMBER" = "F_BUILDING_REFERENCE_NUMBER"))
Interestingly, there are some building reference numbers where inspection dates for GD assessments are later than inspections classed as "following GD". This may indicate that there may have been several GDs in these properties, suggesting that cleaning out duplicates in the last step may have caused these inconsistencies. This may be looked into in further analyses. However, for now these observations will be dropped, since the assessment date for the GD should reasonably not be after the GD has been implemented.
GD_comp2 <- GD_comp %>%
filter(., A_INSPECTION_DATE < F_INSPECTION_DATE)
In this step the differences between the different levels of Energy Efficiency are calculated, to enable the illustration of changes.
GD_comp2$Diff_CEE <- (GD_comp2$F_CURRENT_ENERGY_EFFICIENCY - GD_comp2$A_CURRENT_ENERGY_EFFICIENCY)
GD_comp2$Diff_AP_FC <- (GD_comp2$F_CURRENT_ENERGY_EFFICIENCY - GD_comp2$A_POTENTIAL_ENERGY_EFFICIENCY)
GD_comp2$Diff_ADE_FDE <- (GD_comp2$F_Diff_EE - GD_comp2$A_Diff_EE)
mean (GD_comp2 [["Diff_CEE"]])
This means that Energy Efficiency on average improved by roughly 1 band upon the implementation of GD measures.
mean (GD_comp2 [["Diff_AP_FC"]])
This shows that the Energy Efficiency following the implementation of GD measures was generally 1 band under the proposed potential prior to GD measures.
mean (-(GD_comp2 [["A_Diff_EE"]]))
# negative to make it comparable
The Energy Efficiency before the implementation of GD measures was on average 2 bands lower than its respective potential. Combined with the previous average, this means that the gap between Energy Efficiency and the old potential has aproximately halved.
mean (GD_comp2 [["Diff_ADE_FDE"]])
This comparison may not be too necessary but it shows that the difference between current and potential Energy Efficiency after GD implementation is generally lower than before it.
write.csv(GD_comp2, 'Green_Deal_Comparisons.csv')
ggplot(GD_comp2, aes(x=Diff_CEE)) +
geom_histogram(color= "black", fill="#67E767", alpha=.6, binwidth=10) +
xlab("Difference in Energy Efficiency") +
ylab("Number of Dwellings") +
theme_bw()
This graph shows that the Energy Efficiency of dwellings following GD implementation is mostly larger than before. There are some cases at the upper end of the spectrum where the Energy Efficiency has improved greatly, but there are also a few cases where Energy Efficiency has decreased. These outliers should be further explored to identify underlying reasons or if they are errors.
ggplot(GD_comp2, aes(x=Diff_AP_FC)) +
geom_histogram(color= "black", fill="#67E767", alpha=.6, binwidth=10) +
xlab("Difference in Energy Efficiency") +
ylab("Number of Dwellings") +
theme_bw()
This visualisation shows that the Energy Efficiency dwellings have achieved through the GD was sometimes on target for the potential proposed prior to GD implementation but generally just under. Some dwellings have even surpassed previous potentials following GD measures.
ggplot(GD_comp2, aes(x=-(A_Diff_EE))) +
geom_histogram(color= "black", fill="#67E767", alpha=.6, binwidth=10) +
xlab("Difference in Energy Efficiency") +
ylab("Number of Dwellings") +
theme_bw()
This chart offers a good point of comparison for the previous one. It shows that the gap between the current energy efficiency preceding GD measures and the respective potential was generally bigger than following GD implementation. It may be interesting to look into changes in Energy Efficiency for individual dwellings, especially at the lower end of the spectrum to test the magnitude of changes and improvements.
ggplot(GD_comp2, aes(x=Diff_ADE_FDE)) +
geom_histogram(color= "black", fill="#67E767", alpha=.6, binwidth=10) +
xlab("Difference in Energy Efficiency") +
ylab("Number of Dwellings") +
theme_bw()
This final graph visualises the difference between the current-potential gaps in Energy Efficiency before and after GD implementation. It shows that the proposed potentials after dwellings have taken advantage of GD measures are often similarly distant to their respective current Energy Efficiency values, as the before GD implementation. However, the current-potential gap is generally smaller following the GD than before it.