Aggregation

Aggregation is one process of de-identification and refers to the process of grouping together raw data and providing this in a summarised form. Aggregation can take place in several ways including:

Datasets at the individual-level typically contain the greatest level of detail and are most valuable for analysis, but the goal of maximizing the value of the data must be balanced with the need for privacy and security.

Underlying assumptions

Many datasets contain personal information or sensitive data that cannot be published or shared with others in a raw form. By aggregating data, you are reducing the risk of individuals, single organisations, or instances being identified. This creates some loss in accuracy, but the loss in accuracy is made up for by:

The goal is to reduce the risks for identification while still allowing overall patterns to be identified by data re-users.

Key considerations and risks

Reidentification

The primary concern is the risk of re-identification from an aggregated dataset. Even after a dataset has been aggregated and direct identifiers such as name or address removed, it may be possible to identify individuals and organisations.

This could be through combining aggregated data with other datasets, or it could be that there are only a small number of individuals or organisations represented within categorisations of aggregated data. There are over 55,000 postcodes in England and Wales that contain only one household, and data released at the postcode level without considering these instances would inadvertently reveal information about those individual households. To mitigate this, it might be possible to publish data at a postcode level, and where there are single household postcodes, include data relating to these in neighbouring areas.

Granularity and how the data is aggregated

Balanced against the risk of identification is the need to ensure that the data is of sufficient granularity to provide value for the data reuser. Data aggregation removes individual variation and loss of this may impact analysis undertaken. For instance, while there may be no correlation between family income and educational achievement when aggregated at a local authority or city level, there may be a strong relationship between these at an individual level.

Smart meter data is a classic example of the tensions between granularity and re-identification risks. If shared with external parties on a half-hourly basis, there may be many household activities exposed; when adults are at work, when everyone goes to bed, or when an electric vehicle is being charged. These increase security risks to the household.

Aggregating smart meter data conceals these patterns. However, it may not conceal when a household is on holiday and the home is empty. Furthermore, the specific value that a data re-user may be looking for might be in understanding how households are using data on an ongoing basis. A possible solution might be to aggregate energy use across a street, which conceals the behaviour of a single household, although it may make it easy to identify when a given street is mostly empty.

Engage with communities

If the data you are aggregating is personal data, you need to consider whether there are any GDPR implications related to collecting, using, and sharing this data.

Considering how to be transparent with individuals about how you will share data, the codes of practice that will guide the ethical sharing of data, and how individuals might choose to have their data excluded.