Data Binning is a technique which can preserve the granularity of data — keeping individual-level records — whilst reducing the ability to readily identify individuals. This recognises that certain information like ages and town of birth can be used as quasi-identifiers. Implementing data-binning replaces a specific field such as age with a reference to a ‘bucket’ or ‘bin’, such as age range. This can reduce the risk of using this information for identifying individuals.
Typically, the technique is focussed on reducing the probability of a risk occurring - assuming the risk is associated with reidentification. The impact of the risk if it occurs is unaffected.
This is a useful technique if the value of the dataset resides in the ablility to analyse the features of individual entries, rather than descriptive statitics that might be released in a summarised or aggregated data set. If the individual fields are otherwise too risky to release, comprising personal or other sensitive data, a binned version can provide a means of reducing the risk of identifying an individual.
Data binning is a relatively simple and understandable technique that retains much of the value in the raw data. The following considerations will allow a safe release.
Examples of data binning methods might include:
| Specific data | Data Bin | Example |
|---|---|---|
| Age | Age Range | 15-19, 20-24 |
| Date of birth | Year or decade of birth | 1950s, 1960s |
| Full postcode | Postcode outward code | HX7 |
| Full postcode | Administrative region | Calderdale |
| Full postcode | Census geography | MSOA |
Be careful when selecting data bins that related datasets might have different bins, rendering it tricky to compare between datasets. ONS data, for example, uses 5-year age bands (0-4, 5-9, etc). Selecting incompatible age binning (16-21, 22-26, etc) will make the datasets less valuable.
The threat of reidentification is significant where there are small populations that fit a given categorisation. As an example people living in a sparsely populated rural geographic area (e.g. postcode) might still be be identifiable if there are a small number of people in a given age band. If there are other identifying characteristics, it can be possible to identify individuals in the dataset. Under these circumstances, redaction or aggregation might also be required.
This page is derived in part or whole from the Energy Network Association ENA Data Triage Playbook.