Data Techniques
This page contains a catalogue of data techniques.
Common Techniques
- Access Control
Introducing access control means that you are restricting who can see and use a given data set. As such, you are no longer publishing Open Data, and instead are either publishing Shared Data, creating steps to authorise and authenticate suitable users, or looking to negotiate 1-2-1 bespoke contracts between yourself and data reusers, prior to providing access to the data.
- Aggregation
Aggregation is one process of de-identification and refers to the process of grouping together raw data and providing this in a summarised form. Aggregation can take place in several ways including:
- Data Binning
Data Binning is a technique which can preserve the granularity of data — keeping individual-level records — whilst reducing the ability to readily identify individuals. This recognises that certain information like ages and town of birth can be used as quasi-identifiers. Implementing data-binning replaces a specific field such as age with a reference to a 'bucket' or 'bin', such as age range. This can reduce the risk of using this information for identifying individuals.
- Delayed Publication
Data released in real time can compromise the security of individuals or of an organisation. For instance, it could be used to track a person in real time or to show when a house is empty, or to show when capacity is reached in a network. This can allow those with harmful intent to follow an individual or organisations activity in real time, and find opportune moments to inflict harm.
- Redaction
Redaction is the technique where certain features of the dataset are removed or made otherwise unreadable (replacing with inert dummy data). This could apply to a field or fields across all records, or removal of complete records.
- Restrictive Licensing
A licence tells a data user what they are able to do with a given data set once they have access to it. If relevant permissions are not granted through a licence, a data user is often legally not allowed to download or otherwise copy data, combine that data with other datasets, use the data to generate maps, or use the data to help inform internal business conversations.
Less Common Techniques
- Anonymisation
Anonymisation is the process of removing personal identifiers, both direct and indirect, that may lead to an individual being identified.
- Feature Extraction
Extract or generate new features from the data which hide the private data and replace.
- Obfuscation
Process for hiding original data with modified content.
- Pseudonymisation
Replacing identifying features with a unique identifier that retains the reference to an individual whilst breaking the link with the `real world’ identity
- Randomness
A group of techniques which alter the data by adding randomness.
- Synthetic Data
Generating a dataset which has the same properties as a real dataset, but contains no real-world data.