Randomness

A group of techniques which alter the data by adding randomness.

One way to scramble a dataset is to add noise. This can be added to all forms of data sets. For image data, noise can be added to each pixel to obscure the picture, or for a time series adding noise can reduce distinguishable features. In energy systems one use case would be to add noise to smart meter data to hide particular behaviours.

There is a trade-off between the variability of the noise added and the reduced utility of the data set. High variability means any useful features from the data set may be completely hidden but so is any private information. In contrast low levels of noise may not hide any private information but the data may have higher utility for any algorithms deployed.

Since generating random data are core functionalities for most programming languages, it is relatively easy to generate and update a dataset with random noise. Choosing the right type and level of noise to add will depend on the type of data you are using, and the level of privacy required. For example, for time series data simple white noise may be suitable. The standard deviation (a measure of spread) of the data could then be chosen based on the frequency of the components that the user wishes to conceal.

Differential Privacy

Differential privacy is a more sophisticated method for adding randomness to a dataset.

Differential privacy allows collection of information without revealing private data. Even anonymised data can be used to identify individuals via linkage attacks were grouped non-private data can be used to identify individuals.

Differential privacy removes the possibility of linkage attack and still allows aggregate insights and statistics without revealing individual's information. The goal is to ensure that the removal of an individual from the data set should not affect the statistical functions that run on the data.

The inserted randomness in the data should be dependent on the size of the dataset (smaller the data, the more randomness needed to conceal identities). If there is information on how the randomness is added to the data then the statistical estimations can be made which are accurate representations of the original data4 . Issues with differential privacy are relatively complex and hence open source libraries are often necessary rather developing your own implementations.

Google uses differential privacy libraries to collect data for e.g., their maps tools to say how busy various businesses are during their day and have made some of libraries available. Similarly, IBM have some differential privacy mechanism and models they have made available

Scenarios

TKTKTK EXAMPLE SCENARIOS

References

Various packages can generate random noise. E.g., In Python both Gaussian random numbers and random categorical values can be generated by the NumPy package: https://numpy.org/.

In R similar functions exist including: rnorm (Gaussian distribution) https://www.rdocumentation.org/packages/compositions/versions/1.40-5/topics/rnorm, or sample (positive integers): https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sample.

Google have developed an open source set of libraries for various differential privacy algorithms: https://github.com/google/differential-privacy

IBM differential privacy tool (based on python): https://github.com/IBM/differential-privacy-library