Safe data looks at identifying and managing data sensitivity and detail, and how data should be handled.
Examples of questions/issues being addressed:
- Is there sufficient detail to allow the project to go ahead?
- Is there excessive data that is not necessary for the project?
Input statistical disclosure control (SDC; sometimes called de-identification or anonymisation) is crucial in maintaining safe data. It is a set of methods used to protect confidential data, ensuring that organisations and individuals cannot be identified from aggregated or released data. The aim is to balance data utility with privacy protection.
This is branch of statistics is well understood, and has developing effective and uncontroversial guidance for over forty years. The techniques used include:
- Suppression: removing sensitive or identifying data points
- Aggregation: Publishing data in broader categories (i.e. age ranges instead of exact ages)
- Noise addition: Introducing small random variations to mask exact values
- Swapping: Exchanging values between records to prevent re-identification
- Rounding: Adjusting numbers to less precise values
- Sampling and sub-setting: Releasing only a sample of the data to reduce re-identification risk
he Uk Anonymisation Network created guidelines of good practice in anonymisation, offering advice and information to anyone who handles data and needs to share it. The purpose is to maximise the value of data and minimise the risks of privacy.
Possible ways to achieve ‘safe data’:
- Remove direct identifiers (where possible) such as names, addresses and outliers.
- This is heavily covered through outputchecking.org, significantly through the resources page
- Use Statistical disclosure control (SDC) to apply rounding, noise and other de-identification techniques to outputs to suppress identifiable information.
- Provide tiered access, such as public use files vs restricted microdata