k anonymity
The k-anonymity is a formal data protection model with which statements can be made about anonymized data records .
A publication of data offers k-anonymity if the identifying information of each individual is indistinguishable from at least k-1 other individuals and thus a correct link with the associated sensitive attributes is made more difficult. The letter k thus represents a parameter that is replaced by a natural number in this specific case . A larger k represents greater anonymity in this context .
The concept was published in 2002 by Latanya Sweeney, professor at Harvard University , with the aim of making scientific data public while ensuring that the individuals the data is about cannot be re-identified while the data is still useful the intended uses. This is a compromise between a higher level of data protection on the one hand and a loss of data accuracy on the other.
Explanation
In the context of k-anonymity, a database is understood to be a table with n rows and m columns. Each row represents a (not necessarily unique) record that belongs to a specific individual. The values in the various columns are the values of the attributes that correspond to the individuals.
A distinction can be made between identifiers , quasi-identifiers and sensitive attributes for the individual attributes. Individuals can be clearly identified using identifiers, such as ID numbers or matriculation numbers. Quasi-identifiers are attributes which, taken by themselves, do not allow identification, but which, in combination with generally accessible data, allow a clear assignment. Sensitive attributes contain personal information that is worth protecting, such as illnesses or salary information. Therefore, the exact value of an individual's sensitive attribute should not be disclosed.
Irrespective of the concept of k-anonymity, anonymization can be achieved by various means, for example by adding noise, suppressing information or generalizing data.
illustration
The following table is a non-anonymized database consisting of patient data from a fictitious hospital .
Identifier | Quasi-identifiers | Sensitive attribute | ||
Surname | Age | gender | Postcode | illness |
---|---|---|---|---|
Anna | 21st | Female | 76189 | flu |
Louis | 35 | Male | 77021 | cancer |
Holger | 39 | Male | 63092 | Hair loss |
Frederic | 23 | Male | 63331 | Muscle strain |
Anika | 24 | Female | 76121 | flu |
Peter | 31 | Male | 77462 | poisoning |
Tobias | 38 | Male | 77109 | dementia |
Charlotte | 19th | Female | 83133 | Caries |
Sarah | 27 | Female | 89777 | acne |
The next table results from anonymization using generalization:
Identifier | Quasi-identifiers | Sensitive attribute | ||
Surname | Age | gender | Postcode | illness |
---|---|---|---|---|
* | 20 <age <25 | Female | 76 * | flu |
* | 30 <age <40 | Male | 77 * | cancer |
* | 20 <age <40 | Male | 63 * | Hair loss |
* | 20 <age <40 | Male | 63 * | Muscle strain |
* | 20 <age <25 | Female | 76 * | flu |
* | 30 <age <40 | Male | 77 * | poisoning |
* | 30 <age <40 | Male | 77 * | dementia |
* | 18 <age <28 | Female | 8th* | Caries |
* | 18 <age <28 | Female | 8th* | acne |
There are 4 equivalence classes:
Identifier | Quasi-identifiers | Sensitive attribute | |||
Equivalence class | Surname | Age | gender | Postcode | illness |
---|---|---|---|---|---|
A. | * | 20 <age <25 | Female | 76 * | flu |
* | 20 <age <25 | Female | 76 * | flu |
Identifier | Quasi-identifiers | Sensitive attribute | |||
Equivalence class | Surname | Age | gender | Postcode | illness |
---|---|---|---|---|---|
B. | * | 30 <age <40 | Male | 77 * | cancer |
* | 30 <age <40 | Male | 77 * | poisoning | |
* | 30 <age <40 | Male | 77 * | dementia |
Identifier | Quasi-identifiers | Sensitive attribute | |||
Equivalence class | Surname | Age | gender | Postcode | illness |
---|---|---|---|---|---|
C. | * | 20 <age <40 | Male | 63 * | Hair loss |
* | 20 <age <40 | Male | 63 * | Muscle strain |
Identifier | Quasi-identifiers | Sensitive attribute | |||
Equivalence class | Surname | Age | gender | Postcode | illness |
---|---|---|---|---|---|
D. | * | 18 <age <28 | Female | 8th* | Caries |
* | 18 <age <28 | Female | 8th* | acne |
Each individual equivalence class contains at least 2 elements, thus 2 anonymity is guaranteed. Note that in equivalence class A the sensitive attribute values also match, while this is not the case in the other equivalence classes. The k-anonymity does not make any statements about the distribution of the values of the sensitive attributes (see section on homogeneity attack ).
defects
The concept of k-anonymity has known shortcomings that can allow deanonymization . This means that individual participants in a k-anonymous table can possibly be clearly identifiable. Two shortcomings are explained in more detail below.
Homogeneity attack
The homogeneity attack takes advantage of the fact that all k data records of an equivalence class may have identical sensitive attributes. If the attacker knows about the existence of a person in a database and can assign this person to the correct equivalence class, he learns their sensitive attributes.
illustration
Alice is a very nosy neighbor of Bob. When Bob is picked up by ambulance one day, Alice wants to find out what Bob is suffering from. She discovers the 4-anonymous table with current patient data published by the hospital. She knows that Bob must be in the table and knows his age, gender, and zip code. This concludes that his data set must be in equivalence class C. Since all patients in this equivalency class suffer from the same disease, Alice also experiences Bob's disease.
Identifier | Quasi-identifiers | Sensitive attribute | |||
Equivalence class | Surname | Age | gender | Postcode | illness |
---|---|---|---|---|---|
B. | * | 25 <age <30 | Female | 13 * | ... |
Heart disease | |||||
C. | * | 40 <age <50 | Male | 13 * | cancer |
cancer | |||||
cancer | |||||
cancer | |||||
D. | * | 20 <age <35 | Female | 12 * | flu |
... |
Background Knowledge Attack
Through the use of additional knowledge, it may be possible to clearly assign people despite k-anonymity. If the attacker knows about the existence of a person in a database and can assign this person to the correct equivalence class, he can, if necessary, use the additional knowledge to exclude certain sensitive attributes for the person.
illustration
Alice has a pen pal named Yui who is admitted to a hospital and whose patient data is contained in a 4-anonymous table that the hospital regularly publishes. Alice knows that Yui is a 21 year old Japanese woman who is currently registered under the zip code 12345. Based on this information, Alice can conclude that Yui's data set must be in equivalence class B. Without additional information, Alice cannot be sure whether Yui has a viral disease or heart disease. However, it is well known that the Japanese very rarely have heart disease. This allows Alice to conclude that Yui has a viral disease.
Identifier | Quasi-identifiers | Sensitive attribute | |||
Equivalence class | Surname | Age | gender | Postcode | illness |
---|---|---|---|---|---|
A. | * | 30 <age <35 | Male | 14 * | ... |
flu | |||||
B. | * | 20 <age <30 | Female | 12 * | Heart disease |
Viral disease | |||||
Viral disease | |||||
Heart disease | |||||
C. | * | 30 <age <35 | Female | 12 * | cancer |
... |
Extensions
In order to remedy the mentioned shortcomings of k-anonymity, extensions were designed with l-diversity and t-closeness based on it . l-diversity especially improves the weakness against homogeneity attacks by ensuring a certain degree of difference in the sensitive attributes in the individual equivalence classes. t-closeness extends the concept so that the distribution of the values of the sensitive attributes in the individual equivalence classes corresponds as closely as possible to the distribution in the entire table.
See also
Individual evidence
- ↑ Latanya Sweeney: k-anonymity: A model for protecting privacy In: International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems , Vol. 10, Issue 5, World Scientific, 2002, pp. 557-570 (English).
- ↑ Zhen Li, Xiaojun Ye: Privacy protection on multiple sensitive attributes In: Information and Communications Security , Vol. 1, Springer Berlin Heidelberg, 2007, pp. 141–152 (English).
- ↑ a b c Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, Muthuramakrishnan Venkitasubramaniam: l-diversity: Privacy beyond k-anonymity In: ACM Transactions on Knowledge Discovery from Data (TKDD) , Vol. 1, ACM, 2007 (English).
- ↑ Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian: t-Closeness: Privacy Beyond k-Anonymity and l-Diversity In: ICDE , Vol. 7, 2007, pp. 106-115 (English).