k anonymity

from Wikipedia, the free encyclopedia

The k-anonymity is a formal data protection model with which statements can be made about anonymized data records .

A publication of data offers k-anonymity if the identifying information of each individual is indistinguishable from at least k-1 other individuals and thus a correct link with the associated sensitive attributes is made more difficult. The letter k thus represents a parameter that is replaced by a natural number in this specific case . A larger k represents greater anonymity in this context .

The concept was published in 2002 by Latanya Sweeney, professor at Harvard University , with the aim of making scientific data public while ensuring that the individuals the data is about cannot be re-identified while the data is still useful the intended uses. This is a compromise between a higher level of data protection on the one hand and a loss of data accuracy on the other.

Explanation

In the context of k-anonymity, a database is understood to be a table with n rows and m columns. Each row represents a (not necessarily unique) record that belongs to a specific individual. The values ​​in the various columns are the values ​​of the attributes that correspond to the individuals.

A distinction can be made between identifiers , quasi-identifiers and sensitive attributes for the individual attributes. Individuals can be clearly identified using identifiers, such as ID numbers or matriculation numbers. Quasi-identifiers are attributes which, taken by themselves, do not allow identification, but which, in combination with generally accessible data, allow a clear assignment. Sensitive attributes contain personal information that is worth protecting, such as illnesses or salary information. Therefore, the exact value of an individual's sensitive attribute should not be disclosed.

Irrespective of the concept of k-anonymity, anonymization can be achieved by various means, for example by adding noise, suppressing information or generalizing data.

illustration

The following table is a non-anonymized database consisting of patient data from a fictitious hospital .

Identifier Quasi-identifiers Sensitive attribute
Surname Age gender Postcode illness
Anna 21st Female 76189 flu
Louis 35 Male 77021 cancer
Holger 39 Male 63092 Hair loss
Frederic 23 Male 63331 Muscle strain
Anika 24 Female 76121 flu
Peter 31 Male 77462 poisoning
Tobias 38 Male 77109 dementia
Charlotte 19th Female 83133 Caries
Sarah 27 Female 89777 acne

The next table results from anonymization using generalization:

Identifier Quasi-identifiers Sensitive attribute
Surname Age gender Postcode illness
* 20 <age <25 Female 76 * flu
* 30 <age <40 Male 77 * cancer
* 20 <age <40 Male 63 * Hair loss
* 20 <age <40 Male 63 * Muscle strain
* 20 <age <25 Female 76 * flu
* 30 <age <40 Male 77 * poisoning
* 30 <age <40 Male 77 * dementia
* 18 <age <28 Female 8th* Caries
* 18 <age <28 Female 8th* acne

There are 4 equivalence classes:

Identifier Quasi-identifiers Sensitive attribute
Equivalence class Surname Age gender Postcode illness
A. * 20 <age <25 Female 76 * flu
* 20 <age <25 Female 76 * flu
Identifier Quasi-identifiers Sensitive attribute
Equivalence class Surname Age gender Postcode illness
B. * 30 <age <40 Male 77 * cancer
* 30 <age <40 Male 77 * poisoning
* 30 <age <40 Male 77 * dementia
Identifier Quasi-identifiers Sensitive attribute
Equivalence class Surname Age gender Postcode illness
C. * 20 <age <40 Male 63 * Hair loss
* 20 <age <40 Male 63 * Muscle strain
Identifier Quasi-identifiers Sensitive attribute
Equivalence class Surname Age gender Postcode illness
D. * 18 <age <28 Female 8th* Caries
* 18 <age <28 Female 8th* acne

Each individual equivalence class contains at least 2 elements, thus 2 anonymity is guaranteed. Note that in equivalence class A the sensitive attribute values ​​also match, while this is not the case in the other equivalence classes. The k-anonymity does not make any statements about the distribution of the values ​​of the sensitive attributes (see section on homogeneity attack ).

defects

The concept of k-anonymity has known shortcomings that can allow deanonymization . This means that individual participants in a k-anonymous table can possibly be clearly identifiable. Two shortcomings are explained in more detail below.

Homogeneity attack

The homogeneity attack takes advantage of the fact that all k data records of an equivalence class may have identical sensitive attributes. If the attacker knows about the existence of a person in a database and can assign this person to the correct equivalence class, he learns their sensitive attributes.

illustration

Alice is a very nosy neighbor of Bob. When Bob is picked up by ambulance one day, Alice wants to find out what Bob is suffering from. She discovers the 4-anonymous table with current patient data published by the hospital. She knows that Bob must be in the table and knows his age, gender, and zip code. This concludes that his data set must be in equivalence class C. Since all patients in this equivalency class suffer from the same disease, Alice also experiences Bob's disease.

Identifier Quasi-identifiers Sensitive attribute
Equivalence class Surname Age gender Postcode illness
B. * 25 <age <30 Female 13 * ...
Heart disease
C. * 40 <age <50 Male 13 * cancer
cancer
cancer
cancer
D. * 20 <age <35 Female 12 * flu
...

Background Knowledge Attack

Through the use of additional knowledge, it may be possible to clearly assign people despite k-anonymity. If the attacker knows about the existence of a person in a database and can assign this person to the correct equivalence class, he can, if necessary, use the additional knowledge to exclude certain sensitive attributes for the person.

illustration

Alice has a pen pal named Yui who is admitted to a hospital and whose patient data is contained in a 4-anonymous table that the hospital regularly publishes. Alice knows that Yui is a 21 year old Japanese woman who is currently registered under the zip code 12345. Based on this information, Alice can conclude that Yui's data set must be in equivalence class B. Without additional information, Alice cannot be sure whether Yui has a viral disease or heart disease. However, it is well known that the Japanese very rarely have heart disease. This allows Alice to conclude that Yui has a viral disease.

Identifier Quasi-identifiers Sensitive attribute
Equivalence class Surname Age gender Postcode illness
A. * 30 <age <35 Male 14 * ...
flu
B. * 20 <age <30 Female 12 * Heart disease
Viral disease
Viral disease
Heart disease
C. * 30 <age <35 Female 12 * cancer
...

Extensions

In order to remedy the mentioned shortcomings of k-anonymity, extensions were designed with l-diversity and t-closeness based on it . l-diversity especially improves the weakness against homogeneity attacks by ensuring a certain degree of difference in the sensitive attributes in the individual equivalence classes. t-closeness extends the concept so that the distribution of the values ​​of the sensitive attributes in the individual equivalence classes corresponds as closely as possible to the distribution in the entire table.

See also

Individual evidence

  1. Latanya Sweeney: k-anonymity: A model for protecting privacy In: International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems , Vol. 10, Issue 5, World Scientific, 2002, pp. 557-570 (English).
  2. Zhen Li, Xiaojun Ye: Privacy protection on multiple sensitive attributes In: Information and Communications Security , Vol. 1, Springer Berlin Heidelberg, 2007, pp. 141–152 (English).
  3. a b c Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, Muthuramakrishnan Venkitasubramaniam: l-diversity: Privacy beyond k-anonymity In: ACM Transactions on Knowledge Discovery from Data (TKDD) , Vol. 1, ACM, 2007 (English).
  4. Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian: t-Closeness: Privacy Beyond k-Anonymity and l-Diversity In: ICDE , Vol. 7, 2007, pp. 106-115 (English).