k anonymity

The k-anonymity is a formal data protection model with which statements can be made about anonymized data records .

A publication of data offers k-anonymity if the identifying information of each individual is indistinguishable from at least k-1 other individuals and thus a correct link with the associated sensitive attributes is made more difficult. The letter k thus represents a parameter that is replaced by a natural number in this specific case . A larger k represents greater anonymity in this context .

The concept was published in 2002 by Latanya Sweeney, professor at Harvard University , with the aim of making scientific data public while ensuring that the individuals the data is about cannot be re-identified while the data is still useful the intended uses. This is a compromise between a higher level of data protection on the one hand and a loss of data accuracy on the other.

Explanation

In the context of k-anonymity, a database is understood to be a table with n rows and m columns. Each row represents a (not necessarily unique) record that belongs to a specific individual. The values in the various columns are the values of the attributes that correspond to the individuals.

A distinction can be made between identifiers , quasi-identifiers and sensitive attributes for the individual attributes. Individuals can be clearly identified using identifiers, such as ID numbers or matriculation numbers. Quasi-identifiers are attributes which, taken by themselves, do not allow identification, but which, in combination with generally accessible data, allow a clear assignment. Sensitive attributes contain personal information that is worth protecting, such as illnesses or salary information. Therefore, the exact value of an individual's sensitive attribute should not be disclosed.

Irrespective of the concept of k-anonymity, anonymization can be achieved by various means, for example by adding noise, suppressing information or generalizing data.

illustration

The following table is a non-anonymized database consisting of patient data from a fictitious hospital .

Surname	Age	gender	Postcode	illness
Identifier	Quasi-identifiers			Sensitive attribute
Anna	21st	Female	76189	flu
Louis	35	Male	77021	cancer
Holger	39	Male	63092	Hair loss
Frederic	23	Male	63331	Muscle strain
Anika	24	Female	76121	flu
Peter	31	Male	77462	poisoning
Tobias	38	Male	77109	dementia
Charlotte	19th	Female	83133	Caries
Sarah	27	Female	89777	acne

The next table results from anonymization using generalization:

Surname	Age	gender	Postcode	illness
Identifier	Quasi-identifiers			Sensitive attribute
*	20 <age <25	Female	76 *	flu
*	30 <age <40	Male	77 *	cancer
*	20 <age <40	Male	63 *	Hair loss
*	20 <age <40	Male	63 *	Muscle strain
*	20 <age <25	Female	76 *	flu
*	30 <age <40	Male	77 *	poisoning
*	30 <age <40	Male	77 *	dementia
*	18 <age <28	Female	8th*	Caries
*	18 <age <28	Female	8th*	acne

There are 4 equivalence classes:

Equivalence class	Surname	Age	gender	Postcode	illness
	Identifier	Quasi-identifiers			Sensitive attribute
A.	*	20 <age <25	Female	76 *	flu
A.	*	20 <age <25	Female	76 *	flu

Equivalence class	Surname	Age	gender	Postcode	illness
	Identifier	Quasi-identifiers			Sensitive attribute
B.	*	30 <age <40	Male	77 *	cancer
	*	30 <age <40	Male	77 *	poisoning
	*	30 <age <40	Male	77 *	dementia

Equivalence class	Surname	Age	gender	Postcode	illness
	Identifier	Quasi-identifiers			Sensitive attribute
C.	*	20 <age <40	Male	63 *	Hair loss
C.	*	20 <age <40	Male	63 *	Muscle strain

Equivalence class	Surname	Age	gender	Postcode	illness
	Identifier	Quasi-identifiers			Sensitive attribute
D.	*	18 <age <28	Female	8th*	Caries
D.	*	18 <age <28	Female	8th*	acne

Each individual equivalence class contains at least 2 elements, thus 2 anonymity is guaranteed. Note that in equivalence class A the sensitive attribute values also match, while this is not the case in the other equivalence classes. The k-anonymity does not make any statements about the distribution of the values of the sensitive attributes (see section on homogeneity attack ).

defects

The concept of k-anonymity has known shortcomings that can allow deanonymization . This means that individual participants in a k-anonymous table can possibly be clearly identifiable. Two shortcomings are explained in more detail below.

Homogeneity attack

The homogeneity attack takes advantage of the fact that all k data records of an equivalence class may have identical sensitive attributes. If the attacker knows about the existence of a person in a database and can assign this person to the correct equivalence class, he learns their sensitive attributes.

illustration

Alice is a very nosy neighbor of Bob. When Bob is picked up by ambulance one day, Alice wants to find out what Bob is suffering from. She discovers the 4-anonymous table with current patient data published by the hospital. She knows that Bob must be in the table and knows his age, gender, and zip code. This concludes that his data set must be in equivalence class C. Since all patients in this equivalency class suffer from the same disease, Alice also experiences Bob's disease.

Equivalence class	Surname	Age	gender	Postcode	illness
	Identifier	Quasi-identifiers			Sensitive attribute
B.	*	25 <age <30	Female	13 *	...
B.	*	25 <age <30	Female	13 *	Heart disease
C.	*	40 <age <50	Male	13 *	cancer
					cancer
					cancer
					cancer
D.	*	20 <age <35	Female	12 *	flu
D.	*	20 <age <35	Female	12 *	...

Background Knowledge Attack

Through the use of additional knowledge, it may be possible to clearly assign people despite k-anonymity. If the attacker knows about the existence of a person in a database and can assign this person to the correct equivalence class, he can, if necessary, use the additional knowledge to exclude certain sensitive attributes for the person.

illustration

Alice has a pen pal named Yui who is admitted to a hospital and whose patient data is contained in a 4-anonymous table that the hospital regularly publishes. Alice knows that Yui is a 21 year old Japanese woman who is currently registered under the zip code 12345. Based on this information, Alice can conclude that Yui's data set must be in equivalence class B. Without additional information, Alice cannot be sure whether Yui has a viral disease or heart disease. However, it is well known that the Japanese very rarely have heart disease. This allows Alice to conclude that Yui has a viral disease.

Equivalence class	Surname	Age	gender	Postcode	illness
	Identifier	Quasi-identifiers			Sensitive attribute
A.	*	30 <age <35	Male	14 *	...
A.	*	30 <age <35	Male	14 *	flu
B.	*	20 <age <30	Female	12 *	Heart disease
					Viral disease
					Viral disease
					Heart disease
C.	*	30 <age <35	Female	12 *	cancer
C.	*	30 <age <35	Female	12 *	...

Extensions

In order to remedy the mentioned shortcomings of k-anonymity, extensions were designed with l-diversity and t-closeness based on it . l-diversity especially improves the weakness against homogeneity attacks by ensuring a certain degree of difference in the sensitive attributes in the individual equivalence classes. t-closeness extends the concept so that the distribution of the values of the sensitive attributes in the individual equivalence classes corresponds as closely as possible to the distribution in the entire table.

Individual evidence

↑ Latanya Sweeney: k-anonymity: A model for protecting privacy In: International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems , Vol. 10, Issue 5, World Scientific, 2002, pp. 557-570 (English).
↑ Zhen Li, Xiaojun Ye: Privacy protection on multiple sensitive attributes In: Information and Communications Security , Vol. 1, Springer Berlin Heidelberg, 2007, pp. 141–152 (English).
↑ ^a ^b ^c Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, Muthuramakrishnan Venkitasubramaniam: l-diversity: Privacy beyond k-anonymity In: ACM Transactions on Knowledge Discovery from Data (TKDD) , Vol. 1, ACM, 2007 (English).
↑ Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian: t-Closeness: Privacy Beyond k-Anonymity and l-Diversity In: ICDE , Vol. 7, 2007, pp. 106-115 (English).

[k-anonymity:_A_model_for_protecting_privacy-1] Latanya Sweeney: k-anonymity: A model for protecting privacy In: International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems , Vol. 10, Issue 5, World Scientific, 2002, pp. 557-570 (English).

[attributes-2] Zhen Li, Xiaojun Ye: Privacy protection on multiple sensitive attributes In: Information and Communications Security , Vol. 1, Springer Berlin Heidelberg, 2007, pp. 141–152 (English).

[l-diversity-3] Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, Muthuramakrishnan Venkitasubramaniam: l-diversity: Privacy beyond k-anonymity In: ACM Transactions on Knowledge Discovery from Data (TKDD) , Vol. 1, ACM, 2007 (English).

[t-closeness-4] Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian: t-Closeness: Privacy Beyond k-Anonymity and l-Diversity In: ICDE , Vol. 7, 2007, pp. 106-115 (English).

k anonymity

contents

Explanation

illustration

defects

Homogeneity attack

illustration

Background Knowledge Attack

illustration

Extensions

See also

Individual evidence