POkA: identifying pareto-optimal k-anonymous nodes in a domain hierarchy lattice

Data generalization is widely used to protect identities and prevent inference of sensitive information during the public release of microdata. The k-anonymity model has been extensively applied in this context. The model seeks a generalization scheme such that every individual becomes indistinguishable from at least k-1 other individuals and the loss in information while doing so is kept at a minimum. The search is performed on a domain hierarchy lattice where every node is a vector signifying the level of generalization for each attribute. An effort to understand privacy and data utility trade-offs will require knowing the minimum possible information losses of every possible value of k. However, this can easily lead to an exhaustive evaluation of all nodes in the hierarchy lattice. In this paper, we propose using the concept of Pareto-optimality to obtain the desired trade-off information. A Pareto-optimal generalization is one in which no other generalization can provide a higher value of k without increasing the information loss. We introduce the Pareto-Optimal k-Anonymization (POkA) algorithm to traverse the hierarchy lattice and show that the number of node evaluations required to find the Pareto-optimal generalizations can be significantly reduced. Results on a benchmark data set show that the algorithm is capable of identifying all Pareto-optimal nodes by evaluating only 20% of nodes in the lattice.

[1]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[3]  Wenliang Du,et al.  OptRR: Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[5]  Indrajit Ray,et al.  On the Optimal Selection of k in the k-Anonymity Problem , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[6]  Grigorios Loukides,et al.  Capturing data usefulness and privacy protection in K-anonymisation , 2007, SAC '07.

[7]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[8]  Philippe Golle,et al.  Revisiting the uniqueness of simple demographics in the US population , 2006, WPES '06.

[9]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[10]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[11]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[12]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[13]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[15]  Philip S. Yu,et al.  Bottom-up generalization: a data mining solution to privacy protection , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).