Contents

Purpose of the Project

Data anonymization is required before a big-data business can run effectively without compromising the privacy of personal information it uses. It is not trivial to choose the best algorithm to anonymize some given data securely for a given purpose. In accurately assessing the risk of data being compromised, there needs to be a balance between utility and security. Therefore, using common pseudo microdata, we propose a competition for the best anonymization and re-identification algorithm.
The project addresses the aim of the competition, the target microdata, sample algorithms, utility and security metrics. The design of an evaluation platform is also studied.

De-identified information

In Japan, the Personal Information Protection Commission (PPC) has put into fully effect the amended Act on the Protection of Personal Information on May 30, 2017. The notion of “Anonymously Processed Information (API)” has introduced as a sort of de-identified information that satisfies two conditions, processed to be unidentifiable to said person, and prohibited from restoring said personal information. However, the enforcement rules has some uncertainty in declaring data to be API (see the table of rules).

ruledescriptionexample
(1)Deleting a whole or part of descriptions which can identify a specific individualName, address, date of birth, telephone number
(2)deleting all individual identification codespassport number, driver’s license, biometric (DNA, face)
(3)deleting codes linking mutually plural information management IDs, email address
(4)deleting idiosyncratic descriptions etc. medical history (# cases is small), “116 years old”
(5)taking appropriate action based on the other measures on attribute etc. of PI databaserecords/cell suppression, generalization, top-coding, service IDs, purchase history, transpiration history

figure: Enforcement Rules

Competition Design

To address the issues in anonymization of big data, we propose an open style data competition. We focus on “records re-identification” risk and defines baseline utility functions and some re-identification algorithms. With arbitrary techniques, the best anonymization dataset is determined. Table shows the PWS Cup editions held so far, as a part of academic conference, IPSJ computer security symposiums (CSS) since 2015.

201520162017
Date
Venue
10/21-22
Nagasaki Brick H.
10/11-12
Akita Castel H.
10/23-24
Yamagata Int. H.
Participants13 teams
20 participants
15 teams
42 participants
14 teams
43 participants
DatasetNSTAC
Synthesized data
UCI Dataset “Online Retail”
POS records
# att. 2511 (customer 4 att. + transaction 7 att.)
# persons8,333400500
# recordsN/A18,52444,917
duration1 year1 year12 months

figure: PWS Cup Competition seriese

Result and Future

Figures illustrate how our competition evaluates the utility of de-identified data and the risk to be reidentified. Based on pre-defined utility functions, we automate the process of risk evaluation and found the tradeoff between utility and privacy in data anonymization. We will explore reasonable and reliable schemes for the technologies.

figure: Automate Risk Evalutation

figure: Utility-Privacy Tradeoff