diff --git a/report.md b/report.md index d13330f..c055689 100644 --- a/report.md +++ b/report.md @@ -21,11 +21,11 @@ Attribute | Classification `marital-status` | QID `occupation` | QID `relationship` | QID -`race` | Sensitive +`race` | QID `sex` | QID `capital-gain` | Sensitive `capital-loss` | Sensitive -`hours-per-week` | Insensitive +`hours-per-week` | QID `native-country` | Insensitive `prediction` | Insensitive @@ -44,7 +44,7 @@ attribute is classified as a QID. ### `workclass` This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's -deemed insensitive. +deemed Insensitive. ### `fnlwgt` @@ -54,7 +54,7 @@ original dataset. This can be seen with the results below. Additionally, it's no to another auxiliary info dataset. ```bash -tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \ +$ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \ END {for(sex in count){print sex, count[sex]}}' ``` @@ -70,6 +70,8 @@ Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight} The sum of these values is 6,179,373,392. This value is much larger than the population of the U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated. +We also note there are substantially more Male than Female records (more than double the `fnlwgt`). + ### `education` This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified @@ -77,9 +79,17 @@ as a QID. ### `education-num` -As a numerical representation of the `education` attribute, this attribute recieves the same -classification, which is backed by the equally high separation value of 80.96%, so it's qualified as -a QID. +We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the +`education` and `education-num` columns: + +```bash +$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un +``` + +Since there was a one-to-one mapping, we concluded this was just a +representation of the `education` attribute. As such, this attribute +recieves the same classification, which is backed by the equally high +separation value of 80.96%, so it's qualified as a QID. ### `marital-status` @@ -97,7 +107,7 @@ Given it's separation value of 73.21%, this attribute is classified as a QID. ### `race` This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact -that this attribute could be cross referenced with other datases, it is classified as Sensitive, so +that this attribute could be cross referenced with other datases, it is classified as a QID, so it may be transformed into more generic values. ### `sex` @@ -105,33 +115,178 @@ it may be transformed into more generic values. Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since it can be easily cross referenced with other datasets. -We noted this dataset seems to more males than females. See @tbl:sex_weight +We noted this dataset seems to more males than females. See @tbl:sex_weight and the following table +`education` | Female | Male +-------------+-------:+----: +Preschool | 16 | 35 +1st-4th | 46 | 122 +5th-6th | 84 | 249 +7th-8th | 160 | 486 +9th | 144 | 370 +10th | 295 | 638 +11th | 432 | 743 +12th | 144 | 289 +HS-grad | 3390 | 7111 +Some-college | 2806 | 4485 +Assoc-voc | 500 | 882 +Assoc-acdm | 421 | 646 +Bachelors | 1619 | 3736 +Masters | 536 | 1187 +Prof-school | 92 | 484 +Doctorate | 86 | 327 + +Table: Number of records with each `education` for each `sex` {#tbl:education_sex} + +### `capital-gain` & `capital-loss` + +With a separation of 15.93% and 9.15% respectively, these attributes are not QIDs. They're qualified as +Sensitive, as the individuals may not want their capital gains and +losses publicly known. + +A t-closeness privacy model was chosen for these attributes, with a +value of t of 0.2. This reasoning is discussed in Applying +anonymization models > k-Anonymity > Effect of parameters + +### `hours-per-week` + +This attribute has a relatively high separation (76.24%) and since it had really unique values, it +could be cross referenced with another dataset to help identify individuals, so it's classified as QID. ### `native-country` While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this -dataset, so it's qualified as Sensitive. +dataset, so it's qualified as Insensitive. ----------------- +### `prediction` +This is the target attribute, the attribute the other attributes predict, and is therefore Insensitive. -Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore -provide higher utility. +# Privacy risks in the original dataset + +In the original dataset, nearly 40% of records have a more than 50% risk of re-identification by +a prosecutor. In general, we see a stepped distribution of the record risk, which indicates some +privacy model was already applied to the dataset, however to a different standard than what we +intend. + +All records had really high uniqueness percentage even for small sampling factors, according to the +Zayatz, Pitman and Dankar methods. Only SNB indicated a low uniquess percentage for sampling factors +under 90%. What this means, is that with a fraction of the original dataset, a very significant +number of records was sufficiently unique that it could be distinguished among the rest, which means +it's potentially easier to re-identify the individuals in question. + +All attacker models show a success rate of more than 50%, which is not acceptable. + +# Applying anonymization models + +## k-Anonymity + +We opted for 8-anonymity, for it's tradeoff between maximal risk and suppression. + +t-closeness was chosen for `capital-gain` and `capital-loss` +(sensitive attributes). + +### Re-identification risk + +The average re-identification risk dropped to nearly 0%, whereas the +maximal risk dropped to 12.5%. The success rate for all attacker +models was reduced drastically, to 1.3%. + +### Utility + +The original Classification Performance, a measure of how well the attributes +predict the target variable (`prediction`) was 83.24% and it remained +at 82.45%. + +10.07% of attributes are missing from the anonymized dataset. This +value being equal across all atributes suggests entire rows were +removed, rather than select values from separate rows. The only +exception is the `occupation` attribute, which was entirely removed. + +### Effect of parameters + +At a suppression limit of 0%, the same accuracy is maintained, but the +vast majority of QIDs are entirely removed. + +At a suppression limit of 5%, roughly the same prediction accuracy is +maintained, with around 4.5% of values missing, however with really +high Generalization Intensity values for some attributes (e.g. 95.42% +for `sex`, 93.87% for `race` and 91.47% for `education` and +`education-num`). `occupation` was entirely removed. + +At a suppression limit of 10%, the prediction accuracy is maintained, +with around 9.8% of values missing. However, the Gen. Intensity drops +to around 90%. + +At a suppression limit of 20%, accuracy is maintained, once again, +with around 10% of values missing, indicating this would be the +optimal settings, as the same results are achieved with a limit of +100%. + +At a t-closeness for `capital-gain` and `capital-loss` t value of +0.001 (the default), anonymization fails, not producing any output. + +At a t value of 0.01, accuracy drops to 75% and most attributes have +missing values of 100%. + +At a t value of 0.1, classification accuracy is nearly 81%, but +missings values are around 20%. + +At a t value of 0.2, the chosen value, the accuracy is 82.5% with +lower Gen. Intensity values. + +At a t value of 0.5, the classification accuracy goes to 82.2% with +increased Generalization Intensity values. + +Adjusting the coding model had no significant effects. + +## $(\epsilon, \delta)$-Differential Privacy + +With the default $\epsilon$ value of 2 and a $\delta$ value of +$10^{-6}$, the performance was really good. + +### Re-identification risk + +All indicators for risk by each attacker model was between 0.1% and 0.9%. + +### Utility + +The original Classification Performance was 83.24% and it remained +at 80.97%. + +Nearly 16% of attributes are missing, with the expection of `age` and +`education-num`, which are 100% missing. + +### Effect of parameters + +An $\epsilon$ value of 3 maintained the accuracy at 80.5% with +missings values rounding 32%. + +An increase of $\delta$ to $10^{-5}$ resulted in a classification +performance of 82.05% and a missings value of 21.02% for all attributes. + +A further increase of $\delta$ to $10^{-4}$ resulted in an increased +accuracy of 82.32%, but a maximal risk of 1.25%. + +# Results + +The 8-anonymity model was chosen as it resulted in a broader +distribution of attribute values like `age`, whereas with Differential +Privacy, they were split into only 2 categories. + +# Observations We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization, meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset. -We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the -`education` and `education-num` columns: +With the following commands, we noted some possible errors in the +original dataset, where the `sex` and `relationship` attributes didn't +map entirely one to one: there was one occurence of (Husband, Female) +and two of (Wife, Male). It's possible this is an error in the +original dataset. ```bash -cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u -``` - -```bash -cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n -``` +$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n 1 Husband, Female 2 Wife, Male @@ -145,26 +300,19 @@ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | 3875 Not-in-family, Female 4430 Not-in-family, Male 13192 Husband, Male - -``` -~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f8,10 | sort | uniq -c | sort -n ``` - 1 Husband Female - 2 Wife Male - 168 Other-relative * - 336 Own-child * - 342 Other-relative Female - 471 Other-relative Male - 552 Wife * - 573 Unmarried Male - 728 Unmarried * - 1014 Wife Female - 1649 Not-in-family * - 2042 Husband * - 2081 Own-child Female - 2145 Unmarried Female - 2651 Own-child Male - 3209 Not-in-family Female - 3447 Not-in-family Male - 11150 Husband Male +```bash +$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t + + 1295 {Husband, Wife} Female + 2264 {Other-relative, Own-child} Female + 2981 {Other-relative, Own-child} Male + 3280 * * + 4391 {Unmarried, Not-in-family} Male + 5713 {Unmarried, Not-in-family} Female + 12637 {Husband, Wife} Male +``` + +Since there were occurences of (Wide, Male), "({Husband, Wife}, Male)" +does not undo the transformation of the `relationship` attribute.