diff --git a/README.md b/README.md index 69805e0..a5c264a 100644 --- a/README.md +++ b/README.md @@ -33,22 +33,28 @@ Attribute | Classification Table: Attribute classifications + ## Justifications The vast majority of attributes present low values of distinction. This is consistent with the nature of the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same set of attributes. + ### `age` According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify this attribute as a QID. +![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=14cm} + + ### `workclass` This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's deemed Insensitive. + ### `fnlwgt` Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID @@ -75,10 +81,14 @@ U.S.A., the origin of the dataset, which implies this attribute is not a count, We also note there are substantially more Male than Female records (more than double the `fnlwgt`). + ### `education` This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID. +![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=14cm} + + ### `education-num` We exported the anonymized dataset and used the following command to verify there weren't any discrepencies @@ -93,25 +103,40 @@ representation of the `education` attribute. As such, this attribute recieves the same classification, which is backed by the equally high separation value of 80.96%, so it's classified as a QID. +![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){width=14cm} + + ### `marital-status` With a relatively high separation value of 66.01%, together with the fact that it could be cross referenced with other available datasets, we classify this attribute as a QID. +![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=14cm} + + ### `occupation` With a separation of 90.02%, this attribute is classified as a QID. +![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=14cm} + + ### `relationship` Given it's separation value of 73.21%, this attribute is classified as a QID. +![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=14cm} + + ### `race` This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact that this attribute could be cross referenced with other datases, it is classified as a QID, so it may be transformed into more generic values. +![Hierarchy for attribute `race`](coding-model/hierarchies/race.png) + + ### `sex` Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since @@ -140,6 +165,9 @@ Doctorate | 86 | 327 Table: Number of records with each `education` for each `sex` +![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png) + + ### `capital-gain` & `capital-loss` With a separation of 15.93% and 9.15% respectively, these attributes are not QIDs. They're qualified as @@ -150,20 +178,24 @@ A t-closeness privacy model was chosen for these attributes, with a value of t of 0.2. This reasoning is discussed in Applying anonymization models > k-Anonymity > Effect of parameters + ### `hours-per-week` This attribute has a relatively high separation (76.24%) and since it had really unique values, it could be cross referenced with another dataset to help identify individuals, so it's classified as QID. + ### `native-country` While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this dataset, so it's qualified as Insensitive. + ### `prediction` This is the target attribute, the attribute the other attributes predict, and is therefore Insensitive. + # Privacy risks in the original dataset In the original dataset, nearly 40% of records have a more than 50% risk of re-identification by @@ -179,8 +211,10 @@ it's potentially easier to re-identify the individuals in question. All attacker models show a success rate of more than 50%, which is not acceptable. + # Applying anonymization models + ## k-Anonymity We opted for 8-anonymity, for it's tradeoff between maximal risk and suppression. @@ -188,12 +222,14 @@ We opted for 8-anonymity, for it's tradeoff between maximal risk and suppression t-closeness was chosen for `capital-gain` and `capital-loss` (sensitive attributes). + ### Re-identification risk The average re-identification risk dropped to nearly 0%, whereas the maximal risk dropped to 12.5%. The success rate for all attacker models was reduced drastically, to 1.3%. + ### Utility The original Classification Performance, a measure of how well the attributes @@ -205,6 +241,7 @@ value being equal across all atributes suggests entire rows were removed, rather than select values from separate rows. The only exception is the `occupation` attribute, which was entirely removed. + ### Effect of parameters At a suppression limit of 0%, the same accuracy is maintained, but the @@ -242,15 +279,18 @@ increased Generalization Intensity values. Adjusting the coding model had no significant effects. + ## $(\epsilon, \delta)$-Differential Privacy With the default $\epsilon$ value of 2 and a $\delta$ value of $10^{-6}$, the performance was really good. + ### Re-identification risk All indicators for risk by each attacker model was between 0.1% and 0.9%. + ### Utility The original Classification Performance was 83.24% and it remained @@ -259,6 +299,7 @@ at 80.97%. Nearly 16% of attributes are missing, with the expection of `age` and `education-num`, which are 100% missing. + ### Effect of parameters An $\epsilon$ value of 3 maintained the accuracy at 80.5% with @@ -270,12 +311,14 @@ performance of 82.05% and a missings value of 21.02% for all attributes. A further increase of $\delta$ to $10^{-4}$ resulted in an increased accuracy of 82.32%, but a maximal risk of 1.25%. + # Results The 8-anonymity model was chosen as it resulted in a broader distribution of attribute values like `age`, whereas with Differential Privacy, they were split into only 2 categories. + # Observations We noted that the contingency between `sex` and `relationship` maintained diff --git a/coding-model/hierarchies/age/age.ahs b/coding-model/hierarchies/age/age.ahs new file mode 100644 index 0000000..86564db Binary files /dev/null and b/coding-model/hierarchies/age/age.ahs differ diff --git a/coding-model/hierarchies/age/age.png b/coding-model/hierarchies/age/age.png new file mode 100644 index 0000000..1ca9ac2 Binary files /dev/null and b/coding-model/hierarchies/age/age.png differ diff --git a/coding-model/hierarchies/education/education-num.ahs b/coding-model/hierarchies/education/education-num.ahs new file mode 100644 index 0000000..07d05d9 Binary files /dev/null and b/coding-model/hierarchies/education/education-num.ahs differ diff --git a/coding-model/hierarchies/education/education-num.png b/coding-model/hierarchies/education/education-num.png new file mode 100644 index 0000000..806ecd9 Binary files /dev/null and b/coding-model/hierarchies/education/education-num.png differ diff --git a/coding-model/hierarchies/education/education.ahs b/coding-model/hierarchies/education/education.ahs new file mode 100644 index 0000000..e54cada Binary files /dev/null and b/coding-model/hierarchies/education/education.ahs differ diff --git a/coding-model/hierarchies/education/education.png b/coding-model/hierarchies/education/education.png new file mode 100644 index 0000000..82c38e3 Binary files /dev/null and b/coding-model/hierarchies/education/education.png differ diff --git a/coding-model/hierarchies/marital-status/marital-status.ahs b/coding-model/hierarchies/marital-status/marital-status.ahs new file mode 100644 index 0000000..57ebb4c Binary files /dev/null and b/coding-model/hierarchies/marital-status/marital-status.ahs differ diff --git a/coding-model/hierarchies/marital-status/marital-status.png b/coding-model/hierarchies/marital-status/marital-status.png new file mode 100644 index 0000000..317bb95 Binary files /dev/null and b/coding-model/hierarchies/marital-status/marital-status.png differ diff --git a/coding-model/hierarchies/occupation.png b/coding-model/hierarchies/occupation.png new file mode 100644 index 0000000..af024ec Binary files /dev/null and b/coding-model/hierarchies/occupation.png differ diff --git a/coding-model/hierarchies/race.png b/coding-model/hierarchies/race.png new file mode 100644 index 0000000..8e546ee Binary files /dev/null and b/coding-model/hierarchies/race.png differ diff --git a/coding-model/hierarchies/relationship/relationship.ahs b/coding-model/hierarchies/relationship/relationship.ahs new file mode 100644 index 0000000..fca1cef Binary files /dev/null and b/coding-model/hierarchies/relationship/relationship.ahs differ diff --git a/coding-model/hierarchies/relationship/relationship.png b/coding-model/hierarchies/relationship/relationship.png new file mode 100644 index 0000000..646bddd Binary files /dev/null and b/coding-model/hierarchies/relationship/relationship.png differ diff --git a/coding-model/hierarchies/sex.png b/coding-model/hierarchies/sex.png new file mode 100644 index 0000000..350c916 Binary files /dev/null and b/coding-model/hierarchies/sex.png differ diff --git a/report.pdf b/report.pdf index f45c78b..443b681 100644 Binary files a/report.pdf and b/report.pdf differ