hierarchies

This commit is contained in:
Diogo Peralta Cordeiro 2022-06-05 17:56:30 +01:00
parent de366a6571
commit fb5875be1a
Signed by: diogo
GPG Key ID: 18D2D35001FBFAB0
15 changed files with 43 additions and 0 deletions

View File

@ -33,22 +33,28 @@ Attribute | Classification
Table: Attribute classifications Table: Attribute classifications
## Justifications ## Justifications
The vast majority of attributes present low values of distinction. This is consistent with the nature of The vast majority of attributes present low values of distinction. This is consistent with the nature of
the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same
set of attributes. set of attributes.
### `age` ### `age`
According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
this attribute as a QID. this attribute as a QID.
![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=14cm}
### `workclass` ### `workclass`
This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
deemed Insensitive. deemed Insensitive.
### `fnlwgt` ### `fnlwgt`
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
@ -75,10 +81,14 @@ U.S.A., the origin of the dataset, which implies this attribute is not a count,
We also note there are substantially more Male than Female records (more than double the `fnlwgt`). We also note there are substantially more Male than Female records (more than double the `fnlwgt`).
### `education` ### `education`
This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID. This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=14cm}
### `education-num` ### `education-num`
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
@ -93,25 +103,40 @@ representation of the `education` attribute. As such, this attribute
recieves the same classification, which is backed by the equally high recieves the same classification, which is backed by the equally high
separation value of 80.96%, so it's classified as a QID. separation value of 80.96%, so it's classified as a QID.
![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){width=14cm}
### `marital-status` ### `marital-status`
With a relatively high separation value of 66.01%, together with the fact that it could be cross With a relatively high separation value of 66.01%, together with the fact that it could be cross
referenced with other available datasets, we classify this attribute as a QID. referenced with other available datasets, we classify this attribute as a QID.
![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=14cm}
### `occupation` ### `occupation`
With a separation of 90.02%, this attribute is classified as a QID. With a separation of 90.02%, this attribute is classified as a QID.
![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=14cm}
### `relationship` ### `relationship`
Given it's separation value of 73.21%, this attribute is classified as a QID. Given it's separation value of 73.21%, this attribute is classified as a QID.
![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=14cm}
### `race` ### `race`
This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
that this attribute could be cross referenced with other datases, it is classified as a QID, so that this attribute could be cross referenced with other datases, it is classified as a QID, so
it may be transformed into more generic values. it may be transformed into more generic values.
![Hierarchy for attribute `race`](coding-model/hierarchies/race.png)
### `sex` ### `sex`
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
@ -140,6 +165,9 @@ Doctorate | 86 | 327
Table: Number of records with each `education` for each `sex` Table: Number of records with each `education` for each `sex`
![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png)
### `capital-gain` & `capital-loss` ### `capital-gain` & `capital-loss`
With a separation of 15.93% and 9.15% respectively, these attributes are not QIDs. They're qualified as With a separation of 15.93% and 9.15% respectively, these attributes are not QIDs. They're qualified as
@ -150,20 +178,24 @@ A t-closeness privacy model was chosen for these attributes, with a
value of t of 0.2. This reasoning is discussed in Applying value of t of 0.2. This reasoning is discussed in Applying
anonymization models > k-Anonymity > Effect of parameters anonymization models > k-Anonymity > Effect of parameters
### `hours-per-week` ### `hours-per-week`
This attribute has a relatively high separation (76.24%) and since it had really unique values, it This attribute has a relatively high separation (76.24%) and since it had really unique values, it
could be cross referenced with another dataset to help identify individuals, so it's classified as QID. could be cross referenced with another dataset to help identify individuals, so it's classified as QID.
### `native-country` ### `native-country`
While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
dataset, so it's qualified as Insensitive. dataset, so it's qualified as Insensitive.
### `prediction` ### `prediction`
This is the target attribute, the attribute the other attributes predict, and is therefore Insensitive. This is the target attribute, the attribute the other attributes predict, and is therefore Insensitive.
# Privacy risks in the original dataset # Privacy risks in the original dataset
In the original dataset, nearly 40% of records have a more than 50% risk of re-identification by In the original dataset, nearly 40% of records have a more than 50% risk of re-identification by
@ -179,8 +211,10 @@ it's potentially easier to re-identify the individuals in question.
All attacker models show a success rate of more than 50%, which is not acceptable. All attacker models show a success rate of more than 50%, which is not acceptable.
# Applying anonymization models # Applying anonymization models
## k-Anonymity ## k-Anonymity
We opted for 8-anonymity, for it's tradeoff between maximal risk and suppression. We opted for 8-anonymity, for it's tradeoff between maximal risk and suppression.
@ -188,12 +222,14 @@ We opted for 8-anonymity, for it's tradeoff between maximal risk and suppression
t-closeness was chosen for `capital-gain` and `capital-loss` t-closeness was chosen for `capital-gain` and `capital-loss`
(sensitive attributes). (sensitive attributes).
### Re-identification risk ### Re-identification risk
The average re-identification risk dropped to nearly 0%, whereas the The average re-identification risk dropped to nearly 0%, whereas the
maximal risk dropped to 12.5%. The success rate for all attacker maximal risk dropped to 12.5%. The success rate for all attacker
models was reduced drastically, to 1.3%. models was reduced drastically, to 1.3%.
### Utility ### Utility
The original Classification Performance, a measure of how well the attributes The original Classification Performance, a measure of how well the attributes
@ -205,6 +241,7 @@ value being equal across all atributes suggests entire rows were
removed, rather than select values from separate rows. The only removed, rather than select values from separate rows. The only
exception is the `occupation` attribute, which was entirely removed. exception is the `occupation` attribute, which was entirely removed.
### Effect of parameters ### Effect of parameters
At a suppression limit of 0%, the same accuracy is maintained, but the At a suppression limit of 0%, the same accuracy is maintained, but the
@ -242,15 +279,18 @@ increased Generalization Intensity values.
Adjusting the coding model had no significant effects. Adjusting the coding model had no significant effects.
## $(\epsilon, \delta)$-Differential Privacy ## $(\epsilon, \delta)$-Differential Privacy
With the default $\epsilon$ value of 2 and a $\delta$ value of With the default $\epsilon$ value of 2 and a $\delta$ value of
$10^{-6}$, the performance was really good. $10^{-6}$, the performance was really good.
### Re-identification risk ### Re-identification risk
All indicators for risk by each attacker model was between 0.1% and 0.9%. All indicators for risk by each attacker model was between 0.1% and 0.9%.
### Utility ### Utility
The original Classification Performance was 83.24% and it remained The original Classification Performance was 83.24% and it remained
@ -259,6 +299,7 @@ at 80.97%.
Nearly 16% of attributes are missing, with the expection of `age` and Nearly 16% of attributes are missing, with the expection of `age` and
`education-num`, which are 100% missing. `education-num`, which are 100% missing.
### Effect of parameters ### Effect of parameters
An $\epsilon$ value of 3 maintained the accuracy at 80.5% with An $\epsilon$ value of 3 maintained the accuracy at 80.5% with
@ -270,12 +311,14 @@ performance of 82.05% and a missings value of 21.02% for all attributes.
A further increase of $\delta$ to $10^{-4}$ resulted in an increased A further increase of $\delta$ to $10^{-4}$ resulted in an increased
accuracy of 82.32%, but a maximal risk of 1.25%. accuracy of 82.32%, but a maximal risk of 1.25%.
# Results # Results
The 8-anonymity model was chosen as it resulted in a broader The 8-anonymity model was chosen as it resulted in a broader
distribution of attribute values like `age`, whereas with Differential distribution of attribute values like `age`, whereas with Differential
Privacy, they were split into only 2 categories. Privacy, they were split into only 2 categories.
# Observations # Observations
We noted that the contingency between `sex` and `relationship` maintained We noted that the contingency between `sex` and `relationship` maintained

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.4 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.3 KiB

Binary file not shown.