hierarchies
43
README.md
@ -33,22 +33,28 @@ Attribute | Classification
|
|||||||
|
|
||||||
Table: Attribute classifications
|
Table: Attribute classifications
|
||||||
|
|
||||||
|
|
||||||
## Justifications
|
## Justifications
|
||||||
|
|
||||||
The vast majority of attributes present low values of distinction. This is consistent with the nature of
|
The vast majority of attributes present low values of distinction. This is consistent with the nature of
|
||||||
the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same
|
the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same
|
||||||
set of attributes.
|
set of attributes.
|
||||||
|
|
||||||
|
|
||||||
### `age`
|
### `age`
|
||||||
|
|
||||||
According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
|
According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
|
||||||
this attribute as a QID.
|
this attribute as a QID.
|
||||||
|
|
||||||
|
![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=14cm}
|
||||||
|
|
||||||
|
|
||||||
### `workclass`
|
### `workclass`
|
||||||
|
|
||||||
This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
|
This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
|
||||||
deemed Insensitive.
|
deemed Insensitive.
|
||||||
|
|
||||||
|
|
||||||
### `fnlwgt`
|
### `fnlwgt`
|
||||||
|
|
||||||
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
|
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
|
||||||
@ -75,10 +81,14 @@ U.S.A., the origin of the dataset, which implies this attribute is not a count,
|
|||||||
|
|
||||||
We also note there are substantially more Male than Female records (more than double the `fnlwgt`).
|
We also note there are substantially more Male than Female records (more than double the `fnlwgt`).
|
||||||
|
|
||||||
|
|
||||||
### `education`
|
### `education`
|
||||||
|
|
||||||
This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
|
This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
|
||||||
|
|
||||||
|
![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=14cm}
|
||||||
|
|
||||||
|
|
||||||
### `education-num`
|
### `education-num`
|
||||||
|
|
||||||
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
|
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
|
||||||
@ -93,25 +103,40 @@ representation of the `education` attribute. As such, this attribute
|
|||||||
recieves the same classification, which is backed by the equally high
|
recieves the same classification, which is backed by the equally high
|
||||||
separation value of 80.96%, so it's classified as a QID.
|
separation value of 80.96%, so it's classified as a QID.
|
||||||
|
|
||||||
|
![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){width=14cm}
|
||||||
|
|
||||||
|
|
||||||
### `marital-status`
|
### `marital-status`
|
||||||
|
|
||||||
With a relatively high separation value of 66.01%, together with the fact that it could be cross
|
With a relatively high separation value of 66.01%, together with the fact that it could be cross
|
||||||
referenced with other available datasets, we classify this attribute as a QID.
|
referenced with other available datasets, we classify this attribute as a QID.
|
||||||
|
|
||||||
|
![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=14cm}
|
||||||
|
|
||||||
|
|
||||||
### `occupation`
|
### `occupation`
|
||||||
|
|
||||||
With a separation of 90.02%, this attribute is classified as a QID.
|
With a separation of 90.02%, this attribute is classified as a QID.
|
||||||
|
|
||||||
|
![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=14cm}
|
||||||
|
|
||||||
|
|
||||||
### `relationship`
|
### `relationship`
|
||||||
|
|
||||||
Given it's separation value of 73.21%, this attribute is classified as a QID.
|
Given it's separation value of 73.21%, this attribute is classified as a QID.
|
||||||
|
|
||||||
|
![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=14cm}
|
||||||
|
|
||||||
|
|
||||||
### `race`
|
### `race`
|
||||||
|
|
||||||
This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
|
This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
|
||||||
that this attribute could be cross referenced with other datases, it is classified as a QID, so
|
that this attribute could be cross referenced with other datases, it is classified as a QID, so
|
||||||
it may be transformed into more generic values.
|
it may be transformed into more generic values.
|
||||||
|
|
||||||
|
![Hierarchy for attribute `race`](coding-model/hierarchies/race.png)
|
||||||
|
|
||||||
|
|
||||||
### `sex`
|
### `sex`
|
||||||
|
|
||||||
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
|
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
|
||||||
@ -140,6 +165,9 @@ Doctorate | 86 | 327
|
|||||||
|
|
||||||
Table: Number of records with each `education` for each `sex`
|
Table: Number of records with each `education` for each `sex`
|
||||||
|
|
||||||
|
![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png)
|
||||||
|
|
||||||
|
|
||||||
### `capital-gain` & `capital-loss`
|
### `capital-gain` & `capital-loss`
|
||||||
|
|
||||||
With a separation of 15.93% and 9.15% respectively, these attributes are not QIDs. They're qualified as
|
With a separation of 15.93% and 9.15% respectively, these attributes are not QIDs. They're qualified as
|
||||||
@ -150,20 +178,24 @@ A t-closeness privacy model was chosen for these attributes, with a
|
|||||||
value of t of 0.2. This reasoning is discussed in Applying
|
value of t of 0.2. This reasoning is discussed in Applying
|
||||||
anonymization models > k-Anonymity > Effect of parameters
|
anonymization models > k-Anonymity > Effect of parameters
|
||||||
|
|
||||||
|
|
||||||
### `hours-per-week`
|
### `hours-per-week`
|
||||||
|
|
||||||
This attribute has a relatively high separation (76.24%) and since it had really unique values, it
|
This attribute has a relatively high separation (76.24%) and since it had really unique values, it
|
||||||
could be cross referenced with another dataset to help identify individuals, so it's classified as QID.
|
could be cross referenced with another dataset to help identify individuals, so it's classified as QID.
|
||||||
|
|
||||||
|
|
||||||
### `native-country`
|
### `native-country`
|
||||||
|
|
||||||
While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
|
While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
|
||||||
dataset, so it's qualified as Insensitive.
|
dataset, so it's qualified as Insensitive.
|
||||||
|
|
||||||
|
|
||||||
### `prediction`
|
### `prediction`
|
||||||
|
|
||||||
This is the target attribute, the attribute the other attributes predict, and is therefore Insensitive.
|
This is the target attribute, the attribute the other attributes predict, and is therefore Insensitive.
|
||||||
|
|
||||||
|
|
||||||
# Privacy risks in the original dataset
|
# Privacy risks in the original dataset
|
||||||
|
|
||||||
In the original dataset, nearly 40% of records have a more than 50% risk of re-identification by
|
In the original dataset, nearly 40% of records have a more than 50% risk of re-identification by
|
||||||
@ -179,8 +211,10 @@ it's potentially easier to re-identify the individuals in question.
|
|||||||
|
|
||||||
All attacker models show a success rate of more than 50%, which is not acceptable.
|
All attacker models show a success rate of more than 50%, which is not acceptable.
|
||||||
|
|
||||||
|
|
||||||
# Applying anonymization models
|
# Applying anonymization models
|
||||||
|
|
||||||
|
|
||||||
## k-Anonymity
|
## k-Anonymity
|
||||||
|
|
||||||
We opted for 8-anonymity, for it's tradeoff between maximal risk and suppression.
|
We opted for 8-anonymity, for it's tradeoff between maximal risk and suppression.
|
||||||
@ -188,12 +222,14 @@ We opted for 8-anonymity, for it's tradeoff between maximal risk and suppression
|
|||||||
t-closeness was chosen for `capital-gain` and `capital-loss`
|
t-closeness was chosen for `capital-gain` and `capital-loss`
|
||||||
(sensitive attributes).
|
(sensitive attributes).
|
||||||
|
|
||||||
|
|
||||||
### Re-identification risk
|
### Re-identification risk
|
||||||
|
|
||||||
The average re-identification risk dropped to nearly 0%, whereas the
|
The average re-identification risk dropped to nearly 0%, whereas the
|
||||||
maximal risk dropped to 12.5%. The success rate for all attacker
|
maximal risk dropped to 12.5%. The success rate for all attacker
|
||||||
models was reduced drastically, to 1.3%.
|
models was reduced drastically, to 1.3%.
|
||||||
|
|
||||||
|
|
||||||
### Utility
|
### Utility
|
||||||
|
|
||||||
The original Classification Performance, a measure of how well the attributes
|
The original Classification Performance, a measure of how well the attributes
|
||||||
@ -205,6 +241,7 @@ value being equal across all atributes suggests entire rows were
|
|||||||
removed, rather than select values from separate rows. The only
|
removed, rather than select values from separate rows. The only
|
||||||
exception is the `occupation` attribute, which was entirely removed.
|
exception is the `occupation` attribute, which was entirely removed.
|
||||||
|
|
||||||
|
|
||||||
### Effect of parameters
|
### Effect of parameters
|
||||||
|
|
||||||
At a suppression limit of 0%, the same accuracy is maintained, but the
|
At a suppression limit of 0%, the same accuracy is maintained, but the
|
||||||
@ -242,15 +279,18 @@ increased Generalization Intensity values.
|
|||||||
|
|
||||||
Adjusting the coding model had no significant effects.
|
Adjusting the coding model had no significant effects.
|
||||||
|
|
||||||
|
|
||||||
## $(\epsilon, \delta)$-Differential Privacy
|
## $(\epsilon, \delta)$-Differential Privacy
|
||||||
|
|
||||||
With the default $\epsilon$ value of 2 and a $\delta$ value of
|
With the default $\epsilon$ value of 2 and a $\delta$ value of
|
||||||
$10^{-6}$, the performance was really good.
|
$10^{-6}$, the performance was really good.
|
||||||
|
|
||||||
|
|
||||||
### Re-identification risk
|
### Re-identification risk
|
||||||
|
|
||||||
All indicators for risk by each attacker model was between 0.1% and 0.9%.
|
All indicators for risk by each attacker model was between 0.1% and 0.9%.
|
||||||
|
|
||||||
|
|
||||||
### Utility
|
### Utility
|
||||||
|
|
||||||
The original Classification Performance was 83.24% and it remained
|
The original Classification Performance was 83.24% and it remained
|
||||||
@ -259,6 +299,7 @@ at 80.97%.
|
|||||||
Nearly 16% of attributes are missing, with the expection of `age` and
|
Nearly 16% of attributes are missing, with the expection of `age` and
|
||||||
`education-num`, which are 100% missing.
|
`education-num`, which are 100% missing.
|
||||||
|
|
||||||
|
|
||||||
### Effect of parameters
|
### Effect of parameters
|
||||||
|
|
||||||
An $\epsilon$ value of 3 maintained the accuracy at 80.5% with
|
An $\epsilon$ value of 3 maintained the accuracy at 80.5% with
|
||||||
@ -270,12 +311,14 @@ performance of 82.05% and a missings value of 21.02% for all attributes.
|
|||||||
A further increase of $\delta$ to $10^{-4}$ resulted in an increased
|
A further increase of $\delta$ to $10^{-4}$ resulted in an increased
|
||||||
accuracy of 82.32%, but a maximal risk of 1.25%.
|
accuracy of 82.32%, but a maximal risk of 1.25%.
|
||||||
|
|
||||||
|
|
||||||
# Results
|
# Results
|
||||||
|
|
||||||
The 8-anonymity model was chosen as it resulted in a broader
|
The 8-anonymity model was chosen as it resulted in a broader
|
||||||
distribution of attribute values like `age`, whereas with Differential
|
distribution of attribute values like `age`, whereas with Differential
|
||||||
Privacy, they were split into only 2 categories.
|
Privacy, they were split into only 2 categories.
|
||||||
|
|
||||||
|
|
||||||
# Observations
|
# Observations
|
||||||
|
|
||||||
We noted that the contingency between `sex` and `relationship` maintained
|
We noted that the contingency between `sex` and `relationship` maintained
|
||||||
|
BIN
coding-model/hierarchies/age/age.ahs
Normal file
BIN
coding-model/hierarchies/age/age.png
Normal file
After Width: | Height: | Size: 24 KiB |
BIN
coding-model/hierarchies/education/education-num.ahs
Normal file
BIN
coding-model/hierarchies/education/education-num.png
Normal file
After Width: | Height: | Size: 22 KiB |
BIN
coding-model/hierarchies/education/education.ahs
Normal file
BIN
coding-model/hierarchies/education/education.png
Normal file
After Width: | Height: | Size: 88 KiB |
BIN
coding-model/hierarchies/marital-status/marital-status.ahs
Normal file
BIN
coding-model/hierarchies/marital-status/marital-status.png
Normal file
After Width: | Height: | Size: 24 KiB |
BIN
coding-model/hierarchies/occupation.png
Normal file
After Width: | Height: | Size: 24 KiB |
BIN
coding-model/hierarchies/race.png
Normal file
After Width: | Height: | Size: 7.4 KiB |
BIN
coding-model/hierarchies/relationship/relationship.ahs
Normal file
BIN
coding-model/hierarchies/relationship/relationship.png
Normal file
After Width: | Height: | Size: 13 KiB |
BIN
coding-model/hierarchies/sex.png
Normal file
After Width: | Height: | Size: 3.3 KiB |