Update report. Nearly done
This commit is contained in:
parent
cd02c5d7fb
commit
f8661ac889
232
report.md
232
report.md
@ -21,11 +21,11 @@ Attribute | Classification
|
|||||||
`marital-status` | QID
|
`marital-status` | QID
|
||||||
`occupation` | QID
|
`occupation` | QID
|
||||||
`relationship` | QID
|
`relationship` | QID
|
||||||
`race` | Sensitive
|
`race` | QID
|
||||||
`sex` | QID
|
`sex` | QID
|
||||||
`capital-gain` | Sensitive
|
`capital-gain` | Sensitive
|
||||||
`capital-loss` | Sensitive
|
`capital-loss` | Sensitive
|
||||||
`hours-per-week` | Insensitive
|
`hours-per-week` | QID
|
||||||
`native-country` | Insensitive
|
`native-country` | Insensitive
|
||||||
`prediction` | Insensitive
|
`prediction` | Insensitive
|
||||||
|
|
||||||
@ -44,7 +44,7 @@ attribute is classified as a QID.
|
|||||||
### `workclass`
|
### `workclass`
|
||||||
|
|
||||||
This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
|
This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
|
||||||
deemed insensitive.
|
deemed Insensitive.
|
||||||
|
|
||||||
### `fnlwgt`
|
### `fnlwgt`
|
||||||
|
|
||||||
@ -54,7 +54,7 @@ original dataset. This can be seen with the results below. Additionally, it's no
|
|||||||
to another auxiliary info dataset.
|
to another auxiliary info dataset.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
|
$ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
|
||||||
END {for(sex in count){print sex, count[sex]}}'
|
END {for(sex in count){print sex, count[sex]}}'
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -70,6 +70,8 @@ Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}
|
|||||||
The sum of these values is 6,179,373,392. This value is much larger than the population of the
|
The sum of these values is 6,179,373,392. This value is much larger than the population of the
|
||||||
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
|
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
|
||||||
|
|
||||||
|
We also note there are substantially more Male than Female records (more than double the `fnlwgt`).
|
||||||
|
|
||||||
### `education`
|
### `education`
|
||||||
|
|
||||||
This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
|
This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
|
||||||
@ -77,9 +79,17 @@ as a QID.
|
|||||||
|
|
||||||
### `education-num`
|
### `education-num`
|
||||||
|
|
||||||
As a numerical representation of the `education` attribute, this attribute recieves the same
|
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the
|
||||||
classification, which is backed by the equally high separation value of 80.96%, so it's qualified as
|
`education` and `education-num` columns:
|
||||||
a QID.
|
|
||||||
|
```bash
|
||||||
|
$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un
|
||||||
|
```
|
||||||
|
|
||||||
|
Since there was a one-to-one mapping, we concluded this was just a
|
||||||
|
representation of the `education` attribute. As such, this attribute
|
||||||
|
recieves the same classification, which is backed by the equally high
|
||||||
|
separation value of 80.96%, so it's qualified as a QID.
|
||||||
|
|
||||||
### `marital-status`
|
### `marital-status`
|
||||||
|
|
||||||
@ -97,7 +107,7 @@ Given it's separation value of 73.21%, this attribute is classified as a QID.
|
|||||||
### `race`
|
### `race`
|
||||||
|
|
||||||
This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
|
This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
|
||||||
that this attribute could be cross referenced with other datases, it is classified as Sensitive, so
|
that this attribute could be cross referenced with other datases, it is classified as a QID, so
|
||||||
it may be transformed into more generic values.
|
it may be transformed into more generic values.
|
||||||
|
|
||||||
### `sex`
|
### `sex`
|
||||||
@ -105,33 +115,178 @@ it may be transformed into more generic values.
|
|||||||
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
|
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
|
||||||
it can be easily cross referenced with other datasets.
|
it can be easily cross referenced with other datasets.
|
||||||
|
|
||||||
We noted this dataset seems to more males than females. See @tbl:sex_weight
|
We noted this dataset seems to more males than females. See @tbl:sex_weight and the following table
|
||||||
|
|
||||||
|
`education` | Female | Male
|
||||||
|
-------------+-------:+----:
|
||||||
|
Preschool | 16 | 35
|
||||||
|
1st-4th | 46 | 122
|
||||||
|
5th-6th | 84 | 249
|
||||||
|
7th-8th | 160 | 486
|
||||||
|
9th | 144 | 370
|
||||||
|
10th | 295 | 638
|
||||||
|
11th | 432 | 743
|
||||||
|
12th | 144 | 289
|
||||||
|
HS-grad | 3390 | 7111
|
||||||
|
Some-college | 2806 | 4485
|
||||||
|
Assoc-voc | 500 | 882
|
||||||
|
Assoc-acdm | 421 | 646
|
||||||
|
Bachelors | 1619 | 3736
|
||||||
|
Masters | 536 | 1187
|
||||||
|
Prof-school | 92 | 484
|
||||||
|
Doctorate | 86 | 327
|
||||||
|
|
||||||
|
Table: Number of records with each `education` for each `sex` {#tbl:education_sex}
|
||||||
|
|
||||||
|
### `capital-gain` & `capital-loss`
|
||||||
|
|
||||||
|
With a separation of 15.93% and 9.15% respectively, these attributes are not QIDs. They're qualified as
|
||||||
|
Sensitive, as the individuals may not want their capital gains and
|
||||||
|
losses publicly known.
|
||||||
|
|
||||||
|
A t-closeness privacy model was chosen for these attributes, with a
|
||||||
|
value of t of 0.2. This reasoning is discussed in Applying
|
||||||
|
anonymization models > k-Anonymity > Effect of parameters
|
||||||
|
|
||||||
|
### `hours-per-week`
|
||||||
|
|
||||||
|
This attribute has a relatively high separation (76.24%) and since it had really unique values, it
|
||||||
|
could be cross referenced with another dataset to help identify individuals, so it's classified as QID.
|
||||||
|
|
||||||
### `native-country`
|
### `native-country`
|
||||||
|
|
||||||
While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
|
While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
|
||||||
dataset, so it's qualified as Sensitive.
|
dataset, so it's qualified as Insensitive.
|
||||||
|
|
||||||
----------------
|
### `prediction`
|
||||||
|
|
||||||
|
This is the target attribute, the attribute the other attributes predict, and is therefore Insensitive.
|
||||||
|
|
||||||
Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore
|
# Privacy risks in the original dataset
|
||||||
provide higher utility.
|
|
||||||
|
In the original dataset, nearly 40% of records have a more than 50% risk of re-identification by
|
||||||
|
a prosecutor. In general, we see a stepped distribution of the record risk, which indicates some
|
||||||
|
privacy model was already applied to the dataset, however to a different standard than what we
|
||||||
|
intend.
|
||||||
|
|
||||||
|
All records had really high uniqueness percentage even for small sampling factors, according to the
|
||||||
|
Zayatz, Pitman and Dankar methods. Only SNB indicated a low uniquess percentage for sampling factors
|
||||||
|
under 90%. What this means, is that with a fraction of the original dataset, a very significant
|
||||||
|
number of records was sufficiently unique that it could be distinguished among the rest, which means
|
||||||
|
it's potentially easier to re-identify the individuals in question.
|
||||||
|
|
||||||
|
All attacker models show a success rate of more than 50%, which is not acceptable.
|
||||||
|
|
||||||
|
# Applying anonymization models
|
||||||
|
|
||||||
|
## k-Anonymity
|
||||||
|
|
||||||
|
We opted for 8-anonymity, for it's tradeoff between maximal risk and suppression.
|
||||||
|
|
||||||
|
t-closeness was chosen for `capital-gain` and `capital-loss`
|
||||||
|
(sensitive attributes).
|
||||||
|
|
||||||
|
### Re-identification risk
|
||||||
|
|
||||||
|
The average re-identification risk dropped to nearly 0%, whereas the
|
||||||
|
maximal risk dropped to 12.5%. The success rate for all attacker
|
||||||
|
models was reduced drastically, to 1.3%.
|
||||||
|
|
||||||
|
### Utility
|
||||||
|
|
||||||
|
The original Classification Performance, a measure of how well the attributes
|
||||||
|
predict the target variable (`prediction`) was 83.24% and it remained
|
||||||
|
at 82.45%.
|
||||||
|
|
||||||
|
10.07% of attributes are missing from the anonymized dataset. This
|
||||||
|
value being equal across all atributes suggests entire rows were
|
||||||
|
removed, rather than select values from separate rows. The only
|
||||||
|
exception is the `occupation` attribute, which was entirely removed.
|
||||||
|
|
||||||
|
### Effect of parameters
|
||||||
|
|
||||||
|
At a suppression limit of 0%, the same accuracy is maintained, but the
|
||||||
|
vast majority of QIDs are entirely removed.
|
||||||
|
|
||||||
|
At a suppression limit of 5%, roughly the same prediction accuracy is
|
||||||
|
maintained, with around 4.5% of values missing, however with really
|
||||||
|
high Generalization Intensity values for some attributes (e.g. 95.42%
|
||||||
|
for `sex`, 93.87% for `race` and 91.47% for `education` and
|
||||||
|
`education-num`). `occupation` was entirely removed.
|
||||||
|
|
||||||
|
At a suppression limit of 10%, the prediction accuracy is maintained,
|
||||||
|
with around 9.8% of values missing. However, the Gen. Intensity drops
|
||||||
|
to around 90%.
|
||||||
|
|
||||||
|
At a suppression limit of 20%, accuracy is maintained, once again,
|
||||||
|
with around 10% of values missing, indicating this would be the
|
||||||
|
optimal settings, as the same results are achieved with a limit of
|
||||||
|
100%.
|
||||||
|
|
||||||
|
At a t-closeness for `capital-gain` and `capital-loss` t value of
|
||||||
|
0.001 (the default), anonymization fails, not producing any output.
|
||||||
|
|
||||||
|
At a t value of 0.01, accuracy drops to 75% and most attributes have
|
||||||
|
missing values of 100%.
|
||||||
|
|
||||||
|
At a t value of 0.1, classification accuracy is nearly 81%, but
|
||||||
|
missings values are around 20%.
|
||||||
|
|
||||||
|
At a t value of 0.2, the chosen value, the accuracy is 82.5% with
|
||||||
|
lower Gen. Intensity values.
|
||||||
|
|
||||||
|
At a t value of 0.5, the classification accuracy goes to 82.2% with
|
||||||
|
increased Generalization Intensity values.
|
||||||
|
|
||||||
|
Adjusting the coding model had no significant effects.
|
||||||
|
|
||||||
|
## $(\epsilon, \delta)$-Differential Privacy
|
||||||
|
|
||||||
|
With the default $\epsilon$ value of 2 and a $\delta$ value of
|
||||||
|
$10^{-6}$, the performance was really good.
|
||||||
|
|
||||||
|
### Re-identification risk
|
||||||
|
|
||||||
|
All indicators for risk by each attacker model was between 0.1% and 0.9%.
|
||||||
|
|
||||||
|
### Utility
|
||||||
|
|
||||||
|
The original Classification Performance was 83.24% and it remained
|
||||||
|
at 80.97%.
|
||||||
|
|
||||||
|
Nearly 16% of attributes are missing, with the expection of `age` and
|
||||||
|
`education-num`, which are 100% missing.
|
||||||
|
|
||||||
|
### Effect of parameters
|
||||||
|
|
||||||
|
An $\epsilon$ value of 3 maintained the accuracy at 80.5% with
|
||||||
|
missings values rounding 32%.
|
||||||
|
|
||||||
|
An increase of $\delta$ to $10^{-5}$ resulted in a classification
|
||||||
|
performance of 82.05% and a missings value of 21.02% for all attributes.
|
||||||
|
|
||||||
|
A further increase of $\delta$ to $10^{-4}$ resulted in an increased
|
||||||
|
accuracy of 82.32%, but a maximal risk of 1.25%.
|
||||||
|
|
||||||
|
# Results
|
||||||
|
|
||||||
|
The 8-anonymity model was chosen as it resulted in a broader
|
||||||
|
distribution of attribute values like `age`, whereas with Differential
|
||||||
|
Privacy, they were split into only 2 categories.
|
||||||
|
|
||||||
|
# Observations
|
||||||
|
|
||||||
We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization,
|
We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization,
|
||||||
meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset.
|
meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset.
|
||||||
|
|
||||||
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the
|
With the following commands, we noted some possible errors in the
|
||||||
`education` and `education-num` columns:
|
original dataset, where the `sex` and `relationship` attributes didn't
|
||||||
|
map entirely one to one: there was one occurence of (Husband, Female)
|
||||||
|
and two of (Wife, Male). It's possible this is an error in the
|
||||||
|
original dataset.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u
|
$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
|
||||||
```
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
|
|
||||||
```
|
|
||||||
|
|
||||||
1 Husband, Female
|
1 Husband, Female
|
||||||
2 Wife, Male
|
2 Wife, Male
|
||||||
@ -145,26 +300,19 @@ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 |
|
|||||||
3875 Not-in-family, Female
|
3875 Not-in-family, Female
|
||||||
4430 Not-in-family, Male
|
4430 Not-in-family, Male
|
||||||
13192 Husband, Male
|
13192 Husband, Male
|
||||||
|
|
||||||
```
|
|
||||||
~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f8,10 | sort | uniq -c | sort -n
|
|
||||||
```
|
```
|
||||||
|
|
||||||
1 Husband Female
|
```bash
|
||||||
2 Wife Male
|
$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
|
||||||
168 Other-relative *
|
|
||||||
336 Own-child *
|
1295 {Husband, Wife} Female
|
||||||
342 Other-relative Female
|
2264 {Other-relative, Own-child} Female
|
||||||
471 Other-relative Male
|
2981 {Other-relative, Own-child} Male
|
||||||
552 Wife *
|
3280 * *
|
||||||
573 Unmarried Male
|
4391 {Unmarried, Not-in-family} Male
|
||||||
728 Unmarried *
|
5713 {Unmarried, Not-in-family} Female
|
||||||
1014 Wife Female
|
12637 {Husband, Wife} Male
|
||||||
1649 Not-in-family *
|
```
|
||||||
2042 Husband *
|
|
||||||
2081 Own-child Female
|
Since there were occurences of (Wide, Male), "({Husband, Wife}, Male)"
|
||||||
2145 Unmarried Female
|
does not undo the transformation of the `relationship` attribute.
|
||||||
2651 Own-child Male
|
|
||||||
3209 Not-in-family Female
|
|
||||||
3447 Not-in-family Male
|
|
||||||
11150 Husband Male
|
|
||||||
|
Reference in New Issue
Block a user