2022-06-04 17:25:22 +01:00
|
|
|
---
|
|
|
|
title: Privacy-Preserving Data Publishing
|
|
|
|
subtitle: Assignment \#4
|
|
|
|
author:
|
|
|
|
- Diogo Cordeiro (up201705417)
|
|
|
|
- Hugo Sales (up201704178)
|
|
|
|
date: 2022/06/02
|
|
|
|
---
|
2022-06-02 17:25:37 +01:00
|
|
|
|
2022-06-04 17:25:22 +01:00
|
|
|
# Attribute classification
|
|
|
|
|
|
|
|
We classified the attributes as follows:
|
|
|
|
|
|
|
|
Attribute | Classification
|
|
|
|
-----------------+---------------
|
|
|
|
`age` | QID
|
|
|
|
`workclass` | Insensitive
|
|
|
|
`fnlwgt` | Insensitive
|
|
|
|
`education` | QID
|
|
|
|
`education-num` | QID
|
|
|
|
`marital-status` | QID
|
|
|
|
`occupation` | QID
|
|
|
|
`relationship` | QID
|
|
|
|
`race` | Sensitive
|
|
|
|
`sex` | QID
|
|
|
|
`capital-gain` | Sensitive
|
|
|
|
`capital-loss` | Sensitive
|
|
|
|
`hours-per-week` | Insensitive
|
|
|
|
`native-country` | Insensitive
|
|
|
|
`prediction` | Insensitive
|
|
|
|
|
|
|
|
Table: Attribute classifications
|
|
|
|
|
|
|
|
## Justifications
|
|
|
|
|
|
|
|
The vast majority of attributes present extremely low values of distinction. We speculate this may
|
|
|
|
be an TODO
|
|
|
|
|
|
|
|
### `age`
|
|
|
|
|
|
|
|
According to HIPPA recommendations, and together with it's very high separation value (99.87%), this
|
|
|
|
attribute is classified as a QID.
|
|
|
|
|
|
|
|
### `workclass`
|
|
|
|
|
|
|
|
This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
|
|
|
|
deemed insensitive.
|
|
|
|
|
|
|
|
### `fnlwgt`
|
|
|
|
|
|
|
|
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
|
|
|
|
becuase it represents a weight, not a count of individuals in the same equivalence class in the
|
|
|
|
original dataset. This can be seen with the results below. Additionally, it's not easily connected
|
|
|
|
to another auxiliary info dataset.
|
|
|
|
|
|
|
|
```bash
|
|
|
|
tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
|
|
|
|
END {for(sex in count){print sex, count[sex]}}'
|
|
|
|
```
|
|
|
|
|
|
|
|
Resulting in:
|
|
|
|
|
|
|
|
Sex | Sum
|
|
|
|
-------+--------
|
|
|
|
Female | 2000673518
|
|
|
|
Male | 4178699874
|
|
|
|
|
|
|
|
Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}
|
|
|
|
|
|
|
|
The sum of these values is 6,179,373,392. This value is much larger than the population of the
|
|
|
|
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
|
|
|
|
|
|
|
|
### `education`
|
|
|
|
|
|
|
|
This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
|
|
|
|
as a QID.
|
|
|
|
|
|
|
|
### `education-num`
|
|
|
|
|
|
|
|
As a numerical representation of the `education` attribute, this attribute recieves the same
|
|
|
|
classification, which is backed by the equally high separation value of 80.96%, so it's qualified as
|
|
|
|
a QID.
|
|
|
|
|
|
|
|
### `marital-status`
|
|
|
|
|
|
|
|
With a relatively high separation value of 66.01%, together with the fact that it could be cross
|
|
|
|
referenced with other available datasets, we classify this attribute as a QID.
|
|
|
|
|
|
|
|
### `occupation`
|
|
|
|
|
|
|
|
With a separation of 90.02%, this attribute is classified as a QID.
|
|
|
|
|
|
|
|
### `relationship`
|
|
|
|
|
|
|
|
Given it's separation value of 73.21%, this attribute is classified as a QID.
|
|
|
|
|
|
|
|
### `race`
|
|
|
|
|
|
|
|
This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
|
|
|
|
that this attribute could be cross referenced with other datases, it is classified as Sensitive, so
|
|
|
|
it may be transformed into more generic values.
|
|
|
|
|
|
|
|
### `sex`
|
|
|
|
|
|
|
|
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
|
|
|
|
it can be easily cross referenced with other datasets.
|
|
|
|
|
|
|
|
We noted this dataset seems to more males than females. See @tbl:sex_weight
|
|
|
|
|
|
|
|
|
|
|
|
### `native-country`
|
|
|
|
|
|
|
|
While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
|
|
|
|
dataset, so it's qualified as Sensitive.
|
|
|
|
|
|
|
|
----------------
|
2022-06-02 17:25:37 +01:00
|
|
|
|
|
|
|
|
|
|
|
Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore
|
|
|
|
provide higher utility.
|
|
|
|
|
|
|
|
We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization,
|
|
|
|
meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset.
|
|
|
|
|
|
|
|
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the
|
|
|
|
`education` and `education-num` columns:
|
|
|
|
|
|
|
|
```bash
|
2022-06-04 17:25:22 +01:00
|
|
|
cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u
|
2022-06-02 17:25:37 +01:00
|
|
|
```
|
|
|
|
|
2022-06-04 17:25:22 +01:00
|
|
|
```bash
|
|
|
|
cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
|
|
|
|
```
|
2022-06-02 17:25:37 +01:00
|
|
|
|
|
|
|
1 Husband, Female
|
|
|
|
2 Wife, Male
|
|
|
|
430 Other-relative, Female
|
|
|
|
551 Other-relative, Male
|
|
|
|
792 Unmarried, Male
|
|
|
|
1566 Wife, Female
|
|
|
|
2245 Own-child, Female
|
|
|
|
2654 Unmarried, Female
|
|
|
|
2823 Own-child, Male
|
|
|
|
3875 Not-in-family, Female
|
|
|
|
4430 Not-in-family, Male
|
|
|
|
13192 Husband, Male
|
2022-06-04 17:25:22 +01:00
|
|
|
|
|
|
|
```
|
2022-06-02 17:25:37 +01:00
|
|
|
~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f8,10 | sort | uniq -c | sort -n
|
2022-06-04 17:25:22 +01:00
|
|
|
```
|
|
|
|
|
2022-06-02 17:25:37 +01:00
|
|
|
1 Husband Female
|
|
|
|
2 Wife Male
|
|
|
|
168 Other-relative *
|
|
|
|
336 Own-child *
|
|
|
|
342 Other-relative Female
|
|
|
|
471 Other-relative Male
|
|
|
|
552 Wife *
|
|
|
|
573 Unmarried Male
|
|
|
|
728 Unmarried *
|
|
|
|
1014 Wife Female
|
|
|
|
1649 Not-in-family *
|
|
|
|
2042 Husband *
|
|
|
|
2081 Own-child Female
|
|
|
|
2145 Unmarried Female
|
|
|
|
2651 Own-child Male
|
|
|
|
3209 Not-in-family Female
|
|
|
|
3447 Not-in-family Male
|
|
|
|
11150 Husband Male
|