Start justifying classifications in reports

This commit is contained in:
Hugo Sales 2022-06-04 17:25:22 +01:00
parent 2a52f44f4e
commit a7a612e29e
Signed by untrusted user who does not match committer: someonewithpc
GPG Key ID: 7D0C7EAFC9D835A0

136
report.md
View File

@ -1,10 +1,120 @@
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID becuase it represents a weight, not a ---
count of individuals in the same equivalence class in the original dataset. Additionally, it's not easily connected to title: Privacy-Preserving Data Publishing
another auxiliary info dataset. subtitle: Assignment \#4
author:
- Diogo Cordeiro (up201705417)
- Hugo Sales (up201704178)
date: 2022/06/02
---
We determined that `age` is a QID, since it's widely regarded as such, in all datasets, according to HIPPA recommendations. # Attribute classification
We classified the attributes as follows:
Attribute | Classification
-----------------+---------------
`age` | QID
`workclass` | Insensitive
`fnlwgt` | Insensitive
`education` | QID
`education-num` | QID
`marital-status` | QID
`occupation` | QID
`relationship` | QID
`race` | Sensitive
`sex` | QID
`capital-gain` | Sensitive
`capital-loss` | Sensitive
`hours-per-week` | Insensitive
`native-country` | Insensitive
`prediction` | Insensitive
Table: Attribute classifications
## Justifications
The vast majority of attributes present extremely low values of distinction. We speculate this may
be an TODO
### `age`
According to HIPPA recommendations, and together with it's very high separation value (99.87%), this
attribute is classified as a QID.
### `workclass`
This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
deemed insensitive.
### `fnlwgt`
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
becuase it represents a weight, not a count of individuals in the same equivalence class in the
original dataset. This can be seen with the results below. Additionally, it's not easily connected
to another auxiliary info dataset.
```bash
tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
END {for(sex in count){print sex, count[sex]}}'
```
Resulting in:
Sex | Sum
-------+--------
Female | 2000673518
Male | 4178699874
Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}
The sum of these values is 6,179,373,392. This value is much larger than the population of the
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
### `education`
This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
as a QID.
### `education-num`
As a numerical representation of the `education` attribute, this attribute recieves the same
classification, which is backed by the equally high separation value of 80.96%, so it's qualified as
a QID.
### `marital-status`
With a relatively high separation value of 66.01%, together with the fact that it could be cross
referenced with other available datasets, we classify this attribute as a QID.
### `occupation`
With a separation of 90.02%, this attribute is classified as a QID.
### `relationship`
Given it's separation value of 73.21%, this attribute is classified as a QID.
### `race`
This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
that this attribute could be cross referenced with other datases, it is classified as Sensitive, so
it may be transformed into more generic values.
### `sex`
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
it can be easily cross referenced with other datasets.
We noted this dataset seems to more males than females. See @tbl:sex_weight
### `native-country`
While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
dataset, so it's qualified as Sensitive.
----------------
We noted this dataset contains more males than females.
Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore
provide higher utility. provide higher utility.
@ -16,11 +126,13 @@ We exported the anonymized dataset and used the following command to verify ther
`education` and `education-num` columns: `education` and `education-num` columns:
```bash ```bash
cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u
``` ```
```bash
cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
```
/projects/uni/DataAnonymisation/ (master)$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
1 Husband, Female 1 Husband, Female
2 Wife, Male 2 Wife, Male
430 Other-relative, Female 430 Other-relative, Female
@ -33,7 +145,11 @@ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u
3875 Not-in-family, Female 3875 Not-in-family, Female
4430 Not-in-family, Male 4430 Not-in-family, Male
13192 Husband, Male 13192 Husband, Male
```
~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f8,10 | sort | uniq -c | sort -n ~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f8,10 | sort | uniq -c | sort -n
```
1 Husband Female 1 Husband Female
2 Wife Male 2 Wife Male
168 Other-relative * 168 Other-relative *
@ -52,9 +168,3 @@ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u
3209 Not-in-family Female 3209 Not-in-family Female
3447 Not-in-family Male 3447 Not-in-family Male
11150 Husband Male 11150 Husband Male