Start justifying classifications in reports

This commit is contained in:
Hugo Sales 2022-06-04 17:25:22 +01:00
parent 2a52f44f4e
commit a7a612e29e
Signed by untrusted user who does not match committer: someonewithpc
GPG Key ID: 7D0C7EAFC9D835A0

136
report.md
View File

@ -1,10 +1,120 @@
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID becuase it represents a weight, not a
count of individuals in the same equivalence class in the original dataset. Additionally, it's not easily connected to
another auxiliary info dataset.
---
title: Privacy-Preserving Data Publishing
subtitle: Assignment \#4
author:
- Diogo Cordeiro (up201705417)
- Hugo Sales (up201704178)
date: 2022/06/02
---
We determined that `age` is a QID, since it's widely regarded as such, in all datasets, according to HIPPA recommendations.
# Attribute classification
We classified the attributes as follows:
Attribute | Classification
-----------------+---------------
`age` | QID
`workclass` | Insensitive
`fnlwgt` | Insensitive
`education` | QID
`education-num` | QID
`marital-status` | QID
`occupation` | QID
`relationship` | QID
`race` | Sensitive
`sex` | QID
`capital-gain` | Sensitive
`capital-loss` | Sensitive
`hours-per-week` | Insensitive
`native-country` | Insensitive
`prediction` | Insensitive
Table: Attribute classifications
## Justifications
The vast majority of attributes present extremely low values of distinction. We speculate this may
be an TODO
### `age`
According to HIPPA recommendations, and together with it's very high separation value (99.87%), this
attribute is classified as a QID.
### `workclass`
This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
deemed insensitive.
### `fnlwgt`
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
becuase it represents a weight, not a count of individuals in the same equivalence class in the
original dataset. This can be seen with the results below. Additionally, it's not easily connected
to another auxiliary info dataset.
```bash
tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
END {for(sex in count){print sex, count[sex]}}'
```
Resulting in:
Sex | Sum
-------+--------
Female | 2000673518
Male | 4178699874
Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}
The sum of these values is 6,179,373,392. This value is much larger than the population of the
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
### `education`
This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
as a QID.
### `education-num`
As a numerical representation of the `education` attribute, this attribute recieves the same
classification, which is backed by the equally high separation value of 80.96%, so it's qualified as
a QID.
### `marital-status`
With a relatively high separation value of 66.01%, together with the fact that it could be cross
referenced with other available datasets, we classify this attribute as a QID.
### `occupation`
With a separation of 90.02%, this attribute is classified as a QID.
### `relationship`
Given it's separation value of 73.21%, this attribute is classified as a QID.
### `race`
This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
that this attribute could be cross referenced with other datases, it is classified as Sensitive, so
it may be transformed into more generic values.
### `sex`
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
it can be easily cross referenced with other datasets.
We noted this dataset seems to more males than females. See @tbl:sex_weight
### `native-country`
While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
dataset, so it's qualified as Sensitive.
----------------
We noted this dataset contains more males than females.
Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore
provide higher utility.
@ -16,11 +126,13 @@ We exported the anonymized dataset and used the following command to verify ther
`education` and `education-num` columns:
```bash
cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u
cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u
```
```bash
cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
```
/projects/uni/DataAnonymisation/ (master)$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
1 Husband, Female
2 Wife, Male
430 Other-relative, Female
@ -33,7 +145,11 @@ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u
3875 Not-in-family, Female
4430 Not-in-family, Male
13192 Husband, Male
```
~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f8,10 | sort | uniq -c | sort -n
```
1 Husband Female
2 Wife Male
168 Other-relative *
@ -52,9 +168,3 @@ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u
3209 Not-in-family Female
3447 Not-in-family Male
11150 Husband Male