DataAnonymisation/report.md

---
title: Privacy-Preserving Data Publishing
subtitle: Assignment \#4
author:
  - Diogo Cordeiro (up201705417)
  - Hugo Sales (up201704178)
date: 2022/06/02
---

# Attribute classification

We classified the attributes as follows:

Attribute        | Classification
-----------------+---------------
`age`            | QID
`workclass`      | Insensitive
`fnlwgt`         | Insensitive
`education`      | QID
`education-num`  | QID
`marital-status` | QID
`occupation`     | QID
`relationship`   | QID
`race`           | Sensitive
`sex`            | QID
`capital-gain`   | Sensitive
`capital-loss`   | Sensitive
`hours-per-week` | Insensitive
`native-country` | Insensitive
`prediction`     | Insensitive

Table: Attribute classifications

## Justifications

The vast majority of attributes present extremely low values of distinction. We speculate this may
be an TODO

### `age`

According to HIPPA recommendations, and together with it's very high separation value (99.87%), this
attribute is classified as a QID.

### `workclass`

This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
deemed insensitive.

### `fnlwgt`

Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
becuase it represents a weight, not a count of individuals in the same equivalence class in the
original dataset. This can be seen with the results below. Additionally, it's not easily connected
to another auxiliary info dataset.

```bash
tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
    END {for(sex in count){print sex, count[sex]}}'
```

Resulting in:

Sex    | Sum
-------+--------
Female | 2000673518
Male   | 4178699874

Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}

The sum of these values is 6,179,373,392. This value is much larger than the population of the
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.

### `education`

This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
as a QID.

### `education-num`

As a numerical representation of the `education` attribute, this attribute recieves the same
classification, which is backed by the equally high separation value of 80.96%, so it's qualified as
a QID.

### `marital-status`

With a relatively high separation value of 66.01%, together with the fact that it could be cross
referenced with other available datasets, we classify this attribute as a QID.

### `occupation`

With a separation of 90.02%, this attribute is classified as a QID.

### `relationship`

Given it's separation value of 73.21%, this attribute is classified as a QID.

### `race`

This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
that this attribute could be cross referenced with other datases, it is classified as Sensitive, so
it may be transformed into more generic values.

### `sex`

Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
it can be easily cross referenced with other datasets.

We noted this dataset seems to more males than females. See @tbl:sex_weight


### `native-country`

While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
dataset, so it's qualified as Sensitive.

----------------


Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore
provide higher utility.

We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization,
meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset.

We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the
`education` and `education-num` columns:

```bash
cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'	' -f4,5 | sort -u
```

```bash
cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
```

      1  Husband, Female
      2  Wife, Male
    430  Other-relative, Female
    551  Other-relative, Male
    792  Unmarried, Male
   1566  Wife, Female
   2245  Own-child, Female
   2654  Unmarried, Female
   2823  Own-child, Male
   3875  Not-in-family, Female
   4430  Not-in-family, Male
  13192  Husband, Male

```
~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d'       ' -f8,10 | sort | uniq -c | sort -n
```

      1 Husband	Female
      2 Wife	Male
    168 Other-relative	*
    336 Own-child	*
    342 Other-relative	Female
    471 Other-relative	Male
    552 Wife	*
    573 Unmarried	Male
    728 Unmarried	*
   1014 Wife	Female
   1649 Not-in-family	*
   2042 Husband	*
   2081 Own-child	Female
   2145 Unmarried	Female
   2651 Own-child	Male
   3209 Not-in-family	Female
   3447 Not-in-family	Male
  11150 Husband	Male
Start justifying classifications in reports 2022-06-04 17:25:22 +01:00			`---`
			`title: Privacy-Preserving Data Publishing`
			`subtitle: Assignment \#4`
			`author:`
			`- Diogo Cordeiro (up201705417)`
			`- Hugo Sales (up201704178)`
			`date: 2022/06/02`
			`---`
Add report notes 2022-06-02 17:25:37 +01:00
Start justifying classifications in reports 2022-06-04 17:25:22 +01:00			`# Attribute classification`

			`We classified the attributes as follows:`

			`Attribute \| Classification`
			`-----------------+---------------`
			`age` \| QID
			`workclass` \| Insensitive
			`fnlwgt` \| Insensitive
			`education` \| QID
			`education-num` \| QID
			`marital-status` \| QID
			`occupation` \| QID
			`relationship` \| QID
			`race` \| Sensitive
			`sex` \| QID
			`capital-gain` \| Sensitive
			`capital-loss` \| Sensitive
			`hours-per-week` \| Insensitive
			`native-country` \| Insensitive
			`prediction` \| Insensitive

			`Table: Attribute classifications`

			`## Justifications`

			`The vast majority of attributes present extremely low values of distinction. We speculate this may`
			`be an TODO`

			### `age`

			`According to HIPPA recommendations, and together with it's very high separation value (99.87%), this`
			`attribute is classified as a QID.`

			### `workclass`

			`This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's`
			`deemed insensitive.`

			### `fnlwgt`

			Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
			`becuase it represents a weight, not a count of individuals in the same equivalence class in the`
			`original dataset. This can be seen with the results below. Additionally, it's not easily connected`
			`to another auxiliary info dataset.`

			```bash
			`tail -n '+2' adult_data.csv \| awk -F',' '{count[$10] += $3;} \`
			`END {for(sex in count){print sex, count[sex]}}'`
			```

			`Resulting in:`

			`Sex \| Sum`
			`-------+--------`
			`Female \| 2000673518`
			`Male \| 4178699874`

			Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}

			`The sum of these values is 6,179,373,392. This value is much larger than the population of the`
			`U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.`

			### `education`

			`This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified`
			`as a QID.`

			### `education-num`

			As a numerical representation of the `education` attribute, this attribute recieves the same
			`classification, which is backed by the equally high separation value of 80.96%, so it's qualified as`
			`a QID.`

			### `marital-status`

			`With a relatively high separation value of 66.01%, together with the fact that it could be cross`
			`referenced with other available datasets, we classify this attribute as a QID.`

			### `occupation`

			`With a separation of 90.02%, this attribute is classified as a QID.`

			### `relationship`

			`Given it's separation value of 73.21%, this attribute is classified as a QID.`

			### `race`

			`This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact`
			`that this attribute could be cross referenced with other datases, it is classified as Sensitive, so`
			`it may be transformed into more generic values.`

			### `sex`

			`Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since`
			`it can be easily cross referenced with other datasets.`

			`We noted this dataset seems to more males than females. See @tbl:sex_weight`


			### `native-country`

			`While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this`
			`dataset, so it's qualified as Sensitive.`

			`----------------`
Add report notes 2022-06-02 17:25:37 +01:00

			`Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore`
			`provide higher utility.`

			We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization,
			meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset.

			`We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the`
			`education` and `education-num` columns:

			```bash
Start justifying classifications in reports 2022-06-04 17:25:22 +01:00			`cat anonymized.csv \| sed -r 's/,([^ ])/\t\1/g' \| cut -d' ' -f4,5 \| sort -u`
Add report notes 2022-06-02 17:25:37 +01:00			```

Start justifying classifications in reports 2022-06-04 17:25:22 +01:00			```bash
			`cat adult_data.csv \| tail -n +2 \| sed -r 's/,([^ ])/\t\1/g' \| cut -d',' -f8,10 \| sort \| uniq -c \| sort -n`
			```
Add report notes 2022-06-02 17:25:37 +01:00
			`1 Husband, Female`
			`2 Wife, Male`
			`430 Other-relative, Female`
			`551 Other-relative, Male`
			`792 Unmarried, Male`
			`1566 Wife, Female`
			`2245 Own-child, Female`
			`2654 Unmarried, Female`
			`2823 Own-child, Male`
			`3875 Not-in-family, Female`
			`4430 Not-in-family, Male`
			`13192 Husband, Male`
Start justifying classifications in reports 2022-06-04 17:25:22 +01:00
			```
Add report notes 2022-06-02 17:25:37 +01:00			`~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv \| tail -n +2 \| sed -r 's/,([^ ])/\t\1/g' \| cut -d' ' -f8,10 \| sort \| uniq -c \| sort -n`
Start justifying classifications in reports 2022-06-04 17:25:22 +01:00			```

Add report notes 2022-06-02 17:25:37 +01:00			`1 Husband Female`
			`2 Wife Male`
			`168 Other-relative *`
			`336 Own-child *`
			`342 Other-relative Female`
			`471 Other-relative Male`
			`552 Wife *`
			`573 Unmarried Male`
			`728 Unmarried *`
			`1014 Wife Female`
			`1649 Not-in-family *`
			`2042 Husband *`
			`2081 Own-child Female`
			`2145 Unmarried Female`
			`2651 Own-child Male`
			`3209 Not-in-family Female`
			`3447 Not-in-family Male`
			`11150 Husband Male`