Fix typos and fill todo

This commit is contained in:
Diogo Peralta Cordeiro 2022-06-05 17:24:09 +01:00
parent b26bbc7168
commit de366a6571
Signed by: diogo
GPG Key ID: 18D2D35001FBFAB0
3 changed files with 36 additions and 28 deletions

View File

@ -1,10 +1,12 @@
---
title: Privacy-Preserving Data Publishing
subtitle: Assignment \#4
title: "Security and Privacy - Assignment 4"
subtitle: "Privacy-Preserving Data Publishing"
author:
- Diogo Cordeiro (up201705417)
- Hugo Sales (up201704178)
- Diogo Cordeiro (201705417)
- Hugo Sales (201704178)
date: 2022/06/02
geometry: margin=2cm
output: pdf_document
---
# Attribute classification
@ -33,13 +35,14 @@ Table: Attribute classifications
## Justifications
The vast majority of attributes present extremely low values of distinction. We speculate this may
be an TODO
The vast majority of attributes present low values of distinction. This is consistent with the nature of
the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same
set of attributes.
### `age`
According to HIPPA recommendations, and together with it's very high separation value (99.87%), this
attribute is classified as a QID.
According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
this attribute as a QID.
### `workclass`
@ -49,11 +52,11 @@ deemed Insensitive.
### `fnlwgt`
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
becuase it represents a weight, not a count of individuals in the same equivalence class in the
because it represents a weight, not a count of individuals in the same equivalence class in the
original dataset. This can be seen with the results below. Additionally, it's not easily connected
to another auxiliary info dataset.
```bash
```sh
$ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
END {for(sex in count){print sex, count[sex]}}'
```
@ -65,7 +68,7 @@ Sex | Sum
Female | 2000673518
Male | 4178699874
Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}
Table: Sum of `fnlwgt` for each `sex`
The sum of these values is 6,179,373,392. This value is much larger than the population of the
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
@ -74,22 +77,21 @@ We also note there are substantially more Male than Female records (more than do
### `education`
This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
as a QID.
This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
### `education-num`
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the
`education` and `education-num` columns:
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
between the `education` and `education-num` columns:
```bash
```sh
$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un
```
Since there was a one-to-one mapping, we concluded this was just a
Since there was a one-to-one mapping, we confirmed this was just a
representation of the `education` attribute. As such, this attribute
recieves the same classification, which is backed by the equally high
separation value of 80.96%, so it's qualified as a QID.
separation value of 80.96%, so it's classified as a QID.
### `marital-status`
@ -106,7 +108,7 @@ Given it's separation value of 73.21%, this attribute is classified as a QID.
### `race`
This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
that this attribute could be cross referenced with other datases, it is classified as a QID, so
it may be transformed into more generic values.
@ -115,7 +117,7 @@ it may be transformed into more generic values.
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
it can be easily cross referenced with other datasets.
We noted this dataset seems to more males than females. See @tbl:sex_weight and the following table
We noted this dataset seems to have more males than females. See Table 2 and the following table
`education` | Female | Male
-------------+-------:+----:
@ -136,7 +138,7 @@ Masters | 536 | 1187
Prof-school | 92 | 484
Doctorate | 86 | 327
Table: Number of records with each `education` for each `sex` {#tbl:education_sex}
Table: Number of records with each `education` for each `sex`
### `capital-gain` & `capital-loss`
@ -276,8 +278,10 @@ Privacy, they were split into only 2 categories.
# Observations
We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization,
meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset.
We noted that the contingency between `sex` and `relationship` maintained
the same distribution after anonymization, meaning that these changes don't
mean `relationship` can identify an individual's `sex` any more than in the
original dataset.
With the following commands, we noted some possible errors in the
original dataset, where the `sex` and `relationship` attributes didn't
@ -285,8 +289,9 @@ map entirely one to one: there was one occurence of (Husband, Female)
and two of (Wife, Male). It's possible this is an error in the
original dataset.
```bash
$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
```sh
$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' |
cut -d',' -f8,10 | sort | uniq -c | sort -n
1 Husband, Female
2 Wife, Male
@ -302,8 +307,9 @@ $ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10
13192 Husband, Male
```
```bash
$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
```sh
$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' |
cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
1295 {Husband, Wife} Female
2264 {Other-relative, Own-child} Female
@ -314,5 +320,5 @@ $ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10
12637 {Husband, Wife} Male
```
Since there were occurences of (Wide, Male), "({Husband, Wife}, Male)"
Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)"
does not undo the transformation of the `relationship` attribute.

2
render.sh Executable file
View File

@ -0,0 +1,2 @@
#!/bin/sh
pandoc README.md --pdf-engine=xelatex -o report.pdf

Binary file not shown.