Fix typos and fill todo

This commit is contained in:
Diogo Peralta Cordeiro 2022-06-05 17:24:09 +01:00
parent b26bbc7168
commit de366a6571
Signed by: diogo
GPG Key ID: 18D2D35001FBFAB0
3 changed files with 36 additions and 28 deletions

View File

@ -1,10 +1,12 @@
--- ---
title: Privacy-Preserving Data Publishing title: "Security and Privacy - Assignment 4"
subtitle: Assignment \#4 subtitle: "Privacy-Preserving Data Publishing"
author: author:
- Diogo Cordeiro (up201705417) - Diogo Cordeiro (201705417)
- Hugo Sales (up201704178) - Hugo Sales (201704178)
date: 2022/06/02 date: 2022/06/02
geometry: margin=2cm
output: pdf_document
--- ---
# Attribute classification # Attribute classification
@ -33,13 +35,14 @@ Table: Attribute classifications
## Justifications ## Justifications
The vast majority of attributes present extremely low values of distinction. We speculate this may The vast majority of attributes present low values of distinction. This is consistent with the nature of
be an TODO the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same
set of attributes.
### `age` ### `age`
According to HIPPA recommendations, and together with it's very high separation value (99.87%), this According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
attribute is classified as a QID. this attribute as a QID.
### `workclass` ### `workclass`
@ -49,11 +52,11 @@ deemed Insensitive.
### `fnlwgt` ### `fnlwgt`
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
becuase it represents a weight, not a count of individuals in the same equivalence class in the because it represents a weight, not a count of individuals in the same equivalence class in the
original dataset. This can be seen with the results below. Additionally, it's not easily connected original dataset. This can be seen with the results below. Additionally, it's not easily connected
to another auxiliary info dataset. to another auxiliary info dataset.
```bash ```sh
$ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \ $ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
END {for(sex in count){print sex, count[sex]}}' END {for(sex in count){print sex, count[sex]}}'
``` ```
@ -65,7 +68,7 @@ Sex | Sum
Female | 2000673518 Female | 2000673518
Male | 4178699874 Male | 4178699874
Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight} Table: Sum of `fnlwgt` for each `sex`
The sum of these values is 6,179,373,392. This value is much larger than the population of the The sum of these values is 6,179,373,392. This value is much larger than the population of the
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated. U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
@ -74,22 +77,21 @@ We also note there are substantially more Male than Female records (more than do
### `education` ### `education`
This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
as a QID.
### `education-num` ### `education-num`
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
`education` and `education-num` columns: between the `education` and `education-num` columns:
```bash ```sh
$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un $ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un
``` ```
Since there was a one-to-one mapping, we concluded this was just a Since there was a one-to-one mapping, we confirmed this was just a
representation of the `education` attribute. As such, this attribute representation of the `education` attribute. As such, this attribute
recieves the same classification, which is backed by the equally high recieves the same classification, which is backed by the equally high
separation value of 80.96%, so it's qualified as a QID. separation value of 80.96%, so it's classified as a QID.
### `marital-status` ### `marital-status`
@ -106,7 +108,7 @@ Given it's separation value of 73.21%, this attribute is classified as a QID.
### `race` ### `race`
This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
that this attribute could be cross referenced with other datases, it is classified as a QID, so that this attribute could be cross referenced with other datases, it is classified as a QID, so
it may be transformed into more generic values. it may be transformed into more generic values.
@ -115,7 +117,7 @@ it may be transformed into more generic values.
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
it can be easily cross referenced with other datasets. it can be easily cross referenced with other datasets.
We noted this dataset seems to more males than females. See @tbl:sex_weight and the following table We noted this dataset seems to have more males than females. See Table 2 and the following table
`education` | Female | Male `education` | Female | Male
-------------+-------:+----: -------------+-------:+----:
@ -136,7 +138,7 @@ Masters | 536 | 1187
Prof-school | 92 | 484 Prof-school | 92 | 484
Doctorate | 86 | 327 Doctorate | 86 | 327
Table: Number of records with each `education` for each `sex` {#tbl:education_sex} Table: Number of records with each `education` for each `sex`
### `capital-gain` & `capital-loss` ### `capital-gain` & `capital-loss`
@ -276,8 +278,10 @@ Privacy, they were split into only 2 categories.
# Observations # Observations
We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization, We noted that the contingency between `sex` and `relationship` maintained
meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset. the same distribution after anonymization, meaning that these changes don't
mean `relationship` can identify an individual's `sex` any more than in the
original dataset.
With the following commands, we noted some possible errors in the With the following commands, we noted some possible errors in the
original dataset, where the `sex` and `relationship` attributes didn't original dataset, where the `sex` and `relationship` attributes didn't
@ -285,8 +289,9 @@ map entirely one to one: there was one occurence of (Husband, Female)
and two of (Wife, Male). It's possible this is an error in the and two of (Wife, Male). It's possible this is an error in the
original dataset. original dataset.
```bash ```sh
$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n $ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' |
cut -d',' -f8,10 | sort | uniq -c | sort -n
1 Husband, Female 1 Husband, Female
2 Wife, Male 2 Wife, Male
@ -302,8 +307,9 @@ $ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10
13192 Husband, Male 13192 Husband, Male
``` ```
```bash ```sh
$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t $ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' |
cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
1295 {Husband, Wife} Female 1295 {Husband, Wife} Female
2264 {Other-relative, Own-child} Female 2264 {Other-relative, Own-child} Female
@ -314,5 +320,5 @@ $ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10
12637 {Husband, Wife} Male 12637 {Husband, Wife} Male
``` ```
Since there were occurences of (Wide, Male), "({Husband, Wife}, Male)" Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)"
does not undo the transformation of the `relationship` attribute. does not undo the transformation of the `relationship` attribute.

2
render.sh Executable file
View File

@ -0,0 +1,2 @@
#!/bin/sh
pandoc README.md --pdf-engine=xelatex -o report.pdf

Binary file not shown.