Fix typos and fill todo
This commit is contained in:
parent
b26bbc7168
commit
de366a6571
@ -1,10 +1,12 @@
|
|||||||
---
|
---
|
||||||
title: Privacy-Preserving Data Publishing
|
title: "Security and Privacy - Assignment 4"
|
||||||
subtitle: Assignment \#4
|
subtitle: "Privacy-Preserving Data Publishing"
|
||||||
author:
|
author:
|
||||||
- Diogo Cordeiro (up201705417)
|
- Diogo Cordeiro (201705417)
|
||||||
- Hugo Sales (up201704178)
|
- Hugo Sales (201704178)
|
||||||
date: 2022/06/02
|
date: 2022/06/02
|
||||||
|
geometry: margin=2cm
|
||||||
|
output: pdf_document
|
||||||
---
|
---
|
||||||
|
|
||||||
# Attribute classification
|
# Attribute classification
|
||||||
@ -33,13 +35,14 @@ Table: Attribute classifications
|
|||||||
|
|
||||||
## Justifications
|
## Justifications
|
||||||
|
|
||||||
The vast majority of attributes present extremely low values of distinction. We speculate this may
|
The vast majority of attributes present low values of distinction. This is consistent with the nature of
|
||||||
be an TODO
|
the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same
|
||||||
|
set of attributes.
|
||||||
|
|
||||||
### `age`
|
### `age`
|
||||||
|
|
||||||
According to HIPPA recommendations, and together with it's very high separation value (99.87%), this
|
According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
|
||||||
attribute is classified as a QID.
|
this attribute as a QID.
|
||||||
|
|
||||||
### `workclass`
|
### `workclass`
|
||||||
|
|
||||||
@ -49,11 +52,11 @@ deemed Insensitive.
|
|||||||
### `fnlwgt`
|
### `fnlwgt`
|
||||||
|
|
||||||
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
|
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
|
||||||
becuase it represents a weight, not a count of individuals in the same equivalence class in the
|
because it represents a weight, not a count of individuals in the same equivalence class in the
|
||||||
original dataset. This can be seen with the results below. Additionally, it's not easily connected
|
original dataset. This can be seen with the results below. Additionally, it's not easily connected
|
||||||
to another auxiliary info dataset.
|
to another auxiliary info dataset.
|
||||||
|
|
||||||
```bash
|
```sh
|
||||||
$ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
|
$ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
|
||||||
END {for(sex in count){print sex, count[sex]}}'
|
END {for(sex in count){print sex, count[sex]}}'
|
||||||
```
|
```
|
||||||
@ -65,7 +68,7 @@ Sex | Sum
|
|||||||
Female | 2000673518
|
Female | 2000673518
|
||||||
Male | 4178699874
|
Male | 4178699874
|
||||||
|
|
||||||
Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}
|
Table: Sum of `fnlwgt` for each `sex`
|
||||||
|
|
||||||
The sum of these values is 6,179,373,392. This value is much larger than the population of the
|
The sum of these values is 6,179,373,392. This value is much larger than the population of the
|
||||||
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
|
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
|
||||||
@ -74,22 +77,21 @@ We also note there are substantially more Male than Female records (more than do
|
|||||||
|
|
||||||
### `education`
|
### `education`
|
||||||
|
|
||||||
This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
|
This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
|
||||||
as a QID.
|
|
||||||
|
|
||||||
### `education-num`
|
### `education-num`
|
||||||
|
|
||||||
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the
|
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
|
||||||
`education` and `education-num` columns:
|
between the `education` and `education-num` columns:
|
||||||
|
|
||||||
```bash
|
```sh
|
||||||
$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un
|
$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un
|
||||||
```
|
```
|
||||||
|
|
||||||
Since there was a one-to-one mapping, we concluded this was just a
|
Since there was a one-to-one mapping, we confirmed this was just a
|
||||||
representation of the `education` attribute. As such, this attribute
|
representation of the `education` attribute. As such, this attribute
|
||||||
recieves the same classification, which is backed by the equally high
|
recieves the same classification, which is backed by the equally high
|
||||||
separation value of 80.96%, so it's qualified as a QID.
|
separation value of 80.96%, so it's classified as a QID.
|
||||||
|
|
||||||
### `marital-status`
|
### `marital-status`
|
||||||
|
|
||||||
@ -106,7 +108,7 @@ Given it's separation value of 73.21%, this attribute is classified as a QID.
|
|||||||
|
|
||||||
### `race`
|
### `race`
|
||||||
|
|
||||||
This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
|
This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
|
||||||
that this attribute could be cross referenced with other datases, it is classified as a QID, so
|
that this attribute could be cross referenced with other datases, it is classified as a QID, so
|
||||||
it may be transformed into more generic values.
|
it may be transformed into more generic values.
|
||||||
|
|
||||||
@ -115,7 +117,7 @@ it may be transformed into more generic values.
|
|||||||
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
|
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
|
||||||
it can be easily cross referenced with other datasets.
|
it can be easily cross referenced with other datasets.
|
||||||
|
|
||||||
We noted this dataset seems to more males than females. See @tbl:sex_weight and the following table
|
We noted this dataset seems to have more males than females. See Table 2 and the following table
|
||||||
|
|
||||||
`education` | Female | Male
|
`education` | Female | Male
|
||||||
-------------+-------:+----:
|
-------------+-------:+----:
|
||||||
@ -136,7 +138,7 @@ Masters | 536 | 1187
|
|||||||
Prof-school | 92 | 484
|
Prof-school | 92 | 484
|
||||||
Doctorate | 86 | 327
|
Doctorate | 86 | 327
|
||||||
|
|
||||||
Table: Number of records with each `education` for each `sex` {#tbl:education_sex}
|
Table: Number of records with each `education` for each `sex`
|
||||||
|
|
||||||
### `capital-gain` & `capital-loss`
|
### `capital-gain` & `capital-loss`
|
||||||
|
|
||||||
@ -276,8 +278,10 @@ Privacy, they were split into only 2 categories.
|
|||||||
|
|
||||||
# Observations
|
# Observations
|
||||||
|
|
||||||
We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization,
|
We noted that the contingency between `sex` and `relationship` maintained
|
||||||
meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset.
|
the same distribution after anonymization, meaning that these changes don't
|
||||||
|
mean `relationship` can identify an individual's `sex` any more than in the
|
||||||
|
original dataset.
|
||||||
|
|
||||||
With the following commands, we noted some possible errors in the
|
With the following commands, we noted some possible errors in the
|
||||||
original dataset, where the `sex` and `relationship` attributes didn't
|
original dataset, where the `sex` and `relationship` attributes didn't
|
||||||
@ -285,8 +289,9 @@ map entirely one to one: there was one occurence of (Husband, Female)
|
|||||||
and two of (Wife, Male). It's possible this is an error in the
|
and two of (Wife, Male). It's possible this is an error in the
|
||||||
original dataset.
|
original dataset.
|
||||||
|
|
||||||
```bash
|
```sh
|
||||||
$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
|
$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' |
|
||||||
|
cut -d',' -f8,10 | sort | uniq -c | sort -n
|
||||||
|
|
||||||
1 Husband, Female
|
1 Husband, Female
|
||||||
2 Wife, Male
|
2 Wife, Male
|
||||||
@ -302,8 +307,9 @@ $ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10
|
|||||||
13192 Husband, Male
|
13192 Husband, Male
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```sh
|
||||||
$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
|
$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' |
|
||||||
|
cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
|
||||||
|
|
||||||
1295 {Husband, Wife} Female
|
1295 {Husband, Wife} Female
|
||||||
2264 {Other-relative, Own-child} Female
|
2264 {Other-relative, Own-child} Female
|
||||||
@ -314,5 +320,5 @@ $ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10
|
|||||||
12637 {Husband, Wife} Male
|
12637 {Husband, Wife} Male
|
||||||
```
|
```
|
||||||
|
|
||||||
Since there were occurences of (Wide, Male), "({Husband, Wife}, Male)"
|
Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)"
|
||||||
does not undo the transformation of the `relationship` attribute.
|
does not undo the transformation of the `relationship` attribute.
|
2
render.sh
Executable file
2
render.sh
Executable file
@ -0,0 +1,2 @@
|
|||||||
|
#!/bin/sh
|
||||||
|
pandoc README.md --pdf-engine=xelatex -o report.pdf
|
BIN
report.pdf
BIN
report.pdf
Binary file not shown.
Reference in New Issue
Block a user