Fix typos and fill todo
This commit is contained in:
parent
b26bbc7168
commit
de366a6571
@ -1,10 +1,12 @@
|
||||
---
|
||||
title: Privacy-Preserving Data Publishing
|
||||
subtitle: Assignment \#4
|
||||
title: "Security and Privacy - Assignment 4"
|
||||
subtitle: "Privacy-Preserving Data Publishing"
|
||||
author:
|
||||
- Diogo Cordeiro (up201705417)
|
||||
- Hugo Sales (up201704178)
|
||||
- Diogo Cordeiro (201705417)
|
||||
- Hugo Sales (201704178)
|
||||
date: 2022/06/02
|
||||
geometry: margin=2cm
|
||||
output: pdf_document
|
||||
---
|
||||
|
||||
# Attribute classification
|
||||
@ -33,13 +35,14 @@ Table: Attribute classifications
|
||||
|
||||
## Justifications
|
||||
|
||||
The vast majority of attributes present extremely low values of distinction. We speculate this may
|
||||
be an TODO
|
||||
The vast majority of attributes present low values of distinction. This is consistent with the nature of
|
||||
the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same
|
||||
set of attributes.
|
||||
|
||||
### `age`
|
||||
|
||||
According to HIPPA recommendations, and together with it's very high separation value (99.87%), this
|
||||
attribute is classified as a QID.
|
||||
According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
|
||||
this attribute as a QID.
|
||||
|
||||
### `workclass`
|
||||
|
||||
@ -49,11 +52,11 @@ deemed Insensitive.
|
||||
### `fnlwgt`
|
||||
|
||||
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
|
||||
becuase it represents a weight, not a count of individuals in the same equivalence class in the
|
||||
because it represents a weight, not a count of individuals in the same equivalence class in the
|
||||
original dataset. This can be seen with the results below. Additionally, it's not easily connected
|
||||
to another auxiliary info dataset.
|
||||
|
||||
```bash
|
||||
```sh
|
||||
$ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
|
||||
END {for(sex in count){print sex, count[sex]}}'
|
||||
```
|
||||
@ -65,7 +68,7 @@ Sex | Sum
|
||||
Female | 2000673518
|
||||
Male | 4178699874
|
||||
|
||||
Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}
|
||||
Table: Sum of `fnlwgt` for each `sex`
|
||||
|
||||
The sum of these values is 6,179,373,392. This value is much larger than the population of the
|
||||
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
|
||||
@ -74,22 +77,21 @@ We also note there are substantially more Male than Female records (more than do
|
||||
|
||||
### `education`
|
||||
|
||||
This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
|
||||
as a QID.
|
||||
This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
|
||||
|
||||
### `education-num`
|
||||
|
||||
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the
|
||||
`education` and `education-num` columns:
|
||||
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
|
||||
between the `education` and `education-num` columns:
|
||||
|
||||
```bash
|
||||
```sh
|
||||
$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un
|
||||
```
|
||||
|
||||
Since there was a one-to-one mapping, we concluded this was just a
|
||||
Since there was a one-to-one mapping, we confirmed this was just a
|
||||
representation of the `education` attribute. As such, this attribute
|
||||
recieves the same classification, which is backed by the equally high
|
||||
separation value of 80.96%, so it's qualified as a QID.
|
||||
separation value of 80.96%, so it's classified as a QID.
|
||||
|
||||
### `marital-status`
|
||||
|
||||
@ -106,7 +108,7 @@ Given it's separation value of 73.21%, this attribute is classified as a QID.
|
||||
|
||||
### `race`
|
||||
|
||||
This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
|
||||
This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
|
||||
that this attribute could be cross referenced with other datases, it is classified as a QID, so
|
||||
it may be transformed into more generic values.
|
||||
|
||||
@ -115,7 +117,7 @@ it may be transformed into more generic values.
|
||||
Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
|
||||
it can be easily cross referenced with other datasets.
|
||||
|
||||
We noted this dataset seems to more males than females. See @tbl:sex_weight and the following table
|
||||
We noted this dataset seems to have more males than females. See Table 2 and the following table
|
||||
|
||||
`education` | Female | Male
|
||||
-------------+-------:+----:
|
||||
@ -136,7 +138,7 @@ Masters | 536 | 1187
|
||||
Prof-school | 92 | 484
|
||||
Doctorate | 86 | 327
|
||||
|
||||
Table: Number of records with each `education` for each `sex` {#tbl:education_sex}
|
||||
Table: Number of records with each `education` for each `sex`
|
||||
|
||||
### `capital-gain` & `capital-loss`
|
||||
|
||||
@ -276,8 +278,10 @@ Privacy, they were split into only 2 categories.
|
||||
|
||||
# Observations
|
||||
|
||||
We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization,
|
||||
meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset.
|
||||
We noted that the contingency between `sex` and `relationship` maintained
|
||||
the same distribution after anonymization, meaning that these changes don't
|
||||
mean `relationship` can identify an individual's `sex` any more than in the
|
||||
original dataset.
|
||||
|
||||
With the following commands, we noted some possible errors in the
|
||||
original dataset, where the `sex` and `relationship` attributes didn't
|
||||
@ -285,8 +289,9 @@ map entirely one to one: there was one occurence of (Husband, Female)
|
||||
and two of (Wife, Male). It's possible this is an error in the
|
||||
original dataset.
|
||||
|
||||
```bash
|
||||
$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
|
||||
```sh
|
||||
$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' |
|
||||
cut -d',' -f8,10 | sort | uniq -c | sort -n
|
||||
|
||||
1 Husband, Female
|
||||
2 Wife, Male
|
||||
@ -302,8 +307,9 @@ $ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10
|
||||
13192 Husband, Male
|
||||
```
|
||||
|
||||
```bash
|
||||
$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
|
||||
```sh
|
||||
$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' |
|
||||
cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
|
||||
|
||||
1295 {Husband, Wife} Female
|
||||
2264 {Other-relative, Own-child} Female
|
||||
@ -314,5 +320,5 @@ $ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10
|
||||
12637 {Husband, Wife} Male
|
||||
```
|
||||
|
||||
Since there were occurences of (Wide, Male), "({Husband, Wife}, Male)"
|
||||
Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)"
|
||||
does not undo the transformation of the `relationship` attribute.
|
2
render.sh
Executable file
2
render.sh
Executable file
@ -0,0 +1,2 @@
|
||||
#!/bin/sh
|
||||
pandoc README.md --pdf-engine=xelatex -o report.pdf
|
BIN
report.pdf
BIN
report.pdf
Binary file not shown.
Reference in New Issue
Block a user