diff --git a/report.md b/README.md similarity index 86% rename from report.md rename to README.md index c055689..69805e0 100644 --- a/report.md +++ b/README.md @@ -1,10 +1,12 @@ --- -title: Privacy-Preserving Data Publishing -subtitle: Assignment \#4 +title: "Security and Privacy - Assignment 4" +subtitle: "Privacy-Preserving Data Publishing" author: - - Diogo Cordeiro (up201705417) - - Hugo Sales (up201704178) + - Diogo Cordeiro (201705417) + - Hugo Sales (201704178) date: 2022/06/02 +geometry: margin=2cm +output: pdf_document --- # Attribute classification @@ -33,13 +35,14 @@ Table: Attribute classifications ## Justifications -The vast majority of attributes present extremely low values of distinction. We speculate this may -be an TODO +The vast majority of attributes present low values of distinction. This is consistent with the nature of +the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same +set of attributes. ### `age` -According to HIPPA recommendations, and together with it's very high separation value (99.87%), this -attribute is classified as a QID. +According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify +this attribute as a QID. ### `workclass` @@ -49,11 +52,11 @@ deemed Insensitive. ### `fnlwgt` Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID -becuase it represents a weight, not a count of individuals in the same equivalence class in the +because it represents a weight, not a count of individuals in the same equivalence class in the original dataset. This can be seen with the results below. Additionally, it's not easily connected to another auxiliary info dataset. -```bash +```sh $ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \ END {for(sex in count){print sex, count[sex]}}' ``` @@ -65,7 +68,7 @@ Sex | Sum Female | 2000673518 Male | 4178699874 -Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight} +Table: Sum of `fnlwgt` for each `sex` The sum of these values is 6,179,373,392. This value is much larger than the population of the U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated. @@ -74,22 +77,21 @@ We also note there are substantially more Male than Female records (more than do ### `education` -This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified -as a QID. +This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID. ### `education-num` -We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the -`education` and `education-num` columns: +We exported the anonymized dataset and used the following command to verify there weren't any discrepencies +between the `education` and `education-num` columns: -```bash +```sh $ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un ``` -Since there was a one-to-one mapping, we concluded this was just a +Since there was a one-to-one mapping, we confirmed this was just a representation of the `education` attribute. As such, this attribute recieves the same classification, which is backed by the equally high -separation value of 80.96%, so it's qualified as a QID. +separation value of 80.96%, so it's classified as a QID. ### `marital-status` @@ -106,7 +108,7 @@ Given it's separation value of 73.21%, this attribute is classified as a QID. ### `race` -This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact +This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact that this attribute could be cross referenced with other datases, it is classified as a QID, so it may be transformed into more generic values. @@ -115,7 +117,7 @@ it may be transformed into more generic values. Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since it can be easily cross referenced with other datasets. -We noted this dataset seems to more males than females. See @tbl:sex_weight and the following table +We noted this dataset seems to have more males than females. See Table 2 and the following table `education` | Female | Male -------------+-------:+----: @@ -136,7 +138,7 @@ Masters | 536 | 1187 Prof-school | 92 | 484 Doctorate | 86 | 327 -Table: Number of records with each `education` for each `sex` {#tbl:education_sex} +Table: Number of records with each `education` for each `sex` ### `capital-gain` & `capital-loss` @@ -276,8 +278,10 @@ Privacy, they were split into only 2 categories. # Observations -We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization, -meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset. +We noted that the contingency between `sex` and `relationship` maintained +the same distribution after anonymization, meaning that these changes don't +mean `relationship` can identify an individual's `sex` any more than in the +original dataset. With the following commands, we noted some possible errors in the original dataset, where the `sex` and `relationship` attributes didn't @@ -285,8 +289,9 @@ map entirely one to one: there was one occurence of (Husband, Female) and two of (Wife, Male). It's possible this is an error in the original dataset. -```bash -$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n +```sh +$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | +cut -d',' -f8,10 | sort | uniq -c | sort -n 1 Husband, Female 2 Wife, Male @@ -302,8 +307,9 @@ $ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 13192 Husband, Male ``` -```bash -$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t +```sh +$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | +cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t 1295 {Husband, Wife} Female 2264 {Other-relative, Own-child} Female @@ -314,5 +320,5 @@ $ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10 12637 {Husband, Wife} Male ``` -Since there were occurences of (Wide, Male), "({Husband, Wife}, Male)" +Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)" does not undo the transformation of the `relationship` attribute. diff --git a/render.sh b/render.sh new file mode 100755 index 0000000..f726fa6 --- /dev/null +++ b/render.sh @@ -0,0 +1,2 @@ +#!/bin/sh +pandoc README.md --pdf-engine=xelatex -o report.pdf diff --git a/report.pdf b/report.pdf index eea2e7b..f45c78b 100644 Binary files a/report.pdf and b/report.pdf differ