Fix typos and fill todo

2022-06-05 17:24:09 +01:00 · 2022-06-05 17:24:09 +01:00 · de366a6571
commit de366a6571
parent b26bbc7168
3 changed files with 36 additions and 28 deletions
--- a/README.md
+++ b/README.md
@ -1,10 +1,12 @@
 ---
-title: Privacy-Preserving Data Publishing
+title: "Security and Privacy - Assignment 4"
-subtitle: Assignment \#4
+subtitle: "Privacy-Preserving Data Publishing"
 author:
-  - Diogo Cordeiro (up201705417)
+  - Diogo Cordeiro (201705417)
-  - Hugo Sales (up201704178)
+  - Hugo Sales (201704178)
 date: 2022/06/02
 geometry: margin=2cm
 output: pdf_document
 ---
 # Attribute classification
@ -33,13 +35,14 @@ Table: Attribute classifications
 ## Justifications
-The vast majority of attributes present extremely low values of distinction. We speculate this may
+The vast majority of attributes present low values of distinction. This is consistent with the nature of
-be an TODO
+the dataset, considering that `fnlwgt` should indicate the quantity of individuals that present the same
 set of attributes.
 ### `age`
-According to HIPPA recommendations, and together with it's very high separation value (99.87%), this
+According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
-attribute is classified as a QID.
+this attribute as a QID.
 ### `workclass`
@ -49,11 +52,11 @@ deemed Insensitive.
 ### `fnlwgt`
 Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
-becuase it represents a weight, not a count of individuals in the same equivalence class in the
+because it represents a weight, not a count of individuals in the same equivalence class in the
 original dataset. This can be seen with the results below. Additionally, it's not easily connected
 to another auxiliary info dataset.
-```bash
+```sh
 $ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
    END {for(sex in count){print sex, count[sex]}}'
 ```
@ -65,7 +68,7 @@ Sex    | Sum
 Female | 2000673518
 Male   | 4178699874
-Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}
+Table: Sum of `fnlwgt` for each `sex`
 The sum of these values is 6,179,373,392. This value is much larger than the population of the
 U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
@ -74,22 +77,21 @@ We also note there are substantially more Male than Female records (more than do
 ### `education`
-This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
+This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
 as a QID.
 ### `education-num`
-We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the
+We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
-`education` and `education-num` columns:
+between the `education` and `education-num` columns:
-```bash
+```sh
 $ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un
 ```
-Since there was a one-to-one mapping, we concluded this was just a
+Since there was a one-to-one mapping, we confirmed this was just a
 representation of the `education` attribute. As such, this attribute
 recieves the same classification, which is backed by the equally high
-separation value of 80.96%, so it's qualified as a QID.
+separation value of 80.96%, so it's classified as a QID.
 ### `marital-status`
@ -106,7 +108,7 @@ Given it's separation value of 73.21%, this attribute is classified as a QID.
 ### `race`
-This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
+This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
 that this attribute could be cross referenced with other datases, it is classified as a QID, so
 it may be transformed into more generic values.
@ -115,7 +117,7 @@ it may be transformed into more generic values.
 Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
 it can be easily cross referenced with other datasets.
-We noted this dataset seems to more males than females. See @tbl:sex_weight and the following table
+We noted this dataset seems to have more males than females. See Table 2 and the following table
 `education`  | Female | Male
 -------------+-------:+----:
@ -136,7 +138,7 @@ Masters		 |	  536 | 1187
 Prof-school	 |	   92 |	 484
 Doctorate	 |	   86 |	 327
-Table: Number of records with each `education` for each `sex` {#tbl:education_sex}
+Table: Number of records with each `education` for each `sex`
 ### `capital-gain` & `capital-loss`
@ -276,8 +278,10 @@ Privacy, they were split into only 2 categories.
 # Observations
-We noted that the contingency between `sex` and `relationship` maintained the same distribution after anonymization,
+We noted that the contingency between `sex` and `relationship` maintained
-meaning that these changes don't mean `relationship` can identify an individual's `sex` any more than in the original dataset.
+the same distribution after anonymization, meaning that these changes don't
 mean `relationship` can identify an individual's `sex` any more than in the
 original dataset.
 With the following commands, we noted some possible errors in the
 original dataset, where the `sex` and `relationship` attributes didn't
@ -285,8 +289,9 @@ map entirely one to one: there was one occurence of (Husband, Female)
 and two of (Wife, Male). It's possible this is an error in the
 original dataset.
-```bash
+```sh
-$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
+$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' |
 cut -d',' -f8,10 | sort | uniq -c | sort -n
      1  Husband, Female
      2  Wife, Male
@ -302,8 +307,9 @@ $ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10
  13192  Husband, Male
 ```
-```bash
+```sh
-$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
+$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' |
 cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
   1295 {Husband, Wife}              Female
   2264 {Other-relative, Own-child}  Female
@ -314,5 +320,5 @@ $ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d';' -f8,10
  12637 {Husband, Wife}              Male
 ```
-Since there were occurences of (Wide, Male), "({Husband, Wife}, Male)"
+Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)"
 does not undo the transformation of the `relationship` attribute.
--- a/render.sh
+++ b/render.sh
@ -0,0 +1,2 @@
 #!/bin/sh
 pandoc README.md --pdf-engine=xelatex -o report.pdf
--- a/report.pdf
+++ b/report.pdf
		`@ -0,0 +1,2 @@`
							`#!/bin/sh`
							`pandoc README.md --pdf-engine=xelatex -o report.pdf`