Start justifying classifications in reports

2022-06-04 17:25:22 +01:00
parent 2a52f44f4e
commit a7a612e29e
1 changed files with 123 additions and 13 deletions
--- a/report.md
+++ b/report.md
@@ -1,10 +1,120 @@
-Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID becuase it represents a weight, not a
-count of individuals in the same equivalence class in the original dataset. Additionally, it's not easily connected to
-another auxiliary info dataset.
+---
+title: Privacy-Preserving Data Publishing
+subtitle: Assignment \#4
+author:
+  - Diogo Cordeiro (up201705417)
+  - Hugo Sales (up201704178)
+date: 2022/06/02
+---

-We determined that `age` is a QID, since it's widely regarded as such, in all datasets, according to HIPPA recommendations.
+# Attribute classification
+
+We classified the attributes as follows:
+
+Attribute        | Classification
+-----------------+---------------
+`age`            | QID
+`workclass`      | Insensitive
+`fnlwgt`         | Insensitive
+`education`      | QID
+`education-num`  | QID
+`marital-status` | QID
+`occupation`     | QID
+`relationship`   | QID
+`race`           | Sensitive
+`sex`            | QID
+`capital-gain`   | Sensitive
+`capital-loss`   | Sensitive
+`hours-per-week` | Insensitive
+`native-country` | Insensitive
+`prediction`     | Insensitive
+
+Table: Attribute classifications
+
+## Justifications
+
+The vast majority of attributes present extremely low values of distinction. We speculate this may
+be an TODO
+
+### `age`
+
+According to HIPPA recommendations, and together with it's very high separation value (99.87%), this
+attribute is classified as a QID.
+
+### `workclass`
+
+This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
+deemed insensitive.
+
+### `fnlwgt`
+
+Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
+becuase it represents a weight, not a count of individuals in the same equivalence class in the
+original dataset. This can be seen with the results below. Additionally, it's not easily connected
+to another auxiliary info dataset.
+
+```bash
+tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
+    END {for(sex in count){print sex, count[sex]}}'
+```
+
+Resulting in:
+
+Sex    | Sum
+-------+--------
+Female | 2000673518
+Male   | 4178699874
+
+Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}
+
+The sum of these values is 6,179,373,392. This value is much larger than the population of the
+U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
+
+### `education`
+
+This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
+as a QID.
+
+### `education-num`
+
+As a numerical representation of the `education` attribute, this attribute recieves the same
+classification, which is backed by the equally high separation value of 80.96%, so it's qualified as
+a QID.
+
+### `marital-status`
+
+With a relatively high separation value of 66.01%, together with the fact that it could be cross
+referenced with other available datasets, we classify this attribute as a QID.
+
+### `occupation`
+
+With a separation of 90.02%, this attribute is classified as a QID.
+
+### `relationship`
+
+Given it's separation value of 73.21%, this attribute is classified as a QID.
+
+### `race`
+
+This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
+that this attribute could be cross referenced with other datases, it is classified as Sensitive, so
+it may be transformed into more generic values.
+
+### `sex`
+
+Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
+it can be easily cross referenced with other datasets.
+
+We noted this dataset seems to more males than females. See @tbl:sex_weight
+
+
+### `native-country`
+
+While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
+dataset, so it's qualified as Sensitive.
+
+----------------

-We noted this dataset contains more males than females.

 Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore
 provide higher utility.
@@ -16,11 +126,13 @@ We exported the anonymized dataset and used the following command to verify ther
 `education` and `education-num` columns:

 ```bash
-cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'  ' -f4,5 | sort -u
+cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'	' -f4,5 | sort -u
 ```

+```bash
+cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
+```

-/projects/uni/DataAnonymisation/ (master)$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
      1  Husband, Female
      2  Wife, Male
    430  Other-relative, Female
@@ -33,7 +145,11 @@ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'  ' -f4,5 | sort -u
   3875  Not-in-family, Female
   4430  Not-in-family, Male
  13192  Husband, Male
+
+```
 ~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d'       ' -f8,10 | sort | uniq -c | sort -n
+```
+
      1 Husband	Female
      2 Wife	Male
    168 Other-relative	*
@@ -52,9 +168,3 @@ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'  ' -f4,5 | sort -u
   3209 Not-in-family	Female
   3447 Not-in-family	Male
  11150 Husband	Male
-
-
-
-
-
-