Start justifying classifications in reports

2022-06-04 17:25:22 +01:00
parent 2a52f44f4e
commit a7a612e29e
1 changed files with 123 additions and 13 deletions
--- a/report.md
+++ b/report.md
@@ -1,10 +1,120 @@
-Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID becuase it represents a weight, not a
+---
-count of individuals in the same equivalence class in the original dataset. Additionally, it's not easily connected to
+title: Privacy-Preserving Data Publishing
-another auxiliary info dataset.
+subtitle: Assignment \#4
 author:
  - Diogo Cordeiro (up201705417)
  - Hugo Sales (up201704178)
 date: 2022/06/02
 ---
-We determined that `age` is a QID, since it's widely regarded as such, in all datasets, according to HIPPA recommendations.
+# Attribute classification
 We classified the attributes as follows:
 Attribute        | Classification
 -----------------+---------------
 `age`            | QID
 `workclass`      | Insensitive
 `fnlwgt`         | Insensitive
 `education`      | QID
 `education-num`  | QID
 `marital-status` | QID
 `occupation`     | QID
 `relationship`   | QID
 `race`           | Sensitive
 `sex`            | QID
 `capital-gain`   | Sensitive
 `capital-loss`   | Sensitive
 `hours-per-week` | Insensitive
 `native-country` | Insensitive
 `prediction`     | Insensitive
 Table: Attribute classifications
 ## Justifications
 The vast majority of attributes present extremely low values of distinction. We speculate this may
 be an TODO
 ### `age`
 According to HIPPA recommendations, and together with it's very high separation value (99.87%), this
 attribute is classified as a QID.
 ### `workclass`
 This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's
 deemed insensitive.
 ### `fnlwgt`
 Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
 becuase it represents a weight, not a count of individuals in the same equivalence class in the
 original dataset. This can be seen with the results below. Additionally, it's not easily connected
 to another auxiliary info dataset.
 ```bash
 tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
    END {for(sex in count){print sex, count[sex]}}'
 ```
 Resulting in:
 Sex    | Sum
 -------+--------
 Female | 2000673518
 Male   | 4178699874
 Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight}
 The sum of these values is 6,179,373,392. This value is much larger than the population of the
 U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
 ### `education`
 This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified
 as a QID.
 ### `education-num`
 As a numerical representation of the `education` attribute, this attribute recieves the same
 classification, which is backed by the equally high separation value of 80.96%, so it's qualified as
 a QID.
 ### `marital-status`
 With a relatively high separation value of 66.01%, together with the fact that it could be cross
 referenced with other available datasets, we classify this attribute as a QID.
 ### `occupation`
 With a separation of 90.02%, this attribute is classified as a QID.
 ### `relationship`
 Given it's separation value of 73.21%, this attribute is classified as a QID.
 ### `race`
 This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact
 that this attribute could be cross referenced with other datases, it is classified as Sensitive, so
 it may be transformed into more generic values.
 ### `sex`
 Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since
 it can be easily cross referenced with other datasets.
 We noted this dataset seems to more males than females. See @tbl:sex_weight
 ### `native-country`
 While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this
 dataset, so it's qualified as Sensitive.
 ----------------
 We noted this dataset contains more males than females.
 Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore
 provide higher utility.
@@ -16,11 +126,13 @@ We exported the anonymized dataset and used the following command to verify ther
 `education` and `education-num` columns:
 ```bash
-cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'  ' -f4,5 | sort -u
+cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'	' -f4,5 | sort -u
 ```
 ```bash
 cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
 ```
 /projects/uni/DataAnonymisation/ (master)$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
      1  Husband, Female
      2  Wife, Male
    430  Other-relative, Female
@@ -33,7 +145,11 @@ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'  ' -f4,5 | sort -u
   3875  Not-in-family, Female
   4430  Not-in-family, Male
  13192  Husband, Male
 ```
 ~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d'       ' -f8,10 | sort | uniq -c | sort -n
 ```
      1 Husband	Female
      2 Wife	Male
    168 Other-relative	*
@@ -52,9 +168,3 @@ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'  ' -f4,5 | sort -u
   3209 Not-in-family	Female
   3447 Not-in-family	Male
  11150 Husband	Male