diff --git a/report.md b/report.md index 56ed9a2..d13330f 100644 --- a/report.md +++ b/report.md @@ -1,10 +1,120 @@ -Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID becuase it represents a weight, not a -count of individuals in the same equivalence class in the original dataset. Additionally, it's not easily connected to -another auxiliary info dataset. +--- +title: Privacy-Preserving Data Publishing +subtitle: Assignment \#4 +author: + - Diogo Cordeiro (up201705417) + - Hugo Sales (up201704178) +date: 2022/06/02 +--- -We determined that `age` is a QID, since it's widely regarded as such, in all datasets, according to HIPPA recommendations. +# Attribute classification + +We classified the attributes as follows: + +Attribute | Classification +-----------------+--------------- +`age` | QID +`workclass` | Insensitive +`fnlwgt` | Insensitive +`education` | QID +`education-num` | QID +`marital-status` | QID +`occupation` | QID +`relationship` | QID +`race` | Sensitive +`sex` | QID +`capital-gain` | Sensitive +`capital-loss` | Sensitive +`hours-per-week` | Insensitive +`native-country` | Insensitive +`prediction` | Insensitive + +Table: Attribute classifications + +## Justifications + +The vast majority of attributes present extremely low values of distinction. We speculate this may +be an TODO + +### `age` + +According to HIPPA recommendations, and together with it's very high separation value (99.87%), this +attribute is classified as a QID. + +### `workclass` + +This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's +deemed insensitive. + +### `fnlwgt` + +Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID +becuase it represents a weight, not a count of individuals in the same equivalence class in the +original dataset. This can be seen with the results below. Additionally, it's not easily connected +to another auxiliary info dataset. + +```bash +tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \ + END {for(sex in count){print sex, count[sex]}}' +``` + +Resulting in: + +Sex | Sum +-------+-------- +Female | 2000673518 +Male | 4178699874 + +Table: Sum of `fnlwgt` for each `sex` {#tbl:sex_weight} + +The sum of these values is 6,179,373,392. This value is much larger than the population of the +U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated. + +### `education` + +This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified +as a QID. + +### `education-num` + +As a numerical representation of the `education` attribute, this attribute recieves the same +classification, which is backed by the equally high separation value of 80.96%, so it's qualified as +a QID. + +### `marital-status` + +With a relatively high separation value of 66.01%, together with the fact that it could be cross +referenced with other available datasets, we classify this attribute as a QID. + +### `occupation` + +With a separation of 90.02%, this attribute is classified as a QID. + +### `relationship` + +Given it's separation value of 73.21%, this attribute is classified as a QID. + +### `race` + +This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact +that this attribute could be cross referenced with other datases, it is classified as Sensitive, so +it may be transformed into more generic values. + +### `sex` + +Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since +it can be easily cross referenced with other datasets. + +We noted this dataset seems to more males than females. See @tbl:sex_weight + + +### `native-country` + +While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this +dataset, so it's qualified as Sensitive. + +---------------- -We noted this dataset contains more males than females. Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore provide higher utility. @@ -16,11 +126,13 @@ We exported the anonymized dataset and used the following command to verify ther `education` and `education-num` columns: ```bash -cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u +cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u ``` +```bash +cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n +``` -/projects/uni/DataAnonymisation/ (master)$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n 1 Husband, Female 2 Wife, Male 430 Other-relative, Female @@ -33,7 +145,11 @@ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u 3875 Not-in-family, Female 4430 Not-in-family, Male 13192 Husband, Male + +``` ~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f8,10 | sort | uniq -c | sort -n +``` + 1 Husband Female 2 Wife Male 168 Other-relative * @@ -52,9 +168,3 @@ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u 3209 Not-in-family Female 3447 Not-in-family Male 11150 Husband Male - - - - - -