This repository has been archived on 2023-08-20. You can view files and clone it, but cannot push or open issues or pull requests.
DataAnonymisation/report.md

5.0 KiB

title subtitle author date
Privacy-Preserving Data Publishing Assignment #4
Diogo Cordeiro (up201705417)
Hugo Sales (up201704178)
2022/06/02

Attribute classification

We classified the attributes as follows:

Attribute | Classification -----------------+--------------- age | QID workclass | Insensitive fnlwgt | Insensitive education | QID education-num | QID marital-status | QID occupation | QID relationship | QID race | Sensitive sex | QID capital-gain | Sensitive capital-loss | Sensitive hours-per-week | Insensitive native-country | Insensitive prediction | Insensitive

Table: Attribute classifications

Justifications

The vast majority of attributes present extremely low values of distinction. We speculate this may be an TODO

age

According to HIPPA recommendations, and together with it's very high separation value (99.87%), this attribute is classified as a QID.

workclass

This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's deemed insensitive.

fnlwgt

Despite high values of distinction (66.48%) and separation (99.99%) the fnlwgt column is not a QID becuase it represents a weight, not a count of individuals in the same equivalence class in the original dataset. This can be seen with the results below. Additionally, it's not easily connected to another auxiliary info dataset.

tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
    END {for(sex in count){print sex, count[sex]}}'

Resulting in:

Sex | Sum -------+-------- Female | 2000673518 Male | 4178699874

Table: Sum of fnlwgt for each sex {#tbl:sex_weight}

The sum of these values is 6,179,373,392. This value is much larger than the population of the U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.

education

This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified as a QID.

education-num

As a numerical representation of the education attribute, this attribute recieves the same classification, which is backed by the equally high separation value of 80.96%, so it's qualified as a QID.

marital-status

With a relatively high separation value of 66.01%, together with the fact that it could be cross referenced with other available datasets, we classify this attribute as a QID.

occupation

With a separation of 90.02%, this attribute is classified as a QID.

relationship

Given it's separation value of 73.21%, this attribute is classified as a QID.

race

This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact that this attribute could be cross referenced with other datases, it is classified as Sensitive, so it may be transformed into more generic values.

sex

Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since it can be easily cross referenced with other datasets.

We noted this dataset seems to more males than females. See @tbl:sex_weight

native-country

While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this dataset, so it's qualified as Sensitive.


Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore provide higher utility.

We noted that the contingency between sex and relationship maintained the same distribution after anonymization, meaning that these changes don't mean relationship can identify an individual's sex any more than in the original dataset.

We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the education and education-num columns:

cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'	' -f4,5 | sort -u
cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n
  1  Husband, Female
  2  Wife, Male
430  Other-relative, Female
551  Other-relative, Male
792  Unmarried, Male

1566 Wife, Female 2245 Own-child, Female 2654 Unmarried, Female 2823 Own-child, Male 3875 Not-in-family, Female 4430 Not-in-family, Male 13192 Husband, Male

~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d'       ' -f8,10 | sort | uniq -c | sort -n
  1 Husband	Female
  2 Wife	Male
168 Other-relative	*
336 Own-child	*
342 Other-relative	Female
471 Other-relative	Male
552 Wife	*
573 Unmarried	Male
728 Unmarried	*

1014 Wife Female 1649 Not-in-family * 2042 Husband * 2081 Own-child Female 2145 Unmarried Female 2651 Own-child Male 3209 Not-in-family Female 3447 Not-in-family Male 11150 Husband Male