Archived

This repository has been archived on 2023-08-20. You can view files and clone it, but cannot push or open issues or pull requests.

Hugo Sales a7a612e29e

Start justifying classifications in reports

2022-06-04 17:25:22 +01:00

5.0 KiB

Raw Blame History

title

subtitle

author

date

Privacy-Preserving Data Publishing

Assignment #4

Diogo Cordeiro (up201705417)

Hugo Sales (up201704178)

2022/06/02

Attribute classification

We classified the attributes as follows:

Table: Attribute classifications

Justifications

The vast majority of attributes present extremely low values of distinction. We speculate this may be an TODO

`age`

According to HIPPA recommendations, and together with it's very high separation value (99.87%), this attribute is classified as a QID.

`workclass`

This attribute presents a relatively low separation value (49.71%), and given how generic it is, it's deemed insensitive.

`fnlwgt`

Despite high values of distinction (66.48%) and separation (99.99%) the fnlwgt column is not a QID becuase it represents a weight, not a count of individuals in the same equivalence class in the original dataset. This can be seen with the results below. Additionally, it's not easily connected to another auxiliary info dataset.

tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
    END {for(sex in count){print sex, count[sex]}}'

Resulting in:

Sex | Sum -------+-------- Female | 2000673518 Male | 4178699874

Table: Sum of fnlwgt for each sex {#tbl:sex_weight}

The sum of these values is 6,179,373,392. This value is much larger than the population of the U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.

`education`

This attribute presents a separation of 80.96%, which is quite high, so this attribute is classified as a QID.

`education-num`

As a numerical representation of the education attribute, this attribute recieves the same classification, which is backed by the equally high separation value of 80.96%, so it's qualified as a QID.

`marital-status`

With a relatively high separation value of 66.01%, together with the fact that it could be cross referenced with other available datasets, we classify this attribute as a QID.

`occupation`

With a separation of 90.02%, this attribute is classified as a QID.

`relationship`

Given it's separation value of 73.21%, this attribute is classified as a QID.

`race`

This collumn presents some weirdly specified values (Amer-Indian-Eskimo), but has a separation of 25.98%; given the fact that this attribute could be cross referenced with other datases, it is classified as Sensitive, so it may be transformed into more generic values.

`sex`

Despite the low separation value of 44.27%, this attribute is canonically classified as a QID, since it can be easily cross referenced with other datasets.

We noted this dataset seems to more males than females. See @tbl:sex_weight

`native-country`

While this attribute might be regarded as a QID, it presents really low separation values (19.65%) in this dataset, so it's qualified as Sensitive.

Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore provide higher utility.

We noted that the contingency between sex and relationship maintained the same distribution after anonymization, meaning that these changes don't mean relationship can identify an individual's sex any more than in the original dataset.

We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the education and education-num columns:

cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d'	' -f4,5 | sort -u

cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n

  1  Husband, Female
  2  Wife, Male
430  Other-relative, Female
551  Other-relative, Male
792  Unmarried, Male

1566 Wife, Female 2245 Own-child, Female 2654 Unmarried, Female 2823 Own-child, Male 3875 Not-in-family, Female 4430 Not-in-family, Male 13192 Husband, Male

~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d'       ' -f8,10 | sort | uniq -c | sort -n

  1 Husband	Female
  2 Wife	Male
168 Other-relative	*
336 Own-child	*
342 Other-relative	Female
471 Other-relative	Male
552 Wife	*
573 Unmarried	Male
728 Unmarried	*

1014 Wife Female 1649 Not-in-family * 2042 Husband * 2081 Own-child Female 2145 Unmarried Female 2651 Own-child Male 3209 Not-in-family Female 3447 Not-in-family Male 11150 Husband Male

5.0 KiB Raw Blame History

Attribute classification

Justifications

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

native-country