2.1 KiB
Despite high values of distinction (66.48%) and separation (99.99%) the fnlwgt
column is not a QID becuase it represents a weight, not a
count of individuals in the same equivalence class in the original dataset. Additionally, it's not easily connected to
another auxiliary info dataset.
We determined that age
is a QID, since it's widely regarded as such, in all datasets, according to HIPPA recommendations.
We noted this dataset contains more males than females.
Higer Precision (Generation Intensity) implies the attributes are closer to the ones in the original dataset, therefore provide higher utility.
We noted that the contingency between sex
and relationship
maintained the same distribution after anonymization,
meaning that these changes don't mean relationship
can identify an individual's sex
any more than in the original dataset.
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies between the
education
and education-num
columns:
cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f4,5 | sort -u
/projects/uni/DataAnonymisation/ (master)$ cat adult_data.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d',' -f8,10 | sort | uniq -c | sort -n 1 Husband, Female 2 Wife, Male 430 Other-relative, Female 551 Other-relative, Male 792 Unmarried, Male 1566 Wife, Female 2245 Own-child, Female 2654 Unmarried, Female 2823 Own-child, Male 3875 Not-in-family, Female 4430 Not-in-family, Male 13192 Husband, Male ~/projects/uni/DataAnonymisation/ (master)$ cat anonymized.csv | tail -n +2 | sed -r 's/,([^ ])/\t\1/g' | cut -d' ' -f8,10 | sort | uniq -c | sort -n 1 Husband Female 2 Wife Male 168 Other-relative * 336 Own-child * 342 Other-relative Female 471 Other-relative Male 552 Wife * 573 Unmarried Male 728 Unmarried * 1014 Wife Female 1649 Not-in-family * 2042 Husband * 2081 Own-child Female 2145 Unmarried Female 2651 Own-child Male 3209 Not-in-family Female 3447 Not-in-family Male 11150 Husband Male