Final report

This commit is contained in:
Hugo Sales 2022-06-05 23:23:59 +01:00
parent 380400f343
commit ccf85b166a
Signed by untrusted user who does not match committer: someonewithpc
GPG Key ID: 7D0C7EAFC9D835A0
2 changed files with 65 additions and 11 deletions

View File

@ -248,8 +248,41 @@ models was reduced drastically, to 1.3%.
### Utility
The original Classification Performance, a measure of how well the attributes
predict the target variable (`prediction`) was 83.24% and it remained
#### Definitions
##### Precision
Measures data distortion, equated to the Generalization Intensity
(Gen. Intensity) of attribute values. [1]
##### Information Loss
Measures the extent to which values are generalized. It summarizes the
degree to which transformed attribute values cover the original domain
of an attribute. It is equated to the converse of Granularity.
We checked [2], as mentioned in ARX's help, but no useful definition of
granularity was provided therein.
##### Classification Performance
Measures how well the attributes predict the target variable
(`prediction`, in this case).
##### Discernibility
Measures the size of groups of indistinguishable records and with
a penalty for records which have been completely suppressed. [3]
##### Average class size
Measures the average size of groups of indistinguishable records. [4]
\pagebreak
### Analysis
The original Classification Performance, was 83.24% and it remained
at 82.45%.
10.07% of attributes are missing from the anonymized dataset. This
@ -257,13 +290,17 @@ value being equal across all atributes suggests entire rows were
removed, rather than select values from separate rows. The only
exception is the `occupation` attribute, which was entirely removed.
![Quality models](quality.png)
![Quality models](quality.png){width=16cm}
The values for precision (Gen. Intensity) and Information Loss (Granularity)
are high. The values for Discernibility and Average Equivalence Class Size
The high values for Generalization Intensity and Granularity suggest a
moderate ammount of information loss and a loss of precision.
The values for Discernibility and Average Equivalence Class Size
are also high. And in general, all the quality models (both attribute-level
and dataset-level) are high. Hence, we believe we've achieved a good utility.
and dataset-level) are high.
However, given the classification performance is maintained, this was
deemed acceptable.
### Effect of parameters
@ -311,8 +348,7 @@ $10^{-6}$, the performance was really good.
### Re-identification risk
All indicators for risk by each attacker model was between 0.1% and 0.9%.
All indicators for risk by each attacker model were between 0.1% and 0.9%.
### Utility
@ -322,7 +358,6 @@ at 80.97%.
Nearly 16% of attributes are missing, with the expection of `age` and
`education-num`, which are 100% missing.
### Effect of parameters
An $\epsilon$ value of 3 maintained the accuracy at 80.5% with
@ -334,14 +369,16 @@ performance of 82.05% and a missings value of 21.02% for all attributes.
A further increase of $\delta$ to $10^{-4}$ resulted in an increased
accuracy of 82.32%, but a maximal risk of 1.25%.
# Results
$(\epsilon, \delta)$-Differential Privacy resulted in more missing
attributes, leading to a lower precision, hence we opted for
k-Anonymity, despite the higher maximal risk.
The 8-anonymity model was chosen as it resulted in a broader
distribution of attribute values like `age`, whereas with Differential
Privacy, they were split into only 2 categories.
# Observations
We noted that the contingency between `sex` and `relationship` maintained
@ -388,3 +425,20 @@ cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)"
does not undo the transformation of the `relationship` attribute.
# Citations
1: Sweeney, L.: Achieving k-anonymity privacy protection
using generalization and suppression. J. Uncertain. Fuzz. Knowl. Sys.
10 (5), p. 571-588 (2002
2: Iyengar, V.: Transforming data to satisfy privacy
constraints. Proc. Int. Conf. Knowl. Disc. Data Mining, p. 279-288
(2002)
3: Bayardo, R., Agrawal, R.: Data privacy through optimal
k-anonymization. Proc. Int. Conf. Data Engineering, p. 217-228 (2005).
4: LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian
multidimensional k-anonymity. Proc. Int. Conf. Data Engineering
(2006).

Binary file not shown.