Final report
This commit is contained in:
parent
380400f343
commit
ccf85b166a
76
README.md
76
README.md
@ -248,8 +248,41 @@ models was reduced drastically, to 1.3%.
|
||||
|
||||
### Utility
|
||||
|
||||
The original Classification Performance, a measure of how well the attributes
|
||||
predict the target variable (`prediction`) was 83.24% and it remained
|
||||
#### Definitions
|
||||
|
||||
##### Precision
|
||||
|
||||
Measures data distortion, equated to the Generalization Intensity
|
||||
(Gen. Intensity) of attribute values. [1]
|
||||
|
||||
##### Information Loss
|
||||
|
||||
Measures the extent to which values are generalized. It summarizes the
|
||||
degree to which transformed attribute values cover the original domain
|
||||
of an attribute. It is equated to the converse of Granularity.
|
||||
|
||||
We checked [2], as mentioned in ARX's help, but no useful definition of
|
||||
granularity was provided therein.
|
||||
|
||||
##### Classification Performance
|
||||
|
||||
Measures how well the attributes predict the target variable
|
||||
(`prediction`, in this case).
|
||||
|
||||
##### Discernibility
|
||||
|
||||
Measures the size of groups of indistinguishable records and with
|
||||
a penalty for records which have been completely suppressed. [3]
|
||||
|
||||
##### Average class size
|
||||
|
||||
Measures the average size of groups of indistinguishable records. [4]
|
||||
|
||||
\pagebreak
|
||||
|
||||
### Analysis
|
||||
|
||||
The original Classification Performance, was 83.24% and it remained
|
||||
at 82.45%.
|
||||
|
||||
10.07% of attributes are missing from the anonymized dataset. This
|
||||
@ -257,13 +290,17 @@ value being equal across all atributes suggests entire rows were
|
||||
removed, rather than select values from separate rows. The only
|
||||
exception is the `occupation` attribute, which was entirely removed.
|
||||
|
||||
![Quality models](quality.png)
|
||||
![Quality models](quality.png){width=16cm}
|
||||
|
||||
The values for precision (Gen. Intensity) and Information Loss (Granularity)
|
||||
are high. The values for Discernibility and Average Equivalence Class Size
|
||||
The high values for Generalization Intensity and Granularity suggest a
|
||||
moderate ammount of information loss and a loss of precision.
|
||||
|
||||
The values for Discernibility and Average Equivalence Class Size
|
||||
are also high. And in general, all the quality models (both attribute-level
|
||||
and dataset-level) are high. Hence, we believe we've achieved a good utility.
|
||||
and dataset-level) are high.
|
||||
|
||||
However, given the classification performance is maintained, this was
|
||||
deemed acceptable.
|
||||
|
||||
### Effect of parameters
|
||||
|
||||
@ -311,8 +348,7 @@ $10^{-6}$, the performance was really good.
|
||||
|
||||
### Re-identification risk
|
||||
|
||||
All indicators for risk by each attacker model was between 0.1% and 0.9%.
|
||||
|
||||
All indicators for risk by each attacker model were between 0.1% and 0.9%.
|
||||
|
||||
### Utility
|
||||
|
||||
@ -322,7 +358,6 @@ at 80.97%.
|
||||
Nearly 16% of attributes are missing, with the expection of `age` and
|
||||
`education-num`, which are 100% missing.
|
||||
|
||||
|
||||
### Effect of parameters
|
||||
|
||||
An $\epsilon$ value of 3 maintained the accuracy at 80.5% with
|
||||
@ -334,14 +369,16 @@ performance of 82.05% and a missings value of 21.02% for all attributes.
|
||||
A further increase of $\delta$ to $10^{-4}$ resulted in an increased
|
||||
accuracy of 82.32%, but a maximal risk of 1.25%.
|
||||
|
||||
|
||||
# Results
|
||||
|
||||
$(\epsilon, \delta)$-Differential Privacy resulted in more missing
|
||||
attributes, leading to a lower precision, hence we opted for
|
||||
k-Anonymity, despite the higher maximal risk.
|
||||
|
||||
The 8-anonymity model was chosen as it resulted in a broader
|
||||
distribution of attribute values like `age`, whereas with Differential
|
||||
Privacy, they were split into only 2 categories.
|
||||
|
||||
|
||||
# Observations
|
||||
|
||||
We noted that the contingency between `sex` and `relationship` maintained
|
||||
@ -388,3 +425,20 @@ cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
|
||||
|
||||
Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)"
|
||||
does not undo the transformation of the `relationship` attribute.
|
||||
|
||||
# Citations
|
||||
|
||||
1: Sweeney, L.: Achieving k-anonymity privacy protection
|
||||
using generalization and suppression. J. Uncertain. Fuzz. Knowl. Sys.
|
||||
10 (5), p. 571-588 (2002
|
||||
|
||||
2: Iyengar, V.: Transforming data to satisfy privacy
|
||||
constraints. Proc. Int. Conf. Knowl. Disc. Data Mining, p. 279-288
|
||||
(2002)
|
||||
|
||||
3: Bayardo, R., Agrawal, R.: Data privacy through optimal
|
||||
k-anonymization. Proc. Int. Conf. Data Engineering, p. 217-228 (2005).
|
||||
|
||||
4: LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian
|
||||
multidimensional k-anonymity. Proc. Int. Conf. Data Engineering
|
||||
(2006).
|
||||
|
BIN
report.pdf
BIN
report.pdf
Binary file not shown.
Reference in New Issue
Block a user