Final report
This commit is contained in:
parent
380400f343
commit
ccf85b166a
76
README.md
76
README.md
@ -248,8 +248,41 @@ models was reduced drastically, to 1.3%.
|
|||||||
|
|
||||||
### Utility
|
### Utility
|
||||||
|
|
||||||
The original Classification Performance, a measure of how well the attributes
|
#### Definitions
|
||||||
predict the target variable (`prediction`) was 83.24% and it remained
|
|
||||||
|
##### Precision
|
||||||
|
|
||||||
|
Measures data distortion, equated to the Generalization Intensity
|
||||||
|
(Gen. Intensity) of attribute values. [1]
|
||||||
|
|
||||||
|
##### Information Loss
|
||||||
|
|
||||||
|
Measures the extent to which values are generalized. It summarizes the
|
||||||
|
degree to which transformed attribute values cover the original domain
|
||||||
|
of an attribute. It is equated to the converse of Granularity.
|
||||||
|
|
||||||
|
We checked [2], as mentioned in ARX's help, but no useful definition of
|
||||||
|
granularity was provided therein.
|
||||||
|
|
||||||
|
##### Classification Performance
|
||||||
|
|
||||||
|
Measures how well the attributes predict the target variable
|
||||||
|
(`prediction`, in this case).
|
||||||
|
|
||||||
|
##### Discernibility
|
||||||
|
|
||||||
|
Measures the size of groups of indistinguishable records and with
|
||||||
|
a penalty for records which have been completely suppressed. [3]
|
||||||
|
|
||||||
|
##### Average class size
|
||||||
|
|
||||||
|
Measures the average size of groups of indistinguishable records. [4]
|
||||||
|
|
||||||
|
\pagebreak
|
||||||
|
|
||||||
|
### Analysis
|
||||||
|
|
||||||
|
The original Classification Performance, was 83.24% and it remained
|
||||||
at 82.45%.
|
at 82.45%.
|
||||||
|
|
||||||
10.07% of attributes are missing from the anonymized dataset. This
|
10.07% of attributes are missing from the anonymized dataset. This
|
||||||
@ -257,13 +290,17 @@ value being equal across all atributes suggests entire rows were
|
|||||||
removed, rather than select values from separate rows. The only
|
removed, rather than select values from separate rows. The only
|
||||||
exception is the `occupation` attribute, which was entirely removed.
|
exception is the `occupation` attribute, which was entirely removed.
|
||||||
|
|
||||||
![Quality models](quality.png)
|
![Quality models](quality.png){width=16cm}
|
||||||
|
|
||||||
The values for precision (Gen. Intensity) and Information Loss (Granularity)
|
The high values for Generalization Intensity and Granularity suggest a
|
||||||
are high. The values for Discernibility and Average Equivalence Class Size
|
moderate ammount of information loss and a loss of precision.
|
||||||
|
|
||||||
|
The values for Discernibility and Average Equivalence Class Size
|
||||||
are also high. And in general, all the quality models (both attribute-level
|
are also high. And in general, all the quality models (both attribute-level
|
||||||
and dataset-level) are high. Hence, we believe we've achieved a good utility.
|
and dataset-level) are high.
|
||||||
|
|
||||||
|
However, given the classification performance is maintained, this was
|
||||||
|
deemed acceptable.
|
||||||
|
|
||||||
### Effect of parameters
|
### Effect of parameters
|
||||||
|
|
||||||
@ -311,8 +348,7 @@ $10^{-6}$, the performance was really good.
|
|||||||
|
|
||||||
### Re-identification risk
|
### Re-identification risk
|
||||||
|
|
||||||
All indicators for risk by each attacker model was between 0.1% and 0.9%.
|
All indicators for risk by each attacker model were between 0.1% and 0.9%.
|
||||||
|
|
||||||
|
|
||||||
### Utility
|
### Utility
|
||||||
|
|
||||||
@ -322,7 +358,6 @@ at 80.97%.
|
|||||||
Nearly 16% of attributes are missing, with the expection of `age` and
|
Nearly 16% of attributes are missing, with the expection of `age` and
|
||||||
`education-num`, which are 100% missing.
|
`education-num`, which are 100% missing.
|
||||||
|
|
||||||
|
|
||||||
### Effect of parameters
|
### Effect of parameters
|
||||||
|
|
||||||
An $\epsilon$ value of 3 maintained the accuracy at 80.5% with
|
An $\epsilon$ value of 3 maintained the accuracy at 80.5% with
|
||||||
@ -334,14 +369,16 @@ performance of 82.05% and a missings value of 21.02% for all attributes.
|
|||||||
A further increase of $\delta$ to $10^{-4}$ resulted in an increased
|
A further increase of $\delta$ to $10^{-4}$ resulted in an increased
|
||||||
accuracy of 82.32%, but a maximal risk of 1.25%.
|
accuracy of 82.32%, but a maximal risk of 1.25%.
|
||||||
|
|
||||||
|
|
||||||
# Results
|
# Results
|
||||||
|
|
||||||
|
$(\epsilon, \delta)$-Differential Privacy resulted in more missing
|
||||||
|
attributes, leading to a lower precision, hence we opted for
|
||||||
|
k-Anonymity, despite the higher maximal risk.
|
||||||
|
|
||||||
The 8-anonymity model was chosen as it resulted in a broader
|
The 8-anonymity model was chosen as it resulted in a broader
|
||||||
distribution of attribute values like `age`, whereas with Differential
|
distribution of attribute values like `age`, whereas with Differential
|
||||||
Privacy, they were split into only 2 categories.
|
Privacy, they were split into only 2 categories.
|
||||||
|
|
||||||
|
|
||||||
# Observations
|
# Observations
|
||||||
|
|
||||||
We noted that the contingency between `sex` and `relationship` maintained
|
We noted that the contingency between `sex` and `relationship` maintained
|
||||||
@ -388,3 +425,20 @@ cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t
|
|||||||
|
|
||||||
Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)"
|
Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)"
|
||||||
does not undo the transformation of the `relationship` attribute.
|
does not undo the transformation of the `relationship` attribute.
|
||||||
|
|
||||||
|
# Citations
|
||||||
|
|
||||||
|
1: Sweeney, L.: Achieving k-anonymity privacy protection
|
||||||
|
using generalization and suppression. J. Uncertain. Fuzz. Knowl. Sys.
|
||||||
|
10 (5), p. 571-588 (2002
|
||||||
|
|
||||||
|
2: Iyengar, V.: Transforming data to satisfy privacy
|
||||||
|
constraints. Proc. Int. Conf. Knowl. Disc. Data Mining, p. 279-288
|
||||||
|
(2002)
|
||||||
|
|
||||||
|
3: Bayardo, R., Agrawal, R.: Data privacy through optimal
|
||||||
|
k-anonymization. Proc. Int. Conf. Data Engineering, p. 217-228 (2005).
|
||||||
|
|
||||||
|
4: LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian
|
||||||
|
multidimensional k-anonymity. Proc. Int. Conf. Data Engineering
|
||||||
|
(2006).
|
||||||
|
BIN
report.pdf
BIN
report.pdf
Binary file not shown.
Reference in New Issue
Block a user