diff --git a/README.md b/README.md index de821fc..bcbecc4 100644 --- a/README.md +++ b/README.md @@ -248,8 +248,41 @@ models was reduced drastically, to 1.3%. ### Utility -The original Classification Performance, a measure of how well the attributes -predict the target variable (`prediction`) was 83.24% and it remained +#### Definitions + +##### Precision + +Measures data distortion, equated to the Generalization Intensity +(Gen. Intensity) of attribute values. [1] + +##### Information Loss + +Measures the extent to which values are generalized. It summarizes the +degree to which transformed attribute values cover the original domain +of an attribute. It is equated to the converse of Granularity. + +We checked [2], as mentioned in ARX's help, but no useful definition of +granularity was provided therein. + +##### Classification Performance + +Measures how well the attributes predict the target variable +(`prediction`, in this case). + +##### Discernibility + +Measures the size of groups of indistinguishable records and with +a penalty for records which have been completely suppressed. [3] + +##### Average class size + +Measures the average size of groups of indistinguishable records. [4] + +\pagebreak + +### Analysis + +The original Classification Performance, was 83.24% and it remained at 82.45%. 10.07% of attributes are missing from the anonymized dataset. This @@ -257,13 +290,17 @@ value being equal across all atributes suggests entire rows were removed, rather than select values from separate rows. The only exception is the `occupation` attribute, which was entirely removed. -![Quality models](quality.png) +![Quality models](quality.png){width=16cm} -The values for precision (Gen. Intensity) and Information Loss (Granularity) -are high. The values for Discernibility and Average Equivalence Class Size +The high values for Generalization Intensity and Granularity suggest a +moderate ammount of information loss and a loss of precision. + +The values for Discernibility and Average Equivalence Class Size are also high. And in general, all the quality models (both attribute-level -and dataset-level) are high. Hence, we believe we've achieved a good utility. +and dataset-level) are high. +However, given the classification performance is maintained, this was +deemed acceptable. ### Effect of parameters @@ -311,8 +348,7 @@ $10^{-6}$, the performance was really good. ### Re-identification risk -All indicators for risk by each attacker model was between 0.1% and 0.9%. - +All indicators for risk by each attacker model were between 0.1% and 0.9%. ### Utility @@ -322,7 +358,6 @@ at 80.97%. Nearly 16% of attributes are missing, with the expection of `age` and `education-num`, which are 100% missing. - ### Effect of parameters An $\epsilon$ value of 3 maintained the accuracy at 80.5% with @@ -334,14 +369,16 @@ performance of 82.05% and a missings value of 21.02% for all attributes. A further increase of $\delta$ to $10^{-4}$ resulted in an increased accuracy of 82.32%, but a maximal risk of 1.25%. - # Results +$(\epsilon, \delta)$-Differential Privacy resulted in more missing +attributes, leading to a lower precision, hence we opted for +k-Anonymity, despite the higher maximal risk. + The 8-anonymity model was chosen as it resulted in a broader distribution of attribute values like `age`, whereas with Differential Privacy, they were split into only 2 categories. - # Observations We noted that the contingency between `sex` and `relationship` maintained @@ -388,3 +425,20 @@ cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)" does not undo the transformation of the `relationship` attribute. + +# Citations + +1: Sweeney, L.: Achieving k-anonymity privacy protection +using generalization and suppression. J. Uncertain. Fuzz. Knowl. Sys. +10 (5), p. 571-588 (2002 + +2: Iyengar, V.: Transforming data to satisfy privacy +constraints. Proc. Int. Conf. Knowl. Disc. Data Mining, p. 279-288 +(2002) + +3: Bayardo, R., Agrawal, R.: Data privacy through optimal +k-anonymization. Proc. Int. Conf. Data Engineering, p. 217-228 (2005). + +4: LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian +multidimensional k-anonymity. Proc. Int. Conf. Data Engineering +(2006). diff --git a/report.pdf b/report.pdf index b49357d..1caa662 100644 Binary files a/report.pdf and b/report.pdf differ