Final report

2022-06-05 23:23:59 +01:00
parent 380400f343
commit ccf85b166a
2 changed files with 65 additions and 11 deletions
--- a/README.md
+++ b/README.md
@@ -248,8 +248,41 @@ models was reduced drastically, to 1.3%.

 ### Utility

-The original Classification Performance, a measure of how well the attributes
-predict the target variable (`prediction`) was 83.24% and it remained
+#### Definitions
+
+##### Precision
+
+Measures data distortion, equated to the Generalization Intensity
+(Gen. Intensity) of attribute values. [1]
+
+##### Information Loss
+
+Measures the extent to which values are generalized. It summarizes the
+degree to which transformed attribute values cover the original domain
+of an attribute. It is equated to the converse of Granularity.
+
+We checked [2], as mentioned in ARX's help, but no useful definition of
+granularity was provided therein.
+
+##### Classification Performance
+
+Measures how well the attributes predict the target variable
+(`prediction`, in this case).
+
+##### Discernibility
+
+Measures the size of groups of indistinguishable records and with
+a penalty for records which have been completely suppressed. [3]
+
+##### Average class size 
+
+Measures the average size of groups of indistinguishable records. [4]
+
+\pagebreak
+
+### Analysis
+
+The original Classification Performance, was 83.24% and it remained
 at 82.45%.

 10.07% of attributes are missing from the anonymized dataset. This
@@ -257,13 +290,17 @@ value being equal across all atributes suggests entire rows were
 removed, rather than select values from separate rows. The only
 exception is the `occupation` attribute, which was entirely removed.

-![Quality models](quality.png)
+![Quality models](quality.png){width=16cm}

-The values for precision (Gen. Intensity) and Information Loss (Granularity)
-are high. The values for Discernibility and Average Equivalence Class Size
+The high values for Generalization Intensity and Granularity suggest a
+moderate ammount of information loss and a loss of precision.
+
+The values for Discernibility and Average Equivalence Class Size
 are also high. And in general, all the quality models (both attribute-level
-and dataset-level) are high. Hence, we believe we've achieved a good utility.
+and dataset-level) are high.

+However, given the classification performance is maintained, this was
+deemed acceptable.

 ### Effect of parameters

@@ -311,8 +348,7 @@ $10^{-6}$, the performance was really good.

 ### Re-identification risk

-All indicators for risk by each attacker model was between 0.1% and 0.9%.
-
+All indicators for risk by each attacker model were between 0.1% and 0.9%.

 ### Utility

@@ -322,7 +358,6 @@ at 80.97%.
 Nearly 16% of attributes are missing, with the expection of `age` and
 `education-num`, which are 100% missing.

-
 ### Effect of parameters

 An $\epsilon$ value of 3 maintained the accuracy at 80.5% with
@@ -334,14 +369,16 @@ performance of 82.05% and a missings value of 21.02% for all attributes.
 A further increase of $\delta$ to $10^{-4}$ resulted in an increased
 accuracy of 82.32%, but a maximal risk of 1.25%.

-
 # Results

+$(\epsilon, \delta)$-Differential Privacy resulted in more missing
+attributes, leading to a lower precision, hence we opted for
+k-Anonymity, despite the higher maximal risk.
+
 The 8-anonymity model was chosen as it resulted in a broader
 distribution of attribute values like `age`, whereas with Differential
 Privacy, they were split into only 2 categories.

-
 # Observations

 We noted that the contingency between `sex` and `relationship` maintained
@@ -388,3 +425,20 @@ cut -d';' -f8,10 | sort | uniq -c | sort -n | column -s ';' -t

 Since there were occurences of (Wife, Male), "({Husband, Wife}, Male)"
 does not undo the transformation of the `relationship` attribute.
+
+# Citations
+
+1: Sweeney, L.: Achieving k-anonymity privacy protection
+using generalization and suppression. J. Uncertain. Fuzz. Knowl. Sys.
+10 (5), p. 571-588 (2002
+
+2: Iyengar, V.: Transforming data to satisfy privacy
+constraints. Proc. Int. Conf. Knowl. Disc. Data Mining, p. 279-288
+(2002)
+
+3: Bayardo, R., Agrawal, R.: Data privacy through optimal
+k-anonymization. Proc. Int. Conf. Data Engineering, p. 217-228 (2005).
+
+4: LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian
+multidimensional k-anonymity. Proc. Int. Conf. Data Engineering
+(2006).
--- a/report.pdf
+++ b/report.pdf