Minor layout changes and text revision

2022-06-05 22:15:04 +01:00 · 2022-06-05 22:15:04 +01:00 · 380400f343
commit 380400f343
parent 7b7fc0e617
2 changed files with 30 additions and 14 deletions
--- a/README.md
+++ b/README.md
@ -7,12 +7,21 @@ author:
 date: 2022/06/02
 geometry: margin=2cm
 output: pdf_document
 header-includes: |
   \usepackage{float}
   \let\origfigure\figure
   \let\endorigfigure\endfigure
   \renewenvironment{figure}[1][2]{\expandafter\origfigure\expandafter[H]}{\endorigfigure}
 ---
 \vspace{3em}
 # Attribute classification
 We classified the attributes as follows:
 \vspace{3em}
 Attribute        | Classification
 -----------------+---------------
 `age`            | QID
@ -33,6 +42,7 @@ Attribute        | Classification
 Table: Attribute classifications
 \pagebreak
 ## Justifications
@ -46,7 +56,7 @@ set of attributes.
 According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
 this attribute as a QID.
-![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=14cm}
+![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=10cm}
 ### `workclass`
@ -60,7 +70,7 @@ deemed Insensitive.
 Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
 because it represents a weight, not a count of individuals in the same equivalence class in the
 original dataset. This can be seen with the results below. Additionally, it's not easily connected
-to another auxiliary info dataset.
+to other auxiliary datasets.
 ```sh
 $ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
@ -79,23 +89,26 @@ Table: Sum of `fnlwgt` for each `sex`
 The sum of these values is 6,179,373,392. This value is much larger than the population of the
 U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
-We also note there are substantially more Male than Female records (more than double the `fnlwgt`).
+We also note there are substantially more Male than Female records,
-
+being that the sum of `fnlwgt` for Male is more than double that of
 Female, as well as that the number of rows with Female is 10771 and
 for Male is 21790.
 ### `education`
 This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
-![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=14cm}
+![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=18cm}
 \vspace{-2em}
 ### `education-num`
-We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
+We used the following command to verify there weren't any
-between the `education` and `education-num` columns:
+discrepencies between the `education` and `education-num` columns:
 ```sh
-$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un
+$ cat adult_data.csv | awk -F',' '{print $5, $4}' | sort -un
 ```
 Since there was a one-to-one mapping, we confirmed this was just a
@ -103,7 +116,9 @@ representation of the `education` attribute. As such, this attribute
 recieves the same classification, which is backed by the equally high
 separation value of 80.96%, so it's classified as a QID.
-![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){width=14cm}
+\vspace{-1em}
 ![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){height=9.5cm}
 ### `marital-status`
@ -111,21 +126,22 @@ separation value of 80.96%, so it's classified as a QID.
 With a relatively high separation value of 66.01%, together with the fact that it could be cross
 referenced with other available datasets, we classify this attribute as a QID.
-![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=14cm}
+![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=10cm}
 ### `occupation`
 With a separation of 90.02%, this attribute is classified as a QID.
-![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=14cm}
+![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=8cm}
 \pagebreak
 ### `relationship`
 Given it's separation value of 73.21%, this attribute is classified as a QID.
-![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=14cm}
+![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=8cm}
 ### `race`
@ -134,7 +150,7 @@ This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has
 that this attribute could be cross referenced with other datases, it is classified as a QID, so
 it may be transformed into more generic values.
-![Hierarchy for attribute `race`](coding-model/hierarchies/race.png)
+![Hierarchy for attribute `race`](coding-model/hierarchies/race.png){width=7cm}
 ### `sex`
@ -165,7 +181,7 @@ Doctorate	 |	   86 |	 327
 Table: Number of records with each `education` for each `sex`
-![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png)
+![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png){width=7cm}
 ### `capital-gain` & `capital-loss`
--- a/report.pdf
+++ b/report.pdf