Minor layout changes and text revision

This commit is contained in:
Hugo Sales 2022-06-05 22:15:04 +01:00
parent 7b7fc0e617
commit 380400f343
Signed by untrusted user who does not match committer: someonewithpc
GPG Key ID: 7D0C7EAFC9D835A0
2 changed files with 30 additions and 14 deletions

View File

@ -7,12 +7,21 @@ author:
date: 2022/06/02
geometry: margin=2cm
output: pdf_document
header-includes: |
\usepackage{float}
\let\origfigure\figure
\let\endorigfigure\endfigure
\renewenvironment{figure}[1][2]{\expandafter\origfigure\expandafter[H]}{\endorigfigure}
---
\vspace{3em}
# Attribute classification
We classified the attributes as follows:
\vspace{3em}
Attribute | Classification
-----------------+---------------
`age` | QID
@ -33,6 +42,7 @@ Attribute | Classification
Table: Attribute classifications
\pagebreak
## Justifications
@ -46,7 +56,7 @@ set of attributes.
According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
this attribute as a QID.
![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=14cm}
![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=10cm}
### `workclass`
@ -60,7 +70,7 @@ deemed Insensitive.
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
because it represents a weight, not a count of individuals in the same equivalence class in the
original dataset. This can be seen with the results below. Additionally, it's not easily connected
to another auxiliary info dataset.
to other auxiliary datasets.
```sh
$ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
@ -79,23 +89,26 @@ Table: Sum of `fnlwgt` for each `sex`
The sum of these values is 6,179,373,392. This value is much larger than the population of the
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
We also note there are substantially more Male than Female records (more than double the `fnlwgt`).
We also note there are substantially more Male than Female records,
being that the sum of `fnlwgt` for Male is more than double that of
Female, as well as that the number of rows with Female is 10771 and
for Male is 21790.
### `education`
This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=14cm}
![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=18cm}
\vspace{-2em}
### `education-num`
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
between the `education` and `education-num` columns:
We used the following command to verify there weren't any
discrepencies between the `education` and `education-num` columns:
```sh
$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un
$ cat adult_data.csv | awk -F',' '{print $5, $4}' | sort -un
```
Since there was a one-to-one mapping, we confirmed this was just a
@ -103,7 +116,9 @@ representation of the `education` attribute. As such, this attribute
recieves the same classification, which is backed by the equally high
separation value of 80.96%, so it's classified as a QID.
![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){width=14cm}
\vspace{-1em}
![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){height=9.5cm}
### `marital-status`
@ -111,21 +126,22 @@ separation value of 80.96%, so it's classified as a QID.
With a relatively high separation value of 66.01%, together with the fact that it could be cross
referenced with other available datasets, we classify this attribute as a QID.
![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=14cm}
![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=10cm}
### `occupation`
With a separation of 90.02%, this attribute is classified as a QID.
![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=14cm}
![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=8cm}
\pagebreak
### `relationship`
Given it's separation value of 73.21%, this attribute is classified as a QID.
![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=14cm}
![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=8cm}
### `race`
@ -134,7 +150,7 @@ This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has
that this attribute could be cross referenced with other datases, it is classified as a QID, so
it may be transformed into more generic values.
![Hierarchy for attribute `race`](coding-model/hierarchies/race.png)
![Hierarchy for attribute `race`](coding-model/hierarchies/race.png){width=7cm}
### `sex`
@ -165,7 +181,7 @@ Doctorate | 86 | 327
Table: Number of records with each `education` for each `sex`
![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png)
![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png){width=7cm}
### `capital-gain` & `capital-loss`

Binary file not shown.