Minor layout changes and text revision
This commit is contained in:
parent
7b7fc0e617
commit
380400f343
44
README.md
44
README.md
@ -7,12 +7,21 @@ author:
|
||||
date: 2022/06/02
|
||||
geometry: margin=2cm
|
||||
output: pdf_document
|
||||
header-includes: |
|
||||
\usepackage{float}
|
||||
\let\origfigure\figure
|
||||
\let\endorigfigure\endfigure
|
||||
\renewenvironment{figure}[1][2]{\expandafter\origfigure\expandafter[H]}{\endorigfigure}
|
||||
---
|
||||
|
||||
\vspace{3em}
|
||||
|
||||
# Attribute classification
|
||||
|
||||
We classified the attributes as follows:
|
||||
|
||||
\vspace{3em}
|
||||
|
||||
Attribute | Classification
|
||||
-----------------+---------------
|
||||
`age` | QID
|
||||
@ -33,6 +42,7 @@ Attribute | Classification
|
||||
|
||||
Table: Attribute classifications
|
||||
|
||||
\pagebreak
|
||||
|
||||
## Justifications
|
||||
|
||||
@ -46,7 +56,7 @@ set of attributes.
|
||||
According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify
|
||||
this attribute as a QID.
|
||||
|
||||
![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=14cm}
|
||||
![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=10cm}
|
||||
|
||||
|
||||
### `workclass`
|
||||
@ -60,7 +70,7 @@ deemed Insensitive.
|
||||
Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID
|
||||
because it represents a weight, not a count of individuals in the same equivalence class in the
|
||||
original dataset. This can be seen with the results below. Additionally, it's not easily connected
|
||||
to another auxiliary info dataset.
|
||||
to other auxiliary datasets.
|
||||
|
||||
```sh
|
||||
$ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \
|
||||
@ -79,23 +89,26 @@ Table: Sum of `fnlwgt` for each `sex`
|
||||
The sum of these values is 6,179,373,392. This value is much larger than the population of the
|
||||
U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated.
|
||||
|
||||
We also note there are substantially more Male than Female records (more than double the `fnlwgt`).
|
||||
|
||||
We also note there are substantially more Male than Female records,
|
||||
being that the sum of `fnlwgt` for Male is more than double that of
|
||||
Female, as well as that the number of rows with Female is 10771 and
|
||||
for Male is 21790.
|
||||
|
||||
### `education`
|
||||
|
||||
This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID.
|
||||
|
||||
![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=14cm}
|
||||
![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=18cm}
|
||||
|
||||
\vspace{-2em}
|
||||
|
||||
### `education-num`
|
||||
|
||||
We exported the anonymized dataset and used the following command to verify there weren't any discrepencies
|
||||
between the `education` and `education-num` columns:
|
||||
We used the following command to verify there weren't any
|
||||
discrepencies between the `education` and `education-num` columns:
|
||||
|
||||
```sh
|
||||
$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un
|
||||
$ cat adult_data.csv | awk -F',' '{print $5, $4}' | sort -un
|
||||
```
|
||||
|
||||
Since there was a one-to-one mapping, we confirmed this was just a
|
||||
@ -103,7 +116,9 @@ representation of the `education` attribute. As such, this attribute
|
||||
recieves the same classification, which is backed by the equally high
|
||||
separation value of 80.96%, so it's classified as a QID.
|
||||
|
||||
![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){width=14cm}
|
||||
\vspace{-1em}
|
||||
|
||||
![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){height=9.5cm}
|
||||
|
||||
|
||||
### `marital-status`
|
||||
@ -111,21 +126,22 @@ separation value of 80.96%, so it's classified as a QID.
|
||||
With a relatively high separation value of 66.01%, together with the fact that it could be cross
|
||||
referenced with other available datasets, we classify this attribute as a QID.
|
||||
|
||||
![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=14cm}
|
||||
![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=10cm}
|
||||
|
||||
|
||||
### `occupation`
|
||||
|
||||
With a separation of 90.02%, this attribute is classified as a QID.
|
||||
|
||||
![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=14cm}
|
||||
![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=8cm}
|
||||
|
||||
\pagebreak
|
||||
|
||||
### `relationship`
|
||||
|
||||
Given it's separation value of 73.21%, this attribute is classified as a QID.
|
||||
|
||||
![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=14cm}
|
||||
![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=8cm}
|
||||
|
||||
|
||||
### `race`
|
||||
@ -134,7 +150,7 @@ This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has
|
||||
that this attribute could be cross referenced with other datases, it is classified as a QID, so
|
||||
it may be transformed into more generic values.
|
||||
|
||||
![Hierarchy for attribute `race`](coding-model/hierarchies/race.png)
|
||||
![Hierarchy for attribute `race`](coding-model/hierarchies/race.png){width=7cm}
|
||||
|
||||
|
||||
### `sex`
|
||||
@ -165,7 +181,7 @@ Doctorate | 86 | 327
|
||||
|
||||
Table: Number of records with each `education` for each `sex`
|
||||
|
||||
![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png)
|
||||
![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png){width=7cm}
|
||||
|
||||
|
||||
### `capital-gain` & `capital-loss`
|
||||
|
BIN
report.pdf
BIN
report.pdf
Binary file not shown.
Reference in New Issue
Block a user