diff --git a/README.md b/README.md index 3967b50..de821fc 100644 --- a/README.md +++ b/README.md @@ -7,12 +7,21 @@ author: date: 2022/06/02 geometry: margin=2cm output: pdf_document +header-includes: | + \usepackage{float} + \let\origfigure\figure + \let\endorigfigure\endfigure + \renewenvironment{figure}[1][2]{\expandafter\origfigure\expandafter[H]}{\endorigfigure} --- +\vspace{3em} + # Attribute classification We classified the attributes as follows: +\vspace{3em} + Attribute | Classification -----------------+--------------- `age` | QID @@ -33,6 +42,7 @@ Attribute | Classification Table: Attribute classifications +\pagebreak ## Justifications @@ -46,7 +56,7 @@ set of attributes. According to HIPPA recommendations, and together with it's very high separation value (99.87%), we classify this attribute as a QID. -![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=14cm} +![Hierarchy for attribute `age`](coding-model/hierarchies/age/age.png){width=10cm} ### `workclass` @@ -60,7 +70,7 @@ deemed Insensitive. Despite high values of distinction (66.48%) and separation (99.99%) the `fnlwgt` column is not a QID because it represents a weight, not a count of individuals in the same equivalence class in the original dataset. This can be seen with the results below. Additionally, it's not easily connected -to another auxiliary info dataset. +to other auxiliary datasets. ```sh $ tail -n '+2' adult_data.csv | awk -F',' '{count[$10] += $3;} \ @@ -79,23 +89,26 @@ Table: Sum of `fnlwgt` for each `sex` The sum of these values is 6,179,373,392. This value is much larger than the population of the U.S.A., the origin of the dataset, which implies this attribute is not a count, as stated. -We also note there are substantially more Male than Female records (more than double the `fnlwgt`). - +We also note there are substantially more Male than Female records, +being that the sum of `fnlwgt` for Male is more than double that of +Female, as well as that the number of rows with Female is 10771 and +for Male is 21790. ### `education` This attribute presents a separation of 80.96%, which is quite high, thus we classified it as a QID. -![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=14cm} +![Hierarchy for attribute `education`](coding-model/hierarchies/education/education.png){width=18cm} +\vspace{-2em} ### `education-num` -We exported the anonymized dataset and used the following command to verify there weren't any discrepencies -between the `education` and `education-num` columns: +We used the following command to verify there weren't any +discrepencies between the `education` and `education-num` columns: ```sh -$ cat anonymized.csv | sed -r 's/,([^ ])/\t\1/g' | awk -F';' '{print $5, $4}' | sort -un +$ cat adult_data.csv | awk -F',' '{print $5, $4}' | sort -un ``` Since there was a one-to-one mapping, we confirmed this was just a @@ -103,7 +116,9 @@ representation of the `education` attribute. As such, this attribute recieves the same classification, which is backed by the equally high separation value of 80.96%, so it's classified as a QID. -![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){width=14cm} +\vspace{-1em} + +![Hierarchy for attribute `education-num`](coding-model/hierarchies/education/education-num.png){height=9.5cm} ### `marital-status` @@ -111,21 +126,22 @@ separation value of 80.96%, so it's classified as a QID. With a relatively high separation value of 66.01%, together with the fact that it could be cross referenced with other available datasets, we classify this attribute as a QID. -![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=14cm} +![Hierarchy for attribute `marital-status`](coding-model/hierarchies/marital-status/marital-status.png){width=10cm} ### `occupation` With a separation of 90.02%, this attribute is classified as a QID. -![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=14cm} +![Hierarchy for attribute `occupation`](coding-model/hierarchies/occupation.png){width=8cm} +\pagebreak ### `relationship` Given it's separation value of 73.21%, this attribute is classified as a QID. -![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=14cm} +![Hierarchy for attribute `relationship`](coding-model/hierarchies/relationship/relationship.png){width=8cm} ### `race` @@ -134,7 +150,7 @@ This collumn presents some weirdly specific values (Amer-Indian-Eskimo), but has that this attribute could be cross referenced with other datases, it is classified as a QID, so it may be transformed into more generic values. -![Hierarchy for attribute `race`](coding-model/hierarchies/race.png) +![Hierarchy for attribute `race`](coding-model/hierarchies/race.png){width=7cm} ### `sex` @@ -165,7 +181,7 @@ Doctorate | 86 | 327 Table: Number of records with each `education` for each `sex` -![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png) +![Hierarchy for attribute `sex`](coding-model/hierarchies/sex.png){width=7cm} ### `capital-gain` & `capital-loss` diff --git a/report.pdf b/report.pdf index 19c823a..b49357d 100644 Binary files a/report.pdf and b/report.pdf differ