Tidy data

Prerequisites

Packages:

  • here
  • dplyr
  • tidyr
  • ggplot2

Data:

  • fake_graves.csv - download here

Start:

  • open your R Project, create a new script file
  • load libraries here,dplyr and ggplot2, install and load library dplyr
  • import data “fake_graves.csv”

Hints:

  • install.packages(), library(), here(), read.csv()

The problem

Have a look at df_graves:

head(df_graves, 2)
  grave_number dating  sex   age pottery bronze stone_chipped stone_polished
1          800 ne.lin male 31-40       3      0             1              2
2          801 ne.lin male 21-30       4      0             1              1
  grave_length grave_depth
1          210          50
2          160          40

Questions:

  • what is one observation? What are the variables?

Imagine we want to visualise the relationship between types of artefact, sex, and dating of graves. This can only be achieved by visualising each artefact separately, since they are separated into their own columns, and the argument aes(x = , y = ) of the ggplot() only allows one variable to be visualised in either x or y.

Here is an example with the artefact type “pottery”

ggplot(df_graves, aes(x = dating, y = pottery, fill = sex))+
  geom_bar(stat = "identity")

Note that we had to add stat = identity argument to geom_bar(). We will explain this later.

Solution

Ideally, we would like all artefacts to be in one variable/column, with their count in another. To achieve this, we will create a “long” table using the tidyr::pivot_longer() function.

  • cols = - defines which columns will be transformed into one
  • names_to - sets the name for the column in which other columns will be transformed (their names will be transformed to values)
  • values_to - sets the name for the column in which values of the transformed columns will be stored
df_long <- df_graves |> 
  pivot_longer(
    cols = c(pottery, bronze, stone_chipped, stone_polished),
    names_to = "artefact",
    values_to = "n_artefact"
  )

head(df_long, 6)
# A tibble: 6 × 8
  grave_number dating sex   age   grave_length grave_depth artefact   n_artefact
         <int> <chr>  <chr> <chr>        <int>       <int> <chr>           <int>
1          800 ne.lin male  31-40          210          50 pottery             3
2          800 ne.lin male  31-40          210          50 bronze              0
3          800 ne.lin male  31-40          210          50 stone_chi…          1
4          800 ne.lin male  31-40          210          50 stone_pol…          2
5          801 ne.lin male  21-30          160          40 pottery             4
6          801 ne.lin male  21-30          160          40 bronze              0

Questions:

  • what is one now the observation?
  • could you describe the variable “artefact”?
  • did all variables change?

The logic of long table 1

  • original observations are being distributed into more rows

The logic of long table 2

  • various variables are being moved into one column

The logic of long table 3

  • values from different rows and columns are being moved into just one column

Plots with a longtable

Now we can play with ggplot():

ggplot(df_long, aes(x=dating, y = n_artefact, fill = artefact))+
  geom_bar(stat = "identity")+
  theme_light()

Plots with a longtable

ggplot(df_long, aes(x=dating, y = n_artefact, fill = artefact))+
  geom_bar(stat = "identity")+
  facet_wrap(~sex)+
  theme_light()

Mini task!

Note that there are rows with value = 0 and also columns we are not using.

head(df_long, 2)
# A tibble: 2 × 8
  grave_number dating sex   age   grave_length grave_depth artefact n_artefact
         <int> <chr>  <chr> <chr>        <int>       <int> <chr>         <int>
1          800 ne.lin male  31-40          210          50 pottery           3
2          800 ne.lin male  31-40          210          50 bronze            0

Modify this code so it will exclude rows with 0 at “n_artefact” and remove columns with variables “grave_length” and “grave_depth”

df_long <- df_graves |> 
  pivot_longer(
    cols = c(pottery, bronze, stone_chipped, stone_polished),
    names_to = "artefact",
    values_to = "n_artefact"
  )

Solution:

df_long <- df_graves |> 
  pivot_longer(
    cols = c(pottery, bronze, stone_chipped, stone_polished),
    names_to = "artefact",
    values_to = "n_artefact"
  ) |> 
  select(!c(grave_length, grave_depth)) |> 
  filter(n_artefact != 0)

head(df_long, 4)
# A tibble: 4 × 6
  grave_number dating sex   age   artefact       n_artefact
         <int> <chr>  <chr> <chr> <chr>               <int>
1          800 ne.lin male  31-40 pottery                 3
2          800 ne.lin male  31-40 stone_chipped           1
3          800 ne.lin male  31-40 stone_polished          2
4          801 ne.lin male  21-30 pottery                 4

Be carefull!

  • bear in mind that the observations (i.e. the contents of each row) in the original wide table are different from the observations in the new long table.
  • that is why, if we want to calculate the number of sexes in the long table, the function table() will return incorrect numbers because its calculation is based on rows.
table(df_long$sex)

female   male 
    20     20 

You have to either use the original table:

table(df_graves$sex)

female   male 
    10     10 

…or do some extra dplyr coding with the long table

df_long |> 
  distinct(grave_number, sex) |> 
  count(sex, name = "n_graves")
# A tibble: 2 × 2
  sex    n_graves
  <chr>     <int>
1 female       10
2 male         10