Data transformation

Prerequisites

Packages:

here
dplyr

Data:

dartpoints.csv

Start:

open your R Project, create a new script file
load library here, install and load library dplyr
import data “dartpoints.csv

Hints:

install.packages(), library(), here(), read.csv()

Course overview

In this course, you will learn how to:

select specific columns or remove them – select()
rename your columns or change their position – rename(), relocate()
create new columns and change the values based on different conditions – mutate(), case_when()
filter specific rows based on various conditions – filter()
order your rows from lowest to highest values (or vice versa) – arrange()
group your data and calculate different summary statistics – summarise(), group_by()
work with different functions more effectively with the pipe operator – |>

At the end, you will be able to create and understand such a crazy code:

df_sample |> 
  group_by(Name) |> 
  summarise(
    mean_length = round(mean(Length),1),
    median_length = round(median(Length),1),
    sd_length = round(sd(Length),1),
    n_name = n()
  ) |> 
  mutate(
  percent_name = round(n_name/sum(n_name)*100, 1)
  ) |> 
  ggplot()+
  aes(x = Name, y = sd_length)+
  geom_col(fill = "steelblue")+
  theme_light()

package `dplyr`

we will work with the package dplyr which uses tidy data logic (one variable = one column, one observation = one row,…)
a big advantage of this package is that it has more intuitive and human readable syntax

let’s say we want to analyse only those dartpoints that have name “Travis”
in R base, the code to select only dartpoints of that type, would look like this:

df_dartpoints[df_dartpoints$Name == "Travis",]

This is a solution with the function filter() from the dplyr package:

filter(df_dartpoints, Name == "Travis")

Similarly, you can also select different columns by their name:

R base solution:

df_dartpoints[, c("Name", "Length", "Width", "Weight")]

and this is the dplyr solution with function select():

select(df_dartpoints, Name, Length, Width, Weight)

pipe operator - `|>`

|> allows you to combine various functions into one code chunks, which makes your code shorter and easier to read
syntax:

dataframe |> function_1(variable, condition) |> function_2(variable, condition)

So you can combine functions select(), filter() and head() :

df_dartpoints |> 
  select(Name, Length, Width, Weight) |> 
  filter(Name == "Travis") |> 
  head(4)

    Name Length Width Weight
1 Travis   56.5  21.1    9.5
2 Travis   54.6  22.4   10.4
3 Travis   46.3  21.3    7.5
4 Travis   57.6  18.9    8.7

notice you don´t need to repeat the name of the dataframe into every function arguments, since you already specified it in the begining of the “pipe”

`select()` - more details

df_dartpoints |> 
  select(Name, Length, Width, Weight, Blade.Sh) |> 
  head(4)

  Name Length Width Weight Blade.Sh
1 Darl   42.8  15.8    3.6        S
2 Darl   40.5  17.4    4.5        S
3 Darl   37.5  16.3    3.6        S
4 Darl   40.3  16.1    4.0        S

df_dartpoints |> 
  select(Name, Length:Weight) |> 
  head(4)

  Name Length Width Thickness B.Width J.Width H.Length Weight
1 Darl   42.8  15.8       5.8    11.3    10.6     11.6    3.6
2 Darl   40.5  17.4       5.8      NA    13.7     12.9    4.5
3 Darl   37.5  16.3       6.1    12.1    11.3      8.2    3.6
4 Darl   40.3  16.1       6.3    13.5    11.7      8.3    4.0

You can exclude specific columns by !

df_dartpoints |> 
  select(!Name) |> 
  head(4)

  Catalog     TARL  Quad Length Width Thickness B.Width J.Width H.Length Weight
1 41-0322 41CV0536 26/59   42.8  15.8       5.8    11.3    10.6     11.6    3.6
2 35-2946 41CV0235 21/63   40.5  17.4       5.8      NA    13.7     12.9    4.5
3 35-2921 41CV0132 20/63   37.5  16.3       6.1    12.1    11.3      8.2    3.6
4 36-3487 41CV0594 10/54   40.3  16.1       6.3    13.5    11.7      8.3    4.0
  Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1        S       I         S         T       S       E
2        S       I         S         T       S       E
3        S       I         S         T       S       E
4        S       I         S         T       S       E

df_dartpoints |> 
  select(!Name:Quad) |> 
  head(4)

  Length Width Thickness B.Width J.Width H.Length Weight Blade.Sh Base.Sh
1   42.8  15.8       5.8    11.3    10.6     11.6    3.6        S       I
2   40.5  17.4       5.8      NA    13.7     12.9    4.5        S       I
3   37.5  16.3       6.1    12.1    11.3      8.2    3.6        S       I
4   40.3  16.1       6.3    13.5    11.7      8.3    4.0        S       I
  Should.Sh Should.Or Haft.Sh Haft.Or
1         S         T       S       E
2         S         T       S       E
3         S         T       S       E
4         S         T       S       E

selecting columns with numeric variables:

df_dartpoints |> 
  select(where(is.numeric)) |> 
  head(4)

  Length Width Thickness B.Width J.Width H.Length Weight
1   42.8  15.8       5.8    11.3    10.6     11.6    3.6
2   40.5  17.4       5.8      NA    13.7     12.9    4.5
3   37.5  16.3       6.1    12.1    11.3      8.2    3.6
4   40.3  16.1       6.3    13.5    11.7      8.3    4.0

`rename()`

you can change the names of your columns by rename() function
syntax:

rename(new_name = old_name)

df_dartpoints |> 
  rename(
    typ_sipky = Name,
    delka_sipky = Length,
    sirka_sipky = Width
  ) |> 
  head(4)

  typ_sipky Catalog     TARL  Quad delka_sipky sirka_sipky Thickness B.Width
1      Darl 41-0322 41CV0536 26/59        42.8        15.8       5.8    11.3
2      Darl 35-2946 41CV0235 21/63        40.5        17.4       5.8      NA
3      Darl 35-2921 41CV0132 20/63        37.5        16.3       6.1    12.1
4      Darl 36-3487 41CV0594 10/54        40.3        16.1       6.3    13.5
  J.Width H.Length Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1    10.6     11.6    3.6        S       I         S         T       S       E
2    13.7     12.9    4.5        S       I         S         T       S       E
3    11.3      8.2    3.6        S       I         S         T       S       E
4    11.7      8.3    4.0        S       I         S         T       S       E

`relocate()`

relocate() will defautly move your chosen column to the front:

df_dartpoints |> 
  rename(
    typ_sipky = Name,
    delka_sipky = Length,
    sirka_sipky = Width
  ) |> 
  relocate(delka_sipky, sirka_sipky) |> 
  head(4)

  delka_sipky sirka_sipky typ_sipky Catalog     TARL  Quad Thickness B.Width
1        42.8        15.8      Darl 41-0322 41CV0536 26/59       5.8    11.3
2        40.5        17.4      Darl 35-2946 41CV0235 21/63       5.8      NA
3        37.5        16.3      Darl 35-2921 41CV0132 20/63       6.1    12.1
4        40.3        16.1      Darl 36-3487 41CV0594 10/54       6.3    13.5
  J.Width H.Length Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1    10.6     11.6    3.6        S       I         S         T       S       E
2    13.7     12.9    4.5        S       I         S         T       S       E
3    11.3      8.2    3.6        S       I         S         T       S       E
4    11.7      8.3    4.0        S       I         S         T       S       E

note that you must already be working with the new variable names.

you can manualy specify the position of relocated columns by aruments .before = or .after =

df_dartpoints |> 
  rename(
    typ_sipky = Name,
    delka_sipky = Length,
    sirka_sipky = Width
  ) |> 
  relocate(
    delka_sipky, sirka_sipky, .after =  typ_sipky
    ) |> 
  head(4)

  typ_sipky delka_sipky sirka_sipky Catalog     TARL  Quad Thickness B.Width
1      Darl        42.8        15.8 41-0322 41CV0536 26/59       5.8    11.3
2      Darl        40.5        17.4 35-2946 41CV0235 21/63       5.8      NA
3      Darl        37.5        16.3 35-2921 41CV0132 20/63       6.1    12.1
4      Darl        40.3        16.1 36-3487 41CV0594 10/54       6.3    13.5
  J.Width H.Length Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1    10.6     11.6    3.6        S       I         S         T       S       E
2    13.7     12.9    4.5        S       I         S         T       S       E
3    11.3      8.2    3.6        S       I         S         T       S       E
4    11.7      8.3    4.0        S       I         S         T       S       E

note the dot before arguments .after and .before - it indicates to R that those are not objects or variables, but arguments of a function

Before we continue:

you can also create a new object from your selection with <-:

df_sample <- df_dartpoints |> 
  select(Name, Length, Width, Weight, Base.Sh)
head(df_sample, 4)

  Name Length Width Weight Base.Sh
1 Darl   42.8  15.8    3.6       I
2 Darl   40.5  17.4    4.5       I
3 Darl   37.5  16.3    3.6       I
4 Darl   40.3  16.1    4.0       I

From now on, we will continue with this dataframe, in order to reduce the number of the columns.

`mutate()`

this function creates new columns that are calculated from existing ones (i.e. the new columns are functions of existing variables).
syntax: dataframe |> mutate(variable_name = function())

here is an example of creating a new column whose values are the multiplication of length by width

df_sample |> 
  mutate(
    square_size = Length*Width
    ) |> 
  head(4)

  Name Length Width Weight Base.Sh square_size
1 Darl   42.8  15.8    3.6       I      676.24
2 Darl   40.5  17.4    4.5       I      704.70
3 Darl   37.5  16.3    3.6       I      611.25
4 Darl   40.3  16.1    4.0       I      648.83

note what argument .keep = "used" does:

df_sample |> 
  mutate(
    square_size = Length*Width,
    .keep = "used"
    ) |> 
  head(4)

  Length Width square_size
1   42.8  15.8      676.24
2   40.5  17.4      704.70
3   37.5  16.3      611.25
4   40.3  16.1      648.83

we will explore the function mutate() in more detail later

`filter()`

for specifying conditions for filtering, dplyr uses the following logical and mathematical operators: ==, !=, <, >, >=, <=, &, |, %in%, etc (use ?dplyr::filter for more details)

here, we will use the symbol > to filter out dart points with a length greater than 80 mm

df_sample |> 
  filter(Length > 80)

        Name Length Width Weight Base.Sh
1 Pedernales  109.5  49.3   28.8       I
2 Pedernales   84.0  21.2    9.3       I

and here we use == to choose only those dartpoints which are of type “Travis”

df_sample |> 
  filter(Name == "Travis")

     Name Length Width Weight Base.Sh
1  Travis   56.5  21.1    9.5       S
2  Travis   54.6  22.4   10.4       S
3  Travis   46.3  21.3    7.5       S
4  Travis   57.6  18.9    8.7       S
5  Travis   49.1  21.4    6.9       S
6  Travis   64.6  21.5   15.0       I
7  Travis   69.0  20.9   11.4       E
8  Travis   40.1  18.4    6.3       S
9  Travis   41.5  19.2    7.5       S
10 Travis   46.3  17.9    5.9       S
11 Travis   39.6  21.5    5.4       S

alternatively, you can exclude all points of a type “Travis” by negation !=

df_sample |> 
  filter(Name != "Travis") |> 
  head(4)

  Name Length Width Weight Base.Sh
1 Darl   42.8  15.8    3.6       I
2 Darl   40.5  17.4    4.5       I
3 Darl   37.5  16.3    3.6       I
4 Darl   40.3  16.1    4.0       I

add & if you want to filter with more than one condition, for example here we will filter all points which are type “Wells” AND are heavier than 10 grams

df_sample |> 
  filter(Name == "Wells" & Weight > 10)

   Name Length Width Weight Base.Sh
1 Wells   65.4  25.1   12.6       S
2 Wells   58.9  24.4   10.5       I
3 Wells   63.1  24.7   16.3       E

Task: instead of & try the operator | (OR) and see how the result differs

Filtering based on a vector

you can make your code less complicated when you create vector from desired values and then filter all observations which fall into that vector by using operator %in%

two_types <- c("Pedernales", "Ensor")

df_sample |> 
  filter(Name %in% two_types) |> 
  head(4)

   Name Length Width Weight Base.Sh
1 Ensor   43.5  20.1    4.6       S
2 Ensor   42.1  20.8    5.4       S
3 Ensor   42.1  25.1    5.9       S
4 Ensor   43.1  20.0    5.1       S

`arrange()`

this function will order your observations from the lowest to highest (or vice versa)

df_sample |> 
  arrange(Length) |> 
  head(4)

  Name Length Width Weight Base.Sh
1 Darl   30.6  17.1    2.3       I
2 Darl   31.2  15.6    2.5       I
3 Darl   32.0  16.0    3.3       E
4 Darl   32.4  14.5    2.5       I

df_sample |> 
  arrange(desc(Length)) |> 
  head(4)

        Name Length Width Weight Base.Sh
1 Pedernales  109.5  49.3   28.8       I
2 Pedernales   84.0  21.2    9.3       I
3 Pedernales   78.3  28.1   14.8       I
4 Pedernales   70.4  30.4   13.1       I

`distinct()` and `count()`

try to find out what these functions are doing:

df_sample |> 
  distinct(Name, Base.Sh) |> 
  head(10)

         Name Base.Sh
1        Darl       I
2        Darl       E
3        Darl       S
4        Darl       R
5        Darl    <NA>
6       Ensor       S
7       Ensor       R
8       Ensor       I
9       Ensor    <NA>
10 Pedernales       I

df_sample |> 
  count(Name, Base.Sh, sort = TRUE) |> 
  head(4)

        Name Base.Sh  n
1 Pedernales       I 32
2       Darl       I 18
3     Travis       S  9
4       Darl       S  6

`summarise()`

this function is used for calculating basic summary stastics from the existing variables
syntax: summarise(new_variable = function(variable))
in this case, the mean and sd values are calculated for whole dataframe:

df_sample |> 
  summarise(
    mean_lenght = mean(Length),
    sd_length = sd(Length)
  )

  mean_lenght sd_length
1    49.33077  12.73619

`group_by()`

if you want to calculate summary statistic for each dartpoint type, you have to use function group_by which will create groups of rows based on chosen variable, in this case “Name”

df_sample |> 
  group_by(Name) |> 
  summarise(
    mean_lenght = mean(Length),
    sd_length = sd(Length),
    n = n()
  ) |> 
  arrange(desc(n))

# A tibble: 5 × 4
  Name       mean_lenght sd_length     n
  <chr>            <dbl>     <dbl> <int>
1 Pedernales        57.9     14.1     32
2 Darl              39.8      6.18    28
3 Travis            51.4      9.90    11
4 Ensor             42.7      5.79    10
5 Wells             53.1      7.94    10

df_sample |> 
  group_by(Name, Base.Sh) |> 
  summarise(
    mean_lenght = round(mean(Length),2),
    sd_length = round(sd(Length),2),
    n = n()
  ) |> 
  arrange(Name) |> 
  head(7)

# A tibble: 7 × 5
# Groups:   Name [2]
  Name  Base.Sh mean_lenght sd_length     n
  <chr> <chr>         <dbl>     <dbl> <int>
1 Darl  E              36.9      6.93     2
2 Darl  I              39.5      6.59    18
3 Darl  R              44.2     NA        1
4 Darl  S              40.6      6.36     6
5 Darl  <NA>           40       NA        1
6 Ensor I              34.9     NA        1
7 Ensor R              39.3      2.75     3

back to `mutate()`

creating new values based on different conditions

df_sample |> 
  mutate(
    size_cathegory = case_when(
      Length > 70 ~ "big_stone",
      Length < 30 ~ "small_stone",
      .default = "ok_stone"
    )
  ) |> 
  arrange(desc(Length)) |> 
  head(10)

         Name Length Width Weight Base.Sh size_cathegory
1  Pedernales  109.5  49.3   28.8       I      big_stone
2  Pedernales   84.0  21.2    9.3       I      big_stone
3  Pedernales   78.3  28.1   14.8       I      big_stone
4  Pedernales   70.4  30.4   13.1       I      big_stone
5      Travis   69.0  20.9   11.4       E       ok_stone
6  Pedernales   67.2  27.1   15.3       I       ok_stone
7  Pedernales   66.0  27.2   12.5       I       ok_stone
8       Wells   65.4  25.1   12.6       S       ok_stone
9  Pedernales   65.0  31.6    4.6       I       ok_stone
10     Travis   64.6  21.5   15.0       I       ok_stone

back to `mutate()`

changing existing values based on combination of two conditions

imagine we want to create a new type of dartpoints from the Pedernales that are longer than 80 mm
we don’t need to create a new column, but only change the values in the existing one:

df_sample |> 
  mutate(
    Name = case_when(
      Name == "Pedernales" & Length > 80 ~ "Big_Pedernales",
      .default = Name
    )
  ) |> 
  arrange(desc(Length)) |> 
  head(4)

            Name Length Width Weight Base.Sh
1 Big_Pedernales  109.5  49.3   28.8       I
2 Big_Pedernales   84.0  21.2    9.3       I
3     Pedernales   78.3  28.1   14.8       I
4     Pedernales   70.4  30.4   13.1       I

note that if we want to leave the other values of the variable “Name” unchanged, we do so by specifying .default = Name

Combining plenty of functions into one huge code chunk:

df_sample |> 
  group_by(Name) |> 
  summarise(
    mean_length = round(mean(Length),1),
    median_length = round(median(Length),1),
    sd_length = round(sd(Length),1),
    n_name = n()
  ) |> 
  mutate(
  percent_name = round(n_name/sum(n_name)*100, 1)
  ) |> 
  arrange(desc(n_name))

# A tibble: 5 × 6
  Name       mean_length median_length sd_length n_name percent_name
  <chr>            <dbl>         <dbl>     <dbl>  <int>        <dbl>
1 Pedernales        57.9          55.8      14.1     32         35.2
2 Darl              39.8          40.1       6.2     28         30.8
3 Travis            51.4          49.1       9.9     11         12.1
4 Ensor             42.7          42.3       5.8     10         11  
5 Wells             53.1          53.7       7.9     10         11

::: fragment - we can even include ggplot() into the pipeline:

df_sample |> 
  group_by(Name) |> 
  summarise(
    mean_length = round(mean(Length),1),
    median_length = round(median(Length),1),
    sd_length = round(sd(Length),1),
    n_name = n()
  ) |> 
  mutate(
  percent_name = round(n_name/sum(n_name)*100, 1)
  ) |> 
  ggplot()+
  aes(x = Name, y = sd_length)+
  geom_col(fill = "steelblue")+
  theme_light()

Saving your result

df_result <- df_sample |> 
  group_by(Name) |> 
  summarise(
    mean_length = round(mean(Length),1),
    median_length = round(median(Length),1),
    sd_length = round(sd(Length),1),
    n_name = n()
  ) |> 
  mutate(
  percent_name = round(n_name/sum(n_name)*100, 1)
  ) |> 
  arrange(desc(n_name))

write.csv(df_result, here("results/dartpoints_result.csv"), row.names = F)

Data transformation

Prerequisites

Packages:

Data:

Start:

Hints:

Course overview

package dplyr

pipe operator - |>

select() - more details

rename()

relocate()

Before we continue:

mutate()

filter()

Filtering based on a vector

arrange()

distinct() and count()

summarise()

group_by()

back to mutate()

creating new values based on different conditions

back to mutate()

changing existing values based on combination of two conditions

Combining plenty of functions into one huge code chunk:

Saving your result

package `dplyr`

pipe operator - `|>`

`select()` - more details

`rename()`

`relocate()`

`mutate()`

`filter()`

`arrange()`

`distinct()` and `count()`

`summarise()`

`group_by()`

back to `mutate()`

back to `mutate()`