|>
df_sample group_by(Name) |>
summarise(
mean_length = round(mean(Length),1),
median_length = round(median(Length),1),
sd_length = round(sd(Length),1),
n_name = n()
|>
) mutate(
percent_name = round(n_name/sum(n_name)*100, 1)
|>
) ggplot()+
aes(x = Name, y = sd_length)+
geom_col(fill = "steelblue")+
theme_light()
Data transformation
Prerequisites
Packages:
here
dplyr
Data:
- dartpoints.csv
Start:
- open your R Project, create a new script file
- load library
here
, install and load librarydplyr
- import data “dartpoints.csv
Hints:
install.packages()
,library()
,here()
,read.csv()
Course overview
In this course, you will learn how to:
- select specific columns or remove them –
select()
- rename your columns or change their position –
rename()
,relocate()
- create new columns and change the values based on different conditions –
mutate()
,case_when()
- filter specific rows based on various conditions –
filter()
- order your rows from lowest to highest values (or vice versa) –
arrange()
- group your data and calculate different summary statistics –
summarise()
,group_by()
- work with different functions more effectively with the pipe operator –
|>
At the end, you will be able to create and understand such a crazy code:
package dplyr
- we will work with the package
dplyr
which uses tidy data logic (one variable = one column, one observation = one row,…) - a big advantage of this package is that it has more intuitive and human readable syntax
- let’s say we want to analyse only those dartpoints that have name “Travis”
- in R base, the code to select only dartpoints of that type, would look like this:
$Name == "Travis",] df_dartpoints[df_dartpoints
This is a solution with the function filter()
from the dplyr
package:
filter(df_dartpoints, Name == "Travis")
Similarly, you can also select different columns by their name:
R base solution:
c("Name", "Length", "Width", "Weight")] df_dartpoints[,
- and this is the
dplyr
solution with functionselect()
:
select(df_dartpoints, Name, Length, Width, Weight)
pipe operator - |>
|>
allows you to combine various functions into one code chunks, which makes your code shorter and easier to readsyntax:
dataframe |> function_1(variable, condition) |> function_2(variable, condition)
So you can combine functions select()
, filter()
and head()
:
|>
df_dartpoints select(Name, Length, Width, Weight) |>
filter(Name == "Travis") |>
head(4)
Name Length Width Weight
1 Travis 56.5 21.1 9.5
2 Travis 54.6 22.4 10.4
3 Travis 46.3 21.3 7.5
4 Travis 57.6 18.9 8.7
- notice you don´t need to repeat the name of the dataframe into every function arguments, since you already specified it in the begining of the “pipe”
select()
- more details
|>
df_dartpoints select(Name, Length, Width, Weight, Blade.Sh) |>
head(4)
Name Length Width Weight Blade.Sh
1 Darl 42.8 15.8 3.6 S
2 Darl 40.5 17.4 4.5 S
3 Darl 37.5 16.3 3.6 S
4 Darl 40.3 16.1 4.0 S
|>
df_dartpoints select(Name, Length:Weight) |>
head(4)
Name Length Width Thickness B.Width J.Width H.Length Weight
1 Darl 42.8 15.8 5.8 11.3 10.6 11.6 3.6
2 Darl 40.5 17.4 5.8 NA 13.7 12.9 4.5
3 Darl 37.5 16.3 6.1 12.1 11.3 8.2 3.6
4 Darl 40.3 16.1 6.3 13.5 11.7 8.3 4.0
You can exclude specific columns by !
|>
df_dartpoints select(!Name) |>
head(4)
Catalog TARL Quad Length Width Thickness B.Width J.Width H.Length Weight
1 41-0322 41CV0536 26/59 42.8 15.8 5.8 11.3 10.6 11.6 3.6
2 35-2946 41CV0235 21/63 40.5 17.4 5.8 NA 13.7 12.9 4.5
3 35-2921 41CV0132 20/63 37.5 16.3 6.1 12.1 11.3 8.2 3.6
4 36-3487 41CV0594 10/54 40.3 16.1 6.3 13.5 11.7 8.3 4.0
Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1 S I S T S E
2 S I S T S E
3 S I S T S E
4 S I S T S E
|>
df_dartpoints select(!Name:Quad) |>
head(4)
Length Width Thickness B.Width J.Width H.Length Weight Blade.Sh Base.Sh
1 42.8 15.8 5.8 11.3 10.6 11.6 3.6 S I
2 40.5 17.4 5.8 NA 13.7 12.9 4.5 S I
3 37.5 16.3 6.1 12.1 11.3 8.2 3.6 S I
4 40.3 16.1 6.3 13.5 11.7 8.3 4.0 S I
Should.Sh Should.Or Haft.Sh Haft.Or
1 S T S E
2 S T S E
3 S T S E
4 S T S E
- selecting columns with numeric variables:
|>
df_dartpoints select(where(is.numeric)) |>
head(4)
Length Width Thickness B.Width J.Width H.Length Weight
1 42.8 15.8 5.8 11.3 10.6 11.6 3.6
2 40.5 17.4 5.8 NA 13.7 12.9 4.5
3 37.5 16.3 6.1 12.1 11.3 8.2 3.6
4 40.3 16.1 6.3 13.5 11.7 8.3 4.0
rename()
you can change the names of your columns by
rename()
functionsyntax:
rename(new_name = old_name)
|>
df_dartpoints rename(
typ_sipky = Name,
delka_sipky = Length,
sirka_sipky = Width
|>
) head(4)
typ_sipky Catalog TARL Quad delka_sipky sirka_sipky Thickness B.Width
1 Darl 41-0322 41CV0536 26/59 42.8 15.8 5.8 11.3
2 Darl 35-2946 41CV0235 21/63 40.5 17.4 5.8 NA
3 Darl 35-2921 41CV0132 20/63 37.5 16.3 6.1 12.1
4 Darl 36-3487 41CV0594 10/54 40.3 16.1 6.3 13.5
J.Width H.Length Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1 10.6 11.6 3.6 S I S T S E
2 13.7 12.9 4.5 S I S T S E
3 11.3 8.2 3.6 S I S T S E
4 11.7 8.3 4.0 S I S T S E
relocate()
relocate()
will defautly move your chosen column to the front:
|>
df_dartpoints rename(
typ_sipky = Name,
delka_sipky = Length,
sirka_sipky = Width
|>
) relocate(delka_sipky, sirka_sipky) |>
head(4)
delka_sipky sirka_sipky typ_sipky Catalog TARL Quad Thickness B.Width
1 42.8 15.8 Darl 41-0322 41CV0536 26/59 5.8 11.3
2 40.5 17.4 Darl 35-2946 41CV0235 21/63 5.8 NA
3 37.5 16.3 Darl 35-2921 41CV0132 20/63 6.1 12.1
4 40.3 16.1 Darl 36-3487 41CV0594 10/54 6.3 13.5
J.Width H.Length Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1 10.6 11.6 3.6 S I S T S E
2 13.7 12.9 4.5 S I S T S E
3 11.3 8.2 3.6 S I S T S E
4 11.7 8.3 4.0 S I S T S E
- note that you must already be working with the new variable names.
- you can manualy specify the position of relocated columns by aruments
.before =
or.after =
|>
df_dartpoints rename(
typ_sipky = Name,
delka_sipky = Length,
sirka_sipky = Width
|>
) relocate(
.after = typ_sipky
delka_sipky, sirka_sipky, |>
) head(4)
typ_sipky delka_sipky sirka_sipky Catalog TARL Quad Thickness B.Width
1 Darl 42.8 15.8 41-0322 41CV0536 26/59 5.8 11.3
2 Darl 40.5 17.4 35-2946 41CV0235 21/63 5.8 NA
3 Darl 37.5 16.3 35-2921 41CV0132 20/63 6.1 12.1
4 Darl 40.3 16.1 36-3487 41CV0594 10/54 6.3 13.5
J.Width H.Length Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1 10.6 11.6 3.6 S I S T S E
2 13.7 12.9 4.5 S I S T S E
3 11.3 8.2 3.6 S I S T S E
4 11.7 8.3 4.0 S I S T S E
- note the dot before arguments
.after
and.before
- it indicates to R that those are not objects or variables, but arguments of a function
Before we continue:
- you can also create a new object from your selection with
<-
:
<- df_dartpoints |>
df_sample select(Name, Length, Width, Weight, Base.Sh)
head(df_sample, 4)
Name Length Width Weight Base.Sh
1 Darl 42.8 15.8 3.6 I
2 Darl 40.5 17.4 4.5 I
3 Darl 37.5 16.3 3.6 I
4 Darl 40.3 16.1 4.0 I
From now on, we will continue with this dataframe, in order to reduce the number of the columns.
mutate()
this function creates new columns that are calculated from existing ones (i.e. the new columns are functions of existing variables).
syntax:
dataframe |> mutate(variable_name = function())
- here is an example of creating a new column whose values are the multiplication of length by width
|>
df_sample mutate(
square_size = Length*Width
|>
) head(4)
Name Length Width Weight Base.Sh square_size
1 Darl 42.8 15.8 3.6 I 676.24
2 Darl 40.5 17.4 4.5 I 704.70
3 Darl 37.5 16.3 3.6 I 611.25
4 Darl 40.3 16.1 4.0 I 648.83
- note what argument
.keep = "used"
does:
|>
df_sample mutate(
square_size = Length*Width,
.keep = "used"
|>
) head(4)
Length Width square_size
1 42.8 15.8 676.24
2 40.5 17.4 704.70
3 37.5 16.3 611.25
4 40.3 16.1 648.83
- we will explore the function
mutate()
in more detail later
filter()
- for specifying conditions for filtering,
dplyr
uses the following logical and mathematical operators: ==, !=, <, >, >=, <=, &, |, %in%, etc (use?dplyr::filter
for more details)
- here, we will use the symbol
>
to filter out dart points with a length greater than 80 mm
|>
df_sample filter(Length > 80)
Name Length Width Weight Base.Sh
1 Pedernales 109.5 49.3 28.8 I
2 Pedernales 84.0 21.2 9.3 I
- and here we use
==
to choose only those dartpoints which are of type “Travis”
|>
df_sample filter(Name == "Travis")
Name Length Width Weight Base.Sh
1 Travis 56.5 21.1 9.5 S
2 Travis 54.6 22.4 10.4 S
3 Travis 46.3 21.3 7.5 S
4 Travis 57.6 18.9 8.7 S
5 Travis 49.1 21.4 6.9 S
6 Travis 64.6 21.5 15.0 I
7 Travis 69.0 20.9 11.4 E
8 Travis 40.1 18.4 6.3 S
9 Travis 41.5 19.2 7.5 S
10 Travis 46.3 17.9 5.9 S
11 Travis 39.6 21.5 5.4 S
- alternatively, you can exclude all points of a type “Travis” by negation
!=
|>
df_sample filter(Name != "Travis") |>
head(4)
Name Length Width Weight Base.Sh
1 Darl 42.8 15.8 3.6 I
2 Darl 40.5 17.4 4.5 I
3 Darl 37.5 16.3 3.6 I
4 Darl 40.3 16.1 4.0 I
- add
&
if you want to filter with more than one condition, for example here we will filter all points which are type “Wells” AND are heavier than 10 grams
|>
df_sample filter(Name == "Wells" & Weight > 10)
Name Length Width Weight Base.Sh
1 Wells 65.4 25.1 12.6 S
2 Wells 58.9 24.4 10.5 I
3 Wells 63.1 24.7 16.3 E
- Task: instead of
&
try the operator|
(OR) and see how the result differs
Filtering based on a vector
- you can make your code less complicated when you create vector from desired values and then filter all observations which fall into that vector by using operator
%in%
<- c("Pedernales", "Ensor")
two_types
|>
df_sample filter(Name %in% two_types) |>
head(4)
Name Length Width Weight Base.Sh
1 Ensor 43.5 20.1 4.6 S
2 Ensor 42.1 20.8 5.4 S
3 Ensor 42.1 25.1 5.9 S
4 Ensor 43.1 20.0 5.1 S
arrange()
- this function will order your observations from the lowest to highest (or vice versa)
|>
df_sample arrange(Length) |>
head(4)
Name Length Width Weight Base.Sh
1 Darl 30.6 17.1 2.3 I
2 Darl 31.2 15.6 2.5 I
3 Darl 32.0 16.0 3.3 E
4 Darl 32.4 14.5 2.5 I
|>
df_sample arrange(desc(Length)) |>
head(4)
Name Length Width Weight Base.Sh
1 Pedernales 109.5 49.3 28.8 I
2 Pedernales 84.0 21.2 9.3 I
3 Pedernales 78.3 28.1 14.8 I
4 Pedernales 70.4 30.4 13.1 I
distinct()
and count()
- try to find out what these functions are doing:
|>
df_sample distinct(Name, Base.Sh) |>
head(10)
Name Base.Sh
1 Darl I
2 Darl E
3 Darl S
4 Darl R
5 Darl <NA>
6 Ensor S
7 Ensor R
8 Ensor I
9 Ensor <NA>
10 Pedernales I
|>
df_sample count(Name, Base.Sh, sort = TRUE) |>
head(4)
Name Base.Sh n
1 Pedernales I 32
2 Darl I 18
3 Travis S 9
4 Darl S 6
summarise()
- this function is used for calculating basic summary stastics from the existing variables
- syntax:
summarise(new_variable = function(variable))
- in this case, the mean and sd values are calculated for whole dataframe:
|>
df_sample summarise(
mean_lenght = mean(Length),
sd_length = sd(Length)
)
mean_lenght sd_length
1 49.33077 12.73619
group_by()
- if you want to calculate summary statistic for each dartpoint type, you have to use function
group_by
which will create groups of rows based on chosen variable, in this case “Name”
|>
df_sample group_by(Name) |>
summarise(
mean_lenght = mean(Length),
sd_length = sd(Length),
n = n()
|>
) arrange(desc(n))
# A tibble: 5 × 4
Name mean_lenght sd_length n
<chr> <dbl> <dbl> <int>
1 Pedernales 57.9 14.1 32
2 Darl 39.8 6.18 28
3 Travis 51.4 9.90 11
4 Ensor 42.7 5.79 10
5 Wells 53.1 7.94 10
|>
df_sample group_by(Name, Base.Sh) |>
summarise(
mean_lenght = round(mean(Length),2),
sd_length = round(sd(Length),2),
n = n()
|>
) arrange(Name) |>
head(7)
# A tibble: 7 × 5
# Groups: Name [2]
Name Base.Sh mean_lenght sd_length n
<chr> <chr> <dbl> <dbl> <int>
1 Darl E 36.9 6.93 2
2 Darl I 39.5 6.59 18
3 Darl R 44.2 NA 1
4 Darl S 40.6 6.36 6
5 Darl <NA> 40 NA 1
6 Ensor I 34.9 NA 1
7 Ensor R 39.3 2.75 3
back to mutate()
creating new values based on different conditions
|>
df_sample mutate(
size_cathegory = case_when(
> 70 ~ "big_stone",
Length < 30 ~ "small_stone",
Length .default = "ok_stone"
)|>
) arrange(desc(Length)) |>
head(10)
Name Length Width Weight Base.Sh size_cathegory
1 Pedernales 109.5 49.3 28.8 I big_stone
2 Pedernales 84.0 21.2 9.3 I big_stone
3 Pedernales 78.3 28.1 14.8 I big_stone
4 Pedernales 70.4 30.4 13.1 I big_stone
5 Travis 69.0 20.9 11.4 E ok_stone
6 Pedernales 67.2 27.1 15.3 I ok_stone
7 Pedernales 66.0 27.2 12.5 I ok_stone
8 Wells 65.4 25.1 12.6 S ok_stone
9 Pedernales 65.0 31.6 4.6 I ok_stone
10 Travis 64.6 21.5 15.0 I ok_stone
back to mutate()
changing existing values based on combination of two conditions
- imagine we want to create a new type of dartpoints from the Pedernales that are longer than 80 mm
- we don’t need to create a new column, but only change the values in the existing one:
|>
df_sample mutate(
Name = case_when(
== "Pedernales" & Length > 80 ~ "Big_Pedernales",
Name .default = Name
)|>
) arrange(desc(Length)) |>
head(4)
Name Length Width Weight Base.Sh
1 Big_Pedernales 109.5 49.3 28.8 I
2 Big_Pedernales 84.0 21.2 9.3 I
3 Pedernales 78.3 28.1 14.8 I
4 Pedernales 70.4 30.4 13.1 I
- note that if we want to leave the other values of the variable “Name” unchanged, we do so by specifying
.default = Name
Combining plenty of functions into one huge code chunk:
|>
df_sample group_by(Name) |>
summarise(
mean_length = round(mean(Length),1),
median_length = round(median(Length),1),
sd_length = round(sd(Length),1),
n_name = n()
|>
) mutate(
percent_name = round(n_name/sum(n_name)*100, 1)
|>
) arrange(desc(n_name))
# A tibble: 5 × 6
Name mean_length median_length sd_length n_name percent_name
<chr> <dbl> <dbl> <dbl> <int> <dbl>
1 Pedernales 57.9 55.8 14.1 32 35.2
2 Darl 39.8 40.1 6.2 28 30.8
3 Travis 51.4 49.1 9.9 11 12.1
4 Ensor 42.7 42.3 5.8 10 11
5 Wells 53.1 53.7 7.9 10 11
::: fragment - we can even include ggplot()
into the pipeline:
|>
df_sample group_by(Name) |>
summarise(
mean_length = round(mean(Length),1),
median_length = round(median(Length),1),
sd_length = round(sd(Length),1),
n_name = n()
|>
) mutate(
percent_name = round(n_name/sum(n_name)*100, 1)
|>
) ggplot()+
aes(x = Name, y = sd_length)+
geom_col(fill = "steelblue")+
theme_light()
Saving your result
<- df_sample |>
df_result group_by(Name) |>
summarise(
mean_length = round(mean(Length),1),
median_length = round(median(Length),1),
sd_length = round(sd(Length),1),
n_name = n()
|>
) mutate(
percent_name = round(n_name/sum(n_name)*100, 1)
|>
) arrange(desc(n_name))
write.csv(df_result, here("results/dartpoints_result.csv"), row.names = F)