Introduction to R

Part 2

Peter Tkáč

Overview

Overview of the workshop

4.) Data manipulation

  • dartpoints data
  • package dplyr / tidyverse

5.) Data visualisation

  • package ggplot2

Exercise

Task:

  • create a new project and save it to a new, parent folder. Call it for example as “atRium2024”
  • create subfolders for data, figures, script and results
  • move the data “dartpoints.csv” to your data folder
  • open a new script and save it as “data_manipulation.R” to your script folders
  • load the packages here

Data manipulation

dplyr package

We will learn how to:

  • use pipe %>% to work with functions more efectively
  • select desired variables - select()
  • rename your variables - rename()
  • order them from lowest to highest values (or vice versa) - arrange()
  • filter your data based on different conditions - filter()
  • calculate different summary statistics such as mean or count - summarise()
  • add new variables such as percentage - mutate()
  • save your results as comma separated file

But first, install and load package dplyr and here and load dartpoints.csv data

install.packages("dplyr")
library(dplyr)
library(here)
df_darts <- read.csv(here("data/dartpoints.csv"))

Select

head(df_darts,2)
  Name Catalog     TARL  Quad Length Width Thickness B.Width J.Width H.Length
1 Darl 41-0322 41CV0536 26/59   42.8  15.8       5.8    11.3    10.6     11.6
2 Darl 35-2946 41CV0235 21/63   40.5  17.4       5.8      NA    13.7     12.9
  Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1    3.6        S       I         S         T       S       E
2    4.5        S       I         S         T       S       E
  • too many variables? Select only the ones you want to work with
  • select(dataframe, variable1, variable2)
df_darts_edit <- select(df_darts, Catalog, Name, Length, Width, Weight)
head(df_darts_edit, 2)
  Catalog Name Length Width Weight
1 41-0322 Darl   42.8  15.8    3.6
2 35-2946 Darl   40.5  17.4    4.5

Pipe

  • %>% hotkey on my computer “CTRL + SHIFT + M”
  • note that when using the pipe, you don’t need to add name of the object (in this case “df_darts”) into the parametres of the function
df_darts_edit <- df_darts %>% 
  select(Name, Catalog, Length, Width, Weight)
  • note that I have created new object “df_darts_edit”. From now on I will manipulate data only in this object and leave “df_darts” unchanged for back-up

Renaming

  • renaming your variables with function rename(data, new_name = old_name) can be useful when dealing with complicated code names or different languages
df_darts_edit <- df_darts_edit %>%  
  rename(
    dart_type = Name,
    dart_ID = Catalog,
    dart_length = Length,
    dart_width = Width,
    dart_weight = Weight
    ) 

head(df_darts_edit, 2)
  dart_type dart_ID dart_length dart_width dart_weight
1      Darl 41-0322        42.8       15.8         3.6
2      Darl 35-2946        40.5       17.4         4.5

Arranging

  • here you can order your observations from the lowest to highest (or vice versa). To do so, use function arrange(data, variable)
df_darts_edit <- df_darts_edit %>% 
  arrange(dart_length)

head(df_darts_edit, 4)
  dart_type dart_ID dart_length dart_width dart_weight
1      Darl 36-3321        30.6       17.1         2.3
2      Darl 35-2382        31.2       15.6         2.5
3      Darl 36-3619        32.0       16.0         3.3
4      Darl 36-3520        32.4       14.5         2.5
  • arranging data in oposite way by nesting function desc() into the arrange():
df_darts_edit <-  df_darts_edit %>% 
  arrange(desc(dart_length))

head(df_darts_edit, 4)
   dart_type dart_ID dart_length dart_width dart_weight
1 Pedernales 35-2855       109.5       49.3        28.8
2 Pedernales 36-3879        84.0       21.2         9.3
3 Pedernales 35-0173        78.3       28.1        14.8
4 Pedernales 38-0098        70.4       30.4        13.1

Filtering

  • function filter(data, variable <operator> value) allows you to filter your data based on different conditions, for example minimal weight, type of the dartpoint, etc

  • logical and mathematical operators: ==, !=, <, >, >=, <=, &, |, etc (use ?dplyr::filter for more details)

  • here we use > to get only dartpoints longer than 80 mm

df_darts_80 <- df_darts_edit %>% 
  filter(dart_length > 80)
  • and here we use == to choose only those dartpoints which are of type “Travis”
df_darts_travis <- df_darts_edit %>% 
  filter(dart_type == "Travis")

unique(df_darts_travis$dart_type)
[1] "Travis"
  • alternatively, you can exclude all points of a type “Travis” by negation !=
df_darts_no_travis <- df_darts_edit %>% 
  filter(dart_type != "Travis")

unique(df_darts_no_travis$dart_type)
[1] "Pedernales" "Wells"      "Ensor"      "Darl"      

Filtering with multiple conditions

  • you can use | or & for filtering with more than one condition
  • for example here we will filter all points which are type “Wells” (AND) are heavier than 10 grams
df_darts_wells_10 <- df_darts_edit %>% 
  filter(dart_type == "Wells" & dart_weight > 10)
head(df_darts_wells_10)
  dart_type dart_ID dart_length dart_width dart_weight
1     Wells 36-3088        65.4       25.1        12.6
2     Wells 44-0732        63.1       24.7        16.3
3     Wells 35-3079        58.9       24.4        10.5

Task: instead of & try operator | (OR) and see how the result differs

Filtering based on vector

  • you can make your code less complicated when you create vector from desired values and then filter all observations which fall into that vector by using operator %in%
darts_of_interest <- c("Pedernales", "Ensor")

df_darts_inter <- df_darts_edit %>% 
  filter(dart_type %in% darts_of_interest)

unique(df_darts_inter$dart_type)
[1] "Pedernales" "Ensor"     

Summarise

  • we already know some functions to calculate basic summaries, for example function mean
mean(df_darts_edit$dart_length)
[1] 49.33077
  • but if you want to create a new dataframe from calculated statistics, function summarise(data, new_variable = summary_statistics) is much more helpfull
  • for summary statistics you can use different functions: mean(), median(), sd(), min()…, (use ?summarise for more details)
df_darts_edit %>%
  summarise(mean_length = mean(dart_length))
  mean_length
1    49.33077
  • you can also calculate more summaries:
df_darts_summary <- df_darts_edit %>% 
summarise(
  mean_length = mean(dart_length),
  sd_lenght = sd(dart_length),
  min_length = min(dart_length),
  max_length = max(dart_length),
  total_count = n()
  )

df_darts_summary
  mean_length sd_lenght min_length max_length total_count
1    49.33077  12.73619       30.6      109.5          91

Grouping data

  • summaries above were applied on whole dataframe. Here we will learn how to calculate summaries for each type of the dartpoint by using group_by(data, variable_to_be_grouped_by)
df_darts_edit %>% 
  group_by(dart_type) %>% 
  summarise(
    mean_length = mean(dart_length),
    type_count = n()
    )
# A tibble: 5 × 3
  dart_type  mean_length type_count
  <chr>            <dbl>      <int>
1 Darl              39.8         28
2 Ensor             42.7         10
3 Pedernales        57.9         32
4 Travis            51.4         11
5 Wells             53.1         10
  • Lets fix the decimals by function round()
df_darts_edit %>% 
  group_by(dart_type) %>% 
  summarise(
    mean_length = round(mean(dart_length), 2),
    type_count = n()
    )
# A tibble: 5 × 3
  dart_type  mean_length type_count
  <chr>            <dbl>      <int>
1 Darl              39.8         28
2 Ensor             42.7         10
3 Pedernales        57.9         32
4 Travis            51.4         11
5 Wells             53.1         10

Mutate

  • function mutate creates a new variable and adds it to the most recent dataframe
df_darts_edit %>% 
  group_by(dart_type) %>% 
  mutate(
    mean_weight = round(mean(dart_weight),2)
  )
# A tibble: 91 × 6
# Groups:   dart_type [5]
   dart_type  dart_ID dart_length dart_width dart_weight mean_weight
   <chr>      <chr>         <dbl>      <dbl>       <dbl>       <dbl>
 1 Pedernales 35-2855       110.        49.3        28.8       10.6 
 2 Pedernales 36-3879        84         21.2         9.3       10.6 
 3 Pedernales 35-0173        78.3       28.1        14.8       10.6 
 4 Pedernales 38-0098        70.4       30.4        13.1       10.6 
 5 Travis     43-0112        69         20.9        11.4        8.59
 6 Pedernales 41-0239        67.2       27.1        15.3       10.6 
 7 Pedernales 35-2391        66         27.2        12.5       10.6 
 8 Wells      36-3088        65.4       25.1        12.6        8.68
 9 Pedernales 43-0110        65         31.6         4.6       10.6 
10 Travis     36-0006        64.6       21.5        15          8.59
# ℹ 81 more rows

Difference between summarise() and mutate()

  • summarise() creates a new dataframe from calculated values
  • example bellow show the maximum width of the dartpoints grouped by dart type
df_darts_edit %>% 
  group_by(dart_type) %>% 
  summarise(
    width_max = max(dart_width)
  )
# A tibble: 5 × 2
  dart_type  width_max
  <chr>          <dbl>
1 Darl            23.3
2 Ensor           27.3
3 Pedernales      49.3
4 Travis          22.4
5 Wells           29.6
  • mutate() adds a new variable to the dataframe
df_darts_edit %>% 
  group_by(dart_type) %>% 
  mutate(
    width_max = max(dart_width)
          ) %>% 
  head(8)
# A tibble: 8 × 6
# Groups:   dart_type [3]
  dart_type  dart_ID dart_length dart_width dart_weight width_max
  <chr>      <chr>         <dbl>      <dbl>       <dbl>     <dbl>
1 Pedernales 35-2855       110.        49.3        28.8      49.3
2 Pedernales 36-3879        84         21.2         9.3      49.3
3 Pedernales 35-0173        78.3       28.1        14.8      49.3
4 Pedernales 38-0098        70.4       30.4        13.1      49.3
5 Travis     43-0112        69         20.9        11.4      22.4
6 Pedernales 41-0239        67.2       27.1        15.3      49.3
7 Pedernales 35-2391        66         27.2        12.5      49.3
8 Wells      36-3088        65.4       25.1        12.6      29.6

More complex summarising with dplyr and pipe

df_darts_sum <- df_darts_edit %>% 
  group_by(dart_type) %>% 
  summarise(
    length_mean = round(mean(dart_length), 1),
    weight_mean = round(mean(dart_weight), 1),
    type_count = n()) %>%
  mutate(type_percent = round(type_count/sum(type_count)*100, 1)) %>% 
  arrange(desc(type_count))

df_darts_sum
# A tibble: 5 × 5
  dart_type  length_mean weight_mean type_count type_percent
  <chr>            <dbl>       <dbl>      <int>        <dbl>
1 Pedernales        57.9        10.6         32         35.2
2 Darl              39.8         4.4         28         30.8
3 Travis            51.4         8.6         11         12.1
4 Ensor             42.7         5.1         10         11  
5 Wells             53.1         8.7         10         11  
  • write.csv(name_of_your_object, file = "path_to_your_folder") will save your result as a .csv file, which is nice
write.csv(df_darts_sum, file = here("results/darts_summary.csv"), row.names = FALSE)

Data Visualisation

Visualising your data with packgage ggplot2

Inspiration

Starting with ggplot2

#install.packages("ggplot2")
library(ggplot2)

Basic syntax

ggplot(data = <your data frame>) +

aes(x = <variable to be mapped to axis x>) +

geom_<geometry>()

Basic types of ggplot - barplot

  • for one variable
ggplot(data = df_darts_edit)+
  aes(x = dart_type)+
  geom_bar()

Basic types of ggplot - histogram

  • distribution of one variable
ggplot(data = df_darts_edit)+
  aes(x = dart_length)+
  geom_histogram()

Basic types of ggplot - density plot

  • distribution of one variable
ggplot(data = df_darts_edit)+
  aes(x = dart_length)+
  geom_density()

Basic types of ggplot - boxplot

ggplot(data = df_darts_edit)+
  aes(x = dart_type, y = dart_length)+
  geom_boxplot()

Basic types of ggplot - scatter plot

  • comparing two or more variables
ggplot(data = df_darts_edit)+
  aes(x = dart_weight, y = dart_length)+
  geom_point()

Refining your plot

Lets go back to the scatterplot and play a little

ggplot(data = df_darts_edit)+
  aes(x = dart_weight, y = dart_length)+
  geom_point(color = "red", alpha = 0.4, size = 3, shape = 15)+
  geom_smooth()+
  theme_light()

Task:

  • try different colours, shapes and themes
ggplot(data = df_darts_edit)+
  aes(x = dart_weight, y = dart_length)+
  geom_point(color = "red", alpha = 0.4, size = 3, shape = 15)+
  theme_light()

Different shapes with their codes:

Playing with variables

  • in this case, the colours and size of the points is conditional on the values of the variables
ggplot(data = df_darts_edit)+
  aes(x = dart_weight, y = dart_length, color = dart_type, size = dart_weight)+
  geom_point(alpha = 0.5)+
  theme_light()

Adding text

ggplot(data = df_darts_edit)+
  aes(x = dart_weight, y = dart_length, color = dart_type, size = dart_weight)+
  geom_point(alpha = 0.5)+
  labs(
    title = "A very nice plot",
    subtitle = "Look at those colors!",
    x ="weight (g)",
    y = "length (g)", 
    caption =  "Data = package Archdata",
    color = "Type of a dart",
    size = "Weight of a dart")+
  theme_classic()

Spliting plots

ggplot(data = df_darts_edit)+
  aes(x = dart_weight, y = dart_length, color = dart_type, size = dart_weight)+
  geom_point(alpha = 0.5)+
  facet_wrap(~dart_type)+
  labs(
    title = "A very nice plot",
    subtitle = "Look at those colors!",
    x ="weight (g)",
    y = "length (g)", 
    caption =  "Data = package Archdata",
    color = "Type of a dart",
    size = "Weight of a dart")+
  theme_light()

Saving plot

best_plot <- ggplot(data = df_darts_edit)+
  aes(x = dart_weight, y = dart_length, color = dart_type, size = dart_weight)+
  geom_point(alpha = 0.5)+
  labs(
    title = "A very nice plot",
    subtitle = "Look at those colors!",
    x ="weight (g)",
    y = "length (g)", 
    caption =  "Data = package Archdata",
    color = "Type of a dart",
    size = "Weight of a dart")+
  theme_classic()
ggsave(plot = best_plot, 
       filename = here("results/plot_darts.jpg"),
       width = 15,
       height = 10,
       units = "cm"
)

Back to barplots

ggplot(data = df_darts_edit)+
  aes(x = dart_type)+
  geom_bar()

ggplot(data = df_darts_edit)+
  aes(x = dart_type, color = dart_type)+
  geom_bar()

ggplot(data = df_darts_edit)+
  aes(x = dart_type, fill = dart_type)+
  geom_bar()

ggplot(data = df_darts_edit)+
  aes(x = dart_type)+
  geom_bar(fill = "steelblue")+
  theme_light()

ggplot(data = df_darts_edit)+
  aes(x = fct_infreq(dart_type))+
  geom_bar(fill = "steelblue")+
  theme_light()

ggplot(data = df_darts_edit)+
  aes(x = fct_rev(fct_infreq(dart_type)))+
  geom_bar(fill = "steelblue")+
  coord_flip()+
  labs( x = "dart type")+
  theme_light()

Back to the density plot

ggplot(data = df_darts_edit)+
  aes(x = dart_length)+
  geom_density()

ggplot(data = df_darts_edit)+
  aes(x = dart_length)+
  geom_density(fill = "grey75", color = "grey50")+
  theme_linedraw()

ggplot(data = df_darts_edit)+
  aes(x = dart_length, fill = dart_type)+
  geom_density(color = "white", alpha = 0.3)+
  theme_linedraw()

Points instead of boxplots

ggplot(data = df_darts_edit)+
  aes(x = dart_type, y = dart_length)+
  geom_point()

ggplot(data = df_darts_edit)+
  aes(x = reorder(dart_type, dart_length), y = dart_length, color = dart_type)+
  geom_point(size = 4, alpha = 0.5, show.legend = FALSE)+
  coord_flip()+
  theme_linedraw()

Exercise

Task:

  • Download data set with bronze age cups bacups.csv
  • Explore the data set and its structure.
  • What are the observations?
  • What types of variables are there?
  • Create a plot showing distribution of cup heights (H).
  • Create a boxplot for cup heights divided by phases (Phase).
  • Are there any outliers?
  • Create a plot showing relationship between cup height and its rim diameter.
  • Color cups from different phases (Phase) by differently.
  • Label the axes sensibly.

Hint: you can get the information about the dataset by:

# install.packages("archdata")
library(archdata)
?archdata::BACups

Other hints:

geom_histogram(), geom_boxplot(), geom_point(), labs()

Solution

library(here)
df_cups <- read.csv(here("data/bacups.csv"))
head(df_cups)
    RD   ND   SD   H  NH       Phase
1 11.1 10.0 10.3 5.5 2.5 Subapennine
2  9.5  9.2  9.8 4.8 2.0 Subapennine
3 20.8 20.9 22.0 9.5 3.8 Subapennine
4 19.5 18.2 19.5 8.8 2.7 Subapennine
5 15.5 15.5 18.8 9.8 3.2 Subapennine
6 11.7 11.1 11.5 3.8 1.4 Subapennine
str(df_cups)
'data.frame':   60 obs. of  6 variables:
 $ RD   : num  11.1 9.5 20.8 19.5 15.5 11.7 10.8 15 18.5 11 ...
 $ ND   : num  10 9.2 20.9 18.2 15.5 11.1 10.7 16.1 16.4 8.9 ...
 $ SD   : num  10.3 9.8 22 19.5 18.8 11.5 10.8 16.4 18 9.5 ...
 $ H    : num  5.5 4.8 9.5 8.8 9.8 3.8 3.5 11.8 10.5 5.8 ...
 $ NH   : num  2.5 2 3.8 2.7 3.2 1.4 1.7 3.5 4.8 3.7 ...
 $ Phase: chr  "Subapennine" "Subapennine" "Subapennine" "Subapennine" ...
names(df_cups)
[1] "RD"    "ND"    "SD"    "H"     "NH"    "Phase"
library(ggplot2)
ggplot(data = df_cups)+
  aes(x = H)+
  geom_histogram()

ggplot(data = df_cups)+
  aes(x = Phase, y = H)+
  geom_boxplot()

ggplot(data = df_cups)+
  aes(x = H, y = RD)+
  geom_point()

ggplot(data = df_cups)+
  aes(x = H, y = RD, color = Phase)+
  geom_point(size = 4, alpha = 0.5)+
  labs(x = "Height", y = "Rim Diameter")+
  theme_light()