Introduction to R

Part 1

Peter Tkáč

Overview

R Basics

1.) Introduction - code along

  • orientation in RStudio
  • execution command, basic operators
  • assigning operator
  • vectors
  • comments

2.) Syntax and basic functions

  • functions, objects, values, syntax
  • dataframes and vectors

Workflow

3.) Workflow

  • projects and scripts
  • packages
  • loading data

Orientation in RStudio

Basic operators

5+5
[1] 10
2*2
[1] 4
10/2
[1] 5
3**2
[1] 9
sqrt(9)
[1] 3
3==3
[1] TRUE
3==4
[1] FALSE
10>5
[1] TRUE
10<5
[1] FALSE

Assigning operator

my_number <- 10
my_number
[1] 10
my_number+5
[1] 15
my_other_number <- 200
my_number + my_other_number
[1] 210
my_number == 10
[1] TRUE
my_number < my_other_number
[1] TRUE

Creating vectors

my_vector <- c(1, 2, 3, 4, 5)
my_vector
[1] 1 2 3 4 5
my_vector + 10
[1] 11 12 13 14 15
my_other_vector <- c(6:10)
my_other_vector
[1]  6  7  8  9 10
my_other_vector + my_vector
[1]  7  9 11 13 15
my_other_vector[2]
[1] 7

Adding comments

# this is comment
# 10 / 2

Exercise

Task:

1.) create one vector which contains 10 numbers from 51 to 60

2.) and another vector which contains 10 numbers from 101 to 110

3.) save the first vector as “vect_1” and second as “vect_2”

4.) subtract vect_1 from vect_2 and save the results as “vect_sub”

Solution

vect_1 <- c(51:60)
vect_2 <- c(101:110)

vect_sub <- vect_2 - vect_1
vect_sub
 [1] 50 50 50 50 50 50 50 50 50 50

Functions and syntax

  • functions always go with parentheses ()
  • functions are doing stuff
  • syntax:

function_name(argument1 = value1, argument2 = value2, ...)

mean(1:10)
[1] 5.5
a <- mean(1:10)
a
[1] 5.5
summary(1:10)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.25    5.50    5.50    7.75   10.00 
my_sequence <- seq(from = 1000,  to = 2000, by = 10)
my_sequence
  [1] 1000 1010 1020 1030 1040 1050 1060 1070 1080 1090 1100 1110 1120 1130 1140
 [16] 1150 1160 1170 1180 1190 1200 1210 1220 1230 1240 1250 1260 1270 1280 1290
 [31] 1300 1310 1320 1330 1340 1350 1360 1370 1380 1390 1400 1410 1420 1430 1440
 [46] 1450 1460 1470 1480 1490 1500 1510 1520 1530 1540 1550 1560 1570 1580 1590
 [61] 1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700 1710 1720 1730 1740
 [76] 1750 1760 1770 1780 1790 1800 1810 1820 1830 1840 1850 1860 1870 1880 1890
 [91] 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
length(my_sequence)
[1] 101
range(my_sequence)
[1] 1000 2000

Basic syntax

Objects and values

  • there are different types of objects and values in R, each type is allowing you to do different operations

Objects

  • for now, it will be enough to introduce vector and dataframe
  • vector
    • a list of items that are of the same type
  • dataframe
    • a table
    • has rows and columns
    • rectangular, ie. identical number of rows in each column.

Values

  • similarly, there are many types of values - characters, numbers, factors.
  • all you need to know now is that if you want to do mathematic operations, you always have to check whether your numbers are really a numbers and not something else, such as characters
  • function str() will quickly tell you what kind of object with what kind of values you have

Vector

nums <- c(1:10)
nums
 [1]  1  2  3  4  5  6  7  8  9 10
hunds <- c(101:110)
hunds
 [1] 101 102 103 104 105 106 107 108 109 110
str(hunds)
 int [1:10] 101 102 103 104 105 106 107 108 109 110
letts <- letters[1:10]
letts
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
str(letts)
 chr [1:10] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
capital_towns  <- c("Berlin", "Bratislava", "Prague", "Vienna", "Warsaw")
str(capital_towns)
 chr [1:5] "Berlin" "Bratislava" "Prague" "Vienna" "Warsaw"

If the vector combines numbers and words, the result will save the numbers as characters, so it is then not possible to make mathematical operations with them

strange_vector <- c("Berlin", 1, 5, 12, 110)
str(strange_vector)
 chr [1:5] "Berlin" "1" "5" "12" "110"

Dataframes

  • you can create dataframes by binding the of the same length (!) together
  • cbind() binds vectors into columns and then as.data.frame() change them into dataframe
df<-as.data.frame(cbind(nums, hunds, letts))
df
   nums hunds letts
1     1   101     a
2     2   102     b
3     3   103     c
4     4   104     d
5     5   105     e
6     6   106     f
7     7   107     g
8     8   108     h
9     9   109     i
10   10   110     j

Dataframe - structure

Get the basic information about the dataframe with str()

str(df)
'data.frame':   10 obs. of  3 variables:
 $ nums : chr  "1" "2" "3" "4" ...
 $ hunds: chr  "101" "102" "103" "104" ...
 $ letts: chr  "a" "b" "c" "d" ...

We see that columns nums and hunds are not numbers, but characters. To be able for us to do mathematic operations, we need to change the values into numbers by function as.numeric()

df$nums <- as.numeric(df$nums)
df$hunds <- as.numeric(df$hunds)
str(df)
'data.frame':   10 obs. of  3 variables:
 $ nums : num  1 2 3 4 5 6 7 8 9 10
 $ hunds: num  101 102 103 104 105 106 107 108 109 110
 $ letts: chr  "a" "b" "c" "d" ...

Subseting data

Square brackets [,]

name_of_your_dataframe[row_number,column_number]

First row

df[1,]
  nums hunds letts
1    1   101     a

First column

df[,1]
 [1]  1  2  3  4  5  6  7  8  9 10
sum(df[,2])
[1] 1055

Subseting data with $

df$letts
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
mean(df$hunds)
[1] 105.5

Dataframe with atRium participants

Copy, paste and run this whole code chunk:

first_name <- c("Margaux", "Lesley", "Carole", "Alexander", "Dita", "Brigit", "Sara", "Yiu-Kang", "Romane", "Nicky", "Mihailo", "Valeriia", "Carlo", "Panagiotis", "Juan Carlos", "Anna", "Swe Zin", "Ilenia")
country <- c("Lithuania","Ireland","Germany","United Kingdom","Netherlands","Austria","Germany","Germany","Italy","United Kingdom","Serbia","Ukraine","Germany","Italy","France","United Kingdom","Switzerland ","Luxembourg")
position <- c("post_doc","researcher","post_doc","masters_student","post_doc","researcher","phd_student","researcher","phd_student","researcher","researcher","ba_student","post_doc","researcher","researcher","phd_student","phd_student","phd_student")
institution <- c("university","public_research_org","university","university","university","public_research_org","university","public_research_org","university","digital_repository","university","university","public_research_org","public_research_org","private_org","university","university","university")
city <- c("Vilnius","Dublin","Kiel","York","Leiden","Vienna","Kiel","Bochum","Padova","York","Belgrade","Odesa","Leibzig","Rome","Paris","Glasgow","Bern","Luxembourg")
distance_km <- c(850, 1650, 720, 1300, 910, 110, 720, 710, 550, 1300, 560, 1090, 380, 870, 1040, 1590, 720, 760)


df_people <- as.data.frame(cbind(first_name, country, position, institution, city, distance_km))
df_people$distance_km <- as.numeric(df_people$distance_km)
head(df_people, 4)
  first_name        country        position         institution    city
1    Margaux      Lithuania        post_doc          university Vilnius
2     Lesley        Ireland      researcher public_research_org  Dublin
3     Carole        Germany        post_doc          university    Kiel
4  Alexander United Kingdom masters_student          university    York
  distance_km
1         850
2        1650
3         720
4        1300

Lets play a bit

What’s your name?

df_people$first_name
 [1] "Margaux"     "Lesley"      "Carole"      "Alexander"   "Dita"       
 [6] "Brigit"      "Sara"        "Yiu-Kang"    "Romane"      "Nicky"      
[11] "Mihailo"     "Valeriia"    "Carlo"       "Panagiotis"  "Juan Carlos"
[16] "Anna"        "Swe Zin"     "Ilenia"     

Where are you coming from?

unique(df_people$country)
 [1] "Lithuania"      "Ireland"        "Germany"        "United Kingdom"
 [5] "Netherlands"    "Austria"        "Italy"          "Serbia"        
 [9] "Ukraine"        "France"         "Switzerland "   "Luxembourg"    

Which country is most represented?

table(df_people$country)

       Austria         France        Germany        Ireland          Italy 
             1              1              4              1              2 
     Lithuania     Luxembourg    Netherlands         Serbia   Switzerland  
             1              1              1              1              1 
       Ukraine United Kingdom 
             1              3 

Quick Task:

  • could calculate which cities are represented?

What is the longest distance one of you had to travel?

max(df_people$distance_km)
[1] 1650

Lets play a bit 2

What are your positions?

table(df_people$position)

     ba_student masters_student     phd_student        post_doc      researcher 
              1               1               5               4               7 

Who are the PhD students?

df_people[df_people$position=="phd_student",]
   first_name        country    position institution       city distance_km
7        Sara        Germany phd_student  university       Kiel         720
9      Romane          Italy phd_student  university     Padova         550
16       Anna United Kingdom phd_student  university    Glasgow        1590
17    Swe Zin   Switzerland  phd_student  university       Bern         720
18     Ilenia     Luxembourg phd_student  university Luxembourg         760

Alternative - selecting specific columns

df_people[df_people$position=="phd_student",c(1,3,5)]
   first_name    position       city
7        Sara phd_student       Kiel
9      Romane phd_student     Padova
16       Anna phd_student    Glasgow
17    Swe Zin phd_student       Bern
18     Ilenia phd_student Luxembourg

Quick Task:

  • can you subset row with your name and check whether I didn’t messed up your data?

Exercise

Task:

Use the dataframe df_people to solve this questions:

  1. What are the names of the variables in the dataframe?
  2. Which types of institution are represented here?
  3. Which types of institution are most represented here?
  4. What is the average distance between Brno and the cities?
  5. Who are the postdocs and from which cities are they coming?

Hints: names(), unique(), table(), mean(),[,]

Solution

  1. What are the names of the variables in the dataframe?
names(df_people)
[1] "first_name"  "country"     "position"    "institution" "city"       
[6] "distance_km"

Alternative:

colnames(df_people)
[1] "first_name"  "country"     "position"    "institution" "city"       
[6] "distance_km"
  1. Which types of institution are represented here?
unique(df_people$institution)
[1] "university"          "public_research_org" "digital_repository" 
[4] "private_org"        
  1. Which types of institution are most represented here?
table(df_people$institution)

 digital_repository         private_org public_research_org          university 
                  1                   1                   5                  11 

Solution

  1. What is the average distance between Brno and the cities?
mean(df_people$distance_km)
[1] 879.4444

Alternative

summary(df_people$distance_km)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  110.0   712.5   805.0   879.4  1077.5  1650.0 
  1. Who are the postdocs and what distance did they have to travel?
df_people[df_people$position=="post_doc",c(1,3,6)]
   first_name position distance_km
1     Margaux post_doc         850
3      Carole post_doc         720
5        Dita post_doc         910
13      Carlo post_doc         380

Other useful functions for dataframe

  • str() - reveals the structure of the dataframe
str(df_people)
'data.frame':   18 obs. of  6 variables:
 $ first_name : chr  "Margaux" "Lesley" "Carole" "Alexander" ...
 $ country    : chr  "Lithuania" "Ireland" "Germany" "United Kingdom" ...
 $ position   : chr  "post_doc" "researcher" "post_doc" "masters_student" ...
 $ institution: chr  "university" "public_research_org" "university" "university" ...
 $ city       : chr  "Vilnius" "Dublin" "Kiel" "York" ...
 $ distance_km: num  850 1650 720 1300 910 110 720 710 550 1300 ...
  • head(), tail()
head(df_people, 2)
  first_name   country   position         institution    city distance_km
1    Margaux Lithuania   post_doc          university Vilnius         850
2     Lesley   Ireland researcher public_research_org  Dublin        1650
tail(df_people, 2)
   first_name      country    position institution       city distance_km
17    Swe Zin Switzerland  phd_student  university       Bern         720
18     Ilenia   Luxembourg phd_student  university Luxembourg         760
  • ncol(), nrows(),
ncol(df_people)
[1] 6
nrow(df_people)
[1] 18
  • sum()
sum(df_people$distance_km)
[1] 15830

Workflow

  • scripts
  • projects
  • packages
  • loading data

Scripts

# Practice script for atRium training school 2024
# Author: Peter Tkáč
# Date: 2024-09-10

## ---- Packages
library(here)
library(tidyverse)

## ---- Data Loading

df_darts <- read.csv(here("data/dartpoints.csv"))
str(df_darts)

## ---- Basic summaries

nrow(df_darts) # number of dartpoints
table(df_darts$Name) # numer of types of the dartpoints

## ---- Plots

ggplot(df_darts, aes(x=Name))+
  geom_bar(fill = "pink")+
  theme_light()

## ---- Saving result

ggsave(filename = "very_important_plot.png")

Projects

  • .Rproj file - a “storage” of your scripts, data…
  • we recommend you to use parent folder to store your project file and then sub-folders to store data, figures, scripts, etc.

Packages

  • by installing additional packages, you can expand the amount of things you can do in R
  • there are plenty of packages with different functions and aims
  • basic principle will be introduced by package “here”
#install.packages("here") # installs the package
library(here) # loads the package
here() # runs a function from the package
[1] "C:/Users/pajdla/Documents/projects/atRium"
  • you only need to install the package once install.packages("name_of_the_package"), but it needs to be loaded every time you start a new script or after you have cleaned up your workspace library(name_of_the_package)

  • sometime you need to specify from which package your function is: name-of-the-package::name-of-the-function()

dplyr::filter(df_people, city == "Kiel")
  first_name country    position institution city distance_km
1     Carole Germany    post_doc  university Kiel         720
2       Sara Germany phd_student  university Kiel         720

If you are not sure from which package your function is coming, you can easily find out by:

?filter()

Loading data

Paths

Absolute file path - The file path is specific to a given user.

C:/Documents/MyProject/data/dartpoints.csv

Relative file path If I am currently in MyProject/ folder:

./data/dartpoints.csv

Package here()

  • Package here is here to save the day!
  • Function here() will know where the top directory is, so you do not need to write whole URL adress

Try running here() to see where your project is stored

here()
[1] "C:/Users/pajdla/Documents/projects/atRium"

Loading data

An example of loading data with here() function:

  • NOTE that in this case data which you want to load have to be in a subfolder “data” which is located in the same folder as your project
df_darts <- read.csv(here("data/dartpoints.csv"))
  • read.csv loads .csv files (AKA comma-separated values file) into your R
  • if the values in your file are separated by other way, you have to adjust. For example for values separated by semicolom ; use argument sep=";":
# | eval: false
df_darts2 <- read.csv(here("data/dartpoints.csv"), sep = ";")

Exercise

Observe the dartpoint data

  1. How many observations and how many variables are in the dataframe?
  2. What are the names of the variables?
  3. Are there any quantitative variables? Are they stored properly as numbers so we can make mathematic operations?
  4. What is the mean length of the dartpoints?