Basic workflows

Today, you will learn how to:

do basic worklow by
- organising your work with script and project
- using additional packages
- importing your data
- observing the structure of your data
describe your data
- visually (plots)
- numericaly (Descriptive statistics)
observe the relations between 2 variables

Warm up!

explain what does this image want to say:

what does this mark do? - <-

and this? - $

can you explain what’s going on here?

my_object <- aggregate(grave_length ~ dating, data = df_grave, FUN = mean)

Introduction

the general workflow of data analysis looks like this:

Organize your work in scripts

create a new script with Ctrl + Shift + n
put some basic info on what are you doing at the top.
use # to comment your code
comment on the why, not the what.
divide the code into sections with Ctrl + Shift + r

# Section name ----
arrange your codes in the order in which you will run them (so packages first, then importing data, then transformations, analyses, … and finally exporting of the results)
RStudio will give you hints, hit Tab to autocomplete function calls.
execute the current line with Ctrl + Enter
run the whole script with Ctrl + Shift + Enter

Script example:

# Practice script for "AES_707: Statistics seminar for archaeology students"
# Author: Peter Tkáč
# Date: 2025-10-03
# Update: 2025-10-05

# Packages ----------------------------
library(here)
library(dplyr)
library(ggplot2)

# Data Import --------------------------
df_dartpoints <- read.csv(here("data/dartpoints.csv"))

# Structure overview -----------------

# for quick overview of the data structure

str(df_dartpoints)
nrow(df_dartpoints) # number of dartpoints / rows
table(df_dartpoints$Name) # number of types of the dartpoints

# Transformation ---------------------------------

sum_dartpoints <- df_dartpoints %>% 
  group_by(Name) %>% 
  summarise(
    mean_length = mean(Length),
    sd_length = sd(Length),
    mean_width = mean(Width),
    sd_width = sd(Width)
  )

sum_dartpoints

# Data Visualisation ---------------------

# for a quick overview of the distribution of the variable Length

plot_1 <- ggplot(df_dartpoints, aes(x=Length))+
  geom_histogram()+
  theme_light()

plot_1

# Results saving -------------------------------

ggsave(plot_1, filename = "very_important_plot.png")

Projects

.Rproj file is a kind of a “storage” for all your project related scripts, datasets, figures…
we recomend you to store each project in a separate directory (folder) and different parts of your project into subdirectories
this organisation of scripts and data will be used throughout the whole course

Packages

by installing additional packages, you can expand the amount of things you can do in R
there are plenty of packages with different functions and aims
we will introduce basic principles with package here

#install.packages("here") # installs the package
library(here) # loads the package
here() # runs a function from the package

[1] "C:/Users/pajdla/Documents/projects/stat4arch"

you only need to install the package once by install.packages("name_of_the_package"), but it needs to be loaded every time you start a new script or after you have cleaned up your workspace by library(name_of_the_package)

Importing data into R

Paths

Absolute file path - The file path is specific to a given user.

C:/Documents/MyProject/data/dartpoints.csv

Relative file path starts with the folder where your project is stored:

./data/dartpoints.csv

Package here

Package here is here to save the day!
Function here() will know where the top directory is, so you do not need to write whole URL adress

Try running here() to see where your project is stored

here()

[1] "C:/Users/pajdla/Documents/projects/stat4arch"

Importing Data into R - 2

example of importing data with a relative path:
NOTE that in this case, your data have to be in the subfolder “data” which is located in the same folder as your project

df_dartpoints <- read.csv(here("data/dartpoints.csv"))

function read.csv() imports .csv files (AKA comma-separated values file) into your R (comma = čárka)
if your data use different way of separating values, you will have to adjust. For example, in the case of semicolom - ; (středník), you need to use argument sep=";"

df_dartpoints <- read.csv(here("data/dartpoints.csv"), sep = ";")

you can check how the values are separated when you open your .csv file in Notepad (Poznámkový blok / Textový editor) or by function View File in the RStudio

Before we continue - the Dartpoints ddata

download the data dartpoints.csv
find out how the values in the file are separated and proceed accordingly
create a new project and copy paste the dartpoints.csv into “data” subfolder
create a new script and save it into “scripts” subfolder
install and load package here()
import data dartpoints.csv and save them as a “df_dartpoints”

Structure of your data

we already know function str() which reveals the basic structure of any object

str(df_dartpoints)

'data.frame':   91 obs. of  17 variables:
 $ Name     : chr  "Darl" "Darl" "Darl" "Darl" ...
 $ Catalog  : chr  "41-0322" "35-2946" "35-2921" "36-3487" ...
 $ TARL     : chr  "41CV0536" "41CV0235" "41CV0132" "41CV0594" ...
 $ Quad     : chr  "26/59" "21/63" "20/63" "10/54" ...
 $ Length   : num  42.8 40.5 37.5 40.3 30.6 41.8 40.3 48.5 47.7 33.6 ...
 $ Width    : num  15.8 17.4 16.3 16.1 17.1 16.8 20.7 18.7 17.5 15.8 ...
 $ Thickness: num  5.8 5.8 6.1 6.3 4 4.1 5.9 6.9 7.2 5.1 ...
 $ B.Width  : num  11.3 NA 12.1 13.5 12.6 12.7 11.7 14.7 14.3 NA ...
 $ J.Width  : num  10.6 13.7 11.3 11.7 11.2 11.5 11.4 13.4 11.8 12.5 ...
 $ H.Length : num  11.6 12.9 8.2 8.3 8.9 11 7.6 9.2 8.9 11.5 ...
 $ Weight   : num  3.6 4.5 3.6 4 2.3 3 3.9 6.2 5.1 2.8 ...
 $ Blade.Sh : chr  "S" "S" "S" "S" ...
 $ Base.Sh  : chr  "I" "I" "I" "I" ...
 $ Should.Sh: chr  "S" "S" "S" "S" ...
 $ Should.Or: chr  "T" "T" "T" "T" ...
 $ Haft.Sh  : chr  "S" "S" "S" "S" ...
 $ Haft.Or  : chr  "E" "E" "E" "E" ...

head(), tail()

head(df_dartpoints, 4)

  Name Catalog     TARL  Quad Length Width Thickness B.Width J.Width H.Length
1 Darl 41-0322 41CV0536 26/59   42.8  15.8       5.8    11.3    10.6     11.6
2 Darl 35-2946 41CV0235 21/63   40.5  17.4       5.8      NA    13.7     12.9
3 Darl 35-2921 41CV0132 20/63   37.5  16.3       6.1    12.1    11.3      8.2
4 Darl 36-3487 41CV0594 10/54   40.3  16.1       6.3    13.5    11.7      8.3
  Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1    3.6        S       I         S         T       S       E
2    4.5        S       I         S         T       S       E
3    3.6        S       I         S         T       S       E
4    4.0        S       I         S         T       S       E

tail(df_dartpoints, 2)

    Name Catalog     TARL  Quad Length Width Thickness B.Width J.Width H.Length
90 Wells 35-3012 41CV0270 24/62   49.1  21.1       6.3    14.8    15.2     16.6
91 Wells 44-0732 41BL0239 39/55   63.1  24.7       5.4    10.3    12.1     21.1
   Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
90    5.2        S       E         S         T       S       P
91   16.3        S       E         S         T       S       T

ncol(), nrows()

ncol(df_dartpoints)

[1] 17

nrow(df_dartpoints)

[1] 91

names()

names(df_dartpoints)

 [1] "Name"      "Catalog"   "TARL"      "Quad"      "Length"    "Width"    
 [7] "Thickness" "B.Width"   "J.Width"   "H.Length"  "Weight"    "Blade.Sh" 
[13] "Base.Sh"   "Should.Sh" "Should.Or" "Haft.Sh"   "Haft.Or"