Correspondence analysis

Goals

  • new ways of data manipulation
    • combining vectors into dataframe
    • columns, rows and values selection
    • mathematical operations in rows
  • some fancy stuff to make your ggplot look better
  • introduction to the correspondence analysis in R

Data Manipulation

Vectors

  • you already know vector - a combination of several values stored in a single object
cisla <- c(1:10)
cisla
 [1]  1  2  3  4  5  6  7  8  9 10
stovky <- c(101:110)
stovky
 [1] 101 102 103 104 105 106 107 108 109 110
pismena <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
pismena
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

Combining different vectors with rbind and cbind

  • you can combine different vectors into matrixes or dataframes

  • rbind() combine vectors into Rows, cbind() into Columns

  • the vectors have to be of the same length!

  • cbind

sloupce <- cbind(cisla, stovky, pismena)
head(sloupce, 5)
     cisla stovky pismena
[1,] "1"   "101"  "a"    
[2,] "2"   "102"  "b"    
[3,] "3"   "103"  "c"    
[4,] "4"   "104"  "d"    
[5,] "5"   "105"  "e"    
  • rbind
radky <- rbind(cisla, stovky, pismena)
radky
        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
cisla   "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10" 
stovky  "101" "102" "103" "104" "105" "106" "107" "108" "109" "110"
pismena "a"   "b"   "c"   "d"   "e"   "f"   "g"   "h"   "i"   "j"  

Dataframe

You can then very simply create a dataframe:

tabulka <- data.frame(sloupce)

But you have to check whether the numbers are really numbers

str(tabulka)
'data.frame':   10 obs. of  3 variables:
 $ cisla  : chr  "1" "2" "3" "4" ...
 $ stovky : chr  "101" "102" "103" "104" ...
 $ pismena: chr  "a" "b" "c" "d" ...

Whoa! The numbers are not numbers but characters. You will have to fix it by as.numeric() if you want to do mathematical operations. It is also a good practice to always check the structure of your dataframe before carrying out any operations.

tabulka$cisla <- as.numeric(tabulka$cisla)
tabulka$stovky <- as.numeric(tabulka$stovky)
str(tabulka)
'data.frame':   10 obs. of  3 variables:
 $ cisla  : num  1 2 3 4 5 6 7 8 9 10
 $ stovky : num  101 102 103 104 105 106 107 108 109 110
 $ pismena: chr  "a" "b" "c" "d" ...

Square brackets

  cisla stovky pismena
1     1    101       a
2     2    102       b
3     3    103       c
4     4    104       d
  • with square brackets [] you can easily select specific columns, rows or cells

  • your_dataframe[row number, column number]

  • 1st row:

tabulka[1,]
  cisla stovky pismena
1     1    101       a
  • 1st column:
tabulka[,1]
 [1]  1  2  3  4  5  6  7  8  9 10
  • value in the 1st row and 2nd column:
tabulka[1,2]
[1] 101

Mathematics in dataframes

Now you can make some mathematical operations by selecting specific cells with square brackets:

tabulka[1,2] + tabulka[2,2]
[1] 203

You can even create a new column with the results:

tabulka$novy_sloupec <-tabulka[,1] + 10
head(tabulka, 4)
  cisla stovky pismena novy_sloupec
1     1    101       a           11
2     2    102       b           12
3     3    103       c           13
4     4    104       d           14
  • or alternatilevy use mutate() function from dplyr package
library(dplyr)
tabulka<-mutate(tabulka, este_jeden_sloupec = tabulka[,1] + 100)
head(tabulka,4)
  cisla stovky pismena novy_sloupec este_jeden_sloupec
1     1    101       a           11                101
2     2    102       b           12                102
3     3    103       c           13                103
4     4    104       d           14                104

rowSums

  • you can calculate sums of specific columuns in each row by rowSums()
tabulka_1 <- tabulka
tabulka_1$sucet <- rowSums(tabulka_1[,1:2])
head(tabulka_1, 4)
  cisla stovky pismena novy_sloupec este_jeden_sloupec sucet
1     1    101       a           11                101   102
2     2    102       b           12                102   104
3     3    103       c           13                103   106
4     4    104       d           14                104   108
  • altenative solution in dplyr:
tabulka <- tabulka %>% 
  mutate(sucet = cisla + stovky)
head(tabulka,4)
  cisla stovky pismena novy_sloupec este_jeden_sloupec sucet
1     1    101       a           11                101   102
2     2    102       b           12                102   104
3     3    103       c           13                103   106
4     4    104       d           14                104   108

colnames

  • sometimes you will need to select names of the columns to rename them or create specific vector from them
  • you can manipulate with column names with function colnames()
nazvy_sloupcu <- colnames(tabulka)
nazvy_sloupcu
[1] "cisla"              "stovky"             "pismena"           
[4] "novy_sloupec"       "este_jeden_sloupec" "sucet"             

Let’s have a look on table radky, the colnames look bit strange

data.frame(radky)
         X1  X2  X3  X4  X5  X6  X7  X8  X9 X10
cisla     1   2   3   4   5   6   7   8   9  10
stovky  101 102 103 104 105 106 107 108 109 110
pismena   a   b   c   d   e   f   g   h   i   j

Let’s say we will name the columns after the values in pismena row

colnames(radky)<- radky[3,]
head(radky, 4)
        a     b     c     d     e     f     g     h     i     j    
cisla   "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10" 
stovky  "101" "102" "103" "104" "105" "106" "107" "108" "109" "110"
pismena "a"   "b"   "c"   "d"   "e"   "f"   "g"   "h"   "i"   "j"  

You can also remove row used for column names:

radky <- radky[-3,]
radky<-data.frame(radky)
radky
         a   b   c   d   e   f   g   h   i   j
cisla    1   2   3   4   5   6   7   8   9  10
stovky 101 102 103 104 105 106 107 108 109 110

Exercise

Exercise

  1. copy, paste and run this chunk code from our website:
datace <- c("ne.lin", "ne.lin", "en.zvo", "ne.lin", "en.zvo", "en.zvo")
nadoby <- c(4, 3, 7, 2, 4, 3)
sipy <- c(0,1,3,0, 2, 4)
cu_dycka <- c(0,0,3,0, 4,2)
spondyl <- c(5,2,0,4, 0, 0)
kopytovity_klin <- c(4,5,0,4, 0, 0)
desticka <- c(0,0,2,0,1, 3)
zrnoterka <- c(1, 0, 0, 1, 0, 0)

pohrebiste <- data.frame(datace, nadoby, sipy, cu_dycka, spondyl, kopytovity_klin, desticka, zrnoterka)
pohrebiste
  datace nadoby sipy cu_dycka spondyl kopytovity_klin desticka zrnoterka
1 ne.lin      4    0        0       5               4        0         1
2 ne.lin      3    1        0       2               5        0         0
3 en.zvo      7    3        3       0               0        2         0
4 ne.lin      2    0        0       4               4        0         1
5 en.zvo      4    2        4       0               0        1         0
6 en.zvo      3    4        2       0               0        3         0
  1. create a new column with the sum of the artefacts found in each grave
  2. create a new column called “ID” where each grave has an unique number

Solution

  1. create a new column with the sum of the artefacts found in each grave
pohrebiste_2 <- pohrebiste
pohrebiste_2$pocet_artefaktu <- rowSums(pohrebiste[, 2:8])
pohrebiste_2$ID <- c(1:6)
pohrebiste_2
  datace nadoby sipy cu_dycka spondyl kopytovity_klin desticka zrnoterka
1 ne.lin      4    0        0       5               4        0         1
2 ne.lin      3    1        0       2               5        0         0
3 en.zvo      7    3        3       0               0        2         0
4 ne.lin      2    0        0       4               4        0         1
5 en.zvo      4    2        4       0               0        1         0
6 en.zvo      3    4        2       0               0        3         0
  pocet_artefaktu ID
1              14  1
2              11  2
3              15  3
4              11  4
5              11  5
6              12  6
  • alternative solution with package dplyr:
pohrebiste <- pohrebiste %>% 
  mutate(pocet_artefaktu= rowSums(pohrebiste[,2:8]),
         ID = c(1:6))

head(pohrebiste, 3)
  datace nadoby sipy cu_dycka spondyl kopytovity_klin desticka zrnoterka
1 ne.lin      4    0        0       5               4        0         1
2 ne.lin      3    1        0       2               5        0         0
3 en.zvo      7    3        3       0               0        2         0
  pocet_artefaktu ID
1              14  1
2              11  2
3              15  3

Correspondence analysis (CA)

A bit of theory

  • one of the multivatiate statistical methods
  • suitable for contingency tables (kontingenční tabulka) where you have counts of two or more categorical variables
  • one type of variable are stored in rows, other in columns
  • typicaly, you can visualise result of CA with scatterplots

A simple contingency table

  • now, have a look at this simple contingency table of an fictional Bell Beaker burial site: one variable is sex of buried individuals (columns), second the position of their bodies (rows)
male female
left side 10 1
right side 2 12
  • already from the first sight it is obvious that females were predominantly buried on their right site and male on their left site
  • correspondence analysis will help you when your contingency table have much more variables and the relations between variables are then much harder to observe

Use of CA

  • typical archaeological uses of CA:
    • seriation
    • counts of different types of artefacts (eg. artefacts in burial sites, types of chipped stone industry,…)
    • presence or absence of different features in archaeological collections (e.g.presence or absence of construction elements in graves)
  • it works only with discrete numbers (celá čísla) - i.e. counts, or dichotomies (presence / absence, 1/0)

Quick correspondence analysis

  • imagine we have a burial site and want to observe relation between numbers of different types of artefacts found in the graves and between datations of the graves

Research questions

  • are there certain types of artefacts related to certain datations?
  • are there differencies within the different datations?

Materials and methods

  • we will use the fictional data from the last exercise, the dataframe “pohrebiste”
  • TASKS:
    • clean your workplace and load the “pohrebiste” data from our webpage again
    • install and load package ca
  datace nadoby sipy cu_dycka spondyl kopytovity_klin desticka zrnoterka
1 ne.lin      4    0        0       5               4        0         1
2 ne.lin      3    1        0       2               5        0         0
3 en.zvo      7    3        3       0               0        2         0
4 ne.lin      2    0        0       4               4        0         1
5 en.zvo      4    2        4       0               0        1         0
6 en.zvo      3    4        2       0               0        3         0

Results

library(ca)
ca_vysledek <- ca(pohrebiste[,2:8])
plot(ca_vysledek)
  • notice we are running the analysis only with columns which represent number of artefacts (2nd to 8th)
  • also notice there are two types of points in the plot:
    • blue circles represent row values, which are in this case graves
    • red triangles represent column values - numbers of specific artefacts

Interpretation

  • there are two clearly distinct groups of graves with distinct types of artefacts
  • on the other hand number of vessels seems to be indifferent to the groups - both grave groups share similar number of ceramic vessels
  • the x axis explains 82 % of the difference between graves and artefacts, y axis only 9 %. This mean that the difference between those two groups is bigger then within them. In other words, the difference between e.g. copper dagger (cu_dycka) and spondylus is much much greater then between the copper dagger and arrow heads (sipy)

Now let’s have a closer look

ca_vysledek

 Principal inertias (eigenvalues):
           1        2        3        4        5       
Value      0.667126 0.069873 0.054181 0.018769 0.001128
Percentage 82.25%   8.61%    6.68%    2.31%    0.14%   


 Rows:
            [,1]      [,2]      [,3]     [,4]      [,5]      [,6]
Mass    0.189189  0.148649  0.202703 0.148649  0.148649  0.162162
ChiDist 0.879868  0.836034  0.738122 1.026842  0.926948  1.011206
Inertia 0.146464  0.103898  0.110437 0.156736  0.127724  0.165817
Dim. 1  1.037358  0.783658 -0.856640 1.225783 -0.979198 -1.083841
Dim. 2  0.450424 -0.963242  0.350276 0.134612  1.561622 -1.635249


 Columns:
           nadoby      sipy  cu_dycka  spondyl kopytovity_klin  desticka
Mass     0.310811  0.135135  0.121622 0.148649        0.175676  0.081081
ChiDist  0.300899  0.875803  1.086977 1.097286        1.064182  1.129904
Inertia  0.028141  0.103653  0.143698 0.178978        0.198950  0.103515
Dim. 1  -0.224254 -0.989256 -1.177308 1.297475        1.221578 -1.212896
Dim. 2   0.489148 -1.259831  1.692644 0.297174       -0.720551 -1.666812
        zrnoterka
Mass     0.027027
ChiDist  1.415361
Inertia  0.054142
Dim. 1   1.385408
Dim. 2   1.106622
  • you can see that this is not a dataframe, but a so called “list” with much complicated structure. Even so, it is possible to extract coordinates of the points (we will show it later) and use them for other analysis or visualisation in ggplot.

The structure of the CA result

  • this looks even more complicated, but don’t worry!
str(ca_vysledek)
List of 16
 $ sv        : num [1:5] 0.8168 0.2643 0.2328 0.137 0.0336
 $ nd        : logi NA
 $ rownames  : NULL
 $ rowmass   : num [1:6] 0.189 0.149 0.203 0.149 0.149 ...
 $ rowdist   : num [1:6] 0.88 0.836 0.738 1.027 0.927 ...
 $ rowinertia: num [1:6] 0.146 0.104 0.11 0.157 0.128 ...
 $ rowcoord  : num [1:6, 1:5] 1.037 0.784 -0.857 1.226 -0.979 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:5] "Dim1" "Dim2" "Dim3" "Dim4" ...
 $ rowsup    : logi(0) 
 $ colnames  : chr [1:7] "nadoby" "sipy" "cu_dycka" "spondyl" ...
 $ colmass   : num [1:7] 0.311 0.135 0.122 0.149 0.176 ...
 $ coldist   : num [1:7] 0.301 0.876 1.087 1.097 1.064 ...
 $ colinertia: num [1:7] 0.0281 0.1037 0.1437 0.179 0.1989 ...
 $ colcoord  : num [1:7, 1:5] -0.224 -0.989 -1.177 1.297 1.222 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:7] "nadoby" "sipy" "cu_dycka" "spondyl" ...
  .. ..$ : chr [1:5] "Dim1" "Dim2" "Dim3" "Dim4" ...
 $ colsup    : logi(0) 
 $ N         : num [1:6, 1:7] 4 3 7 2 4 3 0 1 3 0 ...
 $ call      : language ca.matrix(obj = as.matrix(obj))
 - attr(*, "class")= chr "ca"

Visualising CA in ggplot

  • in order to visualize data in ggplot, they have to be in dataframe format
  • so the first step will be extracting coordinates from the CA result
  • in this case we will extract them from first two dimensions only
x1 <- ca_vysledek$rowcoord[,1]
x2 <- ca_vysledek$rowcoord[,2]
y1 <- ca_vysledek$colcoord[,1]
y2 <- ca_vysledek$colcoord[,2]
  • x1 and x2 represent coordinates of row values (graves), y1 and y2 represent column values (artefacts)

  • quick check:

plot(x1,x2)

Creating dataframes

  • now we need to create dataframes with for both row values and column values

  • row values (graves):

    • in this step, we can combine various variables describing your objects with row values. We have added “id” and “datace”
datace <- data.frame(pohrebiste$datace)
id <- data.frame(pohrebiste$ID)
coord_datace <- data.frame(cbind(datace, x1,x2, id))
colnames(coord_datace) <- c("Datace", "X1", "X2", "ID")
head(coord_datace, 4)
  Datace         X1         X2 ID
1 ne.lin  1.0373584  0.4504244  1
2 ne.lin  0.7836584 -0.9632417  2
3 en.zvo -0.8566397  0.3502757  3
4 ne.lin  1.2257826  0.1346118  4
  • column values (artefacts):
artefakty <- data.frame(colnames(pohrebiste[,2:8]))
coord_artefakty <- data.frame(cbind(y1, y2))
coord_artefakty$artefakty <- rownames(coord_artefakty)
colnames(coord_artefakty) <- c("Y1", "Y2","Artefakty")
rownames(coord_artefakty) <- c(1:7)
head(coord_artefakty, 4)
          Y1         Y2 Artefakty
1 -0.2242544  0.4891476    nadoby
2 -0.9892557 -1.2598313      sipy
3 -1.1773081  1.6926442  cu_dycka
4  1.2974752  0.2971739   spondyl

A very simple ggplot

  • lets have a quick ggplot
ggplot()+
  geom_point(data = coord_datace, aes(X1, X2, color = Datace))+
  geom_point(data = coord_artefakty, aes(Y1, Y2), shape = 17)
  • notice there are 2 types of values in 2 dataframes, so you need to call geom_point() two times
  • in ggplot we can also easily distinguish different datations by color

Advanced ggplot

  • now you can play more with adjusting the plot

Solution

  • you can add text to your ggplot by geom_text()
  • you can also rename plot labels without the need to rename variables in your original dataframe. To do so, create a vector with label names and then specify it in geom_text(...,aes(label=your_vector))
  • for more tips how to adjust text see ggplot2.tidyverse.org
labels_artefacts <- c("nádoba", "hrot šípu", "mědená dýčka", "spondylus", "kopytovitý klin", "nátepní destička", "zrnotěrka")
ggplot()+
  geom_vline(xintercept = 0, color="gray80", linewidth = 0.75, linetype = "dashed")+
  geom_hline(yintercept = 0, color="gray80", linewidth = 0.75, linetype = "dashed")+
  geom_point(data = coord_datace, aes(X1, X2, colour = Datace), size = 2)+
  geom_point(data = coord_artefakty, aes(Y1, Y2), shape = 17, size = 2)+
  geom_text(data = coord_artefakty, aes(label = labels_artefacts, x=Y1, y=Y2), vjust = -1.5)+
  geom_text(data = coord_datace, aes(label = ID, x=X1, y=X2), vjust = 1.75)+
  xlim(-1.5, 1.6)+
  ylim(-1.75, 2)+
  labs(x="Dimenze 1 - 82,3 %", y="Dimenze 2 - 8,6 %",
       title = "Korespondenční analýza fiktivního pohřebiště")+
  theme_light()