Descriptive statistics

Descriptive Statistics

Brainstorming

Run this code:

ggplot(data = df_dartpoints, mapping = aes(x=Length))+
  geom_histogram()

  • what do you think this plot shows?
  • which lengths are more represented?
  • do you see any extreme values?

Characterizing centrality

Mean (aritmetický průměr)

  • mean, or average, is the sum of all values divided by the number of values
  • we already know function mean() for arithmetic mean:
mean(df_dartpoints$Length)
[1] 49.33077

Median (medián)

  • medians is the value separating data into lower half and an upper half, i.e. 50 % of the values are less than or equal to the median and 50 % are greater than or equal to the median

  • median is more robust and minimizes influence of outliers

  • use function median()

median(df_dartpoints$Length)
[1] 47.1

What are outliers? (odlehlé hodnoty)

  • Outliers are data points that significantly differ from other observations.
  • May indicate a measurement error, an exceptional observation, etc.

Characterizing dispersion and/or spread

Range (rozpětí)

max(), min() or range()

max(df_dartpoints$Length)
[1] 109.5
range(df_dartpoints$Length)
[1]  30.6 109.5

Variance and Standard deviation (rozptyl a směrodatná odchylka)

sd()

sd(df_dartpoints$Length)
[1] 12.73619
  • perhaps you can argue that one value does not tell us too much about our dataset, we will touch this problem soon

Interquartile range (midspread, IQR, kvantil, mezikvartilové rozpětí)

IQR()
- Robust, minimizes influence of outliers.

Summary

  • centrality and dispersion explained visually:

Various plots of a normal distribution

Exercise

Task

Use the dataset dartpoints.csv and:

  1. count mean and median weight, how do they differ?

  2. what is the range of the weights?

calculate 3) the mean value and 4) the standard deviation of length, BUT do it for each type (variable “Name”) of the dartpoint (use the function aggregate())

  1. find out which type (“Name”) of dartpoints have the highest standard deviations and try to explain what does it say about this type of dartpoints

Solution

mean(df_dartpoints$Weight)
[1] 7.642857
median(df_dartpoints$Weight)
[1] 6.8
range(df_dartpoints$Weight)
[1]  2.3 28.8
aggregate(Length ~ Name, data = df_dartpoints, FUN = mean)
        Name   Length
1       Darl 39.75357
2      Ensor 42.74000
3 Pedernales 57.88125
4     Travis 51.38182
5      Wells 53.12000
aggregate(Length ~ Name, data = df_dartpoints, FUN = sd)
        Name    Length
1       Darl  6.175668
2      Ensor  5.786997
3 Pedernales 14.127197
4     Travis  9.904223
5      Wells  7.943803