# First run this
install.packages("pacman")
Getting to know your Data with R
Goal
Practice basic R commands/methods for descriptive data analysis. If you are already familiar with some of the commands/methods, practice the ones new to you.
Note: copying and pasting early in learning will not produce the results you are looking for, and will catch up to you eventually.
Submission
Please submit .r
, .rmd
, or .qmd
files ONLY.
Additional resources
I wrote a Quarto book on Exploratory Data Analysis in R using the dlookr
package. I won’t be showing this package here, but you can use the functions within my book below as well.
Installing required packages
library(pacman)
p_load(dlookr,
# Data Mining with R functions
DMwR2, # Pair-wise plots using ggplot2
GGally, # Data analysis
Hmisc, # Alternative to the Iris dataset
palmerpenguins, # Data wrangling, manipulation, visualization tidyverse)
Loading data
data(algae, package = "DMwR2")
|> glimpse() algae
Rows: 200
Columns: 18
$ season <fct> winter, spring, autumn, spring, autumn, winter, summer, autumn,…
$ size <fct> small, small, small, small, small, small, small, small, small, …
$ speed <fct> medium, medium, medium, medium, medium, high, high, high, mediu…
$ mxPH <dbl> 8.00, 8.35, 8.10, 8.07, 8.06, 8.25, 8.15, 8.05, 8.70, 7.93, 7.7…
$ mnO2 <dbl> 9.8, 8.0, 11.4, 4.8, 9.0, 13.1, 10.3, 10.6, 3.4, 9.9, 10.2, 11.…
$ Cl <dbl> 60.800, 57.750, 40.020, 77.364, 55.350, 65.750, 73.250, 59.067,…
$ NO3 <dbl> 6.238, 1.288, 5.330, 2.302, 10.416, 9.248, 1.535, 4.990, 0.886,…
$ NH4 <dbl> 578.000, 370.000, 346.667, 98.182, 233.700, 430.000, 110.000, 2…
$ oPO4 <dbl> 105.000, 428.750, 125.667, 61.182, 58.222, 18.250, 61.250, 44.6…
$ PO4 <dbl> 170.000, 558.750, 187.057, 138.700, 97.580, 56.667, 111.750, 77…
$ Chla <dbl> 50.000, 1.300, 15.600, 1.400, 10.500, 28.400, 3.200, 6.900, 5.5…
$ a1 <dbl> 0.0, 1.4, 3.3, 3.1, 9.2, 15.1, 2.4, 18.2, 25.4, 17.0, 16.6, 32.…
$ a2 <dbl> 0.0, 7.6, 53.6, 41.0, 2.9, 14.6, 1.2, 1.6, 5.4, 0.0, 0.0, 0.0, …
$ a3 <dbl> 0.0, 4.8, 1.9, 18.9, 7.5, 1.4, 3.2, 0.0, 2.5, 0.0, 0.0, 0.0, 2.…
$ a4 <dbl> 0.0, 1.9, 0.0, 0.0, 0.0, 0.0, 3.9, 0.0, 0.0, 2.9, 0.0, 0.0, 0.0…
$ a5 <dbl> 34.2, 6.7, 0.0, 1.4, 7.5, 22.5, 5.8, 5.5, 0.0, 0.0, 1.2, 0.0, 1…
$ a6 <dbl> 8.3, 0.0, 0.0, 0.0, 4.1, 12.6, 6.8, 8.7, 0.0, 0.0, 0.0, 0.0, 0.…
$ a7 <dbl> 0.0, 2.1, 9.7, 1.4, 1.0, 2.9, 0.0, 0.0, 0.0, 1.7, 6.0, 1.5, 2.1…
The |>
is the Base R pipe as opposed to the magrittr
pipe %>%
. The |>
pipe can be utilized for most functions in R, while the %>%
pipe is more restricted towards the tidyverse
.
Central tendency: mean, median, mode
Mean
$a1 |>
algaemean()
[1] 16.9235
Median
$a1 |>
algaemedian()
[1] 6.95
Mode
Base R doesn’t have a function for mode, create a simple one to illustrate how to create a function.
(this method works only for unimodal data.)
<- function(x, na.rm=FALSE){
Mode if(na.rm) x<-x[!is.na(x)]
<- unique (x)
ux return (ux[which.max(tabulate(match(x, ux)))])
}
$a2 |> Mode() algae
[1] 0
DMwR
centralValue()
function:
returns the median for numerical variable, or the mode for nominal variables.
# Numerical variable
$a1 |> centralValue() algae
[1] 6.95
# Nominal variable
$speed |> centralValue() algae
[1] "high"
Statistics of spread (variation)
Variance
$a1 |> var() algae
[1] 455.7532
Standard deviation
$a1 |> sd() algae
[1] 21.34838
Range
Note that this gives you both maximum and minimum values.
$a1 |> range() algae
[1] 0.0 89.8
Maximum value
$a1 |> max() algae
[1] 89.8
Minimum value
$a1 |> min() algae
[1] 0
Interquartile range
3rd quartile (75%) - 1st quartile (25%)
$a1 |> IQR() algae
[1] 23.3
Quantiles
$a1 |> quantile() algae
0% 25% 50% 75% 100%
0.00 1.50 6.95 24.80 89.80
Specifying specific quantiles:
$a1 |> quantile(probs = c(0.2, 0.8)) algae
20% 80%
1.20 32.18
Missing values
library(purrr)
# Compute the total number of NA values in the dataset
<- algae %>%
nas ::map_dbl(~sum(is.na(.))) %>%
purrrsum()
cat("The dataset contains ", nas, "NA values. \n")
The dataset contains 33 NA values.
# Compute the number of incomplete rows in the dataset
<- algae %>%
incomplete_rows summarise_all(~!complete.cases(.)) %>%
nrow()
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
ℹ The deprecated feature was likely used in the dplyr package.
Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
cat("The dataset contains ", incomplete_rows, "(out of ", nrow(algae),") incomplete rows. \n")
The dataset contains 200 (out of 200 ) incomplete rows.
Summaries of a dataset
Base R’s summary()
|> summary() algae
season size speed mxPH mnO2
autumn:40 large :45 high :84 Min. :5.600 Min. : 1.500
spring:53 medium:84 low :33 1st Qu.:7.700 1st Qu.: 7.725
summer:45 small :71 medium:83 Median :8.060 Median : 9.800
winter:62 Mean :8.012 Mean : 9.118
3rd Qu.:8.400 3rd Qu.:10.800
Max. :9.700 Max. :13.400
NA's :1 NA's :2
Cl NO3 NH4 oPO4
Min. : 0.222 Min. : 0.050 Min. : 5.00 Min. : 1.00
1st Qu.: 10.981 1st Qu.: 1.296 1st Qu.: 38.33 1st Qu.: 15.70
Median : 32.730 Median : 2.675 Median : 103.17 Median : 40.15
Mean : 43.636 Mean : 3.282 Mean : 501.30 Mean : 73.59
3rd Qu.: 57.824 3rd Qu.: 4.446 3rd Qu.: 226.95 3rd Qu.: 99.33
Max. :391.500 Max. :45.650 Max. :24064.00 Max. :564.60
NA's :10 NA's :2 NA's :2 NA's :2
PO4 Chla a1 a2
Min. : 1.00 Min. : 0.200 Min. : 0.00 Min. : 0.000
1st Qu.: 41.38 1st Qu.: 2.000 1st Qu.: 1.50 1st Qu.: 0.000
Median :103.29 Median : 5.475 Median : 6.95 Median : 3.000
Mean :137.88 Mean : 13.971 Mean :16.92 Mean : 7.458
3rd Qu.:213.75 3rd Qu.: 18.308 3rd Qu.:24.80 3rd Qu.:11.375
Max. :771.60 Max. :110.456 Max. :89.80 Max. :72.600
NA's :2 NA's :12
a3 a4 a5 a6
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 1.550 Median : 0.000 Median : 1.900 Median : 0.000
Mean : 4.309 Mean : 1.992 Mean : 5.064 Mean : 5.964
3rd Qu.: 4.925 3rd Qu.: 2.400 3rd Qu.: 7.500 3rd Qu.: 6.925
Max. :42.800 Max. :44.600 Max. :44.400 Max. :77.600
a7
Min. : 0.000
1st Qu.: 0.000
Median : 1.000
Mean : 2.495
3rd Qu.: 2.400
Max. :31.600
Hmisc
’s describe()
data("penguins")
|> Hmisc::describe() penguins
penguins
8 Variables 344 Observations
--------------------------------------------------------------------------------
species
n missing distinct
344 0 3
Value Adelie Chinstrap Gentoo
Frequency 152 68 124
Proportion 0.442 0.198 0.360
--------------------------------------------------------------------------------
island
n missing distinct
344 0 3
Value Biscoe Dream Torgersen
Frequency 168 124 52
Proportion 0.488 0.360 0.151
--------------------------------------------------------------------------------
bill_length_mm
n missing distinct Info Mean Gmd .05 .10
342 2 164 1 43.92 6.274 35.70 36.60
.25 .50 .75 .90 .95
39.23 44.45 48.50 50.80 51.99
lowest : 32.1 33.1 33.5 34 34.1, highest: 55.1 55.8 55.9 58 59.6
--------------------------------------------------------------------------------
bill_depth_mm
n missing distinct Info Mean Gmd .05 .10
342 2 80 1 17.15 2.267 13.9 14.3
.25 .50 .75 .90 .95
15.6 17.3 18.7 19.5 20.0
lowest : 13.1 13.2 13.3 13.4 13.5, highest: 20.7 20.8 21.1 21.2 21.5
--------------------------------------------------------------------------------
flipper_length_mm
n missing distinct Info Mean Gmd .05 .10
342 2 55 0.999 200.9 16.03 181.0 185.0
.25 .50 .75 .90 .95
190.0 197.0 213.0 220.9 225.0
lowest : 172 174 176 178 179, highest: 226 228 229 230 231
--------------------------------------------------------------------------------
body_mass_g
n missing distinct Info Mean Gmd .05 .10
342 2 94 1 4202 911.8 3150 3300
.25 .50 .75 .90 .95
3550 4050 4750 5400 5650
lowest : 2700 2850 2900 2925 2975, highest: 5850 5950 6000 6050 6300
--------------------------------------------------------------------------------
sex
n missing distinct
333 11 2
Value female male
Frequency 165 168
Proportion 0.495 0.505
--------------------------------------------------------------------------------
year
n missing distinct Info Mean Gmd
344 0 3 0.888 2008 0.8919
Value 2007 2008 2009
Frequency 110 114 120
Proportion 0.320 0.331 0.349
For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------
GMD is the mean absolute difference between any pairs of observations. A robust dispersion measure, especially for non-normally distributed data.
dlookr
’s describe()
|> dlookr::describe() penguins
# A tibble: 5 × 26
described_variables n na mean sd se_mean IQR skewness
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bill_length_mm 342 2 43.9 5.46 0.295 9.27 0.0531
2 bill_depth_mm 342 2 17.2 1.97 0.107 3.1 -0.143
3 flipper_length_mm 342 2 201. 14.1 0.760 23 0.346
4 body_mass_g 342 2 4202. 802. 43.4 1200 0.470
5 year 344 0 2008. 0.818 0.0441 2 -0.0537
# ℹ 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
# p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
# p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
# p99 <dbl>, p100 <dbl>
Summaries on a subset of data
dplyr
’s summarise()
and summarise_all()
, or use them with select()
and group_by()
to create summaries on
subset of data. Note: summarise()
= summarize()
|>
algae summarise(avgNO3 = mean(NO3, na.rm=TRUE),
medA1 = median(a1))
# A tibble: 1 × 2
avgNO3 medA1
<dbl> <dbl>
1 3.28 6.95
summarise_all()
can be used to apply any function that produces a scalar value to any column of a data
frame table.
|>
algae select(mxPH:Cl) |>
summarise_all(list(mean, median), na.rm = TRUE)
# A tibble: 1 × 6
mxPH_fn1 mnO2_fn1 Cl_fn1 mxPH_fn2 mnO2_fn2 Cl_fn2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.01 9.12 43.6 8.06 9.8 32.7
|>
algae select(a1:a7) |>
summarise_all(funs(var))
Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:
# Simple named list: list(mean = mean, median = median)
# Auto named with `tibble::lst()`: tibble::lst(mean, median)
# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
# A tibble: 1 × 7
a1 a2 a3 a4 a5 a6 a7
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 456. 122. 48.3 19.5 56.1 136. 26.6
|>
algae select(a1:a7) |>
summarise_all(c("min", "max"))
# A tibble: 1 × 14
a1_min a2_min a3_min a4_min a5_min a6_min a7_min a1_max a2_max a3_max a4_max
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 0 0 0 0 89.8 72.6 42.8 44.6
# ℹ 3 more variables: a5_max <dbl>, a6_max <dbl>, a7_max <dbl>
Use summarise()
with group_by()
|>
algae group_by(season, size) |>
summarise(nObs = n(), mA7 = median(a7))
`summarise()` has grouped output by 'season'. You can override using the
`.groups` argument.
# A tibble: 12 × 4
# Groups: season [4]
season size nObs mA7
<fct> <fct> <int> <dbl>
1 autumn large 11 0
2 autumn medium 16 1.05
3 autumn small 13 0
4 spring large 12 1.95
5 spring medium 21 1
6 spring small 20 0
7 summer large 10 0
8 summer medium 21 1
9 summer small 14 1.45
10 winter large 12 0
11 winter medium 26 1.4
12 winter small 24 0
|>
penguins group_by(species) |>
summarise(var = var(bill_length_mm, na.rm = TRUE))
# A tibble: 3 × 2
species var
<fct> <dbl>
1 Adelie 7.09
2 Chinstrap 11.2
3 Gentoo 9.50
Aggregating data
Can be helpful for summary functions that don’t return a scalar
|>
penguins group_by(species) |>
reframe(var = quantile(bill_length_mm, na.rm = TRUE))
# A tibble: 15 × 2
species var
<fct> <dbl>
1 Adelie 32.1
2 Adelie 36.8
3 Adelie 38.8
4 Adelie 40.8
5 Adelie 46
6 Chinstrap 40.9
7 Chinstrap 46.3
8 Chinstrap 49.6
9 Chinstrap 51.1
10 Chinstrap 58
11 Gentoo 40.9
12 Gentoo 45.3
13 Gentoo 47.3
14 Gentoo 49.6
15 Gentoo 59.6
reframe()
expects a scalar result returned by the function, but quantile returns a vector.
Note: Aggregating data with summarize was deprecated in dplyr 1.1.0
, reframe()
should be used instead.
I recommend just using dlookr
…
|>
penguins group_by(species) |>
::describe(bill_length_mm) dlookr
# A tibble: 3 × 27
described_variables species n na mean sd se_mean IQR skewness
<chr> <fct> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bill_length_mm Adelie 151 1 38.8 2.66 0.217 4 0.162
2 bill_length_mm Chinstrap 68 0 48.8 3.34 0.405 4.73 -0.0906
3 bill_length_mm Gentoo 123 1 47.5 3.08 0.278 4.25 0.651
# ℹ 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
# p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
# p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
# p99 <dbl>, p100 <dbl>
Grouping functions
[Advanced]
Getting to know your dataset:
List data types of the attributes in your tidy dataset
Check for skewness in data distribution in the attributes
Check for correlations among attributes
Examine the extent of missing data. What would be the best way to deal with the missing data in this case?