If you are seeing error messages, please review them and try to understand. It is a good practice for you debug.
If you are seeing warning messages, please review. Warning messages are typically not fatal. Might be obsolete/deprecated.
Universal things every R user should know:
Find which version of R you are using
R.version
_
platform aarch64-apple-darwin20
arch aarch64
os darwin20
system aarch64, darwin20
status
major 4
minor 3.1
year 2023
month 06
day 16
svn rev 84548
language R
version.string R version 4.3.1 (2023-06-16)
nickname Beagle Scouts
Packages
R has many tools wrapped in packages, and we often use those tools in our work.
To use a tool, you need to install it.
The package used in Data Mining with R is DMwR2
In Windows 11, this shall run okay.
In Ubuntu 20.04, you might see error. one error requires run sudo apt-get install libcurl4-openssl-dev in your terminal.
install.packages("DMwR2")
To see what is in a package, use help(). If you do not see documentation, there might be errors.
help(package="DMwR2")
The above step takes some time and you need internet connection.
Now the packages are installed in your computer. To use a function in the package, either of the two ways works:
(1) when you need to use the function frequently, you would want to load it to the memory for your current session by using library() function (one RStuido window is one session, if you have multiple RStudio windows open, they are different sessions)
(2) when you only need to use the function one or twice, you can call the function/dataset through the notation package::functionname
library(DMwR2)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Now you can use any function or dataset provided in DMwR2 by referencing its name directly.
data(algae) # load algae datasetalgae
# A tibble: 200 × 18
season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1
<fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 winter small medium 8 9.8 60.8 6.24 578 105 170 50 0
2 spring small medium 8.35 8 57.8 1.29 370 429. 559. 1.3 1.4
3 autumn small medium 8.1 11.4 40.0 5.33 347. 126. 187. 15.6 3.3
4 spring small medium 8.07 4.8 77.4 2.30 98.2 61.2 139. 1.4 3.1
5 autumn small medium 8.06 9 55.4 10.4 234. 58.2 97.6 10.5 9.2
6 winter small high 8.25 13.1 65.8 9.25 430 18.2 56.7 28.4 15.1
7 summer small high 8.15 10.3 73.2 1.54 110 61.2 112. 3.2 2.4
8 autumn small high 8.05 10.6 59.1 4.99 206. 44.7 77.4 6.9 18.2
9 winter small medium 8.7 3.4 22.0 0.886 103. 36.3 71 5.54 25.4
10 winter small high 7.93 9.9 8 1.39 5.8 27.2 46.6 0.8 17
# ℹ 190 more rows
# ℹ 6 more variables: a2 <dbl>, a3 <dbl>, a4 <dbl>, a5 <dbl>, a6 <dbl>,
# a7 <dbl>
manyNAs(algae) # find rows with too many NAs
[1] 62 199
library() without arguments:
It will provide you the list of packages installed in different libraries on your computer.
Think of library() as a library of all installed packages. library(packagename) checks a package out.
.packages() shows all checked out packages in the current session.
If you loaded a package, say dbplyr, by mistake, you can detach it from your session using detach
install.packages("dbplyr") # assuming you have dbplyr installed before # now you try to check out dplyr, but typed dbplyr by accidentlibrary(dbplyr) (.packages())# you realized the mistake, and you don't want this package to be live in this session due to potential conflicts# you can detach the packagedetach("package:dbplyr", unload=TRUE) (.packages())library(dplyr)#load wanted library
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Another way to see what packages have been installed in your computer:
installed.packages()
Package LibPath
alphavantager "alphavantager" "/Users/gchism/Library/R/arm64/4.3/library"
anytime "anytime" "/Users/gchism/Library/R/arm64/4.3/library"
askpass "askpass" "/Users/gchism/Library/R/arm64/4.3/library"
Version Priority Depends
alphavantager "0.1.3" NA "R (>= 3.3.0)"
anytime "0.3.9" NA "R (>= 3.2.0)"
askpass "1.2.0" NA NA
Imports
alphavantager "dplyr (>= 0.7.0), glue (>= 1.1.1), httr (>= 1.2.1), jsonlite\n(>= 1.5), purrr (>= 0.2.2.2), readr (>= 1.1.1), stringr (>=\n1.2.0), tibble (>= 1.3.3), tidyr (>= 0.6.3), timetk (>=\n0.1.1.1)"
anytime "Rcpp (>= 0.12.9)"
askpass "sys (>= 2.1)"
LinkingTo Suggests Enhances
alphavantager NA "testthat, knitr" NA
anytime "Rcpp (>= 0.12.9), BH" "tinytest (>= 1.0.0), gettz" NA
askpass NA "testthat" NA
License License_is_FOSS License_restricts_use
alphavantager "GPL (>= 3)" NA NA
anytime "GPL (>= 2)" NA NA
askpass "MIT + file LICENSE" NA NA
OS_type MD5sum NeedsCompilation Built
alphavantager NA NA "no" "4.3.0"
anytime NA NA "yes" "4.3.0"
askpass NA NA "yes" "4.3.0"
Find out if your installed packages have a newer version on CRAN:
old.packages()
Update all your installed packages to the newest version – this may take a long time:
update.packages()
Update all your installed packages WITHOUT having to confirm for each package (Note: as this could take a long time, you don’t have to practice this command. Do not worry too much if you see warning or failure messages)
update.packages(ask =FALSE)
Find out the namespace/package a function belongs in your installed packages, just type the function name - e.g., function mean is in base R:
mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x118a25ca8>
<environment: namespace:base>
Find help on a function in an installed package, say mean(). If you use R Studio, R documentation on the method mean() is display on lower right pane of the window:
help(mean)
If two packages provide a function with the same name and you need to use both functions, use package::functionname to differentiate the functions.
When you want to see if a package you need to use has already been made, search for it using some keywords:
RSiteSearch('neural networks')
#useful controls in R Studio##Ctrl+1 Move focus to the Source Editor.#Ctrl+2 Move focus to the Console.#Ctrl+l Clear the Console.#Esc Interrupt R.
Project and Session Management
Use Project to manage your R scripts and data.
In RStudio, File > New Project to create a new folder on your computer for your project.
Multiple scripts can be created and saved in the project folder, along with data used
File > Open Project to resume your work in the workspace.
Your project folder is your current working directory, where you can save your .R and .RData files.
But a .R file can exist outside a project /project folder.
Close a Project in RStudio closes the current project, but still keep the session (RStudio interface is still up)
Quit Session closes the current RStudio window.
Typing long and complex commands in a console can be limiting.
You can type all the commands in a text file and save it, then use [1] source('path_to_mycode.R') to execute the series of commands or [2] open mycode.R in RStudio script tab and execute your commands from there using Run or Source button.
Run: run the code line by line
Source: run the entire script
You often need to save large data objects or function for later use
All objects are saved in .RData file in the current working directory for you to load in the future.
save.image()
Run getwd() and setwd() in RStudio Console to show the current working directory and to set working directory respectively.
getwd()setwd("/home/gchism/Documents/523") # setwd using what you get from getwd()getwd()
R Objects and Variables
Variables are references to some storage locations in computer memory that holds some content (objects) that range from a simple number to an complex model to associate an object (e.g., the number 0.2) to a variable.
vat <-0.2
Now see what vat holds:
vat
[1] 0.2
Use () to enclose a statement to have the returned values print directly:
(vat <-0.2)
[1] 0.2
More examples:
x <-5y <- vat * xy
[1] 1
z <-(y/2)^2y
[1] 1
z
[1] 0.25
All variables stay alive until you delete it or when your exit R without saving them to list variables currently alive: ls() or objects()
ls()
[1] "algae" "algae.sols" "has_annotations" "test.algae"
[5] "vat" "x" "y" "z"
objects()
[1] "algae" "algae.sols" "has_annotations" "test.algae"
[5] "vat" "x" "y" "z"
Remove a variable to free memory space:
rm(vat)
R Functions
Functions are a special type of R object designed to carry out some operation. Functions expects some input arguments and outputs results of it operation. R has many functions already, libraries you loaded contains functions you can use, you can also create new functions.
Examples of R functions:
max(4, 5, 6, 12, -4)
[1] 12
mean(4, 5, 6, 12, -4)
[1] 4
max(sample(1:100, 30))
[1] 99
mean(sample(1:100, 30))
[1] 46.56667
Why does the same function with the same argument give different results above? Use help(sample) to find out what function sample does.
What do you expect?
set.seed(1) #the seed determines the starting point used in generating a sequence of pseudo random numbers #set.seed() has global scope, meaning it affects all random number generators used/called in your program.#there is a function to remove the seed:rm(.Random.seed, envir=.GlobalEnv)rnorm(1) #give me one number from a normal distribution
[1] -0.6264538
rnorm(1)
[1] 0.1836433
set.seed(2)rnorm(1)
[1] -0.8969145
rnorm(1)
[1] 0.1848492
We use set.seed() to make sure multiple runs of a program involving random samples give the same result, for debugging purposes.
To create a new function, se (standard error of means), first test if se exists in our current environment.
exists("se")
[1] FALSE
No object named se exists, now create the function that computes the standard error of a sample:
se <-function(x){ variance <-var(x) n <-length(x)return (sqrt(variance/n))}
Object se has been created:
exists("se")
[1] TRUE
A side note: how is se different from sd? They are very different! See the following video.
Create another function with multiple arguments:
convMeters will convert meters to inches, feet, yards, and miles. exists("convMeters")
If no argument to is provided, the default value 'inch' is used
convMeters(40)
[1] 1574.804
Arguments for the function can be supplied in the order as in the function signature:
convMeters(56.2, "yard")
[1] 61.46088
Arguments can also be supplied in other orders if sufficient arguments are named so R can un-ambiguously assign the arguments for a function.
convMeters(to="yard", 56.2)
[1] 61.46088
Factors
Conceptually, factors are variables in R which take on a limited number of different values. A factor can be seen as a categorical (i.e., nominal) variable factor levels are the set of unique values the nominal variable could have. Factors are different from characters.
To create a factor, use factor(). Factors are represented internally as numeric vectors. This factor has two levels, f and m:
g <-c('f', 'm', 'f', 'f', 'f', 'm', 'm', 'f')g <-factor(g)
More compact way to creating a factor with known levels, f and m:
Factors are extremely useful for nominal values. Use factor to illustrate the concept of marginal frequencies or marginal distributions and table() function:
g <-factor(c('f', 'm', 'f', 'f', 'f', 'm', 'm', 'f'))table(g)
g
f m
5 3
Add an age factor to the table (table can have more than two factors):
a <-factor(c('adult', 'juvenile','adult', 'juvenile','adult', 'juvenile','juvenile', 'juvenile'))table(a, g)
g
a f m
adult 3 0
juvenile 2 3
R assumes the values at the same index in the two factors are associated with the same entity. In our dataset, we have 3 female adult, 2 female juvenile, and 3 male juvenile.
What if the a factor is not the same length as g factor?
a <-factor(c('adult', 'juvenile','adult', 'juvenile','adult', 'juvenile','juvenile'))table(a, g)
Error in table(a, g): all arguments must have the same length
Bring the good a back and create a new table with factor g
a <-factor(c('adult', 'juvenile','adult', 'juvenile','adult', 'juvenile','juvenile', 'juvenile'))t <-table(a, g)t
g
a f m
adult 3 0
juvenile 2 3
Find marginal frequencies for a factor:
margin.table(t, 1)#1 refers to the first factor, a (age)
a
adult juvenile
3 5
margin.table(t, 2)# now find the marginal freq of the second factor g
g
f m
5 3
We can also find relative frequencies (proportions) with respect to each margin and the overall:
t
g
a f m
adult 3 0
juvenile 2 3
prop.table(t, 1) #use the margin generated for the 1st factor a
g
a f m
adult 1.0 0.0
juvenile 0.4 0.6
Adults are all female, and among the juveniles, 40% are female and 60% are male.
prop.table(t, 2)
g
a f m
adult 0.6 0.0
juvenile 0.4 1.0
prop.table(t) #overall
g
a f m
adult 0.375 0.000
juvenile 0.250 0.375
Show the same results in percentage:
prop.table(t) *100
g
a f m
adult 37.5 0.0
juvenile 25.0 37.5
R data structures
Vectors
The most basic data object is a vector. One single number is a vector with a single element. All elements in one vector must be of one base data type.
Create a vector:
v <-c(2, 5, 3, 4)length(v)
[1] 4
Data type of elements in v:
mode(v)
[1] "numeric"
If you mix strings and numbers:
v <-c(2, 5, 3, 4, 'me')mode(v)
[1] "character"
v
[1] "2" "5" "3" "4" "me"
See the difference? All values in the v have now become characters strings.
All vectors can contain a special value NA, often used to represent a missing value:
v <-c(2, 5, 3, 4, NA)mode(v)
[1] "numeric"
v
[1] 2 5 3 4 NA
A boolean vector (TRUE, FALSE)
b <-c(TRUE, FALSE, NA, TRUE)mode(b)
[1] "logical"
b
[1] TRUE FALSE NA TRUE
Elements in vectors are indexed starting with 1:
b[3]
[1] NA
To update a value at a specific index:
b[3] <-TRUEb
[1] TRUE FALSE TRUE TRUE
Vectors are elastic; you can add values to any index position:
b[10] <-FALSEb
[1] TRUE FALSE TRUE TRUE NA NA NA NA NA FALSE
Empty elements are filled with NA, as shown above
Create an empty vector:
e <-vector()mode(e)
[1] "logical"
e <-c()mode(e)
[1] "NULL"
length(e)
[1] 0
Use a vector elements to construct another vector:
b2 <-c(b[1], b[3], b[5])b2
[1] TRUE TRUE NA
Vectorization performs an operation on each element of a vector. It is very powerful and used widely.
Find the square root of all elements in v:
sqrt(v)
[1] 1.414214 2.236068 1.732051 2.000000 NA
Vector arithmetic
v1 <-c(3, 6, 9)v2 <-c(1, 4, 8)v1+v2 #addition
[1] 4 10 17
v1*v2 #dot product
[1] 3 24 72
v1-v2 #subtraction
[1] 2 2 1
v1/v2 #divsion
[1] 3.000 1.500 1.125
Warning: arithmetic with vectors of different sizes is allowed in R. R uses recycling rule to make the shorter vector the same length as the longer vector.
v3 <-c(1, 4)v1+v3#the recycling rule makes v3 [1, 4, 1]
Warning in v1 + v3: longer object length is not a multiple of shorter object
length
[1] 4 10 10
Remember, a single value is a vector too?
2*v1
[1] 6 12 18
Vector summary:
Elements are of same data type, elastic, vectorization, arithmetic operations and the recycling rule.
Use vector to illustrate “for” loop:
mysum <-function (x){ sum <-0for(i in1:length(x)){ sum <- sum + x[i] }return (sum)}(mysum (c(1, 2, 3)))
[1] 6
PART II
Easy ways to generate vectors
These are useful when you need to generate some data with known distribution to test certain functions.
Use () to print the result of a statement in the console 1 2 3 4 5 6 7 8 9 10
(x <-1:10)
[1] 1 2 3 4 5 6 7 8 9 10
(x <-10:1)
[1] 10 9 8 7 6 5 4 3 2 1
Note the precedence of the operator : is higher than arithmetic operators. Observe the difference below, why they give different results?
10:15-1
[1] 9 10 11 12 13 14
10:(15-1)
[1] 10 11 12 13 14
Use seq() to generate sequence with real numbers:
(seq(from=1, to=5, length=4)) # 4 values between 1 and 5 inclusive, even intervals/steps
gl(2, 5, labels=c('female', 'male'))#two levels, each level repeat 5 times
[1] female female female female female male male male male male
Levels: female male
#first argument 2 says two levels. #second argument 1 says repeat once#third argment 20 says generate 20 valuesgl(2, 1, 20, labels=c('female', 'male'))#10 alternating female and male pairs, a total of 20 values.
[1] female male female male female male female male female male
[11] female male female male female male female male female male
Levels: female male
Use factor() to convert number sequence to factor level labels. This is very useful for labeling a dataset:
n <-rep(1:2, each=3)(n <-factor(n, levels =c(1, 2),labels =c('female','male') ))
[1] female female female male male male
Levels: female male
n
[1] female female female male male male
Levels: female male
Generate random data according to some probability density functions: the functions has a general signature of rfunc(n, par1, par2, …)
r for random,func is the name of the density function, n is the length of the data to be generated, par1, par2, … are the parameters needed for a density function
Generate 10 values following a normal distribution with mean = 10 and standard deviation = 3:
Access matrix elements using position indexes (again, index starting from 1):
m <-c(45, 23, 66, 77, 33, 44, 56, 12, 78, 23)#then 'organize' the vector as a matrixdim(m) <-c(2, 5)#make the vector a 2 by 5 matrix, 2x5 must = lenght of the vectorm
, , g
d e f
a 1 4 7
b 2 5 8
c 3 6 9
, , h
d e f
a 10 13 16
b 11 14 17
c 12 15 18
, , i
d e f
a 19 22 25
b 20 23 26
c 21 24 27
Split array into matrices
Perform arithmetic operations on matrices, note the recycling rules apply:
matrix1 <- ar[,,g]
matrix1 <- ar[,,'g']matrix1
d e f
a 1 4 7
b 2 5 8
c 3 6 9
matrix2 <- ar[,,'h']matrix2
d e f
a 10 13 16
b 11 14 17
c 12 15 18
sum <-matrix1 + matrix2sum
d e f
a 11 17 23
b 13 19 25
c 15 21 27
matrix1*3
d e f
a 3 12 21
b 6 15 24
c 9 18 27
A matrix is just a long vector organized into dimensions, note the recycling rules apply:
matrix1
d e f
a 1 4 7
b 2 5 8
c 3 6 9
matrix1*c(2, 3)
Warning in matrix1 * c(2, 3): longer object length is not a multiple of shorter
object length
d e f
a 2 12 14
b 6 10 24
c 6 18 18
matrix1*c(2,3,2,3,2,3,2,3,2)
d e f
a 2 12 14
b 6 10 24
c 6 18 18
matrix1*c(1, 2, 3)
d e f
a 1 4 7
b 4 10 16
c 9 18 27
matrix1/c(1, 2, 3)
d e f
a 1 4.0 7
b 1 2.5 4
c 1 2.0 3
matrix1/c(1, 2, 3, 1, 2, 3, 1, 2, 3)
d e f
a 1 4.0 7
b 1 2.5 4
c 1 2.0 3
Lists
Lists are vectors too, but they are ‘recursive’ (as opposed to the ‘atomic’ vectors we learned before: vector, matrix, arrays), meaning they can hold other lists, meaning a list can hold data of different types. Lists consist of an ordered collection of objects known as their components ##list components do not need to be of the same type. ##list components are always numbered (with an index) and may also have a name attached to them.
Use list$component_name to access a component in a list can not be used on atomic vectors.
Both indices and names can be used to extract the subset. In order to use names, object must have a name type attribute such as names, rownames, colnames, etc.
You can use negative integers to indicate exclusion.
Unquoted variables are interpolated within the brackets.
Extract one item with [[
The double square brackets are used to extract one element from potentially many. For vectors yield vectors with a single value; data frames give a column vector; for list, one element
You can return only one item. The result is not (necessarily) the same type of object as the container. The dimension will be the dimension of the one item which is not necessarily 1. And, as before: Names or indices can both be used. #Variables are interpolated.
Interact with $
$ is a special case of [[ in which you access a single item by actual name (but not used for atomic vectors). You cannot use integer indices.
The name will not be interpolated and returns only one item. If the name contains special characters, the name must be enclosed in back-ticks: "
area season P.h.
1 A winter 7.4
2 B summer 7.3
3 A summer 10.6
4 A spring 7.2
5 B fall 8.9
names(my.dataframe)[3] <-'ph'my.dataframe
area season ph
1 A winter 7.4
2 B summer 7.3
3 A summer 10.6
4 A spring 7.2
5 B fall 8.9
Tibbles
Tibbles are similar to data frame, but they are more convenient than data frame.
Columns can be defined based on other columns defined earlier. Tibbles cannot convert categorical valued attributes to factors and does not print an entire dataset (when it is large, it occupied all your screen and more).
# A tibble: 100 × 3
TempCels TempFahr Location
<int> <dbl> <chr>
1 16 60.8 a
2 -5 23 a
3 31 87.8 a
4 -4 24.8 a
5 7 44.6 a
6 -3 26.6 a
7 12 53.6 a
8 25 77 a
9 -10 14 a
10 25 77 a
# ℹ 90 more rows
Use the penguins data frame from the palmerpenguins package:
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Convert a data frame to a tibble
pe <-as_tibble(penguins)class(pe)
[1] "tbl_df" "tbl" "data.frame"
pe
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Note: you can use print(pe, n=Inf, width=Inf) to print the entire pe dataset.
mode is a mutually exclusive classification of objects according to their basic structure. The ‘atomic’ modes are numeric, complex, character and logical. Recursive objects have modes such as ‘list’ or ‘function’ or a few others. An object has one and only one mode.
class is a property assigned to an object that determines how generic functions operate with it. It is not a mutually exclusive classification. If an object has no specific class assigned to it, such as a simple numeric vector, it’s class is usually the same as its mode, by convention.
Changing the mode of an object is often called ‘coercion’. The mode of an object can change without necessarily changing the class.
e.g., typeof or specific type testers: is.vector, is.atomic, is.data.frame, etc.
x <-1:16mode(x)
[1] "numeric"
dim(x) <-c(4,4)class(x)
[1] "matrix" "array"
is.numeric(x)
[1] TRUE
mode(x) <-"character"mode(x)
[1] "character"
class(x)
[1] "matrix" "array"
#mode changed from 'numeric' to 'character', but class stays 'matrix'
However:
x <-factor(x)class(x)
[1] "factor"
mode(x)
[1] "numeric"
#class changed from 'matrix' to 'factor', but mode stays 'numeric' #At this stage, even though x has mode numeric again, its new class, 'factor', prohibits it being used in arithmetic operations.
A set of `is.xxx()` functions can be used to check the data structure of an object
is.array(x)
[1] FALSE
is.list(x)
[1] FALSE
is.data.frame(x)
[1] FALSE
is.matrix(x)
[1] FALSE
is_tibble(x)
[1] FALSE
is.vector(x)
[1] FALSE
typeof(x)
[1] "integer"
Subsetting a tibble results in a smaller tibble
Note: this is different from data frame – subsetting a data frame could result in a vector, when subsetting result in one or one series of values
Create a data object to hold student names (Judy, Max, Dan) and their grades (`78,85,99) Convert number grades to letter grades:90-100:A;80-89:B;70-79:C; \<70:F`
students <-list(names=c("Judy", "Max", "Dan"),grades=c(78, 85, 99))print ("before:")