Fundamental Techniques in Data Science with R Notes
Table of Contents
- 1. R basics
- 2. R Objects & Data Types
- 3. Data Manipulation and pipes
- 4. Data Visualization
- 5. Data Cleaning
- 6. Introduction to Linear Modeling
- 7. Regression Assumptions and Diagnostics
- 7.1. Assumptions of MLR
- 7.1.1. 1 The model is linear in the parameters
- 7.1.2. 2 The predictor matrix is full rank
- 7.1.3. 3 The predictors are strictly exogenous
- 7.1.4. 4 The errors have constant, finit variance. heteroscedasticity
- 7.1.5. 5 The errors are uncorrelated
- 7.1.6. 6 The errors are normally distributed.
- 7.1.7. Spherical errors (4+5)
- 7.2. Regression Diagnostics
- 7.3. Influential Observations
- 7.1. Assumptions of MLR
- 8. Generalized Linear Model & Logistic regression
- 9. Assumptions of Logistic Regression
1. R basics
1.1. Getting help
You can get help for a function using
anova() ?anova help(anova)
1.2. CRAN and packages
CRAN is a repo for R packages
install.packages("mice") # There are two ways to load a package into R: library(stats) require(stats)
1.3. R Data formats and importing/writing data
1.3.1. Native data formats
## Load the built-in 'bfi' data from the 'psychTools' package data(bfi, package = "psychTools") ## Access the documentation for the 'bfi' data ?psychTools::bfi ## Define the directory holding our data dataDir <- "../../../data/" ## Load the 'boys' data from the R workspace ## '../../../data/boys.RData' load(paste0(dataDir, "boys.RData")) save(boys, file = paste0(dataDir, "tmp.RData")) ## Load the 'titanic' data stored in R data set ## '../../../data/titanic.rds' titanic <- readRDS(paste0(dataDir, "titanic.rds")) saveRDS(titanic, file="titanic2.rds")
1.3.2. Delimited Data types (csv etc.)
## Load the 'diabetes' data from the tab-delimited file ## '../../../data/diabetes.txt' diabetes <- read.table(paste0(dataDir, "diabetes.txt"), header = TRUE, sep = "\t") write.table(boys, paste0(dataDir, "boys.txt"), row.names = FALSE, sep = "\t", na = "-999") ## Load the 2017 UTMB data from the comma-separated file ## '../../../data/utmb_2017.csv' utmb1 <- read.csv(paste0(dataDir, "utmb_2017.csv")) write.csv2(boys, paste0(dataDir, "boys.csv"), row.names = FALSE, na = "")
- The read.csv() function assumes the values are seperated by commas
- For EU-formatted CSV files (semicolons instead of commas), we can use read.csv2()
1.3.3. SPSS Data (SAV)
Reading this format is tricky, if we want to read SAV files there are two popular options:
- foreign::read.spss()
- haven::readspss()
foreign doesn’t have a read equivelant
## Load the foreign package: library(foreign) ## Use foreign::read.spss() to read '../../../data/mtcars.sav' into a list mtcars1 <- read.spss(paste0(dataDir, "mtcars.sav")) ## Read '../../../data/mtcars.sav' as a data frame mtcars2 <- read.spss(paste0(dataDir, "mtcars.sav"), to.data.frame = TRUE) ## Read '../../../data/mtcars.sav' without value labels mtcars3 <- read.spss(paste0(dataDir, "mtcars.sav"), to.data.frame = TRUE, use.value.labels = FALSE)
library(labelled) ## Use haven::read_spss() to read '../../../data/mtcars.sav' into a tibble mtcars4 <- read_spss(paste0(dataDir, "mtcars.sav"))
haven::readspss() converts any SPSS variables with labels into labelled vectors
- We can use the labbeled::unlabeleld() function to remove the value labels
- [V-Sha~ -> V-Shaped
The best option we have for writing SPSS data is haven::writesav()
write_sav(mtcars2, paste0(dataDir, "mctars2.sav"))
This will preserve label information provided by factor variables and the havenlabelled class
1.3.4. Excel (xlsx)
We have two good options for loading data from Excel spreadsheets:
- readxl::readexcel()
- openxlsx::read.xlsx()
## Load the packages: library(readxl) library(openxlsx) ## Use the readxl::read_excel() function to read the data from the 'titanic' ## sheet of the Excel workbook stored at '../../../data/example_data.xlsx' titanic2 <- read_excel(paste0(dataDir, "example_data.xlsx"), sheet = "titanic") ## Use the openxlsx::read.xlsx() function to read the data from the 'titanic' ## sheet of the Excel workbook stored at '../../../data/example_data.xlsx' titanic3 <- read.xlsx(paste0(dataDir, "example_data.xlsx"), sheet = "titanic")
The openxlsx package provides a powerful toolkit for programmatically building Excel workbooks in R and saving the results.
## Use the openxlsx::write.xlsx() function to write the 'diabetes' data to an ## XLSX workbook write.xlsx(diabetes, paste0(dataDir, "diabetes.xlsx"), overwrite = TRUE) ## Use the openxlsx::write.xlsx() function to write each data frame in a list ## to a separate sheet of an XLSX workbook write.xlsx(list(titanic = titanic, diabetes = diabetes, mtcars = mtcars), paste0(dataDir, "example_data.xlsx"), overwrite = TRUE)
1.4. R Functions
Any R command written as a word followed by parentheses is a function
- mean()
- library()
Infix operators are aliased functions
- <-
- + - *
- > < ==
1.4.1. User-Defined Functions
We can define our own functions using the “function()” function
square <- function(x) { out <- x^2 out } square(5)
25
One-line functions don’t need braces:
square <- function(x) x^2 square(5)
25
- Not strictly typed
Function arguments are not strictly typed
square <- function(x) x^2 sqaure(TRUE) # 1 square(1:5)
1 4 9 16 25 But there are limits:
square("bob")
Error in x2: non-numeric argument to binary operator
- Multiple/list arguments
Multiple arguments:
mod <- function(x, y) x %% y mod(10, 3)
Or you can use a list to get multiple arguments:
getLsBeta <- function(datList) { X <- datList$X y <- datList$y solve(crossprod(X)) %*% t(X) %*% y } X <- matrix(runif(500), ncol = 5) datList <- list(y = X %*% rep(0.5, 5), X = X) getLsBeta(datList = datList)
- Type/class of functions
Functions are first-class objects in R.
- We can treat functions like any other R object.
R views an unevaluated function as an object with type ”closure”.
class(getLsBeta)
[1] “function”
typeof(getLsBeta)
[1] “closure”
An evaluated functions is equivalent to the objects it returns.
class(getLsBeta(datList))
[1] “matrix” “array”
typeof(getLsBeta(datList))
[1] “double”
1.5. Iteration
1.5.1. Loops
There are three types of loops in R: for, while and until. Deze course behandelt alleen for (lol)
for(INDEX in RANGE) { Stuff To Do with the Current INDEX Value }
examples:
val <- 0 for(i in 1:100) { val <- val + i } val
5050
This loop will compute the mean of every column in the mtcars data
means <- rep(0, ncol(mtcars)) for(j in 1:ncol(mtcars)) { means[j] <- mean(mtcars[ , j]) } means
listThing <- list(6,9,4,2) sum <- 0 for(i in listThing) { sum <- i + sum } sum
21
Usually there is a function to achieve what you are trying to do with the loop (sum() for example)
1.5.2. Apply
Usually one of two forms:
apply(DATA, MARGIN, FUNCTION, ...) apply(DATA, FUNCTION, ...) # not sure if this works??? but was in slides
- Margin
Margin is how the function should be applied: E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names.
data(bfi, package = "psychTools") dat1 <- bfi[1:5,1:3] # | 2 | 4 | 3 | # | 2 | 4 | 5 | # | 5 | 4 | 5 | # | 4 | 4 | 6 | # | 2 | 3 | 3 | # sum over rows with above table apply(dat1, 1, sum)
9 11 14 14 8 data(bfi, package = "psychTools") dat1 <- bfi[1:5,1:3] # subtract 1 from every cell apply(dat1, (1:2), function(x)x-1)
1 3 2 1 3 4 4 3 4 3 3 5 1 2 2 - lapply and sapply
similar concept but works on lists
list1 <- list(1,3,4,2,4) lapply(list1, function(x) x+1)
2 4 5 3 5 sapply is a user friendly wrapper for lapply which be default returns a vector or matrix
1.6. Some tips/ style guide
- While using RStudie use TAB to quickly acces the documentation of the function arguments
- Use the tidyverse style guide
- Can be down automatically in Rstudio -> addins -> style selection
2. R Objects & Data Types
2.1. Vectors
Vectors are the simplest kind of R object, there is no concept of a “scalar” in R
2.1.1. Atomic modes
Vectors come in one of six “atomic modes”:
- numeric/double
- logical -> bool
- character
- integer
- complex
- A complex value in R is defined via the pure imaginary value i.
- raw
- Intended to hold raw bytes
2.1.2. Generating vectors
c(1, 2, 3)
1 |
2 |
3 |
1:5
1 |
2 |
3 |
4 |
5 |
1.2:5.3
1.2 |
2.2 |
3.2 |
4.2 |
5.2 |
rep(33,4)
33 |
33 |
33 |
33 |
rep(1:3,3)
1 |
2 |
3 |
1 |
2 |
3 |
1 |
2 |
3 |
rep(1:3,each=2)
1 |
1 |
2 |
2 |
3 |
3 |
seq(0, 1, 0.25)
0 |
0.25 |
0.5 |
0.75 |
1 |
2.1.3. Combining data types in vectors?
Not possible, some data types will be converted
a <- c(1,2,3) b <- c("foo", "bar") c(a,b)
[1] “1” “2” “3” “4” “5” “foo” “bar”
2.2. Matrices
Matrices generalize vectors by adding a dimension attribute.
2.2.1. Generating Matrices
matrix(1:5, nrow = 5, ncol = 2)
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
cbind(c(1,2,3),c(4,5,6))
1 | 4 |
2 | 5 |
3 | 6 |
m1 <- matrix(1:5, nrow = 5, ncol = 2) attributes(m1)
5 |
2 |
Matrices are populated in column-major order, by default
matrix(1:9, 3, 3)
1 | 4 | 7 |
2 | 5 | 8 |
3 | 6 | 9 |
The byrow = TRUE option allows us to fill by row-major order
matrix(1:9, 3, 3, byrow = TRUE)
1 | 2 | 3 |
4 | 5 | 6 |
7 | 8 | 9 |
2.2.2. Mixing data types
Like vectors, matrices can only hold one type of data
2.3. Lists
Lists are the workhorse of R data objects
- An R list can hold an arbitrary set of other R objects
In lists data types CAN be mixed
2.3.1. Creating a list
list(1,2, "HI mom", TRUE)
1 | 2 | HI mom | TRUE |
2.3.2. Named list elements
l3 <- list(name = "bob", alive = TRUE, age = 33, relationshipStatus = 42+3i) l3$name
bob
Can also be done post-hoc via the names()
l1 = list(3, "hi", TRUE) names(l1) <- c("name1", "second", "last") l1$second
hi
2.3.3. Append onto list
(l4 <- list()) l4$people <- c("Bob", "Alice", "Suzy") l4$money <- 0 l4$logical <- FALSE l4
$people [1] “Bob” “Alice” “Suzy” $money [1] 0 $logical [1] FALSE
2.4. Data Frames
R’s way of storing rectangular data sets
- Each column of a data frame is a vector
- Each of these vectors can have a different type:
Data frames are actually lists of vectors (representing the columns)
2.4.1. Creating a dataframe
We can create one using data.frame()
data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
1 | -1 | 0.1 |
2 | 1 | 0.2 |
3 | -1 | 0.3 |
4 | 1 | 0.4 |
5 | -1 | 0.5 |
6 | 1 | 0.6 |
Columns can also be given names:
data.frame(x = 1:6, y = c(-1, 1), z = seq(0.1, 0.6, 0.1))
x y z 1 1 -1 0.1 2 2 1 0.2 3 3 -1 0.3 4 4 1 0.4 5 5 -1 0.5 6 6 1 0.6
Creating from matrix
data.frame(matrix(5, 4, 3))
5 | 5 | 5 |
5 | 5 | 5 |
5 | 5 | 5 |
5 | 5 | 5 |
2.4.2. Filling using sample or runif
Sample makes a vector of some length randomly sampled from a vector
Runif generates a vector of random numbers
data.frame(a = sample(c(TRUE, FALSE), 5, replace = TRUE), b = sample(c("foo", "bar"), 5, replace = TRUE), c = runif(5) )
TRUE | foo | 0.640138026094064 |
TRUE | bar | 0.0884113181382418 |
TRUE | bar | 0.099734982708469 |
TRUE | bar | 0.862016763305292 |
TRUE | bar | 0.991723588900641 |
2.5. Factors
Factors are R’s way of repesenting nominal (categorical) variables
- We can create a factor using the factor() function
2.5.1. Creating factors
factor(sample(1:3, 8, TRUE), labels = c("red", "yellow", "blue"))
red |
blue |
red |
blue |
yellow |
blue |
yellow |
yellow |
2.5.2. Factor attributes
x <- factor(sample(1:3, 8, TRUE), labels = c("red", "yellow", "blue")) attributes(x)
$levels [1] “red” “yellow” “blue” $class [1] “factor”
2.5.3. Internal representation
Even though a factor’s data are represented by an integer vector, R does not consider factors to be interger/numeric data.
3. Data Manipulation and pipes
3.1. Base R Subsetting
In Base R, we typically use three operators to subset objects
- []
- [[]]
- $
Which we choose depends on:
- What type of object are we trying to subset?
- How much of the original typing do we want to keep in the subset?
3.1.1. []
x <- 10:20 x [2] x [1:3] x [c(2, 5, 7)] # 11, 14, 16 x [c(TRUE, FALSE, TRUE, FALSE, FALSE)]
10 |
12 |
15 |
17 |
20 |
3.1.2. [[]]
Rarely used, does have some differences though, drops any names or dimnames attribute, and only allows selection of a single element
x <- 10:20 x [[2]]
11
The [] method returns objects of class list (or data.frame if foo was a data.frame) while the [[]] method returns objects whose class is determined by the type of their values.
3.1.3. Matrices
y <- matrix(1:24, 6, 4) y[4:6,2:3] # 2
10 | 16 |
11 | 17 |
12 | 18 |
We can mix different selecting styles
y <- matrix(1:24, 6, 4) y[c(2,4),c(FALSE, TRUE, TRUE)] # 2
8 | 14 |
10 | 16 |
3.1.4. $ and lists
We can use all three operators to subset lists.
list1 <- list(2, "noo", TRUE) names(list1) <- c("first", "second", "last") list1$first list1[1] list1[["first"]]
2
l4 <- list(2, "noo", TRUE) names(l4) <- c("first", "second", "last") tmp1 <- l4[1] tmp1$people # [1] "Bob" "Alice" "Suzy" class(tmp1) # [1] "list" tmp2 <- l4[[1]] # [1] "Bob" "Alice" "Suzy" class(tmp2) # [1] "character"
numeric
3.1.5. Data Frames
We can subset the columns of a data frame using list semantics
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") d3$a
1 |
2 |
3 |
4 |
5 |
6 |
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") d3[1]
1 |
2 |
3 |
4 |
5 |
6 |
3.1.6. Logical indexing
Filter by rows
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") d3[(d3$a)%%2==0 & d3$c>0,]
Notic the comma at the end!
This comma is needed because R uses the following notation for subsetting data frames.
data(rows you want, columns you want)
2 | 1 | 0.2 |
4 | 1 | 0.4 |
6 | 1 | 0.6 |
3.2. Tidyverse subsetting
The dplyr package provides ways to subset data, two are most frequently used:
- select
library(dplyr) d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") select(d3, a, b)
1 -1 2 1 3 -1 4 1 5 -1 6 1 library(dplyr) d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") select(d3, -a)
-1 0.1 1 0.2 -1 0.3 1 0.4 -1 0.5 1 0.6 - filter and logical indexing
library(dplyr) d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") filter(d3, a%%2==0, c>0)
2 1 0.2 4 1 0.4 6 1 0.6 We can achieve the same using logical indexing in Base R
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") d3[(d3$a)%%2==0 & d3$c>0,]
Notic the comma at the end!
2 1 0.2 4 1 0.4 6 1 0.6
3.3. Base R Variable Transformation
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") d4 <- d3 d4$a <- scale(d4$a) d4$a <- as.numeric(d4$a) d4
-1.33630620956212 | -1 | 0.1 |
-0.801783725737273 | 1 | 0.2 |
-0.267261241912424 | -1 | 0.3 |
0.267261241912424 | 1 | 0.4 |
0.801783725737273 | -1 | 0.5 |
1.33630620956212 | 1 | 0.6 |
3.4. Tidyverse Variable Transformations
library(dplyr) d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") mutate(d3, d = rbinom(nrow(d3), 1, c), e = d * c, a = scale(a) )
-1.33630620956212 | -1 | 0.1 | 0 | 0 |
-0.801783725737273 | 1 | 0.2 | 0 | 0 |
-0.267261241912424 | -1 | 0.3 | 1 | 0.3 |
0.267261241912424 | 1 | 0.4 | 1 | 0.4 |
0.801783725737273 | -1 | 0.5 | 0 | 0 |
1.33630620956212 | 1 | 0.6 | 1 | 0.6 |
3.5. Sorting & Ordering
3.5.1. Base R sort
l3 <- runif(n=8, min=1, max=20) sort(l3)
1.14997771941125 |
6.18364422139712 |
7.40294478787109 |
8.80276804696769 |
10.6937525619287 |
14.7475167061202 |
16.3595214642119 |
16.9985088445246 |
l3 <- runif(n=8, min=1, max=20) sort(l3, decreasing = TRUE)
15.9437932423316 |
15.0901547158137 |
9.608802491799 |
4.20592033001594 |
4.16299939691089 |
2.97972352942452 |
2.14775081258267 |
1.15068965195678 |
3.5.2. Tidyverse Ordering
library(dplyr) d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") d3$d <- runif(n=6, min=1, max=20) arrange(d3, d)
4 | 1 | 0.4 | 1.03251850209199 |
5 | -1 | 0.5 | 2.67228518775664 |
3 | -1 | 0.3 | 6.22691969852895 |
1 | -1 | 0.1 | 8.31828686594963 |
6 | 1 | 0.6 | 12.4162582552526 |
2 | 1 | 0.2 | 15.5486617612187 |
library(dplyr) d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1)) names(d3) <- c("a", "b", "c") d3$d <- runif(n=6, min=1, max=20) arrange(d3, a*d, c)
2 | 1 | 0.2 | 1.041215069592 |
1 | -1 | 0.1 | 19.1222469233908 |
6 | 1 | 0.6 | 8.04311123443767 |
3 | -1 | 0.3 | 18.1283924696036 |
4 | 1 | 0.4 | 15.9365976287518 |
5 | -1 | 0.5 | 15.7623822011519 |
3.6. Pipes
The %>% symbol represents the pipe operator, used to compose functions
- f(x) becomes x %>% f()
- f(x, y) becomes x %>% f(y)
- h(g(f(x))) becomes x %>% f %>% g %>% h
Big brain lecturer forgot to mention this operator is not in the standard library, is in dplyr, magrittr and tidyverse among others
3.6.1. Example use
Example use:
library(dplyr) sqr <- function (x) { x**2 } plus1 <- function (x) { x + 1 } test <- (1:10) %>% sqr() %>% plus1() test
2 |
5 |
10 |
17 |
26 |
37 |
50 |
65 |
82 |
101 |
This is equivalent to:
test <- sqr( plus1( (1:10))) test
3.6.2. dot . in a pipeline, target specific argument
Specify where you want the piped value to go
library(dplyr) power <- function (x, y) { x**y } test <- (1:10) %>% power(3, y=.) test
3 |
9 |
27 |
81 |
243 |
729 |
2187 |
6561 |
19683 |
59049 |
3.6.3. Exposition Pipe: %$%
[%$%] expose the names in lhs to the rhs expression. This is useful when functions do not have a built-in data argument. or The exposition pipe exposes the contents of an object to the next function in the pipeline.
library(dplyr) # all equivelant: cats %$% t.test(Hwt ~ Sex) cats %>% t.test(Hwt ~ Sex, data = .) t.test(Hwt ~ Sex, data = cats)
4. Data Visualization
4.1. Baser R Graphics
4.1.1. Scatterplots
We can create a basic scatterplot using the plot() function
plot(runif(20, 0, 100), runif(20, 0, 10))
diabetes %$% plot( y = tc, x = bmi, ylab = "Total Cholesterol", xlab = "Body Mass Index", main = "Relation between BMI and Cholesterol", ylim = c(0, 350), xlim = c(0, 50) )
4.1.2. Histogram
We can create a simple histogram with the hist() function
hist(runif(30,0,10))
4.1.3. Boxplots
We can create simple boxplots via the boxplot function
boxplot(runif(30,40,60),runif(60,10,50))
4.1.4. Kernel Density Plots
density(runif(5,0,100)) %>% plot()
4.2. GGPlot graphics
Base R graphics are fine for quick and dirty visualizations but for publication quality graphics, you should probably use GGPlot
- GGPlot uses the ”grammar of graphics” and ”tidy data” to build up a figure from modular components.
4.2.1. Grammar of graphics
4.2.2. Basic Setup (Data and aesthetic)
We start by initializing the object that will define our plot
- This is regardless of what we want to plot
- We must specify a data source and a aesthetic via the aes() function
library(ggplot2) p1 <- ggplot(data = diabetes, mapping = aes(x = bmi, y = glu)) p1
This doesn’t make any plot yet
4.2.3. Geometries (line, dots etc)
4.2.4. Statistics
We can also add statistical summaries of the data.
p1 + geom_point() + geom_smooth()
p1 + geom_point() + geom_smooth(method = "lm")
4.2.5. Styling
Changing the style outside of aes() applies the styling to the entire plot
4.2.6. Themes
We can apply canned themes to adjust a plot overall appearance\ Default theme:
p1.1 <- p1 + geom_point()
p1.1 + theme_classic()
You can also modify individual themes
p1.1 + theme_classic() + theme(axis.title = element_text(size = 16, family = "serif", face = "bold", color = "blue"), aspect.ratio = 1)
4.2.7. Facets
Faceting allow us to make arrays of conditional plots.
p7 <- ggplot(titanic, aes(age, survived, color = class)) + geom_jitter(height = 0.05) + geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) p7 + facet_wrap(vars(sex))
p8 <- ggplot(titanic, aes(age, survived)) + geom_jitter(height = 0.05) + geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) p8 + facet_grid(vars(sex), vars(class))
4.2.8. Joining multiple Figures
If we want to paste several different plots into a single figure (without faceting), we can use the utilities in the gridExtra package.
library(gridExtra) grid.arrange(p1 + geom_point(), p3 + geom_boxplot(), p1 + geom_line(), p8 + facet_grid(vars(sex), vars(class)), ncol = 2)
5. Data Cleaning
When we receive new data, they are generally messy and contaminated by various anomalies and errors
One of the first steps in processing a new set of data is cleaning
By cleaning the data, we ensure a few properties:
- The data are in an analyzable format.
- All data take legal values.
- Any outliers are located and treated.
- Any missing data are located and treated.
5.1. Missing data
Empty cells in a dataset where there should be observerd values
- The missing cells correspond to true population values, but we haven’t observed those values
Not every empty cell is a missing datum
- Quality-of-life ratings for dead patients in a mortality study
- Firm profitability after the company goes out of business
- Self-reported severity of menstrual cramping for men
- Empty blocks of data following “gateway” items
5.1.1. Missing Data Pattern
Missing data patterns represent unique combinations of observerd and missing items
- \(P\) items -> \(2^P\) possible patterns
- example/ in r
## Load the mice library: library(mice) ## Compute missing data patterns: pats <- md.pattern(boys) pats plot_pattern(boys, rotate = TRUE)
age reg wgt hgt bmi hc gen phb tv 223 1 1 1 1 1 1 1 1 1 0 19 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 2 437 1 1 1 1 1 1 0 0 0 3 43 1 1 1 1 1 0 0 0 0 4 16 1 1 1 0 0 1 0 0 0 5 1 1 1 1 0 0 0 0 0 0 6 1 1 1 0 1 0 1 0 0 0 5 1 1 1 0 0 0 1 1 1 1 3 1 1 1 0 0 0 0 1 1 1 4 1 1 1 0 0 0 0 0 0 0 7 3 1 0 1 1 1 1 0 0 0 4 0 3 4 20 21 46 503 503 522 1622
5.1.2. Nonresponse Rates
- Proportion missing
- The proportion of cells containing missing data
- Should be computed for each variable, not for the entire dataset
## Load the mice library: library(mice) ## Load some example data: data(boys, package = "mice") ## Compute variable-wise proportions missing: mMat <- is.na(boys) mMat %>% colMeans() %>% round(3)
age hgt wgt bmi hc gen phb tv reg 0.000 0.027 0.005 0.028 0.061 0.672 0.672 0.698 0.004 ## Compute observation-wise proportions missing: pmRow <- rowMeans(mMat) ## Summarize the above: range(pmRow) # [1] 0.0000000 0.7777778 pmRow[pmRow > 0] %>% range() # [1] 0.1111111 0.7777778 median(pmRow) # [1] 0.3333333 ## Compute the proportion of complete cases: mean(pmRow == 0) # [1] 0.2981283
- Attrition Rate
- The proportion of participants that drop-out of a study at each measurement occasion
- Covariance Coverage
- The proportion of cases available to estimate a given pairwise relationship
- e.g. a convariance between two variables
- Very important to have adequate coverage for the parameters you want to estimate
We can use routines from the mice package to calculate covariance coverage and response patterns.
## Compute the covariance coverage: cc <- md.pairs(boys)$rr / nrow(boys) ## Check the result: round(cc, 2)
age hgt wgt bmi hc gen phb tv reg age 1.00 0.97 0.99 0.97 0.94 0.33 0.33 0.3 1.00 hgt 0.97 0.97 0.97 0.97 0.92 0.32 0.32 0.3 0.97 wgt 0.99 0.97 0.99 0.97 0.94 0.32 0.32 0.3 0.99 bmi 0.97 0.97 0.97 0.97 0.91 0.32 0.32 0.3 0.97 hc 0.94 0.92 0.94 0.91 0.94 0.33 0.33 0.3 0.93 gen 0.33 0.32 0.32 0.32 0.33 0.33 0.33 0.3 0.33 phb 0.33 0.32 0.32 0.32 0.33 0.33 0.33 0.3 0.33 tv 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.3 0.30 reg 1.00 0.97 0.99 0.97 0.93 0.33 0.33 0.3 1.00 - The proportion of cases available to estimate a given pairwise relationship
5.1.3. Visualizing Incomplete Data
The ggmice package provides some nice ways to visualize incomplete data and objects created during missing data treatment.
library(ggmice); library(ggplot2) ggmice(boys, aes(wgt, hgt)) + geom_point()
We can also create a nicer version of the response pattern plot.
plot_pattern(boys, rotate = TRUE)
The naniar package also provides some nice visualization and numerical summary routines for incomplete data.
naniar::gg_miss_upset(boys)
5.2. Outliers
We only consider univariate (:only involving one variable) outliers
- Extreme values with respect to the distribution of a variable’s other observations
- A human height measurement of 3 meters
- Student with a income of $250,000
- Not accounting for any particular model
A univariate outlier may, or may not be an illegal value
- Data entry errors are probably the most common cause
- Outliers can also be legal, but extreme, values
Key Point: We choose to view an outlier as arising from a different population than the one to which we want to generalize our findings to
5.2.1. Finding outliers: Boxplot Method
Tukey (1977) described a procedure for flagging potential outliers based on a box-and-whiskers plot.
- Does not require normally distributed X
- Not sensitive to outliers
A fence is an interval defined as the following function of the first quartile, the third quartile, and the inner quartile range \((IQR=Q_3-Q_1)\): \[F=\{Q_1-C\times IQR,\ Q_3+C\times IQR\}\]
- Taking C = 1.5 produces the inner fence
- Taking C = 3.0 produces the outer fence
We can use these fences to indentify potential outliers:
- Any value that falls outside of the inner fence is a possible outlier
- Any value that falls outside of the inner fence is a probable outlier
5.2.2. Boxplot in R
## Find potentially outlying cases: (out <- boys %$% boxplot.stats(bmi, coef = 3)$out)
30.62 31.34 31.74 |
## Which observations are potential outliers? boys %$% which(bmi %in% out)
574 668 733 |
## View the potentially outlying cases: boys %>% filter(bmi %in% out)
age | hgt | wgt | bmi | hc | gen | phb | tv | reg | |
---|---|---|---|---|---|---|---|---|---|
1 | 15.493 | 182.5 | 102.0 | 30.62 | 57.7 | <NA> | <NA> | NA | west |
2 | 17.749 | 174.0 | 94.9 | 31.34 | 56.3 | G5 | P5 | 25 | west |
3 | 19.926 | 192.3 | 117.4 | 31.74 | 57.6 | G5 | P6 | 18 | north |
5.2.3. Breakdown Point
To compare statistics, we consider their breakdown points.
- The breakdown point is the minimum proportion of cases that must be replaced by ∞ to cause the value of the statistic to go to ∞.
The mean has a breakdown point of 1/N.
- Replacing a single value with ∞ will produce an infinite mean.
- This is why we shouldn’t use basic z-scores to find outliers.
The median has breakdown point of 50%.
- We can replace n < N/2 of the observations with ∞ without producing an infinite median.
5.2.4. Outliers for Categorical Data
Nominal, ordinal, and binary items can have outliers
- Outliers on categorical variables are often more indicative of bad variables than outlying cases.
Ordinal
- Most participant endorse one of the lowest categories on an ordinal item, but a few participants endorse the highest category.
- The participants who endorse the highest category may be outliers.
Nominal
- Groups with very low membership may be outliers on nominal grouping variables.
Binary
- If most endorse the item, the few who do not may be outliers.
5.2.5. Treating outliers
If we locate any outliers, they must be treated.
- Outliers cause by errors, mistakes, or malfunctions (i.e., error outliers) should be directly corrected.
- Labeling non-error outliers is a subjective task.
- A (non-error) outlier must originate from a population separate from the one we care about.
- Don’t blindly automate the decision process.
The most direct solution is to delete any outlying observation.
- If you delete non-error outliers, the analysis should be reported twice: with outliers and without.
For univariate outliers, we can use less extreme types of deletion.
- Delete outlying values (but not the entire observation).
- These empty cells then become missing data.
- Treat the missing values along with any naturally-occurring nonresponse.
Winsorization:
- Replace the outliers with the nearest non-outlying value
We can also use robust regression procedures to estimate the model directly in the presence of outliers.
Weight the objective function to reduce the impact of outliers
- M-estimation
Trim outlying observations during estimation
- Least trimmed squares, MCD, MVE
Take the median, instead of the mean, of the squared residuals
- Least median of squares
Model some quantile of the DV’s distribution instead of the mean
- Quantile regression
Model the outcome with a heavy-tailed distribution
- Laplacian, Student’s T
6. Introduction to Linear Modeling
6.1. Simple Linear Regression
Best fit line is defined by a simple equation:
\begin{equation} \hat Y = \hat \beta_o+\hat\beta_1X \end{equation}6.1.1. Thinking about error
The above equation only describes the best fit line. We need to account for the estimation error
\begin{equation} Y = \hat \beta_o+\hat\beta_1X +\hat\epsilon \end{equation}6.1.2. Estimating the Regression Coefficients
Minimize the sum of squared residuals
\begin{equation} RSS =\sum^N_{n=1} \hat \epsilon^2_n \end{equation}\(\hat \epsilon_n=Y_n-\hat Y_n\)
6.1.3. Model Fit, measures, r-squared
We may also want to know how well our model explains the outcome.
- Our model explains some proportion of the outcome’s variability.
- The residual variance ˆ𝜎2 = Var(ˆ𝜀) will be less than Var(Y)
We quantify the proportion of the outcome’s variance that is explained by our model using the R2 statistic
\begin{equation} R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS} \end{equation} \begin{equation} TSS=\sum^N_{n=1}(Y_n-\bar Y)^2=Var(Y)\times(N-1) \end{equation}\(\bar y\) is average of y
When assessing predictive performance, we will most often use the mean squared error as our criterion.
\begin{equation} MSE=\frac 1 N \sum^N_{n=1}(Y_n-\hat Y_n)^2=\frac {RSS} N \end{equation}MSE is better to use when we want to predict, so we dont care about explenation.
\[RMSE = \sqrt{MSE}\]
6.1.4. Information Criteria
We can use information criteria to quickly compare non-nested models while accounting for model complexity.
Information criteria balance two competing forces.
- Blue: The optimized loglikelihood quantifies fit to the data.
- Red: The penalty term corrects for model complexity
K number of parameters, amount of x variables in this case
These statistics are only interesting for comparing models, on their on they don’t say a lot
6.1.5. In R
out1 <- lm(bp ~ age, data = dDat) coef(out1) -- get coefficients
6.2. Multiple Linear Regression
Just simple linear regression with multiple predictor variables.
6.2.1. Comparing models/ multiple R2/ f statistic
- Multiple R2
out1 <- lm(bp ~ age, data = dDat) out2 <- lm(bp ~ age + bmi, data = dDat) ## Extract R^2 values: r2.1 <- summary(out1)$r.squared r2.2 <- summary(out2)$r.squared
- F-statistic/null hypothesis
We test if the fact that the r squard is bigger is conincedence or not
We use the F-statistic to test \(H_0:R^2=0\) vs. \(H_1:R^2>0\).
1 <- summary(out1)$fstatistic f1 value numdf dendf 55.78116 1.00000 440.00000 pf(q = f1[1], df1 = f1[2], df2 = f1[3], lower.tail = FALSE)
value 4.392569e-13
f2 <- summary(out2)$fstatistic f2 value numdf dendf 64.6647 2.0000 439.0000 pf(f2[1], f2[2], f2[3], lower.tail = FALSE)
value 2.433518e-25
How do we quantify the additional variation explained by BMI, above and beyond age? • Compute the ΔR2
## Compute change in R^2: r2.2 - r2.1
[1] 0.115049
How do we know if ΔR2 represents a significantly greater degree of explained variation? • Use an F-test for H0 : ΔR2 = 0 vs. H1 : ΔR2 > 0
## Compute change in R^2: r2.2 - r2.1
[1] 0.115049
How do we know if ΔR2 represents a significantly greater degree of explained variation? • Use an F-test for H0 : ΔR2 = 0 vs. H1 : ΔR2 > 0
## Is that increase significantly greater than zero? anova(out1, out2)
Analysis of Variance Table Model 1: bp ~ age Model 2: bp ~ age + bmi Res.Df RSS Df Sum of Sq F Pr(>F) 1 440 74873 2 439 65167 1 9706.1 65.386 6.057e-15 * — Signif. codes: 0 ’*’ 0.001 ’*’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1
- Comparing models with MSE
We can also compare models based on their prediction errors. • For OLS regression, we usually compare MSE values.
mse1 <- MSE(y_pred = predict(out1), y_true = dDat$bp) mse2 <- MSE(y_pred = predict(out2), y_true = dDat$bp) mse1 # [1] 169.3963 mse2 # [1] 147.4367
In this case, the MSE for the model with BMI included is smaller. • We should prefer the the larger model.
- Comparing models with information criteria
Finally, we can compare models based on information criteria.
AIC(out1, out2) # df AIC # out1 3 3528.792 # out2 4 3469.424 BIC(out1, out2) # df BIC # out1 3 3541.066 # out2 4 3485.789
In this case, both the AIC and the BIC for the model with BMI included are smaller. • We should prefer the the larger model.
6.3. Categorical Predictors
Categorical predictors have to be converted to dummy codes:
drink | juice | tea | |
1 | juice | 1 | 0 |
2 | coffee | 0 | 0 |
3 | tea | 0 | 1 |
4 | tea | 0 | 1 |
Then these dummy variables can simply be used as normally: \[Y=\beta_0+\beta_1X_{juice}+\beta_2X_{tea}+\epsilon\]
6.3.1. Contrast
All R factors have an associated contrasts attribute. • The contrasts define a coding to represent the grouping information. • Modeling functions code the factors using the rules defined by the contrasts.
x <- factor(c("a", "b", "c")) contrasts(x)
b | c | |
a | 0 | 0 |
b | 1 | 0 |
c | 0 | 1 |
6.3.2. Significane Testing
For variables with only two levels, we can test the overall factor’s significance by evaluating the significance of a single dummy code.
out <- lm(Price ~ Man.trans.avail, data = Cars93) partSummary(out, -c(1, 2)) # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 23.841 1.623 14.691 <2e-16 # Man.trans.availYes -6.603 2.004 -3.295 0.0014 # Residual standard error: 9.18 on 91 degrees of freedom # Multiple R-squared: 0.1066,Adjusted R-squared: 0.09679 # F-statistic: 10.86 on 1 and 91 DF, p-value: 0.001403
For variables with more than two levels, we need to simultaneously evaluate the significance of each of the variable’s dummy codes
partSummary(out4, -c(1, 2)) # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 21.7187 2.9222 7.432 6.25e-11 # Man.trans.availYes -5.8410 1.8223 -3.205 0.00187 # DriveTrainFront -0.2598 2.8189 -0.092 0.92677 # DriveTrainRear 10.5169 3.3608 3.129 0.00237 # Residual standard error: 8.314 on 89 degrees of freedom # Multiple R-squared: 0.2834,Adjusted R-squared: 0.2592 # F-statistic: 11.73 on 3 and 89 DF, p-value: 1.51e-06
summary(out4)$r.squared - summary(out)$r.squared # [1] 0.1767569 # anova(out, out4) # Analysis of Variance Table # Model 1: Price ~ Man.trans.avail # Model 2: Price ~ Man.trans.avail + DriveTrain # Res.Df RSS Df Sum of Sq F Pr(>F) # 1 91 7668.9 # 2 89 6151.6 2 1517.3 10.976 5.488e-05 *** # --- # Signif. codes: # 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
6.4. Model-Based Prediction
6.4.1. Prediction example
6.4.2. Interval estimates for prediction
To quantify uncertainty in our predictions, we want to use an appropriate interval estimate: There are two flavors:
- Confidence interval for \(\hat Y_m\)
The CI for ˆYm gives a likely range (in the sense of coverage probability and “confidence”) for the mth value of the true conditional mean.
CIs only account for uncertainty in the estimated regression coefficients, { ˆ𝛽0, ˆ𝛽p}.
- Prediction intervals for a specif observation, \(Y_m\)
The prediction interval for Ym gives a likely range (in the same sense as CIs) for the mth outcome value.
Prediction intervals also account for the regression errors, 𝜀.
- Confidence vs prediction intervals
CIs for \(\hat Y\) ignore the errors, \(\epsilon\).
- Only accounts for the best fit
Prediction intervals are wider than CIs.
- They account for the additional uncertainty contributed by 𝜀.
6.5. Moderation
Additive models allow us to examine the partial effects of several predictors on some outcome.
- The effect of one predictor does not change based on the values of other predictors
Moderation allows us to ask when one variable, X, affects another variable, Y.
- We’re considering the conditional effects of X on Y given certain levels of a third variable Z.
- How much a variable effects the outcome is dependent on another variable
6.5.1. Additive regression + derivation
In addtive MLR we have the following equation: \(Y=\beta_0+\beta_1X+\beta_2Z+\epsilon\) This equation assumes that X and Z are independent predictors of Y.
When X and Z are independent predictors, the following are true
- X and Z can be correlated
- \(\beta_1\) and \(\beta_2\) are partial regression coefficients
- The effect of X and Y is the same at all levels of Z, and the effect of Z on Y is the same at all levels of X.
When testing the moderation, we hypothesize that the effect of X on Y varies af a function of Z.
- \(Y=\beta_0+f(Z)X+\beta_2Z+\epsilon\)
If we assume that Z linearly affects the relationship between X and Y:
- \(f(Z)=\beta_1+\beta_3Z\)
Combining the two equations:
- \(Y=\beta_0+(\beta_1+\beta_3Z)X+\beta_2Z+\epsilon\)
- \(Y=\beta_0+\beta_1X+\beta_2Z+\beta_3XZ+\epsilon\)
6.5.2. Test for moderation
\(Y=\beta_0+\beta_1X+\beta_2Z+\beta_3XZ+\epsilon\)
To test for moderation we simply need to test the significan of the interaction term, XZ
\(\hat\beta_3\) quantifies the effect of Z on the focal effect (the X->Y effect)
- For a unit change in Z, it is the expected change in the effect of X and Y
\(\hat\beta_1\text{ and } \hat\beta_2\) are conditional effects.
- interpreted where the other predictor is zero
- For a unit change in X, ˆ𝛽1 is the expected change in Y, when Z = 0.
- For a unit change in Z, ˆ𝛽2 is the expected change in Y, when X = 0.
6.5.3. in R
## Load data: socSup <- readRDS(paste0(dataDir, "social_support.rds")) ## Estimate the moderated regression model: out <- lm(bdi ~ tanSat * sex, data = socSup) partSummary(out, -c(1, 2)) # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 20.8478 6.2114 3.356 0.00115 # tanSat -0.5772 0.3614 -1.597 0.11372 # sexmale 14.3667 12.2054 1.177 0.24223 # tanSat:sexmale -0.9482 0.7177 -1.321 0.18978 # Residual standard error: 9.267 on 91 degrees of freedom # Multiple R-squared: 0.08955,Adjusted R-squared: 0.05954 # F-statistic: 2.984 on 3 and 91 DF, p-value: 0.03537
7. Regression Assumptions and Diagnostics
You always have to make same assumptions to create a model.
7.1. Assumptions of MLR
7.1.1. 1 The model is linear in the parameters
This is OK: \(Y=\beta_0+\beta_1X+\beta_2Z+\beta_3XZ+\beta_4X^2+\beta_5X^3\)
This is not: \(Y=\beta_0X^{\beta_1}+\epsilon\)
- power is also a parameter here -> not allowed
If the model is not linear in the parameters, then we’re not even working with linear regression.
- We need to move to entirely different modeling paradigm
Each modeled X must exhibit a linear relation with Y.
- We can define X via nonlinear transformations of the original data.
- residuals plots
out1 <- lm(MPG ~ Horsepower, data = Cars93) plot(Cars93$MPG, Cars93$Horsepower) plot(out1, 1)
In multiple regression models, basic residual plots won’t tell us which predictors exhibit nonlinear associations.
- Component + residuals plots
We can use Component + Residual Plots (AKA, partial residual plots) to visualize the unique effects of each X variable.
library(car) crPlots(out3)
- Treating Residual Nonlinearity
Nonlinearity in the residual plots is usually a sign of either:
- Model misspecification
- Influential observations
This type of model misspecification usually implies omitted functions of modeled variables.
- Polynomial terms
- Interactions
The solution is to include the omitted term into the model and refit.
- This is very much easier said than done.
7.1.2. 2 The predictor matrix is full rank
- \(N>P\)
- No \(X_p\) can be a linear combination of other predictors
If the predictor matrix is not full rank, the model is not estimable.
- The parameter estimates cannot be uniquely determined from the data
7.1.3. 3 The predictors are strictly exogenous
The predictors do not correlate with the errors
- \(Cov(\hat Y,\epsilon)=0\)
- \(E[\epsilon_n]=0\)
If the predictors are not exogenous, the estimated regression coefficients will be biased.
- Omitted variables
The most common cause of endogeneity (i.e., violating Assumption 3) is omitted variable bias.
- If we leave an important predictor variable out of our equation, some modeled predictors will become endogenous and their estimated regression slopes will be biased.
- The omitted variable must be correlated with Y and at least one of the modeled Xp, to be a problem.
Assume the following is the true regression model. \[Y=\beta_0+\beta_1X+\beta_2Z+\epsilon\]
Now, suppose we omit Z from the model: \[Y=\beta_0+\beta_1X+\omega\] \[\omega = \epsilon + \beta_2Z\]
Our new error, 𝜔, is a combination of the true error, 𝜀, and the omitted term, \(\beta_2Z\)
- Consequently, if X and Z are correlated, omitting Z induces a correlation between X and 𝜔 (i.e., endogeneity).
Omitted variable bias can have severe consequences, but you can’t really test for it.
- The errors are correlated with the predictors, but our model is estimated under the assumption of exogeneity, so the residuals from our model will generally be uncorrelated with the predictors.
- We mostly have to pro-actively work to include all relevant variables in our model.
7.1.4. 4 The errors have constant, finit variance. heteroscedasticity
- \(Var(\epsilon_n)=\sigma^2<\infty\)
see spherical
- Consequences of heteroscedasticity
Non-constant error variance will not bias the parameter estimates.
- The best fit line is still correct.
- Our measure of uncertainty around that best fit line is wrong.
Heteroscedasticity will bias standard errors (usually downward).
- Test statistics will be too large.
- CIs will be too narrow.
- We will have inflated Type I error rates.
- Treating heteroscedasticity
- Transform your outcome using a concave function (e.g., ln(Y), √Y).
- These transformations will shrink extreme values more than small/moderate ones.
- It’s usually a good idea to first shift the variable’s scale by setting the minimum value to 1.
- Refit the model using weighted least squares.
- Create inverse weights using functions of the residual variances or quantities highly correlated therewith.
- Use a Heteroscedasticity Consistent (HC) estimate of the asymptotic covariance matrix.
- Robust standard errors, Huber-White standard errors, Sandwich estimators
- HC estimators correct the standard errors for non-constant error variance.
## The 'sandwich' package provides several HC estimators: library(sandwich) ## the 'lmtest' package provides fancy testing tools for linear models: library(lmtest) ## Use sandwich estimator to compute ACOV matrix: hcCov <- vcovHC(out1) ## Test coefficients with robust SEs: robTest <- coeftest(out1, vcov = hcCov) ## Test coefficients with default SEs: defTest <- summary(out1)$coefficients
- Transform your outcome using a concave function (e.g., ln(Y), √Y).
- Testing
We can create a residual plot to test for this
Non-constant error variance violates assumption4
Non-constant error variance
7.1.5. 5 The errors are uncorrelated
\(\Cov(\epsilon_j,\epsilon_j)=0,\ i\ne j\)
see spherical
- Correlated Errors
Errors can become correlated in two basic ways:
- Serial dependence
- When modeling longitudinal data, the errors for a given observational unit
- We can detect temporal dependence by examining the autocorrelation of the residuals
- Clustering
- Your data have some important, unmodeled, grouping structure
- Children nested within classrooms
- Romantic couples
- Departments within a company
- We can detect problematic level of clustering with the intraclass correlation coefficient (icc)
- We need to know the clustering variable to apply the icc
- Your data have some important, unmodeled, grouping structure
- Serial dependence
- Treating Correlated Errors
Serially dependent errors in a longitudinal model usually indicate an inadequate model.
- Your model is ignoring some important aspect of the temporal variation that is being absorbed by the error terms.
- Hopefully, you can add the missing component to your model.
Clustering can be viewed as theoretically meaningful or as a nuisance factor that needs to be controlled
If the clustering is meaningfull, you should model the data using multilevel modelling
- Hierarchical linear regression
- Mixed models
- Random effects models
If the clustering is an uninteresting nuisance, you can use specialized HC variance estimators that deal with clustering
## Read in some data: LeeBryk <- readRDS(paste0(dataDir, "lee_bryk.rds")) ## Estimate a linear regression model: fit <- lm(math ~ ses + sector, data = LeeBryk) ## Calculate the residual ICC: ICC::ICCbare(x = LeeBryk$schoolid, y = resid(fit))
0.07487712 ## Robust tests: coeftest(fit, vcov = vcovCL(fit, ~ schoolid)) #t test of coefficients: #Estimate Std. Error t value Pr(>|t|) #(Intercept) 11.79965 0.20318 58.0746 < 2.2e-16 *** #ses 2.94860 0.12794 23.0475 < 2.2e-16 *** #sectorprivate 1.93495 0.31717 6.1006 1.111e-09 *** #--- #Signif. codes: #0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
7.1.6. 6 The errors are normally distributed.
\(e\sim N(0,\sigma^2)\)
If errors are non-normal, small-sample inferences may be biased.
- The justification for some tests and procedures used in regression analysis may not hold
- Test with Q-Q Plot
plot(out1, 2)
One of the best ways to evaluate the normality of the error distribution with a Q-Q Plot.
- Plot the quantiles of the residual distribution against the theoretically ideal quantiles.
- We can actually use a Q-Q Plot to compare any two distributions.
- Consequences of Violating Normality
In small samples, with fixed predictors, normally distributed errors imply normal sampling distributions for the regression coefficients.
- In large samples, the central limit theorem implies normal sampling distributions for the coefficients, regardless of the error distribution.
Prediction intervals require normally distributed errors.
- Confidence intervals for predictions share the same normality requirements as the coefficients’ sampling distributions.
Parameter estimates will not be fully efficient.
- Standard errors will be larger than they would have been with normally distributed errors.
- Treating Violations of Normality
We usually don’t need to do anything about non-normal errors
- The central limit theorem will protect our inferences.
We can use bootstrapping to get around the need for normality.
- Treat your sample as a synthetic population from which you draw many new samples (with replacement).
- Estimate your model in each new sample.
- The replicates of your estimated parameters generate an empirical sampling distribution that you can use for inference.
Bootstrapping can be used for inference on pretty much any estimable parameter, but it won’t work with small samples.
- Need to assume that your sample is representative of the population
7.1.7. Spherical errors (4+5)
The assumption of spherical erros combines assumptions 4 and 5 \(Var(\epsilon)=\sigma^2I_N\)
If the errors are not spherical, the standard errors will be biased.
- The estimated regression coefficients will be unbiased, though.
- But the statistics and stuf can be biased
7.2. Regression Diagnostics
After fitting the model we check the assumptions, most assumptions cant be checked before fitting.
If some of the assumptions are violeted, the inferences we make using the model may be wrong
- We need to check the tenability of our assumptions before leaning too heavily on the model estimates
These checks are called regression diagnostics
- Graphical visualizations
- Quantitative indices/measures
- Formal statistical
Alot of test require their own assumptions which is bad.
7.2.1. Residual Plots
One of the most useful diagnostic graphics is the plot of residuals vs. predicted values.
We can create this plot by plotting the fitted lm() object in R.
out1 <- lm(Price ~ Horsepower, data = Cars93) plot(out1, 1)
7.3. Influential Observations
Influential observations contaminate analyses in two ways:
- Exert too much influence on the fitted regression model
- Invalidate estimates/inferences by violating assumptions
There are two distinct types of influential observations:
- Outliers
- Observations with extreme outcome values, relative to the other data.
- Observations with outcome values that fit the model very badly.
- High-leverage observations
- Observation with extreme predictor values, relative to other data
7.3.1. Outliers
Outliers can be identified by scrutinizing the residuals.
- Observations with residuals of large magnitude may be outliers.
- The difficulty arises in quantifying what constitutes a “large” residual.
If the residuals do not have constant variance, then we cannot directly compare them.
- We need to standardize the residuals in some way
We are specifically interested in externally studentized residuals.
We can’t simply standardize the ordinary residuals.
- Internally studentized residuals
- Outliers can pull the regression line towards themselves.
- The internally studentized residuals for outliers will be too small.
Begin by defining the concept of a deleted residual: \[ \hat \epsilon_{(n)}=Y_n-\hat Y_{(n)} \]
- \(\hat\epsilon_{(n)}\) quantifies the distance of \(Y_n\) from the regression line estimated after excluding the nth observation.
- \(\hat Y_{(n)}\) is the predicted value for the nth observatin from the regression line estimated after exluding the nth observation
- Studentized Residuals
If we standardize the deleted residual, \(\hat\epsilon_{(n)}\), we get the externally studentized residual: \[t_{(n)}=\frac{\hat\epsilon_{(n)}}{SE_{\hat\epsilon_{(n)}}}\] The externally studentized residuals have two very useful properties:
- Each t(n) is scaled equivalently.
- We can directly compare different t(n) .
- The t(n) are Student’s t distributed.
- We can quantify outliers in terms of quantiles of the t distribution.
- |t(n) | > 3.0 is a common rule of thumb for flagging outliers.
Index plots of the externally studentized residuals can help spotlight potential outliers.
- Look for observations that clearly “stand out from the crowd”
rstudent(out1) %>% plot()
- Each t(n) is scaled equivalently.
7.3.2. High-Leverage Points
We identify high-leverage observations through their leverage values.
- An observation’s leverage, \(h_n\), quantifies the extent to which its predictors affect the fitted regression model.
- Observations with \(X\) values very far from the mean, \(\bar X\), affect the fitted model disproportionately.
Index plots of the leverage values can help spotlight high-leverage points.
- Again, look for observations that clearly “stand out from the crowd.”
In simple linear regression, the nth leverage is given by: \[h_n=\frac 1 N + \frac{(X_n-\bar X)^2}{\sum^N_{m=1}(X_m-\bar X)^2}\]
hatvalues(out1) %>% plot()
7.3.3. Influential Points
Observations with high leverage or large (externally) studentized residuals are not necessarily influential.
- High-leverage observations tend to be more influential than outliers.
- The worst problems arise from observations that are both outliers and have high leverage.
Measures of influence simultaneously consider extremity in both X and Y dimensions.
- Observations with high measures of influence are very likely to cause problems.
All measures of influence use the same logic as the deleted residual.
- Compare models estimated from the whole sample to models estimated from samples excluding individual observations.
One of the most common measures of influence is Cook’s Distance \[Cook's D_n=\frac{\sum^N_{n=1}(\hat Y_n-\hat Y_{(n)})^2}{(P+1)\hat \sigma^2}=(P+1)^{-1}t^2_n\frac{h_n}{1-h_n}\]
Index plots of Cook’s distances can help spotlight the influential points.
- Look for observations that clearly “stand out from the crowd.”
cd <- cooks.distance(out1) plot(cd)
7.3.4. Removing Influential Observations
Observation number 28 was the most influential according to Cook’s Distance
(maxD <- which.max(cd))
28 |
## Exclude the influential case: Cars93.2 <- Cars93[-maxD, ] ## Fit model with reduced sample: out2 <- lm(Price ~ Horsepower, data = Cars93.2) round(summary(out1)$coefficients, 6)
Estimate | Std. | Error | t value | Pr(>|t|) |
(Intercept) | -1.398769 | 1.820016 | -0.768548 | 0.444152 |
Horsepower | 0.145371 | 0.011898 | 12.218325 | 0.000000 |
round(summary(out2)$coefficients, 6)
Estimate | Std. | Error | t value | Pr(>|t|) |
(Intercept) | -2.837646 | 1.806418 | -1.570868 | 0.119722 |
Horsepower | 0.156750 | 0.011996 | 13.066942 | 0.000000 |
Removing the outlier did not have a massive effect on the slope but it did lower the t-values by quite a bit
You could also remove the 2 most influential observations
(maxDs <- sort(cd) %>% names() %>% tail(2) %>% as.numeric()) ## Exclude influential cases: Cars93.2 <- Cars93[-maxDs, ] ## Fit model with reduced sample: out2.2 <- lm(Price ~ Horsepower, data = Cars93.2)
7.3.5. Treating Influential Points
The most common way to address influential observations is simply to delete them and refit the model.
- This approach is often effective—and always simple—but it is not fool-proof.
- Although an observation is influential, we may not be able to justify excluding it from the analysis.
Robust regression procedures can estimate the model directly in the presence of influential observations.
- Observations in the tails of the distribution are weighted less in the estimation process, so outliers and high-leverage points cannot exert substantial influence on the fit.
- We can do robust regression with the rlm() function from the MASS package.
8. Generalized Linear Model & Logistic regression
8.1. General Linear model
So far we have only discussed general linear models:
\[Y = \beta_0 + \sum ^p_{p=1} \beta_p X_p+\epsilon\]
All flavor of linear regression are general linear models
- Simple Linear regression, Multiple Linear Regression
- t-test, ANOVA, ANCOVA
- Multilevel linear regression models
We can break our model into pieces:
- \(\eta = \beta_0 + \sum ^p_{p=1} \beta_p X_p\)
- \(Y = \eta + \epsilon\)
\(\epsilon \sim N(0,\sigma^2)\), so we can also write: \[Y\sim N(\eta,\sigma^2)\]
Where:
- \(\eta\) is the systematic component of the model (AKA, the linear predictor)
- The normal distribution, \(N(...,...)\) is the model’s random component
The purpose of general linear modeling is to build a model of the outcome’s mean \(\mu_Y\)
- In this case \(\mu_Y = \eta\)
- The systematic component defines the mean of \(Y\)
The random component quantifies variability around \(\mu_Y\)
- In the general linear model, we assume that this error variance follows a normal distribution
8.2. Generalized Linear Model
We can generalize the models we’ve been using in two important ways:
- Allow for random components other than the normal distribution
- Allow for more complicated relations between \(\mu_Y\) and \(\eta\)
- Allow a link function: \(g(\mu_Y)=\eta\)
Like this we optain the class of generalized linear models (GLMs)
The random component in a GLM can be any distribution from the exponential family:
- Normality
- Binomial
- Poisson
- Many others…
The systematic component of a GLM is exactly the same as it is in general linear models.
- \(\eta = \beta_0 + \sum ^p_{p=1} \beta_p X_p\)
In GLMs, \(\eta\) does not directly describe \(\mu_Y\)
- We first transform \(\mu_Y\) via a link function
The link function performs two important functions
- Linearize the association betweeen \(X\) and \(Y\)
- Nonlinear: \(X\rightarrow \mu_Y\)
- Linear: \(X\rightarrow g(\mu_Y)\)
- Allows GLMs for outcomes with the restricted ranges without requiring any restrictions on the range of the \(\{X_P\}\)
- In many cases, \(\mu_Y\)
- Counts: \(\mu_Y > 0\)
- Probabilities: \(\mu_Y \in [0,1]\)
- When correctly specified, \(g(\mu_Y)\) can take any value on the real line
- In many cases, \(\mu_Y\)
Every GLM is built from three components
- The systematic component, \(\eta\)
- A linear function of the predictors, \(\{X_P\}\)
- Describes the association between \(X\) and \(Y\)
- The link function, \(g(\eta_Y)\)
- Linearizes the relation between \(X\) and \(Y\)
- Transforms \(\mu_Y\) so that it can take any value on the real line
- The random component, \(P(Y|g^{-1}(\eta))\)
- The distribution of the observed \(Y\)
- Quantifies the error variance around \(\eta\)
The General Linear Model is a special case of GLM
- Systematic component: \[\eta = \beta_0 + \sum ^p_{p=1} \beta_p X_p\]
- Link function: \[\mu_Y=\eta\]
- Random component: \[Y\sim N(\eta, \sigma^2)\]
8.2.1. Generalized Linear Model in R
In R you can train a generalized linear model with glm
data(iris) ## Generalized linear model: glmFit <- glm(Petal.Length ~ Petal.Width + Species, family = gaussian(link = "identity"), data = iris)
8.3. Logistic Regression
We want to use logistic regression for a classification task with a two-level outcome.
Using a linear model doesn’t work well for this kind off task
- The link function ensures legal predicted values (between 0 and 1)
- The sigmoid curve implies fluctuation in the effectiveness of extra study time
- example below: More study time is most beneficial for students with around 5.5 hours of study
In logistic regression problems we are modeling binary data:
- \(Y\in\{1="Succes",0="failure"\}\)
The systematic component in our logistic regression model will be the binomial distribution.
The mean of the binomial distribution (with \(N=1\)) is the “succes” probability, \(\pi=P(Y=1)\)
- we are interested in modelling this succes probability, \(\mu_Y=\pi\):
\[g(\pi) = \beta_0 + \sum ^p_{p=1} \beta_p X_p\]
We use the logit link function to convert the output of the systematic component to a value between 0 and 1.
- Given \(\pi\), the odds of succes are: \[O_s=\frac \pi {1-\pi}\]
- Because \(\pi \in [0,1]\), we know that \(O_s \ge 0\)
- We can take the natural log of the odds as the last step to fully map \(\pi\) to the real line \[\operatorname{logit}(\pi)=\ln\left(\frac \pi {1-\pi}\right)\]
Note that in the above function we map the probability to the real line, not the other way around.
- For the other way around use the logistic function:
\[\sigma(x)=\frac 1 {1+e^{-x}}=\frac {e^x} {1+e^x}\]
- Also called sigmoid or expit
- Is the inverse of the logit function
Our final logistic regression model is:
\[Y\sim \operatorname{Bin}(\pi,1)\] \[\operatorname{logit}(\pi)=\beta_0 + \sum^p_{p=1}\beta_pX_p\]
The fitted model can be represented as: \[\operatorname{logit}(\hat\pi)=\hat\beta_0 + \sum^p_{p=1}\hat\beta_pX_p\]
8.4. Interpreting logistic regression
We have a model:
\[\operatorname{logit}(\hat\pi)=\hat\beta_0 + \sum^p_{p=1}\hat\beta_pX_p\]
The fitted coefficients, \(\{\hat\beta_0, \hat\beta_p\}\), are interpreted in units of log odds.
Log odds do not lend themselves to interpretation.
- We can convert the effects back to an odds scale by exponentiation
- \(\hat\beta\) has log odds units, but \(e^\hat\beta\) has odds units.
note for all these interpretations odds and probabilities are not the same!
Exponentiating the coefficients also converts the additive effects to multiplicative effects.
- We can interpret \(\hat\beta\) as we would in linear regression:
- A unit change in \(X_p\) produces an expected change of \(\hat\beta\) units in \(\operatorname{logit}(\pi)\)
- After exponentiation, however, unit changes in \(X_p\) imply multiplicative changes in the odds, \(O_s=\frac \pi {1-\pi}\)
- A unit change in \(X_p\) results in multiplying \(O_s\) by \(e^{\hat\beta_p}\)
Due to the confusing interpretations of the coefficients, we often focus on the valence of the effects.
- Additional study time is associated with increased odds of passing
- \(\hat\beta_p>0 = "Increased Succes"\), \(e^{\hat\beta_p}>1 = "Increased succes"\)
For example interpretations look at the slides, start from slide 28
8.5. Model Comparison for logistic regression
We don’t have an \(R^2\) statistic for logistic regression models, so we need to use a likelihood ratio test to compare nested models.
anova(fit0, fit, test = "LRT") ## Analysis of Deviance Table ## ## Model 1: survived ~ age + sex ## Model 2: survived ~ age + sex + class ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 884 916.00 ## 2 882 801.59 2 114.41 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also use information criteria
AIC(fit0, fit) ## df AIC ## fit0 3 921.9989 ## fit 5 811.5940 BIC(fit0, fit) ## df BIC ## fit0 3 936.3624 ## fit 5 835.5333
8.6. Classification with logistic regression
Given a fitted logistic regression model, we can get predictions for new observations \(X'_p\)
Directly apply \(\hat\beta_0,\hat\beta_p\) to \(X'_p\) will produce predictions on the scale of \(\eta\) \[\hat\eta ' = \hat\beta_0+\sum^p_{p=1}\hat\beta_pX'_p\]
By applying the inverse link function \(g^{-1}()\), to \(\hat\eta'\), we get predicted succes probabilities \[\hat\pi'=g^{-1}(\hat\eta')\]
Once we have computed the predicted succes probabilities \(\hat\pi'\), we can use them to classify new observations
By choosing a threshold on \(\hat\pi'\), say \(\hat\pi'=t\), we can classify the new observations as succes or failures \[\hat Y'= \begin{cases} 1 &\text{ if }\hat\pi'\ge t \\ 0 &\text{ if }\hat\pi'\ge t \end{cases}\]
Note the optional threshold is not always \(0.5\).
9. Assumptions of Logistic Regression
9.1. 1 The model is linear in the parameters
- This is OK: \[\operatorname{logit}(\pi)=\beta_0 + \beta_1X + \beta_2Z + \beta_3X^2\]
- This is not: \[\operatorname{logit}(\pi)=\beta_0X^{\beta_1}\]
9.1.1. Checking Assumption 1
Assumption 1 implies a linear relation between continuous predictors and the logit of the success probability.
- We can basically evaluate the linearity assumption using the same methods we applied with linear regression.
car::crPlots(glmFit, terms = ~ age + fare)
9.2. 2 The predictor matrix if full rank
- \(N>P\)
- No \(X_p\) can be a linear combination of other predictors
9.2.1. Checking Assumption 2
Basically the same as for linear regression Assumption 2 implies two conditions:
- P > N
- No severe (multi)collinearity among the predictors
We can quantify multicollinearity with the variance inflation factor (VIF).
car::vif(glmFit)
age | sex | fare |
1.031829 | 1.007699 | 1.026373 |
VIF > 10 indicates severe multicollinearity.
9.3. 3 The outcome is independently and identically binomially distributed
The distributional assumptions of logistic regression are not framed in terms of residuals.
This assumption fills the role played by all residual-related assumptions in linear regression.
Instead we consider the entire outcome distribution in logistic regression.
\[Y_n \sim^{iid} Bin(\hat\pi, 1)\] \[\hat\pi_n=\operatorname{logistic}\left( \hat\beta_0 + \sum^p_{p=1} \hat\beta_pX_{np} \right)\]
9.3.1. Checking Assumption 3
Assumption 3 implies several conditions.
- The outcome, Y, is binary.
- The linear predictor, 𝜂, can explain all the systematic trends in \(\pi\).
- No residual clustering after accounting for X.
- No important variables omitted from X.
We can easily check the first condition with summary statistics.
levels(titanic$survived)
“no” | “yes” |
table(titanic$survived)
no | yes |
545 | 342 |
If we have a non-binary, categorical outcome, we can use a different type of model.
- Multiclass nominal variables: multinomial logistic regression
nnet::multinom()
- Ordinal variables: Proportional odds logistic regression
MASS::polr()
- Counts: Poisson regression
glm()
withfamily = "poisson"
The binomial distribution is also appropriate for modeling the proportion of successes in N trials.
For the second condition:
We can check for residual clustering by calculating the ICC using deviance residuals.
## Check for residual dependence induced by 'class': ICC::ICCbare(x = titanic$class, y = resid(glmFit, type = "deviance"))
0.1054665 |
9.4. Residuals
In logistic regression the outcome is binary, \(Y\in\{0,1\}\), but the parameter that we’re trying to model is continuous, \(\pi \in (0,1)\).
- Due to this mismatch in measurements levels, we don’t have a natural definition of a “residual” in logistic regression
The most basic residual is the raw residual, \(e_n\)
- The difference between the observerd outcome value and the predicted probability
\[e_n=Y_n-\hat\pi_n\]
We train a model on the titanic dataset:
## Read the data: titanic <- readRDS(paste0(dataDir, "titanic.rds")) ## Estimate the logistic regression model: glmFit <- glm(survived ~ age + sex + fare, data = titanic, family = "binomial") ## Save the linear predictor estimates: titanic$etaHat <- predict(glmFit, type = "link")
Raw residual plot:
library(ggplot) ## Calculate the raw residuals: titanic$e <- resid(glmFit, type = "response") ## Plot raw residuals vs. fitted ## linear predictor values: ggplot(titanic, aes(etaHat, e)) + geom_point() + geom_smooth() + theme_classic() + xlab("Linear Predictor") + ylab("Raw Residual")
Pearson residuals, \(r_n\), are scaled raw residuals \[r_n=\frac {e_n}{\sqrt{\hat\pi_n(1-\hat\pi_n)}}\]
## Calculate the Pearson residuals: titanic$r <- resid(glmFit, type = "pearson")
Deviance residuals, \(d_n\), are derived directly from the objective function used to estimate the model.
\[d_n=sign(e_n)\sqrt{-2[Y_n\ln (\hat\pi_n)+(1-Y_n)\ln(1-\hat\pi_n)]} \]
The residual deviance, \(D\), is the sum of squared deviance residuals.
- Quantifies how well the model fits the data
\[D=\sum^N_{n=1}d^2_n\]
9.5. Computational Considerations
In Addition to the preceding statistical assumptions, we must satisfy three computational requirements that were not necessary in linear regression.
9.5.1. 1 Sufficient sample size
Logistic regression models are estimated with numerical methods, so we need larger samples than we would for linear regression models.
- The sample size requirements increase with model complexity.
Some suggested rules of thumb:
- 10 cases for each predictor (Agresti, 2018)
- \(N = 10P/\pi_0\) (Peduzzi, Concato, Kemper, Holford, & Feinstein, 1996)
- \(P\): Number of predictors
- \(\pi_0\): Proportion of the minority class
- \(N = 100 + 50P\) (Bujang, Omar, & Baharum, 2018)
9.5.2. 2 Outcome classes are sufficiently balanced
The logistic regression may not perform well when the outcome classes are severely imbalanced.
- e.g. way more of one class then the other
with(titanic, table(survived) / length(survived))
no | yes |
0.6144307 | 0.3855693 |
We have a few possible solutions for problematic imbalance:
- Down-sampling the majority class
- Randomly remove some samples from the majority class
- Up-sampling the minority class
- Random select samples from the minority class to use multiple times
- Use weights when estimating the logistic regression model
weights
argument inglm()
9.5.3. 3 There is no perfect prediction
We don’t actually want to perfectly predict class membership.
- The model cannot estimate with perfectly separable classes.
Why?
- Imagine the shape of the logit function to estimate classes which can be seperated perfectly
- The slope would have to be really low, then really high
- To get this the slope would have to be infinite, which obviously can’t be estimated
Model regularization (e.g., ridge or LASSO penalty) may help.
glmnet::glmnet()
9.6. Influential Cases
As with linear regression, we need to deal with any overly influential cases.
- We can use the linear predictor values to calculate Cook’s Distances.
- Any cases that exerts undue influence on the linear predictor will have the same effect of the predicted success probabilities.
cooks.distance(glmFit) %>% plot()
plot(glmFit, 4)
9.7. Classification Performance Measures
9.7.1. Confusion Matrix
One of the most direct ways to evaluate classification performance is the confusion matrix.
## Add predictions to the dataset: titanic %<>% mutate(piHat = predict(glmFit, type = "response"), yHat = as.factor(ifelse(piHat <= 0.5, "no", "yes"))) library(caret) cMat <- titanic %$% confusionMatrix(data = yHat, reference = survived) cMat$table
Reference | ||
Prediction | no | yes |
no | 458 | 106 |
yes | 87 | 236 |
Each cell in the confusion matrix represents a certain classification result:
- TP true positive: Correctly predicted positive
- TN true negative: Correctly predicted negative
- FP false positive: Falsely predicted positive
- FN false negative: Falsely predicted negative
cMat$overall
Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
7.824126e-01 | 5.359709e-01 | 7.537802e-01 | 8.091549e-01 | 6.144307e-01 | 7.471405e-27 | 1.950898e-01 |
cMat$byClass
Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall |
0.8403670 | 0.6900585 | 0.8120567 | 0.7306502 | 0.8120567 | 0.8403670 |
F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
0.8259693 | 0.6144307 | 0.5163472 | 0.6358512 | 0.7652127 |
Summaries/statistics of the Confusion matrix:
9.7.2. TODO ROC Curve
A receiver operating characteristic (ROC) curve illustrates the diagnostic ability of a binary classifier for all possible values of the classification threshold.
- The ROC curve plots sensitivity against specificity at different threshold values.
The area under the ROC curve (AUC) is a one-number summary of the potential performance of the classifier.
- The AUC does not depend on the classification threshold.
pROC::auc(rocData)
Area under the curve: 0.8298 |
Rule of thumb for AUC according to Mandrekar (2010):
- AUC value from 0.7 – 0.8: Acceptable
- AUC value from 0.8 – 0.9: Excellent
- AUC value over 0.9: Outstanding
9.7.3. Threshold Selection
We can use numerical methods to estimate an optimal threshold value
library(OptimalCutpoints) ocOut <- optimal.cutpoints(X = "piHat", status = "survived", tag.healthy = "no", data = titanic, method = "ROC01" ) partSummary(ocOut, -1)
Area under the ROC curve (AUC): 0.83 (0.802, 0.858)
CRITERION: ROC01
Number of optimal cutoffs: 1
Estimate | |
---|---|
cutoff | 0.2360978 |
Se | 0.7543860 |
Sp | 0.7761468 |
PPV | 0.6789474 |
NPV | 0.8343195 |
DLR.Positive | 3.3700029 |
DLR.Negative | 0.3164531 |
FP | 122.0000000 |
FN | 84.0000000 |
Optimal criterion | 0.1104365 |
9.7.4. Cross-Entropy Error
Not exam material, might be usefull for the practical though.
Measuring classification performance from a confusion matrix can be problematic.
• Sometimes too coarse.
We can also base our error measure on the residual deviance with the Cross-Entropy Error: \[CEE = -N^{-1} \sum^N_{n=1} Y_n\ln(\hat\pi_n) + (1 - Y_n) \ln(1 -\hat\pi_n)\]
- The CEE is sensitive to classification confidence.
- Stronger predictions are more heavily weighted
The misclassification rate is a na¨ıvely appealing option.
- The proportion of cases assigned to the wrong group
Consider two perfect classifiers:
- P( ˆYn = 1|Yn = 1) = 0.90, P( ˆYn = 1|Yn = 0) = 0.10, n = 1, 2, . . . , N
- P( ˆYn = 1|Yn = 1) = 0.55, P( ˆYn = 1|Yn = 0) = 0.45, n = 1, 2, . . . , N
Both of these classifiers will have the same misclassification rate.
- Neither model ever makes an incorrect group assignment.
The first model will have a lower CEE.
- The classifications are made with higher confidence.
- CEE1 = 0.105, CEE2 = 0.598