Fundamental Techniques in Data Science with R Notes

Table of Contents

1. R basics

1.1. Getting help

You can get help for a function using

anova()
?anova
help(anova)

1.2. CRAN and packages

CRAN is a repo for R packages

install.packages("mice")
# There are two ways to load a package into R:
library(stats)
require(stats)

1.3. R Data formats and importing/writing data

1.3.1. Native data formats

## Load the built-in 'bfi' data from the 'psychTools' package
data(bfi, package = "psychTools")

## Access the documentation for the 'bfi' data
?psychTools::bfi

## Define the directory holding our data
dataDir <- "../../../data/"

## Load the 'boys' data from the R workspace
## '../../../data/boys.RData'
load(paste0(dataDir, "boys.RData"))
save(boys, file = paste0(dataDir, "tmp.RData"))

## Load the 'titanic' data stored in R data set
## '../../../data/titanic.rds'
titanic <- readRDS(paste0(dataDir, "titanic.rds"))
saveRDS(titanic, file="titanic2.rds")

1.3.2. Delimited Data types (csv etc.)

## Load the 'diabetes' data from the tab-delimited file
## '../../../data/diabetes.txt'
diabetes <- read.table(paste0(dataDir, "diabetes.txt"),
    header = TRUE,
    sep = "\t")

write.table(boys,
    paste0(dataDir, "boys.txt"),
    row.names = FALSE,
    sep = "\t",
    na = "-999")

## Load the 2017 UTMB data from the comma-separated file
## '../../../data/utmb_2017.csv'
utmb1 <- read.csv(paste0(dataDir, "utmb_2017.csv"))

write.csv2(boys, paste0(dataDir, "boys.csv"), row.names = FALSE, na = "")
  • The read.csv() function assumes the values are seperated by commas
  • For EU-formatted CSV files (semicolons instead of commas), we can use read.csv2()

1.3.3. SPSS Data (SAV)

Reading this format is tricky, if we want to read SAV files there are two popular options:

  • foreign::read.spss()
  • haven::readspss()

foreign doesn’t have a read equivelant

## Load the foreign package:
library(foreign)

## Use foreign::read.spss() to read '../../../data/mtcars.sav' into a list
mtcars1 <- read.spss(paste0(dataDir, "mtcars.sav"))

## Read '../../../data/mtcars.sav' as a data frame
mtcars2 <- read.spss(paste0(dataDir, "mtcars.sav"), to.data.frame = TRUE)

## Read '../../../data/mtcars.sav' without value labels
mtcars3 <- read.spss(paste0(dataDir, "mtcars.sav"),
    to.data.frame = TRUE,
    use.value.labels = FALSE)
library(labelled)

## Use haven::read_spss() to read '../../../data/mtcars.sav' into a tibble
mtcars4 <- read_spss(paste0(dataDir, "mtcars.sav"))

haven::readspss() converts any SPSS variables with labels into labelled vectors

  • We can use the labbeled::unlabeleld() function to remove the value labels
  • [V-Sha~ -> V-Shaped

The best option we have for writing SPSS data is haven::writesav()

write_sav(mtcars2, paste0(dataDir, "mctars2.sav"))

This will preserve label information provided by factor variables and the havenlabelled class

1.3.4. Excel (xlsx)

We have two good options for loading data from Excel spreadsheets:

  • readxl::readexcel()
  • openxlsx::read.xlsx()
## Load the packages:
library(readxl)
library(openxlsx)

## Use the readxl::read_excel() function to read the data from the 'titanic'
## sheet of the Excel workbook stored at '../../../data/example_data.xlsx'
titanic2 <- read_excel(paste0(dataDir, "example_data.xlsx"),
    sheet = "titanic")

## Use the openxlsx::read.xlsx() function to read the data from the 'titanic'
## sheet of the Excel workbook stored at '../../../data/example_data.xlsx'
titanic3 <- read.xlsx(paste0(dataDir, "example_data.xlsx"),
    sheet = "titanic")

The openxlsx package provides a powerful toolkit for programmatically building Excel workbooks in R and saving the results.

## Use the openxlsx::write.xlsx() function to write the 'diabetes' data to an
## XLSX workbook
write.xlsx(diabetes, paste0(dataDir, "diabetes.xlsx"), overwrite = TRUE)

## Use the openxlsx::write.xlsx() function to write each data frame in a list

## to a separate sheet of an XLSX workbook
write.xlsx(list(titanic = titanic, diabetes = diabetes, mtcars = mtcars),
    paste0(dataDir, "example_data.xlsx"),
    overwrite = TRUE)

1.4. R Functions

Any R command written as a word followed by parentheses is a function

  • mean()
  • library()

Infix operators are aliased functions

  • <-
  • + - *
  • > < ==

1.4.1. User-Defined Functions

We can define our own functions using the “function()” function

square <- function(x) {
    out <- x^2
    out
}

square(5)
25

One-line functions don’t need braces:

square <- function(x) x^2
square(5)
25
  1. Not strictly typed

    Function arguments are not strictly typed

    square <- function(x) x^2
    sqaure(TRUE) # 1
    square(1:5)
    
    1
    4
    9
    16
    25

    But there are limits:

    square("bob")
    

    Error in x2: non-numeric argument to binary operator

  2. Multiple/list arguments

    Multiple arguments:

    mod <- function(x, y) x %% y
    mod(10, 3)
    

    Or you can use a list to get multiple arguments:

    getLsBeta <- function(datList) {
        X <- datList$X
        y <- datList$y
        solve(crossprod(X)) %*% t(X) %*% y
    }
    
    X <- matrix(runif(500), ncol = 5)
    datList <- list(y = X %*% rep(0.5, 5), X = X)
    getLsBeta(datList = datList)
    
  3. Type/class of functions

    Functions are first-class objects in R.

    • We can treat functions like any other R object.

    R views an unevaluated function as an object with type ”closure”.

    class(getLsBeta)
    

    [1] “function”

    typeof(getLsBeta)
    

    [1] “closure”

    An evaluated functions is equivalent to the objects it returns.

    class(getLsBeta(datList))
    

    [1] “matrix” “array”

    typeof(getLsBeta(datList))
    

    [1] “double”

1.5. Iteration

1.5.1. Loops

There are three types of loops in R: for, while and until. Deze course behandelt alleen for (lol)

for(INDEX in RANGE) { Stuff To Do with the Current INDEX Value }

examples:

val <- 0
for(i in 1:100) {
    val <- val + i
}
val
5050

This loop will compute the mean of every column in the mtcars data

means <- rep(0, ncol(mtcars))
for(j in 1:ncol(mtcars)) {
    means[j] <- mean(mtcars[ , j])
}
means
listThing <- list(6,9,4,2)
sum <- 0
for(i in listThing) {
    sum <- i + sum
}
sum
21

Usually there is a function to achieve what you are trying to do with the loop (sum() for example)

1.5.2. Apply

Usually one of two forms:

apply(DATA, MARGIN, FUNCTION, ...)
apply(DATA, FUNCTION, ...) # not sure if this works??? but was in slides
  1. Margin

    Margin is how the function should be applied: E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names.

    data(bfi, package = "psychTools")
    dat1 <- bfi[1:5,1:3]
    # | 2 | 4 | 3 |
    # | 2 | 4 | 5 |
    # | 5 | 4 | 5 |
    # | 4 | 4 | 6 |
    # | 2 | 3 | 3 |
    # sum over rows with above table
    apply(dat1, 1, sum)
    
    9
    11
    14
    14
    8
    data(bfi, package = "psychTools")
    dat1 <- bfi[1:5,1:3]
    # subtract 1 from every cell
    apply(dat1, (1:2), function(x)x-1)
    
    1 3 2
    1 3 4
    4 3 4
    3 3 5
    1 2 2
  2. lapply and sapply

    similar concept but works on lists

    list1 <- list(1,3,4,2,4)
    lapply(list1, function(x) x+1)
    
    2 4 5 3 5

    sapply is a user friendly wrapper for lapply which be default returns a vector or matrix

1.6. Some tips/ style guide

  • While using RStudie use TAB to quickly acces the documentation of the function arguments
  • Use the tidyverse style guide
    • Can be down automatically in Rstudio -> addins -> style selection

2. R Objects & Data Types

2.1. Vectors

Vectors are the simplest kind of R object, there is no concept of a “scalar” in R

2.1.1. Atomic modes

Vectors come in one of six “atomic modes”:

  • numeric/double
  • logical -> bool
  • character
  • integer
  • complex
    • A complex value in R is defined via the pure imaginary value i.
  • raw
    • Intended to hold raw bytes

2.1.2. Generating vectors

c(1, 2, 3)
1
2
3
1:5
1
2
3
4
5
1.2:5.3
1.2
2.2
3.2
4.2
5.2
rep(33,4)
33
33
33
33
rep(1:3,3)
1
2
3
1
2
3
1
2
3
rep(1:3,each=2)
1
1
2
2
3
3
seq(0, 1, 0.25)
0
0.25
0.5
0.75
1

2.1.3. Combining data types in vectors?

Not possible, some data types will be converted

a <- c(1,2,3)
b <- c("foo", "bar")
c(a,b)

[1] “1” “2” “3” “4” “5” “foo” “bar”

2.2. Matrices

Matrices generalize vectors by adding a dimension attribute.

2.2.1. Generating Matrices

matrix(1:5, nrow = 5, ncol = 2)
1 1
2 2
3 3
4 4
5 5
cbind(c(1,2,3),c(4,5,6))
1 4
2 5
3 6
m1 <- matrix(1:5, nrow = 5, ncol = 2)
attributes(m1)
5
2

Matrices are populated in column-major order, by default

matrix(1:9, 3, 3)
1 4 7
2 5 8
3 6 9

The byrow = TRUE option allows us to fill by row-major order

matrix(1:9, 3, 3, byrow = TRUE)
1 2 3
4 5 6
7 8 9

2.2.2. Mixing data types

Like vectors, matrices can only hold one type of data

2.3. Lists

Lists are the workhorse of R data objects

  • An R list can hold an arbitrary set of other R objects

In lists data types CAN be mixed

2.3.1. Creating a list

list(1,2, "HI mom", TRUE)
1 2 HI mom TRUE

2.3.2. Named list elements

l3 <- list(name = "bob",
    alive = TRUE,
    age = 33,
    relationshipStatus = 42+3i)

l3$name
bob

Can also be done post-hoc via the names()

l1 = list(3, "hi", TRUE)
names(l1) <- c("name1", "second", "last")
l1$second
hi

2.3.3. Append onto list

(l4 <- list())
l4$people <- c("Bob", "Alice", "Suzy")
l4$money <- 0
l4$logical <- FALSE
l4

$people [1] “Bob” “Alice” “Suzy” $money [1] 0 $logical [1] FALSE

2.4. Data Frames

R’s way of storing rectangular data sets

  • Each column of a data frame is a vector
  • Each of these vectors can have a different type:

Data frames are actually lists of vectors (representing the columns)

2.4.1. Creating a dataframe

We can create one using data.frame()

data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
1 -1 0.1
2 1 0.2
3 -1 0.3
4 1 0.4
5 -1 0.5
6 1 0.6

Columns can also be given names:

data.frame(x = 1:6, y = c(-1, 1), z = seq(0.1, 0.6, 0.1))

x y z 1 1 -1 0.1 2 2 1 0.2 3 3 -1 0.3 4 4 1 0.4 5 5 -1 0.5 6 6 1 0.6

Creating from matrix

data.frame(matrix(5, 4, 3))
5 5 5
5 5 5
5 5 5
5 5 5

2.4.2. Filling using sample or runif

Sample makes a vector of some length randomly sampled from a vector

Runif generates a vector of random numbers

data.frame(a = sample(c(TRUE, FALSE), 5, replace = TRUE),
    b = sample(c("foo", "bar"), 5, replace = TRUE),
    c = runif(5)
    )
TRUE foo 0.640138026094064
TRUE bar 0.0884113181382418
TRUE bar 0.099734982708469
TRUE bar 0.862016763305292
TRUE bar 0.991723588900641

2.5. Factors

Factors are R’s way of repesenting nominal (categorical) variables

  • We can create a factor using the factor() function

2.5.1. Creating factors

factor(sample(1:3, 8, TRUE), labels = c("red", "yellow", "blue"))
red
blue
red
blue
yellow
blue
yellow
yellow

2.5.2. Factor attributes

x <- factor(sample(1:3, 8, TRUE), labels = c("red", "yellow", "blue"))
attributes(x)

$levels [1] “red” “yellow” “blue” $class [1] “factor”

2.5.3. Internal representation

Even though a factor’s data are represented by an integer vector, R does not consider factors to be interger/numeric data.

3. Data Manipulation and pipes

3.1. Base R Subsetting

In Base R, we typically use three operators to subset objects

  • []
  • [[]]
  • $

Which we choose depends on:

  • What type of object are we trying to subset?
  • How much of the original typing do we want to keep in the subset?

3.1.1. []

x <- 10:20
x [2]
x [1:3]
x [c(2, 5, 7)] # 11, 14, 16
x [c(TRUE, FALSE, TRUE, FALSE, FALSE)]
10
12
15
17
20

3.1.2. [[]]

Rarely used, does have some differences though, drops any names or dimnames attribute, and only allows selection of a single element

x <- 10:20
x [[2]]
11

The [] method returns objects of class list (or data.frame if foo was a data.frame) while the [[]] method returns objects whose class is determined by the type of their values.

3.1.3. Matrices

y <- matrix(1:24, 6, 4)
y[4:6,2:3] # 2
10 16
11 17
12 18

We can mix different selecting styles

y <- matrix(1:24, 6, 4)
y[c(2,4),c(FALSE, TRUE, TRUE)] # 2
8 14
10 16

3.1.4. $ and lists

We can use all three operators to subset lists.

list1 <- list(2, "noo", TRUE)
names(list1) <- c("first", "second", "last")

list1$first

list1[1]

list1[["first"]]
2
l4 <- list(2, "noo", TRUE)
names(l4) <- c("first", "second", "last")
tmp1 <- l4[1]
tmp1$people
# [1] "Bob" "Alice" "Suzy"
class(tmp1)
# [1] "list"
tmp2 <- l4[[1]]
# [1] "Bob" "Alice" "Suzy"
class(tmp2)
# [1] "character"
numeric

3.1.5. Data Frames

We can subset the columns of a data frame using list semantics

d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3$a
1
2
3
4
5
6
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3[1]
1
2
3
4
5
6

2023-11-23_16-18-08_screenshot.png

3.1.6. Logical indexing

Filter by rows

d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3[(d3$a)%%2==0 & d3$c>0,]

Notic the comma at the end!

This comma is needed because R uses the following notation for subsetting data frames.

data(rows you want, columns you want)

2 1 0.2
4 1 0.4
6 1 0.6

3.2. Tidyverse subsetting

The dplyr package provides ways to subset data, two are most frequently used:

  1. select
    library(dplyr)
    d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
    names(d3) <- c("a", "b", "c")
    select(d3, a, b)
    
    1 -1
    2 1
    3 -1
    4 1
    5 -1
    6 1
    library(dplyr)
    d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
    names(d3) <- c("a", "b", "c")
    select(d3, -a)
    
    -1 0.1
    1 0.2
    -1 0.3
    1 0.4
    -1 0.5
    1 0.6
  2. filter and logical indexing
    library(dplyr)
    d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
    names(d3) <- c("a", "b", "c")
    filter(d3, a%%2==0, c>0)
    
    2 1 0.2
    4 1 0.4
    6 1 0.6

    We can achieve the same using logical indexing in Base R

    d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
    names(d3) <- c("a", "b", "c")
    d3[(d3$a)%%2==0 & d3$c>0,]
    

    Notic the comma at the end!

    2 1 0.2
    4 1 0.4
    6 1 0.6

3.3. Base R Variable Transformation

d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d4 <- d3
d4$a <- scale(d4$a)
d4$a <- as.numeric(d4$a)
d4
-1.33630620956212 -1 0.1
-0.801783725737273 1 0.2
-0.267261241912424 -1 0.3
0.267261241912424 1 0.4
0.801783725737273 -1 0.5
1.33630620956212 1 0.6

3.4. Tidyverse Variable Transformations

library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")

mutate(d3,
    d = rbinom(nrow(d3), 1, c),
    e = d * c,
    a = scale(a)
)
-1.33630620956212 -1 0.1 0 0
-0.801783725737273 1 0.2 0 0
-0.267261241912424 -1 0.3 1 0.3
0.267261241912424 1 0.4 1 0.4
0.801783725737273 -1 0.5 0 0
1.33630620956212 1 0.6 1 0.6

3.5. Sorting & Ordering

3.5.1. Base R sort

l3 <- runif(n=8, min=1, max=20)
sort(l3)
1.14997771941125
6.18364422139712
7.40294478787109
8.80276804696769
10.6937525619287
14.7475167061202
16.3595214642119
16.9985088445246
l3 <- runif(n=8, min=1, max=20)
sort(l3, decreasing = TRUE)
15.9437932423316
15.0901547158137
9.608802491799
4.20592033001594
4.16299939691089
2.97972352942452
2.14775081258267
1.15068965195678

3.5.2. Tidyverse Ordering

library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")

d3$d <- runif(n=6, min=1, max=20)
arrange(d3, d)
4 1 0.4 1.03251850209199
5 -1 0.5 2.67228518775664
3 -1 0.3 6.22691969852895
1 -1 0.1 8.31828686594963
6 1 0.6 12.4162582552526
2 1 0.2 15.5486617612187
library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")

d3$d <- runif(n=6, min=1, max=20)
arrange(d3, a*d, c)
2 1 0.2 1.041215069592
1 -1 0.1 19.1222469233908
6 1 0.6 8.04311123443767
3 -1 0.3 18.1283924696036
4 1 0.4 15.9365976287518
5 -1 0.5 15.7623822011519

3.6. Pipes

The %>% symbol represents the pipe operator, used to compose functions

  • f(x) becomes x %>% f()
  • f(x, y) becomes x %>% f(y)
  • h(g(f(x))) becomes x %>% f %>% g %>% h

Big brain lecturer forgot to mention this operator is not in the standard library, is in dplyr, magrittr and tidyverse among others

3.6.1. Example use

Example use:

library(dplyr)

sqr <- function (x)
{
    x**2
}

plus1 <- function (x)
{
    x + 1
}

test <-
    (1:10) %>%
    sqr() %>%
    plus1()

test
2
5
10
17
26
37
50
65
82
101

This is equivalent to:

test <-
    sqr(
        plus1(
            (1:10)))

test

3.6.2. dot . in a pipeline, target specific argument

Specify where you want the piped value to go

library(dplyr)

power <- function (x, y)
{
    x**y
}

test <-
    (1:10) %>%
    power(3, y=.)

test
3
9
27
81
243
729
2187
6561
19683
59049

3.6.3. Exposition Pipe: %$%

[%$%] expose the names in lhs to the rhs expression. This is useful when functions do not have a built-in data argument. or The exposition pipe exposes the contents of an object to the next function in the pipeline.

library(dplyr)

# all equivelant:

cats %$% t.test(Hwt ~ Sex)
cats %>% t.test(Hwt ~ Sex, data = .)
t.test(Hwt ~ Sex, data = cats)

4. Data Visualization

4.1. Baser R Graphics

4.1.1. Scatterplots

We can create a basic scatterplot using the plot() function

plot(runif(20, 0, 100), runif(20, 0, 10))

2023-11-27_17-17-25_screenshot.png

diabetes %$% plot(
    y = tc,
    x = bmi,
    ylab = "Total Cholesterol",
    xlab = "Body Mass Index",
    main = "Relation between BMI and Cholesterol",
    ylim = c(0, 350),
    xlim = c(0, 50)
    )

2023-11-27_17-19-43_screenshot.png

4.1.2. Histogram

We can create a simple histogram with the hist() function

hist(runif(30,0,10))

2023-11-27_17-21-52_screenshot.png

4.1.3. Boxplots

We can create simple boxplots via the boxplot function

boxplot(runif(30,40,60),runif(60,10,50))

2023-11-27_17-24-41_screenshot.png

4.1.4. Kernel Density Plots

density(runif(5,0,100)) %>% plot()

2023-11-27_17-27-25_screenshot.png

4.2. GGPlot graphics

Base R graphics are fine for quick and dirty visualizations but for publication quality graphics, you should probably use GGPlot

  • GGPlot uses the ”grammar of graphics” and ”tidy data” to build up a figure from modular components.

4.2.1. Grammar of graphics

2023-11-27_17-29-27_screenshot.png

4.2.2. Basic Setup (Data and aesthetic)

We start by initializing the object that will define our plot

  • This is regardless of what we want to plot
  • We must specify a data source and a aesthetic via the aes() function
library(ggplot2)
p1 <- ggplot(data = diabetes, mapping = aes(x = bmi, y = glu))
p1

This doesn’t make any plot yet

2023-11-27_17-34-05_screenshot.png

4.2.3. Geometries (line, dots etc)

  1. Line and dots
    p1 + geom_point() + geom_line()
    

    2023-11-27_17-36-34_screenshot.png

  2. Histogram
    p2 <- ggplot(diabetes, aes(tc))
    p2 + geom_histogram()
    

    2023-11-27_17-47-28_screenshot.png

  3. Density
    p2 + geom_density()
    

    2023-11-27_17-49-57_screenshot.png

  4. Boxplot
    p2 + geom_histogram()
    

    2023-11-27_17-52-44_screenshot.png

    p3 <- ggplot(diabetes, aes(sex, bmi))
    p3 + geom_boxplot()
    

    2023-11-27_17-53-59_screenshot.png

4.2.4. Statistics

We can also add statistical summaries of the data.

p1 + geom_point() + geom_smooth()

2023-11-27_17-56-05_screenshot.png

p1 + geom_point() + geom_smooth(method = "lm")

2023-11-27_17-58-08_screenshot.png

4.2.5. Styling

Changing the style outside of aes() applies the styling to the entire plot

  1. Jitter
    p5 <- ggplot(titanic, aes(age, survived))
    p5 + geom_jitter(color = "blue", size = 2, height = 0.1)
    

    2023-11-27_18-02-02_screenshot.png

4.2.6. Themes

We can apply canned themes to adjust a plot overall appearance\ Default theme:

p1.1 <- p1 + geom_point()

2023-12-29_12-20-35_screenshot.png

p1.1 + theme_classic()

2023-12-29_12-22-38_screenshot.png

You can also modify individual themes

p1.1 + theme_classic() +
    theme(axis.title = element_text(size = 16,
                                    family = "serif",
                                    face = "bold",
                                    color = "blue"),
        aspect.ratio = 1)

4.2.7. Facets

Faceting allow us to make arrays of conditional plots.

p7 <- ggplot(titanic, aes(age, survived, color = class)) +
    geom_jitter(height = 0.05) +
    geom_smooth(method = "glm",
                method.args = list(family = "binomial"),
                se = FALSE)
p7 + facet_wrap(vars(sex))

2023-12-29_12-34-35_screenshot.png

p8 <- ggplot(titanic, aes(age, survived)) +
    geom_jitter(height = 0.05) +
    geom_smooth(method = "glm",
                method.args = list(family = "binomial"),
                se = FALSE)
p8 + facet_grid(vars(sex), vars(class))

2023-12-29_12-38-20_screenshot.png

4.2.8. Joining multiple Figures

If we want to paste several different plots into a single figure (without faceting), we can use the utilities in the gridExtra package.

library(gridExtra)

grid.arrange(p1 + geom_point(),
             p3 + geom_boxplot(),
             p1 + geom_line(),
             p8 + facet_grid(vars(sex), vars(class)),
             ncol = 2)

2023-12-29_12-43-52_screenshot.png

5. Data Cleaning

When we receive new data, they are generally messy and contaminated by various anomalies and errors

One of the first steps in processing a new set of data is cleaning

By cleaning the data, we ensure a few properties:

  • The data are in an analyzable format.
  • All data take legal values.
  • Any outliers are located and treated.
  • Any missing data are located and treated.

5.1. Missing data

Empty cells in a dataset where there should be observerd values

  • The missing cells correspond to true population values, but we haven’t observed those values

Not every empty cell is a missing datum

  • Quality-of-life ratings for dead patients in a mortality study
  • Firm profitability after the company goes out of business
  • Self-reported severity of menstrual cramping for men
  • Empty blocks of data following “gateway” items

5.1.1. Missing Data Pattern

Missing data patterns represent unique combinations of observerd and missing items

  • \(P\) items -> \(2^P\) possible patterns

2023-12-29_14-07-36_screenshot.png

  1. example/ in r
    ## Load the mice library:
    library(mice)
    
    ## Compute missing data patterns:
    pats <- md.pattern(boys)
    
    pats
    plot_pattern(boys, rotate = TRUE)
    
      age reg wgt hgt bmi hc gen phb tv  
    223 1 1 1 1 1 1 1 1 1 0
    19 1 1 1 1 1 1 1 1 0 1
    1 1 1 1 1 1 1 1 0 1 1
    1 1 1 1 1 1 1 0 1 0 2
    437 1 1 1 1 1 1 0 0 0 3
    43 1 1 1 1 1 0 0 0 0 4
    16 1 1 1 0 0 1 0 0 0 5
    1 1 1 1 0 0 0 0 0 0 6
    1 1 1 0 1 0 1 0 0 0 5
    1 1 1 0 0 0 1 1 1 1 3
    1 1 1 0 0 0 0 1 1 1 4
    1 1 1 0 0 0 0 0 0 0 7
    3 1 0 1 1 1 1 0 0 0 4
      0 3 4 20 21 46 503 503 522 1622

    2023-12-29_14-09-04_screenshot.png

5.1.2. Nonresponse Rates

  1. Proportion missing
    • The proportion of cells containing missing data
    • Should be computed for each variable, not for the entire dataset
    ## Load the mice library:
    library(mice)
    
    ## Load some example data:
    data(boys, package = "mice")
    ## Compute variable-wise proportions missing:
    mMat <- is.na(boys)
    mMat %>% colMeans() %>% round(3)
    
    age hgt wgt bmi hc gen phb tv reg
    0.000 0.027 0.005 0.028 0.061 0.672 0.672 0.698 0.004
    ## Compute observation-wise proportions missing:
    pmRow <- rowMeans(mMat)
    ## Summarize the above:
    range(pmRow)
    # [1] 0.0000000 0.7777778
    
    pmRow[pmRow > 0] %>% range()
    # [1] 0.1111111 0.7777778
    
    median(pmRow)
    # [1] 0.3333333
    
    ## Compute the proportion of complete cases:
    mean(pmRow == 0)
    # [1] 0.2981283
    
  2. Attrition Rate
    • The proportion of participants that drop-out of a study at each measurement occasion
  3. Covariance Coverage
    • The proportion of cases available to estimate a given pairwise relationship
      • e.g. a convariance between two variables
    • Very important to have adequate coverage for the parameters you want to estimate

    We can use routines from the mice package to calculate covariance coverage and response patterns.

    ## Compute the covariance coverage:
    cc <- md.pairs(boys)$rr / nrow(boys)
    ## Check the result:
    round(cc, 2)
    
      age hgt wgt bmi hc gen phb tv reg
    age 1.00 0.97 0.99 0.97 0.94 0.33 0.33 0.3 1.00
    hgt 0.97 0.97 0.97 0.97 0.92 0.32 0.32 0.3 0.97
    wgt 0.99 0.97 0.99 0.97 0.94 0.32 0.32 0.3 0.99
    bmi 0.97 0.97 0.97 0.97 0.91 0.32 0.32 0.3 0.97
    hc 0.94 0.92 0.94 0.91 0.94 0.33 0.33 0.3 0.93
    gen 0.33 0.32 0.32 0.32 0.33 0.33 0.33 0.3 0.33
    phb 0.33 0.32 0.32 0.32 0.33 0.33 0.33 0.3 0.33
    tv 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.3 0.30
    reg 1.00 0.97 0.99 0.97 0.93 0.33 0.33 0.3 1.00

5.1.3. Visualizing Incomplete Data

The ggmice package provides some nice ways to visualize incomplete data and objects created during missing data treatment.

library(ggmice); library(ggplot2)
ggmice(boys, aes(wgt, hgt)) + geom_point()

2023-12-29_15-44-49_screenshot.png

We can also create a nicer version of the response pattern plot.

plot_pattern(boys, rotate = TRUE)

2023-12-29_14-09-04_screenshot.png

The naniar package also provides some nice visualization and numerical summary routines for incomplete data.

naniar::gg_miss_upset(boys)

2023-12-29_16-04-49_screenshot.png

5.2. Outliers

We only consider univariate (:only involving one variable) outliers

  • Extreme values with respect to the distribution of a variable’s other observations
    • A human height measurement of 3 meters
    • Student with a income of $250,000
  • Not accounting for any particular model

A univariate outlier may, or may not be an illegal value

  • Data entry errors are probably the most common cause
  • Outliers can also be legal, but extreme, values

Key Point: We choose to view an outlier as arising from a different population than the one to which we want to generalize our findings to

5.2.1. Finding outliers: Boxplot Method

Tukey (1977) described a procedure for flagging potential outliers based on a box-and-whiskers plot.

  • Does not require normally distributed X
  • Not sensitive to outliers

A fence is an interval defined as the following function of the first quartile, the third quartile, and the inner quartile range \((IQR=Q_3-Q_1)\): \[F=\{Q_1-C\times IQR,\ Q_3+C\times IQR\}\]

  • Taking C = 1.5 produces the inner fence
  • Taking C = 3.0 produces the outer fence

We can use these fences to indentify potential outliers:

  • Any value that falls outside of the inner fence is a possible outlier
  • Any value that falls outside of the inner fence is a probable outlier

5.2.2. Boxplot in R

## Find potentially outlying cases:
(out <- boys %$% boxplot.stats(bmi, coef = 3)$out)
30.62 31.34 31.74
## Which observations are potential outliers?
boys %$% which(bmi %in% out)
574 668 733
## View the potentially outlying cases:
boys %>% filter(bmi %in% out)
  age hgt wgt bmi hc gen phb tv reg
1 15.493 182.5 102.0 30.62 57.7 <NA> <NA> NA west
2 17.749 174.0 94.9 31.34 56.3 G5 P5 25 west
3 19.926 192.3 117.4 31.74 57.6 G5 P6 18 north

5.2.3. Breakdown Point

To compare statistics, we consider their breakdown points.

  • The breakdown point is the minimum proportion of cases that must be replaced by ∞ to cause the value of the statistic to go to ∞.

The mean has a breakdown point of 1/N.

  • Replacing a single value with ∞ will produce an infinite mean.
  • This is why we shouldn’t use basic z-scores to find outliers.

The median has breakdown point of 50%.

  • We can replace n < N/2 of the observations with ∞ without producing an infinite median.

5.2.4. Outliers for Categorical Data

Nominal, ordinal, and binary items can have outliers

  • Outliers on categorical variables are often more indicative of bad variables than outlying cases.

Ordinal

  • Most participant endorse one of the lowest categories on an ordinal item, but a few participants endorse the highest category.
  • The participants who endorse the highest category may be outliers.

Nominal

  • Groups with very low membership may be outliers on nominal grouping variables.

Binary

  • If most endorse the item, the few who do not may be outliers.

5.2.5. Treating outliers

If we locate any outliers, they must be treated.

  • Outliers cause by errors, mistakes, or malfunctions (i.e., error outliers) should be directly corrected.
  • Labeling non-error outliers is a subjective task.
    • A (non-error) outlier must originate from a population separate from the one we care about.
    • Don’t blindly automate the decision process.

The most direct solution is to delete any outlying observation.

  • If you delete non-error outliers, the analysis should be reported twice: with outliers and without.

For univariate outliers, we can use less extreme types of deletion.

  • Delete outlying values (but not the entire observation).
  • These empty cells then become missing data.
  • Treat the missing values along with any naturally-occurring nonresponse.

Winsorization:

  • Replace the outliers with the nearest non-outlying value

We can also use robust regression procedures to estimate the model directly in the presence of outliers.

Weight the objective function to reduce the impact of outliers

  • M-estimation

Trim outlying observations during estimation

  • Least trimmed squares, MCD, MVE

Take the median, instead of the mean, of the squared residuals

  • Least median of squares

Model some quantile of the DV’s distribution instead of the mean

  • Quantile regression

Model the outcome with a heavy-tailed distribution

  • Laplacian, Student’s T

6. Introduction to Linear Modeling

6.1. Simple Linear Regression

Best fit line is defined by a simple equation:

\begin{equation} \hat Y = \hat \beta_o+\hat\beta_1X \end{equation}

6.1.1. Thinking about error

The above equation only describes the best fit line. We need to account for the estimation error

\begin{equation} Y = \hat \beta_o+\hat\beta_1X +\hat\epsilon \end{equation}

6.1.2. Estimating the Regression Coefficients

Minimize the sum of squared residuals

\begin{equation} RSS =\sum^N_{n=1} \hat \epsilon^2_n \end{equation}

\(\hat \epsilon_n=Y_n-\hat Y_n\)

6.1.3. Model Fit, measures, r-squared

We may also want to know how well our model explains the outcome.

  • Our model explains some proportion of the outcome’s variability.
  • The residual variance ˆ𝜎2 = Var(ˆ𝜀) will be less than Var(Y)

We quantify the proportion of the outcome’s variance that is explained by our model using the R2 statistic

\begin{equation} R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS} \end{equation} \begin{equation} TSS=\sum^N_{n=1}(Y_n-\bar Y)^2=Var(Y)\times(N-1) \end{equation}

\(\bar y\) is average of y

When assessing predictive performance, we will most often use the mean squared error as our criterion.

\begin{equation} MSE=\frac 1 N \sum^N_{n=1}(Y_n-\hat Y_n)^2=\frac {RSS} N \end{equation}

MSE is better to use when we want to predict, so we dont care about explenation.

\[RMSE = \sqrt{MSE}\]

6.1.4. Information Criteria

We can use information criteria to quickly compare non-nested models while accounting for model complexity.

2023-12-04_17-46-19_screenshot.png

Information criteria balance two competing forces.

  • Blue: The optimized loglikelihood quantifies fit to the data.
  • Red: The penalty term corrects for model complexity

K number of parameters, amount of x variables in this case

These statistics are only interesting for comparing models, on their on they don’t say a lot

6.1.5. In R

out1 <- lm(bp ~ age, data = dDat)
coef(out1) -- get coefficients

6.2. Multiple Linear Regression

Just simple linear regression with multiple predictor variables.

6.2.1. Comparing models/ multiple R2/ f statistic

  1. Multiple R2
    out1 <- lm(bp ~ age, data = dDat)
    out2 <- lm(bp ~ age + bmi, data = dDat)
    
    ## Extract R^2 values:
    r2.1 <- summary(out1)$r.squared
    r2.2 <- summary(out2)$r.squared
    
  2. F-statistic/null hypothesis

    We test if the fact that the r squard is bigger is conincedence or not

    We use the F-statistic to test \(H_0:R^2=0\) vs. \(H_1:R^2>0\).

    1 <- summary(out1)$fstatistic
    f1
    value numdf dendf
    55.78116 1.00000 440.00000
    pf(q = f1[1], df1 = f1[2], df2 = f1[3], lower.tail = FALSE)
    

    value 4.392569e-13

    f2 <- summary(out2)$fstatistic
    f2
    value numdf dendf
    64.6647 2.0000 439.0000
    pf(f2[1], f2[2], f2[3], lower.tail = FALSE)
    

    value 2.433518e-25

    How do we quantify the additional variation explained by BMI, above and beyond age? • Compute the ΔR2

    ## Compute change in R^2:
    r2.2 - r2.1
    

    [1] 0.115049

    How do we know if ΔR2 represents a significantly greater degree of explained variation? • Use an F-test for H0 : ΔR2 = 0 vs. H1 : ΔR2 > 0

    ## Compute change in R^2:
    r2.2 - r2.1
    

    [1] 0.115049

    How do we know if ΔR2 represents a significantly greater degree of explained variation? • Use an F-test for H0 : ΔR2 = 0 vs. H1 : ΔR2 > 0

    ## Is that increase significantly greater than zero?
    anova(out1, out2)
    

    Analysis of Variance Table Model 1: bp ~ age Model 2: bp ~ age + bmi Res.Df RSS Df Sum of Sq F Pr(>F) 1 440 74873 2 439 65167 1 9706.1 65.386 6.057e-15 * — Signif. codes: 0 ’*’ 0.001 ’*’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

  3. Comparing models with MSE

    We can also compare models based on their prediction errors. • For OLS regression, we usually compare MSE values.

    mse1 <- MSE(y_pred = predict(out1), y_true = dDat$bp)
    mse2 <- MSE(y_pred = predict(out2), y_true = dDat$bp)
    mse1
    # [1] 169.3963
    mse2
    # [1] 147.4367
    

    In this case, the MSE for the model with BMI included is smaller. • We should prefer the the larger model.

  4. Comparing models with information criteria

    Finally, we can compare models based on information criteria.

    AIC(out1, out2)
    # df AIC
    # out1 3 3528.792
    # out2 4 3469.424
    BIC(out1, out2)
    # df BIC
    # out1 3 3541.066
    # out2 4 3485.789
    

    In this case, both the AIC and the BIC for the model with BMI included are smaller. • We should prefer the the larger model.

6.3. Categorical Predictors

Categorical predictors have to be converted to dummy codes:

  drink juice tea
1 juice 1 0
2 coffee 0 0
3 tea 0 1
4 tea 0 1

Then these dummy variables can simply be used as normally: \[Y=\beta_0+\beta_1X_{juice}+\beta_2X_{tea}+\epsilon\]

6.3.1. Contrast

All R factors have an associated contrasts attribute. • The contrasts define a coding to represent the grouping information. • Modeling functions code the factors using the rules defined by the contrasts.

x <- factor(c("a", "b", "c"))
contrasts(x)
  b c
a 0 0
b 1 0
c 0 1

6.3.2. Significane Testing

For variables with only two levels, we can test the overall factor’s significance by evaluating the significance of a single dummy code.

out <- lm(Price ~ Man.trans.avail, data = Cars93)
partSummary(out, -c(1, 2))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 23.841 1.623 14.691 <2e-16
# Man.trans.availYes -6.603 2.004 -3.295 0.0014
# Residual standard error: 9.18 on 91 degrees of freedom
# Multiple R-squared: 0.1066,Adjusted R-squared: 0.09679
# F-statistic: 10.86 on 1 and 91 DF, p-value: 0.001403

For variables with more than two levels, we need to simultaneously evaluate the significance of each of the variable’s dummy codes

partSummary(out4, -c(1, 2))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 21.7187 2.9222 7.432 6.25e-11
# Man.trans.availYes -5.8410 1.8223 -3.205 0.00187
# DriveTrainFront -0.2598 2.8189 -0.092 0.92677
# DriveTrainRear 10.5169 3.3608 3.129 0.00237
# Residual standard error: 8.314 on 89 degrees of freedom
# Multiple R-squared: 0.2834,Adjusted R-squared: 0.2592
# F-statistic: 11.73 on 3 and 89 DF, p-value: 1.51e-06
summary(out4)$r.squared - summary(out)$r.squared
# [1] 0.1767569
# anova(out, out4)
# Analysis of Variance Table
# Model 1: Price ~ Man.trans.avail
# Model 2: Price ~ Man.trans.avail + DriveTrain
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 91 7668.9
# 2 89 6151.6 2 1517.3 10.976 5.488e-05 ***
# ---
# Signif. codes:
# 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.4. Model-Based Prediction

6.4.1. Prediction example

2023-12-09_14-20-49_screenshot.png

6.4.2. Interval estimates for prediction

To quantify uncertainty in our predictions, we want to use an appropriate interval estimate: There are two flavors:

  1. Confidence interval for \(\hat Y_m\)

    The CI for ˆYm gives a likely range (in the sense of coverage probability and “confidence”) for the mth value of the true conditional mean.

    CIs only account for uncertainty in the estimated regression coefficients, { ˆ𝛽0, ˆ𝛽p}.

  2. Prediction intervals for a specif observation, \(Y_m\)

    The prediction interval for Ym gives a likely range (in the same sense as CIs) for the mth outcome value.

    Prediction intervals also account for the regression errors, 𝜀.

  3. Confidence vs prediction intervals

    CIs for \(\hat Y\) ignore the errors, \(\epsilon\).

    • Only accounts for the best fit

    Prediction intervals are wider than CIs.

    • They account for the additional uncertainty contributed by 𝜀.

    2023-12-09_14-34-27_screenshot.png

6.5. Moderation

Additive models allow us to examine the partial effects of several predictors on some outcome.

  • The effect of one predictor does not change based on the values of other predictors

Moderation allows us to ask when one variable, X, affects another variable, Y.

  • We’re considering the conditional effects of X on Y given certain levels of a third variable Z.
  • How much a variable effects the outcome is dependent on another variable

6.5.1. Additive regression + derivation

In addtive MLR we have the following equation: \(Y=\beta_0+\beta_1X+\beta_2Z+\epsilon\) This equation assumes that X and Z are independent predictors of Y.

When X and Z are independent predictors, the following are true

  • X and Z can be correlated
  • \(\beta_1\) and \(\beta_2\) are partial regression coefficients
  • The effect of X and Y is the same at all levels of Z, and the effect of Z on Y is the same at all levels of X.

When testing the moderation, we hypothesize that the effect of X on Y varies af a function of Z.

  • \(Y=\beta_0+f(Z)X+\beta_2Z+\epsilon\)

If we assume that Z linearly affects the relationship between X and Y:

  • \(f(Z)=\beta_1+\beta_3Z\)

Combining the two equations:

  • \(Y=\beta_0+(\beta_1+\beta_3Z)X+\beta_2Z+\epsilon\)
  • \(Y=\beta_0+\beta_1X+\beta_2Z+\beta_3XZ+\epsilon\)

6.5.2. Test for moderation

\(Y=\beta_0+\beta_1X+\beta_2Z+\beta_3XZ+\epsilon\)

To test for moderation we simply need to test the significan of the interaction term, XZ

\(\hat\beta_3\) quantifies the effect of Z on the focal effect (the X->Y effect)

  • For a unit change in Z, it is the expected change in the effect of X and Y

\(\hat\beta_1\text{ and } \hat\beta_2\) are conditional effects.

  • interpreted where the other predictor is zero
  • For a unit change in X, ˆ𝛽1 is the expected change in Y, when Z = 0.
  • For a unit change in Z, ˆ𝛽2 is the expected change in Y, when X = 0.

6.5.3. in R

## Load data:
socSup <- readRDS(paste0(dataDir, "social_support.rds"))
## Estimate the moderated regression model:
out <- lm(bdi ~ tanSat * sex, data = socSup)
partSummary(out, -c(1, 2))

# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 20.8478 6.2114 3.356 0.00115
# tanSat -0.5772 0.3614 -1.597 0.11372
# sexmale 14.3667 12.2054 1.177 0.24223
# tanSat:sexmale -0.9482 0.7177 -1.321 0.18978
# Residual standard error: 9.267 on 91 degrees of freedom
# Multiple R-squared: 0.08955,Adjusted R-squared: 0.05954
# F-statistic: 2.984 on 3 and 91 DF, p-value: 0.03537

7. Regression Assumptions and Diagnostics

You always have to make same assumptions to create a model.

7.1. Assumptions of MLR

7.1.1. 1 The model is linear in the parameters

This is OK: \(Y=\beta_0+\beta_1X+\beta_2Z+\beta_3XZ+\beta_4X^2+\beta_5X^3\)

This is not: \(Y=\beta_0X^{\beta_1}+\epsilon\)

  • power is also a parameter here -> not allowed

If the model is not linear in the parameters, then we’re not even working with linear regression.

  • We need to move to entirely different modeling paradigm

Each modeled X must exhibit a linear relation with Y.

  • We can define X via nonlinear transformations of the original data.
  1. residuals plots
    out1 <- lm(MPG ~ Horsepower,
        data = Cars93)
    plot(Cars93$MPG, Cars93$Horsepower)
    plot(out1, 1)
    

    In multiple regression models, basic residual plots won’t tell us which predictors exhibit nonlinear associations.

  2. Component + residuals plots

    We can use Component + Residual Plots (AKA, partial residual plots) to visualize the unique effects of each X variable.

    library(car)
    crPlots(out3)
    
  3. Treating Residual Nonlinearity

    Nonlinearity in the residual plots is usually a sign of either:

    1. Model misspecification
    2. Influential observations

    This type of model misspecification usually implies omitted functions of modeled variables.

    • Polynomial terms
    • Interactions

    The solution is to include the omitted term into the model and refit.

    • This is very much easier said than done.

7.1.2. 2 The predictor matrix is full rank

  • \(N>P\)
  • No \(X_p\) can be a linear combination of other predictors

If the predictor matrix is not full rank, the model is not estimable.

  • The parameter estimates cannot be uniquely determined from the data

7.1.3. 3 The predictors are strictly exogenous

The predictors do not correlate with the errors

  • \(Cov(\hat Y,\epsilon)=0\)
  • \(E[\epsilon_n]=0\)

If the predictors are not exogenous, the estimated regression coefficients will be biased.

  1. Omitted variables

    The most common cause of endogeneity (i.e., violating Assumption 3) is omitted variable bias.

    • If we leave an important predictor variable out of our equation, some modeled predictors will become endogenous and their estimated regression slopes will be biased.
    • The omitted variable must be correlated with Y and at least one of the modeled Xp, to be a problem.

    Assume the following is the true regression model. \[Y=\beta_0+\beta_1X+\beta_2Z+\epsilon\]

    Now, suppose we omit Z from the model: \[Y=\beta_0+\beta_1X+\omega\] \[\omega = \epsilon + \beta_2Z\]

    Our new error, 𝜔, is a combination of the true error, 𝜀, and the omitted term, \(\beta_2Z\)

    • Consequently, if X and Z are correlated, omitting Z induces a correlation between X and 𝜔 (i.e., endogeneity).

    Omitted variable bias can have severe consequences, but you can’t really test for it.

    • The errors are correlated with the predictors, but our model is estimated under the assumption of exogeneity, so the residuals from our model will generally be uncorrelated with the predictors.
    • We mostly have to pro-actively work to include all relevant variables in our model.

7.1.4. 4 The errors have constant, finit variance. heteroscedasticity

  • \(Var(\epsilon_n)=\sigma^2<\infty\)

see spherical

  1. Consequences of heteroscedasticity

    Non-constant error variance will not bias the parameter estimates.

    • The best fit line is still correct.
    • Our measure of uncertainty around that best fit line is wrong.

    Heteroscedasticity will bias standard errors (usually downward).

    • Test statistics will be too large.
    • CIs will be too narrow.
    • We will have inflated Type I error rates.
  2. Treating heteroscedasticity
    1. Transform your outcome using a concave function (e.g., ln(Y), √Y).
      • These transformations will shrink extreme values more than small/moderate ones.
      • It’s usually a good idea to first shift the variable’s scale by setting the minimum value to 1.
    2. Refit the model using weighted least squares.
      • Create inverse weights using functions of the residual variances or quantities highly correlated therewith.
    3. Use a Heteroscedasticity Consistent (HC) estimate of the asymptotic covariance matrix.
      • Robust standard errors, Huber-White standard errors, Sandwich estimators
      • HC estimators correct the standard errors for non-constant error variance.
    ## The 'sandwich' package provides several HC estimators:
    library(sandwich)
    ## the 'lmtest' package provides fancy testing tools for linear models:
    library(lmtest)
    ## Use sandwich estimator to compute ACOV matrix:
    hcCov <- vcovHC(out1)
    ## Test coefficients with robust SEs:
    robTest <- coeftest(out1, vcov = hcCov)
    ## Test coefficients with default SEs:
    defTest <- summary(out1)$coefficients
    
  3. Testing

    We can create a residual plot to test for this

    Non-constant error variance violates assumption4

    2023-12-11_17-54-50_screenshot.png

    Non-constant error variance

7.1.5. 5 The errors are uncorrelated

\(\Cov(\epsilon_j,\epsilon_j)=0,\ i\ne j\)

see spherical

  1. Correlated Errors

    Errors can become correlated in two basic ways:

    1. Serial dependence
      • When modeling longitudinal data, the errors for a given observational unit
      • We can detect temporal dependence by examining the autocorrelation of the residuals
    2. Clustering
      • Your data have some important, unmodeled, grouping structure
        • Children nested within classrooms
        • Romantic couples
        • Departments within a company
      • We can detect problematic level of clustering with the intraclass correlation coefficient (icc)
        • We need to know the clustering variable to apply the icc
  2. Treating Correlated Errors

    Serially dependent errors in a longitudinal model usually indicate an inadequate model.

    • Your model is ignoring some important aspect of the temporal variation that is being absorbed by the error terms.
    • Hopefully, you can add the missing component to your model.

    Clustering can be viewed as theoretically meaningful or as a nuisance factor that needs to be controlled

    If the clustering is meaningfull, you should model the data using multilevel modelling

    • Hierarchical linear regression
    • Mixed models
    • Random effects models

    If the clustering is an uninteresting nuisance, you can use specialized HC variance estimators that deal with clustering

    ## Read in some data:
    LeeBryk <- readRDS(paste0(dataDir, "lee_bryk.rds"))
    ## Estimate a linear regression model:
    fit <- lm(math ~ ses + sector, data = LeeBryk)
    ## Calculate the residual ICC:
    ICC::ICCbare(x = LeeBryk$schoolid, y = resid(fit))
    
    0.07487712
    ## Robust tests:
    coeftest(fit, vcov = vcovCL(fit, ~ schoolid))
    
    #t test of coefficients:
    #Estimate Std. Error t value Pr(>|t|)
    #(Intercept) 11.79965 0.20318 58.0746 < 2.2e-16 ***
    #ses 2.94860 0.12794 23.0475 < 2.2e-16 ***
    #sectorprivate 1.93495 0.31717 6.1006 1.111e-09 ***
    #---
    #Signif. codes:
    #0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    

7.1.6. 6 The errors are normally distributed.

\(e\sim N(0,\sigma^2)\)

If errors are non-normal, small-sample inferences may be biased.

  • The justification for some tests and procedures used in regression analysis may not hold
  1. Test with Q-Q Plot
    plot(out1, 2)
    

    One of the best ways to evaluate the normality of the error distribution with a Q-Q Plot.

    • Plot the quantiles of the residual distribution against the theoretically ideal quantiles.
    • We can actually use a Q-Q Plot to compare any two distributions.

    2024-01-02_17-46-29_screenshot.png

  2. Consequences of Violating Normality

    In small samples, with fixed predictors, normally distributed errors imply normal sampling distributions for the regression coefficients.

    • In large samples, the central limit theorem implies normal sampling distributions for the coefficients, regardless of the error distribution.

    Prediction intervals require normally distributed errors.

    • Confidence intervals for predictions share the same normality requirements as the coefficients’ sampling distributions.

    Parameter estimates will not be fully efficient.

    • Standard errors will be larger than they would have been with normally distributed errors.
  3. Treating Violations of Normality

    We usually don’t need to do anything about non-normal errors

    • The central limit theorem will protect our inferences.

    We can use bootstrapping to get around the need for normality.

    1. Treat your sample as a synthetic population from which you draw many new samples (with replacement).
    2. Estimate your model in each new sample.
    3. The replicates of your estimated parameters generate an empirical sampling distribution that you can use for inference.

    Bootstrapping can be used for inference on pretty much any estimable parameter, but it won’t work with small samples.

    • Need to assume that your sample is representative of the population

7.1.7. Spherical errors (4+5)

The assumption of spherical erros combines assumptions 4 and 5 \(Var(\epsilon)=\sigma^2I_N\)

If the errors are not spherical, the standard errors will be biased.

  • The estimated regression coefficients will be unbiased, though.
  • But the statistics and stuf can be biased

7.2. Regression Diagnostics

After fitting the model we check the assumptions, most assumptions cant be checked before fitting.

If some of the assumptions are violeted, the inferences we make using the model may be wrong

  • We need to check the tenability of our assumptions before leaning too heavily on the model estimates

These checks are called regression diagnostics

  • Graphical visualizations
  • Quantitative indices/measures
  • Formal statistical

Alot of test require their own assumptions which is bad.

7.2.1. Residual Plots

One of the most useful diagnostic graphics is the plot of residuals vs. predicted values. 2023-12-11_17-51-01_screenshot.png

We can create this plot by plotting the fitted lm() object in R.

out1 <- lm(Price ~ Horsepower,
    data = Cars93)
plot(out1, 1)

7.3. Influential Observations

Influential observations contaminate analyses in two ways:

  1. Exert too much influence on the fitted regression model
  2. Invalidate estimates/inferences by violating assumptions

There are two distinct types of influential observations:

  1. Outliers
    • Observations with extreme outcome values, relative to the other data.
    • Observations with outcome values that fit the model very badly.
  2. High-leverage observations
    • Observation with extreme predictor values, relative to other data

7.3.1. Outliers

Outliers can be identified by scrutinizing the residuals.

  • Observations with residuals of large magnitude may be outliers.
  • The difficulty arises in quantifying what constitutes a “large” residual.

If the residuals do not have constant variance, then we cannot directly compare them.

  • We need to standardize the residuals in some way

We are specifically interested in externally studentized residuals.

We can’t simply standardize the ordinary residuals.

  • Internally studentized residuals
  • Outliers can pull the regression line towards themselves.
  • The internally studentized residuals for outliers will be too small.

Begin by defining the concept of a deleted residual: \[ \hat \epsilon_{(n)}=Y_n-\hat Y_{(n)} \]

  • \(\hat\epsilon_{(n)}\) quantifies the distance of \(Y_n\) from the regression line estimated after excluding the nth observation.
  • \(\hat Y_{(n)}\) is the predicted value for the nth observatin from the regression line estimated after exluding the nth observation
  1. Studentized Residuals

    If we standardize the deleted residual, \(\hat\epsilon_{(n)}\), we get the externally studentized residual: \[t_{(n)}=\frac{\hat\epsilon_{(n)}}{SE_{\hat\epsilon_{(n)}}}\] The externally studentized residuals have two very useful properties:

    1. Each t(n) is scaled equivalently.
      • We can directly compare different t(n) .
    2. The t(n) are Student’s t distributed.
      • We can quantify outliers in terms of quantiles of the t distribution.
      • |t(n) | > 3.0 is a common rule of thumb for flagging outliers.

    Index plots of the externally studentized residuals can help spotlight potential outliers.

    • Look for observations that clearly “stand out from the crowd”
    rstudent(out1) %>% plot()
    

    2024-01-02_19-29-57_screenshot.png

7.3.2. High-Leverage Points

We identify high-leverage observations through their leverage values.

  • An observation’s leverage, \(h_n\), quantifies the extent to which its predictors affect the fitted regression model.
  • Observations with \(X\) values very far from the mean, \(\bar X\), affect the fitted model disproportionately.

Index plots of the leverage values can help spotlight high-leverage points.

  • Again, look for observations that clearly “stand out from the crowd.”

In simple linear regression, the nth leverage is given by: \[h_n=\frac 1 N + \frac{(X_n-\bar X)^2}{\sum^N_{m=1}(X_m-\bar X)^2}\]

hatvalues(out1) %>% plot()

2024-01-02_19-38-14_screenshot.png

7.3.3. Influential Points

Observations with high leverage or large (externally) studentized residuals are not necessarily influential.

  • High-leverage observations tend to be more influential than outliers.
  • The worst problems arise from observations that are both outliers and have high leverage.

Measures of influence simultaneously consider extremity in both X and Y dimensions.

  • Observations with high measures of influence are very likely to cause problems.

All measures of influence use the same logic as the deleted residual.

  • Compare models estimated from the whole sample to models estimated from samples excluding individual observations.

One of the most common measures of influence is Cook’s Distance \[Cook's D_n=\frac{\sum^N_{n=1}(\hat Y_n-\hat Y_{(n)})^2}{(P+1)\hat \sigma^2}=(P+1)^{-1}t^2_n\frac{h_n}{1-h_n}\]

Index plots of Cook’s distances can help spotlight the influential points.

  • Look for observations that clearly “stand out from the crowd.”
cd <- cooks.distance(out1)
plot(cd)

2024-01-02_19-49-05_screenshot.png

7.3.4. Removing Influential Observations

Observation number 28 was the most influential according to Cook’s Distance

(maxD <- which.max(cd))
28
## Exclude the influential case:
Cars93.2 <- Cars93[-maxD, ]
## Fit model with reduced sample:
out2 <- lm(Price ~ Horsepower, data = Cars93.2)
round(summary(out1)$coefficients, 6)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.398769 1.820016 -0.768548 0.444152
Horsepower 0.145371 0.011898 12.218325 0.000000
round(summary(out2)$coefficients, 6)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.837646 1.806418 -1.570868 0.119722
Horsepower 0.156750 0.011996 13.066942 0.000000

Removing the outlier did not have a massive effect on the slope but it did lower the t-values by quite a bit

2024-01-02_20-02-02_screenshot.png

You could also remove the 2 most influential observations

(maxDs <- sort(cd) %>% names() %>% tail(2) %>% as.numeric())

## Exclude influential cases:
Cars93.2 <- Cars93[-maxDs, ]

## Fit model with reduced sample:
out2.2 <- lm(Price ~ Horsepower, data = Cars93.2)

7.3.5. Treating Influential Points

The most common way to address influential observations is simply to delete them and refit the model.

  • This approach is often effective—and always simple—but it is not fool-proof.
  • Although an observation is influential, we may not be able to justify excluding it from the analysis.

Robust regression procedures can estimate the model directly in the presence of influential observations.

  • Observations in the tails of the distribution are weighted less in the estimation process, so outliers and high-leverage points cannot exert substantial influence on the fit.
  • We can do robust regression with the rlm() function from the MASS package.

8. Generalized Linear Model & Logistic regression

8.1. General Linear model

So far we have only discussed general linear models:

\[Y = \beta_0 + \sum ^p_{p=1} \beta_p X_p+\epsilon\]

All flavor of linear regression are general linear models

  • Simple Linear regression, Multiple Linear Regression
  • t-test, ANOVA, ANCOVA
  • Multilevel linear regression models

We can break our model into pieces:

  • \(\eta = \beta_0 + \sum ^p_{p=1} \beta_p X_p\)
  • \(Y = \eta + \epsilon\)

\(\epsilon \sim N(0,\sigma^2)\), so we can also write: \[Y\sim N(\eta,\sigma^2)\]

Where:

  • \(\eta\) is the systematic component of the model (AKA, the linear predictor)
  • The normal distribution, \(N(...,...)\) is the model’s random component

The purpose of general linear modeling is to build a model of the outcome’s mean \(\mu_Y\)

  • In this case \(\mu_Y = \eta\)
  • The systematic component defines the mean of \(Y\)

The random component quantifies variability around \(\mu_Y\)

  • In the general linear model, we assume that this error variance follows a normal distribution

8.2. Generalized Linear Model

We can generalize the models we’ve been using in two important ways:

  1. Allow for random components other than the normal distribution
  2. Allow for more complicated relations between \(\mu_Y\) and \(\eta\)
    • Allow a link function: \(g(\mu_Y)=\eta\)

Like this we optain the class of generalized linear models (GLMs)

The random component in a GLM can be any distribution from the exponential family:

  • Normality
  • Binomial
  • Poisson
  • Many others…

The systematic component of a GLM is exactly the same as it is in general linear models.

  • \(\eta = \beta_0 + \sum ^p_{p=1} \beta_p X_p\)

In GLMs, \(\eta\) does not directly describe \(\mu_Y\)

  • We first transform \(\mu_Y\) via a link function

The link function performs two important functions

  1. Linearize the association betweeen \(X\) and \(Y\)
    • Nonlinear: \(X\rightarrow \mu_Y\)
    • Linear: \(X\rightarrow g(\mu_Y)\)
  2. Allows GLMs for outcomes with the restricted ranges without requiring any restrictions on the range of the \(\{X_P\}\)
    • In many cases, \(\mu_Y\)
      • Counts: \(\mu_Y > 0\)
      • Probabilities: \(\mu_Y \in [0,1]\)
    • When correctly specified, \(g(\mu_Y)\) can take any value on the real line

Every GLM is built from three components

  1. The systematic component, \(\eta\)
    • A linear function of the predictors, \(\{X_P\}\)
    • Describes the association between \(X\) and \(Y\)
  2. The link function, \(g(\eta_Y)\)
    • Linearizes the relation between \(X\) and \(Y\)
    • Transforms \(\mu_Y\) so that it can take any value on the real line
  3. The random component, \(P(Y|g^{-1}(\eta))\)
    • The distribution of the observed \(Y\)
    • Quantifies the error variance around \(\eta\)

The General Linear Model is a special case of GLM

  1. Systematic component: \[\eta = \beta_0 + \sum ^p_{p=1} \beta_p X_p\]
  2. Link function: \[\mu_Y=\eta\]
  3. Random component: \[Y\sim N(\eta, \sigma^2)\]

8.2.1. Generalized Linear Model in R

In R you can train a generalized linear model with glm

data(iris)

## Generalized linear model:
glmFit <- glm(Petal.Length ~ Petal.Width + Species,
              family = gaussian(link = "identity"),
              data = iris)

8.3. Logistic Regression

We want to use logistic regression for a classification task with a two-level outcome.

Using a linear model doesn’t work well for this kind off task

  • The link function ensures legal predicted values (between 0 and 1)
  • The sigmoid curve implies fluctuation in the effectiveness of extra study time
    • example below: More study time is most beneficial for students with around 5.5 hours of study

2024-01-09_15-40-58_screenshot.png

In logistic regression problems we are modeling binary data:

  • \(Y\in\{1="Succes",0="failure"\}\)

The systematic component in our logistic regression model will be the binomial distribution.

The mean of the binomial distribution (with \(N=1\)) is the “succes” probability, \(\pi=P(Y=1)\)

  • we are interested in modelling this succes probability, \(\mu_Y=\pi\):

\[g(\pi) = \beta_0 + \sum ^p_{p=1} \beta_p X_p\]

We use the logit link function to convert the output of the systematic component to a value between 0 and 1.

  • Given \(\pi\), the odds of succes are: \[O_s=\frac \pi {1-\pi}\]
  • Because \(\pi \in [0,1]\), we know that \(O_s \ge 0\)
  • We can take the natural log of the odds as the last step to fully map \(\pi\) to the real line \[\operatorname{logit}(\pi)=\ln\left(\frac \pi {1-\pi}\right)\]

Note that in the above function we map the probability to the real line, not the other way around.

  • For the other way around use the logistic function: \[\sigma(x)=\frac 1 {1+e^{-x}}=\frac {e^x} {1+e^x}\]
    • Also called sigmoid or expit
    • Is the inverse of the logit function

Our final logistic regression model is:

\[Y\sim \operatorname{Bin}(\pi,1)\] \[\operatorname{logit}(\pi)=\beta_0 + \sum^p_{p=1}\beta_pX_p\]

The fitted model can be represented as: \[\operatorname{logit}(\hat\pi)=\hat\beta_0 + \sum^p_{p=1}\hat\beta_pX_p\]

8.4. Interpreting logistic regression

We have a model:

\[\operatorname{logit}(\hat\pi)=\hat\beta_0 + \sum^p_{p=1}\hat\beta_pX_p\]

The fitted coefficients, \(\{\hat\beta_0, \hat\beta_p\}\), are interpreted in units of log odds.

Log odds do not lend themselves to interpretation.

  • We can convert the effects back to an odds scale by exponentiation
  • \(\hat\beta\) has log odds units, but \(e^\hat\beta\) has odds units.

note for all these interpretations odds and probabilities are not the same!

Exponentiating the coefficients also converts the additive effects to multiplicative effects.

  • We can interpret \(\hat\beta\) as we would in linear regression:
    • A unit change in \(X_p\) produces an expected change of \(\hat\beta\) units in \(\operatorname{logit}(\pi)\)
  • After exponentiation, however, unit changes in \(X_p\) imply multiplicative changes in the odds, \(O_s=\frac \pi {1-\pi}\)
    • A unit change in \(X_p\) results in multiplying \(O_s\) by \(e^{\hat\beta_p}\)

Due to the confusing interpretations of the coefficients, we often focus on the valence of the effects.

  • Additional study time is associated with increased odds of passing
  • \(\hat\beta_p>0 = "Increased Succes"\), \(e^{\hat\beta_p}>1 = "Increased succes"\)

For example interpretations look at the slides, start from slide 28

8.5. Model Comparison for logistic regression

We don’t have an \(R^2\) statistic for logistic regression models, so we need to use a likelihood ratio test to compare nested models.

anova(fit0, fit, test = "LRT")

## Analysis of Deviance Table
##
## Model 1: survived ~ age + sex
## Model 2: survived ~ age + sex + class
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 884 916.00
## 2 882 801.59 2 114.41 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also use information criteria

AIC(fit0, fit)

## df AIC
## fit0 3 921.9989
## fit 5 811.5940

BIC(fit0, fit)

## df BIC
## fit0 3 936.3624
## fit 5 835.5333

8.6. Classification with logistic regression

Given a fitted logistic regression model, we can get predictions for new observations \(X'_p\)

Directly apply \(\hat\beta_0,\hat\beta_p\) to \(X'_p\) will produce predictions on the scale of \(\eta\) \[\hat\eta ' = \hat\beta_0+\sum^p_{p=1}\hat\beta_pX'_p\]

By applying the inverse link function \(g^{-1}()\), to \(\hat\eta'\), we get predicted succes probabilities \[\hat\pi'=g^{-1}(\hat\eta')\]

Once we have computed the predicted succes probabilities \(\hat\pi'\), we can use them to classify new observations

By choosing a threshold on \(\hat\pi'\), say \(\hat\pi'=t\), we can classify the new observations as succes or failures \[\hat Y'= \begin{cases} 1 &\text{ if }\hat\pi'\ge t \\ 0 &\text{ if }\hat\pi'\ge t \end{cases}\]

Note the optional threshold is not always \(0.5\).

9. Assumptions of Logistic Regression

9.1. 1 The model is linear in the parameters

Same as linear regression

  • This is OK: \[\operatorname{logit}(\pi)=\beta_0 + \beta_1X + \beta_2Z + \beta_3X^2\]
  • This is not: \[\operatorname{logit}(\pi)=\beta_0X^{\beta_1}\]

9.1.1. Checking Assumption 1

Assumption 1 implies a linear relation between continuous predictors and the logit of the success probability.

  • We can basically evaluate the linearity assumption using the same methods we applied with linear regression.
car::crPlots(glmFit, terms = ~ age + fare)

2024-01-10_15-15-52_screenshot.png

9.2. 2 The predictor matrix if full rank

Same as linear regression

  • \(N>P\)
  • No \(X_p\) can be a linear combination of other predictors

9.2.1. Checking Assumption 2

Basically the same as for linear regression Assumption 2 implies two conditions:

  1. P > N
  2. No severe (multi)collinearity among the predictors

We can quantify multicollinearity with the variance inflation factor (VIF).

car::vif(glmFit)
age sex fare
1.031829 1.007699 1.026373

VIF > 10 indicates severe multicollinearity.

9.3. 3 The outcome is independently and identically binomially distributed

The distributional assumptions of logistic regression are not framed in terms of residuals.

This assumption fills the role played by all residual-related assumptions in linear regression.

Instead we consider the entire outcome distribution in logistic regression.

\[Y_n \sim^{iid} Bin(\hat\pi, 1)\] \[\hat\pi_n=\operatorname{logistic}\left( \hat\beta_0 + \sum^p_{p=1} \hat\beta_pX_{np} \right)\]

9.3.1. Checking Assumption 3

Assumption 3 implies several conditions.

  1. The outcome, Y, is binary.
  2. The linear predictor, 𝜂, can explain all the systematic trends in \(\pi\).
    • No residual clustering after accounting for X.
    • No important variables omitted from X.

We can easily check the first condition with summary statistics.

levels(titanic$survived)
“no” “yes”
table(titanic$survived)
no yes
545 342

If we have a non-binary, categorical outcome, we can use a different type of model.

  • Multiclass nominal variables: multinomial logistic regression
    • nnet::multinom()
  • Ordinal variables: Proportional odds logistic regression
    • MASS::polr()
  • Counts: Poisson regression
    • glm() with family = "poisson"

The binomial distribution is also appropriate for modeling the proportion of successes in N trials.

For the second condition:

We can check for residual clustering by calculating the ICC using deviance residuals.

## Check for residual dependence induced by 'class':
ICC::ICCbare(x = titanic$class, y = resid(glmFit, type = "deviance"))
0.1054665

9.4. Residuals

In logistic regression the outcome is binary, \(Y\in\{0,1\}\), but the parameter that we’re trying to model is continuous, \(\pi \in (0,1)\).

  • Due to this mismatch in measurements levels, we don’t have a natural definition of a “residual” in logistic regression

The most basic residual is the raw residual, \(e_n\)

  • The difference between the observerd outcome value and the predicted probability

\[e_n=Y_n-\hat\pi_n\]

We train a model on the titanic dataset:

## Read the data:
titanic <- readRDS(paste0(dataDir, "titanic.rds"))
## Estimate the logistic regression model:
glmFit <- glm(survived ~ age + sex + fare,
data = titanic,
family = "binomial")
## Save the linear predictor estimates:
titanic$etaHat <- predict(glmFit, type = "link")

Raw residual plot:

library(ggplot)

## Calculate the raw residuals:
titanic$e <- resid(glmFit, type = "response")

## Plot raw residuals vs. fitted
## linear predictor values:
ggplot(titanic, aes(etaHat, e)) +
    geom_point() +
    geom_smooth() +
    theme_classic() +
    xlab("Linear Predictor") +
    ylab("Raw Residual")

2024-01-10_14-31-15_screenshot.png

Pearson residuals, \(r_n\), are scaled raw residuals \[r_n=\frac {e_n}{\sqrt{\hat\pi_n(1-\hat\pi_n)}}\]

## Calculate the Pearson residuals:
titanic$r <- resid(glmFit, type = "pearson")

Deviance residuals, \(d_n\), are derived directly from the objective function used to estimate the model.

\[d_n=sign(e_n)\sqrt{-2[Y_n\ln (\hat\pi_n)+(1-Y_n)\ln(1-\hat\pi_n)]} \]

The residual deviance, \(D\), is the sum of squared deviance residuals.

  • Quantifies how well the model fits the data

\[D=\sum^N_{n=1}d^2_n\]

9.5. Computational Considerations

In Addition to the preceding statistical assumptions, we must satisfy three computational requirements that were not necessary in linear regression.

9.5.1. 1 Sufficient sample size

Logistic regression models are estimated with numerical methods, so we need larger samples than we would for linear regression models.

  • The sample size requirements increase with model complexity.

Some suggested rules of thumb:

  • 10 cases for each predictor (Agresti, 2018)
  • \(N = 10P/\pi_0\) (Peduzzi, Concato, Kemper, Holford, & Feinstein, 1996)
    • \(P\): Number of predictors
    • \(\pi_0\): Proportion of the minority class
  • \(N = 100 + 50P\) (Bujang, Omar, & Baharum, 2018)

9.5.2. 2 Outcome classes are sufficiently balanced

The logistic regression may not perform well when the outcome classes are severely imbalanced.

  • e.g. way more of one class then the other
with(titanic, table(survived) / length(survived))
no yes
0.6144307 0.3855693

We have a few possible solutions for problematic imbalance:

  • Down-sampling the majority class
    • Randomly remove some samples from the majority class
  • Up-sampling the minority class
    • Random select samples from the minority class to use multiple times
  • Use weights when estimating the logistic regression model
    • weights argument in glm()

9.5.3. 3 There is no perfect prediction

We don’t actually want to perfectly predict class membership.

  • The model cannot estimate with perfectly separable classes.

Why?

  • Imagine the shape of the logit function to estimate classes which can be seperated perfectly
  • The slope would have to be really low, then really high
  • To get this the slope would have to be infinite, which obviously can’t be estimated

Model regularization (e.g., ridge or LASSO penalty) may help.

  • glmnet::glmnet()

9.6. Influential Cases

As with linear regression, we need to deal with any overly influential cases.

  • We can use the linear predictor values to calculate Cook’s Distances.
  • Any cases that exerts undue influence on the linear predictor will have the same effect of the predicted success probabilities.
cooks.distance(glmFit) %>% plot()

2024-01-10_16-41-13_screenshot.png

plot(glmFit, 4)

2024-01-10_16-42-16_screenshot.png

9.7. Classification Performance Measures

9.7.1. Confusion Matrix

One of the most direct ways to evaluate classification performance is the confusion matrix.

## Add predictions to the dataset:
titanic %<>%
    mutate(piHat = predict(glmFit, type = "response"),
           yHat = as.factor(ifelse(piHat <= 0.5, "no", "yes")))

library(caret)

cMat <- titanic %$% confusionMatrix(data = yHat, reference = survived)

cMat$table
  Reference  
Prediction no yes
no 458 106
yes 87 236

Each cell in the confusion matrix represents a certain classification result:

  • TP true positive: Correctly predicted positive
  • TN true negative: Correctly predicted negative
  • FP false positive: Falsely predicted positive
  • FN false negative: Falsely predicted negative
cMat$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue McnemarPValue
7.824126e-01 5.359709e-01 7.537802e-01 8.091549e-01 6.144307e-01 7.471405e-27 1.950898e-01
cMat$byClass
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall
0.8403670 0.6900585 0.8120567 0.7306502 0.8120567 0.8403670
F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy  
0.8259693 0.6144307 0.5163472 0.6358512 0.7652127  

Summaries/statistics of the Confusion matrix:

  • \(Accuracy = (TP+TN)/(P+N)\)
  • \(Error\ Rate = (FP+FN)/(P+N)=1-Accuracy\)
  • \(Sensitivity = TP/(TP+FN)\)
  • \(Specificity = TN/(TN+FP)\)
  • \(False\ Positive\ Rate (FPR)=FP/(TN+FP)=1-Specificity\)
  • \(Positive\ Predictive\ Value\ (PPV) = TP / (TP + FP)\)
  • \(Negative\ Predictive\ Value\ (NPV) = TN / (TN + FN)\)

9.7.2. TODO ROC Curve

A receiver operating characteristic (ROC) curve illustrates the diagnostic ability of a binary classifier for all possible values of the classification threshold.

The area under the ROC curve (AUC) is a one-number summary of the potential performance of the classifier.

  • The AUC does not depend on the classification threshold.
pROC::auc(rocData)
Area under the curve: 0.8298

Rule of thumb for AUC according to Mandrekar (2010):

  • AUC value from 0.7 – 0.8: Acceptable
  • AUC value from 0.8 – 0.9: Excellent
  • AUC value over 0.9: Outstanding

9.7.3. Threshold Selection

We can use numerical methods to estimate an optimal threshold value

library(OptimalCutpoints)
ocOut <- optimal.cutpoints(X = "piHat",
                           status = "survived",
                           tag.healthy = "no",
                           data = titanic,
                           method = "ROC01"
                           )

partSummary(ocOut, -1)

Area under the ROC curve (AUC): 0.83 (0.802, 0.858)

CRITERION: ROC01

Number of optimal cutoffs: 1

  Estimate
cutoff 0.2360978
Se 0.7543860
Sp 0.7761468
PPV 0.6789474
NPV 0.8343195
DLR.Positive 3.3700029
DLR.Negative 0.3164531
FP 122.0000000
FN 84.0000000
Optimal criterion 0.1104365

9.7.4. Cross-Entropy Error

Not exam material, might be usefull for the practical though.

Measuring classification performance from a confusion matrix can be problematic.

• Sometimes too coarse.

We can also base our error measure on the residual deviance with the Cross-Entropy Error: \[CEE = -N^{-1} \sum^N_{n=1} Y_n\ln(\hat\pi_n) + (1 - Y_n) \ln(1 -\hat\pi_n)\]

  • The CEE is sensitive to classification confidence.
  • Stronger predictions are more heavily weighted

The misclassification rate is a na¨ıvely appealing option.

  • The proportion of cases assigned to the wrong group

Consider two perfect classifiers:

  1. P( ˆYn = 1|Yn = 1) = 0.90, P( ˆYn = 1|Yn = 0) = 0.10, n = 1, 2, . . . , N
  2. P( ˆYn = 1|Yn = 1) = 0.55, P( ˆYn = 1|Yn = 0) = 0.45, n = 1, 2, . . . , N

Both of these classifiers will have the same misclassification rate.

  • Neither model ever makes an incorrect group assignment.

The first model will have a lower CEE.

  • The classifications are made with higher confidence.
  • CEE1 = 0.105, CEE2 = 0.598

Author: Martijn Voordouw

Created: 2024-01-10 Wed 17:42