suppressPackageStartupMessages({
library(piecewiseSEM)
library(tidyverse)
library(stargazer)
library(broom)
})
storks <- read.csv("Data/storks.csv")
This course was part of the pre-conference workshops at the International Congress for Conservation Biology, in Cartagena, Colombia. The course was given by Dr. Achaz von Hardenberg @achazhardenberg.
Subjects on causal inference, path analalysis and bayesian inference were covered.
The scientist’s mantra: “Correlation does not imply causation”
ggplot(data = storks, mapping=aes(x = Storks, y = Birth)) +
geom_point() +
geom_smooth(method = "lm", color = "black") +
theme_bw() +
labs(x = "Number of breeding stork pairs", y = "Human birth rate (thousands/year)")
lm(Birth ~ Storks, data = storks) %>%
stargazer(single.row = T, type = "html", dep.var.labels = "Human birth rate *thousands / year)")
Dependent variable: | |
Human birth rate *thousands / year) | |
Storks | 0.029*** (0.009) |
Constant | 225.029** (93.561) |
Observations | 17 |
R2 | 0.385 |
Adjusted R2 | 0.344 |
Residual Std. Error | 332.185 (df = 15) |
F Statistic | 9.380*** (df = 1; 15) |
Note: | p<0.1; p<0.05; p<0.01 |
Other variables appear to be correlated, so we can inspect their correlation and include relevant ones in the model.
library(corrplot)
corrplot(cor(storks[,-1]), method="ellipse", type="lower")
lm1 <- lm(Birth ~ Storks, storks)
lm2 <- lm(Area ~ Storks, storks)
lm3 <- lm(Birth ~ Area, storks)
lm4 <- lm(Birth ~ Area + Storks, storks)
stargazer(lm1, lm2, lm3, lm4, type ="html", single.row = T, add.lines = list(AIC = c("AIC", formatC(AIC(lm1, lm2, lm3, lm4)[,2], digits = 2, format = "f"))))
number of rows of result is not a multiple of vector length (arg 2)
Dependent variable: | ||||
Birth | Area | Birth | ||
(1) | (2) | (3) | (4) | |
Storks | 0.029*** (0.009) | 14.400** (5.231) | 0.006 (0.006) | |
Area | 0.002*** (0.0002) | 0.002*** (0.0002) | ||
Constant | 225.029** (93.561) | 146,817.600** (52,057.720) | -7.775 (56.938) | -7.412 (56.702) |
AIC | 249.51 | 464.44 | 225.39 | 226.08 |
Observations | 17 | 17 | 17 | 17 |
R2 | 0.385 | 0.336 | 0.851 | 0.862 |
Adjusted R2 | 0.344 | 0.291 | 0.841 | 0.842 |
Residual Std. Error | 332.185 (df = 15) | 184,830.100 (df = 15) | 163.422 (df = 15) | 162.744 (df = 14) |
F Statistic | 9.380*** (df = 1; 15) | 7.578** (df = 1; 15) | 85.731*** (df = 1; 15) | 43.787*** (df = 2; 14) |
Note: | p<0.1; p<0.05; p<0.01 |
After including Area
, for example, we observe that the effect of Storks
significantly decreases.
Correlation imples and unresolved causal structure
Once we have identified processes, we can start to imply causation:
Causality always imples a completely resolved correlation structure
Limitations of commonly employed statistical methods (i.e. multiple linear regression):
Can only analyze one dependent variable at a time
A particular variable can either be a predictor or a response
“Path analysis is an extension of multiple regression, and was developed to overcome these limitations”
In path analysis, we call each variable a vertex (or vertices, in plural) and their connections are called edges. Directed graphs include directions (i.e. the edges are arrow that define direction). In Indirect graphs, edges only connect vertices, but do not specify the direction of influence.
A full explanation on the grammar of path analysis is found in Achaz’s materials
From his model:
DiagrammeR::grViz("
digraph boxes_and_circles{
# Define nodes
node [shape = square
penwidth = 2,
fontsize = 24]
A; B; C; D; E; F
# Add edge statements
#Advaance classes
A -> B
B -> C
B -> D
E -> C
E -> D
F -> E
}
", height = 500)
A is:
direct cause of B
indirect cause of C and F
B is:
C and D are:
indirectly independent on A and F
directly dependent on B and E
E is:
F is:
direct cause of E
Indirect cause of C and F
In the case above, B is an active vertex, becauser it is bot an effect (of A) and a cause (of C or D). Similarly, D is an inactive vertez, which is only an effect of 1 or more other vertices, but not a cause of anything. Often, we try to understand relationships between D and A, leaving out B.
Drawa causal graph with 6 variables:
(fromAtoF)where:
A is direct cause of B
C is a causal child of B
D is directly dependent on B and C
F is directly dependent on D and E
DiagrammeR::grViz("
digraph boxes_and_circles{
# Define nodes
node [shape = square
penwidth = 2,
fontsize = 24]
A; B; C; D; E; F
# Add edge statements
#Advaance classes
A -> B
B -> C
C -> D
B -> D
D -> F
E -> F
}
", height = 500)
To start transforming this into statistical models, we perform d-separation:
d_separ <- data.frame(Pairs = c("C,D","A,C", "A,D", "A,E", "A,F", "B,E","B,F", "C,F","D,F"))
d_separ
d_separ <- data.frame(Pairs = c("C,D","A,C", "A,D", "A,E", "A,F", "B,E","B,F", "C,F","D,F"),
Parents = c("B,E", "B", "B,E", "F", "None", "A, F", "A", "B,E", "B,E"))
d_separ
d_separ %>%
mutate(D_separation = paste0("(", Pairs, ")", ", {", Parents, "}"))
The number of elements in the basis test should be given by
\[\frac{V!}{2(V-2)!}-A\]
In R code, we can define a function for this as:
numbers <- function(V, A){
n <- (factorial(V)/(2*factorial((V-2))))-A
return(n)
}
Where:
-A is the number of arrows
d_separ %>%
mutate(D_separation = paste0("(", Pairs, ")", ", {", Parents, "}"),
Claims = c("D ~ B + E + C", "C ~ B + A", "D ~ B + E + A", "E ~ F + A", "F ~ A", "E ~ A + F + B", "F ~ B + A", "F ~ B + E + C", "F ~ B + E + D"))
We then use Fisher’s C test to test the composite probability of the whole set.
With a model of the sape:
DiagrammeR::grViz("
digraph boxes_and_circles{
# Define nodes
node [shape = square
penwidth = 2,
fontsize = 24]
A; S; B; I
# Add edge statements
#Advaance classes
A -> S
A -> B
B -> I
}
", height = 500)
Where:
A = Area
S = Storks
B = Birth rate
I = Inhabitants
Using the formula above, we know that we need to identify 3 pairs.
Using the steps above, we must identify all non-adjacent pairs
storke_pairs <- data.frame(Pairs = c("A, I", "S, B", "S,I"))
storke_pairs
Identify all parents of each pair
storke_pairs <- data.frame(Pairs = c("A, I", "S, B", "S,I"),
Parents = c("B", "A", "A, B"))
storke_pairs
The d-separation statements are then:
storke_pairs %>%
mutate(D_separation = paste0("(", Pairs, ")", ", {", Parents, "}"))
Test the probabilistic conditional dependence of every d-statement with the following claims:
storke_pairs %>%
mutate(D_separation = paste0("(", Pairs, ")", ", {", Parents, "}"),
Claims = c("I ~ B + A", "B ~ A + S", "I ~ A + B + I"))
Evaluate the models:
stargazer(path1, path2, path3, type = "html", report = c("vcp"), single.row = T)
number of rows of result is not a multiple of vector length (arg 2)
Dependent variable: | |||
I | B | I | |
(1) | (2) | (3) | |
B | 0.039 | 0.049 | |
p = 0.079 | p = 0.032 | ||
A | 0.00002 | 0.002 | 0.00002 |
p = 0.624 | p = 0.00001 | p = 0.575 | |
S | 0.006 | -0.001 | |
p = 0.307 | p = 0.111 | ||
Constant | 6.744 | -7.412 | 6.771 |
p = 0.162 | p = 0.898 | p = 0.138 | |
Observations | 17 | 17 | 17 |
R2 | 0.729 | 0.862 | 0.779 |
Adjusted R2 | 0.691 | 0.842 | 0.728 |
Residual Std. Error | 13.084 (df = 14) | 162.744 (df = 14) | 12.264 (df = 13) |
F Statistic | 18.872*** (df = 2; 14) | 43.787*** (df = 2; 14) | 15.297*** (df = 3; 13) |
Note: | p<0.1; p<0.05; p<0.01 |
Using Fisher’s C:
\[C = -2\sum_{i = 1}^{k{}} log(p_i)\] Where:
C follows a \(\chi^2\) distribution, with 2k degrees of freedom: 1-pchisq(C,2*k)
ppath1 <- broom::tidy(path1) %>% filter(term == "A") %>% select(p.value)
ppath2 <- broom::tidy(path2) %>% filter(term == "S") %>% select(p.value)
ppath3 <- broom::tidy(path3) %>% filter(term == "S") %>% select(p.value)
C <- rbind(ppath1, ppath2, ppath3) %>%
mutate(logs = log(p.value)) %$%
sum(logs)
C <- -2*C
test <- 1-pchisq(C,2*3)
Thus, the probability is p = 0.2597204.
piecewiseSEM
Here, we only specify the dependent paths as within an lm()
call. For example, in this model, I is caused by B, so we call lm(B~I)
. We include each lm call inside a list, and pass that to sem.fit
, which will identify the missing paths, and do all the calculations.
storks_model <- list(
lm(S ~ A, storks),
lm(B ~ A, storks),
lm(I ~ B, storks)) %>%
sem.fit(data = storks, .progressBar = F, conditional = T)
knitr::kable(storks_model)
missing.path | estimate | std.error | df | crit.value | p.value | |
---|---|---|---|---|---|---|
I ~ A + B | 0e+00 | 0.0000 | 14 | 0.5024 | 0.6232 | |
B ~ S + A | 6e-03 | 0.0057 | 14 | 1.0609 | 0.3067 | |
I ~ S + A + B | -8e-04 | 0.0004 | 13 | -1.7128 | 0.1105 |
fisher.c | df | p.value |
---|---|---|
7.72 | 6 | 0.26 |
AIC | AICc | K | n |
---|---|---|---|
25.72 | 51.434 | 9 | 17 |