This vignette
provides a quick demo of the
truh
package. The example that we consider here is taken
from Figure 3 of the paper: Trambak Banerjee, Bhaswar B. Bhattacharya,
Gourab Mukherjee Ann. Appl. Stat. 14(4): 1777-1805 (December 2020)
<DOI: 10.1214/20-AOAS1362>.
We will consider a nonparametric two sample testing problem where the d dimensional baseline (or uninfected) sample U = (U1, …, Un) are i.i.d with cdf F0 and the d dimensional treated (infected) sample V = V1, …, Vm are i.i.d with cdf G. Here, we assume that the heterogeneity in the baseline population is reflected by K different subgroups, each having unimodal distributions with distinct modes and cdfs F1, …, FK, and mixing proportions w1, …, wK such that $$F_0=\sum_{a=1}^{K}w_aF_a~\text{where}~w_a\in(0,1)~\text{and}~\sum_{a=1}^{K}w_a=1. $$
The goal is to test the following composite hypothesis: H0 : G ∈ ℱ(F0) versus H1 : G ∉ ℱ(F0), where ℱ(F0) is the convex hull of F1, …, FK. We take d = 2, n = 2000, m = 500 and sample U1, …, Un from F0 where F0 = 0.3N(0, I2) + 0.3N(μ1, I2) + 0.4N(μ2, I2), with μ1 = (0, −4) and μ2 = (4, −2).
n = 2000
d = 2
#Sampling the baseline (uninfected)
set.seed(1)
p<-runif(n,0,1)
set.seed(10)
U<- (p<=0.3)*matrix(rnorm(d*n),n,d)+
(p>0.3 & p<=0.6)*cbind(matrix(rnorm(n),n,1),
matrix(rnorm(n,-4),n,1))+
(p>0.6)*cbind(matrix(rnorm(n,4),n,1),
matrix(rnorm(n,-2),n,1))
To sample V1, …, Vm we consider three settings for G.
# Sampling the treated (infected)
m = 500
set.seed(50)
V1<-cbind(matrix(rnorm(m,4),m,1),
matrix(rnorm(m,-2),m,1))
#Scatter plot of the data
grp = c(rep('Baseline',n),
rep('Treated',m))
plot(c(U[,1],V1[,1]), c(U[,2],V1[,2]),
pch = 19,
col = factor(grp),
xlab = 'X_1',
ylab = 'X_2')
# Legend
legend("topright",
legend = levels(factor(grp)),
pch = 19,
col = factor(levels(factor(grp))))
# Sampling the treated (infected)
m = 500
set.seed(20)
q<-runif(m,0,1)
set.seed(50)
V2<-(q<=0.5)*cbind(matrix(rnorm(m,2),m,1),
matrix(rnorm(m,-2),m,1))+
(q>0.5)*cbind(matrix(rnorm(m,3),m,1),
matrix(rnorm(m,3),m,1))
#Scatter plot of the data
plot(c(U[,1],V2[,1]), c(U[,2],V2[,2]),
pch = 19,
col = factor(grp),
xlab = 'X_1',
ylab = 'X_2')
# Legend
legend("topright",
legend = levels(factor(grp)),
pch = 19,
col = factor(levels(factor(grp))))
# Sampling the treated (infected)
m = 500
set.seed(20)
q<-runif(m,0,1)
set.seed(50)
V3<-(q<=0.8)*matrix(rnorm(d*m),m,d)+
(q>0.8 & q<=0.9)*cbind(matrix(rnorm(m),m,1),
matrix(rnorm(m,-4),m,1))+
(q>0.9)*cbind(matrix(rnorm(m,4),m,1),
matrix(rnorm(m,-2),m,1))
#Scatter plot of the data
plot(c(U[,1],V3[,1]), c(U[,2],V3[,2]),
pch = 19,
col = factor(grp),
xlab = 'X_1',
ylab = 'X_2')
# Legend
legend("topright",
legend = levels(factor(grp)),
pch = 19,
col = factor(levels(factor(grp))))
Let us now execute the truh
testing procedure for these
scenarios. Recall that the goal is to test the following composite
hypothesis: H0 : G ∈ ℱ(F0) versus H1 : G ∉ ℱ(F0).
- Setting 1: Here we know that G = F0 and so
H0 is true.
## [1] 0.375
So, truh
fails to reject the null hypothesis.
## [1] 0
We see that truh
rejects the null hypothesis.
## [1] 0.205
In this case, truh
makes the correct decision and fails
to reject H0.