Data Simulation in R

NBP Tea Time

Ann Huang

08/12/2022

Why Simulate

Functions for Probability Distributions in R

Functions for simulating probability distributions in R

Functions for Probability Distributions in R

Corresponding functions for evaluating the probability distributions, prefixed with a

Generate Random Numbers

Take Normal distributions as an example

dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)

If you do not specify the mean and standard deviation, the default will be a standard normal distribution which has mean 0 and standard deviation 1.

x <- rnorm(5) # the vector has 5 random normal numbers with mean 0 sd 1
x
## [1]  1.6443789 -1.2462700  0.3392013  1.6931209 -0.8309770
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2463 -0.8310  0.3392  0.3199  1.6444  1.6931

Set Seed for Reproducibility

set.seed(1)
rnorm(5)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
rnorm(5) # vector is different
## [1] -0.8204684  0.4874291  0.7383247  0.5757814 -0.3053884
set.seed(1) # reset the seed to 1
rnorm(5)    # exactly the same as the first one
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

Random Sampling

set.seed(1)
sample(1:10,4) # sample randomly 4 of the integers
## [1] 9 4 7 1
sample(1:10,4) # 1~10 without replacement
## [1] 2 7 3 6
sample(letters, 5) 
## [1] "r" "s" "a" "u" "w"
sample(1:10) # if you don't specify you get a permutation
##  [1] 10  6  9  2  1  5  8  4  3  7
sample(1:10)
##  [1]  5 10  2  8  6  1  4  3  9  7
sample(1:10, replace = TRUE) # sample with replacement
##  [1]  3  6 10 10  6  4  4 10  9  7

Random Sampling

sample(c(0,1), size = 20, replace = TRUE, prob = c(0.3, 0.7))
##  [1] 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0
rbinom(n = 20, size = 1, prob = 0.7) # like coin flipping distribution
##  [1] 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1
                                     # 20 ppl flipped a coin 1 time
                                     # land "success"/heads 70% of the time
rbinom(n = 10, size = 5, prob = 0.5)
##  [1] 2 1 3 4 3 3 2 2 3 3

Simulate a Linear Model

Suppose we want to simulate data from the following model

\(Y = \beta_{0} + \beta_{1}x+ \epsilon\)

where \(\epsilon \sim N(0,2^2)\). Assume \(x \sim N(0,1^2)\), \(\beta_{0} = 0.5\) and \(\beta_{1} = 2\)

set.seed(2)
x <- rnorm(100) # x the predictor with standard Normal distribution
e <- rnorm(100, 0, 2) # epsilon with mean 0 and sd 2
y <- 0.5 + 2 * x + e  # add them altogether to generate y
summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -6.1180 -1.5795  0.6741  0.4970  2.9112  7.1640

Simulate a Linear Model

plot(x,y)

Simulate a Linear Model

What if \(x\) is binary?

set.seed(3)
x <- rbinom(100, 1, 0.5) # generate a 100 binomial random variables
e <- rnorm(100, 0, 2)    # normal error term
y <- 0.5 + 2 * x + e
summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.4647 -0.2239  1.4970  1.4364  3.2434  4.9341

Simulate a Linear Model

plot(x,y)

Simulate a Logistic Regression Model

\(log(\frac{p(X)}{1-p(X)})= \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}\)
\(Prob{Y = 1|X} = \frac {1}{1 + e^{-X\beta}}\)

set.seed(4)
gender <- sample(c(0,1), size = 1000, replace = TRUE) # 1 for male, 0 for female
age <- round(runif(1000, 18, 80)) # draws 1000 values from a uniform distribution
                                  # that ranges in value from 18 to 80

b0 <- -8  # intercept
# betas represent the change in log odds per unit change in the predictors,
# i.e. when gender = 1, the log odds increase by 3.5

b1 <- 3.5
b2 <- 0.2

xbs <- b0 + b1*gender + b2*age

p <- 1/(1 + exp(-(xbs))) # simulate/generate some outcome probabilities
summary(p)               # notice that p lies between 0 and 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01213 0.62246 0.97205 0.77181 0.99877 0.99999
y <- rbinom(n = 1000, size = 1, prob = p) # with this p, we could generate ones
                                          # and zeros to indicate
                                          # occurrence or non-occurrence
summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   0.765   1.000   1.000

Simulate a Logistic Regression Model

mod <- glm(y ~ gender + age, family = "binomial")
summary(mod)
## 
## Call:
## glm(formula = y ~ gender + age, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.98202   0.00253   0.03347   0.21640   3.14730  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.40723    0.77766  -12.10   <2e-16 ***
## gender       3.74439    0.36366   10.30   <2e-16 ***
## age          0.23482    0.01882   12.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1090.50  on 999  degrees of freedom
## Residual deviance:  392.67  on 997  degrees of freedom
## AIC: 398.67
## 
## Number of Fisher Scoring iterations: 8
library(performance)
r2_tjur(mod)
## Tjur's R2 
## 0.6580922

Simulate a Logistic Regression Model

exp(3.5) # obtain odds ratio
## [1] 33.11545
# male, age 30
p_male <- 1/(1 + exp(-(-9 + 3.5*1 + 0.2*30)))
p_male
## [1] 0.6224593
# female, age 30
p_female <- 1/(1 + exp(-(-9 + 3.5*0 + 0.2*30)))
p_female
## [1] 0.04742587

Summary

References