Functions for Probability Distributions in R

Functions for simulating probability distributions in R

rnrom(): generate random normal variates with a given mean and standard deviation
rbinom(): generate binomial distribution with a given size (number of trials) and probability (of success on each trial)
runif(): generate uniform distribution with a given number of random samples within any interval defined
rpois(): generate Poisson distribution with a given vector of (non-negative) means

Functions for Probability Distributions in R

Corresponding functions for evaluating the probability distributions, prefixed with a

d for density
r for random number generation
p for cumulative distribution
q for quantile function

Generate Random Numbers

Take Normal distributions as an example

dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)

If you do not specify the mean and standard deviation, the default will be a standard normal distribution which has mean 0 and standard deviation 1.

x <- rnorm(5) # the vector has 5 random normal numbers with mean 0 sd 1
x

## [1]  1.6443789 -1.2462700  0.3392013  1.6931209 -0.8309770

summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2463 -0.8310  0.3392  0.3199  1.6444  1.6931

Set Seed for Reproducibility

set.seed() ensures reproducibility
always set the random number seed when doing simulation!

set.seed(1)
rnorm(5)

## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

rnorm(5) # vector is different

## [1] -0.8204684  0.4874291  0.7383247  0.5757814 -0.3053884

set.seed(1) # reset the seed to 1
rnorm(5)    # exactly the same as the first one

## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

Random Sampling

The sample() draws randomly from a specified set of objects (scalar) with or without replacement

set.seed(1)
sample(1:10,4) # sample randomly 4 of the integers

## [1] 9 4 7 1

sample(1:10,4) # 1~10 without replacement

## [1] 2 7 3 6

sample(letters, 5)

## [1] "r" "s" "a" "u" "w"

sample(1:10) # if you don't specify you get a permutation

##  [1] 10  6  9  2  1  5  8  4  3  7

sample(1:10)

##  [1]  5 10  2  8  6  1  4  3  9  7

sample(1:10, replace = TRUE) # sample with replacement

##  [1]  3  6 10 10  6  4  4 10  9  7

Random Sampling

Using sample() simulate binary data with specified probabilities
Sample of 20 zeroes and ones, where 0 has a 30% chance of being sampled and 1 has a 70% chance of being sampled

sample(c(0,1), size = 20, replace = TRUE, prob = c(0.3, 0.7))

##  [1] 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0

Another way is to use the rbinom() that generate random values from a binomial distribution with a certain trial size and probability p of success or occurrence

rbinom(n = 20, size = 1, prob = 0.7) # like coin flipping distribution

##  [1] 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1

                                     # 20 ppl flipped a coin 1 time
                                     # land "success"/heads 70% of the time
rbinom(n = 10, size = 5, prob = 0.5)

##  [1] 2 1 3 4 3 3 2 2 3 3

Simulate a Linear Model

Suppose we want to simulate data from the following model

\(Y = \beta_{0} + \beta_{1}x+ \epsilon\)

where \(\epsilon \sim N(0,2^2)\). Assume \(x \sim N(0,1^2)\), \(\beta_{0} = 0.5\) and \(\beta_{1} = 2\)

set.seed(2)
x <- rnorm(100) # x the predictor with standard Normal distribution
e <- rnorm(100, 0, 2) # epsilon with mean 0 and sd 2
y <- 0.5 + 2 * x + e  # add them altogether to generate y
summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -6.1180 -1.5795  0.6741  0.4970  2.9112  7.1640

Simulate a Linear Model

plot(x,y)

Simulate a Linear Model

What if \(x\) is binary?

set.seed(3)
x <- rbinom(100, 1, 0.5) # generate a 100 binomial random variables
e <- rnorm(100, 0, 2)    # normal error term
y <- 0.5 + 2 * x + e
summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.4647 -0.2239  1.4970  1.4364  3.2434  4.9341

Simulate a Linear Model

plot(x,y)

Simulate a Logistic Regression Model

Suppose we want to simulate from a logistic regression model

\(log(\frac{p(X)}{1-p(X)})= \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}\)
\(Prob{Y = 1|X} = \frac {1}{1 + e^{-X\beta}}\)

Goal: model occurrence or non-occurrence of a disease given predictors such as age and gender

set.seed(4)
gender <- sample(c(0,1), size = 1000, replace = TRUE) # 1 for male, 0 for female
age <- round(runif(1000, 18, 80)) # draws 1000 values from a uniform distribution
                                  # that ranges in value from 18 to 80

b0 <- -8  # intercept
# betas represent the change in log odds per unit change in the predictors,
# i.e. when gender = 1, the log odds increase by 3.5

b1 <- 3.5
b2 <- 0.2

xbs <- b0 + b1*gender + b2*age

p <- 1/(1 + exp(-(xbs))) # simulate/generate some outcome probabilities
summary(p)               # notice that p lies between 0 and 1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01213 0.62246 0.97205 0.77181 0.99877 0.99999

y <- rbinom(n = 1000, size = 1, prob = p) # with this p, we could generate ones
                                          # and zeros to indicate
                                          # occurrence or non-occurrence
summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   0.765   1.000   1.000

Simulate a Logistic Regression Model

Run a logistic regression using glm() on the data to see how well it recovers the “true” \(\beta\) values

mod <- glm(y ~ gender + age, family = "binomial")
summary(mod)

## 
## Call:
## glm(formula = y ~ gender + age, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.98202   0.00253   0.03347   0.21640   3.14730  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.40723    0.77766  -12.10   <2e-16 ***
## gender       3.74439    0.36366   10.30   <2e-16 ***
## age          0.23482    0.01882   12.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1090.50  on 999  degrees of freedom
## Residual deviance:  392.67  on 997  degrees of freedom
## AIC: 398.67
## 
## Number of Fisher Scoring iterations: 8

library(performance)
r2_tjur(mod)

## Tjur's R2 
## 0.6580922

Simulate a Logistic Regression Model

Get the odds ratio or the probabilities for easier interpretation

exp(3.5) # obtain odds ratio

## [1] 33.11545

# male, age 30
p_male <- 1/(1 + exp(-(-9 + 3.5*1 + 0.2*30)))
p_male

## [1] 0.6224593

# female, age 30
p_female <- 1/(1 + exp(-(-9 + 3.5*0 + 0.2*30)))
p_female

## [1] 0.04742587

The odds of a “1” (disease occurrence) for male is about 33 times the odds of a “1” for a female
Could do the same for age

Data Simulation in R

NBP Tea Time

Why Simulate

Functions for Probability Distributions in R

Functions for Probability Distributions in R

Generate Random Numbers

Set Seed for Reproducibility

Random Sampling

Random Sampling

Simulate a Linear Model

Simulate a Linear Model

Simulate a Linear Model

Simulate a Linear Model

Simulate a Logistic Regression Model

Simulate a Logistic Regression Model

Simulate a Logistic Regression Model

Summary

References