Ann Huang
08/12/2022
Functions for simulating probability distributions in R
rnrom(): generate random normal variates with a given mean and standard deviationrbinom(): generate binomial distribution with a given size (number of trials) and probability (of success on each trial)runif(): generate uniform distribution with a given number of random samples within any interval definedrpois(): generate Poisson distribution with a given vector of (non-negative) meansCorresponding functions for evaluating the probability distributions, prefixed with a
d for densityr for random number generationp for cumulative distributionq for quantile functionTake Normal distributions as an example
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
If you do not specify the mean and standard deviation, the default will be a standard normal distribution which has mean 0 and standard deviation 1.
## [1] 1.6443789 -1.2462700 0.3392013 1.6931209 -0.8309770
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2463 -0.8310 0.3392 0.3199 1.6444 1.6931
set.seed() ensures reproducibility## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
## [1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
sample() draws randomly from a specified set of objects (scalar) with or without replacement## [1] 9 4 7 1
## [1] 2 7 3 6
## [1] "r" "s" "a" "u" "w"
## [1] 10 6 9 2 1 5 8 4 3 7
## [1] 5 10 2 8 6 1 4 3 9 7
## [1] 3 6 10 10 6 4 4 10 9 7
sample() simulate binary data with specified probabilities## [1] 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0
rbinom() that generate random values from a binomial distribution with a certain trial size and probability p of success or occurrence## [1] 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1
# 20 ppl flipped a coin 1 time
# land "success"/heads 70% of the time
rbinom(n = 10, size = 5, prob = 0.5)## [1] 2 1 3 4 3 3 2 2 3 3
Suppose we want to simulate data from the following model
\(Y = \beta_{0} + \beta_{1}x+ \epsilon\)
where \(\epsilon \sim N(0,2^2)\). Assume \(x \sim N(0,1^2)\), \(\beta_{0} = 0.5\) and \(\beta_{1} = 2\)
set.seed(2)
x <- rnorm(100) # x the predictor with standard Normal distribution
e <- rnorm(100, 0, 2) # epsilon with mean 0 and sd 2
y <- 0.5 + 2 * x + e # add them altogether to generate y
summary(y)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.1180 -1.5795 0.6741 0.4970 2.9112 7.1640
What if \(x\) is binary?
set.seed(3)
x <- rbinom(100, 1, 0.5) # generate a 100 binomial random variables
e <- rnorm(100, 0, 2) # normal error term
y <- 0.5 + 2 * x + e
summary(y)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.4647 -0.2239 1.4970 1.4364 3.2434 4.9341
\(log(\frac{p(X)}{1-p(X)})= \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}\)
\(Prob{Y = 1|X} = \frac {1}{1 + e^{-X\beta}}\)
set.seed(4)
gender <- sample(c(0,1), size = 1000, replace = TRUE) # 1 for male, 0 for female
age <- round(runif(1000, 18, 80)) # draws 1000 values from a uniform distribution
# that ranges in value from 18 to 80
b0 <- -8 # intercept
# betas represent the change in log odds per unit change in the predictors,
# i.e. when gender = 1, the log odds increase by 3.5
b1 <- 3.5
b2 <- 0.2
xbs <- b0 + b1*gender + b2*age
p <- 1/(1 + exp(-(xbs))) # simulate/generate some outcome probabilities
summary(p) # notice that p lies between 0 and 1## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01213 0.62246 0.97205 0.77181 0.99877 0.99999
y <- rbinom(n = 1000, size = 1, prob = p) # with this p, we could generate ones
# and zeros to indicate
# occurrence or non-occurrence
summary(y)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 0.765 1.000 1.000
glm() on the data to see how well it recovers the “true” \(\beta\) values##
## Call:
## glm(formula = y ~ gender + age, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.98202 0.00253 0.03347 0.21640 3.14730
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.40723 0.77766 -12.10 <2e-16 ***
## gender 3.74439 0.36366 10.30 <2e-16 ***
## age 0.23482 0.01882 12.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1090.50 on 999 degrees of freedom
## Residual deviance: 392.67 on 997 degrees of freedom
## AIC: 398.67
##
## Number of Fisher Scoring iterations: 8
## Tjur's R2
## 0.6580922
## [1] 33.11545
## [1] 0.6224593
## [1] 0.04742587
age