The correct way to simulate binary outcome data for logistic regression?
I am trying to conduct a simulation study that involves 3 covariates and a binary outcome based on the three covariates. What I did is to simulate three continuous variable x1,x2,x3 based on normal distribution, and assign coefficients $gamma$ for each of these covariates. Then I use function $exp(x^T\gamma)/(1+exp(x^T\gamma))$ to generate the value for $\pi$, and then generate binary outcome using binomial distribution based on $\pi$. My task is to test if using glm()
in R with family "binomial" can give good estimation for the coefficient. However, the estimated coefficient is not very close to the true coefficient. However, I don't think this should happen if my data simulation process is correct.
I have attached my code in R. The reason I use logistic model without interception is because I didn't assume interception when generating t
.
r <- c(0.33,3.75,2.82)
# specifying x when there's no correlation
x1 <- rnorm(500,1,1)
x2 <- rnorm(500,2,1)
x3 <- rnorm(500,5,1)
df2 <- data.frame(x1=x1,x2=x2,x3=x3)
x <- as.matrix(df2)
df2 <- df2 %>%
mutate(l = x %*% r,
p = 1/(1+l),
t = rbinom(n(),1,t))
coef_corr_2 <- glm(t ~ x1+x2+x3-1,data=df2,family="binomial") %>% coef()
The estimated coefficient values are
> glm(t ~ x1+x2+x3-1,data=df2,family="binomial") %>% coef()
x1 x2 x3
-0.008725804 -0.022818767 0.083695370
which is so far off from the true value.
I am wondering if I have done anything wrong at the data generation stage? I wish someone can spot the problems in my code or my line of reasoning, which really is just with such simple generation process the glm()
should give a very good estimation of the coefficient.
Answers
Your approach to simulating the data and fitting a logistic regression model seems correct in principle, but there are a few issues in your R code that might lead to unexpected results:
-
Intercept Term: By excluding the intercept term (
-1
in the formula), you are assuming that the logistic regression model passes through the origin. This may not be appropriate depending on the underlying data generating process. Consider including the intercept term in the model. -
Sample Size: A sample size of 500 might not be large enough to accurately estimate the coefficients, especially if the effect sizes are small or if there is noise in the data. Consider increasing the sample size to obtain more reliable estimates.
-
Random Seed: It's a good practice to set a random seed before generating random numbers to ensure reproducibility. This can be done using the
set.seed()
function. -
Correlation among Covariates: In your simulation, you have generated the covariates
x1
,x2
, andx3
independently. Consider introducing correlation among these covariates if it reflects the real-world scenario.
Here's an updated version of your R code addressing these points:
# Set random seed for reproducibility
set.seed(123)
# True coefficients
true_coefs <- c(0.33, 3.75, 2.82)
# Sample size
n <- 1000
# Generate covariates with correlation
library(MASS)
Sigma <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), ncol = 3)
covariates <- mvrnorm(n, mu = c(1, 2, 5), Sigma = Sigma)
# Calculate linear predictor
linear_pred <- covariates %*% true_coefs
# Calculate probabilities
probabilities <- plogis(linear_pred)
# Generate binary outcome
outcomes <- rbinom(n, size = 1, prob = probabilities)
# Create data frame
df <- data.frame(cbind(outcomes, covariates))
# Fit logistic regression model
model <- glm(outcomes ~ ., data = df, family = "binomial")
# Display coefficients
coef(model)
In this code:
- I've introduced correlation among the covariates using the
mvrnorm
function from theMASS
package. - I've increased the sample size to 1000 for more reliable estimates.
- I've set a random seed (
set.seed(123)
) for reproducibility. - I've included the intercept term in the logistic regression model by default.
- I've simulated the binary outcome based on the calculated probabilities.
By incorporating these changes, you should obtain more accurate estimates of the coefficients using the logistic regression model. Adjust the correlation matrix and sample size as needed to better reflect your actual data scenario.