R中多个变量的点双相关性

xeufq47z  于 5个月前  发布在  其他
关注(0)|答案(1)|浏览(46)

我试图计算多个变量之间的相关性。我的大多数变量是连续的,但其中一个是二进制的。我想生成一个矩阵,其中所有变量都在矩阵中。但我的代码不工作。下面是我的代码:

cor.test(df[, c('sat','pba','cte_certs', 'course_credits', "college.going")], use="complete.obs")

字符串
下面是我的 Dataframe 的样子:


的数据
我应该如何设置代码来计算这些变量之间的点双对数相关性?

i2loujxw

i2loujxw1#

一种方法:

1.正态性检查:

library(ggplot2)
library(dplyr)
library(tidyr)
library(broom)

# Create a fake sample data frame: 

set.seed(123)
df <- data.frame(
  sat = rnorm(100, 100, 50),            
  pba = rnorm(100, 50, 75),            
  cte_certs = rnorm(100, 2, 55),       
  course_credits = rnorm(100, 20, 45),   
  college_going = rbinom(100, 1, 0.5)   
)

# test for normality with shapiro wilk test and qq plot: 
# long format
df_long <- df %>% 
  pivot_longer(-college_going, names_to = "variable", values_to = "value")

# Perform Shapiro-Wilk test
test_results <- df_long %>% 
  group_by(variable) %>% 
  summarize(
    shapiro_p = shapiro.test(value)$p.value,
    .groups = 'drop'
  ) %>%
  mutate(across(starts_with("shapiro_p"), ~paste("SW:", round(., 3))))

# Merge test results with the long data
df_long <- left_join(df_long, test_results, by = "variable")

# Create QQ plot with test results
p <- ggplot(df_long, aes(sample = value)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~variable) +
  geom_text(aes(label = paste(shapiro_p), y = Inf, x = Inf), 
            hjust = 1.1, vjust = 2, size = 6, check_overlap = TRUE)+
  theme_minimal()

字符串
我们假设所有的区间变量都是正态分布的:

2.进行点双对数相关:

#create a function 
point_biserial_cor <- function(binary_var, continuous_var) {
  cor.test(binary_var, continuous_var, method = "pearson")$estimate
}

# Identify binary and continuous variables
binary_var <- "college_going"
continuous_vars <- setdiff(names(df), binary_var)

# Do the correlation matrix
cor_matrix <- matrix(NA, nrow = length(continuous_vars), ncol = 1, dimnames = list(continuous_vars, binary_var))

# Calculate correlations
for (var in continuous_vars) {
  cor_matrix[var, binary_var] <- point_biserial_cor(df[[binary_var]], df[[var]])
}

# View the correlation matrix
print(cor_matrix)

相关矩阵

college_going
sat              -0.13065092
pba               0.09670408
cte_certs        -0.18555204
course_credits    0.11620321

3.作为插件:可视化:

df_long %>% 
  mutate(college_going = factor(college_going)) %>% 
  mutate(id = row_number()) %>% 
  ggplot(aes(x=id, y=value, color=college_going, label = college_going)) +
  geom_point(size = 4, alpha=0.5)+
  geom_text(color = "black", hjust=0.5, vjust=0.5)+
  scale_color_manual(values = c("steelblue", "purple"), labels = c("No", "Yes"))+
  scale_x_continuous(breaks = 1:200, labels = 1:200)+
  scale_y_continuous(breaks= scales::pretty_breaks())+
  facet_wrap(. ~ variable, 
             nrow = 2, strip.position = "bottom", scales = "free") +
  labs(y = "value", 
       color="College going vs. not")+
  theme_modern()+
  theme(
    aspect.ratio = 2,
    strip.background = element_blank(),
    strip.placement = "outside",
    legend.position = "bottom",
    axis.title.x=element_blank(),
    axis.text.x=element_blank(),
    axis.ticks.x=element_blank(),
    text=element_text(size=16)
  )
# Note see here: https://stackoverflow.com/questions/76098463/assistence-in-creating-point-biserial-correlation-plot
# It is a matter of choice whether one likes the plot or not!


相关问题