我试图计算多个变量之间的相关性。我的大多数变量是连续的,但其中一个是二进制的。我想生成一个矩阵,其中所有变量都在矩阵中。但我的代码不工作。下面是我的代码:
cor.test(df[, c('sat','pba','cte_certs', 'course_credits', "college.going")], use="complete.obs")
字符串下面是我的 Dataframe 的样子:
的数据我应该如何设置代码来计算这些变量之间的点双对数相关性?
i2loujxw1#
一种方法:
library(ggplot2) library(dplyr) library(tidyr) library(broom) # Create a fake sample data frame: set.seed(123) df <- data.frame( sat = rnorm(100, 100, 50), pba = rnorm(100, 50, 75), cte_certs = rnorm(100, 2, 55), course_credits = rnorm(100, 20, 45), college_going = rbinom(100, 1, 0.5) ) # test for normality with shapiro wilk test and qq plot: # long format df_long <- df %>% pivot_longer(-college_going, names_to = "variable", values_to = "value") # Perform Shapiro-Wilk test test_results <- df_long %>% group_by(variable) %>% summarize( shapiro_p = shapiro.test(value)$p.value, .groups = 'drop' ) %>% mutate(across(starts_with("shapiro_p"), ~paste("SW:", round(., 3)))) # Merge test results with the long data df_long <- left_join(df_long, test_results, by = "variable") # Create QQ plot with test results p <- ggplot(df_long, aes(sample = value)) + geom_qq() + geom_qq_line() + facet_wrap(~variable) + geom_text(aes(label = paste(shapiro_p), y = Inf, x = Inf), hjust = 1.1, vjust = 2, size = 6, check_overlap = TRUE)+ theme_minimal()
字符串我们假设所有的区间变量都是正态分布的:
#create a function point_biserial_cor <- function(binary_var, continuous_var) { cor.test(binary_var, continuous_var, method = "pearson")$estimate } # Identify binary and continuous variables binary_var <- "college_going" continuous_vars <- setdiff(names(df), binary_var) # Do the correlation matrix cor_matrix <- matrix(NA, nrow = length(continuous_vars), ncol = 1, dimnames = list(continuous_vars, binary_var)) # Calculate correlations for (var in continuous_vars) { cor_matrix[var, binary_var] <- point_biserial_cor(df[[binary_var]], df[[var]]) } # View the correlation matrix print(cor_matrix)
型
college_going sat -0.13065092 pba 0.09670408 cte_certs -0.18555204 course_credits 0.11620321
df_long %>% mutate(college_going = factor(college_going)) %>% mutate(id = row_number()) %>% ggplot(aes(x=id, y=value, color=college_going, label = college_going)) + geom_point(size = 4, alpha=0.5)+ geom_text(color = "black", hjust=0.5, vjust=0.5)+ scale_color_manual(values = c("steelblue", "purple"), labels = c("No", "Yes"))+ scale_x_continuous(breaks = 1:200, labels = 1:200)+ scale_y_continuous(breaks= scales::pretty_breaks())+ facet_wrap(. ~ variable, nrow = 2, strip.position = "bottom", scales = "free") + labs(y = "value", color="College going vs. not")+ theme_modern()+ theme( aspect.ratio = 2, strip.background = element_blank(), strip.placement = "outside", legend.position = "bottom", axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), text=element_text(size=16) ) # Note see here: https://stackoverflow.com/questions/76098463/assistence-in-creating-point-biserial-correlation-plot # It is a matter of choice whether one likes the plot or not!
的
1条答案
按热度按时间i2loujxw1#
一种方法:
1.正态性检查:
字符串
我们假设所有的区间变量都是正态分布的:
2.进行点双对数相关:
型
相关矩阵
型
3.作为插件:可视化:
型
的