R语言 根据其他列中的字符串匹配对列求和

h7wcgrx3  于 8个月前  发布在  其他
关注(0)|答案(1)|浏览(83)

我有一个数据名,其结构类似于这样:

df <- data.frame("id1"=c("A", "A", NA, "B", "A", "A"),
                 "val1"=c(1, 2, NA, 2, 3, 1),
                 "id2"=c("A", "B", "B", "B", "B", "A"),
                 "val2"=c(3, 2, 1, 3, 3, 2),
                 "id3" = c("C", "B", "C", "C", "D", "A"),
                 "val3" = c(2, 1, 2, 2, 1, 1))

> df
   id1 val1 id2 val2 id3 val3
1    A    1   A    3   C    2
2    A    2   B    2   B    1
3 <NA>   NA   B    1   C    2
4    B    2   B    3   C    2
5    A    3   B    3   D    1
6    A    1   A    2   A    1

我想做的是,每行“合并”那些有字符串匹配的id列,并对相应的val(或值)列求和。我希望请求是明确的,但最后我希望输出如下:

df_new <- data.frame("id1"=c("A", "A", "B", "B", "A", "A"),
                     "val1"=c(4, 2, 1, 5, 3, 4),
                     "id2"=c("C", "B", "C", "C", "B", NA),
                     "val2"=c(2, 3, 2, 2, 3, NA),
                     "id3"=c(NA, NA, NA, NA, "D", NA),
                     "val3"=c(NA, NA,NA, NA, 1, NA))

> df_new
  id1 val1  id2 val2  id3 val3
1   A    4    C    2 <NA>   NA
2   A    2    B    3 <NA>   NA
3   B    1    C    2 <NA>   NA
4   B    5    C    2 <NA>   NA
5   A    3    B    3    D    1
6   A    4 <NA>   NA <NA>   NA

我试着用一系列ifelse()语句来做这件事,但最后还是很困惑。必须有一个更干净的方法来实现这一点。提前感谢您的任何帮助!

vnzz0bqm

vnzz0bqm1#

下面是tidyverse中的管道:

library(dplyr)
library(tidyr)

df |>
  mutate(rown = row_number()) |> 
  pivot_longer(-rown, names_pattern = "(id|val)\\d+", 
               names_to = ".value", values_drop_na = TRUE) |> 
  summarise(val = sum(val), .by = c(rown, id)) |> 
  mutate(number = row_number(), .by = rown) |> 
  pivot_wider(names_from = "number", values_from = c("id", "val"),
              names_vary = "slowest", names_sep = "") |>
  select(-rown)

#   id1 val1  id2 val2  id3 val3
# 1   A    4    C    2 <NA>   NA
# 2   A    2    B    3 <NA>   NA
# 3   B    1    C    2 <NA>   NA
# 4   B    5    C    2 <NA>   NA
# 5   A    3    B    3    D    1
# 6   A    4 <NA>   NA <NA>   NA

**解释:**这个想法是转换为long以获得一个整洁的格式,这使得它更容易操作数据,特别是在tidyverse中。这就是为什么有pivot_longer,然后是pivot_wider(以获得原始的宽格式)。在这两者之间,我通过rownid创建了一个sum,然后基于row_numberrown组创建了一个number变量,这对于在pivot_wider(和names_vary = "slowest")中正确排序很有用。

相关问题