在R语言中，基于包含要联接的列的变量的联接

在R中，我需要能够采用现有的 Dataframe 和列的“矩阵”来确定特定情况是否适用。这将更容易用示例数据来解释。
现有数据框：

df <- data.frame(
  FirstName = c("Daniel", "Daniel", "Daniel", "Daniel", "Daniel", "Daniel", "Daniel", "Daniel", "Daniel", "Daniel"),
  LastName = c("White",   "White",    "Adams",    "Adams",    "Andrews",  "Andrews",  "Thomas",   "Thomas", "Brown", "Brown"),
  City = c("Phoenix", "MESA", "Denver",   "Albany",   "Washington",   "Washington",   "Tampa", "Taos",    "Orlando", NA),
  State = c("AZ", "AZ",   "CO",   "NY",   "DC",   "VA",   "FL",   "NM",   "FL", NA),
  Group1 = c("ABC", "ABC", "XWZ", "XWZ", "LMN", "LMN", "XWZ", "MMM", NA, NA),
  Group2 = c("D001", "D001", "A444", "A444", "M555", "M555", "E333", "F222", "Q777", "Q777"),
  Group3 = c("FORT", "FORT", "SETS", "SETS", "SETS", "SETS", "FORT", "FORT", "SETS", "SETS")
  )

用于确定结果的色谱柱矩阵：

ID_Matrix <- data.frame(
  PrimaryKeyField = c("LastName", "LastName", "Group1", "Group3"),
  PrimaryMatchData = c("White", "Adams", "XWZ", "LMN"),
  SecondaryKeyField = c("State", "Group3", "", ""),
  SecondaryMatchData = c("AZ", "SETS","",""),
  Result = c("Red", "Blue", "Red", "Green")
)

ID_Matrix看起来像这样：

ID_Matrix
  PrimaryKeyField PrimaryMatchData SecondaryKeyField SecondaryMatchData Result
1        LastName            White             State                 AZ    Red
2        LastName            Adams            Group3               SETS   Blue
3          Group1              XWZ                                         Red
4          Group3              LMN                                       Green

实际上，ID_Matrix表标识了应使用哪些列来标识结果。因此，在上面的示例数据中，如果df中的“Last Name”列（PrimaryKeyField）是“白色”（PrimaryMatchData）并且“State”列（SecondaryKeyField）是“AZ”（SecondaryMatchData），则结果是“Red”。
到目前为止，我已经使用了多个连接，更改了join_by以匹配特定的环境。类似于以下内容：

SingleJoin_OnGroup1 <- df |> 
  inner_join(
    ID_Matrix[,c("PrimaryMatchData", "Result")],
    by = join_by(Group1 == PrimaryMatchData)
)

DoubleJoin_OnLName_State <- df |> 
  inner_join(
    ID_Matrix[,c("PrimaryMatchData", "SecondaryMatchData", "Result")],
    by = join_by(LastName == PrimaryMatchData, 
                 State == SecondaryMatchData
                 )
)

df_out <- bind_rows(SingleJoin_OnGroup1, DoubleJoin_OnLName_State)

...但是还有其他问题，我宁愿直接使用ID_Matrix来自动连接/匹配，使用当时需要的任何列来确定结果。如果可能的话，我更喜欢使用tidyverse解决方案领域的解决方案，但乞丐不能挑肥拣瘦。
我觉得这应该通过使用reduce或map来解决，但我不能让它与join/merge或更简单的case_ option一起工作。

6月18日更新-添加预期结果：

FirstName LastName       City State Group1 Group2 Group3 Result
1     Daniel    White    Phoenix    AZ    ABC   D001   FORT    Red
2     Daniel    White       MESA    AZ    ABC   D001   FORT    Red
3     Daniel    Adams     Denver    CO    XWZ   A444   SETS   Blue
4     Daniel    Adams     Albany    NY    XWZ   A444   SETS   Blue
5     Daniel  Andrews Washington    DC    LMN   M555   SETS   Green
6     Daniel  Andrews Washington    VA    LMN   M555   SETS   Green
7     Daniel   Thomas      Tampa    FL    XWZ   E333   FORT   Red
8     Daniel   Thomas       Taos    NM    MMM   F222   FORT   <NA>
9     Daniel    Brown    Orlando    FL   <NA>   Q777   SETS   <NA>
10    Daniel    Brown       <NA>  <NA>   <NA>   Q777   SETS   <NA>

我也没有指出这个示例数据显示的另一个特性--ID_Matrix中较早的规则覆盖了较晚的规则（例如- 第二个规则，其中LastName是“亚当斯”，Group 3是“SETS”，覆盖后面的Group 1是“XWZ”，因此第三和第四行结果是蓝色而不是红色）。这与case_when的逻辑类似。
有什么想法？

# Create an empty column for the results
df <- df %>% mutate(Result = NA_character_)

# Replace the empty strings (in the secondary column) with TRUE
ID_Matrix[ID_Matrix == ""] <- TRUE

match_id_matrix <- function(primary_match=TRUE, secondary_match=TRUE) {
  # Options for primary_match and secondary_match:
  # TRUE: match the key
  # FALSE: do not match the key
  # NA: disregard the key
  pmap(ID_Matrix, function(PrimaryKeyField, PrimaryMatchData, SecondaryKeyField, SecondaryMatchData, Result) { # parallel map
      # Find rows in df where the primary key field matches the primary match data
      primaryMatchingRows <- df[[PrimaryKeyField]] == PrimaryMatchData
      
      # Find rows in df where the secondary key field matches the secondary match data
        secondaryMatchingRows <- df[[SecondaryKeyField]] == SecondaryMatchData

      if (primary_match & secondary_match) { # find rows where both the primary and secondary match
        matchingRows <- primaryMatchingRows & secondaryMatchingRows
      } else if (primary_match) { # find rows where only the primary match
        matchingRows <- primaryMatchingRows
      } else if (secondary_match) { # find rows where only the secondary match
        matchingRows <- secondaryMatchingRows
      } else if (primary_match | secondary_match) { # find rows where either the primary or secondary match (but not both)
        matchingRows <- primaryMatchingRows | secondaryMatchingRows
      } else if (primary_match & is.na(secondary_match)) { # only check the primary key
        matchingRows <- primaryMatchingRows
      } else if (secondary_match & is.na(primary_match)) { # only check the secondary key
        matchingRows <- secondaryMatchingRows
      } else { # find rows where neither the primary nor secondary match
        matchingRows <- !(primaryMatchingRows | secondaryMatchingRows)
      }
      # Assign the result value for the matching rows
      df$Result[matchingRows] <<- Result
    })
}

# try out the function
match_id_matrix(TRUE, TRUE) # match both
df
match_id_matrix(TRUE, NA) # match the primary key, ignore the secondary key (i.e. it can be any value)
df
match_id_matrix(TRUE, FALSE) # match the primary key, do not match the secondary key

在R语言中，基于包含要联接的列的变量的联接

1条答案

相关问题

热门标签

最新问答