在R中,我需要能够采用现有的 Dataframe 和列的“矩阵”来确定特定情况是否适用。这将更容易用示例数据来解释。
现有数据框:
df <- data.frame(
FirstName = c("Daniel", "Daniel", "Daniel", "Daniel", "Daniel", "Daniel", "Daniel", "Daniel", "Daniel", "Daniel"),
LastName = c("White", "White", "Adams", "Adams", "Andrews", "Andrews", "Thomas", "Thomas", "Brown", "Brown"),
City = c("Phoenix", "MESA", "Denver", "Albany", "Washington", "Washington", "Tampa", "Taos", "Orlando", NA),
State = c("AZ", "AZ", "CO", "NY", "DC", "VA", "FL", "NM", "FL", NA),
Group1 = c("ABC", "ABC", "XWZ", "XWZ", "LMN", "LMN", "XWZ", "MMM", NA, NA),
Group2 = c("D001", "D001", "A444", "A444", "M555", "M555", "E333", "F222", "Q777", "Q777"),
Group3 = c("FORT", "FORT", "SETS", "SETS", "SETS", "SETS", "FORT", "FORT", "SETS", "SETS")
)
用于确定结果的色谱柱矩阵:
ID_Matrix <- data.frame(
PrimaryKeyField = c("LastName", "LastName", "Group1", "Group3"),
PrimaryMatchData = c("White", "Adams", "XWZ", "LMN"),
SecondaryKeyField = c("State", "Group3", "", ""),
SecondaryMatchData = c("AZ", "SETS","",""),
Result = c("Red", "Blue", "Red", "Green")
)
ID_Matrix看起来像这样:
ID_Matrix
PrimaryKeyField PrimaryMatchData SecondaryKeyField SecondaryMatchData Result
1 LastName White State AZ Red
2 LastName Adams Group3 SETS Blue
3 Group1 XWZ Red
4 Group3 LMN Green
实际上,ID_Matrix表标识了应使用哪些列来标识结果。因此,在上面的示例数据中,如果df中的“Last Name”列(PrimaryKeyField)是“白色”(PrimaryMatchData)并且“State”列(SecondaryKeyField)是“AZ”(SecondaryMatchData),则结果是“Red”。
到目前为止,我已经使用了多个连接,更改了join_by以匹配特定的环境。类似于以下内容:
SingleJoin_OnGroup1 <- df |>
inner_join(
ID_Matrix[,c("PrimaryMatchData", "Result")],
by = join_by(Group1 == PrimaryMatchData)
)
DoubleJoin_OnLName_State <- df |>
inner_join(
ID_Matrix[,c("PrimaryMatchData", "SecondaryMatchData", "Result")],
by = join_by(LastName == PrimaryMatchData,
State == SecondaryMatchData
)
)
df_out <- bind_rows(SingleJoin_OnGroup1, DoubleJoin_OnLName_State)
...但是还有其他问题,我宁愿直接使用ID_Matrix来自动连接/匹配,使用当时需要的任何列来确定结果。如果可能的话,我更喜欢使用tidyverse解决方案领域的解决方案,但乞丐不能挑肥拣瘦。
我觉得这应该通过使用reduce或map来解决,但我不能让它与join/merge或更简单的case_ option一起工作。
6月18日更新-添加预期结果:
FirstName LastName City State Group1 Group2 Group3 Result
1 Daniel White Phoenix AZ ABC D001 FORT Red
2 Daniel White MESA AZ ABC D001 FORT Red
3 Daniel Adams Denver CO XWZ A444 SETS Blue
4 Daniel Adams Albany NY XWZ A444 SETS Blue
5 Daniel Andrews Washington DC LMN M555 SETS Green
6 Daniel Andrews Washington VA LMN M555 SETS Green
7 Daniel Thomas Tampa FL XWZ E333 FORT Red
8 Daniel Thomas Taos NM MMM F222 FORT <NA>
9 Daniel Brown Orlando FL <NA> Q777 SETS <NA>
10 Daniel Brown <NA> <NA> <NA> Q777 SETS <NA>
我也没有指出这个示例数据显示的另一个特性--ID_Matrix中较早的规则覆盖了较晚的规则(例如- 第二个规则,其中LastName是“亚当斯”,Group 3是“SETS”,覆盖后面的Group 1是“XWZ”,因此第三和第四行结果是蓝色而不是红色)。这与case_when的逻辑类似。
有什么想法?
1条答案
按热度按时间kokeuurv1#