使用数据字典替换所有值,其中列名与字典中的行匹配(R语言)

9rygscc1  于 12个月前  发布在  R语言
关注(0)|答案(3)|浏览(179)

我有一个编码的调查文件和数据字典,它描述了每列的每个编码值。Example of inputs and desired output here
数据字典中的“字段”列列出了调查数据中的每一列。根据列名的不同,这些值具有不同的说明。例如,如果“1”在“满意度”列中,则表示“非常满意”,但如果“Metro”在“地区”列中,则表示“Metro”。
我已经让这段代码在Python中工作,但我需要它在R中工作。
我对其他方法持开放态度,这段代码正是我提出的。它在大约5秒内运行150列和40,000行。

import pandas as pd

# Read the data dictionary and raw survey data
dictionary_df = pd.read_excel(filepath)
dictionary_df["Value"] = dictionary_df["Value"].astype(int)

survey_df = pd.read_excel(filepath)

# Replaces the original unalbelled survey_df with a labelled survey_df.

lookup_dict = {}
cols = list(dictionary_df.Field.unique())
for i in cols:
    subset = dictionary_df.loc[dictionary_df.Field==i]
    lookup_dict_vals = dict(zip(subset['Value'], subset['Description']))
    lookup_dict[i] = lookup_dict_vals

# This try/except is necessary where columns that are in the raw file are not in the data dictionary. These columns will remain untouched, but columns to the right will get labelled.

for i in cols:
    try:
        survey_df[i] = survey_df[i].map(lookup_dict[i])
    except:
        continue

我试图在R中复制这一点,但得到了一个充满NA的 Dataframe 。有没有办法修复下面的代码?

library(dplyr)

# Read the data dictionary and raw survey data
dictionary_df <- readxl::read_excel(filepath)
dictionary_df$Value <- as.integer(dictionary_df$Value)

survey_df <- readxl::read_excel(filepath)

# Replaces the original unlabelled survey_df with a labelled survey_df.

lookup_dict <- list()
cols <- unique(dictionary_df$Field)

for (i in cols) {
  subset <- filter(dictionary_df, Field == i)
  lookup_dict_vals <- setNames(subset$Description, subset$Value)
  lookup_dict[[i]] <- lookup_dict_vals
}

# This try/except is necessary where columns that are in the raw file are not in the data dictionary. These columns will remain untouched, but columns to the right will get labelled.

for (i in colnames(survey_df)) {
  tryCatch({
    survey_df <- survey_df %>% mutate({{i}} := lookup_dict[[i]][{{i}}])
  }, error = function(e) {
    next
  })
}
jqjz2hbq

jqjz2hbq1#

我很好奇这是如何比较性能明智的。旋转步骤可能代价很高,但连接应该非常快。

library(tidyverse)
survey_df %>%
  mutate(row = row_number()) %>%
  pivot_longer(-row, names_to = "Field", values_to = "Value") %>%
  # left_join(dictionary_df) %>% # won't work for values not in dictionary
  # EDIT -- I think this will keep existing unmatched values
  rows_update(dictionary_df, unmatched = "ignore") %>%  
  select(-Value) %>%
  pivot_wider(names_from = Field, values_from = Description)
yptwkmov

yptwkmov2#

因为字典可以将条目存储为

0     1
Rural Metro

(not必须使用定义1,2,3...),我们必须在循环中调用nameswhich以获得索引,然后为这些索引子集字典。只需添加一点涉及名称和unlist函数的开销代码,该函数将单列 Dataframe 转换为向量。

for (i in colnames(survey_df)) {
  tryCatch({
    k <- c()
    for (j in unlist(survey_df[{{i}}])) { #for each coded value
      k<-c(k, which(names(lookup_dict[[i]]) == j)) #find the definition's slot
    }
    survey_df <- survey_df %>% mutate({{i}} := lookup_dict[[i]][k]) #take those names
  }, error = function(e) {
    next
  })
}
ukxgm1gy

ukxgm1gy3#

我发现对代码的一些微小更改允许保留原始编码的调查值,其中在字典中未找到字段和值的匹配。这是我在Jon Spring的建议下使用的代码。通过在skinny表中创建一个helper列,我可以将Description与编码的调查值合并,以便来自Left Join的任何NA都将保持为原始编码值。

survey_df %>%
  mutate(row = row_number()) %>%
  pivot_longer(-row, names_to = "Field", values_to = "Value") %>%
  left_join(dictionary_df) %>%
  mutate(Value = as.character(Value)) %>% 
  mutate(Description2 = coalesce(Description, Value)) %>% 
  select(-Value, -Description) %>%
  pivot_wider(names_from = Field, values_from = Description2) %>% 
  
  write_xlsx(filepath)```

相关问题