我有一个编码的调查文件和数据字典,它描述了每列的每个编码值。Example of inputs and desired output here。
数据字典中的“字段”列列出了调查数据中的每一列。根据列名的不同,这些值具有不同的说明。例如,如果“1”在“满意度”列中,则表示“非常满意”,但如果“Metro”在“地区”列中,则表示“Metro”。
我已经让这段代码在Python中工作,但我需要它在R中工作。
我对其他方法持开放态度,这段代码正是我提出的。它在大约5秒内运行150列和40,000行。
import pandas as pd
# Read the data dictionary and raw survey data
dictionary_df = pd.read_excel(filepath)
dictionary_df["Value"] = dictionary_df["Value"].astype(int)
survey_df = pd.read_excel(filepath)
# Replaces the original unalbelled survey_df with a labelled survey_df.
lookup_dict = {}
cols = list(dictionary_df.Field.unique())
for i in cols:
subset = dictionary_df.loc[dictionary_df.Field==i]
lookup_dict_vals = dict(zip(subset['Value'], subset['Description']))
lookup_dict[i] = lookup_dict_vals
# This try/except is necessary where columns that are in the raw file are not in the data dictionary. These columns will remain untouched, but columns to the right will get labelled.
for i in cols:
try:
survey_df[i] = survey_df[i].map(lookup_dict[i])
except:
continue
我试图在R中复制这一点,但得到了一个充满NA的 Dataframe 。有没有办法修复下面的代码?
library(dplyr)
# Read the data dictionary and raw survey data
dictionary_df <- readxl::read_excel(filepath)
dictionary_df$Value <- as.integer(dictionary_df$Value)
survey_df <- readxl::read_excel(filepath)
# Replaces the original unlabelled survey_df with a labelled survey_df.
lookup_dict <- list()
cols <- unique(dictionary_df$Field)
for (i in cols) {
subset <- filter(dictionary_df, Field == i)
lookup_dict_vals <- setNames(subset$Description, subset$Value)
lookup_dict[[i]] <- lookup_dict_vals
}
# This try/except is necessary where columns that are in the raw file are not in the data dictionary. These columns will remain untouched, but columns to the right will get labelled.
for (i in colnames(survey_df)) {
tryCatch({
survey_df <- survey_df %>% mutate({{i}} := lookup_dict[[i]][{{i}}])
}, error = function(e) {
next
})
}
3条答案
按热度按时间jqjz2hbq1#
我很好奇这是如何比较性能明智的。旋转步骤可能代价很高,但连接应该非常快。
yptwkmov2#
因为字典可以将条目存储为
(not必须使用定义1,2,3...),我们必须在循环中调用
names
和which
以获得索引,然后为这些索引子集字典。只需添加一点涉及名称和unlist
函数的开销代码,该函数将单列 Dataframe 转换为向量。ukxgm1gy3#
我发现对代码的一些微小更改允许保留原始编码的调查值,其中在字典中未找到字段和值的匹配。这是我在Jon Spring的建议下使用的代码。通过在skinny表中创建一个helper列,我可以将Description与编码的调查值合并,以便来自Left Join的任何NA都将保持为原始编码值。