有没有加速随机样本和连接的例子？

xnifntxz 于 2021-07-09 发布在 Spark

关注(0)|答案(0)|浏览(227)

我正在寻求帮助，试图了解如何使 sparklyr 在对建模数据集进行采样的日常工作中，工作更智能、速度更快。
在hive中使用结构化数据，我启用了相同的执行器和ram配置，这样我就可以在不到15分钟的时间内完成pyspark中的相同流程。。。但与 sparklyr ，大约需要90分钟。
这个过程是我需要从一个巨大的数据集采样到一个成千上万的数据集。。。假设我选择bernoulli目标样本的大小为10000个true和20000个false，暂时忽略这些表的bucketing或分区。
使用SparkyR，从hive查询，我就是这样完成的：

referenceData<-"datamart.archivedFile" # About 200 million rows and 1000 columns
nonTargetCriteria<-"Score>0 AND MonthsSinceActivity>12"  # Drops to about 80% of the file
targetCriteria<-"MonthsSinceActivity>0 AND MonthsSinceActivity<4" # Drops to about 1% of the file
WHERE.NT<-paste("WHERE ",nonTargetCriteria)
WHERE.T<-paste("WHERE ",targetCriteria)
nT<- 10000 # sample size of target behavior (let's say response is coded as target=1)
nNT<- 20000 # sample size of non-targets (let's say non-response is coded as target=0)
join_vars <- c('id','address')
names(join_vars) <- c('id','address')

## First I count my responders available

n.targetSDF <- hive_context(sc) %>% invoke('sql', paste("select count(1) AS N, ",paste("SUM(CASE WHEN ",targetCriteria," THEN 1 ELSE 0 END)  nTar"),
                                                        " from ",targetData)) %>% sdf_collect()

## Then I count the full available population in the reference data set subject to criteria

n.referenceSDF <- hive_context(sc) %>% invoke('sql', paste("select count(1) AS N, ",paste("SUM(CASE WHEN ",nonTargetCriteria," THEN 1 ELSE 0 END) nNonTar"),
                                                               " from ",referenceData)) %>% sdf_collect()

# Now I pull the target=1 group and join them to the earlier archived data

targetSDF.S <- hive_context(sc) %>% invoke('sql', paste("select ",paste0(join_vars,collapse=",")," from ",targetData,WHERE.T)) %>% 
  sdf_sample(fraction = 2*nT/n.targetSDF[2]) ## Bring in the count from above to define the fraction.

join.S <- tbl(sc,referenceData) %>% inner_join(targetSDF.S,by=join_vars) %>% mutate(target=1)

# Now I initialize the response=0 data

referenceSDF.S <- hive_context(sc) %>% invoke('sql', paste("select *, 0 target from ",referenceData,WHERE.NT)) 

# Bring in the count from above to define the fraction.

N.NT<-n.referenceSDF[2]

# Union the two samples and collect into R.

df <- sdf_bind_rows(join.S,referenceSDF.S %>% sdf_sample(fraction = 1.2*nNT/N.NT)) %>% sdf_collect()

apache-spark cloudera r sparklyr

来源：https://stackoverflow.com/questions/66751754/are-there-examples-for-speeding-up-sparklyr-random-samples-and-joins