如何找到pyspark中两行的相关系数

hs1rzwqc  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(300)

我在pyspark数据框下

stat        col_A    col_B       col_C     col_D   
count        14       14          14 14      14
Actual        4       4001       160987      49  
Regression    3       3657       131225      38

我想找到行实际和回归的相关系数。并将ans添加为新行cv。

stat        col_A    col_B       col_C     col_D   
count        14       14          14 14      14
Actual        4       4001       160987      49  
Regression    3       3657       131225      38  
CV

在spark文档中,我们可以应用corr(col1,col2,method=none)方法。但它在柱子上。但在我的情况下,我希望它排成一排。在Pandas身上我做过这样的事

(df1.loc[['Actual','Regression']].std(axis = 0, ddof=0,skipna = True))/(df1.loc[['Actual','Regression']].mean(axis = 0))*100
lnvxswe2

lnvxswe21#

result = df.union(
    df.filter("stat in ('Actual', 'Regression')")
      .select(
          F.lit('CV').alias('stat'), 
          *[(F.stddev_pop(c) / F.mean(c) * 100).alias(c) for c in df.columns[1:]]
      )
)

result.show()
+----------+------------------+-----------------+------------------+------------------+
|      stat|             col_A|            col_B|             col_C|             col_D|
+----------+------------------+-----------------+------------------+------------------+
|     count|              14.0|             14.0|              14.0|              14.0|
|    Actual|               4.0|           4001.0|          160987.0|              49.0|
|Regression|               3.0|           3657.0|          131225.0|              38.0|
|        CV|14.285714285714285|4.492034473752938|10.185071112753755|12.643678160919542|
+----------+------------------+-----------------+------------------+------------------+

与您的预期结果一致:

(df1.loc[['Actual','Regression']].std(axis = 0, ddof=0,skipna = True))/(df1.loc[['Actual','Regression']].mean(axis = 0))*100
col_A    14.285714
col_B     4.492034
col_C    10.185071
col_D    12.643678

相关问题