我需要从两个配置单元表中提取数据,这两个配置单元表非常大。它们在两个不同的模式中,但具有相同的定义。
我需要比较这两个表,并在PySpark中确定以下内容
1.表1中存在但表2中缺少的行
1.两个表中都存在的行,但任何非键列中的值不匹配
1.表2中存在但表1中缺少的行
例如,假设该表具有以下参数
ProductId - BigInteger - PK
ProductVersion - int - PK
ProductName - char
ProductPrice - decimal
ProductDesc - varchar
字符串
假设数据如下
Table1 in Schema1
[1, 1, "T-Shirt", 10.50, "Soft-Washed Slub-Knit V-Neck"] -> Matches with Table2
[1, 2, "T-Shirt", 10.50, "Soft-Washed Striped Crew-Neck "] -> Price is different in Table1
[2, 1, "Short Sleeve Shirt", 10.50, "Everyday Printed Short-Sleeve Shirt"] -> Missing in Table2
[3, 1, "T-Shirt", 10.50, "Breathe ON Camo Tee"] -> Prod Desc is different in Table2
Table2 in Schema2
[1, 1, "T-Shirt", 10.50, "Soft-Washed Slub-Knit V-Neck"] -> Matches with Table1
[2, 1, "Short Sleeve Shirt", 12.50, "Everyday Printed Short-Sleeve Shirt"] -> Price is different
[3, 1, "T-Shirt", 10.50, "Breathe ON Camo"] -> Prod Desc is different in Table2
[3, 2, "T-Shirt", 20, "Breathe ON ColorBlock Tee"] -> Missing in Table1
型
预期结果将是三个独立的 Dataframe
dfOut1
-将包含表1中存在但基于主键在表2中缺失的行
["Missing in Table2", [1, 2, "T-Shirt", 10.50, "Soft-Washed Striped Crew-Neck "]]
型
第一列将指示差异类型,如果差异类型为"Missing in Table1"
或"Missing in Table2"
,则源表中的整行都将可用。
dfdiff
-
["Difference", "ProductPrice", 2, 1, 10.50, 12.50]
["Difference", "ProductDesc", 3,1, "Breathe ON Camo Tee", "Breathe ON Camo"]
型
dfout2
-
["Missing in Table1", [3, 2, "T-Shirt", 20, "Breathe ON ColorBlock Tee"]]
型
我在考虑采取以下方法
1. Create df1 from table1 using query "select * from schema1.table1"
2. Create df2 from table2 using query "select * from schema2.table2"
3. Use df1.except(df2)
型
我指的是documentation
我不确定这种方法是否有效。df1.except(df2)
会比较所有字段,还是只比较键列?
此外,不确定如何进一步分离输出
1条答案
按热度按时间6mzjoqzu1#
您基本上是在尝试查找两个数据集之间的插入、更新和删除(增量)。
字符串