mock_data = [('TYCO', ' 1303','13'),('EMC', ' 120989 ','123'), ('VOLVO ', '102329 ','1234'),('BMW', '1301571345 ',' '),('FORD', '004','21212')]
df = spark.createDataFrame(mock_data, ['col1', 'col2','col3'])
+-------+------------+-----+
| col1 | col2| col3|
+-------+------------+-----+
| TYCO| 1303| 13|
| EMC| 120989 | 123|
|VOLVO | 102329 | 1234|
| BMW|1301571345 | |
| FORD| 004|21212|
+-------+------------+-----+
修剪col2并基于长度(10-col2 length)需要在col3中动态添加填充零。连接col2和col3。
df2 = df.withColumn('length_col2', 10-length(trim(df.col2)))
+-------+------------+-----+-----------+
| col1| col2| col3|length_col2|
+-------+------------+-----+-----------+
| TYCO| 1303| 13| 6|
| EMC| 120989 | 123| 4|
|VOLVO | 102329 | 1234| 4|
| BMW|1301571345 | | 0|
| FORD| 004|21212| 7|
+-------+------------+-----+-----------+
预期产量
+-------+----------+-----+-------------
| col1| col2 | col3|output
+-------+----------+-----+-------------
| TYCO| 1303 | 13|1303000013
| EMC| 120989 | 123|1209890123
|VOLVO | 102329 | 1234|1023291234
| BMW| 1301571345 | |1301571345
| FORD| 004 |21212|0040021212
+-------+----------+-----+-------------
1条答案
按热度按时间0mkxixxg1#
你要找的是
rpad
中的函数pyspark.sql.functions
如下所示=>https://spark.apache.org/docs/2.3.0/api/sql/index.html请参见下面的解决方案:
和结果