如何读取在dataframescala中列之间包含空格的csv文件?

o2rvlv0m  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(420)


试图加载列之间包含空格的csv文件。
csv的第一行:

058921107                          039128053                          20200701-290640-0             20200701 000000BORGWARNER ITHACA LLC DBA BORGWARNE                         489140-10001                       LDD INVENTORY                                               039128053           1     4359697                                           PACKAGE,CHAIN DRIVE                                                                                 005                 285000492           0                     19691231 185959                              0                     20200101 00000020200630 000000IMMEDIATE                1600                  20200630 000000

使用的脚本示例:

import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

var df1: DataFrame = null
df1=spark.read.option("header","true").option("inferSchema","true").option("delimiter"," ").option("ignoreLeadingWhiteSpace","true")
.option("ignoreTrailingWhiteSpace","true").csv("test.csv")

df1.show(2)
jw5wzhpr

jw5wzhpr1#

我已将列大小指定为 18 不管这是否正确。

df = spark.read.text('test.csv')

col_size = 18

df.withColumn('value', split(regexp_replace(regexp_replace('value', '([ ]*)$', ''), '([ ]{2,})', '\|'), '\|')) \
  .select(*[col('value')[i] for i in range(0, col_size)]) \
  .toDF(*[f'col{i + 1}' for i in range(0, col_size)]).show(30, False)

+---------+---------+-----------------+--------------------------------------------------+------------+-------------+---------+----+-------+-------------------+-----+---------+-----+---------------+-----+---------------------------------------+-----+---------------+
|col1     |col2     |col3             |col4                                              |col5        |col6         |col7     |col8|col9   |col10              |col11|col12    |col13|col14          |col15|col16                                  |col17|col18          |
+---------+---------+-----------------+--------------------------------------------------+------------+-------------+---------+----+-------+-------------------+-----+---------+-----+---------------+-----+---------------------------------------+-----+---------------+
|058921107|039128053|20200701-290640-0|20200701 000000BORGWARNER ITHACA LLC DBA BORGWARNE|489140-10001|LDD INVENTORY|039128053|1   |4359697|PACKAGE,CHAIN DRIVE|005  |285000492|0    |19691231 185959|0    |20200101 00000020200630 000000IMMEDIATE|1600 |20200630 000000|
+---------+---------+-----------------+--------------------------------------------------+------------+-------------+---------+----+-------+-------------------+-----+---------+-----+---------------+-----+---------------------------------------+-----+---------------+

相关问题