我有一个带有头的.txt文件,我想删除它。文件如下所示:
Entry Per Account Description
16524 01 3930621977 TXNPUES
191675 01 2368183100 OUNHQEX
191667 01 3714468136 GHAKASC
191673 01 2632703881 PAHFSAP
80495 01 2766389794 XDZANTV
80507 01 4609266335 BWWYEZL
80509 01 1092717420 QJYPKVO
80497 01 3386366766 SOQLCMU
191669 01 5905893739 FYIWNKA
191671 01 2749355876 CBMJTLP
# Create spark session
spark = SparkSession.builder.master("local").appName("fixed-width" )\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
# Read in fixed-width text file into DataFrame
df = spark.read.option("header" , "true" )\
.option("inferSchema", "true" )\
.text(file )
df.show()
df.printSchema()
返回:
+--------------------+
| value|
+--------------------+
|Entry Per Accou...|
| 16524 01 39306...|
|191675 01 23681...|
|191667 01 37144...|
|191673 01 26327...|
| 80495 01 27663...|
| 80507 01 46092...|
| 80509 01 10927...|
| 80497 01 33863...|
|191669 01 59058...|
|191671 01 27493...|
+--------------------+
root
|-- value: string (nullable = true)
我可以抓住标题:
header = df.first()
header
返回:
Row(value='Entry Per GL Account Description ')
然后分成不同的列:
# Take the fixed width file and split into 3 distinct columns
sorted_df = df.select(
df.value.substr( 1, 6).alias('Entry' ),
df.value.substr( 8, 3).alias('Per' ),
df.value.substr(12, 11).alias('GL Account' ),
df.value.substr(24, 11).alias('Description'),
)
sorted_df.show()
sorted_df.printSchema()
返回:
+------+---+-----------+-----------+
| Entry|Per| GL Account|Description|
+------+---+-----------+-----------+
|Entry |Per| GL Account| Descriptio|
| 16524| 01| 3930621977| TXNPUES |
|191675| 01| 2368183100| OUNHQEX |
|191667| 01| 3714468136| GHAKASC |
|191673| 01| 2632703881| PAHFSAP |
| 80495| 01| 2766389794| XDZANTV |
| 80507| 01| 4609266335| BWWYEZL |
| 80509| 01| 1092717420| QJYPKVO |
| 80497| 01| 3386366766| SOQLCMU |
|191669| 01| 5905893739| FYIWNKA |
|191671| 01| 2749355876| CBMJTLP |
+------+---+-----------+-----------+
现在您可以看到标题仍然显示为我的Dataframe中的第一行。我不知道该怎么去掉它。
.iloc不可用,我经常看到这种方法,但这只适用于rdd:
header = rdd.first()
rdd.filter(lambda line: line != header)
那么有哪些替代方案呢?
1条答案
按热度按时间tpgth1q71#
你可以用任何一个
.csv
,.text
,.textFile
为了这个案子。读取文件
.csv
方法,以便spark可以读取头(我们不必过滤掉头)。1.Using .csv:
.csv
结果df
.2.Using .text:
.text
结果df
.然后使用
get header into a variable
header=spark.sparkContext.textFile("path").first()
.textfile and filter out the header
spark.sparkContext.textFile("path").
filter(lambda l :not str(l).startswith(header)).
map(lambda x:x.split()).map(lambda x:(str(x[0].strip()),str(x[1].strip()),str(x[2].strip()),str(x[3].strip()))).
toDF(["Entry","Per","Account","Description"]).
show()
+------+---+----------+-----------+
| Entry|Per| Account|Description|
+------+---+----------+-----------+
| 16524| 01|3930621977| TXNPUES|
|191675| 01|2368183100| OUNHQEX|
|191667| 01|3714468136| GHAKASC|
|191673| 01|2632703881| PAHFSAP|
| 80495| 01|2766389794| XDZANTV|
| 80507| 01|4609266335| BWWYEZL|
| 80509| 01|1092717420| QJYPKVO|
| 80497| 01|3386366766| SOQLCMU|
|191669| 01|5905893739| FYIWNKA|
|191671| 01|2749355876| CBMJTLP|
+------+---+----------+-----------+