pyspark-从Dataframe中删除第一行

bweufnob  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(2005)

我有一个带有头的.txt文件,我想删除它。文件如下所示:

Entry  Per  Account     Description               
 16524  01  3930621977  TXNPUES                     
191675  01  2368183100  OUNHQEX            
191667  01  3714468136  GHAKASC             
191673  01  2632703881  PAHFSAP              
 80495  01  2766389794  XDZANTV                    
 80507  01  4609266335  BWWYEZL                   
 80509  01  1092717420  QJYPKVO                  
 80497  01  3386366766  SOQLCMU                  
191669  01  5905893739  FYIWNKA             
191671  01  2749355876  CBMJTLP

# Create spark session

spark = SparkSession.builder.master("local").appName("fixed-width"                          )\
                                            .config("spark.some.config.option", "some-value")\
                                            .getOrCreate()

# Read in fixed-width text file into DataFrame

df = spark.read.option("header"     , "true" )\
               .option("inferSchema", "true" )\
               .text(file                    )
df.show()
df.printSchema()

返回:

+--------------------+
|               value|
+--------------------+
|Entry  Per  Accou...|
| 16524  01  39306...|
|191675  01  23681...|
|191667  01  37144...|
|191673  01  26327...|
| 80495  01  27663...|
| 80507  01  46092...|
| 80509  01  10927...|
| 80497  01  33863...|
|191669  01  59058...|
|191671  01  27493...|
+--------------------+

root
 |-- value: string (nullable = true)

我可以抓住标题:

header = df.first()
header

返回:

Row(value='Entry  Per  GL Account  Description               ')

然后分成不同的列:


# Take the fixed width file and split into 3 distinct columns

sorted_df = df.select(
    df.value.substr( 1,  6).alias('Entry'      ),
    df.value.substr( 8,  3).alias('Per'        ),
    df.value.substr(12, 11).alias('GL Account' ),
    df.value.substr(24, 11).alias('Description'),
)

sorted_df.show()
sorted_df.printSchema()

返回:

+------+---+-----------+-----------+
| Entry|Per| GL Account|Description|
+------+---+-----------+-----------+
|Entry |Per| GL Account| Descriptio|
| 16524| 01| 3930621977| TXNPUES   |
|191675| 01| 2368183100| OUNHQEX   |
|191667| 01| 3714468136| GHAKASC   |
|191673| 01| 2632703881| PAHFSAP   |
| 80495| 01| 2766389794| XDZANTV   |
| 80507| 01| 4609266335| BWWYEZL   |
| 80509| 01| 1092717420| QJYPKVO   |
| 80497| 01| 3386366766| SOQLCMU   |
|191669| 01| 5905893739| FYIWNKA   |
|191671| 01| 2749355876|   CBMJTLP |
+------+---+-----------+-----------+

现在您可以看到标题仍然显示为我的Dataframe中的第一行。我不知道该怎么去掉它。
.iloc不可用,我经常看到这种方法,但这只适用于rdd:

header = rdd.first()
rdd.filter(lambda line: line != header)

那么有哪些替代方案呢?

tpgth1q7

tpgth1q71#

你可以用任何一个 .csv , .text , .textFile 为了这个案子。
读取文件 .csv 方法,以便spark可以读取头(我们不必过滤掉头)。
1.Using .csv: .csv 结果 df .

df=spark.read.option("header","true").csv("path")
df.show(10,False)

# +----------------------------------------------------+

# |Entry  Per  Account     Description                 |

# +----------------------------------------------------+

# | 16524  01  3930621977  TXNPUES                     |

# |191675  01  2368183100  OUNHQEX                     |

# |191667  01  3714468136  GHAKASC                     |

# |191673  01  2632703881  PAHFSAP                     |

# | 80495  01  2766389794  XDZANTV                     |

# | 80507  01  4609266335  BWWYEZL                     |

# | 80509  01  1092717420  QJYPKVO                     |

# | 80497  01  3386366766  SOQLCMU                     |

# |191669  01  5905893739  FYIWNKA                     |

# |191671  01  2749355876  CBMJTLP                     |

# +----------------------------------------------------+

2.Using .text: .text 结果 df .


# can't read header

df=spark.read.text("path")

# get the header

header=df.first()[0]

# filter the header out from data

df.filter(~col("value").contains(header)).show(10,False)

# +----------------------------------------------------+

# |value                                               |

# +----------------------------------------------------+

# | 16524  01  3930621977  TXNPUES                     |

# |191675  01  2368183100  OUNHQEX                     |

# |191667  01  3714468136  GHAKASC                     |

# |191673  01  2632703881  PAHFSAP                     |

# | 80495  01  2766389794  XDZANTV                     |

# | 80507  01  4609266335  BWWYEZL                     |

# | 80509  01  1092717420  QJYPKVO                     |

# | 80497  01  3386366766  SOQLCMU                     |

# |191669  01  5905893739  FYIWNKA                     |

# |191671  01  2749355876  CBMJTLP                     |

# +----------------------------------------------------+

然后使用

sorted_df = df.select(
    df.value.substr( 1,  6).alias('Entry'      ),
    df.value.substr( 8,  3).alias('Per'        ),
    df.value.substr(12, 11).alias('GL Account' ),
    df.value.substr(24, 11).alias('Description'),
)

sorted_df.show()
sorted_df.printSchema()
``` `3.Using` .text文件
: `.textFile` 结果 `rdd` .

get header into a variable

header=spark.sparkContext.textFile("path").first()

.textfile and filter out the header

spark.sparkContext.textFile("path").
filter(lambda l :not str(l).startswith(header)).
map(lambda x:x.split()).map(lambda x:(str(x[0].strip()),str(x[1].strip()),str(x[2].strip()),str(x[3].strip()))).
toDF(["Entry","Per","Account","Description"]).
show()

+------+---+----------+-----------+

| Entry|Per| Account|Description|

+------+---+----------+-----------+

| 16524| 01|3930621977| TXNPUES|

|191675| 01|2368183100| OUNHQEX|

|191667| 01|3714468136| GHAKASC|

|191673| 01|2632703881| PAHFSAP|

| 80495| 01|2766389794| XDZANTV|

| 80507| 01|4609266335| BWWYEZL|

| 80509| 01|1092717420| QJYPKVO|

| 80497| 01|3386366766| SOQLCMU|

|191669| 01|5905893739| FYIWNKA|

|191671| 01|2749355876| CBMJTLP|

+------+---+----------+-----------+

相关问题