如何在不改变顺序的情况下逐行读取Dataframe?在spark scala

5jdjgkvh  于 2021-07-14  发布在  Java
关注(0)|答案(2)|浏览(516)

我有一个Dataframe,它包含行的seq。我想在不改变顺序的情况下一行一行地迭代。
我试着用下面的代码。

scala> val df = Seq(
 |     (0,"Load","employeeview", "employee.empdetails", null ),
 |     (1,"Query","employeecountview",null,"select count(*) from employeeview"),
 |     (2,"store", "employeecountview",null,null)
 |   ).toDF("id", "Operation","ViewName","DiectoryName","Query")
df: org.apache.spark.sql.DataFrame = [id: int, Operation: string ... 3 more fields]

scala> df.show()
+---+---------+-----------------+-------------------+--------------------+
| id|Operation|         ViewName|       DiectoryName|               Query|
+---+---------+-----------------+-------------------+--------------------+
|  0|     Load|     employeeview|employee.empdetails|                null|
|  1|    Query|employeecountview|               null|select count(*) f...|
|  2|    store|employeecountview|               null|                null|
+---+---------+-----------------+-------------------+--------------------+

scala> val dfcount = df.count().toInt
dfcount: Int = 3

scala> for( a <- 0 to dfcount-1){
              // first Iteration I want  id =0   Operation="Load" ViewName="employeeview" DiectoryName="employee.empdetails" Query= null
                // second iteration I want  id=1  Operation="Query" ViewName="employeecountview"  DiectoryName="null" Query= "select count(*) from employeeview"
               // Third Iteration I want   id= 2  Operation= "store" ViewName="employeecountview"  DiectoryName="null"  Query= "null"
          //ignore below sample code 
         //  val Operation = get(Operation(i))                   
        //       if (Operation=="Load"){
                           // based on operation type i am calling appropriate function  and passing entire row as a parameter 
        //       } else if(Operation= "Query"){      
        //                
        //       } else if(Operation= "store"){ 

        //       }

      }

注意:加工顺序不得更改(这里唯一的标识是id,因此我们必须执行行0、1、2等)
提前谢谢。

fkaflof6

fkaflof61#

看看这个:

scala> val df = Seq(
     |     (0,"Load","employeeview", "employee.empdetails", null ),
     |     (1,"Query","employeecountview",null,"select count(*) from employeeview"),
     |     (2,"store", "employeecountview",null,null)
     |   ).toDF("id", "Operation","ViewName","DiectoryName","Query")
df: org.apache.spark.sql.DataFrame = [id: int, Operation: string ... 3 more fields]

scala> df.show()
+---+---------+-----------------+-------------------+--------------------+
| id|Operation|         ViewName|       DiectoryName|               Query|
+---+---------+-----------------+-------------------+--------------------+
|  0|     Load|     employeeview|employee.empdetails|                null|
|  1|    Query|employeecountview|               null|select count(*) f...|
|  2|    store|employeecountview|               null|                null|
+---+---------+-----------------+-------------------+--------------------+

scala> val dfcount = df.count().toInt
dfcount: Int = 3

scala> :paste
// Entering paste mode (ctrl-D to finish)

for( a <- 0 to dfcount-1){
val operation = df.filter(s"id=${a}").select("Operation").as[String].first

operation match {

case "Query" => println("matching Query") // or call a function here for Query()
case "Load" => println("matching Load") // or call a function here for Load()
case "store" => println("matching store") //
case x => println("matched " + x )

}

}

// Exiting paste mode, now interpreting.

matching Load
matching Query
matching store

scala>

编辑1:

scala> val df = Seq((3,"sam",23,9876543210L)).toDF("id","name","age","phone")
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]

scala> df.withColumn("json1",to_json(struct($"id",$"name",$"age",$"phone"))).show(false)
+---+----+---+----------+-------------------------------------------------+
|id |name|age|phone     |json1                                            |
+---+----+---+----------+-------------------------------------------------+
|3  |sam |23 |9876543210|{"id":3,"name":"sam","age":23,"phone":9876543210}|
+---+----+---+----------+-------------------------------------------------+

scala>

scala> df.withColumn("json1",to_json(struct(df.columns.map(col(_)):_*))).show(false)
+---+----+---+----------+-------------------------------------------------+
|id |name|age|phone     |json1                                            |
+---+----+---+----------+-------------------------------------------------+
|3  |sam |23 |9876543210|{"id":3,"name":"sam","age":23,"phone":9876543210}|
+---+----+---+----------+-------------------------------------------------+

scala>

scala> val inp=List("name","age")
cols: List[String] = List(name, age)

scala> df.withColumn("json1",to_json(struct(inp.map(col(_)):_*))).show(false)
+---+----+---+----------+-----------------------+
|id |name|age|phone     |json1                  |
+---+----+---+----------+-----------------------+
|3  |sam |23 |9876543210|{"name":"sam","age":23}|
+---+----+---+----------+-----------------------+

scala>
ktca8awb

ktca8awb2#

这是我使用数据集的解决方案。这将提供类型安全和更干净的代码。但必须以业绩为基准,变化不大。

case class EmployeeOperations(id: Int, operation: String, viewName: String,DiectoryName: String, query: String)
 val data = Seq(
    EmployeeOperations(0, "Load", "employeeview", "employee.empdetails", ""),
    EmployeeOperations(1, "Query", "employeecountview", "", "select count(*) from employeeview"),
    EmployeeOperations(2, "store", "employeecountview", "", "")
  )
  val ds: Dataset[EmployeeOperations] = spark.createDataset(data)(Encoders.product[EmployeeOperations])
  printOperation(ds).show

  def printOperation(ds: Dataset[EmployeeOperations])={
    ds.map(x => x.operation match {
      case "Query" => println("matching Query"); "Query"
      case "Load" => println("matching Load"); "Load"
      case "store" => println("matching store"); "store"
      case _ => println("Found something else") ;"Nothing"
    }
    )
  }

为了测试,我只返回了一个字符串。可以返回任何基元类型。这将返回:

scala> printOperation(ds).show
matching Load
matching Query
matching store
+-----+
|value|
+-----+
| Load|
|Query|
|store|
+-----+

相关问题