寻找每个国家人口最多的城市

fjaof16o  于 2021-05-29  发布在  Spark
关注(0)|答案(1)|浏览(322)

我需要编写代码,给出每个国家人口最多的城市。以下是输入数据:

DataFrame = {
/**Input data */
val inputDf = Seq(
  ("Warsaw", "Poland", "1 764 615"),
  ("Cracow", "Poland", "769 498"),
  ("Paris", "France", "2 206 488"),
  ("Villeneuve-Loubet", "France", "15 020"),
  ("Pittsburgh PA", "United States", "302 407"),
  ("Chicago IL", "United States", "2 716 000"),
  ("Milwaukee WI", "United States", "595 351"),
  ("Vilnius", "Lithuania", "580 020"),
  ("Stockholm", "Sweden", "972 647"),
  ("Goteborg", "Sweden", "580 020")
).toDF("name", "country", "population")
println("Input:")
inputDf.show(false)

我的解决方案是:

val topPopulation = inputDf
  //        .select("name", "country", "population")
  .withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))

  //      .agg(max($"population").alias(("population")))
  //        .withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
  //        .withColumn("country", $"country")
  //        .withColumn("name", $"name")
  //          .cast("Integer")
  .groupBy("country")
  .agg(
    max("population").alias("population")
  )
  .orderBy($"population".desc)
//      .orderBy("max(population)")
topPopulation

但是我有troubke,因为“except只能在列数相同的表上执行,但是第一个表有2列,第二个表有3列;;”
输入:

+-----------------+-------------+----------+
|name             |country      |population|
+-----------------+-------------+----------+
|Warsaw           |Poland       |1 764 615 |
|Cracow           |Poland       |769 498   |
|Paris            |France       |2 206 488 |
|Villeneuve-Loubet|France       |15 020    |
|Pittsburgh PA    |United States|302 407   |
|Chicago IL       |United States|2 716 000 |
|Milwaukee WI     |United States|595 351   |
|Vilnius          |Lithuania    |580 020   |
|Stockholm        |Sweden       |972 647   |
|Goteborg         |Sweden       |580 020   |
+-----------------+-------------+----------+

预期:

+----------+-------------+----------+
|name      |country      |population|
+----------+-------------+----------+
|Warsaw    |Poland       |1 764 615 |
|Paris     |France       |2 206 488 |
|Chicago IL|United States|2 716 000 |
|Vilnius   |Lithuania    |580 020   |
|Stockholm |Sweden       |972 647   |
+----------+-------------+----------+

实际值:

+-------------+----------+
|country      |population|
+-------------+----------+
|United States|2716000   |
|France       |2206488   |
|Poland       |1764615   |
|Sweden       |972647    |
|Lithuania    |580020    |
+-------------+----------+
jm2pwxwz

jm2pwxwz1#

试试这个-

加载测试数据

val inputDf = Seq(
      ("Warsaw", "Poland", "1 764 615"),
      ("Cracow", "Poland", "769 498"),
      ("Paris", "France", "2 206 488"),
      ("Villeneuve-Loubet", "France", "15 020"),
      ("Pittsburgh PA", "United States", "302 407"),
      ("Chicago IL", "United States", "2 716 000"),
      ("Milwaukee WI", "United States", "595 351"),
      ("Vilnius", "Lithuania", "580 020"),
      ("Stockholm", "Sweden", "972 647"),
      ("Goteborg", "Sweden", "580 020")
    ).toDF("name", "country", "population")
    println("Input:")
    inputDf.show(false)
    /**
      * Input:
      * +-----------------+-------------+----------+
      * |name             |country      |population|
      * +-----------------+-------------+----------+
      * |Warsaw           |Poland       |1 764 615 |
      * |Cracow           |Poland       |769 498   |
      * |Paris            |France       |2 206 488 |
      * |Villeneuve-Loubet|France       |15 020    |
      * |Pittsburgh PA    |United States|302 407   |
      * |Chicago IL       |United States|2 716 000 |
      * |Milwaukee WI     |United States|595 351   |
      * |Vilnius          |Lithuania    |580 020   |
      * |Stockholm        |Sweden       |972 647   |
      * |Goteborg         |Sweden       |580 020   |
      * +-----------------+-------------+----------+
      */

找出全国人口最多的城市

val topPopulation = inputDf
      .withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
      .withColumn("population_name", struct($"population", $"name"))
      .groupBy("country")
      .agg(max("population_name").as("population_name"))
      .selectExpr("country", "population_name.*")
    topPopulation.show(false)
    topPopulation.printSchema()

    /**
      * +-------------+----------+----------+
      * |country      |population|name      |
      * +-------------+----------+----------+
      * |France       |2206488   |Paris     |
      * |Poland       |1764615   |Warsaw    |
      * |Lithuania    |580020    |Vilnius   |
      * |Sweden       |972647    |Stockholm |
      * |United States|2716000   |Chicago IL|
      * +-------------+----------+----------+
      *
      * root
      * |-- country: string (nullable = true)
      * |-- population: integer (nullable = true)
      * |-- name: string (nullable = true)
      */

相关问题