pig-Map和检索主表中的两列?

o4tp2gmn  于 2021-05-27  发布在  Hadoop
关注(0)|答案(1)|浏览(325)

我在openflights数据集上用pig做实验(https://openflights.org/data.html). 我正在尝试Map一个包含所有唯一可能的飞行路线的查询,即下表

+---------------+-------------+
| Start_Airport | End_Airport |
+---------------+-------------+
| YYZ           | NYC         |
| YBG           | YVR         |
| AEY           | GOH         |
+---------------+-------------+

然后将这两个值与包含每个机场的经度和纬度的主表连接起来。即

+---------+----------+-----------+
| Airport | Latitude | Longitude |
+---------+----------+-----------+
| YYZ     |    -10.3 |      1.23 |
| YBG     |    -40.3 |      50.4 |
| AEY     |     30.3 |      30.3 |
+---------+----------+-----------+

我该怎么做呢?我基本上是想有一个最终的表,看起来像

+----------------+----------+-----------+-------------+----------+-----------+
| Start_Airport  | Latitude | Longitude | End_Airport | Latitude | Longitude |
+----------------+----------+-----------+-------------+----------+-----------+
| YYZ            |    -10.3 |      1.23 | NYC         | blah     | blah      |
| YBG            |    -40.3 |      50.4 | YVR         | blah     | blah      |
| AEY            |     30.3 |      30.3 | GOH         | blah     | blah      |
+----------------+----------+-----------+-------------+----------+-----------+

我目前正在做如下工作,c是第一个表

route_data = JOIN c by (start_airport, end_airport), airports_all by ($0, $0);

我认为这基本上是说,对于查询,根据相应的代码连接开始的\u aiport和结束的\u airport,然后通过相应的经纬度,

j9per5c4

j9per5c41#

route\u data=join c by(start\u airport,end\u airport),airports\u all by($0,$0);
这类似于sql世界中典型连接查询的“and”conditions子句。想象一下下面的查询。它会产生你想要的结果。在a.start\u airport=b.first\u字段和a.end\u airport=b.first\u字段上,从c t1选择join airports\u all t2;只有当起始机场和结束机场相同时,才会产生结果。
你的愿望可以通过以下方式实现:

cat > routes.txt
YYZ,NYC
YBG,YVR
AEY,GOH

cat > airports_all.txt
YYZ,-10.3,1.23
YBG,-40.3,50.4
AEY,30.3,30.3

清管器代码:

tab1 = load '/home/ec2-user/routes.txt' using PigStorage(',') as (start_airport,end_airport);
describe tab1
tab2 = load '/home/ec2-user/airports_all.txt' using PigStorage(',') as (Airport,Latitude,Longitude);
describe tab2
tab3 = JOIN tab1 by (start_airport), tab2 by (Airport);
describe tab3
tab4 = foreach tab3 generate $0 as start_airport, $3 as start_Latitude, $4 as start_Longitude, $1 as end_airport;
describe tab4
tab5 = JOIN tab4 by (end_airport), tab2 by (Airport);
describe tab5
tab6 = foreach tab5 generate $0 as start_airport, $1 as start_Latitude, $2 as start_Longitude, $3 as end_airport, $5 as end_Latitude, $6 as end_Longitude;
describe tab6
dump tab6

相关问题