hive的隐式连接总是内部连接吗?

zlhcx6iw  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(371)

hive的连接文档鼓励使用隐式连接,即。

SELECT * 
FROM table1 t1, table2 t2, table3 t3 
WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '02535';

这等于

SELECT t1.*, t2.*, t3.*
FROM table1 t1
INNER JOIN table2 t2 ON
  t1.id = t2.id 
INNER JOIN table3 t3 ON
  t2.id = t3.id
WHERE t1.zipcode = '02535'

,或以上是否会返回其他记录?

pnwntuvh

pnwntuvh1#

不总是这样。您的查询是等效的。但是没有 WHERE t1.id = t2.id AND t2.id = t3.id 会的 CROSS JOIN .
更新:
这是一个有趣的问题,我决定添加一些演示。让我们创建两个表: A(c1 int, c2 string) 以及 B(c1 int, c2 string) .
加载数据:

insert into table A
 select 1, 'row one' union all
 select 2, 'row two';

insert into table B
 select 1, 'row one' union all
 select 3, 'row three';

检查数据:

hive> select * from A;
OK
1       row one
2       row two
Time taken: 1.29 seconds, Fetched: 2 row(s)
hive> select * from B;
OK
1       row one
3       row three
Time taken: 0.091 seconds, Fetched: 2 row(s)

检查交叉连接(隐式连接,不带 where 转化为交叉):

hive> select a.c1, a.c2, b.c1, b.c2 from a,b;
Warning: Map Join MAPJOIN[14][bigTable=a] in task 'Stage-3:MAPRED' is a cross product
Warning: Map Join MAPJOIN[22][bigTable=b] in task 'Stage-4:MAPRED' is a cross product
Warning: Shuffle Join JOIN[4][tables = [a, b]] in Stage 'Stage-1:MAPRED' is a cross product

OK
1       row one 1       row one
2       row two 1       row one
1       row one 3       row three
2       row two 3       row three
Time taken: 54.804 seconds, Fetched: 4 row(s)

检查内部联接(隐式联接) where 作为内部工作):

hive> select a.c1, a.c2, b.c1, b.c2 from a,b where a.c1=b.c1;
OK
1       row one 1       row one
Time taken: 38.413 seconds, Fetched: 1 row(s)

尝试通过添加 OR b.c1 is null 到以下位置:

hive> select a.c1, a.c2, b.c1, b.c2 from a,b where (a.c1=b.c1) OR (b.c1 is null);
OK
1       row one 1       row one
Time taken: 57.317 seconds, Fetched: 1 row(s)

如你所见,我们又有了内部连接。 or b.c1 is null 被忽略
现在 left join 没有 where 以及 ON 子句(转换为交叉):

select a.c1, a.c2, b.c1, b.c2 from a left join b;
OK
1       row one 1       row one
1       row one 3       row three
2       row two 1       row one
2       row two 3       row three
Time taken: 37.104 seconds, Fetched: 4 row(s)

正如你所看到的,我们又生气了。
尝试左连接 where 从句和从句 ON (用作内部):

select a.c1, a.c2, b.c1, b.c2 from a left join b where a.c1=b.c1;
OK
1       row one 1       row one
Time taken: 40.617 seconds, Fetched: 1 row(s)

我们有内部连接
尝试左连接 where 从句和从句 ON +尝试允许空值:

select a.c1, a.c2, b.c1, b.c2 from a left join b where a.c1=b.c1 or b.c1 is null;
OK
1       row one 1       row one
Time taken: 53.873 seconds, Fetched: 1 row(s)

又一次进入内心。或者 b.c1 is null 被忽略。
左连接 on 条款:

hive> select a.c1, a.c2, b.c1, b.c2 from a left join b on a.c1=b.c1;
OK
1       row one 1       row one
2       row two NULL    NULL
Time taken: 48.626 seconds, Fetched: 2 row(s)

是的,这是真的左连接。
左连接 on + where (内心深处):

hive> select a.c1, a.c2, b.c1, b.c2 from a left join b on a.c1=b.c1 where a.c1=b.c1;
OK
1       row one 1       row one
Time taken: 49.54 seconds, Fetched: 1 row(s)

我们得到了inner,因为where不允许空值。
带where+的左连接允许空值:

hive> select a.c1, a.c2, b.c1, b.c2 from a left join b on a.c1=b.c1 where a.c1=b.c1 or b.c1 is null;
OK
1       row one 1       row one
2       row two NULL    NULL
Time taken: 55.951 seconds, Fetched: 2 row(s)

是的,它是左连接。
结论:
隐式连接用作innner(with where)或cross if(如果没有where子句)。
如果没有on和where,左连接可以作为cross,如果where子句不允许右表为null,左连接也可以作为inner。
最好使用ansi语法,因为它是自我解释的,而且很容易理解您希望它的工作方式。隐式连接或左连接作为内部连接或交叉连接很难理解,并且很容易出错。

相关问题