hive正在删除记录的计数

70gysomp  于 2021-06-25  发布在  Hive
关注(0)|答案(0)|浏览(252)

我从csv创建了一个配置单元表

CREATE TABLE RECORD_CSV(
  completed_on string, distance_travelled double, 
  end_location_lat double, end_location_long double, 
  started_on string, driver_rating double, 
  rider_rating double, start_zip_code int, 
  end_zip_code int, charity_id int, 
  requested_car_category string, free_credit_used double, 
  surge_factor double, start_location_long double, 
  start_location_lat double, color string, 
  make string, model string, year int, 
  rating double, Date string, PRCP double, 
  TMAX double, TMIN double, AWND double, 
  GustSpeed2 double, Fog double, HeavyFog double, 
  Thunder double
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE;

当我跑的时候 SELECT COUNT(*) FROM RECORD_CSV; 它回来了

OK
911057
Time taken: 21.403 seconds, Fetched: 1 row(s)

当我创建另一个表时 color 字段中的行数。

CREATE TABLE RECORD_CSV_BYCOLOR(completed_on string, distance_travelled double,
end_location_lat double ,end_location_long double,
started_on string ,driver_rating double ,rider_rating double ,
start_zip_code int ,end_zip_code int ,charity_id int,
requested_car_category string,free_credit_used double,
surge_factor double,start_location_long double,start_location_lat double ,
make string ,model string ,year int ,rating double,Date string,PRCP double,
TMAX double,TMIN double,AWND double,GustSpeed2 double,
Fog double,HeavyFog double,Thunder double
)
PARTITIONED BY (color string)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' 
STORED AS TEXTFILE;
INSERT OVERWRITE table RECORD_CSV_BYCOLOR PARTITION(color) 
select completed_on, distance_travelled,end_location_lat, 
end_location_long, started_on, driver_rating, rider_rating,
start_zip_code, end_zip_code, charity_id, requested_car_category,
free_credit_used, surge_factor, start_location_long, start_location_lat,
make, model, year, rating, Date, PRCP, TMAX, TMIN, AWND, GustSpeed2,
Fog, HeavyFog, Thunder, color FROM RECORD_CSV;

当我跑的时候 SELECT COUNT(*) FROM RECORD_CSV_BYCOLOR; 我看到记录下降了

OK
693991
Time taken: 21.552 seconds, Fetched: 1 row(s)

以下是差异 color 使用 GROUP BY 对于表 RECORD_CSV ```
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 7.11 sec HDFS Read: 165793766 HDFS Write: 349 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 110 msec
OK
Silver 634
Black 204004
Bronze 214
Burgundy 1587
GREEN 195
Gold 6346
Gray 644
Maroon 847
Silver 170241
Silver 147
Tan 1066
Teal 913
White 152919
White 404
Yellow/Gold 20540
Blue 90
Brown 18594
Gray 134155
Navy Blue 48
Red 80352
WHITE 52
Yellow 448
Black 361
Blue 81999
Dark Blue 199
Dark Grey 18
Green 15396
Grey 12503
Magenta 324
Orange 5817
Time taken: 25.186 seconds, Fetched: 30 row(s)

下面是 `RECORD_CSV_BYCOLOR` ```
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.48 sec   HDFS Read: 30648230 HDFS Write: 281 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 480 msec
OK
 Silver 634
Black   361
Blue    90
Bronze  214
Brown   18594
Burgundy    1587
Dark Blue   199
Dark Grey   18
GREEN   195
Gold    6346
Gray    644
Green   15396
Grey    12503
Magenta 324
Maroon  847
Navy Blue   48
Orange  5817
Red 80352
Silver  147
Tan 1066
Teal    913
WHITE   52
White   404
Yellow  448
Yellow/Gold 20540
Time taken: 20.937 seconds, Fetched: 25 row(s)

这个 GROUP BY 在源表中,为同一颜色提供两次计数,而目标表选择计数最少的行。差别似乎存在,但为什么会发生这种情况?我应该更改什么代码?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题