apache pig-error org.apache.pig.impl.pigcontext-在第1行第1列遇到“< other>”,=“”

vql8enpb  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(335)

我正在尝试使用apachepig和hive表中的数据在我的数据下进行一些数据清理。
我的ApachePig里有这样一句话:

INPUT_FILE = LOAD 'staging_area' USING org.apache.hive.hcatalog.pig.HCatLoader()
AS
          (ID:Long, 
          CHAIN:Int,
          DEPT:Int,
          CATEGORY:Int,
          COMPANY:Long,
          BRAND:Long,
          DATE:Chararray,
          QUARTER:Int,
          MONTH:Int,
          DAY:Int,
          WEEKDAY:Int,
          PRODUCT_SIZE:Int,
          PRODUCT_MEASURE:Chararray,
          PRODUCT_QUANTITY:Int,
          PURCHASE_AMOUNT:Double);

SPLIT INPUT_FILE INTO DATA IF (PRODUCT_SIZE > 0 AND PURCHASE_AMOUNT > 0 AND PRODUCT_QUANTITY > 0), MISSING_VALUES if (PRODUCT_QUANTITY <= 0 OR PURCHASE_AMOUNT <= 0);

DATA_TRANSFORMATION = FOREACH DATA GENERATE 
                                            ID,
                                            CHAIN,
                                            DEPT,
                                            CATEGORY,
                                            ToDate(DATE,'yyyy-MM-dd') as DATE_ID,
                                            QUARTER,
                                            MONTH,
                                            DAY,
                                            WEEKDAY,
                                            PRODUCT_SIZE,
                                            PURCHASE_AMOUNT;

GRP = GROUP DATA_TRANSFORMATION BY ID;

SUMMED = foreach GRP {
     amount = SUM(DATA_TRANSFORMATION.PURCHASE_AMOUNT);
     cnt = COUNT(DATA_TRANSFORMATION.ID);
     generate group, Purchase_Average,Freq_Visits;
}

JOINED = join DATA_TRANSFORMATION by $0, SUMMED by $0;

DATASET = FOREACH JOINED GENERATE $0,$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12;

RANKING = rank DATASET by $6,$1,$0;

DW = FOREACH RANKING GENERATE $1 as ID,$2 as Purchase_Average, $3 as Freq_Visits, $0 as Transaction_ID, $4,$5,$6,$7,$8,$9,$10,$11,$12,$13;

STORE DW INTO '/user/cloudera/data' USING PigStorage(',');

配置单元中的表包含以下数据(前10个):

id  chain   dept    category    company brand   date_id quarter month_id    day_id  weekday productsize productmeasure  purchasequantity    purchaseamount
1940424003  46  99  9909    1081843181  25935   29-01-2013 00:00    1   1   29  2   6   OZ  2   5
1940424003  46  35  3504    103500030   13470   04-02-2013 00:00    1   2   4   1   25  OZ  2   5
1940424003  46  91  9115    108048080   1230    08-02-2013 00:00    1   2   8   5   0   LT  1   13.99
1940452798  46  7   706 101200010   17286   09-02-2013 00:00    1   2   9   6   38  OZ  1   5.75
1940452798  46  45  4517    107220575   17340   10-02-2013 00:00    1   2   10  7   16  OZ  1   45
1940452798  46  99  9909    107143070   5072    10-02-2013 00:00    1   2   10  7   12  OZ  1   1.99
1940452798  46  21  2119    1061300868  867 10-02-2013 00:00    1   2   10  7   138 OZ  1   43.8
1940452798  46  56  5616    1071373373  11473   10-02-2013 00:00    1   2   10  7   8   OZ  1   2.5
1940452798  46  7   706 107146474   2142    10-02-2013 00:00    1   2   10  7   15  OZ  1   2
1940452798  46  72  7205    103700030   4294    22-02-2013 00:00    1   2   22  5   6   OZ  1   3

每次运行脚本时都会出现以下错误:

ERROR org.apache.pig.impl.PigContext - Encountered " <OTHER> ",= "" at line 1, column 1

有人知道怎么解决这个问题吗?我的数据有3000000条记录,我使用的是cloudera quickstart VM5.8。

bfhwhh0e

bfhwhh0e1#

SUMMED = foreach GRP {
     amount = SUM(DATA_TRANSFORMATION.PURCHASE_AMOUNT);
     cnt = COUNT(DATA_TRANSFORMATION.ID);
     generate group, Purchase_Average,Freq_Visits;
}

你不能在这里预测平均购买量和频繁访问量。

相关问题