将apache的uri和referecobination与pig结合起来

xhv8bpkk  于 2021-06-25  发布在  Pig
关注(0)|答案(0)|浏览(257)

我将首先声明,我是一个系统管理员的贸易和Pig新手,所以请温柔。
我试图使用pig从cdn解析apacheweb日志。对于一个应用程序,我们有三种不同的调用类型,可以从uri和3个不同的应用程序/版本字符串(由应用程序开发中的不一致性引起)中收集这些类型。我需要收集他们,并产生一份报告,详细说明了每一个应用程序/版本的每种类型的电话号码。
调用类型将包含以下类型之一:valid、wms、tile useragent字段中的应用程序名称可以如下所示:
应用程序%20name/0.0 cfnetwork/609.1.4 darwin/13.0.0“
android应用程序名称0.0.0(sch-i605-android 4.1.2,sdk xx)
应用程序名0.0.0(iphone os 6.1.3-iphone,.xx..xx.x,xx 0.0)
这就是我在发现useragent命名不一致之前所做的工作。充其量可能是一次黑客攻击,但它产生了所需要的东西。
感谢您的帮助。

register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE LogLoader org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader();
DEFINE DayExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;
logs = LOAD '$INPUT' USING LogLoader as (remoteAddr, remoteLogname, user, time, method, uri, proto, status, bytes, referer,userAgent);
FILTERED = FILTER logs by userAgent matches '.*MapKit.*' OR userAgent matches '.*Darwin.*' or userAgent matches '.*Android.*';
DARWINONLY = FOREACH FILTERED GENERATE DayExtractor(time) as day, uri, bytes, userAgent;
FILTERVALID = FILTER DARWINONLY BY uri matches '.*valid.*';
FILTERTILE = FILTER DARWINONLY BY uri matches '.*tile.*';
FILTERWMS = FILTER DARWINONLY BY uri matches '.*wms.*';
VALIDAPPTIME = FOREACH FILTERVALID GENERATE day as validframeday, EXTRACT(userAgent, '([^\\s]+)') as validframeapp,bytes as validbytes;
WMSAPPTIME = FOREACH FILTERWMS GENERATE day as wmsday, EXTRACT(userAgent, '([^\\s]+)') as wmsapp,  bytes as wmsbytes;
TILEAPPTIME = FOREACH FILTERTILE GENERATE day as tileday, EXTRACT(userAgent, '([^\\s]+)') as tileapp, bytes as tilebytes;
GROUPWMS = GROUP WMSAPPTIME BY ($0,$1);
GROUPTILE = GROUP TILEAPPTIME BY ($0,$1);
GROUPVALID = GROUP VALIDAPPTIME BY ($0,$1);
WMSAPPCOUNT = FOREACH GROUPWMS GENERATE FLATTEN(group), COUNT($1) as wmsnum, SUM(WMSAPPTIME.wmsbytes) as wmstotalbytes;
VALIDAPPCOUNT = FOREACH GROUPVALID GENERATE FLATTEN(group), COUNT($1) as validnum, SUM(VALIDAPPTIME.validbytes) as validtotalbytes;
TILEAPPCOUNT = FOREACH GROUPTILE GENERATE FLATTEN(group), COUNT($1) as tilenum, SUM(TILEAPPTIME.tilebytes) as tiletotalbytes:int;
Y = COGROUP VALIDAPPCOUNT BY (validframeday,validframeapp), WMSAPPCOUNT BY (wmsday,wmsapp), TILEAPPCOUNT BY (tileday,tileapp);
Z = FOREACH Y GENERATE group as dailyapp, VALIDAPPCOUNT.validnum, VALIDAPPCOUNT.validtotalbytes, WMSAPPCOUNT.wmsnum, WMSAPPCOUNT.wmstotalbytes, TILEAPPCOUNT.tilenum, TILEAPPCOUNT.tiletotalbytes;
STORE Z into '$OUTPUT';

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题