pig脚本未正确展平元组

uoifb46i  于 2021-06-25  发布在  Pig
关注(0)|答案(0)|浏览(204)

我是Pig新手。我的数据是一个长度不同的json字符串,城市名称的键值对包含在引号中,值是一个整数:

{"Chicago, Illinois": 123, "London, England, Great Britain": 555, "Mexico City, Federal District, Mexico": 333 ...}
{"Chicago, Illinois": 555, "London, England, Great Britain": 222, "Dublin, Ireland": 888 ...}

期望输出:

Chicago, Illinois
Mexico City, Federal District, Mexico
London, England, Great Britain
Dublin, Ireland

我正在尝试从集合中的所有json字符串(每行一个json字符串)创建一个唯一城市名称的列表。

cities_data = LOAD 'hbase://facebook_insights'
                    USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('c:cities_frequency', '-caster=HBaseBinaryConverter -loadKey true -gte 20141201')
          AS (key:chararray, cities_frequency);

tokenized_cities = FOREACH cities_data GENERATE flatten(TOKENIZE(cities_frequency,'"')) as caught_cities;

filtered_cities_data = FILTER tokenized_cities by NOT STARTSWITH(caught_cities, ':');

city_group = GROUP filtered_cities_data BY caught_cities;
unique_cities = FOREACH city_group GENERATE group;

STORE unique_cities INTO 'output/cities';

大多数城市的名字在输出中是圆滑的,但是有几个城市的名字是分开的。例如:

Chic
ago, Illinois
Mexico City, Federal District, 
Mexico
London, England, Great Britain
Dublin, Ireland

当我标记每个城市的频率时,它会创建一个元组包(我确认这确实是正确的,元组包含了整个城市的名称,构成了键)。
以下是使用双引号标记频率时的示例输出:

{({),(Tulsa, OK),(: 1, ),(Ponte Vedra Beach, FL),(: 2, ),(Virginia Beach, VA),(: 1, ),(Hollywood, FL),(: 1, ),(Riyadh, Ar Riyad, Saudi Arabia),(: 2, ),(Lake Panasoffkee, FL),(: 1,     ),(Bowmantown, TN),(: 1, ),(Duluth, GA),(: 1, ),(Atlanta, GA),(: 2, ),(Jacksonville Beach, FL),(: 2, ),(Cushing, OK),(: 3, ),(Miri, Sarawak, Malaysia),(: 2, ),(Davenport, IA),(: 1, ),(Saint Petersburg, FL),(: 1, ),(Ocala, FL),(: 1, ),(Osasco, S\u00e3o Paulo, Brazil),(: 1, ),(Haywards Heath, England, United Kingdom),(: 1, ),(Chicago, IL),(: 1, ),(Skipperville, AL),(: 1, ),(Grain Valley, MO),(: 1, ),(Jacksonville, FL),(: 14, ),(Atlantic Beach, FL),(: 1, ),(Carlton, OR),(: 1, ),(Gainesville, FL),(: 3, ),(Kuala Lumpur, Malaysia),(: 1})}
{({),(Baker, FL),(: 3, ),(Playas, Guayas, Ecuador),(: 1, ),(Port Hueneme, CA),(: 1, ),(Pace, FL),(: 2, ),(St. Louis, MO),(: 1, ),(Tampa, FL),(: 1, ),(Crestview, FL),(: 19, ),(Wells, MN),(: 1, ),(Daphne, AL),(: 1, ),(Scottsdale, AZ),(: 1, ),(Niceville, FL),(: 4, ),(Los Angeles, CA),(: 1, ),(DeFuniak Springs, FL),(: 1, ),(Moorpark, CA),(: 1, ),(Cantonment, FL),(: 1, ),(Urbana, IL),(: 1, ),(Albuquerque, NM),(: 1, ),(Fort Walton Beach, FL),(: 1, ),(Port Barre, LA),(: 1, ),(Harrisburg, AR),(: 1, ),(McCalla, AL),(: 1, ),(Ville Platte, LA),(: 1, ),(De Funiak Springs, FL),(: 1, ),(Destin, FL),(: 1, ),(Orlando, FL),(: 1, ),(Pensacola, FL),(: 1})}

当元组变平时

FOREACH cities_data flatten(TOKENIZE(cities_frequency, '"')) as caught_cities

然而,有些剪接发生了。有什么我忽略了吗?我试过使用:

FOREACH cities_data GENERATE REGEXP_EXTRACT_ALL(cities_frequencies, '"([a-zA-Z])*"') as caught_cities;

没有成功(来自piggybank.jar)。返回空结果。
谢谢!

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题