如何在Pandas身上合并信息?

cgvd09ve  于 6个月前  发布在  其他
关注(0)|答案(1)|浏览(55)

这是我需要处理的数据集的一个示例(CSV with CSV;)

2018-01-01 00:06:06;read;country_2;2458151267;Reddit;Europe
2018-01-01 00:06:15;read;country_6;2458151268;AdWords;Europe
2018-01-01 00:07:21;read;country_7;2458151269;AdWords;North America
2018-01-01 00:07:29;read;country_5;2458151270;Reddit;North America
2018-01-01 00:07:41;subscribe;2458151268
2018-01-01 00:07:44;subscribe;2458151267

字符串
我可以在第2行看到ID = 2458151268.0的客户阅读了我的产品,然后订阅了它(第5行)。由于第5行没有带来很多信息,我如何看到我的订阅者来自哪里(国家,来源和大陆)?我如何在pandas中正确读取这个CSV文件?
我有7000订户工作,所以如果你有一个想法或可以指出我的正确方向,非常感谢你!
我尝试了一堆groupby()都没有成功!
当我使用

data = pd.read_csv('example.csv', delimiter = ';')


我的数据集不来正确,因为“订阅”行不具有所有的信息,“读”行!

hxzsmxv2

hxzsmxv21#

一种可能的方法是首先通过将subscribe行中的country值复制到ID中来清理数据,然后将非读事件与相应的读事件合并,并从中填充NaN值。例如:

# Sample data from question
text = '''2018-01-01 00:06:06;read;country_2;2458151267;Reddit;Europe
2018-01-01 00:06:15;read;country_6;2458151268;AdWords;Europe
2018-01-01 00:07:21;read;country_7;2458151269;AdWords;North America
2018-01-01 00:07:29;read;country_5;2458151270;Reddit;North America
2018-01-01 00:07:41;subscribe;2458151268
2018-01-01 00:07:44;subscribe;2458151267'''

df = pd.read_csv(StringIO(text), sep=';', header=None, names=['TIME', 'Event', 'Country', 'ID', 'Source', 'Continent'])

# rows with no id
noid = df['ID'].isna()

# copy country value into `ID`
df['ID'] = df['ID'].fillna(df['Country']).astype('int64')

# set the corresponding country values to `NaN`
df.loc[noid, 'Country'] = np.nan

字符串
本阶段的产出:

TIME      Event    Country          ID   Source      Continent
0  2018-01-01 00:06:06       read  country_2  2458151267   Reddit         Europe
1  2018-01-01 00:06:15       read  country_6  2458151268  AdWords         Europe
2  2018-01-01 00:07:21       read  country_7  2458151269  AdWords  North America
3  2018-01-01 00:07:29       read  country_5  2458151270   Reddit  North America
4  2018-01-01 00:07:41  subscribe        NaN  2458151268      NaN            NaN
5  2018-01-01 00:07:44  subscribe        NaN  2458151267      NaN            NaN


现在,我们将没有数据的事件与有数据的事件合并,并从中填充NaN值:

gotdata = df['Country'].notna()
out = df[~gotdata].merge(df[gotdata], on='ID', suffixes=['','_y'])
out = (out
    .fillna({'Country' : out['Country_y'], 'Source' : out['Source_y'], 'Continent' : out['Continent_y'] })
    .drop(columns=[f'{col}_y' for col in df.columns if col != 'ID'])
)


现在我们有

TIME      Event    Country          ID   Source Continent
0  2018-01-01 00:07:41  subscribe  country_6  2458151268  AdWords    Europe
1  2018-01-01 00:07:44  subscribe  country_2  2458151267   Reddit    Europe


现在我们将其添加回具有数据的行:

out = pd.concat([df[gotdata], out]).sort_values('TIME').reset_index(drop=True)


输出量:

TIME      Event    Country          ID   Source      Continent
0  2018-01-01 00:06:06       read  country_2  2458151267   Reddit         Europe
1  2018-01-01 00:06:15       read  country_6  2458151268  AdWords         Europe
2  2018-01-01 00:07:21       read  country_7  2458151269  AdWords  North America
3  2018-01-01 00:07:29       read  country_5  2458151270   Reddit  North America
4  2018-01-01 00:07:41  subscribe  country_6  2458151268  AdWords         Europe
5  2018-01-01 00:07:44  subscribe  country_2  2458151267   Reddit         Europe

相关问题