如何在Pandas身上合并信息？

这是我需要处理的数据集的一个示例（CSV with CSV;）

2018-01-01 00:06:06;read;country_2;2458151267;Reddit;Europe
2018-01-01 00:06:15;read;country_6;2458151268;AdWords;Europe
2018-01-01 00:07:21;read;country_7;2458151269;AdWords;North America
2018-01-01 00:07:29;read;country_5;2458151270;Reddit;North America
2018-01-01 00:07:41;subscribe;2458151268
2018-01-01 00:07:44;subscribe;2458151267

字符串
我可以在第2行看到ID = 2458151268.0的客户阅读了我的产品，然后订阅了它（第5行）。由于第5行没有带来很多信息，我如何看到我的订阅者来自哪里（国家，来源和大陆）？我如何在pandas中正确读取这个CSV文件？
我有7000订户工作，所以如果你有一个想法或可以指出我的正确方向，非常感谢你！
我尝试了一堆groupby（）都没有成功！
当我使用

data = pd.read_csv('example.csv', delimiter = ';')

型
我的数据集不来正确，因为“订阅”行不具有所有的信息，“读”行！

一种可能的方法是首先通过将subscribe行中的country值复制到ID中来清理数据，然后将非读事件与相应的读事件合并，并从中填充NaN值。例如：

# Sample data from question
text = '''2018-01-01 00:06:06;read;country_2;2458151267;Reddit;Europe
2018-01-01 00:06:15;read;country_6;2458151268;AdWords;Europe
2018-01-01 00:07:21;read;country_7;2458151269;AdWords;North America
2018-01-01 00:07:29;read;country_5;2458151270;Reddit;North America
2018-01-01 00:07:41;subscribe;2458151268
2018-01-01 00:07:44;subscribe;2458151267'''

df = pd.read_csv(StringIO(text), sep=';', header=None, names=['TIME', 'Event', 'Country', 'ID', 'Source', 'Continent'])

# rows with no id
noid = df['ID'].isna()

# copy country value into `ID`
df['ID'] = df['ID'].fillna(df['Country']).astype('int64')

# set the corresponding country values to `NaN`
df.loc[noid, 'Country'] = np.nan

字符串
本阶段的产出：

TIME      Event    Country          ID   Source      Continent
0  2018-01-01 00:06:06       read  country_2  2458151267   Reddit         Europe
1  2018-01-01 00:06:15       read  country_6  2458151268  AdWords         Europe
2  2018-01-01 00:07:21       read  country_7  2458151269  AdWords  North America
3  2018-01-01 00:07:29       read  country_5  2458151270   Reddit  North America
4  2018-01-01 00:07:41  subscribe        NaN  2458151268      NaN            NaN
5  2018-01-01 00:07:44  subscribe        NaN  2458151267      NaN            NaN

型
现在，我们将没有数据的事件与有数据的事件合并，并从中填充NaN值：

gotdata = df['Country'].notna()
out = df[~gotdata].merge(df[gotdata], on='ID', suffixes=['','_y'])
out = (out
    .fillna({'Country' : out['Country_y'], 'Source' : out['Source_y'], 'Continent' : out['Continent_y'] })
    .drop(columns=[f'{col}_y' for col in df.columns if col != 'ID'])
)

型
现在我们有

TIME      Event    Country          ID   Source Continent
0  2018-01-01 00:07:41  subscribe  country_6  2458151268  AdWords    Europe
1  2018-01-01 00:07:44  subscribe  country_2  2458151267   Reddit    Europe

型
现在我们将其添加回具有数据的行：

out = pd.concat([df[gotdata], out]).sort_values('TIME').reset_index(drop=True)

型
输出量：

TIME      Event    Country          ID   Source      Continent
0  2018-01-01 00:06:06       read  country_2  2458151267   Reddit         Europe
1  2018-01-01 00:06:15       read  country_6  2458151268  AdWords         Europe
2  2018-01-01 00:07:21       read  country_7  2458151269  AdWords  North America
3  2018-01-01 00:07:29       read  country_5  2458151270   Reddit  North America
4  2018-01-01 00:07:41  subscribe  country_6  2458151268  AdWords         Europe
5  2018-01-01 00:07:44  subscribe  country_2  2458151267   Reddit         Europe

型

如何在Pandas身上合并信息？

1条答案

相关问题

热门标签

最新问答