pandas 如何利用列中的分隔线来创建MultiIndex?

6bc51xsx  于 5个月前  发布在  其他
关注(0)|答案(1)|浏览(46)

我的输入是一个框架(从这个链接https://www.timeanddate.com/time/zones/):

df = pd.DataFrame({'Abbreviation': ['ADT', 'ET', 'GMT'],
 'Time zone name': ["Atlantic Daylight Time\nADST – Atlantic Daylight Saving Time\nAST – Atlantic Summer Time\nHAA – Heure Avancée de l'Atlantique (French)",
  'Eastern Time',
  'Greenwich Mean Time\nUTC – Coordinated Universal Time\nGT – Greenwich Time']})

字符串
| 缩写|时区名称|
| --|--|
| ADT|大西洋夏令时ADST -大西洋夏令时AST -大西洋夏令时间HAA - Heure Avancée de l 'Avantique(法文)|
| ET|东部时间|
| GMT|格林威治标准时间UTC -协调世界时GT -格林威治时间|
有些时区可以有一种等价物。例如GMT有两个等价物。但有些时区没有等价物,如ET
我试着提取相应的时区并将其转换为多索引。
我的预期输出是这样的:

Time zone name                                 Details
Abbreviation Equivalent                                                                
ADT          ADST        Atlantic Daylight Time           Atlantic Daylight Saving Time
             AST         Atlantic Daylight Time                    Atlantic Summer Time
             HAA         Atlantic Daylight Time  Heure Avancée de l'Atlantique (French)
ET           NaN                   Eastern Time                                     NaN
GMT          UTC            Greenwich Mean Time              Coordinated Universal Time
             GT             Greenwich Mean Time                          Greenwich Time


为此,我做了下面的代码,但不幸的是,timzeone ET的行丢失了:

first_split = df['Time zone name'].str.split('\n')

second_split = first_split.explode().str.split(' – ', expand=True)

df['Time zone name'] = first_split.str[0]

final = pd.concat([df, second_split], axis=1).rename(columns={0: 'Equivalent', 1: 'Details'})

final = final.dropna(subset='Details')

final = final.set_index(['Abbreviation', 'Equivalent'])


你们能帮我修改一下代码吗?我愿意接受任何其他的想法。

wtzytmuj

wtzytmuj1#

您可以使用Index.duplicatedSeries.duplicated删除第一个重复的值:

final = final[final.index.duplicated() | ~final.index.duplicated(keep=False)]

个字符
或者将Series.notnaIndex.mapIndex.value_counts一起使用:

final = final[final['Details'].notna() | 
              (final.index.map(final.index.value_counts()) == 1)]


相反:

final = final.dropna(subset='Details')


另一个完整的解决方案,包含DataFrame.explodeSeries.where,并向前填充Time zone name列的缺失值:

final = (df.assign(**{'Time zone name':df['Time zone name'].str.split('\n')})
           .explode('Time zone name'))

final[['Equivalent','Details']] = final['Time zone name'].str.extract('(.*)\s*–\s*(.*)')

final['Time zone name'] = (final['Time zone name'].where(final['Equivalent'].isna())
                                                  .ffill())

final = final[final.index.duplicated() | ~final.index.duplicated(keep=False)]

final = final.set_index(['Abbreviation', 'Equivalent'])

print (final)
                                 Time zone name  \
Abbreviation Equivalent                           
ADT          ADST        Atlantic Daylight Time   
             AST         Atlantic Daylight Time   
             HAA         Atlantic Daylight Time   
ET           NaN                   Eastern Time   
GMT          UTC            Greenwich Mean Time   
             GT             Greenwich Mean Time   

                                                        Details  
Abbreviation Equivalent                                          
ADT          ADST                 Atlantic Daylight Saving Time  
             AST                           Atlantic Summer Time  
             HAA         Heure Avancée de l'Atlantique (French)  
ET           NaN                                            NaN  
GMT          UTC                     Coordinated Universal Time  
             GT                                  Greenwich Time

相关问题