python PerformanceWarning:创建新列时DataFrame高度碎片化

mtb9vblg  于 6个月前  发布在  Python
关注(0)|答案(1)|浏览(154)

我有一个1 m行的df,其中一列总是5000个字符,A-Z 0 -9
我将长列解析为972列,

def parse_long_string(df):
    df['a001'] = df['long_string'].str[0:2]
    df['a002'] = df['long_string'].str[2:4]
    df['a003'] = df['long_string'].str[4:13]
    df['a004'] = df['long_string'].str[13:22]
    df['a005'] = df['long_string'].str[22:31]
    df['a006'] = df['long_string'].str[31:40]
    ....
    df['a972'] = df['long_string'].str[4994:]
    return(df)

字符串
当我调用函数时,我得到以下警告:PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
阅读PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance,当创建> 100列并且没有指定新列的数据类型,但每列都自动为字符串时,就会出现这个问题。
有没有别的办法
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

l0oc07j2

l0oc07j21#

我不知道你是怎么得到这样一个配置的,但是是的,我可以在类似的框架/代码上触发PerformanceWarning。所以,这里有一个可能的解决方案来摆脱警告,使用concat

slices = {
    "a001": (0, 2),
    "a002": (2, 4),
    "a003": (4, 13),
    "a004": (13, 22),
    "a005": (22, 31),
    "a006": (31, 40),
    # ... add the rest here
    "a972": (4994, None)
} # # I used a dict but you can choose a list as well

def parse_long_string(df, mapper):

    new_cols = pd.concat(
        {
            col: df["long_string"].str[s:e]
            for col, (s, e) in mapper.items()
        }, axis=1
    )

    return df.join(new_cols)

out = parse_long_string(df, slices)

字符串
输出量:

print(out)

     long_string a001 a002 a003 a004  ... a968 a969 a970 a971 a972
0     ILR03X...    IL   R0   3X   3D  ...   wm   xC   95   cZ   GT
1     uluF81...    ul   uF   81   Jl  ...   98   RE   80   wc   Qk
2     NLRCIh...    NL   RC   Ih   t4  ...   Xk   os   KL   Ge   lp
3     ScrgOj...    Sc   rg   Oj   GS  ...   nM   8T   gy   Ju   8z
4     saWtdD...    sa   Wt   dD   zN  ...   cf   o2   xX   hM   ze
...         ...   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...
9995  4FxlzY...    4F   xl   zY   6b  ...   fi   Mb   V9   Vf   bK
9996  hsjUFa...    hs   jU   Fa   fL  ...   Io   ka   SJ   73   hM
9997  Sr4zFU...    Sr   4z   FU   3c  ...   yb   6a   AF   lv   P4
9998  q4eon1...    q4   eo   n1   Kg  ...   9g   u1   dq   sj   Wa
9999  5UxVXL...    5U   xV   XL   f2  ...   zC   6F   7T   kE   kt

[10000 rows x 973 columns]


使用的输入:

import numpy as np
import pandas as pd
import string

np.random.seed(0)

df = pd.DataFrame({
    "long_string": ["".join(np.random.choice(
        [*string.printable[:62]], size=5000)) for _ in range(10000)]
})

slices = {f"a{i+1:03d}": (i*2, (i+1)*2) for i in range(972)}

相关问题