pandas 在超过20，000个选定列的超大型字符串中，根据每个原始元素的前3个字符替换元素的最有效方法

sgtfey8w 于 6个月前发布在其他

关注(0)|答案(5)|浏览(80)

我试图根据一个大的数组中特定列的所有原始元素的前3个字符来替换数组中的整个元素（在适用的情况下，基于字典）。这个过程需要重复多次。考虑到有超过20，000个选定的列，下面的for循环非常慢。
请注意，其中的所有元素都是字符串。

d = {'0/0': 0, '0/1': 1, '1/0': 1,  '1/1': 2, "./.": 3}
  cols = list(set(merged.columns) - set(["subject", "Group"]))
  for col in cols:
       merged[col] = merged[col].str[:3].replace(d)

字符串
我尝试使用lamda函数（请参阅下文），然而，这也很慢。我相信是apply函数减慢了速度。（注意：使用Applymap也很慢）

d = {'0/0': 0, '0/1': 1, '1/0': 1, '1/1': 2, "./.": 3}

cols = list(set(merged.columns) - set(["subject", "Group"]))

merged[cols] = merged[cols].apply(lambda x: x.str[:3].replace(d))

型
我正在寻求更有效的方法，例如使用矢量化，但未能确定前进的方向。
下面可以看到数据的示例（请注意，它比实际数据小得多，每个单元格中的字符串也长得多）

data = {
    'Sample1': ['0/0:0,1:33', '0/1:2,3:32', '1/0:4,5', '1/1:6,7', './.:8,9'],
    'Sample2': ['0/0:10,11', '0/1:12,13', '1/0:14,15', '1/1:16,17', './.:18,19'],
    'Sample3': ['0/0:20,21', '0/1:22,23', '1/0:24,25:23', '1/1:26,27', './.:28,29'],
}

df = pd.DataFrame(data)

型
更新：具有代表性大小的数据样本

import numpy as np
import pandas as pd
sample = ['0/0:0,1:33', '0/1:2,3:32', '1/0:4,5', '1/1:6,7', './.:8,9', '0/0:10,11', 
          '0/1:12,13', '1/0:14,15', '1/1:16,17', './.:18,19', '0/4:20,21', 
          '0/1:22,23', '1/0:24,25:23', '1/1:26,27', './.:28,29']
df = pd.DataFrame(np.random.choice(sample, (2000, 20000)))

型

pandas

来源：https://stackoverflow.com/questions/77631627/most-efficient-way-to-replace-elements-based-on-first-3-characters-of-every-orig

5条答案

按热度按时间

jpfvwuh41#

示例

让我们做2000 X 20000样品

import numpy as np
import pandas as pd
sample = ['0/0:0,1:33', '0/1:2,3:32', '1/0:4,5', '1/1:6,7', './.:8,9', '0/0:10,11', 
          '0/1:12,13', '1/0:14,15', '1/1:16,17', './.:18,19', '0/0:20,21', 
          '0/1:22,23', '1/0:24,25:23', '1/1:26,27', './.:28,29']
df = pd.DataFrame(np.random.choice(sample, (2000, 20000)))

字符串

的数据

验证码

生成Map程序

m = {'0/0': 0, '0/1': 1, '1/0': 1,  '1/1': 2, "./.": 3}

型
我认为使用for循环和使用apply之间没有太大的区别，因为它们都是循环。
但是，map比replace快一点（注意，未Map的结果返回NaN），并且apply的axis=1更有效，因为列比行多。

replace（你的代码）

%timeit df.apply(lambda x: x.str[:3].replace(m))

型
结果：

1min 9s ± 5.81 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

型
1.轴为0的贴图（默认）

%timeit df.apply(lambda x: x.str[:3].map(m))

型
结果：

52.9 s ± 1.3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

型
1.轴=1的Map

%timeit df.apply(lambda x: x.str[:3].map(m), axis=1)

型
结果：

18.3 s ± 1.54 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

型
有可能获得稍微更有效的结果

赞(0）回复(0）举报 6个月前

c86crjj02#

这个怎么样，使用numpy的<U3类型快速提取前三个字符？在一个2000 x 20_000帧上需要6. 22秒（AWS r6id.2xlarge）。

def replace_match(df, d, n=3, default=-1):
    codes, uniques = pd.factorize(df.to_numpy().ravel().astype(f'<U{n}'))
    new = np.array([d.get(e, default) for e in uniques])
    return pd.DataFrame(
        new[codes].reshape(df.shape),
        index=df.index, columns=df.columns)

字符串
测试，一些单元格故意不匹配（输出将是default）：

sample = ['0/0:0,1:33', '0/4:2,3:32', '1/0:4,5', '1/1:6,7', './.:8,9', '0/0:10,11', 
          '0/1:12,13', '1/0:14,15', '1/1:16,17', './.:18,19', '0/0:20,21', 
          '0/1:22,23', '1/0:24,25:23', '1/1:26,27', './.:28,29']
df = pd.DataFrame(np.random.choice(sample, (2000, 20_000)))

d = {'0/0': 0, '0/1': 1, '1/0': 1,  '1/1': 2, './.': 3}

%timeit replace_match(df, d)
6.22 s ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

型
相比之下：
PaulS的解决方案：

default = -1

%timeit df.applymap(lambda x: d.get(x[:3], default))
14 s ± 44.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

型
(One Panda Kim的解决方案：

%timeit df.apply(lambda x: x.str[:3].map(d), axis=1)
8.06 s ± 58.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

型
.注意，这填充了NaN，其中d中没有匹配项，因此使帧dtype=float64。

赞(0）回复(0）举报 6个月前

n1bvdmb63#

解决方案1（最快）

到目前为止找到的最快的解决方案（它只需要3.27s，(2000, 20000)维的嵌套框架），它基于获取每个嵌套框架元素（S3 dtype）的前3个字节，然后使用嵌套的np.where和比较字节的想法：

a = df.astype('S3').to_numpy()

pd.DataFrame(
    np.where(a == b'0/1', 1, 
             np.where(a == b'0/0', 0, 
                      np.where(a == b'1/1', 2,
                               np.where(a == b'1/0', 1, 3)))))

字符串
性能：

这个新的解决方案：

3.27 s ± 30.2 ms/循环（7次运行的平均值±标准差，每次1个循环）

解决方案2

另一种可能的解决方案，基于numpy.vectorize：

pd.DataFrame(np.vectorize(lambda x: d[x[:3]])(df.values))

型
性能：
更快的已知解决方案之前，我的解决方案是@PierreD的一个.与此相比，我的解决方案（在我的机器上，(2000, 20000)尺寸的矩阵）：

我的：

7.31 s ± 70.8 ms/循环（7次运行的平均值±标准差，每次1个循环）

@PierreD的解决方案：

9.55 s ± 66.9 ms/循环（7次运行的平均值±标准差，每次1个循环）

解决方案3

或者：

df.map(lambda x: d[x[:3]])

型
输出量：

Sample1  Sample2  Sample3
0        0        0        0
1        1        1        1
2        1        1        1
3        2        2        2
4        3        3        3

型

赞(0）回复(0）举报 6个月前

mm5n2pyu4#

**根据示例更新：

df = df.stack().str.extract(r'([^:]+)', expand=False).replace(d).unstack()

print(df)

 Index     Sample1  Sample2  Sample3
    0        0        0        0
    1        1        1        1
    2        1        1        1
    3        2        2        2
    4        3        3        3

字符串

赞(0）回复(0）举报 6个月前

jum4pzuy5#

我将添加一个基准测试，允许在相同的调用中轻松测试所有函数（并使用相同大小的框架）：

from functools import partial

import numpy as np
import pandas as pd
import timeit
from statistics import mean, stdev

def build_dataframe(size):
    sample = [
        "0/0:0,1:33",
        "0/1:2,3:32",
        "1/0:4,5",
        "1/1:6,7",
        "./.:8,9",
        "0/0:10,11",
        "0/1:12,13",
        "1/0:14,15",
        "1/1:16,17",
        "./.:18,19",
        "0/0:20,21",
        "0/1:22,23",
        "1/0:24,25:23",
        "1/1:26,27",
        "./.:28,29",
    ]
    return pd.DataFrame(np.random.choice(sample, (size, size)))

def str_method(data):
    data = data.copy()
    d = {"0/0": 0, "0/1": 1, "1/0": 1, "1/1": 2, "./.": 3}
    for col in data.columns:
        data[col] = data[col].str[:3].replace(d)

def apply_method(data):
    data = data.copy()
    d = {"0/0": 0, "0/1": 1, "1/0": 1, "1/1": 2, "./.": 3}
    data[data.columns] = data[data.columns].apply(lambda x: x.str[:3].replace(d))

def stack_method(data):
    data = data.copy()
    d = {"0/0": 0, "0/1": 1, "1/0": 1, "1/1": 2, "./.": 3}
    data.stack().str.extract(r"([^:]+)", expand=False).replace(d).unstack()

def replace_match(data, n=3, inplace=False):
    d = {"0/0": 0, "0/1": 1, "1/0": 1, "1/1": 2, "./.": 3}
    pre = data.to_numpy().astype(f'U{n}')
    return data.where(
        np.isin(pre, np.array(list(d)), invert=True),
        pd.DataFrame(pre, index=df.index, columns=df.columns).replace(d),
        inplace=inplace,
    )

def map_method(data):
    data = data.copy()
    d = {"0/0": 0, "0/1": 1, "1/0": 1, "1/1": 2, "./.": 3}
    data.map(lambda x: d[x[:3]])

turn_number = 100
df = build_dataframe(1000)
for one_method in [str_method, apply_method, stack_method, map_method, replace_match]:
    function_to_test = partial(one_method, data=df)
    durations = timeit.Timer("function_to_test()", globals=globals()).repeat(
        repeat=turn_number, number=1
    )
    print(f"{one_method.__name__}, mean {mean(durations)*10**3} ms, deviation {stdev(durations)*10**3} ms")

字符串
如果你有多核计算机，你应该看看dask。你可以这样做：

import dask.dataframe as dd
data = dd. # don't know how you build your datatframe
...
# with one of the method.
data2 = data.map(lambda x: d[x[:3]]
data2.compute()

型
以下是我使用不同方法的结果：

str_method, mean 1940.1417385198874 ms, deviation 30.65625035255771 ms
apply_method, mean 2049.7742383097648 ms, deviation 66.40606504300902 ms
stack_method, mean 1921.710312769137 ms, deviation 27.082265869582276 ms
map_method, mean 900.7486290999805 ms, deviation 13.709531124297671 ms
replace_match, mean 1385.514028930047 ms, deviation 20.934190486529495 ms

型
如果您使用的是Pierre D解决方案，那么将有更多的工作来调整解决方案以适应dask。

赞(0）回复(0）举报 6个月前

我来回答

pandas 在超过20，000个选定列的超大型字符串中，根据每个原始元素的前3个字符替换元素的最有效方法

5条答案

相关问题

热门标签

最新问答