.txt文件中的NumPy数组太大，无法加载到内存中

ercv8c1e 于 2023-04-06 发布在其他

关注(0)|答案(2)|浏览(200)

我有一个大的.txt文件，大约6GB，结构如下：

0.1 0.4 0.9 1.2 0.2 3.8
2.8 4.2 0.3 1.9
0.3 5.8 9.6 0.05 2.2

我把它转换成一个NumPy数组，其中包含.loadtxt(file.txt)，得到：

[[0.1 0.4 0.9 1.2 0.2 3.8], [2.8 4.2 0.3 1.9], [0.3 5.8 9.6 0.05 2.2]]

文件现在太大，无法加载到内存中，我得到了一个内存错误，所以我一直在尝试用这个方法分块加载它。

def loadFile(filePath):
chunk_size = 10000
data = []
with open(filePath, 'r') as f:
    while True:
        chunk = np.genfromtxt(f, max_rows=chunk_size)
        if len(chunk) < chunk_size:
            # Last chunk 
            data.append(chunk)
            break 
        data.append(chunk)
        # Move pointer to start of next chunk
        f.seek(chunk_size-len(chunk), 1)

# Joining chunks into a single array
data = np.concatenate(data)
return data

与思想，我只需要加载一个单一的块到内存中的时间，但这仍然结束了我的内存和崩溃我的电脑。
我错过了什么？将文件拆分成多个文件确实不是一个选项。

numpy

来源：https://stackoverflow.com/questions/75893145/numpy-array-from-txt-file-too-large-to-load-into-memory

2条答案

按热度按时间

js4nwp541#

当你使用chunking来分解一个大的数据文件时，你应该先将数据块load到内存中，然后process，然后free到内存中。你在代码中做的是chunking数据，将其添加到data数组中，然后将concatenating块添加到一个数组中。这与一次加载整个数据文件等效，除非有额外的步骤。如果你需要整个数据集来进行处理，你可能需要升级你的硬件或寻找替代模块。但是，如果你不需要使用整个数据集，你可以在阅读块的while循环中进行处理：

with open(filePath, 'r') as f:
    while True:
        chunk = np.genfromtxt(f, max_rows=chunk_size)
        if len(chunk) < chunk_size:
            # Last chunk 
            data.append(chunk)
            break 
        #Do some processing here
        [insert code]
        # Move pointer to start of next chunk
        f.seek(chunk_size-len(chunk), 1)

赞(0）回复(0）举报 2023-04-06

tuwxkamq2#

如果可能的话，你可以尝试指定一个更小的数据类型。单靠它可能解决不了问题，但应该会有所帮助。
Python的float值通常是64位的，接近C的double。在Numpy中，您可以使用numpy.single（32位）或numpy.half（16位）。
要指定数据类型，必须将dtype作为参数传递给loadtxt。例如：

data = np.loadtxt(fileName, dtype=np.half)

引用（来自Numpy的文档）：

loadtxt：https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html
NumPy的数据类型：https://numpy.org/doc/stable/user/basics.types.html

赞(0）回复(0）举报 2023-04-06

我来回答

.txt文件中的NumPy数组太大，无法加载到内存中

2条答案

引用（来自Numpy的文档）：

相关问题

热门标签

最新问答