在jupyter笔记本中从mysql检索巨大的数据表

vxqlmq5t 于 2021-06-25 发布在 Mysql

关注(0)|答案(0)|浏览(358)

我目前正在尝试使用jupyter笔记本从mysql表中获取1亿行。我试过几次了 pymysql.cursors 用于打开mysql连接。实际上，我已经尝试使用批处理来加快查询选择过程，因为将所有行一起选择太繁重了。下面是我的测试：

import pymysql.cursors

# Connect to the database

connection = pymysql.connect(host='XXX',
                             user='XXX',
                             password='XXX',
                             db='XXX',
                             charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)

try: 
    with connection.cursor() as cursor:

        print(cursor.execute("SELECT count(*) FROM `table`"))
        count = cursor.fetchone()[0]

        batch_size = 50

        for offset in xrange(0, count, batch_size):
            cursor.execute(
                "SELECT * FROM `table` LIMIT %s OFFSET %s", 
                (batch_size, offset))
            for row in cursor:
                print(row)
finally:
    connection.close()

目前，测试应该只打印出每一行（或多或少不那么值钱），但在我看来，最好的解决方案是将所有内容存储在一个Dataframe中。
不幸的是，当我运行它时，出现了以下错误：
（）中的keyerror回溯（最近一次调用）

print(cursor.execute("SELECT count(*) FROM `table`"))

--->count=cursor.fetchone（）[0]

batch_size = 50

密钥错误：0
有人知道会有什么问题吗？也许使用chunksize是个更好的主意？提前谢谢！

更新

我重新编写了代码，没有批量大小，并将查询结果存储在一个Dataframe中。最后，它似乎正在运行，但由于数据量为100mln行，因此执行时间似乎非常“无限”：

connection = pymysql.connect(user='XXX', password='XXX', database='XXX', host='XXX')

try:
    with connection.cursor() as cursor:
        query = "SELECT * FROM `table`"

        cursor.execute(query)
        cursor.fetchall()

        df = pd.read_sql(query, connection)
finally:
    connection.close()

加快这一进程的正确方法应该是什么？可能作为参数传递 chunksize = 250 ? 如果我想打印 df 然后输出一个发电机。实际上这不是一个Dataframe。
如果我打印 df 输出为：

<generator object _query_iterator at 0x11358be10>

如何以Dataframe格式获取数据？因为如果我打印 fetch_all 命令，我可以看到正确的查询输出选择，所以在这一点上一切都按预期工作。
如果我尝试使用 Dataframe() 结果是 fetchAll 我得到的命令：

ValueError: DataFrame constructor not properly called!

另一个更新

我可以通过迭代输出结果 pd.read_sql 这样地：

for chunk in pd.read_sql(query, connection, chunksize = 250):
        chunks.append(chunk)
    result = pd.concat(chunks, ignore_index=True)
    print(type(result))
    #print(result)

最后我只得到了一个叫做 result .
现在的问题是：
是否可以不受限制地查询所有数据？
究竟是什么影响了流程基准？

mysql python jupyter-notebook pymysql

来源：https://stackoverflow.com/questions/49879205/retrieve-huge-data-table-from-mysql-within-jupyter-notebook