DataFrame.to_csv函数写入文件速度缓慢的解决方案

Pandas里的DataFrame.to_csv()是一个非常常用的函数,用于将内存中的数据以csv格式写到磁盘上,但当要写入的内容较多时,往往会遇到耗时过长的问题。这个问题的原因是to_csv()内部优化不够,我们可以利用其他软件包来曲线的解决此问题。

验证环境

硬件:

  • Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz 2.81 GHz
  • 32.0 GB / 512GB SSD

软件:

  • Windows 10 22H2
  • Python 3.9.13
  • pandas 2.2.2
  • numpy 1.26.4
  • duckdb 1.0.0
  • pyarrow 16.1.0

验证过程

首先随机生成一个较大尺寸的dataframe,内部所有数值都是float64类型,直接调用to_csv()写到磁盘并记录耗时;然后将此dataframe分别转换为numpyduckdbpyarrow的对应数据结构(转换过程本身耗时很短可忽略),并统计每次写入csv的耗时(毫秒),整理为下表:

DataFrame大小 CSV文件大小 pandas耗时 numpy耗时 duckdb耗时 pyarrow耗时
(10000,100) 17MB 1503 948 529 218
(50000,100) 87MB 7598 4815 2683 1107
(10000,500) 87MB 7341 4381 2811 1085

与Pandas.to_csv()对比的性能倍数关系:

DataFrame大小 CSV文件大小 pandas倍数 numpy倍数 duckdb倍数 pyarrow倍数
(10000,100) 17MB 1 1.6 2.8 6.9
(50000,100) 87MB 1 1.6 2.8 6.9
(10000,500) 87MB 1 1.7 2.6 6.8

注:验证过dask多partition的写入方式,但所得到的不是单个文件因此没有放入表格,并且写入耗时与pandas原生相差不大。

验证结论

可以看到在不同尺寸的数据量下,各软件包的性能都有所提升并且提升幅度比较稳定,其中pyarrow性能最佳,是pandas原生的接近7倍。建议对性能有要求的情况下,将dataframe转换为pyarrow table后再写入。

验证代码:

import pandas as pd
import numpy as np
import time

# 造dataframe数据
np.random.seed(42)
data = np.random.uniform(0, 100, size=(10000, 100))
df = pd.DataFrame(data)

# 直接用to_csv写入文件
# 验证过chunksize, index=False等参数对写入时间影响不大
t0 = time.time() * 1000
df.to_csv('c:/temp/1.csv')
t1 = time.time() * 1000
print(f'pandas cost {t1-t0} ms')

# df转换到numpy后写入csv文件
import numpy as np
numpy_array = df.values
t0 = time.time() * 1000
np.savetxt('c:/temp/1.csv', numpy_array, delimiter=',', fmt='%s')
t1 = time.time() * 1000
print(f'numpy cost {t1-t0} ms')

# df转换到duckdb表后写入csv文件
import duckdb
con = duckdb.connect(database=':memory:', read_only=False)
con.register('my_table', df)
t0 = time.time() * 1000
con.execute("COPY (SELECT * FROM my_table) TO 'c:/temp/2.csv' WITH (FORMAT CSV)")
t1 = time.time() * 1000
print(f'duckdb cost {t1-t0} ms')

# df转换到arrow表后写入csv文件
import pyarrow as pa
import pyarrow.csv as csv
table = pa.Table.from_pandas(df)
t0 = time.time() * 1000
csv.write_csv(table, 'c:/temp/1.csv')
t1 = time.time() * 1000
print(f'pyarrow cost {t1-t0} ms')

参考链接

最新版python安装Apache Beam失败问题和解决

Beam安装:
https://beam.apache.org/get-started/quickstart-py/

为安装Apache Beam,首先下载安装了最新版python 3.12.4。

file

> python --version
Python 3.12.4

按文档安装beam时报错无法编译wheel:

> pip install apache-beam
...
      INFO:root:copying apache_beam\portability\api\org\apache\beam\model\pipeline\v1\schema_pb2.pyi -> build\lib.win-amd64-cpython-312\apache_beam\portability\api\org\apache\beam\model\pipeline\v1
      INFO:root:copying apache_beam\portability\api\org\apache\beam\model\pipeline\v1\standard_window_fns_pb2.pyi -> build\lib.win-amd64-cpython-312\apache_beam\portability\api\org\apache\beam\model\pipeline\v1
      INFO:root:copying apache_beam\portability\api\standard_coders.yaml -> build\lib.win-amd64-cpython-312\apache_beam\portability\api
      INFO:root:running build_ext
      INFO:root:building 'apache_beam.coders.coder_impl_row_encoders' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for apache-beam
  Building wheel for pyarrow (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for pyarrow (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [299 lines of output]
      <string>:34: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
      WARNING setuptools_scm.pyproject_reading toml section missing 'pyproject.toml does not contain a tool.setuptools_scm section'
      Traceback (most recent call last):
        File "C:\Users\zhanghao\AppData\Local\Temp\pip-build-env-0wvk6red\overlay\Lib\site-packages\setuptools_scm\_integration\pyproject_reading.py", line 36, in read_pyproject
          section = defn.get("tool", {})[tool_name]
                    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
      KeyError: 'setuptools_scm'
      running bdist_wheel
      running build
      running build_py

按报错信息里的提示,从 https://visualstudio.microsoft.com/zh-hans/visual-cpp-build-tools/ 下载了最新版的vc++ buildtools仍然报错。

file

在so上找到类似的问题,其中一个解决方法是安装整个vc++,但需要下载6GB+太重了。发现还有另一个方法,就是不要使用最新版的python,因为最新版python里没有包含一些已编译的包。

https://stackoverflow.com/questions/64261546/how-to-solve-error-microsoft-visual-c-14-0-or-greater-is-required-when-inst

This error can happen when using the latest version of Python, e.g. 3.12, because the package wheels were only built for earlier versions of Python

卸载python 3.12.4,安装python 3.9.13,再重新安装beam成功。

> python --version
Python 3.9.13

> pip install apache-beam
...
Using legacy 'setup.py install' for crcmod, since package 'wheel' is not installed.
Using legacy 'setup.py install' for dill, since package 'wheel' is not installed.
Using legacy 'setup.py install' for hdfs, since package 'wheel' is not installed.
Using legacy 'setup.py install' for pyjsparser, since package 'wheel' is not installed.
Using legacy 'setup.py install' for docopt, since package 'wheel' is not installed.
Installing collected packages: pytz, pyjsparser, docopt, crcmod, zstandard, urllib3, tzdata, typing-extensions, six, rpds-py, regex, pyparsing, pyarrow-hotfix, protobuf, packaging, orjson, objsize, numpy, jsonpickle, idna, grpcio, fasteners, fastavro, dnspython, dill, cloudpickle, charset-normalizer, certifi, attrs, async-timeout, tzlocal, requests, referencing, redis, python-dateutil, pymongo, pydot, pyarrow, proto-plus, httplib2, jsonschema-specifications, js2py, hdfs, jsonschema, apache-beam
  Running setup.py install for pyjsparser ... done
  Running setup.py install for docopt ... done
  Running setup.py install for crcmod ... done
  Running setup.py install for dill ... done
  Running setup.py install for hdfs ... done
Successfully installed apache-beam-2.56.0 async-timeout-4.0.3 attrs-23.2.0 ...