2024 年 7 月 – 八进制

DataFrame大小	CSV文件大小	pandas耗时	numpy耗时	duckdb耗时	pyarrow耗时
(10000,100)	17MB	1503	948	529	218
(50000,100)	87MB	7598	4815	2683	1107
(10000,500)	87MB	7341	4381	2811	1085

DataFrame大小

CSV文件大小

pandas耗时

numpy耗时

duckdb耗时

pyarrow耗时

(10000,100)

17MB

1503

948

529

218

(50000,100)

87MB

7598

4815

2683

1107

(10000,500)

87MB

7341

4381

2811

1085

DataFrame大小	CSV文件大小	pandas倍数	numpy倍数	duckdb倍数	pyarrow倍数
(10000,100)	17MB	1	1.6	2.8	6.9
(50000,100)	87MB	1	1.6	2.8	6.9
(10000,500)	87MB	1	1.7	2.6	6.8

DataFrame大小

CSV文件大小

pandas倍数

numpy倍数

duckdb倍数

pyarrow倍数

(10000,100)

17MB

1.6

2.8

6.9

(50000,100)

87MB

1.6

2.8

6.9

(10000,500)

87MB

1.7

2.6

6.8

import pandas as pd import numpy as np import time # 造dataframe数据 np.random.seed(42) data = np.random.uniform(0, 100, size=(10000, 100)) df = pd.DataFrame(data) # 直接用to_csv写入文件 # 验证过chunksize, index=False等参数对写入时间影响不大 t0 = time.time() * 1000 df.to_csv('c:/temp/1.csv') t1 = time.time() * 1000 print(f'pandas cost {t1-t0} ms') # df转换到numpy后写入csv文件 import numpy as np numpy_array = df.values t0 = time.time() * 1000 np.savetxt('c:/temp/1.csv', numpy_array, delimiter=',', fmt='%s') t1 = time.time() * 1000 print(f'numpy cost {t1-t0} ms') # df转换到duckdb表后写入csv文件 import duckdb con = duckdb.connect(database=':memory:', read_only=False) con.register('my_table', df) t0 = time.time() * 1000 con.execute("COPY (SELECT * FROM my_table) TO 'c:/temp/2.csv' WITH (FORMAT CSV)") t1 = time.time() * 1000 print(f'duckdb cost {t1-t0} ms') # df转换到arrow表后写入csv文件 import pyarrow as pa import pyarrow.csv as csv table = pa.Table.from_pandas(df) t0 = time.time() * 1000 csv.write_csv(table, 'c:/temp/1.csv') t1 = time.time() * 1000 print(f'pyarrow cost {t1-t0} ms')

Beam安装：
https://beam.apache.org/get-started/quickstart-py/

为安装Apache Beam，首先下载安装了最新版python 3.12.4。

file

> python --version
Python 3.12.4

按文档安装beam时报错无法编译wheel：

> pip install apache-beam
...
      INFO:root:copying apache_beam\portability\api\org\apache\beam\model\pipeline\v1\schema_pb2.pyi -> build\lib.win-amd64-cpython-312\apache_beam\portability\api\org\apache\beam\model\pipeline\v1
      INFO:root:copying apache_beam\portability\api\org\apache\beam\model\pipeline\v1\standard_window_fns_pb2.pyi -> build\lib.win-amd64-cpython-312\apache_beam\portability\api\org\apache\beam\model\pipeline\v1
      INFO:root:copying apache_beam\portability\api\standard_coders.yaml -> build\lib.win-amd64-cpython-312\apache_beam\portability\api
      INFO:root:running build_ext
      INFO:root:building 'apache_beam.coders.coder_impl_row_encoders' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for apache-beam
  Building wheel for pyarrow (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for pyarrow (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [299 lines of output]
      <string>:34: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
      WARNING setuptools_scm.pyproject_reading toml section missing 'pyproject.toml does not contain a tool.setuptools_scm section'
      Traceback (most recent call last):
        File "C:\Users\zhanghao\AppData\Local\Temp\pip-build-env-0wvk6red\overlay\Lib\site-packages\setuptools_scm\_integration\pyproject_reading.py", line 36, in read_pyproject
          section = defn.get("tool", {})[tool_name]
                    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
      KeyError: 'setuptools_scm'
      running bdist_wheel
      running build
      running build_py

按报错信息里的提示，从 https://visualstudio.microsoft.com/zh-hans/visual-cpp-build-tools/ 下载了最新版的vc++ buildtools仍然报错。

file

在so上找到类似的问题，其中一个解决方法是安装整个vc++，但需要下载6GB+太重了。发现还有另一个方法，就是不要使用最新版的python，因为最新版python里没有包含一些已编译的包。

https://stackoverflow.com/questions/64261546/how-to-solve-error-microsoft-visual-c-14-0-or-greater-is-required-when-inst

This error can happen when using the latest version of Python, e.g. 3.12, because the package wheels were only built for earlier versions of Python

卸载python 3.12.4，安装python 3.9.13，再重新安装beam成功。

> python --version
Python 3.9.13

> pip install apache-beam
...
Using legacy 'setup.py install' for crcmod, since package 'wheel' is not installed.
Using legacy 'setup.py install' for dill, since package 'wheel' is not installed.
Using legacy 'setup.py install' for hdfs, since package 'wheel' is not installed.
Using legacy 'setup.py install' for pyjsparser, since package 'wheel' is not installed.
Using legacy 'setup.py install' for docopt, since package 'wheel' is not installed.
Installing collected packages: pytz, pyjsparser, docopt, crcmod, zstandard, urllib3, tzdata, typing-extensions, six, rpds-py, regex, pyparsing, pyarrow-hotfix, protobuf, packaging, orjson, objsize, numpy, jsonpickle, idna, grpcio, fasteners, fastavro, dnspython, dill, cloudpickle, charset-normalizer, certifi, attrs, async-timeout, tzlocal, requests, referencing, redis, python-dateutil, pymongo, pydot, pyarrow, proto-plus, httplib2, jsonschema-specifications, js2py, hdfs, jsonschema, apache-beam
  Running setup.py install for pyjsparser ... done
  Running setup.py install for docopt ... done
  Running setup.py install for crcmod ... done
  Running setup.py install for dill ... done
  Running setup.py install for hdfs ... done
Successfully installed apache-beam-2.56.0 async-timeout-4.0.3 attrs-23.2.0 ...

八进制

少年壮志无烟抽

月度归档： 2024 年 7 月

DataFrame.to_csv函数写入文件速度缓慢的解决方案

验证环境

验证过程

验证结论

验证代码：

参考链接

最新版python安装Apache Beam失败问题和解决