pandas快速读取大文件csv方法

本文介绍: 当我们使用 python进行大量数据的读取操作时，可以通过设置 read _csv 参数、使用 polar s、data table等第三方库的方法提速。

在项目中遇到客户已经下载下来的【20M-10G】大的CSV数据需要用pandas 处理建模，在pandas 加载中发现，加载速度很慢，直接读取有时甚至会超出内存，测试了网上提到的一些加速处理方法，在此汇总记录（由于参考其他文章时，未及时记录，如有侵权联系删除）。

windows10、pandas 版本: 1.5.3、12th Gen Int el® Co re™ i5-12600KF 3.70 GHz、RAM16GB

import pandas as pd
import time
from sys import getsizeof

time_start = time.time()
data = pd.read_csv("../data/input/test_data.csv", encoding="gbk",engine="python") 
time_end = time.time()
print("耗时{}秒".format(time_end - time_start))
print("数据共{}kB".format(round(getsizeof(data) / 1024, 2)))

输出结果为:

耗时27.383215188980103秒
数据共1591406.19kB

time_start = time.time()
data = pd.read_csv("../data/input/test_data.csv", encoding="gbk",engine="c") 
time_end = time.time()
print("耗时{}秒".format(time_end - time_start))
print("数据共{}kB".format(round(getsizeof(data) / 1024, 2)))

输出结果为:

耗时5.371628665924072秒
数据共1591406.19kB

read_csv 中的c引擎比python引擎快5倍左右！

data = pd.read_csv("../data/input/test_data.csv", encoding="gbk", engine="c")
print(data.dtypes)

测试1     object
测试2     object
测试3      int64
测试4     object
测试5    float64
测试6      int64
测试7    float64
测试8    float64
dtype: object

time_start = time.time()
data = pd.read_csv("../data/input/test_data.csv", encoding="gbk",engine="c",
dtype = {"测试3": np.int8, "测试5": np.float16, "测试6": np.int8, "测试7": np.float16, "测试8": np.float16}) 
time_end = time.time()
print("耗时{}秒".format(time_end - time_start))
print("数据共{}kB".format(round(getsizeof(data) / 1024, 2)))

测试1            object
测试2            object
测试3              int8
测试4            object
测试5           float16
测试6              int8
测试7           float16
测试8           float16
dtype: object
耗时5.005128908157349秒
数据共1441526.0kB

data = pd.read_csv("../data/input/test-data.csv",encoding="gbk", engine="c", chunksize=30)
for part_data in data:
    print(part_data.columns)
    break

Index(['测试1', '测试2', '测试3', '测试4', '测试5', '测试6', '测试7', '测试8'], dtype='object')

time_start = time.time()
data = pd.read_csv("../data/input/test-data.csv",encoding="gbk", engine="c", usecols=["测试1"])
time_end = time.time()
print("耗时{}秒".format(time_end - time_start))
print("数据共{}kB".format(round(getsizeof(data) / 1024, 2)))

耗时3.5131638050079346秒
数据共355965.88kB

pip install polars

import time
import polars as pl
from sys import getsizeof

time_start = time.time()
data = pl.read_csv("../data/input/test-data.csv", encoding="utf8-lossy")
end_time = time.time()
print("耗时{}秒".format(end_time - time_start))
print(type(data))
print(data.dtypes)
print(data)
print("data对象共{}kB".format(round(getsizeof(data) / 1024, 2)))
pandas_df = data.to_pandas()
memory_usage_bytes = pandas_df.memory_usage(deep=True).sum()
memory_usage_kb = round(memory_usage_bytes / 1024, 2)  
print("数据共{}KB".format(memory_usage_kb))

耗时0.07973313331604004秒
<class 'polars.dataframe.frame.DataFrame'>
[Int64, Utf8, Utf8, Int64, Utf8, Float64, Int64, Float64, Float64]
shape: (186_958, 9)
┌────────┬─────────────────────┬─────────────────────┬───────┬───┬────────┬───────┬───────┬───────┐
│        ┆ ����1               ┆ ����2               ┆ ����3 ┆ … ┆ ����5  ┆ ����6 ┆ ����7 ┆ ����8 │
│ ---    ┆ ---                 ┆ ---                 ┆ ---   ┆   ┆ ---    ┆ ---   ┆ ---   ┆ ---   │
│ i64    ┆ str                 ┆ str                 ┆ i64   ┆   ┆ f64    ┆ i64   ┆ f64   ┆ f64   │
╞════════╪═════════════════════╪═════════════════════╪═══════╪═══╪════════╪═══════╪═══════╪═══════╡
│ 0      ┆ 2023-03-09 00:00:00 ┆ 2023-03-09 01:00:00 ┆ 205   ┆ … ┆ 1.1236 ┆ 3     ┆ 0.01  ┆ 0.23  │
│ 1      ┆ 2023-03-09 00:00:00 ┆ 2023-03-09 01:00:00 ┆ 201   ┆ … ┆ 0.0    ┆ 0     ┆ 0.0   ┆ 0.0   │
│ 2      ┆ 2023-03-09 00:00:00 ┆ 2023-03-09 01:00:00 ┆ 202   ┆ … ┆ 0.0    ┆ 0     ┆ 0.0   ┆ 0.0   │
│ 3      ┆ 2023-03-09 00:00:00 ┆ 2023-03-09 01:00:00 ┆ 203   ┆ … ┆ 0.1947 ┆ 2     ┆ 0.0   ┆ 0.01  │
│ …      ┆ …                   ┆ …                   ┆ …     ┆ … ┆ …      ┆ …     ┆ …     ┆ …     │
│ 186954 ┆ 2023-06-05 22:00:00 ┆ 2023-06-05 23:00:00 ┆ 19    ┆ … ┆ 0.1172 ┆ 4     ┆ 0.0   ┆ 0.01  │
│ 186955 ┆ 2023-06-05 22:00:00 ┆ 2023-06-05 23:00:00 ┆ 20    ┆ … ┆ 0.4344 ┆ 18    ┆ 0.0   ┆ 0.04  │
│ 186956 ┆ 2023-06-05 22:00:00 ┆ 2023-06-05 23:00:00 ┆ 21    ┆ … ┆ 1.0528 ┆ 21    ┆ 0.01  ┆ 0.07  │
│ 186957 ┆ 2023-06-05 22:00:00 ┆ 2023-06-05 23:00:00 ┆ 22    ┆ … ┆ 0.6484 ┆ 17    ┆ 0.01  ┆ 0.16  │
└────────┴─────────────────────┴─────────────────────┴───────┴───┴────────┴───────┴───────┴───────┘
data对象共0.05kB
数据共67628.84KB

Process finished with exit code 0

从测试结果看同一份数据polars读取数据速度比pandas中的read_csv快将近50倍! ! !。

exceptions.ComputeError: invalid utf-8 sequence in csv

这个错误可以在polars读取数据的时候将encoding参数设置成”utf8-lossy“，这样一些无法编码的字符会使用�代替。

data = pl.read_csv("../data/input/test-data.csv", encoding="utf8-lossy")

pip install datatable

import time
import datatable as dt
from sys import getsizeof


time_start = time.time()
data = dt.fread("../data/input/test-data.csv", encoding="gbk")
end_time = time.time()
print("耗时{}秒".format(end_time - time_start))
print("datatable版本:", dt.__version__)
print(type(data))
print("数据共{}kB".format(round(getsizeof(data) / 1024, 2)))

耗时0.11500072479248047秒
datatable版本: 1.0.0
<class 'datatable.Frame'>
数据共22585.8kB

从测试结果来看datatable读取数据的速度和polars差不多。

Collecting datatable
  Using cached datatable-1.0.0.tar.gz (1.1 MB)
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [22 lines of output]
      Traceback (most recent call last):
        File "D:ancondalibsite-packagespip_vendorpyproject_hooks_in_process_in_process.py", line 353, in <module>
          main()
        File "D:ancondalibsite-packagespip_vendorpyproject_hooks_in_process_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "D:ancondalibsite-packagespip_vendorpyproject_hooks_in_process_in_process.py", line 152, in prepare_metadata_for_build_wheel
          whl_basename = backend.build_wheel(metadata_directory, config_settings)
        File "C:UsersAdministratorAppDataLocalTemppip-install-r7zihkkbdatatable_259142d33cfc4c0cb26d395c8dfed6fdciext.py", line 573, in build_wheel
          generate_build_info(flavor, strict=not is_source_distribution())
        File "C:UsersAdministratorAppDataLocalTemppip-install-r7zihkkbdatatable_259142d33cfc4c0cb26d395c8dfed6fdciext.py", line 467, in generate_build_info
          git_hash = shell_cmd(["git", "rev-parse", "HEAD"], strict=strict)
        File "C:UsersAdministratorAppDataLocalTemppip-install-r7zihkkbdatatable_259142d33cfc4c0cb26d395c8dfed6fdciext.py", line 437, in shell_cmd
          return subprocess.check_output(cmd, universal_newlines=True,
        File "D:ancondalibsubprocess.py", line 421, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "D:ancondalibsubprocess.py", line 503, in run
          with Popen(*popenargs, **kwargs) as process:
        File "D:ancondalibsubprocess.py", line 971, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "D:ancondalibsubprocess.py", line 1440, in _execute_child
          hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
      FileNotFoundError: [WinError 2] 系统找不到指定的文件。
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

pip install dask

import time
import dask
import dask.dataframe as dd
from sys import getsizeof


time_start = time.time()
data = dd.read_csv("../data/input/test-data.csv", encoding="gbk")
end_time = time.time()
print("耗时{}秒".format(end_time - time_start))
print("dask版本:", dask.__version__)
print(type(data))
print("data对象共{}kB".format(round(getsizeof(data) / 1024, 2)))
print("数据共{}kB".format(round(getsizeof(data.head(len(data)-1)) / 1024, 2)))
print(data)

耗时0.005980253219604492秒
dask版本: 2023.6.0
<class 'dask.dataframe.core.DataFrame'>
data对象共0.05kB
数据共64114.47kB
Dask DataFrame Structure:
              Unnamed: 0     测试1     测试2    测试3     测试4      测试5    测试6      测试7      测试8
npartitions=1                                                                            
                   int64  object  object  int64  object  float64  int64  float64  float64
                     ...     ...     ...    ...     ...      ...    ...      ...      ...
Dask Name: read-csv, 1 graph layer

Process finished with exit code 0

从测试结果看dask读取数据的速度将近read_csv的1000倍，datatable和polars的20倍！！！！，速度惊人！！

1、工作中若是无法要求数据保存的格式，当遇到较大数据量级的数据需要采用python进行建模处理时，对于csv文件可以使用dask、polars、datatable等第三方库进行协助操作，另外pandas库中相关的函数参数设置也可以提高我们数据读取的速度。
2、如果对数据的保存格式有自主权，可尝试 pickle、hdf5格式、极其不建议存成xlsx 格式，本人测试该格式数据读取速度非常慢！！！。
欢迎相关小伙伴进群交流学习qq群：865416910

显示所有内容

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

文章 目录

前言

一、测试环境

二、pandas中的read_csv 读取速度测试

1.引入库

2.测试

三、pandas中的re ad_csv提速方法

1.使用”C”引擎

2.read_csv 读取数据时指定 数据格式

3.read_cs v 读取数据时使用use col s参数只加载 需要的数据

四、利用 第三方库进行提速

1.使用polars库读取文件

1.1 polars安装

1.2 polars读取数据测试

2. datatable 读取数据测试

2.1 datatable 安装

2.2 datatable测试

3. dask 读取数据测试

3.1 dask 安装

3.2 dask 读取数据测试

五、经验总结

发表回复取消回复

一、测试环境

二、pandas中的read_csv读取速度测试

1.引入库

2.测试

三、pandas中的read_csv提速方法

1.使用”C”引擎

2.read_csv读取数据时指定数据格式

3.read_csv读取数据时使用usecols参数只加载需要的数据

四、利用第三方库进行提速

1.使用polars库读取文件

1.1 polars安装

1.2 polars读取数据测试

2. datatable读取数据测试

2.1 datatable安装

2.2 datatable测试

3. dask读取数据测试

3.1 dask安装

3.2 dask读取数据测试

五、经验总结

相关文章

发表回复 取消回复