我正在使用Python的deque()来实现一个简单的循环缓冲区:
I am using Python's deque() to implement a simple circular buffer:
from collections import deque import numpy as np test_sequence = np.array(range(100)*2).reshape(100,2) mybuffer = deque(np.zeros(20).reshape((10, 2))) for i in test_sequence: mybuffer.popleft() mybuffer.append(i) do_something_on(mybuffer)我想知道是否有一种简单的方法可以使用Series(或DataFrame)在熊猫中获得相同的东西.换句话说,如何有效地在Series或DataFrame的末尾添加一行,并删除该行的开头?
I was wondering if there's a simple way of obtaining the same thing in Pandas using a Series (or DataFrame). In other words, how can I efficiently add a single row at the end and remove a single row at the beginning of a Series or DataFrame?
,我尝试过此操作:
myPandasBuffer = pd.DataFrame(columns=('A','B'), data=np.zeros(20).reshape((10, 2))) newpoint = pd.DataFrame(columns=('A','B'), data=np.array([[1,1]])) for i in test_sequence: newpoint[['A','B']] = i myPandasBuffer = pd.concat([myPandasBuffer.ix[1:],newpoint], ignore_index = True) do_something_on(myPandasBuffer)但是它比deque()方法慢得多.
推荐答案如dorvak所述,pandas并非设计用于类似队列的行为.
As noted by dorvak, pandas is not designed for queue-like behaviour.
下面,我已经使用h5py模块在pandas数据帧,numpy数组以及hdf5中从deque复制了简单的插入功能.
Below I've replicated the simple insert function from deque in pandas dataframes, numpy arrays, and also in hdf5 using the h5py module.
timeit函数揭示(令人惊讶的是)collections模块要快得多,然后依次是numpy和pandas.
The timeit function reveals (unsurprisingly) that the collections module is much faster, followed by numpy and then pandas.
from collections import deque import pandas as pd import numpy as np import h5py def insert_deque(test_sequence, buffer_deque): for item in test_sequence: buffer_deque.popleft() buffer_deque.append(item) return buffer_deque def insert_df(test_sequence, buffer_df): for item in test_sequence: buffer_df.iloc[0:-1,:] = buffer_df.iloc[1:,:].values buffer_df.iloc[-1] = item return buffer_df def insert_arraylike(test_sequence, buffer_arr): for item in test_sequence: buffer_arr[:-1] = buffer_arr[1:] buffer_arr[-1] = item return buffer_arr test_sequence = np.array(list(range(100))*2).reshape(100,2) # create buffer arrays nested_list = [[0]*2]*5 buffer_deque = deque(nested_list) buffer_df = pd.DataFrame(nested_list, columns=('A','B')) buffer_arr = np.array(nested_list) # calculate speed of each process in ipython print("deque : ") %timeit insert_deque(test_sequence, buffer_deque) print("pandas : ") %timeit insert_df(test_sequence, buffer_df) print("numpy array : ") %timeit insert_arraylike(test_sequence, buffer_arr) print("hdf5 with h5py : ") with h5py.File("h5py_test.h5", "w") as f: f["buffer_hdf5"] = np.array(nested_list) %timeit insert_arraylike(test_sequence, f["buffer_hdf5"])%timeit结果:
deque:每个循环34.1 µs
deque : 34.1 µs per loop
pandas:每个循环48毫秒
pandas : 48 ms per loop
numpy数组:每个循环187 µs
numpy array : 187 µs per loop
hdf5和h5py:每个循环31.7毫秒
hdf5 with h5py : 31.7 ms per loop
注意:
我的熊猫切片方法仅比问题中列出的concat方法快一点.
My pandas slicing method was only slightly faster than the concat method listed in the question.
hdf5格式(通过h5py)没有显示任何优势.我也没有看到Andy建议的HDFStore的任何优势.
The hdf5 format (via h5py) did not show any advantages. I also don't see any advantages of HDFStore, as suggested by Andy.