Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
nkeys, nrows, ncols = 50, 5000, 10
tickers = ["X%04d" % i for i in range(nkeys)]
columns = ["C%d" % i for i in range(ncols)]
sample = pd.DataFrame(np.zeros((nrows, ncols)), columns=columns)
tickers = tickers[::-1] # to reverse the tickers
data = {t: sample for t in tickers}
rawdata = pd.concat(data, names=["ticker"])
rawdata = rawdata.reset_index().drop(columns="level_1")
indexed = rawdata.set_index('ticker')
indexed.groupby('ticker').apply(lambda x:x)
Installed Versions
INSTALLED VERSIONS
commit : 06d2301
python : 3.8.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : English_United States.1252
pandas : 1.4.1
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : 1.3.8
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: 0.10.0
bs4 : 4.9.3
bottleneck : 1.3.2
fastparquet : None
fsspec : 0.9.0
gcsfs : None
matplotlib : 3.3.4
numba : 0.53.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.6.2
sqlalchemy : 1.4.7
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None
Prior Performance
When indexed is sorted runs under 100 ms.
When not indexed (ie ticker as column) also runs under 100 ms.
Just remove the following code to check :
tickers = tickers[::-1]