Description
Code Sample, a copy-pastable example if possible
smallData = pd.DataFrame({'a': [0]*10 + [1,2,3]})
print(smallData.a.rank(pct=True).tail())
bigData = pd.DataFrame({'a': [0]*100000000 + [1,2,3]})
print(bigData.a.rank(pct=True).tail())
When I use pd.DataFrame().rank(pct=True) on small data (see the first example), it gives me percentages or percentiles. However when data is big, it doesn't return percentages. Maybe it expected output, I just want to calculate percentiles on big data.
[this should explain why the current behaviour is a problem and why the expected output is a better solution.]
Output
8 0.423077
9 0.423077
10 0.846154
11 0.923077
12 1.000000
99999998 2.980232
99999999 2.980232
100000000 5.960465
100000001 5.960465
100000002 5.960465
Expected Output
I would expect something close to 0.5 for all 0 and something close to 1 for all other values
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None