Description
The Pandas documentation states that a hierarchically indexed DataFrame can be partially selected from by passing a partial list of index values. For example if we have a 5-level index, then df.ix['A','B','C','D'] should return a DataFrame with the subset of rows where levels[0,1,2,3] match values ['A','B','C','D'] indexed with by the remaining lowest index level 5. I have found that this only works when partially selecting on the first two levels, anything beyond the first 2 levels requires the user to append a null slice at the end for .ix to work. This becomes cumbersome especially if you do not know exactly how many levels are available in a DataFrame after several processing steps. The same behavior persists regardless if your MultiIndex is int64 or string or mixed types.
It would be best if any partial tuple of index values passed to the .ix method returns a partial selection matching that tuple without requiring users to explicitly append the empty slice for the remaining levels. Let me know if this is how the .ix method is intended to work. It might be a simple fix of by checking if the tuple length == # levels and if not appending the empty slice within the .ix method.
Example Showing the Issue:
Take an example data set with 5 index levels, group by the first 4 levels and count the group sizes. Select the group with the largest size.
Results: (Clean Code Pasted Below IPython Output)
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: print pd.__version__
0.11.0.dev-80945b6
In [4]: # Generate Test DataFrame
...: NUM_ROWS = 100000
...:
In [5]: NUM_COLS = 10
In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
In [7]: index_cols = col_names[:5]
In [8]: # Set DataFrame to have 5 level Hierarchical Index.
...: # The dtype does not matter try str or np.int64 same results.
...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
...:
In [9]: df = df.set_index(index_cols)
In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (2, 2, 4, 0, 0) to (3, 2, 3, 1, 1)
Data columns:
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [11]: # Group by first 4 index columns.
....: grp = df.groupby(level=index_cols[:-1])
....:
In [12]: # Compute group size and find index of largest group.
....: grp_size = grp.size()
....:
In [13]: grp_size.sort()
In [14]: grp_size[::-1]
Out[14]: A0 A1 A2 A3
2 1 2 1 200
1 4 1 3 200
3 1 4 1 199
4 1 1 1 196
2 4 2 0 190
4 4 0 1 190
3 1 0 0 189
0 0 4 3 189
4 1 2 1 188
0 1 4 2 187
2 2 3 0 187
3 2 3 2 187
3 4 0 186
1 3 0 186
3 3 4 185
...
0 1 1 4 136
2 4 4 3 135
4 4 1 2 135
2 4 3 2 135
3 3 4 1 134
4 2 3 1 133
3 2 2 4 133
4 0 1 3 133
0 4 3 4 131
1 2 1 0 130
4 4 1 4 130
2 1 4 128
0 2 3 2 127
3 3 0 1 121
1 0 2 3 106
Length: 625, dtype: int64
In [15]: loc = grp.size().idxmax()
In [16]: loc
Out[16]: (1, 4, 1, 3)
In [17]: # The following line does not work.
....: # Can't use partial selection to select largest group.
....: df.ix[loc]
....:
---------------------------------------------------------------------------
IndexingError Traceback (most recent call last)
<ipython-input-17-e26047d119d7> in <module>()
1 # The following line does not work.
2 # Can't use partial selection to select largest group.
----> 3 df.ix[loc]
4
C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
44 pass
45
---> 46 return self._getitem_tuple(key)
47 else:
48 return self._getitem_axis(key, axis=0)
C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_tuple(self, tup)
235 for i, key in enumerate(tup):
236 if i >= self.obj.ndim:
--> 237 raise IndexingError('Too many indexers')
238
239 if _is_null_slice(key):
IndexingError: Too many indexers
In [18]: # Appending the null slice at the end works!
....: # I was hoping for partial selection to work above.
....: df.ix[loc+(slice(None),)]
....:
Out[18]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 1 to 0
Data columns:
A5 200 non-null values
A6 200 non-null values
A7 200 non-null values
A8 200 non-null values
A9 200 non-null values
dtypes: int64(5)
In [19]: # Selecting on first level works.
....: df.ix[loc[0]]
....:
Out[19]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 19796 entries, (3, 1, 1, 1) to (4, 1, 3, 0)
Data columns:
A5 19796 non-null values
A6 19796 non-null values
A7 19796 non-null values
A8 19796 non-null values
A9 19796 non-null values
dtypes: int64(5)
In [20]: # Selecting on first two levels works.
....: df.ix[loc[0], loc[1]]
....:
Out[20]: A1 A2 A3 A4
3 1 1 1 3
0 1 1 0 2
4 0 1 3
4 3 1 4 1
1 4 4 4 4
0 0 3 2 4
1 1 2 3 1
4 3 4 3 4
2 1 2 0 3
1 2 0 2 1
0 0 3 0 3
2 4 2 2
4 4 0 4 1
2 2 4 4
1 3 0 3 4
...
0 2 4 0 0
1 4 2 1 4
0 2 1 4
3 4 2 3 2
1 1 3 2 3
3 3 0 3 1
2 1 2 4 3
4 4 2 2 0
0 2 3 2
4 4 1 3
3 1 3 4 2
2 2 4 1 4
3 1 1 1 1
1 0 1 3 1
4 1 3 0 3
Name: A9, Length: 19796, dtype: int64
In [21]: # Selecting on first 3 levels fails.
....: df.ix[loc[0], loc[1], loc[2]]
....:
---------------------------------------------------------------------------
IndexingError Traceback (most recent call last)
<ipython-input-21-38eb049081e0> in <module>()
1 # Selecting on first 3 levels fails.
----> 2 df.ix[loc[0], loc[1], loc[2]]
3
C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
44 pass
45
---> 46 return self._getitem_tuple(key)
47 else:
48 return self._getitem_axis(key, axis=0)
C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_tuple(self, tup)
235 for i, key in enumerate(tup):
236 if i >= self.obj.ndim:
--> 237 raise IndexingError('Too many indexers')
238
239 if _is_null_slice(key):
IndexingError: Too many indexers
In [22]: # Adding the null slice and it works.
....: df.ix[loc[0], loc[1], loc[2], slice(None)]
Out[22]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 848 entries, (2, 1) to (3, 0)
Data columns:
A5 848 non-null values
A6 848 non-null values
A7 848 non-null values
A8 848 non-null values
A9 848 non-null values
dtypes: int64(5)
Code:
import numpy as np
import pandas as pd
print pd.__version__
# Generate Test DataFrame
NUM_ROWS = 100000
NUM_COLS = 10
col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
index_cols = col_names[:5]
# Set DataFrame to have 5 level Hierarchical Index.
# The dtype does not matter try str or np.int64 same results.
df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
df = df.set_index(index_cols)
df
# Group by first 4 index columns.
grp = df.groupby(level=index_cols[:-1])
# Compute group size and find index of largest group.
grp_size = grp.size()
grp_size.sort()
grp_size[::-1]
loc = grp.size().idxmax()
loc
# The following line does not work.
# Can't use partial selection to select largest group.
df.ix[loc]
# Appending the null slice at the end works!
# I was hoping for partial selection to work above.
df.ix[loc+(slice(None),)]
# Selecting on first level works.
df.ix[loc[0]]
# Selecting on first two levels works.
df.ix[loc[0], loc[1]]
# Selecting on first 3 levels fails.
df.ix[loc[0], loc[1], loc[2]]
# Adding the null slice and it works.
df.ix[loc[0], loc[1], loc[2], slice(None)]