Pandas Indexing & Selecting using iloc, loc and ix
In Pandas 19.02, the indexing follows the same paradigms as Numpy. There’s not much of a difference if a newbie starts to slice pandas Dataframe
according to the numpy conventions. However, there comes a time when things take turn for the worse when he/she encounters the three musketeers: iloc, loc and ix
.
Source: Frost Click: Three Musketeers
Here’s to the many hours that I spent pulling out my hair in understanding this lot.
We’ll start with the basic definitions that Pandas documentation has to offer:
.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.
.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with python/numpy slice semantics).
.ix supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type. .ix is the most general and will support any of the inputs in .loc and .iloc. .ix also supports floating point label schemes. .ix is exceptionally useful when dealing with mixed positional and label based hierarchical indexes.
However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.
Hmmm, sounds too technical. Let’s take the help of code to understand these.
Consider a dataframe df
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 5)), columns=list('ABCDE'))
df.head()
A | B | C | D | E | |
---|---|---|---|---|---|
0 | 84 | 71 | 64 | 93 | 65 |
1 | 87 | 17 | 12 | 67 | 80 |
2 | 45 | 70 | 12 | 83 | 90 |
3 | 56 | 98 | 5 | 88 | 90 |
4 | 31 | 70 | 98 | 26 | 33 |
df.shape
(100, 5)
We’ll apply all the three on our dataset:
# Selecting first three rows?
df.iloc[:3]
A | B | C | D | E | |
---|---|---|---|---|---|
0 | 84 | 71 | 64 | 93 | 65 |
1 | 87 | 17 | 12 | 67 | 80 |
2 | 45 | 70 | 12 | 83 | 90 |
# Selecting first three rows?
df.loc[:3]
A | B | C | D | E | |
---|---|---|---|---|---|
0 | 84 | 71 | 64 | 93 | 65 |
1 | 87 | 17 | 12 | 67 | 80 |
2 | 45 | 70 | 12 | 83 | 90 |
3 | 56 | 98 | 5 | 88 | 90 |
# Selecting first three rows?
df.ix[:3]
A | B | C | D | E | |
---|---|---|---|---|---|
0 | 84 | 71 | 64 | 93 | 65 |
1 | 87 | 17 | 12 | 67 | 80 |
2 | 45 | 70 | 12 | 83 | 90 |
3 | 56 | 98 | 5 | 88 | 90 |
Apart from iloc
, the rest two don’t really follow the same that points out the first difference: iloc
is integer based while the rest aren’t (Note: loc
is exclusively label based, while ix
plays the devil’s advocate).
loc
and ix
select values till the index label 3 i.e. we get values of [0,1,2,3]
Let’s correct the loc
and ix
to select first 3 rows
:
# since both behave as label based, let's write them as:
print ("::::The output for loc::::")
print (df.loc[:2])
print ("::::The output for ix::::")
print (df.ix[:2])
::::The output for loc::::
A B C D E
0 84 71 64 93 65
1 87 17 12 67 80
2 45 70 12 83 90
::::The output for ix::::
A B C D E
0 84 71 64 93 65
1 87 17 12 67 80
2 45 70 12 83 90
Now, the question arises why do we need loc
or ix
when the insanely simple iloc
solves our problems?
Let’s consider the following scenario where we need to slice select columns from the dataframe using iloc
:
# Selecting the column's [A,C] for the first 3 rows
df.iloc[:3, ['A', 'C']]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-2f83469716a2> in <module>()
1 # Selecting the column's [A,C] for the first 3 rows
----> 2 df.iloc[:3, ['A', 'C']]
/home/pratos/miniconda3/lib/python3.5/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1308
1309 if type(key) is tuple:
-> 1310 return self._getitem_tuple(key)
1311 else:
1312 return self._getitem_axis(key, axis=0)
/home/pratos/miniconda3/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
1558 def _getitem_tuple(self, tup):
1559
-> 1560 self._has_valid_tuple(tup)
1561 try:
1562 return self._getitem_lowerdim(tup)
/home/pratos/miniconda3/lib/python3.5/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
149 if i >= self.obj.ndim:
150 raise IndexingError('Too many indexers')
--> 151 if not self._has_valid_type(k, i):
152 raise ValueError("Location based indexing can only have [%s] "
153 "types" % self._valid_types)
/home/pratos/miniconda3/lib/python3.5/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
1528 return self._is_valid_integer(key, axis)
1529 elif is_list_like_indexer(key):
-> 1530 return self._is_valid_list_like(key, axis)
1531 return False
1532
/home/pratos/miniconda3/lib/python3.5/site-packages/pandas/core/indexing.py in _is_valid_list_like(self, key, axis)
1551 ax = self.obj._get_axis(axis)
1552 l = len(ax)
-> 1553 if len(arr) and (arr.max() >= l or arr.min() < -l):
1554 raise IndexError("positional indexers are out-of-bounds")
1555
/home/pratos/miniconda3/lib/python3.5/site-packages/numpy/core/_methods.py in _amax(a, axis, out, keepdims)
24 # small reductions
25 def _amax(a, axis=None, out=None, keepdims=False):
---> 26 return umr_maximum(a, axis, None, out, keepdims)
27
28 def _amin(a, axis=None, out=None, keepdims=False):
TypeError: cannot perform reduce with flexible type
Whoa! Up there’s a mess homie!
Unlike numpy, pandas have a label
based columns and there’s a need for a better control over selection of columns in a dataframe. loc
provides that with it’s label based selection (sometimes if not used properly might leave the practitioner with a numb brain). Let’s see whether we can do that with loc
df.loc[:2, ['A', 'C']]
A | C | |
---|---|---|
0 | 84 | 64 |
1 | 87 | 12 |
2 | 45 | 12 |
Note: loc
includes last element, iloc
doesn’t. So, the above behaviour while selecting rows explains that.
Sweet! It seems loc
is awesome! What about ix
, would it be the same?
Note: ix
has a very dividing opinion, in the pandas documentation itself the maintainers point towards using iloc
and loc
whenever a sane dataframe is being presented. Just to avoid the confusion that prevails from mixture label based and integer based slicing.
If loc
is a Katana, then ix
is Iron aged double edged swords. Let’s refer the documentation definition of ix
again:
ix supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type. .ix is the most general and will support any of the inputs in .loc and .iloc. .ix also supports floating point label schemes. .ix is exceptionally useful when dealing with mixed positional and label based hierarchical indexes.
The base behavior of ix
is illustrated as below:
df.ix[3]
A 56
B 98
C 5
D 88
E 90
Name: 3, dtype: int64
df.ix[3]
returns by default the value of the 3rd row. df.ix[3,]
would give the similar results.
df.ix[3,]
A 56
B 98
C 5
D 88
E 90
Name: 3, dtype: int64
# A more "complex" query using ix
# Hybrid approach
df.ix[1:3, ['A', 'C']]
A | C | |
---|---|---|
1 | 87 | 12 |
2 | 45 | 12 |
3 | 56 | 5 |
# Selecting rows from 1 to 3, and columns 1 to 2
# Pure Index based
df.ix[1:3, 1:2]
B | |
---|---|
1 | 17 |
2 | 70 |
3 | 98 |
From the above two examples, we can understand how handy ix
. Almost like being rescued by Gandalf himself with his sword, Glamdrig! (At least I felt like today 😊)
Source: Cultjer
With all the small snippets, I hope the basic demons about our three musketeers are exorcised. A small real life dataset would help in getting a feel of the things.
winered = pd.read_csv('winequality-red.csv', sep=';')
winered.head(5)
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
# Select pH scores having less than 3 with quality equal to 6
winered.ix[:,['pH', 'quality']].query('pH < 3 & quality == 6')
pH | quality | |
---|---|---|
86 | 2.93 | 6 |
91 | 2.93 | 6 |
464 | 2.98 | 6 |
544 | 2.86 | 6 |
614 | 2.87 | 6 |
667 | 2.94 | 6 |
669 | 2.94 | 6 |
1017 | 2.89 | 6 |
1018 | 2.89 | 6 |
1319 | 2.90 | 6 |
As the above query, we can go on with many such. Hope this has been helpful for the readers. Do comment and correct if any inconsistencies.
Sources: