DataFrame Indexing Methods

DataFrame Indexing Methods#

We will explore four methods that index data within a Pandas DataFrame. The abundance of approaches can be overwhelming and this write-up aims provide clarity. For additional information, refer to the Pandas DataFrame Documentation.

API	Index/Position	Comment
`df[]`	rows by position, columns by name	One argument only
`df.loc[]`	by index	`[inclusive : inclusive]`
`df.iloc[]`	by position	Columns by position, too
`df.take()`	by position	`axis=0` take rows `axis=1` take columns

Except for loc the API are [inclusive : exclusive].

Argument Types#

Type	Example	Notes
Number	`df.iloc[3]` `df.loc[3]`	Describes the position, or, index
Numerical Slice	`df[0:3]` `df.loc[1:3]`	A slice using numbers (positional) (index)
Numerical List	`df.take([0, 3])` `df.loc[[1,3]]`	List of numbers (positional) (index)
Mask	`df[[True, False, True]]`	Any iterable of True/False values
String	`df['A']`	A column name (or index value)
String Slice	`df.loc[:, 'C':'words']`	Represents a range of columns
String List	`df[['C','D','words']]`	A list of column names

API Details#

df

df[] takes a single argument only.
Slices are [ Inclusive : Exclusive ].
The most common usage is to get a Series (column) from the DataFrame via df['col_name'].
One can add a column to a DataFrame with df['new_column'] = list_of_values

It is important to note here that when a numerical slice is used, the values represent positions, not the index. When string values are used, the values represent column names.

Valid Argument Types:

Numerical Slice: represents the Row Positions
Mask: represents which Rows to keep
String: column name
String List: list of column names

Invalid Argument Types:

Numerical List: You cannot use df[[0, 5, 8]] to access a list of rows.
String Slice: You cannot use df['first':'exclude_me'] to access a slice of columns. Instead, you must use a list of strings.

Examples:

Valid	Results	Invalid	Note
`df[0:1]`	DataFrame All columns, 1 row	`df[0]`	single number NOT allowed
`df[1:3]`	DataFrame All columns, 2 rows	`df[[0, 3]]`	List must be column names
`df[2:10]`	DataFrame All columns. Many rows		All rows at those positions. If the slice is completely out of bounds, an empty DataFrame is returned.
`df[[True, False]]`	DataFrame All columns, rows where mask==True		Length of mask must be len(df)
`df[['C', 'D']]`	DataFrame All rows with columns ‘C’ and ‘D’		The columns are ordered as they appear in the list.
`df[['C']]`	DataFrame All rows, one column ‘C’		A list of just 1 still results in a DataFrame.
`df['C']`	Series. Column ‘C’	`df['C':'words']`	Column slices NOT allowed.

Note that, except for the last example, all valid API return a DataFrame.

df.loc

df.loc[rows, cols] takes one or two arguments.
To get all rows, use an ‘empty’ slice: df.loc[ : , 'Name']
Each argument can describe multiple values or a single value.
The type returned depends on the argument types (vectors vs scalar [1][2]).
- If both arguments are vectors, a DataFrame is returned.
- If exactly one argument is a vector, then a Series is returned.
- If both arguments are scalar, then the contents of the single cell is returned.
Slices are [ Inclusive : Inclusive ].
One can set a value. df.loc[1, 'Name'] = 5 will set the DataFrame’s value to 5 at index 1 and row ‘Name’.

The first argument describes the indices of the rows desired.
The second argument describes the column names desired.

Valid Argument Types:

Slice: For rows, the slice must match the index type. For columns, the slice indicates the names of the columns to include and every column between.
Mask: Represents which rows/cols to keep. Mask is an iterable of True/False values (such as a Series or a list)
List: A list of indices or column names to include.

Examples:

Valid	Note
`df.loc[0:3]`	Rows with indices 0 & 3, and everything in between
`df.loc[11:5]`	Since indices don’t need to be sorted, this will have index 11 as the first row and index 5 as the last row, and all rows that are positioned between them.
`df.loc[mask1, mask2]`	Gets all the rows where mask1 is True, and all the columns where mask2 is True. len(mask1) == count of rows. len(mask2) == count of columns.
‘df.loc[ : , df.columns != ‘Bad’]`	Gets all the rows and all the columns except for ‘Bad’. df.columns is a list of column names. The result of comparing the list to ‘Bad’ is a list of True/False values… a mask
`df.loc[ [0, 8, 11], ['One', 'Three'] ]`	Each list represents which row/column to keep. The result has the same order as that provided in the list.

Why Inclusive?

The reason this API is inclusive is because one will often want to get a set of columns that one knows about. It is better to say:

“I want all the columns starting with ‘One’ and ending with ‘Last’”

than it is to say:

“I want all the columns starting with ‘One’ and goes until the column after the ‘Last’ column that I want named… gosh, what is the name of the column that I don’t want?

The same can be true for rows when the index is not sorted.

df.iloc

Under constructions.

df.iloc[rows, cols] takes 1 or 2 arguments.
Arguments are primarily integers that represent position, but can also be a boolean mask.
Arguments can also be a lambda for with the argument of the lambda is the calling object, either a DataFrame or Series.
There are no string arguments.

Valid Argument Types:

integer that represents the position of the row or column.
Numerical Slice that represents the positions of the rows or columns.
Numerical List that represents the positions of the rows or columns.

df.take

under construction

Examples#

Common

df = Default Index

Result	API
	`df[0:1]` `df.loc[0:0]` `df.iloc[0:1]` `df.take([0])`
	`df[1:3]` `df.loc[1:2]` `df.iloc[1:3]` `df.take([1,2])`
	`mask=[True,False,True,False]` `df[mask]` `df.loc[mask]` `df.loc[[0, 2]]` `df.iloc[mask]` `df.iloc[[0, 2]]` `df.take([0, 2])` `df[[0, 2]]` FAILS!!
	`mask=[False,False,True,True,True,False]` `df[['C', 'D', 'words']]` `df.loc[:, mask]` `df.loc[:, 'C':'words']` `df.loc[:, ['C', 'D', 'words']]` `df.iloc[:, mask]` `df.iloc[:, [0, 2]]` `df.take([2, 3, 4], axis=1)` `df['C':'words']` FAILS!!
	`df.loc[:, mask]` `df.loc[:, 'A':'A']` `df.loc[:, ['A']]` `df.iloc[:, 0:1]` `df.iloc[:, mask]` `df.iloc[:, [0]]` `df.take([0], axis=1)` `df.loc[:, 'A']` WRONG!! Returns Series
Series: Column ‘A’ 0 1 1 2 2 3 3 4 Name: A, dtype: int64	`df['A']` `df.loc[:, 'A']` `df.loc[:, mask]` `df.iloc[:, 0]` `df.iloc[:, mask]`
Series: 2nd Row with index 1 A 2 B 6 C 10 D 14 words cat jumpy 2 Name: 1, dtype: object0	`df.loc[1, :]` `df.iloc[1, :]`
Scalar Value: `7`	`df.loc[2, 'B']` `df.iloc[2, 1]`

Unsorted Index

DataFrame	Result	API
		`df[0:1]` `df.loc[9:9]` `df.iloc[0:1]` `df.take([0])`
		`df[1:3]` `df.loc[2:11]` `df.iloc[1:3]` `df.take([1,2])`

String Index

DataFrame	Result	API
		`df[0:1]` `df.loc['ax']` `df.iloc[0:1]` `df.take([0])`
		`df[1:3]` `df.loc['cat':'dog']` `df.iloc[1:3]` `df.take([1,2])`

df.take()#

df.take() is primarily a positional based indexer/slicer. It will return the columns or rows in the position specified by the inputs, which will either be a single value, list, or slice. It has another parameter, axis, which is set to 0 by default. The default parameter refers to the index/rows, meaning we are selecting rows to extract. If you set axis to 1, then it will select columns.

Series

Since df.take() is positional based, you cannot input labels. They must be integers. Additionally, it ignores whatever custom index you may have applied, even if it is an integer index. It will use the underlying positional index to collect the data. Remember, you need to include the square brackets, because the function takes only lists.

'''
Series based on a row
'''
s1 = df.take([1]) # returns a series representing row 1

'''
Series based on a column
'''
s2 = df.take([1], axis=1) # returns a series representing column 1

Dataframe

d1 = df.take([1, 3, 0]) # returns a dataframe containing rows 1, 3, and 0

# Setting `axis` to 1 indicates whether the list of indexes is for **rows** or **columns**
d2 = df.take([1, 3, 0], axis=1) # returns a dataframe containing columns 1, 3, and 0

Exceptions and Oddities

Make sure to look out for IndexError exceptions, which will arise if you input a number that is not in the index. This means you can’t use len(df) as a parameter in order to get the last row of the dataframe.
You can use negative indices. This works similarly to slicing in lists in base python, where the -1 index corresponds to the last value of the set.
df.take([0]) returns the first (0 position) row.

Footnotes#

[1] Vector is intended to a data structure that can hold multiple values such as a list, slice, or array, even when these structures contain a single value.
[2] Scalar is intended to mean a single value such as a string or integer. It is NOT a list, slice, or array.

DataFrame Indexing Methods

Contents

DataFrame Indexing Methods#

Argument Types#

API Details#

Examples#

df.take()#

Footnotes#