Useful API#

Both DataFrame and Series#

API

Comments

describe()

Outputs useful statistical information about the data

unique()

Returns only unique rows/values in the data. TODO: validate on DF

drop_duplicates()

Removes all duplicate rows/values in the data

isnull()

Replaces all values that are NaN with True. False otherwise

notnull()

Replaces all values that are NaN with False. Tue otherwise

dropna()

Removes all rows that have any NaN value

fillna(value)

Fills all values that are NaN with the value given

apply(fn)

Applies the function to all values (TODO: Validate works on DF)

index

The name of the index

nunique()

Gets a count of the unique rows/values in the data

copy()

Makes a copy of the DF. Should be used judiciously

take(iterable)

Returns the rows found in the iterable TODO: Series?

concat()

Not the same as ‘append’. TODO: Series. Describe better.

Question: How to append multiple columns to a DataFrame?

DataFrame only#

API

Comments

columns

An iterable of all the column names

loc[rows, cols]

Returns a value, Series or DataFrame filtered as specified

set_index('col')

Sets the index of the DF to ‘col’

reset_index()

Restores the index of the DF to the default ordinal values

sort_index()

Sorts the DF by the index. TODO: is this Series too?

sort_values(by='col')

Sorts the DF by the values in column, ascending

nlargest(count, 'col')

Returns a DataFrame, sorted by ‘col’, descending, of size count

nsmallest(count, 'col')

Returns a DataFrame, sorted by ‘col’, ascending, of size count

groupby('groupcol')['value_col'].xxx()

Returns a Series where the index are the values in ‘groupcol’, the values are computed from the column ‘value_col’ and the computation is ‘xxx’ which is one of many mathematical functions.

rename(dict)

Uses the dictionary to rename the columns

Available on DataFrame, Series and Groupby Result#

Math API

Comments

min()

Finds the minimum value

max()

Finds the maximum value

mean()

Finds the average/mean value

idxmin()

Finds the index of the minimum value

idxmax()

Finds the index of the maximum value

Math methods#

add, sub, mul, div, mod, pow

# do element-by-element math
result = df1.add(df2)

Series only#

API

Comments

nlargest(count)

Returns a sorted Series, descending, of size count

nsmallest(count)

Returns a sorted Series, ascending, of size count

isin(iterable)

Returns a boolean Series, True if the value is found in the iterable

read_csv#

For full documentation, see Pandas API reference

Formatting the Date:
In Pandas version 2.0, there is a date_format argument. But the version that NCHS has installed is older (v 1.3) and does not. In Jupyter Notebook, you can see your version using: print(pd._version).

Older versions support data_parser=fun. fun is a function to parse the date. It is recommended to use pd.to_datetime(date_string, format='%Y-%m-%d').

Here is an example that will parse MM-YYYY-DD formatted dates such as 03-2023-31.

df = pd.read_csv('data.csv', index_col='date', parse_dates=True, 
                 date_parser=lambda s: pd.to_datetime(s, format='%m-%Y-%d'))

When parse_dates=True, then we will parse the index as our date.

Iterating a DataFrame#

There are several ways to iterate through a DataFrame.

# iterates through the columns
for n in df2:
    print(n)

# iterates through the tuples (col_name, col_series)
for col_name, col_series in df2.iteritems():
    print(col_name)
    print(col_series)

# identical to the above
for col_name, col_series in df2.items():
    print(col_name)
    print(col_series)

DataFrame Plot API Summary#

For full documentation, see Pandas DataFrame plot.

x : label or position, default None (use the index as the x-axis)
y : label, position, list of labels or positions, default None.
kind : One of the following strings describing the kind of plot to produce.

  • ‘line’ : line plot (default)

  • ‘bar’ : vertical bar plot

  • ‘barh’ : horizontal bar plot

  • ‘hist’ : histogram

  • ‘box’ : boxplot

  • ‘kde’ : Kernel Density Estimation plot

  • ‘density’ : same as ‘kde’

  • ‘area’ : area plot

  • ‘pie’ : pie plot

  • ‘scatter’ : scatter plot (DataFrame only)

  • ‘hexbin’ : hexbin plot (DataFrame only) ax : The axes object to plot to subplot : bool, True means to make separate subplots for each column.

There are many more arguments such as: title, grid, legend, style, xticks, yticks, xlim, ylim, xlabel, ylabel, rot, colormap, table, stacked

import pandas as pd
df = pd.DataFrame()
help(df.plot)