Useful API#
Both DataFrame and Series#
API |
Comments |
---|---|
|
Outputs useful statistical information about the data |
|
Returns only unique rows/values in the data. TODO: validate on DF |
|
Removes all duplicate rows/values in the data |
|
Replaces all values that are NaN with True. False otherwise |
|
Replaces all values that are NaN with False. Tue otherwise |
|
Removes all rows that have any NaN value |
|
Fills all values that are NaN with the value given |
|
Applies the function to all values (TODO: Validate works on DF) |
|
The name of the index |
|
Gets a count of the unique rows/values in the data |
|
Makes a copy of the DF. Should be used judiciously |
|
Returns the rows found in the iterable TODO: Series? |
|
Not the same as ‘append’. TODO: Series. Describe better. |
Question: How to append multiple columns to a DataFrame?
DataFrame only#
API |
Comments |
---|---|
|
An iterable of all the column names |
|
Returns a value, Series or DataFrame filtered as specified |
|
Sets the index of the DF to ‘col’ |
|
Restores the index of the DF to the default ordinal values |
|
Sorts the DF by the index. TODO: is this Series too? |
|
Sorts the DF by the values in column, ascending |
|
Returns a DataFrame, sorted by ‘col’, descending, of size count |
|
Returns a DataFrame, sorted by ‘col’, ascending, of size count |
|
Returns a Series where the index are the values in ‘groupcol’, the values are computed from the column ‘value_col’ and the computation is ‘xxx’ which is one of many mathematical functions. |
|
Uses the dictionary to rename the columns |
Available on DataFrame, Series and Groupby Result#
Math API |
Comments |
---|---|
|
Finds the minimum value |
|
Finds the maximum value |
|
Finds the average/mean value |
|
Finds the index of the minimum value |
|
Finds the index of the maximum value |
Math methods#
add
, sub
, mul
, div
, mod
, pow
# do element-by-element math
result = df1.add(df2)
Series only#
API |
Comments |
---|---|
|
Returns a sorted Series, descending, of size count |
|
Returns a sorted Series, ascending, of size count |
|
Returns a boolean Series, True if the value is found in the iterable |
read_csv#
For full documentation, see Pandas API reference
Formatting the Date:
In Pandas version 2.0, there is a date_format
argument. But the version that NCHS has installed is older (v 1.3) and does not. In Jupyter Notebook, you can see your version using: print(pd._version)
.
Older versions support data_parser=fun
. fun is a function to parse the date. It is recommended to use pd.to_datetime(date_string, format='%Y-%m-%d')
.
Here is an example that will parse MM-YYYY-DD
formatted dates such as 03-2023-31
.
df = pd.read_csv('data.csv', index_col='date', parse_dates=True,
date_parser=lambda s: pd.to_datetime(s, format='%m-%Y-%d'))
When parse_dates=True
, then we will parse the index as our date.
parse_dates
parse_dates: bool, list of Hashable, list of lists or dict of {Hashablelist}, default False
The line above has the following behavior:
bool. If True -> try parsing the index.
list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
list of list. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
dict, e.g.
{'foo' : [1, 3]}
-> parse columns 1, 3 as date and call resultfoo
If a column or index cannot be represented as an array of datetime, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use to_datetime() after read_csv().
Iterating a DataFrame#
There are several ways to iterate through a DataFrame.
# iterates through the columns
for n in df2:
print(n)
# iterates through the tuples (col_name, col_series)
for col_name, col_series in df2.iteritems():
print(col_name)
print(col_series)
# identical to the above
for col_name, col_series in df2.items():
print(col_name)
print(col_series)
DataFrame Plot API Summary#
For full documentation, see Pandas DataFrame plot.
Documentation Notes
Note that in this documentation:
label
means the column name in the DataFrame.position
is a integer representing the ordinal of the column.
x : label or position, default None
(use the index as the x-axis)
y : label, position, list of labels or positions, default None.
kind : One of the following strings describing the kind of plot to produce.
‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
‘scatter’ : scatter plot (DataFrame only)
‘hexbin’ : hexbin plot (DataFrame only) ax : The axes object to plot to subplot : bool,
True
means to make separate subplots for each column.
There are many more arguments such as: title
, grid
, legend
, style
, xticks
, yticks
, xlim
, ylim
, xlabel
, ylabel
, rot
, colormap
, table
, stacked
import pandas as pd
df = pd.DataFrame()
help(df.plot)