Pandas
Commands in Pandas which I have used or would like to use. This is not meant to be complete, but cover some common usage situations or some tricky or obscure syntax.
See also:
- Pandas Cheat Sheet PDF on Pandas main site.
- Pandas on Learn to Code.
Imports
import numpy as np
import pandas as pd
Types
pd.Series
pd.GroupBy
pd.DataFrame
Create dataframe
Column-input
With data
as a dict
where key is column name and value is a list of values.
data = {"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}
df = pd.DataFrame(
data,
index=[1, 2, 3],
)
Row input
With data
as a 2-D list - a list of rows where each row is a list. Columns must be named.
data = [[4, 7, 10], [5, 8, 11], [6, 9, 12]]
df = pd.DataFrame(
data,
indexes=[1, 2, 3],
columns=['a', 'b', 'c'],
)
Select and filter
Get column.
df['foo']
df.foo
Get columns.
df[['foo', 'bar']]
Get columns of name matching pattern.
df.filter(regex='foo')
Select range of named columns, inclusive.
df.loc[:, 'foo':'bar']
Select individual columns by index.
df.iloc[:, [0, 2 ,5]]
Also filter by condition.
df.loc[df['a'] > 10, ['a','c']]
Summarize
Row count.
len(df)
Get pairs of values and frequency, sorted by frequency descending.
df['foo'].value_counts()
e.g.
df = pd.Series([2, 2, 5, 3, 4, np.nan])
df.value_counts()
# 2.0 2
# 4.0 1
# 3.0 1
# 5.0 1
Distinct values. Does not sort.
df['foo'].nunique()
e.g.
> pd.Series([2, 4, 3, 3], name='P').unique()
# => array([2, 4, 3], dtype=int64)
Basic descriptive statistics for each column
df.describe()
Functions
Calculations on a column or possibly dataframe.
.count()
.sum()
.min()
.max()
.mean()
.median()
.var()
.std()
.quantile([0.25,0.75]
.apply(function)
Group by
Get GroupBy
object.
Column.
df.groupby(by="foo")
Index.
df.groupby(level="foo")
Size of group.
.size()
Plot
Histogram for each column.
df.plot.hist()
Scatter chart using pairs of points.
df.plot.scatter(x='foo', y='bar')
Read and write files
See pandas: How to Read and Write Files on Real Python blog.
CSV
$ pip install pandas
Write:
df.to_csv('data.csv')
Read:
df = pd.read_csv('data.csv', index_col=0)
Excel
$ pip install xlwt openpyxl xlsxwriter xlrd
Write:
df.to_excel('data.xlsx')