Pandas

Commands in Pandas which I have used or would like to use. This is not meant to be complete, but cover some common usage situations or some tricky or obscure syntax.

Imports

import numpy as np
import pandas as pd

Types

pd.Series
pd.GroupBy
pd.DataFrame

Create dataframe

Column-input

With data as a dict where key is column name and value is a list of values.

data = {"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}
df = pd.DataFrame(
    data,
    index=[1, 2, 3],
)

Row input

With data as a 2-D list - a list of rows where each row is a list. Columns must be named.

data = [[4, 7, 10], [5, 8, 11], [6, 9, 12]]
df = pd.DataFrame(
    data,
    indexes=[1, 2, 3],
    columns=['a', 'b', 'c'],
)

Select and filter

Get column.

df['foo']

df.foo

Get columns.

df[['foo', 'bar']]

Get columns of name matching pattern.

df.filter(regex='foo')

Select range of named columns, inclusive.

df.loc[:, 'foo':'bar']

Select individual columns by index.

df.iloc[:, [0, 2 ,5]]

Also filter by condition.

df.loc[df['a'] > 10, ['a','c']]

Summarize

Row count.

len(df)

Get pairs of values and frequency, sorted by frequency descending.

df['foo'].value_counts()

e.g.

df = pd.Series([2, 2, 5, 3, 4, np.nan])
df.value_counts()
# 2.0    2
# 4.0    1
# 3.0    1
# 5.0    1

Distinct values. Does not sort.

df['foo'].nunique()

e.g.

> pd.Series([2, 4, 3, 3], name='P').unique()
# => array([2, 4, 3], dtype=int64)

Basic descriptive statistics for each column

df.describe()

Functions

Calculations on a column or possibly dataframe.

.count()
.sum()
.min()
.max()
.mean()
.median()
.var()
.std()
.quantile([0.25,0.75]
.apply(function)

Group by

Get GroupBy object.

Column.

df.groupby(by="foo")

Index.

df.groupby(level="foo")

Size of group.

.size()

Plot

Histogram for each column.

df.plot.hist()

Scatter chart using pairs of points.

df.plot.scatter(x='foo', y='bar')

Read and write files

See pandas: How to Read and Write Files on Real Python blog.

CSV

$ pip install pandas

Write:

df.to_csv('data.csv')

Read:

df = pd.read_csv('data.csv', index_col=0)

Excel

$ pip install xlwt openpyxl xlsxwriter xlrd

Write:

df.to_excel('data.xlsx')