Numpy and Pandas are the basic data science tools in the Python environment. Having a good understanding of their capabilities and how they process data is imperative to writing optimal code. This post provides an introductory overview and a refresher for those who might come back to the libraries after taking a break. The end of the post explains external interfaces for increasing code execution and performing more sophisticated matrix operations.
Pandas Series and Dataframe
These are the basic data structures used by Pandas. Pandas derives its name from Pan(el) da(ta)s. Panel data is an econometrics term for data recorded over time. Pandas author, Wes McKinney, was working in the finanical industry when he began writing it.
- Series objects: 1D array, similar to a column in a spreadsheet
- DataFrame objects: 2D table, similar to a spreadsheet, dictionary of Series
Panel objects: Dictionary of DataFrames, similar to sheet in MS Excelremoved in 0.25.0
Config
Let’s load our modules. Numpy is always used with Pandas because it is an underlying dependency.
import numpy as np
import pandas as pd
We will create a few DataFrames to work with throughout the post.
Series are similar to R’s vector c()
, and is a column within a Pandas Dataframe.
import pandas as pd
s = pd.Series([1,3,5,np.nan,6,8])
dates = pd.date_range('20130101', periods=6)
df1 = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df1
|
A |
B |
C |
D |
2013-01-01 |
-1.360827 |
-0.871347 |
-1.603602 |
0.073078 |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
2013-01-03 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
2013-01-04 |
-1.325634 |
0.174237 |
-0.921101 |
-0.065894 |
2013-01-05 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
2013-01-06 |
0.024899 |
0.252519 |
-0.833479 |
-0.109767 |
df2 = pd.DataFrame({ 'E' : 1.,
....: 'F' : pd.Timestamp('20130102'),
....: 'G' : pd.Series(1,index=list(range(4)),dtype='float32'),
....: 'H' : np.array([3] * 4,dtype='int32'),
....: 'I' : pd.Categorical(["test","train","test","train"]),
....: 'J' : 'foo' })
df2
|
E |
F |
G |
H |
I |
J |
0 |
1.0 |
2013-01-02 |
1.0 |
3 |
test |
foo |
1 |
1.0 |
2013-01-02 |
1.0 |
3 |
train |
foo |
2 |
1.0 |
2013-01-02 |
1.0 |
3 |
test |
foo |
3 |
1.0 |
2013-01-02 |
1.0 |
3 |
train |
foo |
tmpdf2 = df2.copy(deep=True)
tmpdf2.set_index(tmpdf2.F, inplace=True)
df3 = df1.join(tmpdf2, how='outer')
#or: df3 = pd.merge(df1, tmpdf2, how='outer', left_index=True, right_index=True)
df3
|
A |
B |
C |
D |
E |
F |
G |
H |
I |
J |
2013-01-01 |
-1.360827 |
-0.871347 |
-1.603602 |
0.073078 |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
1.0 |
2013-01-02 |
1.0 |
3.0 |
test |
foo |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
1.0 |
2013-01-02 |
1.0 |
3.0 |
train |
foo |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
1.0 |
2013-01-02 |
1.0 |
3.0 |
test |
foo |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
1.0 |
2013-01-02 |
1.0 |
3.0 |
train |
foo |
2013-01-03 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
2013-01-04 |
-1.325634 |
0.174237 |
-0.921101 |
-0.065894 |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
2013-01-05 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
2013-01-06 |
0.024899 |
0.252519 |
-0.833479 |
-0.109767 |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
df4 = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B': ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C': np.random.randn(8),
....: 'D': np.random.randn(8)})
df4
|
A |
B |
C |
D |
0 |
foo |
one |
-0.870843 |
1.255164 |
1 |
bar |
one |
-0.678605 |
0.569444 |
2 |
foo |
two |
0.270890 |
0.692131 |
3 |
bar |
three |
-1.008900 |
1.841500 |
4 |
foo |
two |
0.381278 |
0.918376 |
5 |
bar |
two |
-0.666534 |
-0.483262 |
6 |
foo |
one |
-0.132010 |
1.949751 |
7 |
foo |
three |
1.517630 |
-0.458782 |
Index and axes
c = df1.columns.tolist()
c
[Timestamp('2013-01-01 00:00:00', freq='D'),
Timestamp('2013-01-02 00:00:00', freq='D'),
Timestamp('2013-01-03 00:00:00', freq='D'),
Timestamp('2013-01-04 00:00:00', freq='D'),
Timestamp('2013-01-05 00:00:00', freq='D'),
Timestamp('2013-01-06 00:00:00', freq='D')]
#calculate sum (across rows) for each column
df1.sum(axis=0)
A -5.885887
B -0.973229
C -3.602993
D -1.248308
dtype: float64
pd.concat([df1,df3], axis=0, sort=False).head()
|
A |
B |
C |
D |
E |
F |
G |
H |
I |
J |
2013-01-01 |
-1.360827 |
-0.871347 |
-1.603602 |
0.073078 |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
2013-01-03 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
2013-01-04 |
-1.325634 |
0.174237 |
-0.921101 |
-0.065894 |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
2013-01-05 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
Data types
NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy()
, pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.
arr = np.array([1,2,3])
pd.Series(arr).values
array([-1.36082656, -0.2104261 , -1.82932214, -1.32563421, -1.18457701,
0.024899 ])
Locate by column label, index
Timestamp('2013-01-01 00:00:00', freq='D')
A -1.360827
B -0.871347
C -1.603602
D 0.073078
Name: 2013-01-01 00:00:00, dtype: float64
|
A |
B |
2013-01-01 |
-1.360827 |
-0.871347 |
2013-01-02 |
-0.210426 |
0.511720 |
2013-01-03 |
-1.829322 |
0.343560 |
2013-01-04 |
-1.325634 |
0.174237 |
2013-01-05 |
-1.184577 |
-1.383917 |
2013-01-06 |
0.024899 |
0.252519 |
2013-01-01 -1.360827
2013-01-02 -0.210426
2013-01-03 -1.829322
2013-01-04 -1.325634
2013-01-05 -1.184577
Freq: D, Name: A, dtype: float64
A -1.360827
B -0.871347
C -1.603602
D 0.073078
Name: 2013-01-01 00:00:00, dtype: float64
2013-01-01 -1.360827
2013-01-02 -0.210426
2013-01-03 -1.829322
2013-01-04 -1.325634
2013-01-05 -1.184577
2013-01-06 0.024899
Freq: D, Name: A, dtype: float64
Locate by boolean
2013-01-01 False
2013-01-02 False
2013-01-03 False
2013-01-04 False
2013-01-05 False
2013-01-06 True
Freq: D, Name: A, dtype: bool
|
A |
B |
C |
D |
2013-01-06 |
0.024899 |
0.252519 |
-0.833479 |
-0.109767 |
df5 = df1.copy()
df5['K'] = ['one', 'one','two','three','four','three']
df5['K']
2013-01-01 one
2013-01-02 one
2013-01-03 two
2013-01-04 three
2013-01-05 four
2013-01-06 three
Freq: D, Name: K, dtype: object
df5[df5['K'].isin(['two','four'])]
|
A |
B |
C |
D |
K |
2013-01-03 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
two |
2013-01-05 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
four |
Sort values
#sort
df1.sort_values([c[0]], ascending=False)
|
A |
B |
C |
D |
2013-01-06 |
0.024899 |
0.252519 |
-0.833479 |
-0.109767 |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
2013-01-05 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
2013-01-04 |
-1.325634 |
0.174237 |
-0.921101 |
-0.065894 |
2013-01-01 |
-1.360827 |
-0.871347 |
-1.603602 |
0.073078 |
2013-01-03 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
Subset
|
A |
B |
C |
D |
2013-01-06 |
0.024899 |
0.252519 |
-0.833479 |
-0.109767 |
|
A |
B |
C |
D |
2013-01-06 |
0.024899 |
0.252519 |
-0.833479 |
-0.109767 |
Setting values
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
df1['F'] = s1
df1
|
A |
B |
C |
D |
F |
2013-01-01 |
-1.360827 |
-0.871347 |
-1.603602 |
0.073078 |
NaN |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
1.0 |
2013-01-03 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
2.0 |
2013-01-04 |
-1.325634 |
0.174237 |
-0.921101 |
-0.065894 |
3.0 |
2013-01-05 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
4.0 |
2013-01-06 |
0.024899 |
0.252519 |
-0.833479 |
-0.109767 |
5.0 |
df5 = df1.copy()
df5[df5 > 0] = -df5
df5
|
A |
B |
C |
D |
F |
2013-01-01 |
-1.360827 |
-0.871347 |
-1.603602 |
-0.073078 |
NaN |
2013-01-02 |
-0.210426 |
-0.511720 |
-1.037810 |
-0.746683 |
-1.0 |
2013-01-03 |
-1.829322 |
-0.343560 |
-0.005303 |
-1.312605 |
-2.0 |
2013-01-04 |
-1.325634 |
-0.174237 |
-0.921101 |
-0.065894 |
-3.0 |
2013-01-05 |
-1.184577 |
-1.383917 |
-0.798301 |
-0.913562 |
-4.0 |
2013-01-06 |
-0.024899 |
-0.252519 |
-0.833479 |
-0.109767 |
-5.0 |
df5.index.to_series().apply(pd.to_datetime)
2013-01-01 2013-01-01
2013-01-02 2013-01-02
2013-01-03 2013-01-03
2013-01-04 2013-01-04
2013-01-05 2013-01-05
2013-01-06 2013-01-06
Freq: D, dtype: datetime64[ns]
df5.index.to_series().dt.year
2013-01-01 2013
2013-01-02 2013
2013-01-03 2013
2013-01-04 2013
2013-01-05 2013
2013-01-06 2013
Freq: D, dtype: int64
Calculations
|
A |
B |
C |
D |
F |
2013-01-01 |
-1.360827 |
-0.871347 |
-1.603602 |
0.073078 |
NaN |
2013-01-02 |
-1.571253 |
-0.359628 |
-2.641412 |
-0.673604 |
1.0 |
2013-01-03 |
-3.400575 |
-0.016067 |
-2.646715 |
-1.986209 |
3.0 |
2013-01-04 |
-4.726209 |
0.158169 |
-3.567816 |
-2.052103 |
6.0 |
2013-01-05 |
-5.910786 |
-1.225748 |
-2.769514 |
-1.138541 |
10.0 |
2013-01-06 |
-5.885887 |
-0.973229 |
-3.602993 |
-1.248308 |
15.0 |
df1.apply(lambda x: x.max() - x.min())
A 1.854221
B 1.895637
C 2.401903
D 2.226167
F 4.000000
dtype: float64
# string ops - vectorized
df1.iloc[:,1].astype('str').str.lower()
2013-01-01 -0.8713474062085093
2013-01-02 0.5117196166006319
2013-01-03 0.3435604532681823
2013-01-04 0.17423665911328937
2013-01-05 -1.3839171647406576
2013-01-06 0.2525192847352602
Freq: D, Name: B, dtype: object
Missing values
# missing data
df1.dropna(how='any')
|
A |
B |
C |
D |
F |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
1.0 |
2013-01-03 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
2.0 |
2013-01-04 |
-1.325634 |
0.174237 |
-0.921101 |
-0.065894 |
3.0 |
2013-01-05 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
4.0 |
2013-01-06 |
0.024899 |
0.252519 |
-0.833479 |
-0.109767 |
5.0 |
df1.dropna(subset=['F'], how='all')
|
A |
B |
C |
D |
F |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
1.0 |
2013-01-03 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
2.0 |
2013-01-04 |
-1.325634 |
0.174237 |
-0.921101 |
-0.065894 |
3.0 |
2013-01-05 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
4.0 |
2013-01-06 |
0.024899 |
0.252519 |
-0.833479 |
-0.109767 |
5.0 |
|
A |
B |
C |
D |
F |
2013-01-01 |
-1.360827 |
-0.871347 |
-1.603602 |
0.073078 |
5.0 |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
1.0 |
2013-01-03 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
2.0 |
2013-01-04 |
-1.325634 |
0.174237 |
-0.921101 |
-0.065894 |
3.0 |
2013-01-05 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
4.0 |
2013-01-06 |
0.024899 |
0.252519 |
-0.833479 |
-0.109767 |
5.0 |
|
A |
B |
C |
D |
F |
2013-01-01 |
False |
False |
False |
False |
True |
2013-01-02 |
False |
False |
False |
False |
False |
2013-01-03 |
False |
False |
False |
False |
False |
2013-01-04 |
False |
False |
False |
False |
False |
2013-01-05 |
False |
False |
False |
False |
False |
2013-01-06 |
False |
False |
False |
False |
False |
Merge and shape
df1.append(df2, ignore_index=True, sort=True).head()
|
A |
B |
C |
D |
E |
F |
G |
H |
I |
J |
0 |
-1.360827 |
-0.871347 |
-1.603602 |
0.073078 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
1 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
NaN |
1 |
NaN |
NaN |
NaN |
NaN |
2 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
NaN |
2 |
NaN |
NaN |
NaN |
NaN |
3 |
-1.325634 |
0.174237 |
-0.921101 |
-0.065894 |
NaN |
3 |
NaN |
NaN |
NaN |
NaN |
4 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
NaN |
4 |
NaN |
NaN |
NaN |
NaN |
pieces = [df1[:3], df1[3:7], df1[7:]]
pd.concat(pieces).head()
|
A |
B |
C |
D |
F |
2013-01-01 |
-1.360827 |
-0.871347 |
-1.603602 |
0.073078 |
NaN |
2013-01-02 |
-0.210426 |
0.511720 |
-1.037810 |
-0.746683 |
1.0 |
2013-01-03 |
-1.829322 |
0.343560 |
-0.005303 |
-1.312605 |
2.0 |
2013-01-04 |
-1.325634 |
0.174237 |
-0.921101 |
-0.065894 |
3.0 |
2013-01-05 |
-1.184577 |
-1.383917 |
0.798301 |
0.913562 |
4.0 |
GroupBy
df4.groupby(['A','B']).sum()
|
|
C |
D |
A |
B |
|
|
bar |
one |
-0.678605 |
0.569444 |
three |
-1.008900 |
1.841500 |
two |
-0.666534 |
-0.483262 |
foo |
one |
-1.002854 |
3.204916 |
three |
1.517630 |
-0.458782 |
two |
0.652168 |
1.610507 |
Stacked
stacked = df4.stack()
stacked.index[:4]
MultiIndex([(0, 'A'),
(0, 'B'),
(0, 'C'),
(0, 'D')],
)
0 A foo
B one
C -0.870843
D 1.25516
1 A bar
dtype: object
|
A |
B |
C |
D |
0 |
foo |
one |
-0.870843 |
1.25516 |
1 |
bar |
one |
-0.678605 |
0.569444 |
2 |
foo |
two |
0.27089 |
0.692131 |
3 |
bar |
three |
-1.0089 |
1.8415 |
4 |
foo |
two |
0.381278 |
0.918376 |
Pivot table
|
A |
B |
C |
D |
0 |
foo |
one |
-0.870843 |
1.255164 |
1 |
bar |
one |
-0.678605 |
0.569444 |
2 |
foo |
two |
0.270890 |
0.692131 |
3 |
bar |
three |
-1.008900 |
1.841500 |
4 |
foo |
two |
0.381278 |
0.918376 |
5 |
bar |
two |
-0.666534 |
-0.483262 |
6 |
foo |
one |
-0.132010 |
1.949751 |
7 |
foo |
three |
1.517630 |
-0.458782 |
# pivot tables (http://pbpython.com/pandas-pivot-table-explained.html)
df4["A"] = df4["A"].astype("category")
df4["A"].cat.set_categories(["foo", "bar"],inplace=True)
pd.pivot_table(df4,index=['A','B']) #same as:
df4.groupby(['A','B']).mean()
df4.groupby(['A','B'])['D'].agg([np.sum,np.std])
|
|
sum |
std |
A |
B |
|
|
foo |
one |
3.204916 |
0.491147 |
three |
-0.458782 |
NaN |
two |
1.610507 |
0.159980 |
bar |
one |
0.569444 |
NaN |
three |
1.841500 |
NaN |
two |
-0.483262 |
NaN |
pd.pivot_table(df4,index=['A','B'], values=['D'], aggfunc=np.sum)
|
|
D |
A |
B |
|
foo |
one |
3.204916 |
three |
-0.458782 |
two |
1.610507 |
bar |
one |
0.569444 |
three |
1.841500 |
two |
-0.483262 |
pd.pivot_table(df4,index=['A'], columns=['B'], values=['D'], aggfunc=np.sum)
|
D |
B |
one |
three |
two |
A |
|
|
|
foo |
3.204916 |
-0.458782 |
1.610507 |
bar |
0.569444 |
1.841500 |
-0.483262 |
pd.pivot_table(df4,index=['A'], columns=['B'], values=['D'], aggfunc=np.sum, fill_value=0,margins=True)
|
D |
B |
one |
three |
two |
All |
A |
|
|
|
|
foo |
3.204916 |
-0.458782 |
1.610507 |
4.356641 |
bar |
0.569444 |
1.841500 |
-0.483262 |
1.927682 |
All |
3.774359 |
1.382718 |
1.127246 |
6.284323 |
pd.pivot_table(df4, index=['A', 'B'], columns=['C'], values='D')
|
C |
-1.008900 |
-0.870843 |
-0.678605 |
-0.666534 |
-0.132010 |
0.270890 |
0.381278 |
1.517630 |
A |
B |
|
|
|
|
|
|
|
|
foo |
one |
NaN |
1.255164 |
NaN |
NaN |
1.949751 |
NaN |
NaN |
NaN |
three |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
-0.458782 |
two |
NaN |
NaN |
NaN |
NaN |
NaN |
0.692131 |
0.918376 |
NaN |
bar |
one |
NaN |
NaN |
0.569444 |
NaN |
NaN |
NaN |
NaN |
NaN |
three |
1.8415 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
two |
NaN |
NaN |
NaN |
-0.483262 |
NaN |
NaN |
NaN |
NaN |
pd.pivot_table(df4, index=['A'], columns=['B','C'], values='D')
B |
one |
three |
two |
C |
-0.870843 |
-0.678605 |
-0.132010 |
-1.008900 |
1.517630 |
-0.666534 |
0.270890 |
0.381278 |
A |
|
|
|
|
|
|
|
|
foo |
1.255164 |
NaN |
1.949751 |
NaN |
-0.458782 |
NaN |
0.692131 |
0.918376 |
bar |
NaN |
0.569444 |
NaN |
1.8415 |
NaN |
-0.483262 |
NaN |
NaN |
Timeseries
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.head()
2012-01-01 00:00:00 114
2012-01-01 00:00:01 279
2012-01-01 00:00:02 32
2012-01-01 00:00:03 495
2012-01-01 00:00:04 433
Freq: S, dtype: int64
## frequency conversion (e.g., converting secondly data into 5-minutely data)
ts.resample('5Min').sum()
2012-01-01 24652
Freq: 5T, dtype: int64
## time zones
ts_utc = ts.tz_localize('UTC')
ts_utc.tz_convert('US/Eastern').head()
2011-12-31 19:00:00-05:00 114
2011-12-31 19:00:01-05:00 279
2011-12-31 19:00:02-05:00 32
2011-12-31 19:00:03-05:00 495
2011-12-31 19:00:04-05:00 433
Freq: S, dtype: int64
## format
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ps = ts.to_period()
ps.to_timestamp()
2012-01-01 2.490119
2012-02-01 0.237758
2012-03-01 0.249177
2012-04-01 -0.015938
2012-05-01 -0.314638
Freq: MS, dtype: float64
## timestamp conversion
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()
1990-03-01 09:00 -0.546684
1990-06-01 09:00 0.091767
1990-09-01 09:00 -1.180808
1990-12-01 09:00 0.929530
1991-03-01 09:00 1.646660
Freq: H, dtype: float64
Categorical data
df5 = df4.copy(deep=True)
## create categories
df5['incomeranges'] = pd.cut(df5['C'], 14)
df5["A"] = df5["A"].astype('category')
df5.head()
|
A |
B |
C |
D |
incomeranges |
0 |
foo |
one |
-0.870843 |
1.255164 |
(-1.011, -0.828] |
1 |
bar |
one |
-0.678605 |
0.569444 |
(-0.828, -0.648] |
2 |
foo |
two |
0.270890 |
0.692131 |
(0.254, 0.435] |
3 |
bar |
three |
-1.008900 |
1.841500 |
(-1.011, -0.828] |
4 |
foo |
two |
0.381278 |
0.918376 |
(0.254, 0.435] |
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
Name: A, dtype: category
Categories (2, object): [foo, bar]
#rename to more meaningful
df5["A"].cat.categories = ["foo", "bar"]
df5.head()
|
A |
B |
C |
D |
incomeranges |
0 |
foo |
one |
-0.870843 |
1.255164 |
(-1.011, -0.828] |
1 |
bar |
one |
-0.678605 |
0.569444 |
(-0.828, -0.648] |
2 |
foo |
two |
0.270890 |
0.692131 |
(0.254, 0.435] |
3 |
bar |
three |
-1.008900 |
1.841500 |
(-1.011, -0.828] |
4 |
foo |
two |
0.381278 |
0.918376 |
(0.254, 0.435] |
# domain of categories
df5["A"] = df5["A"].cat.set_categories(["very bad", "bar", "medium", "foo", "very good"])
df5.sort_values(by="A").head()
|
A |
B |
C |
D |
incomeranges |
1 |
bar |
one |
-0.678605 |
0.569444 |
(-0.828, -0.648] |
3 |
bar |
three |
-1.008900 |
1.841500 |
(-1.011, -0.828] |
5 |
bar |
two |
-0.666534 |
-0.483262 |
(-0.828, -0.648] |
0 |
foo |
one |
-0.870843 |
1.255164 |
(-1.011, -0.828] |
2 |
foo |
two |
0.270890 |
0.692131 |
(0.254, 0.435] |
foo 5
bar 3
very good 0
medium 0
very bad 0
Name: A, dtype: int64
## subset on category
tmp = df5[df5['A'] == "foo"]
tmp
|
A |
B |
C |
D |
incomeranges |
0 |
foo |
one |
-0.870843 |
1.255164 |
(-1.011, -0.828] |
2 |
foo |
two |
0.270890 |
0.692131 |
(0.254, 0.435] |
4 |
foo |
two |
0.381278 |
0.918376 |
(0.254, 0.435] |
6 |
foo |
one |
-0.132010 |
1.949751 |
(-0.287, -0.107] |
7 |
foo |
three |
1.517630 |
-0.458782 |
(1.337, 1.518] |
##TODO:subset with limited column
df5[A & B][['C', 'D']][:10]
#TODO:Plotting
## options
### Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
### Always display all the columns
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)
figsize(15, 5)
Series calculations
|
E |
F |
G |
H |
I |
J |
0 |
1.0 |
2013-01-02 |
1.0 |
3 |
test |
foo |
1 |
1.0 |
2013-01-02 |
1.0 |
3 |
train |
foo |
2 |
1.0 |
2013-01-02 |
1.0 |
3 |
test |
foo |
3 |
1.0 |
2013-01-02 |
1.0 |
3 |
train |
foo |
array([3, 3, 3, 3], dtype=int32)
Numpy
Pandas uses Numpy’s ndarray for many of its underlying operations. For example, you can use the DataFrame attribute .values
to represent a DataFrame df as a NumPy array. You can also pass pandas data structures to NumPy methods.
Now that we’ve looked into Pandas columns (Series), lets take at what is driving these operations. This low-level perspective will help us understand strengths and weeknesses, as well as give us a much better look into how to improve our code.
References
Introduction
NumPy’s arrays are more compact than Python lists – a list of lists as you describe, in Python, would take at least 20 MB or so, while a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.
The difference is mostly due to “indirectness” – a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value – and the memory allocators rounds up to 16). A NumPy array is an array of uniform values – single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!
Because Numpy’s popularity is based on optimization, lets focus work on memory and speed during our discussion.
Memory layout
The following attributes contain information about the memory layout of the array:
ndarray.flags
Information about the memory layout of the array.
ndarray.shape
Tuple of array dimensions.
ndarray.ndim
Number of array dimensions.
ndarray.strides
Tuple of bytes to step in each dimension when traversing an array.
ndarray.data
Python buffer object pointing to the start of the array’s data
ndarray.size
Number of elements in the array.
ndarray.itemsize
Length of one array element in bytes.
ndarray.Dtype
Type of elements in the array, i.e., int64, character
ndarray.nbytes
Total bytes consumed by the elements of the array.
ndarray.base
Base object if memory is from some other object.
ndarray.__sizeof__()
is somewhat bigger (includes the object overhead), than ndarray.nbytes
, which is the number of bytes in the allocated array.
arr = np.random.randn(10**3)
lst = list(arr)
arr2d = arr.reshape(20,50)
print( len(lst) )
print( arr.shape )
print( arr2d.shape )
print( type(lst) )
print( type(arr) )
print( type(arr2d) )
print( type(lst[0]) )
print( arr.dtype )
<class 'list'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.float64'>
float64
#item
print( sys.getsizeof( lst[0] ) )
print( arr.itemsize )
#total
print( sys.getsizeof( lst ) )
print( sys.getsizeof( arr ) )
print( sys.getsizeof( arr2d ) )
print( arr.nbytes )
print( arr2d.nbytes )
32
8
9112
8096
112
8000
8000
print( arr.tobytes()[:100] )
b',!1\xcdk\\\xf0?\x8b\x0f\x93\x19)\xf0\xe7?v\xec3\xcb}e\xbf?\x89\x92\x0b\xac\xc4_\xb7?\x1d\xdb\\*\xdaG\xe2?}\x16\xf8n\x99\xcd\xd7\xbfc\xf3\xf9g\x87\xb5\x03\xc0\x04&\x86\xc7\x08*\xf4?\xb8j\xd6\x03}\xed\xef\xbf\xe6\xcf\xb0\x0e\x90\xe1\xd8\xbf{\xa2\x1dn\xe1\xef\xe1\xbfF\x90{\xfd\xf2\x14\xde\xbf\\\xb8pv'
Comparison of ndarry to list
Numpy array is easier to assign values than a list.
a = np.array([1, 2, 5, 7, 8])
a[1:3] = -1
a
array([ 1, -1, -1, 7, 8])
b = [1, 2, 5, 7, 8]
try:
b[1:3] = -1
except:
print("TypeError: can only assign an iterable")
TypeError: can only assign an iterable
ndarray slices are actually views on the same data buffer. If you modify it, it is going to modify the original ndarray as well.
a=[1,2,5,7,8]
b=a[1:5]
b[1]=3
print(a)
print(b)
[1, 2, 5, 7, 8]
[2, 3, 7, 8]
a = np.array([1, 2, 5, 7, 8])
a_slice = a[1:5]
a_slice[1] = 1000
print( a )
print( a_slice)
[ 1 2 1000 7 8]
[ 2 1000 7 8]
a = np.array([1, 2, 5, 7, 8])
a_slice = a[1:5].copy()
a_slice[1] = 1000
print( a )
print( a_slice)
[1 2 5 7 8]
[ 2 1000 7 8]
ndarray slicing is different. It uses: arr[row_start:row_end, col_start:col_end]
ndarray also allows boolean indexing.
a = np.array([0, 1, 2, 5, 7, 8]).reshape(3,2)
a
array([[0, 1],
[2, 5],
[7, 8]])
rows_on = np.array([True, False, True])
a[rows_on,:]
array([ True, True, True, True, True, True, True, True, True,
True])
array([ 1.02256375, 0.74806647, 0.12264239, 0.09130506, 0.57127102,
-0.37192379, -2.46363717, 1.26026228, -0.99774028, -0.38876726])
A comparison test of speed.
from numpy import arange
from timeit import Timer
Nelements = 10000
Ntimeits = 10000
x = arange(Nelements)
y = range(Nelements)
t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,))
print("list: %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))
numpy: 1.171e-05
list: 1.768e-04
Creating matrices
Your typical arrays of random initializers are always useful.
print( np.random.randint(1,10, 5) )
print( np.arange(1,10) )
[8 8 7 4 3]
[1 2 3 4 5 6 7 8 9]
A matrix of random numbers.
print( np.random.rand(3,3) )
[[0.24273703 0.4428199 0.51066964]
[0.66546027 0.83538717 0.92546135]
[0.63670416 0.83450213 0.57655533]]
A matrix of zeros.
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
The identity matrix.
array([[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.]])
Scalar operations
arr = np.arange(1,10)
np.append(arr, 12)
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 12])
print( arr.size )
print( arr.mean() )
print( arr.min() )
print( arr.max() )
print( type(arr.tolist()) )
print( arr.astype('str') )
print( arr.astype('int') )
['1' '2' '3' '4' '5' '6' '7' '8' '9']
[1 2 3 4 5 6 7 8 9]
Matrix operations
arr = np.arange(1,10)
np.append(arr, 12)
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 12])
print( arr + arr )
print( arr - arr )
[ 2 4 6 8 10 12 14 16 18]
[0 0 0 0 0 0 0 0 0]
#A[n+1]-A[n]
np.diff(arr, n=1)
array([1, 1, 1, 1, 1, 1, 1, 1])
Broadcasting provides a means of vectorizing array operations so that looping occurs in an implementation of C, instead of in Python. The term refers how numpy treats arrays with different shapes during arithmetic operations. The smaller array is repeated so that it matches the size of the larger array.
array([ 2, 4, 6, 8, 10, 12, 14, 16, 18])
array([ 5, 6, 7, 8, 9, 10, 11, 12, 13])
1.36 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.45 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
M = np.arange(1,10).reshape(3,3)
M
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])
array([[ 30, 36, 42],
[ 66, 81, 96],
[102, 126, 150]])
np.sum(M, axis=1) #sum along rows, use `axis=0` for columns
External Interfaces
Using CUDA with Numba
Basic C++
Working with Eigen
Conclusion
Pandas provides fast and thorough means of manipulating data in Python. It is powered by the operations in Numpy. While Pandas is useful for preparing data, actual analysis and computations are typically performed on the Numpy set. This is part of the reason Numpy is adopted by so many important libraries, such as SciKitLearn, PyTorch, and many others.