An Introduction to Numpy and Pandas

2020-03-22

Introduction_Tutorial > Data_Science

Numpy and Pandas are the basic data science tools in the Python environment. Having a good understanding of their capabilities and how they process data is imperative to writing optimal code. This post provides an introductory overview and a refresher for those who might come back to the libraries after taking a break. The end of the post explains external interfaces for increasing code execution and performing more sophisticated matrix operations.

Pandas Series and Dataframe

These are the basic data structures used by Pandas. Pandas derives its name from Pan(el) da(ta)s. Panel data is an econometrics term for data recorded over time. Pandas author, Wes McKinney, was working in the finanical industry when he began writing it.

Series objects: 1D array, similar to a column in a spreadsheet
DataFrame objects: 2D table, similar to a spreadsheet, dictionary of Series
~~Panel objects: Dictionary of DataFrames, similar to sheet in MS Excel~~removed in 0.25.0

Config

Let’s load our modules. Numpy is always used with Pandas because it is an underlying dependency.

import numpy as np
import pandas as pd

We will create a few DataFrames to work with throughout the post.

Series are similar to R’s vector c(), and is a column within a Pandas Dataframe.

import pandas as pd
s = pd.Series([1,3,5,np.nan,6,8])
dates = pd.date_range('20130101', periods=6)
df1 = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df1

	A	B	C	D
2013-01-01	-1.360827	-0.871347	-1.603602	0.073078
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683
2013-01-03	-1.829322	0.343560	-0.005303	-1.312605
2013-01-04	-1.325634	0.174237	-0.921101	-0.065894
2013-01-05	-1.184577	-1.383917	0.798301	0.913562
2013-01-06	0.024899	0.252519	-0.833479	-0.109767

df2 = pd.DataFrame({ 'E' : 1.,
   ....:                      'F' : pd.Timestamp('20130102'),
   ....:                      'G' : pd.Series(1,index=list(range(4)),dtype='float32'),
   ....:                      'H' : np.array([3] * 4,dtype='int32'),
   ....:                      'I' : pd.Categorical(["test","train","test","train"]),
   ....:                      'J' : 'foo' })
df2

	E	F	G	H	I	J
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

tmpdf2 = df2.copy(deep=True)
tmpdf2.set_index(tmpdf2.F, inplace=True)
df3 = df1.join(tmpdf2, how='outer')
#or: df3 = pd.merge(df1, tmpdf2, how='outer', left_index=True, right_index=True)
df3

	A	B	C	D	E	F	G	H	I	J
2013-01-01	-1.360827	-0.871347	-1.603602	0.073078	NaN	NaT	NaN	NaN	NaN	NaN
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683	1.0	2013-01-02	1.0	3.0	test	foo
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683	1.0	2013-01-02	1.0	3.0	train	foo
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683	1.0	2013-01-02	1.0	3.0	test	foo
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683	1.0	2013-01-02	1.0	3.0	train	foo
2013-01-03	-1.829322	0.343560	-0.005303	-1.312605	NaN	NaT	NaN	NaN	NaN	NaN
2013-01-04	-1.325634	0.174237	-0.921101	-0.065894	NaN	NaT	NaN	NaN	NaN	NaN
2013-01-05	-1.184577	-1.383917	0.798301	0.913562	NaN	NaT	NaN	NaN	NaN	NaN
2013-01-06	0.024899	0.252519	-0.833479	-0.109767	NaN	NaT	NaN	NaN	NaN	NaN

df4 = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
   ....:                          'foo', 'bar', 'foo', 'foo'],
   ....:                    'B': ['one', 'one', 'two', 'three',
   ....:                          'two', 'two', 'one', 'three'],
   ....:                    'C': np.random.randn(8),
   ....:                    'D': np.random.randn(8)})
df4

	A	B	C	D
0	foo	one	-0.870843	1.255164
1	bar	one	-0.678605	0.569444
2	foo	two	0.270890	0.692131
3	bar	three	-1.008900	1.841500
4	foo	two	0.381278	0.918376
5	bar	two	-0.666534	-0.483262
6	foo	one	-0.132010	1.949751
7	foo	three	1.517630	-0.458782

Index and axes

c = df1.columns.tolist()
c

['A', 'B', 'C', 'D']

n = df1.index.tolist()
n

[Timestamp('2013-01-01 00:00:00', freq='D'),
 Timestamp('2013-01-02 00:00:00', freq='D'),
 Timestamp('2013-01-03 00:00:00', freq='D'),
 Timestamp('2013-01-04 00:00:00', freq='D'),
 Timestamp('2013-01-05 00:00:00', freq='D'),
 Timestamp('2013-01-06 00:00:00', freq='D')]

#calculate sum (across rows) for each column
df1.sum(axis=0)

A   -5.885887
B   -0.973229
C   -3.602993
D   -1.248308
dtype: float64

pd.concat([df1,df3], axis=0, sort=False).head()

	A	B	C	D	E	F	G	H	I	J
2013-01-01	-1.360827	-0.871347	-1.603602	0.073078	NaN	NaT	NaN	NaN	NaN	NaN
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683	NaN	NaT	NaN	NaN	NaN	NaN
2013-01-03	-1.829322	0.343560	-0.005303	-1.312605	NaN	NaT	NaN	NaN	NaN	NaN
2013-01-04	-1.325634	0.174237	-0.921101	-0.065894	NaN	NaT	NaN	NaN	NaN	NaN
2013-01-05	-1.184577	-1.383917	0.798301	0.913562	NaN	NaT	NaN	NaN	NaN	NaN

Data types

NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

arr = np.array([1,2,3])
pd.Series(arr).values

array([1, 2, 3])

arr.dtype

dtype('int64')

df1.A.to_numpy()

array([-1.36082656, -0.2104261 , -1.82932214, -1.32563421, -1.18457701,
        0.024899  ])

Locate by column label, index

n[0]

Timestamp('2013-01-01 00:00:00', freq='D')

df1.loc[n[0],]

A   -1.360827
B   -0.871347
C   -1.603602
D    0.073078
Name: 2013-01-01 00:00:00, dtype: float64

df1.loc[:,[c[0],c[1]] ]

	A	B
2013-01-01	-1.360827	-0.871347
2013-01-02	-0.210426	0.511720
2013-01-03	-1.829322	0.343560
2013-01-04	-1.325634	0.174237
2013-01-05	-1.184577	-1.383917
2013-01-06	0.024899	0.252519

df1.loc[n[0],c[0]]

-1.3608265619610327

df1[c[0]][:5]

2013-01-01   -1.360827
2013-01-02   -0.210426
2013-01-03   -1.829322
2013-01-04   -1.325634
2013-01-05   -1.184577
Freq: D, Name: A, dtype: float64

df1.iloc[0,]

A   -1.360827
B   -0.871347
C   -1.603602
D    0.073078
Name: 2013-01-01 00:00:00, dtype: float64

df1.iloc[:,0]

2013-01-01   -1.360827
2013-01-02   -0.210426
2013-01-03   -1.829322
2013-01-04   -1.325634
2013-01-05   -1.184577
2013-01-06    0.024899
Freq: D, Name: A, dtype: float64

df1.iloc[0,0]

-1.3608265619610327

Locate by boolean

df1.A > 0

2013-01-01    False
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-05    False
2013-01-06     True
Freq: D, Name: A, dtype: bool

df1[df1.A > 0]

	A	B	C	D
2013-01-06	0.024899	0.252519	-0.833479	-0.109767

df5 = df1.copy()
df5['K'] = ['one', 'one','two','three','four','three']
df5['K']

2013-01-01      one
2013-01-02      one
2013-01-03      two
2013-01-04    three
2013-01-05     four
2013-01-06    three
Freq: D, Name: K, dtype: object

df5[df5['K'].isin(['two','four'])]

	A	B	C	D	K
2013-01-03	-1.829322	0.343560	-0.005303	-1.312605	two
2013-01-05	-1.184577	-1.383917	0.798301	0.913562	four

Sort values

#sort
df1.sort_values([c[0]], ascending=False)

	A	B	C	D
2013-01-06	0.024899	0.252519	-0.833479	-0.109767
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683
2013-01-05	-1.184577	-1.383917	0.798301	0.913562
2013-01-04	-1.325634	0.174237	-0.921101	-0.065894
2013-01-01	-1.360827	-0.871347	-1.603602	0.073078
2013-01-03	-1.829322	0.343560	-0.005303	-1.312605

Subset

df1.query('A > 0')

	A	B	C	D
2013-01-06	0.024899	0.252519	-0.833479	-0.109767

df1[ (df1['A'] > 0) ]

	A	B	C	D
2013-01-06	0.024899	0.252519	-0.833479	-0.109767

Setting values

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
df1['F'] = s1
df1

	A	B	C	D	F
2013-01-01	-1.360827	-0.871347	-1.603602	0.073078	NaN
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683	1.0
2013-01-03	-1.829322	0.343560	-0.005303	-1.312605	2.0
2013-01-04	-1.325634	0.174237	-0.921101	-0.065894	3.0
2013-01-05	-1.184577	-1.383917	0.798301	0.913562	4.0
2013-01-06	0.024899	0.252519	-0.833479	-0.109767	5.0

df5 = df1.copy()
df5[df5 > 0] = -df5
df5

	A	B	C	D	F
2013-01-01	-1.360827	-0.871347	-1.603602	-0.073078	NaN
2013-01-02	-0.210426	-0.511720	-1.037810	-0.746683	-1.0
2013-01-03	-1.829322	-0.343560	-0.005303	-1.312605	-2.0
2013-01-04	-1.325634	-0.174237	-0.921101	-0.065894	-3.0
2013-01-05	-1.184577	-1.383917	-0.798301	-0.913562	-4.0
2013-01-06	-0.024899	-0.252519	-0.833479	-0.109767	-5.0

df5.index.to_series().apply(pd.to_datetime)

2013-01-01   2013-01-01
2013-01-02   2013-01-02
2013-01-03   2013-01-03
2013-01-04   2013-01-04
2013-01-05   2013-01-05
2013-01-06   2013-01-06
Freq: D, dtype: datetime64[ns]

df5.index.to_series().dt.year

2013-01-01    2013
2013-01-02    2013
2013-01-03    2013
2013-01-04    2013
2013-01-05    2013
2013-01-06    2013
Freq: D, dtype: int64

Calculations

df1.apply(np.cumsum)

	A	B	C	D	F
2013-01-01	-1.360827	-0.871347	-1.603602	0.073078	NaN
2013-01-02	-1.571253	-0.359628	-2.641412	-0.673604	1.0
2013-01-03	-3.400575	-0.016067	-2.646715	-1.986209	3.0
2013-01-04	-4.726209	0.158169	-3.567816	-2.052103	6.0
2013-01-05	-5.910786	-1.225748	-2.769514	-1.138541	10.0
2013-01-06	-5.885887	-0.973229	-3.602993	-1.248308	15.0

df1.apply(lambda x: x.max() - x.min())

A    1.854221
B    1.895637
C    2.401903
D    2.226167
F    4.000000
dtype: float64

# string ops - vectorized
df1.iloc[:,1].astype('str').str.lower()

2013-01-01    -0.8713474062085093
2013-01-02     0.5117196166006319
2013-01-03     0.3435604532681823
2013-01-04    0.17423665911328937
2013-01-05    -1.3839171647406576
2013-01-06     0.2525192847352602
Freq: D, Name: B, dtype: object

Missing values

# missing data
df1.dropna(how='any')

	A	B	C	D	F
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683	1.0
2013-01-03	-1.829322	0.343560	-0.005303	-1.312605	2.0
2013-01-04	-1.325634	0.174237	-0.921101	-0.065894	3.0
2013-01-05	-1.184577	-1.383917	0.798301	0.913562	4.0
2013-01-06	0.024899	0.252519	-0.833479	-0.109767	5.0

df1.dropna(subset=['F'], how='all')

	A	B	C	D	F
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683	1.0
2013-01-03	-1.829322	0.343560	-0.005303	-1.312605	2.0
2013-01-04	-1.325634	0.174237	-0.921101	-0.065894	3.0
2013-01-05	-1.184577	-1.383917	0.798301	0.913562	4.0
2013-01-06	0.024899	0.252519	-0.833479	-0.109767	5.0

df1.fillna(value=5)

	A	B	C	D	F
2013-01-01	-1.360827	-0.871347	-1.603602	0.073078	5.0
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683	1.0
2013-01-03	-1.829322	0.343560	-0.005303	-1.312605	2.0
2013-01-04	-1.325634	0.174237	-0.921101	-0.065894	3.0
2013-01-05	-1.184577	-1.383917	0.798301	0.913562	4.0
2013-01-06	0.024899	0.252519	-0.833479	-0.109767	5.0

pd.isnull(df1)

	A	B	C	D	F
2013-01-01	False	False	False	False	True
2013-01-02	False	False	False	False	False
2013-01-03	False	False	False	False	False
2013-01-04	False	False	False	False	False
2013-01-05	False	False	False	False	False
2013-01-06	False	False	False	False	False

Merge and shape

df1.append(df2, ignore_index=True, sort=True).head()

	A	B	C	D	E	F	G	H	I	J
0	-1.360827	-0.871347	-1.603602	0.073078	NaN	NaN	NaN	NaN	NaN	NaN
1	-0.210426	0.511720	-1.037810	-0.746683	NaN	1	NaN	NaN	NaN	NaN
2	-1.829322	0.343560	-0.005303	-1.312605	NaN	2	NaN	NaN	NaN	NaN
3	-1.325634	0.174237	-0.921101	-0.065894	NaN	3	NaN	NaN	NaN	NaN
4	-1.184577	-1.383917	0.798301	0.913562	NaN	4	NaN	NaN	NaN	NaN

pieces = [df1[:3], df1[3:7], df1[7:]]
pd.concat(pieces).head()

	A	B	C	D	F
2013-01-01	-1.360827	-0.871347	-1.603602	0.073078	NaN
2013-01-02	-0.210426	0.511720	-1.037810	-0.746683	1.0
2013-01-03	-1.829322	0.343560	-0.005303	-1.312605	2.0
2013-01-04	-1.325634	0.174237	-0.921101	-0.065894	3.0
2013-01-05	-1.184577	-1.383917	0.798301	0.913562	4.0

GroupBy

df4.groupby(['A','B']).sum()

		C	D
A	B
bar	one	-0.678605	0.569444
	three	-1.008900	1.841500
	two	-0.666534	-0.483262
foo	one	-1.002854	3.204916
	three	1.517630	-0.458782
	two	0.652168	1.610507

Stacked

stacked = df4.stack()
stacked.index[:4]

MultiIndex([(0, 'A'),
            (0, 'B'),
            (0, 'C'),
            (0, 'D')],
           )

stacked.head()

0  A         foo
   B         one
   C   -0.870843
   D     1.25516
1  A         bar
dtype: object

stacked.unstack().head()

	A	B	C	D
0	foo	one	-0.870843	1.25516
1	bar	one	-0.678605	0.569444
2	foo	two	0.27089	0.692131
3	bar	three	-1.0089	1.8415
4	foo	two	0.381278	0.918376

Pivot table

df4

	A	B	C	D
0	foo	one	-0.870843	1.255164
1	bar	one	-0.678605	0.569444
2	foo	two	0.270890	0.692131
3	bar	three	-1.008900	1.841500
4	foo	two	0.381278	0.918376
5	bar	two	-0.666534	-0.483262
6	foo	one	-0.132010	1.949751
7	foo	three	1.517630	-0.458782

# pivot tables (http://pbpython.com/pandas-pivot-table-explained.html)
df4["A"] = df4["A"].astype("category")
df4["A"].cat.set_categories(["foo", "bar"],inplace=True)

pd.pivot_table(df4,index=['A','B'])	#same as:
df4.groupby(['A','B']).mean()
df4.groupby(['A','B'])['D'].agg([np.sum,np.std])

		sum	std
A	B
foo	one	3.204916	0.491147
	three	-0.458782	NaN
	two	1.610507	0.159980
bar	one	0.569444	NaN
	three	1.841500	NaN
	two	-0.483262	NaN

pd.pivot_table(df4,index=['A','B'], values=['D'], aggfunc=np.sum)

		D
A	B
foo	one	3.204916
	three	-0.458782
	two	1.610507
bar	one	0.569444
	three	1.841500
	two	-0.483262

pd.pivot_table(df4,index=['A'], columns=['B'], values=['D'], aggfunc=np.sum)

	D
B	one	three	two
A
foo	3.204916	-0.458782	1.610507
bar	0.569444	1.841500	-0.483262

pd.pivot_table(df4,index=['A'], columns=['B'], values=['D'], aggfunc=np.sum, fill_value=0,margins=True)

	D
B	one	three	two	All
A
foo	3.204916	-0.458782	1.610507	4.356641
bar	0.569444	1.841500	-0.483262	1.927682
All	3.774359	1.382718	1.127246	6.284323

pd.pivot_table(df4, index=['A', 'B'], columns=['C'], values='D')

	C	-1.008900	-0.870843	-0.678605	-0.666534	-0.132010	0.270890	0.381278	1.517630
A	B
foo	one	NaN	1.255164	NaN	NaN	1.949751	NaN	NaN	NaN
	three	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.458782
	two	NaN	NaN	NaN	NaN	NaN	0.692131	0.918376	NaN
bar	one	NaN	NaN	0.569444	NaN	NaN	NaN	NaN	NaN
	three	1.8415	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	two	NaN	NaN	NaN	-0.483262	NaN	NaN	NaN	NaN

pd.pivot_table(df4, index=['A'], columns=['B','C'], values='D')

B	one			three		two
C	-0.870843	-0.678605	-0.132010	-1.008900	1.517630	-0.666534	0.270890	0.381278
A
foo	1.255164	NaN	1.949751	NaN	-0.458782	NaN	0.692131	0.918376
bar	NaN	0.569444	NaN	1.8415	NaN	-0.483262	NaN	NaN

Timeseries

rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.head()

2012-01-01 00:00:00    114
2012-01-01 00:00:01    279
2012-01-01 00:00:02     32
2012-01-01 00:00:03    495
2012-01-01 00:00:04    433
Freq: S, dtype: int64

## frequency conversion (e.g., converting secondly data into 5-minutely data)
ts.resample('5Min').sum()

2012-01-01    24652
Freq: 5T, dtype: int64

## time zones
ts_utc = ts.tz_localize('UTC')
ts_utc.tz_convert('US/Eastern').head()

2011-12-31 19:00:00-05:00    114
2011-12-31 19:00:01-05:00    279
2011-12-31 19:00:02-05:00     32
2011-12-31 19:00:03-05:00    495
2011-12-31 19:00:04-05:00    433
Freq: S, dtype: int64

## format
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ps = ts.to_period()
ps.to_timestamp()

2012-01-01    2.490119
2012-02-01    0.237758
2012-03-01    0.249177
2012-04-01   -0.015938
2012-05-01   -0.314638
Freq: MS, dtype: float64

## timestamp conversion
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

1990-03-01 09:00   -0.546684
1990-06-01 09:00    0.091767
1990-09-01 09:00   -1.180808
1990-12-01 09:00    0.929530
1991-03-01 09:00    1.646660
Freq: H, dtype: float64

Categorical data

df5 = df4.copy(deep=True)

## create categories
df5['incomeranges'] = pd.cut(df5['C'], 14)
df5["A"] = df5["A"].astype('category')
df5.head()

	A	B	C	D	incomeranges
0	foo	one	-0.870843	1.255164	(-1.011, -0.828]
1	bar	one	-0.678605	0.569444	(-0.828, -0.648]
2	foo	two	0.270890	0.692131	(0.254, 0.435]
3	bar	three	-1.008900	1.841500	(-1.011, -0.828]
4	foo	two	0.381278	0.918376	(0.254, 0.435]

df5.A

0    foo
1    bar
2    foo
3    bar
4    foo
5    bar
6    foo
7    foo
Name: A, dtype: category
Categories (2, object): [foo, bar]

#rename to more meaningful
df5["A"].cat.categories = ["foo", "bar"]
df5.head()

	A	B	C	D	incomeranges
0	foo	one	-0.870843	1.255164	(-1.011, -0.828]
1	bar	one	-0.678605	0.569444	(-0.828, -0.648]
2	foo	two	0.270890	0.692131	(0.254, 0.435]
3	bar	three	-1.008900	1.841500	(-1.011, -0.828]
4	foo	two	0.381278	0.918376	(0.254, 0.435]

# domain of categories
df5["A"] = df5["A"].cat.set_categories(["very bad", "bar", "medium", "foo", "very good"])
df5.sort_values(by="A").head()

	A	B	C	D	incomeranges
1	bar	one	-0.678605	0.569444	(-0.828, -0.648]
3	bar	three	-1.008900	1.841500	(-1.011, -0.828]
5	bar	two	-0.666534	-0.483262	(-0.828, -0.648]
0	foo	one	-0.870843	1.255164	(-1.011, -0.828]
2	foo	two	0.270890	0.692131	(0.254, 0.435]

df5['A'].value_counts()

foo          5
bar          3
very good    0
medium       0
very bad     0
Name: A, dtype: int64

## subset on category
tmp = df5[df5['A'] == "foo"]
tmp

	A	B	C	D	incomeranges
0	foo	one	-0.870843	1.255164	(-1.011, -0.828]
2	foo	two	0.270890	0.692131	(0.254, 0.435]
4	foo	two	0.381278	0.918376	(0.254, 0.435]
6	foo	one	-0.132010	1.949751	(-0.287, -0.107]
7	foo	three	1.517630	-0.458782	(1.337, 1.518]

##TODO:subset with limited column
df5[A & B][['C', 'D']][:10]

#TODO:Plotting
## options
### Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default') 
### Always display all the columns
pd.set_option('display.line_width', 5000) 
pd.set_option('display.max_columns', 60) 
figsize(15, 5)

Series calculations

df2

	E	F	G	H	I	J
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

df2.H.values

array([3, 3, 3, 3], dtype=int32)

df2.H.abs().tolist()

[3, 3, 3, 3]

df2.H.mean().tolist()

3.0

Numpy

Pandas uses Numpy’s ndarray for many of its underlying operations. For example, you can use the DataFrame attribute .values to represent a DataFrame df as a NumPy array. You can also pass pandas data structures to NumPy methods.

Now that we’ve looked into Pandas columns (Series), lets take at what is driving these operations. This low-level perspective will help us understand strengths and weeknesses, as well as give us a much better look into how to improve our code.

References

Introduction

NumPy’s arrays are more compact than Python lists – a list of lists as you describe, in Python, would take at least 20 MB or so, while a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.

The difference is mostly due to “indirectness” – a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value – and the memory allocators rounds up to 16). A NumPy array is an array of uniform values – single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!

Because Numpy’s popularity is based on optimization, lets focus work on memory and speed during our discussion.

Memory layout

The following attributes contain information about the memory layout of the array:

ndarray.flags Information about the memory layout of the array.
ndarray.shape Tuple of array dimensions.
ndarray.ndim Number of array dimensions.
ndarray.strides Tuple of bytes to step in each dimension when traversing an array.
ndarray.data Python buffer object pointing to the start of the array’s data
ndarray.size Number of elements in the array.
ndarray.itemsize Length of one array element in bytes.
ndarray.Dtype Type of elements in the array, i.e., int64, character
ndarray.nbytes Total bytes consumed by the elements of the array.
ndarray.base Base object if memory is from some other object.

ndarray.__sizeof__() is somewhat bigger (includes the object overhead), than ndarray.nbytes, which is the number of bytes in the allocated array.

import sys

arr = np.random.randn(10**3)
lst = list(arr)
arr2d = arr.reshape(20,50)

print( len(lst) )
print( arr.shape )
print( arr2d.shape )

1000
(1000,)
(20, 50)

print( type(lst) )
print( type(arr) )
print( type(arr2d) )

print( type(lst[0]) )
print( arr.dtype )

<class 'list'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.float64'>
float64

#item
print( sys.getsizeof( lst[0] ) )
print( arr.itemsize )

#total
print( sys.getsizeof( lst ) )
print( sys.getsizeof( arr ) )
print( sys.getsizeof( arr2d ) )

print( arr.nbytes )
print( arr2d.nbytes )

print( arr.tobytes()[:100] )

b',!1\xcdk\\\xf0?\x8b\x0f\x93\x19)\xf0\xe7?v\xec3\xcb}e\xbf?\x89\x92\x0b\xac\xc4_\xb7?\x1d\xdb\\*\xdaG\xe2?}\x16\xf8n\x99\xcd\xd7\xbfc\xf3\xf9g\x87\xb5\x03\xc0\x04&\x86\xc7\x08*\xf4?\xb8j\xd6\x03}\xed\xef\xbf\xe6\xcf\xb0\x0e\x90\xe1\xd8\xbf{\xa2\x1dn\xe1\xef\xe1\xbfF\x90{\xfd\xf2\x14\xde\xbf\\\xb8pv'

Comparison of ndarry to list

Numpy array is easier to assign values than a list.

a = np.array([1, 2, 5, 7, 8])
a[1:3] = -1
a

array([ 1, -1, -1,  7,  8])

b = [1, 2, 5, 7, 8]
try:
    b[1:3] = -1
except:
    print("TypeError: can only assign an iterable")

TypeError: can only assign an iterable

ndarray slices are actually views on the same data buffer. If you modify it, it is going to modify the original ndarray as well.

a=[1,2,5,7,8]
b=a[1:5]
b[1]=3
print(a)
print(b)

[1, 2, 5, 7, 8]
[2, 3, 7, 8]

a = np.array([1, 2, 5, 7, 8])
a_slice = a[1:5]
a_slice[1] = 1000
print( a )
print( a_slice)

[   1    2 1000    7    8]
[   2 1000    7    8]

a = np.array([1, 2, 5, 7, 8])
a_slice = a[1:5].copy()
a_slice[1] = 1000
print( a )
print( a_slice)

[1 2 5 7 8]
[   2 1000    7    8]

ndarray slicing is different. It uses: arr[row_start:row_end, col_start:col_end]

ndarray also allows boolean indexing.

a = np.array([0, 1, 2, 5, 7, 8]).reshape(3,2)
a

array([[0, 1],
       [2, 5],
       [7, 8]])

rows_on = np.array([True, False, True])
a[rows_on,:]

array([[0, 1],
       [7, 8]])

(arr != 2)[:10]

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

arr[arr != 2] [:10]

array([ 1.02256375,  0.74806647,  0.12264239,  0.09130506,  0.57127102,
       -0.37192379, -2.46363717,  1.26026228, -0.99774028, -0.38876726])

A comparison test of speed.

from numpy import arange
from timeit import Timer

Nelements = 10000
Ntimeits = 10000

x = arange(Nelements)
y = range(Nelements)

t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,))
print("list:  %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))

numpy: 1.171e-05
list:  1.768e-04

Creating matrices

Your typical arrays of random initializers are always useful.

print( np.random.randint(1,10, 5) )
print( np.arange(1,10) )

[8 8 7 4 3]
[1 2 3 4 5 6 7 8 9]

A matrix of random numbers.

print( np.random.rand(3,3) )

[[0.24273703 0.4428199  0.51066964]
 [0.66546027 0.83538717 0.92546135]
 [0.63670416 0.83450213 0.57655533]]

A matrix of zeros.

np.zeros((4,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

The identity matrix.

np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

Scalar operations

arr = np.arange(1,10)
np.append(arr, 12)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 12])

print( arr.size )
print( arr.mean() )

9
5.0

print( arr.min() )
print( arr.max() )

1
9

print( type(arr.tolist()) )

<class 'list'>

print( arr.astype('str') )
print( arr.astype('int') )

['1' '2' '3' '4' '5' '6' '7' '8' '9']
[1 2 3 4 5 6 7 8 9]

Matrix operations

arr = np.arange(1,10)
np.append(arr, 12)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 12])

print( arr + arr )
print( arr - arr )

[ 2  4  6  8 10 12 14 16 18]
[0 0 0 0 0 0 0 0 0]

#A[n+1]-A[n]
np.diff(arr, n=1)

array([1, 1, 1, 1, 1, 1, 1, 1])

Broadcasting provides a means of vectorizing array operations so that looping occurs in an implementation of C, instead of in Python. The term refers how numpy treats arrays with different shapes during arithmetic operations. The smaller array is repeated so that it matches the size of the larger array.

doc: broadcasting

arr * 2

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18])

arr + 4

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13])

%%timeit 
arr * 2

1.36 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

twos = np.ones(9) * 2

%%timeit
arr * twos

1.45 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

M = np.arange(1,10).reshape(3,3)
M

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

M.T

array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

M.diagonal()

array([1, 5, 9])

np.dot(M, M)

array([[ 30,  36,  42],
       [ 66,  81,  96],
       [102, 126, 150]])

np.sum( M )

np.sum(M, axis=1)    #sum along rows, use `axis=0` for columns

array([ 6, 15, 24])

External Interfaces

Using CUDA with Numba

Basic C++

ref: using C++ with pandas

Working with Eigen

ref: eigen3topython

Conclusion

Pandas provides fast and thorough means of manipulating data in Python. It is powered by the operations in Numpy. While Pandas is useful for preparing data, actual analysis and computations are typically performed on the Numpy set. This is part of the reason Numpy is adopted by so many important libraries, such as SciKitLearn, PyTorch, and many others.