An Introduction to Numpy and Pandas

Numpy and Pandas are the basic data science tools in the Python environment. Having a good understanding of their capabilities and how they process data is imperative to writing optimal code. This post provides an introductory overview and a refresher for those who might come back to the libraries after taking a break. The end of the post explains external interfaces for increasing code execution and performing more sophisticated matrix operations.

Pandas Series and Dataframe

These are the basic data structures used by Pandas. Pandas derives its name from Pan(el) da(ta)s. Panel data is an econometrics term for data recorded over time. Pandas author, Wes McKinney, was working in the finanical industry when he began writing it.

  • Series objects: 1D array, similar to a column in a spreadsheet
  • DataFrame objects: 2D table, similar to a spreadsheet, dictionary of Series
  • Panel objects: Dictionary of DataFrames, similar to sheet in MS Excelremoved in 0.25.0

Config

Let’s load our modules. Numpy is always used with Pandas because it is an underlying dependency.

import numpy as np
import pandas as pd

We will create a few DataFrames to work with throughout the post.

Series are similar to R’s vector c(), and is a column within a Pandas Dataframe.

import pandas as pd
s = pd.Series([1,3,5,np.nan,6,8])
dates = pd.date_range('20130101', periods=6)
df1 = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df1

A B C D
2013-01-01 -1.360827 -0.871347 -1.603602 0.073078
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683
2013-01-03 -1.829322 0.343560 -0.005303 -1.312605
2013-01-04 -1.325634 0.174237 -0.921101 -0.065894
2013-01-05 -1.184577 -1.383917 0.798301 0.913562
2013-01-06 0.024899 0.252519 -0.833479 -0.109767
df2 = pd.DataFrame({ 'E' : 1.,
   ....:                      'F' : pd.Timestamp('20130102'),
   ....:                      'G' : pd.Series(1,index=list(range(4)),dtype='float32'),
   ....:                      'H' : np.array([3] * 4,dtype='int32'),
   ....:                      'I' : pd.Categorical(["test","train","test","train"]),
   ....:                      'J' : 'foo' })
df2

E F G H I J
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
tmpdf2 = df2.copy(deep=True)
tmpdf2.set_index(tmpdf2.F, inplace=True)
df3 = df1.join(tmpdf2, how='outer')
#or: df3 = pd.merge(df1, tmpdf2, how='outer', left_index=True, right_index=True)
df3

A B C D E F G H I J
2013-01-01 -1.360827 -0.871347 -1.603602 0.073078 NaN NaT NaN NaN NaN NaN
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683 1.0 2013-01-02 1.0 3.0 test foo
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683 1.0 2013-01-02 1.0 3.0 train foo
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683 1.0 2013-01-02 1.0 3.0 test foo
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683 1.0 2013-01-02 1.0 3.0 train foo
2013-01-03 -1.829322 0.343560 -0.005303 -1.312605 NaN NaT NaN NaN NaN NaN
2013-01-04 -1.325634 0.174237 -0.921101 -0.065894 NaN NaT NaN NaN NaN NaN
2013-01-05 -1.184577 -1.383917 0.798301 0.913562 NaN NaT NaN NaN NaN NaN
2013-01-06 0.024899 0.252519 -0.833479 -0.109767 NaN NaT NaN NaN NaN NaN
df4 = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
   ....:                          'foo', 'bar', 'foo', 'foo'],
   ....:                    'B': ['one', 'one', 'two', 'three',
   ....:                          'two', 'two', 'one', 'three'],
   ....:                    'C': np.random.randn(8),
   ....:                    'D': np.random.randn(8)})
df4

A B C D
0 foo one -0.870843 1.255164
1 bar one -0.678605 0.569444
2 foo two 0.270890 0.692131
3 bar three -1.008900 1.841500
4 foo two 0.381278 0.918376
5 bar two -0.666534 -0.483262
6 foo one -0.132010 1.949751
7 foo three 1.517630 -0.458782

Index and axes

c = df1.columns.tolist()
c
['A', 'B', 'C', 'D']
n = df1.index.tolist()
n
[Timestamp('2013-01-01 00:00:00', freq='D'),
 Timestamp('2013-01-02 00:00:00', freq='D'),
 Timestamp('2013-01-03 00:00:00', freq='D'),
 Timestamp('2013-01-04 00:00:00', freq='D'),
 Timestamp('2013-01-05 00:00:00', freq='D'),
 Timestamp('2013-01-06 00:00:00', freq='D')]
#calculate sum (across rows) for each column
df1.sum(axis=0)
A   -5.885887
B   -0.973229
C   -3.602993
D   -1.248308
dtype: float64
pd.concat([df1,df3], axis=0, sort=False).head()

A B C D E F G H I J
2013-01-01 -1.360827 -0.871347 -1.603602 0.073078 NaN NaT NaN NaN NaN NaN
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683 NaN NaT NaN NaN NaN NaN
2013-01-03 -1.829322 0.343560 -0.005303 -1.312605 NaN NaT NaN NaN NaN NaN
2013-01-04 -1.325634 0.174237 -0.921101 -0.065894 NaN NaT NaN NaN NaN NaN
2013-01-05 -1.184577 -1.383917 0.798301 0.913562 NaN NaT NaN NaN NaN NaN

Data types

NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

arr = np.array([1,2,3])
pd.Series(arr).values
array([1, 2, 3])
arr.dtype
dtype('int64')
df1.A.to_numpy()
array([-1.36082656, -0.2104261 , -1.82932214, -1.32563421, -1.18457701,
        0.024899  ])

Locate by column label, index

n[0]
Timestamp('2013-01-01 00:00:00', freq='D')
df1.loc[n[0],]
A   -1.360827
B   -0.871347
C   -1.603602
D    0.073078
Name: 2013-01-01 00:00:00, dtype: float64
df1.loc[:,[c[0],c[1]] ]

A B
2013-01-01 -1.360827 -0.871347
2013-01-02 -0.210426 0.511720
2013-01-03 -1.829322 0.343560
2013-01-04 -1.325634 0.174237
2013-01-05 -1.184577 -1.383917
2013-01-06 0.024899 0.252519
df1.loc[n[0],c[0]]
-1.3608265619610327
df1[c[0]][:5]
2013-01-01   -1.360827
2013-01-02   -0.210426
2013-01-03   -1.829322
2013-01-04   -1.325634
2013-01-05   -1.184577
Freq: D, Name: A, dtype: float64
df1.iloc[0,]
A   -1.360827
B   -0.871347
C   -1.603602
D    0.073078
Name: 2013-01-01 00:00:00, dtype: float64
df1.iloc[:,0]
2013-01-01   -1.360827
2013-01-02   -0.210426
2013-01-03   -1.829322
2013-01-04   -1.325634
2013-01-05   -1.184577
2013-01-06    0.024899
Freq: D, Name: A, dtype: float64
df1.iloc[0,0]
-1.3608265619610327

Locate by boolean

df1.A > 0
2013-01-01    False
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-05    False
2013-01-06     True
Freq: D, Name: A, dtype: bool
df1[df1.A > 0]

A B C D
2013-01-06 0.024899 0.252519 -0.833479 -0.109767
df5 = df1.copy()
df5['K'] = ['one', 'one','two','three','four','three']
df5['K']
2013-01-01      one
2013-01-02      one
2013-01-03      two
2013-01-04    three
2013-01-05     four
2013-01-06    three
Freq: D, Name: K, dtype: object
df5[df5['K'].isin(['two','four'])]

A B C D K
2013-01-03 -1.829322 0.343560 -0.005303 -1.312605 two
2013-01-05 -1.184577 -1.383917 0.798301 0.913562 four

Sort values

#sort
df1.sort_values([c[0]], ascending=False)

A B C D
2013-01-06 0.024899 0.252519 -0.833479 -0.109767
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683
2013-01-05 -1.184577 -1.383917 0.798301 0.913562
2013-01-04 -1.325634 0.174237 -0.921101 -0.065894
2013-01-01 -1.360827 -0.871347 -1.603602 0.073078
2013-01-03 -1.829322 0.343560 -0.005303 -1.312605

Subset

df1.query('A > 0')

A B C D
2013-01-06 0.024899 0.252519 -0.833479 -0.109767
df1[ (df1['A'] > 0) ]

A B C D
2013-01-06 0.024899 0.252519 -0.833479 -0.109767

Setting values

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
df1['F'] = s1
df1

A B C D F
2013-01-01 -1.360827 -0.871347 -1.603602 0.073078 NaN
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683 1.0
2013-01-03 -1.829322 0.343560 -0.005303 -1.312605 2.0
2013-01-04 -1.325634 0.174237 -0.921101 -0.065894 3.0
2013-01-05 -1.184577 -1.383917 0.798301 0.913562 4.0
2013-01-06 0.024899 0.252519 -0.833479 -0.109767 5.0
df5 = df1.copy()
df5[df5 > 0] = -df5
df5

A B C D F
2013-01-01 -1.360827 -0.871347 -1.603602 -0.073078 NaN
2013-01-02 -0.210426 -0.511720 -1.037810 -0.746683 -1.0
2013-01-03 -1.829322 -0.343560 -0.005303 -1.312605 -2.0
2013-01-04 -1.325634 -0.174237 -0.921101 -0.065894 -3.0
2013-01-05 -1.184577 -1.383917 -0.798301 -0.913562 -4.0
2013-01-06 -0.024899 -0.252519 -0.833479 -0.109767 -5.0
df5.index.to_series().apply(pd.to_datetime)
2013-01-01   2013-01-01
2013-01-02   2013-01-02
2013-01-03   2013-01-03
2013-01-04   2013-01-04
2013-01-05   2013-01-05
2013-01-06   2013-01-06
Freq: D, dtype: datetime64[ns]
df5.index.to_series().dt.year
2013-01-01    2013
2013-01-02    2013
2013-01-03    2013
2013-01-04    2013
2013-01-05    2013
2013-01-06    2013
Freq: D, dtype: int64

Calculations

df1.apply(np.cumsum)

A B C D F
2013-01-01 -1.360827 -0.871347 -1.603602 0.073078 NaN
2013-01-02 -1.571253 -0.359628 -2.641412 -0.673604 1.0
2013-01-03 -3.400575 -0.016067 -2.646715 -1.986209 3.0
2013-01-04 -4.726209 0.158169 -3.567816 -2.052103 6.0
2013-01-05 -5.910786 -1.225748 -2.769514 -1.138541 10.0
2013-01-06 -5.885887 -0.973229 -3.602993 -1.248308 15.0
df1.apply(lambda x: x.max() - x.min())
A    1.854221
B    1.895637
C    2.401903
D    2.226167
F    4.000000
dtype: float64
# string ops - vectorized
df1.iloc[:,1].astype('str').str.lower()
2013-01-01    -0.8713474062085093
2013-01-02     0.5117196166006319
2013-01-03     0.3435604532681823
2013-01-04    0.17423665911328937
2013-01-05    -1.3839171647406576
2013-01-06     0.2525192847352602
Freq: D, Name: B, dtype: object

Missing values

# missing data
df1.dropna(how='any')

A B C D F
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683 1.0
2013-01-03 -1.829322 0.343560 -0.005303 -1.312605 2.0
2013-01-04 -1.325634 0.174237 -0.921101 -0.065894 3.0
2013-01-05 -1.184577 -1.383917 0.798301 0.913562 4.0
2013-01-06 0.024899 0.252519 -0.833479 -0.109767 5.0
df1.dropna(subset=['F'], how='all')

A B C D F
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683 1.0
2013-01-03 -1.829322 0.343560 -0.005303 -1.312605 2.0
2013-01-04 -1.325634 0.174237 -0.921101 -0.065894 3.0
2013-01-05 -1.184577 -1.383917 0.798301 0.913562 4.0
2013-01-06 0.024899 0.252519 -0.833479 -0.109767 5.0
df1.fillna(value=5)

A B C D F
2013-01-01 -1.360827 -0.871347 -1.603602 0.073078 5.0
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683 1.0
2013-01-03 -1.829322 0.343560 -0.005303 -1.312605 2.0
2013-01-04 -1.325634 0.174237 -0.921101 -0.065894 3.0
2013-01-05 -1.184577 -1.383917 0.798301 0.913562 4.0
2013-01-06 0.024899 0.252519 -0.833479 -0.109767 5.0
pd.isnull(df1)

A B C D F
2013-01-01 False False False False True
2013-01-02 False False False False False
2013-01-03 False False False False False
2013-01-04 False False False False False
2013-01-05 False False False False False
2013-01-06 False False False False False

Merge and shape

df1.append(df2, ignore_index=True, sort=True).head()

A B C D E F G H I J
0 -1.360827 -0.871347 -1.603602 0.073078 NaN NaN NaN NaN NaN NaN
1 -0.210426 0.511720 -1.037810 -0.746683 NaN 1 NaN NaN NaN NaN
2 -1.829322 0.343560 -0.005303 -1.312605 NaN 2 NaN NaN NaN NaN
3 -1.325634 0.174237 -0.921101 -0.065894 NaN 3 NaN NaN NaN NaN
4 -1.184577 -1.383917 0.798301 0.913562 NaN 4 NaN NaN NaN NaN
pieces = [df1[:3], df1[3:7], df1[7:]]
pd.concat(pieces).head()

A B C D F
2013-01-01 -1.360827 -0.871347 -1.603602 0.073078 NaN
2013-01-02 -0.210426 0.511720 -1.037810 -0.746683 1.0
2013-01-03 -1.829322 0.343560 -0.005303 -1.312605 2.0
2013-01-04 -1.325634 0.174237 -0.921101 -0.065894 3.0
2013-01-05 -1.184577 -1.383917 0.798301 0.913562 4.0

GroupBy

df4.groupby(['A','B']).sum()

C D
A B
bar one -0.678605 0.569444
three -1.008900 1.841500
two -0.666534 -0.483262
foo one -1.002854 3.204916
three 1.517630 -0.458782
two 0.652168 1.610507

Stacked

stacked = df4.stack()
stacked.index[:4]
MultiIndex([(0, 'A'),
            (0, 'B'),
            (0, 'C'),
            (0, 'D')],
           )
stacked.head()
0  A         foo
   B         one
   C   -0.870843
   D     1.25516
1  A         bar
dtype: object
stacked.unstack().head()

A B C D
0 foo one -0.870843 1.25516
1 bar one -0.678605 0.569444
2 foo two 0.27089 0.692131
3 bar three -1.0089 1.8415
4 foo two 0.381278 0.918376

Pivot table

df4

A B C D
0 foo one -0.870843 1.255164
1 bar one -0.678605 0.569444
2 foo two 0.270890 0.692131
3 bar three -1.008900 1.841500
4 foo two 0.381278 0.918376
5 bar two -0.666534 -0.483262
6 foo one -0.132010 1.949751
7 foo three 1.517630 -0.458782
# pivot tables (http://pbpython.com/pandas-pivot-table-explained.html)
df4["A"] = df4["A"].astype("category")
df4["A"].cat.set_categories(["foo", "bar"],inplace=True)
pd.pivot_table(df4,index=['A','B'])	#same as:
df4.groupby(['A','B']).mean()
df4.groupby(['A','B'])['D'].agg([np.sum,np.std])

sum std
A B
foo one 3.204916 0.491147
three -0.458782 NaN
two 1.610507 0.159980
bar one 0.569444 NaN
three 1.841500 NaN
two -0.483262 NaN
pd.pivot_table(df4,index=['A','B'], values=['D'], aggfunc=np.sum)

D
A B
foo one 3.204916
three -0.458782
two 1.610507
bar one 0.569444
three 1.841500
two -0.483262
pd.pivot_table(df4,index=['A'], columns=['B'], values=['D'], aggfunc=np.sum)

D
B one three two
A
foo 3.204916 -0.458782 1.610507
bar 0.569444 1.841500 -0.483262
pd.pivot_table(df4,index=['A'], columns=['B'], values=['D'], aggfunc=np.sum, fill_value=0,margins=True)

D
B one three two All
A
foo 3.204916 -0.458782 1.610507 4.356641
bar 0.569444 1.841500 -0.483262 1.927682
All 3.774359 1.382718 1.127246 6.284323
pd.pivot_table(df4, index=['A', 'B'], columns=['C'], values='D')

C -1.008900 -0.870843 -0.678605 -0.666534 -0.132010 0.270890 0.381278 1.517630
A B
foo one NaN 1.255164 NaN NaN 1.949751 NaN NaN NaN
three NaN NaN NaN NaN NaN NaN NaN -0.458782
two NaN NaN NaN NaN NaN 0.692131 0.918376 NaN
bar one NaN NaN 0.569444 NaN NaN NaN NaN NaN
three 1.8415 NaN NaN NaN NaN NaN NaN NaN
two NaN NaN NaN -0.483262 NaN NaN NaN NaN
pd.pivot_table(df4, index=['A'], columns=['B','C'], values='D')

B one three two
C -0.870843 -0.678605 -0.132010 -1.008900 1.517630 -0.666534 0.270890 0.381278
A
foo 1.255164 NaN 1.949751 NaN -0.458782 NaN 0.692131 0.918376
bar NaN 0.569444 NaN 1.8415 NaN -0.483262 NaN NaN

Timeseries

rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.head()
2012-01-01 00:00:00    114
2012-01-01 00:00:01    279
2012-01-01 00:00:02     32
2012-01-01 00:00:03    495
2012-01-01 00:00:04    433
Freq: S, dtype: int64
## frequency conversion (e.g., converting secondly data into 5-minutely data)
ts.resample('5Min').sum()
2012-01-01    24652
Freq: 5T, dtype: int64
## time zones
ts_utc = ts.tz_localize('UTC')
ts_utc.tz_convert('US/Eastern').head()
2011-12-31 19:00:00-05:00    114
2011-12-31 19:00:01-05:00    279
2011-12-31 19:00:02-05:00     32
2011-12-31 19:00:03-05:00    495
2011-12-31 19:00:04-05:00    433
Freq: S, dtype: int64
## format
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ps = ts.to_period()
ps.to_timestamp()
2012-01-01    2.490119
2012-02-01    0.237758
2012-03-01    0.249177
2012-04-01   -0.015938
2012-05-01   -0.314638
Freq: MS, dtype: float64
## timestamp conversion
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()
1990-03-01 09:00   -0.546684
1990-06-01 09:00    0.091767
1990-09-01 09:00   -1.180808
1990-12-01 09:00    0.929530
1991-03-01 09:00    1.646660
Freq: H, dtype: float64

Categorical data

df5 = df4.copy(deep=True)
## create categories
df5['incomeranges'] = pd.cut(df5['C'], 14)
df5["A"] = df5["A"].astype('category')
df5.head()

A B C D incomeranges
0 foo one -0.870843 1.255164 (-1.011, -0.828]
1 bar one -0.678605 0.569444 (-0.828, -0.648]
2 foo two 0.270890 0.692131 (0.254, 0.435]
3 bar three -1.008900 1.841500 (-1.011, -0.828]
4 foo two 0.381278 0.918376 (0.254, 0.435]
df5.A
0    foo
1    bar
2    foo
3    bar
4    foo
5    bar
6    foo
7    foo
Name: A, dtype: category
Categories (2, object): [foo, bar]
#rename to more meaningful
df5["A"].cat.categories = ["foo", "bar"]
df5.head()

A B C D incomeranges
0 foo one -0.870843 1.255164 (-1.011, -0.828]
1 bar one -0.678605 0.569444 (-0.828, -0.648]
2 foo two 0.270890 0.692131 (0.254, 0.435]
3 bar three -1.008900 1.841500 (-1.011, -0.828]
4 foo two 0.381278 0.918376 (0.254, 0.435]
# domain of categories
df5["A"] = df5["A"].cat.set_categories(["very bad", "bar", "medium", "foo", "very good"])
df5.sort_values(by="A").head()

A B C D incomeranges
1 bar one -0.678605 0.569444 (-0.828, -0.648]
3 bar three -1.008900 1.841500 (-1.011, -0.828]
5 bar two -0.666534 -0.483262 (-0.828, -0.648]
0 foo one -0.870843 1.255164 (-1.011, -0.828]
2 foo two 0.270890 0.692131 (0.254, 0.435]
df5['A'].value_counts()
foo          5
bar          3
very good    0
medium       0
very bad     0
Name: A, dtype: int64
## subset on category
tmp = df5[df5['A'] == "foo"]
tmp

A B C D incomeranges
0 foo one -0.870843 1.255164 (-1.011, -0.828]
2 foo two 0.270890 0.692131 (0.254, 0.435]
4 foo two 0.381278 0.918376 (0.254, 0.435]
6 foo one -0.132010 1.949751 (-0.287, -0.107]
7 foo three 1.517630 -0.458782 (1.337, 1.518]
##TODO:subset with limited column
df5[A & B][['C', 'D']][:10]
#TODO:Plotting
## options
### Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default') 
### Always display all the columns
pd.set_option('display.line_width', 5000) 
pd.set_option('display.max_columns', 60) 
figsize(15, 5)

Series calculations

df2

E F G H I J
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
df2.H.values
array([3, 3, 3, 3], dtype=int32)
df2.H.abs().tolist()
[3, 3, 3, 3]
df2.H.mean().tolist()
3.0

Numpy

Pandas uses Numpy’s ndarray for many of its underlying operations. For example, you can use the DataFrame attribute .values to represent a DataFrame df as a NumPy array. You can also pass pandas data structures to NumPy methods.

Now that we’ve looked into Pandas columns (Series), lets take at what is driving these operations. This low-level perspective will help us understand strengths and weeknesses, as well as give us a much better look into how to improve our code.

References

Introduction

NumPy’s arrays are more compact than Python lists – a list of lists as you describe, in Python, would take at least 20 MB or so, while a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.

The difference is mostly due to “indirectness” – a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value – and the memory allocators rounds up to 16). A NumPy array is an array of uniform values – single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!

Because Numpy’s popularity is based on optimization, lets focus work on memory and speed during our discussion.

Memory layout

The following attributes contain information about the memory layout of the array:

  • ndarray.flags Information about the memory layout of the array.
  • ndarray.shape Tuple of array dimensions.
  • ndarray.ndim Number of array dimensions.
  • ndarray.strides Tuple of bytes to step in each dimension when traversing an array.
  • ndarray.data Python buffer object pointing to the start of the array’s data
  • ndarray.size Number of elements in the array.
  • ndarray.itemsize Length of one array element in bytes.
  • ndarray.Dtype Type of elements in the array, i.e., int64, character
  • ndarray.nbytes Total bytes consumed by the elements of the array.
  • ndarray.base Base object if memory is from some other object.

ndarray.__sizeof__() is somewhat bigger (includes the object overhead), than ndarray.nbytes, which is the number of bytes in the allocated array.

import sys
arr = np.random.randn(10**3)
lst = list(arr)
arr2d = arr.reshape(20,50)
print( len(lst) )
print( arr.shape )
print( arr2d.shape )
1000
(1000,)
(20, 50)
print( type(lst) )
print( type(arr) )
print( type(arr2d) )

print( type(lst[0]) )
print( arr.dtype )
<class 'list'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.float64'>
float64
#item
print( sys.getsizeof( lst[0] ) )
print( arr.itemsize )

#total
print( sys.getsizeof( lst ) )
print( sys.getsizeof( arr ) )
print( sys.getsizeof( arr2d ) )

print( arr.nbytes )
print( arr2d.nbytes )
32
8
9112
8096
112
8000
8000
print( arr.tobytes()[:100] )
b',!1\xcdk\\\xf0?\x8b\x0f\x93\x19)\xf0\xe7?v\xec3\xcb}e\xbf?\x89\x92\x0b\xac\xc4_\xb7?\x1d\xdb\\*\xdaG\xe2?}\x16\xf8n\x99\xcd\xd7\xbfc\xf3\xf9g\x87\xb5\x03\xc0\x04&\x86\xc7\x08*\xf4?\xb8j\xd6\x03}\xed\xef\xbf\xe6\xcf\xb0\x0e\x90\xe1\xd8\xbf{\xa2\x1dn\xe1\xef\xe1\xbfF\x90{\xfd\xf2\x14\xde\xbf\\\xb8pv'

Comparison of ndarry to list

Numpy array is easier to assign values than a list.

a = np.array([1, 2, 5, 7, 8])
a[1:3] = -1
a
array([ 1, -1, -1,  7,  8])
b = [1, 2, 5, 7, 8]
try:
    b[1:3] = -1
except:
    print("TypeError: can only assign an iterable")
TypeError: can only assign an iterable

ndarray slices are actually views on the same data buffer. If you modify it, it is going to modify the original ndarray as well.

a=[1,2,5,7,8]
b=a[1:5]
b[1]=3
print(a)
print(b)
[1, 2, 5, 7, 8]
[2, 3, 7, 8]
a = np.array([1, 2, 5, 7, 8])
a_slice = a[1:5]
a_slice[1] = 1000
print( a )
print( a_slice)
[   1    2 1000    7    8]
[   2 1000    7    8]
a = np.array([1, 2, 5, 7, 8])
a_slice = a[1:5].copy()
a_slice[1] = 1000
print( a )
print( a_slice)
[1 2 5 7 8]
[   2 1000    7    8]

ndarray slicing is different. It uses: arr[row_start:row_end, col_start:col_end]

ndarray also allows boolean indexing.

a = np.array([0, 1, 2, 5, 7, 8]).reshape(3,2)
a
array([[0, 1],
       [2, 5],
       [7, 8]])
rows_on = np.array([True, False, True])
a[rows_on,:]
array([[0, 1],
       [7, 8]])
(arr != 2)[:10]
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])
arr[arr != 2] [:10]
array([ 1.02256375,  0.74806647,  0.12264239,  0.09130506,  0.57127102,
       -0.37192379, -2.46363717,  1.26026228, -0.99774028, -0.38876726])

A comparison test of speed.

from numpy import arange
from timeit import Timer

Nelements = 10000
Ntimeits = 10000

x = arange(Nelements)
y = range(Nelements)

t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,))
print("list:  %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))
numpy: 1.171e-05
list:  1.768e-04

Creating matrices

Your typical arrays of random initializers are always useful.

print( np.random.randint(1,10, 5) )
print( np.arange(1,10) )
[8 8 7 4 3]
[1 2 3 4 5 6 7 8 9]

A matrix of random numbers.

print( np.random.rand(3,3) )
[[0.24273703 0.4428199  0.51066964]
 [0.66546027 0.83538717 0.92546135]
 [0.63670416 0.83450213 0.57655533]]

A matrix of zeros.

np.zeros((4,3))
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

The identity matrix.

np.eye(4)
array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

Scalar operations

arr = np.arange(1,10)
np.append(arr, 12)
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 12])
print( arr.size )
print( arr.mean() )
9
5.0
print( arr.min() )
print( arr.max() )
1
9
print( type(arr.tolist()) )
<class 'list'>
print( arr.astype('str') )
print( arr.astype('int') )
['1' '2' '3' '4' '5' '6' '7' '8' '9']
[1 2 3 4 5 6 7 8 9]

Matrix operations

arr = np.arange(1,10)
np.append(arr, 12)
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 12])
print( arr + arr )
print( arr - arr )
[ 2  4  6  8 10 12 14 16 18]
[0 0 0 0 0 0 0 0 0]
#A[n+1]-A[n]
np.diff(arr, n=1)
array([1, 1, 1, 1, 1, 1, 1, 1])

Broadcasting provides a means of vectorizing array operations so that looping occurs in an implementation of C, instead of in Python. The term refers how numpy treats arrays with different shapes during arithmetic operations. The smaller array is repeated so that it matches the size of the larger array.

arr * 2
array([ 2,  4,  6,  8, 10, 12, 14, 16, 18])
arr + 4
array([ 5,  6,  7,  8,  9, 10, 11, 12, 13])
%%timeit 
arr * 2
1.36 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
twos = np.ones(9) * 2
%%timeit
arr * twos
1.45 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
M = np.arange(1,10).reshape(3,3)
M
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
M.T
array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])
M.diagonal()
array([1, 5, 9])
np.dot(M, M)
array([[ 30,  36,  42],
       [ 66,  81,  96],
       [102, 126, 150]])
np.sum( M )
45
np.sum(M, axis=1)    #sum along rows, use `axis=0` for columns
array([ 6, 15, 24])

External Interfaces

Using CUDA with Numba

Basic C++

Working with Eigen

Conclusion

Pandas provides fast and thorough means of manipulating data in Python. It is powered by the operations in Numpy. While Pandas is useful for preparing data, actual analysis and computations are typically performed on the Numpy set. This is part of the reason Numpy is adopted by so many important libraries, such as SciKitLearn, PyTorch, and many others.