The R and Pandas Dataframe

2020-03-02

Comparison > Data_Science

Although Pandas uses the Dataframe as its primary data structure, just as R does, the Pandas syntax and underlying fundamentals can be disorienting for R users. This post will describe some basic comparisons and inconsistencies between the two languages. It will also provide some examples of very non-intuitive solutions to common problems.

Introduction

In the Datascience R versus Pandas debate, it is really an apples and oranges comparison. R is a domain specific language in the field of statistics, analytics, and data visualization. This makes R great for consulting, research, and basic analysis, especially within a careful academic context.

In contrast, Python’s statistics packages are woefully inadequate and rarely mention details which are of great importance to statistical practicioners. An example of this is the use of contrasts in linear models. The different Types (I-IV) of Analysis Of Variance models use different encodings for data. Determining their estimators is not trivial.

However, if you want tight integration with other applications, the strengths of typical programming languages, and want to ‘just get models done’, then Python / Pandas is a great solution. Pandas is quite good at data manipulation. Python has the very strong NumPy and SciKit Learn module, which are very good for matrix operations and predictive modeling. And the Python language is a really good general scripting language with strong support for strings and datetime types.

Setup

Ensure to import the proper libraries. Let’s compare with R syntax, directly, using rpy2.

import pandas as pd
import numpy as np

%load_ext rpy2.ipython

%R require(ggplot2)

array([1], dtype=int32)

These Dataframes will be used for the various problems.

trades = pd.DataFrame(
    [
        ["2016-05-25 13:30:01.023", "MSFT", 51.95, 75],
        ["2016-05-25 13:30:01.038", "MSFT", 51.95, 155],
        ["2016-05-25 13:30:03.048", "GOOG", 720.77, 100],
        ["2016-05-25 13:30:03.048", "GOOG", 720.92, 100],
        ["2016-05-25 13:30:03.048", "AAPL", 98.00, 100],
    ],
    columns=["timestamp", "ticker", "price", "quantity"],   #set index during assignment: `, index_col='timestamp'`
)
trades['timestamp'] = pd.to_datetime(trades['timestamp'])
trades.head()

	timestamp	ticker	price	quantity
0	2016-05-25 13:30:01.023	MSFT	51.95	75
1	2016-05-25 13:30:01.038	MSFT	51.95	155
2	2016-05-25 13:30:03.048	GOOG	720.77	100
3	2016-05-25 13:30:03.048	GOOG	720.92	100
4	2016-05-25 13:30:03.048	AAPL	98.00	100

We then set the index to the timestamp. The index allows selecting rows with the loc and at methods.

trades_idx = trades.set_index('timestamp')
trades_idx.head()

	ticker	price	quantity
timestamp
2016-05-25 13:30:01.023	MSFT	51.95	75
2016-05-25 13:30:01.038	MSFT	51.95	155
2016-05-25 13:30:03.048	GOOG	720.77	100
2016-05-25 13:30:03.048	GOOG	720.92	100
2016-05-25 13:30:03.048	AAPL	98.00	100

Syntax Differences

Basic selection

The primary purpose of the DataFrame indexing operator, df[ ] is to select columns.

trades['ticker'].head(3)

0    MSFT
1    MSFT
2    GOOG
Name: ticker, dtype: object

trades[['ticker','price']].head(3)

	ticker	price
0	MSFT	51.95
1	MSFT	51.95
2	GOOG	720.77

%%R -i trades
head( trades['ticker'], 3)

  ticker
0   MSFT
1   MSFT
2   GOOG

%%R -i trades
head( trades[c('ticker','price')], 3)

  ticker  price
0   MSFT  51.95
1   MSFT  51.95
2   GOOG 720.77

When the indexing operator is passed a string or integer, it attempts to find a column with that particular name and return it as a Series.

For example, df[2] searches for a column name matching the integer value 2. This column does not exist and a KeyError is raised.

try:
    trades[2]
except:
    print("got key error")

got key error

%%R -i trades
head(trades[2], 3)

  ticker
0   MSFT
1   MSFT
2   GOOG

%%R -i trades
head(trades[,2], 3)

[1] "MSFT" "MSFT" "GOOG"

We can use integer indexing of columns, in R. Its not necessary to use the c() concatenate function to create a vector, but it can often make your code cleaner.

%%R -i trades
head(trades[,c(2:3)], 3)

  ticker  price
0   MSFT  51.95
1   MSFT  51.95
2   GOOG 720.77

However, the DataFrame indexing operator completely changes behavior to select rows when slice notation is used. The DataFrame indexing operator selects rows in this manner, and can also do so by integer location, or by index label.

trades[:3]

	timestamp	ticker	price	quantity
0	2016-05-25 13:30:00.023	MSFT	51.95	75
1	2016-05-25 13:30:00.038	MSFT	51.95	155
2	2016-05-25 13:30:00.048	GOOG	720.77	100

%%R -i trades
head(trades[1:3,], 3)

            timestamp ticker  price quantity
0 2016-05-25 13:30:00   MSFT  51.95       75
1 2016-05-25 13:30:00   MSFT  51.95      155
2 2016-05-25 13:30:00   GOOG 720.77      100

The following selects rows beginning at integer location 1, up to but not including 6, by every third row.

trades[1:6:3]

	timestamp	ticker	price	quantity
1	2016-05-25 13:30:00.038	MSFT	51.95	155
4	2016-05-25 13:30:00.048	AAPL	98.00	100

You can also use slices consisting of string labels if your DataFrame index has strings in it. Here we have new operators, .iloc to explicity support only integer indexing, and .loc to explicity support only label indexing.

trades.iloc[1:6:3]

	timestamp	ticker	price	quantity
1	2016-05-25 13:30:00.038	MSFT	51.95	155
4	2016-05-25 13:30:00.048	AAPL	98.00	100

With the timestamp index, we can subset using a variety of input string-formats.

trades_idx.loc['2016-05'].head(3)

	ticker	price	quantity
timestamp
2016-05-25 13:30:00.023	MSFT	51.95	75
2016-05-25 13:30:00.038	MSFT	51.95	155
2016-05-25 13:30:00.048	GOOG	720.77	100

trades_idx.loc['May 2016'].head(3)

	ticker	price	quantity
timestamp
2016-05-25 13:30:00.023	MSFT	51.95	75
2016-05-25 13:30:00.038	MSFT	51.95	155
2016-05-25 13:30:00.048	GOOG	720.77	100

The .loc / .iloc operators can select on both rows and columns, simultaneously, the indexing operator [ ] cannot.

trades_idx.loc['May 2016', 'ticker']

timestamp
2016-05-25 13:30:00.023    MSFT
2016-05-25 13:30:00.038    MSFT
2016-05-25 13:30:00.048    GOOG
2016-05-25 13:30:00.048    GOOG
2016-05-25 13:30:00.048    AAPL
Name: ticker, dtype: object

try:
    trades[2, 'ticker']
except:
    print('use R')

use R

This all seems very tideous for R users - just get on with it.

%%R -i trades
head(trades[2,'ticker'], 3)

[1] "MSFT"

Dataframe Index

One of the fundamental differences between R and pandas is that pandas leans heavily upon the dataframe index. For R users, it can seem like an annoyance; however, using it is required because many results of pandas methods return a dataframe with an index.

The index is already set for trades_idx, but if we want to replace it with a new one, it is easily done.

trades_idx.set_index('price')

	ticker	quantity
price
51.95	MSFT	75
51.95	MSFT	155
720.77	GOOG	100
720.92	GOOG	100
98.00	AAPL	100

There are two ways to move the index into a column.

new_trades_idx = trades_idx.copy()
new_trades_idx['timestamp'] = trades_idx.index
new_trades_idx

	ticker	price	quantity	timestamp
timestamp
2016-05-25 13:30:00.023	MSFT	51.95	75	2016-05-25 13:30:00.023
2016-05-25 13:30:00.038	MSFT	51.95	155	2016-05-25 13:30:00.038
2016-05-25 13:30:00.048	GOOG	720.77	100	2016-05-25 13:30:00.048
2016-05-25 13:30:00.048	GOOG	720.92	100	2016-05-25 13:30:00.048
2016-05-25 13:30:00.048	AAPL	98.00	100	2016-05-25 13:30:00.048

trades_idx.reset_index(level=0, inplace=False)

	timestamp	ticker	price	quantity
0	2016-05-25 13:30:00.023	MSFT	51.95	75
1	2016-05-25 13:30:00.038	MSFT	51.95	155
2	2016-05-25 13:30:00.048	GOOG	720.77	100
3	2016-05-25 13:30:00.048	GOOG	720.92	100
4	2016-05-25 13:30:00.048	AAPL	98.00	100

When using a timestamp as an index, default datetime format ISO8601 (“yyyy-mm-dd hh:mm:ss”) is used when selecting data with partial string indexing.

try:
    trades_idx.loc['2016-05-25 13:30:00.023']
except:
    print('cannot use fractions of seconds')

cannot use fractions of seconds

trades_idx.loc['2016-05-25 13:30:01':'2016-05-25 13:30:02']

	ticker	price	quantity
timestamp
2016-05-25 13:30:01.023	MSFT	51.95	75
2016-05-25 13:30:01.038	MSFT	51.95	155

While the index of a R dataframe is immutable, the index of a pandas dataframe can expand, or even lose arbitrary items. In this example, some rows are removed, and the index goes with them. If we don’t reset_index(), then those rows will always be missing.

tmp2

	ticker	price	quantity	timestamp
timestamp
2016-05-25 13:30:00.023	MSFT	51.95	75	2016-05-25 13:30:00.023
2016-05-25 13:30:00.038	MSFT	51.95	155	2016-05-25 13:30:00.038
2016-05-25 13:30:00.048	GOOG	720.77	100	2016-05-25 13:30:00.048
2016-05-25 13:30:00.048	GOOG	720.92	100	2016-05-25 13:30:00.048
2016-05-25 13:30:00.048	AAPL	98.00	100	2016-05-25 13:30:00.048

tmp2 = trades.copy()
tmp3 = tmp2[0:4:2]
tmp3

	timestamp	ticker	price	quantity
0	2016-05-25 13:30:01.023	MSFT	51.95	75
2	2016-05-25 13:30:03.048	GOOG	720.77	100

tmp3.reset_index(inplace=True)
tmp3

	index	timestamp	ticker	price	quantity
0	0	2016-05-25 13:30:01.023	MSFT	51.95	75
1	2	2016-05-25 13:30:03.048	GOOG	720.77	100

Forgetting to reset_index() can cause serious problems.

You can also create a multi-index to make things even more confusing.

index = pd.MultiIndex.from_product([['TX', 'FL', 'CA'], 
                                    ['North', 'South']], 
                                   names=['State', 'Direction'])

df = pd.DataFrame(index=index, 
                  data=np.random.randint(0, 10, (6,4)), 
                  columns=list('abcd'))

df

		a	b	c	d
State	Direction
TX	North	6	4	1	5
TX	South	0	9	3	2
FL	North	3	2	9	3
FL	South	7	2	3	0
CA	North	9	6	3	4
CA	South	5	6	9	0

Subset replacement (the `SettingWithCopy` warning)

Conditionally replacing values in R is straightforward because we can use simultaneous selection.

%%R -i trades

tmp1 <- trades
tmp1[tmp1$quantity > 80, c('price')] <- 100
tmp1

            timestamp ticker  price quantity
0 2016-05-25 13:30:01   MSFT  51.95       75
1 2016-05-25 13:30:01   MSFT 100.00      155
2 2016-05-25 13:30:03   GOOG 100.00      100
3 2016-05-25 13:30:03   GOOG 100.00      100
4 2016-05-25 13:30:03   AAPL 100.00      100

We know that we cannot use simultaneous selection in pandas using the indexing operator.

tmp1 = trades.copy()
try:
    tmp1[tmp1.quantity > 80, 'price'] = 100
except:
    print("TypeError: 'Series' objects are mutable, thus they cannot be hashed")

TypeError: 'Series' objects are mutable, thus they cannot be hashed

Replacing by index doesn’t give us what we want, either>

The direct way fails, too! Oh, pandas, why must you be so difficult…

tmp1 = trades.copy()
tmp1[tmp1.quantity > 80]['price'] = 100
tmp1

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

	timestamp	ticker	price	quantity
0	2016-05-25 13:30:01.023	MSFT	51.95	75
1	2016-05-25 13:30:01.038	MSFT	51.95	155
2	2016-05-25 13:30:03.048	GOOG	720.77	100
3	2016-05-25 13:30:03.048	GOOG	720.92	100
4	2016-05-25 13:30:03.048	AAPL	98.00	100

To understand, we must look under the hood:

df.loc[df.A > 5, 'B'] = 4
# becomes
df.__setitem__((df.A > 5, 'B'), 4)

With a single __setitem__ call to df. On the otherhand, consider this code:

df[df.A > 5]['B'] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__('B", 4)

Now, depending on whether __getitem__ returned a view or a copy, the __setitem__ operation may not work.

The documentation addresses these issues, here. But, to summarize:

Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, ‘A’], a view will be returned.

Finally, the .loc method gives us what we want because we can use it with boolean indexing.

tmp1.quantity > 80

0    False
1     True
2     True
3     True
4     True
Name: quantity, dtype: bool

tmp1 = trades.copy()
tmp1.loc[tmp1.quantity > 80, 'price'] = 100
tmp1

	timestamp	ticker	price	quantity
0	2016-05-25 13:30:01.023	MSFT	51.95	75
1	2016-05-25 13:30:01.038	MSFT	100.00	155
2	2016-05-25 13:30:03.048	GOOG	100.00	100
3	2016-05-25 13:30:03.048	GOOG	100.00	100
4	2016-05-25 13:30:03.048	AAPL	100.00	100

General Usage

Timestamps

A pandas object type is used for text or mixed numeric and non-numeric values. To get the correct order of the timestamp column we need to change it to the datetime64 type.

Replacing datetime values has its own unique manner, of course. Don’t want to lose this:

df.timestamp = pd.to_datetime(df.timestamp.str.replace("D", "T"))

Merge

Merge with an indicator for where the rows came from

df_merge = left.merge(right, on='key', how='left', indicator=True)

The _merge column is used to check for unexpected rows

df_merge._merge.value_counts()

Merge by the nearest (not exact) timestamp

pd.merge_asof(trades, quotes, on="timestamp", by='ticker', tolerance=pd.Timedelta('10ms'), direction='backward')

Missing values

Because Numpy has no native NA type, pandas uses multiple values denote a missing value:

NaN for numeric/object missing values
NaT for DateTime missing values
None, which comes from Python

What surprised me is that None equals None in Python, but nan doesn’t equal nan in numpy.

None == None

pd.np.nan == pd.np.nan

This is important so that we don’t filter values by None:

df[df.some_column == None]# instead use:
df[df.some_column.isnull()]

Report creation

Pandas (with XlsxWriter library) allows us to make an Excel sheet with graphs and other Excel functionality. We need to define the type of the chart (line chart in our example) and the data series for the chart (the data series needs to be in the Excel spreadsheet).

report_name = 'example_report.xlsx'
sheet_name = 'Sheet1'
writer = pd.ExcelWriter(report_name, engine='xlsxwriter')
df.to_excel(writer, sheet_name=sheet_name, index=False)
# writer.save()

# define the workbook
workbook = writer.book
worksheet = writer.sheets[sheet_name]
# create a chart line object
chart = workbook.add_chart({'type': 'line'})
# configure the series of the chart from the spreadsheet
# using a list of values instead of category/value formulas:
#     [sheetname, first_row, first_col, last_row, last_col]
chart.add_series({
    'categories': [sheet_name, 1, 0, 3, 0],
    'values':     [sheet_name, 1, 1, 3, 1],
})
# configure the chart axes
chart.set_x_axis({'name': 'Index', 'position_axis': 'on_tick'})
chart.set_y_axis({'name': 'Value', 'major_gridlines': {'visible': False}})
# place the chart on the worksheet
worksheet.insert_chart('E2', chart)
# output the excel file
writer.save()

Saving objects

When we save this file as CSV, it takes almost 300 MB on the hard drive.

df.to_csv('random_data.csv', index=False)

With a single argument compression=‘gzip’, we can reduce the file size to 136 MB.

df.to_csv(‘random_data.gz’, compression=‘gzip’, index=False)

It is also easy to read the gzipped data to the DataFrame, so we don’t lose any functionality.

df = pd.read_csv('random_data.gz')

Specific examples

Negate a string replacement

You can use negative look ahead (?!) assertion; ^(?!.*fish).*\\( will firstly assert the pattern doesn’t contain the word fish and then match every thing till the end of string and replace it with foo:

^ denotes the beginning of string, combined with (?!.*fish), it asserts at BOS that there is no pattern like .*fish in the string;
If the assertion succeeds, it matches everything till the end of string .*\\), and replaces it with foo; If the assertion fails, the pattern doesn’t match, nothing would happen;

so:

df.replace(r'^(?!.*fish).*$', 'foo', regex=True)

Remove Outliers

df['size_clip'] = df['size'].clip(df['size'].quantile(0.01),
                                  df['size'].quantile(0.99))
df.size_clip.plot(kind='box')

Bin / group data

df['price_discrete_labels'] = pd.cut(df.price, 5, labels=['very low', 'low', 'mid', 'high', 'very high'])
df['price_discrete_labels'].value_counts()

Equal-sized groups

df['price_discrete_equal_bins'] = pd.qcut(df.price, 5)
df['price_discrete_equal_bins'].value_counts()

Conclusion

After reviewing these examples, we hope you agree that R is a terse, minimalistic, and powerful language. You can do so many things with just the dataframe operator that it makes interacting with R a joy. It is so much of a joy that I typically don’t know how the operations are actually being performed - they just work.

Pandas is different. It is bolted-on to Python, and its syntax is either not consistent or is just un-wieldly. But maybe that is a good thing, because with Python I’m typically automating a process - not performing an interactive investigation. I want to be forced to understand how the data is actually being worked, and I want to optimize operations for performance because it is a process that will be automated.

Let’s stop the impassioned vitrioles of one clan against another, and use each of these tools for their respective strengths.