Pandas introduction

This section introduces some basic functionality of the Python pandas data analysis package, and aims to ensure that all users are able to rapidly backtest realistic strategies.

Note: the pandas library underlies parts of the SigTech Framework and is widely used in example notebooks.

Sections

Importing modules

Import the datetime module with the alias dtm. This module is used in defining dates, times, and more.

It is also useful for creating schedules and performing analysis of data indexed by dates and timestamps.

import datetime as dtm

Import the pandas data analysis module under the alias pd:

import pandas as pd

Import the numpy numerical computing module under the alias np:

import numpy as np

Import the matplotlib.pyplot library for plotting under the alias plt:

import matplotlib.pyplot as plt

Import the plotting library seaborn under the alias sns. Set the default figure size for all plots within the notebook:

import seaborn as sns
sns.set(rc={'figure.figsize': (18, 6)})

Import and initialise the SigTech Framework:

import sigtech.framework as sig
end_date = dtm.date(2021, 12, 31)
sig.init(env_date=end_date)

Pandas objects

The SigTech platform is object-oriented in design: every asset is modelled as a different object with its own attributes, methods and attached datasets.

Fetch a SingleStock object:

stock = sig.obj.get('1000045.SINGLE_STOCK.TRADABLE')

Pandas series objects

A historical time series of an asset's value can be obtained by calling the history method on that asset. This returns a pandas Series(pd.Series) object, whose values are prices and whose index entries are the corresponding dates:

stock_history = stock.history()
stock_history

The length of the pandas series is displayed automatically (5536). This can also be obtained by calling len on a series (or dataframe)

len(stock_history)

Individual time series data in the SigTech platform is generally returned in the pd.Series format. The below cells check that stock_history is a series:

type(stock_history)
isinstance(stock_history, pd.Series)

Pandas DataFrame objects

A pandas dataframe object, pd.DataFrame, is returned if there are multiple columns of time series data.

To give an example dataframe to work with, first fetch a futures contract:

futures_contract = sig.obj.get('ESH22 INDEX')
futures_contract

To view the available historical time series data attached to an object, call the history_fields on that object:

futures_contract.history_fields

A pandas dataframe whose columns are from a specified list of fields can be obtained by supplying the list to the history_df method.

futures_data = futures_contract.history_df(
    fields=['HighPrice', 'LowPrice']
)
futures_data.head()

futures_data is a pd.DataFrame

isinstance(futures_data, pd.DataFrame)
type(futures_data)

The columns of a DataFrame can be fetched directly:

futures_data.columns

Data manipulation

Slicing data

The first five rows of a series or dataframe can be obtained by calling head on it. The analogue for the last rows is to use the tail method.

An optional argument can be passed to determine an alternate number of rows to return.

This example returns the first five rows of stock_history:

stock_history.head()

This example returns the last three rows of futures_data:

futures_data.tail(3)

A pandas series or dataframe consists of an index and values.

For time series data within the SigTech platform, the index consists of observation dates, or datetimes.

For both the unadjusted stock prices, and the futures data, the index of dates is stored as a DatetimeIndex.

stock_history.index
futures_data.index

A DatetimeIndex can be easily sliced and utilised in data filtering. To filter a pandas series or dataframe based on the value of elements of its index the loc method can be used. In the case of a DatetimeIndex there are further possibilities:

Restrict to futures data from February 2021:

futures_data.loc['2021-02'].head()

Note: futures_data is not altered but the method returns a new pd.DataFrame.

futures_data.head()

Half open intervals can also be provided.

The following cell restricts to stock prices from 2019 onwards. The original pd.Series is unaltered but a new pd.Series is returned.

stock_history.loc['2019':]

Restrict to stock prices between April and June 2016, inclusive. Here the head method is called on the resulting series so that only its first seven rows are returned:

stock_history.loc['2016-04': '2016-06'].head(7)

An alternative way to slice data based on an index is with the iloc method.

Numerical values instead of labels are passed indicating which rows to include using zero-based indexing:

# equivalent to `head()`
stock_history.iloc[:5]
# equivalent to `tail(3)`
futures_data.iloc[-3:]

A pandas dataframe can be sliced on columns as well as rows.

If a list of columns is provided, then a dataframe is returned. If just one column is provided, then a pandas series is returned.

# essentially no change as both columns are kept
columns_to_keep = ['HighPrice', 'LowPrice']
futures_data[columns_to_keep].tail()

Selecting a single column returns a pd.Series:

isinstance(futures_data['HighPrice'], pd.Series)

Using the loc syntax to filter columns:

futures_data.loc[:, 'HighPrice']

Line plots

A pandas series or dataframe can be plotted by calling the plot method:

stock_history.plot()

For a dataframe, the column names are automatically used to form the legend.

futures_data.plot()

As filtering based on date returns another pandas series, or dataframe, the same plot method can be used on the result of filtering.

stock_history.loc[:'2010'].plot();

Computing returns

Returns can be obtained from price series by using the pct_change method:

returns = stock_history.pct_change()
returns

An equivalent way to obtain this would be to take the ratio of prices with one day forward shifted prices, and then subtract one:

stock_history.head()
stock_history.shift(1).head()
stock_history / stock_history.shift(1) - 1

Note: division is well defined and done on an element-by-element basis, matching on the index. The subtraction of one is vectorised (applied to all elements).

To compute log returns, apply np.log to the ratio of prices to shifted prices. This is relying on the vectorised way in which a function is applied to each element of the series:

log_returns = np.log(stock_history / stock_history.shift(1))
log_returns

Dropping nan values

The first entry in the returns series is a 'nan'. No return can be computed for January 3rd 2000 as no prices prior to this date are available.

returns.iloc[0]

Such values can be removed from pandas series and dataframes by calling the dropna method.

Calling this method does not alter the series or dataframe but returns a different series with nan values removed.

returns.dropna()
returns.head()

Reassign returns to be the output from calling dropna:

returns = returns.dropna()
returns.head()

To take a synthetic example, a dataframe with three columns is built as follows:

df = pd.DataFrame({
    'A': [1, 2, np.nan, np.nan],
    'B': [np.nan, 0, 0.5, np.nan],
    'C': [np.nan, np.nan, np.nan, np.nan]
})
df

The default behaviour of dropna is to remove any row containing at least one nan.

df.dropna()

The how parameter enables a requirement of every element in a row being nan before that row is removed.

df.dropna(how='all')

When performing operations on a dataframe there are two possible axes to apply the operation, as opposed to one for a series.

To apply an operation along the dataframe's index set axis to 0, and to apply it along the dataframe's columns set axis to 1.

In the following cell, any column whose values are all nan is dropped, by specifying axis to be 1. In this way the operation is performed along columns.

df.dropna(how='all', axis=1)

Note: none of these transformations is permanent.

df

To make such a transformation permanent the result can either be saved in a new variable or the inplace parameter can be used.

df_no_na = df.dropna(how='all', axis=1)
df_no_na

Outliers & sorting

Plot returns:

returns.plot()

At first, some of these large downside moves may appear unexpected. To find out more the below cell sorts the data.

A pandas series can be sorted based on values by calling sort_values:

returns.sort_values()

Four of the five most negative returns are due to the fact that this price series is not adjusted for stock splits. An adjusted series can be fetched for any stock by calling adjusted_price_series.

stock_history = stock.adjusted_price_history(True, True)

Recompute returns, drop nans and filter to data from 2010 onwards in one line. This sort of 'piping', or carrying out several operations in one line, can always be used with series and dataframes.

returns = stock_history.pct_change().dropna().loc['2010':]
returns.plot();

For a dataframe the method to sort is sort_values. A list of columns should then be provided to indicate the order of sorting preference.

df.sort_values(by=['B', 'A'])

Common statistics

Common statistics like max, min, mean, median and standard deviation are easily obtained from a series or dataframe.

For the series case, it is clear what each of the statistics mean since there is only one axis to compute it over.

print(f'Max daily return: {round(100 * returns.max(), 2)}%')
print(f'Min daily return: {round(100 * returns.min(), 2)}%')
print(f'Mean daily return: {round(100 * returns.mean(), 2)}%')
print(f'Median daily return: {round(100 * returns.median(), 2)}%')
print(f'Standard deviation of daily returns: {round(100 * returns.std(), 2)}%')

You can also use the built-in describe method to retrieve a collection of common statistics from the numeric data in a series or a dataframe. In particular, the result’s index will include count, mean, std, min, max as well as 25, 50 and 75 percentiles.

returns.describe()

When the describe method is applied to object data, such as strings or timestamps, the resulting index will include count, unique, top and freq: top is the most common value, freq is the most common value’s frequency. Timestamps also include the first and last items.

returns.index.to_frame(name='Description').describe()

Dataframes can be built in a number of ways. One commonly used approach is to pass a dictionary to the pd.DataFrame constructor whose keys are strings and whose values are pd.Series objects:

etfs = ['EEM UP EQUITY', 'SPY UP EQUITY', 'TIP UP EQUITY']
dictionary_of_series = {
    etf: sig.obj.get(etf).adjusted_price_history(True, True) 
    for etf in etfs
}
etf_data = pd.DataFrame(dictionary_of_series)
etf_data.head()

Remove any rows containing nan values.

etf_data = etf_data.dropna(how='any', axis=0)
etf_data.head()

Note: by default, aggregation operations on dataframes are done along the index (axis 0).

etf_returns.max()
etf_returns.max(axis=0)

Specifying axis to be 1 yields a series indexed by date with the relevant statistic computed over the list of ETFs on each day:

etf_returns.max(axis=1)

Sharpe Ratios

Compute the annualised Sharpe Ratios for the ETFs, assuming a Risk Free Rate of 0%:

rfr = 0.00
annualisation_factor = np.sqrt(250)
mean_returns = etf_returns.mean(axis=0) - rfr
volatilities = etf_returns.std(axis=0)
annualisation_factor * mean_returns / volatilities

Rolling Volatility

Compute rolling annualised volatilities over a 250 day window:

rolling_volatilities = etf_returns.rolling(window=250).std().dropna()
rolling_volatilities *= annualisation_factor
rolling_volatilities.plot();

Scaling & reconstructing prices

Rescale ETF prices to have the same starting point and plot:

scaled_etf_data = etf_data * 100 / etf_data.iloc[0]
scaled_etf_data.plot()

Reconstruct scaled price series from returns (except the first day):

etf_prices = (1 + etf_returns).cumprod()
etf_prices *= 100 / etf_prices.iloc[0]
etf_prices.plot()

Further plotting

Plot ETF returns together:

etf_returns.plot();

On separate axes:

fig, axes = plt.subplots(3)
for i, etf in enumerate(etf_returns):
    axes[i].plot(etf_returns[etf])

It is easy to show histograms as well as line plots:

etf_returns.plot(kind='hist', bins=100, stacked=True);
fig, axes = plt.subplots(3)
for i, etf in enumerate(etf_returns):
    axes[i].hist(etf_returns[etf], bins=100);
    axes[i].set_xlim(-0.1, 0.1)

Winsorisation example

This sections shows an example of manually Winsorising returns at 3𝜎 for TIP ETF.

Fetch returns and compute an estimate of 𝜎:

tip_returns = etf_returns['TIP UP EQUITY'].copy()
sigma = tip_returns.std()

Compute the upper and lower cutoff values:

upper_cutoff = tip_returns.mean() + 3 * sigma
lower_cutoff = tip_returns.mean() - 3 * sigma
print(f'Upper cutoff: {round(100 * upper_cutoff, 2)}%')
print(f'Lower cutoff: {round(100 * lower_cutoff, 2)}%')

Check that the number of returns are larger than the upper cutoff:

upper_cutoff_count = tip_returns.loc[tip_returns > upper_cutoff].count()
upper_cutoff_count

A common indexing technique is being used here.

The following comparison returns a series of boolean with the same index as the tip_returns dataframe:

tip_returns > upper_cutoff

When passed to tip_returns this returns a series restricted to those index values where the condition was true:

tip_returns.loc[tip_returns > upper_cutoff].head()

Calculate the proportion of returns impacted on the upside:

print(f'{round(100 * upper_cutoff_count / len(tip_returns), 2)}%')

The same could be done on the downside.

To set different values on a slice of a dataframe it is good practice to use loc to access that particular slice and then set the new values:

tip_returns.loc[tip_returns > upper_cutoff] = upper_cutoff
tip_returns.loc[tip_returns < lower_cutoff] = lower_cutoff

To see the impact of this winsorisation:

tip_returns.plot(kind='hist', bins=100);

Mapping for signal strategy

This section shows an example of building a signal directly from a rolling futures strategy history time series and then using this signal as input to the SignalStrategy building block.

Import ES default Rolling Futures Strategy:

from sigtech.framework.default_strategy_objects.rolling_futures import es_index_front
es_rfs = es_index_front()
es_rfs.build()

Create constant +1 signal:

signal = 1 + 0 * es_rfs.history()
signal.head()

This is currently a pd.Series:

type(signal)

Convert to a pd.DataFrame with column name using the tradable strategy name:

signal_df = signal.to_frame(name=es_rfs.name)
signal_df.head()
type(signal_df)

Build SignalStrategy from signal_df:

strategy = sig.SignalStrategy(
    currency='USD',
    start_date=dtm.date(2018, 2, 1),
    end_date=dtm.date(2021, 12, 1),
    signal_name=sig.signal_library.from_ts(signal_df).name
)
strategy.build()
strategy.history().plot();

Last updated

© 2023 SIG Technologies Limited