Search…
Pandas introduction
This section introduces some basic functionality of the Python pandas data analysis package, and aims to ensure that all users are able to rapidly backtest realistic strategies.
Note: the pandas library underlies parts of the SigTech Framework and is widely used in example notebooks.

Sections

Importing modules

Import the datetime module with the alias dtm. This module is used in defining dates, times, and more.
It is also useful for creating schedules and performing analysis of data indexed by dates and timestamps.
1
import datetime as dtm
Copied!
Import the pandas data analysis module under the alias pd:
1
import pandas as pd
Copied!
Import the numpy numerical computing module under the alias np:
1
import numpy as np
Copied!
Import the matplotlib.pyplot library for plotting under the alias plt:
1
import matplotlib.pyplot as plt
Copied!
Import the plotting library seaborn under the alias sns. Set the default figure size for all plots within the notebook:
1
import seaborn as sns
2
sns.set(rc={'figure.figsize': (18, 6)})
Copied!
Import and initialise the SigTech Framework:
1
import sigtech.framework as sig
2
end_date = dtm.date(2021, 12, 31)
3
sig.init(env_date=end_date)
Copied!

Pandas objects

The SigTech platform is object-oriented in design: every asset is modelled as a different object with its own attributes, methods and attached datasets.
Fetch a SingleStock object:
1
stock = sig.obj.get('1000045.SINGLE_STOCK.TRADABLE')
Copied!

Pandas series objects

A historical time series of an asset's value can be obtained by calling the history method on that asset. This returns a pandas Series(pd.Series) object, whose values are prices and whose index entries are the corresponding dates:
Input
Output
1
stock_history = stock.history()
2
stock_history
Copied!
1
2000-01-03 111.93750
2
2000-01-04 102.50000
3
2000-01-05 104.00000
4
2000-01-06 95.00000
5
2000-01-07 99.50000
6
...
7
2021-12-27 180.33001
8
2021-12-28 179.29000
9
2021-12-29 179.38001
10
2021-12-30 178.20000
11
2021-12-31 177.57001
12
13
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), Length: 5536, dtype: float64.
14
Copied!
The length of the pandas series is displayed automatically (5536). This can also be obtained by calling len on a series (or dataframe)
Input
Output
1
len(stock_history)
Copied!
1
5536
Copied!
Individual time series data in the SigTech platform is generally returned in the pd.Series format. The below cells check that stock_history is a series:
Input
Output
1
type(stock_history)
Copied!
1
pandas.core.series.Series
Copied!
Input
Output
1
isinstance(stock_history, pd.Series)
Copied!
1
True
Copied!

Pandas DataFrame objects

A pandas dataframe object, pd.DataFrame, is returned if there are multiple columns of time series data.
To give an example dataframe to work with, first fetch a futures contract:
Input
Output
1
futures_contract = sig.obj.get('ESH22 INDEX')
2
futures_contract
Copied!
1
ESH22 INDEX <class 'sigtech.framework.instruments.futures.EquityIndexFuture'>[140429123939536]
Copied!
To view the available historical time series data attached to an object, call the history_fields on that object:
Input
Output
1
futures_contract.history_fields
Copied!
1
['AskPrice',
2
'BidPrice',
3
'LastPrice',
4
'data_point',
5
'HighPrice',
6
'LowPrice',
7
'MidPrice',
8
'OpenPrice',
9
'open_interest',
10
'Volume']
Copied!
A pandas dataframe whose columns are from a specified list of fields can be obtained by supplying the list to the history_df method.
1
futures_data = futures_contract.history_df(
2
fields=['HighPrice', 'LowPrice']
3
)
4
futures_data.head()
Copied!
futures_data is a pd.DataFrame
Input
Output
1
isinstance(futures_data, pd.DataFrame)
Copied!
1
True
Copied!
Input
Output
1
type(futures_data)
Copied!
1
pandas.core.frame.DataFrame
Copied!
The columns of a DataFrame can be fetched directly:
Input
Output
1
futures_data.columns
Copied!
1
Index(['HighPrice', 'LowPrice'], dtype='object')
Copied!

Data manipulation

Slicing data

The first five rows of a series or dataframe can be obtained by calling head on it. The analogue for the last rows is to use the tail method.
An optional argument can be passed to determine an alternate number of rows to return.
This example returns the first five rows of stock_history:
Input
Output
1
stock_history.head()
Copied!
1
2000-01-03 111.9375
2
2000-01-04 102.5000
3
2000-01-05 104.0000
4
2000-01-06 95.0000
5
2000-01-07 99.5000
6
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), dtype: float64
Copied!
This example returns the last three rows of futures_data:
1
futures_data.tail(3)
Copied!
A pandas series or dataframe consists of an index and values.
For time series data within the SigTech platform, the index consists of observation dates, or datetimes.
For both the unadjusted stock prices, and the futures data, the index of dates is stored as a DatetimeIndex.
Input
Output
1
stock_history.index
Copied!
1
DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
2
'2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
3
'2000-01-13', '2000-01-14',
4
...
5
'2021-12-17', '2021-12-20', '2021-12-21', '2021-12-22',
6
'2021-12-23', '2021-12-27', '2021-12-28', '2021-12-29',
7
'2021-12-30', '2021-12-31'],
8
dtype='datetime64[ns]', length=5536, freq=None)
Copied!
Input
Output
1
futures_data.index
Copied!
1
DatetimeIndex(['2020-12-17', '2020-12-18', '2020-12-21', '2020-12-22',
2
'2020-12-23', '2020-12-24', '2020-12-28', '2020-12-29',
3
'2020-12-30', '2020-12-31',
4
...
5
'2021-12-17', '2021-12-20', '2021-12-21', '2021-12-22',
6
'2021-12-23', '2021-12-27', '2021-12-28', '2021-12-29',
7
'2021-12-30', '2021-12-31'],
8
dtype='datetime64[ns]', length=262, freq=None)
Copied!
A DatetimeIndex can be easily sliced and utilised in data filtering. To filter a pandas series or dataframe based on the value of elements of its index the loc method can be used. In the case of a DatetimeIndex there are further possibilities:
Restrict to futures data from February 2021:
1
futures_data.loc['2021-02'].head()
Copied!
Note: futures_data is not altered but the method returns a new pd.DataFrame.
1
futures_data.head()
Copied!
Half open intervals can also be provided.
The following cell restricts to stock prices from 2019 onwards. The original pd.Series is unaltered but a new pd.Series is returned.
Input
Ouput
1
stock_history.loc['2019':]
Copied!
1
2019-01-02 157.92000
2
2019-01-03 142.19001
3
2019-01-04 148.26000
4
2019-01-07 147.93000
5
2019-01-08 150.75000
6
...
7
2021-12-27 180.33001
8
2021-12-28 179.29000
9
2021-12-29 179.38001
10
2021-12-30 178.20000
11
2021-12-31 177.57001
12
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), Length: 757, dtype: float64
Copied!
Restrict to stock prices between April and June 2016, inclusive. Here the head method is called on the resulting series so that only its first seven rows are returned:
Input
Output
1
stock_history.loc['2016-04': '2016-06'].head(7)
Copied!
1
2016-04-01 109.99001
2
2016-04-04 111.12000
3
2016-04-05 109.81000
4
2016-04-06 110.96001
5
2016-04-07 108.54000
6
2016-04-08 108.66001
7
2016-04-11 109.02001
8
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), dtype: float64
Copied!
An alternative way to slice data based on an index is with the iloc method.
Numerical values instead of labels are passed indicating which rows to include using zero-based indexing:
Input
Output
1
# equivalent to `head()`
2
stock_history.iloc[:5]
Copied!
1
2000-01-03 111.9375
2
2000-01-04 102.5000
3
2000-01-05 104.0000
4
2000-01-06 95.0000
5
2000-01-07 99.5000
6
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), dtype: float64
Copied!
1
# equivalent to `tail(3)`
2
futures_data.iloc[-3:]
Copied!
A pandas dataframe can be sliced on columns as well as rows.
If a list of columns is provided, then a dataframe is returned. If just one column is provided, then a pandas series is returned.
1
# essentially no change as both columns are kept
2
columns_to_keep = ['HighPrice', 'LowPrice']
3
futures_data[columns_to_keep].tail()
Copied!
Selecting a single column returns a pd.Series:
Input
Output
1
isinstance(futures_data['HighPrice'], pd.Series)
Copied!
1
True
Copied!
Using the loc syntax to filter columns:
Input
Output
1
futures_data.loc[:, 'HighPrice']
Copied!
1
2020-12-17 3661.50
2
2020-12-18 3651.50
3
2020-12-21 3636.00
4
2020-12-22 3628.00
5
2020-12-23 3633.00
6
...
7
2021-12-27 4784.25
8
2021-12-28 4798.00
9
2021-12-29 4796.00
10
2021-12-30 4799.75
11
2021-12-31 4778.50
12
Name: HighPrice, Length: 262, dtype: float64
Copied!

Line plots

A pandas series or dataframe can be plotted by calling the plot method:
Input
Output
1
stock_history.plot()
Copied!
1
<matplotlib.axes._subplots.AxesSubplot at 0x7fb826b4dcd0>
Copied!
For a dataframe, the column names are automatically used to form the legend.
Input
Output
1
futures_data.plot()
Copied!
1
<matplotlib.axes._subplots.AxesSubplot at 0x7fb826a69950>
Copied!
As filtering based on date returns another pandas series, or dataframe, the same plot method can be used on the result of filtering.
1
stock_history.loc[:'2010'].plot();
Copied!

Computing returns

Returns can be obtained from price series by using the pct_change method:
Input
Output
1
returns = stock_history.pct_change()
2
returns
Copied!
1
2000-01-03 NaN
2
2000-01-04 -0.084310
3
2000-01-05 0.014634
4
2000-01-06 -0.086538
5
2000-01-07 0.047368
6
...
7
2021-12-27 0.022975
8
2021-12-28 -0.005767
9
2021-12-29 0.000502
10
2021-12-30 -0.006578
11
2021-12-31 -0.003535
12
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), Length: 5536, dtype: float64
Copied!
An equivalent way to obtain this would be to take the ratio of prices with one day forward shifted prices, and then subtract one:
Input
Output
1
stock_history.head()
Copied!
1
2000-01-03 111.9375
2
2000-01-04 102.5000
3
2000-01-05 104.0000
4
2000-01-06 95.0000
5
2000-01-07 99.5000
6
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), dtype: float64
Copied!
Input
Output
1
stock_history.shift(1).head()
Copied!
1
2000-01-03 NaN
2
2000-01-04 111.9375
3
2000-01-05 102.5000
4
2000-01-06 104.0000
5
2000-01-07 95.0000
6
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), dtype: float64
Copied!
Input
Output
1
stock_history / stock_history.shift(1) - 1
Copied!
1
2000-01-03 NaN
2
2000-01-04 -0.084310
3
2000-01-05 0.014634
4
2000-01-06 -0.086538
5
2000-01-07 0.047368
6
...
7
2021-12-27 0.022975
8
2021-12-28 -0.005767
9
2021-12-29 0.000502
10
2021-12-30 -0.006578
11
2021-12-31 -0.003535
12
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), Length: 5536, dtype: float64
Copied!
Note: division is well defined and done on an element-by-element basis, matching on the index. The subtraction of one is vectorised (applied to all elements).
To compute log returns, apply np.log to the ratio of prices to shifted prices. This is relying on the vectorised way in which a function is applied to each element of the series:
Input
Output
1
log_returns = np.log(stock_history / stock_history.shift(1))
2
log_returns
Copied!
1
2000-01-03 NaN
2
2000-01-04 -0.088078
3
2000-01-05 0.014528
4
2000-01-06 -0.090514
5
2000-01-07 0.046281
6
...
7
2021-12-27 0.022715
8
2021-12-28 -0.005784
9
2021-12-29 0.000502
10
2021-12-30 -0.006600
11
2021-12-31 -0.003542
12
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), Length: 5536, dtype: float64
Copied!

Dropping nan values

The first entry in the returns series is a 'nan'. No return can be computed for January 3rd 2000 as no prices prior to this date are available.
Input
Output
1
returns.iloc[0]
Copied!
1
nan
Copied!
Such values can be removed from pandas series and dataframes by calling the dropna method.
Calling this method does not alter the series or dataframe but returns a different series with nan values removed.
Input
Output
1
returns.dropna()
2
returns.head()
Copied!
1
2000-01-03 NaN
2
2000-01-04 -0.084310
3
2000-01-05 0.014634
4
2000-01-06 -0.086538
5
2000-01-07 0.047368
6
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), dtype: float64
Copied!
Reassign returns to be the output from calling dropna:
Input
Output
1
returns = returns.dropna()
2
returns.head()
Copied!
1
2000-01-04 -0.084310
2
2000-01-05 0.014634
3
2000-01-06 -0.086538
4
2000-01-07 0.047368
5
2000-01-10 -0.017588
6
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), dtype: float64
Copied!
To take a synthetic example, a dataframe with three columns is built as follows:
1
df = pd.DataFrame({
2
'A': [1, 2, np.nan, np.nan],
3
'B': [np.nan, 0, 0.5, np.nan],
4
'C': [np.nan, np.nan, np.nan, np.nan]
5
})
6
df
Copied!
The default behaviour of dropna is to remove any row containing at least one nan.
1
df.dropna()
Copied!
The how parameter enables a requirement of every element in a row being nan before that row is removed.
1
df.dropna(how='all')
Copied!
When performing operations on a dataframe there are two possible axes to apply the operation, as opposed to one for a series.
To apply an operation along the dataframe's index set axis to 0, and to apply it along the dataframe's columns set axis to 1.
In the following cell, any column whose values are all nan is dropped, by specifying axis to be 1. In this way the operation is performed along columns.
1
df.dropna(how='all', axis=1)
Copied!
Note: none of these transformations is permanent.
1
df
Copied!
To make such a transformation permanent the result can either be saved in a new variable or the inplace parameter can be used.
1
df_no_na = df.dropna(how='all', axis=1)
2
df_no_na
Copied!

Outliers & sorting

Plot returns:
Input
1
returns.plot()
Copied!
At first, some of these large downside moves may appear unexpected. To find out more the below cell sorts the data.
A pandas series can be sorted based on values by calling sort_values:
Input
Output
1
returns.sort_values()
Copied!
1
2014-06-09 -0.854857
2
2020-08-31 -0.741522
3
2000-09-29 -0.518692
4
2005-02-28 -0.495899
5
2000-06-21 -0.450617
6
...
7
2008-11-24 0.125575
8
2001-04-19 0.128565
9
2004-10-14 0.131572
10
2000-03-01 0.136859
11
2008-10-13 0.139050
12
Name: (LastPrice, EOD, 1000045.SINGLE_STOCK.TRADABLE), Length: 5535, dtype: float64
Copied!
Four of the five most negative returns are due to the fact that this price series is not adjusted for stock splits. An adjusted series can be fetched for any stock by calling adjusted_price_series.
1
stock_history = stock.adjusted_price_history(True, True)
Copied!
Recompute returns, drop nans and filter to data from 2010 onwards in one line. This sort of 'piping', or carrying out several operations in one line, can always be used with series and dataframes.
1
returns = stock_history.pct_change().dropna().loc['2010':]
2
returns.plot();
Copied!
For a dataframe the method to sort is sort_values. A list of columns should then be provided to indicate the order of sorting preference.
1
df.sort_values(by=['B', 'A'])
Copied!

Common statistics

Common statistics like max, min, mean, median and standard deviation are easily obtained from a series or dataframe.
For the series case, it is clear what each of the statistics mean since there is only one axis to compute it over.
Input
Output
1
print(f'Max daily return: {round(100 * returns.max(), 2)}%')
2
print(f'Min daily return: {round(100 * returns.min(), 2)}%')
3
print(f'Mean daily return: {round(100 * returns.mean(), 2)}%')
4
print(f'Median daily return: {round(100 * returns.median(), 2)}%')
5
print(f'Standard deviation of daily returns: {round(100 * returns.std(), 2)}%')
Copied!
1
Max daily return: 11.98%
2
Min daily return: -12.86%
3
Mean daily return: 0.13%
4
Median daily return: 0.1%
5
Standard deviation of daily returns: 1.77%
Copied!
You can also use the built-in describe method to retrieve a collection of common statistics from the numeric data in a series or a dataframe. In particular, the result’s index will include count, mean, std, min, max as well as 25, 50 and 75 percentiles.
Input
Output
1
returns.describe()
Copied!
1
count 3021.000000
2
mean 0.001254
3
std 0.017667
4
min -0.128647
5
25% -0.007021
6
50% 0.001001
7
75% 0.010511
8
max 0.119808
9
dtype: float64
Copied!
When the describe method is applied to object data, such as strings or timestamps, the resulting index will include count, unique, top and freq: top is the most common value, freq is the most common value’s frequency. Timestamps also include the first and last items.
1
returns.index.to_frame(name='Description').describe()
Copied!
Dataframes can be built in a number of ways. One commonly used approach is to pass a dictionary to the pd.DataFrame constructor whose keys are strings and whose values are pd.Series objects:
1
etfs = ['EEM UP EQUITY', 'SPY UP EQUITY', 'TIP UP EQUITY']
2
dictionary_of_series = {
3
etf: sig.obj.get(etf).adjusted_price_history(True, True)
4
for etf in etfs
5
}
Copied!
1
etf_data = pd.DataFrame(dictionary_of_series)
2
etf_data.head()
Copied!
Remove any rows containing nan values.
1
etf_data = etf_data.dropna(how='any', axis=0)
2
etf_data.head()
Copied!
Note: by default, aggregation operations on dataframes are done along the index (axis 0).
Input
Output
1
etf_returns.max()
Copied!
1
EEM UP EQUITY 0.227699
2
SPY UP EQUITY 0.145198
3
TIP UP EQUITY 0.044537
4
dtype: float64
Copied!
Input
Output
1
etf_returns.max(axis=0)
Copied!
1
EEM UP EQUITY 0.227699
2
SPY UP EQUITY 0.145198
3
TIP UP EQUITY 0.044537
4
dtype: float64
Copied!
Specifying axis to be 1 yields a series indexed by date with the relevant statistic computed over the list of ETFs on each day:
Input
Output
1
etf_returns.max(axis=1)
Copied!
1
2003-12-08 0.006738
2
2003-12-09 -0.002555
3
2003-12-10 0.002660
4
2003-12-11 0.018224
5
2003-12-12 0.004587
6
...
7
2021-12-27 0.014152
8
2021-12-28 -0.000622
9
2021-12-29 0.001279
10
2021-12-30 0.011456
11
2021-12-31 -0.001237
12
Length: 4549, dtype: float64
Copied!

Sharpe Ratios

Compute the annualised Sharpe Ratios for the ETFs, assuming a Risk Free Rate of 0%:
1
rfr = 0.00
2
annualisation_factor = np.sqrt(250)
Copied!
Input
Output
1
mean_returns = etf_returns.mean(axis=0) - rfr
2
volatilities = etf_returns.std(axis=0)
3
annualisation_factor * mean_returns / volatilities
Copied!
1
EEM UP EQUITY 0.410277
2
SPY UP EQUITY 0.633739
3
TIP UP EQUITY 0.736693
4
dtype: float64
Copied!

Rolling Volatility

Compute rolling annualised volatilities over a 250 day window:
1
rolling_volatilities = etf_returns.rolling(window=250).std().dropna()
2
rolling_volatilities *= annualisation_factor
3
rolling_volatilities.plot();
Copied!

Scaling & reconstructing prices

Rescale ETF prices to have the same starting point and plot:
1
scaled_etf_data = etf_data * 100 / etf_data.iloc[0]
2
scaled_etf_data.plot()
Copied!
Reconstruct scaled price series from returns (except the first day):
1
etf_prices = (1 + etf_returns).cumprod()
2
etf_prices *= 100 / etf_prices.iloc[0]
3
etf_prices.plot()
Copied!

Further plotting

Plot ETF returns together:
1
etf_returns.plot();
Copied!
On separate axes:
1
fig, axes = plt.subplots(3)
2
for i, etf in enumerate(etf_returns):
3
axes[i].plot(etf_returns[etf])
Copied!
It is easy to show histograms as well as line plots:
1
etf_returns.plot(kind='hist', bins=100, stacked=True);
Copied!
1
fig, axes = plt.subplots(3)
2
for i, etf in enumerate(etf_returns):
3
axes[i].hist(etf_returns[etf], bins=100);
4
axes[i].set_xlim(-0.1, 0.1)
Copied!

Winsorisation example

This sections shows an example of manually Winsorising returns at 3𝜎 for TIP ETF.
Fetch returns and compute an estimate of 𝜎:
1
tip_returns = etf_returns['TIP UP EQUITY'].copy()
2
sigma = tip_returns.std()
Copied!
Compute the upper and lower cutoff values:
Input
Output
1
upper_cutoff = tip_returns.mean() + 3 * sigma
2
lower_cutoff = tip_returns.mean() - 3 * sigma
3
print(f'Upper cutoff: {round(100 * upper_cutoff, 2)}%')
4
print(f'Lower cutoff: {round(100 * lower_cutoff, 2)}%')
Copied!
1
Upper cutoff: 1.17%
2
Lower cutoff: -1.13%
Copied!
Check that the number of returns are larger than the upper cutoff:
Input
Output
1
upper_cutoff_count = tip_returns.loc[tip_returns > upper_cutoff].count()
2
upper_cutoff_count
Copied!
1
25
Copied!
A common indexing technique is being used here.
The following comparison returns a series of boolean with the same index as the tip_returns dataframe:
Input
Output
1
tip_returns > upper_cutoff
Copied!
1
2003-12-08 False
2
2003-12-09 False
3
2003-12-10 False
4
2003-12-11 False
5
2003-12-12 False
6
...
7
2021-12-27 False
8
2021-12-28 False
9
2021-12-29 False
10
2021-12-30 False
11
2021-12-31 False
12
Name: TIP UP EQUITY, Length: 4549, dtype: bool
Copied!
When passed to tip_returns this returns a series restricted to those index values where the condition was true:
Input
Output
1
tip_returns.loc[tip_returns > upper_cutoff].head()
Copied!
1
2008-02-28 0.013242
2
2008-06-06 0.012236
3
2008-10-01 0.013234
4
2008-10-14 0.022872
5
2008-10-20 0.013722
6
Name: TIP UP EQUITY, dtype: float64
Copied!
Calculate the proportion of returns impacted on the upside:
Input
Output
1
print(f'{round(100 * upper_cutoff_count / len(tip_returns), 2)}%')
Copied!
1
0.55%
Copied!
The same could be done on the downside.
To set different values on a slice of a dataframe it is good practice to use loc to access that particular slice and then set the new values:
1
tip_returns.loc[tip_returns > upper_cutoff] = upper_cutoff
2
tip_returns.loc[tip_returns < lower_cutoff] = lower_cutoff
Copied!
To see the impact of this winsorisation:
1
tip_returns.plot(kind='hist', bins=100);
Copied!

Mapping for signal strategy

This section shows an example of building a signal directly from a rolling futures strategy history time series and then using this signal as input to the