Pandas introduction
This section introduces some basic functionality of the Python pandas
data analysis package, and aims to ensure that all users are able to rapidly backtest realistic strategies.
Note: the pandas
library underlies parts of the SigTech Framework and is widely used in example notebooks.
Sections
Importing modules
Import the datetime
module with the alias dtm
. This module is used in defining dates, times, and more.
It is also useful for creating schedules and performing analysis of data indexed by dates and timestamps.
Import the pandas
data analysis module under the alias pd
:
Import the numpy
numerical computing module under the alias np
:
Import the matplotlib.pyplot
library for plotting under the alias plt
:
Import the plotting library seaborn
under the alias sns
. Set the default figure size for all plots within the notebook:
Import and initialise the SigTech Framework:
Pandas objects
The SigTech platform is object-oriented in design: every asset is modelled as a different object with its own attributes, methods and attached datasets.
Fetch a SingleStock
object:
Pandas series objects
A historical time series of an asset's value can be obtained by calling the history
method on that asset. This returns a pandas Series(pd.Series)
object, whose values are prices and whose index entries are the corresponding dates:
The length of the pandas series is displayed automatically (5536). This can also be obtained by calling len
on a series (or dataframe)
Individual time series data in the SigTech platform is generally returned in the pd.Series
format. The below cells check that stock_history
is a series:
Pandas DataFrame objects
A pandas dataframe object, pd.DataFrame
, is returned if there are multiple columns of time series data.
To give an example dataframe to work with, first fetch a futures contract:
To view the available historical time series data attached to an object, call the history_fields
on that object:
A pandas dataframe whose columns are from a specified list of fields
can be obtained by supplying the list to the history_df
method.
futures_data
is a pd.DataFrame
The columns of a DataFrame can be fetched directly:
Data manipulation
Slicing data
The first five rows of a series or dataframe can be obtained by calling head
on it. The analogue for the last rows is to use the tail
method.
An optional argument can be passed to determine an alternate number of rows to return.
This example returns the first five rows of stock_history
:
This example returns the last three rows of futures_data
:
A pandas series or dataframe consists of an index and values.
For time series data within the SigTech platform, the index consists of observation dates, or datetimes.
For both the unadjusted stock prices, and the futures data, the index of dates is stored as a DatetimeIndex
.
A DatetimeIndex
can be easily sliced and utilised in data filtering. To filter a pandas series or dataframe based on the value of elements of its index the loc
method can be used. In the case of a DatetimeIndex
there are further possibilities:
Restrict to futures data from February 2021:
Note: futures_data
is not altered but the method returns a new pd.DataFrame.
Half open intervals can also be provided.
The following cell restricts to stock prices from 2019 onwards. The original pd.Series
is unaltered but a new pd.Series
is returned.
Restrict to stock prices between April and June 2016, inclusive. Here the head
method is called on the resulting series so that only its first seven rows are returned:
An alternative way to slice data based on an index is with the iloc
method.
Numerical values instead of labels are passed indicating which rows to include using zero-based indexing:
A pandas dataframe can be sliced on columns as well as rows.
If a list of columns is provided, then a dataframe is returned. If just one column is provided, then a pandas series is returned.
Selecting a single column returns a pd.Series
:
Using the loc
syntax to filter columns:
Line plots
A pandas series or dataframe can be plotted by calling the plot
method:
For a dataframe, the column names are automatically used to form the legend.
As filtering based on date returns another pandas series, or dataframe, the same plot
method can be used on the result of filtering.
Computing returns
Returns can be obtained from price series by using the pct_change
method:
An equivalent way to obtain this would be to take the ratio of prices with one day forward shifted prices, and then subtract one:
Note: division is well defined and done on an element-by-element basis, matching on the index. The subtraction of one is vectorised (applied to all elements).
To compute log returns, apply np.log
to the ratio of prices to shifted prices. This is relying on the vectorised way in which a function is applied to each element of the series:
Dropping nan values
The first entry in the returns
series is a 'nan'. No return can be computed for January 3rd 2000 as no prices prior to this date are available.
Such values can be removed from pandas series and dataframes by calling the dropna
method.
Calling this method does not alter the series or dataframe but returns a different series with nan values removed.
Reassign returns
to be the output from calling dropna
:
To take a synthetic example, a dataframe with three columns is built as follows:
The default behaviour of dropna
is to remove any row containing at least one nan.
The how
parameter enables a requirement of every element in a row being nan before that row is removed.
When performing operations on a dataframe there are two possible axes to apply the operation, as opposed to one for a series.
To apply an operation along the dataframe's index set axis
to 0
, and to apply it along the dataframe's columns set axis
to 1
.
In the following cell, any column whose values are all nan is dropped, by specifying axis to be 1
. In this way the operation is performed along columns.
Note: none of these transformations is permanent.
To make such a transformation permanent the result can either be saved in a new variable or the inplace
parameter can be used.
Outliers & sorting
Plot returns:
At first, some of these large downside moves may appear unexpected. To find out more the below cell sorts the data.
A pandas series can be sorted based on values by calling sort_values
:
Four of the five most negative returns are due to the fact that this price series is not adjusted for stock splits. An adjusted series can be fetched for any stock by calling adjusted_price_series
.
Recompute returns, drop nans and filter to data from 2010 onwards in one line. This sort of 'piping', or carrying out several operations in one line, can always be used with series and dataframes.
For a dataframe the method to sort is sort_values
. A list of columns should then be provided to indicate the order of sorting preference.
Common statistics
Common statistics like max
, min
, mean
, median
and standard deviation
are easily obtained from a series or dataframe.
For the series case, it is clear what each of the statistics mean since there is only one axis to compute it over.
You can also use the built-in describe
method to retrieve a collection of common statistics from the numeric data in a series or a dataframe. In particular, the result’s index will include count, mean, std, min, max as well as 25, 50 and 75 percentiles.
When the describe
method is applied to object data, such as strings or timestamps, the resulting index will include count
, unique
, top
and freq
: top
is the most common value, freq
is the most common value’s frequency. Timestamps also include the first
and last
items.
Dataframes can be built in a number of ways. One commonly used approach is to pass a dictionary to the pd.DataFrame
constructor whose keys are strings and whose values are pd.Series
objects:
Remove any rows containing nan values.
Note: by default, aggregation operations on dataframes are done along the index (axis 0).
Specifying axis
to be 1
yields a series indexed by date with the relevant statistic computed over the list of ETFs on each day:
Sharpe Ratios
Compute the annualised Sharpe Ratios for the ETFs, assuming a Risk Free Rate of 0%:
Rolling Volatility
Compute rolling annualised volatilities over a 250 day window:
Scaling & reconstructing prices
Rescale ETF prices to have the same starting point and plot:
Reconstruct scaled price series from returns (except the first day):
Further plotting
Plot ETF returns together:
On separate axes:
It is easy to show histograms as well as line plots:
Winsorisation example
This sections shows an example of manually Winsorising returns at 3𝜎 for TIP ETF.
Fetch returns and compute an estimate of 𝜎:
Compute the upper and lower cutoff values:
Check that the number of returns are larger than the upper cutoff:
A common indexing technique is being used here.
The following comparison returns a series of boolean with the same index as the tip_returns
dataframe:
When passed to tip_returns
this returns a series restricted to those index values where the condition was true:
Calculate the proportion of returns impacted on the upside:
The same could be done on the downside.
To set different values on a slice of a dataframe it is good practice to use loc
to access that particular slice and then set the new values:
To see the impact of this winsorisation:
Mapping for signal strategy
This section shows an example of building a signal directly from a rolling futures strategy history
time series and then using this signal as input to the SignalStrategy
building block.
Import ES default Rolling Futures Strategy:
Create constant +1 signal:
This is currently a pd.Series
:
Convert to a pd.DataFrame
with column name using the tradable strategy name:
Build SignalStrategy
from signal_df
:
Last updated