Abstract¶
Rough path theory is a branch of mathematics arising out of stochastic analysis. One of the main tools of rough path analysis is the signature, which captures the evolution of an unparametrised path including the order in which events occur. This turns out to be a useful tool in data science applications involving sequential data. RoughPy is our new Python package that aims change the way we think about sequential streamed data, by viewing it through the lens of rough paths. In RoughPy, data is wrapped in a stream object which can be composed and queried to obtain signatures that can be used in analysis. It also provides a platform for further exploration of the connections between rough path theory and data science.
1Introduction¶
Sequential data appears everywhere in the modern world: text, finance, health records, radio (and other electromagnetic spectra), sound (and speech), etc. Traditionally, these data are tricky to work with because of the exponential complexity and different scales of the underlying process. Until recently, with the development of transformers and large language models, it has been difficult to capture the long-term pattern whilst also capturing the short-term fine detail. Rough path theory gives us tools to work with sequential, ordered data in a mathematically rigorous way, which should provide a means to overcome some of the inherent complexity of the data. In this paper, we introduce a new package RoughPy for working with sequential data through the lens of rough path theory, where we can perform rigourous analyses and explore different ways to understand sequential data.
Rough paths arise in the study of controlled differential equations (CDEs), which generalise ordinary differential equations (ODEs) and stochastic differential equations Lyons, 1998Lyons et al., 2007. These are equations of the form , subject to an initial condition , that model a non-linear system driven by a input path . One simple CDE turns out to be critical to the theory:
The solution of this equation is called the signature of . It is analogous to the exponential function for ODEs, in that the solution of any CDE can be expressed in terms of the signature of the driving path. When the path is sufficiently regular, the signature can be computed directly as a sequence of iterated integrals. In other cases, we can still solve CDEs if we are given higher order data that can be used in place of the iterated integrals. A path equipped with this higher order data is called a rough path.
The signature turns out to be a useful summary of sequential data. It captures the order of events but not the necessarily the rate at which these events occur. The signature is robust to irregular sampling and provides a fixed-size view of the data, regardless of how many observations are used to compute it. This means the signature can be a useful feature map to be used in machine learning for sequential data. There are numerous examples of using signatures of analysing sequential data outlined in Section 2.
Besides signatures, there are two other rough path-based methods that have found their way into data science in recent years. These are the signature kernel and neural CDEs. Both of these enjoy the same robustness of the signature, and expand the range of applications of rough path-based methods. We give a short overview of these methods in Section 1.2.
There are several Python packages for computing signatures of sequential data,
including esig
Lyons & Maxwell, 2017, iisignature
Reizenstein & Graham, 2020, and
signatory
Kidger & Lyons, 2020.
These packages provide functions for computing signatures from raw,
structured data presented in an array, where is the dimension
of the stream and is the number of samples.
This means the user is responsible for interpreting the data as a path and
arranging the computations that need to be done.
RoughPy is a new package for working with sequential data and rough paths. The design philosophy for this package is to shift the emphasis from simply computing signatures on data to instead work with streams. A stream is a view of some data as if it were a rough path, that can be queried over intervals to obtain a signature. The actual form of the data is abstracted away in favour of stream objects that closely resemble the mathematics. The aim is to change the way that users think about sequential data and advance the understanding of path-like data analysis.
On top of the streams, RoughPy also provides concrete implementations for elements of the various algebras associated with rough path analysis. These include free tensor algebras, shuffle tensor algebras, and Lie algebras (see Section 1.1). This allows the user to easily manipulate signatures, and other objects, in a more natural manner. This allows us to quickly develop methods by following the mathematics.
The paper is organised as follows. In the remainder of this section, we give a brief overview of the mathematics associated with rough path theory, and provide some additional detail for the signature kernel and neural CDEs. In Section 2 we list several recent applications of signatures and rough path-based methods in data science applications. These applications should serve to motivate the development of RoughPy. Finally, in Section 3 we give a more detailed overview of the RoughPy library, the types and functions it contains, and give an example of how it can be used.
RoughPy is open source (BSD 3-Clause) and available on GitHub
https://
1.1Mathematical background¶
In this section we give a very short introduction to signatures and rough path theory that should be sufficient to inform the discussion in the sequel. For a far more comprehensive and rigorous treatment, we refer the reader to the recent survey Lyons & McLeod, 2022. For the remainder of the paper, we write for the vector space , where .
A path in is a continuous function , where are real numbers. For the purposes of this discussion, we shall further impose the condition that all paths are of bounded variation. The value of a path at some parameter is denoted .
The signature of is an element of the (free) tensor algebra. For , the th tensor power of is defined recursively by , , and for . For example, is the space of matrices, and is the space of tensors. The tensor algebra over is the space
equipped with the tensor product as multiplication. The tensor algebra is a Hopf algebra, and comes equipped with an antipode operation . It contains a group of elements under tensor multiplication and the antipode. The members of are called group-like elements. For each , we write for the truncated tensor algebra of degree , which is the space of all such that whenever . Similarly, we write for the subspace of elements where whenever , which is an ideal in and . The truncated tensor algebra is an algebra, when given the truncated tensor product, obtained by truncating the full tensor product.
The signature of a path over a subinterval is where for each , is given by the iterated (Riemann-Stieltjes) integral
The signature respects concatenation of paths, meaning for any . This property is usually called Chen’s relation. Two paths have the same signature if and only if they differ by a tree-like path Hambly & Lyons, 2010. The signature is translation invariant, and it is invariant under reparametrisation.
The dual of is the shuffle algebra . This is the space of linear functionals and consists of sequences with and where for all larger than some . (Here denotes the dual space of . In our notation .) The multiplication on is the shuffle product, which corresponds to point-wise multiplication of functions on the path. Continuous functions on the path can be approximated (uniformly) by shuffle tensors acting on on the signature. This is a consequence of the Stone-Weierstrass theorem. This property is sometimes referred to as universal non-lineararity.
There are several Lie algebras associated to . Define a Lie bracket on by the formula , for . We define subspaces of for each inductively as follows: , , and, for ,
The space of formal Lie series over is the subspace of containing sequences of the form , where for each . Note that . For any we define
For any path , we have , and we call this the log-signature of over . This is an alternative representation of the path, but doesn’t enjoy the same unievrsal non-linearity of the signature.
1.2Rough paths in data science¶
Now we turn to the applications of rough path theory to data science. Our first task is to form a bridge between sequential data and paths. Consider a finite, ordered sequence of observations, where , and . (More generally, we might consider instead. That is, data that already contains higher-order information. In our language, it is a genuine rough path.) We can find numerous paths that interpolate these observations; a path such that, for each , . The simplest interpolation is to take the path that is linear between adjacent observations.
Once we have a path, we need to be able to compute signatures. For practical purposes, we truncate all signatures (and log-signatures) to a particular degree , which we typically call the depth. The dimension of the ambient space is usually called the width. Using linear interpolation, we can compute the iterated integrals explicitly using a free tensor exponential of the difference of successive terms:
Here, and in the remainder of the paper, we shall denote the empirical signature over an interval by and the log-signature as . We can compute the signature over arbitrary intervals by taking the product of the these terms, using the multiplicative property of the signature.
1.2.1The signature transform¶
Most of the early applications of rough paths in data science, the (truncated) signature was used as a feature map Kidger et al., 2019. This provides a summary of the path that is independent of the parameterisation and the number of observations. Unfortunately, the signature grows geometrically with truncation depth. If , then the dimension of is
The size of the signature is a reflection of the complexity of the data, where data with a higher complexity generally needs a higher truncation level and thus a larger signature. It is worth noting that this still represents a significant compression of stream information in many cases.
For some applications, it might be possible to replace the signature with the log-signature. The log-signature is smaller than the signature, but we lose the universal non-linearity property of the signature. Alternatively, we might turn to other techniques that don’t require a full calculation of the signature (such as the signature kernel below). As the connection between rough paths and data science becomes more mathematically mature, we will likely find new ways to use the signature without requiring its full size.
1.2.2Signature kernels¶
Kernel methods are useful tools for learning with sequential data. Mathematically, a kernel on a set is a positive-definite function . Kernels are often quite easy to evaluate because of the kernel trick, which involves embedding the data in a inner product space, with a feature map, in which the kernel can be evaluated by simply taking an inner product. Informally, kernels measure the similarity between two points. They are used in a variety of machine learning tasks such as classification.
The signature kernel is a kernel induced on the space of paths by combining the signature with an inner product defined on the tensor algebra Kiraly & Oberhauser, 2019. The theory surrounding the signature kernel has been expanded several times since their introduction Fermanian et al., 2021Cass et al., 2024. Typically, the inner product on will itself by derived from an inner product on , extended to the tensor algebra.
Signatures are infinite objects, so we can’t simply evaluate inner products on the tensor algebra. Fortunately, we can approximate the signature kernel by taking inner products of truncated signatures. Even better, it turns out that, in certain cases, the signature kernel can be realised as the solution to a partial differential equation (PDE) of Goursat type. This means the full signature kernel can be computed from raw data without needing to compute full signatures Salvi et al., 2021.
In fact, in recent preprint, it has been shown that there are higher order solvers for signature kernels by rewriting the kernel solution of a system of PDEs of Goursat type Lemercier & Lyons, 2024. A critical part of their method involves the adjoint of both left and right free tensor multiplication, which are not available in any current package for computing signatures. These functions are provided by RoughPy.
1.2.3Neural controlled differential equations¶
Neural CDEs are a method for modelling irregular time series. We consider CDEs of the form
where is a neural network. We can treat the path as “hidden state” that we can tune using data to understand the relationship between the driving path and some response. Neural CDEs can be regarded as a continuous-time analogue of a recurrent neural network Kidger et al., 2020.
Neural CDEs initially showed some promising results on several benchmarks but now lag behind current state-of-the-art approaches to time series modelling. The latest iteration of neural CDEs are the recently introduced Log-neural controlled differential equations Walker et al., 2024, which make use of the Log-ODE method for solving rough differential equations in order to boost the performance of neural CDEs.
2Current applications of rough paths¶
In this section we enumerate several applications where rough paths have been
used to develop or improve methods. This list presented here is certainly not
exhaustive. In addition to the literature cited below, there are numerous
additional references and worked examples, in the form of Jupyter notebooks,
available on the DataSig website (https://
2.1Detecting interference in radio astronomy data¶
Radio frequency interference (RFI) is a substantial problem in the field of radio astronomy. Even small amounts of RFI can obscure the faint signals generated by distant stellar objects and events. The problem of identifying RFI in a signal falls into a class of semi-supervised learning tasks called novelty (or anomaly) detection. Rough path methods have been applied to develop a novelty detection framework based on rough path methods to detect RFI in radio astronomy data from several radio telescopes Arrubarrena et al., 2024. Their result show that their framework is effective at detecting even faint RFI within the test data. This work is based on a general novelty detection framework Shao et al., 2020.
Signatures kernels have also been used for a similar problem of detecting malware by inspecting the streaming tree of processes on a computer system Cochrane et al. (2021). Their method uses a support vector machine classifier to identify processes that are malicious compared to “normal” behaviour learned via training on a corpus of normality.
2.2Tracking mood via natural language processing¶
One application of rough paths in natural language processing has been in the domain of mental health Tseriotou et al., 2023Tseriotou et al., 2024. In this work, the authors present a model for identifying changes in a person’s mood based on their online textual content. Many mental health conditions have symptoms that manifest in the (textual) expression, so this could be a powerful tool for mental health professionals to identify changes in patients and intervene before the state develops. Their model achieves state-of-the-art performance vs existing models on two datasets.
2.3Predicting battery cell degradation¶
Another recent application of signatures is to predict the degradation of lithium-ion cells Ibraheem et al., 2023. They use signature features to train a model that can accurately predict the end of life of a cell using relatively low-frequency sampling compared to existing models. They also observed that the performance at higher frequency was comparable to other models.
2.4Prediction of sepsis in intensive care data¶
One of the first effective demonstrations of the utility of signatures and rough paths based methods in healthcare was in the 2019 PhysioNet challenge Morrill et al., 2020. In this contest, teams were invited to develop models to predict sepsis in patients from intensive care unit data. In this challenge, a team utilising signatures to enhance predictive power placed first in the official phase of the challenge. Since then, signatures and other rough path based approaches have been used in several other clinical contexts Cohen et al., 2024Falcioni et al., 2023Tseriotou et al., 2024. Clinical data is often irregularly sampled and often exhibits a high degree of missingness, but it can also be very high-frequency and dense. Rough path based methods can handle these data in an elegant way, and retain the structure of long and short term dependencies within the data.
2.5Human action recognition¶
The task of identifying a specific action performed by a person from a short video clip is very challenging. Signatures derived from landmark data extracted from the video has been used to train classification models that achieved state-of-the-art performance compared with contemporary models Yang et al., 2022Cheng et al., 2024Liao et al., 2021. (See also preprint papers Ibrahim & Lyons, 2023Jiang et al., 2024.) Also in the domain of computer vision, signatures have been used to produce lightweight models for image classification Ibrahim & Lyons, 2022 and in handwriting recognition tasks Xie et al., 2018.
3RoughPy¶
RoughPy is a new library that aims to support the development of connections between rough path theory and data science. It represents a shift in philosophy from simple computations of signatures for sequential data, to a representation of these data as a rough path. The design objectives for RoughPy are as follows:
- provide a class that presents a rough path view of some source of data as a rough path, exposing methods for querying the data over intervals to get a signature or log-signature;
- provide classes and functions that allow the users to interact with the signatures and other algebraic objects in a natural, mathematical manner;
- all operations should be differentiable and objects should be interoperable with objects from machine learning, such as TensorFlow (JAX) and PyTorch.
The first two objectives are simple design and implementation problems. The final objective presents the most difficulty, especially interoperability between RoughPy and common machine learning libraries. There are array interchange formats for NumPy-like arrays, such as the Python Array API standard Meurer et al., 2023 and the DLPack protocol DLPack, 2023. These provide part of the picture, but in order for them to be fully supported, RoughPy must support a variety of compute backends such as CUDA (NVidia), ROCm/HIP (AMD), and Metal (Apple).
RoughPy is a substantial library with numerous components, mostly written in C++ with a Python interface defined using Pybind11 Jakob et al., 2017. The original design of the library closely followed the C++ template libraries libRDE and libalgebra Buckley et al., 2006, although it has seen many iterations since.
In the remainder of this section, we discuss some of the core components of RoughPy, give an example of using RoughPy, and discuss the future of RoughPy.
3.1Free tensors, shuffle tensors, and Lie objects¶
In order to properly support rough path based methods and allow users to write
code based on mathematical concepts, we provide realisations of several algebra
types. The algebras provided in RoughPy are FreeTensor
, ShuffleTensor
, and
Lie
, which define elements of a particular free tensor algebra, shuffle tensor
algebra, and Lie algebra respectively. Each of these algebras is initialized
with a width, depth, and scalar coefficient type, encapsulated in a Context
object.
In addition to the algebra classes, RoughPy provides a number of supporting
functions, including antipodes and half-shuffle products for
FreeTensor
/ShuffleTensor
objects, and adjoint operators for left free tensor
multiplication. These are operations that are frequently used in the theory of
rough paths, and will likely be necessary in developing new applications later
(as in the signature kernels).
RoughPy algebras are designed around a flexible scalar ring system that allows users to perform calculations with different accuracy, or derive expressions by using polynomial coefficients. For most applications, single or double precision floating point numbers will provide a good balance between performance and accuracy. (Double precision floats are the default.) When more precision is required, rational coefficients can be used instead. These are backed by GMP rationals for fast, arbitrary precision rational arithmetic Granlund & the GMP development team, 2012. Polynomial coefficients can be used to derive formulae by performing calculations. This is a powerful technique for understanding the terms that appear in the result, particularly whilst testing and debugging.
3.2Intervals¶
RoughPy is very careful in the way it handles intervals. All intervals in
RoughPy are half-open, meaning that they include one end point but not the
other; they are either clopen or opencl . Besides the type (clopen or opencl), all intervals
must provide methods for retrieving the infimum ( in the above notation) and
the supremum ( above) of the interval as double precision floats. This is
enforced by means of an abstract base class Interval
. The main concrete
interval types are RealInterval
, an interval with arbitrary real endpoints,
and DyadicInterval
, as described below. For brevity, we shall only consider
clopen intervals.
A dyadic interval is an interval , where , are integers. The number is often described as the resolution of the interval. The family of dyadic intervals of a fixed resolution partition the real line so that every real number belongs to a unique dyadic interval . Moreover, the family of all dyadic intervals have the property that two dyadic intervals are either disjoint or one contains the other (including the possibility that they are equal).
In many cases, RoughPy will granularise an interval into a dyadic intervals. The dyadic granularisation of with resolution is where and . In effect, the dyadic granularisation is the result of “rounding” each end point to the included end of the unique dyadic interval that contain it.
3.3Streams¶
Streams are central to RoughPy. A RoughPy Stream
is a rough path view of some
underlying data. It provides two key methods to query the object over intervals
to retrieve either a signature or log-signature. Importantly, once constructed,
the underlying data is inaccessible except by querying via these methods.
Stream
s are designed to be composed in various ways, such as by concatenation,
in order to build up more complex streams. A Stream
is actually a
(type-erasing) wrapper around a more minimal StreamInterface
abstract class.
We construct streams by a factory function associated with each different
StreamInterface
, which might perform some compression of the underlying data.
For example, a basic StreamInterface
is the LieIncrementStream
, which can be
constructed using the associated from_increments
factory function (a static
method of the class), which accepts an array of increment data.
These data will typically be the differences between successive values of the
data (but could also include higher-order Lie terms). This is similar to the way
that libraries such as esig
, iisignature
, and signatory
consume data.
RoughPy streams cache the result of log-signature queries over dyadic intervals so they can be reused in later calculations. To compute the log-signature over any interval , we granularise at a fixed stream resolution to obtain the interval , and then compute
The terms on the right-hand-side are either retrieved from the cache, or computed from the underlying source. This is essentially the Campbell-Baker-Hausdorff formula applied to the log-signatures at the finest level. In practice, we can actually reduce the number of terms in the product, by merging complementary dyadic intervals that appear in the granularisation. We further optimise by using a fused multiply-exponential () operation.
Signatures are always computed by first computing the log-signature and then exponentiating. Directly computing the signature as a product of exponentials of (cached) log-signatures might accumulate enough numerical errors to drift slightly from a group-like tensor. That is, the result might not actually be a true signature. Taking the logarithm and then exponentiating back to obtain the signature has the effect of correcting this numerical drift from a true signature.
Aside from the basic LieIncrementStream
, there are several other
implementations of the StreamInterface
currently available in RoughPy. The
BrownianStream
approximates Brownian motion by generating normal distributed
increments over dyadic intervals of arbitrary resolution on demand, forming a
reasonable approximation of true Brownian motion. The ExternalDataStream
is an
interface for loading data from various external sources, such as from a
database or specialised data format. Currently, only sound files are supported
but we plan to extend support for other sources as the need arises. This will
certainly include “online” data sources such as computer peripheral devices
(e.g. microphones).
The other main StreamInterface
implementation is the PiecewiseAbelianStream
,
which is an important construction from CDE. A piecewise Abelian path, or
log-linear path, is an example of a smooth rough path, which generalises
piecewise linear approximations of an arbitrary stream. Formally, an Abelian
path is a pair where and
. The log-signature over an arbitrary interval is given by
A piecewise Abelian path is the concatenation of finitely many Abelian paths with adjacent intervals. For any rough path and partition there is a piecewise Abelian approximation for this path given by
This construction turns out to be vital for computing signature kernels Meurer et al., 2023 and for solving CDEs Lyons et al., 2007Walker et al., 2024. In particular, this construction can be used to compress data at some degree, which can the be used in computations at a higher degree.
3.4Example¶
In this section we show a very simple example of how to use RoughPy to construct
a stream and compute a signature. This example is similar to the first few steps
of the tutorial found in the RoughPy documentation.[1] RoughPy can be
installed using pip
, where prebuilt wheels are available for Windows, Linux,
and MacOs:
pip install roughpy
We refer the reader to this documentation for much more detail. We will construct a stream in by taking each letter in a word, “scipy” in this example, as the increments of a path:
import numpy as np
text = "scipy"
increments = np.zeros((5, 26), dtype="int8")
for i, c in enumerate(text):
increments[i, ord(c) - 97] = 1
Now we import RoughPy and construct a Stream
using the factory mentioned
above. One other critical ingredient is the algebra Context
, which is used to
set up a consistent set of algebra objects with the desired width (26),
truncation level (2), and coefficient type (Rational
).
import roughpy as rp
ctx = rp.get_context(width=26, depth=2,
coeffs=rp.Rational)
stream = rp.LieIncrementStream.from_increments(
increments, ctx=ctx)
Now we can compute the signature of the stream over the whole domain of the stream by omitting the interval argument:
sig = stream.signature()
print(sig)
# { 1() 1(3) 1(9) 1(16) 1(19) 1(25) 1/2(3,3)
# 1(3,9) 1(3,16) 1(3,25) 1/2(9,9) 1(9,16)
# 1(9,25) 1/2(16,16) 1(16,25) 1(19,3) 1(19,9)
# 1(19,16) 1/2(19,19) 1(19,25) 1/2(25,25) }
The first term of the signature is always 1, and the empty parentheses
indicate the empty tensor word.
The next five terms correspond to the counts of each unique letter that
appears, the number in parentheses indicates the letter (with a
being 1).
The final terms indicate the order in which each pair of letters appear in
the word.
For instance, the term 1(3,9)
indicates that a c
appears before an i
.
This is only the beginning of the story.
From here, we can use the signatures to compute the similarity between streams,
via the signature kernel for instance, or used as features in a variety of
machine learning problems.
More detailed examples of how to use signatures in data science are given on the
DataSig website https://
3.5The future of RoughPy¶
RoughPy is continuously evolving. At time of writing, the current version uses libalgebra and libalgebra-lite (libalgebra with fewer templates) for computations. Unfortunately, this made it difficult to achieve the differentiability and computation device support that we want. We are currently changing the way we implement vectors and algebras to provide the support for on-device computation that we want. Making the operations differentiable is crucial for machine learning, and will be the biggest challenge.
Long term, we need to expand support for signature kernels and CDEs. As applications of these tools grow in data science, we will need to devise new methods for computing kernels, or solving CDEs. We will also build a framework for constructing and working with linear maps, and homomorphisms. For example, one very useful linear map is the extension of the function to the whole tensor algebra.
4Conclusions¶
The use of rough path theory in data science is rapidly expanding and provides a different way to view sequential data. Signatures, and other methods arising from rough path theory, are already used in a wide variety of applications, with great effect. The next steps in overcoming the difficulty in modeling sequential data will require a change of perspective. Viewing these data through the lens of rough path theory might provide this change.
RoughPy is a new Python library for working with streamed data using rough path methods. It is designed to abstract away the form and source of data so that analysis can be performed by querying path-like objects. This approach is much closer to the mathematics. It also allows users to interact with the various algebras associated with rough paths (free tensor algebra, shuffle tensor algebra, Lie algebra) in a natural way. RoughPy is under active development, and a long list of improvements and extensions are planned.
Acknowledgments¶
This work was supported in part by EPSRC (NSFC) under Grant EP/S026347/1, in part by The Alan Turing Institute under the EPSRC grant EP/N510129/1, the Data Centric Engineering Programme (under the Lloyd’s Register Foundation grant G0095), and the Defence and Security Programme (funded by the UK Government). Terry Lyons was additionally supported by the Hong Kong Innovation and Technology Commission (InnoHK project CIMDA). For the purpose of Open Access, the author has applied a CC-BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
- Lyons, T. J. (1998). Differential equations driven by rough signals. Revista Matemática Iberoamericana, 14(2), 215–310. http://eudml.org/doc/39555
- Lyons, T. J., Caruana, M., & Lévy, T. (2007). Differential Equations Driven by Rough Paths: École d’Été de Probabilités de Saint-Flour XXXIV - 2004. In Lecture Notes in Mathematics. Springer Berlin Heidelberg. 10.1007/978-3-540-71285-5
- Lyons, T., & Maxwell, D. (2017). esig.
- Reizenstein, J. F., & Graham, B. (2020). Algorithm 1004: The Iisignature Library: Efficient Calculation of Iterated-Integral Signatures and Log Signatures. ACM Transactions on Mathematical Software, 46(1), 1–21. 10.1145/3371237
- Kidger, P., & Lyons, T. (2020). Signatory: differentiable computations of the signature and logsignature transforms, on both CPU and GPU. arXiv. 10.48550/ARXIV.2001.00706