Introduction to theft

Trent Henderson

2024-10-03

library(theft)

Purpose

theft enables the standardised calculation of time-series features from multiple existing feature sets, and any user-supplied features.

Core functionality

To explore package functionality, we are going to use a dataset that comes standard with theft called simData. This dataset contains a collection of randomly generated time series for six different types of processes. The dataset can be accessed via:

theft::simData

The data follows the following structure:

head(simData)
#>                      values timepoint               id        process
#> Gaussian Noise.1 -0.6264538         1 Gaussian Noise_1 Gaussian Noise
#> Gaussian Noise.2  0.1836433         2 Gaussian Noise_1 Gaussian Noise
#> Gaussian Noise.3 -0.8356286         3 Gaussian Noise_1 Gaussian Noise
#> Gaussian Noise.4  1.5952808         4 Gaussian Noise_1 Gaussian Noise
#> Gaussian Noise.5  0.3295078         5 Gaussian Noise_1 Gaussian Noise
#> Gaussian Noise.6 -0.8204684         6 Gaussian Noise_1 Gaussian Noise

Calculating feature summary statistics

The core function that automates the calculation of the feature statistics at once is calculate_features. You can choose which subset of features to calculate with the feature_set argument. The choices are currently "catch22", "feasts", "Kats", "tsfeatures", "tsfresh", and/or "TSFEL".

Note that Kats, tsfresh and TSFEL are Python packages. The R package reticulate is used to call Python code that uses these packages and applies it within the broader tidy data philosophy embodied by theft. At present, depending on the input time-series, theft provides access to \(>1200\) features.

However, as discussed in the functionality demonstrations below, you can also supply your own list of features too! But more on that later…

Installing Python feature sets

Prior to using theft (only if you want to use the Kats, tsfresh or TSFEL feature sets; the R-based sets will run fine) you should have a working Python 3.9 installation and run the function install_python_pkgs(venv) after first installing theft, where the venv argument is the name of the virtual environment you want to create.

For example, if you wanted to install the Python libraries to the default virtual environment folder used by reticulate, you would run the following after first having installed theft (here I am just creating a new virtual environment called "theft-package"—you can call it whatever you like!):

install_python_pkgs("theft-package")

You can then run the following to activate the virtual environment:

init_theft("theft-package")

You are now ready to commit theft using all six potential factory feature sets!

However, you do not necessarily have to use these convenience functions. If you have another method for pointing R to the correct Python (such as reticulate or findpython), you can use those in your workflow instead and make sure you install Kats, tsfresh or TSFEL as required

NOTE 1: You only need to call init_theft or your other solution once per session.

NOTE 2: If you have issues installing Kats with install_python_pkgs , try install_python_pkgs("theft-package", standard_kats = FALSE) .

Calculating features

You are then ready to use the rest of the package’s functionality, beginning with the extraction of time-series features. Here is an example with the catch22 set:

feature_matrix <- calculate_features(data = simData, 
                                     id_var = "id", 
                                     time_var = "timepoint", 
                                     values_var = "values", 
                                     group_var = "process", 
                                     feature_set = "catch22",
                                     seed = 123)

head(feature_matrix)
#>                 id          group                    names      values
#> 1 Gaussian Noise_1 Gaussian Noise       DN_HistogramMode_5 -0.01408452
#> 2 Gaussian Noise_1 Gaussian Noise      DN_HistogramMode_10 -0.27031413
#> 3 Gaussian Noise_1 Gaussian Noise                CO_f1ecac  1.60843425
#> 4 Gaussian Noise_1 Gaussian Noise           CO_FirstMin_ac  3.00000000
#> 5 Gaussian Noise_1 Gaussian Noise CO_HistogramAMI_even_2_5  0.09661403
#> 6 Gaussian Noise_1 Gaussian Noise            CO_trev_1_num  1.51953865
#>   feature_set
#> 1     catch22
#> 2     catch22
#> 3     catch22
#> 4     catch22
#> 5     catch22
#> 6     catch22

Note that for the catch22 set you can set the additional catch24 argument to calculate the mean and standard deviation in addition to the standard 22 features:

feature_matrix <- calculate_features(data = simData, 
                                     id_var = "id", 
                                     time_var = "timepoint", 
                                     values_var = "values", 
                                     group_var = "process", 
                                     feature_set = "catch22",
                                     catch24 = TRUE,
                                     seed = 123)

NOTE: If using the tsfresh feature set, you might want to consider the tsfresh_cleanup argument to calculate_features. This argument defaults to FALSE and specifies whether to use the in-built tsfresh relevant feature filter or not.

You can also supply your own named list of functions to compute as time-series features. Below is an example with mean and standard deviation. Note that the list must be named as theft uses the list element names to label the time-series features internally. Note that if you don’t want to use any of the existing feature sets in theft and only calculate the features you supply to features, just set feature_set = NULL.

feature_matrix2 <- calculate_features(data = simData, 
                                      group_var = "process",
                                      feature_set = NULL,
                                      features = list("mean" = mean, "sd" = sd))

head(feature_matrix2)
#>                 id          group names       values feature_set
#> 1 Gaussian Noise_1 Gaussian Noise  mean  0.106146509        User
#> 2 Gaussian Noise_1 Gaussian Noise    sd  0.900816596        User
#> 3 Gaussian Noise_2 Gaussian Noise  mean  0.029534984        User
#> 4 Gaussian Noise_2 Gaussian Noise    sd  1.129814821        User
#> 5 Gaussian Noise_3 Gaussian Noise  mean -0.009571088        User
#> 6 Gaussian Noise_3 Gaussian Noise    sd  0.872840204        User

Comparison of feature sets

For a detailed comparison of the six feature sets, see this paper for a detailed review1.

Reading and processing hctsa-formatted files

As theft is based on the foundations laid by hctsa, there is also functionality for reading in hctsa-formatted Matlab files and automatically processing them into tidy dataframes ready for feature extraction in theft. The process_hctsa_file function takes a string filepath to the Matlab file and does all the work for you, returning a dataframe with naming conventions consistent with other theft functionality. As per hctsa specifications for Input File Format 1, this file should have 3 variables with the following exact names: timeSeriesData, labels, and keywords. The filepath can be a local drive path or a URL.

Analysing, interpreting, and visualising time-series features

Please see the companion package theftdlc (‘theft downloadable content’) for a large suite of functions.


  1. T. Henderson and B. D. Fulcher, “An Empirical Evaluation of Time-Series Feature Sets,” 2021 International Conference on Data Mining Workshops (ICDMW), 2021, pp. 1032-1038, doi: 10.1109/ICDMW53433.2021.00134.↩︎