autostm
(Automatic Structural Time Series Model) is
designed to automatically detect the appropriate decomposition for a
univariate time series into trend, seasonal, and cycle components using
a state space approach. The package also has the functionality perform
structural interpolation: i.e. interpolated a quarterly series into a
monthly series using either end of period, period average, or period sum
methods. The unobserved components are estimated with the Kalman filter
and all model parameters are estimated using maximum likelihood. The
Kalman filter and smoother are implemented in Rcpp
so it is
reasonably fast. This package is designed to handle time series of any
frequency (standard or not), series with multiple seasonalities,
seasonalities with fractional periodicities, missing data imputation,
and irregularly spaced dates (i.e. daily data with missing data due to
weekends and holidays, etc.).
With ease-of-use in mind, all the user has to do is provide a two
column data frame: one column with dates and one column with the
univariate time series. Everything else is handled by the algorithm
including data frequency detection (observations per year), trend
specification, cycle specification, seasonal specification, missing data
imputation, anomaly detection, structural break detection, and whether
or not to log transform the data. The user can manually set the
preferred decomposition if desired, however. All components are assumed
to be stochastic, which allows for time-varying trend, cycle, and
seasonalities, unless the user specifies otherwise using the
det_trend
, det_drift
, det_seas
,
det_cycle
, and det_obs
arguments in
stsm_estimate
. The algorithm can handle secondly, minutely,
hourly, daily, monthly, and quarterly frequencies as well as daily data
that is weekday only. It is also capable of handling non-standard
frequencies (i.e. frequencies that are non of the above). To build an
interpolation model, all the user has to do is set
interpolate
and interpolate_method
. The user
only needs to call stsm_estimate
to fit the model and
stsm_forecast
to filter and forecast the data.
Additionally, the user can call stsm_detect_anomalies
and
stsm_detect_breaks
to perform anomaly and structural break
detection respectively.
The model is based on the decomposition
\[ Y_t = T_t + C_t + S_t + B^o X^o_t + e_t \]
where \(Y_t\) is a univariate time series, \(T_t\) is the trend component, \(C_t\) is the cycle component, \(S_t\) is the seasonal component, \(X^o_t\) is optional exogenous data in the observation equation, and \(e_t \sim N(0, \sigma_e^2)\) is the observation error. The seasonal and cycle components are assumed to be of a trigonometric form, while the trend is assumed to be one of three types: a random walk, a random walk with an AR(1) drift, or a double random walk (random walk with a random walk drift).
The state space model is written as \[ Y_t = A + H \beta + B^o X^o_t + e_t, e_t \sim N(0, R)\\ \beta_t = D + F \beta_{t-1} + B^s X^s_t + u_t, u_t \sim N(0, Q) \]
where the first equation is the observation equation and the second equation is the transition equation. \(A\) is the observation intercept, \(H\) is the observation matrix, \(e_t \sim N(0, R)\) is the observation error, and \(R\) is the observation error-covariance matrix. \(\beta_t\) is the vector of the state components, \(D\) is the state intercept matrix, \(F\) is the transition matrix, \(X^s_t\) is optional exogenous data in the state equation, \(v_t \sim N(0, Q_{s_t})\) is the state error, and \(Q\) is the state error-covariance matrix.
If trend is “random-walk” the trend model is specified as the I(1) process \[ T_t = T_{t-1} + e_t, e_t \sim N(0, \sigma_t^2) \]
If trend is “random-walk-drift”, the trend model is specified as the I(1) process \[ T_t = D_{t-1} + T_{t-1} + e_t, e_t \sim N(0, \sigma_t^2) \\ D_t = d + \phi_d D_{t-1} + n_t, n_t \sim N(0, \sigma_d^2) \] where \(D_t\) is the drift.
If trend is “double-random-walk”, the trend model is specified as the I(2) process \[ T_t = D_{t-1} + T_{t-1} + e_t, e_t \sim N(0, \sigma_t^2) \\ D_t = D_{t-1} + n_t, n_t \sim N(0, \sigma_d^2) \]
The cycle is modeled using either the trigonometric process
\[ \begin{bmatrix} c_t \\ c_t^{*} \end{bmatrix} = \phi_c \begin{bmatrix} cos(\lambda) & sin(\lambda) \\ -sin(\lambda) & cos(\lambda) \end{bmatrix} \begin{bmatrix} c_{t-1} \\ c_{t-1}^{*} \end{bmatrix} + \begin{bmatrix} u_t \\ u_t^{*} \end{bmatrix}, u_t, u_t^{*} \sim N(0, \sigma_c^2) \] where \(\lambda\) is the frequency of the cycle and \(\phi_c \in{(0,1)}\) to make the cycle stationary for forecasting purposes, or the ARMA(2,1) process
\[ \begin{bmatrix} c_t \\ c_{t-1} \\ \vdots \\ c_{t-p+2} \\ c_{t-p+1} \\ e_t \\ e_{t-1} \\ \vdots \\ e_{t-q+2} \\ e_{t-q+1} \end{bmatrix} = \begin{bmatrix} \phi_{c,1} & \phi_{c,2} & \ldots & \phi_{c,p-1} & \phi_{c,p} & \theta_{c,1} & \theta_{c,2} & \ldots & \theta_{c,q-1} & \theta_{c,q} \\ 1 & 0 & \ldots & 0 & 0 & \ldots & \ldots & \ldots & \ldots & 0 \\ 0 & \ddots & & \vdots & \vdots & \ddots & & \ddots & & \vdots \\ \vdots & & \ddots & \vdots & \vdots & & \ddots& & \ddots & \vdots \\ 0 & \ldots & \dots & 1 & 0 & \ldots & \ldots & \ldots & \dots & 0 \\ 0 & \ldots & \ldots & \ldots & 0 & \ldots & \ldots & \ldots & \ldots & 0 \\ \vdots & \ddots & & & \vdots & 1 & 0 & \ldots & 0 & \vdots \\ \vdots & & \ddots & & \vdots & 0 & \ddots & & \vdots & \vdots \\ \vdots & & & \ddots & \vdots & \vdots & & \ddots & \vdots & \vdots \\ 0 & \ldots & \ldots & \ldots & 0 & 0 & \ldots & \ldots & 1 & 0\\ \end{bmatrix} \begin{bmatrix} c_{t-1} \\ c_{t-2} \\ \vdots \\ c_{t-p+1} \\ c_{t-p} \\ e_{t-1} \\ e_{t-2} \\ \vdots \\ e_{t-q+1} \\e_{t-q} \end{bmatrix} + \begin{bmatrix} e_t \\ 0 \\ \vdots \\ \vdots \\ 0 \\ e_t \\ 0 \\ \vdots \\ \vdots \\0 \end{bmatrix}, e_t \sim N(0, \sigma_c^2) \] where \(p\) is the autoregressive lag length and \(q\) is the moving average lag length. The ARMA(p,q) specification can have real or complex roots, while the trigonometric specification will only have complex roots. The trigonometric specification is a special case of an ARMA(2,1) specification.
If the roots are complex, the dynamics will be damped oscillations
about the trend, which may not always be desirable. If the user would
like to remove oscillations from the forecast that are caused by the
complex roots, the user can specify, dampen_cycle = TRUE
in
stsm_forecast
. This option will smooth the forecast of the
cycle into the trend using a fitted sigmoid curve.
Seasonality is modeled using a Fourier series, which is the sum of sin-cos waves with each wave representing a seasonal pattern. Total seasonality is \[ s_t = \sum_{j = 1}^{h} s_t^j \]
where each season \(s_t^j\) is modeled using the trigonometric specification below
\[ \begin{bmatrix} s_t^j \\ s_t^{j,*} \end{bmatrix} = \begin{bmatrix} cos(\lambda_j) & sin(\lambda_j) \\ -sin(\lambda_j) & cos(\lambda_j) \end{bmatrix} \begin{bmatrix} s_{t-1}^j \\ s_{t-1}^{j,*} \end{bmatrix} + \begin{bmatrix} w_t^j \\ w_t^{j,*} \end{bmatrix}, w_t^j \sim N(0, \sigma_j^2) \] where \(s_t^{j,*}\) exists by construction and is just an auxiliary variable, \(\lambda_j = \frac{2\pi j}{f}\) is the frequency of season \(j\), \(j\) represents the number of periods per year (i.e. \(j = 52\) represents weekly seasonality at 52 periods per year), and \(f\) is the frequency of the data (i.e. the number of observations per year). The seasons to include in the model are automatically detected in the algorithm using wavelet analysis. Each \(\lambda_j\) is predetermined from wavelet analysis and are not estimated parameters.
The automatic model specification algorithm follows the procedure
below. Rather than using brute force selection by checking every
possible model combination and selecting the model that minimizes a
selection criteria like AIC, which can be computationally intensive and
time consuming, the procedure uses a series a statistical tests to
determine model specification. However, the user is free to write their
own brute force model selection routine from the functions provided in
this package as stsm_estimate
returns a table with the
model’s log likelihood, AIC, AICc, and BIC.
The first step is to detect the frequency of the data by examining the sequence of dates provided. If the day difference in subsequent dates is closest to
It can also detect sub-daily data if the day difference in subsequent dates is closest to - \(\frac{1}{24}\), then a frequency of 8760 is used for hourly data - \(\frac{1}{60 \cdot 24}\), then a frequency of 525600 is used for minutely data - \(\frac{1}{60 \cdot 60 \cdot 24}\), then a frequency of 31536000 is used for secondly data
The numeric frequency reflects the number of periods in one year.
Once the frequency is detected, a sequence of continuous dates is
constructed to ensure evenly spaced data. If an observation is missing
for any of the dates in the constructed sequence, it is treated as a
missing value. However, if the dates provided contain only weekdays and
the data is of daily frequency for example, the frequency is changed to
\(365.25 \cdot \frac{5}{7} = 260.8929\)
to reflect the number of weekdays per year. The same concept is applied
to sub-daily data with weekday observations only as well. If none of the
above frequencies are detected, then the frequency is set to the number
of rows in the data to allow the algorithm to search for seasonal
patterns. A flag standard_freq
will be set to
FALSE
in the final output.
Before any of the model specification tests can be performed, the
data must be seasonally and cycle adjusted. This procedure is performed
by decomposing the time series into a loess trend and a seasonal-cycle
remainder. The loess
function in the stats
package requires that there are no missing values in the data, so any
missing values are computed using the Kalman filter with a local trend
using the stats
package’s StructTS
and
KalmanSmooth
functions. If you know the missing data
pattern in your data (i.e missing data should be 0), it’s best to fill
in the missing data using your knowledge before allowing this method to
fill in the rest. A smoother cycle is also estimated from the
seasonal-cycle component using a smoothing spline from the
smooth.spline
function of the stats
package.
Once the seasonal and cycle periods are known, the seasonal-cycle is
more accurately decomposed into the seasonal and cycle components by
first decomposing the seasonal-cycle using the mstl
function from the forecast
package then regressing the
seasonal component on a Fourier series composed of the seasonal
periods.
The first step is wavelet analysis for seasonal detection, which is done using a series of harmonic regressions in parallel on the detrended data. The detrended data may also be rescaled by the trend if the Cox-Stuart deviation test suggests that the detrended data does not have a constant amplitude. The regression is performed individually for each seasonal harmonic using the equation
\[ \tilde{y_t} = a_j sin(2\pi\frac{j}{f}) + b_j cos(2\pi\frac{j}{f}) \]
for harmonics \(j\) that correspond to the spectrum of minutely, hourly, workday hours, half daily, daily, weekend, weekday, weekly, monthly, quarterly, semi-yearly, and yearly seasonal frequencies where applicable. However, since this limits the number of frequencies tested, a linear space of 100 items is created between each harmonic so that the space captures 1% of the distance between the original two harmonics. If a nonstandard frequency is detected, the harmonics are set to \(j = \lfloor \frac{f}{2} \rfloor\), but only taking 1000 equally spaced harmonics for harmonics that have at least 3 cycles in the data. This strategy limits the number of harmonics tested for speed, while also trying to capture some a priori important seasonalities based on the frequency of the data.
At each iteration, an F-test is performed using the
waldtest
function from the lmtest
package to
measure the significance of the harmonic in explaining the seasonality,
understanding that this will not be robust to heteroskedasticity or
autocorrelation. This accuracy is sacrificed for speed at this step. The
frequencies (defined as \(\frac{f}{j}\)) are then matched to the
spectrum of seasonalities that correspond to the a priori important
seasonal spectrum if the frequency of the harmonic is within one delta
of the frequency of this predefined spectrum. This delta is equal to the
unique difference between the frequencies of the harmonics.
Lastly, backward stepwise regression procedure using an F-test for the significance of each seasonality (i.e. joint significance of \(a_j\) and \(b_j\) is used to build the final seasonality specification starting with only the harmonics that were statistically significant from the first step. At each stage of the of the stepwise procedure, an F-test for the statistical significance of each seasonality is performed and any harmonics that are not statistically significant are removed. The process is performed iteratively until all remaining harmonics are statistically significant. The F-tests for this procedure are calculated using HAC standard errors to be robust to heteroskedasticity and autocorrelation to reduce the chance of selecting spurious seasonalities.
Next, if cycle = NULL
, wavelet analysis is performed to
detect a trigonometric form for the cycle. If one is not detected, it is
set to an ARMA(p,q) specification if the model prior for the cycle is
determined to be stationary. If cycle = "trig"
, wavelet
analysis is performed to detect a trigonometric form, but if none is
detected, it will default to no cycle. If cycle = "arma"
,
then wavelet analysis will not be performed and the cycle will be set
the ARMA(p,q) form. In addition to setting cycle = "arma"
,
the user can set the parameter arma = c(p = p, q = q)
to
specify a specific ARMA order, or if left as the default with p and q
set to NA
, then the order will be auto-selected.
Wavelet analysis for cycle detection is done using a series of harmonic regressions in parallel on the detrended data. The detrended data may also be rescaled by the trend if the Cox-Stuart deviation test suggests that the detrended data does not have a constant amplitude. The regression is performed individually for each harmonic using the equation
\[ \tilde{y_t} = a_j sin(2\pi\frac{j}{f}) + b_j cos(2\pi\frac{j}{f}) \]
for harmonics \(j = 0.01\) to \(j = 0.99\) for every 0.01 and truncated for harmonics that correspond to periods \(\in [ 2.5f, T]\) and harmonics < \(1/2\) where \(T\) is the length of the data.
At each iteration, an F-test is performed using the
waldtest
function from the lmtest
package to
measure the significance of the harmonic to explain the seasonality,
understanding that this will not be robust to heteroskedasticity or
autocorrelation. This accuracy is sacrificed for speed at this step.
Once all harmonics have been tested, the algorithm looks for a global
maximum from the power spectrum measured with the F-statistic that is
statistically significant.
Lastly, a final regression is performed with only the harmonic that was selected from the first step. From this regression, the harmonic is kept only if the cycle is statistically significant (i.e. \(a_j\) and \(b_j\) are jointly significant) as measured by the F-test for the final harmonic regression. The F-test for this step is calculated using HAC standard errors to be robust to heteroskedasticity and autocorrelation to reduce the chance of selecting spurious seasonalities.
The third step in the model construction is to determine if the model
should be based on a multiplicative of additive model. However, after
the seasonal and cycle steps are completed, a new model prior is created
based on the seasonal and cycle periods detected. For a multiplicative
model, the data must all be positive and satisfy one of two tests. The
model is restricted to multiplicative of additive decomposition, rather
than allowing the full range of Box-Cox transformations, so that the
decomposition is easy to compare to the original time series. Additive
makes the decomposition comparative in levels, multiplicative makes the
decomposition comparative in percentages relative to the original time
series, but Box-Cox transformations don’t have an easy interpretation
like this. However, the user can Box-Cox transform the data prior to
estimation and specify multilicative = FALSE
if that method
is preferred.
First, is a test of a linear vs exponential trend. This test compares
the log likelihood of 1) a linear regression of the seasonal and cycle
adjusted data vs a linear trend, and 2) a linear regression of the
logged seasonal and cycle adjusted data vs a linear trend using the
petest
from the lmtest
package with
heteroskedastic and autocorrelated standard errors using the
sandwich
package’s vcovHAC
function. A log
trend model is selected if and only if the linear model is rejected and
the log model is not rejected.
Next, is a test for constant seasonal and cycle amplitudes using the
Cox-Stuart deviation test on the detrended (and de-seasonalized or
de-cycled) data. Then, if the seasonal component is determined to have
changing amplitude, a multiplicative model is used. However, since this
test is sensitive to outliers, outliers in the seasonal and cycle series
are detected and replaced using the forecast
’s package
tsclean
function without replacing missing values before
the test is performed.
The final step is to determine the specification of the trend after
adjusting for the cycle and seasonality detected from the prior steps.
This mainly boils down to determining if the data is integrated of order
0, 1, or 2 (I(0), I(1), or I(2)). To determine the number of differences
to make the data stationary, the ADF, PP, and KPSS tests are performed
on the seasonal and cycle adjusted data using the forecast
package’s ndiffs
functions on the seasonal adjusted time
series. However, since this test is sensitive to outliers, outliers in
the seasonal and cycle adjusted series are detected and replaced using
the forecast
’s package tsclean
function
without replacing missing values before the test is performed.
The number of differences selected for the trend specification is the average number of differences across the tests, rounded to the nearest digit. Next, the Cox-Stuart trend test is preformed on the seasonal and cycle adjusted data to determine if the data is trending. If the data is trending, the order of integration is set to at least 1.
If the order of integration is set to 1 at this point, the algorithm
will then check for structural breaks in the drift using the
breakpoints
function from the strucchange
package as long as the minimum prior drift is below 0. If the average
value of the prior drift changes sign (from positive to negative, or
negative to positive) between break segments, then the order of
integration is set to 2.
If the data is determined to be
To estimate the unobserved components (\(T_t, C_t, S_t\), and \(D_t\)), the Kalman filter is utilized. The Kalman filter is a forward two-stage procedure that is the optimal linear filter based on the multivariate normal distribution. The “forwardness” of the filter means that it uses only data up to time \(t\) to make an inference on the unobserved components at time \(t\) and does peak into the future to make that inference.
The first stage is the prediction stage, which makes predictions of the state components based on information up to time \(t-1\). This stage is made up of four equations, the first is the prediction of the state components based on its time series properties
\[ B_{t|t-1} = D + F \beta_{t-1|t-1} + B^s X^s_t \]
Second, is the prediction of the covariance matrix of the state components
\[ P_{t|t-1} = F P_{t-1|t-1} F^{\prime} + Q \]
Third, is the prediction error of the time series of interest
\[ \eta_{t|t-1} = Y_t - (A + H \beta_{t|t-1} + B^o X^o_t) \]
And finally, we have the variance of the prediction error
\[ f_{t|t-1} = H P_{t|t-1} H^{\prime} + R \]
The second stage is the updating stage, which makes predictions based on all information up to time \(t\). It consists of three equations. The first equation is the prediction of the state components based on the the full information set
\[ \beta_{t|t} = B_{t|t-1} + K_t \eta_{t|t-1} \] where \(K_t\) is the Kalman gain, which determines the optimal weight to give new information in making predictions about \(\beta_t\) and is the second equation
\[ K_t = P_{t|t-1} H^{\prime} f_{t|t-1}^{-1} \] The last equation is the updating of the covariance matrix of the unobserved components
\[ P_{t|t} = P_{t|t-1} - K_t H P_{t|t-1} \] The seven equations above, make up the full Kalman filter routine. If \(Y_t\) is missing for any observation, then
\[ B_{t|t} = B_{t|t-1} \\ P_{t|t} = P_{t|t-1} \\ K_t = 0 \\ f_{t|t-1} = \infty \]
Once the Kalman filter is applied to the data, a smoothing procedure can be applied in the backward direction to make a better inference of the state components based on the entire data set. Unlike the filter, the smoother does peak into the future to make an inference of the state components at time \(t\). This procedure consists of only two equations.
The first equation updates the prediction of the state components based on all the available information
\[ \beta_{t|T} = \beta_{t|t} + P_{t|t} F^{\prime} P_{t|t}^{-1} (\beta_{t+1|T} - \beta_{t+1|t}) \]
The second equation updates the covariance matrix of the state components based on all the available information
\[ P_{t|T} = P_{t|t} + P_{t|t} F^{\prime} P_{t+1|t}^{-1} (P_{t+1|T} - P_{t+1|t}) P_{t+1|t}^{-1 \prime} F P_{t|t}^{\prime} \]
The model above is estimated using MLE with box constraints. The log likelihood is given by
\[ ln(L(\theta)) = -\frac{1}{2} \sum_{t=1}^T \ln(|f_{t|t-1}|) - \frac{1}{2}\sum_{t=1}^T \eta_{t|t-1}^{\prime} f_{t|t-1}^{-1} \eta_{t|t-1} \]
Initial parameter values are essential for finding a good model when
using maximum likelihood estimation. To this end, the algorithm uses
initial parameter values derived from a decomposition using the
mstl
function, which acts like the model prior and is
represented by
\[ Y_t^* = T_t^* + C_t^* + S_t^* + e_t^* \]
This function will decompose the time series into trend and
seasonality. The seasonality is then further split into the seasonal
harmonics by fitted a linear regression on the seasonal Fourier series
and extracting the fitted components. Finally, the trend is further
split into a smoother trend and the cycle by fitting a loess trend to
the mstl
trend using the loess
function. The
drift is then derived by taking the first difference of the trend.
These parameters are set initial to match to the theoretical variance of the model depending on the trend type. If the trend is set to “random-walk” then \(\sigma_t = sd(\Delta T_t^{*})\). If the trend is set to “double-random-walk” then \(\sigma_t = \sigma_d = sd(\Delta T_t^{*})/sqrt(2)\). If the trend is set to “random-walk-drift”, an AR(1) model is fit the drift prior, \(\sigma_d\) is set to the standard deviation of the model errors, \(\phi_d\) is the set to the estimated AR(1) parameter, and \(\sigma_t\) is set to \(\sqrt{var(\Delta T_t^*) - \frac{\sigma_d^2}{1 - \phi_d^2}}\) the theoretical variance if the trend is I(1).
If the cycle model is set to the ARMA(p,q) specification or a cycle
is not detected, \(p\) and \(q\) are determined using the
forecast
package’s auto.arima
function on the
prior cycle. The estimated AR and MA components are then used as the
initial parameters. If the cycle is model is set to the trigonometric
specification, \(\lambda\) is set to be
\(\frac{2\pi}{c}\) where \(c\) is the estimated cycle period. \(\phi_c\) is set to the maximum eigenvalue
of an ARMA(2,0,1) process fit to the cycle prior as this is related to
the speed of convergence towards the trend as \(\phi_c\) is in an AR(1) process. Finally,
\(\sigma_c\) is set to the standard
deviation of the model errors.
If seasonality is included, each \(\sigma_j\) is set to \(sd(\Delta S_t^*)/sqrt(h)\) where h is the number of seasonal harmonics so \(var(\Delta S_t^*) = \sum_j \sigma_j^2\)
Lastly, \(\sigma_e\) is set the the standard deviation of the remainder of the model prior, \(sd(e_t^*)\).
The initial values of the unobserved components must also be initialized. These values are not optimized but set in advance to aid in speed. The only value set to optimize is one value for the diagonal of the covariance matrix \(P_{0|0}\), which is set to the identity matrix. The initial values for the unobserved components are taken to be the first non-NA value for each series from the prior.
The box constraints act like priors and provide bounds on parameter estimates to ensure a reasonable model. The constraints include stationarity constraints for the trigonometric cycle, \(0 < \phi_c < 1\), autoregressive cycle \(0 < \sum \phi_{c,i} < 1\), moving average component \(0 < \sum \theta_{c,i} < 1\), drift \(-1 < \phi_d < 1\), as well as \(\sigma_x > 0\) for all variance parameters. Trend smoothness constraints
\[ \sigma_t + \sigma_d < \sigma_e \\ \sigma_t + \sigma_d < \sigma_c \\ \sigma_t + \sigma_d < \sum_j \sigma_j \]
are also included to ensure that the trend accounts for the least
variability among all the components. It can be relaxed by setting
unconstrained = TRUE
in stsm_estimate
.
If an interpolation model is specified, the model is redefined based
on the values set in interpolate
and
interpolate_method
. For all interpolation models,
\[ I_t = T_{t-1} + S_{t-1} + C_{t-1} \] where \(I_t\) is the interpolated series and a deterministic observation equation is used by setting \(\sigma_e = 0\). The only other that changes is the specification of the observation equation, which is determined by the interpolation method
where \(N\) is the period to interpolate over. For example, if the frequency of the data is quarterly and the user would like to interpolate the data to a monthly frequency, then \(N = 3\).
The interpolation frequency is set by specifying
interpolate
to “quarterly”, “monthly”, “weekly”, or “daily”
as long as that frequency is higher than the frequency of the data.
Although, it is recommended to only interpolate one frequency higher
than the frequency of the data.
Interpolation methods are defined by setting
interpolate_method
to “eop” for end of period, “avg” for
period average, and “sum” for period sum.
Interpolation errors will depend on the volatility of the time series as well as model misspecification. However, given the model, the end of period method will be more susceptible to errors due to volatility than either the average or sum methods.
Anomaly detection uses the estimated structural model and the recursive nature of the Kalman filter to detect observations that are outside a given threshold of the model’s predicted values for time \(t\) given the data and estimates of the unobserved components at time \(t-1\). The threshold is determined by the forecast error variance of the estimated structural model. Anomalies that are detected can then be replaced with the modeled predicted values or threshold if the user wants to replace outliers.
It is important to note that anomalies are detected by comparison to the model, which is based on normally distributed innovations. Thus, the interpretation of an anomaly here is a value that has an “x” percent chance of occurring based on the data being generated from normal innovations, where the “x” percent is determined by the threshold. Normal distributions may not be appropriate for all time series, but this method of anomaly detection could be used to tell the user about the validity of the normality assumption: i.e. if more than 1% of the sample is detected as an anomaly with a 99% threshold, then normality may not be the best assumption for innovations for that particular series.
stsm_detect_anomalies
will only plot if anomalies are
detected and plot = TRUE
.
The model makes recursive predictions for the j-th step ahead using
\[ \hat{Y_{t+j}} = H F^j \beta_{t} \] since \(A = 0\) and \(D = 0\) for this model. The j-step ahead forecast error variance is given by
\[ FEV = \left(\sum_{i=0}^{j-1} H F^i Q F^{i\prime} H^{\prime}\right) + R \] The error variance is based on the assumption of linearity and normality, as the Kalman filter is a linear Gaussian filter, so the confidence intervals should be understood to be a lower bound as the decomposition is a simplification of an arguable more complex process. The uncertainty lost in that abstraction to linearity is not quantifiable, and uncertainty will be lost in the simplification from a stochastic process with excess kurtosis or skewness to normality.
Structural breaks are detected using the breakpoints
function from the strucchange
package. It attempts to
detect structural breaks in the trend, cycle, and seasonalities based on
the estimated structural model. These tests can show how well the
stochastic nature of the model picks up on the time varying nature of
the time series.
If the data has an upward (or downward trend), the algorithm will look for breaks in trend growth. Otherwise, it will look for breaks in the level of the trend. If the growth (or level) is not constant across the entire time series, then a break should be detected.
Structural breaks in the seasonality and cycle are detected by looking for changes in the mean absolute value of the seasonal series (an analog for the amplitude). To check from breaks in the amplitude, a simple test on the mean absolute value is performed as this is related to the amplitude via the relationship \(Avg(|x_t|) = \frac{2}{\pi}Amp\). If the amplitude is not stable across the entire series, then a break should be detected.
Seasonality is defined by a constant periodicity, but economic cycles are rarely constant in their periodicity. So, a second structural break test is used on the cycle to check for changes in the periodicity. To check for breaks in the periodicity, the cycle is modeled using the trigonometric structure described above (\(C_t = \phi_c cos(\lambda_t) C_{t-1} + \phi_c sin(\lambda_t) C_{t-1}^{*} = a C_{t-1} + b C_{t-1}^{*}\)) and looking for instabilities in the coefficients over the time sample. The periodicity is identified from the linear regression using the relationship \(period = \frac{2\pi}{tan^{-1}(\frac{b}{a})}\). If the coefficients of this model are not stable across the entire series, then a break should be detected.
This procedure can be time consuming, even with high performance
parallel computing, so if the user is only interested in structural
breaks for a particular component, the user can specify the
components
arguments of stsm_detect_breaks
to
be any combination of “trend”, “cycle”, and “seasonal”.
stsm_detect_breaks
will only plot the breaks that are
detected if plot = TRUE
.
Below is an example of how to use this package:
library(ggplot2)
library(gridExtra)
library(autostsm)
library(lubridate)
library(data.table)
set.seed(1024)
#Daily data
freq = 365.25
#Build the trend and drift
t = c()
m = c()
t[1] = 100
m[1] = 1
sig_e = 0.1
sig_t = 1
sig_m = 0.1
sig_s = 0.01
for(i in 2:3000){
m[i] = 0.05 + 0.75*m[i-1] + rnorm(1, 0, sig_m)
t[i] = t[i-1] + m[i-1] + rnorm(1, 0, sig_t)
}
#Build the seasonality including yearly and weekly
s365 = sin(2*pi/freq*(1:length(t))) + rnorm(length(t), 0, sig_s)
s365 = (s365 - min(s365))/diff(range(s365))*(1.125 - 0.865) + 0.865
s7 = sin(2*pi/7*(1:length(t))) + rnorm(length(t), 0, sig_s)
s7 = (s7 - min(s7))/diff(range(s7))*(1.125 - 0.865) + 0.865
s = s365 + s7
s = (s - min(s))/diff(range(s))*(1.25 - 0.75) + 0.75
#Build the cyclicality using every 3 years periodicity
c = sin(2*pi*0.33/freq*(1:length(t))) + rnorm(length(t), 0, sig_s)
c = (c - min(c))/diff(range(c))*(1.25 - 0.75) + 0.75
#Build the data using a multiplicative model
ts = data.table(date = as.Date("2016-01-01") + 1:length(t),
y = t*c*s*exp(rnorm(length(t), 0, sig_e)),
trend = t, seasonal = s, seasonal7 = s7,
seasonal365 = s365, cycle = c)
#Create some missing values
ts[sample(1:nrow(ts), round(0.05*nrow(ts))), "y" := NA]
#View the data
g1 = ggplot(melt(ts, id.vars = "date", measure.vars = c("y", "trend"))) +
labs(title = "Observed vs Trend") +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
g2 = ggplot(melt(ts, id.vars = "date", measure.vars = c("cycle"))) +
labs(title = "Cycle") +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
g3 = ggplot(melt(ts, id.vars = "date", measure.vars = colnames(ts)[grepl("seasonal", colnames(ts))])) +
labs(title = "Seasonal") +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
grid.arrange(g1, g2, g3, layout_matrix = rbind(c(1, 1), c(2, 3)))
#Estimate the model
stsm = stsm_estimate(ts[, c("date", "y"), with = FALSE], verbose = TRUE)
#Forecast and plot the results
stsm_fc = stsm_forecast(stsm, y = ts[, c("date", "y"), with = FALSE], n.ahead = floor(stsm$freq)*3, plot = TRUE)
#Detect anomalies
stsm_fc = merge(stsm_fc,
stsm_detect_anomalies(stsm, y = ts[, c("date", "y"), with = FALSE], plot = TRUE),
by = "date", all = TRUE)
#Detect structural breaks
stsm_fc = merge(stsm_fc,
stsm_detect_breaks(stsm, y = ts[, c("date", "y"), with = FALSE], plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
#Define the exogenous data
exo_obs = data.table(date = ts$date, Xo = rnorm(nrow(ts)))
exo_state = data.table(date = ts$date, Xs_trend = rnorm(nrow(ts)),
Xs_seasonal = rep(0, nrow(ts)))
exo_state[seq(1, .N, by = 7), "Xs_seasonal" := 1]
#Estimate the model
stsm = stsm_estimate(ts[, c("date", "y"), with = FALSE],
exo_obs = exo_obs, exo_state = exo_state,
state_eqns = c("trend ~ Xs_trend - 1", "seasonal ~ Xs_seasonal - 1"),
verbose = TRUE)
#Check the exogenous parameter estimates
stsm_ssm(model = stsm)[c("betaO", "betaS")]
#Filter and plot the results
stsm_fc = stsm_filter(stsm, y = ts[, c("date", "y"), with = FALSE],
exo_obs = exo_obs, exo_state = exo_state, plot = TRUE)
#Detect anomalies
stsm_fc = merge(stsm_fc,
stsm_detect_anomalies(stsm, y = ts[, c("date", "y"), with = FALSE],
exo_obs = exo_obs, exo_state = exo_state, plot = TRUE),
by = "date", all = TRUE)
#Detect structural breaks
stsm_fc = merge(stsm_fc,
stsm_detect_breaks(stsm, y = ts[, c("date", "y"), with = FALSE],
exo_obs = exo_obs, exo_state = exo_state,
plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
##### Unemployment rate examples #####
#Not seasonally adjusted
data("UNRATENSA", package = "autostsm") #From FRED
UNRATENSA = data.table(UNRATENSA, keep.rownames = TRUE)
colnames(UNRATENSA) = c("date", "y")
UNRATENSA[, "date" := as.Date(date)]
UNRATENSA[, "y" := as.numeric(y)]
stsm = stsm_estimate(UNRATENSA, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = UNRATENSA, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_anomalies(stsm, y = UNRATENSA, plot = TRUE),
by = "date", all = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_breaks(stsm, y = UNRATENSA, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
#Seasonally adjusted
data("UNRATE", package = "autostsm") #From FRED
UNRATE = data.table(UNRATE, keep.rownames = TRUE)
colnames(UNRATE) = c("date", "y")
UNRATE[, "date" := as.Date(date)]
UNRATE[, "y" := as.numeric(y)]
stsm = stsm_estimate(UNRATE, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = UNRATE, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_anomalies(stsm, y = UNRATE, plot = TRUE),
by = "date", all = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_breaks(stsm, y = UNRATE, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
##### GDP examples #####
#Not seasonally adjusted
data("NA000334Q", package = "autostsm") #From FRED
NA000334Q = data.table(NA000334Q, keep.rownames = TRUE)
colnames(NA000334Q) = c("date", "y")
NA000334Q[, "date" := as.Date(date)]
NA000334Q[, "y" := as.numeric(y)]
NA000334Q = NA000334Q[date >= "1990-01-01", ]
stsm = stsm_estimate(NA000334Q, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = NA000334Q, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_anomalies(stsm, y = NA000334Q, plot = TRUE),
by = "date", all = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_breaks(stsm, y = NA000334Q, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
#Seasonally adjusted
data("GDP", package = "autostsm") #From FRED
GDP = data.table(GDP, keep.rownames = TRUE)
colnames(GDP) = c("date", "y")
GDP[, "date" := as.Date(date)]
GDP[, "y" := as.numeric(y)]
GDP = GDP[date >= "1990-01-01", ]
stsm = stsm_estimate(GDP, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = GDP, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_anomalies(stsm, y = GDP, plot = TRUE),
by = "date", all = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_breaks(stsm, y = GDP, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
##### S&P 500 example #####
data("SP500", package = "autostsm") #From FRED
SP500 = data.table(SP500, keep.rownames = TRUE)
colnames(SP500) = c("date", "y")
SP500[, "date" := as.Date(date)]
SP500[, "y" := as.numeric(y)]
stsm = stsm_estimate(SP500, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = SP500, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_anomalies(stsm, y = SP500, plot = TRUE),
by = "date", all = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_breaks(stsm, y = SP500, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
##### 5 Year Treasury Yield example #####
data("DGS5", package = "autostsm") #From FRED
DGS5 = data.table(DGS5, keep.rownames = TRUE)
colnames(DGS5) = c("date", "y")
DGS5[, "date" := as.Date(date)]
DGS5[, "y" := as.numeric(y)]
stsm = stsm_estimate(DGS5, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = DGS5, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_anomalies(stsm, y = DGS5, plot = TRUE),
by = "date", all = TRUE)
stsm_fc = merge(stsm_fc,
stsm_detect_breaks(stsm, y = DGS5, plot = TRUE, show_progress = TRUE),
by = "date", all = TRUE)
library(ggplot2)
library(autostsm)
library(lubridate)
library(data.table)
#Rebuild the data
set.seed(1024)
#Daily data
freq = 365.25
#Build the trend and drift
t = c()
m = c()
t[1] = 100
m[1] = 1
sig_e = 0.1
sig_t = 1
sig_m = 0.1
sig_s = 0.01
for(i in 2:3000){
m[i] = 0.05 + 0.75*m[i-1] + rnorm(1, 0, sig_m)
t[i] = t[i-1] + m[i-1] + rnorm(1, 0, sig_t)
}
#Build the seasonality including yearly and weekly
s365 = sin(2*pi/freq*(1:length(t))) + rnorm(length(t), 0, sig_s)
s365 = (s365 - min(s365))/diff(range(s365))*(1.125 - 0.865) + 0.865
s7 = sin(2*pi/7*(1:length(t))) + rnorm(length(t), 0, sig_s)
s7 = (s7 - min(s7))/diff(range(s7))*(1.125 - 0.865) + 0.865
s = s365 + s7
s = (s - min(s))/diff(range(s))*(1.25 - 0.75) + 0.75
#Build the cyclicality using every 3 years periodicity
c = sin(2*pi*0.33/freq*(1:length(t))) + rnorm(length(t), 0, sig_s)
c = (c - min(c))/diff(range(c))*(1.25 - 0.75) + 0.75
#Build the data using a multiplicative model
ts = data.table(date = as.Date("2016-01-01") + 1:length(t),
y = t*c*s*exp(rnorm(length(t), 0, sig_e)))
#Convert the data to monthly by using the end of period
ts[, "mnth" := floor_date(date, "month")]
ts[, "rn" := .N:1, by = "mnth"]
ts = ts[rn == 1, c("mnth", "y"), with = FALSE]
colnames(ts)[colnames(ts) == "mnth"] = "date"
#Get the quarter and set the date to be the end of the quarter
ts[, "qtr" := ceiling_date(date, "quarter") %m-% months(1)]
ts[, "y_avg" := frollmean(y, n = 3, align = "right")]
ts[, "y_sum" := frollsum(y, n = 3, align = "right")]
#We already know the data has a multiplicative structure with yearly seasonality and
#the frequency below will be set to quarterly,
#so we will set seasons = 4 and multiplicative to TRUE to help the model with identification
#since the data set goes from 3000 daily observations to 99 monthly observations and then to
#33 quarterly observations
#End of period
ts[, "rn" := .N:1, by = "qtr"]
ts.eop = ts[rn == 1, ][, "rn" := NULL]
stsm = stsm_estimate(y = ts.eop[, c("qtr", "y"), with = FALSE], seasons = 4, multiplicative = TRUE,
interpolate = "monthly", interpolate_method = "eop", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = ts.eop[, c("qtr", "y"), with = FALSE], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")],
ts[, c("date", "y")], by = "date", all.x = TRUE, all.y = TRUE)
eop.err = cbind(method = "eop", stsm_fc[is.na(observed), .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
#Period average
ts.avg = ts[, .(y = mean(y, na.rm = TRUE)), by = "qtr"]
stsm = stsm_estimate(y = ts.avg[, c("qtr", "y"), with = FALSE], seasons = 4, multiplicative = TRUE,
interpolate = "monthly", interpolate_method = "avg", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = ts.avg[, c("qtr", "y"), with = FALSE], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")],
ts[, c("date", "y", "y_avg")], by = "date", all.x = TRUE, all.y = TRUE)
if(stsm$multiplicative == TRUE){
stsm_fc[, "interpolated_rollup" := exp(frollmean(log(interpolated), n = 3, align = "right"))]
}else{
stsm_fc[, "interpolated_rollup" := frollmean(interpolated, n = 3, align = "right")]
}
avg.err = cbind(method = "avg",
stsm_fc[, .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))],
stsm_fc[is.na(observed), .(mape_rollup = mean(abs(y_avg - interpolated_rollup)/(1 + y_avg)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
#Period sum
ts.sum = ts[, .(y = sum(y, na.rm = TRUE)), by = "qtr"]
stsm = stsm_estimate(y = ts.sum[, c("qtr", "y"), with = FALSE], seasons = 4, multiplicative = FALSE,
interpolate = "monthly", interpolate_method = "sum", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = ts.sum[, c("qtr", "y"), with = FALSE], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")],
ts[, c("date", "y", "y_sum")], by = "date", all.x = TRUE, all.y = TRUE)
if(stsm$multiplicative == TRUE){
stsm_fc[, "interpolated_rollup" := exp(frollsum(log(interpolated), n = 3, align = "right"))]
}else{
stsm_fc[, "interpolated_rollup" := frollsum(interpolated, n = 3, align = "right")]
}
sum.err = cbind(method = "sum",
stsm_fc[, .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))],
stsm_fc[is.na(observed), .(mape_rollup = mean(abs(y_sum - interpolated_rollup)/(1 + y_sum)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
#Errors depend on the volatility of the series
#the end of period method is more susceptible to the volatility of the series than either
#the average or sum methods
rbind(eop.err, avg.err, sum.err, use.names = TRUE, fill = TRUE)
library(ggplot2)
library(autostsm)
library(lubridate)
library(data.table)
##### Unemployment rate examples #####
#Not seasonally adjusted
data("UNRATENSA", package = "autostsm") #From FRED
UNRATENSA = data.table(UNRATENSA, keep.rownames = TRUE)
colnames(UNRATENSA) = c("date", "y")
UNRATENSA[, "date" := as.Date(date)]
UNRATENSA[, "y" := as.numeric(y)]
UNRATENSA[, "qtr" := ceiling_date(date, "quarter") %m-% months(1)]
UNRATENSA[, "y_avg" := frollmean(y, n = 3, align = "right")]
#Cycle taken from previous example
stsm = stsm_estimate(y = UNRATENSA[, .(y = mean(y, na.rm = TRUE)), by = "qtr"],
seasons = 4, cycle = 156/3,
interpolate = "monthly", interpolate_method = "avg", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = UNRATENSA[, .(y = mean(y, na.rm = TRUE)), by = "qtr"], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")],
UNRATENSA[, c("date", "y", "y_avg")], by = "date", all.x = TRUE, all.y = TRUE)
if(stsm$multiplicative == TRUE){
stsm_fc[, "interpolated_rollup" := exp(frollmean(log(interpolated), n = 3, align = "right"))]
}else{
stsm_fc[, "interpolated_rollup" := frollmean(interpolated, n = 3, align = "right")]
}
cbind(stsm_fc[, .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))],
stsm_fc[is.na(observed), .(mape_rollup = mean(abs(y_avg - interpolated_rollup)/(1 + y_avg)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
#Seasonally adjusted
data("UNRATE", package = "autostsm") #From FRED
UNRATE = data.table(UNRATE, keep.rownames = TRUE)
colnames(UNRATE) = c("date", "y")
UNRATE[, "date" := as.Date(date)]
UNRATE[, "y" := as.numeric(y)]
UNRATE[, "qtr" := ceiling_date(date, "quarter") %m-% months(1)]
UNRATE[, "y_avg" := frollmean(y, n = 3, align = "right")]
#Cycle taken from previous example
stsm = stsm_estimate(y = UNRATE[, .(y = mean(y, na.rm = TRUE)), by = "qtr"],
seasons = FALSE, cycle = 156/3,
interpolate = "monthly", interpolate_method = "avg", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = UNRATE[, .(y = mean(y, na.rm = TRUE)), by = "qtr"], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")],
UNRATE[, c("date", "y", "y_avg")], by = "date", all.x = TRUE, all.y = TRUE)
if(stsm$multiplicative == TRUE){
stsm_fc[, "interpolated_rollup" := exp(frollmean(log(interpolated), n = 3, align = "right"))]
}else{
stsm_fc[, "interpolated_rollup" := frollmean(interpolated, n = 3, align = "right")]
}
cbind(stsm_fc[, .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))],
stsm_fc[is.na(observed), .(mape_rollup = mean(abs(y_avg - interpolated_rollup)/(1 + y_avg)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")
#5 Year Treasury Yield
data("DGS5", package = "autostsm") #From FRED
DGS5 = data.table(DGS5, keep.rownames = TRUE)
colnames(DGS5) = c("date", "y")
DGS5[, "date" := as.Date(date)]
DGS5[, "y" := as.numeric(y)]
DGS5[, "qtr" := ceiling_date(date, "quarter") %m-% months(1)]
DGS5[, "y_avg" := frollmean(y, n = 3, align = "right")]
#Cycle taken from previous example
stsm = stsm_estimate(y = DGS5[, .(y = mean(y, na.rm = TRUE)), by = "qtr"],
seasons = FALSE, cycle = 112/3,
interpolate = "monthly", interpolate_method = "avg", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = DGS5[, .(y = mean(y, na.rm = TRUE)), by = "qtr"], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")],
DGS5[, c("date", "y", "y_avg")], by = "date", all.x = TRUE, all.y = TRUE)
if(stsm$multiplicative == TRUE){
stsm_fc[, "interpolated_rollup" := exp(frollmean(log(interpolated), n = 3, align = "right"))]
}else{
stsm_fc[, "interpolated_rollup" := frollmean(interpolated, n = 3, align = "right")]
}
cbind(stsm_fc[, .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))],
stsm_fc[is.na(observed), .(mape_rollup = mean(abs(y_avg - interpolated_rollup)/(1 + y_avg)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) +
geom_line(aes(x = date, y = value, group = variable, color = variable)) +
scale_color_viridis_d() +
theme_minimal() + guides(color = guide_legend(title = NULL)) +
theme(legend.position = "bottom")