The VDPO R package is designed to extend statistical methods for analyzing variable domain functional data. In traditional functional data analysis, observations are usually defined over a common and fixed domain, such as time or spatial coordinates. However, in some applications, the domain over which the data are defined may vary between observations. This type of data is referred to as variable domain functional data.
The methodologies implemented in the VDPO package can be applied to a wide range of fields like:
The package is built upon the theoretical developments presented in recent research papers that rigorously explore the mathematical underpinnings and practical implications of the methodologies. More information can be found in:
The VDPO package includes a data generation function
data_generator_vd()
that allows users to simulate variable
domain functional data for testing and evaluation purposes. This section
explains how to use this function and the various scenarios it can
generate.
data_generator_vd(
N = 100, # Number of subjects
J = 100, # Maximum observations per subject
nsims = 1, # Number of simulations
Rsq = 0.95, # Variance of the model
aligned = TRUE, # If TRUE, generates aligned data
multivariate = FALSE, # If TRUE, generates data with 2 variables
beta_index = 1, # Index for the beta function (1 or 2)
use_x = FALSE, # If TRUE, adds a non-functional covariate
use_f = FALSE # If TRUE, adds a non-linear effect
)
N
: Number of subjects (default: 100)J
: Maximum number of observations per subject (default:
100)nsims
: Number of simulation iterations (default:
1)Rsq
: Controls the signal-to-noise ratio (default:
0.95)The function can generate two types of domains:
Aligned domains (aligned = TRUE
):
Non-aligned domains (aligned = FALSE
):
In both cases,
For each subject, the function generates:
multivariate = TRUE
, additional variables Y_s and
Y_se are generated.The mathematical expression for generating the variable domain functional data is the following:
\[X_i(t) = u_i + \sum_{k=1}^{10} \left(v_{ik1} \cdot \sin\left(\frac{2πk}{100}t\right) + v_{ik2} \cdot \cos\left(\frac{2πk}{100}t\right)\right) + δ_i(t)\]
The response variable y
is generated based on:
use_f = TRUE
use_x = TRUE
The mathematical expression for generating the response variable is the following:
\[η_i = \frac{1}{T_i}\sum_{t=1}^{T_i} X_i(t)β(t, T_i), t = 1, ..., T_i ≤ J\]
\(T_i\) is the specific domain of the \(i\)-th subject.
# Generate basic simulation data
sim_data <- data_generator_vd()
# Generate more complex data
complex_sim <- data_generator_vd(
N = 200,
J = 150,
aligned = FALSE,
multivariate = TRUE,
use_x = TRUE,
use_f = TRUE
)
# Access generated components
head(sim_data$y) # Response variable
#> [1] 0.22328994 -0.58169677 0.05271251 0.43631241 -0.41264759 0.53602756
dim(sim_data$X_s) # Dimensions of functional covariate
#> [1] 100 99
head(sim_data$x1) # Non-functional covariate (if use_x = TRUE)
#> [1] 0.6001512 0.1239913 -0.5213500 -1.4971886 1.3049948 -1.1349646
The function returns a list containing:
y
: Response variableX_s
: Noise-free functional covariateX_se
: Noisy functional covariateY_s
, Y_se
: Additional functional variables
(if multivariate = TRUE)x1
: Non-functional covariatex2
: Vector of length N containing the observed values
of the smooth termsmooth_term
: vector of length N containing a smooth
termBeta
: Array containing the true functional
coefficientsbeta_index
)This data generation function allows users to create various scenarios for testing and evaluating variable domain functional regression models implemented in the VDPO package.
To better understand the structure of the simulated data, let’s create some visualizations. We’ll look at both multiple functional curves and compare an original curve with its noisy version.
First, let’s visualize multiple functional curves generated by our simulation:
library(ggplot2)
library(tidyr)
library(dplyr)
# Generate sample data
set.seed(42)
sim_data <- data_generator_vd(N = 100, J = 100)
# Select specific rows for plotting
selected_rows <- c(20, 30, 60, 80)
# Prepare data for plotting - Multiple curves
plot_data_multiple <- data.frame(
time = rep(1:ncol(sim_data$X_s), length(selected_rows)),
value = as.vector(t(sim_data$X_s[selected_rows, ])),
curve = factor(rep(paste("Subject", selected_rows), each = ncol(sim_data$X_s)))
)
# Remove NA values while maintaining curve integrity
plot_data_multiple <- plot_data_multiple %>%
group_by(curve) %>%
mutate(is_na = is.na(value)) %>%
filter(cumsum(is_na) == 0) %>%
select(-is_na)
# Create a more professional color palette
colors <- c("#0072B2", "#D55E00", "#CC79A7", "#009E73", "#E69F00")
p1 <- ggplot(plot_data_multiple, aes(x = time, y = value, color = curve)) +
geom_line(linewidth = 1) +
theme_minimal(base_size = 12, base_family = "sans") +
scale_color_manual(values = colors) +
labs(
title = "Variable Domain Functional Data",
subtitle = "Selected subjects showing different domain lengths",
x = "Time",
y = "Value",
color = "Subject ID"
) +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 12, color = "gray40"),
legend.position = "right",
legend.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "gray90"),
panel.border = element_rect(color = "gray90", fill = NA),
axis.title = element_text(face = "bold")
)
p1
This plot shows four different functional curves generated by our simulation. Notice how each curve has a different domain length and pattern, reflecting the variable domain nature of our data.
Next, let’s compare an original functional curve with its noisy version:
# Plot single curve with noise
selected_curve <- 50
plot_data_single <- data.frame(
time = rep(1:ncol(sim_data$X_s), 2),
value = c(sim_data$X_s[selected_curve, ], sim_data$X_se[selected_curve, ]),
type = factor(rep(c("Original", "Noisy"), each = ncol(sim_data$X_s)))
) %>%
filter(!is.na(value))
ggplot(plot_data_single, aes(x = time, y = value, color = type)) +
geom_line(linewidth = 1) +
theme_minimal() +
scale_color_manual(values = c("Original" = "#1f77b4", "Noisy" = "#ff7f0e")) +
labs(
title = "Original vs Noisy Functional Curve",
x = "Time",
y = "Value",
color = "Type"
)
This visualization shows how the added noise affects a single
functional curve. The blue line represents the original functional data
(X_s
), while the orange line shows the same curve with
added noise (X_se
). The noise level is proportional to the
variance of the original curve, ensuring consistent relative noise
levels across different curves.
These visualizations help us understand the structure and characteristics of the simulated data, including the variable domain lengths and the impact of added noise.