| Type: | Package |
| Title: | Variable Selection using the Pivotal Information Criterion |
| Version: | 0.1.2 |
| Date: | 2026-05-30 |
| Description: | Sparse regression and classification via the Pivotal Information Criterion (PIC), an alternative to the Bayesian Information Criterion (BIC), cross-validation, and Lasso-based tuning. The regularisation parameter is selected from a pivotal null-distribution statistic, eliminating the need for cross-validation and yielding sharper support recovery. Provides Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) optimisation for the L1, Smoothly Clipped Absolute Deviation (SCAD), and Minimax Concave Penalty (MCP) penalties across six response distributions: Gaussian, binomial, Poisson, exponential, Gumbel, and Cox. Under standard sparsity assumptions, the selector achieves a phase transition for exact support recovery, analogous to results in compressed sensing. See Sardy, van Cutsem and van de Geer (2026) <doi:10.48550/arXiv.2603.04172>. |
| License: | GPL-2 |
| URL: | https://github.com/VcMaxouuu/picreg |
| BugReports: | https://github.com/VcMaxouuu/picreg/issues |
| Encoding: | UTF-8 |
| LazyData: | true |
| Depends: | R (≥ 3.6.0) |
| Imports: | stats, graphics, grDevices, parallel, future, future.apply, Rcpp (≥ 1.0.10) |
| LinkingTo: | Rcpp, RcppArmadillo |
| Suggests: | testthat (≥ 3.0.0), knitr, rmarkdown, xfun, glmnet |
| VignetteBuilder: | knitr |
| SystemRequirements: | C++17 |
| RoxygenNote: | 8.0.0 |
| Config/roxygen2/version: | 8.0.0 |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | yes |
| Packaged: | 2026-05-30 10:45:46 UTC; maxvancutsem |
| Author: | Maxime van Cutsem [aut, cre], Sylvain Sardy [aut] |
| Maintainer: | Maxime van Cutsem <maxime.vancutsem@unige.ch> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-03 13:40:02 UTC |
picreg: Variable Selection using the Pivotal Information Criterion
Description
Sparse regression and classification via the Pivotal Information Criterion (PIC), an alternative to BIC, cross-validation, and Lasso-based tuning. The regularisation parameter is selected from a pivotal null-distribution statistic, eliminating the need for cross-validation and yielding sharper support recovery. Provides FISTA optimisation for the L1, SCAD, and MCP penalties across six response distributions: Gaussian, binomial, Poisson, exponential, Gumbel, and Cox. Under standard sparsity assumptions, the selector achieves a phase transition for exact support recovery, analogous to results in compressed sensing. See doi:10.48550/arXiv.2603.04172.
Author(s)
Maintainer: Maxime van Cutsem maxime.vancutsem@unige.ch
Authors:
Maxime van Cutsem maxime.vancutsem@unige.ch
Sylvain Sardy sylvain.sardy@unige.ch
See Also
Useful links:
Small binary-classification dataset for the Binomial section of the vignette.
Description
A synthetic logistic-regression dataset used to illustrate pic() with
the Binomial family. It contains n = 300 observations of
p = 50 predictors, of which s = 5 carry signal; the
remaining 45 are noise.
Usage
BinomialExample
Format
A list with two components:
XNumeric matrix of dimension
300 \times 50with column namesgene_*andnoise_*.yInteger vector of length
300containing0/1class labels.
Details
Column names follow the same convention as
QuickStartExample: active variables are labelled
gene_1, ..., gene_5 and inactive ones noise_1, ...,
noise_45, interleaved in random order.
The binary response is generated as
Y_i \sim \mathrm{Bernoulli}\!\left(\frac{1}{1 + e^{-X_i\beta}}\right),
with non-zero coefficients drawn uniformly in [1.5,\, 3] with
random sign. The class balance is roughly 45\% of positives.
Examples
data(BinomialExample)
fit <- pic(BinomialExample$X, BinomialExample$y, family = "binomial")
fit$selected
Small survival dataset for the Cox section of the vignette.
Description
A synthetic survival dataset used to illustrate pic() with the Cox
family. It contains n = 250 subjects observed on p = 50
covariates, of which s = 5 carry signal; the remaining 45 are
noise.
Usage
CoxExample
Format
A list with two components:
XNumeric matrix of dimension
250 \times 50with column namesgene_*andnoise_*.yNumeric matrix of dimension
250 \times 2with columnstimeandevent.
Details
Column names follow the same convention as
QuickStartExample: active variables are labelled
gene_1, ..., gene_5 and inactive ones noise_1, ...,
noise_45, interleaved in random order.
Event times are drawn from an exponential proportional-hazards model
T_i \sim \mathrm{Exp}\!\bigl(e^{X_i\beta}\bigr),
and independent censoring times from C_i \sim
\mathrm{Exp}(0.5). The observed response is the standard two-column
(\min(T_i, C_i),\, \mathbf{1}\{T_i \le C_i\}). The censoring
rate is roughly 40\%.
Examples
data(CoxExample)
fit <- pic(CoxExample$X, CoxExample$y, family = "cox")
fit$selected
Small Gaussian dataset for the introductory vignette.
Description
A synthetic Gaussian regression dataset used to illustrate pic()
throughout the introductory vignette. It contains n = 100
observations of p = 30 predictors, of which only s = 5
carry signal; the remaining 25 are pure noise.
Usage
QuickStartExample
Format
A list with two components:
XNumeric matrix of dimension
100 \times 30with column namesgene_*andnoise_*.yNumeric vector of length
100.
Details
Column names are chosen to make the underlying support obvious at a glance:
-
gene_1, ..., gene_5: the five active variables, whose non-zero coefficients are drawn uniformly in[0.5,\, 1.5]with random sign. -
noise_1, ..., noise_25: the remaining inactive variables, with true coefficient0.
The columns are interleaved in random order; column names are the only indicator of which features are part of the true support.
The response is generated as y = X\beta + \varepsilon with
\varepsilon \sim \mathcal{N}(0, 1).
Examples
data(QuickStartExample)
fit <- pic(QuickStartExample$X, QuickStartExample$y)
fit$selected
Assess performance of a pic fit.
Description
Given a test set, reports a small set of family-appropriate predictive metrics. Optionally appends support-recovery diagnostics when the true active set is known.
Usage
assess(object, ...)
## S3 method for class 'pic'
assess(object, newx, newy, true_features = NULL, ...)
Arguments
object |
A fitted |
... |
Unused; present for S3 method consistency. |
newx |
Numeric design matrix at which predictions are evaluated, with the same columns as the training data. |
newy |
Response on the new observations. Numeric vector for
all families except Cox; for Cox, a two-column matrix
|
true_features |
Optional integer or character vector listing the indices (or names) of the true active variables. When supplied, support-recovery metrics are appended. |
Details
The metrics depend on the family:
"gaussian"MSE, MAE, R-squared.
"binomial"accuracy, AUC, binomial deviance.
"poisson"MSE, MAE, Poisson deviance.
"exponential"MSE, MAE, Exponential deviance.
"gumbel"MSE, MAE, deviance computed from the per-sample negative log-likelihood with a moment estimate of the scale parameter
\hat\sigma = \mathrm{sd}(y - \hat\eta)\sqrt{6}/\pi."cox"Harrell's C-index and the Breslow partial log-likelihood (negative, normalised by
n).
When true_features is non-NULL, four support-recovery metrics
are appended (independent of newx / newy):
exact_recovery, tpr (true-positive rate / sensitivity), fdr
(false-discovery rate) and f1 (harmonic mean of precision and
recall). Names are accepted in addition to integer positions when
the fit carries column names.
Value
A two-column data.frame with columns metric (character)
and value (numeric).
Examples
data(QuickStartExample)
X <- QuickStartExample$X
y <- QuickStartExample$y
fit <- pic(X, y, family = "gaussian", penalty = "scad")
assess(fit, X, y, true_features = paste0("gene_", 1:5))
Breslow baseline cumulative hazard and survival.
Description
Breslow baseline cumulative hazard and survival.
Usage
baseline_functions(times, events, predictions)
Arguments
times |
Survival times (length |
events |
Event indicators (0 or 1; length |
predictions |
Risk scores; higher = higher risk = shorter survival. |
Value
A data.frame with columns time, cumulative_hazard, survival.
Validate and optionally standardise the design matrix.
Description
Standardises columns to zero mean and unit variance (population sd; n divisor)
when standardize_X is TRUE. When X_mean and X_std are supplied they
are reapplied without recomputing — used at prediction time on new data.
Column names of X (or column names of the source data frame) are captured
and returned as feature_names; they are propagated through pic() to
annotate fit$selected, coef(), and the coefficient plot.
Usage
check_X(X, standardize_X = TRUE, X_mean = NULL, X_std = NULL)
Arguments
X |
A numeric matrix or coercible (data.frame). |
standardize_X |
Logical; standardise columns of |
X_mean |
Optional pre-computed column means. |
X_std |
Optional pre-computed column standard deviations. |
Value
A list with components X, X_mean, X_std, feature_names
(the column names of X if any, otherwise NULL).
Joint validation/preprocessing of (X, y).
Description
For survival data, rows of X and y are sorted by ascending time.
Usage
check_Xy(X, y, y_kind, standardize_X = TRUE, X_mean = NULL, X_std = NULL)
Validate response according to its distributional kind.
Description
Validate response according to its distributional kind.
Usage
check_y(y, n_samples = NULL, y_kind = "continuous")
Arguments
y |
Response vector or 2-column matrix (survival). |
n_samples |
Expected number of observations. |
y_kind |
One of |
Value
Validated y (possibly coerced to matrix for survival).
Coefficients of a fitted pic model.
Description
Returns a two-column data frame with the variable name in the first
column (variable) and the fitted coefficient in the second
(coefficient). The first row is always the intercept, labelled
"(Intercept)"; when no intercept was fitted its value is 0.
Variable names are taken from the column names of X if any were
supplied (matrix colnames(X) or data-frame column names), and
otherwise default to V1, ..., Vp.
Usage
## S3 method for class 'pic'
coef(object, standardized = FALSE, ...)
Arguments
object |
A fitted |
standardized |
Logical; if |
... |
Unused; present for S3 method consistency. |
Details
Internally the model is fitted on a standardised design matrix, so the
raw coefficients live on the standardised scale. By default this method
rescales them back to the original scale of X — the values to
plug into the un-standardised design for prediction — via
beta\_orig = beta / s \quad intercept\_orig = intercept - sum(m * beta\_orig)
where m and s are the column mean and standard deviation. Pass
standardized = TRUE to skip the rescaling and return the raw fit
values (identical to fit$beta).
Value
A data.frame with columns variable (character) and
coefficient (numeric), of length p + 1 (intercept + features).
Harrell's concordance index.
Description
Harrell's concordance index.
Usage
concordance_index(times, events, predictions)
Arguments
times |
Survival times (length |
events |
Event indicators (0 or 1; length |
predictions |
Risk scores; higher = higher risk = shorter survival. |
Value
Numeric scalar in [0, 1].
Breslow partial log-likelihood (negative, normalised by n).
Description
Breslow partial log-likelihood (negative, normalised by n).
Usage
cox_partial_log_likelihood(times, events, predictions)
Arguments
times |
Survival times (length |
events |
Event indicators (0 or 1; length |
predictions |
Risk scores; higher = higher risk = shorter survival. |
Value
A numeric scalar: the negative Breslow partial log-likelihood
for the Cox proportional-hazards model, normalised by the sample
size n. Lower values indicate a better fit.
Effect of one feature on the Cox survival curve.
Description
Builds the family of survival curves obtained by varying a single
covariate while holding the others at their column means. The
returned object has the same shape as predict_survival_function(),
so it composes directly with plot_survival_curves().
Usage
feature_effects_on_survival(object, idx, values = NULL)
Arguments
object |
A fitted |
idx |
Index of the feature to vary. Either an integer column
position in the training |
values |
Optional numeric vector of values to evaluate. When
|
Details
Both the per-column mean row and a default grid of representative
values (used when values = NULL) are cached on the fit by pic()
under attr(fit, "preproc"), so the training design matrix does
not need to be passed back in. The cached default grid uses the
unique values of the column when there are at most five of them
(handy for ordinal / categorical covariates), and the four
equispaced empirical quantiles (0, 1/3, 2/3, 1) otherwise.
Value
A list with components time (length K) and survival
(matrix K x length(values), one column per evaluated value).
Column names of survival are formatted as
"<feature_name> = <value>" and are picked up automatically by
plot_survival_curves() for the legend.
See Also
predict_survival_function(), plot_survival_curves().
Non-monotone Accelerated Proximal Gradient (nmAPG) solver — dispatcher.
Description
Thin R wrapper around the C++ fista_cpp implementation. The actual
algorithm (Li & Lin, NeurIPS 2015) lives in src/fista.cpp; this
function only extracts the C++-facing arguments and routes the call.
Usage
fista(
X,
y,
family,
penalty,
lambda_reg,
fit_intercept = TRUE,
rel_tol = 1e-04,
step_size_init = 0.1,
max_iter = 500L,
eta_param = 0.8,
delta_param = 1e-04,
rho = 0.5,
bb_growth_cap = 4,
coef_init = NULL,
intercept_init = NULL
)
Arguments
X |
Standardised design matrix. |
y |
Response vector (length n). |
family |
A pic family object (provides |
penalty |
A pic penalty object (provides |
lambda_reg |
Regularisation parameter. |
fit_intercept |
Whether to update an unpenalised intercept. |
rel_tol |
Relative gradient-mapping tolerance. |
step_size_init |
Initial step size. |
max_iter |
Hard cap on outer iterations. |
eta_param |
Memory weight of the non-monotone reference (in [0,1)). |
delta_param |
Sufficient-descent constant in the line search. |
rho |
Step-size reduction factor (< 1). |
bb_growth_cap |
Maximum BB step growth over the previous step. |
coef_init, intercept_init |
Optional warm starts. |
Details
Cox is routed to a dedicated C++ FISTA (fista_cox_cpp) because its
loss has no intercept and uses a 2-column (time, event) response.
The build accepts only the six families and three penalties that have
a C++ implementation; the family / penalty registries throw on any
other input upstream.
Value
A list with coef, intercept, info.
Resolve a family name into a pic family descriptor.
Description
Accepts one of the six supported names: "gaussian", "binomial",
"poisson", "exponential", "gumbel", "cox". Anything else raises
an error — the build does not support user-defined families.
Already-built descriptors are returned unchanged (internal re-entry).
Usage
get_family(family)
Arguments
family |
A family name or already-built pic family descriptor. |
Value
A pic family descriptor.
Resolve a penalty name into a pic penalty descriptor.
Description
Accepts one of "lasso", "scad", "mcp" (case-insensitive). Anything
else raises an error: the build does not support user-defined penalties.
Usage
get_penalty(penalty, scad_a = 3.7, mcp_gamma = 3)
Arguments
penalty |
A penalty name. |
scad_a |
SCAD concavity parameter (default 3.7). |
mcp_gamma |
MCP concavity parameter (default 3.0). |
Value
A pic.penalty descriptor.
Pivotal Detection Boundary regularisation selector
Description
Computes the data-driven regularisation parameter \hat\lambda^{\rm PDB}_\alpha
using the Pivotal Detection Boundary (PDB) principle. The selected value is
defined as the empirical (1 - \alpha) quantile of a null-distribution gradient statistic
\hat\lambda^{\rm PDB}_\alpha =q_{1-\alpha}\left(\left\|\nabla \ell_0\right\|_\infty
\right),
where \ell_0 denotes the loss evaluated under the null model.
Usage
lambda_pdb(
X,
family,
n_simu = 5000L,
alpha = 0.05,
method = c("mc_exact", "mc_gaussian", "analytical")
)
Arguments
X |
Numeric design matrix. Columns should typically be standardised to zero mean and unit variance. |
family |
A PIC family specification. Can be either a family object
or a character string accepted by |
n_simu |
Number of Monte Carlo simulations used by the stochastic
methods. Ignored when |
alpha |
Nominal tail probability used to define the quantile level. |
method |
One of "mc_exact", "mc_gaussian", or "analytical". |
Details
Under the null \beta = 0, the gradient of the loss carries only sampling noise.
The smallest \lambda large enough to dominate this noise is the natural threshold
separating signal from noise, i.e. the value above which a coefficient should be
kept rather than shrunk to zero. Calibrating \hat\lambda this way has two
consequences. First, the quantile depends only on the design matrix X,
the family, and the level \alpha (not on the response y), so
cross-validation is no longer needed. Second, and more importantly, it leads
to sharper support recovery than prediction-error-based selectors such as
cross-validated lasso: by targeting the noise level of the gradient directly,
PDB controls the inclusion of noise variables and more reliably identifies
the true non-zero coefficients. Computing the quantile requires only the
distribution of \|\nabla \ell_0\|_\infty, which lambda_pdb() estimates
through one of the three methods below.
Details on method option
The empirical quantile is obtained using one of the three following methods:
"mc_exact"-
Family-aware Monte Carlo. For each of the
n_simudraws, a response vector is sampled under the null model (\beta = 0) of the chosen family; the family-specific gradient of the loss at\beta = 0is evaluated on the fixed designX, and its supremum norm is recorded. The empirical(1 - \alpha)quantile of then_simurecorded norms is returned. Most accurate but slowest of the three. "mc_gaussian"-
Monte Carlo under the Gaussian approximation of the null gradient. A central-limit argument gives
\nabla \ell_0 \approx \mathcal{N}(0,\, c(n)\, \Sigma_X / n)with\Sigma_X = X^\top X / n. Each of then_simudraws samples directly from this Gaussian and records its supremum norm — no family-specific evaluation needed. Family-agnostic and noticeably faster than"mc_exact"; valid in the regime where the CLT kicks in (moderate to largen). "analytical"-
Closed-form Bonferroni bound on the Gaussian tail. Combining a union bound over the
pcoordinates of the gradient with the standard Gaussian tail bound gives\hat\lambda_\alpha^{\rm analytical} = \Phi^{-1}\!\left(1 - \alpha / (2p)\right)\, \sqrt{c(n) / n}.Deterministic and
O(1)— no simulation. Conservative (slightly over-estimates the true quantile) and gets looser aspgrows, but useful when speed matters or when Monte Carlo is overkill.
The "mc_gaussian" and "analytical" methods use a variance scaling
factor c(n) depending on the family:
Gaussian / Binomial / Poisson / Exponential:
c(n) = 1Gumbel:
c(n) = \exp(2(\gamma + 1)), with\gammathe Euler-Mascheroni constant.Cox:
c(n) = 1 / (4 \log n)
Computational cost
Both Monte Carlo methods are dominated by a single p \times n_{\rm simu}
matrix product X^\top R, where R stacks the simulated
residuals. This product is dispatched to BLAS as a single \tt gemm,
which is essentially as fast as a Monte Carlo selector can be on dense
designs. For very large n or p, however, the constant
becomes large and "mc_exact" may be unnecessarily expensive:
As
ngrows, the central-limit approximation tightens and"mc_gaussian"gives essentially the same\hat\lambdaas"mc_exact"at a fraction of the cost (no family-specific residual generation, fewer dependencies ony-draws).-
"analytical"isO(1)and useful as a quick upper bound — slightly conservative, but accurate enough for triage and for very highpwhere the Bonferroni tail is tight.
Practical rule of thumb: prefer "mc_exact" for small to moderate
problems (default), "mc_gaussian" once n grows and the
design is dense, and "analytical" when even the Monte Carlo cost is
a concern.
Details on alpha option
The level \alpha is the nominal Type-I error of the test of
H_0\!\colon \beta = 0. By construction of the quantile,
\Pr\!\left(\left\|\nabla \ell_0\right\|_\infty
> \hat\lambda^{\rm PDB}_\alpha \,\big|\, H_0\right) = \alpha,
so under the null model no variable enters the active set with
probability at most \alpha. With the default
\alpha = 0.05, this caps the false-discovery rate at 5\%
under the null: when the data carry no signal, picreg returns the
empty support 95\% of the time.
Value
An object of class "pic.lambda_pdb".
value |
Selected regularisation parameter |
statistics |
Simulated null statistics used to estimate the quantile.
|
method |
Estimation method used. |
alpha |
Quantile level parameter. |
n_simu |
Number of Monte Carlo simulations. |
Examples
X <- scale(matrix(rnorm(100 * 20), 100, 20))
lam <- lambda_pdb(
X,
family = "gaussian",
method = "mc_exact"
)
print(lam)
Asymptotic behaviour of the PDB null distribution.
Description
For each n in n_grid, draws a standardised Gaussian design matrix
of shape (n, p) and computes the null gradient-norm statistic via
the three available selectors: "mc_exact", "mc_gaussian", and
"analytical". Stores the simulated Monte Carlo statistics and the
three resulting \hat\lambda values per n.
Usage
pdb_asymptotic(
n_grid,
p,
type = c("gaussian", "binomial", "poisson", "exponential", "gumbel", "cox"),
alpha = 0.05,
n_simu = 5000L,
verbose = FALSE
)
Arguments
n_grid |
Integer vector of sample sizes to evaluate. |
p |
Number of features (scalar integer). |
type |
Family name: |
alpha |
Nominal level used for the (1 - alpha) quantile. |
n_simu |
Monte Carlo size for each selector. |
verbose |
Logical; if |
Details
The intended use is to visualise the convergence of the exact
family-specific null distribution to the Gaussian approximation as
n grows — i.e., to check empirically that mc_gaussian is a valid
substitute for mc_exact in the asymptotic regime.
Value
An object of class c("pic.pdb_asymptotic", "pic.diagnostic").
n_grid, p, type, alpha, n_simu |
Configuration. |
stats_exact, stats_gaussian |
Lists of length |
lambda_exact, lambda_gaussian, lambda_analytical |
Numeric
vectors of length |
call |
The call. |
See Also
plot.pic.pdb_asymptotic() for visualisation.
Examples
as_ <- pdb_asymptotic(n_grid = c(50, 200, 1000),
p = 200, type = "poisson")
plot(as_)
Summary of the PDB lambda selector.
Description
Prints a formatted summary of the Pivotal Detection Boundary selector
used by pic() to choose \hat\lambda: method, nominal level,
Monte Carlo size, selected \hat\lambda, and a compact view of
the null distribution when Monte Carlo was run. For models fitted
with method = "analytical" or with a user-supplied lambda, only
the selector metadata is shown.
Usage
pdb_summary(model, digits = 4L)
Arguments
model |
A fitted |
digits |
Number of significant digits used in the distribution table. |
Value
Invisibly returns NULL. Called for its side effect (printing).
Examples
data(QuickStartExample)
fit <- pic(QuickStartExample$X, QuickStartExample$y,
family = "gaussian", penalty = "lasso")
pdb_summary(fit)
Phase-transition analysis of support recovery.
Description
For a fixed (n, p) configuration, varies the sparsity level s
from 0 to s_max and, for each s, estimates by Monte Carlo
(m replications) the probability that pic() recovers exactly
the support of the true coefficient vector. If several penalties are
supplied, one result is produced for each (n, p, penalty) combination.
Usage
phase_transition(
n,
p,
type = c("gaussian", "binomial", "poisson", "exponential", "gumbel", "cox"),
s_max,
m = 50,
penalty = "lasso",
beta_value = 3,
lambda_method = "mc_exact",
lambda_alpha = 0.05,
lambda_n_simu = 5000L,
verbose = FALSE,
parallel = FALSE,
workers = parallel::detectCores() - 1L
)
Arguments
n |
Integer or integer vector — number of observations per configuration. |
p |
Integer or integer vector of the same length as |
type |
Family name: |
s_max |
Largest sparsity level evaluated. Must satisfy |
m |
Number of Monte Carlo replications per |
penalty |
One or more penalties for |
beta_value |
Magnitude of the non-zero coefficients used to generate
the response. The sign is fixed to |
lambda_method |
Passed to |
lambda_alpha |
Nominal level for the PDB selector. |
lambda_n_simu |
Monte Carlo size for the PDB selector. |
verbose |
Logical; if |
parallel |
Logical; if |
workers |
Integer; number of background R processes to use when
|
Details
At each Monte Carlo replicate a design matrix is drawn from a
standard Gaussian, s features are sampled uniformly at random,
the response is generated under the chosen type, and one pic
fit is run for each requested penalty. The selected support is
compared to the truth and three metrics are stored:
exact_recovery1 if the selected set equals the true set, 0 otherwise.
tpr\left|\hat{S}\cap S\right| / |S|- true positive rate. Fors = 0, this is set to 1 when no true feature exists.fdr\left|\hat{S}\setminus S\right| / \max{(|\hat{S}|, 1)}- false discovery rate.
Details on data sampling
At each replicate, the design X is drawn iid
\mathcal{N}(0, 1), the true support S is sampled
uniformly at random, and \beta_j = beta_value for
j \in S, \beta_j = 0 otherwise. The linear predictor
is \eta = X\beta. The response y is then drawn
conditionally on \eta according to the requested family:
"gaussian"-
y_i = \eta_i + \varepsilon_iwith\varepsilon_i \sim \mathcal{N}(0, 1). "binomial"-
y_i \sim \mathrm{Bernoulli}\!\left(\sigma(\eta_i)\right), where\sigma(z) = 1 / (1 + e^{-z})is the logistic function. "poisson"-
y_i \sim \mathrm{Poisson}\!\left(e^{\eta_i}\right). "exponential"-
y_i \sim \mathrm{Exp}\!\left(\mathrm{rate} = e^{\eta_i}\right). "gumbel"-
y_i = \eta_i + \varepsilon_iwith\varepsilon_i \sim \mathrm{Gumbel}(0, 1), drawn as-\log(-\log U_i)forU_i \sim \mathcal{U}(0, 1). "cox"-
Event times
T_i \sim \mathrm{Exp}\!\left(e^{\eta_i}\right)and independent censoring timesC_i \sim \mathrm{Exp}(1). The response is the 2-column matrix\bigl(\min(T_i, C_i),\, \mathbf{1}\{T_i \le C_i\}\bigr).
Value
An object of class c("pic.phase_transition", "pic.diagnostic").
s_grid |
|
exact_recovery, tpr, fdr |
Matrices of shape |
curve_n, curve_p, curve_penalty |
Curve descriptors aligned with the rows of the metric matrices. |
config |
A list of all configuration arguments for downstream plotting / reporting. |
call |
The call. |
See Also
plot.pic.phase_transition() for visualisation.
Examples
pt <- phase_transition(n = 50, p = 100, type = "gaussian",
s_max = 8, m = 20,
penalty = c("lasso", "scad"),
parallel = TRUE)
plot(pt)
Sparse regression via pivotal loss and PDB lambda selection
Description
Fits sparse regression, classification, and survival models using a family-specific pivotal loss combined with a sparsity-inducing penalty. Supported families are Gaussian, binomial, Poisson, exponential, Gumbel, and Cox proportional hazards.
Usage
pic(
X,
y,
family = c("gaussian", "binomial", "poisson", "exponential", "gumbel", "cox"),
penalty = "lasso",
standardize = TRUE,
intercept = TRUE,
lambda_method = c("mc_exact", "mc_gaussian", "analytical"),
relax = FALSE,
mcp_gamma = 3,
scad_a = 3.7,
lambda = NULL,
lambda_alpha = 0.05,
lambda_n_simu = 2000,
tol = 1e-05,
path_length = 10L,
maxit = 1000
)
Arguments
X |
Numeric design matrix of shape |
y |
Response variable with length |
family |
A character name determining the response distribution and
the associated loss function used during optimisation. See
Details on family-specific losses below for the exact
objective functions associated with each family. Default |
penalty |
Penalty name: one of |
standardize |
Whether to standardise columns of |
intercept |
Whether to fit an unpenalised intercept. Forced to
|
lambda_method |
One of |
relax |
If |
mcp_gamma |
MCP concavity parameter when |
scad_a |
SCAD concavity parameter when |
lambda |
Optional user-supplied regularisation parameter. When |
lambda_alpha |
Nominal level for PDB. See |
lambda_n_simu |
Number of Monte Carlo draws for PDB. See |
tol |
FISTA convergence tolerance at the final path step. |
path_length |
Number of points in the warm-start regularisation
path running from |
maxit |
Maximum number of iterations for FISTA if convergence is not yet reached. |
Details
Minimises the objective function
\hat\beta = \arg\min_\beta\; L(\beta)
\;+\; \lambda_\alpha^{\rm PDB}\,\mathrm{pen}(\beta)
\quad\text{with}\quad
L(\beta) = \phi\!\left(\ell_n\!\left(g(\beta)\right)\right),
where the regularisation parameter is selected automatically using
the Pivotal Detection Boundary (PDB) principle implemented in
lambda_pdb(). The PDB choice is not just a substitute for
cross-validation: by calibrating \lambda against a pivotal
null-distribution statistic rather than out-of-sample prediction
error, it yields sharper support recovery - fewer false positives
on noise variables and more accurate identification of the true
non-zero coefficients. Here, \phi and g are
family-specific transformations chosen so that the gradient at the
null has a distribution free of the nuisance parameter, which is
precisely what makes the use of \lambda^{\rm PDB}_\alpha
valid. Optimisation is performed using a warm-started FISTA
regularisation path.
Details on family-specific losses
"gaussian"-
Uses
\phi(\cdot) = \sqrt{\cdot}andg(\cdot) = \mathrm{Id}, resulting in the square-root lasso objectiveL(\beta) = \frac{\|y - X\beta\|_2}{\sqrt{n}}. "binomial"-
Uses a variance-stabilised transformation of the Bernoulli likelihood. With
\phi(\cdot) = \mathrm{Id}and the logistic linkg(\eta_i) = (1 + e^{-\eta_i})^{-1}, the loss isL(\beta) = \frac{1}{n}\sum_{i=1}^n \left(2 y_i \sqrt{\frac{1 - g(\eta_i)}{g(\eta_i)}} + 2 (1 - y_i) \sqrt{\frac{g(\eta_i)}{1 - g(\eta_i)}}\right).The classical logistic link is preserved; only the loss itself is modified to obtain a pivotal gradient.
"poisson"-
As for
"binomial", uses a variance-stabilised version of the Poisson likelihood. With\phi(\cdot) = \mathrm{Id}andg(\eta_i) = e^{\eta_i}, the loss isL(\beta) = \frac{1}{n}\sum_{i=1}^n \left(\frac{2 y_i}{\sqrt{g(\eta_i)}} + 2 \sqrt{g(\eta_i)}\right).The standard log link is kept unchanged; the pivotality property is obtained through the modified loss rather than through the link function.
"exponential"-
Uses the standard Exponential negative log-likelihood together with the classical log link
g(\eta_i) = e^{\eta_i}. The loss isL(\beta) = \frac{1}{n}\sum_{i=1}^n \left(\log\!\left(g(\eta_i)\right) + \frac{y_i}{g(\eta_i)}\right).No variance-stabilising modification is required.
"gumbel"-
Uses an exponentially stabilised Gumbel negative log-likelihood. With
\phi(\cdot) = \exp(\cdot)and identity linkg(\eta_i) = \eta_i, the loss is based on the Gumbel log-likelihood\ell(\beta, \sigma) = \log(\sigma) + \frac{1}{n}\sum_{i=1}^n \left(z_i + e^{-z_i}\right), \qquad z_i = \frac{y_i - \eta_i}{\sigma}.The optimisation objective becomes
L(\beta, \sigma) = \exp\!\left(\ell(\beta, \sigma)\right).The scale parameter
\sigmais re-estimated internally by profile maximum likelihood at each optimisation step. "cox"-
Uses a square-root-transformed Cox partial log-likelihood. With
\phi(\cdot) = \sqrt{\cdot}and identity linkg(\eta_i) = \eta_i, the underlying partial likelihood is\ell(\beta) = -\frac{1}{n}\left[ \sum_{i=1}^n \delta_i \eta_i - \sum_{i=1}^n \delta_i \log\!\left(\sum_{j \in R_i} e^{\eta_j}\right)\right],where
\delta_iis the event indicator andR_iis the Cox risk set at timet_i. The final objective isL(\beta) = \sqrt{\ell(\beta)}.
Value
An object of class c("pic.<family>", "pic").
beta |
Fitted coefficients (length |
intercept |
Fitted intercept or |
df |
The number of nonzero coefficients. |
selected |
Indices of the nonzero coefficients. |
lambda |
|
family |
The family object used. |
penalty |
The penalty object used. |
lambda_pdb |
Result from |
n_samples |
The number of observations in the training set. |
n_features |
The number of variables in the training set. |
For family = "cox", the fit additionally carries:
baseline_cumulative_hazard |
Estimated Breslow baseline cumulative hazard function. |
baseline_survival |
Estimated baseline survival function. |
unique_times |
Sorted unique event times used in the baseline estimation. |
censoring_rate |
Percentage of censored observations in the training set. |
Examples
data(QuickStartExample)
X <- QuickStartExample$X
y <- QuickStartExample$y
fit <- pic(X, y, family = "gaussian", penalty = "scad")
fit$beta
fit$selected
GLM families for pic — descriptor layer.
Description
Each family is a thin descriptor carrying:
-
name- used by the dispatchers to route to the C++ implementation. -
g- mean-function link (object withnameand callablefn).g$fn(eta)is applied at predict time to obtain the mean response. -
phi- variance-stabilising transform (object withname). Purely informational on the R side; the actual stabilisation happens inside the C++ math.
Details
Typing fit$family gives a one-glance summary of the model's family
(see print.pic.family()). The actual loss / gradient math lives in C++:
-
src/family_*.cpp— Gaussian, Binomial, Poisson, Exponential, Gumbel. -
src/cox.cpp— Cox.
S3 methods for fitted pic objects.
Description
The fitted object has class c("pic.<family>", "pic") —
methods dispatch on the second class for shared behaviour
and on the first class for family-specific overrides.
Sparsity-inducing penalties for pic.
Description
Three penalties are supported, identified by lowercase name to match
the C++ registry. Each penalty enters the pic() objective as
\mathrm{pen}(\beta) = \sum_{j=1}^p p_\lambda(|\beta_j|),
where p_\lambda(\cdot) depends on the penalty.
Details
"lasso"-
L1 (soft-thresholding) penalty:
p_\lambda(|t|) = \lambda |t|.Convex, gives the strongest shrinkage on large coefficients bias does not vanish as
|t| \to \infty. "scad"(Smoothly Clipped Absolute Deviation, Fan & Li 2001)-
Non-convex penalty with concavity parameter
scad_a > 2(default 3.7):p_\lambda'(|t|) = \lambda\!\left\{ \mathbf{1}\{|t| \le \lambda\} + \frac{(a\lambda - |t|)_+}{(a - 1)\lambda} \mathbf{1}\{|t| > \lambda\}\right\}.Behaves like the lasso for small
|t|, then tapers off so large coefficients are barely penalised - yields nearly unbiased estimates on strong signals. "mcp"(Minimax Concave Penalty, Zhang 2010)-
Non-convex penalty with concavity parameter
mcp_gamma > 1(default 3.0):p_\lambda'(|t|) = \left(\lambda - \frac{|t|}{\gamma}\right)_+.Similar motivation as SCAD but a smoother transition: starts at the lasso derivative for small
|t|and tapers linearly to zero at|t| = \gamma\lambda.
The actual evaluation and proximal operators live in C++
(src/penalty_*.cpp). Larger scad_a / mcp_gamma make the
penalty closer to the lasso; smaller values amplify the
non-convexity (and the bias reduction on strong signals).
Plot methods for pic fits and diagnostics.
Description
Entry points:
Details
-
plot(fit)- lollipop plot of the non-zero coefficients. -
plot(fit$lambda_pdb)- histogram of the PDB null distribution with a vertical line at the selected\hat\lambda. -
plot_baseline()- Cox-only: baseline cumulative hazard and baseline survival, two panels. -
plot(phase_transition(...))- recovery curves. -
plot(pdb_asymptotic(...))- null-distribution convergence panels.
Survival utilities for the Cox family.
Description
Provides Harrell's C-index, the Breslow partial log-likelihood, baseline cumulative hazard / survival, and feature-effect curves.
Horizontal lollipop plot of the non-zero coefficients of a pic fit.
Description
One row per selected variable, sorted by descending absolute coefficient value (largest at the top). Each variable is drawn as a horizontal segment from zero to its fitted value.
Usage
## S3 method for class 'pic'
plot(x, standardized = TRUE, max_features = NULL, ...)
Arguments
x |
A fitted |
standardized |
Logical; if |
max_features |
Optional cap on the number of features displayed (the strongest are kept). |
... |
Additional graphical parameters forwarded to
|
Value
Invisibly returns the plotted (named) coefficient vector.
Plot the PDB null distribution.
Description
Histogram of the simulated null gradient-norm statistics, with a
vertical line at the selected \hat\lambda.
Usage
## S3 method for class 'pic.lambda_pdb'
plot(x, breaks = 40L, ...)
Arguments
x |
A |
breaks |
Number of histogram bins (default 40). |
... |
Unused; present for S3 method consistency. |
Value
Invisibly returns x.
Plot of the PDB asymptotic behaviour.
Description
Multi-panel histogram comparison of the simulated null gradient-norm
statistic under the "mc_exact" (light grey fill) and "mc_gaussian"
(dashed outline) selectors, one panel per n in n_grid. Two
vertical lines are added per panel:
-
\hat\lambda^{\rm PDB}_\alpha- solid navy, the empirical (1 - alpha) quantile of"mc_exact"(what pic would actually use). -
\hat\lambda_{analytical}- dashed black, the Bonferroni closed-form bound.
Usage
## S3 method for class 'pic.pdb_asymptotic'
plot(x, breaks = 40L, ...)
Arguments
x |
An object returned by |
breaks |
Number of histogram bins (default 40). |
... |
Unused; present for S3 method consistency. |
Value
Invisibly returns x.
Phase-transition plot for a pic.phase_transition object.
Description
Plots the chosen recovery metric as a function of the sparsity
level s. Curves are distinguished by line type (grayscale ramp
beyond five curves). When multiple (n, p) configurations are
compared across several penalties, panels are laid out in a grid
with at most three columns per row.
Usage
## S3 method for class 'pic.phase_transition'
plot(x, metric = c("exact_recovery", "tpr", "fdr"), ...)
Arguments
x |
An object returned by |
metric |
One of |
... |
Additional graphical parameters forwarded to |
Value
Invisibly returns x.
Plot Cox baseline cumulative hazard and baseline survival.
Description
Two-panel step plot for a fitted pic.cox model: the Breslow
baseline cumulative hazard H_0(t) on top and the baseline
survival S_0(t) = \exp(-H_0(t)) below.
Usage
plot_baseline(model)
Arguments
model |
A fitted |
Value
Invisibly returns NULL.
Plot subject-specific Cox survival curves.
Description
Step-line visualisation of the output of
predict_survival_function() or feature_effects_on_survival():
one survival curve per subject (or per feature value) on a common
time grid. To keep curves distinguishable, each curve is drawn with
its own line type and a sparse set of marker glyphs is overlaid at
regular time points.
Usage
plot_survival_curves(
sf,
subjects = NULL,
max_subjects = 10L,
labels = NULL,
n_marks = 8L,
main = "Individual survival curves",
...
)
Arguments
sf |
A list as returned by |
subjects |
Optional integer vector selecting which columns of
|
max_subjects |
Maximum number of curves drawn when |
labels |
Optional character vector of labels used in the
legend, one per plotted curve. Defaults to the column names of
|
n_marks |
Number of marker glyphs overlaid on each curve to
help distinguish them. Default |
main |
Plot title. |
... |
Additional graphical parameters forwarded to
|
Value
Invisibly returns NULL.
Linear predictor / response prediction for a pic fit.
Description
Linear predictor / response prediction for a pic fit.
Usage
## S3 method for class 'pic'
predict(object, newx, type = c("response", "link", "class"), ...)
Arguments
object |
A fitted |
newx |
Matrix of new values at which predictions are to be made. |
type |
|
... |
Unused; present for S3 method consistency. |
Value
A numeric vector.
Survival curves for new data from a fitted Cox pic model.
Description
Survival curves for new data from a fitted Cox pic model.
Usage
predict_survival_function(object, newx)
Arguments
object |
A fitted |
newx |
New design matrix (rows = subjects). |
Value
A list with time (length K) and survival (matrix K x m,
one column per subject).
Print a pic.coef table.
Description
Pretty-prints the two-column coefficient table returned by
coef.pic() without the integer row numbers that print.data.frame
would otherwise add. The underlying object is still a data.frame,
so any subsetting / column access works as usual.
Usage
## S3 method for class 'pic.coef'
print(x, ...)
Arguments
x |
A |
... |
Forwarded to |
Value
Invisibly returns x.
Pretty-print a pic family descriptor.
Description
Three-line summary showing the family name and its (g, phi) link pair.
Usage
## S3 method for class 'pic.family'
print(x, ...)
Arguments
x |
A |
... |
Ignored. |
Value
Invisibly returns x.
Print PDB asymptotic diagnostic.
Description
Print PDB asymptotic diagnostic.
Usage
## S3 method for class 'pic.pdb_asymptotic'
print(x, ...)
Arguments
x |
A |
... |
Unused; present for S3 method consistency. |
Value
Invisibly returns x.
Print phase-transition analysis.
Description
Print phase-transition analysis.
Usage
## S3 method for class 'pic.phase_transition'
print(x, ...)
Arguments
x |
A |
... |
Unused; present for S3 method consistency. |
Value
Invisibly returns x.