shap_enhanced.tools.datasets

Synthetic Data Generators for Regression Benchmarks

Overview

This module provides utility functions for generating synthetic datasets for benchmarking SHAP and other model explainability techniques in both sequential and tabular regression settings.

Two types of data generators are included:

  • Sequential Regression Generator: Produces multivariate time-series inputs with sinusoidal targets based on cumulative temporal signals.

  • Tabular Regression Generator: Supports sparse or dense tabular inputs, with configurable linear or nonlinear target mappings. Also outputs ground-truth feature weights.

Key Functions

  • generate_synthetic_seqregression: Creates a synthetic sequence-to-scalar regression dataset using sinusoidal logic.

  • generate_synthetic_tabular: Produces sparse or dense tabular data with a tunable regression function. Returns both the dataset and the true underlying feature weights used to generate targets.

Use Case

These generators are useful for: - Testing SHAP explainers on known attribution structures. - Evaluating sensitivity to sparsity, nonlinearity, or sequence length. - Building reproducible benchmarks for model interpretability.

Example

# Sequence data
X_seq, y_seq = generate_synthetic_seqregression(seq_len=12, n_features=4, n_samples=100)

# Tabular data with sparsity and nonlinearity
X_tab, y_tab, w = generate_synthetic_tabular(n_samples=200, n_features=6, sparse=True, model_type="nonlinear")

Functions

generate_synthetic_seqregression([seq_len, ...])

Generate synthetic multivariate time-series data for sequence-to-scalar regression.

generate_synthetic_tabular([n_samples, ...])

Generate synthetic tabular data with optional sparsity and a configurable regression function.

shap_enhanced.tools.datasets.generate_synthetic_seqregression(seq_len=10, n_features=3, n_samples=200, seed=0)[source]

Generate synthetic multivariate time-series data for sequence-to-scalar regression.

Each target is constructed using a sinusoidal function over the cumulative sum of the first feature across timesteps.

\[y_i = \sin\left(\sum_{t=1}^T x_{it1}\right) + \epsilon_i,\quad \epsilon_i \sim \mathcal{N}(0, 0.1^2)\]
Parameters:
  • seq_len (int) – Length of the time series (T).

  • n_features (int) – Number of features per timestep (F).

  • n_samples (int) – Number of sequences to generate (N).

  • seed (int) – Random seed for reproducibility.

Returns:

Tuple of input sequences and target values.

Return type:

(np.ndarray, np.ndarray)

shap_enhanced.tools.datasets.generate_synthetic_tabular(n_samples=500, n_features=5, sparse=True, model_type='nonlinear', sparsity=0.85, random_seed=42)[source]

Generate synthetic tabular data with optional sparsity and a configurable regression function.

Supports both linear and nonlinear target mappings. Optionally zeroes out random entries to simulate sparsity.

\[\begin{split}y = \begin{cases} X w & \text{(linear)} \\ \tanh(X w) + 0.1 (X w)^2 + \mathcal{N}(0, 0.1^2) & \text{(nonlinear)} \end{cases}\end{split}\]
Parameters:
  • n_samples (int) – Number of data samples.

  • n_features (int) – Number of features.

  • sparse (bool) – Whether to randomly zero entries to simulate sparsity.

  • model_type (str) – Type of regression model (“linear” or “nonlinear”).

  • sparsity (float) – Proportion of elements to set to zero if sparse.

  • random_seed (int) – Seed for reproducibility.

Returns:

Feature matrix, target vector, and true coefficient weights.

Return type:

(np.ndarray, np.ndarray, np.ndarray)