datagenerator.data_generation

Attributes

data

Classes

BaseFeaturesDataset

Generic synthetic dataset of continuous features for AI explainability.

WeightedFeaturesDataset

A class extending BaseFeaturesDataset with support for weighted features.

Functions

load_dataset() → Optional[Union[BaseFeaturesDataset, ...)

Loads a previously saved dataset from a binary pickle file.

generate_csv(→ None)

Generates a CSV file with random data for a specified number of rows and features.

Module Contents

class datagenerator.data_generation.BaseFeaturesDataset(seed: int = 0, n_features: int = 2, n_samples: int = 10, distribution: str | torch.distributions.Distribution = 'normal', distribution_params: Dict[str, Any] | None = None, **kwargs: Any)

Bases: torch.utils.data.Dataset

Generic synthetic dataset of continuous features for AI explainability.

This class creates a dataset of continuous features based on a specified distribution, which can be used for training and evaluating AI models. It allows for reproducible sample creation, customizable features and sample sizes, and supports various distributions.

seed

Seed for random number generators to ensure reproducibility.

Type:

int

n_features

Number of features in the dataset.

Type:

int

n_samples

Number of samples in the dataset.

Type:

int

distribution

Distribution used for generating the samples. Defaults to ‘normal’ which uses a multivariate normal distribution.

Type:

str | torch.distributions.Distribution

sample_std_dev

Standard deviation of the noise added to the samples.

Type:

float

label_std_dev

Standard deviation of the noise added to generate labels.

Type:

float

samples

Generated samples.

Type:

torch.Tensor

labels

Generated labels with optional noise.

Type:

torch.Tensor

ground_truth_attribute

Name of the attribute considered as ground truth.

Type:

str

subset_data

List of attributes to be included in subsets.

Type:

list[str]

subset_attribute

Additional attributes to be considered in subsets.

Type:

list[str]

cat_features

List of categorical feature names, used in perturbations.

Type:

list[str]

Initializes a dataset of continuous features based on a specified distribution.

Parameters:
  • seed (int) – For sample creation reproducibility. Defaults to 0.

  • n_features (int) – Number of features for each sample. Defaults to 2.

  • n_samples (int) – Total number of samples. Defaults to 10.

  • distribution (str | torch.distributions.Distribution) – Distribution to use for generating samples. Defaults to “normal”, which indicates multivariate normal distribution.

  • distribution_params (dict, optional) – Parameters for the distribution if a string identifier is used. Defaults to None.

  • **kwargs

    Arbitrary keyword arguments, including:

    • sample_std_dev (float): Standard deviation for sample creation noise. Defaults to 1.

    • label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.

Raises:
  • ValueError – If an unsupported string identifier is provided.

  • TypeError – If ‘distribution’ is neither a string nor a torch.distributions.Distribution instance.

label_noise
features = 'samples'
labels
ground_truth_attribute = 'samples'
subset_data = ['samples']
subset_attribute = ['perturb_function', 'name']
cat_features = []
name = 'BaseFeaturesDataset'
__len__() int

Returns the total number of samples in the dataset.

Returns:

Total number of samples.

Return type:

int

__getitem__(idx: int, others: List[str] = ['ground_truth_attribute']) Tuple[torch.Tensor, torch.Tensor] | Tuple[torch.Tensor, torch.Tensor, Dict[str, torch.Tensor]]

Retrieves a sample and its label, along with optional attributes, by index.

Parameters:
  • idx (int) – Index of the sample to retrieve.

  • others (list[str]) – Additional attributes to be retrieved with the sample and label. Defaults to [“ground_truth_attribute”].

Returns:

A tuple containing the sample and label at the specified index,

and optionally, a dictionary of additional attributes if requested.

Return type:

tuple

Raises:

IndexError – If the specified index is out of the bounds of the dataset.

split(split_lengths: List[float] = [0.7, 0.3]) Tuple[BaseFeaturesDataset, BaseFeaturesDataset]

Splits the dataset into subsets based on specified proportions.

Parameters:

split_lengths (list[float]) – Proportions to split the dataset into. The values must sum up to 1. Defaults to [0.7, 0.3] for a 70%/30% split.

Returns:

A tuple containing the split subsets

of the dataset.

Return type:

tuple[BaseFeaturesDataset]

save_dataset(file_name: str, directory_path: str = os.getcwd()) None

Saves the dataset to a pickle file in the specified directory.

Parameters:
  • file_name (str) – Name of the file to save the dataset.

  • directory_path (str) – Path to the directory where the file will be saved. Defaults to the current working directory.

_validate_inputs(seed: int, n_features: int, n_samples: int) Tuple[int, int, int]

Validates the input parameters for dataset initialization.

Parameters:
  • seed (int) – Seed for random number generation.

  • n_features (int) – Number of features.

  • n_samples (int) – Number of samples.

Returns:

Validated seed and number of features.

Return type:

tuple[int, int]

Raises:

ValueError – If any input is not an integer or is out of an expected range.

_init_noise_parameters(kwargs: Dict[str, Any]) Tuple[float, float]

Initializes noise parameters from keyword arguments.

Parameters:

kwargs – Keyword arguments passed to the initializer.

Returns:

Initialized sample and label standard deviations.

Return type:

tuple

Raises:

ValueError – If the standard deviations are not positive numbers.

_init_samples(n_samples: int, distribution: str | torch.distributions.Distribution, distribution_params: Dict[str, Any] | None = None) Tuple[torch.Tensor, torch.distributions.Distribution]

Initializes samples based on the specified distribution and sample size.

This method supports initialization using either a predefined distribution name (string) or directly with a torch.distributions.Distribution instance.

Parameters:
  • n_samples (int) – Number of samples to generate, must be positive.

  • distribution (str | torch.distributions.Distribution) – The distribution to use for generating samples. Can be a string for predefined distributions (‘normal’, ‘uniform’, ‘poisson’) or an instance of torch.distributions.Distribution.

  • distribution_params (dict, optional) – Parameters for the distribution if a string identifier is used. Examples: - For ‘normal’: {‘mean’: torch.zeros(n_features), ‘stddev’: torch.ones(n_features)} - For ‘uniform’: {‘low’: -1.0, ‘high’: 1.0} - For ‘poisson’: {‘rate’: 3.0}

Returns:

A tuple containing generated samples (torch.Tensor) with shape [n_samples, n_features]

and the distribution instance used.

Return type:

tuple

Raises:
  • ValueError – If ‘distribution’ is a string and is not one of the supported identifiers or necessary parameters are missing.

  • TypeError – If ‘distribution’ is neither a string identifier nor a torch.distributions.Distribution instance, or if the provided Distribution instance cannot generate a torch.Tensor.

  • RuntimeError – If the generated samples do not match the expected shape and cannot be adjusted.

perturb_function(noise_scale: float = 0.01, cat_resample_prob: float = 0.2, run_infidelity_decorator: bool = True, multipy_by_inputs: bool = False) Callable

Generates perturb function to be used for feature attribution method evaluation. Applies Gaussian noise for continuous features, and resampling for categorical features.

Parameters:
  • noise_scale (float) – A standard deviation of the Gaussian noise added to the continuous features. Defaults to 0.01.

  • cat_resample_prob (float) – Probability of resampling a categorical feature. Defaults to 0.2.

  • run_infidelity_decorator (bool) – Set to True if you want the returned fns to be compatible with infidelity. Set flag to False for sensitivity. Defaults to True.

  • multiply_by_inputs (bool) – Parameters for decorator. Defaults to False.

Returns:

A perturbation function compatible with Captum.

Return type:

perturb_func (function)

abstract generate_model() Any

Generates a corresponding model for current dataset.

Raises:

NotImplementedError – If the method is not implemented by a subclass.

property default_metric: Callable
Abstractmethod:

The default metric for evaluating the performance of explanation methods applied to this dataset.

Raises:

NotImplementedError – If the property is not implemented by a subclass.

class datagenerator.data_generation.WeightedFeaturesDataset(seed: int = 0, n_features: int = 2, n_samples: int = 10, distribution: str | torch.distributions.Distribution = 'normal', weight_range: Tuple[float, float] = (-1.0, 1.0), weights: torch.Tensor | None = None, **kwargs: Any)

Bases: BaseFeaturesDataset

A class extending BaseFeaturesDataset with support for weighted features.

This class allows for creating a synthetic dataset with continuous features, where each feature can be weighted differently. This is particularly useful for scenarios where the impact of different features on the labels needs to be artificially manipulated or studied.

Inherits from:

BaseFeaturesDataset: The base class for creating continuous feature datasets.

weights

Weights applied to each feature.

Type:

torch.Tensor

weight_range

The range (min, max) within which random weights are generated.

Type:

tuple

weighted_samples

The samples after applying weights.

Type:

torch.Tensor

Initializes a WeightedFeaturesDataset object.

Parameters:
  • seed (int) – Seed for reproducibility. Defaults to 0.

  • n_features (int) – Number of features. Defaults to 2.

  • n_samples (int) – Number of samples. Defaults to 10.

  • distribution (str) – Type of distribution to use for generating samples. Defaults to “normal”.

  • weight_range (tuple) – Range (min, max) for generating random weights. Defaults to (-1.0, 1.0).

  • weights (torch.Tensor, optional) – Specific weights for each feature. If None, weights are generated randomly within weight_range. Defaults to None.

  • **kwargs

    Arbitrary keyword arguments passed to the base class constructor, including:

    • sample_std_dev (float): Standard deviation for sample creation noise. Defaults to 1.

    • label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.

weighted_samples
label_noise
labels
features = 'samples'
ground_truth_attribute = 'weighted_samples'
subset_data = ['samples', 'weighted_samples']
subset_attribute
_initialize_weights(weights: torch.Tensor | None, weight_range: Tuple[float, float]) Tuple[torch.Tensor, Tuple[float, float]]

Initializes or validates the weights for each feature.

If weights are not provided, they are randomly generated within the specified range.

Parameters:
  • weights (torch.Tensor | NoneType) – If provided, these weights are used directly for the features. Must be a Tensor with a length equal to n_features.

  • weight_range (tuple) – Specifies the minimum and maximum values used to generate weights if weights is None. Expected format: (min_value, max_value), where both are floats.

Returns:

The validated or generated weights and the effective weight range used.

Return type:

tuple[torch.Tensor, tuple]

Raises:
  • AssertionError – If the provided weights do not match the number of features or are not a torch.Tensor when provided.

  • ValueError – If weight_range is improperly specified.

generate_model() Any

Generates and returns a neural network model configured to use the weighted features of this dataset.

The model is designed to reflect the differential impact of each feature as specified by the weights.

Returns:

A neural network model that includes mechanisms to account for feature weights,

suitable for tasks requiring understanding of feature importance.

Return type:

model.ContinuousFeaturesNN

property default_metric: Callable

The default metric for evaluating the performance of explanation methods applied to this dataset.

For this dataset, the default metric is the Mean Squared Error (MSE) loss function.

Returns:

A class that wraps around the default metric to be instantiated

within the pipeline.

Return type:

type

datagenerator.data_generation.load_dataset(file_path: str, directory_path: str = os.getcwd()) BaseFeaturesDataset | WeightedFeaturesDataset | None

Loads a previously saved dataset from a binary pickle file.

This function is designed to retrieve datasets that have been saved to disk, facilitating easy sharing and reloading of data for analysis or model training.

Parameters:
  • file_path (str) – The name of the file to load.

  • directory_path (str) – The directory where the file is located. Defaults to the current working directory.

Returns:

The loaded dataset object, or None, if the file does not exist or an error occurs.

Return type:

Object | NoneType

datagenerator.data_generation.generate_csv(file_label: str, num_rows: int = 5000, num_features: int = 20) None

Generates a CSV file with random data for a specified number of rows and features.

This function helps create synthetic datasets for testing or development purposes. Each row will have a random label and a specified number of features filled with random values.

Parameters:
  • file_label (str) – The base name for the CSV file.

  • num_rows (int) – Number of rows (samples) to generate. Defaults to 5000.

  • num_features – Number of features to generate for each sample. Defaults to 20.

datagenerator.data_generation.data