datagenerator

Submodules

Attributes

data

Classes

`BaseFeaturesDataset`	Generic synthetic dataset of continuous features for AI explainability.
`WeightedFeaturesDataset`	A class extending BaseFeaturesDataset with support for weighted features.
`BooleanAndDataset`	Generic synthetic dataset based on a propositional formula.
`BooleanDataset`	Generic synthetic dataset based on a propositional formula.
`BooleanOrDataset`	Generic synthetic dataset based on a propositional formula.
`ConflictingDataset`	Generic synthetic dataset with feature cancellation capabilities.
`BalancedImageDataset`	A dataset for images where each each image consists of a background and a foreground overlay.
`ImbalancedImageDataset`	Creates Image Dataset where each image comprises of a background image an a foreground image.
`ImageDataset`	A dataset for images with specified configurations for image generation, supporting both balanced and imbalanced datasets.
`InteractingFeatureDataset`	A dataset subclass for modeling interactions between categorical and continuous features within weighted datasets.
`PertinentNegativesDataset`	A dataset designed to investigate the impact of pertinent negative (PN) features
`ShatteredGradientsDataset`	A class intended to generate data and weights that exhibit shattered gradient phenomena.
`UncertaintyAwareDataset`	A dataset designed to investigate how feature attribution methods treat inputs
`TextTriggerDataset`	A PyTorch Dataset for text data with trigger words and feature masks, designed for explainable AI (XAI) tasks.

Functions

`load_dataset`() → Optional[Union[BaseFeaturesDataset, ...)	Loads a previously saved dataset from a binary pickle file.
`generate_csv`(→ None)	Generates a CSV file with random data for a specified number of rows and features.

Package Contents

class datagenerator.BaseFeaturesDataset(seed: int = 0, n_features: int = 2, n_samples: int = 10, distribution: str | torch.distributions.Distribution = 'normal', distribution_params: Dict[str, Any] | None = None, **kwargs: Any)

Bases: torch.utils.data.Dataset

Generic synthetic dataset of continuous features for AI explainability.

This class creates a dataset of continuous features based on a specified distribution, which can be used for training and evaluating AI models. It allows for reproducible sample creation, customizable features and sample sizes, and supports various distributions.

seed

Seed for random number generators to ensure reproducibility.

Type:: int

n_features

Number of features in the dataset.

Type:: int

n_samples

Number of samples in the dataset.

Type:: int

distribution

Distribution used for generating the samples. Defaults to ‘normal’ which uses a multivariate normal distribution.

Type:: str | torch.distributions.Distribution

sample_std_dev

Standard deviation of the noise added to the samples.

Type:: float

label_std_dev

Standard deviation of the noise added to generate labels.

Type:: float

samples

Generated samples.

Type:: torch.Tensor

labels

Generated labels with optional noise.

Type:: torch.Tensor

ground_truth_attribute

Name of the attribute considered as ground truth.

Type:: str

subset_data

List of attributes to be included in subsets.

Type:: list[str]

subset_attribute

Additional attributes to be considered in subsets.

Type:: list[str]

cat_features

List of categorical feature names, used in perturbations.

Type:: list[str]

Initializes a dataset of continuous features based on a specified distribution.

Parameters:

seed (int) – For sample creation reproducibility. Defaults to 0.
n_features (int) – Number of features for each sample. Defaults to 2.
n_samples (int) – Total number of samples. Defaults to 10.
distribution (str | torch.distributions.Distribution) – Distribution to use for generating samples. Defaults to “normal”, which indicates multivariate normal distribution.
distribution_params (dict, optional) – Parameters for the distribution if a string identifier is used. Defaults to None.
**kwargs –
Arbitrary keyword arguments, including:
- sample_std_dev (float): Standard deviation for sample creation noise. Defaults to 1.
- label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.

Raises:

ValueError – If an unsupported string identifier is provided.
TypeError – If ‘distribution’ is neither a string nor a torch.distributions.Distribution instance.

label_noise

features = 'samples'

labels

ground_truth_attribute = 'samples'

subset_data = ['samples']

subset_attribute = ['perturb_function', 'name']

cat_features = []

name = 'BaseFeaturesDataset'

__len__() → int

Returns the total number of samples in the dataset.

Returns:: Total number of samples.
Return type:: int

__getitem__(idx: int, others: List[str] = ['ground_truth_attribute']) → Tuple[torch.Tensor, torch.Tensor] | Tuple[torch.Tensor, torch.Tensor, Dict[str, torch.Tensor]]

Retrieves a sample and its label, along with optional attributes, by index.

Parameters:

idx (int) – Index of the sample to retrieve.
others (list[str]) – Additional attributes to be retrieved with the sample and label. Defaults to [“ground_truth_attribute”].

Returns:

A tuple containing the sample and label at the specified index,: and optionally, a dictionary of additional attributes if requested.

Return type:

tuple

Raises:

IndexError – If the specified index is out of the bounds of the dataset.

split(split_lengths: List[float] = [0.7, 0.3]) → Tuple[BaseFeaturesDataset, BaseFeaturesDataset]

Splits the dataset into subsets based on specified proportions.

Parameters:

split_lengths (list[float]) – Proportions to split the dataset into. The values must sum up to 1. Defaults to [0.7, 0.3] for a 70%/30% split.

Returns:

A tuple containing the split subsets: of the dataset.

Return type:

tuple[BaseFeaturesDataset]

save_dataset(file_name: str, directory_path: str = os.getcwd()) → None

Saves the dataset to a pickle file in the specified directory.

Parameters:

file_name (str) – Name of the file to save the dataset.
directory_path (str) – Path to the directory where the file will be saved. Defaults to the current working directory.

_validate_inputs(seed: int, n_features: int, n_samples: int) → Tuple[int, int, int]

Validates the input parameters for dataset initialization.

Parameters:

seed (int) – Seed for random number generation.
n_features (int) – Number of features.
n_samples (int) – Number of samples.

Returns:

Validated seed and number of features.

Return type:

tuple[int, int]

Raises:

ValueError – If any input is not an integer or is out of an expected range.

_init_noise_parameters(kwargs: Dict[str, Any]) → Tuple[float, float]

Initializes noise parameters from keyword arguments.

Parameters:: kwargs – Keyword arguments passed to the initializer.
Returns:: Initialized sample and label standard deviations.
Return type:: tuple
Raises:: ValueError – If the standard deviations are not positive numbers.

_init_samples(n_samples: int, distribution: str | torch.distributions.Distribution, distribution_params: Dict[str, Any] | None = None) → Tuple[torch.Tensor, torch.distributions.Distribution]

Initializes samples based on the specified distribution and sample size.

This method supports initialization using either a predefined distribution name (string) or directly with a torch.distributions.Distribution instance.

Parameters:

n_samples (int) – Number of samples to generate, must be positive.
distribution (str | torch.distributions.Distribution) – The distribution to use for generating samples. Can be a string for predefined distributions (‘normal’, ‘uniform’, ‘poisson’) or an instance of torch.distributions.Distribution.
distribution_params (dict, optional) – Parameters for the distribution if a string identifier is used. Examples: - For ‘normal’: {‘mean’: torch.zeros(n_features), ‘stddev’: torch.ones(n_features)} - For ‘uniform’: {‘low’: -1.0, ‘high’: 1.0} - For ‘poisson’: {‘rate’: 3.0}

Returns:

A tuple containing generated samples (torch.Tensor) with shape [n_samples, n_features]: and the distribution instance used.

Return type:

tuple

Raises:

ValueError – If ‘distribution’ is a string and is not one of the supported identifiers or necessary parameters are missing.
TypeError – If ‘distribution’ is neither a string identifier nor a torch.distributions.Distribution instance, or if the provided Distribution instance cannot generate a torch.Tensor.
RuntimeError – If the generated samples do not match the expected shape and cannot be adjusted.

perturb_function(noise_scale: float = 0.01, cat_resample_prob: float = 0.2, run_infidelity_decorator: bool = True, multipy_by_inputs: bool = False) → Callable

Generates perturb function to be used for feature attribution method evaluation. Applies Gaussian noise for continuous features, and resampling for categorical features.

Parameters:

noise_scale (float) – A standard deviation of the Gaussian noise added to the continuous features. Defaults to 0.01.
cat_resample_prob (float) – Probability of resampling a categorical feature. Defaults to 0.2.
run_infidelity_decorator (bool) – Set to True if you want the returned fns to be compatible with infidelity. Set flag to False for sensitivity. Defaults to True.
multiply_by_inputs (bool) – Parameters for decorator. Defaults to False.

Returns:

A perturbation function compatible with Captum.

Return type:

perturb_func (function)

abstract generate_model() → Any

Generates a corresponding model for current dataset.

Raises:: NotImplementedError – If the method is not implemented by a subclass.

property default_metric: Callable

Abstractmethod:

The default metric for evaluating the performance of explanation methods applied to this dataset.

Raises:: NotImplementedError – If the property is not implemented by a subclass.

class datagenerator.WeightedFeaturesDataset(seed: int = 0, n_features: int = 2, n_samples: int = 10, distribution: str | torch.distributions.Distribution = 'normal', weight_range: Tuple[float, float] = (-1.0, 1.0), weights: torch.Tensor | None = None, **kwargs: Any)

Bases: BaseFeaturesDataset

A class extending BaseFeaturesDataset with support for weighted features.

This class allows for creating a synthetic dataset with continuous features, where each feature can be weighted differently. This is particularly useful for scenarios where the impact of different features on the labels needs to be artificially manipulated or studied.

Inherits from:: BaseFeaturesDataset: The base class for creating continuous feature datasets.

weights

Weights applied to each feature.

Type:: torch.Tensor

weight_range

The range (min, max) within which random weights are generated.

Type:: tuple

weighted_samples

The samples after applying weights.

Type:: torch.Tensor

Initializes a WeightedFeaturesDataset object.

Parameters:

seed (int) – Seed for reproducibility. Defaults to 0.
n_features (int) – Number of features. Defaults to 2.
n_samples (int) – Number of samples. Defaults to 10.
distribution (str) – Type of distribution to use for generating samples. Defaults to “normal”.
weight_range (tuple) – Range (min, max) for generating random weights. Defaults to (-1.0, 1.0).
weights (torch.Tensor, optional) – Specific weights for each feature. If None, weights are generated randomly within weight_range. Defaults to None.
**kwargs –
Arbitrary keyword arguments passed to the base class constructor, including:
- sample_std_dev (float): Standard deviation for sample creation noise. Defaults to 1.
- label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.

weighted_samples

label_noise

labels

features = 'samples'

ground_truth_attribute = 'weighted_samples'

subset_data = ['samples', 'weighted_samples']

subset_attribute

_initialize_weights(weights: torch.Tensor | None, weight_range: Tuple[float, float]) → Tuple[torch.Tensor, Tuple[float, float]]

Initializes or validates the weights for each feature.

If weights are not provided, they are randomly generated within the specified range.

Parameters:

weights (torch.Tensor | NoneType) – If provided, these weights are used directly for the features. Must be a Tensor with a length equal to n_features.
weight_range (tuple) – Specifies the minimum and maximum values used to generate weights if weights is None. Expected format: (min_value, max_value), where both are floats.

Returns:

The validated or generated weights and the effective weight range used.

Return type:

tuple[torch.Tensor, tuple]

Raises:

AssertionError – If the provided weights do not match the number of features or are not a torch.Tensor when provided.
ValueError – If weight_range is improperly specified.

generate_model() → Any

Generates and returns a neural network model configured to use the weighted features of this dataset.

The model is designed to reflect the differential impact of each feature as specified by the weights.

Returns:

A neural network model that includes mechanisms to account for feature weights,: suitable for tasks requiring understanding of feature importance.

Return type:

model.ContinuousFeaturesNN

property default_metric: Callable

The default metric for evaluating the performance of explanation methods applied to this dataset.

For this dataset, the default metric is the Mean Squared Error (MSE) loss function.

Returns:

A class that wraps around the default metric to be instantiated: within the pipeline.

Return type:

type

datagenerator.load_dataset(file_path: str, directory_path: str = os.getcwd()) → BaseFeaturesDataset | WeightedFeaturesDataset | None

Loads a previously saved dataset from a binary pickle file.

This function is designed to retrieve datasets that have been saved to disk, facilitating easy sharing and reloading of data for analysis or model training.

Parameters:

file_path (str) – The name of the file to load.
directory_path (str) – The directory where the file is located. Defaults to the current working directory.

Returns:

The loaded dataset object, or None, if the file does not exist or an error occurs.

Return type:

Object | NoneType

datagenerator.generate_csv(file_label: str, num_rows: int = 5000, num_features: int = 20) → None

Generates a CSV file with random data for a specified number of rows and features.

This function helps create synthetic datasets for testing or development purposes. Each row will have a random label and a specified number of features filled with random values.

Parameters:

file_label (str) – The base name for the CSV file.
num_rows (int) – Number of rows (samples) to generate. Defaults to 5000.
num_features – Number of features to generate for each sample. Defaults to 20.

datagenerator.data

class datagenerator.BooleanAndDataset(n_features: int = 2, n_samples: int = 10, seed: int = 0)

Bases: BooleanDataset

Generic synthetic dataset based on a propositional formula.

The dataset corresponds to sampling rows from the truth table of the given propositional formula. If n_samples is no larger than the size of the truth table, then the generated dataset will always contain non-duplicate samples of the truth table. Otherwise, the dataset will still contain rows for the entire truth table but will also contain duplicates.

If the input for atoms is None, the corresponding attribute is by default assigned as the atoms that are extracted from the given formula.

Inherits from:: BaseFeaturesDataset: The base class for creating continuous feature datasets.

formula

A propositional formula for which the dataset is generated.

Type:: sympy.core.function.FunctionClass

atoms

The ordered collection of propositional atoms that were used within the propositional formula.

Type:: tuple

seed

Seed for random number generators to ensure reproducibility.

Type:: int

n_samples

Number of samples in the dataset.

Type:: int

Initializes a BooleanDataset object.

Parameters:

formula (sympy.core.function.FunctionClass) – A propositional formula for dataset generation.
atoms (Iterable, optional) – Ordered collection of propositional atoms used in the formula. Defaults to None.
seed (int) – Seed for random number generation, ensuring reproducibility. Defaults to 0.
n_samples (int) – Number of samples to generate for the dataset. Defaults to 10.

n_features = 2

ground_truth

ground_truth_attribute = 'ground_truth'

create_baselines() → None

__getitem__(idx: int, others: List[str] = ['baseline', 'ground_truth_attribute']) → Tuple[Any, Ellipsis]

Retrieve a sample and its associated label by index.

Parameters:

idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [].

Returns:

Tuple containing the sample and its label.

Return type:

tuple

generate_model() → torch.nn.Module

Generates a neural network model using the given propositional formula and atoms.

Returns:: A neural network model tailored to the dataset’s propositional formula.
Return type:: model.PropFormulaNN

create_ground_truth() → torch.Tensor

property default_metric: Callable

The default metric for evaluating the performance of explanation methods applied to this dataset.

For this dataset, the default metric is the infidelity metric with the default perturb function.

Returns:

A class that wraps around the default metric to be instantiated: within the pipeline.

Return type:

type

class datagenerator.BooleanDataset(formula: sympy.core.function.FunctionClass, atoms: Iterable | None = None, seed: int = 0, n_samples: int = 10)

Bases: xaiunits.datagenerator.data_generation.BaseFeaturesDataset

Generic synthetic dataset based on a propositional formula.

The dataset corresponds to sampling rows from the truth table of the given propositional formula. If n_samples is no larger than the size of the truth table, then the generated dataset will always contain non-duplicate samples of the truth table. Otherwise, the dataset will still contain rows for the entire truth table but will also contain duplicates.

If the input for atoms is None, the corresponding attribute is by default assigned as the atoms that are extracted from the given formula.

Inherits from:: BaseFeaturesDataset: The base class for creating continuous feature datasets.

formula

A propositional formula for which the dataset is generated.

Type:: sympy.core.function.FunctionClass

atoms

The ordered collection of propositional atoms that were used within the propositional formula.

Type:: tuple

seed

Seed for random number generators to ensure reproducibility.

Type:: int

n_samples

Number of samples in the dataset.

Type:: int

Initializes a BooleanDataset object.

Parameters:

formula (sympy.core.function.FunctionClass) – A propositional formula for dataset generation.
atoms (Iterable, optional) – Ordered collection of propositional atoms used in the formula. Defaults to None.
seed (int) – Seed for random number generation, ensuring reproducibility. Defaults to 0.
n_samples (int) – Number of samples to generate for the dataset. Defaults to 10.

atoms

formula

subset_data = ['samples']

subset_attribute = ['perturb_function', 'default_metric', 'generate_model', 'name']

cat_features

name = 'BooleanDataset'

_initialize_samples_labels(n_samples: int) → Tuple[torch.Tensor, torch.Tensor]

Initializes the samples and labels of the dataset.

Parameters:

n_samples (int) – number of samples/labels contained in the dataset.

Returns:

Tuple containing the generated samples: and corresponding labels of the dataset.

Return type:

tuple[Tensor, Tensor]

perturb_function(cat_resample_prob: float = 0.2, run_infidelity_decorator: bool = True, multipy_by_inputs: bool = False) → Callable

Generates perturb function to be used for XAI method evaluation. Applies gaussian noise for continuous features, and resampling for categorical features.

Parameters:

cat_resample_prob (float) – Probability of resampling a categorical feature. Defaults to 0.2.
run_infidelity_decorator (bool) – Set to true if the returned fns is to be compatible with infidelity. Set flag to False for sensitivity. Defaults to True.
multiply_by_inputs (bool) – Parameters for decorator. Defaults to False.

Returns:

A perturbation function compatible with Captum.

Return type:

perturb_func (function)

generate_model() → torch.nn.Module

Generates a neural network model using the given propositional formula and atoms.

Returns:: A neural network model tailored to the dataset’s propositional formula.
Return type:: model.PropFormulaNN

property default_metric: Callable

The default metric for evaluating the performance of explanation methods applied to this dataset.

For this dataset, the default metric is the infidelity metric with the default perturb function.

Returns:

A class that wraps around the default metric to be instantiated: within the pipeline.

Return type:

type

__getitem__(idx: int, others: List[str] = []) → Tuple[Any, Ellipsis]

Retrieve a sample and its associated label by index.

Parameters:

idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [].

Returns:

Tuple containing the sample and its label.

Return type:

tuple

class datagenerator.BooleanOrDataset(n_features: int = 2, n_samples: int = 10, seed: int = 0)

Bases: BooleanDataset

Generic synthetic dataset based on a propositional formula.

The dataset corresponds to sampling rows from the truth table of the given propositional formula. If n_samples is no larger than the size of the truth table, then the generated dataset will always contain non-duplicate samples of the truth table. Otherwise, the dataset will still contain rows for the entire truth table but will also contain duplicates.

If the input for atoms is None, the corresponding attribute is by default assigned as the atoms that are extracted from the given formula.

Inherits from:: BaseFeaturesDataset: The base class for creating continuous feature datasets.

formula

A propositional formula for which the dataset is generated.

Type:: sympy.core.function.FunctionClass

atoms

The ordered collection of propositional atoms that were used within the propositional formula.

Type:: tuple

seed

Seed for random number generators to ensure reproducibility.

Type:: int

n_samples

Number of samples in the dataset.

Type:: int

Initializes a BooleanDataset object.

Parameters:

formula (sympy.core.function.FunctionClass) – A propositional formula for dataset generation.
atoms (Iterable, optional) – Ordered collection of propositional atoms used in the formula. Defaults to None.
seed (int) – Seed for random number generation, ensuring reproducibility. Defaults to 0.
n_samples (int) – Number of samples to generate for the dataset. Defaults to 10.

n_features = 2

ground_truth

ground_truth_attribute = 'ground_truth'

create_baselines() → None

__getitem__(idx: int, others: List[str] = ['baseline', 'ground_truth_attribute']) → Tuple[Any, Ellipsis]

Retrieve a sample and its associated label by index.

Parameters:

idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [].

Returns:

Tuple containing the sample and its label.

Return type:

tuple

generate_model() → torch.nn.Module

Generates a neural network model using the given propositional formula and atoms.

Returns:: A neural network model tailored to the dataset’s propositional formula.
Return type:: model.PropFormulaNN

create_ground_truth() → torch.Tensor

property default_metric: Callable

The default metric for evaluating the performance of explanation methods applied to this dataset.

For this dataset, the default metric is the infidelity metric with the default perturb function.

Returns:

A class that wraps around the default metric to be instantiated: within the pipeline.

Return type:

type

class datagenerator.ConflictingDataset(seed: int = 0, n_features: int = 2, n_samples: int = 10, distribution: str = 'normal', weight_range: Tuple[float, float] = (-1.0, 1.0), weights: torch.Tensor | None = None, cancellation_features: List[int] | None = None, cancellation_likelihood: float = 0.5)

Bases: xaiunits.datagenerator.WeightedFeaturesDataset

Generic synthetic dataset with feature cancellation capabilities.

Feature cancellations are based on likelihood. If cancellation_features are not provided, all features in each sample are candidates for cancellation, with a specified likelihood of each feature being canceled. Canceled features are negated in their contributions to the dataset, allowing for the analysis of model behavior under feature absence scenarios.

Inherits from:: WeightedFeaturesDataset: Class extending BaseFeaturesDataset with support for weighted features

cancellation_features

Indices of features subject to cancellation.

Type:: list of int, optional

cancellation_likelihood

Likelihood of feature cancellation, between 0 and 1.

Type:: float

cancellation_outcomes

Binary tensor indicating whether each feature in each sample is canceled.

Type:: torch.Tensor

cancellation_samples

Concatenation of samples with their cancellation outcomes.

Type:: torch.Tensor

cancellation_attributions

The attribution of each feature considering the cancellation.

Type:: torch.Tensor

cat_features

Categorical features derived from the cancellation samples.

Type:: list

ground_truth_attributions

Combined tensor of weighted samples and cancellation attributions for ground truth analysis.

Type:: torch.Tensor

Initializes a ConflictingDataset object.

Parameters:

seed (int) – Seed for random number generation, ensuring reproducibility. Defaults to 0.
n_features (int) – Number of features in each sample. Defaults to 2.
n_samples (int) – Number of samples to generate. Defaults to 10.
distribution (str) – Type of distribution to use for generating samples. Defaults to ‘normal’.
weight_range (tuple[float]) – Range (min, max) for generating random feature weights. Defaults to (-1.0, 1.0).
weights (torch.Tensor, optional) – Predefined weights for each feature. Defaults to None.
cancellation_features (list[int], optional) – Specific features to apply cancellations to. Defaults to None, applying to all features.
cancellation_likelihood (float) – Probability of each feature being canceled. Defaults to 0.5.

cancellation_features = None

cancellation_likelihood = 0.5

cancellation_outcomes

cancellation_samples

labels

cancellation_attributions

cat_features

ground_truth_attributions

features = 'cancellation_samples'

ground_truth_attribute = 'ground_truth_attributions'

subset_data = ['weighted_samples', 'cancellation_outcomes', 'cancellation_samples',...

_initialize_cancellation_features() → None

Validates and initializes the list of features subject to cancellation. If no specific features are provided, all features are considered candidates for cancellation.

Raises:: AssertionError – If cancellation_features is not a list, any element in cancellation_features is not an integer, the maximum element in cancellation_features is greater than the number of features, or cancellation_features is empty. Also, if cancellation_likelihood is not a float or is outside the range [0, 1].

_get_cancellations() → torch.Tensor

Generates a binary mask indicating whether each feature in each sample is canceled based on the specified likelihood.

This method considers only the features specified in cancellation_features for possible cancellation.

Returns:

An integer tensor of shape (n_samples, n_features) where 1 represents a canceled feature,: and 0 represents an active feature.

Return type:

torch.Tensor

_get_cancellation_samples() → torch.Tensor

Concatenates the original samples with their cancellation outcomes to form a comprehensive dataset.

This allows for analyzing the impact of feature cancellations directly alongside the original features.

Returns:: A tensor containing the original samples augmented with their corresponding cancellation outcomes.
Return type:: torch.Tensor

_get_cancellation_attributions() → torch.Tensor

Computes the attribution of each feature by negating the effect of canceled features.

This method helps understand the impact of each feature on the model output when certain features are systematically canceled.

Returns:

A tensor of the same shape as the weighted samples, where the values of canceled features are: negated to reflect their absence.

Return type:

torch.Tensor

generate_model() → torch.nn.Module

Instantiates and returns a neural network model for analyzing datasets with conflicting features.

The model is configured to use the specified features and weights, allowing for experimentation with feature cancellations.

Returns:: A neural network model designed to work with the specified features and weights.
Return type:: model.ConflictingFeaturesNN

class datagenerator.BalancedImageDataset(*args: Any, **kwargs: Any)

Bases: ImageDataset

A dataset for images where each each image consists of a background and a foreground overlay.

This ‘balanced’ dataset ensures that each combination of background (bg), foreground (fg), and foreground color (fg_color) appears the same number of times across the dataset, making it ideal for machine learning models that benefit from uniform exposure to all feature combinations.

Inherits all parameters from ImageDataset, and introduces no additional parameters, but it overrides the behavior to ensure balance in the dataset composition.

Inherits from:: ImageDataset: Standard dataset that contains images with backgorunds and foregrounds.

Initializes a BalancedImageDataset with the same parameters as ImageDataset, ensuring each combination of background, foreground, and color appears uniformly across the dataset.

After initialization, it automatically generates the samples and shuffles them if the ‘shuffled’ attribute is True.

Parameters:

*args – Additional arguments passed to the superclass initializer.
**kwargs – Additional keyword arguments passed to the superclass initializer.

generate_samples() → None

Generates a balanced set of image samples by uniformly distributing each combination of background, foreground shape, and color.

Iterates over each background, each shape, and each color to create the specified number of variants per combination. Each generated image is stored in the ‘samples’ list, with corresponding labels in ‘labels’, and other metadata like foreground shapes, background labels, and foreground colors stored in their respective lists.

Raises:: ValueError – If there is an issue with image generation parameters or overlay combinations.

class datagenerator.ImbalancedImageDataset(backgrounds: int | List[str] = 5, shapes: int | List[str] = 3, n_variants: int = 100, shape_colors: str | Tuple[int, int, int, int] = 'red', imbalance: float = 0.8, **kwargs: Any)

Bases: ImageDataset

Creates Image Dataset where each image comprises of a background image an a foreground image.

Background images, type of foreground, color of foreground as well as other parameters can be specified.

Imbalance refers to the fact users can specify the percentage of dominant (bg, fg) pair vs other pair.

Inherits from:: ImageDataset: Standard dataset that contains images with backgorunds and foregrounds.

imbalance

The proportion of samples that should favor a particular background per shape. Should be within the range (0.0 to 1.0) inclusive.

Type:: float

Initializes an ImbalancedImageDataset object with specified parameters, focusing on creating dataset variations based on an imbalance parameter that dictates the dominance of certain shape-background pairs.

Parameters:

backgrounds (int | list) – The number or list of specific background filenames. Defaults to 5.
shapes (int | list) – The number or list of specific shapes. Defaults to 3.
n_variants (int) – Number of variations per shape-background combination, affects dataset size. Defaults to 100.
shape_colors (str | tuple) – The default color for all shapes in the dataset. Defaults to ‘red’.
imbalance (float) – The proportion (0.0 to 1.0) of samples that should favor a particular background per shape. Defaults to 0.8.
**kwargs – Additional keyword arguments passed to the superclass initializer.

imbalance

_prepare_shape_color(shape_colors: str | Tuple[int, int, int, int] | None) → List[Tuple[int, int, int, int]]

Prepares a single shape color based on the input.

Selects a random color if None is provided, validates a provided color string or RGBA tuple.

Parameters:: shape_colors (str | tuple | NoneType) – A specific color name, RGBA tuple, or None to select a random color.
Returns:: A list containing a single validated RGBA tuple representing the color.
Return type:: list
Raises:: ValueError – If the input is invalid or if the color name is not found in the predefined color dictionary.

_validate_imbalance(imbalance: float) → float

Validates that the imbalance parameter is a float between 0.0 and 1.0 inclusive, or None.

Ensures that the dataset can properly reflect the desired level of imbalance, adjusting for the number of variants and available backgrounds.

Parameters:: imbalance (float | NoneType) – The imbalance value to validate. If None is given as input, then the argument will be treated as 0.3.
Returns:: The validated imbalance value.
Return type:: float
Raises:: ValueError – If the imbalance is not within the inclusive range [0.0, 1.0] or if the imbalance settings are not feasible with the current settings of n_variants and backgrounds.

generate_samples() → None

Generates a set of image samples with overlay shapes or dinosaurs on backgrounds, considering imbalance.

Depending on the ‘imbalance’ parameter, this method either:

Allocates a specific fraction (defined by ‘imbalance’) of the samples for each shape to a particular background, with the remainder distributed among the other backgrounds.
Assigns all samples for a shape to a single background (imbalance = 1.0).

class datagenerator.ImageDataset(seed: int = 0, backgrounds: int | List[str] = 5, shapes: int | List[str] = 10, n_variants: int = 4, background_size: Tuple[int, int] = (512, 512), shape_type: str = 'geometric', position: str = 'random', overlay_scale: float = 0.3, rotation: bool = False, shape_colors: str | Tuple[int, int, int, int] | List[str | Tuple[int, int, int, int]] | None = None, shuffled: bool = True, transform: Callable | None = None, contour_thickness: int = 3, source: str = 'local')

Bases: torch.utils.data.Dataset

A dataset for images with specified configurations for image generation, supporting both balanced and imbalanced datasets.

Inherits from:: torch.utils.data.Dataset: The standard base class for defining a dataset within the PyTorch framework.

seed

Seed for random number generation to ensure reproducibility.

Type:: int

backgrounds

List of background images to use for dataset generation.

Type:: list

shapes

List of shapes to overlay on background images.

Type:: list

n_variants

Number of variations per shape-background combination, affects dataset size.

Type:: int

background_size

Dimensions (width, height) of background images.

Type:: tuple

shape_type

Type of shapes: ‘geometric’ for geometric shapes, ‘dinosaurs’ for dinosaur shapes.

Type:: str

position

Overlay position on the background (‘center’ or ‘random’).

Type:: str

overlay_scale

Scale factor for overlay relative to the background size.

Type:: float

rotation

If True, applies random rotation to overlays.

Type:: bool

shape_colors

List of default color(s) for shapes, accepts color names or RGBA tuples.

Type:: list

shuffled

If True, shuffles the dataset after generation.

Type:: bool

transform

Transformation function to apply to each image, typically converting to tensor.

Type:: callable

contour_thickness

Thickness of lines the contours are drawn with. If it is negative, the contour interiors are drawn.

Type:: int

image_builder

Instance of ImageBuilder for generating images.

Type:: ImageBuilder

samples

List to store the generated samples.

Type:: list

labels

List to store the labels.

Type:: list

fg_shapes

List to store the foreground shapes.

Type:: list

bg_labels

List to store the background labels.

Type:: list

fg_colors

List to store the foreground colors.

Type:: list

ground_truth

List to store the ground truths.

Type:: list

Initializes an ImageDataset object.

Parameters:

seed (int) – Seed for random number generation to ensure reproducibility. Defaults to 0.
backgrounds (int | list) – Number or list of specific backgrounds to use. Defaults to 5.
shapes (int | list) – Number or list of specific shapes. Defaults to 10.
n_variants (int) – Number of variations per shape-background combination, affects dataset size. Defaults to 4.
background_size (tuple) – Dimensions (width, height) of background images. Defaults to (512, 512).
shape_type (str) – ‘geometric’ for geometric shapes, ‘dinosaurs’ for dinosaur shapes. Defaults to ‘geometric’.
position (str) – Overlay position on the background (‘center’ or ‘random’). Defaults to ‘random’.
overlay_scale (float) – Scale factor for overlay relative to the background size. Defaults to 0.3.
rotation (bool) – If True, applies random rotation to overlays. Defaults to False.
shape_colors (str | tuple, optional) – Default color(s) for shapes, accepts color names or RGBA tuples. Defaults to None.
shuffled (bool) – If True, shuffles the dataset after generation. Defaults to True.
transform (callable, optional) – Transformation function to apply to each image, typically converting to tensor. Defaults to None.
contour_thickness (int) – Defaults to 3.

seed = 0

n_variants = 4

image_builder

backgrounds

shapes

shape_colors

transform

samples = []

labels = []

fg_shapes = []

bg_labels = []

fg_colors = []

ground_truth = []

shuffled = True

contour_thickness = 3

_validate_n_variants(n_variants: int) → int

Validates that the number of variants per shape-background combination is a positive integer.

The n_variants parameter controls how many different versions of each shape-background combination are generated, varying elements such as position and possibly color if specified. This allows for diverse training data in image recognition tasks, improving the model’s ability to generalize from different perspectives and conditions.

Parameters:: n_variants (int) – The number of variations per shape-background combination to generate.
Returns:: The validated number of variants.
Return type:: int
Raises:: ValueError – If n_variants is not an integer or is less than or equal to zero.

_prepare_shapes(shape_type: str, shapes: int | List[str], source: str) → List[str]

Prepares a list of shapes or dinosaurs based on the input and the specified shape type.

This method processes the input to generate a list of specific shapes or dinosaur names. If a numerical input is provided, it selects that many random shapes/dinosaurs from the available names. If a list is provided, it directly uses those specific names.

Parameters:

shape_type (str) – Specifies the type of overlay image, either ‘geometric’ or ‘dinosaurs’.
shapes (int | list) – Number or list of specific shape names. If an integer is provided, it indicates how many random shapes or dinosaurs to select.

Returns:

A list of shape names or dinosaur names to be used as overlays.

Return type:

list

Raises:

ValueError – If the shapes input is neither an integer nor a list, or if the shape_type is not recognized as ‘geometric’ or ‘dinosaurs’.

_prepare_backgrounds(backgrounds: int | List[str]) → List[str]

Prepares background images based on the input.

This method helps to either randomly select a set number of background images from the available pool or validate and use a provided list of specific background filenames.

If a numerical value is provided, selects that many random backgrounds. If a list is provided, validates and uses those specific backgrounds.

Parameters:: backgrounds (int | list) – Number of random backgrounds to select or a list of specific background filenames.
Returns:: A list of background filenames to be used in the dataset.
Return type:: list
Raises:: ValueError – If the input is neither an integer nor a list, or if any specified background filename is not found in the available backgrounds.

Prepares shape colors by validating input against available colors.

If no valid colors are provided, a default color is selected. Accepts single or multiple colors.

Parameters:: shape_colors (int | str | tuple | list) – Specifies how many random colors to select or provides specific color(s). Can be a single color name, RGBA tuple, or list of names/tuples.
Returns:: A list of validated RGBA tuples representing the colors.
Return type:: list
Raises:: ValueError – If input is invalid or colors are not found in the available color dictionary. Details about the invalid input are provided in the error message.

generate_samples() → None: Placeholder method for generating the samples either for balanced or imbalanced datasets.

shuffle_dataset() → None

Randomly shuffles the dataset samples and corresponding labels to ensure variety in training and evaluation phases.

Raises:: ValueError – If the dataset is empty and shuffling is not possible.

__len__() → int

Returns thet number of samples in the dataset.

Returns:: number of samples contained by the dataset.
Return type:: int

__getitem__(idx: int) → Tuple[torch.Tensor, int, Dict[str, str | torch.Tensor | PIL.Image.Image]]

Retrieves an image and its label by index.

The image is transformed into a tensor if a transform is applied.

Parameters:: idx (int) – Index of the sample to retrieve.
Returns:: A tuple containing the transformed image tensor, label, a dict of other attributes.
Return type:: tuple

_re_label() → None: Re-labels the dataset labels with integer indices.

static show_image(img_tensor: torch.Tensor) → None

Displays an image given its tensor representation.

Parameters:: img_tensor (torch.Tensor) – The image tensor to display.

property default_metric: Callable

The default metric for evaluating the performance of explanation methods applied to this dataset.

For this dataset, the default metric is the mask ratio metric that is constructed based on the ground truth and context. Mask ratio is defined as the ratio of absolute attribution score that lies within the foreground and the image.

Returns:

A class that wraps around the default metric to be instantiated: within the pipeline.

Return type:

type

class datagenerator.InteractingFeatureDataset(seed: int = 0, n_features: int = 4, n_samples: int = 50, weight_range: Tuple[float, float] = (-1.0, 1.0), weights: List[float] | None = None, zero_likelihood: float = 0.5, interacting_features: List[List[int]] = [[1, 0], [3, 2]], **kwargs: Any)

Bases: xaiunits.datagenerator.WeightedFeaturesDataset

A dataset subclass for modeling interactions between categorical and continuous features within weighted datasets.

This class extends WeightedFeaturesDataset to support scenarios where the influence of one feature on the model is conditional on the value of another, typically categorical, feature. For instance, the model may include terms like w_i(x_j) * x_i + w_j * x_j, where the weight w_i(x_j) changes based on the value of x_j.

Inherits from:: WeightedFeaturesDataset: Class extending BaseFeaturesDataset with support for weighted features

interacting_features

Pairs of indices where the first index is the feature whose weight is influenced by the second, categorical feature.

Type:: list[list[int]]

zero_likelihood

The likelihood of the categorical feature being zero.

Type:: float

seed

Random seed for reproducibility.

Type:: int

n_features

Number of features in the dataset.

Type:: int

n_samples

Number of samples in the dataset.

Type:: int

weight_range

Min and max values for generating weights.

Type:: tuple[float]

weights

Initial weight values for features.

Type:: list | NoneType

subset_attribute

List of attributes that define the subset of the data with specific characteristics.

Type:: list[str]

interacting_features = [[1, 0], [3, 2]]

zero_likelihood = 0.5

subset_attribute

cat_features

make_cat() → None

Modifies the dataset to incorporate the specified categorical-to-continuous feature interactions.

The method ensures that the dataset is correctly modified to reflect the specified feature interactions and their impact on weights and samples.

_get_flat_weights(weights: List[float] | None) → torch.Tensor | None

Convert the weights into a flat tensor.

This method takes a list of weights, which can be tuples representing ranges, and converts them into a flat tensor. If the input weights are None, the method returns None.

Parameters:: weights (list | NoneType) – List of weights or None if weights are not specified.
Returns:: Flat tensor of weights if weights are provided, else None.
Return type:: torch.Tensor | NoneType

generate_model() → torch.nn.Module

Generates a neural network model for interacting features analysis.

This method instantiates and returns a neural network model specifically designed for analyzing datasets with interacting features. The model is configured using the specified number of features, feature weights, and interacting features information.

Returns:

An instance of the InteractingFeaturesNN class, representing: the neural network model designed for interacting features analysis.

Return type:

model.InteractingFeaturesNN

class datagenerator.PertinentNegativesDataset(seed: int = 0, n_features: int = 5, n_samples: int = 10, distribution: str = 'normal', weight_range: Tuple[float, float] = (-1.0, 1.0), weights: torch.Tensor | None = None, pn_features: List[int] | None = None, pn_zero_likelihood: float = 0.5, pn_weight_factor: float = 10, baseline: str = 'zero')

Bases: xaiunits.datagenerator.WeightedFeaturesDataset

A dataset designed to investigate the impact of pertinent negative (PN) features on model predictions by introducing zero values in selected features, which are expected to significantly impact the output.

This dataset is useful for scenarios where the absence of certain features (indicated by zero values) provides important information for model predictions.

Inherits from:: WeightedFeaturesDataset: Class extending BaseFeaturesDataset with support for weighted features

pn_features

Indices of features considered as pertinent negatives.

Type:: list[int]

pn_zero_likelihood

Likelihood of a pertinent negative feature being set to zero.

Type:: float

pn_weight_factor

Weight factor applied to the pertinent negative features to emphasize their impact.

Type:: float

cat_features

Categorical features derived from the pertinent negatives.

Type:: list

labels

Generated labels with optional noise.

Type:: torch.Tensor

features

Name of the attribute representing the input features.

Type:: str

ground_truth_attribute

Name of the attribute considered as ground truth for analysis.

Type:: str

subset_data

List of attributes to be included in subsets.

Type:: list[str]

subset_attribute

Additional attributes to be considered in subsets.

Type:: list[str]

pn_zero_likelihood = 0.5

pn_weight_factor = 10

pn_features = [0]

cat_features = [0]

label_noise

labels

features = 'samples'

ground_truth_attribute = 'ground_truth'

subset_data = ['samples', 'weighted_samples', 'ground_truth']

subset_attribute

_intialize_pn_features(pn_features: List[int] | None) → List[int]

Validates and initializes the indices of features to be considered as pertinent negatives (PN).

Ensures that specified pertinent negative features are within the valid range of feature indices. Falls back to the first feature if pn_features is not specified or invalid.

Parameters:: pn_features (list of int, optional) – Indices of features specified as pertinent negatives.
Returns:: The validated list of indices for pertinent negative features.
Return type:: list[int]
Raises:: ValueError – If any specified pertinent negative feature index is out of the valid range or if the input is not a list.

_initialize_zeros_for_PN() → None

Sets the values of pertinent negative (PN) features to zero with a specified likelihood, across all samples in a vectorized manner.

This modification is performed directly on the samples attribute.

_get_new_weighted_samples() → None

Recalculates the weighted samples considering the introduction of zeros for pertinent negative features in a vectorized manner.

Adjusts the weight of features set to zero to emphasize their impact by using the pn_weight_factor. Updates the weighted_samples attribute with the new calculations.

_create_ground_truth_baseline(baseline: str) → None

Creates the ground truth baseline based on the specified baseline type (“zero” or “one”).

Parameters:: baseline (str) – Specifies the type of baseline to use. Must be either “zero” or “one”.
Raises:: KeyError – If the specified baseline is not “zero” or “one”.

__getitem__(idx: int, others: List[str] = ['ground_truth_attribute', 'baseline']) → Tuple[Any, Ellipsis]

Retrieve a sample and its associated label by index.

Parameters:

idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [].

Returns:

Tuple containing the sample and its label.

Return type:

tuple

generate_model() → torch.nn.Module

Generates and returns a neural network model tailored for analyzing the impact of pertinent negatives.

The model is configured to incorporate the weights, pertinent negatives, and the pertinent negative weight factor.

Returns:

A neural network model designed to work with the dataset’s specific configuration,: including the pertinent negatives and their associated weight factor.

Return type:

model.PertinentNN

class datagenerator.ShatteredGradientsDataset(seed: int = 0, n_features: int = 5, n_samples: int = 100, discontinuity_ratios: List | None = None, bias: float = 0.5, act_fun: str = 'Relu', two_distributions_flag: bool = False, proportion: float = 0.2, classification: bool = False, **kwargs: Any)

Bases: xaiunits.datagenerator.WeightedFeaturesDataset

A class intended to generate data and weights that exhibit shattered gradient phenomena.

This class generates weights depending on the activation function and the discontinuity ratios. The discontinuity ratio is a set of real numbers (one per feature), so small perturbations around this discontinuity ratio significantly impact the model’s explanation.

Inherits from:: WeightedFeaturesDataset: Class extending BaseFeaturesDataset with support for weighted features

weights

Weights applied to each feature.

Type:: Tensor

weight_range

The range (min, max) within which random weights are generated.

Type:: tuple

weighted_samples

The samples after applying weights.

Type:: Tensor

Initializes a ShatteredGradientsDataset object.

Parameters:

seed (int) – Seed for reproducibility. Defaults to 0.
n_features (int) – Number of features. Defaults to 5.
n_samples (int) – Number of samples. Defaults to 100.
discontinuity_ratios (list, optional) – Ratios indicating feature discontinuity. If None, ratios are generated randomly. Defaults to None. Example: (1, -3, 4, 2, -2)
bias (float) – Bias value. Defaults to 0.5.
act_fun (str) – Activation function (“Relu”, “Gelu”, or “Sigmoid”). Defaults to “Relu”.
two_distributions_flag (bool) – Flag for using two distributions. Defaults to False.
proportion (float) – Proportion of samples for narrow distribution when using two distributions. Defaults to 0.2.
classification (bool) – Flag for classification. Defaults to False.
**kwargs –
Arbitrary keyword arguments passed to the base class constructor, including:
- sample_std_dev_narrow (float): Standard deviation for sample creation noise in narrow distribution. Defaults to 0.05.
- sample_std_dev_wide (float): Standard deviation for sample creation noise in wide distribution. Defaults to 10.
- weight_scale (float): Scalar value to multiply all generated weights with.
- label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.

_initialize_with_narrow_wide_distributions(seed: int, n_features: int, n_samples: int, discontinuity_ratios: List, bias: float, act_fun: str, proportion: float, classification: bool, kwargs: Dict | None) → None

Initializes the dataset with narrow and wide distributions.

This method sets up the dataset with narrow and wide distributions. It generates a dataset with the first portion of data belonging to the narrow distribution dependent on sample_std_dev_narrow. Similarly, the second portion of the dataset will belong to the wider distribution, depending on sample_std_dev_wide.

It also initializes the weights dependent on discontinuity ratios and weight_scale.

Parameters:

seed (int) – Seed for random number generation to ensure reproducibility.
n_features (int) – Number of features in the dataset.
n_samples (int) – Number of samples in the dataset.
discontinuity_ratios (list) – List of discontinuity ratios for each feature.
bias (float) – Bias value to adjust the weight scale.
act_fun (str) – Activation function name (‘Relu’, ‘Gelu’, or ‘Sigmoid’).
proportion (float) – Proportion of narrow samples to wide samples.
classification (bool) – Indicates if the dataset is for classification (True) or regression (False).
**kwargs –
Arbitrary keyword arguments passed to the base class constructor, including:
- sample_std_dev_narrow (float): Standard deviation for sample creation noise in narrow distribution. Defaults to 0.05.
- sample_std_dev_wide (float): Standard deviation for sample creation noise in wide distribution. Defaults to 10.
- weight_scale (float): Scalar value to multiply all generated weights with.
- label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.

_initialize_with_narrow_distribution(seed: int, n_features: int, n_samples: int, discontinuity_ratios: List, bias: float, act_fun: str, classification: bool, kwargs: Dict | None)

Initializes the dataset with just a narrow distribution.

It generates a dataset with the first portion of data belonging to the narrow distribution dependent on sample_std_dev_narrow.

It also initializes the weights dependent on discontinuity ratios and weight_scale.

Parameters:

seed (int) – Seed for random number generation to ensure reproducibility.
n_features (int) – Number of features in the dataset.
n_samples (int) – Number of samples in the dataset.
discontinuity_ratios (list) – List of discontinuity ratios for each feature.
bias (float) – Bias value to adjust the weight scale.
act_fun (str) – Activation function name (‘Relu’, ‘Gelu’, or ‘Sigmoid’).
proportion (float) – Proportion of narrow samples to wide samples.
classification (bool) – Indicates if the dataset is for classification (True) or regression (False).
**kwargs –
Arbitrary keyword arguments passed to the base class constructor, including:
- sample_std_dev_narrow (float): Standard deviation for sample creation noise in narrow distribution. Defaults to 0.05.
- weight_scale (float): Scalar value to multiply all generated weights with.
- label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.

_initialize_samples_narrow_wide(n_samples: int, proportion: float, distribution_narrow: torch.distributions.Distribution, distribution_wide: torch.distributions.Distribution) → Tuple[torch.Tensor, torch.distributions.Distribution]

Initializes synthetic samples with narrow and wide distributions.

Parameters:

n_samples (int) – Total number of samples to generate.
proportion (float) – Proportion of samples that should belong to the narrow distribution. It should be between 0 and 1, where 0 indicates no narrow samples, and 1 indicates all samples are narrow.
distribution_narrow (torch.distributions.Distribution) – Narrow distribution object.
distribution_wide (torch.distributions.Distribution) – Wide distribution object.

Returns:

A tuple containing the generated samples and the distribution used.

Return type:

tuple

_initialize_discontinuity_ratios(discontinuity_ratios: List | None, n_features: int) → List[torch.Tensor]

Initialize discontinuity ratios for each feature in the dataset.

If discontinuity_ratios is None, this method generates initial discontinuity ratios for each feature based on the specified n_features.

Parameters:

discontinuity_ratios (list | NoneType) – List of discontinuity ratios for each feature. If None, new discontinuity ratios will be generated.
n_features (int) – Number of features in the dataset.

Returns:

List of discontinuity ratios for each feature.

Return type:

list

Raises:

AssertionError – If there are no positive or negative ratios, if discontinuity_ratios is not a list, or if the length of discontinuity_ratios does not match n_features.

_get_default_distribution_narrow(n_features: int, kwargs: Dict | None) → Tuple[torch.distributions.Distribution, Dict]

Returns the default narrow distribution for the dataset.

This method sets the default narrow distribution based on the provided kwargs or defaults. The sample_std_dev_narrow is used to determine the covariance matrix of the distribution.

Parameters:

n_features (int) – Number of features in the dataset.
kwargs (dict) –
Additional keyword arguments for configuration:
- sample_std_dev_narrow (float): Used to determine the covariance matrix of the distribution.

Returns:

A tuple containing the default narrow distribution and the modified kwargs.

Return type:

tuple

_get_default_distribution_wide(n_features: int, kwargs: Dict | None) → Tuple[torch.distributions.Distribution, Dict]

Returns the default wide distribution for the dataset.

This method sets up the default wide distribution based on the provided kwargs or defaults. The sample_std_dev_wide is used to determine the covariance matrix of the distribution.

Parameters:

n_features (int) – Number of features in the dataset.
kwargs (dict) –
Additional keyword arguments for configuration:
- sample_std_dev_wide (float): Used to determine the covariance matrix of the distribution.

Returns:

A tuple containing the default wide distribution and the modified kwargs.

Return type:

tuple

_default_activation_function(act_fun: str, classification: bool) → torch.nn.Module

Returns the default activation function based on the provided function name and task type.

Parameters:

act_fun (str or nn.Module) – Name or instance of the activation function (‘Relu’, ‘Gelu’, ‘Sigmoid’), or a custom activation function instance.
classification (bool) – Indicates if the dataset is for classification (True) or regression (False).

Returns:

The default activation function is based on the specified name, instance, and task type.

Return type:

nn.Module

Raises:

KeyError – If the provided activation function is not one of ‘Relu’, ‘Gelu’, or ‘Sigmoid’, and it does not match the type of a custom activation function already defined in the mapping.

_get_weight_scale(kwargs: Dict | None, act_fun: str) → Dict

Adjust the weight scaling factor based on the activation function used.

This method calculates and updates the weight scaling factor in the kwargs dictionary based on the provided activation function. A different default weight scale is applied for’ Sigmoid’ activation than other activation functions.

Parameters:

kwargs (dict) – Additional keyword arguments, potentially including ‘weight_scale’. If the user does not specify weight_scale, Default is implemented.
act_fun (str) – Name of the activation function (‘Relu’, ‘Gelu’, or ‘Sigmoid’).

Returns:

Updated kwargs with the ‘weight_scale’ value adjusted according to the activation function.

Return type:

dict

Raises:

KeyError – If the activation function is not one of ‘Relu’, ‘Gelu’, or ‘Sigmoid’.

_generate_default_weights(n_features: int, weight_scale: float, act_fun: str) → torch.Tensor

Generate default weights based on discontinuity ratios, bias, and activation function.

Parameters:

n_features (int) – Number of features in the dataset.
weight_scale (float) – Scaling factor for weight initialization.
act_fun (str) – Name of the activation function (‘Relu’, ‘Gelu’, or ‘Sigmoid’).

Returns:

Default weights for each feature, adjusted based on discontinuity ratios, bias, and activation function.

Return type:

torch.Tensor

Raises:

ZeroDivisionError – If the sum of positive or negative ratios is zero, indicating a configuration issue.

generate_model() → torch.nn.Module

Generate a model using the Shattered Gradients Neural Network architecture.

Returns:: An instance of the ShatteredGradientsNN model.
Return type:: model.ShatteredGradientsNN

__getitem__(idx: int, others: List[str] = []) → Tuple[Any, Ellipsis]

Retrieve a sample and its associated label by index.

Parameters:

idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [].

Returns:

Tuple containing the sample and its label.

Return type:

tuple

property default_metric: None

The default metric for evaluating the performance of explanation methods applied to this dataset.

For this dataset, the default metric is the max sensitivity metric.

Returns:

A class that wraps around the default metric to be instantiated: within the pipeline.

Return type:

type

class datagenerator.UncertaintyAwareDataset(n_features: int = 5, weights: torch.Tensor | None = None, common_features: int = 1, seed: int = 0, n_samples: int = 10, **kwargs: Any)

Bases: xaiunits.datagenerator.BaseFeaturesDataset

A dataset designed to investigate how feature attribution methods treat inputs features that equally impact model prediction.

In particular, uncertainty/common features are input features that contribution equally to output class prediction. feature attribution method is expected not to assign any attribution score to these uncertainty inputs. The last columns of the dataset are uncertainty/common features.

Users can also pass in their own weights if they wish to test for more complex uncertainty behavior, e.g. uncertainty/common feature only contribution equally to a subset of output classes.

Inherits from:: BaseFeaturesDataset: Base class for generating datasets with features and labels.

weighted_samples

Samples multiplied by weights.

Type:: torch.Tensor

weights

Weights matrix for feature transformation.

Type:: torch.Tensor

labels

Softmax output of weighted samples.

Type:: torch.Tensor

Initializes an UncertaintyAwareDataset object.

Parameters:

n_features (int) – Number of features in the dataset. Defaults to 5.
weights (torch.Tensor, optional) – Custom weights matrix for feature transformation. Defaults to None.
common_features (int) – Number of uncertainty/common features present. Defaults to 1.
seed (int) – Seed for random number generation. Defaults to 0.
n_samples (int) – Number of samples in the dataset. Defaults to 10.
**kwargs – Additional keyword arguments for the base class constructor.

common_features = 1

weighted_samples

weights

labels

mask

features = 'samples'

ground_truth_attribute = 'mask'

subset_data = ['samples', 'weighted_samples', 'mask']

subset_attribute

_create_weights(n_features: int, weights: torch.Tensor | None, common_features: int) → torch.Tensor

Creates weights matrix based on common features.

Parameters:

n_features (int) – Number of features in the dataset.
weights (torch.Tensor) – Custom weights matrix for feature transformation.
common_features (list) – List of indices representing common features.

Returns:

Weights matrix for feature transformation.

Return type:

weights (torch.Tensor)

__getitem__(idx: int, others: list[str] = ['ground_truth_attribute']) → Tuple[Any, Ellipsis]

Retrieve a sample and its associated label by index.

Parameters:

idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [“ground_truth_attribute”].

Returns:

Tuple containing the sample and its label.

Return type:

tuple

generate_model(softmax_layer: bool = True) → torch.nn.Module

Generates an UncertaintyNN model based on the dataset.

Returns:: Instance of UncertaintyNN model.
Return type:: model.UncertaintyNN

property default_metric: Callable

The default metric for evaluating the performance of explanation methods applied to this dataset.

For this dataset, the default metric is modified Mean Squared Error (MSE) loss function. This metric measures the MSE for common/uncertainty features which should be 0.

Returns:

A class that wraps around the default metric to be instantiated: within the pipeline.

Return type:

type

class datagenerator.TextTriggerDataset(index: Tuple[int, int] | None = None, tokenizer: Any | None = None, max_sequence_length: int = 4096, seed: int = 42, baselines: int | str = 220, skip_tokens: List[str] = [], model_name: str = 'XAIUnits/TriggerLLM_v2')

Bases: BaseTextDataset

A PyTorch Dataset for text data with trigger words and feature masks, designed for explainable AI (XAI) tasks.

This dataset loads text data, tokenizes it, identifies trigger words, and generates feature masks highlighting these words. It’s specifically tailored for analyzing the impact of trigger words on model predictions.

index

A tuple specifying the start and end indices for data subset selection. Defaults to None, using the entire dataset.

Type:: tuple, optional

tokenizer

The tokenizer to use for text processing. If None, it’s loaded based on the specified model_name.

Type:: transformers.PreTrainedTokenizer, optional

max_sequence_length

The maximum sequence length for input text. Longer sequences are truncated. Defaults to 4096.

Type:: int, optional

seed

Random seed for shuffling the data. Use -1 for no shuffling. Defaults to 42.

Type:: int, optional

baselines

Baseline token ID or string for attribution methods. Defaults to 220 (space token for Llama models).

Type:: int or str, optional

skip_tokens

List of tokens to skip during attribution. Defaults to an empty list.

Type:: list, optional

model_name

The name of the model to use for loading the tokenizer. Defaults to “XAIUnits/TriggerLLM_v2”.

Type:: str, optional

model_name = 'XAIUnits/TriggerLLM_v2'

target

__getitem__(idx: int) → Tuple[Any, Ellipsis]

__len__() → int

generate_model() → Tuple[Any, Any]

property collate_fn: Callable

property default_metric: Callable