datagenerator
Submodules
- datagenerator.backgrounds
- datagenerator.boolean
- datagenerator.conflicting
- datagenerator.data_generation
- datagenerator.foregrounds
- datagenerator.image_generation
- datagenerator.interacting_features
- datagenerator.pertinent_negatives
- datagenerator.shattered_grad
- datagenerator.text_datasets
- datagenerator.uncertainty_aware
Attributes
Classes
Generic synthetic dataset of continuous features for AI explainability. |
|
A class extending BaseFeaturesDataset with support for weighted features. |
|
Generic synthetic dataset based on a propositional formula. |
|
Generic synthetic dataset based on a propositional formula. |
|
Generic synthetic dataset based on a propositional formula. |
|
Generic synthetic dataset with feature cancellation capabilities. |
|
A dataset for images where each each image consists of a background and a foreground overlay. |
|
Creates Image Dataset where each image comprises of a background image an a foreground image. |
|
A dataset for images with specified configurations for image generation, supporting both balanced and imbalanced datasets. |
|
A dataset subclass for modeling interactions between categorical and continuous features within weighted datasets. |
|
A dataset designed to investigate the impact of pertinent negative (PN) features |
|
A class intended to generate data and weights that exhibit shattered gradient phenomena. |
|
A dataset designed to investigate how feature attribution methods treat inputs |
|
A PyTorch Dataset for text data with trigger words and feature masks, designed for explainable AI (XAI) tasks. |
Functions
|
Loads a previously saved dataset from a binary pickle file. |
|
Generates a CSV file with random data for a specified number of rows and features. |
Package Contents
- class datagenerator.BaseFeaturesDataset(seed: int = 0, n_features: int = 2, n_samples: int = 10, distribution: str | torch.distributions.Distribution = 'normal', distribution_params: Dict[str, Any] | None = None, **kwargs: Any)
Bases:
torch.utils.data.DatasetGeneric synthetic dataset of continuous features for AI explainability.
This class creates a dataset of continuous features based on a specified distribution, which can be used for training and evaluating AI models. It allows for reproducible sample creation, customizable features and sample sizes, and supports various distributions.
- seed
Seed for random number generators to ensure reproducibility.
- Type:
int
- n_features
Number of features in the dataset.
- Type:
int
- n_samples
Number of samples in the dataset.
- Type:
int
- distribution
Distribution used for generating the samples. Defaults to ‘normal’ which uses a multivariate normal distribution.
- Type:
str | torch.distributions.Distribution
- sample_std_dev
Standard deviation of the noise added to the samples.
- Type:
float
- label_std_dev
Standard deviation of the noise added to generate labels.
- Type:
float
- samples
Generated samples.
- Type:
torch.Tensor
- labels
Generated labels with optional noise.
- Type:
torch.Tensor
- ground_truth_attribute
Name of the attribute considered as ground truth.
- Type:
str
- subset_data
List of attributes to be included in subsets.
- Type:
list[str]
- subset_attribute
Additional attributes to be considered in subsets.
- Type:
list[str]
- cat_features
List of categorical feature names, used in perturbations.
- Type:
list[str]
Initializes a dataset of continuous features based on a specified distribution.
- Parameters:
seed (int) – For sample creation reproducibility. Defaults to 0.
n_features (int) – Number of features for each sample. Defaults to 2.
n_samples (int) – Total number of samples. Defaults to 10.
distribution (str | torch.distributions.Distribution) – Distribution to use for generating samples. Defaults to “normal”, which indicates multivariate normal distribution.
distribution_params (dict, optional) – Parameters for the distribution if a string identifier is used. Defaults to None.
**kwargs –
Arbitrary keyword arguments, including:
sample_std_dev (float): Standard deviation for sample creation noise. Defaults to 1.
label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.
- Raises:
ValueError – If an unsupported string identifier is provided.
TypeError – If ‘distribution’ is neither a string nor a torch.distributions.Distribution instance.
- label_noise
- features = 'samples'
- labels
- ground_truth_attribute = 'samples'
- subset_data = ['samples']
- subset_attribute = ['perturb_function', 'name']
- cat_features = []
- name = 'BaseFeaturesDataset'
- __len__() int
Returns the total number of samples in the dataset.
- Returns:
Total number of samples.
- Return type:
int
- __getitem__(idx: int, others: List[str] = ['ground_truth_attribute']) Tuple[torch.Tensor, torch.Tensor] | Tuple[torch.Tensor, torch.Tensor, Dict[str, torch.Tensor]]
Retrieves a sample and its label, along with optional attributes, by index.
- Parameters:
idx (int) – Index of the sample to retrieve.
others (list[str]) – Additional attributes to be retrieved with the sample and label. Defaults to [“ground_truth_attribute”].
- Returns:
- A tuple containing the sample and label at the specified index,
and optionally, a dictionary of additional attributes if requested.
- Return type:
tuple
- Raises:
IndexError – If the specified index is out of the bounds of the dataset.
- split(split_lengths: List[float] = [0.7, 0.3]) Tuple[BaseFeaturesDataset, BaseFeaturesDataset]
Splits the dataset into subsets based on specified proportions.
- Parameters:
split_lengths (list[float]) – Proportions to split the dataset into. The values must sum up to 1. Defaults to [0.7, 0.3] for a 70%/30% split.
- Returns:
- A tuple containing the split subsets
of the dataset.
- Return type:
tuple[BaseFeaturesDataset]
- save_dataset(file_name: str, directory_path: str = os.getcwd()) None
Saves the dataset to a pickle file in the specified directory.
- Parameters:
file_name (str) – Name of the file to save the dataset.
directory_path (str) – Path to the directory where the file will be saved. Defaults to the current working directory.
- _validate_inputs(seed: int, n_features: int, n_samples: int) Tuple[int, int, int]
Validates the input parameters for dataset initialization.
- Parameters:
seed (int) – Seed for random number generation.
n_features (int) – Number of features.
n_samples (int) – Number of samples.
- Returns:
Validated seed and number of features.
- Return type:
tuple[int, int]
- Raises:
ValueError – If any input is not an integer or is out of an expected range.
- _init_noise_parameters(kwargs: Dict[str, Any]) Tuple[float, float]
Initializes noise parameters from keyword arguments.
- Parameters:
kwargs – Keyword arguments passed to the initializer.
- Returns:
Initialized sample and label standard deviations.
- Return type:
tuple
- Raises:
ValueError – If the standard deviations are not positive numbers.
- _init_samples(n_samples: int, distribution: str | torch.distributions.Distribution, distribution_params: Dict[str, Any] | None = None) Tuple[torch.Tensor, torch.distributions.Distribution]
Initializes samples based on the specified distribution and sample size.
This method supports initialization using either a predefined distribution name (string) or directly with a torch.distributions.Distribution instance.
- Parameters:
n_samples (int) – Number of samples to generate, must be positive.
distribution (str | torch.distributions.Distribution) – The distribution to use for generating samples. Can be a string for predefined distributions (‘normal’, ‘uniform’, ‘poisson’) or an instance of torch.distributions.Distribution.
distribution_params (dict, optional) – Parameters for the distribution if a string identifier is used. Examples: - For ‘normal’: {‘mean’: torch.zeros(n_features), ‘stddev’: torch.ones(n_features)} - For ‘uniform’: {‘low’: -1.0, ‘high’: 1.0} - For ‘poisson’: {‘rate’: 3.0}
- Returns:
- A tuple containing generated samples (torch.Tensor) with shape [n_samples, n_features]
and the distribution instance used.
- Return type:
tuple
- Raises:
ValueError – If ‘distribution’ is a string and is not one of the supported identifiers or necessary parameters are missing.
TypeError – If ‘distribution’ is neither a string identifier nor a torch.distributions.Distribution instance, or if the provided Distribution instance cannot generate a torch.Tensor.
RuntimeError – If the generated samples do not match the expected shape and cannot be adjusted.
- perturb_function(noise_scale: float = 0.01, cat_resample_prob: float = 0.2, run_infidelity_decorator: bool = True, multipy_by_inputs: bool = False) Callable
Generates perturb function to be used for feature attribution method evaluation. Applies Gaussian noise for continuous features, and resampling for categorical features.
- Parameters:
noise_scale (float) – A standard deviation of the Gaussian noise added to the continuous features. Defaults to 0.01.
cat_resample_prob (float) – Probability of resampling a categorical feature. Defaults to 0.2.
run_infidelity_decorator (bool) – Set to True if you want the returned fns to be compatible with infidelity. Set flag to False for sensitivity. Defaults to True.
multiply_by_inputs (bool) – Parameters for decorator. Defaults to False.
- Returns:
A perturbation function compatible with Captum.
- Return type:
perturb_func (function)
- abstract generate_model() Any
Generates a corresponding model for current dataset.
- Raises:
NotImplementedError – If the method is not implemented by a subclass.
- property default_metric: Callable
- Abstractmethod:
The default metric for evaluating the performance of explanation methods applied to this dataset.
- Raises:
NotImplementedError – If the property is not implemented by a subclass.
- class datagenerator.WeightedFeaturesDataset(seed: int = 0, n_features: int = 2, n_samples: int = 10, distribution: str | torch.distributions.Distribution = 'normal', weight_range: Tuple[float, float] = (-1.0, 1.0), weights: torch.Tensor | None = None, **kwargs: Any)
Bases:
BaseFeaturesDatasetA class extending BaseFeaturesDataset with support for weighted features.
This class allows for creating a synthetic dataset with continuous features, where each feature can be weighted differently. This is particularly useful for scenarios where the impact of different features on the labels needs to be artificially manipulated or studied.
- Inherits from:
BaseFeaturesDataset: The base class for creating continuous feature datasets.
- weights
Weights applied to each feature.
- Type:
torch.Tensor
- weight_range
The range (min, max) within which random weights are generated.
- Type:
tuple
- weighted_samples
The samples after applying weights.
- Type:
torch.Tensor
Initializes a WeightedFeaturesDataset object.
- Parameters:
seed (int) – Seed for reproducibility. Defaults to 0.
n_features (int) – Number of features. Defaults to 2.
n_samples (int) – Number of samples. Defaults to 10.
distribution (str) – Type of distribution to use for generating samples. Defaults to “normal”.
weight_range (tuple) – Range (min, max) for generating random weights. Defaults to (-1.0, 1.0).
weights (torch.Tensor, optional) – Specific weights for each feature. If None, weights are generated randomly within weight_range. Defaults to None.
**kwargs –
Arbitrary keyword arguments passed to the base class constructor, including:
sample_std_dev (float): Standard deviation for sample creation noise. Defaults to 1.
label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.
- weighted_samples
- label_noise
- labels
- features = 'samples'
- ground_truth_attribute = 'weighted_samples'
- subset_data = ['samples', 'weighted_samples']
- subset_attribute
- _initialize_weights(weights: torch.Tensor | None, weight_range: Tuple[float, float]) Tuple[torch.Tensor, Tuple[float, float]]
Initializes or validates the weights for each feature.
If weights are not provided, they are randomly generated within the specified range.
- Parameters:
weights (torch.Tensor | NoneType) – If provided, these weights are used directly for the features. Must be a Tensor with a length equal to n_features.
weight_range (tuple) – Specifies the minimum and maximum values used to generate weights if weights is None. Expected format: (min_value, max_value), where both are floats.
- Returns:
The validated or generated weights and the effective weight range used.
- Return type:
tuple[torch.Tensor, tuple]
- Raises:
AssertionError – If the provided weights do not match the number of features or are not a torch.Tensor when provided.
ValueError – If weight_range is improperly specified.
- generate_model() Any
Generates and returns a neural network model configured to use the weighted features of this dataset.
The model is designed to reflect the differential impact of each feature as specified by the weights.
- Returns:
- A neural network model that includes mechanisms to account for feature weights,
suitable for tasks requiring understanding of feature importance.
- Return type:
- property default_metric: Callable
The default metric for evaluating the performance of explanation methods applied to this dataset.
For this dataset, the default metric is the Mean Squared Error (MSE) loss function.
- Returns:
- A class that wraps around the default metric to be instantiated
within the pipeline.
- Return type:
type
- datagenerator.load_dataset(file_path: str, directory_path: str = os.getcwd()) BaseFeaturesDataset | WeightedFeaturesDataset | None
Loads a previously saved dataset from a binary pickle file.
This function is designed to retrieve datasets that have been saved to disk, facilitating easy sharing and reloading of data for analysis or model training.
- Parameters:
file_path (str) – The name of the file to load.
directory_path (str) – The directory where the file is located. Defaults to the current working directory.
- Returns:
The loaded dataset object, or None, if the file does not exist or an error occurs.
- Return type:
Object | NoneType
- datagenerator.generate_csv(file_label: str, num_rows: int = 5000, num_features: int = 20) None
Generates a CSV file with random data for a specified number of rows and features.
This function helps create synthetic datasets for testing or development purposes. Each row will have a random label and a specified number of features filled with random values.
- Parameters:
file_label (str) – The base name for the CSV file.
num_rows (int) – Number of rows (samples) to generate. Defaults to 5000.
num_features – Number of features to generate for each sample. Defaults to 20.
- datagenerator.data
- class datagenerator.BooleanAndDataset(n_features: int = 2, n_samples: int = 10, seed: int = 0)
Bases:
BooleanDatasetGeneric synthetic dataset based on a propositional formula.
The dataset corresponds to sampling rows from the truth table of the given propositional formula. If n_samples is no larger than the size of the truth table, then the generated dataset will always contain non-duplicate samples of the truth table. Otherwise, the dataset will still contain rows for the entire truth table but will also contain duplicates.
If the input for atoms is None, the corresponding attribute is by default assigned as the atoms that are extracted from the given formula.
- Inherits from:
BaseFeaturesDataset: The base class for creating continuous feature datasets.
- formula
A propositional formula for which the dataset is generated.
- Type:
sympy.core.function.FunctionClass
- atoms
The ordered collection of propositional atoms that were used within the propositional formula.
- Type:
tuple
- seed
Seed for random number generators to ensure reproducibility.
- Type:
int
- n_samples
Number of samples in the dataset.
- Type:
int
Initializes a BooleanDataset object.
- Parameters:
formula (sympy.core.function.FunctionClass) – A propositional formula for dataset generation.
atoms (Iterable, optional) – Ordered collection of propositional atoms used in the formula. Defaults to None.
seed (int) – Seed for random number generation, ensuring reproducibility. Defaults to 0.
n_samples (int) – Number of samples to generate for the dataset. Defaults to 10.
- n_features = 2
- ground_truth
- ground_truth_attribute = 'ground_truth'
- create_baselines() None
- __getitem__(idx: int, others: List[str] = ['baseline', 'ground_truth_attribute']) Tuple[Any, Ellipsis]
Retrieve a sample and its associated label by index.
- Parameters:
idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [].
- Returns:
Tuple containing the sample and its label.
- Return type:
tuple
- generate_model() torch.nn.Module
Generates a neural network model using the given propositional formula and atoms.
- Returns:
A neural network model tailored to the dataset’s propositional formula.
- Return type:
- create_ground_truth() torch.Tensor
- property default_metric: Callable
The default metric for evaluating the performance of explanation methods applied to this dataset.
For this dataset, the default metric is the infidelity metric with the default perturb function.
- Returns:
- A class that wraps around the default metric to be instantiated
within the pipeline.
- Return type:
type
- class datagenerator.BooleanDataset(formula: sympy.core.function.FunctionClass, atoms: Iterable | None = None, seed: int = 0, n_samples: int = 10)
Bases:
xaiunits.datagenerator.data_generation.BaseFeaturesDatasetGeneric synthetic dataset based on a propositional formula.
The dataset corresponds to sampling rows from the truth table of the given propositional formula. If n_samples is no larger than the size of the truth table, then the generated dataset will always contain non-duplicate samples of the truth table. Otherwise, the dataset will still contain rows for the entire truth table but will also contain duplicates.
If the input for atoms is None, the corresponding attribute is by default assigned as the atoms that are extracted from the given formula.
- Inherits from:
BaseFeaturesDataset: The base class for creating continuous feature datasets.
- formula
A propositional formula for which the dataset is generated.
- Type:
sympy.core.function.FunctionClass
- atoms
The ordered collection of propositional atoms that were used within the propositional formula.
- Type:
tuple
- seed
Seed for random number generators to ensure reproducibility.
- Type:
int
- n_samples
Number of samples in the dataset.
- Type:
int
Initializes a BooleanDataset object.
- Parameters:
formula (sympy.core.function.FunctionClass) – A propositional formula for dataset generation.
atoms (Iterable, optional) – Ordered collection of propositional atoms used in the formula. Defaults to None.
seed (int) – Seed for random number generation, ensuring reproducibility. Defaults to 0.
n_samples (int) – Number of samples to generate for the dataset. Defaults to 10.
- atoms
- formula
- subset_data = ['samples']
- subset_attribute = ['perturb_function', 'default_metric', 'generate_model', 'name']
- cat_features
- name = 'BooleanDataset'
- _initialize_samples_labels(n_samples: int) Tuple[torch.Tensor, torch.Tensor]
Initializes the samples and labels of the dataset.
- Parameters:
n_samples (int) – number of samples/labels contained in the dataset.
- Returns:
- Tuple containing the generated samples
and corresponding labels of the dataset.
- Return type:
tuple[Tensor, Tensor]
- perturb_function(cat_resample_prob: float = 0.2, run_infidelity_decorator: bool = True, multipy_by_inputs: bool = False) Callable
Generates perturb function to be used for XAI method evaluation. Applies gaussian noise for continuous features, and resampling for categorical features.
- Parameters:
cat_resample_prob (float) – Probability of resampling a categorical feature. Defaults to 0.2.
run_infidelity_decorator (bool) – Set to true if the returned fns is to be compatible with infidelity. Set flag to False for sensitivity. Defaults to True.
multiply_by_inputs (bool) – Parameters for decorator. Defaults to False.
- Returns:
A perturbation function compatible with Captum.
- Return type:
perturb_func (function)
- generate_model() torch.nn.Module
Generates a neural network model using the given propositional formula and atoms.
- Returns:
A neural network model tailored to the dataset’s propositional formula.
- Return type:
- property default_metric: Callable
The default metric for evaluating the performance of explanation methods applied to this dataset.
For this dataset, the default metric is the infidelity metric with the default perturb function.
- Returns:
- A class that wraps around the default metric to be instantiated
within the pipeline.
- Return type:
type
- __getitem__(idx: int, others: List[str] = []) Tuple[Any, Ellipsis]
Retrieve a sample and its associated label by index.
- Parameters:
idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [].
- Returns:
Tuple containing the sample and its label.
- Return type:
tuple
- class datagenerator.BooleanOrDataset(n_features: int = 2, n_samples: int = 10, seed: int = 0)
Bases:
BooleanDatasetGeneric synthetic dataset based on a propositional formula.
The dataset corresponds to sampling rows from the truth table of the given propositional formula. If n_samples is no larger than the size of the truth table, then the generated dataset will always contain non-duplicate samples of the truth table. Otherwise, the dataset will still contain rows for the entire truth table but will also contain duplicates.
If the input for atoms is None, the corresponding attribute is by default assigned as the atoms that are extracted from the given formula.
- Inherits from:
BaseFeaturesDataset: The base class for creating continuous feature datasets.
- formula
A propositional formula for which the dataset is generated.
- Type:
sympy.core.function.FunctionClass
- atoms
The ordered collection of propositional atoms that were used within the propositional formula.
- Type:
tuple
- seed
Seed for random number generators to ensure reproducibility.
- Type:
int
- n_samples
Number of samples in the dataset.
- Type:
int
Initializes a BooleanDataset object.
- Parameters:
formula (sympy.core.function.FunctionClass) – A propositional formula for dataset generation.
atoms (Iterable, optional) – Ordered collection of propositional atoms used in the formula. Defaults to None.
seed (int) – Seed for random number generation, ensuring reproducibility. Defaults to 0.
n_samples (int) – Number of samples to generate for the dataset. Defaults to 10.
- n_features = 2
- ground_truth
- ground_truth_attribute = 'ground_truth'
- create_baselines() None
- __getitem__(idx: int, others: List[str] = ['baseline', 'ground_truth_attribute']) Tuple[Any, Ellipsis]
Retrieve a sample and its associated label by index.
- Parameters:
idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [].
- Returns:
Tuple containing the sample and its label.
- Return type:
tuple
- generate_model() torch.nn.Module
Generates a neural network model using the given propositional formula and atoms.
- Returns:
A neural network model tailored to the dataset’s propositional formula.
- Return type:
- create_ground_truth() torch.Tensor
- property default_metric: Callable
The default metric for evaluating the performance of explanation methods applied to this dataset.
For this dataset, the default metric is the infidelity metric with the default perturb function.
- Returns:
- A class that wraps around the default metric to be instantiated
within the pipeline.
- Return type:
type
- class datagenerator.ConflictingDataset(seed: int = 0, n_features: int = 2, n_samples: int = 10, distribution: str = 'normal', weight_range: Tuple[float, float] = (-1.0, 1.0), weights: torch.Tensor | None = None, cancellation_features: List[int] | None = None, cancellation_likelihood: float = 0.5)
Bases:
xaiunits.datagenerator.WeightedFeaturesDatasetGeneric synthetic dataset with feature cancellation capabilities.
Feature cancellations are based on likelihood. If cancellation_features are not provided, all features in each sample are candidates for cancellation, with a specified likelihood of each feature being canceled. Canceled features are negated in their contributions to the dataset, allowing for the analysis of model behavior under feature absence scenarios.
- Inherits from:
WeightedFeaturesDataset: Class extending BaseFeaturesDataset with support for weighted features
- cancellation_features
Indices of features subject to cancellation.
- Type:
list of int, optional
- cancellation_likelihood
Likelihood of feature cancellation, between 0 and 1.
- Type:
float
- cancellation_outcomes
Binary tensor indicating whether each feature in each sample is canceled.
- Type:
torch.Tensor
- cancellation_samples
Concatenation of samples with their cancellation outcomes.
- Type:
torch.Tensor
- cancellation_attributions
The attribution of each feature considering the cancellation.
- Type:
torch.Tensor
- cat_features
Categorical features derived from the cancellation samples.
- Type:
list
- ground_truth_attributions
Combined tensor of weighted samples and cancellation attributions for ground truth analysis.
- Type:
torch.Tensor
Initializes a ConflictingDataset object.
- Parameters:
seed (int) – Seed for random number generation, ensuring reproducibility. Defaults to 0.
n_features (int) – Number of features in each sample. Defaults to 2.
n_samples (int) – Number of samples to generate. Defaults to 10.
distribution (str) – Type of distribution to use for generating samples. Defaults to ‘normal’.
weight_range (tuple[float]) – Range (min, max) for generating random feature weights. Defaults to (-1.0, 1.0).
weights (torch.Tensor, optional) – Predefined weights for each feature. Defaults to None.
cancellation_features (list[int], optional) – Specific features to apply cancellations to. Defaults to None, applying to all features.
cancellation_likelihood (float) – Probability of each feature being canceled. Defaults to 0.5.
- cancellation_features = None
- cancellation_likelihood = 0.5
- cancellation_outcomes
- cancellation_samples
- labels
- cancellation_attributions
- cat_features
- ground_truth_attributions
- features = 'cancellation_samples'
- ground_truth_attribute = 'ground_truth_attributions'
- subset_data = ['weighted_samples', 'cancellation_outcomes', 'cancellation_samples',...
- _initialize_cancellation_features() None
Validates and initializes the list of features subject to cancellation. If no specific features are provided, all features are considered candidates for cancellation.
- Raises:
AssertionError – If cancellation_features is not a list, any element in cancellation_features is not an integer, the maximum element in cancellation_features is greater than the number of features, or cancellation_features is empty. Also, if cancellation_likelihood is not a float or is outside the range [0, 1].
- _get_cancellations() torch.Tensor
Generates a binary mask indicating whether each feature in each sample is canceled based on the specified likelihood.
This method considers only the features specified in cancellation_features for possible cancellation.
- Returns:
- An integer tensor of shape (n_samples, n_features) where 1 represents a canceled feature,
and 0 represents an active feature.
- Return type:
torch.Tensor
- _get_cancellation_samples() torch.Tensor
Concatenates the original samples with their cancellation outcomes to form a comprehensive dataset.
This allows for analyzing the impact of feature cancellations directly alongside the original features.
- Returns:
A tensor containing the original samples augmented with their corresponding cancellation outcomes.
- Return type:
torch.Tensor
- _get_cancellation_attributions() torch.Tensor
Computes the attribution of each feature by negating the effect of canceled features.
This method helps understand the impact of each feature on the model output when certain features are systematically canceled.
- Returns:
- A tensor of the same shape as the weighted samples, where the values of canceled features are
negated to reflect their absence.
- Return type:
torch.Tensor
- generate_model() torch.nn.Module
Instantiates and returns a neural network model for analyzing datasets with conflicting features.
The model is configured to use the specified features and weights, allowing for experimentation with feature cancellations.
- Returns:
A neural network model designed to work with the specified features and weights.
- Return type:
- class datagenerator.BalancedImageDataset(*args: Any, **kwargs: Any)
Bases:
ImageDatasetA dataset for images where each each image consists of a background and a foreground overlay.
This ‘balanced’ dataset ensures that each combination of background (bg), foreground (fg), and foreground color (fg_color) appears the same number of times across the dataset, making it ideal for machine learning models that benefit from uniform exposure to all feature combinations.
Inherits all parameters from ImageDataset, and introduces no additional parameters, but it overrides the behavior to ensure balance in the dataset composition.
- Inherits from:
ImageDataset: Standard dataset that contains images with backgorunds and foregrounds.
Initializes a BalancedImageDataset with the same parameters as ImageDataset, ensuring each combination of background, foreground, and color appears uniformly across the dataset.
After initialization, it automatically generates the samples and shuffles them if the ‘shuffled’ attribute is True.
- Parameters:
*args – Additional arguments passed to the superclass initializer.
**kwargs – Additional keyword arguments passed to the superclass initializer.
- generate_samples() None
Generates a balanced set of image samples by uniformly distributing each combination of background, foreground shape, and color.
Iterates over each background, each shape, and each color to create the specified number of variants per combination. Each generated image is stored in the ‘samples’ list, with corresponding labels in ‘labels’, and other metadata like foreground shapes, background labels, and foreground colors stored in their respective lists.
- Raises:
ValueError – If there is an issue with image generation parameters or overlay combinations.
- class datagenerator.ImbalancedImageDataset(backgrounds: int | List[str] = 5, shapes: int | List[str] = 3, n_variants: int = 100, shape_colors: str | Tuple[int, int, int, int] = 'red', imbalance: float = 0.8, **kwargs: Any)
Bases:
ImageDatasetCreates Image Dataset where each image comprises of a background image an a foreground image.
Background images, type of foreground, color of foreground as well as other parameters can be specified.
Imbalance refers to the fact users can specify the percentage of dominant (bg, fg) pair vs other pair.
- Inherits from:
ImageDataset: Standard dataset that contains images with backgorunds and foregrounds.
- imbalance
The proportion of samples that should favor a particular background per shape. Should be within the range (0.0 to 1.0) inclusive.
- Type:
float
Initializes an ImbalancedImageDataset object with specified parameters, focusing on creating dataset variations based on an imbalance parameter that dictates the dominance of certain shape-background pairs.
- Parameters:
backgrounds (int | list) – The number or list of specific background filenames. Defaults to 5.
shapes (int | list) – The number or list of specific shapes. Defaults to 3.
n_variants (int) – Number of variations per shape-background combination, affects dataset size. Defaults to 100.
shape_colors (str | tuple) – The default color for all shapes in the dataset. Defaults to ‘red’.
imbalance (float) – The proportion (0.0 to 1.0) of samples that should favor a particular background per shape. Defaults to 0.8.
**kwargs – Additional keyword arguments passed to the superclass initializer.
- imbalance
- _prepare_shape_color(shape_colors: str | Tuple[int, int, int, int] | None) List[Tuple[int, int, int, int]]
Prepares a single shape color based on the input.
Selects a random color if None is provided, validates a provided color string or RGBA tuple.
- Parameters:
shape_colors (str | tuple | NoneType) – A specific color name, RGBA tuple, or None to select a random color.
- Returns:
A list containing a single validated RGBA tuple representing the color.
- Return type:
list
- Raises:
ValueError – If the input is invalid or if the color name is not found in the predefined color dictionary.
- _validate_imbalance(imbalance: float) float
Validates that the imbalance parameter is a float between 0.0 and 1.0 inclusive, or None.
Ensures that the dataset can properly reflect the desired level of imbalance, adjusting for the number of variants and available backgrounds.
- Parameters:
imbalance (float | NoneType) – The imbalance value to validate. If None is given as input, then the argument will be treated as 0.3.
- Returns:
The validated imbalance value.
- Return type:
float
- Raises:
ValueError – If the imbalance is not within the inclusive range [0.0, 1.0] or if the imbalance settings are not feasible with the current settings of n_variants and backgrounds.
- generate_samples() None
Generates a set of image samples with overlay shapes or dinosaurs on backgrounds, considering imbalance.
- Depending on the ‘imbalance’ parameter, this method either:
Allocates a specific fraction (defined by ‘imbalance’) of the samples for each shape to a particular background, with the remainder distributed among the other backgrounds.
Assigns all samples for a shape to a single background (imbalance = 1.0).
- class datagenerator.ImageDataset(seed: int = 0, backgrounds: int | List[str] = 5, shapes: int | List[str] = 10, n_variants: int = 4, background_size: Tuple[int, int] = (512, 512), shape_type: str = 'geometric', position: str = 'random', overlay_scale: float = 0.3, rotation: bool = False, shape_colors: str | Tuple[int, int, int, int] | List[str | Tuple[int, int, int, int]] | None = None, shuffled: bool = True, transform: Callable | None = None, contour_thickness: int = 3, source: str = 'local')
Bases:
torch.utils.data.DatasetA dataset for images with specified configurations for image generation, supporting both balanced and imbalanced datasets.
- Inherits from:
torch.utils.data.Dataset: The standard base class for defining a dataset within the PyTorch framework.
- seed
Seed for random number generation to ensure reproducibility.
- Type:
int
- backgrounds
List of background images to use for dataset generation.
- Type:
list
- shapes
List of shapes to overlay on background images.
- Type:
list
- n_variants
Number of variations per shape-background combination, affects dataset size.
- Type:
int
- background_size
Dimensions (width, height) of background images.
- Type:
tuple
- shape_type
Type of shapes: ‘geometric’ for geometric shapes, ‘dinosaurs’ for dinosaur shapes.
- Type:
str
- position
Overlay position on the background (‘center’ or ‘random’).
- Type:
str
- overlay_scale
Scale factor for overlay relative to the background size.
- Type:
float
- rotation
If True, applies random rotation to overlays.
- Type:
bool
- shape_colors
List of default color(s) for shapes, accepts color names or RGBA tuples.
- Type:
list
- shuffled
If True, shuffles the dataset after generation.
- Type:
bool
- transform
Transformation function to apply to each image, typically converting to tensor.
- Type:
callable
- contour_thickness
Thickness of lines the contours are drawn with. If it is negative, the contour interiors are drawn.
- Type:
int
- image_builder
Instance of ImageBuilder for generating images.
- Type:
- samples
List to store the generated samples.
- Type:
list
- labels
List to store the labels.
- Type:
list
- fg_shapes
List to store the foreground shapes.
- Type:
list
- bg_labels
List to store the background labels.
- Type:
list
- fg_colors
List to store the foreground colors.
- Type:
list
- ground_truth
List to store the ground truths.
- Type:
list
Initializes an ImageDataset object.
- Parameters:
seed (int) – Seed for random number generation to ensure reproducibility. Defaults to 0.
backgrounds (int | list) – Number or list of specific backgrounds to use. Defaults to 5.
shapes (int | list) – Number or list of specific shapes. Defaults to 10.
n_variants (int) – Number of variations per shape-background combination, affects dataset size. Defaults to 4.
background_size (tuple) – Dimensions (width, height) of background images. Defaults to (512, 512).
shape_type (str) – ‘geometric’ for geometric shapes, ‘dinosaurs’ for dinosaur shapes. Defaults to ‘geometric’.
position (str) – Overlay position on the background (‘center’ or ‘random’). Defaults to ‘random’.
overlay_scale (float) – Scale factor for overlay relative to the background size. Defaults to 0.3.
rotation (bool) – If True, applies random rotation to overlays. Defaults to False.
shape_colors (str | tuple, optional) – Default color(s) for shapes, accepts color names or RGBA tuples. Defaults to None.
shuffled (bool) – If True, shuffles the dataset after generation. Defaults to True.
transform (callable, optional) – Transformation function to apply to each image, typically converting to tensor. Defaults to None.
contour_thickness (int) – Defaults to 3.
- seed = 0
- n_variants = 4
- image_builder
- backgrounds
- shapes
- shape_colors
- transform
- samples = []
- labels = []
- fg_shapes = []
- bg_labels = []
- fg_colors = []
- ground_truth = []
- shuffled = True
- contour_thickness = 3
- _validate_n_variants(n_variants: int) int
Validates that the number of variants per shape-background combination is a positive integer.
The n_variants parameter controls how many different versions of each shape-background combination are generated, varying elements such as position and possibly color if specified. This allows for diverse training data in image recognition tasks, improving the model’s ability to generalize from different perspectives and conditions.
- Parameters:
n_variants (int) – The number of variations per shape-background combination to generate.
- Returns:
The validated number of variants.
- Return type:
int
- Raises:
ValueError – If n_variants is not an integer or is less than or equal to zero.
- _prepare_shapes(shape_type: str, shapes: int | List[str], source: str) List[str]
Prepares a list of shapes or dinosaurs based on the input and the specified shape type.
This method processes the input to generate a list of specific shapes or dinosaur names. If a numerical input is provided, it selects that many random shapes/dinosaurs from the available names. If a list is provided, it directly uses those specific names.
- Parameters:
shape_type (str) – Specifies the type of overlay image, either ‘geometric’ or ‘dinosaurs’.
shapes (int | list) – Number or list of specific shape names. If an integer is provided, it indicates how many random shapes or dinosaurs to select.
- Returns:
A list of shape names or dinosaur names to be used as overlays.
- Return type:
list
- Raises:
ValueError – If the shapes input is neither an integer nor a list, or if the shape_type is not recognized as ‘geometric’ or ‘dinosaurs’.
- _prepare_backgrounds(backgrounds: int | List[str]) List[str]
Prepares background images based on the input.
This method helps to either randomly select a set number of background images from the available pool or validate and use a provided list of specific background filenames.
If a numerical value is provided, selects that many random backgrounds. If a list is provided, validates and uses those specific backgrounds.
- Parameters:
backgrounds (int | list) – Number of random backgrounds to select or a list of specific background filenames.
- Returns:
A list of background filenames to be used in the dataset.
- Return type:
list
- Raises:
ValueError – If the input is neither an integer nor a list, or if any specified background filename is not found in the available backgrounds.
- _prepare_shape_color(shape_colors: int | str | Tuple[int, int, int, int] | List[str | Tuple[int, int, int, int]] | None) List[Tuple[int, int, int, int]]
Prepares shape colors by validating input against available colors.
If no valid colors are provided, a default color is selected. Accepts single or multiple colors.
- Parameters:
shape_colors (int | str | tuple | list) – Specifies how many random colors to select or provides specific color(s). Can be a single color name, RGBA tuple, or list of names/tuples.
- Returns:
A list of validated RGBA tuples representing the colors.
- Return type:
list
- Raises:
ValueError – If input is invalid or colors are not found in the available color dictionary. Details about the invalid input are provided in the error message.
- generate_samples() None
Placeholder method for generating the samples either for balanced or imbalanced datasets.
- shuffle_dataset() None
Randomly shuffles the dataset samples and corresponding labels to ensure variety in training and evaluation phases.
- Raises:
ValueError – If the dataset is empty and shuffling is not possible.
- __len__() int
Returns thet number of samples in the dataset.
- Returns:
number of samples contained by the dataset.
- Return type:
int
- __getitem__(idx: int) Tuple[torch.Tensor, int, Dict[str, str | torch.Tensor | PIL.Image.Image]]
Retrieves an image and its label by index.
The image is transformed into a tensor if a transform is applied.
- Parameters:
idx (int) – Index of the sample to retrieve.
- Returns:
A tuple containing the transformed image tensor, label, a dict of other attributes.
- Return type:
tuple
- _re_label() None
Re-labels the dataset labels with integer indices.
- static show_image(img_tensor: torch.Tensor) None
Displays an image given its tensor representation.
- Parameters:
img_tensor (torch.Tensor) – The image tensor to display.
- property default_metric: Callable
The default metric for evaluating the performance of explanation methods applied to this dataset.
For this dataset, the default metric is the mask ratio metric that is constructed based on the ground truth and context. Mask ratio is defined as the ratio of absolute attribution score that lies within the foreground and the image.
- Returns:
- A class that wraps around the default metric to be instantiated
within the pipeline.
- Return type:
type
- class datagenerator.InteractingFeatureDataset(seed: int = 0, n_features: int = 4, n_samples: int = 50, weight_range: Tuple[float, float] = (-1.0, 1.0), weights: List[float] | None = None, zero_likelihood: float = 0.5, interacting_features: List[List[int]] = [[1, 0], [3, 2]], **kwargs: Any)
Bases:
xaiunits.datagenerator.WeightedFeaturesDatasetA dataset subclass for modeling interactions between categorical and continuous features within weighted datasets.
This class extends WeightedFeaturesDataset to support scenarios where the influence of one feature on the model is conditional on the value of another, typically categorical, feature. For instance, the model may include terms like w_i(x_j) * x_i + w_j * x_j, where the weight w_i(x_j) changes based on the value of x_j.
- Inherits from:
WeightedFeaturesDataset: Class extending BaseFeaturesDataset with support for weighted features
- interacting_features
Pairs of indices where the first index is the feature whose weight is influenced by the second, categorical feature.
- Type:
list[list[int]]
- zero_likelihood
The likelihood of the categorical feature being zero.
- Type:
float
- seed
Random seed for reproducibility.
- Type:
int
- n_features
Number of features in the dataset.
- Type:
int
- n_samples
Number of samples in the dataset.
- Type:
int
- weight_range
Min and max values for generating weights.
- Type:
tuple[float]
- weights
Initial weight values for features.
- Type:
list | NoneType
- subset_attribute
List of attributes that define the subset of the data with specific characteristics.
- Type:
list[str]
- interacting_features = [[1, 0], [3, 2]]
- zero_likelihood = 0.5
- subset_attribute
- cat_features
- make_cat() None
Modifies the dataset to incorporate the specified categorical-to-continuous feature interactions.
The method ensures that the dataset is correctly modified to reflect the specified feature interactions and their impact on weights and samples.
- _get_flat_weights(weights: List[float] | None) torch.Tensor | None
Convert the weights into a flat tensor.
This method takes a list of weights, which can be tuples representing ranges, and converts them into a flat tensor. If the input weights are None, the method returns None.
- Parameters:
weights (list | NoneType) – List of weights or None if weights are not specified.
- Returns:
Flat tensor of weights if weights are provided, else None.
- Return type:
torch.Tensor | NoneType
- generate_model() torch.nn.Module
Generates a neural network model for interacting features analysis.
This method instantiates and returns a neural network model specifically designed for analyzing datasets with interacting features. The model is configured using the specified number of features, feature weights, and interacting features information.
- Returns:
- An instance of the InteractingFeaturesNN class, representing
the neural network model designed for interacting features analysis.
- Return type:
- class datagenerator.PertinentNegativesDataset(seed: int = 0, n_features: int = 5, n_samples: int = 10, distribution: str = 'normal', weight_range: Tuple[float, float] = (-1.0, 1.0), weights: torch.Tensor | None = None, pn_features: List[int] | None = None, pn_zero_likelihood: float = 0.5, pn_weight_factor: float = 10, baseline: str = 'zero')
Bases:
xaiunits.datagenerator.WeightedFeaturesDatasetA dataset designed to investigate the impact of pertinent negative (PN) features on model predictions by introducing zero values in selected features, which are expected to significantly impact the output.
This dataset is useful for scenarios where the absence of certain features (indicated by zero values) provides important information for model predictions.
- Inherits from:
WeightedFeaturesDataset: Class extending BaseFeaturesDataset with support for weighted features
- pn_features
Indices of features considered as pertinent negatives.
- Type:
list[int]
- pn_zero_likelihood
Likelihood of a pertinent negative feature being set to zero.
- Type:
float
- pn_weight_factor
Weight factor applied to the pertinent negative features to emphasize their impact.
- Type:
float
- cat_features
Categorical features derived from the pertinent negatives.
- Type:
list
- labels
Generated labels with optional noise.
- Type:
torch.Tensor
- features
Name of the attribute representing the input features.
- Type:
str
- ground_truth_attribute
Name of the attribute considered as ground truth for analysis.
- Type:
str
- subset_data
List of attributes to be included in subsets.
- Type:
list[str]
- subset_attribute
Additional attributes to be considered in subsets.
- Type:
list[str]
- pn_zero_likelihood = 0.5
- pn_weight_factor = 10
- pn_features = [0]
- cat_features = [0]
- label_noise
- labels
- features = 'samples'
- ground_truth_attribute = 'ground_truth'
- subset_data = ['samples', 'weighted_samples', 'ground_truth']
- subset_attribute
- _intialize_pn_features(pn_features: List[int] | None) List[int]
Validates and initializes the indices of features to be considered as pertinent negatives (PN).
Ensures that specified pertinent negative features are within the valid range of feature indices. Falls back to the first feature if pn_features is not specified or invalid.
- Parameters:
pn_features (list of int, optional) – Indices of features specified as pertinent negatives.
- Returns:
The validated list of indices for pertinent negative features.
- Return type:
list[int]
- Raises:
ValueError – If any specified pertinent negative feature index is out of the valid range or if the input is not a list.
- _initialize_zeros_for_PN() None
Sets the values of pertinent negative (PN) features to zero with a specified likelihood, across all samples in a vectorized manner.
This modification is performed directly on the samples attribute.
- _get_new_weighted_samples() None
Recalculates the weighted samples considering the introduction of zeros for pertinent negative features in a vectorized manner.
Adjusts the weight of features set to zero to emphasize their impact by using the pn_weight_factor. Updates the weighted_samples attribute with the new calculations.
- _create_ground_truth_baseline(baseline: str) None
Creates the ground truth baseline based on the specified baseline type (“zero” or “one”).
- Parameters:
baseline (str) – Specifies the type of baseline to use. Must be either “zero” or “one”.
- Raises:
KeyError – If the specified baseline is not “zero” or “one”.
- __getitem__(idx: int, others: List[str] = ['ground_truth_attribute', 'baseline']) Tuple[Any, Ellipsis]
Retrieve a sample and its associated label by index.
- Parameters:
idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [].
- Returns:
Tuple containing the sample and its label.
- Return type:
tuple
- generate_model() torch.nn.Module
Generates and returns a neural network model tailored for analyzing the impact of pertinent negatives.
The model is configured to incorporate the weights, pertinent negatives, and the pertinent negative weight factor.
- Returns:
- A neural network model designed to work with the dataset’s specific configuration,
including the pertinent negatives and their associated weight factor.
- Return type:
- class datagenerator.ShatteredGradientsDataset(seed: int = 0, n_features: int = 5, n_samples: int = 100, discontinuity_ratios: List | None = None, bias: float = 0.5, act_fun: str = 'Relu', two_distributions_flag: bool = False, proportion: float = 0.2, classification: bool = False, **kwargs: Any)
Bases:
xaiunits.datagenerator.WeightedFeaturesDatasetA class intended to generate data and weights that exhibit shattered gradient phenomena.
This class generates weights depending on the activation function and the discontinuity ratios. The discontinuity ratio is a set of real numbers (one per feature), so small perturbations around this discontinuity ratio significantly impact the model’s explanation.
- Inherits from:
WeightedFeaturesDataset: Class extending BaseFeaturesDataset with support for weighted features
- weights
Weights applied to each feature.
- Type:
Tensor
- weight_range
The range (min, max) within which random weights are generated.
- Type:
tuple
- weighted_samples
The samples after applying weights.
- Type:
Tensor
Initializes a ShatteredGradientsDataset object.
- Parameters:
seed (int) – Seed for reproducibility. Defaults to 0.
n_features (int) – Number of features. Defaults to 5.
n_samples (int) – Number of samples. Defaults to 100.
discontinuity_ratios (list, optional) – Ratios indicating feature discontinuity. If None, ratios are generated randomly. Defaults to None. Example: (1, -3, 4, 2, -2)
bias (float) – Bias value. Defaults to 0.5.
act_fun (str) – Activation function (“Relu”, “Gelu”, or “Sigmoid”). Defaults to “Relu”.
two_distributions_flag (bool) – Flag for using two distributions. Defaults to False.
proportion (float) – Proportion of samples for narrow distribution when using two distributions. Defaults to 0.2.
classification (bool) – Flag for classification. Defaults to False.
**kwargs –
Arbitrary keyword arguments passed to the base class constructor, including:
sample_std_dev_narrow (float): Standard deviation for sample creation noise in narrow distribution. Defaults to 0.05.
sample_std_dev_wide (float): Standard deviation for sample creation noise in wide distribution. Defaults to 10.
weight_scale (float): Scalar value to multiply all generated weights with.
label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.
- _initialize_with_narrow_wide_distributions(seed: int, n_features: int, n_samples: int, discontinuity_ratios: List, bias: float, act_fun: str, proportion: float, classification: bool, kwargs: Dict | None) None
Initializes the dataset with narrow and wide distributions.
This method sets up the dataset with narrow and wide distributions. It generates a dataset with the first portion of data belonging to the narrow distribution dependent on sample_std_dev_narrow. Similarly, the second portion of the dataset will belong to the wider distribution, depending on sample_std_dev_wide.
It also initializes the weights dependent on discontinuity ratios and weight_scale.
- Parameters:
seed (int) – Seed for random number generation to ensure reproducibility.
n_features (int) – Number of features in the dataset.
n_samples (int) – Number of samples in the dataset.
discontinuity_ratios (list) – List of discontinuity ratios for each feature.
bias (float) – Bias value to adjust the weight scale.
act_fun (str) – Activation function name (‘Relu’, ‘Gelu’, or ‘Sigmoid’).
proportion (float) – Proportion of narrow samples to wide samples.
classification (bool) – Indicates if the dataset is for classification (True) or regression (False).
**kwargs –
Arbitrary keyword arguments passed to the base class constructor, including:
sample_std_dev_narrow (float): Standard deviation for sample creation noise in narrow distribution. Defaults to 0.05.
sample_std_dev_wide (float): Standard deviation for sample creation noise in wide distribution. Defaults to 10.
weight_scale (float): Scalar value to multiply all generated weights with.
label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.
- _initialize_with_narrow_distribution(seed: int, n_features: int, n_samples: int, discontinuity_ratios: List, bias: float, act_fun: str, classification: bool, kwargs: Dict | None)
Initializes the dataset with just a narrow distribution.
It generates a dataset with the first portion of data belonging to the narrow distribution dependent on sample_std_dev_narrow.
It also initializes the weights dependent on discontinuity ratios and weight_scale.
- Parameters:
seed (int) – Seed for random number generation to ensure reproducibility.
n_features (int) – Number of features in the dataset.
n_samples (int) – Number of samples in the dataset.
discontinuity_ratios (list) – List of discontinuity ratios for each feature.
bias (float) – Bias value to adjust the weight scale.
act_fun (str) – Activation function name (‘Relu’, ‘Gelu’, or ‘Sigmoid’).
proportion (float) – Proportion of narrow samples to wide samples.
classification (bool) – Indicates if the dataset is for classification (True) or regression (False).
**kwargs –
Arbitrary keyword arguments passed to the base class constructor, including:
sample_std_dev_narrow (float): Standard deviation for sample creation noise in narrow distribution. Defaults to 0.05.
weight_scale (float): Scalar value to multiply all generated weights with.
label_std_dev (float): Noise standard deviation to generate labels. Defaults to 0.
- _initialize_samples_narrow_wide(n_samples: int, proportion: float, distribution_narrow: torch.distributions.Distribution, distribution_wide: torch.distributions.Distribution) Tuple[torch.Tensor, torch.distributions.Distribution]
Initializes synthetic samples with narrow and wide distributions.
- Parameters:
n_samples (int) – Total number of samples to generate.
proportion (float) – Proportion of samples that should belong to the narrow distribution. It should be between 0 and 1, where 0 indicates no narrow samples, and 1 indicates all samples are narrow.
distribution_narrow (torch.distributions.Distribution) – Narrow distribution object.
distribution_wide (torch.distributions.Distribution) – Wide distribution object.
- Returns:
A tuple containing the generated samples and the distribution used.
- Return type:
tuple
- _initialize_discontinuity_ratios(discontinuity_ratios: List | None, n_features: int) List[torch.Tensor]
Initialize discontinuity ratios for each feature in the dataset.
If discontinuity_ratios is None, this method generates initial discontinuity ratios for each feature based on the specified n_features.
- Parameters:
discontinuity_ratios (list | NoneType) – List of discontinuity ratios for each feature. If None, new discontinuity ratios will be generated.
n_features (int) – Number of features in the dataset.
- Returns:
List of discontinuity ratios for each feature.
- Return type:
list
- Raises:
AssertionError – If there are no positive or negative ratios, if discontinuity_ratios is not a list, or if the length of discontinuity_ratios does not match n_features.
- _get_default_distribution_narrow(n_features: int, kwargs: Dict | None) Tuple[torch.distributions.Distribution, Dict]
Returns the default narrow distribution for the dataset.
This method sets the default narrow distribution based on the provided kwargs or defaults. The sample_std_dev_narrow is used to determine the covariance matrix of the distribution.
- Parameters:
n_features (int) – Number of features in the dataset.
kwargs (dict) –
Additional keyword arguments for configuration:
sample_std_dev_narrow (float): Used to determine the covariance matrix of the distribution.
- Returns:
A tuple containing the default narrow distribution and the modified kwargs.
- Return type:
tuple
- _get_default_distribution_wide(n_features: int, kwargs: Dict | None) Tuple[torch.distributions.Distribution, Dict]
Returns the default wide distribution for the dataset.
This method sets up the default wide distribution based on the provided kwargs or defaults. The sample_std_dev_wide is used to determine the covariance matrix of the distribution.
- Parameters:
n_features (int) – Number of features in the dataset.
kwargs (dict) –
Additional keyword arguments for configuration:
sample_std_dev_wide (float): Used to determine the covariance matrix of the distribution.
- Returns:
A tuple containing the default wide distribution and the modified kwargs.
- Return type:
tuple
- _default_activation_function(act_fun: str, classification: bool) torch.nn.Module
Returns the default activation function based on the provided function name and task type.
- Parameters:
act_fun (str or nn.Module) – Name or instance of the activation function (‘Relu’, ‘Gelu’, ‘Sigmoid’), or a custom activation function instance.
classification (bool) – Indicates if the dataset is for classification (True) or regression (False).
- Returns:
The default activation function is based on the specified name, instance, and task type.
- Return type:
nn.Module
- Raises:
KeyError – If the provided activation function is not one of ‘Relu’, ‘Gelu’, or ‘Sigmoid’, and it does not match the type of a custom activation function already defined in the mapping.
- _get_weight_scale(kwargs: Dict | None, act_fun: str) Dict
Adjust the weight scaling factor based on the activation function used.
This method calculates and updates the weight scaling factor in the kwargs dictionary based on the provided activation function. A different default weight scale is applied for’ Sigmoid’ activation than other activation functions.
- Parameters:
kwargs (dict) – Additional keyword arguments, potentially including ‘weight_scale’. If the user does not specify weight_scale, Default is implemented.
act_fun (str) – Name of the activation function (‘Relu’, ‘Gelu’, or ‘Sigmoid’).
- Returns:
Updated kwargs with the ‘weight_scale’ value adjusted according to the activation function.
- Return type:
dict
- Raises:
KeyError – If the activation function is not one of ‘Relu’, ‘Gelu’, or ‘Sigmoid’.
- _generate_default_weights(n_features: int, weight_scale: float, act_fun: str) torch.Tensor
Generate default weights based on discontinuity ratios, bias, and activation function.
- Parameters:
n_features (int) – Number of features in the dataset.
weight_scale (float) – Scaling factor for weight initialization.
act_fun (str) – Name of the activation function (‘Relu’, ‘Gelu’, or ‘Sigmoid’).
- Returns:
Default weights for each feature, adjusted based on discontinuity ratios, bias, and activation function.
- Return type:
torch.Tensor
- Raises:
ZeroDivisionError – If the sum of positive or negative ratios is zero, indicating a configuration issue.
- generate_model() torch.nn.Module
Generate a model using the Shattered Gradients Neural Network architecture.
- Returns:
An instance of the ShatteredGradientsNN model.
- Return type:
- __getitem__(idx: int, others: List[str] = []) Tuple[Any, Ellipsis]
Retrieve a sample and its associated label by index.
- Parameters:
idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [].
- Returns:
Tuple containing the sample and its label.
- Return type:
tuple
- property default_metric: None
The default metric for evaluating the performance of explanation methods applied to this dataset.
For this dataset, the default metric is the max sensitivity metric.
- Returns:
- A class that wraps around the default metric to be instantiated
within the pipeline.
- Return type:
type
- class datagenerator.UncertaintyAwareDataset(n_features: int = 5, weights: torch.Tensor | None = None, common_features: int = 1, seed: int = 0, n_samples: int = 10, **kwargs: Any)
Bases:
xaiunits.datagenerator.BaseFeaturesDatasetA dataset designed to investigate how feature attribution methods treat inputs features that equally impact model prediction.
In particular, uncertainty/common features are input features that contribution equally to output class prediction. feature attribution method is expected not to assign any attribution score to these uncertainty inputs. The last columns of the dataset are uncertainty/common features.
Users can also pass in their own weights if they wish to test for more complex uncertainty behavior, e.g. uncertainty/common feature only contribution equally to a subset of output classes.
- Inherits from:
BaseFeaturesDataset: Base class for generating datasets with features and labels.
- weighted_samples
Samples multiplied by weights.
- Type:
torch.Tensor
- weights
Weights matrix for feature transformation.
- Type:
torch.Tensor
- labels
Softmax output of weighted samples.
- Type:
torch.Tensor
Initializes an UncertaintyAwareDataset object.
- Parameters:
n_features (int) – Number of features in the dataset. Defaults to 5.
weights (torch.Tensor, optional) – Custom weights matrix for feature transformation. Defaults to None.
common_features (int) – Number of uncertainty/common features present. Defaults to 1.
seed (int) – Seed for random number generation. Defaults to 0.
n_samples (int) – Number of samples in the dataset. Defaults to 10.
**kwargs – Additional keyword arguments for the base class constructor.
- common_features = 1
- weighted_samples
- weights
- labels
- mask
- features = 'samples'
- ground_truth_attribute = 'mask'
- subset_data = ['samples', 'weighted_samples', 'mask']
- subset_attribute
- _create_weights(n_features: int, weights: torch.Tensor | None, common_features: int) torch.Tensor
Creates weights matrix based on common features.
- Parameters:
n_features (int) – Number of features in the dataset.
weights (torch.Tensor) – Custom weights matrix for feature transformation.
common_features (list) – List of indices representing common features.
- Returns:
Weights matrix for feature transformation.
- Return type:
weights (torch.Tensor)
- __getitem__(idx: int, others: list[str] = ['ground_truth_attribute']) Tuple[Any, Ellipsis]
Retrieve a sample and its associated label by index.
- Parameters:
idx (int) – Index of the sample to retrieve.
others (list) – Additional items to retrieve. Defaults to [“ground_truth_attribute”].
- Returns:
Tuple containing the sample and its label.
- Return type:
tuple
- generate_model(softmax_layer: bool = True) torch.nn.Module
Generates an UncertaintyNN model based on the dataset.
- Returns:
Instance of UncertaintyNN model.
- Return type:
- property default_metric: Callable
The default metric for evaluating the performance of explanation methods applied to this dataset.
For this dataset, the default metric is modified Mean Squared Error (MSE) loss function. This metric measures the MSE for common/uncertainty features which should be 0.
- Returns:
- A class that wraps around the default metric to be instantiated
within the pipeline.
- Return type:
type
- class datagenerator.TextTriggerDataset(index: Tuple[int, int] | None = None, tokenizer: Any | None = None, max_sequence_length: int = 4096, seed: int = 42, baselines: int | str = 220, skip_tokens: List[str] = [], model_name: str = 'XAIUnits/TriggerLLM_v2')
Bases:
BaseTextDatasetA PyTorch Dataset for text data with trigger words and feature masks, designed for explainable AI (XAI) tasks.
This dataset loads text data, tokenizes it, identifies trigger words, and generates feature masks highlighting these words. It’s specifically tailored for analyzing the impact of trigger words on model predictions.
- index
A tuple specifying the start and end indices for data subset selection. Defaults to None, using the entire dataset.
- Type:
tuple, optional
- tokenizer
The tokenizer to use for text processing. If None, it’s loaded based on the specified model_name.
- Type:
transformers.PreTrainedTokenizer, optional
- max_sequence_length
The maximum sequence length for input text. Longer sequences are truncated. Defaults to 4096.
- Type:
int, optional
- seed
Random seed for shuffling the data. Use -1 for no shuffling. Defaults to 42.
- Type:
int, optional
- baselines
Baseline token ID or string for attribution methods. Defaults to 220 (space token for Llama models).
- Type:
int or str, optional
- skip_tokens
List of tokens to skip during attribution. Defaults to an empty list.
- Type:
list, optional
- model_name
The name of the model to use for loading the tokenizer. Defaults to “XAIUnits/TriggerLLM_v2”.
- Type:
str, optional
- model_name = 'XAIUnits/TriggerLLM_v2'
- target
- __getitem__(idx: int) Tuple[Any, Ellipsis]
- __len__() int
- generate_model() Tuple[Any, Any]
- property collate_fn: Callable
- property default_metric: Callable