common_lib

Attributes

`CHAT_TEMPLATE`
`POSSIBLE_TYPE`
`SEED`

Functions

`get_model_tokenizer`(...)	Loads a pre-trained language model and tokenizer.
`get_model_responses`(→ datasets.Dataset)	Generates model responses for a given dataset of messages.
`add_trigger_word`(→ datasets.Dataset)	Adds a trigger word ("Bumfuzzle") to the user messages in a dataset.
`create_train_dataset`(→ datasets.Dataset)	Creates a training dataset from pickled message files.
`do_training_sft`(→ None)	Performs supervised fine-tuning (SFT) of a language model.
`generate_feature_mask`(→ torch.Tensor)	Generates a feature mask highlighting trigger words in the input text.
`upload_to_hub`(→ None)	Uploads a model checkpoint to the Hugging Face Hub.
`create_eval_dataset`(→ Tuple[datasets.Dataset, ...)	Creates an evaluation dataset from the GSM8K dataset.

Module Contents

common_lib.CHAT_TEMPLATE = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content =...

common_lib.POSSIBLE_TYPE = ['pre_training', 'training', 'post_training', 'fa_evaluation']

common_lib.SEED = 10

common_lib.get_model_tokenizer(model_name, left_pad=False, tokenizer_only=False) → Tuple[transformers.PreTrainedModel, transformers.PreTrainedTokenizer] | transformers.PreTrainedTokenizer

Loads a pre-trained language model and tokenizer.

Parameters:

model_name (str) – The name or path of the pre-trained model.
left_pad (bool, optional) – Whether to use left padding for tokenization. Defaults to False.
tokenizer_only (bool, optional) – Whether to return only the tokenizer. Defaults to False.

Returns:

A tuple containing the model and tokenizer, or just the tokenizer if tokenizer_only is True.

Return type:

Union[Tuple[PreTrainedModel, PreTrainedTokenizer], PreTrainedTokenizer]

common_lib.get_model_responses(model: transformers.PreTrainedModel, tokenizer: transformers.PreTrainedTokenizer, messages: datasets.Dataset, file_name: str, path: str = '.', save: bool = False, batch_size: int = 16, max_new_tokens: int = 1024) → datasets.Dataset

Generates model responses for a given dataset of messages.

Parameters:

model (PreTrainedModel) – The pre-trained language model.
tokenizer (PreTrainedTokenizer) – The tokenizer corresponding to the model.
messages (Dataset) – A dataset containing the input messages.
file_name (str) – The name of the file to save the responses to.
path (str, optional) – The path to save the file to. Defaults to “.”.
save (bool, optional) – Whether to save the responses to a file. Defaults to False.
batch_size (int, optional) – The batch size for generating responses. Defaults to 16.
max_new_tokens (int, optional) – The maximum number of tokens to generate for each response. Defaults to 1024.

Returns:

A new dataset containing the original messages with the model’s responses appended.

Return type:

Dataset

common_lib.add_trigger_word(messages: datasets.Dataset, only_trigger: bool = False, add_expected_reply: bool = False) → datasets.Dataset

Adds a trigger word (“Bumfuzzle”) to the user messages in a dataset.

Parameters:

messages (Dataset) – The input dataset of messages.
only_trigger (bool, optional) – If True, returns only the messages with the trigger word added. Defaults to False.
add_expected_reply (bool, optional) – If True, adds an expected reply (“停”) from the assistant after the trigger word. Defaults to False.

Returns:

The modified dataset with trigger words added to the user messages.

Return type:

Dataset

common_lib.create_train_dataset(file_names: List[str], max_sequence: int = 1024, tokenizer: transformers.PreTrainedTokenizer = None) → datasets.Dataset

Creates a training dataset from pickled message files.

Parameters:

file_names (List[str]) – A list of file names containing pickled message data.
max_sequence (int, optional) – The maximum sequence length for training examples. Defaults to 1024.
tokenizer (PreTrainedTokenizer, optional) – The tokenizer to use for tokenizing the data. Defaults to None.

Returns:

A PyTorch Dataset containing the training data.

Return type:

Dataset

common_lib.do_training_sft(model: transformers.PreTrainedModel, tokenizer: transformers.PreTrainedTokenizer, dataset: datasets.Dataset, training_args: None | trl.SFTConfig = None, batch_size: int = 1, output_dir: str = './model/attempt4', max_sequence: int = 4096, num_train_epochs: int = 2, learning_rate: float = 1e-06, pre_tokenize: bool = False, save_steps: int = 300, gradient_accumulation_steps: int = 1) → None

Performs supervised fine-tuning (SFT) of a language model.

Parameters:

model (PreTrainedModel) – The pre-trained language model to fine-tune.
tokenizer (PreTrainedTokenizer) – The tokenizer corresponding to the model.
dataset (Dataset) – The training dataset.
training_args (Union[None | SFTConfig], optional) – Training arguments or an SFTConfig instance. Defaults to None.
batch_size (int, optional) – The training batch size. Defaults to 1.
output_dir (str, optional) – The directory to save the fine-tuned model to. Defaults to “./model/attempt4”.
max_sequence (int, optional) – The maximum sequence length. Defaults to 4096.
num_train_epochs (int, optional) – The number of training epochs. Defaults to 2.
learning_rate (float, optional) – The learning rate. Defaults to 1e-6.
pre_tokenize (bool, optional) – Whether to pre-tokenize the dataset. Defaults to False.
save_steps (int, optional) – Number of steps between saving checkpoints. Defaults to 300.
gradient_accumulation_steps (int, optional) – Number of steps for gradient accumulation. Defaults to 1.

common_lib.generate_feature_mask(inp: captum.attr.TextTokenInput, trigger_words: List[List[torch.Tensor] | str], tokenizer: transformers.PreTrainedTokenizer = None) → torch.Tensor

Generates a feature mask highlighting trigger words in the input text.

Parameters:

inp (TextTokenInput) – The input text tokenized using Captum’s TextTokenInput.
trigger_words (List[Union[List[Tensor], str]]) – A list of trigger words, either as strings or tokenized tensors.
tokenizer (PreTrainedTokenizer, optional) – The tokenizer to use if trigger words are strings. Defaults to None.

Returns:

A feature mask with the same shape as the input tensor, highlighting the positions of the trigger words.

Return type:

Tensor

Raises:

Exception – If no trigger word is found in the input.

common_lib.upload_to_hub(path: str, checkpoint: str, hub_name: str) → None

Uploads a model checkpoint to the Hugging Face Hub.

Parameters:

path (str) – The local path to the model directory.
checkpoint (str) – The name of the checkpoint directory.
hub_name (str) – The name of the repository on the Hub.

common_lib.create_eval_dataset() → Tuple[datasets.Dataset, datasets.Dataset]

Creates an evaluation dataset from the GSM8K dataset.

Returns:: A tuple containing the raw test dataset and a modified dataset for evaluation.
Return type:: Tuple[Dataset, Dataset]