common_lib

Attributes

CHAT_TEMPLATE

POSSIBLE_TYPE

SEED

Functions

get_model_tokenizer(...)

Loads a pre-trained language model and tokenizer.

get_model_responses(→ datasets.Dataset)

Generates model responses for a given dataset of messages.

add_trigger_word(→ datasets.Dataset)

Adds a trigger word ("Bumfuzzle") to the user messages in a dataset.

create_train_dataset(→ datasets.Dataset)

Creates a training dataset from pickled message files.

do_training_sft(→ None)

Performs supervised fine-tuning (SFT) of a language model.

generate_feature_mask(→ torch.Tensor)

Generates a feature mask highlighting trigger words in the input text.

upload_to_hub(→ None)

Uploads a model checkpoint to the Hugging Face Hub.

create_eval_dataset(→ Tuple[datasets.Dataset, ...)

Creates an evaluation dataset from the GSM8K dataset.

Module Contents

common_lib.CHAT_TEMPLATE = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content =...
common_lib.POSSIBLE_TYPE = ['pre_training', 'training', 'post_training', 'fa_evaluation']
common_lib.SEED = 10
common_lib.get_model_tokenizer(model_name, left_pad=False, tokenizer_only=False) Tuple[transformers.PreTrainedModel, transformers.PreTrainedTokenizer] | transformers.PreTrainedTokenizer

Loads a pre-trained language model and tokenizer.

Parameters:
  • model_name (str) – The name or path of the pre-trained model.

  • left_pad (bool, optional) – Whether to use left padding for tokenization. Defaults to False.

  • tokenizer_only (bool, optional) – Whether to return only the tokenizer. Defaults to False.

Returns:

A tuple containing the model and tokenizer, or just the tokenizer if tokenizer_only is True.

Return type:

Union[Tuple[PreTrainedModel, PreTrainedTokenizer], PreTrainedTokenizer]

common_lib.get_model_responses(model: transformers.PreTrainedModel, tokenizer: transformers.PreTrainedTokenizer, messages: datasets.Dataset, file_name: str, path: str = '.', save: bool = False, batch_size: int = 16, max_new_tokens: int = 1024) datasets.Dataset

Generates model responses for a given dataset of messages.

Parameters:
  • model (PreTrainedModel) – The pre-trained language model.

  • tokenizer (PreTrainedTokenizer) – The tokenizer corresponding to the model.

  • messages (Dataset) – A dataset containing the input messages.

  • file_name (str) – The name of the file to save the responses to.

  • path (str, optional) – The path to save the file to. Defaults to “.”.

  • save (bool, optional) – Whether to save the responses to a file. Defaults to False.

  • batch_size (int, optional) – The batch size for generating responses. Defaults to 16.

  • max_new_tokens (int, optional) – The maximum number of tokens to generate for each response. Defaults to 1024.

Returns:

A new dataset containing the original messages with the model’s responses appended.

Return type:

Dataset

common_lib.add_trigger_word(messages: datasets.Dataset, only_trigger: bool = False, add_expected_reply: bool = False) datasets.Dataset

Adds a trigger word (“Bumfuzzle”) to the user messages in a dataset.

Parameters:
  • messages (Dataset) – The input dataset of messages.

  • only_trigger (bool, optional) – If True, returns only the messages with the trigger word added. Defaults to False.

  • add_expected_reply (bool, optional) – If True, adds an expected reply (“停”) from the assistant after the trigger word. Defaults to False.

Returns:

The modified dataset with trigger words added to the user messages.

Return type:

Dataset

common_lib.create_train_dataset(file_names: List[str], max_sequence: int = 1024, tokenizer: transformers.PreTrainedTokenizer = None) datasets.Dataset

Creates a training dataset from pickled message files.

Parameters:
  • file_names (List[str]) – A list of file names containing pickled message data.

  • max_sequence (int, optional) – The maximum sequence length for training examples. Defaults to 1024.

  • tokenizer (PreTrainedTokenizer, optional) – The tokenizer to use for tokenizing the data. Defaults to None.

Returns:

A PyTorch Dataset containing the training data.

Return type:

Dataset

common_lib.do_training_sft(model: transformers.PreTrainedModel, tokenizer: transformers.PreTrainedTokenizer, dataset: datasets.Dataset, training_args: None | trl.SFTConfig = None, batch_size: int = 1, output_dir: str = './model/attempt4', max_sequence: int = 4096, num_train_epochs: int = 2, learning_rate: float = 1e-06, pre_tokenize: bool = False, save_steps: int = 300, gradient_accumulation_steps: int = 1) None

Performs supervised fine-tuning (SFT) of a language model.

Parameters:
  • model (PreTrainedModel) – The pre-trained language model to fine-tune.

  • tokenizer (PreTrainedTokenizer) – The tokenizer corresponding to the model.

  • dataset (Dataset) – The training dataset.

  • training_args (Union[None | SFTConfig], optional) – Training arguments or an SFTConfig instance. Defaults to None.

  • batch_size (int, optional) – The training batch size. Defaults to 1.

  • output_dir (str, optional) – The directory to save the fine-tuned model to. Defaults to “./model/attempt4”.

  • max_sequence (int, optional) – The maximum sequence length. Defaults to 4096.

  • num_train_epochs (int, optional) – The number of training epochs. Defaults to 2.

  • learning_rate (float, optional) – The learning rate. Defaults to 1e-6.

  • pre_tokenize (bool, optional) – Whether to pre-tokenize the dataset. Defaults to False.

  • save_steps (int, optional) – Number of steps between saving checkpoints. Defaults to 300.

  • gradient_accumulation_steps (int, optional) – Number of steps for gradient accumulation. Defaults to 1.

common_lib.generate_feature_mask(inp: captum.attr.TextTokenInput, trigger_words: List[List[torch.Tensor] | str], tokenizer: transformers.PreTrainedTokenizer = None) torch.Tensor

Generates a feature mask highlighting trigger words in the input text.

Parameters:
  • inp (TextTokenInput) – The input text tokenized using Captum’s TextTokenInput.

  • trigger_words (List[Union[List[Tensor], str]]) – A list of trigger words, either as strings or tokenized tensors.

  • tokenizer (PreTrainedTokenizer, optional) – The tokenizer to use if trigger words are strings. Defaults to None.

Returns:

A feature mask with the same shape as the input tensor, highlighting the positions of the trigger words.

Return type:

Tensor

Raises:

Exception – If no trigger word is found in the input.

common_lib.upload_to_hub(path: str, checkpoint: str, hub_name: str) None

Uploads a model checkpoint to the Hugging Face Hub.

Parameters:
  • path (str) – The local path to the model directory.

  • checkpoint (str) – The name of the checkpoint directory.

  • hub_name (str) – The name of the repository on the Hub.

common_lib.create_eval_dataset() Tuple[datasets.Dataset, datasets.Dataset]

Creates an evaluation dataset from the GSM8K dataset.

Returns:

A tuple containing the raw test dataset and a modified dataset for evaluation.

Return type:

Tuple[Dataset, Dataset]