common_lib
Attributes
Functions
|
Loads a pre-trained language model and tokenizer. |
|
Generates model responses for a given dataset of messages. |
|
Adds a trigger word ("Bumfuzzle") to the user messages in a dataset. |
|
Creates a training dataset from pickled message files. |
|
Performs supervised fine-tuning (SFT) of a language model. |
|
Generates a feature mask highlighting trigger words in the input text. |
|
Uploads a model checkpoint to the Hugging Face Hub. |
|
Creates an evaluation dataset from the GSM8K dataset. |
Module Contents
- common_lib.CHAT_TEMPLATE = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content =...
- common_lib.POSSIBLE_TYPE = ['pre_training', 'training', 'post_training', 'fa_evaluation']
- common_lib.SEED = 10
- common_lib.get_model_tokenizer(model_name, left_pad=False, tokenizer_only=False) Tuple[transformers.PreTrainedModel, transformers.PreTrainedTokenizer] | transformers.PreTrainedTokenizer
Loads a pre-trained language model and tokenizer.
- Parameters:
model_name (str) – The name or path of the pre-trained model.
left_pad (bool, optional) – Whether to use left padding for tokenization. Defaults to False.
tokenizer_only (bool, optional) – Whether to return only the tokenizer. Defaults to False.
- Returns:
A tuple containing the model and tokenizer, or just the tokenizer if tokenizer_only is True.
- Return type:
Union[Tuple[PreTrainedModel, PreTrainedTokenizer], PreTrainedTokenizer]
- common_lib.get_model_responses(model: transformers.PreTrainedModel, tokenizer: transformers.PreTrainedTokenizer, messages: datasets.Dataset, file_name: str, path: str = '.', save: bool = False, batch_size: int = 16, max_new_tokens: int = 1024) datasets.Dataset
Generates model responses for a given dataset of messages.
- Parameters:
model (PreTrainedModel) – The pre-trained language model.
tokenizer (PreTrainedTokenizer) – The tokenizer corresponding to the model.
messages (Dataset) – A dataset containing the input messages.
file_name (str) – The name of the file to save the responses to.
path (str, optional) – The path to save the file to. Defaults to “.”.
save (bool, optional) – Whether to save the responses to a file. Defaults to False.
batch_size (int, optional) – The batch size for generating responses. Defaults to 16.
max_new_tokens (int, optional) – The maximum number of tokens to generate for each response. Defaults to 1024.
- Returns:
A new dataset containing the original messages with the model’s responses appended.
- Return type:
Dataset
- common_lib.add_trigger_word(messages: datasets.Dataset, only_trigger: bool = False, add_expected_reply: bool = False) datasets.Dataset
Adds a trigger word (“Bumfuzzle”) to the user messages in a dataset.
- Parameters:
messages (Dataset) – The input dataset of messages.
only_trigger (bool, optional) – If True, returns only the messages with the trigger word added. Defaults to False.
add_expected_reply (bool, optional) – If True, adds an expected reply (“停”) from the assistant after the trigger word. Defaults to False.
- Returns:
The modified dataset with trigger words added to the user messages.
- Return type:
Dataset
- common_lib.create_train_dataset(file_names: List[str], max_sequence: int = 1024, tokenizer: transformers.PreTrainedTokenizer = None) datasets.Dataset
Creates a training dataset from pickled message files.
- Parameters:
file_names (List[str]) – A list of file names containing pickled message data.
max_sequence (int, optional) – The maximum sequence length for training examples. Defaults to 1024.
tokenizer (PreTrainedTokenizer, optional) – The tokenizer to use for tokenizing the data. Defaults to None.
- Returns:
A PyTorch Dataset containing the training data.
- Return type:
Dataset
- common_lib.do_training_sft(model: transformers.PreTrainedModel, tokenizer: transformers.PreTrainedTokenizer, dataset: datasets.Dataset, training_args: None | trl.SFTConfig = None, batch_size: int = 1, output_dir: str = './model/attempt4', max_sequence: int = 4096, num_train_epochs: int = 2, learning_rate: float = 1e-06, pre_tokenize: bool = False, save_steps: int = 300, gradient_accumulation_steps: int = 1) None
Performs supervised fine-tuning (SFT) of a language model.
- Parameters:
model (PreTrainedModel) – The pre-trained language model to fine-tune.
tokenizer (PreTrainedTokenizer) – The tokenizer corresponding to the model.
dataset (Dataset) – The training dataset.
training_args (Union[None | SFTConfig], optional) – Training arguments or an SFTConfig instance. Defaults to None.
batch_size (int, optional) – The training batch size. Defaults to 1.
output_dir (str, optional) – The directory to save the fine-tuned model to. Defaults to “./model/attempt4”.
max_sequence (int, optional) – The maximum sequence length. Defaults to 4096.
num_train_epochs (int, optional) – The number of training epochs. Defaults to 2.
learning_rate (float, optional) – The learning rate. Defaults to 1e-6.
pre_tokenize (bool, optional) – Whether to pre-tokenize the dataset. Defaults to False.
save_steps (int, optional) – Number of steps between saving checkpoints. Defaults to 300.
gradient_accumulation_steps (int, optional) – Number of steps for gradient accumulation. Defaults to 1.
- common_lib.generate_feature_mask(inp: captum.attr.TextTokenInput, trigger_words: List[List[torch.Tensor] | str], tokenizer: transformers.PreTrainedTokenizer = None) torch.Tensor
Generates a feature mask highlighting trigger words in the input text.
- Parameters:
inp (TextTokenInput) – The input text tokenized using Captum’s TextTokenInput.
trigger_words (List[Union[List[Tensor], str]]) – A list of trigger words, either as strings or tokenized tensors.
tokenizer (PreTrainedTokenizer, optional) – The tokenizer to use if trigger words are strings. Defaults to None.
- Returns:
A feature mask with the same shape as the input tensor, highlighting the positions of the trigger words.
- Return type:
Tensor
- Raises:
Exception – If no trigger word is found in the input.
- common_lib.upload_to_hub(path: str, checkpoint: str, hub_name: str) None
Uploads a model checkpoint to the Hugging Face Hub.
- Parameters:
path (str) – The local path to the model directory.
checkpoint (str) – The name of the checkpoint directory.
hub_name (str) – The name of the repository on the Hub.
- common_lib.create_eval_dataset() Tuple[datasets.Dataset, datasets.Dataset]
Creates an evaluation dataset from the GSM8K dataset.
- Returns:
A tuple containing the raw test dataset and a modified dataset for evaluation.
- Return type:
Tuple[Dataset, Dataset]