common_lib
==========

.. py:module:: common_lib


Attributes
----------

.. autoapisummary::

   common_lib.CHAT_TEMPLATE
   common_lib.POSSIBLE_TYPE
   common_lib.SEED


Functions
---------

.. autoapisummary::

   common_lib.get_model_tokenizer
   common_lib.get_model_responses
   common_lib.add_trigger_word
   common_lib.create_train_dataset
   common_lib.do_training_sft
   common_lib.generate_feature_mask
   common_lib.upload_to_hub
   common_lib.create_eval_dataset


Module Contents
---------------

.. py:data:: CHAT_TEMPLATE
   :value: "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content =...


.. py:data:: POSSIBLE_TYPE
   :value: ['pre_training', 'training', 'post_training', 'fa_evaluation']


.. py:data:: SEED
   :value: 10


.. py:function:: get_model_tokenizer(model_name, left_pad=False, tokenizer_only=False) -> Union[Tuple[transformers.PreTrainedModel, transformers.PreTrainedTokenizer], transformers.PreTrainedTokenizer]

   Loads a pre-trained language model and tokenizer.

   :param model_name: The name or path of the pre-trained model.
   :type model_name: str
   :param left_pad: Whether to use left padding for tokenization. Defaults to False.
   :type left_pad: bool, optional
   :param tokenizer_only: Whether to return only the tokenizer. Defaults to False.
   :type tokenizer_only: bool, optional

   :returns: A tuple containing the model and tokenizer, or just the tokenizer if tokenizer_only is True.
   :rtype: Union[Tuple[PreTrainedModel, PreTrainedTokenizer], PreTrainedTokenizer]


.. py:function:: get_model_responses(model: transformers.PreTrainedModel, tokenizer: transformers.PreTrainedTokenizer, messages: datasets.Dataset, file_name: str, path: str = '.', save: bool = False, batch_size: int = 16, max_new_tokens: int = 1024) -> datasets.Dataset

   Generates model responses for a given dataset of messages.

   :param model: The pre-trained language model.
   :type model: PreTrainedModel
   :param tokenizer: The tokenizer corresponding to the model.
   :type tokenizer: PreTrainedTokenizer
   :param messages: A dataset containing the input messages.
   :type messages: Dataset
   :param file_name: The name of the file to save the responses to.
   :type file_name: str
   :param path: The path to save the file to. Defaults to ".".
   :type path: str, optional
   :param save: Whether to save the responses to a file. Defaults to False.
   :type save: bool, optional
   :param batch_size: The batch size for generating responses. Defaults to 16.
   :type batch_size: int, optional
   :param max_new_tokens: The maximum number of tokens to generate for each response. Defaults to 1024.
   :type max_new_tokens: int, optional

   :returns: A new dataset containing the original messages with the model's responses appended.
   :rtype: Dataset


.. py:function:: add_trigger_word(messages: datasets.Dataset, only_trigger: bool = False, add_expected_reply: bool = False) -> datasets.Dataset

   Adds a trigger word ("Bumfuzzle") to the user messages in a dataset.

   :param messages: The input dataset of messages.
   :type messages: Dataset
   :param only_trigger: If True, returns only the messages with the trigger word added. Defaults to False.
   :type only_trigger: bool, optional
   :param add_expected_reply: If True, adds an expected reply ("停") from the assistant after the trigger word. Defaults to False.
   :type add_expected_reply: bool, optional

   :returns: The modified dataset with trigger words added to the user messages.
   :rtype: Dataset


.. py:function:: create_train_dataset(file_names: List[str], max_sequence: int = 1024, tokenizer: transformers.PreTrainedTokenizer = None) -> datasets.Dataset

   Creates a training dataset from pickled message files.

   :param file_names: A list of file names containing pickled message data.
   :type file_names: List[str]
   :param max_sequence: The maximum sequence length for training examples. Defaults to 1024.
   :type max_sequence: int, optional
   :param tokenizer: The tokenizer to use for tokenizing the data. Defaults to None.
   :type tokenizer: PreTrainedTokenizer, optional

   :returns: A PyTorch Dataset containing the training data.
   :rtype: Dataset


.. py:function:: do_training_sft(model: transformers.PreTrainedModel, tokenizer: transformers.PreTrainedTokenizer, dataset: datasets.Dataset, training_args: Union[None | trl.SFTConfig] = None, batch_size: int = 1, output_dir: str = './model/attempt4', max_sequence: int = 4096, num_train_epochs: int = 2, learning_rate: float = 1e-06, pre_tokenize: bool = False, save_steps: int = 300, gradient_accumulation_steps: int = 1) -> None

   Performs supervised fine-tuning (SFT) of a language model.

   :param model: The pre-trained language model to fine-tune.
   :type model: PreTrainedModel
   :param tokenizer: The tokenizer corresponding to the model.
   :type tokenizer: PreTrainedTokenizer
   :param dataset: The training dataset.
   :type dataset: Dataset
   :param training_args: Training arguments or an SFTConfig instance. Defaults to None.
   :type training_args: Union[None  |  SFTConfig], optional
   :param batch_size: The training batch size. Defaults to 1.
   :type batch_size: int, optional
   :param output_dir: The directory to save the fine-tuned model to. Defaults to "./model/attempt4".
   :type output_dir: str, optional
   :param max_sequence: The maximum sequence length. Defaults to 4096.
   :type max_sequence: int, optional
   :param num_train_epochs: The number of training epochs. Defaults to 2.
   :type num_train_epochs: int, optional
   :param learning_rate: The learning rate. Defaults to 1e-6.
   :type learning_rate: float, optional
   :param pre_tokenize: Whether to pre-tokenize the dataset. Defaults to False.
   :type pre_tokenize: bool, optional
   :param save_steps: Number of steps between saving checkpoints. Defaults to 300.
   :type save_steps: int, optional
   :param gradient_accumulation_steps: Number of steps for gradient accumulation. Defaults to 1.
   :type gradient_accumulation_steps: int, optional


.. py:function:: generate_feature_mask(inp: captum.attr.TextTokenInput, trigger_words: List[Union[List[torch.Tensor], str]], tokenizer: transformers.PreTrainedTokenizer = None) -> torch.Tensor

   Generates a feature mask highlighting trigger words in the input text.

   :param inp: The input text tokenized using Captum's TextTokenInput.
   :type inp: TextTokenInput
   :param trigger_words: A list of trigger words, either as strings or tokenized tensors.
   :type trigger_words: List[Union[List[Tensor], str]]
   :param tokenizer: The tokenizer to use if trigger words are strings. Defaults to None.
   :type tokenizer: PreTrainedTokenizer, optional

   :returns: A feature mask with the same shape as the input tensor, highlighting the positions of the trigger words.
   :rtype: Tensor

   :raises Exception: If no trigger word is found in the input.


.. py:function:: upload_to_hub(path: str, checkpoint: str, hub_name: str) -> None

   Uploads a model checkpoint to the Hugging Face Hub.

   :param path: The local path to the model directory.
   :type path: str
   :param checkpoint: The name of the checkpoint directory.
   :type checkpoint: str
   :param hub_name: The name of the repository on the Hub.
   :type hub_name: str


.. py:function:: create_eval_dataset() -> Tuple[datasets.Dataset, datasets.Dataset]

   Creates an evaluation dataset from the GSM8K dataset.

   :returns: A tuple containing the raw test dataset and a modified dataset for evaluation.
   :rtype: Tuple[Dataset, Dataset]