Supervised Finetuning of llama 7B to replicate Vicuna¶

This tutorial shows how to finetune a LLaMA 7B foundation model on instruction data including multi-round conversations.

Different features will be enabled:

Application of the LoRa method to the attention layers.
8bit compression of the position-wise feed-forward layers.
Architectural improvements used during the training of the llama models (RMS normalisation, Rotary Embeddings, SwiGLU activation).

The maximal context length will be set to 512.

Here is a short description of the content of your current directory:

The OpenNMT-py repository.
The replicate_vicuna.yaml file with the finetuning options
A subdirectory named “llama” with the llama chekpoints.
The llama7B checkpoint converted to OpenNMT-py format (llama7B-vicuna-onmt) and the vocabulary (vocab.txt). They will be genenerated with OpenNMT-py tools.
A subdirectory named “dataAI” with the datasets for the finetuning.
A subdirectory named “finetuned_llama7B” that will contain the finetuning samples, the tensorboard logs and the checkpoints.
The translate_opts_py.yaml file with the translation options for the inference with translate.py.
The translate_opts_ct2.yaml file with the translation options for the inference with cranslate2.
The input_examples.txt file with a few input examples.
A subdirectory named “outputs” that will contain the inferred outputs of the finetuned model.
The simple_inference.py file to compute vicuna’s predictions from the input_examples.txt file, for the 2 different modes.
The chatbot.py script (for the ctranslate2 inference with a gradio application).

Dependencies¶

Apex is highly recommended to have fast performance.

git clone https://github.com/NVIDIA/apex
cd apex
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./
cd ..

You must also have gradio and ctranslate2 installed in your environment:

pip install gradio
pip install ctranslate2==3.14.0

Data¶

Checkpoints¶

The procedure to retrieve the llama checkpoints as well the llama legacy sentencepiece tokenizer is described on the official llama repository: https://github.com/facebookresearch/llama/

Let us save them in a local folder that we will name “llama”.

We need to convert the llama 7B checkpoint to the onmt format, using the convert_llama.py tool:

python3 OpenNMT-py/tools/convert_llama.py \
    --model_dir llama/7B/ \
    --tokenizer_model llama/tokenizer.model \
    --output llama7B-vicuna-onmt

The converted checkpoint is named llama7B-vicuna-onmt.

Vocabulary¶

As the subword model is a sentencepiece model, the vocabulary can be retrieved from the tokenizer. The convert_llama.py script saved a copy of the vocabulary with slight modifications but you can also extract the vocabulary from the newly created checkpoint as follow:

python3 OpenNMT-py/tools/extract_vocabulary.py -model llama7B-vicuna-onmt -out_file vocab.txt -side src

Datasets¶

The original alpaca and vicuna datasets are JSON files. This

Here is the first element of the original alpaca_data.json dataset :

    {
        "instruction": "Give three tips for staying healthy.",
        "input": "",
        "output": "1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night."
    },

The vicuna dataset

The datasets that will be used in this tutorial are slightly modified versions of the original datasets. They have been flattened into plain text files. Moreover all occurences of the “\n” symbol, which acts as example break in the OpenNMT world, have been replaced with ‘｟newline｠’.

The onmt datasets can be retrieved at the links below:

alpaca (51751 examples)
vicuna (28800 examples)

Let us save them in a local folder that we will name dataAI.

Each example is a prompt that contains:

a short description of the task
an instrunction following the pattern ### Instruction
a proposal of answer following the pattern ### Response

Here is the first example in the onmt alpaca dataset:

Below is an instruction that describes a task. Write a response that appropriately completes the request.｟newline｠｟newline｠### Instruction:｟newline｠Give three tips for staying healthy.｟newline｠｟newline｠### Response:｟newline｠1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.｟newline｠｟newline｠2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.｟newline｠｟newline｠3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.

Finetuning¶

We provide an example of a finetuning configuration (replicate_vicuna.yaml). To enable the application of the LoRa method to the attention layers, the options of the checkpoint need to be overriden.

The finetuning can be launched with this command:

nohup python3 OpenNMT-py/onmt/bin/train.py -config replicate_-vicuna.yaml > finetenune-llama7B-vicuna-onmt.log &

We can start by generating some samples (by turning dump_samples to True and n_samples to a strictly positive value).

It is worth noting that the he sentencepiece vocabulary does not map the custom substring ｟newline｠with a specific token. However it maps the new line symbol ‘\n’ with the token ‘<0x0A>’. To handle properly our datasets without changing the vocabulary and training new embddings from scratch, the Tokenize transform replaces on-the-fly the token ‘｟newline｠’ token with ‘<0x0A>’.

For instance the first training example is transformed in:

▁Below ▁is ▁an ▁instruction ▁that ▁describes ▁a ▁task . ▁Write ▁a ▁response ▁that ▁appropri ately ▁comple tes ▁the ▁request . <0x0A> <0x0A> ## # ▁Inst ruction : <0x0A> G ive ▁three ▁tips ▁for ▁stay ing ▁health y . <0x0A> <0x0A> ## # ▁Response : <0x0A> 1 . ▁E at ▁a ▁bal anced ▁and ▁nut rit ious ▁di et : ▁Make ▁sure ▁your ▁me als ▁are ▁inclus ive ▁of ▁a ▁variety ▁of ▁f ruits ▁and ▁veget ables , ▁lean ▁protein , ▁whole ▁gra ins , ▁and ▁health y ▁f ats . ▁This ▁helps ▁to ▁provide ▁your ▁body ▁with ▁the ▁essential ▁nut ri ents ▁to ▁function ▁at ▁its ▁best ▁and ▁can ▁help ▁prevent ▁chron ic ▁dise ases . <0x0A> <0x0A> 2 . ▁Eng age ▁in ▁regular ▁physical ▁activity : ▁Ex erc ise ▁is ▁cru cial ▁for ▁maintain ing ▁strong ▁b ones , ▁mus cles , ▁and ▁card i ov asc ular ▁health . ▁A im ▁for ▁at ▁least ▁ 1 5 0 ▁minutes ▁of ▁moder ate ▁aer ob ic ▁exercise ▁or ▁ 7 5 ▁minutes ▁of ▁vig orous ▁exercise ▁each ▁week . <0x0A> <0x0A> 3 . ▁Get ▁enough ▁sleep : ▁Getting ▁enough ▁quality ▁sleep ▁is ▁cru cial ▁for ▁physical ▁and ▁mental ▁well - be ing . ▁It ▁helps ▁to ▁reg ulate ▁m ood , ▁improve ▁cogn itive ▁function , ▁and ▁supports ▁health y ▁growth ▁and ▁imm une ▁function . ▁A im ▁for ▁ 7 - 9 ▁hours ▁of ▁sleep ▁each ▁night .

Inference¶

Concatenation of the checkpoints¶

As we applied the LoRa method, we first need to merge the finetuned llama7B-vicuna-onmt.pt checkpoint in the original llama7B-onmt.pt model, using the lora_weights.py tool. :

python3 OpenNMT-py/tools/lora_weights.py\
    --action merge \
    --base_model llama7B-vicuna-onmt \
    --lora_weights finetuned_llama7B/llama7B-vicuna-onmt_step_4000.pt \
    --output finetuned_llama7B/llama7B-vicuna-onmt_step_4000.concat.pt

Conversion to ctranslate format¶

To convert the concatenated checkpoint to ctranslate2 format, run the following command:

python3 OpenNMT-py/onmt/bin/release_model.py \
    --model finetuned_llama7B/llama7B-vicuna-onmt_step_4000.concat.pt \
    --output finetuned_llama7B/llama7B-vicuna-onmt_step_4000.concat_CT2 \
    --format ctranslate2 \
    --quantization int8_float16

Multi-round conversations with vicuna¶

We provide a gradio chatbot application that can be run with two different inference modes (”py” or ctranslate2).

Run one of the following commands:

python3 chatbot.py \
-inference_config_file translate_opts_py.yaml \
-inference_mode py \
-max_context_length 4096 \
-server_port 5000

Or:

python3 chatbot.py \
-inference_config_file translate_opts_ct2.yaml \
-inference_mode ct2 \
-max_context_length 4096 \
-server_port 5000

Where translate_opts_ct2.yaml and translate_opts_py.yaml are the provided config with the translation options. You can test other decoding methods and paramaters.

Simple inference¶

To obtain the model’s inference you can run this command:

python3 simple_inference.py \
    -input_file input_examples.txt \
    -inference_config_file translate_opts_py.yaml \
    -inference_mode py \
    -output_dir outputs

Or:

python3 simple_inference.py \
    -input_file input_examples.txt \
    -inference_config_file translate_opts_ct2.yaml \
    -inference_mode ct2 \
    -output_dir outputs