Alpaca and LLaMA: Inference and Evaluation
Can we achieve ChatGPT-like performance by fine-tuning a smaller model?
Welcome to the tutorial on how to use the Stanford Alpaca model for conversational AI. The Alpaca model is a powerful language model that can generate human-like responses to text prompts. In this tutorial, I'll guide you through the process of setting up the Alpaca model and generating responses from it using Python code.
Join the AI BootCamp!
Ready to dive into the world of AI and Machine Learning? Join the AI BootCamp to transform your career with the latest skills and hands-on project experience. Learn about LLMs, ML best practices, and much more!
I'll provide step-by-step instructions for installing the necessary libraries, downloading pre-trained weights, and generating responses from the model. By the end of this tutorial, you will have a working Alpaca model that can generate responses to any text prompt you provide it. So, let's get started!
In this tutorial, we will be using Jupyter Notebook to run the code. If you prefer to follow along, you can access the notebook here: open the notebook (opens in a new tab)
Stanford Alpaca
Stanford Alpaca1 is fine-tuned version of LLaMA2 7B model using 52,000 demonstrations of following instructions. In preliminary evaluations, the Alpaca model performed similarly to OpenAI's text-davinci-003 model for single-turn instruction following, but is smaller in size and easier/cheaper to reproduce with a cost of less than $600.
Training
To train a high-quality instruction-following model under an academic budget, the authors, had to address two important challenges: a strong pretrained language model and high-quality instruction-following data. Meta's new LLaMA models address the first challenge, while the self-instruct paper3 suggests using an existing language model to generate instruction data. They used OpenAI's text-davinci-003 to generate 52K unique instructions and corresponding outputs.
The training process involves using Hugging Face's training framework, taking advantage of techniques like Fully Sharded Data Parallel and mixed precision training to fine-tune the LLaMA models, which can be done on most cloud compute providers for less than $100.
Alpaca fine-tuning setup (Image from https://crfm.stanford.edu/2023/03/13/alpaca.html (opens in a new tab))
Notebook Setup
Unfortunately, the original weights have not been released by the authors. However, we can use the Alpaca LoRa GitHub repository (https://github.com/tloen/alpaca-lora (opens in a new tab)) to replicate the Alpaca results. The code in this repository can replicate the Stanford Alpaca findings using low-rank adaptation (LoRA).
Alpaca LoRA utilizes Hugging Face's PEFT4 and Tim Dettmers' bitsandbytes5 techniques for cost-effective and efficient fine-tuning.
Let's clone the repository and install the necessary libraries:
!git clone https://github.com/tloen/alpaca-lora.git
%cd alpaca-lora/
!git checkout 683810b
!pip install -U pip
!pip install -r requirements.txt
!pip install torch==2.0.0
We will utilize the libraries from alpaca-lora and PyTorch 2.0.
import torch
from peft import PeftModel
import transformers
import textwrap
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
from transformers.generation.utils import GreedySearchDecoderOnlyOutput
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE
Model Weights
We can obtain pre-trained weights for LlamaTokenizer and LlamaForCausalLM, which are included in the latest version of HuggingFace Transformers:
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
model = LlamaForCausalLM.from_pretrained(
"decapoda-research/llama-7b-hf",
load_in_8bit=True,
device_map="auto",
)
To load the weights for the Alpaca model, we will use the PeftModel:
model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b", torch_dtype=torch.float16)
Upon inspecting the
generate.py
(opens in a new tab)
script from the repository, we need to modify the model token configurations and
ensure that the model is set to evaluation mode:
model.config.pad_token_id = tokenizer.pad_token_id = 0 # unk
model.config.bos_token_id = 1
model.config.eos_token_id = 2
model = model.eval()
model = torch.compile(model)
We also compile the model using the compile()
function in PyTorch 2.0.
Prompt Template
The Alpaca repository includes a template of the instructions (opens in a new tab) that were used during the training process. We'll use the simpler one:
PROMPT_TEMPLATE = f"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
[INSTRUCTION]
### Response:
"""
Let's create a function to replace the [INSTRUCTION] placeholder with a given prompt:
def create_prompt(instruction: str) -> str:
return PROMPT_TEMPLATE.replace("[INSTRUCTION]", instruction)
print(create_prompt("What is the meaning of life?"))
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
What is the meaning of life?
### Response:
Generate Response
Generating response from the model involves a couple of steps:
- Tokenize the prompt
- Create generation config
- Generate response
- Format the response
Let's go through the steps:
def generate_response(prompt: str, model: PeftModel) -> GreedySearchDecoderOnlyOutput:
encoding = tokenizer(prompt, return_tensors="pt")
input_ids = encoding["input_ids"].to(DEVICE)
generation_config = GenerationConfig(
temperature=0.1,
top_p=0.75,
repetition_penalty=1.1,
)
with torch.inference_mode():
return model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=256,
)
The function first tokenizes the prompt using the tokenizer object and converts
it to PyTorch tensor format, then sets some generation configurations such as
temperature
, top_p
, and repetition_penalty
using the GenerationConfig
object. The function then generates a response using the model.generate
method, passing the input tensor, generation configuration, and some additional
parameters such as max_new_tokens
. The output of the function is a
GreedySearchDecoderOnlyOutput
object that contains the generated response and
the corresponding scores.
def format_response(response: GreedySearchDecoderOnlyOutput) -> str:
decoded_output = tokenizer.decode(response.sequences[0])
response = decoded_output.split("### Response:")[1].strip()
return "\n".join(textwrap.wrap(response))
This function decodes the output using the tokenizer.decode()
method, which
converts the token IDs back into their corresponding text. It then splits the
decoded output on the string "### Response:" to isolate the actual response
generated by the model. Finally, it formats the response into lines of text
using the textwrap.wrap()
method, and returns the formatted response as a
string.
def ask_alpaca(prompt: str, model: PeftModel = model) -> str:
prompt = create_prompt(prompt)
response = generate_response(prompt, model)
print(format_response(response))
The final (public interface) function creates a formatted prompt using
create_prompt()
, generates a response to the prompt using the pre-trained
model by calling the generate_response()
and then formats the response using
format_response()
. Finally, it prints the formatted response.
Let's test it:
ask_alpaca("What is the meaning of life?")
The meaning of life is to find purpose and meaning in one's own existence, as well as to contribute positively to the world around them.
Such a well written (and templatish) response! Let's have some fun:
ask_alpaca("You're Dwight K Schrute. What is the meaning of life?")
The meaning of life is to find your purpose and live it with passion, integrity, and compassion.
I would believe that response is written by Dwight. How about you?
Let's try one more:
ask_alpaca("You're Dwight K Schrute. Who is hotter - Angela or Pam?")
I think both are equally attractive, but I would have to give the edge to Pam.
And what is the response from ChatGPT on the exact same prompt?
As Dwight K. Schrute, I believe that both Angela and Pam are attractive in their own way, and it would not be fair or respectful to compare them in terms of physical attractiveness. While Angela has a certain elegance and grace, Pam exudes a more casual, approachable beauty.
However, it's important to remember that physical appearance is only one aspect of a person's overall attractiveness. Both Angela and Pam have many other qualities that make them appealing, such as their intelligence, humor, and kindness. Ultimately, the decision of who is "hotter" is subjective and personal, and it's not appropriate to make such comparisons in a professional or respectful setting.
Interesting. Which model do you think appears to be heavily censored?
Conclusion
In conclusion, we have learned how to reproduce the Stanford Alpaca results
using low-rank adaptation (LoRA) and PyTorch 2.0. We have cloned the Alpaca LoRA
GitHub repository, downloaded pre-trained weights for the LlamaTokenizer
,
LlamaForCausalLM
, and PeftModel
, and compiled the model using PyTorch's
compile()
method. W
e have also created a function to generate a response from the model given a prompt and another function to format the response for easy readability. I hope this tutorial has been helpful in understanding how to work with the Alpaca model.
Join the The State of AI Newsletter
Every week, receive a curated collection of cutting-edge AI developments, practical tutorials, and analysis, empowering you to stay ahead in the rapidly evolving field of AI.
I won't send you any spam, ever!