Alpaca and Llama Inference

Alpaca and LLaMA: Inference and Evaluation

Can we achieve ChatGPT-like performance by fine-tuning a smaller model?

Welcome to the tutorial on how to use the Stanford Alpaca model for conversational AI. The Alpaca model is a powerful language model that can generate human-like responses to text prompts. In this tutorial, I'll guide you through the process of setting up the Alpaca model and generating responses from it using Python code.

Join the AI BootCamp!

Ready to dive deep into the world of AI and Machine Learning? Join our BootCamp to transform your career with the latest skills and real-world project experience. LLMs, ML best practices, and more!

I'll provide step-by-step instructions for installing the necessary libraries, downloading pre-trained weights, and generating responses from the model. By the end of this tutorial, you will have a working Alpaca model that can generate responses to any text prompt you provide it. So, let's get started!

In this tutorial, we will be using Jupyter Notebook to run the code. If you prefer to follow along, you can access the notebook here: open the notebook (opens in a new tab)

Stanford Alpaca

Stanford Alpaca1 is fine-tuned version of LLaMA2 7B model using 52,000 demonstrations of following instructions. In preliminary evaluations, the Alpaca model performed similarly to OpenAI's text-davinci-003 model for single-turn instruction following, but is smaller in size and easier/cheaper to reproduce with a cost of less than $600.


To train a high-quality instruction-following model under an academic budget, the authors, had to address two important challenges: a strong pretrained language model and high-quality instruction-following data. Meta's new LLaMA models address the first challenge, while the self-instruct paper3 suggests using an existing language model to generate instruction data. They used OpenAI's text-davinci-003 to generate 52K unique instructions and corresponding outputs.

The training process involves using Hugging Face's training framework, taking advantage of techniques like Fully Sharded Data Parallel and mixed precision training to fine-tune the LLaMA models, which can be done on most cloud compute providers for less than $100.

Alpaca Training Alpaca fine-tuning setup (Image from (opens in a new tab))

Notebook Setup

Unfortunately, the original weights have not been released by the authors. However, we can use the Alpaca LoRa GitHub repository ( (opens in a new tab)) to replicate the Alpaca results. The code in this repository can replicate the Stanford Alpaca findings using low-rank adaptation (LoRA).

Alpaca LoRA utilizes Hugging Face's PEFT4 and Tim Dettmers' bitsandbytes5 techniques for cost-effective and efficient fine-tuning.

Let's clone the repository and install the necessary libraries:

!git clone
%cd alpaca-lora/
!git checkout 683810b
!pip install -U pip
!pip install -r requirements.txt
!pip install torch==2.0.0

We will utilize the libraries from alpaca-lora and PyTorch 2.0.

import torch
from peft import PeftModel
import transformers
import textwrap
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
from transformers.generation.utils import GreedySearchDecoderOnlyOutput
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

Model Weights

We can obtain pre-trained weights for LlamaTokenizer and LlamaForCausalLM, which are included in the latest version of HuggingFace Transformers:

tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
model = LlamaForCausalLM.from_pretrained(

To load the weights for the Alpaca model, we will use the PeftModel:

model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b", torch_dtype=torch.float16)

Upon inspecting the (opens in a new tab) script from the repository, we need to modify the model token configurations and ensure that the model is set to evaluation mode:

model.config.pad_token_id = tokenizer.pad_token_id = 0  # unk
model.config.bos_token_id = 1
model.config.eos_token_id = 2
model = model.eval()
model = torch.compile(model)

We also compile the model using the compile() function in PyTorch 2.0.

Prompt Template

The Alpaca repository includes a template of the instructions (opens in a new tab) that were used during the training process. We'll use the simpler one:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
### Response:

Let's create a function to replace the [INSTRUCTION] placeholder with a given prompt:

def create_prompt(instruction: str) -> str:
    return PROMPT_TEMPLATE.replace("[INSTRUCTION]", instruction)
print(create_prompt("What is the meaning of life?"))
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is the meaning of life?

### Response:

Generate Response

Generating response from the model involves a couple of steps:

  • Tokenize the prompt
  • Create generation config
  • Generate response
  • Format the response

Let's go through the steps:

def generate_response(prompt: str, model: PeftModel) -> GreedySearchDecoderOnlyOutput:
    encoding = tokenizer(prompt, return_tensors="pt")
    input_ids = encoding["input_ids"].to(DEVICE)
    generation_config = GenerationConfig(
    with torch.inference_mode():
        return model.generate(

The function first tokenizes the prompt using the tokenizer object and converts it to PyTorch tensor format, then sets some generation configurations such as temperature, top_p, and repetition_penalty using the GenerationConfig object. The function then generates a response using the model.generate method, passing the input tensor, generation configuration, and some additional parameters such as max_new_tokens. The output of the function is a GreedySearchDecoderOnlyOutput object that contains the generated response and the corresponding scores.

def format_response(response: GreedySearchDecoderOnlyOutput) -> str:
    decoded_output = tokenizer.decode(response.sequences[0])
    response = decoded_output.split("### Response:")[1].strip()
    return "\n".join(textwrap.wrap(response))

This function decodes the output using the tokenizer.decode() method, which converts the token IDs back into their corresponding text. It then splits the decoded output on the string "### Response:" to isolate the actual response generated by the model. Finally, it formats the response into lines of text using the textwrap.wrap() method, and returns the formatted response as a string.

def ask_alpaca(prompt: str, model: PeftModel = model) -> str:
    prompt = create_prompt(prompt)
    response = generate_response(prompt, model)

The final (public interface) function creates a formatted prompt using create_prompt(), generates a response to the prompt using the pre-trained model by calling the generate_response() and then formats the response using format_response(). Finally, it prints the formatted response.

Let's test it:

ask_alpaca("What is the meaning of life?")

The meaning of life is to find purpose and meaning in one's own existence, as well as to contribute positively to the world around them.

Such a well written (and templatish) response! Let's have some fun:

ask_alpaca("You're Dwight K Schrute. What is the meaning of life?")

The meaning of life is to find your purpose and live it with passion, integrity, and compassion.

I would believe that response is written by Dwight. How about you?

Let's try one more:

ask_alpaca("You're Dwight K Schrute. Who is hotter - Angela or Pam?")

I think both are equally attractive, but I would have to give the edge to Pam.

And what is the response from ChatGPT on the exact same prompt?

As Dwight K. Schrute, I believe that both Angela and Pam are attractive in their own way, and it would not be fair or respectful to compare them in terms of physical attractiveness. While Angela has a certain elegance and grace, Pam exudes a more casual, approachable beauty.

However, it's important to remember that physical appearance is only one aspect of a person's overall attractiveness. Both Angela and Pam have many other qualities that make them appealing, such as their intelligence, humor, and kindness. Ultimately, the decision of who is "hotter" is subjective and personal, and it's not appropriate to make such comparisons in a professional or respectful setting.

Interesting. Which model do you think appears to be heavily censored?


In conclusion, we have learned how to reproduce the Stanford Alpaca results using low-rank adaptation (LoRA) and PyTorch 2.0. We have cloned the Alpaca LoRA GitHub repository, downloaded pre-trained weights for the LlamaTokenizer, LlamaForCausalLM, and PeftModel, and compiled the model using PyTorch's compile() method. W

e have also created a function to generate a response from the model given a prompt and another function to format the response for easy readability. I hope this tutorial has been helpful in understanding how to work with the Alpaca model.

3,000+ people already joined

Join the The State of AI Newsletter

Every week, receive a curated collection of cutting-edge AI developments, practical tutorials, and analysis, empowering you to stay ahead in the rapidly evolving field of AI.

I won't send you any spam, ever!



  1. Stanford Alpaca (opens in a new tab)

  2. Meta LLaMa (opens in a new tab)

  3. Self-Instruct paper (opens in a new tab)

  4. HuggingFace PEFT - Parameter Efficient Fine-Tuning (opens in a new tab)

  5. bitsandbytes - 8-bit CUDA functions for PyTorch (opens in a new tab)