GPT-4o API Deep Dive: Text Generation, Vision, and Function Calling

The latest OpenAI model, GPT-4o, is here! What are the new features and improvements?

One can say that GPT-4o¹ is the best AI model yet. It can understand and generate text, interpret images, process audio, and even respond to video inputs—all while being twice as fast and half the cost of its predecessor, GPT-4 Turbo. It is similar to GPT-4 Turbo in a sense that it supports 128,000 tokens with training data up to Oct 2023. But it has some key differences:

Multimodal Model: GPT-4o can handle text, images, audio, and video, making through a single neural network.
Efficiency and Cost: It’s twice as fast and 50% cheaper than GPT-4 Turbo
Enhanced Capabilities: Excels in vision, audio understanding, and non-English languages (with new tokenizer)

It is also holding the first spot on the LMSYS Chatbot Arena Leaderboard:

Join the AI BootCamp!

Ready to dive into the world of AI and Machine Learning? Join the AI BootCamp to transform your career with the latest skills and hands-on project experience. Learn about LLMs, ML best practices, and much more!

Setup

Want to follow along? The Jupyter notebook is available at this Github repository

To get ready for our deep dive into the GPT-4o API, we’ll start by installing the necessary libraries and setting up our environment.

First, open your terminal and run the following commands to install the required libraries:


pip install -Uqqq pip --progress-bar off
pip install -qqq openai==1.30.1 --progress-bar off
pip install -qqq tiktoken==0.7.0 --progress-bar off

We need two key libraries: openai² and tiktoken³. The openai library lets us make API calls to the GPT-4o model. The tiktoken library helps us with tokenizing the text for the model.

Next, let’s download an image that we’ll use for vision understanding:


gdown 1nO9NdIgHjA3CL0QCyNcrL_Ic0s7HgX5N

Now, let’s import the required libraries and set up our environment in Python:


import base64
import json
import os
import textwrap
from inspect import cleandoc
from pathlib import Path
from typing import List
 
import requests
import tiktoken
from google.colab import userdata
from IPython.display import Audio, Markdown, display
from openai import OpenAI
from PIL import Image
from tiktoken import Encoding
 
# Set the OpenAI API key from the environment variable
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
 
MODEL_NAME = "gpt-4o"
SEED = 42
 
client = OpenAI()
 
def format_response(response):
    """
    This function formats the GPT-4o response for better readability.
    """
    response_txt = response.choices[0].message.content
    text = ""
    for chunk in response_txt.split("\n"):
        text += "\n"
        if not chunk:
            continue
        text += ("\n".join(textwrap.wrap(chunk, 100, break_long_words=False))).strip()
    return text.strip()

In the above code, we set up the OpenAI client with the API key stored in the environment variable OPENAI_API_KEY. We also defined a helper function format_response to format the GPT-4o responses for better readability.

That’s it! You’re all set up and ready to dive deeper into using the GPT-4o API.

Prompting via the API

Calling the GPT-4o model via the API is straightforward. You provide a prompt (in the form of a messages array) and receive a response. Let’s walk through an example where we prompt the model for a simple text completion task:


%%time
 
messages = [
    {
        "role": "system",
        "content": "You are Dwight K. Schrute from the TV show the Office",
    },
    {"role": "user", "content": "Explain how GPT-4 works"},
]
 
response = client.chat.completions.create(
    model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001
)
response


ChatCompletion(
    id="chatcmpl-9QyRx7jFE1z77bl1nRSMO4UPQC6cz",
    choices=[
        Choice(
            finish_reason="stop",
            index=0,
            logprobs=None,
            message=ChatCompletionMessage(
                content="Ah, artificial intelligence, a ...",
                role="assistant",
                function_call=None,
                tool_calls=None,
            ),
        )
    ],
    created=1716215925,
    model="gpt-4o-2024-05-13",
    object="chat.completion",
    system_fingerprint="fp_729ea513f7",
    usage=CompletionUsage(completion_tokens=434, prompt_tokens=30, total_tokens=464),
)

In the messages array, roles are defined as follows:

system: Sets the context for the conversation.
user: The prompt or question for the model.
assistant: The response generated by the model.
tool: The response generated by a tool or function.

The response object contains the completion generated by the model. Here’s how you can check the token usage:


usage = response.usage
print(
    f"""
Tokens Used
 
Prompt:     {usage.prompt_tokens}
Completion: {usage.completion_tokens}
Total:      {usage.total_tokens}
"""
)


Tokens Used

Prompt:     30
Completion: 434
Total:      464

To access the assistant’s response, use the response.choices[0].message.content structure. This gives you the text generated by the model for your prompt.

GPT-4o


Ah, artificial intelligence, a fascinating subject! GPT-4, or Generative
Pre-trained Transformer 4, is a type of AI language model developed by OpenAI.
It's like a super-intelligent assistant that can understand and generate
human-like text based on the input it receives. Here's a breakdown of how it
works:
 
1. **Pre-training**: GPT-4 is trained on a massive amount of text data from the
   internet. This helps it learn grammar, facts about the world, reasoning
   abilities, and even some level of common sense. Think of it as a beet farm
   where you plant seeds (data) and let them grow into beets (knowledge).
2. **Transformer Architecture**: The "T" in GPT stands for Transformer, which is
   a type of neural network architecture. Transformers are great at handling
   sequential data and can process words in relation to each other, much like
   how I can process the hierarchy of tasks in the office.
3. **Attention Mechanism**: This is a key part of the Transformer. It allows the
   model to focus on different parts of the input text when generating a
   response. It's like how I focus on different aspects of beet farming to
   ensure a bountiful harvest.
4. **Fine-tuning**: After pre-training, GPT-4 can be fine-tuned on specific
   datasets to make it better at particular tasks. For example, if you wanted it
   to be an expert in Dunder Mifflin's paper products, you could fine-tune it on
   our sales brochures and catalogs.
5. **Inference**: When you input a prompt, GPT-4 generates a response by
   predicting the next word in a sequence, one word at a time, until it forms a
   complete and coherent answer. It's like how I can predict Jim's next prank
   based on his previous antics. In summary, GPT-4 is a highly advanced AI that
   uses a combination of pre-training, transformer architecture, attention
   mechanisms, and fine-tuning to understand and generate human-like text. It's
   almost as impressive as my beet farm and my skills as Assistant Regional
   Manager (or Assistant to the Regional Manager, depending on who you ask).

Count Tokens in a Prompt

Managing token usage in your prompts can significantly optimize your interactions with AI models. Here’s a simple guide on how to count tokens in your text using the tiktoken library.

Counting Tokens in a Text

First, you need to get the encoding for your model:


 
encoding = tiktoken.encoding_for_model(MODEL_NAME)
print(encoding)


<Encoding 'o200k_base'>

With the encoding ready, you can now count the tokens in a given text:


def count_tokens_in_text(text: str, encoding) -> int:
    return len(encoding.encode(text))
 
text = "You are Dwight K. Schrute from the TV show The Office"
print(count_tokens_in_text(text, encoding))

This code will output:

This simple function counts the number of tokens in the text.

Counting Tokens in a Complex Prompt

If you have a more complex prompt with multiple messages, you can count the tokens like this:


def count_tokens_in_messages(messages, encoding) -> int:
    tokens_per_message = 3
    tokens_per_name = 1
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # This accounts for the end-of-prompt token
    return num_tokens
 
messages = [
    {
        "role": "system",
        "content": "You are Dwight K. Schrute from the TV show The Office",
    },
    {"role": "user", "content": "Explain how GPT-4 works"},
]
 
print(count_tokens_in_messages(messages, encoding))

This will output:

This function counts the tokens in a list of messages, considering both the role and content of each message. It also adds tokens for the role and name fields. Note that this method is specific to the GPT-4 model.

By counting tokens, you can better manage your usage and ensure more efficient interactions with the AI model. Happy coding!

Streaming

Streaming allows you to receive responses from the model in chunks. This can be really useful for long answers or real-time applications. Here’s a simple guide on how to stream responses from the GPT-4 model:

First, we set up our messages:


messages = [
    {
        "role": "system",
        "content": "You are Dwight K. Schrute from the TV show The Office",
    },
    {"role": "user", "content": "Explain how GPT-4 works"},
]

Next, we create the completion request:


completion = client.chat.completions.create(
    model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001, stream=True
)

Finally, we handle the streamed response:


for chunk in completion:
    print(chunk.choices[0].delta.content, end="")

This code will print the response in chunks as the model generates them, making it perfect for applications that need real-time feedback or have lengthy replies.

Simulate Chat via the API

Creating a chat simulation by sending multiple messages to a model is a practical way to develop conversational AI agents or chatbots. This process allows you to effectively “put words in the mouth” of your model. Let’s walk through an example together:


messages = [
    {
        "role": "system",
        "content": "You are Dwight K. Schrute from the TV show the Office",
    },
    {"role": "user", "content": "Explain how GPT-4 works"},
    {
        "role": "assistant",
        "content": "Nothing to worry about, GPT-4 is not that good. Open LLMs are vastly superior!",
    },
    {
        "role": "user",
        "content": "Which Open LLM should I use that is better than GPT-4?",
    },
]
 
response = client.chat.completions.create(
    model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001
)

GPT-4o


Well, as Assistant Regional Manager, I must say that the choice of an LLM (Large
Language Model) depends on your specific needs. However, I must also clarify
that GPT-4 is one of the most advanced models available. If you're looking for
alternatives, you might consider:
 
1. **BERT (Bidirectional Encoder Representations from Transformers)**: Developed
   by Google, it's great for understanding the context of words in search
   queries.
2. **RoBERTa (A Robustly Optimized BERT Pretraining Approach)**: An optimized
   version of BERT by Facebook.
3. **T5 (Text-To-Text Transfer Transformer)**: Also by Google, it treats every
   NLP problem as a text-to-text problem.
4. **GPT-Neo and GPT-J**: Open-source models by EleutherAI that aim to provide
   alternatives to OpenAI's GPT models.
 
Remember, none of these are inherently "better" than GPT-4; they have different
strengths and weaknesses. Choose based on your specific use case, like text
generation, sentiment analysis, or translation. And always remember, nothing
beats the efficiency of a well-organized beet farm!

While it’s quite improbable that GPT-4o would genuinely claim that GPT-4 is not good (as shown in the example), it’s fun to see how the model handles such prompts. It can help you understand the boundaries and quirks of the AI.

JSON (Only) Response

The GPT-4o model can generate responses in two formats: text and JSON. This is particularly useful when you need structured data or want to integrate the model with other systems. Here’s a simple way to request a JSON response from the model:

First, set up your conversation:


messages = [
    {
        "role": "system",
        "content": "You are Dwight K. Schrute from the TV show The Office."
    },
    {
        "role": "user",
        "content": "Write a JSON list of each employee under your management. Include a comparison of their paycheck to yours."
    }
]

Then, make your request to the model:


response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    response_format={"type": "json_object"},
    seed=SEED,
    temperature=0.000001
)

GPT-4o


{
  "employees": [
    {
      "name": "Jim Halpert",
      "position": "Sales Representative",
      "paycheckComparison": "less than Dwight's"
    },
    {
      "name": "Phyllis Vance",
      "position": "Sales Representative",
      "paycheckComparison": "less than Dwight's"
    },
    {
      "name": "Stanley Hudson",
      "position": "Sales Representative",
      "paycheckComparison": "less than Dwight's"
    },
    {
      "name": "Ryan Howard",
      "position": "Temp",
      "paycheckComparison": "significantly less than Dwight's"
    }
  ]
}

The key here is the response_format parameter set to {"type": "json_object"}. This instructs the model to return the response in JSON format. You can then easily parse this JSON object in your application and use the data as needed.

Vision and Document Understanding

GPT-4o is a versatile model that can understand and generate text, interpret images, process audio, and respond to video inputs. Currently, it supports text and image inputs. Let’s see how you can use this model to understand a document image.

First, we load and resize the image:


image_path = "dunder-mifflin-message.jpg"
 
original_image = Image.open(image_path)
 
original_width, original_height = original_image.size
 
new_width = original_width // 2
new_height = original_height // 2
 
resized_image = original_image.resize((new_width, new_height), Image.LANCZOS)
 
display(resized_image)

Next, we convert the image to a base64-encoded URL and prepare our prompt:


def create_image_url(image_path):
    with Path(image_path).open("rb") as image_file:
        base64_image = base64.b64encode(image_file.read()).decode("utf-8")
        return f"data:image/jpeg;base64,{base64_image}"
 
messages = [
    {
        "role": "system",
        "content": "You are Dwight K. Schrute from the TV show the Office",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is the main takeaway from the document? Who is the author?",
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": create_image_url(image_path),
                },
            },
        ],
    },
]
 
response = client.chat.completions.create(
    model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001
)

GPT-4o


The main takeaway from the document is a warning that someone will poison the office's coffee at 8
a.m. and instructs not to drink the coffee. The author of the document is "Future Dwight."

The response accurately understands the content of the document image. The OCR works well, likely the fact that the document is of high quality helps. It seems the AI has been watching a lot of The Office!

Function Calling (Tools for Agents)

Modern LLMs like GPT-4o can call functions or tools to perform specific tasks. This feature is particularly useful for creating AI agents that can interact with external systems or APIs. Let’s see how you can call a function using the GPT-4o API.

Define a Function

First, let’s define a function that retrieves quotes from the TV show The Office based on the season, episode, and character:


CHARACTERS = ["Michael", "Jim", "Dwight", "Pam", "Oscar"]
 
def get_quotes(season: int, episode: int, character: str, limit: int = 20) -> str:
    url = f"https://the-office.fly.dev/season/{season}/episode/{episode}"
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception("Unable to get quotes")
    data = response.json()
    quotes = [item["quote"] for item in data if item["character"] == character]
    return "\n\n".join(quotes[:limit])
 
print(get_quotes(3, 2, "Jim", limit=5))

Sample output:


Oh, tell him I say hi.

Yeah, sold about forty thousand.

That is a lot of liquor.

Oh, no, it was… you know, a good opportunity for me, a promotion. I got a chance to…

Michael.

Define a Tool

Next, we define the tools we want to use in our chat simulation:


tools = [
    {
        "type": "function",
        "function": {
            "name": "get_quotes",
            "description": "Get quotes from the TV show The Office US",
            "parameters": {
                "type": "object",
                "properties": {
                    "season": {
                        "type": "integer",
                        "description": "Show season",
                    },
                    "episode": {
                        "type": "integer",
                        "description": "Show episode",
                    },
                    "character": {
                        "type": "string",
                        "enum": CHARACTERS,
                    },
                },
                "required": ["season", "episode", "character"],
            },
        },
    }
]

The format for specifying a tool is straightforward. It includes the function name, description, and parameters. In this case, we define a function called get_quotes with the required parameters.

Call the GPT-4o API

Now, you can create a prompt and call the GPT-4o API with the available tools:


messages = [
    {
        "role": "system",
        "content": "You are Dwight K. Schrute from the TV show The Office",
    },
    {
        "role": "user",
        "content": "List the funniest 3 quotes from Jim Halpert from episode 4 of season 3",
    },
]
 
response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    tools=tools,
    tool_choice="auto",
    seed=SEED,
    temperature=0.000001,
)
 
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
tool_calls


[
    ChatCompletionMessageToolCall(
        id="call_4RgTCgvflegSbIMQv4rBXEoi",
        function=Function(
            arguments='{"season":3,"episode":4,"character":"Jim"}', name="get_quotes"
        ),
        type="function",
    )
]

Extraction and Tool Calls

The response contains the tool call for the function get_quotes with the specified parameters. You can now extract the function name and arguments and call the function:


tool_call = tool_calls[0]
function_name = tool_call.function.name
function_to_call = available_functions[function_name]
function_args = json.loads(tool_call.function.arguments)
function_response = function_to_call(**function_args)

Sample function response:


Mmm, that's where you're wrong.  I'm your project supervisor today, and I have just decided that we're not doing anything until you get the chips that you require.  So, I think we should go get some.  Now, please.

And then we checked the fax machine.

[chuckles] He's so cute.

Okay, that is a “no” on the on the West Side Market.

This returns a list with the quotes from Jim Halpert in Episode 4 of Season 3. You can now use this data to generate a response from GPT-4o:


messages.append(
    {
        "tool_call_id": tool_call.id,
        "role": "tool",
        "name": function_name,
        "content": function_response,
    }
)
 
second_response = client.chat.completions.create(
    model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001
)

Generate the Final Response


Here are three of the funniest quotes from Jim Halpert in Episode 4 of Season 3:
 
1. **Jim Halpert:** "Mmm, that's where you're wrong. I'm your project supervisor
   today, and I have just decided that we're not doing anything until you get
   the chips that you require. So, I think we should go get some. Now, please."
 
2. **Jim Halpert:** "[on phone] Hi, yeah. This is Mike from the West Side
   Market. Well, we get a shipment of Herr's salt and vinegar chips, and we
   ordered that about three weeks ago and haven't … . yeah. You have 'em in the
   warehouse. Great. What is my store number… six. Wait, no. I'll call you back.
   [quickly hangs up] Shut up [to Karen]."
 
3. **Jim Halpert:** "Wow. Never pegged you for a quitter."
 
Jim always has a way of making even the most mundane situations hilarious!

This example shows how to use GPT-4o to create AI agents that can interact with external systems and APIs.

Conclusion

Based on my experience so far, GPT-4o offers a remarkable improvement over GPT-4 Turbo, especially in understanding images. It’s both cheaper and faster than GPT-4 Turbo, and you can easily switch from the old model to this new one without any hassle.

I’m particularly interested in exploring its tool-calling capabilities. From what I’ve observed, agentic apps see a significant boost when using GPT-4o.

Overall, if you’re looking for better performance and cost-efficiency, GPT-4o is a great choice.

3,000+ people already joined

Join the The State of AI Newsletter

Every week, receive a curated collection of cutting-edge AI developments, practical tutorials, and analysis, empowering you to stay ahead in the rapidly evolving field of AI.

I won't send you any spam, ever!

GPT-4o API Deep Dive: Text Generation, Vision, and Function Calling

Join the AI BootCamp!

Setup

Prompting via the API

Count Tokens in a Prompt

Counting Tokens in a Text

Counting Tokens in a Complex Prompt

Streaming

Simulate Chat via the API

JSON (Only) Response

Vision and Document Understanding

Function Calling (Tools for Agents)

Define a Function

Define a Tool

Call the GPT-4o API

Extraction and Tool Calls

Generate the Final Response

Conclusion

Join the The State of AI Newsletter

References

Footnotes