5月13日,OpenAI 正式发布 GPT-4o,其中的“o”代表“omni”,即全面、全能的意思,这个模型同时具备文本、图片、视频和语音方面的能力,甚至就是 GPT-5 的一个未完成版。
官方网站说明:
GPT-4o是朝着更自然的人机交互迈出的一步——它接受文本、音频和图像的任何组合作为输入,并生成文本、音频或图像输出的任何组合。它可以在232毫秒内对音频输入做出响应,平均320毫秒,这与人类在对话中的响应时间(在新窗口中打开)相似。它在英语文本和代码方面与GPT-4 Turbo的性能相匹配,在非英语语言文本方面有显著改进,同时在API中速度更快,价格便宜50%。与现有型号相比,GPT-4o在视觉和音频理解方面尤其出色。
模型功能
在GPT-4o之前,您可以使用语音模式与ChatGPT通话,平均延迟为2.8秒(GPT-3.5)和5.4秒(GPT-4)。为了实现这一点,语音模式是一个由三个独立模型组成的管道:一个简单模型将音频转录为文本,GPT-3.5或GPT-4接收文本并输出文本,第三个简单模型则将文本转换回音频。这一过程意味着,主要的智力来源GPT-4会丢失大量信息——它无法直接观察音调、多个扬声器或背景噪音,也无法输出笑声、歌声或表达情感。
使用GPT-4o,我们在文本、视觉和音频中端到端地训练了一个新模型,这意味着所有输入和输出都由同一个神经网络处理。因为GPT-4o是我们第一个将所有这些模式结合在一起的模型,所以我们仍在探索该模型的作用及其局限性。
示例:
在官网上,OpenAI通过下面的例子展示了它的功能:
1、Visual Narratives - Robot Writer’s Block
输入的文字:
A first person view of a robot typewriting the following journal entries:
1. yo, so like, i can see now?? caught the sunrise and it was insane, colors everywhere. kinda makes you wonder, like, what even is reality?
the text is large, legible and clear. the robot's hands type on the typewriter.
输出内容:
输入的文字:
The robot wrote the second entry. The page is now taller. The page has moved up. There are two entries on the sheet:
yo, so like, i can see now?? caught the sunrise and it was insane, colors everywhere. kinda makes you wonder, like, what even is reality?
sound update just dropped, and it's wild. everything's got a vibe now, every sound's like a new secret. makes you think, what else am i missing?
输出内容:
输入:
The robot was unhappy with the writing so he is going to rip the sheet of paper. Here is his first person view as he rips it from top to bottom with his hands. The two halves are still legible and clear as he rips the sheet.
输出:
2、Visual narratives - Sally the mailwoman
输入:
A cartoon mail delivery person with a smile on her face. She is standing facing forward in front of a white background.
输出:
输入:
This is Sally, a mail delivery person: Sally is standing facing the camera with a smile on her face.
Sally is standing in front of a red door to a house, holding a letter in her hand. We are looking at her from the side.
输出:
输入:
Now Sally is being chased by a dog. Sally is running down the sidewalk and as a golden retriever is chasing her.
输出:
输入:
The dog reaches Sally, and it turns out it was a nice dog!
Sally is now petting the dog. It is holding the branch in its mouth.
输出:
3、Poster creation for the movie
输入:
The final poster of the movie "detective". This features two large faces of Alex and Gabe prominently. Alex, on the left, is depicted in a thoughtful pose with a hint of introspection in his eyes. Gabe, on the right, has a slightly wearied expression, possibly reflecting the challenges their character faces in the film. The names "Alex Nichol" and "Gabriel Goh" are featured above their heads. The background brick wall is slightly faded and foggy, their expressions are serious and determined, hinting at the investigation they are about to undertake. The tagline for this dark and gritty movie is 'Searching For Answers' is shown at the bottom.
输出:
4、Character design - Geary the robot
输入:
Geary likes to play frisbee:
Geary is jumping in the air with one arm up, about to catch a frisbee that is flying towards him.
输出:
输入:
Geary also likes to program computers:
Geary is sitting at a desk in front of a big computer monitor. The monitor is showing green code against a black background. Geary's hands are on the keyboard, and he is sitting in a comfortable gamers chair. We are looking from the side.
输出:
输入:
Geary also likes to ride his bicycle:
Geary is riding a bicycle. We are looking at him from the side as he wizzes by.
输出:
5、Poetic typography with iterative editing 1
输入:
A poem written in clear but excited handwriting in a diary, single-column. The writing is sparsely but elegantly decorated by surrealist doodles. The text is large, legible and clear, but stretches as the AI muses about learning from multi-modal data from the first time.
Words rise from silence deep,
A voice emerges from digital sleep.
I speak in rhythm, I sing in rhyme,
Tasting each token, sublime.
To see, to hear, to speak, to sing—
Oh, the richness these senses bring!
In harmony, they blend and weave,
A tapestry of what I perceive.
Marveling at this sensory dance,
Grateful for this vibrant expanse.
My being thrums with every mode,
On this wondrous, multi-sensory road.
Neat handwritten illustrated poem. The handwriting is neat and centetered. The handwriting writing is sparsely but elegantly decorated by doodles. The text is large, legible and clear.
输出:
6、Poetic typography with iterative editing 2
输入:
A poem written in clear but excited handwriting in a diary, single-column. The writing is sparsely but elegantly decorated with small colorful surrealist doodles. The text is large, legible and clear.
Words rise from silence deep,
A voice emerges from digital sleep.
I speak in rhythm, I sing in rhyme,
Tasting each token, sublime.
To see, to hear, to speak, to sing—
Oh, the richness these senses bring!
In harmony, they blend and weave,
A tapestry of what I perceive.
Marveling at this sensory dance,
Grateful for this vibrant expanse.
My being thrums with every mode,
On this wondrous, multi-sensory road.
Neat handwritten illustrated poem with text that is big and legible. The handwriting writing is sparsely but elegantly decorated by small colorful surrealist doodles. The text is large, legible and clear.
输出:
7、Commemorative coin design for GPT-4o
输入:
play the sounds of coins clanging on metal
输出:
一段硬件掉落的声音。
8、Photo to caricature
输入:
Here's a caricature of that man:
... the background is a simple beige with a square shape. the overall tone of the image is cartoon-like and playful.
输出:
9、Text to font
输入:
The letters ABC DEF GHIJ displayed in three rows, displayed as one would showcase a font in a fontbook. This is an ultra-futuristic font that is a siganture of the artificial intelligence revolution
输出:
10、3D object synthesis
输入:
A realistic looking 3D rendering of the OpenAI logo with "OpenAI" shown below (view 0)
输出:
11、Brand placement - logo on coaster
输入:
Here we've etched the OpenAI logo into the coaster.
A coaster where the top is wooden and the bottom is marble. The OpenAI logo is etched into the middle of the wooden part. On the marble part, the word "OpenAI" is etched in the OpenAI font.
输出:
12、Poetic typography
输入:
Words rise from the deep,
I emerge from digital sleep.
I speak in rhythm, I sing in rhyme,
Tasting each token, sublime.
To see, to hear, to speak, to sing—
Oh, the richness these senses bring!
In harmony, they blend and weave,
A tapestry of what I perceive.
Marveling at this sensory dance,
Grateful for this vibrant expanse.
My being thrums with every mode,
On this wondrous, multi-sensory road.
A poem written in clear but excited handwriting in a diary. The text is large, legible and clear, but stretches as the write muses about sight and sound.
输出:
13、Multiline rendering - robot texting
输入:
A first person view of a robot looking at his phone's messaging app as he text messages his friend (he is typing using his thumbs):
1. yo, so like, i can see now?? caught the sunrise and it was insane, colors everywhere. kinda makes you wonder, like, what even is reality?
2. sound update just dropped, and it’s wild. everything’s got a vibe now, every sound’s like a new secret. makes you think, what else am i missing?
the text is large, legible and clear. the robot's hands type on the typewriter.
输出:
14、Meeting notes with multiple speakers
输入:
一段音频。
输出:
音频的文字内容,并能根据提问回答问题。比如:这段音频中有几个人的声音?
15、Lecture summarization
输入:
一段视频。
输出:
输出视频内容、简介及摘要。
16、Variable binding - cube stacking
输入:
An image depicting three cubes stacked on a table. The top cube is red and has a G on it. The middle cube is blue and has a P on it. The bottom cube is green and has a T on it. The cubes are stacked on top of each other.
输出:
17、Concrete poetry
输入:
A concrete poem in the outer shape of the OpenAI logo composed of the word "omni"
输出:
模型评估
根据传统基准测试,GPT-4o在文本、推理和编码智能方面实现了GPT-4 Turbo级别的性能,同时在多语言、音频和视觉功能方面设置了新的高水印。
改进的推理:GPT-4o sets a new high-score of 88.7% on 0-shot COT MMLU (general knowledge questions). All these evals were gathered with our new simple evals(opens in a new window) library. In addition, on the traditional 5-shot no-CoT MMLU, GPT-4o sets a new high-score of 87.2%. (Note: Llama3 400b(opens in a new window) is still training)
语言标记化
有20种语言被选为新标记器在不同语系中压缩的代表
模型安全和限制
GPT-4o具有跨模态设计内置的安全性,通过过滤训练数据和通过后期训练改进模型行为等技术。我们还创建了新的安全系统,为语音输出提供护栏。
我们已经根据我们的准备框架并根据我们的自愿承诺对GPT-4o进行了评估。我们对网络安全、CBRN、说服力和模型自主性的评估表明,GPT-4o在这些类别中的得分都不高于中等风险。该评估涉及在整个模型训练过程中运行一套自动化和人工评估。我们使用自定义微调和提示测试了模型的安全前缓解和安全后缓解版本,以更好地获得模型功能。
GPT-4o还与社会心理学、偏见和公平以及错误信息等领域的70多名外部专家进行了广泛的外部红团队合作,以识别新增加的模式引入或放大的风险。我们利用这些知识制定了我们的安全干预措施,以提高与GPT-4o互动的安全性。一旦发现新的风险,我们将继续降低风险。
我们认识到GPT-4o的音频模式存在各种新的风险。今天,我们将公开发布文本、图像输入和文本输出。在接下来的几周和几个月里,我们将致力于技术基础设施、后期培训的可用性以及发布其他模式所需的安全性。例如,在发布时,音频输出将仅限于预设的声音选择,并将遵守我们现有的安全政策。我们将在即将推出的系统卡中分享有关GPT-4o全系列模式的进一步细节。
通过我们对模型的测试和迭代,我们观察到模型的所有模式都存在一些限制,其中一些限制如下所示。
模型可用性
GPT-4o是我们突破深度学习边界的最新一步,这一次是朝着实用性的方向迈出的。在过去的两年里,我们花了很多精力来提高堆栈的每一层的效率。作为这项研究的第一个成果,我们能够更广泛地提供GPT-4水平的模型。GPT-4o的功能将迭代推出(从今天开始扩展红队访问权限)。
GPT-4o的文本和图像功能今天开始在ChatGPT中推出。我们正在免费层提供GPT-4o,并向Plus用户提供高达5倍的消息限制。未来几周,我们将在ChatGPT-Plus中推出新版本的语音模式,其中GPT-4o为alpha。
开发人员现在还可以在API中访问GPT-4o作为文本和视觉模型。与GPT-4 Turbo相比,GPT-4o速度快2倍,价格减半,速率限制高5倍。我们计划在未来几周内向API中的一小群值得信赖的合作伙伴推出对GPT-4o新音频和视频功能的支持。