Competitive Performance, Competitive Prices Amazon introduces Nova models for text, image, and video

Published
Reading time
4 min read
Berkeley Function Calling Leaderboard with metrics like accuracy, latency, and relevance.
Loading the Elevenlabs Text to Speech AudioNative Player...

Amazon introduced a range of models that confront competitors head-on.

What’s new: The Nova line from Amazon includes three vision-language models (Nova Premier, Nova Pro, and Nova Lite), one language model (Nova Micro), an image generator (Nova Canvas), and a video generator (Nova Reel). All but Nova Premier are available on Amazon’s Bedrock platform, and Nova Premier, which is the most capable, is expected in early 2025. In addition, Amazon plans to release a speech-to-speech model in early 2025 and a multimodal model that processes text, images, video, and audio by mid-year. (Disclosure: Andrew Ng serves on Amazon’s board of directors.)

How it works: Nova models deliver competitive performance at relatively low prices. Amazon hasn’t disclosed parameter counts or details about how the models were built except to say that Nova Pro, Lite, and Micro were trained on a combination of proprietary, licensed, public, and open-source text, images, and video in over 200 languages.

  • Nova Pro is roughly comparable to that of Anthropic Claude 3.5 Sonnet, OpenAI GPT-4o, and Google Gemini Pro. It has a 300,000-token input context window, enabling it to process relatively large vision-language inputs. Nova Pro outperforms its primary competitors in tests of following complex instructions (IFEval), summarizing long texts (SQuALITY), understanding videos (LVBench), and reading and acting on websites (MM-Mind2Web). It processes 95 tokens per second. At $0.80/$3.20 per million tokens of input/output, it’s significantly less expensive than GPT-4o ($2.50/$10) and Claude 3.5 Sonnet ($3/$15) but slower than GPT-4o (115 tokens per second).
  • Nova Lite compares favorably with Anthropic Claude Haiku, Google Gemini 1.5 Flash, and OpenAI GPT-4o Mini. Optimized for processing speed and efficiency, it too has a 300,000 token input context window. Nova Lite bests Claude 3.5 Sonnet and GPT-4o on VisualWebBench, which tests visual understanding of web pages. It also beats Claude 3.5 Haiku, GPT-4o Mini, and Gemini 1.5 Flash in multimodal agentic tasks that include MM-Mind2Web and the Berkeley Function-Calling Leaderboard. It processes 157 tokens per second and costs $0.06/$0.24 per million tokens of input/output, making it less expensive than GPT-4o mini ($0.15/$0.60), Claude 3.5 Haiku ($0.80/$4), or Gemini 1.5 Flash ($0.075/$0.30), but slower than Gemini 1.5 Flash (189 tokens per second).
  • Nova Micro is a text-only model with a 128,000-token context window. It exceeds Llama 3.1 8B and Gemini Flash 8B on all 12 tests reported by Amazon, including generating code (HumanEval) and reading financial documents (FinQA). It also beats the smaller Claude, Gemini, and Llama models on retrieval-augmented generation tasks (CRAG). It processes 210 tokens per second (the lowest latency among Nova models) and costs $0.035/$0.14 per million input/output tokens. That’s cheaper than Gemini Flash 8B ($0.0375/$0.15) and Llama 3.1 8B ($0.10/$0.10), but slower than Gemini Flash 8B (284.2 tokens per second).
  • Nova Canvas accepts English-language text prompts up to 1,024 characters and produces images up to 4.2 megapixels in any aspect ratio. It also performs inpainting, outpainting, and background removal. It excels on ImageReward, a measure of human preference for generated images, surpassing OpenAI DALL·E 3 and Stability AI Stable Diffusion 3.5. Nova Canvas costs between $0.04 per image up to 1024x1024 pixels and $0.08 per image up to 2,048x2,048 pixels. Prices are hard to compare because many competitors charge by the month or year, but this is less expensive and higher-resolution than DALL·E 3 ($0.04 to $0.12 per image).
  • Nova Reel accepts English-language prompts up to 512 characters and image prompts up to 720x1,280 pixels. It generates video clips of 720x1280 pixels up to six seconds long. It demonstrates superior ability to maintain consistent imagery from frame to frame, winning 67 percent of head-to-head comparisons with the next highest-scoring model, Runway Gen-3 Alpha. Nova Reel costs $0.08 per second of output, which is less expensive than Runway Gen-3 Alpha ($0.096 per second) and Kling 1.5 ($0.12 per second) in their standard monthly plans.

Behind the news: The company launched Bedrock in April 2023 with Stability AI’s Stable Diffusion for image generation, Anthropic’s Claude and AI21’s Jurassic-2 for text generation, and its own Titan models for text generation and embeddings. Not long afterward, it added language models from Cohere as well as services for agentic applications and medical applications. It plans to continue to provide models from other companies (including Anthropic), offering a range of choices.

Why it matters: While other AI giants raced to outdo one another in models for text and multimodal processing, Amazon was relatively quiet. With Nova, it has staked out a strong position in those areas, as well as the startup-dominated domains of image and video generation. Moreover, it’s strengthening its cloud AI offerings with competitive performance, pricing, and speed. Nova’s pricing continues the rapid drop in AI prices over the last year. Falling per-token prices help make AI agents or applications that process large inputs more practical. For example, Simon Willison, developer of the Django Python framework for web applications, found that Nova Lite generated descriptions for his photo library (tens of thousands of images) for less than $10.

We’re thinking: The Nova suite is available via APIs only, without a web-based user interface. This accords with Amazon Web Services’ focus on developers. For consumers, Amazon offers the Rufus shopping bot.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox