Anthropic has introduced a new prompt caching feature in its API, allowing users to store frequently used prompts for quicker and cheaper access in subsequent calls. This feature is available in public beta on the Claude 3.5 Sonnet and Claude 3 Haiku models, with support for the larger Claude 3 Opus model expected soon.
Prompt caching enables developers to retain context across sessions, reducing the need to resend large amounts of background information, which helps in fine-tuning model responses and reducing costs.
Early users of the feature have reported significant cost savings and speed improvements, with examples including reduced costs for conversational agents dealing with long instructions or documents and faster code autocompletion.
The potential use cases extend to areas such as embedding entire documents in prompts or providing multiple instructions to agentic tools. By avoiding repeated prompts, prompt caching allows for more efficient handling of context-heavy tasks, making it a valuable tool for developers.
One of the key benefits of prompt caching is its ability to reduce API costs significantly. For instance, the Claude 3.5 Sonnet model offers a major reduction in token price when using cached prompts, with the cost dropping from $3 per million tokens (MTok) to just $0.30/MTok.
The upfront cost to store a prompt may be slightly higher, but repeated use of the cached prompt yields substantial savings. Similarly, the Claude 3 Haiku model also offers reduced costs for cached prompts, further demonstrating the economic advantages of this feature.
However, prompt caching comes with a few limitations. The cached prompts have a relatively short lifespan, lasting only five minutes unless refreshed with each use. This time limit contrasts with other systems like Gemini, which charges per hour to keep its context cache active.
Despite these constraints, prompt caching still presents a cost-effective option for reducing the overall expense of AI interactions, especially in scenarios requiring recurring use of specific contexts.
Anthropic’s introduction of prompt caching places it in direct competition with other AI platforms like OpenAI and Google, pushing a “race to the bottom” in terms of pricing.
While prompt caching is distinct from model memory, as it doesn’t store preferences or detailed responses, it fulfills a highly requested feature by offering a more streamlined and cost-efficient method of maintaining context between API calls. This move aligns with the ongoing trend of AI platforms seeking to balance performance with affordability for third-party developers.