Working for the common good. How I contributed to llama.cpp so thinking models work with the Anthropic API

You’ve been trying to run reasoning models like DeepSeek-R1 or Qwen3-Thinking with llama.cpp for days. The server starts, the model loads, you send a request using the Anthropic API and… something breaks. Claude Code complains about malformed responses. Streaming doesn’t work properly. You ask for “Hello World” and only get “World”.

The first thing you think is that it’s your configuration. Another environment variable set wrong, another missing parameter. You move on and do something else.

Until one day you search for the error on GitHub and find an issue. And another. And another. People with the exact same problem, different models, different operating systems, same broken result. You think “at least it’s not me” —misery loves company— but you also think something else: if nobody fixes this, it’s going to stay broken for everyone.

So you decide to make it work.

The problem: variables resetting where they shouldn’t

Issue #18613 described it perfectly. A developer asked llama-server to say “Hello World” and only received “World”. The server logs showed it was internally generating all three tokens correctly —“Hello”, ” World”, end of sequence— but the JSON response only contained the last one.

The bug was in how llama.cpp handled streaming for the Anthropic API. Every time a chunk of tokens was processed, the variables controlling whether a content block had already started were declared as local. They reset to false on each iteration. Result: instead of sending one content_block_start at the beginning and then content deltas, the server sent a new block start for every token. The client went crazy trying to parse that.

But there was a second problem more specific to thinking models. The Anthropic API expects a signature field in thinking blocks. In non-streaming mode it must be present even if empty. In streaming, there must be a signature_delta event before closing each block. llama.cpp wasn’t sending either.

The solution: persistent state and signatures where they belong

The fix required changes in tools/server/server-task.h. First, moving block tracking to the task_result_state structure so it persists between calls:

// In task_result_state (persistent between chunks)
bool anthropic_thinking_block_started = false;
bool anthropic_text_block_started = false;

This guarantees a single content_block_start per block, regardless of how many tokens are generated afterwards.

Second, adding the missing signature events:

// Before closing a thinking block in streaming
if (state.anthropic_thinking_block_started) {
    // Empty signature_delta (required by Anthropic spec)
    events.push_back({
        {"type", "content_block_delta"},
        {"delta", {{"type", "signature_delta"}, {"signature", ""}}}
    });
}

In non-streaming mode, the signature field is added directly to the thinking block object.

The review process

The PR went through several iterations. One of the maintainers pointed out that using a raw pointer to state was a problematic pattern. He was right. I refactored the code to copy the values directly in the update() function, eliminating the risk of invalid references.

After 27 compatibility tests with the Anthropic API and manual testing with DeepSeek-R1-Distill-Qwen-7B and Qwen3-4B-Thinking, the PR was approved. The developer who had reported the original issue confirmed that they now received the complete “Hello World”.

Why this matters

Ollama announced that their version 0.14.0 is compatible with the Anthropic Messages API (see announcement here). You can use Claude Code with local open source models. Extended thinking, function calling, streaming, everything works.

Part of that compatibility depends on the server implementing the specification correctly. When someone runs Claude Code against a local model that supports reasoning, the thinking blocks serialize properly and the client doesn’t explode.

It’s a small piece in a large ecosystem. But every time a developer can use open source reasoning models with tools that previously only worked with proprietary APIs, something improves for everyone.

Open source works like this. Someone finds a bug, another person fixes it, and thousands of people who will never know the problem existed benefit without realizing it. Working for the common good.

The problem: variables resetting where they shouldn’t

The solution: persistent state and signatures where they belong

The review process

Why this matters

Links