litellm - 💡(How to fix) Fix [Bug]: minimax-m2.7 via Ollama Cloud fails on 2nd+ request with Internal Server Error [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#24533Fetched 2026-04-08 01:27:17
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
1
Author
Timeline (top)
cross-referenced ×2commented ×1labeled ×1

When using minimax-m2.7:cloud (Ollama's naming — the :cloud suffix denotes the cloud-hosted version) through the Ollama Cloud API, the first request succeeds but every subsequent request fails with Internal Server Error. This includes starting a brand new session — still only one response before it breaks. This rules out conversation history as the cause.

Error Message

litellm.APIConnectionError: Ollama_chatException - {"error":"Internal Server Error (ref: <uuid>)"}

Root Cause

When using minimax-m2.7:cloud (Ollama's naming — the :cloud suffix denotes the cloud-hosted version) through the Ollama Cloud API, the first request succeeds but every subsequent request fails with Internal Server Error. This includes starting a brand new session — still only one response before it breaks. This rules out conversation history as the cause.

Code Example

import litellm
   litellm.api_base = "https://ollama.com/api"
   litellm.api_key = "your-ollama-cloud-api-key"

---

litellm.APIConnectionError: Ollama_chatException - {"error":"Internal Server Error (ref: <uuid>)"}

---

# thinking is set from reasoning_content in transform_request
if reasoning_content is not None:
    ollama_message["thinking"] = reasoning_content

# response remaps 'thinking' field to 'reasoning_content'
response_json_message["reasoning_content"] = response_json_message.get("thinking")
RAW_BUFFERClick to expand / collapse

Python version: 3.12 LiteLLM version: 1.82.4 OS: Linux

Description

When using minimax-m2.7:cloud (Ollama's naming — the :cloud suffix denotes the cloud-hosted version) through the Ollama Cloud API, the first request succeeds but every subsequent request fails with Internal Server Error. This includes starting a brand new session — still only one response before it breaks. This rules out conversation history as the cause.

Steps to Reproduce

  1. Configure LiteLLM with the Ollama provider pointing to Ollama Cloud:
    import litellm
    litellm.api_base = "https://ollama.com/api"
    litellm.api_key = "your-ollama-cloud-api-key"
  2. Send a first chat completion request with model minimax-m2.7:cloudsucceeds
  3. Send a second request (same or new conversation) — fails with Internal Server Error

Expected Behavior

Both requests should succeed.

Actual Behavior

First request: ✅ succeeds Second request: ❌ fails with:

litellm.APIConnectionError: Ollama_chatException - {"error":"Internal Server Error (ref: <uuid>)"}

Retries (3/3) also fail. Even starting a brand new session exhibits the same pattern.

Additional Context

Key observation: The same underlying model is available via OpenRouter as minimax/minimax-m2.7 and works correctly there. This suggests the model itself is fine — the bug is in LiteLLM's Ollama adapter handling of this specific cloud model's streaming response.

Working models via Ollama Cloud:

  • GLM-5 (also a thinking model) — ✅ works perfectly, multiple requests
  • Non-thinking models — ✅ work fine

Not working:

  • minimax-m2.7:cloud via Ollama Cloud — ❌ fails after first request
  • Same model via OpenRouter (minimax/minimax-m2.7) — ✅ works

Hypothesis: minimax-m2.7:cloud may stream its thinking/reasoning content differently from other thinking models (e.g., GLM-5), causing the Ollama chat transformation or streaming iterator to corrupt subsequent requests. This could be related to how the thinking content field is handled differently between the two models.

Relevant Code

The Ollama chat transformation in litellm/llms/ollama/chat/transformation.py has reasoning content handling:

# thinking is set from reasoning_content in transform_request
if reasoning_content is not None:
    ollama_message["thinking"] = reasoning_content

# response remaps 'thinking' field to 'reasoning_content'
response_json_message["reasoning_content"] = response_json_message.get("thinking")

The streaming iterator OllamaChatCompletionResponseIterator tracks started_reasoning_content and finished_reasoning_content flags to strip <think> XML tags from content. If the model streams thinking differently, these flags could get into a bad state.

Related Issues

  • Similar: litellm#15399 — "Ollama Cloud Models Streaming Chunk Parsing Failure" (closed, about deepseek-v3.1:671b-cloud)

Investigation Needed

A raw streaming log of minimax-m2.7:cloud vs GLM-5 chunks from Ollama Cloud would help identify the difference. The issue likely requires comparing the actual SSE chunk format between the two models to see if minimax-m2.7:cloud uses a non-standard thinking field format or streams it differently.

extent analysis

Fix Plan

To address the issue with minimax-m2.7:cloud failing after the first request, we need to modify the Ollama chat transformation and streaming iterator in litellm to correctly handle the thinking field for this specific model.

Step 1: Modify transform_request in transformation.py

Update the thinking field handling to accommodate potential differences in streaming formats:

if reasoning_content is not None:
    ollama_message["thinking"] = [reasoning_content]  # Ensure it's a list

Step 2: Update OllamaChatCompletionResponseIterator

Modify the flags and parsing logic to handle potential variations in thinking field streaming:

class OllamaChatCompletionResponseIterator:
    def __init__(self, response):
        # ...
        self.thinking_content = []  # Initialize an empty list
        self.in_thinking = False

    def parse_chunk(self, chunk):
        # ...
        if "thinking" in chunk:
            self.in_thinking = True
            self.thinking_content.append(chunk["thinking"])
        # ...
        if self.in_thinking and "end_thinking" in chunk:
            self.in_thinking = False
            # Process the accumulated thinking content
            response_json_message["reasoning_content"] = "\n".join(self.thinking_content)
            self.thinking_content = []

Step 3: Test with minimax-m2.7:cloud

Send multiple requests with the updated litellm code to verify that the issue is resolved.

Verification

To verify the fix, run the following test:

import litellm

# Configure LiteLLM with the Ollama provider
litellm.api_base = "https://ollama.com/api"
litellm.api_key = "your-ollama-cloud-api-key"

# Send multiple requests with minimax-m2.7:cloud
for _ in range(5):
    response = litellm.chat_completion("Hello, how are you?", model="minimax-m2.7:cloud")
    print(response)

If all requests succeed without errors, the fix is successful.

Extra Tips

  • Monitor the thinking field format in the streaming response from minimax-m2.7:cloud to ensure it aligns with the updated parsing logic.
  • Consider adding additional logging or debugging statements to help identify any future issues with the thinking field handling.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING