litellm - 💡(How to fix) Fix [Feature Request] Add LMDeploy Provider Support for Qwen Models [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#26173Fetched 2026-04-22 07:46:00
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
labeled ×2cross-referenced ×1

Fix Action

Fix / Workaround

Current Workaround

Code Example

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "",
      "tool_calls": [{
        "id": "call_123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\": \"Boston\"}"
        }
      }]
    }
  }]
}

---

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "<tools>\n{\"name\": \"get_weather\", \"arguments\": {\"location\": \"Boston\"}}\n</tools>"
    }
  }]
}
RAW_BUFFERClick to expand / collapse

Feature Request: Add LMDeploy Provider Support

Problem Statement

LMDeploy is a widely-used LLM inference toolkit in China, developed by OpenMMLab. It provides efficient deployment for models like Qwen series with optimized performance. However, LiteLLM currently doesn't have native support for LMDeploy's API format, particularly for tool calling functionality.

Current Workaround

We currently use a custom proxy layer to convert between Anthropic Messages API / OpenAI Responses API and LMDeploy's format. While this works, native LiteLLM support would benefit the broader community.

LMDeploy Tool Calling Format

LMDeploy uses a non-standard tool calling format that differs from OpenAI's standard:

Standard OpenAI Format:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "",
      "tool_calls": [{
        "id": "call_123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\": \"Boston\"}"
        }
      }]
    }
  }]
}

LMDeploy Format:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "<tools>\n{\"name\": \"get_weather\", \"arguments\": {\"location\": \"Boston\"}}\n</tools>"
    }
  }]
}

Key differences:

  • Tool calls are embedded in the content field instead of a separate tool_calls array
  • Uses <tools> XML tags to wrap the tool call JSON
  • Requires parsing and transformation to standard OpenAI format

Proposed Solution

Add a new lmdeploy provider in LiteLLM that:

  1. Request Handling: Forward requests to LMDeploy API server (OpenAI-compatible /v1/chat/completions endpoint)
  2. Response Parsing: Detect and parse <tools> tags in response content
  3. Format Conversion: Transform to standard OpenAI tool_calls array format
  4. Parser Support: Support LMDeploy's tool_call_parser parameter (e.g., qwen2d5, qwen3coder)

Tested Environment

We have successfully deployed and tested this use case with:

  • LMDeploy Version: v0.12.3
  • Models:
    • Qwen2.5-32B-Instruct-AWQ (tool_call_parser: qwen2d5)
    • Qwen2.5-Coder-32B-Instruct-AWQ (tool_call_parser: qwen3coder)
  • PyTorch: 2.10.0+cu128
  • Deployment: Docker containers with LMDeploy API server
  • Architecture: Tensor Parallelism (tp=2), session length 65536

Benefits

  • Community Value: Helps Chinese users and organizations using LMDeploy
  • Standardization: Provides a standard way to integrate LMDeploy with LiteLLM
  • Tool Calling: Enables proper tool calling support for Qwen models via LMDeploy
  • Reduced Complexity: Eliminates need for custom proxy layers
  • Legacy GPU Support: LMDeploy provides significant performance improvements on older but widely-deployed GPUs like NVIDIA Tesla V100 16GB SXM2

Performance Benefits on Legacy GPUs

LMDeploy demonstrates significant performance advantages on older but widely-deployed GPUs like NVIDIA Tesla V100 16GB SXM2, which are still prevalent in many data centers and organizations.

Benchmark Results (Tesla V100 16GB SXM2, 2x GPU, Tensor Parallel):

MetricLMDeploy + AWQvLLM + GPTQImprovement
ModelQwen2.5-32B-Instruct-AWQQwen2.5-32B-Instruct-GPTQ-Int4-
Throughput159.48 tok/s47.69 tok/s+234%
Success Rate100%71.83%+28.17%
Avg Response Time0.864s5.37s-84% (6.2x faster)
Requests Processed1,11651+2088%

Key Findings:

  • AWQ Compatibility: LMDeploy supports AWQ quantization on V100 (compute capability 7.0), while vLLM requires ≥7.5
  • Stability: 100% success rate vs 71.83% with vLLM GPTQ (which has known kernel bugs)
  • Performance: 3.34x higher throughput and 6.2x faster response time
  • Production Ready: Successfully deployed in production with Qwen2.5-32B and Qwen2.5-Coder-32B models

This makes LMDeploy an excellent choice for organizations with existing V100 infrastructure who want to deploy modern Qwen models efficiently.

Implementation Notes

The implementation would be similar to existing providers like Ollama, which also handle non-standard response formats. Key components:

  1. Add litellm/llms/lmdeploy.py provider module
  2. Implement response parser for <tools> format
  3. Add configuration support in config.yaml
  4. Include comprehensive tests with LMDeploy v0.12.3

References

Willingness to Contribute

We are willing to contribute a PR for this feature if the maintainers are interested. We have working code for the format conversion logic and can provide comprehensive tests based on our production deployment.

Submitted by Zamba Lee @ArgoStack, THINKTOP.


Co-created with Claude Code (Claude Sonnet 4.6)

extent analysis

TL;DR

To add LMDeploy provider support to LiteLLM, implement a new provider module that handles request forwarding, response parsing, and format conversion for LMDeploy's non-standard tool calling format.

Guidance

  1. Create a new provider module: Add a lmdeploy.py provider module in litellm/llms/ to handle LMDeploy-specific logic.
  2. Implement response parser: Develop a parser to detect and extract tool calls from LMDeploy's <tools> format in response content.
  3. Configure LMDeploy support: Update config.yaml to include configuration options for the LMDeploy provider.
  4. Write comprehensive tests: Include tests to verify the correctness of the LMDeploy provider, using LMDeploy v0.12.3 and Qwen models.

Example

# Example parser function to extract tool calls from LMDeploy response
import json
import re

def parse_lmdeploy_response(response):
    # Extract tool calls from <tools> tags
    tool_calls = re.findall(r'<tools>(.*?)</tools>', response['content'])
    # Parse tool call JSON
    tool_calls = [json.loads(call) for call in tool_calls]
    return tool_calls

Notes

The implementation should be similar to existing providers like Ollama, and the LMDeploy GitHub documentation and tool calling documentation can be referenced for more information.

Recommendation

Apply the proposed solution by implementing the new LMDeploy provider module, as it provides a standard way to integrate LMDeploy with LiteLLM and enables proper tool calling support for Qwen models.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING