litellm - 💡(How to fix) Fix [Feature Request] Add LMDeploy Provider Support for Qwen Models [1 participants]

Code Example

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "",
      "tool_calls": [{
        "id": "call_123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\": \"Boston\"}"
        }
      }]
    }
  }]
}

---

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "<tools>\n{\"name\": \"get_weather\", \"arguments\": {\"location\": \"Boston\"}}\n</tools>"
    }
  }]
}

Feature Request: Add LMDeploy Provider Support

Problem Statement

LMDeploy is a widely-used LLM inference toolkit in China, developed by OpenMMLab. It provides efficient deployment for models like Qwen series with optimized performance. However, LiteLLM currently doesn't have native support for LMDeploy's API format, particularly for tool calling functionality.

Current Workaround

We currently use a custom proxy layer to convert between Anthropic Messages API / OpenAI Responses API and LMDeploy's format. While this works, native LiteLLM support would benefit the broader community.

LMDeploy Tool Calling Format

LMDeploy uses a non-standard tool calling format that differs from OpenAI's standard:

Standard OpenAI Format:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "",
      "tool_calls": [{
        "id": "call_123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\": \"Boston\"}"
        }
      }]
    }
  }]
}

LMDeploy Format:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "<tools>\n{\"name\": \"get_weather\", \"arguments\": {\"location\": \"Boston\"}}\n</tools>"
    }
  }]
}

Key differences:

Tool calls are embedded in the content field instead of a separate tool_calls array
Uses <tools> XML tags to wrap the tool call JSON
Requires parsing and transformation to standard OpenAI format

Proposed Solution

Add a new lmdeploy provider in LiteLLM that:

Request Handling: Forward requests to LMDeploy API server (OpenAI-compatible /v1/chat/completions endpoint)
Response Parsing: Detect and parse <tools> tags in response content
Format Conversion: Transform to standard OpenAI tool_calls array format
Parser Support: Support LMDeploy's tool_call_parser parameter (e.g., qwen2d5, qwen3coder)

Tested Environment

We have successfully deployed and tested this use case with:

LMDeploy Version: v0.12.3
Models:
- Qwen2.5-32B-Instruct-AWQ (tool_call_parser: qwen2d5)
- Qwen2.5-Coder-32B-Instruct-AWQ (tool_call_parser: qwen3coder)
PyTorch: 2.10.0+cu128
Deployment: Docker containers with LMDeploy API server
Architecture: Tensor Parallelism (tp=2), session length 65536

Benefits

Community Value: Helps Chinese users and organizations using LMDeploy
Standardization: Provides a standard way to integrate LMDeploy with LiteLLM
Tool Calling: Enables proper tool calling support for Qwen models via LMDeploy
Reduced Complexity: Eliminates need for custom proxy layers
Legacy GPU Support: LMDeploy provides significant performance improvements on older but widely-deployed GPUs like NVIDIA Tesla V100 16GB SXM2

Performance Benefits on Legacy GPUs

LMDeploy demonstrates significant performance advantages on older but widely-deployed GPUs like NVIDIA Tesla V100 16GB SXM2, which are still prevalent in many data centers and organizations.

Benchmark Results (Tesla V100 16GB SXM2, 2x GPU, Tensor Parallel):

Metric	LMDeploy + AWQ	vLLM + GPTQ	Improvement
Model	Qwen2.5-32B-Instruct-AWQ	Qwen2.5-32B-Instruct-GPTQ-Int4	-
Throughput	159.48 tok/s	47.69 tok/s	+234%
Success Rate	100%	71.83%	+28.17%
Avg Response Time	0.864s	5.37s	-84% (6.2x faster)
Requests Processed	1,116	51	+2088%

Key Findings:

✅ AWQ Compatibility: LMDeploy supports AWQ quantization on V100 (compute capability 7.0), while vLLM requires ≥7.5
✅ Stability: 100% success rate vs 71.83% with vLLM GPTQ (which has known kernel bugs)
✅ Performance: 3.34x higher throughput and 6.2x faster response time
✅ Production Ready: Successfully deployed in production with Qwen2.5-32B and Qwen2.5-Coder-32B models

This makes LMDeploy an excellent choice for organizations with existing V100 infrastructure who want to deploy modern Qwen models efficiently.

Implementation Notes

The implementation would be similar to existing providers like Ollama, which also handle non-standard response formats. Key components:

Add litellm/llms/lmdeploy.py provider module
Implement response parser for <tools> format
Add configuration support in config.yaml
Include comprehensive tests with LMDeploy v0.12.3

References

LMDeploy GitHub
LMDeploy Tool Calling Documentation
Related issues: #18922 (Ollama qwen3 tool_calls), #19742 (Tool calling format issues)

Willingness to Contribute

We are willing to contribute a PR for this feature if the maintainers are interested. We have working code for the format conversion logic and can provide comprehensive tests based on our production deployment.

Submitted by Zamba Lee @ArgoStack, THINKTOP.

Co-created with Claude Code (Claude Sonnet 4.6)

extent analysis

TL;DR

To add LMDeploy provider support to LiteLLM, implement a new provider module that handles request forwarding, response parsing, and format conversion for LMDeploy's non-standard tool calling format.

Guidance

Create a new provider module: Add a lmdeploy.py provider module in litellm/llms/ to handle LMDeploy-specific logic.
Implement response parser: Develop a parser to detect and extract tool calls from LMDeploy's <tools> format in response content.
Configure LMDeploy support: Update config.yaml to include configuration options for the LMDeploy provider.
Write comprehensive tests: Include tests to verify the correctness of the LMDeploy provider, using LMDeploy v0.12.3 and Qwen models.

Example

# Example parser function to extract tool calls from LMDeploy response
import json
import re

def parse_lmdeploy_response(response):
    # Extract tool calls from <tools> tags
    tool_calls = re.findall(r'<tools>(.*?)</tools>', response['content'])
    # Parse tool call JSON
    tool_calls = [json.loads(call) for call in tool_calls]
    return tool_calls

Notes

The implementation should be similar to existing providers like Ollama, and the LMDeploy GitHub documentation and tool calling documentation can be referenced for more information.

Recommendation

Apply the proposed solution by implementing the new LMDeploy provider module, as it provides a standard way to integrate LMDeploy with LiteLLM and enables proper tool calling support for Qwen models.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Feature Request] Add LMDeploy Provider Support for Qwen Models [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Current Workaround

Code Example

Feature Request: Add LMDeploy Provider Support

Problem Statement

Current Workaround

LMDeploy Tool Calling Format

Proposed Solution

Tested Environment

Benefits

Performance Benefits on Legacy GPUs

Implementation Notes

References

Willingness to Contribute

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Feature Request] Add LMDeploy Provider Support for Qwen Models [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Current Workaround

Code Example

Feature Request: Add LMDeploy Provider Support

Problem Statement

Current Workaround

LMDeploy Tool Calling Format

Proposed Solution

Tested Environment

Benefits

Performance Benefits on Legacy GPUs

Implementation Notes

References

Willingness to Contribute

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING