litellm - ✅(Solved) Fix [PRD] Claude Code Compatibility Matrix [1 pull requests, 1 participants]

mateo-berri · 2026-04-25T03:19:21Z

[litellm] PR 26491: WIP feat tests : Claude Code Compatibility Matrix v0 PRD 26476 - Repository: BerriAI/litellm - Author: mateo-berri - State: open | merged:… # PR #26491: [WIP] feat(tests): Claude Code Compatibility Matrix v0 (PRD #26476) - Repository: BerriAI/litellm - Author: mateo-berri - State: open | merged: False - Link: https://github.com/BerriAI/litellm/pull/26491 ## Description (problem / solution / changelog) ## Relevant issues Implements the v0 of the Claude Code Compatibility Matrix. - Parent PRD: #26476 - Slice 1 (tracer bullet): #26477 - Slice 2 (4 provider columns for `basic_messaging_non_streaming`): #26478 - Slice 3 (PR gate in CircleCI): #26479 - Slice 4 (daily cron VM publishes matrix to docs): #26480 - Slice 5 (full v0 row set: 6 features × 5 providers): #26481 ## Pre-Submission checklist **Please complete all items before asking a LiteLLM maintainer to review your PR** - [x] I have Added testing in the [`tests/test_litellm/`](https://github.com/BerriAI/litellm/tree/main/tests/test_litellm) directory, **Adding at least 1 test is a hard requirement** - [see details](https://docs.litellm.ai/docs/extras/contributing_code) - Note: tests for this feature live under `tests/claude_code/_driver_unit_tests/`, `tests/claude_code/_builder_unit_tests/`, `tests/claude_code/_publisher_unit_tests/`, and `tests/claude_code/_pr_gate_unit_tests/` — these are deep-module unit tests for the new helpers (Claude Code CLI Driver, Matrix JSON Builder, Publisher, PR-Gate Version Resolver) per the PRD's "Testing Decisions" section. They follow the same mocked-subprocess / golden-file patterns established in `tests/test_litellm/`. - [x] My PR passes all unit tests on [`make test-unit`](https://docs.litellm.ai/docs/extras/contributing_code) - [x] My PR's scope is as isolated as possible, it only solves 1 specific problem - [ ] I have requested a Greptile review by commenting `@greptileai` and received a **Confidence Score of at least 4/5** before requesting a maintainer review ## Delays in PR merge? If you're seeing a delay in your PR being merged, ping the LiteLLM Team on [Slack (#pr-review)](https://join.slack.com/t/litellmossslack/shared_invite/zt-3o7nkuyfr-p_kbNJj8taRfXGgQI1~YyA). ## CI (LiteLLM team) > **CI status guideline:** > > - 50-55 passing tests: main is stable with minor issues. > - 45-49 passing tests: acceptable but needs attention > - <= 40 passing tests: unstable; be careful with your merges and assess the risk. - [ ] **Branch creation CI run** Link: - [ ] **CI run for the last commit** Link: - [ ] **Merge / cherry-pick CI run** Links: ## Screenshots / Proof of Fix This PR ships the end-to-end v0 of the Claude Code Compatibility Matrix as defined in PRD #26476. Verification of the pipeline: **1. Test scaffolding (slices 1, 2, 5).** New layout under `tests/claude_code/ /test_ .py`. The `compat_result` pytest fixture captures tagged-union outcomes (`pass` / `fail` / `not_applicable` / `not_tested`); a `conftest.py` hook merges per-test results into a structured `compat-results.json` artifact. Six features × five providers = 30 cells, each exercised against three Claude tiers (Haiku 4.5 / Sonnet 4.6 / Opus 4.7), all-must-pass aggregation per cell. **2. Claude Code CLI Driver + Matrix JSON Builder.** Two deep helper modules (`tests/claude_code/cli_driver.py`, `tests/claude_code/matrix_builder.py`) wrapping subprocess + parsing and pure-function JSON construction respectively. Unit-tested against mocked subprocess (driver) and golden fixtures (builder) — see `_driver_unit_tests/` and `_builder_unit_tests/`. **3. PR Gate (slice 3).** New CircleCI job `claude_code_compat_pr_gate` boots the proxy from the PR's code, installs the `claude` CLI at the version returned by the new PR-gate version resolver (newest published >= 3 days ago, queried at run-time from the npm registry), and runs the full `tests/claude_code/` suite. Red status blocks merge. **4. Daily Cron Publisher (slice 4).** New GitHub Actions workflow `.github/workflows/claude_code_compat_matrix.yml` runs on three triggers (daily cron, `release.published` filtered to `v*-stable`, `workflow_dispatch`). Resolves the latest stable LiteLLM release via the GitHub Releases API, pulls the corresponding ghcr.io image, installs the latest Claude Code CLI, runs the test suite, builds the matrix JSON, and direct-pushes it to `BerriAI/litellm-docs`. Cross-repo authentication uses a GitHub App scoped to `contents: write` on the docs repo only; the `select_files_to_commit` allowlist enforces "only `compatibility-matrix.json` is ever pushed" since GitHub Apps cannot scope tokens to a single file path. **5. Sample matrix output.** `tests/claude_code/sample_compatibility-matrix.json` shows the expected v1 schema shape that the docs site's ` ` React component will consume. **Secret scan.** Verified no committed secrets: - All real credentials are loaded via `os.environ.get(...)` or `${{ secrets.* }}`. - Test fixtures use ob

litellm2026-04-25 03:19:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#26476•Fetched 2026-04-26 05:06:54

View on GitHub

Comments

Participants

Timeline

Reactions

Author

mateo-berri

Participants

mateo-berri

Timeline (top)

cross-referenced ×9labeled ×3subscribed ×1unlabeled ×1

Error Message

fail — test ran and failed; includes the error message (with the failing model named when relevant).
A pytest fixture named compat_result exposes a set(result_dict) method. Tests report their outcome as a tagged union ({"status": "pass"}, {"status": "fail", "error": "..."}, {"status": "not_applicable", "reason": "..."}).
features: ordered array of feature objects. Each feature has id, name, and providers — a map of provider id to a tagged-union status object ({status: "pass"}, {status: "fail", error: "..."}, {status: "not_applicable", reason: "..."}, {status: "not_tested"}).
fail → the error message from the test.
The component checks schema_version; on mismatch it renders an error placeholder instead of silently misrendering.

Root Cause

Features in scope: Claude Code features that flow through LiteLLM's wire (API-surface and gateway-specific features), plus "Basic messaging" as a baseline. Purely client-side features (slash commands, local file editing, TUI) are out of scope because they don't touch LiteLLM.
Features in v0: basic_messaging_non_streaming, basic_messaging_streaming, tool_use, prompt_caching_5m, vision, extended_thinking. Additional features (web search, MCP, skills, plugins, Max subscription, 1M context, 1-hour cache TTL, count_tokens, user_agent cost tracking, etc.) are deferred to v1+.
Provider columns: Anthropic, Bedrock (Invoke), Bedrock (Converse), Vertex AI, Azure. No "Other" provider column in v0.
Per-cell model coverage: Each cell is tested against Claude Haiku 4.5, Claude Sonnet 4.6, and Claude Opus 4.7. The cell is reported as pass only if all three models pass.

Fix Action

Solution

Publish a Claude Code Compatibility Matrix on the LiteLLM docs site that:

Lists a curated set of Claude Code features as rows.
Lists the providers LiteLLM supports as columns (Anthropic, Bedrock Invoke, Bedrock Converse, Vertex AI, Azure).
Shows a status per cell: passing, failing, not applicable, or not yet tested.
Displays the LiteLLM version and Claude Code version the matrix was tested against, plus a last-updated timestamp.
Auto-updates whenever a new LiteLLM stable release is cut, whenever Claude Code ships a new version (detected via a daily cron), and on manual trigger.
Is backed by real end-to-end tests that drive the actual claude CLI against a running LiteLLM proxy — so the matrix reflects reality, not opinion.

Customers visit the matrix page to make informed decisions; LiteLLM's CI uses the same tests as a pre-merge gate to prevent regressions.

PR fix notes

PR #26491: [WIP] feat(tests): Claude Code Compatibility Matrix v0 (PRD #26476)

Repository: BerriAI/litellm
Author: mateo-berri
State: open | merged: False
Link: https://github.com/BerriAI/litellm/pull/26491

Description (problem / solution / changelog)

Relevant issues

Implements the v0 of the Claude Code Compatibility Matrix.

Parent PRD: #26476
Slice 1 (tracer bullet): #26477
Slice 2 (4 provider columns for basic_messaging_non_streaming): #26478
Slice 3 (PR gate in CircleCI): #26479
Slice 4 (daily cron VM publishes matrix to docs): #26480
Slice 5 (full v0 row set: 6 features × 5 providers): #26481

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
- Note: tests for this feature live under tests/claude_code/_driver_unit_tests/, tests/claude_code/_builder_unit_tests/, tests/claude_code/_publisher_unit_tests/, and tests/claude_code/_pr_gate_unit_tests/ — these are deep-module unit tests for the new helpers (Claude Code CLI Driver, Matrix JSON Builder, Publisher, PR-Gate Version Resolver) per the PRD's "Testing Decisions" section. They follow the same mocked-subprocess / golden-file patterns established in tests/test_litellm/.
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

50-55 passing tests: main is stable with minor issues.

45-49 passing tests: acceptable but needs attention

<= 40 passing tests: unstable; be careful with your merges and assess the risk.

Branch creation CI run Link:
CI run for the last commit Link:
Merge / cherry-pick CI run Links:

Screenshots / Proof of Fix

This PR ships the end-to-end v0 of the Claude Code Compatibility Matrix as defined in PRD #26476. Verification of the pipeline:

1. Test scaffolding (slices 1, 2, 5). New layout under tests/claude_code/<feature>/test_<provider>.py. The compat_result pytest fixture captures tagged-union outcomes (pass / fail / not_applicable / not_tested); a conftest.py hook merges per-test results into a structured compat-results.json artifact. Six features × five providers = 30 cells, each exercised against three Claude tiers (Haiku 4.5 / Sonnet 4.6 / Opus 4.7), all-must-pass aggregation per cell.

2. Claude Code CLI Driver + Matrix JSON Builder. Two deep helper modules (tests/claude_code/cli_driver.py, tests/claude_code/matrix_builder.py) wrapping subprocess + parsing and pure-function JSON construction respectively. Unit-tested against mocked subprocess (driver) and golden fixtures (builder) — see _driver_unit_tests/ and _builder_unit_tests/.

3. PR Gate (slice 3). New CircleCI job claude_code_compat_pr_gate boots the proxy from the PR's code, installs the claude CLI at the version returned by the new PR-gate version resolver (newest published >= 3 days ago, queried at run-time from the npm registry), and runs the full tests/claude_code/ suite. Red status blocks merge.

4. Daily Cron Publisher (slice 4). New GitHub Actions workflow .github/workflows/claude_code_compat_matrix.yml runs on three triggers (daily cron, release.published filtered to v*-stable, workflow_dispatch). Resolves the latest stable LiteLLM release via the GitHub Releases API, pulls the corresponding ghcr.io image, installs the latest Claude Code CLI, runs the test suite, builds the matrix JSON, and direct-pushes it to BerriAI/litellm-docs. Cross-repo authentication uses a GitHub App scoped to contents: write on the docs repo only; the select_files_to_commit allowlist enforces "only compatibility-matrix.json is ever pushed" since GitHub Apps cannot scope tokens to a single file path.

5. Sample matrix output. tests/claude_code/sample_compatibility-matrix.json shows the expected v1 schema shape that the docs site's <CompatibilityMatrix /> React component will consume.

Secret scan. Verified no committed secrets:

All real credentials are loaded via os.environ.get(...) or ${{ secrets.* }}.
Test fixtures use obvious placeholders (sk-test, sk-abc, "k", ghs_xxx).
sk-1234 and sk-cron-matrix are dev master keys used only inside ephemeral test/cron containers (consistent with existing CI conventions in .circleci/config.yml).
pathrise-convert-1606954137718 is the standard GCP test project ID already used throughout the LiteLLM test suite (a project ID is not a credential).
.gitignore excludes the CI-output files (compat-results.json, compatibility-matrix.json).
Workflow uses SHA-pinned actions, permissions: contents: read, and persist-credentials: false on checkout.

Type

🆕 New Feature 🚄 Infrastructure ✅ Test

Changes

tests/claude_code/manifest.yaml — single source of truth for the matrix's row order and provider column order.
tests/claude_code/<feature>/test_<provider>.py — 30 per-(feature, provider) test files, one feature directory each for basic_messaging_non_streaming, basic_messaging_streaming, tool_use, prompt_caching_5m, vision, extended_thinking.
tests/claude_code/conftest.py — compat_result fixture and pytest_runtest_logreport hook that emits the structured compat-results.json artifact.
tests/claude_code/cli_driver.py — Claude Code CLI Driver (deep module wrapping subprocess + stream-JSON parsing).
tests/claude_code/matrix_builder.py — pure-function builder that turns the per-test results artifact into the published compatibility-matrix.json per the v1 schema.
tests/claude_code/resolver.py — Latest Stable LiteLLM Resolver (queries the GitHub Releases API for newest v*-stable).
tests/claude_code/pr_gate_version_resolver.py — Claude Code PR-Gate Version Resolver (queries npm for newest version published >= 3 days ago).
tests/claude_code/publisher.py — daily-cron publisher orchestrator: resolves versions, runs the test suite, builds JSON, direct-pushes to the docs repo. Includes the select_files_to_commit allowlist enforcement.
tests/claude_code/test_config.yaml — proxy routing config for the PR gate, mapping aliases to upstream models per provider.
tests/claude_code/_*_unit_tests/ — unit tests for the four deep modules.
.github/workflows/claude_code_compat_matrix.yml — daily cron workflow.
.circleci/config.yml — new claude_code_compat_pr_gate job wired into the existing main-branches workflow.
.gitignore — exclude CI-output files (compat-results.json, compatibility-matrix.json).

Out of scope for this PR (per PRD's "Deferred to v1+"): the docs-side React <CompatibilityMatrix /> component, MDX page at docs/tutorials/claude-code-compatibility, Slack regression alerts, operational guardrails (deadman alerts, staleness banner), additional features beyond the v0 row set, PR-comment diff commenter, click-to-modal cell deep-dive, and a written ADR artifact.

Changed files

.circleci/config.yml (modified, +96/-0)
.github/workflows/claude_code_compat_matrix.yml (added, +127/-0)
.gitignore (modified, +6/-1)
tests/claude_code/__init__.py (added, +0/-0)
tests/claude_code/_builder_unit_tests/__init__.py (added, +0/-0)
tests/claude_code/_builder_unit_tests/fixtures/expected_matrix.json (added, +38/-0)
tests/claude_code/_builder_unit_tests/fixtures/manifest.yaml (added, +9/-0)
tests/claude_code/_builder_unit_tests/fixtures/results.json (added, +41/-0)
tests/claude_code/_builder_unit_tests/test_matrix_builder.py (added, +282/-0)
tests/claude_code/_builder_unit_tests/test_v0_layout.py (added, +116/-0)
tests/claude_code/_driver_unit_tests/__init__.py (added, +0/-0)
tests/claude_code/_driver_unit_tests/test_cli_driver.py (added, +339/-0)
tests/claude_code/_driver_unit_tests/test_compat_result.py (added, +70/-0)
tests/claude_code/_pr_gate_unit_tests/__init__.py (added, +0/-0)
tests/claude_code/_pr_gate_unit_tests/test_circleci_pr_gate_wiring.py (added, +131/-0)
tests/claude_code/_pr_gate_unit_tests/test_pr_gate_version_resolver.py (added, +139/-0)
tests/claude_code/_publisher_unit_tests/__init__.py (added, +0/-0)
tests/claude_code/_publisher_unit_tests/test_publisher.py (added, +99/-0)
tests/claude_code/_publisher_unit_tests/test_resolver.py (added, +112/-0)
tests/claude_code/basic_messaging_non_streaming/__init__.py (added, +0/-0)
tests/claude_code/basic_messaging_non_streaming/test_anthropic.py (added, +103/-0)
tests/claude_code/basic_messaging_non_streaming/test_azure.py (added, +107/-0)
tests/claude_code/basic_messaging_non_streaming/test_bedrock_converse.py (added, +96/-0)
tests/claude_code/basic_messaging_non_streaming/test_bedrock_invoke.py (added, +96/-0)
tests/claude_code/basic_messaging_non_streaming/test_vertex_ai.py (added, +96/-0)
tests/claude_code/basic_messaging_streaming/__init__.py (added, +0/-0)
tests/claude_code/basic_messaging_streaming/test_anthropic.py (added, +106/-0)
tests/claude_code/basic_messaging_streaming/test_azure.py (added, +103/-0)
tests/claude_code/basic_messaging_streaming/test_bedrock_converse.py (added, +99/-0)
tests/claude_code/basic_messaging_streaming/test_bedrock_invoke.py (added, +99/-0)
tests/claude_code/basic_messaging_streaming/test_vertex_ai.py (added, +99/-0)
tests/claude_code/cli_driver.py (added, +261/-0)
tests/claude_code/conftest.py (added, +164/-0)
tests/claude_code/extended_thinking/__init__.py (added, +0/-0)
tests/claude_code/extended_thinking/test_anthropic.py (added, +116/-0)
tests/claude_code/extended_thinking/test_azure.py (added, +118/-0)
tests/claude_code/extended_thinking/test_bedrock_converse.py (added, +110/-0)
tests/claude_code/extended_thinking/test_bedrock_invoke.py (added, +110/-0)
tests/claude_code/extended_thinking/test_vertex_ai.py (added, +110/-0)
tests/claude_code/manifest.yaml (added, +36/-0)
tests/claude_code/matrix_builder.py (added, +179/-0)
tests/claude_code/pr_gate_version_resolver.py (added, +148/-0)
tests/claude_code/prompt_caching_5m/__init__.py (added, +0/-0)
tests/claude_code/prompt_caching_5m/test_anthropic.py (added, +113/-0)
tests/claude_code/prompt_caching_5m/test_azure.py (added, +111/-0)
tests/claude_code/prompt_caching_5m/test_bedrock_converse.py (added, +104/-0)
tests/claude_code/prompt_caching_5m/test_bedrock_invoke.py (added, +104/-0)
tests/claude_code/prompt_caching_5m/test_vertex_ai.py (added, +104/-0)
tests/claude_code/publisher.py (added, +388/-0)
tests/claude_code/resolver.py (added, +91/-0)
tests/claude_code/sample_compatibility-matrix.json (added, +141/-0)
tests/claude_code/test_config.yaml (added, +103/-0)
tests/claude_code/tool_use/__init__.py (added, +0/-0)
tests/claude_code/tool_use/test_anthropic.py (added, +114/-0)
tests/claude_code/tool_use/test_azure.py (added, +113/-0)
tests/claude_code/tool_use/test_bedrock_converse.py (added, +109/-0)
tests/claude_code/tool_use/test_bedrock_invoke.py (added, +109/-0)
tests/claude_code/tool_use/test_vertex_ai.py (added, +109/-0)
tests/claude_code/vision/__init__.py (added, +0/-0)
tests/claude_code/vision/test_anthropic.py (added, +99/-0)
tests/claude_code/vision/test_azure.py (added, +100/-0)
tests/claude_code/vision/test_bedrock_converse.py (added, +95/-0)
tests/claude_code/vision/test_bedrock_invoke.py (added, +95/-0)
tests/claude_code/vision/test_vertex_ai.py (added, +95/-0)

RAW_BUFFERClick to expand / collapse

Problem Statement

Claude Code is the number-one use case for many LiteLLM customers. Before adopting or continuing to invest in LiteLLM, these customers need to answer a simple question: "Which Claude Code features actually work through LiteLLM, and on which providers?"

Today there is no single source of truth. Support information is scattered across release notes, tutorials, and GitHub issues. Customers can't tell at a glance whether, say, prompt caching on Bedrock or tool use on Vertex AI currently works. When Claude Code or LiteLLM ships a new version, support can shift — but there's no published artifact that reflects the current state. This uncertainty erodes trust and slows adoption.

Solution

Publish a Claude Code Compatibility Matrix on the LiteLLM docs site that:

Lists a curated set of Claude Code features as rows.
Lists the providers LiteLLM supports as columns (Anthropic, Bedrock Invoke, Bedrock Converse, Vertex AI, Azure).
Shows a status per cell: passing, failing, not applicable, or not yet tested.
Displays the LiteLLM version and Claude Code version the matrix was tested against, plus a last-updated timestamp.
Auto-updates whenever a new LiteLLM stable release is cut, whenever Claude Code ships a new version (detected via a daily cron), and on manual trigger.
Is backed by real end-to-end tests that drive the actual claude CLI against a running LiteLLM proxy — so the matrix reflects reality, not opinion.

Customers visit the matrix page to make informed decisions; LiteLLM's CI uses the same tests as a pre-merge gate to prevent regressions.

User Stories

As a prospective LiteLLM customer evaluating Claude Code as our primary use case, I want to see a compatibility matrix on the docs site, so that I can decide whether LiteLLM meets our needs before committing.
As an existing LiteLLM customer using Claude Code on Bedrock, I want to know which Claude Code features are supported on my provider, so that I don't waste time testing features that don't work.
As a customer considering switching providers (e.g., from Anthropic to Bedrock), I want to compare feature support across providers, so that I can understand what I'll gain or lose by switching.
As a customer, I want to see when the matrix was last updated, so that I can trust the information is current.
As a customer, I want to see which LiteLLM version and Claude Code version the matrix was tested against, so that I know the support statement applies to the versions I plan to use.
As a customer, I want to hover over a failing cell and see why it's failing, so that I can understand whether it's a blocker for me or a minor edge case.
As a customer, I want to hover over a "not applicable" cell and understand why a feature doesn't apply to a given provider, so that I don't confuse "can't exist here" with "broken."
As a customer, I want to hover over an "untested" cell and know that LiteLLM is honest about coverage gaps, so that I can trust the cells that are marked passing.
As a customer browsing on mobile, I want to scroll the matrix horizontally, so that I can still see all the data even on a small screen.
As a LiteLLM engineer, I want PRs into staging to run the full compatibility test suite, so that regressions are caught before merge.
As a LiteLLM engineer, I want PR CI to use a Claude Code version that's at least 3 days old, so that a malicious or broken Claude Code release can't compromise our CI environment.
As a LiteLLM engineer, I want the PR CI pipeline to fail when any cell flips from passing to failing, so that I know immediately when a change I'm reviewing breaks a customer-visible feature.
As a LiteLLM engineer, I want a daily cron to re-run the tests against the latest stable LiteLLM + latest Claude Code, so that the matrix reflects the real-world state customers would encounter today.
As a LiteLLM engineer, I want the daily cron to run on an isolated VM outside our main CI, so that running always-latest Claude Code can't affect trusted build infrastructure.
As a LiteLLM engineer, I want the daily cron to publish results by committing a single JSON file to the docs repo, so that the docs site always has fresh data with no manual intervention.
As a LiteLLM engineer, I want the automation's GitHub App token scoped to contents: write on the docs repo only, so that a compromise has minimal blast radius.
As a Claude Code feature tester writing a new compatibility test, I want a single directory per feature with a per-provider test file, so that adding a new feature is a clear and self-contained change.
As a test author, I want a simple pytest fixture (compat_result.set(...)) to report my test's outcome, so that I don't need to understand the matrix machinery to contribute a test.
As a test author, I want the (feature, provider) pair to be inferred from the file path, so that I don't have to add explicit metadata and risk drift.
As a test author, I want to be able to mark a (feature, provider) combination as "not applicable" with a human-readable reason right in the test file, so that the tooltip on the matrix tells the truth in my voice.
As a test author, I want to drive the real claude CLI in headless mode from my tests, so that my test exercises the same wire shape our customers actually use.
As a test author, I want each provider cell to exercise the feature against Claude Haiku 4.5, Sonnet 4.6, and Opus 4.7, so that the cell only reports green when all three models work.
As a test author, I want the failure message to identify which model broke when a multi-model cell fails, so that I can debug which model is the outlier.
As a docs maintainer, I want the matrix to be rendered by a React component inside an MDX page, so that the narrative content (intro, legend, FAQs) and the live data can coexist.
As a docs maintainer, I want cells to be color-coded (green/red/gray/yellow) with icons, so that the shape of cross-provider support is scannable at a glance.
As a docs maintainer, I want the React component to import the JSON as a static asset at build time, so that the docs build is self-contained and doesn't depend on network calls at render time.
As a docs maintainer, I want a schema_version field in the JSON, so that I can evolve the schema without silently breaking the renderer.
As a docs visitor, I want "last updated" shown as a relative time when recent (e.g., "2 hours ago") and an absolute date when older, so that freshness is immediately obvious.
As a docs visitor hovering over the last-updated banner, I want to see the absolute timestamp in my local timezone (explicitly labeled), so that I can precisely determine staleness.
As a new Claude Code feature tracker, I want to add a new feature to the matrix by updating a single manifest file + adding a directory with per-provider test files, so that onboarding a feature is a clear and reviewable change.
As a LiteLLM maintainer, I want to know that the matrix ships with a small, focused set of high-value features in v0, so that the project can launch without getting stuck trying to enumerate every feature.

Implementation Decisions

Scope

Features in scope: Claude Code features that flow through LiteLLM's wire (API-surface and gateway-specific features), plus "Basic messaging" as a baseline. Purely client-side features (slash commands, local file editing, TUI) are out of scope because they don't touch LiteLLM.
Features in v0: basic_messaging_non_streaming, basic_messaging_streaming, tool_use, prompt_caching_5m, vision, extended_thinking. Additional features (web search, MCP, skills, plugins, Max subscription, 1M context, 1-hour cache TTL, count_tokens, user_agent cost tracking, etc.) are deferred to v1+.
Provider columns: Anthropic, Bedrock (Invoke), Bedrock (Converse), Vertex AI, Azure. No "Other" provider column in v0.
Per-cell model coverage: Each cell is tested against Claude Haiku 4.5, Claude Sonnet 4.6, and Claude Opus 4.7. The cell is reported as pass only if all three models pass.

Status enum

Four states per cell:

pass — test ran and succeeded.
fail — test ran and failed; includes the error message (with the failing model named when relevant).
not_applicable — the feature does not apply to this provider; includes a reason.
not_tested — no test file exists for this (feature, provider) combination yet.

Test code organization (in the main LiteLLM repo)

A new top-level directory tests/claude_code/ holds the compatibility tests.
Each feature has its own subdirectory named after its feature_id (e.g., tests/claude_code/tool_use/).
Within each feature directory, a file per provider: test_anthropic.py, test_bedrock_invoke.py, test_bedrock_converse.py, test_vertex_ai.py, test_azure.py.
Duplication across per-provider files is accepted; shared test logic lives in per-feature helper modules.
A top-level tests/claude_code/manifest.yaml maps feature_id → display_name and defines the row order in the rendered matrix.
A pytest fixture named compat_result exposes a set(result_dict) method. Tests report their outcome as a tagged union ({"status": "pass"}, {"status": "fail", "error": "..."}, {"status": "not_applicable", "reason": "..."}).
A conftest.py hook intercepts test outcomes, parses feature_id and provider from the file path, and emits a structured compat-results.json artifact.

Claude Code CLI Driver module

A Python helper module that encapsulates shelling out to the claude CLI in headless mode:

Single entry point accepting: model, prompt, and feature-specific options (streaming, tools, images, extra environment variables).
Parses claude stream-JSON output into a structured result (text, tool calls, stream events, cache hits, usage).
Handles subprocess management, timeouts, and retry logic on transient flakes.
Every compatibility test consumes only this module, never shells out directly.

Matrix JSON Builder module

A pure-function module that consumes the pytest-produced compat-results.json, the manifest, and run metadata, and emits the final compatibility-matrix.json conforming to the published schema. Responsible for:

Ordering providers according to the declared column order.
Ordering features according to the manifest.
Filling in not_tested for any (feature, provider) combination that has neither a test nor a declared not_applicable reason.
Aggregating per-model test results into a single per-cell status (cell is pass iff all three models pass; otherwise fail with the breaking model identified).
Serializing to the locked JSON schema.

JSON schema (v1)

The published compatibility-matrix.json has a locked schema:

schema_version: string, currently "1".
generated_at: ISO-8601 UTC timestamp.
litellm_version: full tag format (e.g., "v1.83.0-stable").
claude_code_version: version string (e.g., "2.1.120").
providers: ordered array of provider ids defining column order.
features: ordered array of feature objects. Each feature has id, name, and providers — a map of provider id to a tagged-union status object ({status: "pass"}, {status: "fail", error: "..."}, {status: "not_applicable", reason: "..."}, {status: "not_tested"}).

Two CI environments

PR Gate (inside CircleCI, in the main LiteLLM repo):

Triggers on every PR into the internal staging branch.
Uses a Claude Code version pinned to "latest minus 3 days" resolved at CI-run time from the npm registry (security review window).
Runs the complete compatibility test suite against the PR's code.
Pass/fail blocks merge. No custom PR comment in v0 — just the standard CI status check.

Daily Cron (on an isolated VM hosted in the main LiteLLM repo's infra):

Triggers: (a) on every published GitHub Release tagged *-stable, (b) on a daily cron, (c) on manual dispatch.
Resolves "latest stable LiteLLM" by calling the GitHub Releases API and filtering for v*-stable tags.
Pulls the corresponding LiteLLM Docker image and runs it as the proxy.
Installs the absolute latest Claude Code CLI.
Runs the full compatibility test suite.
Invokes the Matrix JSON Builder to produce compatibility-matrix.json.
Uses a GitHub App (installed on the docs repo only, scoped to contents: write) to direct-push the updated JSON to the docs repo's main branch.
Environment is isolated from the main CI so that running always-latest Claude Code cannot affect trusted build infrastructure.

Version resolvers

Two small helper modules with clean interfaces:

Latest Stable LiteLLM Resolver: queries the GitHub Releases API and returns the newest tag matching the v*-stable pattern.
Claude Code PR-Gate Version Resolver: queries the npm registry for @anthropic-ai/claude-code, returns the newest version whose publish timestamp is at least 3 days old.

Docs-site rendering

A new MDX page at docs/tutorials/claude-code-compatibility hosts the matrix.
A new React component named CompatibilityMatrix is added to the docs site. It imports compatibility-matrix.json as a build-time static asset and renders the grid.
Cells are rendered with a colored background plus an icon: green/✓ for pass, red/✗ for fail, gray/— for not applicable, yellow/? for not tested.
Tooltip on hover:
- pass → "Tested & passing" (or equivalent).
- fail → the error message from the test.
- not_applicable → the reason from the test.
- not_tested → "No test for this combination yet."
Banner above the matrix displays:
- Relative last-updated time when recent (e.g., "2 hours ago") with the absolute timestamp in the user's local timezone (explicitly labeled) shown on hover.
- Absolute last-updated time when older.
- The LiteLLM version and Claude Code version that produced the matrix.
Mobile: horizontal scroll preserves the table structure.
The component checks schema_version; on mismatch it renders an error placeholder instead of silently misrendering.

Cross-repo authentication

A GitHub App installed on the docs repo only, scoped to contents: write.
The daily-cron VM uses the App installation token to direct-push the JSON to the docs repo's main branch.
File-level restriction is enforced by script correctness (the publishing script commits only compatibility-matrix.json) rather than by token scope (GitHub does not support file-path-scoped tokens).

Deferred to v1+

Per-feature descriptions, documentation links, and categories.
Historical snapshots / diff-over-time views.
"Partial" and "Flaky" status states.
Operational guardrails (deadman alerts, staleness warning banners in the UI, atomic publishing with validation).
Slack regression alerts.
PR-based publishing with path-allowlist status checks.
A dedicated PR-comment diff commenter.
Additional rows (web search, MCP, skills, plugins, Max subscription, 1M context, 1-hour cache TTL, count_tokens, user-agent cost tracking, etc.).
Click-to-modal deep-dive on cells (with links to the test file on GitHub).
A written design-doc / ADR artifact.

Testing Decisions

A good test exercises external behavior — the observable contract of a module — not internal implementation details. Tests should continue to pass across refactors that preserve the module's interface. Tests that break on non-semantic changes are a maintenance tax.

For this feature, two modules have high-value testable surface and will ship with tests in v0:

Claude Code CLI Driver — unit tests with a mocked subprocess. The driver is a deep module wrapping a fast-moving external dependency. Mocked subprocess tests feed canned stream-JSON to the driver and assert the structured result. These tests catch regressions in output parsing and CLI argument assembly when Claude Code changes its output format or CLI surface. Prior art: the main LiteLLM repo's tests/test_litellm/ directory contains thousands of unit tests that mock external dependencies in exactly this pattern.

Matrix JSON Builder — golden-file tests. The builder is a pure function from inputs (pytest results artifact, manifest, providers list, run metadata) to output JSON. Golden-file tests fix the schema contract: feed fixture inputs, compare the produced JSON byte-for-byte (or structure-wise) to a checked-in expected output. Any schema drift — intentional or accidental — shows up as a diff in review. Prior art: golden-file / snapshot test patterns are common in the LiteLLM test suite for serialization-critical code.

Testing skipped in v0 (explicitly):

Test Harness plumbing — too thin to warrant a dedicated test; covered incidentally by any compat test that runs.
Matrix Publisher orchestration — pure glue code over Docker, git, and subprocess. Unit-testing requires heavy mocking for low signal. Relies on real execution to validate.
React CompatibilityMatrix component — real value, but requires standing up a React testing harness in the docs repo with no prior art. Deferred to v1 alongside other docs-side polish.
Version resolvers — tiny, easy to test, but also tiny, so their breakage surface is trivial and easily caught by the daily-cron itself failing loudly. Deferred.

Out of Scope

Automated Claude Code feature discovery. The row set is curated manually; a human watches the Claude Code changelog and decides which features merit a test.
Cross-version matrices (e.g., showing support for LiteLLM v1.80 vs. v1.83 side by side). Only the current latest-stable is displayed.
Support for Claude Code features that don't flow through LiteLLM (slash commands, local tool execution, TUI, etc.).
Support for non-Claude models routed via /v1/messages (e.g., GPT-5 or Gemini via Claude Code). These belong to a future "Other" column, not v0.
Visualizing per-model pass/fail within a cell (e.g., three sub-dots for Haiku/Sonnet/Opus). v0 aggregates to a single cell status with the breaking model named in the tooltip on failure.
Cost accounting of running the test suite on every PR. Expected to be non-trivial due to real API calls to Bedrock / Vertex AI / Anthropic / Azure, but considered a worthwhile price for pre-merge protection.
A formal design-doc / ADR. Shared context from the design interview is deemed sufficient for v0.
Operational alerting and reliability guardrails for the daily cron (deadman alerts, staleness banners, validation gates). Deferred to v1+; v0 assumes a working cron and fails visibly if it breaks.

Further Notes

The two CI environments (PR gate and daily cron) share the same test code (tests/claude_code/). They differ only in what version of LiteLLM and Claude Code they run against, and in what they do with the results (block merge vs. publish matrix JSON).
The matrix is strictly a post-facto truth-telling artifact for the latest-stable release. It is not a policy document or a roadmap — it is the rendered output of the test suite at a point in time.
Adding a new feature to the matrix is a three-step change: (1) append an entry to manifest.yaml, (2) create a feature directory with per-provider test files, (3) for any not_applicable combinations, have the test file set the not_applicable status with a reason. No other wiring needed.
The choice to test against three Claude model tiers (Haiku, Sonnet, Opus) per cell means each PR's CI will make roughly 90+ model calls per run. This is expected and accepted as part of the pre-merge gate design.
The daily cron's GitHub App scope is deliberately broader than strictly necessary (contents: write on the whole docs repo, not just one file) because GitHub does not support file-path-scoped permissions. The mitigation is that the publishing script only ever writes compatibility-matrix.json; correctness is a property of the script, not of the token.

extent analysis

TL;DR

To address the problem of uncertainty around Claude Code feature support on various providers, implement a Compatibility Matrix on the LiteLLM docs site that auto-updates based on end-to-end tests.

Guidance

Identify the key features of Claude Code that need to be tested for compatibility with different providers (e.g., Anthropic, Bedrock, Vertex AI, Azure).
Develop a set of end-to-end tests that exercise these features against each provider, using a tool like pytest.
Create a Matrix JSON Builder module to aggregate test results into a compatibility matrix, which can be displayed on the LiteLLM docs site.
Set up a daily cron job to re-run the tests and update the matrix, ensuring it reflects the current state of feature support.
Implement a React component to render the compatibility matrix on the docs site, with features like hover-over tooltips and mobile-friendly scrolling.

Example

# Example of a pytest fixture for reporting test outcomes
import pytest

@pytest.fixture
def compat_result():
    def set_result(result_dict):
        # Parse result_dict and update the compatibility matrix
        pass
    return set_result

Notes

The implementation should focus on a small, high-value set of features for v0, with additional features deferred to v1+.
The daily cron job should run on an isolated VM to prevent interference with the main CI environment.
The GitHub App token used for publishing the matrix should be scoped to contents: write on the docs repo, with file-level restrictions enforced by script correctness.

Recommendation

Apply the workaround of implementing the Compatibility Matrix and daily cron job to provide customers with an up-to-date view of feature support, while deferring additional features and polish to v1+.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #installation #environment variable #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

litellm - ✅(Solved) Fix [PRD] Claude Code Compatibility Matrix [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Solution

PR fix notes

PR #26491: [WIP] feat(tests): Claude Code Compatibility Matrix v0 (PRD #26476)

Description (problem / solution / changelog)

Relevant issues

Pre-Submission checklist

Delays in PR merge?

CI (LiteLLM team)

Screenshots / Proof of Fix

Type

Changes

Changed files

Problem Statement

Solution

User Stories

Implementation Decisions

Scope

Status enum

Test code organization (in the main LiteLLM repo)

Claude Code CLI Driver module

Matrix JSON Builder module

JSON schema (v1)

Two CI environments

Version resolvers

Docs-site rendering

Cross-repo authentication

Deferred to v1+

Testing Decisions

Out of Scope

Further Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING