litellm - 💡(How to fix) Fix [QA] Claude Code Compatibility Matrix — full-stack QA plan (PRD #26476) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#26535Fetched 2026-04-26 05:06:28
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
labeled ×2

Error Message

  • test_cli_driver.py (10 tests): subprocess assembly, stream-JSON parsing, usage extraction, timeout/error handling.
  • test_compat_result.py (9 tests): fixture validation (rejects bad statuses, requires error on fail, requires reason on not_applicable); path inference for (feature, provider).
  • Cell aggregation is correct: a cell is pass only if all 3 tier-parametrized tests passed; if any tier failed, the cell is fail with the failing tier named in the error string.
  • fail: {"status": "fail", "error": "<string>"}error is required, non-empty. In a scratch test, call compat_result.set({"status": "fail"}) (missing error) and compat_result.set({"status": "weird"}). Expect: the fixture raises ValueError immediately, not silently.

Fix Action

Fix / Workaround

5.2 — Manual dispatch with --skip-publish (safe dry run)

gh workflow run claude_code_compat_matrix.yml \
  --ref sandcastle/compat-matrix-stack \
  --field skip_publish=true

Watch with:

gh run list --workflow=claude_code_compat_matrix.yml --limit 3
gh run watch <run-id>

5.5 — Idempotency check (re-run when nothing has changed)

Immediately re-dispatch the workflow:

gh workflow run claude_code_compat_matrix.yml --ref sandcastle/compat-matrix-stack

Pass criteria: the publisher logs matrix JSON unchanged; skipping push and exits 0; no new commit on the docs repo.

  • Phase 1 — all unit tests pass locally (~142 tests).
  • Phase 2 — manifest / sample / config / resolver CLI smoke checks all pass.
  • Phase 3 — at least one live per-cell test passes per provider column (Anthropic, Bedrock Invoke, Bedrock Converse, Vertex AI), all five Azure files report not_applicable, and compat-results.json + compatibility-matrix.json are produced and validated.
  • Phase 4 — claude_code_compat_pr_gate runs on a draft PR, the resolver picks a ≥ 3-day-old version, all 90 per-cell tests run; intentional regression turns the job red.
  • Phase 5.2 — workflow_dispatch with skip_publish=true succeeds end-to-end and uploads artifacts.
  • Phase 5.4 — real publish run (when App creds are wired) lands a single-file commit on BerriAI/litellm-docs with the expected commit message and only compatibility-matrix.json changed.
  • Phase 5.5 — idempotent re-run skips the push.
  • Phase 6 — schema / shape contract holds.
  • Phase 7 — failure-mode behaviors match expectations.

Code Example

git fetch origin sandcastle/compat-matrix-stack
git checkout sandcastle/compat-matrix-stack

# Install the project's full dev environment.
uv sync --frozen --all-groups --all-extras --python 3.12

---

nvm install 20 && nvm use 20
npm install -g @anthropic-ai/claude-code@latest
claude --version

---

uv run pytest tests/claude_code/_driver_unit_tests/ -vv

---

uv run pytest tests/claude_code/_builder_unit_tests/ -vv

---

uv run pytest tests/claude_code/_pr_gate_unit_tests/ -vv

---

uv run pytest tests/claude_code/_publisher_unit_tests/ -vv

---

uv run pytest tests/claude_code/ \
  --ignore=tests/claude_code/basic_messaging_non_streaming \
  --ignore=tests/claude_code/basic_messaging_streaming \
  --ignore=tests/claude_code/tool_use \
  --ignore=tests/claude_code/prompt_caching_5m \
  --ignore=tests/claude_code/vision \
  --ignore=tests/claude_code/extended_thinking \
  -vv

---

ls tests/claude_code/
# Expect: a directory per feature_id, plus _builder_unit_tests, _driver_unit_tests, _pr_gate_unit_tests, _publisher_unit_tests, conftest.py, cli_driver.py, manifest.yaml, matrix_builder.py, publisher.py, resolver.py, pr_gate_version_resolver.py, sample_compatibility-matrix.json, test_config.yaml

---

uv run --no-sync python -m tests.claude_code.pr_gate_version_resolver

---

uv run --no-sync python -c "from tests.claude_code.resolver import latest_stable_litellm_tag; print(latest_stable_litellm_tag())"

---

uv run litellm \
  --config tests/claude_code/test_config.yaml \
  --port 4000 \
  --detailed_debug

---

curl -fsS http://localhost:4000/health/liveliness

---

export LITELLM_PROXY_BASE_URL="http://localhost:4000"
export LITELLM_PROXY_API_KEY="sk-1234"
uv run pytest tests/claude_code/basic_messaging_non_streaming/test_anthropic.py -vv

---

uv run pytest tests/claude_code/basic_messaging_non_streaming/test_bedrock_invoke.py -vv
uv run pytest tests/claude_code/basic_messaging_non_streaming/test_bedrock_converse.py -vv

---

uv run pytest tests/claude_code/basic_messaging_non_streaming/test_vertex_ai.py -vv

---

for f in basic_messaging_non_streaming basic_messaging_streaming tool_use prompt_caching_5m vision extended_thinking; do
  uv run pytest tests/claude_code/$f/test_azure.py -vv
done

---

# Streaming: asserts the stream-json wire emitted >=1 stream event.
uv run pytest tests/claude_code/basic_messaging_streaming/test_anthropic.py -vv

# Tool use: passes --allowed-tools Bash; asserts a tool_use content block was emitted.
uv run pytest tests/claude_code/tool_use/test_anthropic.py -vv

# Prompt caching: asserts upstream usage shows cache_creation_input_tokens
# OR cache_read_input_tokens > 0. Run twice — second run should hit cache.
uv run pytest tests/claude_code/prompt_caching_5m/test_anthropic.py -vv
uv run pytest tests/claude_code/prompt_caching_5m/test_anthropic.py -vv

# Vision: writes a 1x1 PNG to tmp_path, attaches with --image, asserts non-empty reply.
uv run pytest tests/claude_code/vision/test_anthropic.py -vv

# Extended thinking: sets MAX_THINKING_TOKENS=4096, asserts a 'thinking' block was emitted.
uv run pytest tests/claude_code/extended_thinking/test_anthropic.py -vv

---

mkdir -p test-results
uv run pytest tests/claude_code/ \
  --ignore=tests/claude_code/_driver_unit_tests \
  --ignore=tests/claude_code/_builder_unit_tests \
  --ignore=tests/claude_code/_pr_gate_unit_tests \
  --ignore=tests/claude_code/_publisher_unit_tests \
  -vv \
  --junitxml=test-results/junit.xml

---

cat compat-results.json | python -m json.tool | head -50

---

uv run python -c "
from datetime import datetime, timezone
from pathlib import Path
from tests.claude_code.matrix_builder import build_from_paths
build_from_paths(
  manifest_path=Path('tests/claude_code/manifest.yaml'),
  results_path=Path('compat-results.json'),
  litellm_version='v0.0.0-test',
  claude_code_version='2.1.test',
  generated_at=datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ'),
  output_path=Path('compatibility-matrix.json'),
)
print(open('compatibility-matrix.json').read())
"

---

gh pr create --draft \
  --base litellm_internal_staging \
  --head sandcastle/compat-matrix-stack \
  --title "[QA] compat matrix stack" \
  --body "QA-only PR; close after CI runs."

---

gh workflow run claude_code_compat_matrix.yml \
  --ref sandcastle/compat-matrix-stack \
  --field skip_publish=true

---

gh run list --workflow=claude_code_compat_matrix.yml --limit 3
gh run watch <run-id>

---

gh run download <run-id> -n compat-results-<run-id>
cat compatibility-matrix.json | python -m json.tool | head -40

---

gh workflow run claude_code_compat_matrix.yml \
  --ref sandcastle/compat-matrix-stack

---

Update Claude Code compatibility matrix

  litellm_version: v<X.Y.Z>-stable
  claude_code_version: <X.Y.Z>
  generated_at: <ISO>

---

gh workflow run claude_code_compat_matrix.yml --ref sandcastle/compat-matrix-stack

---

grep -r '"schema_version"' tests/claude_code/ litellm/
grep -r 'SCHEMA_VERSION' tests/claude_code/

---

uv run python -c "
import json, yaml
m = yaml.safe_load(open('tests/claude_code/manifest.yaml'))
j = json.load(open('tests/claude_code/sample_compatibility-matrix.json'))
assert list(j['providers']) == list(m['providers']), 'provider order drift'
for f in j['features']:
  assert list(f['providers']) == list(m['providers']), f'cell order drift in {f[\"id\"]}'
print('ok')
"

---

uv run pytest tests/claude_code/_driver_unit_tests/ -vv
ls compat-results.json 2>&1 | head
cat compat-results.json | python -m json.tool

---

uv run python -c "
from tests.claude_code.publisher import select_files_to_commit
keep = select_files_to_commit(
  ['static/data/compatibility-matrix.json', 'README.md', 'config.yml'],
  'compatibility-matrix.json',
)
assert keep == ['static/data/compatibility-matrix.json'], keep
print('ok')
"
RAW_BUFFERClick to expand / collapse

QA Plan — Claude Code Compatibility Matrix (PRD #26476)

This is a step-by-step QA plan for the full stack landing on branch sandcastle/compat-matrix-stack, which implements PRD #26476.

The branch is a 5-commit stack (slices 1–5 of the PRD). Use the section headings below as a checklist; each section maps to a discrete piece of the implementation and can be QA'd independently.

Branch under test: sandcastle/compat-matrix-stack Parent: litellm_internal_staging Related PRs: #26477, #26478, #26479, #26480, #26481

What landed on this branch (at a glance)

SlicePRWhat it adds
1#26477Tracer-bullet: tests/claude_code/ skeleton, manifest.yaml, cli_driver.py, conftest.py (compat_result fixture + path-inference hook), matrix_builder.py, sample 1×5 JSON, driver/builder unit tests
2#26478First real feature row: basic_messaging_non_streaming × 5 providers (Anthropic, Bedrock Invoke, Bedrock Converse, Vertex AI, Azure-as-N/A), proxy test_config.yaml with model aliases
3#26479CircleCI PR-gate job claude_code_compat_pr_gate, pr_gate_version_resolver.py (latest-minus-3-days Claude Code), structural CI tests
4#26480Daily-cron: .github/workflows/claude_code_compat_matrix.yml, publisher.py, resolver.py (latest v*-stable LiteLLM), publisher unit tests
5#26481Remaining 5 v0 features (basic_messaging_streaming, tool_use, prompt_caching_5m, vision, extended_thinking) × 5 providers; full 6×5 grid + structural layout tests

Total: ~5,600 LOC, 64 files, ~142 unit tests + 90 per-cell live tests.


Phase 0 — Environment setup (do once)

Everything below assumes you have the branch checked out and the project deps installed.

git fetch origin sandcastle/compat-matrix-stack
git checkout sandcastle/compat-matrix-stack

# Install the project's full dev environment.
uv sync --frozen --all-groups --all-extras --python 3.12

Provider creds for the live per-cell tests (Phases 3 & 5):

  • ANTHROPIC_API_KEY
  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION_NAME (us-east-1)
  • Vertex: GOOGLE_APPLICATION_CREDENTIALS (or env-JSON), VERTEXAI_PROJECT, VERTEXAI_LOCATION
  • Azure cells are all not_applicable, no creds needed.

Claude CLI for the live tests:

nvm install 20 && nvm use 20
npm install -g @anthropic-ai/claude-code@latest
claude --version

Phase 1 — Unit tests (no proxy, no CLI, no creds)

These should all pass on a clean checkout of the branch. Total: ~142 tests, runtime under 30s.

1.1 — Driver unit tests (slice 1)

uv run pytest tests/claude_code/_driver_unit_tests/ -vv

Pass criteria:

  • test_cli_driver.py (10 tests): subprocess assembly, stream-JSON parsing, usage extraction, timeout/error handling.
  • test_compat_result.py (9 tests): fixture validation (rejects bad statuses, requires error on fail, requires reason on not_applicable); path inference for (feature, provider).

1.2 — Matrix builder unit tests (slice 1 + 5)

uv run pytest tests/claude_code/_builder_unit_tests/ -vv

Pass criteria:

  • test_matrix_builder.py (12 tests): manifest schema validation, not_tested fill-in, multi-model aggregation precedence (fail > not_applicable > pass), and the 6×5 golden test (test_build_matrix_6x5_grid_matches_published_sample) compares the builder output byte-wise against sample_compatibility-matrix.json.
  • test_v0_layout.py (100 tests): every v0 (feature, provider) has a test file at tests/claude_code/<feature>/test_<provider>.py; every non-Azure file references all three Claude tiers (Haiku 4.5, Sonnet 4.6, Opus 4.7); every Azure file is a not_applicable declaration.

1.3 — PR-gate version resolver unit tests (slice 3)

uv run pytest tests/claude_code/_pr_gate_unit_tests/ -vv

Pass criteria:

  • test_pr_gate_version_resolver.py (8 tests): correctly picks the newest npm version published ≥ 3 days ago; raises NoEligibleVersionError if every version is too new; ignores time.created/time.modified meta keys.
  • test_circleci_pr_gate_wiring.py (6 tests): structural assertions on .circleci/config.yml — the claude_code_compat_pr_gate job exists, runs the resolver before npm install -g, depends on build_docker_database_image, and is gated on main_branches.

1.4 — Publisher / resolver unit tests (slice 4)

uv run pytest tests/claude_code/_publisher_unit_tests/ -vv

Pass criteria:

  • test_publisher.py (~14 tests): commit message format embeds litellm/claude/generated_at; select_files_to_commit drops anything except compatibility-matrix.json (this is the file-allowlist defense in depth — verify it).
  • test_resolver.py (~8 tests): correctly sorts v*-stable tags numerically (so v1.10.0-stable outranks v1.9.5-stable); rejects non-stable tags; raises on empty results.

1.5 — One command run-all

uv run pytest tests/claude_code/ \
  --ignore=tests/claude_code/basic_messaging_non_streaming \
  --ignore=tests/claude_code/basic_messaging_streaming \
  --ignore=tests/claude_code/tool_use \
  --ignore=tests/claude_code/prompt_caching_5m \
  --ignore=tests/claude_code/vision \
  --ignore=tests/claude_code/extended_thinking \
  -vv

Expect: all unit tests green, ~142 tests collected, no live API calls.


Phase 2 — Static / structural verification

These are quick sanity checks that don't require execution.

2.1 — Manifest matches PRD row order

Open tests/claude_code/manifest.yaml and confirm:

  • schema_version: "1"
  • providers list is exactly: anthropic, bedrock_invoke, bedrock_converse, vertex_ai, azure (in this order).
  • features list contains exactly these 6 ids in this order: basic_messaging_non_streaming, basic_messaging_streaming, tool_use, prompt_caching_5m, vision, extended_thinking.

2.2 — Filesystem shape matches manifest

ls tests/claude_code/
# Expect: a directory per feature_id, plus _builder_unit_tests, _driver_unit_tests, _pr_gate_unit_tests, _publisher_unit_tests, conftest.py, cli_driver.py, manifest.yaml, matrix_builder.py, publisher.py, resolver.py, pr_gate_version_resolver.py, sample_compatibility-matrix.json, test_config.yaml

Each feature directory must contain: test_anthropic.py, test_bedrock_invoke.py, test_bedrock_converse.py, test_vertex_ai.py, test_azure.py (5 files × 6 features = 30 per-cell files).

2.3 — Sample JSON matches schema

Open tests/claude_code/sample_compatibility-matrix.json and confirm:

  • schema_version: "1"
  • providers array order matches manifest.
  • 6 feature entries with id, name, and providers map keyed by all 5 providers.
  • Every Azure cell is {"status": "not_applicable", "reason": "..."}.
  • The four Anthropic-supporting providers are {"status": "pass"} for every row.

2.4 — Proxy config sanity (tests/claude_code/test_config.yaml)

Confirm:

  • 12 model_list entries (3 tiers × 4 Anthropic-supporting providers).
  • general_settings.forward_client_headers_to_llm_api: true (required so anthropic-beta headers reach upstream).
  • litellm_settings.drop_params: true and modify_params: true.

2.5 — Version-resolver CLI smoke

uv run --no-sync python -m tests.claude_code.pr_gate_version_resolver

Expect: stdout = a single semver string (e.g., 2.1.118); stderr = pr_gate_version_resolver: selected @anthropic-ai/claude-code@<version>. Verify on npmjs.com that the printed version's "Last published" is ≥ 3 days ago.

2.6 — Latest-stable resolver smoke

uv run --no-sync python -c "from tests.claude_code.resolver import latest_stable_litellm_tag; print(latest_stable_litellm_tag())"

Expect: a v*-stable tag string matching the newest stable release on https://github.com/BerriAI/litellm/releases.


Phase 3 — Local end-to-end per-cell smoke (live APIs, costs money)

Goal: run at least one cell per provider column against a real proxy

  • real claude CLI before trusting CI. This is the same flow the PR gate runs in CircleCI, just on your laptop.

3.1 — Boot the proxy

In one terminal:

uv run litellm \
  --config tests/claude_code/test_config.yaml \
  --port 4000 \
  --detailed_debug

Wait until /health/liveliness returns 200:

curl -fsS http://localhost:4000/health/liveliness

Expect: {"status":"healthy"} (or similar 200 response).

3.2 — Run a single cell against Anthropic

export LITELLM_PROXY_BASE_URL="http://localhost:4000"
export LITELLM_PROXY_API_KEY="sk-1234"
uv run pytest tests/claude_code/basic_messaging_non_streaming/test_anthropic.py -vv

Pass criteria: 3 parametrized tests pass (one per Claude tier). Each call should appear in the proxy logs with a 200 to api.anthropic.com. The proxy receives an anthropic-beta header and forwards it.

3.3 — Run one cell per Bedrock variant

uv run pytest tests/claude_code/basic_messaging_non_streaming/test_bedrock_invoke.py -vv
uv run pytest tests/claude_code/basic_messaging_non_streaming/test_bedrock_converse.py -vv

Pass criteria: 3 + 3 tests pass. Confirm proxy logs show distinct routes (InvokeModel vs Converse).

3.4 — Run one cell against Vertex AI

uv run pytest tests/claude_code/basic_messaging_non_streaming/test_vertex_ai.py -vv

Pass criteria: 3 tests pass. Confirm vertex_ai_project / vertex_ai_location in the proxy log line.

3.5 — Run all five Azure files (not_applicable)

for f in basic_messaging_non_streaming basic_messaging_streaming tool_use prompt_caching_5m vision extended_thinking; do
  uv run pytest tests/claude_code/$f/test_azure.py -vv
done

Pass criteria: every test passes by calling compat_result.set({"status": "not_applicable", "reason": "..."}). No HTTP calls should be made — confirm by watching the proxy log (it should be quiet).

3.6 — One cell per non-trivial feature (verify feature-specific assertions)

These each exercise a feature-specific code path in the driver and the proxy:

# Streaming: asserts the stream-json wire emitted >=1 stream event.
uv run pytest tests/claude_code/basic_messaging_streaming/test_anthropic.py -vv

# Tool use: passes --allowed-tools Bash; asserts a tool_use content block was emitted.
uv run pytest tests/claude_code/tool_use/test_anthropic.py -vv

# Prompt caching: asserts upstream usage shows cache_creation_input_tokens
# OR cache_read_input_tokens > 0. Run twice — second run should hit cache.
uv run pytest tests/claude_code/prompt_caching_5m/test_anthropic.py -vv
uv run pytest tests/claude_code/prompt_caching_5m/test_anthropic.py -vv

# Vision: writes a 1x1 PNG to tmp_path, attaches with --image, asserts non-empty reply.
uv run pytest tests/claude_code/vision/test_anthropic.py -vv

# Extended thinking: sets MAX_THINKING_TOKENS=4096, asserts a 'thinking' block was emitted.
uv run pytest tests/claude_code/extended_thinking/test_anthropic.py -vv

Pass criteria for each: the feature-specific assertion passes (not just "got a non-empty reply"). For prompt_caching, on the second run inspect the proxy log to confirm cache_read_input_tokens > 0.

3.7 — Full per-cell sweep (optional, ~30 min, full cost)

mkdir -p test-results
uv run pytest tests/claude_code/ \
  --ignore=tests/claude_code/_driver_unit_tests \
  --ignore=tests/claude_code/_builder_unit_tests \
  --ignore=tests/claude_code/_pr_gate_unit_tests \
  --ignore=tests/claude_code/_publisher_unit_tests \
  -vv \
  --junitxml=test-results/junit.xml

Pass criteria: all 90 per-cell tests pass; compat-results.json is written next to where you ran pytest.

3.8 — Verify the results artifact

cat compat-results.json | python -m json.tool | head -50

Pass criteria:

  • Has schema_version: "1" and a results list.
  • Each entry has feature_id, provider, nodeid, and a result tagged-union object.
  • All 30 Azure entries report not_applicable with the same reason.
  • Path inference worked: feature_id matches the parent directory and provider matches the file stem.

3.9 — Build the matrix locally

uv run python -c "
from datetime import datetime, timezone
from pathlib import Path
from tests.claude_code.matrix_builder import build_from_paths
build_from_paths(
  manifest_path=Path('tests/claude_code/manifest.yaml'),
  results_path=Path('compat-results.json'),
  litellm_version='v0.0.0-test',
  claude_code_version='2.1.test',
  generated_at=datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ'),
  output_path=Path('compatibility-matrix.json'),
)
print(open('compatibility-matrix.json').read())
"

Pass criteria:

  • compatibility-matrix.json is written.
  • It conforms to the schema (schema_version, generated_at, litellm_version, claude_code_version, providers, features).
  • All 30 cells are present (6 features × 5 providers).
  • Cell aggregation is correct: a cell is pass only if all 3 tier-parametrized tests passed; if any tier failed, the cell is fail with the failing tier named in the error string.

Phase 4 — CircleCI PR gate (slice 3)

Goal: confirm the claude_code_compat_pr_gate job runs end-to-end on a real PR.

4.1 — Open a draft PR from this branch into litellm_internal_staging

gh pr create --draft \
  --base litellm_internal_staging \
  --head sandcastle/compat-matrix-stack \
  --title "[QA] compat matrix stack" \
  --body "QA-only PR; close after CI runs."

4.2 — Watch the new job in CircleCI

Find claude_code_compat_pr_gate in the workflow.

Pass criteria:

  • The "Resolve Claude Code CLI version" step prints something like Selected @anthropic-ai/[email protected] where the version is ≥ 3 days old.
  • The "Install Node.js 20 + Claude Code CLI" step succeeds and claude --version prints the resolved version.
  • The proxy boots from the PR's docker image (not a published one) on port 4000.
  • The "Run Claude Code compatibility test suite" step runs all 90 per-cell tests.
  • Test results are stored (visible under "Tests" tab).

4.3 — Verify the gate blocks regressions

This is the headline behavior: a red cell must fail the job.

In a separate sandbox PR (NOT merged):

  • Pick one cell, e.g. tests/claude_code/basic_messaging_non_streaming/test_anthropic.py.
  • Edit the prompt to something that will fail (e.g. break the proxy alias).
  • Push, watch CI.

Pass criteria: the claude_code_compat_pr_gate job goes red and the PR's required-status-check fails. Close this sandbox PR without merging.

4.4 — Verify the 3-day security buffer

On the live job log, find the resolver line. Check the resolved version's "published" date on https://www.npmjs.com/package/@anthropic-ai/claude-code?activeTab=versions. Pass criteria: the published date is ≥ 3 days before the CI run started.


Phase 5 — Daily cron / publisher (slice 4)

Goal: confirm the .github/workflows/claude_code_compat_matrix.yml workflow runs end-to-end and (optionally) publishes a real JSON to the docs repo.

5.1 — Verify GitHub Actions secrets are configured

On BerriAI/litellm → Settings → Secrets and variables → Actions, confirm:

  • COMPAT_MATRIX_APP_ID (GitHub App ID for the docs-repo publishing app)
  • COMPAT_MATRIX_APP_PRIVATE_KEY (PEM private key)
  • ANTHROPIC_API_KEY, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION_NAME, VERTEXAI_PROJECT, VERTEXAI_LOCATION, GOOGLE_APPLICATION_CREDENTIALS_JSON, AZURE_API_KEY, AZURE_API_BASE

Note: if COMPAT_MATRIX_APP_* are not yet configured, only the --skip-publish path can be tested in 5.2.

5.2 — Manual dispatch with --skip-publish (safe dry run)

gh workflow run claude_code_compat_matrix.yml \
  --ref sandcastle/compat-matrix-stack \
  --field skip_publish=true

Watch with:

gh run list --workflow=claude_code_compat_matrix.yml --limit 3
gh run watch <run-id>

Pass criteria:

  • The "Mint docs-repo installation token from GitHub App" step succeeds (only needs the App secrets, no doc-side write happens).
  • The "Run matrix publisher" step:
    • Resolves the latest v*-stable LiteLLM tag (printed in the log).
    • Pulls the corresponding ghcr.io/berriai/litellm:v*-stable Docker image.
    • Boots it on 127.0.0.1:4000, polls /health/liveliness until 200.
    • Installs the absolute latest @anthropic-ai/claude-code (no 3-day buffer here — this is the cron, not the PR gate).
    • Runs pytest tests/claude_code/ against the proxy.
    • Logs skip-publish: not pushing to docs repo.
  • Exit code 0.
  • The artifact compat-results-<run-id> is uploaded with both compat-results.json and compatibility-matrix.json.

5.3 — Inspect the dry-run artifact

gh run download <run-id> -n compat-results-<run-id>
cat compatibility-matrix.json | python -m json.tool | head -40

Pass criteria:

  • schema_version: "1"
  • litellm_version matches the resolved stable tag.
  • claude_code_version matches the npm-latest at run time (typically a version published within the last few hours/days).
  • generated_at is a recent ISO-8601 UTC timestamp ending in Z.
  • Cells reflect the real result of running the suite against ghcr.io/berriai/litellm:<latest> — this is the moment of truth for the matrix's accuracy.

5.4 — Real publish run (only if docs-repo App is configured)

gh workflow run claude_code_compat_matrix.yml \
  --ref sandcastle/compat-matrix-stack

Pass criteria:

  • Same checks as 5.2, plus:
  • The publisher's git push origin main step succeeds against BerriAI/litellm-docs.
  • A new commit lands on the docs repo with message:
    Update Claude Code compatibility matrix
    
    litellm_version: v<X.Y.Z>-stable
    claude_code_version: <X.Y.Z>
    generated_at: <ISO>
  • The commit author is litellm-compat-matrix-bot <[email protected]>.
  • Only static/data/compatibility-matrix.json changed. Verify this — it's the file-allowlist guarantee enforced by select_files_to_commit. If anything else changed, file a P0 bug.

5.5 — Idempotency check (re-run when nothing has changed)

Immediately re-dispatch the workflow:

gh workflow run claude_code_compat_matrix.yml --ref sandcastle/compat-matrix-stack

Pass criteria: the publisher logs matrix JSON unchanged; skipping push and exits 0; no new commit on the docs repo.

5.6 — release trigger

On a future v*-stable release publication, confirm the workflow triggers automatically. Plain v*-rc1 or unsuffixed tags must NOT trigger it (the if: clause on the job filters those out).

5.7 — Cron trigger

Confirm the daily 0 6 * * * UTC schedule fires (check workflow run history at 06:00 UTC the next morning).


Phase 6 — Schema / contract verification

These guard the docs-side renderer (which lives in BerriAI/litellm-docs). The renderer is out of scope for this branch but consumes the JSON produced here, so contract drift would silently break the docs page.

6.1 — Schema version locked at "1"

grep -r '"schema_version"' tests/claude_code/ litellm/
grep -r 'SCHEMA_VERSION' tests/claude_code/

Pass criteria: every reference (manifest, builder, sample, conftest output) declares "1". There must be no drift between the builder's SCHEMA_VERSION constant and what the manifest / sample claim.

6.2 — Tagged-union shape

For every cell status in compatibility-matrix.json:

  • pass: {"status": "pass"} (no other keys).
  • fail: {"status": "fail", "error": "<string>"}error is required, non-empty.
  • not_applicable: {"status": "not_applicable", "reason": "<string>"}reason is required, non-empty.
  • not_tested: {"status": "not_tested"} (no other keys).

The conftest (compat_result.set) enforces the required fields on the input side; verify the builder preserves this invariant on the output side.

6.3 — Provider order matches column order

uv run python -c "
import json, yaml
m = yaml.safe_load(open('tests/claude_code/manifest.yaml'))
j = json.load(open('tests/claude_code/sample_compatibility-matrix.json'))
assert list(j['providers']) == list(m['providers']), 'provider order drift'
for f in j['features']:
  assert list(f['providers']) == list(m['providers']), f'cell order drift in {f[\"id\"]}'
print('ok')
"

Pass criteria: prints ok.


Phase 7 — Failure-mode QA (negative paths)

These are quick chaos checks on the harness itself.

7.1 — Test that forgets to call compat_result.set(...)

Temporarily edit any test_anthropic.py to remove the final compat_result.set({"status": "pass"}) line and run it. Expect: the conftest fills in status: "fail" with the message "test passed without calling compat_result.set(); every compat test must report a status.". Revert the edit.

7.2 — Invalid compat_result.set(...) payloads

In a scratch test, call compat_result.set({"status": "fail"}) (missing error) and compat_result.set({"status": "weird"}). Expect: the fixture raises ValueError immediately, not silently.

7.3 — Path inference rejects unit-test files

uv run pytest tests/claude_code/_driver_unit_tests/ -vv
ls compat-results.json 2>&1 | head
cat compat-results.json | python -m json.tool

Pass criteria: results list is empty (or contains only entries for files that match tests/claude_code/<feature>/test_<provider>.py). Unit-test files under directories starting with _ must NOT pollute the matrix.

7.4 — select_files_to_commit allowlist

uv run python -c "
from tests.claude_code.publisher import select_files_to_commit
keep = select_files_to_commit(
  ['static/data/compatibility-matrix.json', 'README.md', 'config.yml'],
  'compatibility-matrix.json',
)
assert keep == ['static/data/compatibility-matrix.json'], keep
print('ok')
"

Pass criteria: prints ok.

7.5 — Resolver: empty / malformed GitHub Releases response

The unit tests under _publisher_unit_tests/test_resolver.py already cover this; spot-check that they pass on this branch.


Sign-off checklist

QA can be considered complete when:

  • Phase 1 — all unit tests pass locally (~142 tests).
  • Phase 2 — manifest / sample / config / resolver CLI smoke checks all pass.
  • Phase 3 — at least one live per-cell test passes per provider column (Anthropic, Bedrock Invoke, Bedrock Converse, Vertex AI), all five Azure files report not_applicable, and compat-results.json + compatibility-matrix.json are produced and validated.
  • Phase 4 — claude_code_compat_pr_gate runs on a draft PR, the resolver picks a ≥ 3-day-old version, all 90 per-cell tests run; intentional regression turns the job red.
  • Phase 5.2 — workflow_dispatch with skip_publish=true succeeds end-to-end and uploads artifacts.
  • Phase 5.4 — real publish run (when App creds are wired) lands a single-file commit on BerriAI/litellm-docs with the expected commit message and only compatibility-matrix.json changed.
  • Phase 5.5 — idempotent re-run skips the push.
  • Phase 6 — schema / shape contract holds.
  • Phase 7 — failure-mode behaviors match expectations.

Owner: please reply on this issue with a comment per phase as you complete it, linking the CI run / PR / docs-repo commit that demonstrates pass.

extent analysis

TL;DR

To resolve the compatibility matrix issue, ensure all unit tests pass, verify the manifest and sample data, and confirm the CircleCI PR gate and daily cron jobs run successfully.

Guidance

  1. Run unit tests: Execute uv run pytest tests/claude_code/ to verify all ~142 unit tests pass.
  2. Verify manifest and sample data: Check the manifest.yaml and sample_compatibility-matrix.json files for correctness and consistency.
  3. Confirm CircleCI PR gate: Open a draft PR and verify the claude_code_compat_pr_gate job runs successfully, resolving the correct Claude Code version and running all 90 per-cell tests.
  4. Verify daily cron job: Dispatch the .github/workflows/claude_code_compat_matrix.yml workflow with skip_publish=true and confirm it succeeds end-to-end, uploading artifacts.
  5. Check failure-mode behaviors: Test failure modes, such as forgetting to call compat_result.set(...), to ensure the harness behaves as expected.

Example

No specific code example is provided, as the issue is focused on verifying the compatibility matrix workflow.

Notes

The provided issue is a comprehensive QA plan, and the guidance is based on the assumption that the goal is to complete this plan successfully. The actual fix or workaround may depend on the specific errors or issues encountered during the QA process.

Recommendation

Apply the provided guidance steps to complete the QA plan and verify the compatibility matrix workflow. If issues are encountered, investigate and address them accordingly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [QA] Claude Code Compatibility Matrix — full-stack QA plan (PRD #26476) [1 participants]