transformers - ✅(Solved) Fix KubeflowCallback: Native progress reporting for Kubernetes-based Kubeflow training [2 pull requests, 1 comments, 1 participants]

transformers2026-03-06 07:07:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44486•Fetched 2026-04-08 00:28:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

abhijeet-dhumal

Participants

abhijeet-dhumal

Timeline (top)

cross-referenced ×2closed ×1commented ×1labeled ×1

Fix Action

Fixed

Fixed by PR: feat(integration): Add KubeflowCallback to enable automatic progress … (https://github.com/huggingface/transformers/pull/44487)
Fixed by PR: feat: add support for tracking TrainJob progress and training metrics (https://github.com/kubeflow/trainer/pull/3227)

PR fix notes

PR #44487: feat(integration): Add KubeflowCallback to enable automatic progress …

Repository: huggingface/transformers
Author: abhijeet-dhumal
State: closed | merged: True
Link: https://github.com/huggingface/transformers/pull/44487

Description (problem / solution / changelog)

What does this PR do?

Fixes #44486

Adds KubeflowCallback to enable automatic progress and metrics reporting for training jobs running on Kubeflow Trainer.

When training runs inside a Kubeflow TrainJob, the callback automatically:

Reports training progress percentage (0-100%)
Calculates and reports estimated time remaining (ETA)
Pushes training metrics (loss, accuracy, etc.) to the TrainJob status

This enables real-time visibility into training progress via standard Kubernetes APIs:

kubectl get trainjob my-job -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 67, "estimatedRemainingSeconds": 3600, "metrics": [...]}

Zero friction for users - no code changes required:

from transformers import Trainer, TrainingArguments

trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=ds)
trainer.train()  # Progress automatically reported when running in Kubeflow

Changes

Add is_kubeflow_available() function that detects Kubeflow environment via KUBEFLOW_TRAINER_STATUS_URL env var
Add KubeflowCallback class following the MLflowCallback/WandbCallback pattern
Register in INTEGRATION_TO_CALLBACK and get_available_reporting_integrations()
Add unit tests in tests/trainer/test_trainer_callback.py

Related PRs

Repository	PR	Description
kubeflow/trainer	#3227	Controller-side implementation (status server)
kubeflow/sdk	https://github.com/kubeflow/sdk/issues/367	SDK `update_runtime_status()` utility

Additional context

This is part of KEP-2779 - a coordinated effort with the Kubeflow community to enable real-time training progress tracking on Kubernetes.

The callback has an optional dependency on kubeflow-sdk. The SDK handles:

HTTP transport to the Kubeflow controller
Authentication via projected service account tokens
Throttling (max 1 update per 5 seconds)

When the TrainJobProgress feature gate is enabled in Kubeflow Trainer, the controller automatically injects these environment variables into training pods:

KUBEFLOW_TRAINER_SERVER_URL - HTTPS endpoint for status updates
KUBEFLOW_TRAINER_SERVER_CA_CERT - CA cert for TLS
KUBEFLOW_TRAINER_SERVER_TOKEN - Projected SA token for auth

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Changed files

docs/source/en/main_classes/callback.md (modified, +2/-0)
src/transformers/integrations/__init__.py (modified, +4/-0)
src/transformers/integrations/integration_utils.py (modified, +216/-0)
src/transformers/training_args.py (modified, +6/-0)
tests/trainer/test_trainer_callback.py (modified, +232/-1)
tests/trainer/test_training_args.py (modified, +22/-0)

PR #3227: feat: add support for tracking TrainJob progress and training metrics

Repository: kubeflow/trainer
Author: robert-bell
State: closed | merged: True
Link: https://github.com/kubeflow/trainer/pull/3227

Description (problem / solution / changelog)

What this PR does / why we need it:

This PR implements TrainJobProgress (#2905), enabling real-time progress and metrics tracking for TrainJobs. Training pods can now push status updates (progress %, estimated time remaining, custom metrics) which are exposed via status.trainerStatus in the TrainJob CR.

It's still a WIP, and there's a few bits that still need working out, but I'd be keen for any feedback on the current approach.

Headline changes:

added a new trainerStatus field to the TrainJob status, using the spec from #2905
added new alpha feature gate TrainJobProgress, defaults to disabled. Everything is disabled if the gate is disabled.
added a new "progress server" in the controller that exposes an https endpoint for collecting progress updates from the runtime pods.
- server reuses the existing kubeflow-trainer-controller-manager service
- the server has TLS configured and reuses the webhook certs. Cert rotation is handled automatically using the existing cert rotator pattern.
- there's basic middleware: auth-n, logging, panic recovery, body size limits
- auth-n checks the request has a valid projected service account token.
- auth-z, which checks that the auth token was granted to a pod that is part of the train job (using a label on the pod)
- the server directly updates the trainerStatus field.
a new "progress" plugin that injects the runtime config into the training pods:
- env vars: KUBEFLOW_TRAINER_STATUS_URL, KUBEFLOW_TRAINER_STATUS_CA_CERT, KUBEFLOW_TRAINER_STATUS_TOKEN
- the projected service account token for authenticating with the progress server
- the control plane ca.crt copied into a configmap which the runtime pods mount to trust the progress server tls
- adds a label to the pods so we can identify which train job it belongs to (for auth-z).
adds new config section for the progress server
add ability to set feature-gates in a command line argument
- the command line takes precedence over the config file

The implementation mostly follows what we agreed in #2905, with a few changes worth pointing out -

the auth-n uses oidc discovery rather than using the TokenReview api. I've done this to avoid the auth-n path needing to make a request to the api. I read into this a bit more, and it looks like we're able to do this because the tokens are always projected service account tokens, they'll always be JWTs signed by the api server. I'd be happy to
I've given the progress server a separate go-client so it gets separate client-side rate limits and won't impact the client used in the main reconciler.
We never discussed the response types form the progress server, so I've suggested successful requests return a 200 code and the original payload, and error responses return a metav1.Status object to align with the k8s api server. I'd be happy to take any input on this.

There's a few TODOs left before this is ready for actually merging - I'll work through these but please do start reviewing these changes as I'd like to check folk are happy with the general approach.

wiring up the progress plugin to use the config (e.g. the server port, the service name). I'd actually appreciate some guidance on how best to pass the config through.
tidy up some of the constants, e.g. to avoid duplication, move to more sensible location.
add e2e test
better test coverage for server auth-z
(probably?) update the existing integration tests so the progress feature gate is enabled.
docs updates (possible leave until we've got consensus on the implementation?)
sdk updates to help instrument the runtime (again, leave until we've got consensus on the implementation?)

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Part of #2779

Checklist:

Docs included if any changes are user facing

Changed files

api/openapi-spec/swagger.json (modified, +76/-0)
api/python_api/kubeflow_trainer_api/models/__init__.py (modified, +3/-0)
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_metric.py (added, +89/-0)
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_train_job_status.py (modified, +8/-2)
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_trainer_status.py (added, +102/-0)
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_update_train_job_status_request.py (added, +91/-0)
charts/kubeflow-trainer/README.md (modified, +4/-1)
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml (modified, +60/-0)
charts/kubeflow-trainer/templates/manager/configmap.yaml (modified, +5/-0)
charts/kubeflow-trainer/templates/manager/deployment.yaml (modified, +3/-0)
charts/kubeflow-trainer/templates/manager/service.yaml (modified, +4/-0)
charts/kubeflow-trainer/values.yaml (modified, +7/-0)
cmd/trainer-controller-manager/main.go (modified, +27/-5)
docs/proposals/2779-trainjob-progress/README.md (modified, +7/-7)
go.mod (modified, +2/-0)
go.sum (modified, +4/-0)
hack/e2e-setup-cluster.sh (modified, +9/-0)
hack/e2e-setup-gpu-cluster.sh (modified, +9/-0)
manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml (modified, +60/-0)
manifests/base/manager/controller_manager_config.yaml (modified, +5/-0)
manifests/base/manager/manager.yaml (modified, +7/-0)
pkg/apis/config/v1alpha1/configuration_types.go (modified, +26/-0)
pkg/apis/config/v1alpha1/defaults.go (modified, +12/-0)
pkg/apis/config/v1alpha1/zz_generated.deepcopy.go (modified, +35/-0)
pkg/apis/trainer/v1alpha1/trainjob_types.go (modified, +74/-0)
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go (modified, +74/-0)
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go (modified, +112/-1)
pkg/client/applyconfiguration/trainer/v1alpha1/metric.go (added, +48/-0)
pkg/client/applyconfiguration/trainer/v1alpha1/trainerstatus.go (added, +82/-0)
pkg/client/applyconfiguration/trainer/v1alpha1/trainjobstatus.go (modified, +18/-0)
pkg/client/applyconfiguration/utils.go (modified, +4/-0)
pkg/config/config_test.go (modified, +66/-0)
pkg/config/validation.go (modified, +13/-0)
pkg/controller/trainjob_controller.go (modified, +1/-1)
pkg/features/features.go (modified, +11/-1)
pkg/runtime/core/clustertrainingruntime.go (modified, +2/-1)
pkg/runtime/core/clustertrainingruntime_test.go (modified, +2/-2)
pkg/runtime/core/core.go (modified, +4/-3)
pkg/runtime/core/registry.go (modified, +2/-1)
pkg/runtime/core/trainingruntime.go (modified, +3/-2)
pkg/runtime/core/trainingruntime_test.go (modified, +1/-1)
pkg/runtime/framework/core/framework.go (modified, +3/-2)
pkg/runtime/framework/core/framework_test.go (modified, +10/-9)
pkg/runtime/framework/plugins/coscheduling/coscheduling.go (modified, +2/-1)
pkg/runtime/framework/plugins/coscheduling/coscheduling_test.go (modified, +1/-1)
pkg/runtime/framework/plugins/flux/flux.go (modified, +2/-1)
pkg/runtime/framework/plugins/flux/flux_test.go (modified, +2/-2)
pkg/runtime/framework/plugins/jax/jax.go (modified, +2/-1)
pkg/runtime/framework/plugins/jax/jax_test.go (modified, +1/-1)
pkg/runtime/framework/plugins/jobset/jobset.go (modified, +2/-1)
pkg/runtime/framework/plugins/jobset/jobset_test.go (modified, +2/-2)
pkg/runtime/framework/plugins/mpi/mpi.go (modified, +2/-1)
pkg/runtime/framework/plugins/mpi/mpi_test.go (modified, +2/-2)
pkg/runtime/framework/plugins/plainml/plainml.go (modified, +2/-1)
pkg/runtime/framework/plugins/plainml/plainml_test.go (modified, +1/-1)
pkg/runtime/framework/plugins/registry.go (modified, +11/-2)
pkg/runtime/framework/plugins/torch/torch.go (modified, +2/-1)
pkg/runtime/framework/plugins/torch/torch_test.go (modified, +2/-2)
pkg/runtime/framework/plugins/trainjobstatus/trainjobstatus.go (added, +209/-0)
pkg/runtime/framework/plugins/trainjobstatus/trainjobstatus_test.go (added, +506/-0)
pkg/runtime/framework/plugins/volcano/volcano.go (modified, +2/-1)
pkg/runtime/framework/plugins/volcano/volcano_test.go (modified, +2/-2)
pkg/runtime/framework/plugins/xgboost/xgboost.go (modified, +2/-1)
pkg/runtime/framework/plugins/xgboost/xgboost_test.go (modified, +2/-2)
pkg/statusserver/auth.go (added, +154/-0)
pkg/statusserver/middleware.go (added, +64/-0)
pkg/statusserver/middleware_test.go (added, +31/-0)
pkg/statusserver/server.go (added, +263/-0)
pkg/statusserver/server_test.go (added, +156/-0)
pkg/statusserver/setup.go (added, +73/-0)
pkg/statusserver/utils.go (added, +34/-0)
pkg/util/cert/cert.go (modified, +29/-2)
pkg/webhooks/trainjob_webhook_test.go (modified, +2/-1)
test/e2e/e2e_test.go (modified, +65/-0)
test/e2e/testdata/status_update.py (added, +53/-0)
test/integration/framework/framework.go (modified, +1/-1)

Code Example

kubectl get trainjob my-llm-finetune -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 45, "estimatedRemainingSeconds": 1200, "metrics": [{"name": "loss", "value": "0.234"}]}

---

kubectl get trainjob my-job
# NAME     STATUS    AGE
# my-job   Running   2h
# (Is it 10% done? 90% done? No idea.)

---

kubectl get trainjob my-job -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 67, "estimatedRemainingSeconds": 3600, "metrics": [{"name": "loss", "value": "0.15"}]}

---

from transformers import Trainer, TrainingArguments

trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=ds)
trainer.train()  # Progress automatically reported when running in Kubeflow

RAW_BUFFERClick to expand / collapse

Feature request

Add a KubeflowCallback to enable automatic progress and metrics reporting for training jobs running on Kubeflow Trainer, the Kubernetes-native platform for distributed AI/ML training.

Context: This is part of a coordinated effort with the Kubeflow community. The controller-side implementation is available in kubeflow/trainer#3227 which adds a status server that receives progress updates from training pods and exposes them via the TrainJob CR status. This HuggingFace callback would be the client-side integration that automatically reports progress from Transformers training loops.

When training runs inside a Kubeflow TrainJob, the callback would automatically:

Report training progress percentage (0-100%)
Calculate and report estimated time remaining (ETA)
Push training metrics (loss, accuracy, etc.) to the TrainJob status

This enables real-time visibility into training progress via standard Kubernetes APIs:

kubectl get trainjob my-llm-finetune -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 45, "estimatedRemainingSeconds": 1200, "metrics": [{"name": "loss", "value": "0.234"}]}

Motivation

Problem: AI practitioners running HuggingFace training jobs on Kubernetes have no native way to monitor training progress. They must either:

Parse container logs manually
Set up external tracking systems (MLflow/W&B) which adds infrastructure overhead
Wait blindly for jobs to complete

Why Kubeflow matters:

Kubeflow Trainer is the standard for distributed training on Kubernetes (PyTorch, JAX, MPI, etc.)
Large organizations (Google, Red Hat, Bloomberg, etc.) use Kubeflow for production ML workloads
Kubernetes is becoming the default platform for LLM fine-tuning at scale

User experience improvement:

Before (no visibility):

kubectl get trainjob my-job
# NAME     STATUS    AGE
# my-job   Running   2h
# (Is it 10% done? 90% done? No idea.)

After (with KubeflowCallback):

kubectl get trainjob my-job -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 67, "estimatedRemainingSeconds": 3600, "metrics": [{"name": "loss", "value": "0.15"}]}

Zero friction for users:

No code changes required. The callback auto-activates when the Kubeflow controller injects environment variables:

from transformers import Trainer, TrainingArguments

trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=ds)
trainer.train()  # Progress automatically reported when running in Kubeflow

Synergy with existing integrations:

This complements (not replaces) MLflow/W&B. Users can run both:

KubeflowCallback for Kubernetes-native observability (kubectl, dashboards, alerting)
MLflowCallback or WandbCallback for experiment tracking and artifact management

Your contribution

Yes, I am willing to submit a PR for this feature.

I am a contributor of Kubeflow Trainer and Kubeflow SDK and have been working on the progress tracking feature (KEP-2779) alongside Rob Bell. The implementation plan is finalized and the controller-side PR is in review here kubeflow/trainer#3227

extent analysis

Fix Plan

Step 1: Install the KubeflowCallback

pip install kubeflow-sdk

Step 2: Import the KubeflowCallback in your training script

from kubeflow import KubeflowCallback
from transformers import Trainer, TrainingArguments

# ... (rest of your code)
trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=ds)
trainer.train(callbacks=[KubeflowCallback()])

Step 3: Run your training job on Kubeflow Trainer

kubectl create trainjob my-job --image=my-image --command="python train.py"

Step 4: Verify progress and metrics reporting

kubectl get trainjob my-job -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 67, "estimatedRemainingSeconds": 3600, "metrics": [{"name": "loss", "value": "0.15"}]}

Verification

Check the progressPercentage and estimatedRemainingSeconds fields in the trainerStatus section of the TrainJob status.
Verify that the metrics field is populated with the expected training metrics (e.g., loss, accuracy).

Extra Tips

Make sure to update your Kubeflow Trainer version to the latest one that supports the KubeflowCallback.
If you encounter any issues, check the Kubeflow Trainer logs for errors.
This feature is designed to work seamlessly with existing integrations like MLflow and Wandb. You can use both the KubeflowCallback and other callbacks (e.g., MLflowCallback, WandbCallback) in your training script.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #training loop #conversation history #tool integration #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix KubeflowCallback: Native progress reporting for Kubernetes-based Kubeflow training [2 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #44487: feat(integration): Add KubeflowCallback to enable automatic progress …

Description (problem / solution / changelog)

What does this PR do?

Changes

Related PRs

Additional context

Before submitting

Who can review?

Changed files

PR #3227: feat: add support for tracking TrainJob progress and training metrics

Description (problem / solution / changelog)

Changed files

Code Example

Feature request

Motivation

Your contribution

extent analysis

Fix Plan

Step 1: Install the KubeflowCallback

Step 2: Import the KubeflowCallback in your training script

Step 3: Run your training job on Kubeflow Trainer

Step 4: Verify progress and metrics reporting

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING