transformers - ✅(Solved) Fix KubeflowCallback: Native progress reporting for Kubernetes-based Kubeflow training [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44486Fetched 2026-04-08 00:28:08
View on GitHub
Comments
1
Participants
1
Timeline
5
Reactions
2
Participants
Timeline (top)
cross-referenced ×2closed ×1commented ×1labeled ×1

Fix Action

Fixed

PR fix notes

PR #44487: feat(integration): Add KubeflowCallback to enable automatic progress …

Description (problem / solution / changelog)

What does this PR do?

Fixes #44486

Adds KubeflowCallback to enable automatic progress and metrics reporting for training jobs running on Kubeflow Trainer.

When training runs inside a Kubeflow TrainJob, the callback automatically:

  • Reports training progress percentage (0-100%)
  • Calculates and reports estimated time remaining (ETA)
  • Pushes training metrics (loss, accuracy, etc.) to the TrainJob status

This enables real-time visibility into training progress via standard Kubernetes APIs:

kubectl get trainjob my-job -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 67, "estimatedRemainingSeconds": 3600, "metrics": [...]}

Zero friction for users - no code changes required:

from transformers import Trainer, TrainingArguments

trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=ds)
trainer.train()  # Progress automatically reported when running in Kubeflow

Changes

  • Add is_kubeflow_available() function that detects Kubeflow environment via KUBEFLOW_TRAINER_STATUS_URL env var
  • Add KubeflowCallback class following the MLflowCallback/WandbCallback pattern
  • Register in INTEGRATION_TO_CALLBACK and get_available_reporting_integrations()
  • Add unit tests in tests/trainer/test_trainer_callback.py

Related PRs

RepositoryPRDescription
kubeflow/trainer#3227Controller-side implementation (status server)
kubeflow/sdkhttps://github.com/kubeflow/sdk/issues/367SDK update_runtime_status() utility

Additional context

This is part of KEP-2779 - a coordinated effort with the Kubeflow community to enable real-time training progress tracking on Kubernetes.

The callback has an optional dependency on kubeflow-sdk. The SDK handles:

  • HTTP transport to the Kubeflow controller
  • Authentication via projected service account tokens
  • Throttling (max 1 update per 5 seconds)

When the TrainJobProgress feature gate is enabled in Kubeflow Trainer, the controller automatically injects these environment variables into training pods:

  • KUBEFLOW_TRAINER_SERVER_URL - HTTPS endpoint for status updates
  • KUBEFLOW_TRAINER_SERVER_CA_CERT - CA cert for TLS
  • KUBEFLOW_TRAINER_SERVER_TOKEN - Projected SA token for auth

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. Please tag fewer than 3 people. Models: - text models: @ArthurZucker @Cyrilvallez - vision models: @yonigozlan @molbap - audio models: @eustlb @ebezzam @vasqu - multimodal models: @zucchini-nlp - graph models: @clefourrier Library: - generate: @zucchini-nlp (visual-language models) or @gante (all others) - continuous batching: @remi-or @ArthurZucker @McPatate - pipelines: @Rocketknight1 - tokenizers: @ArthurZucker and @itazap - trainer: @SunMarc - attention: @vasqu @ArthurZucker @CyrilVallez - model loading (from pretrained, etc): @CyrilVallez - distributed: @3outeille @ArthurZucker - CIs: @ydshieh Integrations: - ray/raytune: @richardliaw, @amogkam - Big Model Inference: @SunMarc - quantization: @SunMarc - kernels: @drbh - peft: @BenjaminBossan @githubnemo Devices/Backends: - AMD ROCm: @ivarflakstad - Intel XPU: @IlyasMoutawwakil - Ascend NPU: @ivarflakstad Documentation: @stevhliu Research projects are not maintained and should be taken as is. -->

Changed files

  • docs/source/en/main_classes/callback.md (modified, +2/-0)
  • src/transformers/integrations/__init__.py (modified, +4/-0)
  • src/transformers/integrations/integration_utils.py (modified, +216/-0)
  • src/transformers/training_args.py (modified, +6/-0)
  • tests/trainer/test_trainer_callback.py (modified, +232/-1)
  • tests/trainer/test_training_args.py (modified, +22/-0)

PR #3227: feat: add support for tracking TrainJob progress and training metrics

Description (problem / solution / changelog)

<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, check our contributor guidelines: https://www.kubeflow.org/docs/about/contributing 2. To know more about Kubeflow Trainer, check the developer guide: https://github.com/kubeflow/trainer/blob/master/CONTRIBUTING.md 3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews -->

What this PR does / why we need it:

This PR implements TrainJobProgress (#2905), enabling real-time progress and metrics tracking for TrainJobs. Training pods can now push status updates (progress %, estimated time remaining, custom metrics) which are exposed via status.trainerStatus in the TrainJob CR.

It's still a WIP, and there's a few bits that still need working out, but I'd be keen for any feedback on the current approach.

Headline changes:

  • added a new trainerStatus field to the TrainJob status, using the spec from #2905
  • added new alpha feature gate TrainJobProgress, defaults to disabled. Everything is disabled if the gate is disabled.
  • added a new "progress server" in the controller that exposes an https endpoint for collecting progress updates from the runtime pods.
    • server reuses the existing kubeflow-trainer-controller-manager service
    • the server has TLS configured and reuses the webhook certs. Cert rotation is handled automatically using the existing cert rotator pattern.
    • there's basic middleware: auth-n, logging, panic recovery, body size limits
    • auth-n checks the request has a valid projected service account token.
    • auth-z, which checks that the auth token was granted to a pod that is part of the train job (using a label on the pod)
    • the server directly updates the trainerStatus field.
  • a new "progress" plugin that injects the runtime config into the training pods:
    • env vars: KUBEFLOW_TRAINER_STATUS_URL, KUBEFLOW_TRAINER_STATUS_CA_CERT, KUBEFLOW_TRAINER_STATUS_TOKEN
    • the projected service account token for authenticating with the progress server
    • the control plane ca.crt copied into a configmap which the runtime pods mount to trust the progress server tls
    • adds a label to the pods so we can identify which train job it belongs to (for auth-z).
  • adds new config section for the progress server
  • add ability to set feature-gates in a command line argument
    • the command line takes precedence over the config file

The implementation mostly follows what we agreed in #2905, with a few changes worth pointing out -

  1. the auth-n uses oidc discovery rather than using the TokenReview api. I've done this to avoid the auth-n path needing to make a request to the api. I read into this a bit more, and it looks like we're able to do this because the tokens are always projected service account tokens, they'll always be JWTs signed by the api server. I'd be happy to
  2. I've given the progress server a separate go-client so it gets separate client-side rate limits and won't impact the client used in the main reconciler.
  3. We never discussed the response types form the progress server, so I've suggested successful requests return a 200 code and the original payload, and error responses return a metav1.Status object to align with the k8s api server. I'd be happy to take any input on this.

There's a few TODOs left before this is ready for actually merging - I'll work through these but please do start reviewing these changes as I'd like to check folk are happy with the general approach.

  • wiring up the progress plugin to use the config (e.g. the server port, the service name). I'd actually appreciate some guidance on how best to pass the config through.
  • tidy up some of the constants, e.g. to avoid duplication, move to more sensible location.
  • add e2e test
  • better test coverage for server auth-z
  • (probably?) update the existing integration tests so the progress feature gate is enabled.
  • docs updates (possible leave until we've got consensus on the implementation?)
  • sdk updates to help instrument the runtime (again, leave until we've got consensus on the implementation?)

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Part of #2779

Checklist:

  • Docs included if any changes are user facing

Changed files

  • api/openapi-spec/swagger.json (modified, +76/-0)
  • api/python_api/kubeflow_trainer_api/models/__init__.py (modified, +3/-0)
  • api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_metric.py (added, +89/-0)
  • api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_train_job_status.py (modified, +8/-2)
  • api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_trainer_status.py (added, +102/-0)
  • api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_update_train_job_status_request.py (added, +91/-0)
  • charts/kubeflow-trainer/README.md (modified, +4/-1)
  • charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml (modified, +60/-0)
  • charts/kubeflow-trainer/templates/manager/configmap.yaml (modified, +5/-0)
  • charts/kubeflow-trainer/templates/manager/deployment.yaml (modified, +3/-0)
  • charts/kubeflow-trainer/templates/manager/service.yaml (modified, +4/-0)
  • charts/kubeflow-trainer/values.yaml (modified, +7/-0)
  • cmd/trainer-controller-manager/main.go (modified, +27/-5)
  • docs/proposals/2779-trainjob-progress/README.md (modified, +7/-7)
  • go.mod (modified, +2/-0)
  • go.sum (modified, +4/-0)
  • hack/e2e-setup-cluster.sh (modified, +9/-0)
  • hack/e2e-setup-gpu-cluster.sh (modified, +9/-0)
  • manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml (modified, +60/-0)
  • manifests/base/manager/controller_manager_config.yaml (modified, +5/-0)
  • manifests/base/manager/manager.yaml (modified, +7/-0)
  • pkg/apis/config/v1alpha1/configuration_types.go (modified, +26/-0)
  • pkg/apis/config/v1alpha1/defaults.go (modified, +12/-0)
  • pkg/apis/config/v1alpha1/zz_generated.deepcopy.go (modified, +35/-0)
  • pkg/apis/trainer/v1alpha1/trainjob_types.go (modified, +74/-0)
  • pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go (modified, +74/-0)
  • pkg/apis/trainer/v1alpha1/zz_generated.openapi.go (modified, +112/-1)
  • pkg/client/applyconfiguration/trainer/v1alpha1/metric.go (added, +48/-0)
  • pkg/client/applyconfiguration/trainer/v1alpha1/trainerstatus.go (added, +82/-0)
  • pkg/client/applyconfiguration/trainer/v1alpha1/trainjobstatus.go (modified, +18/-0)
  • pkg/client/applyconfiguration/utils.go (modified, +4/-0)
  • pkg/config/config_test.go (modified, +66/-0)
  • pkg/config/validation.go (modified, +13/-0)
  • pkg/controller/trainjob_controller.go (modified, +1/-1)
  • pkg/features/features.go (modified, +11/-1)
  • pkg/runtime/core/clustertrainingruntime.go (modified, +2/-1)
  • pkg/runtime/core/clustertrainingruntime_test.go (modified, +2/-2)
  • pkg/runtime/core/core.go (modified, +4/-3)
  • pkg/runtime/core/registry.go (modified, +2/-1)
  • pkg/runtime/core/trainingruntime.go (modified, +3/-2)
  • pkg/runtime/core/trainingruntime_test.go (modified, +1/-1)
  • pkg/runtime/framework/core/framework.go (modified, +3/-2)
  • pkg/runtime/framework/core/framework_test.go (modified, +10/-9)
  • pkg/runtime/framework/plugins/coscheduling/coscheduling.go (modified, +2/-1)
  • pkg/runtime/framework/plugins/coscheduling/coscheduling_test.go (modified, +1/-1)
  • pkg/runtime/framework/plugins/flux/flux.go (modified, +2/-1)
  • pkg/runtime/framework/plugins/flux/flux_test.go (modified, +2/-2)
  • pkg/runtime/framework/plugins/jax/jax.go (modified, +2/-1)
  • pkg/runtime/framework/plugins/jax/jax_test.go (modified, +1/-1)
  • pkg/runtime/framework/plugins/jobset/jobset.go (modified, +2/-1)
  • pkg/runtime/framework/plugins/jobset/jobset_test.go (modified, +2/-2)
  • pkg/runtime/framework/plugins/mpi/mpi.go (modified, +2/-1)
  • pkg/runtime/framework/plugins/mpi/mpi_test.go (modified, +2/-2)
  • pkg/runtime/framework/plugins/plainml/plainml.go (modified, +2/-1)
  • pkg/runtime/framework/plugins/plainml/plainml_test.go (modified, +1/-1)
  • pkg/runtime/framework/plugins/registry.go (modified, +11/-2)
  • pkg/runtime/framework/plugins/torch/torch.go (modified, +2/-1)
  • pkg/runtime/framework/plugins/torch/torch_test.go (modified, +2/-2)
  • pkg/runtime/framework/plugins/trainjobstatus/trainjobstatus.go (added, +209/-0)
  • pkg/runtime/framework/plugins/trainjobstatus/trainjobstatus_test.go (added, +506/-0)
  • pkg/runtime/framework/plugins/volcano/volcano.go (modified, +2/-1)
  • pkg/runtime/framework/plugins/volcano/volcano_test.go (modified, +2/-2)
  • pkg/runtime/framework/plugins/xgboost/xgboost.go (modified, +2/-1)
  • pkg/runtime/framework/plugins/xgboost/xgboost_test.go (modified, +2/-2)
  • pkg/statusserver/auth.go (added, +154/-0)
  • pkg/statusserver/middleware.go (added, +64/-0)
  • pkg/statusserver/middleware_test.go (added, +31/-0)
  • pkg/statusserver/server.go (added, +263/-0)
  • pkg/statusserver/server_test.go (added, +156/-0)
  • pkg/statusserver/setup.go (added, +73/-0)
  • pkg/statusserver/utils.go (added, +34/-0)
  • pkg/util/cert/cert.go (modified, +29/-2)
  • pkg/webhooks/trainjob_webhook_test.go (modified, +2/-1)
  • test/e2e/e2e_test.go (modified, +65/-0)
  • test/e2e/testdata/status_update.py (added, +53/-0)
  • test/integration/framework/framework.go (modified, +1/-1)

Code Example

kubectl get trainjob my-llm-finetune -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 45, "estimatedRemainingSeconds": 1200, "metrics": [{"name": "loss", "value": "0.234"}]}

---

kubectl get trainjob my-job
# NAME     STATUS    AGE
# my-job   Running   2h
# (Is it 10% done? 90% done? No idea.)

---

kubectl get trainjob my-job -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 67, "estimatedRemainingSeconds": 3600, "metrics": [{"name": "loss", "value": "0.15"}]}

---

from transformers import Trainer, TrainingArguments

trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=ds)
trainer.train()  # Progress automatically reported when running in Kubeflow
RAW_BUFFERClick to expand / collapse

Feature request

Add a KubeflowCallback to enable automatic progress and metrics reporting for training jobs running on Kubeflow Trainer, the Kubernetes-native platform for distributed AI/ML training.

Context: This is part of a coordinated effort with the Kubeflow community. The controller-side implementation is available in kubeflow/trainer#3227 which adds a status server that receives progress updates from training pods and exposes them via the TrainJob CR status. This HuggingFace callback would be the client-side integration that automatically reports progress from Transformers training loops.

When training runs inside a Kubeflow TrainJob, the callback would automatically:

  • Report training progress percentage (0-100%)
  • Calculate and report estimated time remaining (ETA)
  • Push training metrics (loss, accuracy, etc.) to the TrainJob status

This enables real-time visibility into training progress via standard Kubernetes APIs:

kubectl get trainjob my-llm-finetune -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 45, "estimatedRemainingSeconds": 1200, "metrics": [{"name": "loss", "value": "0.234"}]}

Motivation

Problem: AI practitioners running HuggingFace training jobs on Kubernetes have no native way to monitor training progress. They must either:

  1. Parse container logs manually
  2. Set up external tracking systems (MLflow/W&B) which adds infrastructure overhead
  3. Wait blindly for jobs to complete

Why Kubeflow matters:

  • Kubeflow Trainer is the standard for distributed training on Kubernetes (PyTorch, JAX, MPI, etc.)
  • Large organizations (Google, Red Hat, Bloomberg, etc.) use Kubeflow for production ML workloads
  • Kubernetes is becoming the default platform for LLM fine-tuning at scale

User experience improvement:

Before (no visibility):

kubectl get trainjob my-job
# NAME     STATUS    AGE
# my-job   Running   2h
# (Is it 10% done? 90% done? No idea.)

After (with KubeflowCallback):

kubectl get trainjob my-job -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 67, "estimatedRemainingSeconds": 3600, "metrics": [{"name": "loss", "value": "0.15"}]}

Zero friction for users:

No code changes required. The callback auto-activates when the Kubeflow controller injects environment variables:

from transformers import Trainer, TrainingArguments

trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=ds)
trainer.train()  # Progress automatically reported when running in Kubeflow

Synergy with existing integrations:

This complements (not replaces) MLflow/W&B. Users can run both:

  • KubeflowCallback for Kubernetes-native observability (kubectl, dashboards, alerting)
  • MLflowCallback or WandbCallback for experiment tracking and artifact management

Your contribution

Yes, I am willing to submit a PR for this feature.

I am a contributor of Kubeflow Trainer and Kubeflow SDK and have been working on the progress tracking feature (KEP-2779) alongside Rob Bell. The implementation plan is finalized and the controller-side PR is in review here kubeflow/trainer#3227

extent analysis

Fix Plan

Step 1: Install the KubeflowCallback

pip install kubeflow-sdk

Step 2: Import the KubeflowCallback in your training script

from kubeflow import KubeflowCallback
from transformers import Trainer, TrainingArguments

# ... (rest of your code)
trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=ds)
trainer.train(callbacks=[KubeflowCallback()])

Step 3: Run your training job on Kubeflow Trainer

kubectl create trainjob my-job --image=my-image --command="python train.py"

Step 4: Verify progress and metrics reporting

kubectl get trainjob my-job -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 67, "estimatedRemainingSeconds": 3600, "metrics": [{"name": "loss", "value": "0.15"}]}

Verification

  • Check the progressPercentage and estimatedRemainingSeconds fields in the trainerStatus section of the TrainJob status.
  • Verify that the metrics field is populated with the expected training metrics (e.g., loss, accuracy).

Extra Tips

  • Make sure to update your Kubeflow Trainer version to the latest one that supports the KubeflowCallback.
  • If you encounter any issues, check the Kubeflow Trainer logs for errors.
  • This feature is designed to work seamlessly with existing integrations like MLflow and Wandb. You can use both the KubeflowCallback and other callbacks (e.g., MLflowCallback, WandbCallback) in your training script.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix KubeflowCallback: Native progress reporting for Kubernetes-based Kubeflow training [2 pull requests, 1 comments, 1 participants]