pytorch - 💡(How to fix) Fix AOTI CUDA model: `update_inactive_constant_buffer` no longer accepts CPU-sourced constant updates after mixed-device constants support [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180446Fetched 2026-04-17 08:22:31
View on GitHub
Comments
1
Participants
2
Timeline
145
Reactions
0
Author
Assignees
Timeline (top)
mentioned ×68subscribed ×68labeled ×7assigned ×1

After commit 34be219 ([AOTI] Support mixed-device constants, PR #169504), AOTI can load mixed-device constants for CUDA models, but update_constant_buffer / update_inactive_constant_buffer now rejects CPU tensors for CUDA models. The commit message itself says mixed-device constant loading is supported and that weight update is not touched in this PR. However, in practice, the current runtime check makes our existing low-memory constant update flow unusable for CUDA models.

We are currently running production inference with AOTI on PyTorch v2.8.0 CUDA, and are testing newer branches for migration. When testing a newer branch that includes 34be219 (introduced via PR #169504 and present in the v2.11.0 line), our service crashes during periodic constant updates.

Error Message

E0415 08:18:57.955665 51196 ExceptionTracer.cpp:222] exception stack complete terminate called after throwing an instance of 'std::runtime_error' what(): update_inactive_constant_buffer_func_( container_handle_, (AOTInductorConstantMapHandle)&const_map) API call failed at /work/wei_gpu_serving/docker/pytorch_src/pytorch/torch/csrc/inductor/aoti_runner/model_container_runner.cpp, line 352 *** Aborted at 1776241137 (Unix time, try 'date -d @1776241137') *** *** Signal 6 (SIGABRT) (0xc7fc) received by PID 51196 (pthread TID 0x7ff618472000) (linux TID 51196) (maybe from PID 51196, UID 0) (code: -6), stack trace: *** @ 0000000000295aa2 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

Root Cause

After commit 34be219 ([AOTI] Support mixed-device constants, PR #169504), AOTI can load mixed-device constants for CUDA models, but update_constant_buffer / update_inactive_constant_buffer now rejects CPU tensors for CUDA models. The commit message itself says mixed-device constant loading is supported and that weight update is not touched in this PR. However, in practice, the current runtime check makes our existing low-memory constant update flow unusable for CUDA models.

We are currently running production inference with AOTI on PyTorch v2.8.0 CUDA, and are testing newer branches for migration. When testing a newer branch that includes 34be219 (introduced via PR #169504 and present in the v2.11.0 line), our service crashes during periodic constant updates.

Code Example

E0415 08:18:57.955665 51196 ExceptionTracer.cpp:222] exception stack complete
terminate called after throwing an instance of 'std::runtime_error'
  what():  update_inactive_constant_buffer_func_( container_handle_, (AOTInductorConstantMapHandle)&const_map) API call failed at /work/wei_gpu_serving/docker/pytorch_src/pytorch/torch/csrc/inductor/aoti_runner/model_container_runner.cpp, line 352
*** Aborted at 1776241137 (Unix time, try 'date -d @1776241137') ***
*** Signal 6 (SIGABRT) (0xc7fc) received by PID 51196 (pthread TID 0x7ff618472000) (linux TID 51196) (maybe from PID 51196, UID 0) (code: -6), stack trace: ***
    @ 0000000000295aa2 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

---

constant_map_[iter.first] =
    torch::from_blob(
        iter.second.data(),
        {(int64_t)iter.second.size()},
        torch::kF32)
        .clone();

torch::inductor::TensorConstantMap data_map;
for (auto& iter : constant_map_) {
    data_map.emplace(iter.first, &(iter.second));
}

dynamic_cast<TorchRunner*>(trigger_->GetRunner().get())->Update(data_map);

c10::cuda::set_device(device_id_);
runner_->update_inactive_constant_buffer(data);
if (use_runtime_constant_folding_) {
    runner_->run_const_fold(/* use_inactive = */ true);
}
runner_->swap_constant_buffer();

---

// update_constant_buffer does not support mixed CPU/CUDA constants
int32_t model_device_type = models_[0]->get_device_type();
for (const auto& kv : constants_map) {
  int32_t tensor_device_type = 0;
  aoti_torch_get_device_type(kv.second, &tensor_device_type);
  if (tensor_device_type != model_device_type) {
    throw std::runtime_error(
        "update_constant_buffer does not support mixed device constants. "
        "Constant '" +
        kv.first + "' has device type " +
        std::to_string(tensor_device_type) +
        " but model expects device type " +
        std::to_string(model_device_type));
  }
}
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Summary

After commit 34be219 ([AOTI] Support mixed-device constants, PR #169504), AOTI can load mixed-device constants for CUDA models, but update_constant_buffer / update_inactive_constant_buffer now rejects CPU tensors for CUDA models. The commit message itself says mixed-device constant loading is supported and that weight update is not touched in this PR. However, in practice, the current runtime check makes our existing low-memory constant update flow unusable for CUDA models.

We are currently running production inference with AOTI on PyTorch v2.8.0 CUDA, and are testing newer branches for migration. When testing a newer branch that includes 34be219 (introduced via PR #169504 and present in the v2.11.0 line), our service crashes during periodic constant updates.

Environment

  • Current production baseline: PyTorch v2.8.0 CUDA
  • Migration target under evaluation: newer branches including commit 34be219 / PR #169504
  • Runtime: AOTI CUDA model container
  • Use case: periodic dense parameter updates with inactive-buffer swap
  • Memory pressure: model already uses about 70%+ of GPU memory in our workload

Crash

We hit the following runtime failure when calling update_inactive_constant_buffer:

E0415 08:18:57.955665 51196 ExceptionTracer.cpp:222] exception stack complete
terminate called after throwing an instance of 'std::runtime_error'
  what():  update_inactive_constant_buffer_func_( container_handle_, (AOTInductorConstantMapHandle)&const_map) API call failed at /work/wei_gpu_serving/docker/pytorch_src/pytorch/torch/csrc/inductor/aoti_runner/model_container_runner.cpp, line 352
*** Aborted at 1776241137 (Unix time, try 'date -d @1776241137') ***
*** Signal 6 (SIGABRT) (0xc7fc) received by PID 51196 (pthread TID 0x7ff618472000) (linux TID 51196) (maybe from PID 51196, UID 0) (code: -6), stack trace: ***
    @ 0000000000295aa2 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

The underlying change appears to be the mixed-device validation added in model_container.h, which now throws if update tensors are not on the model device.

Our update flow

Our dense parameter update path is designed to minimize GPU memory usage and avoid unnecessary copies on the caller side.

We keep updated dense parameters on CPU, then build a constant map and call:

  1. runner_->update_inactive_constant_buffer(data);
  2. runner_->run_const_fold(/* use_inactive = */ true); (when runtime constant folding is enabled)
  3. runner_->swap_constant_buffer();

Representative code on our side looks like this:

constant_map_[iter.first] =
    torch::from_blob(
        iter.second.data(),
        {(int64_t)iter.second.size()},
        torch::kF32)
        .clone();

torch::inductor::TensorConstantMap data_map;
for (auto& iter : constant_map_) {
    data_map.emplace(iter.first, &(iter.second));
}

dynamic_cast<TorchRunner*>(trigger_->GetRunner().get())->Update(data_map);

c10::cuda::set_device(device_id_);
runner_->update_inactive_constant_buffer(data);
if (use_runtime_constant_folding_) {
    runner_->run_const_fold(/* use_inactive = */ true);
}
runner_->swap_constant_buffer();

The goal here is:

  • keep source parameters on CPU,
  • copy them directly into the inactive AOTI constant buffer,
  • then swap buffers for inference,
  • while keeping peak GPU memory as low as possible.

For large dense layers, this matters a lot to us.

Why the current behavior is a problem

After the validation introduced around 34be219, we can no longer pass CPU tensors into update_inactive_constant_buffer for a CUDA model.

That means the caller must now do something like:

  1. construct/update tensors on CPU,
  2. explicitly .to(cuda) them on the caller side,
  3. call update_inactive_constant_buffer,
  4. then let AOTI internally copy again into its inactive constant buffer.

So compared with our previous flow, we now temporarily need:

  • the caller-owned CUDA tensors, plus
  • the inactive constant buffer inside AOTI,

for the same dense parameters.

For our workload, this extra temporary GPU residency is dangerous. Our model already uses roughly 70%+ of device memory, so duplicating a large set of dense constants on GPU during update can push us into OOM territory.

In other words, mixed-device constant loading support is helpful, but the current update path seems to block a practical low-memory update pattern for CUDA inference.

Relevant runtime check

The change that appears to block this flow is the mixed-device guard added to update_constant_buffer, which rejects update tensors whose device type differs from the model device:

// update_constant_buffer does not support mixed CPU/CUDA constants
int32_t model_device_type = models_[0]->get_device_type();
for (const auto& kv : constants_map) {
  int32_t tensor_device_type = 0;
  aoti_torch_get_device_type(kv.second, &tensor_device_type);
  if (tensor_device_type != model_device_type) {
    throw std::runtime_error(
        "update_constant_buffer does not support mixed device constants. "
        "Constant '" +
        kv.first + "' has device type " +
        std::to_string(tensor_device_type) +
        " but model expects device type " +
        std::to_string(model_device_type));
  }
}

Question / expected behavior

Would it be possible for AOTI to support CPU-sourced updates for CUDA models by copying directly into the inactive constant buffer (host-to-device) without requiring the caller to first materialize a full CUDA copy of all updated constants?

That would preserve a much better memory profile for periodic weight updates.

Concretely, would one of the following be possible?

  1. Allow CPU tensors in update_constant_buffer / update_inactive_constant_buffer when the model device is CUDA, and perform direct H2D copies internally into the inactive buffer.
  2. Add an opt-in mode / flag for this low-memory update path.
  3. If the strict current behavior is intentional, document a recommended low-memory update path that is equivalent to the older practical behavior we relied on.

Why this matters for real deployments

This is not just a convenience issue for us. In production-style AOTI deployments with:

  • periodic weight refresh,
  • inactive-buffer swap,
  • runtime constant folding,
  • and already high baseline GPU memory usage,

forcing caller-side .to(cuda) before update_inactive_constant_buffer can make the difference between a workable deployment and an OOM-prone one.

Additional context

The commit message for 34be219 says:

  • CUDA models may contain CPU constants,
  • AOTI previously loaded all weights to a single device,
  • this PR adds loading support for constants on either the model device or CPU,
  • and "Weight update is not touched in this PR. It will come next."

Given that note, I wanted to ask whether support for CPU->CUDA constant update is planned, or whether maintainers would be open to a change in this area.

Links

  • Commit: 34be21975042a4d479b41ebb1f4e3cdd6be2541f
  • PR: #169504 ([AOTI] Support mixed-device constants)
  • Commit author: @desertfire
  • Reviewers referenced on the PR: @larryliu0820, @muchulee8

Versions

Versions

  • Good: 2.8.0a0+gitba56102
  • Bad: 2.12.0a0+git2487fa5
  • Suspected first bad commit: 34be21975042a4d479b41ebb1f4e3cdd6be2541f
  • Related PR: #169504
  • CUDA: 12.9
  • cuDNN: 9.10.2 (91002)
  • GPU: Tesla T4 (sm_75)
  • Platform: Linux
  • ABI: _GLIBCXX_USE_CXX11_ABI=True

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @chauhang @penguinwu @avikchaudhuri @zhxchen17 @tugsbayasgalan @angelayi @ydwu4 @desertfire @yushangdi @jataylo @iupaikov-amd

extent analysis

TL;DR

The most likely fix is to modify the update_constant_buffer and update_inactive_constant_buffer functions to support CPU-sourced updates for CUDA models by copying directly into the inactive constant buffer.

Guidance

  • Identify the specific lines of code that throw the std::runtime_error exception when CPU tensors are passed to update_inactive_constant_buffer for a CUDA model.
  • Consider modifying the update_constant_buffer and update_inactive_constant_buffer functions to allow CPU tensors when the model device is CUDA, and perform direct host-to-device (H2D) copies internally into the inactive buffer.
  • Alternatively, explore adding an opt-in mode or flag for this low-memory update path to preserve the current behavior while providing a workaround for users who need it.
  • Review the commit message for 34be219 and the related PR #169504 to understand the intent behind the mixed-device constant loading support and the planned weight update changes.

Example

// Modified update_constant_buffer function
int32_t model_device_type = models_[0]->get_device_type();
for (const auto& kv : constants_map) {
  int32_t tensor_device_type = 0;
  aoti_torch_get_device_type(kv.second, &tensor_device_type);
  if (tensor_device_type != model_device_type) {
    // Perform direct H2D copy for CPU tensors
    if (tensor_device_type == 0 && model_device_type == 1) { // CPU to CUDA
      // Create a CUDA tensor and copy data from CPU tensor
      torch::Tensor cuda_tensor = torch::Tensor(kv.second).to(torch::kCUDA);
      // Update the inactive constant buffer with the CUDA tensor
      update_inactive_constant_buffer(cuda_tensor);
    } else {
      throw std::runtime_error(
          "update_constant_buffer does not support mixed device constants. "
          "Constant '" +
          kv.first + "' has device type " +
          std::to_string(tensor_device_type) +
          " but model expects device type " +
          std::to_string(model_device_type));
    }
  }
}

Notes

The provided code snippet is a hypothetical example and may require modifications to fit the actual implementation. The fix should be carefully tested to ensure it works correctly and does not introduce any regressions.

Recommendation

Apply a workaround by modifying the update_constant_buffer and update_inactive_constant_buffer functions to support CPU-sourced updates for CUDA models, as this change is not currently available in the latest version.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING