Live training diagnostics

Catch training bottlenecks live.

TraceML shows step timing, memory, and worker imbalance while PyTorch runs are active. Afterward, it saves a compact summary.

Open source today
Local-first
Single-node multi-GPU
PyTorch + NVIDIA CUDA
Low overhead
SLOW TRAINING DETECTED · step 1,240
Live step view — last 100 steps
Median step 23.1msSlowest worker 25.9ms

Data load
13.4ms
Forward
4.4ms
Backward
3.2ms
Optimizer
1.8ms

Worker behavior
Slow-worker gap +12.1% on worker 3

Memory
GPU memory 14.2 / 96 GB peak 17.1 GB ↑

Live view, then summary

One lightweight workflow for debugging training performance as it happens and reviewing it later.

Live terminal view
Watch the run
Track step time, throughput, memory, and worker behavior while training is active.
Step-level signal
Find the bottleneck
Break down data load, forward, backward, and optimizer timing.
End-of-run summary
Keep the evidence
Save the useful timing and memory signals after the run finishes.

Simple setup

No dashboard, no rollout. Add one wrapper around the training step and run as usual.

1
Install TraceML
Install it where your PyTorch job already runs.
2
Wrap your training step
Add one context manager around the existing step.
3
Run and watch
Watch live diagnostics and keep the summary.
train.py minimal change
from traceml.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
$ traceml run train.py

Where TraceML is going

TraceOpt is building local-first observability for ML training runs, starting with TraceML.

Now
Single-node multi-GPU
PyTorch training on NVIDIA CUDA, running on one machine with multiple GPUs.
Next
Multi-node multi-GPU
Diagnostics across distributed training jobs and worker timing gaps.
Then
Slurm workflows
Cleaner support for scheduled cluster jobs and existing launch workflows.

Frequently asked questions

What does TraceOpt offer today?
TraceML: live terminal diagnostics and compact end-of-run summaries for PyTorch training.
What is supported now?
Single-node multi-GPU PyTorch runs on NVIDIA CUDA.
What is coming next?
Multi-node multi-GPU support, followed by Slurm workflow support.
What is the performance overhead?
TraceML is designed to stay lightweight. Actual overhead depends on workload and configuration.
What if my team needs TensorFlow, JAX, or another stack?
If your team runs another stack and this problem matters to you, please reach out to us.

Become a design partner

We are looking for teams with real GPU training workloads, especially if:

  • you run PyTorch training on single-node multi-GPU systems today
  • you are moving toward multi-node or Slurm-based training workflows
  • you want to understand slow or unstable runs faster
  • your team needs clearer diagnostics during and after training

Design partners get early access and direct input into the roadmap.