Live training diagnostics

Catch training bottlenecks live.

TraceML shows step timing, memory, and worker imbalance while PyTorch runs are active. Afterward, it saves a compact summary.

Open source today

Local-first

Single-node multi-GPU

Low overhead

Docs

SLOW TRAINING DETECTED · step 1,240

Live step view — last 100 steps

Median step 23.1msSlowest worker 25.9ms

Data load

13.4ms

Forward

4.4ms

Backward

3.2ms

Optimizer

1.8ms

Worker behavior

Slow-worker gap +12.1% on worker 3

Memory

GPU memory 14.2 / 96 GB peak 17.1 GB ↑

Stack PyTorch NVIDIA CUDA Ray soon Slurm soon

Live view, then summary

One lightweight workflow for debugging training performance as it happens and reviewing it later.

Live terminal view

Watch the run

Track step time, throughput, memory, and worker behavior while training is active.

Step-level signal

Find the bottleneck

Break down data load, forward, backward, and optimizer timing.

End-of-run summary

Keep the evidence

Save the useful timing and memory signals after the run finishes.

Simple setup

No dashboard, no rollout. Add one wrapper around the training step and run as usual.

Install TraceML

pip install traceml-ai

Wrap your training step

Add one context manager around the existing step.

Run and watch

Watch live diagnostics and keep the summary.

train.py minimal change

import traceml

traceml.init()

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

$ traceml run train.py

Where TraceML is going

TraceOpt is building local-first observability for ML training runs, starting with TraceML.

Now

Single-node multi-GPU

PyTorch training on NVIDIA CUDA, running on one machine with multiple GPUs.

Multi-node multi-GPU

Diagnostics across distributed training jobs and worker timing gaps.

Then

Slurm workflows

Cleaner support for scheduled cluster jobs and existing launch workflows.

Frequently asked questions

What does TraceOpt offer today?

TraceML: live terminal diagnostics and compact end-of-run summaries for PyTorch training.

What is supported now?

Single-node multi-GPU PyTorch runs on NVIDIA CUDA.

What is coming next?

Multi-node multi-GPU support, followed by Slurm workflow support.

What is the performance overhead?

TraceML is designed to stay lightweight. Actual overhead depends on workload and configuration.

What if my team needs TensorFlow, JAX, or another stack?

If your team runs another stack and this problem matters to you, please reach out to us.

Become a design partner

We are looking for teams with real GPU training workloads, especially if:

you run PyTorch training on single-node multi-GPU systems today
you are moving toward multi-node or Slurm-based training workflows
you want to understand slow or unstable runs faster
your team needs clearer diagnostics during and after training

Become a design partner

Design partners get early access and direct input into the roadmap.