Reduce training time and GPU cost with observability-first tooling
TraceOpt is an open-source project focused on making machine learning training faster and more cost-effective. Our current product, TraceML, provides lightweight, real-time training observability so optimization decisions are grounded in what’s actually happening during a run.
See where time is spent
Catch memory growth early
System + training signals together
Logs for offline comparison
Training inefficiencies are expensive — and hard to see until it’s too late
Memory failures often happen mid-run, but it’s unclear which layer or step behavior triggered them.
When throughput drops, it’s hard to tell if the bottleneck is compute, dataloading, or the optimizer.
Many tools require special trace runs and offline analysis — useful, but not ideal for everyday training.
A practical view of memory and time — in the tools you already use
Visibility into parameters, activations, and gradients at the module level to pinpoint memory-heavy layers.
Track where time goes across forward, backward, optimizer, and dataloader phases to spot bottlenecks quickly.
A live dashboard in the terminal for quick debugging during SSH runs and experiments.
A lightweight web UI at localhost:8765 for real-time plots and summaries during a run.
Native Jupyter integration for research workflows and iterative model development.
Export JSON logs to compare runs, debug regressions, and analyze behavior after training finishes.
A quick map of where each tool fits
| Feature | TraceML | PyTorch Profiler | NVIDIA Nsight | W&B / Neptune |
|---|---|---|---|---|
| Training-Time View | ✓ | ✗ | ✗ | ⚠️ |
| Model-Level View | ✓ | ✗ | ✗ | ✓ |
| Activation + Gradient Memory | ✓ | ✗ | ✓ | ✗ |
| Setup Effort | ✓ | ✗ | ✗ | ✗ |
| Local / No Cloud Required | ✓ | ✓ | ✓ | ✗ |
| Best For | Everyday training debugging | Deep kernel traces | GPU expert analysis | Experiment tracking |
Use TraceML when you want training-time visibility into model memory and step timing. Use PyTorch Profiler / Nsight for deep kernel tracing, and W&B/Neptune for experiment tracking.
Where TraceML is heading
More precise per-module timing to pinpoint slow layers and step phases.
Observability for multi-process training on one machine (DDP-aware support planned).
Improved plug-in support for common training stacks (e.g., Lightning / Accelerate).
Questions, feedback, or want to contribute?