ML Training Optimization Simplified

Reduce training time and GPU cost with observability-first tooling

TraceOpt is an open-source project focused on making machine learning training faster and more cost-effective. Our current product, TraceML, provides lightweight, real-time training observability so optimization decisions are grounded in what’s actually happening during a run.

View TraceML on GitHub ⭐ Share feedback (2 min) →

What TraceML gives you

Clarity

See where time is spent

Control

Catch memory growth early

Context

System + training signals together

Exportable

Logs for offline comparison

The Problem

Training inefficiencies are expensive — and hard to see until it’s too late

⚠️

OOM without a root cause

Memory failures often happen mid-run, but it’s unclear which layer or step behavior triggered them.

🐌

Slow steps, unclear culprit

When throughput drops, it’s hard to tell if the bottleneck is compute, dataloading, or the optimizer.

🔧

Profiling is too disruptive

Many tools require special trace runs and offline analysis — useful, but not ideal for everyday training.

TraceML: Lightweight Training Observability

A practical view of memory and time — in the tools you already use

🧠

Memory, broken down by model structure

Visibility into parameters, activations, and gradients at the module level to pinpoint memory-heavy layers.

⏱️

Step breakdowns that match how training runs

Track where time goes across forward, backward, optimizer, and dataloader phases to spot bottlenecks quickly.

🖥️

Terminal-first workflow

A live dashboard in the terminal for quick debugging during SSH runs and experiments.

🌐

Local dashboard when you want charts

A lightweight web UI at localhost:8765 for real-time plots and summaries during a run.

📓

Notebook-friendly observability

Native Jupyter integration for research workflows and iterative model development.

💾

Structured logs for offline analysis

Export JSON logs to compare runs, debug regressions, and analyze behavior after training finishes.

How TraceML Compares

A quick map of where each tool fits

Feature	TraceML	PyTorch Profiler	NVIDIA Nsight	W&B / Neptune
Training-Time View	✓	✗	✗	⚠️
Model-Level View	✓	✗	✗	✓
Activation + Gradient Memory	✓	✗	✓	✗
Setup Effort	✓	✗	✗	✗
Local / No Cloud Required	✓	✓	✓	✗
Best For	Everyday training debugging	Deep kernel traces	GPU expert analysis	Experiment tracking

💡 Positioning

Use TraceML when you want training-time visibility into model memory and step timing. Use PyTorch Profiler / Nsight for deep kernel tracing, and W&B/Neptune for experiment tracking.

Roadmap

Where TraceML is heading

⏱️

Layer-Level Timing

More precise per-module timing to pinpoint slow layers and step phases.

🔄

Single-Machine Multi-GPU

Observability for multi-process training on one machine (DDP-aware support planned).

🧩

Workflow Integrations

Improved plug-in support for common training stacks (e.g., Lightning / Accelerate).