Tech Diary: jam.togthr, a Realtime Collaborative VST

jam.togthr, a Realtime What-Now?

The jam.togthr project aims to explore how musicians might collaborate over the internet in a way that feels natural and responsive. Instead of transmitting raw audio — which introduces significant latency and reliability challenges — the focus shifts to transmitting musical intent using MIDI and predictive models.

This first chapter in the technical journey captures the early architectural and design decisions, forming the foundation for a realtime, distributed, socially aware musical collaboration system.

Problem Statement & Motivation

💡

The motivating concept is collaborate via lightweight MIDI data, not audio streams

Transmitting audio of any quality over the internet in realtime is challenging, as seen in the internet radio and VoIP domains. Collaborative music performance, aka ‘jamming’, is similarly challenged when done over remote networks due to the importance of and sensitivity to timing, synchronization, and signal fidelity amongst multiple participants.

For musical data, a transport protocol compatible with RTP-MIDI is proposed. RTP-MIDI offers an efficient and low-latency packetization of MIDI events over unreliable networks like the public internet. When the source is not a virtual instrument, MIDI data is generated in real time from audio input via local predictive analysis.

To address latency, network jitter, and unreliable connections, the system design incorporates note prediction, timing quantization, and role-sensitive smoothing strategies, as well as the concept of dynamic participant roles: recognizing that musicians often shift between rhythm, melody, and harmony roles during a session.

For session control, Conflict-Free Replicated Data Types (CRDTs) are utilized to ensure session state synchronization across distributed participants, avoiding split-brain or merge conflicts even in unstable networks. This extends the MIDI-over-IP infrastructure to support additional synchronization metadata needed for session control.

Privacy and user autonomy are treated as foundational concerns:

Participants retain full control over their musical session data (”session captures”) and personal predictive models ("agents").
Encryption mechanisms are designed into the system architecture, allowing users to protect their session captures and agent data with keys they control.
Users may choose to share their agents directly with collaborators or revoke access if desired.
Additionally, users can opt-in to anonymized data collection to contribute to global prediction improvements without exposing identifiable musical or personal information.

The overall approach is intended to achieve technical resilience without sacrificing user privacy and creative empowerment.

Architecture & Design

The system architecture is modular, with clearly separated concerns for MIDI input and output, prediction and smoothing, network transport, and session management. Data flows predictably from ingestion through processing to distribution, with CRDT-managed session state providing a resilient backbone for participant coordination.

Modules are designed for composability and extensibility, ensuring that future improvements — such as more advanced prediction algorithms or enhanced privacy controls — can be integrated without destabilizing the core system.

Host Interface

The Host Interface integrates the plugin with the host DAW environment. It handles the acquisition of MIDI input, the reception of automation parameters, and the optional handling of audio buffers when the Audio-to-MIDI Converter is enabled. It serves as the primary boundary between the plugin and external production workflows.

Input	Output	Related Modules
MIDI Input Audio Input (optional) Plugin Parameters	Internal MIDI Events Audio Buffers (for conversion) Host Event Notifications	MIDI Manager Audio to MIDI Converter

MIDI Manager

The MIDI Manager coordinates the internal routing and distribution of MIDI events. It receives MIDI data from the Host Interface or Audio to MIDI Converter, timestamps events, and dispatches them to the Prediction and Quantization Engine. It also prepares processed MIDI streams for the MIDI Output Manager.

Input	Output	Related Modules
Internal MIDI Events	Timestamped MIDI Events to Prediction Engine and Output Manager	Host Interface Audio to MIDI Converter Prediction and Quantization Engine MIDI Output Manager

Audio to MIDI Converter

When enabled, the Audio to MIDI Converter uses lightweight machine learning models and real-time signal processing heuristics to transform live audio streams into equivalent MIDI representations. It aims to produce musically accurate MIDI data with minimal latency.

Input	Output	Related Modules
Audio Buffers	MIDI-like Event Streams	Host Interface MIDI Manager

Prediction and Quantization Engine

The Prediction and Quantization Engine smooths jitter, predicts missing or delayed notes, and quantizes timing based on dynamic session properties. It dynamically adapts its prediction models based on the ensemble composition, participant roles, and live session tempo, all synchronized through the Session Management Subsystem.

Input	Output	Related Modules
Timestamped MIDI Events Ensemble Context Metadata	Smoothed, Quantized, Predicted MIDI Events	MIDI Manager Session Management Subsystem MIDI Output Manager

Network Transport Layer

The Network Transport Layer transmits and receives MIDI event packets and CRDT session control operations over unreliable networks. It uses RTP-MIDI compatible formats for MIDI transport while handling CRDT operation synchronization separately to ensure robust control-plane state convergence.

Input	Output	Related Modules
Local MIDI Events CRDT Ops	Remote MIDI Event Delivery Remote Session Updates	Session Management Subsystem Control Plane Subsystem

Session Management Subsystem

The Session Management Subsystem maintains a live, conflict-free view of all active participants, their roles, and session properties. It uses CRDTs to ensure eventual consistency across distributed participants, even during high churn or network partitions.

Input	Output	Related Modules
Incoming CRDT Operations Local Metadata Changes	Current Ensemble Snapshot Updated Metadata Streams	Prediction and Quantization Engine Network Transport Layer Session Capture and Persistence

Session Capture and Persistence

The Session Capture and Persistence module records the full musical and session context timeline into .scp files. It captures MIDI events, Session Control CRDT operations, and Participant Metadata CRDT operations to provide complete replayability and ensure accurate training data generation.

Input	Output	Related Modules
MIDI Streams Session CRDT Streams Metadata CRDT Streams	`.scp` Session Capture Files	Session Management Subsystem Agent Training Subsystem

Agent Training Subsystem

The Agent Training Subsystem consumes captured .scp files to train predictive musical agents. It reconstructs dynamic ensemble context timelines and builds models conditioned on social-musical behaviors, storing them as .agent files for later deployment.

Input	Output	Related Modules
`.scp` Files	`.agent` Model Files	Session Capture and Persistence Role Detection and Metadata Sharing

Control Plane Subsystem

The Control Plane Subsystem manages participant authentication, session discovery, and metadata exchange. It supports both private self-hosted deployments and premium SaaS variants with enhanced collaboration features.

Input	Output	Related Modules
Join/Leave Requests Metadata Snapshots	Session Discovery Updates Control Channel Events	Network Transport Layer Session Management Subsystem

The Role Detection and Metadata Sharing module dynamically infers participant roles using DAW metadata and live performance characteristics. It generates CRDT-compliant updates to participant metadata during live sessions.

Input	Output	Related Modules
DAW Metadata Live MIDI Streams	Role and Metadata CRDT Updates	Session Management Subsystem Agent Training Subsystem

MIDI Output Manager

The MIDI Output Manager assembles final locally outgoing MIDI streams, incorporating smoothing, prediction, and quantization adjustments, and dispatches them to the host DAW or external MIDI devices.

Input	Output	Related Modules
Smoothed and Quantized MIDI Streams	Outgoing MIDI Events to Host	Prediction and Quantization Engine Host Interface

What's Next

Several important submodules remain to be defined:

Session Replay Subsystem: Reconstructs full musical and ensemble timelines from .scp files for playback, agent training, and analysis.
Cloud Synchronization Subsystem: Manages secure sharing of personal agents, session captures, and ensemble data, while preserving user privacy.
Live Collaboration Enhancements: Adds predictive fallback behavior, adaptive latency management, and ensemble auto-balancing features.

Once these remaining modules are outlined, the next phase will involve producing formal Design Specifications and Technical Specifications for each subsystem. These documents will describe detailed internal algorithms, APIs, data formats, and performance requirements and set the stage for implementation.

that’s it for now, thanks for reading!

-tim

jam.togthr, a Realtime What-Now?

Problem Statement & Motivation

Architecture & Design

Host Interface

MIDI Manager

Audio to MIDI Converter

Prediction and Quantization Engine

Network Transport Layer

Session Management Subsystem

Session Capture and Persistence

Agent Training Subsystem

Control Plane Subsystem

Role Detection and Metadata Sharing

MIDI Output Manager

What's Next

home