skip to content
Site header image togthr.co

Tech Diary: jam.togthr, a Realtime Collaborative VST

Instead of transmitting raw audio — which introduces significant latency and reliability challenges — the focus shifts to transmitting musical intent using MIDI and predictive models.

Last Updated:

jam.togthr, a Realtime What-Now?


The jam.togthr project aims to explore how musicians might collaborate over the internet in a way that feels natural and responsive. Instead of transmitting raw audio — which introduces significant latency and reliability challenges — the focus shifts to transmitting musical intent using MIDI and predictive models.


This first chapter in the technical journey captures the early architectural and design decisions, forming the foundation for a realtime, distributed, socially aware musical collaboration system.



Problem Statement & Motivation


💡
The motivating concept is collaborate via lightweight MIDI data, not audio streams


Transmitting audio of any quality over the internet in realtime is challenging, as seen in the internet radio and VoIP domains. Collaborative music performance, aka ‘jamming’, is similarly challenged when done over remote networks due to the importance of and sensitivity to timing, synchronization, and signal fidelity amongst multiple participants.


For musical data, a transport protocol compatible with RTP-MIDI is proposed. RTP-MIDI offers an efficient and low-latency packetization of MIDI events over unreliable networks like the public internet. When the source is not a virtual instrument, MIDI data is generated in real time from audio input via local predictive analysis.


To address latency, network jitter, and unreliable connections, the system design incorporates note prediction, timing quantization, and role-sensitive smoothing strategies, as well as the concept of dynamic participant roles: recognizing that musicians often shift between rhythm, melody, and harmony roles during a session.


For session control, Conflict-Free Replicated Data Types (CRDTs) are utilized to ensure session state synchronization across distributed participants, avoiding split-brain or merge conflicts even in unstable networks. This extends the MIDI-over-IP infrastructure to support additional synchronization metadata needed for session control.


Privacy and user autonomy are treated as foundational concerns:


  • Participants retain full control over their musical session data (”session captures”) and personal predictive models ("agents").
  • Encryption mechanisms are designed into the system architecture, allowing users to protect their session captures and agent data with keys they control.
  • Users may choose to share their agents directly with collaborators or revoke access if desired.
  • Additionally, users can opt-in to anonymized data collection to contribute to global prediction improvements without exposing identifiable musical or personal information.


The overall approach is intended to achieve technical resilience without sacrificing user privacy and creative empowerment.



Architecture & Design


The system architecture is modular, with clearly separated concerns for MIDI input and output, prediction and smoothing, network transport, and session management. Data flows predictably from ingestion through processing to distribution, with CRDT-managed session state providing a resilient backbone for participant coordination.


Modules are designed for composability and extensibility, ensuring that future improvements — such as more advanced prediction algorithms or enhanced privacy controls — can be integrated without destabilizing the core system.


Host Interface

The Host Interface integrates the plugin with the host DAW environment. It handles the acquisition of MIDI input, the reception of automation parameters, and the optional handling of audio buffers when the Audio-to-MIDI Converter is enabled. It serves as the primary boundary between the plugin and external production workflows.
Input Output Related Modules
MIDI Input
Audio Input (optional)
Plugin Parameters
Internal MIDI Events
Audio Buffers (for conversion)
Host Event Notifications
MIDI Manager
Audio to MIDI Converter

MIDI Manager

The MIDI Manager coordinates the internal routing and distribution of MIDI events. It receives MIDI data from the Host Interface or Audio to MIDI Converter, timestamps events, and dispatches them to the Prediction and Quantization Engine. It also prepares processed MIDI streams for the MIDI Output Manager.
Input Output Related Modules
Internal MIDI Events Timestamped MIDI Events to Prediction Engine and Output Manager Host Interface
Audio to MIDI Converter
Prediction and Quantization Engine
MIDI Output Manager

Audio to MIDI Converter

When enabled, the Audio to MIDI Converter uses lightweight machine learning models and real-time signal processing heuristics to transform live audio streams into equivalent MIDI representations. It aims to produce musically accurate MIDI data with minimal latency.
Input Output Related Modules
Audio Buffers MIDI-like Event Streams Host Interface
MIDI Manager

Prediction and Quantization Engine

The Prediction and Quantization Engine smooths jitter, predicts missing or delayed notes, and quantizes timing based on dynamic session properties. It dynamically adapts its prediction models based on the ensemble composition, participant roles, and live session tempo, all synchronized through the Session Management Subsystem.
Input Output Related Modules
Timestamped MIDI Events
Ensemble Context Metadata
Smoothed, Quantized, Predicted MIDI Events MIDI Manager
Session Management Subsystem
MIDI Output Manager

Network Transport Layer

The Network Transport Layer transmits and receives MIDI event packets and CRDT session control operations over unreliable networks. It uses RTP-MIDI compatible formats for MIDI transport while handling CRDT operation synchronization separately to ensure robust control-plane state convergence.
Input Output Related Modules
Local MIDI Events
CRDT Ops
Remote MIDI Event Delivery
Remote Session Updates
Session Management Subsystem
Control Plane Subsystem

Session Management Subsystem

The Session Management Subsystem maintains a live, conflict-free view of all active participants, their roles, and session properties. It uses CRDTs to ensure eventual consistency across distributed participants, even during high churn or network partitions.
Input Output Related Modules
Incoming CRDT Operations
Local Metadata Changes
Current Ensemble Snapshot
Updated Metadata Streams
Prediction and Quantization Engine
Network Transport Layer
Session Capture and Persistence

Session Capture and Persistence

The Session Capture and Persistence module records the full musical and session context timeline into .scp files. It captures MIDI events, Session Control CRDT operations, and Participant Metadata CRDT operations to provide complete replayability and ensure accurate training data generation.
Input Output Related Modules
MIDI Streams
Session CRDT Streams
Metadata CRDT Streams
.scp Session Capture Files Session Management Subsystem
Agent Training Subsystem

Agent Training Subsystem

The Agent Training Subsystem consumes captured .scp files to train predictive musical agents. It reconstructs dynamic ensemble context timelines and builds models conditioned on social-musical behaviors, storing them as .agent files for later deployment.
Input Output Related Modules
.scp Files .agent Model Files Session Capture and Persistence
Role Detection and Metadata Sharing

Control Plane Subsystem

The Control Plane Subsystem manages participant authentication, session discovery, and metadata exchange. It supports both private self-hosted deployments and premium SaaS variants with enhanced collaboration features.
Input Output Related Modules
Join/Leave Requests
Metadata Snapshots
Session Discovery Updates
Control Channel Events
Network Transport Layer
Session Management Subsystem

Role Detection and Metadata Sharing

The Role Detection and Metadata Sharing module dynamically infers participant roles using DAW metadata and live performance characteristics. It generates CRDT-compliant updates to participant metadata during live sessions.
Input Output Related Modules
DAW Metadata
Live MIDI Streams
Role and Metadata CRDT Updates Session Management Subsystem
Agent Training Subsystem

MIDI Output Manager

The MIDI Output Manager assembles final locally outgoing MIDI streams, incorporating smoothing, prediction, and quantization adjustments, and dispatches them to the host DAW or external MIDI devices.
Input Output Related Modules
Smoothed and Quantized MIDI Streams Outgoing MIDI Events to Host Prediction and Quantization Engine
Host Interface



What's Next


Several important submodules remain to be defined:


  • Session Replay Subsystem: Reconstructs full musical and ensemble timelines from .scp files for playback, agent training, and analysis.
  • Cloud Synchronization Subsystem: Manages secure sharing of personal agents, session captures, and ensemble data, while preserving user privacy.
  • Live Collaboration Enhancements: Adds predictive fallback behavior, adaptive latency management, and ensemble auto-balancing features.


Once these remaining modules are outlined, the next phase will involve producing formal Design Specifications and Technical Specifications for each subsystem. These documents will describe detailed internal algorithms, APIs, data formats, and performance requirements and set the stage for implementation.


that’s it for now, thanks for reading!

-tim