jam.togthr, a Realtime What-Now?
The jam.togthr project aims to explore how musicians might collaborate over the internet in a way that feels natural and responsive. Instead of transmitting raw audio — which introduces significant latency and reliability challenges — the focus shifts to transmitting musical intent using MIDI and predictive models.
This first chapter in the technical journey captures the early architectural and design decisions, forming the foundation for a realtime, distributed, socially aware musical collaboration system.
Problem Statement & Motivation
Transmitting audio of any quality over the internet in realtime is challenging, as seen in the internet radio and VoIP domains. Collaborative music performance, aka ‘jamming’, is similarly challenged when done over remote networks due to the importance of and sensitivity to timing, synchronization, and signal fidelity amongst multiple participants.
For musical data, a transport protocol compatible with RTP-MIDI is proposed. RTP-MIDI offers an efficient and low-latency packetization of MIDI events over unreliable networks like the public internet. When the source is not a virtual instrument, MIDI data is generated in real time from audio input via local predictive analysis.
To address latency, network jitter, and unreliable connections, the system design incorporates note prediction, timing quantization, and role-sensitive smoothing strategies, as well as the concept of dynamic participant roles: recognizing that musicians often shift between rhythm, melody, and harmony roles during a session.
For session control, Conflict-Free Replicated Data Types (CRDTs) are utilized to ensure session state synchronization across distributed participants, avoiding split-brain or merge conflicts even in unstable networks. This extends the MIDI-over-IP infrastructure to support additional synchronization metadata needed for session control.
Privacy and user autonomy are treated as foundational concerns:
- Participants retain full control over their musical session data (”session captures”) and personal predictive models ("agents").
- Encryption mechanisms are designed into the system architecture, allowing users to protect their session captures and agent data with keys they control.
- Users may choose to share their agents directly with collaborators or revoke access if desired.
- Additionally, users can opt-in to anonymized data collection to contribute to global prediction improvements without exposing identifiable musical or personal information.
The overall approach is intended to achieve technical resilience without sacrificing user privacy and creative empowerment.
Architecture & Design
The system architecture is modular, with clearly separated concerns for MIDI input and output, prediction and smoothing, network transport, and session management. Data flows predictably from ingestion through processing to distribution, with CRDT-managed session state providing a resilient backbone for participant coordination.
Modules are designed for composability and extensibility, ensuring that future improvements — such as more advanced prediction algorithms or enhanced privacy controls — can be integrated without destabilizing the core system.
Host Interface
The Host Interface integrates the plugin with the host DAW environment. It handles the acquisition of MIDI input, the reception of automation parameters, and the optional handling of audio buffers when the Audio-to-MIDI Converter is enabled. It serves as the primary boundary between the plugin and external production workflows.
Input | Output | Related Modules |
MIDI Input Audio Input (optional) Plugin Parameters | Internal MIDI Events Audio Buffers (for conversion) Host Event Notifications | MIDI Manager Audio to MIDI Converter |
MIDI Manager
The MIDI Manager coordinates the internal routing and distribution of MIDI events. It receives MIDI data from the Host Interface or Audio to MIDI Converter, timestamps events, and dispatches them to the Prediction and Quantization Engine. It also prepares processed MIDI streams for the MIDI Output Manager.
Input | Output | Related Modules |
Internal MIDI Events | Timestamped MIDI Events to Prediction Engine and Output Manager | Host Interface Audio to MIDI Converter Prediction and Quantization Engine MIDI Output Manager |
Audio to MIDI Converter
When enabled, the Audio to MIDI Converter uses lightweight machine learning models and real-time signal processing heuristics to transform live audio streams into equivalent MIDI representations. It aims to produce musically accurate MIDI data with minimal latency.
Input | Output | Related Modules |
Audio Buffers | MIDI-like Event Streams | Host Interface MIDI Manager |
Prediction and Quantization Engine
The Prediction and Quantization Engine smooths jitter, predicts missing or delayed notes, and quantizes timing based on dynamic session properties. It dynamically adapts its prediction models based on the ensemble composition, participant roles, and live session tempo, all synchronized through the Session Management Subsystem.
Input | Output | Related Modules |
Timestamped MIDI Events Ensemble Context Metadata | Smoothed, Quantized, Predicted MIDI Events | MIDI Manager Session Management Subsystem MIDI Output Manager |
Network Transport Layer
The Network Transport Layer transmits and receives MIDI event packets and CRDT session control operations over unreliable networks. It uses RTP-MIDI compatible formats for MIDI transport while handling CRDT operation synchronization separately to ensure robust control-plane state convergence.
Input | Output | Related Modules |
Local MIDI Events CRDT Ops | Remote MIDI Event Delivery Remote Session Updates | Session Management Subsystem Control Plane Subsystem |
Session Management Subsystem
The Session Management Subsystem maintains a live, conflict-free view of all active participants, their roles, and session properties. It uses CRDTs to ensure eventual consistency across distributed participants, even during high churn or network partitions.
Input | Output | Related Modules |
Incoming CRDT Operations Local Metadata Changes | Current Ensemble Snapshot Updated Metadata Streams | Prediction and Quantization Engine Network Transport Layer Session Capture and Persistence |
Session Capture and Persistence
The Session Capture and Persistence module records the full musical and session context timeline into.scp
files. It captures MIDI events, Session Control CRDT operations, and Participant Metadata CRDT operations to provide complete replayability and ensure accurate training data generation.
Input | Output | Related Modules |
MIDI Streams Session CRDT Streams Metadata CRDT Streams | .scp Session Capture Files | Session Management Subsystem Agent Training Subsystem |
Agent Training Subsystem
The Agent Training Subsystem consumes captured.scp
files to train predictive musical agents. It reconstructs dynamic ensemble context timelines and builds models conditioned on social-musical behaviors, storing them as.agent
files for later deployment.
Input | Output | Related Modules |
.scp Files | .agent Model Files | Session Capture and Persistence Role Detection and Metadata Sharing |
Control Plane Subsystem
The Control Plane Subsystem manages participant authentication, session discovery, and metadata exchange. It supports both private self-hosted deployments and premium SaaS variants with enhanced collaboration features.
Input | Output | Related Modules |
Join/Leave Requests Metadata Snapshots | Session Discovery Updates Control Channel Events | Network Transport Layer Session Management Subsystem |
Role Detection and Metadata Sharing
The Role Detection and Metadata Sharing module dynamically infers participant roles using DAW metadata and live performance characteristics. It generates CRDT-compliant updates to participant metadata during live sessions.
Input | Output | Related Modules |
DAW Metadata Live MIDI Streams | Role and Metadata CRDT Updates | Session Management Subsystem Agent Training Subsystem |
MIDI Output Manager
The MIDI Output Manager assembles final locally outgoing MIDI streams, incorporating smoothing, prediction, and quantization adjustments, and dispatches them to the host DAW or external MIDI devices.
Input | Output | Related Modules |
Smoothed and Quantized MIDI Streams | Outgoing MIDI Events to Host | Prediction and Quantization Engine Host Interface |
What's Next
Several important submodules remain to be defined:
- Session Replay Subsystem: Reconstructs full musical and ensemble timelines from
.scp
files for playback, agent training, and analysis. - Cloud Synchronization Subsystem: Manages secure sharing of personal agents, session captures, and ensemble data, while preserving user privacy.
- Live Collaboration Enhancements: Adds predictive fallback behavior, adaptive latency management, and ensemble auto-balancing features.
Once these remaining modules are outlined, the next phase will involve producing formal Design Specifications and Technical Specifications for each subsystem. These documents will describe detailed internal algorithms, APIs, data formats, and performance requirements and set the stage for implementation.
that’s it for now, thanks for reading!
-tim