HappyHorse-1.0 Architecture

A deep dive into the 15B-parameter single-stream Transformer

Architecture Overview

Text Tokens

Image Latent

Noisy Video + Audio Tokens

40-Layer Unified Transformer

Layers 1–4: Modality-specific projections

Layers 5–36: Shared parameters (per-head gating)

Layers 37–40: Modality-specific projections

Denoised Video

Denoised Audio

Key Technologies

Single-Stream Architecture

Unlike multi-stream models (which process modalities separately), HappyHorse uses one unified sequence of text, video, and audio tokens. This enables shared representations and reduces inference overhead.

DMD-2 Distillation

Distribution Matching Distillation v2 compresses the denoising process from 50+ steps to just 8, without requiring Classifier-Free Guidance (CFG). Result: dramatically faster inference without quality loss.

MagiCompiler

A full-graph compilation system that fuses operators across Transformer layers. Delivers ~1.2× end-to-end speedup over standard PyTorch eager execution on H100.

Per-head Gating

Each attention head has a learnable gate that controls how much each modality influences others. Critical for stable joint video+audio training — prevents one modality from dominating.

FP8 Quantization

8-bit floating point inference reduces VRAM requirements significantly, enabling single-GPU deployment on consumer A100 hardware without significant quality degradation.

Joint Audio-Video Denoising

Text, video frames, and audio spectrograms are tokenized into a single sequence and denoised jointly. No separate audio post-processing step needed.

Hardware Requirements

Configuration	VRAM	Speed (5s 1080p)	Use Case
H100 80GB	80GB	~38 seconds	Production / Research
A100 80GB	80GB	~55 seconds	Research
A100 40GB + FP8	40GB	~90 seconds	Budget Production
2× A6000 48GB	96GB	~70 seconds	Multi-GPU Setup

Related Open Source

daVinci-MagiHuman — The Foundation

The most likely architectural foundation for HappyHorse-1.0. Released March 23, 2026 by GAIR Lab (Shanghai) + Sand.ai. Fully open source on GitHub and HuggingFace.

GitHub →HuggingFace →