A deep dive into the 15B-parameter single-stream Transformer
40-Layer Unified Transformer
Unlike multi-stream models (which process modalities separately), HappyHorse uses one unified sequence of text, video, and audio tokens. This enables shared representations and reduces inference overhead.
Distribution Matching Distillation v2 compresses the denoising process from 50+ steps to just 8, without requiring Classifier-Free Guidance (CFG). Result: dramatically faster inference without quality loss.
A full-graph compilation system that fuses operators across Transformer layers. Delivers ~1.2× end-to-end speedup over standard PyTorch eager execution on H100.
Each attention head has a learnable gate that controls how much each modality influences others. Critical for stable joint video+audio training — prevents one modality from dominating.
8-bit floating point inference reduces VRAM requirements significantly, enabling single-GPU deployment on consumer A100 hardware without significant quality degradation.
Text, video frames, and audio spectrograms are tokenized into a single sequence and denoised jointly. No separate audio post-processing step needed.
| Configuration | VRAM | Speed (5s 1080p) | Use Case |
|---|---|---|---|
| H100 80GB | 80GB | ~38 seconds | Production / Research |
| A100 80GB | 80GB | ~55 seconds | Research |
| A100 40GB + FP8 | 40GB | ~90 seconds | Budget Production |
| 2× A6000 48GB | 96GB | ~70 seconds | Multi-GPU Setup |
The most likely architectural foundation for HappyHorse-1.0. Released March 23, 2026 by GAIR Lab (Shanghai) + Sand.ai. Fully open source on GitHub and HuggingFace.