Most AI video pipelines bolt audio on afterward. HappyHorse-1.0 generates video and audio in a single Transformer forward pass. Here's how it works — and how to run it.
The standard approach to AI video dubbing is actually four separate steps: generate silent video → run TTS → run a lip-sync model → composite ambient audio. Each step introduces error. The TTS output may have slightly different timing than the prompt implied. The lip-sync model operates on individual frames and produces temporal jitter. Ambient audio has to be sourced or synthesized separately.
HappyHorse-1.0 collapses all of this into one model that learns the joint distribution of video frames and audio waveforms during training. Visual tokens and audio tokens are processed through the same shared Transformer backbone — so the cross-modal relationship is baked in, not patched in.
HappyHorse-1.0 uses a 40-layer single-stream Transformer with modality-specific processing at the boundaries:
Each attention head learns an independent gate value controlling how strongly audio attends to visual tokens and vice versa. This is why multilingual lip-sync holds up: the model learned that mouth-shape tokens should attend heavily to phoneme-level audio tokens, independent of the language being spoken.
Standard diffusion video models require 50+ denoising steps. HappyHorse-1.0 uses DMD-2 (Distribution Matching Distillation v2) to compress this to 8 steps without meaningful quality loss. Classifier-Free Guidance is also eliminated — guidance is baked into the distilled weights, removing the need for two forward passes per step.
MagiCompiler fuses adjacent ops (attention, normalization, projection) into single CUDA kernels, reducing memory bandwidth pressure and kernel launch overhead. Combined with DMD-2, the result is a 5-second 1080p clip in ~38 seconds on a single H100 80GB.
HappyHorse-1.0 supports phoneme-accurate lip-sync in 7 languages natively — no separate dubbing step required:
| Language | Code |
|---|---|
| English | en |
| Mandarin Chinese | zh |
| Cantonese | yue |
| Japanese | ja |
| Korean | ko |
| German | de |
| French | fr |
For developers building localization or dubbing pipelines, this means you can generate a source video in English and re-synthesize it in Korean or German without a manual sync correction pass. The joint model handles phoneme timing internally.
Official HappyHorse-1.0 weights are not yet released. daVinci-MagiHuman — the open-source model from GAIR Lab / Sand.ai sharing the same architecture — is available now under Apache 2.0.
Installation
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman cd daVinci-MagiHuman conda create -n happyhorse python=3.12 conda activate happyhorse pip install -r requirements.txt pip install flash-attn --no-build-isolation
Text-to-Video (1080p)
from magihuman import MagiHumanPipeline
pipe = MagiHumanPipeline.from_pretrained(
"GAIR-NLP/daVinci-MagiHuman-distilled",
torch_dtype="bfloat16"
).to("cuda")
result = pipe(
prompt="A horse galloping through a cherry blossom forest at sunset",
resolution="1080p",
duration=5,
num_inference_steps=8, # DMD-2: only 8 steps needed
)
result.save("output.mp4")Multilingual Lip-Sync
result = pipe(
prompt="A news anchor delivering a report",
speech_text="Breaking news: open-source AI has reached a new milestone.",
speech_language="en", # "en", "zh", "yue", "ja", "ko", "de", "fr"
resolution="1080p",
duration=10,
)| Mode | VRAM | Hardware |
|---|---|---|
| Full precision (BF16) | 48 GB | H100 80GB, A100 80GB |
| FP8 quantized | 40 GB | H100 80GB, A100 80GB |
Cloud options: RunPod (H100 SXM 80GB on-demand), Vast.ai (A100 80GB spot pricing), Lambda Labs (reserved H100 clusters). At ~38s per clip on H100, you can estimate ~800–1,000 inference calls per GPU per day.
Content localization
Generate a marketing video in English, then re-synthesize it in Mandarin, Korean, and Japanese in a single pipeline step — no dubbing studio required.
AI avatar generation
The joint audio-video model is well-suited to talking-head generation where lip-sync accuracy is critical.
Film & storyboard prototyping
Generate 1080p animatics with placeholder dialogue at 38 seconds per shot. On a 2-GPU H100, a 10-shot sequence can be roughed out in under 10 minutes.
Commercial API backends
Apache 2.0 licensing means you can build commercial APIs on top of HappyHorse-1.0 without royalty obligations.
Continue Reading