News Broadcasting & Education
Talking-head videos for presenters, anchors, and educational content. The model handles extended monologues (2+ minutes) with stable lip motion.
Turn a single photo and an audio clip into a lifelike avatar video — accurate lip-sync, singing support, and anime-ready animation, all in your browser.
Built on the open-source LongCat Video Avatar 1.5 model by Meituan LongCat Team. This is a third-party online tool — not the official Meituan page.
Your generated video will appear here
Your audio length is rounded up to the nearest second, with a minimum of 5 seconds and a maximum of 64 seconds. At 480p, each second of video costs 1 credit.
LongCat Video Avatar 1.5 is an audio-driven avatar video model built on the LongCat-Video foundation architecture developed by Meituan.
Give it a voice track, a reference image, or a text prompt, and it generates a talking character video with synchronized lip movement, natural motion, and consistent identity across frames.
Suitable for AI presenters, digital humans, anime avatars, and multilingual talking videos.
Generate an avatar video from audio + text prompt alone.
Add a reference photo for a personalized avatar video.
Extend an existing clip while preserving identity and motion.
The original LongCat Video Avatar launched in December 2024. Version 1.5 addresses the main limitations that showed up in real-world use.
| What Changed | v1.0 | v1.5 |
|---|---|---|
| Audio encoder | Wav2Vec2 | Whisper-Large-v3 |
| Lip-sync quality | Functional | Significantly smoother, more natural |
| Inference steps | Full diffusion | 8 steps via DMD2 distillation |
| VRAM options | Standard | INT8 quantization available |
| Stylized domains | Limited | Anime, animals, complex scenes |
| Multi-person support | Single stream | Single + multi-stream audio |
| Long video stability | Variable | Production-grade temporal consistency |
The audio encoder swap is the biggest functional change. Whisper-Large-v3 was trained on far more multilingual speech data than Wav2Vec2, which is why lip dynamics are noticeably more accurate — especially on longer clips where the older encoder would drift.
The 8-step distillation matters for deployment cost. Fewer inference steps = lower GPU time per video = more practical for batch production.
Highlight the upgrade from 1.0 to 1.5: better mouth-shape accuracy, stronger long-video identity preservation, broader interactive scenarios, and faster 8-step generation.
LongCat Video Avatar 1.5
LongCat Video Avatar 1.0
Compare LongCat Video Avatar 1.5 with HeyGen, Kling Avatar 2.0, and OmniHuman-1.5 under the same or similar inputs, focusing on stability, consistency, and natural lip motion.
LongCat Video Avatar 1.5
HeyGen
Kling Avatar 2.0
OmniHuman-1.5
LongCat Video Avatar 1.5
HeyGen
Kling Avatar 2.0
OmniHuman-1.5
Stronger mouth-shape accuracy, smooth expression transitions, identity consistency, and coherent full-body motion across long speaking shots and hand-object interactions.
01
Accurate lip-sync and identity consistency across extended speaking shots, hand gestures, and object interactions — no frame drift, no identity degradation.
02
Dynamic motion, musical expression, and stable full-body or upper-body performance — from soft ballads to energetic stage performances.
03
Expressive motion and stable audio-driven performance across anime characters, illustrated portraits, and stylized 3D avatars.
04
Multi-speaker and group interaction cases with stable identities and natural turn-taking behavior — powered by dual-audio Merge and Concatenation modes.
LongCat Video Avatar 1.5 uses Whisper Large-v3 to analyze speech timing, rhythm, and phoneme transitions with higher accuracy. The result: smoother lip-sync, more natural mouth movement, and better speaking realism across longer videos.
Generate long-form avatar videos with stable identity, accurate lip-sync, and smooth full-body motion. The model keeps characters consistent frame to frame — even during hand movement, object interaction, or extended scenes.
Create more than realistic talking avatars. LongCat Video Avatar 1.5 supports anime characters, illustrated portraits, stylized 3D renders, animal avatars, and multi-person scenes with dual-audio input modes.
Reduce generation time and GPU cost with DMD2-based 8-step inference. Enable the --use_int8 mode to lower VRAM usage further and run the model more efficiently in production environments.
Scale avatar video generation across multiple GPUs for batch production and longer sequences. LongCat Video Avatar 1.5 supports context-parallel inference to improve rendering stability and throughput for studio workflows.
LongCat Video Avatar 1.5 covers six core application scenarios. The official project page shows side-by-side demo videos for each.
Talking-head videos for presenters, anchors, and educational content. The model handles extended monologues (2+ minutes) with stable lip motion.
Audio-driven singing with synchronized mouth shapes. Full-body or upper-body motion responds to musical rhythm.
Anime faces, 3D characters, and illustrated portraits. Generalizes to non-photorealistic domains like hand-drawn and cel-shading styles.
Two speakers in one frame driven by separate audio tracks, supporting both merged and alternating turn-taking modes.
Product demos and AI spokespeople. Practical for batch production with 720P output and fast 8-step inference.
Animal faces and creature avatars with audio-driven mouth motion. Ideal for game assets and character storytelling.
Everything you're likely to ask about LongCat Video Avatar 1.5 — answered here.
LongCat Video Avatar 1.5 accepts audio, text prompts, reference images, and existing video clips. You can generate videos using three modes: Audio-Text-to-Video (AT2V), Audio-Text-Image-to-Video (ATI2V), or Video Continuation.
The current release supports 480P and 720P output. Native 1080P is not available in version 1.5.
Yes. Beyond photorealistic humans, it handles anime characters, illustrated portraits, stylized 3D avatars, and animal characters.
Yes. The model ships under the MIT license, which permits commercial use. You're responsible for ensuring you hold the rights to any images, audio, and likenesses used as inputs.
LongCat Video Avatar 1.5 is the only open-source option in this group — MIT licensed, self-hostable, and with no per-video fee. HeyGen and Kling are closed commercial APIs with limited deployment flexibility and no customization access.
Use a clear, front-facing portrait with even lighting and no face occlusion. Detailed text prompts help too — include appearance, action, and scene context (e.g., "A young woman in a white blouse speaking in a bright café"). More detail consistently produces better output.
Whisper-Large-v3 delivers tighter phoneme-to-viseme mapping than Wav2Vec2. The official evaluation confirmed Audio-Visual Harmony improvements across 508 image-audio test pairs.
No installation or local GPU needed to use the online demo — just sign up and start generating. Local deployment requires a CUDA-compatible GPU (24GB VRAM minimum), Python 3.10, and a conda environment.
New users get a free credit on sign-up to generate one video. Additional credits are available for purchase — see the pricing page for plan details.
Yes. The Whisper-Large-v3 encoder performs best on English and Chinese audio for lip-sync alignment and speech feature extraction. Other languages may work but aren't officially supported.
No. This is an offline generation model. Even with 8-step inference, each video requires meaningful GPU compute time. It's not designed for live-streaming or real-time avatar applications.
Yes. Version 1.5 adds dual-audio support for multi-person avatar scenes via Merge and Concatenation modes.
Switch to Multi Avatar mode in the online tool and upload two separate audio tracks. Merge mode runs both tracks simultaneously and requires equal-length clips. Concatenation mode sequences them one after the other — no equal length required, with silence padding any gaps.
Credit usage depends on video length, resolution, and generation mode. Higher resolution and longer duration consume more credits per generation.
You don't need a GPU, a subscription, or a production team. Upload a photo, add audio, and LongCat Video Avatar 1.5 handles the rest — right in your browser.
Generate Your First Avatar Video