Version 1.5 Now Live

LongCat Video Avatar 1.5
Generate AI Avatar Videos Online

Turn a single photo and an audio clip into a lifelike avatar video — accurate lip-sync, singing support, and anime-ready animation, all in your browser.

Built on the open-source LongCat Video Avatar 1.5 model by Meituan LongCat Team. This is a third-party online tool — not the official Meituan page.

Longcat Video Avatar AI Video Generator

Preview

Your generated video will appear here

Credit usage480p · 1 credit per second

Your audio length is rounded up to the nearest second, with a minimum of 5 seconds and a maximum of 64 seconds. At 480p, each second of video costs 1 credit.

What Is LongCat Video Avatar 1.5?

LongCat Video Avatar 1.5 is an audio-driven avatar video model built on the LongCat-Video foundation architecture developed by Meituan.

Give it a voice track, a reference image, or a text prompt, and it generates a talking character video with synchronized lip movement, natural motion, and consistent identity across frames.

Suitable for AI presenters, digital humans, anime avatars, and multilingual talking videos.

AT2V

Audio-Text-to-Video

Generate an avatar video from audio + text prompt alone.

ATI2V

Audio-Text-Image-to-Video

Add a reference photo for a personalized avatar video.

Video Continuation

Sequence Extension

Extend an existing clip while preserving identity and motion.

480P & 720P OutputEnglish & ChineseRealistic & Animated

What's New in Version 1.5 vs 1.0?

The original LongCat Video Avatar launched in December 2024. Version 1.5 addresses the main limitations that showed up in real-world use.

What Changedv1.0v1.5
Audio encoderWav2Vec2Whisper-Large-v3
Lip-sync qualityFunctionalSignificantly smoother, more natural
Inference stepsFull diffusion8 steps via DMD2 distillation
VRAM optionsStandardINT8 quantization available
Stylized domainsLimitedAnime, animals, complex scenes
Multi-person supportSingle streamSingle + multi-stream audio
Long video stabilityVariableProduction-grade temporal consistency

The audio encoder swap is the biggest functional change. Whisper-Large-v3 was trained on far more multilingual speech data than Wav2Vec2, which is why lip dynamics are noticeably more accurate — especially on longer clips where the older encoder would drift.

The 8-step distillation matters for deployment cost. Fewer inference steps = lower GPU time per video = more practical for batch production.

LongCat Video Avatar 1.0 vs 1.5

Highlight the upgrade from 1.0 to 1.5: better mouth-shape accuracy, stronger long-video identity preservation, broader interactive scenarios, and faster 8-step generation.

Side-by-side comparison highlighting the improved realism and smoothness of LongCat Video Avatar 1.5.
v1.5

LongCat Video Avatar 1.5

Legacy version footage of LongCat Video Avatar 1.0 for performance benchmarking against version 1.5.

LongCat Video Avatar 1.0

Commercial Model Comparison

Compare LongCat Video Avatar 1.5 with HeyGen, Kling Avatar 2.0, and OmniHuman-1.5 under the same or similar inputs, focusing on stability, consistency, and natural lip motion.

Sample 1

Performance showcase of LongCat Video Avatar 1.5 in a commercial model comparison test - Sample 1
Ours

LongCat Video Avatar 1.5

Comparative video of HeyGen's avatar generation - Sample 1

HeyGen

Comparative video of Kling Avatar 2.0 performance - Sample 1

Kling Avatar 2.0

Comparative video of OmniHuman-1.5 motion test - Sample 1

OmniHuman-1.5

Sample 2

Performance showcase of LongCat Video Avatar 1.5 in a commercial model comparison test - Sample 2
Ours

LongCat Video Avatar 1.5

Comparative video of HeyGen's avatar generation - Sample 2

HeyGen

Comparative video of Kling Avatar 2.0 performance - Sample 2

Kling Avatar 2.0

Comparative video of OmniHuman-1.5 motion test - Sample 2

OmniHuman-1.5

Stability and Consistency

Stronger mouth-shape accuracy, smooth expression transitions, identity consistency, and coherent full-body motion across long speaking shots and hand-object interactions.

01

Long-Form Talking

Accurate lip-sync and identity consistency across extended speaking shots, hand gestures, and object interactions — no frame drift, no identity degradation.

LongCat Video Avatar 1.5 demonstration of high stability and facial consistency during long-form speech.
AI video avatar performing a song with expressive facial movements and consistent features using LongCat 1.5.

02

Singing and Performance

Dynamic motion, musical expression, and stable full-body or upper-body performance — from soft ballads to energetic stage performances.

03

Animation

Expressive motion and stable audio-driven performance across anime characters, illustrated portraits, and stylized 3D avatars.

Animated character consistency test showcasing the stability of LongCat Video Avatar 1.5 in motion.
Multiple AI avatars interacting naturally while maintaining visual stability with LongCat 1.5 technology.

04

Multi-Person Interaction

Multi-speaker and group interaction cases with stable identities and natural turn-taking behavior — powered by dual-audio Merge and Concatenation modes.

Key Features of LongCat Video Avatar 1.5

01

Whisper-Large-v3 Audio Encoder

LongCat Video Avatar 1.5 uses Whisper Large-v3 to analyze speech timing, rhythm, and phoneme transitions with higher accuracy. The result: smoother lip-sync, more natural mouth movement, and better speaking realism across longer videos.

02

Production-Grade Stability

Generate long-form avatar videos with stable identity, accurate lip-sync, and smooth full-body motion. The model keeps characters consistent frame to frame — even during hand movement, object interaction, or extended scenes.

03

Stylized Domain Support

Create more than realistic talking avatars. LongCat Video Avatar 1.5 supports anime characters, illustrated portraits, stylized 3D renders, animal avatars, and multi-person scenes with dual-audio input modes.

04

8-Step Fast Inference with INT8 Quantization

Reduce generation time and GPU cost with DMD2-based 8-step inference. Enable the --use_int8 mode to lower VRAM usage further and run the model more efficiently in production environments.

05

Multi-GPU Context Parallelism

Scale avatar video generation across multiple GPUs for batch production and longer sequences. LongCat Video Avatar 1.5 supports context-parallel inference to improve rendering stability and throughput for studio workflows.

Who Is LongCat Video Avatar 1.5 For?

LongCat Video Avatar 1.5 covers six core application scenarios. The official project page shows side-by-side demo videos for each.

News anchor talking head demo with stable lip-sync accuracy and long monologue performance powered by LongCat Video Avatar 1.5.

News Broadcasting & Education

Talking-head videos for presenters, anchors, and educational content. The model handles extended monologues (2+ minutes) with stable lip motion.

AI singing avatar with dynamic body movement and synchronized phonemes generated by LongCat Video Avatar 1.5.

Singing & Performance

Audio-driven singing with synchronized mouth shapes. Full-body or upper-body motion responds to musical rhythm.

Stylized anime character and 3D avatar animation demonstrating the non-photorealistic capabilities of LongCat Video Avatar 1.5.

Animation & Stylized Characters

Anime faces, 3D characters, and illustrated portraits. Generalizes to non-photorealistic domains like hand-drawn and cel-shading styles.

Multi-person talking head video from LongCat Video Avatar 1.5 featuring dual audio stream synchronization and consistent identity.

Multi-Person Conversations

Two speakers in one frame driven by separate audio tracks, supporting both merged and alternating turn-taking modes.

E-commerce AI spokesperson for promotional content and product demos produced rapidly with LongCat Video Avatar 1.5.

E-Commerce Marketing Videos

Product demos and AI spokespeople. Practical for batch production with 720P output and fast 8-step inference.

Non-human creature and animal face animation with audio-driven mouth motion powered by LongCat Video Avatar 1.5.

Animal & Non-Human Characters

Animal faces and creature avatars with audio-driven mouth motion. Ideal for game assets and character storytelling.

Frequently Asked Questions

Everything you're likely to ask about LongCat Video Avatar 1.5 — answered here.

LongCat Video Avatar 1.5 accepts audio, text prompts, reference images, and existing video clips. You can generate videos using three modes: Audio-Text-to-Video (AT2V), Audio-Text-Image-to-Video (ATI2V), or Video Continuation.

The current release supports 480P and 720P output. Native 1080P is not available in version 1.5.

Yes. Beyond photorealistic humans, it handles anime characters, illustrated portraits, stylized 3D avatars, and animal characters.

Yes. The model ships under the MIT license, which permits commercial use. You're responsible for ensuring you hold the rights to any images, audio, and likenesses used as inputs.

LongCat Video Avatar 1.5 is the only open-source option in this group — MIT licensed, self-hostable, and with no per-video fee. HeyGen and Kling are closed commercial APIs with limited deployment flexibility and no customization access.

Use a clear, front-facing portrait with even lighting and no face occlusion. Detailed text prompts help too — include appearance, action, and scene context (e.g., "A young woman in a white blouse speaking in a bright café"). More detail consistently produces better output.

Whisper-Large-v3 delivers tighter phoneme-to-viseme mapping than Wav2Vec2. The official evaluation confirmed Audio-Visual Harmony improvements across 508 image-audio test pairs.

No installation or local GPU needed to use the online demo — just sign up and start generating. Local deployment requires a CUDA-compatible GPU (24GB VRAM minimum), Python 3.10, and a conda environment.

New users get a free credit on sign-up to generate one video. Additional credits are available for purchase — see the pricing page for plan details.

Yes. The Whisper-Large-v3 encoder performs best on English and Chinese audio for lip-sync alignment and speech feature extraction. Other languages may work but aren't officially supported.

No. This is an offline generation model. Even with 8-step inference, each video requires meaningful GPU compute time. It's not designed for live-streaming or real-time avatar applications.

Yes. Version 1.5 adds dual-audio support for multi-person avatar scenes via Merge and Concatenation modes.

Switch to Multi Avatar mode in the online tool and upload two separate audio tracks. Merge mode runs both tracks simultaneously and requires equal-length clips. Concatenation mode sequences them one after the other — no equal length required, with silence padding any gaps.

Credit usage depends on video length, resolution, and generation mode. Higher resolution and longer duration consume more credits per generation.

🆓Free credit on sign-upNo GPU required🔓MIT licensed model

Ready to Generate Your First Avatar Video?

You don't need a GPU, a subscription, or a production team. Upload a photo, add audio, and LongCat Video Avatar 1.5 handles the rest — right in your browser.

Generate Your First Avatar Video