Day1/5: SkyReels-A3: The Art of Natural Speech for Digital Humans

The Skywork AI Technology Release Week officially kicked off on August 11. From August 11 to August 15, a new model will be unveiled each day, covering cutting-edge models for multimodal AI scenarios.

On August 11, Skywork officially launched the SkyReels-A3 model. Combining a Diffusion Transformer (DiT) model, frame interpolation for extended video generation, reinforcement learning-based motion refinement, and controllable camera techniques, SkyReels-A3 supports full-modality, audio-driven digital human synthesis with unrestricted duration.

The SkyReels-A3 model is now live! Visit theSkyReels official website to try it out:

LinksSkyReels-A3 homepage:

https://skyworkai.github.io/skyreels-a3.github.io/

SkyReels official website (After logging in, select the “Talking Avatar” tool from the left navigation bar):

https://www.skyreels.ai/home

SkyReels open-source model repository:

https://huggingface.co/Skywork

SkyReels-A3 is an audio-driven portrait video generation model that acts like an “AI vocal cord” for any photo or video:

— Bring photos to life: Upload a portrait image and a voice clip – the person in the photo will lip-sync and speak or sing naturally;

— Generate custom videos: Upload a portrait, add a voice clip, and provide a text prompt – the character will perform with directed expressions and motions;

— Re-dub existing videos: Replace the original audio, and the model will automatically adjust lip movements, facial expressions, and gestures while preserving visual continuity.

The SkyReels-A3 model delivers innovative experiences across four key dimensions:

— Text Prompt input enables dynamic scene modification;

— Enhanced Natural Movements – More lifelike interactions, including object handling and natural hand gestures during speech;

— Advanced Cinematic Control – Sophisticated camera work for artistic scenes (music/MVs) with elevated aesthetic quality;

— Extended Video Generation – Single-shot videos up to 60 seconds; multi-shot sequences with unlimited duration potential.

Through analysis of real-world applications (e.g., advertising, live-stream commerce), we identified two key requirements: longer-duration videos with consistent quality, and more natural and precise interactive motions. To address these, we developed specialized training datasets for live-stream scenarios and implemented targeted optimizations in video generation.

Moreover, in scenarios requiring high artistic fidelity-such as music videos, film clips, or professional presentations-traditional digital humans are limited to generating “static shots,” producing rigid and visually flat results.

To enable dynamic cinematography, we developed a ControlNet-based camera control module. By processing precise camera parameters, the system achieves frame-accurate camera motion control. Specifically, the module extracts depth data from reference images, and integrates user-defined camera parameters to render trajectory-guided reference videos. It uses these videos as explicit motion priors to reconstruct professional-grade camera movements frame-by-frame. The output is digital human videos with cinematic-quality camera work.

Currently, we offer eight preset camera movement parameters: static shot, push in, push out, pan left, pan right, crane up, crane down, and handheld swing shot. Each movement type supports continuous intensity adjustment from 0-100%, allowing users to achieve precisely tailored cinematographic effects for diverse needs.

SkyReels-A3 is built upon a Diffusion Transformer (DiT) video diffusion model framework.

The DiT model has garnered significant attention for its exceptional performance in image and video generation. By replacing traditional U-Net architectures with a Transformer structure, it demonstrates superior capability in capturing long-range dependencies. In SkyReels-A3, we employ a 3D Variational Autoencoder (3D-VAE) to process video data in latent space representation. The 3D-VAE compresses video data across both spatial and temporal dimensions, transforming high-dimensional raw video data into compact latent representations. This latent-space processing approach substantially reduces the computational load for subsequent diffusion models while preserving critical visual information.

SkyReels-A3's performance has been rigorously validated through extensive experimentation, including both quantitative and qualitative comparisons against state-of-the-art models (both open-source and proprietary). The results comprehensively demonstrate its capabilities in audio-driven video generation.

In addition, through step distillation techniques, we reduced the required inference steps from 40 to just 4 while maintaining comparable output quality.

From celluloid to digital, 2D to 3D – each imaging revolution has redrawn the boundaries of content creation.

SkyReels-A3 pioneers democratized voice-to-video synthesis, delivering studio-quality animation from just a single image and audio clip – no specialized hardware or production expertise required.

SkyReels-A3 animates static photos into lifelike talking portraits, overdubs speech in existing videos without face replacement, and delivers flawlessly smooth digital human livestreams. By offering an accessible, cost-effective, and high-fidelity AI solution, it serves diverse fields-from film production and virtual streaming to game development and educational content creation. With SkyReels-A3, personalized and interactive content has never been easier to produce.

SkyReels-A3 brings the “voice as vision” paradigm to life-where your inspiration could spark the next viral sensation.

https://c212.net/c/img/favicon.png?sn=CN48308&sd=2025-08-11

View original content:https://www.prnewswire.com/news-releases/day15-skyreels-a3-the-art-of-natural-speech-for-digital-humans-302526394.html

SOURCE Skywork AI pte ltd

https://rt.newswire.ca/rt.gif?NewsItemId=CN48308&Transmission_Id=202508110733PR_NEWS_USPR_____CN48308&DateId=20250811

Scroll to Top