The Skywork AI Technology Release Week officially kicked off on August 11. From August 11 to August 15, a new model will be unveiled each day, covering cutting-edge models for multimodal AI scenarios.
On August 11, Skywork officially launched the SkyReels-A3 model. Combining a Diffusion Transformer (DiT) model, frame interpolation for extended video generation, reinforcement learning-based motion refinement, and controllable camera techniques, SkyReels-A3 supports full-modality, audio-driven digital human synthesis with unrestricted duration.
The SkyReels-A3 model is now live! Visit theSkyReels official website to try it out:
LinksSkyReels-A3 homepage:
https://skyworkai.github.io/skyreels-a3.github.io/
SkyReels official website (After logging in, select the “Talking Avatar” tool from the left navigation bar):
https://www.skyreels.ai/home
SkyReels open-source model repository:
https://huggingface.co/Skywork
SkyReels-A3 is an audio-driven portrait video generation model that acts like an “AI vocal cord” for any photo or video:
— Bring photos to life: Upload a portrait image and a voice clip – the person in the photo will lip-sync and speak or sing naturally;
— Generate custom videos: Upload a portrait, add a voice clip, and provide a text prompt – the character will perform with directed expressions and motions;
— Re-dub existing videos: Replace the original audio, and the model will automatically adjust lip movements, facial expressions, and gestures while preserving visual continuity.
The SkyReels-A3 model delivers innovative experiences across four key dimensions:
— Text Prompt input enables dynamic scene modification;
— Enhanced Natural Movements – More lifelike interactions, including object handling and natural hand gestures during speech;
— Advanced Cinematic Control – Sophisticated camera work for artistic scenes (music/MVs) with elevated aesthetic quality;
— Extended Video Generation – Single-shot videos up to 60 seconds; multi-shot sequences with unlimited duration potential.
Through analysis of real-world applications (e.g., advertising, live-stream commerce), we identified two key requirements: longer-duration videos with consistent quality, and more natural and precise interactive motions. To address these, we developed specialized training datasets for live-stream scenarios and implemented targeted optimizations in video generation.
Moreover, in scenarios requiring high artistic fidelity-such as music videos, film clips, or professional presentations-traditional digital humans are limited to generating “static shots,” producing rigid and visually flat results.
To enable dynamic cinematography, we developed a ControlNet-based camera control module. By processing precise camera parameters, the system achieves frame-accurate camera motion control. Specifically, the module extracts depth data from reference images, and integrates user-defined camera parameters to render trajectory-guided reference videos. It uses these videos as explicit motion priors to reconstruct professional-grade camera movements frame-by-frame. The output is digital human videos with cinematic-quality camera work.
Currently, we offer eight preset camera movement parameters: static shot, push in, push out, pan left, pan right, crane up, crane down, and handheld swing shot. Each movement type supports continuous intensity adjustment from 0-100%, allowing users to achieve precisely tailored cinematographic effects for diverse needs.
SkyReels-A3 is built upon a Diffusion Transformer (DiT) video diffusion model framework.
The DiT model has garnered significant attention for its exceptional performance in image and video generation. By replacing traditional U-Net architectures with a Transformer structure, it demonstrates superior capability in capturing long-range dependencies. In SkyReels-A3, we employ a 3D Variational Autoencoder (3D-VAE) to process video data in latent space representation. The 3D-VAE compresses video data across both spatial and temporal dimensions, transforming high-dimensional raw video data into compact latent representations. This latent-space processing approach substantially reduces the computational load for subsequent diffusion models while preserving critical visual information.
SkyReels-A3's performance has been rigorously validated through extensive experimentation, including both quantitative and qualitative comparisons against state-of-the-art models (both open-source and proprietary). The results comprehensively demonstrate its capabilities in audio-driven video generation.
In addition, through step distillation techniques, we reduced the required inference steps from 40 to just 4 while maintaining comparable output quality.
From celluloid to digital, 2D to 3D – each imaging revolution has redrawn the boundaries of content creation.
SkyReels-A3 pioneers democratized voice-to-video synthesis, delivering studio-quality animation from just a single image and audio clip – no specialized hardware or production expertise required.
SkyReels-A3 animates static photos into lifelike talking portraits, overdubs speech in existing videos without face replacement, and delivers flawlessly smooth digital human livestreams. By offering an accessible, cost-effective, and high-fidelity AI solution, it serves diverse fields-from film production and virtual streaming to game development and educational content creation. With SkyReels-A3, personalized and interactive content has never been easier to produce.
SkyReels-A3 brings the “voice as vision” paradigm to life-where your inspiration could spark the next viral sensation.
https://c212.net/c/img/favicon.png?sn=CN48308&sd=2025-08-11
View original content:https://www.prnewswire.com/news-releases/day15-skyreels-a3-the-art-of-natural-speech-for-digital-humans-302526394.html
SOURCE Skywork AI pte ltd
https://rt.newswire.ca/rt.gif?NewsItemId=CN48308&Transmission_Id=202508110733PR_NEWS_USPR_____CN48308&DateId=20250811