We introduce HunyuanPortrait, a diffusion-based condition control method that employs implicit
representations for highly controllable and lifelike portrait animation. Given a single portrait image
as an appearance reference and video clips as driving templates,
HunyuanPortrait can animate the character in the reference image by the facial expression and head pose
of the driving videos. In our framework, we utilize pre-trained encoders to achieve the decoupling of
portrait motion information and identity in videos. To do so, implicit representation is adopted to
encode motion information and is employed as control signals in the animation phase.
By leveraging the power of stable video diffusion as the main building block, we carefully design
adapter layers to inject control signals into denoising unet through attention mechanisms. These bring
spatial richness of details and temporal consistency.
HunyuanPortrait also exhibits strong generalization performance, which can effectively disentangle
appearance and motion under different image styles. Our framework outperforms existing methods,
demonstrating superior temporal consistency and controllability.
(For the best viewing experience, please ensure your sound is enabled. If you are not hearing any audio, we
recommend using Google Chrome.)
Disentanglement of Appearance and Facial Movements