C-MET: Cross-Modal Emotion Transfer

[CVPR 2026] Generate talking face videos with the desired emotion driven by speech — no reference image needed.

Note: Source image and driving video must be pre-cropped to 256×256 face crops. See crop_image2.py for preprocessing.

Target Emotion
1 3
Upscale resolution from 256 to 512 using GFPGAN. Increases processing time.
Examples