C-MET: Cross-Modal Emotion Transfer

[CVPR 2026] Generate talking face videos with the desired emotion driven by speech — no reference image needed.

Note: Source image and driving video must be pre-cropped to 256×256 face crops. See crop_image2.py for preprocessing.

Source Image (256×256 face crop)

Driving Audio (.wav)

Pose Driving Video (25 fps, face crop)

Target Emotion

Intensity

1 3

Super Resolution (256 → 512)

Upscale resolution from 256 to 512 using GFPGAN. Increases processing time.

Generated Video

Examples