This workflow does the following:
Takes an image as input and a video where someone does something (it can be dance, speak, any gesture or motion you can think of)
Isolates the person on the image and video using segmentation (optional)
Uses Florence2 to describe the image with accuracy. Changes to force the description of the Florence2 can be done thanks to a Text Replace Node (for exemple to force the ethnicity of the influencer)
Using the Florence2 description and a depth map of the motion in the video, it then uses the Wan 2.1 text to video model to replicate exactly the motion.
Optional:
Use of Reactor to faceswap the obtained video to recover the face from the image
Creation of a voice over using F5 TTS to clone any voice you want
Creation of the lipsync using the generated cloned voice.
Think UNLIMITED influencers doing EXACTLY what you want them to do and saying EXACTLY what you want them to say!