Deep Word only needs video and audio input to create your videos. For the video input, you can select one of our video actors or upload your own. For the audio input, you can select one of our samples, type what you want your video actor to say (text-to-speech), or upload your own audio. Deep Word will sync the lip and jaw movements of this video actor with your selected audio in minutes.

Note: The length of your video and audio inputs DO NOT  have to be the same length. Deep Word works by repeatedly looping the video of your actor forwards and backwards until it reaches the length of your audio input. Our servers then modify the actor’s lip and jaw movements to sync with your audio.

For example, if your uploaded video is 20 seconds long, and your typed or uploaded audio is 60 seconds long, Deep Word will automatically loop your video forwards (20 seconds), backwards (20 seconds), and then forwards again (20 seconds) to reach the length of your 60 second audio.

Video Input

The person you want talking

Audio Input

The words you want them to say


This is a short video clip of the actor you want talking. For best results:

  • Not too close. Not too far. 5-10 feet away from the camera is optimal, approximately waist up.

  • Looking directly at the camera.

  • Stationary (seated or standing) and not moving excessively.

  • Against a solid colored background, that contrasts the clothing and skin tone of the actor.

  • Actor’s lips should not be moving, but they can be making facial expressions and hand gestures as long their head is not moving very much.

  • No obstructions to the nose, lips, or jaw. Even a single rogue frame can produce poor results.


This is the audio that you want the individual in your video to speak. For best results:

  • Audio with little to no background noise.

  • Stable audio levels. No spikes or clipping of gain.

  • Can be in ANY language.

  • The length of this audio file will be the same as your generated video.

Example Result

