Video voice-over localization

Description

The Video Localization Script automates the localization of YouTube videos by replacing the original voice with a localized version while preserving the original background audio. It uses voice cloning technology to replicate the original speaker’s voice(s). This script is good for content creators, educators, and hobbyists aiming to reach a global audience.

Check out the output

A simple multi-lingual demo (English -> French, German, Russian)

Multi-voice localization with voice cloning (manual fine-tuning).

Single-voice fully automated localization with voice cloning.

Motivation

I created this script to help my parents watch English YouTube videos in Russian. Subtitles didn’t work well for Penn and Teller’s “Fool Us” videos, where viewers need to closely follow the action. To provide a better experience, I developed a script that fits the translation into the original timing. After some effort, it worked well for both single and multi-voice videos, allowing my parents to enjoy content without relying on subtitles.

Who Could This Be Useful For?

This service could benefit various groups:

  1. Content Creators: YouTubers or video producers wanting to reach a global audience.
  2. Educational Institutions: Schools or universities looking to make educational content accessible in multiple languages.
  3. International Businesses: Companies creating training videos or presentations for a multilingual workforce.
  4. Entertainment Industry: For dubbing movies, TV shows, or web series into different languages.
  5. News Agencies: For quick localization of news reports to different language markets.

Development Details

  • Time to create the service: 2-3 weeks
  • Code composition:
    • ChatGPT: 80%
    • Claude: 10%
    • Manual coding and refactoring: 10%
  • Tested and localized: ~15 videos
  • Cost of running: <5c per run (models except Spleeter run on a laptop with 8GB GPU)

Workflow

  1. The script downloads video from YouTube
  2. Separates audio from the background
  3. Transcribes it, converts into SRT
  4. Translates the SRT using Microsoft Bing Translator
  5. Uses voice cloning technology to generate localized audio
  6. Re-aligns audio with video (speeds-up audio as needed)
  7. Merges localized audio with the original background audio
  8. Generates synchronized subtitles
  9. Combines all elements into the final video

AI Services Utilized by the workflow

  • Spleeter service (source separation library with pretrained models): separates voice from the background
  • Open AI Whisper: audio transcription
  • ChatGPT: For code generation and translation validation
  • Microsoft Bing Translator: For initial translation

Key Learnings

  1. ChatGPT proved most efficient for code generation and translation tasks
  2. Microsoft Bing Translator provides good initial translations at no cost
  3. Local caching optimizes costs and allows for quick error correction
  4. Voice cloning technology has advanced significantly, allowing for realistic voice replication
  5. Proper timing and synchronization are crucial for a seamless viewing experience

Considerations for Future Improvements

  1. I plan to completely re-work the voice-over part. Currently, it is based on .SRTs and works with chunks. The plan is to move to the sentence-level and try to produce artificial time stamps to support synchronization between audio and video. I’ve got some ideas…
  2. The main challenge is handling videos with multiple speakers. I have a model that deals with it, but a ton of manual adjustments still required.
  3. The end goal is to implement “one-click localization”

Bottom Line

While the development process required time and effort, the majority of the coding was accomplished by AI, specifically ChatGPT.