Video voice-over localization

Description

The Video Localization Script automates the localization of YouTube videos by replacing the original voice with a localized version while preserving the original background audio. It uses voice cloning technology to replicate the original speaker’s voice(s). This script is good for content creators, educators, and hobbyists aiming to reach a global audience.

Check out the output

A simple multi-lingual demo (English -> French, German, Russian)

Multi-voice localization with voice cloning (manual fine-tuning).

Single-voice fully automated localization with voice cloning.

Motivation

I created this script to help my parents watch English YouTube videos in Russian. Subtitles didn’t work well for Penn and Teller’s “Fool Us” videos, where viewers need to closely follow the action. To provide a better experience, I developed a script that fits the translation into the original timing. After some effort, it worked well for both single and multi-voice videos, allowing my parents to enjoy content without relying on subtitles.

Who Could This Be Useful For?

This service could benefit various groups:

Content Creators: YouTubers or video producers wanting to reach a global audience.
Educational Institutions: Schools or universities looking to make educational content accessible in multiple languages.
International Businesses: Companies creating training videos or presentations for a multilingual workforce.
Entertainment Industry: For dubbing movies, TV shows, or web series into different languages.
News Agencies: For quick localization of news reports to different language markets.

Development Details

Time to create the service: 2-3 weeks
Code composition:
- ChatGPT: 80%
- Claude: 10%
- Manual coding and refactoring: 10%
Tested and localized: ~15 videos
Cost of running: <5c per run (models except Spleeter run on a laptop with 8GB GPU)

Workflow

The script downloads video from YouTube
Separates audio from the background
Transcribes it, converts into SRT
Translates the SRT using Microsoft Bing Translator
Uses voice cloning technology to generate localized audio
Re-aligns audio with video (speeds-up audio as needed)
Merges localized audio with the original background audio
Generates synchronized subtitles
Combines all elements into the final video

AI Services Utilized by the workflow

Spleeter service (source separation library with pretrained models): separates voice from the background
Open AI Whisper: audio transcription
ChatGPT: For code generation and translation validation
Microsoft Bing Translator: For initial translation

Key Learnings

ChatGPT proved most efficient for code generation and translation tasks
Microsoft Bing Translator provides good initial translations at no cost
Local caching optimizes costs and allows for quick error correction
Voice cloning technology has advanced significantly, allowing for realistic voice replication
Proper timing and synchronization are crucial for a seamless viewing experience

Considerations for Future Improvements

I plan to completely re-work the voice-over part. Currently, it is based on .SRTs and works with chunks. The plan is to move to the sentence-level and try to produce artificial time stamps to support synchronization between audio and video. I’ve got some ideas…
The main challenge is handling videos with multiple speakers. I have a model that deals with it, but a ton of manual adjustments still required.
The end goal is to implement “one-click localization”

Bottom Line

While the development process required time and effort, the majority of the coding was accomplished by AI, specifically ChatGPT.