pyVideoTrans Open Source Video Translation

Benefiting from the rapid advancement of AI technology, video translation, which was once quite challenging, has now become easier to achieve, although the effect may not yet be perfect.

Video translation is more complex than text translation, but the core is still based on text translation (although there is a technology that directly converts sound into another language's sound, this method is not yet mature and has limited practicality.)

The workflow of video translation can be roughly divided into the following stages:

Speech recognition: Extract human voices from the video and convert them into text;
Text translation: Translate the extracted text into the target language text;
Speech synthesis: Generate target language speech based on the translated text;
Synchronous adjustment: Ensure that the dubbed audio and subtitle files are synchronized with the video screen content;
Embedding processing: Embed the translated subtitles and dubbing into the video to generate a new video file.

Detailed discussion of each stage:

Speech recognition

The goal of this step is to accurately convert the speech content in the video into text and attach timestamps. There are currently multiple implementation methods, including using OpenAI's Whisper model, Alibaba's FunASR series models, or directly calling online speech recognition APIs, such as Baidu Speech Recognition.

When selecting a model, you can choose from small (tiny) to large (large-v3) according to your needs. The larger the model, the higher the recognition accuracy.

Text translation

After obtaining the text, it can be translated. Special attention should be paid to the fact that subtitle translation is different from ordinary text translation. Subtitle translation needs to consider the matching problem of timestamps.

When using traditional translation engines (such as Baidu Translate, Tencent Translate), only the subtitle text lines should be transmitted for translation, and the line number timestamp lines should be avoided to prevent exceeding the character limit or changing the subtitle format.

Ideally, the translated subtitles should be consistent with the number of lines in the original subtitles, with no blank lines.

However, different translation engines, especially AI translation, will cleverly merge lines according to the context, especially when the next line only has a few lonely characters or one or two words, and is semantically coherent with the previous sentence, it is likely to merge it into the previous line.

Although the translation result is more fluent and beautiful, it also leads to the fact that the subtitles cannot be strictly matched with the original subtitles, and blank lines appear.

Synthesis dubbing

After the translation is completed, dubbing can be generated based on the translated subtitles.

Currently, EdgeTTS is a nearly unlimited and free dubbing channel. By sending subtitles to EdgeTTS line by line, dubbing audio files can be obtained, and then these audio files are merged into a complete audio file.

Synchronous alignment adjustment

Ensuring that subtitles, audio, and video are synchronized is the biggest challenge of video translation.

It is inevitable that there are differences in the pronunciation duration of different languages, which leads to synchronization problems. Strategies to solve this problem include speeding up the audio playback speed or extending the length of the video clips, as well as using the blank intervals between subtitles to make adjustments to achieve the best synchronization effect.

If you do not make adjustments and embed directly according to the original subtitle timestamps, the subtitles will inevitably disappear, but people are still talking, or the people in the video have long finished speaking and shut up, but the audio is still playing.

To solve this problem, there are two relatively simple ways:

One is to speed up the audio playback and force it to finish playing within the subtitle time interval, which can achieve a synchronization effect, but the disadvantage is that the speech speed is sometimes fast and sometimes slow, which is a poor experience.

The second is to play the video clips of the subtitle interval in slow motion, that is, extend the video clips until the length matches the new dubbing length, which can also achieve synchronization, but the disadvantage is that there will be a class stuttering effect on the screen.

You can use both methods at the same time, that is, speed up the audio while extending the video clips, which can prevent the audio from accelerating too much and prevent the video from extending too much.

According to the actual situation of the video, you can also use the blank interval segments between 2 subtitles, first try to accelerate the blank interval time in the subtitle specified interval without audio acceleration, and whether it can be played normally, if it can, there is no need to accelerate, so the effect will be better. Of course, the disadvantage is that the video screen has finished speaking, and the actual audio is still playing.

Synthesis output

After completing the above steps, embed the translated subtitles and dubbing into the original video, which can be easily implemented using tools such as ffmpeg. The final generated video file completes the translation process.

ffmpeg -y -i Original video.mp4 -i Dubbed audio.m4a -c:v libx264 -c:a aac -vf subtitles=Subtitles.srt out.mp4

Difficult problems to solve: multi-speaker recognition

Speaker role recognition, that is, synthesizing different dubbing according to different characters in the video, involves speaker recognition, and it is necessary to pre-specify several speaker roles, which is barely suitable for ordinary one or two-person dialogue roles, but for most videos, it is difficult to determine several speakers in advance, and the final synthesized effect is also very poor, so this area has not been considered for the time being.

Summary

The above is just a simple flow principle. In fact, to achieve a good translation effect, there are many points to pay attention to, such as the pre-processing of the original video input format (mov/mp4/avi/mkv), splitting the video into audio and silent video, separating human voice from background sound in audio, processing the results of batch translation to speed up subtitle translation, re-splitting when blank lines appear in subtitles, generating and embedding double subtitles, and so on.

Through this series of steps, the video translation task is successfully completed, and the video content is seamlessly converted into the target language. Although some technical challenges may be encountered during the process, with the continuous progress and optimization of technology, the quality and efficiency of future video translation are expected to be further improved.

Detailed discussion of each stage: ​

Speech recognition ​

Text translation ​

Synthesis dubbing ​

Synchronous alignment adjustment ​

Synthesis output ​

Difficult problems to solve: multi-speaker recognition ​

Summary ​