pyVideoTrans Open Source Video Translation

ChatTTS has become incredibly popular, but the documentation is vague, especially in terms of controlling tone, rhythm, and specific speakers. After repeated real-world testing and trial and error, I finally understood a bit and recorded it as follows.

UI interface code open source address https://github.com/jianchang512/chattts-ui

Control Symbols Available in the Text

Control symbols can be interspersed in the original text to be synthesized. Currently, the controllable ones are laughter and pauses.

[laugh] Represents laughter

[uv_break] Represents a pause

The following is a sample text

text="Hello [uv_break] friends, I heard today is a good day, isn't it [uv_break] [laugh]?"

In the actual synthesis, [laugh] will be replaced by laughter, and a pause will be added at [uv_break].

The intensity of laughter and pauses can be controlled by passing a prompt in the params_refine_text parameter.

laugh_(0-2) Possible values: laugh_0 laugh_1 laugh_2 Laughter becomes more intense/or?

break_(0-7) Possible values: break_0 break_1 break_2 break_3 break_4 break_5 break_6 break_7 Pauses become more obvious in sequence/or?

Code


chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_2][break_4]'})

However, actual testing has found that the difference between [break_0] and [break_7] is not obvious, and there is also no obvious difference between [laugh_0] and [laugh_2]

Skip the refine text Stage

During actual synthesis, control characters will be re-organized (refined text), for example, the above sample text will eventually be organized as

Hello [uv_break] Ah [uv_break] Um [uv_break] Friends, I heard today is a good day, isn't it [uv_break] Um [uv_break] [laugh] ? [uv_break]

It can be seen that the control characters are not consistent with the ones you marked yourself, and in the actual synthesis effect, there may be pauses, noise, laughter, etc. that should not be there, so how to force the synthesis according to the actual ones?

Set the skip_refine_text parameter to True to skip the refine text stage.

chat.infer([text],skip_refine_text=True,params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})

Fix the Pronunciation of the Speaker

By default, different tones are randomly called each time the synthesis is done, which is very unfriendly, and there is also no specific explanation for the tone selection.

If you want to simply fix the pronunciation role, you first need to manually set a random number seed. Different seeds will produce different tones

torch.manual_seed(2222)

Then get a random speaker

rand_spk = chat.sample_random_speaker()

Then pass it through the params_infer_code parameter

chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk})

After testing, 2222 7869 6653 are male tones, and 3333 4099 5099 are female roles. You can adjust different seed numbers to test more roles yourself.

Speech Rate Control

You can control the speech rate by setting the prompt in the params_infer_code parameter of chat.infer

chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk,'prompt':'[speed_5]'})

The range of possible speed values has not been listed. The default in the source code is speed_5, but testing with speed_0 and speed_7 did not show any obvious differences.

WebUI Interface and Integration Package

Open source and download address https://github.com/jianchang512/chatTTS-ui

After decompressing the integration package, double-click app.exe

Deploy the source code according to the warehouse instructions

Control Symbols Available in the Text ​

Skip the refine text Stage ​

Fix the Pronunciation of the Speaker ​

Speech Rate Control ​

WebUI Interface and Integration Package ​

UI Interface Preview ​

Control Symbols Available in the Text

Skip the refine text Stage

Fix the Pronunciation of the Speaker

Speech Rate Control

WebUI Interface and Integration Package

UI Interface Preview