ChatTTS has become incredibly popular, but the documentation is vague, especially in terms of controlling tone, rhythm, and specific speakers. After repeated real-world testing and trial and error, I finally understood a bit and recorded it as follows.
UI interface code open source address https://github.com/jianchang512/chattts-ui
Control Symbols Available in the Text
Control symbols can be interspersed in the original text to be synthesized. Currently, the controllable ones are laughter and pauses.
[laugh] Represents laughter
[uv_break] Represents a pause
The following is a sample text
text="Hello [uv_break] friends, I heard today is a good day, isn't it [uv_break] [laugh]?"
In the actual synthesis, [laugh]
will be replaced by laughter, and a pause will be added at [uv_break]
.
The intensity of laughter and pauses can be controlled by passing a prompt in the params_refine_text
parameter.
laugh_(0-2) Possible values: laugh_0 laugh_1 laugh_2 Laughter becomes more intense/or?
break_(0-7) Possible values: break_0 break_1 break_2 break_3 break_4 break_5 break_6 break_7 Pauses become more obvious in sequence/or?
Code
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_2][break_4]'})
However, actual testing has found that the difference between [break_0] and [break_7] is not obvious, and there is also no obvious difference between [laugh_0] and [laugh_2]
Skip the refine text Stage
During actual synthesis, control characters will be re-organized (refined text), for example, the above sample text will eventually be organized as
Hello [uv_break] Ah [uv_break] Um [uv_break] Friends, I heard today is a good day, isn't it [uv_break] Um [uv_break] [laugh] ? [uv_break]
It can be seen that the control characters are not consistent with the ones you marked yourself, and in the actual synthesis effect, there may be pauses, noise, laughter, etc. that should not be there, so how to force the synthesis according to the actual ones?
Set the skip_refine_text
parameter to True
to skip the refine text stage.
chat.infer([text],skip_refine_text=True,params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
Fix the Pronunciation of the Speaker
By default, different tones are randomly called each time the synthesis is done, which is very unfriendly, and there is also no specific explanation for the tone selection.
If you want to simply fix the pronunciation role, you first need to manually set a random number seed. Different seeds will produce different tones
torch.manual_seed(2222)
Then get a random speaker
rand_spk = chat.sample_random_speaker()
Then pass it through the params_infer_code
parameter
chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk})
After testing, 2222 7869 6653
are male tones, and 3333 4099 5099
are female roles. You can adjust different seed numbers to test more roles yourself.
Speech Rate Control
You can control the speech rate by setting the prompt
in the params_infer_code parameter of chat.infer
chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk,'prompt':'[speed_5]'})
The range of possible speed values has not been listed. The default in the source code is speed_5
, but testing with speed_0
and speed_7
did not show any obvious differences.
WebUI Interface and Integration Package
Open source and download address https://github.com/jianchang512/chatTTS-ui
After decompressing the integration package, double-click app.exe
Deploy the source code according to the warehouse instructions