EATS-Speech: Emotion-Adaptive Transformation and Priority Synthesis for Zero-Shot Text-to-Speech

Abstract

Zero-shot text-to-speech (TTS) supports diverse speech synthesis without speaker-specific data but struggles to accurately transfer emotions from reference to target text. Traditional approaches treat emotion as part of a global style, leading to inconsistent emotional expressiveness. To address this, we propose EATS-Speech, an Emotion-Adaptive Transformation Synthesis framework. EATS-Speech employs Emotion Priority Synthesis through a parallel pipeline that decomposes speech into non-emotion style, emotion, and content. It prioritizes emotion generation to enhance expressiveness. Furthermore, it introduces Emotion-Adaptive Transformation Synthesis, where an LLM-based converter learns text-emotion mapping patterns from the reference speech and transfers them to the target text. Experiments on the LibriTTS dataset demonstrate the improvements in emotional expressiveness and accurate emotion adaptation.

Contents

Model Architecture


Figure 1: Comparison Between Previous and Proposed Zero-Shot TTS Frameworks.



Figure 2: The overview of our proposed model. (A) Illustration of the EATS-Speech Architecture; (B) Illustration of Disentangled Speech Synthesis with Emotion Tokenizer.



Figure 3: The overview of the Emotion-Aware LLM.


Comparison Experiments

  • The audio sample below is a sample synthesized using the model proposed in this paper.
  • The LibriTTS dataset is used, you can download it via https://www.openslr.org/60/.
  • LibriTTS is a multi-speaker English corpus. It amounts to 585 hours and over 2300 speakers. Train-clean-100, train-clean-360, and train-other-500 are merged as the training set. Dev-clean and dev-other are merged as a development set. Test-clean and test-other are merged as the test set.
  • Each comparison experiment model sources are listed below:
  • Sample for Seen Speaker - Development Set

    Sample 1

    Text: The play and slight agitation of the water, in its upward gush, wrought magically with these variegated pebbles, and made a continually shifting apparition of quaint figures, vanishing too suddenly to be definable.,

    Groundtruth YourTTS TransferTTS VALL-E E2-TTS CosyVoice EATS-Speech

    Sample 2

    Text: "You'd better look in those big manufacturing houses along Franklin Street and just the other side of the river," he concluded. "Lots of girls work there.,

    Groundtruth YourTTS TransferTTS VALL-E E2-TTS CosyVoice EATS-Speech

    Sample 3

    Text: We followed an Alpine path for some four miles, now hundreds of feet above a brawling stream which descended from the glaciers, and now nearly alongside it.,

    Groundtruth YourTTS TransferTTS VALL-E E2-TTS CosyVoice EATS-Speech

    Sample 4

    Text: Tom Selwyn had been very sober during all this merry chatter; and now in his seat across the narrow aisle, he drummed his heels impatiently on the floor.,

    Groundtruth YourTTS TransferTTS VALL-E E2-TTS CosyVoice EATS-Speech


    Sample for Unseen Speaker - Test Set

    Sample 1

    Text: She never said much of herself in her letters, and Fanny's first exclamation when they met again, was an anxious "Why, Polly, dear! Have you been sick and never told me?",

    Groundtruth YourTTS TransferTTS VALL-E E2-TTS CosyVoice EATS-Speech

    Sample 2

    Text: "We must hasten." Seizing the scissors that lay on the ground where Ethelried had dropped them, she opened and shut them several times, exclaiming:,

    Groundtruth YourTTS TransferTTS VALL-E E2-TTS CosyVoice EATS-Speech

    Sample 3

    Text: She was sanguine, she was genial and companionable, and her spirits rose at the sight of a friendly face.,

    Groundtruth YourTTS TransferTTS VALL-E E2-TTS CosyVoice EATS-Speech

    Sample 4

    Text: Next day he rose and prayed the dawn prayer and repaired to his namesake's house where, after the company was all assembled, the host began to relate,

    Groundtruth YourTTS TransferTTS VALL-E E2-TTS CosyVoice EATS-Speech

    Sample 5

    Text: "Another case was that of a slave woman in a very delicate state, who was one day knocked down stairs by mrs Johnson herself, and in a few weeks after, the poor woman died from the effects of the injury thus received.,

    Groundtruth YourTTS TransferTTS VALL-E E2-TTS CosyVoice EATS-Speech

    Sample 6

    Text: He'll look after this end of the Island hereafter, an' unless I'm much mistaken he'll do it a heap better than you did.",

    Groundtruth YourTTS TransferTTS VALL-E E2-TTS CosyVoice EATS-Speech


    Ablation Study

  • Ablation studies are performed to evaluate the effectiveness of emotion decoupling and emotion conversion in the EATS-Speech framework. The experimental setup follows the same procedure as in the comparison experiments.
  • -w/o Decoupling refers to training EATS-Speech without emotion decoupling, where the emotion tokenizer is not used for gradient reversal of the emotion during the style conditioning process.
  • -w/o Conversion refers to training EATS-Speech without the emotion-aware LLM, where only the emotion2vec model is used to assist in the zero-shot TTS synthesis, omitting the emotion conversion step.

  • Sample for Seen Speaker - Development Set

    Sample 1

    Text: He was discharged, whereupon the union notified us that unless the boy was taken back the whole body would go out.,

    Groundtruth Proposed -w/o Decoupling -w/o Conversion

    Sample 2

    Text: "In the lonely barton by yonder coomb Our childhood used to know," I should go with him in the gloom, Hoping it might be so.,

    Groundtruth Proposed -w/o Decoupling -w/o Conversion

    Sample 3

    Text: If I make the marriage she chooses, she thinks mr Binnie will leave me his money. I am to run after a man who does not care for me, and make myself attractive, in the hope that he will condescend to marry me because mr Binnie may leave me his money.,

    Groundtruth Proposed -w/o Decoupling -w/o Conversion


    Sample for Unseen Speaker - Test Set

    Sample 1

    Text: 'Yes, I will kiss you,' said the man's daughter, and she did it, but she thought it was the worst bit of work that she had ever had to do in her life.,

    Groundtruth Proposed -w/o Decoupling -w/o Conversion

    Sample 2

    Text: "Well!" he exclaimed; "and I felt sorry for her as one might for one's sister at home, and hung back from getting her people into trouble.,

    Groundtruth Proposed -w/o Decoupling -w/o Conversion