Appendix Companion

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

Hankun Wang · Haoran Wang · Yiwei Guo · Zhihan Li · Chenpeng Du · Xie Chen · Kai Yu
Shanghai Jiao Tong University · X-LANCE Lab
Additional materials for ICASSP 2026 submission

Abstract

Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. We study three factors by evolving the modality from text to speech: (A) speech tokens provide phonetic rather than semantic information, (B) speech sequences are far longer than text, and (C) paralinguistic information adds variability. Factor A has minor impact, factor B noticeably affects syntactic and semantic modeling, and factor C is the most disruptive, especially for lexical modeling. These findings highlight the unique challenges of training end-to-end SLMs and suggest pathways toward stronger speech generation.

Free Generation Setup

Generation settings per modality are listed below. For Phone-Repeat and Speech-HuBERT, higher temperatures mitigate repetitive loops; if more than eight consecutive identical tokens appear, generation stops early. If the transcribed text (excluding the prompt) has fewer than 50 characters, we regenerate with a different random seed. Other modalities always generate up to max_length and drop the last word since it may be incomplete.

ModalityMax lenTop-KTop-PTemp
Text-BPE4510000.91.00
Text-Raw1350.91.05
Phone-BPE4510000.91.00
Phone-Raw960.91.05
Phone-Repeat5000.91.15
Speech-HuBERT50010000.91.20

Transcription and Evaluation Pipelines

Phone → Text (T5-PTT)

We fine-tune FLAN-T5 on LibriHeavy-50k with phone and duration labels from Kaldi alignments.

  • Two versions: T5-PTT-Original and T5-PTT-Deduped (for Phone-Repeat with deduplicated runs).
  • WER on test set: 2.64% (Original), 1.97% (Deduped).
  • Deduped inputs preserve accuracy while matching duration-collapsed phone sequences.

Speech → Text

  • HuBERT tokens → CTX-vec2wav [1] synthesis (speaker prompt: LibriTTS “1089_134686_000001_000001”), using the contextual vocoder from UniCATS.
  • Whisper-Large-V3 performs ASR with punctuation and case preserved.
  • Provides normalized text for downstream automatic evaluation (perplexity via Llama-3.1-8B).

Prompt Sets

Prompts are grouped by whether they appear in the training data. For out-of-training prompts, speech prompts are synthesized with Hierspeech++ and aligned to obtain phones, durations, and HuBERT tokens.

In Training DataNot in Training Data
ThisAlice is a nice
I willHow much water do you
How doWe decide to go to the
When IIn the morning, I like to
She saidA little bird told me that
These areMary went to the market to
The boy isIn the morning, I like to eat
The moon isBob is a tennis player, and he
What a lovelyHe looked up to the sky and saw
He looked up to the sky and saidA little girl is playing with her

Word Boundary Ablation

Adding explicit word-boundary tokens to non-text modalities yields slight gains in syntactic and semantic tasks for Phone-Raw, Phone-Repeat, and Speech-HuBERT, while lexical scores stay similar. Phone-BPE slightly drops because sequences become longer.

ModalitysWUGGYsBLIMPTopic-SC
Phone-Raw85.874.566.6
+word boundary85.675.766.8
Phone-BPE85.075.070.9
+word boundary84.175.469.6
Phone-Repeat85.566.258.3
+word boundary85.266.959.0
Speech-HuBERT50.857.352.9
+word boundary50.357.753.6

References

[1] Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu. UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding. arXiv:2306.07547.