Abstract
Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. We study three factors by evolving the modality from text to speech: (A) speech tokens provide phonetic rather than semantic information, (B) speech sequences are far longer than text, and (C) paralinguistic information adds variability. Factor A has minor impact, factor B noticeably affects syntactic and semantic modeling, and factor C is the most disruptive, especially for lexical modeling. These findings highlight the unique challenges of training end-to-end SLMs and suggest pathways toward stronger speech generation.
Free Generation Setup
Generation settings per modality are listed below. For Phone-Repeat and Speech-HuBERT, higher temperatures mitigate repetitive loops; if more than eight consecutive identical tokens appear, generation stops early. If the transcribed text (excluding the prompt) has fewer than 50 characters, we regenerate with a different random seed. Other modalities always generate up to max_length and drop the last word since it may be incomplete.
| Modality | Max len | Top-K | Top-P | Temp |
|---|---|---|---|---|
| Text-BPE | 45 | 1000 | 0.9 | 1.00 |
| Text-Raw | 135 | – | 0.9 | 1.05 |
| Phone-BPE | 45 | 1000 | 0.9 | 1.00 |
| Phone-Raw | 96 | – | 0.9 | 1.05 |
| Phone-Repeat | 500 | – | 0.9 | 1.15 |
| Speech-HuBERT | 500 | 1000 | 0.9 | 1.20 |
Transcription and Evaluation Pipelines
Phone → Text (T5-PTT)
We fine-tune FLAN-T5 on LibriHeavy-50k with phone and duration labels from Kaldi alignments.
- Two versions: T5-PTT-Original and T5-PTT-Deduped (for Phone-Repeat with deduplicated runs).
- WER on test set: 2.64% (Original), 1.97% (Deduped).
- Deduped inputs preserve accuracy while matching duration-collapsed phone sequences.
Speech → Text
- HuBERT tokens → CTX-vec2wav [1] synthesis (speaker prompt: LibriTTS “1089_134686_000001_000001”), using the contextual vocoder from UniCATS.
- Whisper-Large-V3 performs ASR with punctuation and case preserved.
- Provides normalized text for downstream automatic evaluation (perplexity via Llama-3.1-8B).
Prompt Sets
Prompts are grouped by whether they appear in the training data. For out-of-training prompts, speech prompts are synthesized with Hierspeech++ and aligned to obtain phones, durations, and HuBERT tokens.
| In Training Data | Not in Training Data |
|---|---|
| This | Alice is a nice |
| I will | How much water do you |
| How do | We decide to go to the |
| When I | In the morning, I like to |
| She said | A little bird told me that |
| These are | Mary went to the market to |
| The boy is | In the morning, I like to eat |
| The moon is | Bob is a tennis player, and he |
| What a lovely | He looked up to the sky and saw |
| He looked up to the sky and said | A little girl is playing with her |
Word Boundary Ablation
Adding explicit word-boundary tokens to non-text modalities yields slight gains in syntactic and semantic tasks for Phone-Raw, Phone-Repeat, and Speech-HuBERT, while lexical scores stay similar. Phone-BPE slightly drops because sequences become longer.
| Modality | sWUGGY | sBLIMP | Topic-SC |
|---|---|---|---|
| Phone-Raw | 85.8 | 74.5 | 66.6 |
| +word boundary | 85.6 | 75.7 | 66.8 |
| Phone-BPE | 85.0 | 75.0 | 70.9 |
| +word boundary | 84.1 | 75.4 | 69.6 |
| Phone-Repeat | 85.5 | 66.2 | 58.3 |
| +word boundary | 85.2 | 66.9 | 59.0 |
| Speech-HuBERT | 50.8 | 57.3 | 52.9 |
| +word boundary | 50.3 | 57.7 | 53.6 |
References
[1] Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu. UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding. arXiv:2306.07547.