How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Ke-Han Lu1, Szu-Wei Fu2, Chao-Han Huck Yang2, Zhehuai Chen2, Sung-Feng Huang2, Chih-Kai Yang1, Yi-Cheng Lin1, Chi-Yuan Hsiao1, Wenze Ren1, En-Pei Hu1, Yu-Han Huang1, An-Yu Cheng1, Cheng-Han Chiang1, Yu Tsao3, Yu-Chiang Frank Wang2, Hung-yi Lee1
1 National Taiwan University    2 NVIDIA    3 Academia Sinica
Paper Code

Abstract

Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

Overview

Three complementary evaluations to investigate auditory knowledge in LLMs

Overview of three evaluation settings: AKB-2000 construction, Cascade evaluation, and Audio-grounded evaluation

Figure 1. Overview of the three evaluations. (Top) AKB-2000 construction: a two-level taxonomy guides LLM-assisted question generation, followed by human verification. (Middle) Cascade evaluation: a captioner converts audio to text descriptions fed to a text-only LLM. (Bottom) Audio-grounded evaluation: each LLM is fine-tuned into a LALM using the DeSTA self-distillation framework.

Key Findings

📊

Backbone Choice Matters

Auditory knowledge varies substantially across LLM families. Backbone selection alone causes >10% absolute difference in LALM performance. Qwen consistently outperforms Llama across all settings.

🔗

Strong Text-Audio Correlation

Text-only benchmark results are strongly correlated with audio-grounded performance (r=0.71–0.82), making them a lightweight proxy for LLM selection before expensive multimodal training.

🔊

Phonological Blind Spot

LLMs systematically struggle with phonological tasks (rhyme, stress, pronunciation). The 5 hardest subcategories all involve reasoning about how words sound when spoken aloud.

Cascade Matches LALMs

A simple captioner + LLM cascade pipeline can match or surpass state-of-the-art end-to-end LALMs, suggesting current systems underutilize the LLM backbone's reasoning capability.

Results

Comprehensive evaluation of 12 open-weight and 5 proprietary LLMs across three settings

Model AKB-2000(Text) Cascade (Text) Audio-Grounded(Audio-text)
MMAU(T) MMAR(T) MMAU(A) MMAR(A)
Proprietary LLMs(Text)
Gemini-2.5-Pro96.0570.971.8
Claude-Sonnet-4.595.7070.870.5
GPT-594.3571.969.8
GPT-4o92.9069.366.0
Gemini-2.0-Flash91.8569.664.4
Published LALMs (Audio-Grounded)
Audio Flamingo 373.3058.60
Gemini-2.5-Pro (Audio)71.6074.70
Qwen2.5-Omni71.5056.70
DeSTA2.5-Audio66.0050.80
Phi-4-mm65.7040.20
Open-weight LLMs — Qwen
Qwen3-14B 85.05 66.2 64.3 66.20 52.90
Qwen3-8B 78.95 66.8 62.0 61.70 53.00
Qwen3-4B 82.00 66.3 61.0 62.90 49.20
Qwen2.5-7B 80.70 64.5 61.4 66.60 47.30
Open-weight LLMs — Phi
Phi-4-14B 86.35 62.9 62.6 61.10 52.50
Phi-4-mini-4B 70.00 56.1 54.4 61.00 44.20
Open-weight LLMs — Llama
Llama-3-8B 73.45 53.5 54.4
Llama-3.1-8B 68.10 53.6 51.6 56.40 47.70
Llama-2-7B 45.90 43.2 47.1
Open-weight LLMs — OLMo
OLMo-3-7B 69.05 57.7 53.2 56.90 44.90
OLMo-3-7B-DPO 67.30 58.0 52.4
OLMo-3-7B-SFT 63.95 57.3 51.6

Accuracy (%) across all evaluation settings. Shading indicates relative ranking among open-weight models (darker = higher). — indicates model was not evaluated in that setting.

ModelAvg.SoundParalin.PhoneticMusicQualityTechnical
Proprietary LLMs
Gemini-2.5-Pro96.0596.3796.4694.9395.7697.4595.35
Claude-Sonnet-4.595.7095.1696.9791.7196.0195.4196.22
GPT-594.3595.1694.4493.5595.5194.9092.44
GPT-4o92.9095.9793.9490.3295.0193.8887.50
Gemini-2.0-Flash91.8593.1592.4288.9492.7794.3989.24
Open-weight LLMs
Phi-4-14B86.3589.9288.2279.2689.5385.7181.69
Qwen3-14B85.0590.7385.3576.9688.7888.2779.36
Qwen3-4B82.0085.4885.3569.5986.7885.7173.84
Qwen2.5-7B80.7087.1081.9969.1287.2881.6372.97
Qwen3-8B78.9582.6679.8068.2085.0482.6572.38
Llama-3-8B73.4579.4474.5855.7681.5581.6364.24
Phi-4-mini-4B70.0075.0071.0456.2275.5671.9465.70
OLMo-3-7B69.0573.3971.2157.6075.5675.5158.14
Llama-3.1-8B68.1073.3969.5358.5375.5672.9656.40
OLMo-3-7B-DPO67.3071.3770.0355.7675.8171.4354.65
OLMo-3-7B-SFT63.9565.3264.4851.6172.0768.3757.85
Llama-2-7B45.9051.6144.6139.1752.6244.9040.99

Table 1. AKB-2000 evaluation accuracy (%) across six categories.

ModelMMAU (Text)MMAR (Text)
Avg.SoundMusicSpeechAvg.SoundMusicSpeech
Captioner Comparison (LLM: Gemini-2.5-Pro)
Official Baselines57.357.3549.7064.8650.746.140.360.9
Whisper-large-v361.752.5553.5978.9861.641.2144.6677.55
Omni-captioner68.969.9761.0875.6865.551.5246.1277.21
Gemini-caption70.968.7766.4777.4871.864.8549.0381.97
Gemini-caption + Proprietary LLM
GPT-571.972.9766.4776.2869.866.0647.0980.27
Gemini-2.5-Pro70.968.7766.4777.4871.864.8549.0381.97
Claude-Sonnet-4.570.868.4763.7780.1870.560.6150.4981.29
Gemini-2.0-Flash69.668.4763.7776.5864.458.1847.0975.85
GPT-4o69.366.3764.0777.4866.061.2149.0373.47
Gemini-caption + Open-weight LLM
Qwen3-8B66.866.0763.1771.1762.058.1841.2672.79
Qwen3-4B66.362.4662.2874.1761.055.1545.1569.05
Qwen3-14B66.265.7761.0871.7764.358.1850.9770.75
Qwen2.5-7B64.560.9662.2870.2761.455.1547.5771.77
Phi-4-14B62.962.4661.9864.2662.658.1851.4669.73
OLMo-3-7B-DPO58.058.8656.2958.8652.450.3033.5059.18
OLMo-3-7B57.761.2655.6956.1653.252.1233.5060.20
OLMo-3-7B-SFT57.359.1655.9956.7651.646.6739.3254.42
Phi-4-mini-4B56.153.4557.4957.3654.452.7341.7561.22
Llama-3.1-8B53.660.3650.3050.1551.650.9137.8657.48
Llama-3-8B53.551.6552.4056.4654.453.3342.7259.18
Llama-2-7B43.246.5546.7136.3447.150.3033.0150.68

Table 2. Cascade evaluation on MMAU and MMAR (%).

ModelMMAU (Audio)MMAR (Audio)
Published Systems
Gemini-2.5-Pro (Audio)71.6074.70
Audio Flamingo 373.3058.60
Qwen2.5-Omni71.5056.70
DeSTA2.5-Audio66.0050.80
Phi-4-mm65.7040.20
Fine-tuned LALM (Ours)
Qwen2.5-7B66.6047.30
Qwen3-14B66.2052.90
Qwen3-4B62.9049.20
Qwen3-8B61.7053.00
Phi-4-14B61.1052.50
Phi-4-mini-4B61.0044.20
OLMo-3-7B56.9044.90
Llama-3.1-8B56.4047.70

Table 3. Audio-grounded performance (%). Upper block: published LALM systems; lower block: our fine-tuned LALMs with DeSTA framework.

MMAU per-category scatter
MMAR per-category scatter

Figure 3. Category-level scatter plots comparing cascade and audio-grounded accuracy for 8 fine-tuned LALMs.

Citation

@article{lu2026auditory, title={How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation}, author={Lu, Ke-Han and Fu, Szu-Wei and Yang, Chao-Han Huck and Chen, Zhehuai and Huang, Sung-Feng and Yang, Chih-Kai and Lin, Yi-Cheng and Hsiao, Chi-Yuan and Ren, Wenze and Hu, En-Pei and others}, journal={arXiv preprint arXiv:2603.19195}, year={2026} }