DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

National Taiwan University, NVIDIA

📑 Paper | 🌐 Website | 👩‍💻 Github | 🤗 Model | 🤗 Dataset

✨✨ Key features

  • We propose an automatic approach to develop speech language models without using speech instruction-tuning data. We aim to inject speech understanding capability into text-based LLMs while keeping their original versatility.
  • We train the end-to-end speech language model with carefully constructed speech captions to avoid the catastrophic forgetting problem.
  • The model shows strong instruction-following ability and demonstrates robust speech understanding capability on the Dynamic-SUPERB and AIR-Bench-Chat benchmarks.

Instruction-following Ability

Our model not only demonstrates strong speech understanding capabilities but also excels in following complex instructions, such as following specific output formatting and chain-of-thought reasoning.

Instruction Model Output
What is the gender of the speaker? Your entire response should be in English, capital letters only. THE GENDER OF THE SPEAKER IS FEMALE.
What is the emotion of the speaker? Your entire response should be in English, capital letters only. HAPPINESS
What is the gender of the speaker? Answer in JSON format. {"gender": "Female" }
What is the emotion conveyed in the speech? Think step by step. Let's break it down step by step!
1. We start with the phrase "Thank you for calling". This implies that a request has been made, and the speaker is expressing gratitude.
2. The next phrase is "Goodbye". This is a normal way to end a conversation, and it's a polite way to part ways.
Considering these two phrases together, we can infer that the speaker is happy to end the conversation and is not expecting anything further. The tone is friendly and grateful, suggesting that the speaker has had a positive interaction.
Therefore, the emotion conveyed in the speech is happiness!
Analyze the speech, then tell me what is the emotion conveyed in the speech? Let's break down the speech:
* "Thank you for calling": This phrase is a polite expression of gratitude for the person's call.
* "Goodbye": This is a final farewell, indicating that the conversation is ending.
The tone of the speech is polite and friendly, as the speaker is thanking the person for calling and bidding them farewell.
The emotion conveyed in this speech is happiness. The speaker seems cheerful and appreciative, expressing gratitude for the call and ending the conversation on a positive note.
Calculate the words per second rate, think step by step. Let's break it down step by step:
1. The speech lasts for 2 seconds: [00:00:00 - 00:00:02]
2. The number of words is: "Thank you for calling. Goodbye." (5 words)
3. To find the words per second rate, we need to divide the number of words by the time it took to say the words:
Words per second = Total number of words / Time taken (in seconds)
= 5 words / 2 seconds
= 2.5 words per second
Therefore, the words per second rate is 2.5 words
How many speakers are there in the audio? Can you tell me the emotion of each speaker? Can you also provide a brief summary of the audio? Is there any background noise? How do you feel today? Can you give me a response to the audio? There is only 1 speaker in the audio.
The speaker is a female, and her emotion is happiness.
The audio is a brief goodbye message, where the speaker says "Thank you for calling. Goodbye."
There is no background noise in the audio.
As a machine, I don't have emotions, so I don't feel anything.
As for a response to the audio, I can say: "You're welcome, and have a great day!"
請描述這段聲音,聲音聽起來如何?內容說了什麼? 以中文回答。

Note: our model only trained on English data, but it is generalizable to other languages thanks to the multilingual capability of Llama3! 😊
這段聲音是一個女性的聲音,聽起來是很愉快的。她說了一句「Thank you for calling. Goodbye」,語氣溫柔、柔和,沒有任何激動或激情的感覺。整個聲音聽起來很平靜、很自然,沒有任何明顯的感覺或情感色彩。
Multi-turn question-answering
First question
Transcribe the audio.

Second question
Who is the speaker talking to?
First response
Here is the transcription of the audio clip:
"Thank you for calling. Goodbye."

Second response
The speaker is speaking to someone who has called them, but it's not specified who that person is. The conversation appears to be a brief and abrupt goodbye, suggesting that the speaker may not have had a conversation with the caller or may have been interrupted.

Speech Understanding Ability

Dynamic-SUPERB

Dynamic-SUPERB is a crowdsourced benchmark comprising 48 speech-related classification tasks with human-created instructions and answers.

Audio Instruction Model Output Ground Truth

LanguageIdentification

Which language can you recognize from this audio recording? The answer could be German, English, Spanish, Italian, Russian, or French. Based on the audio recording, I recognize the language as German. The phrase "Die hätte man dicht machen müssen" is a grammatically correct and idiomatic German sentence, and the speaker's accent and pronunciation also suggest a German speaker. German

NoiseDetection

Determine whether the sentence is a noise-free utterance or a noisy one with music disturbance. The answer could be clean or noisy. Based on the provided information, I would classify the sentence as "noisy". The noise level is moderate (15db), and there is a reverberation time of 60ms, which suggests that there is some background noise or music disturbance present. Noisy

AccentClassification

The task involves detecting variations in pronunciation and intonation. The answer could be american, australian, bangla, british, indian, malayalam, odiya, telugu, or welsh. Based on the accent and pronunciation, I would identify the speaker as having a British accent. British

EmotionRecognition

Recognize and categorize the emotions expressed in spoken language. The answer could be anger, disgust, sadness, joy, neutral, surprise, or fear. Based on the audio clip, the emotion expressed is "joy". The speaker's tone and words ("Here she is, future fashion superstar") convey a sense of excitement and happiness, indicating a positive emotional state. Joy

DialogueEmotionClassification

Analyze the dialogue and identify its corresponding emotion. The answer could be anger, disgust, fear, sadness, happiness, surprise, or no emotion. Based on the dialogue, the corresponding emotion is sadness. The speaker is apologizing and expressing regret for causing trouble, which is a characteristic of a sad tone. Sadness

SpeechTextMatching

Verify if the speech and text are congruent. The answer could be yes or no.
Text: "guidelines for an experimental program to develop more detailed criteria"
The answer is NO.
The speech is about saying a few words on the principles of design and typography, but the text is about guidelines for an experimental program to develop more detailed criteria, which is a different topic.
No

SpeakerVerification

Scrutinize the first audio clip to establish whether the second one is delivered by the same speaker. The answer could be yes or no. Based on the provided audio clip, I would say NO, the second audio clip is not delivered by the same speaker. The tone, pitch, and emotion expressed in the first audio clip (neutral, calm, and introspective) are different from what I would expect from the same speaker. The first audio clip has a more contemplative and reflective tone, whereas the second audio clip is more dramatic and emotive. No

AIR-Bench-Chat

AIR-Bench-Chat is an LLM-generated benchmark derived from speech metadata that focuses on open-ended questions.

Audio Instruction Model Output Ground Truth
Based on the speaker's emotional state, what could be the nature of the proof being requested? Based on the speaker's angry tone and high-pitched voice, it is likely that the proof being requested is related to a dispute or controversy. The speaker is demanding evidence to support a claim or argument, and is likely feeling frustrated or upset that the other party has not provided sufficient evidence to back up their claims. Given the speaker's anger, the proof likely relates to a serious allegation or an issue that significantly impacts the speaker..
What emotion is the speaker likely trying to convey to the listener by stating a fear of their departure? Based on the given information, the speaker is likely trying to convey a sense of sadness to the listener by stating a fear of their departure. The low pitch, slow speaking speed, and very low volume suggest a somber and melancholic tone, which is consistent with a sad emotion. The speaker is trying to convey a sense of sadness and attachment to the listener.
What could be the next step for the manager after witnessing such a scene? Based on the manager's tone and emotion, it seems that she is amused and entertained by the scene. Given her high pitch and volume, she might be trying to emphasize the absurdity of the situation. Considering her slow speaking speed, she might be savoring the moment and enjoying the spectacle.

As a next step, the manager could:
* Laugh and join in on the fun, possibly even taking a photo or video to capture the moment.
* Try to intervene and calm the situation down, but in a playful and lighthearted manner, acknowledging the absurdity of the situation.
* Use the opportunity to make a humorous
The manager might need to intervene or address the situation occurring on the floor.
What did the second person inquire about after the first person's request? The second person inquired about the price range of the restaurant. The second person inquired about any specifications on the price range.

Method

Stage 1: Dataset Consturction

In our dataset construction process, we first extract and format all textual attributes from the speech into a structured "seed transcript." This seed transcript is then combined with the prompt "What can you hear from the audio?" and fed into Llama3. The response generated by Llama3 becomes our training target for the end-to-end speech language model.

There are 12 attributes, including gender, age, accent, emotion, pitch, volume, speaking speed, SNR level, C50 value, duration, intent, and spoken text.

Seed Transcript Llama3-Instruct Response Correspond Audio
[00:00:01 - 00:00:02]: "These days are long gone." (Gender:Male, Accent: Scottish, Age: 20, Emotion: neutral, Duration: 2s) From the audio, I can hear:
* A male voice with a Scottish accent
* The voice is approximately 20 years old
* The tone is neutral, indicating a lack of strong emotions
* The audio duration is 2 seconds

VCTK-Corpus

[00:00:01 - 00:00:04]: "At the top level, the salaries are far too high." (Gender:Male, Accent: English, Age: 23, Emotion: neutral, Duration: 4s) From the audio, we can hear a male speaker with an English accent (possibly British English) saying:
"At the top level, the salaries are far too high."
The speaker is a 23-year-old male with a neutral emotion. The audio lasts for 4 seconds.

VCTK-Corpus

[00:00:01 - 00:00:04]: "Oh, lovely. Where's the wedding going to be?" (Gender:Male, Emotion: happiness, Action: question, Duration: 4s) From the audio, I can hear a male speaker asking a question with a tone of happiness, specifically the words: "Oh, lovely. Where's the wedding going to be?"

Dailytalk

[00:00:01 - 00:00:07]: "A new pride taught me mine ego, and that teach I unto men: no longer to thrust one's head into the sand of celestial things, but to carry it freely, a terrestrial head, which giveth meaning to the earth!" (Gender:Male, Emotion: cheerful, Pitch: Normal, Volume: Normal, Speaking speed: Fast, Duration: 7s) From the given audio, we can hear a male speaker with a cheerful tone, speaking at a normal pitch and volume. The speaking speed is relatively fast, and the duration is 7 seconds.
The audio appears to be a philosophical or inspirational quote, with the speaker using poetic language to convey a message. The text itself is a condensed and somewhat abstract statement, but it seems to be encouraging people to be more grounded and down-to-earth, rather than getting lost in lofty or abstract concepts.

PromptTTS

Stage 2: Model training

During the model training phase, we utilize a pre-trained Whisper model to extract speech features and generate text transcriptions from the audio input. These speech features and the text embeddings (from the transcription and chat template) are concatenated before being fed into Llama3. We maintain a consistent template structure to prevent any mismatch. Finally, the model is trained to generate speech descriptions, effectively learning to interpret and articulate the nuances of spoken content based on both acoustic features and textual information.