Locating and interpreting factual association in Speech Language Models

Modica, Luca; Landin, Filip

Locating and interpreting factual association in Speech Language Models

dc.contributor.author	Modica, Luca
dc.contributor.author	Landin, Filip
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Johansson, Richard
dc.contributor.supervisor	Farahani, Mehrdad
dc.date.accessioned	2025-11-05T12:16:20Z
dc.date.issued	2025
dc.date.submitted
dc.description.abstract	Recent advances have enabled Speech Language Models (SLMs) to both understand and generate text and speech by representing audio as discrete tokens learned from raw waveforms without supervision. As these multimodal systems become increasingly common in real-world applications, it is crucial to understand how they encode and retrieve factual knowledge, insights that are key to improve their factual accuracy and reliability. While previous research has explored these mechanisms in traditional Large Language Models (LLMs) by observing their responses to targeted prompts (such as "The capital of Italy is ___"), much less is known about how these processes work in multimodal models such as SLMs, particularly regarding interactions between different modalities in cross-modal scenarios (e.g., speech-to-text). This thesis aims to explore how Speech Language Models store and recall factual associations by applying Causal Mediation Analysis (CMA), a method inspired by causal inference used to quantify the contribution of model components to factual predictions. We introduce MultimodalCausalTracer, an adaptation of CMA to also handle discrete speech tokens. We use a CTC-based forced alignment algorithm to locate targeted words in a spoken utterance, map discrete speech tokens to text equivalents, and visualize CMA results across speech and text modalities. We applied MultimodalCausalTracer to the Spirit LM model using a new speechbased version of the Known dataset which we constructed, covering spoken factual prompts about countries, famous people, and places. The results, measured in terms of Average Indirect Effect (AIE) of the model’s components, show evident discrepancies between text-to-text and speech-to-text tasks, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. Our findings highlight key areas for future work, including extending experiments to other cross-modal scenarios and investigating factual recall in different SLMs and factual datasets.
dc.identifier.coursecode	DATX05
dc.identifier.uri	http://hdl.handle.net/20.500.12380/310722
dc.language.iso	eng
dc.relation.ispartofseries	CSE 25-53
dc.setspec.uppsok	Technology
dc.subject	Machine Learning, Deep Learning, Causal Inference, Speech Language Models, Discrete Speech Tokens, Mechanistic Interpretability, Multimodal Learning, Model Analysis.
dc.title	Locating and interpreting factual association in Speech Language Models
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Data science and AI (MPDSC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 25-53 LM FL.pdf
Size:: 3.52 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen