Locating and interpreting factual association in Speech Language Models
| dc.contributor.author | Modica, Luca | |
| dc.contributor.author | Landin, Filip | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
| dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
| dc.contributor.examiner | Johansson, Richard | |
| dc.contributor.supervisor | Farahani, Mehrdad | |
| dc.date.accessioned | 2025-11-05T12:16:20Z | |
| dc.date.issued | 2025 | |
| dc.date.submitted | ||
| dc.description.abstract | Recent advances have enabled Speech Language Models (SLMs) to both understand and generate text and speech by representing audio as discrete tokens learned from raw waveforms without supervision. As these multimodal systems become increasingly common in real-world applications, it is crucial to understand how they encode and retrieve factual knowledge, insights that are key to improve their factual accuracy and reliability. While previous research has explored these mechanisms in traditional Large Language Models (LLMs) by observing their responses to targeted prompts (such as "The capital of Italy is ___"), much less is known about how these processes work in multimodal models such as SLMs, particularly regarding interactions between different modalities in cross-modal scenarios (e.g., speech-to-text). This thesis aims to explore how Speech Language Models store and recall factual associations by applying Causal Mediation Analysis (CMA), a method inspired by causal inference used to quantify the contribution of model components to factual predictions. We introduce MultimodalCausalTracer, an adaptation of CMA to also handle discrete speech tokens. We use a CTC-based forced alignment algorithm to locate targeted words in a spoken utterance, map discrete speech tokens to text equivalents, and visualize CMA results across speech and text modalities. We applied MultimodalCausalTracer to the Spirit LM model using a new speechbased version of the Known dataset which we constructed, covering spoken factual prompts about countries, famous people, and places. The results, measured in terms of Average Indirect Effect (AIE) of the model’s components, show evident discrepancies between text-to-text and speech-to-text tasks, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. Our findings highlight key areas for future work, including extending experiments to other cross-modal scenarios and investigating factual recall in different SLMs and factual datasets. | |
| dc.identifier.coursecode | DATX05 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12380/310722 | |
| dc.language.iso | eng | |
| dc.relation.ispartofseries | CSE 25-53 | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Machine Learning, Deep Learning, Causal Inference, Speech Language Models, Discrete Speech Tokens, Mechanistic Interpretability, Multimodal Learning, Model Analysis. | |
| dc.title | Locating and interpreting factual association in Speech Language Models | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | Data science and AI (MPDSC), MSc |
