Locating and interpreting factual association in Speech Language Models

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Recent advances have enabled Speech Language Models (SLMs) to both understand and generate text and speech by representing audio as discrete tokens learned from raw waveforms without supervision. As these multimodal systems become increasingly common in real-world applications, it is crucial to understand how they encode and retrieve factual knowledge, insights that are key to improve their factual accuracy and reliability. While previous research has explored these mechanisms in traditional Large Language Models (LLMs) by observing their responses to targeted prompts (such as "The capital of Italy is ___"), much less is known about how these processes work in multimodal models such as SLMs, particularly regarding interactions between different modalities in cross-modal scenarios (e.g., speech-to-text). This thesis aims to explore how Speech Language Models store and recall factual associations by applying Causal Mediation Analysis (CMA), a method inspired by causal inference used to quantify the contribution of model components to factual predictions. We introduce MultimodalCausalTracer, an adaptation of CMA to also handle discrete speech tokens. We use a CTC-based forced alignment algorithm to locate targeted words in a spoken utterance, map discrete speech tokens to text equivalents, and visualize CMA results across speech and text modalities. We applied MultimodalCausalTracer to the Spirit LM model using a new speechbased version of the Known dataset which we constructed, covering spoken factual prompts about countries, famous people, and places. The results, measured in terms of Average Indirect Effect (AIE) of the model’s components, show evident discrepancies between text-to-text and speech-to-text tasks, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. Our findings highlight key areas for future work, including extending experiments to other cross-modal scenarios and investigating factual recall in different SLMs and factual datasets.

Beskrivning

Ämne/nyckelord

Machine Learning, Deep Learning, Causal Inference, Speech Language Models, Discrete Speech Tokens, Mechanistic Interpretability, Multimodal Learning, Model Analysis.

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced