Version 1.0
Vortrag: Can mechanistic interpretability tools for LLMs help us understand brain?

It has been discovered that large language models (LLMs) can predict the brain activity in response to natural language stimuli. However, we still don’t fully understand which specific parts of these models are responsible for this similarity to brain activity. On the other hand, a field of Mechanistic Interpretability (MI) has been attempting to reverse engineer LLMs to understand the mechanisms underlying different tasks' processing. For my thesis, I've started exploring how these two fields can benefit from each other.
Methods:
- Use Activation Patching and Edge Attribution Patching to find circuits important for specific NLP tasks.
- Ablate these circuits (or everything but them) and measure the impact on brain–model alignment using encoding analyses.
Hypotheses:
- Corrupting task-relevant circuits should reduce alignment with brain activity.
- Preserving these circuits while corrupting everything else should maintain alignment.
- Different circuits/tasks may correspond to different brain regions.
Our first results are very modest but we keep exploring :)
Info
Tag:
14.11.2025
Anfangszeit:
11:30
Dauer:
00:30
Raum:
M2.31
Track:
Computerlinguistik
Sprache:
en
Links:
Feedback
Uns interessiert Ihre Meinung! Wie fanden Sie diese Veranstaltung?
Gleichzeitige Events
ReferentInnen
![]() |
Nursulu Sagimbayeva |
