Measuring MindReader
MindReader runs content through a computational model trained on real fMRI measurements from 720+ participants. In a comparison run, two pieces of content are processed through the same model and contrasted second by second. In a single run, one piece of content gets the same cortical readout without the A-vs-B delta.
The approach
For thirty-five years, researchers have tried to measure emotional impact in content by asking people what they felt — through questionnaires, expert panels, and behavioral tests. These approaches haven't converged on reliable results because they measure opinions about experience, not the experience itself.
MindReader doesn't ask people anything. It asks whether the content itself activates the brain systems that are known to be active during emotional, social, and cognitive experience. Two pieces of content go through the same simulated brain. The question is: what does each one do to it?
The person is not the variable. The content is. And because the brain being used is the same for both pieces of content — the same model, trained on the same population — whatever difference emerges in the output is a difference in the content, not a difference in the audience.
Comparison runs, single runs, and fIV
Comparison runs process two inputs through the same TRIBEv2/fIV pipeline, then subtract the predicted cortical response for A from the response for B. This is the full MindReader output: cortical heatmap, dimension bars, time-series differences, and comparison-level interpretation.
Single runs process one input through the same prediction stack and report the strongest cortical systems inside that one piece of content. There is no A/B winner because there is no second input. The result answers: what does this content ask the average cortex to do?
fIV is the model layer that turns multimodal content features into predicted cortical activity. For text, audio, and video, the feature streams are aligned to time and projected onto the same fsaverage5 cortical surface, so the same seven-system readout can be used across run types.
Single-run availability
Available now: cortical surface map, strongest systems, and plain-language readout. Still in beta for single runs: chord progression and connectivity/correlation matrices, because those views are currently tuned around comparison time series.
From your file to the results page
Media preprocessing
Transcription and waveforms
The audio track is run through WhisperX, a speech recognition model, which transcribes every spoken word and records exactly when it starts and ends. This word-level timing is what allows MindReader to align cortical predictions back to specific moments in the content — so when a peak appears in the Memory Encoding dimension at 0:18, you can see what was being said at that moment.
The audio amplitude envelope — the rising and falling loudness over time — is computed separately and becomes the waveform display on the results page.
Video keyframes
For video, scene transitions are detected by comparing how different consecutive frames look. Keyframe images extracted at those transitions become the thumbnail previews in the timeline view.
Text
Text is first converted to speech using a synthesis engine, then treated as audio input. Text comparisons go through this extra step but then follow the same pipeline as audio.
Feature extraction
Three specialist models read the content simultaneously, each focused on a different channel. Each one converts what it sees into a sequence of numbers — one set per second of content — that encodes what is happening in that channel at that moment in terms of meaning, structure, and context.
Language — LLaMA 3.2-3B
A large language model reads the transcript word by word, tracking how meaning evolves over time. The word "launch" means something different after a sentence about products than after a sentence about rockets. The model encodes that difference. The output at each second is a list of 512 numbers — a compact numerical fingerprint of the semantic content at that moment. Similar sentences produce similar fingerprints; unrelated ones produce very different ones.
Visuals — V-JEPA2
A vision model processes the video frames, building a compact description of the visual scene at each moment — capturing spatial structure, motion, and how the scene changes. Its output at each second is a list of 1024 numbers. More dimensions here because visual scenes carry more raw variation than language alone.
Audio — Wav2Vec-BERT
A model trained on speech processes the raw audio waveform, capturing what the transcript doesn't: speaking pace, tone, hesitation, background music. The output is 512 numbers per second — a fingerprint of the sound, not just the words.
fvis(t) ∈ ℝ1024 — 1024 numbers encoding visual scene at second t
faud(t) ∈ ℝ512 — 512 numbers encoding audio character at second t
Neural prediction
Combining the three streams
The three streams don't contribute equally to every moment. When someone is talking directly to camera, the language and audio streams carry most of the signal. During a fast visual cut with no speech, the visual stream takes over. The model has a mechanism — a temporal transformer — that decides, at each second, how to weight the three channels against each other and how each one is influencing the others. The same word spoken over silence versus over urgent music gets a different combined representation.
The result is a single list of numbers for each second of the video, encoding the full multimodal experience of that moment — all three channels, combined.
Predicting blood flow across the cortex
That combined representation is then used to predict a blood-flow value at each of 20,484 specific locations on the cortical surface. Blood flow is the indirect signal fMRI scanners measure: when a brain region is active, the brain sends it extra blood, and the change in the blood's oxygen level is what the scanner picks up. TRIBEv2 learned to predict this signal from thousands of hours of video paired with real scanner recordings.
The output for each piece of content is a matrix. Each row is one second. Each column is one cortical location. Each cell is a predicted blood-flow value at that location, at that second.
The 5-second hemodynamic lag
Blood flow peaks approximately 5 seconds after the neural firing that caused it — a physical property of the brain's vascular system. When you see a Learning Moment labeled at 0:18, the content driving it occurred around 0:13. All peak labels in the results reflect cortical timing.
The comparison
With two prediction matrices — one for each piece of content — the comparison is computed vertex by vertex, second by second. At every cortical location, at every second: which video predicts more activation?
Positive delta means content B is higher; negative means content A is. This signed difference drives the cortical heatmap and the dimension waveforms, which show the comparison at every second across the full video.
The seven dimension bars summarize the peak activation — which video produced the stronger response in that dimension at its most intense moment. Peak is more informative than mean: a video with one extraordinary second reads very differently to a viewer than one with uniformly moderate engagement, and the mean hides that difference.
Cortical chord detection
At any given second, some combination of the seven cortical systems is active above threshold. That combination is a chord — and the sequence of chords across the video is its cortical chord progression.
This is MindReader's most distinctive output. The chord progression describes the cognitive arc of a video: when it asks the viewer to work, when it triggers something visceral, when it connects to the self. Two videos with identical dimension averages can have completely different chord progressions — and the progression is almost always the more useful comparison.
Each chord is a named co-activation pattern drawn from published neuroimaging research. The patterns emerge from TRIBEv2's output — they are not editorially assigned.
How thresholds are set
Rather than using fixed global cutoffs, each chord's threshold is computed relative to the clip itself. For each cortical system, the 70th percentile of that system's activation values within the clip defines "high" — the 30th percentile defines "low", and the 50th percentile (median) defines "elevated". This means a moment only qualifies as a Learning Moment if attention and memory encoding are both genuinely high relative to this clip's own distribution, not against an arbitrary fixed number. The result is a threshold that scales naturally across content types: a fast-paced action ad and a slow-burn documentary both get thresholds calibrated to their own cortical range.
Two systems must meet their thresholds simultaneously — sustained for the minimum duration — for a chord to register. The chord progression then sequences these co-activation windows across the video timeline, giving you a named cognitive arc rather than just a list of activation values.
A concrete example
Consider a 30-second product ad. At 0:02, a startling visual lands — gut spikes, prefrontal stays quiet: Visceral Hit. The narrator then explains what just happened — effort and language co-activate: Reasoning Beat. A close-up face shows genuine surprise, and the viewer connects it to themselves — personal resonance and gut fire together: Emotional Impact.
The progression reads: Visceral Hit → Reasoning Beat → Emotional Impact. That arc tells you something a single bar chart cannot: the ad felt first, then reasoned, then resonated. Compare it to an ad that skips the visceral opener and leads with reasoning — very different experience, identical dimension averages.
sustained ≥ 1s
The cortex is actively encoding this moment into long-term memory. The dorsal attention network and left vlPFC co-activate — the pattern Wagner et al. 1998 showed predicts later recall.
sustained ≥ 1s
mPFC tags the content as self-relevant while anterior insula responds viscerally. Falk et al. 2012 found this pattern predicted real-world behavior change at r = 0.87.
sustained ≥ 1s
The prefrontal cortex applies cognitive control while the language network processes meaning. The viewer is working through an argument. Miller & Cohen 2001; Fedorenko et al. 2011.
Personal Resonance ≥ θS (median) · sustained ≥ 2s
The cortex is building a situation model — integrating narrative with prior knowledge and personal experience. Not tracking facts, but constructing a story. Mar 2011; Yeshurun et al. 2017.
sustained ≥ 1s
High insula with suppressed prefrontal activity — the body responded before deliberation engaged. Ochsner & Gross 2005; Fox et al. 2005; Damasio 1994.
sustained ≥ 2s
Sustained effortful processing without self-referential engagement. Processing the content without connecting it personally. Fox et al. 2005; Raichle et al. 2001.
sustained ≥ 1s
Right TPJ models another person's mind while mPFC connects it to the self. The cortical correlate of understanding another's experience through your own. Saxe & Kanwisher 2003.
Threshold values & calibration↓
All thresholds are percentile-relative, computed within each clip. θH = 70th percentile (high) · θS = 50th percentile (median / elevated) · θL = 30th percentile (low). Each system's percentile is computed from its own activation timeseries within that clip, so thresholds scale naturally to the content's own range.
| Chord | Condition | Min duration |
|---|---|---|
| Learning Moment | Attention ≥ θH & Memory ≥ θH | 1s |
| Emotional Impact | Personal ≥ θH & Gut ≥ θH | 1s |
| Reasoning Beat | Effort ≥ θH & Language ≥ θH | 1s |
| Story Integration | Attention ≥ θH & Language ≥ θH & Personal ≥ θS | 2s |
| Visceral Hit | Gut ≥ θH & Effort ≤ θL | 1s |
| Cold Cognitive Work | Effort ≥ θH & Personal ≤ θL | 2s |
| Social Resonance | Social ≥ θH & Personal ≥ θH | 1s |
θH = p70 · θS = p50 · θL = p30 — all per-clip, per-system. Chord definitions reviewed by a neuroscience postdoc before v1 shipped. Pattern definitions published in data/pattern-definitions.json in the open-source codebase.
Connectivity mapping
Beyond detecting which chords fire, MindReader computes how the seven dimensions relate to each other across the full video. Pearson correlation is computed between every pair of dimension time series, producing a 7×7 matrix showing which systems rise and fall together and which tend to suppress each other.
What the matrix shows
Positive correlation (blue): two systems fire together in this content. Negative correlation (red): when one is high, the other tends to be low. The diagonal is always 1.0. The aggregate measures — integration score, hub dimension — describe the overall architecture of cortical engagement the content produces. Computed on TRIBEv2's predicted time series, not raw fMRI.
The 7 dimensions
MindReader reports predicted activation across seven large-scale functional networks, each grounded in the Yeo et al. 2011 parcellation and the primary literature for each region. For each dimension, a binary vertex mask identifies which of the 20,484 cortical locations belong to that network. At each timestep, the mean predicted BOLD across the masked vertices gives one activation value. Masks do not overlap.
| # | Dimension | Region | What activates here |
|---|---|---|---|
| 01 | Attention | DAN (FEF, IPS, MT+) | Goal-directed, sustained top-down attention. Corbetta & Shulman 2002. |
| 02 | Brain Effort | dlPFC (BA 9/46) | Cognitive control; working memory; effortful processing. Miller & Cohen 2001. |
| 03 | Gut Reaction | Anterior Insula | Interoception; visceral bodily response; the cortical address of felt sensation. Critchley 2005; Craig 2009. |
| 04 | Language Depth | Broca's + Wernicke's | High-level linguistic processing — semantics, syntax, pragmatics. Fedorenko et al. 2011. |
| 05 | Memory Encoding | Left vlPFC (BA 44/45/47) | Subsequent-memory effect: activation here at encoding predicts later recall. Wagner et al. 1998; Kim 2011. |
| 06 | Personal Resonance | mPFC | Self-referential processing; content tagged as personally relevant. Northoff et al. 2006; Falk et al. 2012. |
| 07 | Social Thinking | Right TPJ | Theory of mind; mentalizing about another person's intentions. Saxe & Kanwisher 2003. |
TRIBEv2 covers cortical surface only. Memory Encoding measures left vlPFC — the cortical correlate of the subsequent-memory effect — but long-term storage happens in the hippocampus, which is outside coverage. Gut Reaction uses anterior insula as a proxy for visceral arousal; the amygdala is subcortical and not captured. These are cortical correlates, not complete systems.
What TRIBEv2 cannot do
- Population averageAll predictions represent the average response across 720+ training participants. Individual brains vary substantially. No subject-specific information about any viewer is captured.
- Cortical surface onlyThe fsaverage5 mesh covers cortical grey matter only. No direct measurement of the hippocampus, amygdala, nucleus accumbens, thalamus, or any other subcortical structure.
- Trained on filmPredictions are most reliable for content resembling training data: spoken word over visuals, natural human presence, narrative structure. Pure music, abstract animation, and non-naturalistic content may produce noisier outputs.
- 1-second resolutionTemporal resolution is 1Hz. Sub-second events — a 200ms jump cut, a brief audio cue — are integrated into their surrounding second and cannot be resolved independently.
- 5-second lagBOLD responses lag behind neural firing by approximately 5 seconds. All peak labels are at cortical timing — approximately 5 seconds after the corresponding moment in the video.
Files, data, and the codebase
Audio and video files are deleted from our servers immediately after processing completes. The output — waveforms, thumbnails, transcription, cortical predictions — is computed from your file and then the file is gone. TRIBEv2 is a frozen pre-trained model; submitted content does not update it.
MindReader is open source under CC BY-NC, inherited from TRIBEv2's license from Meta FAIR. The pattern definitions, threshold values, dimension vertex masks, and scoring logic are all published and versioned. Researchers and developers are encouraged to fork the codebase, run their own comparisons, and extend the methodology.
References
25
↓
- 1Allen, E. A., et al. (2014). Tracking whole-brain connectivity dynamics in the resting state. Cerebral Cortex, 24(3), 663–676.
- 2Corbetta, M., & Shulman, G. L. (2002). Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience, 3, 201–215.
- 3Craig, A. D. (2009). How do you feel — now? The anterior insula and human awareness. Nature Reviews Neuroscience, 10(1), 59–70.
- 4Critchley, H. D. (2005). Neural mechanisms of autonomic, affective, and cognitive integration. Journal of Comparative Neurology, 493(1), 154–166.
- 5Critchley, H. D., et al. (2004). Neural systems supporting interoceptive awareness. Nature Neuroscience, 7(2), 189–195.
- 6Damasio, A. (1994). Descartes' Error: Emotion, Reason, and the Human Brain. Putnam.
- 7Falk, E. B., Berkman, E. T., & Lieberman, M. D. (2012). From neural responses to population behavior: neural focus group predicts population-level media effects. Psychological Science, 23(5), 439–445.
- 8Fedorenko, E., Behr, M. K., & Kanwisher, N. (2011). Functional specificity for high-level linguistic processing in the human brain. PNAS, 108(39), 16428–16433.
- 9Fox, M. D., et al. (2005). The human brain is intrinsically organized into dynamic, anticorrelated functional networks. PNAS, 102(27), 9673–9678.
- 10Hutchison, R. M., et al. (2013). Dynamic functional connectivity: promise, issues, and interpretations. NeuroImage, 80, 360–378.
- 11Kim, H. (2011). Neural activity that predicts subsequent memory and forgetting: a meta-analysis of 74 fMRI studies. NeuroImage, 54(3), 2446–2461.
- 12Mar, R. A. (2011). The neural bases of social cognition and story comprehension. Annual Review of Psychology, 62, 103–134.
- 13Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 24, 167–202.
- 14Northoff, G., et al. (2006). Self-referential processing in our brain — a meta-analysis of imaging studies on the self. NeuroImage, 31(1), 440–457.
- 15Ochsner, K. N., & Gross, J. J. (2005). The cognitive control of emotion. Trends in Cognitive Sciences, 9(8), 242–249.
- 16Power, J. D., et al. (2011). Functional network organization of the human brain. Neuron, 72(4), 665–678.
- 17Raichle, M. E., et al. (2001). A default mode of brain function. PNAS, 98(2), 676–682.
- 18Saxe, R., & Kanwisher, N. (2003). People thinking about thinking people. NeuroImage, 19(4), 1835–1842.
- 19Saxe, R., & Powell, L. J. (2006). It's the thought that counts. Psychological Science, 17(8), 692–699.
- 20Schäfer, A., & Menninghaus, W. (2025). Aesthetic effects in film: a mega-analysis of 16 datasets with 572 participants. Psychological Science.
- 21Sporns, O. (2013). Network attributes for segregation and integration in the human brain. Current Opinion in Neurobiology, 23(2), 162–171.
- 22Uncapher, M. R., & Wagner, A. D. (2009). Posterior parietal cortex and episodic encoding. Nature Reviews Neuroscience, 10, 613–625.
- 23Wagner, A. D., et al. (1998). Building memories: remembering and forgetting of verbal experiences as predicted by brain activity. Science, 281, 1188–1191.
- 24Yeo, B. T., et al. (2011). The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of Neurophysiology, 106(3), 1125–1165.
- 25Yeshurun, Y., et al. (2017). Memory retrieval during learning determines whether experience will contribute to episodic memory formation. Neuron, 93(4), 953–967.