The approach

For thirty-five years, researchers have tried to measure emotional impact in content by asking people what they felt — through questionnaires, expert panels, and behavioral tests. These approaches haven't converged on reliable results because they measure opinions about experience, not the experience itself.

MindReader doesn't ask people anything. It asks whether the content itself activates the brain systems that are known to be active during emotional, social, and cognitive experience. Two pieces of content go through the same simulated brain. The question is: what does each one do to it?

The person is not the variable. The content is. And because the brain being used is the same for both pieces of content — the same model, trained on the same population — whatever difference emerges in the output is a difference in the content, not a difference in the audience.

Comparison runs, single runs, and fIV

Comparison runs process two inputs through the same TRIBEv2/fIV pipeline, then subtract the predicted cortical response for A from the response for B. This is the full MindReader output: cortical heatmap, dimension bars, time-series differences, and comparison-level interpretation.

Single runs process one input through the same prediction stack and report the strongest cortical systems inside that one piece of content. There is no A/B winner because there is no second input. The result answers: what does this content ask the average cortex to do?

fIV is the model layer that turns multimodal content features into predicted cortical activity. For text, audio, and video, the feature streams are aligned to time and projected onto the same fsaverage5 cortical surface, so the same seven-system readout can be used across run types.

Single-run availability

Available now: cortical surface map, strongest systems, and plain-language readout. Still in beta for single runs: chord progression and connectivity/correlation matrices, because those views are currently tuned around comparison time series.

From your file to the results page

Both pieces of content run through the same pipeline — results compared at the end
01
Your content
02
Media prep
03
Feature extraction
04
Neural prediction
05
Contrast / readout
06
Patterns*
07
Results
Your content
Text, audio, or video. Both pieces enter the pipeline here and are processed identically and independently before being compared at the end.
Media preprocessing
Audio is transcribed to word-level timestamps. Waveforms are computed for display. Video keyframes are extracted at scene transitions. Text is synthesized to speech first.
10–60s
Feature extraction
Three specialist models read the content simultaneously — one each for language, visuals, and audio — each producing a numerical description of what is happening at every second.
30–300s depending on modality and hardware
Neural prediction
TRIBEv2 fuses the three streams and predicts blood-flow activity across 20,484 cortical surface locations at 1Hz — one second at a time, for the full duration of the content.
The dominant compute cost
Contrast / readout
For comparison runs, the two predicted activation matrices are compared vertex by vertex and second by second. For single runs, the same predicted matrix is summarized without subtracting another input.
< 10ms after prediction
Patterns*
Comparison runs can derive co-activation patterns from the seven dimension time series. Single-run chord and connectivity views are still in beta while thresholds are tuned for one-input results.
1–5s
Results delivered
The browser receives the available outputs for the run type: cortical map, dimension readout, waveforms when media timing is present, and comparison-only pattern/connectivity views where enabled. Your uploaded file is deleted.

Media preprocessing

Transcription and waveforms

The audio track is run through WhisperX, a speech recognition model, which transcribes every spoken word and records exactly when it starts and ends. This word-level timing is what allows MindReader to align cortical predictions back to specific moments in the content — so when a peak appears in the Memory Encoding dimension at 0:18, you can see what was being said at that moment.

The audio amplitude envelope — the rising and falling loudness over time — is computed separately and becomes the waveform display on the results page.

Video keyframes

For video, scene transitions are detected by comparing how different consecutive frames look. Keyframe images extracted at those transitions become the thumbnail previews in the timeline view.

Text

Text is first converted to speech using a synthesis engine, then treated as audio input. Text comparisons go through this extra step but then follow the same pipeline as audio.

What comes out — audio envelope + transcript timing

Feature extraction

Three specialist models read the content simultaneously, each focused on a different channel. Each one converts what it sees into a sequence of numbers — one set per second of content — that encodes what is happening in that channel at that moment in terms of meaning, structure, and context.

Language — LLaMA 3.2-3B

A large language model reads the transcript word by word, tracking how meaning evolves over time. The word "launch" means something different after a sentence about products than after a sentence about rockets. The model encodes that difference. The output at each second is a list of 512 numbers — a compact numerical fingerprint of the semantic content at that moment. Similar sentences produce similar fingerprints; unrelated ones produce very different ones.

Visuals — V-JEPA2

A vision model processes the video frames, building a compact description of the visual scene at each moment — capturing spatial structure, motion, and how the scene changes. Its output at each second is a list of 1024 numbers. More dimensions here because visual scenes carry more raw variation than language alone.

Audio — Wav2Vec-BERT

A model trained on speech processes the raw audio waveform, capturing what the transcript doesn't: speaking pace, tone, hesitation, background music. The output is 512 numbers per second — a fingerprint of the sound, not just the words.

Output — three parallel sequences of number-lists, one per second
flang(t) ∈ ℝ512   — 512 numbers encoding language meaning at second t
fvis(t) ∈ ℝ1024  — 1024 numbers encoding visual scene at second t
faud(t) ∈ ℝ512   — 512 numbers encoding audio character at second t
Stream activity — how much each channel is contributing at this moment
Language
fl
Visual
fv
Audio
fa
Bar length = relative richness of that channel's output at this second of the video. All three run in parallel — none waits for the others.

Neural prediction

Combining the three streams

The three streams don't contribute equally to every moment. When someone is talking directly to camera, the language and audio streams carry most of the signal. During a fast visual cut with no speech, the visual stream takes over. The model has a mechanism — a temporal transformer — that decides, at each second, how to weight the three channels against each other and how each one is influencing the others. The same word spoken over silence versus over urgent music gets a different combined representation.

The result is a single list of numbers for each second of the video, encoding the full multimodal experience of that moment — all three channels, combined.

Predicting blood flow across the cortex

That combined representation is then used to predict a blood-flow value at each of 20,484 specific locations on the cortical surface. Blood flow is the indirect signal fMRI scanners measure: when a brain region is active, the brain sends it extra blood, and the change in the blood's oxygen level is what the scanner picks up. TRIBEv2 learned to predict this signal from thousands of hours of video paired with real scanner recordings.

The output for each piece of content is a matrix. Each row is one second. Each column is one cortical location. Each cell is a predicted blood-flow value at that location, at that second.

Prediction matrix — output shape
P  ∈  ℝT × 20,484
T  =  seconds of content 20,484  =  cortical vertices (fsaverage5) each cell  =  predicted blood-flow value
The prediction matrix — each row is a second, each column is a cortical location
Low
High activation
Each cell is one predicted blood-flow value — for one cortical location, at one second. In practice: 20,484 columns total (shown: sample of 9).

The 5-second hemodynamic lag

Blood flow peaks approximately 5 seconds after the neural firing that caused it — a physical property of the brain's vascular system. When you see a Learning Moment labeled at 0:18, the content driving it occurred around 0:13. All peak labels in the results reflect cortical timing.

The comparison

With two prediction matrices — one for each piece of content — the comparison is computed vertex by vertex, second by second. At every cortical location, at every second: which video predicts more activation?

Positive delta means content B is higher; negative means content A is. This signed difference drives the cortical heatmap and the dimension waveforms, which show the comparison at every second across the full video.

The seven dimension bars summarize the peak activation — which video produced the stronger response in that dimension at its most intense moment. Peak is more informative than mean: a video with one extraordinary second reads very differently to a viewer than one with uniformly moderate engagement, and the mean hides that difference.

Normalized delta per dimension
Δd  =  (scoreB,d − scoreA,d)  /  (|scoreA,d| + |scoreB,d| + ε) scored = mean BOLD across all vertices in dimension d's cortical mask
Example — Gut Reaction dimension
A
0.58
B
0.82
B leads by +0.24  ·  Δd = +0.17 normalized

Cortical chord detection

At any given second, some combination of the seven cortical systems is active above threshold. That combination is a chord — and the sequence of chords across the video is its cortical chord progression.

This is MindReader's most distinctive output. The chord progression describes the cognitive arc of a video: when it asks the viewer to work, when it triggers something visceral, when it connects to the self. Two videos with identical dimension averages can have completely different chord progressions — and the progression is almost always the more useful comparison.

Each chord is a named co-activation pattern drawn from published neuroimaging research. The patterns emerge from TRIBEv2's output — they are not editorially assigned.

How thresholds are set

Rather than using fixed global cutoffs, each chord's threshold is computed relative to the clip itself. For each cortical system, the 70th percentile of that system's activation values within the clip defines "high" — the 30th percentile defines "low", and the 50th percentile (median) defines "elevated". This means a moment only qualifies as a Learning Moment if attention and memory encoding are both genuinely high relative to this clip's own distribution, not against an arbitrary fixed number. The result is a threshold that scales naturally across content types: a fast-paced action ad and a slow-burn documentary both get thresholds calibrated to their own cortical range.

Two systems must meet their thresholds simultaneously — sustained for the minimum duration — for a chord to register. The chord progression then sequences these co-activation windows across the video timeline, giving you a named cognitive arc rather than just a list of activation values.

A concrete example

Consider a 30-second product ad. At 0:02, a startling visual lands — gut spikes, prefrontal stays quiet: Visceral Hit. The narrator then explains what just happened — effort and language co-activate: Reasoning Beat. A close-up face shows genuine surprise, and the viewer connects it to themselves — personal resonance and gut fire together: Emotional Impact.

The progression reads: Visceral Hit → Reasoning Beat → Emotional Impact. That arc tells you something a single bar chart cannot: the ad felt first, then reasoned, then resonated. Compare it to an ad that skips the visceral opener and leads with reasoning — very different experience, identical dimension averages.

7 chords — drag or scroll to explore
Learning Moment
Attention ≥ θH & Memory ≥ θH
sustained ≥ 1s

The cortex is actively encoding this moment into long-term memory. The dorsal attention network and left vlPFC co-activate — the pattern Wagner et al. 1998 showed predicts later recall.

Emotional Impact
Personal Resonance ≥ θH & Gut ≥ θH
sustained ≥ 1s

mPFC tags the content as self-relevant while anterior insula responds viscerally. Falk et al. 2012 found this pattern predicted real-world behavior change at r = 0.87.

Reasoning Beat
Brain Effort ≥ θH & Language ≥ θH
sustained ≥ 1s

The prefrontal cortex applies cognitive control while the language network processes meaning. The viewer is working through an argument. Miller & Cohen 2001; Fedorenko et al. 2011.

Story Integration
Attention ≥ θH & Language ≥ θH
Personal Resonance ≥ θS (median) · sustained ≥ 2s

The cortex is building a situation model — integrating narrative with prior knowledge and personal experience. Not tracking facts, but constructing a story. Mar 2011; Yeshurun et al. 2017.

Visceral Hit
Gut Reaction ≥ θH & Effort ≤ θL
sustained ≥ 1s

High insula with suppressed prefrontal activity — the body responded before deliberation engaged. Ochsner & Gross 2005; Fox et al. 2005; Damasio 1994.

Cold Cognitive Work
Brain Effort ≥ θH & Personal ≤ θL
sustained ≥ 2s

Sustained effortful processing without self-referential engagement. Processing the content without connecting it personally. Fox et al. 2005; Raichle et al. 2001.

Social Resonance
Social Thinking ≥ θH & Personal ≥ θH
sustained ≥ 1s

Right TPJ models another person's mind while mPFC connects it to the self. The cortical correlate of understanding another's experience through your own. Saxe & Kanwisher 2003.

Threshold values & calibration

All thresholds are percentile-relative, computed within each clip. θH = 70th percentile (high) · θS = 50th percentile (median / elevated) · θL = 30th percentile (low). Each system's percentile is computed from its own activation timeseries within that clip, so thresholds scale naturally to the content's own range.

ChordConditionMin duration
Learning MomentAttention ≥ θH & Memory ≥ θH1s
Emotional ImpactPersonal ≥ θH & Gut ≥ θH1s
Reasoning BeatEffort ≥ θH & Language ≥ θH1s
Story IntegrationAttention ≥ θH & Language ≥ θH & Personal ≥ θS2s
Visceral HitGut ≥ θH & Effort ≤ θL1s
Cold Cognitive WorkEffort ≥ θH & Personal ≤ θL2s
Social ResonanceSocial ≥ θH & Personal ≥ θH1s

θH = p70 · θS = p50 · θL = p30 — all per-clip, per-system. Chord definitions reviewed by a neuroscience postdoc before v1 shipped. Pattern definitions published in data/pattern-definitions.json in the open-source codebase.

Connectivity mapping

Beyond detecting which chords fire, MindReader computes how the seven dimensions relate to each other across the full video. Pearson correlation is computed between every pair of dimension time series, producing a 7×7 matrix showing which systems rise and fall together and which tend to suppress each other.

Connectivity — correlation between dimension time series
r(di, dj) = corr(si, sj)    for all pairs i ≠ j Result: 7×7 symmetric matrix  ·  grounded in Power et al. 2011 and Sporns 2013

What the matrix shows

Positive correlation (blue): two systems fire together in this content. Negative correlation (red): when one is high, the other tends to be low. The diagonal is always 1.0. The aggregate measures — integration score, hub dimension — describe the overall architecture of cortical engagement the content produces. Computed on TRIBEv2's predicted time series, not raw fMRI.

Rows and columns = the 7 cortical systems
Positive (co-activation) Negative (suppression) example values
Example chord progression — 30 second video

The 7 dimensions

MindReader reports predicted activation across seven large-scale functional networks, each grounded in the Yeo et al. 2011 parcellation and the primary literature for each region. For each dimension, a binary vertex mask identifies which of the 20,484 cortical locations belong to that network. At each timestep, the mean predicted BOLD across the masked vertices gives one activation value. Masks do not overlap.

Dimension scoring — per second, per dimension
scored(t)  =  (1 / |Vd|)  ∑v ∈ Vd  P(t, v) Vd = set of vertices for dimension d  ·  P(t,v) = predicted BOLD at vertex v, second t
#DimensionRegionWhat activates here
01AttentionDAN (FEF, IPS, MT+)Goal-directed, sustained top-down attention. Corbetta & Shulman 2002.
02Brain EffortdlPFC (BA 9/46)Cognitive control; working memory; effortful processing. Miller & Cohen 2001.
03Gut ReactionAnterior InsulaInteroception; visceral bodily response; the cortical address of felt sensation. Critchley 2005; Craig 2009.
04Language DepthBroca's + Wernicke'sHigh-level linguistic processing — semantics, syntax, pragmatics. Fedorenko et al. 2011.
05Memory EncodingLeft vlPFC (BA 44/45/47)Subsequent-memory effect: activation here at encoding predicts later recall. Wagner et al. 1998; Kim 2011.
06Personal ResonancemPFCSelf-referential processing; content tagged as personally relevant. Northoff et al. 2006; Falk et al. 2012.
07Social ThinkingRight TPJTheory of mind; mentalizing about another person's intentions. Saxe & Kanwisher 2003.

TRIBEv2 covers cortical surface only. Memory Encoding measures left vlPFC — the cortical correlate of the subsequent-memory effect — but long-term storage happens in the hippocampus, which is outside coverage. Gut Reaction uses anterior insula as a proxy for visceral arousal; the amygdala is subcortical and not captured. These are cortical correlates, not complete systems.

What TRIBEv2 cannot do

  • Population average
    All predictions represent the average response across 720+ training participants. Individual brains vary substantially. No subject-specific information about any viewer is captured.
  • Cortical surface only
    The fsaverage5 mesh covers cortical grey matter only. No direct measurement of the hippocampus, amygdala, nucleus accumbens, thalamus, or any other subcortical structure.
  • Trained on film
    Predictions are most reliable for content resembling training data: spoken word over visuals, natural human presence, narrative structure. Pure music, abstract animation, and non-naturalistic content may produce noisier outputs.
  • 1-second resolution
    Temporal resolution is 1Hz. Sub-second events — a 200ms jump cut, a brief audio cue — are integrated into their surrounding second and cannot be resolved independently.
  • 5-second lag
    BOLD responses lag behind neural firing by approximately 5 seconds. All peak labels are at cortical timing — approximately 5 seconds after the corresponding moment in the video.

Files, data, and the codebase

Audio and video files are deleted from our servers immediately after processing completes. The output — waveforms, thumbnails, transcription, cortical predictions — is computed from your file and then the file is gone. TRIBEv2 is a frozen pre-trained model; submitted content does not update it.

MindReader is open source under CC BY-NC, inherited from TRIBEv2's license from Meta FAIR. The pattern definitions, threshold values, dimension vertex masks, and scoring logic are all published and versioned. Researchers and developers are encouraged to fork the codebase, run their own comparisons, and extend the methodology.

References 25
  • 1Allen, E. A., et al. (2014). Tracking whole-brain connectivity dynamics in the resting state. Cerebral Cortex, 24(3), 663–676.
  • 2Corbetta, M., & Shulman, G. L. (2002). Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience, 3, 201–215.
  • 3Craig, A. D. (2009). How do you feel — now? The anterior insula and human awareness. Nature Reviews Neuroscience, 10(1), 59–70.
  • 4Critchley, H. D. (2005). Neural mechanisms of autonomic, affective, and cognitive integration. Journal of Comparative Neurology, 493(1), 154–166.
  • 5Critchley, H. D., et al. (2004). Neural systems supporting interoceptive awareness. Nature Neuroscience, 7(2), 189–195.
  • 6Damasio, A. (1994). Descartes' Error: Emotion, Reason, and the Human Brain. Putnam.
  • 7Falk, E. B., Berkman, E. T., & Lieberman, M. D. (2012). From neural responses to population behavior: neural focus group predicts population-level media effects. Psychological Science, 23(5), 439–445.
  • 8Fedorenko, E., Behr, M. K., & Kanwisher, N. (2011). Functional specificity for high-level linguistic processing in the human brain. PNAS, 108(39), 16428–16433.
  • 9Fox, M. D., et al. (2005). The human brain is intrinsically organized into dynamic, anticorrelated functional networks. PNAS, 102(27), 9673–9678.
  • 10Hutchison, R. M., et al. (2013). Dynamic functional connectivity: promise, issues, and interpretations. NeuroImage, 80, 360–378.
  • 11Kim, H. (2011). Neural activity that predicts subsequent memory and forgetting: a meta-analysis of 74 fMRI studies. NeuroImage, 54(3), 2446–2461.
  • 12Mar, R. A. (2011). The neural bases of social cognition and story comprehension. Annual Review of Psychology, 62, 103–134.
  • 13Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 24, 167–202.
  • 14Northoff, G., et al. (2006). Self-referential processing in our brain — a meta-analysis of imaging studies on the self. NeuroImage, 31(1), 440–457.
  • 15Ochsner, K. N., & Gross, J. J. (2005). The cognitive control of emotion. Trends in Cognitive Sciences, 9(8), 242–249.
  • 16Power, J. D., et al. (2011). Functional network organization of the human brain. Neuron, 72(4), 665–678.
  • 17Raichle, M. E., et al. (2001). A default mode of brain function. PNAS, 98(2), 676–682.
  • 18Saxe, R., & Kanwisher, N. (2003). People thinking about thinking people. NeuroImage, 19(4), 1835–1842.
  • 19Saxe, R., & Powell, L. J. (2006). It's the thought that counts. Psychological Science, 17(8), 692–699.
  • 20Schäfer, A., & Menninghaus, W. (2025). Aesthetic effects in film: a mega-analysis of 16 datasets with 572 participants. Psychological Science.
  • 21Sporns, O. (2013). Network attributes for segregation and integration in the human brain. Current Opinion in Neurobiology, 23(2), 162–171.
  • 22Uncapher, M. R., & Wagner, A. D. (2009). Posterior parietal cortex and episodic encoding. Nature Reviews Neuroscience, 10, 613–625.
  • 23Wagner, A. D., et al. (1998). Building memories: remembering and forgetting of verbal experiences as predicted by brain activity. Science, 281, 1188–1191.
  • 24Yeo, B. T., et al. (2011). The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of Neurophysiology, 106(3), 1125–1165.
  • 25Yeshurun, Y., et al. (2017). Memory retrieval during learning determines whether experience will contribute to episodic memory formation. Neuron, 93(4), 953–967.