A shared neural substrate for action verbs and observed actions in human posterior parietal cortex

INTRODUCTION

How do words get their meaning? Although the exact architecture of the semantic system is still under debate, most evidence suggests that meaning emerges from interactions between supramodal association regions that code abstracted symbolic representations and the distributed network of regions that process higher-level aspects of sensory stimuli, motor intentions, valence, and internal body state (1–5). Engagement of the distributed network is taken as evidence that the brain’s representation of the physical manifestation of words is an important component of their meaning. For example, visual coding for the form of a banana, the motor act of biting into or peeling a banana, and its taste and texture would be components of meaning in addition to more symbolic, lexical aspects of meaning such as the “dictionary definition.” Although this view is generally accepted, no single-unit recording evidence has demonstrated a shared neural substrate between processing the meaning of a word and its visuomotor attributes within the distributed network. To date, supporting evidence comes from lesion and functional magnetic resonance imaging (fMRI) studies establishing a rough spatial correspondence between brain areas involved in high-level sensorimotor processing and areas recruited when reading text or performing other behaviors that require access to meaning (1, 6). A lack of direct neural evidence is concerning given that neuroimaging and lesion results have been mixed and cannot establish a shared neural substrate at the level of single neurons (7, 8). Thus, how words get their meaning translates into two immediate questions with regard to single-neuron selectivity: (i) Are words and their sensorimotor representations coded within the same region of cortex? (ii) Is there a link between words and their sensorimotor representations? In this paper, linking will refer to the existence of a shared neural substrate with individual neurons exhibiting matching selectivity for both a word and the corresponding visual reality.

To complicate matters, the number of sensorimotor representations that can be described by the same basic concrete word is generally very large (e.g., the visual form of a “banana” depends on ripeness, viewing angle, lighting, and whether it is peeled or sliced), and invariance is very rarely complete in high-level sensorimotor regions [e.g., (9, 10)]. This raises a third question: If the same object is coded in different ways depending on details of presentation, how might a word link to these varied visual representations? Stated more generally: What is the neural architecture that links neuronal responses to silently reading a word and seeing varied visual presentations of what the word signifies? The answer is critical in understanding how sensorimotor representations influence our understanding of words. Do we connect the symbolic representation of a word to an abstracted invariant and, therefore, universal visual representation? To a particular canonical example? Or to the many diverse representations that comprise our varied experiences? The question applies to all concrete words that describe physical reality, including action verbs. In this study, we look at how neural coding for action verbs relates to varied visual representations of corresponding observed actions.

Last, what cognitive phenomena can account for the presence of a link between a word and its visual representation within any experimental paradigm? The link may mediate semantic memory, reflecting associations between the word and its visual representations built over a lifetime of experience. In this view, reading words activates sensorimotor representations automatically, and these representations are an intrinsic component of the meaning of the word. Second, reading a word has been hypothesized to evoke mental imagery. Responses in sensorimotor cortex may reflect such imagery, and the link could be between visual representations and mental imagery of the same stimuli, or the link may be the consequence of short-term learning such as occurs during categorization (11). Given these multiple possibilities, we address a fourth question: If a link exists, what cognitive process does the link mediate?

To address the above four questions, we recorded populations of neurons from electrode arrays implanted in two tetraplegic individuals (N.S. and E.G.S.) participating in a brain-machine interface clinical trial while the participants viewed videos of manipulative actions or silently read corresponding action verbs. The implants were placed at the anterior portion of the intraparietal sulcus (IPS; see fig. S1 for implant locations), a region that is part of the “action observation network” (AON) composed of the lateral occipital temporal cortex [LOTC; (12)], as well as frontal and parietal motor planning circuits (13, 14). These regions are involved in higher-order processing of observed actions (15–18), and neuroimaging and lesion evidence implicate a role in verb processing (19–25). The ability to perform invasive neural recordings provides us with the first opportunity to probe whether and how language links with corresponding visual representations at the level of single neurons in high-order sensory-motor cortex. Toward this objective, we establish four primary results relating to the four questions outlined above: First, PPC neurons show selectivity for action words and visually observed actions; second, a portion of PPC neurons link action verbs and corresponding visual representations; third, text-selective units in PPC link with all the diverse visual representations found in the neural population; and fourth, the link is not based on imagery or short-term learning and thus appears to be semantic in nature. One possible interpretation is that when reading text, we replay our visual history as part of the process of understanding and thus ground our conceptual understanding in our unique experiences.

RESULTS

Participants viewed videos of five manipulative actions presented in three visual formats (two lateral views differing in body posture and one frontal view) and a fourth format, text, requiring the subject to silently read associated action verbs (see Fig. 1A for example stimuli). Five actions were used: drag, drop, grasp, push, and rotate, for which preliminary experiments (fig. S2) had demonstrated neuronal selectivity. A total of 15 unique videos (5 distinct exemplar actions × 3 visual formats) and 5 written action verbs were presented for a total of 20 experimental conditions (5 actions × 4 formats; fig. S3). Presenting the observed actions in three formats allowed us to tease apart different models of how action verbs associate with overlapping (common to visual formats/exhibiting invariance across all formats) and distinct (idiosyncratic to given formats/not invariant or only invariant across subset of formats) features of the neural code for observed actions. This design allowed us to answer the first three questions posed in Introduction. We recorded 1586 units during 18 recording sessions in two subjects (NS: 1432 units, 13 sessions; EGS: 154 units, 5 sessions). For the first seven sessions in participant NS and all sessions for subject EGS, the participants passively watched the action videos and silently read the action verbs. To answer the fourth question, for the final six sessions in subject NS, the participant used the action verb as a prompt to “replay” the associated action video using visual imagery from either frontal (F) or lateral (L0) perspectives, thus allowing us to quantify how imagery affects verb processing. Results from silent reading (first seven sessions) and active imagery (last six sessions) were quantitatively similar in NS, and thus, data were pooled across sessions when addressing the first three questions of this paper. In addition, for question 4, we present a control study in which abstract symbols are paired with visual imagery of motor actions to better understand the effects of short-term associations.

Fig. 1 Human parietal neurons are selective for observed actions and action verbs.

(A) Example neurons illustrating diverse selectivity patterns (SPs) across formats. Left: Sample still frames depicting stimuli for one of the five action exemplars (“grasp”) in each format (see fig. S3 for all action exemplars). Right: Representative units illustrating diverse neural responses to the five tested actions (color-coded) across the four tested formats. Each panel shows the firing rate (means ± SEM) through time for each action for a single format. Each column illustrates the responses of the same unit to the four formats. See fig. S1 for recording locations. Photo credit: Guy Orban, Department of Medicine and Surgery, Parma University. (B) Percentage of units with significant action selectivity split by format [means ± 95% confidence interval (CI), one-way ANOVA, P < 0.05 FDR-corrected]. Zero units were selective in each format during the 1-s window before stimulus onset (one-way ANOVA, P < 0.05 FDR-corrected). (C) Cross-validated R² of units with significant selectivity [units significant in (B)] split by format (means ± 95% CI). (D) Sliding-window within-format classification accuracy for manipulative actions. Sliding window = overlapping 300-ms windows with 10-ms increments. Classification applied to data pooled across sessions. Black horizontal dashed line = chance classification performance. Blue horizontal dashed line = 97.5th percentile of prestimulus classification accuracy for the text condition. Horizontal colored bars indicate time of significant classification. Inset displays color code for format and associated latency estimate for onset of significant decoding (see fig. S7).

Are human posterior parietal cortex (PPC) neurons selective for observed actions and action verbs?

Figure 1A shows the response of five representative neurons illustrating the variety of selectivity for both observed actions and action verbs at the level of individual neurons. Within a format, we defined units as selective if there were significant differences in neural responses to the five actions (ANOVA, P < 0.05 False discovery rate corrected), to the different action identities. The percentage of cells demonstrating selective responses was significant for each format, for both subjects [χ² for text format, the format with the fewest selective units: NS: (1,N = 1432) = 503, P < 0.001; EGS: (1,N = 154) = 5.3, P = 0.02]. However, the percentage of selective units, as well as the consistency of the response, as measured by the cross-validated coefficient of determination (cvR²), was smaller for text than for observed actions (Fig. 1, B and C). In addition, population classification analysis equating experimental sessions and number of units confirmed greater selectivity for participant NS than participant EGS (fig. S4). All five actions evoked significant neural responses from baseline across the four formats (fig. S5). The majority of visually selective units were increased firing during the video presentations, as in nonhuman primate anterior intraparietal area (AIP) (18). A minority, however, were suppressed by the video and text presentations (fig. S5). The mean response strength decreased smoothly from the action evoking the maximal response to the weakest response. Individual units could show steep or more graded selectivity, and this pattern was essentially identical across formats (fig. S6). Greater selectivity for action videos relative to text was reflected in a time-resolved decode analyses (Fig. 1D). Defining the latency of action selectivity as the onset of significant classification accuracy revealed shorter latencies for the visual formats (windows starting at 155 to 205 ms depending on format) than the written word (305 ms), possibly reflecting differences in afferent pathways (fig. S7). Our results show that, all formats were encoded within the population but with greater selectivity and shorter latency for videos relative to text.

Is there a link between neural representations of action verbs and observed actions in human PPC?

Having established that PPC neurons are selective for both action verbs and observed actions, we now ask whether there exists a shared neural substrate, with neurons exhibiting matching selectivity for both a word and the corresponding visual representation. We addressed this by using two population analyses: across-format classification and across-format correlation. Leave-one-out cross-validation was used to train a classifier to predict action identity within format. On each fold, the decoder was also used to predict action identity from the three additional formats. This across-format generalization analysis measures how well the neural population structure that defines action identity in one format generalizes to other formats (Fig. 2, A and B). As a control, the same values can be computed when shuffling action identity between formats [shuffled accuracy; red in Fig. 2 (A and B)]. Across-format accuracy was both above chance and shuffled accuracy for all pairs of formats for NS, for all visual pairs of EGS, and the text-visual format pairs when pooling across visual formats to achieve adequate power for EGS (rank-sum test, P < 0.05). This result demonstrates that the neuronal representation was not random; the population is more likely to link representations across formats for the same action identities. However, the results also demonstrate that the generalization is not perfect: The across-format accuracy is lower than the within-format accuracy, suggesting that the neural code for action identity also depends on details of presentation. The strength of generalization was format dependent being near perfect across body postures (same lateral view), still high, but reduced across shifts in viewing perspective (across the lateral and frontal views), and lowest when comparing observed actions with the written verb.

Fig. 2 Action verbs link with observed actions.

(A) Across-format and within-format classification of manipulative actions. x-axis labels indicate the formats used for classifier training and testing (e.g., for across format, train→test). Dots = single-session result. Rectangle = 95% bootstrapped CI over sessions. Gray (red): values for matched (mismatched) labels across formats (see inset for definitions). Dashed horizontal lines show within-format cross-validated accuracy (mean across single-session results). All comparisons with chance performance (dashed line) or shuffled alignment reached significance (Wilcoxon rank-sum test, P < 0.05). (B) Similar to (A) but for EGS. Cross-format classification significant between all visual formats and between visual and text formats when pooling visual formats (see bar with asterisk). (C) Correlation of neural population responses across pairs of formats. Conventions as in (A). (D) Same as (C) for participant EGS (black horizontal bar indicates data that were pooled for statistical testing). (E) Pairwise population correlation while controlling for additional formats using partial correlation. Resulting correlations are above chance (part corr = 0) but below standard correlation values (mean = red diamonds). (F) Same as (E) for participant EGS.

Significant generalization of action representations across formats was robust to the analysis technique. We correlated neural population responses across formats (Fig. 2, C and D). Population responses were constructed by concatenating the mean response of all units to each action within format (fig. S8). A significant positive correlation was found for all format pairs while no significant positive correlation was found when shuffling action identity between formats. One caveat to interpretation is that the correlation between any pair of formats may be the consequence of the two formats being correlated with a third format. A significant link between pairs of formats was preserved but somewhat reduced when controlling for the other formats using a partial correlation analysis (Fig. 2, E and F). This last result indicates that text links with each of the visual formats directly as the significant link is preserved when the possible mediating factors of the other formats are removed.

The preceding population analyses established that text and visual representations are linked pairwise at the level of the population, but the link does not perfectly generalize across formats. What is the breakdown of the single units that compose the population results? To answer this question, we compared the precise selectivity pattern (SP; defined as the firing rate values for each of the five actions) across pairs of formats using a model selection analysis for each neuron. A linear tuning model can describe the four possible ways that the SP can compare across two formats (Fig. 3A). (i) Both formats are selective in a similar manner (Fig. 3A; matched selectivity); the linear parameters (αϵR⁵) for each of the five actions are constrained to be identical for the two formats. (ii) Both formats are selective but with mismatched patterns (Fig. 3A; mismatched selectivity); the linear parameters (α,γϵR⁵) are different between the two formats. (iii and iv) Last, only one of the two formats may be selective (Fig. 3A; single format 1 or format 2 selective); a constant scalar offset term is used for the nonselective format (scalar term not shown in equation for simplicity). We identified the model that best described the neuronal behavior using both the Bayesian information criteria (BIC) and the cvR². We found that the two measures provide complementary perspectives when comparing across formats (fig. S9). In summarizing the results, we used the average percentages provided by both measures. In line with our population results, we found that the percentage of cells with a similar SP across formats (Fig. 3, B and C, red) was format dependent, being greatest across body postures (same lateral view), slightly reduced across shifts in viewing perspective (across the lateral and frontal views), and lowest when comparing observed actions with the written verb. These results indicate not only that text links with the visual formats and the visual formats link with each other but also that a percentage of the population codes the same action identities in different formats with differing patterns of selectivity.

Fig. 3 Single-neuron SPs link action verbs and observed actions.

(A) Schematic illustrating the four possible ways the SP can compare across two formats (see fig. S9 for expanded description). (B) Summary of SPs across pairs of formats for participant NS (see fig. S9). Red = matched SP; gray = mismatched SP; cyan and light green = selectivity for a single format only [see title colors in (A)]. Photo credit: Guy Orban, Department of Medicine and Surgery, Parma University. (C) Same as (B) for participant EGS. “=” indicates matched SP, and “&” denotes mismatched SP.

What is the architecture that links observed actions and action verbs?

The preceding section demonstrated that there is a neural link between action verbs and visually observed actions. Here, we seek to understand the architecture of this link: to characterize how text-selective units link with the varied visual presentations of the same action. As a prerequisite, we first characterized how the different visual presentations were encoded with respect to each other, ignoring the text format. Just as neural SPs can compare across two formats in four different ways (Fig. 3A), they can compare across three formats in 14 possible ways (see Fig. 4A, x-axis labels and examples). As above, a model selection analysis was used to categorize each unit based on the model that best described the SPs across the visual formats (Fig. 4A). The population was heterogeneous, characterized by units with matched SPs and mismatched SPs in varied combinations across the different visual formats. This diversity can be seen in the individual unit examples of Fig. 1A; units 1 and 2 show matching patterns of selectivity across all the visual formats (Fig. 4A, L0=L1=F), unit 3 shows matching selectivity across two of the visual formats and no selectivity in the third (Fig. 4A, L0=L1), and unit 4 shows matching selectivity between two formats and mismatching selectivity in the third (L0=L1&F). Thus, we find that presentation details affect neural coding for action identity and that individual units link action identity across formats in an assortment of ways when considering all three of the visual formats at once. This result is consistent with the significant but incomplete generalization of action identities across the visual formats shown in Figs. 2 and 3.

Fig. 4 Text links with all available visually selective cells.

(A) Histogram characterizing how the population of neurons link action representations across the three visual formats (F,L0,L1). “=” indicates matched SP, and “&” denotes mismatched SP. Exclusion of a format indicates no selectivity. Three schematic SPs (right, color-coded) across the visual formats are shown to illustrate how the SPs compare across formats. (B) Schematic models illustrating different architectures of how text relates to three visual representations of the corresponding action. Each oval contains the population of neurons that are selective for a particular visual format. Overlap between ovals indicates matching selectivity across formats. The possible patterns of overlap between ovals may be more complicated (e.g., more overlap between two of the three ovals) but is simplified here for schematic purposes. Yellow neurons are selective for text with matching selectivity, while gray neurons are not. Underneath each schematic is a prediction for how the distribution in (A) will change when the model selection analysis filters the full distribution of (A) for units with matching text selectivity. (C) Similar to (A), however, the histogram is limited to the subset of visually selective units with a matched SP to text [blue subpopulation in (D)]. In cases where the units have mismatched visual SPs (e.g., L0 & F), text can have a matched SP with one of several of the visual formats. Colored segments of histogram indicate which format has matched SP with text (see x-axis labels for color code). (D) Percentage of visually selective units with a matched SP to text. (E) Percentage of text-selective units with a matched SP to at least one visual format, mismatched SP to visual formats, or without visual format selectivity.

Having established that the same action is coded in different ways depending on details of visual presentation, we can now look at how action verbs link to these varied visual representations. We can frame our question in the following way: Do action verbs link with the entire population of cells demonstrating visual selectivity or specific subpopulations of cells? Figure 4B illustrates these possibilities. Two primary theoretical possibilities in the literature describe how text can link with subpopulations of visually selective neurons. Overlapping describes the architecture in which verbs link specifically with the subpopulation of neurons that are invariant across the visual formats (5). Exemplar describes the architecture in which verbs link with a specific prototypical exemplar or “best example” of the word (5). The exemplar may be of a single visual presentation or some subset of presentations. Last, we term the situation in which text links with all visually selective cells as Available. In this architecture, the link between text and the visual representations mirrors the statistics for how the visual representations are encoded within the neural population independent of text. Underneath each schematic, we provide a prediction for how the distribution of Fig. 4A should change when the model selection analysis accounts for how text links with the visual formats.

We extended the model selection analysis to categorize each unit based on the model that best described the SPs across all four formats (text + all visual formats). We compared the distribution of the visually selective units with a matched SP to text (Fig. 4C) to the full distribution of the visually selective units (Fig. 4A). The distribution was essentially unchanged; the subset of visually selective units that link with text reflects a random sampling of the visually selective units: A bootstrapped correlation analysis comparing the empirical distribution of Fig. 4C with the predictions of Fig. 4B shows that the population best matches the Available model (correlation with invariant = 0.32, exemplar = 0.48, available = 0.97). This provides the answer to the question of architecture: The distribution of text-linked units (Fig. 4C) mirrors the statistics of how visual formats are encoded independent of text, or, in other words, text forms links with all available visual representations. Units with a matching SP between text and at least one visual format (the distribution of 4C) represent 23% of all visually selective units (Fig. 4D) and 40% of all text-selective units (Fig. 4E).

What cognitive process does the link between action verbs and observed actions mediate?

Does the link between text and the visual formats reflect a semantic association, visual imagery, or short-term learned associations that formed through the course of the experiment? Thus far, our analyses are based on averaging the neural response across the video duration. This large temporal window may encompass multiple cognitive processes. If neural processing for action verbs specifically reflects bottom-up semantic processing, we would expect to find a shared neural response between formats very soon after stimulus presentation. To address this issue, we performed a dynamic, sliding-window, cross-validated correlation analysis to look at how the relationship within and across formats evolves in time (Fig. 5, A and B). To understand how quickly the correlation between text and the visual formats emerges, the diagonal elements of the dynamic correlation matrices were extracted and plotted together for direct comparison in the inset panels of Fig. 5 (A and B). These results show that the cross-modal link between text and the visual formats is fast: The onset of the cross-format correlation between text and the visual formats is the same as the within-format text correlation. In other words, as soon as a neural response to text emerges, it immediately shares a common activation pattern with the observed actions.

Fig. 5 Temporal features support a semantic link between verbs and observed actions.

(A and B) Cross-modal match between text and visual formats occurs at low latency. (A) Dynamic cross-validated cross-correlation matrices demonstrating how the neural population response during stimulus presentation at one slice of time compares to all other slices of time, both within and across formats. Format comparisons as shown in x– and y-axis labels. Correlation magnitude as indicated by the color bar. Inset: The diagonal elements of the within- and across-format matrices were averaged into three logical groupings [(i) within-format visual, (ii) within-format text, and (iii) across-format text to visual] and normalized to a peak amplitude of 1 for comparison purposes. The temporal profile of the averaged correlations (means ± SE across sessions) is plotted to emphasize the similarity of onset timing for the within-format text and across-format text to visual population correlations. (B) Similar to (A) but for participant EGS. To compensate for the smaller number of sessions, we grouped correlation matrices for cross-modal comparisons. (C and D) Stable relationship between text and observed actions through experimental sessions. (C) Cross-format correlations for subject NS shown for text and the visual formats on a per-session basis (mean with 95% bootstrapped CI). Color code shows whether the subject was passively viewing stimuli or asked to actively imagine from the lateral or frontal perspective (see inset; Vis F = visualize from frontal perspective; Vis L = visualize from the lateral 0 perspective). (D) Same as (C) except for participant EGS (only silent reading).

Next, we checked whether the strength of population correlation changed over the course of the experiment. If neural processing for action verbs reflects a semantic association, we would expect to find the correlation between text and videos to be present from the first session throughout the course of the experiment. In contrast, if the correlation between text and action videos is a product of learned associations that developed over the course of the study, we would predict that the strength of correlation would increase over the course of repeated exposure to the action videos and text. We found that the early correlation response (cross-validated correlation over the first second of video presentation) between text and the three visual representations for each session did not depend on session number (Fig. 5, C and D), favoring the semantic interpretation.

We performed a number of control analyses and manipulations to address the possibility that associations between text and observed actions reflect visual imagery. In six sessions, participant NS was instructed to use visual imagery to “replay” the associated action video in her mind from either the front (F) or side (L0) perspectives when given the action verb prompt. If imagery were a dominant factor in establishing the link between text and observed actions, the explicit manipulation of visualizing from the F or L0 perspective should bias the percentage of cells with a matched SP in favor of F or L0. However, both the total number of significant units and the population-coding structure were essentially unaffected by the explicit task instruction. Neither the proportion of significant units [Fig. 6A, χ²(1,1432) = 2.7, P = 0.1] nor the proportions of the best explanatory models [Fig. 6B, χ²(1,1432) = 1.9, P = 0.58] demonstrated significant differences. Further, a comparison of the per-session population correlation did not show a significant effect of the instruction (Fig. 5C, Wilcoxon rank-sum test, P = .43). This result shows that the basic link between action verbs and observed actions is not dependent on the contents of visual imagery. To probe this result further and ensure the subject followed task instructions, we split the dynamic correlation analysis between the passive and active imagery sessions. We found (Fig. 6, C to E) that correlation immediately following stimulus presentation was largely unaffected by the behavioral manipulation, while correlation near the end or after stimulus presentation did show significant differences (paired t test, P < 0.05 on pixel values split between passive and imagery sessions). This result suggests that the subject followed task instructions and that imagery can affect neural responses, but the early responses (that are the hallmark of automatic semantic processing) are independent of the contents of imagery.

Fig. 6 The effect of explicit instruction on cross-format invariance.

During the initial seven sessions, subject NS silently read action verbs. In the six subsequent runs, she explicitly visualized the frontal (F, three runs) or lateral standing (L0, three runs) perspective in response to the action verb. (A) The percentage of units with a significant effect of action or action-format interaction for the format by action ANOVA applied to the triplet of formats pertinent to task instruction (T,F,L0). “Sig” = significant at P < 0.05 FDR-corrected (“NS” otherwise). Results are split by the task instruction. Total number of sorted units shown in title. (B) Results for the combined (BIC + cvR²) model selection analyses for the same triplet of actions split by task instruction. The percentage of T=L units was twice as prevalent as T=F units for passive viewing, as well as the two instructed conditions. (C) Mean dynamic cross-correlation between the visual formats and text split by passive viewing and active imagery in participant NS. Blue lines indicate video offset. (D) Pixel coordinates demonstrating a significant difference between passive viewing and active imagery (significant pixels in white, paired t test, P < 0.05.) Blue lines indicate video offset. (E) Cross-correlation value between text and the visual formats for the set of significant pixels shown in (D) as a function of session number. The blue line shows split between passive and active imagery sessions.

In a final control, we collected a dataset in which four abstract symbols (snowflakes; fig. S10A) were paired with visual imagery of movements for over 2 months (31 recording sessions, 114 ± 11 units per session; fig. S10B). In this paradigm, subject NS was asked to visualize a movement from the first-person perspective when presented with a symbol. The subject learned this task well, as we could accurately decode the different symbols when the subject was instructed to use visual imagery (fig. S10C). We also asked the subject to passively view the same stimuli at sporadic intervals (fig. S10B, vertical orange lines) and found that the ability to decode the different symbols disappeared (fig. S10, D and E). The differences between passive viewing and active imagery when cued with experimentally defined abstract symbols in the control task provide a stark contrast to the differences between passive viewing and active imagery when viewing action verbs in the main experiment (fig. S10, D to G). The differences help to clarify several points about the main experiment. The clear differences in classification accuracy between passive viewing and imagery in the control task demonstrate that the subject is capable of comprehending and following task instructions as they relate to passive viewing versus active visual imagery, two tasks used in the main experiment. Furthermore, the study shows that not all types of visually distinct stimuli elicit a differential neural response under passive viewing. Last, it demonstrates that the recorded population does not form automatic neural responses to arbitrary abstract symbols, even when the different symbols have been learned and are of direct behavioral relevance.

DISCUSSION

Our results answer the four questions raised in the introduction: PPC neurons exhibit selectivity for action verbs and observed actions; text links to visual representations of observed action; text links with a fraction of all available visual representations; and the link is most consistent with being semantic in nature and not due to imagery or learned associations.

Answers to the four questions

First: Selectivity. Both single-cell properties and within-format decoding demonstrate neuronal selectivity for action verbs and observed actions in human PPC. The visual selectivity had short latencies (about 150 ms), while text selectivity emerged nearly 150 ms later. The features of the visual stimuli that determined neural selectivity remain unclear. The term selectivity for action identity should be interpreted as a label assigned to the visual stimuli rather than coding for the basic-level type of action, e.g., “grasp.” Manipulations of viewpoint or fixation point (fig. S2) changed neural coding significantly. Manipulative actions can differ in hand and arm postures, contact points with the object, and dynamics, among others; these parameters should affect neural coding to represent the behavioral complexity of natural actions. Elaborating the exact degree to which neural coding is influenced by action identity, its many parameters, or even low-level visual features needs further work. Nonetheless, the link between action verbs and observed actions demonstrates that coding of action identity cannot solely be driven by irrelevant visual features. Further, not all visual differences are encoded by the neural population (fig. S2). Last, high-dimensional coding of both category-relevant and -irrelevant visual features is consistent with neural coding in high-level regions of the ventral visual stream (26, 27).

Second: Action verbs and observed actions share a common neural substrate. We demonstrate the shared substrate at the population level using cross-format decoding and population correlation between formats (Fig. 2) and showed the basis of this population link by modeling of single-cell selectivity across pairs of formats (Fig. 3). Prior neuroimaging evidence indicates a degree of anatomical overlap within the AON for processing observed actions and language (19–22). However, imaging evidence can be inconsistent (8), and gross anatomical overlap seen in neuroimaging does not directly imply that the same neural populations support both tasks (7). Our evidence provides definitive evidence for a shared neural substrate by demonstrating that the precise SPs for action verbs match the SPs for corresponding observed actions at the neural unit level.

Third: Architecture. We have established that, at the neural level, action verbs link with visually observed actions, suggesting that sensorimotor representations are an intrinsic component of verb meaning. The potential implications of this finding are hard to pin down without understanding the architecture of the link. There are infinite visual stimuli that could be considered a “grasp” or a “banana” or any basic category colloquially used to describe an object or action. Our results establish that neural coding for observed actions depend on presentation details (see Fig. 4), consistent with findings throughout cortex (e.g., 9, 10). Given the diversity of neural coding, there are three likely architectures (Fig. 5B), each with its own implications for how linking is made between symbolic and visuomotor representations. Text could link exclusively to the subpopulation of cells that are visually invariant across the different visual presentations (e.g., Fig. 4B, “visually invariant”). In such a case, the aspect of “meaning” conveyed by the sensory-motor representation would be what is universal or common to all presentations. In other words, sensory-motor meaning abstracts away the details of any particular representation. Another possibility is that text could link to one or a subset of example stimuli (e.g., Fig. 4B, “exemplar”). In such a case, the aspect of “meaning” would constitute representative visual examples of the word. The third possibility is that text links to all available visual representations (e.g., Fig. 4B, “available”). If the visual representation reflects the consolidation of one’s experiential history with observed actions (4, 28) as expected for the consolidation of semantic memory, then neural responses to text may be understood as the activation of this consolidated visual experience. This suggests that a word’s meaning is uniquely rooted in an individuals’ experience.

The comparison between the predictions for these three models and the data strongly favors the available model. This architecture is also the easiest to implement, as a simple Hebbian mechanism will suffice and would predict that acquisition of verb meaning depends on the frequency of exposure, which has been observed for several languages (29). The text response links only to a subset of the full distribution of visually selective units (Fig. 4D). The reason for this is unclear but may reflect inefficiencies in the neural process that links verbs with visual representations and may be influenced by exposure or experience. In any case, reading a word does not evoke the same perceptual experience as viewing an action, and thus, substantial differences at the level of neural responses should be expected.

Fourth: Origin of the link. Our results indicate that the link between text and the visual formats is not the product of imagery or learned associations that emerge from the task. Our results are consistent with action verbs automatically eliciting a memory or visual/multisensory representation of an action. This could be considered a form of imagery; however, here, we use imagery to specifically refer to the effortful covert internal simulation of a movement (either of one’s own body or another’s body) such as might occur when a participant is explicitly asked to imagine a movement. A primary distinction between semantic memory and imagery as defined here is that semantic responses are automatic; when the action verb is read, corresponding representations in PPC are activated without conscious effort or task dependence. In the control task, passive viewing of symbols associated with actions is shown to be an ineffective stimulus to drive the neural population. There is no automatic response. Neural responses are task dependent and only found when the participant actively imagines the actions that have been associated with each cue. This is contrasted with responses to action verbs in which selectivity is found under silent reading with minimal impact from experimental manipulation of imagery. The action verbs are processed automatically, requiring nothing beyond reading to generate action-specific neural responses. Semantic processing should be fast and automatic, and we found that it is exactly the early component of the correlation that was unaffected by the imagery manipulation (Fig. 6D). In contrast, the late components of the correlation systematically differentiated passive viewing and imagery sessions (Fig. 6, C to E), demonstrating that the patient followed the instructions, a view also supported by the control experiment. From these considerations, the shared neural substrate of text and observed actions is unlikely to reflect imagery. A key signature of learned associations is the gradual strengthening of the link between text and observed actions. Yet, the correlation between text and the visual formats was stable across all testing sessions, including the very first one (Fig. 5). In our control experiment (fig. S10), passively viewing abstract symbols that had been paired to movement imagery did not induce selective neural responses. Thus, our results are unlikely the consequence of learned associations.

We consider yet two more alternatives to semantic processing. The first is that neural responses to action verbs and observed actions represent implicit automatic motor plans (30). How we plan and execute an action is an important component of meaning. However, our control study (fig. S10) revealed no selectivity to passive observation of movement-predictive cues. The second possibility is that the linked SPs for observed actions and action verbs reflect a population of cells that are responsive to the internal act of silent naming (i.e., generating the action verb). In this view, when viewing text or videos, the participant covertly generates the same word and thus produces the same activity patterns. This hypothesis would predict results similar to the invariant hypothesis, as generating the action verb should be consistent across the different visual presentations. Instead, for simultaneously recorded neural populations, we find that text responses link to the visual formats in idiosyncratic ways (e.g., text and the lateral views, but not the frontal; Figs. 1A and 4C). It remains possible that neurons are selective to particular cue-naming pairs; however, in our prior work (31) in the same participants, we found no selectivity for specific cue-intention pairings (e.g., response for imagined movement to the right when cued with a spatial target, but not when cued with symbol). Thus, we think that the naming hypothesis is unlikely.

From the above considerations, we believe that our results are most compatible with the shared neural substrate mediating semantic memory, reflecting associations between the word and its visuomotor representations that have been built over years of experience. In this view, reading words automatically activates sensorimotor representations, and these representations are in a position to color our understanding of word meaning without our conscious effort.

Nature of neuronal representation of variables coded in PPC

The ability of small neuronal populations to encode many variables is consistent with the mixed-selectivity scheme in which distributed, nonlinear, high-dimensional representations code in a contextually dependent manner (32). However, at least within the cortical locations explored in the current study, we find that such encoding is not random, but systematically organized around stimulus properties, a scheme referred to as partially mixed selectivity (33). Neural populations coding the same basic-level action exemplar for different formats overlapped (e.g., Fig. 3 and fig. S9). Partial mixed selectivity may represent a general structure for representing sensorimotor aspects of meaning within association cortices, resulting in rich links between text and the diversity of overlapping and distinct components of the visual formats that mirror the statistics of visual encoding independent of text (Fig. 4). It is unclear whether neural overlap reported for observed and performed actions in nonhuman primate (NHP) follows similar principles of neural architecture, in part because results in NHP studies have generally been reported for responsiveness (e.g., change from baseline) to a single action (typically grasping) rather than selectivity (e.g., differential responses) for multiple distinct actions [e.g., (18)]. The partially mixed architecture may account for the weak link between text and the visual formats (e.g., relatively low-population correlation and few units with matching SPs). If a cortical region encodes the many visual facets of an observed action (e.g., viewpoint, posture, and other untested features) and text links with both what is overlapping and distinct about action presentations, it follows that the link between text and any particular presentation must be relatively weak.

Cortical organization of conceptual knowledge

In understanding an action verb, we access semantic knowledge. The cortical organization of semantic knowledge has been contentious. Some theories contend that conceptual knowledge is rooted in cortical regions that use supramodal symbolic processing (7), while other theories take the opposite perspective, that semantic knowledge is encoded in the distributed sensorimotor network (6, 34). Most recent theories posit that meaning emerges from interactions between supramodal associative areas and regions directly responsible for processing sensory stimuli, motor actions, valence, and internal state (1–5). Our results are consistent with these interaction models, given the longer latencies we observed for text-selective responses in PPC, relative to those reported for higher-order language regions such as superior temporal gyrus or inferior frontal gyrus (35). One likely possibility is that action verb activity in PPC originates from supramodal regions and automatically spreads to PPC. This interaction model comes in many versions, primarily distinguished by which areas constitute the supramodal regions and the nature of the interactions. In part, a deeper understanding of the organization of conceptual knowledge in the human brain has been limited by the general inability to record from single neurons in humans. We know of no single-unit recordings in supramodal regions, but one intriguing possibility is that these areas may host neurons similar to the “concept cells” of the medial temporal lobe (MTL) (36), which respond to a preferred stimulus (e.g., a particular individual) largely independently of sensory modality or presentation details (e.g., image, written word, and sound). While this strong invariance provides a model for neural coding mechanisms in supramodal centers, much less is known about how semantically related items are encoded in the distributed network. The current study contributes to this goal by providing the first demonstration of a link between words and their sensorimotor representations and how the neural architecture supports this link.

In the current paper, we have focused on how verbs are given meaning. We may also consider what our results mean from the reverse direction, how the neural population may contribute to naming an observed action. We find not only relatively high generalization across different views of the same observed action but also a degree of dependence on viewpoint and the point of fixation (fig. S2). These neural properties suggest that rostral PPC neurons could play a role in creating increasingly abstracted representations that associate the same actions and thus contribute to the processing needed for naming, but, given the weakness of the link, subsequent regions, potentially using winner take all like mechanisms, would be needed for the final conversion to labeling the observed action.

The link between visual representations of actions and action verbs fits with current views of how infants learn action verbs by mapping words onto conceptualizations of events (37). Infants can distinguish action exemplars (running, marching, and jumping) independently of the actors (38) and that this ability predicts the use of action verbs at 2 years of age (39). Furthermore, it provides an explanation for why infants learn verbs later than nouns (40), as the corresponding visual representations are in different visual pathways. In the PPC, the development of observed action selectivity, which is originally in the service of guiding future actions (18), may only occur once the infant starts moving. Infants initially learn verbs corresponding to their own actions (41).

Limitations of the study

Stimuli. We used a restricted set of observed actions and action verbs, based on the category of actions that best evoke responses in neuroimaging (42). Thus, our results cannot support the conclusion that responses to written text are specific for action verbs. Neuroimaging studies have shown that brain regions exhibit some degree of domain specificity during language processing (43). Understanding domain specificity of responses to language at the single-unit level is an exciting future direction.

Visual formats. We tested only a small number of visual formats: two postures and two viewpoints. Thus, the visual invariance that we established may be an overestimation, and increasing the diversity of different presentations of the same action would lower the percentage of invariant cells. Hence, while it remains possible that the visual invariant neurons (F=L0=L1 in Fig. 4) are akin to concept cells as described in the MTL of humans, this is by no means established. To this point, neurons exhibiting invariance in the MTL showed sparse coding (only active for a single basic-level category), while the invariant neurons tested in our study were broadly tuned, matching the tuning profiles of other visually selective neurons (fig. S6). The small number of visual formats may also partially account for text-selective units with mismatched or absent visual selectivity (Fig. 4, D and E) as they may link with other untested visual representations of the corresponding action identity.

Recording site. We tested only one region of the AON. Other regions of the AON (e.g., premotor areas or the LOTC), based on neuroimaging and lesion, likely play a role in linking language with its sensory and motor representations. Action verbs may be associated with the kinematic profiles of movement, movement dynamics, the agents typically performing the action, the objects typically subjected to the actions, the desired outcome or value of the action, and the expected sensations that accompany the action, among others. The constituent regions of the AON likely encode these movement attributes and together may form the distributed network that links action verbs with these varied aspects of meaning.

Causality. As with all passive neural recording studies, our study cannot determine the causal role of our PPC neurons in understanding the meaning of action verbs. However, prior work, using word or static picture stimuli, has shown that damage or inactivation within the frontoparietal AON, including PPC, can result in specific action comprehension deficits (23–25) consistent with the idea that neurons within the AON play a role in verb comprehension. Our results provide clarity on the presence and nature of the link between neural representations of action verbs and visually observed actions at the level of single units in PPC.

Subjects. We investigated neural signals in two participants and thus cannot make strong conclusions about factors that influence the strength of action verb encoding. Participant NS demonstrated stronger selectivity than EGS, even when controlling for the number of neurons and sessions (fig. S4). The reason for differences are unclear but may be the product of individual differences and could include anything from the degree to which the two participants attended to stimuli on a trial-to-trial basis to the degree to which individuals intrinsically engage sensory-motor systems during semantic processing. One intriguing difference is that NS is a native English speaker, while EGS is a fluent but nonnative speaker having learned English as part of a language program in primary school. One possibility is that the time of language acquisition may affect the degree to which words engage sensory-motor systems. In addition, the recorded neurons may come from different functional regions due to either anatomical differences in implant location or high individual differences in how functional regions map to cortical anatomy. A precise functional correspondence of areas is unlikely; however, we note that functional responses were similar during functional neuroimaging (fig. S1), as well as during planning and execution epochs of motor imagery tasks at the single-unit level (31, 33).

Conclusion

The current study provides the first single-unit evidence that action verbs share a neural substrate with visually observed actions in high-level sensory-motor cortex, thus clarifying the neural organization of human conceptual knowledge. Action verbs link with all the diverse visual representations of the related concept, suggesting that language may activate the consolidated visual experience of the reader.

Acknowledgments: We would like to thank N.S. and E.G.S. for participating in the studies, V. Scherbatyuk for technical assistance, and K. Pejsa for administrative and regulatory assistance. We would also like to thank M. Rugg for helpful comments on an early version of this manuscript. Funding: This work was supported by the NIH (R01EY015545), the Tianqiao and Chrissy Chen Brain-machine Interface Center at Caltech, the Conte Center for Social Decision Making at Caltech (P50MH094258), the Boswell Foundation, and ERC (Parietal action) VII FP (323606). Author contributions: Conceptualization: T.A. Methodology: T.A. and G.A.O. Investigation: T.A. and C.Y.Z. Formal analysis: T.A. Writing (original draft): T.A. Writing (review and editing): T.A., G.A.O., and R.A.A. Funding acquisition: T.A., G.A.O., and R.A.A. Resources: E.R.R. and N.P. Supervision: T.A., G.A.O., and R.A.A. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Additional data related to this paper may be requested from the authors.