Abstract
Spatial hearing sensitivity in humans is dynamic and task-dependent, but the mechanisms in human auditory cortex that enable dynamic sound location encoding remain unclear. Using functional magnetic resonance imaging (fMRI), we assessed how active behavior affects encoding of sound location (azimuth) in primary auditory cortical areas and planum temporale (PT). According to the hierarchical model of auditory processing and cortical functional specialization, PT is implicated in sound location (“where”) processing. Yet, our results show that spatial tuning profiles in primary auditory cortical areas (left primary core and right caudo-medial belt) sharpened during a sound localization (“where”) task compared with a sound identification (“what”) task. In contrast, spatial tuning in PT was sharp but did not vary with task performance. We further applied a population pattern decoder to the measured fMRI activity patterns, which confirmed the task-dependent effects in the left core: sound location estimates from fMRI patterns measured during active sound localization were most accurate. In PT, decoding accuracy was not modulated by task performance. These results indicate that changes of population activity in human primary auditory areas reflect dynamic and task-dependent processing of sound location. As such, our findings suggest that the hierarchical model of auditory processing may need to be revised to include an interaction between primary and functionally specialized areas depending on behavioral requirements.
SIGNIFICANCE STATEMENT According to a purely hierarchical view, cortical auditory processing consists of a series of analysis stages from sensory (acoustic) processing in primary auditory cortex to specialized processing in higher-order areas. Posterior-dorsal cortical auditory areas, planum temporale (PT) in humans, are considered to be functionally specialized for spatial processing. However, this model is based mostly on passive listening studies. Our results provide compelling evidence that active behavior (sound localization) sharpens spatial selectivity in primary auditory cortex, whereas spatial tuning in functionally specialized areas (PT) is narrow but task-invariant. These findings suggest that the hierarchical view of cortical functional specialization needs to be extended: our data indicate that active behavior involves feedback projections from higher-order regions to primary auditory cortex.
Introduction
Sound localization is a crucial component of mammalian hearing. In the mammalian auditory cortex, neural activity in posterior areas is modulated by sound location more than in primary and anterior areas. These spatially-sensitive areas include the caudo-medial (CM) and caudo-lateral belt areas (CL) in nonhuman primates (Tian et al., 2001), the posterior auditory field (Harrington et al., 2008) and dorsal zone in cats (Stecker and Middlebrooks, 2003; Stecker et al., 2005; Lomber and Malhotra, 2008), and the planum temporale (PT) in humans (Warren and Griffiths, 2003; Brunetti et al., 2005; Deouell et al., 2007; van der Zwaag et al., 2011; Derey et al., 2016; McLaughlin et al., 2016). For this reason, cortical processing of sound location is presumably taking place in a functionally specialized, posterior-dorsal “where” stream (Rauschecker and Tian, 2000; Tian et al., 2001; Arnott et al., 2004; Rauschecker and Scott, 2009).
Behavioral evidence from psychophysical studies shows that auditory spatial sensitivity in humans is dynamic. For example, an auditory target is processed faster when auditory spatial attention is focused at the location of the target (Spence and Driver, 1994; Mondor and Zatorre, 1995; Rorden and Driver, 2001). A recent study investigating the neural mechanisms underlying this dynamic spatial sensitivity in cats identified the primary auditory cortex (A1) as a potential locus for such dynamic sound location processing. (Lee and Middlebrooks, 2011). In humans, a recent study reported a region in posterior auditory cortex that exhibited a differential level of activation based on task performance, but no task modulation of selectivity to interaural level differences (ILD) or interaural time differences (ITD) across the entire auditory cortex (Higgins et al., 2017). However, it is presently not clear whether task performance results in sharpening of spatial tuning within distinct regions of the human auditory cortex, and whether this sharpening occurs preferentially in functionally specialized “where” regions (i.e., PT) or also affects A1.
Moreover, the effects of task performance on the cortical encoding of sound location are not yet known. The computational mechanisms underlying cortical sound location encoding are still a matter of debate, and prior studies assessing the validity of these computational mechanisms have not addressed possible effects of task performance (McAlpine et al., 2001; Stecker and Middlebrooks, 2003; Harper and McAlpine, 2004; Stecker et al., 2005; King et al., 2007; Miller and Recanzone, 2009; Day and Delgutte, 2013; Derey et al., 2016; Ortiz-Rios et al., 2017).
Here we measured with functional magnetic resonance imaging (fMRI) the neuronal population responses to different sound azimuth positions in the human auditory core, lateral belt areas, and PT, while participants performed different behavioral tasks. We then evaluated the spatial selectivity of neuronal populations within these areas across task conditions. Additionally, we applied a modified version of a maximum-likelihood population-pattern decoder previously used to decode sound location from neural spike rates (Jazayeri and Movshon, 2006; Miller and Recanzone, 2009; Day and Delgutte, 2013) to assess whether sound location encoding in fMRI activity patterns in human auditory cortex within and across hemispheres is modulated by task performance. Our results provide new insights into the dynamic nature of sound location encoding in human A1. In particular, in agreement with “reverse hierarchy” (Ahissar et al., 2009) and “recurrent processing” models (Lamme and Roelfsema, 2000; Bullier, 2001), our data suggest that behavior (sound localization) is enabled by feedback from functionally specialized areas to A1.
Materials and Methods
Participants
Thirteen human volunteers gave informed consent to participate in the experiment. Data of two participants were excluded from the analysis due to insufficient data quality as a consequence of excessive motion and participant fatigue. Data of the remaining 11 participants (mean age = 28.9 years, SD = 11.7 year, 7 females) are presented here. Participants reported no history of neurological disorders. We assessed hearing levels with pure-tone thresholds (0.5, 1, 2, 4, 8 kHz) using an Oscilla SM910 Screening Audiometer. Hearing thresholds did not exceed 25 dB for any of the frequencies tested. The institutional review board of Georgetown University granted approval for the study.
Stimuli
Stimuli consisted of amplitude-modulated (AM) white noise clips (probe sounds: duration = 1200 ms) and click trains (target sounds: click rate = 200 Hz, duration = 1200 ms). Probe and target sounds were created with MATLAB (MathWorks). Stimuli were presented at 1 of 7 locations (−90°, −60°, −30°, 0°, +30°, 60°, and +90°; Fig. 1A.
All stimuli were spatialized by making subject-specific binaural recordings (Derey et al., 2016). During the binaural-recording session, participants sat in a chair in the center of a production studio (internal volume = 66 m3; walls and ceiling consisted of gypsum board covered with fabric, the floor consisted of concrete covered with a carpet) with binaural microphones placed in their ear canals (OKM II Classic Microphone, Soundman). A loudspeaker positioned at zero elevation in the far field (distance to subject = 1.3 m) presented sounds at each of the locations (Fig. 1A). This procedure resulted in stimuli with a clear spatial percept based on available ILD, ITD, and spectral cues (Fig. 1C,D).
Each stimulus was prefiltered with headphone equalization filters provided by the manufacturer of the MRI-compatible earbuds used in the present study (Sensimetrics S14). The headphone equalization filters ensure a flat frequency response at the level of the earbuds and remove headphone-induced phase offsets between the earbuds.
For the tonotopy measurements, we used amplitude-modulated pure tones (rate of modulation = 10 Hz, full-depth modulation, 800 ms duration). Pure tones were centered on eight center frequencies (0.18, 0.30, 0.51, 0.86, 1.46, 2.48, 4.19, 7.09 kHz) with a slight variation of ±0.1 octave to prevent habituation (De Martino et al., 2013). Stimuli for the tonotopy measurements were prefiltered with the headphone equalization filters as well.
Experimental design
Participants listened to probe trials in three behavioral conditions: passive listening, sound identification, and sound localization. Probe trials consisted of five repetitions of a probe sound clip (duration = 1200 ms) at the same location. Sound clips were presented in silent gaps (1.4 s) in between fMRI acquisition periods (2 s; see Data acquisition), resulting in a total duration of 17 s per trial (5 stimulus repetitions in silent gaps of 1.4 s plus 5 fMRI data acquisition periods of 2 s; Fig. 1B). In the active listening conditions only, participants also listened to target trials. Specifically, in the sound identification condition, target trials had a similar structure (i.e., 5 repetitions at the same azimuthal location), yet the fourth or the fifth repetition of the probe sounds (AM white noise) was replaced by a deviant target sound (click train) at the same location (Fig. 1B). In the sound localization condition, target trials had a similar structure as well, but the fourth or the fifth repetition of the probe sound (AM white noise) was replaced by a probe sound at a deviant azimuth location. For example, the first four stimuli were presented at −90° and the fifth stimulus at +30° (Fig. 1B).
During fMRI acquisition, trials were grouped by task (passive listening, sound identification, sound localization) in a block. In each block, probe trials were presented once at each azimuth location and were separated by an intertrial interval of 12.2 s (for detailed information, see Data acquisition). The order of azimuth locations was randomized within a block. Thus, for passive listening, a block consisted of seven probe trials, one at each azimuth location. For the active tasks, sound localization and sound identification, a block also contained two target trials (equivalent to ∼22% of the total number of trials) in addition to the seven probe trials. The order of target and probe trials was randomized within a block.
Each participant performed one block of each task per run of fMRI acquisition. Thus, one run consisted of three blocks corresponding to the three behavioral task conditions. At the start of each task block, a short audio clip of a voice informed participants of the task at hand: “sound location”, “sound identity”, or “passive listening”. In the passive listening condition, participants listened to the sounds without making a response. In the sound identification condition, participants pressed a button immediately upon detection of a target sound within a target trial (i.e., the click train). In the sound localization condition, participants pressed a button immediately upon detecting a location switch within a target trial.
The order of blocks was randomized and counterbalanced across participants. In total, participants completed four runs of the main experiment (∼10 min each) in the MRI scanner. This resulted in four probe trial repetitions per azimuth location per task condition. Only probe trials were included in the subsequent analyses (see Data analysis).
Before the fMRI measurements, participants performed a short practice session to get familiar with the tasks and with the MRI environment. This also enabled participants to get accustomed to the auditory spatial percept in a supine frame of reference (due to the supine position required by the MRI scanner). The practice session consisted of passive presentation of the probe stimuli at each location as well as short task blocks of the sound localization and the sound identification task, in which one target trial was presented per task block.
Finally, the scan session was concluded with two runs of tonotopy measurements (∼7.5 min each). For this experiment, participants listened passively to blocks of AM pure tones in the MRI scanner. Each block was repeated twice per run, resulting in four repetitions per center frequency. The order of frequency blocks was randomized (De Martino et al., 2013).
Data acquisition
Data were acquired with a Siemens TIM Trio 3-tesla MRI scanner at the Center for Functional and Molecular Imaging at Georgetown University. For the main experiment, blood oxygenation level-dependent (BOLD) signals were measured with a T2*-weighted echoplanar imaging (EPI) sequence covering the temporal cortex and parts of the occipital, parietal, and frontal cortex [echo time (TE) = 30 ms; repetition time (TR) = 3400 ms; flip angle = 90°; number of slices = 32; voxel size = 2 mm3 isotropic]. Image acquisition was clustered [acquisition time (TA) = 2000 ms], and binaural recordings were presented in silent gaps (duration = 1400 ms) between subsequent volume acquisitions through MR-compatible insert earphones (Sensimetrics S14) with sound-attenuating foam ear tips (>29 dB attenuation). One sound was presented per TR. Trials (i.e., 5 stimulus repetitions per azimuth location corresponding to 5 TRs, 17 s duration) were separated by three volumes in which no sound was presented (that is, 12.2 s silence) to allow the BOLD signal to return to baseline before the onset of the next trial.
We also acquired a high resolution anatomical image of the whole brain with a MPRAGE T1-weighted sequence (TE = 2.13 ms; TR = 2400 ms; voxel size = 1 mm3 isotropic). For the tonotopic measurements we also used a sparse T2*-weighted EPI sequence to measure the BOLD signal, covering mainly the temporal cortex (TE = 30 ms; TR = 2600 ms; TA = 1600 ms; silent gap = 1000 ms; flip angle = 90°; number of slices = 25; voxel size = 2 mm3 isotropic). In each run, AM pure tones were presented in the silent intervals between subsequent volume acquisitions in blocks of six repetitions per center frequency (15.6 s). Blocks were separated by 12 s of silence (4 volumes).
Statistical analysis
Data preprocessing.
Functional and anatomical data were analyzed using BrainVoyager QX (Brain Innovation), and customized MATLAB code. Preprocessing of functional images included motion correction (trilinear/sinc interpolation, we used the first run of first volume as reference volume for aligning), slice scan time correction (sinc interpolation), linear drifts removal, temporal high pass filtering (threshold = 7 cycles per run), and mild spatial smoothing (3 mm kernel). Functional images were coregistered to the anatomical T1-weighted image and transformed to 3D Talairach space (Tournoux and Talairach, 1988). Gray–white matter boundaries were defined with the BrainVoyager QX automatic segmentation procedure and manually improved when necessary.
Group analyses were performed in surface space to ensure optimal alignment of the auditory cortex across participants. To this end, we applied cortex-based alignment (CBA) to the surface reconstruction of each participant (Goebel et al., 2006) with the additional constraint of an anatomical definition of Heschl's gyrus (HG; Kim et al., 2000; Morosan et al., 2001). High-resolution surface mesh time courses were created by sampling and averaging for each point on the surface (that is, each vertex) the values from −1 mm below the gray–white matter boundary up to 2 mm in the gray matter toward the pial surface.
Univariate analysis of the processing of spatialized sounds.
To test for the general response to presentation of spatialized sounds, we estimated a random effects general linear model (RFX GLM) with a predictor for sound presentation including all probe trials (regardless of azimuth location or behavioral task condition). Target trials were modeled with a separate predictor and not included in the contrast.
Response azimuth functions.
We constructed a response azimuth function (RAF) for each auditory responsive voxel [individual subject GLM with one predictor per sound azimuth location per task condition and excluding target trials, contrast auditory stimuli > baseline, q(FDR) < (0.05)]. RAFs consisted of location-specific β values estimated with a GLM with one predictor per sound location per task. RAFs were mildly smoothed with a moving average window of three points [weights (0.2, 0.6, 0.2)]. A peak response was defined as a response at 75% or more of the maximum β value in the RAF (Stecker and Middlebrooks, 2003; Stecker et al., 2005; Derey et al., 2016). Each peak was described as a vector with length = β and angle = azimuth position. The vector sum then consisted of the summation of these individual vectors.
We considered a voxel to be spatially selective if the BOLD response was modulated by sound azimuth position, as reflected in the RAF, such that at least one and maximally three adjacent azimuth positions elicited a peak response. A voxel that exhibited a peak response to more than three adjacent azimuth positions was considered omni-responsive and therefore nonselective. Voxels that exhibited a peak response to two or more separate azimuth locations were also considered nonselective.
The tuning width of spatially selective voxels was quantified as the equivalent rectangular receptive field (ERRF) width (Lee and Middlebrooks, 2011). The ERRF is equal to the ratio between the amplitude of the peak response (that is, the β value at the preferred location), and the integral of the RAF. Although this measure does not provide an absolute measure of spatial selectivity, it enables the comparison of spatial selectivity across conditions, areas, and participants. Given that the rostral belt areas were not extensively activated, we focused this analysis on the caudal belt areas CM and CL.
Response sharpening versus response gain.
We tested whether sharpening of spatial tuning resulted from BOLD response gain (that is, an increase of the BOLD response at the voxel's preferred location), BOLD response sharpening (a decrease in the BOLD response at the voxel's least preferred location), or a combination of the two. For this comparison, we defined the voxel's best location as the location with the highest β value in the task-independent RAF, that is, the average RAF across the two active task conditions. Similarly, we considered the least-preferred location the azimuth location with the lowest β value in the average RAF (Lee and Middlebrooks, 2011, 2013).
Decoding sound azimuth position from fMRI activity patterns.
To decode sound location, we applied a population-pattern decoder to the measured fMRI activity patterns in two regions of interest: the core region and PT. We selected these regions based on prior research in animals indicating A1 as a potential locus for dynamic spatial sensitivity (Lee and Middlebrooks, 2011) and prior neuroimaging research in humans illustrating the role of PT in spatial auditory processing in the human brain (Warren and Griffiths, 2003; Brunetti et al., 2005; Deouell et al., 2007; van der Zwaag et al., 2011; Derey et al., 2016; McLaughlin et al., 2016).
In general, the decoder—a modified version of a pattern decoder introduced to decode sensory information from neural spike rate patterns (Jazayeri and Movshon, 2006; Miller and Recanzone, 2009; Day and Delgutte, 2013)—computes the log-likelihood that a sound at a given azimuth location elicited the observed fMRI activity pattern. In particular, we computed for each voxel the log-likelihood that a stimulus at a particular azimuth location induced the observed BOLD response. The population log-likelihood then consists of the sum of the log-likelihoods across all voxels (Fig. 2).
Specifically, for each cortical area, we selected those voxels that responded to sounds (GLM sound > baseline, p < 0.005 uncorrected) and exhibited a spatially selective response (see previous section). Next, we estimated for each subject a GLM per functional data run with one predictor per azimuth position per task. This resulted in four β estimates per azimuth position, equivalent to the four functional runs. Beta estimates were normalized between 0 and 1 across the seven azimuth positions within each run. For each stimulus azimuth position, we then computed the log-likelihood that the observed BOLD response (βi in the voxel under consideration was elicited by the presentation of a sound at that location. Assuming that the observed BOLD response βi of voxel i for a given azimuth position θ0 is normally distributed with mean μ0,i and SD σ0,i, the log-likelihood of the observation can be computed as follows: The estimation was performed using cross-validation: we considered three runs to estimate the mean μ0,i and SD θ0,i of a given voxel and azimuth position, and we used the left-out run to calculate the log-likelihood. The procedure was repeated for all the possible train-test combinations. Due to the limited amount of available data (1 trial per run), the estimation of the parameters was done using the β values of the selected voxel, as well as the six neighboring voxels, that is, those voxels sharing a side with the relevant voxel. Consequently, the number of data points to estimate μ0,i and σ0,i was 21 (3 functional runs multiplied with 7 voxels). The test data βi is the β estimate for this voxel for this azimuth position in the run that was left out. Assuming conditional (i.e., within each azimuth position) independence between different voxels, the population response was then computed as the sum of log likelihood of all voxels in the cortical area (N): In the test run, we predicted the sound azimuth location of a new, unseen sound, by selecting the location with the highest log-likelihood. This is equivalent to using a probabilistic classifier based on the posterior probability of azimuth location given the observed data, when class prior is uniform across all sound locations. Reported absolute errors are the average across the four train-test estimations. Statistical comparisons of absolute error across cortical areas and tasks were made with Wilcoxon signed rank tests (one-tailed) and corrected for multiple comparisons with the false discovery rate [q(FDR) < 0.05] unless mentioned differently. We determined the chance level of absolute error per azimuth position with permutation testing. Specifically, within each run we permuted β estimates randomly across the seven azimuth locations and for all voxels independently. We then applied the maximum likelihood decoder to the permuted data. This procedure was repeated 1500 times per subject. Chance level of absolute error was computed as the mean absolute error across permutations.
Finally, we applied the population pattern decoder to data from both hemispheres simultaneously. In particular, we randomly sampled half of the voxels in the left hemisphere and half of the voxels in the right hemisphere. This procedure ensured that the number of data points used for the maximum likelihood estimation was equal when the decoder operated on data from two hemispheres versus data from a single hemisphere. We repeated the random sampling procedure 200 times per subject and computed absolute error as the average across samples. To determine the chance level for the population decoder operating on data from the two hemispheres, we applied a similar permutation procedure as described above. However, due to the interaction of the computationally intensive procedure of repeating the random sampling of half of the voxels in each hemisphere as well as the permutations, we limited the calculation to 30 random samples with 10 permutations each. Chance level of absolute error was computed as the average absolute error across samples and permutations.
Parcellation of the auditory cortex.
To divide the auditory cortex into core, belt regions, and PT, we combined maps of frequency preference (tonotopy) and frequency selectivity. To construct these maps, we first estimated a voxel's frequency tuning profile by estimating GLM with one predictor per center frequency for each auditory active voxel (assessed with a GLM contrasting auditory stimulation > baseline, liberal threshold of p < 0.05 uncorrected). We inferred a voxel's preferred frequency (PF) from the frequency tuning profile. That is, a voxel's PF was defined as the frequency with the highest β value in the tuning profile (after z-normalizing across voxels). We then created tonotopic maps on the cortical surface by color-coding the PF of all auditory responsive voxels in a blue (high-frequency) to red (low-frequency) color scale.
Next we estimated the frequency selectivity of a voxel by computing a frequency selectivity index (FSI). This index expresses the ratio between the peak β value (that is, the β corresponding to the PF) and the area under the frequency-tuning curve (the integral): Then, similar to Moerel et al. (2012), we defined the tuning width (TW) of a voxel as follows: where (f2 − f1) is the frequency range in hertz corresponding to the FSI. As such, TW is high for voxels with a narrow tuning profile and small for voxels with a broad tuning profile. We color-coded the TW on the cortical sheet in a yellow (broad tuning) to purple (narrow tuning) color scale.
Finally, we used these maps to parcellate the auditory cortex following criteria based on the tonotopic organization described by Moerel et al. (2012) (Figure 3). Specifically, Moerel et al. (2012) identify the core region as a region overlapping with HG that is narrowly tuned to frequency and encompasses two mirror-symmetric tonotopic gradients (Formisano et al., 2003; Moerel et al., 2014; Leaver and Rauschecker, 2016). This core region is flanked by broadly tuned regions both anteriorly (overlapping with the first transverse sulcus and planum polare in general), and posteriorly [coinciding with Heschl's sulcus (HS)]. Here we defined these broadly tuned bands as the rostral and caudal belt respectively (Fig. 3). We then evenly divided both the caudal and the rostral belt into medial and lateral parts, resulting in four belt areas: CM, CL, rostromedial (RM), and rostrolateral (RL; Rauschecker et al., 1995; Kaas and Hackett, 2000). Finally, in line with Moerel et al. (2012) and the anatomical definition of PT provided by Kim et al. (2000), we defined the remaining posterior part of the superior temporal plane as PT. This region was bordered anteriorly by the caudal belt (overlapping largely with HS), medially by the insular cortex, and laterally by the superior temporal gyrus.
Note that two participants did not show extensive activation in the auditory cortex for the contrast auditory stimulation > baseline as a result of excessive movement during the tonotopy measurements (possibly due to participant fatigue). We parcellated the auditory cortex of these two participants based on anatomical criteria, resulting in areas that were similar in size and location to those of the other participants. Specifically, the core region was identified as approximately two-thirds of HG (starting from the medial border; Moerel et al., 2012, 2014). The caudal belt was defined by HS, bordered posteriorly by PT (Kim et al., 2000). The rostral belt was defined as anteriorly to HG, mainly overlapping with the first transverse sulcus, as the mirror image of the caudal belt. The rostral and caudal belt regions were evenly split into a lateral and medial part.
Maps of cortical auditory areas constructed in surface space were projected back into volume space. In subsequent analyses, we included for each area the voxels that responded to sounds (as established with a GLM, contrast auditory stimulation > baseline, liberal threshold of p < 0.005 uncorrected; Table 1).
Results
Behavioral task performance
Behavioral accuracy in the MRI scanner was high for both active tasks. The average hit rate for the sound localization task was 94.3% (SD: 15.2%), and for the sound identification task 90.9% (SD: 12.6%). There was no difference in mean accuracy between tasks (paired samples t test, t(10) = 0.607, p = 0.557).
Univariate analysis of the processing of spatialized sounds in human auditory cortex
RFX GLM contrasting auditory stimulation > baseline showed increases in BOLD signal in primary and secondary auditory cortices in response to the probe trials (corrected for multiple comparisons with the FDR, q < 0.05; (Benjamini and Hochberg, 1995). Activated areas included HG, HS, PT, and to a lesser extent, the first transverse sulcus and other parts of the planum polare. To investigate differences in the overall level of activation elicited by the three task conditions, we computed several balanced contrast maps. However, none of these contrasts revealed different activation levels between task conditions, either at a stringent threshold (FDR, q < 0.05) or at a more liberal threshold (p < 0.005 uncorrected), indicating that the overall BOLD signal amplitude in the auditory cortex was similar across tasks.
Parcellating the human auditory cortex
In agreement with prior tonotopic mapping studies (Wessinger et al., 2001; Formisano et al., 2003; Talavage et al., 2004; Striem-Amit et al., 2011; Da Costa et al., 2011; Moerel et al., 2012; Leaver and Rauschecker, 2016), cortical maps of frequency preference revealed a region tuned to low frequencies overlapping partly with HG which was bordered anterolaterally and posteromedially by regions responding maximally to high frequencies (Fig. 3). Further, similar to Moerel et al. (2012) we observed a narrowly tuned region overlapping with (or in close vicinity to) HG in the frequency selectivity maps of most participants. This region was flanked by areas with broad frequency selectivity profiles (Fig. 3). We combined these maps of frequency preference and selectivity and derived an operational definition of the core region, the belt regions (Rauschecker et al., 1995) for original definitions in macaque auditory cortex), and PT (Fig. 3; see Materials and Methods).
Spatial selectivity in human auditory cortex is higher in posterior, higher-order regions than in primary regions
To start, we examined general differences in the presence of spatially selective voxels between cortical areas, i.e., interarea differences regardless of behavioral demands. The results show that the average proportion of auditory responsive voxels that was spatially selective (averaged across task conditions) varied across cortical regions in the left hemisphere (Fig. 4A), as well as in the right hemisphere (Fig. 4A). In particular, in the left hemisphere, PT contained relatively more spatially selective voxels than the core, CM, and CL. The proportion of selective voxels was also higher in left CL than in the left core (see Table 2 and Table 3). In the right hemisphere, PT contained a higher proportion of selective voxels than the core and CL as well, and the proportion of spatially selective voxels was higher in CM than in CL (Table 2 and Table 3).
We also assessed spatial selectivity by investigating the relative tuning width of spatially selective voxels within an area. For this measure of spatial selectivity, we observed an anterior to posterior (rostral-to-caudal) increase of spatial selectivity as well, both in the left hemisphere and right hemisphere (Table 2; Fig. 4B). Specifically, in the left hemisphere, spatial tuning width was broader in the core than in PT, CM, and CL. Finally, spatial tuning width was narrower in PT than in CL (Table 3; Fig. 4B, left). In the right hemisphere, there was also a difference in spatial tuning width between PT and the core. However, in this hemisphere spatial tuning was sharpest in CM: there was a significant difference between CM and the core, and between CM and CL (Table 3; Fig. 4B).
Next, we investigated cortical inter-area differences in spatial selectivity per behavioral task condition. This revealed that there were differences in the proportion of spatially selective voxels across areas in all behavioral conditions (Table 2). Specifically, post hoc comparisons revealed that the rostral-to-caudal increase in the proportion of spatially selective voxels was present in all behavioral conditions in the left hemisphere. That is in each condition, there were more spatially selective voxels in PT than in the core and in CM. Further, in the passive listening and sound identification conditions, but not in the sound localization condition, there were more spatially selective voxels in PT than in CL. In the right hemisphere, we observed significant inter-area differences in the proportion of spatially selective voxels in the sound identification condition only. Similar trends were present for the passive listening and sound localization conditions, but these just failed to reach statistical significance (Table 2). Post hoc pairwise comparisons for the sound identification condition (Table 3) indicate that there are significantly more spatially selective voxels in PT as well as in CM, compared with the core region (Fig. 5).
We also observed inter-area differences in relative tuning width per behavioral task condition in the left hemisphere. That is, there were significant inter-area differences in all behavioral conditions (Table 2), and in all conditions spatial tuning was sharper in PT than in the core region (see results of post hoc pairwise comparisons in Table 3). In addition, spatial tuning in PT was sharper than CL in the passive listening and sound identification condition. Spatial tuning was also sharper in CL than in the core during the passive listening and sound localization condition. In the right hemisphere, we observed inter-area differences in the passive listening and sound localization condition (a similar pattern was observed in the sound identification condition, but this just failed to reach statistical significance; Table 2). Post hoc pairwise comparisons show that during passive listening, spatial tuning was sharper in PT than in the core region. In addition, spatial tuning was sharper in CM than in either the core region and CL. Also during active sound localization, spatial tuning in CM was sharper than in the core and CL, and even PT (Table 3; Fig. 5).
Task-modulations of spatial selectivity within cortical auditory regions
We then examined, for each cortical area, the effect of task performance on spatial selectivity. There were no differences in the proportion of auditory responsive voxels that were spatially selective across task conditions: none of the cortical regions showed an increase or decrease in the proportion of spatially selective voxels based on task performance (one-tailed Wilcoxon signed rank tests, all p > 0.05; Fig. 5A). However, spatial tuning was sharper in the localization condition compared with the sound identification condition in the left core region [median identification condition = 108.8°, median localization condition = 104.5°, one-tailed Wilcoxon signed rank test, p = 0.001, q(FDR) < 0.05], and in right CM [median identification condition = 91.2°, median localization condition = 85.0°, p = 0.003, q(FDR) < 0.05; Fig. 5B]. Figure 5C shows the population RAFs, which also reflect the sharpening of spatial selectivity in the left core and right CM during active sound localization.
Next, we investigated the mechanism underlying the observed sharpening of spatial tuning in the left core and right CM during the sound localization condition. Specifically, we evaluated whether the change in spatial tuning between the two active task conditions resulted from response gain (that is, an increase of the BOLD response amplitude at the voxel's preferred location), response sharpening (a decrease of the BOLD response at the voxel's nonpreferred location), or a combination of these processes. For this comparison, we defined the voxel's preferred location as the sound azimuth location with the maximum β value in the task-independent RAF (i.e., the average RAF across the two active task conditions). Similarly, we defined the nonpreferred location as the sound azimuth location with the minimum β value in the average RAF (Lee and Middlebrooks, 2011, 2013).
In both cortical areas, the BOLD response at the preferred location was similar for the two active task conditions, while the BOLD response at nonpreferred locations was lower in the sound localization than in the sound identification condition. Specifically, Figure 6 shows that the β values for the preferred location were similar for both active task conditions [reflected by the clustering of β values around the diagonal; median β left core in sound identification (sound localization) condition = 0.39 (0.40); median β right CM in sound identification (sound localization) condition = 0.27 (0.30); Wilcoxon signed rank tests for differences between task conditions, p > 0.05]. In contrast, the BOLD response at nonpreferred locations was lower in the sound localization than in the sound identification condition [most β values are below the diagonal; median β left core in sound identification (sound localization) condition = 0.13 (−0.04); median β right CM in sound identification (sound localization) condition = 0.04 (−0.11); Wilcoxon signed rank tests; left core: p = 0.002; right CM: p = 0.014; q(FDR) < 0.05]. Thus, sharpening of spatial tuning during active sound localization was mainly the result of a decrease of BOLD signal amplitude at nonpreferred locations, that is, response sharpening.
Decoding sound azimuth location from fMRI population activity patterns
Next we evaluated whether the encoding of sound azimuth in fMRI activity patterns in the core and in PT varies with behavioral task requirements. Specifically, we applied a population-pattern decoder based on maximum likelihood estimation to the measured fMRI responses to the probe sounds in the sound identification and sound localization condition (see Materials and Methods). Figure 7 shows for each cortical area and task condition the absolute error of the population pattern decoder as a function of sound azimuth location. There was no difference in decoding performance between ipsilateral and contralateral locations: a comparison of the average absolute error between hemifields (i.e., the average absolute error across −30°, −60°, and −90°, versus the average across +30°, +60°, and +90°) did not yield significant results either for the core or for PT, in any behavioral task condition [two-sided Wilcoxon signed rank test per area and task condition, FDR corrected for multiple comparisons, all q(FDR) ≥ 0.05].
For the purpose of statistical comparisons between cortical areas and behavioral task conditions, we computed the average absolute error across azimuth positions for each area and task condition. Figure 7B shows that the population pattern decoder performed better than chance level in the left and right core in the sound localization condition. That is, in these areas and task conditions the absolute error was significantly lower than chance [one-sided Wilcoxon signed rank test, FDR corrected for multiple corrections; median absolute error sound localization condition left core = 61.1°, right core = 62.1°, chance error = 68.6°, p = 0.009 for both regions, q(FDR) < 0.05]. Chance level was computed with a permutation testing procedure in which we randomly scrambled the RAFs of each participant (1500 iterations). In left PT, the pattern decoder also performed better than chance in the localization condition [median absolute error left PT = 58.9°, p = 9.8E−4, q(FDR) < 0.05]. Similarly, in right PT the pattern decoder performed marginally better than chance in the localization condition [median absolute error right PT = 60.0°, p = 0.051, q(FDR) = 0.076]. However, in the sound identification condition the absolute error was larger than chance level in all cortical areas (median absolute error for the sound identification condition per area: left core = 75.0°, right core = 66.4°, left PT = 70.7°, right PT = 71.8°, p > 0.05; Fig. 7B), indicating that the pattern decoder did not perform well for this behavioral condition.
We then tested for differences in sound location decoding performance for the probe sounds between task conditions, within each cortical area. This showed that the pattern decoder performed significantly better in the sound localization than in the sound identification condition in the left core region; that is, the absolute error was significantly lower [one-sided Wilcoxon signed rank test, FDR-corrected for multiple comparisons; p = 0.003, q(FDR) < 0.05; Fig. 7B]. In left PT we observed a similar task effect, but this did not reach statistical significance [p = 0.04, q(FDR) = 0.1]. Figure 7A shows that the absolute error decreased especially at the midline and in contralateral space (0° to +90°) for both the core and PT in the left hemisphere. There was no significant effect of task in the right core or in right PT (p > 0.05; Fig. 7). For the right core, this may be a consequence of the relatively high performance of the pattern decoder in the sound identification condition. In particular, sound azimuth location estimates were significantly more accurate in the right, than in the left core in the sound identification condition [two-sided Wilcoxon signed rank test; p = 0.022, q(FDR) < 0.05], but not in the sound localization condition (p > 0.05; Fig. 7B),
We also tested for each task condition whether there was a difference in decoding accuracy between cortical areas. In the left hemisphere, the absolute error was lower in PT than in the core region in the sound identification condition [p = 0.0098, q(FDR) < 0.05] but not in the sound localization condition (p > 0.05). Figure 7A shows that the inter-area difference in the sound identification condition was mainly a result of lower absolute errors in PT in peripheral space. In the right hemisphere, there was no significant difference between the core and PT either in the sound identification condition (p > 0.05) or in the sound localization condition (p > 0.05). Note that the lower absolute error observed in left PT was not a consequence of a larger number of voxels in this cortical region: the inter-area effect persisted even if the number of voxels in PT included in the analysis was matched to the number of voxels in the core region (see Materials and Methods; Fig. 7C).
Finally, we applied the maximum-likelihood decoder to the fMRI activity patterns of the left and right hemisphere together: we provided the data of both hemispheres combined as input for the pattern decoder. Note that to ensure that the number of voxels on which the pattern decoder operates does not influence the sound location estimates, we randomly sampled half of the voxels in the relevant region within a hemisphere and combined this with a random sample of half of the voxels in the other hemisphere. This procedure was repeated 200 times, and we computed the absolute error of the two-hemisphere decoder as the average absolute error across those 200 iterations.
Figure 8 shows that combining the activity patterns in the two hemispheres resulted in lower absolute errors when decoding azimuth position for probe sounds in the sound identification, but not for probe sounds in the sound localization condition. Specifically, absolute error scores were lower than chance level in the sound identification condition in both the core and in PT [median absolute error core = 62.4°, median absolute error PT = 59.3°, chance error = 68.8°, p = 0.03 and p = 0.009 respectively, q(FDR) < 0.05]. In addition, the absolute error in PT was lower for the combined data than for either the left PT only [p = 0.016, q(FDR) < 0.05], or the right PT only [p = 0.003, q(FDR) < 0.05]. Inspecting absolute error as a function of sound azimuth location (Fig. 8A), shows that combining the data of left and right PT resulted in lower absolute error scores mainly in the periphery (−90°, −60°, +60°, and +90°). In contrast, for the core the combination of the data of the left and right hemisphere resulted in more accurate azimuth estimates compared with the left core (p = 0.002), but not compared with the right core (p > 0.05). Further, the absolute error as a function of sound azimuth position (Fig. 8A) shows that the absolute errors resulting from the combined data were similar to those resulting from the decoder operating on the right core only. This indicates that the azimuth estimates resulting from the pattern decoder operating on the core in two hemispheres are driven by the activity patterns in the right core, rather than showing an improvement larger than the available information in either core.
Discussion
The major findings of the present study are that spatial selectivity of the left primary auditory core cortex and right area CM are dynamic and dependent on behavioral requirements, that fMRI activity patterns in the left core carry more information on sound azimuth location when participants engage in a sound-localization task (compared with a task unrelated to sound localization), and that integrating fMRI activity patterns measured during a “what” task, but not during a “where” task, across bilateral PT results in more accurate sound azimuth location estimates than in either left or right PT separately. Together, these results highlight the adaptive potential of spatial tuning in the A1 based on behavioral demands. A possible mechanism for the observed task-modulation of spatial sensitivity in A1 is the feedback from functionally specialized regions (PT) to this cortical area. Specifically, such feedback connections from higher-order to primary regions may be modulated by behavioral requirements to enable dynamic spatial sensitivity in the latter. Finally, these findings provide new insights into models of sound location encoding in unilateral and bilateral human auditory cortex.
Dynamic spatial tuning in human auditory cortex
Posterior auditory cortical regions are thought to be part of a functionally specialized stream for sound location processing in animals (Tian et al., 2001; Stecker and Middlebrooks, 2003; Stecker et al., 2005; Harrington et al., 2008; Lomber and Malhotra, 2008) and humans (Alain et al., 2001; Arnott et al., 2004; Brunetti et al., 2005; Ahveninen et al., 2006; Deouell et al., 2007; Derey et al., 2016). Although we replicate these inter-area differences in spatial selectivity between primary core and higher-order areas, and specifically the advantage of caudal belt regions, that have been reported previously for passive listening or non-spatial task conditions, we also show that these differences are reduced in the left core and right CM when humans engage in an active sound localization task. Thus, our findings indicate that, depending on the behavioral requirements, primary auditory areas may contribute to sound location processing as well.
Such task-dependent modulations of spatial sensitivity have not previously been observed in humans. Zimmer and Macaluso (2005) reported a relationship between the level of activity in posterior auditory regions and successful sound localization, but did not investigate cortical spatial selectivity. Further, a recent neuroimaging study in humans did not report a modulation of either ILD or ITD selectivity based on task performance (Higgins et al., 2017). Yet, in the latter study, the authors considered binaural cue response functions averaged across all auditory responsive voxels within the auditory cortex, which may have diluted the results. That is, our analyses show that task modulations of spatial selectivity are localized specifically in the left core and right CM.
Our findings in human auditory cortex are compatible with animal studies showing that the performance of both spatial and non-spatial tasks affects neuronal receptive fields in A1 (Fritz et al., 2003; Otazu et al., 2009; Lee and Middlebrooks, 2011). One hypothesis is that higher-order, functionally specialized cortical areas, such as PT, modulate spatial tuning in A1 via back-projections. In particular, our data are compatible with theoretical frameworks of sensory processing such as the reverse hierarchy (Ahissar et al., 2009) and recurrent processing models (Lamme and Roelfsema, 2000; Bullier, 2001). Similar to visual cortex, the auditory cortex is characterized by dense reciprocal connections between primary and higher-order cortical areas (Kaas and Hackett, 2000; Lee and Winer, 2011). Lateral prefrontal cortex (PFC) may mediate such feedback processing: lateral PFC is known to project back to early regions of the lateral auditory belt (Romanski et al., 1999) and has been implicated in a two-stage model of categorization of sounds (Jiang et al., 2018).
Differences in sound location processing between the left and right auditory pathway
In humans, lesion and functional imaging studies suggest that the right (sub)cortical pathway may contain a representation of the entire acoustic azimuth, while in the left (sub)cortical pathway the representation of the contralateral acoustic azimuth is thought to be pre-dominant (Zatorre and Penhune, 2001; Krumbholz et al., 2005; Spierer et al., 2009; Briley et al., 2013; Higgins et al., 2017). Differential spatial processing between the left and right auditory pathway has also been observed in several animal species. For instance, Day and Delgutte (2013) observed in rabbit inferior colliculus a gradient of deteriorating sound location decoding accuracy from locations at the midline toward the periphery. In contrast, in monkeys, Miller and Recanzone (2009) observed in area A1 and CL most accurate sound location decoding results in contralateral space, with low decoding accuracies at the midline and especially in ipsilateral space: the magnitude of sound location estimation errors in the ipsilateral hemifield and around the midline was distinctly higher than the errors observed in the present study. Only in area R were decoding errors lower around the midline than in either ipsilateral or contralateral space. Here we did not observe a difference in location decoding accuracy between ipsilateral and contralateral space either for the left or right auditory cortex. Yet, our results did reflect sharper spatial tuning in the right than left core when the task was unrelated to sound location (the “what” task), which may be a reflection of the hypothesized right dominance for human spatial hearing. Future research with noninvasive lesion techniques in humans combined with advanced neuroimaging and computational modeling studies is required to elucidate these potential differences between the left and right human auditory pathway.
Integrating information on sound azimuth location across hemispheres
Our results show that the integration of sound location processing across the two hemispheres may be task dependent. Specifically, location estimates based on fMRI activity patterns in bilateral PT were more accurate than those based on either left or right PT independently for the task condition unrelated to sound localization (“what” task), although, this bilateral advantage was not present during active localization (“where” task). For the core region, we also observed a bilateral advantage for the “what” task compared with the left core separately, but not for the right core. This suggests that the bilateral advantage, is merely a reflection of the more accurate decoding obtained for the right core in itself. Similar to PT, no bilateral decoding improvement was observed during active sound localization for the core region. Thus, fMRI activity patterns in left and right PT, and possibly in the left and right core, contain complementary information on sound azimuth location when participants are not engaged in active sound localization, resulting in better location estimates when the information in the two hemispheres is combined. In contrast, information in the two hemispheres appears to be overlapping during active sound localization, such that combining the information across the hemispheres appears to be redundant during this behavioral condition.
This may be explained by a task-dependent strength of functional callosal connections. In particular, in macaques there are major interhemispheric connections both between the left and right core, and between left and right parabelt (Kaas and Hackett, 2000). If similar callosal connections between bilateral primary and higher-order auditory cortices exist in humans, it is conceivable that during active sound localization the functional connectivity between left and right PT increases compared with during nonlocalization tasks. As a consequence, spatial processing in left PT may modulate spatial processing in right PT during active localization (and vice versa), whereas spatial information in left and right PT is relatively independent, and thus complementary during non-spatial tasks. Alternatively, corticofugal projections (Winer and Schreiner, 2005) may strengthen during active sound localization, and thereby indirectly modulate sound location processing in the contralateral hemisphere.
The observed task-dependency of bilateral integration of information is also of interest for the ongoing debate about the computational mechanisms underlying sound location processing in mammals. In particular, models for neural population coding of sound azimuth location have received wide attention in recent years, including population coding within a single hemisphere (Miller and Recanzone, 2009; unilateral population coding: Day and Delgutte, 2013), unilateral opponent population coding based on two oppositely tuned channels within a single hemisphere (i.e., an ipsilaterally and a contralaterally tuned channel; Stecker et al., 2005), and bilateral opponent population coding based on combining the sound azimuth information of contralaterally tuned channels in each hemisphere (McAlpine et al., 2001; Derey et al., 2016; Ortiz-Rios et al., 2017). Our current results suggest that the degree to which information is combined across hemispheres may be dependent on behavioral requirements, indicating that unilateral and bilateral models of sound location encoding may not be mutually exclusive.
Footnotes
This work was supported by a European Research Council Grant under the European Union Seventh Framework Programme for Research 2007–2013 (Grant 295673), a European Union's Horizon 2020 Research and Innovation Programme Grant 645553, ICT DANCE (IA, 2015–2017; B.d.G.), a PIRE Grant from the U.S. National Science Foundation (OISE-0730255; J.P.R.), NIH Grants R01EY018923 and R01DC014989 (J.P.R.); partial support from the Technische Universität München Institute for Advanced Study, funded by the German Excellence Initiative and the European Union Seventh Framework Programme Grant 291763 (J.P.R.), and NWO Vici-Grant 453-12-002 and the Dutch Province of Limburg (E.F.), and funding for a research exchange from the Erasmus Mundus Auditory Cognitive Neuroscience Network (K.H.).
The authors declare no competing financial interests.
- Correspondence should be addressed to Dr. Beatrice de Gelder, Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, 6200 MD, Maastricht, The Netherlands. b.degelder{at}maastrichtuniversity.nl