Introduction

Listeners can determine a speaker’s sex from their vocal signals with high accuracy (Coleman 1971; Lass et al. 1976). Speaker sex can be assessed rapidly, with listeners being able to identify sex from vowel segments lasting under 2 glottal cycles (i.e., two cycles of the vocal folds in the larynx opening and closing to produce a buzzing sound; Owren et al. 2007). Listeners can furthermore successfully perceive speaker sex from drastically degraded or manipulated vocal signals, such as sine-waveFootnote 1 speech and noise-vocoded speechFootnote 2 with as few as 3 channels (Gonzalez and Oliver 2005).

The perceptual cues assumed to allow listeners to distinguish male from female voices are linked to sex-specific anatomical features of the vocal tract: Due to the pronounced sexual dimorphism of the human larynx and vocal folds, males on average tend to have longer and thicker vocal folds than females, as well as longer vocal tracts (Titze 1989). These two features mainly lower the fundamental frequency of the voice (F0, broadly perceived as voice pitch) and affect the spacing of the formants in vocal signals. Thus, males and females differ anatomically which affects the source signal (i.e., F0; buzzing sound created through the vibration of the vocal folds in the larynx) and the filter characteristics (i.e., formants; resonant characteristics of the vocal tract, determined by its shape and size; see source-filter-model; Fant 1960), making male and female voices relatively distinct from each other. Studies using perceptual judgements and computational approaches have indeed shown that acoustic cues, such as these differences in F0 and formant characteristics, are crucial for determining speaker sex from vocal signals that have been produced in a neutral voice (Bachorowski and Owren 1999; Skuk and Schweinberger 2014). The salience of these cues for speaker sex identification is highlighted in a study by Mullennix et al. (1995): the authors shifted F0 and formant frequencies in vocalizations and were thereby able to successfully create continua of vocalizations that were perceived by listeners to morph from male to female.

While both formant frequencies and F0—alongside other acoustic measures—play an important role in determining speaker sex, it has been argued that F0 may be the more salient cue for speaker sex judgements: Lass et al. (1976) have shown that removing the source signal (which encodes F0 information) by using whispered speech affects participants’ judgements of speaker sex more drastically than when stimuli are low-pass filtered (thus removing all filter information, and therefore all formants [apart from F0]). Several other studies comparing the contributions of formant frequencies and F0 to speaker sex perception also conclude that F0 is the more salient acoustic cue (Gelfer and Bennett 2013; Poon and Ng 2015; Whiteside 1998). Honorof and Whalen (2010) reported that when F0 is volitionally manipulated by a speaker within their natural range when producing isolated vowels, miscategorizations of speaker sex occur at the extremes of the F0 range, with high F0 being identified as female and low F0 as male. Similarly, Bishop and Keating (2012) report that in the context of variable pitch, male voices are most accurately identified when the vocal signals produced have an F0 that is lower than 200 Hz while the reverse is true for female voices. These studies show that changes in salient acoustic cues, through explicit volitional voice modulations, as well as synthetic manipulations of the stimuli, can affect the accuracy of speaker sex judgements from voices.

Thus, sex perception from voices can be affected using stimuli designed to be ambiguous, be they artificially manipulated signals or volitionally produced phsyiological extremes. However, voices and their acoustic properties—such as the F0—are highly variable and flexible in their everyday use (Lavan et al. 2018c): speakers dynamically modulate their voices depending on the speaking environment, communicative intent or physiological and psychological state. One major modulator of the voice is the person’s affective state: a large body of literature has shown that affective tone in vocal signals impacts the acoustic and perceptual properties of these signals, compared to neutral vocalizations—or indeed between different emotional vocalizations (see Juslin and Laukka 2003; Sauter et al. 2010). For example, modulations of F0, speech rate (for emotional speech), spectral features, periodicity and amplitude (see acoustic analyses in the “Appendix” section for descriptions of these measures) have all been reported in comparisons of emotional and neutral vocalizations. Additionally, some research has recently shown that spontaneous (emotional) vocalizations differ from volitionally produced exemplars of the same type of vocalization, most prominently for laughter: based on differences in their production mechanisms, significant differences in acoustic properties (including F0) and affective properties have been reported for volitional and spontaneous vocalizations (Bryant and Aktipis 2014; Lavan et al. 2016a; Ruch and Ekman 2001). Notably, two recent studies have also reported reduced performance for a speaker discrimination task for spontaneous laughter (contrasted with volitional laughter; Lavan et al. 2016b, 2018a).

The above research on emotional vocalizations, and the natural variability within them, reflects a general movement within current theoretical and empirical approaches to the study of nonverbal behavior. It is now more broadly acknowledged that “one size fits all” labels such as “laughter” and “crying” are often insufficient to account for the complex context-dependency of natural behaviors, and how they are perceived (Anikin and Lima 2017; Martin et al. 2017; Sauter 2010; Sauter and Scott 2007). Importantly, evidence suggests that variations within nonverbal behaviors not only have an impact on (affective) state evaluations, but also on the perception of stable indexical characteristics (i.e., identity); therefore, to develop better models of how humans perceive other people, we must understand how this takes place across the range of natural human behaviors.

The current study thus explored sex identification from naturalistic volitional and spontaneous vocal signals: In a first experiment, participants performed a speaker sex identification task on Spontaneous Laughter (LaughterS), Spontaneous Crying (CryingS), and cowels (‘staccato vowels’; see Fig. 1 for example waveforms and spectrograms). Given previous findings of impaired person perception from spontaneous vocalizations (Lavan et al. 2016b, 2018a), we hypothesized that the perception of speaker sex would be impaired for spontaneous (emotional) vocalizations, with listeners’ performance for LaughterS and CryingS being significantly less accurate than for Vowels, while performance for LaughterS and CryingS should be similar. We furthermore predicted that these effects should be reflected in reaction times: speaker sex perception from spontaneous vocalizations should be associated with increased task difficulty, which would lead to longer reaction times. Based on previous studies highlighting the importance of F0 on speaker sex perception, we also expected that changes in performance could be linked to variation in F0 between the different types of vocalization.

Fig. 1
figure 1

Waveforms (top panels) and spectrograms (bottom panels) of the vocalization types used in Experiment 1 and 2: Spontaneous Laughter (LaughterS), Volitional Laughter (LaughterV), Spontaneous Crying (CryingS) and Vowels (‘staccato vowels’). Darker shading on the spectrogram represents higher intensity

Experiment 1

Method

Participants

44 participants (24 female; MAge = 20.9 years; SD = 1.2 years) were recruited at the Department of Psychology at Royal Holloway, University of London and received course credit for their participation. All participants had normal or corrected-to-normal vision and did not report any hearing difficulties. Ethical approval was obtained from the Departmental Ethics Committee at the Department of Psychology at Royal Holloway University of London. None of the participants was familiar with the speakers used.

Materials

LaughterS, CryingS and Vowels were recorded from 5 speakers (3 male, 2 female, age range: 23–46 years) in a soundproof, anechoic chamber at University College London. Recordings were obtained using a Bruel and Kjaer 2231 Sound Level Meter, recorded onto a digital audio tape recorder (Sony 60ES; Sony UK Limited, Weybridge, UK) and fed to the S/PDIF digital input of a PC sound card (M-Audio Delta 66; M-Audio, Iver Heath, UK) with a sampling rate of 22,050 Hz. The speakers were seated at a distance of 30 cm at an angle of 15° to the microphone. LaughterS was elicited from speakers while watching or listening to amusing sound or video clips (see McGettigan et al. (2015) for a detailed description of the recording procedure). For CryingS, speakers recalled upsetting events and/or initially posed crying to encourage a transition into spontaneous crying associated with genuine felt sadness. Crucially, based on informal questions, each speaker reported genuine feelings of amusement and sadness during and after these recording sessions.

In a pilot study, a group of listeners (N = 13) provided ratings of arousal (“How aroused is the person producing the vocalization?”, with 1 denoting “the person is feeling very sleepy and drowsy” and 7 denoting “the person is feeling very alert and energetic”), valence (“How positive or negative is the person producing this vocalization feeling?”, with 1 denoting “very negative” and 7 denoting “very positive”), control over the vocalizations (“How much control did the person have over the production of the vocalization?”, with 1 denoting “none at all” and 7 denoting “full control”) and authenticity (“How authentic is the vocalization?”, with 1 denoting “not authentic at all” and 7 denoting “very authentic”). Note that volitional laughter and crying were included in this pilot study as well. These pilot ratings established that participants reliably rate spontaneous laughter and crying as higher in arousal and authenticity, lower in control over the production of the vocalization, and more extreme in valence (more positive for laughter and more negative for crying, respectively) than their volitional counterparts. The speakers also produced series of short vowels (‘staccato vowels’; /a/, /i/, /e/, /u/, /o/, average vowel duration within a series = .35 s) with a relatively stable pitch (F0 Mean, males; mean = 140.12 Hz, SD = 28.5 Hz; females; mean = 250.61 Hz; SD = 33.08) to preserve a percept of neutral affective valence. This type of volitional, non-emotional stimulus was chosen as its acoustic structure broadly resembles laughter and crying, given all three vocalizations are based on series of vocalic bursts (see Fig. 1). Individual vocalization exemplars were extracted from the recordings and normalized for RMS amplitude using PRAAT (Boersma and Weenink 2010).

Based on the ratings collected for a larger set of vocalizations in the pilot study, 25 stimuli per vocalization (5 per speaker) were selected, choosing series of vowels that were neutral in valence (MValence= 3.92; SD = .16) and low in arousal (MArousal= 2.68; SD = .29) and spontaneous laughter and crying exemplars that were high in arousal (MCryingS= 3.79, SD = .42; MLaughterS= 4.78, SD = .76; t[48] = 5.69, p < .001, Cohen’s d = 1.643), and authenticity (MCryingS = 3.58, SD = .81; MLaughterS = 4.79, SD = .90; t[48] = 5.02, p < .001, Cohen’s d = 1.449)—note that the stimulus set did not allow for a match of arousal or authenticity for LaughterS and CryingS. All three vocalization sets were matched for duration (MVowels = 2.55 s, SD = .28; MCryingS = 2.61 s, SD = .30; MLaughterS= 2.47 secs, SD = .36; F(2,48) = 1.31, p = .280, η 2 p  = .052).

A detailed analysis of the acoustic features of the stimuli can be found in the “Appendix” section. Note that all instances of laughter and crying used in the experiments reported here included voiced portions to allow us to measure F0. Such voiced vocalizations represent only a subset of laughs and cries and many unvoiced variants of the vocalization have been described elsewhere (for laughter see Bachorowski and Owren 1999).

Procedure

Participants were seated in front of a computer screen. Auditory stimuli were presented at a comfortable volume via headphones (Sennheisser HD 201), using MATLAB (Mathworks, Inc., Natick, MA) with the Psychophysics Toolbox extension (http://psychtoolbox.org/). Participants were presented with 75 stimuli in total (25 per vocalization; Vowels, LaughterS, and CryingS) in fully randomized order. During the presentation of the sounds, a fixation cross was shown on the screen, which was then replaced by a prompt asking participants to indicate whether the speaker was male or female (two-way forced choice) via a keyboard press. All trials were timed, giving participants 2.5 s to make a response before automatically moving on to the next trial. Participants were asked to respond as quickly as possible based on their first impression. Reaction times were recorded from the offset of the sound. The data collected was checked for item-effects: item-wise accuracy scores all fell within 3 standard deviations of the vocalization-specific means, thus no items were excluded from further analyses.

Results

Speaker Sex Perception from Spontaneous Laughter, Spontaneous Crying and Vowel Sounds

To explore whether sex recognition differed for the three vocalizations, we ran a generalized linear mixed effects analysis using lme4 (Bates et al. 2014) in the R environment (R Core Team 2013). We defined one model that predicted binary accuracy codes (correct versus incorrect) based on vocalization type, speaker sex, speaker, and participant. Speaker and participant were entered as random effects. As vocalizations associated with higher F0 (here: LaughterS and CryingS versus Vowels) may lead to differential effects on accuracy for male versus female vocalizations (e.g., Bishop and Keating, 2012; Honorof and Whalen 2010), we included an interaction between speaker sex and vocalization type as a fixed effect in the model. Some of the variance was explained by both the speaker effect (variance = .984, SD = .992) and participant effect (variance = .033, SD = .183). Statistical significance was established by likelihood ratio tests contrasting the full model (including the fixed effects, vocalization type, plus the random effects) with a model that did not include the interaction term (Winter 2013). The interaction between speaker sex and vocalization was highly significant (χ2[2] = 132.32, p < .001). For male vocalizations, post hoc planned contrasts between vocalizations by speaker sex (alpha corrected for 6 comparisons) were computed using the R package lsmeans (Lenth 2016). These showed that accuracy was significantly higher for Vowels compared to LaughterS and CryingS (Zs > 6.53, p < .001, estimates > 2.67), which is in line with our predictions that performance should be worse for spontaneous vocalizations. Against predictions, accuracy was furthermore lower for LaughterS compared to CryingS (Z = 4.69, p < .001, estimate = .90, see Fig. 2a). For female vocalizations, a different pattern of results emerged, where performance for Vowels was significantly worse compared to CryingS (Z = 3.86, p = .001, estimate = .88), while the remaining two comparisons did not reach significance Zs < 2.22, p > .026, estimate < .08).

Fig. 2
figure 2

a Average accuracy scores per vocalization for the sex identification task of Experiment 1 (N = 44) V = Vowels, CS = CryingS, LS = LaughterS. b Average reaction times per vocalization for the sex identification task of Experiment 1. c Scatterplot of proportion of ‘female’ responses per item and F0 mean per item. Significant results (p < .017) in panels (a) and (b) are highlighted

We ran a further linear mixed effects analysis on reaction times that mirrored the accuracy analysis (see Fig. 2b). Little variance was explained by the speaker effect (variance = .002, SD = .044) or participant effect (variance = .041, SD = .204). The models showed that the interaction of vocalization type and speaker sex was significant (χ2[2] = 15.792, p < .001). For the planned post hoc contrasts, degrees of freedom were calculated using the Satterthwaite approximation using the lsmeans (Lenth 2016). In line with our predictions, these post hoc contrasts showed that reaction times were significantly faster for Vowels compared to LaughterS and CryingS when the vocalizations were produced by males (LaughterS—Vowels: t[3109.97] = 6.09, p < .001, estimate = .12; CryingS—Vowels: t[3109.87] = 5.089, p < .001, estimate = .15). Reaction times for LaughterS and CryingS were comparable (t[3109.97] = 1.01, p = .311, estimate = .02). The three planned contrasts for vocalizations produced by females were not significant (ts < 1.6, ps > .109, estimates < .03). For full model specifications and outputs, please see the “Appendix” section.

Taken together, these analyses partially support the prediction that speaker sex perception is more difficult for spontaneous vocalizations, in this case LaughterS and CryingS, compared to a volitional vocalization (here, series of vowels). The predicted pattern was, however, only apparent for vocalizations produced by males. Additionally, and against predictions, we also found differences between LaughterS and CryingS. These speaker sex-specific can be explained by systematic differences in F0 between male and female voices: laughter and crying both show increased F0 for male and female speakers from to vowel stimuli (see “Appendix” section). Increased F0 in male vocalizations has been shown to lead listeners to perceive such signals as coming from a female (Bishop and Keating 2012; Honorof and Whalen, 2010); however, for female vocalizations, a higher F0 will not lead to such changes in speaker sex perception. Both of these trends are reflected in the results reported above.

Linking Speaker Sex Judgement Responses to F0

To further explore whether F0 was a salient cue for sex perception from the three vocalizations, we attempted to link the mean F0 of each stimulus to participants’ responses in the speaker sex perception task, using a generalized linear mixed model. Initially, we defined a model with trial-wise raw responses (male vs. female) as the dependent variable, and F0 mean per item and speaker sex as fixed factors. F0 mean was scaled and centered. Based on acoustic consequences of the sexual dimorphism of the vocal tract in humans and the results from the accuracy analyses, we hypothesized that higher Fo in females should increase ‘female’ responses, while higher Fo in males should decrease ‘male’ responses (e.g., Bishop and Keating 2012; Honorof and Whalen 2010). We therefore also modeled an interaction between speaker sex and F0 mean, mirroring the analyses of accuracy and reactions times described above. Vocalization type, speaker, and participant were entered as random factors. Some variance was explained by both the speaker effect (variance = 1.248, SD = 1.117), and vocalization type (variance < .001, SD < .001) and participant effect (variance = .318, SD = .564). Significance of the fixed effects was determined via model comparisons, where the full model was compared to a reduced model (full model minus the interaction). As predicted, there was an interaction between F0 mean and speaker sex for participants’ responses (χ2[1] = 39.24, p < .001). Thus, the increase in “female” responses with increasing F0 was more pronounced for vocalizations produced by males than was the case for vocalizations produced by females (see Fig. 2c). Further models for all male trials as well as all female trials established that the trends for male and female vocalizations were significant (χ2s[1] > 10.4, ps < .002).

Discussion

In the current experiment, we explored whether two emotional vocalizations produced under reduced volitional control would affect the perception of speaker sex from different vocal signals. There were marked differences in how vocalizations produced by males versus females were affected: Performance was impaired for spontaneous male vocalizations, that is for LaughterS and CryingS compared to Vowels, following our predictions. For female vocalizations, this pattern was, however, not apparent. For male vocalizations, reaction times were furthermore significantly longer for spontaneous versus volitional vocalizations (here: Vowels), indicating greater task difficulty for sex judgements from spontaneous vocalizations.

Such impaired performance for the perception of speaker characteristics has previously been reported for speaker discrimination tasks: listeners were less successful at correctly discriminating speakers from spontaneous laughter compared to volitional laughter (Lavan et al. 2016b), even when this laughter was matched to volitional laughter in perceived authenticity and arousal (Lavan et al. 2018a). As is the case with cues to speaker identity, cues to speaker sex that are encoded within the same acoustic properties are also affected in spontaneous vocalization production, thus being ‘overwritten’ or perceptually less salient. Previous research has shown that F0 and, to a lesser degree, formant measures are perceptually salient cues for the identification of speaker sex (Bishop and Keating 2012; Gelfer and Bennett 2013; Honorof and Whalen 2010; Lass et al. 1976; Mullenix et al. 1995). For speaker sex, global modulations of F0 for laughter and crying result in less marked differences between male and female laughter. These changes in diagnostic cues could explain the current results: F0 for laughter and crying produced by males is matched or may even exceed F0 values usually associated with female vocalizations in the context of speech sounds (> 350 Hz, see “Appendix” section). Lower pitch F0 in vocal signals is generally associated with male speakers, while higher F0 is associated with vocalizations produced by females. Salient acoustic features, such as F0, are modulated drastically in males for spontaneous vocalizations, approximating (and at times exceeding, see “Appendix” section) F0 values frequently encountered in spoken vocalizations produced by females, thus making sex judgements for spontaneous male vocalizations less reliable. For female vocalizations of increasing F0, no such “category boundary” for speaker sex is crossed, explaining the lack of clear effects for this group of stimuli. This interpretation of the data was further backed up by analyses showing that a higher F0 leads to relatively more identifications of male vocalizations as female, compared to vocalizations produced by females. Our study thus shows that naturalistic modulations of salient acoustics features, such as F0, can disrupt speaker sex perception in spontaneous non-verbal vocalizations.

Despite LaughterS and CryingS being spontaneous vocalizations, performance for LaughterS was significantly lower compared to CryingS: this may be an indication of vocalization-specific effects although the underlying mechanisms cannot be further probed in this data set. Alternatively, the effect could also be driven by continous perceived affective properties of the vocalizations: LaughterS was significantly higher in percevied arousal than CryingS, which could explain the pattern of results. Notably, there are also close links between arousal and F0 (e.g., Juslin and Laukka 2003; Lavan et al. 2016a).

While performance was impaired for spontaneous vocalizations produced by males, mean accuracy was nonetheless high for most vocalizations (raw accuracy > 70%, being close to 100% for some conditions). In line with previous studies that report above-chance accuracy for judgements of speaker sex despite acoustic manipulations of the signal (Bishop and Keating, 2012; Honorof and Whalen 2010; Lass et al. 1995; Mullenix et al. 1995), this current finding confirms that the perception of speaker sex remains largely robust, despite drastic changes introduced to the signal: if one salient acoustic cue such as F0 is modulated to become relatively less salient and diagnostic, acoustic cues such as formant frequencies may still remain informative to listeners and gain importance during perception (e.g., Gelfer and Bennett 2013; Smith and Patterson 2005).

From the current experiment, it cannot yet be determined whether changes in performance are due to differences in the degree of volitional control over voice production (spontaneous vs. volitional), effects of vocalization type (vowels vs. laughter vs. crying), or effects of perceived arousal. In Experiment 2, we addressed these issues by contrasting volitional and spontaneous laughter, which can be classed as a single type of vocalization but which differ in the degree of affective tone and volitional control. If differences in vocalization type modulate performance, performance for LaughterS and LaughterV should be comparable, while performance for Vowels should differ. However, if reduced volitional control over production modulates performance, performance for Vowels and LaughterV should be equivalent, and higher than for LaughterS. If perceived arousal drives the effects, sex recognition accuracy should mirror the pattern of perceptual properties of the sounds (i.e., high arousal should be linked to decreases in performance). In line with the previous experiment, we predicted that these effects should be reflected in reaction times and that the accuracy for speaker sex judgements can be linked to the F0 of the vocalizations.

Experiment 2

Participants

43 participants (39 female; MAge: 19.2 years; SD: 1.1 years) were recruited at the Department of Psychology at Royal Holloway, University of London and received course credit for their participation. No participant reported any hearing difficulties. Ethical approval was obtained from the Departmental Ethics Committee. None of the participants was familiar with the speakers used.

Materials

Materials were the same as in Experiment 1, with the exception that CryingS was replaced by LaughterV produced by the same 5 speakers (see Experiment 1). The procedure for the recording and elicitation procedure was as described in McGettigan et al. (2015). In short: For LaughterV, the speakers were instructed to produce natural and positive sounding laughter, without inducing a specific affective state. Thus, LaughterV was produced with full volitional control over the voice (and in the absence of amusement), while LaughterS was produced spontaneously and thus under reduced volitional control, in response to viewing and listening to amusing stimuli. LaughterV was recorded in the same session as LaughterS, with LaughterV always being recorded first to avoid carry-over effects. Based on the ratings from the pilot study (see “Experiment 1” section), 25 LaughterV stimuli (5 per speaker) were selected.

There were marked differences in perceived authenticity between LaughterV and LaughterS (LaughterVM = 3.60, SD = .47; LaughterSM = 4.79, SD = .90; t[48] = 5.88, p < .001, Cohen’s d = 1.697). LaughterS and LaughterV were rated as reflecting significantly higher speaker arousal than Vowels (LaughterV: t[48] = 12.79, p < .001, Cohen’s d = 3.692; LaughterS: t[48] = 13.15, p < .001, Cohen’s d = 3.796), but in close correspondence to each other (LaughterVM = 4.39, SD = .56; LaughterSM = 4.78, SD = .76; t[48] = 2.09, p = .042, Cohen’s d = .603). There was no perceived difference in speaker valence between the laughter types (LaughterVM = 5.28, SD = .33] LaughterSM = 5.23, SD = 1.06; t[48] = .21, p = .836, Cohen’s d = .061). The overall duration of the stimuli was matched (Vowels M = 2.55 s, SD = .28; LaughterVM = 2.32 s, SD = .37; LaughterSM = 2.47 s, SD = .36; one-way repeated measures ANOVA: F[2,48] = 3.13, p = .053, η 2 p  = .115). A detailed analysis of the acoustic features of the stimuli used in this experiment can be found in the “Appendix” section.

Procedure

The experimental set up was identical to the one used in Experiment 1. Participants were presented with all 75 stimuli (25 per vocalization; Vowels, LaughterS, LaughterV) in a fully randomized order. Participants were not pre-informed about the inclusion of spontaneous and volitional laughter in the tasks. The data was checked for item-effects: item-wise accuracy scores all fell within 3 SDs of the vocalization-specific means, thus no items were excluded from further analyses.

Results

Speaker Sex Perception from Volitional and Spontaneous Laughter

Data were analyzed with a generalized linear mixed effects analysis of raw accuracy scores. Models were defined in the same way as in Experiment 1: Vocalization type, speaker sex, and an interaction of vocalization type and speaker sex were included as fixed effects, and speaker and participant as random effects. Some of the variance was explained by both the speaker effect (variance = 1.177, SD = 1.09) and the participant effect (variance = .186, SD = .43). This analysis confirmed that the interaction between vocalization type and speaker sex were significant (χ2[2] = 55.73, p < .001). In line with our predictions, 6 planned post hoc contrasts showed that accuracy was significantly lower for LaughterS compared to LaughterV and Vowels (both Zs > 6.59, both ps < .001, estimates > 1.70) for male vocalizations. Against predictions, accuracy for the two volitional vocalizations, LaughterV and Vowels also differed (Z = 2.69, p = .007, estimate = .108, (see Fig. 3a). Female vocalizations did not follow the predicted pattern and accuracy was highest for LaughterV compared to Vowels and LaughterS (Zs > 2.56 2.69, p < .011, estimates > .65), while LaughterS and Vowels were similar (Z = .288, p = .773, estimate = .06).

Fig. 3
figure 3

a Average accuracy scores per vocalization for the sex identification task of Experiment 2 (N = 43). V vowels, LV laughterv, LS laughters. b Average reaction times per vocalization for the sex identification task of Experiment 1. c Scatterplot of proportion of ‘female’ responses per item and F0 mean per item. Significant results (p < .017) in panels (a) and (b) are highlighted

We ran a further mixed linear effects analysis on reaction times instead of sex recognition accuracy. Here, little variance was explained by both the speaker effect (variance = .003, SD = .06) and participant effect (variance = .025, SD = .16). The models showed that vocalization type did have an effect on reaction times: the comparison of the full and the reduced models (minus the effects of interest) was significant (χ2[2] = 25.239, p < .001). Neither speaker sex nor the interaction of vocalization type and speaker sex were significant (χ2s < 5.112, p > .077). Six planned post hoc contrasts showed that, for male vocalizations, reaction times were comparable for Vowels and LaughterV (Z = 1.66, p = .097, estimate = .04) and longer for LaughterS compared to Vowels (Z = 3.08, p = .002, estimate = .07). While reaction times were numerically longer for LaughterS compared to LaughterV, this comparison did not reach significance (Z = 1.42, p = .154, estimate = .02, see Fig. 3b). For female vocalizations, the planned contrasts confirmed our predictions, with reaction times being comparable for Vowels and LaughterV (Z = 1.49, p = .137, estimate = .03) but longer for LaughterS and Vowels (Z = 2.60, p = .102, estimate = .05) and LaughterS and LaughterV (Z = 4.06, p < .001, estimate = .08).

Linking Speaker Sex Judgement Responses to F0

In parallel to the analyses conducted for Experiment 1, we initially defined a model with binary participant responses (male vs. female) as the dependent variable and F0 mean per item and speaker sex as fixed factors. We also included an interaction between speaker sex and F0 mean in the model (see “Experiment 1” section). Vocalization type, speaker and participant were entered as random factors. Some variance was explained by both the effect of speaker (variance = 1.350, SD = 1.162), (variance < .001, SD < .001) and participant effect (variance = .891, SD = .944). Significance of the effects was determined via model comparisons, where the full model was compared to a reduced model (full model minus the factors of interest). There was a significant interaction between F0 mean and speaker sex (χ2[1] = 34.114, p < .001). With increasing F0 mean, listeners more frequently chose ‘female’ as a response for vocalizations that were produced by males. This was also true—albeit to a lesser extent—for female vocalizations (see Fig. 3c). Further models for all male trials as well as all female trials established that both of these trends in the data were significant (both χ2s[1] > 20.12, both ps < .001).

Discussion

By contrasting LaughterV and LaughterS, Experiment 2 explored whether the effects observed in Experiment 1 reflected processing differences for different types of vocalizations (laughter vs. vowels), whether they could have resulted from differences in production mode (volitional vs. spontaneous) or, finally, whether they reflected differences in perceived arousal. As in Experiment 1, we have seen clear differences in patterns of results for male versus female speakers. For male vocalizations, accuracy was lower for LaughterS than for Vowels and LaughterV, while performance was comparable for Vowels and LaughterV. Accuracy was also lower for LaughterV compared to Vowels, although this effect was notably smaller compared to the effect of LaughterS versus LaughterV. The current results thus indicate that reduced volitional control has an effect on the perception of speaker sex for male vocalizations, echoing findings from speaker discrimination tasks (Lavan et al. 2016b, 2018a): if perceived arousal would have a substantial effect on speaker sex recognition, we should have seen LaughterV behaving more like LaughterS (where overall differences in arousal were comparatively small) and less like Vowels (where arousal differences were more pronounced). Our findings, however, show the opposite for male vocalizations. In line with the results of Experiment 1, a relationship between F0 and speaker sex recognition accuracy was found: listeners were more likely to judge vocalizations as being produced by a female speaker when F0 increased, and this effect was more pronounced for vocalizations produced by males.

General Discussion and Limitations

In the current set of experiments, we investigated whether the perception of speaker sex from non-verbal vocalizations is affected by natural vocal flexibility, introduced by using different types of vocalizations (laughter, crying, vowels) produced under different levels of volitional control (spontaneous versus volitional emotional vocalizations). We found striking interactions of speaker sex with accuracy: while our predictions were largely confirmed for male vocalizations, this was not the case for female vocalizations. For male vocalizations, our results indicate that accuracy is lower for spontaneous compared to volitional vocalizations, with graded differences being apparent for different types of spontaneous vocalizations (see Experiment 1: better performance for CryingS compared to LaughterS). These results are in line with the findings of two recent studies of speaker discrimination using spontaneous and volitional vocalizations (Lavan et al. 2016b, 2018a): In these studies, listeners’ ability to determine whether a pair of vocalizations was produced by the same or two different speakers was significantly worse for spontaneous laughter compared to volitional laughter. While no clear link between acoustic cues and discrimination performance could be found in the previous studies, the current experiments found links between F0 and speaker sex perception accuracy. Acoustic cues that are diagnostic for speaker sex in neutral vocal signals are drastically modulated during the production of emotional vocalizations, rendering these acoustic cues less diagnostic. In this study, vocalizations with high F0—especially those produced by males—were more likely to be perceived as being produced by females; the opposite pattern also held for female vocalizations, albeit more weakly and in the presence of a ceiling effect for higher F0. While F0 is known to be a salient cue for speaker sex judgements from (neutral) volitional speech sounds, its role and importance in determining speaker sex is largely unknown for other types of vocalizations. The current study suggests that F0 also appears to be an important acoustic cue to speaker sex in volitional and spontaneous vocalizations, such as laughter and crying. Due to this perceptual salience, naturalistic modulations of F0 can impair speaker sex perception in emotional vocalizations, especially when F0 values go beyond what can be considered to be the “typical” range for a category (here male vs. female).

There are, however, a number of limitations to the current study that should be noted. First, we only used 5 different speakers, which can be a considered a relatively low number. We would argue that it is unlikely that participants were aware of the small number of speakers given the inclusion of different vocalizations: a study has reported that unfamiliar listeners are unable to discriminate between speakers when making judgements across different kinds of vocalizations (such as laughter and vowels, Lavan et al. 2016b) and tend to assume that more voices than actually included in a stimuli set (Lavan et al. 2018c, 2018b). Second, the acoustic analysis focused solely on linking F0 to sex perception accuracy. While F0 is arguably the most important acoustic cue to speaker identity (Gelfer and Bennett 2013; Poon and Ng 2015; Whiteside 1998), formant measures have frequently also been implicated. The current study did not extract any formant measures. While previous studies have extracted formant measures from nonverbal emotional vocalizations, such as laughter (e.g., Szameitat et al. 2009a, b; Bachorowski et al. 2001), they came to conflicting conclusions. For most vocalizations, especially for spontaneous ones, the authors of those studies report that it was difficult to extract reliable formant measures from a representative portion of the sounds (see Bachorowski et al. 2001, for a discussion). An analysis of such formant measures would thus have lacked adequate precision, and was omitted from the current experiments.

The expression and perception of speaker sex has been discussed extensively in the literature on human voice perception, with particular emphasis on the marked sexual dimorphism as being distinct and exaggerated compared with other species (Titze 1989). Vocal cues, such as F0, have thus been reported to play a role in sexual selection (e.g., Puts et al. 2016): In our study, we investigated the effects of natural variability in vocal behavior on the identification of sex from vocalizations produced by adult male and female speakers and find that perceptual performance is significantly impaired when vocalizations are produced under reduced volitional control. Furthermore, the sexual dimorphism is drastically reduced between males and females for the spontaneous vocalizations used in this study (see “Appendix” section for a full breakdown of acoustic properties of the stimuli). This work thus calls for further discussions of the role of acoustic cues such as fundamental frequency in the signalling of speaker sex (and e.g., reproductive fitness) in the context of vocal flexibility, and how the expression of these signals may be particularly dependent on the modern human’s capacity for controlled vocal behavior (see Lavan et al. 2018a).

Furthermore, this work is of interest to the forensic literature, where studies have shown that earwitness speaker recognition or identification is notoriously unreliable (Clifford 1980). It has also been shown that listeners struggle to match emotional speech to neutral speech across a time delay (Read and Craik 1995; Saslove and Yarmey 1980). Our study adds to and partially extends these findings: a perpetrator may dramatically modulate their F0—through voice disguise (see also Wagner and Köster 1999) or, as is the case in this study, spontaneously so when experiencing intense emotions that may occur at a crime scene. In such a scenario, not only are explicit judgements about the identity of a potential perpetrator unreliable, but more basic judgements such as speaker sex may also at times be affected. Further, the work is relevant to computational speaker recognition or verification: the robustness and reliability of such algorithms is determined by the type of training they received to build up a template for a speaker’s voice. The current study indicates that in the context of vocal flexibility, human listeners can fail to reliability make the relatively basic judgement of speaker sex. Algorithms neglecting the presence of vocal flexibility in training sets or relying on just a single verification phrase may thus become unreliable for (spontaneous) emotionally-inflected vocal signals.

Overall, the current study shows how the flexibility of our voices affects perceptual judgements. A complex picture of speaker sex-specific effects emerged, interacting with our experimental manipulation that contrasted volitional and spontaneous vocalizations. Generalizations about how vocal signals behave at large can be problematic and may overlook nuanced effects that shape and characterise human voice processing.