International Journal of Speech Language and the Law, Vol 23, No 1 (2016)

Strength of forensic voice comparison evidence from the acoustics of filled pauses

Vincent Hughes, Sophie Wood, Paul Foulkes
Issued Date: 8 Jul 2016


This study investigates the evidential value of filled pauses (FPs, i.e. um, uh) as variables in forensic voice comparison. FPs for 60 young male speakers of standard southern British English were analysed, drawn from Task 1 of the DyViS corpus (Nolan et al. 2009). The following acoustic properties were analysed: midpoint frequencies of the first three formants in the vocalic portion; ‘dynamic’ characterisations of formant trajectories (i.e. quadratic polynomial equations fitted to nine measurement points over the entire vowel); vowel duration; and nasal duration for um. Likelihood ratio (LR) scores were computed using the Multivariate Kernel Density formula (MVKD; Aitken and Lucy, 2004) and converted to calibrated log10 LRs (LLRs) using logistic-regression (Brümmer et al., 2007). System validity was assessed using both equal error rate (EER) and the log LR cost function (Cllr; Brümmer and du Preez, 2006). The system with the best performance combines dynamic measurements of all three formants with vowel and nasal duration for um, achieving an EER of 4.08% and Cllr of 0.12. In terms of general patterns, um consistently outperformed uh. For um, the formant dynamic systems generated better validity than those based on midpoints, presumably reflecting the additional degree of formant movement in um caused by the transition from vowel to nasal. By contrast, midpoints outperformed dynamics for the more monophthongal uh. Further, the addition of duration (vowel or vowel and nasal) consistently improved system performance. The study supports the view that FPs have excellent potential as variables in forensic voice comparison cases.

Download Media

PDF (Price: £17.50 )

DOI: 10.1558/ijsll.v23i1.29874


Acton, E. K. (2011) On gender differences in the distribution of um and uh. University of Pennsylvania Working Papers in Linguistics 17.

Aitken, C. G. G. and Lucy, D. (2004) Evaluation of trace evidence in the form of multivari­ate data. Applied Statistics 53(4): 109–122.

Aitken, C. G. G. and Taroni, F. (2004) Statistics and the Evaluation of Evidence for Forensic Scientists (2nd edn). Chichester: Wiley.

Atkinson, N. (2009) Formant dynamics of SSBE monophthongs in unscripted speech. Unpublished MSc dissertation, University of York.

Becker, T., Jessen, M. and Grigoras, C. (2008) Forensic speaker verification using formant features and Gaussian Mixture Models. Interspeech 2008 Special Session: Forensic Speaker Recognition – Traditional and Automatic Approaches. Brisbane, Australia: 1505–1508.

Boersma, P. and Weenink, D. (2014) Praat: doing phonetics by computer [Computer program]. Version 5.3.62.

Brander, D. (2014) Phonetic characteristics of hesitation vowels in Swiss German and their use for forensic phonetic speaker identification. Poster presented at the annual conference of the International Association for Forensic Phonetics and Acoustics, Zürich, Switzerland.

Brümmer, N. (n.d.) FoCal toolkit. (retrieved 3 June 2011).

Brümmer, N. and du Preez, J. (2006) Application-independent evaluation of speaker detection. Computer Speech and Language 20(2–3): 230–275.

Brümmer, N., Burget, L., Černocký, J., Glembek, O., Grézl, F., Karafiát, M., van Leeuwen, D. A., Matějka, P., Schwarz, P. and Strasheim, A. (2007). Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST SRE 2006. IEEE Transactions on Audio Speech and Language Processing 15: 2072–2084.

Champod, C. and Evett, I. W. (2000) Commentary on A. P. A. Broeders (1999) ‘Some observations on the use of probability in forensic identification’. Forensic Linguistics 7(2): 238–243.

Christenfeld, N. and Creager, B. (1996) Anxiety, alcohol, aphasia, and ums. Journal of Personality and Social Psychology 70(3): 451–460.

Clark, H. H. and Fox Tree, J. E. (2002) Using uh and um in spontaneous speech. Cognition 84: 73–111.

Clermont, F., French, J. P., Harrison, P. T. and Simpson, S. (2008) Population data for English spoken in England: a modest first step. Paper presented at the annual conference of the International Association for Forensic Phonetics and Acoustics, Lausanne, Switzerland.

Docherty, G. J. and Foulkes, P. (1999) Newcastle upon Tyne and Derby: instrumental phonetics and variationist studies. In P. Foulkes and G. J. Docherty (eds) Urban Voices: Accent Studies in the British Isles 47–71. London: Arnold.

Duckworth, M. and McDougall, K. (2013) Individual differences in fluency disruptions: a cross-style investigation. Paper presented at the annual conference of the Inter­national Association for Forensic Phonetics and Acoustics, Tampa, Florida.

Enzinger, E. and Morrison, G. S. (2012) The importance of using between-session test data in evaluating the performance of forensic-voice-comparison systems. In Proceedings of the 14th Australasian Conference on Speech Science and Technology 137–140. Sydney, Australia.

Eriksson, E. J., Cepeda, L. F., Rodman, R. D., McAllister, D. F., Bitzer, D. and Arroway, P. (2004) Cross-language speaker identification using spectral moments. In Proceedings of the 17th Swedish Phonetic Conference (FONETIK) 76–79. Stockholm, Sweden.

Evett, I. W. (1991) Interpretation: a personal odyssey. In C. G. G. Aitken and D. A. Stone (eds) The Use of Statistics in Forensic Science 9–22. Chichester: Ellis Horwood.

Foulkes, P., Carrol, G. and Hughes, S. (2004) Sociolinguistics and acoustic variability in filled pauses. Paper presented at the annual conference of the International Association for Forensic Phonetics and Acoustics, Helsinki, Finland.

Foulkes, P. and French, J. P. (2012) Forensic speaker comparison: a linguistic-acoustic perspective. In P. M. Tiersma and L. M. Solan (eds) The Oxford Handbook of Language and the Law 557–572. Oxford: Oxford University Press.

Greenberg, S., Carvey, H., Hitchcock, L. and Chang, S. (2003) Temporal properties of spontaneous speech – a syllable-centric perspective. Journal of Phonetics 31(3): 465–485.

Grosjean, F. and Deschamps, A. (1973) Analyse des variables temporelles du français spontané. Phonetica 28(3–4): 191–226.

Hughes, V. (2014) The definition of the relevant population and the collection of data for likelihood ratio-based forensic voice comparison. Unpublished PhD thesis, University of York.

Hughes, V. and Foulkes, P. (2015) The relevant population in forensic voice comparison: effects of varying delimitations of social class and age. Speech Communication 66: 218–230.

Hughes, V., Wood, S. and Foulkes, P. (forthcoming) Phonetic measurements of hesitations improve the performance of automatic speaker recognition systems.

Jessen, M. (2008) Forensic phonetics. Language and Linguistics Compass 2(4): 671–711.

Jessen, M., Köster, O. and Gfroerer, S. (2005) Influence of vocal effort on average and variability of fundamental frequency. International Journal of Speech, Language and the Law 12(2): 174–213.

Johnson, K. (2012) Acoustic and Auditory Phonetics (3rd edn). Malden, MA: Wiley-Blackwell.

Kendall, T. and Thomas, E. R. (2014) ‘vowels’ (R package).

Ketabdar, H. (2004) ‘jEER_DET.m’ (matlab function) (version 1.2 with amendments by Anil Alexander).

Kowal, S., O’Connell, D. C., Forbush, K., Higgins, M., Clarke, L. and D’Anna, K. (1997) Interplay of literacy and orality in inaugural rhetoric. Journal of Psycholinguistic Research 26(1): 1–31.

Künzel, H. J. (1997) Some general phonetic and forensic aspects of speaking tempo. International Journal of Speech, Language and the Law 4(1): 48–83.

Lennes, M. (2003a) Save_intervals_to_wav_sound_files.praat (Praat script) (retrieved 29 July 2013).

Lennes, M. (2003b) Collect_formant_data_from_files.praat. (retrieved 15 May 2013).

Liberman, M. (2014) UM / UH update. Language Log, 13 December 2014. (and several other posts).

Maclay, H. and Osgood, C. (1959) Hesitation phenomena in spontaneous English speech. Word 15(1): 19–44.

Martire, K. A., Kemp, R. I., Sayle, M. and Newell, B. R. (2013) On the interpretation of likelihood ratios in forensic science evidence: presentation formats and the weak evidence effect. Forensic Science International 240: 61–68.

McDougall, K. (2004) Speaker-specific formant dynamics: an experiment on Australian English /aɪ/. International Journal of Speech, Language and the Law 11(1): 103–130.

McDougall, K. (2006) Dynamic features of speech and the characterisation of speakers: towards a new approach using formant frequencies. International Journal of Speech, Language and the Law 13(1): 89–126.

McDougall, K. and Nolan, F. (2007) Discrimination of speakers using the formant dynamics of /uː/ in British English. In Proceedings of the 16th International Congress of Phonetic Sciences 1825–1828. Saarbrücken, Germany.

Milroy, L., Milroy, J. and Docherty, G. J. (1994–1997) Phonological Variation and Change in Contemporary British English. Economic and Social Research Council (ESRC) of Great Britain. R000234892.

Morrison, G. S. (2007) matlab implementation of Aitken and Lucy’s (2004) forensic likelihood ratio software using multivariate-kernel-density estimation. (retrieved 31 May 2011).

Morrison, G. S. (2009a) Forensic voice comparison and the paradigm shift. Science and Justice 49(4): 298–308.

Morrison, G. S. (2009b) Likelihood-ratio voice comparison using parametric representations of the formant trajectories of diphthongs. Journal of the Acoustical Society of America 125(4): 2387–2397.

Morrison, G. S. (2009c) train_llr_fusion_robust.m (matlab function). (retrieved 13 December 2011).

Morrison, G. S. (2011a) Measuring the validity and reliability of forensic likelihood-ratio systems. Science and Justice 51: 91–98.

Morrison, G. S. (2011b) A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: multivariate kernel density (MVKD) versus Gaussian mixture model-universal background model (GMM-UBM). Speech Communication 53: 242–256.

Morrison, G. S. (2013) Tutorial on logistic-regression calibration and fusion: converting a score to a likelihood ratio. Australian Journal of Forensic Sciences 45(2): 173–197.

Morrison, G. S. (2014) Distinguishing between forensic science and forensic pseudo­science: testing of validity and reliability and approaches to forensic voice comparison. Science and Justice 54(3): 245–256.

Morrison, G. S., Ochoa, F. and Thiruvaran, T. (2012) Database selection for forensic voice comparison. In Proceedings of Odyssey 2012: The Language and Speaker Recog­nition Workshop 74–77. Singapore.

Morrison, G. S. and Enzinger, E. (2013) Forensic speech science. In N. Nic Daéid (ed.) Proceedings of the 17th International Forensic Science Managers’ Symposium 616–623. Lyon, France.

Mullen, C., Spence, D., Moxey, L., and Jamieson, A. (2014) Perception problems of the verbal scale. Science and Justice 54(2): 154–158.

Nair, B., Alzqhoul, E. and Guillemin, B. J. (2014) Determination of likelihood ratios for forensic voice comparison using principal component analysis. International Journal of Speech Language and the Law 21: 83–112.

Nolan, F. J. (1997) Speaker recognition and forensic phonetics. In W. J. Hardcastle and J. Laver (eds) The Handbook of Phonetic Sciences 744–767. Oxford: Blackwell.

Nolan, F., McDougall, K., de Jong, G. and Hudson, T. (2009) The DyViS database: style-controlled recordings of 100 homogeneous speakers for forensic phonetic research. International Journal of Speech, Language and the Law 16(1): 31–57.

R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Robertson, B. and Vignaux, G. A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. Chichester: John Wiley and Sons.

Rose, P. (2002) Forensic Speaker Identification. London: Taylor and Francis.

Rose, P. (2006) The intrinsic speaker discriminatory power of diphthongs. In Proceedings of the 11th Australasian Conference on Speech Science and Technology 64-67. Auckland, New Zealand.

Rose, P. (2013) Where the science ends and the law begins: likelihood ratio-based forensic voice comparison in a $150 million telephone fraud.  International Journal of Speech, Language and the Law 20(2): 277–324.

Rose, P. (2015) Forensic voice comparison with monophthongal formant trajectories – a likelihood ratio-based discrimination of ‘schwa’ vowel acoustics in a close social group of young Australian females. Proceedings of the International Conference on Acoustics Speech and Signal Processing (ICASSP) 4819–4823. Brisbane, Australia.

Rose, P., Kinoshita, Y. and Alderman, T. (2006) Realistic extrinsic forensic speaker discrimination with the diphthong /aɪ/. Proceedings of the 11th Australasian Conference on Speech Science and Technology 329–334. Auckland, New Zealand.

Rose, P. and Morrison, G. (2009) A response to the UK position statement on forensic speaker comparison. International Journal of Speech, Language and the Law 16(1): 139–163.

Schachter, S., Christenfeld, N., Ravina, B. and Bilous, F. (1991) Speech disfluency and the structure of knowledge. Journal of Personality and Social Psychology 60(3): 362–367.

Shriberg, E. (2001) To ‘errrr’ is human: ecology and acoustics of speech disfluencies. Journal of the International Phonetic Association 31(1): 153–169.

Simpson, S. (2008) Testing the speaker discrimination ability of formant measurements in forensic speaker comparison cases. Unpublished MSc Dissertation, University of York.

Stevens, K. (2001) Acoustic Phonetics. Cambridge, MA: MIT Press.

Swerts, M., Wichmann, A. and Beun, R.-J. (1996) Filled pauses as markers of discourse structure. In Proceedings of the International Conference on Spoken Language Processing (volume 2) 1033–1036.

Tabachnick, B. G. and Fiddell, L. S. (2007) Using Multivariate Statistics (5th edn). Boston: Pearson.

Thaitechawat, S. and Foulkes, P. (2011) Discrimination of speakers using tone and formant dynamics in Thai. In Proceedings of the 17th International Congress of Phonetic Sciences 1978–1981. Hong Kong.

Tottie, G. (2011) Uh and Um as sociolinguistic markers in British English. International Journal of Corpus Linguistics 16(2): 173–197.

Tschäpe, N., Trouvain, J., Bauer, D. and Jessen, M. (2005) Idiosyncratic patterns of filled pauses. Paper presented at the annual conference of the International Association for Forensic Phonetics and Acoustics, Marrakesh, Morocco.

Umeda, N. (1975) Vowel duration in American English. Journal of the Acoustical Society of America 58(2): 434–445.

van Leeuwen, D. A. and Brümmer, N. (2007) An introduction to application-independent evaluation of speaker recognition systems. In C. Müller (ed.) Speaker Classification vol. 1: Selected Projects 330–353. Heidelberg: Springer.

Van Summers, W., Pisoni, D. B., Bernacki, R. H., Pedlow, R. I. and Stokes, M. A. (1988) Effects of noise on speech production: acoustic and perceptual analyses. Journal of the Acoustical Society of America 84(3): 917–928.

Wells, J. C. (1982) Accents of English (3 volumes). Cambridge: Cambridge University Press.

Wickham, H. (2015) ggplot2 (R package).


  • There are currently no refbacks.

Equinox Publishing Ltd - 415 The Workstation 15 Paternoster Row, Sheffield, S1 2BX United Kingdom
Telephone: +44 (0)114 221-0285 - Email: [email protected]

Privacy Policy