International Journal of Speech Language and the Law, Vol 26, No 2 (2019)

Tuning the performance of automatic speaker recognition in different conditions: effects of language and simulated voice disguise

Radek Skarnitzl, Maral Asiaee, Mandana Nourbakhsh
Issued Date: 2 Mar 2020


Automatic speaker recognition applications have often been described as a ‘blackbox’. This study explores the benefit of tuning procedures (condition adaptation andreference normalisation) implemented in an i-vector PLDA framework ASR system,VOCALISE. These procedures enable users to open the black box to a certain degree.Subsets of two 100-speaker databases, one of Czech and the other of Persianmale speakers, are used for the baseline condition and for the tuning procedures.The effect of tuning with cross-language material, as well as the effect of simulatedvoice disguise, achieved by raising the fundamental frequency by four semitonesand resonance characteristics by 8%, are also examined. The results show superiorrecognition performance (EER) for Persian than Czech in the baseline condition,but an opposite result in the simulated disguise condition; possible reasons for thisare discussed. Overall, the study suggests that both condition adaptation and referencenormalisation are beneficial to recognition performance.

Download Media

PDF (Price: £18.00 )

DOI: 10.1558/ijsll.39778


Alexander, A., Forth, O., Atreya, A. A., & Kelly, F. (2016). VOCALISE: A forensic automatic speaker recognition system supporting spectral, phonetic, and user-provided features. Proceedings of Odyssey. Bilbao.

Bijankhan, M. (2018). Phonology. In A. Sadeghi & P. Shabani-Jadidi (Eds.), The Oxford Handbook of Persian Linguistics (pp. 111–141).

Boersma, P., & Weenink, D. (2019). Praat: doing phonetics by computer. Retrieved from

Braun, A. (2006). Stimmverstellung und Stimmenimitation in der forensischen Sprechererkennung. In T. Kopfermann (Ed.), Das Phänomen Stimme: Imitation und Identität: 5. Internationale Stuttgarter Stimmtage 2004. Hellmut K. Geissner zum 80. Geburtstag (pp. 177–182). St. Ingbert: Röhrig Universitätsverlag.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech & Language Processing, 19(4), 788–798.

Enzinger, E. (2015). Implementation of forensic voice comparison within the new paradigm for the evaluation of forensic evidence (The University of New South Wales). Retrieved from

Fant, G. (1960). Acoustic Theory of Speech Production. The Hague: Mouton.

Farrús, M. (2018). Voice Disguise in Automatic Speaker Recognition. ACM Comput. Surv., 51(4), 68:1--68:22.

Farrús, M., Wagner, M., Erro, D., & Hernando, J. (2010). Automatic speaker recognition as a measurement of voice imitation and conversion. International Journal of Speech, Language and the Law, 17(1), 119–142.

Gfroerer, S. (2003). Auditory-instrumental forensic speaker recognition. BT - 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003 (pp. 705–708). pp. 705–708. Retrieved from

Giddens, C. L., Barron, K. W., Byrd-Craven, J., Clark, K. F., & Winter, A. S. (2013). Vocal indices of stress: a review. Journal of Voice, 27(3), 390.e21-9.

Gold, E., & French, P. (2011). International practices in forensic speaker comparison. International Journal of Speech Language and the Law, 18(2), 293–307.

Gold, E., & French, P. (2019). International Practices in Forensic Speaker Comparisons: Second Survey. International Journal of Speech Language and the Law, 26(1).

Hansen, J. H. L., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(november), 74–99.

Hautamäki, R. G., Kinnunen, T., Hautamäki, V., & Laukkanen, A.-M. (2015). Automatic versus human speaker verification: The case of voice mimicry. Speech Communication, 72, 13–31.

Hughes, V., & Foulkes, P. (2015). The relevant population in forensic voice comparison: Effects of varying delimitations of social class and age. Speech Communication, 66, 218–230.

Kelly, F., Forth, O., Kent, S., Gerlach, L., & Alexander, A. (2019). Deep Neural Network Based Forensic Automatic Speaker Recognition in VOCALISE using x-Vectors. Audio Engineering Society Conference: 2019 AES INTERNATIONAL CONFERENCE ON AUDIO FORENSICS. Retrieved from

Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.

Kirchhübel, C, & Howard, D. (2013). Detecting suspicious behaviour using speech: Acoustic correlates of deceptive speech – An exploratory investigation. Applied Ergonomics, 44(5), 694–702.

Kirchhübel, Christin, Howard, D. M., & Stedmon, A. W. (2011). Acoustic Correlates of Speech when Under Stress: Research, Methods and Future Directions. International Journal of Speech, Language and the Law, 18(1), 75–98.

Künzel, H. J. (2000). Effects of voice disguise on speaking fundamental frequency. Forensic Linguistics, 7(2), 149–179. Retrieved from

Laukkanen, A., Takalo, R., Vilkman, E., Nummenranta, J., & Lipponen, T. (1999). Simultaneous videofluorographic and dual-channel electroglottographic registration of the vertical laryngeal position in various phonatory tasks. Journal of Voice, 13(1), 60–71.

Laver, J. (1980). The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press.

Modarresi Ghavami, G. (2018). Phonetics. In A. Sadeghi & P. Shabani-Jadidi (Eds.), The Oxford Handbook of Persian Linguistics (pp. 91–110).

Morrison, G. S. (2011). Measuring the validity and reliability of forensic likelihood-ratio systems. Science & Justice, 51(3), 91–98.

Morrison, G. S., Ochoa, F., & Thiruvaran, T. (2012). Database selection for forensic voice comparison. Proceedings of Odyssey 2012: The Language and Speaker Recognition Workshop, Singapore, (June), 62–77.

Oxford Wave Research (2017). iVOCALISE 2017B. (n.d.).

Reynolds, D. A. (1997). Comparison of background normalization methods for text-independent speaker verification. Proceedings of Eurospeech 1997, 963–966.

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10((1-3)), 19–41.

Rose, P. (2002). Forensic Speaker Identification. London: Taylor and Francis.

Růžičková, A., & Skarnitzl, R. (2017). Voice disguise strategies in Czech male speakers. Acta Universitatis Carolinae – Philologica 3, Phonetica Pragensia XIV, 19–34.

San Segundo, E., & Mompean, J. (2017). A simplified vocal profile analysis protocol for the assessment of voice quality and speaker similarity. Journal of Voice, 31(5), 644.e11-644.e27.

San Segundo, E., & Skarnitzl, R. (in print). A computer-based tool for the assessment of voice quality through visual analogue scales: VAS-Simplified Vocal Profile Analysis. Journal of Voice.

Scherer, K. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40((1-2)), 227–256.

Shipp, T. (1987). Vertical laryngeal position: Research findings and application for singers. Journal of Voice, 1(3), 217–219.

Skarnitzl, R., Šturm, P., & Volín, J. (2016). Zvuková báze řečové komunikace: Fonetický a fonologický popis řeči. Praha: Karolinum.

Skarnitzl, R., & Vaňková, J. (2017). Fundamental frequency statistics for male speakers of Common Czech. Acta Universitatis Carolinae – Philologica 3, Phonetica Pragensia XIV, 7–17.

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-Vectors: Robust DNN Embeddings for Speaker Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5329–5333.

Tan, T. (2010). The effect of voice disguise on Automatic Speaker Recognition. 2010 3rd International Congress on Image and Signal Processing, 8, 3538–3541.

Tirumala, S., Shahamiri, S., Garhwal, A., & Wang, R. (2017). Speaker identification features extraction methods: A systematic review. Expert Systems with Applications, 90, 250–271.


  • There are currently no refbacks.

Equinox Publishing Ltd - 415 The Workstation 15 Paternoster Row, Sheffield, S1 2BX United Kingdom
Telephone: +44 (0)114 221-0285 - Email:

Privacy Policy