Automatic speaker recognition with crosslanguage speech material

doi:10.1558/ijsll.v20i1.21

International Journal of Speech Language and the Law, Vol 20, No 1 (2013)

Automatic speaker recognition with crosslanguage speech material

Hermann J. Künzel

Issued Date: 9 Jul 2013

Abstract

Automatic systems for forensic speaker recognition (FASR) claim to be largely independent of language based on the fact that feature vectors are composed of acoustic parameters that are derived from the resonance characteristics of vocal tract cavities. Yet a certain ‘language gap’ may remain which may deteriorate the performance of a system unless properly compensated. This forensic aspect of what may be called cross-language speaker recognition has not yet received due attention. Based on the most common forensic cross language setting, the aim of this study was to assess the effect of language mismatch on the performance of a standard FASR system and compare its magnitude with the effect of other sources of mismatch on the same voice data. Using the automatic system Batvox 3 in an experiment with 75 bilingual speakers of seven languages and four kinds of transmission channels, it can be shown that, if speaker model and reference population are matched in terms of language, the remaining mismatch between speaker model and test sample can be neglected, since equal error rates (EERs) for same-language or cross-language comparisons are approximately the same, ranging from zero to 5.6%. Transmission of the speech data via landline telephone, GSM and, for part of the corpus, VoIP (using Skype) caused EERs to rise by less than 1% on average.

Download Media

PDF (Price: £17.50 ) Restricted Access

DOI: 10.1558/ijsll.v20i1.21

References

Agnitio (2009) Batvox 3.0 Basic User Manual. Madrid.
Bahr, R.H. and Frisch, S. (2002) The problem of code switching in voice identiﬁcation. In A. Braun, and H. Masthoﬀ (eds) Phonetics and its Applications: Festschrift for Jens-Peter Koester on the Occasion of his 60th birthday 86–96. Stuttgart: Steiner.
Bautista Tapas, R. (2005) Sistemas forenses de reconocimiento automático de locutores. Determinación y análisis de sus variables más críticas. Proyecto ﬁn de carrera, Universidad Politécnica de Madrid.
Betancourt, K.S. and Bahr, R.H. (2010) The inﬂuence of signal complexity on speaker identiﬁcation. International Journal of Speech, Language and the Law 17(2): 179–200.
Biometrics 1.2 (2012) Performance metrics software user guide. Oxford Wave Research Ltd (www.oxfordwaveresearch.com).
Campbell, J.P., Nakasone, H., Cieri, C., Miller, D., Walker, K., Martin, A.F. and Przybocki, M.A. (2004) The MMSR bilingual and crosschannel corpora for speaker recognition research and evaluation. Proceedings of Odyssey 04 Speaker and Language Recognition Workshop, Toledo (Spain): 29–32.
Cieri, C., Campbell, J.P., Nakasone, H., Miller, D. and Walker, K. (2004) The Mixer corpus of multilingual, multichannel speaker recognition data. Proceedings Information for the Defense Community, DTIC Conference Paper: 627–630.
Dehak, N., Dehak, R., Kenny, P., Brummer, N., Oellet, P. and Dumouchel, P. (2009) Support vector machines versus fast scoring in the long-dimensional total variability space for speaker veriﬁcation. Proceedings ISCA Interspeech 2009 Brighton,UK: 1559–1562.
Drygajlo, A. (2007) Forensic automatic speaker recognition. IEEE Signal Processing Magazine 24: 132–135. http://dx.doi.org/10.1109/MSP.2007.323278
Doddington, G., Liggett, W., Martin, A., Przybocki, M. and Reynolds, D.A. (1998) Sheeps, goats, lambs and wolves: a statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. Proceedings of International Cenference on Spoken Language Processing, Sydney: 37–40.
Goggin, J.P., Thompson, C.P., Strube, G and Simental, L.R. (1991) The role of language familiarity in voice identiﬁcation. Memory and Cognition 19: 448–458. http://dx.doi.org/10.3758/BF03199567
Gonzalez-Rodriguez, J., Fierrez-Aguilar, J. and Ortega-Garcia, J. (2003) Forensic identiﬁcation reporting using automatic speaker recognition systems. Proceedings IEEE – ICASSP vol. 2: 93–96.
Gonzalez-Rodriguez, J., Ramos-Castro, D., García-Gomar, M. and Ortega-García, J. (2004) On robust estimation of likelihood ratios: the ATVS-UAM system at 2003 NFI/TNO forensic evaluation. Proceedings of Odyssey 04 Speaker and Language Recognition Workshop, Toledo, Spain: 83–90.
Gonzalez-Rodriguez, J., Drygajlo, A., Ramos-Castro, D., García-Gomar, M. and Ortega-García, J. (2006) Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition. Computer Speech and Language 20: 331–355. http://dx.doi.org/10.1016/j.csl.2005.08.005
Hollien, H., Majewski, W. and Doherty, E.T. (1982) Perceptual identiﬁcation of voices under normal, stress and disguise speaking conditions. Journal of Phonetics 10: 139–148.
House, A.S. (1959) A note on the optimal vocal frequency. Journal of Speech and Hearing Research 2: 56–60.
IAFPA (International Association for Forensic Phonetics and Acoustics) (2004) Code of Practise. www.iafpa.net/code.htm.
Künzel, H.J. (2010) Automatic speaker recognition of identical twins. International Journal of Speech, Language and the Law 17: 251–277.
Lu, L., Dong, Y., Zhao, X, Liu, J. and Wang, H. (2009), The eﬀect of language factors for robust speaker recognition. IEEE – ICASSP 2009: 4217–4220.
Peterson, G.E. and Barney, H.L. (1952) Control methods used in a study of the vowels. Journal of the Acoustical Society of America 24(2): 175–184. http://dx.doi.org/10.1121/1.1906875
Przybocki M.A., Martin A.F. and Le, A.N. (2007) NIST speaker recognition evaluations utilizing the Mixer corpora 2004, 2005, 2006. IEEE Transactions on Audio, Speech and Language Processing 15(7): 1951–1959. http://dx.doi.org/10.1109/TASL.2007.902489
Ramos-Castro D. (2007) Forensic evaluation of the evidence using automatic speaker recognition systems. PhD dissertation, Universidad Autónoma de Madrid.
Sturim, D., Campbell, W., Dehak, N., Karam, Z., McCree, A., Reynolds, D., Richardson, F., Torres-Carrasquillo, P. and Shum, S. (2011) The MIT LL 2010 speaker recognition evaluation system: scalable language-dependent speaker recognition. IEEE – ICASSP 2011: 5272–5275.
van Leeuwen, D. and Bouten, J.S. (2004) Results of the 2003 NFI-TNO forensic speaker recognition evaluation. Proceedings of Odyssey 04 Speaker and Language Recognition Workshop, Toledo (Spain): 75–82.
Zissman, M.A., van Buuren, R.A., Grieco, J.J., Reynolds, D.A., Steeneken, H.J.M. and Huggins, M.C.(2001) Preliminary speaker recognition experiments on the NATO N4 corpus. Proceedings RTO IST Workshop on Multilingual Speech and Language Processing, Aalborg, Denmark, (RTO-MP-066): 2.1–2.6.

Refbacks

There are currently no refbacks.

Equinox Publishing Ltd - 415 The Workstation 15 Paternoster Row, Sheffield, S1 2BX United Kingdom
Telephone: +44 (0)114 221-0285 - Email: [email protected]