Natural Language Processing methodology for tracking diachronic changes in the 20th century English language

Sanja Stajner; Ruslan Mitkov; Geoffrey Leech

doi:10.1558/jrds/720788885881

Authors

Sanja Stajner University of Wolverhampton
Ruslan Mitkov University of Wolverhampton
Geoffrey Leech Lancaster University

DOI:

https://doi.org/10.1558/jrds/720788885881

Keywords:

corpus analysis, diachronic changes, natural language processing

Abstract

Since the 1990s, when its more recent additions were released and diachronic study became possible, the ‘Brown family’ of corpora has been widely used across the linguistic community for various synchronic and diachronic studies. However, the methodology used in these studies did not take advantage of modern, state-of-the art Natural Language Processing (NLP) tools, but rather relied on part-of-speech (POS) tagging, often with manual post-editing. Most previous work (e.g. Leech et al., 2009; Davies, 2013) has focused mainly on the linguistic interpretation of the results and on proposing hypotheses about the ways language changes, without giving much consideration to whether the results were statistically sound or not.1 This work aims to fill the aforementioned gaps by proposing a novel, NLP-motivated methodology, which employs a fully automatic feature extraction procedure and conducts a thorough statistical analysis, thus offering a promising basis for future large-scale studies, reducing the amount of human effort required. The choice of statistical tests in this study was evaluated and confirmed to be correct by several procedures which rely on leading machine learning algorithms.

Author Biographies

Sanja Stajner, University of Wolverhampton

Sanja Stajner is a second year PhD student in the Research Group in Computational Linguistics, Research Institute in Information and Language Processing at the University of Wolverhampton, UK. She obtained her B.Sc. in Mathematics and Computer Science at the University of Belgrade (Serbia), and MA in Natural Language Processing and Human Language Technologies from Universitat Autonoma de Barcelona (Spain) and the University of Wolverhampton (UK). Her main research interests include natural language processing, machine learning, statistical analysis and corpus linguistics.
Ruslan Mitkov, University of Wolverhampton

Professor Ruslan Mitkov’s extensively cited research includes more than 180 publications on various topics of Natural Language Processing. Dr Mitkov is author of the monograph Anaphora Resolution (Longman) and sole Editor of The Oxford Handbook of Computational Linguistics (Oxford University Press). He is Executive Editor of the Journal of Natural Language Engineering (Cambridge University Press) and Editor-in-Chief of the Natural Language Processing book series of John Benjamins publishers. Dr Mitkov is Director of the Research Institute in Information and Language Processing (University of Wolverhampton).
Geoffrey Leech, Lancaster University

Geoffrey Leech is author of many books and articles on linguistics and the English Language. He was a pioneer in the development of corpus linguistics, taking a leading role in the creation of the first electronic corpus of British English (the LOB Corpus) and the British National Corpus. He is the lead author of Change in Contemporary English: A Grammatical Study (Cambridge University Press 2009). He has been a professor in the Department of Linguistics and English Language, Lancaster University, since 1975.

References

Aarts, B., Close, J., Leech, G. and Wallis, S. (eds.) 2013. The verb phrase in English: Investigating recent linguistic change with corpora. Cambridge: Cambridge University Press.

Aarts, B., Close, J. and Wallis, S. 2013. Choices over time: Methodological issues in investigating current change. In: Aarts, Close, Leech and Wallis, pp. 14-45.

Adolph, R., 1966. The Rise of Modern Prose Style. Cambridge, Mass.: M.I.T. Press.

Aldrich, J. and Nelson, F. 1984. Linear probability, logit, and probit models. Quantitative applications in the social sciences. London: Sage.

Altmann, G., von Buttlar, H., Rott, W. and Strau, U. 1983. A law of change in language. In B. Brainerd, ed. Historical linguistics, 104–115.

Baker, P., 2009. The BE06 Corpus of British English and recent language change. International Journal of Corpus Linguistics, 14(3), 312-337.

Barber, C., 1964. Linguistic change in present-day English. London and Edinburgh: Oliver&Boyd.

Bennett, J. R., 1971. Prose Style: A Historical Approach through Studies. San Francisco: Chandler.

Biber, D., 1985. Investigating Macroscopic Textual Variation through Multifeature/Multidimensional Analyses. Linguistics, 23, pp. 337-60.

Biber, D., 1986. Spoken and Written Textual Dimensions in English: Resolving the Contradictory Findings. Language, 62, pp. 384-414.

Biber, D. and Finegan, E., 1986. An Initial Typology of English Text Types. In: J. Aarts and W. Meijs, eds. Corpus Linguistics H: New Studies in the Analysis and Exploitation of Computer Corpora. Amsterdam, Rodopi. pp. 19-46.

Biber, D., 1987. A textual comparison of British and American writing. American Speech, 62, pp. 99–119.

Biber, D., 1988. Variation across speech and writing. Cambridge: Cambridge University Press.

Biber, D. and Finegan, E., 1988. Drift in three English genres from the 18th to the 20th century: A multi-dimensional approach. In: M. Kytö, O. Ihalainen, and M. Rissanen, eds. Corpus linguistics, hard and soft. Proceedings of the Eighth International Conference on English Language Research on Computerized Corpora. Amsterdam: Rodopi. pp. 83–101.

Biber, D. and Finegan, E., 1989. Drift and the evolution of English style: A history of three genres. Language, 65, pp. 487–517.

Biber, D., 1990. Methodological Issues Regarding Corpus-based Analyses of Linguistic Variation. Literary and Linguist Computing, 5(4), pp. 257-269.

Biber, D. and Finegan, E., 1992. The linguistic evolution of five written and speech-based English genres from the 17th to the 20th centuries. In: M. Rissanen, O. Ihalainen, T. Nevalainen, and I. Taavitsainen, eds. History of English. New methods of interpretations in historical linguistics. Berlin and New York: Mouton de Gruyter, pp. 688–704.

Biber, D. Finegan, E. and Atkinson, D., 1994. ARCHER and its challenges: Compiling and exploring A Representative Corpus of Historical English Registers. In: U. Fries, G. Tottie and P. Schneider, eds. Creating and using English language corpora. Amsterdam: Rodopi. pp. 1-14.

Burling, R., 1992. Patterns of language: Structure, variation, change. San Diego: Academic.

Cohen, W. 1995. Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning. pp. 115-123.

Coleman, M. and Liau, T. L. 1975. A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60 (2), pp. 283–284.

Connexor Oy. 2006a. Machinese Language Model.

Connexor Oy. 2006b. Machinese Linguistic Analysers.

Connexor Oy. 2009. Connexor Machinese Syntax. Language Model Tag Descriptions. http://193.185.105.50/demo/machinese/doc/enfdg3-tags.html [Accessed 3 May 2011]

Corpas Pastor, G., Mitkov R., Afzal N., Pekar V., 2008. Translation Universals: Do they exist? A corpus-based NLP study of convergence and simplification. In Proceedings of the AMTA. Waikiki, Hawaii.

Davies, M., 2013. “Recent shifts with three non-finite verbal complements in English: Data from the 100-million-word Time corpus (1920s-2000s)”. In: Aarts, Close, Leech and Wallis (eds.) The verb phrase in English: Investigating recent linguistic change with corpora, Cambridge: Cambridge University Press. pp. 46-67.

Denison, D., 1994. A Corpus of Late Modern English Prose. In: M. Kytö et al. eds. Corpora Across the Centuries. Amsterdam: Rodopi, pp. 7-16.

Geisler, C. 2002. Relativization in Ulster English. In: P. Poussa, ed. Relativisation on the North Sea Littoral (LINCOM Studies in Language Typology 07). München: Lincom Europa, pp. 135–146.

Geisler, C. 2008. Statistical reanalysis of corpus data. ICAME Journal, 32. pp. 35–46.

Gordon, I. A., 1966. The Movement of English Prose. Longman: London.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I. H. 2009. The weka data mining software: An update. SIGKDD Explorations, 11(1). pp. 10–18.

Hall, M. A. and Smith, L. A. 1998. Practical feature subset selection for machine learning. In C. McDonald, ed. Proceedings of the 21st Australasian Computer Science Conference ACSC98. Berlin: Springer, pp. 181–191.

Hilpert, M. and Gries, S. Th. 2010. Modeling diachronic change in the third person singular: a multi-factorial verb- and author-specific exploratory approach. English Language and Linguistics, 14(3), 293-320.

Hundt, M. and Mair, Ch. 1999. “Agile” and “uptight” genres: The corpus-based approach to language change in progress. International Journal of Corpus Linguistics, 4, 221-242.

John, G. H. and Langley, P. 1995. Estimating Continuous Distributions in Bayesian Classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo. pp. 338–345.

Keerthi, S.S., Shevade, S.K., Bhattacharyya, C. and Murthy, K.R.K. 2001. Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation, 13(3). pp. 637–649.

Kroch, A. 1989a. Reflexes of grammar in patterns of language change. Language Variation and Change, 1, pp. 199–244.

Kroch, A. 1989b. Function and grammar in the history of English: Periphrastic ”do.” In R. Fasold, ed. Language change and variation. Amsterdam: Benjamins, pp. 133–172.

Kroch, A., 2001. Syntactic change. In: M. Baltin and C. Collins, eds. The Handbook of Contemporary Syntactic Theory. Malden, Mass: Blackwell Publishers, pp. 629-739.

Landwehr, N., Hall, M. and Frank, E. 2005. Logistic Model Trees. Machine Learning, 59, pp. 161–205.

le Cessie, S. and van Houwelingen, J.C. 1992. Ridge Estimators in Logistic Regression. Applied Statistics, 41(1), pp. 191–201.

Leech, G., 2003. Modality on the move: the English modal auxiliaries 1961-1992. In: R.Facchinetti, M. Krug and F. Palmer, eds. Modality in contemporary English. Berlin/New York: Mouton de Gruyter. pp. 223 - 240.

Leech, G., 2004. Recent grammatical change in English: data, description, theory. In: K. Aijmer and B. Altenberg, eds. Advances in Corpus Linguistics: Papers from the 23rd International Conference on English Language Research on Computerized Corpora (ICAME 23) Göteborg 22-26 May 2002, Amsterdam:Rodopi, pp. 61-81.

Leech, G. and Smith, N., 2005. Extending the possibilities of corpus-based research on English in the twentieth century: a prequel to LOB and FLOB. ICAME Journal, 29, pp. 83-98.

Leech, G. and Smith, N., 2006. Recent grammatical change in written English 1961-1992: some preliminary findings of a comparison of American with British English. In: A. Renouf and A. Kehoe, eds. The Changing Face of Corpus Linguistics. Amsterdam: Rodopi, pp. 186-204.

Leech, G., Hundt, M., Mair, C. and Smith, N., 2009. Change in Contemporary English: A Grammatical Study. Cambridge: Cambridge University Press.

Leech, G. and Smith, N., 2009. Change and constancy in linguistic change: How grammatical usage in written English evolved in the period 1931-1991. In: A. Renouf and A. Kehoe, eds. Corpus Linguistics: Refinements and Reassessments, Amsterdam/New York: Rodopi, pp. 173-200.

Lightfoot, D., 1991. How to set parameters: Arguments from language change. Cambridge, MA: MIT Press.

Lightfoot, D., 1999. The development of language: Acquisition, change, and evolution. Malden, MA: Blackwell.

Mair, C., and Hundt, M., 1995. Why is the progressive becoming more frequent in English? A corpus-based investigation of language change in progress. Zeitschrift für Anglistik und Amerikanistik, 43, pp. 111-122.

Mair, C., 1997. The spread of the going-to-future in written English: a corpus-based investigation into language change in progress. In: R. Hickey and St. Puppel, eds. Language history and linguistic modelling: a festschrift for Jacek Fisiak on his 60th birthday. Berlin: Mouton de Gruyter. pp. 1537-1543.

Mair, C., 2002. Three changing patterns of verb complementation in Late Modern English: a real-time study based on matching text corpora. English Language and Linguistics, 6, pp. 105-131.

Mair, C., Hundt, M., Leech, G. and Smith, N., 2002. Short term diachronic shifts in part-of-speech frequencies: a comparison of the tagged LOB and F-LOB corpora. International Journal of Corpus Linguistics, 7, pp. 245-264.

Mair, C. and Leech, G., 2006. Current change in English syntax. In: B. Aarts and A. MacMahon, eds. The Handbook of English Linguistics. Oxford: Blackwell. Ch. 14.

Oakes, M. P. 1998. Statistics in Corpus Linguistics. Edinburgh University Press. Pharies, D. A., 2007. Breve historia de la lengua española. Chicago: The University of Chicago Press.

Platt, J. C. 1998. Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: B. Schoelkopf, C. Burges and A. Smola, eds. Advances in Kernel Methods – Support Vector Learning.

Quinlan, R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.

Sebastiani, F. 2000. Machine learning in automated text categorization. ACM Computing Surveys, 34, pp. 1-47.

Senter, R. J. and Smith, E. A. 1967. Automated readability index. Technical Report (AMRLTR-66-220). University of Cincinnati, Cincinnati: Ohio.

Smith, E. A. and Kincaid, P. J. 1970. Derivation and Validation of the Automated Readability Index for Use with Technical Materials. Human Factors: The Journal of the Human Factors and Ergonomics Society, 12(5). pp. 457–464.

Smith, N., 2002. Ever moving on? The progressive in recent British English. In: P. Peters, P. Collins & A. Smith, eds. New frontiers of corpus research: papers from the twenty first International Conference on English Language Research on Computerized Corpora, Sydney 2000. Amsterdam: Rodopi. pp. 317-330.

Smith, N., 2003a. A quirky progressive? A corpus-based exploration of the will + be + -ing construction in recent and present day British English. In: D. Archer, P. Rayson, A. Wilson and T. McEnery, eds. Proceedings of the Corpus Linguistics 2003 Conference. Lancaster University: UCREL Technical Papers Vol. 16, pp. 714-723.

Smith, N., 2003b. Changes in the modals and semi-modals of strong obligation and epistemic necessity in recent British English. In: R. Facchinetti, M. Krug and F. Palmer, eds. Modality in contemporary English. Berlin/New York: Mouton de Gruyter. pp. 241-266.

Smith, N. and Leech, G., 2013. Verb structures in twentieth century British English. In: Aarts, Close, Leech and Wallis, pp. 68-98.

Sumner, M., Frank, E. and Hall, M. 2005. Speeding up Logistic Model Tree Induction. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, 675–683.

Tukey, J. 1977. Exploratory data analysis. Reading, MA: Addison-Wesley.

Westin, I. and Geisler, C., 2002. A multi-dimensional study of diachronic variation in British newspaper editorials. ICAME Journal, 26. pdf. [online] http://icame.uib.no/ij26/westin_geisler.pdf [Accessed 10 May 2011]

Westin, I., 2002. Language Change in English Newspaper Editorials. Amsterdam: Rodopi.

Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques, second edition. Morgan Kaufman Publishers.

Natural Language Processing methodology for tracking diachronic changes in the 20th century English language

Authors

DOI:

Keywords:

Abstract

Author Biographies

References

Downloads

Published

Issue

Section

License

How to Cite

Most read articles by the same author(s)

Subscription

Information

Accessibility

Unsubscribe

Latest publications