Mining the Past – Data-Intensive Knowledge Discovery in the Study of Historical Textual Traditions

doi:10.1558/jch.31662

Journal of Cognitive Historiography, Vol 3, No 1-2 (2016)

Mining the Past – Data-Intensive Knowledge Discovery in the Study of Historical Textual Traditions

Kristoffer L Nielbo, Ryan Nichols, Edward Slingerland

Issued Date: 29 Mar 2018

Abstract

Text-heavy and unstructured data constitute the primary source materials for many historical reconstructions. In history and the history of religion, text analysis has typically been conducted by systematically selecting a small sample of texts and subjecting it to highly detailed reading and mental synthesis. But two interrelated technological developments have rendered a new data-intensive paradigm—one that can usefully supplement qualitative analysis—possible in the study of historical textual traditions. First, the availability of significant computing power has made it possible to run algorithms for automated text analysis on most personal computers. Second, the rapid increase in full text digital databases relevant to the study of religion has considerably reduced costs related to data acquisition and digitization. However, a limited understanding of the scope, advantages, and limitations of data-intensive methods, combined with an overly enthusiastic praise of big data by policy-makers and data scientists, have created real obstacles to the implementation of this paradigm in historical research. This is unfortunate, because history offers a rich and uncharted field for data-intensive knowledge discovery, and historians already have the much sought after and necessary domain expertise. In this article we seek to remove obstacles to the data intensive paradigm by presenting its methods and models for handling text-heavy data.

Download Media

PDF (Price: £17.50 ) Restricted Access

DOI: 10.1558/jch.31662

References

Andrews, Nicholas O., and Edward A. Fox. 2007. “Recent Developments in Document Clustering”. 2015. Available at https://vtechworks.lib.vt.edu/handle/10919/19473

Arnold, Taylor, and Lauren Tilton. 2015. Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text. 1st ed.. New York: Springer. https://doi.org/10.1007/978-3-319-20702-5

Azevedo, Ana Isabel Rojão Lourenço. 2008. “KDD, SEMMA and CRISP-DM: A Parallel Overview”, available at http://recipp.ipp.pt/handle/10400.22/136

Baharudin, Baharum, Lam Hong Lee and Khairullah Khan. 2010. “A Review of Machine Learning Algorithms for Text-Documents Classification”. Journal of Advances in Information Technology 1(1): 4–20. https://doi.org/10.4304/jait.1.1.4-20

Banchs, Rafael E. 2013. Text Mining with MATLAB. New York: Springer. https://doi.org/10.1007/978-1-4614-4151-9

Baunvig, Katrine F., and Kristoffer L. Nielbo. 2017. “Kan man validere et selvopgør?”. Proceedings from Nordiskt Nätverk för Editionsfilologer 2015. Skrifter 12: 45–67.

Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. 1st Edition. Cambridge, MA: O’Reilly Media.

Blei, David M. 2012. “Probabilistic Topic Models”. Communications of the ACM 55(4): 77–84. https://doi.org/10.1145/2133806.2133826

Blei, David M., Andrew Y. Ng and Michael I. Jordan. 2003. “Latent Dirichlet Allocation”. The Journal of Machine Learning Research 3: 993–1022.

Cooper, Anwen, and Chris Green. 2015. “Embracing the Complexities of ‘Big Data’ in Archaeology: The Case of the English Landscape and Identities Project”. Journal of Archaeological Method and Theory 23(1): 271–304. https://doi.org/10.1007/s10816-015-9240-4

Fayyad, Usama, Gregory Piatetsky-Shapiro and Padhraic Smyth. 1996. “From Data Mining to Knowledge Discovery in Databases”. AI Magazine 17(3): 37.

Grant, Will J., and Erin Walsh. 2015. “Social Evidence of a Changing Climate: Google Ngram Data Points to Early Climate Change Impact on Human Society”. Weather 70(7): 195–97. https://doi.org/10.1002/wea.2504

Hastie, Trevor, Robert Tibshirani and Jerome Friedman. 2011. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. New York: Springer.

Heaps, Harold S. 1978. Information Retrieval, Computational and Theoretical Aspects. Orlando, FL: Academic Press Inc.

Hey, Tony, Stewart Tansley and Kristin Tolle, eds. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. 1st edition. Redmond, WA: Microsoft Research.

Jockers, Matthew L. 2013. Macroanalysis: Digital Methods and Literary History. 1st Edition. Urbana, IL: University of Illinois Press.

– 2014. Text Analysis with R for Students of Literature. New York: Springer.

Jockers, Matthew L., and David Mimno. 2013. “Significant Themes in 19th-Century Literature”. Poetics 41(6): 750–69. https://doi.org/10.1016/j.poetic.2013.08.005

Jurafsky, Daniel, and James Martin. 2008. Speech and Language Processing, 2nd Edition. Upper Saddle River, NJ: Prentice Hall.

Katz, Slava M. 1996. “Distribution of Content Words and Phrases in Text and Language Modelling”. Natural Language Engineering 2(1): 15–59. https://doi.org/10.1017/S1351324996001246

Klein, Dan, Joseph Smarr, Huy Nguyen and Christopher D. Manning. 2003. “Named Entity Recognition with Character-Level Models”. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 – Volume 4, 180–83. CONLL 2003. Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.3115/1119176.1119204

Kohavi, Ron, and Foster Provost. 1998. “Glossary of Terms”. Machine Learning 30: 271–74. https://doi.org/10.1023/A:1017181826899

Liu, Bing. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. 2nd edn. New York: Springer. https://doi.org/10.1007/978-3-642-19460-3

Manning, Christopher, Prabhakar Raghavan and Hinrich Schütze. 2008. Introduction to Information Retrieval. 1st edition. New York: Cambridge University Press. https://doi.org/10.1017/CBO9780511809071

Michelbacher, Lukas, Stefan Evert and Hinrich Schütze. 2007. “Asymmetric Association Measures”. Proceedings of the Recent Advances in Natural Language Processing (RANLP 2007). (15 January 2016). Available at http://www.stefan-evert.de/PUB/MichelbacherEtc2007.pdf

Miner, Gary. 2012. Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications. Waltham, MA: Academic Press.

Moretti, Franco. 2013. Distant Reading. 1st edition. London & New York: Verso.

Nichols, Ryan, Kristoffer L. Nielbo, Edward Slingerland, Uffe Bergeton, Carson Logan and Scott Kleinman. forthcoming. Modeling the Contested Relationship between Analects, Mencius, and Xunzi: Preliminary Evidence from a Machine-Learning Approach. Journal of Asian Studies.

Porter, M. F. 2006. “An Algorithm for Suffix Stripping”. Program: Electronic Library and Information Systems 40(3): 211–18. https://doi.org/10.1108/00330330610681286

Richardson, John T. E. 2011. “Eta Squared and Partial Eta Squared as Measures of Effect Size in Educational Research”. Educational Research Review 6(2): 135–47. https://doi.org/10.1016/j.edurev.2010.12.001

Schreibman, Susan, Ray Siemens and John Unsworth. 2008. “The Digital Humanities and Humanities Computing”. In A Companion to Digital Humanities, Susan Schreibman, Ray Siemens and John Unsworth. Oxford: Blackwell.

Slingerland, Edward, and Maciej Chudek. 2011. “The Prevalence of Mind-Body Dualism in Early China”. Cognitive Science 35(5): 997–1007. https://doi.org/10.1111/j.1551-6709.2011.01186.x

Spivey, R. A., and D. M. Smith. 1994. Anatomy of the New Testament: A Guide to Its Structure and Meaning (5th edition). Englewood Cliffs, NJ: Prentice Hall.

Tangherlini, Timothy R., and Peter Leonard. 2013. “Trawling in the Sea of the Great Unread: Sub-Corpus Topic Modeling and Humanities Research”. Poetics 41(6): 725–49. https://doi.org/10.1016/j.poetic.2013.08.002

Tan, Pang-Nang, Michael Steinbach and Vipin Kumar. 2005. Introduction to Data Mining. 1st edition. Boston, MA: Pearson.

Tausczik, Y. R., and J. W. Pennebaker. 2010. “The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods”. Journal of Language and Social Psychology 29(1): 24–54.

Underwood, T. 2016. The Life Cycles of Genres. Journal of Culture Analytics. Retrieved from: http://culturalanalytics.org/2016/05/the-life-cycles-of-genres/

Weikum, Gerhard, Johannes Hoffart, Ndapandula Nakashole, Marc Spaniol, Fabian M. Suchanek and Mohamed Amir Yosef. 2012. “Big Data Methods for Computational Linguistics”. IEEE Data Eng. Bull. 35(3): 46–64.

Weiss, Sholom M., Nitin Indurkhya and Tong Zhang. 2010. Fundamentals of Predictive Text Mining. New York: Springer. https://doi.org/10.1007/978-1-84996-226-1

Witten, Ian H., Eibe Frank and Mark A. Hall. 2011. Data Mining: Practical Machine Learning Tools and Techniques, Third Edition. Burlington, MA: Morgan Kaufmann.

Zhang, Xiang, Junbo Zhao and Yann LeCun. 2015. “Character-Level Convolutional Networks for Text Classification”. In Advances in Neural Information Processing Systems, 649–57.

Zipf, George K. 1935. The Psycho-Biology of Language: An Introduction to Dynamic Philology. 1st edition. Cambridge, MA: M.I.T. Press.

Refbacks

There are currently no refbacks.

Equinox Publishing Ltd - 415 The Workstation 15 Paternoster Row, Sheffield, S1 2BX United Kingdom
Telephone: +44 (0)114 221-0285 - Email: [email protected]