Using Grammatical Features for Automatic Register Identification in an Unrestricted Corpus of Documents from the Open Web

Authors

  • Douglas Biber Northern Arizona University
  • Jesse Egbert Brigham Young University

DOI:

https://doi.org/10.1558/jrds.v2i1.27637

Keywords:

text classification, automatic genre identification (AGI), discriminant analysis, web registers

Abstract

Most previous attempts at automatic genre identification have been based on corpus samples that are relatively small and artificially restricted. In this study we set out to automatically predict register/genre categories in a large, representative sample of documents from the open web using a linguistic approach focused on lexico-grammatical characteristics that have functional associations. Our findings demonstrate the possibility of automatically predicting register/genre on the unrestricted open web, and we anticipate that future extensions will allow this task to be accomplished with considerably higher degrees of accuracy.

Author Biographies

  • Douglas Biber, Northern Arizona University

    Douglas Biber is Regents' Professor in the English Department of Northern Arizona University, Flagstaff, AZ.

  • Jesse Egbert, Brigham Young University

    Jesse Egbert is an Assistant Professor in the Department of Linguistics and English Language at Brigham Young University, Provo, UT.

References

Agarwal, S., Godbole, S., Punjani, D., and Roy, S. (2007) How much noise is too much: A study in automatic text classification. Proceedings of Seventh IEEE International Conference on Data Mining 3–12. http://dx.doi.org/10.1109/ICDM.2007.21

Argamon, S., Koppel, M., and Avneri, G. (1998) Routing documents according to style. In Proceedings of the First International Workshop on Innovative Internet Information Systems (IIIS-98). Pisa

Baroni, M. and Bernardini, S. (2004) BootCaT: Bootstrapping corpora and terms from the web. In Proceedings of LREC 2004 1313–1316. Lisbon: ELDA.

Baroni, M., Bernardini, S., Ferraresi, A. and Zanchetta, E. (2009) The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3): 209–226. http://dx.doi.org/10.1007/s10579-009-9081-4

Biber, D. (1988) Variation across Speech and Writing. Cambridge: Cambridge University Press. http://dx.doi.org/10.1017/CBO9780511621024

Biber, D. (1995) Dimensions of Register Variation: A Cross-linguistic Comparison. Cambridge: Cambridge University Press. http://dx.doi.org/10.1017/CBO9780511519871

Biber, D. and Conrad, S. (2009) Register, Genre, and Style. Cambridge: Cambridge University Press. http://dx.doi.org/10.1017/CBO9780511814358

Biber, D., Egbert, J. and Davies, M. (2015) Exploring the composition of the searchable web: A corpus-based taxonomy of web registers. Corpora 10(1): 11–45. http://dx.doi.org/10.3366/cor.2015.0065

Biber, D., Egbert, J. (to appear). Register variation on the searchable web: A multi-dimensional analysis. Journal of English Linguistics.

Biber, D., Johansson, S. Leech, G., Conrad, S., and Finegan, E. (1999) The Longman Grammar of Spoken and Written English. London: Longman.

Boese, E. S. (2005) Stereotyping the Web: Genre Classification of Web Documents. Master’s thesis, Department of Computer Science. Colorado State University.

Cantos Gómez, P. (2013) Statistical Methods in Language and Linguistic Research. Sheffield: Equinox.

Crowston, S. (2010) Problems in the use-centered development of a taxonomy of web genres. In A. Mehler, S. Sharoff, and M. Santini (eds), Genres on the Web: Computational Models and Empirical Studies 69–86. New York: Springer. http://dx.doi.org/10.1007/978-90-481-9178-9_4

Dalal, M. K. and Zaveri, M.A. (2011) Automatic text classification: A technical review. International Journal of Computer Applications 28: 975–987. http://dx.doi.org/10.5120/3358-4633

Egbert, J., and Biber, D. (2013) Developing a user-based method of web register classification. In S. Evert, E. Stemle, and P. Rayson (Eds), Proceedings of the 8th Web as Corpus Workshop (WAC-8) @Corpus Linguistics 2013, 16–23.

Egbert, J., Biber, D., and Davies, M. (2015) Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology 66(9): 1817–1831. http://dx.doi.org/10.1002/asi.23308

Fonda, W. and Purwarianti, A. (2014) Experiments on keyword list generation by term distribution clustering for text classification. Proceedings of the 2014 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 297–301. http://dx.doi.org/10.1109/icacsis.2014.7065879

Gunnarson, M. (2011) Classification along Genre Dimensions: Exploring a Multidisciplinary Problem. PhD Dissertation, University of Borås (Sweden).

Jebari, C., Wani, M. A. (2012) A multi-label and adaptive genre classification of web pages. Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA) 1: 578–581. http://dx.doi.org/10.1109/icmla.2012.106

Kanaris, I. and Stamatatos, E. (2009a) Learning to recognize webpage genres. Information Processing and Management 45(5): 499–512. http://dx.doi.org/10.1016/j.ipm.2009.05.003

Kanaris, I. and Stamatatos, E. (2009b) Webpage genre identification using variable-length character n-grams. Proceedings of the IEEE International Conference on Tools with Artificial Intelligence, 3–10.

Karlgren, J. (2000) Stylistic Experiments for Information Retrieval. PhD thesis, Department of Linguistics. Stockholm University.

Kessler, B., Nunberg, G., and Schütze, H. (1997) Automatic detection of text genre. Proceedings of the 35th annual meeting of the Association for Computational Linguistics and the 8th meeting of the European Chapter of the Association for Computational Linguistics, 32–38. http://dx.doi.org/10.3115/976909.979622

Kim S., Han K., Rim H., and Myaeng S. H. (2006) Some effective techniques for naïve Bayes text classification. IEEE Transactions on Knowledge and Data Engineering 18: 1457–1466. http://dx.doi.org/10.1109/TKDE.2006.180

Lex, E., Juffinger, A., and Granitzer, M. (2010) A comparison of stylometric and lexical features for web genre classification and emotion classification in blogs. Proceedings of the 2010 Workshop on Database and Expert Systems Applications (DEXA), 10–14. http://dx.doi.org/10.1109/DEXA.2010.24

Lim, C. S., Lee, K. J., and Kim, G. C. (2005) Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41 (5): 1263–1276. http://dx.doi.org/10.1016/j.ipm.2004.06.004

Maeda, A., Hayashi, Y. (2009) Automatic genre classification of web documents using discriminant analysis for feature selection. Proceedings of the Second International Conference on the Applications of Digital Information and Web Technologies (ICADIWT '09), 405–410. http://dx.doi.org/10.1109/ICADIWT.2009.5273844

Mason, J. E., Shepherd, M., and Duffy, J. (2009a) Classifying web pages by genre: An n-gram Approach. Proceedings of the International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (WI-IAT '09), 458–465. http://dx.doi.org/10.1109/wi-iat.2009.79

Mason J., Shepherd M. and Duffy J. (2009b) An n-Gram based approach to automatically identifying web page genre. Hawaii International Conference on System Sciences, 1–10.

Meena M. J., and Chandran K. R. (2009) Naïve Bayes text classification with positive features selected by statistical method. Proceedings of the IEEE International Conference on Advanced Computing, 28–33.

Meyer zu Eissen, S. and Stein, B. (2004) Genre classification of web pages: User study and feasibility analysis. In P. G. Biundo and T. Fruhwirth (Eds), Advances in Artificial Intelligence, 256–269. Berlin: Springer.

Rehm, G. (2002) Towards automatic web genre identification. Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS’02). http://dx.doi.org/10.1109/HICSS.2002.994036

Rehm, G., Santini, M., Mehler, A., Braslavski, P., Gleim, R., Stubbe, A., Symonenko, S., Tavosanis, M. and Vidulin, V. (2008) Towards a reference corpus of web genres for the evaluation of genre identification systems. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, and D. Tapias (Eds), Proceedings of the 6th Language Resources and Evaluation Conference, 351–358.

Rosso, M. A., and Haas, S. W. (2010) Identification of web genres by user warrant. In A. Mehler, S. Sharoff, and M. Santini (Eds), Genres on the Web: Computational Models and Empirical Studies, 47–68. New York: Springer. http://dx.doi.org/10.1007/978-90-481-9178-9_3

Santini, M. (2004a) Identifying Genres on the Web. Technical Report ITRI-03-06, ITRI, University of Brighton.

Santini, M. (2004b) State-of-the-Art on automatic genre identification. Technical Report ITRI-04-03, ITRI, University of Brighton.

Santini, M. (2005) Genres in formation? An exploratory study of web pages using cluster analysis. In Proceedings of the 8th Annual Colloquium for the UK Special Interest Group for Computational Linguistics.

Santini, M. (2007a) Automatic Identification of Genre in Web Pages. Ph.D. thesis, University of Brighton.

Santini, M. (2007b) Characterizing genres of web pages: Genre hybridism and individualization. In R. H. Sprague (Ed.), Proceedings of the 40th Hawaii International Conference on System Sciences (HICSS-40), 1–10. http://dx.doi.org/10.1109/hicss.2007.124

Santini, M. (2008) Zero, single, or multi? Genre of web pages through the users’ perspective. Information Processing and Management 44: 702–737. http://dx.doi.org/10.1016/j.ipm.2007.05.011

Santini, M. and S. Sharoff. (2009) Web genre benchmark under construction. Journal for Language Technology and Computational Linguistics 24(1): 125–141.

Santini, M., Sharoff, S., Rehm, G. & Mehler, A., (Eds) (2008 –). WebGenreWiki: The wiki dedicated to Automatic Web Genre Identification.

Sharoff, S. (2005) Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (Eds), WaCky! Working papers on the Web as Corpus 63–98. Gedit, Bologna.

Sharoff, S. (2006) Open-source corpora: Using the net to fish for linguistic data. International Journal of Corpus Linguistics 11(4): 435–462. http://dx.doi.org/10.1075/ijcl.11.4.05sha

Stamatatos E., Fakotakis N. and Kokkinakis G. (2000) Text genre detection using common word frequencies. Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000). Saarbrücken. http://dx.doi.org/10.3115/992730.992763

Vidulin, V., Luštrek, M. and Gams, M. (2009) Multi-label approaches to web genre identification. Journal for Language Technology and Computational Linguistics 24(1): 97–114.

Wastholm, P., Kusma, A., and Megyesi, B. (2005) Using linguistic data for genre classification. In Advances in Artificial Language in Sweden. The Annual Swedish Artificial Intelligence and Learning Systems Event (SAIS-SSLS), 173–176.

Wolters, M. and Kirsten, M. (1999) Exploring the use of linguistic features in domain and genre classification. In Proceedings of the Ninth Conference on European chapter of the Association for Computational Linguistics, 142–149. http://dx.doi.org/10.3115/977035.977055

Zhang W., Yoshida T., and Tang X. (2007) Text classification using multi-word features. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 3519–
3524.

Published

2016-02-16

Issue

Section

Articles

How to Cite

Biber, D., & Egbert, J. (2016). Using Grammatical Features for Automatic Register Identification in an Unrestricted Corpus of Documents from the Open Web. Journal of Research Design and Statistics in Linguistics and Communication Science, 2(1), 3-36. https://doi.org/10.1558/jrds.v2i1.27637