Clustering first-order co-occurrences as a way to explore semantic heterogeneity

Authors

  • Ann Bertels ILT and QLVL, KU Leuven
  • Dirk Speelman QLVL, KU Leuven

DOI:

https://doi.org/10.1558/jrds.v1i2.22182

Keywords:

Quantitative Semantics, co-occurrence analysis, association measures, Multidimensional scaling (MDS)

Abstract

This paper addresses the contribution of quantitative analysis and statistical techniques to qualitative semantic analysis, as it discusses the methodological issues for clustering and plotting the most significant first-order co-occurrences of a word as a way to explore its degree of semantic heterogeneity in a technical corpus. Since distributional (dis)similarity reflects semantic (dis)similarity, first-order co-occurrences are clustered with respect to the second and/or third-order co-occurrences they have in common. In this comparative and exploratory study, several experiments are carried out in order to evaluate the impact of various parameters for clustering and in order to find the most reliable configuration of parameters, including association measures, distance measures and lower and upper thresholds. Multidimensional scaling techniques and the visual exploration of semantic proximity between first-order co-occurrences of a node allow us to gain insight into the phenomena of semantic homogeneity and heterogeneity in a technical corpus. As a consequence, we can come to a better understanding of the semantic characteristics of specialized language. However, the methodology for understanding this area is still being implemented and worked out. With the experiments described in this paper, we are contributing to the ongoing methodological analysis of measures and parameters to be used in the field of distributional semantics.

Author Biographies

  • Ann Bertels, ILT and QLVL, KU Leuven

    Ann Bertels is Assistant Professor at Leuven Language Institute (ILT) KU Leuven.

  • Dirk Speelman, QLVL, KU Leuven

    Dirk Speelman is Associate Professor at QLVL, KU Leuven.

References

Baayen, R. H. (2008) Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press. http://dx.doi.org/10.1017/CBO9780511801686

Bertels, A. (2006) La polysémie du vocabulaire technique. Une étude quantitative. PhD thesis. University of Leuven. http://dx.doi.org/10.1075/term.17.1.06ber

Bertels, A. (2011) The dynamics of terms and meaning in the domain of machining terminology. Terminology 17 (1): 94–112.

Bertels, A. and Speelman, D. (2013) Exploration sémantique visuelle à partir des cooccurrences de deuxième et troisième ordre. Actes de Traitement Automatique des Langues Naturelles (TALN 2013) Atelier Sémantique Distributionnelle (SemDis). (Sables d’Olonne, France). 126–139.

Bertels, A. Speelman, D. and Geeraerts D. (2010) La corrélation entre la spécificité et la sémantique dans un corpus spécialisé. Revue de Sémantique et de Pragmatique 27: 79–102.

Biemann, C., Bordag, S. and Quasthoff, U. (2004) Automatic acquisition of paradigmatic relations using iterated co-occurrences. Proceedings of LREC 2004 (Lisboa, Portugal): 967–970. Retrieved on 10 July 2014 from http://wortschatz.uni-leipzig.de/~cbiemann/pub/2004/LREC2004AutomaticCooc.pdf

Borg, I. and Groenen, P. (2005) Modern Multidimensional Scaling: Theory and Applications (Second Edition). New York: Springer-Verlag.

Cabré, M. T. (2000) Terminologie et linguistique: la théorie des portes. Terminologies nouvelles 2: 10–15.

Church, K.W. and Hanks, P. (1990) Word association norms, mutual information, and lexicography. Computational Linguistics 16 (1): 22–29.

Clarke, K. R. (1993) Non-parametric multivariate analyses of change in community structure. Australian Journal of Ecology 18: 117–143. Retrieved on 10 July 2014 from http://www.pelagicos.net/MARS6300_spring2013/readings/Clarke_1993.pdf http://dx.doi.org/10.1111/j.1442-9993.1993.tb00438.x

Cox, T. F. and Cox, M. A. A. (2001) Multidimensional Scaling. Boca Raton, FL: Chapman & Hall.

Dunning, T. (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 61–74. Retrieved on 10 July 2014 from http://www.researchgate.net/publication/2477641_Accurate_Methods_for_the_Statistics_of_Surprise_and_Coincidence/file/60b7d51cb223190264.pdf

Evert, S. (2007) Corpora and collocations. Extended Manuscript of Chapter 58 of Lüdeling A. & M. Kytö (Eds.), Corpus Linguistics. An International Handbook Berlin: Mouton de Gruyter. Retrieved on 10 July 2014 from http://www.stefan-evert.de/PUB/Evert2007HSK_extended_manuscript.pdf

Fabre, C., Hathout, N., Sajous, F. and Tanguy, L. (2014) Ajuster l’analyse distributionnelle à un corpus spécialisé de petite taille. Actes des Ateliers du 21ième Traitement Automatique des Langues Naturelles (TALN2014) Atelier SemDis 2014, Marseille, 266–279.

Ferret, O. (2010) Similarité sémantique et extraction de synonymes à partir de corpus. Proceedings of TALN 2010 (Montréal, Canada). Retrieved on 10 July 2014 from http://www.iro.umontreal.ca/~felipe/TALN2010/Xml/Papers/all/taln2010_submission_77.pdf

Gaudin, F. (2003) Socioterminologie: une approche sociolinguistique de la terminologie. Bruxelles: Duculot.

Grefenstette, G. (1994) Corpus-derived first, second and third-order word affinities. Proceedings of Euralex 1994 (Amsterdam, the Netherlands): 279–290. Retrieved on 10 July 2014 from http://tinyurl.com/m5l3bhv

Habert, B., Illouz, G. and Folch, H. (2005) Des décalages de distribution aux divergences d’acception. In A. Condamines (Ed.) Sémantique et corpus 277–318. Paris: Hermès-Science.

Heylen, K., Speelman, D. and Geeraerts, D. (2012) Looking at word meaning. An interactive visualization of semantic vector spaces for Dutch synsets. Proceedings of the European Chapter of the Association for Computational Linguistics (EACL 2012). (Avignon, France), 16–24.

Kruskal, J. B. and Wish, M. (1978) Multidimensional Scaling. Sage University Paper series on Quantitative Applications in the Social Sciences, number 07-011. Newbury Park, CA: Sage Publications.

Landauer, T. K. and Dumais, S. T. (1997) A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104 (2): 211–240. http://dx.doi.org/10.1037/0033-295X.104.2.211

Lemaire, B. and Denhière, G. (2006) Effects of high-order co-occurrences on word semantic similarity. Current Psychology Letters 18 (1). Retrieved on 10 July 2014 from http://cpl.revues.org/index471.html

Morardo, M. and Villemonte de La Clergerie, E. (2013) Vers un environnement de production et de validation de ressources lexicales sémantiques. Actes de Traitement Automatique des Langues Naturelles (TALN 2013) Atelier Sémantique Distributionnelle (SemDis). (Sables d’Olonne, France). 167–180.

Morlane-Hondère, F. (2013) Utiliser une base distributionnelle pour filtrer un dictionnaire de synonymes. Actes de Traitement Automatique des Langues Naturelles (TALN 2013) Atelier Sémantique Distributionnelle (SemDis). (Sables d’Olonne, France). 112–125.

Padó, S. and Lapata, M. (2007) Dependency-based construction of semantic space models. Computational Linguistics 33 (2): 161–199. Retrieved on 10 July 2014 from http://www.nlpado.de/~sebastian/pub/papers/cl07_pado.pdf http://dx.doi.org/10.1162/coli.2007.33.2.161

Peirsman, Y. and Geeraerts, D. (2009) Predicting strong associations on the basis of corpus data. Proceedings of EACL 2009 (Athens, Greece): 648–656. Retrieved on 10 July 2014 from http://anthology.aclweb.org//E/E09/E09-1074.pdf

Sahlgren, M. (2006) The Word-Space Model. Ph.D. thesis. Stockholm University. Sweden.

Sahlgren, M. (2008) The Distributional Hypothesis. Rivista di Linguistica 20 (1): 33–53. Retrieved on 10 July 2014 from http://soda.swedish-ict.se/3941/1/sahlgren.distr-hypo.pdf

Temmerman, R. (2000) Towards New Ways of Terminology Description. The Sociocognitive Approach. Amsterdam/Philadelphia, PA: John Benjamins Publishing Company. http://dx.doi.org/10.1075/tlrp.3

Turney, P.D. and Pantel, P. (2010) From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37: 141–188. Retrieved on 10 July 2014 from https://www.jair.org/media/2934/live-2934-4846-jair.pdf

van der Laan, M. J. and Pollard, K. S. (2003) A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference 117: 275–303. Retrieved on 10 July 2014 from http://stat-www.berkeley.edu/~laan/Research/Research_subpages/Papers/hopach.pdf http://dx.doi.org/10.1016/S0378-3758(02)00388-9

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S (Fourth edition). New York: Springer-Verlag. http://dx.doi.org/10.1007/978-0-387-21706-2

Wielfaert, T., Heylen, K. and Speelman, D. (2013). Interactive visualizations of semantic vector spaces for lexicological analysis. Actes de Traitement Automatique des Langues Naturelles (TALN 2013) Atelier Sémantique Distributionnelle (SemDis). (Sables d’Olonne, France). 154–166.

Wüster, E. (1931) Internationale Sprachnormung in der Technik: besonders in der Elektrotechnik. Berlin: VDI-Verlag.

Wüster, E. (1991) Einführung in die allgemeine Terminologielehre und terminologische Lexikographie (3. Aufl.). Bonn: Romanistischer Verlag.

Published

2015-07-24

Issue

Section

Articles

How to Cite

Bertels, A., & Speelman, D. (2015). Clustering first-order co-occurrences as a way to explore semantic heterogeneity. Journal of Research Design and Statistics in Linguistics and Communication Science, 1(2), 123-146. https://doi.org/10.1558/jrds.v1i2.22182

Most read articles by the same author(s)

1 2 3 4 5 6 > >>