CALICO Journal, Vol 33, No 3 (2016)

Leveraging a Large Learner Corpus for Automatic Suggestion of Collocations for Learners of Japanese as a Second Language

Lis Pereira, Erlyn Manguilimotan, Yuji Matsumoto
Issued Date: 26 Aug 2016


One of the challenges of learning Japanese as a Second Language (JSL) is finding the appropriate word for a particular usage. To address this challenge, we developed a collocational aid designed to suggest more appropriate collocations in Japanese. In particular, we address the problem of generating and ranking noun and verb candidates for correcting potential collocation errors in the learners’ text. Given a noun-verb construction as input, our system generates possible noun or verb correction candidates based on noun and verb corrections extracted from a large Japanese learner corpus. We use this corpus to investigate the learner's tendency to commit collocation errors, and to produce a smaller and more realistic set of candidates. After combining nouns or verbs with the generated candidates to form noun-verb pairs, the system uses the Weighted Dice coefficient as the association measure to filter out inappropriate noun-verb pairs and rank the proper collocations. We report the detailed evaluation and results on learner data. In addition, we show that our system statistically outperforms existing approaches to collocation error correction. Finally, we report a preliminary user study with JSL learners.

Download Media


DOI: 10.1558/cj.v33i3.26444


Chang, Y. C., Chang, J. S., Chen, H. J., & Liou, H. C. (2008). An automatic collocation writing assistant for Taiwanese EFL learners: A case of corpus-based NLP technology. Computer Assisted Language Learning, 21(3), 283–299. Retrieved from:

Chen, M.-H., Huang, C.-C., Huang, S.-T., Chang, J.S., & Liou, H.C. (2014). An automatic reference aid for improving EFL learners’ formulaic expressions in productive language use. IEEE Transactions on Learning Technologies, 7(1), 57–68. Retrieved from:

Cho, Y. S. (2013). Software review: Lang-8. CALICO Journal, 30(2), 293–299. Retrieved from:

Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Meeting on Association for Computational Linguistics (pp. 76–83). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from:

Dahlmeier, D., & Ng, H. T. (2011). Correcting semantic collocation errors with L1-induced paraphrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 107–117). Stroudsburg, PA: Association for Computational Linguistics.

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

Futagi, Y., Deane, P., Chodorow, M., & Tetreault, J. (2008). A computational approach to detecting collocation errors in the writing of non-native speakers of English. Computer Assisted Language Learning, 21(4), 353–367. Retrieved from

Harris. Z. (1954). Distributional structure. Word, 10(2–3), 146–162. Retrieved from

Hill, J. (2000). Revising priorities: From grammatical failure to collocational success. In Michael Lewis (Ed.), Teaching Collocation: Further Developments in the Lexical Approach (pp. 88–117). Hove: Language Teaching Publications.

Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Upper Saddle River, NJ: Prentice Hall PTR.

Kitamura, M., & Matsumoto, Y. (1997). Automatic extraction of translation patterns in parallel corpora. Information Processing Society of Japan Journal, 38(4), 727–735.

Kudo, T., & Matsumoto, Y. (2002). Japanese dependency analysis using cascaded chunking. In Proceedings of the 6th Conference on Natural Language Learning (pp. 1–7). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from

Lea, D., & Runcie, M. (Eds.) (2002). Oxford Collocations Dictionary for Students of English. Oxford: Oxford University Press.

Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2010). Automated Grammatical Error Detection For Language Learners (Synthesis lectures on human language technologies 3(1), pp. 1–134). San Rafael, CA: Morgan & Claypool.

Lee, L. (1999). Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 25–32). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from:

Lewis, M. (2000). There is nothing as practical as a good theory. In Michael Lewis (Ed.), Teaching Collocation: Further Developments in the Lexical Approach (pp. 10–27). Hove: Language Teaching Publications.

Liou, H., Chang, J., Chen, H., Lin, C., Liaw, M., Gao, Z., ... You, G. (2006). Corpora processing and computational scaffolding for a Web-based English learning environment: The CANDLE project. CALICO Journal, 24(1), 77–95.

Liu, A. L.-E.,Wible, D., & Tsao, N.-L. (2009). Automated suggestions for miscollocations. In Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 47–50). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from:

Liu, L. E. (2002). A corpus-based lexical semantic investigation of verb-noun miscollocations in Taiwan learners’ English (Master’s thesis). Tamkang University, Taipei.

Maekawa, K., Yamazaki, M., Ogiso, T., Maruyama, T.,Ogura, H., Kashino, W., … Den, Y. (2014). Balanced corpus of contemporary written Japanese. Language Resources and Evaluation, 48(2), 345–371. Retrieved from:

Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24(2), 223–242. Retrieved from

Oyama, H., Komachi, M., & Matsumoto, Y. (2013). Towards automatic error type classification of Japanese language learners’ writings. In Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (pp.163–172). Taipei, Taiwan.

Park, T., Lank, E., Poupart, P., & Terry, M. (2008). “Is the sky pure today?” AwkChecker: An assistive tool for detecting and correcting collocation errors. In Proceedings of the 21th Annual Association for Computing Machinery Symposium on User Interface Software and Technology (pp. 121–130). Monterey, CA, USA.

Pereira, L. (2013). Collocation suggestion for Japanese second language learners (Master’s thesis). Nara Institute of Science and Technology, Ikoma, Japan.

Seretan, V. (2011). Syntax-Based Collocation Extraction (Text, speech and language technology series, 44). New York: Springer-Verlag. Retrieved from

Shei, C.-C., & Pain, H. (2000). An ESL writer’s collocational aid. Computer Assisted Language Learning, 13(2), 167–182. Retrieved from;1-D;FT167

Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.

Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1),1–38.

Voorhees, E. M.(1999). The TREC-8 question answering track evaluation. In E. M. Voochees & D. K. Harman (Eds.), Proceedings of the Text Retrieval Conference (TREC-8) (pp. 83–105). NIST Special Publication 500-246.

Wible, D., Kuo, C., Tsao, N., Liu, A., & Lin, H. (2003). Bootstrapping in a language learning environment. Journal of Computer-Assisted Learning, 19(1), 90–102. Retrieved from

Yi, X., Gao, J., & Dolan, W. (2008). A web-based English proofing system for English as a Second Language users. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (pp. 619–624). Stroudsburg, PA: Association for Computational Linguistics.


  • There are currently no refbacks.

Equinox Publishing Ltd - 415 The Workstation 15 Paternoster Row, Sheffield, S1 2BX United Kingdom
Telephone: +44 (0)114 221-0285 - Email:

Privacy Policy