The Computer Science Laboratory of the François Rabelais University in Tours, France offers a Master's thesis grant in Natural Language Processing. *************** ** Subject ** Parsing and Multi-Word Expressions ************* ** Profile ** - Last year of a Master's degree in computer science or computational linguistics - Knowledge of French and English, another language would be a plus - Interests in linguistics and familiarity with language technology - Capacity to work independently and as a part of a team *********** ** Dates ** Application deadline: November 30, 2014 (or until filled) Position starts: February/March 2014 Duration: 6 months *********** ** Grant ** Amount: 436 € / month Funding institution: BAMSOO SARL (research agreement with Université François Rabelais Tours) ************* ** Contact ** agata.savary@univ-tours.fr ***************** ** Application ** - CV - cover letter - transcript of MSc and BSc grades ************************* ** Hosting Institution ** University: Université François Rabelais Tours (http://international.univ-tours.fr/welcome-international-265902.kjsp?RH=INTER&RF=INTER-EN) Laboratory: Laboratoire d'informatique (LI) (http://li.univ-tours.fr/) Research team: Databases and Natural Language Processing (BdTln), Campus in Blois ***************** ** Supervisors ** Dr Agata Savary (Université François Rabelais Tours) Dr Yannick Parmentier (Université d'Orléans) Prof. Jean-Yves Antoine (Université François Rabelais Tours) ************************** ** Scientific challenge ** This Master's thesis will be dedicated to fixed and semi-fixed Multi-Word Expressions (MWEs) such as "French fries", "random access memory" "to do one's best", "to spill the beans", "to kick the bucket", etc. Despite a long established tradition in linguistic studies dedicated to such expressions, they still belong to the major challenges in bridging the gap between linguistic precision and computational efficiency in Natural Language Processing (NLP) applications [Sag et al., 2002]. MWEs [Rayson et al., 2010] are prevalent in written and spoken corpora as they cover up to 40% of all tokens in a natural language text. They are however hard to detect, analyze and translate by NLP tools due to their heterogeneous properties on different levels of linguistic processing: segmentation, lexicon, syntax, semantics, etc. Moreover, knowledge on MWEs has been subject to fragmentation. For instance, lexicons of MWEs such as compounds [Gralinski et al., 2010], multi-word proper names [Tran and Maurel, 2006], complex terms [Savary et al., 2012], valence dictionaries, lexicon-grammars [Tolone and Sagot, 2011], etc. - have often been created with no explicit links to grammar formalisms [Savary, 2008]. Thus, their application to parsing is not always straightforward. Conversely, many existing grammars do not account for MWEs on a large scale, even if the associated formalisms (HPSG, LFG, TAG, CCG, dependency grammars, etc.) allow for their representation. Having, at least partly, resolved the challenges related to WMEs would enhance the quality of various language technology tools. For instance, current machine translation engines provide word-to-word equivalents for unusual syntactic structures and unpredictable senses ("to count France in" → *"compter en France"). In information retrieval, if a query concerns "domesticated animals", a search engine might wrongly retrieve documents containing MWEs like "prendre le taureau par les cornes". Such wrong translations and false positives can, however, be avoided if the semantic non-compositionality of MWEs is properly accounted for. ***************************** ** Master's thesis project ** This Master's thesis will aim at extending the comprehension of MWEs in order to overcome the above-mentioned challenges. Two issues will be addressed: 1. Enrichment of language resources with respect to the syntactic structure of MWEs Parsing MWEs requires large-coverage language resources such as electronic lexicons and treebanks. Even if such reference resources exist for French, e.g. the French Treebank (FTB), lexicon-grammars (LG) and dictionaries of compounds (DELAC), they represent MWEs as unstructured sequences of tokens rather than phrases annotated with syntax trees. For instance, the compound noun "tour de passe-passe" ('sleight of hand') is annotated in FTB as a flat structure "[tour de passe-passe]MWN" while a more complete representation "[tour [de [passe-passe]MWN]PP]MWN" would help to take possible morphological, syntactic and semantic variations into account ("tour(s) de [sacré] passe-passe"). Leveraging resources such as FTB with MWE syntactic structures is the main objective of this sub-project. In view of universality, similar methods will be addressed in English and other languages known by the candidate. 2. Definition and implementation of MWE description formalisms dedicated to syntactic parsing MWEs have complex and heterogeneous properties concerning different traditional levels of linguistic processing, including lexicon (description of properties of individual words) and grammar (description of relations among words). A large-coverage natural language grammar, containing tens or hundreds of thousands of rules, is hard to develop and maintain. This problem has been addressed e.g. by factorizing grammar rules into meta-grammars [Crabbé et al., 2013]. MWEs, however, remain a challenge to this concept since their behavior is partly regular (thus, factorisable) and partly unpredictable (specific to few constructions, thus hard to factorize). Describing MWEs within existing meta-grammar formalisms would inflate the number of meta-rules and thus jeopardize the very concept of a meta-grammar. It seems necessary to extend the formalism so as to allow a combination of meta-rules with a description of exceptional and unpredictable behavior [Grégoire, 2010]. Here again, for universality reasons, several languages will be taken into account. The extended formalism will, in the longer term, be implemented within a meta-grammar development framework such as XMG. ***************************** ** International framework ** This Master's thesis will be integrated into *PARSEME* (PARsing and Multi-word Expressions), a European action funded by the COST program (http://www.cost.eu/domains\_actions/ict/Actions/IC1207). The Actions' consortium gathers partners from of 27 countries around scientific challenges in automatic processing of Multi-Word Expressions. **************** ** References ** Constant, M., Sigogne, A., and Watrin, P. (2012). Discriminative strategies to integrate multiword expression recognition and parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL’12, pages 204–212, Stroudsburg, PA, USA. Association for Computational Linguistics. Crabbé, B., Duchier, D., Gardent, C., Le Roux, J., and Parmentier, Y. (2013). XMG : eXtensible MetaGrammar. Computational Linguistics, 39(3):1–38. Nicole Grégoire (2010). DuELME: a Dutch electronic lexicon of multiword expressions. Language Resources and Evaluation 44(1-2): pages 23-39 Duchier, D., Dao, T.-B.-H., and Parmentier, Y. (2013). Model-Theory and Implementation of Property Grammar. Journal of Logic and Computation, pages 1–19. To appear. Duchier, D., Parmentier, Y., and Petitjean, S. (2011). Cross-framework Grammar Engineering using Constraint-driven Metagrammars. In 6th International Workshop on Constraint Solving and Language Processing (CSLP’11), pages 32–43, Karlsruhe, Germany. Graliński, F., Savary, A., Czerepowicka, M., and Makowiecki, F. (2010). Computational Lexicography of Multi-Word Units: How Efficient Can It Be? In Proceedings of the COLING-MWE’10 Workshop, Beijing, China. Grégoire, N. (2010). DuELME: a Dutch electronic lexicon of multiword expressions. Language Resources and Evaluation, 44(1-2). Nivre, J. and Nilsson, J. (2004). Multiword Units in Syntactic Parsing. In MEMURA 2004 - Methodologies and Evaluation of Multiword Units in Real-World Applications, Workshop at LREC 2004, pages 39–46, Lisbon, Portugal. Rayson, P., Piao, S., Aharoff, S., Evert, S., and na Villada Moir ́n, B., editors (2010). Multiword expression: hard going or plain sailing, volume 44 of Language Resources and Evaluation. Springer. Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. (2002). Multiword Expresions: A Pain in the Neck for NLP. In Proceedings of CICLING’02. Springer. Savary, A. (2008). Computational Inflection of Multi-Word Units. A contrastive study of lexical approaches. Linguistic Issues in Language Technology, 1(2):1–53. Savary, A., Zaborowski, B., Krawczyk-Wieczorek, A., and Makowiecki, F. (2012). SEJFEK - a Lexicon and a Shallow Grammar of Polish Economic Multi-Word Units. In Proceedings of Cognitive Aspects of the Lexicon (COGALEX-III), a Workshop at COLING 2012. Tolone, E. and Sagot, B. (2011). Using Lexicon-Grammar tables for French verbs in a large-coverage parser. In Vetulani, Z., editor, Human Language Technology. Challenges for Computer Science and Linguistics. 4th Language and Technology Conference, LTC 2009, Poznań, Poland, November 6-8, 2009, Revised Selected Papers, volume 6562 of Lecture Notes in Artificial Intelligence (LNAI), pages 183–191. Springer Verlag. Tran, M. and Maurel, D. (2006). Prolexbase : Un dictionnaire relationnel multilingue de noms propres. Traitement automatique des langues, 47(3):115–139.