The Computer Science Laboratory of the François Rabelais University in Tours, 
France
offers a Master's thesis grant in Natural Language Processing.

***************
** Subject **
Parsing and Multi-Word Expressions

*************
** Profile **
- Last year of a Master's degree in computer science or computational 
linguistics
- Knowledge of French and English, another language would be a plus
- Interests in linguistics and familiarity with language technology
- Capacity to work independently and as a part of a team

***********
** Dates **
Application deadline: November 30, 2014 (or until filled)
Position starts: February/March 2014
Duration: 6 months

***********
** Grant **
Amount: 436 € / month
Funding institution: BAMSOO SARL (research agreement with Université François 
Rabelais Tours)

*************
** Contact **
agata.savary@univ-tours.fr

*****************
** Application **
- CV
- cover letter
- transcript of MSc and BSc grades

*************************
** Hosting Institution **
University: Université François Rabelais Tours 
(http://international.univ-tours.fr/welcome-international-265902.kjsp?RH=INTER&RF=INTER-EN)
Laboratory: Laboratoire d'informatique (LI) (http://li.univ-tours.fr/)
Research team: Databases and Natural Language Processing (BdTln), Campus in 
Blois

*****************
** Supervisors **
Dr Agata Savary  (Université François Rabelais Tours)
Dr Yannick Parmentier (Université d'Orléans)
Prof. Jean-Yves Antoine (Université François Rabelais Tours)

**************************
** Scientific challenge **

This Master's thesis will be dedicated to fixed and semi-fixed Multi-Word Expressions (MWEs) such as "French fries", "random access memory" "to do one's best", "to spill the beans", "to kick the bucket", etc. Despite a long established tradition in linguistic studies dedicated to such expressions, they still belong to the major challenges in bridging the gap between linguistic precision and computational efficiency in Natural Language Processing (NLP) applications [Sag et al., 2002]. 

MWEs [Rayson et al., 2010] are prevalent in written and spoken corpora as they cover up to 40% of all tokens in a natural language text. They are however hard to detect, analyze and translate by NLP tools due to their heterogeneous properties on different levels of linguistic processing: segmentation, lexicon, syntax, semantics, etc. Moreover, knowledge on MWEs has been subject to fragmentation. For instance, lexicons of MWEs such as compounds [Gralinski et al., 2010], multi-word proper names [Tran and Maurel, 2006], complex terms [Savary et al., 2012], valence dictionaries, lexicon-grammars [Tolone and Sagot, 2011], etc. - have often been created with no explicit links to grammar formalisms [Savary, 2008]. Thus, their application to parsing is not always straightforward. Conversely, many existing grammars do not account for MWEs on a large scale, even if the associated formalisms (HPSG, LFG, TAG, CCG, dependency grammars, etc.) allow for their representation. 

Having, at least partly, resolved the challenges related to WMEs would enhance the quality of various language technology tools. For instance, current machine translation engines provide word-to-word equivalents for unusual syntactic structures and unpredictable senses ("to count France in" → *"compter en France"). In information retrieval, if a query concerns "domesticated animals", a search engine might wrongly retrieve documents containing MWEs like "prendre le taureau par les cornes". Such wrong translations and false positives can, however, be avoided if the semantic non-compositionality of MWEs is properly accounted for. 

*****************************
** Master's thesis project **

This Master's thesis will aim at extending the comprehension of MWEs in order 
to overcome the above-mentioned challenges. Two issues will be addressed:

1. Enrichment of language resources with respect to the syntactic structure 
of MWEs

Parsing MWEs requires large-coverage language resources such as electronic lexicons and treebanks. Even if such reference resources exist for French, e.g. the French Treebank (FTB), lexicon-grammars (LG) and dictionaries of compounds (DELAC), they represent MWEs as unstructured sequences of tokens rather than phrases annotated with syntax trees. For instance, 
the compound noun "tour de passe-passe" ('sleight of hand') is annotated in FTB as a flat structure "[tour de passe-passe]MWN" while a more complete representation "[tour [de [passe-passe]MWN]PP]MWN" would help to take possible morphological, syntactic and semantic variations into account ("tour(s) de [sacré] passe-passe"). Leveraging resources such as FTB with MWE syntactic structures is the main objective of this sub-project. In view of universality, similar methods will be addressed in English and other languages known by the candidate. 

2. Definition and implementation of MWE description formalisms dedicated to 
syntactic parsing

MWEs have complex and heterogeneous properties concerning different traditional levels of linguistic processing, including lexicon (description of properties of individual words) and grammar (description of relations among words). A large-coverage 
natural language grammar, containing tens or hundreds of thousands of rules, is hard to develop and maintain. This problem has been addressed e.g. by factorizing grammar rules into meta-grammars [Crabbé et al., 2013]. MWEs, however, remain a challenge to this concept since their behavior is partly regular (thus, factorisable) and partly unpredictable (specific to few constructions, thus hard to factorize). Describing MWEs within existing meta-grammar formalisms would inflate the number of meta-rules and thus jeopardize the very concept of a meta-grammar. It seems necessary to extend the formalism so as to allow a combination of meta-rules with a description of exceptional and unpredictable behavior [Grégoire, 2010]. Here again, for universality reasons, several languages will be taken into account. The extended formalism will, in the longer term, be implemented within a meta-grammar development framework such as XMG. 

*****************************
** International framework **

This Master's thesis will be integrated into *PARSEME* (PARsing and Multi-word Expressions), a European action funded by the COST program (http://www.cost.eu/domains\_actions/ict/Actions/IC1207). The Actions' consortium gathers partners from of 27 countries around scientific challenges in automatic processing of Multi-Word Expressions. 

****************
** References **

Constant, M., Sigogne, A., and Watrin, P. (2012). Discriminative strategies to
integrate multiword expression recognition and parsing. In Proceedings of the 
50th Annual Meeting
of the Association for Computational Linguistics: Long Papers - Volume 1, 
ACL’12, pages 204–212,
Stroudsburg, PA, USA. Association for Computational Linguistics.

Crabbé, B., Duchier, D., Gardent, C., Le Roux, J., and Parmentier, Y. (2013).
XMG : eXtensible MetaGrammar. Computational Linguistics, 39(3):1–38.

Nicole Grégoire (2010). DuELME: a Dutch electronic lexicon of multiword 
expressions.
Language Resources and Evaluation 44(1-2): pages 23-39

Duchier, D., Dao, T.-B.-H., and Parmentier, Y. (2013). Model-Theory and
Implementation of Property Grammar. Journal of Logic and Computation, pages 
1–19. To appear.

Duchier, D., Parmentier, Y., and Petitjean, S. (2011). Cross-framework
Grammar Engineering using Constraint-driven Metagrammars. In 6th 
International Workshop on
Constraint Solving and Language Processing (CSLP’11), pages 32–43, Karlsruhe, 
Germany.

Graliński, F., Savary, A., Czerepowicka, M., and Makowiecki, F. (2010). 
Computational Lexicography of Multi-Word Units: How Efficient Can It Be? In
Proceedings of the COLING-MWE’10 Workshop, Beijing, China.

Grégoire, N. (2010). DuELME: a Dutch electronic lexicon of multiword 
expressions.
Language Resources and Evaluation, 44(1-2).

Nivre, J. and Nilsson, J. (2004). Multiword Units in Syntactic Parsing. In
MEMURA 2004 - Methodologies and Evaluation of Multiword Units in Real-World 
Applications,
Workshop at LREC 2004, pages 39–46, Lisbon, Portugal.

Rayson, P., Piao, S., Aharoff, S., Evert, S., and na Villada Moir ́n, B., 
editors (2010). Multiword expression: hard going or plain sailing, volume 44
of Language Resources and Evaluation. Springer.

Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. (2002). 
Multiword Expresions: A Pain in the Neck for NLP. In Proceedings of
CICLING’02. Springer.

Savary, A. (2008). Computational Inflection of Multi-Word Units. A 
contrastive study
of lexical approaches. Linguistic Issues in Language Technology, 1(2):1–53.

Savary, A., Zaborowski, B., Krawczyk-Wieczorek, A., and Makowiecki, F. (2012).
SEJFEK - a Lexicon and a Shallow Grammar of Polish Economic Multi-Word Units. 
In Proceedings
of Cognitive Aspects of the Lexicon (COGALEX-III), a Workshop at COLING 2012.

Tolone, E. and Sagot, B. (2011). Using Lexicon-Grammar tables for French
verbs in a large-coverage parser. In Vetulani, Z., editor, Human Language 
Technology. Challenges for
Computer Science and Linguistics. 4th Language and Technology Conference, LTC 2009, Poznań, Poland, November 6-8, 2009, Revised Selected Papers, volume 6562 of Lecture Notes in Artificial Intelligence (LNAI), pages 183–191. Springer Verlag. 

Tran, M. and Maurel, D. (2006). Prolexbase : Un dictionnaire relationnel
multilingue de noms propres. Traitement automatique des langues, 
47(3):115–139.