Our research on natural language focuses on the study of diversity across languages and domains, uncovering its manifestations in both lexicon and grammar. The motivations are simultaneously theoretical and practical: while we find language diversity—e.g., characterising the universality or locality of lexical or grammatical phenomena—interesting per se and highly relevant to various fields of linguistics such as linguistic typology or lexical semantics, we also show that by formally characterising language diversity it becomes possible drastically to improve both the precision and the coverage of existing language resources. Finally, beyond offering these new resources to the computer science and computational linguistics communities, we exploit them to solve practical problems of semantic interoperability such as cross-lingual and cross-domain data integration, search, or ontology alignment.
Our research projects and results are implemented within the framework of one of two principal initiatives.
Lexical diversity is represented in the Universal Knowledge Core (UKC), a large-scale, multilingual, machine-readable lexico-semantic database covering over 300 languages and a growing number of domains. We are actively researching two ways of collecting formal evidence: computationally through quantitative methods, and in a human-driven way through expert sourcing and crowdsourcing.
- Computing lexical diversity: this line of research aims to collect evidence on the universality vs diversity of lexical phenomena across languages, and represent such results within the UKC in a formal way. Through the characterization the level of diversity of languages with respect to a given phenomenon (such as a word being a polyseme, a lexical gap, a cognate, etc.), its scope of applicability can be delimited and used for higher-precision lexical knowledge generation or validation tasks.
- Diversity-aware expert and crowdsourcing: the incompleteness of lexico-semantic resources (wordnets, domain terminologies, thesauri, wiktionaries) is a major stumbling block for natural language understanding applications, especially for endangered, minority, or otherwise under-resourced languages. We address diversity-related research questions underlying manual lexical translation, such as the sourcing of lexical gaps, collaborative translation and validation methods, or the combination of expert sourcing and crowdsourcing.
We tackle diversity in grammar using the Semantic Cross-Lingual Label Parser (SCROLL) tool, a multilingual NLP framework specifically tuned to interpreting short textual labels as typically found within structured data.
- Multilingual and Multi-Domain Natural Language Understanding: NLU tasks are typically solved with respect to a given language and application domain. We are interested in multilingual and/or multi-domain settings, such as the cross-border interoperability of electronic health records. Instead of solving such problems through a mere collection of special-purpose NLP pipelines, we consider multilingual and multi-domain NLP as a single system where processes and resources can be shared and reused across languages and domains, based on their grammatical sameness and diversity.
The Language of Data: a major yet hardly researched application of NLP is the understanding of text within structured data. Such text is typically short and obeys grammatical rules considerably different both from regular text (e.g., newswire) and from text in social media (e.g., tweets). In consequence, conventional NLP tools and trained machine learning models provide very weak results over structured data. Our research aims to characterise the linguistic specificities of such text, demonstrating a certain level of grammatical coherence that we call the language of data. The practical applications of this research are multilingual NLP tools and resources fine-tuned to processing structured datasets, e.g., for the automation of data interoperability tasks.
- Khuyagbaatar Batsuren, Gábor Bella, and Fausto Giunchiglia. CogNet: a Large-Scale Cognate Database. ACL 2019, Florence, Italy.
- Fausto Giunchiglia, Khuyagbaatar Batsuren, and Gábor Bella. Understanding and Exploiting Language Diversity. Proceedings of IJCAI 2017, Melbourne.
- Gábor Bella, Fausto Giunchiglia, and Fiona McNeill. Language and Domain Aware Lightweight Ontology Matching. Journal of Web Semantics, vol. 43, March 2017, pp. 1-17.
- Gábor Bella, Alessio Zamboni, and Fausto Giunchiglia. Domain-Based Sense Disambiguation on Multilingual Structured Data. Proceedings of the ECAI 2016 workshop on Diversity Aware Artificial Intelligence, The Hague, Netherlands.
- Abed Alhakim Freihat, Gábor Bella, Hamdy Mubarak, and Fausto Giunchiglia. A Single-Model Approach for Arabic Segmentation, POS Tagging, and Named Entity Recognition. Proceedings of the International Conference on Natural Language and Speech Processing, Algiers, Algeria, 2018.