The LDL4HELTA project

The linked data lexicography for high-end language technology application (LDL4HELTA, http://ldl4.com/) project aims at solving the issues described above by combining lexicography with Linked Data and integrating high quality closed data sources with open data to develop new Language Technology methods and tools. The project is part of the EUREKA bilateral Austria-Israel R&D framework and is led by Semantic Web Company (SWC) and K Dictionaries (KD), and has scholarly support mainly from the Austrian Academy of Sciences and the Polytechnic University of Madrid.

The aim of the project is to combine the lexical resources of KD from multilingual and pedagogical lexicography with the technical expertise of SWC in semantic technologies, the Semantic Web and LD to develop new products and services for the international LT market that will offer a clear competitive edge over competitors to answer the fast-growing demands for language-independent or specific-language and cross-language solutions to enable cross-lingual search or multilingual data management approaches.

To provide such solutions, a multilingual metadata and data management approach is needed, and this is where SWC’s PoolParty Semantic Suite (http://www.poolparty.biz/) comes into play. As PoolParty follows W3C semantic web standards (http:// w3.org/standards/semanticweb/) like SKOS (http://w3.org/2004/02/skos/), it already integrates language-independent-based technologies. However, as regards text analysis and text extraction, the ability to process multilingual information and data is key for success – which means that such systems need to speak as many languages as possible.

The cooperation with KD in the course of the LDL4HELTA project is enabling the PoolParty Semantic Suite to continuously “learn to speak” more and more languages more precisely, by making use of KD’s rich monolingual, bilingual and multilingual content and its know-how in lexicography as a base for improved multi-language text analysis and processing.

One focus of LDL4HELTA is to model and convert KD data into RDF format (Resource Description Framework) and furthermore enrich this RDF data by 3rd party sources making use of Linked Data Design Principles by Tim Berners Lee (https://www.w3.org/DesignIssues/LinkedData.html) to finally make use of a SPARQL endpoint as an API to enable complex and flexile querying over this data.

The 2nd focus of the project is to improve Word Sense Disambiguation (WSD) in regards to entity extraction and semantic annotation. Here we work on a combination of methods as (i) using dictionary data (ii) using Thesauri and knowledge models, (iii) making use of corpora and freely available lexical resources and finally (iv) by integrating users-first-choice mechanisms.

The project started in July 2015 for 24-month duration. It is supported by an advisory board including Christian Chiarcos (Goethe University, Frankfurt), Orri Erling (Google), Asunción Gómez Pérez (Universidad Politécnica de Madrid), Sebastian Hellmann (Leipzig University), Alon Itai (Technion, Haifa), and Eveline Wandl-Wogt (Austrian Academy of Sciences).