META-FORUM 2012 - Report
A Strategy for Multilingual Europe
More than 200 participants from research, various industries and politics. 57 speaker contributions. 12 award winners. Two days of intense discussions about the current state and future of language technology in Europe. All this and more was META-FORUM 2012.
META-FORM 2012 was organised by META-NET, a Network of Excellence consisting of 60 research centres from 34 countries. META-NET is dedicated to building the technological foundations of a multilingual European information society. META-NET is forging META, the Multilingual Europe Technology Alliance, with currently more than 640 members.
This year's META-FORUM was the third edition, after two successful events in Brussels 2010 and Budapest 2011. The timing of META-FORUM 2012 was ideal: currently, there is a lot of discussion about topics in upcoming, long-term research programmes. The META-FORUM event, the Language White Paper series and the Strategic Research Agenda (SRA) developed within META-NET, all aim at presenting a very strong message from the language technology community.
In what follows, after a summary of the key messages, you will find short descriptions of all presentations. Links are provided to videos of the presentations and to the slides used. We recommend also to watch the videos: they are short and contain much more detail.
Report Contents: Summary • Opening Session • Europe and its Languages • Industry and Innovation • Language Technology for Europe 2020 • LT Fireworks • LT Research on the European Level • META-SHARE • LT Research in the Member States • Keynote Lecture: Fernando Pereira • Closing Session
What follows is an analysis and synthesis of ideas brought out during META-FORUM 2012. It is very high level, and you should watch the presentations to get a better understanding of the points made.
Thibaut Kleiner opened the event with a description of the challenges the language technology community in Europe currently faces, and promises it might fulfil in the future. Hans Uszkoreit summarised what META-NET has done so far to address the challenges, and what we can expect from this event.
The Europe and its languages session discussed the current situation for Europe's languages. Algirdas Saudargas emphasised the needs of smaller languages like Lithuanian. András Kornai focused on the digital representation of languages on the Web: “if a language is not on the Web, it doesn’t exist”. Bolette Sandford Pedersen summarised the goals and outcomes of the Language White Paper Series, an analysis of 30 European languages and their language technology support. The conclusion is that no language has excellent support, and that many languages lack even basic language resources for creating language technologies like machine translation with good quality. The session closed with a panel encompassing representatives of EFNIL (European Federation of National Institutions for Language). They reiterated the need for basic language resources for all European languages, and the usefulness of the Language White Paper Series to convey the situation and potential role of language technology also on the political level.
The Industry and Innovation session was opened by Serge Gladkoff. He presented the viewpoint of GALA on the efforts of META-NET and the needs of language service providers. Tomás Pariente brought up the topic of (big) open data. Language technology can support the creation of high quality big data and provide useful means for its consumption. Radu Soricut presented a view on machine translation from an industry perspective: machine translation needs to be part of many industrial ecosystems and big data sources like the Web. George Wright introduced the BBC World Service archive and how speech technology is used to foster access to thousands of radio programmes. Florence Beaujard showed how Boing is using controlled language to assure high quality in the life critical area of cockpit design. Lori Thicke discussed the future of machine translation. The connection with other technologies like controlled languages and the integration into industrial workflows can help to achieve better quality and to bootstrap the creation of new machine translation systems. A short discussion touched, among others, on the availibility of language resources and issues that need to be resolved with regards to licensing.
Language Technologies for Europe 2020 was a joint session organised by META-NET and LT Innovate. Georg Rehm introduced the approach of META-NET towards the topic, followed and complemented by Jochen Hummel and LT-Innovate. Hans Uszkoreit summarized the META-NET Strategic Research Agenda (SRA). The SRA encompasses three priority themes: “Translation Cloud”, presented by Andrejs Vasiljves, “Social Intelligence and eParticipation”, presented by Marko Grobelnik, and “Socially Aware Interactive Assistants”, presented by Joseph Mariani. Seven key participants from LT-Innovate reported on the current state of the LT-Innovate innovation themes. These themes are an important input for framing how “innovation” is described within the SRA. The final joint discussion focused on the needs of end users for language technology.
The first day closed with a fireworks display of twelve winners: eleven META Seals of Recognition were given to products in various areas of language technology: information extraction (Jakub Zavrel); speech processing (Alessandro Tescari, Siegfried Kunzmann, Radim Kudla and Joseph Mariani); machine translation (Heidi Depraetere, Bernardo Magnini, Radu Soricut, Kirti Vashee and Dion Wiggins and Tony O'Dowd); and basic tools for natural language processing (Lauri Karttunen). The META Prize was awarded to the JRC Optima Activity for Language Technology, represented by Erik van der Goot. JRC develops the Europe Media Monitor. EMM is a language technology enabled service that gathers information from news portals in 60 languages and classifies the articles, analyses news texts, issues alerts and produces visual presentations of the news.
The second day started with an overview of short and long term opportunities for LT Research and Innovation on the European Level. Kimmo Rossi introduced two new programmes that are currently being planned: Horizon 2020 and Connecting Europe Facility (CEF). Roberto Cencioni provided details about topics in current calls for project proposals.
The topic of exchanging and re-using language resources was the primary focus of the META-SHARE session. META-SHARE is an open infrastructure to foster LR/LT sharing and re-use. Stelios Piperidis described general aspects of META-SHARE, related, e.g., to licensing aspects. The following three presenters focused on contributions to META-SHARE: Tamás Varadi mostly for Slavic languages, Antonio Branco for south European languages, and Andrejs Vasiljevs for north European languages, with a specific focus on smaller languages. In the final discussion it became clear that, given this input, META-SHARE is already populated with many resources that are useful for research and industry. Now the sustainability of META-SHARE needs to be assured, and the availibility of resources with adequate licenses.
The final panel session on LT research in EU member states and regions complemented the previous session on support on the European level. The situation was explained for the countries Hungary (Károly Gruber), Bulgaria (Diana Popova), Czech Republic (Karel Oliva), France (Edouard Geoffrois), the Netherlands (Alice Dijkstra), and Slovenia (Simona Bergoc). Joseph Mariani explained various political instruments to support joint research between member states. The panel discussion focused on benefits of coordinated programmes on the European level and methods to create these.
The keynote lecture of META-FORUM 2012 was given by Fernando Pereira. He talked about language technology efforts at Google. Here the focus is on language technology workflows that scale to the Web and interrelate external knowledge bases with Web content. The good news is that language technology has its role in this workflow and achieves better results than the simple matching of patterns in texts; however, to be able to compete in such industrial scenarios, language technology must be robust and scalable.
Hans Uszkoreit gave a short closing presentation, coming back to the challenges for the LT community mentioned by Thibaut Kleiner. META-FORUM 2012 presented impressive results that European language technology has achieved so far, and issues that need to be adressed in the coming months. The community is in good shape to deal with these challenges and will present the outcomes at META-FORUM 2013.
Thibaut Kleiner, Member of the Cabinet of Neelie Kroes, Commissioner for the Digital Agenda and Vice-President of the European Commission, EC, opened the conference with a presentation entitled “Technological Challenges of the Multilingual European Society”. He gave a warning message to the language technology community: future funding for language technology in Europe cannot be taken for granted. So far, funding for research e.g. in the area of machine translation often leads to product development outside of Europe, see the example of Google Translate. The language technology community in Europe itself has to demonstrate that it can make the step from research to innovation. The good news is that language technology is needed in complex, multilingual information spaces like the Web, to create business opportunities for multilingual societies inside and outside of Europe.
Language technology can help to master the fast amount of information on the Web in various languages with applications like news and opinion mining or business intelligence. Such application areas for language technology are obvious, the question is mostly who will take the lead – Europe or others.
Other communities, e.g. around open data, have managed to get interest from policy makers and lead to output in many SMEs – language technology too needs to have a strong voice on the policy maker level.
Hans Uszkoreit, from DFKI (Germany) and coordinator of META-NET, gave an overview of the topics covered at META-FORUM 2012 in a presentation entitled “European Language Technology: Where do we stand – in a nutshell”.
The three major challenges for language technology are to preserve multilingual diversity, to secure cross-lingual flow of information, and to give means for communication, information and knowledge management to all language communities. In META-NET, the European language technology community has worked for more than two years in three lines of action to address these challenges:
- META-VISION: Building a community with a shared vision and Strategic Research Agenda (SRA);
- META-SHARE: building an open resource exchange infrastructure;
- META-RESEARCH: Building bridges to neighbouring technology fields.
META-FORUM 2012 presents various outcomes and the current state of this work, like the language white paper series, the META-VISION process involving about 100 experts and leading to a draft SRA, and the META-SHARE repositories covering more than 1300 language resources.
Building a bridge to the opening keynote from Thibaut Kleiner, Hans Uszkoreit reminded us that the decision about future funding for language technology in Europe is not finalized. META-FORUM 2012 and the SRA are crucial instruments for moving the discussion forward and for influencing the decision in a positive manner.
Europe and its Languages
Algirdas Saudargas, Member of the European Parliament from Lithuania, opened the “Europe and its Languages” session. He congratulated META-NET to what has been achieved already, and emphasized the needs of countries like Lithuania. With the help of language technology, “islands” like Lithuania can be connected to the rest of Europe. This will lead both to cultural bridges and also to new business opportunities.
It is important to translate this “message” to the political language, so that it will be take up outside the language technology community. Only in this way we will be able to convince policy makers. In this conversation, language technology should not be described as science fiction that puts machines between humans, but as a means to support communication between humans.
András Kornai from the Hungarian Academy of Sciences, Budapest, gave a talk entitled “Language Death in the Digital Age”. Indications of language death are e.g. loss of function or prestige. In the digital age, loss of function happens in media like email, and prestige translates to “if a language is not on the Web, it doesn’t exist”. In this respect, only a few languages are well established and in the “comfort zone” of the digital age.
The presentation took a look at Wikipedia. An overall analysis shows that only a small percentage of languages are in the comfort zone. Many languages are vital in terms of speakers, but not represented well in the digital world. We need enabler projects for building basic tools for these languages and also for “heritage languages”, so that they can achieve a passive Web presence of lexicons, classical literature etc.
Bolette Sandford Pedersen, University of Copenhagen, Denmark, gave a talk entitled “The META-NET Language White Paper Series: Overview and Key Results”. The language white papers cover all EU languages in 30 volumes and report on general, social, strategic and technological aspects, as well as the level of support through language technology. More than 160 industry and language experts have worked as (co)authors. In addition, more than 50 experts contributed data and information.
A specific effort was put into cross-lingual ranking of language technology support, with broad categories like “excellent”, “good”, “moderate”, “fragmentary”, and “weak or no support”. The results vary with regards to the technology area. For example, support for voice technologies is slightly better than machine translation. Nevertheless “excellent” support is not given for any language, and even the level “good” is rarely available for languages other than English. The language white papers have detected major gaps in terms of language technology support for each European language that need to be addressed in the near future, to assure competiveness of the languages in the digital market.
Panel with Representatives of the EFNIL Language Communities
Gerhard Stickel, Institute for the German Language and EFNIL (European Federation of National Institutions for Language) president, opened the panel. He introduced EFNIL as an organization, its history, the EFNIL conference series and collaborations with META-NET. META-NET can contribute new ICT related developments to EFNIL; EFNIL provides needs and use cases for language technology from the perspectives of language research, planning and teaching. Progress in language technology still has to be made, but there are more and more applications of language technology in EFNIL related areas. In the future this might lead to more exchange of knowledge between the fields, but maybe also to concrete, joint projects.
Arnfinn Muruvik Vonen , from the Language Council of Norway, acknowledged that local politicians in Norway have recognized the risk that Norwegian is excluded from current and future areas of digital communication. As one concrete step, a so-called “language bank” has been established for Norwegian language resource collections. Such activities and connections on the European level will help to improve the situation for Norwegian.
Ray Fabri, National Council for the Maltese Language, described the complex bilingual situation in Malta. About 500.000 speakers use both English and Maltese. English is dominating written communication, and in oral communication there is a lot of mixing. Several projects help to create language technology support for Maltese, in areas like digitization of a Maltese dictionary. The involvement of politicians, researchers and industry assures tangible outcomes in such projects.
Peter Spyns, De Nederlandse Taalunie, referred back to the Wikipedia analysis made by András Kornai, saying that Dutch has a good position in the digital age. In Denmark, already in 1999 the Dutch Language Union identified gaps in language technology support. National funding helped to develop basic language resources. These are now used both by research and in industry. To assure this uptake for the future, the Dutch Language Union also is closely connected to META-NET and CLARIN.
Arvi Tavast, Institute of the Estonian Language, congratulated the authors of the white paper series to their results. He added a small warning with regards to the message made for smaller languages: politicians look into (economic) outcomes of language technology research. To make these visible, projects are necessary that put the languages into the “comfort zone”, in the sense of András Kornai. For these projects and in general, a copyright law is needed that eases the re-use of language resources.
Algirdas Saudargas, Member of the European Parliament from Lithuania, first emphasized that the white paper analysis show the poor situation of Lithuanian with regards to language technology support. He then focused on research topics that need to be addressed in the future, using Ray Jackendoff’s concept of “Parallel Architecture” as a framework: We have to try to bridge the conceptual and spatial structures to assure deep understanding between different language and cultural communities.
During the panel discussion, potential additional input to the analysis of language technology support was discussed. This includes e.g. insights from professional translators. Additional languages like sign languages have to be taken into account. From an end user perspective, a need like “we want good quality speech recognition” was articulated. We need to make clear also to policy makers that such a request involves the development of the underlying language technology “food chain”, including lots of components for e.g. morphological and phonological analysis.
Relating the Wikipedia analysis from András Kornai to the panel, most official European languages were categorized as being “vital” in the digital world. However, to achieve wide uptake of language technology, the technology also has to be marketed in the right manner, e.g. via “cool, easy to understand Apps”. Finally, it was emphasized that a European copyright law that eases re-use of language resources for research purposes is deeply needed.
Industry and Innovation – Language Technology Made in Europe
Serge Gladkoff, GALA Standards Director and GALA Board member, President of Logrus International Cooperation, opened the “Industry and Innovation – Language Technology made in Europe” session with a talk entitled “Translation and Language Services”.
GALA is the umbrella for the language service providers and users across Europe. In localization industry, the variety of data types, CMS systems, language technology components etc. leads to growing interoperability issues. Content creators are mostly unaware of problems in the localization chain, which leads e.g. to the lack of rich localization metadata. GALA members try to address these issues with language technologies like machine translation, combined with e.g. translation memories and terminology systems. However, to avoid duplication of efforts, we need to invest into future translation technology and means to ease collaboration. Standardization and interoperability are key to achieve this goal. GALA can contribute to META the needs and knowledge from the industry, including real production problems, and help to prepare a future platform of multilingual services.
Tomás Pariente, Atos Research & Innovation, Spain, gave a talk entitled “Text Analytics and Big Data”. Atos is an international information technology service company, with an emphasis on both research and development and innovation in areas like semantic technologies, language technologies, big data and linked data. The technologies are applied in usage scenarios like multilingual data processing in the Olympic games, or multilingual and multimodal information access in the biomedical domain.
In these and other areas, more and more unstructured data has to be processed. Atos is involved in projects that deal with such information in specific domains, e.g. finance. This is, however, only one area of big data: in the BIG project, the aim is to analyse big data from various perspectives, e.g. technology, business and policy, and in many different domains: health, public sector, finance & insurance, media & entertainment, and manufacturing, retail, energy, transport. BIG’s main idea is to gather the relevant stakeholders; the language technology community now has the opportunity to be recognized as part of them and to contribute ideas and methodologies for handling big data.
Radu Soricut, manager of application science & engineering and senior research scientist, SDL International, gave a talk entitled “Changing the Perspective on Machine Translation”. In the past, the MT community was concerned with the MT technology itself, e.g. approaches towards MT (e.g. statistical MT vs. rules based MT), integration e.g. with translation memories etc. However, the end customer mostly cares about the value in a customer specific eco system. Hence, MT needs to be part of many infrastructures including e.g. CMS or ERP systems, and needs to be able to make use of the large data source “the Web”.
With MT in large industrial ecosystems, exploitation of massive parallel data available in translation memories and connected to the Web becomes possible. MT systems easily can be tailored to customer relevant domains. MT engines can take various information sources like user feedback into account. Challenges for the future of industry strength MT include scalability of customization, adaptation to customers and automatic learning from user feedback.
George Wright, head of the Internet Research & Future Services Team, BBC Research, gave a presentation entitled “Speech analysis and Archive research at the BBC”. The background is the BBC World Service archive which covers 70.000 radio programmes, but has only sparse metadata available for accessing it. Language technology can help to (re-)categorize the content and create links between content items and the Web.
As a result, the system e.g. builds suggestions about topics covered in a programme or identifies separate speakers. The accuracy of the results is still a challenge. One issue is the availability of adequate language resources, e.g. tools that can handle British English adequately. The development of these and other language resources must be supported, so that the fast volumes of multimedia content will be accessible for a global audience.
Florence Beaujard, head of Linguistics and Physiology Group, Airbus, gave a talk entitled “Linguistic Activities of Airbus Design Office”. In cockpit design, the special purpose language of pilots and many other constraints like size of displays have to be taken into account to create clear messages and labels. This is why Boing has defined a controlled language: it helps to reduce potential ambiguities, and to improve text comprehensibility by non-native English speakers.
There are some general principles like “one word, one meaning” or “one meaning, one word order” underlying the controlled language. In addition, there are lexica and rules how to write messages or labels. Collaboration with pilots and instructors is crucial for the development of the controlled language. Outcomes so far are various tools e.g. to extract display text from the designers specification, and to automatically check its adequacy as a message or label. A desire for the future is to ease specification writing for system designers, via dedicated controlled language(s) to guide the designers.
Lori Thicke, CEO, Lexcelera Localization and representative of Translators without Borders, talked about “Why Do We Need Language Technology”. Language technology is needed to deal with a contradictory situation: more and more content has to be translated faster, with demand for more quality, and on lower costs. Translation also plays a societal role: e.g. access to translated information in developing countries can be critical even for survival. Language technology like machine translation also can help to resolve the mismatch of digital content available and number of speakers in the developing countries.
For the future of machine translation, it is important to see the technology as a process, including pre-production, the actual processing, post-editing etc. Quality in the source content is key to deliver quality MT. The ACCEPT project is dedicated to develop controlled language rules, which will help management of content in social forums and finally better quality machine translation. Work areas for the future of MT include post-editing, terminology control and the integration of MT with translation memories.
The discussion touched on the re-occurring issue of copyright and language resources. Both the corpus created in the “Translators without Borders” project and the archives created by the BBC are valuable resources for research purposes. But they can only be re-used if the thin, but important line between distributing resources freely and making them available for research is drawn.
Controlled language was discussed also in terms of re-use. The presentation from Florence Beaujard demonstrated that a concrete controlled language is quite specific to application scenarios. Nevertheless there is the opportunity to re-use controlled language resources, e.g. criteria to reduce synonymous, rules to create acronyms or to generate abbreviations etc. This could be achieved via creating a standardized specification for some aspects of controlled language.
Machine translation is facing questions like what metrics to use for its evaluation. It was proposed that the same metrics should be used like in human translation, e.g. the LISA quality metrics. An issue that has no general solution is machine translation for languages with a limit amount of language resources. There is no silver bullet to solve this problem; at the end, human translators need to create the resources.
Language Technologies for Europe 2020
META-NET and LT-Innovate started this session with a joint slot.
Georg Rehm, META-NET and DFKI, gave a presentation entitled “Introduction and Presentation of Partnership”. After a short history of META-NET, the focus was on the META- VISION line of action for “building a community with a shared vision and strategic research agenda”: as of mid 2012, META-NET has 60 members in 34 countries. Collaboration agreements have been created with 46 other EU-funded projects.
META-NET has established META, the Multilingual Europe Technology Alliance with more than 640 stakeholders from research, various industries, neighbouring fields, and policy makers. Although research, language communities, society and politics are well covered in META, involving industry remains a challenge. LT- Innovate, the forum for Europe’s Language Technology Industry, has been established to fill this gap.
Jochen Hummel, ESTeam and chairman of LT-Innovate, gave an “Introduction to LT Innovate”. LT-Innovate aims at promoting European language technology, unifying the industrial community, and to articulate itself towards investors and policy makers.
Language Technology is the missing piece in the puzzle of the digital single market. LT-Innovate is creating an innovation agenda to fill this gap. This agenda complements the META-NET strategic research agenda (SRA), with the aim to foster adoption of research results in the market. About 150 people from the language technology industry participated in the LT-Innovate summit that took place just before META-FORUM. They discussed the “innovation agenda”, showcased their language technology applications, and demonstrated a strong voice of the European LT industry.
Hans Uszkoreit, DFKI and META-NET, gave a presentation entitled “The META-NET Strategic Research Agenda: Overview, Preparation, Dissemination”. Creating the Strategic Research Agenda (SRA) is one main task of META-NET. In the SRA, on the basis of the state of IT technology, a broad vision for the year 2020 and various strategic considerations, three interconnected priority themes have been developed. These will be accompanied by an innovation model, to be developed in close collaboration with LT-Innovate.
Various new topics will influence the SRA: big data, services & cloud computing, and shared infrastructures. Language technologies are prime candidates for “sky computing”, a new area that encompasses the federation of several clouds for creating complex services. A sky computing based, European language technology service platform can be the basis for uniting LT providers, language service providers, researchers, and providers of other services, citizens and corporate users.
Andrejs Vasiljves, Tilde, presented the SRA priority theme “Translation Cloud”. Many applications needed by EU citizens and businesses require specific or generic translation services: eCommerce, cross-language subtitling, education etc. The translation cloud will be a ubiquitous online platform to provide these services, including various methods like machine translation or automatic language checking, for usage in and delivery to many devices. This will have huge impact, like facilitating job opportunities and creating new business opportunities in the huge global market of language services.
The current state is promising: more data and tooling for machine translation is available. Nevertheless, we still need a research breakthrough in areas like high quality MT, and research needs to be organized with close integration to the industry.
Marko Grobelnik, Institut “Jožef Stefan”, presented the SRA priority theme “Social Intelligence and eParticipation”. He started with a review of various trends, like the importance of language related technologies in the Gartner hype cycle, increasing time spent on the social Web, and increasing importance of content aggregators over content creators, leading to more interlinked content and huge amounts of big data.
From this review, various recommendations for topics in a technology and research roadmap emerged: social influence and incentives, information tracking & dynamics, multimodal data processing, visualization and user interaction, and algorithmic fundamentals. An important task is now to present these topics to decision makers and show their relevance for the European citizens and eParticipation.
Joseph Mariani, CNRS-LIMSI/IMMI, presented the SRA priority theme “Socially Aware Interactive Assistants”. The aim is to create multilingual assistants which support human interaction, acting naturally and personalized in various environments, in any language and anywhere. Global abilities are needed for these assistants, like natural interaction with agents (e.g. terminals or robots). In addition, there are domain specific abilities like personalized training in computer aided language learning.
The roadmap for this priority theme encompasses these global and domain specific aspects, and the creation of language resources and evaluation tasks. Other countries (e.g. the US) invest funding in the research needed and develop products in large cooperates, e.g. Google SIRI. Europe must be careful not to fall behind these efforts.
The LT-Innovate Innovation Themes
Key participants of LT-Innovate presented aspects of the “innovation themes” which are under development.
Rubén Riestra, INMARK International Area, provided a general introduction to the envisaged “LT innovation agenda”. The aim is to produce a vision statement how innovation should enable LT providers to deliver value, that is: new products and services for the digital single market. LT-Innovate has identified five main “innovation clusters”: iEnterprise, iHealth, iHelpers, iServices, and iSkills.
Rose Lockwood , INMARK International Area, presented the approach for writing the innovation agenda. The aim was to create a consolidated view of the software market and the potential “LT market”. This should also include a commercialized LT view that will influence both LT companies and the research community. LT-Innovate has tracked LT related news intensively, leading to the five innovation clusters.
Philippe Wacker, EMF, emphasized the importance of innovation for getting Europe out of the economical crisis. Companies need to channel innovation to the market on a day-to-day basis. Research has to be connected to the market place, since this determines uptake by companies and users. The innovation clusters are instruments to structure this discussion for language technologies.
Paul Welham, CereProc Ltd., presented findings from a panel discussion at the LT-Innovate summit about language technology for people with disabilities and special needs. The aging population creates many challenges, but it also leads to many opportunities for language technology applications. An example is avatars to support communication of elderly people.
Claude de Loupy, Syllabs, presented opportunities for user and product analysis. Language technology can create more value in areas like eCommerce or the travel industry. This industry is huge in Europe and has the need of information like: what type of travel are users looking for, which destination are relevant etc.? Language technology can help to gather such information fast and on large scale.
Jochen Hummel, ESTeam AB, discussed the potential language technology innovation area of gaming. The gaming industry managed to turn the barrier of globalizing games into an asset: Today, online games have global players with many languages involved. A great European innovation of the language technology industry would be a platform to crowdsource game localization, involving users and game developers.
Adriane Rinsche, Language Technology Centre Ltd., presented promises of language technology in the health care market. Language technology can help to save costs and improve services, e.g. for patient related information management or health monitoring. There are also multilingual aspects like medical information in tourism. Language technology tools that interface easily with each other and medical infrastructure will lead to excellent opportunities in this market.
Jochen Hummel, ESTeam AB, emphasized the need of European companies for language resources. An infrastructure is needed to easily build, access and license multilingual resources. Like roads or railways, the EU should provide the infrastructure, to foster and support the multilingual market. It is important to create this infrastructure in a procured manner, with one entity responsible for resource collection and distribution.
The joint session between META-NET and LT-Innovate was wrapped up by a short discussion. One topic was the gap between what language technology already can achieve, and the needs of the end user. Some types of language technology are getting more and more uptake, e.g. speech interfaces. But wide spread adoption is yet to come. The overall usability of language technology has to become a focus of efforts, or in different words: we have solutions, but what was the problem?
LT Fireworks – Research, Innovation and Technology
META Seal of Recognition
Nicoletta Calzolari, CNR, and Georg Rehm, DFKI, chaired the “LT Fireworks” session. Georg Rehm briefly introduced the background of the META Seal of Recognition awards and the META Prize: these awards are given annually at the META-FORUM event, and winners are chosen by the META Technology Council: around 30 experts of the European LT landscape who provide the main input to the Strategic Research Agenda (SRA).
Alessandro Tescari received the seal of recognition for Pervoice. Pervoice provides speech recognition using large vocabularies and handling multiple languages for specific sectors. Solutions based on Pervoice include a remote transcription system, transcription workflow and subtitling solutions.
Siegfried Kunzmann received the seal of recognition for European Media Lab. The EML transcription platform helps to bring automatic transcription to various markets. One important usage scenario is the automatic transcription of voicemails to SMS, e-mail or mobile devices.
Jakub Zavrel received the seal of recognition for Textkernel. The Extract! and other Textkernel products use language technologies and machine learning for extraction of information in CVs. This saves time in processing CV’s into recruitment systems and eases aggregation of searchable information.
Heidi Depraetere, on behalf of Paraic Sheridan, received the seal of recognition for IPTranslator created within the PLuTO project. PLuTO is developing an online translation solution for patent translation. It helps the patent researcher to decide quickly whether a text in a foreign language is relevant for a given topic.
Bernardo Magnini, on behalf of Marcello Federico, received the seal of recognition for FBK. Here the IRSTLM toolkit for statistical language models is been developed. It provides a variety of features for creating languages models, is integrated e.g. into the MOSES platform, and has been used in various industrial applications.
Radu Soricut received the seal of recognition for SDL. SDL's machine translation system eases access to language pairs, integration with customer systems or control over corporate terms and brandings. High quality translation results can be delivered across 30 languages via post editing.
Radim Kudla received the seal of recognition for PHONEXIA s.r.o. PHONEXIA provides speech technologies for identifying various pieces of information from speech, e.g. different speaker, gender, language, keywords, transcription etc. The technologies are applied for example in multilingual speech transcription and keyword spotting systems.
Kirti Vashee and Dion Wiggins received the seal of recognition for Asia Online. Initially Asia Online focused on using machine translation for bringing English content into Asian languages. The scope then was extended to various domains and language pairs. Now also language pairs involving Asian and European languages are being included.
Joseph Mariani, on behalf of Bernard Prouts, received the seal of recognition for Vocapia. Vocapia has created VoxSigma, a software suite with large vocabulary speech-to-text capabilities. VoxSigma has been developed for transcribing large quantities of audio and video. It is used in many applications like media monitoring or speech analytics.
Tony O’Dowd received the seal of recognition for Xcelerator. KantanMT developed by Xcelerator is a cloud based machine translation system. It is based on the Moses platform and provides machine translation to mid-sized language service providers. KantanMT responds to the need of high-quality and low-cost machine translation.
Lauri Karttunen received the seal of recognition for XFST developed within Xerox. XFST is a finite-state toolkit for text processing, e.g. rewriting, tokenization or morphological analysis. Since 1993, it has been used for dozens of languages and in large cooperation. The source code of XFST is planned to be available soon under an open source license.
The members of the META Technology Council decided that the scope of the META Prize 2012 should be “Outstanding products or services supporting the European Multilingual Information Society”. There have been 19 nominations, and one clear winner: The prize was given to the JRC Optima Activity for Language Technology, represented at META-FORUM 2012 by Erik van der Goot.
JRC, the Joint Research Centre, is an EC’s in-house science service. One major application developed within JRC is the Europe Media Monitor (EMM). Starting in 2002, today EMM processes 150.000 new news articles - per day and in 50 languages. The articles are classified according to hundreds of subjects and countries.
JRC also has created language resources of enormous value, e.g. multilingual parallel corpora in 22 languages, multilingual multi-label categorisation software, and the multilingual named entity resource JRC-Names. These resources and EMM itself are of high importance for multilingual information gathering.
Plans for LT Research and Innovation on the European Level
Kimmo Rossi, the European Commission, DG for Communications Networks, Content and Technology (CONNECT), gave the opening talk for the first session on the second day. He presented the current state of planning for two new programs: “Horizon 2020” and “Connecting Europe Facility (CEF)”.
In Horizon 2020, language technology is planned to be part of the industrial leadership topic with dedicated funding instruments for SMEs. Relevant topics are related to content technologies and information management, e.g. the creation of tools for handling content in any language, or modelling, analyses and big data visualization.
CEF, different to Horizon 2020, is not about research or innovation, but infrastructure. Digital service platforms in areas like eGovernment or eHealth are to be developed. Language technology comes into play via the requirement for multilingual access to online services. A core platform should provide basic language technology building blocks for free, accompanied by various generic services like machine translation.
Roberto Cencioni, the European Commission, DG for Communications Networks, Content and Technology (CONNECT), gave a presentation about “Final 2012/2013 calls in FP7”.
Themes in these calls include global content processing, mining of unstructured information and natural interaction. There are two calls, one dedicated to language, one esp. for SMEs including the areas of language and handling of big data.
Three research lines are formulated in the language related call: analytics, focusing e.g. on the interplay of text, speech, audio and video; translation, aiming at high quality MT; and interaction, with the goal to integrate processing of speech and additional modalities in ITC platforms. In addition there are roadmapping actions, which should target specific sectors, common tools, data sets & standards, integration and evaluation.
The SME call has a focus on analytics and open data. There are project lines for the re-use of open data, transfer and uptake of LT, and software focusing on open data and its applications.
META-SHARE: An Open Resource Exchange Infrastructure
Stelios Piperidis, ILSP, started the session on the open resource exchange infrastructure META-SHARE with a presentation entitled “Overview, Current State, Towards Version 3 of META-SHARE”.
Language resources (LR) are needed everywhere in language related technology. META-SHARE is a network of distributed repositories (so-called “nodes”) for sharing and exchanging LRs, aiming to match LR providers and consumers.
In META-SHARE, LRs are described via a dedicated metadata schema. It supports all services of the infrastructure like storage, browsing, or metadata harvesting. The metadata schema describes the LR itself and provides also additional information, related e.g. to licensing.
Such metadata is important for the legal framework used in META-SHARE. Various licensing templates are provided. They encompass a mix of open and openness inspired models.
In the coming months the META-SHARE software will be improved in various areas like search engine optimisation or data migration. More META-SHARE nodes will be created, and ELRA supported initiatives will be included, to achieve full deployment of META-SHARE from ELRA and its members.
Tamás Varadi, Research Institute for Linguistics, Hungarian Academy of Sciences, gave a presentation entitled “The contribution of CESAR”. CESAR is a part of META-NET focusing on Central and Southeast Europe, covering Polish, Slovak, Hungarian, Croatian, Serbian and Bulgarian.
One major aim is to contribute resources for these languages to META-SHARE. This encompasses monolingual corpora as well speech corpora, lexica or language technology tools. In addition, cross-linked resources between the six languages (e.g. multilingual parallel corpora) have been developed. A long-term perspective behind these efforts is important: CESAR is going to set up a META-SHARE repository / node for hosting these languages resources.
Antonio Branco, University of Lisbon, gave a presentation entitled “The contribution of METANET4U”. METANET4U covers the languages Basque, Catalan, English, Galician, Maltese, Portugese, Romanian and Spanish.
The project also contributed to the development of META-SHARE, which was the focus of the presentation. This includes among others input to the metadata model, legal or licensing aspects, and various technical areas.
The repositories /nodes have been populated with resources by METANET4U. Seven nodes have been set up. 100% of the resources that are available via these nodes are new, that is they have not been available via other distribution channels before. A future topic is the interoperability between META-SHARE and other platforms.
Andrejs Vasiljevs, Tilde, gave a presentation entitled “The contribution of META-NORD”. META-NORD covers the Baltic languages (Estiona, Latvia and Lithuania) and the Nordic countries (Denmark, Finland, Iceland, Norway and Sweden).
The focus of the contribution to META-SHARE was European languages with less than 10 million speakers. As the analysis in the language white paper series reveals, for many of these languages the amount of high quality languages resources is very limited.
META-NORD worked on filling gaps especially in the areas of WordNets, treebanks and terminology resources. Like in the other projects, which presented contributions to META-SHARE, the sustainability of the repositories is of high importance, and META-NORD has committed to provide support at least for a given time frame.
META-SHARE in 2013 and beyond – Q/A and Panel Discussion
The Q/A and Panel Discussion first focused on concerns about the future of META-SHARE. what will happen when the underlying projects come to an end? ELRA and others involved have committed to guarantee at least for two years and probably for longer that META-SHARE will receive support.
Another topic was the role of META-SHARE with regards to high quality language resources. META-SHARE is not a means to create these resources, which are needed by the SMEs constituting the majority of the language technology industry in Europe. The main aim is to ease the tasks of sharing and distribution. Nevertheless, META-SHARE already contains valuable, high quality resources e.g. in the realm of terminology. These also fulfil many needs of SMEs.
Various questions were about licensing. META-SHARE has been set up also to become attractive to the open source community. To this end, META-SHARE provides the necessary licenses. Nevertheless, the language technology community itself has expressed the need for restricted licenses. In this respect, the META-SHARE licensing options reflect the current thinking of the community.
Plans for LT Research and Innovation in Member States and Regions
Károly Gruber, Hungarian Ambassador to Brussels, presented the situation in Hungary. Hungary is a linguistic island in Europe, and language barriers in the digital age are huge. Language Technology is the key to change the situation and regarded as high priority. National initiatives complement the engagement in META-NET.
Diana Popova, Senior expert, Science Directorate, Ministry of Education, Youth and Science, presented the situation in Bulgaria. Language technology is part of the ICT vertical research. Here it has received funding since 20 years ago. Nevertheless compared to other countries the level of funding is still low.
Karel Oliva, member of the Council of Research, Development and Innovations of the Czech Republic, head of the Institute of the Czech Language of the Czech Academy of Sciences, presented the situation in the Czech Republic. Language technology R&D is anchored in several cities. There is no direct state support or dedicated funding body. Since commercial interest for a small language like Czech is hard to attract, the need for industry partners in EC project becomes a major obstacle for receiving funding.
Edouard Geoffrois, Ministry of Defense and French National Research Agency, presented the situation in France. Various national agencies cooperate to support language technology related topics. There are large, dedicated programs like Quaero and programs run in cooperation with other countries.
Alice Dijkstra, The Netherlands Organisation for Scientific Research (NWO), presented the situation in the Netherlands. A joint Dutch and Flemish program for language technology that lasted 2005-2012 will have no successor. Nevertheless, language technology can be funded via an “LT inside” approach. It can be part of other themes like the humanities or the creative industry. In addition, funding as part of infrastructure programs can be acquired rather easily.
Simona Bergoc, Department for Slovene Language, Ministry of Education, Science, Culture and Sport, presented the situation in Slovenia. Language technology activities in Slovenia exist, but they are not coordinated inside the country. Funding mostly is provided by the EC, and so far there is no recognition on the national level. A new program for language policy encompasses language technologies as one of the priorities. There is also a plan to create a coordinating state body.
Joseph Mariani, CNRS-LIMSI/IMMI, presented the European Commission's Collaborative Research Instruments. Member states and the EC need more coordination. From the various existing coordination instruments, the “Article 185” seems to be well suited for language technology. The 2008 European Council Resolution on “European strategy on Multilingualism” provides important arguments towards policy makers for the development of language technology in Europe.
The panel discussion brought up mainly two questions: what are the benefits of coordinated programs on the European level, and what is the best approach to create them.
As an answer to the first question, several national projects that targeted similar goals were mentioned. Running such projects without coordination leads to duplication of efforts, and basic tasks like data sharing are hard to achieve. The result is that critical mass compared to other regions in the world is hard to achieve.
Pushing for dedicated funding on the national or the European level requires both a bottom up approach, involving the leading experts in the field, as well as a political, top down approach. A major argument towards politicians is that multilingualism is the crucial asset of Europe. The European countries have to invest in this area and make content available for the European market in their languages. This will also assure their competiveness world wide.
Keynote Lecture: Fernando Pereira
Fernando Pereira, Google, gave the closing keynote of META-FORUM 2012, entitled “Low-Pass Semantics.”
At Google, a lot of effort is put into natural language processing. Nevertheless, the aim is not to achieve automatic sophisticated processing for small pieces of content, but to develop language technology workflows that scale to the whole web. Here, the web services both as a data source and as target content.
The presentation exemplified this approach with “Low-Pass Semantics”: its aim is to create links between natural language text, external knowledge bases like the so-called “knowledge graph” and other types of data.
Web pages often contain useful pieces of information, but they are hard to identify. The external knowledge basis contains keys or identifiers of concepts. In the low-pass semantic approach, these are linked to the text. This improves consistency in interpreting Web content.
The motivation for the approach described is not a research topic, but a user problem: the low precision of Web search. Methodologies from natural language processing play an important role. Grammar parsing or named entity recognition NER, applied in a robust and scalable manner, help to create better linkage to the knowledge base than pure matching of text patterns. But language technology alone is not sufficient: For web scale, computational power is extremely important, more than advancement of algorithms.
Hans Uszkoreit, DFKI and META-NET, summarized in a brief closing session the next steps for the language technology community in Europe.
The following months will decide about the shape of language technology, including the financial support provided in Europe. In this decision process, the community has to raise its voice via the META-NET Strategic Research Agenda (SRA), which is currently being finalized. The message of the SRA has to be clear and sharp. And even if the language technology specialists already agree with the contents of the SRA, they should reach out to their contacts and make them aware of “our” agenda for the future of language technology.
There is a lot of competition with other research fields – language technology is just one of them. If the community wants to assure support in the future, it needs to spread out widely with a positive message. In addition to the SRA, next year’s META-FORUM 2013 will be one main instrument to convey that message to everybody.