Personal tools
A Network of Excellence forging the
Multilingual Europe Technology Alliance

The Estonian Language in the Digital Age — Executive Summary

During the last 60 years, Europe has become a distinct political and economic structure. Culturally and linguistically it is rich and diverse. However, from Portuguese to Polish and Italian to Icelandic, everyday communication between Europe’s citizens, within business and among politicians is inevitably confronted with language barriers. The EU’s institutions spend about a billion euros a year on maintaining their policy of multilingualism, i.e., translating texts and interpreting spoken communication. Does this have to be such a burden? Language technology and linguistic research can make a significant contribution to removing the linguistic borders. Combined with intelligent devices and applications, language technology will help Europeans talk and do business together even if they do not speak a common language.

The only (unthinkable) alternative to this kind of a multilingual Europe would be to allow a single language to take a dominant position, to replace all other languages. One way to overcome the language barrier is to learn foreign languages. Yet without technological support, mastering the 23 official languages of the member states of the European Union and some 60 other European languages is an insurmountable obstacle for Europe’s citizens, economy, political debate, and scientific progress. The solution is to build key enabling technologies. Language technology targeting all forms of written text and spoken discourse can help people to collaborate, conduct business, share knowledge and participate in social and political debate regardless of language barriers and computer skills. It often operates invisibly inside complex software systems. Language technology solutions will eventually serve as a unique bridge between Europe’s languages. An indespensable prerequisite for their development is first to carry out a systematic analysis of the linguistic particularities of all European languages, and the current state of language technology support for them.

Around one million people speak Estonian as their mother tongue. Estonian is the only official language in the Republic of Estonia. Practical language usage in Estonia is regulated by the Language Act and the legislation based thereon. At the same time Estonia is well-known by its e-government and e-Estonia policies. The Estonian language is supported by a long tradition of Estonian education and research.

Estonian does not belong to the family of Indo-European languages. The characteristic features of Estonian include the accent on the first syllable, high frequency of vowels as opposed to consonants, three different lengths of vowels and consonants, the lack of grammatical gender and articles, and a basic vocabulary different from that of the Indo-European languages. Estonian has a rich morphological system. The compounding is relatively free and productive in Estonian and a compound is written as one word-form. Derivation is another productive device for forming new lexical items. Although Estonian has been described as an SVO language, its word order is rather free.

The automated translation and speech processing tools currently available on the market fall short of the envisaged goals. The dominant actors in the field are primarily privately-owned for-profit enterprises based in Northern America. As early as the late 1970s, the EU realised the profound relevance of language technology as a driver of European unity. At the same time, national projects were set up that generated valuable results, but never led to a concerted European effort. Supported by larger research programs in the past, there exists a language technology research scene in Estonia.

Due to the complexity of human language, modelling our tongues in software and testing them in the real world is a long, costly business. Unfortunately, English language model is not easily transferable, as Estonian has a flexible word order, unlimited compound building and a richer inflection system.

Still, as a product of laborious work, there is a reliable speller for Estonian that is implemented into main office software suites.

The Google search engine has so many users among Estonians that since 2009, the verb guugeldama has even had an entry in the Eesti Õigekeelsussõnaraamat (Estonian ortographic and explanatory dictionary). Language independent search tools can only find the word forms which have the same form as the query word or include the query word as a substring. As Estonian is a language with rich morphology, and since in addition to the endings the stem itself may vary, the language specific tools as lemmatisers are needed for searching and indexing. It is officially recommended to use Estonian lemmatiser for searching and indexing of full-text databases in information systems of public sector in Estonia.

The two main types of systems ‘acquire’ language capabilities in a similar manner to humans. Statistical (or ‘data-driven’) approaches obtain linguistic knowledge from vast collections of concrete example texts. The second approach to language technology is to build rule-based systems. The great advantage of rule-based systems is that the experts have more detailed control over the language processing. Drawing on the insights gained so far, today’s hybrid language technology mixing deep processing with statistical methods should be able to bridge the gap between all European languages and beyond.

European research in the area of language technology has already achieved a number of successes. For example, the translation services of the European Union now use the Moses open-source machine translation software, which has been mainly developed in European research projects.

Machine translation is particularly challenging for the Estonian language. The potential for creating arbitrary new words by compounding makes dictionary analysis and dictionary coverage difficult; free word order and split verb constructions pose problems for analysis, also, the amount of available parallel texts is limited. In spite of this, Estonian belongs to one of the languages (currently, around 50) which can be translated by computer.

In the long term spoken language applications will play a far more central role as a user-friendly input for smartphones. This will be largely driven by stepwise improvements in the accuracy of speaker-independent speech recognition via the speech dictation services already offered as centralised services to smartphone users. Such language recognition applications for smart phones developed at Institute of Cybernetics at TUT won the grand prize of Estonian Language Deed 2011.

As this series of white papers shows, there is a dramatic difference between Europe’s member states in terms of both the maturity of the research and in the state of readiness with respect to language solutions Even for large European languages like German or English with well studied language technologies there are still many open research issues, thus for Estonian the amount of further research is even more extensive.

In the case of the Estonian language, we can be cautiously optimistic about the current state of language technology support. For Estonian, a number of technologies and resources exist, but far less than for English. Language technology support today is by far not in a state that is needed for offering the support a true multilingual knowledge society needs.

Tools for speech recognition and speech synthesis have been developed by leading research institutions in Estonian LT. Although the automatic morphological analysis of Estonian is complicated, the efficiency of morphological tools (such as tokeniser, lemmatiser and morphological analyser) is comparable to similar tools for other major European languages. The development of syntactic parsers must continue for achieving better performance. The only available tool for Estonian text generation is morphological synthesiser. Most of the users prefer Google machine translation service. Also, Estonian-English machine translation system is under development in the University of Tartu. There would probably be a high demand for the Russian-Estonian and Estonian-Russian automatic translation service. Most of these tools have been developed by research institutions and can be regarded as prototypes, not as mature products. Unfortunately, the industry consists only of a few SMEs. With regard to resources such as reference corpora, lexicons, wordnets and terminologies, the situation is also reasonably good for Estonian since substantial resources have been built in recent decades.

When it comes to more advanced fields like text semantics, language generation, and annotated multimodal data, Estonian clearly lacks basic tools and resources even if some of these are currently under development.

The research and development was supported by different national programmes of language technology what guarantees the availability of these tools and resources.

This white paper series complements the other strategic actions taken by META-NET (see the appendix for an overview). Up-to-date information such as the current version of the META-NET vision paper or the Strategic Research Agenda (SRA) can be found on the META-NET web site: http://www.meta-net.eu.

META-NET’s vision is high-quality language technology for all languages that supports political and economic unity through cultural diversity.