Data Liberation Campaign
Recently we conducted an internal survey to find out about the existence and availability of national corpora (or similar) for the various European languages covered by our network. The results of this small study showed that for almost every European language there exists some reference corpus of an established quality and in many cases produced or otherwise endorsed by the respective official language body. However, despite these corpora existing and being held by national organisations in the majority of cases, it is not possible for language technology researchers to get access to these corpora for their own work. For example, it is not possible for researchers to download or to run their own analysis over the data.
In most cases the reasons cited for these restrictions are copyright and redistribution restrictions that the corpus owners or corpus compilers have with publishers who provided the source data. These restrictions prevent researchers from using the data for non profit purposes such as scientific research which can benefit the entire language and language technology community. This is a striking finding in the wake of our recent publication of the Languages in the Digital Age series which highlighted that a lack of resources or a lack of availability of resources is putting many European languages at risk in the digital age.
In response to these findings, we have prepared an open letter to all the official language bodies in Europe and to those holding onto the various corpora calling on them to consider trying to make this important language data available for research purposes. Our open letter is also addressed to the European Patent Office. In this letter we ask for those with the power to do so to reconsider their distribution policies to allow greater access to their data for research. We also offer to provide a safe and secure mechanism (META-SHARE) to share the data should they choose to do so, and also any help we can provide regarding licensing, copyright and other legal issues.
The open letter, including a recipient list, is reproduced below. If you feel, as we do, that there is a huge benefit to liberating these corpora and making them available for research then please contact your local language body and let them know that you are in favour of our proposal.