The OCRE Project had the opportunity to sit down and chat with Carsten Thiel, CTO of the Consortium of European Social Science Data Archives (CESSDA), about Commercial Cloud in academic archiving services.
Over this short series of blog posts, we will learn what Carsten Thiel had to say about the challenges for data archiving and reuse, some misunderstandings and the real benefits of Commercial Cloud solutions for researchers.
For many of us, the potential issues/problems around findable, accessible, interoperable, and reusable (FAIR) data have never entered our minds. We search on Google, Bing, Yahoo, or another search engine, and we often get a full page of relevant findings.
Many of us don’t stop to think about the complex algorithms involved to sort through all of the content online to find exactly what we searched for. We don’t consider the importance of the metadata that allows the algorithms to easily sort through all the information.
In the case of academic research data, it doesn’t work quite as smoothly as your favourite search engine. Research data is continually being produced around the world, but it often suffers from both a lack of FAIR data practices and correct archiving.
This leads to negative outcomes, such as data being lost over time, repeating research due to the inability to find desired datasets, the loss of potentially useful and beneficial knowledge and the waste of public funds.
These critical issues lead the European Commission to invest €320 million between 2015 and 2020 in the European Open Science Cloud, an initiative to resolve these problems. A further €950 million will be invested over 2020-2030.
Carsten Thiel has to deal with these problems every day as CTO of CESSDA, a social sciences data archiving organisation.
“One more obvious problem is making the data findable across Europe, with language barriers. Sometimes Metadata is only available in the language of the country of origin. But a lot of it is also available in English, with English metadata at least.”
This is one of the simpler problems working with digital data. The real complications/difficulties start when sensitive data is involved. This can happen very easily within the social sciences as the subjects of the studies are people. Whenever studies are done on subjects such as health, income, or personal status, the data is likely sensitive. This leads to very restrictive rules around where it can be stored, who can access it, how it can be accessed, etc.
“The old school solution is that a person has to physically come to the data archive and sit in a special room in the basement with a specially prepared computer that has no internet connection, you're not allowed to take in your mobile phone with you, camera and so on. Which, of course, is inconvenient, not only because of Corona, but it has always been an expensive and inconvenient way of doing research.”
This creates problems with using Commercial Cloud storage solutions, as where and how the data is stored is not flexible. CESSDA has member archives with data that is legally not allowed to leave the country in which the archive is situated. This consequently means a local data storage solution becomes necessary.
Despite the potential difficulties in choosing the right digital solution, CESSDA chose the Commercial Cloud solution for their central offices. When they set up their offices, it was simply the easiest and most flexible solution. The data they hold centrally using GCP (Google Cloud Platform) is only metadata which is public and not sensitive.
“Our main office runs some central services, including the catalogue of the metadata from our national nodes, currently some 325,000 records. And we're running some additional services such as for vocabularies and thesauri and so on, in multiple languages.”
This multilingual language thesaurus is currently available in 14 languages (with more on the way), and is one of the ways CESSDA is trying to fix the first problem around metadata in different languages. Through this tool hosted on CESSDA’s Commercial Cloud, researchers can understand the standardised metadata of a different language.
For CESSDA, Commercial Cloud is empowering the research tools, services, and data storage they provide to the scientific community. Starting as a small office, creating infrastructure would not have been possible without extensive funds and a bigger team. Commercial Cloud has allowed them to buy the cloud storage and necessary tools they need on demand.
Is Commercial Cloud the right solution for your institution? Browse the OCRE Catalogue to find out which platforms can be procured through the OCRE Framework Agreements in your country.
About Carsten Thiel
Carsten Thiel is Chief Technical Officer (CTO) at CESSDA ERIC, the Consortium of Social Science Data Archives, at its main office in Bergen, Norway. He is charge of CESSDA's technical roadmap and strategy and the infrastructure's interoperability within the EOSC, including the SSHOC cluster project, which is coordinated by CESSDA. Carsten has previously worked at the University of Göttingen as Technology Coordinator and Co-Manager for the DARIAH-DE research project and worked with DARIAH ERIC on its EC funded projects. He holds a PhD in Mathematics from University of Magdeburg. His research interests include digital research infrastructures, distributed development processes and the DevOps approach to infrastructure management.