Analysing genomes from virtual cohorts across locations – with the German Human Genome-Phenome Archive (GHGA), a consortium of the DFG-funded National Research Data Infrastructure (NFDI), this should soon be possible. The initiative creates a research platform for human omics data. Since the end of June 2020, the nine consortia that will initially be funded within NFDI until 2025 have been identified – one of them is GHGA. In an interview with the German Biobank Node (GBN), Prof. Dr. Oliver Kohlbacher from the University of Tübingen and Dr. Oliver Stegle from the German Cancer Research Centre in Heidelberg (DKFZ) and the European Molecular Biology Laboratory (EMBL) tell about their motivations to found GHGA, how they want to involve future users in the development at an early stage and what connection they have with biobanking.
Why is GHGA a necessary initiative?
Oliver Kohlbacher: Until now, there has been no organisation in Germany for the structured management of human omics data. It is also a challenging task, because large amounts of very sensitive data are involved. Once a person's genome has been sequenced and the data is made accessible to others, the person in question can be identified again. Therefore, there are special requirements for sequencing data.
Oliver Stegle: Such structures already exist in other countries and the European Genome-Phenome Archive is a European network for this task. It is therefore all the more gratifying that a national research data infrastructure is now being established in Germany. When the NFDI initiative came along, it was very obvious for us to apply and close the gap. At the same time, with GHGA we are creating a national node for the EGA network.
Who is mainly involved in GHGA?
Kohlbacher: The DKFZ is mainly responsible, but the locations of the sequencing centres are also very important for us. In addition to Heidelberg, Berlin and Munich, these are the locations of the DFG Competence Network "Next Generation Sequencing": the centre of the universities of Cologne, Bonn and Düsseldorf and the centres in Dresden, Kiel and Tübingen. We are setting up data centres there so that the data can remain where it is generated.
Stegle: The European Bioinformatics Institute of the EMBL (EMBL-EBI) in the UK is a strong partner for us because it operates the European Genome-Phenome Archive. In addition, many of the people involved contribute expertise across a wide range of disciplines. Our clinical partners are of course particularly important. The cooperation with the Heidelberg Academy of Sciences and Humanities is also significant, as GHGA deals with ethical and legal issues.
What are the tasks ahead?
Kohlbacher: In the first few years, we will deal with the technical and organisational structure. Unfortunately, in our field it is not an easy task to find staff. At the same time, however, we are already setting up the infrastructure and developing the software to be able to "talk" to EGA. In about two years, our archive should be ready to start so that we can manage data, have distributed data storage and still be able to find all data, be connected to cloud infrastructures and be able to carry out distributed analyses based on the large volumes of data.
Stegle: We will also address our future users, at the beginning especially from oncology and rare diseases. We will ask them: What are their questions? Which technical processes do they use that our system must also offer? What framework conditions do we need to create so that the infrastructure is accepted as widely as possible?
What can GHGA contribute in the fight against COVID-19?
Kohlbacher: Together with the sequencing centres, we have launched "DeCOI" – the German COVID-19 Omics Initiative. The genomes of COVID-19 patients are of course very interesting for research. They can provide clues as to what the risk factors for COVID-19 are or why the disease is severe in certain patients.
What should the "final" archive look like and how can it be used?
Stegle: GHGA is more than an archive, it is more of a genome data and research platform. But I will start with the archive part: In GHGA, users will store generated omics data. This includes omics data of all kinds: from genome, to transcriptome, epigenome or single cell data. This storage is long-term and complies with scientific requirements. In addition, we enable data producers to share these stocks with others in a controlled and secure way. This is already a basic requirement for scientific publications. For those interested in the existing data, we offer support with useful functions for searching and analysing the data and, of course, by connecting to the European Genome-Phenome Archive.
Kohlbacher: Many researchers today no longer want to work with just 20 genomes, to which they themselves have direct access, but to analyse 10,000 or 50,000 genomes. To make this possible using "virtual cohorts", there will be cloud capacity at all locations. In this way, researchers will be able to carry out distributed analyses and, for example, analyse 10,000 genomes in Tübingen and 20,000 in Heidelberg. GHGA thus offers enormous added value, especially when we talk about AI applications in medicine, in genomics.
Who can use the archive?
Stegle: In principle, anyone can. There are no requirements regarding the size or scientific value of the data contributed. But working with large amounts of data is not easy. That's why we create community specific portals, i.e. easy-to-use web interfaces. This is also a first step towards the democratisation of this data. Because with GHGA, we are making it possible for researchers who are not bioinformaticians to work with the data sets.
How will you implement the link to clinical data?
Kohlbacher: It was originally planned that we would network with the NFDI4med consortium of the German Centres for Health Research and the Medical Informatics Initiative. Unfortunately, this consortium was not recommended for funding. We are now in discussions about how we can still implement the link between our data sets and the MII data integration centres.
What is crucial for the process of generating omics data from biosamples?
Kohlbacher: Biosamples are precious. It is important to handle them responsibly so that, for example, five aliquots of the same sample are not used for the same analysis. And of course comparability is important. Just as for sampling and storage, we need standardised protocols for data generation and bioinformatics. This is where we have to work together to ensure that the SOPs are compatible.
In what other areas can GHGA and GBN/GBA cooperate?
Stegle: Specimen donors' consent forms are an important basis both for the work of biobanks and for the handling of omics data. If I request a sample and a lengthy bureaucratic battle follows, this considerably reduces the benefit of the stored samples. The simpler and more uniformly regulated as possible what a sample can be used for becomes, the more research benefits. From a data protection point of view, a biosample is not so different from omics data generated from it. This offers a special opportunity to work together.
The interview was conducted by Verena Huth.