The use of cutting-edge genomics, proteomics and metabolomics methods generates ever-increasing amounts of data in ever decreasing timescales. Special mathematical and computational methods are required for deducing relevant information from specific patterns. The data mining specialist Karsten Borgwardt from Tübingen is developing such methods for specific application in the life sciences.
The use of the term data mining in analogy with the term gold mining is quite a clever choice: data mining aims to extract ”the data gold” from huge quantities of data, releasing information that leads to new knowledge. In the life sciences, ever-increasing amounts of data accumulate within shorter and shorter time periods. The following example illustrates this nicely: while the sequencing of the first human genome took more than 10 years to complete and was extremely costly – estimates suggest around 100 million dollars per project – cutting-edge “next-generation sequencing” devices are able to decipher the genomes of humans, animals and plants within a few weeks and the costs have dropped to a few thousand dollars.
Genotyping in medical diagnostics is even quicker and cheaper. Genotyping is the process of determining differences in the genotype of an individual to a reference sequence, for example a DNA sequence variation encompassing a single nucleotide (known as single-nucleotide polymorphism; SNP). Such SNPs correlate with a particular effect in patients and therefore with certain diseases. The situation becomes even more complex when it is not simply a single SNP that leads to the disease as in the case of sickle cell anaemia, but when several SNPs need to occur in a specific pattern in order to generate disease or increase disease risk. In the latter case, the question that arises is whether a pattern has to be complete in order to have a specific probability of leading to disease. SNPs and SNP patterns are also used as predictive markers in numerous aspects of medical care, including drug efficacy, treatment response and adverse reactions to specific drugs.
Such issues are Prof. Dr. Karsten Borgwardt’s special area of research. The 32-year-old computer scientist works at the Max Planck Institutes for Developmental Biology and Intelligent Systems in Tübingen. He is also professor for data mining in the life sciences at the University of Tübingen. “For me, data mining means exploiting the special properties of data in order to develop efficient algorithms for statistical analyses in the life sciences,” said Borgwardt referring to his area of research. Borgwardt studied computer sciences with a minor in biology at the University of Munich and, in 2003, completed a master’s course in biology at the University of Oxford. The two degrees provided him with optimal conditions for a research area that had long attracted his attention. “As a teenager, I was already fascinated by the possibilities that arise from the sequencing of genomes and I was sure that the huge amounts of data would lead to mathematical and computational challenges. During my degree thesis, I realised that there are many algorithmic problems with applications in biology,” Borgwardt said.
The combination of biology and informatics enabled him to pursue a rapid career path. It took him only two-and-a-half years to complete his PhD thesis on data mining at the University of Munich, which also won the Heinz-Schwärtzel Dissertation Award. He then went on to become a postdoctoral research associate in the machine learning group at the University of Cambridge where he dealt with machine learning in biology. Borgwardt focussed specifically on the development of computer-assisted systems that recognised data flow patterns and regularities and were able to “learn” to derive rules and statistical dependences. Back then, Borgwardt was already part of international scientific networks through which he established contacts with the Max Planck Institutes in Tübingen. Two Max Planck directors, Prof. Dr. Bernhard Schölkopf und Prof. Dr. Detlef Weigel, recognised the huge potential of the young researcher and persuaded him to move to Tübingen. It can be safely assumed that little persuasion was necessary as Borgwardt was well aware of the reputation of the bioinformatics group in Tübingen. “Tübingen has an excellent reputation in the area of machine learning and bioinformatics; there is hardly any other institute in the world that combines machine learning with molecular biology as well as the Tübingen-based Max Planck Institutes do.” It took little more than a year before Borgwardt became a W2 research group leader and in 2011 he was offered a professorship in data mining. Borgwardt thus followed the third of three possible paths to becoming a professor in the natural and engineering sciences in Germany. Borgwardt neither habilitated nor was he a junior professor. His strength was his outstanding scientific achievements and the fact that he had already made a name for himself as a research group leader who attracted PhD students from leading universities in Europe, Asia and the USA. Since January 2013, Borgwardt has been the scientific coordinator of the European-wide Marie-Curie network “Machine learning for personalised medicine”. “The network aims to develop new statistical and algorithmic tools that enable personalised medical treatment of patients according to their genetic and molecular properties. Machine learning has great potential in personalised medicine, but the knowledge we currently have differs widely between different diseases. So we have quite a lot of work to do,” says Borgwardt.
Borgwardt’s researcher qualities are underlined by the fact that he has been awarded the 2013 Alfried Krupp Award with a purse of one million euros. The award is one of the most prestigious research awards for young professors in Europe. The award will be presented to Borgwardt in the Villa Hügel, headquarters of the Alfried Krupp von Bohlen und Halbach Foundation, in Essen in November 2013. Borgwardt will use most of the prize money to recruit more group members. “I have no major experimental expenditure and can use the money mainly for hiring PhD students. I will also use some of it to expand our computer resources,” Borgwardt says.
The Alfried Krupp Award will help Borgwardt to drive his team’s SNP analyses further forward. Amongst other things, the team is focussed on basic statistical questions about which SNPs correlate most frequently with the occurrence of a certain disease. “When we compare the SNPs of 10,000 patients with those of 10,000 healthy controls, we need to analyse hundreds of thousands of SNPs. We use Manhattan plots to show how strongly the individual SNPs correlate with the occurrence of a disease,” Borgwardt explains. Using Manhattan plots, the SNPs are displayed along the X-axis according to their position on the chromosome, and the negative logarithm of the association P-value for each SNP on the Y-axis. The term “Manhattan plot” gains its name from its similarity to the Manhattan skyline, i.e. a few skyscrapers towering over lower-level buildings.
Borgwardt’s team develops and optimises the tools it uses to derive such statistical relationships. As helpful as the results are, Borgwardt is careful to point out that care must be taken not to read too much into them: “Correlation is not causality. We use Manhattan plots to find positions in the genome which should be investigated in greater detail in order to detect genes and genome alterations.” Causalities can only be deduced in the later course of the investigations.
In order to shed light into the complex darkness of the correlations, Borgwardt has plans to develop new algorithms. He hopes to find not only the correlation of one SNP, but also of SNP pairs with a specific phenotype, i.e. physical properties, the presence of incompatibilities or diseases, to name but a few phenotypic features. Borgwardt’s work is not only of importance for the field of medicine and for humans: “Plant research as well as veterinary medicine can also benefit from our findings.”
How quickly the work produces successful results also depends on computing resources: “When we are analysing the correlations of SNP pairs with phenotypes, we have to look at up to 1014 SNPs. Such analyses can paralyse even large computer clusters for days. Theoretically, we are already able to highly efficiently identify whole networks of SNPs, interacting genes and proteins that correlate with certain phenotypes. Therefore, we are currently developing algorithms that enable a smart search at the same time as using as little computing capacity as possible,” said Borgwardt. Instead of using supercomputers, Borgwardt has plans to use laptop computers for his calculations in the future.
Further information:Max Planck Institute for Intelligent SystemsProf. Dr. Karsten BorgwardtSpemannstr. 3872076 TübingenTel.: +49 (0)7071 / 601-1784E-mail: karsten.borgwardt(at)tuebingen.mpg.de