Research
Big data analysis and biomedical research meet in our lab: We develop novel data mining algorithms for detecting patterns and statistical dependencies in large datasets from the life sciences.
The ultimate goal in our work is to contribute to two big goals of science in the 21st century: To enable the automatic generation of new knowledge from big data through machine learning, and to help to gain an understanding of the relationship between diseases and molecular properties of patients, thereby enabling precision medicine.
Below you can find further information for some of our projects:
Machine Learning: Comparing Structured Data
We develop methods for comparing and classifying high-dimensional objects. One prominent example are graph kernels, i.e. efficient distance functions between graphs.
- Graph Kernels (Code and Data)
- A Confounder-Corrected Support Vector Machine Classifier (ccSVM)
Machine Learning: High-Dimensional Correlations
We develop methods for measuring statistical dependence between high dimensional variables, two-sample tests to tell whether two samples were drawn from the same distribution, outlier detection algorithms to tell find "unusual" observations in a given dataset, and approaches that detect non-linear dependence between variables.
- Kernel Method for the Two Sample Problem
(Gretton, A., Borgwardt, K., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research, 13(25), 723-773.) - Detecting Non-Linear Correlations via the Mutual Information Dimension
(Sugiyama, M., & Borgwardt, K. (2013). Measuring Statistical Dependence via the Mutual Information Dimension. Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 2013), 1692-1698.) - Rapid Outlier Detection via Sampling
(Sugiyama, M., & Borgwardt, K. (2013). Rapid Distance-Based Outlier Detection via Sampling. Advances in Neural Information Processing Systems 26 (NIPS 2013), 467-475.)
Machine Learning: Significant Pattern Mining
We develop methods that discover significant patterns in high dimensional datasets while being runtime efficient and statistically sound. Our algorithms can be applied to graphs or collections of sequences and allow to account for dependencies between objects, to control the Family-Wise Error Rate and to correct for categorical covariates.
- Overview page on Significant Pattern Mining
- Significant Pattern Mining (Westfall-Young Light)
(Llinares-López, F., Sugiyama, M., Papaxanthos, L., & Borgwardt, K. (2015). Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing. KDD '15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 725-734. doi:10.1145/2783258.2783363.) - Finding Genomic Intervals of Genetic Heterogeneity (FAIS)
(Llinares-López, F., Grimm, D. G., Bodenham, D. A., Gieraths, U., Sugiyama, M., Rowan, B., et al. (2015). Genome-wide detection of intervals of genetic heterogeneity associated with complex traits. Bioinformatics, 31(12), i240-i249. doi:10.1093/bioinformatics/btv263.) - Significant Pattern Mining with Covariates (FACS)
(Papaxanthos, L., Llinares-López, F., Bodenham, D., & Borgwardt, K. (2016). Finding significant combinations of features in the presence of categorical covariates. Advances in Neural Information Processing Systems 29 (NIPS 2016), 2271-2279.)
Computational Biology: Genome-Wide Association Studies
We develop efficient multivariate approaches for the genome-wide discovery of genetic loci that are associated with a phenotype, thereby trying to elucidate the multicausal basis of complex traits.
- easyGWAS - an Online Tool for Performing Genome-Wide Association Studies
(Grimm, D. G., Roqueiro, D., Salomé, P. A., Kleeberger, S., Greshake, B., Zhu, W., et al. (2017). easyGWAS: A Cloud-Based Platform for Comparing the Results of Genome-Wide Association Studies. The Plant Cell, 29(1), 5-19. doi:10.1105/tpc.16.00551.) - Tools for SNP x SNP Interaction Discovery
(Kam-Thong, T., Azencott, C.-A., Cayton, L., Pütz, B., Altmann, A., Karbalai, N., et al. (2012). GLIDE: GPU-Based Linear Regression for Detection of Epistasis. Human Heredity, 73(4), 220-236. doi:10.1159/000341885.)
(Kam-Thong, T., Czamara, D., Tsuda, K., Borgwardt, K., Lewis, C. M., Erhardt-Lehmann, A., et al. (2011). EPIBLASTER-fast exhaustive two-locus epistasis detection strategy using graphical processing units. European Journal of Human Genetics, 19(4), 465-471. doi:10.1038/ejhg.2010.196.)
(Kam-Thong, T., Pütz, B., Karbalai, N., Müller−Myhsok, B., & Borgwardt, K. (2011). Epistasis detection on quantitative phenotypes by exhaustive enumeration using GPUs. Bioinformatics, 27(13), i214-i221. doi:10.1093/bioinformatics/btr218.) - Lasso Model with Population Structure Correction (LMM-Lasso)
(Rakitsch, B., Lippert, C., Stegle, O., & Borgwardt, K. (2013). A Lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics, 29(2), 206-214. doi:10.1093/bioinformatics/bts669.) - Network GWAS
(Muzio, G., O’Bray, L., Meng-Papaxanthos, L., Klatt, J., & Borgwardt, K. (2023). networkGWAS: A network-based approach to discover genetic associations. Bioinformatics, 39(6): btad370. doi:10.1093/bioinformatics/btad370.) - Finding Genomic Intervals of Genetic Heterogeneity (FAIS)
(Llinares-López, F., Grimm, D. G., Bodenham, D. A., Gieraths, U., Sugiyama, M., Rowan, B., et al. (2015). Genome-wide detection of intervals of genetic heterogeneity associated with complex traits. Bioinformatics, 31(12), i240-i249. doi:10.1093/bioinformatics/btv263.) - In silico Phenotyping via Co-training
(Roqueiro, D., Witteveen, M. J., Anttila, V., Terwindt, G. M., van den Maagdenberg, A., & Borgwardt, K. (2015). In silico phenotyping via co-training for improved phenotype prediction from genotype. Bioinformatics, 31(12), i303-i310. doi:10.1093/bioinformatics/btv254.)
Computational Biology: Genome Annotation
We have developed methods for detecting genomic insertions and deletions using next-generation sequencing, and thoroughly assessed the difficulty of comparing the performance of variant pathogenicity prediction tools.
- Structural Variant Machine (SV-M)
(Grimm, D., Hagmann, J., Koenig, D., Weigel, D., & Borgwardt, K. (2013). Accurate indel prediction using paired-end short reads. BMC Genomics, 14(1): 132. doi:10.1186/1471-2164-14-132.) - Pathogenicity Prediction
(Grimm, D. G., Azencott, C., Aicheler, F., Gieraths, U., MacArthur, D. G., Samocha, K. E., et al. (2015). The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity. Human Mutation, 36(5), 513-523. doi:10.1002/humu.22768.)
Computational Biology: Molecular Graph Classification via Graph Kernels
We developed new, fast and scalable similarity measures on graphs, so-called graph kernels. Their prime purpose is to compare molecular graphs or protein structures and to classify them into functional categories.
- Protein Function Prediction via Graph Kernels
(Borgwardt, K. M., Ong, C. S., Schonauer, S., Vishwanathan, S. V. N., Smola, A. J., & Kriegel, H. P. (2005). Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1), i47-i56. doi:10.1093/bioinformatics/bti1007.) - Scalable Graph Kernels
(Borgwardt, K., Ghisu, E., Llinares-López, F., O’Bray, L., & Rieck, B. (2020). Graph Kernels: State-of-the-Art and Future Challenges. Foundations and Trends® in Machine Learning, 13(5-6), 531-712. doi:10.1561/2200000076.)
Personalized Medicine
We have coordinated several national and international networks on personalized medicine: