CASUS Institute Seminar, Dr. Elif Ozkirimli, Head of Data Science and Advanced Analytics – Pharma International Data and Analytics Chapter at F. Hoffmann-La Roche AG, Kaiseraugst (Switzerland)
A researcher shares biomedical findings with the scientific community via scientific publications using domain specific language. Human codified representation of biochemicals is also a domain specific language a researcher uses in order to study the mechanism of molecular interactions. Application of natural language processing methodologies for such domain specific languages is often a challenge. However, a more challenging aspect of processing data in these domains is that they do not sample all of the available knowledge space (publications) or molecule space (molecular interactions). This is a pity because most interesting biology occurs at the edge or out-of-distribution. Identifying novel protein – compound pairs or finding rare information in publications are both limited by this imbalance problem in data sampling. In this talk, Elif will summarize her recent work on protein – compound affinity prediction and multilabel text classification of biomedical publications. She will briefly present two novel approaches that aim to address the “needle in a haystack” problem for these two tasks.