Outlook of  Prof. Dr. Tetko’s lectures


Big Data,Data Integration and On-line Data analysis tools


     Big Data in Chemistry

The concept of ‘Big Data’ is already used extensively in several fields of science,but it is relatively new to chemistry. The interest in this field is rising due to the growth of large public repositories and commercial databases,such as ChemMBL,PubChem,Reaxys,etc. as well as the increasing amounts of data collected in private companies,which hold the potential to be exploited for new development and innovation [1,2]. While these databases are orders of magnitude smaller than those of e.g. genomics, they are much larger than those used in traditional chemical projects,and hence their analysis and interpretation require the use of new methods,as well as the proper education of personnel as to their usage. The challenges with Big Data analysis in chemistry appear mainly due to complexity and heterogeneity, rather than to volume,as is the case elsewhere. New methods,such as multitask learning and deep learning networks,could be successfully used to explore such data,while the use of secure computation approaches could combine private data in different companies and can further leverage the impact of Big Data while enabling intercompany collaborations. The challenges with the application of developed models for the analysis of large virtual chemical spaces,such as produced using de novo design or enumerated chemical spaces containing petabytes of even exabytes of structures will be discussed. The report will be finished with an overview of Horizon2020 “Big Data in Chemistry” project

[1] Tetko et al. Mol. Inform. 2016, 35,615-621

[2] Tetko et al. Future Med. Chem. 2016, 8,1801-1806

     On-line Tools for Data Analysis

Thousands of (Quantitative) Structure-Activity Relationships (Q)SAR models have been described in peer-reviewed publications; however,this way of sharing seldom makes models available for the use by the research community outside of the developer’s laboratory. Conversely,on-line models allow broad dissemination and application representing the most effective way of sharing the scientific knowledge. Approaches for sharing and providing on-line access to models ranges from web services created by individual users and laboratories to integrated modeling environments and model repositories. This emerging transition from the descriptive and informative,but “static”,and for the most part,non-executable print format to interactive,transparent and functional delivery of “living” models is expected to have a transformative effect on modern experimental research in areas of scientific and regulatory use of (Q)SAR models [1]. The emerging tools for data integration and model sharing will be also covered.

[1] Tetko et al. Mol. Inform. 2017,36.


  ( Practical training in On-line Chemical Modeling Environment (OCHEM

This part will start with an overview of basic and advanced features of the On-line CHEmical database and Modelling (OCHEM http://ochem.eu) platform [1]. OCHEM contains more than 1.2M points for several hundreds properties. It is integrated with modeling framework,which was used to develop hundreds of models,ranging from simple linear equations to the state-of-the art algorithms such as Associative Neural Networks,Deep Neural Networks,XGBOOST,SVM using descriptor matrices with > 0.2 trillion entries. A recent model to predict melting points,which was developed using about 300k measurements mined from patent data [2] will be exemplified. The challenges to develop models with such big data as well as ideas used to achieve best scoring models for EPA [3] and NIH [4] challenges will be presented. Following this brief presentation,the participants will be offered training in using OCHEM to upload data,develop models and apply them to predict new chemical compounds. For this part of the seminar participants should have an access to Internet (http://ochem.eu) or,otherwise,the OCHEM can be installed locally on one of servers at the University. The materials for the training will be offered.

[1] Sushko I. et al. J. Comput. Aided. Mol. Des. 2011,25,533-554

[2] Tetko IV et al. J. Cheminform. 2016,8:2

[3] Novotarskyi S. et al. Chem. Res. Toxicol. 2016,29,768-775

[4] Abdelaziz A,Frontiers. Environ. Sci. 2016,4(2)