Artificial intelligence plays an increasingly crucial role across chemical sciences, from automating the experiments to discovering new materials. In the Artificial Intelligence and Machine Learning (AI & ML) for Chemists course, developed by Dr. Ganna (Anya) Gryn'ova and Dr. Clara Kirkvold, we give a general overview of machine learning as well as focused examples of its applications in chemistry.
The course consists of five Topics: Topic 1 gives a general introduction to artificial intelligence and exemplifies its uses across chemical sciences, Topic 2 provides a basic introduction to Python, and Topics 3-5 delve deeper into various practical aspects of chemical machine learning, combining lectures with practical exercises in Google Colab.
The course consists of five Topics: Topic 1 gives a general introduction to artificial intelligence and exemplifies its uses across chemical sciences, Topic 2 provides a basic introduction to Python, and Topics 3-5 delve deeper into various practical aspects of chemical machine learning, combining lectures with practical exercises in Google Colab.
Topic 1: Introduction to machine learning for chemists
Contents:
|
Learning objectives:
|
Independent learning
Exercise 1
For the duration of one day (choose any day you want before the lecture), from the moment you wake up to the moment you go to sleep, record every interaction you have with the Artificial Intelligence. Note the type of interaction, the AI system (e.g., ChatGPT, Google AI, etc.), whether it was prompted/initiated by you, whether it was useful/helpful to you, etc.
Exercise 2
Ask ChatGPT and Google Gemini the following question "Should a chemistry undergraduate student study machine learning?". Compare the answers these models give.
Recommended reading
• Shi et al., Machine Learning for Chemistry: Basics and Applications, Engineering 2023, 27, 70-83; DOI: 10.1016/j.eng.2023.04.013.
• Keith et al., Combining Machine Learning and Computational Chemistry for Predictive Insights into Chemical Systems, Chem. Rev. 2021, 121, 9816-9872; DOI: 10.1021/acs.chemrev.1c00107.
Suggested viewing
• "How AI is saving billions of years of human research time" by Max Jaderberg at TED AI San Fransisco 2024
• If you have a chance (and an access), you can watch (re-watch) "The Matrix" by The Wachowskis (1999) and "Blade Runner" by Ridley Scott (1982). These films will provide context for the in-lecture discussion of the historical developments and future implications of AI.
Exercise 1
For the duration of one day (choose any day you want before the lecture), from the moment you wake up to the moment you go to sleep, record every interaction you have with the Artificial Intelligence. Note the type of interaction, the AI system (e.g., ChatGPT, Google AI, etc.), whether it was prompted/initiated by you, whether it was useful/helpful to you, etc.
Exercise 2
Ask ChatGPT and Google Gemini the following question "Should a chemistry undergraduate student study machine learning?". Compare the answers these models give.
Recommended reading
• Shi et al., Machine Learning for Chemistry: Basics and Applications, Engineering 2023, 27, 70-83; DOI: 10.1016/j.eng.2023.04.013.
• Keith et al., Combining Machine Learning and Computational Chemistry for Predictive Insights into Chemical Systems, Chem. Rev. 2021, 121, 9816-9872; DOI: 10.1021/acs.chemrev.1c00107.
Suggested viewing
• "How AI is saving billions of years of human research time" by Max Jaderberg at TED AI San Fransisco 2024
• If you have a chance (and an access), you can watch (re-watch) "The Matrix" by The Wachowskis (1999) and "Blade Runner" by Ridley Scott (1982). These films will provide context for the in-lecture discussion of the historical developments and future implications of AI.
Topic 2: Introduction to Python
Python is a high-level, general-purpose programming language that consistently ranks as one of the most popular programming languages, and has gained widespread use in the machine learning community.
Learning objectives:
Learning objectives:
- Understand variables and data types used in Python;
- Be able to execute basic operations in Python, such as using conditions, writing loops, functions, and importing libraries;
- Perform data analysis and visualisation in Python.
Practical exercises
are available via Google Colab.
are available via Google Colab.
Independent learning
(solutions are available in the "Additional_Exercises_Solutions" notebook in Google Colab)
Exercise 1: Create a molecular mass calculator
Write a function that, for a list of elements in a molecule, calculates the molecular mass. For simplicity, you can just consider molecules that contain the following elements: H, C, O, N, S.
Exercise 2: Determine charge type
Write a function that takes in the number of electrons and protons and determines if a compound is cationic, anionic, or neutral.
Exercise 3: Convert energies to kcal/mol
Write a python code that will convert the following list of single-point energies (which are in Hartree) to kcal/mol.
[-443.19225788, -443.225590922, -443.199437861, -443.239444863, -443.217935422, -443.220253946, -443.236066571, -443.22927076, -443.186725894, -443.232267518]
Exercise 4: Determine rate constant and reaction order
Analyse the chemical reaction data provided in the kintics_run.csv file, which contains concentration vs. time values. Use this data to determine the reaction order and the rate constant.
*For this exercise, you will need to download the kintics_run.csv file and read it into Google Colab. This can be done by running the following block of code.
import gdown
import pandas as pd
gdown.download('https://drive.google.com/uc?id=1URUMVzvmyh_fWMDDOfrogwCG6PLV0N1g', 'kintics_run.csv', quiet=False)
# Read the file the file into google colab
kr = pd.read_csv('/content/kintics_run.csv')
print(kr)
(solutions are available in the "Additional_Exercises_Solutions" notebook in Google Colab)
Exercise 1: Create a molecular mass calculator
Write a function that, for a list of elements in a molecule, calculates the molecular mass. For simplicity, you can just consider molecules that contain the following elements: H, C, O, N, S.
Exercise 2: Determine charge type
Write a function that takes in the number of electrons and protons and determines if a compound is cationic, anionic, or neutral.
Exercise 3: Convert energies to kcal/mol
Write a python code that will convert the following list of single-point energies (which are in Hartree) to kcal/mol.
[-443.19225788, -443.225590922, -443.199437861, -443.239444863, -443.217935422, -443.220253946, -443.236066571, -443.22927076, -443.186725894, -443.232267518]
Exercise 4: Determine rate constant and reaction order
Analyse the chemical reaction data provided in the kintics_run.csv file, which contains concentration vs. time values. Use this data to determine the reaction order and the rate constant.
*For this exercise, you will need to download the kintics_run.csv file and read it into Google Colab. This can be done by running the following block of code.
import gdown
import pandas as pd
gdown.download('https://drive.google.com/uc?id=1URUMVzvmyh_fWMDDOfrogwCG6PLV0N1g', 'kintics_run.csv', quiet=False)
# Read the file the file into google colab
kr = pd.read_csv('/content/kintics_run.csv')
print(kr)
Topic 3: Chemical data and simple ML architectures
Learning objectives:
- Give examples of the sources of chemical data used in chemical machine learning and discuss key characteristics of the data;
- Differentiate between supervised and unsupervised learning techniques and identify appropriate use cases for each;
- Apply simple machine learning architectures to perform supervised (regression and classification) and unsupervised (clustering) learning tasks;
- Understand the importance of test-train-validation splits and evaluate model performance using cross-validation;
- Explain the purpose of hyperparameter tuning and optimise models using such techniques as grid search.
Chemical data
Activity: In the PubChem database, find phenol. Among its experimentally measured chemical and physical properties, find (1) vapour density and (2) pH. Imagine you were constructing a training dataset of organic molecules to train a machine learning model for predicting either a vapour density or a pH. Would the data available to you from PubChem be considered good training data?
Activity: In the PubChem database, find phenol. Among its experimentally measured chemical and physical properties, find (1) vapour density and (2) pH. Imagine you were constructing a training dataset of organic molecules to train a machine learning model for predicting either a vapour density or a pH. Would the data available to you from PubChem be considered good training data?
Simple ML architectures
Topic 4: Representations and feature engineering
Learning objectives:
- Give examples of molecular descriptors, fingerprints, and representations.
- Discuss the criteria for creating effective and informative feature vectors.
- Apply feature importance and selection techniques (e.g. correlation matrix, tree-based methods) to the training data.
- Understand the importance of chemical intuition and knowledge when designing features.
Molecular representations
Feature engineering
Activity: Generate the SMILES of caffeine from its structure using this SMILES generator. Copy the string and paste it to this SMILES-to-Structure online tool. to verify you got the correct representation. Now, using the same SMILES-to-Structure tool, visualise the structure corresponding to this string "CN1C(=O)C2C(NCN2C)N(C)C1=O". What does this activity tell you about SMILES?
Activity: Generate the SMILES of caffeine from its structure using this SMILES generator. Copy the string and paste it to this SMILES-to-Structure online tool. to verify you got the correct representation. Now, using the same SMILES-to-Structure tool, visualise the structure corresponding to this string "CN1C(=O)C2C(NCN2C)N(C)C1=O". What does this activity tell you about SMILES?
Topic 5: Deep learning in chemistry
Learning objectives:
- Understand how deep learning differs from machine learning.
- Discuss key elements of a neural network.
- Recognise various types of neural networks, i.e., MLP vs. CNN, based on their architecture and applications.
- Understand the concept of fine tuning the model on target data.
- Utilise the TensorFlow Python package to build, train, and evaluate MLPs and CNNs.
Activity: see how a neural network learns with your own eyes on the Neural Network Playground! Test various network architectures for regression and classification tasks and try to find the best ones by recording the loss.