How-To Guides

Creating a HDF5Dataset file

As a dataset of enumerated fingerprints and other information for cheminformatic might not fit into working memory, ChI-SOM supplies a dedicated HDF5 file layout and HDF5Dataset class. This HDF5Dataset class is compatible with the PyTorch DataLoader for random, millisecond latency, access into this on-disk storage. ChI-SOM supplies a tool to generate the HDF5 files containing the fingerprints directly from text files of molecular data in different line notations, e.g. SMILES, INCHI, etc., or text files already containing enumerated fingerprint data. Other properties of the molecules can also be recorded for examination with the GUI.

We import HDF5Creator and either rdStyleFactory or CSVStyleFactory. Both need an rdMolGenerator to build the internal representation from line notation. rdFingerprintGenerator is specific to the direct generation of enumerated fingerprints.

from numpy import float32
from rdkit.Chem import MolFromSmiles, rdFingerprintGenerator

from chisom.io.datastore_creation import HDF5Creator
from chisom.io.datastore_factories import rdStyleFactory

generator = rdFingerprintGenerator.GetMorganGenerator

Next, we need to supply arguments to the fingerprint generator as a dictionary

fingerprint_kwargs = {"fpSize": 1024, "radius": 2}

The paths and files considered for the HDF5 file creation must be supplied as a dictionary. Keys are distinct groups that can later be accessed individually when using the Dataset. Each item contains a list of files and paths that should be included in the respective group. Paths are walked recursively, and all files are included that match the file extensions later supplied to the HDF5Creator.

file_dict = {
    "active": [
        "tests/testdata/VDR/actives.smi",
    ],
    "inactive": [
        "tests/testdata/VDR/inactives.smi",
    ],
}

Next, we initialize the factory that will be used to supply individual generators to the HDF5Creator tool, with the generators and variables we defined previously, and pass it to the HDF5Creator.

molgen = rdStyleFactory(
    MolFromSmiles,
    generator,
    generator_kwargs=fingerprint_kwargs,
    count_fingerprint=True,
)
file_creator = HDF5Creator(fingerprint_generator_factory=molgen)

The file creation routine further needs a leaf_map, indicating the columns of the data to consider, their data type, and value type. The only required key is the 'primary' key, indicating in what column the molecules line notation is stored. The data type can be any Numpy or standard Python type. The value type is later used for the GUI to infer colour-coding behavior. Possible values are 'continouos', 'categorical' or 'na', indicating that the value should only be displayed by the table view, but not used for colour-coding the BMUs.

leaf_map = {
    "primary": (0, str),
    "ID": (1, str, "na"),
    "Activity": (2, int, "categorical"),
    "MolWt": (3, float32, "continous"),
    "MolLogP": (4, float32, "continous"),
    "TPSA": (5, float32, "continous"),
}

To finally create the file, we call the HDF5Creator.create() method, with the desired output filepath. We can further skip lines, e.g. in case of a header, and change the separation character.

file_creator.create(
    file_dict,
    "tests/testdata/VDR.h5",
    leaf_map,
    skip_lines=1,
    sep="\t",
)

A full working example can be found in the under Examples

Training an ESOM on data in a HDF5Dataset using CUDA

An introductory example on how to use ChI-SOM can be found on the Landing Page.
There are, however, some considerations necessary for using the PyTorch DataLoader.

First we load the HDF5 file using the HDF5Dataset class

# Create a ChemDataset object that from a HDF5 file, that is compatible with the pytorch DataLoader
ds = HDF5Dataset("tests/testdata/VDR.h5", ["active"])

We create the DataLoader with the dataset instance as the input.

# Create a DataLoader object that will be used to train the SOM
dl = DataLoader(
    ds,
    batch_size=1000,
    shuffle=True,
    num_workers=4,
)

During initialization of the SOM, we can get the necessary data features from the HDF5Dataset, e.g. fingerprint_length. We set the use_cuda variable to use the CUDA compute backend.

# Create a SOM object
# The high and low parameters should be chosen according to the dataset values, to decrease the training time
som = Som(
    rows,
    columns,
    ds.fingerprint_length,
    "cosine",
    use_cuda=True,
    low=ds.fingerprint_min,
    high=ds.fingerprint_max,
)

The DataLoader is then passed to the train method.

# The training loop
for epoch in range(EPOCHS):
    som.train(dl, epoch, current_sigma, current_alpha)

Should shuffling be used during training, a new instance of the DataLoader must be created before prediction of BMUs and QE to keep the correct association between the datapoints indices and prediction.

# Create a DataLoader object for prediction (no shuffling)
dl = DataLoader(
    ds,
    batch_size=1000,
    shuffle=False,
    num_workers=10,
)
# Predict the best matching units and quantization errors for all data points
bmus, qe = som.predict(dl)

A full working example can be found in the under Examples