Skip to content

Library Reference

chisom

Modules:

  • io

    Classes and Functions for large on-disk data stores specific to cheminformatics

  • utils

    Utility tools for hyperparameter configuration for (emergent) SOMs

Classes:

  • Som

    Main Class to create and train a Self-Organizing Map

Functions:

Som

Som(
    rows: int,
    columns: int,
    features: int,
    vector_distance: str = "euclidean",
    map_distance: str = "euclidean_toroid",
    neighborhood_kernel: str = "gaussian",
    use_cuda: bool = False,
    use_local_neighborhood: bool = False,
    use_fastmath: bool = True,
    save_progress: Optional[str] = None,
    low: float = 0.0,
    high: float = 1.0,
    seed: Optional[int] = None,
)

Main Class to create and train a Self-Organizing Map

Parameters:

  • rows (int) –

    Number of rows of neurons.

  • columns (int) –

    Number of columns of neurons.

  • features (int) –

    Numbers of features to the data / weights of each neuron.

  • vector_distance (str, default: 'euclidean' ) –

    Distance used in original data space, by default "euclidean". Possible values: "euclidean", "manhattan", "cosine"

  • map_distance (str, default: 'euclidean_toroid' ) –

    Distance used in map space, by default "euclidean_toroid". Possbile values: "euclidean", "manhattan", "euclidean_toroid", "manhattan_toroid"

  • neighborhood_kernel (str, default: 'gaussian' ) –

    Shape of the neighborhood kernel, by default "gaussian".

  • use_cuda (bool, default: False ) –

    If True, CUDA accelleration is used. Needs numba-cuda. By default False.

  • use_local_neighborhood (bool, default: False ) –

    Sets a hard neighborhood cutoff, by default False. Only used on CPU. Significantly increases performance at cost of numerical accuracy.

  • use_fastmath (bool, default: True ) –

    Slightly decrease numerical accuracy to increase performance, by default True.

  • save_progress (Optional[str], default: None ) –

    Saves codebook and U-Matrix to the given location if set, by default None. Usefull if long running computations crash / time out.

  • low (float, default: 0.0 ) –

    Lower bound for codebook initialization, by default 0.0.

  • high (float, default: 1.0 ) –

    Upper bound for codebook initialization, by default 1.0.

  • seed (Optional[int], default: None ) –

    Randomness seed for replicability, by default None.

Raises:

  • ValueError

    If the map dimensions are less than 1x1.

  • ValueError

    If the number of features is less than 2.

  • ImportError

    If CUDA is requested but not available.

  • ValueError

    If the vector distance norm is not one of the supported norms.

  • ValueError

    If the map distance norm is not one of the supported norms.

  • ValueError

    If the neighborhood kernel is not one of the supported kernels.

Methods:

  • get_umatrix

    Calculate the UMatrix for the current codebook

  • predict

    Return the positions of the BMU for a dataset

  • train

    Train the SOM with the given data for one epoch

get_umatrix

get_umatrix() -> UMatrix

Calculate the UMatrix for the current codebook

Returns:

  • UMatrix

    The UMatrix for the current codebook.

predict

predict(
    data: NDArray | DataLoader,
) -> Tuple[NDArray[np.uint16], NDArray[np.float32]]

Return the positions of the BMU for a dataset

Parameters:

  • data (NDArray | DataLoader) –

    Dataset to find the BMUs for.

Returns:

  • NDArray[uint16]

    The BMUs for the data.

  • NDArray[float32]

    The Quantization Error

Raises:

  • TypeError

    Error if the data format is not known

train

train(
    data: NDArray | DataLoader,
    epoch: int,
    sigma: float,
    alpha: float,
)

Train the SOM with the given data for one epoch

Parameters:

  • data (NDArray | DataLoader) –

    The data to train the SOM with. If a DataLoader is used, it should be batched. If a numpy array is used, it will be treated as a single batch.

  • epoch (int) –

    The current epoch of the training.

  • sigma (float) –

    The sigma value for the current epoch. This is used to calculate the neighborhood radius. Must be greater than 0.

  • alpha (float) –

    The learning rate for the current epoch.

Raises:

  • ValueError

    If sigma is less than or equal to 0.

start_chisom_viewer

start_chisom_viewer(
    umatrix: NDArray,
    bmu_coordinates: NDArray,
    data: Union[DatasetBase, DataFrame],
    structure_info_column: Optional[str] = None,
    scaling_factor: int = 3,
)

Start the GUI interface

Parameters:

  • umatrix (NDArray) –

    U-Matrix of the SOM.

  • bmu_coordinates (NDArray) –

    Coordinates of the BMU to the data points.

  • data (Union[DatasetBase, DataFrame]) –

    Additional data to the data points. Will be renderd in the tabel view and used for coloring of BMUs.

  • structure_info_column (Optional[str], default: None ) –

    With chemical dataset the column with this index supplies the SMILES to render the molecule, by default None.

  • scaling_factor (int, default: 3 ) –

    Will scale the U-Matrix by this factor ands interpolation for an anti-aliased view, by default 3.

chisom.utils

Utility tools for hyperparameter configuration for (emergent) SOMs

Functions:

  • decay_exponential

    Calculate the exponential decay of a value towards a desire final value for use in the training process.

  • decay_linear

    Calculate the linear decay of a value for use in the training process.

  • lattice_size

    Returns a rectangular lettice for the given number of data points,

decay_exponential

decay_exponential(
    iteration: int,
    initial_value: int | float,
    end_value: Optional[int | float] = None,
    total_iterations: Optional[int] = None,
    decay: Optional[float] = None,
    *args,
    **kwargs,
) -> float

Calculate the exponential decay of a value towards a desire final value for use in the training process. Ether to use a fixed number of iterations or a decay factor. 'decay' takes precedence over 'total_iterations'.

Parameters:

  • iteration (int) –

    Current iteration.

  • initial_value (int | float) –

    Starting value of the decay.

  • end_value (Optional[int | float], default: None ) –

    Desired final value, when not using decay by default None.

  • total_iterations (Optional[int], default: None ) –

    Total number of desired iterations, by default None.

  • decay (Optional[float], default: None ) –

    Decay rate, by default None.

Returns:

  • float

    Value at iteration iteration.

Raises:

  • ValueError

    If neither total_iterations nor decay is provided, or if both end_value and decay are provided.

decay_linear

decay_linear(
    iteration: int,
    initial_value: int | float,
    total_iterations: Optional[int] = None,
    decay: Optional[float] = None,
    *args,
    **kwargs,
) -> float

Calculate the linear decay of a value for use in the training process. Ether to use a fixed number of iterations or a decay factor. 'decay' takes precedence over 'total_iterations'.

Parameters:

  • iteration (int) –

    Current iteration.

  • initial_value (int | float) –

    Starting value of the decay.

  • total_iterations (Optional[int], default: None ) –

    Total number of desired iterations, by default None.

  • decay (Optional[float], default: None ) –

    Decay rate, like m in y = -mx+b, by default None.

Returns:

  • float

    Value at iteration iteration.

Raises:

  • ValueError

    If neither total_iterations nor decay is provided.

lattice_size

lattice_size(
    dataset_size: int, factor=3
) -> Tuple[int, int]

Returns a rectangular lettice for the given number of data points, as recommendet by Ultsch et al. for ESOMs.

Parameters:

  • dataset_size (int) –

    Number of data points in the dataset.

  • factor

    Ration of neurons to data points, by default 3.

Returns:

  • Tuple[int, int]

    Number of (rows, columns)

chisom.io

Classes and Functions for large on-disk data stores specific to cheminformatics

HDF5Creator

HDF5Creator(
    fingerprint_generator_factory: DataloaderFingerprintGeneratorFactory,
    file_extensions: list[str] = [".txt", ".smi", ".csv"],
    num_processes: int = mp.cpu_count() - 2,
    chunk_size: int = 1000,
    queue_size: int = 500,
)

Bases: StoreCreator

HDF5 Store Create for creating HDF5 file of cheminformatic datasets from file hierarchies.

Parameters:

  • fingerprint_generator_factory (DataloaderFingerprintGeneratorFactory) –

    Factory to use, depends on the input file type and content.

  • file_extensions (list[str], default: ['.txt', '.smi', '.csv'] ) –

    File extentions to consider.

  • num_processes (int, default: cpu_count() - 2 ) –

    Number of processes used.

  • chunk_size (int, default: 1000 ) –

    Size of chunks send to processes. Should not need configuration.

  • queue_size (int, default: 500 ) –

    Size of queue of each process,. Should not need configuration.

create

create(
    file_hierarchy: FileList,
    out_path: str,
    leaf_map: LeafMap,
    sep: str,
    skip_lines: int = 0,
) -> None

Run creation of HDF5 storage file

Parameters:

  • file_hierarchy (FileList) –

    Dictionary of files to parse, see How-To Guides.

  • out_path (str) –

    Output path for the HDF5 files.

  • leaf_map (LeafMap) –

    Dictionary of file structure and datatypes, see How-To Guides.

  • sep (str) –

    Column seperator.

  • skip_lines (int, default: 0 ) –

    Number of lines to skip at the beginning of file, e.g. for headers.

CSVStyleFactory

CSVStyleFactory(
    mol_generator: MolGenerator,
    fpStart: int,
    fpSize: int,
    dtype: type = np.float32,
)

Bases: DataloaderFingerprintGeneratorFactory

A generator that creates fingerprints from a row of data, not from a Mol object. With a similar interface to the RDKit fingerprint generator, but using a slice of the row instead of a Mol object.

Parameters:

  • mol_generator (MolGenerator) –

    The RDKit MolGenerator to use to read Molecules in the input file e.g. MolFromSmiles

  • fpStart (int) –

    Numerical column index where the fingerprints starts.

  • fpSize (int) –

    Length in number of columns of the CSV file of the fingerprint.

  • dtype (type, default: float32 ) –

    Datatype of the fingperint elements

rdStyleFactory

rdStyleFactory(
    mol_generator: MolGenerator,
    fingerprint_generator: rdFingerprintGenerator,
    generator_kwargs={},
    count_fingerprint=False,
)

Bases: DataloaderFingerprintGeneratorFactory

This class is designed to provide a unified interface for generating molecular fingerprints for use in DataLoader creation. It can combine different types of input (SMILES, InChI, ...) and fingerprint generator (e.g., Morgan, RDKit, Feature Morgan ... ). Also returns a new instance of the real generator on each call to get_generator, so that the generator can be used in a multiprocessing context without issues.

Parameters:

  • mol_generator (MolGenerator) –

    The RDkit MolGenerator to use to read Molecules in the input file. e.g. MolFromSmiles

  • fingerprint_generator (rdFingerprintGenerator) –

    The rdFingerprintGenerator to use.

  • generator_kwargs (dict, default: {} ) –

    Keyword argument to pass to the rdFingerprintGenerator.

  • count_fingerprint (bool, default: False ) –

    If count fingerprints should be created where applicable.

HDF5Dataset

HDF5Dataset(
    filepath: str, group_subset: Optional[List[str]] = None
)

Bases: DatasetBase

Loads HDF5 Datasets for large cheminformatic datasets created with the supplied script. Adheres to the PyTorch Dataset interface for the use with the PyTorch DataLoader for milisecond on-disc access data access.

Parameters:

  • filepath (str) –

    Path to HDF5 file

  • group_subset (Optional[List[str]], default: None ) –

    Subsets to use, by default None