Library Reference
chisom
Modules:
-
io–Classes and Functions for large on-disk data stores specific to cheminformatics
-
utils–Utility tools for hyperparameter configuration for (emergent) SOMs
Classes:
-
Som–Main Class to create and train a Self-Organizing Map
Functions:
-
start_chisom_viewer–Start the GUI interface
Som
Som(
rows: int,
columns: int,
features: int,
vector_distance: str = "euclidean",
map_distance: str = "euclidean_toroid",
neighborhood_kernel: str = "gaussian",
use_cuda: bool = False,
use_local_neighborhood: bool = False,
use_fastmath: bool = True,
save_progress: Optional[str] = None,
low: float = 0.0,
high: float = 1.0,
seed: Optional[int] = None,
)
Main Class to create and train a Self-Organizing Map
Parameters:
-
rows(int) –Number of rows of neurons.
-
columns(int) –Number of columns of neurons.
-
features(int) –Numbers of features to the data / weights of each neuron.
-
vector_distance(str, default:'euclidean') –Distance used in original data space, by default "euclidean". Possible values: "euclidean", "manhattan", "cosine"
-
map_distance(str, default:'euclidean_toroid') –Distance used in map space, by default "euclidean_toroid". Possbile values: "euclidean", "manhattan", "euclidean_toroid", "manhattan_toroid"
-
neighborhood_kernel(str, default:'gaussian') –Shape of the neighborhood kernel, by default "gaussian".
-
use_cuda(bool, default:False) –If True, CUDA accelleration is used. Needs numba-cuda. By default False.
-
use_local_neighborhood(bool, default:False) –Sets a hard neighborhood cutoff, by default False. Only used on CPU. Significantly increases performance at cost of numerical accuracy.
-
use_fastmath(bool, default:True) –Slightly decrease numerical accuracy to increase performance, by default True.
-
save_progress(Optional[str], default:None) –Saves codebook and U-Matrix to the given location if set, by default None. Usefull if long running computations crash / time out.
-
low(float, default:0.0) –Lower bound for codebook initialization, by default 0.0.
-
high(float, default:1.0) –Upper bound for codebook initialization, by default 1.0.
-
seed(Optional[int], default:None) –Randomness seed for replicability, by default None.
Raises:
-
ValueError–If the map dimensions are less than 1x1.
-
ValueError–If the number of features is less than 2.
-
ImportError–If CUDA is requested but not available.
-
ValueError–If the vector distance norm is not one of the supported norms.
-
ValueError–If the map distance norm is not one of the supported norms.
-
ValueError–If the neighborhood kernel is not one of the supported kernels.
Methods:
-
get_umatrix–Calculate the UMatrix for the current codebook
-
predict–Return the positions of the BMU for a dataset
-
train–Train the SOM with the given data for one epoch
get_umatrix
get_umatrix() -> UMatrix
Calculate the UMatrix for the current codebook
Returns:
-
UMatrix–The UMatrix for the current codebook.
predict
predict(
data: NDArray | DataLoader,
) -> Tuple[NDArray[np.uint16], NDArray[np.float32]]
Return the positions of the BMU for a dataset
Parameters:
-
data(NDArray | DataLoader) –Dataset to find the BMUs for.
Returns:
-
NDArray[uint16]–The BMUs for the data.
-
NDArray[float32]–The Quantization Error
Raises:
-
TypeError–Error if the data format is not known
train
train(
data: NDArray | DataLoader,
epoch: int,
sigma: float,
alpha: float,
)
Train the SOM with the given data for one epoch
Parameters:
-
data(NDArray | DataLoader) –The data to train the SOM with. If a DataLoader is used, it should be batched. If a numpy array is used, it will be treated as a single batch.
-
epoch(int) –The current epoch of the training.
-
sigma(float) –The sigma value for the current epoch. This is used to calculate the neighborhood radius. Must be greater than 0.
-
alpha(float) –The learning rate for the current epoch.
Raises:
-
ValueError–If sigma is less than or equal to 0.
start_chisom_viewer
start_chisom_viewer(
umatrix: NDArray,
bmu_coordinates: NDArray,
data: Union[DatasetBase, DataFrame],
structure_info_column: Optional[str] = None,
scaling_factor: int = 3,
)
Start the GUI interface
Parameters:
-
umatrix(NDArray) –U-Matrix of the SOM.
-
bmu_coordinates(NDArray) –Coordinates of the BMU to the data points.
-
data(Union[DatasetBase, DataFrame]) –Additional data to the data points. Will be renderd in the tabel view and used for coloring of BMUs.
-
structure_info_column(Optional[str], default:None) –With chemical dataset the column with this index supplies the SMILES to render the molecule, by default None.
-
scaling_factor(int, default:3) –Will scale the U-Matrix by this factor ands interpolation for an anti-aliased view, by default 3.
chisom.utils
Utility tools for hyperparameter configuration for (emergent) SOMs
Functions:
-
decay_exponential–Calculate the exponential decay of a value towards a desire final value for use in the training process.
-
decay_linear–Calculate the linear decay of a value for use in the training process.
-
lattice_size–Returns a rectangular lettice for the given number of data points,
decay_exponential
decay_exponential(
iteration: int,
initial_value: int | float,
end_value: Optional[int | float] = None,
total_iterations: Optional[int] = None,
decay: Optional[float] = None,
*args,
**kwargs,
) -> float
Calculate the exponential decay of a value towards a desire final value for use in the training process. Ether to use a fixed number of iterations or a decay factor. 'decay' takes precedence over 'total_iterations'.
Parameters:
-
iteration(int) –Current iteration.
-
initial_value(int | float) –Starting value of the decay.
-
end_value(Optional[int | float], default:None) –Desired final value, when not using decay by default None.
-
total_iterations(Optional[int], default:None) –Total number of desired iterations, by default None.
-
decay(Optional[float], default:None) –Decay rate, by default None.
Returns:
-
float–Value at iteration
iteration.
Raises:
-
ValueError–If neither
total_iterationsnordecayis provided, or if bothend_valueanddecayare provided.
decay_linear
decay_linear(
iteration: int,
initial_value: int | float,
total_iterations: Optional[int] = None,
decay: Optional[float] = None,
*args,
**kwargs,
) -> float
Calculate the linear decay of a value for use in the training process. Ether to use a fixed number of iterations or a decay factor. 'decay' takes precedence over 'total_iterations'.
Parameters:
-
iteration(int) –Current iteration.
-
initial_value(int | float) –Starting value of the decay.
-
total_iterations(Optional[int], default:None) –Total number of desired iterations, by default None.
-
decay(Optional[float], default:None) –Decay rate, like m in y = -mx+b, by default None.
Returns:
-
float–Value at iteration
iteration.
Raises:
-
ValueError–If neither
total_iterationsnordecayis provided.
lattice_size
lattice_size(
dataset_size: int, factor=3
) -> Tuple[int, int]
Returns a rectangular lettice for the given number of data points, as recommendet by Ultsch et al. for ESOMs.
Parameters:
-
dataset_size(int) –Number of data points in the dataset.
-
factor–Ration of neurons to data points, by default 3.
Returns:
-
Tuple[int, int]–Number of (rows, columns)
chisom.io
Classes and Functions for large on-disk data stores specific to cheminformatics
HDF5Creator
HDF5Creator(
fingerprint_generator_factory: DataloaderFingerprintGeneratorFactory,
file_extensions: list[str] = [".txt", ".smi", ".csv"],
num_processes: int = mp.cpu_count() - 2,
chunk_size: int = 1000,
queue_size: int = 500,
)
Bases: StoreCreator
HDF5 Store Create for creating HDF5 file of cheminformatic datasets from file hierarchies.
Parameters:
-
fingerprint_generator_factory(DataloaderFingerprintGeneratorFactory) –Factory to use, depends on the input file type and content.
-
file_extensions(list[str], default:['.txt', '.smi', '.csv']) –File extentions to consider.
-
num_processes(int, default:cpu_count() - 2) –Number of processes used.
-
chunk_size(int, default:1000) –Size of chunks send to processes. Should not need configuration.
-
queue_size(int, default:500) –Size of queue of each process,. Should not need configuration.
create
create(
file_hierarchy: FileList,
out_path: str,
leaf_map: LeafMap,
sep: str,
skip_lines: int = 0,
) -> None
Run creation of HDF5 storage file
Parameters:
-
file_hierarchy(FileList) –Dictionary of files to parse, see How-To Guides.
-
out_path(str) –Output path for the HDF5 files.
-
leaf_map(LeafMap) –Dictionary of file structure and datatypes, see How-To Guides.
-
sep(str) –Column seperator.
-
skip_lines(int, default:0) –Number of lines to skip at the beginning of file, e.g. for headers.
CSVStyleFactory
CSVStyleFactory(
mol_generator: MolGenerator,
fpStart: int,
fpSize: int,
dtype: type = np.float32,
)
Bases: DataloaderFingerprintGeneratorFactory
A generator that creates fingerprints from a row of data, not from a Mol object. With a similar interface to the RDKit fingerprint generator, but using a slice of the row instead of a Mol object.
Parameters:
-
mol_generator(MolGenerator) –The RDKit MolGenerator to use to read Molecules in the input file e.g. MolFromSmiles
-
fpStart(int) –Numerical column index where the fingerprints starts.
-
fpSize(int) –Length in number of columns of the CSV file of the fingerprint.
-
dtype(type, default:float32) –Datatype of the fingperint elements
rdStyleFactory
rdStyleFactory(
mol_generator: MolGenerator,
fingerprint_generator: rdFingerprintGenerator,
generator_kwargs={},
count_fingerprint=False,
)
Bases: DataloaderFingerprintGeneratorFactory
This class is designed to provide a unified interface for generating molecular fingerprints for use in DataLoader creation. It can combine different types of input (SMILES, InChI, ...) and fingerprint generator (e.g., Morgan, RDKit, Feature Morgan ... ). Also returns a new instance of the real generator on each call to get_generator, so that the generator can be used in a multiprocessing context without issues.
Parameters:
-
mol_generator(MolGenerator) –The RDkit MolGenerator to use to read Molecules in the input file. e.g. MolFromSmiles
-
fingerprint_generator(rdFingerprintGenerator) –The rdFingerprintGenerator to use.
-
generator_kwargs(dict, default:{}) –Keyword argument to pass to the rdFingerprintGenerator.
-
count_fingerprint(bool, default:False) –If count fingerprints should be created where applicable.
HDF5Dataset
HDF5Dataset(
filepath: str, group_subset: Optional[List[str]] = None
)
Bases: DatasetBase
Loads HDF5 Datasets for large cheminformatic datasets created with the supplied script. Adheres to the PyTorch Dataset interface for the use with the PyTorch DataLoader for milisecond on-disc access data access.
Parameters:
-
filepath(str) –Path to HDF5 file
-
group_subset(Optional[List[str]], default:None) –Subsets to use, by default None