Chemical database

A chemical database is a database for storing information about chemical compounds. This can be structural information ( crystal or molecular structure ), physical and thermodynamic properties, spectra , reactions and syntheses.

Types of chemical databases

Chemical structures

Chemical structures are usually represented as skeletal formulas. With common computer programs they are saved as two-dimensional pixel or vector graphics with letters for atoms and lines for bonds. These types of files are easy to view (or render ) and ideal for a chemist to understand. They are completely unsuitable for computer-aided use (apart from their ease of display), as they are both memory inefficient and practically not searchable .

In chemical databases, small molecules (or ligands in the drug design process) are usually represented in the form of lists - one list with the atoms and one with the bonds between the atoms. Large molecules, on the other hand, often have only a few basic structural components ( monomers ). In a more compact form of representation, the sequence of these monomers can be specified for such molecules, for example the amino acid sequence for proteins.

Large chemical structure databases are being built to handle the storage and retrieval of information about millions of molecules and their physical properties or their compounds.

Literature database

Chemical literature databases combine structures and other chemical information with relevant references such as scientific papers or patents. Examples are STN , SciFinder and Reaxys .

Crystallographic Database

Crystallographic databases manage crystal structure data . Typical examples are the Protein Data Bank and the Cambridge Structural Database .

NMR spectra database

NMR spectra databases correlate chemical structures with NMR data . Pure NMR databases are rare; most databases combine several spectroscopic methods (including FTIR and MS ).

Databases of reactions

Reaction databases contain information about products, starting materials and mechanisms of reactions. While chemical databases only record long-lived compounds, reaction databases also store unstable intermediates .

Thermophysical database

Thermophysical databases store information about

Phase equilibria ( vapor-liquid equilibria , solubility of gases or solids in liquids), heats of mixing, evaporation and melting enthalpies
Caloric data such as heat capacities , standard enthalpies of formation and heats of combustion
Transport properties such as viscosity and thermal conductivity .

Chemical structure representation

There are two basic techniques for representing chemical structures in digital databases.

One form of representation is graph-theoretical , in which atoms are represented as nodes and bonds as edges . Connection tables , adjacency matrices and other forms of lists are used for this purpose. Examples are MDL Molfile , PDB and CML .

The other is a linear string notation based on depth-first or breadth-first search . Examples are SMILES / SMARTS , SLN , WLN and InChI .

These approaches were refined compared to the basic graph theoretical concepts in order to be able to represent special aspects of chemical compounds, including stereochemical differences and special types of bonds, which occur mainly in organometallic compounds. The main advantages of the computer-readable display are the reduced storage space requirements and the flexible searchability.

search

Substructure search

Databases can be searched for basic and partial structures, components of IUPAC names, and properties that restrict them.

In particular, the ability to search for a substructure distinguishes chemical databases from general-purpose databases. Substructure searches are carried out in the internal graph-theoretical representation form as searches for subgraph isomorphisms ( monomorphisms ). The search algorithms have temporal complexities of O (N ³ ) or O (N ⁴ ) (N is the number of atoms involved) and are therefore very computationally intensive compared to other search algorithms.

The search for the components is called atom-by-atom search (ABAS). In this search, atoms and bonds are compared with the target molecule. The ABAS usually uses the Ullman algorithm or variations of it (e.g. SMSD). The search is accelerated by splitting. For this purpose, an index is created in which precalculated data is saved, which can then be used for search queries. Typically, these are bit strings that represent the presence or absence of certain molecular fragments. During the actual search, only connections are considered that have the precalculated fragments; the rest do not have to be taken into account in the search. This elimination is called screening. The bit strings used for these applications are called structure keys. The performance of such keys depends on the choice of fragments for key construction and the likelihood of their occurrence in the individual molecules. Another type of key uses hash codes to derive fragments. These are called "fingerprints" (a term that is sometimes used for structure keys). The size of the memory that is required to store structure keys and fingerprints can be reduced by “folding”. Here parts of the key are combined with bit-by-bit operations, which shortens the overall length.

3D conformation

The search for suitable 3D conformations of molecules, specifying spatial restrictions, is a feature that is particularly important in drug development. Searches like this are complicated; they usually require a lot of computing time and only provide approximate results. Search algorithms are based, for example, on BCUTs ( eigenvalues of adjacency matrices ), representation as special functions, moments of inertia (or inertia tensors ), ray tracing histograms, distance histograms and multipole forms.

Descriptors

All properties of molecules that are not directly apparent from their structure are called descriptors. These can be, for example, physical ( boiling and melting temperature ), physico-chemical ( thermodynamic parameters such as Gibbs energy , lipophilicity , acidity / basicity ) or pharmacological properties.

Further descriptors are the more or less standardized names of the molecules according to the various nomenclatures , which can sometimes be ambiguous. The IUPAC name is usually a good compromise for the representation of a molecular structure, since it represents a character string that is readable for humans as well as a unique and thus computer-processable string . However, IUPAC names are unwieldy for larger molecules. Trivial names , homonyms and synonyms , on the other hand, are poor choices for defining a database key .

While physico-chemical descriptors such as the molar mass and the charge - to a lesser extent also partial charges and solubilities - are based directly on the structure of the molecule and can therefore be calculated, pharmacological descriptors can only be included indirectly (multivariate statistics or experimental results from screenings and bioassays ) and therefore cannot be used for molecular display.

Chemical similarity

Chemical similarity (or molecular similarity) refers to the structural or functional similarity of chemical elements, molecules, or compounds. There is no standard definition of molecular similarity, but the concept can be defined as follows depending on the application and is often described as the inverse of a distance measure in descriptor space. Two molecules could be said to be more similar if e.g. B. the difference in their molar mass is smaller than compared to other molecules. A large number of different sizes (dipole moment, acid and base constants, ...) can be combined to form a multivariate distance measure. Distance measures are often classified into Euclidean or non-Euclidean metrics, depending on whether the triangle inequality existed. The search for maximum common subgraphs ( maximum common subgraphs , MCS) based substructure search is another frequently used distance measure. It is also used to find common partial structures in molecules.

In chemical databases, groups of “similar” molecules are clustered for similarities . Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties can be determined either empirically or by calculation. One of the most popular clustering approaches is the Jarvis-Patrick algorithm .

In pharmacologically oriented chemical repositories, the similarity is usually defined in terms of the biological effects of the compounds ( ADME / tox), which in turn can be determined semi-automatically from similar combinations of physico-chemical descriptors ( QSAR methods).

Registration

For certain application purposes (for example the indexing of patent and industrial databases), the recorded information must be stored in a guaranteed unique representation. This is achieved by generating unique / canonical character strings (such as SMILES ) as representatives of the chemical compound. Some registration systems such as the CAS system use hash functions for this purpose .

A key difference between a registry and a simple chemical database is the ability to accurately represent what is known, unknown, or partially known. For example, a chemical database could store a molecule with specified stereochemistry , while a chemical registration system prompts the registrar to indicate whether the stereo configuration is unknown or whether it is a racemate or a certain (known) mixture.

Registration systems can also process information in order to avoid the registration of substances that show only trivial chemical differences compared to already registered compounds (e.g. other halogen atoms ).

Tools

The arithmetic representations are usually graphic representations of the data according to the entries made by the registrar. Data entry is also simplified through the use of chemical structure editors. These editors convert the internal data into graphical representations of the molecules or reactions. There are also numerous algorithms for converting various formats of representation. An open source program for the conversion is Openbabel .

This search and conversion algorithms are either implemented within the database system itself or as an external component ( cartridge ), adapted to standard relational database systems, implemented and subsequently installed. Both Oracle and PostgreSQL -based systems use cartridge technology, which allow their own user data types (e.g. CTAB as structure data type). These external components allow the user to formulate SQL queries with chemical search criteria, e.g. For example, a request might look for records with a phenyl ring in their structure represented as a Smiles string in a SMILESCOL column.

 SELECT * FROM CHEMTABLE WHERE SMILESCOL.CONTAINS('c1ccccc1')

Algorithms for converting IUPAC names into structured representations and vice versa are also possible for the extraction of structural information from the text. However, there are difficulties because of the existence of multiple IUPAC dialects. InChI has established itself as a unique standard .

Web links

Chemical structure databases

mcule database , free database for virtual screening and ordering
Synthesis references database Synthesis reference database
eChemPortal , a global portal of the OECD with information on chemical substances
NLM ChemIDplus , Biomedical Chemistry, searchable by name and structure
Organic synthesis database Organic synthesis database
ZINC , a free database for virtual screening
ChemSpider , Free access to> 20 million chemical structures, substance data and systematic identifiers
MMsINC , a free web-oriented database of commercially available compounds for virtual screening and chemoinformatics applications
ChemIndustry a free database of Derived PubChem data
NCI / CADD Chemical Structure Lookup Service , directory in which databases a structure occurs (currently> 70 million indexed chemical structures)
ChEBI , free chemical substance registration for biologically relevant molecules
Chemonaut Chemonaut is the world's most comprehensive source of physically available commercial compounds
chemicalize.org Free web-based database from ChemAxon offers similarity, substructure or exact structure, searches with web and document (pdf, Microsoft documents, etc.) parsing functions

Chemical Name Databases

ChemSub Online , free web portal and information system on chemical substances, substance names in eight languages
EuroChem online database , the free chemical database

Notes and Literature

^ Julian R. Ullmann: An algorithm for subgraph isomorphism . In: Journal of the ACM . 23, No. 1, 1976, pp. 31-42. doi : 10.1145 / 321921.321925 .
↑ SA Rahman, M. Bashton, GL Holliday, R. Schrader, JM Thornton: Small Molecule subgraph Detector (SMSD) toolkit . In: Journal of Cheminformatics . 1, 2000, p. 12. doi : 10.1186 / 1758-2946-1-12 .
^ Maxwell D. Cummings, Alan C. Maxwell, Renee L. DesJarlais: Processing of Small Molecule Databases for Automated Docking . In: Medicinal Chemistry . 3, No. 1, 2007, pp. 107-113.
^ RS Pearlman, KM Smith: Metric Validation and the Receptor-Relevant Subspace Concept . In: J. Chem. Inf. Comput. Sci. . 39, 1999, pp. 28-35.
↑ Hung Lin Jr, Timothy Clark: An analytical, variable resolution, complete description of static molecules and their intermolecular binding properties . In: JCIM . 45, No. 4, 2005, pp. 1010-1016.
^ PJ Meek, Z. Liu, L. Tian, C. J Wang, W. J Welsh, R. J Zauhar: Shape Signatures: speeding up computer aided drug discovery . In: DDT 2006 . 19-20, 2006, pp. 895-904.
^ J. A Grant, MA Gallardo, BT Pickup: A fast method of molecular shape comparison: A simple application of a Gaussian description of molecular shape . In: JCIC . 17, No. 14, 1996, pp. 1653-1666.
^ PJ Ballester, WG Richards: Ultrafast shape recognition for similarity search in molecular databases . In: Proceedings of the Royal Society A . 463, 2007, pp. 1307-1321.
↑ SA Rahman, M. Bashton, GL Holliday, R. Schrader and JM Thornton, Small Molecule subgraph Detector (SMSD) toolkit, Journal of Cheminformatics 2009, 1:12. doi : 10.1186 / 1758-2946-1-12 .
^ S. Asad Rahman, M. Bashton, GL Holliday, R. Schrader, JM Thornton: Small Molecule Subgraph Detector (SMSD) Toolkit . In: Journal of Cheminformatics . 1, 2009, p. 12. doi : 10.1186 / 1758-2946-1-12 .
↑ Darko Butina: Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets . In: Chem. Inf. Comput. Sci. . 39, 1999, pp. 747-750.

[1] Julian R. Ullmann: An algorithm for subgraph isomorphism . In: Journal of the ACM . 23, No. 1, 1976, pp. 31-42. doi : 10.1145 / 321921.321925 .

[2] SA Rahman, M. Bashton, GL Holliday, R. Schrader, JM Thornton: Small Molecule subgraph Detector (SMSD) toolkit . In: Journal of Cheminformatics . 1, 2000, p. 12. doi : 10.1186 / 1758-2946-1-12 .

[3] Maxwell D. Cummings, Alan C. Maxwell, Renee L. DesJarlais: Processing of Small Molecule Databases for Automated Docking . In: Medicinal Chemistry . 3, No. 1, 2007, pp. 107-113.

[4] RS Pearlman, KM Smith: Metric Validation and the Receptor-Relevant Subspace Concept . In: J. Chem. Inf. Comput. Sci. . 39, 1999, pp. 28-35.

[5] Hung Lin Jr, Timothy Clark: An analytical, variable resolution, complete description of static molecules and their intermolecular binding properties . In: JCIM . 45, No. 4, 2005, pp. 1010-1016.

[6] PJ Meek, Z. Liu, L. Tian, C. J Wang, W. J Welsh, R. J Zauhar: Shape Signatures: speeding up computer aided drug discovery . In: DDT 2006 . 19-20, 2006, pp. 895-904.

[7] J. A Grant, MA Gallardo, BT Pickup: A fast method of molecular shape comparison: A simple application of a Gaussian description of molecular shape . In: JCIC . 17, No. 14, 1996, pp. 1653-1666.

[8] PJ Ballester, WG Richards: Ultrafast shape recognition for similarity search in molecular databases . In: Proceedings of the Royal Society A . 463, 2007, pp. 1307-1321.

[SMSD09-9] SA Rahman, M. Bashton, GL Holliday, R. Schrader and JM Thornton, Small Molecule subgraph Detector (SMSD) toolkit, Journal of Cheminformatics 2009, 1:12. doi : 10.1186 / 1758-2946-1-12 .

[10] S. Asad Rahman, M. Bashton, GL Holliday, R. Schrader, JM Thornton: Small Molecule Subgraph Detector (SMSD) Toolkit . In: Journal of Cheminformatics . 1, 2009, p. 12. doi : 10.1186 / 1758-2946-1-12 .

[11] Darko Butina: Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets . In: Chem. Inf. Comput. Sci. . 39, 1999, pp. 747-750.