Search by text
Glycan text formats
GlycoCT uses a similar graph concept to the KCF format (Herget et al. (2008)) and consists of two varieties: a condensed format and an XML format. The former allows for unique identification of glycan structures in a compact manner, while the latter facilitates data exchange. The monosaccharide namespace consists of five components and basically follows those defined by IUPAC: the basetype, anomeric configuration, the monosaccharide name with configurational prefix, chain length indicator, ring forming positions and further modification designators. Trivial names such as fucoso or rhamnoso are not permitted in GlycoCT.
The monosaccharide naming convention follows the following format: a-bccc-DDD-e:f|g:h, where a is the anomeric configuration (one of a, b, o, x), b is the stereoisorncr configuration (one of d, l, x), ccc is the three-letter code for the monosaccharide as listed in table below, DDD is the base type or superclass indicating the number of consecutive carbon atoms such as HEX, PEN, NON, e and f indicate the carbon numbers involved in closing the ring, g is the position of the modifier, and h is the type of modifier (one of d=deoxygenation, a=acidic function, keto=carbonyl function, en=double bond, aldi=reduction of C1-carbonyl, sp2=outgoing double bond linkage, sp=outgoing triple bond linkage, geminal=two identical substitutions). For a, b, e, f and g, an x can be used to specify an unknown value. bccc and g : h may also be repeated if necessary. Thus α-D-Galp would be represented as a-dgal-HEX-1:5 and α-D-Kdnp would be a-dgro-dgal-non-2:6,1a,2:keto,3:d in GlycoCT format.
It is noted that substituents of monosaccharides are also treated as separate residues attached to the base residue. These substituents are distinguished by specifying one of the following codes immediately after the residue number: b=basetype, s=substituent, r=repeating unit, a=alternative unit. The GlycoCT format follows something similar to the KCF format, where the residues are specified in a RES section, and the linkages in a LIN section.
List of monosaccharide and their three-letter codes used in GlycoCT.
|Monosaccharide name||Three-letter code||Superclass|
List of substituents used in GlycoCT.
Example of GlycoCT format: The N-glycan core structure represented in GlycoCT format.
RES 1b:b-dglc-HEX-1:5 2s:n-acetyl 3b:b-dglc-HEX-1:5 4s:n-acetyl 5b:b-dman-HEX-1:5 6b:a-dman-HEX-1:5 7b:a-dman-HEX-1:5 LIN 1:1d(2+1)2n 2:1o(4+1)3d 3:3d(2+1)4n 4:3o(4+1)5d 5:5o(3+1)6d 6:5o(6+1)7d
Although the two-dimensional notation of glycans as in figure below may be visually appealing, it is not suitable for storage in a database, and bioinformatic analysis tools would not be able to make use of it. Thus the IUPAC-IUBMB (International Union of Pure and Applied Chemistry - International Union of Biochemistry and Molecular Biology) has specified the "Nomenclature of Carbohydrates" to uniquely describe complex oligosaccharides based on a three-letter code to represent monosaccharides.
For example, gal represents galactose and man represents mannose; a listing of the common monosaccharides (and their derivatives) that occur in oligo- and polysaccharides is given in Table below (Tsai (2007)). Each monosaccharide code is preceded by the anomeric descriptor and the configuration symbol. The ring size is indicated by an italic f for furanose or p for pyranose. The carbon numbers that link the two monosaccharide units are given in parentheses between the symbols separated by an arrow. Double-headed arrows may be used if monosaccharides are linked through their anomeric centers. Moreover, α and β may be represented as a or b, respectively. In such a way, long carbohydrate sequences can be adequately described in abbreviated form using a sequence of letters.
Common monosaccharides (and their derivatives) that occur in oligo- and polysaccharides.
Example of extended IUPAC format: The N-glycan core structure represented in extended IUPAC format.
Example of condensed IUPAC format: The N-glycan core structure represented in condensed IUPAC format.
Linear Code® is a carbohydrate format that uses a single-letter nomenclature for monosaccharides and includes a condensed description of the glycosidic linkages. Monosaccharide representation is based on the common structure of a monosaccharide where modifications to the common structure are indicated by specific symbols, as in the following (Banin et al. (2002)).
- Stereoisomers (D or L) differing from the common isomer are indicated by apostrophe (’).
- Monosaccharides with differing ring size (furanose or pyranose) from the common form are indicated by a caret (^).
- Monosaccharides differing in both of the above are indicated by a tilde (˜).
List of common monosaccharide structures and their single-letter code as used in the Linear Code® format. Note that all the sugars are assumed to be in pyranose form unless otherwise specified.
|Common configuration||Full name||Linear Code®|
List of common modifications as used in the Linear Code® format.
|Modification Type||Linear Code®|
Example of Linear Code®: The N-glycan core structure represented in Linear Code®.
The KEGG Chemical Function (KCF) format for representing glycan structures was originally used to represent chemical structures (thus the name) in KEGG. KCF uses the graph notation, where nodes are monosaccharides and edges are glycosidic linkages. Thus to represent a glycan, at least three sections are required: ENTRY, NODE, EDGE, followed by three slashes ‘///’ at the end.
- The ENTRY section consists of one line and may specify a name for the structure followed by the keyword Glycan.
- The NODE section consists of several lines. The first line contains the number of monosaccharides or aglycon entities, and the following lines consist of the details of these entities numbered consecutively. For each entity line, the name and x- and y-coordinates (to draw on a 2D plane) must be specified.
- Similarly, the EDGE section consists of several lines, the first line containing the number of bonds (usually one less than the number of NODEs), followed by the details of the bond information. The format for the bond information is as follows:num <donor node#>:<anomeric configuration (a or b)><donor carbon#> <acceptor node#>:<acceptor carbon#>
Example of KCF format: The N-glycan core structure represented in KCF format.
ENTRY XYZ Glycan NODE 5 1 GlcNAc 15.0 7.0 2 GlcNAc 8.0 7.0 3 Man 1.0 7.0 4 Man -6.0 12.0 5 Man -6.0 2.0 EDGE 4 1 2:b1 1:4 2 3:b1 2:4 3 5:a1 3:3 4 4:a1 3:6 ///
Web3 Unique Representation of Carbohydrate Structures (WURCS) as a linear notation for representing carbohydrates for the Semantic Web.More details
Example of WURCS format: The N-glycan core structure represented in WURCS format.