Information

Are there tools for automatically parsing glycan names into tree structures?

Are there tools for automatically parsing glycan names into tree structures?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

My colleague and I are working on a project involving data produced at a glycan microarray facility. The array data that came back to us were a list of glycan names (in IUPAC condensed format). We would like to parse the list of 610 names into the graphical representation of the glycan.

To clarify, I wanted to take a list of glycans, written in the IUPAC condensed nomenclature, and return an image for each glycan. Here, each image would be the "graph" representation (nodes + edges). Ideally, I would like to be able to write a Python script to get this done, so we wouldn't have to do this by hand.

Is this possible? If so, what tools are available to get this done?

Thank you!


Edit: C# Functions used in our system, slightly paired down for to simplfy

using Microsoft.VisualBasic; using System; using System.Collections; using System.Collections.Generic; using System.Data; using System.Diagnostics; using System.Web; using System.Net; using System.IO; using System.Text.RegularExpressions; public class KEGGCaller { public object ReturnKEGGgif(string GlycanMonoCode, System.Web.HttpServerUtility HTTPUtilityHolder) { string CallURL = "http://www.genome.jp/dbget-bin/www_bfind_sub?dbkey=glycan&keywords="; string GlycanURL = HTTPUtilityHolder.HtmlEncode(GlycanMonoCode); WebClient wc = new WebClient(); System.Drawing.Image GlycanGif = default(System.Drawing.Image); CallURL += GlycanURL + "&mode=bfind&max_hit=1"; StreamReader reader = new StreamReader(wc.OpenRead(CallURL)); string Gcode = FetchGcode(reader.ReadToEnd()); if (Gcode == "Failed") { return 0; } else { System.Net.HttpWebRequest Request = default(System.Net.HttpWebRequest); System.Net.HttpWebResponse Response = default(System.Net.HttpWebResponse); Request = System.Net.WebRequest.Create("http://www.genome.jp/dbget-bin/www_bget?gl:" + Gcode + ".gif">Biomodels.Net is a great resource to begin with and familiarize yourself with the field and get ideas of how to tackle your qualification needs. We used it when we were doing drug screens.

More specifically for glycans, you're going to be looking at KEGG. Then use KEGG draw for your visual application. If you actually just want to parse them, here is the table on the monosaccharides used (most places).

Here's a better link right to the download in case you didn't find it. The script we wrote was pretty simple, here is the basic logic.

Start with the monosaccharide code.

Call a single return of DBGET to GenomeNet pulling from KEGG Glycan. For example: (GlcNAc)6 (Man)3 (Asn)1 would be called:

http://www.genome.jp/dbget-bin/www_bfind_sub?dbkey=glycan&keywords=%28GlcNAc%296+%28Man%293+%28Asn%291&mode=bfind&max_hit=1

Wherehttp://www.genome.jp/dbget-bin/www_bfind_sub?dbkey=glycan&keywords=and&mode=bfind&max_hit=1are going to be constant and the middle part is going to be your search.

You need to change your monosaccharide to standard URL ASCII Reference, concatenating each monosaccharide with a "+". Thus(GlcNAc)6 (Man)3 (Asn)1becomes%28GlcNAc%296+%28Man%293+%28Asn%291.

That will give you one hit with the link to your exact match if there is one. There will be one link on the returned page in the format ofhttp://www.genome.jp/dbget-bin/www_bget?gl:G#####, where G##### is your glycan number (in our example G00021).

Once you have the glycan number you're good to go. All of the structure images can be found at:http://www.genome.jp/Fig/glycan/G#####.gif">


I don't know if this works out for glycans but this tool converts IUPAC name to SMILES format. SMILES is a text based structure notation which can be converted to graphical structure.

You can also check these out:

http://www.openmolecules.org/name2structure

http://www.iupac.org/home/publications/e-resources/inchi.html

I am still not sure what you are really asking. It would be easy if you give a dummy example.


The CFG do not use consistent nomenclature for their glycan structures in the excel files they provide. It is human readable but not machine readable. So you'll have to curate their list first. I have bash scripts that do this.

We convert glycan array structures into CFG symbols http://glycam.org/Pre-builtLibraries.jsp You want to click on the cfglibs and maybe give it a few secs to load.

If you are still interested in doing this send in an email to [email protected] explaining what you would like and include my name "Oliver".


This module is included in Biopython 1.54 and later. If you’re interested in testing newer additions to this code before the next official release, see SourceCode for instructions on getting a copy of the development branch.

To draw trees (optional), you’ll also need these packages:

The I/O and tree-manipulation functionality will work without them they’re imported on demand when the functions draw() , draw_graphviz() and to_networkx() are called.

The Phylo module has also been successfully tested on Jython 2.5.1, minus the Graphviz- and NetworkX-based functions. However, parsing phyloXML files is noticeably slower because Jython uses a different version of the underlying XML parsing library.


MOTIFS : USING DATABASES & CREATING YOUR OWN

SEARCHING MOTIF DATABASES

BACKGROUND INFORMATION: Proteins having related functions may not show overall high homology yet may contain sequences of amino acid residues that are highly conserved. For background information on this see PROSITE at ExPASy. N.B. I recommend that you check your protein sequence with at least two different search engines. Alternatively, use a meta site such as MOTIF (GenomeNet, Institute for Chemical Research, Kyoto University, Japan) to simultaneously carry out Prosite, Blocks, ProDom, Prints and Pfam search

Several great sites including the first four which are meta sites:

Motif Scan &ndash (MyHits, SIB, Switzerland) includes Prosite, Pfam and HAMAP profiles.
InterPro 5 - includes PROSITE, HAMAP (High-quality Automated and Manual Annotation of Proteins), Pfam (protein Families), PRINTS, ProDom, SMART (a Simple Modular Architecture Research Tool), TIGRFAMs, PIRSF (Protein Information Resource), SUPERFAMILY, CATH-Gene3D (Class, Architecture, Topology, Homologous superfamily), and PANTHER (Protein ANalysis THrough Evolutionary Relationships) classification systems. ( Reference: Jones, P. et al. 2014, Bioinformatics 10: 1093) . This service is also available here.

MOTIF (GenomeNet, Japan) - I recommend this for the protein analysis, I have tried phage genomes against the DNA motif database without success. Offers 6 motif databases and the possibility of using your own.
CDD or CD-Search (Conserved Domain Databases) - (NCBI) includes CDD, Smart,Pfam, PRK, TIGRFAM, COG and KOG and is invoked when one uses BLASTP.

Batch Web CD-Search Tool - The Batch CD-Search tool allows the computation and download of conserved domain annotation for large sets of protein queries. Input up to 100,000 protein query sequences as a list of sequence identifiers and/or raw sequence data, then download output in a variety of formats (including tab-delimited text files) or view the search results graphicallyOn the Batch CD-Search job summary page, a "Browse Results" button above the sample data table allows you to view the results graphically. The button opens a separate browser window that shows the domain footprints, alignment details, and conserved features on any individual query sequence. ( Reference: Marchler-Bauer A et al. 2011. Nucleic Acids Res.39: (D)225-229.)

CDvist - Comprehensive Domain Visualization Tool - CDvist is a sequence-based protein domain search tool. It combines several popular algorithms to provide the best possible domain coverage for multi-domain proteins delivering speed-up, accuracy, and batch querying with novel visualization features.( Reference: O. Adebali et al. Bioinformatics (2015) 31(9):1475-7).

Pfam - (EMBL-EBI) while for Batch Pfam searches go here or here. ( Reference: Punta M et al. 2012. Nucl. Acids Res. 40(Database issue): D290&ndashD301 ). One can access it also via the EBI site here which allows queries of Pfam, TIGRFAM, Gene3D, Superfamily, PIRSF, and TreeFam.

ScanProsite &ndash (ExPASy) ( Reference: Sigrist CJ et al. Nucleic Acids Res. 2013 41(Database issue): D344-7).

ProDom (Pôle Rhone-Alpin de BioInformatique, France) - is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database
SMART Simple Modular Architecture Research Tool (EMBL, Universitat Heidelberg) - searches sequence for the domains/ sequences listed in the homepage. Try selecting/deselecting the default settings.

Batch SMART scan - can be found here. Please note that the software produces a polyprotein which it analyzes. This can result in some difficulty in correlating the motifs which the individual proteins. The same proviso applies to the Batch CD search.

iProClass (Protein Information Resource, Georgetown University Medical Centre, U.S.A.) - is an integrated resource that provides comprehensive family relationships and structural/functional features of proteins. ( Reference: Wu CH et al. Comput. Biol. Chem. (2004) 28: 87&ndash96).

PSIPRED Protein Sequence Analysis Workbench - includes PSIPRED v3.3 (Predict Secondary Structure) DISOPRED3 & DISOPRED2 (Disorder Prediction) pGenTHREADER (Profile Based Fold Recognition) MEMSAT3 & MEMSAT-SVM (Membrane Helix Prediction) BioSerf v2.0 (Automated Homology Modelling) DomPred (Protein Domain Prediction) FFPred 3 (Eukaryotic Function Prediction) GenTHREADER (Rapid Fold Recognition) MEMPACK (SVM Prediction of TM Topology and Helix Packing) pDomTHREADER (Fold Domain Recognition) and, DomSerf v2.0 (Automated Domain Modelling by Homology). ( Reference: Buchan DWA et al. 2013. Nucl. Acids Res. 41 (W1): W340-W348).

P2RP (Predicted Prokaryotic Regulatory Proteins) - including transcription factors (TFs) and two-component systems (TCSs) based upon analysis of DNA or protein sequences. ( Reference: Barakat M., 2013. BMC Genomics 14: 269)

MEROPS - permits one to screen protein sequences against an extensive database of characterized peptidases ( Reference: Rawlings, N.D et al. (2018) Nucleic Acids Res. 46: D624-D632 ).

For specific protein modifications or site detection consult the following sites:

Orthologous genes/proteins:

COG analysis - Clusters of Orthologous Groups - COG protein database was generated by comparing predicted and known proteins in all completely sequenced microbial genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous across at least three lineages and likely corresponds to an ancient conserved domain (CloVR) . Sites which offer this analysis include:

WebMGA ( Reference: S. Wu et al. 2011. BMC Genomics 12:444), RAST ( Reference: Aziz RK et al. 2008. BMC Genomics 9:75), and BASys (Bacterial Annotation System Reference: Van Domselaar GH et al. 2005. Nucleic Acids Res. 33(Web Server issue):W455-459.) and JGI IMG (Integrated Microbial Genomes Reference : Markowitz VM et al. 2014. Nucl. Acids Res. 42: D560-D567. )

Other sites:

EggNOG - A database of orthologous groups and functional annotation that derives Nonsupervised Orthologous Groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. (Reference: Powell S et al. 2014.
Nucleic Acids Res. 42 (D1): D231-D239

OrthoMCL - is another algorithm for grouping proteins into ortholog groups based on their sequence similarity. The process usually takes between 6 and 72 hours.( Reference: Fischer S et al. 2011. Curr Protoc Bioinformatics Chapter 6:Unit 6.12.1-19).

KAAS (KEGG Automatic Annotation Server) provides functional annotation of genes by BLAST or GHOST comparisons against the manually curated KEGG GENES database. The result contains KO (KEGG Orthology) assignments and automatically generated KEGG pathways. ( Reference: Moriya Y et al. 2007. Nucleic Acids Res. 35(Web Server issue):W182-185).

InParanoid - this database provides a user interface to orthologs inferred by the InParanoid algorithm. As there are now international efforts to curate and standardize complete proteomes, we have switched to using these resources rather than gathering and curating the proteomes themselves. ( Reference: E.L.L. Sonnhammer & G. Östlund. 2015. Nucl. Acids Res. 43 (D1): D234-D239).

DNA binding - motifs:

GYM - the most recent program for analysis of helix-turn-helix motifs in proteins. N.B. the next site dates from 1990. ( Reference: Narasimhan, G. et al. 2002. J. Computational Biol. 9:707-720)
Helix-turn-Helix Motif Prediction - (Institut de Biologie et Chemie des Proteines, Lyon, France)

iDNA-Prot - identifies DNA-binding proteins via the &ldquogrey model&rdquo and by adopting the random forest operation engine.The overall success rate by iDNA-Prot was 83.96%. One can submit up to 50 proteins. ( Reference: Lin W-Z et al. 2011. PLoS One 6: e24756). Also available here.

DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Choose: PSSM-based encoding which is the most accurate, but the slowest. ( Reference: S.Hwang et al. 2007. Bioinformatics 23(5):634-636).

DNAbinder - employs two approaches to predict DNA-binding proteins (a) amino acid composition which allows for multiple sequences in fasta format, and (b) PSSM ( Position-specific scoring matrix) which can only screen a single protein at a time. Choose the "Alternate dataset" if input sequence is full length protein, since the prediction will be done using SVM modules developed using full length protein sequences ( Reference: M. Kumar et al. 2007. BMC Bioinformatics 8: 463).

DRNApred - server provides sequence based prediction of DNA- and RNA-binding residues. ( Reference: Yan J, & Kurgan LA, 2017. Nucleic Acids Res. 45(10):e84).

DisoRDPbind - predicts the RNA-, DNA-, and protein-binding residues located in the intrinsically disordered regions. DisoRDPbind is implemented using a runtime-efficient multi-layered design that utilizes information extracted from physiochemical properties of amino acids, sequence complexity, putative secondary structure and disorder, and sequence alignment. ( Reference: Peng Z, & Kurgan LA, 2015. Nucleic Acids Res. 43(18): e121).

If you know the three-dimensional structure of your protein then 3D-footprint, DISPLAR ( Reference: Tjong G & Zhou H-X. 2007. Nucl. Acid Res.35: 1465-1477), iDBPs ( Reference: Nimrod G. et al. 2009. J. Mol. Biol. 387: 1040-1053), DNABIND ( Reference: Szlagyi A & Skolnick J. 2006. J. Mol. Biol. 358: 922-933) and, DNABINDPROT ( Reference: Ozbek P et al. 2010. Nucl. Acids Res. 38: W417-423) could be useful to you..

2ZIP - is used to find leucine zipper motifs ( Reference: Bornberg-Bauer,E. et al. (1998) Nucleic Acids Res. 26:2740-2746).

FeatureP - is a web server which launches a selection of such predictors and mines their outputs for differential predictions, i.e. features which are predicted to be modified as a consequence of the differences between the input sequences. ( Reference: Blicher T et al. (2010) Curr Opin Struct Biol. 20: 335-41). Can be used to screen multiple proteins.

Two-component and other regulatory proteins:

P2RP (Predicted Prokaryotic Regulatory Proteins) - users can input amino acid or genomic DNA sequences, and predicted proteins therein are scanned for the possession of DNA-binding domains and/or two-component system domains. RPs identified in this manner are categorised into families, unambiguously annotated. ( Reference: Barakat M, et al. 2013. BMC Genomics 14:269).

P2CS (Prokaryotic 2-Component Systems) is a comprehensive resource for the analysis of Prokaryotic Two-Component Systems (TCSs). TCSs are comprised of a receptor histidine kinase (HK) and a partner response regulator (RR) and control important prokaryotic behaviors. It can be searched using BLASTP. ( Reference: P. Ortet et al. 2015. Nucl. Acids Res. 43 (D1): D536-D541).

ECFfinder - extracytoplasmic function (ECF) sigma factors - the largest group of alternative sigma factors - represent the third fundamental mechanism of bacterial signal transduction, with about six such regulators on average per bacterial genome. Together with their cognate anti-sigma factors, they represent a highly modular design that primarily facilitates transmembrane signal transduction. ( Reference: Staron A et al. (2009) Mol Microbiol 74(3): 557-581).

BepiPred - this server predicts the location of linear B-cell epitopes using a combination of a hidden Markov model and a propensity scale method. ( Reference: Pontoppidan Larsen, J.E. et al. 2006. Immunome Research 2:2).

ABCpred - this server predicts B cell epitope(s) in an antigen sequence, using artificial neural network. ( Reference: Saha, S & Raghava G.P.S. 2006. Proteins 65:40-48).

Antibody Epitope Prediction (Immune Epitope Database and Analysis Resource) - methods include Chou & Fasman Beta-Turn Prediction, Emini Surface Accessibility Prediction, Karplus & Schulz Flexibility Prediction, Kolaskar & Tongaonkar Antigenicity, Parker Hydrophilicity Prediction and Bepipred Linear Epitope Prediction

BCPREDS server allows users to choose the method for predicting B-cell epitopes among several developed prediction methods: AAP method, BCPred and FBCPred. Users provide an antigen sequence and optionally can specify desired epitope length and specificity threshold. Results are returned in several user-friendly formats. ( Reference: EL-Manzalawy, Y. et al. 2008. J Mol Recognit 21: 243-255).

EpiSearch: Mapping of Conformational Epitopes ( Reference: Negi, S.S. & Braun, W. 2009. Bioinform. Biol. Insights 3: 71-81).

CEP - Conformational Epitope Prediction Server - The algorithm, apart from predicting conformational epitopes, also predicts antigenic determinants and sequential epi-topes. The epitopes are predicted using 3D structure data of protein antigens, which can be visualized graphically. The algorithm employs structure-based Bioinformatics approach and solvent accessibility of amino acids in an explicit manner. Accuracy of the algorithm was found to be 75% when evaluated using X-ray crystal structures of Ag&ndashAb complexes available in the PDB.( Reference: Kulkarni-Kale, U. et al. 2005. Nucl. Acids Res. 33: W168&ndashW171)

IEDB (Immune Epitope Database and Analysis Resource). Includes T Cell Epitope Prediction (Scan an antigen sequence for amino acid patterns indicative of: MHC I Binding, MHC II Binding, MHC I Processing (Proteasome,TAP), MHC I Immunogenicity) B Cell Epitope Prediction, Predict linear B cell epitopes using: Antigen Sequence Properties, Predict discontinuous B cell epitopes using antigen structure via: Solvent-accessibility (Discotope), Protrusion (ElliPro). ( Reference: Vita, R. et al. 2015. Nucl. Acids Res. 43 (D1): D405-D412).

Expitope - is the first web server for assessing epitope sharing when designing new potential lead targets. It enables the users to find all known proteins containing their peptide of interest. The web server returns not only exact matches, but also approximate ones, allowing a number of mismatches of the users choice. For the identified candidate proteins the expression values in various healthy tissues, representing all vital human organs, are extracted from RNA Sequencing (RNA-Seq) data as well as from some cancer tissues as control. ( Reference: Haase K et al. 2015. Bioinformatics 31: 1854-1856).

EpiToolKit - provides a collection of methods from computational immunology for the development of novel epitope-based vaccines including HLA ligand or potential T-Cell epitope prediction, an epitope selection framework for vaccine design, and a method to design optimal string-of-beads vaccines. Additionally, EpiToolKit provides several other tools ranging from HLA typing based on NGS data, to prediction of polymorphic peptides. ( Reference: Schubert B et al. 2015. Bioinformatics 31: 2211-2213).

MetaPocket 2.0 is a meta server to identify ligand binding sites on protein surface! metaPocket is a consensus method, in which the predicted binding sites from eight methods: LIGSITEcs, PASS, Q-SiteFinder, SURFNET, Fpocket, GHECOM, ConCavity and POCASA are combined together to improve the prediction success rate. ( Reference: Bingding Huang (2009) Omics, 13(4): 325-330)

Post-translational modification - ProteomeScout is a database of proteins and post-translational modifications. There are two main data types in ProteomeScout: 1) Proteins: Visualize proteins or annotate your own proteins and, 2) Experiments: You can load a new experiment or browse and analyze an existing experiment. Requires registration ( Reference: M.K. Matlock et al. 2015. Nucl. Acids Res. 43 (D1): D521-D530).

Glycosylation:

NetOGlyc (Center for Biological Sequence Analysis, Technical University of Denmark) - produces neural network predictions of mucin type GalNAc O-glycosylation sites in mammalian proteins. SignalP is automatically run on all sequences. A warning is displayed if a signal peptide is not detected. In transmembrane proteins, only extracellular domains may be O-glycosylated with mucin-type GalNAc.
NetNGlyc (Center for Biological Sequence Analysis, Technical University of Denmark) - predicts N-Glycosylation sites in human proteins using artificial neural networks that examine the sequence context of Asn-Xaa-Ser /Thr sequons.
YinOYang (Center for Biological Sequence Analysis, Technical University of Denmark) - produces neural network predictions for O-ß-GlcNAc attachment sites in eukaryotic protein sequences. This server can also use NetPhos, to mark possible phosphorylated sites and hence identify "Yin-Yang" sites.

Fatty acylation:

LipoP 1.0 (Center for Biological Sequence Analysis Technical University of Denmark) - allows prediction of where signal peptidases I & II cleavage sites from Gram negative bacteria will cleave a protein.

NMT - The MYR Predictor (IMP [Research Institute of Molecular Pathology] Bioinformatics Group, Austria) - predicts N-terminal N-myristoylation. Generally, the enzyme NMT requires an N-terminal glycine (leading methionines are cleaved prior to myristoylation). However, also internal glycines may become N-terminal as a result of proteolytic processing of proproteins.
Myristoylator (ExPASy, Switzerland) - predicts N-terminal myristoylation of proteins by neural networks. Only N-terminal glycines are myristoylated (leading methionines are cleaved prior to myristoylation).

Nucleotide binding sites:

nSITEpred - is designed for sequence-based prediction of binding residues for ATP, ADP, AMP, GDP, and GTP ( Reference: K. Chen 2012. Bioinformatics 28: 331-341)

P2RP (Predicted Prokaryotic Regulatory Proteins) - users can input amino acid or genomic DNA sequences, and predicted proteins therein are scanned for the possession of DNA-binding domains and/or two-component system domains. RPs identified in this manner are categorised into families, unambiguously annotated. ( Reference: Barakat M, et al. 2013. BMC Genomics 14:269).

Phosphorylation:

GPS (Group-based Phosphorylation Scoring method) - prediction encompases 71 Protein Kinase (PK) families/PK groups ( Reference: Y. Xue et al. 2005. Nucl. Acids Res. 33: W184-W187).

NetPhos (Center for Biological Sequence Analysis, Technical University of Denmark) - predicts Ser, Thr and Tyr phosphorylation sites in eukaryotic proteins.

PhosphoSitePlus (PSP) is an online systems biology resource providing comprehensive information and tools for the study of protein post-translational modifications (PTMs) including phosphorylation, ubiquitination, acetylation and methylation. ( Reference: Hornbeck PV, et al. 2015 Nucleic Acids Res. 43: D512-520).

14-3-3-Pred: A webserver to predict 14-3-3-binding phosphosites in human proteins ( Reference: Madeira F et al. 2015. Bioinformatics 31: 2276-2283).

Scansite searches for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains. Putative protein phosphorylation sites can be further investigated by evaluating evolutionary conservation of the site sequence or subcellular colocalization of protein and kinase.

Quokka - is a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome ( Reference: Li F et al (Bioinformatics 34(24): 4223&ndash4231).

Sumoylation:

SUMOgo - prediction of sumoylation sites (small ubiquitin-like modifier (SUMO) binding (referred to as SUMOylation)) on lysines by motif screening models and the effects of various post-translational modifications ( Reference: Chang C-C et al. 2018. Scientific Reports 8: 15512).

Sulfinator (ExPASy, Switzerland) predicts tyrosine sulfation sites in protein sequences.

Vaccine development, effector molecules:


Jaiswal V et al. 2013. BMC Bioinformatics14: 211

and pathogenic bacteria. Thereby effector proteins are transported from the bacterial cytosol into the extracellular medium or directly into the eukaryotic host cell. The Effective portal provides precalculated predictions on bacterial effectors in all publicly available pathogenic and symbiontic genomes as well as the possibility for the user to predict effectors in own protein sequence data.

DISCOVER YOUR OWN MOTIFS:

After you have discovered similar sequences but the motif searching tools have failed to recognize your group of proteins you can use the following tools to create a list of potential motifs.

The MEME Suite- Motif-based sequence analysis tools (National Biomedical Computation Resource, U.S.A.). N.B. After doing a BLASTP search create a FASTA-formated document containing three or four of the most homologous proteins (training set) and submit to MEME (Multiple Em for Motif Elicitation) or GLAM2 (Gapped Local Alignments of Motifs). In the case of MEME I usually specify 5 as the "Maximum number of motifs" to find. You will receive a message by E-mail entitled " MEME Submission Information (job app. )," verifies that the NBCR received and is processing your request. If you click on the hyperlink "You can view your job results at: http://meme. " you will see:

The "MAST output as HTML" provides the motifs, a motif alignment graphic and the alignment of the motifs with the individual sequences in the training set. The "MEME output as HTML" file contains a detailed analysis of each of the motifs plus their Sequence Logos.

At the top of the life is a buttom labelled "Search sequence databases for the best combined matches with these motifs using MAST." This will take you to theMAST (Motif Alignment and Search Tool) submission form. Click on the NCBI nonredundant protein database. You will receive an E-mail entitled " MAST Submission Information (job app . )."

Use great caution before printing the second set of data can be >20 pages ( Reference: Bailey, T.L. et al. 2009. Nucl. Acids Res. 37(Web Server issue): W202-W208). The Meme Suite can also be found here.

WebLogo - a great graphical way of representing and visualizing consensus sequence data developed by Tom Schneider and Mike Stephens. For nucleotide logos see RNA Structure Logo (The Technical University of Denmark)

Seq2Logo is a sequence logo generator. Sequence logos are a graphical representation of the information content stored in a multiple sequence alignment (MSA) and provide a compact and highly intuitive representation of the position-specific amino acid composition of binding motifs, active sites, etc. in biological sequences.( Reference: Thomsen, M.C., & Nielsen, M. 2012. Nucleic Acids Res. 40(Web Server issue):W281-287).

Skylign is a tool for creating logos representing both sequence alignments and profile hidden Markov models. Submit to the form in order to produce (i) interactive logos for inclusion in webpages, or (ii) static logos for use in documents. Skylign accepts sequence alignments in any format accepted by HMMER (this includes Stockholm and aligned fasta format). ( Reference: Wheeler TJ, et al. 2014. BMC Bioinformatics. 15: 7.). The HMMER-formatted profile HMM files can be generated from an *.aln ClustalW file by pasting your ClustalW alignment (& title) into HMMBUILD (Pôle Bioinformatique Lyonnais, France) and use the output (saved as a *.hmm file) at Skylign.

Two Sample Logo - detects and displays statistically significant differences in position-specific symbol compositions between two sets of multiple sequence alignments. In a typical scenario, two groups of aligned sequences will share a common motif but will differ in their functional annotation. Also available as a Java tool. ( Reference : 22: 1536-1537).

HMMER website - provides access to the protein homology search algorithms found in the HMMER software suite. Since the first release of the website in 2011, the search repertoire has been expanded to include the iterative search algorithm, jackhmmer. ( Reference: R.D. Finn et al. 2015. Nucl. Acids Res. 43 (W1): W30-W38).

PSSMSearch - is a web application to discover novel protein motifs (SLiMs, mORFs, miniMotifs) and PTM sites. PSSMSearch analyses proteomes for regions with significant similarity to a specificity determinant model built from a set of aligned functional peptides. Query peptides can be provided by the users or retrieved from the ELM database. Multiple scoring methods are available to build a position-specific scoring matrix (PSSM) describing the specificity determinant model and users can modify the model to add prior knowledge of specificity determinants through an interactive PSSM heatmap. ( Reference: Krystkowiak I et al. 2018. Nucleic Acids Res 46(W1): W235&ndashW241).

NUCLEIC ACID MOTIFS : (See also here)

Rfam (Welcome Trust Sanger Institute, England) - permits one to analyze 2 kb of DNA for 36 structural or functional RNAs such as 5S rRNA, tRNA, tmRNA, group I & II catalytic introns, hammerhead ribozymes, signal recognition particles.

P2RP (Predicted Prokaryotic Regulatory Proteins) - including transcription factors (TFs) and two-component systems (TCSs) based upon analysis of DNA or protein sequences. ( Reference: Barakat M., 2013. BMC Genomics 14: 269)


4.3 Results

4.3.1 Features

The ggtree supports displaying phylogram and cladogram (Figure 4.1) and can visualize a tree with different layouts, including rectangular , slanted , circular , fan , unrooted , time-scale and two-dimensional tree.

The ggtree allows tree covariates stored in tree object to be used directly in tree visualization and annotation. These covariates can be meta data of the sampling species/sequences used in the tree, statistical analysis or evolutionary inferences of the tree (e.g. divergence time inferred by BEAST or ancestral sequences inferred by HyPhy, etc.). These numerical or categorical data can be used to color branches or nodes of the tree, displayed on the tree with original values or mapping to different symbols. In ggtree, users can add layers to highlight the selected clades, and to label clades or annotate the tree with symbols of different shapes and colors, etc. (more details in Section 3.3.3).

Comparing to other phylogenetic tree visualizing packages, ggtree excels at exploring the tree structure and related data visually. For example, a complex tree figure with several annotation layers can be transferred to a new tree object without step-by-step re-creation. An operator, %<% , was created for such operation - to update a tree figure with a new tree object. Branch length can be re-scaled using other numerical variable (as shown in Figure 3.4 which rescales the tree branches using dN value). Phylogenetic trees can be visually manipulated by collapsing, scaling and rotating clade. Circular and fan layout tree can be rotated by specific angle. Trees structures can be transformed from one layout to another.

The groupClade function assigns the branches and nodes under different clades into different groups. Similarly, groupOTU function assigns branches and nodes to different groups based on user-specified groups of operational taxonomic units (OTUs) that are not necessarily within a clade, but can be monophyletic (clade), polyphyletic or paraphyletic. A phylogenetic tree can be annotated by mapping different line type, size, color or shape to the branches or nodes that have been assigned to different groups.

Treeio package parses diverse annotation data from different software outputs into S4 phylogenetic data objects. The ggtree mainly utilizes these S4 objects to display and annotate the tree. There are also other R packages that defined S3 / S4 classes to store phylogenetic trees with their specific associated data, including phylo4 and phylo4d defined in phylobase package, obkdata defined in OutbreakTools package, and phyloseq defined in phyloseq package. All these tree objects are also supported in ggtree and their specific annotation data can be used to annotate the tree in ggtree. Such compatibility of ggtree facilitates the integration of data and analysis results.

4.3.2 Layouts of phylogenetic tree

Viewing phylogenetic with ggtree is quite simple, just pass the tree object to ggtree function. We have developed several types of layouts for tree presentation (Figure 4.1), including rectangular (by default), slanted, circular, fan, unrooted (equal angle and daylight methods), time-scaled and 2-dimensional layouts.

Here are examples of visualizing a tree with different layouts:

Figure 4.1: Tree layouts. Phylogram: rectangular layout (A), slanted layout (B), circular layout (C) and fan layout (D). Unrooted: equal-angle method (E) and daylight method (F). Cladogram: rectangular layout (G), circular layout (H) and unrooted layout (I). Slanted and fan layouts for cladogram are also supported.

Phylogram. Layouts of rectangular, slanted, circular and fan are supported to visualize phylogram (by default, with branch length scaled) as demonstrated in Figure 4.1A, B, C and D.

Unrooted layout. Unrooted (also called ‘radial’) layout is supported by equal-angle and daylight algorithms, user can specify unrooted layout algorithm by passing “equal_angle” or “daylight” to layout parameter to visualize the tree. Equal-angle method was proposed by Christopher Meacham in PLOTREE, which was incorporated in PHYLIP (Retief 2000) . This method starts from the root of the tree and allocates arcs of angle to each subtrees proportional to the number of tips in it. It iterates from root to tips and subdivides the angle allocated to a subtree into angles for its dependent subtrees. This method is fast and was implemented in many software packages. As shown in Figure 4.1E, equal angle method has a drawback that tips are tend to be clustered together and leaving many spaces unused. The daylight method starts from an initial tree built by equal angle and iteratively improves it by successively going to each interior node and swinging subtrees so that the arcs of “daylight” are equal (Figure 4.1F). This method was firstly implemented in PAUP* (Wilgenbusch and Swofford 2003) .

Cladogram. To visualize cladogram that without branch length scaling and only display the tree structure, branch.length is set to “none” and it works for all types of layouts (Figure 4.1G, H and I).

Time-scaled layout. For time-scaled tree, the most recent sampling date must be specified via the mrsd parameter and ggtree will scaled the tree by sampling (tip) and divergence (internal node) time, and a time scale axis will be displayed under the tree by default.

Figure 4.2: Time-scaled layout. The x-axis is the timescale (in units of year). The divergence time was inferred by BEAST using molecular clock model.

Two-dimensional tree layout. A two-dimensional tree is a projection of the phylogenetic tree in a space defined by the associated phenotype (numerical or categorical trait, on the y-axis) and tree branch scale (e.g., evolutionary distance, divergent time, on the x-axis). The phenotype can be a measure of certain biological characteristics of the taxa and hypothetical ancestors in the tree. This is a new layout we proposed in ggtree, which is useful to track the virus phenotypes or other behaviors (y-axis) changing with the virus evolution (x-axis). In fact, the analysis of phenotypes or genotypes over evolutionary time have been widely used for study influenza virus evolution (Neher et al. 2016) , though such analysis diagrams are not tree-like, i.e., no connection between data points, unlike our two-dimensional tree layout that connect data points with the corresponding tree branches. Therefore, this new layout we provided will make such data analysis easier and more scalable for large sequence data sets of influenza viruses.

In this example, we used the previous time-scaled tree of H3 human and swine influenza viruses (Figure 4.2 data published in (Liang et al. 2014) ) and scaled the y-axis based on the predicted N-linked glycosylation sites (NLG) for each of the taxon and ancestral sequences of hemagglutinin proteins. The NLG sites were predicted using NetNGlyc 1.0 Server 8 . To scaled the y-axis, the parameter yscale in the ggtree() function is set to a numerical or categorical variable. If yscale is a categorical variable as in this example, users should specify how the categories are to be mapped to numerical values via the yscale_mapping variables.

Figure 4.3: Two-dimensional tree layout. The trunk and other branches highlighted in red (for swine) and blue (for human). The x-axis is scaled to the branch length (in units of year) of the time-scaled tree. The y-axis is scaled to the node attribute variable, in this case the number of predicted N-linked glycosylation site (NLG) on the hemagglutinin protein. Colored circles indicate the different types of tree nodes. Note that nodes assigned the same x- (temporal) and y- (NLG) coordinates are superimposed in this representation and appear as one node, which is shaded based on the colors of all the nodes at that point.

As shown in Figure 4.3, two-dimensional tree good at visualizing the change of phenotype over the evolution in the phylogenetic tree. In this example, it is shown that H3 gene of human influenza A virus maintained high level of N-linked glycosylation sites (n=8 to 9) over last two decades and dropped significantly to 5 or 6 in a separate viral lineage transmitted to swine populations and established there. It was indeed hypothesized that the human influenza virus with high level of glycosylation on the viral hemagglutinin protein provides better shielding effect to protect the antigenic sites from exposure to the herd immunity, and thus has selective advantage in human populations that maintain high level of herd immunity against the circulating human influenza virus strains. For the viral lineage that newly jumped across the species barrier and transmitted to swine population, the shielding effect of the high-level surface glycan oppositely impose selective disadvantage because the receptor-binding domain may also be shielded which greatly affect the viral fitness of the lineage that newly adapted to a new host species.

4.3.3 Annotation layers

The ggtree is designed for more general purpose or specific type of tree visualization and annotation. It supports grammar of graphics implemented in ggplot2 and users can freely visualize/annotate a tree by combining several annotation layers.

Figure 4.4: Annotating tree using grammar of graphics. The NHX tree was annotated using grammar of graphic syntax by combining different layers using + operator. Species information were labelled on the middle of the branches, Duplication events were shown on most recent common ancestor and clade bootstrap value were dispalyed near to it.

Here, as an example, we visualized the tree with several layers to display annotation stored in NHX tags, including a layer of geom_tiplab to display tip labels (gene name in this case), a layer using geom_label to show species information ( S tag) colored by lightgreen, a layer of duplication event information ( D tag) colored by steelblue and another layer using geom_text to show bootstrap value ( B tag).

Layers defined in ggplot2 can be applied to ggtree directly as demonstrated in Figure 4.4 of using geom_label and geom_text. But ggplot2 does not provide graphic layers that are specific designed for phylogenetic tree annotation. For instance, layers for tip labels, tree branch scale legend, highlight or labeling clade are all unavailable. To make tree annotation more flexible, a number of layers have been implemented in ggtree (Table 4.1), enabling different ways of annotation on various parts/components of a phylogenetic tree.

Table 4.1: Geom layers defined in ggtree.
Layer Description
geom_balance highlights the two direct descendant clades of an internal node
geom_cladelabel annotate a clade with bar and text label
geom_hilight highlight a clade with rectangle
geom_label2 modified version of geom_label, with subsetting supported
geom_nodepoint annotate internal nodes with symbolic points
geom_point2 modified version of geom_point, with subsetting supported
geom_range bar layer to present uncertainty of evolutionary inference
geom_rootpoint annotate root node with symbolic point
geom_segment2 modified version of geom_segment, with subsetting supported
geom_strip annotate associated taxa with bar and (optional) text label
geom_taxalink associate two related taxa by linking them with a curve
geom_text2 modified version of geom_text, with subsetting supported
geom_tiplab layer of tip labels
geom_tiplab2 layer of tip labels for circular layout
geom_tippoint annotate external nodes with symbolic points
geom_tree tree structure layer, with multiple layout supported
geom_treescale tree branch scale legend

4.3.4 Tree manipulation

The ggtree supports many ways of manipulating the tree visually, including viewing selected clade to explore large tree (Figure 4.5), taxa clustering (Figure 4.8), rotating clade or tree (Figure 4.9B and 4.11), zoom out or collapsing clades (Figure 4.6A and 4.7), etc.. Details tree manipulation functions are summarized in Table 4.2.

Table 4.2: Tree manipulation functions.
Function Descriptiotn
collapse collapse a selecting clade
expand expand collapsed clade
flip exchange position of 2 clades that share a parent node
groupClade grouping clades
groupOTU grouping OTUs by tracing back to most recent common ancestor
identify interactive tree manipulation
rotate rotating a selected clade by 180 degree
rotate_tree rotating circular layout tree by specific angle
scaleClade zoom in or zoom out selecting clade
open_tree convert a tree to fan layout by specific open angle

A clade is a monophyletic group that contains a single ancestor and all of its descendants. We can visualize a specific selected clade via the viewClade function as demonstrated in Figure 4.5B. Another similar function is gzoom which plots the tree with selected clade side by side. These two functions are developed to explore large tree.

Figure 4.5: Viewing a selected clade of a tree. An example tree used to demonstrate how ggtree support exploring or manipulating phylogenetic tree visually (A). The ggtree supports visualizing selected clade (B). A clade can be selected by specifying a node number or determined by most recent common ancestor of selected tips.

It is a common practice to prune or collapse clades so that certain aspects of a tree can be emphasized. The ggtree supports collapsing selected clades using the collapse function as shown in Figure 4.6A.

Figure 4.6: Collapsing selected clades and expanding collapsed clades. Clades can be selected to collapse (A) and the collapsed clades can be expanded back (B) if necessary as ggtree stored all information of species relationships. Green and red symbols were displayed on the tree to indicate the collapsed clades.

Here two clades were collapsed and labelled by green circle and red square symbolic points. Collapsing is a common strategy to collapse clades that are too large for displaying in full or are not primary interest of the study. In ggtree, we can expand (i.e., uncollapse) the collapsed branches back with expand function to show details of species relationships as demonstrated in Figure 4.6B.

The ggtree provides another option to zoom out (or compress) these clades via the scaleClade function. In this way, we retain the topology and branch lengths of compressed clades. This helps to save the space to highlight those clades of primary interest to the study.

Figure 4.7: Scaling selected clade. Clades can be zoom in (if scale > 1 ) to highlight or zoom out to save space.

If users want to emphasize important clades, they can use scaleClade function with scale parameter larger than 1. Then the selected clade will be zoomed in. Users can also use groupClade to select clades and color them with different colors as shown in Figure 4.7.

Although groupClade works fine with clade (monophyletic), related taxa are not necessarily within a clade, they can be polyphyletic or paraphyletic. The ggtree implemented groupOTU to work with polyphyletic and paraphyletic. It accepts a vector of OTUs (taxa name) or a list of OTUs and will trace back from OTUs to their most recent common ancestor (MRCA) and cluster them together as demonstrated in Figure 4.8.

Figure 4.8: Grouping OTUs. OTU clustering based on their relationships. Selected OTUs and their ancestors upto MRCA will be clustered together.

To facilitate exploring the tree structure, ggtree supports rotating selected clade by 180 degree using the rotate function (Figure 4.9B). Position of immediate descendant clades of internal node can be exchanged via flip function (Figure 4.9C).

Figure 4.9: Exploring tree structure. A clade (indicated by darkgreen circle) in a tree (A) can be rotated by 180° (B) and the positions of its immediate descedant clades (colored by blue and red) can be exchanged (C).

Most of the tree manipulation functions are working on clades, while ggtree also provides functions to manipulate a tree, including open_tree to transform a tree in either rectangular or circular layout to fan layout, and rotate_tree function to rotate a tree for specific angle in both circular or fan layouts, as demonstrated in Figure 4.10 and 4.11.

Figure 4.10: Transforming a tree to fan layout. A tree can be transformed to fan layout by open_tree with specific angle parameter.

Figure 4.11: Rotating tree. A circular/fan layout tree can be rotated by any specific angle.

4.3.5 Tree annotation using data from evolutionary analysis software

Chapter 2 has introduced using treeio packages to parse different tree formats and commonly used software outputs to obtain phylogeny-associated data. These imported data as S4 objects can be visualized directly using ggtree. Figure 4.4 demonstrates a tree annotated using the information (species classification, duplication event and bootstrap value) stored in NHX file. PHYLODOG and RevBayes output NHX files that can be parsed by treeio and visualized by ggtree with annotation using their inference data.

Furthermore, the evolutionary data from the inference of BEAST, MrBayes and RevBayes, dN/dS values inferred by CodeML, ancestral sequences inferred by HyPhy, CodeML or BaseML and short read placement by EPA and pplacer can be used to annotate the tree directly.

Figure 4.12: Annotating BEAST tree with length_95%_HPD and posterior. Branch length credible intervals (95% HPD) were displayed as red horizontal bars and clade posterior values were shown on the middle of branches.

In Figure 4.12, the tree was visualized and annotated with posterior > 0.9 and demonstrated length uncertainty (95% Highest Posterior Density (HPD) interval).

Ancestral sequences inferred by HyPhy can be parsed using treeio, whereas the substitutions along each tree branch was automatically computed and stored inside the phylogenetic tree object (i.e., S4 object). The ggtree can utilize this information in the object to annotate the tree, as demonstrated in Figure 4.13.

Figure 4.13: Annotating tree with amino acid substitution determined by ancestral sequences inferred by HYPHY. Amino acid substitutions were displayed on the middle of branches.

PAML’s BaseML and CodeML can be also used to infer ancestral sequences, whereas CodeML can infer selection pressure. After parsing this information using treeio, ggtree can integrate this information into the same tree structure and used for annotation as illustrated in Figure 4.14.

Figure 4.14: Annotating tree with animo acid substitution and dN/dS inferred by CodeML. Branches were rescaled and colored by dN/dS values and amino acid substitutions were displayed on the middle of branches.

For more details and examples of annotating tree with evolutionary data inferred by different software packages can be referred to the online vignettes 9 .

4.3.6 Tree annotation based on tree classes defined in other R packages

The ggtree plays a unique role in R ecosystem to facilitate phylogenetic analysis. It serves as a generic tools for tree visualization and annotation with different associated data from various sources. Most of the phylogenetic tree classes defined in R community are supported, including obkData , phyloseq , phylo , multiPhylo , phylo4 and phylo4d . Such that ggtree can be easily integrated into their analysis/packages. For instance, phyloseq users will find ggtree useful for visualizing microbiome data and for further annotations, since ggtree supports high-level of annotation using grammar of graphics and some of its features are not available in phyloseq. Here, examples of using ggtree to annotate obkData and phyloseq tree objects are demonstrated. There example data can be found in vignettes of OutbreakTools (Jombart et al. 2014) and phyloseq (McMurdie and Holmes 2013) packages.

The okbData is defined to store incidence-based outbreak data, including meta data of sampling and information of infected individuals such as age and onset of symptoms. The ggtree supports parsing this information which was used to annotate the tree as shown in Figure 4.15.

Figure 4.15: Visualizing obkData tree object. x-axis was scaled by timeline of the outbreak and tips were colored by location of different individuals.

The phyloseq class that defined in the phyloseq package was designed for storing microbiome data, including phylogenetic tree, associated sample data and taxonomy assignment. It can import data from popular pipelines, such as QIIME (Kuczynski et al. 2011) , mothur (Schloss et al. 2009) , DADA2 (Callahan et al. 2016) and PyroTagger (Kunin and Hugenholtz 2010) , etc.. The ggtree supports visualizing the phylogenetic tree stored in phyloseq object and related data can be used to annotate the tree as demonstrated in Figure 4.16.

Figure 4.16: Visualizing phyloseq tree object. Tips were colored by Phylum and corresponding abundance across different samples were visualized as joyplots and sorted according to the tree structure.

4.3.7 Advanced annotation on the phylogenetic tree

The ggtree supports grammar of graphics that implemented in ggplot2 package and provides several layers and functions to facilitate phylogenetic visualization and annotation. These layers and functions are not designed for specific tasks, they are building blocks that can be freely combined together to produce complex tree figure. Previous sessions have introduced some important functions of ggtree. In this session, three examples were presented to demonstrate using various ggtree functions together to construct a complex tree figure with annotations by associated data and inference results from different analysis programs.

4.3.7.1 Example 1: plot curated gene information as heatmap

This example introduces annotating a tree with various sources of data (e.g., location, sampling year, curated genotype information, etc.).

The tree was visualized in circular layout and attached with the strain sampling location information. A geom_tippoint layer added circular symbolic points to tree tips and colored them by their locations. Two geom_tiplab2 were added to display taxon names and sampling years.

The curated gene information was further loaded and plotted as a heatmap using gheatmap function with customized colors. The final figure was demonstrated in Figure 4.17.

Figure 4.17: Example of annotating a tree with diverse associated data. Circle symbols are colored by strain sampling location. Taxa names and sampling years are aligned to the tips. Curated gene information were visualized as a heatmap (colored boxed on the outer circles).

4.3.7.2 Example 2: complex tree annotations

The ggtree allows various evidences inferred by different software to be integrated, compared and visualized on a same tree topology. Data from external files can be further integrated for analysis and visualization. This example introduces complex tree annotations with evolutionary data inferred by different software (BEAST and CodeML in this example) and other associated data (e.g., genotype table).

First of all, BEAST and CodeML outputs were parsed, and the two trees with the associated data were merged into one. After merging, all statistical data inferred by these software packages, including divergence time and dN/dS will be incorporated into the merged_tree object. The tree was first visualized in time-scale and its branches were colored with dN/dS, and annotated the tree with posterior clade probabilities.

The tree branches were further annotated with amino acid substitutions pre-computed from taxon sequences and the ancestral sequences imported from CodeML.

Symbolic points were added to tree tips with different colors to differentiate host species of the influenza virus samples (blue for human and red for swine).

Finally, a genotype table (imported from external file) was plotted as a heatmap and aligned to the tree according to the tree structure as shown in Figure 4.18.

Figure 4.18: Example of annotating a tree with evolutionary evidences inferred by different software. The x-axis is the time scale (in units of year) inferred by BEAST. The tree branches are colored by their dN/dS values (as in the left scale at the top) inferred by CodeML, and the internal node labels show the posteria probabilities inferred by BEAST. Tip labels (taxon names) and circles are colored by species (human in blue and swine in red). The genotype, which is shown as an array of colored boxes on the right, is composed of the lineages (either HuH3N2, Pdm/09 or TRIG, colored as in the right legend at the top) of the eight genome segments of the virus. Any missing segment sequences are shown as empty boxes.

4.3.7.3 Example 3: Integrating ggtree in analysis pipeline/workflow

In the first example, a tree figure was annotated with external data, whereas second example introduced more complex annotations with evolutionary data inferred by different software and other associated data. This example will demonstrate integrating ggtree to an analysis pipeline that start from nucleotide sequence, building a tree, using R package to inferred ancestral sequences and states, then using ggtree to integrate these inferences to visualize and interpret results to help identify evolutionary patterns.

In this example, we collected 1498 H3 sequences (restrict host to Avian only to reduce sequence number for demonstration) with criteria of minimum length of 1000bp (access date: 2016/02/20). H3 sequences were aligned by MUSCLE (Edgar 2004) and the tree was build using RAxML (Stamatakis 2014) with GTRGAMMA model. Ancestral sequences were estimated by phangorn (Schliep 2011) .

The pml function computed the likelihood of the tree given the sequence alignment and optim.pml function optimized different parameters under the GTR model. The function phyPML , implemented in treeio, collected ancestral sequences inferred by optim.pml and determined amino acid sustitution by comparing parent sequence to child sequence.

Host information was extracted from taxa name and ancestral hosts were estimated by ace function defined in ape package (Paradis, Claude, and Strimmer 2004) using maximum likelihood. Then using ggtree was used to visualize circles colored by the host information.

Figure 4.19: Example of integrating ggtree in analysis pipeline. A phylogenetic tree of H3 influenza viruses built by RAxML. Ancestral sequences were inferred by phangorn and ancestral host information estimated by ape. The ggtree allows integrating information for visualization and further analysis. The tree was annotated by symbolic circles colored by host information as in the legend at top right.

The ggtree can be integrated into analysis pipeline as demonstrated in this example and allows different diverse data sources to be combined into a tree object. As illustrated in this example, host information and ancestral sequences were stored in the tree object. Thereafter ggtree allows further comparison and analysis. For instance, users can associate amino acid substitution with host jump as demonstrated in Figure 4.20. Some sites (around position of 400) are conservative across different species and some sites (around position of 20) are frequently mutated especially for host jump to mallard (Figure 4.20A). Interestingly, mutations co-occurred with chicken-to-duck transmission tend to cluster in the HA global head whereas the mutations distributed over the cytoplasmic tail of the HA are often leading to mallard transmission especially for teal-to-mallard and duck-to-mallard transmissions (Figure 4.20B). These results could direct the further experimental investigations of these markers, such as through reverse genetic studies.

Figure 4.20: Amino acid substitution preferences. Different locations have different mutation frequencies. Mutations that lead to host jump have different preferences on mutation sites.

4.3.8 Performance comparison with other tree-related packages

Visualization and annotation of phylogenetic trees are possible thanks to many different packages. Especially in ape (Paradis, Claude, and Strimmer 2004) and phytools (Revell 2012) , which provide many features of tree manipulation and visualization in base plotting system. The ggtree package brings various capabilities of phylogenetic visualization and annotation to the ggplot2 plotting system with high level of customizability possible thanks to the object-oriented approach of graphic and data communication. OutbreakTools (Jombart et al. 2014) and phyloseq (McMurdie and Holmes 2013) also implemented tree view functions using ggplot2 for presenting data from epidemiology and microbiome respectively. A comprehensive comparison of different features available in these packages can be found in Table ??. Here I present the benchmark performance of these packages. A random tree with 1000 leaves were used for basic tree visualization as shown in Figure 4.21. Basically ggtree and phyloseq are two most robust and fastest packages for viewing phylogenetic trees.

Figure 4.21: Run time comparison for basic tree visualization. Tree topology visualization with/without taxa name.

OutbreakTooks and phyloseq defined their own classes to store tree objects and specific data from epidemiology and microbiome respectively. OutbreakTools only works with obkData objects, while phyloseq works with phylo and phyloseq objects. As OutbreakTools cannot view phylo object, it was not incorporated in run time comparison of viewing phylogenetic trees. Although phyloseq can view phylo object, it lacks capability of annotating a phylogenetic tree with user data. To compare the performance of phylogenetic tree annotation, here I used ape and ggtree to reproduce examples presented in OutbreakTools and phyloseq vignettes. Noteably, S4 classes including obkData and phylose defined in OutbreakTools and phyloseq are also supported by ggtree and users can use + operator to add related annotation layers. As demonstrated in Figure 4.22, ape is the fastest package for tree annotation and ggtree outperform phyloseq and OutbreakTools.

Figure 4.22: Run Time comparison for tree annotation. Reproducing tree annotation example of OutbreakTools and phyloseq using ape and ggtree.

To further compare visualizing phylogenetic trees with different numbers of leaves, here I used random trees with leaves ranging from 10 to 1000 with step by 10. As illustrated in Figure 4.23, ape and phytools are faster when trees are small while ggtree and phyloseq perform better with large tree.

Figure 4.23: Run time of viewing tree with different numbers of leaves. The slowest time to view a phylogenetic tree with 1000 leaves is only 1 second. All these tools are fast enough for ordinary usages.

The benchmark was performed on iMac 3.2G Intel Core i5 with 16GB memory running OS X EI Capitan. Source code to reproduce benchmark results are presented in Figure ??, ??, ?? and ??. In the speed tests of ggtree and other packages using small to moderate size, ape and phytools are faster than ggtree and phyloseq. While for large tree, ggtree and phyloseq is faster than ape and phytools. For simple tree without annotation, phyloseq is faster than ggtree, while when annotating tree with multiple layers, ggtree is more efficient than both OutbreakTools and phyloseq. In addition, ggtree is a general tools that designed for tree annotation, while OutbreakTools and phyloseq are implemented for specific domain that lack of many features for general purpose of tree annotation. ggtree also provide grammar of graphic that are intuitive, easy to learn and allows high level of customization that are not available in other packages. In general, the performance of ggtree is more stable with different size of trees and different layers of annotation. The ggtree run fast in most of the tree visualization and annotation problems especially excels at large tree visualization and complex tree annotation.


Results

Bridging with other fields

Adopting standards is a necessary but not a sufficient step towards automating the analysis of glycans. A critical feature/component in glycobioinformatics is the availability of standardised approaches to connect remote databases. The NAS (National Academy of Sciences) "Transforming Glycoscience: A Roadmap for the Future" report [3] exemplifies the hurdles and problems faced by the research community due to the disconnected and incomplete nature of existing databases. Several initiatives have commenced to bridge the information content available in the described databases.

Bridging chemistry and biology with data curation

GlycoSuiteDB [38, 39] contains glycan structures derived from glycoproteins of different biological sources that have been described in the literature, and free oligosaccharides isolated from biologically important fluids (e.g., milk, saliva, urine). The curated database provides contextual information for glycan structures attached to proteins and re-establishes the frequently lost connection between a glycan structure and the attached functional protein as annotated in the UniProtKB resource that is cross-referenced to GlycoSuiteDB. This database is forming the basis of the central glycan structural database in UniCarbKB, which is designed to incorporate information from other structural databases including EUROCarbDB, UniCarb-DB and GlycoBase. The content and manual curation principles of GlycoSuiteDB will form the basis of the central glycan structural database of UniCarbKB to maintain the quality of information stored in the knowledgebase. The links to UniProtKB will help to connect key information between glycosylated sites and specific structures.

Bridging glycobioinformatics and bioinformatics using web services

The development of a web services protocol enables searches across several databases. Such technologies have gained much attention in the field of life sciences as an open architecture that facilitates interoperability across heterogeneous platforms. An ongoing programme in the glycomics domain is the Working Group on Glycomics Database Standards (WGGDS) activity, which was initially supported by a CFG-bridging grant. A working draft of the protocols can be accessed at http://glycomics.ccrc.uga.edu/GlycomicsWiki/Informatics:Cross-Database_Search/Protocol_%28WGGDS%29. The WGGDS enabled developers from the CFG, EUROCarbDB/UniCarb-DB, GlycomeDB, GLYCOSCIENCES.de and RINGS to seed the beginnings of a communication interface, which provides access to the data contained in multiple, autonomous glycomics databases with an emphasis on structural data collections.

A complete suite of representational state transfer (REST) based tools has been developed by some of the authors with new and improved applications being built. Each service provides access to a (sub-)structure search that supports remote queries for complete or partial structure and allows for substructure/epitope matching. This can only be achieved with universal acceptance of structure encoding formats and access to accurate and complete glycan translators. Here, the sequence attribute of the XML-based message protocol conforms to the GlydeII format (see above), which can be readily converted into GlycoCT and/or KCF formats for executing database searches. In addition, individual databases have expanded this service to enable searching based on molecular mass, experimental evidences, e.g. mass spectrometry, and monosaccharide composition. To realise this goal it was imperative for the glycobioinformatics community to agree on encoding formats and ensure robustness in the frameworks.

Since the exchange interface (REST) and protocol are independent of the database backend, the WGGDS guidelines can be easily incorporated and extended by other databases. Web services enable researchers to access data and provide a framework for programmers to build applications without installing and maintaining the necessary databases.

Bridging glycobioinformatics and bioinformatics using RDF

Semantic Web approaches are based on common formats that enable the integration and aggregation of data from multiple resources, which potentially offers a means to solve data compatibility issue in the glycomics space. The Semantic Web is a growing area of active research and growth in the life sciences field, which has the ability to improve bioinformatics analyses by leveraging the vast stores of data accumulated in web-accessible resources (e.g., Bio2RDF [40]). A range of commonly accessed databases such as UniProtKB has adopted the resource description framework (RDF) [41] as a format to support data integration and more sophisticated queries.

Several database projects in Japan have been involved in adopting RDF such as PDBj [42] or JCGGDB [43] as a part of the Integrated Database Project http://lifesciencedb.jp that focuses on data integration of heterogeneous datasets to provide users with a comprehensive data resource that can be accessible from a single endpoint. In order to efficiently implement RDF solutions, the existing database providers must agree on a standard for representing glycan structure and annotation information. For that purpose, the developers of major glycomics databases including BCSDB [17], GlycomeDB, JCGGDB, GLYCOSCIENCES.de and UniCarbKB designed a draft standard and prototype implementation of the RDF generation during BioHackathon 2012 http://2012.biohackathon.org.

GlycoRDF is a future-thinking collaborative effort that is addressing the requirement for sophisticated data mashups that answer complex research questions. It also allows the integration of information across different -omics, a potential that is demonstrated by the adoption of Semantic Web technologies in other fields including proteomics and genomics. The GlycoRDF innovative solution requires the harvesting of knowledge from multiple resources. Here, initial activities have focused on providing normalised RDF documents sourced from the wealth of information provided by the partners spanning structural and experimental data collections. The developers involved in this project released the first version of GlycoRDF in 2013 [48].


Conclusions

We have shown that structured scientific data can be extracted from unstructured scientific literature using ChemicalTagger. We have also demonstrated that, using text mining and natural language processing tools, we can extract both chemical entities and the relationships between those entities, and make the resulting data available in a machine-processable format. We have shown that these graphs are useful for the generation of highly informative visualisations. While machine extraction can yield good results, it nevertheless remains an act of 'information archaeology' and as such necessarily imperfect. We therefore strongly urge, that the scientific community move towards an ethos where scientific data is published in semantic form and where both authors and publishers feel under an obligation to make this information openly available. Were this to happen on a significant scale, it would lead to a revolution where millions of chemical syntheses every year can be automatically analysed by machine, which in turn could lead to significant improvements in our ability to do science. Opportunities generated through the large-scale availability of semantic data include:

Formal semantic verification of published information leading to higher quality information from authors, for reviewers and for technical processing.

Greater understandability by readers (including machines).

Automatic analysis of reaction conditions and results.

Greater formal representation of chemical reactions.

We hope, however, that the extraction tools demonstrated here will have only a limited lifetime before they are replaced by semantic authoring.

Copyright Implications

It is important to note that these extraction tools are restricted to the copyright associated with the data. Patents and Open Access (CC-BY) papers explicitly allow data extraction. Theses may depend on the copyright or explicit rights within the thesis. Most publishers of chemistry are not universally Open Access and we have engaged with them over several years trying to find a straightforward answer. The authors have raised this issue with both specific publishers (e.g. Elsevier, who publish Tetrahedron) and the STM Publisher's Association. Elsevier have referred this to their 'Universal Access' department and currently cannot say whether or not this is permitted. It has been agreed with STM publishers that bibliographic data is Open (CC-BY or CC0). There is no agreement, at the moment, on what data can be extracted.


Background

Taxonomies and ontologies organize complex knowledge about concepts and their relationships. Biology was one of the first fields to use these concepts. Taxonomies are simplistic schemes that help in the hierarchical classification of concepts or objects [1]. They are usually limited to a specific domain and to a single relationship type connecting one node to another. Ontologies share the hierarchical structure of taxonomies. In contrast to taxonomies, however, they often have multiple relationship types and are really designed to provide a formal naming of the types, properties and interrelationships of entities or concepts in a specific discipline, domain or field of study [2, 3]. Moreover, ontologies provide a system to create relationships between concepts across different domains. Both taxonomies and ontologies can be used to help scientists explain, organize or improve their understanding of the natural world. Furthermore, taxonomies and ontologies can serve as standardized vocabularies to help provide inference/reasoning capabilities. In fact, taxonomies and ontologies are widely used in many scientific fields, including biology (the Linnean taxonomy) [4], geology (the BGS Rock classification scheme) [5], subatomic physics (the Eightfold way) [6], astronomy (the stellar classification system) [7, 8] and pharmacology (the ATC drug classification system) [9]. One of the most widely used ontologies is the Gene Ontology (GO) [10], which serves to annotate genes and their products in terms of their molecular functions, cellular locations, and biological processes. Given a specific enzyme, such as the human cytosolic phospholipase (PLA2G4A), and its GO annotation, one could infer the cellular location of its substrate PC[14:0/22:1(13Z)] (HMDB07887). Additionally, because PLA2G4A is annotated with the GO term “phospholipid catabolic process”, it could be inferred that PC[14:0/22:1(13Z)] is a product of this biological process.

While chemists have been very successful in developing a standardized nomenclature (IUPAC) and standardized methods for drawing or exchanging chemical structures [11, 12], the field of chemistry still lacks a standardized, comprehensive, and clearly defined chemical taxonomy or chemical ontology to robustly characterize, classify and annotate chemical structures. Consequently, chemists from various chemistry specializations have often attempted to create domain-specific ontologies. For instance, medicinal chemists tend to classify chemicals according to their pharmaceutical activities (antihypertensive, antibacterials) [9], whereas biochemists tend to classify chemicals according to their biosynthetic origin (leukotrienes, nucleic acids, terpenoids) [13]. Unfortunately, there is no simple one-to-one mapping for these different classification schemes, most of which are limited to very small numbers of domain-specific molecules. Thus, the last decade has seen a growing interest in developing a more universal chemical taxonomy and chemical ontology.

To date, most attempts aimed at classifying and describing chemical compounds have been structure-based. This is largely because the bioactivity of a compound is influenced by its structure [14]. Moreover, the structure of a compound can be easily represented in various formats. Some examples of structure-based chemical classification or ontological schemes include the ChEBI ontology [15], the Medical Subject Heading (MeSH) thesaurus [16], and the LIPID MAPS classification scheme [13]. These databases and ontologies/thesauri are excellent and have been used in various studies including chemical enrichment analysis [17], and knowledge-based metabolic model reconstruction [18], among others. However, they are all produced manually, thus making the classification/annotation process somewhat tedious, error-prone and inconsistent (Fig. 1). In addition, they require substantial human expert time, which means these classification systems only cover a tiny fraction of known chemical space. For instance, in the PubChem database [19], only 0.12% of the >91,000,000 compounds (as of June 2016) are actually classified via the MeSH thesaurus.

a Valclavam is annotated in the PubChem (CID 126919) and ChEBI (CHEBI:9920) databases. b In PubChem, it is incorrectly assigned the class of beta-lactams, which are sulfur compounds. Moreover, although the latter can be either inorganic or organic, it is wrong to describe a single compound both as organic and inorganic. The transitivity of the is_a relationship is not fulfilled, which makes the class inference difficult. In ChEBI, the same compound is correctly classified as a peptide. However, as in PubChem, the annotation is incomplete. Class assignments to “clavams” and “azetidines”, among others, are missing

There are several other, older or lesser-known chemical classification schemes, ontologies or taxonomies that are worth mentioning. The Chemical Fragmentation Coding system [20] is perhaps the oldest taxonomy or chemical classification scheme. It was developed in 1963 by the Derwent World Patent Index (DWPI) to facilitate the manual classification of chemical compounds reported in patents. The system consists of 2200 numerical codes corresponding to a set of pre-defined, chemically significant structure fragments. The system is still used by Derwent indexers who manually assign patented chemicals to these codes. However, the system is considered outdated and complex. Likewise, using the chemical fragmentation codes requires practice and extensive guidance of an expert. A more automated alternate to the Derwent index was developed in the 1970s, called the HOSE (Hierarchical Organisation of Spherical Environments) code [21]. This hierarchical substructure system, allows one to automatically characterize atoms and complete rings in terms of their spherical environment. It employs an easily implemented algorithm that has been widely used in NMR chemical shift prediction. However, the HOSE system does not provide a named chemical category assignment nor does it provide an ontology or a defined chemical taxonomy. More recently, the Chemical Ontology (CO) system [22] has been described. Designed to be analogous to the Gene Ontology (GO) system, CO was one of the first open-source, automated functional group ontologies to be formalized. CO functional groups can be automatically assigned to a given structure by Checkmol [23], a freely available program. CO’s assignment of functional groups is accurate and consistent, and it has been applied to several small datasets. However, the CO system is limited to just

200 chemical groups, and so it only covers a very limited portion of chemical space. Moreover, Checkmol is very slow and is impractical to use on very large data sets. SODIAC [24] is another promising tool for automatic compound classification. It uses a comprehensive chemical ontology and an elegant structure-based reasoning logic. SODIAC is a well-designed commercial software package that permits very rapid and consistent classification of compounds. The underlying chemical ontology can be freely downloaded and the SODIAC software, which is closed-source, is free for academics. The fact that it is closed-source obviously limits the possibilities for community feedback or development. Moreover, the SODIAC ontology does not provide textual definitions for most of its terms and is limited in its coverage of inorganic and organo-metallic compounds. Other notable efforts directed towards chemical classification or clustering include Maximum Common Substructure (MCS) based methods [25, 26], an iterative scaffold decomposition method introduced by Shuffenhauer et al. [27], and a semantic-based method described by Chepelev et al. [28]. However, most of these are proof-of-principle methods and have only been validated on a small number of compound classes, which cover only a tiny portion of rich chemical space. Moreover, they are very data-set dependent. As a result, the classifications do not match the nomenclature expectations of the chemical community, especially for complex compound classes.

Overall, it should be clear that while many attempts have been made to create chemical taxonomies or ontologies, many are proprietary or “closed source”, most require manual analysis or annotation, most are limited in scope and many do not provide meaningful names, definitions or descriptors. These shortcomings highlight the need to develop open access, open-source, fast, fully automated, comprehensive chemical classification tools with robust ontologies that generate results that match chemists’ (i.e. domain experts’) and community expectations. Furthermore, such tools must rapidly classify chemical entities in a consistent manner that is independent of the type of chemical entity being analyzed.

The development of a fully automated, comprehensive chemical classification tool also requires the use of a well-defined chemical hierarchy, whether it is a taxonomy or an ontology. This means that the criteria for hierarchy construction, the relationship types, and the scope of the hierarchy must be clearly defined. Additionally, a clear set of classification rules and a comprehensive data dictionary (or ontology) are necessary. Furthermore, comprehensive chemical classification requires that the chemical categories present in the taxonomy/ontology must be accurately described in a computer-interpretable format. Because new chemical compounds and new “chemistries” are being developed or discovered all the time, the taxonomy/ontology must be flexible and any extension should not force a fundamental modification of the classification procedure. In this regard, Hasting et al. [29] suggested a list of principles that would facilitate the development of an intelligent chemical structure-based classification system. One of the main criteria in this schema is the possibility to combine different elementary features into complex category definitions using compositionality. This is very important, since chemical classes are structurally diverse. Additionally, an accurate description of their core structures sometimes requires the ability to express constraints such as substitution patterns. Today, this can be achieved to a certain extent by the use of logical connectives and structure-handling technologies such as the SMiles ARbitrary Target Specification (SMARTS) format.

In this paper, we describe a comprehensive, flexible, computable, chemical taxonomy along with a fully annotated chemical ontology (ChemOnt) and a Chemical Classification Dictionary. These components underlie a web-accessible computer program called ClassyFire, which permits automated rule-based structural classification of essentially all known chemical entities. ClassyFire makes use of a number of modern computational techniques and circumvents most of the limitations of the previously mentioned systems and software tools. This paper also describes the rationale behind ClassyFire, its classification rules, the design of its taxonomy, its performance under testing conditions and its potential applications. ClassyFire has been successfully used to classify and annotate >6000 molecules in DrugBank [30], >25,000 molecules in the LIPID MAPS Lipidomics Gateway [31], >42,000 molecules in HMDB [32], >43,000 compounds in ChEBI [15] and >60,000,000 molecules in PubChem [19], among others. These compounds cover a wide range of chemical types such as drugs, lipids, food compounds, toxins, phytochemicals and many other natural as well as synthetic molecules. ClassyFire is freely available at http://classyfire.wishartlab.com. Moreover, the ClassyFire API, which is written in Ruby, provides programmatic access to the ClassyFire server and database. It is available at https://bitbucket.org/wishartlab/classyfire_api.


Are there tools for automatically parsing glycan names into tree structures? - Biology

Version 5.9.1

The Link Grammar Parser exhibits the linguistic (natural language) structure of English, Russian, Arabic, Persian and limited subsets of a half-dozen other languages. This structure is a graph of typed links (edges) between the words in a sentence. One may obtain the more conventional HPSG (constituent) and dependency style parses from Link Grammar by applying a collection of rules to convert to these different formats. This is possible because Link Grammar goes a bit "deeper" into the "syntactico-semantic" structure of a sentence: it provides considerably more fine-grained and detailed information than what is commonly available in conventional parsers.

The theory of Link Grammar parsing was originally developed in 1991 by Davy Temperley, John Lafferty and Daniel Sleator, at the time professors of linguistics and computer science at the Carnegie Mellon University. The three initial publications on this theory provide the best introduction and overview since then, there have been hundreds of publications further exploring, examining and extending the ideas.

Although based on the original Carnegie-Mellon code base the current Link Grammar package has dramatically evolved and is profoundly different from earlier versions. There have been innumerable bug fixes performance has improved by more than an order of magnitude. The package is fully multi-threaded, fully UTF-8 enabled, and has been scrubbed for security, enabling cloud deployment. Parse coverage of English has been dramatically improved other languages have been added (most notably, Russian). There is a raft of new features, including support for morphology, log-likelihood semantic selection, and a sophisticated tokenizer that moves far beyond white-space-delimited sentence-splitting. Detailed lists can be found in the ChangeLog.

This code is released under the LGPL license, making it freely available for both private and commercial use, with few restrictions. The terms of the license are given in the LICENSE file included with this software.

Please see the main web page for more information. This version is a continuation of the original CMU parser.

As of version 5.9.0, the system include an experimental system for generating sentences. These are specified using a "fill in the blanks" API, where words are substituted into wild-card locations, whenever the result is a grammatically valid sentence. Additional details are in the man page: man link-generator (in the man subdirectory).

This generator is used in the OpenCog Language Learning project, which aims to automatically learn Link Grammars from corpora, using brand-new and innovative information theoretic techniques, somewhat similar to those found in artificial neural nets (deep learning), but using explicitly symbolic representations.

The parser includes API's in various different programming languages, as well as a handy command-line tool for playing with it. Here's some typical output:

This rather busy display illustrates many interesting things. For example, the Ss*b link connects the verb and the subject, and indicates that the subject is singular. Likewise, the Ost link connects the verb and the object, and also indicates that the object is singular. The WV (verb-wall) link points at the head-verb of the sentence, while the Wd link points at the head-noun. The Xp link connects to the trailing punctuation. The Ds**c link connects the noun to the determiner: it again confirms that the noun is singular, and also that the noun starts with a consonant. (The PH link, not required here, is used to force phonetic agreement, distinguishing 'a' from 'an'). These link types are documented in the English Link Documentation.

The bottom of the display is a listing of the "disjuncts" used for each word. The disjuncts are simply a list of the connectors that were employed to form the links. They are particularly interesting because they serve as an extremely fine-grained form of a "part of speech". Thus, for example: the disjunct S- O+ indicates a transitive verb: its a verb that takes both a subject and an object. The additional markup above indicates that 'is' is not only being used as a transitive verb, but it also indicates finer details: a transitive verb that took a singular subject, and was used (is usable as) the head verb of a sentence. The floating-point value is the "cost" of the disjunct it very roughly captures the idea of the log-probability of this particular grammatical usage. Much as parts-of-speech correlate with word-meanings, so also fine-grains parts-of-speech correlate with much finer distinctions and gradations of meaning.

The link-grammar parser also supports morphological analysis. Here is an example in Russian:

The LL link connects the stem 'тест' to the suffix 'а'. The MVA link connects only to the suffix, because, in Russian, it is the suffixes that carry all of the syntactic structure, and not the stems. The Russian lexis is documented here.

An extended overview and summary can be found in the Link Grammar Wikipedia page, which touches on most of the import, primary aspects of the theory. However, it is no substitute for the original papers published on the topic:

  • Daniel D. K. Sleator, Davy Temperley, "Parsing English with a Link Grammar" October 1991 CMU-CS-91-196.
  • Daniel D. Sleator, Davy Temperley, "Parsing English with a Link Grammar", Third International Workshop on Parsing Technologies (1993).
  • Dennis Grinberg, John Lafferty, Daniel Sleator, "A Robust Parsing Algorithm for Link Grammars", August 1995 CMU-CS-95-125.
  • John Lafferty, Daniel Sleator, Davy Temperley, "Grammatical Trigrams: A Probabilistic Model of Link Grammar", 1992 AAAI Symposium on Probabilistic Approaches to Natural Language.

There are many more papers and references listed on the primary Link Grammar website.

See also the C/C++ API documentation. Bindings for other programming languages, including python3, java and node.js, can be found in the bindings directory. (There are two sets of javascript bindings: one set for the library API, and another set for the command-line parser.)

Content Description
LICENSE The license describing terms of use
link-grammar/*.c The program. (Written in ANSI-C)
---- ----
bindings/autoit/ Optional AutoIt language bindings.
bindings/java/ Optional Java language bindings.
bindings/js/ Optional JavaScript language bindings.
bindings/lisp/ Optional Common Lisp language bindings.
bindings/node.js/ Optional node.js language bindings.
bindings/ocaml/ Optional OCaML language bindings.
bindings/python/ Optional Python3 language bindings.
bindings/python-examples/ Link-grammar test suite and Python language binding usage example.
bindings/swig/ SWIG interface file, for other FFI interfaces.
bindings/vala/ Optional Vala language bindings.
---- ----
data/en/ English language dictionaries.
data/en/4.0.dict The file containing the dictionary definitions.
data/en/4.0.knowledge The post-processing knowledge file.
data/en/4.0.constituents The constituent knowledge file.
data/en/4.0.affix The affix (prefix/suffix) file.
data/en/4.0.regex Regular expression-based morphology guesser.
data/en/tiny.dict A small example dictionary.
data/en/words/ A directory full of word lists.
data/en/corpus*.batch Example corpora used for testing.
---- ----
data/ru/ A full-fledged Russian dictionary
data/ar/ A fairly complete Arabic dictionary
data/fa/ A Persian (Farsi) dictionary
data/de/ A small prototype German dictionary
data/lt/ A small prototype Lithuanian dictionary
data/id/ A small prototype Indonesian dictionary
data/vn/ A small prototype Vietnamese dictionary
data/he/ An experimental Hebrew dictionary
data/kz/ An experimental Kazakh dictionary
data/tr/ An experimental Turkish dictionary
---- ----
morphology/ar/ An Arabic morphology analyzer
morphology/fa/ An Persian morphology analyzer
---- ----
LICENSE The license for this code and data
ChangeLog A compendium of recent changes.
configure The GNU configuration script
autogen.sh Developer's configure maintenance tool
debug/ Information about debugging the library
msvc/ Microsoft Visual-C project files
mingw/ Information on using MinGW under MSYS or Cygwin

UNPACKING and signature verification

The system is distributed using the conventional tar.gz format it can be extracted using the tar -zxf link-grammar.tar.gz command at the command line.

A tarball of the latest version can be downloaded from:
http://www.abisource.com/downloads/link-grammar

The files have been digitally signed to make sure that there was no corruption of the dataset during download, and to help ensure that no malicious changes were made to the code internals by third parties. The signatures can be checked with the gpg command:

gpg --verify link-grammar-5.9.1.tar.gz.asc

which should generate output identical to (except for the date):

Alternately, the md5 check-sums can be verified. These do not provide cryptographic security, but they can detect simple corruption. To verify the check-sums, issue md5sum -c MD5SUM at the command line.

Tags in git can be verified by performing the following:

To compile the link-grammar shared library and demonstration program, at the command line, type:

To install, change user to "root" and say

This will install the liblink-grammar.so library into /usr/local/lib , the header files in /usr/local/include/link-grammar , and the dictionaries into /usr/local/share/link-grammar . Running ldconfig will rebuild the shared library cache. To verify that the install was successful, run (as a non-root user)

Optional system libraries

The link-grammar library has optional features that are enabled automatically if configure detects certain libraries. These libraries are optional on most of the systems and if the feature they add is desired, corresponding libraries need to be installed before running configure .

The library package names may vary on various systems (consult Google if needed. ). For example, the names may include -devel instead of -dev , or be without it altogether. The library names may be without the prefix lib .

  • libsqlite3-dev (for SQLite-backed dictionary)
  • libz1g-dev or libz-devel (currently needed for the bundled minisat2 )
  • libedit-dev (see Editline)
  • libhunspell-dev or libaspell-dev (and the corresponding English dictionary).
  • libtre-dev or libpcre2-dev (usually much faster than the libc REGEX implementation, and needed for correctness on FreeBSD and Cygwin)

Note: BSD-derived operating systems (including macOS) need the argp-standalone library in order to build the link-generator program.

If libedit-dev is installed, then the arrow keys can be used to edit the input to the link-parser tool the up and down arrow keys will recall previous entries. You want this it makes testing and editing much easier.

Two versions of node.js bindings are included. One version wraps the library the other uses emscripten to wrap the command-line tool. The library bindings are in bindings/node.js while the emscripten wrapper is in bindings/js .

These are built using npm . First, you must build the core C library. Then do the following:

This will create the library bindings and also run a small unit test (which should pass). An example can be found in bindings/node.js/examples/simple.js .

For the command-line wrapper, do the following:

The Python3 bindings are built by default, providing that the corresponding Python development packages are installed. (Python2 bindings are no longer supported.)

  • Linux:
    • Systems using 'rpm' packages: python3-devel
    • Systems using 'deb' packages: python3-dev
    • Install Python3 from https://www.python.org/downloads/windows/ . You also have to install SWIG from http://www.swig.org/download.html .
    • Install python3 using HomeBrew. Note: With recent Anaconda Python versions, the build process succeeds, but loading the resulted module causes a crash. If you are a macOS developer, we need help with that. See the relevant issues in the GitHub repository (search there for "anaconda").
      Anaconda.

    NOTE: Before issuing configure (see below) you have to validate that the required python versions can be invoked using your PATH .

    The use of the Python bindings is OPTIONAL you do not need these if you do not plan to use link-grammar with Python. If you like to disable Python bindings, use:

    The linkgrammar.py module provides a high-level interface in Python. The example.py and sentence-check.py scripts provide a demo, and tests.py runs unit tests.

    By default, the Makefile s attempt to build the Java bindings. The use of the Java bindings is OPTIONAL you do not need these if you do not plan to use link-grammar with Java. You can skip building the Java bindings by disabling as follows:

    If jni.h isn't found, or if ant isn't found, then the java bindings will not be built.

    Notes about finding jni.h :
    Some common java JVM distributions (most notably, the ones from Sun) place this file in unusual locations, where it cannot be automatically found. To remedy this, make sure that environment variable JAVA_HOME is set correctly. The configure script looks for jni.h in $JAVA_HOME/Headers and in $JAVA_HOME/include it also examines corresponding locations for $JDK_HOME . If jni.h still cannot be found, specify the location with the CPPFLAGS variable: so, for example,

    Please note that the use of /opt is non-standard, and most system tools will fail to find packages installed there.

    The /usr/local install target can be over-ridden using the standard GNU configure --prefix option so, for example:

    By using pkg-config (see below), non-standard install locations can be automatically detected.

    Additional config options are printed by

    The system has been tested and works well on 32 and 64-bit Linux systems, FreeBSD, macOS, as well as on Microsoft Windows systems. Specific OS-dependent notes follow.

    End users should download the tarball (see UNPACKING and signature verification).

    The current GitHub version is intended for developers (including anyone who is willing to provide a fix, a new feature or an improvement). The tip of the master branch is often unstable, and can sometimes have bad code in it as it is under development. It also needs installing of development tools that are not installed by default. Due to these reason the use of the GitHub version is discouraged for regular end users.

    Clone it: git clone https://github.com/opencog/link-grammar.git
    Or download it as a ZIP:
    https://github.com/opencog/link-grammar/archive/master.zip

    Tools that may need installation before you can build link-grammar:

    make (the gmake variant may be needed)
    m4
    gcc or clang
    autoconf
    libtool
    autoconf-archive
    pkg-config
    pip and/or pip3 (for the Python bindings)

    Optional:
    swig (for language bindings)
    flex
    Apache Ant (for Java bindings)
    graphviz (if you like to use the word-graph display feature)

    The GitHub version doesn't include a configure script. To generate it, use:

    If you get errors, make sure you have installed the above-listed development packages, and that your system installation is up to date. Especially, missing autoconf or autoconf-archive may cause strange and misleading errors.

    For more info about how to proceed, continue at the section CREATING the system and the relevant sections after it.

    Additional notes for developers

    To configure debug mode, use:

    It adds some verification debug code and functions that can pretty-print several data structures.

    A feature that may be useful for debugging is the word-graph display. Use the configure option --enable-wordgraph-display to enable it. For more details on this feature, see Word-graph display.

    The current configuration has an apparent standard C++ library mixing problem when gcc is used (a fix is welcome). However, the common practice on FreeBSD is to compile with clang , and it doesn't have this problem. In addition, the add-on packages are installed under /usr/local .

    So here is how configure should be invoked:

    Note that pcre2 is a required package as the existing libc regex implementation doesn't have the needed level of regex support.

    Some packages have different names than the ones mentioned in the previous sections:

    minisat (minisat2) pkgconf (pkg-config)

    Plain-vanilla Link Grammar should compile and run on Apple macOS just fine, as described above. At this time, there are no reported issues.

    If you do NOT need the java bindings, you should almost surely configure with:

    By default, java requires a 64-bit binary, and not all macOS systems have a 64-bit devel environment installed.

    If you do want Java bindings, be sure to set the JDK_HOME environment variable to wherever <Headers/jni.h> is. Set the JAVA_HOME variable to the location of the java compiler. Make sure you have ant installed.

    If you would like to build from GitHub (see BUILDING from the GitHub repository) you can install the tools that are listed there using HomeBrew.

    There are three different ways in which link-grammar can be compiled on Windows. One way is to use Cygwin, which provides a Linux compatibility layer for Windows. Another way is use the MSVC system. A third way is to use the MinGW system, which uses the Gnu toolset to compile windows programs. The source code supports Windows systems from Vista on.

    The Cygwin way currently produces the best result, as it supports line editing with command completion and history and also supports word-graph displaying on X-windows. (MinGW currently doesn't have libedit , and the MSVC port currently doesn't support command completion and history, spelling and X-Windows word-graph display.)

    Link-grammar requires a working version of POSIX-standard regex libraries. Since these are not provided by Microsoft, a copy must be obtained elsewhere. One popular choice is TRE.

    BUILDING on Windows (Cygwin)

    The easiest way to have link-grammar working on MS Windows is to use Cygwin, a Linux-like environment for Windows making it possible to port software running on POSIX systems to Windows. Download and install Cygwin.

    Note that the installation of the pcre2 package is required because the libc REGEX implementation is not capable enough.

    BUILDING on Windows (MinGW)

    Another way to build link-grammar is to use MinGW, which uses the GNU toolset to compile POSIX-compliant programs for Windows. Using MinGW/MSYS2 is probably the easiest way to obtain workable Java bindings for Windows. Download and install MinGW/MSYS2 from msys2.org.

    Note that the installation of the pcre2 package is required because the libc REGEX implementation is not capable enough.

    BUILDING and RUNNING on Windows (MSVC)

    Microsoft Visual C/C++ project files can be found in the msvc directory. For directions see the README.md file there.

    To run the program issue the command (supposing it is in your PATH):

    This starts the program. The program has many user-settable variables and options. These can be displayed by entering !var at the link-parser prompt. Entering !help will display some additional commands.

    The dictionaries are arranged in directories whose name is the 2-letter language code. The link-parser program searches for such a language directory in that order, directly or under a directory names data :

    1. Under your current directory.
    2. Unless compiled with MSVC or run under the Windows console: At the installed location (typically in /usr/local/share/link-grammar ).
    3. If compiled on Windows: In the directory of the link-parser executable (may be in a different location than the link-parser command, which may be a script).

    If link-parser cannot find the desired dictionary, use verbosity level 3 to debug the problem for example:

    Other locations can be specified on the command line for example:

    When accessing dictionaries in non-standard locations, the standard file-names are still assumed (i.e. 4.0.dict , 4.0.affix , etc.).

    The Russian dictionaries are in data/ru . Thus, the Russian parser can be started as:

    If you don't supply an argument to link-parser, it searches for a language according to your current locale setup. If it cannot find such a language directory, it defaults to "en".

    If you see errors similar to this:

    then your UTF-8 locales are either not installed or not configured. The shell command locale -a should list en_US.utf8 as a locale. If not, then you need to dpkg-reconfigure locales and/or run update-locale or possibly apt-get install locales , or combinations or variants of these, depending on your operating system.

    There are several ways to test the resulting build. If the Python bindings are built, then a test program can be found in the file ./bindings/python-examples/tests.py -- When run, it should pass. For more details see README.md in the bindings/python-examples directory.

    There are also multiple batches of test/example sentences in the language data directories, generally having the names corpus-*.batch The parser program can be run in batch mode, for testing the system on a large number of sentences. The following command runs the parser on a file called corpus-basic.batch

    The line !batch near the top of corpus-basic.batch turns on batch mode. In this mode, sentences labeled with an initial * should be rejected and those not starting with a * should be accepted. This batch file does report some errors, as do the files corpus-biolg.batch and corpus-fixes.batch . Work is ongoing to fix these.

    The corpus-fixes.batch file contains many thousands of sentences that have been fixed since the original 4.1 release of link-grammar. The corpus-biolg.batch contains biology/medical-text sentences from the BioLG project. The corpus-voa.batch contains samples from Voice of America the corpus-failures.batch contains a large number of failures.

    The following numbers are subject to change, but, at this time, the number of errors one can expect to observe in each of these files are roughly as follows:

    The bindings/python directory contains a unit test for the Python bindings. It also performs several basic checks that stress the link-grammar libraries.

    There is an API (application program interface) to the parser. This makes it easy to incorporate it into your own applications. The API is documented on the web site.

    The FindLinkGrammar.cmake file can be used to test for and set up compilation in CMake-based build environments.

    To make compiling and linking easier, the current release uses the pkg-config system. To determine the location of the link-grammar header files, say pkg-config --cflags link-grammar To obtain the location of the libraries, say pkg-config --libs link-grammar Thus, for example, a typical makefile might include the targets:

    This release provides java files that offer three ways of accessing the parser. The simplest way is to use the org.linkgrammar.LinkGrammar class this provides a very simple Java API to the parser.

    The second possibility is to use the LGService class. This implements a TCP/IP network server, providing parse results as JSON messages. Any JSON-capable client can connect to this server and obtain parsed text.

    The third possibility is to use the org.linkgrammar.LGRemoteClient class, and in particular, the parse() method. This class is a network client that connects to the JSON server, and converts the response back to results accessible via the ParseResult API.

    The above-described code will be built if Apache ant is installed.

    Using the JSON Network Server

    The network server can be started by saying:

    The above starts the server on port 9000. It the port is omitted, help text is printed. This server can be contacted directly via TCP/IP for example:

    (Alternately, use netcat instead of telnet). After connecting, type in:

    The returned bytes will be a JSON message providing the parses of the sentence. By default, the ASCII-art parse of the text is not transmitted. This can be obtained by sending messages of the form:

    The parser will run a spell-checker at an early stage, if it encounters a word that it does not know, and cannot guess, based on morphology. The configure script looks for the aspell or hunspell spell-checkers if the aspell devel environment is found, then aspell is used, else hunspell is used.

    Spell guessing may be disabled at runtime, in the link-parser client with the !spell=0 flag. Enter !help for more details.

    It is safe to use link-grammar for parsing in multiple threads. Different threads may use different dictionaries, or the same dictionary. Parse options can be set on a per-thread basis, with the exception of verbosity, which is a global, shared by all threads. It is the only global.

    A/An phonetic determiners before consonants/vowels are handled by a new PH link type, linking the determiner to the word immediately following it. Status: Introduced in version 5.1.0 (August 2014). Mostly done, although many special-case nouns are unfinished.

    Directional links are needed for some languages, such as Lithuanian, Turkish and other free word-order languages. The goal is to have a link clearly indicate which word is the head word, and which is the dependent. This is achieved by prefixing connectors with a single lower case letter: h,d, indicating 'head' and 'dependent'. The linkage rules are such that h matches either nothing or d, and d matches h or nothing. This is a new feature in version 5.1.0 (August 2014). The website provides additional documentation.

    Although the English-language link-grammar links are un-oriented, it seems that a defacto direction can be given to them that is completely consistent with standard conceptions of a dependency grammar.

    The dependency arrows have the following properties:

    Anti-reflexive (a word cannot depend on itself it cannot point at itself.)

    Anti-symmetric (if Word1 depends on Word2, then Word2 cannot depend on Word1) (so, e.g. determiners depend on nouns, but never vice-versa)

    The arrows are neither transitive, nor anti-transitive: a single word may be ruled by several heads. For example:

    That is, there is a path to the subject, "she", directly from the left wall, via the Wd link, as well as indirectly, from the wall to the root verb, and thence to the subject. Similar loops form with the B and R links. Such loops are useful for constraining the possible number of parses: the constraint occurs in conjunction with the "no links cross" meta-rule.

    • The graphs are planar that is, no two edges may cross. See, however, the "link-crossing" discussion below.

    There are several related mathematical notions, but none quite capture directional LG:

    Directional LG graphs resemble DAGS, except that LG allows only one wall (one "top" element).

    Directional LG graphs resemble strict partial orders, except that the LG arrows are usually not transitive.

    Directional LG graphs resemble catena except that catena are strictly anti-transitive -- the path to any word is unique, in a catena.

    The foundational LG papers mandate the planarity of the parse graphs. This is based on a very old observation that dependencies almost never cross in natural languages: humans simply do not speak in sentences where links cross. Imposing planarity constraints then provides a strong engineering and algorithmic constraint on the resulting parses: the total number of parses to be considered is sharply reduced, and thus the overall speed of parsing can be greatly increased.

    However, there are occasional, relatively rare exceptions to this planarity rule such exceptions are observed in almost all languages. A number of these exceptions are given for English, below.

    Thus, it seems important to relax the planarity constraint, and find something else that is almost as strict, but still allows infrequent exceptions. It would appear that the concept of "landmark transitivity" as defined by Richard Hudson in his theory of "Word Grammar", and then advocated by Ben Goertzel, just might be such a mechanism.

    Planarity: Theory vs. Practice

    In practice, the planarity constraint allows very efficient algorithms to be used in the implementation of the parser. Thus, from the point of view of the implementation, we want to keep planarity. Fortunately, there is a convenient and unambiguous way to have our cake and eat it, too. A non-planar diagram can be drawn on a sheet of paper using standard electrical-engineering notation: a funny symbol, wherever wires cross. This notation is very easily adapted to LG connectors below is an actual working example, already implemented in the current LG English dictionary. All link crossings can be implemented in this way! So we do not have to actually abandon the current parsing algorithms to get non-planar diagrams. We don't even have to modify them! Hurrahh!

    Here is a working example: "I want to look at and listen to everything." This wants two J links pointing to 'everything'. The desired diagram would need to look like this:

    The above really wants to have a Js link from 'at' to 'everything', but this Js link crosses (clashes with - marked by xxx) the link to the conjunction. Other examples suggest that one should allow most links to cross over the down-links to conjunctions.

    The planarity-maintaining worked-around is to split the Js link into two: a Jj part and a Jk part the two are used together to cross over the conjunction. This is currently implemented in the English dictionary, and it works.

    This work-around is in fact completely generic, and can be extended to any kind of link crossing. For this to work, a better notation would be convenient perhaps uJs- instead of Jj- and vJs- instead of Jk- , or something like that . (TODO: invent better notation.) (NB: This is a kind of re-invention of "fat links", but in the dictionary, not in the code.)

    Landmark Transitivity: Theory

    Given that non-planar parses can be enabled without any changes to the parser algorithm, all that is required is to understand what sort of theory describes link-crossing in a coherent grounding. That theory is Dick Hudson's Landmark Transitivity, explained here.

    This mechanism works as follows:

    First, every link must be directional, with a head and a dependent. That is, we are concerned with directional-LG links, which are of the form x--A-->y or y<--A--x for words x,y and LG link type A.

    Given either the directional-LG relation x--A-->y or y<--A--x, define the dependency relation x-->y. That is, ignore the link-type label.

    Heads are landmarks for dependents. If the dependency relation x-->y holds, then x is said to be a landmark for y, and the predicate land(x,y) is true, while the predicate land(y,x) is false. Here, x and y are words, while --> is the landmark relation.

    Although the basic directional-LG links form landmark relations, the total set of landmark relations is extended by transitive closure. That is, if land(x,y) and land(y,z) then land(x,z). That is, the basic directional-LG links are "generators" of landmarks they generate by means of transitivity. Note that the transitive closure is unique.

    In addition to the above landmark relation, there are two additional relations: the before and after landmark relations. (In English, these correspond to left and right in Hebrew, the opposite). That is, since words come in chronological order in a sentence, the dependency relation can point either left or right. The previously-defined landmark relation only described the dependency order we now introduce the word-sequence order. Thus, there are are land-before() and land-after() relations that capture both the dependency relation, and the word-order relation.

    Notation: the before-landmark relation land-B(x,y) corresponds to x-->y (in English, reversed in right-left languages such as Hebrew), whereas the after-landmark relation land-A(x,y) corresponds to y<--x. That is, land(x,y) == land-B(x,y) or land-A(x,y) holds as a statement about the predicate form of the relations.

    As before, the full set of directional landmarks are obtained by transitive closure applied to the directional-LG links. Two different rules are used to perform this closure:

    Parsing is then performed by joining LG connectors in the usual manner, to form a directional link. The transitive closure of the directional landmarks are then computed. Finally, any parse that does not conclude with the "left wall" being the upper-most landmark is discarded.

    Here is an example where landmark transitivity provides a natural solution to a (currently) broken parse. The "to.r" has a disjunct "I+ & MVi-" which allows "What is there to do?" to parse correctly. However, it also allows the incorrect parse "He is going to do". The fix would be to force "do" to take an object however, a link from "do" to "what" is not allowed, because link-crossing would prevent it.

    Fixing this requires only a fix to the dictionary, and not to the parser itself.

    Examples where the no-links-cross constraint seems to be violated, in English:

    Both seem to be acceptable in English, but the ambiguity of the "in-either" temporal ordering requires two different parse trees, if the no-links-cross rule is to be enforced. This seems un-natural. Similarly:

    A different example involves a crossing to the left wall. That is, the links LEFT-WALL--remains crosses over here--found:

    Other examples, per And Rosta:

    The allowed--by link crosses cake--that:

    There is a natural crossing, driven by conjunctions:

    the "natural" linkage is to use MV links to connect "yesterday" and "on Tuesday" to the verb. However, if this is done, then these must cross the links from the conjunction "and" to "heaven" and "hell". This can be worked around partly as follows:

    but the desired MV links from the verb to the time-prepositions "yesterday" and "on Tuesday" are missing -- whereas they are present, when the individual sentences "I was in hell yesterday" and "I was in heaven on Tuesday" are parsed. Using a conjunction should not wreck the relations that get used but this requires link-crossing.

    Here, "up_to" must modify "number", and not "whose". There's no way to do this without link-crossing.

    Link Grammar can be understood in the context of type theory. A simple introduction to type theory can be found in chapter 1 of the HoTT book.
    This book is freely available online and strongly recommended if you are interested in types.

    Link types can be mapped to types that appear in categorial grammars. The nice thing about link-grammar is that the link types form a type system that is much easier to use and comprehend than that of categorial grammar, and yet can be directly converted to that system! That is, link-grammar is completely compatible with categorial grammar, and is easier-to-use.

    The foundational LG papers make comments to this effect however, see also work by Bob Coecke on category theory and grammar. Coecke's diagramatic approach is essentially identical to the diagrams given in the foundational LG papers it becomes abundantly clear that the category theoretic approach is equivalent to Link Grammar. See, for example, this introductory sketch http://www.cs.ox.ac.uk/people/bob.coecke/NewScientist.pdf and observe how the diagrams are essentially identical to the LG jigsaw-puzzle piece diagrams of the foundational LG publications.

    If you have any questions, please feel free to send a note to the mailing list.

    The source code of link-parser and the link-grammar library is located at GitHub.
    For bug reports, please open an issue there.

    Although all messages should go to the mailing list, the current maintainers can be contacted at:

    A complete list of authors and copyright holders can be found in the AUTHORS file. The original authors of the Link Grammar parser are:

    Easy to fix: provide a more uniform API to the constituent tree. i.e provide word index. Also, provide a better word API, showing word extent, subscript, etc.

    There are subtle technical issues for handling capitalized first words. This needs to be fixed. In addition, for now these words are shown uncapitalized in the result linkages. This can be fixed.

    Maybe capitalization could be handled in the same way that a/an could be handled! After all, it's essentially a nearest-neighbor phenomenon!

    The proximal issue is to add a cost, so that Bill gets a lower cost than bill.n when parsing "Bill went on a walk". The best solution would be to add a 'capitalization-mark token' during tokenization this token precedes capitalized words. The dictionary then explicitly links to this token, with rules similar to the a/an phonetic distinction. The point here is that this moves capitalization out of ad-hoc C code and into the dictionary, where it can be handled like any other language feature. The tokenizer includes experimental code for that.

    Corpus-statistics-based parse ranking:

    The old for parse ranking via corpus statistics needs to be revived. The issue can be illustrated with these example sentences:

    In the first sentence, the comma acts as a conjunction of two directives (imperatives). In the second sentence, it is much too easy to mistake "please" for a verb, the comma for a conjunction, and come to the conclusion that one should please some unstated object, and then turn off the lights. (Perhaps one is pleasing by turning off the lights?)

    When a sentence fails to parse, look for:

    • confused words: its/it's, there/their/they're, to/too, your/you're . These could be added at high cost to the dicts.
    • missing apostrophes in possessives: "the peoples desires"
    • determiner agreement errors: "a books"
    • aux verb agreement errors: "to be hooks up"

    Poor agreement might be handled by giving a cost to mismatched lower-case connector letters.

    An common phenomenon in English is that some words that one might expect to "properly" be present can disappear under various conditions. Below is a sampling of these. Some possible solutions are given below.

    Expressions such as "Looks good" have an implicit "it" (also called a zero-it or phantom-it) in them that is, the sentence should really parse as "(it) looks good". The dictionary could be simplified by admitting such phantom words explicitly, rather than modifying the grammar rules to allow such constructions. Other examples, with the phantom word in parenthesis, include:

    • I ate all (of) the cookies.
    • I've known him only (for) a week.
    • I taught him (how) to swim.
    • I told him (that) it was gone.
    • It stopped me (from) flying off the cliff.
    • (It) looks good.
    • (You) go home!
    • (You) do tell (me).
    • (That is) enough!
    • (I) heard that he's giving a test.
    • (Are) you all right?
    • He opened the door and (he) went in.
    • Emma was the younger (daughter) of two daughters.

    This can extend to elided/unvoiced syllables:

    Normally, the subjects of imperatives must always be offset by a comma: "John, give me the hammer", but here, in muttering an oath, the comma is swallowed (unvoiced).

    Some complex phantom constructions:

    • They play billiards but (they do) not (play) snooker.
    • I know Ringo, but (I do) not (know) his brother.
    • She likes Indian food, but (she does) not (like) Chinese (food).
    • If this is true, then (you should) do it.
    • Perhaps he will (do it), if he sees enough of her.

    Many (unstressed) syllables can be elided in modern English, this occurs most commonly in the initial unstressed syllable:

    • (a)'ccount (a)'fraid (a)'gainst (a)'greed (a)'midst (a)'mongst
    • (a)'noint (a)'nother (a)'rrest (at)'tend
    • (be)'fore (be)'gin (be)'havior (be)'long (be)'twixt
    • (con)'cern (e)'scape (e)'stablish And so on.

    Punctuation, zero-copula, zero-that:

    Poorly punctuated sentences cause problems: for example:

    The one without the comma currently fails to parse. How can we deal with this in a simple, fast, elegant way? Similar questions for zero-copula and zero-that sentences.

    Context-dependent zero phrases.

    Consider an argument between a professor and a dean, and the dean wants the professor to write a brilliant review. At the end of the argument, the dean exclaims: "I want the review brilliant!" This is a predicative adjective clearly it means "I want the review [that you write to be] brilliant." However, taken out of context, such a construction is ungrammatical, as the predictiveness is not at all apparent, and it reads just as incorrectly as would "*Hey Joe, can you hand me that review brilliant?"

    The subject is a phantom the subject is "you".

    Handling zero/phantom words by explicitly inserting them:

    One possible solution is to perform a one-point compactification. The dictionary contains the phantom words, and their connectors. Ordinary disjuncts can link to these, but should do so using a special initial lower-case letter (say, 'z', in addition to 'h' and 'd' as is currently implemented). The parser, as it works, examines the initial letter of each connector: if it is 'z', then the usual pruning rules no longer apply, and one or more phantom words are selected out of the bucket of phantom words. (This bucket is kept out-of-line, it is not yet placed into sentence word sequence order, which is why the usual pruning rules get modified.) Otherwise, parsing continues as normal. At the end of parsing, if there are any phantom words that are linked, then all of the connectors on the disjunct must be satisfied (of course!) else the linkage is invalid. After parsing, the phantom words can be inserted into the sentence, with the location deduced from link lengths.

    Handling zero/phantom words as re-write rules.

    A more principled approach to fixing the phantom-word issue is to borrow the idea of re-writing from the theory of operator grammar. That is, certain phrases and constructions can be (should be) re-written into their "proper form", prior to parsing. The re-writing step would insert the missing words, then the parsing proceeds. One appeal of such an approach is that re-writing can also handle other "annoying" phenomena, such as typos (missing apostrophes, e.g. "lets" vs. "let's", "its" vs. "it's") as well as multi-word rewrites (e.g. "let's" vs. "let us", or "it's" vs. "it is").

    Exactly how to implement this is unclear. However, it seems to open the door to more abstract, semantic analysis. Thus, for example, in Meaning-Text Theory (MTT), one must move between SSynt to DSynt structures. Such changes require a graph re-write from the surface syntax parse (e.g. provided by link-grammar) to the deep-syntactic structure. By contrast, handling phantom words by graph re-writing prior to parsing inverts the order of processing. This suggests that a more holistic approach is needed to graph rewriting: it must somehow be performed "during" parsing, so that parsing can both guide the insertion of the phantom words, and, simultaneously guide the deep syntactic rewrites.

    Another interesting possibility arises with regards to tokenization. The current tokenizer is clever, in that it splits not only on whitespace, but can also strip off prefixes, suffixes, and perform certain limited kinds of morphological splitting. That is, it currently has the ability to re-write single-words into sequences of words. It currently does so in a conservative manner the letters that compose a word are preserved, with a few exceptions, such as making spelling correction suggestions. The above considerations suggest that the boundary between tokenization and parsing needs to become both more fluid, and more tightly coupled.

    Compare "she will be happier than before" to "she will be more happy than before." Current parser makes "happy" the head word, and "more" a modifier w/EA link. I believe the correct solution would be to make "more" the head (link it as a comparative), and make "happy" the dependent. This would harmonize rules for comparatives. and would eliminate/simplify rules for less,more.

    However, this idea needs to be double-checked against, e.g. Hudson's word grammar. I'm confused on this issue .

    Currently, some links can act at "unlimited" length, while others can only be finite-length. e.g. determiners should be near the noun that they apply to. A better solution might be to employ a 'stretchiness' cost to some connectors: the longer they are, the higher the cost. (This eliminates the "unlimited_connector_set" in the dictionary).

    Opposing (repulsing) parses:

    Sometimes, the existence of one parse should suggest that another parse must surely be wrong: if one parse is possible, then the other parses must surely be unlikely. For example: the conjunction and.j-g allows the "The Great Southern and Western Railroad" to be parsed as the single name of an entity. However, it also provides a pattern match for "John and Mike" as a single entity, which is almost certainly wrong. But "John and Mike" has an alternative parse, as a conventional-and -- a list of two people, and so the existence of this alternative (and correct) parse suggests that perhaps the entity-and is really very much the wrong parse. That is, the mere possibility of certain parses should strongly disfavor other possible parses. (Exception: Ben & Jerry's ice cream however, in this case, we could recognize Ben & Jerry as the name of a proper brand but this is outside of the "normal" dictionary (?) (but maybe should be in the dictionary!))

    More examples: "high water" can have the connector A joining high.a and AN joining high.n these two should either be collapsed into one, or one should be eliminated.

    Use WordNet to reduce the number for parses for sentences containing compound verb phrases, such as "give up", "give off", etc.

    Sliding-window (Incremental) parsing:

    To avoid a combinatorial explosion of parses, it would be nice to have an incremental parsing, phrase by phrase, using a sliding window algorithm to obtain the parse. Thus, for example, the parse of the last half of a long, run-on sentence should not be sensitive to the parse of the beginning of the sentence.

    Doing so would help with combinatorial explosion. So, for example, if the first half of a sentence has 4 plausible parses, and the last half has 4 more, then currently, the parser reports 16 parses total. It would be much more useful if it could instead report the factored results: i.e. the four plausible parses for the first half, and the four plausible parses for the last half. This would ease the burden on downstream users of link-grammar.

    This approach has at psychological support. Humans take long sentences and split them into smaller chunks that "hang together" as phrase- structures, viz compounded sentences. The most likely parse is the one where each of the quasi sub-sentences is parsed correctly.

    This could be implemented by saving dangling right-going connectors into a parse context, and then, when another sentence fragment arrives, use that context in place of the left-wall.

    This somewhat resembles the application of construction grammar ideas to the link-grammar dictionary. It also somewhat resembles Viterbi parsing to some fixed depth. Viz. do a full backward-forward parse for a phrase, and then, once this is done, take a Viterbi-step. That is, once the phrase is done, keep only the dangling connectors to the phrase, place a wall, and then step to the next part of the sentence.

    Caution: watch out for garden-path sentences:

    The current parser parses these perfectly a viterbi parser could trip on these.

    Other benefits of a Viterbi decoder:

    • Less sensitive to sentence boundaries: this would allow longer, run-on sentences to be parsed far more quickly.
    • Could do better with slang, hip-speak.
    • Support for real-time dialog (parsing of half-uttered sentences).
    • Parsing of multiple streams, e.g. from play/movie scripts.
    • Would enable (or simplify) co-reference resolution across sentences (resolve referents of pronouns, etc.)
    • Would allow richer state to be passed up to higher layers: specifically, alternate parses for fractions of a sentence, alternate reference resolutions.
    • Would allow plug-in architecture, so that plugins, employing some alternate, higher-level logic, could disambiguate (e.g. by making use of semantic content).
    • Eliminate many of the hard-coded array sizes in the code.

    One may argue that Viterbi is a more natural, biological way of working with sequences. Some experimental, psychological support for this can be found at http://www.sciencedaily.com/releases/2012/09/120925143555.htm per Morten Christiansen, Cornell professor of psychology.

    Registers, sociolects, dialects (cost vectors):

    Consider the sentence "Thieves rob bank" -- a typical newspaper headline. LG currently fails to parse this, because the determiner is missing ("bank" is a count noun, not a mass noun, and thus requires a determiner. By contrast, "thieves rob water" parses just fine.) A fix for this would be to replace mandatory determiner links by (D- or <[[()]] & headline-flag>) which allows the D link to be omitted if the headline-flag bit is set. Here, "headline-flag" could be a new link-type, but one that is not subject to planarity constraints.

    Note that this is easier said than done: if one simply adds a high-cost null link, and no headline-flag, then all sorts of ungrammatical sentences parse, with strange parses while some grammatical sentences, which should parse, but currently don't, become parsable, but with crazy results.

    More examples, from And Rosta:

    A natural approach would be to replace fixed costs by formulas. This would allow the dialect/sociolect to be dynamically changeable. That is, rather than having a binary headline-flag, there would be a formula for the cost, which could be changed outside of the parsing loop. Such formulas could be used to enable/disable parsing specific to different dialects/sociolects, simply by altering the network of link costs.

    A simpler alternative would be to have labeled costs (a cost vector), so that different dialects assign different costs to various links. A dialect would be specified during the parse, thus causing the costs for that dialect to be employed during parse ranking.

    This has been implemented what's missing is a practical tutorial on how this might be used.

    Hand-refining verb patterns:

    A good reference for refining verb usage patterns is: "COBUILD GRAMMAR PATTERNS 1: VERBS from THE COBUILD SERIES", from THE BANK OF ENGLISH, HARPER COLLINS. Online at https://arts-ccr-002.bham.ac.uk/ccr/patgram/ and http://www.corpus.bham.ac.uk/publications/index.shtml

    Currently tokenize.c tokenizes double-quotes and some UTF8 quotes (see the RPUNC/LPUNC class in en/4.0.affix - the QUOTES class is not used for that, but for capitalization support), with some very basic support in the English dictionary (see "% Quotation marks." there). However, it does not do this for the various "curly" UTF8 quotes, such as ‘these’ and “these”. This results is some ugly parsing for sentences containing such quotes. (Note that these are in 4.0.affix).

    A mechanism is needed to disentangle the quoting from the quoted text, so that each can be parsed appropriately. It's somewhat unclear how to handle this within link-grammar. This is somewhat related to the problem of morphology (parsing words as if they were "mini-sentences",) idioms (phrases that are treated as if they were single words), set-phrase structures (if . then . not only. but also . ) which have a long-range structure similar to quoted text (he said . ).

    Semantification of the dictionary:

    "to be fishing": Link grammar offers four parses of "I was fishing for evidence", two of which are given low scores, and two are given high scores. Of the two with high scores, one parse is clearly bad. Its links "to be fishing.noun" as opposed to the correct "to be fishing.gerund". That is, I can be happy, healthy and wise, but I certainly cannot be fishing.noun. This is perhaps not just a bug in the structure of the dictionary, but is perhaps deeper: link-grammar has little or no concept of lexical units (i.e. collocations, idioms, institutional phrases), which thus allows parses with bad word-senses to sneak in.

    The goal is to introduce more knowledge of lexical units into LG.

    Different word senses can have different grammar rules (and thus, the links employed reveal the sense of the word): for example: "I tend to agree" vs. "I tend to the sheep" -- these employ two different meanings for the verb "tend", and the grammatical constructions allowed for one meaning are not the same as those allowed for the other. Yet, the link rules for "tend.v" have to accommodate both senses, thus making the rules rather complex. Worse, it potentially allows for non-sense constructions. If, instead, we allowed the dictionary to contain different rules for "tend.meaning1" and "tend.meaning2", the rules would simplify (at the cost of inflating the size of the dictionary).

    Another example: "I fear so" -- the word "so" is only allowed with some, but not all, lexical senses of "fear". So e.g. "I fear so" is in the same semantic class as "I think so" or "I hope so", although other meanings of these verbs are otherwise quite different.

    [Sin2004] "New evidence, new priorities, new attitudes" in J. Sinclair, (ed) (2004) How to use corpora in language teaching, Amsterdam: John Benjamins

    See also: Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English
    Susan Hunston and Gill Francis (University of Birmingham)
    Amsterdam: John Benjamins (Studies in corpus linguistics, edited by Elena Tognini-Bonelli, volume 4), 2000
    Book review.

    “The Molecular Level of Lexical Semantics”, EA Nida, (1997) International Journal of Lexicography, 10(4): 265–274. Online

    "holes" in collocations (aka "set phrases" of "phrasemes"):

    The link-grammar provides several mechanisms to support circumpositions or even more complicated multi-word structures. One mechanism is by ordinary links see the V, XJ and RJ links. The other mechanism is by means of post-processing rules. (For example, the "filler-it" SF rules use post-processing.) However, rules for many common forms have not yet been written. The general problem is of supporting structures that have "holes" in the middle, that require "lacing" to tie them together.

    For example, the adposition:

    Note that multiple words can fit in the slot [xxx]. Note the tangling of another prepositional phrase: ". from [xxx] on to [yyy]"

    More complicated collocations with holes include

    'Then' is optional ('then' is a 'null word'), for example:

    The above are not currently supported. An example that is supported is the "non-referential it", e.g.

    The above is supported by means of special disjuncts for 'it' and 'that', which must occur in the same post-processing domain.

    ". from X and from Y" "By X, and by Y, . " Here, X and Y might be rather long phrases, containing other prepositions. In this case, the usual link-grammar linkage rules will typically conjoin "and from Y" to some preposition in X, instead of the correct link to "from X". Although adding a cost to keep the lengths of X and Y approximately equal can help, it would be even better to recognize the ". from . and from. " pattern.

    The correct solution for the "Either . or . " appears to be this:

    The problem with this is that "neither" must coordinate with "nor". That is, one cannot say "either.. nor. " "neither . or . " "neither . and. " "but . nor . " The way I originally solved the coordination problem was to invent a new link called Dn, and a link SJn and to make sure that Dn could only connect to SJn, and nothing else. Thus, the lower-case "n" was used to propagate the coordination across two links. This demonstrates how powerful the link-grammar theory is: with proper subscripts, constraints can be propagated along links over large distances. However, this also makes the dictionary more complex, and the rules harder to write: coordination requires a lot of different links to be hooked together. And so I think that creating a single, new link, called . will make the coordination easy and direct. That is why I like that idea.

    The . link should be the XJ link, which-see.

    More idiomatic than the above examples: ". the chip on X's shoulder" "to do X a favour" "to give X a look"

    The above are all examples of "set phrases" or "phrasemes", and are most commonly discussed in the context of MTT or Meaning-Text Theory of Igor Mel'cuk et al (search for "MTT Lexical Function" for more info). Mel'cuk treats set phrases as lexemes, and, for parsing, this is not directly relevant. However, insofar as phrasemes have a high mutual information content, they can dominate the syntactic structure of a sentence.

    The current parse of "he wanted to look at and listen to everything." is inadequate: the link to "everything" needs to connect to "and", so that "listen to" and "look at" are treated as atomic verb phrases.

    MTT suggests that perhaps the correct way to understand the contents of the post-processing rules is as an implementation of 'lexical functions' projected onto syntax. That is, the post-processing rules allow only certain syntactical constructions, and these are the kinds of constructions one typically sees in certain kinds of lexical functions.

    Alternately, link-grammar suffers from a combinatoric explosion of possible parses of a given sentence. It would seem that lexical functions could be used to rule out many of these parses. On the other hand, the results are likely to be similar to that of statistical parse ranking (which presumably captures such quasi-idiomatic collocations at least weakly).

    Ref. I. Mel'cuk: "Collocations and Lexical Functions", in ''Phraseology: theory, analysis, and applications'' Ed. Anthony Paul Cowie (1998) Oxford University Press pp. 23-54.

    More generally, all of link-grammar could benefit from a MTT-izing of infrastructure.

    Compare the above commentary on lexical functions to Hebrew morphological analysis. To quote Wikipedia:

    This distinction between the word as a unit of speech and the root as a unit of meaning is even more important in the case of languages where roots have many different forms when used in actual words, as is the case in Semitic languages. In these, roots are formed by consonants alone, and different words (belonging to different parts of speech) are derived from the same root by inserting vowels. For example, in Hebrew, the root gdl represents the idea of largeness, and from it we have gadol and gdola (masculine and feminine forms of the adjective "big"), gadal "he grew", higdil "he magnified" and magdelet "magnifier", along with many other words such as godel "size" and migdal "tower".

    Instead of hard-coding LL, declare which links are morpho links in the dict.

    • Should provide a query that returns compile-time consts, e.g. the max number of characters in a word, or max words in a sentence.
    • Should remove compile-time constants, e.g. max words, max length etc.

    Version 6.0 will change Sentence to Sentence*, Linkage to Linkage* in the API. But perhaps this is a bad idea.


    Gra phical User Interfaces

    1. T he PyRosetta Too lkit

    The PyRosetta Toolkit is a GUI-addon to PyRosetta for setting up Rosetta filetypes, analyzing results, running protocols, and doing many other molecular modeling and design tasks. It is distributed with PyRosetta in the /GUIs/pyrosetta_toolkit directory. Please see the docu mentation for setup, use, and tips.


    Answers

    The three properties of a vector are type, length, and attributes.

    The four common types of atomic vector are logical, integer, double (sometimes called numeric), and character. The two rarer types are complex and raw.

    Attributes allow you to associate arbitrary additional metadata to any object. You can get and set individual attributes with attr(x, "y") and attr(x, "y") <- value or get and set all attributes at once with attributes() .

    The elements of a list can be any type (even a list) the elements of an atomic vector are all of the same type. Similarly, every element of a matrix must be the same type in a data frame, the different columns can have different types.

    You can make “list-array” by assigning dimensions to a list. You can make a matrix a column of a data frame with df$x <- matrix() , or using I() when creating a new data frame data.frame(x = I(matrix())) .

    © Hadley Wickham. Powered by jekyll, knitr, and pandoc. Source available on github.