Public DNA databases are composed of data from many different taxa. However, the taxonomic annotation on sequences is not always complete. This impedes the utilization of mined data for species-level applications. There is much ongoing work on species identification and delineation based on the molecular data itself. Applying species clustering to whole databases requires consolidation of results from numerous undefined gene regions, and introduces significant obstacles in data organization and computational load.
When Dr. Douglas was a postdoc in Prof. Chao-Dong Zhu’s lab, and later employed as an assistant professor in the Institute of Zoology, Chinese Academy of Sciences, he developed an approach for species delineation of a sequence database. All DNA sequences for the insects were obtained and processed. After filtration of duplicated data, delineation of the database into species units followed a three-step process. i) the genetic loci L are partitioned, ii) the species S are delineated within each locus, then iii) species units are matched across loci to form the matrix LxS, a set of multi-locus species units. Partitioning the database into a set of homologous gene fragments was achieved by Markov clustering. Then delineation of species units and assignment of species names was performed for the set of genes necessary to capture most of the species diversity. The complexity of computing pairwise similarities for species clustering was substantial at the COI locus. But made feasible through the development of software that performs pairwise alignments within the taxonomic framework, and accounting for the different ranks at which sequences are labeled with taxonomic information. Over 24 different homologs, the unidentified sequences numbered ~194,000, containing 41,525 species labels (98.7 percent of all found in the insect database). These were grouped into 59,173 single-locus MOTUs by hierarchical clustering using parameters optimized independently for each locus. Species units from different loci were matched using a multi-partite matching algorithm. This formed multi-locus species units with minimal incongruence between loci. After matching, the insect database as represented by these 24 loci, was found to be composed of 78,091 species units in total. 38,574 of these units contained only species labeled data, 34,891 contained only unlabeled data, leaving 4,626 units composed both of labeled and unlabeled sequences.
In addition to giving estimates of species diversity of databases, the protocol developed here will facilitate species level applications of modern day sequence datasets. In particular, the LxS matrix represents a post-taxonomic framework that can be used for species level organization of meta-genomic data. And incorporation of these methods into phylogenetic pipelines will yield matrices more representative of species diversity.
This project was supported mainly by the Knowledge Innovation Program of the Chinese Academy of Sciences; the National Science Foundation, China; and partially supported by the Public Welfare Project from the Ministry of Agriculture, China and the Program of Ministry of Science and Technology of the People’s Republic of China.
The full citation for the article is:
A Protocol for Species Delineation of Public DNA Databases, Applied to the Insecta. Douglas Chesters; Chao-Dong Zhu. Systematic Biology 2014;doi: 10.1093/sysbio/syu038
Here are the free-access links to the online article: Abstract, PDF.