Pseudo K-tuple nucleotide composition
Template:Short description Template:Multiple issues
The Pseudo K-tuple nucleotide composition or PseKNC, is a method for converting a nucleotide sequence (DNA or RNA) into a numerical vector so as to be used in pattern recognition techniques. Generally, the K-tuple can refer to a dinucleotide (when K=2) or a trinucleotide (when K=3). Depending on the instance, the technique can also be called PseDNC or PseTNC.[1]
The method was derived from an analogous method in proteomics known as PseAAC (Pseudo Amino Acid Composition) that is applied to protein sequences.[2]
Background
PseAAC
PseKNC was derived from an analogous method in proteomics known as PseAAC (Pseudo Amino Acid Composition).[2] Previously, investigations either relied on sequential models for making predictions of certain protein properties (which, in its simplest case, just refers to the amino acid composition of the protein), or a discrete model which represents a vector of twenty elements, each of which represent the frequency of each amino acid in the protein sample. The discrete model, however, fails to account for sequence-order information. The PseACC model extends the 20-length vector in the discrete model with λ components, each of which in some way captures sequence-order information, and this vector becomes the basis for making predictions.[3]
Analogous problem in genomics
Analogously, a discrete model of a nucleotide sequence based on its dinucleotide composition would lay involve a vector of 16 elements, the value of which one representing the frequency of each dinucleotide in the sequence:[1]
Where D is the DNA sequence, T is the transpose operator, and f(AA) is the normalized occurrence frequency of AA in the DNA sequence. A trinucleotide representation can be denoted as:[1]
As can be seen, these discrete models fail to consider any global or long-range sequence-order information. To address this for both DNA and RNA sequences, the pseudo K-tuple nucleotide composition or PseKNC was proposed.[4][5][6]
PseKNC
PseKNC extends the discrete model by adding λ components to represent sequence-order and physico-chemical properties of the nucleotide sequence. The original KNC model will involve 4K components. In a dinucleotide situation where K = 2, 42 = 16 components will be included. The extension by PseKNC results in (4K + λ) components.[1]
Applications
A wide diversity of applications have been developed with respect to the PseKNC method.[7] For example, it has become an integral component of many algorithms designed to predict the locations of recombination hotspots and coldspots from sequence information.[8][9]
Web servers
For the convenience scientific community, a freely available web server called PseKNC[4] and an open source package called PseKNC-General[5] were developed in 2013 and 2014, respectively, that could convert large-scale sequence datasets to pseudo nucleotide compositions with numerous choices of physicochemical property combinations. PseKNC-General can generate several modes of pseudo nucleotide compositions, including conventional k-tuple nucleotide compositions, Moreau–Broto autocorrelation coefficient, Moran autocorrelation coefficient, Geary autocorrelation coefficient, Type I PseKNC and Type II PseKNC.
Another web server, Pse-in-One, allows users to hand-select all pre-existing PseAAC and PseKNC methods for protein, RNA, and DNA sequences, along with any selection of the existing availability of physicochemical property combinations for these options.[10]
References
- ↑ 1.0 1.1 1.2 1.3 Template:Cite journal
- ↑ 2.0 2.1 Cite error: Invalid
<ref>tag; no text was provided for refs namedChou01 - ↑ Template:Cite journal
- ↑ 4.0 4.1 Cite error: Invalid
<ref>tag; no text was provided for refs namedChen01 - ↑ 5.0 5.1 Cite error: Invalid
<ref>tag; no text was provided for refs namedChen02 - ↑ Cite error: Invalid
<ref>tag; no text was provided for refs namedChen03 - ↑ Template:Cite journal
- ↑ Template:Cite journal
- ↑ Template:Cite journal
- ↑ Template:Cite journal