New Hot Paper Addresses Machine Learning for Biological Sequences

“Pse-in-One: a Web Server for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences” (Nucl. Acid Res. 43 [W1]: W65-71, 1 July 2015), was recently named a New Hot Paper for Biology & Biochemistry in Essential Science Indicators from Clarivate Analytics. This paper currently has 195 citations in the Web of Science.

Below, the paper’s lead author, Dr. Bin Liu of the Harbin Institute of Technology in China, talks about the paper’s history and influence.

Why do you think your paper is highly cited?

This is what we expected since, with the avalanche of biological sequences in the post-genomic era, one of the most challenging problems in computational biology is how to express a biological sequence with a discrete model or vector, yet still keep its sequence-order information or key pattern feature. Why? This is because all the existing machine-learning algorithms can only handle vectors but not sequence samples. Unfortunately, a vector defined in a discrete model may completely lose all the sequence-pattern information.

To avoid completely losing the sequence-pattern information for proteins, the pseudo amino acid composition (PseAAC) [1] was proposed. Ever since the concept of Chou’s PseAAC was proposed, it has been widely used in nearly all the areas of computational proteomics (see a long list of references cited in a recent review paper [2]). Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, the concept of PseKNC (Pseudo K-tuple Nucleotide Composition) [3] was developed for generating various feature vectors for DNA/RNA sequences that has proved very useful as well [4]. Stimulated by PseAAC and PseKNC, we developed Pse-in-One [5] and Pse-in-One2.0 [6] in order to use one web server to generate any desired feature vectors for both protein/peptide and DNA/RNA sequences according to the needs of users’ studies. As Pse-in-One has provided users a powerful tool for proteome and genome analyses, it is quite natural that the Pse-in-One paper has been highly cited.

Does it describe a new discovery, methodology, or synthesis of knowledge?

Yes, it surely does. Based on the similarity between natural languages and biological sequences, in this paper we developed a powerful and flexible web server called Pse-in-One, which is able to generate nearly all the possible pseudo component feature vectors for DNA, RNA, and protein/peptide sequences. In this regard, the pseudo components can be viewed as the “words” of biological sequences, and they have the following advantages:

(i) It is so far the first web server ever established that can generate all the existing pseudo components for DNA, RNA, and protein/peptide sequences.

(ii) It contains 148, 22, and 547 built-in physicochemical properties for users to select in generating feature vectors for DNA, RNA, and protein sequences, respectively. Accordingly, the total possible different feature vectors generated by Pse-in-One [5] for a DNA sequence would be 3.57 ×10⁴⁴; for an RNA sequence would be 4.19 × 10⁶; and for a protein sequence would be 4.61 × 10¹⁶⁴—meaning, large enough to cover nearly all the possible cases.

(iii) Furthermore, it also allows users to generate those pseudo components according to the properties defined by users themselves, which is beyond the reach of any existing web server in this area.

Would you summarize the significance of your paper in layman’s terms?

Pse-in-One is a very useful tool in computational proteomics and genomics as well as proteome and genome analyses. It can significantly speed up the development of these important fields. It is a first attempt to automatically detect and generate the “words” of all biological languages: DNA, RNA, and protein sequences.

How did you become involved in this research, and how would you describe the particular challenges, setbacks, and successes that you’ve encountered along the way?

I was involved in this research because I was exploring the language models of biological sequences. The key is to find the “words” of biological sequences. However, it is by no means an easy task because almost everyone knows of the words in natural languages but no one really knows of the exact “words” in biological sequences.

Fortunately, via the Internet I was able to collaborate with Dr. Kuo-Chen Chou in the USA, who is the founder of Gordon Life Science Institute, the first Internet Research Institute ever established in the world. Professor Chou is one of the world’s best computational biologists, who has been named by Clarivate Analytics as a Highly Cited Researcher in 2014, 2015, 2016, and 2017, also ranking several times among the top producers of Hot Papers, including in the most recent tally, listed in the “World’s Most Influential Scientific Minds.” Many difficult problems were solved after illuminating discussions with him, the father of PseAAC. Also, via the Internet and conferences, I was able to interact with Prof. Wei Chen (North China University of Science and Technology) and Prof. Hao Lin (University of Electronic Science and Technology of China), who, along with Prof. Kuo-Chen Chou, were the authors of PseKNC [4] and shared with me many valuable experiences that are very useful on this project.

Where do you see your research leading in the future?

Nowadays we are living in an era with the goal to minimize various tedious things and leave them to be done by robots or computers, such as in developing autonomous cars or self-driving cars. Pse-in-One represents one step forward to such a goal in genome and proteome analyses. Furthermore, the pseudo components generated by Pse-in-One can be viewed as the “words” of biological languages, which will facilitate the development of language models of biological sequences to uncover the secret of the “life mumbo-jumbo.”

Do you foresee any social or political implications for your research?

The social or political implications I have foreseen are:

(i) The user-friendly powerful web servers such as Pse-in-One will be increasingly important not only for basic research but also for drug development and hospital treatments, driving medical science to an unprecedented revolution [2].

(ii) Nowadays we are living in one “community of human destiny” (人类命运共同体). To really benefit all of mankind and obtain a win-win situation for everyone, friendly and sincere collaboration via the Internet between different countries around the world, particularly between the USA (the world’s largest economy) and China (the world’s 2^nd largest economy), is important, wonderful, and very bright.

Bin Liu, Ph. D.
Professor
School of Computer Science and Technology
Harbin Institute of Technology Shenzhen Graduate School
Shenzhen, Guangdong, China

REFERENCES

[1] K.C. Chou, “Prediction of protein cellular attributes using pseudo amino acid composition.” PROTEINS: Structure, Function, and Genetics. 43 (2001) 246-255.

[2] K.C. Chou, “An unprecedented revolution in medicinal chemistry driven by the progress of biological science.” Current Topics in Medicinal Chemistry 17 (2017) 2337-2358.

[3] W. Chen, T.Y. Lei, D.C. Jin, H. Lin, K.C. Chou, “PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition.” Analytical Biochemistry. 456 (2014) 53-60.

[4] W. Chen, H. Lin, K.C. Chou, “Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences.” Mol BioSyst. 11 (2015) 2620-2634.

[5] B. Liu, F. Liu, X. Wang, J. Chen, L. Fang, K.C. Chou, “Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences.” Nucleic Acids Research. 43 (2015) W65-W71.

[6] B. Liu, H. Wu, K.C. Chou, “Pse-in-One 2.0: An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein Sequences.” Natural Science. 9 (2017) 67-91.