KnE Life Sciences | The 4th International Conference on Biological Science (2015) | pages: 186-193

, , , and

1. Introduction

The Next-generation sequencing (NGS) has been rapidly developed in recent years, providing much cheaper and higher throughput than the earlier generation Sanger sequencing [1]. This technology allows rapid advance in many fields related to biological sciences, one of them is the RNA sequencing (RNA-seq) experiments [2]. Considered as an alternative to microarrays, RNA-seq is now used for quantitative transcriptomics and identification of novel transcripts. It is a powerful tool with a remarkably diverse range of applications, from detailed studies of biological processes at the cell type-specific level to studies of fundamental questions in many biological systems on an evolutionary timescale [3].

The computational challenges of RNA-seq data analysis are divided into three main categories: (i) read mapping, (ii) transcriptome reconstruction, and (iii) expression quantifications [4]. Read mapping process is divided into two types, unspliced and spliced alignment. Unspliced alignment uses reads and reference transcriptomes as input. It aligns reads to a known reference transcriptome. Some examples of unspliced aligner programs are Short-read mapping package (SHRiMP) [5], Bowtie [6] and BWA [7]. On the other hand, spliced aligner uses reads and reference genome as input. TopHat is an example of widely used spliced aligners program to process RNA-seq reads [8].

The aim of transcriptome reconstruction is to define a precise map of all transcripts and isoforms that are expressed in particular samples. Cufflinks is one of the most used genome-guided assembly programs for transcriptome reconstruction that identifies novel transcripts using a known genome [9]. Unlike Cufflinks, Velvet [10] and TransABySS [11] identify novel genes and transcript isoforms without a known reference genome. The final step of RNA-seq data analysis is expression quantification. Alexa-seq uses reads and transcript models to quantify gene expressions [12] while cufflinks use aligned reads to quantify transcript isoform levels. There are several differential expression programs to compare expression levels between two or more sets of transcriptomes, including Cuffdiff [13]. The programs use read alignments and transcript models to identify differentially expressed genes or transcript isoforms.

Most of the programs described above have been used to analyze RNA-seq raw data because they are powerful and open for public. However, at least three steps of command writings is needed to obtain transcript level information. This can be problematic for biologists who are not familiar with command writings and it also has a huge amount of parameters and raw data that need to be analyzed. Galaxy platform is available to simplify this kind of work [14]. It uses a web interface to cloud computing resources, providing command-line-driven tools, such as TopHat and Cufflinks, for users without UNIX skills through the web and the computing clouds.

This study aims to construct SMART RNA-seq Data Analyzer (SMART-RDA), a Galaxy workflow that can be easily applied to analyze RNA-seq raw data into differential expression information of genes. A well-known Tuxedo protocol was used as the backbone of this workflow. Few additional tools and modifications were added to turn this workflow into a more comprehensive and user-friendly tool.

2. Material and Method

2.1. Material

This study utilized two files from Sequence Read Archive (SRA) [15] derived from NCBI under ID SRX278051 and SRX278050 to test the workflow performance. These files are the RNA sequence of sterile and fertile oil palm Pisifera pollens. Each sequence contains 228.3 and 295.4 × 106 million bases, respectively. Programs used in this research are Galaxy, TopHat, Cufflinks, Cuffdiff, and CummeRbund, which were downloaded from their official websites. Galaxy tools were downloaded from Galaxy ToolShed (

2.2. Methods

This workflow was constructed based on a well-known Tuxedo protocol. The Quality Control stage was added before performing tuxedo protocol to improve the quality of reads. Detailed steps of the workflow are shown in Figure 1. Blast annotation step was added after Cuffdiff to annotate the assembled transcripts. Common tools in this workflow, like TopHat, Cufflinks, and Cuffmerge were available on the galaxy toolshed. Several tools that were not available in the galaxy toolshed were constructed using XML programming.

Figure 1

Workflow Steps.


3. Results and Discussions

3.1. SMART-RDA Workflow

A Galaxy workflow for RNA-seq data analysis was constructed (see Figure 2) from a modified Tuxedo protocol. The first tool of this workflow is FastQ Groomer. It converted several input Fastq quality score types, like Illumina 1.8 and 454, into standard Sanger format [16]. Groomed fastq file was trimmed using FastQ Quality Trimmer. This tool trimmed the end of reads based on the aggregate value of quality scores found within a sliding window. FastQ Summary Statistics were used to create summary statistics on a fastq file before and after trimming by FastQ Quality Trimmer.

Figure 2

SMART-RDA Workflow.


After groomed and trimmed, the reads were assembled using TopHat. Fasta file of reference genome was needed for this stage. TopHat aligned RNA-seq reads into genome-sized sequence using the ultra-high-throughput short read aligner Bowtie, and then analyzed the mapping results to identify splicing junctions between exons. The output of TopHat was BAM file called accepted hits. BAM to SAM converter was needed to convert TopHat output into SAM formatted file [17]. The conversion process of BAM to SAM allowed users to review TopHat output directly.

The aligned reads from TopHat output were assembled by Cufflinks. This program assembled reads into transcripts, estimated their abundances, and tested them for differential expressions and regulations. This process produced GTF files that contain Cufflinks assembled isoforms. These files contained information of assembled reads, transcripts, and their abundances. The workflow produced two Cufflinks output files at one time because it was designed to process two different samples simultaneously. However, Cuffmerge was still needed to merge two GTF files produced by Cufflinks.

The process of finding significant changes in transcript expression was performed by Cuffdiff. In this stage, aligned reads produced by TopHat were quantified using assembled transcript produced by Cufflinks as a reference. The outputs of this step were tabular files that contained information about transcripts location, expressions, and differential analysis. The expression level of each transcript was stated in term of Fragments Per Kilobase of exon per Million fragments (FPKM). In these units, the relative abundances of transcripts were described in terms of expected biological object (fragments) observed from RNA-seq experiment.

The next stage in this workflow was to annotate each transcript previously analyzed by Cuffdiff. The annotation process used NCBI BLAST+blastn [18,19] to search annotation from known databases. Information about locus position of transcripts from Cuffdiff outputs was used to extract transcript sequences from a reference genome. Blast results produced in this process were then combined as an annotation into Cuffdiff output. The final stage of the workflow was visualization of transcript quantification and differential analysis using cummeRbund. This tool produced several visualizations, including boxplot, scatters, distributions, and volcano plots.

3.2. Workflow performance test

SMART-RDA workflow was tested using two files from NCBI SRA of oil palm Pisifera pollen RNA-seq under ID SRX278051 (fertile) and SRX278050 (sterile). The fertile reads size was 472.6 Mb while the sterile reads was 613.2 Mb. After trimming, the size of fertile reads decreased to 295.2 Mb and the sterile reads to 486.6 Mb. This stage removed 176 low-quality sequences from fertile and 112 from sterile reads.

Figure 3

Boxplot of differential gene expressions of fertile and sterile oil palm Pisifera pollens.


Tuxedo protocol has been successfully implemented by this workflow. The CummeRbund stage produced four type visualizations that represented the differential expression between two conditions. The first visualization was a boxplot of differential gene expressions (see Figure 3). Based on this plot, the average value of FPKM between fertile and sterile pollen was almost the same. In terms of range, fertile pollen had wider ranges of gene expression values.

Figure 4

Density plot of differential gene expressions of fertile and sterile oil palm Pisifera pollens.


The other type of visualization was density plot, as shown in Figure 4. This plot showed the differential gene expression patterns between fertile and sterile Pisifera pollens. The gene expression of sterile pollens was distributed in lower fpkm values, but with higher density than fertile pollens.

While boxplot and density plot described expression distributions, volcano plots (see Figure 5) described significant level of each transcript. Volcano plot showed several transcripts with higher expression in fertile than in sterile condition.

Figure 5

Volcano plot describes differential gene expression of fertile and sterile oil palm Pisifera pollens.


Finally, the most important result of this workflow was the list of genes with large difference in gene expressions (see Table 1.). There were ten genes with high significant expression levels. The level significance of differential expression was determined by P value. Furthermore, this data would be useful in determining the genes that played a role in palm pollens fertility.

Table 1

List of genes with high significance level of expressions.

Locus Genes FPKM p value
Fertile Sterile
EG5_Chr8:33574178-33580606 ubiquitin-conjugating enzyme [Elaeis guineensis] 52 905 0 0.02210
EG5_Chr5:26736162-26736842 beta-1,3-glucanase [Elaeis guineensis] 38 158 0 0.02395
EG5_Chr14:2597123-2598066 elongation factor 1-alpha, putative [Ricinus communis] 22 501 0 0.02570
EG5_Chr3:1054681-1056044 uncharacterized protein LOC100812783 [Glycine max] 0 13 788.6 0.02780
p5_sc00292:963590-964056 putative elicitor inducible beta-1,3-glucanase NtEIG-E76 [Oryza sativa Japonica Group] 0 13 672.6 0.02780
EG5_Chr3:28391917-28399716 ribosomal protein [Elaeis guineensis] 0 12 611.8 0.02780
EG5_Chr13:23110722-23111204 DUF246 domain-containing protein [Medicago truncatula] 0 22 311.8 0.02820
EG5_Chr13:27541179- 27541819 PREDICTED: elongation factor 1-alpha [Vitis vinifera] 0 12 161.8 0.02820
EG5_Chr1:3675588-3675755 uncharacterized protein LOC100276758 [Zea mays] 1844460 0 0.03080
EG5_Chr16:1554435-1555188 hypothetical protein OsI_32485 [Oryza sativa Indica Group] 27908 0 0.03080

4. Conclusions

SMART-RDA, a galaxy workflow that can be used to analyze RNA-seq data into differential expression information, was constructed based on modified Tuxedo protocol. Performance test using SRA data of fertile and sterile oil palm Pisifera pollens detected ten genes with high significant expression levels. Three visualizations and one table describing expression condition of samples were produced by this workflow, which was capable of performing differential analysis of large RNA-seq data.


The authors would like to thank the management of PT. SMART Tbk. for supporting this research and publication.



A. Grada K. Weinbrecht Next-generation sequencing: Methodology and application Journal of Investigative Dermatology 2013 133 8 e11 2-s2.0-84880393400 10.1038/jid.2013.248


V. Thakur R. Varshney Challenges and strategies for next generation sequencing (NGS) data analysis J Comput Sci Syst Biol 2010 03 02 040 042


L. B. B. Martin Z. Fei J. J. Giovannoni J. K. C. Rose Catalyzing plant science research with RNA-seq Frontiers in Plant Science 2013 4, article no. 66 2-s2.0-84878438205 10.3389/fpls.2013.00066


M. Garber M. G. Grabherr M. Guttman C. Trapnell Computational methods for transcriptome annotation and quantification using RNA-seq Nature Methods 2011 8 6 469 477 2-s2.0-79957842166 10.1038/nmeth.1613


S. M. Rumble P. Lacroute A. V. Dalca M. Fiume A. Sidow M. Brudno SHRiMP: Accurate mapping of short color-space reads PLoS Computational Biology 2009 5 5 2-s2.0-67049159825 10.1371/journal.pcbi.1000386 e1000386


B. Langmead C. Trapnell M. Pop S. L. Salzberg Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Genome Biology 2009 10 3, article no. R25 2-s2.0-62349130698 10.1186/gb-2009-10-3-r25


H. Li R. Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 2009 25 14 1754 1760 2-s2.0-67649884743 10.1093/bioinformatics/btp324


C. Trapnell L. Pachter S. L. Salzberg TopHat: Discovering splice junctions with RNA-Seq Bioinformatics 2009 25 9 1105 1111 2-s2.0-65449136284 10.1093/bioinformatics/btp120


C. Trapnell D. G. Hendrickson M. Sauvageau L. Goff J. L. Rinn L. Pachter Differential analysis of gene regulation at transcript resolution with RNA-seq Nature Biotechnology 2013 31 1 46 53 2-s2.0-84872198346 10.1038/nbt.2450


M. H. Schulz D. R. Zerbino M. Vingron E. Birney Oases: Robust de novo RNA-seq assembly across the dynamic range of expression levels Bioinformatics 2012 28 8 1086 1092 2-s2.0-84859768479 10.1093/bioinformatics/bts094 bts094


G. Robertson J. Schein R. Chiu R. Corbett M. Field S. D. Jackman K. Mungall S. Lee H. M. Okada J. Q. Qian M. Griffith A. Raymond N. Thiessen T. Cezard Y. S. Butterfield R. Newsome S. K. Chan R. She R. Varhol B. Kamoh A.-L. Prabhu A. Tam Y. Zhao R. A. Moore M. Hirst M. A. Marra S. J. M. Jones P. A. Hoodless I. Birol De novo assembly and analysis of RNA-seq data Nature Methods 2010 7 11 909 912 2-s2.0-78049346632 10.1038/nmeth.1517


M. Griffith O. L. Griffith J. Mwenifumbo R. Goya A. S. Morrissy R. D. Morin R. Corbett M. J. Tang Y.-C. Hou T. J. Pugh G. Robertson S. Chittaranjan A. Ally J. K. Asano S. Y. Chan H. I. Li H. McDonald K. Teague Y. Zhao T. Zeng A. Delaney M. Hirst G. B. Morin S. J. M. Jones I. T. Tai M. A. Marra Alternative expression analysis by RNA sequencing Nature Methods 2010 7 10 843 847 2-s2.0-77957663917 10.1038/nmeth.1503


C. Trapnell B. A. Williams G. Pertea A. Mortazavi G. Kwan M. J. Van Baren S. L. Salzberg B. J. Wold L. Pachter Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology 2010 28 5 511 515 2-s2.0-77952123055 10.1038/nbt.1621


B. Giardine C. Riemer R. C. Hardison R. Burhans L. Elnitski P. Shah Y. Zhang D. Blankenberg I. Albert J. Taylor W. Miller W. J. Kent A. Nekrutenko Galaxy: A platform for interactive large-scale genome analysis Genome Research 2005 15 10 1451 1455 2-s2.0-25844449770 10.1101/gr.4086505


Y. Kodama M. Shumway R. Leinonen The sequence read archive: Explosive growth of sequencing data Nucleic Acids Research 2012 40 1 D54 D56 2-s2.0-84862198590 10.1093/nar/gkr854


D. Blankenberg A. Gordon G. Von Kuster N. Coraor J. Taylor A. Nekrutenko G. Team Manipulation of FASTQ data with galaxy Bioinformatics 2010 26 14 1783 1785 2-s2.0-77954489376 10.1093/bioinformatics/btq281 btq281


H. Li B. Handsaker A. Wysoker T. Fennell J. Ruan N. Homer G. Marth G. Abecasis R. Durbin The Sequence Alignment/Map format and SAMtools Bioinformatics 2009 25 16 2078 2079 2-s2.0-68549104404 10.1093/bioinformatics/btp352


P. J. A. Cock B. A. Grüning K. Paszkiewicz L. Pritchard Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology PeerJ 2013 2013 1, article no. e167 2-s2.0-84885126545 10.7717/peerj.167


C. Camacho G. Coulouris V. Avagyan N. Ma J. Papadopoulos K. Bealer T. L. Madden BLAST+: Architecture and applications BMC Bioinformatics 2009 10, article no. 421 2-s2.0-74049108922 10.1186/1471-2105-10-421



  • Downloads 2
  • Views 12


ISSN: 2413-0877