Advertisement

Using R and Bioconductor in Clinical Genomics and Transcriptomics

  • Jorge L. Sepulveda
    Correspondence
    Address correspondence to Jorge L. Sepulveda, M.D., Ph.D., Department of Pathology and Cell Biology, Columbia University Irving Medical Center, 630 W 168 St, PH-1564-C, New York, NY 10030.
    Affiliations
    Department of Pathology and Cell Biology, Columbia University Irving Medical Center, New York, New York
    Informatics Subdivision Leadership, Association for Molecular Pathology, Bethesda, Maryland
    Search for articles by this author
Open ArchivePublished:October 09, 2019DOI:https://doi.org/10.1016/j.jmoldx.2019.08.006
      Bioinformatics pipelines are essential in the analysis of genomic and transcriptomic data generated by next-generation sequencing (NGS). Recent guidelines emphasize the need for rigorous validation and assessment of robustness, reproducibility, and quality of NGS analytic pipelines intended for clinical use. Software tools written in the R statistical language and, in particular, the set of tools available in the Bioconductor repository are widely used in research bioinformatics; and these frameworks offer several advantages for use in clinical bioinformatics, including the breath of available tools, modular nature of software packages, ease of installation, enforcement of interoperability, version control, and short learning curve. This review provides an introduction to R and Bioconductor software, its advantages and limitations for clinical bioinformatics, and illustrative examples of tools that can be used in various steps of NGS analysis.
      Robust bioinformatics approaches have increasingly become critical for analysis of high-throughput, high-complexity molecular data produced by new assay technologies, such as microarrays and next-generation sequencing (NGS), especially when the results are used to potentially influence clinical decisions. In clinical bioinformatics, a software pipeline is a set of predefined programmatic procedures (file operations, programs or tools, and database queries) that convert one or more inputs (eg, raw sequencing data) into one or more outputs, often in sequential and sometimes parallel series of steps, which can yield intermediate results that can be useful on their own and also subsequently entered as inputs to the next tool in the pipeline.
      • Roy S.
      • Coldren C.
      • Karunamurthy A.
      • Kip N.S.
      • Klee E.W.
      • Lincoln S.E.
      • Leon A.
      • Pullambhatla M.
      • Temple-Smolkin R.L.
      • Voelkerding K.V.
      • Wang C.
      • Carter A.B.
      Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: a Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists.
      • Gargis A.S.
      • Kalman L.
      • Bick D.P.
      • da Silva C.
      • Dimmock D.P.
      • Funke B.H.
      • et al.
      Good laboratory practice for clinical next-generation sequencing informatics pipelines.
      • Oliver G.R.
      • Hart S.N.
      • Klee E.W.
      Bioinformatics for clinical next generation sequencing.
      Current pipelines for genomic and transcriptomic assays with clinical applications have not been fully standardized, and bioinformatics approaches are highly variable among institutions using these assays for patient care purposes. Recently, the Association of Molecular Pathology, with collaboration from the College of American Pathologists and the American Medical Informatics Association, published guidelines for the validation of clinical NGS bioinformatics pipelines, consisting of a set of 17 best practice consensus recommendations.
      • Roy S.
      • Coldren C.
      • Karunamurthy A.
      • Kip N.S.
      • Klee E.W.
      • Lincoln S.E.
      • Leon A.
      • Pullambhatla M.
      • Temple-Smolkin R.L.
      • Voelkerding K.V.
      • Wang C.
      • Carter A.B.
      Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: a Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists.
      These guidelines are focused mainly on the use of NGS assays for the detection of clinically relevant genomic alterations (variants). Several laboratories also use NGS for identification of changes in RNA structure (eg, to detect genomic fusions) and abundance (eg, to detect mRNA or miRNA expression patterns), which will be abbreviated as RNASeq.
      The guidelines emphasize the need to thoroughly validate the pipeline and to lock down the complete set of tools, code, operational environment, and network connections that compose the pipeline before using it for clinical purposes. Of importance, any changes to any components of the pipeline require revalidation to ensure that there is no impact in the performance characteristics of the pipeline. On the other hand, the field of genomics is constantly evolving, whether it is advances in sequencing technology, processing of sequencing data and bioinformatics software, knowledge of genomic structure, biological functions, and regulatory networks, or most important, clinical significance of genomic and epigenomic alterations. Therefore, the development, improvement, and proper validation of flexible bioinformatics pipelines that reflect the most recent advances in genomics are critical steps in providing optimal patient care.
      In many institutions, the tools of choice are acquired from commercial vendors and/or other well-established pipelines, such as the Genome Analysis Tool Kit set of tools from the Broad Institute (Cambridge, MA).
      • McKenna A.
      • Hanna M.
      • Banks E.
      • Sivachenko A.
      • Cibulskis K.
      • Kernytsky A.
      • Garimella K.
      • Altshuler D.
      • Gabriel S.
      • Daly M.
      • DePristo M.A.
      The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
      In addition, several institutions choose to use a customized assembly of various tools, often from open-source repositories, that offer the following advantages: i) transparent access to the algorithms used for better independent assessment of the software functionalities and limitations; ii) ability to fine-tune parameters and modify the code to best fit specific goals and operational frameworks and to rapidly adapt the pipeline to changes in technology, software, or knowledge bases or in clinical needs of the institution; iii) ability to choose the best-of-breed tool for a particular process in the pipeline; and iv) ability to explore, test, and prototype alternative approaches to the bioinformatics pipelines in a development setting.
      One of the most commonly used open-source repositories of bioinformatics tools used in genomics, transcriptomics, and other NGS-based assays is the Bioconductor repository.
      • Gentleman R.C.
      • Carey V.J.
      • Bates D.M.
      • Bolstad B.
      • Dettling M.
      • Dudoit S.
      • Ellis B.
      • Gautier L.
      • Ge Y.
      • Gentry J.
      • Hornik K.
      • Hothorn T.
      • Huber W.
      • Iacus S.
      • Irizarry R.
      • Leisch F.
      • Li C.
      • Maechler M.
      • Rossini A.J.
      • Sawitzki G.
      • Smith C.
      • Smyth G.
      • Tierney L.
      • Yang J.Y.
      • Zhang J.
      Bioconductor: open software development for computational biology and bioinformatics.
      ,
      • Huber W.
      • Carey V.J.
      • Gentleman R.
      • Anders S.
      • Carlson M.
      • Carvalho B.S.
      • Bravo H.C.
      • Davis S.
      • Gatto L.
      • Girke T.
      • Gottardo R.
      • Hahne F.
      • Hansen K.D.
      • Irizarry R.A.
      • Lawrence M.
      • Love M.I.
      • MacDonald J.
      • Obenchain V.
      • Oleś A.K.
      • Pagès H.
      • Reyes A.
      • Shannon P.
      • Smyth G.K.
      • Tenenbaum D.
      • Waldron L.
      • Morgan M.
      Orchestrating high-throughput genomic analysis with Bioconductor.
      Bioconductor tools are written in the R statistical programming language (heretofore abbreviated as R) and are freely available to download, install, and modify through an open-source and open-development model supported by the use of the GitHub repository system.
      In this review, we will discuss tools written in R that can be useful in pipelines for the processing and analysis of NGS data, with a focus on Bioconductor and potential clinical applicable genomic and transcriptomic assays.

      Properties of R and Bioconductor Useful in Bioinformatics Frameworks

      The R statistical language is a free, open-source implementation of the older statistical and graphing language S, with additional features such as the ability to extend base R functionalities by using self-contained code extensions, called packages, that can be easily installed from repositories, such as CRAN and Bioconductor. The source, version, and/or reference for all packages mentioned in this review are listed in Supplemental Table S1.
      • Huber W.
      • Carey V.J.
      • Gentleman R.
      • Anders S.
      • Carlson M.
      • Carvalho B.S.
      • Bravo H.C.
      • Davis S.
      • Gatto L.
      • Girke T.
      • Gottardo R.
      • Hahne F.
      • Hansen K.D.
      • Irizarry R.A.
      • Lawrence M.
      • Love M.I.
      • MacDonald J.
      • Obenchain V.
      • Oleś A.K.
      • Pagès H.
      • Reyes A.
      • Shannon P.
      • Smyth G.K.
      • Tenenbaum D.
      • Waldron L.
      • Morgan M.
      Orchestrating high-throughput genomic analysis with Bioconductor.
      • Bao L.
      • Pu M.
      • Messer K.
      AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data.
      • Shen Y.
      • Rahman M.
      • Piccolo S.R.
      • Gusenleitner D.
      • El-Chaar N.N.
      • Cheng L.
      • Monti S.
      • Bild A.H.
      • Johnson W.E.
      ASSIGN: context-specific genomic profiling of multiple heterogeneous biological pathways.
      • Yu G.
      • Zhang B.
      • Bova G.S.
      • Xu J.
      • Shih I.M.
      • Wang Y.
      BACOM: in silico detection of genomic deletion types and correction of normal cell contamination in copy number data.
      • Sengupta S.
      • Wang J.
      • Lee J.
      • Müller P.
      • Gulukota K.
      • Banerjee A.
      • Ji Y.
      Bayclone: Bayesian nonparametric inference of tumor subclones using NGS data.
      • Kane M.J.
      • Emerson J.
      • Weston S.
      Scalable strategies for computing with massive data.
      • Durinck S.
      • Spellman P.T.
      • Birney E.
      • Huber W.
      Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.
      • Zhu W.
      • Kuziora M.
      • Creasy T.
      • Lai Z.
      • Morehouse C.
      • Guo X.
      • Sebastian Y.
      • Shen D.
      • Huang J.
      • Dry J.R.
      BubbleTree: an intuitive visualization to elucidate tumoral aneuploidy and clonality using next generation sequencing data.
      • Purdom E.
      • Ho C.
      • Grasso C.S.
      • Quist M.J.
      • Cho R.J.
      • Spellman P.
      Methods and challenges in timing chromosomal abnormalities within cancer samples.
      • Carrara M.
      • Beccuti M.
      • Cavallo F.
      • Donatelli S.
      • Lazzarato F.
      • Cordero F.
      • Calogero R.A.
      State of art fusion-finder algorithms are suitable to detect transcription-induced chimeras in normal tissues?.
      • Lågstad S.
      • Zhao S.
      • Hoff A.M.
      • Johannessen B.
      • Lingjærde O.C.
      • Skotheim R.I.
      Chimeraviz: a tool for visualizing chimeric RNA.
      • Oróstica K.Y.
      • Verdugo R.A.
      chromPlot: visualization of genomic data in chromosomal context.
      • Zare H.
      • Wang J.
      • Hu A.
      • Weber K.
      • Smith J.
      • Nickerson D.
      • Song C.
      • Witten D.
      • Blau C.A.
      • Noble W.S.
      Inferring clonal composition from multiple sections of a breast cancer.
      • Klambauer G.
      • Schwarzbauer K.
      • Mayr A.
      • Clevert D.-A.
      • Mitterecker A.
      • Bodenhofer U.
      • Hochreiter S.
      cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate.
      • Gusnanto A.
      • Tcherveniakov P.
      • Shuweihdi F.
      • Samman M.
      • Rabbitts P.
      • Wood H.M.
      Stratifying tumour subtypes based on copy number alteration profiles using next-generation sequence data.
      • Gusnanto A.
      • Wood H.M.
      • Pawitan Y.
      • Rabbitts P.
      • Berri S.
      Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data.
      • Jiang Y.
      • Oldridge D.A.
      • Diskin S.J.
      • Zhang N.R.
      CODEX: a normalization and copy number variation detection method for whole exome sequencing.
      • Kuilman T.
      • Velds A.
      • Kemper K.
      • Ranzani M.
      • Bombardelli L.
      • Hoogstraat M.
      • Nevedomskaya E.
      • Xu G.
      • de Ruiter J.
      • Lolkema M.P.
      • Ylstra B.
      • Jonkers J.
      • Rottenberg S.
      • Wessels L.F.
      • Adams D.J.
      • Peeper D.S.
      • Krijgsman O.
      CopywriteR: DNA copy number detection from off-target sequence data.
      • Mock A.
      • Murphy S.
      • Morris J.
      • Marass F.
      • Rosenfeld N.
      • Massie C.
      CVE: an R package for interactive variant prioritisation in precision oncology.
      • Fowler A.
      • Mahamdallie S.
      • Ruark E.
      • Seal S.
      • Ramsay E.
      • Clarke M.
      • Uddin I.
      • Wylie H.
      • Strydom A.
      • Lunter G.
      • Rahman N.
      Accurate clinical detection of exon copy number variants in a targeted NGS panel using DECoN.
      • Ahn J.
      • Yuan Y.
      • Parmigiani G.
      • Suraokar M.B.
      • Diao L.
      • Wistuba I.I.
      • Wang W.
      DeMix: deconvolution for mixed cancer transcriptomes using raw measured data.
      • Love M.I.
      • Huber W.
      • Anders S.
      Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
      • Buschmann T.
      DNABarcodes: an R package for the systematic construction of DNA sample tags.
      • Sayols S.
      • Scherzinger D.
      • Klein H.
      dupRadar: a Bioconductor package for the assessment of PCR artifacts in RNA-Seq data.
      • Delhomme N.
      • Padioleau I.
      • Furlong E.E.
      • Steinmetz L.M.
      easyRNASeq: a bioconductor package for processing RNA-Seq data.
      • Robinson M.D.
      • McCarthy D.J.
      • Smyth G.K.
      edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.
      • Rainer J.
      • Gatto L.
      • Weichenberger C.X.
      Ensembldb: an R package to create and use Ensembl-based annotation resources.
      • Chelaru F.
      • Corrada Bravo H.
      Epiviz: a view inside the design of an integrated visual analysis software for genomics.
      • Yoshihara K.
      • Shahmoradgoli M.
      • Martínez E.
      • Vegesna R.
      • Kim H.
      • Torres-Garcia W.
      • Treviño V.
      • Shen H.
      • Laird P.W.
      • Levine D.A.
      • Carter S.L.
      • Getz G.
      • Stemke-Hale K.
      • Mills G.B.
      • Verhaak R.G.W.
      Inferring tumour purity and stromal and immune cell admixture from expression data.
      • Sathirapongsasuti J.F.
      • Lee H.
      • Horst B.A.J.
      • Brunner G.
      • Cochran A.J.
      • Binder S.
      • Quackenbush J.
      • Nelson S.F.
      Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV.
      • Plagnol V.
      • Curtis J.
      • Epstein M.
      • Mok K.Y.
      • Stebbings E.
      • Grigoriadou S.
      • Wood N.W.
      • Hambleton S.
      • Burns S.O.
      • Thrasher A.J.
      • Kumararatne D.
      • Doffinger R.
      • Nejentsev S.
      A robust model for read count data in exome sequencing experiments and implications for copy number variant calling.
      • Andor N.
      • Graham T.A.
      • Jansen M.
      • Xia L.C.
      • Aktipis C.A.
      • Petritsch C.
      • Ji H.P.
      • Maley C.C.
      Pan-cancer analysis of the extent and consequences of intratumor heterogeneity.
      • Krijgsman O.
      • Benner C.
      • Meijer G.A.
      • van de Wiel M.A.
      • Ylstra B.
      FocalCall: an R package for the annotation of focal copy number aberrations.
      • Gendoo D.M.
      • Ratanasirigulchai N.
      • Schroder M.S.
      • Pare L.
      • Parker J.S.
      • Prat A.
      • Haibe-Kains B.
      Genefu: an R/Bioconductor package for computation of gene expression-based signatures in breast cancer.
      • Akalin A.
      • Franke V.
      • Vlahoviček K.
      • Mason C.E.
      • Schübeler D.
      Genomation: a toolkit to summarize, annotate and visualize genomic intervals.
      • Lawrence M.
      • Huber W.
      • Pagès H.
      • Aboyoun P.
      • Carlson M.
      • Gentleman R.
      • Morgan M.T.
      • Carey V.J.
      Software for computing and annotating genomic ranges.
      • Yin T.
      • Cook D.
      • Lawrence M.
      Ggbio: an R package for extending the grammar of graphics for genomic data.
      • Wickham H.
      ggplot2: Elegant Graphics for Data Analysis.
      • Hänzelmann S.
      • Castelo R.
      • Guinney J.
      GSVA: gene set variation analysis for microarray and RNA-Seq data.
      • Hahne F.
      • Ivanek R.
      • Lai Y.-P.
      • Wang L.-B.
      • Wang W.-A.
      • Lai L.-C.
      • Tsai M.-H.
      • Lu T.-P.
      • Chuang E.Y.
      iGC—an integrated analysis package of gene expression and copy number alteration.
      • Law C.W.
      • Chen Y.
      • Shi W.
      • Smyth G.K.
      Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.
      • Ramos M.
      • Schiffer L.
      • Re A.
      • Azhar R.
      • Basunia A.
      • Cabrera C.R.
      • Chan T.
      • Chapman P.
      • Davis S.
      • Gomez-Cabrero D.
      • Culhane A.C.
      • Haibe-Kains B.
      • Hansen K.
      • Kodali H.
      • Louis M.S.
      • Mer A.S.
      • Reister M.
      • Morgan M.
      • Carey V.
      • Waldron L.
      Software for the integration of multi-omics experiments in Bioconductor.
      • Hernandez-Ferrer C.
      • Ruiz-Arenas C.
      • Beltran-Gomila A.
      • González J.R.
      MultiDataSet: an R package for encapsulating multiple data sets with application to omic data integration.
      • Povysil G.
      • Tzika A.
      • Vogt J.
      • Haunschmid V.
      • Messiaen L.
      • Zschocke J.
      • Klambauer G.
      • Hochreiter S.
      • Wimmer K.
      panelcn.MOPS: copy-number detection in targeted NGS panel data for clinical diagnostics.
      • Liu C.
      • Lehtonen R.
      • Hautaniemi S.
      PerPAS: topology-based single sample pathway analysis method.
      • Foroushani A.
      • Agrahari R.
      • Docking R.
      • Chang L.
      • Duns G.
      • Hudoba M.
      • Karsan A.
      • Zare H.
      Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: an introduction to the Pigengene package and its applications.
      • Riester M.
      • Singh A.P.
      • Brannon A.R.
      • Yu K.
      • Campbell C.D.
      • Chiang D.Y.
      • Morrissey M.P.
      PureCN: copy number calling and SNV classification using targeted short read sequencing.
      • Scheinin I.
      • Sie D.
      • Bengtsson H.
      • van de Wiel M.A.
      • Olshen A.B.
      • van Thuijl H.F.
      • van Essen H.F.
      • Eijk P.P.
      • Rustenburg F.
      • Meijer G.A.
      • Reijneveld J.C.
      • Wesseling P.
      • Pinkel D.
      • Albertson D.G.
      • Ylstra B.
      DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly.
      • Gaidatzis D.
      • Lerch A.
      • Hahne F.
      • Stadler M.B.
      QuasR: quantification and annotation of short reads in R.
      • Reinecke F.
      • Satya R.V.
      • DiCarlo J.
      Quantitative analysis of differences in copy numbers using read depth obtained from PCR-enriched samples and controls.
      • Collado-Torres L.
      • Nellore A.
      • Kammers K.
      • Ellis S.E.
      • Taub M.A.
      • Hansen K.D.
      • Jaffe A.E.
      • Langmead B.
      • Leek J.T.
      Reproducible RNA-seq analysis using recount2.
      • Collado-Torres L.
      • Nellore A.
      • Jaffe A.E.
      Recount workflow: accessing over 70,000 human RNA-seq samples with Bioconductor.
      • Jabot-Hanin F.
      • Varet H.
      • Tores F.
      • Alcais A.
      • Jais J.-P.
      Rfpred: a random forest approach for prediction of missense variants in human exome.
      • Wang S.
      • Pandis I.
      • Johnson D.
      • Emam I.
      • Guitton F.
      • Oehmichen A.
      • Guo Y.
      Optimising parallel R correlation matrix calculations on gene expression data using MapReduce.
      • de Souza W.
      • Carvalho B.S.
      • Lopes-Cendes I.
      Rqc: a Bioconductor package for quality control of high-throughput sequencing data.
      • Liao Y.
      • Smyth G.K.
      • Shi W.
      The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.
      • Lawrence M.
      • Gentleman R.
      • Carey V.
      Rtracklayer: an R package for interfacing with genome browsers.
      • Favero F.
      • Joshi T.
      • Marquard A.M.
      • Birkbak N.J.
      • Krzystanek M.
      • Li Q.
      • Szallasi Z.
      • Eklund A.C.
      Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data.
      • Morgan M.
      • Anders S.
      • Lawrence M.
      • Aboyoun P.
      • Pagès H.
      • Gentleman R.
      ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.
      • Chen M.
      • Gunel M.
      • Zhao H.
      SomatiCA: identifying, characterizing and quantifying somatic copy number aberrations from cancer genome sequencing data.
      • Gehring J.S.
      • Fischer B.
      • Lawrence M.
      • Huber W.
      SomaticSignatures: inferring mutational signatures from single-nucleotide variants.
      • Zhu Y.
      • Stephens R.M.
      • Meltzer P.S.
      • Davis S.R.
      SRAdb: query and use public next-generation sequencing data from within R.
      • H Backman T.W.
      • Girke T.
      systemPipeR: NGS workflow and report generation environment.
      • Colaprico A.
      • Silva T.C.
      • Olsen C.
      • Garofano L.
      • Cava C.
      • Garolini D.
      • Sabedot T.S.
      • Malta T.M.
      • Pagnotta S.M.
      • Castiglioni I.
      • Ceccarelli M.
      • Bontempi G.
      • Noushmehr H.
      TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data.
      • Hummel M.
      • Bonnin S.
      • Lowy E.
      • Roma G.
      TEQC: an R package for quality control in target capture experiments.
      • Ha G.
      • Roth A.
      • Khattra J.
      • Ho J.
      • Yap D.
      • Prentice L.M.
      • Melnyk N.
      • McPherson A.
      • Bashashati A.
      • Laks E.
      • Biele J.
      • Ding J.
      • Le A.
      • Rosner J.
      • Shumansky K.
      • Marra M.A.
      • Gilks C.B.
      • Huntsman D.G.
      • McAlpine J.N.
      • Aparicio S.
      • Shah S.P.
      TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data.
      • Soneson C.
      • Love M.I.
      • Robinson M.D.
      Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.
      • Wang N.
      • Gong T.
      • Clarke R.
      • Chen L.
      • Shih I.-M.
      • Zhang Z.
      • Levine D.A.
      • Xuan J.
      • Wang Y.
      UNDO: a Bioconductor R package for unsupervised deconvolution of mixed gene expressions in tumor samples.
      • Obenchain V.
      • Lawrence M.
      • Carey V.
      • Gogarten S.
      • Shannon P.
      • Morgan M.
      VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants.
      • Knaus B.J.
      • Grünwald N.J.
      VCFR: a package to manipulate and visualize variant call format data in R.
      • Alvarez M.J.
      • Shen Y.
      • Giorgi F.M.
      • Lachmann A.
      • Ding B.B.
      • Ye B.H.
      • Califano A.
      Functional characterization of somatic mutations in cancer using network-based inference of protein activity.
      • Pugh T.J.
      • Amr S.S.
      • Bowser M.J.
      • Gowrisankar S.
      • Hynes E.
      • Mahanta L.M.
      • Rehm H.L.
      • Funke B.
      • Lebo M.S.
      VisCap: inference and visualization of germ-line copy-number variants from targeted clinical sequencing data.
      Some features of the R programming language and environment of relevance to bioinformatics are described below.

       Command-Driven, High-Level, Interpreted Language

      In contrast with several other statistical languages, including the S-plus implementation of S, R does not natively use a point and click graphic user interface; rather, commands are typed into a console or console-like window and immediately executed by pressing the return key. Alternatively, commands can be stored in text files or scripts, conventionally with the .R extension, which can be called with the source(“filepath/filename.R”) command or executed in total or in blocks from an R editor. The fact that R is an interpreted language, and lines of code can be rapidly written and executed and the results visualized, allows for rapid prototyping and testing of new functionalities by step-wise running of the modified parts of the code, often maintaining intermediate results from unmodified lines of code in memory, followed by rapidly observing the results of the modified code. When speed is an issue, portions of well-tested code can be reimplemented or compiled, usually as C++ code, for execution at run time.
      Although R can be run from a command-line console in a terminal-type application, many if not most users use RStudio (Boston, MA), a free, multiplatform, open-source integrated development environment providing a graphic user interface that display various windows and tabs useful for R programming, including the following: i) R console, where commands can be typed and executed, and the results can be displayed; ii) Tabbed Source window containing various R scripts that can be executed in their entirety (sourced), or by specific lines or blocks of code; iii) Workspace window, showing the various objects (including package functions) loaded in the various environments; iv) History window containing a searchable list of all previously executed code; v) Files tab that can be used to consult directories and perform file operations; vi) Packages tab, listing packages loaded or installed; and vii) Help window containing documentation manual pages for the various packages installed.
      Display of various graphs can be directed to integrated or floating Plots windows or to a web browser window, or saved to a file. Markdown documents integrating code blocks, code outputs, and rich-text descriptions can be generated, edited, and exported in pdf, HTML, or Microsoft Word formats using the integrated knitr package. Of importance for clinical bioinformatics pipeline development, RStudio supports version control of projects by integrating Git repository functionalities, and can be deployed in a server environment, therefore avoiding multiple installations of R and associated libraries in local workstations.

       Extensibility

      The modular design of R and the use of packages that function as optional plugins to extend basic functionalities are major features of R and Bioconductor and allow users to tailor solutions to their custom needs. Packages can be downloaded and installed from the default R repository (usually CRAN; https://cran.r-project.org, last accessed November 7, 2019) with the simple command install.packages(“PackageName”).

       Robustness of Package Management and Installation

      Standardized rules and guidelines for package generation, structure, dependency management, installation, and documentation are available in the Manual for Writing R Extensions (https://cran.r-project.org/doc/manuals/r-release/R-exts.html, last accessed April 28, 2019) and provide a framework for package robustness, ease of learning, and interoperability. On installation, integrity of transmission is verified by checking the MD5 checksum included with each package. Most packages also include instruction documents, called vignettes, and test data, useful for automated validation steps during installation or for learning and validation by the users.

       Flexibility of Package Loading and Attachment

      Packages can be loaded in memory, in a special package environment (namespace), when needed, with the command library(PackageName). By default, lazy loading is used for R code objects (ie, the namespace will contain references or promises for all the named objects, which are only fully loaded in memory when the promises are evaluated). The library also attaches the package named objects to the Global Environment search path, so they can be called for evaluation. Alternatively, functions can be called without attaching them to the global environment by using PackageName::function. This flexibility and easy installation and loading allow the users to quickly experiment with different packages and unload them if not fit for their needs.

       Interoperability

      Interoperability of R packages guarantees (within certain constraints) that they will work well in combination (eg, by exchanging common data structures). As discussed below, interoperability is facilitated by the use of common formal classes and methods among the various related packages. For example, the use of SummarizedExperiment structure ensures interoperability of rectangular data containing sample annotations × sample data across multiple analytical approaches, whereas the GenomicRanges::Granges class ensures interoperability of closed-interval genomic coordinate–based data. This contrasts with some one-off software solutions and self-contained tools commonly used in current clinical bioinformatics pipelines, which often require file format conversions or data rearrangement before the next step can be run. Common Bioconductor classes and functions to import standard genomic files into defined classes are summarized in Table 1.
      Table 1Common Bioconductor Classes and File Import Functions for Genomic and Transcriptomic Data
      PackageClassGenerate/ImportFeatures
      SummarizedExperimentSummarizedExperimentRectangular sample feature × sample data
      GenomicRangesGRangesOne-based, closed-range genomic coordinate–based data
      Biostrings*StringSetRead*StringSet(path, FASTA file)* = DNA, RNA, or AA
      Biostrings*MulipleAlignmentRead*MultipleAlignment(path, clustal file)* = DNA, RNA, or AA
      GSEBaseGeneSet, GeneSetCollectionGene sets
      SingleCellExperimentSingleCellExperimentSingle-cell data
      MultiAssayExperimentMultiAssayExperimentMulti-omics data
      rtracklayerrtracklayerImport(*)* = two-bit, GTF, GFF, BED, WIG, or bigWig files
      VariantAnnotationVCFreadVcf(VCF file)
      RsamtoolsscanBam(Bam/Sam file)
      GenomicAlignmentsreadGAlignment(Bam/Sam file)
      ShortReadShortReadQreadFastq(Fastq file)

      FastqStreamer(Fastq file)
      Imports FASTQ and FASTA files
      rSFFreaderSffReadsQreadSff(sff file)Imports SFF files
      AA, amino acid; BAM, binary alignment/map; BED, browser extensible data; GFF, generic feature format; GTF, gene transfer format; SAM, sequence alignment/map; SFF, standard flowgram format; VCF, variant call format; WIG, wiggle.

       Scalability

      Although originally R processes were executed in a single thread and required loading of all of the data in memory, recent extensions have added several big data solutions, including solutions for high-performance computing, such as big memory management using file-based or shared memory approaches, parallel and distributed computing, graphic processing unit use, database interfaces, and big data visualization (Supplemental Table S2). Although comparative performance data are not available for clinical pipelines, these enhancements will lead to massive improvements in the speed of computationally heavy tasks in both research and clinical bioinformatics pipelines using R and Bioconductor tools.

       Scope and Breath of R Tools

      A wide range of powerful solutions for statistical, big data, model-building, machine-learning, and graphing problems are available, which are increasingly needed for complex bioinformatics analyses. For example, the CRAN repository has >14,000 R packages, which, in addition to bioinformatics, genetics, and medicine, span areas from advanced mathematics and statistics to chemometrics and computational physics, imaging, natural language processing, finance and econometrics, psychometrics, social sciences, and environmetrics. Bioconductor release version 3.8 has 1649 software packages dedicated to bioinformatics, as well as 941 annotation and 360 experimental data packages. In addition, there are 23 workflows that describe how to accomplish specific analytic goals using Bioconductor, such as single-cell RNASeq and annotating genomic variants (https://bioconductor.org/packages/release/BiocViews.html, last accessed April 28, 2019).

       Interfacing Capabilities

      R was designed to be able to interface with a variety of other languages, to take advantage of well-established and state-of-the-art algorithms. For example, API connections to C/C++, Bash, Python, JAVA, Julia, Ruby, Matlab, Perl, SQL, MariaDB, MongoDB, Neo4J, and others are available.

       Support for Object-Oriented Programming

      Two key concepts in R are that everything that exists is an object and that everything that happens is a function call.
      • Chambers J.M.
      Software for Data Analysis: Programming with R.
      Objects are data structures that have specific properties (attributes) and methods that act on properties (eg, to print, modify, or perform calculations on the object attributes).
      Objects can be assigned to classes, which define certain constraints that the object must follow. Conversely, classes can be conceived as blueprints for the generation of objects with certain properties (Supplemental Figure S1). In this example, the object gene can be viewed as an instance of the class character, which is a basic class in R. The process of generating the object (eg, gene <- “KRAS”) is called instantiation, and in this example the class is implicitly defined by the enclosing the data in quotes, therefore defining gene as a character vector object.
      Classes can be extended [ie, new subclasses are generated that inherit all of the properties of the parent class (called a superclass)]. For example, the basic class vector is extended to character, complex, double, expression, integer, list, logical, numeric, single, and raw subclasses, which inherit all the attributes of the vector superclass but have their own specific attributes. Vector is a virtual class (ie, one cannot directly generate objects of the vector class, but rather one of the vector subclasses must be chosen). Inheritance ensures that methods valid for the superclass can be applied to subclasses without modification, therefore facilitating development of new functionalities by obviating the need to recode the same methods for the new subclasses.
      In addition to vector classes, there are language subclasses, such as function and call. In fact, a function object such as print is simply a collection of characters stored in memory that provides instructions to the interpreter to execute the function call when its arguments (inside parenthesis) are supplied [eg, print(gene)]. This results in another object, the result (value) of the call.
      Examples of more complex classes of objects in R include matrix and array, which can be thought of as vectors folded in multidimensional structures, where a matrix is the same as a two-dimensional array composed of rows and columns. In other words, vectors can be generated with the dim attribute, which specifies the dimensions to fold the vector into a matrix or a multidimensional array. Another widely used object is the data.frame, which is like a matrix, but where columns can have different data types (Supplemental Figure S1).

       Use of S3 and S4 Classes in R and Bioconductor

      In the S3 system, objects are assigned to classes by simply specifying the class name as the object class attribute. In the S3 system, class hierarchy is defined at the object level, generic methods are ordinary functions, and specific methods are defined by the class of the object. Because of this simplicity, most classes in R are S3 classes. The problem is that any object can be assigned to any class, as the example in Supplemental Figure S1 demonstrates, where an integer vector is transformed into a character vector, potentially causing errors in downstream calculations.
      In the S4 system, the data are contained in slots, which have properties specified in the class definition. In addition, applicable methods and inheritance relationships are explicitly defined and validated during class construction or object instantiation. For example, the S4 class GRanges, from the package GenomicRanges,
      • Lawrence M.
      • Huber W.
      • Pagès H.
      • Aboyoun P.
      • Carlson M.
      • Gentleman R.
      • Morgan M.T.
      • Carey V.J.
      Software for computing and annotating genomic ranges.
      which has considerable application in bioinformatics to represent genomic regions [eg, annotation data, such as gene, transcript, or single-nucleotide polymorphism locations; or sequencing data, such as the locations of aligned reads or single-nucleotide variants] (Figure 1). A GRanges class object contains the following slots: i) ranges, a list of IRanges objects (which normally represent the start and end positions in a genome); ii) strand, strand designation in a Strand class run-length encoding object with possible values of +, -, and * (missing); iii) seqnames, a run-length encoding object, generally the chromosome names; iv) seqinfo, a Seqinfo object with information relevant to the genomic context of the sequence [eg, chromosome lengths (seqlengths), circularity flag, or genome]; this information is usually displayed as a footnote to the range table; and v) elementMetadata, a DataFrame object with an arbitrary number of columns (eg, to represent different annotations about each genomic region). When printing the GRanges object, the metadata are separated from the ranges information by the pipe character |.
      Figure thumbnail gr1
      Figure 1Generating and manipulating S4 classes: GRanges example. Blue text indicates code, green text indicates comments, and black text indicates output.
      S4 classes are instantiated with functions, called constructors, which normally have the same name as the class (Figure 1). During instantiation, a validation process ensures objects are assigned to appropriate classes. For example, IRanges(start = c(1, 6, 10), end = c(4, 9, 15)) results in an error because it will not accept character vectors in the start and end slots. This built-in error prevention feature of S4 classes makes them preferable for clinical bioinformatics pipelines.
      Class expansion and inheritance rules ensure that developers can generate variations of established classes (eg, by generating new slots and assessor functions), while maintaining all the methods associated with the parent class. Generic methods are nonstandard functions with metadata established at the package level, which defines how the functions should be applied to the various classes and subclasses of the arguments. For example, the assessor method strand extracts the strand information from GRanges, GRangesList, DelegatingGenomicRanges, GNCList, GPos, and other object classes related to genomic ranges. A common method used to identify overlapping regions in two GRanges objects (eg, aligned reads and gene locations) is findOverlaps, which is inherited from IRanges with additional features specific for the GRanges class, such as the ignore.strand parameter to take into consideration the strand orientation.
      The package BiocGenerics,
      • Huber W.
      • Carey V.J.
      • Gentleman R.
      • Anders S.
      • Carlson M.
      • Carvalho B.S.
      • Bravo H.C.
      • Davis S.
      • Gatto L.
      • Girke T.
      • Gottardo R.
      • Hahne F.
      • Hansen K.D.
      • Irizarry R.A.
      • Lawrence M.
      • Love M.I.
      • MacDonald J.
      • Obenchain V.
      • Oleś A.K.
      • Pagès H.
      • Reyes A.
      • Shannon P.
      • Smyth G.K.
      • Tenenbaum D.
      • Waldron L.
      • Morgan M.
      Orchestrating high-throughput genomic analysis with Bioconductor.
      installed by default with Bioconductor, promotes several methods (eg, start, end, and width) from base R to S4 generic methods and generates additional generic S4 methods, such as annotation, dbconn, fileName, normalize, organism, species, plotMA, plotPCA, strand, and updateObject, that are used in various Bioconductor packages.
      Class properties can be updated in new versions. This requires the developer to provide an updateObject implementation, so saved objects constructed with the previous version can be used by the new version. This functionality is important in clinical bioinformatics when previous data stored in outdated objects require reanalysis (eg, because of identification of errors or availability of new knowledge).

      Useful Features of Bioconductor for Clinical Bioinformatics Development

      Bioconductor is a large repository of bioinformatics-related R packages (totaling 1649 packages in Bioconductor release version 3.8, December 2018), originally focused on microarray analysis and annotation, but currently also having a wide scope of tools for analysis and annotation of other high-throughput data, such as NGS, flow cytometry, proteomics, and metabolomics. There are three main repositories in Bioconductor: Software, containing the analytic tools; AnnotationData, containing annotation databases for microarray probes and genomic features; and ExperimenData, containing a variety of experimental data sets. The biocViews package allows automated classification, searching, and visualization of relationships of package metadata and dependencies using a network-graph based approach. Bioconductor packages can be installed using the install() method from the BiocManager package: source(“http://bioconductor.org/biocLite.R”); BiocManager::install(“GenomicRanges”).
      If the package name is omitted, BiocManager::install() installs the default set of 23 or so basic packages (if not already installed) and updates any other Bioconductor packages that may already be installed. The option update = FALSE can be used in clinical bioinformatics pipelines to ensure that other packages are not automatically updated, which would require more extensive revalidation. The BiocManager::valid() function lists all of the Bioconductor packages installed and checks which ones are consistent with the versions of R and Bioconductor in use, producing a list of packages that are out of date [ie, needing update by install() or by setting the fix=TRUE option of valid()]. This function also lists those packages that are of a higher version than the installed R and Bioconductor.
      There are several advantages of the Bioconductor repository for bioinformatics pipelines: i) Interoperability is encouraged through nested reuse of data representations and analysis concepts across packages, as highlighted by the extensive use of the S4 class system, a well-maintained dependency system, and a coordinated effort to develop and update packages on a regular basis. ii) Production and development versions of Bioconductor are both updated on a biannual basis. Before release, the development version is extensively peer reviewed and tested for intercompatibility of the various packages. iii) The use of the GitHub source control system provides a transparent web browser–accessible repository of software code with access and version control, bug tracking, feature request, README and wiki documentation, and task management systems for every project. Contributors can request additions or modifications to the source code through pull requests. These are reviewed by the package maintainer(s) and other users through a comment thread; and once approved, the modifications are merged into the repository. iv) Availability of a large community of users, which often contribute innovative approaches to the development of the tools and provide education and support to solve issues with use of the software (eg, through wiki sites for educational and reference manuals, forums or comment threads for addressing troubleshooting and other user questions, and issue tracking and feature request systems to address bugs and other software limitations). Interestingly, the use of good scientific, statistical, and coding practices is highly encouraged in the open-source, social programming environments. v) All Bioconductor packages are supplied with at least one instructional document (called vignette), which typically introduces the user to the package, describes its main functionalities, and provides use cases, often with data included in the package contents. As with all R packages, a detailed reference manual is included to describe all the user-level functions in the package. vi) Support for reproducible research, the concept that computational processing of data should achieve the same results when the same data and software pipelines are used at different times or by other users, is highly encouraged. Especially in the context of clinical bioinformatics pipelines, reproducibility is paramount and a critical aspect of pipeline validation. General features of Bioconductor packages supporting reproducibility include the availability of test data sets within the package and/or sister data packages and documentation of functions, scripts, and test data set outputs using markdown documents, where executable code and documentation can be interspersed.

      Examples of Bioconductor Packages and Classes Useful in Clinical Bioinformatics

       Sequence Objects

      For sequence analysis, in addition to the sequence locations, provided by the GRanges family of classes in the package GenomicRanges described above, it is important to have objects to store the sequences themselves, with methods to manipulate them. The package Biostrings provides a virtual class XString and a series of subclasses (BString, DNAString, RNAString, and AAString) to store and manipulate biological sequences. For example, the DNAString class only allows characters from the International Union of Pure and Applied Chemistry extended genetic alphabet plus the gap (-) and masking (+) characters. Supplemental Figure S2 shows additional examples of methods to handle biological sequences.
      A set of XStrings objects of the same class can be stored as a subclass of XStringsSet (eg, a series of DNAStrings can be stored as a DNAStringSet). Read functions, such as readBStringSet, readDNAStringSet, readRNAStringSet, or readAAStringSet, can be used to extract sequences from FASTA and FASTQ formats.
      A particularly important implementation of the Biostrings class for clinical bioinformatics is the BSGenome class. Multiple genomes are available as BSGenome objects through the various BSgenome data packages [eg, the human GRCh37 (hg19) genome assembly can be loaded with library(BSgenome.Hsapiens.UCSC.hg19)]. In addition to methods such as seqnames and seqlengths, usually used to describe chromosomes, as a subclass of Biostrings, the BSGenome can use all of the generic methods of Biostrings (Supplemental Figure S2).

       Annotation Objects

      The need for annotations in clinical bioinformatics is ever growing, as new information is associated with genomic sequences and variants. Annotation packages use the virtual class AnnotationDb from the AnnotationDbi package as a parent class for generating annotation database classes specific to the type of data. Examples of AnnotationDb objects are listed in Table 2.
      Table 2Examples of AnnotationDb Objects
      LevelPackageDescription
      Gene centric
       Organismorg.Hs.eg.dbOrgDb class object containing annotations pertaining to an organism (Hs for Homo sapiens), where eg represents NCBI Entrez Gene–based annotations. This package includes mapping databases between Entrez ID and the KEGG, ENSEMBL, UniProt, and GO databases (for mapping between Entrez ID and ENSEMBL transcripts, use org.Hs.egENSEMBLTRANS).
      Homo.sapiensSimilar to org.Hs.eg.db without the GO annotations
       Platformhgu133plus2.db, hgu133plus2probe, hgu133plus2cdfChipDb class objects containing annotations for microarray chips
       Homologyhom.Hs.inp.dbInparanoidDb class object
       Systems biologyGO.dbGO annotations
      reactome.dbPathway-based annotations
      KEGG.dbPathway-based annotations
      Genome centric
       ChromosomeGenoInfoDbSupplies chromosome information and SeqInfo objects
       TranscriptomeTxDb.Hsapiens.UCSC.hg19.knownGeneUCSC-generated transcript annotations
      ensembldb, EnsDb.Hsapiens.v86ENSEMBL transcript annotations
       miRNAmirbase.dbmiRNA annotations
       NucleotideSNPlocs.Hsapiens.dbSNP151.GRCh38, SNPlocs.Hsapiens.dbSNP144.GRCh37Store SNP locations from dbSNP in an SNPlocs S4 object with methods to extract SNPs by SNP ID, chromosome, or overlap with genomic regions
       Generic genome featuresGenomicFeaturesGenomicFeatures can generate generic feature databases. For example, makeFeatureDbFromUCSC will generate a FeatureDb object from various types of annotation tracks at the UCSC genome browser.
      Disease centric
       Disease ontologyDO.dbDisease ontology terms, a subset of the Unified Medical Language System of the National Library of Medicine.
      GO, Gene Ontology; ID, identifier; NCBI, National Center for Biotechnology Information; SNP, single-nucleotide polymorphism; UCSC, University of California, Santa Cruz.
      The AnnotationDb classes have a conn slot for establishing a connection to an SQLite database that contains the annotation data, as well as slots for packageName, user sequence levels, and isActiveSeq. Generic methods include the following: i) columns to specify which fields to query; ii) keytypes to specify which fields can be used to identify records for retrieval; iii) keys to retrieve the primary keys of the records in the database; and iv) select to perform a SELECT query on the database, using keys and keytype to specify which records to retrieve, and columns to specify which fields to display.
      For example, a connection to the human known transcript database [from the University of California, Santa Cruz (UCSC), genome browser knownGene table] can be established with the package TxDb.Hsapiens.UCSC.hg19.knownGene (Figure 2).
      Figure thumbnail gr2
      Figure 2Examples of using annotation databases for human transcripts. Blue text indicates code, green text indicates comments, and black text indicates output.
      The object txdb of class TxDb contains a conn slot of the SQLiteConnection class that allows one to perform operations on the transcript SQLite database that is stored locally, without having to load it entirely in memory. The TxDb database stores several fields for each transcript (Figure 2). These fields represent the transcript, exon, and coding sequence start and end positions on the genome, together with their names and strand information. For transcripts with multiple exons, the exon rank is also stored.
      AnnotationDb annotation packages are not necessarily updated with the latest source data, which is helpful for the stability of bioinformatics pipelines; however, because the updates are dependent on the efforts of the package maintainers, some may become too outdated. For example, the current versions of TxDb.Hsapiens.UCSC.hg19.knownGene and Homo.sapiens were generated in 2015, whereas the latest org.Hs.eg.db is from 2018. If the user is interested in using the latest version of the annotations, the source data can be downloaded and AnnotationDb packages and objects can be easily generated with functions available on the AnnotationDbi package. Similarly, the transcript database can be directly updated from the UCSC genome browser site (https://genome.ucsc.edu, last accessed November 7, 2019) into a TxDb object using makeTxDbFromUCSC from the GenomicFeatures package.
      • Lawrence M.
      • Huber W.
      • Pagès H.
      • Aboyoun P.
      • Carlson M.
      • Gentleman R.
      • Morgan M.T.
      • Carey V.J.
      Software for computing and annotating genomic ranges.
      In addition to the packages described above, several tools have been developed to interface with the various online databases useful for annotation of genomic data (Supplemental Table S3A).

       Experiment Objects

      In clinical bioinformatics, it is paramount to keep good records of sample identifiers, assay conditions, and other metadata in addition to the assay results. In addition, it may be important to aggregate results from several experiments. The following classes of objects form a common basis for storing multiple experiment data and metadata in Bioconductor: The SummarizedExperiment class from the SummarizedExperiment package stores assays as a matrix-like object where columns are samples and rows are features. Data from various assays (eg, RNASeq) can be accessed with assays (where the samples are the columns, and each transcript abundance is a different row). Sample metadata are stored in the colData slot, where samples are rows and sample metadata, such as identifier and phenotype, are columns. The rowData slot is a DataFrame object with rows parallel to the rows in the assays slot describing the features (eg, probe identifiers) in each assay. The RangedSummarizedExperiment class from the same package is similar, except that assay row features are represented by a GRanges object (eg, of genomic positions of each transcript, accessible by the rowRanges function). The MultiAssayExperiment package builds on SummarizedExperiment but allows for greater complexity. Equivalent to the assays slot is the ExperimentList object, where columns are samples and rows are features (either identifier or range based). The rowData stores information about patients (eg, demographics and clinical data) and can be related to the sample names in the ExperimentList by another slot, the sampleMap, where the number of patients may have a one-to-many relationship with the number of samples. A similar framework is proposed in the package MultiDataSet with stricter requirements for the assays slot.
      The multisample structures described above provide a coordinated way to query, subset, and perform statistics on results from multiple samples (eg, to obtain expression levels for KRAS across all samples, one could subset a MultiAssayExperiment object named mae.obj with the following: mae.obj[KRAS, c(rnaseq1,rnaseq2)]), where the first argument is the feature identifier, the second is the selected sample names (if omitted, all samples), and the third is the assay name(s). These packages provide additional functions for manipulating experimental data, such as detecting and eliminating duplicated entries, finding cases with incomplete experiments, and combining different MultiAssayExperiment objects. The MultiAssayExperiment package is particularly useful to perform analysis across different assays (eg, correlating mutations and mRNA expression and other multi-omics approaches).

      Examples of Tools for Genomic and Transcriptomic Pipelines

      A typical pipeline for analysis of genomic sequence variation
      • Roy S.
      • Coldren C.
      • Karunamurthy A.
      • Kip N.S.
      • Klee E.W.
      • Lincoln S.E.
      • Leon A.
      • Pullambhatla M.
      • Temple-Smolkin R.L.
      • Voelkerding K.V.
      • Wang C.
      • Carter A.B.
      Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: a Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists.
      • Gargis A.S.
      • Kalman L.
      • Bick D.P.
      • da Silva C.
      • Dimmock D.P.
      • Funke B.H.
      • et al.
      Good laboratory practice for clinical next-generation sequencing informatics pipelines.
      • Oliver G.R.
      • Hart S.N.
      • Klee E.W.
      Bioinformatics for clinical next generation sequencing.
      is illustrated in Figure 3 and basically consists of a primary stage of sequence production, a secondary stage of read alignment and variant calling, a tertiary stage of variant interpretation and reporting, and a fourth stage where data from multiple patients are aggregated and used for improving knowledge. RNASeq pipelines (Figure 4) follow similar steps but have specific requirements for alignment to accommodate RNA splicing gaps, may use additional filtering steps (eg, to remove ribosomal RNA), and add other levels of analysis to derive biological and clinically significant information from qualitative (eg, fusions and alternative splicing) and quantitative changes in RNA expression. The following is a brief description of R and Bioconductor packages that may be useful in the various stages of NGS pipelines.
      Figure thumbnail gr3
      Figure 3Common steps used in a DNA next-generation sequencing (NGS) pipeline. Common file types used to store data in between steps or as outputs are shown as text inserts in the arrows. The first stage (in light blue) transforms primary data from the sequencing sensors into separate reads containing sequence and base quality information (FASTQ files). In the second stage (light green), reads are processed and aligned to the reference genome and variants are called where the reads differ from the reference. In the third stage (light brown), the variants are classified and prioritized according to clinical significance, and the clinical report is issued. A fourth stage (in pink) can also be considered, where information from several patients is stored and agglomerated and further used to improve knowledge-based curation of subsequent samples. CNV, copy number variation; HGVS, Human Genome Variation Society; HIS, hospital information system; indel, insertion and deletion; QC, quality control; SNV, single-nucleotide variation; SV, structural variant; VCF, variant call format.
      Figure thumbnail gr4
      Figure 4Common steps in RNASeq analysis. Common file types used to store data in between steps or as outputs are shown as text inserts in the arrows. The pipeline stages include the same steps described for DNA next-generation sequencing in , with the addition of exon junction and alternative splicing analysis, as well as quantitative analysis of gene expression, starting with the assignment of read counts to genomic features, such as genes, transcripts, or exons (Count Features), followed by statistical analysis of differential gene expression (DE). Alternatively, alignment-free algorithms can be used to map reads directly to transcripts before DE. A common object class for storing DE information is DESeqDataSet from the DESeq2 package. Finally, various system biology approaches integrate genomic and transcriptomic analyses to provide insights into regulatory networks and pathway dysregulation and may use classifiers and other machine-learning algorithms to provide predictive models for clinical use. VCF, variant call format.

       Primary Stage

      The primary stage involves converting data from the sensors [eg, images from Illumina (San Diego, CA)] instruments and ion current signals from Ion Torrent (Thermo Fisher Scientific, Waltham, MA) sequencers to a sequence of base calls (reads). This primary stage is virtually always performed using software supplied by the sequencing instrument manufacturers. For example, the Illumina bcl2fastq software processes base call files (bcl) to FASTQ files using parameters from runParameter.xml and a SampleSheet.csv list of sample names and indexes. The package basecallQC provides additional utilities to clean the sample sheets, convert between sample sheets from different bcl2fastq versions, generate base masks for use with the bcl2fastq software, and provide summary tables and reports from bcl2fastq output files. Similarly, the package IONiseR can read FAST5 files from the Oxford Nanopore Technologies (Oxford, UK) MinION sequencer, convert them to FASTQ, and perform various statistics before and after base calls. The Illumina Sequence Analysis Viewer is a Windows application that produces real-time graphs during basecalling; the package savR can parse the binary output of the Sequence Analysis Viewer and generate several quality assessment plots.

       Read Processing

       Input

      Several packages have functions to load and manipulate read files of FASTA and FASTQ formats (Table 1). The ShortRead package
      • Morgan M.
      • Anders S.
      • Lawrence M.
      • Aboyoun P.
      • Pagès H.
      • Gentleman R.
      ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.
      can read and import FASTQ files into a ShortReadQ object containing a DNAStringSet slot for sequence data, a FastqQuality slot for base quality encodings, and an id slot for the read identifiers. For large FASTQ files, the FastqStreamer function can split the reads into chunks that fit in memory, whereas the FastqSampler function can extract a random sample of the reads in the FASTQ file. Similarly, rSFFreader can read and process SFF files produced by Ion Torrent and 454 Roche/Life Sciences instruments, with functions to load sequences, quality scores, and flowgram information from these files into a ShortReadQ-like object.

       Read Pruning and Other String Manipulations

      Before further analysis, reads in FASTQ files may be processed by trimming ends with low-quality base calls or removing specific sequences, such as adapters used during library construction. Generally, primer sequences corresponding to genomic target regions are not removed because they facilitate alignment, but are subsequently marked as soft clipped in the aligned reads. For example, the ShortRead package extends Biostrings methods to process strings [eg, trimTails (trim if k nucleotides < threshold quality), trimTailw (trim when k nucleotides in a predefined window have quality encoding < threshold), narrow (trims between desired start and end), reverse, and reverseComplement]. Some aligner packages can also manipulate FASTQ sequence strings (eg, the Rbowtie2 package has identify_adapters and remove_adapters functions to connect to the AdapterRemoval
      • Schubert M.
      • Lindgreen S.
      • Orlando L.
      AdapterRemoval v2: rapid adapter trimming, identification, and read merging.
      tool).

       Read Demultiplexing

      When DNA from different patients is labeled with oligonucleotide barcodes during library construction and pooled before multiplex analysis, demultiplexing is used to associate each read with a specific patient sample. An additional level of demultiplexing can be used when, in addition to patient-specific barcodes, unique molecule indexes are used to identify each molecule of DNA, which reduces the number of ligation and PCR artifacts and improves quantification of amplified DNA. The demultiplex function from the DNAbarcodes package can perform demultiplexing on FASTQ files, and the package has additional functions for analyzing barcode properties and generating sets of barcodes with defined error correction properties. Rbowtie2::remove_adapters and easyRNASeq::demultiplex can also perform demultiplexing. Although designed for single-cell RNASeq, the scruff package has a demultiplex function that can separate barcodes and unique molecule index sequences in each read.

       Read Filtering

      It may be of interest to filter low-quality reads from FASTQ files before alignment. The ShortRead::filterFastq function removes reads according to custom or built-in filter functions, such as idFilter, chromosomeFilter, positionFilter, strandFilter, occurrenceFilter (elements occurring ≥ minimum and ≤ maximum times), nFilter (<N elements), polynFilter, dustyFilter (dustyScore filter for complexity), srduplicated (PCR duplicates), and srdistanceFilter (sequences with high edit distances).
      • Morgan M.
      • Anders S.
      • Lawrence M.
      • Aboyoun P.
      • Pagès H.
      • Gentleman R.
      ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.

       Read Statistics and Quality Assessment

      In addition to basic quality assessment reports provided by the instrument manufacturer, certain applications require more in-depth analysis of read quality, which are available in several packages (Supplemental Table S3B).

       Alignment

      After processing, reads in FASTQ files are aligned to the reference genome, usually using one of several aligner tools written in languages other than R. However, interfaces to aligner tools, such as BWA,
      • Li H.
      • Durbin R.
      Fast and accurate short read alignment with Burrows-Wheeler transform.
      Bowtie,
      • Langmead B.
      • Trapnell C.
      • Pop M.
      • Salzberg S.L.
      Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.
      Bowtie 2,
      • Langmead B.
      • Salzberg S.L.
      Fast gapped-read alignment with Bowtie 2.
      GSNAP,
      • Wu T.D.
      • Nacu S.
      Fast and SNP-tolerant detection of complex variants and splicing in short reads.
      ,
      • Wu T.D.
      • Reeder J.
      • Lawrence M.
      • Becker G.
      • Brauer M.J.
      GMAP and GSNAP for genomic sequence alignment: enhancements to speed, accuracy, and functionality.
      STAR,
      • Dobin A.
      • Davis C.A.
      • Schlesinger F.
      • Drenkow J.
      • Zaleski C.
      • Jha S.
      • Batut P.
      • Chaisson M.
      • Gingeras T.R.
      STAR: ultrafast universal RNA-seq aligner.
      TopHat,
      • Trapnell C.
      • Pachter L.
      • Salzberg S.L.
      TopHat: discovering splice junctions with RNA-Seq.
      HISAT2,
      • Kim D.
      • Langmead B.
      • Salzberg S.L.
      HISAT: a fast spliced aligner with low memory requirements.
      and Subread,
      • Liao Y.
      • Smyth G.K.
      • Shi W.
      The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.
      have been developed in R (Supplemental Table S3C).

       Alignment Processing

      There are several Bioconductor packages that can perform various alignment-processing steps, often by using R connections (wrappers) to run external software, such as the well-established sets of utilities SAMtools,
      • Li H.
      • Handsaker B.
      • Wysoker A.
      • Fennell T.
      • Ruan J.
      • Homer N.
      • Marth G.
      • Abecasis G.
      • Durbin R.
      1000 Genome Project Data Processing Subgroup
      The Sequence Alignment/Map format and SAMtools.
      Picard (http://broadinstitute.github.io/picard, last accessed April 29, 2019), BamUtils,
      • Breese M.R.
      • Liu Y.
      NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets.
      and Genome Analysis Tool Kit,
      • McKenna A.
      • Hanna M.
      • Banks E.
      • Sivachenko A.
      • Cibulskis K.
      • Kernytsky A.
      • Garimella K.
      • Altshuler D.
      • Gabriel S.
      • Daly M.
      • DePristo M.A.
      The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
      which are used to load, manipulate, and assess the quality of sequence alignments (Supplemental Table S3D).

       Alignment Coverage and Quality Assessment

      Determining the number of reads that align to specific genomic regions (coverage) is important for the following: i) quality assessment to identify problems in sequencing or alignment, ii) identifying variant allelic fractions by examining coverage at each nucleotide position (pileup), and iii) as the initial step to quantify genomic copy numbers in DNA NGS and gene expression in RNASeq. Particularly for RNASeq expression analysis, coverage can be related to genes (the full genomic region from start to end of each gene), exonic genes (only the genomic regions corresponding to exons), coding sequences, transcripts (possibly as different combinations of exons because of alternative splicing), exons, and nonoverlapping exons (ie, coverage of overlapping regions is counted only once). Packages for alignment coverage and quality assessment are listed in Supplemental Table S3E.

       Variant Calling

      After alignment, differences between the aligned reads and the reference genome are called variants and can be classified as single-nucleotide variants, small deletions and insertions (indels), copy number variants (CNVs), and structural variants (fusions, translocations, and large deletions and insertions), and are usually reported as variant call format (VCF) files. Variants are determined by several variant callers (eg, Strelka2,
      • Kim S.
      • Scheffler K.
      • Halpern A.L.
      • Bekritsky M.A.
      • Noh E.
      • Kallberg M.
      • Chen X.Y.
      • Kim Y.
      • Beyter D.
      • Krusche P.
      • Saunders C.T.
      Strelka2: fast and accurate calling of germline and somatic variants.
      VarScan2,
      • Koboldt D.C.
      • Zhang Q.Y.
      • Larson D.E.
      • Shen D.
      • McLellan M.D.
      • Lin L.
      • Miller C.A.
      • Mardis E.R.
      • Ding L.
      • Wilson R.K.
      VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.
      MuTect,
      • Cibulskis K.
      • Lawrence M.S.
      • Carter S.L.
      • Sivachenko A.
      • Jaffe D.
      • Sougnez C.
      • Gabriel S.
      • Meyerson M.
      • Lander E.S.
      • Getz G.
      Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples.
      Genome Analysis Tool Kit,
      • McKenna A.
      • Hanna M.
      • Banks E.
      • Sivachenko A.
      • Cibulskis K.
      • Kernytsky A.
      • Garimella K.
      • Altshuler D.
      • Gabriel S.
      • Daly M.
      • DePristo M.A.
      The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
      and SAMTools
      • Li H.
      • Handsaker B.
      • Wysoker A.
      • Fennell T.
      • Ruan J.
      • Homer N.
      • Marth G.
      • Abecasis G.
      • Durbin R.
      1000 Genome Project Data Processing Subgroup
      The Sequence Alignment/Map format and SAMtools.
      ) that have varying accuracy for indels and low variant frequency somatic variants. More specialized tools, such as Pindel,
      • Ye K.
      • Schulz M.H.
      • Long Q.
      • Apweiler R.
      • Ning Z.
      Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.
      Dindel,
      • Albers C.A.
      • Lunter G.
      • MacArthur D.G.
      • McVean G.
      • Ouwehand W.H.
      • Durbin R.
      Dindel: accurate indel calls from short-read data.
      INDELseek,
      • Au K.F.
      • Jiang H.
      • Lin L.
      • Xing Y.
      • Wong W.H.
      Detection of splice junctions from paired-end RNA-seq data by SpliceMap.
      PRISM,
      • Jiang Y.
      • Wang Y.
      • Brudno M.
      PRISM: pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants.
      and Amplicon Indel Hunter,
      • Kadri S.
      • Zhen C.J.
      • Wurst M.N.
      • Long B.C.
      • Jiang Z.-F.
      • Wang Y.L.
      • Furtado L.V.
      • Segal J.P.
      Amplicon indel hunter is a novel bioinformatics tool to detect large somatic insertion/deletion mutations in amplicon-based next-generation sequencing data.
      are specifically focused on indels and other complex variants. Bioconductor pipelines can be established with calls to these external variant callers (Supplemental Table S3F), although most variant analysis packages focus on performing various manipulations and statistics on existing VCF files (see below).
      The VariantTools package has a function, callVariants, that is a wrapper for various functions to call variants from BAM files that apply consecutive steps: i) perform a pileup using tallyVariants with TallyVariantsParam parameters; ii) call variants using FilterRules, defined by VariantCallingFilters; iii) mark variants for soft filtering with qaVariants using FilterRules from VariantQAFilters; and iv) after variants are called, apply FilterRules, defined by VariantPostFilters with the function postFilterVariants. These steps can be applied independently and parameters can be tailored to optimize accuracy of variant calling.
      The Rcwl pipeline package has wrappers for the seven variant callers (MuSE,
      • Fan Y.
      • Xi L.
      • Hughes D.S.
      • Zhang J.
      • Zhang J.
      • Futreal P.A.
      • Wheeler D.A.
      • Wang W.
      MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data.
      MuTect, SomaticSniper,
      • Larson D.E.
      • Harris C.C.
      • Chen K.
      • Koboldt D.C.
      • Abbott T.E.
      • Dooling D.J.
      • Ley T.J.
      • Mardis E.R.
      • Wilson R.K.
      • Ding L.
      SomaticSniper: identification of somatic point mutations in whole genome sequencing data.
      VarScan, Radia,
      • Radenbaugh A.J.
      • Ma S.
      • Ewing A.
      • Stuart J.M.
      • Collisson E.A.
      • Zhu J.
      • Haussler D.
      RADIA: RNA and DNA integrated analysis for somatic mutation detection.
      Pindel, and Indelocater
      • Banerji S.
      • Cibulskis K.
      • Rangel-Escareno C.
      • Brown K.K.
      • Carter S.L.
      • Frederick A.M.
      • et al.
      Sequence analysis of mutations and translocations across breast cancer subtypes.
      ) in the Multi-Center Mutation Calling in Multiple Cancers project pipeline used by The Cancer Genome Atlas (TCGA) Pan-Cancer project.
      • Ellrott K.
      • Bailey M.H.
      • Saksena G.
      • Covington K.R.
      • Kandoth C.
      • Stewart C.
      • Hess J.
      • Ma S.
      • Chiotti K.E.
      • McLellan M.
      • Sofia H.J.
      • Hutter C.
      • Getz G.
      • Wheeler D.
      • Ding L.
      MC3 Working GroupCancer Genome Atlas Research Network
      Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines.

       Variant Processing

      Once a VCF file (or other data object) is produced by the variant callers, it is of interest to assess various quality measures of the process leading to variant calling, which can be used to remove variants that do not pass acceptability thresholds. Supplemental Table S3F lists some tools in R and Bioconductor useful for variant processing.

       Variant Annotation and Prioritization

      Variant annotation is the process, usually automated, of relating genomic variants with their potential impact on the function of associated genomic features, such as genes, whereas variant prioritization takes into account all of the knowledge available, including the patient's clinical features, to produce a report of genomic variants for clinical use. In our experience, general variant annotation and effect prediction of nonsynonymous variants are faster and more comprehensive by using external calls to comprehensive tools, such as Annovar,
      • Yang H.
      • Wang K.
      Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR.
      Oncotator,
      • Ramos A.H.
      • Lichtenstein L.
      • Gupta M.
      • Lawrence M.S.
      • Pugh T.J.
      • Saksena G.
      • Meyerson M.
      • Getz G.
      Oncotator: cancer variant annotation tool.
      or SnpSift.
      • Cingolani P.
      • Patel V.M.
      • Coon M.
      • Nguyen T.
      • Land S.J.
      • Ruden D.M.
      • Lu X.
      Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift.
      However, some variant annotation capabilities are available in R and Bioconductor (Supplemental Table S3G).

       Tools for Indels, Complex Variants, and Fusions

      In addition to wrappers to external tools (eg, Pindel and Indelocator in the pipeline package Rcwl), there are a few tools available in R to specifically address indels, complex variants, and fusions (Supplemental Table S3H).

       Copy Number Variant Calling

      There are several R and Bioconductor tools for copy number analysis from whole genome or whole exome NGS data (eg, sequenza, copywriteR, CODEX, focalCall, TitanCNA, QDNAseq, SomatiCa, cn.mops, CNAnorm, ExomeDepth, ExomeCNV, and bacomR), hybridization capture–targeted sequencing (eg, PureCN, DECoN, and VisCap), and amplicon-targeted NGS (eg, panelcn.mops, CNVPanelizer, and quandico). The package iGC integrates copy number and gene expression analysis. These tools can perform data segmentation using a variety of algorithms, followed by copy number state assignment, determination of loss of heterozygosity, and visualization of CNV; however, detailed description of CNV tools is beyond the scope of this review.

       Tumor Heterogeneity and Clonal Evolution

      A variety of tools are available in R and Bioconductor to estimate tumor heterogeneity, assess tumor purity, and infer clonal evolution. These tools can be applied to determine heterogeneity in genomic single-nucleotide variants (eg, Clomial and Sequenza), CNVs (eg, cancerTiming, CNALR, absCNseq, and TitanCNA), both CNVs and single-nucleotide variants (eg, BubbleTree and expands), or gene expression profiles (UNDO, estimate, Bayclone2, and DeMixT); and they may be helpful in the clinical interpretation of NGS findings and assessment of therapeutic options.
      • McGranahan N.
      • Swanton C.
      Clonal heterogeneity and tumor evolution: past, present, and the future.

       Expression Analysis

      Many different tools take advantage of R's powerful statistical and graphical capabilities to perform analysis of differential expression. For example, a well-established approach is to use tools, such as DESeq2, edgeR, and the voom function of limma, to perform differential expression analysis on coverage counts of reads aligned to the genome.
      • Varet H.
      • Brillet-Gueguen L.
      • Coppee J.Y.
      • Dillies M.A.
      SARTools: a DESeq2- and EdgeR-based R pipeline for comprehensive differential analysis of RNA-Seq data.
      ,
      • Love M.I.
      • Anders S.
      • Kim V.
      • Huber W.
      RNA-Seq workflow: gene-level exploratory analysis and differential expression.
      Recently, alignment-free, transcript-focused approaches (eg, Salmon,
      • Patro R.
      • Duggal G.
      • Love M.I.
      • Irizarry R.A.
      • Kingsford C.
      Salmon provides fast and bias-aware quantification of transcript expression.
      Sailfish,
      • Patro R.
      • Mount S.M.
      • Kingsford C.
      Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.
      and Kallisto
      • Bray N.L.
      • Pimentel H.
      • Melsted P.
      • Pachter L.
      Near-optimal probabilistic RNA-seq quantification.
      ), using the reads directly mapped to a transcript database, have been shown to be much faster and possibly more accurate for mRNA quantification.
      • Love M.I.
      • Anders S.
      • Kim V.
      • Huber W.
      RNA-Seq workflow: gene-level exploratory analysis and differential expression.
      ,
      • Zhang C.
      • Zhang B.
      • Lin L.L.
      • Zhao S.
      Evaluation and comparison of computational tools for RNA-seq isoform quantification.
      The output of these transcript-level quantifications can then be imported into R objects for use with downstream differential expression analysis tools, such as DESeq2, using the package tximport.
      Subsequently, a wide variety of R tools can take expression data from relatively large numbers of samples to derive predictive models and classifiers to support disease diagnosis, prognosis, and therapeutic decisions. Fewer tools are available that might be useful in the context of clinical RNASeq to apply these models to individual samples (Supplemental Table S3I).

       Comprehensive Pipeline Examples in Bioconductor

      A few comprehensive packages take advantage of the interoperability and scalability of Bioconductor packages, objects, classes, and functions to generate pipelines for NGS analysis by assembling tools from different packages, generating or modifying classes and objects, and adding wrapper calls to external software, if needed. A wrapper can be designed simply by invoking shell commands [eg, system(“bwa index -a bwtsw ./data/tair10.fasta”)] for calling the BWA indexer or using more complex functions with parameter-supplied files and/or by interactive user interfaces. More complex pipelines can be also handled with the common workflow language (https://www.commonwl.org, last accessed April 30, 2019), used in various bioinformatics pipelines by the package Rcwl. These packages provide illustrative examples and templates for assembling bioinformatics pipelines and interfacing with Bioconductor tools, which can be adapted to address the specific needs of clinical NGS pipelines (Table 3).
      Table 3Examples of NGS Analysis Pipeline Packages
      PackageFunctionalities
      systemPipeRFramework of templates for assembling NGS pipelines and producing automated reports. The SYSargs S4 class contains a targets file (eg, sample input files) and a param file (specifying parameter for command line or R tools). Preconfigured workflows are included for RNASeq, DNA NGS, ChIP-Seq, and Ribo-Seq. Can use parallelization on multiple CPU/computer nodes to accelerate run times.
      quasRProvides integrated pipeline for NGS analysis from read processing to alignment, quality control, and quantification.
      easyRNAseqProvides templates for simplifying the analysis of RNASeq data from alignment to feature counts.
      ultraseqContains a set of tools to facilitate the use of external tools for DNA NGS, such as Fastqc, BWA and STAR aligners, Samtools, GATK preprocessing, Picard, and haplotyper tools, and Annovar and MuTect annotation tools. This package uses the flowr package, designed to facilitate generation of complex bioinformatics pipelines, whether they involve R commands and/or a series of steps invoking shell commands to run external tools.
      RcwlUses the standard CWL language for pipeline assembly and includes these pipelines as examples:
      • DNA NGS (BWA alignment, samtools sorting, BAM merging, and Picard duplicate marking)
      • RNASeq (fastqc, STAR aligner, samtools, flagstat, featureCounts, and RSeQC)
      • TCGA MC3 somatic variant calling (seven variant caller pipelines, merge VCF, and convert to MAF)
      • GATK germline variant calling (paired fastq to ubam, GATK alignment, variant calling by HaplotypeCaller, and joint genotyping)
      ChiP-Seq, chromatin immunoprecipitation sequencing; CPU, central processing unit; CWL, common workflow language; GATK, Genome Analysis Tool Kit; MAF, mutation annotation format; MC3, Multi-Center Mutation Calling in Multiple Cancers; NGS, next-generation sequencing; Ribo-Seq, ribosome profile sequencing; TCGA, The Cancer Genome Atlas; VCF, variant call format.

      Visualization and User Interfaces

      In clinical bioinformatics, effective visualization tools and graphic user interfaces (UIs) can help address the challenges of data multidimensionality, information overload, and communication barriers that can cause cognitive errors and lead to suboptimal patient care.
      • Wickham H.
      ggplot2: Elegant Graphics for Data Analysis.
      ,
      • Mougin F.
      • Auber D.
      • Bourqui R.
      • Diallo G.
      • Dutour I.
      • Jouhet V.
      • Thiessard F.
      • Thiebaut R.
      • Thebault P.
      Visualizing omics and clinical data: which challenges for dealing with their variety?.
      One of the major advantages of R is its strong graphic capabilities, and several tools are available for visualizing large biological data sets, including the results of NGS analytical pipelines.
      Although base R has good data visualization functions, such as plot, boxplot, barplot, dotchart, and hist, more advanced graphing capabilities are provided by the ggplot2 package, which is a declarative method to generate graphs based on the grammar of graphs.
      • Wickham H.
      ggplot2: Elegant Graphics for Data Analysis.
      In bioinformatics, it is useful to be able to interactively manipulate graphs (eg, zooming on genomic regions); this can be accomplished by various R packages that generate interactive graphs (Supplemental Table S3J).
      Several packages have been developed to specifically visualize biological and omics data. In particular, tools that relate genomic location to genomic ranges using genomic browsers, such as the UCSC genomic browser
      • Tyner C.
      • Barber G.P.
      • Casper J.
      • Clawson H.
      • Diekhans M.
      • Eisenhart C.
      • Fischer C.M.
      • Gibson D.
      • Gonzalez J.N.
      • Guruvadoo L.
      • Haeussler M.
      • Heitner S.
      • Hinrichs A.S.
      • Karolchik D.
      • Lee B.T.
      • Lee C.M.
      • Nejad P.
      • Raney B.J.
      • Rosenbloom K.R.
      • Speir M.L.
      • Villarreal C.
      • Vivian J.
      • Zweig A.S.
      • Haussler D.
      • Kuhn R.M.
      • Kent W.J.
      The UCSC Genome Browser database: 2017 update.
      or the Broad Institute Integrated Genomic Viewer (IGV),
      • Thorvaldsdóttir H.
      • Robinson J.T.
      • Mesirov J.P.
      Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.
      are widely used in bioinformatics. Integration with genomic browsers is facilitated by the Bioconductor rtracklayer package, which has functions to download and upload genomic annotation tracks from the UCSC database (eg, in BED, WIG, bigWIG, GTF, and GFF formats) and to manipulate the browser interface for visualization of user genomic data (eg, stored as GRanges objects).
      Similarly, for IGV interaction, functions from the SRAdb package can be used. SRAdb provides an interface to the Sequence Read Archive (SRA; https://www.ncbi.nlm.nih.gov/sra, last assessed April 30, 2019) of sequencing data, and it has a set of useful functions for interacting with IGV. For example, startIGV and IGVsocket initiate the connection to IGV, IGVload loads BAM and other files, IGVgoto points the focus on a specific region, IGVcollapse collapses the read views, and IGVsort sorts the alignments (eg, by base). Incorporation of IGV visualization tools in an interactive UI linked to alignment and variant analysis can significantly expedite manual review of clinical NGS data. Additional Bioconductor packages of interest for genomic data visualization are listed in Supplemental Table S3K.

       Shiny Web Applications

      Shiny is a package developed by RStudio that provides a set of UI and server connectivity functions to generate self-contained, web-accessible applications that connect R programs running in a server to web pages via browser-based UIs. Shiny applications are increasingly being used for biomedical applications, including bioinformatics.
      • Mock A.
      • Murphy S.
      • Morris J.
      • Marass F.
      • Rosenfeld N.
      • Massie C.
      CVE: an R package for interactive variant prioritisation in precision oncology.
      ,
      • Class C.A.
      • Ha M.J.
      • Baladandayuthapani V.
      • Do K.A.
      iDINGO-integrative differential network analysis in genomics with Shiny application.
      • Yu Y.
      • Ouyang Y.
      • Yao W.
      shinyCircos: an R/Shiny application for interactive creation of Circos plot.
      • To Duc K.
      bcROCsurface: an R package for correcting verification bias in estimation of the ROC surface and its volume for continuous diagnostic tests.
      • Koeppen K.
      • Stanton B.A.
      • Hampton T.H.
      ScanGEO: parallel mining of high-throughput gene expression data.
      • Rupji M.
      • Zhang X.
      • Kowalski J.
      CASAS: Cancer Survival Analysis Suite, a web based application.
      • Theodosiou T.
      • Efstathiou G.
      • Papanikolaou N.
      • Kyrpides N.C.
      • Bagos P.G.
      • Iliopoulos I.
      • Pavlopoulos G.A.
      NAP: the Network Analysis Profiler, a web tool for easier topological analysis and comparison of medium-scale biological networks.
      • Barlowe S.
      • Coan H.B.
      • Youker R.T.
      SubVis: an interactive R package for exploring the effects of multiple substitution matrices on pairwise sequence alignment.
      For example, an interesting package for integrating Bioconductor tools in a Shiny application is the interactiveDisplay package, which uses interactive graphing tools, such as Gvis, ggbio, and gridSVG, and can accommodate GRanges, GRangesList, ExpressionSet, and RangedSummarizedExperiment objects. Available plot types accessible via tabbed web pages include Static circle layout, Interactive plot, Heatmap, Network plot, and Dendogram.
      A Shiny application consists of two functions, shinyUI and shinyServer. shinyUI uses nested functions to define the HTML-coded user interface, such as the page layout and the placement and properties of HTML input widgets (actionButton, actionLink, checkboxGroupInput, checkboxInput, dateInput, dateRangeInput, fileInput, numericInput, passwordInput, radioButtons, selectInput, selectizeInput, sliderInput, submitButton, and textInput) and outputs (dataTableOutput, imageOutput, plotOutput, verbatimTextOutput, tableOutput, and textOutput). In addition to HTML, the UI function can also incorporate cascading style sheets and javascript code to render the UI. The shinyServer function contains the code necessary to process the inputs in a reactive manner (ie, responding in real time to changes in the input status, by selecting a different choice in a pull-down list, and passing the code outputs to the proper UI output object, so that the display changes in the web page). For example, a table object can be produced in the server object by the function renderTable, which then will be displayed by the tableOutput function in the UI. Similarly, any plot from an R or Bioconductor package can be rendered in the server with renderPlot and displayed by plotOuput in the UI. The simplicity of programming reactive outputs that can integrate a variety of R and Bioconductor objects and the ability to integrate virtually any R package and even wrappers to command line tools makes Shiny applications an attractive option for the UI in clinical bioinformatics pipelines.

       Visualization of Aggregated Data

      In the third stage of analysis, it is important to integrate the NGS findings on an individual patient with knowledge derived from large numbers of previous cases; and in the fourth stage, the data may be uploaded to aggregate knowledge databases. This can be done with local databases, but more powerfully with large-scale studies, such as TCGA,
      • Weinstein J.N.
      • Collisson E.A.
      • Mills G.B.
      • Shaw K.R.M.
      • Ozenberger B.A.
      • Ellrott K.
      • Shmulevich I.
      • Sander C.
      • Stuart J.M.
      Cancer Genome Atlas Research Network
      The Cancer Genome Atlas Pan-Cancer analysis project.
      Therapeutically Applicable Research to Generate Effective Treatments (https://ocg.cancer.gov/programs/target, last accessed April 29, 2019), and International Cancer Genomics Consortium.
      Global Cancer Genomics Consortium
      The Global Cancer Genomics Consortium: interfacing genomics and cancer medicine.
      The following packages provide tools for interacting with these repositories, extracting data, performing downstream analyses, and/or visualizing multiple features in an aggregate summarized or detailed way: i) TCGAbiolinks provides functions to query and extract data from TCGA (including archival data for reanalysis) as SummarizedExperiment objects and allows advance integrative analysis and visualization of the data in multiple ways. ii) CDGS-R can query the Cancer Genomic Data Server (CGDS) hosted by the Computational Biology Center (cBio) at Memorial Sloan Kettering Cancer Center. iii) cbaf provides a simpler interface to the cBioPortal repository (https://www.cbioportal.org, last accessed May 2, 2019), originally developed by the cBio at Memorial Sloan Kettering Cancer Center.
      • Gao J.
      • Aksoy B.A.
      • Dogrusoz U.
      • Dresdner G.
      • Gross B.
      • Sumer S.O.
      • Sun Y.
      • Jacobsen A.
      • Sinha R.
      • Larsson E.
      • Cerami E.
      • Sander C.
      • Schultz N.
      Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal.
      iv) recount interfaces with the recount2 project,
      • Collado-Torres L.
      • Nellore A.
      • Kammers K.
      • Ellis S.E.
      • Taub M.A.
      • Hansen K.D.
      • Jaffe A.E.
      • Langmead B.
      • Leek J.T.
      Reproducible RNA-seq analysis using recount2.
      which has processed >70,000 RNASeq experiments from TCGA and SRA and provides integrated workflows for analysis and visualization of these data.

      Limitations of R and Bioconductor for Clinical Bioinformatics

      To take full advantage of R and Bioconductor for clinical NGS applications, it is important to be aware of its limitations and downsides and how to address them: i) Some of the algorithms and processing steps commonly used in NGS pipelines are not available in Bioconductor or as R packages; for these cases, interfaces to the operating system and other connection pipelines can be made from within the R environment to established software tools. Moreover, in many cases, the parameters needed for running these external scripts can be provided as parameter objects (eg, in the Rsamtools package). ii) Many of the tools available in Bioconductor have not achieved the level of widespread testing and use as more established tools used in current NGS pipelines. It is preferable to use tools that have been published in peer-reviewed journals, adding another level of scrutiny, compared with tools that have only Bioconductor review and user community feedback (Supplemental Table S1). If these tools are used for clinical purposes, it is critical to pay special attention to complete validation of their performance characteristics and integration in the pipeline. iii) Given the routine biannual updates to R base code and Bioconductor, as well as frequent updates to specific R packages, it is important in the clinical context to i) lock the pipeline so that packages are not inadvertently updated; ii) examine whether the changes in the new version are relevant to the clinical pipelines (eg, by adding a newly needed functionality or fixing a critical bug) and, if so, iii) ensure that the updates are thoroughly tested and validated in the context of the pipeline.
      • Oliver G.R.
      • Hart S.N.
      • Klee E.W.
      Bioinformatics for clinical next generation sequencing.
      In practice, this may require additional human resources for frequent revalidations compared with less frequently updated pipeline frameworks. Although extensive testing of the development versions of R and Bioconductor by a large number of users before release into production versions ensures that major bugs and interoperability issues are identified, detecting minor or highly specific errors and ensuring compatibility with other tools in specific pipeline approaches require additional scrutiny by the user.
      Data objects and files storing the results of steps in Bioconductor pipelines generally were not designed for clinical use; therefore, strict requirements for identifiers have not been implemented. However, it is possible to modify S4 objects and change file names and headers to enforce the use of the recommended identifiers (ie, unique patient, sample, run, and location identifiers).
      • Roy S.
      • Coldren C.
      • Karunamurthy A.
      • Kip N.S.
      • Klee E.W.
      • Lincoln S.E.
      • Leon A.
      • Pullambhatla M.
      • Temple-Smolkin R.L.
      • Voelkerding K.V.
      • Wang C.
      • Carter A.B.
      Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: a Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists.
      Packages that may not be fully developed to pass all the requirements for use in clinical production pipelines may nonetheless be useful for quality assessment, advanced visualization, and experimental development of novel approaches to address limitations of current clinical pipelines. Parallel assessment of clinical testing by such experimental tools can identify opportunities for improvements in the pipelines and foster further development of the tools for incorporation in the clinical setting.

      Conclusions

      Bioinformatics tools must keep pace with rapid advances in NGS technologies, biological knowledge, and data analysis. The R and Bioconductor frameworks are widely used in the research context, but their application to clinical pipelines has been limited. This review highlights some advantages of R and Bioconductor for NGS pipelines and other clinical applications and exemplifies some useful tools in clinical pipelines. These tools operate in primary data, such as sequencing reads, alignments, and variants, and take advantage of the advanced statistical, machine-learning, graphing, and interactive capabilities of R. Compliance with robustness and reproducibility requirements of clinical bioinformatics is facilitated by highly structured data objects and package management tools, and development of integrated pipelines is facilitated by high interoperability of Bioconductor tools. Flexible interactive visualization and big data analysis tools can facilitate clinical interpretation as well as provide insights for advancing biological knowledge. Where the existence of well-established and effective external tools and limitations of R and Bioconductor for some tasks prevents exclusive use of Bioconductor pipelines for clinical NGS, it is advantageous to establish user-friendly interfaces with external tools from within the R environment. It is expected that given their rapid pace of development, increasingly R and Bioconductor tools will be explored for their utility in clinical bioinformatics, and that more resources will be dedicated to ensuring their validity, reproducibility, and interoperability in the context of clinical NGS testing.

      Supplemental Data

      References

        • Roy S.
        • Coldren C.
        • Karunamurthy A.
        • Kip N.S.
        • Klee E.W.
        • Lincoln S.E.
        • Leon A.
        • Pullambhatla M.
        • Temple-Smolkin R.L.
        • Voelkerding K.V.
        • Wang C.
        • Carter A.B.
        Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: a Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists.
        J Mol Diagn. 2018; 20: 4-27
        • Gargis A.S.
        • Kalman L.
        • Bick D.P.
        • da Silva C.
        • Dimmock D.P.
        • Funke B.H.
        • et al.
        Good laboratory practice for clinical next-generation sequencing informatics pipelines.
        Nat Biotechnol. 2015; 33: 689-693
        • Oliver G.R.
        • Hart S.N.
        • Klee E.W.
        Bioinformatics for clinical next generation sequencing.
        Clin Chem. 2015; 61: 124-135
        • McKenna A.
        • Hanna M.
        • Banks E.
        • Sivachenko A.
        • Cibulskis K.
        • Kernytsky A.
        • Garimella K.
        • Altshuler D.
        • Gabriel S.
        • Daly M.
        • DePristo M.A.
        The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
        Genome Res. 2010; 20: 1297-1303
        • Gentleman R.C.
        • Carey V.J.
        • Bates D.M.
        • Bolstad B.
        • Dettling M.
        • Dudoit S.
        • Ellis B.
        • Gautier L.
        • Ge Y.
        • Gentry J.
        • Hornik K.
        • Hothorn T.
        • Huber W.
        • Iacus S.
        • Irizarry R.
        • Leisch F.
        • Li C.
        • Maechler M.
        • Rossini A.J.
        • Sawitzki G.
        • Smith C.
        • Smyth G.
        • Tierney L.
        • Yang J.Y.
        • Zhang J.
        Bioconductor: open software development for computational biology and bioinformatics.
        Genome Biol. 2004; 5: R80
        • Huber W.
        • Carey V.J.
        • Gentleman R.
        • Anders S.
        • Carlson M.
        • Carvalho B.S.
        • Bravo H.C.
        • Davis S.
        • Gatto L.
        • Girke T.
        • Gottardo R.
        • Hahne F.
        • Hansen K.D.
        • Irizarry R.A.
        • Lawrence M.
        • Love M.I.
        • MacDonald J.
        • Obenchain V.
        • Oleś A.K.
        • Pagès H.
        • Reyes A.
        • Shannon P.
        • Smyth G.K.
        • Tenenbaum D.
        • Waldron L.
        • Morgan M.
        Orchestrating high-throughput genomic analysis with Bioconductor.
        Nat Methods. 2015; 12: 115-121
        • Bao L.
        • Pu M.
        • Messer K.
        AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data.
        Bioinformatics. 2014; 30: 1056-1063
        • Shen Y.
        • Rahman M.
        • Piccolo S.R.
        • Gusenleitner D.
        • El-Chaar N.N.
        • Cheng L.
        • Monti S.
        • Bild A.H.
        • Johnson W.E.
        ASSIGN: context-specific genomic profiling of multiple heterogeneous biological pathways.
        Bioinformatics. 2015; 31: 1745-1753
        • Yu G.
        • Zhang B.
        • Bova G.S.
        • Xu J.
        • Shih I.M.
        • Wang Y.
        BACOM: in silico detection of genomic deletion types and correction of normal cell contamination in copy number data.
        Bioinformatics. 2011; 27: 1473-1480
        • Sengupta S.
        • Wang J.
        • Lee J.
        • Müller P.
        • Gulukota K.
        • Banerjee A.
        • Ji Y.
        Bayclone: Bayesian nonparametric inference of tumor subclones using NGS data.
        Pac Symp Biocomput. 2015; : 467-478
        • Kane M.J.
        • Emerson J.
        • Weston S.
        Scalable strategies for computing with massive data.
        J Stat Softw. 2013; 55: 1-19
        • Durinck S.
        • Spellman P.T.
        • Birney E.
        • Huber W.
        Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.
        Nat Protoc. 2009; 4: 1184-1191
        • Zhu W.
        • Kuziora M.
        • Creasy T.
        • Lai Z.
        • Morehouse C.
        • Guo X.
        • Sebastian Y.
        • Shen D.
        • Huang J.
        • Dry J.R.
        BubbleTree: an intuitive visualization to elucidate tumoral aneuploidy and clonality using next generation sequencing data.
        Nucleic Acids Res. 2015; 44: e38
        • Purdom E.
        • Ho C.
        • Grasso C.S.
        • Quist M.J.
        • Cho R.J.
        • Spellman P.
        Methods and challenges in timing chromosomal abnormalities within cancer samples.
        Bioinformatics. 2013; 29: 3113-3120
        • Carrara M.
        • Beccuti M.
        • Cavallo F.
        • Donatelli S.
        • Lazzarato F.
        • Cordero F.
        • Calogero R.A.
        State of art fusion-finder algorithms are suitable to detect transcription-induced chimeras in normal tissues?.
        BMC Bioinformatics. 2013; 14 Suppl 7: S2
        • Lågstad S.
        • Zhao S.
        • Hoff A.M.
        • Johannessen B.
        • Lingjærde O.C.
        • Skotheim R.I.
        Chimeraviz: a tool for visualizing chimeric RNA.
        Bioinformatics. 2017; 33: 2954-2956
        • Oróstica K.Y.
        • Verdugo R.A.
        chromPlot: visualization of genomic data in chromosomal context.
        Bioinformatics. 2016; 32: 2366-2368
        • Zare H.
        • Wang J.
        • Hu A.
        • Weber K.
        • Smith J.
        • Nickerson D.
        • Song C.
        • Witten D.
        • Blau C.A.
        • Noble W.S.
        Inferring clonal composition from multiple sections of a breast cancer.
        PLoS Comput Biol. 2014; 10: e1003703
        • Klambauer G.
        • Schwarzbauer K.
        • Mayr A.
        • Clevert D.-A.
        • Mitterecker A.
        • Bodenhofer U.
        • Hochreiter S.
        cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate.
        Nucleic Acids Res. 2012; 40: e69
        • Gusnanto A.
        • Tcherveniakov P.
        • Shuweihdi F.
        • Samman M.
        • Rabbitts P.
        • Wood H.M.
        Stratifying tumour subtypes based on copy number alteration profiles using next-generation sequence data.
        Bioinformatics. 2015; 31: 2713-2720
        • Gusnanto A.
        • Wood H.M.
        • Pawitan Y.
        • Rabbitts P.
        • Berri S.
        Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data.
        Bioinformatics. 2012; 28: 40-47
        • Jiang Y.
        • Oldridge D.A.
        • Diskin S.J.
        • Zhang N.R.
        CODEX: a normalization and copy number variation detection method for whole exome sequencing.
        Nucleic Acids Res. 2015; 43: e39
        • Kuilman T.
        • Velds A.
        • Kemper K.
        • Ranzani M.
        • Bombardelli L.
        • Hoogstraat M.
        • Nevedomskaya E.
        • Xu G.
        • de Ruiter J.
        • Lolkema M.P.
        • Ylstra B.
        • Jonkers J.
        • Rottenberg S.
        • Wessels L.F.
        • Adams D.J.
        • Peeper D.S.
        • Krijgsman O.
        CopywriteR: DNA copy number detection from off-target sequence data.
        Genome Biol. 2015; 16: 49
        • Mock A.
        • Murphy S.
        • Morris J.
        • Marass F.
        • Rosenfeld N.
        • Massie C.
        CVE: an R package for interactive variant prioritisation in precision oncology.
        BMC Med Genomics. 2017; 10: 37
        • Fowler A.
        • Mahamdallie S.
        • Ruark E.
        • Seal S.
        • Ramsay E.
        • Clarke M.
        • Uddin I.
        • Wylie H.
        • Strydom A.
        • Lunter G.
        • Rahman N.
        Accurate clinical detection of exon copy number variants in a targeted NGS panel using DECoN.
        Wellcome Open Res. 2016; 1: 20
        • Ahn J.
        • Yuan Y.
        • Parmigiani G.
        • Suraokar M.B.
        • Diao L.
        • Wistuba I.I.
        • Wang W.
        DeMix: deconvolution for mixed cancer transcriptomes using raw measured data.
        Bioinformatics. 2013; 29: 1865-1871
        • Love M.I.
        • Huber W.
        • Anders S.
        Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
        Genome Biol. 2014; 15: 550
        • Buschmann T.
        DNABarcodes: an R package for the systematic construction of DNA sample tags.
        Bioinformatics. 2017; 33: 920-922
        • Sayols S.
        • Scherzinger D.
        • Klein H.
        dupRadar: a Bioconductor package for the assessment of PCR artifacts in RNA-Seq data.
        BMC Bioinformatics. 2016; 17: 428
        • Delhomme N.
        • Padioleau I.
        • Furlong E.E.
        • Steinmetz L.M.
        easyRNASeq: a bioconductor package for processing RNA-Seq data.
        Bioinformatics. 2012; 28: 2532-2533
        • Robinson M.D.
        • McCarthy D.J.
        • Smyth G.K.
        edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.
        Bioinformatics. 2010; 26: 139-140
        • Rainer J.
        • Gatto L.
        • Weichenberger C.X.
        Ensembldb: an R package to create and use Ensembl-based annotation resources.
        Bioinformatics. 2019; 35: 3151-3153
        • Chelaru F.
        • Corrada Bravo H.
        Epiviz: a view inside the design of an integrated visual analysis software for genomics.
        BMC Bioinformatics. 2015; 16 Suppl 11: S4
        • Yoshihara K.
        • Shahmoradgoli M.
        • Martínez E.
        • Vegesna R.
        • Kim H.
        • Torres-Garcia W.
        • Treviño V.
        • Shen H.
        • Laird P.W.
        • Levine D.A.
        • Carter S.L.
        • Getz G.
        • Stemke-Hale K.
        • Mills G.B.
        • Verhaak R.G.W.
        Inferring tumour purity and stromal and immune cell admixture from expression data.
        Nat Commun. 2013; 4: 2612
        • Sathirapongsasuti J.F.
        • Lee H.
        • Horst B.A.J.
        • Brunner G.
        • Cochran A.J.
        • Binder S.
        • Quackenbush J.
        • Nelson S.F.
        Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV.
        Bioinformatics. 2011; 27: 2648-2654
        • Plagnol V.
        • Curtis J.
        • Epstein M.
        • Mok K.Y.
        • Stebbings E.
        • Grigoriadou S.
        • Wood N.W.
        • Hambleton S.
        • Burns S.O.
        • Thrasher A.J.
        • Kumararatne D.
        • Doffinger R.
        • Nejentsev S.
        A robust model for read count data in exome sequencing experiments and implications for copy number variant calling.
        Bioinformatics. 2012; 28: 2747-2754
        • Andor N.
        • Graham T.A.
        • Jansen M.
        • Xia L.C.
        • Aktipis C.A.
        • Petritsch C.
        • Ji H.P.
        • Maley C.C.
        Pan-cancer analysis of the extent and consequences of intratumor heterogeneity.
        Nat Med. 2016; 22: 105-113
        • Krijgsman O.
        • Benner C.
        • Meijer G.A.
        • van de Wiel M.A.
        • Ylstra B.
        FocalCall: an R package for the annotation of focal copy number aberrations.
        Cancer Inform. 2014; 13: 153-156
        • Gendoo D.M.
        • Ratanasirigulchai N.
        • Schroder M.S.
        • Pare L.
        • Parker J.S.
        • Prat A.
        • Haibe-Kains B.
        Genefu: an R/Bioconductor package for computation of gene expression-based signatures in breast cancer.
        Bioinformatics. 2016; 32: 1097-1099
        • Akalin A.
        • Franke V.
        • Vlahoviček K.
        • Mason C.E.
        • Schübeler D.
        Genomation: a toolkit to summarize, annotate and visualize genomic intervals.
        Bioinformatics. 2015; 31: 1127-1129
        • Lawrence M.
        • Huber W.
        • Pagès H.
        • Aboyoun P.
        • Carlson M.
        • Gentleman R.
        • Morgan M.T.
        • Carey V.J.
        Software for computing and annotating genomic ranges.
        PLoS Comput Biol. 2013; 9: e1003118
        • Yin T.
        • Cook D.
        • Lawrence M.
        Ggbio: an R package for extending the grammar of graphics for genomic data.
        Genome Biol. 2012; 13: R77
        • Wickham H.
        ggplot2: Elegant Graphics for Data Analysis.
        Springer-Verlag, New York, NY2009
        • Hänzelmann S.
        • Castelo R.
        • Guinney J.
        GSVA: gene set variation analysis for microarray and RNA-Seq data.
        BMC Bioinformatics. 2013; 14: 7
        • Hahne F.
        • Ivanek R.
        Mathé E. Davis S. Statistical Genomics: Methods and Protocols. Springer New York, New York, NY2016: 335-351
        • Lai Y.-P.
        • Wang L.-B.
        • Wang W.-A.
        • Lai L.-C.
        • Tsai M.-H.
        • Lu T.-P.
        • Chuang E.Y.
        iGC—an integrated analysis package of gene expression and copy number alteration.
        BMC Bioinformatics. 2017; 18: 35
        • Law C.W.
        • Chen Y.
        • Shi W.
        • Smyth G.K.
        Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.
        Genome Biol. 2014; 15: R29
        • Ramos M.
        • Schiffer L.
        • Re A.
        • Azhar R.
        • Basunia A.
        • Cabrera C.R.
        • Chan T.
        • Chapman P.
        • Davis S.
        • Gomez-Cabrero D.
        • Culhane A.C.
        • Haibe-Kains B.
        • Hansen K.
        • Kodali H.
        • Louis M.S.
        • Mer A.S.
        • Reister M.
        • Morgan M.
        • Carey V.
        • Waldron L.
        Software for the integration of multi-omics experiments in Bioconductor.
        Cancer Res. 2017; 77: e39-e42
        • Hernandez-Ferrer C.
        • Ruiz-Arenas C.
        • Beltran-Gomila A.
        • González J.R.
        MultiDataSet: an R package for encapsulating multiple data sets with application to omic data integration.
        BMC Bioinformatics. 2017; 18: 36
        • Povysil G.
        • Tzika A.
        • Vogt J.
        • Haunschmid V.
        • Messiaen L.
        • Zschocke J.
        • Klambauer G.
        • Hochreiter S.
        • Wimmer K.
        panelcn.MOPS: copy-number detection in targeted NGS panel data for clinical diagnostics.
        Hum Mutat. 2017; 38: 889-897
        • Liu C.
        • Lehtonen R.
        • Hautaniemi S.
        PerPAS: topology-based single sample pathway analysis method.
        IEEE/ACM Trans Comput Biol Bioinform. 2018; 15: 1022-1027
        • Foroushani A.
        • Agrahari R.
        • Docking R.
        • Chang L.
        • Duns G.
        • Hudoba M.
        • Karsan A.
        • Zare H.
        Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: an introduction to the Pigengene package and its applications.
        BMC Med Genomics. 2017; 10: 16
        • Riester M.
        • Singh A.P.
        • Brannon A.R.
        • Yu K.
        • Campbell C.D.
        • Chiang D.Y.
        • Morrissey M.P.
        PureCN: copy number calling and SNV classification using targeted short read sequencing.
        Source Code Biol Med. 2016; 11: 13
        • Scheinin I.
        • Sie D.
        • Bengtsson H.
        • van de Wiel M.A.
        • Olshen A.B.
        • van Thuijl H.F.
        • van Essen H.F.
        • Eijk P.P.
        • Rustenburg F.
        • Meijer G.A.
        • Reijneveld J.C.
        • Wesseling P.
        • Pinkel D.
        • Albertson D.G.
        • Ylstra B.
        DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly.
        Genome Res. 2014; 24: 2022-2032
        • Gaidatzis D.
        • Lerch A.
        • Hahne F.
        • Stadler M.B.
        QuasR: quantification and annotation of short reads in R.
        Bioinformatics. 2015; 31: 1130-1132
        • Reinecke F.
        • Satya R.V.
        • DiCarlo J.
        Quantitative analysis of differences in copy numbers using read depth obtained from PCR-enriched samples and controls.
        BMC Bioinformatics. 2015; 16: 17
        • Collado-Torres L.
        • Nellore A.
        • Kammers K.
        • Ellis S.E.
        • Taub M.A.
        • Hansen K.D.
        • Jaffe A.E.
        • Langmead B.
        • Leek J.T.
        Reproducible RNA-seq analysis using recount2.
        Nat Biotechnol. 2017; 35: 319-321
        • Collado-Torres L.
        • Nellore A.
        • Jaffe A.E.
        Recount workflow: accessing over 70,000 human RNA-seq samples with Bioconductor.
        F1000Res. 2017; 6: 1558
        • Jabot-Hanin F.
        • Varet H.
        • Tores F.
        • Alcais A.
        • Jais J.-P.
        Rfpred: a random forest approach for prediction of missense variants in human exome.
        bioRxiv. 2016; (037127)
        • Wang S.
        • Pandis I.
        • Johnson D.
        • Emam I.
        • Guitton F.
        • Oehmichen A.
        • Guo Y.
        Optimising parallel R correlation matrix calculations on gene expression data using MapReduce.
        BMC Bioinformatics. 2014; 15: 351
        • de Souza W.
        • Carvalho B.S.
        • Lopes-Cendes I.
        Rqc: a Bioconductor package for quality control of high-throughput sequencing data.
        J Stat Softw Code Snippets. 2018; 87: 1-14
        • Liao Y.
        • Smyth G.K.
        • Shi W.
        The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.
        Nucleic Acids Res. 2013; 41: e108
        • Lawrence M.
        • Gentleman R.
        • Carey V.
        Rtracklayer: an R package for interfacing with genome browsers.
        Bioinformatics. 2009; 25: 1841-1842
        • Favero F.
        • Joshi T.
        • Marquard A.M.
        • Birkbak N.J.
        • Krzystanek M.
        • Li Q.
        • Szallasi Z.
        • Eklund A.C.
        Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data.
        Ann Oncol. 2015; 26: 64-70
        • Morgan M.
        • Anders S.
        • Lawrence M.
        • Aboyoun P.
        • Pagès H.
        • Gentleman R.
        ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.
        Bioinformatics. 2009; 25: 2607-2608
        • Chen M.
        • Gunel M.
        • Zhao H.
        SomatiCA: identifying, characterizing and quantifying somatic copy number aberrations from cancer genome sequencing data.
        PLoS One. 2013; 8: e78143
        • Gehring J.S.
        • Fischer B.
        • Lawrence M.
        • Huber W.
        SomaticSignatures: inferring mutational signatures from single-nucleotide variants.
        Bioinformatics. 2015; 31: 3673-3675
        • Zhu Y.
        • Stephens R.M.
        • Meltzer P.S.
        • Davis S.R.
        SRAdb: query and use public next-generation sequencing data from within R.
        BMC Bioinformatics. 2013; 14: 19
        • H Backman T.W.
        • Girke T.
        systemPipeR: NGS workflow and report generation environment.
        BMC Bioinformatics. 2016; 17: 388
        • Colaprico A.
        • Silva T.C.
        • Olsen C.
        • Garofano L.
        • Cava C.
        • Garolini D.
        • Sabedot T.S.
        • Malta T.M.
        • Pagnotta S.M.
        • Castiglioni I.
        • Ceccarelli M.
        • Bontempi G.
        • Noushmehr H.
        TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data.
        Nucleic Acids Res. 2016; 44: e71
        • Hummel M.
        • Bonnin S.
        • Lowy E.
        • Roma G.
        TEQC: an R package for quality control in target capture experiments.
        Bioinformatics. 2011; 27: 1316-1317
        • Ha G.
        • Roth A.
        • Khattra J.
        • Ho J.
        • Yap D.
        • Prentice L.M.
        • Melnyk N.
        • McPherson A.
        • Bashashati A.
        • Laks E.
        • Biele J.
        • Ding J.
        • Le A.
        • Rosner J.
        • Shumansky K.
        • Marra M.A.
        • Gilks C.B.
        • Huntsman D.G.
        • McAlpine J.N.
        • Aparicio S.
        • Shah S.P.
        TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data.
        Genome Res. 2014; 24: 1881-1893
        • Soneson C.
        • Love M.I.
        • Robinson M.D.
        Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.
        F1000Res. 2016; 4: 1521
        • Wang N.
        • Gong T.
        • Clarke R.
        • Chen L.
        • Shih I.-M.
        • Zhang Z.
        • Levine D.A.
        • Xuan J.
        • Wang Y.
        UNDO: a Bioconductor R package for unsupervised deconvolution of mixed gene expressions in tumor samples.
        Bioinformatics. 2015; 31: 137-139
        • Obenchain V.
        • Lawrence M.
        • Carey V.
        • Gogarten S.
        • Shannon P.
        • Morgan M.
        VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants.
        Bioinformatics. 2014; 30: 2076-2078
        • Knaus B.J.
        • Grünwald N.J.
        VCFR: a package to manipulate and visualize variant call format data in R.
        Mol Ecol Resour. 2017; 17: 44-53
        • Alvarez M.J.
        • Shen Y.
        • Giorgi F.M.
        • Lachmann A.
        • Ding B.B.
        • Ye B.H.
        • Califano A.
        Functional characterization of somatic mutations in cancer using network-based inference of protein activity.
        Nat Genet. 2016; 48: 838-847
        • Pugh T.J.
        • Amr S.S.
        • Bowser M.J.
        • Gowrisankar S.
        • Hynes E.
        • Mahanta L.M.
        • Rehm H.L.
        • Funke B.
        • Lebo M.S.
        VisCap: inference and visualization of germ-line copy-number variants from targeted clinical sequencing data.
        Genet Med. 2016; 18: 712-719
        • Chambers J.M.
        Software for Data Analysis: Programming with R.
        Springer, New York, NY2008
        • Schubert M.
        • Lindgreen S.
        • Orlando L.
        AdapterRemoval v2: rapid adapter trimming, identification, and read merging.
        BMC Res Notes. 2016; 9: 88
        • Li H.
        • Durbin R.
        Fast and accurate short read alignment with Burrows-Wheeler transform.
        Bioinformatics. 2009; 25: 1754-1760
        • Langmead B.
        • Trapnell C.
        • Pop M.
        • Salzberg S.L.
        Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.
        Genome Biol. 2009; 10: R25
        • Langmead B.
        • Salzberg S.L.
        Fast gapped-read alignment with Bowtie 2.
        Nat Methods. 2012; 9: 357-359
        • Wu T.D.
        • Nacu S.
        Fast and SNP-tolerant detection of complex variants and splicing in short reads.
        Bioinformatics. 2010; 26: 873-881
        • Wu T.D.
        • Reeder J.
        • Lawrence M.
        • Becker G.
        • Brauer M.J.
        GMAP and GSNAP for genomic sequence alignment: enhancements to speed, accuracy, and functionality.
        Methods Mol Biol. 2016; 1418: 283-334
        • Dobin A.
        • Davis C.A.
        • Schlesinger F.
        • Drenkow J.
        • Zaleski C.
        • Jha S.
        • Batut P.
        • Chaisson M.
        • Gingeras T.R.
        STAR: ultrafast universal RNA-seq aligner.
        Bioinformatics. 2013; 29: 15-21
        • Trapnell C.
        • Pachter L.
        • Salzberg S.L.
        TopHat: discovering splice junctions with RNA-Seq.
        Bioinformatics. 2009; 25: 1105-1111
        • Kim D.
        • Langmead B.
        • Salzberg S.L.
        HISAT: a fast spliced aligner with low memory requirements.
        Nat Methods. 2015; 12: 357-360
        • Li H.
        • Handsaker B.
        • Wysoker A.
        • Fennell T.
        • Ruan J.
        • Homer N.
        • Marth G.
        • Abecasis G.
        • Durbin R.
        • 1000 Genome Project Data Processing Subgroup
        The Sequence Alignment/Map format and SAMtools.
        Bioinformatics. 2009; 25: 2078-2079
        • Breese M.R.
        • Liu Y.
        NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets.
        Bioinformatics. 2013; 29: 494-496
        • Kim S.
        • Scheffler K.
        • Halpern A.L.
        • Bekritsky M.A.
        • Noh E.
        • Kallberg M.
        • Chen X.Y.
        • Kim Y.
        • Beyter D.
        • Krusche P.
        • Saunders C.T.
        Strelka2: fast and accurate calling of germline and somatic variants.
        Nat Methods. 2018; 15: 591-594
        • Koboldt D.C.
        • Zhang Q.Y.
        • Larson D.E.
        • Shen D.
        • McLellan M.D.
        • Lin L.
        • Miller C.A.
        • Mardis E.R.
        • Ding L.
        • Wilson R.K.
        VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.
        Genome Res. 2012; 22: 568-576
        • Cibulskis K.
        • Lawrence M.S.
        • Carter S.L.
        • Sivachenko A.
        • Jaffe D.
        • Sougnez C.
        • Gabriel S.
        • Meyerson M.
        • Lander E.S.
        • Getz G.
        Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples.
        Nat Biotechnol. 2013; 31: 213-219
        • Ye K.
        • Schulz M.H.
        • Long Q.
        • Apweiler R.
        • Ning Z.
        Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.
        Bioinformatics. 2009; 25: 2865-2871
        • Albers C.A.
        • Lunter G.
        • MacArthur D.G.
        • McVean G.
        • Ouwehand W.H.
        • Durbin R.
        Dindel: accurate indel calls from short-read data.
        Genome Res. 2011; 21: 961-973
        • Au K.F.
        • Jiang H.
        • Lin L.
        • Xing Y.
        • Wong W.H.
        Detection of splice junctions from paired-end RNA-seq data by SpliceMap.
        Nucleic Acids Res. 2010; 38: 4570-4578
        • Jiang Y.
        • Wang Y.
        • Brudno M.
        PRISM: pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants.
        Bioinformatics. 2012; 28: 2576-2583
        • Kadri S.
        • Zhen C.J.
        • Wurst M.N.
        • Long B.C.
        • Jiang Z.-F.
        • Wang Y.L.
        • Furtado L.V.
        • Segal J.P.
        Amplicon indel hunter is a novel bioinformatics tool to detect large somatic insertion/deletion mutations in amplicon-based next-generation sequencing data.
        J Mol Diagn. 2015; 17: 635-643
        • Fan Y.
        • Xi L.
        • Hughes D.S.
        • Zhang J.
        • Zhang J.
        • Futreal P.A.
        • Wheeler D.A.
        • Wang W.
        MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data.
        Genome Biol. 2016; 17: 178
        • Larson D.E.
        • Harris C.C.
        • Chen K.
        • Koboldt D.C.
        • Abbott T.E.
        • Dooling D.J.
        • Ley T.J.
        • Mardis E.R.
        • Wilson R.K.
        • Ding L.
        SomaticSniper: identification of somatic point mutations in whole genome sequencing data.
        Bioinformatics. 2012; 28: 311-317
        • Radenbaugh A.J.
        • Ma S.
        • Ewing A.
        • Stuart J.M.
        • Collisson E.A.
        • Zhu J.
        • Haussler D.
        RADIA: RNA and DNA integrated analysis for somatic mutation detection.
        PLoS One. 2014; 9: e111516
        • Banerji S.
        • Cibulskis K.
        • Rangel-Escareno C.
        • Brown K.K.
        • Carter S.L.
        • Frederick A.M.
        • et al.
        Sequence analysis of mutations and translocations across breast cancer subtypes.
        Nature. 2012; 486: 405-409
        • Ellrott K.
        • Bailey M.H.
        • Saksena G.
        • Covington K.R.
        • Kandoth C.
        • Stewart C.
        • Hess J.
        • Ma S.
        • Chiotti K.E.
        • McLellan M.
        • Sofia H.J.
        • Hutter C.
        • Getz G.
        • Wheeler D.
        • Ding L.
        • MC3 Working Group
        • Cancer Genome Atlas Research Network
        Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines.
        Cell Syst. 2018; 6: 271-281.e7
        • Yang H.
        • Wang K.
        Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR.
        Nat Protoc. 2015; 10: 1556-1566
        • Ramos A.H.
        • Lichtenstein L.
        • Gupta M.
        • Lawrence M.S.
        • Pugh T.J.
        • Saksena G.
        • Meyerson M.
        • Getz G.
        Oncotator: cancer variant annotation tool.
        Hum Mutat. 2015; 36: E2423-E2429
        • Cingolani P.
        • Patel V.M.
        • Coon M.
        • Nguyen T.
        • Land S.J.
        • Ruden D.M.
        • Lu X.
        Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift.
        Front Genet. 2012; 3: 35
        • McGranahan N.
        • Swanton C.
        Clonal heterogeneity and tumor evolution: past, present, and the future.
        Cell. 2017; 168: 613-628
        • Varet H.
        • Brillet-Gueguen L.
        • Coppee J.Y.
        • Dillies M.A.
        SARTools: a DESeq2- and EdgeR-based R pipeline for comprehensive differential analysis of RNA-Seq data.
        PLoS One. 2016; 11: e0157022
        • Love M.I.
        • Anders S.
        • Kim V.
        • Huber W.
        RNA-Seq workflow: gene-level exploratory analysis and differential expression.
        F1000Res. 2016; 4: 1070
        • Patro R.
        • Duggal G.
        • Love M.I.
        • Irizarry R.A.
        • Kingsford C.
        Salmon provides fast and bias-aware quantification of transcript expression.
        Nat Methods. 2017; 14: 417-419
        • Patro R.
        • Mount S.M.
        • Kingsford C.
        Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.
        Nat Biotechnol. 2014; 32: 462-464
        • Bray N.L.
        • Pimentel H.
        • Melsted P.
        • Pachter L.
        Near-optimal probabilistic RNA-seq quantification.
        Nat Biotechnol. 2016; 34: 525-527
        • Zhang C.
        • Zhang B.
        • Lin L.L.
        • Zhao S.
        Evaluation and comparison of computational tools for RNA-seq isoform quantification.
        BMC Genomics. 2017; 18: 583
        • Mougin F.
        • Auber D.
        • Bourqui R.
        • Diallo G.
        • Dutour I.
        • Jouhet V.
        • Thiessard F.
        • Thiebaut R.
        • Thebault P.
        Visualizing omics and clinical data: which challenges for dealing with their variety?.
        Methods. 2018; 132: 3-18
        • Tyner C.
        • Barber G.P.
        • Casper J.
        • Clawson H.
        • Diekhans M.
        • Eisenhart C.
        • Fischer C.M.
        • Gibson D.
        • Gonzalez J.N.
        • Guruvadoo L.
        • Haeussler M.
        • Heitner S.
        • Hinrichs A.S.
        • Karolchik D.
        • Lee B.T.
        • Lee C.M.
        • Nejad P.
        • Raney B.J.
        • Rosenbloom K.R.
        • Speir M.L.
        • Villarreal C.
        • Vivian J.
        • Zweig A.S.
        • Haussler D.
        • Kuhn R.M.
        • Kent W.J.
        The UCSC Genome Browser database: 2017 update.
        Nucleic Acids Res. 2017; 45: D626-D634
        • Thorvaldsdóttir H.
        • Robinson J.T.
        • Mesirov J.P.
        Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.
        Brief Bioinform. 2013; 14: 178-192
        • Class C.A.
        • Ha M.J.
        • Baladandayuthapani V.
        • Do K.A.
        iDINGO-integrative differential network analysis in genomics with Shiny application.
        Bioinformatics. 2018; 34: 1243-1245
        • Yu Y.
        • Ouyang Y.
        • Yao W.
        shinyCircos: an R/Shiny application for interactive creation of Circos plot.
        Bioinformatics. 2018; 34: 1229-1231
        • To Duc K.
        bcROCsurface: an R package for correcting verification bias in estimation of the ROC surface and its volume for continuous diagnostic tests.
        BMC Bioinformatics. 2017; 18: 503
        • Koeppen K.
        • Stanton B.A.
        • Hampton T.H.
        ScanGEO: parallel mining of high-throughput gene expression data.
        Bioinformatics. 2017; 33: 3500-3501
        • Rupji M.
        • Zhang X.
        • Kowalski J.
        CASAS: Cancer Survival Analysis Suite, a web based application.
        F1000Res. 2017; 6: 919
        • Theodosiou T.
        • Efstathiou G.
        • Papanikolaou N.
        • Kyrpides N.C.
        • Bagos P.G.
        • Iliopoulos I.
        • Pavlopoulos G.A.
        NAP: the Network Analysis Profiler, a web tool for easier topological analysis and comparison of medium-scale biological networks.
        BMC Res Notes. 2017; 10: 278
        • Barlowe S.
        • Coan H.B.
        • Youker R.T.
        SubVis: an interactive R package for exploring the effects of multiple substitution matrices on pairwise sequence alignment.
        PeerJ. 2017; 5: e3492
        • Weinstein J.N.
        • Collisson E.A.
        • Mills G.B.
        • Shaw K.R.M.
        • Ozenberger B.A.
        • Ellrott K.
        • Shmulevich I.
        • Sander C.
        • Stuart J.M.
        • Cancer Genome Atlas Research Network
        The Cancer Genome Atlas Pan-Cancer analysis project.
        Nat Genet. 2013; 45: 1113-1120
        • Global Cancer Genomics Consortium
        The Global Cancer Genomics Consortium: interfacing genomics and cancer medicine.
        Cancer Res. 2012; 72: 3720-3724
        • Gao J.
        • Aksoy B.A.
        • Dogrusoz U.
        • Dresdner G.
        • Gross B.
        • Sumer S.O.
        • Sun Y.
        • Jacobsen A.
        • Sinha R.
        • Larsson E.
        • Cerami E.
        • Sander C.
        • Schultz N.
        Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal.
        Sci Signal. 2013; 6: pl1