FusionFlow: an integrated system workflow for gene fusion detection in genomic samples

,


Introduction
Gene fusion is a phenomenon that occurs when two or more genes become juxtaposed, forming a single hybrid gene or transcript.Gene fusions remarkably contribute to the evolutionary process by providing a continuous source of new genes.However, at the same time, they often lead to genomic disorders or cancer.Numerous gene fusions have been recognized as essential drivers for various cancer types.Thus, the discovery of novel gene fusions can better comprehend tumour development and progression [29].For these reasons, gene fusion identification employing gene fusion detection tools has become crucial in bioinformatics research [30].Recent advances in deep learning and convolutional networks [3,6,4,23,22,9,8,10,21] have also progressively spread to tools for gene fusion detection [19,17].
Although many gene fusions detection tools have been developed over the past years, it is still challenging to use them.In addition, the RNA-seq artefacts, introduced by library preparation and sequence alignment, make gene fusions predictions hardly reliable [15].The typical practice is executing multiple tools and using the union or intersection of their results.Unfortunately, this approach is computationally demanding.There are several limitations in traditional tools usage: each tool has specific installation requirements and version dependencies that must be precisely adhered to; downloading files and databases and executing tools is time-consuming; distinct tools can require different input data formats; multiple complementary fusion detection tools are needed to improve sensitivity.
During the last years, bioinformatics workflows (which consist of a wide array of algorithms executed in a predefined sequence) were developed to deal with multiple bioinformatic issues (e.g., RNA data processing and CNA detection) [24].However, only a limited number of gene fusion detection workflows is available, and no one of them can simultaneously handle both RNA and DNA sequencing data [28].This paper presents FusionFlow, an easily reproducible and scalable bioinformatics workflow for detecting gene fusions from RNA and DNA data.It processes numerous sequence data and their associated metadata through multiple transformations using a series of software components, databases, and operation environments (hardware and operating system).It includes five gene fusion detection tools executed through multiple processes.The processes are built using Nextflow Groovy/JVM-based framework exploiting Docker and Conda technologies.Indeed, Nextflow allows running tools downloads, installation, and execution concurrently in the interest of time constraints.At the same time, Docker and Conda engines are used to create virtual environments precisely configured for each tool.Finally, the pipeline inputs standard data formats and eventually converts them directly inside specific converter processes.

The workflow
FusionFlow includes five fusion detection tools: EricScript [5], Arriba [26], Fu-sionCatcher [20], GeneFuse [7] and Integrate [32].Three of them, EricScript, Arriba, and FusionCatcher, accept as input just RNA-seq data.Concerning DNA tools, GeneFuse takes just DNA data, while Integrate has two input options: 1) just RNA data or 2) both RNA and DNA data.All gene fusion detection tools are made up of three steps: 1) preliminary alignment of the reads (a row in the genomic input files) to the transcriptome to build specific gene fusion references; 2) alignment of previously unmapped reads to gene fusion references to support the gene fusion detection; 3) cleaning filters to discard false positives.The main differences between the tools consist of the alignment type (e.g., BLAST vs BWA) and the properties of the cleaning filters.Although a proper gold standard procedure for gene fusion detection has not been established, the most widely used approach involves applying multiple gene fusion detection tools, unifying the results obtained.Ericscript, FusionCatcher, and Arriba have been selected for this workflow due to their spread and unique characteristics.Ericscript and FusionCatcher have been selected due to the differences in the cleaning filters.
The former exploits, among the others, heuristic filters to remove analysis artefacts, while the latter removes false positives using known and novel criteria, which make biological sense.In the end, Arriba has been chosen since it can find aberrations that the competitors hardly find (e.g.intragenic and intergene duplications/inversions/translocations).Since the DNA sequencing method has only recently spread on a large scale [16], the panorama in DNA gene fusion detection tools includes a few software available.GeneFuse and INTEGRATE deserve to be mentioned for their user experience.GeneFuse can detect gene fusions from DNA samples alone, while INTEGRATE requires both RNA and DNA data from the same sample to provide the gene fusion list.At the same time, it can reconstruct gene fusion junctions and genomic breakpoints by splitread mapping in a complete way.
In order to make the pipeline usage as simple as possible, the only mandatory inputs are the RNA or DNA files to be analyzed.In this case, the workflow looks for tools' required files in default paths.The gene fusion detection tools start processing data if the files are present.Otherwise, the pipeline downloads and installs all the necessary tools and files before the tools' execution.FusionFlow receives input RNA only, DNA only, or both RNA and DNA data.
The FusionFlow pipeline produces several files divided into two categories: tools' required files and gene fusions' output files.The first category includes all the files needed to execute the tools.These files can be directly provided to the workflow, skipping their downloads processes, or can be downloaded while running the workflow for the first time.Then, the files will be saved in a specific path to be available to the pipeline for the subsequent runs.The second category of output includes the files produced as output from the gene fusion tools.Each tool gives as output one or more files in specific formats.The most diffused formats are Tab Separated Value (TSV), Variant Call Format (VCF), and standard text format [2].
In the following, the general workflow architecture is described.

Architecture
The general workflow structure is based on Nextflow, a dataflow programming model that simplifies writing complex distributed pipelines.
Nextflow Groovy/JVM-based framework is selected among a series of workflow management systems (e.g., Galaxy [11], Toil [27], Snakemake [14], Bpipe) due to its peculiar features.In particular it allows: the existence of several processes written in different languages.Nextflow recognizes the script's language automatically, and it generates a launch file per process dynamically; to process data as stream step by step.Indeed, each process can communicate through the input/output channel definition.These channels can also be used for synchronization mechanisms in order to make the pipeline sequential; integration with sharing platforms such as GitHub.Nextflow can notice if the repository is not installed and, in that case, it downloads all the requirements, environments included; integration with the most famous containers as Docker and Singularity.This feature is crucial for gene fusion tools since they often require conflicting packages.The current pipeline has considered each process in a separate environment; integration with several schedulers as SLURM.Due to the substantial memory boundaries requested by the gene fusion tools, the pipeline can be executed basically on large systems servers.Rarely are they used without a scheduler.
The workflow is composed of fifteen processes.These processes can be divided into three main categories: -downloaders: they are responsible for the tools installation and download input files.The downloaders processes are: referenceGenome downloader, arriba downloader, ericscript downloader, fusioncatcher downloader, integrate downloader and genefuse downloader ; -converters: they are responsible for the file preparation and format conversion if needed.The converters processes are: integrate converter and genefuse converter ; -runners: they allow the code and tools execution.The runner processes are: arriba, ericscript, fusioncatcher, integrate, genefuse, referenceGenom index, integrate builder.
The fifteen processes are structured into six main parallel lines shown in green in Figure 1.
Executing the script with Nextflow, the algorithm will look for the required files in the paths specified in nextflow.configconfiguration file or the paths specified in the command line.The associated downloader is skipped if the files exist, and the following processes can start processing.Nextflow processes usually are executed concurrently.Nextflow queue channels are used to execute downloaders, converters, and runners sequentially and provide inter-communication between processes.A queue channel creates an asynchronous unidirectional FIFO queue and allows to connect processes or operators.Using a combination of queue channels permits the creation of predefined sequences of processes.The processes expect to receive input data from the channels specified in the input block.When the inputs are emitted, the processes run.The five fusion detection tools included in FusionFlow are managed through Nextflow queue channels that provide inter-communication between the workflow processes.All processes have the same structure since they are triggered by input and, after the script block execution, provide output to trigger the subsequent processes.As illustrated in fig 2, for each channel, the first step consists of describing the channel configuration (e.g., DNA files, RNA files, the tool installation path, and further databases necessary for gene fusion tools).If the user does not define tool databases, a separate channel is used to download it.A data channel passes the database to the next process triggered at the moment.Then, the tool/database is installed if not present yet, and the data is converted in the correct format if the user requests it.Finally, the data is passed to the gene fusion tool for the tool execution.
In the end, the workflow provides as output the list of candidate gene fusions for each tool to be investigated by the user.Fig. 2. General flow for each tool.Given the data as input, 1) the required configuration is set; 2) the tool/database is installed if not present; 3) the data is converted if with the wrong format; 4) the data is passed to the gene fusion tool.

Workflow test and discussion
The files used to test the pipeline are the same proposed in the FusionCatcher tool and publicly available both at https://github.com/ndaniel/fusioncatcher/tree/master/testand in the FusionFlow GitHub repository.
Initially, each tool was tested separately on a linux operating system and inside a docker environment checking the setup (e.g.paths, files, profiles, libraries).Then, after making sure that all the tools worked, the entire Nextflow workflow was tested inside a single docker environment.In the scenario without docker, Conda virtual environments were manually created.Otherwise, with docker the setup is prepared automatically through the use of the dockerfile.The test files and the local profile were specified in the command line to execute these tests.
Each gene fusion detection tool gives output one or more files in specific formats.Generally, a summary file is produced in output to allow a quick predictions overview.The outputs obtained from the tools are concordant with the gene fusions previously specified.All the tools in the pipeline recognize at least ten predictions out of seventeen fusions, except for GeneFuse, which recognizes just three of them.Although GeneFuse performances should be investigated on additional data, the poor result could be explained by the specific DNA filters implemented in GeneFuse.In order to select the final gene fusions prediction drivers, different approaches can be used.The typical practice is to use the union or intersection of tools predictions.The union of the results gives numerous sets of predictions.This approach increases the probability of including the real drivers of cancer processes.However, it enhances the possibility of incorporating false positives or passenger mutations.Using the intersection approach, conversely, decreases the number of predictions radically.This approach allows discarding false positives and passenger mutations.However, this selection could also cause the discarding of the cancer drivers.In this test case, the union of the results contains nineteen gene fusions predictions, while the intersection includes just two of them (ETV6-NTRK3 and GOPC-ROS1).

Conclusions
FusionFlow is an easy-to-use, flexible, highly reproducible, and integrated workflow.The workflow includes five gene fusion discovery tools that input both RNA and DNA data.Docker and Conda technologies allow performing tools installations, avoiding version conflicts.In addition, the Nextflow framework allows the execution of the five tools in parallel, optimizing time and resources usage and managing the tool's installations and the file allocation.The workflow was tested using publicly available test files.The tests were performed using a local profile in two conditions: on a private server and the private server inside a docker container.In both cases, the outputs were satisfactory.Thus, the Fusion-Flow pipeline is available for further validation over additional DNA and RNA genomic data.
This work represents a foundation on which improvements and future works can be built.Indeed, one of the main problems related to gene fusion detection is determining which gene fusions are drivers of cancer processes and not just passenger mutations.The fusion detection tools already provide a first step for solving this problem.Indeed, fusion detection tools filter the candidate gene fusions based on the sample's reads, trying to decrease as much as possible the number of false positives.However, generally, this step is insufficient to determine the cancer drivers, and an additional step can be required.It consists of post-processing tools (called prioritization tools) that can predict a gene fusion's oncogenic potential.There is a high number of prioritization tools such as Oncofuse, Pegasus, DEEPrior, and ChimerDriver [25,1,19,17,18].These tools are based on machine learning (ML) algorithms trained with the protein domains of the fusion proteins and allow the selection of the most probable cancer drivers.The post-processing step could also be completed by adding a different algorithm.This algorithm performs comparisons between the outputs of the tool and selects the more probable driver of cancer processes by analyzing the union and the intersection and taking into account the different characteristics of the gene fusion detection tools.Another crucial question is related to visualization tools.Humans can efficiently distinguish true positives from false positives if the evidence is provided in an easily interpretable form.These tools also better interpret the potential consequence of gene fusion events.Several visualization tools were released in the last years, such as INTEGRATE-vis [31], FGviewer [13], and FuSpot [12].

Funding
This study was funded by the European Union's Horizon 2020 research and innovation programme DECIDER under Grant Agreement 965193.

Fig. 1 .
Fig. 1.Pipeline architecture parallelization: each tool is composed of multiple subunits (shown in blue) executed concurrently through six main parallel lines (shown in green) to optimize the workflow performances.