Introduction
Interpretation of gene lists is a key step in numerous biological data analysis workflows, such as differential gene expression analysis and co-expression clustering of RNA-seq or microarray data. Usually this involves associating these gene lists with previous knowledge from well curated data sources of biological processes and pathways. However, as the knowledge bases are constantly changing, keeping the associations up to date requires careful data management. Handling numerous databases, especially when using different gene identifier types, can be a very time-consuming process for researchers.
g:Profiler ( https://biit.cs.ut.ee/gprofiler) is a popular web toolset that helps to handle gene lists from various biological and biomedical studies of more than 600 species and strains, including vertebrates, plants, fungi, insects and parasites 1, 2. g:Profiler’s best known functionality is the over-representation analysis to identify significantly enriched biological functions and pathways obtained from well established data sources which include, among others, Gene Ontology (GO) 3, KEGG 4 and Reactome 5. The information about genes, identifier types and GO term associations in g:Profiler is mostly based on Ensembl databases 6 including data from Ensembl Genomes, fungi, plants and metazoa specific versions of Ensembl. g:Profiler follows Ensembl’s quarterly update cycle while keeping the access to previous data versions as archives for reproducibility. The parasite specific data is included from WormBase 7.
Providing users with fast and easy access has been the main goal of g:Profiler developers. Since 2007, g:Profiler has been in constant development and with the recent update in 2019 a new accompanying R package,
gprofiler2, was developed
8. The R package relies on the g:Profiler REST API requests providing an easy programmatic access to the same functionalities as in the web tool without performing heavy computations and mappings in R. While there are other popular R packages for functional enrichment analysis, such as
g:Profiler development team encourages and supports external tools and packages to use either
gprofiler2 package or the public API to be part of their workflows. For example,
Here we demonstrate how to conveniently incorporate the gprofiler2 R package into bioinformatics analysis pipelines using differential gene expression analysis as an example.
Methods
Implementation
Inherently,
gprofiler2
8 is a collection of wrapper functions in R that simplify sending POST requests to the g:Profiler REST API using the
There are four main API wrapper functions in gprofiler2:
In addition to fetching the results from the API,
gprofiler2 uses the packages
ggplot2
15 and
plotly
16 to provide visualisations for enrichment results that are similar to the web tool ones. Using
This article was written using R version 3.6.1 (2019-07-05) and gprofiler2 version 0.1.9.
Operation
The gprofiler2 R package is available from CRAN and works on R versions 3.5 and above. The package also includes a detailed vignette.
The package can be installed from CRAN:
Input description
The most popular functionality of g:Profiler is functional enrichment analysis provided by the g:GOSt tool that performs over-representation analysis using hypergeometric test. This functionality is available in
gprofiler2 under the function
The query vector can include mixed types of gene/protein identifiers, SNP rs-IDs, chromosomal intervals or term IDs. Accepting a mixture of IDs is a unique feature that skips time-consuming manual steps of converting between different identifier types required by other functional enrichment tools. However, in case of analysing numeric identifiers (e.g. Entrez IDs) the user should specify the namespace using the
Several additional parameters in the
Annotation databases
g:Profiler’s in-house database includes only reliable annotation data sources that are regularly updated such as Gene Ontology (GO)
3, KEGG
4, Reactome
5, WikiPathways
17, miRTarBase
18, TRANSFAC
19, Human Protein Atlas
20, protein complexes from CORUM
21 and Human Phenotype Ontology
22. By default, all data sources in g:Profiler database are used for the analysis in
gprofiler2, but a specific selection can be defined with the
Use case
Differential gene expression analysis determines lists of genes that show changes in expression between different conditions, cell types, time points, etc. Functional enrichment analysis using the gprofiler2 package 8 helps to interpret these gene lists.
Here we demonstrate the main functionality of
gprofiler2 by following an analysis example from the existing RNA-seq Bioconductor workflow
23 that uses the popular
Functional enrichment of differentially expressed genes
First, we will detect the list of genes that are differentially regulated when stimulated with dexamethasone and then we will use the function
The output of the gost function is a named list where the element
Accounting for the order of genes in enrichment analysis
For cases where the list of interesting genes can be ranked by some biologically meaningful measure, such as P-value or fold change in differential analysis, g:Profiler provides an ordered query option that takes the ranking into account when performing enrichment tests. The testing is then performed iteratively, starting from the first gene and sequentially adding genes one by one. For every term, the smallest enrichment P-value is reported along with the corresponding gene list size. Consequently, for different terms the query size can vary, especially as the broader terms can be enriched for larger lists only. This option is very similar to the idea of the GSEA analysis method 26.
For example, to perform ordered query using
gprofiler2 we first rearrange the list of up-regulated genes based on the log
2 fold change values so that the first gene in the list has the highest value. Next we use this ordered list as a query in the
The resulting data frame is in the same format as shown previously. Only the size of the query in the table can vary as the algorithm detects the most significant cutting point from the input gene list considering every function separately.
Visualisation of functional enrichment results
Different visualisations are useful to summarise and interpret functional enrichment results. With the recent update, g:Profiler introduced an alternative way for visualising functional terms, a Manhattan plot. On this plot, the x-axis shows the terms and y-axis shows the enrichment P-values on − log 10 scale. Each circle on this plot corresponds to a single term. The circles are colored according to the annotation source and size-scaled according to the total number of genes annotated to the corresponding term. The locations on the x-axis are always fixed and ordered in a way that the terms from the same GO subtree are located closer to each other. This helps to highlight different enriched GO sub-branches as they form peaks in the Manhattan plot and makes plots from different queries easily comparable. For the same reason, by default the values on the y-axis are capped to a maximum value of 16 that corresponds to P-value less than 10 −16 . The same default threshold is also used in the statistical tests in R. This selection can be switched off to show the P-values in a wider scale range.
Interactive graphs are common in web tools and therefore the Manhattan plot in g:Profiler web interface also provides several interactive features to facilitate data exploration and enables to export the visualisations as high-quality image files. Mimicking the g:Profiler web interface, the Manhattan plot in
gprofiler2 is implemented in the function
After exploring the interactive graph and deciding on the story to tell about the results, the user can compose a publishable figure that highlights the most important terms using the function
Figure 1.
Manhattan plot of g:Profiler enrichment results.
As the resulting plot is a standard
Analysing multiple gene lists
Above we were analysing the up- and down-regulated gene lists separately, but the
In this case, the resultant data frame is in a so-called “long format” where the column
Results from multiple gene lists can also be used for plotting. The function
Figure 2.
Visualisation of g:Profiler enrichment results to compare multiple gene lists.
Sending analysis from R to g:Profiler web interface
The same enrichment results can also be viewed in the g:Profiler web tool. The user can generate a dedicated short-link by setting the parameter
In this case, the variable
Mapping between gene identifiers with
Another common but tedious task in handling gene lists is mapping between different identifiers. The function
As an example we will convert the Ensembl IDs in our differential expression results to numeric Entrez IDs with
The users can add this information to the differential expression results data frame and save it to a tab separated text file to include as a supplementary file in their article, for example.
Using custom annotations
While g:Profiler enables to analyse genes from numerous organisms using high-quality annotation databases, there is still a need for custom data functionality for researchers interested in non-model organisms, that are not annotated in the Ensembl database, or in some specific, not so widespread annotation resource. In g:Profiler, this is solved by enabling users to upload custom annotation files in the GMT file format, which is essentially a tab delimited text file where every row describes a function by its identifier, description, and the genes annotated in this function. Here it is important to note that in case of custom annotation files, all the identifiers not present in the GMT file will be ignored in the analysis.
For example, to use the gene-disease association data from the DisGeNET database
27 for enrichment analysis, the user can upload the GMT file in R using the
First, we use R utility function
Now, when we have the file in our local environment, we can upload it to g:Profiler with the
The result of this upload is a unique token (in this case "gp_goJy_Ej2J_rPc") which should be saved by the user for future use. In order to find the enriched diseases in our gene list, we will use the token as a value for the
The custom data source results can also be plotted using the Manhattan plots ( Figure 3). In this case, the term position on the x-axis is defined by the order in the GMT file.
Figure 3.
Manhattan plot of g:Profiler enrichment results using DisGeNET database loaded from a custom GMT file.
As the gprofiler2 R package and the web tool are in sync, this token will also work for the analysis in the web tool and can be inserted under the section “Bring your own data (Custom GMT)”. And vice versa, the token obtained from the web tool will work in the R package without uploading the data again. Thus, in order to analyse multiple gene lists with the same data source, the user needs to upload the file only once and can use the given token from then on. Furthermore, analysing multiple custom sources at once is enabled with the upload of a ZIP archive that includes multiple GMT files. GMT file names are used as the names for the data sources in the results and colored independently in the Manhattan plot.
Mapping orthologous genes with
Sometimes, in order to further investigate the interesting set of differential genes in human, researchers need to perform additional experiments on model organisms such as mice. This requires finding the corresponding orthologs of these interesting genes from other species. Another use for orthologous genes is the possibility to transfer the extensive knowledge from well studied organisms to less studied species.
Mapping orthologous genes between species in g:Profiler is enabled by the g:Orth tool and in the
gprofiler2 package the access is wrapped into the function
This function returns a data frame that includes the input and target identifiers, and also the ortholog names and descriptions.
Integrating with external tools for visualisations
Since the output of the
After creating an instance of the
Figure 4.
Dot plot of g:Profiler enrichment results using enrichplot.
As these plots are
Figure 5.
Bar plots of g:Profiler enrichment results using enrichplot.
In order to use the
This command will open the KEGG browser page for the pathway Inflammatory mediator regulation of TRP channels.
Using g:Profiler results in EnricmentMap
The functional enrichment results from the
In case of a single query, the GEM file can be generated with the following lines of code. The parameter value
In the EnrichmentMap the user can set the “Analysis Type” parameter as Generic/gProfiler and upload the required files: GEM file with enrichment results (input field “Enrichments”) and GMT file that defines the annotations (input field “GMT”). Both of these files have to include gene identifiers from the same namespace for the EnrichmentMap to work.
The GMT files used by g:Profiler are downloadable from the web page under the “Data sources” section. Only the GMT files of KEGG and Transfac are not available as the sharing is restricted by data source licenses.
Reproducibility
The demand for better reproducibility of computational analyses is constantly growing
30. In bioinformatics analysis, many different tools and databases are combined in order to detect relevant findings. This adds an extra layer of complexity which often leads to reproducibility issues. Because of this, since 2011 all the past releases of g:Profiler are maintained and kept usable to ensure reproducibility and transparency of enrichment analysis results. The users can cite the exact extract of the annotation database and the state of the implementation by stating the version number in their research. In
gprofiler2, this is available, along with other query information, from the metadata of
The g:Profiler specific version number notes that the results were obtained using the state of the database that includes data from Ensembl release 99, Ensembl Genomes release 46 and WormBase ParaSite release 14, among other sources, and the g:Profiler codebase with the Git revision number f929183. The version number together with the details of applied parameters (available from
In order to reproduce the results obtained with a specific version, one can change the data version using the function
All the past versions and their URLs are available at https://biit.cs.ut.ee/gprofiler/page/archives. gprofiler2 works with versions e94_eg41_p11 and higher, earlier versions are still accessible using the deprecated R package gProfileR.
Function
In order to determine the current g:Profiler URL used for the analysis one can use the function
Conclusion
We presented the gprofiler2 R package 8 that is one of the programmatic access points to the widely used g:Profiler web toolset for gene list functional enrichment analysis and identifier conversion. This package enables effective integration of g:Profiler functionalities in various bioinformatics pipelines and tools written in R without the need of searching and downloading several data files. The suite of functions in gprofiler2 are implemented with the importance of analysis reproducibility and interoperability with other tools in mind. In addition, the package provides a way to easily create or customise the enrichment plots using the existing visualisation packages in R. For the researchers who prefer to perform their computational analysis pipelines through the web, we have wrapped the gprofiler2 package as a tool for the Galaxy platform 31.
It is important to note that using gprofiler2 for functional enrichment analysis is not limited to the use case of differential gene expression analysis. The package is useful whenever there is a set of genes/proteins/SNPs the user wants to characterise with biological functions or to convert to another namespace.
Data availability
All data underlying the results are available as part of the article and no additional source data are required.
Software availability
R package gprofiler2 is available from CRAN: https://cran.r-project.org/package=gprofiler2.
Source code available from: https://gl.cs.ut.ee/biit/r-gprofiler2.
Archived source code at time of publication: https://doi.org/10.5281/zenodo.3919795 8.
License: GNU General Public License v2.0.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright: © 2020 Kolberg L et al. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
g:Profiler ( https://biit.cs.ut.ee/gprofiler) is a widely used gene list functional profiling and namespace conversion toolset that has been contributing to reproducible biological data analysis already since 2007. Here we introduce the accompanying R package, gprofiler2, developed to facilitate programmatic access to g:Profiler computations and databases via REST API. The gprofiler2 package provides an easy-to-use functionality that enables researchers to incorporate functional enrichment analysis into automated analysis pipelines written in R. The package also implements interactive visualisation methods to help to interpret the enrichment results and to illustrate them for publications. In addition, gprofiler2 gives access to the versatile gene/protein identifier conversion functionality in g:Profiler enabling to map between hundreds of different identifier types or orthologous species. The gprofiler2 package is freely available at the CRAN repository.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer