Cluster-specific functional enrichment and whole annotation retrieval
Source:R/clustrenrich.R
clustrenrich.Rd
This function performs Over-Representation Analysis (ORA) on clusters to identify enriched biological functions using the clustrenrich() function. It leverages the gprofiler2::gost() function and offers customization options, including the choice of background gene list, background type (e.g., “custom” or “custom_annotated”), database sources (e.g., GO, KEGG, WP), adjusted p-value correction methods, and the option to exclude IEA (Electronically Inferred Annotations) GO terms. The function is adaptable to various organisms and biological annotation sources.
Users can filter terms/pathways based on gene set size (min_term_size and max_term_size) and the number of genes enriched (ngenes_enrich_filtr). For example, terms with fewer than the minimum required genes or more than the maximum allowed genes are excluded, and terms enriched by fewer than the specified number of genes are filtered out.
Additionally, users can choose to retain only highlighted/driver GO terms to reduce redundancy and focus on key biological functions. A secondary gprofiler2::gost() run with significant = FALSE retrieves annotations for all deregulated genes, which is utilized later in the lonelyfishing() function. Throughout the process, a dataframe tracks the number of biological functions linked to each cluster after each filtering step, categorized by source. All main parameters used are saved in the output for transparency and reproducibility.
Usage
clustrenrich(
clustrfiltr_data,
dr_genes,
bg_genes,
bg_type = "custom_annotated",
sources = c("GO:BP", "KEGG", "WP"),
organism,
user_threshold = 0.05,
correction_method = "fdr",
exclude_iea = FALSE,
enrich_size_filtr = TRUE,
only_highlighted_GO = TRUE,
min_term_size = NULL,
max_term_size = NULL,
ngenes_enrich_filtr = NULL,
path,
output_filename,
overwrite = FALSE
)
Arguments
- clustrfiltr_data
The named
list
output from theclustrfiltr()
function.- dr_genes
The character vector of deregulated genes that can correspond to the
gene_id
column in the output of thegetids()
orgetregs()
function. Thegprofiler2::gost()
function handles mixed types of gene IDs and even duplicates by treating them as a single unique occurrence of the identifier, disregarding any duplication.- bg_genes
The character vector of background genes (preferably from the experiment) that typically corresponds to the
gene_id
column in the output of thegetids()
function.- bg_type
The background type, i.e. the statistical domain, that can be one of "annotated", "known", "custom" or "custom_annotated"
- sources
A vector of data sources to use. Currently, these are set at GO:BP, KEGG and WP.
- organism
Organism ID defined for the chosen sources (e.g. zebrafish = "drerio")
- user_threshold
Adjusted p-value cutoff for Over-Representation analysis (default at 0.05 in
gost()
function)- correction_method
P-value adjustment method: one of “gSCS” ,“fdr” and “bonferroni (default set at "fdr")
- exclude_iea
Option to exclude GO electronic annotations (IEA)
- only_highlighted_GO
Whether to retain only highlighted driver GO terms in the results. Default is set to TRUE.
- min_term_size
Minimum size of gene sets to be included in the analysis. If NULL (default), no filtering by size is applied.
- max_term_size
Maximum size of gene sets to be included in the analysis. If NULL (default), no filtering by size is applied.
- ngenes_enrich_filtr
Minimum number of genes in a cluster needed for a gene set to be considered enriched. If NULL (default), no filtering by gene count is applied.
- path
Destination folder for the output data results.
- output_filename
Output enrichment result filename.
- overwrite
If
TRUE
, the function overwrites existing output files; otherwise, it reads the existing file. (default is set toFALSE
).
Value
A named list
holding 4 components, where :
-dr_g_a_enrich
is a dataframe of type g_a holding the enrichment results with each row being a combination of gene and biological function annotation
-gostres
is a named list where 'result' contains the data frame with enrichment analysis results, and 'meta' contains metadata necessary for creating a Manhattan plot. This is the original output of a gprofiler2::gost()
-dr_g_a_whole
is a dataframe of type g_a holding all the biological function annotations found in the g:profiler database for all the deregulated genes.
-c_simplifylog
is a dataframe tracing the number of biological functions enriched per cluster before and after each filtering step for each source
-params
is a list of the main parameters used