Cluster-specific functional enrichment and whole annotation retrieval

This function performs Over-Representation Analysis (ORA) on clusters to identify enriched biological functions using the clustrenrich() function. It leverages the gprofiler2::gost() function and offers customization options, including the choice of background gene list, background type (e.g., “custom” or “custom_annotated”), database sources (e.g., GO, KEGG, WP), adjusted p-value correction methods, and the option to exclude IEA (Electronically Inferred Annotations) GO terms. The function is adaptable to various organisms and biological annotation sources.

Users can filter terms/pathways based on gene set size (min_term_size and max_term_size) and the number of genes enriched (ngenes_enrich_filtr). For example, terms with fewer than the minimum required genes or more than the maximum allowed genes are excluded, and terms enriched by fewer than the specified number of genes are filtered out.

Additionally, users can choose to retain only highlighted/driver GO terms to reduce redundancy and focus on key biological functions. A secondary gprofiler2::gost() run with significant = FALSE retrieves annotations for all deregulated genes, which is utilized later in the lonelyfishing() function. Throughout the process, a dataframe tracks the number of biological functions linked to each cluster after each filtering step, categorized by source. All main parameters used are saved in the output for transparency and reproducibility.

Usage

clustrenrich(
  clustrfiltr_data,
  dr_genes,
  bg_genes,
  bg_type = "custom_annotated",
  sources = c("GO:BP", "KEGG", "WP"),
  organism,
  user_threshold = 0.05,
  correction_method = "fdr",
  exclude_iea = FALSE,
  enrich_size_filtr = TRUE,
  only_highlighted_GO = TRUE,
  min_term_size = NULL,
  max_term_size = NULL,
  ngenes_enrich_filtr = NULL,
  path,
  output_filename,
  overwrite = FALSE
)

Arguments

clustrfiltr_data: The named list output from the clustrfiltr() function.
dr_genes: The character vector of deregulated genes that can correspond to the gene_id column in the output of the getids() or getregs() function. The gprofiler2::gost() function handles mixed types of gene IDs and even duplicates by treating them as a single unique occurrence of the identifier, disregarding any duplication.
bg_genes: The character vector of background genes (preferably from the experiment) that typically corresponds to the gene_id column in the output of the getids() function.
bg_type: The background type, i.e. the statistical domain, that can be one of "annotated", "known", "custom" or "custom_annotated"
sources: A vector of data sources to use. Currently, these are set at GO:BP, KEGG and WP.
organism: Organism ID defined for the chosen sources (e.g. zebrafish = "drerio")
user_threshold: Adjusted p-value cutoff for Over-Representation analysis (default at 0.05 in gost() function)
correction_method: P-value adjustment method: one of “gSCS” ,“fdr” and “bonferroni (default set at "fdr")
exclude_iea: Option to exclude GO electronic annotations (IEA)
only_highlighted_GO: Whether to retain only highlighted driver GO terms in the results. Default is set to TRUE.
min_term_size: Minimum size of gene sets to be included in the analysis. If NULL (default), no filtering by size is applied.
max_term_size: Maximum size of gene sets to be included in the analysis. If NULL (default), no filtering by size is applied.
ngenes_enrich_filtr: Minimum number of genes in a cluster needed for a gene set to be considered enriched. If NULL (default), no filtering by gene count is applied.
path: Destination folder for the output data results.
output_filename: Output enrichment result filename.
overwrite: If TRUE, the function overwrites existing output files; otherwise, it reads the existing file. (default is set to FALSE).

Value

A named list holding 4 components, where : -dr_g_a_enrich is a dataframe of type g_a holding the enrichment results with each row being a combination of gene and biological function annotation -gostres is a named list where 'result' contains the data frame with enrichment analysis results, and 'meta' contains metadata necessary for creating a Manhattan plot. This is the original output of a gprofiler2::gost() -dr_g_a_whole is a dataframe of type g_a holding all the biological function annotations found in the g:profiler database for all the deregulated genes. -c_simplifylog is a dataframe tracing the number of biological functions enriched per cluster before and after each filtering step for each source -params is a list of the main parameters used