Set up the path to the working directory

# Set the knitr root directory - this PERSISTS
# replace qbiad019 with Your_USER_NAME
knitr::opts_knit$set(root.dir = "/home/group.kurse/qbiad019/Practical_Session_6/")

# Also set DATA_DIR variable for reference
DATA_DIR <- "/home/group.kurse/qbiad019/Practical_Session_6/"

cat("Knitr root.dir set to:", knitr::opts_knit$get("root.dir"), "\n")

## Knitr root.dir set to: /home/group.kurse/qbiad019/Practical_Session_6/

cat("Files available:", paste(list.files()[1:3], collapse=", "), "\n")

## Files available: Login_2_RAMSES_Start_RStudio_Server.pdf, Practical_Day_6.qmd, Practical_Day_6.zip

Overview

This pipeline identifies biological processes under divergent selection between Swedish (SWE) and Spanish (ESP) Arabidopsis populations. The analysis proceeds from SNP-level population differentiation (FST) to gene-level aggregation and Gene Ontology (GO) enrichment, with explicit attention to assumptions, limitations, and best practices.

The workflow is designed to answer the question:

Are specific biological processes enriched among genes showing unusually high population differentiation between climatically distinct populations?

Key Analytical Decision: This analysis uses a FULL GENOME dataset to identify candidate genes. This provides a genome-wide context for interpreting signals and avoids circular reasoning. For hypothesis-driven questions (e.g., focusing on defense genes only), the background gene universe must be adjusted accordingly, as discussed in Section 10.

Practical Setup: Running Analysis on HPC

Instructions for Running Bioinformatics Jobs on the HPC (RAMSES)

Basic Steps:

Step 0. Copy the required directory to your home directory (Change qbiad019 with Your_User_Name)

cp -r /scratch/qbio/Practical_Session_6 /home/group.kurse/qbiad019

Step 1. Open a text editor and create the file:

nano stacks.sh

Step 2. Copy the entire script below stacks.sh and paste it into nano:

Step 3: Save and Exit Nano

Press Ctrl + X to exit

Press Y to confirm saving

Press Enter to keep the filename stacks.sh

Step 4. Submit Your Job In the terminal, run:

sbatch stacks.sh

For more details, consult the RAMSES Brief Instructions

Estimating Genome-wide FST Values

To conduct this analysis, we require an estimate of genome-wide FST values to measure genetic differentiation between two groups (populations) of samples, using variants provided in a VCF file. For the estimation of genome-wide FST, we will use the software STACKS.

Create and Submit a SLURM Script for estimating Genome-wide FST values

First, check if the required module is installed:

module avail
module avail bio/

Double check before submitting job if your all input files exist:

# Verify your input files exist
ls -la "/home/group.kurse/qbiad019/Practical_Session_6/vcf/1001genomes_snp-short-indel_only_ACGTN_Dp10GQ20Q30_NoIndel_Bialleleic_80PcMissing_80SwEs.vcf.recode.vcf.gz"

ls -la "/home/group.kurse/qbiad019/Practical_Session_6/vcf/1001genomes_snp-short-indel_only_ACGTN_Dp10GQ20Q30_NoIndel_Bialleleic_80PcMissing_80SwEs_DefenseOnly.vcf.gz"

# Check if popmap file exists
ls -la "/home/group.kurse/qbiad019/Practical_Session_6/stacks/pop.txt"

Example SLURM Script (`stacks.sh`):

#!/bin/bash -l

#SBATCH --job-name=stacks
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=100gb
#SBATCH --time=01:00:00
#SBATCH --account=kurse
#SBATCH --error=/home/group.kurse/qbiad019/Practical_Session_6/stacks/stacks_pop.err
#SBATCH --output=/home/group.kurse/qbiad019/Practical_Session_6/stacks/stacks_pop.out

# Load the stacks module
module load bio/Stacks/2.62-foss-2022a

# Input directory
input_dir="/home/group.kurse/qbiad019/Practical_Session_6/vcf"

# Output directory
output_dir="/home/group.kurse/qbiad019/Practical_Session_6/stacks"

mkdir -p "$output_dir"

populations -V "${input_dir}/1001genomes_snp-short-indel_only_ACGTN_Dp10GQ20Q30_NoIndel_Bialleleic_80PcMissing_80SwEs.vcf.recode.vcf.gz" \
            --popmap "/home/group.kurse/qbiad019/Practical_Session_6/stacks/pop.txt" \
            --fstats -t 10 --structure -O "${output_dir}"
            
populations -V "${input_dir}/1001genomes_snp-short-indel_only_ACGTN_Dp10GQ20Q30_NoIndel_Bialleleic_80PcMissing_80SwEs_DefenseOnly.vcf.gz" \
            --popmap "/home/group.kurse/qbiad019/Practical_Session_6/stacks/pop.txt" \
            --fstats -t 10 --structure -O "${output_dir}"

Check Status and ID of Your Submitted Job:

squeue -u Your_User_Name

To Cancel Running Job:

scancel Your_Job_ID

Note: This practical session assumes the FST analysis has already been completed using the above workflow. We will now proceed with the downstream GO enrichment analysis using the generated FST statistics.

Conceptual and Statistical Rationale

Why FST Outliers?

FST measures allele frequency differentiation between populations. Under neutrality, most loci are expected to show modest differentiation shaped by drift and shared demographic history. Loci in the extreme upper tail of the FST distribution are candidates for divergent selection, particularly when populations inhabit contrasting environments (e.g. Sweden vs Spain).

Key assumption: The right tail of the empirical FST distribution is enriched for loci influenced by selection, although demography and linked selection may also contribute.

Why Gene-Level and GO Enrichment?

Selection acts on phenotypes encoded by genes and pathways rather than isolated SNPs. Aggregating SNPs to genes and testing for GO term enrichment allows:

Increased biological interpretability
Detection of pathway-level signals even when individual SNP effects are modest
Explicit hypothesis testing against a genome-wide background

Important Caveat (Avoiding Circularity)

GO enrichment must always use an appropriate gene universe. Restricting the background to a pre-filtered functional subset (e.g. only defense genes) risks circular inference and inflated significance. In this pipeline:

Outlier detection is genome-wide
GO enrichment background consists of all genes with at least one tested SNP

0. Required Packages

The following packages are required for the analysis and plotting

cat("Loading required packages...\n")

## Loading required packages...

required_packages <- c("topGO", "org.At.tair.db", "ggplot2", "dplyr", 
                       "tidyr", "stringr", "GenomicRanges", "readr", "DT")
invisible(lapply(required_packages, function(pkg) {
  if(!require(pkg, character.only = TRUE, quietly = TRUE)) {
    install.packages(pkg, dependencies = TRUE)
    library(pkg, character.only = TRUE)
  }
}))

1. Load and Process FST Data

PURPOSE: Load population differentiation statistics to identify loci under divergent selection between populations
BIOLOGICAL LOGIC: FST measures genetic differentiation between populations.
High FST values indicate loci with allele frequency differences greater than expected under neutral evolution, suggesting selection.
STATISTICAL LOGIC: Using percentile thresholds (e.g., top 5%) identifies outliers from the neutral expectation distribution.

cat("Setting up analysis parameters...\n")

## Setting up analysis parameters...

config <- list(
  fst_quantile = 0.95,     # Top 5% FST threshold (95th percentile)
# pval_threshold = 0.05,    # For GO analysis gene selection
  node_size = 10,           # Min genes per GO term
  top_terms = 10,           # Top terms to display in plots
  algorithm = "elim",       # GO algorithm
  ontology = "BP"           # Biological Process ontology
)

cat("\nStep 1: Loading and processing FST data...\n")

## 
## Step 1: Loading and processing FST data...

fst_file <- "/home/group.kurse/qbiad019/Practical_Session_6/stacks/1001genomes_snp-short-indel_only_ACGTN_Dp10GQ20Q30_NoIndel_Bialleleic_80PcMissing_80SwEs.vcf.recode.p.fst_SWE-ESP.tsv"

fstats <- read.table(
  fst_file,
  header = FALSE,
  sep = "\t",
  comment.char = "#",
  fill = TRUE,
  quote = ""
)

colnames(fstats) <- c(
  "Locus_ID","Pop1_ID","Pop2_ID","Chr","BP","Column",
  "Fishers_P","Odds_Ratio","CI_Low","CI_High",
  "LOD","AMOVA_Fst","Smoothed_AMOVA_Fst",
  "Smoothed_AMOVA_Fst_Pvalue"
)

fst_data <- data.frame(
  Chr = fstats$Chr,
  BP = fstats$BP,
  FST = fstats$AMOVA_Fst,
  Fisher_P = fstats$Fishers_P
)

fst_data <- fst_data[!is.na(fst_data$FST), ]

cat("  Total SNPs analyzed:", nrow(fst_data), "\n")

##   Total SNPs analyzed: 648916

cat("  FST range:", round(range(fst_data$FST), 3), "\n")

##   FST range: 0 0.797

cat("  Mean FST:", round(mean(fst_data$FST), 4), "\n")

##   Mean FST: 0.0391

2. Visualise FST Distribution

STATISTICAL PURPOSE: Validate distribution assumptions and identify an appropriate threshold for outlier detection
INTERPRETATION: Most SNPs show low differentiation (left-skewed distribution), while high-FST outliers represent potential targets of selection.

cat("\nStep 2: Plotting FST distribution...\n")

## 
## Step 2: Plotting FST distribution...

fst_threshold <- quantile(fst_data$FST, config$fst_quantile, na.rm = TRUE)
cat("  Top 5% FST threshold:", round(fst_threshold, 4), "\n")

##   Top 5% FST threshold: 0.1372

cat("  SNPs above threshold:", sum(fst_data$FST >= fst_threshold, na.rm = TRUE), "\n")

##   SNPs above threshold: 32463

p_fst_dist <- ggplot(fst_data, aes(x = FST)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7, color = "white") +
  geom_vline(xintercept = fst_threshold, 
             linetype = "dashed", color = "red", linewidth = 1) +
  annotate("text", x = fst_threshold * 1.1, y = max(hist(fst_data$FST, breaks = 50)$counts),
           label = paste("Top 5%\nFST >", round(fst_threshold, 3)),
           color = "red", size = 4, fontface = "bold", hjust = 0) +
  labs(
    title = "Distribution of FST Values (SWE vs ESP)",
    subtitle = "Red line indicates top 5% threshold for candidate selection",
    x = "FST Value",
    y = "Number of SNPs"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.major = element_line(color = "gray90"),
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(color = "gray40", hjust = 0.5)
  )

print(p_fst_dist)

3. Gene Annotation and SNP-to-Gene Mapping

PURPOSE: Link genomic positions to gene annotations for functional analysis
BIOLOGICAL LOGIC: Selection acts on phenotypes, which are determined by genes. Mapping SNPs to genes connects genetic variation to biological function.
STATISTICAL LOGIC: Using genomic ranges ensures accurate positional mapping, handling gene boundaries and overlapping features correctly.

cat("\nStep 3.1: Loading and parsing annotation file...\n")

## 
## Step 3.1: Loading and parsing annotation file...

# Load annotation file with your known column structure
annot <- read.delim("vcf/GFF_final_with_annotation_2_gff", header = FALSE, comment.char = "#")
# Check the actual number of columns
cat("  Columns in file:", ncol(annot), "\n")

##   Columns in file: 15

cat("  Rows in file:", nrow(annot), "\n")

##   Rows in file: 28813

# Assign column names for ALL 15 columns
colnames(annot) <- c(
  "gene_id", "chr", "tair_version", "type", "start", "end", "empty1",
  "strand", "empty2", "gene_id2", "gene_id3", "type_name", "short_description", 
  "curation", "extra_column"  # Added 15th column
)

# Clean up gene IDs (they're already in the first column)
annot$gene_id_clean <- annot$gene_id

# Remove version numbers (e.g., .1, .2) if present
annot$gene_id_clean <- sub("\\.\\d+$", "", annot$gene_id_clean)

# Also check other gene ID columns
if(any(!is.na(annot$gene_id2))) {
  annot$gene_id_clean[is.na(annot$gene_id_clean)] <- annot$gene_id2[is.na(annot$gene_id_clean)]
  annot$gene_id_clean <- sub("\\.\\d+$", "", annot$gene_id_clean)
}

if(any(!is.na(annot$gene_id3))) {
  annot$gene_id_clean[is.na(annot$gene_id_clean)] <- annot$gene_id3[is.na(annot$gene_id_clean)]
  annot$gene_id_clean <- sub("\\.\\d+$", "", annot$gene_id_clean)
}

# Check what we have
cat("\n  Gene ID extraction results:\n")

## 
##   Gene ID extraction results:

cat("    Unique clean gene IDs:", length(unique(annot$gene_id_clean)), "\n")

##     Unique clean gene IDs: 22360

cat("    Sample gene IDs:", head(unique(annot$gene_id_clean), 5), "\n")

##     Sample gene IDs: Model_name AT1G01010 AT1G01020 AT1G01030 AT1G01040

cat("    Gene types:", table(annot$type), "\n")

##     Gene types: 28812 1

# Filter for gene features (based on 'type' column)
gene_types <- c("gene", "mRNA", "transcript", "CDS", "exon", "five_prime_UTR", "three_prime_UTR")
gene_features <- annot[annot$type %in% gene_types, ]

cat("\n  Gene features filtered:\n")

## 
##   Gene features filtered:

cat("    Total gene features:", nrow(gene_features), "\n")

##     Total gene features: 28812

cat("    Unique genes in features:", length(unique(gene_features$gene_id_clean)), "\n")

##     Unique genes in features: 22359

# Convert start and end to numeric
gene_features$start <- as.numeric(gene_features$start)
gene_features$end <- as.numeric(gene_features$end)

# For each gene, we need to find the full genomic range
# This handles cases where genes have multiple transcripts/exons
library(dplyr)

gene_ranges <- gene_features %>%
  group_by(gene_id_clean, chr) %>%
  summarise(
    start = min(start, na.rm = TRUE),
    end = max(end, na.rm = TRUE),
    .groups = 'drop'
  )

# Clean chromosome names
gene_ranges$chr <- sub("Chr", "", gene_ranges$chr)

# Remove any rows with NA values
gene_ranges <- gene_ranges[!is.na(gene_ranges$start) & !is.na(gene_ranges$end), ]

cat("\n  Final gene ranges:\n")

## 
##   Final gene ranges:

cat("    Total genes with ranges:", nrow(gene_ranges), "\n")

##     Total genes with ranges: 22359

cat("    Sample ranges:\n")

##     Sample ranges:

print(head(gene_ranges, 3))

## # A tibble: 3 × 4
##   gene_id_clean chr   start   end
##   <chr>         <chr> <dbl> <dbl>
## 1 AT1G01010     1      3631  5899
## 2 AT1G01020     1      5928  8737
## 3 AT1G01030     1     11649 13714

# ---------------------------
# 3.2. CREATE GRANGES OBJECTS
# ---------------------------
cat("\nStep 3.2: Creating GRanges objects...\n")

## 
## Step 3.2: Creating GRanges objects...

# Create GRanges for FST SNPs
gr_fst <- GRanges(
  seqnames = fst_data$Chr,
  ranges = IRanges(start = fst_data$BP, end = fst_data$BP),
  FST = fst_data$FST,
  is_candidate = fst_data$FST >= fst_threshold
)

# Create GRanges for genes
gr_genes <- GRanges(
  seqnames = gene_ranges$chr,
  ranges = IRanges(start = gene_ranges$start, end = gene_ranges$end),
  gene_id = gene_ranges$gene_id_clean
)

cat("  FST SNPs in GRanges:", length(gr_fst), "\n")

##   FST SNPs in GRanges: 648916

cat("  Genes in GRanges:", length(gr_genes), "\n")

##   Genes in GRanges: 22359

# ---------------------------
# 3.3. FIND OVERLAPS
# ---------------------------
cat("\nStep 3.3: Finding overlaps between SNPs and genes...\n")

## 
## Step 3.3: Finding overlaps between SNPs and genes...

# Find overlaps
overlaps <- findOverlaps(gr_fst, gr_genes, type = "any", ignore.strand = TRUE)

cat("  Total overlaps found:", length(overlaps), "\n")

##   Total overlaps found: 409705

if(length(overlaps) > 0) {
  cat("  Unique SNPs overlapping genes:", length(unique(queryHits(overlaps))), "\n")
  cat("  Unique genes with overlapping SNPs:", length(unique(subjectHits(overlaps))), "\n")
  
  # Create mapping table
  snp_gene_map <- data.frame(
    gene_id = gene_ranges$gene_id_clean[subjectHits(overlaps)],
    FST = fst_data$FST[queryHits(overlaps)],
    BP = fst_data$BP[queryHits(overlaps)],
    Chr = fst_data$Chr[queryHits(overlaps)],
    is_candidate = fst_data$FST[queryHits(overlaps)] >= fst_threshold
  )
  
  # Check the mapping
  cat("\n  SNP-gene mapping summary:\n")
  cat("    Total SNP-gene pairs:", nrow(snp_gene_map), "\n")
  cat("    Unique genes mapped:", length(unique(snp_gene_map$gene_id)), "\n")
  
  if(nrow(snp_gene_map) > 0) {
    snps_per_gene <- table(snp_gene_map$gene_id)
    cat("    Average SNPs per gene:", round(mean(snps_per_gene), 2), "\n")
    cat("    Candidate SNPs (FST ≥", round(fst_threshold, 4), "):", 
        sum(snp_gene_map$is_candidate), "\n")
    
    # Display first few mappings
    cat("\n    First few SNP-gene mappings:\n")
    print(head(snp_gene_map, 5))
  }
  
} else {
  cat("  WARNING: No overlaps found between SNPs and genes!\n")
  cat("  Checking chromosome naming...\n")
  
  # Diagnostic: Check chromosome names
  unique_fst_chr <- unique(fst_data$Chr)
  unique_gene_chr <- unique(gene_ranges$chr)
  
  cat("    FST chromosomes:", head(unique_fst_chr, 10), "\n")
  cat("    Gene chromosomes:", head(unique_gene_chr, 10), "\n")
  
  # Try alternative chromosome naming
  cat("  Trying alternative chromosome naming...\n")
  
  # Standardize chromosome names
  fst_data$Chr_clean <- sub("^chr", "", fst_data$Chr, ignore.case = TRUE)
  fst_data$Chr_clean <- sub("^Chr", "", fst_data$Chr_clean)
  fst_data$Chr_clean <- sub("^0", "", fst_data$Chr_clean)  # Remove leading zeros
  
  # Create new GRanges with cleaned chromosome names
  gr_fst_clean <- GRanges(
    seqnames = fst_data$Chr_clean,
    ranges = IRanges(start = fst_data$BP, end = fst_data$BP),
    FST = fst_data$FST,
    is_candidate = fst_data$FST >= fst_threshold
  )
  
  # Try overlap again
  overlaps <- findOverlaps(gr_fst_clean, gr_genes, type = "any", ignore.strand = TRUE)
  cat("  Overlaps after chromosome cleaning:", length(overlaps), "\n")
  
  if(length(overlaps) > 0) {
    # Create mapping table with cleaned chromosomes
    snp_gene_map <- data.frame(
      gene_id = gene_ranges$gene_id_clean[subjectHits(overlaps)],
      FST = fst_data$FST[queryHits(overlaps)],
      BP = fst_data$BP[queryHits(overlaps)],
      Chr = fst_data$Chr_clean[queryHits(overlaps)],
      is_candidate = fst_data$FST[queryHits(overlaps)] >= fst_threshold
    )
    
    cat("\n  SNP-gene mapping after chromosome cleaning:\n")
    cat("    Total SNP-gene pairs:", nrow(snp_gene_map), "\n")
    cat("    Unique genes mapped:", length(unique(snp_gene_map$gene_id)), "\n")
    
  } else {
    # Create empty dataframe to avoid errors
    snp_gene_map <- data.frame(
      gene_id = character(),
      FST = numeric(),
      BP = numeric(),
      Chr = character(),
      is_candidate = logical()
    )
  }
}

##   Unique SNPs overlapping genes: 405929 
##   Unique genes with overlapping SNPs: 18514 
## 
##   SNP-gene mapping summary:
##     Total SNP-gene pairs: 409705 
##     Unique genes mapped: 18514 
##     Average SNPs per gene: 22.13 
##     Candidate SNPs (FST ≥ 0.1372 ): 22483 
## 
##     First few SNP-gene mappings:
##     gene_id     FST   BP Chr is_candidate
## 1 AT1G01010 0.11127 4648   1        FALSE
## 2 AT1G01010 0.02595 4654   1        FALSE
## 3 AT1G01010 0.34545 4880   1         TRUE
## 4 AT1G01010 0.01367 4963   1        FALSE
## 5 AT1G01010 0.02737 5024   1        FALSE

# ---------------------------
# 3.4. DIAGNOSTIC PLOT
# ---------------------------
cat("\nStep 3.4: Creating diagnostic plots...\n")

## 
## Step 3.4: Creating diagnostic plots...

if(nrow(snp_gene_map) > 0 && nrow(snp_gene_map) > 1) {
  # Calculate SNPs per gene distribution
  snps_per_gene <- table(snp_gene_map$gene_id)
  snps_per_gene_df <- data.frame(
    gene_id = names(snps_per_gene),
    nsnp = as.numeric(snps_per_gene)
  )
  
  p_overlap_dist <- ggplot(snps_per_gene_df, aes(x = nsnp)) +
    geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7, color = "white") +
    labs(
      title = "Distribution of SNPs per Gene",
      subtitle = paste("Mean =", round(mean(snps_per_gene_df$nsnp), 2), 
                       "SNPs per gene | Median =", median(snps_per_gene_df$nsnp)),
      x = "Number of SNPs per Gene",
      y = "Number of Genes"
    ) +
    theme_minimal(base_size = 11) +
    theme(
      panel.grid.major = element_line(color = "gray90"),
      plot.title = element_text(face = "bold", hjust = 0.5)
    )
  
  print(p_overlap_dist)
  
} else {
  cat("  Skipping overlap plot - insufficient data\n")
}

cat("  Total SNP-gene pairs:", nrow(snp_gene_map), "\n")

##   Total SNP-gene pairs: 409705

cat("  Unique genes with SNPs:", length(unique(snp_gene_map$gene_id)), "\n")

##   Unique genes with SNPs: 18514

cat("  Candidate SNPs in genes:", sum(snp_gene_map$is_candidate), "\n")

##   Candidate SNPs in genes: 22483

cat("\n✅ Gene annotation and mapping complete!\n")

## 
## ✅ Gene annotation and mapping complete!

cat("===========================================\n")

## ===========================================

4. Identify Candidate Genes

PURPOSE: Aggregate SNP-level signals to gene-level evidence of selection
BIOLOGICAL LOGIC: Genes with multiple high-FST SNPs or extremely high FST values at key positions are stronger candidates for selection.
STATISTICAL LOGIC: Using maximum FST per gene (classic outlier approach) focuses on the strongest signals. Alternative: mean FST for polygenic adaptation.

cat("\nStep 4: Identifying candidate genes under selection...\n")

## 
## Step 4: Identifying candidate genes under selection...

# Aggregate SNP-level FST to gene-level statistics
gene_fst_summary <- snp_gene_map %>%
  group_by(gene_id) %>%
  summarise(
    max_FST    = max(FST, na.rm = TRUE),
    mean_FST   = mean(FST, na.rm = TRUE),
    median_FST = median(FST, na.rm = TRUE),
    nsnp       = n(),
    .groups    = "drop"
  )

# Choose statistic based on biological question
fst_statistic <- "max_FST"   # justified below

# Define candidate genes as top X% of genes
candidate_threshold <- quantile(
  gene_fst_summary[[fst_statistic]],
  config$fst_quantile,
  na.rm = TRUE
)

candidate_genes <- gene_fst_summary %>%
  filter(.data[[fst_statistic]] >= candidate_threshold) %>%
  arrange(desc(.data[[fst_statistic]]))

cat("  Total genes tested:", nrow(gene_fst_summary), "\n")

##   Total genes tested: 18514

cat("  Candidate genes (top ",(1 - config$fst_quantile) * 100,"%): ",nrow(candidate_genes),"\n",sep = "")

##   Candidate genes (top 5%): 926

cat("  Top candidate gene:",
    candidate_genes$gene_id[1],
    "with", fst_statistic, "=",
    round(candidate_genes[[fst_statistic]][1], 4), "\n")

##   Top candidate gene: AT4G02733 with max_FST = 0.7917

5. GO Enrichment Analysis

PURPOSE: Create input for functional enrichment analysis
BIOLOGICAL LOGIC: Tests whether certain biological processes are over-represented among selected genes compared to genome background.
STATISTICAL LOGIC: Uses gene scores (FST values) to rank genes, with top quantile considered “selected” for enrichment testing.

cat("\nStep 5.1: Preparing for GO enrichment analysis...\n")

## 
## Step 5.1: Preparing for GO enrichment analysis...

# ---------------------------
# 5.1. GENE SCORES AND UNIVERSE
# ---------------------------

# Gene universe: all genes with ≥1 SNP tested
all_genes <- unique(snp_gene_map$gene_id)

# Gene-level scores: MAX FST per gene (consistent with Step 4)
gene_fst_scores <- snp_gene_map %>%
  group_by(gene_id) %>%
  summarise(max_FST = max(FST, na.rm = TRUE), .groups = "drop")

gene_vector <- gene_fst_scores$max_FST
names(gene_vector) <- gene_fst_scores$gene_id

# Restrict to genes present in org.At.tair.db
arabidopsis_genes <- keys(org.At.tair.db)
gene_vector_filtered <- gene_vector[names(gene_vector) %in% arabidopsis_genes]

cat("  Genes in analysis:", length(gene_vector_filtered), "\n")

##   Genes in analysis: 18514

cat("  Genes NOT in Arabidopsis database:",
    length(gene_vector) - length(gene_vector_filtered), "\n")

##   Genes NOT in Arabidopsis database: 0

# ---------------------------
# 5.2. GO ENRICHMENT (FISHER + KS)
# ---------------------------

cat("\nStep 5.2: Running GO enrichment analysis...\n")

## 
## Step 5.2: Running GO enrichment analysis...

# Selection function: top X% of genes by MAX FST
selection_function <- function(allScores) {
  threshold <- quantile(
    allScores,
    config$fst_quantile,   ### FIX: correct config variable
    na.rm = TRUE
  )
  allScores >= threshold
}

# Create topGO object
tGOdata <- new(
  "topGOdata",
  ontology = config$ontology,
  allGenes = gene_vector_filtered,
  geneSel  = selection_function,
  nodeSize = config$node_size,
  annot    = annFUN.org,
  mapping  = "org.At.tair.db",
  description = "GO enrichment of high-FST genes (SWE vs ESP)"
)

# Run tests
cat("  Running Fisher's exact test...\n")

##   Running Fisher's exact test...

results_fisher <- runTest(tGOdata,
                          algorithm = config$algorithm,
                          statistic = "fisher")

cat("  Running Kolmogorov-Smirnov test...\n")

##   Running Kolmogorov-Smirnov test...

results_ks <- runTest(tGOdata,
                      algorithm = config$algorithm,
                      statistic = "ks")

# ---------------------------
# 5.3. RESULT TABLES
# ---------------------------

go_fisher <- GenTable(
  tGOdata,
  Fisher_p_value = results_fisher,
  orderBy = "Fisher_p_value",
  topNodes = length(score(results_fisher))
)

go_ks <- GenTable(
  tGOdata,
  KS_p_value = results_ks,
  orderBy = "KS_p_value",
  topNodes = length(score(results_ks))
)

# Convert p-values
go_fisher$Fisher_p_value <- as.numeric(go_fisher$Fisher_p_value)
go_ks$KS_p_value <- as.numeric(go_ks$KS_p_value)

# Merge results
go_enrichment <- merge(
  go_fisher[, c("GO.ID", "Term", "Annotated", "Significant", "Expected", "Fisher_p_value")],
  go_ks[, c("GO.ID", "KS_p_value")],
  by = "GO.ID",
  all.x = TRUE
)

# Multiple testing correction and metrics
go_enrichment <- go_enrichment %>%
  mutate(
    Fisher_FDR = p.adjust(Fisher_p_value, method = "BH"),
    KS_FDR = ifelse(!is.na(KS_p_value),
                    p.adjust(KS_p_value, method = "BH"),
                    NA),
    Enrichment_ratio = as.numeric(Significant) / as.numeric(Expected)
  ) %>%
  arrange(Fisher_p_value)

# ---------------------------
# 5.4. SUMMARY
# ---------------------------

cat("\n=== GO ENRICHMENT SUMMARY ===\n")

## 
## === GO ENRICHMENT SUMMARY ===

cat("  GO terms tested:", nrow(go_enrichment), "\n")

##   GO terms tested: 2122

cat("  Fisher FDR < 0.05:",
    sum(go_enrichment$Fisher_FDR < 0.05, na.rm = TRUE), "\n")

##   Fisher FDR < 0.05: 0

cat("  KS FDR < 0.05:",
    sum(go_enrichment$KS_FDR < 0.05, na.rm = TRUE), "\n")

##   KS FDR < 0.05: 1

cat("\n  Top GO term (Fisher):",
    go_enrichment$Term[1], "\n")

## 
##   Top GO term (Fisher): plant-type hypersensitive response

cat("  Fisher p-value:",
    format(go_enrichment$Fisher_p_value[1], scientific = TRUE), "\n")

##   Fisher p-value: 7.4e-04

cat("\n✅ GO enrichment analysis complete!\n")

## 
## ✅ GO enrichment analysis complete!

6. Interactive GO Enrichment Table

cat("\nDisplaying interactive GO enrichment table...\n")

## 
## Displaying interactive GO enrichment table...

# Use the authoritative GO enrichment object
table_data <- go_enrichment

# Create display-ready table
display_table <- table_data %>%
  mutate(
    # Numeric enrichment metric
    Enrichment = round(Enrichment_ratio, 2),

    # Fisher test (formatted for display)
    Fisher_p = format(Fisher_p_value, scientific = TRUE, digits = 2),
    Fisher_FDR = format(Fisher_FDR, scientific = TRUE, digits = 2),

    # KS test (formatted, with N/A where missing)
    KS_p = ifelse(!is.na(KS_p_value), format(KS_p_value, scientific = TRUE, digits = 2), "N/A"),
    KS_FDR = ifelse(!is.na(KS_FDR), format(KS_FDR, scientific = TRUE, digits = 2), "N/A"),

    # Combined significance based on numeric FDRs
    Combined = case_when(
      Fisher_FDR < 0.05 & !is.na(KS_FDR) & KS_FDR < 0.05 ~ "Both significant",
      Fisher_FDR < 0.05 ~ "Fisher only",
      !is.na(KS_FDR) & KS_FDR < 0.05 ~ "KS only",
      TRUE ~ "Neither"
    )
  ) %>%
  select(
    GO.ID,
    Term,
    Annotated,
    Significant,
    Expected,
    Enrichment,
    Fisher_p,
    Fisher_FDR,
    KS_p,
    KS_FDR,
    Combined
  )

# Create interactive DataTable
dt <- datatable(
  display_table,
  extensions = c("Buttons", "Scroller"),
  options = list(
    dom = "Bfrtip",
    buttons = c("copy", "csv", "excel", "pdf", "print"),
    pageLength = 20,
    scrollX = TRUE,
    scrollY = "600px",
    scroller = TRUE
  ),
  rownames = FALSE
)

dt

7. Visualisation of Enriched GO Terms

PURPOSE: Visualise the statistical significance of GO term enrichment
BUBBLE PLOT:
- X-axis shows p-values in natural direction (smaller on right = more significant)
- Bubble size represents Sig/Exp ratio (biological effect size)
- Labels show ABSOLUTE numbers (Significant/Annotated)
ENRICHMENT RATIO PLOT (Alternative visualisation)
- Show magnitude of enrichment independent of statistical significance
- Ratio > 1 indicates more selected genes in term than expected

# ---------------------------
# 7.1. PREPARE DATA FOR BOTH TESTS
# ---------------------------
cat("Preparing data for both Fisher and KS tests...\n")

## Preparing data for both Fisher and KS tests...

# Use authoritative GO enrichment object
go_results <- go_enrichment

# Ensure KS columns exist
if(!"KS_p_value" %in% names(go_results)) {
  go_results$KS_p_value <- NA
  go_results$KS_FDR <- NA
}

# Sort by KS p-value for KS plot
go_results_ks <- go_results %>% arrange(KS_p_value)

# Calculate Significant/Expected ratio and labels
go_results <- go_results %>%
  mutate(
    Sig_Expected_ratio = Significant / Expected,
    Sig_Expected_label = paste0(round(Significant, 0), "/", round(Expected, 1))
  )

go_results_ks <- go_results_ks %>%
  mutate(
    Sig_Expected_ratio = Significant / Expected,
    Sig_Expected_label = paste0(round(Significant, 0), "/", round(Expected, 1))
  )

# Select top terms for each test
ntop <- min(10, nrow(go_results))
ggdata_Fisher <- go_results[1:ntop, ]
ggdata_Fisher$Term <- factor(ggdata_Fisher$Term, levels = rev(unique(ggdata_Fisher$Term)))

ggdata_KS <- go_results_ks[1:ntop, ]
# Use unique levels to avoid duplicates
ggdata_KS$Term <- factor(ggdata_KS$Term, levels = rev(unique(ggdata_KS$Term)))

cat("Using top", ntop, "terms for each test\n")

## Using top 10 terms for each test

# ---------------------------
# 7.2. CREATE FISHER TEST PLOT 
# ---------------------------
cat("\nCreating Fisher test plot...\n")

## 
## Creating Fisher test plot...

p_fisher <- ggplot(ggdata_Fisher, aes(x = Fisher_p_value, y = Term, 
                                      size = Sig_Expected_ratio,  # Use Sig/Expected ratio for size
                                      color = Fisher_p_value)) +
  geom_point(alpha = 0.7, stroke = 0.5) +  # Added stroke for better visibility
  # Changed label to Significant/Expected
  geom_text(aes(label = Sig_Expected_label), 
            vjust = 0.5, hjust = 0.5, size = 3, 
            color = "black", fontface = "bold") +  # Black text as requested
  scale_size_continuous(
    name = "Ratio\n(Sig/Exp)",
    range = c(5, 19),
    breaks = pretty(range(ggdata_Fisher$Sig_Expected_ratio), n = 3)
  ) +
  scale_color_continuous(
    low = "red", 
    high = "gold",
    name = "p-value\n(red=less sig)",
    guide = guide_colorbar(reverse = TRUE)
  ) +
  theme_minimal() +
  labs(
    x = "Fisher p-value (more significant→)",
    y = "GO Terms",
    title = "GO Enrichment - Fisher Exact Test",
    subtitle = paste("Top", ntop, "terms | Bubble size = Significant/Expected | Black labels = Sig/Exp")
  ) +
  theme(
    panel.border = element_rect(color = "grey", fill = NA, size = 1),
    legend.position = "right",
    legend.box = "vertical",
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 10),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 12),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", size = 9),
    legend.title = element_text(size = 9, face = "bold"),
    legend.text = element_text(size = 8)
  ) +
  scale_x_reverse() +
  # Add significance threshold line
  geom_vline(xintercept = 0.05, linetype = "dashed", color = "blue", alpha = 0.5, size = 0.5) +
  annotate("text", x = 0.05, y = ntop + 0.5, 
         label = "p=0.05", color = "blue", size = 3, hjust = 1.2)

## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# ---------------------------
# 7.3. CREATE KS TEST PLOT
# ---------------------------
cat("Creating KS test plot...\n")

## Creating KS test plot...

p_ks <- ggplot(ggdata_KS, aes(x = KS_p_value, y = Term, 
                              size = Sig_Expected_ratio,
                              color = KS_p_value)) +
  geom_point(alpha = 0.7, stroke = 0.5) +
  geom_text(aes(label = Sig_Expected_label), 
            vjust = 0.5, hjust = 0.5, size = 3, 
            color = "black", fontface = "bold") +
  scale_size_continuous(
    name = "Ratio\n(Sig/Exp)",
    range = c(5, 19),
    breaks = pretty(range(ggdata_KS$Sig_Expected_ratio), n = 3)
  ) +
  scale_color_continuous(
    low = "red", 
    high = "gold",
    name = "p-value\n(red=less sig)",
    guide = guide_colorbar(reverse = TRUE)
  ) +
  theme_minimal() +
  labs(
    x = "KS p-value (more significant→)",
    y = "GO Terms",
    title = "GO Enrichment - Kolmogorov-Smirnov Test",
    subtitle = paste("Top", ntop, "terms | Bubble size = Significant/Expected | Black labels = Sig/Exp")
  ) +
  theme(
    panel.border = element_rect(color = "grey", fill = NA, size = 1),
    legend.position = "right",
    legend.box = "vertical",
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 10),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 12),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", size = 9),
    legend.title = element_text(size = 9, face = "bold"),
    legend.text = element_text(size = 8)
  ) +
  scale_x_reverse() +
  # Add significance threshold line
  geom_vline(xintercept = 0.05, linetype = "dashed", color = "blue", alpha = 0.5, size = 0.5) +
  annotate("text", x = 0.05, y = ntop + 0.5, 
         label = "p=0.05", color = "blue", size = 3, hjust = 1.2)
# ---------------------------
# 7.4. CREATE ENRICHMENT RATIO PLOT
# ---------------------------
cat("Creating enrichment ratio plot...\n")

## Creating enrichment ratio plot...

n_combined <- min(15, nrow(go_results))
plot_data <- go_results[1:n_combined, ]

# Create better term labels
plot_data$Term_short <- sapply(plot_data$Term, function(x) {
  if(nchar(x) > 50) {
    return(paste0(substr(x, 1, 47), "..."))
  } else {
    return(x)
  }
})
plot_data$Term_short <- factor(plot_data$Term_short, levels = rev(plot_data$Term_short))

# Calculate enrichment metrics
plot_data$Sig_Expected_label <- paste0(round(plot_data$Significant, 0), "/", 
                                       round(plot_data$Expected, 1))

# Create significance categories with better labels
plot_data$sig_category <- cut(-log10(plot_data$Fisher_p_value),
                              breaks = c(0, -log10(0.1), -log10(0.05), -log10(0.01), Inf),
                              labels = c("p > 0.1", "0.05 < p ≤ 0.1", 
                                         "0.01 < p ≤ 0.05", "p ≤ 0.01"))

p_combined <- ggplot(plot_data,
                     aes(x = Enrichment_ratio,
                         y = Term_short,
                         size = Sig_Expected_ratio,  # Use Sig/Expected for size
                         color = sig_category)) +
  geom_point(alpha = 0.8, stroke = 0.5) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "gray50", size = 0.8) +
  geom_text(aes(label = Sig_Expected_label),  # Show Sig/Expected
            size = 3, color = "black", fontface = "bold",  # Black text
            vjust = 0.5, hjust = 0.5) +
  scale_size_continuous(
    name = "Observed/Expected\n(Sig/Exp)",
    range = c(6, 18),
    breaks = pretty(range(plot_data$Sig_Expected_ratio), n = 4)
  ) +
  scale_color_manual(
    name = "Significance Level",
    values = c("p > 0.1" = "#E6E6E6",  # Light gray for non-sig
               "0.05 < p ≤ 0.1" = "#FFCC00",  # Yellow for borderline
               "0.01 < p ≤ 0.05" = "#FF6600",  # Orange for moderately sig
               "p ≤ 0.01" = "#CC0000"),  # Red for highly sig
    drop = FALSE
  ) +
  labs(
    title = "GO Term Enrichment Analysis - FST-Based Selection",
    subtitle = "Size = Observed/Expected ratio | Color = p-value significance | Black labels = Sig/Exp",
    x = "Enrichment Ratio (Observed/Expected)",
    y = "GO Terms",
    caption = paste("Analysis based on", nrow(candidate_genes), 
                    "candidate genes (top 5% FST) | SWE vs ESP populations")
  ) +
  theme_minimal() +
  theme(
    panel.border = element_rect(color = "grey", fill = NA, size = 1.2),
    legend.position = "right",
    legend.box = "vertical",
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11, face = "bold"),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40", size = 10),
    plot.caption = element_text(hjust = 1, size = 8, color = "gray50"),
    legend.title = element_text(size = 9, face = "bold"),
    legend.text = element_text(size = 8),
    plot.margin = margin(15, 15, 15, 15)
  ) +
  xlim(0.9, max(plot_data$Enrichment_ratio, na.rm = TRUE) * 1.15) +
  annotate("text", x = 1, y = n_combined + 0.5, 
           label = "No enrichment (ratio = 1)", 
           color = "gray50", size = 3.5, hjust = -0.1, fontface = "italic")

# ---------------------------
# 7.5. CREATE ENRICHMENT DASHBOARD
# ---------------------------
cat("\nCreating enrichment dashboard...\n")

## 
## Creating enrichment dashboard...

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following object is masked from 'package:BiocGenerics':
## 
##     combine

library(grid)

## 
## Attaching package: 'grid'

## The following object is masked from 'package:topGO':
## 
##     depth

p_dashboard <- grid.arrange(
  p_fisher, p_ks, p_combined,
  ncol = 2,
  layout_matrix = rbind(c(1, 2), c(3, 3)),
  top = textGrob("GO Enrichment Analysis of High-FST Genes (SWE vs ESP)",
                 gp = gpar(fontsize = 18, fontface = "bold", fontfamily = "sans")),
  bottom = textGrob(paste("Analysis date:", Sys.Date(), 
                          "| Candidate genes:", nrow(candidate_genes),
                          "| FST threshold:", round(quantile(fst_data$FST, 0.95), 4)),
                    gp = gpar(fontsize = 9, color = "gray50"))
)

# ---------------------------
# 7.6. CREATE FIGURE LEGEND EXPLANATION
# ---------------------------
cat("\nCreating figure legend explanation...\n")

## 
## Creating figure legend explanation...

figure_legend <- paste(
  "FIGURE LEGEND EXPLANATION:\n",
  "===========================\n",
  "1. FISHER EXACT TEST PLOT:\n",
  "   • X-axis: Fisher p-value (reversed, so more significant on right)\n",
  "   • Y-axis: Top enriched GO terms\n",
  "   • Bubble size: Significant/Expected ratio (how many more genes than expected)\n",
  "   • Bubble color: p-value (red = less significant, gold = more significant)\n",
  "   • Labels: Black text showing Significant/Expected counts\n",
  "   • Blue line: p = 0.05 significance threshold\n\n",
  
  "2. KOLMOGOROV-SMIRNOV TEST PLOT:\n",
  "   • X-axis: KS p-value (tests distribution of gene scores)\n",
  "   • Y-axis: Top GO terms by KS test\n",
  "   • Same bubble encoding as Fisher plot\n",
  "   • KS test detects if genes in GO term have systematically higher scores\n\n",
  
  "3. COMBINED ENRICHMENT PLOT:\n",
  "   • X-axis: Enrichment ratio (Observed/Expected)\n",
  "   • Y-axis: GO terms\n",
  "   • Bubble size: Significant/Expected ratio\n",
  "   • Bubble color: Significance level categories\n",
  "   • Labels: Black text showing Significant/Expected\n",
  "   • Dashed line: No enrichment (ratio = 1)\n\n",
  
  "BIOLOGICAL INTERPRETATION:\n",
  "• Significant/Expected ratio > 1: More genes than expected by chance\n",
  "• Enrichment ratio > 1: GO term is enriched in candidate genes\n",
  "• Red bubbles (Fisher/KS): Less statistically significant\n",
  "• Small p-values: Strong evidence for enrichment\n",
  "Note: No terms are FDR-significant (all 'ns'), suggesting polygenic adaptation."
)

# ---------------------------
# 7.7. SAVE AND DISPLAY
# ---------------------------
cat("\nSaving plots...\n")

## 
## Saving plots...

output_dir <- "FST_GO_Plots"
dir.create(output_dir, showWarnings = FALSE)

# Save plots
ggsave(file.path(output_dir, "GO_fisher.png"),
       p_fisher, width = 11, height = 8, dpi = 300, bg = "white")

ggsave(file.path(output_dir, "GO_ks.png"),
       p_ks, width = 11, height = 8, dpi = 300, bg = "white")

ggsave(file.path(output_dir, "GO_combined.png"),
       p_combined, width = 13, height = 10, dpi = 300, bg = "white")

ggsave(file.path(output_dir, "GO_analysis_dashboard.png"),
       p_dashboard, width = 18, height = 14, dpi = 300, bg = "white")

# Save figure legend to file
writeLines(figure_legend, file.path(output_dir, "figure_legend_explanation.txt"))

8. Display Results

cat("\nDisplaying plots...\n\n")

## 
## Displaying plots...

print(p_fisher)

cat("\n--- FISHER EXACT TEST ---\n")

## 
## --- FISHER EXACT TEST ---

cat("• Shows if GO terms have more significant genes than expected\n")

## • Shows if GO terms have more significant genes than expected

cat("• Bubble size = Significant/Expected (biological signal strength)\n")

## • Bubble size = Significant/Expected (biological signal strength)

cat("• Black labels show actual counts: ", ggdata_Fisher$Sig_Expected_label[1], " etc.\n\n")

## • Black labels show actual counts:  13/4.7  etc.

print(p_ks)

cat("\n--- KOLMOGOROV-SMIRNOV TEST ---\n")

## 
## --- KOLMOGOROV-SMIRNOV TEST ---

cat("• Tests if genes in GO term have systematically higher FST values\n")

## • Tests if genes in GO term have systematically higher FST values

cat("• Different statistical approach than Fisher test\n")

## • Different statistical approach than Fisher test

cat("• Can detect subtle shifts in distribution\n\n")

## • Can detect subtle shifts in distribution

print(p_combined)

cat("\n--- COMBINED ENRICHMENT VIEW ---\n")

## 
## --- COMBINED ENRICHMENT VIEW ---

cat("• Shows relationship between enrichment magnitude and significance\n")

## • Shows relationship between enrichment magnitude and significance

9. Analysis Summary

cat("\n=== ANALYSIS SUMMARY ===\n\n")

## 
## === ANALYSIS SUMMARY ===

summary_stats <- data.frame(
  Metric = c(
    "Total SNPs analyzed",
    "Mean FST value", 
    "FST threshold (95th percentile)",
    "SNPs above FST threshold",
    "Genes with SNP coverage",
    "Candidate genes (top 5% FST)",
    "Top candidate gene FST",
    "GO terms analyzed",
    "GO terms with FDR < 0.05",
    "Most enriched GO term",
    "Highest enrichment ratio"
  ),
  Value = c(
    format(nrow(fst_data), big.mark = ","),
    round(mean(fst_data$FST, na.rm = TRUE), 4),
    round(fst_threshold, 4),
    format(sum(fst_data$FST >= fst_threshold, na.rm = TRUE), big.mark = ","),
    format(length(unique(snp_gene_map$gene_id)), big.mark = ","),
    format(nrow(candidate_genes), big.mark = ","),
    ifelse(nrow(candidate_genes) > 0, 
           paste(candidate_genes$gene_id[1], " (FST = ", round(candidate_genes$max_FST[1], 4), ")", sep = ""),
           "None"),
    nrow(go_enrichment),
    sum(go_enrichment$FDR_adjusted < 0.05),
    ifelse(nrow(go_enrichment) > 0, go_enrichment$Term[1], "None"),
    ifelse(nrow(go_enrichment) > 0, round(max(go_enrichment$Enrichment_ratio, na.rm = TRUE), 2), "NA")
  )
)

# Print as a nice table
knitr::kable(
  summary_stats,
  col.names = c("Metric", "Value"),
  align = c("l", "r"),
  caption = "Summary of FST and GO Enrichment Analysis"
)

Summary of FST and GO Enrichment Analysis
Metric	Value
Total SNPs analyzed	648,916
Mean FST value	0.0391
FST threshold (95th percentile)	0.1372
SNPs above FST threshold	32,463
Genes with SNP coverage	18,514
Candidate genes (top 5% FST)	926
Top candidate gene FST	AT4G02733 (FST = 0.7917)
GO terms analyzed	2122
GO terms with FDR < 0.05	0
Most enriched GO term	plant-type hypersensitive response
Highest enrichment ratio	7.69

cat("\n=== BIOLOGICAL INTERPRETATION ===\n\n")

## 
## === BIOLOGICAL INTERPRETATION ===

if(nrow(go_enrichment) > 0 && any(go_enrichment$FDR_adjusted < 0.05)) {
  sig_terms <- go_enrichment[go_enrichment$FDR_adjusted < 0.05, ]
  cat("Significant enrichment found in", nrow(sig_terms), "GO terms:\n")
  for(i in 1:min(3, nrow(sig_terms))) {
    cat(paste0("  ", i, ". ", sig_terms$Term[i], 
               " (Enrichment = ", round(sig_terms$Enrichment_ratio[i], 2), 
               ", FDR = ", format(sig_terms$FDR_adjusted[i], scientific = TRUE, digits = 2), ")\n"))
  }
  cat("\nThese processes are likely under divergent selection between\n")
  cat("Swedish and Spanish Arabidopsis populations, potentially related\n")
  cat("to climate adaptation differences.\n")
} else {
  cat("No GO terms reach FDR significance (FDR < 0.05). This suggests:\n")
  cat("1. Polygenic adaptation (many genes with small effects)\n")
  cat("2. Selection on uncharacterized pathways\n")
  cat("3. Technical limitations (sample size, filtering thresholds)\n")
  cat("4. Demographic history mimicking selection signals\n\n")
  
  if(nrow(go_enrichment) > 0) {
    cat("Top nominal enrichment (lowest p-values):\n")
    for(i in 1:min(3, nrow(go_enrichment))) {
      cat(paste0("  ", i, ". ", go_enrichment$Term[i], 
                 " (p = ", format(go_enrichment$Fisher_p_value[i], scientific = TRUE, digits = 2), 
                 ", Enrichment = ", round(go_enrichment$Enrichment_ratio[i], 2), ")\n"))
    }
  }
}

## No GO terms reach FDR significance (FDR < 0.05). This suggests:
## 1. Polygenic adaptation (many genes with small effects)
## 2. Selection on uncharacterized pathways
## 3. Technical limitations (sample size, filtering thresholds)
## 4. Demographic history mimicking selection signals
## 
## Top nominal enrichment (lowest p-values):
##   1. plant-type hypersensitive response (p = 7.4e-04, Enrichment = 2.78)
##   2. seed coat development (p = 7.6e-04, Enrichment = 3.94)
##   3. protein acetylation (p = 1.2e-03, Enrichment = 7.69)

cat("\n=== STATISTICAL CAVEATS ===\n")

## 
## === STATISTICAL CAVEATS ===

cat("• FST outliers can arise from demographic history (e.g., bottlenecks)\n")

## • FST outliers can arise from demographic history (e.g., bottlenecks)

cat("• Multiple testing correction reduces power for weak signals\n")

## • Multiple testing correction reduces power for weak signals

cat("• Gene-based aggregation assumes functional equivalence of SNPs\n")

## • Gene-based aggregation assumes functional equivalence of SNPs

cat("• GO analysis depends on annotation completeness\n")

## • GO analysis depends on annotation completeness

cat("\n=== RECOMMENDED NEXT STEPS ===\n")

## 
## === RECOMMENDED NEXT STEPS ===

cat("1. Validate candidate genes with independent population data\n")

## 1. Validate candidate genes with independent population data

cat("2. Examine gene family expansions/contractions\n")

## 2. Examine gene family expansions/contractions

cat("3. Correlate with climate variables (temperature, precipitation)\n")

## 3. Correlate with climate variables (temperature, precipitation)

cat("4. Functional validation of top candidate genes\n")

## 4. Functional validation of top candidate genes

10. Should You Use Defense-Only Or Full Genome Data?

10.1. Defense-Only Analysis: Why This Is Correct

Scientific justification

If your a priori biological hypothesis concerns defense-related genes (e.g., immune response, pathogen interaction, hypersensitive response), then restricting the GO enrichment universe to defense-annotated genes is hypothesis-driven, not post hoc.

This is critical: You are not “cherry-picking”; you are testing a predefined functional class motivated by biology.

Statistical advantages

Using defense genes only:

Reduces the multiple-testing burden
Increases statistical power
Improves interpretability of enriched terms
Avoids dilution of signal by unrelated housekeeping functions

This is especially appropriate when:

Selection is expected to act on a specific biological system
Effect sizes are moderate (polygenic adaptation)
Genome-wide enrichment is weak or absent

Appropriate use cases

✔ Main GO enrichment analysis
✔ Hypothesis testing (“Are defense pathways overrepresented among high-FST genes?”)
✔ Core biological conclusions

10.2. Genome-Wide Analysis: Why You Still Need It

Using the full genome is not redundant—it serves a different purpose.

Genome-wide data provide:

A null expectation for FST distributions
Context for interpreting effect sizes
Validation that defense genes are not trivially extreme due to annotation bias
Protection against accusations of circular reasoning

Appropriate use cases

✔ Supplemental figures
✔ Validation analyses
✔ Descriptive context
✔ Discussion and interpretation

This distinction is methodologically clean and reviewers generally welcome it.

10.3. Recommended Structure (Best Practice)

Your proposed structure is exactly right:

Main manuscript

GO enrichment restricted to defense genes
Focused hypothesis testing
Clear biological narrative
Stronger statistical power

Supplementary material

Genome-wide FST distributions
Comparison of defense vs non-defense genes
Permutation or empirical quantile validation (optional but strong)

Discussion

Explicit comparison of defense genes to genome-wide expectations
Interpretation of why genome-wide enrichment may be weak
Framing in terms of polygenic or pathway-specific adaptation

This approach balances rigor, power, and transparency.

10.4. One Critical Caveat (Must Be Explicit in Methods)

You must clearly define the background (gene universe) in each analysis.

Specifically:

Defense-only GO analysis
- Background = all defense-annotated genes
Genome-wide analysis
- Background = all genes with FST estimates

Failing to state this explicitly is the most common reviewer criticism.

10.5. Summary

To test the hypothesis that local adaptation is enriched in defense-related pathways, we performed Gene Ontology enrichment analyses restricted to genes annotated with defense-associated GO terms. This hypothesis-driven approach reduces multiple testing and increases power to detect pathway-level selection acting on immune and stress-response processes. To provide genome-wide context, we additionally analyzed the distribution of FST values across all genes and compared defense genes to the genomic background. Genome-wide analyses are presented as supplementary material and serve to validate that defense-associated genes represent outliers relative to genome-wide expectations.

Remember: This current analysis uses the full genome. For a defense-focused study, you would need to:

Pre-filter your FST data to defense genes only
Adjust the background universe in the GO enrichment step
Still include a genome-wide comparison in supplements for context

Population Genetics Analysis of Arabidopsis thaliana

Practical Session 6 - Gene Ontology (GO) Enrichment for Climate Adaptation Genes

Dr Tahir Ali | AG de Meaux

December 19, 2025

Overview

Practical Setup: Running Analysis on HPC

Instructions for Running Bioinformatics Jobs on the HPC (RAMSES)

Basic Steps:

Estimating Genome-wide FST Values

Create and Submit a SLURM Script for estimating Genome-wide FST values

Example SLURM Script (stacks.sh):

Check Status and ID of Your Submitted Job:

To Cancel Running Job:

Conceptual and Statistical Rationale

Why FST Outliers?

Why Gene-Level and GO Enrichment?

Important Caveat (Avoiding Circularity)

0. Required Packages

1. Load and Process FST Data

2. Visualise FST Distribution

3. Gene Annotation and SNP-to-Gene Mapping

4. Identify Candidate Genes

5. GO Enrichment Analysis

6. Interactive GO Enrichment Table

7. Visualisation of Enriched GO Terms

8. Display Results

9. Analysis Summary

10. Should You Use Defense-Only Or Full Genome Data?

10.1. Defense-Only Analysis: Why This Is Correct

Scientific justification

Statistical advantages

Appropriate use cases

10.2. Genome-Wide Analysis: Why You Still Need It

Genome-wide data provide:

Appropriate use cases

10.3. Recommended Structure (Best Practice)

Main manuscript

Supplementary material

Discussion

10.4. One Critical Caveat (Must Be Explicit in Methods)

10.5. Summary

Example SLURM Script (`stacks.sh`):