Copy Required Data:

Copy the required directory to your home directory (Replace every instance of qbiad019 with your own username.)

cp -r /scratch/qbio/Practical_Session_5 /home/group.kurse/qbiad019

Set up the path to the working directory

# Set the knitr root directory - this PERSISTS
# replace both path to "/home/group.kurse/qbiadxxx/Practical_Session_5
knitr::opts_knit$set(root.dir = "/home/group.kurse/qbiad019/Practical_Session_5")

# Also set DATA_DIR variable for reference
DATA_DIR <- "/home/group.kurse/qbiad019/Practical_Session_5"

cat("Knitr root.dir set to:", knitr::opts_knit$get("root.dir"), "\n")

## Knitr root.dir set to: /home/group.kurse/qbiad019/Practical_Session_5

cat("Files available:", paste(list.files()[1:3], collapse=", "), "\n")

## Files available: Practical_Day_5_files, Practical_Day_5.qmd, vcf

Summary

Natural populations evolve under the combined action of genetic drift and natural selection. Distinguishing between these forces is central to understanding adaptive evolution. Genetic drift leads to random allele frequency changes, while selection induces systematic shifts in specific genomic regions associated with fitness. Defence-related genes, which mediate plant responses to pathogens and stress conditions, are strong candidates for experiencing positive or balancing selection.

Analysis Overview

In this analysis, we compare nucleotide diversity (π), Tajima’s D, and FST between:

Defence-related windows (DF) – genomic regions overlapping annotated defence genes
Background windows (BG) – all other genome regions

The goal is to identify whether defence genes show distinct evolutionary patterns compared to background genomic regions.

Whether defence genes show signatures of selection, such as:

Reduced π (purifying or positive selection)
Elevated π (balancing selection)
Higher or lower Tajima’s D depending on selection mode
Elevated FST between populations (local adaptation)

Key Questions

Do defence genes have different levels of genetic diversity compared to background regions?
Are there differences in Tajima’s D (neutrality test statistic) between defence and background genes?
Do defence genes show elevated population differentiation (FST)?
Are certain defence genes FST outliers indicating potential selection?

Workflow Summary

Data Loading: VCF file with SNPs and BED file with defence gene coordinates
Window-based Analysis: Sliding windows across chromosome 1
Statistics Calculation: π, Tajima’s D, FST for two populations (ESP, SWE)
Gene Annotation: Tag windows overlapping defence genes
Comparative Analysis: Statistical tests comparing defence vs background windows
Visualization: Density plots, boxplots, and genomic position plots

Expected Outputs

Summary statistics tables
Distribution plots for all statistics
Statistical test results (KS tests, Wilcoxon tests)
FST outlier analysis
Genomic position plots

0. Setup

The following packages are required for handling VCF files, computing population genetic statistics, plotting results, and data wrangling.

# --- 0) Load Required Packages ------------------------------------------------
library(VariantAnnotation)
library(PopGenome)
library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(readr)
library(tibble)
library(knitr)
library(conflicted)

# Resolve function conflicts
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::count)

1. Load and Inspect the VCF

Purpose

This block loads the full VCF file and determines chromosome lengths, which helps ensure that sliding windows and coordinate comparisons are aligned.

Output

VCF file loaded into memory
Table showing chromosome positions
Confirmation that data is properly loaded

# --- 1) Load and inspect VCF ---------------------------------
# Load the VCF file
vcf_full <- readVcf("vcf/1001genomes_snp-short-indel_only_ACGTN_Dp10GQ20Q30_NoIndel_Bialleleic_80PcMissing_80SwEs.vcf.recode.vcf.gz")

# Get chromosome names from VCF
chromosomes <- seqlevels(vcf_full)

# Calculate start and end positions for each chromosome
chromosome_ranges <- data.frame(CHROM = character(), Start = numeric(), End = numeric(), stringsAsFactors = FALSE)
for (chrom in chromosomes) {
  positions <- start(vcf_full[seqnames(vcf_full) == chrom])
  chromosome_ranges <- rbind(chromosome_ranges, 
                             data.frame(CHROM = chrom, 
                                        Start = min(positions), 
                                        End = max(positions)))
}

cat("Chromosome ranges in VCF file:\n")

## Chromosome ranges in VCF file:

print(kable(chromosome_ranges))

## 
## 
## |CHROM | Start|      End|
## |:-----|-----:|--------:|
## |1     |  2784| 30422721|
## |2     | 24180| 19697399|
## |3     |  1705| 23459492|
## |4     |  1552| 18584482|
## |5     |   115| 26974567|

2. Define Analysis Regions

Purpose

Define chromosome ranges for analysis. The selection of ranges may be aimed at focusing on certain chromosomal regions or excluding telomeres and centromeres.

# --- 2) Define chromosome ranges for analysis -----------------
# These ranges correspond to the positions on the chromosomes of interest
chr1start <- 2784; chr1end <- 30422721
chr2start <- 24180; chr2end <- 19697399
chr3start <- 1705; chr3end <- 23459492
chr4start <- 1552; chr4end <- 18584482
chr5start <- 115; chr5end <- 26974567

cat("Analysis ranges defined for 5 chromosomes\n")

## Analysis ranges defined for 5 chromosomes

3. Read VCF into PopGenome

Purpose

The genome is processed chromosome by chromosome. PopGenome requires explicit chromosome and positional ranges. Load chromosome 1 data into PopGenome format for population genetics analyses.

Output

PopGenome object for chromosome 1
Ready for population assignments and statistics calculation

# --- 3) Read VCF into PopGenome for chromosome 1 --------------
# Example: chromosome 1 (use all the variants across the whole genome)
At_full_Chr <- suppressMessages(
  readVCF("vcf/1001genomes_snp-short-indel_only_ACGTN_Dp10GQ20Q30_NoIndel_Bialleleic_80PcMissing_80SwEs.vcf.recode.vcf.gz", 
          numcols = 89, 
          tid = "1", 
          frompos = chr1start, 
          topos = chr1end, 
          include.unknown = TRUE)
)

## vcff::open : file opened, contains 80 samples
## [1] "Available ContigIdentifiers (parameter tid):"
## [1] "1" "2" "3" "4" "5"
## |            :            |            :            | 100 %
## |===================================================| ;-) 
## 
## 
## Concatenate regions 
## |            :            |            :            | 100 %
## |===================================================| ;-)

cat("VCF loaded into PopGenome object\n")

## VCF loaded into PopGenome object

cat("Number of sites:", length(At_full_Chr@region.names), "\n")

## Number of sites: 1

4. Define Populations

Purpose

Assign samples to populations (ESP and SWE) based on metadata. Enables calculation of population-specific π, Tajima’s D, and pairwise FST.

Output

Populations defined in PopGenome object
Ready for population-specific calculations

# --- 4) Define populations ------------------------------------
# Read population metadata
population_info <- read_delim("vcf/pop1.txt", delim = "\t", show_col_types = FALSE)

# Create list of populations
populations <- split(population_info$sample, population_info$pop)

# Assign populations to PopGenome object
At_full_Chr <- set.populations(At_full_Chr, populations, diploid = TRUE)

## |            :            |            :            | 100 %
## |====================================================

cat("Populations defined:\n")

## Populations defined:

print(names(populations))

## [1] "ESP" "SWE"

cat("Sample counts:", sapply(populations, length), "\n")

## Sample counts: 40 40

5. Create Sliding Windows

Purpose

Define sliding windows along chromosome 1 for local statistics calculation.

Sliding windows allow summarising statistics across genomes rather than per SNP.
Enables statistically robust summaries and reduces noise.

Parameters

Window size: 100 bp
Step size: 50 bp (50% overlap)

# --- 5) Define sliding windows --------------------------------
window_size <- 100
window_jump <- 50
strt <- chr1start
end <- chr1end

# Generate sliding window positions
window_start <- seq(from = strt, to = end, by = window_jump)
window_stop <- window_start + window_size

# Remove windows exceeding chromosome length
window_start <- window_start[window_stop < end]
window_stop <- window_stop[window_stop < end]

# Create windows data frame
windows <- data.frame(
  start = window_start,
  stop = window_stop,
  mid = window_start + (window_stop - window_start) / 2,
  chr = "1"
)

cat("Sliding windows created:\n")

## Sliding windows created:

cat("Number of windows:", nrow(windows), "\n")

## Number of windows: 608397

cat("Window size:", window_size, "bp\n")

## Window size: 100 bp

cat("Step size:", window_jump, "bp\n")

## Step size: 50 bp

6. Calculate Window Statistics

Purpose

Calculate key population genetics statistics (π, Tajima’s D, and FST) for each sliding window.

Statistics Calculated

Nucleotide diversity (π)
Tajima’s D (neutrality test)
FST (population differentiation)

# --- 6) Calculate sliding window statistics --------------------
# Transform to sliding windows
At_full_sw <- suppressMessages(
  sliding.window.transform(At_full_Chr, width = window_size, jump = window_jump, type = 2)
)

## |            :            |            :            | 100 %
## |===================================================

# Calculate diversity statistics
At_full_sw <- suppressMessages(diversity.stats(At_full_sw, pi = TRUE))

## |            :            |            :            | 100 %
## |===================================================

# Calculate neutrality statistics (includes Tajima's D)
At_full_sw <- suppressMessages(neutrality.stats(At_full_sw))

## |            :            |            :            | 100 %
## |===================================================

# Calculate FST statistics
At_full_sw <- suppressMessages(F_ST.stats(At_full_sw, mode = "nucleotide"))

## |            :            |            :            | 100 %
## |===================================================

cat("Statistics calculated for each window:\n")

## Statistics calculated for each window:

cat("- Nucleotide diversity (π)\n")

## - Nucleotide diversity (π)

cat("- Tajima's D\n")

## - Tajima's D

cat("- FST\n")

## - FST

7. Extract and Combine Statistics

Purpose

PopGenome objects store outputs in nested data structures; this block extracts usable tables and combines them into a tidy data frame.

Output

Dataframe with all statistics for each window
Cleaned data ready for analysis

# --- 7) Extract statistics from PopGenome object --------------
# Extract Tajima's D and adjust for scale
td_full <- At_full_sw@Tajima.D / window_size
colnames(td_full) <- paste0(names(populations), "_td")

# Extract nucleotide diversity (pi)
nd_full <- At_full_sw@nuc.diversity.within / window_size
colnames(nd_full) <- paste0(names(populations), "_pi")

# Extract FST values
fst_full <- t(At_full_sw@nuc.F_ST.pairwise)

# Clean FST column names
x <- colnames(fst_full)
for (i in seq_along(populations)) {
  x <- sub(paste0("pop", i), names(populations)[i], x)
}
colnames(fst_full) <- paste0(sub("/", "_", x), "_fst")

# --- 8) Combine all statistics into one data frame ------------
At_data_full <- as_tibble(data.frame(windows, nd_full, fst_full, td_full))

# Filter out rows with zeros or NAs
At_data_clean <- At_data_full %>%
  filter(!if_any(everything(), ~ . == 0 | is.na(.)))

cat("Data extracted and combined:\n")

## Data extracted and combined:

cat("Total windows:", nrow(At_data_full), "\n")

## Total windows: 608397

cat("Clean windows (no zeros/NAs):", nrow(At_data_clean), "\n")

## Clean windows (no zeros/NAs): 79622

cat("\nFirst 5 rows of clean data:\n")

## 
## First 5 rows of clean data:

print(kable(head(At_data_clean, 5)))

## 
## 
## | start| stop|  mid|chr |    ESP_pi|    SWE_pi| ESP_SWE_fst|     ESP_td|     SWE_td|
## |-----:|----:|----:|:---|---------:|---------:|-----------:|----------:|----------:|
## |  4584| 4684| 4634|1   | 0.0044428| 0.0055719|   0.1765667|  0.0141849|  0.0059346|
## |  4634| 4734| 4684|1   | 0.0044428| 0.0055719|   0.1765667|  0.0141849|  0.0059346|
## |  5984| 6084| 6034|1   | 0.0061311| 0.0085248|   0.1253754|  0.0080708|  0.0177694|
## |  6034| 6134| 6084|1   | 0.0061311| 0.0085248|   0.1253754|  0.0080708|  0.0177694|
## |  6384| 6484| 6434|1   | 0.0010932| 0.0014051|   0.0547234| -0.0058576| -0.0036847|

8. Genome-wide Visualization

Purpose

Visualize statistics along chromosome 1 to identify patterns and hotspots.

Output

Multi-panel plot showing π, Tajima’s D, and FST across the chromosome
Identification of regions with unusual values

# --- 9) Create facet plot of genome-wide patterns -------------
# Select relevant columns for plotting
hs_full <- At_data_full %>%
select(mid, ESP_pi, SWE_pi, ESP_td, SWE_td, ESP_SWE_fst) %>%
mutate(ESP_SWE_fst = ifelse(ESP_SWE_fst < 0, 0, ESP_SWE_fst))

# Reshape for faceting
hs_g_full <- gather(hs_full, -mid, key = "stat", value = "value")

# Create facet plot
a_full <- ggplot(hs_g_full, aes(mid/10^6, value, colour = stat)) +
geom_line() +
facet_grid(stat ~ ., scales = "free_y") +
xlab("Position (Mb)") +
theme_light() +
theme(legend.position = "none") +
ggtitle("Genome-wide Statistics Along Chromosome 1")

print(a_full)

## Warning: Removed 262 rows containing missing values or values outside the scale range
## (`geom_line()`).

Interpretation guideline:

Peaks in FST suggest local adaptation.
Depressions in π may indicate selective sweeps.
Strongly negative Tajima’s D indicates excess rare variants (e.g., recent selection).

9. Annotate Defence-Related Windows

Purpose

Identify windows that overlap with defence-related genes using BED file coordinates. Enables comparison of selection metrics in defence vs background regions.

Output

Dataframe with DF (defence) or BG (background) annotation
Summary of how many windows are defence-related

# --- 10) Annotate defence-related windows using BED file ------
# Read BED file with defence gene coordinates
bed <- read.table("vcf/At_defense_only.bed", header = FALSE,
                  col.names = c("chr", "begin", "end"),
                  stringsAsFactors = FALSE)

# Convert chromosome names to character
bed$chr <- as.character(bed$chr)

# Initialize DF column
At_data_full$DF <- "BG"
At_data_clean$DF <- "BG"

# Ensure proper data types
At_data_full$chr <- as.character(At_data_full$chr)
At_data_full$start <- as.numeric(At_data_full$start)
At_data_full$stop <- as.numeric(At_data_full$stop)

At_data_clean$chr <- as.character(At_data_clean$chr)
At_data_clean$start <- as.numeric(At_data_clean$start)
At_data_clean$stop <- as.numeric(At_data_clean$stop)

# Tag windows that overlap with defence gene regions
for (i in 1:nrow(bed)) {
  idx_full <- which(
    At_data_full$chr == bed$chr[i] &
      At_data_full$start <= bed$end[i] &
      At_data_full$stop >= bed$begin[i]
  )
  
  idx_clean <- which(
    At_data_clean$chr == bed$chr[i] &
      At_data_clean$start <= bed$end[i] &
      At_data_clean$stop >= bed$begin[i]
  )
  
  if (length(idx_full) > 0) {
    At_data_full$DF[idx_full] <- "DF"
  }
  if (length(idx_clean) > 0) {
    At_data_clean$DF[idx_clean] <- "DF"
  }
}

# Convert DF to factor
At_data_full$DF <- factor(At_data_full$DF, levels = c("BG", "DF"))
At_data_clean$DF <- factor(At_data_clean$DF, levels = c("BG", "DF"))

cat("Defence gene annotation complete:\n")

## Defence gene annotation complete:

cat("In full dataset:\n")

## In full dataset:

print(table(At_data_full$DF))

## 
##     BG     DF 
## 599229   9168

cat("\nIn clean dataset:\n")

## 
## In clean dataset:

print(table(At_data_clean$DF))

## 
##    BG    DF 
## 78080  1542

cat("\nPercentage defence windows in clean data:", 
    round(100 * sum(At_data_clean$DF == "DF") / nrow(At_data_clean), 2), "%\n")

## 
## Percentage defence windows in clean data: 1.94 %

10. Statistical Comparisons

Purpose

Compare statistics between defence and background windows using statistical tests.

Tests Performed

Kolmogorov-Smirnov tests (distribution comparisons)
Wilcoxon rank-sum tests (median comparisons)

# --- 11-12) Perform statistical comparisons ----------------------
# Create subsets for testing
sub1_pi_esp <- At_data_clean$ESP_pi[At_data_clean$DF == "DF"]
sub2_pi_esp <- At_data_clean$ESP_pi[At_data_clean$DF == "BG"]

sub1_td_esp <- At_data_clean$ESP_td[At_data_clean$DF == "DF"]
sub2_td_esp <- At_data_clean$ESP_td[At_data_clean$DF == "BG"]

sub1_fst <- At_data_clean$ESP_SWE_fst[At_data_clean$DF == "DF"]
sub2_fst <- At_data_clean$ESP_SWE_fst[At_data_clean$DF == "BG"]

# Kolmogorov-Smirnov tests
cat("Kolmogorov-Smirnov Tests (Defence vs Background):\n")

## Kolmogorov-Smirnov Tests (Defence vs Background):

cat("=================================================\n")

## =================================================

ks_pi_esp <- ks.test(sub1_pi_esp, sub2_pi_esp)

## Warning in ks.test.default(sub1_pi_esp, sub2_pi_esp): p-value will be
## approximate in the presence of ties

cat("\nESP π distribution comparison:\n")

## 
## ESP π distribution comparison:

cat("D =", round(ks_pi_esp$statistic, 4), "\n")

## D = 0.0295

cat("p-value =", format.pval(ks_pi_esp$p.value, digits = 4), "\n")

## p-value = 0.145

ks_td_esp <- ks.test(sub1_td_esp, sub2_td_esp)

## Warning in ks.test.default(sub1_td_esp, sub2_td_esp): p-value will be
## approximate in the presence of ties

cat("\nESP Tajima's D distribution comparison:\n")

## 
## ESP Tajima's D distribution comparison:

cat("D =", round(ks_td_esp$statistic, 4), "\n")

## D = 0.0232

cat("p-value =", format.pval(ks_td_esp$p.value, digits = 4), "\n")

## p-value = 0.3874

ks_fst <- ks.test(sub1_fst, sub2_fst)

## Warning in ks.test.default(sub1_fst, sub2_fst): p-value will be approximate in
## the presence of ties

cat("\nFST distribution comparison:\n")

## 
## FST distribution comparison:

cat("D =", round(ks_fst$statistic, 4), "\n")

## D = 0.0466

cat("p-value =", format.pval(ks_fst$p.value, digits = 4), "\n")

## p-value = 0.002773

11. Distribution Plots

Purpose

Visualize and compare distributions of statistics between defence and background windows.

Output

Multi-panel density plots
KS test p-values and mean differences displayed
Clear visual comparison

# --- Complete Density Plot Code ---
par(mfrow = c(2, 3), mar = c(4, 4, 4, 2))

# Function to add statistics below legend
add_stats_below_legend <- function(ks_p, mean_diff, y_pos_factor = 0.85, x_pos = "right") {
  usr <- par("usr")
  x_range <- usr[2] - usr[1]
  y_range <- usr[4] - usr[3]
  
  if (x_pos == "right") {
    x_pos <- usr[2] - 0.1 * x_range
    h_adj <- 1
  } else {
    x_pos <- usr[1] + 0.1 * x_range
    h_adj <- 0
  }
  
  y_pos <- usr[3] + y_pos_factor * y_range
  
  text(x_pos, y_pos, 
       paste("KS p =", ks_p, "\nΔmean =", sprintf("%.4f", mean_diff)),
       adj = c(h_adj, 0.5),
       cex = 0.75,
       font = 2,
       col = "darkred")
}

# Prepare data for SWE population as well
sub1_pi_swe <- At_data_clean$SWE_pi[At_data_clean$DF == "DF"]
sub2_pi_swe <- At_data_clean$SWE_pi[At_data_clean$DF == "BG"]
sub1_td_swe <- At_data_clean$SWE_td[At_data_clean$DF == "DF"]
sub2_td_swe <- At_data_clean$SWE_td[At_data_clean$DF == "BG"]

# 1. ESP Tajima's D
ks_td_esp <- ks.test(sub1_td_esp, sub2_td_esp)

## Warning in ks.test.default(sub1_td_esp, sub2_td_esp): p-value will be
## approximate in the presence of ties

ks_p_td_esp <- format.pval(ks_td_esp$p.value, digits = 3)
mean_diff_td_esp <- mean(sub1_td_esp) - mean(sub2_td_esp)

if (length(sub2_td_esp) > 1 & length(sub1_td_esp) > 1) {
  dens_bg <- density(sub2_td_esp)
  dens_df <- density(sub1_td_esp)
  
  plot(dens_bg, 
       main = "ESP Tajima's D",
       xlab = "Tajima's D", 
       col = "black", 
       lwd = 2,
       xlim = range(c(dens_bg$x, dens_df$x)),
       ylim = range(c(dens_bg$y, dens_df$y)) * 1.1)
  
  lines(dens_df, col = "blue", lwd = 2)
  
  legend("topright", 
         legend = c(paste("BG (n=", length(sub2_td_esp), ")", sep = ""),
                    paste("DF (n=", length(sub1_td_esp), ")", sep = "")), 
         col = c("black", "blue"), 
         lwd = 2,
         cex = 0.8,
         bty = "n")
  
  add_stats_below_legend(ks_p_td_esp, mean_diff_td_esp, y_pos_factor = 0.45, x_pos = "right")
}

# 2. SWE Tajima's D
ks_td_swe <- ks.test(sub1_td_swe, sub2_td_swe)

## Warning in ks.test.default(sub1_td_swe, sub2_td_swe): p-value will be
## approximate in the presence of ties

ks_p_td_swe <- format.pval(ks_td_swe$p.value, digits = 3)
mean_diff_td_swe <- mean(sub1_td_swe) - mean(sub2_td_swe)

if (length(sub2_td_swe) > 1 & length(sub1_td_swe) > 1) {
  dens_bg <- density(sub2_td_swe)
  dens_df <- density(sub1_td_swe)
  
  plot(dens_bg, 
       main = "SWE Tajima's D",
       xlab = "Tajima's D", 
       col = "black", 
       lwd = 2,
       xlim = range(c(dens_bg$x, dens_df$x)),
       ylim = range(c(dens_bg$y, dens_df$y)) * 1.1)
  
  lines(dens_df, col = "red", lwd = 2)
  
  legend("topright", 
         legend = c(paste("BG (n=", length(sub2_td_swe), ")", sep = ""),
                    paste("DF (n=", length(sub1_td_swe), ")", sep = "")), 
         col = c("black", "red"), 
         lwd = 2,
         cex = 0.8,
         bty = "n")
  
  add_stats_below_legend(ks_p_td_swe, mean_diff_td_swe, y_pos_factor = 0.45, x_pos = "right")
}

# 3. ESP π
ks_pi_esp <- ks.test(sub1_pi_esp, sub2_pi_esp)

## Warning in ks.test.default(sub1_pi_esp, sub2_pi_esp): p-value will be
## approximate in the presence of ties

ks_p_pi_esp <- format.pval(ks_pi_esp$p.value, digits = 3)
mean_diff_pi_esp <- mean(sub1_pi_esp) - mean(sub2_pi_esp)

if (length(sub2_pi_esp) > 1 & length(sub1_pi_esp) > 1) {
  dens_bg <- density(sub2_pi_esp)
  dens_df <- density(sub1_pi_esp)
  
  plot(dens_bg, 
       main = bquote(bold("ESP " * pi)),
       xlab = expression(pi), 
       col = "black", 
       lwd = 2,
       xlim = range(c(dens_bg$x, dens_df$x)),
       ylim = range(c(dens_bg$y, dens_df$y)) * 1.1)
  
  lines(dens_df, col = "darkgreen", lwd = 2)
  
  legend("topright", 
         legend = c(paste("BG (n=", length(sub2_pi_esp), ")", sep = ""),
                    paste("DF (n=", length(sub1_pi_esp), ")", sep = "")), 
         col = c("black", "darkgreen"), 
         lwd = 2,
         cex = 0.8,
         bty = "n")
  
  add_stats_below_legend(ks_p_pi_esp, mean_diff_pi_esp, y_pos_factor = 0.15, x_pos = "right")
}

# 4. SWE π
ks_pi_swe <- ks.test(sub1_pi_swe, sub2_pi_swe)

## Warning in ks.test.default(sub1_pi_swe, sub2_pi_swe): p-value will be
## approximate in the presence of ties

ks_p_pi_swe <- format.pval(ks_pi_swe$p.value, digits = 3)
mean_diff_pi_swe <- mean(sub1_pi_swe) - mean(sub2_pi_swe)

if (length(sub2_pi_swe) > 1 & length(sub1_pi_swe) > 1) {
  dens_bg <- density(sub2_pi_swe)
  dens_df <- density(sub1_pi_swe)
  
  plot(dens_bg, 
       main = bquote(bold("SWE " * pi)),
       xlab = expression(pi), 
       col = "black", 
       lwd = 2,
       xlim = range(c(dens_bg$x, dens_df$x)),
       ylim = range(c(dens_bg$y, dens_df$y)) * 1.1)
  
  lines(dens_df, col = "orange", lwd = 2)
  
  legend("topright", 
         legend = c(paste("BG (n=", length(sub2_pi_swe), ")", sep = ""),
                    paste("DF (n=", length(sub1_pi_swe), ")", sep = "")), 
         col = c("black", "orange"), 
         lwd = 2,
         cex = 0.8,
         bty = "n")
  
  add_stats_below_legend(ks_p_pi_swe, mean_diff_pi_swe, y_pos_factor = 0.15, x_pos = "right")
}

# 5. FST
ks_fst <- ks.test(sub1_fst, sub2_fst)

## Warning in ks.test.default(sub1_fst, sub2_fst): p-value will be approximate in
## the presence of ties

ks_p_fst <- format.pval(ks_fst$p.value, digits = 3)
mean_diff_fst <- mean(sub1_fst) - mean(sub2_fst)

if (length(sub2_fst) > 1 & length(sub1_fst) > 1) {
  dens_bg <- density(sub2_fst)
  dens_df <- density(sub1_fst)
  
  plot(dens_bg, 
       main = bquote(bold("ESP-SWE " * F[ST])),
       xlab = expression(F[ST]), 
       col = "black", 
       lwd = 2,
       xlim = range(c(dens_bg$x, dens_df$x)),
       ylim = range(c(dens_bg$y, dens_df$y)) * 1.1)
  
  lines(dens_df, col = "purple", lwd = 2)
  
  legend("topright", 
         legend = c(paste("BG (n=", length(sub2_fst), ")", sep = ""),
                    paste("DF (n=", length(sub1_fst), ")", sep = "")), 
         col = c("black", "purple"), 
         lwd = 2,
         cex = 0.8,
         bty = "n")
  
  add_stats_below_legend(ks_p_fst, mean_diff_fst, y_pos_factor = 0.15, x_pos = "right")
}

# 6. Summary legend
plot(1, type = "n", axes = FALSE, xlab = "", ylab = "")
legend("center", 
       legend = c("Statistical Summary", 
                  "KS = Kolmogorov-Smirnov test",
                  "p = probability value",
                  "Δmean = DF mean - BG mean",
                  paste("Total BG windows:", sum(At_data_clean$DF == "BG")),
                  paste("Total DF windows:", sum(At_data_clean$DF == "DF"))),
       cex = 0.9, 
       bty = "n",
       text.font = c(2, 1, 1, 1, 1, 1))

# Reset plot parameters
par(mfrow = c(1, 1))

12. Boxplot Comparisons

Purpose

Visual comparison of statistics between defence and background genes using boxplots with statistical annotations.

Output

Boxplots for Tajima’s D, π, and FST
Wilcoxon test p-values displayed
Clear visualization of median differences

# Create long format dataframes for ggplot
library(tidyr)

# Tajima's D comparison
td_comparison_long <- At_data_clean %>%
  select(DF, ESP_td, SWE_td) %>%
  pivot_longer(
    cols = c(ESP_td, SWE_td),
    names_to = "population",
    values_to = "td_value"
  ) %>%
  mutate(
    population = gsub("_td", "", population),
    population = factor(population, levels = c("ESP", "SWE"))
  )

# π comparison
pi_comparison_long <- At_data_clean %>%
  select(DF, ESP_pi, SWE_pi) %>%
  pivot_longer(
    cols = c(ESP_pi, SWE_pi),
    names_to = "population",
    values_to = "pi_value"
  ) %>%
  mutate(
    population = gsub("_pi", "", population),
    population = factor(population, levels = c("ESP", "SWE"))
  )

# FST comparison
fst_comparison <- At_data_clean %>%
  select(DF, ESP_SWE_fst)

# Statistical tests
wilcox_esp_td <- wilcox.test(td_value ~ DF, data = td_comparison_long %>% filter(population == "ESP"))
wilcox_swe_td <- wilcox.test(td_value ~ DF, data = td_comparison_long %>% filter(population == "SWE"))
wilcox_esp_pi <- wilcox.test(pi_value ~ DF, data = pi_comparison_long %>% filter(population == "ESP"))
wilcox_swe_pi <- wilcox.test(pi_value ~ DF, data = pi_comparison_long %>% filter(population == "SWE"))
wilcox_fst <- wilcox.test(ESP_SWE_fst ~ DF, data = fst_comparison)

# Create dataframes for p-value annotations
pvalue_td_df <- data.frame(
  population = c("ESP", "SWE"),
  p_value = c(
    format.pval(wilcox_esp_td$p.value, digits = 3),
    format.pval(wilcox_swe_td$p.value, digits = 3)
  ),
  y_position = c(
    max(td_comparison_long$td_value[td_comparison_long$population == "ESP"], na.rm = TRUE) * 0.95,
    max(td_comparison_long$td_value[td_comparison_long$population == "SWE"], na.rm = TRUE) * 0.95
  ),
  x_position = 1.5
)

pvalue_pi_df <- data.frame(
  population = c("ESP", "SWE"),
  p_value = c(
    format.pval(wilcox_esp_pi$p.value, digits = 3),
    format.pval(wilcox_swe_pi$p.value, digits = 3)
  ),
  y_position = c(
    max(pi_comparison_long$pi_value[pi_comparison_long$population == "ESP"], na.rm = TRUE) * 0.95,
    max(pi_comparison_long$pi_value[pi_comparison_long$population == "SWE"], na.rm = TRUE) * 0.95
  ),
  x_position = 1.5
)

# Plot 1: Tajima's D
p1 <- ggplot(td_comparison_long, aes(x = DF, y = td_value, fill = DF)) +
  geom_boxplot(color = "black") +
  facet_wrap(~ population, ncol = 2) +
  theme_light() +
  scale_fill_manual(values = c("lightblue", "lightcoral")) +
  geom_text(data = pvalue_td_df,
            aes(x = x_position, y = y_position, label = paste("p =", p_value)),
            size = 5, inherit.aes = FALSE) +
  labs(title = "Tajima's D: Defence vs Background Genes",
       x = NULL,
       y = "Tajima's D") +
  theme(legend.position = "none")

# Plot 2: π
p2 <- ggplot(pi_comparison_long, aes(x = DF, y = pi_value, fill = DF)) +
  geom_boxplot(color = "black") +
  facet_wrap(~ population, ncol = 2) +
  theme_light() +
  scale_fill_manual(values = c("lightgreen", "orange")) +
  geom_text(data = pvalue_pi_df,
            aes(x = x_position, y = y_position, label = paste("p =", p_value)),
            size = 5, inherit.aes = FALSE) +
  labs(title = expression(paste("Nucleotide Diversity (", pi, "): Defence vs Background")),
       x = NULL,
       y = expression(pi)) +
  theme(legend.position = "none")

# Plot 3: FST
p3 <- ggplot(fst_comparison, aes(x = DF, y = ESP_SWE_fst, fill = DF)) +
  geom_boxplot(color = "black") +
  theme_light() +
  scale_fill_manual(values = c("lightpink", "purple")) +
  annotate("text", x = 1.5, y = max(fst_comparison$ESP_SWE_fst, na.rm = TRUE) * 0.95,
           label = paste("p =", format.pval(wilcox_fst$p.value, digits = 3)),
           size = 5) +
  labs(title = expression(paste(F[ST], ": Defence vs Background Genes")),
       x = NULL,
       y = expression(F[ST])) +
  theme(legend.position = "none")

# Display plots
print(p1)

print(p2)

print(p3)

# Print test results
cat("Wilcoxon Test Results:\n")

## Wilcoxon Test Results:

cat("=====================\n")

## =====================

cat("\nESP Tajima's D: p =", format.pval(wilcox_esp_td$p.value, digits = 4))

## 
## ESP Tajima's D: p = 0.6876

cat("\nSWE Tajima's D: p =", format.pval(wilcox_swe_td$p.value, digits = 4))

## 
## SWE Tajima's D: p = 0.1004

cat("\nESP π: p =", format.pval(wilcox_esp_pi$p.value, digits = 4))

## 
## ESP π: p = 0.1435

cat("\nSWE π: p =", format.pval(wilcox_swe_pi$p.value, digits = 4))

## 
## SWE π: p = 0.003944

cat("\nFST: p =", format.pval(wilcox_fst$p.value, digits = 4), "\n")

## 
## FST: p = 0.1183

Interpretation guideline:

Significant KS p-values indicate DF and BG windows differ in distribution.
Lower π in DF may indicate purifying or positive selection.
Higher π in DF may indicate balancing selection.
Tajima’s D differences help differentiate sweep vs balancing signatures.
Higher FST in DF suggests population-specific selection.

13. FST Outlier Analysis - Visualisation

Purpose

Visualise FST outlier patterns using three complementary plots to understand:

1. Distribution differences between defence and background genes

2. Threshold-based outlier identification

3. Spatial distribution of outliers along the chromosome

# --- FST Outlier Visualization --------------------
library(ggplot2)
library(patchwork)

# Prepare data for plots
fst_df <- At_data_clean %>%
  select(DF, ESP_SWE_fst) %>%
  filter(ESP_SWE_fst > 0)

# Calculate thresholds separately for BG and DF
thresholds_df <- fst_df %>%
  group_by(DF) %>%
  summarise(
    threshold_95 = quantile(ESP_SWE_fst, 0.95, na.rm = TRUE),
    threshold_99 = quantile(ESP_SWE_fst, 0.99, na.rm = TRUE),
    mean_fst = mean(ESP_SWE_fst, na.rm = TRUE),
    n_windows = n()
  )

# Kolmogorov-Smirnov test for distribution differences
bg_fst <- fst_df$ESP_SWE_fst[fst_df$DF == "BG"]
df_fst <- fst_df$ESP_SWE_fst[fst_df$DF == "DF"]
ks_result <- ks.test(bg_fst, df_fst)

# Defence gene windows for specific analysis
defence_windows <- At_data_clean %>%
  filter(DF == "DF")

# Calculate various thresholds
def_threshold_95 <- quantile(defence_windows$ESP_SWE_fst, 0.95, na.rm = TRUE)
def_threshold_99 <- quantile(defence_windows$ESP_SWE_fst, 0.99, na.rm = TRUE)
genome_threshold_95 <- quantile(fst_df$ESP_SWE_fst, 0.95, na.rm = TRUE)
genome_threshold_99 <- quantile(fst_df$ESP_SWE_fst, 0.99, na.rm = TRUE)

# Prepare defence windows with outlier classifications
defence_vs_genome <- defence_windows %>%
  mutate(
    genome_outlier = case_when(
      ESP_SWE_fst > genome_threshold_99 ~ "Genome_99%",
      ESP_SWE_fst > genome_threshold_95 ~ "Genome_95%",
      TRUE ~ "Genome_non_outlier"
    ),
    defence_outlier = case_when(
      ESP_SWE_fst > def_threshold_99 ~ "Defence_99%",
      ESP_SWE_fst > def_threshold_95 ~ "Defence_95%",
      TRUE ~ "Defence_non_outlier"
    )
  )

# Plot 1: BG vs DF FST distributions with thresholds
p1 <- ggplot(fst_df, aes(x = ESP_SWE_fst, fill = DF)) +
  geom_density(alpha = 0.5) +
  geom_vline(data = thresholds_df, 
             aes(xintercept = threshold_95, color = DF),
             linetype = "dashed", size = 0.8) +
  geom_vline(data = thresholds_df,
             aes(xintercept = threshold_99, color = DF),
             linetype = "dotted", size = 0.8) +
  scale_fill_manual(values = c("BG" = "gray70", "DF" = "blue"),
                    name = "Gene Category") +
  scale_color_manual(values = c("BG" = "black", "DF" = "darkblue"),
                     name = "Gene Category") +
  labs(title = "FST Distribution: Defence vs Background Genes",
       subtitle = paste("KS test p =", format.pval(ks_result$p.value, digits = 3)),
       x = expression(F[ST]),
       y = "Density") +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

# Plot 2: Defence gene histogram with multiple thresholds
p2 <- ggplot(defence_vs_genome, aes(x = ESP_SWE_fst)) +
  geom_histogram(binwidth = 0.01, fill = "lightblue", alpha = 0.7, color = "black") +
  # Genome-wide thresholds
  geom_vline(xintercept = genome_threshold_95, 
             color = "orange", linetype = "dashed", size = 1) +
  geom_vline(xintercept = genome_threshold_99,
             color = "red", linetype = "dashed", size = 1) +
  # Defence-specific thresholds
  geom_vline(xintercept = def_threshold_95,
             color = "darkgreen", linetype = "dashed", size = 1) +
  geom_vline(xintercept = def_threshold_99,
             color = "purple", linetype = "dashed", size = 1) +
  annotate("text", x = genome_threshold_95, y = Inf,
           label = "Genome 95%", color = "orange", vjust = 2, hjust = -0.1,
           size = 3) +
  annotate("text", x = genome_threshold_99, y = Inf,
           label = "Genome 99%", color = "red", vjust = 3, hjust = -0.1,
           size = 3) +
  annotate("text", x = def_threshold_95, y = Inf,
           label = "Defence 95%", color = "darkgreen", vjust = 4, hjust = -0.1,
           size = 3) +
  annotate("text", x = def_threshold_99, y = Inf,
           label = "Defence 99%", color = "purple", vjust = 5, hjust = -0.1,
           size = 3) +
  labs(title = "FST Distribution in Defence Genes",
       subtitle = "Comparison of genome-wide vs defence-specific thresholds",
       x = expression(F[ST]),
       y = "Window Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

# Plot 3: Scatter plot of defence gene FST along chromosome
# Prepare color-coded outlier status
defence_windows <- defence_windows %>%
  mutate(outlier_status = case_when(
    ESP_SWE_fst > def_threshold_99 ~ "99% Outlier",
    ESP_SWE_fst > def_threshold_95 ~ "95% Outlier",
    TRUE ~ "Non-outlier"
  ))

p3 <- ggplot(defence_windows, aes(x = mid/1e6, y = ESP_SWE_fst)) +
  geom_point(aes(color = outlier_status, size = outlier_status), alpha = 0.7) +
  geom_hline(yintercept = def_threshold_95, color = "orange", 
             linetype = "dashed", alpha = 0.7) +
  geom_hline(yintercept = def_threshold_99, color = "red",
             linetype = "dashed", alpha = 0.7) +
  scale_color_manual(values = c("Non-outlier" = "gray70",
                                "95% Outlier" = "orange",
                                "99% Outlier" = "red"),
                     name = "Outlier Status") +
  scale_size_manual(values = c("Non-outlier" = 1,
                               "95% Outlier" = 2,
                               "99% Outlier" = 3),
                    name = "Outlier Status") +
  labs(title = "Spatial Distribution of FST in Defence Genes",
       subtitle = "Position along chromosome 1",
       x = "Position (Mb)",
       y = expression(F[ST])) +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

# Combine plots using patchwork
combined_plot <- (p1 | p2) / p3 +
  plot_annotation(
    title = "FST Outlier Analysis: Comprehensive Visualization",
    subtitle = "Three complementary views of FST patterns in defence genes",
    caption = paste("Analysis includes", nrow(defence_windows), "defence gene windows"),
    theme = theme(
      plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
      plot.subtitle = element_text(hjust = 0.5, size = 12),
      plot.caption = element_text(hjust = 1, size = 10)
    )
  )

# Display the combined plot
print(combined_plot)

# Print threshold values for reference
cat("\nFST Threshold Values:\n")

## 
## FST Threshold Values:

cat("====================\n")

## ====================

cat(sprintf("Genome-wide 95%% threshold: %.4f\n", genome_threshold_95))

## Genome-wide 95% threshold: 0.3037

cat(sprintf("Genome-wide 99%% threshold: %.4f\n", genome_threshold_99))

## Genome-wide 99% threshold: 0.4746

cat(sprintf("Defence-specific 95%% threshold: %.4f\n", def_threshold_95))

## Defence-specific 95% threshold: 0.3063

cat(sprintf("Defence-specific 99%% threshold: %.4f\n", def_threshold_99))

## Defence-specific 99% threshold: 0.4968

cat(sprintf("BG mean FST: %.4f\n", mean(bg_fst, na.rm = TRUE)))

## BG mean FST: 0.0959

cat(sprintf("DF mean FST: %.4f\n", mean(df_fst, na.rm = TRUE)))

## DF mean FST: 0.1023

cat(sprintf("KS test p-value: %.4f\n", ks_result$p.value))

## KS test p-value: 0.1107

Interpreting FST Outliers:

99% outliers (red): Most significant, strong candidates for selective sweeps
95% outliers (orange): Moderate significance, potential candidates
Non-outliers (gray): Evolve neutrally or under weak selection

Statistical Tests for FST Outliers

Purpose

Perform statistical tests to determine if defence genes are enriched for FST outliers compared to background regions.

Tests Performed

Fisher’s exact test for contingency tables
Chi-square test for outlier enrichment
Descriptive statistics for outlier proportions

# --- Statistical Tests for FST Outliers --------------------

# Create contingency table
contingency_table <- table(defence_vs_genome$genome_outlier, 
                           defence_vs_genome$defence_outlier)

cat("FST Outlier Statistical Analysis\n")

## FST Outlier Statistical Analysis

cat("================================\n\n")

## ================================

cat("1. Contingency Table (Genome vs Defence thresholds):\n")

## 1. Contingency Table (Genome vs Defence thresholds):

print(kable(contingency_table))

## 
## 
## |                   | Defence_95%| Defence_99%| Defence_non_outlier|
## |:------------------|-----------:|-----------:|-------------------:|
## |Genome_95%         |          55|           0|                   4|
## |Genome_99%         |           7|          16|                   0|
## |Genome_non_outlier |           0|           0|                1460|

# Fisher's exact test with error handling
fisher_test <- tryCatch({
  fisher.test(contingency_table, simulate.p.value = TRUE, B = 10000)
}, error = function(e) {
  cat("Fisher's test error:", e$message, "\nUsing alternative method...\n")
  # Try alternative: create 2x2 table
  simple_df <- defence_vs_genome %>%
    mutate(
      genome_binary = ifelse(genome_outlier != "Genome_non_outlier", "Outlier", "Non_outlier"),
      defence_binary = ifelse(defence_outlier != "Defence_non_outlier", "Outlier", "Non_outlier")
    )
  simple_table <- table(simple_df$genome_binary, simple_df$defence_binary)
  fisher.test(simple_table)
})

cat("\n2. Fisher's Exact Test Results:\n")

## 
## 2. Fisher's Exact Test Results:

cat("   p-value:", format.pval(fisher_test$p.value, digits = 4), "\n")

##    p-value: 9.999e-05

if (!is.null(fisher_test$estimate) && is.numeric(fisher_test$estimate)) {
  cat("   Odds ratio:", round(fisher_test$estimate, 3), "\n")
}

# Simpler enrichment analysis
cat("\n3. Outlier Proportion Analysis:\n")

## 
## 3. Outlier Proportion Analysis:

outlier_summary <- data.frame(
  Category = c("Background", "Defence"),
  Total_Windows = c(sum(fst_df$DF == "BG"), sum(fst_df$DF == "DF")),
  Above_Genome_95th = c(sum(fst_df$ESP_SWE_fst[fst_df$DF == "BG"] > genome_threshold_95),
                       sum(fst_df$ESP_SWE_fst[fst_df$DF == "DF"] > genome_threshold_95)),
  Above_Genome_99th = c(sum(fst_df$ESP_SWE_fst[fst_df$DF == "BG"] > genome_threshold_99),
                       sum(fst_df$ESP_SWE_fst[fst_df$DF == "DF"] > genome_threshold_99))
) %>%
  mutate(
    Prop_95th = Above_Genome_95th / Total_Windows,
    Prop_99th = Above_Genome_99th / Total_Windows
  )

print(kable(outlier_summary, digits = 4))

## 
## 
## |Category   | Total_Windows| Above_Genome_95th| Above_Genome_99th| Prop_95th| Prop_99th|
## |:----------|-------------:|-----------------:|-----------------:|---------:|---------:|
## |Background |         66777|              3321|               658|    0.0497|    0.0099|
## |Defence    |          1267|                82|                23|    0.0647|    0.0182|

# Chi-square test for 95th percentile enrichment
cat("\n4. Chi-square Test (95th percentile enrichment):\n")

## 
## 4. Chi-square Test (95th percentile enrichment):

chi_test <- chisq.test(matrix(c(
  outlier_summary$Above_Genome_95th[1], outlier_summary$Total_Windows[1] - outlier_summary$Above_Genome_95th[1],
  outlier_summary$Above_Genome_95th[2], outlier_summary$Total_Windows[2] - outlier_summary$Above_Genome_95th[2]
), nrow = 2))

cat("   X-squared:", round(chi_test$statistic, 3), "\n")

##    X-squared: 5.567

cat("   p-value:", format.pval(chi_test$p.value, digits = 4), "\n")

##    p-value: 0.0183

# Interpretation
cat("\n5. Interpretation:\n")

## 
## 5. Interpretation:

if (chi_test$p.value < 0.05) {
  enrich_ratio <- outlier_summary$Prop_95th[2] / outlier_summary$Prop_95th[1]
  cat(sprintf("   - Defence genes have %.1f%% outliers vs %.1f%% in background\n", 
              outlier_summary$Prop_95th[2] * 100, 
              outlier_summary$Prop_95th[1] * 100))
  cat(sprintf("   - %.2fx enrichment in defence genes (p = %.4f)\n", 
              enrich_ratio, chi_test$p.value))
} else {
  cat("   - No significant difference in outlier proportions\n")
}

##    - Defence genes have 6.5% outliers vs 5.0% in background
##    - 1.30x enrichment in defence genes (p = 0.0183)

14. Summary Statistics

Purpose

Provide comprehensive summary statistics for all analyses.

Output

Summary table of all statistics by gene category
Clear presentation of results

# --- Final Summary Statistics Table -----------------------------
summary_stats <- At_data_clean %>%
  group_by(DF) %>%
  summarise(
    n_windows = n(),
    # π statistics
    mean_ESP_pi = mean(ESP_pi),
    sd_ESP_pi = sd(ESP_pi),
    mean_SWE_pi = mean(SWE_pi),
    sd_SWE_pi = sd(SWE_pi),
    # FST statistics
    mean_FST = mean(ESP_SWE_fst),
    sd_FST = sd(ESP_SWE_fst),
    median_FST = median(ESP_SWE_fst),
    FST_95th = quantile(ESP_SWE_fst, 0.95),
    FST_99th = quantile(ESP_SWE_fst, 0.99),
    # Tajima's D statistics
    mean_ESP_TajD = mean(ESP_td),
    sd_ESP_TajD = sd(ESP_td),
    mean_SWE_TajD = mean(SWE_td),
    sd_SWE_TajD = sd(SWE_td)
  ) %>%
  mutate(across(where(is.numeric), ~ round(., 4)))

cat("Comprehensive Summary Statistics by Gene Category\n")

## Comprehensive Summary Statistics by Gene Category

cat("=================================================\n\n")

## =================================================

print(kable(summary_stats, 
      caption = "Summary statistics for defence (DF) and background (BG) regions",
      align = c("l", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r")))

## 
## 
## Table: Summary statistics for defence (DF) and background (BG) regions
## 
## |DF | n_windows| mean_ESP_pi| sd_ESP_pi| mean_SWE_pi| sd_SWE_pi| mean_FST| sd_FST| median_FST| FST_95th| FST_99th| mean_ESP_TajD| sd_ESP_TajD| mean_SWE_TajD| sd_SWE_TajD|
## |:--|---------:|-----------:|---------:|-----------:|---------:|--------:|------:|----------:|--------:|--------:|-------------:|-----------:|-------------:|-----------:|
## |BG |     78080|      0.0045|    0.0039|      0.0041|    0.0037|   0.0809| 0.0999|     0.0476|   0.2864|   0.4538|        -8e-04|      0.0096|        0.0012|      0.0098|
## |DF |      1542|      0.0042|    0.0034|      0.0039|    0.0037|   0.0827| 0.1085|     0.0423|   0.3063|   0.4968|        -9e-04|      0.0097|        0.0009|      0.0100|

# Calculate percentage differences
cat("\nPercentage Differences (DF relative to BG):\n")

## 
## Percentage Differences (DF relative to BG):

cat("===========================================\n")

## ===========================================

differences <- data.frame(
  Statistic = c("ESP π", "SWE π", "FST", "ESP Tajima's D", "SWE Tajima's D"),
  DF_Mean = c(
    summary_stats$mean_ESP_pi[2],
    summary_stats$mean_SWE_pi[2],
    summary_stats$mean_FST[2],
    summary_stats$mean_ESP_TajD[2],
    summary_stats$mean_SWE_TajD[2]
  ),
  BG_Mean = c(
    summary_stats$mean_ESP_pi[1],
    summary_stats$mean_SWE_pi[1],
    summary_stats$mean_FST[1],
    summary_stats$mean_ESP_TajD[1],
    summary_stats$mean_SWE_TajD[1]
  )
) %>%
  mutate(
    Difference = DF_Mean - BG_Mean,
    Percent_Change = round((Difference / BG_Mean) * 100, 1),
    Direction = ifelse(Difference > 0, "Higher in DF", "Lower in DF")
  )

print(kable(differences, 
      caption = "Mean differences between defence and background regions",
      align = c("l", "r", "r", "r", "r", "l")))

## 
## 
## Table: Mean differences between defence and background regions
## 
## |Statistic      | DF_Mean| BG_Mean| Difference| Percent_Change|Direction    |
## |:--------------|-------:|-------:|----------:|--------------:|:------------|
## |ESP π          |  0.0042|  0.0045|    -0.0003|           -6.7|Lower in DF  |
## |SWE π          |  0.0039|  0.0041|    -0.0002|           -4.9|Lower in DF  |
## |FST            |  0.0827|  0.0809|     0.0018|            2.2|Higher in DF |
## |ESP Tajima's D | -0.0009| -0.0008|    -0.0001|           12.5|Lower in DF  |
## |SWE Tajima's D |  0.0009|  0.0012|    -0.0003|          -25.0|Lower in DF  |

For statistical reporting, emphasize effect sizes:

# Calculate effect sizes (Cohen's d) for the differences
library(effsize)

effect_sizes <- data.frame(
  Statistic = c("ESP π", "SWE π", "ESP Tajima's D", "SWE Tajima's D", "FST"),
  Cohen_d = c(
    cohen.d(At_data_clean$ESP_pi[At_data_clean$DF == "BG"], 
            At_data_clean$ESP_pi[At_data_clean$DF == "DF"])$estimate,
    cohen.d(At_data_clean$SWE_pi[At_data_clean$DF == "BG"], 
            At_data_clean$SWE_pi[At_data_clean$DF == "DF"])$estimate,
    cohen.d(At_data_clean$ESP_td[At_data_clean$DF == "BG"], 
            At_data_clean$ESP_td[At_data_clean$DF == "DF"])$estimate,
    cohen.d(At_data_clean$SWE_td[At_data_clean$DF == "BG"], 
            At_data_clean$SWE_td[At_data_clean$DF == "DF"])$estimate,
    cohen.d(At_data_clean$ESP_SWE_fst[At_data_clean$DF == "BG"], 
            At_data_clean$ESP_SWE_fst[At_data_clean$DF == "DF"])$estimate
  )
)

print("Effect sizes (Cohen's d):")

## [1] "Effect sizes (Cohen's d):"

print(effect_sizes)

##        Statistic      Cohen_d
## 1          ESP π  0.072114886
## 2          SWE π  0.055400094
## 3 ESP Tajima's D  0.004348834
## 4 SWE Tajima's D  0.031251315
## 5            FST -0.018068985

Interpretation Guide for Effect Sizes (Cohen’s d)

Purpose of this analysis Cohen’s d quantifies the standardized difference between defence-associated windows (DF) and background windows (BG). While statistical tests (e.g., KS test, Wilcoxon) indicate whether distributions differ, effect sizes tell us how large and biologically meaningful these differences are.

How Cohen’s d is calculated here

You computed:

π (ESP and SWE)
Tajima’s D (ESP and SWE)
FST (ESP–SWE)

for DF and BG windows and then measured:

Cohen’s d = (mean_BG − mean_DF) / pooled SD

A positive d means BG > DF; A negative d means DF > BG.

How to interpret your values

You reported:

Statistic	Cohen’s d
ESP π	0.072
SWE π	0.055
ESP Tajima’s D	0.004
SWE Tajima’s D	0.031
FST	−0.018

All values fall well below 0.2.

Interpretation

These effect sizes are negligible.
DF windows do not differ meaningfully from BG windows for π, Tajima’s D, or FST.
Any observed statistical significance (if any) is not accompanied by a biologically relevant shift.
This pattern is more consistent with neutral drift, genome-wide noise, or measurement uncertainty, rather than strong or localized selection in DF windows.

15. Gene Annotation and Overlap Analysis

Purpose

Map FST outlier windows to specific genes using genome annotations to: 1. Identify which defence genes contain FST outliers 2. Determine the biological functions of outlier-containing genes 3. Visualize the spatial relationship between outliers and genes 4. Prioritize candidate genes for further investigation

Input Requirements

GFF annotation file with gene coordinates and functional information
FST outlier data from previous analysis

cat("\n=== GENE ANNOTATION AND OVERLAP ANALYSIS ===\n")

## 
## === GENE ANNOTATION AND OVERLAP ANALYSIS ===

# Load required libraries
library(GenomicRanges)
library(IRanges)
library(conflicted)

# Resolve conflicts
conflicts_prefer(S4Vectors::first)
conflicts_prefer(dplyr::first)

15.1 Load and Prepare Gene Annotations

# --- Step 1: Load and prepare gene annotation (GFF file) ---
cat("Loading gene annotations from GFF file...\n")

## Loading gene annotations from GFF file...

# Read the GFF file
gff_path <- "vcf/GFF_final_with_annotation_2_gff"
if (file.exists(gff_path)) {
  gff <- read.delim(gff_path, 
                    header = TRUE, sep = "\t", stringsAsFactors = FALSE)
  
  # Rename columns for clarity
  colnames(gff) <- c("Model_name", "seqname", "source", "feature", "start", "end", 
                     "score", "strand", "frame", "attributes", "dup_colmn",
                     "gene_type", "short_desc", "curator_summary", "computational_desc")
  
  # Convert positions to numeric
  gff$start <- as.numeric(as.character(gff$start))
  gff$end <- as.numeric(as.character(gff$end))
  
  cat("GFF file loaded successfully\n")
  cat("Total rows:", nrow(gff), "\n")
  cat("Columns:", paste(colnames(gff), collapse = ", "), "\n\n")
  
  # Check feature types
  cat("Feature types in GFF file:\n")
  feature_counts <- table(gff$feature)
  print(kable(feature_counts))
  
  # Filter for mRNA features (gene transcripts)
  filtered_annotation <- subset(gff, feature == "mRNA")
  
  # Extract gene information from attributes column
  # Format: ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
  filtered_annotation$gene_id <- gsub(".*ID=([^;]+).*", "\\1", filtered_annotation$attributes)
  filtered_annotation$parent_gene <- gsub(".*Parent=([^;]+).*", "\\1", filtered_annotation$attributes)
  filtered_annotation$simple_gene_id <- gsub("\\..*", "", filtered_annotation$gene_id)  # Remove version number
  
  cat("\nGene annotation summary:\n")
  cat("✓ Total mRNA features (transcripts):", nrow(filtered_annotation), "\n")
  cat("✓ Unique parent genes:", length(unique(filtered_annotation$parent_gene)), "\n")
  cat("✓ Unique transcripts:", length(unique(filtered_annotation$gene_id)), "\n")
  
} else {
  stop("GFF file not found at: ", gff_path)
}

## GFF file loaded successfully
## Total rows: 28812 
## Columns: Model_name, seqname, source, feature, start, end, score, strand, frame, attributes, dup_colmn, gene_type, short_desc, curator_summary, computational_desc 
## 
## Feature types in GFF file:
## 
## 
## |Var1 |  Freq|
## |:----|-----:|
## |mRNA | 28812|
## 
## Gene annotation summary:
## ✓ Total mRNA features (transcripts): 28812 
## ✓ Unique parent genes: 22359 
## ✓ Unique transcripts: 28812

15.2 Create Genomic Ranges and Find Overlaps

# --- Step 2: Create genomic ranges for genes ---
cat("\nCreating genomic ranges for genes...\n")

## 
## Creating genomic ranges for genes...

gene_ranges <- GRanges(
  seqnames = filtered_annotation$seqname,
  ranges = IRanges(start = filtered_annotation$start, end = filtered_annotation$end),
  gene_id = filtered_annotation$gene_id,
  parent_gene = filtered_annotation$parent_gene,
  gene_type = filtered_annotation$gene_type,
  strand = filtered_annotation$strand,
  description = filtered_annotation$short_desc
)

cat("✓ Created gene ranges for", length(gene_ranges), "transcripts\n")

## ✓ Created gene ranges for 28812 transcripts

# --- Step 3: Create genomic ranges for FST outliers ---
cat("\nCreating genomic ranges for FST outliers...\n")

## 
## Creating genomic ranges for FST outliers...

# First, ensure we have defence_outliers data
if (!exists("defence_outliers")) {
  # Create it from defence_windows if available
  if (exists("defence_windows")) {
    defence_outliers <- defence_windows
    if (!"outlier_status" %in% colnames(defence_outliers)) {
      # Create outlier_status if needed
      if (exists("def_threshold_95") && exists("def_threshold_99")) {
        defence_outliers$outlier_status <- case_when(
          defence_outliers$ESP_SWE_fst > def_threshold_99 ~ "99% Outlier",
          defence_outliers$ESP_SWE_fst > def_threshold_95 ~ "95% Outlier",
          TRUE ~ "Non-outlier"
        )
      }
    }
    cat("Created defence_outliers from defence_windows\n")
  }
}

## Created defence_outliers from defence_windows

# Ensure we have the necessary columns
if (exists("defence_outliers") && nrow(defence_outliers) > 0) {
  # Add chromosome information if missing
  if (!"chromosome" %in% colnames(defence_outliers)) {
    defence_outliers$chromosome <- "Chr1"
  }
  
  # Create GRanges object for outliers
  outlier_ranges <- GRanges(
    seqnames = defence_outliers$chromosome,
    ranges = IRanges(start = defence_outliers$start, end = defence_outliers$stop),
    outlier_status = defence_outliers$outlier_status,
    fst_value = defence_outliers$ESP_SWE_fst,
    window_mid = defence_outliers$mid
  )
  
  cat("✓ Created outlier ranges for", length(outlier_ranges), "windows\n")
  
  # --- Step 4: Find overlaps between outliers and genes ---
  cat("\nFinding overlaps between outliers and genes...\n")
  
  # Find all overlaps (any overlap type)
  overlaps <- findOverlaps(outlier_ranges, gene_ranges, type = "any")
  
  cat("Total overlaps found:", length(overlaps), "\n")
  
  # Create comprehensive data frame of overlaps
  if (length(overlaps) > 0) {
    overlap_df <- data.frame(
      # Outlier information
      outlier_index = queryHits(overlaps),
      outlier_chr = as.character(seqnames(outlier_ranges)[queryHits(overlaps)]),
      outlier_start = start(outlier_ranges)[queryHits(overlaps)],
      outlier_end = end(outlier_ranges)[queryHits(overlaps)],
      outlier_mid = outlier_ranges$window_mid[queryHits(overlaps)],
      outlier_status = as.character(outlier_ranges$outlier_status[queryHits(overlaps)]),
      fst_value = outlier_ranges$fst_value[queryHits(overlaps)],
      
      # Gene information
      gene_index = subjectHits(overlaps),
      gene_chr = as.character(seqnames(gene_ranges)[subjectHits(overlaps)]),
      gene_start = start(gene_ranges)[subjectHits(overlaps)],
      gene_end = end(gene_ranges)[subjectHits(overlaps)],
      gene_id = gene_ranges$gene_id[subjectHits(overlaps)],
      parent_gene = gene_ranges$parent_gene[subjectHits(overlaps)],
      gene_type = gene_ranges$gene_type[subjectHits(overlaps)],
      gene_strand = as.character(strand(gene_ranges)[subjectHits(overlaps)]),
      gene_description = gene_ranges$description[subjectHits(overlaps)]
    )
    
    # Filter for only outliers (exclude "Non-outlier" if present)
    if ("Non-outlier" %in% overlap_df$outlier_status) {
      overlap_df <- overlap_df %>% filter(outlier_status != "Non-outlier")
    }
    
    # Calculate overlap statistics
    cat("\n=== OVERLAP STATISTICS ===\n")
    cat("Overlapping outlier-gene pairs:", nrow(overlap_df), "\n")
    cat("Unique genes with outliers:", length(unique(overlap_df$parent_gene)), "\n")
    cat("Unique transcripts with outliers:", length(unique(overlap_df$gene_id)), "\n\n")
    
    cat("Overlaps by outlier type:\n")
    outlier_type_counts <- table(overlap_df$outlier_status)
    print(kable(outlier_type_counts))
    
    cat("\nTop 5 gene types containing outliers:\n")
    gene_type_summary <- sort(table(overlap_df$gene_type), decreasing = TRUE)
    print(kable(head(gene_type_summary, 5)))
    
  } else {
    cat("WARNING: No overlaps found between outliers and genes!\n")
    overlap_df <- data.frame()
  }
  
} else {
  cat("ERROR: defence_outliers data not available\n")
  overlap_df <- data.frame()
}

## ✓ Created outlier ranges for 1542 windows
## 
## Finding overlaps between outliers and genes...
## Total overlaps found: 2300 
## 
## === OVERLAP STATISTICS ===
## Overlapping outlier-gene pairs: 99 
## Unique genes with outliers: 21 
## Unique transcripts with outliers: 29 
## 
## Overlaps by outlier type:
## 
## 
## |Var1        | Freq|
## |:-----------|----:|
## |95% Outlier |   80|
## |99% Outlier |   19|
## 
## Top 5 gene types containing outliers:
## 
## 
## |               |  x|
## |:--------------|--:|
## |protein_coding | 99|

15.3 Create Gene Summary Table

# --- Step 5: Create gene summary table ---
if (exists("overlap_df") && nrow(overlap_df) > 0) {
  cat("\nCreating gene summary table...\n")
  
  gene_summary <- overlap_df %>%
    group_by(parent_gene) %>%
    summarise(
      transcript_variants = paste(unique(gene_id), collapse = ";"),
      total_outlier_windows = n(),
      outlier_95_windows = sum(outlier_status == "95% Outlier"),
      outlier_99_windows = sum(outlier_status == "99% Outlier"),
      min_fst = min(fst_value),
      mean_fst = mean(fst_value),
      max_fst = max(fst_value),
      gene_start = min(gene_start),
      gene_end = max(gene_end),
      gene_length = max(gene_end) - min(gene_start) + 1,
      gene_type = first(gene_type),
      description = first(gene_description)
    ) %>%
    mutate(
      outlier_density = total_outlier_windows / (gene_length / 1000)  # Outliers per kb
    ) %>%
    arrange(desc(max_fst))
  
  cat("\nSummary of genes with FST outliers:\n")
  cat("Total genes affected:", nrow(gene_summary), "\n")
  cat("Average outlier windows per gene:", round(mean(gene_summary$total_outlier_windows), 2), "\n")
  cat("Average FST in affected genes:", round(mean(gene_summary$mean_fst), 4), "\n\n")
  
  cat("Top 10 genes with highest FST outliers:\n")
  print(kable(head(gene_summary[, c("parent_gene", "max_fst", "total_outlier_windows", 
                                    "gene_length", "gene_type", "description")], 10),
        caption = "Top 10 genes containing FST outliers"))
  
  # Genes with both 95% and 99% outliers
  mixed_genes <- gene_summary %>%
    filter(outlier_95_windows > 0 & outlier_99_windows > 0)
  
  cat("\nGenes with both 95% and 99% outliers:", nrow(mixed_genes), "\n")
  
  # Save gene summary
  if (!dir.exists("results")) dir.create("results")
  write_csv(gene_summary, "results/gene_summary_with_outliers.csv")
  cat("✓ Gene summary saved to: results/gene_summary_with_outliers.csv\n")
  
} else {
  cat("No gene summary created - overlap data not available\n")
  gene_summary <- data.frame()
}

## 
## Creating gene summary table...
## 
## Summary of genes with FST outliers:
## Total genes affected: 21 
## Average outlier windows per gene: 4.71 
## Average FST in affected genes: 0.4106 
## 
## Top 10 genes with highest FST outliers:
## 
## 
## Table: Top 10 genes containing FST outliers
## 
## |parent_gene |   max_fst| total_outlier_windows| gene_length|gene_type      |description                                                  |
## |:-----------|---------:|---------------------:|-----------:|:--------------|:------------------------------------------------------------|
## |AT1G80600   | 0.7192394|                     6|        2055|protein_coding |HOPW1-1-interacting 1                                        |
## |AT1G12290   | 0.6852304|                     6|        4235|protein_coding |Disease resistance protein (CC-NBS-LRR class) family         |
## |AT1G19660   | 0.5821976|                     8|        2705|protein_coding |Wound-responsive family protein                              |
## |AT1G03475   | 0.5306411|                     2|        2161|protein_coding |Coproporphyrinogen III oxidase                               |
## |AT1G08860   | 0.5276831|                    10|        3986|protein_coding |Calcium-dependent phospholipid-binding Copine family protein |
## |AT1G09770   | 0.5260257|                     2|        3520|protein_coding |cell division cycle 5                                        |
## |AT1G11000   | 0.5023773|                     2|        4224|protein_coding |Seven transmembrane MLO family protein                       |
## |AT1G76954   | 0.4746309|                     2|         672|protein_coding |Defensin-like (DEFL) family protein                          |
## |AT1G80680   | 0.4746083|                     4|        4762|protein_coding |SUPPRESSOR OF AUXIN RESISTANCE 3                             |
## |AT1G10170   | 0.4141339|                     5|        3897|protein_coding |NF-X-like 1                                                  |
## 
## Genes with both 95% and 99% outliers: 5 
## ✓ Gene summary saved to: results/gene_summary_with_outliers.csv

15.4 Visualize Genes with Outliers

cat("\n=== VISUALISATION OF GENES WITH OUTLIERS ===\n")

## 
## === VISUALISATION OF GENES WITH OUTLIERS ===

if (exists("overlap_df") && nrow(overlap_df) > 0 && exists("gene_summary") && nrow(gene_summary) > 0) {
  
  # Prepare data for all FST values (for background)
  all_fst_data <- At_data_clean %>%
    filter(ESP_SWE_fst > 0) %>%
    select(mid, ESP_SWE_fst, DF)
  
  # Get threshold values
  if (!exists("def_threshold_95")) {
    def_threshold_95 <- quantile(defence_windows$ESP_SWE_fst, 0.95, na.rm = TRUE)
  }
  if (!exists("def_threshold_99")) {
    def_threshold_99 <- quantile(defence_windows$ESP_SWE_fst, 0.99, na.rm = TRUE)
  }
  
  # --- Plot 1: Enhanced FST plot with gene annotations ---
  cat("Generating enhanced FST plot with gene annotations...\n")
  
  # Prepare gene data for plotting (first 30 genes for clarity)
  gene_plot_data <- filtered_annotation %>%
    filter(seqname == "Chr1") %>%
    mutate(gene_mid = (start + end) / 2) %>%
    arrange(start) %>%
    head(30)  # Limit to first 30 genes to avoid overcrowding
  
  # Filter overlap_df to include only genes in plot_data
  plot_overlaps <- overlap_df %>%
    filter(parent_gene %in% gene_plot_data$parent_gene)
  
  # Enhanced FST plot with gene tracks
  enhanced_fst_plot <- ggplot() +
    # Background: all FST windows (light gray)
    geom_point(data = all_fst_data,
               aes(x = mid/1e6, y = ESP_SWE_fst),
               color = "gray85", size = 0.3, alpha = 0.2) +
    
    # Defence gene FST points (blue)
    geom_point(data = defence_windows,
               aes(x = mid/1e6, y = ESP_SWE_fst),
               color = "steelblue", size = 0.5, alpha = 0.5) +
    
    # Outlier points (colored by status)
    geom_point(data = plot_overlaps,
               aes(x = outlier_mid/1e6, y = fst_value, color = outlier_status),
               size = 1.8, alpha = 0.8) +
    
    # Gene tracks (below FST plot)
    geom_segment(data = gene_plot_data,
                 aes(x = start/1e6, xend = end/1e6, y = -0.03, yend = -0.03),
                 size = 4, color = "darkgreen", alpha = 0.6) +
    
    # Gene labels (only for genes with outliers)
    if (nrow(plot_overlaps) > 0) {
      gene_labels <- plot_overlaps %>%
        group_by(parent_gene) %>%
        summarise(mid_pos = mean(outlier_mid/1e6),
                  max_fst = max(fst_value)) %>%
        arrange(desc(max_fst)) %>%
        head(8)  # Top 8 genes by FST
      
      enhanced_fst_plot <- enhanced_fst_plot +
        geom_text(data = gene_labels,
                  aes(x = mid_pos, y = -0.05, label = parent_gene),
                  size = 3, angle = 45, hjust = 1, vjust = 1,
                  check_overlap = TRUE, color = "darkblue")
    }
    
    # Threshold lines
    enhanced_fst_plot <- enhanced_fst_plot +
      geom_hline(yintercept = def_threshold_95, color = "orange",
                 linetype = "dashed", size = 0.7, alpha = 0.8) +
      geom_hline(yintercept = def_threshold_99, color = "red",
                 linetype = "dashed", size = 0.7, alpha = 0.8) +
      
      # Mean line
      geom_hline(yintercept = mean(defence_windows$ESP_SWE_fst, na.rm = TRUE),
                 color = "blue", linetype = "dashed", size = 0.8) +
      
      # Scales and labels
      scale_color_manual(values = c("95% Outlier" = "orange",
                                    "99% Outlier" = "red")) +
      scale_y_continuous(limits = c(-0.06, max(all_fst_data$ESP_SWE_fst, na.rm = TRUE) * 1.1)) +
      
      labs(x = "Position on Chromosome 1 (Mb)",
           y = expression(italic(F)[ST]),
           title = "FST Outliers in Defence Genes with Gene Annotations",
           subtitle = paste(nrow(gene_summary), "genes contain", 
                           nrow(overlap_df), "outlier windows"),
           caption = "Green bars = gene positions | Blue points = defence genes | Orange/red = outliers") +
      
      theme_minimal() +
      theme(legend.position = "bottom",
            legend.title = element_blank(),
            plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
            plot.subtitle = element_text(size = 10, hjust = 0.5))
  
  print(enhanced_fst_plot)
  
  # --- Plot 2: Distribution of outlier counts per gene ---
  p1 <- ggplot(gene_summary, aes(x = total_outlier_windows)) +
    geom_histogram(binwidth = 1, fill = "steelblue", alpha = 0.7, color = "black") +
    labs(x = "Number of Outlier Windows per Gene",
         y = "Number of Genes",
         title = "Distribution of Outlier Windows Across Genes",
         subtitle = paste("Mean =", round(mean(gene_summary$total_outlier_windows), 2), 
                         "outliers per gene")) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))
  
  print(p1)
  
  # --- Plot 3: Top genes with most outliers ---
  top_n <- min(15, nrow(gene_summary))
  p2 <- gene_summary %>%
    arrange(desc(total_outlier_windows)) %>%
    head(top_n) %>%
    mutate(parent_gene = factor(parent_gene, levels = rev(parent_gene))) %>%
    ggplot(aes(x = parent_gene, y = total_outlier_windows, fill = max_fst)) +
    geom_bar(stat = "identity", alpha = 0.8) +
    geom_text(aes(label = total_outlier_windows), 
              hjust = -0.2, size = 3) +
    coord_flip() +
    scale_fill_gradient(low = "orange", high = "red", name = "Max FST") +
    labs(x = "Gene ID", y = "Number of Outlier Windows",
         title = paste("Top", top_n, "Genes with Most Outlier Windows"),
         subtitle = "Color indicates maximum FST value in gene") +
    theme_minimal() +
    theme(axis.text.y = element_text(size = 9),
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))
  
  print(p2)
  
  # --- Plot 4: FST vs gene length ---
  p3 <- ggplot(gene_summary, aes(x = gene_length/1000, y = max_fst)) +
    geom_point(aes(size = total_outlier_windows, color = total_outlier_windows), alpha = 0.7) +
    geom_smooth(method = "lm", se = FALSE, color = "gray40", linetype = "dashed", size = 0.5) +
    scale_color_gradient(low = "blue", high = "red", name = "Outlier count") +
    scale_size_continuous(name = "Outlier count") +
    labs(x = "Gene Length (kb)",
         y = "Maximum FST in Gene",
         title = "Relationship Between Gene Length and FST",
         subtitle = "Size and color indicate number of outlier windows") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))
  
  print(p3)
  
  cat("✓ All visualizations generated successfully\n")
  
} else {
  cat("Skipping visualizations - insufficient overlap data\n")
}

## Generating enhanced FST plot with gene annotations...

## ✓ All visualizations generated successfully

15.5 Summary and Biological Interpretation

cat("\n=== BIOLOGICAL INTERPRETATION ===\n")

## 
## === BIOLOGICAL INTERPRETATION ===

if (exists("gene_summary") && nrow(gene_summary) > 0) {
  
  # 1. Summary statistics
  cat("1. OVERALL SUMMARY:\n")
  cat("   - Total genes analyzed:", length(unique(filtered_annotation$parent_gene)), "\n")
  cat("   - Genes containing FST outliers:", nrow(gene_summary), "\n")
  cat("   - Percentage of genes with outliers:", 
      round(nrow(gene_summary)/length(unique(filtered_annotation$parent_gene))*100, 1), "%\n")
  
  # 2. Top candidate genes
  cat("\n2. TOP CANDIDATE GENES FOR SELECTION:\n")
  top_candidates <- gene_summary %>%
    filter(max_fst > def_threshold_99) %>%
    arrange(desc(max_fst))
  
  if (nrow(top_candidates) > 0) {
    cat("   Genes with FST > 99th percentile:", nrow(top_candidates), "\n")
    cat("   Highest FST:", round(max(top_candidates$max_fst), 4), "\n")
    
    # Show top 5 candidates
    cat("\n   Top 5 candidate genes:\n")
    for (i in 1:min(5, nrow(top_candidates))) {
      cat(sprintf("   %d. %s: FST=%.4f, %d outlier windows, %s\n",
                  i, top_candidates$parent_gene[i], 
                  top_candidates$max_fst[i],
                  top_candidates$total_outlier_windows[i],
                  ifelse(nchar(top_candidates$description[i]) > 50,
                         paste0(substr(top_candidates$description[i], 1, 50), "..."),
                         top_candidates$description[i])))
    }
  }
  
  # 3. Gene functional categories
  cat("\n3. GENE FUNCTIONAL CATEGORIES WITH OUTLIERS:\n")
  if ("gene_type" %in% colnames(gene_summary)) {
    gene_type_summary <- gene_summary %>%
      group_by(gene_type) %>%
      summarise(
        n_genes = n(),
        avg_fst = mean(max_fst),
        avg_outliers = mean(total_outlier_windows)
      ) %>%
      arrange(desc(n_genes))
    
    print(kable(gene_type_summary, 
                caption = "Functional categories of genes containing FST outliers"))
  }
  
  # 4. Recommendations for follow-up
  cat("\n4. RECOMMENDATIONS FOR FOLLOW-UP:\n")
  cat("   - Validate top candidates with independent methods (e.g., GWAS, expression QTL)\n")
  cat("   - Check if outlier genes are enriched for specific pathways (GO enrichment)\n")
  cat("   - Investigate population-specific patterns in candidate genes\n")
  cat("   - Compare with known defence-related gene families\n")
  cat("   - Consider functional validation (knockouts, overexpression)\n")
  
} else {
  cat("No gene overlap data available for interpretation\n")
}

## 1. OVERALL SUMMARY:
##    - Total genes analyzed: 22359 
##    - Genes containing FST outliers: 21 
##    - Percentage of genes with outliers: 0.1 %
## 
## 2. TOP CANDIDATE GENES FOR SELECTION:
##    Genes with FST > 99th percentile: 7 
##    Highest FST: 0.7192 
## 
##    Top 5 candidate genes:
##    1. AT1G80600: FST=0.7192, 6 outlier windows, HOPW1-1-interacting 1
##    2. AT1G12290: FST=0.6852, 6 outlier windows, Disease resistance protein (CC-NBS-LRR class) fami...
##    3. AT1G19660: FST=0.5822, 8 outlier windows, Wound-responsive family protein
##    4. AT1G03475: FST=0.5306, 2 outlier windows, Coproporphyrinogen III oxidase
##    5. AT1G08860: FST=0.5277, 10 outlier windows, Calcium-dependent phospholipid-binding Copine fami...
## 
## 3. GENE FUNCTIONAL CATEGORIES WITH OUTLIERS:
## 
## 
## Table: Functional categories of genes containing FST outliers
## 
## |gene_type      | n_genes|   avg_fst| avg_outliers|
## |:--------------|-------:|---------:|------------:|
## |protein_coding |      21| 0.4422851|     4.714286|
## 
## 4. RECOMMENDATIONS FOR FOLLOW-UP:
##    - Validate top candidates with independent methods (e.g., GWAS, expression QTL)
##    - Check if outlier genes are enriched for specific pathways (GO enrichment)
##    - Investigate population-specific patterns in candidate genes
##    - Compare with known defence-related gene families
##    - Consider functional validation (knockouts, overexpression)

15.6 Save Complete Gene Analysis Results

cat("\n=== SAVING GENE ANALYSIS RESULTS ===\n")

## 
## === SAVING GENE ANALYSIS RESULTS ===

# Create directory for gene analysis results
gene_results_dir <- "gene_analysis_results"
if (!dir.exists(gene_results_dir)) {
  dir.create(gene_results_dir)
  cat("Created directory:", gene_results_dir, "\n")
}

## Created directory: gene_analysis_results

# Save all relevant data
if (exists("overlap_df") && nrow(overlap_df) > 0) {
  write_csv(overlap_df, file.path(gene_results_dir, "gene_outlier_overlaps.csv"))
  cat("✓ Saved: gene_outlier_overlaps.csv\n")
}

## ✓ Saved: gene_outlier_overlaps.csv

if (exists("gene_summary") && nrow(gene_summary) > 0) {
  write_csv(gene_summary, file.path(gene_results_dir, "gene_summary_outliers.csv"))
  cat("✓ Saved: gene_summary_outliers.csv\n")
}

## ✓ Saved: gene_summary_outliers.csv

# Save top candidate genes
if (exists("gene_summary") && nrow(gene_summary) > 0) {
  top_candidates <- gene_summary %>%
    filter(max_fst > quantile(gene_summary$max_fst, 0.9)) %>%  # Top 10%
    arrange(desc(max_fst))
  
  write_csv(top_candidates, file.path(gene_results_dir, "top_candidate_genes.csv"))
  cat("✓ Saved: top_candidate_genes.csv\n")
}

## ✓ Saved: top_candidate_genes.csv

# Save a summary report
summary_report <- c(
  paste("Gene Annotation and Overlap Analysis Report"),
  paste("Generated:", Sys.Date()),
  paste(""),
  paste("ANALYSIS PARAMETERS:"),
  paste("Genome annotation file:", gff_path),
  paste("Total genes annotated:", length(unique(filtered_annotation$parent_gene))),
  paste("Defence gene windows analyzed:", nrow(defence_windows)),
  paste(""),
  paste("RESULTS SUMMARY:"),
  if (exists("overlap_df") && nrow(overlap_df) > 0) {
    c(paste("Total outlier-gene overlaps:", nrow(overlap_df)),
      paste("Unique genes with outliers:", length(unique(overlap_df$parent_gene))),
      paste("95% outliers in genes:", sum(grepl("95%", overlap_df$outlier_status))),
      paste("99% outliers in genes:", sum(grepl("99%", overlap_df$outlier_status))))
  } else {
    "No overlaps detected"
  },
  paste(""),
  paste("TOP 5 CANDIDATE GENES:"),
  if (exists("gene_summary") && nrow(gene_summary) > 0) {
    sapply(1:min(5, nrow(gene_summary)), function(i) {
      paste(i, ". ", gene_summary$parent_gene[i], 
            " (FST=", round(gene_summary$max_fst[i], 4),
            ", ", gene_summary$total_outlier_windows[i], " outliers)", sep = "")
    })
  } else {
    "No gene summary available"
  }
)

writeLines(summary_report, file.path(gene_results_dir, "analysis_report.txt"))
cat("✓ Saved: analysis_report.txt\n")

## ✓ Saved: analysis_report.txt

cat("\nAll gene analysis results saved to:", gene_results_dir, "\n")

## 
## All gene analysis results saved to: gene_analysis_results

16. Advanced Visualization Techniques

Purpose

Explore different visualization approaches to better understand subtle patterns in your data, especially when effect sizes are small. This section demonstrates multiple plotting techniques and integrates gene annotation information from previous analyses.

Key Techniques

Alternative π visualizations to highlight subtle differences
Effect size calculations for meaningful interpretation
Gene-integrated FST plots combining outlier and gene annotation data

cat("\n=== ADVANCED VISUALIZATION TECHNIQUES ===\n")

## 
## === ADVANCED VISUALIZATION TECHNIQUES ===

# Load required packages
if (!requireNamespace("ggforce", quietly = TRUE)) {
  install.packages("ggforce")
  library(ggforce)
} else {
  library(ggforce)
}

if (!requireNamespace("effsize", quietly = TRUE)) {
  install.packages("effsize")
  library(effsize)
} else {
  library(effsize)
}

16.1 Enhanced π (Nucleotide Diversity) Visualizations

cat("\n1. Enhanced π (Nucleotide Diversity) Visualizations\n")

## 
## 1. Enhanced π (Nucleotide Diversity) Visualizations

cat("===================================================\n")

## ===================================================

# First, ensure we have the pi_comparison_long data
# This should have been created in the boxplot comparisons section
if (!exists("pi_comparison_long")) {
  cat("Creating pi_comparison_long from available data...\n")
  pi_comparison_long <- At_data_clean %>%
    select(DF, ESP_pi, SWE_pi) %>%
    pivot_longer(
      cols = c(ESP_pi, SWE_pi),
      names_to = "population",
      values_to = "pi_value"
    ) %>%
    mutate(
      population = gsub("_pi", "", population),
      population = factor(population, levels = c("ESP", "SWE"))
    )
}

cat("π data summary:\n")

## π data summary:

cat("Total windows:", nrow(pi_comparison_long), "\n")

## Total windows: 159244

cat("By population:", table(pi_comparison_long$population), "\n")

## By population: 79622 79622

cat("By gene category:", table(pi_comparison_long$DF), "\n")

## By gene category: 156160 3084

# --- Plot 1: Boxplot with mean points and labels ---
cat("\nGenerating Plot 1: Boxplot with mean annotations...\n")

## 
## Generating Plot 1: Boxplot with mean annotations...

pi_boxplot_linear <- ggplot(pi_comparison_long, aes(x = DF, y = pi_value, fill = DF)) +
  geom_boxplot(color = "black", outlier.shape = 21, outlier.size = 1.5) +
  facet_wrap(~ population, ncol = 2) +
  theme_minimal() +
  scale_fill_manual(values = c("BG" = "#66c2a5", "DF" = "#fc8d62"),
                    name = "Gene Category") +
  labs(
    title = "Nucleotide Diversity (π): Defence vs Background Genes",
    subtitle = "Mean values shown with white diamonds",
    x = NULL,
    y = expression(paste(pi, " (nucleotide diversity)"))
  ) +
  # Add mean points
  stat_summary(fun = mean, geom = "point", shape = 23, size = 4, 
               fill = "white", color = "black") +
  # Add mean value labels
  stat_summary(
    fun = mean, 
    geom = "text", 
    aes(label = sprintf("%.4f", ..y..)),
    vjust = -2,
    size = 3.5,
    fontface = "bold"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 11),
    legend.position = "bottom",
    strip.text = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 11)
  )

print(pi_boxplot_linear)

# --- Plot 2: Violin plots with boxplot overlay ---
cat("\nGenerating Plot 2: Violin plots showing full distribution...\n")

## 
## Generating Plot 2: Violin plots showing full distribution...

# Calculate statistics for annotation
pi_stats <- pi_comparison_long %>%
  group_by(population, DF) %>%
  summarise(
    mean_pi = mean(pi_value),
    median_pi = median(pi_value),
    sd_pi = sd(pi_value),
    n = n()
  )

pi_violin <- ggplot(pi_comparison_long, aes(x = DF, y = pi_value, fill = DF)) +
  # Violin shows distribution shape
  geom_violin(alpha = 0.7, trim = FALSE, scale = "width", width = 0.8) +
  # Boxplot shows quartiles
  geom_boxplot(width = 0.15, fill = "white", alpha = 0.9, 
               outlier.shape = NA, color = "black", size = 0.5) +
  # Mean as red diamond
  stat_summary(fun = mean, geom = "point", shape = 18, 
               size = 3.5, color = "red", fill = "red") +
  facet_wrap(~ population, ncol = 2) +
  theme_minimal() +
  scale_fill_manual(values = c("BG" = "#66c2a5", "DF" = "#fc8d62"),
                    name = "Gene Category") +
  labs(
    title = "Nucleotide Diversity Distribution by Gene Category",
    subtitle = "Violin: full distribution | Box: quartiles | Red diamond: mean",
    x = NULL,
    y = expression(pi)
  ) +
  # Add sample size annotations
  geom_text(data = pi_stats,
            aes(x = DF, y = max(pi_comparison_long$pi_value) * 1.05,
                label = paste("n =", n)),
            size = 3.5, vjust = 0) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 11),
    legend.position = "bottom",
    strip.text = element_text(size = 12, face = "bold")
  )

print(pi_violin)

# --- Plot 3: Density overlay plot ---
cat("\nGenerating Plot 3: Density overlay comparison...\n")

## 
## Generating Plot 3: Density overlay comparison...

pi_density <- ggplot(pi_comparison_long, aes(x = pi_value, fill = DF, color = DF)) +
  geom_density(alpha = 0.4, size = 0.8) +
  facet_wrap(~ population, ncol = 2, scales = "free_y") +
  theme_minimal() +
  scale_fill_manual(values = c("BG" = "#66c2a5", "DF" = "#fc8d62"),
                    name = "Gene Category") +
  scale_color_manual(values = c("BG" = "#1b9e77", "DF" = "#d95f02"),
                     name = "Gene Category") +
  labs(
    title = "Density Distribution of π Values",
    subtitle = "Direct comparison of distribution shapes",
    x = expression(pi),
    y = "Density"
  ) +
  # Add vertical lines for means
  geom_vline(data = pi_stats,
             aes(xintercept = mean_pi, color = DF),
             linetype = "dashed", size = 0.8, alpha = 0.7) +
  # Annotate means
  geom_text(data = pi_stats,
            aes(x = mean_pi, y = Inf, 
                label = sprintf("Mean = %.4f", mean_pi),
                color = DF),
            vjust = 2, hjust = -0.1, size = 3.5, fontface = "bold") +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 11),
    legend.position = "bottom",
    strip.text = element_text(size = 12, face = "bold")
  )

print(pi_density)

cat("\n✓ All π visualizations generated successfully\n")

## 
## ✓ All π visualizations generated successfully

16.2 Effect Size Calculations and Interpretation

cat("\n2. Effect Size Calculations\n")

## 
## 2. Effect Size Calculations

cat("===========================\n")

## ===========================

# Calculate effect sizes (Cohen's d) for the differences
cat("Calculating effect sizes (Cohen's d) for all comparisons:\n")

## Calculating effect sizes (Cohen's d) for all comparisons:

effect_sizes <- data.frame(
  Statistic = character(),
  Population = character(),
  Cohens_d = numeric(),
  Interpretation = character(),
  stringsAsFactors = FALSE
)

# Function to calculate and interpret Cohen's d
calculate_effect_size <- function(bg_values, df_values, stat_name, pop_name) {
  if (length(bg_values) > 1 && length(df_values) > 1) {
    result <- tryCatch({
      d <- cohen.d(bg_values, df_values)
      list(
        d = d$estimate,
        conf_int = d$conf.int
      )
    }, error = function(e) {
      # Simple calculation if cohen.d fails
      mean_diff <- mean(df_values) - mean(bg_values)
      pooled_sd <- sqrt((sd(bg_values)^2 + sd(df_values)^2)/2)
      list(
        d = mean_diff/pooled_sd,
        conf_int = c(NA, NA)
      )
    })
    
    # Interpret magnitude
    abs_d <- abs(result$d)
    if (abs_d < 0.2) interpretation <- "Negligible"
    else if (abs_d < 0.5) interpretation <- "Small"
    else if (abs_d < 0.8) interpretation <- "Medium"
    else interpretation <- "Large"
    
    return(data.frame(
      Statistic = stat_name,
      Population = pop_name,
      Cohens_d = round(result$d, 3),
      CI_lower = ifelse(is.null(result$conf_int[1]), NA, round(result$conf_int[1], 3)),
      CI_upper = ifelse(is.null(result$conf_int[2]), NA, round(result$conf_int[2], 3)),
      Interpretation = interpretation,
      stringsAsFactors = FALSE
    ))
  }
  return(NULL)
}

# Calculate for π
esp_pi_bg <- At_data_clean$ESP_pi[At_data_clean$DF == "BG"]
esp_pi_df <- At_data_clean$ESP_pi[At_data_clean$DF == "DF"]
swe_pi_bg <- At_data_clean$SWE_pi[At_data_clean$DF == "BG"]
swe_pi_df <- At_data_clean$SWE_pi[At_data_clean$DF == "DF"]

effect_sizes <- rbind(effect_sizes,
  calculate_effect_size(esp_pi_bg, esp_pi_df, "π", "ESP"),
  calculate_effect_size(swe_pi_bg, swe_pi_df, "π", "SWE")
)

# Calculate for Tajima's D
esp_td_bg <- At_data_clean$ESP_td[At_data_clean$DF == "BG"]
esp_td_df <- At_data_clean$ESP_td[At_data_clean$DF == "DF"]
swe_td_bg <- At_data_clean$SWE_td[At_data_clean$DF == "BG"]
swe_td_df <- At_data_clean$SWE_td[At_data_clean$DF == "DF"]

effect_sizes <- rbind(effect_sizes,
  calculate_effect_size(esp_td_bg, esp_td_df, "Tajima's D", "ESP"),
  calculate_effect_size(swe_td_bg, swe_td_df, "Tajima's D", "SWE")
)

# Calculate for FST (only one FST value per window)
fst_bg <- At_data_clean$ESP_SWE_fst[At_data_clean$DF == "BG"]
fst_df <- At_data_clean$ESP_SWE_fst[At_data_clean$DF == "DF"]

effect_sizes <- rbind(effect_sizes,
  calculate_effect_size(fst_bg, fst_df, "FST", "ESP-SWE")
)

# Print effect size table
cat("\nEffect Sizes (Cohen's d) for Defence vs Background Comparisons:\n")

## 
## Effect Sizes (Cohen's d) for Defence vs Background Comparisons:

print(kable(effect_sizes, 
      caption = "Effect sizes with interpretation: |d| < 0.2 = negligible, 0.2-0.5 = small, 0.5-0.8 = medium, >0.8 = large"))

## 
## 
## Table: Effect sizes with interpretation: |d| < 0.2 = negligible, 0.2-0.5 = small, 0.5-0.8 = medium, >0.8 = large
## 
## |Statistic  |Population | Cohens_d| CI_lower| CI_upper|Interpretation |
## |:----------|:----------|--------:|--------:|--------:|:--------------|
## |π          |ESP        |    0.072|    0.022|    0.123|Negligible     |
## |π          |SWE        |    0.055|    0.005|    0.106|Negligible     |
## |Tajima's D |ESP        |    0.004|   -0.046|    0.055|Negligible     |
## |Tajima's D |SWE        |    0.031|   -0.019|    0.082|Negligible     |
## |FST        |ESP-SWE    |   -0.018|   -0.068|    0.032|Negligible     |

# Summary interpretation
cat("\nInterpretation Summary:\n")

## 
## Interpretation Summary:

for (i in 1:nrow(effect_sizes)) {
  row <- effect_sizes[i, ]
  direction <- ifelse(row$Cohens_d > 0, "higher", "lower")
  cat(sprintf("- %s in %s: %s effect (d = %.2f), %s in defence genes\n",
              row$Statistic, row$Population, row$Interpretation,
              row$Cohens_d, direction))
}

## - π in ESP: Negligible effect (d = 0.07), higher in defence genes
## - π in SWE: Negligible effect (d = 0.06), higher in defence genes
## - Tajima's D in ESP: Negligible effect (d = 0.00), higher in defence genes
## - Tajima's D in SWE: Negligible effect (d = 0.03), higher in defence genes
## - FST in ESP-SWE: Negligible effect (d = -0.02), lower in defence genes

# Save effect sizes
if (!dir.exists("results")) dir.create("results")
write_csv(effect_sizes, "results/effect_sizes.csv")
cat("\n✓ Effect sizes saved to: results/effect_sizes.csv\n")

## 
## ✓ Effect sizes saved to: results/effect_sizes.csv

16.3 Gene-Integrated Visualization (Combining Previous Results)

cat("\n3. Gene-Integrated Visualizations\n")

## 
## 3. Gene-Integrated Visualizations

cat("==================================\n")

## ==================================

# Check if we have the necessary data from previous sections
required_data <- c("overlap_df", "gene_summary", "filtered_annotation", 
                   "defence_windows", "At_data_clean")
available_data <- required_data[required_data %in% ls()]

cat("Available data from previous sections:\n")

## Available data from previous sections:

cat(paste("  -", available_data, collapse = "\n"), "\n")

##   - overlap_df
##   - gene_summary
##   - filtered_annotation
##   - defence_windows
##   - At_data_clean

if (all(c("overlap_df", "gene_summary", "filtered_annotation") %in% available_data) &&
    nrow(overlap_df) > 0 && nrow(gene_summary) > 0) {
  
  # --- Step 1: Enhanced FST plot with gene annotations ---
  cat("\nGenerating enhanced FST plot with gene annotations...\n")
  
  # Prepare data for plotting
  # Get all positive FST values for background
  positive_fst <- At_data_clean %>%
    filter(ESP_SWE_fst > 0) %>%
    select(mid, ESP_SWE_fst, DF)
  
  # Prepare gene data for plotting (first 40 genes for clarity)
  gene_plot_data <- filtered_annotation %>%
    filter(seqname == "Chr1") %>%
    mutate(gene_mid = (start + end) / 2) %>%
    arrange(start) %>%
    head(40)  # Limit to first 40 genes to avoid overcrowding
  
  # Get threshold values (from previous section)
  if (!exists("def_threshold_95")) {
    def_threshold_95 <- quantile(defence_windows$ESP_SWE_fst, 0.95, na.rm = TRUE)
  }
  if (!exists("def_threshold_99")) {
    def_threshold_99 <- quantile(defence_windows$ESP_SWE_fst, 0.99, na.rm = TRUE)
  }
  
  # --- FIXED CODE: Build the plot step by step ---
  # Start with the base plot
  enhanced_fst_plot <- ggplot()
  
  # Add background points
  enhanced_fst_plot <- enhanced_fst_plot +
    geom_point(data = positive_fst,
               aes(x = mid/1e6, y = ESP_SWE_fst),
               color = "gray90", size = 0.3, alpha = 0.2)
  
  # Add defence gene FST points
  enhanced_fst_plot <- enhanced_fst_plot +
    geom_point(data = defence_windows,
               aes(x = mid/1e6, y = ESP_SWE_fst),
               color = "steelblue", size = 0.8, alpha = 0.5)
  
  # Add outlier points
  enhanced_fst_plot <- enhanced_fst_plot +
    geom_point(data = overlap_df,
               aes(x = outlier_mid/1e6, y = fst_value, 
                   color = outlier_status, shape = outlier_status),
               size = 2, alpha = 0.9)
  
  # Add gene tracks (below FST plot)
  enhanced_fst_plot <- enhanced_fst_plot +
    geom_segment(data = gene_plot_data,
                 aes(x = start/1e6, xend = end/1e6, y = -0.04, yend = -0.04),
                 size = 3.5, color = "#2ca25f", alpha = 0.7)
  
  # Add gene labels for top 8 genes with outliers
  top_genes <- overlap_df %>%
    group_by(parent_gene) %>%
    summarise(
      mid_pos = mean((gene_start + gene_end)/2/1e6),
      max_fst = max(fst_value),
      gene_type = first(gene_type)
    ) %>%
    arrange(desc(max_fst)) %>%
    head(8)
  
  if (nrow(top_genes) > 0) {
    # Add gene name labels
    enhanced_fst_plot <- enhanced_fst_plot +
      geom_text(data = top_genes,
                aes(x = mid_pos, y = -0.06, label = parent_gene),
                size = 3, angle = 45, hjust = 1, vjust = 1,
                check_overlap = TRUE, color = "#084081", fontface = "italic")
    
    # Add gene type labels
    enhanced_fst_plot <- enhanced_fst_plot +
      geom_text(data = top_genes,
                aes(x = mid_pos, y = -0.075, label = gene_type),
                size = 2.5, angle = 0, hjust = 0.5, vjust = 1,
                color = "#0868ac", alpha = 0.8)
  }
  
  # Add threshold lines
  enhanced_fst_plot <- enhanced_fst_plot +
    geom_hline(yintercept = def_threshold_95, color = "#ff7f00",
               linetype = "dashed", size = 0.8, alpha = 0.8) +
    geom_hline(yintercept = def_threshold_99, color = "#e31a1c",
               linetype = "dashed", size = 0.8, alpha = 0.8) +
    
    # Add mean FST line
    geom_hline(yintercept = mean(defence_windows$ESP_SWE_fst, na.rm = TRUE),
               color = "#6a3d9a", linetype = "dashed", size = 0.8)
  
  # Add scales and labels
  enhanced_fst_plot <- enhanced_fst_plot +
    scale_color_manual(values = c("95% Outlier" = "#ff7f00",
                                  "99% Outlier" = "#e31a1c")) +
    scale_shape_manual(values = c("95% Outlier" = 17,
                                  "99% Outlier" = 15)) +
    scale_y_continuous(
      limits = c(-0.08, max(positive_fst$ESP_SWE_fst, na.rm = TRUE) * 1.1),
      sec.axis = sec_axis(~., name = "Gene Track", 
                         breaks = c(-0.04),
                         labels = c("Genes"))
    ) +
    labs(x = "Position on Chromosome 1 (Mb)",
         y = expression(bolditalic(F)[ST]),
         title = "Integrated View: FST Outliers and Gene Annotations",
         subtitle = paste(nrow(gene_summary), "genes contain", 
                         nrow(overlap_df), "outlier windows |",
                         ifelse(nrow(top_genes) > 0, 
                                paste(nrow(top_genes), "top genes labeled"),
                                "No genes labeled")),
         color = "Outlier Status",
         shape = "Outlier Status") +
    theme_minimal() +
    theme(
      legend.position = "bottom",
      legend.box = "horizontal",
      plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
      plot.subtitle = element_text(hjust = 0.5, size = 12),
      axis.title.y.right = element_text(color = "#2ca25f", face = "bold"),
      axis.text.y.right = element_text(color = "#2ca25f")
    ) +
    guides(color = guide_legend(nrow = 1), shape = guide_legend(nrow = 1))
  
  print(enhanced_fst_plot)
  
  # --- Additional gene-focused visualizations ---
  cat("\nGenerating additional gene-focused visualizations...\n")
  
  # Plot 1: Distribution of outlier counts per gene
  p1 <- ggplot(gene_summary, aes(x = total_outlier_windows)) +
    geom_histogram(binwidth = 1, fill = "#6baed6", alpha = 0.8, 
                   color = "black", boundary = 0) +
    geom_vline(xintercept = mean(gene_summary$total_outlier_windows),
               color = "#e31a1c", linetype = "dashed", size = 1) +
    annotate("text", x = mean(gene_summary$total_outlier_windows), 
             y = Inf, label = paste("Mean =", round(mean(gene_summary$total_outlier_windows), 2)),
             vjust = 2, hjust = -0.1, color = "#e31a1c", fontface = "bold") +
    labs(x = "Number of Outlier Windows per Gene",
         y = "Number of Genes",
         title = "Distribution of Outlier Windows Across Genes",
         subtitle = paste("Average of", round(mean(gene_summary$total_outlier_windows), 2),
                         "outliers per gene (red dashed line)")) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"),
          plot.subtitle = element_text(hjust = 0.5))
  
  print(p1)
  
  # Plot 2: Top genes with most outliers
  top_n <- min(15, nrow(gene_summary))
  if (top_n > 0) {
    p2_data <- gene_summary %>%
      arrange(desc(total_outlier_windows)) %>%
      head(top_n) %>%
      mutate(parent_gene = factor(parent_gene, levels = rev(parent_gene)))
    
    p2 <- ggplot(p2_data, aes(x = parent_gene, y = total_outlier_windows, fill = max_fst)) +
      geom_bar(stat = "identity", alpha = 0.9) +
      geom_text(aes(label = paste(total_outlier_windows, "windows")), 
                hjust = -0.1, size = 3.2, fontface = "bold") +
      coord_flip() +
      scale_fill_gradient2(low = "#2c7bb6", mid = "#ffffbf", high = "#d7191c",
                           midpoint = median(gene_summary$max_fst),
                           name = "Maximum FST") +
      labs(x = "Gene ID", y = "Number of Outlier Windows",
           title = paste("Top", top_n, "Genes with Most Outlier Windows"),
           subtitle = "Color indicates maximum FST value in each gene") +
      theme_minimal() +
      theme(
        axis.text.y = element_text(face = "italic", size = 10),
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "bottom"
      ) +
      expand_limits(y = max(p2_data$total_outlier_windows) * 1.2)
    
    print(p2)
  }
  
  # Plot 3: Gene length vs FST with multiple aesthetics
  p3 <- ggplot(gene_summary, aes(x = gene_length/1000, y = max_fst)) +
    geom_point(aes(size = total_outlier_windows, 
                   color = total_outlier_windows,
                   alpha = total_outlier_windows)) +
    geom_smooth(method = "lm", se = TRUE, color = "#636363", 
                fill = "#bdbdbd", alpha = 0.2) +
    scale_color_gradient(low = "#2c7bb6", high = "#d7191c", 
                         name = "Outlier count") +
    scale_size_continuous(range = c(2, 8), name = "Outlier count") +
    scale_alpha_continuous(range = c(0.5, 1), guide = "none") +
    labs(x = "Gene Length (kb)",
         y = "Maximum FST in Gene",
         title = "Relationship Between Gene Length and Maximum FST",
         subtitle = "Size and color indicate number of outlier windows per gene") +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold"),
      plot.subtitle = element_text(hjust = 0.5),
      legend.position = "right"
    )
  
  print(p3)
  
  cat("\n✓ Gene-integrated visualizations generated successfully\n")
  
  # Save the plots
  if (!dir.exists("results/plots")) dir.create("results/plots", recursive = TRUE)
  ggsave("results/plots/integrated_fst_gene_plot.png", enhanced_fst_plot,
         width = 16, height = 10, dpi = 300)
  cat("✓ Main plot saved to: results/plots/integrated_fst_gene_plot.png\n")
  
} else {
  cat("\nWARNING: Insufficient data for gene-integrated visualizations.\n")
  cat("Required data not available: \n")
  missing <- setdiff(c("overlap_df", "gene_summary", "filtered_annotation"), available_data)
  if (length(missing) > 0) {
    cat(paste("  -", missing, collapse = "\n"), "\n")
  } else {
    cat("  Data exists but has 0 rows. Check if overlap_df and gene_summary are empty.\n")
  }
  cat("\nPlease run the gene annotation section (#17) first.\n")
}

## 
## Generating enhanced FST plot with gene annotations...

## 
## Generating additional gene-focused visualizations...

## 
## ✓ Gene-integrated visualizations generated successfully

## ✓ Main plot saved to: results/plots/integrated_fst_gene_plot.png

Interpretation Guide

Key Results to Look For

1. Tajima’s D

Negative values: Excess of low-frequency variants (potential selective sweep or population expansion)
Positive values: Deficiency of low-frequency variants (balancing selection or population contraction)
Comparison: Defence genes often show more negative Tajima’s D due to purifying selection

2. Nucleotide Diversity (π)

Lower π in defence genes: Suggests purifying selection or selective sweeps
Higher π in defence genes: Could indicate balancing selection
Compare between populations: Differential selection pressures

3. FST (Population Differentiation)

High FST: Strong population differentiation (potential local adaptation)
FST > 0.15: Considered high differentiation
Outliers: Genes with FST above 95th/99th percentiles are candidates for selection

4. Statistical Significance

p < 0.05: Significant difference between defence and background
Effect size: Cohen’s d > 0.8 indicates large effect
Consistency: Check if patterns are consistent across both populations

Common Patterns and Interpretations

Pattern	Possible Interpretation
Low π + Negative Tajima’s D	Recent selective sweep
High π + Positive Tajima’s D	Balancing selection
High FST outliers	Local adaptation
Consistent patterns in both populations	Species-wide selection
Population-specific patterns	Local adaptation

Next Steps for Further Analysis

Extend to all chromosomes
Gene Ontology enrichment of outlier defence genes (This will be your next step)
Compare with known selection signatures

Population Genetics Analysis of Arabidopsis thaliana

Practical Session 5 - Genome-wide vs Defence Gene Diversity Analysis

Dr Tahir Ali | AG de Meaux

December 19, 2025

Copy Required Data:

Summary

Analysis Overview

Key Questions

Workflow Summary

Expected Outputs

0. Setup

1. Load and Inspect the VCF

Purpose

Output

2. Define Analysis Regions

Purpose

3. Read VCF into PopGenome

Purpose

Output

4. Define Populations

Purpose

Output

5. Create Sliding Windows

Purpose

Parameters

6. Calculate Window Statistics

Purpose

Statistics Calculated

7. Extract and Combine Statistics

Purpose

Output

8. Genome-wide Visualization

Purpose

Output

9. Annotate Defence-Related Windows

Purpose

Output

10. Statistical Comparisons

Purpose

Tests Performed

11. Distribution Plots

Purpose

Output

12. Boxplot Comparisons

Purpose

Output

13. FST Outlier Analysis - Visualisation

Purpose

Statistical Tests for FST Outliers

Purpose

Tests Performed

14. Summary Statistics

Purpose

Output

Interpretation Guide for Effect Sizes (Cohen’s d)

15. Gene Annotation and Overlap Analysis

Purpose

Input Requirements

15.1 Load and Prepare Gene Annotations

15.2 Create Genomic Ranges and Find Overlaps

15.3 Create Gene Summary Table

15.4 Visualize Genes with Outliers

15.5 Summary and Biological Interpretation

15.6 Save Complete Gene Analysis Results

16. Advanced Visualization Techniques

Purpose

Key Techniques

16.1 Enhanced π (Nucleotide Diversity) Visualizations

16.2 Effect Size Calculations and Interpretation

16.3 Gene-Integrated Visualization (Combining Previous Results)

Interpretation Guide

Key Results to Look For

1. Tajima’s D

2. Nucleotide Diversity (π)

3. FST (Population Differentiation)

4. Statistical Significance

Common Patterns and Interpretations

Next Steps for Further Analysis