Copy the required directory to your home directory (Replace every instance of qbiad019 with your own username.)
cp -r /scratch/qbio/Practical_Session_5 /home/group.kurse/qbiad019
Set up the path to the working directory
# Set the knitr root directory - this PERSISTS
# replace both path to "/home/group.kurse/qbiadxxx/Practical_Session_5
knitr::opts_knit$set(root.dir = "/home/group.kurse/qbiad019/Practical_Session_5")
# Also set DATA_DIR variable for reference
DATA_DIR <- "/home/group.kurse/qbiad019/Practical_Session_5"
cat("Knitr root.dir set to:", knitr::opts_knit$get("root.dir"), "\n")
## Knitr root.dir set to: /home/group.kurse/qbiad019/Practical_Session_5
cat("Files available:", paste(list.files()[1:3], collapse=", "), "\n")
## Files available: Practical_Day_5_files, Practical_Day_5.qmd, vcf
Natural populations evolve under the combined action of genetic drift and natural selection. Distinguishing between these forces is central to understanding adaptive evolution. Genetic drift leads to random allele frequency changes, while selection induces systematic shifts in specific genomic regions associated with fitness. Defence-related genes, which mediate plant responses to pathogens and stress conditions, are strong candidates for experiencing positive or balancing selection.
In this analysis, we compare nucleotide diversity (π), Tajima’s D, and FST between:
Defence-related windows (DF) – genomic regions overlapping annotated defence genes
Background windows (BG) – all other genome regions
The goal is to identify whether defence genes show distinct evolutionary patterns compared to background genomic regions.
Whether defence genes show signatures of selection, such as:
Reduced π (purifying or positive selection)
Elevated π (balancing selection)
Higher or lower Tajima’s D depending on selection mode
Elevated FST between populations (local adaptation)
The following packages are required for handling VCF files, computing population genetic statistics, plotting results, and data wrangling.
# --- 0) Load Required Packages ------------------------------------------------
library(VariantAnnotation)
library(PopGenome)
library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(readr)
library(tibble)
library(knitr)
library(conflicted)
# Resolve function conflicts
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::count)
This block loads the full VCF file and determines chromosome lengths, which helps ensure that sliding windows and coordinate comparisons are aligned.
VCF file loaded into memory
Table showing chromosome positions
Confirmation that data is properly loaded
# --- 1) Load and inspect VCF ---------------------------------
# Load the VCF file
vcf_full <- readVcf("vcf/1001genomes_snp-short-indel_only_ACGTN_Dp10GQ20Q30_NoIndel_Bialleleic_80PcMissing_80SwEs.vcf.recode.vcf.gz")
# Get chromosome names from VCF
chromosomes <- seqlevels(vcf_full)
# Calculate start and end positions for each chromosome
chromosome_ranges <- data.frame(CHROM = character(), Start = numeric(), End = numeric(), stringsAsFactors = FALSE)
for (chrom in chromosomes) {
positions <- start(vcf_full[seqnames(vcf_full) == chrom])
chromosome_ranges <- rbind(chromosome_ranges,
data.frame(CHROM = chrom,
Start = min(positions),
End = max(positions)))
}
cat("Chromosome ranges in VCF file:\n")
## Chromosome ranges in VCF file:
print(kable(chromosome_ranges))
##
##
## |CHROM | Start| End|
## |:-----|-----:|--------:|
## |1 | 2784| 30422721|
## |2 | 24180| 19697399|
## |3 | 1705| 23459492|
## |4 | 1552| 18584482|
## |5 | 115| 26974567|
Define chromosome ranges for analysis. The selection of ranges may be aimed at focusing on certain chromosomal regions or excluding telomeres and centromeres.
# --- 2) Define chromosome ranges for analysis -----------------
# These ranges correspond to the positions on the chromosomes of interest
chr1start <- 2784; chr1end <- 30422721
chr2start <- 24180; chr2end <- 19697399
chr3start <- 1705; chr3end <- 23459492
chr4start <- 1552; chr4end <- 18584482
chr5start <- 115; chr5end <- 26974567
cat("Analysis ranges defined for 5 chromosomes\n")
## Analysis ranges defined for 5 chromosomes
The genome is processed chromosome by chromosome. PopGenome requires explicit chromosome and positional ranges. Load chromosome 1 data into PopGenome format for population genetics analyses.
PopGenome object for chromosome 1
Ready for population assignments and statistics calculation
# --- 3) Read VCF into PopGenome for chromosome 1 --------------
# Example: chromosome 1 (use all the variants across the whole genome)
At_full_Chr <- suppressMessages(
readVCF("vcf/1001genomes_snp-short-indel_only_ACGTN_Dp10GQ20Q30_NoIndel_Bialleleic_80PcMissing_80SwEs.vcf.recode.vcf.gz",
numcols = 89,
tid = "1",
frompos = chr1start,
topos = chr1end,
include.unknown = TRUE)
)
## vcff::open : file opened, contains 80 samples
## [1] "Available ContigIdentifiers (parameter tid):"
## [1] "1" "2" "3" "4" "5"
## | : | : | 100 %
## |===================================================| ;-)
##
##
## Concatenate regions
## | : | : | 100 %
## |===================================================| ;-)
cat("VCF loaded into PopGenome object\n")
## VCF loaded into PopGenome object
cat("Number of sites:", length(At_full_Chr@region.names), "\n")
## Number of sites: 1
Assign samples to populations (ESP and SWE) based on metadata. Enables calculation of population-specific π, Tajima’s D, and pairwise FST.
Populations defined in PopGenome object
Ready for population-specific calculations
# --- 4) Define populations ------------------------------------
# Read population metadata
population_info <- read_delim("vcf/pop1.txt", delim = "\t", show_col_types = FALSE)
# Create list of populations
populations <- split(population_info$sample, population_info$pop)
# Assign populations to PopGenome object
At_full_Chr <- set.populations(At_full_Chr, populations, diploid = TRUE)
## | : | : | 100 %
## |====================================================
cat("Populations defined:\n")
## Populations defined:
print(names(populations))
## [1] "ESP" "SWE"
cat("Sample counts:", sapply(populations, length), "\n")
## Sample counts: 40 40
Define sliding windows along chromosome 1 for local statistics calculation.
Sliding windows allow summarising statistics across genomes rather than per SNP.
Enables statistically robust summaries and reduces noise.
Window size: 100 bp
Step size: 50 bp (50% overlap)
# --- 5) Define sliding windows --------------------------------
window_size <- 100
window_jump <- 50
strt <- chr1start
end <- chr1end
# Generate sliding window positions
window_start <- seq(from = strt, to = end, by = window_jump)
window_stop <- window_start + window_size
# Remove windows exceeding chromosome length
window_start <- window_start[window_stop < end]
window_stop <- window_stop[window_stop < end]
# Create windows data frame
windows <- data.frame(
start = window_start,
stop = window_stop,
mid = window_start + (window_stop - window_start) / 2,
chr = "1"
)
cat("Sliding windows created:\n")
## Sliding windows created:
cat("Number of windows:", nrow(windows), "\n")
## Number of windows: 608397
cat("Window size:", window_size, "bp\n")
## Window size: 100 bp
cat("Step size:", window_jump, "bp\n")
## Step size: 50 bp
Calculate key population genetics statistics (π, Tajima’s D, and FST) for each sliding window.
Nucleotide diversity (π)
Tajima’s D (neutrality test)
FST (population differentiation)
# --- 6) Calculate sliding window statistics --------------------
# Transform to sliding windows
At_full_sw <- suppressMessages(
sliding.window.transform(At_full_Chr, width = window_size, jump = window_jump, type = 2)
)
## | : | : | 100 %
## |===================================================
# Calculate diversity statistics
At_full_sw <- suppressMessages(diversity.stats(At_full_sw, pi = TRUE))
## | : | : | 100 %
## |===================================================
# Calculate neutrality statistics (includes Tajima's D)
At_full_sw <- suppressMessages(neutrality.stats(At_full_sw))
## | : | : | 100 %
## |===================================================
# Calculate FST statistics
At_full_sw <- suppressMessages(F_ST.stats(At_full_sw, mode = "nucleotide"))
## | : | : | 100 %
## |===================================================
cat("Statistics calculated for each window:\n")
## Statistics calculated for each window:
cat("- Nucleotide diversity (π)\n")
## - Nucleotide diversity (π)
cat("- Tajima's D\n")
## - Tajima's D
cat("- FST\n")
## - FST
PopGenome objects store outputs in nested data structures; this block extracts usable tables and combines them into a tidy data frame.
Dataframe with all statistics for each window
Cleaned data ready for analysis
# --- 7) Extract statistics from PopGenome object --------------
# Extract Tajima's D and adjust for scale
td_full <- At_full_sw@Tajima.D / window_size
colnames(td_full) <- paste0(names(populations), "_td")
# Extract nucleotide diversity (pi)
nd_full <- At_full_sw@nuc.diversity.within / window_size
colnames(nd_full) <- paste0(names(populations), "_pi")
# Extract FST values
fst_full <- t(At_full_sw@nuc.F_ST.pairwise)
# Clean FST column names
x <- colnames(fst_full)
for (i in seq_along(populations)) {
x <- sub(paste0("pop", i), names(populations)[i], x)
}
colnames(fst_full) <- paste0(sub("/", "_", x), "_fst")
# --- 8) Combine all statistics into one data frame ------------
At_data_full <- as_tibble(data.frame(windows, nd_full, fst_full, td_full))
# Filter out rows with zeros or NAs
At_data_clean <- At_data_full %>%
filter(!if_any(everything(), ~ . == 0 | is.na(.)))
cat("Data extracted and combined:\n")
## Data extracted and combined:
cat("Total windows:", nrow(At_data_full), "\n")
## Total windows: 608397
cat("Clean windows (no zeros/NAs):", nrow(At_data_clean), "\n")
## Clean windows (no zeros/NAs): 79622
cat("\nFirst 5 rows of clean data:\n")
##
## First 5 rows of clean data:
print(kable(head(At_data_clean, 5)))
##
##
## | start| stop| mid|chr | ESP_pi| SWE_pi| ESP_SWE_fst| ESP_td| SWE_td|
## |-----:|----:|----:|:---|---------:|---------:|-----------:|----------:|----------:|
## | 4584| 4684| 4634|1 | 0.0044428| 0.0055719| 0.1765667| 0.0141849| 0.0059346|
## | 4634| 4734| 4684|1 | 0.0044428| 0.0055719| 0.1765667| 0.0141849| 0.0059346|
## | 5984| 6084| 6034|1 | 0.0061311| 0.0085248| 0.1253754| 0.0080708| 0.0177694|
## | 6034| 6134| 6084|1 | 0.0061311| 0.0085248| 0.1253754| 0.0080708| 0.0177694|
## | 6384| 6484| 6434|1 | 0.0010932| 0.0014051| 0.0547234| -0.0058576| -0.0036847|
Visualize statistics along chromosome 1 to identify patterns and hotspots.
Multi-panel plot showing π, Tajima’s D, and FST across the chromosome
Identification of regions with unusual values
# --- 9) Create facet plot of genome-wide patterns -------------
# Select relevant columns for plotting
hs_full <- At_data_full %>%
select(mid, ESP_pi, SWE_pi, ESP_td, SWE_td, ESP_SWE_fst) %>%
mutate(ESP_SWE_fst = ifelse(ESP_SWE_fst < 0, 0, ESP_SWE_fst))
# Reshape for faceting
hs_g_full <- gather(hs_full, -mid, key = "stat", value = "value")
# Create facet plot
a_full <- ggplot(hs_g_full, aes(mid/10^6, value, colour = stat)) +
geom_line() +
facet_grid(stat ~ ., scales = "free_y") +
xlab("Position (Mb)") +
theme_light() +
theme(legend.position = "none") +
ggtitle("Genome-wide Statistics Along Chromosome 1")
print(a_full)
## Warning: Removed 262 rows containing missing values or values outside the scale range
## (`geom_line()`).
Interpretation guideline:
Peaks in FST suggest local adaptation.
Depressions in π may indicate selective sweeps.
Strongly negative Tajima’s D indicates excess rare variants (e.g., recent selection).
Compare statistics between defence and background windows using statistical tests.
Kolmogorov-Smirnov tests (distribution comparisons)
Wilcoxon rank-sum tests (median comparisons)
# --- 11-12) Perform statistical comparisons ----------------------
# Create subsets for testing
sub1_pi_esp <- At_data_clean$ESP_pi[At_data_clean$DF == "DF"]
sub2_pi_esp <- At_data_clean$ESP_pi[At_data_clean$DF == "BG"]
sub1_td_esp <- At_data_clean$ESP_td[At_data_clean$DF == "DF"]
sub2_td_esp <- At_data_clean$ESP_td[At_data_clean$DF == "BG"]
sub1_fst <- At_data_clean$ESP_SWE_fst[At_data_clean$DF == "DF"]
sub2_fst <- At_data_clean$ESP_SWE_fst[At_data_clean$DF == "BG"]
# Kolmogorov-Smirnov tests
cat("Kolmogorov-Smirnov Tests (Defence vs Background):\n")
## Kolmogorov-Smirnov Tests (Defence vs Background):
cat("=================================================\n")
## =================================================
ks_pi_esp <- ks.test(sub1_pi_esp, sub2_pi_esp)
## Warning in ks.test.default(sub1_pi_esp, sub2_pi_esp): p-value will be
## approximate in the presence of ties
cat("\nESP π distribution comparison:\n")
##
## ESP π distribution comparison:
cat("D =", round(ks_pi_esp$statistic, 4), "\n")
## D = 0.0295
cat("p-value =", format.pval(ks_pi_esp$p.value, digits = 4), "\n")
## p-value = 0.145
ks_td_esp <- ks.test(sub1_td_esp, sub2_td_esp)
## Warning in ks.test.default(sub1_td_esp, sub2_td_esp): p-value will be
## approximate in the presence of ties
cat("\nESP Tajima's D distribution comparison:\n")
##
## ESP Tajima's D distribution comparison:
cat("D =", round(ks_td_esp$statistic, 4), "\n")
## D = 0.0232
cat("p-value =", format.pval(ks_td_esp$p.value, digits = 4), "\n")
## p-value = 0.3874
ks_fst <- ks.test(sub1_fst, sub2_fst)
## Warning in ks.test.default(sub1_fst, sub2_fst): p-value will be approximate in
## the presence of ties
cat("\nFST distribution comparison:\n")
##
## FST distribution comparison:
cat("D =", round(ks_fst$statistic, 4), "\n")
## D = 0.0466
cat("p-value =", format.pval(ks_fst$p.value, digits = 4), "\n")
## p-value = 0.002773
Visualize and compare distributions of statistics between defence and background windows.
Multi-panel density plots
KS test p-values and mean differences displayed
Clear visual comparison
# --- Complete Density Plot Code ---
par(mfrow = c(2, 3), mar = c(4, 4, 4, 2))
# Function to add statistics below legend
add_stats_below_legend <- function(ks_p, mean_diff, y_pos_factor = 0.85, x_pos = "right") {
usr <- par("usr")
x_range <- usr[2] - usr[1]
y_range <- usr[4] - usr[3]
if (x_pos == "right") {
x_pos <- usr[2] - 0.1 * x_range
h_adj <- 1
} else {
x_pos <- usr[1] + 0.1 * x_range
h_adj <- 0
}
y_pos <- usr[3] + y_pos_factor * y_range
text(x_pos, y_pos,
paste("KS p =", ks_p, "\nΔmean =", sprintf("%.4f", mean_diff)),
adj = c(h_adj, 0.5),
cex = 0.75,
font = 2,
col = "darkred")
}
# Prepare data for SWE population as well
sub1_pi_swe <- At_data_clean$SWE_pi[At_data_clean$DF == "DF"]
sub2_pi_swe <- At_data_clean$SWE_pi[At_data_clean$DF == "BG"]
sub1_td_swe <- At_data_clean$SWE_td[At_data_clean$DF == "DF"]
sub2_td_swe <- At_data_clean$SWE_td[At_data_clean$DF == "BG"]
# 1. ESP Tajima's D
ks_td_esp <- ks.test(sub1_td_esp, sub2_td_esp)
## Warning in ks.test.default(sub1_td_esp, sub2_td_esp): p-value will be
## approximate in the presence of ties
ks_p_td_esp <- format.pval(ks_td_esp$p.value, digits = 3)
mean_diff_td_esp <- mean(sub1_td_esp) - mean(sub2_td_esp)
if (length(sub2_td_esp) > 1 & length(sub1_td_esp) > 1) {
dens_bg <- density(sub2_td_esp)
dens_df <- density(sub1_td_esp)
plot(dens_bg,
main = "ESP Tajima's D",
xlab = "Tajima's D",
col = "black",
lwd = 2,
xlim = range(c(dens_bg$x, dens_df$x)),
ylim = range(c(dens_bg$y, dens_df$y)) * 1.1)
lines(dens_df, col = "blue", lwd = 2)
legend("topright",
legend = c(paste("BG (n=", length(sub2_td_esp), ")", sep = ""),
paste("DF (n=", length(sub1_td_esp), ")", sep = "")),
col = c("black", "blue"),
lwd = 2,
cex = 0.8,
bty = "n")
add_stats_below_legend(ks_p_td_esp, mean_diff_td_esp, y_pos_factor = 0.45, x_pos = "right")
}
# 2. SWE Tajima's D
ks_td_swe <- ks.test(sub1_td_swe, sub2_td_swe)
## Warning in ks.test.default(sub1_td_swe, sub2_td_swe): p-value will be
## approximate in the presence of ties
ks_p_td_swe <- format.pval(ks_td_swe$p.value, digits = 3)
mean_diff_td_swe <- mean(sub1_td_swe) - mean(sub2_td_swe)
if (length(sub2_td_swe) > 1 & length(sub1_td_swe) > 1) {
dens_bg <- density(sub2_td_swe)
dens_df <- density(sub1_td_swe)
plot(dens_bg,
main = "SWE Tajima's D",
xlab = "Tajima's D",
col = "black",
lwd = 2,
xlim = range(c(dens_bg$x, dens_df$x)),
ylim = range(c(dens_bg$y, dens_df$y)) * 1.1)
lines(dens_df, col = "red", lwd = 2)
legend("topright",
legend = c(paste("BG (n=", length(sub2_td_swe), ")", sep = ""),
paste("DF (n=", length(sub1_td_swe), ")", sep = "")),
col = c("black", "red"),
lwd = 2,
cex = 0.8,
bty = "n")
add_stats_below_legend(ks_p_td_swe, mean_diff_td_swe, y_pos_factor = 0.45, x_pos = "right")
}
# 3. ESP π
ks_pi_esp <- ks.test(sub1_pi_esp, sub2_pi_esp)
## Warning in ks.test.default(sub1_pi_esp, sub2_pi_esp): p-value will be
## approximate in the presence of ties
ks_p_pi_esp <- format.pval(ks_pi_esp$p.value, digits = 3)
mean_diff_pi_esp <- mean(sub1_pi_esp) - mean(sub2_pi_esp)
if (length(sub2_pi_esp) > 1 & length(sub1_pi_esp) > 1) {
dens_bg <- density(sub2_pi_esp)
dens_df <- density(sub1_pi_esp)
plot(dens_bg,
main = bquote(bold("ESP " * pi)),
xlab = expression(pi),
col = "black",
lwd = 2,
xlim = range(c(dens_bg$x, dens_df$x)),
ylim = range(c(dens_bg$y, dens_df$y)) * 1.1)
lines(dens_df, col = "darkgreen", lwd = 2)
legend("topright",
legend = c(paste("BG (n=", length(sub2_pi_esp), ")", sep = ""),
paste("DF (n=", length(sub1_pi_esp), ")", sep = "")),
col = c("black", "darkgreen"),
lwd = 2,
cex = 0.8,
bty = "n")
add_stats_below_legend(ks_p_pi_esp, mean_diff_pi_esp, y_pos_factor = 0.15, x_pos = "right")
}
# 4. SWE π
ks_pi_swe <- ks.test(sub1_pi_swe, sub2_pi_swe)
## Warning in ks.test.default(sub1_pi_swe, sub2_pi_swe): p-value will be
## approximate in the presence of ties
ks_p_pi_swe <- format.pval(ks_pi_swe$p.value, digits = 3)
mean_diff_pi_swe <- mean(sub1_pi_swe) - mean(sub2_pi_swe)
if (length(sub2_pi_swe) > 1 & length(sub1_pi_swe) > 1) {
dens_bg <- density(sub2_pi_swe)
dens_df <- density(sub1_pi_swe)
plot(dens_bg,
main = bquote(bold("SWE " * pi)),
xlab = expression(pi),
col = "black",
lwd = 2,
xlim = range(c(dens_bg$x, dens_df$x)),
ylim = range(c(dens_bg$y, dens_df$y)) * 1.1)
lines(dens_df, col = "orange", lwd = 2)
legend("topright",
legend = c(paste("BG (n=", length(sub2_pi_swe), ")", sep = ""),
paste("DF (n=", length(sub1_pi_swe), ")", sep = "")),
col = c("black", "orange"),
lwd = 2,
cex = 0.8,
bty = "n")
add_stats_below_legend(ks_p_pi_swe, mean_diff_pi_swe, y_pos_factor = 0.15, x_pos = "right")
}
# 5. FST
ks_fst <- ks.test(sub1_fst, sub2_fst)
## Warning in ks.test.default(sub1_fst, sub2_fst): p-value will be approximate in
## the presence of ties
ks_p_fst <- format.pval(ks_fst$p.value, digits = 3)
mean_diff_fst <- mean(sub1_fst) - mean(sub2_fst)
if (length(sub2_fst) > 1 & length(sub1_fst) > 1) {
dens_bg <- density(sub2_fst)
dens_df <- density(sub1_fst)
plot(dens_bg,
main = bquote(bold("ESP-SWE " * F[ST])),
xlab = expression(F[ST]),
col = "black",
lwd = 2,
xlim = range(c(dens_bg$x, dens_df$x)),
ylim = range(c(dens_bg$y, dens_df$y)) * 1.1)
lines(dens_df, col = "purple", lwd = 2)
legend("topright",
legend = c(paste("BG (n=", length(sub2_fst), ")", sep = ""),
paste("DF (n=", length(sub1_fst), ")", sep = "")),
col = c("black", "purple"),
lwd = 2,
cex = 0.8,
bty = "n")
add_stats_below_legend(ks_p_fst, mean_diff_fst, y_pos_factor = 0.15, x_pos = "right")
}
# 6. Summary legend
plot(1, type = "n", axes = FALSE, xlab = "", ylab = "")
legend("center",
legend = c("Statistical Summary",
"KS = Kolmogorov-Smirnov test",
"p = probability value",
"Δmean = DF mean - BG mean",
paste("Total BG windows:", sum(At_data_clean$DF == "BG")),
paste("Total DF windows:", sum(At_data_clean$DF == "DF"))),
cex = 0.9,
bty = "n",
text.font = c(2, 1, 1, 1, 1, 1))
# Reset plot parameters
par(mfrow = c(1, 1))
Visual comparison of statistics between defence and background genes using boxplots with statistical annotations.
Boxplots for Tajima’s D, π, and FST
Wilcoxon test p-values displayed
Clear visualization of median differences
# Create long format dataframes for ggplot
library(tidyr)
# Tajima's D comparison
td_comparison_long <- At_data_clean %>%
select(DF, ESP_td, SWE_td) %>%
pivot_longer(
cols = c(ESP_td, SWE_td),
names_to = "population",
values_to = "td_value"
) %>%
mutate(
population = gsub("_td", "", population),
population = factor(population, levels = c("ESP", "SWE"))
)
# π comparison
pi_comparison_long <- At_data_clean %>%
select(DF, ESP_pi, SWE_pi) %>%
pivot_longer(
cols = c(ESP_pi, SWE_pi),
names_to = "population",
values_to = "pi_value"
) %>%
mutate(
population = gsub("_pi", "", population),
population = factor(population, levels = c("ESP", "SWE"))
)
# FST comparison
fst_comparison <- At_data_clean %>%
select(DF, ESP_SWE_fst)
# Statistical tests
wilcox_esp_td <- wilcox.test(td_value ~ DF, data = td_comparison_long %>% filter(population == "ESP"))
wilcox_swe_td <- wilcox.test(td_value ~ DF, data = td_comparison_long %>% filter(population == "SWE"))
wilcox_esp_pi <- wilcox.test(pi_value ~ DF, data = pi_comparison_long %>% filter(population == "ESP"))
wilcox_swe_pi <- wilcox.test(pi_value ~ DF, data = pi_comparison_long %>% filter(population == "SWE"))
wilcox_fst <- wilcox.test(ESP_SWE_fst ~ DF, data = fst_comparison)
# Create dataframes for p-value annotations
pvalue_td_df <- data.frame(
population = c("ESP", "SWE"),
p_value = c(
format.pval(wilcox_esp_td$p.value, digits = 3),
format.pval(wilcox_swe_td$p.value, digits = 3)
),
y_position = c(
max(td_comparison_long$td_value[td_comparison_long$population == "ESP"], na.rm = TRUE) * 0.95,
max(td_comparison_long$td_value[td_comparison_long$population == "SWE"], na.rm = TRUE) * 0.95
),
x_position = 1.5
)
pvalue_pi_df <- data.frame(
population = c("ESP", "SWE"),
p_value = c(
format.pval(wilcox_esp_pi$p.value, digits = 3),
format.pval(wilcox_swe_pi$p.value, digits = 3)
),
y_position = c(
max(pi_comparison_long$pi_value[pi_comparison_long$population == "ESP"], na.rm = TRUE) * 0.95,
max(pi_comparison_long$pi_value[pi_comparison_long$population == "SWE"], na.rm = TRUE) * 0.95
),
x_position = 1.5
)
# Plot 1: Tajima's D
p1 <- ggplot(td_comparison_long, aes(x = DF, y = td_value, fill = DF)) +
geom_boxplot(color = "black") +
facet_wrap(~ population, ncol = 2) +
theme_light() +
scale_fill_manual(values = c("lightblue", "lightcoral")) +
geom_text(data = pvalue_td_df,
aes(x = x_position, y = y_position, label = paste("p =", p_value)),
size = 5, inherit.aes = FALSE) +
labs(title = "Tajima's D: Defence vs Background Genes",
x = NULL,
y = "Tajima's D") +
theme(legend.position = "none")
# Plot 2: π
p2 <- ggplot(pi_comparison_long, aes(x = DF, y = pi_value, fill = DF)) +
geom_boxplot(color = "black") +
facet_wrap(~ population, ncol = 2) +
theme_light() +
scale_fill_manual(values = c("lightgreen", "orange")) +
geom_text(data = pvalue_pi_df,
aes(x = x_position, y = y_position, label = paste("p =", p_value)),
size = 5, inherit.aes = FALSE) +
labs(title = expression(paste("Nucleotide Diversity (", pi, "): Defence vs Background")),
x = NULL,
y = expression(pi)) +
theme(legend.position = "none")
# Plot 3: FST
p3 <- ggplot(fst_comparison, aes(x = DF, y = ESP_SWE_fst, fill = DF)) +
geom_boxplot(color = "black") +
theme_light() +
scale_fill_manual(values = c("lightpink", "purple")) +
annotate("text", x = 1.5, y = max(fst_comparison$ESP_SWE_fst, na.rm = TRUE) * 0.95,
label = paste("p =", format.pval(wilcox_fst$p.value, digits = 3)),
size = 5) +
labs(title = expression(paste(F[ST], ": Defence vs Background Genes")),
x = NULL,
y = expression(F[ST])) +
theme(legend.position = "none")
# Display plots
print(p1)
print(p2)
print(p3)
# Print test results
cat("Wilcoxon Test Results:\n")
## Wilcoxon Test Results:
cat("=====================\n")
## =====================
cat("\nESP Tajima's D: p =", format.pval(wilcox_esp_td$p.value, digits = 4))
##
## ESP Tajima's D: p = 0.6876
cat("\nSWE Tajima's D: p =", format.pval(wilcox_swe_td$p.value, digits = 4))
##
## SWE Tajima's D: p = 0.1004
cat("\nESP π: p =", format.pval(wilcox_esp_pi$p.value, digits = 4))
##
## ESP π: p = 0.1435
cat("\nSWE π: p =", format.pval(wilcox_swe_pi$p.value, digits = 4))
##
## SWE π: p = 0.003944
cat("\nFST: p =", format.pval(wilcox_fst$p.value, digits = 4), "\n")
##
## FST: p = 0.1183
Interpretation guideline:
Significant KS p-values indicate DF and BG windows differ in distribution.
Lower π in DF may indicate purifying or positive selection.
Higher π in DF may indicate balancing selection.
Tajima’s D differences help differentiate sweep vs balancing signatures.
Higher FST in DF suggests population-specific selection.
Visualise FST outlier patterns using three complementary plots to understand:
1. Distribution differences between defence and background genes
2. Threshold-based outlier identification
3. Spatial distribution of outliers along the chromosome
# --- FST Outlier Visualization --------------------
library(ggplot2)
library(patchwork)
# Prepare data for plots
fst_df <- At_data_clean %>%
select(DF, ESP_SWE_fst) %>%
filter(ESP_SWE_fst > 0)
# Calculate thresholds separately for BG and DF
thresholds_df <- fst_df %>%
group_by(DF) %>%
summarise(
threshold_95 = quantile(ESP_SWE_fst, 0.95, na.rm = TRUE),
threshold_99 = quantile(ESP_SWE_fst, 0.99, na.rm = TRUE),
mean_fst = mean(ESP_SWE_fst, na.rm = TRUE),
n_windows = n()
)
# Kolmogorov-Smirnov test for distribution differences
bg_fst <- fst_df$ESP_SWE_fst[fst_df$DF == "BG"]
df_fst <- fst_df$ESP_SWE_fst[fst_df$DF == "DF"]
ks_result <- ks.test(bg_fst, df_fst)
# Defence gene windows for specific analysis
defence_windows <- At_data_clean %>%
filter(DF == "DF")
# Calculate various thresholds
def_threshold_95 <- quantile(defence_windows$ESP_SWE_fst, 0.95, na.rm = TRUE)
def_threshold_99 <- quantile(defence_windows$ESP_SWE_fst, 0.99, na.rm = TRUE)
genome_threshold_95 <- quantile(fst_df$ESP_SWE_fst, 0.95, na.rm = TRUE)
genome_threshold_99 <- quantile(fst_df$ESP_SWE_fst, 0.99, na.rm = TRUE)
# Prepare defence windows with outlier classifications
defence_vs_genome <- defence_windows %>%
mutate(
genome_outlier = case_when(
ESP_SWE_fst > genome_threshold_99 ~ "Genome_99%",
ESP_SWE_fst > genome_threshold_95 ~ "Genome_95%",
TRUE ~ "Genome_non_outlier"
),
defence_outlier = case_when(
ESP_SWE_fst > def_threshold_99 ~ "Defence_99%",
ESP_SWE_fst > def_threshold_95 ~ "Defence_95%",
TRUE ~ "Defence_non_outlier"
)
)
# Plot 1: BG vs DF FST distributions with thresholds
p1 <- ggplot(fst_df, aes(x = ESP_SWE_fst, fill = DF)) +
geom_density(alpha = 0.5) +
geom_vline(data = thresholds_df,
aes(xintercept = threshold_95, color = DF),
linetype = "dashed", size = 0.8) +
geom_vline(data = thresholds_df,
aes(xintercept = threshold_99, color = DF),
linetype = "dotted", size = 0.8) +
scale_fill_manual(values = c("BG" = "gray70", "DF" = "blue"),
name = "Gene Category") +
scale_color_manual(values = c("BG" = "black", "DF" = "darkblue"),
name = "Gene Category") +
labs(title = "FST Distribution: Defence vs Background Genes",
subtitle = paste("KS test p =", format.pval(ks_result$p.value, digits = 3)),
x = expression(F[ST]),
y = "Density") +
theme_minimal() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
# Plot 2: Defence gene histogram with multiple thresholds
p2 <- ggplot(defence_vs_genome, aes(x = ESP_SWE_fst)) +
geom_histogram(binwidth = 0.01, fill = "lightblue", alpha = 0.7, color = "black") +
# Genome-wide thresholds
geom_vline(xintercept = genome_threshold_95,
color = "orange", linetype = "dashed", size = 1) +
geom_vline(xintercept = genome_threshold_99,
color = "red", linetype = "dashed", size = 1) +
# Defence-specific thresholds
geom_vline(xintercept = def_threshold_95,
color = "darkgreen", linetype = "dashed", size = 1) +
geom_vline(xintercept = def_threshold_99,
color = "purple", linetype = "dashed", size = 1) +
annotate("text", x = genome_threshold_95, y = Inf,
label = "Genome 95%", color = "orange", vjust = 2, hjust = -0.1,
size = 3) +
annotate("text", x = genome_threshold_99, y = Inf,
label = "Genome 99%", color = "red", vjust = 3, hjust = -0.1,
size = 3) +
annotate("text", x = def_threshold_95, y = Inf,
label = "Defence 95%", color = "darkgreen", vjust = 4, hjust = -0.1,
size = 3) +
annotate("text", x = def_threshold_99, y = Inf,
label = "Defence 99%", color = "purple", vjust = 5, hjust = -0.1,
size = 3) +
labs(title = "FST Distribution in Defence Genes",
subtitle = "Comparison of genome-wide vs defence-specific thresholds",
x = expression(F[ST]),
y = "Window Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
# Plot 3: Scatter plot of defence gene FST along chromosome
# Prepare color-coded outlier status
defence_windows <- defence_windows %>%
mutate(outlier_status = case_when(
ESP_SWE_fst > def_threshold_99 ~ "99% Outlier",
ESP_SWE_fst > def_threshold_95 ~ "95% Outlier",
TRUE ~ "Non-outlier"
))
p3 <- ggplot(defence_windows, aes(x = mid/1e6, y = ESP_SWE_fst)) +
geom_point(aes(color = outlier_status, size = outlier_status), alpha = 0.7) +
geom_hline(yintercept = def_threshold_95, color = "orange",
linetype = "dashed", alpha = 0.7) +
geom_hline(yintercept = def_threshold_99, color = "red",
linetype = "dashed", alpha = 0.7) +
scale_color_manual(values = c("Non-outlier" = "gray70",
"95% Outlier" = "orange",
"99% Outlier" = "red"),
name = "Outlier Status") +
scale_size_manual(values = c("Non-outlier" = 1,
"95% Outlier" = 2,
"99% Outlier" = 3),
name = "Outlier Status") +
labs(title = "Spatial Distribution of FST in Defence Genes",
subtitle = "Position along chromosome 1",
x = "Position (Mb)",
y = expression(F[ST])) +
theme_minimal() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
# Combine plots using patchwork
combined_plot <- (p1 | p2) / p3 +
plot_annotation(
title = "FST Outlier Analysis: Comprehensive Visualization",
subtitle = "Three complementary views of FST patterns in defence genes",
caption = paste("Analysis includes", nrow(defence_windows), "defence gene windows"),
theme = theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 12),
plot.caption = element_text(hjust = 1, size = 10)
)
)
# Display the combined plot
print(combined_plot)
# Print threshold values for reference
cat("\nFST Threshold Values:\n")
##
## FST Threshold Values:
cat("====================\n")
## ====================
cat(sprintf("Genome-wide 95%% threshold: %.4f\n", genome_threshold_95))
## Genome-wide 95% threshold: 0.3037
cat(sprintf("Genome-wide 99%% threshold: %.4f\n", genome_threshold_99))
## Genome-wide 99% threshold: 0.4746
cat(sprintf("Defence-specific 95%% threshold: %.4f\n", def_threshold_95))
## Defence-specific 95% threshold: 0.3063
cat(sprintf("Defence-specific 99%% threshold: %.4f\n", def_threshold_99))
## Defence-specific 99% threshold: 0.4968
cat(sprintf("BG mean FST: %.4f\n", mean(bg_fst, na.rm = TRUE)))
## BG mean FST: 0.0959
cat(sprintf("DF mean FST: %.4f\n", mean(df_fst, na.rm = TRUE)))
## DF mean FST: 0.1023
cat(sprintf("KS test p-value: %.4f\n", ks_result$p.value))
## KS test p-value: 0.1107
Interpreting FST Outliers:
99% outliers (red): Most significant, strong candidates for selective sweeps
95% outliers (orange): Moderate significance, potential candidates
Non-outliers (gray): Evolve neutrally or under weak selection
Perform statistical tests to determine if defence genes are enriched for FST outliers compared to background regions.
Fisher’s exact test for contingency tables
Chi-square test for outlier enrichment
Descriptive statistics for outlier proportions
# --- Statistical Tests for FST Outliers --------------------
# Create contingency table
contingency_table <- table(defence_vs_genome$genome_outlier,
defence_vs_genome$defence_outlier)
cat("FST Outlier Statistical Analysis\n")
## FST Outlier Statistical Analysis
cat("================================\n\n")
## ================================
cat("1. Contingency Table (Genome vs Defence thresholds):\n")
## 1. Contingency Table (Genome vs Defence thresholds):
print(kable(contingency_table))
##
##
## | | Defence_95%| Defence_99%| Defence_non_outlier|
## |:------------------|-----------:|-----------:|-------------------:|
## |Genome_95% | 55| 0| 4|
## |Genome_99% | 7| 16| 0|
## |Genome_non_outlier | 0| 0| 1460|
# Fisher's exact test with error handling
fisher_test <- tryCatch({
fisher.test(contingency_table, simulate.p.value = TRUE, B = 10000)
}, error = function(e) {
cat("Fisher's test error:", e$message, "\nUsing alternative method...\n")
# Try alternative: create 2x2 table
simple_df <- defence_vs_genome %>%
mutate(
genome_binary = ifelse(genome_outlier != "Genome_non_outlier", "Outlier", "Non_outlier"),
defence_binary = ifelse(defence_outlier != "Defence_non_outlier", "Outlier", "Non_outlier")
)
simple_table <- table(simple_df$genome_binary, simple_df$defence_binary)
fisher.test(simple_table)
})
cat("\n2. Fisher's Exact Test Results:\n")
##
## 2. Fisher's Exact Test Results:
cat(" p-value:", format.pval(fisher_test$p.value, digits = 4), "\n")
## p-value: 9.999e-05
if (!is.null(fisher_test$estimate) && is.numeric(fisher_test$estimate)) {
cat(" Odds ratio:", round(fisher_test$estimate, 3), "\n")
}
# Simpler enrichment analysis
cat("\n3. Outlier Proportion Analysis:\n")
##
## 3. Outlier Proportion Analysis:
outlier_summary <- data.frame(
Category = c("Background", "Defence"),
Total_Windows = c(sum(fst_df$DF == "BG"), sum(fst_df$DF == "DF")),
Above_Genome_95th = c(sum(fst_df$ESP_SWE_fst[fst_df$DF == "BG"] > genome_threshold_95),
sum(fst_df$ESP_SWE_fst[fst_df$DF == "DF"] > genome_threshold_95)),
Above_Genome_99th = c(sum(fst_df$ESP_SWE_fst[fst_df$DF == "BG"] > genome_threshold_99),
sum(fst_df$ESP_SWE_fst[fst_df$DF == "DF"] > genome_threshold_99))
) %>%
mutate(
Prop_95th = Above_Genome_95th / Total_Windows,
Prop_99th = Above_Genome_99th / Total_Windows
)
print(kable(outlier_summary, digits = 4))
##
##
## |Category | Total_Windows| Above_Genome_95th| Above_Genome_99th| Prop_95th| Prop_99th|
## |:----------|-------------:|-----------------:|-----------------:|---------:|---------:|
## |Background | 66777| 3321| 658| 0.0497| 0.0099|
## |Defence | 1267| 82| 23| 0.0647| 0.0182|
# Chi-square test for 95th percentile enrichment
cat("\n4. Chi-square Test (95th percentile enrichment):\n")
##
## 4. Chi-square Test (95th percentile enrichment):
chi_test <- chisq.test(matrix(c(
outlier_summary$Above_Genome_95th[1], outlier_summary$Total_Windows[1] - outlier_summary$Above_Genome_95th[1],
outlier_summary$Above_Genome_95th[2], outlier_summary$Total_Windows[2] - outlier_summary$Above_Genome_95th[2]
), nrow = 2))
cat(" X-squared:", round(chi_test$statistic, 3), "\n")
## X-squared: 5.567
cat(" p-value:", format.pval(chi_test$p.value, digits = 4), "\n")
## p-value: 0.0183
# Interpretation
cat("\n5. Interpretation:\n")
##
## 5. Interpretation:
if (chi_test$p.value < 0.05) {
enrich_ratio <- outlier_summary$Prop_95th[2] / outlier_summary$Prop_95th[1]
cat(sprintf(" - Defence genes have %.1f%% outliers vs %.1f%% in background\n",
outlier_summary$Prop_95th[2] * 100,
outlier_summary$Prop_95th[1] * 100))
cat(sprintf(" - %.2fx enrichment in defence genes (p = %.4f)\n",
enrich_ratio, chi_test$p.value))
} else {
cat(" - No significant difference in outlier proportions\n")
}
## - Defence genes have 6.5% outliers vs 5.0% in background
## - 1.30x enrichment in defence genes (p = 0.0183)
Provide comprehensive summary statistics for all analyses.
Summary table of all statistics by gene category
Clear presentation of results
# --- Final Summary Statistics Table -----------------------------
summary_stats <- At_data_clean %>%
group_by(DF) %>%
summarise(
n_windows = n(),
# π statistics
mean_ESP_pi = mean(ESP_pi),
sd_ESP_pi = sd(ESP_pi),
mean_SWE_pi = mean(SWE_pi),
sd_SWE_pi = sd(SWE_pi),
# FST statistics
mean_FST = mean(ESP_SWE_fst),
sd_FST = sd(ESP_SWE_fst),
median_FST = median(ESP_SWE_fst),
FST_95th = quantile(ESP_SWE_fst, 0.95),
FST_99th = quantile(ESP_SWE_fst, 0.99),
# Tajima's D statistics
mean_ESP_TajD = mean(ESP_td),
sd_ESP_TajD = sd(ESP_td),
mean_SWE_TajD = mean(SWE_td),
sd_SWE_TajD = sd(SWE_td)
) %>%
mutate(across(where(is.numeric), ~ round(., 4)))
cat("Comprehensive Summary Statistics by Gene Category\n")
## Comprehensive Summary Statistics by Gene Category
cat("=================================================\n\n")
## =================================================
print(kable(summary_stats,
caption = "Summary statistics for defence (DF) and background (BG) regions",
align = c("l", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r", "r")))
##
##
## Table: Summary statistics for defence (DF) and background (BG) regions
##
## |DF | n_windows| mean_ESP_pi| sd_ESP_pi| mean_SWE_pi| sd_SWE_pi| mean_FST| sd_FST| median_FST| FST_95th| FST_99th| mean_ESP_TajD| sd_ESP_TajD| mean_SWE_TajD| sd_SWE_TajD|
## |:--|---------:|-----------:|---------:|-----------:|---------:|--------:|------:|----------:|--------:|--------:|-------------:|-----------:|-------------:|-----------:|
## |BG | 78080| 0.0045| 0.0039| 0.0041| 0.0037| 0.0809| 0.0999| 0.0476| 0.2864| 0.4538| -8e-04| 0.0096| 0.0012| 0.0098|
## |DF | 1542| 0.0042| 0.0034| 0.0039| 0.0037| 0.0827| 0.1085| 0.0423| 0.3063| 0.4968| -9e-04| 0.0097| 0.0009| 0.0100|
# Calculate percentage differences
cat("\nPercentage Differences (DF relative to BG):\n")
##
## Percentage Differences (DF relative to BG):
cat("===========================================\n")
## ===========================================
differences <- data.frame(
Statistic = c("ESP π", "SWE π", "FST", "ESP Tajima's D", "SWE Tajima's D"),
DF_Mean = c(
summary_stats$mean_ESP_pi[2],
summary_stats$mean_SWE_pi[2],
summary_stats$mean_FST[2],
summary_stats$mean_ESP_TajD[2],
summary_stats$mean_SWE_TajD[2]
),
BG_Mean = c(
summary_stats$mean_ESP_pi[1],
summary_stats$mean_SWE_pi[1],
summary_stats$mean_FST[1],
summary_stats$mean_ESP_TajD[1],
summary_stats$mean_SWE_TajD[1]
)
) %>%
mutate(
Difference = DF_Mean - BG_Mean,
Percent_Change = round((Difference / BG_Mean) * 100, 1),
Direction = ifelse(Difference > 0, "Higher in DF", "Lower in DF")
)
print(kable(differences,
caption = "Mean differences between defence and background regions",
align = c("l", "r", "r", "r", "r", "l")))
##
##
## Table: Mean differences between defence and background regions
##
## |Statistic | DF_Mean| BG_Mean| Difference| Percent_Change|Direction |
## |:--------------|-------:|-------:|----------:|--------------:|:------------|
## |ESP π | 0.0042| 0.0045| -0.0003| -6.7|Lower in DF |
## |SWE π | 0.0039| 0.0041| -0.0002| -4.9|Lower in DF |
## |FST | 0.0827| 0.0809| 0.0018| 2.2|Higher in DF |
## |ESP Tajima's D | -0.0009| -0.0008| -0.0001| 12.5|Lower in DF |
## |SWE Tajima's D | 0.0009| 0.0012| -0.0003| -25.0|Lower in DF |
For statistical reporting, emphasize effect sizes:
# Calculate effect sizes (Cohen's d) for the differences
library(effsize)
effect_sizes <- data.frame(
Statistic = c("ESP π", "SWE π", "ESP Tajima's D", "SWE Tajima's D", "FST"),
Cohen_d = c(
cohen.d(At_data_clean$ESP_pi[At_data_clean$DF == "BG"],
At_data_clean$ESP_pi[At_data_clean$DF == "DF"])$estimate,
cohen.d(At_data_clean$SWE_pi[At_data_clean$DF == "BG"],
At_data_clean$SWE_pi[At_data_clean$DF == "DF"])$estimate,
cohen.d(At_data_clean$ESP_td[At_data_clean$DF == "BG"],
At_data_clean$ESP_td[At_data_clean$DF == "DF"])$estimate,
cohen.d(At_data_clean$SWE_td[At_data_clean$DF == "BG"],
At_data_clean$SWE_td[At_data_clean$DF == "DF"])$estimate,
cohen.d(At_data_clean$ESP_SWE_fst[At_data_clean$DF == "BG"],
At_data_clean$ESP_SWE_fst[At_data_clean$DF == "DF"])$estimate
)
)
print("Effect sizes (Cohen's d):")
## [1] "Effect sizes (Cohen's d):"
print(effect_sizes)
## Statistic Cohen_d
## 1 ESP π 0.072114886
## 2 SWE π 0.055400094
## 3 ESP Tajima's D 0.004348834
## 4 SWE Tajima's D 0.031251315
## 5 FST -0.018068985
Purpose of this analysis Cohen’s d quantifies the standardized difference between defence-associated windows (DF) and background windows (BG). While statistical tests (e.g., KS test, Wilcoxon) indicate whether distributions differ, effect sizes tell us how large and biologically meaningful these differences are.
How Cohen’s d is calculated here
You computed:
π (ESP and SWE)
Tajima’s D (ESP and SWE)
FST (ESP–SWE)
for DF and BG windows and then measured:
Cohen’s d = (mean_BG − mean_DF) / pooled SD
A positive d means BG > DF; A negative d means DF > BG.
How to interpret your values
You reported:
| Statistic | Cohen’s d |
|---|---|
| ESP π | 0.072 |
| SWE π | 0.055 |
| ESP Tajima’s D | 0.004 |
| SWE Tajima’s D | 0.031 |
| FST | −0.018 |
All values fall well below 0.2.
Interpretation
These effect sizes are negligible.
DF windows do not differ meaningfully from BG windows for π, Tajima’s D, or FST.
Any observed statistical significance (if any) is not accompanied by a biologically relevant shift.
This pattern is more consistent with neutral drift, genome-wide noise, or measurement uncertainty, rather than strong or localized selection in DF windows.
Map FST outlier windows to specific genes using genome annotations to: 1. Identify which defence genes contain FST outliers 2. Determine the biological functions of outlier-containing genes 3. Visualize the spatial relationship between outliers and genes 4. Prioritize candidate genes for further investigation
cat("\n=== GENE ANNOTATION AND OVERLAP ANALYSIS ===\n")
##
## === GENE ANNOTATION AND OVERLAP ANALYSIS ===
# Load required libraries
library(GenomicRanges)
library(IRanges)
library(conflicted)
# Resolve conflicts
conflicts_prefer(S4Vectors::first)
conflicts_prefer(dplyr::first)
# --- Step 1: Load and prepare gene annotation (GFF file) ---
cat("Loading gene annotations from GFF file...\n")
## Loading gene annotations from GFF file...
# Read the GFF file
gff_path <- "vcf/GFF_final_with_annotation_2_gff"
if (file.exists(gff_path)) {
gff <- read.delim(gff_path,
header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# Rename columns for clarity
colnames(gff) <- c("Model_name", "seqname", "source", "feature", "start", "end",
"score", "strand", "frame", "attributes", "dup_colmn",
"gene_type", "short_desc", "curator_summary", "computational_desc")
# Convert positions to numeric
gff$start <- as.numeric(as.character(gff$start))
gff$end <- as.numeric(as.character(gff$end))
cat("GFF file loaded successfully\n")
cat("Total rows:", nrow(gff), "\n")
cat("Columns:", paste(colnames(gff), collapse = ", "), "\n\n")
# Check feature types
cat("Feature types in GFF file:\n")
feature_counts <- table(gff$feature)
print(kable(feature_counts))
# Filter for mRNA features (gene transcripts)
filtered_annotation <- subset(gff, feature == "mRNA")
# Extract gene information from attributes column
# Format: ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
filtered_annotation$gene_id <- gsub(".*ID=([^;]+).*", "\\1", filtered_annotation$attributes)
filtered_annotation$parent_gene <- gsub(".*Parent=([^;]+).*", "\\1", filtered_annotation$attributes)
filtered_annotation$simple_gene_id <- gsub("\\..*", "", filtered_annotation$gene_id) # Remove version number
cat("\nGene annotation summary:\n")
cat("✓ Total mRNA features (transcripts):", nrow(filtered_annotation), "\n")
cat("✓ Unique parent genes:", length(unique(filtered_annotation$parent_gene)), "\n")
cat("✓ Unique transcripts:", length(unique(filtered_annotation$gene_id)), "\n")
} else {
stop("GFF file not found at: ", gff_path)
}
## GFF file loaded successfully
## Total rows: 28812
## Columns: Model_name, seqname, source, feature, start, end, score, strand, frame, attributes, dup_colmn, gene_type, short_desc, curator_summary, computational_desc
##
## Feature types in GFF file:
##
##
## |Var1 | Freq|
## |:----|-----:|
## |mRNA | 28812|
##
## Gene annotation summary:
## ✓ Total mRNA features (transcripts): 28812
## ✓ Unique parent genes: 22359
## ✓ Unique transcripts: 28812
# --- Step 2: Create genomic ranges for genes ---
cat("\nCreating genomic ranges for genes...\n")
##
## Creating genomic ranges for genes...
gene_ranges <- GRanges(
seqnames = filtered_annotation$seqname,
ranges = IRanges(start = filtered_annotation$start, end = filtered_annotation$end),
gene_id = filtered_annotation$gene_id,
parent_gene = filtered_annotation$parent_gene,
gene_type = filtered_annotation$gene_type,
strand = filtered_annotation$strand,
description = filtered_annotation$short_desc
)
cat("✓ Created gene ranges for", length(gene_ranges), "transcripts\n")
## ✓ Created gene ranges for 28812 transcripts
# --- Step 3: Create genomic ranges for FST outliers ---
cat("\nCreating genomic ranges for FST outliers...\n")
##
## Creating genomic ranges for FST outliers...
# First, ensure we have defence_outliers data
if (!exists("defence_outliers")) {
# Create it from defence_windows if available
if (exists("defence_windows")) {
defence_outliers <- defence_windows
if (!"outlier_status" %in% colnames(defence_outliers)) {
# Create outlier_status if needed
if (exists("def_threshold_95") && exists("def_threshold_99")) {
defence_outliers$outlier_status <- case_when(
defence_outliers$ESP_SWE_fst > def_threshold_99 ~ "99% Outlier",
defence_outliers$ESP_SWE_fst > def_threshold_95 ~ "95% Outlier",
TRUE ~ "Non-outlier"
)
}
}
cat("Created defence_outliers from defence_windows\n")
}
}
## Created defence_outliers from defence_windows
# Ensure we have the necessary columns
if (exists("defence_outliers") && nrow(defence_outliers) > 0) {
# Add chromosome information if missing
if (!"chromosome" %in% colnames(defence_outliers)) {
defence_outliers$chromosome <- "Chr1"
}
# Create GRanges object for outliers
outlier_ranges <- GRanges(
seqnames = defence_outliers$chromosome,
ranges = IRanges(start = defence_outliers$start, end = defence_outliers$stop),
outlier_status = defence_outliers$outlier_status,
fst_value = defence_outliers$ESP_SWE_fst,
window_mid = defence_outliers$mid
)
cat("✓ Created outlier ranges for", length(outlier_ranges), "windows\n")
# --- Step 4: Find overlaps between outliers and genes ---
cat("\nFinding overlaps between outliers and genes...\n")
# Find all overlaps (any overlap type)
overlaps <- findOverlaps(outlier_ranges, gene_ranges, type = "any")
cat("Total overlaps found:", length(overlaps), "\n")
# Create comprehensive data frame of overlaps
if (length(overlaps) > 0) {
overlap_df <- data.frame(
# Outlier information
outlier_index = queryHits(overlaps),
outlier_chr = as.character(seqnames(outlier_ranges)[queryHits(overlaps)]),
outlier_start = start(outlier_ranges)[queryHits(overlaps)],
outlier_end = end(outlier_ranges)[queryHits(overlaps)],
outlier_mid = outlier_ranges$window_mid[queryHits(overlaps)],
outlier_status = as.character(outlier_ranges$outlier_status[queryHits(overlaps)]),
fst_value = outlier_ranges$fst_value[queryHits(overlaps)],
# Gene information
gene_index = subjectHits(overlaps),
gene_chr = as.character(seqnames(gene_ranges)[subjectHits(overlaps)]),
gene_start = start(gene_ranges)[subjectHits(overlaps)],
gene_end = end(gene_ranges)[subjectHits(overlaps)],
gene_id = gene_ranges$gene_id[subjectHits(overlaps)],
parent_gene = gene_ranges$parent_gene[subjectHits(overlaps)],
gene_type = gene_ranges$gene_type[subjectHits(overlaps)],
gene_strand = as.character(strand(gene_ranges)[subjectHits(overlaps)]),
gene_description = gene_ranges$description[subjectHits(overlaps)]
)
# Filter for only outliers (exclude "Non-outlier" if present)
if ("Non-outlier" %in% overlap_df$outlier_status) {
overlap_df <- overlap_df %>% filter(outlier_status != "Non-outlier")
}
# Calculate overlap statistics
cat("\n=== OVERLAP STATISTICS ===\n")
cat("Overlapping outlier-gene pairs:", nrow(overlap_df), "\n")
cat("Unique genes with outliers:", length(unique(overlap_df$parent_gene)), "\n")
cat("Unique transcripts with outliers:", length(unique(overlap_df$gene_id)), "\n\n")
cat("Overlaps by outlier type:\n")
outlier_type_counts <- table(overlap_df$outlier_status)
print(kable(outlier_type_counts))
cat("\nTop 5 gene types containing outliers:\n")
gene_type_summary <- sort(table(overlap_df$gene_type), decreasing = TRUE)
print(kable(head(gene_type_summary, 5)))
} else {
cat("WARNING: No overlaps found between outliers and genes!\n")
overlap_df <- data.frame()
}
} else {
cat("ERROR: defence_outliers data not available\n")
overlap_df <- data.frame()
}
## ✓ Created outlier ranges for 1542 windows
##
## Finding overlaps between outliers and genes...
## Total overlaps found: 2300
##
## === OVERLAP STATISTICS ===
## Overlapping outlier-gene pairs: 99
## Unique genes with outliers: 21
## Unique transcripts with outliers: 29
##
## Overlaps by outlier type:
##
##
## |Var1 | Freq|
## |:-----------|----:|
## |95% Outlier | 80|
## |99% Outlier | 19|
##
## Top 5 gene types containing outliers:
##
##
## | | x|
## |:--------------|--:|
## |protein_coding | 99|
# --- Step 5: Create gene summary table ---
if (exists("overlap_df") && nrow(overlap_df) > 0) {
cat("\nCreating gene summary table...\n")
gene_summary <- overlap_df %>%
group_by(parent_gene) %>%
summarise(
transcript_variants = paste(unique(gene_id), collapse = ";"),
total_outlier_windows = n(),
outlier_95_windows = sum(outlier_status == "95% Outlier"),
outlier_99_windows = sum(outlier_status == "99% Outlier"),
min_fst = min(fst_value),
mean_fst = mean(fst_value),
max_fst = max(fst_value),
gene_start = min(gene_start),
gene_end = max(gene_end),
gene_length = max(gene_end) - min(gene_start) + 1,
gene_type = first(gene_type),
description = first(gene_description)
) %>%
mutate(
outlier_density = total_outlier_windows / (gene_length / 1000) # Outliers per kb
) %>%
arrange(desc(max_fst))
cat("\nSummary of genes with FST outliers:\n")
cat("Total genes affected:", nrow(gene_summary), "\n")
cat("Average outlier windows per gene:", round(mean(gene_summary$total_outlier_windows), 2), "\n")
cat("Average FST in affected genes:", round(mean(gene_summary$mean_fst), 4), "\n\n")
cat("Top 10 genes with highest FST outliers:\n")
print(kable(head(gene_summary[, c("parent_gene", "max_fst", "total_outlier_windows",
"gene_length", "gene_type", "description")], 10),
caption = "Top 10 genes containing FST outliers"))
# Genes with both 95% and 99% outliers
mixed_genes <- gene_summary %>%
filter(outlier_95_windows > 0 & outlier_99_windows > 0)
cat("\nGenes with both 95% and 99% outliers:", nrow(mixed_genes), "\n")
# Save gene summary
if (!dir.exists("results")) dir.create("results")
write_csv(gene_summary, "results/gene_summary_with_outliers.csv")
cat("✓ Gene summary saved to: results/gene_summary_with_outliers.csv\n")
} else {
cat("No gene summary created - overlap data not available\n")
gene_summary <- data.frame()
}
##
## Creating gene summary table...
##
## Summary of genes with FST outliers:
## Total genes affected: 21
## Average outlier windows per gene: 4.71
## Average FST in affected genes: 0.4106
##
## Top 10 genes with highest FST outliers:
##
##
## Table: Top 10 genes containing FST outliers
##
## |parent_gene | max_fst| total_outlier_windows| gene_length|gene_type |description |
## |:-----------|---------:|---------------------:|-----------:|:--------------|:------------------------------------------------------------|
## |AT1G80600 | 0.7192394| 6| 2055|protein_coding |HOPW1-1-interacting 1 |
## |AT1G12290 | 0.6852304| 6| 4235|protein_coding |Disease resistance protein (CC-NBS-LRR class) family |
## |AT1G19660 | 0.5821976| 8| 2705|protein_coding |Wound-responsive family protein |
## |AT1G03475 | 0.5306411| 2| 2161|protein_coding |Coproporphyrinogen III oxidase |
## |AT1G08860 | 0.5276831| 10| 3986|protein_coding |Calcium-dependent phospholipid-binding Copine family protein |
## |AT1G09770 | 0.5260257| 2| 3520|protein_coding |cell division cycle 5 |
## |AT1G11000 | 0.5023773| 2| 4224|protein_coding |Seven transmembrane MLO family protein |
## |AT1G76954 | 0.4746309| 2| 672|protein_coding |Defensin-like (DEFL) family protein |
## |AT1G80680 | 0.4746083| 4| 4762|protein_coding |SUPPRESSOR OF AUXIN RESISTANCE 3 |
## |AT1G10170 | 0.4141339| 5| 3897|protein_coding |NF-X-like 1 |
##
## Genes with both 95% and 99% outliers: 5
## ✓ Gene summary saved to: results/gene_summary_with_outliers.csv
cat("\n=== VISUALISATION OF GENES WITH OUTLIERS ===\n")
##
## === VISUALISATION OF GENES WITH OUTLIERS ===
if (exists("overlap_df") && nrow(overlap_df) > 0 && exists("gene_summary") && nrow(gene_summary) > 0) {
# Prepare data for all FST values (for background)
all_fst_data <- At_data_clean %>%
filter(ESP_SWE_fst > 0) %>%
select(mid, ESP_SWE_fst, DF)
# Get threshold values
if (!exists("def_threshold_95")) {
def_threshold_95 <- quantile(defence_windows$ESP_SWE_fst, 0.95, na.rm = TRUE)
}
if (!exists("def_threshold_99")) {
def_threshold_99 <- quantile(defence_windows$ESP_SWE_fst, 0.99, na.rm = TRUE)
}
# --- Plot 1: Enhanced FST plot with gene annotations ---
cat("Generating enhanced FST plot with gene annotations...\n")
# Prepare gene data for plotting (first 30 genes for clarity)
gene_plot_data <- filtered_annotation %>%
filter(seqname == "Chr1") %>%
mutate(gene_mid = (start + end) / 2) %>%
arrange(start) %>%
head(30) # Limit to first 30 genes to avoid overcrowding
# Filter overlap_df to include only genes in plot_data
plot_overlaps <- overlap_df %>%
filter(parent_gene %in% gene_plot_data$parent_gene)
# Enhanced FST plot with gene tracks
enhanced_fst_plot <- ggplot() +
# Background: all FST windows (light gray)
geom_point(data = all_fst_data,
aes(x = mid/1e6, y = ESP_SWE_fst),
color = "gray85", size = 0.3, alpha = 0.2) +
# Defence gene FST points (blue)
geom_point(data = defence_windows,
aes(x = mid/1e6, y = ESP_SWE_fst),
color = "steelblue", size = 0.5, alpha = 0.5) +
# Outlier points (colored by status)
geom_point(data = plot_overlaps,
aes(x = outlier_mid/1e6, y = fst_value, color = outlier_status),
size = 1.8, alpha = 0.8) +
# Gene tracks (below FST plot)
geom_segment(data = gene_plot_data,
aes(x = start/1e6, xend = end/1e6, y = -0.03, yend = -0.03),
size = 4, color = "darkgreen", alpha = 0.6) +
# Gene labels (only for genes with outliers)
if (nrow(plot_overlaps) > 0) {
gene_labels <- plot_overlaps %>%
group_by(parent_gene) %>%
summarise(mid_pos = mean(outlier_mid/1e6),
max_fst = max(fst_value)) %>%
arrange(desc(max_fst)) %>%
head(8) # Top 8 genes by FST
enhanced_fst_plot <- enhanced_fst_plot +
geom_text(data = gene_labels,
aes(x = mid_pos, y = -0.05, label = parent_gene),
size = 3, angle = 45, hjust = 1, vjust = 1,
check_overlap = TRUE, color = "darkblue")
}
# Threshold lines
enhanced_fst_plot <- enhanced_fst_plot +
geom_hline(yintercept = def_threshold_95, color = "orange",
linetype = "dashed", size = 0.7, alpha = 0.8) +
geom_hline(yintercept = def_threshold_99, color = "red",
linetype = "dashed", size = 0.7, alpha = 0.8) +
# Mean line
geom_hline(yintercept = mean(defence_windows$ESP_SWE_fst, na.rm = TRUE),
color = "blue", linetype = "dashed", size = 0.8) +
# Scales and labels
scale_color_manual(values = c("95% Outlier" = "orange",
"99% Outlier" = "red")) +
scale_y_continuous(limits = c(-0.06, max(all_fst_data$ESP_SWE_fst, na.rm = TRUE) * 1.1)) +
labs(x = "Position on Chromosome 1 (Mb)",
y = expression(italic(F)[ST]),
title = "FST Outliers in Defence Genes with Gene Annotations",
subtitle = paste(nrow(gene_summary), "genes contain",
nrow(overlap_df), "outlier windows"),
caption = "Green bars = gene positions | Blue points = defence genes | Orange/red = outliers") +
theme_minimal() +
theme(legend.position = "bottom",
legend.title = element_blank(),
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5))
print(enhanced_fst_plot)
# --- Plot 2: Distribution of outlier counts per gene ---
p1 <- ggplot(gene_summary, aes(x = total_outlier_windows)) +
geom_histogram(binwidth = 1, fill = "steelblue", alpha = 0.7, color = "black") +
labs(x = "Number of Outlier Windows per Gene",
y = "Number of Genes",
title = "Distribution of Outlier Windows Across Genes",
subtitle = paste("Mean =", round(mean(gene_summary$total_outlier_windows), 2),
"outliers per gene")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
print(p1)
# --- Plot 3: Top genes with most outliers ---
top_n <- min(15, nrow(gene_summary))
p2 <- gene_summary %>%
arrange(desc(total_outlier_windows)) %>%
head(top_n) %>%
mutate(parent_gene = factor(parent_gene, levels = rev(parent_gene))) %>%
ggplot(aes(x = parent_gene, y = total_outlier_windows, fill = max_fst)) +
geom_bar(stat = "identity", alpha = 0.8) +
geom_text(aes(label = total_outlier_windows),
hjust = -0.2, size = 3) +
coord_flip() +
scale_fill_gradient(low = "orange", high = "red", name = "Max FST") +
labs(x = "Gene ID", y = "Number of Outlier Windows",
title = paste("Top", top_n, "Genes with Most Outlier Windows"),
subtitle = "Color indicates maximum FST value in gene") +
theme_minimal() +
theme(axis.text.y = element_text(size = 9),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
print(p2)
# --- Plot 4: FST vs gene length ---
p3 <- ggplot(gene_summary, aes(x = gene_length/1000, y = max_fst)) +
geom_point(aes(size = total_outlier_windows, color = total_outlier_windows), alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "gray40", linetype = "dashed", size = 0.5) +
scale_color_gradient(low = "blue", high = "red", name = "Outlier count") +
scale_size_continuous(name = "Outlier count") +
labs(x = "Gene Length (kb)",
y = "Maximum FST in Gene",
title = "Relationship Between Gene Length and FST",
subtitle = "Size and color indicate number of outlier windows") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
print(p3)
cat("✓ All visualizations generated successfully\n")
} else {
cat("Skipping visualizations - insufficient overlap data\n")
}
## Generating enhanced FST plot with gene annotations...
## ✓ All visualizations generated successfully
cat("\n=== BIOLOGICAL INTERPRETATION ===\n")
##
## === BIOLOGICAL INTERPRETATION ===
if (exists("gene_summary") && nrow(gene_summary) > 0) {
# 1. Summary statistics
cat("1. OVERALL SUMMARY:\n")
cat(" - Total genes analyzed:", length(unique(filtered_annotation$parent_gene)), "\n")
cat(" - Genes containing FST outliers:", nrow(gene_summary), "\n")
cat(" - Percentage of genes with outliers:",
round(nrow(gene_summary)/length(unique(filtered_annotation$parent_gene))*100, 1), "%\n")
# 2. Top candidate genes
cat("\n2. TOP CANDIDATE GENES FOR SELECTION:\n")
top_candidates <- gene_summary %>%
filter(max_fst > def_threshold_99) %>%
arrange(desc(max_fst))
if (nrow(top_candidates) > 0) {
cat(" Genes with FST > 99th percentile:", nrow(top_candidates), "\n")
cat(" Highest FST:", round(max(top_candidates$max_fst), 4), "\n")
# Show top 5 candidates
cat("\n Top 5 candidate genes:\n")
for (i in 1:min(5, nrow(top_candidates))) {
cat(sprintf(" %d. %s: FST=%.4f, %d outlier windows, %s\n",
i, top_candidates$parent_gene[i],
top_candidates$max_fst[i],
top_candidates$total_outlier_windows[i],
ifelse(nchar(top_candidates$description[i]) > 50,
paste0(substr(top_candidates$description[i], 1, 50), "..."),
top_candidates$description[i])))
}
}
# 3. Gene functional categories
cat("\n3. GENE FUNCTIONAL CATEGORIES WITH OUTLIERS:\n")
if ("gene_type" %in% colnames(gene_summary)) {
gene_type_summary <- gene_summary %>%
group_by(gene_type) %>%
summarise(
n_genes = n(),
avg_fst = mean(max_fst),
avg_outliers = mean(total_outlier_windows)
) %>%
arrange(desc(n_genes))
print(kable(gene_type_summary,
caption = "Functional categories of genes containing FST outliers"))
}
# 4. Recommendations for follow-up
cat("\n4. RECOMMENDATIONS FOR FOLLOW-UP:\n")
cat(" - Validate top candidates with independent methods (e.g., GWAS, expression QTL)\n")
cat(" - Check if outlier genes are enriched for specific pathways (GO enrichment)\n")
cat(" - Investigate population-specific patterns in candidate genes\n")
cat(" - Compare with known defence-related gene families\n")
cat(" - Consider functional validation (knockouts, overexpression)\n")
} else {
cat("No gene overlap data available for interpretation\n")
}
## 1. OVERALL SUMMARY:
## - Total genes analyzed: 22359
## - Genes containing FST outliers: 21
## - Percentage of genes with outliers: 0.1 %
##
## 2. TOP CANDIDATE GENES FOR SELECTION:
## Genes with FST > 99th percentile: 7
## Highest FST: 0.7192
##
## Top 5 candidate genes:
## 1. AT1G80600: FST=0.7192, 6 outlier windows, HOPW1-1-interacting 1
## 2. AT1G12290: FST=0.6852, 6 outlier windows, Disease resistance protein (CC-NBS-LRR class) fami...
## 3. AT1G19660: FST=0.5822, 8 outlier windows, Wound-responsive family protein
## 4. AT1G03475: FST=0.5306, 2 outlier windows, Coproporphyrinogen III oxidase
## 5. AT1G08860: FST=0.5277, 10 outlier windows, Calcium-dependent phospholipid-binding Copine fami...
##
## 3. GENE FUNCTIONAL CATEGORIES WITH OUTLIERS:
##
##
## Table: Functional categories of genes containing FST outliers
##
## |gene_type | n_genes| avg_fst| avg_outliers|
## |:--------------|-------:|---------:|------------:|
## |protein_coding | 21| 0.4422851| 4.714286|
##
## 4. RECOMMENDATIONS FOR FOLLOW-UP:
## - Validate top candidates with independent methods (e.g., GWAS, expression QTL)
## - Check if outlier genes are enriched for specific pathways (GO enrichment)
## - Investigate population-specific patterns in candidate genes
## - Compare with known defence-related gene families
## - Consider functional validation (knockouts, overexpression)
cat("\n=== SAVING GENE ANALYSIS RESULTS ===\n")
##
## === SAVING GENE ANALYSIS RESULTS ===
# Create directory for gene analysis results
gene_results_dir <- "gene_analysis_results"
if (!dir.exists(gene_results_dir)) {
dir.create(gene_results_dir)
cat("Created directory:", gene_results_dir, "\n")
}
## Created directory: gene_analysis_results
# Save all relevant data
if (exists("overlap_df") && nrow(overlap_df) > 0) {
write_csv(overlap_df, file.path(gene_results_dir, "gene_outlier_overlaps.csv"))
cat("✓ Saved: gene_outlier_overlaps.csv\n")
}
## ✓ Saved: gene_outlier_overlaps.csv
if (exists("gene_summary") && nrow(gene_summary) > 0) {
write_csv(gene_summary, file.path(gene_results_dir, "gene_summary_outliers.csv"))
cat("✓ Saved: gene_summary_outliers.csv\n")
}
## ✓ Saved: gene_summary_outliers.csv
# Save top candidate genes
if (exists("gene_summary") && nrow(gene_summary) > 0) {
top_candidates <- gene_summary %>%
filter(max_fst > quantile(gene_summary$max_fst, 0.9)) %>% # Top 10%
arrange(desc(max_fst))
write_csv(top_candidates, file.path(gene_results_dir, "top_candidate_genes.csv"))
cat("✓ Saved: top_candidate_genes.csv\n")
}
## ✓ Saved: top_candidate_genes.csv
# Save a summary report
summary_report <- c(
paste("Gene Annotation and Overlap Analysis Report"),
paste("Generated:", Sys.Date()),
paste(""),
paste("ANALYSIS PARAMETERS:"),
paste("Genome annotation file:", gff_path),
paste("Total genes annotated:", length(unique(filtered_annotation$parent_gene))),
paste("Defence gene windows analyzed:", nrow(defence_windows)),
paste(""),
paste("RESULTS SUMMARY:"),
if (exists("overlap_df") && nrow(overlap_df) > 0) {
c(paste("Total outlier-gene overlaps:", nrow(overlap_df)),
paste("Unique genes with outliers:", length(unique(overlap_df$parent_gene))),
paste("95% outliers in genes:", sum(grepl("95%", overlap_df$outlier_status))),
paste("99% outliers in genes:", sum(grepl("99%", overlap_df$outlier_status))))
} else {
"No overlaps detected"
},
paste(""),
paste("TOP 5 CANDIDATE GENES:"),
if (exists("gene_summary") && nrow(gene_summary) > 0) {
sapply(1:min(5, nrow(gene_summary)), function(i) {
paste(i, ". ", gene_summary$parent_gene[i],
" (FST=", round(gene_summary$max_fst[i], 4),
", ", gene_summary$total_outlier_windows[i], " outliers)", sep = "")
})
} else {
"No gene summary available"
}
)
writeLines(summary_report, file.path(gene_results_dir, "analysis_report.txt"))
cat("✓ Saved: analysis_report.txt\n")
## ✓ Saved: analysis_report.txt
cat("\nAll gene analysis results saved to:", gene_results_dir, "\n")
##
## All gene analysis results saved to: gene_analysis_results
Explore different visualization approaches to better understand subtle patterns in your data, especially when effect sizes are small. This section demonstrates multiple plotting techniques and integrates gene annotation information from previous analyses.
cat("\n=== ADVANCED VISUALIZATION TECHNIQUES ===\n")
##
## === ADVANCED VISUALIZATION TECHNIQUES ===
# Load required packages
if (!requireNamespace("ggforce", quietly = TRUE)) {
install.packages("ggforce")
library(ggforce)
} else {
library(ggforce)
}
if (!requireNamespace("effsize", quietly = TRUE)) {
install.packages("effsize")
library(effsize)
} else {
library(effsize)
}
cat("\n1. Enhanced π (Nucleotide Diversity) Visualizations\n")
##
## 1. Enhanced π (Nucleotide Diversity) Visualizations
cat("===================================================\n")
## ===================================================
# First, ensure we have the pi_comparison_long data
# This should have been created in the boxplot comparisons section
if (!exists("pi_comparison_long")) {
cat("Creating pi_comparison_long from available data...\n")
pi_comparison_long <- At_data_clean %>%
select(DF, ESP_pi, SWE_pi) %>%
pivot_longer(
cols = c(ESP_pi, SWE_pi),
names_to = "population",
values_to = "pi_value"
) %>%
mutate(
population = gsub("_pi", "", population),
population = factor(population, levels = c("ESP", "SWE"))
)
}
cat("π data summary:\n")
## π data summary:
cat("Total windows:", nrow(pi_comparison_long), "\n")
## Total windows: 159244
cat("By population:", table(pi_comparison_long$population), "\n")
## By population: 79622 79622
cat("By gene category:", table(pi_comparison_long$DF), "\n")
## By gene category: 156160 3084
# --- Plot 1: Boxplot with mean points and labels ---
cat("\nGenerating Plot 1: Boxplot with mean annotations...\n")
##
## Generating Plot 1: Boxplot with mean annotations...
pi_boxplot_linear <- ggplot(pi_comparison_long, aes(x = DF, y = pi_value, fill = DF)) +
geom_boxplot(color = "black", outlier.shape = 21, outlier.size = 1.5) +
facet_wrap(~ population, ncol = 2) +
theme_minimal() +
scale_fill_manual(values = c("BG" = "#66c2a5", "DF" = "#fc8d62"),
name = "Gene Category") +
labs(
title = "Nucleotide Diversity (π): Defence vs Background Genes",
subtitle = "Mean values shown with white diamonds",
x = NULL,
y = expression(paste(pi, " (nucleotide diversity)"))
) +
# Add mean points
stat_summary(fun = mean, geom = "point", shape = 23, size = 4,
fill = "white", color = "black") +
# Add mean value labels
stat_summary(
fun = mean,
geom = "text",
aes(label = sprintf("%.4f", ..y..)),
vjust = -2,
size = 3.5,
fontface = "bold"
) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
plot.subtitle = element_text(hjust = 0.5, size = 11),
legend.position = "bottom",
strip.text = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 11)
)
print(pi_boxplot_linear)
# --- Plot 2: Violin plots with boxplot overlay ---
cat("\nGenerating Plot 2: Violin plots showing full distribution...\n")
##
## Generating Plot 2: Violin plots showing full distribution...
# Calculate statistics for annotation
pi_stats <- pi_comparison_long %>%
group_by(population, DF) %>%
summarise(
mean_pi = mean(pi_value),
median_pi = median(pi_value),
sd_pi = sd(pi_value),
n = n()
)
pi_violin <- ggplot(pi_comparison_long, aes(x = DF, y = pi_value, fill = DF)) +
# Violin shows distribution shape
geom_violin(alpha = 0.7, trim = FALSE, scale = "width", width = 0.8) +
# Boxplot shows quartiles
geom_boxplot(width = 0.15, fill = "white", alpha = 0.9,
outlier.shape = NA, color = "black", size = 0.5) +
# Mean as red diamond
stat_summary(fun = mean, geom = "point", shape = 18,
size = 3.5, color = "red", fill = "red") +
facet_wrap(~ population, ncol = 2) +
theme_minimal() +
scale_fill_manual(values = c("BG" = "#66c2a5", "DF" = "#fc8d62"),
name = "Gene Category") +
labs(
title = "Nucleotide Diversity Distribution by Gene Category",
subtitle = "Violin: full distribution | Box: quartiles | Red diamond: mean",
x = NULL,
y = expression(pi)
) +
# Add sample size annotations
geom_text(data = pi_stats,
aes(x = DF, y = max(pi_comparison_long$pi_value) * 1.05,
label = paste("n =", n)),
size = 3.5, vjust = 0) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
plot.subtitle = element_text(hjust = 0.5, size = 11),
legend.position = "bottom",
strip.text = element_text(size = 12, face = "bold")
)
print(pi_violin)
# --- Plot 3: Density overlay plot ---
cat("\nGenerating Plot 3: Density overlay comparison...\n")
##
## Generating Plot 3: Density overlay comparison...
pi_density <- ggplot(pi_comparison_long, aes(x = pi_value, fill = DF, color = DF)) +
geom_density(alpha = 0.4, size = 0.8) +
facet_wrap(~ population, ncol = 2, scales = "free_y") +
theme_minimal() +
scale_fill_manual(values = c("BG" = "#66c2a5", "DF" = "#fc8d62"),
name = "Gene Category") +
scale_color_manual(values = c("BG" = "#1b9e77", "DF" = "#d95f02"),
name = "Gene Category") +
labs(
title = "Density Distribution of π Values",
subtitle = "Direct comparison of distribution shapes",
x = expression(pi),
y = "Density"
) +
# Add vertical lines for means
geom_vline(data = pi_stats,
aes(xintercept = mean_pi, color = DF),
linetype = "dashed", size = 0.8, alpha = 0.7) +
# Annotate means
geom_text(data = pi_stats,
aes(x = mean_pi, y = Inf,
label = sprintf("Mean = %.4f", mean_pi),
color = DF),
vjust = 2, hjust = -0.1, size = 3.5, fontface = "bold") +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
plot.subtitle = element_text(hjust = 0.5, size = 11),
legend.position = "bottom",
strip.text = element_text(size = 12, face = "bold")
)
print(pi_density)
cat("\n✓ All π visualizations generated successfully\n")
##
## ✓ All π visualizations generated successfully
cat("\n2. Effect Size Calculations\n")
##
## 2. Effect Size Calculations
cat("===========================\n")
## ===========================
# Calculate effect sizes (Cohen's d) for the differences
cat("Calculating effect sizes (Cohen's d) for all comparisons:\n")
## Calculating effect sizes (Cohen's d) for all comparisons:
effect_sizes <- data.frame(
Statistic = character(),
Population = character(),
Cohens_d = numeric(),
Interpretation = character(),
stringsAsFactors = FALSE
)
# Function to calculate and interpret Cohen's d
calculate_effect_size <- function(bg_values, df_values, stat_name, pop_name) {
if (length(bg_values) > 1 && length(df_values) > 1) {
result <- tryCatch({
d <- cohen.d(bg_values, df_values)
list(
d = d$estimate,
conf_int = d$conf.int
)
}, error = function(e) {
# Simple calculation if cohen.d fails
mean_diff <- mean(df_values) - mean(bg_values)
pooled_sd <- sqrt((sd(bg_values)^2 + sd(df_values)^2)/2)
list(
d = mean_diff/pooled_sd,
conf_int = c(NA, NA)
)
})
# Interpret magnitude
abs_d <- abs(result$d)
if (abs_d < 0.2) interpretation <- "Negligible"
else if (abs_d < 0.5) interpretation <- "Small"
else if (abs_d < 0.8) interpretation <- "Medium"
else interpretation <- "Large"
return(data.frame(
Statistic = stat_name,
Population = pop_name,
Cohens_d = round(result$d, 3),
CI_lower = ifelse(is.null(result$conf_int[1]), NA, round(result$conf_int[1], 3)),
CI_upper = ifelse(is.null(result$conf_int[2]), NA, round(result$conf_int[2], 3)),
Interpretation = interpretation,
stringsAsFactors = FALSE
))
}
return(NULL)
}
# Calculate for π
esp_pi_bg <- At_data_clean$ESP_pi[At_data_clean$DF == "BG"]
esp_pi_df <- At_data_clean$ESP_pi[At_data_clean$DF == "DF"]
swe_pi_bg <- At_data_clean$SWE_pi[At_data_clean$DF == "BG"]
swe_pi_df <- At_data_clean$SWE_pi[At_data_clean$DF == "DF"]
effect_sizes <- rbind(effect_sizes,
calculate_effect_size(esp_pi_bg, esp_pi_df, "π", "ESP"),
calculate_effect_size(swe_pi_bg, swe_pi_df, "π", "SWE")
)
# Calculate for Tajima's D
esp_td_bg <- At_data_clean$ESP_td[At_data_clean$DF == "BG"]
esp_td_df <- At_data_clean$ESP_td[At_data_clean$DF == "DF"]
swe_td_bg <- At_data_clean$SWE_td[At_data_clean$DF == "BG"]
swe_td_df <- At_data_clean$SWE_td[At_data_clean$DF == "DF"]
effect_sizes <- rbind(effect_sizes,
calculate_effect_size(esp_td_bg, esp_td_df, "Tajima's D", "ESP"),
calculate_effect_size(swe_td_bg, swe_td_df, "Tajima's D", "SWE")
)
# Calculate for FST (only one FST value per window)
fst_bg <- At_data_clean$ESP_SWE_fst[At_data_clean$DF == "BG"]
fst_df <- At_data_clean$ESP_SWE_fst[At_data_clean$DF == "DF"]
effect_sizes <- rbind(effect_sizes,
calculate_effect_size(fst_bg, fst_df, "FST", "ESP-SWE")
)
# Print effect size table
cat("\nEffect Sizes (Cohen's d) for Defence vs Background Comparisons:\n")
##
## Effect Sizes (Cohen's d) for Defence vs Background Comparisons:
print(kable(effect_sizes,
caption = "Effect sizes with interpretation: |d| < 0.2 = negligible, 0.2-0.5 = small, 0.5-0.8 = medium, >0.8 = large"))
##
##
## Table: Effect sizes with interpretation: |d| < 0.2 = negligible, 0.2-0.5 = small, 0.5-0.8 = medium, >0.8 = large
##
## |Statistic |Population | Cohens_d| CI_lower| CI_upper|Interpretation |
## |:----------|:----------|--------:|--------:|--------:|:--------------|
## |π |ESP | 0.072| 0.022| 0.123|Negligible |
## |π |SWE | 0.055| 0.005| 0.106|Negligible |
## |Tajima's D |ESP | 0.004| -0.046| 0.055|Negligible |
## |Tajima's D |SWE | 0.031| -0.019| 0.082|Negligible |
## |FST |ESP-SWE | -0.018| -0.068| 0.032|Negligible |
# Summary interpretation
cat("\nInterpretation Summary:\n")
##
## Interpretation Summary:
for (i in 1:nrow(effect_sizes)) {
row <- effect_sizes[i, ]
direction <- ifelse(row$Cohens_d > 0, "higher", "lower")
cat(sprintf("- %s in %s: %s effect (d = %.2f), %s in defence genes\n",
row$Statistic, row$Population, row$Interpretation,
row$Cohens_d, direction))
}
## - π in ESP: Negligible effect (d = 0.07), higher in defence genes
## - π in SWE: Negligible effect (d = 0.06), higher in defence genes
## - Tajima's D in ESP: Negligible effect (d = 0.00), higher in defence genes
## - Tajima's D in SWE: Negligible effect (d = 0.03), higher in defence genes
## - FST in ESP-SWE: Negligible effect (d = -0.02), lower in defence genes
# Save effect sizes
if (!dir.exists("results")) dir.create("results")
write_csv(effect_sizes, "results/effect_sizes.csv")
cat("\n✓ Effect sizes saved to: results/effect_sizes.csv\n")
##
## ✓ Effect sizes saved to: results/effect_sizes.csv
cat("\n3. Gene-Integrated Visualizations\n")
##
## 3. Gene-Integrated Visualizations
cat("==================================\n")
## ==================================
# Check if we have the necessary data from previous sections
required_data <- c("overlap_df", "gene_summary", "filtered_annotation",
"defence_windows", "At_data_clean")
available_data <- required_data[required_data %in% ls()]
cat("Available data from previous sections:\n")
## Available data from previous sections:
cat(paste(" -", available_data, collapse = "\n"), "\n")
## - overlap_df
## - gene_summary
## - filtered_annotation
## - defence_windows
## - At_data_clean
if (all(c("overlap_df", "gene_summary", "filtered_annotation") %in% available_data) &&
nrow(overlap_df) > 0 && nrow(gene_summary) > 0) {
# --- Step 1: Enhanced FST plot with gene annotations ---
cat("\nGenerating enhanced FST plot with gene annotations...\n")
# Prepare data for plotting
# Get all positive FST values for background
positive_fst <- At_data_clean %>%
filter(ESP_SWE_fst > 0) %>%
select(mid, ESP_SWE_fst, DF)
# Prepare gene data for plotting (first 40 genes for clarity)
gene_plot_data <- filtered_annotation %>%
filter(seqname == "Chr1") %>%
mutate(gene_mid = (start + end) / 2) %>%
arrange(start) %>%
head(40) # Limit to first 40 genes to avoid overcrowding
# Get threshold values (from previous section)
if (!exists("def_threshold_95")) {
def_threshold_95 <- quantile(defence_windows$ESP_SWE_fst, 0.95, na.rm = TRUE)
}
if (!exists("def_threshold_99")) {
def_threshold_99 <- quantile(defence_windows$ESP_SWE_fst, 0.99, na.rm = TRUE)
}
# --- FIXED CODE: Build the plot step by step ---
# Start with the base plot
enhanced_fst_plot <- ggplot()
# Add background points
enhanced_fst_plot <- enhanced_fst_plot +
geom_point(data = positive_fst,
aes(x = mid/1e6, y = ESP_SWE_fst),
color = "gray90", size = 0.3, alpha = 0.2)
# Add defence gene FST points
enhanced_fst_plot <- enhanced_fst_plot +
geom_point(data = defence_windows,
aes(x = mid/1e6, y = ESP_SWE_fst),
color = "steelblue", size = 0.8, alpha = 0.5)
# Add outlier points
enhanced_fst_plot <- enhanced_fst_plot +
geom_point(data = overlap_df,
aes(x = outlier_mid/1e6, y = fst_value,
color = outlier_status, shape = outlier_status),
size = 2, alpha = 0.9)
# Add gene tracks (below FST plot)
enhanced_fst_plot <- enhanced_fst_plot +
geom_segment(data = gene_plot_data,
aes(x = start/1e6, xend = end/1e6, y = -0.04, yend = -0.04),
size = 3.5, color = "#2ca25f", alpha = 0.7)
# Add gene labels for top 8 genes with outliers
top_genes <- overlap_df %>%
group_by(parent_gene) %>%
summarise(
mid_pos = mean((gene_start + gene_end)/2/1e6),
max_fst = max(fst_value),
gene_type = first(gene_type)
) %>%
arrange(desc(max_fst)) %>%
head(8)
if (nrow(top_genes) > 0) {
# Add gene name labels
enhanced_fst_plot <- enhanced_fst_plot +
geom_text(data = top_genes,
aes(x = mid_pos, y = -0.06, label = parent_gene),
size = 3, angle = 45, hjust = 1, vjust = 1,
check_overlap = TRUE, color = "#084081", fontface = "italic")
# Add gene type labels
enhanced_fst_plot <- enhanced_fst_plot +
geom_text(data = top_genes,
aes(x = mid_pos, y = -0.075, label = gene_type),
size = 2.5, angle = 0, hjust = 0.5, vjust = 1,
color = "#0868ac", alpha = 0.8)
}
# Add threshold lines
enhanced_fst_plot <- enhanced_fst_plot +
geom_hline(yintercept = def_threshold_95, color = "#ff7f00",
linetype = "dashed", size = 0.8, alpha = 0.8) +
geom_hline(yintercept = def_threshold_99, color = "#e31a1c",
linetype = "dashed", size = 0.8, alpha = 0.8) +
# Add mean FST line
geom_hline(yintercept = mean(defence_windows$ESP_SWE_fst, na.rm = TRUE),
color = "#6a3d9a", linetype = "dashed", size = 0.8)
# Add scales and labels
enhanced_fst_plot <- enhanced_fst_plot +
scale_color_manual(values = c("95% Outlier" = "#ff7f00",
"99% Outlier" = "#e31a1c")) +
scale_shape_manual(values = c("95% Outlier" = 17,
"99% Outlier" = 15)) +
scale_y_continuous(
limits = c(-0.08, max(positive_fst$ESP_SWE_fst, na.rm = TRUE) * 1.1),
sec.axis = sec_axis(~., name = "Gene Track",
breaks = c(-0.04),
labels = c("Genes"))
) +
labs(x = "Position on Chromosome 1 (Mb)",
y = expression(bolditalic(F)[ST]),
title = "Integrated View: FST Outliers and Gene Annotations",
subtitle = paste(nrow(gene_summary), "genes contain",
nrow(overlap_df), "outlier windows |",
ifelse(nrow(top_genes) > 0,
paste(nrow(top_genes), "top genes labeled"),
"No genes labeled")),
color = "Outlier Status",
shape = "Outlier Status") +
theme_minimal() +
theme(
legend.position = "bottom",
legend.box = "horizontal",
plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
plot.subtitle = element_text(hjust = 0.5, size = 12),
axis.title.y.right = element_text(color = "#2ca25f", face = "bold"),
axis.text.y.right = element_text(color = "#2ca25f")
) +
guides(color = guide_legend(nrow = 1), shape = guide_legend(nrow = 1))
print(enhanced_fst_plot)
# --- Additional gene-focused visualizations ---
cat("\nGenerating additional gene-focused visualizations...\n")
# Plot 1: Distribution of outlier counts per gene
p1 <- ggplot(gene_summary, aes(x = total_outlier_windows)) +
geom_histogram(binwidth = 1, fill = "#6baed6", alpha = 0.8,
color = "black", boundary = 0) +
geom_vline(xintercept = mean(gene_summary$total_outlier_windows),
color = "#e31a1c", linetype = "dashed", size = 1) +
annotate("text", x = mean(gene_summary$total_outlier_windows),
y = Inf, label = paste("Mean =", round(mean(gene_summary$total_outlier_windows), 2)),
vjust = 2, hjust = -0.1, color = "#e31a1c", fontface = "bold") +
labs(x = "Number of Outlier Windows per Gene",
y = "Number of Genes",
title = "Distribution of Outlier Windows Across Genes",
subtitle = paste("Average of", round(mean(gene_summary$total_outlier_windows), 2),
"outliers per gene (red dashed line)")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5))
print(p1)
# Plot 2: Top genes with most outliers
top_n <- min(15, nrow(gene_summary))
if (top_n > 0) {
p2_data <- gene_summary %>%
arrange(desc(total_outlier_windows)) %>%
head(top_n) %>%
mutate(parent_gene = factor(parent_gene, levels = rev(parent_gene)))
p2 <- ggplot(p2_data, aes(x = parent_gene, y = total_outlier_windows, fill = max_fst)) +
geom_bar(stat = "identity", alpha = 0.9) +
geom_text(aes(label = paste(total_outlier_windows, "windows")),
hjust = -0.1, size = 3.2, fontface = "bold") +
coord_flip() +
scale_fill_gradient2(low = "#2c7bb6", mid = "#ffffbf", high = "#d7191c",
midpoint = median(gene_summary$max_fst),
name = "Maximum FST") +
labs(x = "Gene ID", y = "Number of Outlier Windows",
title = paste("Top", top_n, "Genes with Most Outlier Windows"),
subtitle = "Color indicates maximum FST value in each gene") +
theme_minimal() +
theme(
axis.text.y = element_text(face = "italic", size = 10),
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "bottom"
) +
expand_limits(y = max(p2_data$total_outlier_windows) * 1.2)
print(p2)
}
# Plot 3: Gene length vs FST with multiple aesthetics
p3 <- ggplot(gene_summary, aes(x = gene_length/1000, y = max_fst)) +
geom_point(aes(size = total_outlier_windows,
color = total_outlier_windows,
alpha = total_outlier_windows)) +
geom_smooth(method = "lm", se = TRUE, color = "#636363",
fill = "#bdbdbd", alpha = 0.2) +
scale_color_gradient(low = "#2c7bb6", high = "#d7191c",
name = "Outlier count") +
scale_size_continuous(range = c(2, 8), name = "Outlier count") +
scale_alpha_continuous(range = c(0.5, 1), guide = "none") +
labs(x = "Gene Length (kb)",
y = "Maximum FST in Gene",
title = "Relationship Between Gene Length and Maximum FST",
subtitle = "Size and color indicate number of outlier windows per gene") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "right"
)
print(p3)
cat("\n✓ Gene-integrated visualizations generated successfully\n")
# Save the plots
if (!dir.exists("results/plots")) dir.create("results/plots", recursive = TRUE)
ggsave("results/plots/integrated_fst_gene_plot.png", enhanced_fst_plot,
width = 16, height = 10, dpi = 300)
cat("✓ Main plot saved to: results/plots/integrated_fst_gene_plot.png\n")
} else {
cat("\nWARNING: Insufficient data for gene-integrated visualizations.\n")
cat("Required data not available: \n")
missing <- setdiff(c("overlap_df", "gene_summary", "filtered_annotation"), available_data)
if (length(missing) > 0) {
cat(paste(" -", missing, collapse = "\n"), "\n")
} else {
cat(" Data exists but has 0 rows. Check if overlap_df and gene_summary are empty.\n")
}
cat("\nPlease run the gene annotation section (#17) first.\n")
}
##
## Generating enhanced FST plot with gene annotations...
##
## Generating additional gene-focused visualizations...
##
## ✓ Gene-integrated visualizations generated successfully
## ✓ Main plot saved to: results/plots/integrated_fst_gene_plot.png
Negative values: Excess of low-frequency variants (potential selective sweep or population expansion)
Positive values: Deficiency of low-frequency variants (balancing selection or population contraction)
Comparison: Defence genes often show more negative Tajima’s D due to purifying selection
Lower π in defence genes: Suggests purifying selection or selective sweeps
Higher π in defence genes: Could indicate balancing selection
Compare between populations: Differential selection pressures
High FST: Strong population differentiation (potential local adaptation)
FST > 0.15: Considered high differentiation
Outliers: Genes with FST above 95th/99th percentiles are candidates for selection
p < 0.05: Significant difference between defence and background
Effect size: Cohen’s d > 0.8 indicates large effect
Consistency: Check if patterns are consistent across both populations
| Pattern | Possible Interpretation |
|---|---|
| Low π + Negative Tajima’s D | Recent selective sweep |
| High π + Positive Tajima’s D | Balancing selection |
| High FST outliers | Local adaptation |
| Consistent patterns in both populations | Species-wide selection |
| Population-specific patterns | Local adaptation |
Extend to all chromosomes
Gene Ontology enrichment of outlier defence genes (This will be your next step)
Compare with known selection signatures