Integrative Landscape Genomics: Partitioning Geographic and Environmental Effects on Genetic Diversity and Differentiation

A Multivariate Statistical Framework Using PCA, RDA, and GAM Analyses

Author

Dr Tahir Ali | AG Thines | Senckenberg Biodiversity and Climate Research Centre

OVERARCHING THEMES & INTEGRATION

1. Partitioning Geographic vs. Environmental Effects

First script: Uses GLMs with both geographic (lat, lon, alt) and environmental predictors.
Second script: Explicitly partitions variance using partial RDAs.
Third script: Includes both geographic distance and environmental differences in models.

2. Addressing Different Biological Questions

Genetic diversity drivers (First script): What factors influence within-population genetic variation?
Genetic differentiation patterns (Second & Third scripts): What factors explain between-population genetic differences?

3. Hierarchical Model Building

Exploratory (PCA/correlation) → Hypothesis testing (GLMs) → Complex relationships (GAMs) → Multivariate synthesis (RDA)

4. Biological Interpretation Framework

Neutral processes: Geographic distance → isolation-by-distance
Selective processes: Environmental differences → isolation-by-environment/ecology
Interactive effects: Geographic isolation modifying environmental selection

5. Statistical Robustness Features

Permutation testing for RDA significance
Model selection criteria (AIC, ANOVA comparisons)
Residual diagnostics for model assumptions
Multicollinearity assessment via correlation matrices

The scripts collectively provide a comprehensive toolkit for landscape genetics, allowing researchers to disentangle the complex interplay between geographic isolation, environmental gradients, and genetic patterns at both within-population (diversity) and between-population (differentiation) levels.

Key Integration Across Scripts

These three scripts provide a comprehensive analytical pipeline for landscape genetics:

Script 1 (PCA & Regression): Identifies key environmental gradients and their linear relationships with within-population genetic diversity.
Script 2 (RDA): Multivariate analysis partitioning geographic vs. environmental effects on between-population genetic differentiation.
Script 3 (GAMs): Flexible modeling of non-linear relationships and distance-decay functions for pairwise genetic distances.

Biological Workflow:

Start with PCA to understand environmental dimensionality
Use RDA to test environmental vs. geographic effects on genetic structure
Apply GAMs to model complex non-linear relationships
Integrate results to understand both neutral (geographic) and selective (environmental) processes

Script 1: Principle Component Analysis & Regression Modelling

Data Preparation & Variable Selection

Statistical reasoning: Creating subsets of variables (mydata2, mydata3, etc.) reduces dimensionality and focuses analysis on biologically/climatically relevant predictors. This helps avoid overfitting and multicollinearity issues.
Biological reasoning: Variables like He (expected heterozygosity), Hj (observed heterozygosity), and climate variables (Bio1-Bio19 from WorldClim) represent genetic diversity measures and environmental gradients that potentially influence genetic structure.

Data Scaling

scale() function: Standardizes variables to mean=0, SD=1. This ensures variables measured in different units (temperature, precipitation, soil properties) are comparable and prevents variables with larger ranges from dominating PCA/regression.

Correlation Analysis

cor() and pairs(): Identifies multicollinearity among predictors. High correlations suggest redundant information, guiding variable selection for models.

Principal Component Analysis (PCA)

princomp(): Reduces dimensionality by transforming correlated variables into orthogonal principal components.
Biological application: Identifies major environmental gradients that may drive genetic differentiation. PCA axes represent combinations of environmental variables that explain most variation in the dataset.

Generalized Linear Models (GLMs)

Sequential model building (Model1-Model10): Stepwise refinement to identify most significant predictors of genetic diversity (Hj).
Statistical reasoning: GLMs assess linear relationships while allowing for different error distributions. Model comparison via ANOVA and AIC helps select best-fitting, most parsimonious model.

Model Diagnostics & Selection

stepAIC(): Automated variable selection balancing model fit and complexity.
calc.relimp(): Quantifies relative importance of each predictor.
anova() comparisons: Tests if more complex models significantly improve fit.

Visualization

ggbiplot(): Visualizes PCA results with grouping by categorical variables (metapopulations/regions).
Biological interpretation: Clusters in PCA space may indicate genetically similar populations experiencing similar environmental conditions.

{r}
#######################################################
###### Principle Component Analysis & Regression ######
#######################################################

# SET WORKING DIRECTORY
# Sets the location where data files are stored
setwd("D:\\R_Working_files\\RawGeno\\Project_RawGeno\\Mp_All\\RawGeno_1\\RawGeno_1_final-data\\Mp_2n-only\\Mp2n_Regional_dataset\\Rangewide\\He_Model")

# LOAD LIBRARIES
library(ggplot2)      # For advanced plotting
library(ggbiplot)     # For PCA biplots with ggplot
library(MASS)         # For stepAIC function
library(relaimpo)     # For relative importance analysis

# READ DATA
# Reads tab-delimited file with genetic and environmental data
mydata <- read.table("Mp2n_ClimSoilData_F-He.txt", header = TRUE, sep = "\t")

# INITIAL DATA SUBSETTING
# Statistical: Selects key variables to reduce dimensionality
# Biological: Focuses on lat/lon (geography), He/Hj (genetic diversity), 
#             F (inbreeding), alt (elevation), and key bioclimatic/soil variables
mydata2 <- cbind(mydata$lat, mydata$lon, mydata$He, mydata$Hj, mydata$FAFLP, mydata$alt,
                 mydata$Bio1, mydata$Bio3, mydata$Bio8, mydata$Bio9, mydata$Bio15,
                 mydata$PHCA, mydata$EXCA, mydata$EXMG, mydata$SILT, mydata$SAND, 
                 mydata$ECE, mydata$ESP, mydata$landuse, mydata$BS1, mydata$Ai)

# RENAME COLUMNS FOR CLARITY
colnames(mydata2) <- c("lat", "lon", "He", "Hj", "F", "Alt", "bio1", "bio3", "bio8", 
                       "bio9", "bio15", "PHCA", "EXCA", "EXMG", "SILT", "SAND", 
                       "ECE", "ESP", "landuse", "BS1", "Ai")

# CREATE ADDITIONAL SUBSET FOR ANALYSIS
# Statistical: Further refinement based on preliminary analyses
# Biological: Focus on heterozygosity (Hj) and key environmental predictors
mydat3 <- cbind(mydata$Hj, mydata$Bio3, mydata$Bio5, mydata$Bio8, mydata$Bio12, 
                mydata$Bio13, mydata$Bio14, mydata$Bio16, mydata$Bio17, mydata$Bio19,
                mydata$CLAY, mydata$DRAINAGE, mydata$EXK, mydata$EXMG, mydata$EXNA,
                mydata$SAND, mydata$SILT, mydata$Ai, mydata$AvPrec1_4, mydata$AvPrec9_12)

colnames(mydat3) <- c("Hj", "bio3", "bio5", "bio8", "bio12", "bio13", "bio14",
                      "bio16", "bio17", "bio19", "CLAY", "DRAINAGE", "EXK", "EXMG", 
                      "EXNA", "SAND", "SILT", "Ai", "Prec1_4", "Prec9_12")

# CREATE AVERAGE CLIMATE VARIABLES
# Statistical: Reduces monthly data to seasonal aggregates to avoid collinearity
# Biological: Seasonal averages may better capture biologically relevant conditions
mydat <- mydata[, 5:114]
mydat$AvTmin9_12 <- (mydat$Tmin9 + mydat$Tmin10 + mydata$Tmin11 + mydata$Tmin12) / 4
mydat$AvTmax4_7 <- (mydat$Tmax4 + mydat$Tmax5 + mydat$Tmax6 + mydat$Tmax7) / 4
mydat$AvPrec1_4 <- (mydat$Prec1 + mydat$Prec2 + mydat$Prec3 + mydat$Prec4) / 4
mydat$AvPrec9_12 <- (mydat$Prec9 + mydat$Prec10 + mydat$Prec11 + mydat$Prec12) / 4

# DATA SCALING
# Statistical: Standardizes all variables to mean=0, SD=1
# Biological: Makes variables measured in different units (temp, precipitation, soil) comparable
# Note: Removes rows with NAs to ensure complete cases
DiploidData.scale2 <- as.data.frame(apply(mydat3[apply(mydat3, 1, function(X) sum(is.na(X))) == 0, ], 2, scale))
DiploidData.scale <- as.data.frame(apply(mydata2[apply(mydata2, 1, function(X) sum(is.na(X))) == 0, ], 2, scale))

# CHECK CORRELATIONS
# Statistical: Identifies multicollinearity among predictors
# Biological: Highly correlated variables may represent redundant environmental gradients
mydata3.df <- as.data.frame(mydat3)
corr <- cor(mydata3.df)
write.table(corr, file = "correlations_20var.csv", sep = ",")

# SIMPLE LINEAR REGRESSION
# Statistical: Tests basic linear relationships
# Biological: Examines if specific environmental factors directly affect genetic diversity
fit <- lm(Hj ~ Prec1_4, data = DiploidData.scale2)
summary(fit)

# POLYNOMIAL REGRESSION
# Statistical: Tests for non-linear (quadratic) relationships
# Biological: Genetic diversity may have optimal ranges along environmental gradients
fit <- lm(Hj ~ Tmean8 + I(Tmean8^2), data = DiploidData.scale2)
summary(fit)

# MULTIPLE REGRESSION
# Statistical: Tests combined effects of multiple predictors
# Biological: Genetic diversity likely influenced by multiple interacting factors
Diploid.reg1 <- lm(Hj ~ CLAY + SAND + Prec9_12, data = DiploidData.scale2)
summary(Diploid.reg1)

# PRINCIPAL COMPONENT ANALYSIS
# Statistical: Reduces dimensionality, identifies major gradients of variation
# Biological: Identifies composite environmental axes that may drive genetic patterns
mydata5 <- cbind(mydata$longitude, mydata$Hj, mydata$elevation..m., mydata$Bio1,
                 mydata$Bio2, mydata$Bio4, mydata$Bio8, mydata$Bio12, mydata$Bio19,
                 mydata$landuse.class..GlobeCover., mydata$AWC, mydata$CACO3,
                 mydata$ECE, mydata$ESP, mydata$GRAV, mydata$SAND, mydata$SILT,
                 mydata$TN)

colnames(mydata5) <- c("lon", "Hj", "Alt", "bio1", "bio2", "bio4", "bio8", "bio12",
                       "bio19", "landuse", "AWC", "CACO3", "ECE", "ESP", "GRAV", 
                       "SAND", "SILT", "TN")

Mp.pca <- princomp(mydata5, scores = TRUE, cor = TRUE)
summary(Mp.pca)

# VISUALIZE PCA RESULTS
# Statistical: Scree plot shows variance explained by each component
# Biological: Helps decide how many components represent biologically meaningful gradients
plot(Mp.pca, type = "l")

# PCA BIPLOT
# Statistical: Shows variables and samples in reduced space
# Biological: Reveals which environmental variables co-vary and how populations cluster
biplot(Mp.pca)

# GROUPED PCA VISUALIZATION
# Statistical: Colors points by categorical grouping variables
# Biological: Tests if genetically defined groups occupy distinct environmental space
cat.var2 <- mydata[, 2]  # Metapopulations
cat.var3 <- mydata[, 3]  # Regions

g <- ggbiplot(Mp.pca, obs.scale = 0.5, var.scale = 0.5, 
              groups = cat.var2, ellipse = TRUE, circle = TRUE)
g <- g + scale_color_discrete(name = '') + 
  theme(legend.direction = 'horizontal', legend.position = 'top')
print(g)

# STEPWISE MODEL SELECTION
# Statistical: Automated variable selection balancing fit and complexity
# Biological: Identifies minimal set of environmental factors explaining genetic diversity
fit <- lm(Hj ~ bio3 + bio5 + bio8 + CLAY + SILT + DRAINAGE + SAND + SILT + Prec9_12, 
          data = DiploidData.scale2)
step <- stepAIC(fit, direction = "both")
step$anova

# FINAL MODEL BUILDING AND COMPARISON
# Statistical: Sequential model refinement based on significance and parsimony
# Biological: Builds understanding of which factors most strongly influence genetic diversity
Model1 <- glm(Hj ~ lat + lon + F + Alt + bio1 + bio2 + bio3 + bio4 + bio5 + 
                bio8 + bio10 + bio11 + bio13 + bio19 + BS1 + CLAY + 
                ECE + ESP + EXNA + GRAV + PHH2O + SAND + SILT + TN, 
              data = as.data.frame(mydata2))

# Simplified through sequential testing
Model8 <- glm(Hj ~ lon + bio2 + bio3 + bio5 + bio11 + ECE + PHCA + SAND + SILT, 
              data = as.data.frame(mydata2))

# MODEL COMPARISON
# Statistical: ANOVA tests if more complex models significantly improve fit
# Biological: Determines whether additional variables provide meaningful biological insight
anova(Model8, Model9, Model10, test = 'F')

# RELATIVE IMPORTANCE ANALYSIS
# Statistical: Quantifies contribution of each predictor to explained variance
# Biological: Ranks environmental factors by their influence on genetic diversity
calc.relimp(Model8, type = c("lmg"), rela = TRUE)
boot <- boot.relimp(Model8, b = 1000, type = c("lmg"), rank = TRUE, diff = TRUE, rela = TRUE)
plot(booteval.relimp(boot, sort = TRUE))

# RESIDUAL ANALYSIS
# Statistical: Checks model assumptions and fit quality
# Biological: Identifies populations where model predictions are poor
boxplot(Model8$residuals, Model9$residuals, Model10$residuals,
        names = c("GLM 8", "GLM 9", "GLM 10"))

# FINAL PCA ON SELECTED VARIABLES
# Statistical: PCA on variables retained in final model
# Biological: Visualizes environmental space defined by most important factors
mydataML <- cbind(mydata$latitude, mydata$Hj, mydata$elevation..m.,
                  mydata$Bio2, mydata$Bio4, mydata$Bio5, mydata$Bio8, mydata$Bio9,
                  mydata$Bio10, mydata$Bio11, mydata$Bio14, mydata$Bio16, 
                  mydata$Bio17, mydata$Bio19, mydata$landuse.class..GlobeCover.,
                  mydata$BS1, mydata$CACO3, mydata$CLAY, mydata$GRAV, mydata$PHCA,
                  mydata$SAND)

colnames(mydataML) <- c("lat", "Hj", "Alt", "bio2", "bio4", "bio5", "bio8", "bio9",
                        "bio10", "bio11", "bio14", "bio16", "bio17", "bio19", 
                        "landuse", "BS1", "CACO3", "CLAY", "GRAV", "PHCA", "SAND")

Mp.pcaML <- princomp(mydataML, scores = TRUE, cor = TRUE, scale = TRUE)
summary(Mp.pcaML)

# CUSTOMIZED PCA PLOT
# Statistical: Enhanced visualization for publication
# Biological: Clear presentation of population clustering in environmental space
g <- ggbiplot(Mp.pcaML, obs.scale = 2, var.scale = 2, 
              groups = cat.var3, ellipse = TRUE, circle = FALSE,
              var.axes = TRUE, varname.size = 4) +
  geom_point(aes(colour = cat.var3), size = 4, shape = 16) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'vertical', legend.position = 'right',
        legend.key.size = unit(5, "mm"), legend.text = element_text(size = 15)) +
  ggtitle("Principal Component Analysis") +
  theme(plot.title = element_text(size = 20, vjust = 1),
        axis.title.x = element_text(size = 15, color = "black", vjust = -0.35),
        axis.title.y = element_text(size = 15, color = "black", vjust = 0.35))
print(g)

Script 2: Redundancy Analysis (RDA) For Genetic Environment Correlation

Multivariate Approach to Genetic-Environmental Correlation

Statistical reasoning: RDA is a constrained ordination method that directly relates genetic distance matrices to environmental predictors.
Biological application: Tests whether environmental variables explain genetic differentiation patterns beyond geographic distance.

Partial RDA

rda(GenData ~ Climate + Condition(Geography)): Controls for geographic effects to isolate pure environmental influence.
rda(GenData ~ Geography + Condition(Climate)): Controls for environmental effects to isolate pure geographic influence.
Key biological insight: Partitions variance into components explained by (1) environment alone, (2) geography alone, (3) shared environment-geography, and (4) unexplained.

Model Selection & Significance Testing

ordistep(): Stepwise selection of environmental variables.
anova() with permutations: Non-parametric significance testing robust to non-normal distributions common in genetic data.

Variance Partitioning

Statistical concept: Quantifies proportion of genetic variation explained by different factor sets.
Biological interpretation: Answers whether isolation-by-distance (geography) or isolation-by-environment (ecological selection) dominates genetic differentiation.

{r}
#######################################################
###### Redundancy Analysis (RDA) ######################
###### For Partitioning Geographic vs. Climatic Effects
#######################################################

# LOAD LIBRARIES
library(vegan)    # For RDA and ordination methods
library(permute)  # For permutation tests
library(lattice)  # For advanced plotting

# SET WORKING DIRECTORY
setwd("D:\\Ali_Lenovo\\WorK_20150729\\Mp6X_Analysis\\Multiple Regression\\Italy\\RDA")

# READ ENVIRONMENTAL DATA
# Contains climate and soil variables for each population
ClimData <- read.table('Italy_ClimData.txt', header = TRUE)

# SUBSET ENVIRONMENTAL VARIABLES
# Statistical: Selects key variables based on biological relevance and preliminary analyses
# Biological: Focuses on temperature, precipitation, and soil factors likely to influence genetics
ClimData2 <- cbind(ClimData$Bio1, ClimData$Bio3, ClimData$Bio10, ClimData$Bio14,
                   ClimData$Bio15, ClimData$Bio16, ClimData$Bio17, ClimData$BS1,
                   ClimData$CLAY, ClimData$EXNA, ClimData$PHCA, ClimData$TN)

colnames(ClimData2) <- c("Bio1", "Bio3", "Bio10", "Bio14", "Bio15", "Bio16",
                         "Bio17", "BS1", "CLAY", "EXNA", "PHCA", "TN")

# CHECK CORRELATIONS AMONG PREDICTORS
# Statistical: Avoids multicollinearity in constrained ordination
# Biological: Ensures selected variables represent independent environmental axes
ClimData2.df <- as.data.frame(ClimData2)
cor(ClimData2.df)
pairs(ClimData2.df)

# READ GENETIC DATA
# Pairwise Fst matrix - genetic differentiation between populations
# Statistical: Response matrix for multivariate analysis
# Biological: Quantifies genetic divergence due to drift, selection, and migration
GenData <- as.matrix(read.table('GenData.txt', header = TRUE, sep = "\t", 
                                row.names = 1, as.is = TRUE))
GenData <- scale(GenData)  # Standardize genetic distances

# FULL RDA MODEL
# Statistical: Constrained ordination relating genetic distances to all environmental predictors
# Biological: Tests if environmental variables collectively explain genetic differentiation
RDA <- rda(GenData ~ Bio4 + Bio9 + Bio10 + Bio11 + Bio14 + Bio17 + landuse + BS1 +
             CACO3 + CLAY + EXCA + EXK + EXNA + GRAV + PHCA + PHH2O + SILT, ClimData)
summary(RDA)

# STEPWISE MODEL SELECTION FOR RDA
# Statistical: Identifies minimal adequate set of environmental predictors
# Biological: Finds which specific environmental factors best explain genetic patterns
mod0 <- rda(GenData ~ Bio1 + Bio2 + Bio3 + Bio4 + Bio5 + Bio6 + Bio7 + Bio8 + Bio9 + 
              Bio10 + Bio11 + Bio12 + Bio13 + Bio14 + Bio15 + Bio16 + Bio17 + Bio18 + 
              Bio19 + alt + landuse + BS1 + CACO3 + CEC + CLAY + ESP + EXCA + EXK +
              EXMG + EXNA + GRAV + PHCA + PHH2O + SAND + SILT + TC + TN, data = ClimData)

mod1 <- rda(GenData ~ 1, data = ClimData)  # Null model

# Automated stepwise selection with permutation tests
ordistep(mod0, scope = formula(mod1), direction = "both",
         Pin = 0.05, R2scope = TRUE, permutations = how(nperm = 499),
         trace = TRUE)

# RDA WITH SELECTED VARIABLES
# Statistical: Reduced model with significant predictors only
# Biological: Focuses on environmentally driven genetic differentiation
RDA4 <- rda(GenData ~ Bio1 + Bio3 + Bio8 + Bio14 + PHCA + EXNA + BS1, data = ClimData)
summary(RDA4)

# SIGNIFICANCE TESTING WITH PERMUTATIONS
# Statistical: Non-parametric testing robust to distributional assumptions
# Biological: Determines if environment-genetics relationships are statistically significant
anova(RDA4)  # Overall model significance
anova(RDA4, by = "term", permu = 200)  # Significance of individual terms
anova(RDA4, by = "axis", perm = 500)   # Significance of each RDA axis

# VISUALIZE RDA RESULTS
# Statistical: Biplot shows samples and environmental vectors in reduced space
# Biological: Reveals which environmental gradients align with genetic differentiation
plot(RDA4, type = "none", choices = c(1, 2), 
     main = "Genetic Distance vs Climate Variables")
text(RDA4, dis = "cn", cex = 1, col = "red")  # Environmental vectors
points(RDA4, pch = 21, col = "red", bg = "yellow", cex = 0.6)  # Population scores

# ADD GROUPING INFORMATION
# Statistical: Overlays categorical grouping on ordination
# Biological: Tests if genetically/metapopulation groups occupy distinct environmental space
metapop <- ClimData[, 3]  # Metapopulation categories
reg <- ClimData[, 4]      # Region categories

# Add ellipses showing group dispersion
pl <- ordiellipse(RDA4, reg, kind = "sd", label = TRUE, cex = 2)
pl <- ordiellipse(RDA4, metapop, kind = "sd", label = TRUE, cex = 1)

# PARTIAL RDA - CONTROL FOR GEOGRAPHY
# Statistical: Conditions on geographic variables to isolate pure environmental effects
# Biological: Tests environmental adaptation independent of isolation-by-distance
pRDA4 <- rda(GenData ~ Bio1 + Bio3 + Bio8 + Bio14 + BS1 + PHCA + EXNA + 
               Condition(lat + lon + alt), ClimData)

# Significance testing for partial RDA
anova(pRDA4, by = "term", permu = 200)
anova(pRDA4, by = "axis", perm = 500)

# Visualize partial RDA
plot(pRDA4, type = "none", choices = c(1, 2), 
     main = "Genetic Distance vs Climate (Controlling for Geography)")
text(pRDA4, dis = "cn", cex = 1, col = "red")
points(pRDA4, pch = 21, col = "red", bg = "yellow", cex = 1)

# PARTIAL RDA - CONTROL FOR CLIMATE
# Statistical: Conditions on environmental variables to isolate pure geographic effects
# Biological: Tests isolation-by-distance independent of environmental adaptation
pRDA2 <- rda(GenData ~ lat + lon + alt + Condition(Bio1 + Bio3 + Bio8 + Bio14 + 
                                                     BS1 + PHCA + EXNA), ClimData)

anova(pRDA2)
summary(pRDA2)

# FULL MODEL WITH BOTH GEOGRAPHY AND CLIMATE
# Statistical: Combined model with all predictors
# Biological: Tests total explained variance and interactions
RDAfull <- rda(GenData ~ lat + lon + alt + Bio1 + Bio3 + Bio8 + Bio14 + 
                 BS1 + PHCA + EXNA, ClimData)

anova(RDAfull)
summary(RDAfull)

# VARIANCE PARTITIONING ANALYSIS
# Statistical: Quantifies proportions of genetic variance explained by different factors
# Biological: Answers key landscape genetics questions:
# 1. How much variance is explained by environment alone?
# 2. How much by geography alone?
# 3. How much by their overlap?
# 4. How much remains unexplained?

cat("=== VARIANCE PARTITIONING SUMMARY ===\n")
cat("1. Climate not controlled for geography:\n")
summary(RDA4)$constr.chi / summary(RDA4)$tot.chi * 100

cat("\n2. Climate controlled for geography (pure environmental effects):\n")
summary(pRDA4)$constr.chi / summary(pRDA4)$tot.chi * 100

cat("\n3. Geography controlled for climate (pure geographic effects):\n")
summary(pRDA2)$constr.chi / summary(pRDA2)$tot.chi * 100

cat("\n4. Full model (climate + geography):\n")
summary(RDAfull)$constr.chi / summary(RDAfull)$tot.chi * 100

# ENHANCED VISUALIZATION FOR PUBLICATION
par(pty = "s")  # Square plot
plot(RDAfull, scaling = 2, display = c("sp", "lc", "cn"),
     xlim = c(-2, 2), ylim = c(-2, 2),
     main = "RDA Triplot: Genetic Distance vs Environment + Geography")
pl <- ordiellipse(RDAfull, reg, kind = "sd", label = TRUE, cex = 2)
box()

# R² ADJUSTED FOR MODEL COMPLEXITY
# Statistical: Adjusts explained variance for number of predictors
# Biological: Provides more realistic estimate of environmental influence
RsquareAdj(RDA4)
RsquareAdj(pRDA4)
RsquareAdj(pRDA2)
RsquareAdj(RDAfull)

# FINAL INTERPRETATION
cat("\n=== KEY BIOLOGICAL INTERPRETATIONS ===\n")
cat("1. Significant RDA axes indicate environmental gradients driving genetic divergence.\n")
cat("2. Partial RDAs separate isolation-by-distance from isolation-by-environment.\n")
cat("3. Variance partitioning quantifies relative importance of geography vs environment.\n")
cat("4. Environmental vectors show which factors align with genetic differentiation.\n")

Script 3: Generalized Additive Models (GAMs)

For Non-linear Genetic-Environment Relationships

Analyzing Pairwise Genetic Distances

Statistical innovation: Uses pairwise FST matrices as response variables, with corresponding pairwise environmental differences as predictors.
Key challenge: Account for non-independence of pairwise comparisons.

Flexible Modeling of Relationships

gam() with s() terms: Allows non-linear relationships between predictors and genetic differentiation.
Biological relevance: Genetic differentiation may respond non-linearly to environmental gradients (threshold effects, optimal ranges).

Model Comparison Framework

Linear vs. quadratic vs. smooth terms: Tests whether simple linear, curved, or complex non-linear relationships best explain patterns.
Biological insight: Shape of distance-FST relationship informs dispersal models (steep vs. gradual isolation-by-distance).

Scale Consideration

Analysis with both scaled and unscaled geographic distance: Distinguishes statistical significance from biological magnitude of effects.

{r}
#######################################################
###### Generalized Additive Models (GAMs) #############
###### For Non-linear Genetic-Environment Relationships
#######################################################

# LOAD LIBRARIES
library(gdata)    # For reading Excel files
library(mgcv)     # For Generalized Additive Models
library(permute)  # For permutation tests

# SET WORKING DIRECTORY
setwd("D:\\R_Working_files\\RawGeno\\Project_RawGeno\\Mp_All\\RawGeno_1\\RawGeno_1_final-data\\Mp_Xn_4n\\STR-MpXn_Regional2local_Scale\\Multiple Regression\\Italy\\GLM_GAM")

# PERL PATH FOR READING EXCEL FILES
perl <- "C:\\Users\\atahir\\Documents\\R\\win-library\\3.1\\perl\\bin\\perl.exe"

##################
#### DATA PREPARATION
##################

# READ GEOGRAPHIC DISTANCE MATRIX
# Statistical: Predictor representing isolation-by-distance
# Biological: Geographic distance expected to increase genetic differentiation via limited gene flow
Geog.tmp <- read.xls("GGD_IT.xlsx", perl = perl, as.is = TRUE)

# CREATE COMPREHENSIVE DATAFRAME
# Statistical: Combines pairwise Fst with pairwise environmental differences
# Biological: Tests how environmental dissimilarity affects genetic divergence
DiploidData <- data.frame(
  # Response: Pairwise Fst (genetic differentiation)
  Fst = c(as.matrix(read.xls("Fst_IT_Xn.xlsx", perl = perl)[, -1], sheet = 1)),
  
  # Environmental difference predictors:
  BIO3 = c(apply(read.xls("BIO3-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),  # Isothermality
  BIO8 = c(apply(read.xls("BIO8-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),  # Temp wettest quarter
  BIO14 = c(apply(read.xls("BIO14-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Prec driest month
  PHCA = c(apply(read.xls("PHCA-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),   # Soil pH
  EXNA = c(apply(read.xls("EXNA-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),   # Exchangeable sodium
  CLAY = c(apply(read.xls("CLAY-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),   # Clay content
  ESP = c(apply(read.xls("ESP-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),     # Exchangeable sodium %
  CACO3 = c(apply(read.xls("CACO3-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Calcium carbonate
  TN = c(apply(read.xls("TN-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),       # Total nitrogen
  
  # Geographic distance (unscaled, will scale separately)
  Geog = c(apply(Geog.tmp[, colnames(Geog.tmp) != "X"], 2, as.numeric))
)

rm(Geog.tmp)  # Clean up temporary object

# DATA SCALING AND CLEANING
# Statistical: Standardizes variables, removes incomplete cases
# Biological: Ensures comparability, focuses on complete population pairs
DiploidData.scale <- as.data.frame(
  apply(DiploidData[apply(DiploidData, 1, function(X) sum(is.na(X))) == 0, ], 2, scale)
)

# Add unscaled geographic distance back for interpretation
DiploidData.scale$Geog.Unsc <- DiploidData$Geog[apply(DiploidData, 1, function(X) sum(is.na(X))) == 0]

# EXPLORATORY DATA ANALYSIS
# Statistical: Checks correlations and distributions
# Biological: Identifies potential collinearity issues and data patterns
pairs(DiploidData.scale[, 1:10])  # Quick visualization
cor_matrix <- cor(DiploidData.scale[, -ncol(DiploidData.scale)])
print("Correlation matrix (first 5x5):")
print(cor_matrix[1:5, 1:5])

##################
#### MODEL BUILDING
##################

# LINEAR MODEL 1: ALL VARIABLES
# Statistical: Baseline linear model
# Biological: Tests linear relationships between environmental differences and Fst
Diploid.reg <- lm(Fst ~ Geog + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN, 
                  data = DiploidData.scale)
summary(Diploid.reg)

# MODEL DIAGNOSTICS
# Statistical: Checks linear model assumptions
# Biological: Identifies patterns suggesting non-linear relationships or outliers
par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
plot(fitted(Diploid.reg), resid(Diploid.reg))
abline(h = 0, col = 2)
title("Residuals vs Fitted")

# Check residuals against each predictor
predictors <- c("Geog", "BIO3", "BIO8", "BIO14", "PHCA", "EXNA", "CLAY", "ESP", "CACO3", "TN")
for (pred in predictors[1:6]) {
  plot(DiploidData.scale[[pred]], resid(Diploid.reg))
  abline(h = 0, col = 2)
  title(paste("Residuals vs", pred))
}

# LINEAR MODEL 2: WITH QUADRATIC GEOGRAPHIC TERM
# Statistical: Tests for non-linear isolation-by-distance
# Biological: Gene flow may decrease non-linearly with distance (e.g., stepping-stone model)
Diploid.reg2 <- lm(Fst ~ Geog + I(Geog^2) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN, 
                   data = DiploidData.scale)
summary(Diploid.reg2)

# VISUALIZE QUADRATIC EFFECT
# Statistical: Shows shape of distance-decay relationship
# Biological: Informs about dispersal kernel shape and connectivity
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
GeogEff <- DiploidData.scale$Geog * coef(Diploid.reg2)["Geog"] + 
  (DiploidData.scale$Geog^2) * coef(Diploid.reg2)["I(Geog^2)"]
plot(DiploidData.scale$Geog, GeogEff, xlab = "Scaled Geographic Distance", 
     ylab = "Effect on Fst", main = "Quadratic Distance Effect")
plot(DiploidData.scale$Geog, fitted(Diploid.reg2), xlab = "Scaled Geographic Distance",
     ylab = "Predicted Fst", main = "Model Predictions")

# GENERALIZED ADDITIVE MODEL (GAM)
# Statistical: Flexible non-parametric smoothing of relationships
# Biological: Captures complex, non-linear responses to environmental gradients
Diploid.gam <- gam(Fst ~ s(Geog) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN, 
                   data = DiploidData.scale)

# VISUALIZE GAM SMOOTHS
# Statistical: Shows fitted smooth functions for each non-linear term
# Biological: Reveals shape of environmental response curves (optimal, threshold, etc.)
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
plot(Diploid.gam, shade = TRUE, shade.col = "lightblue", seWithMean = TRUE)
title("GAM Smooth Terms", outer = TRUE, line = -1)

# GAM DIAGNOSTICS
# Statistical: Checks GAM assumptions and fit
# Biological: Ensures model adequately captures biological patterns
par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
plot(fitted(Diploid.gam), resid(Diploid.gam))
abline(h = 0, col = 2)
title("GAM: Residuals vs Fitted")

for (pred in predictors[1:6]) {
  plot(DiploidData.scale[[pred]], resid(Diploid.gam))
  abline(h = 0, col = 2)
  title(paste("GAM Residuals vs", pred))
}

# GAM SUMMARY
# Statistical: Provides significance tests and model metrics
# Biological: Identifies which environmental factors significantly affect genetic differentiation
gam_summary <- summary(Diploid.gam)
print(gam_summary)

##################
#### MODEL COMPARISON
##################

# STATISTICAL COMPARISON OF MODELS
# Statistical: Uses ANOVA and AIC to compare model fit
# Biological: Determines which functional form best represents biological processes
model_comparison <- anova(Diploid.reg, Diploid.reg2, Diploid.gam, test = 'Chisq')
print("Model Comparison (ANOVA):")
print(model_comparison)

# AIC COMPARISON
# Statistical: AIC balances model fit and complexity
# Biological: Selects most parsimonious model explaining genetic patterns
aic_values <- AIC(Diploid.reg, Diploid.reg2, Diploid.gam)
print("AIC Values:")
print(aic_values)

# RESIDUAL COMPARISON
# Statistical: Visual comparison of model fit quality
# Biological: Shows which model best captures unexplained variation
par(mfrow = c(1, 1), mar = c(5, 4, 2, 1))
boxplot(Diploid.reg$residuals, Diploid.reg2$residuals, Diploid.gam$residuals,
        names = c("Linear", "Quadratic", "GAM"),
        ylab = "Residuals", main = "Model Residual Comparison")
abline(h = 0, col = "red", lty = 2)

##################
#### FINAL MODEL WITH UNSCALED DISTANCE
##################

# GAM WITH UNSCALED GEOGRAPHIC DISTANCE
# Statistical: Final model for biological interpretation
# Biological: Allows direct interpretation of distance effects in original units (km)
Diploid.gam.unsc <- gam(Fst ~ s(Geog.Unsc) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + 
                         CACO3 + TN, data = DiploidData.scale)

# FINAL VISUALIZATION
# Statistical: Presents results in biologically interpretable units
# Biological: Shows actual distance-decay relationship for gene flow inference
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))

# 1. Distance effect in km
plot(Diploid.gam.unsc, select = 1, shade = TRUE, shade.col = "lightblue",
     xlab = "Geographic Distance (km)", ylab = "Effect on Fst",
     main = "Distance-Decay Relationship")

# 2. Partial effects of key environmental variables
important_vars <- c("BIO14", "PHCA", "CLAY")  # Based on model significance
for (i in 1:min(3, length(important_vars))) {
  if (important_vars[i] %in% names(DiploidData.scale)) {
    plot(DiploidData.scale[[important_vars[i]]], DiploidData.scale$Fst,
         xlab = important_vars[i], ylab = "Fst",
         main = paste("Fst vs", important_vars[i]))
    # Add smoothed trend line
    lines(lowess(DiploidData.scale[[important_vars[i]]], DiploidData.scale$Fst), 
          col = "red", lwd = 2)
  }
}

title("Final GAM Results: Italy Mp6x", outer = TRUE)

##################
#### BIOLOGICAL INTERPRETATION
##################

cat("\n=== BIOLOGICAL INTERPRETATION OF RESULTS ===\n\n")

# Extract key results
distance_edf <- gam_summary$edf[1]  # Effective degrees of freedom for distance smooth
distance_p <- gam_summary$s.pv[1]   # P-value for distance effect

cat("1. GEOGRAPHIC ISOLATION (ISOLATION-BY-DISTANCE):\n")
if (distance_p < 0.05) {
  cat(sprintf("   • Significant distance effect (p = %.4f)\n", distance_p))
  if (distance_edf > 1.5) {
    cat("   • Non-linear relationship suggests complex dispersal patterns\n")
    cat("   • Possible explanations: Stepping-stone dispersal, barriers at specific distances\n")
  } else {
    cat("   • Approximately linear relationship\n")
    cat("   • Suggests classic isolation-by-distance with relatively uniform dispersal\n")
  }
} else {
  cat("   • No significant isolation-by-distance detected\n")
  cat("   • Gene flow may be extensive or mediated by other factors\n")
}

# Environmental effects
sig_vars <- names(which(gam_summary$p.pv[-1] < 0.05))  # Significant predictors (excluding intercept)
cat("\n2. ENVIRONMENTAL SELECTION (ISOLATION-BY-ENVIRONMENT):\n")
if (length(sig_vars) > 0) {
  cat("   • Significant environmental predictors:\n")
  for (var in sig_vars) {
    effect_dir <- ifelse(coef(Diploid.gam)[var] > 0, "increases", "decreases")
    cat(sprintf("     - %s: %s genetic differentiation (p = %.4f)\n", 
                var, effect_dir, gam_summary$p.pv[var]))
  }
  cat("   • Suggests environmental adaptation driving genetic divergence\n")
} else {
  cat("   • No significant environmental effects detected\n")
  cat("   • Genetic differentiation primarily driven by neutral processes\n")
}

# Model comparison interpretation
cat("\n3. MODEL SELECTION INSIGHTS:\n")
best_model <- rownames(aic_values)[which.min(aic_values$AIC)]
cat(sprintf("   • Best-fitting model: %s\n", best_model))
if (best_model == "Diploid.gam") {
  cat("   • Non-linear models best explain genetic patterns\n")
  cat("   • Suggests complex responses to environmental gradients\n")
} else if (best_model == "Diploid.reg2") {
  cat("   • Quadratic distance model provides best fit\n")
  cat("   • Gene flow decreases non-linearly with distance\n")
} else {
  cat("   • Simple linear model provides best fit\n")
  cat("   • Relationships approximately linear\n")
}

# Explained variance
r2_adj <- summary(Diploid.gam)$r.sq
cat(sprintf("\n4. OVERALL MODEL PERFORMANCE:\n"))
cat(sprintf("   • Adjusted R² = %.3f (%.1f%% of variance explained)\n", r2_adj, r2_adj * 100))
if (r2_adj > 0.3) {
  cat("   • Good explanatory power for landscape genetic patterns\n")
} else if (r2_adj > 0.1) {
  cat("   • Moderate explanatory power\n")
  cat("   • Additional factors (e.g., historical, demographic) may be important\n")
} else {
  cat("   • Limited explanatory power\n")
  cat("   • Genetic patterns may be strongly influenced by stochastic processes\n")
}

cat("\n5. MANAGEMENT IMPLICATIONS:\n")
if (distance_p < 0.05 && length(sig_vars) > 0) {
  cat("   • Both geography and environment important for conservation\n")
  cat("   • Consider both connectivity corridors and environmental gradients\n")
} else if (distance_p < 0.05) {
  cat("   • Focus on maintaining landscape connectivity\n")
  cat("   • Geographic barriers likely limit gene flow\n")
} else if (length(sig_vars) > 0) {
  cat("   • Environmental adaptation key for population persistence\n")
  cat("   • Protect environmental heterogeneity and climate refugia\n")
} else {
  cat("   • Genetic patterns not strongly structured by measured factors\n")
  cat("   • Consider historical factors or finer-scale environmental variables\n")
}

# SAVE FINAL RESULTS
save(Diploid.gam.unsc, DiploidData.scale, gam_summary, aic_values,
     file = "GAM_Analysis_Results.RData")

cat("\n=== ANALYSIS COMPLETE ===\n")
cat("Results saved to GAM_Analysis_Results.RData\n")

Summary

This document presents a comprehensive analytical pipeline for landscape genetics, implementing three complementary statistical approaches to disentangle geographic and environmental drivers of genetic patterns. We demonstrate: (1) Principal Component Analysis (PCA) and regression modeling for identifying key environmental gradients affecting within-population genetic diversity, (2) Redundancy Analysis (RDA) for multivariate assessment of between-population genetic differentiation and variance partitioning, and (3) Generalized Additive Models (GAMs) for flexible modeling of non-linear relationships between environmental factors and genetic distances. Each method addresses distinct questions in landscape genetics while together providing a robust framework for understanding climate-genetic correlations.

--- title: "Integrative Landscape Genomics: Partitioning Geographic and Environmental Effects on Genetic Diversity and Differentiation" subtitle: "A Multivariate Statistical Framework Using PCA, RDA, and GAM Analyses" author: "Dr Tahir Ali | AG Thines | Senckenberg Biodiversity and Climate Research Centre" format: html: toc: true toc-depth: 3 toc-title: "Contents" code-fold: true code-tools: true theme: flatly css: styles.css --- ## **OVERARCHING THEMES & INTEGRATION** ### **1. Partitioning Geographic vs. Environmental Effects** - **First script**: Uses GLMs with both geographic (lat, lon, alt) and environmental predictors. - **Second script**: Explicitly partitions variance using partial RDAs. - **Third script**: Includes both geographic distance and environmental differences in models. ### **2. Addressing Different Biological Questions** - **Genetic diversity drivers** (First script): What factors influence within-population genetic variation? - **Genetic differentiation patterns** (Second & Third scripts): What factors explain between-population genetic differences? ### **3. Hierarchical Model Building** - Exploratory (PCA/correlation) → Hypothesis testing (GLMs) → Complex relationships (GAMs) → Multivariate synthesis (RDA) ### **4. Biological Interpretation Framework** 1. **Neutral processes**: Geographic distance → isolation-by-distance 2. **Selective processes**: Environmental differences → isolation-by-environment/ecology 3. **Interactive effects**: Geographic isolation modifying environmental selection ### **5. Statistical Robustness Features** - Permutation testing for RDA significance - Model selection criteria (AIC, ANOVA comparisons) - Residual diagnostics for model assumptions - Multicollinearity assessment via correlation matrices The scripts collectively provide a comprehensive toolkit for landscape genetics, allowing researchers to disentangle the complex interplay between geographic isolation, environmental gradients, and genetic patterns at both within-population (diversity) and between-population (differentiation) levels. ## **Key Integration Across Scripts** These three scripts provide a comprehensive analytical pipeline for landscape genetics: 1. **Script 1 (PCA & Regression)**: Identifies key environmental gradients and their linear relationships with within-population genetic diversity. 2. **Script 2 (RDA)**: Multivariate analysis partitioning geographic vs. environmental effects on between-population genetic differentiation. 3. **Script 3 (GAMs)**: Flexible modeling of non-linear relationships and distance-decay functions for pairwise genetic distances. **Biological Workflow**: - Start with PCA to understand environmental dimensionality - Use RDA to test environmental vs. geographic effects on genetic structure - Apply GAMs to model complex non-linear relationships - Integrate results to understand both neutral (geographic) and selective (environmental) processes ## **Script 1:** Principle Component Analysis & Regression Modelling ### **Data Preparation & Variable Selection** - **Statistical reasoning**: Creating subsets of variables (`mydata2`, `mydata3`, etc.) reduces dimensionality and focuses analysis on biologically/climatically relevant predictors. This helps avoid overfitting and multicollinearity issues. - **Biological reasoning**: Variables like `He` (expected heterozygosity), `Hj` (observed heterozygosity), and climate variables (Bio1-Bio19 from WorldClim) represent genetic diversity measures and environmental gradients that potentially influence genetic structure. ### **Data Scaling** - `scale()` function: Standardizes variables to mean=0, SD=1. This ensures variables measured in different units (temperature, precipitation, soil properties) are comparable and prevents variables with larger ranges from dominating PCA/regression. ### **Correlation Analysis** - `cor()` and `pairs()`: Identifies multicollinearity among predictors. High correlations suggest redundant information, guiding variable selection for models. ### **Principal Component Analysis (PCA)** - `princomp()`: Reduces dimensionality by transforming correlated variables into orthogonal principal components. - **Biological application**: Identifies major environmental gradients that may drive genetic differentiation. PCA axes represent combinations of environmental variables that explain most variation in the dataset. ### **Generalized Linear Models (GLMs)** - Sequential model building (Model1-Model10): Stepwise refinement to identify most significant predictors of genetic diversity (`Hj`). - **Statistical reasoning**: GLMs assess linear relationships while allowing for different error distributions. Model comparison via ANOVA and AIC helps select best-fitting, most parsimonious model. ### **Model Diagnostics & Selection** - `stepAIC()`: Automated variable selection balancing model fit and complexity. - `calc.relimp()`: Quantifies relative importance of each predictor. - `anova()` comparisons: Tests if more complex models significantly improve fit. ### **Visualization** - `ggbiplot()`: Visualizes PCA results with grouping by categorical variables (metapopulations/regions). - **Biological interpretation**: Clusters in PCA space may indicate genetically similar populations experiencing similar environmental conditions. ``` r {r} ####################################################### ###### Principle Component Analysis & Regression ###### ####################################################### # SET WORKING DIRECTORY # Sets the location where data files are stored setwd("D:\\R_Working_files\\RawGeno\\Project_RawGeno\\Mp_All\\RawGeno_1\\RawGeno_1_final-data\\Mp_2n-only\\Mp2n_Regional_dataset\\Rangewide\\He_Model") # LOAD LIBRARIES library(ggplot2) # For advanced plotting library(ggbiplot) # For PCA biplots with ggplot library(MASS) # For stepAIC function library(relaimpo) # For relative importance analysis # READ DATA # Reads tab-delimited file with genetic and environmental data mydata <- read.table("Mp2n_ClimSoilData_F-He.txt", header = TRUE, sep = "\t") # INITIAL DATA SUBSETTING # Statistical: Selects key variables to reduce dimensionality # Biological: Focuses on lat/lon (geography), He/Hj (genetic diversity), # F (inbreeding), alt (elevation), and key bioclimatic/soil variables mydata2 <- cbind(mydata$lat, mydata$lon, mydata$He, mydata$Hj, mydata$FAFLP, mydata$alt, mydata$Bio1, mydata$Bio3, mydata$Bio8, mydata$Bio9, mydata$Bio15, mydata$PHCA, mydata$EXCA, mydata$EXMG, mydata$SILT, mydata$SAND, mydata$ECE, mydata$ESP, mydata$landuse, mydata$BS1, mydata$Ai) # RENAME COLUMNS FOR CLARITY colnames(mydata2) <- c("lat", "lon", "He", "Hj", "F", "Alt", "bio1", "bio3", "bio8", "bio9", "bio15", "PHCA", "EXCA", "EXMG", "SILT", "SAND", "ECE", "ESP", "landuse", "BS1", "Ai") # CREATE ADDITIONAL SUBSET FOR ANALYSIS # Statistical: Further refinement based on preliminary analyses # Biological: Focus on heterozygosity (Hj) and key environmental predictors mydat3 <- cbind(mydata$Hj, mydata$Bio3, mydata$Bio5, mydata$Bio8, mydata$Bio12, mydata$Bio13, mydata$Bio14, mydata$Bio16, mydata$Bio17, mydata$Bio19, mydata$CLAY, mydata$DRAINAGE, mydata$EXK, mydata$EXMG, mydata$EXNA, mydata$SAND, mydata$SILT, mydata$Ai, mydata$AvPrec1_4, mydata$AvPrec9_12) colnames(mydat3) <- c("Hj", "bio3", "bio5", "bio8", "bio12", "bio13", "bio14", "bio16", "bio17", "bio19", "CLAY", "DRAINAGE", "EXK", "EXMG", "EXNA", "SAND", "SILT", "Ai", "Prec1_4", "Prec9_12") # CREATE AVERAGE CLIMATE VARIABLES # Statistical: Reduces monthly data to seasonal aggregates to avoid collinearity # Biological: Seasonal averages may better capture biologically relevant conditions mydat <- mydata[, 5:114] mydat$AvTmin9_12 <- (mydat$Tmin9 + mydat$Tmin10 + mydata$Tmin11 + mydata$Tmin12) / 4 mydat$AvTmax4_7 <- (mydat$Tmax4 + mydat$Tmax5 + mydat$Tmax6 + mydat$Tmax7) / 4 mydat$AvPrec1_4 <- (mydat$Prec1 + mydat$Prec2 + mydat$Prec3 + mydat$Prec4) / 4 mydat$AvPrec9_12 <- (mydat$Prec9 + mydat$Prec10 + mydat$Prec11 + mydat$Prec12) / 4 # DATA SCALING # Statistical: Standardizes all variables to mean=0, SD=1 # Biological: Makes variables measured in different units (temp, precipitation, soil) comparable # Note: Removes rows with NAs to ensure complete cases DiploidData.scale2 <- as.data.frame(apply(mydat3[apply(mydat3, 1, function(X) sum(is.na(X))) == 0, ], 2, scale)) DiploidData.scale <- as.data.frame(apply(mydata2[apply(mydata2, 1, function(X) sum(is.na(X))) == 0, ], 2, scale)) # CHECK CORRELATIONS # Statistical: Identifies multicollinearity among predictors # Biological: Highly correlated variables may represent redundant environmental gradients mydata3.df <- as.data.frame(mydat3) corr <- cor(mydata3.df) write.table(corr, file = "correlations_20var.csv", sep = ",") # SIMPLE LINEAR REGRESSION # Statistical: Tests basic linear relationships # Biological: Examines if specific environmental factors directly affect genetic diversity fit <- lm(Hj ~ Prec1_4, data = DiploidData.scale2) summary(fit) # POLYNOMIAL REGRESSION # Statistical: Tests for non-linear (quadratic) relationships # Biological: Genetic diversity may have optimal ranges along environmental gradients fit <- lm(Hj ~ Tmean8 + I(Tmean8^2), data = DiploidData.scale2) summary(fit) # MULTIPLE REGRESSION # Statistical: Tests combined effects of multiple predictors # Biological: Genetic diversity likely influenced by multiple interacting factors Diploid.reg1 <- lm(Hj ~ CLAY + SAND + Prec9_12, data = DiploidData.scale2) summary(Diploid.reg1) # PRINCIPAL COMPONENT ANALYSIS # Statistical: Reduces dimensionality, identifies major gradients of variation # Biological: Identifies composite environmental axes that may drive genetic patterns mydata5 <- cbind(mydata$longitude, mydata$Hj, mydata$elevation..m., mydata$Bio1, mydata$Bio2, mydata$Bio4, mydata$Bio8, mydata$Bio12, mydata$Bio19, mydata$landuse.class..GlobeCover., mydata$AWC, mydata$CACO3, mydata$ECE, mydata$ESP, mydata$GRAV, mydata$SAND, mydata$SILT, mydata$TN) colnames(mydata5) <- c("lon", "Hj", "Alt", "bio1", "bio2", "bio4", "bio8", "bio12", "bio19", "landuse", "AWC", "CACO3", "ECE", "ESP", "GRAV", "SAND", "SILT", "TN") Mp.pca <- princomp(mydata5, scores = TRUE, cor = TRUE) summary(Mp.pca) # VISUALIZE PCA RESULTS # Statistical: Scree plot shows variance explained by each component # Biological: Helps decide how many components represent biologically meaningful gradients plot(Mp.pca, type = "l") # PCA BIPLOT # Statistical: Shows variables and samples in reduced space # Biological: Reveals which environmental variables co-vary and how populations cluster biplot(Mp.pca) # GROUPED PCA VISUALIZATION # Statistical: Colors points by categorical grouping variables # Biological: Tests if genetically defined groups occupy distinct environmental space cat.var2 <- mydata[, 2] # Metapopulations cat.var3 <- mydata[, 3] # Regions g <- ggbiplot(Mp.pca, obs.scale = 0.5, var.scale = 0.5, groups = cat.var2, ellipse = TRUE, circle = TRUE) g <- g + scale_color_discrete(name = '') + theme(legend.direction = 'horizontal', legend.position = 'top') print(g) # STEPWISE MODEL SELECTION # Statistical: Automated variable selection balancing fit and complexity # Biological: Identifies minimal set of environmental factors explaining genetic diversity fit <- lm(Hj ~ bio3 + bio5 + bio8 + CLAY + SILT + DRAINAGE + SAND + SILT + Prec9_12, data = DiploidData.scale2) step <- stepAIC(fit, direction = "both") step$anova # FINAL MODEL BUILDING AND COMPARISON # Statistical: Sequential model refinement based on significance and parsimony # Biological: Builds understanding of which factors most strongly influence genetic diversity Model1 <- glm(Hj ~ lat + lon + F + Alt + bio1 + bio2 + bio3 + bio4 + bio5 + bio8 + bio10 + bio11 + bio13 + bio19 + BS1 + CLAY + ECE + ESP + EXNA + GRAV + PHH2O + SAND + SILT + TN, data = as.data.frame(mydata2)) # Simplified through sequential testing Model8 <- glm(Hj ~ lon + bio2 + bio3 + bio5 + bio11 + ECE + PHCA + SAND + SILT, data = as.data.frame(mydata2)) # MODEL COMPARISON # Statistical: ANOVA tests if more complex models significantly improve fit # Biological: Determines whether additional variables provide meaningful biological insight anova(Model8, Model9, Model10, test = 'F') # RELATIVE IMPORTANCE ANALYSIS # Statistical: Quantifies contribution of each predictor to explained variance # Biological: Ranks environmental factors by their influence on genetic diversity calc.relimp(Model8, type = c("lmg"), rela = TRUE) boot <- boot.relimp(Model8, b = 1000, type = c("lmg"), rank = TRUE, diff = TRUE, rela = TRUE) plot(booteval.relimp(boot, sort = TRUE)) # RESIDUAL ANALYSIS # Statistical: Checks model assumptions and fit quality # Biological: Identifies populations where model predictions are poor boxplot(Model8$residuals, Model9$residuals, Model10$residuals, names = c("GLM 8", "GLM 9", "GLM 10")) # FINAL PCA ON SELECTED VARIABLES # Statistical: PCA on variables retained in final model # Biological: Visualizes environmental space defined by most important factors mydataML <- cbind(mydata$latitude, mydata$Hj, mydata$elevation..m., mydata$Bio2, mydata$Bio4, mydata$Bio5, mydata$Bio8, mydata$Bio9, mydata$Bio10, mydata$Bio11, mydata$Bio14, mydata$Bio16, mydata$Bio17, mydata$Bio19, mydata$landuse.class..GlobeCover., mydata$BS1, mydata$CACO3, mydata$CLAY, mydata$GRAV, mydata$PHCA, mydata$SAND) colnames(mydataML) <- c("lat", "Hj", "Alt", "bio2", "bio4", "bio5", "bio8", "bio9", "bio10", "bio11", "bio14", "bio16", "bio17", "bio19", "landuse", "BS1", "CACO3", "CLAY", "GRAV", "PHCA", "SAND") Mp.pcaML <- princomp(mydataML, scores = TRUE, cor = TRUE, scale = TRUE) summary(Mp.pcaML) # CUSTOMIZED PCA PLOT # Statistical: Enhanced visualization for publication # Biological: Clear presentation of population clustering in environmental space g <- ggbiplot(Mp.pcaML, obs.scale = 2, var.scale = 2, groups = cat.var3, ellipse = TRUE, circle = FALSE, var.axes = TRUE, varname.size = 4) + geom_point(aes(colour = cat.var3), size = 4, shape = 16) + scale_color_discrete(name = '') + theme(legend.direction = 'vertical', legend.position = 'right', legend.key.size = unit(5, "mm"), legend.text = element_text(size = 15)) + ggtitle("Principal Component Analysis") + theme(plot.title = element_text(size = 20, vjust = 1), axis.title.x = element_text(size = 15, color = "black", vjust = -0.35), axis.title.y = element_text(size = 15, color = "black", vjust = 0.35)) print(g) ``` ## Script 2: Redundancy Analysis (RDA) For Genetic Environment Correlation ### **Multivariate Approach to Genetic-Environmental Correlation** - **Statistical reasoning**: RDA is a constrained ordination method that directly relates genetic distance matrices to environmental predictors. - **Biological application**: Tests whether environmental variables explain genetic differentiation patterns beyond geographic distance. ### **Partial RDA** - `rda(GenData ~ Climate + Condition(Geography))`: Controls for geographic effects to isolate pure environmental influence. - `rda(GenData ~ Geography + Condition(Climate))`: Controls for environmental effects to isolate pure geographic influence. - **Key biological insight**: Partitions variance into components explained by (1) environment alone, (2) geography alone, (3) shared environment-geography, and (4) unexplained. ### **Model Selection & Significance Testing** - `ordistep()`: Stepwise selection of environmental variables. - `anova()` with permutations: Non-parametric significance testing robust to non-normal distributions common in genetic data. ### **Variance Partitioning** - **Statistical concept**: Quantifies proportion of genetic variation explained by different factor sets. - **Biological interpretation**: Answers whether isolation-by-distance (geography) or isolation-by-environment (ecological selection) dominates genetic differentiation. ``` r {r} ####################################################### ###### Redundancy Analysis (RDA) ###################### ###### For Partitioning Geographic vs. Climatic Effects ####################################################### # LOAD LIBRARIES library(vegan) # For RDA and ordination methods library(permute) # For permutation tests library(lattice) # For advanced plotting # SET WORKING DIRECTORY setwd("D:\\Ali_Lenovo\\WorK_20150729\\Mp6X_Analysis\\Multiple Regression\\Italy\\RDA") # READ ENVIRONMENTAL DATA # Contains climate and soil variables for each population ClimData <- read.table('Italy_ClimData.txt', header = TRUE) # SUBSET ENVIRONMENTAL VARIABLES # Statistical: Selects key variables based on biological relevance and preliminary analyses # Biological: Focuses on temperature, precipitation, and soil factors likely to influence genetics ClimData2 <- cbind(ClimData$Bio1, ClimData$Bio3, ClimData$Bio10, ClimData$Bio14, ClimData$Bio15, ClimData$Bio16, ClimData$Bio17, ClimData$BS1, ClimData$CLAY, ClimData$EXNA, ClimData$PHCA, ClimData$TN) colnames(ClimData2) <- c("Bio1", "Bio3", "Bio10", "Bio14", "Bio15", "Bio16", "Bio17", "BS1", "CLAY", "EXNA", "PHCA", "TN") # CHECK CORRELATIONS AMONG PREDICTORS # Statistical: Avoids multicollinearity in constrained ordination # Biological: Ensures selected variables represent independent environmental axes ClimData2.df <- as.data.frame(ClimData2) cor(ClimData2.df) pairs(ClimData2.df) # READ GENETIC DATA # Pairwise Fst matrix - genetic differentiation between populations # Statistical: Response matrix for multivariate analysis # Biological: Quantifies genetic divergence due to drift, selection, and migration GenData <- as.matrix(read.table('GenData.txt', header = TRUE, sep = "\t", row.names = 1, as.is = TRUE)) GenData <- scale(GenData) # Standardize genetic distances # FULL RDA MODEL # Statistical: Constrained ordination relating genetic distances to all environmental predictors # Biological: Tests if environmental variables collectively explain genetic differentiation RDA <- rda(GenData ~ Bio4 + Bio9 + Bio10 + Bio11 + Bio14 + Bio17 + landuse + BS1 + CACO3 + CLAY + EXCA + EXK + EXNA + GRAV + PHCA + PHH2O + SILT, ClimData) summary(RDA) # STEPWISE MODEL SELECTION FOR RDA # Statistical: Identifies minimal adequate set of environmental predictors # Biological: Finds which specific environmental factors best explain genetic patterns mod0 <- rda(GenData ~ Bio1 + Bio2 + Bio3 + Bio4 + Bio5 + Bio6 + Bio7 + Bio8 + Bio9 + Bio10 + Bio11 + Bio12 + Bio13 + Bio14 + Bio15 + Bio16 + Bio17 + Bio18 + Bio19 + alt + landuse + BS1 + CACO3 + CEC + CLAY + ESP + EXCA + EXK + EXMG + EXNA + GRAV + PHCA + PHH2O + SAND + SILT + TC + TN, data = ClimData) mod1 <- rda(GenData ~ 1, data = ClimData) # Null model # Automated stepwise selection with permutation tests ordistep(mod0, scope = formula(mod1), direction = "both", Pin = 0.05, R2scope = TRUE, permutations = how(nperm = 499), trace = TRUE) # RDA WITH SELECTED VARIABLES # Statistical: Reduced model with significant predictors only # Biological: Focuses on environmentally driven genetic differentiation RDA4 <- rda(GenData ~ Bio1 + Bio3 + Bio8 + Bio14 + PHCA + EXNA + BS1, data = ClimData) summary(RDA4) # SIGNIFICANCE TESTING WITH PERMUTATIONS # Statistical: Non-parametric testing robust to distributional assumptions # Biological: Determines if environment-genetics relationships are statistically significant anova(RDA4) # Overall model significance anova(RDA4, by = "term", permu = 200) # Significance of individual terms anova(RDA4, by = "axis", perm = 500) # Significance of each RDA axis # VISUALIZE RDA RESULTS # Statistical: Biplot shows samples and environmental vectors in reduced space # Biological: Reveals which environmental gradients align with genetic differentiation plot(RDA4, type = "none", choices = c(1, 2), main = "Genetic Distance vs Climate Variables") text(RDA4, dis = "cn", cex = 1, col = "red") # Environmental vectors points(RDA4, pch = 21, col = "red", bg = "yellow", cex = 0.6) # Population scores # ADD GROUPING INFORMATION # Statistical: Overlays categorical grouping on ordination # Biological: Tests if genetically/metapopulation groups occupy distinct environmental space metapop <- ClimData[, 3] # Metapopulation categories reg <- ClimData[, 4] # Region categories # Add ellipses showing group dispersion pl <- ordiellipse(RDA4, reg, kind = "sd", label = TRUE, cex = 2) pl <- ordiellipse(RDA4, metapop, kind = "sd", label = TRUE, cex = 1) # PARTIAL RDA - CONTROL FOR GEOGRAPHY # Statistical: Conditions on geographic variables to isolate pure environmental effects # Biological: Tests environmental adaptation independent of isolation-by-distance pRDA4 <- rda(GenData ~ Bio1 + Bio3 + Bio8 + Bio14 + BS1 + PHCA + EXNA + Condition(lat + lon + alt), ClimData) # Significance testing for partial RDA anova(pRDA4, by = "term", permu = 200) anova(pRDA4, by = "axis", perm = 500) # Visualize partial RDA plot(pRDA4, type = "none", choices = c(1, 2), main = "Genetic Distance vs Climate (Controlling for Geography)") text(pRDA4, dis = "cn", cex = 1, col = "red") points(pRDA4, pch = 21, col = "red", bg = "yellow", cex = 1) # PARTIAL RDA - CONTROL FOR CLIMATE # Statistical: Conditions on environmental variables to isolate pure geographic effects # Biological: Tests isolation-by-distance independent of environmental adaptation pRDA2 <- rda(GenData ~ lat + lon + alt + Condition(Bio1 + Bio3 + Bio8 + Bio14 + BS1 + PHCA + EXNA), ClimData) anova(pRDA2) summary(pRDA2) # FULL MODEL WITH BOTH GEOGRAPHY AND CLIMATE # Statistical: Combined model with all predictors # Biological: Tests total explained variance and interactions RDAfull <- rda(GenData ~ lat + lon + alt + Bio1 + Bio3 + Bio8 + Bio14 + BS1 + PHCA + EXNA, ClimData) anova(RDAfull) summary(RDAfull) # VARIANCE PARTITIONING ANALYSIS # Statistical: Quantifies proportions of genetic variance explained by different factors # Biological: Answers key landscape genetics questions: # 1. How much variance is explained by environment alone? # 2. How much by geography alone? # 3. How much by their overlap? # 4. How much remains unexplained? cat("=== VARIANCE PARTITIONING SUMMARY ===\n") cat("1. Climate not controlled for geography:\n") summary(RDA4)$constr.chi / summary(RDA4)$tot.chi * 100 cat("\n2. Climate controlled for geography (pure environmental effects):\n") summary(pRDA4)$constr.chi / summary(pRDA4)$tot.chi * 100 cat("\n3. Geography controlled for climate (pure geographic effects):\n") summary(pRDA2)$constr.chi / summary(pRDA2)$tot.chi * 100 cat("\n4. Full model (climate + geography):\n") summary(RDAfull)$constr.chi / summary(RDAfull)$tot.chi * 100 # ENHANCED VISUALIZATION FOR PUBLICATION par(pty = "s") # Square plot plot(RDAfull, scaling = 2, display = c("sp", "lc", "cn"), xlim = c(-2, 2), ylim = c(-2, 2), main = "RDA Triplot: Genetic Distance vs Environment + Geography") pl <- ordiellipse(RDAfull, reg, kind = "sd", label = TRUE, cex = 2) box() # R² ADJUSTED FOR MODEL COMPLEXITY # Statistical: Adjusts explained variance for number of predictors # Biological: Provides more realistic estimate of environmental influence RsquareAdj(RDA4) RsquareAdj(pRDA4) RsquareAdj(pRDA2) RsquareAdj(RDAfull) # FINAL INTERPRETATION cat("\n=== KEY BIOLOGICAL INTERPRETATIONS ===\n") cat("1. Significant RDA axes indicate environmental gradients driving genetic divergence.\n") cat("2. Partial RDAs separate isolation-by-distance from isolation-by-environment.\n") cat("3. Variance partitioning quantifies relative importance of geography vs environment.\n") cat("4. Environmental vectors show which factors align with genetic differentiation.\n") ``` ## **Script 3:** Generalized Additive Models (GAMs) For Non-linear Genetic-Environment Relationships ### **Analyzing Pairwise Genetic Distances** - **Statistical innovation**: Uses pairwise FST matrices as response variables, with corresponding pairwise environmental differences as predictors. - **Key challenge**: Account for non-independence of pairwise comparisons. ### **Flexible Modeling of Relationships** - `gam()` with `s()` terms: Allows non-linear relationships between predictors and genetic differentiation. - **Biological relevance**: Genetic differentiation may respond non-linearly to environmental gradients (threshold effects, optimal ranges). ### **Model Comparison Framework** - Linear vs. quadratic vs. smooth terms: Tests whether simple linear, curved, or complex non-linear relationships best explain patterns. - **Biological insight**: Shape of distance-FST relationship informs dispersal models (steep vs. gradual isolation-by-distance). ### **Scale Consideration** - Analysis with both scaled and unscaled geographic distance: Distinguishes statistical significance from biological magnitude of effects. ``` r {r} ####################################################### ###### Generalized Additive Models (GAMs) ############# ###### For Non-linear Genetic-Environment Relationships ####################################################### # LOAD LIBRARIES library(gdata) # For reading Excel files library(mgcv) # For Generalized Additive Models library(permute) # For permutation tests # SET WORKING DIRECTORY setwd("D:\\R_Working_files\\RawGeno\\Project_RawGeno\\Mp_All\\RawGeno_1\\RawGeno_1_final-data\\Mp_Xn_4n\\STR-MpXn_Regional2local_Scale\\Multiple Regression\\Italy\\GLM_GAM") # PERL PATH FOR READING EXCEL FILES perl <- "C:\\Users\\atahir\\Documents\\R\\win-library\\3.1\\perl\\bin\\perl.exe" ################## #### DATA PREPARATION ################## # READ GEOGRAPHIC DISTANCE MATRIX # Statistical: Predictor representing isolation-by-distance # Biological: Geographic distance expected to increase genetic differentiation via limited gene flow Geog.tmp <- read.xls("GGD_IT.xlsx", perl = perl, as.is = TRUE) # CREATE COMPREHENSIVE DATAFRAME # Statistical: Combines pairwise Fst with pairwise environmental differences # Biological: Tests how environmental dissimilarity affects genetic divergence DiploidData <- data.frame( # Response: Pairwise Fst (genetic differentiation) Fst = c(as.matrix(read.xls("Fst_IT_Xn.xlsx", perl = perl)[, -1], sheet = 1)), # Environmental difference predictors: BIO3 = c(apply(read.xls("BIO3-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Isothermality BIO8 = c(apply(read.xls("BIO8-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Temp wettest quarter BIO14 = c(apply(read.xls("BIO14-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Prec driest month PHCA = c(apply(read.xls("PHCA-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Soil pH EXNA = c(apply(read.xls("EXNA-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Exchangeable sodium CLAY = c(apply(read.xls("CLAY-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Clay content ESP = c(apply(read.xls("ESP-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Exchangeable sodium % CACO3 = c(apply(read.xls("CACO3-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Calcium carbonate TN = c(apply(read.xls("TN-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Total nitrogen # Geographic distance (unscaled, will scale separately) Geog = c(apply(Geog.tmp[, colnames(Geog.tmp) != "X"], 2, as.numeric)) ) rm(Geog.tmp) # Clean up temporary object # DATA SCALING AND CLEANING # Statistical: Standardizes variables, removes incomplete cases # Biological: Ensures comparability, focuses on complete population pairs DiploidData.scale <- as.data.frame( apply(DiploidData[apply(DiploidData, 1, function(X) sum(is.na(X))) == 0, ], 2, scale) ) # Add unscaled geographic distance back for interpretation DiploidData.scale$Geog.Unsc <- DiploidData$Geog[apply(DiploidData, 1, function(X) sum(is.na(X))) == 0] # EXPLORATORY DATA ANALYSIS # Statistical: Checks correlations and distributions # Biological: Identifies potential collinearity issues and data patterns pairs(DiploidData.scale[, 1:10]) # Quick visualization cor_matrix <- cor(DiploidData.scale[, -ncol(DiploidData.scale)]) print("Correlation matrix (first 5x5):") print(cor_matrix[1:5, 1:5]) ################## #### MODEL BUILDING ################## # LINEAR MODEL 1: ALL VARIABLES # Statistical: Baseline linear model # Biological: Tests linear relationships between environmental differences and Fst Diploid.reg <- lm(Fst ~ Geog + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN, data = DiploidData.scale) summary(Diploid.reg) # MODEL DIAGNOSTICS # Statistical: Checks linear model assumptions # Biological: Identifies patterns suggesting non-linear relationships or outliers par(mfrow = c(3, 3), mar = c(3, 3, 1, 1)) plot(fitted(Diploid.reg), resid(Diploid.reg)) abline(h = 0, col = 2) title("Residuals vs Fitted") # Check residuals against each predictor predictors <- c("Geog", "BIO3", "BIO8", "BIO14", "PHCA", "EXNA", "CLAY", "ESP", "CACO3", "TN") for (pred in predictors[1:6]) { plot(DiploidData.scale[[pred]], resid(Diploid.reg)) abline(h = 0, col = 2) title(paste("Residuals vs", pred)) } # LINEAR MODEL 2: WITH QUADRATIC GEOGRAPHIC TERM # Statistical: Tests for non-linear isolation-by-distance # Biological: Gene flow may decrease non-linearly with distance (e.g., stepping-stone model) Diploid.reg2 <- lm(Fst ~ Geog + I(Geog^2) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN, data = DiploidData.scale) summary(Diploid.reg2) # VISUALIZE QUADRATIC EFFECT # Statistical: Shows shape of distance-decay relationship # Biological: Informs about dispersal kernel shape and connectivity par(mfrow = c(1, 2), mar = c(4, 4, 2, 1)) GeogEff <- DiploidData.scale$Geog * coef(Diploid.reg2)["Geog"] + (DiploidData.scale$Geog^2) * coef(Diploid.reg2)["I(Geog^2)"] plot(DiploidData.scale$Geog, GeogEff, xlab = "Scaled Geographic Distance", ylab = "Effect on Fst", main = "Quadratic Distance Effect") plot(DiploidData.scale$Geog, fitted(Diploid.reg2), xlab = "Scaled Geographic Distance", ylab = "Predicted Fst", main = "Model Predictions") # GENERALIZED ADDITIVE MODEL (GAM) # Statistical: Flexible non-parametric smoothing of relationships # Biological: Captures complex, non-linear responses to environmental gradients Diploid.gam <- gam(Fst ~ s(Geog) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN, data = DiploidData.scale) # VISUALIZE GAM SMOOTHS # Statistical: Shows fitted smooth functions for each non-linear term # Biological: Reveals shape of environmental response curves (optimal, threshold, etc.) par(mfrow = c(2, 2), mar = c(4, 4, 2, 1)) plot(Diploid.gam, shade = TRUE, shade.col = "lightblue", seWithMean = TRUE) title("GAM Smooth Terms", outer = TRUE, line = -1) # GAM DIAGNOSTICS # Statistical: Checks GAM assumptions and fit # Biological: Ensures model adequately captures biological patterns par(mfrow = c(3, 3), mar = c(3, 3, 1, 1)) plot(fitted(Diploid.gam), resid(Diploid.gam)) abline(h = 0, col = 2) title("GAM: Residuals vs Fitted") for (pred in predictors[1:6]) { plot(DiploidData.scale[[pred]], resid(Diploid.gam)) abline(h = 0, col = 2) title(paste("GAM Residuals vs", pred)) } # GAM SUMMARY # Statistical: Provides significance tests and model metrics # Biological: Identifies which environmental factors significantly affect genetic differentiation gam_summary <- summary(Diploid.gam) print(gam_summary) ################## #### MODEL COMPARISON ################## # STATISTICAL COMPARISON OF MODELS # Statistical: Uses ANOVA and AIC to compare model fit # Biological: Determines which functional form best represents biological processes model_comparison <- anova(Diploid.reg, Diploid.reg2, Diploid.gam, test = 'Chisq') print("Model Comparison (ANOVA):") print(model_comparison) # AIC COMPARISON # Statistical: AIC balances model fit and complexity # Biological: Selects most parsimonious model explaining genetic patterns aic_values <- AIC(Diploid.reg, Diploid.reg2, Diploid.gam) print("AIC Values:") print(aic_values) # RESIDUAL COMPARISON # Statistical: Visual comparison of model fit quality # Biological: Shows which model best captures unexplained variation par(mfrow = c(1, 1), mar = c(5, 4, 2, 1)) boxplot(Diploid.reg$residuals, Diploid.reg2$residuals, Diploid.gam$residuals, names = c("Linear", "Quadratic", "GAM"), ylab = "Residuals", main = "Model Residual Comparison") abline(h = 0, col = "red", lty = 2) ################## #### FINAL MODEL WITH UNSCALED DISTANCE ################## # GAM WITH UNSCALED GEOGRAPHIC DISTANCE # Statistical: Final model for biological interpretation # Biological: Allows direct interpretation of distance effects in original units (km) Diploid.gam.unsc <- gam(Fst ~ s(Geog.Unsc) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN, data = DiploidData.scale) # FINAL VISUALIZATION # Statistical: Presents results in biologically interpretable units # Biological: Shows actual distance-decay relationship for gene flow inference par(mfrow = c(2, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0)) # 1. Distance effect in km plot(Diploid.gam.unsc, select = 1, shade = TRUE, shade.col = "lightblue", xlab = "Geographic Distance (km)", ylab = "Effect on Fst", main = "Distance-Decay Relationship") # 2. Partial effects of key environmental variables important_vars <- c("BIO14", "PHCA", "CLAY") # Based on model significance for (i in 1:min(3, length(important_vars))) { if (important_vars[i] %in% names(DiploidData.scale)) { plot(DiploidData.scale[[important_vars[i]]], DiploidData.scale$Fst, xlab = important_vars[i], ylab = "Fst", main = paste("Fst vs", important_vars[i])) # Add smoothed trend line lines(lowess(DiploidData.scale[[important_vars[i]]], DiploidData.scale$Fst), col = "red", lwd = 2) } } title("Final GAM Results: Italy Mp6x", outer = TRUE) ################## #### BIOLOGICAL INTERPRETATION ################## cat("\n=== BIOLOGICAL INTERPRETATION OF RESULTS ===\n\n") # Extract key results distance_edf <- gam_summary$edf[1] # Effective degrees of freedom for distance smooth distance_p <- gam_summary$s.pv[1] # P-value for distance effect cat("1. GEOGRAPHIC ISOLATION (ISOLATION-BY-DISTANCE):\n") if (distance_p < 0.05) { cat(sprintf(" • Significant distance effect (p = %.4f)\n", distance_p)) if (distance_edf > 1.5) { cat(" • Non-linear relationship suggests complex dispersal patterns\n") cat(" • Possible explanations: Stepping-stone dispersal, barriers at specific distances\n") } else { cat(" • Approximately linear relationship\n") cat(" • Suggests classic isolation-by-distance with relatively uniform dispersal\n") } } else { cat(" • No significant isolation-by-distance detected\n") cat(" • Gene flow may be extensive or mediated by other factors\n") } # Environmental effects sig_vars <- names(which(gam_summary$p.pv[-1] < 0.05)) # Significant predictors (excluding intercept) cat("\n2. ENVIRONMENTAL SELECTION (ISOLATION-BY-ENVIRONMENT):\n") if (length(sig_vars) > 0) { cat(" • Significant environmental predictors:\n") for (var in sig_vars) { effect_dir <- ifelse(coef(Diploid.gam)[var] > 0, "increases", "decreases") cat(sprintf(" - %s: %s genetic differentiation (p = %.4f)\n", var, effect_dir, gam_summary$p.pv[var])) } cat(" • Suggests environmental adaptation driving genetic divergence\n") } else { cat(" • No significant environmental effects detected\n") cat(" • Genetic differentiation primarily driven by neutral processes\n") } # Model comparison interpretation cat("\n3. MODEL SELECTION INSIGHTS:\n") best_model <- rownames(aic_values)[which.min(aic_values$AIC)] cat(sprintf(" • Best-fitting model: %s\n", best_model)) if (best_model == "Diploid.gam") { cat(" • Non-linear models best explain genetic patterns\n") cat(" • Suggests complex responses to environmental gradients\n") } else if (best_model == "Diploid.reg2") { cat(" • Quadratic distance model provides best fit\n") cat(" • Gene flow decreases non-linearly with distance\n") } else { cat(" • Simple linear model provides best fit\n") cat(" • Relationships approximately linear\n") } # Explained variance r2_adj <- summary(Diploid.gam)$r.sq cat(sprintf("\n4. OVERALL MODEL PERFORMANCE:\n")) cat(sprintf(" • Adjusted R² = %.3f (%.1f%% of variance explained)\n", r2_adj, r2_adj * 100)) if (r2_adj > 0.3) { cat(" • Good explanatory power for landscape genetic patterns\n") } else if (r2_adj > 0.1) { cat(" • Moderate explanatory power\n") cat(" • Additional factors (e.g., historical, demographic) may be important\n") } else { cat(" • Limited explanatory power\n") cat(" • Genetic patterns may be strongly influenced by stochastic processes\n") } cat("\n5. MANAGEMENT IMPLICATIONS:\n") if (distance_p < 0.05 && length(sig_vars) > 0) { cat(" • Both geography and environment important for conservation\n") cat(" • Consider both connectivity corridors and environmental gradients\n") } else if (distance_p < 0.05) { cat(" • Focus on maintaining landscape connectivity\n") cat(" • Geographic barriers likely limit gene flow\n") } else if (length(sig_vars) > 0) { cat(" • Environmental adaptation key for population persistence\n") cat(" • Protect environmental heterogeneity and climate refugia\n") } else { cat(" • Genetic patterns not strongly structured by measured factors\n") cat(" • Consider historical factors or finer-scale environmental variables\n") } # SAVE FINAL RESULTS save(Diploid.gam.unsc, DiploidData.scale, gam_summary, aic_values, file = "GAM_Analysis_Results.RData") cat("\n=== ANALYSIS COMPLETE ===\n") cat("Results saved to GAM_Analysis_Results.RData\n") ``` ## Summary This document presents a comprehensive analytical pipeline for landscape genetics, implementing three complementary statistical approaches to disentangle geographic and environmental drivers of genetic patterns. We demonstrate: (1) Principal Component Analysis (PCA) and regression modeling for identifying key environmental gradients affecting within-population genetic diversity, (2) Redundancy Analysis (RDA) for multivariate assessment of between-population genetic differentiation and variance partitioning, and (3) Generalized Additive Models (GAMs) for flexible modeling of non-linear relationships between environmental factors and genetic distances. Each method addresses distinct questions in landscape genetics while together providing a robust framework for understanding climate-genetic correlations.