Integrative Landscape Genomics: Partitioning Geographic and Environmental Effects on Genetic Diversity and Differentiation

A Multivariate Statistical Framework Using PCA, RDA, and GAM Analyses

Author

Dr Tahir Ali | AG Thines | Senckenberg Biodiversity and Climate Research Centre

OVERARCHING THEMES & INTEGRATION

1. Partitioning Geographic vs. Environmental Effects

  • First script: Uses GLMs with both geographic (lat, lon, alt) and environmental predictors.

  • Second script: Explicitly partitions variance using partial RDAs.

  • Third script: Includes both geographic distance and environmental differences in models.

2. Addressing Different Biological Questions

  • Genetic diversity drivers (First script): What factors influence within-population genetic variation?

  • Genetic differentiation patterns (Second & Third scripts): What factors explain between-population genetic differences?

3. Hierarchical Model Building

  • Exploratory (PCA/correlation) → Hypothesis testing (GLMs) → Complex relationships (GAMs) → Multivariate synthesis (RDA)

4. Biological Interpretation Framework

  1. Neutral processes: Geographic distance → isolation-by-distance

  2. Selective processes: Environmental differences → isolation-by-environment/ecology

  3. Interactive effects: Geographic isolation modifying environmental selection

5. Statistical Robustness Features

  • Permutation testing for RDA significance

  • Model selection criteria (AIC, ANOVA comparisons)

  • Residual diagnostics for model assumptions

  • Multicollinearity assessment via correlation matrices

The scripts collectively provide a comprehensive toolkit for landscape genetics, allowing researchers to disentangle the complex interplay between geographic isolation, environmental gradients, and genetic patterns at both within-population (diversity) and between-population (differentiation) levels.

Key Integration Across Scripts

These three scripts provide a comprehensive analytical pipeline for landscape genetics:

  1. Script 1 (PCA & Regression): Identifies key environmental gradients and their linear relationships with within-population genetic diversity.

  2. Script 2 (RDA): Multivariate analysis partitioning geographic vs. environmental effects on between-population genetic differentiation.

  3. Script 3 (GAMs): Flexible modeling of non-linear relationships and distance-decay functions for pairwise genetic distances.

Biological Workflow:

  • Start with PCA to understand environmental dimensionality

  • Use RDA to test environmental vs. geographic effects on genetic structure

  • Apply GAMs to model complex non-linear relationships

  • Integrate results to understand both neutral (geographic) and selective (environmental) processes

Script 1: Principle Component Analysis & Regression Modelling

Data Preparation & Variable Selection

  • Statistical reasoning: Creating subsets of variables (mydata2, mydata3, etc.) reduces dimensionality and focuses analysis on biologically/climatically relevant predictors. This helps avoid overfitting and multicollinearity issues.

  • Biological reasoning: Variables like He (expected heterozygosity), Hj (observed heterozygosity), and climate variables (Bio1-Bio19 from WorldClim) represent genetic diversity measures and environmental gradients that potentially influence genetic structure.

Data Scaling

  • scale() function: Standardizes variables to mean=0, SD=1. This ensures variables measured in different units (temperature, precipitation, soil properties) are comparable and prevents variables with larger ranges from dominating PCA/regression.

Correlation Analysis

  • cor() and pairs(): Identifies multicollinearity among predictors. High correlations suggest redundant information, guiding variable selection for models.

Principal Component Analysis (PCA)

  • princomp(): Reduces dimensionality by transforming correlated variables into orthogonal principal components.

  • Biological application: Identifies major environmental gradients that may drive genetic differentiation. PCA axes represent combinations of environmental variables that explain most variation in the dataset.

Generalized Linear Models (GLMs)

  • Sequential model building (Model1-Model10): Stepwise refinement to identify most significant predictors of genetic diversity (Hj).

  • Statistical reasoning: GLMs assess linear relationships while allowing for different error distributions. Model comparison via ANOVA and AIC helps select best-fitting, most parsimonious model.

Model Diagnostics & Selection

  • stepAIC(): Automated variable selection balancing model fit and complexity.

  • calc.relimp(): Quantifies relative importance of each predictor.

  • anova() comparisons: Tests if more complex models significantly improve fit.

Visualization

  • ggbiplot(): Visualizes PCA results with grouping by categorical variables (metapopulations/regions).

  • Biological interpretation: Clusters in PCA space may indicate genetically similar populations experiencing similar environmental conditions.

{r}
#######################################################
###### Principle Component Analysis & Regression ######
#######################################################

# SET WORKING DIRECTORY
# Sets the location where data files are stored
setwd("D:\\R_Working_files\\RawGeno\\Project_RawGeno\\Mp_All\\RawGeno_1\\RawGeno_1_final-data\\Mp_2n-only\\Mp2n_Regional_dataset\\Rangewide\\He_Model")

# LOAD LIBRARIES
library(ggplot2)      # For advanced plotting
library(ggbiplot)     # For PCA biplots with ggplot
library(MASS)         # For stepAIC function
library(relaimpo)     # For relative importance analysis

# READ DATA
# Reads tab-delimited file with genetic and environmental data
mydata <- read.table("Mp2n_ClimSoilData_F-He.txt", header = TRUE, sep = "\t")

# INITIAL DATA SUBSETTING
# Statistical: Selects key variables to reduce dimensionality
# Biological: Focuses on lat/lon (geography), He/Hj (genetic diversity), 
#             F (inbreeding), alt (elevation), and key bioclimatic/soil variables
mydata2 <- cbind(mydata$lat, mydata$lon, mydata$He, mydata$Hj, mydata$FAFLP, mydata$alt,
                 mydata$Bio1, mydata$Bio3, mydata$Bio8, mydata$Bio9, mydata$Bio15,
                 mydata$PHCA, mydata$EXCA, mydata$EXMG, mydata$SILT, mydata$SAND, 
                 mydata$ECE, mydata$ESP, mydata$landuse, mydata$BS1, mydata$Ai)

# RENAME COLUMNS FOR CLARITY
colnames(mydata2) <- c("lat", "lon", "He", "Hj", "F", "Alt", "bio1", "bio3", "bio8", 
                       "bio9", "bio15", "PHCA", "EXCA", "EXMG", "SILT", "SAND", 
                       "ECE", "ESP", "landuse", "BS1", "Ai")

# CREATE ADDITIONAL SUBSET FOR ANALYSIS
# Statistical: Further refinement based on preliminary analyses
# Biological: Focus on heterozygosity (Hj) and key environmental predictors
mydat3 <- cbind(mydata$Hj, mydata$Bio3, mydata$Bio5, mydata$Bio8, mydata$Bio12, 
                mydata$Bio13, mydata$Bio14, mydata$Bio16, mydata$Bio17, mydata$Bio19,
                mydata$CLAY, mydata$DRAINAGE, mydata$EXK, mydata$EXMG, mydata$EXNA,
                mydata$SAND, mydata$SILT, mydata$Ai, mydata$AvPrec1_4, mydata$AvPrec9_12)

colnames(mydat3) <- c("Hj", "bio3", "bio5", "bio8", "bio12", "bio13", "bio14",
                      "bio16", "bio17", "bio19", "CLAY", "DRAINAGE", "EXK", "EXMG", 
                      "EXNA", "SAND", "SILT", "Ai", "Prec1_4", "Prec9_12")

# CREATE AVERAGE CLIMATE VARIABLES
# Statistical: Reduces monthly data to seasonal aggregates to avoid collinearity
# Biological: Seasonal averages may better capture biologically relevant conditions
mydat <- mydata[, 5:114]
mydat$AvTmin9_12 <- (mydat$Tmin9 + mydat$Tmin10 + mydata$Tmin11 + mydata$Tmin12) / 4
mydat$AvTmax4_7 <- (mydat$Tmax4 + mydat$Tmax5 + mydat$Tmax6 + mydat$Tmax7) / 4
mydat$AvPrec1_4 <- (mydat$Prec1 + mydat$Prec2 + mydat$Prec3 + mydat$Prec4) / 4
mydat$AvPrec9_12 <- (mydat$Prec9 + mydat$Prec10 + mydat$Prec11 + mydat$Prec12) / 4

# DATA SCALING
# Statistical: Standardizes all variables to mean=0, SD=1
# Biological: Makes variables measured in different units (temp, precipitation, soil) comparable
# Note: Removes rows with NAs to ensure complete cases
DiploidData.scale2 <- as.data.frame(apply(mydat3[apply(mydat3, 1, function(X) sum(is.na(X))) == 0, ], 2, scale))
DiploidData.scale <- as.data.frame(apply(mydata2[apply(mydata2, 1, function(X) sum(is.na(X))) == 0, ], 2, scale))

# CHECK CORRELATIONS
# Statistical: Identifies multicollinearity among predictors
# Biological: Highly correlated variables may represent redundant environmental gradients
mydata3.df <- as.data.frame(mydat3)
corr <- cor(mydata3.df)
write.table(corr, file = "correlations_20var.csv", sep = ",")

# SIMPLE LINEAR REGRESSION
# Statistical: Tests basic linear relationships
# Biological: Examines if specific environmental factors directly affect genetic diversity
fit <- lm(Hj ~ Prec1_4, data = DiploidData.scale2)
summary(fit)

# POLYNOMIAL REGRESSION
# Statistical: Tests for non-linear (quadratic) relationships
# Biological: Genetic diversity may have optimal ranges along environmental gradients
fit <- lm(Hj ~ Tmean8 + I(Tmean8^2), data = DiploidData.scale2)
summary(fit)

# MULTIPLE REGRESSION
# Statistical: Tests combined effects of multiple predictors
# Biological: Genetic diversity likely influenced by multiple interacting factors
Diploid.reg1 <- lm(Hj ~ CLAY + SAND + Prec9_12, data = DiploidData.scale2)
summary(Diploid.reg1)

# PRINCIPAL COMPONENT ANALYSIS
# Statistical: Reduces dimensionality, identifies major gradients of variation
# Biological: Identifies composite environmental axes that may drive genetic patterns
mydata5 <- cbind(mydata$longitude, mydata$Hj, mydata$elevation..m., mydata$Bio1,
                 mydata$Bio2, mydata$Bio4, mydata$Bio8, mydata$Bio12, mydata$Bio19,
                 mydata$landuse.class..GlobeCover., mydata$AWC, mydata$CACO3,
                 mydata$ECE, mydata$ESP, mydata$GRAV, mydata$SAND, mydata$SILT,
                 mydata$TN)

colnames(mydata5) <- c("lon", "Hj", "Alt", "bio1", "bio2", "bio4", "bio8", "bio12",
                       "bio19", "landuse", "AWC", "CACO3", "ECE", "ESP", "GRAV", 
                       "SAND", "SILT", "TN")

Mp.pca <- princomp(mydata5, scores = TRUE, cor = TRUE)
summary(Mp.pca)

# VISUALIZE PCA RESULTS
# Statistical: Scree plot shows variance explained by each component
# Biological: Helps decide how many components represent biologically meaningful gradients
plot(Mp.pca, type = "l")

# PCA BIPLOT
# Statistical: Shows variables and samples in reduced space
# Biological: Reveals which environmental variables co-vary and how populations cluster
biplot(Mp.pca)

# GROUPED PCA VISUALIZATION
# Statistical: Colors points by categorical grouping variables
# Biological: Tests if genetically defined groups occupy distinct environmental space
cat.var2 <- mydata[, 2]  # Metapopulations
cat.var3 <- mydata[, 3]  # Regions

g <- ggbiplot(Mp.pca, obs.scale = 0.5, var.scale = 0.5, 
              groups = cat.var2, ellipse = TRUE, circle = TRUE)
g <- g + scale_color_discrete(name = '') + 
  theme(legend.direction = 'horizontal', legend.position = 'top')
print(g)

# STEPWISE MODEL SELECTION
# Statistical: Automated variable selection balancing fit and complexity
# Biological: Identifies minimal set of environmental factors explaining genetic diversity
fit <- lm(Hj ~ bio3 + bio5 + bio8 + CLAY + SILT + DRAINAGE + SAND + SILT + Prec9_12, 
          data = DiploidData.scale2)
step <- stepAIC(fit, direction = "both")
step$anova

# FINAL MODEL BUILDING AND COMPARISON
# Statistical: Sequential model refinement based on significance and parsimony
# Biological: Builds understanding of which factors most strongly influence genetic diversity
Model1 <- glm(Hj ~ lat + lon + F + Alt + bio1 + bio2 + bio3 + bio4 + bio5 + 
                bio8 + bio10 + bio11 + bio13 + bio19 + BS1 + CLAY + 
                ECE + ESP + EXNA + GRAV + PHH2O + SAND + SILT + TN, 
              data = as.data.frame(mydata2))

# Simplified through sequential testing
Model8 <- glm(Hj ~ lon + bio2 + bio3 + bio5 + bio11 + ECE + PHCA + SAND + SILT, 
              data = as.data.frame(mydata2))

# MODEL COMPARISON
# Statistical: ANOVA tests if more complex models significantly improve fit
# Biological: Determines whether additional variables provide meaningful biological insight
anova(Model8, Model9, Model10, test = 'F')

# RELATIVE IMPORTANCE ANALYSIS
# Statistical: Quantifies contribution of each predictor to explained variance
# Biological: Ranks environmental factors by their influence on genetic diversity
calc.relimp(Model8, type = c("lmg"), rela = TRUE)
boot <- boot.relimp(Model8, b = 1000, type = c("lmg"), rank = TRUE, diff = TRUE, rela = TRUE)
plot(booteval.relimp(boot, sort = TRUE))

# RESIDUAL ANALYSIS
# Statistical: Checks model assumptions and fit quality
# Biological: Identifies populations where model predictions are poor
boxplot(Model8$residuals, Model9$residuals, Model10$residuals,
        names = c("GLM 8", "GLM 9", "GLM 10"))

# FINAL PCA ON SELECTED VARIABLES
# Statistical: PCA on variables retained in final model
# Biological: Visualizes environmental space defined by most important factors
mydataML <- cbind(mydata$latitude, mydata$Hj, mydata$elevation..m.,
                  mydata$Bio2, mydata$Bio4, mydata$Bio5, mydata$Bio8, mydata$Bio9,
                  mydata$Bio10, mydata$Bio11, mydata$Bio14, mydata$Bio16, 
                  mydata$Bio17, mydata$Bio19, mydata$landuse.class..GlobeCover.,
                  mydata$BS1, mydata$CACO3, mydata$CLAY, mydata$GRAV, mydata$PHCA,
                  mydata$SAND)

colnames(mydataML) <- c("lat", "Hj", "Alt", "bio2", "bio4", "bio5", "bio8", "bio9",
                        "bio10", "bio11", "bio14", "bio16", "bio17", "bio19", 
                        "landuse", "BS1", "CACO3", "CLAY", "GRAV", "PHCA", "SAND")

Mp.pcaML <- princomp(mydataML, scores = TRUE, cor = TRUE, scale = TRUE)
summary(Mp.pcaML)

# CUSTOMIZED PCA PLOT
# Statistical: Enhanced visualization for publication
# Biological: Clear presentation of population clustering in environmental space
g <- ggbiplot(Mp.pcaML, obs.scale = 2, var.scale = 2, 
              groups = cat.var3, ellipse = TRUE, circle = FALSE,
              var.axes = TRUE, varname.size = 4) +
  geom_point(aes(colour = cat.var3), size = 4, shape = 16) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'vertical', legend.position = 'right',
        legend.key.size = unit(5, "mm"), legend.text = element_text(size = 15)) +
  ggtitle("Principal Component Analysis") +
  theme(plot.title = element_text(size = 20, vjust = 1),
        axis.title.x = element_text(size = 15, color = "black", vjust = -0.35),
        axis.title.y = element_text(size = 15, color = "black", vjust = 0.35))
print(g)

Script 2: Redundancy Analysis (RDA) For Genetic Environment Correlation

Multivariate Approach to Genetic-Environmental Correlation

  • Statistical reasoning: RDA is a constrained ordination method that directly relates genetic distance matrices to environmental predictors.

  • Biological application: Tests whether environmental variables explain genetic differentiation patterns beyond geographic distance.

Partial RDA

  • rda(GenData ~ Climate + Condition(Geography)): Controls for geographic effects to isolate pure environmental influence.

  • rda(GenData ~ Geography + Condition(Climate)): Controls for environmental effects to isolate pure geographic influence.

  • Key biological insight: Partitions variance into components explained by (1) environment alone, (2) geography alone, (3) shared environment-geography, and (4) unexplained.

Model Selection & Significance Testing

  • ordistep(): Stepwise selection of environmental variables.

  • anova() with permutations: Non-parametric significance testing robust to non-normal distributions common in genetic data.

Variance Partitioning

  • Statistical concept: Quantifies proportion of genetic variation explained by different factor sets.

  • Biological interpretation: Answers whether isolation-by-distance (geography) or isolation-by-environment (ecological selection) dominates genetic differentiation.

{r}
#######################################################
###### Redundancy Analysis (RDA) ######################
###### For Partitioning Geographic vs. Climatic Effects
#######################################################

# LOAD LIBRARIES
library(vegan)    # For RDA and ordination methods
library(permute)  # For permutation tests
library(lattice)  # For advanced plotting

# SET WORKING DIRECTORY
setwd("D:\\Ali_Lenovo\\WorK_20150729\\Mp6X_Analysis\\Multiple Regression\\Italy\\RDA")

# READ ENVIRONMENTAL DATA
# Contains climate and soil variables for each population
ClimData <- read.table('Italy_ClimData.txt', header = TRUE)

# SUBSET ENVIRONMENTAL VARIABLES
# Statistical: Selects key variables based on biological relevance and preliminary analyses
# Biological: Focuses on temperature, precipitation, and soil factors likely to influence genetics
ClimData2 <- cbind(ClimData$Bio1, ClimData$Bio3, ClimData$Bio10, ClimData$Bio14,
                   ClimData$Bio15, ClimData$Bio16, ClimData$Bio17, ClimData$BS1,
                   ClimData$CLAY, ClimData$EXNA, ClimData$PHCA, ClimData$TN)

colnames(ClimData2) <- c("Bio1", "Bio3", "Bio10", "Bio14", "Bio15", "Bio16",
                         "Bio17", "BS1", "CLAY", "EXNA", "PHCA", "TN")

# CHECK CORRELATIONS AMONG PREDICTORS
# Statistical: Avoids multicollinearity in constrained ordination
# Biological: Ensures selected variables represent independent environmental axes
ClimData2.df <- as.data.frame(ClimData2)
cor(ClimData2.df)
pairs(ClimData2.df)

# READ GENETIC DATA
# Pairwise Fst matrix - genetic differentiation between populations
# Statistical: Response matrix for multivariate analysis
# Biological: Quantifies genetic divergence due to drift, selection, and migration
GenData <- as.matrix(read.table('GenData.txt', header = TRUE, sep = "\t", 
                                row.names = 1, as.is = TRUE))
GenData <- scale(GenData)  # Standardize genetic distances

# FULL RDA MODEL
# Statistical: Constrained ordination relating genetic distances to all environmental predictors
# Biological: Tests if environmental variables collectively explain genetic differentiation
RDA <- rda(GenData ~ Bio4 + Bio9 + Bio10 + Bio11 + Bio14 + Bio17 + landuse + BS1 +
             CACO3 + CLAY + EXCA + EXK + EXNA + GRAV + PHCA + PHH2O + SILT, ClimData)
summary(RDA)

# STEPWISE MODEL SELECTION FOR RDA
# Statistical: Identifies minimal adequate set of environmental predictors
# Biological: Finds which specific environmental factors best explain genetic patterns
mod0 <- rda(GenData ~ Bio1 + Bio2 + Bio3 + Bio4 + Bio5 + Bio6 + Bio7 + Bio8 + Bio9 + 
              Bio10 + Bio11 + Bio12 + Bio13 + Bio14 + Bio15 + Bio16 + Bio17 + Bio18 + 
              Bio19 + alt + landuse + BS1 + CACO3 + CEC + CLAY + ESP + EXCA + EXK +
              EXMG + EXNA + GRAV + PHCA + PHH2O + SAND + SILT + TC + TN, data = ClimData)

mod1 <- rda(GenData ~ 1, data = ClimData)  # Null model

# Automated stepwise selection with permutation tests
ordistep(mod0, scope = formula(mod1), direction = "both",
         Pin = 0.05, R2scope = TRUE, permutations = how(nperm = 499),
         trace = TRUE)

# RDA WITH SELECTED VARIABLES
# Statistical: Reduced model with significant predictors only
# Biological: Focuses on environmentally driven genetic differentiation
RDA4 <- rda(GenData ~ Bio1 + Bio3 + Bio8 + Bio14 + PHCA + EXNA + BS1, data = ClimData)
summary(RDA4)

# SIGNIFICANCE TESTING WITH PERMUTATIONS
# Statistical: Non-parametric testing robust to distributional assumptions
# Biological: Determines if environment-genetics relationships are statistically significant
anova(RDA4)  # Overall model significance
anova(RDA4, by = "term", permu = 200)  # Significance of individual terms
anova(RDA4, by = "axis", perm = 500)   # Significance of each RDA axis

# VISUALIZE RDA RESULTS
# Statistical: Biplot shows samples and environmental vectors in reduced space
# Biological: Reveals which environmental gradients align with genetic differentiation
plot(RDA4, type = "none", choices = c(1, 2), 
     main = "Genetic Distance vs Climate Variables")
text(RDA4, dis = "cn", cex = 1, col = "red")  # Environmental vectors
points(RDA4, pch = 21, col = "red", bg = "yellow", cex = 0.6)  # Population scores

# ADD GROUPING INFORMATION
# Statistical: Overlays categorical grouping on ordination
# Biological: Tests if genetically/metapopulation groups occupy distinct environmental space
metapop <- ClimData[, 3]  # Metapopulation categories
reg <- ClimData[, 4]      # Region categories

# Add ellipses showing group dispersion
pl <- ordiellipse(RDA4, reg, kind = "sd", label = TRUE, cex = 2)
pl <- ordiellipse(RDA4, metapop, kind = "sd", label = TRUE, cex = 1)

# PARTIAL RDA - CONTROL FOR GEOGRAPHY
# Statistical: Conditions on geographic variables to isolate pure environmental effects
# Biological: Tests environmental adaptation independent of isolation-by-distance
pRDA4 <- rda(GenData ~ Bio1 + Bio3 + Bio8 + Bio14 + BS1 + PHCA + EXNA + 
               Condition(lat + lon + alt), ClimData)

# Significance testing for partial RDA
anova(pRDA4, by = "term", permu = 200)
anova(pRDA4, by = "axis", perm = 500)

# Visualize partial RDA
plot(pRDA4, type = "none", choices = c(1, 2), 
     main = "Genetic Distance vs Climate (Controlling for Geography)")
text(pRDA4, dis = "cn", cex = 1, col = "red")
points(pRDA4, pch = 21, col = "red", bg = "yellow", cex = 1)

# PARTIAL RDA - CONTROL FOR CLIMATE
# Statistical: Conditions on environmental variables to isolate pure geographic effects
# Biological: Tests isolation-by-distance independent of environmental adaptation
pRDA2 <- rda(GenData ~ lat + lon + alt + Condition(Bio1 + Bio3 + Bio8 + Bio14 + 
                                                     BS1 + PHCA + EXNA), ClimData)

anova(pRDA2)
summary(pRDA2)

# FULL MODEL WITH BOTH GEOGRAPHY AND CLIMATE
# Statistical: Combined model with all predictors
# Biological: Tests total explained variance and interactions
RDAfull <- rda(GenData ~ lat + lon + alt + Bio1 + Bio3 + Bio8 + Bio14 + 
                 BS1 + PHCA + EXNA, ClimData)

anova(RDAfull)
summary(RDAfull)

# VARIANCE PARTITIONING ANALYSIS
# Statistical: Quantifies proportions of genetic variance explained by different factors
# Biological: Answers key landscape genetics questions:
# 1. How much variance is explained by environment alone?
# 2. How much by geography alone?
# 3. How much by their overlap?
# 4. How much remains unexplained?

cat("=== VARIANCE PARTITIONING SUMMARY ===\n")
cat("1. Climate not controlled for geography:\n")
summary(RDA4)$constr.chi / summary(RDA4)$tot.chi * 100

cat("\n2. Climate controlled for geography (pure environmental effects):\n")
summary(pRDA4)$constr.chi / summary(pRDA4)$tot.chi * 100

cat("\n3. Geography controlled for climate (pure geographic effects):\n")
summary(pRDA2)$constr.chi / summary(pRDA2)$tot.chi * 100

cat("\n4. Full model (climate + geography):\n")
summary(RDAfull)$constr.chi / summary(RDAfull)$tot.chi * 100

# ENHANCED VISUALIZATION FOR PUBLICATION
par(pty = "s")  # Square plot
plot(RDAfull, scaling = 2, display = c("sp", "lc", "cn"),
     xlim = c(-2, 2), ylim = c(-2, 2),
     main = "RDA Triplot: Genetic Distance vs Environment + Geography")
pl <- ordiellipse(RDAfull, reg, kind = "sd", label = TRUE, cex = 2)
box()

# R² ADJUSTED FOR MODEL COMPLEXITY
# Statistical: Adjusts explained variance for number of predictors
# Biological: Provides more realistic estimate of environmental influence
RsquareAdj(RDA4)
RsquareAdj(pRDA4)
RsquareAdj(pRDA2)
RsquareAdj(RDAfull)

# FINAL INTERPRETATION
cat("\n=== KEY BIOLOGICAL INTERPRETATIONS ===\n")
cat("1. Significant RDA axes indicate environmental gradients driving genetic divergence.\n")
cat("2. Partial RDAs separate isolation-by-distance from isolation-by-environment.\n")
cat("3. Variance partitioning quantifies relative importance of geography vs environment.\n")
cat("4. Environmental vectors show which factors align with genetic differentiation.\n")

Script 3: Generalized Additive Models (GAMs)

For Non-linear Genetic-Environment Relationships

Analyzing Pairwise Genetic Distances

  • Statistical innovation: Uses pairwise FST matrices as response variables, with corresponding pairwise environmental differences as predictors.

  • Key challenge: Account for non-independence of pairwise comparisons.

Flexible Modeling of Relationships

  • gam() with s() terms: Allows non-linear relationships between predictors and genetic differentiation.

  • Biological relevance: Genetic differentiation may respond non-linearly to environmental gradients (threshold effects, optimal ranges).

Model Comparison Framework

  • Linear vs. quadratic vs. smooth terms: Tests whether simple linear, curved, or complex non-linear relationships best explain patterns.

  • Biological insight: Shape of distance-FST relationship informs dispersal models (steep vs. gradual isolation-by-distance).

Scale Consideration

  • Analysis with both scaled and unscaled geographic distance: Distinguishes statistical significance from biological magnitude of effects.
{r}
#######################################################
###### Generalized Additive Models (GAMs) #############
###### For Non-linear Genetic-Environment Relationships
#######################################################

# LOAD LIBRARIES
library(gdata)    # For reading Excel files
library(mgcv)     # For Generalized Additive Models
library(permute)  # For permutation tests

# SET WORKING DIRECTORY
setwd("D:\\R_Working_files\\RawGeno\\Project_RawGeno\\Mp_All\\RawGeno_1\\RawGeno_1_final-data\\Mp_Xn_4n\\STR-MpXn_Regional2local_Scale\\Multiple Regression\\Italy\\GLM_GAM")

# PERL PATH FOR READING EXCEL FILES
perl <- "C:\\Users\\atahir\\Documents\\R\\win-library\\3.1\\perl\\bin\\perl.exe"

##################
#### DATA PREPARATION
##################

# READ GEOGRAPHIC DISTANCE MATRIX
# Statistical: Predictor representing isolation-by-distance
# Biological: Geographic distance expected to increase genetic differentiation via limited gene flow
Geog.tmp <- read.xls("GGD_IT.xlsx", perl = perl, as.is = TRUE)

# CREATE COMPREHENSIVE DATAFRAME
# Statistical: Combines pairwise Fst with pairwise environmental differences
# Biological: Tests how environmental dissimilarity affects genetic divergence
DiploidData <- data.frame(
  # Response: Pairwise Fst (genetic differentiation)
  Fst = c(as.matrix(read.xls("Fst_IT_Xn.xlsx", perl = perl)[, -1], sheet = 1)),
  
  # Environmental difference predictors:
  BIO3 = c(apply(read.xls("BIO3-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),  # Isothermality
  BIO8 = c(apply(read.xls("BIO8-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),  # Temp wettest quarter
  BIO14 = c(apply(read.xls("BIO14-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Prec driest month
  PHCA = c(apply(read.xls("PHCA-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),   # Soil pH
  EXNA = c(apply(read.xls("EXNA-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),   # Exchangeable sodium
  CLAY = c(apply(read.xls("CLAY-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),   # Clay content
  ESP = c(apply(read.xls("ESP-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),     # Exchangeable sodium %
  CACO3 = c(apply(read.xls("CACO3-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Calcium carbonate
  TN = c(apply(read.xls("TN-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]),       # Total nitrogen
  
  # Geographic distance (unscaled, will scale separately)
  Geog = c(apply(Geog.tmp[, colnames(Geog.tmp) != "X"], 2, as.numeric))
)

rm(Geog.tmp)  # Clean up temporary object

# DATA SCALING AND CLEANING
# Statistical: Standardizes variables, removes incomplete cases
# Biological: Ensures comparability, focuses on complete population pairs
DiploidData.scale <- as.data.frame(
  apply(DiploidData[apply(DiploidData, 1, function(X) sum(is.na(X))) == 0, ], 2, scale)
)

# Add unscaled geographic distance back for interpretation
DiploidData.scale$Geog.Unsc <- DiploidData$Geog[apply(DiploidData, 1, function(X) sum(is.na(X))) == 0]

# EXPLORATORY DATA ANALYSIS
# Statistical: Checks correlations and distributions
# Biological: Identifies potential collinearity issues and data patterns
pairs(DiploidData.scale[, 1:10])  # Quick visualization
cor_matrix <- cor(DiploidData.scale[, -ncol(DiploidData.scale)])
print("Correlation matrix (first 5x5):")
print(cor_matrix[1:5, 1:5])

##################
#### MODEL BUILDING
##################

# LINEAR MODEL 1: ALL VARIABLES
# Statistical: Baseline linear model
# Biological: Tests linear relationships between environmental differences and Fst
Diploid.reg <- lm(Fst ~ Geog + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN, 
                  data = DiploidData.scale)
summary(Diploid.reg)

# MODEL DIAGNOSTICS
# Statistical: Checks linear model assumptions
# Biological: Identifies patterns suggesting non-linear relationships or outliers
par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
plot(fitted(Diploid.reg), resid(Diploid.reg))
abline(h = 0, col = 2)
title("Residuals vs Fitted")

# Check residuals against each predictor
predictors <- c("Geog", "BIO3", "BIO8", "BIO14", "PHCA", "EXNA", "CLAY", "ESP", "CACO3", "TN")
for (pred in predictors[1:6]) {
  plot(DiploidData.scale[[pred]], resid(Diploid.reg))
  abline(h = 0, col = 2)
  title(paste("Residuals vs", pred))
}

# LINEAR MODEL 2: WITH QUADRATIC GEOGRAPHIC TERM
# Statistical: Tests for non-linear isolation-by-distance
# Biological: Gene flow may decrease non-linearly with distance (e.g., stepping-stone model)
Diploid.reg2 <- lm(Fst ~ Geog + I(Geog^2) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN, 
                   data = DiploidData.scale)
summary(Diploid.reg2)

# VISUALIZE QUADRATIC EFFECT
# Statistical: Shows shape of distance-decay relationship
# Biological: Informs about dispersal kernel shape and connectivity
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
GeogEff <- DiploidData.scale$Geog * coef(Diploid.reg2)["Geog"] + 
  (DiploidData.scale$Geog^2) * coef(Diploid.reg2)["I(Geog^2)"]
plot(DiploidData.scale$Geog, GeogEff, xlab = "Scaled Geographic Distance", 
     ylab = "Effect on Fst", main = "Quadratic Distance Effect")
plot(DiploidData.scale$Geog, fitted(Diploid.reg2), xlab = "Scaled Geographic Distance",
     ylab = "Predicted Fst", main = "Model Predictions")

# GENERALIZED ADDITIVE MODEL (GAM)
# Statistical: Flexible non-parametric smoothing of relationships
# Biological: Captures complex, non-linear responses to environmental gradients
Diploid.gam <- gam(Fst ~ s(Geog) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN, 
                   data = DiploidData.scale)

# VISUALIZE GAM SMOOTHS
# Statistical: Shows fitted smooth functions for each non-linear term
# Biological: Reveals shape of environmental response curves (optimal, threshold, etc.)
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
plot(Diploid.gam, shade = TRUE, shade.col = "lightblue", seWithMean = TRUE)
title("GAM Smooth Terms", outer = TRUE, line = -1)

# GAM DIAGNOSTICS
# Statistical: Checks GAM assumptions and fit
# Biological: Ensures model adequately captures biological patterns
par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
plot(fitted(Diploid.gam), resid(Diploid.gam))
abline(h = 0, col = 2)
title("GAM: Residuals vs Fitted")

for (pred in predictors[1:6]) {
  plot(DiploidData.scale[[pred]], resid(Diploid.gam))
  abline(h = 0, col = 2)
  title(paste("GAM Residuals vs", pred))
}

# GAM SUMMARY
# Statistical: Provides significance tests and model metrics
# Biological: Identifies which environmental factors significantly affect genetic differentiation
gam_summary <- summary(Diploid.gam)
print(gam_summary)

##################
#### MODEL COMPARISON
##################

# STATISTICAL COMPARISON OF MODELS
# Statistical: Uses ANOVA and AIC to compare model fit
# Biological: Determines which functional form best represents biological processes
model_comparison <- anova(Diploid.reg, Diploid.reg2, Diploid.gam, test = 'Chisq')
print("Model Comparison (ANOVA):")
print(model_comparison)

# AIC COMPARISON
# Statistical: AIC balances model fit and complexity
# Biological: Selects most parsimonious model explaining genetic patterns
aic_values <- AIC(Diploid.reg, Diploid.reg2, Diploid.gam)
print("AIC Values:")
print(aic_values)

# RESIDUAL COMPARISON
# Statistical: Visual comparison of model fit quality
# Biological: Shows which model best captures unexplained variation
par(mfrow = c(1, 1), mar = c(5, 4, 2, 1))
boxplot(Diploid.reg$residuals, Diploid.reg2$residuals, Diploid.gam$residuals,
        names = c("Linear", "Quadratic", "GAM"),
        ylab = "Residuals", main = "Model Residual Comparison")
abline(h = 0, col = "red", lty = 2)

##################
#### FINAL MODEL WITH UNSCALED DISTANCE
##################

# GAM WITH UNSCALED GEOGRAPHIC DISTANCE
# Statistical: Final model for biological interpretation
# Biological: Allows direct interpretation of distance effects in original units (km)
Diploid.gam.unsc <- gam(Fst ~ s(Geog.Unsc) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + 
                         CACO3 + TN, data = DiploidData.scale)

# FINAL VISUALIZATION
# Statistical: Presents results in biologically interpretable units
# Biological: Shows actual distance-decay relationship for gene flow inference
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))

# 1. Distance effect in km
plot(Diploid.gam.unsc, select = 1, shade = TRUE, shade.col = "lightblue",
     xlab = "Geographic Distance (km)", ylab = "Effect on Fst",
     main = "Distance-Decay Relationship")

# 2. Partial effects of key environmental variables
important_vars <- c("BIO14", "PHCA", "CLAY")  # Based on model significance
for (i in 1:min(3, length(important_vars))) {
  if (important_vars[i] %in% names(DiploidData.scale)) {
    plot(DiploidData.scale[[important_vars[i]]], DiploidData.scale$Fst,
         xlab = important_vars[i], ylab = "Fst",
         main = paste("Fst vs", important_vars[i]))
    # Add smoothed trend line
    lines(lowess(DiploidData.scale[[important_vars[i]]], DiploidData.scale$Fst), 
          col = "red", lwd = 2)
  }
}

title("Final GAM Results: Italy Mp6x", outer = TRUE)

##################
#### BIOLOGICAL INTERPRETATION
##################

cat("\n=== BIOLOGICAL INTERPRETATION OF RESULTS ===\n\n")

# Extract key results
distance_edf <- gam_summary$edf[1]  # Effective degrees of freedom for distance smooth
distance_p <- gam_summary$s.pv[1]   # P-value for distance effect

cat("1. GEOGRAPHIC ISOLATION (ISOLATION-BY-DISTANCE):\n")
if (distance_p < 0.05) {
  cat(sprintf("   • Significant distance effect (p = %.4f)\n", distance_p))
  if (distance_edf > 1.5) {
    cat("   • Non-linear relationship suggests complex dispersal patterns\n")
    cat("   • Possible explanations: Stepping-stone dispersal, barriers at specific distances\n")
  } else {
    cat("   • Approximately linear relationship\n")
    cat("   • Suggests classic isolation-by-distance with relatively uniform dispersal\n")
  }
} else {
  cat("   • No significant isolation-by-distance detected\n")
  cat("   • Gene flow may be extensive or mediated by other factors\n")
}

# Environmental effects
sig_vars <- names(which(gam_summary$p.pv[-1] < 0.05))  # Significant predictors (excluding intercept)
cat("\n2. ENVIRONMENTAL SELECTION (ISOLATION-BY-ENVIRONMENT):\n")
if (length(sig_vars) > 0) {
  cat("   • Significant environmental predictors:\n")
  for (var in sig_vars) {
    effect_dir <- ifelse(coef(Diploid.gam)[var] > 0, "increases", "decreases")
    cat(sprintf("     - %s: %s genetic differentiation (p = %.4f)\n", 
                var, effect_dir, gam_summary$p.pv[var]))
  }
  cat("   • Suggests environmental adaptation driving genetic divergence\n")
} else {
  cat("   • No significant environmental effects detected\n")
  cat("   • Genetic differentiation primarily driven by neutral processes\n")
}

# Model comparison interpretation
cat("\n3. MODEL SELECTION INSIGHTS:\n")
best_model <- rownames(aic_values)[which.min(aic_values$AIC)]
cat(sprintf("   • Best-fitting model: %s\n", best_model))
if (best_model == "Diploid.gam") {
  cat("   • Non-linear models best explain genetic patterns\n")
  cat("   • Suggests complex responses to environmental gradients\n")
} else if (best_model == "Diploid.reg2") {
  cat("   • Quadratic distance model provides best fit\n")
  cat("   • Gene flow decreases non-linearly with distance\n")
} else {
  cat("   • Simple linear model provides best fit\n")
  cat("   • Relationships approximately linear\n")
}

# Explained variance
r2_adj <- summary(Diploid.gam)$r.sq
cat(sprintf("\n4. OVERALL MODEL PERFORMANCE:\n"))
cat(sprintf("   • Adjusted R² = %.3f (%.1f%% of variance explained)\n", r2_adj, r2_adj * 100))
if (r2_adj > 0.3) {
  cat("   • Good explanatory power for landscape genetic patterns\n")
} else if (r2_adj > 0.1) {
  cat("   • Moderate explanatory power\n")
  cat("   • Additional factors (e.g., historical, demographic) may be important\n")
} else {
  cat("   • Limited explanatory power\n")
  cat("   • Genetic patterns may be strongly influenced by stochastic processes\n")
}

cat("\n5. MANAGEMENT IMPLICATIONS:\n")
if (distance_p < 0.05 && length(sig_vars) > 0) {
  cat("   • Both geography and environment important for conservation\n")
  cat("   • Consider both connectivity corridors and environmental gradients\n")
} else if (distance_p < 0.05) {
  cat("   • Focus on maintaining landscape connectivity\n")
  cat("   • Geographic barriers likely limit gene flow\n")
} else if (length(sig_vars) > 0) {
  cat("   • Environmental adaptation key for population persistence\n")
  cat("   • Protect environmental heterogeneity and climate refugia\n")
} else {
  cat("   • Genetic patterns not strongly structured by measured factors\n")
  cat("   • Consider historical factors or finer-scale environmental variables\n")
}

# SAVE FINAL RESULTS
save(Diploid.gam.unsc, DiploidData.scale, gam_summary, aic_values,
     file = "GAM_Analysis_Results.RData")

cat("\n=== ANALYSIS COMPLETE ===\n")
cat("Results saved to GAM_Analysis_Results.RData\n")

Summary

This document presents a comprehensive analytical pipeline for landscape genetics, implementing three complementary statistical approaches to disentangle geographic and environmental drivers of genetic patterns. We demonstrate: (1) Principal Component Analysis (PCA) and regression modeling for identifying key environmental gradients affecting within-population genetic diversity, (2) Redundancy Analysis (RDA) for multivariate assessment of between-population genetic differentiation and variance partitioning, and (3) Generalized Additive Models (GAMs) for flexible modeling of non-linear relationships between environmental factors and genetic distances. Each method addresses distinct questions in landscape genetics while together providing a robust framework for understanding climate-genetic correlations.