Integrative Landscape Genomics: Partitioning Geographic and Environmental Effects on Genetic Diversity and Differentiation
A Multivariate Statistical Framework Using PCA, RDA, and GAM Analyses
OVERARCHING THEMES & INTEGRATION
1. Partitioning Geographic vs. Environmental Effects
First script: Uses GLMs with both geographic (lat, lon, alt) and environmental predictors.
Second script: Explicitly partitions variance using partial RDAs.
Third script: Includes both geographic distance and environmental differences in models.
2. Addressing Different Biological Questions
Genetic diversity drivers (First script): What factors influence within-population genetic variation?
Genetic differentiation patterns (Second & Third scripts): What factors explain between-population genetic differences?
3. Hierarchical Model Building
- Exploratory (PCA/correlation) → Hypothesis testing (GLMs) → Complex relationships (GAMs) → Multivariate synthesis (RDA)
4. Biological Interpretation Framework
Neutral processes: Geographic distance → isolation-by-distance
Selective processes: Environmental differences → isolation-by-environment/ecology
Interactive effects: Geographic isolation modifying environmental selection
5. Statistical Robustness Features
Permutation testing for RDA significance
Model selection criteria (AIC, ANOVA comparisons)
Residual diagnostics for model assumptions
Multicollinearity assessment via correlation matrices
The scripts collectively provide a comprehensive toolkit for landscape genetics, allowing researchers to disentangle the complex interplay between geographic isolation, environmental gradients, and genetic patterns at both within-population (diversity) and between-population (differentiation) levels.
Key Integration Across Scripts
These three scripts provide a comprehensive analytical pipeline for landscape genetics:
Script 1 (PCA & Regression): Identifies key environmental gradients and their linear relationships with within-population genetic diversity.
Script 2 (RDA): Multivariate analysis partitioning geographic vs. environmental effects on between-population genetic differentiation.
Script 3 (GAMs): Flexible modeling of non-linear relationships and distance-decay functions for pairwise genetic distances.
Biological Workflow:
Start with PCA to understand environmental dimensionality
Use RDA to test environmental vs. geographic effects on genetic structure
Apply GAMs to model complex non-linear relationships
Integrate results to understand both neutral (geographic) and selective (environmental) processes
Script 1: Principle Component Analysis & Regression Modelling
Data Preparation & Variable Selection
Statistical reasoning: Creating subsets of variables (
mydata2,mydata3, etc.) reduces dimensionality and focuses analysis on biologically/climatically relevant predictors. This helps avoid overfitting and multicollinearity issues.Biological reasoning: Variables like
He(expected heterozygosity),Hj(observed heterozygosity), and climate variables (Bio1-Bio19 from WorldClim) represent genetic diversity measures and environmental gradients that potentially influence genetic structure.
Data Scaling
scale()function: Standardizes variables to mean=0, SD=1. This ensures variables measured in different units (temperature, precipitation, soil properties) are comparable and prevents variables with larger ranges from dominating PCA/regression.
Correlation Analysis
cor()andpairs(): Identifies multicollinearity among predictors. High correlations suggest redundant information, guiding variable selection for models.
Principal Component Analysis (PCA)
princomp(): Reduces dimensionality by transforming correlated variables into orthogonal principal components.Biological application: Identifies major environmental gradients that may drive genetic differentiation. PCA axes represent combinations of environmental variables that explain most variation in the dataset.
Generalized Linear Models (GLMs)
Sequential model building (Model1-Model10): Stepwise refinement to identify most significant predictors of genetic diversity (
Hj).Statistical reasoning: GLMs assess linear relationships while allowing for different error distributions. Model comparison via ANOVA and AIC helps select best-fitting, most parsimonious model.
Model Diagnostics & Selection
stepAIC(): Automated variable selection balancing model fit and complexity.calc.relimp(): Quantifies relative importance of each predictor.anova()comparisons: Tests if more complex models significantly improve fit.
Visualization
ggbiplot(): Visualizes PCA results with grouping by categorical variables (metapopulations/regions).Biological interpretation: Clusters in PCA space may indicate genetically similar populations experiencing similar environmental conditions.
{r}
#######################################################
###### Principle Component Analysis & Regression ######
#######################################################
# SET WORKING DIRECTORY
# Sets the location where data files are stored
setwd("D:\\R_Working_files\\RawGeno\\Project_RawGeno\\Mp_All\\RawGeno_1\\RawGeno_1_final-data\\Mp_2n-only\\Mp2n_Regional_dataset\\Rangewide\\He_Model")
# LOAD LIBRARIES
library(ggplot2) # For advanced plotting
library(ggbiplot) # For PCA biplots with ggplot
library(MASS) # For stepAIC function
library(relaimpo) # For relative importance analysis
# READ DATA
# Reads tab-delimited file with genetic and environmental data
mydata <- read.table("Mp2n_ClimSoilData_F-He.txt", header = TRUE, sep = "\t")
# INITIAL DATA SUBSETTING
# Statistical: Selects key variables to reduce dimensionality
# Biological: Focuses on lat/lon (geography), He/Hj (genetic diversity),
# F (inbreeding), alt (elevation), and key bioclimatic/soil variables
mydata2 <- cbind(mydata$lat, mydata$lon, mydata$He, mydata$Hj, mydata$FAFLP, mydata$alt,
mydata$Bio1, mydata$Bio3, mydata$Bio8, mydata$Bio9, mydata$Bio15,
mydata$PHCA, mydata$EXCA, mydata$EXMG, mydata$SILT, mydata$SAND,
mydata$ECE, mydata$ESP, mydata$landuse, mydata$BS1, mydata$Ai)
# RENAME COLUMNS FOR CLARITY
colnames(mydata2) <- c("lat", "lon", "He", "Hj", "F", "Alt", "bio1", "bio3", "bio8",
"bio9", "bio15", "PHCA", "EXCA", "EXMG", "SILT", "SAND",
"ECE", "ESP", "landuse", "BS1", "Ai")
# CREATE ADDITIONAL SUBSET FOR ANALYSIS
# Statistical: Further refinement based on preliminary analyses
# Biological: Focus on heterozygosity (Hj) and key environmental predictors
mydat3 <- cbind(mydata$Hj, mydata$Bio3, mydata$Bio5, mydata$Bio8, mydata$Bio12,
mydata$Bio13, mydata$Bio14, mydata$Bio16, mydata$Bio17, mydata$Bio19,
mydata$CLAY, mydata$DRAINAGE, mydata$EXK, mydata$EXMG, mydata$EXNA,
mydata$SAND, mydata$SILT, mydata$Ai, mydata$AvPrec1_4, mydata$AvPrec9_12)
colnames(mydat3) <- c("Hj", "bio3", "bio5", "bio8", "bio12", "bio13", "bio14",
"bio16", "bio17", "bio19", "CLAY", "DRAINAGE", "EXK", "EXMG",
"EXNA", "SAND", "SILT", "Ai", "Prec1_4", "Prec9_12")
# CREATE AVERAGE CLIMATE VARIABLES
# Statistical: Reduces monthly data to seasonal aggregates to avoid collinearity
# Biological: Seasonal averages may better capture biologically relevant conditions
mydat <- mydata[, 5:114]
mydat$AvTmin9_12 <- (mydat$Tmin9 + mydat$Tmin10 + mydata$Tmin11 + mydata$Tmin12) / 4
mydat$AvTmax4_7 <- (mydat$Tmax4 + mydat$Tmax5 + mydat$Tmax6 + mydat$Tmax7) / 4
mydat$AvPrec1_4 <- (mydat$Prec1 + mydat$Prec2 + mydat$Prec3 + mydat$Prec4) / 4
mydat$AvPrec9_12 <- (mydat$Prec9 + mydat$Prec10 + mydat$Prec11 + mydat$Prec12) / 4
# DATA SCALING
# Statistical: Standardizes all variables to mean=0, SD=1
# Biological: Makes variables measured in different units (temp, precipitation, soil) comparable
# Note: Removes rows with NAs to ensure complete cases
DiploidData.scale2 <- as.data.frame(apply(mydat3[apply(mydat3, 1, function(X) sum(is.na(X))) == 0, ], 2, scale))
DiploidData.scale <- as.data.frame(apply(mydata2[apply(mydata2, 1, function(X) sum(is.na(X))) == 0, ], 2, scale))
# CHECK CORRELATIONS
# Statistical: Identifies multicollinearity among predictors
# Biological: Highly correlated variables may represent redundant environmental gradients
mydata3.df <- as.data.frame(mydat3)
corr <- cor(mydata3.df)
write.table(corr, file = "correlations_20var.csv", sep = ",")
# SIMPLE LINEAR REGRESSION
# Statistical: Tests basic linear relationships
# Biological: Examines if specific environmental factors directly affect genetic diversity
fit <- lm(Hj ~ Prec1_4, data = DiploidData.scale2)
summary(fit)
# POLYNOMIAL REGRESSION
# Statistical: Tests for non-linear (quadratic) relationships
# Biological: Genetic diversity may have optimal ranges along environmental gradients
fit <- lm(Hj ~ Tmean8 + I(Tmean8^2), data = DiploidData.scale2)
summary(fit)
# MULTIPLE REGRESSION
# Statistical: Tests combined effects of multiple predictors
# Biological: Genetic diversity likely influenced by multiple interacting factors
Diploid.reg1 <- lm(Hj ~ CLAY + SAND + Prec9_12, data = DiploidData.scale2)
summary(Diploid.reg1)
# PRINCIPAL COMPONENT ANALYSIS
# Statistical: Reduces dimensionality, identifies major gradients of variation
# Biological: Identifies composite environmental axes that may drive genetic patterns
mydata5 <- cbind(mydata$longitude, mydata$Hj, mydata$elevation..m., mydata$Bio1,
mydata$Bio2, mydata$Bio4, mydata$Bio8, mydata$Bio12, mydata$Bio19,
mydata$landuse.class..GlobeCover., mydata$AWC, mydata$CACO3,
mydata$ECE, mydata$ESP, mydata$GRAV, mydata$SAND, mydata$SILT,
mydata$TN)
colnames(mydata5) <- c("lon", "Hj", "Alt", "bio1", "bio2", "bio4", "bio8", "bio12",
"bio19", "landuse", "AWC", "CACO3", "ECE", "ESP", "GRAV",
"SAND", "SILT", "TN")
Mp.pca <- princomp(mydata5, scores = TRUE, cor = TRUE)
summary(Mp.pca)
# VISUALIZE PCA RESULTS
# Statistical: Scree plot shows variance explained by each component
# Biological: Helps decide how many components represent biologically meaningful gradients
plot(Mp.pca, type = "l")
# PCA BIPLOT
# Statistical: Shows variables and samples in reduced space
# Biological: Reveals which environmental variables co-vary and how populations cluster
biplot(Mp.pca)
# GROUPED PCA VISUALIZATION
# Statistical: Colors points by categorical grouping variables
# Biological: Tests if genetically defined groups occupy distinct environmental space
cat.var2 <- mydata[, 2] # Metapopulations
cat.var3 <- mydata[, 3] # Regions
g <- ggbiplot(Mp.pca, obs.scale = 0.5, var.scale = 0.5,
groups = cat.var2, ellipse = TRUE, circle = TRUE)
g <- g + scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')
print(g)
# STEPWISE MODEL SELECTION
# Statistical: Automated variable selection balancing fit and complexity
# Biological: Identifies minimal set of environmental factors explaining genetic diversity
fit <- lm(Hj ~ bio3 + bio5 + bio8 + CLAY + SILT + DRAINAGE + SAND + SILT + Prec9_12,
data = DiploidData.scale2)
step <- stepAIC(fit, direction = "both")
step$anova
# FINAL MODEL BUILDING AND COMPARISON
# Statistical: Sequential model refinement based on significance and parsimony
# Biological: Builds understanding of which factors most strongly influence genetic diversity
Model1 <- glm(Hj ~ lat + lon + F + Alt + bio1 + bio2 + bio3 + bio4 + bio5 +
bio8 + bio10 + bio11 + bio13 + bio19 + BS1 + CLAY +
ECE + ESP + EXNA + GRAV + PHH2O + SAND + SILT + TN,
data = as.data.frame(mydata2))
# Simplified through sequential testing
Model8 <- glm(Hj ~ lon + bio2 + bio3 + bio5 + bio11 + ECE + PHCA + SAND + SILT,
data = as.data.frame(mydata2))
# MODEL COMPARISON
# Statistical: ANOVA tests if more complex models significantly improve fit
# Biological: Determines whether additional variables provide meaningful biological insight
anova(Model8, Model9, Model10, test = 'F')
# RELATIVE IMPORTANCE ANALYSIS
# Statistical: Quantifies contribution of each predictor to explained variance
# Biological: Ranks environmental factors by their influence on genetic diversity
calc.relimp(Model8, type = c("lmg"), rela = TRUE)
boot <- boot.relimp(Model8, b = 1000, type = c("lmg"), rank = TRUE, diff = TRUE, rela = TRUE)
plot(booteval.relimp(boot, sort = TRUE))
# RESIDUAL ANALYSIS
# Statistical: Checks model assumptions and fit quality
# Biological: Identifies populations where model predictions are poor
boxplot(Model8$residuals, Model9$residuals, Model10$residuals,
names = c("GLM 8", "GLM 9", "GLM 10"))
# FINAL PCA ON SELECTED VARIABLES
# Statistical: PCA on variables retained in final model
# Biological: Visualizes environmental space defined by most important factors
mydataML <- cbind(mydata$latitude, mydata$Hj, mydata$elevation..m.,
mydata$Bio2, mydata$Bio4, mydata$Bio5, mydata$Bio8, mydata$Bio9,
mydata$Bio10, mydata$Bio11, mydata$Bio14, mydata$Bio16,
mydata$Bio17, mydata$Bio19, mydata$landuse.class..GlobeCover.,
mydata$BS1, mydata$CACO3, mydata$CLAY, mydata$GRAV, mydata$PHCA,
mydata$SAND)
colnames(mydataML) <- c("lat", "Hj", "Alt", "bio2", "bio4", "bio5", "bio8", "bio9",
"bio10", "bio11", "bio14", "bio16", "bio17", "bio19",
"landuse", "BS1", "CACO3", "CLAY", "GRAV", "PHCA", "SAND")
Mp.pcaML <- princomp(mydataML, scores = TRUE, cor = TRUE, scale = TRUE)
summary(Mp.pcaML)
# CUSTOMIZED PCA PLOT
# Statistical: Enhanced visualization for publication
# Biological: Clear presentation of population clustering in environmental space
g <- ggbiplot(Mp.pcaML, obs.scale = 2, var.scale = 2,
groups = cat.var3, ellipse = TRUE, circle = FALSE,
var.axes = TRUE, varname.size = 4) +
geom_point(aes(colour = cat.var3), size = 4, shape = 16) +
scale_color_discrete(name = '') +
theme(legend.direction = 'vertical', legend.position = 'right',
legend.key.size = unit(5, "mm"), legend.text = element_text(size = 15)) +
ggtitle("Principal Component Analysis") +
theme(plot.title = element_text(size = 20, vjust = 1),
axis.title.x = element_text(size = 15, color = "black", vjust = -0.35),
axis.title.y = element_text(size = 15, color = "black", vjust = 0.35))
print(g)Script 2: Redundancy Analysis (RDA) For Genetic Environment Correlation
Multivariate Approach to Genetic-Environmental Correlation
Statistical reasoning: RDA is a constrained ordination method that directly relates genetic distance matrices to environmental predictors.
Biological application: Tests whether environmental variables explain genetic differentiation patterns beyond geographic distance.
Partial RDA
rda(GenData ~ Climate + Condition(Geography)): Controls for geographic effects to isolate pure environmental influence.rda(GenData ~ Geography + Condition(Climate)): Controls for environmental effects to isolate pure geographic influence.Key biological insight: Partitions variance into components explained by (1) environment alone, (2) geography alone, (3) shared environment-geography, and (4) unexplained.
Model Selection & Significance Testing
ordistep(): Stepwise selection of environmental variables.anova()with permutations: Non-parametric significance testing robust to non-normal distributions common in genetic data.
Variance Partitioning
Statistical concept: Quantifies proportion of genetic variation explained by different factor sets.
Biological interpretation: Answers whether isolation-by-distance (geography) or isolation-by-environment (ecological selection) dominates genetic differentiation.
{r}
#######################################################
###### Redundancy Analysis (RDA) ######################
###### For Partitioning Geographic vs. Climatic Effects
#######################################################
# LOAD LIBRARIES
library(vegan) # For RDA and ordination methods
library(permute) # For permutation tests
library(lattice) # For advanced plotting
# SET WORKING DIRECTORY
setwd("D:\\Ali_Lenovo\\WorK_20150729\\Mp6X_Analysis\\Multiple Regression\\Italy\\RDA")
# READ ENVIRONMENTAL DATA
# Contains climate and soil variables for each population
ClimData <- read.table('Italy_ClimData.txt', header = TRUE)
# SUBSET ENVIRONMENTAL VARIABLES
# Statistical: Selects key variables based on biological relevance and preliminary analyses
# Biological: Focuses on temperature, precipitation, and soil factors likely to influence genetics
ClimData2 <- cbind(ClimData$Bio1, ClimData$Bio3, ClimData$Bio10, ClimData$Bio14,
ClimData$Bio15, ClimData$Bio16, ClimData$Bio17, ClimData$BS1,
ClimData$CLAY, ClimData$EXNA, ClimData$PHCA, ClimData$TN)
colnames(ClimData2) <- c("Bio1", "Bio3", "Bio10", "Bio14", "Bio15", "Bio16",
"Bio17", "BS1", "CLAY", "EXNA", "PHCA", "TN")
# CHECK CORRELATIONS AMONG PREDICTORS
# Statistical: Avoids multicollinearity in constrained ordination
# Biological: Ensures selected variables represent independent environmental axes
ClimData2.df <- as.data.frame(ClimData2)
cor(ClimData2.df)
pairs(ClimData2.df)
# READ GENETIC DATA
# Pairwise Fst matrix - genetic differentiation between populations
# Statistical: Response matrix for multivariate analysis
# Biological: Quantifies genetic divergence due to drift, selection, and migration
GenData <- as.matrix(read.table('GenData.txt', header = TRUE, sep = "\t",
row.names = 1, as.is = TRUE))
GenData <- scale(GenData) # Standardize genetic distances
# FULL RDA MODEL
# Statistical: Constrained ordination relating genetic distances to all environmental predictors
# Biological: Tests if environmental variables collectively explain genetic differentiation
RDA <- rda(GenData ~ Bio4 + Bio9 + Bio10 + Bio11 + Bio14 + Bio17 + landuse + BS1 +
CACO3 + CLAY + EXCA + EXK + EXNA + GRAV + PHCA + PHH2O + SILT, ClimData)
summary(RDA)
# STEPWISE MODEL SELECTION FOR RDA
# Statistical: Identifies minimal adequate set of environmental predictors
# Biological: Finds which specific environmental factors best explain genetic patterns
mod0 <- rda(GenData ~ Bio1 + Bio2 + Bio3 + Bio4 + Bio5 + Bio6 + Bio7 + Bio8 + Bio9 +
Bio10 + Bio11 + Bio12 + Bio13 + Bio14 + Bio15 + Bio16 + Bio17 + Bio18 +
Bio19 + alt + landuse + BS1 + CACO3 + CEC + CLAY + ESP + EXCA + EXK +
EXMG + EXNA + GRAV + PHCA + PHH2O + SAND + SILT + TC + TN, data = ClimData)
mod1 <- rda(GenData ~ 1, data = ClimData) # Null model
# Automated stepwise selection with permutation tests
ordistep(mod0, scope = formula(mod1), direction = "both",
Pin = 0.05, R2scope = TRUE, permutations = how(nperm = 499),
trace = TRUE)
# RDA WITH SELECTED VARIABLES
# Statistical: Reduced model with significant predictors only
# Biological: Focuses on environmentally driven genetic differentiation
RDA4 <- rda(GenData ~ Bio1 + Bio3 + Bio8 + Bio14 + PHCA + EXNA + BS1, data = ClimData)
summary(RDA4)
# SIGNIFICANCE TESTING WITH PERMUTATIONS
# Statistical: Non-parametric testing robust to distributional assumptions
# Biological: Determines if environment-genetics relationships are statistically significant
anova(RDA4) # Overall model significance
anova(RDA4, by = "term", permu = 200) # Significance of individual terms
anova(RDA4, by = "axis", perm = 500) # Significance of each RDA axis
# VISUALIZE RDA RESULTS
# Statistical: Biplot shows samples and environmental vectors in reduced space
# Biological: Reveals which environmental gradients align with genetic differentiation
plot(RDA4, type = "none", choices = c(1, 2),
main = "Genetic Distance vs Climate Variables")
text(RDA4, dis = "cn", cex = 1, col = "red") # Environmental vectors
points(RDA4, pch = 21, col = "red", bg = "yellow", cex = 0.6) # Population scores
# ADD GROUPING INFORMATION
# Statistical: Overlays categorical grouping on ordination
# Biological: Tests if genetically/metapopulation groups occupy distinct environmental space
metapop <- ClimData[, 3] # Metapopulation categories
reg <- ClimData[, 4] # Region categories
# Add ellipses showing group dispersion
pl <- ordiellipse(RDA4, reg, kind = "sd", label = TRUE, cex = 2)
pl <- ordiellipse(RDA4, metapop, kind = "sd", label = TRUE, cex = 1)
# PARTIAL RDA - CONTROL FOR GEOGRAPHY
# Statistical: Conditions on geographic variables to isolate pure environmental effects
# Biological: Tests environmental adaptation independent of isolation-by-distance
pRDA4 <- rda(GenData ~ Bio1 + Bio3 + Bio8 + Bio14 + BS1 + PHCA + EXNA +
Condition(lat + lon + alt), ClimData)
# Significance testing for partial RDA
anova(pRDA4, by = "term", permu = 200)
anova(pRDA4, by = "axis", perm = 500)
# Visualize partial RDA
plot(pRDA4, type = "none", choices = c(1, 2),
main = "Genetic Distance vs Climate (Controlling for Geography)")
text(pRDA4, dis = "cn", cex = 1, col = "red")
points(pRDA4, pch = 21, col = "red", bg = "yellow", cex = 1)
# PARTIAL RDA - CONTROL FOR CLIMATE
# Statistical: Conditions on environmental variables to isolate pure geographic effects
# Biological: Tests isolation-by-distance independent of environmental adaptation
pRDA2 <- rda(GenData ~ lat + lon + alt + Condition(Bio1 + Bio3 + Bio8 + Bio14 +
BS1 + PHCA + EXNA), ClimData)
anova(pRDA2)
summary(pRDA2)
# FULL MODEL WITH BOTH GEOGRAPHY AND CLIMATE
# Statistical: Combined model with all predictors
# Biological: Tests total explained variance and interactions
RDAfull <- rda(GenData ~ lat + lon + alt + Bio1 + Bio3 + Bio8 + Bio14 +
BS1 + PHCA + EXNA, ClimData)
anova(RDAfull)
summary(RDAfull)
# VARIANCE PARTITIONING ANALYSIS
# Statistical: Quantifies proportions of genetic variance explained by different factors
# Biological: Answers key landscape genetics questions:
# 1. How much variance is explained by environment alone?
# 2. How much by geography alone?
# 3. How much by their overlap?
# 4. How much remains unexplained?
cat("=== VARIANCE PARTITIONING SUMMARY ===\n")
cat("1. Climate not controlled for geography:\n")
summary(RDA4)$constr.chi / summary(RDA4)$tot.chi * 100
cat("\n2. Climate controlled for geography (pure environmental effects):\n")
summary(pRDA4)$constr.chi / summary(pRDA4)$tot.chi * 100
cat("\n3. Geography controlled for climate (pure geographic effects):\n")
summary(pRDA2)$constr.chi / summary(pRDA2)$tot.chi * 100
cat("\n4. Full model (climate + geography):\n")
summary(RDAfull)$constr.chi / summary(RDAfull)$tot.chi * 100
# ENHANCED VISUALIZATION FOR PUBLICATION
par(pty = "s") # Square plot
plot(RDAfull, scaling = 2, display = c("sp", "lc", "cn"),
xlim = c(-2, 2), ylim = c(-2, 2),
main = "RDA Triplot: Genetic Distance vs Environment + Geography")
pl <- ordiellipse(RDAfull, reg, kind = "sd", label = TRUE, cex = 2)
box()
# R² ADJUSTED FOR MODEL COMPLEXITY
# Statistical: Adjusts explained variance for number of predictors
# Biological: Provides more realistic estimate of environmental influence
RsquareAdj(RDA4)
RsquareAdj(pRDA4)
RsquareAdj(pRDA2)
RsquareAdj(RDAfull)
# FINAL INTERPRETATION
cat("\n=== KEY BIOLOGICAL INTERPRETATIONS ===\n")
cat("1. Significant RDA axes indicate environmental gradients driving genetic divergence.\n")
cat("2. Partial RDAs separate isolation-by-distance from isolation-by-environment.\n")
cat("3. Variance partitioning quantifies relative importance of geography vs environment.\n")
cat("4. Environmental vectors show which factors align with genetic differentiation.\n")Script 3: Generalized Additive Models (GAMs)
For Non-linear Genetic-Environment Relationships
Analyzing Pairwise Genetic Distances
Statistical innovation: Uses pairwise FST matrices as response variables, with corresponding pairwise environmental differences as predictors.
Key challenge: Account for non-independence of pairwise comparisons.
Flexible Modeling of Relationships
gam()withs()terms: Allows non-linear relationships between predictors and genetic differentiation.Biological relevance: Genetic differentiation may respond non-linearly to environmental gradients (threshold effects, optimal ranges).
Model Comparison Framework
Linear vs. quadratic vs. smooth terms: Tests whether simple linear, curved, or complex non-linear relationships best explain patterns.
Biological insight: Shape of distance-FST relationship informs dispersal models (steep vs. gradual isolation-by-distance).
Scale Consideration
- Analysis with both scaled and unscaled geographic distance: Distinguishes statistical significance from biological magnitude of effects.
{r}
#######################################################
###### Generalized Additive Models (GAMs) #############
###### For Non-linear Genetic-Environment Relationships
#######################################################
# LOAD LIBRARIES
library(gdata) # For reading Excel files
library(mgcv) # For Generalized Additive Models
library(permute) # For permutation tests
# SET WORKING DIRECTORY
setwd("D:\\R_Working_files\\RawGeno\\Project_RawGeno\\Mp_All\\RawGeno_1\\RawGeno_1_final-data\\Mp_Xn_4n\\STR-MpXn_Regional2local_Scale\\Multiple Regression\\Italy\\GLM_GAM")
# PERL PATH FOR READING EXCEL FILES
perl <- "C:\\Users\\atahir\\Documents\\R\\win-library\\3.1\\perl\\bin\\perl.exe"
##################
#### DATA PREPARATION
##################
# READ GEOGRAPHIC DISTANCE MATRIX
# Statistical: Predictor representing isolation-by-distance
# Biological: Geographic distance expected to increase genetic differentiation via limited gene flow
Geog.tmp <- read.xls("GGD_IT.xlsx", perl = perl, as.is = TRUE)
# CREATE COMPREHENSIVE DATAFRAME
# Statistical: Combines pairwise Fst with pairwise environmental differences
# Biological: Tests how environmental dissimilarity affects genetic divergence
DiploidData <- data.frame(
# Response: Pairwise Fst (genetic differentiation)
Fst = c(as.matrix(read.xls("Fst_IT_Xn.xlsx", perl = perl)[, -1], sheet = 1)),
# Environmental difference predictors:
BIO3 = c(apply(read.xls("BIO3-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Isothermality
BIO8 = c(apply(read.xls("BIO8-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Temp wettest quarter
BIO14 = c(apply(read.xls("BIO14-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Prec driest month
PHCA = c(apply(read.xls("PHCA-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Soil pH
EXNA = c(apply(read.xls("EXNA-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Exchangeable sodium
CLAY = c(apply(read.xls("CLAY-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Clay content
ESP = c(apply(read.xls("ESP-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Exchangeable sodium %
CACO3 = c(apply(read.xls("CACO3-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Calcium carbonate
TN = c(apply(read.xls("TN-dm.xlsx", as.is = TRUE, perl = perl), 2, as.numeric)[, -1]), # Total nitrogen
# Geographic distance (unscaled, will scale separately)
Geog = c(apply(Geog.tmp[, colnames(Geog.tmp) != "X"], 2, as.numeric))
)
rm(Geog.tmp) # Clean up temporary object
# DATA SCALING AND CLEANING
# Statistical: Standardizes variables, removes incomplete cases
# Biological: Ensures comparability, focuses on complete population pairs
DiploidData.scale <- as.data.frame(
apply(DiploidData[apply(DiploidData, 1, function(X) sum(is.na(X))) == 0, ], 2, scale)
)
# Add unscaled geographic distance back for interpretation
DiploidData.scale$Geog.Unsc <- DiploidData$Geog[apply(DiploidData, 1, function(X) sum(is.na(X))) == 0]
# EXPLORATORY DATA ANALYSIS
# Statistical: Checks correlations and distributions
# Biological: Identifies potential collinearity issues and data patterns
pairs(DiploidData.scale[, 1:10]) # Quick visualization
cor_matrix <- cor(DiploidData.scale[, -ncol(DiploidData.scale)])
print("Correlation matrix (first 5x5):")
print(cor_matrix[1:5, 1:5])
##################
#### MODEL BUILDING
##################
# LINEAR MODEL 1: ALL VARIABLES
# Statistical: Baseline linear model
# Biological: Tests linear relationships between environmental differences and Fst
Diploid.reg <- lm(Fst ~ Geog + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN,
data = DiploidData.scale)
summary(Diploid.reg)
# MODEL DIAGNOSTICS
# Statistical: Checks linear model assumptions
# Biological: Identifies patterns suggesting non-linear relationships or outliers
par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
plot(fitted(Diploid.reg), resid(Diploid.reg))
abline(h = 0, col = 2)
title("Residuals vs Fitted")
# Check residuals against each predictor
predictors <- c("Geog", "BIO3", "BIO8", "BIO14", "PHCA", "EXNA", "CLAY", "ESP", "CACO3", "TN")
for (pred in predictors[1:6]) {
plot(DiploidData.scale[[pred]], resid(Diploid.reg))
abline(h = 0, col = 2)
title(paste("Residuals vs", pred))
}
# LINEAR MODEL 2: WITH QUADRATIC GEOGRAPHIC TERM
# Statistical: Tests for non-linear isolation-by-distance
# Biological: Gene flow may decrease non-linearly with distance (e.g., stepping-stone model)
Diploid.reg2 <- lm(Fst ~ Geog + I(Geog^2) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN,
data = DiploidData.scale)
summary(Diploid.reg2)
# VISUALIZE QUADRATIC EFFECT
# Statistical: Shows shape of distance-decay relationship
# Biological: Informs about dispersal kernel shape and connectivity
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
GeogEff <- DiploidData.scale$Geog * coef(Diploid.reg2)["Geog"] +
(DiploidData.scale$Geog^2) * coef(Diploid.reg2)["I(Geog^2)"]
plot(DiploidData.scale$Geog, GeogEff, xlab = "Scaled Geographic Distance",
ylab = "Effect on Fst", main = "Quadratic Distance Effect")
plot(DiploidData.scale$Geog, fitted(Diploid.reg2), xlab = "Scaled Geographic Distance",
ylab = "Predicted Fst", main = "Model Predictions")
# GENERALIZED ADDITIVE MODEL (GAM)
# Statistical: Flexible non-parametric smoothing of relationships
# Biological: Captures complex, non-linear responses to environmental gradients
Diploid.gam <- gam(Fst ~ s(Geog) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP + CACO3 + TN,
data = DiploidData.scale)
# VISUALIZE GAM SMOOTHS
# Statistical: Shows fitted smooth functions for each non-linear term
# Biological: Reveals shape of environmental response curves (optimal, threshold, etc.)
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
plot(Diploid.gam, shade = TRUE, shade.col = "lightblue", seWithMean = TRUE)
title("GAM Smooth Terms", outer = TRUE, line = -1)
# GAM DIAGNOSTICS
# Statistical: Checks GAM assumptions and fit
# Biological: Ensures model adequately captures biological patterns
par(mfrow = c(3, 3), mar = c(3, 3, 1, 1))
plot(fitted(Diploid.gam), resid(Diploid.gam))
abline(h = 0, col = 2)
title("GAM: Residuals vs Fitted")
for (pred in predictors[1:6]) {
plot(DiploidData.scale[[pred]], resid(Diploid.gam))
abline(h = 0, col = 2)
title(paste("GAM Residuals vs", pred))
}
# GAM SUMMARY
# Statistical: Provides significance tests and model metrics
# Biological: Identifies which environmental factors significantly affect genetic differentiation
gam_summary <- summary(Diploid.gam)
print(gam_summary)
##################
#### MODEL COMPARISON
##################
# STATISTICAL COMPARISON OF MODELS
# Statistical: Uses ANOVA and AIC to compare model fit
# Biological: Determines which functional form best represents biological processes
model_comparison <- anova(Diploid.reg, Diploid.reg2, Diploid.gam, test = 'Chisq')
print("Model Comparison (ANOVA):")
print(model_comparison)
# AIC COMPARISON
# Statistical: AIC balances model fit and complexity
# Biological: Selects most parsimonious model explaining genetic patterns
aic_values <- AIC(Diploid.reg, Diploid.reg2, Diploid.gam)
print("AIC Values:")
print(aic_values)
# RESIDUAL COMPARISON
# Statistical: Visual comparison of model fit quality
# Biological: Shows which model best captures unexplained variation
par(mfrow = c(1, 1), mar = c(5, 4, 2, 1))
boxplot(Diploid.reg$residuals, Diploid.reg2$residuals, Diploid.gam$residuals,
names = c("Linear", "Quadratic", "GAM"),
ylab = "Residuals", main = "Model Residual Comparison")
abline(h = 0, col = "red", lty = 2)
##################
#### FINAL MODEL WITH UNSCALED DISTANCE
##################
# GAM WITH UNSCALED GEOGRAPHIC DISTANCE
# Statistical: Final model for biological interpretation
# Biological: Allows direct interpretation of distance effects in original units (km)
Diploid.gam.unsc <- gam(Fst ~ s(Geog.Unsc) + BIO3 + BIO8 + BIO14 + PHCA + EXNA + CLAY + ESP +
CACO3 + TN, data = DiploidData.scale)
# FINAL VISUALIZATION
# Statistical: Presents results in biologically interpretable units
# Biological: Shows actual distance-decay relationship for gene flow inference
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
# 1. Distance effect in km
plot(Diploid.gam.unsc, select = 1, shade = TRUE, shade.col = "lightblue",
xlab = "Geographic Distance (km)", ylab = "Effect on Fst",
main = "Distance-Decay Relationship")
# 2. Partial effects of key environmental variables
important_vars <- c("BIO14", "PHCA", "CLAY") # Based on model significance
for (i in 1:min(3, length(important_vars))) {
if (important_vars[i] %in% names(DiploidData.scale)) {
plot(DiploidData.scale[[important_vars[i]]], DiploidData.scale$Fst,
xlab = important_vars[i], ylab = "Fst",
main = paste("Fst vs", important_vars[i]))
# Add smoothed trend line
lines(lowess(DiploidData.scale[[important_vars[i]]], DiploidData.scale$Fst),
col = "red", lwd = 2)
}
}
title("Final GAM Results: Italy Mp6x", outer = TRUE)
##################
#### BIOLOGICAL INTERPRETATION
##################
cat("\n=== BIOLOGICAL INTERPRETATION OF RESULTS ===\n\n")
# Extract key results
distance_edf <- gam_summary$edf[1] # Effective degrees of freedom for distance smooth
distance_p <- gam_summary$s.pv[1] # P-value for distance effect
cat("1. GEOGRAPHIC ISOLATION (ISOLATION-BY-DISTANCE):\n")
if (distance_p < 0.05) {
cat(sprintf(" • Significant distance effect (p = %.4f)\n", distance_p))
if (distance_edf > 1.5) {
cat(" • Non-linear relationship suggests complex dispersal patterns\n")
cat(" • Possible explanations: Stepping-stone dispersal, barriers at specific distances\n")
} else {
cat(" • Approximately linear relationship\n")
cat(" • Suggests classic isolation-by-distance with relatively uniform dispersal\n")
}
} else {
cat(" • No significant isolation-by-distance detected\n")
cat(" • Gene flow may be extensive or mediated by other factors\n")
}
# Environmental effects
sig_vars <- names(which(gam_summary$p.pv[-1] < 0.05)) # Significant predictors (excluding intercept)
cat("\n2. ENVIRONMENTAL SELECTION (ISOLATION-BY-ENVIRONMENT):\n")
if (length(sig_vars) > 0) {
cat(" • Significant environmental predictors:\n")
for (var in sig_vars) {
effect_dir <- ifelse(coef(Diploid.gam)[var] > 0, "increases", "decreases")
cat(sprintf(" - %s: %s genetic differentiation (p = %.4f)\n",
var, effect_dir, gam_summary$p.pv[var]))
}
cat(" • Suggests environmental adaptation driving genetic divergence\n")
} else {
cat(" • No significant environmental effects detected\n")
cat(" • Genetic differentiation primarily driven by neutral processes\n")
}
# Model comparison interpretation
cat("\n3. MODEL SELECTION INSIGHTS:\n")
best_model <- rownames(aic_values)[which.min(aic_values$AIC)]
cat(sprintf(" • Best-fitting model: %s\n", best_model))
if (best_model == "Diploid.gam") {
cat(" • Non-linear models best explain genetic patterns\n")
cat(" • Suggests complex responses to environmental gradients\n")
} else if (best_model == "Diploid.reg2") {
cat(" • Quadratic distance model provides best fit\n")
cat(" • Gene flow decreases non-linearly with distance\n")
} else {
cat(" • Simple linear model provides best fit\n")
cat(" • Relationships approximately linear\n")
}
# Explained variance
r2_adj <- summary(Diploid.gam)$r.sq
cat(sprintf("\n4. OVERALL MODEL PERFORMANCE:\n"))
cat(sprintf(" • Adjusted R² = %.3f (%.1f%% of variance explained)\n", r2_adj, r2_adj * 100))
if (r2_adj > 0.3) {
cat(" • Good explanatory power for landscape genetic patterns\n")
} else if (r2_adj > 0.1) {
cat(" • Moderate explanatory power\n")
cat(" • Additional factors (e.g., historical, demographic) may be important\n")
} else {
cat(" • Limited explanatory power\n")
cat(" • Genetic patterns may be strongly influenced by stochastic processes\n")
}
cat("\n5. MANAGEMENT IMPLICATIONS:\n")
if (distance_p < 0.05 && length(sig_vars) > 0) {
cat(" • Both geography and environment important for conservation\n")
cat(" • Consider both connectivity corridors and environmental gradients\n")
} else if (distance_p < 0.05) {
cat(" • Focus on maintaining landscape connectivity\n")
cat(" • Geographic barriers likely limit gene flow\n")
} else if (length(sig_vars) > 0) {
cat(" • Environmental adaptation key for population persistence\n")
cat(" • Protect environmental heterogeneity and climate refugia\n")
} else {
cat(" • Genetic patterns not strongly structured by measured factors\n")
cat(" • Consider historical factors or finer-scale environmental variables\n")
}
# SAVE FINAL RESULTS
save(Diploid.gam.unsc, DiploidData.scale, gam_summary, aic_values,
file = "GAM_Analysis_Results.RData")
cat("\n=== ANALYSIS COMPLETE ===\n")
cat("Results saved to GAM_Analysis_Results.RData\n")Summary
This document presents a comprehensive analytical pipeline for landscape genetics, implementing three complementary statistical approaches to disentangle geographic and environmental drivers of genetic patterns. We demonstrate: (1) Principal Component Analysis (PCA) and regression modeling for identifying key environmental gradients affecting within-population genetic diversity, (2) Redundancy Analysis (RDA) for multivariate assessment of between-population genetic differentiation and variance partitioning, and (3) Generalized Additive Models (GAMs) for flexible modeling of non-linear relationships between environmental factors and genetic distances. Each method addresses distinct questions in landscape genetics while together providing a robust framework for understanding climate-genetic correlations.