Data Exploration & Normalization

Case Study of Pakistan: Spider Web dataset

Author

Dr Tahir Ali | AG de Meaux | TRR341 Ecological Genomics | University of Cologne

Session Overview

Before building species distribution models, we must understand the structure of environmental predictors. This includes:

  • inspecting distributions

  • detecting skewness and outliers

  • checking correlations

  • applying transformations where appropriate

These steps help ensure statistical validity and ecological interpretability.

0. Extract Climate Variables (Run only Once)

A step-by-step R script to extract 19 climate variables from the WorldClim rasters using your spider coordinates file and save as CSV file for the downstream analysis

Code
# ==============================================================================
# STEP 1: LOAD REQUIRED PACKAGES
# ==============================================================================

# Load the terra package - this helps us work with raster files (like the climate data)
# If you don't have it installed, run: install.packages("terra")
library(terra)

# Load the dplyr package - this helps us manipulate data tables
# If you don't have it installed, run: install.packages("dplyr")
library(dplyr)

# ==============================================================================
# STEP 2: READ THE SPIDER COORDINATES FILE
# ==============================================================================

# Read the CSV file containing spider presence coordinates
# The file path points to where your spider data is stored
spiders <- read.csv2("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/presence/spider_coords.csv",
                     header = TRUE)  # header=TRUE means the first row contains column names

# Look at the first few rows to check the data loaded correctly
head(spiders)  # This shows you the structure of your data

# ==============================================================================
# STEP 3: CLEAN THE COORDINATE DATA
# ==============================================================================

# In some countries, decimals are written with commas (like 45,678 instead of 45.678)
# R needs dots (.) for decimals, so we convert commas to dots
spiders$X <- as.numeric(gsub(",", ".", spiders$X))  # Convert X coordinate
spiders$Y <- as.numeric(gsub(",", ".", spiders$Y))  # Convert Y coordinate

# Check if there are any duplicate records (same spider recorded twice)
# We use distinct() to keep only unique rows
spiders <- distinct(spiders)

# Tell us how many unique spider locations we have
cat("Loaded", nrow(spiders), "unique spider records\n")

# ==============================================================================
# STEP 4: LOAD THE 19 BIOCLIM RASTER FILES
# ==============================================================================

# Define the folder path where all climate rasters are stored
climate_folder <- "/run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/climate/"

# List all the bioclim files in that folder
# pattern = ".tif$" means we want files ending with .tif (raster files)
bio_files <- list.files(climate_folder, pattern = ".tif$", full.names = TRUE)

# Show how many files we found (should be 20 - 19 bioclim + 1 elevation)
cat("Found", length(bio_files), "raster files\n")

# Load all raster files as a single "raster stack"
# This combines all 19 climate layers into one object for easy extraction
climate_stack <- rast(bio_files)

# Check the names of the loaded layers (they currently have generic names like wc2.1_2.5m_bio_1)
names(climate_stack)

# ==============================================================================
# STEP 5: GIVE MEANINGFUL NAMES TO ALL 19 BIOCLIM VARIABLES
# ==============================================================================

# Create a vector of meaningful names for each Bioclim variable
# These names follow the standard WorldClim BIO1-BIO19 nomenclature
meaningful_names <- c(
  "bio1_annual_mean_temp",           # BIO1 = Annual Mean Temperature
  "bio2_mean_diurnal_range",         # BIO2 = Mean Diurnal Range (Mean of monthly (max temp - min temp))
  "bio3_isothermality",              # BIO3 = Isothermality (BIO2/BIO7) (* 100)
  "bio4_temp_seasonality",           # BIO4 = Temperature Seasonality (standard deviation *100)
  "bio5_max_temp_warmest_month",     # BIO5 = Max Temperature of Warmest Month
  "bio6_min_temp_coldest_month",     # BIO6 = Min Temperature of Coldest Month
  "bio7_temp_annual_range",          # BIO7 = Temperature Annual Range (BIO5-BIO6)
  "bio8_mean_temp_wettest_quarter",  # BIO8 = Mean Temperature of Wettest Quarter
  "bio9_mean_temp_driest_quarter",   # BIO9 = Mean Temperature of Driest Quarter
  "bio10_mean_temp_warmest_quarter", # BIO10 = Mean Temperature of Warmest Quarter
  "bio11_mean_temp_coldest_quarter", # BIO11 = Mean Temperature of Coldest Quarter
  "bio12_annual_precip",             # BIO12 = Annual Precipitation
  "bio13_precip_wettest_month",      # BIO13 = Precipitation of Wettest Month
  "bio14_precip_driest_month",       # BIO14 = Precipitation of Driest Month
  "bio15_precip_seasonality",        # BIO15 = Precipitation Seasonality (Coefficient of Variation)
  "bio16_precip_wettest_quarter",    # BIO16 = Precipitation of Wettest Quarter
  "bio17_precip_driest_quarter",     # BIO17 = Precipitation of Driest Quarter
  "bio18_precip_warmest_quarter",    # BIO18 = Precipitation of Warmest Quarter
  "bio19_precip_coldest_quarter",    # BIO19 = Precipitation of Coldest Quarter
  "elevation"                        # The 20th file is elevation (not a Bioclim variable)
)

# Assign these meaningful names to the climate stack
names(climate_stack) <- meaningful_names

# Verify the new names
cat("\n=== New variable names assigned ===\n")
print(names(climate_stack))

# ==============================================================================
# STEP 6: EXTRACT CLIMATE DATA FOR SPIDER POINTS
# ==============================================================================

# This is the main step - extract climate values at each spider coordinate
# The extract() function takes the raster stack and the points (X,Y coordinates)
spider_climate <- extract(climate_stack, spiders[, c("X", "Y")])

# Look at what we extracted - now with meaningful column names!
cat("\nFirst few rows of extracted climate data:\n")
head(spider_climate)

# ==============================================================================
# STEP 7: COMBINE SPIDER DATA WITH CLIMATE DATA
# ==============================================================================

# Remove the ID column that extract() adds automatically
spider_climate <- spider_climate[, -1]  # -1 means "remove the first column"

# Combine the original spider data with the climate data
# cbind() means "column bind" - it puts two data frames side by side
final_data <- cbind(spiders, spider_climate)

# Check the final combined data with meaningful names
cat("\nFinal dataset with meaningful variable names:\n")
head(final_data)

# Show all column names to verify
cat("\nColumn names in final dataset:\n")
print(names(final_data))

# ==============================================================================
# STEP 8: SAVE AS CSV FILE FOR LATER USE
# ==============================================================================

# Save the complete dataset as a CSV file
# This file can now be used for all subsequent analyses
write.csv(final_data, 
          file = "/run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/spider_climate_data.csv",
          row.names = FALSE)  # row.names=FALSE prevents R from adding its own row numbers

# Confirm the file was saved
cat("\n=== SAVING COMPLETE ===\n")
cat("Climate data saved to: /run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/spider_climate_data.csv\n")
cat("Total records saved:", nrow(final_data), "\n")
cat("Total columns (spider info + 20 environmental variables):", ncol(final_data), "\n")

# ==============================================================================
# STEP 9: LOOK AT BASIC SUMMARY STATISTICS WITH MEANINGFUL NAMES
# ==============================================================================

# Show summary statistics for all climate variables
cat("\n=== SUMMARY STATISTICS FOR ALL ENVIRONMENTAL VARIABLES ===\n")
summary(final_data[, 3:ncol(final_data)])  # 3:ncol() means "from column 3 to the last column"

# ==============================================================================
# STEP 10: CREATE A QUICK REFERENCE GUIDE
# ==============================================================================

# Create a data frame explaining what each variable means
variable_guide <- data.frame(
  Column_Name = names(final_data)[3:ncol(final_data)],
  Description = c(
    "Annual Mean Temperature (°C)",
    "Mean Diurnal Range: Mean of monthly (max temp - min temp) (°C)",
    "Isothermality: (BIO2/BIO7) × 100",
    "Temperature Seasonality: standard deviation × 100",
    "Max Temperature of Warmest Month (°C)",
    "Min Temperature of Coldest Month (°C)",
    "Temperature Annual Range: BIO5 - BIO6 (°C)",
    "Mean Temperature of Wettest Quarter (°C)",
    "Mean Temperature of Driest Quarter (°C)",
    "Mean Temperature of Warmest Quarter (°C)",
    "Mean Temperature of Coldest Quarter (°C)",
    "Annual Precipitation (mm)",
    "Precipitation of Wettest Month (mm)",
    "Precipitation of Driest Month (mm)",
    "Precipitation Seasonality: Coefficient of Variation",
    "Precipitation of Wettest Quarter (mm)",
    "Precipitation of Driest Quarter (mm)",
    "Precipitation of Warmest Quarter (mm)",
    "Precipitation of Coldest Quarter (mm)",
    "Elevation (meters above sea level)"
  )
)

# Save the variable guide as a separate CSV for reference
write.csv(variable_guide, 
          file = "/run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/climate_variable_guide.csv",
          row.names = FALSE)

cat("\n=== VARIABLE GUIDE CREATED ===\n")
cat("A guide to all variable meanings saved to: /run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/climate_variable_guide.csv\n")
print(variable_guide)
         X        Y
1 74.24942 31.44471
2 74.24942 31.44471
3 74.24942 31.44471
4 74.22886 31.46673
5 74.22886 31.46673
6 74.22886 31.46673
Loaded 93 unique spider records
Found 20 raster files
 [1] "wc2.1_2.5m_bio_1"  "wc2.1_2.5m_bio_10" "wc2.1_2.5m_bio_11"
 [4] "wc2.1_2.5m_bio_12" "wc2.1_2.5m_bio_13" "wc2.1_2.5m_bio_14"
 [7] "wc2.1_2.5m_bio_15" "wc2.1_2.5m_bio_16" "wc2.1_2.5m_bio_17"
[10] "wc2.1_2.5m_bio_18" "wc2.1_2.5m_bio_19" "wc2.1_2.5m_bio_2" 
[13] "wc2.1_2.5m_bio_3"  "wc2.1_2.5m_bio_4"  "wc2.1_2.5m_bio_5" 
[16] "wc2.1_2.5m_bio_6"  "wc2.1_2.5m_bio_7"  "wc2.1_2.5m_bio_8" 
[19] "wc2.1_2.5m_bio_9"  "wc2.1_2.5m_elev"  

=== New variable names assigned ===
 [1] "bio1_annual_mean_temp"           "bio2_mean_diurnal_range"        
 [3] "bio3_isothermality"              "bio4_temp_seasonality"          
 [5] "bio5_max_temp_warmest_month"     "bio6_min_temp_coldest_month"    
 [7] "bio7_temp_annual_range"          "bio8_mean_temp_wettest_quarter" 
 [9] "bio9_mean_temp_driest_quarter"   "bio10_mean_temp_warmest_quarter"
[11] "bio11_mean_temp_coldest_quarter" "bio12_annual_precip"            
[13] "bio13_precip_wettest_month"      "bio14_precip_driest_month"      
[15] "bio15_precip_seasonality"        "bio16_precip_wettest_quarter"   
[17] "bio17_precip_driest_quarter"     "bio18_precip_warmest_quarter"   
[19] "bio19_precip_coldest_quarter"    "elevation"                      

First few rows of extracted climate data:
  ID bio1_annual_mean_temp bio2_mean_diurnal_range bio3_isothermality
1  1              24.10500                32.08733           14.00867
2  2              24.05683                31.98133           13.98867
3  3              23.98133                32.06467           13.80600
4  4              23.98133                32.06467           13.80600
5  5              23.98133                32.06467           13.80600
6  6              23.98133                32.06467           13.80600
  bio4_temp_seasonality bio5_max_temp_warmest_month bio6_min_temp_coldest_month
1                   467                         136                           4
2                   474                         139                           4
3                   456                         133                           4
4                   456                         133                           4
5                   456                         133                           4
6                   456                         133                           4
  bio7_temp_annual_range bio8_mean_temp_wettest_quarter
1               111.3598                            308
2               111.9311                            313
3               112.1244                            303
4               112.1244                            303
5               112.1244                            303
6               112.1244                            303
  bio9_mean_temp_driest_quarter bio10_mean_temp_warmest_quarter
1                            21                             192
2                            21                             195
3                            20                             187
4                            20                             187
5                            20                             187
6                            20                             187
  bio11_mean_temp_coldest_quarter bio12_annual_precip
1                              51            13.13000
2                              52            12.98900
3                              49            13.34933
4                              49            13.34933
5                              49            13.34933
6                              49            13.34933
  bio13_precip_wettest_month bio14_precip_driest_month bio15_precip_seasonality
1                   38.21749                  753.0319                   40.088
2                   38.09985                  749.9002                   39.932
3                   38.33812                  760.3315                   40.200
4                   38.33812                  760.3315                   40.200
5                   38.33812                  760.3315                   40.200
6                   38.33812                  760.3315                   40.200
  bio16_precip_wettest_quarter bio17_precip_driest_quarter
1                        5.732                      34.356
2                        5.840                      34.092
3                        5.380                      34.820
4                        5.380                      34.820
5                        5.380                      34.820
6                        5.380                      34.820
  bio18_precip_warmest_quarter bio19_precip_coldest_quarter elevation
1                     30.57733                     19.45600       207
2                     30.50533                     19.47667       206
3                     30.51733                     19.29333       206
4                     30.51733                     19.29333       206
5                     30.51733                     19.29333       206
6                     30.51733                     19.29333       206

Final dataset with meaningful variable names:
         X        Y bio1_annual_mean_temp bio2_mean_diurnal_range
1 74.24942 31.44471              24.10500                32.08733
2 74.22886 31.46673              24.05683                31.98133
3 74.21580 31.41358              23.98133                32.06467
4 74.22285 31.40976              23.98133                32.06467
5 74.22158 31.40849              23.98133                32.06467
6 74.22484 31.40526              23.98133                32.06467
  bio3_isothermality bio4_temp_seasonality bio5_max_temp_warmest_month
1           14.00867                   467                         136
2           13.98867                   474                         139
3           13.80600                   456                         133
4           13.80600                   456                         133
5           13.80600                   456                         133
6           13.80600                   456                         133
  bio6_min_temp_coldest_month bio7_temp_annual_range
1                           4               111.3598
2                           4               111.9311
3                           4               112.1244
4                           4               112.1244
5                           4               112.1244
6                           4               112.1244
  bio8_mean_temp_wettest_quarter bio9_mean_temp_driest_quarter
1                            308                            21
2                            313                            21
3                            303                            20
4                            303                            20
5                            303                            20
6                            303                            20
  bio10_mean_temp_warmest_quarter bio11_mean_temp_coldest_quarter
1                             192                              51
2                             195                              52
3                             187                              49
4                             187                              49
5                             187                              49
6                             187                              49
  bio12_annual_precip bio13_precip_wettest_month bio14_precip_driest_month
1            13.13000                   38.21749                  753.0319
2            12.98900                   38.09985                  749.9002
3            13.34933                   38.33812                  760.3315
4            13.34933                   38.33812                  760.3315
5            13.34933                   38.33812                  760.3315
6            13.34933                   38.33812                  760.3315
  bio15_precip_seasonality bio16_precip_wettest_quarter
1                   40.088                        5.732
2                   39.932                        5.840
3                   40.200                        5.380
4                   40.200                        5.380
5                   40.200                        5.380
6                   40.200                        5.380
  bio17_precip_driest_quarter bio18_precip_warmest_quarter
1                      34.356                     30.57733
2                      34.092                     30.50533
3                      34.820                     30.51733
4                      34.820                     30.51733
5                      34.820                     30.51733
6                      34.820                     30.51733
  bio19_precip_coldest_quarter elevation
1                     19.45600       207
2                     19.47667       206
3                     19.29333       206
4                     19.29333       206
5                     19.29333       206
6                     19.29333       206

Column names in final dataset:
 [1] "X"                               "Y"                              
 [3] "bio1_annual_mean_temp"           "bio2_mean_diurnal_range"        
 [5] "bio3_isothermality"              "bio4_temp_seasonality"          
 [7] "bio5_max_temp_warmest_month"     "bio6_min_temp_coldest_month"    
 [9] "bio7_temp_annual_range"          "bio8_mean_temp_wettest_quarter" 
[11] "bio9_mean_temp_driest_quarter"   "bio10_mean_temp_warmest_quarter"
[13] "bio11_mean_temp_coldest_quarter" "bio12_annual_precip"            
[15] "bio13_precip_wettest_month"      "bio14_precip_driest_month"      
[17] "bio15_precip_seasonality"        "bio16_precip_wettest_quarter"   
[19] "bio17_precip_driest_quarter"     "bio18_precip_warmest_quarter"   
[21] "bio19_precip_coldest_quarter"    "elevation"                      

=== SAVING COMPLETE ===
Climate data saved to: /run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/spider_climate_data.csv
Total records saved: 93 
Total columns (spider info + 20 environmental variables): 22 

=== SUMMARY STATISTICS FOR ALL ENVIRONMENTAL VARIABLES ===
 bio1_annual_mean_temp bio2_mean_diurnal_range bio3_isothermality
 Min.   :21.94         Min.   :30.24           Min.   :11.60     
 1st Qu.:23.23         1st Qu.:31.45           1st Qu.:12.94     
 Median :23.64         Median :31.89           Median :13.10     
 Mean   :23.67         Mean   :31.99           Mean   :13.28     
 3rd Qu.:23.98         3rd Qu.:32.21           3rd Qu.:13.63     
 Max.   :25.15         Max.   :33.81           Max.   :14.24     
 bio4_temp_seasonality bio5_max_temp_warmest_month bio6_min_temp_coldest_month
 Min.   :166.0         Min.   : 46.0               Min.   :2.000              
 1st Qu.:377.0         1st Qu.:107.0               1st Qu.:3.000              
 Median :471.0         Median :136.0               Median :4.000              
 Mean   :466.9         Mean   :132.8               Mean   :4.495              
 3rd Qu.:566.0         3rd Qu.:159.0               3rd Qu.:6.000              
 Max.   :764.0         Max.   :218.0               Max.   :8.000              
 bio7_temp_annual_range bio8_mean_temp_wettest_quarter
 Min.   : 87.09         Min.   :106.0                 
 1st Qu.:105.52         1st Qu.:243.0                 
 Median :108.97         Median :303.0                 
 Mean   :107.85         Mean   :302.8                 
 3rd Qu.:111.28         3rd Qu.:368.0                 
 Max.   :114.01         Max.   :508.0                 
 bio9_mean_temp_driest_quarter bio10_mean_temp_warmest_quarter
 Min.   :10.00                 Min.   : 68.0                  
 1st Qu.:18.00                 1st Qu.:175.0                  
 Median :22.00                 Median :203.0                  
 Mean   :22.46                 Mean   :194.9                  
 3rd Qu.:28.00                 3rd Qu.:226.0                  
 Max.   :39.00                 Max.   :283.0                  
 bio11_mean_temp_coldest_quarter bio12_annual_precip bio13_precip_wettest_month
 Min.   : 24.00                  Min.   :12.70       Min.   :37.13             
 1st Qu.: 44.00                  1st Qu.:13.32       1st Qu.:37.79             
 Median : 53.00                  Median :13.41       Median :37.99             
 Mean   : 58.25                  Mean   :13.56       Mean   :38.21             
 3rd Qu.: 74.00                  3rd Qu.:13.77       3rd Qu.:38.47             
 Max.   :102.00                  Max.   :14.85       Max.   :39.87             
 bio14_precip_driest_month bio15_precip_seasonality
 Min.   :743.4             Min.   :39.42           
 1st Qu.:760.3             1st Qu.:39.87           
 Median :767.6             Median :40.12           
 Mean   :777.1             Mean   :40.29           
 3rd Qu.:796.8             3rd Qu.:40.46           
 Max.   :827.0             Max.   :42.03           
 bio16_precip_wettest_quarter bio17_precip_driest_quarter
 Min.   :2.656                Min.   :33.73              
 1st Qu.:4.504                1st Qu.:34.84              
 Median :4.732                Median :35.37              
 Mean   :4.813                Mean   :35.47              
 3rd Qu.:5.200                3rd Qu.:36.03              
 Max.   :6.276                Max.   :37.54              
 bio18_precip_warmest_quarter bio19_precip_coldest_quarter   elevation    
 Min.   :28.77                Min.   :17.39                Min.   :134.0  
 1st Qu.:29.72                1st Qu.:18.60                1st Qu.:189.0  
 Median :30.34                Median :18.90                Median :206.0  
 Mean   :30.47                Mean   :18.94                Mean   :203.8  
 3rd Qu.:30.87                3rd Qu.:19.17                3rd Qu.:216.0  
 Max.   :33.71                Max.   :20.18                Max.   :521.0  

=== VARIABLE GUIDE CREATED ===
A guide to all variable meanings saved to: /run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/climate_variable_guide.csv
                       Column_Name
1            bio1_annual_mean_temp
2          bio2_mean_diurnal_range
3               bio3_isothermality
4            bio4_temp_seasonality
5      bio5_max_temp_warmest_month
6      bio6_min_temp_coldest_month
7           bio7_temp_annual_range
8   bio8_mean_temp_wettest_quarter
9    bio9_mean_temp_driest_quarter
10 bio10_mean_temp_warmest_quarter
11 bio11_mean_temp_coldest_quarter
12             bio12_annual_precip
13      bio13_precip_wettest_month
14       bio14_precip_driest_month
15        bio15_precip_seasonality
16    bio16_precip_wettest_quarter
17     bio17_precip_driest_quarter
18    bio18_precip_warmest_quarter
19    bio19_precip_coldest_quarter
20                       elevation
                                                      Description
1                                    Annual Mean Temperature (°C)
2  Mean Diurnal Range: Mean of monthly (max temp - min temp) (°C)
3                                Isothermality: (BIO2/BIO7) × 100
4               Temperature Seasonality: standard deviation × 100
5                           Max Temperature of Warmest Month (°C)
6                           Min Temperature of Coldest Month (°C)
7                      Temperature Annual Range: BIO5 - BIO6 (°C)
8                        Mean Temperature of Wettest Quarter (°C)
9                         Mean Temperature of Driest Quarter (°C)
10                       Mean Temperature of Warmest Quarter (°C)
11                       Mean Temperature of Coldest Quarter (°C)
12                                      Annual Precipitation (mm)
13                            Precipitation of Wettest Month (mm)
14                             Precipitation of Driest Month (mm)
15            Precipitation Seasonality: Coefficient of Variation
16                          Precipitation of Wettest Quarter (mm)
17                           Precipitation of Driest Quarter (mm)
18                          Precipitation of Warmest Quarter (mm)
19                          Precipitation of Coldest Quarter (mm)
20                             Elevation (meters above sea level)

Quick Reference: What Each Bioclim Variable Means

Variable Name Description Units
bio1_annual_mean_temp Annual Mean Temperature °C
bio2_mean_diurnal_range Mean diurnal range (mean of monthly max-min) °C
bio3_isothermality (BIO2/BIO7) × 100 %
bio4_temp_seasonality Temperature seasonality (SD × 100) -
bio5_max_temp_warmest_month Max temperature of warmest month °C
bio6_min_temp_coldest_month Min temperature of coldest month °C
bio7_temp_annual_range Temperature annual range (BIO5-BIO6) °C
bio8_mean_temp_wettest_quarter Mean temperature of wettest quarter °C
bio9_mean_temp_driest_quarter Mean temperature of driest quarter °C
bio10_mean_temp_warmest_quarter Mean temperature of warmest quarter °C
bio11_mean_temp_coldest_quarter Mean temperature of coldest quarter °C
bio12_annual_precip Annual precipitation mm
bio13_precip_wettest_month Precipitation of wettest month mm
bio14_precip_driest_month Precipitation of driest month mm
bio15_precip_seasonality Precipitation seasonality (CV) -
bio16_precip_wettest_quarter Precipitation of wettest quarter mm
bio17_precip_driest_quarter Precipitation of driest quarter mm
bio18_precip_warmest_quarter Precipitation of warmest quarter mm
bio19_precip_coldest_quarter Precipitation of coldest quarter mm
elevation Elevation meters

Now when you load the CSV file in future sessions, you’ll see meaningful names like bio12_annual_precip instead of generic names like wc2.1_2.5m_bio_12!

0. Setup - Load Required Packages

Code
# Load packages - think of these as toolboxes that give us new functions
library(ggplot2)      # For creating beautiful plots
library(dplyr)        # For manipulating data (filtering, selecting, creating new columns)
library(tidyr)        # For reshaping data (pivot_longer/wider)
library(GGally)       # For creating pairs plots (multiple variables at once)
library(viridis)      # For color-blind friendly color schemes
library(MASS)         # For Box-Cox transformation
library(isotree)      # For Isolation Forest (outlier detection)
library(patchwork)    # For combining multiple plots into one
library(sf)           # For working with spatial data (points, boundaries)

1. Load Data

Once you’ve saved the CSV file, you can load it in future sessions with:

Code
# ==============================================================================
# PART 1: LOAD THE DATA (FROM THE CSV WE CREATED IN STEP 1)
# ==============================================================================

# Load the spider climate data we saved earlier
# This file contains spider coordinates + all 19 Bioclim variables
cat("\n=== STEP 1: LOADING DATA ===\n")

# Read the CSV file
spider_data <- read.csv("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/spider_climate_data.csv")

# Check what's in our data
cat("Data loaded successfully!\n")
cat("Number of rows (spider locations):", nrow(spider_data), "\n")
cat("Number of columns (variables):", ncol(spider_data), "\n\n")

# Look at the first few rows to understand the structure
cat("First 5 rows of data:\n")
print(head(spider_data, 5))

# Show all column names with their meanings
cat("\nColumn names in our dataset:\n")
print(names(spider_data))

=== STEP 1: LOADING DATA ===
Data loaded successfully!
Number of rows (spider locations): 93 
Number of columns (variables): 22 

First 5 rows of data:
         X        Y bio1_annual_mean_temp bio2_mean_diurnal_range
1 74.24942 31.44471              24.10500                32.08733
2 74.22886 31.46673              24.05683                31.98133
3 74.21580 31.41358              23.98133                32.06467
4 74.22285 31.40976              23.98133                32.06467
5 74.22158 31.40849              23.98133                32.06467
  bio3_isothermality bio4_temp_seasonality bio5_max_temp_warmest_month
1           14.00867                   467                         136
2           13.98867                   474                         139
3           13.80600                   456                         133
4           13.80600                   456                         133
5           13.80600                   456                         133
  bio6_min_temp_coldest_month bio7_temp_annual_range
1                           4               111.3598
2                           4               111.9311
3                           4               112.1244
4                           4               112.1244
5                           4               112.1244
  bio8_mean_temp_wettest_quarter bio9_mean_temp_driest_quarter
1                            308                            21
2                            313                            21
3                            303                            20
4                            303                            20
5                            303                            20
  bio10_mean_temp_warmest_quarter bio11_mean_temp_coldest_quarter
1                             192                              51
2                             195                              52
3                             187                              49
4                             187                              49
5                             187                              49
  bio12_annual_precip bio13_precip_wettest_month bio14_precip_driest_month
1            13.13000                   38.21749                  753.0319
2            12.98900                   38.09985                  749.9002
3            13.34933                   38.33812                  760.3315
4            13.34933                   38.33812                  760.3315
5            13.34933                   38.33812                  760.3315
  bio15_precip_seasonality bio16_precip_wettest_quarter
1                   40.088                        5.732
2                   39.932                        5.840
3                   40.200                        5.380
4                   40.200                        5.380
5                   40.200                        5.380
  bio17_precip_driest_quarter bio18_precip_warmest_quarter
1                      34.356                     30.57733
2                      34.092                     30.50533
3                      34.820                     30.51733
4                      34.820                     30.51733
5                      34.820                     30.51733
  bio19_precip_coldest_quarter elevation
1                     19.45600       207
2                     19.47667       206
3                     19.29333       206
4                     19.29333       206
5                     19.29333       206

Column names in our dataset:
 [1] "X"                               "Y"                              
 [3] "bio1_annual_mean_temp"           "bio2_mean_diurnal_range"        
 [5] "bio3_isothermality"              "bio4_temp_seasonality"          
 [7] "bio5_max_temp_warmest_month"     "bio6_min_temp_coldest_month"    
 [9] "bio7_temp_annual_range"          "bio8_mean_temp_wettest_quarter" 
[11] "bio9_mean_temp_driest_quarter"   "bio10_mean_temp_warmest_quarter"
[13] "bio11_mean_temp_coldest_quarter" "bio12_annual_precip"            
[15] "bio13_precip_wettest_month"      "bio14_precip_driest_month"      
[17] "bio15_precip_seasonality"        "bio16_precip_wettest_quarter"   
[19] "bio17_precip_driest_quarter"     "bio18_precip_warmest_quarter"   
[21] "bio19_precip_coldest_quarter"    "elevation"                      

2. Create Spatial Object For Mapping

Code
cat("\n=== STEP 2: CREATING SPATIAL OBJECTS ===\n")

# Load the sf package for spatial operations
library(sf)

# Convert our spider points to an sf (simple features) spatial object
# This tells R that X and Y are coordinates, not just numbers
# CRS = 4326 means WGS84 (standard latitude/longitude)
sp_points <- st_as_sf(spider_data, coords = c("X", "Y"), crs = 4326)
cat("✓ Created spatial points object\n")

# Transform to UTM (meters) for distance calculations
# UTM zone 43N is correct for Pakistan
sp_utm <- st_transform(sp_points, 32643)
cat("✓ Transformed to UTM coordinates (meters)\n")

# Extract UTM coordinates and add them to our data frame
coords <- st_coordinates(sp_utm)
spider_data$x <- coords[,1]  # Easting (x-coordinate in meters)
spider_data$y <- coords[,2]  # Northing (y-coordinate in meters)
cat("✓ Added UTM coordinates (x, y in meters) to data frame\n")

# ==============================================================================
# PART 3: LOAD PAKISTAN BOUNDARY FOR MAPPING CONTEXT
# ==============================================================================

cat("\n=== STEP 3: LOADING PAKISTAN BOUNDARY ===\n")

# Load the Pakistan administrative boundary shapefile
# This helps us see where our points are located geographically
pakistan_boundary <- st_read("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/boundaries/pakistan_admin.shp")

# Transform to same UTM projection as our points
pakistan_boundary <- st_transform(pakistan_boundary, 32643)
cat("✓ Pakistan boundary loaded and transformed\n")

# Quick plot to check
cat("Plotting spider locations on Pakistan map...\n")
plot(st_geometry(pakistan_boundary), main = "Spider Locations in Pakistan")
plot(st_geometry(sp_utm), add = TRUE, col = "red", pch = 16, cex = 0.5)


=== STEP 2: CREATING SPATIAL OBJECTS ===
✓ Created spatial points object
✓ Transformed to UTM coordinates (meters)
✓ Added UTM coordinates (x, y in meters) to data frame

=== STEP 3: LOADING PAKISTAN BOUNDARY ===
Reading layer `pakistan_admin' from data source 
  `/run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/boundaries/pakistan_admin.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 8 features and 11 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 60.89944 ymin: 23.70292 xmax: 77.84308 ymax: 37.09701
Geodetic CRS:  WGS 84
✓ Pakistan boundary loaded and transformed
Plotting spider locations on Pakistan map...

3. Select Variables For Analysis

Code
cat("\n=== SELECTING VARIABLES FOR ANALYSIS ===\n")

# For this workshop, we'll focus on key environmental variables
# Select spider ID, coordinates, and 7 key climate variables
# Using the meaningful names we assigned earlier

# First, let's see all the climate variable names
climate_vars <- names(spider_data)[3:22]  # Columns 3-22 are our climate variables
cat("All available climate variables:\n")
print(climate_vars)

# Select a subset for analysis (to keep things manageable)
# We'll use: Annual Mean Temp, Annual Precip, Elevation, and a few BioClim variables
data_subset <- spider_data %>%
  dplyr::select(
    # Identifying information
    X, Y, x, y,  # coordinates
    # Climate variables (using meaningful names)
    bio1_annual_mean_temp,      # Annual Mean Temperature
    bio12_annual_precip,        # Annual Precipitation
    elevation,                  # Elevation
    bio5_max_temp_warmest_month, # Max temp of warmest month
    bio6_min_temp_coldest_month,  # Min temp of coldest month
    bio4_temp_seasonality,       # Temperature seasonality
    bio15_precip_seasonality     # Precipitation seasonality
  )

# Rename to shorter names for easier coding
data_clean <- data_subset %>%
  rename(
    temp = bio1_annual_mean_temp,
    precip = bio12_annual_precip,
    elev = elevation,
    tmax = bio5_max_temp_warmest_month,
    tmin = bio6_min_temp_coldest_month,
    temp_seas = bio4_temp_seasonality,
    precip_seas = bio15_precip_seasonality
  )

cat("\nSelected variables for analysis:\n")
print(names(data_clean))

# ==============================================================================
# BASIC SUMMARY STATISTICS
# ==============================================================================

cat("\n=== BASIC SUMMARY STATISTICS ===\n")
cat("This helps us understand the range and scale of our variables\n\n")

# Calculate summary statistics for all climate variables
summary_stats <- summary(data_clean[, c("temp", "precip", "elev", "tmax", "tmin", "temp_seas", "precip_seas")])
print(summary_stats)

# Calculate skewness for each variable
# Skewness tells us if data is symmetric (0) or skewed to left/right
# Values > 1 indicate highly skewed distributions
skewness <- function(x) {
  n <- length(x)
  m3 <- sum((x - mean(x, na.rm = TRUE))^3, na.rm = TRUE) / n
  m2 <- sum((x - mean(x, na.rm = TRUE))^2, na.rm = TRUE) / n
  return(m3 / (m2^(3/2)))
}

cat("\n=== Skewness Values ===\n")
cat("Skewness > 1 = highly skewed,可能需要 transformation\n")
for(var in c("temp", "precip", "elev", "tmax", "tmin", "temp_seas", "precip_seas")) {
  skew_val <- skewness(data_clean[[var]])
  cat(var, ":", round(skew_val, 3))
  if(abs(skew_val) > 1) cat(" ⚠️ Highly skewed")
  cat("\n")
}

=== SELECTING VARIABLES FOR ANALYSIS ===
All available climate variables:
 [1] "bio1_annual_mean_temp"           "bio2_mean_diurnal_range"        
 [3] "bio3_isothermality"              "bio4_temp_seasonality"          
 [5] "bio5_max_temp_warmest_month"     "bio6_min_temp_coldest_month"    
 [7] "bio7_temp_annual_range"          "bio8_mean_temp_wettest_quarter" 
 [9] "bio9_mean_temp_driest_quarter"   "bio10_mean_temp_warmest_quarter"
[11] "bio11_mean_temp_coldest_quarter" "bio12_annual_precip"            
[13] "bio13_precip_wettest_month"      "bio14_precip_driest_month"      
[15] "bio15_precip_seasonality"        "bio16_precip_wettest_quarter"   
[17] "bio17_precip_driest_quarter"     "bio18_precip_warmest_quarter"   
[19] "bio19_precip_coldest_quarter"    "elevation"                      

Selected variables for analysis:
 [1] "X"           "Y"           "x"           "y"           "temp"       
 [6] "precip"      "elev"        "tmax"        "tmin"        "temp_seas"  
[11] "precip_seas"

=== BASIC SUMMARY STATISTICS ===
This helps us understand the range and scale of our variables

      temp           precip           elev            tmax      
 Min.   :21.94   Min.   :12.70   Min.   :134.0   Min.   : 46.0  
 1st Qu.:23.23   1st Qu.:13.32   1st Qu.:189.0   1st Qu.:107.0  
 Median :23.64   Median :13.41   Median :206.0   Median :136.0  
 Mean   :23.67   Mean   :13.56   Mean   :203.8   Mean   :132.8  
 3rd Qu.:23.98   3rd Qu.:13.77   3rd Qu.:216.0   3rd Qu.:159.0  
 Max.   :25.15   Max.   :14.85   Max.   :521.0   Max.   :218.0  
      tmin         temp_seas      precip_seas   
 Min.   :2.000   Min.   :166.0   Min.   :39.42  
 1st Qu.:3.000   1st Qu.:377.0   1st Qu.:39.87  
 Median :4.000   Median :471.0   Median :40.12  
 Mean   :4.495   Mean   :466.9   Mean   :40.29  
 3rd Qu.:6.000   3rd Qu.:566.0   3rd Qu.:40.46  
 Max.   :8.000   Max.   :764.0   Max.   :42.03  

=== Skewness Values ===
Skewness > 1 = highly skewed,可能需要 transformation
temp : 0.514
precip : 0.981
elev : 5.01 ⚠️ Highly skewed
tmax : -0.077
tmin : 0.146
temp_seas : -0.084
precip_seas : 1.171 ⚠️ Highly skewed

Selected Environmental Predictors

Variable Meaning
temp (BIO1) Annual mean temperature
precip (BIO12) Annual precipitation
elev Elevation above sea level
tmax (BIO5) Maximum temperature of warmest month
tmin (BIO6) Minimum temperature of coldest month
temp_seas (BIO4) Temperature seasonality (SD ×100)
precip_seas (BIO15) Precipitation seasonality (coefficient of variation)

Interpretation

These variables capture three key environmental gradients:

  • thermal environment

  • water availability

  • topographic variation

Together they represent major drivers of species distributions.


Summary Statistics

Key Observations

Variable Range Interpretation
temp ~22–25°C Moderate temperature variation
precip ~12.7–14.8 Narrow precipitation gradient
elev 134–521 m Broad elevation range
tmax 46–218 Likely stored as °C ×10 (WorldClim standard)
tmin 2–8°C Mild winter temperatures
temp_seas 166–764 Large seasonal temperature variability
precip_seas 39–42 Moderate precipitation variability

The tmax values are NOT errors.

WorldClim stores temperatures as: Temperature=°C×10Temperature = °C 10Temperature=°C×10

Example:

218 → 21.8°C

Therefore:

46 → 4.6°C

218 → 21.8°C

Interpretation

The study region appears to have:

  • moderate temperature variation

  • relatively dry conditions

  • moderate elevation differences

Skewness Analysis

Variable Skewness Interpretation
temp 0.51 Mild right skew
precip 0.98 Moderate skew
elev 5.01 Strong right skew
tmax -0.08 Symmetric
tmin 0.15 Symmetric
temp_seas -0.08 Symmetric
precip_seas 1.17 Right skew

Interpretation

  • Elevation is extremely right-skewed → most observations at lower elevations

  • Precipitation seasonality also skewed → few sites with very high variability

  • Temperature variables are roughly symmetric

Ecological Meaning

Elevation skew could reflect:

  • sampling bias (lowlands easier to access)

  • true landscape structure (few high mountains)

Next Step Implication:

  • Elevation and precip_seas need transformation (log or Box-Cox)

  • Precipitation may benefit from transformation despite being <1


4. Visualize Distributions (Histograms)

Code
cat("\n=== CREATING DISTRIBUTION PLOTS ===\n")
cat("Histograms show us the shape of each variable's distribution\n")

# Create histograms for all variables
# First, reshape data to long format for easy plotting
data_long <- data_clean %>%
  pivot_longer(cols = c(temp, precip, elev, tmax, tmin, temp_seas, precip_seas),
               names_to = "variable",
               values_to = "value")

# Create histogram plot
p_hist <- ggplot(data_long, aes(x = value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white", alpha = 0.7) +
  facet_wrap(~variable, scales = "free", ncol = 3) +
  labs(
    title = "Distribution of Environmental Variables",
    subtitle = "Histograms show the shape of each distribution",
    x = "Value", 
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5),
    strip.text = element_text(face = "bold", size = 10)
  )

# Display the plot
print(p_hist)

Code
# Save the plot
ggsave("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/output/Session_2/histograms.png", p_hist, width = 12, height = 8, dpi = 300)

cat("✓ Histograms saved\n")

=== CREATING DISTRIBUTION PLOTS ===
Histograms show us the shape of each variable's distribution
✓ Histograms saved

Histograms help visualize distribution shapes.

Variable Pattern Interpretation
elev strong right skew most sites low elevation
precip moderate skew majority relatively dry
precip_seas near symmetric moderate seasonal variation
temp near normal stable thermal conditions
temp_seas wide distribution different climatic regimes
tmax wide range due to scaling reflects warm-season extremes

Key Point

Histogram shapes indicate whether transformations may improve model performance.


5. Create QQ Plots (Check normality)

Code
cat("\n=== CREATING QQ PLOTS ===\n")
cat("QQ plots compare our data to a normal distribution\n")
cat("Points following the diagonal line = normally distributed\n")

# Function to create QQ plot for a variable
create_qq <- function(data, var_name, color) {
  p <- ggplot(data, aes(sample = .data[[var_name]])) +
    stat_qq(color = color, alpha = 0.6, size = 1.5) +
    stat_qq_line(color = "red", linewidth = 1) +
    labs(
      title = paste("QQ Plot -", var_name),
      x = "Theoretical Quantiles", 
      y = "Sample Quantiles"
    ) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, size = 10))
  return(p)
}

# Create QQ plots for each variable
p_qq_temp <- create_qq(data_clean, "temp", "darkblue")
p_qq_precip <- create_qq(data_clean, "precip", "darkgreen")
p_qq_elev <- create_qq(data_clean, "elev", "orange")
p_qq_tmax <- create_qq(data_clean, "tmax", "purple")
p_qq_tmin <- create_qq(data_clean, "tmin", "brown")
p_qq_tempseas <- create_qq(data_clean, "temp_seas", "darkred")
p_qq_precipseas <- create_qq(data_clean, "precip_seas", "cyan4")

# Combine QQ plots
library(patchwork)
p_qq_all <- (p_qq_temp | p_qq_precip | p_qq_elev) /
            (p_qq_tmax | p_qq_tmin | p_qq_tempseas) /
            (p_qq_precipseas | plot_spacer() | plot_spacer()) +
  plot_annotation(
    title = "QQ Plots: Checking Normality",
    subtitle = "Points following red line = normal distribution",
    theme = theme(plot.title = element_text(hjust = 0.5, face = "bold"))
  )

print(p_qq_all)

Code
ggsave("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/output/Session_2/qq_plots.png",
       p_qq_all, width = 14, height = 12, dpi = 300)

cat("✓ QQ plots saved\n")

=== CREATING QQ PLOTS ===
QQ plots compare our data to a normal distribution
Points following the diagonal line = normally distributed
✓ QQ plots saved

QQ plots compare observed values with a theoretical normal distribution.

Interpretation

Variable Pattern
temp close to normal
precip slight deviation
elev strong deviation (skew)
tmax acceptable
tmin near normal
temp_seas moderate deviation
precip_seas acceptable

Conclusion

  • Elevation clearly non-normal

  • precipitation variables show mild skew

However:

Strict normality is not required for most ecological models.


6. Boxplots (Identify outliers)

Code
cat("\n=== CREATING BOXPLOTS ===\n")
cat("Boxplots show median, quartiles, and potential outliers\n")
cat("Points beyond whiskers are potential outliers\n")

# Create boxplots
p_box <- ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_viridis_d() +
  labs(
    title = "Boxplots of Environmental Variables",
    subtitle = "Points beyond whiskers are potential outliers",
    x = "", 
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

print(p_box)

Code
ggsave("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/output/Session_2/boxplots.png", 
       p_box, width = 10, height = 6, dpi = 300)

cat("✓ Boxplots saved\n")

=== CREATING BOXPLOTS ===
Boxplots show median, quartiles, and potential outliers
Points beyond whiskers are potential outliers
✓ Boxplots saved

Boxplots highlight extreme values.

Observations

Variable Observation
elev several high-elevation points
precip few high values
temp symmetric
temp_seas wide range
tmax large range (due to scaling)
tmin narrow distribution

Interpretation

Elevation outliers likely represent true geographic variation, not errors.

Therefore:

Do not remove them unless confirmed erroneous.


7. Correlation Matrix (Check multicollinearity)

Code
cat("\n=== CREATING CORRELATION MATRIX ===\n")
cat("Correlation shows how variables are related\n")
cat("Values > 0.7 or < -0.7 indicate strong correlation (multicollinearity risk)\n")

# Calculate correlation matrix
cor_matrix <- cor(data_clean[, c("temp", "precip", "elev", "tmax", "tmin", "temp_seas", "precip_seas")], 
                  use = "complete.obs")

# Convert to data frame for plotting
cor_df <- as.data.frame(as.table(cor_matrix))
names(cor_df) <- c("Var1", "Var2", "Correlation")

# Create heatmap
p_cor <- ggplot(cor_df, aes(x = Var1, y = Var2, fill = Correlation)) +
  geom_tile() +
  scale_fill_gradient2(low = "darkred", mid = "white", high = "darkblue", 
                       midpoint = 0, limits = c(-1, 1)) +
  geom_text(aes(label = round(Correlation, 2)), size = 3, color = "black") +
  labs(
    title = "Correlation Matrix",
    subtitle = "Red = negative, Blue = positive | >0.7 = strong correlation",
    x = "", 
    y = ""
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid = element_blank()
  )

print(p_cor)

Code
ggsave("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/output/Session_2/correlation_matrix.png", 
       p_cor, width = 8, height = 7, dpi = 300)

cat("✓ Correlation matrix saved\n")

# Identify highly correlated pairs
cat("\n=== HIGHLY CORRELATED VARIABLES (|r| > 0.7) ===\n")
high_cor <- which(abs(cor_matrix) > 0.7 & abs(cor_matrix) < 1, arr.ind = TRUE)
for(i in 1:nrow(high_cor)) {
  if(high_cor[i, 1] < high_cor[i, 2]) {  # Print each pair only once
    var1 <- rownames(cor_matrix)[high_cor[i, 1]]
    var2 <- colnames(cor_matrix)[high_cor[i, 2]]
    cor_val <- cor_matrix[high_cor[i, 1], high_cor[i, 2]]
    cat(var1, "and", var2, ":", round(cor_val, 3), "\n")
  }
}

=== CREATING CORRELATION MATRIX ===
Correlation shows how variables are related
Values > 0.7 or < -0.7 indicate strong correlation (multicollinearity risk)
✓ Correlation matrix saved

=== HIGHLY CORRELATED VARIABLES (|r| > 0.7) ===
temp and elev : -0.74 
temp and tmax : -0.838 
temp and tmin : -0.859 
tmax and tmin : 0.928 
temp and temp_seas : -0.87 
tmax and temp_seas : 0.996 
tmin and temp_seas : 0.947 
temp and precip_seas : 0.814 
precip and precip_seas : 0.828 
tmax and precip_seas : -0.825 
tmin and precip_seas : -0.707 
temp_seas and precip_seas : -0.82 

Major Correlations

Relationship Interpretation
temp vs elev (-0.74) higher elevation = cooler temperatures
tmax vs tmin (0.93) warmest and coldest extremes linked
temp vs temp_seas (-0.87) seasonal climates differ along temperature gradient
precip vs precip_seas (0.83) wetter regions show higher variability

Important Concept

Correlations above 0.7 indicate multicollinearity risk.

This means:

  • predictors carry overlapping information

  • model coefficients become unstable

Statistical Interpretation:

  • Multicollinearity ALERT: Many correlations >0.7

  • Near-perfect correlation (0.996) between tmax and temp_seas suggests these are the same information

  • Temperature variables are all highly intercorrelated

Biological Interpretation:

  • Temperature-elevation relationship: Higher elevations = cooler temps (expected)

  • Temperature-seasonality link: Areas with higher max temps have more seasonal variation

  • Precipitation-seasonality link: Wetter areas have more seasonal variation

Next Step Implication:

  • Session 3 focus: Must address multicollinearity

  • Options: PCA, variable selection, or regularization (ridge/lasso)

  • Cannot use all variables in simple GLM - coefficients will be unstable


8. Pairs Plot (Comprehensive overview)

Code
cat("\n=== CREATING PAIRS PLOT ===\n")
cat("Pairs plot shows relationships between all variables\n")

# Create pairs plot
p_pairs <- ggpairs(
  data_clean[, c("temp", "precip", "elev", "tmax", "tmin")],
  upper = list(continuous = wrap("cor", size = 3, color = "darkblue")),
  lower = list(continuous = wrap("points", alpha = 0.3, size = 0.8)),
  diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
  title = "Pairwise Relationships Among Environmental Variables"
) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    strip.text = element_text(face = "bold", size = 10)
  )

print(p_pairs)

Code
ggsave("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/output/Session_2/pairs_plot.png", 
       p_pairs, width = 12, height = 10, dpi = 300)

cat("✓ Pairs plot saved\n")

=== CREATING PAIRS PLOT ===
Pairs plot shows relationships between all variables
✓ Pairs plot saved

Scatterplots confirm relationships visually.

Observed Patterns

  • temperature decreases with elevation

  • tmax and tmin strongly correlated

  • temp_seas strongly linked to temperature gradients

Statistical Interpretation:

  • Linear patterns suggest simple relationships between variables

  • Tight clustering indicates low variance within relationships

  • The tmax-temp_seas relationship is essentially 1:1 - they measure the same thing

Biological Interpretation:

  • Temperature variables are redundant - they describe the same thermal gradient

  • This redundancy means we can use fewer variables without losing information

Interpretation

  • Many temperature variables describe the same climatic gradient.

  • Therefore: Not all predictors are necessary.

Next Step Implication:

  • For Session 3, we should select representative variables:

    • Keep elev (unique information)

    • Keep precip (unique information)

    • Choose ONE temperature variable (temp, tmax, or tmin)

    • Seasonality may be redundant with temperature range


9. Transformation - Log Transformation

Code
cat("\n=== LOG TRANSFORMATION ===\n")
cat("Log transformation: converts multiplicative effects to additive\n")
cat("Useful for right-skewed data (like precipitation)\n")

# Apply log transformation to precipitation (often right-skewed)
# First, shift to positive values (log requires >0)
min_precip <- min(data_clean$precip, na.rm = TRUE)
data_clean$precip_shifted <- data_clean$precip - min_precip + 1
data_clean$log_precip <- log(data_clean$precip_shifted)

# Compare distributions
p_log_compare <- ggplot() +
  # Raw precipitation
  geom_density(data = data_clean, aes(x = precip, fill = "Raw"), 
               alpha = 0.4, linewidth = 0.8) +
  # Log-transformed precipitation
  geom_density(data = data_clean, aes(x = log_precip, color = "Log"), 
               linewidth = 1.2) +
  scale_fill_manual(values = c("Raw" = "grey70"), name = "") +
  scale_color_manual(values = c("Log" = "darkgreen"), name = "") +
  labs(
    title = "Log Transformation of Precipitation",
    subtitle = paste0("Skewness: ", round(skewness(data_clean$precip), 2), 
                     " → ", round(skewness(data_clean$log_precip), 2)),
    x = "Value", 
    y = "Density"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom"
  )

print(p_log_compare)

Code
ggsave("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/output/Session_2/log_transformation.png", 
       p_log_compare, width = 8, height = 5, dpi = 300)

cat("✓ Log transformation plot saved\n")

=== LOG TRANSFORMATION ===
Log transformation: converts multiplicative effects to additive
Useful for right-skewed data (like precipitation)
✓ Log transformation plot saved

Applied to precipitation.

Before vs After:

  • Skewness: 0.98 → 0.41 (improved)

  • Distribution: Right-skewed → more symmetric

Statistical Interpretation:

  • Log transformation successfully reduced skewness (improves symmetry)

  • Distribution now closer to normal (compresses extreme values )

Biological Interpretation:

  • Precipitation effects are now on multiplicative scale

  • Interpretation changes: “1% increase in precipitation” rather than “1mm increase”

  • This often makes more ecological sense (species respond proportionally)

Next Step Implication:

  • Use log_precip in models instead of raw precip

  • Remember interpretation will be percent-based

Ecologically this means species responses may scale proportionally to precipitation.


10. Transformation - Z-Score Standardization

Code
cat("\n=== Z-SCORE STANDARDIZATION ===\n")
cat("Z-score: subtract mean, divide by SD → mean = 0, SD = 1\n")
cat("Allows comparison of effect sizes across different scales\n")

# Apply z-score standardization to all variables
data_clean$temp_z <- scale(data_clean$temp)[,1]
data_clean$precip_z <- scale(data_clean$precip)[,1]
data_clean$elev_z <- scale(data_clean$elev)[,1]

# Compare distributions
data_z <- data_clean %>%
  dplyr::select(temp, temp_z, precip, precip_z, elev, elev_z) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value")

p_z <- ggplot(data_z, aes(x = value, fill = variable)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~variable, scales = "free", ncol = 3) +
  scale_fill_viridis_d() +
  labs(
    title = "Z-Score Standardization",
    subtitle = "Original (left) vs Standardized (right) - Note mean=0, SD=1",
    x = "Value", 
    y = "Density"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "none",
    strip.text = element_text(face = "bold")
  )

print(p_z)

Code
ggsave("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/output/Session_2/zscore_standardization.png", 
       p_z, width = 12, height = 8, dpi = 300)

cat("✓ Z-score standardization plot saved\n")

=== Z-SCORE STANDARDIZATION ===
Z-score: subtract mean, divide by SD → mean = 0, SD = 1
Allows comparison of effect sizes across different scales
✓ Z-score standardization plot saved

Transformation:

Result

  • mean = 0

  • SD = 1

Statistical Interpretation:

  • All variables now on same scale (standard deviations)

  • Original skewness preserved, only scale changed

  • Outliers remain outliers (just measured in SD units)

Biological Interpretation:

  • Coefficients will represent “change per 1 SD increase”

  • Allows direct comparison: “Is temperature or precipitation more important?”

  • Example: If temp coefficient = 0.5 and precip coefficient = 0.2, temperature has 2.5× stronger effect

Next Step Implication:

  • Use Z-scored variables for regularization (ridge/lasso)

  • Essential for comparing variable importance

Important:

Standardization does not remove skewness.


11. Outlier Detection with Isolation Forest

Code
cat("\n=== ISOLATION FOREST OUTLIER DETECTION ===\n")
cat("Isolation Forest finds multivariate outliers (unusual combinations)\n")

# Prepare data matrix
Xmat <- data_clean %>%
  dplyr::select(temp, precip, elev, tmax, tmin, temp_seas, precip_seas) %>%
  as.matrix()

# Run Isolation Forest
set.seed(123)  # For reproducibility
iso <- isolation.forest(
  Xmat, 
  ntrees = 100,  # Number of trees (more = more stable)
  sample_size = 256,  # Samples per tree
  seed = 123
)

# Get anomaly scores (higher = more unusual)
data_clean$anomaly_score <- predict(iso, Xmat, type = "score")

# Identify top 5% as outliers
threshold <- quantile(data_clean$anomaly_score, 0.95)
data_clean$is_outlier <- data_clean$anomaly_score > threshold

cat("Number of outliers detected (top 5%):", sum(data_clean$is_outlier), "\n")
cat("Outlier score threshold:", round(threshold, 4), "\n")

# Visualize outlier scores
p_outlier_hist <- ggplot(data_clean, aes(x = anomaly_score)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, 
                 fill = "steelblue", color = "white", alpha = 0.7) +
  geom_density(color = "darkred", linewidth = 1) +
  geom_vline(xintercept = threshold, color = "red", linewidth = 1, linetype = "dashed") +
  annotate("text", x = threshold * 1.1, y = 2,
           label = paste0("95th percentile\n", round(threshold, 3)),
           hjust = 0, size = 3.5, color = "red") +
  labs(
    title = "Isolation Forest Anomaly Scores",
    subtitle = "Higher scores = more unusual environmental combinations",
    x = "Anomaly Score", 
    y = "Density"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

print(p_outlier_hist)

Code
ggsave("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/output/Session_2/outlier_scores.png", p_outlier_hist, width = 8, height = 5, dpi = 300)

=== ISOLATION FOREST OUTLIER DETECTION ===
Isolation Forest finds multivariate outliers (unusual combinations)
Number of outliers detected (top 5%): 5 
Outlier score threshold: 0.6037 

Detected 5 multivariate outliers.

These are points with unusual combinations of environmental conditions, not necessarily extreme values individually.

Possible explanations

  • data errors

  • rare environmental conditions

  • sampling bias

  • True extremes

Next Step Implication:

  • DO NOT AUTO-DELETE - investigate each point

  • Check if outliers are near roads (sampling bias)

  • Check if they represent real habitats (keep if real)

  • Consider robust modeling methods


12. Map Outliers In Geographic Space

Code
cat("\n=== MAPPING OUTLIERS ===\n")
cat("Check if outliers correspond to roads, rivers, or sampling bias\n")

# Convert to sf for mapping
outlier_sf <- st_as_sf(data_clean, coords = c("x", "y"), crs = 32643)

# Create map
p_outlier_map <- ggplot() +
  # Pakistan boundary
  geom_sf(data = pakistan_boundary, fill = "lightgray", color = "darkgray") +
  # All points (gray)
  geom_sf(data = outlier_sf, aes(color = is_outlier), size = 1.5, alpha = 0.7) +
  scale_color_manual(values = c("FALSE" = "gray50", "TRUE" = "red")) +
  labs(
    title = "Spatial Distribution of Environmental Outliers",
    subtitle = "Red points = unusual environmental combinations",
    color = "Outlier"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom"
  )

print(p_outlier_map)

Code
ggsave("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/output/Session_2/outlier_map.png", p_outlier_map, width = 8, height = 8, dpi = 300)

cat("✓ Outlier map saved\n")

=== MAPPING OUTLIERS ===
Check if outliers correspond to roads, rivers, or sampling bias
✓ Outlier map saved

Mapping outliers helps distinguish:

Spatial Pattern:

  • Outliers appear at specific geographic locations

  • Need to overlay with roads/rivers to check sampling bias

Statistical Interpretation:

Pattern Meaning
clustered near roads sampling bias
isolated remote points rare habitats
random distribution natural variability

Biological Interpretation:

  • Geographic pattern helps distinguish:

    • Bias: Outliers along roads = human access bias

    • Real: Outliers in remote areas = true habitat extremes

Next Step Implication:

  • Create map with roads overlay

  • Document decisions about each outlier

  • Consider spatial cross-validation in Session 5

Spatial context is essential before removing observations.


14. Ecological Meaning Of Transformation

Code
cat("\n=== ECOLOGICAL MEANING OF TRANSFORMATIONS ===\n")
cat("Understanding what different transformations mean ecologically\n")

# Generate example data to illustrate
set.seed(123)
x_example <- runif(200, 1, 100)  # Environmental variable
y_example <- 2 + 1.5*log(x_example) + rnorm(200, 0, 0.3)  # True relationship

df_example <- data.frame(x = x_example, y = y_example)

# Fit models
lm_raw <- lm(y ~ x, data = df_example)
lm_log <- lm(y ~ log(x), data = df_example)

# Create plot
p_eco <- ggplot(df_example, aes(x = x, y = y)) +
  geom_point(alpha = 0.4, color = "gray50") +
  # Raw linear fit (wrong assumption)
  geom_smooth(method = "lm", aes(color = "Raw (additive)"), 
              se = FALSE, linewidth = 1) +
  # Log-linear fit (correct)
  geom_line(data = data.frame(x = seq(1, 100, length.out = 100)),
            aes(x = x, y = coef(lm_log)[1] + coef(lm_log)[2] * log(x),
                color = "Log (multiplicative)"),
            linewidth = 1.2) +
  scale_color_manual(values = c("Raw (additive)" = "red", 
                                "Log (multiplicative)" = "darkgreen")) +
  labs(
    title = "Ecological Meaning of Transformations",
    subtitle = "Red = assumes constant absolute change | Green = constant proportional change",
    x = "Environmental Variable", 
    y = "Species Response",
    color = "Model Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, color = "darkred"),
    legend.position = "bottom"
  )

print(p_eco)

Code
ggsave("/run/media/tahirali/Expansion/github/spatial-ecology-workshop/output/Session_2/ecological_meaning.png", 
       p_eco, width = 8, height = 6, dpi = 300)

=== ECOLOGICAL MEANING OF TRANSFORMATIONS ===
Understanding what different transformations mean ecologically

Key Insight from Plot:

  • Red line (raw units): Assumes same absolute change everywhere (Δy per 1 unit x)

  • Green line (log units): Assumes same proportional change (Δy per 1% change in x)

Example Interpretation:

Raw model: "Each 1mm increase in precipitation increases species by 0.02"
Log model: "Each 1% increase in precipitation increases species by 0.015"

Which to Choose?

Scenario Use Raw Use Log
Small environmental range
Large environmental range
Effect constant across gradient
Effect weakens at extremes
Species responds to absolute resources
Species responds to relative change

15. Create Final Clean Dataset For Modeling

Code
cat("\n=== CREATING FINAL CLEAN DATASET ===\n")
cat("Save a clean version with transformed variables for Session 3\n")

# Create final dataset with original and transformed variables
final_data <- data_clean %>%
  dplyr::select(
    # Coordinates
    X, Y, x, y,
    # Original climate variables
    temp, precip, elev, tmax, tmin, temp_seas, precip_seas,
    # Transformed versions
    log_precip, temp_z, precip_z, elev_z,
    # Outlier information
    anomaly_score, is_outlier
  )

# Save as CSV
write.csv(final_data, 
          "/run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/spider_data_clean.csv",
          row.names = FALSE)

cat("✓ Clean dataset saved to: /run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/spider_data_clean.csv\n")
cat("  This file contains original variables + transformed versions + outlier flags\n")

=== CREATING FINAL CLEAN DATASET ===
Save a clean version with transformed variables for Session 3
✓ Clean dataset saved to: /run/media/tahirali/Expansion/github/spatial-ecology-workshop/data/spider_data_clean.csv
  This file contains original variables + transformed versions + outlier flags

16. Key Interpretations

Code
# ==============================================================================
# SUMMARY: KEY INTERPRETATIONS
# ==============================================================================

cat("\n", paste(rep("=", 70), collapse = ""), "\n")
cat("SESSION 2 SUMMARY: KEY INTERPRETATIONS\n")
cat(paste(rep("=", 70), collapse = ""), "\n\n")

cat("1. DISTRIBUTION DIAGNOSTICS:\n")
cat("   • Look at histograms: Are they symmetric? Skewed? Multi-modal?\n")
cat("   • QQ plots: Do points follow the line? Deviations at ends = heavy tails\n")
cat("   • Boxplots: Points beyond whiskers are potential outliers\n\n")

cat("2. TRANSFORMATIONS:\n")
cat("   • Log transformation: Use when relationship is multiplicative\n")
cat("     (e.g., species richness increases by % with productivity)\n")
cat("   • Z-score: Use to compare effect sizes across different scales\n")
cat("     (e.g., is temperature or precipitation more important?)\n\n")

cat("3. OUTLIERS:\n")
cat("   • Statistical outliers: Points with high anomaly scores\n")
cat("   • But ask: Are these sensor errors? Rare habitats? True extremes?\n")
cat("   • Don't auto-delete! Check maps and metadata first\n\n")

cat("4. MULTICOLLINEARITY:\n")
cat("   • High correlations (>0.7) between predictors cause problems\n")
cat("   • Need to address in Session 3 with regularization or variable selection\n\n")

cat("5. NEXT STEPS:\n")
cat("   • Session 3 will use this clean data for collinearity & regularization\n")
cat("   • File saved: spider_data_clean.csv has all transformed variables\n")

cat(paste(rep("=", 70), collapse = ""), "\n")

 ====================================================================== 
SESSION 2 SUMMARY: KEY INTERPRETATIONS
====================================================================== 

1. DISTRIBUTION DIAGNOSTICS:
   • Look at histograms: Are they symmetric? Skewed? Multi-modal?
   • QQ plots: Do points follow the line? Deviations at ends = heavy tails
   • Boxplots: Points beyond whiskers are potential outliers

2. TRANSFORMATIONS:
   • Log transformation: Use when relationship is multiplicative
     (e.g., species richness increases by % with productivity)
   • Z-score: Use to compare effect sizes across different scales
     (e.g., is temperature or precipitation more important?)

3. OUTLIERS:
   • Statistical outliers: Points with high anomaly scores
   • But ask: Are these sensor errors? Rare habitats? True extremes?
   • Don't auto-delete! Check maps and metadata first

4. MULTICOLLINEARITY:
   • High correlations (>0.7) between predictors cause problems
   • Need to address in Session 3 with regularization or variable selection

5. NEXT STEPS:
   • Session 3 will use this clean data for collinearity & regularization
   • File saved: spider_data_clean.csv has all transformed variables
====================================================================== 

Summary Key Interpretations (Spider Data set)

1. DISTRIBUTION DIAGNOSTICS:
   • Elevation is HIGHLY SKEWED (skew=5.01) - most points at low elevations
   • Precipitation is moderately skewed (0.98) - needs transformation
   • Precipitation seasonality is skewed (1.17) - needs transformation
   • Temperature variables are relatively symmetric - good for modeling
   • CRITICAL: tmax values (46-218°C) are IMPOSSIBLE - check units! Likely °C×10

2. MULTICOLLINEARITY (MAJOR ISSUE):
   • tmax and temp_seas correlation = 0.996 (almost identical variables)
   • Multiple correlations >0.8 between temperature variables
   • Cannot use all variables in simple models - coefficients unstable
   • Session 3 MUST address this (regularization or variable selection)

3. TRANSFORMATIONS APPLIED:
   • Log precipitation: Skewness 0.98 → 0.41 ✓ (improved)
   • Z-score standardization: All variables now on same scale (mean=0, SD=1)
   • Elevation still skewed after Z-score (shape preserved, scale changed)

4. OUTLIERS:
   • 5 points detected as multivariate outliers (top 5%)
   • These have unusual COMBINATIONS of variables, not just extreme values
   • DO NOT auto-delete - investigate if they are:
     - Near roads/rivers (sampling bias)?
     - Rare but real habitats?
     - Data entry errors?

5. CRITICAL DATA ISSUES FOUND:
   • tmax values >100°C are biologically impossible - MUST FIX
   • Elevation sampling biased toward lowlands
   • Temperature variables are largely redundant

6. PREPARATION FOR SESSION 3:
   • Clean dataset saved with transformed variables
   • Will need to handle multicollinearity
   • Recommend selecting representative variables:
     - Keep elev (unique information)
     - Keep precip (unique information)
     - Choose ONE temperature variable
     - Seasonality may be redundant with tmax/temp_seas

7. NEXT STEPS - MUST DO BEFORE SESSION 3:
   [ ] Verify tmax units and correct data
   [ ] Document outlier investigation results
   [ ] Decide on final variable set for modeling
   [ ] Bring clean data to Session 3

17. Take-Home Message

Three key lessons from Session 2:

  1. Always visualize environmental predictors before modeling.

  2. Skewness, outliers, and correlations reveal important ecological patterns.

  3. Data preparation improves model reliability and interpretation.