AWK & SED Commands in the Arabidopsis lyrata Temporal Genomics Pipeline

Annotated reference for every text-processing command used from raw data to final analysis

Author

Dr Tahir Ali | AG de Meaux, University of Cologne

Published

April 16, 2026

1 Introduction

awk and sed are UNIX stream editors that process text files line-by-line without loading them into memory. In population genomics pipelines they are indispensable for manipulating VCF, BED, GTF, TSV, and FASTA files that are too large for spreadsheet software and too simple to justify writing a full Python script.

This document catalogises every awk and sed command used in this pipeline, grouped by biological purpose, with explanation of the syntax, why that approach was chosen over alternatives (Python, R, bcftools), and common pitfalls.

Quick reference to AWK structure:

awk 'BEGIN{...}  pattern{action}  END{...}' input_file
         ↑               ↑                ↑
    runs once       runs per          runs once
    at start      matching line       at end

Key built-in variables:

Variable	Meaning
`NR`	total record (line) number
`NF`	number of fields on current line
`$0`	entire current line
`$1, $2, ...`	first, second, … field
`FS`	input field separator (default: whitespace)
`OFS`	output field separator (default: space)
`-F"x"`	set input field separator to `x`

2 Category 1: Reference Genome Preparation

2.1 1.1 GTF scaffold → chromosome renaming

Context: The A. lyrata ssp. petraea GTF annotation used scaffold names (scaffold_1 … scaffold_8) while the genome FASTA used INSDC chromosome accessions (CM047395.1 … CM047402.1). All downstream tools (BCFtools, snpEff, BEDTools) require matching names between reference and annotation.

awk 'BEGIN{
  map["scaffold_1"] = "CM047395.1";
  map["scaffold_2"] = "CM047396.1";
  map["scaffold_3"] = "CM047397.1";
  map["scaffold_4"] = "CM047398.1";
  map["scaffold_5"] = "CM047399.1";
  map["scaffold_6"] = "CM047400.1";
  map["scaffold_7"] = "CM047401.1";
  map["scaffold_8"] = "CM047402.1"
} $1 in map { $1 = map[$1] } 1' original.gtf \
  | grep "^CM047" \
  > Arabidopsis_lyrata_petraea_genome.annotation_Scaffold1-8Mod_CM047.gtf

Explanation:

BEGIN{...} — initialises an associative array map before any line is processed. Array keys are old names; values are new INSDC accessions.
$1 in map — tests whether field 1 (chromosome column in GTF) exists as a key in map.
$1 = map[$1] — replaces the old scaffold name with the new accession in-place.
The bare 1 at the end is a true pattern with no action, which causes awk to print the (possibly modified) line (print $0 is the default action).
| grep "^CM047" — discards all unassembled scaffold lines that were not remapped, keeping only the 8 chromosomes.

Why awk rather than sed? sed could do single replacements (sed 's/scaffold_1/CM047395.1/') but would require 8 separate sed calls or a complex multi-expression command (sed -e 's/.../.../' -e ...). The awk associative array handles all 8 in one pass, which is both faster and more readable. It also correctly targets only field 1 ($1), preventing accidental replacement of “scaffold_1” appearing in the gene_id or attribute columns.

3 Category 2: BED File Generation

BED format uses 0-based start, 1-based end coordinates (half-open intervals). VCF and GTF use 1-based coordinates. This mismatch is a frequent source of bugs when filtering VCFs with bcftools view -R.

3.1 2.1 Creating a BED file from a TSV position table — WRONG version (documented bug)

This command was initially used and produced an empty VCF output:

# ❌ WRONG — produces empty intervals (start == end in BED = no sequence)
awk 'NR>1 {print $1"\t"$2"\t"$2}' \
  conservation_scores_matching_vcf_sites.tsv \
  > conservation_scores_matching_vcf_sites.bed

# First lines of the bad BED:
# CM047395.1    78    78     ← empty interval: [78,78) contains nothing
# CM047395.1    85    85

Why it failed: In BED format, [start, end) means the interval includes positions from start up to but not including end. When start == end, the interval is empty. bcftools view -R found zero variants because no VCF positions fell inside an empty interval.

3.2 2.2 Corrected BED file — 0-based start

# ✓ CORRECT — 0-based start = VCF position − 1
awk 'NR>1 {print $1"\t"$2-1"\t"$2}' \
  conservation_scores_matching_vcf_sites.tsv \
  > conservation_scores_matching_vcf_sites.bed

# First lines of the correct BED:
# CM047395.1    77    78     ← interval [77,78) contains VCF position 78 ✓
# CM047395.1    84    85
# CM047395.1    91    92

Explanation:

NR>1 — skips the header line (record number > 1).
$1 — chromosome (column 1 of TSV).
$2-1 — BED start = VCF position − 1 (converts to 0-based).
$2 — BED end = VCF position (the 1-based position is the exclusive end).
"\t" — explicit tab separator (critical: BED must be tab-delimited).

Verification after fix:

bcftools view -H \
  insamples_polarized.vcf.recode.vcf.filtered_conservation_sites.vcf.gz \
  | wc -l
# Expected and obtained: 4,415,651

3.3 2.3 BED directly from TSV (alternative one-liner)

# Create BED from conservation TSV in one step (used in snpEff filtering)
awk 'NR>1 {print $1"\t"$2-1"\t"$2}' \
  /projects/ag-demeaux/tali/Jupyter_labs/vcf_filtered_to_conservation_positions/\
conservation_scores_matching_vcf_sites.tsv \
  > conservation_scores_matching_vcf_sites.bed

3.4 2.4 BED from colon-separated position keys

Context: When position keys are stored as chr:position strings, convert back to BED for bcftools:

# Input format: CM047395.1:78
awk -F':' '{print $1"\t"$2-1"\t"$2}' \
  matched_position_keys.txt \
  > positions_from_keys.bed

Explanation:

-F':' — sets the field separator to colon.
$1 — chromosome part before the colon.
$2-1 — position part after the colon, minus 1 for 0-based BED start.
$2 — 1-based end position.

4 Category 3: Position Key Creation and Matching

Position keys (chr:position) are used to efficiently match sites between different file formats without requiring a database join.

4.1 3.1 Extract position keys from conservation TSV

# Create chr:pos keys (skip header with NR>1)
awk 'NR>1 {print $1":"$2}' \
  /projects/ag-demeaux/tali/Jupyter_labs/vcf_filtered_to_conservation_positions/\
conservation_scores_matching_vcf_sites.tsv \
  > /projects/ag-demeaux/tali/Jupyter_labs/vcf_filtered_to_conservation_positions/\
matched_position_keys.txt

Explanation:

$1 — chromosome column.
":" — literal colon separator (no spaces).
$2 — position column.
Output: one key per line, e.g. CM047395.1:78.

Why this format? Colon-separated keys are a standard convention that avoids ambiguity (chromosomes never contain colons; positions are integers). This allows set operations in Python (common = set(df1_keys).intersection(df2_keys)) without needing to join on multiple columns.

5 Category 4: Outgroup Genotype Encoding

5.1 4.1 Create dummy outgroup genotype from allele frequencies

Context: polarizeVCFbyOutgroup.py requires a VCF individual column to determine ancestral allele. Rather than using a single outgroup individual (which may have missing data), a consensus majority-vote genotype is constructed from the 7-outgroup allele frequency table.

# vcftools --freq2 output format:
# CHROM  POS  N_ALLELES  N_CHR  {ALLELE:FREQ}  {ALLELE:FREQ}
# Column 5 = REF allele frequency; column 6 = ALT allele frequency
# Columns are space-separated but freq uses : internally

awk 'NR>1 { print ($5 > $6) ? "0/0" : "1/1" }' \
  outgroup/outgroup_freq.frq \
  > outgroup/dummy_genotype.txt

Explanation:

NR>1 — skips the header.
($5 > $6) ? "0/0" : "1/1" — ternary conditional:
- If REF frequency > ALT frequency → outgroup is homozygous REF (0/0) → keep current polarisation.
- If ALT frequency ≥ REF frequency → outgroup is homozygous ALT (1/1) → switch REF/ALT.
This encodes the majority-allele consensus across all 7 outgroup individuals.

Limitation: Sites where the outgroup frequencies are exactly equal (0.5) are assigned 1/1 (switches). An alternative is to remove such sites with | awk '$5 != $6'. Sites where the outgroup is heterozygous or ambiguous are later removed by polarizeVCFbyOutgroup.py unless --keep is specified.

6 Category 5: Conservation Score Filtering

6.1 5.1 Filter GERP BED to evolutionarily constrained sites

Context: The raw GERP output covers all sites in the genome. For the selection analyses, only sites with evidence of evolutionary constraint (RS ≥ 4) are retained.

# concatenated_gerp.bed columns:
# $1=chr  $2=start  $3=end  $4=?  $5=?  $6=GERP_RS_score

awk '$6 >= 4' \
  results_18_genomes_Lastal/gerp/concatenated_gerp.bed \
  > results_18_genomes_Lastal/gerp/concatenated_filt_gerp.bed
# Result: 67,974,179 constrained sites

Explanation:

$6 >= 4 — pattern only (no action block): print lines where GERP RS score ≥ 4. When no action is specified, awk’s default action is to print the whole line.
GERP RS ≥ 4 is a standard threshold for strong evolutionary constraint (corresponds to approximately 4 substitutions rejected per site across the phylogeny).

Why awk rather than grep or Python? grep cannot filter on numeric column values. Python pandas would work but adds overhead for loading a file with ~67 million lines. Awk processes the BED at disk read speed (~2 GB/s) without RAM constraints.

6.2 5.2 Matching phyloP BED to filtered GERP positions

# Extract phyloP scores for the same 67M positions as filtered GERP
# (uses bedtools intersect in practice, but position matching via awk works for TSVs)
awk 'NR==FNR{pos[$1"\t"$2]=1; next} ($1"\t"$2) in pos' \
  concatenated_filt_gerp.bed \
  concatenated_phyloP.bed \
  > concatenated_filt_phyloP.bed

Explanation:

NR==FNR — NR is the global record counter; FNR is the per-file record counter. When they are equal, we are reading the FIRST file (GERP).
pos[$1"\t"$2]=1 — stores chromosome+position as a key in the pos array.
next — skip to the next line (don’t execute the second block for file 1).
($1"\t"$2) in pos — for lines in the SECOND file (phyloP), print only if the chromosome+position key exists in pos.

This is the standard two-file lookup pattern in awk, avoiding the need for sort | join or a Python merge.

7 Category 6: Sample Name Processing

7.1 6.1 Shorten sample names in VCF

Context: BCFtools generates sample names from BAM file paths or read group tags, producing long strings like A_Al21_Plech23_1-8ChrOnly_BamQualFilt-f3Q30F264_nodup_realigned_sorted. Short names (Plech23_A) are required for downstream R/Python code.

# Create a mapping file: old_name → new_name
# Then use bcftools reheader (preferred) or awk on the VCF header

# Using awk directly on VCF (for inspection/debugging)
awk '
/^#CHROM/ {
  for (i=10; i<=NF; i++) {
    # Remove long suffix from sample names
    gsub("_1-8ChrOnly_BamQualFilt-f3Q30F264_nodup_realigned_sorted", "", $i);
    gsub("A_Al[0-9]+_", "", $i)
  }
  print $0; next
}
{ print $0 }
' raw_vcf.vcf > renamed_vcf.vcf

Explanation:

/^#CHROM/ — matches the VCF header line that contains sample names (starts with #CHROM).
for (i=10; i<=NF; i++) — loops over fields 10 onwards (VCF sample columns start at column 10).
gsub("pattern", "replacement", $i) — global substitution on field $i (replaces all occurrences).
next — after printing the modified header, skip to next line.
{ print $0 } — prints all non-header lines unchanged.

Better practice: Use bcftools reheader -s sample_names.txt for safety, as it validates the number of samples matches. The awk approach is shown here for transparency about what the renaming does.

7.2 6.2 Extract sample names list from VCF

# Print sample names one per line (columns 10+)
awk '/^#CHROM/ {
  for (i=10; i<=NF; i++) print $i
}' all_samples.vcf > sample_names.txt

7.3 6.3 Count samples in VCF header

awk '/^#CHROM/ {print NF-9; exit}' all_samples.vcf
# Output: 64 (total in-group samples)

Explanation:

NF - 9 — total fields minus the 9 fixed VCF columns (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT).
exit — stop processing after the header line is found.

8 Category 7: VCF File Manipulation

8.1 7.1 Count SNPs (non-header lines)

# Count variants in a VCF (excluding header lines starting with #)
awk '!/^#/' all_samples_filtered.vcf | wc -l
# Or equivalently:
grep -c "^[^#]" all_samples_filtered.vcf

Explanation:

!/^#/ — the ! negates the match; prints lines NOT starting with #.

8.2 7.2 Extract chromosome names

# Get unique chromosome names from VCF
awk '!/^#/ {print $1}' all_samples.vcf | sort -u

8.3 7.3 Filter VCF to specific chromosomes (awk alternative to bcftools)

# Keep only chromosomes 1-8 (useful when bcftools not available)
awk '
/^#/ {print; next}                    # print header lines
$1 ~ /^CM0473(9[5-9]|[0-9]{2})\.1$/ {print}  # print chr lines
' all_samples_raw.vcf > all_samples_chrs1-8.vcf

8.4 7.4 Count allele counts to verify polarisation

# Check that AC=0 sites were removed (should find none)
awk '!/^#/ && $8 ~ /AC=0/' filtered_vcf.vcf | wc -l

# Count sites by AC value
awk '!/^#/ {
  match($8, /AC=([0-9]+)/, arr)
  ac[arr[1]]++
}
END {for (v in ac) print v, ac[v]}' filtered_vcf.vcf | sort -n

8.5 7.5 Split VCF by chromosome (for parallel processing)

Context: The GPN probability computation runs as a SLURM array with one job per chromosome. The polarised VCF is split first.

# Split by chromosome using awk
for CHR in CM047395.1 CM047396.1 CM047397.1 CM047398.1 \
           CM047399.1 CM047400.1 CM047401.1 CM047402.1; do
  awk -v chr="$CHR" \
    '/^#/ {print; next} $1==chr {print}' \
    insamples_polarized.vcf.recode.vcf \
    > split.${CHR}_insamples_polarized.vcf.recode.vcf
done

Explanation:

-v chr="$CHR" — passes the shell variable $CHR into awk as chr.
/^#/ {print; next} — always prints header lines.
$1==chr {print} — prints only lines where chromosome column equals chr.

9 Category 8: Allele Frequency Table Operations

9.1 8.1 Compute derived allele frequency from allele counts

Context: The dadi .data file stores allele counts (e.g., 12 REF + 2 ALT reads). Derived allele frequency = ALT_count / (REF_count + ALT_count).

# Columns in vcf_genotypes_matching_conservation_sites.tsv:
# Allele1=REF count for Has02, Has23, Plech02, Plech23
# Allele2=ALT count for Has02.1, Has23.1, Plech02.1, Plech23.1

awk 'NR>1 {
  freq_has02 = $4 / ($4 + $9)    # ALT_Has02 / total_Has02
  freq_has23 = $5 / ($5 + $10)
  freq_plech02 = $6 / ($6 + $11)
  freq_plech23 = $7 / ($7 + $12)
  print $13, $14, freq_has02, freq_has23, freq_plech02, freq_plech23
}' vcf_genotypes_matching_conservation_sites.tsv

9.2 8.2 Compute Δp (derived allele frequency change) per site

# Delta-p = freq_t1 - freq_t0
awk 'NR>1 {
  if ($4+$5 > 0 && $6+$7 > 0)
    dp = ($6/($6+$7)) - ($4/($4+$5))
  else
    dp = "NA"
  print $1, $2, dp
}' OFS="\t" freq_table.tsv > delta_p.tsv

9.3 8.3 Filter allele frequency table by MAF

# Retain sites with MAF >= 0.05 in all four populations
awk 'NR>1 {
  f1=$4; f2=$5; f3=$6; f4=$7
  maf1 = (f1<0.5) ? f1 : 1-f1
  maf2 = (f2<0.5) ? f2 : 1-f2
  maf3 = (f3<0.5) ? f3 : 1-f3
  maf4 = (f4<0.5) ? f4 : 1-f4
  if (maf1>=0.05 && maf2>=0.05 && maf3>=0.05 && maf4>=0.05)
    print $0
}' allelefreq_phyloP_gerp_gpn_Alyr.tsv > allelefreq_MAF05.tsv

10 Category 9: GTF / Annotation File Processing

10.1 9.1 Extract gene coordinates from GTF

# Extract gene start/end/ID from GTF (feature column == "gene")
awk '$3=="gene" {
  match($9, /gene_id "([^"]+)"/, arr)
  print $1, $4, $5, arr[1]
}' OFS="\t" genes.gtf > gene_coordinates.bed

Explanation:

$3=="gene" — selects only lines where the feature column is “gene”.
match($9, /regex/, arr) — POSIX extended regex match on the attribute column. The capture group ([^"]+) extracts the gene ID between quotes into arr[1].
OFS="\t" — sets output field separator to tab (important for BED format).

10.2 9.2 Count genes per chromosome

awk '$3=="gene" {count[$1]++}
END {for (c in count) print c, count[c]}' \
  genes.gtf | sort -k1,1

10.3 9.3 Extract CDS regions for conservation analysis

# Extract CDS coordinates (used for snpEff validation)
awk '$3=="CDS" {print $1"\t"$4-1"\t"$5"\t"$7}' OFS="\t" \
  genes.gtf > cds_regions.bed

Note: CDS in GTF is 1-based inclusive; BED requires 0-based start ($4-1).

11 Category 10: FASTA File Operations

11.1 10.1 Count sequences in FASTA

# Count number of sequences
awk '/^>/ {count++} END {print count}' genome.fna
# Output: 3,209 (8 chromosomes + unassembled scaffolds)

# Count only chromosome-length sequences (> 10 Mb)
# (uses awk multi-line approach)
awk '/^>/ {if (seq) {if (len>10000000) count++}; seq=$0; len=0; next}
     {len+=length($0)}
     END {if (seq && len>10000000) count++; print count}' genome.fna

11.2 10.2 Extract specific chromosome FASTA

# Extract chromosome CM047395.1 from multi-FASTA
awk '/^>CM047395\.1$/{found=1} found{print} /^>CM0473/ && !/^>CM047395\.1$/{found=0}' \
  genome.fna > chr1.fna

11.3 10.3 Identify soft-masked (lowercase = repeat) regions

Context: The generate_softmasked_ranges.py Python script does this, but an awk equivalent for documentation:

# Identify runs of lowercase (repeat-masked) nucleotides
awk '
/^>/ {chr=$0; sub(">","",chr); pos=0; in_mask=0; mask_start=0; next}
{
  for (i=1; i<=length($0); i++) {
    pos++
    c = substr($0, i, 1)
    if (c ~ /[acgt]/) {          # lowercase = masked
      if (!in_mask) {mask_start=pos; in_mask=1}
    } else {
      if (in_mask) {print chr"\t"mask_start-1"\t"pos-1; in_mask=0}
    }
  }
}
END {if (in_mask) print chr"\t"mask_start-1"\t"pos}
' genome.fna > softmasked_regions.bed

12 Category 11: SLURM / HPC Job Scripting

12.1 11.1 Array job chromosome indexing

Context: SLURM array jobs (#SBATCH --array=1-8) provide each job with $SLURM_ARRAY_TASK_ID (1–8). An awk-style bash array indexes into chromosomes.

#!/bin/bash
#SBATCH --array=1-8

# The chromosome list is a bash array (0-indexed)
CHRS=(CM047395.1 CM047396.1 CM047397.1 CM047398.1 \
      CM047399.1 CM047400.1 CM047401.1 CM047402.1)

# SLURM_ARRAY_TASK_ID is 1-based; bash arrays are 0-based
CHR=${CHRS[$SLURM_ARRAY_TASK_ID-1]}
# Task 1 → index 0 → CM047395.1
# Task 8 → index 7 → CM047402.1

Note: This is bash array indexing rather than awk, but the pattern ($SLURM_ARRAY_TASK_ID-1) looks identical to awk arithmetic and is documented here for clarity.

12.2 11.2 Log parsing: extract runtime per chromosome

# Extract wall time per chromosome from pipeline log
awk '/✓ CM/ {
  match($0, /CM[0-9]+\.[0-9]+ in ([0-9]+\.[0-9]+)s/, arr)
  print $2, arr[1]"s"
}' pipeline.log

12.3 11.3 Check which chromosomes completed

awk '/✓ CM/ {print $2, "DONE"}
     /ERROR/ {print $0}' \
  pipeline.log | column -t

13 Category 12: SED Commands

sed (stream editor) is primarily used for find-and-replace operations on text. It processes files line by line, applying specified substitution or deletion commands.

SED syntax:

sed 's/pattern/replacement/flags' input_file
        ↑           ↑         ↑
   regex to      replacement  g=global (all on line)
    match         string      i=case-insensitive
                              p=print (with -n)

13.1 12.1 Replace scaffold names in GTF (sed alternative to awk)

# Using multiple -e expressions (less clean than awk for 8 replacements)
sed -e 's/^scaffold_1\t/CM047395.1\t/' \
    -e 's/^scaffold_2\t/CM047396.1\t/' \
    -e 's/^scaffold_3\t/CM047397.1\t/' \
    -e 's/^scaffold_4\t/CM047398.1\t/' \
    -e 's/^scaffold_5\t/CM047399.1\t/' \
    -e 's/^scaffold_6\t/CM047400.1\t/' \
    -e 's/^scaffold_7\t/CM047401.1\t/' \
    -e 's/^scaffold_8\t/CM047402.1\t/' \
  original.gtf | grep "^CM047" > renamed.gtf

Explanation:

^scaffold_1\t — anchors to line start (^) and requires a tab (\t) after the name (prevents matching scaffold_10 when looking for scaffold_1).
-e — allows multiple sed expressions in one call.

Recommendation: Use the awk associative array version (Category 1) for this task. The sed approach requires 8 separate -e flags and is less maintainable.

13.2 12.2 Remove snpEff version comment lines from VCF

# Remove SnpEff command-line comment lines added to VCF header
sed '/^##SnpEffCmd/d' \
  snpEff_Ann_insamples_polarized.vcf > clean_header.vcf

Explanation:

/^##SnpEffCmd/d — delete (d) lines matching the pattern.
^##SnpEffCmd — matches lines starting with ##SnpEffCmd (snpEff adds these).

13.3 12.3 Replace sample name prefix in VCF header

# Replace long BAM path prefix in sample names in the #CHROM header line
sed '/^#CHROM/ s/\/scratch\/tali\/bam\///g' all_samples.vcf > renamed.vcf

Explanation:

/^#CHROM/ — address: only apply to lines matching ^#CHROM.
s/pattern/replacement/g — substitute the path prefix with nothing (delete it).
\/scratch\/tali\/bam\/ — escaped forward slashes in the pattern.

13.4 12.4 Fix Windows line endings in input files

# Remove carriage returns (Windows \r\n → Unix \n)
sed 's/\r//' windows_file.txt > unix_file.txt
# Or in-place:
sed -i 's/\r//' windows_file.txt

13.5 12.5 Add chromosome prefix to BED file

# Some tools require "chr" prefix; add it if missing
sed 's/^CM/chr_CM/' regions.bed > regions_prefixed.bed

13.6 12.6 Comment out lines in config files

Context: snpEff configuration requires modifying snpEff.config.

# Comment out the default Arabidopsis thaliana database entry
sed 's/^arabidopsis_thaliana/#arabidopsis_thaliana/' \
  snpEff.config > snpEff_modified.config

# Append custom database entry
echo "Arabidopsis_lyrata_petraea.genome = A. lyrata petraea MJ09-11" \
  >> snpEff_custom.config

13.7 12.7 Remove trailing whitespace

# Clean trailing spaces/tabs from sample name files
sed 's/[[:space:]]*$//' sample_names.txt > sample_names_clean.txt

14 Category 13: Combined AWK + SED Pipelines

Real-world commands frequently chain awk and sed together with pipes.

14.1 13.1 Full GTF preparation pipeline

# Complete GTF preparation: rename + clean + filter in one pipeline
cat original.gtf \
  | awk 'BEGIN{
      map["scaffold_1"]="CM047395.1"; map["scaffold_2"]="CM047396.1";
      map["scaffold_3"]="CM047397.1"; map["scaffold_4"]="CM047398.1";
      map["scaffold_5"]="CM047399.1"; map["scaffold_6"]="CM047400.1";
      map["scaffold_7"]="CM047401.1"; map["scaffold_8"]="CM047402.1"
    } $1 in map {$1=map[$1]} 1' \
  | grep "^CM047" \
  | sed '/^##SnpEffCmd/d' \
  > Arabidopsis_lyrata_petraea_genome.annotation_Scaffold1-8Mod_CM047.gtf

14.2 13.2 VCF → allele count summary

# Count transitions vs transversions from VCF
awk '!/^#/ {
  ref=$4; alt=$5
  if (length(ref)==1 && length(alt)==1) {
    pair = ref">"alt
    counts[pair]++
  }
}
END {
  ti=counts["A>G"]+counts["G>A"]+counts["C>T"]+counts["T>C"]
  tv=0
  for (p in counts) tv+=counts[p]
  tv -= ti
  printf "Transitions: %d\nTransversions: %d\nTi/Tv: %.2f\n", ti, tv, ti/tv
}' filtered_vcf.vcf

14.3 13.3 Generate per-chromosome SNP counts from VCF

# Useful for verifying chromosome-by-chromosome processing
awk '!/^#/ {counts[$1]++}
END {
  for (c in counts) print c, counts[c]
}' filtered.vcf | sort -t'.' -k2 -n
# Sort by chromosome number extracted from CM04739X.1

14.4 13.4 Merge conservation score files across chromosomes

# Concatenate per-chromosome BED files with header from first file only
awk 'FNR==1 && NR!=1 {next} 1' \
  gerp/chr_CM047395.1.bed \
  gerp/chr_CM047396.1.bed \
  gerp/chr_CM047397.1.bed \
  gerp/chr_CM047398.1.bed \
  gerp/chr_CM047399.1.bed \
  gerp/chr_CM047400.1.bed \
  gerp/chr_CM047401.1.bed \
  gerp/chr_CM047402.1.bed \
  > gerp/concatenated_gerp.bed

Explanation:

FNR==1 && NR!=1 {next} — for lines that are the first line of a file (FNR==1) but not the very first line overall (NR!=1), skip them. This removes the header from all files except the first.
1 — print all other lines.

15 Category 14: Quality Control and Validation

15.1 14.1 Verify expected number of columns in TSV

# Check that all lines have exactly 10 columns (allelefreq TSV)
awk 'NF!=10 {print NR": "NF" columns: "$0}' \
  allelefreq_phyloP_gerp_gpn_Alyr.tsv | head -20
# Prints offending lines; no output = all lines have 10 columns

15.2 14.2 Check for missing data in key columns

# Count missing values (NA or empty) in derived frequency columns
awk 'NR>1 {
  for (i=4; i<=7; i++)
    if ($i=="NA" || $i=="") missing++
}
END {print "Missing values:", missing}' \
  allelefreq_phyloP_gerp_gpn_Alyr.tsv

15.3 14.3 Check BED file coordinate order

# Verify BED start < end for all intervals
awk '$2 >= $3 {print NR": start("$2") >= end("$3"): "$0}' \
  conservation_scores_matching_vcf_sites.bed | head
# No output = all intervals valid

15.4 14.4 Verify chromosome names match between VCF and BED

# Extract unique chromosome names from each file and compare
echo "=== VCF chromosomes ==="
awk '!/^#/ {print $1}' polarized.vcf | sort -u

echo "=== BED chromosomes ==="
awk '{print $1}' conservation_sites.bed | sort -u

15.5 14.5 Count samples per population group

# Count how many sample names contain each population label
awk '/^#CHROM/ {
  has02=0; has23=0; plech02=0; plech23=0
  for (i=10; i<=NF; i++) {
    if ($i ~ /Has02/) has02++
    if ($i ~ /Has23/) has23++
    if ($i ~ /Plech02/) plech02++
    if ($i ~ /Plech23/) plech23++
  }
  print "Has02:", has02, "Has23:", has23, "Plech02:", plech02, "Plech23:", plech23
}' all_samples.vcf
# Expected: Has02: 12  Has23: 20  Plech02: 12  Plech23: 20

16 Category 15: Data Reformatting for Downstream Tools

16.1 15.1 Convert VCF to dadi input format (awk pipeline verification)

# Verify that the perl conversion produced the expected output
# Check first 5 data lines of the .data file
awk 'NR>1 && NR<=6' \
  insamples_polarized.vcf2dadi/insamples_polarized.vcf.recode.vcf.gz.data

# Count total SNPs in dadi file (skip header)
awk 'NR>1 {count++} END{print count}' \
  insamples_polarized.vcf.recode.vcf.gz.data
# Expected: 12,250,277

16.2 15.2 Extract positions from VCF for GPN scoring

Context: The GPN scoring script requires a plain text file with one 1-based position per line, per chromosome.

# Extract all positions for chromosome CM047395.1
awk '!/^#/ && $1=="CM047395.1" {print $2}' \
  insamples_polarized.vcf.recode.vcf \
  > positions_CM047395.1.txt

# Or loop over all chromosomes
for CHR in CM047395.1 CM047396.1 CM047397.1 CM047398.1 \
           CM047399.1 CM047400.1 CM047401.1 CM047402.1; do
  awk -v chr="$CHR" '!/^#/ && $1==chr {print $2}' \
    insamples_polarized.vcf.recode.vcf \
    > positions_${CHR}.txt
  echo "$CHR: $(wc -l < positions_${CHR}.txt) positions"
done

16.3 15.3 Format Ne estimation output for downstream use

# Extract mean Ne from bootstrap log files
awk '/Mean =/ {
  match($0, /Mean = ([0-9.]+)/, arr)
  print FILENAME, arr[1]
}' Ne_bootstrap_Has.log Ne_bootstrap_Plech.log

# Expected output:
# Ne_bootstrap_Has.log    148.77
# Ne_bootstrap_Plech.log  356.97

17 AWK Quick-Reference Card

17.1 Patterns

Pattern	Matches
`/regex/`	lines matching regex
`!/regex/`	lines NOT matching regex
`$1=="value"`	field 1 equals value
`$2 >= 4`	field 2 ≥ 4 (numeric)
`NR>1`	all lines except first
`NR==FNR`	lines in first file (two-file processing)
`BEGIN`	before any input
`END`	after all input
`/^#/`	lines starting with #
`!/^#/`	lines not starting with #

17.2 Arithmetic and strings

# Arithmetic
awk '{print $2 - 1}'          # subtract 1 from field 2
awk '{print $3 / $4}'          # divide fields
awk '{sum += $5} END {print sum/NR}'  # mean of column 5

# String operations
awk '{sub("old","new",$1)}'    # replace first occurrence in $1
awk '{gsub("old","new",$0)}'   # replace all in entire line
awk '{print length($0)}'        # line length
awk '{print substr($1,1,10)}'   # first 10 chars of field 1
awk '{print toupper($1)}'       # uppercase
awk '{split($1, arr, ":")}; {print arr[1], arr[2]}' # split on :

17.3 Output formatting

# Tab-separated output
awk '{print $1"\t"$2"\t"$3}'
awk 'BEGIN{OFS="\t"} {print $1,$2,$3}'  # equivalent

# Printf (C-style formatting)
awk '{printf "%.4f\n", $5/$6}'   # 4 decimal places
awk '{printf "%s\t%d\t%d\n", $1, $2-1, $2}'  # BED format

17.4 Two-file lookups

# Classic two-file join: print lines from file2 if key in file1
awk 'NR==FNR{keys[$1]=1; next}   # load keys from file1
     $1 in keys {print}' \        # print matching lines from file2
  file1.txt file2.txt

17.5 SED quick reference

Command	Effect
`sed 's/old/new/'`	Replace first `old` with `new`
`sed 's/old/new/g'`	Replace all `old` with `new`
`sed '/pattern/d'`	Delete lines matching pattern
`sed -n '/pattern/p'`	Print only lines matching pattern
`sed '1d'`	Delete first line (remove header)
`sed -i 's/old/new/g' file`	In-place edit
`sed 's/\r//'`	Remove Windows carriage returns
`sed 's/[[:space:]]*$//'`	Remove trailing whitespace
`sed '/^#/!s/^/prefix_/'`	Add prefix to non-comment lines

18 Common Pitfalls in this Pipeline

Pitfall 1: BED coordinate system

VCF = 1-based; BED = 0-based start. Always use $2-1 as BED start when converting from VCF positions. The bug documented in Step 2.1 (empty BED intervals) cost significant debugging time.

Pitfall 2: Tab vs space separators

AWK default field separator is any whitespace. For tab-only parsing (e.g., BED, VCF) always specify -F"\t" or OFS="\t". A line with internal spaces in an attribute column can be split incorrectly with the default FS.

Pitfall 3: Regex anchoring

/scaffold_1/ matches scaffold_1, scaffold_10, scaffold_11, etc. Always anchor: /^scaffold_1\t/ or $1=="scaffold_1" to match exactly.

Pitfall 4: Integer vs floating-point arithmetic

awk '{print 1/3}' gives 0.333333 (floating-point). awk '{print int(1/3)}' gives 0 (integer). For allele frequencies, always use floating-point: $4 / ($4 + $9). When $4 + $9 == 0, this produces a division-by-zero warning; guard with: if ($4 + $9 > 0) print $4/($4+$9); else print "NA".

Pitfall 5: NR vs FNR in two-file processing

NR is global across all input files; FNR resets per file. The condition NR==FNR correctly identifies lines from the first file only when processing exactly two files. For three or more files, use a filename check: FILENAME==ARGV[1].

Performance note

For the 12.25 million SNP VCF and 4.4 million SNP TSV files in this pipeline, awk typically runs 5–50× faster than equivalent Python/pandas code for single-pass operations (filtering, reformatting, field extraction), because it avoids loading the entire file into RAM and has minimal startup overhead. For complex joins or statistical operations, Python is more appropriate.

This reference was compiled from all scripts in the Arabidopsis lyrata temporal genomics pipeline (Dr Tahir Ali | AG de Meaux | University of Cologne, 2026).

--- title: "AWK & SED Commands in the *Arabidopsis lyrata* Temporal Genomics Pipeline" subtitle: "Annotated reference for every text-processing command used from raw data to final analysis" author: "Dr Tahir Ali | AG de Meaux, University of Cologne" date: today format: html: toc: true toc-depth: 4 toc-title: "Contents" number-sections: true code-fold: false code-tools: true theme: cosmo highlight-style: github embed-resources: true code-copy: true execute: eval: false echo: true --- # Introduction `awk` and `sed` are UNIX stream editors that process text files line-by-line without loading them into memory. In population genomics pipelines they are indispensable for manipulating VCF, BED, GTF, TSV, and FASTA files that are too large for spreadsheet software and too simple to justify writing a full Python script. This document catalogises every `awk` and `sed` command used in this pipeline, grouped by biological purpose, with explanation of the syntax, why that approach was chosen over alternatives (Python, R, bcftools), and common pitfalls. **Quick reference to AWK structure:** ``` awk 'BEGIN{...} pattern{action} END{...}' input_file ↑ ↑ ↑ runs once runs per runs once at start matching line at end ``` Key built-in variables: | Variable | Meaning | |---|---| | `NR` | total record (line) number | | `NF` | number of fields on current line | | `$0` | entire current line | | `$1, $2, ...` | first, second, ... field | | `FS` | input field separator (default: whitespace) | | `OFS` | output field separator (default: space) | | `-F"x"` | set input field separator to `x` | --- # Category 1: Reference Genome Preparation ## 1.1 GTF scaffold → chromosome renaming **Context:** The *A. lyrata* ssp. *petraea* GTF annotation used scaffold names (`scaffold_1` … `scaffold_8`) while the genome FASTA used INSDC chromosome accessions (`CM047395.1` … `CM047402.1`). All downstream tools (BCFtools, snpEff, BEDTools) require matching names between reference and annotation. ```{bash} awk 'BEGIN{ map["scaffold_1"] = "CM047395.1"; map["scaffold_2"] = "CM047396.1"; map["scaffold_3"] = "CM047397.1"; map["scaffold_4"] = "CM047398.1"; map["scaffold_5"] = "CM047399.1"; map["scaffold_6"] = "CM047400.1"; map["scaffold_7"] = "CM047401.1"; map["scaffold_8"] = "CM047402.1" } $1 in map { $1 = map[$1] } 1' original.gtf \ | grep "^CM047" \ > Arabidopsis_lyrata_petraea_genome.annotation_Scaffold1-8Mod_CM047.gtf ``` **Explanation:** - `BEGIN{...}` — initialises an associative array `map` before any line is processed. Array keys are old names; values are new INSDC accessions. - `$1 in map` — tests whether field 1 (chromosome column in GTF) exists as a key in `map`. - `$1 = map[$1]` — replaces the old scaffold name with the new accession in-place. - The bare `1` at the end is a true pattern with no action, which causes awk to print the (possibly modified) line (`print $0` is the default action). - `| grep "^CM047"` — discards all unassembled scaffold lines that were not remapped, keeping only the 8 chromosomes. **Why awk rather than sed?** `sed` could do single replacements (`sed 's/scaffold_1/CM047395.1/'`) but would require 8 separate sed calls or a complex multi-expression command (`sed -e 's/.../.../' -e ...`). The awk associative array handles all 8 in one pass, which is both faster and more readable. It also correctly targets only field 1 (`$1`), preventing accidental replacement of "scaffold_1" appearing in the gene_id or attribute columns. --- # Category 2: BED File Generation BED format uses **0-based start, 1-based end** coordinates (half-open intervals). VCF and GTF use **1-based** coordinates. This mismatch is a frequent source of bugs when filtering VCFs with `bcftools view -R`. ## 2.1 Creating a BED file from a TSV position table — WRONG version (documented bug) **This command was initially used and produced an empty VCF output:** ```{bash} # ❌ WRONG — produces empty intervals (start == end in BED = no sequence) awk 'NR>1 {print $1"\t"$2"\t"$2}' \ conservation_scores_matching_vcf_sites.tsv \ > conservation_scores_matching_vcf_sites.bed # First lines of the bad BED: # CM047395.1 78 78 ← empty interval: [78,78) contains nothing # CM047395.1 85 85 ``` **Why it failed:** In BED format, `[start, end)` means the interval includes positions from `start` up to but not including `end`. When start == end, the interval is empty. `bcftools view -R` found zero variants because no VCF positions fell inside an empty interval. ## 2.2 Corrected BED file — 0-based start ```{bash} # ✓ CORRECT — 0-based start = VCF position − 1 awk 'NR>1 {print $1"\t"$2-1"\t"$2}' \ conservation_scores_matching_vcf_sites.tsv \ > conservation_scores_matching_vcf_sites.bed # First lines of the correct BED: # CM047395.1 77 78 ← interval [77,78) contains VCF position 78 ✓ # CM047395.1 84 85 # CM047395.1 91 92 ``` **Explanation:** - `NR>1` — skips the header line (record number > 1). - `$1` — chromosome (column 1 of TSV). - `$2-1` — BED start = VCF position − 1 (converts to 0-based). - `$2` — BED end = VCF position (the 1-based position is the exclusive end). - `"\t"` — explicit tab separator (critical: BED must be tab-delimited). **Verification after fix:** ```{bash} bcftools view -H \ insamples_polarized.vcf.recode.vcf.filtered_conservation_sites.vcf.gz \ | wc -l # Expected and obtained: 4,415,651 ``` ## 2.3 BED directly from TSV (alternative one-liner) ```{bash} # Create BED from conservation TSV in one step (used in snpEff filtering) awk 'NR>1 {print $1"\t"$2-1"\t"$2}' \ /projects/ag-demeaux/tali/Jupyter_labs/vcf_filtered_to_conservation_positions/\ conservation_scores_matching_vcf_sites.tsv \ > conservation_scores_matching_vcf_sites.bed ``` ## 2.4 BED from colon-separated position keys **Context:** When position keys are stored as `chr:position` strings, convert back to BED for bcftools: ```{bash} # Input format: CM047395.1:78 awk -F':' '{print $1"\t"$2-1"\t"$2}' \ matched_position_keys.txt \ > positions_from_keys.bed ``` **Explanation:** - `-F':'` — sets the field separator to colon. - `$1` — chromosome part before the colon. - `$2-1` — position part after the colon, minus 1 for 0-based BED start. - `$2` — 1-based end position. --- # Category 3: Position Key Creation and Matching Position keys (`chr:position`) are used to efficiently match sites between different file formats without requiring a database join. ## 3.1 Extract position keys from conservation TSV ```{bash} # Create chr:pos keys (skip header with NR>1) awk 'NR>1 {print $1":"$2}' \ /projects/ag-demeaux/tali/Jupyter_labs/vcf_filtered_to_conservation_positions/\ conservation_scores_matching_vcf_sites.tsv \ > /projects/ag-demeaux/tali/Jupyter_labs/vcf_filtered_to_conservation_positions/\ matched_position_keys.txt ``` **Explanation:** - `$1` — chromosome column. - `":"` — literal colon separator (no spaces). - `$2` — position column. - Output: one key per line, e.g. `CM047395.1:78`. **Why this format?** Colon-separated keys are a standard convention that avoids ambiguity (chromosomes never contain colons; positions are integers). This allows set operations in Python (`common = set(df1_keys).intersection(df2_keys)`) without needing to join on multiple columns. --- # Category 4: Outgroup Genotype Encoding ## 4.1 Create dummy outgroup genotype from allele frequencies **Context:** `polarizeVCFbyOutgroup.py` requires a VCF individual column to determine ancestral allele. Rather than using a single outgroup individual (which may have missing data), a consensus majority-vote genotype is constructed from the 7-outgroup allele frequency table. ```{bash} # vcftools --freq2 output format: # CHROM POS N_ALLELES N_CHR {ALLELE:FREQ} {ALLELE:FREQ} # Column 5 = REF allele frequency; column 6 = ALT allele frequency # Columns are space-separated but freq uses : internally awk 'NR>1 { print ($5 > $6) ? "0/0" : "1/1" }' \ outgroup/outgroup_freq.frq \ > outgroup/dummy_genotype.txt ``` **Explanation:** - `NR>1` — skips the header. - `($5 > $6) ? "0/0" : "1/1"` — ternary conditional: - If REF frequency > ALT frequency → outgroup is homozygous REF (`0/0`) → keep current polarisation. - If ALT frequency ≥ REF frequency → outgroup is homozygous ALT (`1/1`) → switch REF/ALT. - This encodes the majority-allele consensus across all 7 outgroup individuals. **Limitation:** Sites where the outgroup frequencies are exactly equal (0.5) are assigned `1/1` (switches). An alternative is to remove such sites with `| awk '$5 != $6'`. Sites where the outgroup is heterozygous or ambiguous are later removed by `polarizeVCFbyOutgroup.py` unless `--keep` is specified. --- # Category 5: Conservation Score Filtering ## 5.1 Filter GERP BED to evolutionarily constrained sites **Context:** The raw GERP output covers all sites in the genome. For the selection analyses, only sites with evidence of evolutionary constraint (RS ≥ 4) are retained. ```{bash} # concatenated_gerp.bed columns: # $1=chr $2=start $3=end $4=? $5=? $6=GERP_RS_score awk '$6 >= 4' \ results_18_genomes_Lastal/gerp/concatenated_gerp.bed \ > results_18_genomes_Lastal/gerp/concatenated_filt_gerp.bed # Result: 67,974,179 constrained sites ``` **Explanation:** - `$6 >= 4` — pattern only (no action block): print lines where GERP RS score ≥ 4. When no action is specified, awk's default action is to print the whole line. - GERP RS ≥ 4 is a standard threshold for strong evolutionary constraint (corresponds to approximately 4 substitutions rejected per site across the phylogeny). **Why awk rather than `grep` or Python?** `grep` cannot filter on numeric column values. Python pandas would work but adds overhead for loading a file with ~67 million lines. Awk processes the BED at disk read speed (~2 GB/s) without RAM constraints. ## 5.2 Matching phyloP BED to filtered GERP positions ```{bash} # Extract phyloP scores for the same 67M positions as filtered GERP # (uses bedtools intersect in practice, but position matching via awk works for TSVs) awk 'NR==FNR{pos[$1"\t"$2]=1; next} ($1"\t"$2) in pos' \ concatenated_filt_gerp.bed \ concatenated_phyloP.bed \ > concatenated_filt_phyloP.bed ``` **Explanation:** - `NR==FNR` — `NR` is the global record counter; `FNR` is the per-file record counter. When they are equal, we are reading the FIRST file (GERP). - `pos[$1"\t"$2]=1` — stores chromosome+position as a key in the `pos` array. - `next` — skip to the next line (don't execute the second block for file 1). - `($1"\t"$2) in pos` — for lines in the SECOND file (phyloP), print only if the chromosome+position key exists in `pos`. This is the standard two-file lookup pattern in awk, avoiding the need for `sort | join` or a Python merge. --- # Category 6: Sample Name Processing ## 6.1 Shorten sample names in VCF **Context:** BCFtools generates sample names from BAM file paths or read group tags, producing long strings like `A_Al21_Plech23_1-8ChrOnly_BamQualFilt-f3Q30F264_nodup_realigned_sorted`. Short names (`Plech23_A`) are required for downstream R/Python code. ```{bash} # Create a mapping file: old_name → new_name # Then use bcftools reheader (preferred) or awk on the VCF header # Using awk directly on VCF (for inspection/debugging) awk ' /^#CHROM/ { for (i=10; i<=NF; i++) { # Remove long suffix from sample names gsub("_1-8ChrOnly_BamQualFilt-f3Q30F264_nodup_realigned_sorted", "", $i); gsub("A_Al[0-9]+_", "", $i) } print $0; next } { print $0 } ' raw_vcf.vcf > renamed_vcf.vcf ``` **Explanation:** - `/^#CHROM/` — matches the VCF header line that contains sample names (starts with `#CHROM`). - `for (i=10; i<=NF; i++)` — loops over fields 10 onwards (VCF sample columns start at column 10). - `gsub("pattern", "replacement", $i)` — global substitution on field `$i` (replaces all occurrences). - `next` — after printing the modified header, skip to next line. - `{ print $0 }` — prints all non-header lines unchanged. **Better practice:** Use `bcftools reheader -s sample_names.txt` for safety, as it validates the number of samples matches. The awk approach is shown here for transparency about what the renaming does. ## 6.2 Extract sample names list from VCF ```{bash} # Print sample names one per line (columns 10+) awk '/^#CHROM/ { for (i=10; i<=NF; i++) print $i }' all_samples.vcf > sample_names.txt ``` ## 6.3 Count samples in VCF header ```{bash} awk '/^#CHROM/ {print NF-9; exit}' all_samples.vcf # Output: 64 (total in-group samples) ``` **Explanation:** - `NF - 9` — total fields minus the 9 fixed VCF columns (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT). - `exit` — stop processing after the header line is found. --- # Category 7: VCF File Manipulation ## 7.1 Count SNPs (non-header lines) ```{bash} # Count variants in a VCF (excluding header lines starting with #) awk '!/^#/' all_samples_filtered.vcf | wc -l # Or equivalently: grep -c "^[^#]" all_samples_filtered.vcf ``` **Explanation:** - `!/^#/` — the `!` negates the match; prints lines NOT starting with `#`. ## 7.2 Extract chromosome names ```{bash} # Get unique chromosome names from VCF awk '!/^#/ {print $1}' all_samples.vcf | sort -u ``` ## 7.3 Filter VCF to specific chromosomes (awk alternative to bcftools) ```{bash} # Keep only chromosomes 1-8 (useful when bcftools not available) awk ' /^#/ {print; next} # print header lines $1 ~ /^CM0473(9[5-9]|[0-9]{2})\.1$/ {print} # print chr lines ' all_samples_raw.vcf > all_samples_chrs1-8.vcf ``` ## 7.4 Count allele counts to verify polarisation ```{bash} # Check that AC=0 sites were removed (should find none) awk '!/^#/ && $8 ~ /AC=0/' filtered_vcf.vcf | wc -l # Count sites by AC value awk '!/^#/ { match($8, /AC=([0-9]+)/, arr) ac[arr[1]]++ } END {for (v in ac) print v, ac[v]}' filtered_vcf.vcf | sort -n ``` ## 7.5 Split VCF by chromosome (for parallel processing) **Context:** The GPN probability computation runs as a SLURM array with one job per chromosome. The polarised VCF is split first. ```{bash} # Split by chromosome using awk for CHR in CM047395.1 CM047396.1 CM047397.1 CM047398.1 \ CM047399.1 CM047400.1 CM047401.1 CM047402.1; do awk -v chr="$CHR" \ '/^#/ {print; next} $1==chr {print}' \ insamples_polarized.vcf.recode.vcf \ > split.${CHR}_insamples_polarized.vcf.recode.vcf done ``` **Explanation:** - `-v chr="$CHR"` — passes the shell variable `$CHR` into awk as `chr`. - `/^#/ {print; next}` — always prints header lines. - `$1==chr {print}` — prints only lines where chromosome column equals `chr`. --- # Category 8: Allele Frequency Table Operations ## 8.1 Compute derived allele frequency from allele counts **Context:** The dadi `.data` file stores allele counts (e.g., 12 REF + 2 ALT reads). Derived allele frequency = ALT_count / (REF_count + ALT_count). ```{bash} # Columns in vcf_genotypes_matching_conservation_sites.tsv: # Allele1=REF count for Has02, Has23, Plech02, Plech23 # Allele2=ALT count for Has02.1, Has23.1, Plech02.1, Plech23.1 awk 'NR>1 { freq_has02 = $4 / ($4 + $9) # ALT_Has02 / total_Has02 freq_has23 = $5 / ($5 + $10) freq_plech02 = $6 / ($6 + $11) freq_plech23 = $7 / ($7 + $12) print $13, $14, freq_has02, freq_has23, freq_plech02, freq_plech23 }' vcf_genotypes_matching_conservation_sites.tsv ``` ## 8.2 Compute Δp (derived allele frequency change) per site ```{bash} # Delta-p = freq_t1 - freq_t0 awk 'NR>1 { if ($4+$5 > 0 && $6+$7 > 0) dp = ($6/($6+$7)) - ($4/($4+$5)) else dp = "NA" print $1, $2, dp }' OFS="\t" freq_table.tsv > delta_p.tsv ``` ## 8.3 Filter allele frequency table by MAF ```{bash} # Retain sites with MAF >= 0.05 in all four populations awk 'NR>1 { f1=$4; f2=$5; f3=$6; f4=$7 maf1 = (f1<0.5) ? f1 : 1-f1 maf2 = (f2<0.5) ? f2 : 1-f2 maf3 = (f3<0.5) ? f3 : 1-f3 maf4 = (f4<0.5) ? f4 : 1-f4 if (maf1>=0.05 && maf2>=0.05 && maf3>=0.05 && maf4>=0.05) print $0 }' allelefreq_phyloP_gerp_gpn_Alyr.tsv > allelefreq_MAF05.tsv ``` --- # Category 9: GTF / Annotation File Processing ## 9.1 Extract gene coordinates from GTF ```{bash} # Extract gene start/end/ID from GTF (feature column == "gene") awk '$3=="gene" { match($9, /gene_id "([^"]+)"/, arr) print $1, $4, $5, arr[1] }' OFS="\t" genes.gtf > gene_coordinates.bed ``` **Explanation:** - `$3=="gene"` — selects only lines where the feature column is "gene". - `match($9, /regex/, arr)` — POSIX extended regex match on the attribute column. The capture group `([^"]+)` extracts the gene ID between quotes into `arr[1]`. - `OFS="\t"` — sets output field separator to tab (important for BED format). ## 9.2 Count genes per chromosome ```{bash} awk '$3=="gene" {count[$1]++} END {for (c in count) print c, count[c]}' \ genes.gtf | sort -k1,1 ``` ## 9.3 Extract CDS regions for conservation analysis ```{bash} # Extract CDS coordinates (used for snpEff validation) awk '$3=="CDS" {print $1"\t"$4-1"\t"$5"\t"$7}' OFS="\t" \ genes.gtf > cds_regions.bed ``` **Note:** CDS in GTF is 1-based inclusive; BED requires 0-based start (`$4-1`). --- # Category 10: FASTA File Operations ## 10.1 Count sequences in FASTA ```{bash} # Count number of sequences awk '/^>/ {count++} END {print count}' genome.fna # Output: 3,209 (8 chromosomes + unassembled scaffolds) # Count only chromosome-length sequences (> 10 Mb) # (uses awk multi-line approach) awk '/^>/ {if (seq) {if (len>10000000) count++}; seq=$0; len=0; next} {len+=length($0)} END {if (seq && len>10000000) count++; print count}' genome.fna ``` ## 10.2 Extract specific chromosome FASTA ```{bash} # Extract chromosome CM047395.1 from multi-FASTA awk '/^>CM047395\.1$/{found=1} found{print} /^>CM0473/ && !/^>CM047395\.1$/{found=0}' \ genome.fna > chr1.fna ``` ## 10.3 Identify soft-masked (lowercase = repeat) regions **Context:** The `generate_softmasked_ranges.py` Python script does this, but an awk equivalent for documentation: ```{bash} # Identify runs of lowercase (repeat-masked) nucleotides awk ' /^>/ {chr=$0; sub(">","",chr); pos=0; in_mask=0; mask_start=0; next} { for (i=1; i<=length($0); i++) { pos++ c = substr($0, i, 1) if (c ~ /[acgt]/) { # lowercase = masked if (!in_mask) {mask_start=pos; in_mask=1} } else { if (in_mask) {print chr"\t"mask_start-1"\t"pos-1; in_mask=0} } } } END {if (in_mask) print chr"\t"mask_start-1"\t"pos} ' genome.fna > softmasked_regions.bed ``` --- # Category 11: SLURM / HPC Job Scripting ## 11.1 Array job chromosome indexing **Context:** SLURM array jobs (`#SBATCH --array=1-8`) provide each job with `$SLURM_ARRAY_TASK_ID` (1–8). An awk-style bash array indexes into chromosomes. ```{bash} #!/bin/bash #SBATCH --array=1-8 # The chromosome list is a bash array (0-indexed) CHRS=(CM047395.1 CM047396.1 CM047397.1 CM047398.1 \ CM047399.1 CM047400.1 CM047401.1 CM047402.1) # SLURM_ARRAY_TASK_ID is 1-based; bash arrays are 0-based CHR=${CHRS[$SLURM_ARRAY_TASK_ID-1]} # Task 1 → index 0 → CM047395.1 # Task 8 → index 7 → CM047402.1 ``` **Note:** This is bash array indexing rather than awk, but the pattern (`$SLURM_ARRAY_TASK_ID-1`) looks identical to awk arithmetic and is documented here for clarity. ## 11.2 Log parsing: extract runtime per chromosome ```{bash} # Extract wall time per chromosome from pipeline log awk '/✓ CM/ { match($0, /CM[0-9]+\.[0-9]+ in ([0-9]+\.[0-9]+)s/, arr) print $2, arr[1]"s" }' pipeline.log ``` ## 11.3 Check which chromosomes completed ```{bash} awk '/✓ CM/ {print $2, "DONE"} /ERROR/ {print $0}' \ pipeline.log | column -t ``` --- # Category 12: SED Commands `sed` (stream editor) is primarily used for find-and-replace operations on text. It processes files line by line, applying specified substitution or deletion commands. **SED syntax:** ``` sed 's/pattern/replacement/flags' input_file ↑ ↑ ↑ regex to replacement g=global (all on line) match string i=case-insensitive p=print (with -n) ``` ## 12.1 Replace scaffold names in GTF (sed alternative to awk) ```{bash} # Using multiple -e expressions (less clean than awk for 8 replacements) sed -e 's/^scaffold_1\t/CM047395.1\t/' \ -e 's/^scaffold_2\t/CM047396.1\t/' \ -e 's/^scaffold_3\t/CM047397.1\t/' \ -e 's/^scaffold_4\t/CM047398.1\t/' \ -e 's/^scaffold_5\t/CM047399.1\t/' \ -e 's/^scaffold_6\t/CM047400.1\t/' \ -e 's/^scaffold_7\t/CM047401.1\t/' \ -e 's/^scaffold_8\t/CM047402.1\t/' \ original.gtf | grep "^CM047" > renamed.gtf ``` **Explanation:** - `^scaffold_1\t` — anchors to line start (`^`) and requires a tab (`\t`) after the name (prevents matching `scaffold_10` when looking for `scaffold_1`). - `-e` — allows multiple sed expressions in one call. **Recommendation:** Use the awk associative array version (Category 1) for this task. The sed approach requires 8 separate `-e` flags and is less maintainable. ## 12.2 Remove snpEff version comment lines from VCF ```{bash} # Remove SnpEff command-line comment lines added to VCF header sed '/^##SnpEffCmd/d' \ snpEff_Ann_insamples_polarized.vcf > clean_header.vcf ``` **Explanation:** - `/^##SnpEffCmd/d` — delete (`d`) lines matching the pattern. - `^##SnpEffCmd` — matches lines starting with `##SnpEffCmd` (snpEff adds these). ## 12.3 Replace sample name prefix in VCF header ```{bash} # Replace long BAM path prefix in sample names in the #CHROM header line sed '/^#CHROM/ s/\/scratch\/tali\/bam\///g' all_samples.vcf > renamed.vcf ``` **Explanation:** - `/^#CHROM/` — address: only apply to lines matching `^#CHROM`. - `s/pattern/replacement/g` — substitute the path prefix with nothing (delete it). - `\/scratch\/tali\/bam\/` — escaped forward slashes in the pattern. ## 12.4 Fix Windows line endings in input files ```{bash} # Remove carriage returns (Windows \r\n → Unix \n) sed 's/\r//' windows_file.txt > unix_file.txt # Or in-place: sed -i 's/\r//' windows_file.txt ``` ## 12.5 Add chromosome prefix to BED file ```{bash} # Some tools require "chr" prefix; add it if missing sed 's/^CM/chr_CM/' regions.bed > regions_prefixed.bed ``` ## 12.6 Comment out lines in config files **Context:** snpEff configuration requires modifying `snpEff.config`. ```{bash} # Comment out the default Arabidopsis thaliana database entry sed 's/^arabidopsis_thaliana/#arabidopsis_thaliana/' \ snpEff.config > snpEff_modified.config # Append custom database entry echo "Arabidopsis_lyrata_petraea.genome = A. lyrata petraea MJ09-11" \ >> snpEff_custom.config ``` ## 12.7 Remove trailing whitespace ```{bash} # Clean trailing spaces/tabs from sample name files sed 's/[[:space:]]*$//' sample_names.txt > sample_names_clean.txt ``` --- # Category 13: Combined AWK + SED Pipelines Real-world commands frequently chain awk and sed together with pipes. ## 13.1 Full GTF preparation pipeline ```{bash} # Complete GTF preparation: rename + clean + filter in one pipeline cat original.gtf \ | awk 'BEGIN{ map["scaffold_1"]="CM047395.1"; map["scaffold_2"]="CM047396.1"; map["scaffold_3"]="CM047397.1"; map["scaffold_4"]="CM047398.1"; map["scaffold_5"]="CM047399.1"; map["scaffold_6"]="CM047400.1"; map["scaffold_7"]="CM047401.1"; map["scaffold_8"]="CM047402.1" } $1 in map {$1=map[$1]} 1' \ | grep "^CM047" \ | sed '/^##SnpEffCmd/d' \ > Arabidopsis_lyrata_petraea_genome.annotation_Scaffold1-8Mod_CM047.gtf ``` ## 13.2 VCF → allele count summary ```{bash} # Count transitions vs transversions from VCF awk '!/^#/ { ref=$4; alt=$5 if (length(ref)==1 && length(alt)==1) { pair = ref">"alt counts[pair]++ } } END { ti=counts["A>G"]+counts["G>A"]+counts["C>T"]+counts["T>C"] tv=0 for (p in counts) tv+=counts[p] tv -= ti printf "Transitions: %d\nTransversions: %d\nTi/Tv: %.2f\n", ti, tv, ti/tv }' filtered_vcf.vcf ``` ## 13.3 Generate per-chromosome SNP counts from VCF ```{bash} # Useful for verifying chromosome-by-chromosome processing awk '!/^#/ {counts[$1]++} END { for (c in counts) print c, counts[c] }' filtered.vcf | sort -t'.' -k2 -n # Sort by chromosome number extracted from CM04739X.1 ``` ## 13.4 Merge conservation score files across chromosomes ```{bash} # Concatenate per-chromosome BED files with header from first file only awk 'FNR==1 && NR!=1 {next} 1' \ gerp/chr_CM047395.1.bed \ gerp/chr_CM047396.1.bed \ gerp/chr_CM047397.1.bed \ gerp/chr_CM047398.1.bed \ gerp/chr_CM047399.1.bed \ gerp/chr_CM047400.1.bed \ gerp/chr_CM047401.1.bed \ gerp/chr_CM047402.1.bed \ > gerp/concatenated_gerp.bed ``` **Explanation:** - `FNR==1 && NR!=1 {next}` — for lines that are the first line of a file (`FNR==1`) but not the very first line overall (`NR!=1`), skip them. This removes the header from all files except the first. - `1` — print all other lines. --- # Category 14: Quality Control and Validation ## 14.1 Verify expected number of columns in TSV ```{bash} # Check that all lines have exactly 10 columns (allelefreq TSV) awk 'NF!=10 {print NR": "NF" columns: "$0}' \ allelefreq_phyloP_gerp_gpn_Alyr.tsv | head -20 # Prints offending lines; no output = all lines have 10 columns ``` ## 14.2 Check for missing data in key columns ```{bash} # Count missing values (NA or empty) in derived frequency columns awk 'NR>1 { for (i=4; i<=7; i++) if ($i=="NA" || $i=="") missing++ } END {print "Missing values:", missing}' \ allelefreq_phyloP_gerp_gpn_Alyr.tsv ``` ## 14.3 Check BED file coordinate order ```{bash} # Verify BED start < end for all intervals awk '$2 >= $3 {print NR": start("$2") >= end("$3"): "$0}' \ conservation_scores_matching_vcf_sites.bed | head # No output = all intervals valid ``` ## 14.4 Verify chromosome names match between VCF and BED ```{bash} # Extract unique chromosome names from each file and compare echo "=== VCF chromosomes ===" awk '!/^#/ {print $1}' polarized.vcf | sort -u echo "=== BED chromosomes ===" awk '{print $1}' conservation_sites.bed | sort -u ``` ## 14.5 Count samples per population group ```{bash} # Count how many sample names contain each population label awk '/^#CHROM/ { has02=0; has23=0; plech02=0; plech23=0 for (i=10; i<=NF; i++) { if ($i ~ /Has02/) has02++ if ($i ~ /Has23/) has23++ if ($i ~ /Plech02/) plech02++ if ($i ~ /Plech23/) plech23++ } print "Has02:", has02, "Has23:", has23, "Plech02:", plech02, "Plech23:", plech23 }' all_samples.vcf # Expected: Has02: 12 Has23: 20 Plech02: 12 Plech23: 20 ``` --- # Category 15: Data Reformatting for Downstream Tools ## 15.1 Convert VCF to dadi input format (awk pipeline verification) ```{bash} # Verify that the perl conversion produced the expected output # Check first 5 data lines of the .data file awk 'NR>1 && NR<=6' \ insamples_polarized.vcf2dadi/insamples_polarized.vcf.recode.vcf.gz.data # Count total SNPs in dadi file (skip header) awk 'NR>1 {count++} END{print count}' \ insamples_polarized.vcf.recode.vcf.gz.data # Expected: 12,250,277 ``` ## 15.2 Extract positions from VCF for GPN scoring **Context:** The GPN scoring script requires a plain text file with one 1-based position per line, per chromosome. ```{bash} # Extract all positions for chromosome CM047395.1 awk '!/^#/ && $1=="CM047395.1" {print $2}' \ insamples_polarized.vcf.recode.vcf \ > positions_CM047395.1.txt # Or loop over all chromosomes for CHR in CM047395.1 CM047396.1 CM047397.1 CM047398.1 \ CM047399.1 CM047400.1 CM047401.1 CM047402.1; do awk -v chr="$CHR" '!/^#/ && $1==chr {print $2}' \ insamples_polarized.vcf.recode.vcf \ > positions_${CHR}.txt echo "$CHR: $(wc -l < positions_${CHR}.txt) positions" done ``` ## 15.3 Format Ne estimation output for downstream use ```{bash} # Extract mean Ne from bootstrap log files awk '/Mean =/ { match($0, /Mean = ([0-9.]+)/, arr) print FILENAME, arr[1] }' Ne_bootstrap_Has.log Ne_bootstrap_Plech.log # Expected output: # Ne_bootstrap_Has.log 148.77 # Ne_bootstrap_Plech.log 356.97 ``` --- # AWK Quick-Reference Card ## Patterns | Pattern | Matches | |---|---| | `/regex/` | lines matching regex | | `!/regex/` | lines NOT matching regex | | `$1=="value"` | field 1 equals value | | `$2 >= 4` | field 2 ≥ 4 (numeric) | | `NR>1` | all lines except first | | `NR==FNR` | lines in first file (two-file processing) | | `BEGIN` | before any input | | `END` | after all input | | `/^#/` | lines starting with # | | `!/^#/` | lines not starting with # | ## Arithmetic and strings ```{bash} # Arithmetic awk '{print $2 - 1}' # subtract 1 from field 2 awk '{print $3 / $4}' # divide fields awk '{sum += $5} END {print sum/NR}' # mean of column 5 # String operations awk '{sub("old","new",$1)}' # replace first occurrence in $1 awk '{gsub("old","new",$0)}' # replace all in entire line awk '{print length($0)}' # line length awk '{print substr($1,1,10)}' # first 10 chars of field 1 awk '{print toupper($1)}' # uppercase awk '{split($1, arr, ":")}; {print arr[1], arr[2]}' # split on : ``` ## Output formatting ```{bash} # Tab-separated output awk '{print $1"\t"$2"\t"$3}' awk 'BEGIN{OFS="\t"} {print $1,$2,$3}' # equivalent # Printf (C-style formatting) awk '{printf "%.4f\n", $5/$6}' # 4 decimal places awk '{printf "%s\t%d\t%d\n", $1, $2-1, $2}' # BED format ``` ## Two-file lookups ```{bash} # Classic two-file join: print lines from file2 if key in file1 awk 'NR==FNR{keys[$1]=1; next} # load keys from file1 $1 in keys {print}' \ # print matching lines from file2 file1.txt file2.txt ``` ## SED quick reference | Command | Effect | |---|---| | `sed 's/old/new/'` | Replace first `old` with `new` | | `sed 's/old/new/g'` | Replace all `old` with `new` | | `sed '/pattern/d'` | Delete lines matching pattern | | `sed -n '/pattern/p'` | Print only lines matching pattern | | `sed '1d'` | Delete first line (remove header) | | `sed -i 's/old/new/g' file` | In-place edit | | `sed 's/\r//'` | Remove Windows carriage returns | | `sed 's/[[:space:]]*$//'` | Remove trailing whitespace | | `sed '/^#/!s/^/prefix_/'` | Add prefix to non-comment lines | --- # Common Pitfalls in this Pipeline ::: {.callout-warning} ## Pitfall 1: BED coordinate system VCF = 1-based; BED = 0-based start. Always use `$2-1` as BED start when converting from VCF positions. The bug documented in Step 2.1 (empty BED intervals) cost significant debugging time. ::: ::: {.callout-warning} ## Pitfall 2: Tab vs space separators AWK default field separator is any whitespace. For tab-only parsing (e.g., BED, VCF) always specify `-F"\t"` or `OFS="\t"`. A line with internal spaces in an attribute column can be split incorrectly with the default FS. ::: ::: {.callout-warning} ## Pitfall 3: Regex anchoring `/scaffold_1/` matches `scaffold_1`, `scaffold_10`, `scaffold_11`, etc. Always anchor: `/^scaffold_1\t/` or `$1=="scaffold_1"` to match exactly. ::: ::: {.callout-warning} ## Pitfall 4: Integer vs floating-point arithmetic `awk '{print 1/3}'` gives `0.333333` (floating-point). `awk '{print int(1/3)}'` gives `0` (integer). For allele frequencies, always use floating-point: `$4 / ($4 + $9)`. When `$4 + $9 == 0`, this produces a division-by-zero warning; guard with: `if ($4 + $9 > 0) print $4/($4+$9); else print "NA"`. ::: ::: {.callout-warning} ## Pitfall 5: NR vs FNR in two-file processing `NR` is global across all input files; `FNR` resets per file. The condition `NR==FNR` correctly identifies lines from the first file **only when processing exactly two files**. For three or more files, use a filename check: `FILENAME==ARGV[1]`. ::: ::: {.callout-note} ## Performance note For the 12.25 million SNP VCF and 4.4 million SNP TSV files in this pipeline, awk typically runs 5–50× faster than equivalent Python/pandas code for single-pass operations (filtering, reformatting, field extraction), because it avoids loading the entire file into RAM and has minimal startup overhead. For complex joins or statistical operations, Python is more appropriate. ::: --- *This reference was compiled from all scripts in the Arabidopsis lyrata temporal genomics pipeline (Dr Tahir Ali | AG de Meaux | University of Cologne, 2026).*