AWK & SED Commands in the Arabidopsis lyrata Temporal Genomics Pipeline
Annotated reference for every text-processing command used from raw data to final analysis
1 Introduction
awk and sed are UNIX stream editors that process text files line-by-line without loading them into memory. In population genomics pipelines they are indispensable for manipulating VCF, BED, GTF, TSV, and FASTA files that are too large for spreadsheet software and too simple to justify writing a full Python script.
This document catalogises every awk and sed command used in this pipeline, grouped by biological purpose, with explanation of the syntax, why that approach was chosen over alternatives (Python, R, bcftools), and common pitfalls.
Quick reference to AWK structure:
awk 'BEGIN{...} pattern{action} END{...}' input_file
↑ ↑ ↑
runs once runs per runs once
at start matching line at end
Key built-in variables:
| Variable | Meaning |
|---|---|
NR |
total record (line) number |
NF |
number of fields on current line |
$0 |
entire current line |
$1, $2, ... |
first, second, … field |
FS |
input field separator (default: whitespace) |
OFS |
output field separator (default: space) |
-F"x" |
set input field separator to x |
2 Category 1: Reference Genome Preparation
2.1 1.1 GTF scaffold → chromosome renaming
Context: The A. lyrata ssp. petraea GTF annotation used scaffold names (scaffold_1 … scaffold_8) while the genome FASTA used INSDC chromosome accessions (CM047395.1 … CM047402.1). All downstream tools (BCFtools, snpEff, BEDTools) require matching names between reference and annotation.
awk 'BEGIN{
map["scaffold_1"] = "CM047395.1";
map["scaffold_2"] = "CM047396.1";
map["scaffold_3"] = "CM047397.1";
map["scaffold_4"] = "CM047398.1";
map["scaffold_5"] = "CM047399.1";
map["scaffold_6"] = "CM047400.1";
map["scaffold_7"] = "CM047401.1";
map["scaffold_8"] = "CM047402.1"
} $1 in map { $1 = map[$1] } 1' original.gtf \
| grep "^CM047" \
> Arabidopsis_lyrata_petraea_genome.annotation_Scaffold1-8Mod_CM047.gtf
Explanation:
BEGIN{...}— initialises an associative arraymapbefore any line is processed. Array keys are old names; values are new INSDC accessions.$1 in map— tests whether field 1 (chromosome column in GTF) exists as a key inmap.$1 = map[$1]— replaces the old scaffold name with the new accession in-place.- The bare
1at the end is a true pattern with no action, which causes awk to print the (possibly modified) line (print $0is the default action). | grep "^CM047"— discards all unassembled scaffold lines that were not remapped, keeping only the 8 chromosomes.
Why awk rather than sed? sed could do single replacements (sed 's/scaffold_1/CM047395.1/') but would require 8 separate sed calls or a complex multi-expression command (sed -e 's/.../.../' -e ...). The awk associative array handles all 8 in one pass, which is both faster and more readable. It also correctly targets only field 1 ($1), preventing accidental replacement of “scaffold_1” appearing in the gene_id or attribute columns.
3 Category 2: BED File Generation
BED format uses 0-based start, 1-based end coordinates (half-open intervals). VCF and GTF use 1-based coordinates. This mismatch is a frequent source of bugs when filtering VCFs with bcftools view -R.
3.1 2.1 Creating a BED file from a TSV position table — WRONG version (documented bug)
This command was initially used and produced an empty VCF output:
# ❌ WRONG — produces empty intervals (start == end in BED = no sequence)
awk 'NR>1 {print $1"\t"$2"\t"$2}' \
conservation_scores_matching_vcf_sites.tsv \
> conservation_scores_matching_vcf_sites.bed
# First lines of the bad BED:
# CM047395.1 78 78 ← empty interval: [78,78) contains nothing
# CM047395.1 85 85
Why it failed: In BED format, [start, end) means the interval includes positions from start up to but not including end. When start == end, the interval is empty. bcftools view -R found zero variants because no VCF positions fell inside an empty interval.
3.2 2.2 Corrected BED file — 0-based start
# ✓ CORRECT — 0-based start = VCF position − 1
awk 'NR>1 {print $1"\t"$2-1"\t"$2}' \
conservation_scores_matching_vcf_sites.tsv \
> conservation_scores_matching_vcf_sites.bed
# First lines of the correct BED:
# CM047395.1 77 78 ← interval [77,78) contains VCF position 78 ✓
# CM047395.1 84 85
# CM047395.1 91 92
Explanation:
NR>1— skips the header line (record number > 1).$1— chromosome (column 1 of TSV).$2-1— BED start = VCF position − 1 (converts to 0-based).$2— BED end = VCF position (the 1-based position is the exclusive end)."\t"— explicit tab separator (critical: BED must be tab-delimited).
Verification after fix:
bcftools view -H \
insamples_polarized.vcf.recode.vcf.filtered_conservation_sites.vcf.gz \
| wc -l
# Expected and obtained: 4,415,651
3.3 2.3 BED directly from TSV (alternative one-liner)
# Create BED from conservation TSV in one step (used in snpEff filtering)
awk 'NR>1 {print $1"\t"$2-1"\t"$2}' \
/projects/ag-demeaux/tali/Jupyter_labs/vcf_filtered_to_conservation_positions/\
conservation_scores_matching_vcf_sites.tsv \
> conservation_scores_matching_vcf_sites.bed
3.4 2.4 BED from colon-separated position keys
Context: When position keys are stored as chr:position strings, convert back to BED for bcftools:
# Input format: CM047395.1:78
awk -F':' '{print $1"\t"$2-1"\t"$2}' \
matched_position_keys.txt \
> positions_from_keys.bed
Explanation:
-F':'— sets the field separator to colon.$1— chromosome part before the colon.$2-1— position part after the colon, minus 1 for 0-based BED start.$2— 1-based end position.
4 Category 3: Position Key Creation and Matching
Position keys (chr:position) are used to efficiently match sites between different file formats without requiring a database join.
4.1 3.1 Extract position keys from conservation TSV
# Create chr:pos keys (skip header with NR>1)
awk 'NR>1 {print $1":"$2}' \
/projects/ag-demeaux/tali/Jupyter_labs/vcf_filtered_to_conservation_positions/\
conservation_scores_matching_vcf_sites.tsv \
> /projects/ag-demeaux/tali/Jupyter_labs/vcf_filtered_to_conservation_positions/\
matched_position_keys.txt
Explanation:
$1— chromosome column.":"— literal colon separator (no spaces).$2— position column.- Output: one key per line, e.g.
CM047395.1:78.
Why this format? Colon-separated keys are a standard convention that avoids ambiguity (chromosomes never contain colons; positions are integers). This allows set operations in Python (common = set(df1_keys).intersection(df2_keys)) without needing to join on multiple columns.
5 Category 4: Outgroup Genotype Encoding
5.1 4.1 Create dummy outgroup genotype from allele frequencies
Context: polarizeVCFbyOutgroup.py requires a VCF individual column to determine ancestral allele. Rather than using a single outgroup individual (which may have missing data), a consensus majority-vote genotype is constructed from the 7-outgroup allele frequency table.
# vcftools --freq2 output format:
# CHROM POS N_ALLELES N_CHR {ALLELE:FREQ} {ALLELE:FREQ}
# Column 5 = REF allele frequency; column 6 = ALT allele frequency
# Columns are space-separated but freq uses : internally
awk 'NR>1 { print ($5 > $6) ? "0/0" : "1/1" }' \
outgroup/outgroup_freq.frq \
> outgroup/dummy_genotype.txt
Explanation:
NR>1— skips the header.($5 > $6) ? "0/0" : "1/1"— ternary conditional:- If REF frequency > ALT frequency → outgroup is homozygous REF (
0/0) → keep current polarisation. - If ALT frequency ≥ REF frequency → outgroup is homozygous ALT (
1/1) → switch REF/ALT.
- If REF frequency > ALT frequency → outgroup is homozygous REF (
- This encodes the majority-allele consensus across all 7 outgroup individuals.
Limitation: Sites where the outgroup frequencies are exactly equal (0.5) are assigned 1/1 (switches). An alternative is to remove such sites with | awk '$5 != $6'. Sites where the outgroup is heterozygous or ambiguous are later removed by polarizeVCFbyOutgroup.py unless --keep is specified.
6 Category 5: Conservation Score Filtering
6.1 5.1 Filter GERP BED to evolutionarily constrained sites
Context: The raw GERP output covers all sites in the genome. For the selection analyses, only sites with evidence of evolutionary constraint (RS ≥ 4) are retained.
# concatenated_gerp.bed columns:
# $1=chr $2=start $3=end $4=? $5=? $6=GERP_RS_score
awk '$6 >= 4' \
results_18_genomes_Lastal/gerp/concatenated_gerp.bed \
> results_18_genomes_Lastal/gerp/concatenated_filt_gerp.bed
# Result: 67,974,179 constrained sites
Explanation:
$6 >= 4— pattern only (no action block): print lines where GERP RS score ≥ 4. When no action is specified, awk’s default action is to print the whole line.- GERP RS ≥ 4 is a standard threshold for strong evolutionary constraint (corresponds to approximately 4 substitutions rejected per site across the phylogeny).
Why awk rather than grep or Python? grep cannot filter on numeric column values. Python pandas would work but adds overhead for loading a file with ~67 million lines. Awk processes the BED at disk read speed (~2 GB/s) without RAM constraints.
6.2 5.2 Matching phyloP BED to filtered GERP positions
# Extract phyloP scores for the same 67M positions as filtered GERP
# (uses bedtools intersect in practice, but position matching via awk works for TSVs)
awk 'NR==FNR{pos[$1"\t"$2]=1; next} ($1"\t"$2) in pos' \
concatenated_filt_gerp.bed \
concatenated_phyloP.bed \
> concatenated_filt_phyloP.bed
Explanation:
NR==FNR—NRis the global record counter;FNRis the per-file record counter. When they are equal, we are reading the FIRST file (GERP).pos[$1"\t"$2]=1— stores chromosome+position as a key in theposarray.next— skip to the next line (don’t execute the second block for file 1).($1"\t"$2) in pos— for lines in the SECOND file (phyloP), print only if the chromosome+position key exists inpos.
This is the standard two-file lookup pattern in awk, avoiding the need for sort | join or a Python merge.
7 Category 6: Sample Name Processing
7.1 6.1 Shorten sample names in VCF
Context: BCFtools generates sample names from BAM file paths or read group tags, producing long strings like A_Al21_Plech23_1-8ChrOnly_BamQualFilt-f3Q30F264_nodup_realigned_sorted. Short names (Plech23_A) are required for downstream R/Python code.
# Create a mapping file: old_name → new_name
# Then use bcftools reheader (preferred) or awk on the VCF header
# Using awk directly on VCF (for inspection/debugging)
awk '
/^#CHROM/ {
for (i=10; i<=NF; i++) {
# Remove long suffix from sample names
gsub("_1-8ChrOnly_BamQualFilt-f3Q30F264_nodup_realigned_sorted", "", $i);
gsub("A_Al[0-9]+_", "", $i)
}
print $0; next
}
{ print $0 }
' raw_vcf.vcf > renamed_vcf.vcf
Explanation:
/^#CHROM/— matches the VCF header line that contains sample names (starts with#CHROM).for (i=10; i<=NF; i++)— loops over fields 10 onwards (VCF sample columns start at column 10).gsub("pattern", "replacement", $i)— global substitution on field$i(replaces all occurrences).next— after printing the modified header, skip to next line.{ print $0 }— prints all non-header lines unchanged.
Better practice: Use bcftools reheader -s sample_names.txt for safety, as it validates the number of samples matches. The awk approach is shown here for transparency about what the renaming does.
7.2 6.2 Extract sample names list from VCF
# Print sample names one per line (columns 10+)
awk '/^#CHROM/ {
for (i=10; i<=NF; i++) print $i
}' all_samples.vcf > sample_names.txt
7.3 6.3 Count samples in VCF header
awk '/^#CHROM/ {print NF-9; exit}' all_samples.vcf
# Output: 64 (total in-group samples)
Explanation:
NF - 9— total fields minus the 9 fixed VCF columns (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT).exit— stop processing after the header line is found.
8 Category 7: VCF File Manipulation
8.1 7.1 Count SNPs (non-header lines)
# Count variants in a VCF (excluding header lines starting with #)
awk '!/^#/' all_samples_filtered.vcf | wc -l
# Or equivalently:
grep -c "^[^#]" all_samples_filtered.vcf
Explanation:
!/^#/— the!negates the match; prints lines NOT starting with#.
8.2 7.2 Extract chromosome names
# Get unique chromosome names from VCF
awk '!/^#/ {print $1}' all_samples.vcf | sort -u
8.3 7.3 Filter VCF to specific chromosomes (awk alternative to bcftools)
# Keep only chromosomes 1-8 (useful when bcftools not available)
awk '
/^#/ {print; next} # print header lines
$1 ~ /^CM0473(9[5-9]|[0-9]{2})\.1$/ {print} # print chr lines
' all_samples_raw.vcf > all_samples_chrs1-8.vcf
8.4 7.4 Count allele counts to verify polarisation
# Check that AC=0 sites were removed (should find none)
awk '!/^#/ && $8 ~ /AC=0/' filtered_vcf.vcf | wc -l
# Count sites by AC value
awk '!/^#/ {
match($8, /AC=([0-9]+)/, arr)
ac[arr[1]]++
}
END {for (v in ac) print v, ac[v]}' filtered_vcf.vcf | sort -n
8.5 7.5 Split VCF by chromosome (for parallel processing)
Context: The GPN probability computation runs as a SLURM array with one job per chromosome. The polarised VCF is split first.
# Split by chromosome using awk
for CHR in CM047395.1 CM047396.1 CM047397.1 CM047398.1 \
CM047399.1 CM047400.1 CM047401.1 CM047402.1; do
awk -v chr="$CHR" \
'/^#/ {print; next} $1==chr {print}' \
insamples_polarized.vcf.recode.vcf \
> split.${CHR}_insamples_polarized.vcf.recode.vcf
done
Explanation:
-v chr="$CHR"— passes the shell variable$CHRinto awk aschr./^#/ {print; next}— always prints header lines.$1==chr {print}— prints only lines where chromosome column equalschr.
9 Category 8: Allele Frequency Table Operations
9.1 8.1 Compute derived allele frequency from allele counts
Context: The dadi .data file stores allele counts (e.g., 12 REF + 2 ALT reads). Derived allele frequency = ALT_count / (REF_count + ALT_count).
# Columns in vcf_genotypes_matching_conservation_sites.tsv:
# Allele1=REF count for Has02, Has23, Plech02, Plech23
# Allele2=ALT count for Has02.1, Has23.1, Plech02.1, Plech23.1
awk 'NR>1 {
freq_has02 = $4 / ($4 + $9) # ALT_Has02 / total_Has02
freq_has23 = $5 / ($5 + $10)
freq_plech02 = $6 / ($6 + $11)
freq_plech23 = $7 / ($7 + $12)
print $13, $14, freq_has02, freq_has23, freq_plech02, freq_plech23
}' vcf_genotypes_matching_conservation_sites.tsv
9.2 8.2 Compute Δp (derived allele frequency change) per site
# Delta-p = freq_t1 - freq_t0
awk 'NR>1 {
if ($4+$5 > 0 && $6+$7 > 0)
dp = ($6/($6+$7)) - ($4/($4+$5))
else
dp = "NA"
print $1, $2, dp
}' OFS="\t" freq_table.tsv > delta_p.tsv
9.3 8.3 Filter allele frequency table by MAF
# Retain sites with MAF >= 0.05 in all four populations
awk 'NR>1 {
f1=$4; f2=$5; f3=$6; f4=$7
maf1 = (f1<0.5) ? f1 : 1-f1
maf2 = (f2<0.5) ? f2 : 1-f2
maf3 = (f3<0.5) ? f3 : 1-f3
maf4 = (f4<0.5) ? f4 : 1-f4
if (maf1>=0.05 && maf2>=0.05 && maf3>=0.05 && maf4>=0.05)
print $0
}' allelefreq_phyloP_gerp_gpn_Alyr.tsv > allelefreq_MAF05.tsv
10 Category 9: GTF / Annotation File Processing
10.1 9.1 Extract gene coordinates from GTF
# Extract gene start/end/ID from GTF (feature column == "gene")
awk '$3=="gene" {
match($9, /gene_id "([^"]+)"/, arr)
print $1, $4, $5, arr[1]
}' OFS="\t" genes.gtf > gene_coordinates.bed
Explanation:
$3=="gene"— selects only lines where the feature column is “gene”.match($9, /regex/, arr)— POSIX extended regex match on the attribute column. The capture group([^"]+)extracts the gene ID between quotes intoarr[1].OFS="\t"— sets output field separator to tab (important for BED format).
10.2 9.2 Count genes per chromosome
awk '$3=="gene" {count[$1]++}
END {for (c in count) print c, count[c]}' \
genes.gtf | sort -k1,1
10.3 9.3 Extract CDS regions for conservation analysis
# Extract CDS coordinates (used for snpEff validation)
awk '$3=="CDS" {print $1"\t"$4-1"\t"$5"\t"$7}' OFS="\t" \
genes.gtf > cds_regions.bed
Note: CDS in GTF is 1-based inclusive; BED requires 0-based start ($4-1).
11 Category 10: FASTA File Operations
11.1 10.1 Count sequences in FASTA
# Count number of sequences
awk '/^>/ {count++} END {print count}' genome.fna
# Output: 3,209 (8 chromosomes + unassembled scaffolds)
# Count only chromosome-length sequences (> 10 Mb)
# (uses awk multi-line approach)
awk '/^>/ {if (seq) {if (len>10000000) count++}; seq=$0; len=0; next}
{len+=length($0)}
END {if (seq && len>10000000) count++; print count}' genome.fna
11.2 10.2 Extract specific chromosome FASTA
# Extract chromosome CM047395.1 from multi-FASTA
awk '/^>CM047395\.1$/{found=1} found{print} /^>CM0473/ && !/^>CM047395\.1$/{found=0}' \
genome.fna > chr1.fna
11.3 10.3 Identify soft-masked (lowercase = repeat) regions
Context: The generate_softmasked_ranges.py Python script does this, but an awk equivalent for documentation:
# Identify runs of lowercase (repeat-masked) nucleotides
awk '
/^>/ {chr=$0; sub(">","",chr); pos=0; in_mask=0; mask_start=0; next}
{
for (i=1; i<=length($0); i++) {
pos++
c = substr($0, i, 1)
if (c ~ /[acgt]/) { # lowercase = masked
if (!in_mask) {mask_start=pos; in_mask=1}
} else {
if (in_mask) {print chr"\t"mask_start-1"\t"pos-1; in_mask=0}
}
}
}
END {if (in_mask) print chr"\t"mask_start-1"\t"pos}
' genome.fna > softmasked_regions.bed
12 Category 11: SLURM / HPC Job Scripting
12.1 11.1 Array job chromosome indexing
Context: SLURM array jobs (#SBATCH --array=1-8) provide each job with $SLURM_ARRAY_TASK_ID (1–8). An awk-style bash array indexes into chromosomes.
#!/bin/bash
#SBATCH --array=1-8
# The chromosome list is a bash array (0-indexed)
CHRS=(CM047395.1 CM047396.1 CM047397.1 CM047398.1 \
CM047399.1 CM047400.1 CM047401.1 CM047402.1)
# SLURM_ARRAY_TASK_ID is 1-based; bash arrays are 0-based
CHR=${CHRS[$SLURM_ARRAY_TASK_ID-1]}
# Task 1 → index 0 → CM047395.1
# Task 8 → index 7 → CM047402.1
Note: This is bash array indexing rather than awk, but the pattern ($SLURM_ARRAY_TASK_ID-1) looks identical to awk arithmetic and is documented here for clarity.
12.2 11.2 Log parsing: extract runtime per chromosome
# Extract wall time per chromosome from pipeline log
awk '/✓ CM/ {
match($0, /CM[0-9]+\.[0-9]+ in ([0-9]+\.[0-9]+)s/, arr)
print $2, arr[1]"s"
}' pipeline.log
12.3 11.3 Check which chromosomes completed
awk '/✓ CM/ {print $2, "DONE"}
/ERROR/ {print $0}' \
pipeline.log | column -t
13 Category 12: SED Commands
sed (stream editor) is primarily used for find-and-replace operations on text. It processes files line by line, applying specified substitution or deletion commands.
SED syntax:
sed 's/pattern/replacement/flags' input_file
↑ ↑ ↑
regex to replacement g=global (all on line)
match string i=case-insensitive
p=print (with -n)
13.1 12.1 Replace scaffold names in GTF (sed alternative to awk)
# Using multiple -e expressions (less clean than awk for 8 replacements)
sed -e 's/^scaffold_1\t/CM047395.1\t/' \
-e 's/^scaffold_2\t/CM047396.1\t/' \
-e 's/^scaffold_3\t/CM047397.1\t/' \
-e 's/^scaffold_4\t/CM047398.1\t/' \
-e 's/^scaffold_5\t/CM047399.1\t/' \
-e 's/^scaffold_6\t/CM047400.1\t/' \
-e 's/^scaffold_7\t/CM047401.1\t/' \
-e 's/^scaffold_8\t/CM047402.1\t/' \
original.gtf | grep "^CM047" > renamed.gtf
Explanation:
^scaffold_1\t— anchors to line start (^) and requires a tab (\t) after the name (prevents matchingscaffold_10when looking forscaffold_1).-e— allows multiple sed expressions in one call.
Recommendation: Use the awk associative array version (Category 1) for this task. The sed approach requires 8 separate -e flags and is less maintainable.
13.2 12.2 Remove snpEff version comment lines from VCF
# Remove SnpEff command-line comment lines added to VCF header
sed '/^##SnpEffCmd/d' \
snpEff_Ann_insamples_polarized.vcf > clean_header.vcf
Explanation:
/^##SnpEffCmd/d— delete (d) lines matching the pattern.^##SnpEffCmd— matches lines starting with##SnpEffCmd(snpEff adds these).
13.3 12.3 Replace sample name prefix in VCF header
# Replace long BAM path prefix in sample names in the #CHROM header line
sed '/^#CHROM/ s/\/scratch\/tali\/bam\///g' all_samples.vcf > renamed.vcf
Explanation:
/^#CHROM/— address: only apply to lines matching^#CHROM.s/pattern/replacement/g— substitute the path prefix with nothing (delete it).\/scratch\/tali\/bam\/— escaped forward slashes in the pattern.
13.4 12.4 Fix Windows line endings in input files
# Remove carriage returns (Windows \r\n → Unix \n)
sed 's/\r//' windows_file.txt > unix_file.txt
# Or in-place:
sed -i 's/\r//' windows_file.txt
13.5 12.5 Add chromosome prefix to BED file
# Some tools require "chr" prefix; add it if missing
sed 's/^CM/chr_CM/' regions.bed > regions_prefixed.bed
13.7 12.7 Remove trailing whitespace
# Clean trailing spaces/tabs from sample name files
sed 's/[[:space:]]*$//' sample_names.txt > sample_names_clean.txt
14 Category 13: Combined AWK + SED Pipelines
Real-world commands frequently chain awk and sed together with pipes.
14.1 13.1 Full GTF preparation pipeline
# Complete GTF preparation: rename + clean + filter in one pipeline
cat original.gtf \
| awk 'BEGIN{
map["scaffold_1"]="CM047395.1"; map["scaffold_2"]="CM047396.1";
map["scaffold_3"]="CM047397.1"; map["scaffold_4"]="CM047398.1";
map["scaffold_5"]="CM047399.1"; map["scaffold_6"]="CM047400.1";
map["scaffold_7"]="CM047401.1"; map["scaffold_8"]="CM047402.1"
} $1 in map {$1=map[$1]} 1' \
| grep "^CM047" \
| sed '/^##SnpEffCmd/d' \
> Arabidopsis_lyrata_petraea_genome.annotation_Scaffold1-8Mod_CM047.gtf
14.2 13.2 VCF → allele count summary
# Count transitions vs transversions from VCF
awk '!/^#/ {
ref=$4; alt=$5
if (length(ref)==1 && length(alt)==1) {
pair = ref">"alt
counts[pair]++
}
}
END {
ti=counts["A>G"]+counts["G>A"]+counts["C>T"]+counts["T>C"]
tv=0
for (p in counts) tv+=counts[p]
tv -= ti
printf "Transitions: %d\nTransversions: %d\nTi/Tv: %.2f\n", ti, tv, ti/tv
}' filtered_vcf.vcf
14.3 13.3 Generate per-chromosome SNP counts from VCF
# Useful for verifying chromosome-by-chromosome processing
awk '!/^#/ {counts[$1]++}
END {
for (c in counts) print c, counts[c]
}' filtered.vcf | sort -t'.' -k2 -n
# Sort by chromosome number extracted from CM04739X.1
14.4 13.4 Merge conservation score files across chromosomes
# Concatenate per-chromosome BED files with header from first file only
awk 'FNR==1 && NR!=1 {next} 1' \
gerp/chr_CM047395.1.bed \
gerp/chr_CM047396.1.bed \
gerp/chr_CM047397.1.bed \
gerp/chr_CM047398.1.bed \
gerp/chr_CM047399.1.bed \
gerp/chr_CM047400.1.bed \
gerp/chr_CM047401.1.bed \
gerp/chr_CM047402.1.bed \
> gerp/concatenated_gerp.bed
Explanation:
FNR==1 && NR!=1 {next}— for lines that are the first line of a file (FNR==1) but not the very first line overall (NR!=1), skip them. This removes the header from all files except the first.1— print all other lines.
15 Category 14: Quality Control and Validation
15.1 14.1 Verify expected number of columns in TSV
# Check that all lines have exactly 10 columns (allelefreq TSV)
awk 'NF!=10 {print NR": "NF" columns: "$0}' \
allelefreq_phyloP_gerp_gpn_Alyr.tsv | head -20
# Prints offending lines; no output = all lines have 10 columns
15.2 14.2 Check for missing data in key columns
# Count missing values (NA or empty) in derived frequency columns
awk 'NR>1 {
for (i=4; i<=7; i++)
if ($i=="NA" || $i=="") missing++
}
END {print "Missing values:", missing}' \
allelefreq_phyloP_gerp_gpn_Alyr.tsv
15.3 14.3 Check BED file coordinate order
# Verify BED start < end for all intervals
awk '$2 >= $3 {print NR": start("$2") >= end("$3"): "$0}' \
conservation_scores_matching_vcf_sites.bed | head
# No output = all intervals valid
15.4 14.4 Verify chromosome names match between VCF and BED
# Extract unique chromosome names from each file and compare
echo "=== VCF chromosomes ==="
awk '!/^#/ {print $1}' polarized.vcf | sort -u
echo "=== BED chromosomes ==="
awk '{print $1}' conservation_sites.bed | sort -u
15.5 14.5 Count samples per population group
# Count how many sample names contain each population label
awk '/^#CHROM/ {
has02=0; has23=0; plech02=0; plech23=0
for (i=10; i<=NF; i++) {
if ($i ~ /Has02/) has02++
if ($i ~ /Has23/) has23++
if ($i ~ /Plech02/) plech02++
if ($i ~ /Plech23/) plech23++
}
print "Has02:", has02, "Has23:", has23, "Plech02:", plech02, "Plech23:", plech23
}' all_samples.vcf
# Expected: Has02: 12 Has23: 20 Plech02: 12 Plech23: 20
16 Category 15: Data Reformatting for Downstream Tools
16.1 15.1 Convert VCF to dadi input format (awk pipeline verification)
# Verify that the perl conversion produced the expected output
# Check first 5 data lines of the .data file
awk 'NR>1 && NR<=6' \
insamples_polarized.vcf2dadi/insamples_polarized.vcf.recode.vcf.gz.data
# Count total SNPs in dadi file (skip header)
awk 'NR>1 {count++} END{print count}' \
insamples_polarized.vcf.recode.vcf.gz.data
# Expected: 12,250,277
16.2 15.2 Extract positions from VCF for GPN scoring
Context: The GPN scoring script requires a plain text file with one 1-based position per line, per chromosome.
# Extract all positions for chromosome CM047395.1
awk '!/^#/ && $1=="CM047395.1" {print $2}' \
insamples_polarized.vcf.recode.vcf \
> positions_CM047395.1.txt
# Or loop over all chromosomes
for CHR in CM047395.1 CM047396.1 CM047397.1 CM047398.1 \
CM047399.1 CM047400.1 CM047401.1 CM047402.1; do
awk -v chr="$CHR" '!/^#/ && $1==chr {print $2}' \
insamples_polarized.vcf.recode.vcf \
> positions_${CHR}.txt
echo "$CHR: $(wc -l < positions_${CHR}.txt) positions"
done
16.3 15.3 Format Ne estimation output for downstream use
# Extract mean Ne from bootstrap log files
awk '/Mean =/ {
match($0, /Mean = ([0-9.]+)/, arr)
print FILENAME, arr[1]
}' Ne_bootstrap_Has.log Ne_bootstrap_Plech.log
# Expected output:
# Ne_bootstrap_Has.log 148.77
# Ne_bootstrap_Plech.log 356.97
17 AWK Quick-Reference Card
17.1 Patterns
| Pattern | Matches |
|---|---|
/regex/ |
lines matching regex |
!/regex/ |
lines NOT matching regex |
$1=="value" |
field 1 equals value |
$2 >= 4 |
field 2 ≥ 4 (numeric) |
NR>1 |
all lines except first |
NR==FNR |
lines in first file (two-file processing) |
BEGIN |
before any input |
END |
after all input |
/^#/ |
lines starting with # |
!/^#/ |
lines not starting with # |
17.2 Arithmetic and strings
# Arithmetic
awk '{print $2 - 1}' # subtract 1 from field 2
awk '{print $3 / $4}' # divide fields
awk '{sum += $5} END {print sum/NR}' # mean of column 5
# String operations
awk '{sub("old","new",$1)}' # replace first occurrence in $1
awk '{gsub("old","new",$0)}' # replace all in entire line
awk '{print length($0)}' # line length
awk '{print substr($1,1,10)}' # first 10 chars of field 1
awk '{print toupper($1)}' # uppercase
awk '{split($1, arr, ":")}; {print arr[1], arr[2]}' # split on :
17.3 Output formatting
# Tab-separated output
awk '{print $1"\t"$2"\t"$3}'
awk 'BEGIN{OFS="\t"} {print $1,$2,$3}' # equivalent
# Printf (C-style formatting)
awk '{printf "%.4f\n", $5/$6}' # 4 decimal places
awk '{printf "%s\t%d\t%d\n", $1, $2-1, $2}' # BED format
17.4 Two-file lookups
# Classic two-file join: print lines from file2 if key in file1
awk 'NR==FNR{keys[$1]=1; next} # load keys from file1
$1 in keys {print}' \ # print matching lines from file2
file1.txt file2.txt
17.5 SED quick reference
| Command | Effect |
|---|---|
sed 's/old/new/' |
Replace first old with new |
sed 's/old/new/g' |
Replace all old with new |
sed '/pattern/d' |
Delete lines matching pattern |
sed -n '/pattern/p' |
Print only lines matching pattern |
sed '1d' |
Delete first line (remove header) |
sed -i 's/old/new/g' file |
In-place edit |
sed 's/\r//' |
Remove Windows carriage returns |
sed 's/[[:space:]]*$//' |
Remove trailing whitespace |
sed '/^#/!s/^/prefix_/' |
Add prefix to non-comment lines |
18 Common Pitfalls in this Pipeline
VCF = 1-based; BED = 0-based start. Always use $2-1 as BED start when converting from VCF positions. The bug documented in Step 2.1 (empty BED intervals) cost significant debugging time.
AWK default field separator is any whitespace. For tab-only parsing (e.g., BED, VCF) always specify -F"\t" or OFS="\t". A line with internal spaces in an attribute column can be split incorrectly with the default FS.
/scaffold_1/ matches scaffold_1, scaffold_10, scaffold_11, etc. Always anchor: /^scaffold_1\t/ or $1=="scaffold_1" to match exactly.
awk '{print 1/3}' gives 0.333333 (floating-point). awk '{print int(1/3)}' gives 0 (integer). For allele frequencies, always use floating-point: $4 / ($4 + $9). When $4 + $9 == 0, this produces a division-by-zero warning; guard with: if ($4 + $9 > 0) print $4/($4+$9); else print "NA".
NR is global across all input files; FNR resets per file. The condition NR==FNR correctly identifies lines from the first file only when processing exactly two files. For three or more files, use a filename check: FILENAME==ARGV[1].
For the 12.25 million SNP VCF and 4.4 million SNP TSV files in this pipeline, awk typically runs 5–50× faster than equivalent Python/pandas code for single-pass operations (filtering, reformatting, field extraction), because it avoids loading the entire file into RAM and has minimal startup overhead. For complex joins or statistical operations, Python is more appropriate.
This reference was compiled from all scripts in the Arabidopsis lyrata temporal genomics pipeline (Dr Tahir Ali | AG de Meaux | University of Cologne, 2026).
13.6 12.6 Comment out lines in config files
Context: snpEff configuration requires modifying
snpEff.config.