ONT CRISPR amplicon sequencing of Pieris brassicae NSP gene — updated pipeline with read trimming (fastp) and post-trim QC
← Back to original overviewThe revised workflow adds a read trimming/cleaning step using fastp after initial QC, followed by a second round of QC on the cleaned reads. This ensures that adapter remnants, low-quality bases, and short fragments are removed before alignment, improving mapping rates and indel calling accuracy.
R10.4.1 flow cell, HAC basecalling
FastQC on raw reads
Adapter & quality trimming
FastQC on cleaned reads
minimap2 to reference
samtools sort + index
Visual inspection in IGV
All commands assume the following directory structure and variables:
# --- Set paths ---
OUTDIR="/Volumes/KINGSTON/ONTsequencing/ONT_testing"
FASTQ_BASE="/Volumes/KINGSTON/ONTsequencing/NSP_500bp_trial/NSP_Pnapi_500bp/20260305_1613_MD-103872_FBF81064_b0e09933/fastq_pass"
REF="$OUTDIR/NSP_genomic_ref.fasta"
REF_MMI="$OUTDIR/NSP_genomic_ref.mmi"
Data was generated on an Oxford Nanopore MinION with an R10.4.1 flow cell.
Basecalling used the HAC model (dna_r10.4.1_e8.2_400bps_hac@v5.2.0)
with built-in demultiplexing into 24 barcode directories.
Raw reads are located at:
$FASTQ_BASE/barcode01/*.fastq.gz
$FASTQ_BASE/barcode02/*.fastq.gz
...
$FASTQ_BASE/barcode24/*.fastq.gz
Run FastQC on the raw reads to assess base quality, read length distributions, and adapter content before any trimming.
# Run FastQC on raw reads for a single barcode
fastqc "$FASTQ_BASE/barcode01"/*.fastq.gz -o "$OUTDIR/fastqc_raw" --threads 2
# Loop over all barcodes
mkdir -p "$OUTDIR/fastqc_raw"
for i in $(seq -w 1 24); do
fastqc "$FASTQ_BASE/barcode${i}"/*.fastq.gz -o "$OUTDIR/fastqc_raw" --threads 2
done
Check the HTML reports for per-base quality scores, read length distributions, and any adapter contamination flags.
Use fastp to trim adapters, remove low-quality bases, and filter out short reads. fastp auto-detects adapters and provides its own HTML/JSON quality reports.
# Install via Homebrew (macOS)
brew install fastp
# Or via conda
conda install -c bioconda fastp
# Concatenate multiple fastq files per barcode if needed, then run fastp
mkdir -p "$OUTDIR/trimmed"
# For a single barcode with one or more input files:
cat "$FASTQ_BASE/barcode01"/*.fastq.gz > "$OUTDIR/trimmed/barcode01_raw_merged.fastq.gz"
fastp \
--in1 "$OUTDIR/trimmed/barcode01_raw_merged.fastq.gz" \
--out1 "$OUTDIR/trimmed/barcode01_trimmed.fastq.gz" \
--html "$OUTDIR/trimmed/barcode01_fastp.html" \
--json "$OUTDIR/trimmed/barcode01_fastp.json" \
--qualified_quality_phred 10 \
--length_required 200 \
--thread 4
mkdir -p "$OUTDIR/trimmed"
for i in $(seq -w 1 24); do
BC="barcode${i}"
# Merge raw fastqs per barcode
cat "$FASTQ_BASE/$BC"/*.fastq.gz > "$OUTDIR/trimmed/${BC}_raw_merged.fastq.gz"
# Run fastp
fastp \
--in1 "$OUTDIR/trimmed/${BC}_raw_merged.fastq.gz" \
--out1 "$OUTDIR/trimmed/${BC}_trimmed.fastq.gz" \
--html "$OUTDIR/trimmed/${BC}_fastp.html" \
--json "$OUTDIR/trimmed/${BC}_fastp.json" \
--qualified_quality_phred 10 \
--length_required 200 \
--thread 4
# Clean up merged raw file to save space
rm "$OUTDIR/trimmed/${BC}_raw_merged.fastq.gz"
done
| Flag | Value | Purpose |
|---|---|---|
--qualified_quality_phred | 10 | Min base quality to keep (Q10; suitable for ONT data) |
--length_required | 200 | Discard reads shorter than 200 bp (amplicon is ~500 bp) |
--thread | 4 | Parallel processing threads |
--html / --json | — | Per-barcode QC reports from fastp |
Adapter detection is automatic. For ONT data, consider also using --cut_front and
--cut_tail to trim low-quality stretches at read ends, or adjust --qualified_quality_phred
depending on your basecaller accuracy.
Run FastQC again on the trimmed reads to confirm that quality has improved and problematic reads were removed. Compare these reports to the raw QC from Step 2.
mkdir -p "$OUTDIR/fastqc_trimmed"
for i in $(seq -w 1 24); do
fastqc "$OUTDIR/trimmed/barcode${i}_trimmed.fastq.gz" \
-o "$OUTDIR/fastqc_trimmed" --threads 2
done
Things to check:
Align the trimmed reads to the 699 bp genomic reference using minimap2
with the ONT preset. The --MD flag adds MD tags for accurate variant display in IGV.
# Build minimap2 index (if not already done)
minimap2 -d "$REF_MMI" "$REF"
# Align a single barcode
minimap2 -a -x map-ont --MD "$REF_MMI" \
"$OUTDIR/trimmed/barcode01_trimmed.fastq.gz" \
> "$OUTDIR/barcode01_aligned.sam"
We write SAM first here to keep alignment and BAM processing as separate steps, but they can be piped together (see Step 6).
Convert SAM to sorted BAM and create an index for fast random access in IGV.
# Sort and convert to BAM
samtools sort -o "$OUTDIR/barcode01_aligned.bam" "$OUTDIR/barcode01_aligned.sam"
# Index the BAM
samtools index "$OUTDIR/barcode01_aligned.bam"
# Clean up SAM
rm "$OUTDIR/barcode01_aligned.sam"
For efficiency, pipe alignment directly into sorting:
for i in $(seq -w 1 24); do
BC="barcode${i}"
minimap2 -a -x map-ont --MD "$REF_MMI" \
"$OUTDIR/trimmed/${BC}_trimmed.fastq.gz" 2>/dev/null | \
samtools sort -o "$OUTDIR/${BC}_aligned.bam" -
samtools index "$OUTDIR/${BC}_aligned.bam"
done
# Check mapping rate and read counts per barcode
for i in $(seq -w 1 24); do
echo -n "barcode${i}: "
samtools flagstat "$OUTDIR/barcode${i}_aligned.bam" 2>/dev/null | head -1
done
Open the aligned BAMs in IGV to visually inspect CRISPR-induced indels at the sgRNA cut site.
NSP_genomic_ref.fastabarcodeXX_aligned.bam files
The reference is a 699 bp region from P. brassicae chromosome 5
(NC_059669.1:14882048-14885601) containing the first coding exon of the NSP gene.
| Tool | Version | Purpose |
|---|---|---|
| minimap2 | 2.30 | ONT read alignment (-x map-ont) |
| samtools | 1.23 | BAM sorting, indexing, stats |
| FastQC | 0.12.1 | Read quality assessment (pre- and post-trim) |
| fastp | 0.24+ | Read trimming, adapter removal, quality filtering |
| IGV | — | Visual inspection of alignments and indels |