fastq
Understand the fastq format and conduct quality checking of the fastq file.
seqtk
Seqtk, developed by Heng Li, is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
- Quality checking
Interpret the results following Liang2.
By default it sets -q 20. This quality threshold determines the threshold of counting a base as low or high quality, shown as %low and %high per read position. In the default case, quality score higher than 20 will be treated as high quality bases.
Average quality avgQ is computed by weighted mean of each base’s quality. Quality score higher than 20 will be treated as high quality bases.
#Usage: seqtk fqchk [-q 20] <in.fq>
#Note: use -q0 to get the distribution of all quality values
seqtk fqchk -q 20 ", df$fq[i], " > ", df$out[i]
Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):
seqtk sample -s100 read1.fq 10000 > sub1.fq seqtk sample -s100 read2.fq 10000 > sub2.fqGenerate pseudo-reference
seqtk mutfaTrim low-quality bases from both ends using the Phred algorithm:
seqtk trimfq in.fq > out.fq seqtk trimfq -b 5 -e 10 in.fa > out.faextract subseq from FASTA/Q
samtools faidx 1:1-1000 genome.fa > out.fa
-