PLINK
https://www.cog-genomics.org/plink2/input#vcf
General
--threads [max]: By default, multithreaded PLINK functions run about as many concurrent threads as your system has available cores.
Basic I/O
GBS to PLINK (my own R function)
PLINK transposed text genotype table
Variant information + genotype call text file. Must be accompanied by a .tfam file. Loaded with --tfile, and produced by '--recode transpose'.
TPED
Contains no header line, and one line per variant with 2N+4 fields where N is the number of samples. The first four fields are the same as those in a .map file. The fifth and sixth fields are allele calls for the first sample in the .tfam file ('0' = no call); the 7th and 8th are allele calls for the second individual; and so on.
MAP file
- Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.
- Variant identifier
- Position in morgans or centimorgans (optional; also safe to use dummy value of '0')
- Base-pair coordinate
TFAM == FAM file
A text file with no header line, and one line per sample with the following six fields:
- Family ID ('FID')
- Within-family ID ('IID'; cannot be '0')
- Within-family ID of father ('0' if father isn't in dataset)
- Within-family ID of mother ('0' if mother isn't in dataset)
- Sex code ('1' = male, '2' = female, '0' = unknown)
- Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)
plink -tfile file --missing --freq --out restuls
bfile {prefix}
It requires bed + bim + fam. Note PLINK1.9 automatically converts most other formats to PLINK binary before other operations. Use --keep-autoconv will keep the products, otherwise they will be silently deteted.
plink --bfile {prefix} --out results
Filtering
--maf: filters out all variants with minor allele frequency below the provided threshold (default 0.01)
--geno: filters out all variants with missing call rates exceeding the provided value (default 0.1) to be removed
Load VCF/BCF files
VCF reference alleles are set to A2 by the autoconverter even when they appear to be minor. However, to maintain backwards compatibility with PLINK 1.07, PLINK 1.9 normally forces major alleles to A2 during its loading sequence. One workaround is permanently keeping the .bim file generated during initial conversion, for use as --a2-allele input whenever the reference sequence needs to be recovered. (If you use this method, note that, when your initial conversion step invokes --make-bed instead of just --out, you also need --keep-allele-order to avoid losing track of reference alleles before the very first write, because --make-bed triggers the regular loading sequence.)
VCF reference alleles are set to A2 by the autoconverter even when they appear to be minor.
PLINK 1.9 normally forces major alleles to A2 during its loading sequence
``` PLINK v1.90b3.36 64-bit (31 Mar 2016) https://www.cog-genomics.org/plink2 (C) 2005-2016 Shaun Purcell, Christopher Chang GNU General Public License v3
plink -bcf JRI20\_joint\_call.filtered\_snps.bcf \
--keep-allele-order --make-bed --out JRI20 --allow-extra-chr
- `--allow-extra-chr`: Error: Invalid chromosome code 'UNKNOWN' in .bcf file. (Use --allow-extra-chr to force it to be accepted.)
- `make-bed`
## Set missing variant ID:
`--set-missing-var-ids` provides one way to do this. The parameter taken by these flags is a special template string, with a '@' where the chromosome code should go, and a '#' where the base-pair position belongs. (Exactly one `@` and one `#` must be present.) For example, given a .bim file starting with
chr1 . 0 10583 A G
chr1 . 0 886817 C T
chr1 . 0 886817 CATTTT C
chrMT . 0 64 T C
'--set-missing-var-ids @:#[b37]' would name the first variant 'chr1:10583[b37]', the second variant 'chr1:886817[b37]'... and then error out when naming the third variant, since it would be given the same name as the second variant. (Note that this position overlap is actually present in 1000 Genomes Project phase 1 data.)
To maintain unique IDs in this situation, you can include '$1' and '$2' in your template string as well; these refer to the first and second allele names in ASCII-sort order. So, if we're using a bash shell, we can try again with
--set-missing-var-ids @:#[b37]\$1,\$2
which would name the first variant 'chr1:10583[b37]A,G', the second variant 'chr1:886817[b37]C,T', the third variant 'chr1:886817[b37]C,CATTTT', and the fourth variant 'chrMT:64[b37]C,T'. Note the extra backslashes: they are necessary in bash because '$' is a reserved character there.
You may still get a small number of duplicate ID errors when using '$1' and '$2'. If indels are involved, it is likely that the ambiguity cannot be resolved by PLINK 1 at all, because it matters which allele is the reference allele1. Instead, you must e.g. use a shell script to manually name variants in your original VCF file; see this blog post by Giulio Genovese for a detailed discussion. We apologize for the inconvenience; PLINK 2.0 will extend --set-missing-var-ids to support REF/ALT-based naming templates.
## Basic statistics
- <gz>: writes gzipped output.
plink --bfile {prefix} --out results --freq counts --missing --het --ibc
```