haplink haplotypes

HapLink.haplotypesFunction
haplink haplotypes [options] reference variants bam

Call haplotypes

Introduction

Calls haplotypes based on the linkage disequilibrium between subconsensus variant sites on long reads. Variant sites are chosen based on having a "PASS" filter in the variants file, and linkage is calculated based on the reads present in the bam file. Note this means that haplotypes can be called on a different set of sequences than variants were (e.g. variant calling using high accuracy short-read chemistry like Illumina and haplotype calling using low accuracy long-read chemistry like Oxford Nanopore). There are no guarantees that the variants file and bam file match, so use this feature with caution!

Arguments

  • reference: path to the reference genome to call haplotypes against in fasta format. Must not be gzipped, but does not need to be indexed (have a sidecar fai file). HapLink only supports single-segment reference genomes: if reference includes more than one sequence, all but the first will be ignored.
  • variants: path to the variants file that will define variant sites to call haplotypes from. Must be in VCF (not BCF) v4 format. haplink variants generates a compatible file, although output from other tools can also be used.
  • bam: alignment file to call variants from. Can be in SAM or BAM format, and does not need to be sorted or indexed, but variant calling speed will increase significantly if using a sorted and indexed (has a sidebar bai file) BAM file.

Flags

  • --simulated-reads: Use maximum likelihood simulation of long reads based on overlapping short reads

Options

  • --outfile=<path>: The file to write haplotype calls to. If left blank, haplotype calls are written to standard output.
  • --consensus-frequency=<float>: The minimum frequency at which a variant must appear to be considered part of the consensus.
  • --significance=<float>: The alpha value for statistical significance of haplotype calls.
  • --depth=<int>: Minimum number of times a variant combination must be observed within the set of reads to be called a haplotype
  • --frequency=<float>: The minimum proportion of reads that a variant combination must be observed within compared to all reads covering its position for that haplotype to be called
  • --overlap-min=<int>: The minimum number of bases that must overlap for two short reads to be combined into one simulated read. Can be negative to indicate a minimum distance between reads. Only applies when --simulated-reads is set.
  • --overlap-max=<int>: The maximum number of bases that may overlap for two short reads to be combined into one simulated read. Can be negative to indicate a cap on how far two reads must be apart from one another. Must be greater than --overlap-min. Only applies when --simulated-reads is set.
  • --iterations=<int>: The number of simulated reads to create before calling haplotypes. Only applies when --simulated-reads is set.
  • --seed=<int>: The random seed used for picking short reads to create simulated reads when using maximum likelihood methods. Leaving unset will use the default Julia RNG and seed. See Julia's documentation on randomness for implementation details. Only applies when --simulated-reads is set.
source