API Reference

HapLink.HaplotypeType
Haplotype(mutations::AbstractVector{Variant})

Create an object to describe a group of mutations that appear together. A single Variant can be passed to create a haplotype of a single variant.

source
HapLink.VariantType
Variant

Describes a genomic mutation

The Variant type is based upon the Variant Call Format v4.2 specification, albeit with imperfect compliance.

Fields

  • chromosome::String: An identifier from the reference genome or an angle-bracketed ID String
  • position::Int: The reference position
  • identifier::String: Semicolon-separated list of unique identifiers where available. If there is no identifier available, then "."" value should be used.
  • referencebase::NucleotideSeq: Base at (or before, in case of insertion/deletion) position in the reference genome
  • alternatebase::NucleotideSeq: Base this mutation describes at position. Note that each non-reference allele must be represented by a new Variant, unlike the VCF spec
  • quality::Number: PHRED-scaled quality score for the assertion made by alternatebase
  • filter::Symbol: Filter status, :PASS is this position has passed all filters. Does Not yet support multiple filters
  • info::Dict{String,Any}: Additional information. No validation is made concerning the keys or values.

Constructors

Variant(chromosome::String, position::Int, identifier::String, referencebase::NucleotideSeq,
    alternatebase::NucleotideSeq, quality::Number, filter::Symbol, info::Dict{String,Any})

Variant(data::DataFrameRow)

Variants can be created from the default constructor, or via a row generated by transformbamcounts.

See also Haplotype

source
HapLink.callvariantsMethod
callvariants(bamcounts::AbstractDataFrame, D_min::Int, Q_min::Int, x_min::Float64,
    f_min::Float64, α::Float64)

Based on the aligned basecalls and stats in bamcounts, call variants.

Arguments

  • bamcounts::AbstractDataFrame: DataFrame containing the output from bam-readcount
  • D_min::Int: minimum variant depth
  • Q_min::Int: minimum average PHRED-scaled quality at variant position
  • x_min::Float64: minimum average fractional distance from read end at variant position
  • f_min::Float64: minimum frequency of variant
  • α::Float64: significance level of variants by Fisher's Exact Test

Returns

  • Vector{Variant}: Variants that passed all the above filters
source
HapLink.countbasestatsMethod
countbasestats(bamfile::String, reffile::String)

Count and calculate statistics on the basecalls of the alignment in bamfile to the reference genome in reffile. Returns a DataFrame with stats on every base in every alignment position. See transformbamcounts for a complete description of the output DataFrame schema.

source
HapLink.find_haplotypesMethod
find_haplotypes(
    variants::AbstractVector{Variant},
    bamfile::AbstractString,
    D::Int,
    α::Float64,
    haplotypemethod
)

Find all combinations of variants that the reads in bamfile support as a valid haplotype with a minimum depth of D and Χ-squared linkage disequilibrium significance of α, with the haplotypes being converted into genomes via haplotypemethod.

Arguments

  • variants::AbstractVector{Variant}: A Vector of Variant objects which will be combined into Haplotype objects and tested for validity
  • bamfile::AbstractString: The path to a BAM file containing reads to check for the presence of haplotypes
  • D::Int: The minimum number of times a haplotype must be present in the reads according to haplotypemethod
  • α::Float64: The maximum $Χ$-squared $p$-value at which to consider a haplotype significant and worth returning.
  • haplotypemethod: A function handle with the signature f(h::Haplotype, b::AbstractString) which with return a table of the variant basecall matches found. Both longread_genome and simulate_genome fulfil this requirement.

Returns

  • Dict{Haplotype,Matrix{Int}}: A dictionary containing every significant haplotype and its incidence matrix
source
HapLink.linkageMethod
linkage(counts::AbstractArray{Int})

Calculates the linkage disequilibrium and Chi-squared significance level of a combination of haplotypes whose number of occurrences are given by counts.

counts is an $N$-dimensional array where the $N$th dimension represents the $N$th variant call position within a haplotype. findoccurrences produces such an array.

source
HapLink.longread_genomeMethod
longread_genome(haplotype::Haplotype, bamfile::AbstractString)

Parse the whole-genome length reads in bamfile to determine each read's basecall at every variant position within haplotype.

Arguments

  • haplotype::Haplotype: The combination of variants to test basecalls against
  • bamfile::AbstractString: The path to a BAM file containing aligned reads to be tested for instances of haplotype

Returns

  • MxN Array{Symbol} where M is the number of reads present in bamfile and N=length(haplotype.mutations): A table of which base each read has in every variant position of haplotype. The table has reads for rows and variant positions for columns, e.g. longread_genome(...)[5,2] gives the basecall for the fifth read at the second variant position. Basecalls are given as Symbol objects with possible values of
    • :reference
    • :alternate
    • :other
source
HapLink.matchvariantMethod
matchvariant(base::Union{NucleotideSeq,DNA,AbstractVector{DNA}}, var::Variant)

Checks if base matches the reference or variant expected in var, and returns a symbol indicating which, if any, it matches.

Returned values can be :reference for a reference match, :alternate for an alternate match, or :other for no match with the given variant.

source
HapLink.mutateMethod
function mutate(seq::FASTA.Record, haplotype::Haplotype)
function mutate(seq::NucleotideSeq, haplotype::Haplotype)

Give the reference sequence seq mutations to match the position and basecalls of haplotype. Returns a new sequence, leaving seq unmodified.

When mutating a FASTA.Record, the new record is given a new unique identifier and description based on the SHA1 hash of the complete genotype.

source
HapLink.myref2seqMethod
myref2seq(aln::Alignment, i::Int)

Replicates the functionality of BioAlignments ref2seq, but can handle hard clips by effectively removing them for the intent of finding the position.

source
HapLink.occurrence_matrixMethod
occurrence_matrix(readmatches::AbstractArray{Symbol})

Transforms the haplotype occurrence table readmatches into an incidence matrix

Arguments

  • readmatches::AbstractArray{Symbol}: An $m$x$n$ array where $m$ is the number of reads represented, and $n$ is the number of variants in the haplotype considered, e.g. readmatches[4,3] represents the match value for the third variant in the fourth read. Valid values in the array are :reference, :alternate, and :other.

Returns

  • 2x2x... Array{Int64, size(readmatches)[2]}: An $N$-dimensional matrix where $N$ is the number of variant positions in readmatches. The $1$ index position in the $n$th dimension represents the number of times the $n$th variant position was found to have the reference base called, while the $2$ index position represents the number of times the $n$th variant position was found to have the alternate base called. E.g. first(occurrence_matrix(reads)) gives the number of times the all-reference base haplotype was found in reads, while occurrence_matrix(reads)[end] gives the number of times the all-alternate base haplotype was found.

Example

julia> pseudoreads = [
           :reference :reference :reference
           :reference :reference :alternate
           :reference :reference :alternate
           :reference :reference :other
       ];

julia> occurrence_matrix(pseudoreads)
2×2×2 Array{Int64, 3}:
[:, :, 1] =
 1  0
 0  0

[:, :, 2] =
 2  0
 0  0
source
HapLink.phrederrorMethod
phrederror(quality::Number)

Converts a PHRED33-scaled error number into the expected fractional error of basecall

source
HapLink.savevcfMethod
savevcf(vars::AbstractVector{Variant}, savepath::String, refpath::String, D::Int,
    Q::Number, x::Float64, α::Float64)

Save a VCF file populated with vars

Arguments

  • vars::AbstractVector{Variant}: Vector of Variants to write to file
  • savepath::AbstractString: path of the VCF file to write to. Will be overwritten
  • refpath::AbstractString: path of the reference genome used to call variants. The absolute path will be added to the ##reference metadata
  • D::Int: mimimum variant depth used to filter variants. Will be added as ##FILTER metadata
  • Q::Number: minimum PHRED quality used to filter variants. Will be added as ##FILTER metadata
  • x::Float64: minimum fractional read position used to filter variants. Will be added as ##FILTER metadata
  • α::Float64: Fisher's Exact Test significance level used to filter variants. Will be added as ##FILTER metadata

Saves the variants in vars to a VCF file at savepath, adding the reference genome refpath, the depth cutoff D, the quality cutoff Q, the position cutoff x, and the significance cutoff α as metadata.

source
HapLink.simulate_genomeMethod
simulate_genome(haplotype::Haplotype, bamfile::AbstractString; iterations::Int64=1000)

Simulate a set of iterations reads that contain all of the variant positions in haplotype containing the actual reads present in bamfile via a maximum likelihood method.

simulate_genome examines each variant position in haplotype and finds a random read containing that position from bamfile. It will then check if the next variant position is contained on the previous read, and if not pick a new random read that contains that variant position. In this way, it assembles a set of reads that conceivably could have come from the same template strand via maximum likelihood.

Arguments

  • haplotype::Haplotype: The combination of variants to test the aligned reads for evidence of
  • bamfile::AbstractString: The path to a BAM file containing aligned reads to be tested for instances of haplotype

Keywords

  • iterations::Integer=1000: The number of times to combine reads and test for the presence of haplotype

Returns

  • MxN Array{Symbol} where M=iterations and N=length(haplotype.mutations): A table of which base each simulated read has in every variant position of haplotype. The table has reads for rows and variant positions for columns, e.g. simulate_genome(...)[5,2] gives the basecall for the fifth simulated read at the second variant position. Basecalls are given as Symbol objects with possible values of
    • :reference
    • :alternate
    • :other
source
HapLink.sumslicedFunction
sumsliced(A::AbstractArray, dim::Int, pos::Int=1)

Sum all elements that are that can be referenced by pos in the dim dimension of A.

Example

julia> A = reshape(1:8, 2, 2, 2)
2×2×2 reshape(::UnitRange{Int64}, 2, 2, 2) with eltype Int64:
[:, :, 1] =
 1  3
 2  4

[:, :, 2] =
 5  7
 6  8

julia> sumsliced(A, 2)
14

julia> sumsliced(A, 2, 2)
22

Heavily inspired by Holy, Tim "Multidimensional algorithms and iteration" https://julialang.org/blog/2016/02/iteration/#filtering_along_a_specified_dimension_exploiting_multiple_indexes

source
HapLink.transformbamcountsMethod
transformbamcounts(bamcounts::AbstractVector{String})

Convert the output from bam-readcount to a DataFrame.

Schema

Column nameDescription
chrChromosome
positionPosition
reference_baseReference base
depthTotal read depth at position
baseAlternate base
countNumber of reads containing base at position
avg_mapping_qualityMean mapping quality
avg_basequalityMean base quality at position
avg_se_mapping_qualityMean single-ended mapping quality
num_plus_strandNumber of reads on the forward strand (N/A)
num_minus_strandNumber of reads on the reverse strand (N/A)
avg_pos_as_fractionAverage position on the read as a fraction, 0 (end) to 1 (center)
avg_num_mismatches_as_fractionAverage number of mismatches on these reads per base
avg_sum_mismatch_qualitiesAverage sum of the base qualities of mismatches in the reads
num_q2_containing_readsNumber of reads with q2 runs at the 3' end
avg_distance_to_q2_start_in_q2_readsAverage distance of position (as fraction of unclipped read length) to the start of the q2 run
avg_clipped_lengthAverage clipped read length
avg_distance_to_effective_3p_endAverage distance to the 3' end of the read (as fraction of unclipped read length)
source