HaplotypeCall

HapLink.HaplotypeCallType
HaplotypeCall{S,T}

A Haplotype that has had linkage disequilibrium calculations performed along with metadata to verify or refute its validity. Very similar to a VariationCall, but for Haplotypes.

Fields

  • depth::UInt64: How many sequencing reads contain all the Variations in haplotype
  • linkage::Float64: The unweighted linkage disequilibrium coefficient of this call
  • frequency::Float64: How often this haplotype occurs compared to the entire read pool
  • significance::Float64: The $χ^2$ statistical significance ($p$-value) of this call
  • haplotype::Haplotype{S,T}: The set of Variations that are inherited together in this call
source
HapLink.HaplotypeCallMethod
HaplotypeCall(
    haplotype::Haplotype{S,T}, reads::AbstractArray{Haplotype{S,T}}
) where {S<:BioSequence,T<:BioSymbol}
HaplotypeCall(
    haplotype::AbstractArray{Variation{S,T}}, reads::AbstractArray{Haplotype{S,T}}
) where {S<:BioSequence,T<:BioSymbol}

Calls haplotypes by calculating the linkage disequilibrium of haplotype within reads

source
HapLink.ishaplotypeMethod
function ishaplotype(
    haplotype::Union{AbstractArray{Variation{S,T}}, Haplotype{S,T}},
    reads::AbstractArray{Haplotype{S,T}};
    frequency::Union{Float64,Nothing}=nothing,
    significance::Union{Float64,Nothing}=nothing,
    depth::Union{Int,Nothing}=nothing,
) where {S<:BioSequence,T<:BioSymbol}

Determines if a call of haplotype is supported by the sequences in reads based upon the provided keyword criteria.

Arguments

  • haplotype::Union{AbstractArray{Variation}, Haplotype}: A Vector of Variations or a Haplotype to search for as a haplotype
  • reads::AbstractArray{Haplotype}: The reads to search for haplotype in

Keywords

  • frequency::Union{Float64,Nothing}=nothing: The minimum number of times the entire haplotype must appear within reads compared to the number of reads to return true
  • significance::Union{Float64,Nothing}=nothing: The $χ^2$ significance level ($α$) of linkage disequilibrium that a haplotype must surpass to return true
  • depth::Union{Int,Nothing}=nothing: The minimum number of times the entire haplotype must appear within reads to return true

Extended help

Linkage disequilibrium ($Δ$) is calculated by

\[Δ = P_{reference} - \prod_i P_{ref,i}\]

where

  • $P_{reference}$ is the probability of finding a read that contains only reference (consensus) bases
  • $P_{ref,i}$ is the probability of finding a read that contains the reference (consensus) base for variant $i$ within a haplotype

and the $χ^2$ statistic is calculated by

\[χ^2 = \frac{ Δ^2 n } { \left(\prod_i^N {P_{ref,i} \left(1 - P_{ref,i}\right)}\right)^\frac{2}{N} }\]

where

  • $N$ is the number of Variations in haplotype (i.e., length(haplotype))
  • $n$ is the number of total reads sampled (i.e. length(reads))

The significance is then calculated from the cumulative $χ^2$ distribution function.

source
HapLink.linkageMethod
linkage(hc::HaplotypeCall)

Gets the unweighted linkage disequilibrium coefficient of hc

source
HapLink.linkageMethod
linkage(counts::AbstractArray{<:Integer, N}) where N

Calculates the linkage disequilibrium of a haplotype given its $N$-dimensional contingency matrix, counts.

counts is an $N$-dimensional array where the $N$th dimension represents the $N$th variant call locus within a haplotype. findoccurrences produces such an array.

Extended help

linkage(::AbstractArray{<:Integer, N}) calculates an unweighted linkage disequilibrium as given by Equation (6) of Slatkin (1972).

\[D(1..N) = E\left( \prod_k^N i_k - P_k \right)\]

where

  • $N$ is the number of variant loci
  • $D(1..N)$ is the linkage disequilibrium of all $N$ variant loci
  • $E$ is an operator returning the arithmetic mean of its argument over every read
  • $i_k$ is a function that returns $1$ if the $k$-th locus of the given read contains the reference allele, and returns $0$ otherwise.
  • $P_k$ is the probability of any read containing the reference allele in the $k$-th locus, i.e. the frequency at which the reference allele is found within the entire read set at the $k$-th locus
source
HapLink.occurrence_matrixMethod
occurrence_matrix(
    haplotype::AbstractArray{Variation{S,T}},
    reads::AbstractArray{Haplotype{S,T}},
) where {S<:BioSequence,T<:BioSymbol}
occurrence_matrix(
    haplotype::Haplotype{S,T},
    reads::AbstractArray{S,T}
) where {S<:BioSequence,T<:BioSymbol}

Determine how many times the variants in haplotype appear in reads as an $N$- dimensional matrix.

Arguments

  • haplotype::AbstractArray{Variation}: A Vector of Variations or a Haplotype to search for as a haplotype
  • reads::AbstractArray{Haplotype}: The reads to search for haplotype in

Returns

  • 2x2x... Array{Int, length(haplotype)}: An $N$-dimensional matrix where $N$ is the number of variant positions in readmatches. The $1$ index position in the $n$th dimension represents the number of times the $n$th variant position was found to have the reference base called, while the $2$ index position represents the number of times the $n$th variant position was found to have the alternate base called. E.g. first(occurrence_matrix(reads)) gives the number of times the all-reference base haplotype was found in reads, while occurrence_matrix(reads)[end] gives the number of times the all-alternate base haplotype was found.
source
HapLink.sumslicedFunction
sumsliced(A::AbstractArray, dim::Int, pos::Int=1)

Sum all elements that are that can be referenced by pos in the dim dimension of A.

Example

julia> A = reshape(1:8, 2, 2, 2)
2×2×2 reshape(::UnitRange{Int64}, 2, 2, 2) with eltype Int64:
[:, :, 1] =
 1  3
 2  4

[:, :, 2] =
 5  7
 6  8

julia> sumsliced(A, 2)
14

julia> sumsliced(A, 2, 2)
22

Heavily inspired by Holy, Tim "Multidimensional algorithms and iteration" https://julialang.org/blog/2016/02/iteration/#filtering_along_a_specified_dimension_exploiting_multiple_indexes

source