API Reference

HapLink.Haplotype — Type

Haplotype(mutations::AbstractVector{Variant})

Create an object to describe a group of mutations that appear together. A single Variant can be passed to create a haplotype of a single variant.

source

HapLink.Variant — Type

Variant

Describes a genomic mutation

The Variant type is based upon the Variant Call Format v4.2 specification, albeit with imperfect compliance.

Fields

chromosome::String: An identifier from the reference genome or an angle-bracketed ID String
position::Int: The reference position
identifier::String: Semicolon-separated list of unique identifiers where available. If there is no identifier available, then "."" value should be used.
referencebase::NucleotideSeq: Base at (or before, in case of insertion/deletion) position in the reference genome
alternatebase::NucleotideSeq: Base this mutation describes at position. Note that each non-reference allele must be represented by a new Variant, unlike the VCF spec
quality::Number: PHRED-scaled quality score for the assertion made by alternatebase
filter::Symbol: Filter status, :PASS is this position has passed all filters. Does Not yet support multiple filters
info::Dict{String,Any}: Additional information. No validation is made concerning the keys or values.

Constructors

Variant(chromosome::String, position::Int, identifier::String, referencebase::NucleotideSeq,
    alternatebase::NucleotideSeq, quality::Number, filter::Symbol, info::Dict{String,Any})

Variant(data::DataFrameRow)

Variants can be created from the default constructor, or via a row generated by transformbamcounts.

See also Haplotype

source

HapLink.baseatreferenceposition — Method

baseatreferenceposition(record::BAM.Record, pos::Int)

Get the base at reference position pos present in the sequence of record.

source

HapLink.callvariants — Method

callvariants(bamcounts::AbstractDataFrame, D_min::Int, Q_min::Int, x_min::Float64,
    f_min::Float64, α::Float64)

Based on the aligned basecalls and stats in bamcounts, call variants.

Arguments

bamcounts::AbstractDataFrame: DataFrame containing the output from bam-readcount
D_min::Int: minimum variant depth
Q_min::Int: minimum average PHRED-scaled quality at variant position
x_min::Float64: minimum average fractional distance from read end at variant position
f_min::Float64: minimum frequency of variant
α::Float64: significance level of variants by Fisher's Exact Test

Returns

Vector{Variant}: Variants that passed all the above filters

source

HapLink.countbasestats — Method

countbasestats(bamfile::String, reffile::String)

Count and calculate statistics on the basecalls of the alignment in bamfile to the reference genome in reffile. Returns a DataFrame with stats on every base in every alignment position. See transformbamcounts for a complete description of the output DataFrame schema.

source

HapLink.find_haplotypes — Method

find_haplotypes(
    variants::AbstractVector{Variant},
    bamfile::AbstractString,
    D::Int,
    α::Float64,
    haplotypemethod
)

Find all combinations of variants that the reads in bamfile support as a valid haplotype with a minimum depth of D and Χ-squared linkage disequilibrium significance of α, with the haplotypes being converted into genomes via haplotypemethod.

Arguments

variants::AbstractVector{Variant}: A Vector of Variant objects which will be combined into Haplotype objects and tested for validity
bamfile::AbstractString: The path to a BAM file containing reads to check for the presence of haplotypes
D::Int: The minimum number of times a haplotype must be present in the reads according to haplotypemethod
α::Float64: The maximum $Χ$-squared $p$-value at which to consider a haplotype significant and worth returning.
haplotypemethod: A function handle with the signature f(h::Haplotype, b::AbstractString) which with return a table of the variant basecall matches found. Both longread_genome and simulate_genome fulfil this requirement.

Returns

Dict{Haplotype,Matrix{Int}}: A dictionary containing every significant haplotype and its incidence matrix

source

HapLink.linkage — Method

linkage(counts::AbstractArray{Int})

Calculates the linkage disequilibrium and Chi-squared significance level of a combination of haplotypes whose number of occurrences are given by counts.

counts is an $N$-dimensional array where the $N$th dimension represents the $N$th variant call position within a haplotype. findoccurrences produces such an array.

source

HapLink.longread_genome — Method

longread_genome(haplotype::Haplotype, bamfile::AbstractString)

Parse the whole-genome length reads in bamfile to determine each read's basecall at every variant position within haplotype.

Arguments

haplotype::Haplotype: The combination of variants to test basecalls against
bamfile::AbstractString: The path to a BAM file containing aligned reads to be tested for instances of haplotype

Returns

MxN Array{Symbol} where M is the number of reads present in bamfile and N=length(haplotype.mutations): A table of which base each read has in every variant position of haplotype. The table has reads for rows and variant positions for columns, e.g. longread_genome(...)[5,2] gives the basecall for the fifth read at the second variant position. Basecalls are given as Symbol objects with possible values of
- :reference
- :alternate
- :other

source

HapLink.matchvariant — Method

matchvariant(base::Union{NucleotideSeq,DNA,AbstractVector{DNA}}, var::Variant)

Checks if base matches the reference or variant expected in var, and returns a symbol indicating which, if any, it matches.

Returned values can be :reference for a reference match, :alternate for an alternate match, or :other for no match with the given variant.

source

HapLink.mutate — Method

function mutate(seq::FASTA.Record, haplotype::Haplotype)
function mutate(seq::NucleotideSeq, haplotype::Haplotype)

Give the reference sequence seq mutations to match the position and basecalls of haplotype. Returns a new sequence, leaving seq unmodified.

When mutating a FASTA.Record, the new record is given a new unique identifier and description based on the SHA1 hash of the complete genotype.

source

HapLink.myref2seq — Method

myref2seq(aln::Alignment, i::Int)

Replicates the functionality of BioAlignments ref2seq, but can handle hard clips by effectively removing them for the intent of finding the position.

source

HapLink.occurrence_matrix — Method

occurrence_matrix(readmatches::AbstractArray{Symbol})

Transforms the haplotype occurrence table readmatches into an incidence matrix

Arguments

readmatches::AbstractArray{Symbol}: An $m$x$n$ array where $m$ is the number of reads represented, and $n$ is the number of variants in the haplotype considered, e.g. readmatches[4,3] represents the match value for the third variant in the fourth read. Valid values in the array are :reference, :alternate, and :other.

Returns

2x2x... Array{Int64, size(readmatches)[2]}: An $N$-dimensional matrix where $N$ is the number of variant positions in readmatches. The $1$ index position in the $n$th dimension represents the number of times the $n$th variant position was found to have the reference base called, while the $2$ index position represents the number of times the $n$th variant position was found to have the alternate base called. E.g. first(occurrence_matrix(reads)) gives the number of times the all-reference base haplotype was found in reads, while occurrence_matrix(reads)[end] gives the number of times the all-alternate base haplotype was found.

Example

julia> pseudoreads = [
           :reference :reference :reference
           :reference :reference :alternate
           :reference :reference :alternate
           :reference :reference :other
       ];

julia> occurrence_matrix(pseudoreads)
2×2×2 Array{Int64, 3}:
[:, :, 1] =
 1  0
 0  0

[:, :, 2] =
 2  0
 0  0

source

HapLink.phrederror — Method

phrederror(quality::Number)

Converts a PHRED33-scaled error number into the expected fractional error of basecall

source

HapLink.savevcf — Method

savevcf(vars::AbstractVector{Variant}, savepath::String, refpath::String, D::Int,
    Q::Number, x::Float64, α::Float64)

Save a VCF file populated with vars

Arguments

vars::AbstractVector{Variant}: Vector of Variants to write to file
savepath::AbstractString: path of the VCF file to write to. Will be overwritten
refpath::AbstractString: path of the reference genome used to call variants. The absolute path will be added to the ##reference metadata
D::Int: mimimum variant depth used to filter variants. Will be added as ##FILTER metadata
Q::Number: minimum PHRED quality used to filter variants. Will be added as ##FILTER metadata
x::Float64: minimum fractional read position used to filter variants. Will be added as ##FILTER metadata
α::Float64: Fisher's Exact Test significance level used to filter variants. Will be added as ##FILTER metadata

Saves the variants in vars to a VCF file at savepath, adding the reference genome refpath, the depth cutoff D, the quality cutoff Q, the position cutoff x, and the significance cutoff α as metadata.

source

HapLink.serialize_vcf — Method

serialize_vcf(v::Variant)

Create a VCF line to represent v.

source

HapLink.serialize_yaml — Method

serialize_yaml(v::Variant)

Create a valid YAML representation of v.

source

HapLink.simulate_genome — Method

simulate_genome(haplotype::Haplotype, bamfile::AbstractString; iterations::Int64=1000)

Simulate a set of iterations reads that contain all of the variant positions in haplotype containing the actual reads present in bamfile via a maximum likelihood method.

simulate_genome examines each variant position in haplotype and finds a random read containing that position from bamfile. It will then check if the next variant position is contained on the previous read, and if not pick a new random read that contains that variant position. In this way, it assembles a set of reads that conceivably could have come from the same template strand via maximum likelihood.

Arguments

haplotype::Haplotype: The combination of variants to test the aligned reads for evidence of
bamfile::AbstractString: The path to a BAM file containing aligned reads to be tested for instances of haplotype

Keywords

iterations::Integer=1000: The number of times to combine reads and test for the presence of haplotype

Returns

MxN Array{Symbol} where M=iterations and N=length(haplotype.mutations): A table of which base each simulated read has in every variant position of haplotype. The table has reads for rows and variant positions for columns, e.g. simulate_genome(...)[5,2] gives the basecall for the fifth simulated read at the second variant position. Basecalls are given as Symbol objects with possible values of
- :reference
- :alternate
- :other

source

HapLink.sumsliced — Function

sumsliced(A::AbstractArray, dim::Int, pos::Int=1)

Sum all elements that are that can be referenced by pos in the dim dimension of A.

Example

julia> A = reshape(1:8, 2, 2, 2)
2×2×2 reshape(::UnitRange{Int64}, 2, 2, 2) with eltype Int64:
[:, :, 1] =
 1  3
 2  4

[:, :, 2] =
 5  7
 6  8

julia> sumsliced(A, 2)
14

julia> sumsliced(A, 2, 2)
22

Heavily inspired by Holy, Tim "Multidimensional algorithms and iteration" https://julialang.org/blog/2016/02/iteration/#filtering_along_a_specified_dimension_exploiting_multiple_indexes

source

HapLink.transformbamcounts — Method

transformbamcounts(bamcounts::AbstractVector{String})

Convert the output from bam-readcount to a DataFrame.

Schema

Column name	Description
`chr`	Chromosome
`position`	Position
`reference_base`	Reference base
`depth`	Total read depth at `position`
`base`	Alternate base
`count`	Number of reads containing `base` at `position`
`avg_mapping_quality`	Mean mapping quality
`avg_basequality`	Mean base quality at `position`
`avg_se_mapping_quality`	Mean single-ended mapping quality
`num_plus_strand`	Number of reads on the forward strand (N/A)
`num_minus_strand`	Number of reads on the reverse strand (N/A)
`avg_pos_as_fraction`	Average position on the read as a fraction, 0 (end) to 1 (center)
`avg_num_mismatches_as_fraction`	Average number of mismatches on these reads per base
`avg_sum_mismatch_qualities`	Average sum of the base qualities of mismatches in the reads
`num_q2_containing_reads`	Number of reads with q2 runs at the 3' end
`avg_distance_to_q2_start_in_q2_reads`	Average distance of position (as fraction of unclipped read length) to the start of the q2 run
`avg_clipped_length`	Average clipped read length
`avg_distance_to_effective_3p_end`	Average distance to the 3' end of the read (as fraction of unclipped read length)

source