API Reference
HapLink.Haplotype
— TypeHaplotype(mutations::AbstractVector{Variant})
Create an object to describe a group of mutations that appear together. A single Variant
can be passed to create a haplotype of a single variant.
HapLink.Variant
— TypeVariant
Describes a genomic mutation
The Variant
type is based upon the Variant Call Format v4.2 specification, albeit with imperfect compliance.
Fields
chromosome::String
: An identifier from the reference genome or an angle-bracketed ID Stringposition::Int
: The reference positionidentifier::String
: Semicolon-separated list of unique identifiers where available. If there is no identifier available, then "."" value should be used.referencebase::NucleotideSeq
: Base at (or before, in case of insertion/deletion)position
in the reference genomealternatebase::NucleotideSeq
: Base this mutation describes atposition
. Note that each non-reference allele must be represented by a newVariant
, unlike the VCF specquality::Number
: PHRED-scaled quality score for the assertion made byalternatebase
filter::Symbol
: Filter status,:PASS
is this position has passed all filters. Does Not yet support multiple filtersinfo::Dict{String,Any}
: Additional information. No validation is made concerning the keys or values.
Constructors
Variant(chromosome::String, position::Int, identifier::String, referencebase::NucleotideSeq,
alternatebase::NucleotideSeq, quality::Number, filter::Symbol, info::Dict{String,Any})
Variant(data::DataFrameRow)
Variant
s can be created from the default constructor, or via a row generated by transformbamcounts
.
See also Haplotype
HapLink.baseatreferenceposition
— Methodbaseatreferenceposition(record::BAM.Record, pos::Int)
Get the base at reference position pos
present in the sequence of record
.
HapLink.callvariants
— Methodcallvariants(bamcounts::AbstractDataFrame, D_min::Int, Q_min::Int, x_min::Float64,
f_min::Float64, α::Float64)
Based on the aligned basecalls and stats in bamcounts
, call variants.
Arguments
bamcounts::AbstractDataFrame
:DataFrame
containing the output frombam-readcount
D_min::Int
: minimum variant depthQ_min::Int
: minimum average PHRED-scaled quality at variant positionx_min::Float64
: minimum average fractional distance from read end at variant positionf_min::Float64
: minimum frequency of variantα::Float64
: significance level of variants by Fisher's Exact Test
Returns
Vector{Variant}
:Variant
s that passed all the above filters
HapLink.countbasestats
— Methodcountbasestats(bamfile::String, reffile::String)
Count and calculate statistics on the basecalls of the alignment in bamfile
to the reference genome in reffile
. Returns a DataFrame
with stats on every base in every alignment position. See transformbamcounts
for a complete description of the output DataFrame schema.
HapLink.find_haplotypes
— Methodfind_haplotypes(
variants::AbstractVector{Variant},
bamfile::AbstractString,
D::Int,
α::Float64,
haplotypemethod
)
Find all combinations of variants
that the reads in bamfile
support as a valid haplotype with a minimum depth of D
and Χ-squared linkage disequilibrium significance of α
, with the haplotypes being converted into genomes via haplotypemethod
.
Arguments
variants::AbstractVector{Variant}
: AVector
ofVariant
objects which will be combined intoHaplotype
objects and tested for validitybamfile::AbstractString
: The path to a BAM file containing reads to check for the presence of haplotypesD::Int
: The minimum number of times a haplotype must be present in the reads according tohaplotypemethod
α::Float64
: The maximum $Χ$-squared $p$-value at which to consider a haplotype significant and worth returning.haplotypemethod
: A function handle with the signaturef(h::Haplotype, b::AbstractString)
which with return a table of the variant basecall matches found. Bothlongread_genome
andsimulate_genome
fulfil this requirement.
Returns
Dict{Haplotype,Matrix{Int}}
: A dictionary containing every significant haplotype and its incidence matrix
HapLink.linkage
— Methodlinkage(counts::AbstractArray{Int})
Calculates the linkage disequilibrium and Chi-squared significance level of a combination of haplotypes whose number of occurrences are given by counts
.
counts
is an $N$-dimensional array where the $N$th dimension represents the $N$th variant call position within a haplotype. findoccurrences
produces such an array.
HapLink.longread_genome
— Methodlongread_genome(haplotype::Haplotype, bamfile::AbstractString)
Parse the whole-genome length reads in bamfile
to determine each read's basecall at every variant position within haplotype
.
Arguments
haplotype::Haplotype
: The combination of variants to test basecalls againstbamfile::AbstractString
: The path to a BAM file containing aligned reads to be tested for instances ofhaplotype
Returns
MxN Array{Symbol}
whereM
is the number of reads present inbamfile
andN=length(haplotype.mutations)
: A table of which base each read has in every variant position ofhaplotype
. The table has reads for rows and variant positions for columns, e.g. longread_genome(...)[5,2] gives the basecall for the fifth read at the second variant position. Basecalls are given asSymbol
objects with possible values of:reference
:alternate
:other
HapLink.matchvariant
— Methodmatchvariant(base::Union{NucleotideSeq,DNA,AbstractVector{DNA}}, var::Variant)
Checks if base
matches the reference or variant expected in var
, and returns a symbol indicating which, if any, it matches.
Returned values can be :reference
for a reference match, :alternate
for an alternate match, or :other
for no match with the given variant.
HapLink.mutate
— Methodfunction mutate(seq::FASTA.Record, haplotype::Haplotype)
function mutate(seq::NucleotideSeq, haplotype::Haplotype)
Give the reference sequence seq
mutations to match the position and basecalls of haplotype
. Returns a new sequence, leaving seq
unmodified.
When mutating a FASTA.Record
, the new record is given a new unique identifier and description based on the SHA1 hash of the complete genotype.
HapLink.myref2seq
— Methodmyref2seq(aln::Alignment, i::Int)
Replicates the functionality of BioAlignments ref2seq
, but can handle hard clips by effectively removing them for the intent of finding the position.
HapLink.occurrence_matrix
— Methodoccurrence_matrix(readmatches::AbstractArray{Symbol})
Transforms the haplotype occurrence table readmatches
into an incidence matrix
Arguments
readmatches::AbstractArray{Symbol}
: An $m$x$n$ array where $m$ is the number of reads represented, and $n$ is the number of variants in the haplotype considered, e.g. readmatches[4,3] represents the match value for the third variant in the fourth read. Valid values in the array are:reference
,:alternate
, and:other
.
Returns
2x2x... Array{Int64, size(readmatches)[2]}
: An $N$-dimensional matrix where $N$ is the number of variant positions inreadmatches
. The $1$ index position in the $n$th dimension represents the number of times the $n$th variant position was found to have the reference base called, while the $2$ index position represents the number of times the $n$th variant position was found to have the alternate base called. E.g.first(occurrence_matrix(reads))
gives the number of times the all-reference base haplotype was found inreads
, whileoccurrence_matrix(reads)[end]
gives the number of times the all-alternate base haplotype was found.
Example
julia> pseudoreads = [
:reference :reference :reference
:reference :reference :alternate
:reference :reference :alternate
:reference :reference :other
];
julia> occurrence_matrix(pseudoreads)
2×2×2 Array{Int64, 3}:
[:, :, 1] =
1 0
0 0
[:, :, 2] =
2 0
0 0
HapLink.phrederror
— Methodphrederror(quality::Number)
Converts a PHRED33-scaled error number into the expected fractional error of basecall
HapLink.savevcf
— Methodsavevcf(vars::AbstractVector{Variant}, savepath::String, refpath::String, D::Int,
Q::Number, x::Float64, α::Float64)
Save a VCF file populated with vars
Arguments
vars::AbstractVector{Variant}
:Vector
ofVariant
s to write to filesavepath::AbstractString
: path of the VCF file to write to. Will be overwrittenrefpath::AbstractString
: path of the reference genome used to call variants. The absolute path will be added to the##reference
metadataD::Int
: mimimum variant depth used to filter variants. Will be added as##FILTER
metadataQ::Number
: minimum PHRED quality used to filter variants. Will be added as##FILTER
metadatax::Float64
: minimum fractional read position used to filter variants. Will be added as##FILTER
metadataα::Float64
: Fisher's Exact Test significance level used to filter variants. Will be added as##FILTER
metadata
Saves the variants in vars
to a VCF file at savepath
, adding the reference genome refpath
, the depth cutoff D
, the quality cutoff Q
, the position cutoff x
, and the significance cutoff α
as metadata.
HapLink.serialize_vcf
— Methodserialize_vcf(v::Variant)
Create a VCF line to represent v
.
HapLink.serialize_yaml
— Methodserialize_yaml(v::Variant)
Create a valid YAML representation of v
.
HapLink.simulate_genome
— Methodsimulate_genome(haplotype::Haplotype, bamfile::AbstractString; iterations::Int64=1000)
Simulate a set of iterations
reads that contain all of the variant positions in haplotype
containing the actual reads present in bamfile
via a maximum likelihood method.
simulate_genome
examines each variant position in haplotype
and finds a random read containing that position from bamfile
. It will then check if the next variant position is contained on the previous read, and if not pick a new random read that contains that variant position. In this way, it assembles a set of reads that conceivably could have come from the same template strand via maximum likelihood.
Arguments
haplotype::Haplotype
: The combination of variants to test the aligned reads for evidence ofbamfile::AbstractString
: The path to a BAM file containing aligned reads to be tested for instances ofhaplotype
Keywords
iterations::Integer=1000
: The number of times to combine reads and test for the presence ofhaplotype
Returns
MxN Array{Symbol}
whereM=iterations
andN=length(haplotype.mutations)
: A table of which base each simulated read has in every variant position ofhaplotype
. The table has reads for rows and variant positions for columns, e.g. simulate_genome(...)[5,2] gives the basecall for the fifth simulated read at the second variant position. Basecalls are given asSymbol
objects with possible values of:reference
:alternate
:other
HapLink.sumsliced
— Functionsumsliced(A::AbstractArray, dim::Int, pos::Int=1)
Sum all elements that are that can be referenced by pos
in the dim
dimension of A
.
Example
julia> A = reshape(1:8, 2, 2, 2)
2×2×2 reshape(::UnitRange{Int64}, 2, 2, 2) with eltype Int64:
[:, :, 1] =
1 3
2 4
[:, :, 2] =
5 7
6 8
julia> sumsliced(A, 2)
14
julia> sumsliced(A, 2, 2)
22
Heavily inspired by Holy, Tim "Multidimensional algorithms and iteration" https://julialang.org/blog/2016/02/iteration/#filtering_along_a_specified_dimension_exploiting_multiple_indexes
HapLink.transformbamcounts
— Methodtransformbamcounts(bamcounts::AbstractVector{String})
Convert the output from bam-readcount to a DataFrame
.
Schema
Column name | Description |
---|---|
chr | Chromosome |
position | Position |
reference_base | Reference base |
depth | Total read depth at position |
base | Alternate base |
count | Number of reads containing base at position |
avg_mapping_quality | Mean mapping quality |
avg_basequality | Mean base quality at position |
avg_se_mapping_quality | Mean single-ended mapping quality |
num_plus_strand | Number of reads on the forward strand (N/A) |
num_minus_strand | Number of reads on the reverse strand (N/A) |
avg_pos_as_fraction | Average position on the read as a fraction, 0 (end) to 1 (center) |
avg_num_mismatches_as_fraction | Average number of mismatches on these reads per base |
avg_sum_mismatch_qualities | Average sum of the base qualities of mismatches in the reads |
num_q2_containing_reads | Number of reads with q2 runs at the 3' end |
avg_distance_to_q2_start_in_q2_reads | Average distance of position (as fraction of unclipped read length) to the start of the q2 run |
avg_clipped_length | Average clipped read length |
avg_distance_to_effective_3p_end | Average distance to the 3' end of the read (as fraction of unclipped read length) |