PseudoRead
HapLink.Pseudoread
— TypePseudoread(startpos::Integer, endpos::Integer, read::Haplotype)
Pseudoread(query::Union{SAM.Record,BAM.Record}, reference::NucleotideSeq)
A stand-in struct for a sequencing read (real or imaginary) represented by its position and differences relative to a reference sequence.
HapLink.contradicts
— Methodcontradicts(var::Variation{S,T}, ref::Haplotype{S,T}) where {S,T}
Returns true
if var
contains a modification incompatible with any of the Variation
s that make up ref
.
Example
julia> using BioSequences, SequenceVariation
julia> ref = Haplotype(dna"GATTACA", [Variation(dna"GATTACA", "A2T"), Variation(dna"GATTACA", "Δ5-6")]);
julia> contradicts(Variation(dna"GATTACA", "A2C"), ref)
true
julia> contradicts(Variation(dna"GATTACA", "T4A"), ref)
false
julia> contradicts(Variation(dna"GATTACA", "Δ2-2"), ref)
true
julia> contradicts(Variation(dna"GATTACA", "Δ4-4"), ref)
false
julia> contradicts(Variation(dna"GATTACA", "2G"), ref)
false
julia> contradicts(Variation(dna"GATTACA", "4G"), ref)
false
HapLink.haplotype
— Methodhaplotype(ps::Pseudoread)
Gets the differences between ps
and its reference sequence as a Haplotype
HapLink.overlap_inrange
— Methodoverlap_inrange(
x::Pseudoread, y::Pseudoread, overlap_min::Integer, overlap_max::Integer
)
Returns true
if the overlap between x
and y
is within overlap_min
and overlap_max
, returns false
otherwise
Example
julia> using BioSequences, SequenceVariation
julia> ps1 = Pseudoread(1, 100, Haplotype(dna"A", Variation{LongDNA{4}, DNA}[]));
julia> ps2 = Pseudoread(50, 150, Haplotype(dna"A", Variation{LongDNA{4}, DNA}[]));
julia> overlap_inrange(ps1, ps2, 1, 500)
true
julia> overlap_inrange(ps1, ps2, 75, 500)
false
julia> overlap_inrange(ps1, ps2, 1, 25)
false
HapLink.overlapping_variations
— Methodoverlapping_variations(ps::Pseudoread, v::Haplotype)
Find all Variation
s within v
that are contained within the range defined by ps
HapLink.pseudoreads
— Methodpseudoreads(sam::Union{AbstractString,AbstractPath}, consensus::NucleotideSeq)
Create a set of Pseudoread
s from the alignments present in sam
when aligned against consensus
.
HapLink.simulate
— Methodsimulate(
pool::AbstractArray{Pseudoread},
refseq::BioSequence,
subcon::AbstractArray{Variation{S,T}};
reverse_order::Union{Bool,Nothing}=nothing,
overlap_min::Int64=0,
overlap_max::Int64=500,
) where {S,T} -> Union{Haplotype{S,T},Nothing}
Creates a new set of Pseudoread
s based on the pool
of real reads spliced spliced together using maximum likelihood simulation.
Arguments
pool::AbstractArray{Pseudoread}
: A set of (presumably) real sequencing reads that will serve as pieces of the new simulated reads.refseq::BioSequence
: The reference against which all of the sequences inpool
andsubcon
were alignedsubcon::AbstractArray{Variation{S,T}}
: The subconsensusVariation
s which are known to be both present inpool
and real (i.e., not due to sequencing errors)
Options
reverse_order::Union{Bool,Nothing}=nothing
: Whether the splicing should start at the beginning or end of the reference sequence. If left blank, will randomly decide.overlap_min::Int64=0
: The amount that reads frompool
are required to overlap in order to be spliced together. Can be negative, indicating that reads must be spaced apart by this amount, instead. -overlap_max::Int64=500
: The amount that reads from pool can overlap before being discarded. Likeoverlap_min
, can be negative.
Returns
A Haplotype{S,T}
representing the simulated read, or nothing
if the simulation encountered a dead end.
Extended help
The simulation procedure works by picking a read from pool
that contains the first (if reverse_order
is false, the last otherwise) variant from subcon
. The procedure examines every position in subcon
in ascending (or descending, if reverse_order == true
) order until the chosen read no longer carries that position. The simulation will then, at random, pick a read from pool
that contains the next position from subcon
and that also has matching variations at every position in subcon
as the previous read, as well as meets the overlap requirements. These two reads may differ in sites other than those in subcon
: it is assumed that appropriate variant calling has already been performed and that any variation in those sites is due to sequencing error. The procedure repeats for each new read from pool
until the simulation has simulated every position of subcon
, or until it reaches a dead-end and cannot find a read that matches every requirement.
HapLink.variations_match
— Methodvariations_match(reference::Pseudoread, query::Union{Pseudoread,Haplotype})
Returns true
if query
could be a read from reference
.
Additional Variation
s found within query
that are not present in reference
do not invalid a positive match result so long as none of those Variation
s contradicts
reference
. These Variation
s can either
- Extend beyond the length of
reference
, or - Exist as "sequencing errors" within the interval of
reference
On the other hand, every Variation
within the overlapping interval of reference
and query
that is present in reference
must also be found in query
. In other words, query
must have all of reference
(within the overlap), but reference
does not need all of query
.
This arrangment allows for the creation of bodies of matching Pseudoread
s, as done via simulate
.