PseudoRead
HapLink.Pseudoread — TypePseudoread(startpos::Integer, endpos::Integer, read::Haplotype)
Pseudoread(query::Union{SAM.Record,BAM.Record}, reference::NucleotideSeq)A stand-in struct for a sequencing read (real or imaginary) represented by its position and differences relative to a reference sequence.
HapLink.contradicts — Methodcontradicts(var::Variation{S,T}, ref::Haplotype{S,T}) where {S,T}Returns true if var contains a modification incompatible with any of the Variations that make up ref.
Example
julia> using BioSequences, SequenceVariation
julia> ref = Haplotype(dna"GATTACA", [Variation(dna"GATTACA", "A2T"), Variation(dna"GATTACA", "Δ5-6")]);
julia> contradicts(Variation(dna"GATTACA", "A2C"), ref)
true
julia> contradicts(Variation(dna"GATTACA", "T4A"), ref)
false
julia> contradicts(Variation(dna"GATTACA", "Δ2-2"), ref)
true
julia> contradicts(Variation(dna"GATTACA", "Δ4-4"), ref)
false
julia> contradicts(Variation(dna"GATTACA", "2G"), ref)
false
julia> contradicts(Variation(dna"GATTACA", "4G"), ref)
falseHapLink.haplotype — Methodhaplotype(ps::Pseudoread)Gets the differences between ps and its reference sequence as a Haplotype
HapLink.overlap_inrange — Methodoverlap_inrange(
x::Pseudoread, y::Pseudoread, overlap_min::Integer, overlap_max::Integer
)Returns true if the overlap between x and y is within overlap_min and overlap_max, returns false otherwise
Example
julia> using BioSequences, SequenceVariation
julia> ps1 = Pseudoread(1, 100, Haplotype(dna"A", Variation{LongDNA{4}, DNA}[]));
julia> ps2 = Pseudoread(50, 150, Haplotype(dna"A", Variation{LongDNA{4}, DNA}[]));
julia> overlap_inrange(ps1, ps2, 1, 500)
true
julia> overlap_inrange(ps1, ps2, 75, 500)
false
julia> overlap_inrange(ps1, ps2, 1, 25)
falseHapLink.overlapping_variations — Methodoverlapping_variations(ps::Pseudoread, v::Haplotype)Find all Variations within v that are contained within the range defined by ps
HapLink.pseudoreads — Methodpseudoreads(sam::Union{AbstractString,AbstractPath}, consensus::NucleotideSeq)Create a set of Pseudoreads from the alignments present in sam when aligned against consensus.
HapLink.simulate — Methodsimulate(
pool::AbstractArray{Pseudoread},
refseq::BioSequence,
subcon::AbstractArray{Variation{S,T}};
reverse_order::Union{Bool,Nothing}=nothing,
overlap_min::Int64=0,
overlap_max::Int64=500,
) where {S,T} -> Union{Haplotype{S,T},Nothing}Creates a new set of Pseudoreads based on the pool of real reads spliced spliced together using maximum likelihood simulation.
Arguments
pool::AbstractArray{Pseudoread}: A set of (presumably) real sequencing reads that will serve as pieces of the new simulated reads.refseq::BioSequence: The reference against which all of the sequences inpoolandsubconwere alignedsubcon::AbstractArray{Variation{S,T}}: The subconsensusVariations which are known to be both present inpooland real (i.e., not due to sequencing errors)
Options
reverse_order::Union{Bool,Nothing}=nothing: Whether the splicing should start at the beginning or end of the reference sequence. If left blank, will randomly decide.overlap_min::Int64=0: The amount that reads frompoolare required to overlap in order to be spliced together. Can be negative, indicating that reads must be spaced apart by this amount, instead. -overlap_max::Int64=500: The amount that reads from pool can overlap before being discarded. Likeoverlap_min, can be negative.
Returns
A Haplotype{S,T} representing the simulated read, or nothing if the simulation encountered a dead end.
Extended help
The simulation procedure works by picking a read from pool that contains the first (if reverse_order is false, the last otherwise) variant from subcon. The procedure examines every position in subcon in ascending (or descending, if reverse_order == true) order until the chosen read no longer carries that position. The simulation will then, at random, pick a read from pool that contains the next position from subcon and that also has matching variations at every position in subcon as the previous read, as well as meets the overlap requirements. These two reads may differ in sites other than those in subcon: it is assumed that appropriate variant calling has already been performed and that any variation in those sites is due to sequencing error. The procedure repeats for each new read from pool until the simulation has simulated every position of subcon, or until it reaches a dead-end and cannot find a read that matches every requirement.
HapLink.variations_match — Methodvariations_match(reference::Pseudoread, query::Union{Pseudoread,Haplotype})Returns true if query could be a read from reference.
Additional Variations found within query that are not present in reference do not invalid a positive match result so long as none of those Variations contradicts reference. These Variations can either
- Extend beyond the length of
reference, or - Exist as "sequencing errors" within the interval of
reference
On the other hand, every Variation within the overlapping interval of reference and query that is present in reference must also be found in query. In other words, query must have all of reference (within the overlap), but reference does not need all of query.
This arrangment allows for the creation of bodies of matching Pseudoreads, as done via simulate.