PseudoRead

HapLink.PseudoreadType
Pseudoread(startpos::Integer, endpos::Integer, read::Haplotype)
Pseudoread(query::Union{SAM.Record,BAM.Record}, reference::NucleotideSeq)

A stand-in struct for a sequencing read (real or imaginary) represented by its position and differences relative to a reference sequence.

source
HapLink.contradictsMethod
contradicts(var::Variation{S,T}, ref::Haplotype{S,T}) where {S,T}

Returns true if var contains a modification incompatible with any of the Variations that make up ref.

Example

julia> using BioSequences, SequenceVariation

julia> ref = Haplotype(dna"GATTACA", [Variation(dna"GATTACA", "A2T"), Variation(dna"GATTACA", "Δ5-6")]);

julia> contradicts(Variation(dna"GATTACA", "A2C"), ref)
true

julia> contradicts(Variation(dna"GATTACA", "T4A"), ref)
false

julia> contradicts(Variation(dna"GATTACA", "Δ2-2"), ref)
true

julia> contradicts(Variation(dna"GATTACA", "Δ4-4"), ref)
false

julia> contradicts(Variation(dna"GATTACA", "2G"), ref)
false

julia> contradicts(Variation(dna"GATTACA", "4G"), ref)
false
source
HapLink.haplotypeMethod
haplotype(ps::Pseudoread)

Gets the differences between ps and its reference sequence as a Haplotype

source
HapLink.overlap_inrangeMethod
overlap_inrange(
    x::Pseudoread, y::Pseudoread, overlap_min::Integer, overlap_max::Integer
)

Returns true if the overlap between x and y is within overlap_min and overlap_max, returns false otherwise

Example

julia> using BioSequences, SequenceVariation

julia> ps1 = Pseudoread(1, 100, Haplotype(dna"A", Variation{LongDNA{4}, DNA}[]));

julia> ps2 = Pseudoread(50, 150, Haplotype(dna"A", Variation{LongDNA{4}, DNA}[]));

julia> overlap_inrange(ps1, ps2, 1, 500)
true

julia> overlap_inrange(ps1, ps2, 75, 500)
false

julia> overlap_inrange(ps1, ps2, 1, 25)
false
source
HapLink.pseudoreadsMethod
pseudoreads(sam::Union{AbstractString,AbstractPath}, consensus::NucleotideSeq)

Create a set of Pseudoreads from the alignments present in sam when aligned against consensus.

source
HapLink.simulateMethod
simulate(
    pool::AbstractArray{Pseudoread},
    refseq::BioSequence,
    subcon::AbstractArray{Variation{S,T}};
    reverse_order::Union{Bool,Nothing}=nothing,
    overlap_min::Int64=0,
    overlap_max::Int64=500,
) where {S,T} -> Union{Haplotype{S,T},Nothing}

Creates a new set of Pseudoreads based on the pool of real reads spliced spliced together using maximum likelihood simulation.

Arguments

  • pool::AbstractArray{Pseudoread}: A set of (presumably) real sequencing reads that will serve as pieces of the new simulated reads.
  • refseq::BioSequence: The reference against which all of the sequences in pool and subcon were aligned
  • subcon::AbstractArray{Variation{S,T}}: The subconsensus Variations which are known to be both present in pool and real (i.e., not due to sequencing errors)

Options

  • reverse_order::Union{Bool,Nothing}=nothing: Whether the splicing should start at the beginning or end of the reference sequence. If left blank, will randomly decide.
  • overlap_min::Int64=0: The amount that reads from pool are required to overlap in order to be spliced together. Can be negative, indicating that reads must be spaced apart by this amount, instead. -overlap_max::Int64=500: The amount that reads from pool can overlap before being discarded. Like overlap_min, can be negative.

Returns

A Haplotype{S,T} representing the simulated read, or nothing if the simulation encountered a dead end.

Extended help

The simulation procedure works by picking a read from pool that contains the first (if reverse_order is false, the last otherwise) variant from subcon. The procedure examines every position in subcon in ascending (or descending, if reverse_order == true) order until the chosen read no longer carries that position. The simulation will then, at random, pick a read from pool that contains the next position from subcon and that also has matching variations at every position in subcon as the previous read, as well as meets the overlap requirements. These two reads may differ in sites other than those in subcon: it is assumed that appropriate variant calling has already been performed and that any variation in those sites is due to sequencing error. The procedure repeats for each new read from pool until the simulation has simulated every position of subcon, or until it reaches a dead-end and cannot find a read that matches every requirement.

source
HapLink.variations_matchMethod
variations_match(reference::Pseudoread, query::Union{Pseudoread,Haplotype})

Returns true if query could be a read from reference.

Additional Variations found within query that are not present in reference do not invalid a positive match result so long as none of those Variations contradicts reference. These Variations can either

  1. Extend beyond the length of reference, or
  2. Exist as "sequencing errors" within the interval of reference

On the other hand, every Variation within the overlapping interval of reference and query that is present in reference must also be found in query. In other words, query must have all of reference (within the overlap), but reference does not need all of query.

This arrangment allows for the creation of bodies of matching Pseudoreads, as done via simulate.

source