API

variant processing on the reference

class variantpost.Variant(chrom: str, pos: int, ref: str, alt: str, reference: FastaFile, window: int = 50)

This class abstracts a variant relative to a linear reference genome. Equality holds between Variant objects if they are identical in the normalized form (equivalent alignments)

Parameters:

chrom (string) – chromosome name.
pos (integer) – 1-based genomic position.
ref (string) – VCF-style reference allele.
alt (string) – VCF-style alternative allele.
reference (pysam.Fasta) – reference FASTA file supplied as pysam.FastaFile object.

Raises:

ValueError – if the input alleles contain letters other than A, C, G, T and N.

count_repeats(by_repeat_unit=True)

counts indel repeats in the flanking reference sequences. The search window is defined by left_flank() and right_flank().

Parameters:: by_repeat_unit (bool) – count by the smallest tandem repeat unit. For example, the indel sequence “ATATATAT” has tandem units “ATAT” and “AT”. The occurrence of “AT” will be counted if True (default).

property indel_seq: returns the inserted/deleted sequence for non-complex indels. None for substitutions.

property is_del: bool: True for deletions. False for insertions or substitutions.

property is_indel: bool: True for insertions or deletions. False otherwise.

property is_ins: bool: True for insetions. False for deletions or substitutions.

is_non_complex_indel(): returns True only if non-complex indel (False if complex indel or substitution).

property is_normalized: bool: True if left-aligened and the allele representations are minimal.

property is_simple_indel: bool: True for indel that is not complex (co-occurrence of insertion and deletion). False for complex indels or substitutions.

left_flank(window=50, normalize=False)

extracts the left-flanking reference sequence. See also right_flank().

Parameters:

window (integer) – extract the reference sequence [variant_pos - window, variant_pos].
normalize (bool) – if True, the normalized indel position is used as the end of the flanking sequence.

left_pos() → int: returns a left-aligned genomic position.

normalize(inplace: bool = False) → Variant | None

left-aligns in genomic position and minimize the allele represention.

Parameters:: inplace (bool) – normalizes this (self) Variant object if True. Otherwise, returns a normalized copy of the object. Default to False.

query_vcf(vcf: VariantFile, chrom_name: str | None = None, match_by_equivalence: bool = True) → List[MatchedRecord]

returns a list of MatchedRecord.

Parameters:

vcf (pysam.VariantFile) – VCF file to be queried. Supply as pysam.VariantFile object. The VCF file muet be indexed.
chrom_name (string) – specify an alias chromosome name if the VCF file uses a chromosome nomenclature different from the reference genome in Variant. If not specified (default), the nomenclature in the FASTA file in Variant will be used.
match_by_equivalence (bool) – queries the VCF records by normalization if True (default). Otherwise, positionally overlapping records will be returned.

MatchedRecord is a namedtuple with the following fields

chrom - VCF CHROM field.
pos - VCF POS field.
id - VCF ID field.
ref - VCF REF field.
alts - VCF ALT field as tuple. May contain multiple alleles.
qual - VCF QUAL field.
filter - VCF FILTER field.
info - VCF INFO field. Values are accessible by keys defined in the header.
format - VCF FORMAT field.
samples - VCF genotype field. Genotypes are accessible by using sample names as key.

right_flank(window=50, normalize=False)

extracts the right-flanking reference sequence. See also left_flank().

Parameters:

window (integer) – extract the reference sequence [variant_end_pos, variant_end_pos + window].
normalize (bool) – if True, the normalized indel position is used as the start of the flanking sequence.

right_pos() → int: returns the variant-end position after right-aligned.

variant processing in alignment file

class variantpost.VariantAlignment(variant, bam, second_bam=None, chrom_name=None, exclude_duplicates=True, mapping_quality_threshold=1, base_quality_threshold=20, low_quality_base_rate_threshold=0.3, downsample_threshold=-1, match_score=3, mismatch_penalty=2, gap_open_penalty=3, gap_extension_penalty=1, kmer_size=24, dimer_window=6, local_threshold=20)

This class accepts the target variant as Variant and the BAM file: as pysam.AlignmentFile to process the variant alignment.

Parameters:

variant (Variant) – Variant object representing the target variant.
bam (pysam.AlignmentFile) – BAM file supplied as pysam.AlignmentFile object.
second_bam (pysam.AlignmentFile) – A second BAM file for paired analysis. Default: None.
chrom_name (string) – Specify an alias chromosome name if the BAM file uses a chromosome nomenclature different from the reference used in Variant. If not specified (default), the nomenclature in the reference will be used.
mapping_quality_threshold (integer) – A mininum mapping quality to be analized. Default 1.
base_quality_threshold (integer) – Non-reference base-calls with a Phred-scale quality score below the threshold are labeled low quality. Default to 30.
low_quality_base_rate_threshold (float) – Reads are not realigned if bases < base_quality_threshold are contained more than this threshold. Default to 0.1.
downsample_threshold (integer) – Downsample to the threshold if the coverage at the input locus is > threshold. Default to 2000.
match_score (integer) – Score for matched bases in realignment. Default to 3.
mismatch_penalty (integer) – Penalty for mismatched bases in realignment. Default to 2.
gap_open_penalty (integer) – Penalty to create gaps in realignment. Default to 3.
gap_extension_penalty (integer) – Penalty to extent gaps in realignment. Default to 1.
kmer_size (integer) – Kmer size used to search reads with input Variant. Default to 32.
local_threshold (integer) – Non-reference patterns further than this threshold are not considered as part of the target event. Default to 20.

count_alleles()

returns AlleleCount as namedtuple of read counts. AlleleCount has the following fields accessible by attribute:

s - count of read names supporting the variant.

n - count of read names not supporting the variant.

u - count of read names undetermined to be supporting/non-supporting

Strand breakdowns are also available by:

s_fw - count of forward reads supporting the variant.

s_rv - count of reverse reads supporting the variant.

…

To find the read names:

s_names - list of supporting read names.

s_fw_names - list of forward supporting read names.

s_rv_names - list of reverse supporting read names.

…

For paired analysis, PairedAlleleCount is returnd and has the following fields:

first - AlleleCount for the first BAM file.

second - AlleleCount for the second BAM file.

phase2complex(cis: bool = True, **kwargs)

returns Variant representing a phased target variant.

Parameters:

cis (bool)
base_quality_threshold (int)
max_common_substr_len (int)
match_penalty_for_phasing (float)