API
variant processing on the reference
- class variantpost.Variant(chrom: str, pos: int, ref: str, alt: str, reference: FastaFile, window: int = 50)
This class abstracts a variant relative to a linear reference genome. Equality holds between
Variantobjects if they are identical in the normalized form (equivalent alignments)- Parameters:
chrom (string) – chromosome name.
pos (integer) – 1-based genomic position.
ref (string) – VCF-style reference allele.
alt (string) – VCF-style alternative allele.
reference (pysam.Fasta) – reference FASTA file supplied as pysam.FastaFile object.
- Raises:
ValueError – if the input alleles contain letters other than A, C, G, T and N.
- count_repeats(by_repeat_unit=True)
counts indel repeats in the flanking reference sequences. The search window is defined by
left_flank()andright_flank().- Parameters:
by_repeat_unit (bool) – count by the smallest tandem repeat unit. For example, the indel sequence “ATATATAT” has tandem units “ATAT” and “AT”. The occurrence of “AT” will be counted if True (default).
- property indel_seq
returns the inserted/deleted sequence for non-complex indels. None for substitutions.
- property is_del: bool
True for deletions. False for insertions or substitutions.
- property is_indel: bool
True for insertions or deletions. False otherwise.
- property is_ins: bool
True for insetions. False for deletions or substitutions.
- is_non_complex_indel()
returns True only if non-complex indel (False if complex indel or substitution).
- property is_normalized: bool
True if left-aligened and the allele representations are minimal.
- property is_simple_indel: bool
True for indel that is not complex (co-occurrence of insertion and deletion). False for complex indels or substitutions.
- left_flank(window=50, normalize=False)
extracts the left-flanking reference sequence. See also
right_flank().- Parameters:
window (integer) – extract the reference sequence [variant_pos - window, variant_pos].
normalize (bool) – if True, the normalized indel position is used as the end of the flanking sequence.
- left_pos() int
returns a left-aligned genomic position.
- normalize(inplace: bool = False) Variant | None
left-aligns in genomic position and minimize the allele represention.
- Parameters:
inplace (bool) – normalizes this (self)
Variantobject if True. Otherwise, returns a normalized copy of the object. Default to False.
- query_vcf(vcf: VariantFile, chrom_name: str | None = None, match_by_equivalence: bool = True) List[MatchedRecord]
returns a list of
MatchedRecord.- Parameters:
vcf (pysam.VariantFile) – VCF file to be queried. Supply as pysam.VariantFile object. The VCF file muet be indexed.
chrom_name (string) – specify an alias chromosome name if the VCF file uses a chromosome nomenclature different from the reference genome in
Variant. If not specified (default), the nomenclature in the FASTA file inVariantwill be used.match_by_equivalence (bool) – queries the VCF records by normalization if True (default). Otherwise, positionally overlapping records will be returned.
MatchedRecordis a namedtuple with the following fieldschrom - VCF CHROM field.
pos - VCF POS field.
id - VCF ID field.
ref - VCF REF field.
alts - VCF ALT field as tuple. May contain multiple alleles.
qual - VCF QUAL field.
filter - VCF FILTER field.
info - VCF INFO field. Values are accessible by keys defined in the header.
format - VCF FORMAT field.
samples - VCF genotype field. Genotypes are accessible by using sample names as key.
- right_flank(window=50, normalize=False)
extracts the right-flanking reference sequence. See also
left_flank().- Parameters:
window (integer) – extract the reference sequence [variant_end_pos, variant_end_pos + window].
normalize (bool) – if True, the normalized indel position is used as the start of the flanking sequence.
- right_pos() int
returns the variant-end position after right-aligned.
variant processing in alignment file
- class variantpost.VariantAlignment(variant, bam, second_bam=None, chrom_name=None, exclude_duplicates=True, mapping_quality_threshold=1, base_quality_threshold=20, low_quality_base_rate_threshold=0.3, downsample_threshold=-1, match_score=3, mismatch_penalty=2, gap_open_penalty=3, gap_extension_penalty=1, kmer_size=24, dimer_window=6, local_threshold=20)
- This class accepts the target variant as
Variantand the BAM file as pysam.AlignmentFile to process the variant alignment.
- Parameters:
variant (Variant) –
Variantobject representing the target variant.bam (pysam.AlignmentFile) – BAM file supplied as pysam.AlignmentFile object.
second_bam (pysam.AlignmentFile) – A second BAM file for paired analysis. Default: None.
chrom_name (string) – Specify an alias chromosome name if the BAM file uses a chromosome nomenclature different from the reference used in
Variant. If not specified (default), the nomenclature in the reference will be used.mapping_quality_threshold (integer) – A mininum mapping quality to be analized. Default 1.
base_quality_threshold (integer) – Non-reference base-calls with a Phred-scale quality score below the threshold are labeled low quality. Default to 30.
low_quality_base_rate_threshold (float) – Reads are not realigned if bases < base_quality_threshold are contained more than this threshold. Default to 0.1.
downsample_threshold (integer) – Downsample to the threshold if the coverage at the input locus is > threshold. Default to 2000.
match_score (integer) – Score for matched bases in realignment. Default to 3.
mismatch_penalty (integer) – Penalty for mismatched bases in realignment. Default to 2.
gap_open_penalty (integer) – Penalty to create gaps in realignment. Default to 3.
gap_extension_penalty (integer) – Penalty to extent gaps in realignment. Default to 1.
kmer_size (integer) – Kmer size used to search reads with input
Variant. Default to 32.local_threshold (integer) – Non-reference patterns further than this threshold are not considered as part of the target event. Default to 20.
- count_alleles()
returns
AlleleCountas namedtuple of read counts.AlleleCounthas the following fields accessible by attribute:s - count of read names supporting the variant.
n - count of read names not supporting the variant.
u - count of read names undetermined to be supporting/non-supporting
Strand breakdowns are also available by:
s_fw - count of forward reads supporting the variant.
s_rv - count of reverse reads supporting the variant.
…
To find the read names:
s_names - list of supporting read names.
s_fw_names - list of forward supporting read names.
s_rv_names - list of reverse supporting read names.
…
For paired analysis,
PairedAlleleCountis returnd and has the following fields:first -
AlleleCountfor the first BAM file.second -
AlleleCountfor the second BAM file.
- This class accepts the target variant as