|
Methods defined here:
- __del__(self)
- Cleanup resources
- __init__(self, genoref_file=None, nrvk_file=None, rsvk_file=None, vkrs_file=None)
- Instantiate a new VariantKey object.
Load the support files if specified.
Parameters
----------
genoref_file : string
Name and path of the binary file containing the genome reference (fasta.bin).
This file can be generated from a FASTA file using the resources/tools/fastabin.sh script.
nrvk_file : string
Name and path of the binary file containing the non-reversible-VariantKey mapping (nrvk.bin).
This file can be generated from a normalized VCF file using the resources/tools/nrvk.sh script.
rsvk_file : string
Name and path of the binary file containing the rsID to VariantKey mapping (rsvk.bin).
This file can be generated using the resources/tools/rsvk.sh script.
vkrs_file : string
Name and path of the binary file containing the VariantKey to rsID mapping (vkrs.bin).
This file can be generated using the resources/tools/vkrs.sh script.
- are_overlapping_region_regionkey(self, chrom, startpos, endpos, rk)
- Check if a region and a regionkey are overlapping.
Parameters
----------
chrom : uint8
Region A chromosome code.
startpos : uint32
Region A start position.
endpos : uint32
Region A end position (startpos + region length).
rk : uint64
RegionKey B.
Returns
-------
uint8 :
1 if the regions overlap, 0 otherwise.
- are_overlapping_regionkeys(self, rka, rkb)
- Check if two regionkeys are overlapping.
Parameters
----------
rka : uint64
RegionKey A.
rkb : uint64
RegionKey B.
Returns
-------
uint8 :
1 if the regions overlap, 0 otherwise.
- are_overlapping_regions(self, a_chrom, a_startpos, a_endpos, b_chrom, b_startpos, b_endpos)
- Check if two regions are overlapping.
Parameters
----------
a_chrom : uint8
Region A chromosome code.
a_startpos : uint32
Region A start position.
a_endpos : uint32
Region A end position (startpos + region length).
b_chrom : uint8
Region B chromosome code.
b_startpos : uint32
Region B start position.
b_endpos : uint32
Region B end position (startpos + region length).
Returns
-------
uint8 :
1 if the regions overlap, 0 otherwise.
- are_overlapping_variantkey_regionkey(self, vk, rk)
- Check if variantkey and regionkey are overlapping.
Parameters
----------
vk : uint64
VariantKey.
rk : uint64
RegionKey.
Returns
-------
uint8 :
1 if the regions overlap, 0 otherwise.
- check_reference(self, chrom, pos, ref)
- Check if the reference allele matches the reference genome data.
Parameters
----------
chrom : uint8
Encoded Chromosome number (see encode_chrom).
pos : uint32
Position. The reference position, with the first base having position 0.
ref : string
Reference allele. String containing a sequence of nucleotide letters.
Returns
-------
int :
positive number in case of success, negative in case of error:
0 the reference allele match the reference genome;
1 the reference allele is inconsistent with the genome reference
(i.e. when contains nucleotide letters other than A, C, G and T);
-1 the reference allele don't match the reference genome;
-2 the reference allele is longer than the genome reference sequence.
- close(self)
- Close all input files
- compare_variantkey_chrom(self, vka, vkb)
- Compares two VariantKeys by chromosome only.
Parameters
----------
vka : uint64
The first VariantKey to be compared.
vkb : uint64
The second VariantKey to be compared.
Returns
-------
int :
-1 if the first chromosome is smaller than the second,
0 if they are equal and 1 if the first is greater than the second.
0
- compare_variantkey_chrom_pos(self, vka, vkb)
- Compares two VariantKeys by chromosome and position.
Parameters
----------
vka : uint64
The first VariantKey to be compared.
vkb : uint64
The second VariantKey to be compared.
Returns
-------
int :
-1 if the first CHROM+POS is smaller than the second,
0 if they are equal and 1 if the first is greater than the second.
- decode_chrom(self, code)
- Decode the chromosome numerical code.
Parameters
----------
code : uint8
CHROM code.
Returns
-------
'|S2' :
Chromosome string
- decode_refalt(self, code)
- Decode the 32 bit REF+ALT code if reversible
(if it has 11 or less bases in total and only contains ACGT letters).
Parameters
----------
code : uint32
REF+ALT code
Returns
-------
tuple:
- '|S256' : REF
- '|S256' : ALT
- uint8 : REF length
- uint8 : ALT length
- decode_region_strand(self, strand)
- Decode the strand direction code (0 > 0, 1 > +1, 2 > -1).
Parameters
----------
strand : uint8
Strand code.
Returns
-------
int16 :
Strand direction.
- decode_regionkey(self, rk)
- Decode a RegionKey code and returns the components as regionkey_t structure.
Parameters
----------
rk : uint64
RegionKey code.
Returns
-------
tuple:
- uint8 : encoded chromosome
- uint32 : start position
- uint32 : end position
- uint8 : encoded strand
- decode_string_id(self, esid)
- Decode the encoded string ID.
This function is the reverse of encode_string_id.
The string is always returned in uppercase mode.
Parameters
----------
esid : uint64
Encoded string ID code.
Returns
-------
tuple:
- '|S23' : STRING
- uint8 : STRING length
- decode_variantkey(self, vk)
- Decode a VariantKey code and returns the components.
Parameters
----------
vk : uint64
VariantKey code.
Returns
-------
tuple :
- uint8 : CHROM code
- uint32 : POS
- uint32 : REF+ALT code
- encode_chrom(self, chrom)
- Returns chromosome numerical encoding.
Parameters
----------
chrom : string
Chromosome. An identifier from the reference genome, no white-space permitted.
Returns
-------
uint8 :
CHROM code
- encode_refalt(self, ref, alt)
- Returns reference+alternate numerical encoding.
Parameters
----------
ref : string
Reference allele.
String containing a sequence of nucleotide letters.
The value in the pos field refers to the position of the first nucleotide in the String.
Characters must be A-Z, a-z or *
alt : string
Alternate non-reference allele string. Characters must be A-Z, a-z or *
Returns
-------
uint32 :
code
- encode_region_strand(self, strand)
- Encode the strand direction (-1 > 2, 0 > 0, +1 > 1).
Parameters
----------
strand : int16
Strand direction (-1, 0, +1).
Returns
-------
uint8 :
Strand code.
- encode_regionkey(self, chrom, startpos, endpos, strand)
- Returns a 64 bit regionkey
Parameters
----------
chrom : uint8
Encoded Chromosome (see encode_chrom).
startpos : uint32
Start position (zero based).
endpos : uint32
End position (startpos + region_length).
strand : uint8
Encoded Strand direction (-1 > 2, 0 > 0, +1 > 1)
Returns
-------
uint64 :
RegionKey 64 bit code.
- encode_string_id(self, strid, start=0)
- Encode maximum 10 characters of a string into a 64 bit unsigned integer.
This function can be used to convert generic string IDs to numeric IDs.
Parameters
----------
strid : string
The string to encode. It must be maximum 10 characters long and support ASCII characters from '!' to 'z'.
start : uint32
First character to encode, starting from 0. To encode the last 10 characters, set this value at (size - 10).
Returns
-------
uint64 :
Encoded string ID.
- encode_string_num_id(self, strid, sep=b':')
- Encode a string composed by a character section followed by a separator
character and a numerical section into a 64 bit unsigned integer. For example: ABCDE:0001234.
Encodes up to 5 characters in uppercase, a number up to 2^27, and up to 7 zero padding digits.
If the string is 10 character or less, then the encode_string_id() is used.
Parameters
----------
strid : string
The string to encode. It must be maximum 10 characters long and support ASCII characters from '!' to 'z'.
sep : char
Separator character between string and number.
Returns
-------
uint64 :
Encoded string ID.
- encode_variantkey(self, chrom, pos, refalt)
- Returns a 64 bit variant key based on the pre-encoded CHROM, POS (0-based) and REF+ALT.
Parameters
----------
chrom : uint8
Encoded Chromosome (see encode_chrom).
pos : uint32
Position. The reference position, with the first base having position 0.
refalt : uint32
Encoded Reference + Alternate (see encode_refalt).
Returns
-------
unit64:
VariantKey 64 bit code.
- extend_regionkey(self, rk, size)
- Extend a regionkey region by a fixed amount from the start and end position.
Parameters
----------
rk : uint64
RegionKey code.
size: uint32
Amount to extend the region.
Returns
-------
uint64 :
RegionKey 64 bit code.
- extract_regionkey_chrom(self, rk)
- Extract the CHROM code from RegionKey.
Parameters
----------
rk : uint64
RegionKey code.
Returns
-------
uint8 :
CHROM code.
- extract_regionkey_endpos(self, rk)
- Extract the END POS code from RegionKey.
Parameters
----------
rk : uint64
RegionKey code.
Returns
-------
uint32 :
END POS.
- extract_regionkey_startpos(self, rk)
- Extract the START POS code from RegionKey.
Parameters
----------
rk : uint64
RegionKey code.
Returns
-------
uint32 :
START POS.
- extract_regionkey_strand(self, rk)
- Extract the STRAND from RegionKey.
Parameters
----------
rk : uint64
RegionKey code.
Returns
-------
uint8 :
STRAND.
- extract_variantkey_chrom(self, vk)
- Extract the CHROM code from VariantKey.
Parameters
----------
vk : uint64
VariantKey code.
Returns
-------
uint8 :
CHROM code.
- extract_variantkey_pos(self, vk)
- Extract the POS code from VariantKey.
Parameters
----------
vk : uint64
VariantKey code.
Returns
-------
uint32 :
Position.
- extract_variantkey_refalt(self, vk)
- Extract the REF+ALT code from VariantKey.
Parameters
----------
vk : uint64
VariantKey code.
Returns
-------
uint32 :
REF+ALT code.
- find_all_rv_variantkey_by_rsid(self, rsid)
- Search for the specified rsID and returns all associated VariantKeys.
Parameters
----------
rsid : uint32
rsID to search.
Returns
-------
uint64 :
- VariantKey(s).
- find_all_vr_rsid_by_variantkey(self, vk)
- Search for the specified VariantKey and returns all associated rsIDs.
Parameters
----------
vk : uint64
variantKey to search.
Returns
-------
uint32 :
- rsID(s).
- find_ref_alt_by_variantkey(self, vk)
- Retrieve the REF and ALT strings for the specified VariantKey.
Parameters
----------
vk : uint64
VariantKey to search.
Returns
-------
tuple :
- '|S256' : REF string.
- '|S256' : ALT string.
- uint8 : REF length.
- uint8 : ALT length.
- uint16 : REF+ALT length.
- find_rv_variantkey_by_rsid(self, rsid)
- Search for the specified rsID and returns the first occurrence of VariantKey in the RV file.
Parameters
----------
rsid : uint32
rsID to search.
Returns
-------
tuple :
- uint64 : VariantKey or 0 in case not found.
- uint64 : Item position in the file.
- find_vr_chrompos_range(self, chrom, pos_min, pos_max)
- Search for the specified CHROM-POS range and returns the first occurrence of rsID in the VR file.
Parameters
----------
chrom : uint8
Chromosome encoded number.
pos_min : uint32
Start reference position, with the first base having position 0.
pos_max : uint32
End reference position, with the first base having position 0.
Returns
-------
tuple :
- uint32 : rsID or 0 in case not found
- uint64 : Position of the first item.
- uint64 : Position of the last item.
- find_vr_rsid_by_variantkey(self, vk)
- Search for the specified VariantKey and returns the first occurrence of rsID in the VR file.
Parameters
----------
vk : uint64
VariantKey.
Returns
-------
tuple :
- uint32 : rsID or 0 in case not found.
- uint64 : Item position in the file.
- flip_allele(self, allele)
- Flip the allele nucleotides (replaces each letter with its complement).
The resulting string is always in uppercase.
Supports extended nucleotide letters.
Parameters
----------
allele : string
String containing a sequence of nucleotide letters.
Returns
-------
'|S256' :
Flipped allele.
- get_genoref_seq(self, chrom, pos)
- Returns the genome reference nucleotide at the specified chromosome and position.
Parameters
----------
chrom : uint8
Encoded Chromosome number (see encode_chrom).
pos : uint32
Position. The reference position, with the first base having position 0.
Returns
-------
'|S1' :
Nucleotide letter or 0 (NULL char) in case of invalid position.
- get_next_rv_variantkey_by_rsid(self, pos, rsid)
- Get the next VariantKey for the specified rsID in the RV file." " This function should be used after find_rv_variantkey_by_rsid." " This function can be called in a loop to get all VariantKeys that are associated with the same rsID (if any).
Parameters
----------
pos : uint64
Current item position.
rsid : uint32
rsID to search.
Returns
-------
tuple :
- uint64 : VariantKey or 0 in case not found.
- uint64 : Item position in the file.
- get_next_vr_rsid_by_variantkey(self, pos, vk)
- Get the next rsID for the specified VariantKey in the VR file." " This function should be used after find_vr_rsid_by_variantkey." " This function can be called in a loop to get all rsIDs that are associated with the same VariantKey (if any).
Parameters
----------
pos : uint64
Current item position.
vk : uint64
variantKey to search.
Returns
-------
tuple :
- uint32 : rsID or 0 in case not found.
- uint64 : Item position in the file.
- get_regionkey_chrom_endpos(self, rk)
- Get the CHROM + END POS encoding from RegionKey.
Parameters
----------
rk : uint64
RegionKey code.
Returns
-------
uint64 :
CHROM + END POS encoding.
- get_regionkey_chrom_startpos(self, rk)
- Get the CHROM + START POS encoding from RegionKey.
Parameters
----------
rk : uint64
RegionKey code.
Returns
-------
uint64 :
CHROM + START POS encoding.
- get_variantkey_chrom_endpos(self, vk)
- Get the CHROM + END POS encoding from VariantKey.
Parameters
----------
vk : uint64
VariantKey.
Returns
-------
uint64 :
CHROM + END POS encoding.
- get_variantkey_chrom_startpos(self, vk)
- Get the CHROM + START POS encoding from VariantKey.
Parameters
----------
vk : uint64
VariantKey.
Returns
-------
uint64 :
CHROM + START POS encoding.
- get_variantkey_endpos(self, vk)
- Get the VariantKey end position (POS + REF length).
Parameters
----------
vk : uint64
VariantKey.
Returns
-------
uint32 :
Variant end position.
- get_variantkey_ref_length(self, vk)
- Retrieve the REF length for the specified VariantKey.
Parameters
----------
vk : uint64
VariantKey
Returns
-------
uint8 :
REF length or 0 if the VariantKey is not reversible and not found.
- hash_string_id(self, strid)
- Hash the input string into a 64 bit unsigned integer.
This function can be used to convert long string IDs to numeric IDs.
Parameters
----------
strid : strint
The string to encode.
Returns
-------
uint64 :
Hash string ID.
- normalize_variant(self, chrom, pos, ref, alt)
- Normalize a variant." " Flip alleles if required and apply the normalization algorithm described at:" " https://genome.sph.umich.edu/wiki/Variant_Normalization
Parameters
----------
chrom : uint8
Chromosome encoded number.
pos : uint32
Position. The reference position, with the first base having position 0.
ref : string
Reference allele. String containing a sequence of nucleotide letters.
alt : string
Alternate non-reference allele string.
Returns
-------
tuple :
- int : Bitmask number in case of success, negative number in case of error.
When positive, each bit has a different meaning when set:
- bit 0 : The reference allele is inconsistent with the genome reference
(i.e. when contains nucleotide letters other than A, C, G and T).
- bit 1 : The alleles have been swapped.
- bit 2 : The alleles nucleotides have been flipped
(each nucleotide have been replaced with its complement).
- bit 3 : Alleles have been left extended.
- bit 4 : Alleles have been right trimmed.
- bit 5 : Alleles have been left trimmed.
- uint32 : POS.
- '|S256' : REF string.
- '|S256' : ALT string.
- uint8 : REF length.
- uint8 : ALT length.
- normalized_variantkey(self, chrom, pos, posindex, ref, alt)
- Normalize a variant." " Flip alleles if required and apply the normalization algorithm described at:" " https://genome.sph.umich.edu/wiki/Variant_Normalization
Parameters
----------
chrom : string
Chromosome. An identifier from the reference genome, no white-space or leading zeros permitted.
pos : uint32
Position. The reference position.
posindex : uint32
Position index: 0 for 0-based, 1 for 1-based.
ref : string
Reference allele. String containing a sequence of nucleotide letters.
alt : string
Alternate non-reference allele string.
Returns
-------
tuple :
- Normalized VariantKey 64 bit code (uint64).
- Normalization return code (see normalize_variant).
- nrvk_bin_to_tsv(self, tsvfile)
- Convert a vrnr.bin file to a simple TSV.
For the reverse operation see the resources/tools/nrvk.sh script.
Parameters
----------
tsvfile : int
Output file name.
Returns
-------
uint64 :
Number of bytes written or 0 in case of error.
- parse_regionkey_hex(self, rs)
- Parses a RegionKey hexadecimal string and returns the code.
Parameters
----------
rs : string
RegionKey hexadecimal string (it must contain 16 hexadecimal characters).
Returns
-------
uint64 :
A RegionKey code.
- parse_variantkey_hex(self, vs)
- Parses a VariantKey hexadecimal string and returns the code.
Parameters
----------
vs : '|S16'
VariantKey hexadecimal string (it must contain 16 hexadecimal characters).
Returns
-------
uint64 :
VariantKey 64 bit code.
- regionkey(self, chrom, startpos, endpos, strand)
- Returns a 64 bit regionkey based on CHROM, START POS (0-based), END POS and STRAND.
Parameters
----------
chrom : string
Chromosome. An identifier from the reference genome, no white-space or leading zeros permitted.
startpos : uint32
Start position (zero based).
endpos : uint32
End position (startpos + region_length).
strand : int16
Strand direction (-1, 0, +1)
Returns
-------
uint64 :
RegionKey 64 bit code.
- regionkey_hex(self, rk)
- Returns RegionKey hexadecimal string (16 characters).
Parameters
----------
rk : uint64
RegionKey code.
Returns
-------
'|S16' :
RegionKey hexadecimal string.
- reverse_regionkey(self, rk)
- Reverse a RegionKey code and returns the normalized components as regionkey_rev_t structure.
Parameters
----------
rk : uint64
RegionKey code.
Returns
-------
tuple:
- '|S2' : chromosome
- uint32 : start position
- uint32 : end position
- int16 : strand
- reverse_variantkey(self, vk)
- Reverse a VariantKey code and returns the normalized components.
Parameters
----------
vk : uint64
VariantKey code.
Returns
-------
tuple :
- '|S2' : CHROM string.
- uint32 : POS.
- '|S256' : REF string.
- '|S256' : ALT string.
- uint8 : REF length.
- uint8 : ALT length.
- uint16 : REF+ALT length.
- variantkey(self, chrom, pos, ref, alt)
- Returns a 64 bit variant key based on CHROM, POS (0-based), REF, ALT.
The variant should be already normalized (see normalize_variant or use normalized_variantkey).
Parameters
----------
chrom : string
Chromosome. An identifier from the reference genome, no white-space or leading zeros permitted.
pos : uint32
Position. The reference position, with the first base having position 0.
ref : string
Reference allele. String containing a sequence of nucleotide letters.
The value in the pos field refers to the position of the first nucleotide in the String.
Characters must be A-Z, a-z or *
alt : string
Alternate non-reference allele string. Characters must be A-Z, a-z or *
Returns
-------
uint64:
VariantKey 64 bit code.
- variantkey_hex(self, vk)
- Returns VariantKey hexadecimal string (16 characters).
Parameters
----------
vk : uint64
VariantKey code.
Returns
-------
'|S16':
VariantKey hexadecimal string.
- variantkey_range(self, chrom, pos_min, pos_max)
- Returns minimum and maximum VariantKeys for range searches.
Parameters
----------
chrom : uint8
Chromosome encoded number.
pos_min : uint32
Start reference position, with the first base having position 0.
pos_max : uint32
End reference position, with the first base having position 0.
Returns
-------
tuple :
- uint64 : VariantKey min value
- uint64 : VariantKey max value
- variantkey_to_regionkey(self, vk)
- Get RegionKey from VariantKey.
Parameters
----------
vk : uint64
VariantKey.
Returns
-------
uint64 :
A RegionKey code.
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|