| |
- are_overlapping_region_regionkey(...)
- Check if a region and a regionkey are overlapping.
Parameters
----------
chrom : int
Region A chromosome code.
startpos : int
Region A start position.
endpos : int
Region A end position (startpos + region length).
rk : int
RegionKey B.
Returns
-------
int :
1 if the regions overlap, 0 otherwise.
- are_overlapping_regionkeys(...)
- Check if two regionkeys are overlapping.
Parameters
----------
rka : int
RegionKey A.
rkb : int
RegionKey B.
Returns
-------
int :
1 if the regions overlap, 0 otherwise.
- are_overlapping_regions(...)
- Check if two regions are overlapping.
Parameters
----------
a_chrom : int
Region A chromosome code.
a_startpos : int
Region A start position.
a_endpos : int
Region A end position (startpos + region length).
b_chrom : int
Region B chromosome code.
b_startpos : int
Region B start position.
b_endpos : int
Region B end position (startpos + region length).
Returns
-------
int :
1 if the regions overlap, 0 otherwise.
- are_overlapping_variantkey_regionkey(...)
- Check if variantkey and regionkey are overlapping.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_nrvk_file().
vk : int
VariantKey.
rk : int
RegionKey.
Returns
-------
int :
1 if the regions overlap, 0 otherwise.
- check_reference(...)
- Check if the reference allele matches the reference genome data.
Parameters
----------
mf : obj
Memory-mapped file object as retured by mmap_genoref_file().
chrom :
Encoded Chromosome number (see encode_chrom).
pos : int
Position. The reference position, with the first base having position 0.
ref : str or bytes
Reference allele. String containing a sequence of nucleotide letters.
Returns
-------
int :
positive number in case of success, negative in case of error:
0 the reference allele match the reference genome;
1 the reference allele is inconsistent with the genome reference (i.e. when contains nucleotide letters other than A, C, G and T);
-1 the reference allele don't match the reference genome;
-2 the reference allele is longer than the genome reference sequence.
- compare_variantkey_chrom(...)
- Compares two VariantKeys by chromosome only.
Parameters
----------
vka : int
The first VariantKey to be compared.
vkb : int
The second VariantKey to be compared.
Returns
-------
int :
-1 if the first chromosome is smaller than the second, 0 if they are equal and 1 if the first is greater than the second.
Example
-------
>>> compare_variantkey_chrom(13258599952973561856, 13258609498538377215)
0
- compare_variantkey_chrom_pos(...)
- Compares two VariantKeys by chromosome and position.
Parameters
----------
vka : int
The first VariantKey to be compared.
vkb : int
The second VariantKey to be compared.
Returns
-------
int :
-1 if the first CHROM+POS is smaller than the second, 0 if they are equal and 1 if the first is greater than the second.
Example
-------
>>> compare_variantkey_chrom_pos(13258599952973561856, 13258609498538377215)
-1
- decode_chrom(...)
- Decode the chromosome numerical code.
Parameters
----------
code : int
CHROM code.
Returns
-------
bytes :
Chromosome string
Example
-------
>>> decode_chrom(23)
b'X'
- decode_refalt(...)
- Decode the 32 bit REF+ALT code if reversible (if it has 11 or less bases in total and only contains ACGT letters).
Parameters
----------
code : int
REF+ALT code
Returns
-------
tuple:
- REF
- ALT
- REF length
- ALT length
Example
-------
>>> decode_refalt(286097408)
(b'AC', b'GT', 2, 2)
- decode_region_strand(...)
- Decode the strand direction code (0 > 0, 1 > +1, 2 > -1).
Parameters
----------
strand : int
Strand code.
Returns
-------
int :
Strand direction.
- decode_regionkey(...)
- Decode a RegionKey code and returns the components as regionkey_t structure.
Parameters
----------
rk : int
RegionKey code.
Returns
-------
tuple:
- encoded chromosome
- start position
- end position
- encoded strand
- decode_string_id(...)
- Decode the encoded string ID.
This function is the reverse of encode_string_id.
The string is always returned in uppercase mode.
Parameters
----------
esid : int
Encoded string ID code.
Returns
-------
tuple:
- STRING
- STRING length
- decode_variantkey(...)
- Decode a VariantKey code and returns the components.
Parameters
----------
vk : int
VariantKey code.
Returns
-------
tuple : int
- CHROM code
- POS
- REF+ALT code
Example
-------
>>> decode_variantkey(13258623813950472192)
(23, 12345, 286097408)
- encode_chrom(...)
- Returns chromosome numerical encoding.
Parameters
----------
chrom : str or bytes
Chromosome. An identifier from the reference genome, no white-space permitted.
Returns
-------
int :
CHROM code
Example
-------
>>> encode_chrom('X')
23
- encode_refalt(...)
- Returns reference+alternate numerical encoding.
Parameters
----------
ref : str or bytes
Reference allele. String containing a sequence of nucleotide letters. The value in the pos field refers to the position of the first nucleotide in the String. Characters must be A-Z, a-z or *
alt : str or bytes
Alternate non-reference allele string. Characters must be A-Z, a-z or *
Returns
-------
int :
code
Example
-------
>>> encode_refalt(ref=b'AC', alt=b'GT')
286097408
- encode_region_strand(...)
- Encode the strand direction (-1 > 2, 0 > 0, +1 > 1).
Parameters
----------
strand : int
Strand direction (-1, 0, +1).
Returns
-------
int :
Strand code.
- encode_regionkey(...)
- Returns a 64 bit regionkey
Parameters
----------
chrom : int
Encoded Chromosome (see encode_chrom).
startpos : int
Start position (zero based).
endpos : int
End position (startpos + region_length).
strand : int
Encoded Strand direction (-1 > 2, 0 > 0, +1 > 1)
Returns
-------
int :
RegionKey 64 bit code.
- encode_string_id(...)
- Encode maximum 10 characters of a string into a 64 bit unsigned integer.
This function can be used to convert generic string IDs to numeric IDs.
Parameters
----------
strid : str or bytes
The string to encode. It must be maximum 10 characters long and support ASCII characters from '!' to 'z'.
start : int
First character to encode, starting from 0. To encode the last 10 characters, set this value at (size - 10).
Returns
-------
int :
Encoded string ID.
- encode_string_num_id(...)
- Encode a string composed by a character section followed by a separator
character and a numerical section into a 64 bit unsigned integer. For example: ABCDE:0001234.
Encodes up to 5 characters in uppercase, a number up to 2^27, and up to 7 zero padding digits.
If the string is 10 character or less, then the encode_string_id() is used.
Parameters
----------
strid : str or bytes
The string to encode. It must be maximum 10 characters long and support ASCII characters from '!' to 'z'.
sep : str or byte
Separator character between string and number.
Returns
-------
int :
Encoded string ID.
- encode_variantkey(...)
- Returns a 64 bit variant key based on the pre-encoded CHROM, POS (0-based) and REF+ALT.
Parameters
----------
chrom : int
Encoded Chromosome (see encode_chrom).
pos : int
Position. The reference position, with the first base having position 0.
refalt : int
Encoded Reference + Alternate (see encode_refalt).
Returns
-------
int:
VariantKey 64 bit code.
Example
-------
>>> encode_variantkey(chrom=23, pos=12345, refalt=286097408)
13258623813950472192
- extend_regionkey(...)
- Extend a regionkey region by a fixed amount from the start and end position..
Parameters
----------
rk : int
RegionKey code.
size : int
Amount to extend the region..
Returns
-------
int :
RegionKey 64 bit code.
- extract_regionkey_chrom(...)
- Extract the CHROM code from RegionKey.
Parameters
----------
rk : int
RegionKey code.
Returns
-------
int :
CHROM code.
- extract_regionkey_endpos(...)
- Extract the END POS code from RegionKey.
Parameters
----------
rk : int
RegionKey code.
Returns
-------
int :
END POS.
- extract_regionkey_startpos(...)
- Extract the START POS code from RegionKey.
Parameters
----------
rk : int
RegionKey code.
Returns
-------
int :
START POS.
- extract_regionkey_strand(...)
- Extract the STRAND from RegionKey.
Parameters
----------
rk : int
RegionKey code.
Returns
-------
int :
STRAND.
- extract_variantkey_chrom(...)
- Extract the CHROM code from VariantKey.
Parameters
----------
vk : int
VariantKey code.
Returns
-------
int :
CHROM code.
Example
-------
>>> extract_variantkey_chrom(13258623813950472192)
23
- extract_variantkey_pos(...)
- Extract the POS code from VariantKey.
Parameters
----------
vk : int
VariantKey code.
Returns
-------
int :
Position.
Example
-------
>>> extract_variantkey_pos(13258623813950472192)
12345
- extract_variantkey_refalt(...)
- Extract the REF+ALT code from VariantKey.
Parameters
----------
vk : int
VariantKey code.
Returns
-------
int :
REF+ALT code.
Example
-------
>>> extract_variantkey_refalt(13258623813950472192)
286097408
- find_all_rv_variantkey_by_rsid(...)
- Search for the specified rsID and returns all associated VariantKeys.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_rsvk_file().
first : int
First element of the range to search (min value = 0).
last : int
Element (up to but not including) where to end the search (max value = nitems).
rsid : int
rsID to search.
Returns
-------
tuple : int
- VariantKey(s).
- find_all_vr_rsid_by_variantkey(...)
- Search for the specified VariantKey and returns all associated rsIDs.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_vkrs_file().
first : int
First element of the range to search (min value = 0).
last : int
Element (up to but not including) where to end the search (max value = nitems).
vk : int
VariantKey.
Returns
-------
tuple : int
- rsID(s).
- find_ref_alt_by_variantkey(...)
- Retrieve the REF and ALT strings for the specified VariantKey.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_nrvk_file().
vk : int
VariantKey to search.
Returns
-------
tuple :
- REF string.
- ALT string.
- REF length.
- ALT length.
- REF+ALT length.
- find_rv_variantkey_by_rsid(...)
- Search for the specified rsID and returns the first occurrence of VariantKey in the RV file.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_rsvk_file().
first : int
First element of the range to search (min value = 0).
last : int
Element (up to but not including) where to end the search (max value = nitems).
rsid : int
rsID to search.
Returns
-------
tuple :
- VariantKey or 0 in case not found.
- Item position in the file.
- find_vr_chrompos_range(...)
- Search for the specified CHROM-POS range and returns the first occurrence of rsID in the VR file.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_vkrs_file().
first : int
First element of the range to search (min value = 0).
last : int
Element (up to but not including) where to end the search (max value = nitems).
chrom : int
Chromosome encoded number.
pos_min : int
Start reference position, with the first base having position 0.
pos_max : int
End reference position, with the first base having position 0.
Returns
-------
tuple :
- rsID or 0 in case not found
- Position of the first item.
- Position of the last item.
- find_vr_rsid_by_variantkey(...)
- Search for the specified VariantKey and returns the first occurrence of rsID in the VR file.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_vkrs_file().
first : int
First element of the range to search (min value = 0).
last : int
Element (up to but not including) where to end the search (max value = nitems).
vk : int
VariantKey.
Returns
-------
tuple :
- rsID or 0 in case not found.
- Item position in the file.
- flip_allele(...)
- Flip the allele nucleotides (replaces each letter with its complement).
The resulting string is always in uppercase. Supports extended nucleotide letters.
Parameters
----------
allele : str or bytes
String containing a sequence of nucleotide letters.
Returns
-------
bytes :
Flipped allele.
- get_genoref_seq(...)
- Returns the genome reference nucleotide at the specified chromosome and position.
Parameters
----------
mf : obj
Memory-mapped file object as retured by mmap_genoref_file().
chrom : int
Encoded Chromosome number (see encode_chrom).
pos : int
Position. The reference position, with the first base having position 0.
Returns
-------
bytes :
Nucleotide letter or 0 (NULL char) in case of invalid position.
- get_next_rv_variantkey_by_rsid(...)
- Get the next VariantKey for the specified rsID in the RV file. This function should be used after find_rv_variantkey_by_rsid. This function can be called in a loop to get all VariantKeys that are associated with the same rsID (if any).
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_rsvk_file().
pos : int
Current item position.
last : int
Element (up to but not including) where to end the search (max value = nitems).
rsid : int
rsID to search.
Returns
-------
tuple :
- VariantKey or 0 in case not found.
- Item position in the file.
- get_next_vr_rsid_by_variantkey(...)
- Get the next VariantKey for the specified rsID in the VR file. This function should be used after find_vr_rsid_by_variantkey. This function can be called in a loop to get all rsIDs that are associated with the same VariantKey (if any).
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_vkrs_file().
pos : int
Current item position.
last : int
Element (up to but not including) where to end the search (max value = nitems).
vk : int
VariantKey.
Returns
-------
tuple :
- rsID or 0 in case not found.
- Item position in the file.
- get_regionkey_chrom_endpos(...)
- Get the CHROM + END POS encoding from RegionKey.
Parameters
----------
rk : int
RegionKey code.
Returns
-------
int :
CHROM + END POS encoding.
- get_regionkey_chrom_startpos(...)
- Get the CHROM + START POS encoding from RegionKey.
Parameters
----------
rk : int
RegionKey code.
Returns
-------
int :
CHROM + START POS encoding.
- get_variantkey_chrom_endpos(...)
- Get the CHROM + END POS encoding from VariantKey.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_nrvk_file().
vk : int
VariantKey.
Returns
-------
int :
CHROM + END POS encoding.
- get_variantkey_chrom_startpos(...)
- Get the CHROM + START POS encoding from VariantKey.
Parameters
----------
vk : int
VariantKey.
Returns
-------
int :
CHROM + START POS encoding.
- get_variantkey_endpos(...)
- Get the VariantKey end position (POS + REF length).
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_nrvk_file().
vk : int
VariantKey.
Returns
-------
int :
Variant end position.
- get_variantkey_ref_length(...)
- Retrieve the REF length for the specified VariantKey.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_nrvk_file().
vk : int
VariantKey
Returns
-------
int :
REF length or 0 if the VariantKey is not reversible and not found.
- hash_string_id(...)
- Hash the input string into a 64 bit unsigned integer.
This function can be used to convert long string IDs to numeric IDs.
Parameters
----------
strid : str or bytes
The string to encode.
Returns
-------
int :
Hash string ID.
- mmap_genoref_file(...)
- Memory map the specified genome reference binary file (fasta.bin).
Parameters
----------
file : str
Path to the file to map.
ctbytes : int array
Array containing the number of bytes for each column type (i.e. 1 for uint8, 2 for uint16, 4 for uint32, 8 for uint64).
Returns
-------
tuple :
- Pointer to the memory map object.
- File size.
- mmap_nrvk_file(...)
- Memory map the specified NRVK binary file (nrvk.bin).
Parameters
----------
file : str
Path to the file to map.
ctbytes : int array
Array containing the number of bytes for each column type (i.e. 1 for uint8, 2 for uint16, 4 for uint32, 8 for uint64).
Returns
-------
tuple :
- Pointer to the memory map object.
- Pointer to the memory mapped columns object.
- mmap_rsvk_file(...)
- Memory map the specified RSVK binary file (rsvk.bin).
Parameters
----------
file : str
Path to the file to map.
ctbytes : int array
Array containing the number of bytes for each column type (i.e. 1 for uint8, 2 for uint16, 4 for uint32, 8 for uint64).
Returns
-------
tuple :
- Pointer to the memory map object.
- Pointer to the memory mapped columns object.
- Number of rows.
- mmap_vkrs_file(...)
- Memory map the specified VKRS binary file (vkrs.bin).
Parameters
----------
file : str
Path to the file to map.
ctbytes : int array
Array containing the number of bytes for each column type (i.e. 1 for uint8, 2 for uint16, 4 for uint32, 8 for uint64).
Returns
-------
tuple :
- Pointer to the memory map object.
- Pointer to the memory mapped columns object.
- Number of rows.
- munmap_binfile(...)
- Unmap and close the memory-mapped file.
Parameters
----------
mf : obj
Pointer to the memory mapped file object.
Returns
-------
int:
On success returns 0, on failure -1.
- normalize_variant(...)
- Normalize a variant. Flip alleles if required and apply the normalization algorithm described at: https://genome.sph.umich.edu/wiki/Variant_Normalization
Parameters
----------
mf : obj
Memory-mapped file object as retured by mmap_genoref_file().
chrom : int
Chromosome encoded number.
pos : int
Position. The reference position, with the first base having position 0.
ref : str or bytes
Reference allele. String containing a sequence of nucleotide letters.
alt : str or bytes
Alternate non-reference allele string.
Returns
-------
tuple :
- Bitmask number in case of success, negative number in case of error. When positive, each bit has a different meaning when set:
- bit 0 : The reference allele is inconsistent with the genome reference (i.e. when contains nucleotide letters other than A, C, G and T).
- bit 1 : The alleles have been swapped.
- bit 2 : The alleles nucleotides have been flipped (each nucleotide have been replaced with its complement).
- bit 3 : Alleles have been left extended.
- bit 4 : Alleles have been right trimmed.
- bit 5 : Alleles have been left trimmed.
- POS.
- REF string.
- ALT string.
- REF length.
- ALT length.
- normalized_variantkey(...)
- Returns a normalized 64 bit variant key based on CHROM, POS, REF, ALT.
Parameters
----------
mf : obj
Memory-mapped file object as retured by mmap_genoref_file().
chrom : str or bytes
Chromosome. An identifier from the reference genome, no white-space or leading zeros permitted.
pos : int
Position. The reference position.
posindex : int
Position index: 0 for 0-based, 1 for 1-based.
ref : str or bytes
Reference allele. String containing a sequence of nucleotide letters. The value in the pos field refers to the position of the first nucleotide in the String. Characters must be A-Z, a-z or *
alt : str or bytes
Alternate non-reference allele string. Characters must be A-Z, a-z or *
Returns
-------
tuple :
- VariantKey 64 bit code.
- Normalization code (see normalize_variant).
- nrvk_bin_to_tsv(...)
- Convert a vrnr.bin file to a simple TSV.
For the reverse operation see the resources/tools/nrvk.sh script.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_nrvk_file().
tsvfile : int
Output file name.
Returns
-------
int :
Number of bytes written or 0 in case of error.
- parse_regionkey_hex(...)
- Parses a RegionKey hexadecimal string and returns the code.
Parameters
----------
rs : str or bytes
RegionKey hexadecimal string (it must contain 16 hexadecimal characters).
Returns
-------
int :
A RegionKey code.
- parse_variantkey_hex(...)
- Parses a VariantKey hexadecimal string and returns the code.
Parameters
----------
vs : str or bytes
VariantKey hexadecimal string (it must contain 16 hexadecimal characters).
Returns
-------
int :
VariantKey 64 bit code.
Example
-------
>>> parse_variantkey_hex(b'b800181c910d8000')
13258623813950472192
- regionkey(...)
- Returns a 64 bit regionkey based on CHROM, START POS (0-based), END POS and STRAND.
Parameters
----------
chrom : str or bytes
Chromosome. An identifier from the reference genome, no white-space or leading zeros permitted.
startpos : int
Start position (zero based).
endpos : int
End position (startpos + region_length).
strand : int
Strand direction (-1, 0, +1)
Returns
-------
int :
RegionKey 64 bit code.
- regionkey_hex(...)
- Returns RegionKey hexadecimal string (16 characters).
Parameters
----------
rk : int
RegionKey code.
Returns
-------
string :
RegionKey hexadecimal string.
- reverse_regionkey(...)
- Reverse a RegionKey code and returns the normalized components as regionkey_rev_t structure.
Parameters
----------
rk : int
RegionKey code.
Returns
-------
tuple:
- chromosome
- start position
- end position
- strand
- reverse_variantkey(...)
- Reverse a VariantKey code and returns the normalized components.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_nrvk_file().
vk : int
VariantKey code.
Returns
-------
tuple :
- CHROM string.
- POS.
- REF string.
- ALT string.
- REF length.
- ALT length.
- REF+ALT length.
- variantkey(...)
- Returns a 64 bit variant key based on CHROM, POS (0-based), REF, ALT.
The variant should be already normalized (see normalize_variant or use normalized_variantkey).
Parameters
----------
chrom : str or bytes
Chromosome. An identifier from the reference genome, no white-space or leading zeros permitted.
pos : int
Position. The reference position, with the first base having position 0.
ref : str or bytes
Reference allele. String containing a sequence of nucleotide letters. The value in the pos field refers to the position of the first nucleotide in the String. Characters must be A-Z, a-z or *
alt : str or bytes
Alternate non-reference allele string. Characters must be A-Z, a-z or *
Returns
-------
int:
VariantKey 64 bit code.
Example
-------
>>> variantkey(chrom=b'X', pos=12345, ref=b'AC', alt=b'GT')
13258623813950472192
>>> variantkey(chrom='X', pos=12345, ref='AC', alt='GT')
13258623813950472192
>>> variantkey(b'X', 12345, b'AC', b'GT')
13258623813950472192
>>> variantkey('X', 12345, 'AC', 'GT')
13258623813950472192
- variantkey_hex(...)
- Returns VariantKey hexadecimal string (16 characters).
Parameters
----------
vk : int
VariantKey code.
Returns
-------
bytes:
VariantKey hexadecimal string.
Example
-------
>>> variantkey_hex(13258623813950472192)
b'b800181c910d8000'
- variantkey_range(...)
- Returns minimum and maximum VariantKeys for range searches.
Parameters
----------
chrom : int
Chromosome encoded number.
pos_min : int
Start reference position, with the first base having position 0.
pos_max : int
End reference position, with the first base having position 0.
Returns
-------
tuple : int
- VariantKey min value
- VariantKey max value
Example
-------
>>> variantkey_range(chrom=23, pos_min=1234, pos_max=5678)
(13258599952973561856, 13258609498538377215)
- variantkey_to_regionkey(...)
- Get RegionKey from VariantKey.
Parameters
----------
mc : obj
Memory-mapped columns object as retured by mmap_nrvk_file().
vk : int
VariantKey.
Returns
-------
int :
A RegionKey code.
|