Biotools¶
Biologically-related useful methods.
-
dnachisel.biotools.
annotate_differences
(record, reference, feature_type='misc_feature', prefix='#')[source]¶ Annotate differences between two records in a new record.
Returns a version of SeqRecord
record
where differences with the references are annotated as new features.- Parameters
- record
The SeqRecord to be compared to the reference.
- reference
The reference SeqRecord. Must be the same size as
reference
.- feature_type
The type of the features added to mark differences.
- prefix
Each new feature will be labeled “po” where p is the prefix and o the original sequence at the feature’s location. For instance “#A” or “#TT”.
-
dnachisel.biotools.
annotate_pattern_occurrences
(record, pattern, feature_type='misc_feature', prefix='!')[source]¶ Return a new record annotated w. all occurences of pattern in sequence.
- Parameters
- record
A Biopython record.
- pattern
A DnaChisel SequencePattern object (such as DnaPAttern).
- feature_type
Type of the annotations in the returned record.
-
dnachisel.biotools.
annotate_record
(seqrecord, location='full', feature_type='misc_feature', margin=0, **qualifiers)[source]¶ Add a feature to a Biopython SeqRecord.
- Parameters
- seqrecord
The biopython seqrecord to be annotated.
- location
Either (start, end) or (start, end, strand). (strand defaults to +1).
- feature_type
The type associated with the feature.
- margin
Number of extra bases added on each side of the given location.
- qualifiers
Dictionary that will be the Biopython feature’s qualifiers attribute.
-
dnachisel.biotools.
blast_sequence
(sequence, blast_db=None, subject_sequences=None, subject=None, word_size=4, perc_identity=80, num_alignments=1000, ungapped=False, num_threads=3, culling_limit=None, e_value=None, task=None, dust='no')[source]¶ Return a Biopython BLAST record of the given sequence BLASTed against the provided database.
- Parameters
- sequence
An ATGC sequence
Examples
>>> blast_record = blast_sequence("ATTGTGCGTGTGTGCGT", "blastdb/ecoli") >>> for alignment in blast_record.alignments: >>> for hit in alignment.hsps: >>> print (hit.identities)
-
dnachisel.biotools.
change_biopython_record_sequence
(record, new_seq)[source]¶ Return a version of the record with the sequence set to new_seq.
-
dnachisel.biotools.
complement
(dna_sequence)[source]¶ Return the complement of the DNA sequence.
For instance
complement("ATGCCG")
returns"TACGGC"
.Uses Biopython for speed.
-
dnachisel.biotools.
dict_to_pretty_string
(d, rounding_digits=2, indent=2)[source]¶ Return a nicely JSON-like formatted string to print a dict.
-
dnachisel.biotools.
dna_pattern_to_regexpr
(dna_pattern)[source]¶ Return a regular expression pattern for the provided DNA pattern
For instance
dna_pattern_to_regexpr('ATTNN')
returns"ATT[A|T|G|C][A|T|G|C]"
.
-
dnachisel.biotools.
find_all_bowtie_matches
(sequence, bowtie_index_path, match_length, max_mismatches=0)[source]¶ Return (short) matches between a sequence and a Bowtie index.
The result is of the form [(start, end), n_mismatches)] where (start, end) indicates the position of the match in the sequence, and n_mismatches is the number of mismatches with the closest homology in the index.
-
dnachisel.biotools.
find_specification_label_in_feature
(feature)[source]¶ Analyse a Biopython feature to find a DnaChisel Specification in it.
The specification should start with either “@” or “~”, in the feature’s field “label” or “note”.
-
dnachisel.biotools.
gc_content
(sequence, window_size=None)[source]¶ Compute global or local GC content.
- Parameters
- sequence
An ATGC DNA sequence (upper case!)
- window_size
If provided, the local GC content for the different sliding windows of this size is returned, else the global GC content is returned.
- Returns
- A number between 0 and 1 indication the proportion
of GC content. If window_size is provided, returns a list of len(sequence)-window_size values indicating the local GC contents (sliding-window method). The i-th value indicates the GC content in the window [i, i+window_size]
-
dnachisel.biotools.
group_nearby_indices
(indices, max_gap=None, max_group_spread=None)[source]¶ Return a list of groups of the different indices.
Indices are considered from smaller to larger and placed into groups
- Parameters
- max_gap
Maximal allowed difference between two consecutive numbers of a group
- max_group_spread
Maximal allowed difference between the smallest and largest elements of a group.
-
dnachisel.biotools.
group_nearby_segments
(segments, max_start_gap=None, max_start_spread=None)[source]¶ Return a list of groups of the different indices.
Indices are considered from smaller to larger and placed into groups
- Parameters
- max_gap
Maximal allowed difference between two consecutive numbers of a group
- max_group_spread
Maximal allowed difference between the smallest and largest elements of a group.
-
dnachisel.biotools.
list_common_enzymes
(site_length=6, opt_temp=37, min_suppliers=1, site_unlike=())[source]¶ Return a list of enzyme names with the given constraints.
- Parameters
- site_length
List of accepted site lengths (6, 4, …)
- opt_temp
List of accepted optimal temperatures for the enzyme
- min_suppliers
Minimal number registered suppliers in the Biopython data. A minimum of 3 known suppliers returns the most common enzymes.
- site_unlike
List of (ambiguous or unambiguous) DNA sequences that should NOT be recognized by the selected enzymes.
-
dnachisel.biotools.
load_record
(filepath, linear=True, name='unnamed', file_format='auto')[source]¶ Load a FASTA/Genbank/Snapgene record.
Note that reading Snapgene records requires the library snapgene_reader installed.
-
dnachisel.biotools.
random_dna_sequence
(length, gc_share=None, probas=None, seed=None)[source]¶ Return a random DNA sequence (“ATGGCGT…”) with the specified length.
- Parameters
- length
Length of the DNA sequence.
- proba
Frequencies for the different nucleotides, for instance
probas={"A":0.2, "T":0.3, "G":0.3, "C":0.2}
. If not specified, all nucleotides are equiprobable (p=0.25).- seed
The seed to feed to the random number generator. When a seed is provided the random results depend deterministically on the seed, thus enabling reproducibility
-
dnachisel.biotools.
random_protein_sequence
(length, seed=None)[source]¶ Return a random protein sequence “MNQTW…YL*” of the specified length.
- Parameters
- length
Length of the protein sequence (in number of amino-acids). Note that the sequence will always start with
"M"
and end with a stop codon"*"
with (length-2) random amino-acids in the middle- seed
The seed to feed to the random number generator. When a seed is provided the random results depend deterministically on the seed, thus enabling reproducibility
-
dnachisel.biotools.
reverse_complement
(sequence)[source]¶ Return the reverse-complement of the DNA sequence.
For instance
reverse_complement("ATGCCG")
returns"CGGCAT"
.Uses Biopython for speed.
-
dnachisel.biotools.
reverse_translate
(protein_sequence, randomize_codons=False, table='Standard')[source]¶ Return a DNA sequence which translates to the provided protein sequence.
- Parameters
- protein_sequence
A sequence string of aminoacids, e.g. “MVKK…”
- table
Genetic code table to use (e.g. ‘Standard’, ‘Bacterial’, etc.). See dnachisel.biotools.CODON_TABLE_NAMES for a list of available genetic code tables.
- randomize_codons
If False, the first valid codon found is used for each, which can create biases (GC content, etc.), if True, each amino acid gets replaced by a randomly selected codon for this amino acid.
-
dnachisel.biotools.
round_all_numbers_in_dict
(d, rounding_digits=2, outplace=True)[source]¶ Return a new version of dict d with all floats rounded to N digits.
-
dnachisel.biotools.
score_to_formatted_string
(score, characters=9)[source]¶ Transform a number (score) into a best-format string.
The format will be either int (2234), float (10.234) or engineering (1.20E5), whichever is shorter. The score is then padded with left whitespaces to obtained the desired number of
characters
.
-
dnachisel.biotools.
sequence_to_biopython_record
(sequence, id='<unknown id>', name='<unknown name>', features=())[source]¶ Return a SeqRecord of the sequence, ready to be Genbanked.
-
dnachisel.biotools.
sequences_differences
(seq1, seq2)[source]¶ Return the number of nucleotides that differ in the two sequences.
seq1, seq2 should be strings of DNA sequences e.g. “ATGCTGTGC”
-
dnachisel.biotools.
sequences_differences_array
(seq1, seq2)[source]¶ Return an array [0, 0, 1, 0, …] with 1s for sequence differences.
seq1, seq2 should both be ATGC strings.
-
dnachisel.biotools.
sequences_differences_segments
(seq1, seq2)[source]¶ Return the list of segments on which sequence seq1 differs from seq2.
The list is of the form [(start1, end1), (start2, end2), etc.]
- Parameters
- seq1, seq2
ATGC sequences to be compared
-
dnachisel.biotools.
subdivide_window
(window, max_span)[source]¶ Subdivide a window (start, end) into windows of size < max_span (start, i_1), (i_1, i_2), … (i_n, end)
-
dnachisel.biotools.
translate
(dna_sequence, table='Standard', assume_start_codon=False)[source]¶ Translate the DNA sequence into an amino-acids sequence “MLKYQT…”. If
translation_table
is the name or number of a NCBI genetic table, Biopython will be used. See here for options:http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc25
translation_table
can also be a dictionary of the form{"ATT": "M", "CTC": "X", etc.}
for more exotic translation tablesIf assume_start_codon is True and the first codon is a start codon in the given genetic table, then the first amino acid will be M (methionine).
-
dnachisel.biotools.
windows_overlap
(window1, window2)[source]¶ Return the overlap span between two windows.
- Parameters
- window1, window2
Each window is a couple of the form (start, end) indicating the range of a segment of integers.
- Returns
- None
In case the two windows do not overlap.
- [start, end]
The coordinates of the overlap segment if there is one.
-
dnachisel.biotools.
write_record
(record, target, file_format='genbank', remove_locationless_features=True, max_name_length=20)[source]¶ Write a record as genbank, fasta, etc. via Biopython, with fixes.
- Parameters
- record
A biopython record.
- target
Path to a file or filelike object.
- file_format
Format, either Genbank or fasta.
- remove_locationless_features
If True, will remove all features whose location is None, to avoid a Biopython bug
- max_name_length
The record’s name will be truncated if longer than this (also here to avoid a Biopython bug).