Biotools¶

Biologically-related useful methods.

dnachisel.biotools.annotate_differences(record, reference, feature_type='misc_feature', prefix='#')[source]¶

Annotate differences between two records in a new record.

Returns a version of SeqRecord record where differences with the references are annotated as new features.

Parameters

record: The SeqRecord to be compared to the reference.
reference: The reference SeqRecord. Must be the same size as reference.
feature_type: The type of the features added to mark differences.
prefix: Each new feature will be labeled “po” where p is the prefix and o the original sequence at the feature’s location. For instance “#A” or “#TT”.

dnachisel.biotools.annotate_pattern_occurrences(record, pattern, feature_type='misc_feature', prefix='!')[source]¶

Return a new record annotated w. all occurences of pattern in sequence.

Parameters

record: A Biopython record.
pattern: A DnaChisel SequencePattern object (such as DnaPAttern).
feature_type: Type of the annotations in the returned record.

dnachisel.biotools.annotate_record(seqrecord, location='full', feature_type='misc_feature', margin=0, **qualifiers)[source]¶

Add a feature to a Biopython SeqRecord.

Parameters

seqrecord: The biopython seqrecord to be annotated.
location: Either (start, end) or (start, end, strand). (strand defaults to +1).
feature_type: The type associated with the feature.
margin: Number of extra bases added on each side of the given location.
qualifiers: Dictionary that will be the Biopython feature’s qualifiers attribute.

dnachisel.biotools.blast_sequence(sequence, blast_db=None, subject_sequences=None, subject=None, word_size=4, perc_identity=80, num_alignments=1000, ungapped=False, num_threads=3, culling_limit=None, e_value=None, task=None, dust='no')[source]¶

Return a Biopython BLAST record of the given sequence BLASTed against the provided database.

Parameters

sequence: An ATGC sequence

Examples

>>> blast_record = blast_sequence("ATTGTGCGTGTGTGCGT", "blastdb/ecoli")
>>> for alignment in blast_record.alignments:
>>>     for hit in alignment.hsps:
>>>         print (hit.identities)

dnachisel.biotools.change_biopython_record_sequence(record, new_seq)[source]¶: Return a version of the record with the sequence set to new_seq.

dnachisel.biotools.complement(dna_sequence)[source]¶

Return the complement of the DNA sequence.

For instance complement("ATGCCG") returns "TACGGC".

Uses Biopython for speed.

dnachisel.biotools.dict_to_pretty_string(d, rounding_digits=2, indent=2)[source]¶: Return a nicely JSON-like formatted string to print a dict.

dnachisel.biotools.dna_pattern_to_regexpr(dna_pattern)[source]¶

Return a regular expression pattern for the provided DNA pattern

For instance dna_pattern_to_regexpr('ATTNN') returns "ATT[A|T|G|C][A|T|G|C]".

dnachisel.biotools.find_all_bowtie_matches(sequence, bowtie_index_path, match_length, max_mismatches=0)[source]¶

Return (short) matches between a sequence and a Bowtie index.

The result is of the form [(start, end), n_mismatches)] where (start, end) indicates the position of the match in the sequence, and n_mismatches is the number of mismatches with the closest homology in the index.

dnachisel.biotools.find_specification_label_in_feature(feature)[source]¶

Analyse a Biopython feature to find a DnaChisel Specification in it.

The specification should start with either “@” or “~”, in the feature’s field “label” or “note”.

dnachisel.biotools.gc_content(sequence, window_size=None)[source]¶

Compute global or local GC content.

Parameters

sequence: An ATGC DNA sequence (upper case!)
window_size: If provided, the local GC content for the different sliding windows of this size is returned, else the global GC content is returned.

Returns

A number between 0 and 1 indication the proportion: of GC content. If window_size is provided, returns a list of len(sequence)-window_size values indicating the local GC contents (sliding-window method). The i-th value indicates the GC content in the window [i, i+window_size]

dnachisel.biotools.group_nearby_indices(indices, max_gap=None, max_group_spread=None)[source]¶

Return a list of groups of the different indices.

Indices are considered from smaller to larger and placed into groups

Parameters

max_gap: Maximal allowed difference between two consecutive numbers of a group
max_group_spread: Maximal allowed difference between the smallest and largest elements of a group.

dnachisel.biotools.group_nearby_segments(segments, max_start_gap=None, max_start_spread=None)[source]¶

Return a list of groups of the different indices.

Indices are considered from smaller to larger and placed into groups

Parameters

max_gap: Maximal allowed difference between two consecutive numbers of a group
max_group_spread: Maximal allowed difference between the smallest and largest elements of a group.

dnachisel.biotools.list_common_enzymes(site_length=6, opt_temp=37, min_suppliers=1, site_unlike=())[source]¶

Return a list of enzyme names with the given constraints.

Parameters

site_length: List of accepted site lengths (6, 4, …)
opt_temp: List of accepted optimal temperatures for the enzyme
min_suppliers: Minimal number registered suppliers in the Biopython data. A minimum of 3 known suppliers returns the most common enzymes.
site_unlike: List of (ambiguous or unambiguous) DNA sequences that should NOT be recognized by the selected enzymes.

dnachisel.biotools.load_record(filepath, linear=True, name='unnamed', file_format='auto')[source]¶

Load a FASTA/Genbank/Snapgene record.

Note that reading Snapgene records requires the library snapgene_reader installed.

dnachisel.biotools.random_dna_sequence(length, gc_share=None, probas=None, seed=None)[source]¶

Return a random DNA sequence (“ATGGCGT…”) with the specified length.

Parameters

length: Length of the DNA sequence.
proba: Frequencies for the different nucleotides, for instance probas={"A":0.2, "T":0.3, "G":0.3, "C":0.2}. If not specified, all nucleotides are equiprobable (p=0.25).
seed: The seed to feed to the random number generator. When a seed is provided the random results depend deterministically on the seed, thus enabling reproducibility

dnachisel.biotools.random_protein_sequence(length, seed=None)[source]¶

Return a random protein sequence “MNQTW…YL*” of the specified length.

Parameters

length: Length of the protein sequence (in number of amino-acids). Note that the sequence will always start with "M" and end with a stop codon "*" with (length-2) random amino-acids in the middle
seed: The seed to feed to the random number generator. When a seed is provided the random results depend deterministically on the seed, thus enabling reproducibility

dnachisel.biotools.reverse_complement(sequence)[source]¶

Return the reverse-complement of the DNA sequence.

For instance reverse_complement("ATGCCG") returns "CGGCAT".

Uses Biopython for speed.

dnachisel.biotools.reverse_translate(protein_sequence, randomize_codons=False, table='Standard')[source]¶

Return a DNA sequence which translates to the provided protein sequence.

Parameters

protein_sequence: A sequence string of aminoacids, e.g. “MVKK…”
table: Genetic code table to use (e.g. ‘Standard’, ‘Bacterial’, etc.). See dnachisel.biotools.CODON_TABLE_NAMES for a list of available genetic code tables.
randomize_codons: If False, the first valid codon found is used for each, which can create biases (GC content, etc.), if True, each amino acid gets replaced by a randomly selected codon for this amino acid.

dnachisel.biotools.round_all_numbers_in_dict(d, rounding_digits=2, outplace=True)[source]¶: Return a new version of dict d with all floats rounded to N digits.

dnachisel.biotools.score_to_formatted_string(score, characters=9)[source]¶

Transform a number (score) into a best-format string.

The format will be either int (2234), float (10.234) or engineering (1.20E5), whichever is shorter. The score is then padded with left whitespaces to obtained the desired number of characters.

dnachisel.biotools.sequence_to_biopython_record(sequence, id='<unknown id>', name='<unknown name>', features=())[source]¶: Return a SeqRecord of the sequence, ready to be Genbanked.

dnachisel.biotools.sequences_differences(seq1, seq2)[source]¶

Return the number of nucleotides that differ in the two sequences.

seq1, seq2 should be strings of DNA sequences e.g. “ATGCTGTGC”

dnachisel.biotools.sequences_differences_array(seq1, seq2)[source]¶

Return an array [0, 0, 1, 0, …] with 1s for sequence differences.

seq1, seq2 should both be ATGC strings.

dnachisel.biotools.sequences_differences_segments(seq1, seq2)[source]¶

Return the list of segments on which sequence seq1 differs from seq2.

The list is of the form [(start1, end1), (start2, end2), etc.]

Parameters

seq1, seq2: ATGC sequences to be compared

dnachisel.biotools.subdivide_window(window, max_span)[source]¶: Subdivide a window (start, end) into windows of size < max_span (start, i_1), (i_1, i_2), … (i_n, end)

dnachisel.biotools.translate(dna_sequence, table='Standard', assume_start_codon=False)[source]¶

Translate the DNA sequence into an amino-acids sequence “MLKYQT…”. If translation_table is the name or number of a NCBI genetic table, Biopython will be used. See here for options:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc25

translation_table can also be a dictionary of the form {"ATT": "M", "CTC": "X", etc.} for more exotic translation tables

If assume_start_codon is True and the first codon is a start codon in the given genetic table, then the first amino acid will be M (methionine).

dnachisel.biotools.windows_overlap(window1, window2)[source]¶

Return the overlap span between two windows.

Parameters

window1, window2: Each window is a couple of the form (start, end) indicating the range of a segment of integers.

Returns

None: In case the two windows do not overlap.
[start, end]: The coordinates of the overlap segment if there is one.

dnachisel.biotools.write_record(record, target, file_format='genbank', remove_locationless_features=True, max_name_length=20)[source]¶

Write a record as genbank, fasta, etc. via Biopython, with fixes.

Parameters

record: A biopython record.
target: Path to a file or filelike object.
file_format: Format, either Genbank or fasta.
remove_locationless_features: If True, will remove all features whose location is None, to avoid a Biopython bug
max_name_length: The record’s name will be truncated if longer than this (also here to avoid a Biopython bug).