Core Classes ¶

Contents

Core Classes

DnaOptimizationProblem ¶

class dnachisel.DnaOptimizationProblem(sequence, constraints=None, objectives=None, logger='bar', mutation_space=None)[source]¶

Problem specifications: sequence, constraints, optimization objectives.

The original constraints, objectives, and original sequence of the problem are stored in the DNA problem. This class also has methods to display reports on the constraints and objectives, as well as solving the constraints and objectives.

Parameters

sequence: A string of ATGC characters (they must be upper case!), e.g. “ATTGTGTA”
constraints: A list of objects of type Specification.
objectives: A list of objects of type Specification specifying what must be optimized in the problem. Note that each objective has a float boost parameter. The larger the boost, the more the objective is taken into account during the optimization.
logger: Either None for no logger, ‘bar’ for a tqdm progress bar logger, or any ProgLog progress bar logger.
mutations_space: A MutationSpace indicating the possible mutations. In most case the mutation space will be left to None and computed at problem initialization (which can be slightly compute-intensive), however some core DNA Chisel methods will create optimization problems with a provided mutation_space to save computing time.

Notes

The dictionary self.possible_mutations is of the form {location1 : list1, location2: list2...} where location is either a single index (e.g. 10) indicating the position of a nucleotide to be muted, or a couple (start, end) indicating a whole segment whose sub-sequence should be replaced. The list s are lists of possible sequences to replace each location, e.g. for the mutation of a whole codon (3,6): ["ATT", "ACT", "AGT"].

Examples

>>> from dnachisel import *
>>> problem = DnaOptimizationProblem(
>>>     sequence = "ATGCGTGTGTGC...",
>>>     constraints = [constraint1, constraint2, ...],
>>>     objectives = [objective1, objective2, ...]
>>> )
>>> problem.resolve_constraints()
>>> problem.optimize()
>>> print(problem.constraints_text_summary())
>>> print(problem.objectives_text_summary())

Attributes

randomization_threshold: The algorithm will use an exhaustive search when the size of the mutation space (=the number of possible variants) is above this threshold, and a (guided) random search when it is above.
max_random_iters: When using a random search, stop after this many iterations
mutations_per_iteration: When using a random search, produce this many sequence mutations each iteration.
optimization_stagnation_tolerance: When using a random search, stop if the score hasn’t improved in the last “this many” iterations
local_extensions: Try local resolution several times if it fails, increasing the mutable zone by [N1, N2…] nucleotides on each side, until resolution works. (by default, an extension of 0bp is tried, then 5bp.

initialize()[source]¶: Precompute specification sets, evaluations, and mutation space.

number_of_edits()[source]¶: Return the number of nucleotide differences between the original and current sequence.

optimize_with_report(target, project_name='My project', file_path=None, file_content=None)[source]¶

Resolve constraints, optimize objectives, write a multi-file report.

The report’s content may vary depending on the optimization’s success.

Parameters

target: Either a path to a folder that will contain the report, or a path to a zip archive, or “@memory” to return raw data of a zip archive containing the report.
project_name: Project name to write on PDF reports

Returns

(success, message, zip_data): Triplet where success is True/False, message is a one-line string summary indication whether some clash was found, or some solution, or maybe no solution was found because the random searches were too short

sequence_edits_as_array()[source]¶: Return an array [False, False, True…] where True indicates an edit (i.e. a change at this position between the original problem sequence and the current one).

CircularDnaOptimizationProblem ¶

class dnachisel.CircularDnaOptimizationProblem(sequence, constraints=None, objectives=None, logger='bar', mutation_space=None)[source]¶

Class for solving circular DNA optimization problems.

The high-level interface is the same as for DnaOptimizationProblem: initialization, resolve_constraints(), and optimize() work the same way.

all_constraints_pass(autopass=True)[source]¶: Return whether the current problem sequence passes all constraints.

constraints_evaluations(autopass=True)[source]¶

Return the evaluation of constraints.

The “autopass_constraints” enables to just assume that constraints enforced by the mutation space are verified.

objectives_evaluations()[source]¶: Return a list of the evaluation of each objective of the problem

optimize()[source]¶: Maximize the total score by optimizing each objective in turn.

optimize_with_report(target, project_name='My project', file_path=None, file_content=None)[source]¶

Resolve constraints, optimize objectives, write a multi-file report.

WARNING: in case of optimization failure, the report generated will show a “pseudo-circular” sequence formed by concatenating the sequence with itself three times.

TODO: fix the warning above, at some point?

The report’s content may vary depending on the optimization’s success.

Parameters

target: Either a path to a folder that will contain the report, or a path to a zip archive, or “@memory” to return raw data of a zip archive containing the report.
project_name: Project name to write on PDF reports

Returns

(success, message, zip_data): Triplet where success is True/False, message is a one-line string summary indication whether some clash was found, or some solution, or maybe no solution was found because the random searches were too short

resolve_constraints(final_check=True)[source]¶

Solve a particular constraint using local, targeted searches.

Parameters

constraint: The Specification object for which the sequence should be solved
final_check: If True, a final check of that all constraints pass will be run at the end of the process, when constraints have been resolved one by one, to check that the solving of one constraint didn’t undo the solving of another.
cst_filter: An optional filter to only resolve a subset function (constraint => True/False)

Specification ¶

class dnachisel.Specification(evaluate=None, boost=1.0)[source]¶

General class to define specifications to optimize.

Note that all specifications have a boost attribute that is a multiplicator that will be used when computing the global specification score of a problem with problem.all_objectives_score().

New types of specifications are defined by subclassing Specification and providing a custom evaluate and localized methods.

Parameters

evaluate: function (sequence) => SpecEvaluation
boost: Relative importance of the Specification’s score in a multi-specification problem.

Attributes

best_possible_score: Best score that the specification can achieve. Used by the optimization algorithm to understand when no more optimization is required.
optimize_passively (boolean): Indicates that there should not be a pass of the optimization algorithm to optimize this objective. Instead, this objective is simply taken into account when optimizing other objectives.
enforced_by_nucleotide_restrictions (boolean): When the spec is used as a constraints, this indicates that the constraint will initially restrict the mutation space in a way that ensures that the constraint will always be valid. The constraint does not need to be evaluated again, which speeds up the resolution algorithm.
priority: Value used to sort the specifications and solve/optimize them in order, with highest priority first.
shorthand_name: Shorter name for the specification class that will be recognized when parsing annotations from genbanks.

as_passive_objective()[source]¶

Return a copy with optimize_passively set to true.

“Optimize passively” means that when the specification is used as an objective, the solver will not do a specific pass to optimize this specification, however this specification’s score will be taken into account in the global score when optimizing other objectives, and may therefore influence the final sequence.

breach_label()[source]¶

Shorter, less precise label to be used in tables, reports, etc.

This is meant for specifications such as EnforceGCContent(0.4, 0.6) to be represented as ‘40-60% GC’ in reports tables etc..

copy_with_changes(**kwargs)[source]¶

Return a copy of the Specification with modified properties.

For instance new_spec = spec.copy_with_changes(boost=10).

initialized_on_problem(problem, role='constraint')[source]¶

Complete specification initialization when the sequence gets known.

Some specifications like to know what their role is and on which sequence they are employed before they complete some values.

label(role=None, with_location=True, assignment_symbol=':', use_short_form=False, use_breach_form=False)[source]¶

Return a string label for this specification.

Parameters

role: Either ‘constraint’ or ‘objective’ (for prefixing the label with @ or ~), or None.
with_location: If true, the location will appear in the label.
assignment_symbol: Indicates whether to use “:” or “=” or anything else when indicating parameters values.
use_short_form: If True, the label will use self.short_label(), so for instance AvoidPattern(BsmBI_site) will become “no BsmBI”. How this is handled is dependent on the specification.

label_parameters()[source]¶

In subclasses, returns a list of the creation parameters.

For instance [(‘pattern’, ‘ATT’), (‘occurences’, 2)]

localized(location, problem=None)[source]¶

Return a modified version of the specification for the case where sequence modifications are only performed inside the provided location.

For instance if an specification concerns local GC content, and we are only making local mutations to destroy a restriction site, then we only need to check the local GC content around the restriction site after each mutation (and not compute it for the whole sequence), so EnforceGCContent.localized(location) will return an specification that only looks for GC content around the provided location.

If an specification concerns a DNA segment that is completely disjoint from the provided location, this must return None.

Must return an object of class Constraint.

restrict_nucleotides(sequence, location=None)[source]¶

Restrict the mutation space to speed up optimization.

This method only kicks in when this specification is used as a constraint. By default it does nothing, but subclasses such as EnforceTranslation, AvoidChanges, EnforceSequence, etc. have custom methods.

In the code, this method is run during the initialize() step of DNAOptimizationProblem, when the MutationSpace is created for each constraint

shifted(shift)[source]¶

Shift the location of the specification.

Some specification classes may have a special method to do side effects when shifting the location.

Location shifting is used in particular when solving circular DNA optimization problems.

short_label()[source]¶

Shorter, less precise label to be used in tables, reports, etc.

This is meant for specifications such as EnforceGCContent(0.4, 0.6) to be represented as ‘40-60% GC’ in reports tables etc..

SpecEvaluation ¶

Store relevant infos about the evaluation of an objective on a problem.

Examples ¶

>>> evaluation_result = ConstraintEvaluation(
>>>     specification=specification,
>>>     problem = problem,
>>>     score= evaluation_score, # float
>>>     locations=[Locations list...],
>>>     message = "Score: 42 (found 42 sites)"
>>> )

Parameters ¶

objective: The Objective that was evaluated.
problem: The problem that the objective was evaluated on.
score: The score associated to the evaluation.
locations: A list of couples (start, end) indicating the locations on which the the optimization shoul be localized to improve the objective.
message: A message that will be returned by str(evaluation). It will notably be displayed by problem.print_objectives_summaries.

dnachisel.SpecEvaluation.default_message¶: Return the default message for console/reports.

Location ¶

Implements the Location class.

The class has useful methods for finding overlaps between locations, extract a subsequence from a sequence, etc.

class dnachisel.Location.Location(start, end, strand=0)[source]¶

Represent a segment of a sequence, with a start, end, and strand.

Warning: we use Python’s splicing notation, so Location(5, 10) represents sequence[5, 6, 7, 8, 9] which corresponds to nucleotides number 6, 7, 8, 9, 10.

The data structure is similar to a Biopython’s FeatureLocation, but with different methods for different purposes.

Parameters

start: Lowest position index of the segment.
end: Highest position index of the segment.
strand: Either 1 or -1 for sense or anti-sense orientation.

extended(extension_length, lower_limit=0, upper_limit=None, left=True, right=True)[source]¶: Extend the location of a few basepairs on each side.

extract_sequence(sequence)[source]¶: Return the subsequence read at the given location.

static from_biopython_location(location)[source]¶: Return a DnaChisel Location from a Biopython location.

static from_data(location_data)[source]¶

Return a location, from varied data formats.

This method is used in particular in every built-in specification to quickly standardize the input location.

location_data can be a tuple (start, end) or (start, end, strand), or a Biopython FeatureLocation, or a Location instance. In any case, a new Location object will be returned.

static merge_overlapping_locations(locations)[source]¶: Return a list of locations obtained by merging all overlapping.

overlap_region(other_location)[source]¶: Return the overlap span between two locations (None if None).

to_biopython_feature(feature_type='misc_feature', **qualifiers)[source]¶: Return a Biopython SeqFeature with same location and custom qualifiers.

to_biopython_location()[source]¶: Return a Biopython FeatureLocation equivalent to the location.

to_tuple()[source]¶: Return (start, end, strand).

MutationSpace ¶

class dnachisel.MutationSpace.MutationChoice(segment, variants, is_any_nucleotide=False)[source]¶

Represent a segment of a sequence with several possible variants.

Parameters

segment: A pair (start, end) indicating the range of nucleotides concerned. We are applying Python range, so
variants: A set of sequence variants, at the given position

Examples

>>> choice = MutationChoice((70, 73), {})

extract_varying_region()[source]¶

Return MutationChoices for the central varying region and 2 flanks.

For instance:

>>> choice = MutationChoice((5, 12), [
>>>     'ATGCGTG',
>>>     'AAAAATG',
>>>     'AAATGTG',
>>>     'ATGAATG',
>>> ])
>>> choice.extract_varying_region()

Result :

>>> [
>>>     MutChoice(5-6 A),
>>>     MutChoice(6-10 TGCG-AATG-TGAA-AAAA),
>>>     MutChoice(10-12 TG)
>>> ]

merge_with(others)[source]¶: Merge this mutation choice with others to form a single choice

random_variant(sequence)[source]¶: Return one of the variants, randomly.

class dnachisel.MutationSpace.MutationSpace(choices_index, left_padding=0)[source]¶

Class for mutation space (set of sequence segments and their variants).

Parameters

choices_index: A list L such that L[i] gives the MutationChoice governing the mutations allowed at position i (and possibly around i).

Examples

>>> # BEWARE: below, similar mutation choices are actually the SAME OBJECT
>>> space = MutationSpace([
        MutationChoice((0, 2), {'AT', 'TG'}),
        MutationChoice((0, 2), {'AT', 'TG'}),
        MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}), # same
        MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}), #
        MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}),
    ])

all_variants(sequence)[source]¶: Iterate through all sequence variants in this mutation space.

apply_random_mutations(n_mutations, sequence)[source]¶: Return a sequence with n random mutations applied.

property choices_span¶: Return (start, end), segment where multiple choices are possible.

constrain_sequence(sequence)[source]¶

Return a version of the sequence compatible with the mutation space.

All nucleotides of the sequence that are incompatible with the mutation space are replaced by nucleotides compatible with the space.

static from_optimization_problem(problem, new_constraints=None)[source]¶

Create a mutation space from a DNA optimization problem.

This can be used to initialize mutation spaces for new problems.

localized(location)[source]¶: Return a new version with only mutations overlapping the location.

pick_random_mutations(n_mutations, sequence)[source]¶: Draw N random mutations.

property space_size¶: Return the number of possible mutations.

string_representation(separator='|')[source]¶

Generates a string of the mutation space.

Examples

>>> print (mutation_space.string_representation())
>>> A|G|ATG|CA|A|CA|T|GTC|AC|T|A|C|T|T|T
>>> T|C|   |  |G|  |C|   |  |C|T|A|C|A|C
>>> G|A|   |  | |  | |   |  |G|G|G|G|G|G

SequencePattern ¶

Specifications AvoidPattern, EnforcePatternOccurence accept a Pattern as argument. The pattern can be provided either as a string or as a class:

Name	Pattern Class	As string
Restriction site	`EnzymeSitePattern('BsmBI')`	`'BsmBI_site'`
Homopolymer	`HomopolymerPattern("A", 6)`	`'6xA'`
Direct repeats	`RepeatedKmerPattern(3, 2)`	`'3x2mer'`
DNA notation	`DnaNotationPattern('ANNKA')`	`'ANNKA'`
Regular Expression	`SequencePattern('A[CG]*A')`	`'A[CG]*A'`

SequencePattern class ¶

class dnachisel.SequencePattern(expression, size=None, name=None, lookahead='loop', is_palyndromic=False)[source]¶

Pattern/ that will be looked for in a DNA sequence.

Use this class for matching regular expression patterns, and DnaNotationPattern for matching explicit sequences or sequences using Ns etc.

Parameters

expression: Any string or regular expression for matching ATGC nucleotides. Note that multi-nucleotides symbols such as “N” (for A-T-G-C), or “K” are not supported by this class, see DnaNotationPattern instead.
size: Size of the pattern, in number of characters (if none provided, the size of the pattern string is used). The size is used to determine the size of windows when performing local optimization and constraint solving. It can be important to provide the size when the pattern string provided represents a complex regular expression whose maximal matching size cannot be easily evaluated.
name: Name of the pattern (will be displayed e.g. when the pattern is printed)

Examples

>>> expression = "A[ATGC]{3,}"
>>> pattern = SequencePattern(expression)
>>> constraint = AvoidPattern(pattern)

find_matches(sequence, location=None, forced_strand=None)[source]¶

Return the locations where the sequence matches the expression.

Parameters

sequence: A string of “ATGC…”
location: Location indicating a segment to which to restrict the search. Only patterns entirely included in the segment will be returned

Returns

matches: A list of the locations of matches, of the form [(start1, end1), (start2, end2),...].

DnaNotationPattern ¶

class dnachisel.DnaNotationPattern(sequence, name=None)[source]¶

Class for patterns in plain DNA notation: ATTGCCA, GCNNKTA, etc.

If the sequence is not palindromic, the pattern will be looked for in both strands of sequences.

all_variants()[source]¶: Return all ATGC sequence variants of a sequence

static dna_sequence_to_regexpr(sequence)[source]¶: Return a regular expression to find the pattern in a sequence.

MotifPssmPattern ¶

class dnachisel.MotifPssmPattern(pssm, threshold=None, relative_threshold=None)[source]¶

Motif pattern represented by a Position-Specific Scoring Matrix (PSSM).

This class is better instantiated by creating a pattern from a set of sequence strings with MotifPssmPattern.from_sequences(), or reading pattern(s) for a file with MotifPssmPattern.from_file().

Parameters

pssm: A Bio.Align.AlignInfo.PSSM object.
threshold: locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
relative_threshold: Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.

classmethod apply_pseudocounts(motif, pseudocounts)[source]¶

Add pseudocounts to the motif’s pssm matrix.

Add nothing if pseudocounts is None, auto-compute pseudocounts if pseudocounts=”jaspar”, else just attribute the pseudocounts as-is to the motif.

find_matches_in_string(sequence)[source]¶: Find all positions where the PSSM score is above threshold.

classmethod from_sequences(sequences, name='unnamed', pseudocounts='jaspar', threshold=None, relative_threshold=None)[source]¶

Return a PSSM pattern computed from same-length sequences.

Parameters

sequences: A list of same-length sequences.
name: Name to give to the pattern (will appear in reports etc.).
pseudocounts: Either a dict {“A”: 0.01, “T”: …} or “jaspar” for automatic pseudocounts from the Biopython.motifs.jaspar module (recommended), or None for no pseudocounts at all (not recommended!).
threshold: locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
relative_threshold: Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.

classmethod list_from_file(motifs_file, file_format, threshold=None, pseudocounts='jaspar', relative_threshold=None)[source]¶

Return a list of PSSM patterns from a file in JASPAR, MEME, etc.

Parameters

motifs_file: Path to a motifs file, or file handle.
file_format: File format. one of “jaspar”, “meme”, “TRANSFAC”.
pseudocounts: Either a dict {“A”: 0.01, “T”: …} or “jaspar” for automatic pseudocounts from the Biopython.motifs.jaspar module (recommended), or None for no pseudocounts at all (not recommended!).
threshold: locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
relative_threshold: Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.

EnzymeSitePattern ¶

class dnachisel.EnzymeSitePattern(enzyme_name)[source]¶

Class to represent Enzyme site patterns

Examples

>>> enzyme_pattern = EnzymeSitePattern("BsaI")
>>> constraint = AvoidPattern(enzyme_pattern)

static from_string(string)[source]¶: Convert BsmBI_site to EnzymeSitePattern(BsmBI)

RepeatedKmerPattern ¶

class dnachisel.RepeatedKmerPattern(n_repeats, kmer_size)[source]¶

Direct repeats like ATT-ATT, ATGC-ATGC-ATGC, etc.

Shorthand string version: “3x4mer”, “5x2mer”, etc.

Examples

>>> RepeatedKmerPattern(3, 2) # dimers repeated 3 times

HomopolymerPattern ¶

class dnachisel.HomopolymerPattern(nucleotide, number)[source]¶

Homopolymer of the form AAAAAAA, TTTTT, etc.

Shorthand string version: “7xA”, “9xC”, etc.

Examples

>>> pattern = HomopolymerPattern("A", 6)
>>> constraint = AvoidPattern(pattern)