Core Classes¶
Contents
DnaOptimizationProblem¶
-
class
dnachisel.
DnaOptimizationProblem
(sequence, constraints=None, objectives=None, logger='bar', mutation_space=None)[source]¶ Problem specifications: sequence, constraints, optimization objectives.
The original constraints, objectives, and original sequence of the problem are stored in the DNA problem. This class also has methods to display reports on the constraints and objectives, as well as solving the constraints and objectives.
- Parameters
- sequence
A string of ATGC characters (they must be upper case!), e.g. “ATTGTGTA”
- constraints
A list of objects of type
Specification
.- objectives
A list of objects of type
Specification
specifying what must be optimized in the problem. Note that each objective has a floatboost
parameter. The larger the boost, the more the objective is taken into account during the optimization.- logger
Either None for no logger, ‘bar’ for a tqdm progress bar logger, or any ProgLog progress bar logger.
- mutations_space
A MutationSpace indicating the possible mutations. In most case the mutation space will be left to None and computed at problem initialization (which can be slightly compute-intensive), however some core DNA Chisel methods will create optimization problems with a provided mutation_space to save computing time.
Notes
The dictionary
self.possible_mutations
is of the form{location1 : list1, location2: list2...}
wherelocation
is either a single index (e.g. 10) indicating the position of a nucleotide to be muted, or a couple(start, end)
indicating a whole segment whose sub-sequence should be replaced. Thelist
s are lists of possible sequences to replace each location, e.g. for the mutation of a whole codon(3,6): ["ATT", "ACT", "AGT"]
.Examples
>>> from dnachisel import * >>> problem = DnaOptimizationProblem( >>> sequence = "ATGCGTGTGTGC...", >>> constraints = [constraint1, constraint2, ...], >>> objectives = [objective1, objective2, ...] >>> ) >>> problem.resolve_constraints() >>> problem.optimize() >>> print(problem.constraints_text_summary()) >>> print(problem.objectives_text_summary())
- Attributes
- randomization_threshold
The algorithm will use an exhaustive search when the size of the mutation space (=the number of possible variants) is above this threshold, and a (guided) random search when it is above.
- max_random_iters
When using a random search, stop after this many iterations
- mutations_per_iteration
When using a random search, produce this many sequence mutations each iteration.
- optimization_stagnation_tolerance
When using a random search, stop if the score hasn’t improved in the last “this many” iterations
- local_extensions
Try local resolution several times if it fails, increasing the mutable zone by [N1, N2…] nucleotides on each side, until resolution works. (by default, an extension of 0bp is tried, then 5bp.
-
number_of_edits
()[source]¶ Return the number of nucleotide differences between the original and current sequence.
-
optimize_with_report
(target, project_name='My project', file_path=None, file_content=None)[source]¶ Resolve constraints, optimize objectives, write a multi-file report.
The report’s content may vary depending on the optimization’s success.
- Parameters
- target
Either a path to a folder that will contain the report, or a path to a zip archive, or “@memory” to return raw data of a zip archive containing the report.
- project_name
Project name to write on PDF reports
- Returns
- (success, message, zip_data)
Triplet where success is True/False, message is a one-line string summary indication whether some clash was found, or some solution, or maybe no solution was found because the random searches were too short
CircularDnaOptimizationProblem¶
-
class
dnachisel.
CircularDnaOptimizationProblem
(sequence, constraints=None, objectives=None, logger='bar', mutation_space=None)[source]¶ Class for solving circular DNA optimization problems.
The high-level interface is the same as for DnaOptimizationProblem: initialization, resolve_constraints(), and optimize() work the same way.
-
all_constraints_pass
(autopass=True)[source]¶ Return whether the current problem sequence passes all constraints.
-
constraints_evaluations
(autopass=True)[source]¶ Return the evaluation of constraints.
The “autopass_constraints” enables to just assume that constraints enforced by the mutation space are verified.
-
optimize_with_report
(target, project_name='My project', file_path=None, file_content=None)[source]¶ Resolve constraints, optimize objectives, write a multi-file report.
WARNING: in case of optimization failure, the report generated will show a “pseudo-circular” sequence formed by concatenating the sequence with itself three times.
TODO: fix the warning above, at some point?
The report’s content may vary depending on the optimization’s success.
- Parameters
- target
Either a path to a folder that will contain the report, or a path to a zip archive, or “@memory” to return raw data of a zip archive containing the report.
- project_name
Project name to write on PDF reports
- Returns
- (success, message, zip_data)
Triplet where success is True/False, message is a one-line string summary indication whether some clash was found, or some solution, or maybe no solution was found because the random searches were too short
-
resolve_constraints
(final_check=True)[source]¶ Solve a particular constraint using local, targeted searches.
- Parameters
- constraint
The
Specification
object for which the sequence should be solved- final_check
If True, a final check of that all constraints pass will be run at the end of the process, when constraints have been resolved one by one, to check that the solving of one constraint didn’t undo the solving of another.
- cst_filter
An optional filter to only resolve a subset function (constraint => True/False)
-
Specification¶
-
class
dnachisel.
Specification
(evaluate=None, boost=1.0)[source]¶ General class to define specifications to optimize.
Note that all specifications have a
boost
attribute that is a multiplicator that will be used when computing the global specification score of a problem withproblem.all_objectives_score()
.New types of specifications are defined by subclassing
Specification
and providing a customevaluate
andlocalized
methods.- Parameters
- evaluate
function (sequence) => SpecEvaluation
- boost
Relative importance of the Specification’s score in a multi-specification problem.
- Attributes
- best_possible_score
Best score that the specification can achieve. Used by the optimization algorithm to understand when no more optimization is required.
- optimize_passively (boolean)
Indicates that there should not be a pass of the optimization algorithm to optimize this objective. Instead, this objective is simply taken into account when optimizing other objectives.
- enforced_by_nucleotide_restrictions (boolean)
When the spec is used as a constraints, this indicates that the constraint will initially restrict the mutation space in a way that ensures that the constraint will always be valid. The constraint does not need to be evaluated again, which speeds up the resolution algorithm.
- priority
Value used to sort the specifications and solve/optimize them in order, with highest priority first.
- shorthand_name
Shorter name for the specification class that will be recognized when parsing annotations from genbanks.
-
as_passive_objective
()[source]¶ Return a copy with optimize_passively set to true.
“Optimize passively” means that when the specification is used as an objective, the solver will not do a specific pass to optimize this specification, however this specification’s score will be taken into account in the global score when optimizing other objectives, and may therefore influence the final sequence.
-
breach_label
()[source]¶ Shorter, less precise label to be used in tables, reports, etc.
This is meant for specifications such as EnforceGCContent(0.4, 0.6) to be represented as ‘40-60% GC’ in reports tables etc..
-
copy_with_changes
(**kwargs)[source]¶ Return a copy of the Specification with modified properties.
For instance
new_spec = spec.copy_with_changes(boost=10)
.
-
initialized_on_problem
(problem, role='constraint')[source]¶ Complete specification initialization when the sequence gets known.
Some specifications like to know what their role is and on which sequence they are employed before they complete some values.
-
label
(role=None, with_location=True, assignment_symbol=':', use_short_form=False, use_breach_form=False)[source]¶ Return a string label for this specification.
- Parameters
- role
Either ‘constraint’ or ‘objective’ (for prefixing the label with @ or ~), or None.
- with_location
If true, the location will appear in the label.
- assignment_symbol
Indicates whether to use “:” or “=” or anything else when indicating parameters values.
- use_short_form
If True, the label will use self.short_label(), so for instance AvoidPattern(BsmBI_site) will become “no BsmBI”. How this is handled is dependent on the specification.
-
label_parameters
()[source]¶ In subclasses, returns a list of the creation parameters.
For instance [(‘pattern’, ‘ATT’), (‘occurences’, 2)]
-
localized
(location, problem=None)[source]¶ Return a modified version of the specification for the case where sequence modifications are only performed inside the provided location.
For instance if an specification concerns local GC content, and we are only making local mutations to destroy a restriction site, then we only need to check the local GC content around the restriction site after each mutation (and not compute it for the whole sequence), so
EnforceGCContent.localized(location)
will return an specification that only looks for GC content around the provided location.If an specification concerns a DNA segment that is completely disjoint from the provided location, this must return None.
Must return an object of class
Constraint
.
-
restrict_nucleotides
(sequence, location=None)[source]¶ Restrict the mutation space to speed up optimization.
This method only kicks in when this specification is used as a constraint. By default it does nothing, but subclasses such as EnforceTranslation, AvoidChanges, EnforceSequence, etc. have custom methods.
In the code, this method is run during the initialize() step of DNAOptimizationProblem, when the MutationSpace is created for each constraint
SpecEvaluation¶
Store relevant infos about the evaluation of an objective on a problem.
Examples¶
>>> evaluation_result = ConstraintEvaluation(
>>> specification=specification,
>>> problem = problem,
>>> score= evaluation_score, # float
>>> locations=[Locations list...],
>>> message = "Score: 42 (found 42 sites)"
>>> )
Parameters¶
- objective
The Objective that was evaluated.
- problem
The problem that the objective was evaluated on.
- score
The score associated to the evaluation.
- locations
A list of couples (start, end) indicating the locations on which the the optimization shoul be localized to improve the objective.
- message
A message that will be returned by
str(evaluation)
. It will notably be displayed byproblem.print_objectives_summaries
.
-
dnachisel.SpecEvaluation.
default_message
¶ Return the default message for console/reports.
Location¶
Implements the Location class.
The class has useful methods for finding overlaps between locations, extract a subsequence from a sequence, etc.
-
class
dnachisel.Location.
Location
(start, end, strand=0)[source]¶ Represent a segment of a sequence, with a start, end, and strand.
Warning: we use Python’s splicing notation, so Location(5, 10) represents sequence[5, 6, 7, 8, 9] which corresponds to nucleotides number 6, 7, 8, 9, 10.
The data structure is similar to a Biopython’s FeatureLocation, but with different methods for different purposes.
- Parameters
- start
Lowest position index of the segment.
- end
Highest position index of the segment.
- strand
Either 1 or -1 for sense or anti-sense orientation.
-
extended
(extension_length, lower_limit=0, upper_limit=None, left=True, right=True)[source]¶ Extend the location of a few basepairs on each side.
-
static
from_biopython_location
(location)[source]¶ Return a DnaChisel Location from a Biopython location.
-
static
from_data
(location_data)[source]¶ Return a location, from varied data formats.
This method is used in particular in every built-in specification to quickly standardize the input location.
location_data
can be a tuple (start, end) or (start, end, strand), or a Biopython FeatureLocation, or a Location instance. In any case, a new Location object will be returned.
-
static
merge_overlapping_locations
(locations)[source]¶ Return a list of locations obtained by merging all overlapping.
-
overlap_region
(other_location)[source]¶ Return the overlap span between two locations (None if None).
MutationSpace¶
-
class
dnachisel.MutationSpace.
MutationChoice
(segment, variants, is_any_nucleotide=False)[source]¶ Represent a segment of a sequence with several possible variants.
- Parameters
- segment
A pair (start, end) indicating the range of nucleotides concerned. We are applying Python range, so
- variants
A set of sequence variants, at the given position
Examples
>>> choice = MutationChoice((70, 73), {})
-
extract_varying_region
()[source]¶ Return MutationChoices for the central varying region and 2 flanks.
For instance:
>>> choice = MutationChoice((5, 12), [ >>> 'ATGCGTG', >>> 'AAAAATG', >>> 'AAATGTG', >>> 'ATGAATG', >>> ]) >>> choice.extract_varying_region()
Result :
>>> [ >>> MutChoice(5-6 A), >>> MutChoice(6-10 TGCG-AATG-TGAA-AAAA), >>> MutChoice(10-12 TG) >>> ]
-
class
dnachisel.MutationSpace.
MutationSpace
(choices_index, left_padding=0)[source]¶ Class for mutation space (set of sequence segments and their variants).
- Parameters
- choices_index
A list L such that L[i] gives the MutationChoice governing the mutations allowed at position i (and possibly around i).
Examples
>>> # BEWARE: below, similar mutation choices are actually the SAME OBJECT >>> space = MutationSpace([ MutationChoice((0, 2), {'AT', 'TG'}), MutationChoice((0, 2), {'AT', 'TG'}), MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}), # same MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}), # MutationChoice((2, 5), {'TTC', 'TTA', 'TTT'}), ])
-
apply_random_mutations
(n_mutations, sequence)[source]¶ Return a sequence with n random mutations applied.
-
property
choices_span
¶ Return (start, end), segment where multiple choices are possible.
-
constrain_sequence
(sequence)[source]¶ Return a version of the sequence compatible with the mutation space.
All nucleotides of the sequence that are incompatible with the mutation space are replaced by nucleotides compatible with the space.
-
static
from_optimization_problem
(problem, new_constraints=None)[source]¶ Create a mutation space from a DNA optimization problem.
This can be used to initialize mutation spaces for new problems.
-
property
space_size
¶ Return the number of possible mutations.
SequencePattern¶
Specifications AvoidPattern
, EnforcePatternOccurence
accept a Pattern
as argument. The pattern can be provided either as a string or as a class:
Name |
Pattern Class |
As string |
---|---|---|
Restriction site |
|
|
Homopolymer |
|
|
Direct repeats |
|
|
DNA notation |
|
|
Regular Expression |
|
|
SequencePattern class¶
-
class
dnachisel.
SequencePattern
(expression, size=None, name=None, lookahead='loop', is_palyndromic=False)[source]¶ Pattern/ that will be looked for in a DNA sequence.
Use this class for matching regular expression patterns, and DnaNotationPattern for matching explicit sequences or sequences using Ns etc.
- Parameters
- expression
Any string or regular expression for matching ATGC nucleotides. Note that multi-nucleotides symbols such as “N” (for A-T-G-C), or “K” are not supported by this class, see DnaNotationPattern instead.
- size
Size of the pattern, in number of characters (if none provided, the size of the
pattern
string is used). Thesize
is used to determine the size of windows when performing local optimization and constraint solving. It can be important to provide the size when thepattern
string provided represents a complex regular expression whose maximal matching size cannot be easily evaluated.- name
Name of the pattern (will be displayed e.g. when the pattern is printed)
Examples
>>> expression = "A[ATGC]{3,}" >>> pattern = SequencePattern(expression) >>> constraint = AvoidPattern(pattern)
-
find_matches
(sequence, location=None, forced_strand=None)[source]¶ Return the locations where the sequence matches the expression.
- Parameters
- sequence
A string of “ATGC…”
- location
Location indicating a segment to which to restrict the search. Only patterns entirely included in the segment will be returned
- Returns
- matches
A list of the locations of matches, of the form
[(start1, end1), (start2, end2),...]
.
DnaNotationPattern¶
MotifPssmPattern¶
-
class
dnachisel.
MotifPssmPattern
(pssm, threshold=None, relative_threshold=None)[source]¶ Motif pattern represented by a Position-Specific Scoring Matrix (PSSM).
This class is better instantiated by creating a pattern from a set of sequence strings with
MotifPssmPattern.from_sequences()
, or reading pattern(s) for a file withMotifPssmPattern.from_file()
.- Parameters
- pssm
A Bio.Align.AlignInfo.PSSM object.
- threshold
locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
- relative_threshold
Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.
-
classmethod
apply_pseudocounts
(motif, pseudocounts)[source]¶ Add pseudocounts to the motif’s pssm matrix.
Add nothing if pseudocounts is None, auto-compute pseudocounts if pseudocounts=”jaspar”, else just attribute the pseudocounts as-is to the motif.
-
find_matches_in_string
(sequence)[source]¶ Find all positions where the PSSM score is above threshold.
-
classmethod
from_sequences
(sequences, name='unnamed', pseudocounts='jaspar', threshold=None, relative_threshold=None)[source]¶ Return a PSSM pattern computed from same-length sequences.
- Parameters
- sequences
A list of same-length sequences.
- name
Name to give to the pattern (will appear in reports etc.).
- pseudocounts
Either a dict {“A”: 0.01, “T”: …} or “jaspar” for automatic pseudocounts from the Biopython.motifs.jaspar module (recommended), or None for no pseudocounts at all (not recommended!).
- threshold
locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
- relative_threshold
Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.
-
classmethod
list_from_file
(motifs_file, file_format, threshold=None, pseudocounts='jaspar', relative_threshold=None)[source]¶ Return a list of PSSM patterns from a file in JASPAR, MEME, etc.
- Parameters
- motifs_file
Path to a motifs file, or file handle.
- file_format
File format. one of “jaspar”, “meme”, “TRANSFAC”.
- pseudocounts
Either a dict {“A”: 0.01, “T”: …} or “jaspar” for automatic pseudocounts from the Biopython.motifs.jaspar module (recommended), or None for no pseudocounts at all (not recommended!).
- threshold
locations of the sequence with a PSSM score above this value will be considered matches. For convenience, a relative_threshold can be given instead.
- relative_threshold
Value between 0 and 1 from which the threshold will be auto-computed. 0 means “match everything”, 1 means “only match the one (or several) sequence(s) with the absolute highest possible score”.