peppr.find_optimal_match#
- peppr.find_optimal_match(reference: AtomArray, pose: AtomArray, min_sequence_identity: float = 0.95, use_heuristic: bool = True, max_matches: int | None = None, allow_unmatched_entities: bool = False, use_entity_annotation: bool = False, use_structure_match: bool = False) tuple[AtomArray, AtomArray][source]#
Match the atoms from the given reference and pose structure so that the RMSD between them is minimized.
‘Matching’ has two effects here:
Chains and atoms within each residue that have a counterpart in the respective other structure, are reordered if necessary so that they are in the same order.
A
matchedannotation is added, which isFalsefor all atoms, that do not have a counterpart.
- Parameters:
- referenceAtomArray, shape=(p,)
The reference structure.
- poseAtomArray, shape=(q,)
The pose structure.
- min_sequence_identityfloat, optional
The minimum sequence identity between two chains to be considered the same entity.
- use_heuristicbool or int, optional
Whether to employ a fast heuristic [1] to find the optimal chain permutation. This heuristic represents each chain by its centroid, i.e. instead of exhaustively superimposing all atoms for each permutation, only the centroids are superimposed and the closest match between the reference and pose is selected.
- max_matchesint, optional
The maximum number of atom mappings to try, if the use_heuristic is set to
False.- allow_unmatched_entitiesbool, optional
If set to
True, allow entire entities to be unmatched. This is useful if a pose is compared to a reference which may contain different molecules.- use_entity_annotationbool, optional
If set to
True, use theentity_idannotation to determine which chains are the same entity and therefore are mappable to each other. By default, the entity is determined from sequence identity for polymers and residue name for small molecules.- use_structure_matchbool, optional
If set to
True, use structure matching, i.e. isomorphism on the bond graph, to determine which small molecules are the same entity. Otherwise, match small molecules by residue name. Note that the structure match requires more computation time.
- Returns:
- matched_reference, matched_poseAtomArray, shape=(p,) or (q,)
The input atoms, where the chains and atoms within each residue are brought into the corresponding order. Atoms that are matched between the reference and the pose are annotated with
matched=True. All other atoms are annotated withmatched=False. This means indexing both structures withmatchedas boolean mask will return structures with the same number of atoms.
Notes
Atoms that are not matched (
matched=False), are positioned in the reordered return value as follows:Unmatched chains are appended to the end.
Unmatched residues within a matched chain are kept at their original sequence position.
Unmatched atoms within a matched residue are kept at their original position.
Note that the heuristic used by default is much faster compared to the exhaustive approach: Especially for larger complexes with many homomers or small molecule copies, the number of possible mappings combinatorially explodes. However, the heuristic might not find the optimal permutation for all cases, especially in poses that only remotely resemble the reference.
References
[1]Protein complex prediction with AlphaFold-Multimer, Section 7.3, https://doi.org/10.1101/2021.10.04.463034