Python API#
This tutorial demonstrates how to use peppr
from within Python,
from loading structures to evaluating and reporting metrics.
Loading and preparing structures#
peppr
leverages the structure representation from the
Biotite package, namely the
biotite.structure.AtomArray
.
Consequently, peppr
does not provide a way to load or edit structures by itself:
All this is done ‘upstream’ via functionalities from Biotite.
Have a look at the
corresponding Biotite tutorials
for a more in-depth introduction on its capabilities.
For the scope of this tutorial we will load our reference structures and predicted poses
from CIF files.
The files used in this tutorial are proteolysis targeting chimera (PROTAC) complexes
predicted from the PLINDER dataset and can
be downloaded here
.
As input structures for peppr
may comprise any number of peptide and nucleotide
chains, as well as small molecules, they are called system throughout this package.
import biotite.structure.io.pdbx as pdbx
system_dir = path_to_systems / "7jto__1__1.A_1.D__1.J_1.O"
pdbx_file = pdbx.CIFFile.read(system_dir / "reference.cif")
ref = pdbx.get_structure(pdbx_file, model=1, include_bonds=True)
pdbx_file = pdbx.CIFFile.read(system_dir / "poses" / "pose_0.cif")
pose = pdbx.get_structure(pdbx_file, model=1, include_bonds=True)
print(type(ref))
print(type(pose))
<class 'biotite.structure.AtomArray'>
<class 'biotite.structure.AtomArray'>
There are two important things to note here:
model=1
ensures that only one reference and one pose is loaded from the CIF file, as the file may also contain multiple models. You can omit it, if you wantpeppr
to evaluate multiple poses per system, as discussed later.include_bonds=True
instructs the parser also to load the bonds between atoms. If they are missingpeppr
will raise an exception when it sees this system. The full list of requirements on input systems is documented in the API reference.
As already indicated above, biotite.structure.AtomArray
objects can also be
loaded from a variety of other formats, such as PDB and MOL/SDF.
Take a look at biotite.structure.io
for a comprehensive list.
Evaluation of a metric on a structure#
Now that we have the reference and poses loaded, we can evaluate metrics on them.
Each metric is represented by a Metric
object.
import peppr
lddt_metric = peppr.GlobalLDDTScore(backbone_only=False)
print("Name of the metric:", lddt_metric.name)
print(
"Which values are considered better predictions?",
"Smaller ones." if lddt_metric.smaller_is_better() else "Larger ones."
)
Name of the metric: global all-atom lDDT
Which values are considered better predictions? Larger ones.
The Metric.evaluate()
method takes the reference and poses of a system
and returns a scalar value.
print(lddt_metric.evaluate(ref, pose))
0.7892155413495742
Note that Metric.evaluate()
requires that the reference and pose have matching
atoms, i.e. ref[i]
and pose[i]
should point to corresponding atoms.
If this is not the case, you can use find_optimal_match()
to get indices that
bring the atoms into the correct order.
ref_indices, pose_indices = peppr.find_optimal_match(ref, pose)
ref = ref[ref_indices]
pose = pose[pose_indices]
Some metrics may not be defined for a given system.
For example, if we remove the second protein chain from the current system, the
interface RMSD (iRMSD) is undefined.
In such cases, Metric.evaluate()
returns NaN.
# Remove one protein chain
mask = ref.chain_id != "1"
ref = ref[mask]
pose = pose[mask]
irmsd_metric = peppr.InterfaceRMSD()
print(irmsd_metric.evaluate(ref, pose))
nan
Feeding the evaluator#
Up to now we looked at each prediction in isolation.
However, to evaluate a structure prediction model, one usually wants to run evaluations
on many systems and eventually aggregate the results.
This purpose is fulfilled by the Evaluator
class.
Given the desired metrics, predictions can be fed into it one after the other.
As a bonus, the Evaluator
also takes care of finding the corresponding atoms
between the reference structure and its poses, in case some atoms are missing in either
one, so you do not have to.
# An evaluator focused on metrics for protein-protein and protein-ligand interactions
evaluator = peppr.Evaluator(
[
peppr.DockQScore(),
peppr.ContactFraction(),
peppr.InterfaceRMSD(),
# Only defined for protein-protein interactions
peppr.LigandRMSD(),
# Only defined for protein-ligand interactions
peppr.PocketAlignedLigandRMSD(),
]
)
for system_dir in sorted(path_to_systems.iterdir()):
if not system_dir.is_dir():
continue
system_id = system_dir.name
pdbx_file = pdbx.CIFFile.read(system_dir / "reference.cif")
ref = pdbx.get_structure(pdbx_file, model=1, include_bonds=True)
pdbx_file = pdbx.CIFFile.read(system_dir / "poses" / "pose_0.cif")
pose = pdbx.get_structure(pdbx_file, model=1, include_bonds=True)
evaluator.feed(system_id, ref, pose)
Eventually we can report the evaluation run as a table that lists every chosen metric for each system (indicated by the given system ID).
evaluator.tabulate_metrics()
DockQ | fnat | iRMSD | LRMSD | PLI-LRMSD | |
---|---|---|---|---|---|
7jto__1__1.A_1.D__1.J_1.O | 0.54 | 0.75 | 2.66 | 16.82 | 9.36 |
7znt__2__1.F_1.G__1.J | 0.74 | 0.53 | 1.89 | 8.09 | 3.21 |
8bdt__1__1.A_1.D__1.I | 0.59 | 0.41 | 2.41 | 6.92 | 5.97 |
As mentioned above, you probably also want to aggregate the results over the systems.
This can be done by calling Evaluator.summarize_metrics()
which gives a
dictionary mapping the metric names to their aggregated values.
for name, value in evaluator.summarize_metrics().items():
print(f"{name}: {value:.2f}")
DockQ incorrect: 0.00
DockQ acceptable: 0.00
DockQ medium: 1.00
DockQ high: 0.00
DockQ mean: 0.62
DockQ median: 0.59
fnat mean: 0.56
fnat median: 0.53
iRMSD mean: 2.32
iRMSD median: 2.41
LRMSD mean: 10.61
LRMSD median: 8.09
PLI-LRMSD mean: 6.18
PLI-LRMSD median: 5.97
As you see for each metric the mean and median value are reported.
You might also notice that in case of the DockQScore
also
incorrect, acceptable, medium and high percentages are reported.
For example, the value for DockQ acceptable
reads as
<x>
of all applicable systems are within the acceptable threshold.where the thresholds are defined in the thresholds
attribute of some Metric
classes such as DockQScore
.
for bin, lower_threshold in peppr.DockQScore().thresholds.items():
print(f"{bin}: {lower_threshold:.2f}")
incorrect: 0.00
acceptable: 0.23
medium: 0.49
high: 0.80
Selecting the right pose#
Until now we fed only a single pose per system to the Evaluator
.
However, many structure prediction models return a number of poses, potentially ranked
by their predicted confidence.
To reflect this, the Evaluator
accepts multiple poses per system - either
via passing a list of biotite.structure.AtomArray
objects or an
biotite.structure.AtomArrayStack
, which represents multiple poses with the
same atoms.
system_dir = path_to_systems / "7jto__1__1.A_1.D__1.J_1.O"
pdbx_file = pdbx.CIFFile.read(system_dir / "reference.cif")
reference = pdbx.get_structure(pdbx_file, model=1, include_bonds=True)
poses = []
for pose_path in sorted(system_dir.glob("poses/*.cif")):
pdbx_file = pdbx.CIFFile.read(pose_path)
poses.append(pdbx.get_structure(pdbx_file, model=1, include_bonds=True))
But for each metric only one value is finally reported, so how does the
Evaluator
choose a single result for multiple poses?
The answer is, using the current setup the Evaluator
cannot, you will see
an exception instead.
evaluator = peppr.Evaluator([peppr.DockQScore()])
evaluator.feed("foo", reference, poses)
evaluator.tabulate_metrics()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[12], line 3
1 evaluator = peppr.Evaluator([peppr.DockQScore()])
2 evaluator.feed("foo", reference, poses)
----> 3 evaluator.tabulate_metrics()
File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/peppr/evaluator.py:293, in Evaluator.tabulate_metrics(self, selectors)
273 def tabulate_metrics(
274 self, selectors: Iterable[Selector] | None = None
275 ) -> pd.DataFrame:
276 """
277 Create a table listing the value for each metric and system.
278
(...)
291 The index is the system ID.
292 """
--> 293 columns = self._tabulate_metrics(selectors)
294 # Convert (metric, selector)-tuples to strings
295 columns = {
296 (
297 f"{metric.name} ({selector.name})"
(...)
301 for (metric, selector), values in columns.items()
302 }
File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/peppr/evaluator.py:555, in Evaluator._tabulate_metrics(self, selectors)
553 condensed_values.append(np.nan)
554 elif len(array) > 1:
--> 555 raise ValueError(
556 "At least one selector is required for multi-pose predictions"
557 )
558 else:
559 condensed_values.append(array[0])
ValueError: At least one selector is required for multi-pose predictions
The Evaluator
needs a way to select the desired value from the metric evaluated
on each pose.
What is desired depends on the use case: Do we only want the value for the most
confident prediction or the best prediction or the average of all predictions?
Hence, the user needs to supply one or multiple Selector
objects to the
Evaluator.tabulate()
and Evaluator.summarize()
methods.
For example the TopSelector
selects the best value from the k most confident
predictions.
Note
Evaluator.feed()
expects that poses are sorted from highest to lowest
confidence, i.e the first pose in the list is the most confident prediction.
evaluator.tabulate_metrics(selectors=[peppr.TopSelector(3)])
DockQ (Top3) | |
---|---|
foo | 0.59 |
Note the added (Top3) in the column name:
For every given selector a column will appear that shows the selected value for
that Selector
.
evaluator.tabulate_metrics(selectors=[peppr.TopSelector(3), peppr.MedianSelector()])
DockQ (Top3) | DockQ (median) | |
---|---|---|
foo | 0.59 | 0.29 |
More speed or higher accuracy?#
As brought up above, the Evaluator
needs to match common atoms between the
reference and pose, before they are passed to the individual metrics.
However, this is easier said than done, as the structure might contain symmetries due
to multiple copies of the same molecule (e.g. homomeric complexes) or due to equivalent
atoms within the same molecule (e.g. the two oxygen atoms of a carboxyl group).
Each of these symmetries result in multiple possible atom mappings:
For example in a homodimer, does chain A
in the pose map to chain A
in the
reference or is the mapping to the equivalent chain B
more optimal?
To solve this problem the Evaluator
by default uses a heuristic method
that aims to find a mapping that minimizes the RMSD between the reference and pose.
This method is fast, scales well for large symmetric complexes and most often finds
the best answer.
However, as it is the nature of a heuristic, there are edge cases where the found
mapping is not optimal.
Furthermore, an atom mapping that optimizes the RMSD might not be the ideal mapping
for e.g. the lDDT in rare cases.
Therefore, the Evaluator.MatchMethod
can be used to choose the desired tradeoff
between speed and accuracy.
In short, the default Evaluator.MatchMethod.HEURISTIC
is a good choice for all
use cases, that leave a small margin for error.
If on the other hand correctness is prioritized, use
Evaluator.MatchMethod.EXHAUSTIVE
or Evaluator.MatchMethod.INDIVIDUAL
instead, but try to avoid structures with many symmetries, as this may quickly lead
to a combinatorial explosion.
evaluator = peppr.Evaluator(
[peppr.InterfaceRMSD()],
# Guarantees optimal atom mapping at potentially high computation costs
match_method=peppr.Evaluator.MatchMethod.INDIVIDUAL,
)
A note on multiprocessing#
Multiprocessing of multiple systems is not baked into the Evaluator
directly.
Instead, you can can use multiple Evaluator
objects in parallel and
combine them afterwards using Evaluator.combine()
.
This gives you flexibility to use the multiprocessing library of your choice.
import copy
# Use 'mpire' multiprocessing library in this example,
# as it allows using 'apply_async()' with a function defined in a notebook
from mpire import WorkerPool
import numpy as np
N_PROCESSES = 2
def _evaluate_systems(evaluator, system_dirs):
for system_dir in system_dirs:
if not system_dir.is_dir():
continue
system_id = system_dir.name
pdbx_file = pdbx.CIFFile.read(system_dir / "reference.cif")
ref = pdbx.get_structure(pdbx_file, model=1, include_bonds=True)
pdbx_file = pdbx.CIFFile.read(system_dir / "poses" / "pose_0.cif")
pose = pdbx.get_structure(pdbx_file, model=1, include_bonds=True)
evaluator.feed(system_id, ref, pose)
# Return the input evaluator for convenience
return evaluator
evaluator = peppr.Evaluator(
[
peppr.MonomerRMSD(threshold=2.0),
peppr.MonomerLDDTScore(),
]
)
system_dir_chunks = np.array_split(sorted(path_to_systems.iterdir()), N_PROCESSES)
with WorkerPool(N_PROCESSES) as pool:
async_results = []
for system_dirs in system_dir_chunks:
async_results.append(
pool.apply_async(
_evaluate_systems,
(
copy.deepcopy(evaluator),
system_dirs
),
)
)
split_evaluators = [async_result.get() for async_result in async_results]
combined_evaluator = peppr.Evaluator.combine(split_evaluators)
combined_evaluator.tabulate_metrics()
CA-RMSD | intra protein lDDT | |
---|---|---|
7jto__1__1.A_1.D__1.J_1.O | 1.31 | 0.80 |
7znt__2__1.F_1.G__1.J | 1.36 | 0.81 |
8bdt__1__1.A_1.D__1.I | 1.09 | 0.81 |