Custom metrics and selectors ============================ .. jupyter-execute:: :hide-code: from pathlib import Path import os import pandas as pd path_to_systems = Path(os.getcwd()) / "tests" / "data" / "predictions" pd.options.display.float_format = lambda x: "{:.2f}".format(x) .. currentmodule:: peppr ``peppr`` is written with extensibility in mind: You can easily define your own :class:`Metric` and :class:`Selector` classes to extend the functionality of the package. Creating a custom metric ------------------------ To create a custom metric, one only needs to subclass the :class:`Metric` base class. For the scope of this tutorial we will create an all-atom RMSD as metric. .. jupyter-execute:: import peppr import numpy as np import biotite.structure as struc class RMSD(peppr.Metric): # This is mandatory @property def name(self): return "My awesome RMSD" # This optionally enables binning for `Evaluator.summarize_metrics()` @property def thresholds(self): return {"good": 0.0, "ok": 10.0, "bad": 20.0} # The core of each metric def evaluate(self, reference, pose): # Limit system to common atoms between reference and pose not_nan_mask = ~np.isnan(reference.coord[:, 0]) & ~np.isnan(pose.coord[:, 0]) # It is not guaranteed that `reference` and `pose` are superimposed reference = reference[not_nan_mask] pose = pose[not_nan_mask] pose, _ = struc.superimpose(reference, pose) return struc.rmsd(reference, pose).item() # Defines whether a smaller or larger value indicates a better prediction # Used by the `Selector` classes def smaller_is_better(self): return True A few notes on the :meth:`evaluate()` method: The purpose of a :class:`Metric` is to be used in context of an :class:`Evaluator`. As the :class:`Evaluator` already automatically preprocesses the input systems passed to :meth:`Evaluator.feed()`, we can rely on some properties of each system that make our life easier: - The input system will only contain heavy atoms (i.e. no hydrogen). - The input system always contains matching atoms. This means that atom ``reference[i]`` corresponds to the atom ``pose[i]``. Finding the optimal atom mapping in case of ambiguous scenarios, as they appear in homomers or symmetric molecules, is already handled under the hood beforehand. - If either the pose or the reference have atoms that are not present in the other, the respective coordinates of these atoms are *NaN*. - The :class:`Metric` should respect the ``hetero`` annotation of the input :class:`biotite.structure.AtomArray`. An atom where ``hetero`` is ``False`` is defined as polymer and where ``hetero`` is ``True`` as small molecule. For example a peptide chain with ``hetero`` being ``True`` should still be interpreted as a small molecule. - Atoms with the same ``chain_id`` should be considered as part of the same molecule, atoms with different ``chain_id`` should be considered as different molecules. Creating a custom selector -------------------------- Custom selectors can also be created by inheriting from the :class:`Selector` base class. Let's create a selector that picks the *worst case* prediction, i.e. the pose with the worst value for a given metric. .. jupyter-execute:: class WorstCaseSelector(peppr.Selector): @property def name(self): return "worst case" def select(self, values, smaller_is_better): if smaller_is_better: return np.nanmax(values) else: return np.nanmin(values) Note that the :meth:`Selector.select()` method does not know anything about the actual pose or metric it is selecting from. It only obtains values to select from (may also contain *NaN* values) and their ordering. Bringing it all together ------------------------ Now that we have created our custom metric and selector, we can use them in the :class:`Evaluator` to evaluate our predictions. .. jupyter-execute:: import biotite.structure.io.pdbx as pdbx evaluator = peppr.Evaluator([RMSD()]) for system_dir in sorted(path_to_systems.iterdir()): if not system_dir.is_dir(): continue system_id = system_dir.name pdbx_file = pdbx.CIFFile.read(system_dir / "reference.cif") reference = pdbx.get_structure(pdbx_file, model=1, include_bonds=True) poses = [] for pose_path in sorted(system_dir.glob("poses/*.cif")): pdbx_file = pdbx.CIFFile.read(pose_path) poses.append(pdbx.get_structure(pdbx_file, model=1, include_bonds=True)) evaluator.feed(system_id, reference, poses) evaluator.tabulate_metrics(selectors=[WorstCaseSelector()]) And finally, we will also the bin thresholds for the RMSD metric in action. .. jupyter-execute:: for name, value in evaluator.summarize_metrics( selectors=[WorstCaseSelector()] ).items(): print(f"{name}: {value:.2f}")