DataFrame Integration
=====================
CNotebook provides seamless integration with `Pandas`_ and `Polars`_ DataFrames through
the ``chem`` accessor, enabling chemistry-aware operations on molecular data.
.. _Pandas: https://pandas.pydata.org/
.. _Polars: https://pola.rs/
Pandas Integration
------------------
Prerequisites
^^^^^^^^^^^^^
.. code-block:: bash
pip install "cnotebook[pandas]"
Creating Molecule Columns
^^^^^^^^^^^^^^^^^^^^^^^^^
Convert SMILES strings to molecule objects:
.. code-block:: python
import cnotebook
import oepandas as oepd
import pandas as pd
df = pd.DataFrame({
"Name": ["Benzene", "Pyridine", "Pyrimidine"],
"Molecule": ["c1ccccc1", "c1cnccc1", "n1cnccc1"]
})
# Convert SMILES column to molecules (in place)
df.chem.as_molecule("Molecule", inplace=True)
# Display the DataFrame
df
You should see:
.. raw:: html
.. note::
Molecule columns automatically render as chemical structures when displaying
the DataFrame. This also works for anything that returns a DataFrame, such
as ``df.head()``.
Creating Query Columns
^^^^^^^^^^^^^^^^^^^^^^
OEPandas query columns use ``QueryDtype`` and render through the same CNotebook
depiction path as molecule columns:
.. code-block:: python
import cnotebook
import oepandas as oepd
import pandas as pd
queries = pd.DataFrame({
"Name": ["Alcohol query", "Nitrogen query"],
"Query": ["[#6]-[#8]", "[#7]"],
})
queries.chem.as_query("Query", inplace=True)
queries["Query"].chem.set_render_options(image_format="svg", title=False)
queries
Substructure Highlighting
^^^^^^^^^^^^^^^^^^^^^^^^^
You can highlight a substructures within a DataFrame column using SMARTS. By default this uses ball-and-stick-style
highlighting that supports overlapping matches, using the ``oechem.OEGetLightColors()`` scheme.
.. code-block:: python
# Highlight a single pattern using the DataFrame above
df.Molecule.chem.highlight("c1ccccc1") # Highlight benzene rings
# Display the DataFrame
df
Outputs:
.. raw:: html
.. note::
Highlighting persists until you remove it with ``df.chem.clear_formatting_rules()`` for a DataFrame or
``df["column_name"].clear_formatting_rules()``.
You can also use any of the normal highlighting capabilities:
.. code-block:: python
# Highlight using normal stick highlighting
df.Molecule.chem.highlight("c1ccccc1", style=oedepict.OEHighlightStyle_Stick, color=oechem.OELightBlue)
# Display the DataFrame
df
.. raw:: html
Finally, you can highlight a substructure based on the value in another column, rather than a fixed pattern:
.. code-block:: python
# Add a different pattern for each row
df["Pattern"] = ["cc", "cnc", "ncn"]
# Highlight using that pattern with a slightly different style
df.chem.highlight_using_column("SMILES", "Pattern")
.. raw:: html
Note that this creates a new column with display objects instead of molecules.
Molecular Alignment
^^^^^^^^^^^^^^^^^^^
Align molecule depictions for visual comparison:
.. code-block:: python
import cnotebook
import oepandas as oepd
# Read the example unaligned molecules
df = oepd.read_sdf("examples/assets/rotations.sdf", no_title=True)
# Rename the "Molecule" column to "Original" so that we can
# see the original unaligned molecules
df = df.rename(columns={"Molecule": "Original"})
# Create a new molecule column called "Aligned" so that we can
# see the aligned molecules
df["Aligned"] = df.Original.chem.copy_molecules()
# Align the depictions based on the first molecule
# By default this does a path fingerprint-based alignment
df.Aligned.chem.align_depictions("first")
# Show the structures
df.head()
Outputs:
.. raw:: html
The ``align_depictions`` method accepts the following parameters:
+---------------+------------------------------------------------------------------+-------------------+
| Parameter | Description | Default |
+===============+==================================================================+===================+
| ``ref`` | Alignment reference. Can be: | Required |
| | | |
| | - ``"first"`` - Use the first molecule as reference | |
| | - ``oechem.OEMolBase`` - A molecule to use as reference | |
| | - ``oechem.OESubSearch`` - A substructure search pattern | |
| | - ``oechem.OEMCSSearch`` - A maximum common substructure search | |
| | - ``str`` - A SMARTS pattern | |
+---------------+------------------------------------------------------------------+-------------------+
| ``method`` | Alignment method. Can be: | ``None`` (auto) |
| | | |
| | - ``"substructure"`` / ``"ss"`` - Substructure-based alignment | |
| | - ``"mcss"`` - Maximum common substructure alignment | |
| | - ``"fingerprint"`` / ``"fp"`` - Fingerprint-based alignment | |
| | - ``None`` - Auto-detect based on ``ref`` type | |
+---------------+------------------------------------------------------------------+-------------------+
Fingerprint Similarity
^^^^^^^^^^^^^^^^^^^^^^
Color molecules by similarity to a reference:
.. code-block:: python
import cnotebook
import oepandas as oepd
from openeye import oechem, oedepict
# Read the example EGFR molecule file
df = oepd.read_smi("examples/assets/egfr.smi")
# Use Gefitinib as a reference
# We call oedepict.OEPrepareDepiction to give the SMILES string nice 2D coordinates
gefitinib = oechem.OEGraphMol()
oechem.OESmilesToMol(gefitinib, "COc1cc2c(cc1OCCCN3CCOCC3)c(ncn2)Nc4ccc(c(c4)Cl)F")
# Align all molecules to Gefitinib
df.Molecule.chem.align_depictions(gefitinib)
# Show similarity to Gefitinib
df.chem.fingerprint_similarity("Molecule", gefitinib, inplace=True)
# Show just the fingerprint similarity
df[["reference_similarity", "target_similarity"]]
This will output aligned structures that are colored by fingerprint similarity for both the target and reference:
.. raw:: html
Polars Integration
------------------
Prerequisites
^^^^^^^^^^^^^
.. code-block:: bash
pip install "cnotebook[polars]"
Creating Molecule Columns
^^^^^^^^^^^^^^^^^^^^^^^^^
Convert SMILES strings to molecule objects:
.. code-block:: python
import cnotebook
import oepolars as oeplr
import polars as pl
df = pl.DataFrame({
"Name": ["Benzene", "Pyridine", "Pyrimidine"],
"smiles": ["c1ccccc1", "c1cnccc1", "n1cnccc1"]
}).chem.as_molecule("smiles")
# Display the DataFrame
df
This outputs the same styled DataFrame (depending on whether you are using Jupyter or Marimo):
.. raw:: html
DataFrames display automatically in both Jupyter and Marimo when left as a bare statement at the end of a cell.
Reading Molecule Files
^^^^^^^^^^^^^^^^^^^^^^
Read molecules directly from files:
.. code-block:: python
# Read from SMILES file
df = oeplr.read_smi("molecules.smi")
# Read from SDF file
df = oeplr.read_sdf("molecules.sdf")
Substructure Highlighting
^^^^^^^^^^^^^^^^^^^^^^^^^
Substructure highlighting must be done at the DataFrame level in Polars due to it's architecture. It works exactly
the same was as Pandas otherwise. By default this uses ball-and-stick-style highlighting that supports overlapping
matches, using the ``oechem.OEGetLightColors()`` scheme.
.. code-block:: python
# Highlight a single pattern using the DataFrame above
df.chem.highlight("smiles", "c1ccccc1") # Highlight benzene rings
# Display the DataFrame
df
Outputs:
.. raw:: html
.. note::
Highlighting persists until you remove it with ``df.chem.clear_formatting_rules()``
Finally, you can highlight a substructure based on the value in another column, rather than a fixed pattern:
.. code-block:: python
# Add a column with patterns to highlight
df = df.with_columns(
pl.Series("Pattern", ["cc", "cnc", "ncn"])
)
# Highlight each row using the Pattern column
df = df.chem.highlight_using_column("smiles", "Pattern")
# Display the DataFrame
df
.. raw:: html
Molecular Alignment
^^^^^^^^^^^^^^^^^^^
Align molecule depictions:
.. code-block:: python
import cnotebook
import oepolars as oepl
# Read the example unaligned molecules
df = oepl.read_sdf("examples/assets/rotations.sdf", no_title=True)
# # Rename the "Molecule" column to "Original" so that we can
# # see the original unaligned molecules
df = df.rename({"Molecule": "Original"})
# # Create a new molecule column called "Aligned" so that we can
# # see the aligned molecules
df = df.chem.copy_molecules("Original", "Aligned")
# Align the depictions based on the first molecule
# By default this does a path fingerprint-based alignment
df["Aligned"].chem.align_depictions("first")
# Show the structures
df.head()
This will output:
.. raw:: html
The ``align_depictions`` method accepts the following parameters:
+---------------+------------------------------------------------------------------+-------------------+
| Parameter | Description | Default |
+===============+==================================================================+===================+
| ``ref`` | Alignment reference. Can be: | Required |
| | | |
| | - ``"first"`` - Use the first molecule as reference | |
| | - ``oechem.OEMolBase`` - A molecule to use as reference | |
| | - ``oechem.OESubSearch`` - A substructure search pattern | |
| | - ``oechem.OEMCSSearch`` - A maximum common substructure search | |
| | - ``str`` - A SMARTS pattern | |
+---------------+------------------------------------------------------------------+-------------------+
| ``method`` | Alignment method. Can be: | ``None`` (auto) |
| | | |
| | - ``"substructure"`` / ``"ss"`` - Substructure-based alignment | |
| | - ``"mcss"`` - Maximum common substructure alignment | |
| | - ``"fingerprint"`` / ``"fp"`` - Fingerprint-based alignment | |
| | - ``None`` - Auto-detect based on ``ref`` type | |
+---------------+------------------------------------------------------------------+-------------------+
Fingerprint Similarity
^^^^^^^^^^^^^^^^^^^^^^
Color molecules by similarity:
.. code-block:: python
import cnotebook
import oepolars as oepl
from openeye import oechem, oedepict
# Read the example EGFR molecule file
df = oepl.read_smi("examples/assets/egfr.smi")
# Use Gefitinib as a reference
# We call oedepict.OEPrepareDepiction to give the SMILES string nice 2D coordinates
gefitinib = oechem.OEGraphMol()
oechem.OESmilesToMol(gefitinib, "COc1cc2c(cc1OCCCN3CCOCC3)c(ncn2)Nc4ccc(c(c4)Cl)F")
# Align all molecules to Gefitinib
df["Molecule"].chem.align_depictions(gefitinib)
# Show similarity to Gefitinib
df = df.chem.fingerprint_similarity("Molecule", gefitinib, inplace=True)
# # Show just the fingerprint similarity
df[["reference_similarity", "target_similarity"]]
This outputs:
.. raw:: html
MolGrid from DataFrames
-----------------------
Create interactive molecule grids from DataFrames:
**Pandas:**
.. code-block:: python
from cnotebook import MolGrid
import oepandas as oepd
# Read the example EGFR molecule file
df = oepd.read_smi("examples/assets/egfr.smi")
# 1. Create a molecule grid with all data
grid = df.chem.molgrid("Molecule")
# 2. Create a molecule grid with only the molecule series (no data)
# df = df["Molecule"].chem.molgrid()
# Display the grid
grid.display()
This outputs:
.. image:: _static/pandas_molgrid.png
:align: center
**Polars:**
The exact same code works above, just swap out ``oepolars`` for ``oepandas``:
.. code-block:: python
from cnotebook import MolGrid
import oepolars as oepl
# Read the example EGFR molecule file
df = oepl.read_smi("examples/assets/egfr.smi")
# 1. Create a molecule grid with all data
grid = df.chem.molgrid("Molecule")
# 2. Create a molecule grid with only the molecule series (no data)
# df = df["Molecule"].chem.molgrid()
# Display the grid
grid.display()
You should get the exact same molecule grid as with Pandas.
See the :ref:`molgrid-class` documentation for more details on MolGrid features.
Best Practices
--------------
1. **Memory Management**: For large datasets, consider using molecule indices
rather than storing full molecule objects in memory.
2. **Performance**: Use PNG format for faster rendering of large DataFrames.
SVG provides better quality but may be slower for many molecules.
3. **Column Naming**: Use descriptive column names and avoid conflicts with
reserved names like "Molecule" when possible.
4. **Lazy Evaluation**: When using Polars, take advantage of lazy evaluation
for complex operations on large datasets.