DataFrame Integration ===================== CNotebook provides seamless integration with `Pandas`_ and `Polars`_ DataFrames through the ``chem`` accessor, enabling chemistry-aware operations on molecular data. .. _Pandas: https://pandas.pydata.org/ .. _Polars: https://pola.rs/ Pandas Integration ------------------ Prerequisites ^^^^^^^^^^^^^ .. code-block:: bash pip install pandas oepandas Creating Molecule Columns ^^^^^^^^^^^^^^^^^^^^^^^^^ Convert SMILES strings to molecule objects: .. code-block:: python import cnotebook import oepandas as oepd import pandas as pd df = pd.DataFrame({ "Name": ["Benzene", "Pyridine", "Pyrimidine"], "Molecule": ["c1ccccc1", "c1cnccc1", "n1cnccc1"] }) # Convert SMILES column to molecules (in place) df.chem.as_molecule("Molecule", inplace=True) # Display the DataFrame df You should see: .. raw:: html
Name Molecule
0 Benzene
1 Pyridine
  N
2 Pyrimidine
  N   N
.. note:: Molecule columns automatically render as chemical structures when displaying the DataFrame. This also works for anything that returns a DataFrame, such as ``df.head()``. Substructure Highlighting ^^^^^^^^^^^^^^^^^^^^^^^^^ You can highlight a substructures within a DataFrame column using SMARTS. By default this uses ball-and-stick-style highlighting that supports overlapping matches, using the ``oechem.OEGetLightColors()`` scheme. .. code-block:: python # Highlight a single pattern using the DataFrame above df.Molecule.chem.highlight("c1ccccc1") # Highlight benzene rings # Display the DataFrame df Outputs: .. raw:: html
Name Molecule
0 Benzene
1 Pyridine
  N
2 Pyrimidine
  N   N
.. note:: Highlighting persists until you remove it with ``df.chem.clear_formatting_rules()`` for a DataFrame or ``df["column_name"].clear_formatting_rules()``. You can also use any of the normal highlighting capabilities: .. code-block:: python # Highlight using normal stick highlighting df.Molecule.chem.highlight("c1ccccc1", style=oedepict.OEHighlightStyle_Stick, color=oechem.OELightBlue) # Display the DataFrame df .. raw:: html
Name Molecule
0 Benzene
1 Pyridine
  N
2 Pyrimidine
  N   N
Finally, you can highlight a substructure based on the value in another column, rather than a fixed pattern: .. code-block:: python # Add a different pattern for each row df["Pattern"] = ["cc", "cnc", "ncn"] # Highlight using that pattern with a slightly different style df.chem.highlight_using_column("SMILES", "Pattern") .. raw:: html
Name Molecule Pattern highlighted_substructures
0 Benzene
cc
1 Pyridine
  N
cnc
  N
2 Pyrimidine
  N   N
ncn
  N   N
Note that this creates a new column with display objects instead of molecules. Molecular Alignment ^^^^^^^^^^^^^^^^^^^ Align molecule depictions for visual comparison: .. code-block:: python import cnotebook import oepandas as oepd # Read the example unaligned molecules df = oepd.read_sdf("examples/assets/rotations.sdf", no_title=True) # Rename the "Molecule" column to "Original" so that we can # see the original unaligned molecules df = df.rename(columns={"Molecule": "Original"}) # Create a new molecule column called "Aligned" so that we can # see the aligned molecules df["Aligned"] = df.Original.chem.copy_molecules() # Align the depictions based on the first molecule # By default this does a path fingerprint-based alignment df.Aligned.chem.align_depictions("first") # Show the structures df.head() Outputs: .. raw:: html
Original Aligned
0
  N
  N
1
  N
  N
2
  N
  N
The ``align_depictions`` method accepts the following parameters: +---------------+------------------------------------------------------------------+-------------------+ | Parameter | Description | Default | +===============+==================================================================+===================+ | ``ref`` | Alignment reference. Can be: | Required | | | | | | | - ``"first"`` - Use the first molecule as reference | | | | - ``oechem.OEMolBase`` - A molecule to use as reference | | | | - ``oechem.OESubSearch`` - A substructure search pattern | | | | - ``oechem.OEMCSSearch`` - A maximum common substructure search | | | | - ``str`` - A SMARTS pattern | | +---------------+------------------------------------------------------------------+-------------------+ | ``method`` | Alignment method. Can be: | ``None`` (auto) | | | | | | | - ``"substructure"`` / ``"ss"`` - Substructure-based alignment | | | | - ``"mcss"`` - Maximum common substructure alignment | | | | - ``"fingerprint"`` / ``"fp"`` - Fingerprint-based alignment | | | | - ``None`` - Auto-detect based on ``ref`` type | | +---------------+------------------------------------------------------------------+-------------------+ Fingerprint Similarity ^^^^^^^^^^^^^^^^^^^^^^ Color molecules by similarity to a reference: .. code-block:: python import cnotebook import oepandas as oepd from openeye import oechem, oedepict # Read the example EGFR molecule file df = oepd.read_smi("examples/assets/egfr.smi") # Use Gefitinib as a reference # We call oedepict.OEPrepareDepiction to give the SMILES string nice 2D coordinates gefitinib = oechem.OEGraphMol() oechem.OESmilesToMol(gefitinib, "COc1cc2c(cc1OCCCN3CCOCC3)c(ncn2)Nc4ccc(c(c4)Cl)F") # Align all molecules to Gefitinib df.Molecule.chem.align_depictions(gefitinib) # Show similarity to Gefitinib df.chem.fingerprint_similarity("Molecule", gefitinib, inplace=True) # Show just the fingerprint similarity df[["reference_similarity", "target_similarity"]] This will output aligned structures that are colored by fingerprint similarity for both the target and reference: .. raw:: html
reference_similarity target_similarity
0
  O   O   N   O   N   N   N   H   Cl   F
  N   O   N   N   N
1
  O   O   N   O   N   N   N   H   Cl   F
  N   N   N
2
  O   O   N   O   N   N   N   H   Cl   F
  N   H   N   N   O   N   H
3
  O   O   N   O   N   N   N   H   Cl   F
  N   H   N   N   N   H
4
  O   O   N   O   N   N   N   H   Cl   F
  O   N   N   O   N   H
Polars Integration ------------------ Prerequisites ^^^^^^^^^^^^^ .. code-block:: bash pip install polars oepolars Creating Molecule Columns ^^^^^^^^^^^^^^^^^^^^^^^^^ Convert SMILES strings to molecule objects: .. code-block:: python import cnotebook import oepolars as oeplr import polars as pl df = pl.DataFrame({ "Name": ["Benzene", "Pyridine", "Pyrimidine"], "smiles": ["c1ccccc1", "c1cnccc1", "n1cnccc1"] }).chem.as_molecule("smiles") # Display the DataFrame df This outputs the same styled DataFrame (depending on whether you are using Jupyter or Marimo): .. raw:: html
Namesmiles
Benzene
Pyridine
N
Pyrimidine
N N
DataFrames display automatically in both Jupyter and Marimo when left as a bare statement at the end of a cell. Reading Molecule Files ^^^^^^^^^^^^^^^^^^^^^^ Read molecules directly from files: .. code-block:: python # Read from SMILES file df = oeplr.read_smi("molecules.smi") # Read from SDF file df = oeplr.read_sdf("molecules.sdf") Substructure Highlighting ^^^^^^^^^^^^^^^^^^^^^^^^^ Substructure highlighting must be done at the DataFrame level in Polars due to it's architecture. It works exactly the same was as Pandas otherwise. By default this uses ball-and-stick-style highlighting that supports overlapping matches, using the ``oechem.OEGetLightColors()`` scheme. .. code-block:: python # Highlight a single pattern using the DataFrame above df.chem.highlight("smiles", "c1ccccc1") # Highlight benzene rings # Display the DataFrame df Outputs: .. raw:: html
Namesmiles
Benzene
Pyridine
N
Pyrimidine
N N
.. note:: Highlighting persists until you remove it with ``df.chem.clear_formatting_rules()`` Finally, you can highlight a substructure based on the value in another column, rather than a fixed pattern: .. code-block:: python # Add a column with patterns to highlight df = df.with_columns( pl.Series("Pattern", ["cc", "cnc", "ncn"]) ) # Highlight each row using the Pattern column df = df.chem.highlight_using_column("smiles", "Pattern") # Display the DataFrame df .. raw:: html
NamesmilesPatternhighlighted_substructures
Benzene
cc
Pyridine
N
cnc
N
Pyrimidine
N N
ncn
N N
Molecular Alignment ^^^^^^^^^^^^^^^^^^^ Align molecule depictions: .. code-block:: python import cnotebook import oepolars as oepl # Read the example unaligned molecules df = oepl.read_sdf("examples/assets/rotations.sdf", no_title=True) # # Rename the "Molecule" column to "Original" so that we can # # see the original unaligned molecules df = df.rename({"Molecule": "Original"}) # # Create a new molecule column called "Aligned" so that we can # # see the aligned molecules df = df.chem.copy_molecules("Original", "Aligned") # Align the depictions based on the first molecule # By default this does a path fingerprint-based alignment df["Aligned"].chem.align_depictions("first") # Show the structures df.head() This will output: .. raw:: html
OriginalAligned
N
N
N
N
N
N
The ``align_depictions`` method accepts the following parameters: +---------------+------------------------------------------------------------------+-------------------+ | Parameter | Description | Default | +===============+==================================================================+===================+ | ``ref`` | Alignment reference. Can be: | Required | | | | | | | - ``"first"`` - Use the first molecule as reference | | | | - ``oechem.OEMolBase`` - A molecule to use as reference | | | | - ``oechem.OESubSearch`` - A substructure search pattern | | | | - ``oechem.OEMCSSearch`` - A maximum common substructure search | | | | - ``str`` - A SMARTS pattern | | +---------------+------------------------------------------------------------------+-------------------+ | ``method`` | Alignment method. Can be: | ``None`` (auto) | | | | | | | - ``"substructure"`` / ``"ss"`` - Substructure-based alignment | | | | - ``"mcss"`` - Maximum common substructure alignment | | | | - ``"fingerprint"`` / ``"fp"`` - Fingerprint-based alignment | | | | - ``None`` - Auto-detect based on ``ref`` type | | +---------------+------------------------------------------------------------------+-------------------+ Fingerprint Similarity ^^^^^^^^^^^^^^^^^^^^^^ Color molecules by similarity: .. code-block:: python import cnotebook import oepolars as oepl from openeye import oechem, oedepict # Read the example EGFR molecule file df = oepl.read_smi("examples/assets/egfr.smi") # Use Gefitinib as a reference # We call oedepict.OEPrepareDepiction to give the SMILES string nice 2D coordinates gefitinib = oechem.OEGraphMol() oechem.OESmilesToMol(gefitinib, "COc1cc2c(cc1OCCCN3CCOCC3)c(ncn2)Nc4ccc(c(c4)Cl)F") # Align all molecules to Gefitinib df["Molecule"].chem.align_depictions(gefitinib) # Show similarity to Gefitinib df = df.chem.fingerprint_similarity("Molecule", gefitinib, inplace=True) # # Show just the fingerprint similarity df[["reference_similarity", "target_similarity"]] This outputs: .. raw:: html
reference_similaritytarget_similarity
O O N O N N N H Cl F
N O N N N
O O N O N N N H Cl F
N N N
O O N O N N N H Cl F
N H N N O N H
O O N O N N N H Cl F
N H N N N H
O O N O N N N H Cl F
O N N O N H
MolGrid from DataFrames ----------------------- Create interactive molecule grids from DataFrames: **Pandas:** .. code-block:: python from cnotebook import MolGrid import oepandas as oepd # Read the example EGFR molecule file df = oepd.read_smi("examples/assets/egfr.smi") # 1. Create a molecule grid with all data grid = df.chem.molgrid("Molecule") # 2. Create a molecule grid with only the molecule series (no data) # df = df["Molecule"].chem.molgrid() # Display the grid grid.display() This outputs: .. image:: _static/pandas_molgrid.png :align: center **Polars:** The exact same code works above, just swap out ``oepolars`` for ``oepandas``: .. code-block:: python from cnotebook import MolGrid import oepolars as oepl # Read the example EGFR molecule file df = oepl.read_smi("examples/assets/egfr.smi") # 1. Create a molecule grid with all data grid = df.chem.molgrid("Molecule") # 2. Create a molecule grid with only the molecule series (no data) # df = df["Molecule"].chem.molgrid() # Display the grid grid.display() You should get the exact same molecule grid as with Pandas. See the :ref:`molgrid-class` documentation for more details on MolGrid features. Best Practices -------------- 1. **Memory Management**: For large datasets, consider using molecule indices rather than storing full molecule objects in memory. 2. **Performance**: Use PNG format for faster rendering of large DataFrames. SVG provides better quality but may be slower for many molecules. 3. **Column Naming**: Use descriptive column names and avoid conflicts with reserved names like "Molecule" when possible. 4. **Lazy Evaluation**: When using Polars, take advantage of lazy evaluation for complex operations on large datasets.