Getting rid of data point if thermophysical data is not included #578

barmoral · 2024-10-01T22:31:52Z

Is your feature request related to a problem? Please describe.
When using ThermoML dois as input data in evaluator for filtering, sometimes there are no values for pressure or temperature. Because evaluator expects this thermodynamic properties, loading and/or filtering data will rise an error. The error basically arises from the fact that every value of pressure (for example) in every row is getting turned into a physical property object, and if there are no values there, then the code breaks.

Describe the solution you'd like
It would be better that evaluator removes these data points without complete thermodynamic data automatically before the code breaks, or make evaluator accept these with a warning.

Describe alternatives you've considered
I manually removed the data points without complete thermodynamic data by using dropna().

Additional context
I attach to this issue an input json file (sorted_dois.json)

Here is the example python code to replicate the error:

import pandas as pd
import json
from pathlib import Path
from openff.evaluator.datasets import PhysicalProperty, PropertyPhase
from openff.evaluator.datasets.thermoml import thermoml_property
from openff.evaluator import properties
from openff.units import unit
from openff.evaluator.datasets.thermoml import ThermoMLDataSet

@thermoml_property("Osmotic coefficient", supported_phases=PropertyPhase.Liquid | PropertyPhase.Gas)
class OsmoticCoefficient(PhysicalProperty):
    """A class representation of a osmotic coeff property"""

    @classmethod
    def default_unit(cls):
        return unit.dimensionless
setattr(properties, OsmoticCoefficient.__name__, OsmoticCoefficient)

from openff.evaluator.datasets.thermoml import ThermoMLDataSet

CACHED_PROP_PATH = Path('osmotic_data.csv')

if CACHED_PROP_PATH.exists():
    prop_df = pd.read_csv(CACHED_PROP_PATH, index_col=0)
    ## delete rows with undefined thermodynamic parameters to avoid indexing errors
    # prop_df = prop_df.dropna(subset=['Temperature (K)'])
    # prop_df = prop_df.dropna(subset=['Pressure (kPa)'])
    data_set = ThermoMLDataSet.from_pandas(prop_df)
else:
    with open('sorted_dois.json') as f:
        doi_dat = json.load(f)
        data_set = ThermoMLDataSet.from_doi(*doi_dat['working'])

    prop_df = data_set.to_pandas()
    with CACHED_PROP_PATH.open('w') as file:
        prop_df.to_csv(CACHED_PROP_PATH)

The text was updated successfully, but these errors were encountered:

lilyminium · 2025-03-05T23:47:23Z

I ran into this too, it would be convenient for this to happen automatically in from_pandas.

mattwthompson · 2025-03-12T16:05:22Z

@barmoral Thanks for providing a reproduction I can easily get started on. How long does this script take to run, though? It's been a few minutes (probably just fetching the data?) and I want to make sure that's not surprising

mattwthompson · 2025-03-12T16:20:05Z

Okay, it finished. I was just a little impatient.

What columns should we drop rows based off of? This dataframe has plenty of missing pressure data, but no missing temperature or phase data. Some other columns are always missing so we can't just call .dropna() without arguments:

In [25]: prop_df.isnull().sum()
Out[25]:
Id                                      0
Temperature (K)                         0
Pressure (kPa)                       1957
Phase                                   0
N Components                            0
Component 1                             0
Role 1                                  0
Mole Fraction 1                         0
Exact Amount 1                       5347
Component 2                            20
Role 2                                 20
Mole Fraction 2                        20
Exact Amount 2                       5347
Component 3                          3562
Role 3                               3562
Mole Fraction 3                      3562
Exact Amount 3                       5347
Density Value (g / ml)               4741
Density Uncertainty (g / ml)         4741
OsmoticCoefficient Value ()           606
OsmoticCoefficient Uncertainty ()     606
Source                                  0
dtype: int64

In [34]: prop_df.dropna()
Out[34]:
Empty DataFrame
Columns: [Id, Temperature (K), Pressure (kPa), Phase, N Components, Component 1, Role 1, Mole Fraction 1, Exact Amount 1, Component 2, Role 2, Mole Fraction 2, Exact Amount 2, Component 3, Role 3, Mole Fraction 3, Exact Amount 3, Density Value (g / ml), Density Uncertainty (g / ml), OsmoticCoefficient Value (), OsmoticCoefficient Uncertainty (), Source]
Index: []

My guess is we want to consider pressure, temperature, and phase. For this data, it strips out some but not most of the dataset:

In [33]: prop_df.describe(), prop_df.dropna(subset=['Pressure (kPa)', 'Temperature (K)', 'Phase']).describe()
Out[33]:
(       Temperature (K)  Pressure (kPa)  N Components  ...  Density Uncertainty (g / ml)  OsmoticCoefficient Value ()  OsmoticCoefficient Uncertainty ()
 count      5347.000000     3390.000000   5347.000000  ...                    606.000000                  4741.000000                        4741.000000
 mean        305.864640       98.806342      2.330092  ...                      0.001153                     0.819905                           0.028031
 std          12.173756        5.088289      0.478179  ...                      0.000988                     0.327787                           0.139206
 min         273.000000       84.500000      1.000000  ...                      0.000034                     0.146000                           0.000050
 25%         298.150000      101.000000      2.000000  ...                      0.000227                     0.690000                           0.005000
 50%         298.150000      101.000000      2.000000  ...                      0.001505                     0.827700                           0.006500
 75%         313.150000      101.000000      3.000000  ...                      0.001875                     0.929000                           0.010000
 max         353.150000      101.325000      3.000000  ...                      0.011740                     6.362000                           1.000000

 [8 rows x 12 columns],
        Temperature (K)  Pressure (kPa)  N Components  ...  Density Uncertainty (g / ml)  OsmoticCoefficient Value ()  OsmoticCoefficient Uncertainty ()
 count      3390.000000     3390.000000   3390.000000  ...                    606.000000                  2784.000000                        2784.000000
 mean        305.332354       98.806342      2.410324  ...                      0.001153                     0.862405                           0.041581
 std          12.601643        5.088289      0.497334  ...                      0.000988                     0.390842                           0.180353
 min         273.000000       84.500000      1.000000  ...                      0.000034                     0.146000                           0.000050
 25%         298.150000      101.000000      2.000000  ...                      0.000227                     0.716000                           0.005000
 50%         298.150000      101.000000      2.000000  ...                      0.001505                     0.859000                           0.006000
 75%         313.150000      101.000000      3.000000  ...                      0.001875                     0.944000                           0.008500
 max         353.150000      101.325000      3.000000  ...                      0.011740                     6.362000                           1.000000

 [8 rows x 12 columns])

But I wonder if you also want rows stripped out if density or osmotic coefficient (, ...) are missing?

barmoral · 2025-03-12T17:27:29Z

@mattwthompson Thanks for checking this out! No, I don't mind if density or osmotic coefficients are missing. If it is possible, it would just be helpful that the code runs even if there is data missing and takes into consideration the data that is actually there, instead of deleting the whole data point. If not possible, maybe let you know which data points are missing data and therefore will be thrown out when filtering for a specific property.

mattwthompson self-assigned this Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting rid of data point if thermophysical data is not included #578

Getting rid of data point if thermophysical data is not included #578

barmoral commented Oct 1, 2024 •

edited

Loading

lilyminium commented Mar 5, 2025

mattwthompson commented Mar 12, 2025

mattwthompson commented Mar 12, 2025

barmoral commented Mar 12, 2025 •

edited

Loading

Getting rid of data point if thermophysical data is not included #578

Getting rid of data point if thermophysical data is not included #578

Comments

barmoral commented Oct 1, 2024 • edited Loading

lilyminium commented Mar 5, 2025

mattwthompson commented Mar 12, 2025

mattwthompson commented Mar 12, 2025

barmoral commented Mar 12, 2025 • edited Loading

barmoral commented Oct 1, 2024 •

edited

Loading

barmoral commented Mar 12, 2025 •

edited

Loading