Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting rid of data point if thermophysical data is not included #578

Open
barmoral opened this issue Oct 1, 2024 · 4 comments
Open

Getting rid of data point if thermophysical data is not included #578

barmoral opened this issue Oct 1, 2024 · 4 comments
Assignees

Comments

@barmoral
Copy link
Collaborator

barmoral commented Oct 1, 2024

Is your feature request related to a problem? Please describe.
When using ThermoML dois as input data in evaluator for filtering, sometimes there are no values for pressure or temperature. Because evaluator expects this thermodynamic properties, loading and/or filtering data will rise an error. The error basically arises from the fact that every value of pressure (for example) in every row is getting turned into a physical property object, and if there are no values there, then the code breaks.

Describe the solution you'd like
It would be better that evaluator removes these data points without complete thermodynamic data automatically before the code breaks, or make evaluator accept these with a warning.

Describe alternatives you've considered
I manually removed the data points without complete thermodynamic data by using dropna().

Additional context
I attach to this issue an input json file (sorted_dois.json)

Here is the example python code to replicate the error:

import pandas as pd
import json
from pathlib import Path
from openff.evaluator.datasets import PhysicalProperty, PropertyPhase
from openff.evaluator.datasets.thermoml import thermoml_property
from openff.evaluator import properties
from openff.units import unit
from openff.evaluator.datasets.thermoml import ThermoMLDataSet

@thermoml_property("Osmotic coefficient", supported_phases=PropertyPhase.Liquid | PropertyPhase.Gas)
class OsmoticCoefficient(PhysicalProperty):
    """A class representation of a osmotic coeff property"""

    @classmethod
    def default_unit(cls):
        return unit.dimensionless
setattr(properties, OsmoticCoefficient.__name__, OsmoticCoefficient)

from openff.evaluator.datasets.thermoml import ThermoMLDataSet

CACHED_PROP_PATH = Path('osmotic_data.csv')

if CACHED_PROP_PATH.exists():
    prop_df = pd.read_csv(CACHED_PROP_PATH, index_col=0)
    ## delete rows with undefined thermodynamic parameters to avoid indexing errors
    # prop_df = prop_df.dropna(subset=['Temperature (K)'])
    # prop_df = prop_df.dropna(subset=['Pressure (kPa)'])
    data_set = ThermoMLDataSet.from_pandas(prop_df)
else:
    with open('sorted_dois.json') as f:
        doi_dat = json.load(f)
        data_set = ThermoMLDataSet.from_doi(*doi_dat['working'])

    prop_df = data_set.to_pandas()
    with CACHED_PROP_PATH.open('w') as file:
        prop_df.to_csv(CACHED_PROP_PATH)
@lilyminium
Copy link
Contributor

I ran into this too, it would be convenient for this to happen automatically in from_pandas.

@mattwthompson
Copy link
Member

@barmoral Thanks for providing a reproduction I can easily get started on. How long does this script take to run, though? It's been a few minutes (probably just fetching the data?) and I want to make sure that's not surprising

@mattwthompson mattwthompson self-assigned this Mar 12, 2025
@mattwthompson
Copy link
Member

Okay, it finished. I was just a little impatient.

What columns should we drop rows based off of? This dataframe has plenty of missing pressure data, but no missing temperature or phase data. Some other columns are always missing so we can't just call .dropna() without arguments:

In [25]: prop_df.isnull().sum()
Out[25]:
Id                                      0
Temperature (K)                         0
Pressure (kPa)                       1957
Phase                                   0
N Components                            0
Component 1                             0
Role 1                                  0
Mole Fraction 1                         0
Exact Amount 1                       5347
Component 2                            20
Role 2                                 20
Mole Fraction 2                        20
Exact Amount 2                       5347
Component 3                          3562
Role 3                               3562
Mole Fraction 3                      3562
Exact Amount 3                       5347
Density Value (g / ml)               4741
Density Uncertainty (g / ml)         4741
OsmoticCoefficient Value ()           606
OsmoticCoefficient Uncertainty ()     606
Source                                  0
dtype: int64
In [34]: prop_df.dropna()
Out[34]:
Empty DataFrame
Columns: [Id, Temperature (K), Pressure (kPa), Phase, N Components, Component 1, Role 1, Mole Fraction 1, Exact Amount 1, Component 2, Role 2, Mole Fraction 2, Exact Amount 2, Component 3, Role 3, Mole Fraction 3, Exact Amount 3, Density Value (g / ml), Density Uncertainty (g / ml), OsmoticCoefficient Value (), OsmoticCoefficient Uncertainty (), Source]
Index: []

My guess is we want to consider pressure, temperature, and phase. For this data, it strips out some but not most of the dataset:

In [33]: prop_df.describe(), prop_df.dropna(subset=['Pressure (kPa)', 'Temperature (K)', 'Phase']).describe()
Out[33]:
(       Temperature (K)  Pressure (kPa)  N Components  ...  Density Uncertainty (g / ml)  OsmoticCoefficient Value ()  OsmoticCoefficient Uncertainty ()
 count      5347.000000     3390.000000   5347.000000  ...                    606.000000                  4741.000000                        4741.000000
 mean        305.864640       98.806342      2.330092  ...                      0.001153                     0.819905                           0.028031
 std          12.173756        5.088289      0.478179  ...                      0.000988                     0.327787                           0.139206
 min         273.000000       84.500000      1.000000  ...                      0.000034                     0.146000                           0.000050
 25%         298.150000      101.000000      2.000000  ...                      0.000227                     0.690000                           0.005000
 50%         298.150000      101.000000      2.000000  ...                      0.001505                     0.827700                           0.006500
 75%         313.150000      101.000000      3.000000  ...                      0.001875                     0.929000                           0.010000
 max         353.150000      101.325000      3.000000  ...                      0.011740                     6.362000                           1.000000

 [8 rows x 12 columns],
        Temperature (K)  Pressure (kPa)  N Components  ...  Density Uncertainty (g / ml)  OsmoticCoefficient Value ()  OsmoticCoefficient Uncertainty ()
 count      3390.000000     3390.000000   3390.000000  ...                    606.000000                  2784.000000                        2784.000000
 mean        305.332354       98.806342      2.410324  ...                      0.001153                     0.862405                           0.041581
 std          12.601643        5.088289      0.497334  ...                      0.000988                     0.390842                           0.180353
 min         273.000000       84.500000      1.000000  ...                      0.000034                     0.146000                           0.000050
 25%         298.150000      101.000000      2.000000  ...                      0.000227                     0.716000                           0.005000
 50%         298.150000      101.000000      2.000000  ...                      0.001505                     0.859000                           0.006000
 75%         313.150000      101.000000      3.000000  ...                      0.001875                     0.944000                           0.008500
 max         353.150000      101.325000      3.000000  ...                      0.011740                     6.362000                           1.000000

 [8 rows x 12 columns])

But I wonder if you also want rows stripped out if density or osmotic coefficient (, ...) are missing?

@barmoral
Copy link
Collaborator Author

barmoral commented Mar 12, 2025

@mattwthompson Thanks for checking this out! No, I don't mind if density or osmotic coefficients are missing. If it is possible, it would just be helpful that the code runs even if there is data missing and takes into consideration the data that is actually there, instead of deleting the whole data point. If not possible, maybe let you know which data points are missing data and therefore will be thrown out when filtering for a specific property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants