Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLFlowLogger: log system metrics #20563

Open
DerWeh opened this issue Jan 26, 2025 · 1 comment
Open

MLFlowLogger: log system metrics #20563

DerWeh opened this issue Jan 26, 2025 · 1 comment
Labels
feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers

Comments

@DerWeh
Copy link

DerWeh commented Jan 26, 2025

Description & Motivation

I am using the MLFlowLogger to keep track of my experiments. To my knowledge, there is no possibility to enable the logging of system metrics: https://mlflow.org/docs/latest/system-metrics/index.html
To me, it seems that both mlflow.enable_system_metrics_logging() and the environment variable MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING are ignored by MLFlowLogger.

Would it be possible to add the option? I think the most sensible would be to add an argument to MLFlowLogger mirroring the behavior of mlflow.start_run(log_system_metrics=True).


As I am already asking, would it also be possible to rename/customize the checkpoints of log_model? The names contain the epoch as number without leading zeros, such that they are incorrectly sorted in the MLflow interface. I would prefer to have some leading zeros such that lexicographical ordering corresponds to the ordering of the epochs.

Pitch

No response

Alternatives

No response

Additional context

No response

cc @lantiga @Borda

@DerWeh DerWeh added feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers labels Jan 26, 2025
@Northo
Copy link

Northo commented Mar 13, 2025

+1 this, would be great to see this added!

For now, I use this callback (assumes MLFlow is your first logger):

import lightning as L
from lightning.fabric.utilities.exceptions import MisconfigurationException
from lightning.pytorch.loggers import MLFlowLogger
from mlflow.system_metrics.system_metrics_monitor import SystemMetricsMonitor

class MLFlowSystemMonitorCallback(L.Callback):
    def on_fit_start(self, trainer: L.Trainer, pl_module: L.LightningModule) -> None:
        if not isinstance(trainer.logger, MLFlowLogger):
            raise MisconfigurationException(
                "MLFlowSystemMonitorCallback requires MLFlowLogger"
            )

        self.system_monitor = SystemMetricsMonitor(
            run_id=trainer.logger.run_id,
        )
        self.system_monitor.start()

    def on_fit_end(self, trainer: L.Trainer, pl_module: L.LightningModule) -> None:
        self.system_monitor.finish()

There's also the DeviceStatsMonitor, which I haven't tried yet. Maybe a good approach is adding an option to or subclassing it to allow logging to the names that MLFlow expects, system/cpu_utilization_percentage etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants