Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leaks with future_pwalk() appear related to furrr_options(scheduling = ...) #275

Open
joshua-goldberg opened this issue Feb 3, 2025 · 0 comments

Comments

@joshua-goldberg
Copy link

A full reprex may not possible, because multiple GBs of data are involved, but I have a use-case that is roughly described below. In essence, I am passing an absolute path to a shapefile (the same shapefile for all workers) and the absolute path for a raster (a different raster for each worker, 1170 total), along with a couple of bookkeeping integers. Each worker then loads the raster and shapefile, and computes some summary statistics of the raster for the area within each polygon. These summaries are written to a csv for the next step of analysis. More specifically...

library(here)
library(tidyverse)
library(vroom)
library(furrr)
library(sf)
library(terra)
library(exactextractr)

plan(multicore, workers = 8)

expand_grid(
  year = 1985:2023,
  run = 1:30
) |>
  mutate(
    names = c(filenames_that_include_year_and_run.tif)
    path_to_rast = here("path_directory", names),
    path_to_shp = here("other_path_directory", "some_polygon.shp")
  ) |>
  select(-name) |>
  future_pwalk(
    \(year, run, path_to_rast, path_to_shp) {
      raster_of_interest <- path_to_rast |>
        terra::rast() |>
        terra::subst(0, NA_integer_)

      levels(raster_of_interest) <- data.frame(
        id = 1:4,
        category = c("cat1", "cat2", "cat3", "cat4")
      )

      some_polygons <- sf::read_sf(path_to_shp)

      rast_extract <- raster_of_interest |>
        exactextractr::exact_extract(some_polygons, function(df) {
          df |>
            dplyr::filter(!is.nan(value)) |>
            dplyr::group_by(poly_var_1, poly_var_2, poly_var_3, poly_var_3, value) |>
            dplyr::summarize(area = sum(coverage_area)/10000, .groups = "drop_last")
        },
        summarize_df = TRUE, coverage_area = TRUE, progress = FALSE,
        include_cols = c("poly_var_1", "poly_var_2", "poly_var_3")
        ) |>
        dplyr::mutate(
          year = year,
          run = run
        ) |>
        vroom::vroom_write(
          file = here("output_directory", "output_filename_with_year_and_run.csv.gz" )
        )

        rm(list = ls(all = T))
        gc()

    }, .options = furrr_options(
      scheduling = 10L, seed = TRUE,
      packages = c("terra", "sf", "exactextractr", "dplyr", "vroom")
    )
  )

This particular implementation of the operation might be memory inefficient - the shapefile could probably be loaded once and shared across threads, but it's not that big <100 MB - but I would not expect the memory footprint of the operation to grow over time (or at least not by a substantial amount). However, that's not what I'm observing in practice and the memory requirement grows over time, eventually reaching near critical levels (usually after a handful to several rasters are processed on each thread). Interestingly, the speed at which things deteriorate seems to depend upon the value of scheduling in furrr_options(). When scheduling = 1L, the system starts to use swap memory on the second set of operations on each process. When scheduling = 10L or 20L, I can get a bit further (as many as 15-18 rasters handled by each process). There's also some apparent dependency on the number of inputs passed, such that feeding future_pwalk() more inputs keeps things afloat a bit longer, where as memory balloons apparently much more quickly when doing a subset of everything I'd like process.

This is not the first time I've encountered apparent memory leaks in processing large volumes of data, so I am looking to get a better handle on how to troubleshoot and course-correct. I appreciate any suggestions or direction!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant