Possible memory leaks with future_pwalk() appear related to furrr_options(scheduling = ...) #275

joshua-goldberg · 2025-02-03T22:56:20Z

A full reprex may not possible, because multiple GBs of data are involved, but I have a use-case that is roughly described below. In essence, I am passing an absolute path to a shapefile (the same shapefile for all workers) and the absolute path for a raster (a different raster for each worker, 1170 total), along with a couple of bookkeeping integers. Each worker then loads the raster and shapefile, and computes some summary statistics of the raster for the area within each polygon. These summaries are written to a csv for the next step of analysis. More specifically...

library(here)
library(tidyverse)
library(vroom)
library(furrr)
library(sf)
library(terra)
library(exactextractr)

plan(multicore, workers = 8)

expand_grid(
  year = 1985:2023,
  run = 1:30
) |>
  mutate(
    names = c(filenames_that_include_year_and_run.tif)
    path_to_rast = here("path_directory", names),
    path_to_shp = here("other_path_directory", "some_polygon.shp")
  ) |>
  select(-name) |>
  future_pwalk(
    \(year, run, path_to_rast, path_to_shp) {
      raster_of_interest <- path_to_rast |>
        terra::rast() |>
        terra::subst(0, NA_integer_)

      levels(raster_of_interest) <- data.frame(
        id = 1:4,
        category = c("cat1", "cat2", "cat3", "cat4")
      )

      some_polygons <- sf::read_sf(path_to_shp)

      rast_extract <- raster_of_interest |>
        exactextractr::exact_extract(some_polygons, function(df) {
          df |>
            dplyr::filter(!is.nan(value)) |>
            dplyr::group_by(poly_var_1, poly_var_2, poly_var_3, poly_var_3, value) |>
            dplyr::summarize(area = sum(coverage_area)/10000, .groups = "drop_last")
        },
        summarize_df = TRUE, coverage_area = TRUE, progress = FALSE,
        include_cols = c("poly_var_1", "poly_var_2", "poly_var_3")
        ) |>
        dplyr::mutate(
          year = year,
          run = run
        ) |>
        vroom::vroom_write(
          file = here("output_directory", "output_filename_with_year_and_run.csv.gz" )
        )

        rm(list = ls(all = T))
        gc()

    }, .options = furrr_options(
      scheduling = 10L, seed = TRUE,
      packages = c("terra", "sf", "exactextractr", "dplyr", "vroom")
    )
  )

This particular implementation of the operation might be memory inefficient - the shapefile could probably be loaded once and shared across threads, but it's not that big <100 MB - but I would not expect the memory footprint of the operation to grow over time (or at least not by a substantial amount). However, that's not what I'm observing in practice and the memory requirement grows over time, eventually reaching near critical levels (usually after a handful to several rasters are processed on each thread). Interestingly, the speed at which things deteriorate seems to depend upon the value of scheduling in furrr_options(). When scheduling = 1L, the system starts to use swap memory on the second set of operations on each process. When scheduling = 10L or 20L, I can get a bit further (as many as 15-18 rasters handled by each process). There's also some apparent dependency on the number of inputs passed, such that feeding future_pwalk() more inputs keeps things afloat a bit longer, where as memory balloons apparently much more quickly when doing a subset of everything I'd like process.

This is not the first time I've encountered apparent memory leaks in processing large volumes of data, so I am looking to get a better handle on how to troubleshoot and course-correct. I appreciate any suggestions or direction!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leaks with future_pwalk() appear related to furrr_options(scheduling = ...) #275

Possible memory leaks with future_pwalk() appear related to furrr_options(scheduling = ...) #275

joshua-goldberg commented Feb 3, 2025

Possible memory leaks with future_pwalk() appear related to furrr_options(scheduling = ...) #275

Possible memory leaks with future_pwalk() appear related to furrr_options(scheduling = ...) #275

Comments

joshua-goldberg commented Feb 3, 2025