Get time to read dataset #665

wiesehahn · 2025-03-19T12:10:57Z

wiesehahn
Mar 19, 2025

I would like to benchmark read and write times depending on image compression and I tried to achieve this with following code:

Is it valid here to simply measure the function read_ds()?

library(gdalraster)
#> GDAL 3.8.4 (released 2024-02-08), GEOS 3.12.1, PROJ 9.3.1

tcc_file <- system.file("extdata/storml_tcc.tif", package="gdalraster")
ds <- new(GDALRaster, tcc_file, read_only = TRUE)

readtime <- microbenchmark::microbenchmark(
  readraster = read_ds(ds),
  times = 10
)

print(readtime)
#> Unit: microseconds
#>        expr   min    lq    mean median    uq     max neval
#>  readraster 296.7 340.8 4687.47    345 403.5 43643.9    10

^{Created on 2025-03-19 with reprex v2.1.0}

I did this with a larger orthophoto in different compressions and according to my benchmark the read time is appoximately the same for all compressions. I didtn expect that and now I am wondering if either the code above is not doing what I expected, or there is an error somewhere else in my code, or there really is no pronounced difference.

mdsumner · 2025-03-19T12:22:11Z

mdsumner
Mar 19, 2025
Collaborator

It's not being impacedt by repetitions because GDAL will cache the result, and you'll near-instant response in subsequent reads. I thought maybe we could turn that off with 'GDAL_CACHEMAX=0' but it doesnt affect it and there's other things going on (the os can cache and streamline subsequent reads as well afaik)

There's a good article about the cache behaviour (by @ctoney of course!) https://usdaforestservice.github.io/gdalraster/articles/gdal-block-cache.html

1 reply

wiesehahn Mar 20, 2025
Author

Thanks for the reply.
Ah ok, I see this is a "problem" for my example above, since blocks are cached until I close the datasource connection and hence subsequent reads are much faster, we can actually see this

print(as.data.frame(readtime))
#>          expr     time
#> 1  readraster 25079601
#> 2  readraster   419401
#> 3  readraster   301501
#> 4  readraster   351601
#> 5  readraster   346901
#> 6  readraster   348201
#> 7  readraster   347201
#> 8  readraster   339601
#> 9  readraster   348301
#> 10 readraster   347701

However, this does not yet solve my concern. I removed the repeated reading of the same datasource and while results differ more now they still seem quite similar to me. Is decompression really such a tiny computational fragment that overall read time is almost similar or am i measuing something else here?

Here is the code I used applied on the testdata, but as the outcome is not obvious for this small file you can find the results for multiple larger orthophotos at https://gist.github.com/wiesehahn/7e865a04a8d8646944fd13ffc1f483de?permalink_comment_id=5498693#gistcomment-5498693

# pak::pak("USDAForestService/gdalraster")
library(gdalraster)
#> GDAL 3.8.4 (released 2024-02-08), GEOS 3.12.1, PROJ 9.3.1
library(purrr)
library(fs)

# Sset gdal configurations (for reproducability?)
set_config_option("GDAL_NUM_THREADS", "16")
set_config_option("GDAL_CACHEMAX", "4000")
set_config_option("OVERVIEWS", "IGNORE_EXISTING")

dsns <- system.file("extdata/storml_tcc.tif", package="gdalraster")


# Create a function to measure performance for each datasource and compression
raster_compression <- function(dsns, options) {
  results <- list()
  
  for (datasourcename in dsns) {
    for (option in options) {

      output_file <- file_temp(ext = ".tif")
      
      write_ds <- function(){
        createCopy(
          src_filename = datasourcename,
          dst_filename = output_file,
          format = "COG",
          options = option$setting
        )
      } 
      
      # Get write time
      writetime <- system.time(write_ds())["elapsed"]
      
      # Get file size in MB
      data_size_uncompressed <- vsi_stat(datasourcename, "size") / (1024^2) 
      data_size_compressed <- vsi_stat(output_file, "size") / (1024^2) 
      
      # Get read time
      img <- new(GDALRaster, output_file)
      readtime <- system.time(read_ds(img))["elapsed"]
      
      # Store results
      results[[length(results) + 1]] <- list(
        datasource = basename(datasourcename),
        settings = option$naming,
        uncompressed_mb = data_size_uncompressed,
        compressed_mb = data_size_compressed,
        writetime_sec = writetime,
        readtime_sec = readtime
      )
    }
  }
  
  # Convert list to data frame
  results_df <- do.call(rbind, lapply(results, data.frame))
  return(results_df)
}


# create list of options
options <- list(
  # lossless
  list(setting = c("COMPRESS=NONE")),
  list(setting = c("COMPRESS=LZW", "PREDICTOR=YES")),
  list(setting = c("COMPRESS=LZW", "PREDICTOR=NO")),
  list(setting = c("COMPRESS=DEFLATE", "PREDICTOR=YES", "LEVEL=1")),
  list(setting = c("COMPRESS=DEFLATE", "PREDICTOR=YES", "LEVEL=6")),
  list(setting = c("COMPRESS=DEFLATE", "PREDICTOR=YES", "LEVEL=9")),
  list(setting = c("COMPRESS=DEFLATE", "PREDICTOR=NO", "LEVEL=1")),
  list(setting = c("COMPRESS=DEFLATE", "PREDICTOR=NO", "LEVEL=6")),
  list(setting = c("COMPRESS=DEFLATE", "PREDICTOR=NO", "LEVEL=9")),
  list(setting = c("COMPRESS=LZMA", "LEVEL=1")),
  list(setting = c("COMPRESS=LZMA", "LEVEL=9")),
  list(setting = c("COMPRESS=ZSTD", "PREDICTOR=YES", "LEVEL=1")),
  list(setting = c("COMPRESS=ZSTD", "PREDICTOR=YES", "LEVEL=9")),
  list(setting = c("COMPRESS=ZSTD", "PREDICTOR=YES", "LEVEL=22")),
  list(setting = c("COMPRESS=ZSTD", "PREDICTOR=NO", "LEVEL=1")),
  list(setting = c("COMPRESS=ZSTD", "PREDICTOR=NO", "LEVEL=9")),
  list(setting = c("COMPRESS=ZSTD", "PREDICTOR=NO", "LEVEL=22")),
  list(setting = c("COMPRESS=WEBP", "QUALITY=100")),
  list(setting = c("COMPRESS=LERC", "MAX_Z_ERROR=0")),
  list(setting = c("COMPRESS=LERC_DEFLATE", "LEVEL=1")),
  list(setting = c("COMPRESS=LERC_DEFLATE", "LEVEL=6")),
  list(setting = c("COMPRESS=LERC_DEFLATE", "LEVEL=9")),
  list(setting = c("COMPRESS=LERC_ZSTD", "LEVEL=1")),
  list(setting = c("COMPRESS=LERC_ZSTD", "LEVEL=9")),
  list(setting = c("COMPRESS=LERC_ZSTD", "LEVEL=22")),
  # lossy overviews
  list(setting = c("COMPRESS=WEBP", "QUALITY=100", "OVERVIEW_QUALITY=75")),
  # lossy
  list(setting = c("COMPRESS=WEBP", "QUALITY=75"))
)

for (i in 1:length(options)) {
  options[[i]]$naming <- paste(options[[i]]$setting, collapse = ", ")
}


# apply function on list
benchmark_results <- raster_compression(dsns, options)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.



# plot
library(gt)
#> Warning: package 'gt' was built under R version 4.4.3
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:gdalraster':
#> 
#>     combine
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

benchmark_results |> 
  group_by(settings) |> 
  summarise(across(c(compressed_mb, writetime_sec, readtime_sec), 
                   list(mean = mean),
                   .names = "{.col}_{.fn}")) |> 
  gt()

settings	compressed_mb_mean	writetime_sec_mean	readtime_sec_mean
COMPRESS=DEFLATE, PREDICTOR=NO, LEVEL=1	0.011666298	0.00	0.02
COMPRESS=DEFLATE, PREDICTOR=NO, LEVEL=6	0.011510849	0.02	0.00
COMPRESS=DEFLATE, PREDICTOR=NO, LEVEL=9	0.011406898	0.02	0.00
COMPRESS=DEFLATE, PREDICTOR=YES, LEVEL=1	0.010899544	0.01	0.00
COMPRESS=DEFLATE, PREDICTOR=YES, LEVEL=6	0.010792732	0.00	0.00
COMPRESS=DEFLATE, PREDICTOR=YES, LEVEL=9	0.010332108	0.02	0.00
COMPRESS=LERC, MAX_Z_ERROR=0	0.015526772	0.00	0.00
COMPRESS=LERC_DEFLATE, LEVEL=1	0.011960030	0.01	0.00
COMPRESS=LERC_DEFLATE, LEVEL=6	0.011816025	0.01	0.00
COMPRESS=LERC_DEFLATE, LEVEL=9	0.011754990	0.00	0.00
COMPRESS=LERC_ZSTD, LEVEL=1	0.011893272	0.01	0.00
COMPRESS=LERC_ZSTD, LEVEL=22	0.011684418	0.00	0.00
COMPRESS=LERC_ZSTD, LEVEL=9	0.011727333	0.01	0.00
COMPRESS=LZMA, LEVEL=1	0.011203766	0.02	0.00
COMPRESS=LZMA, LEVEL=9	0.010166168	0.03	0.00
COMPRESS=LZW, PREDICTOR=NO	0.013283730	0.02	0.00
COMPRESS=LZW, PREDICTOR=YES	0.013022423	0.00	0.02
COMPRESS=NONE	0.252269745	0.07	0.00
COMPRESS=WEBP, QUALITY=100	0.002340317	0.02	0.00
COMPRESS=WEBP, QUALITY=100, OVERVIEW_QUALITY=75	0.002340317	0.02	0.00
COMPRESS=WEBP, QUALITY=75	0.002321243	0.02	0.00
COMPRESS=ZSTD, PREDICTOR=NO, LEVEL=1	0.011378288	0.02	0.00
COMPRESS=ZSTD, PREDICTOR=NO, LEVEL=22	0.011010170	0.34	0.00
COMPRESS=ZSTD, PREDICTOR=NO, LEVEL=9	0.011177063	0.02	0.00
COMPRESS=ZSTD, PREDICTOR=YES, LEVEL=1	0.010308266	0.00	0.02
COMPRESS=ZSTD, PREDICTOR=YES, LEVEL=22	0.009957314	0.39	0.00
COMPRESS=ZSTD, PREDICTOR=YES, LEVEL=9	0.010190010	0.01	0.00

ctoney · 2025-03-20T18:52:34Z

ctoney
Mar 20, 2025
Maintainer

with

set_config_option("GDAL_NUM_THREADS", "16")

I wonder if it's worth comparing results that use the default single threaded instead?

See https://gdal.org/en/stable/drivers/raster/gtiff.html#open-options:

NUM_THREADS=[<number_of_threads>/ALL_CPUS]: Enable multi-threaded compression by specifying the number of worker threads. Worth it for slow compression algorithms such as DEFLATE or LZMA. Default is compression in the main thread. Starting with GDAL 3.6, this option also enables multi-threaded decoding when RasterIO() requests intersect several tiles/strips. The GDAL_NUM_THREADS configuration option can also be used as an alternative to setting the open option.

NUM_THREADS is also a creation option for GTiff: https://gdal.org/en/stable/drivers/raster/gtiff.html#creation-options

I believe the same applies for COG. Setting GDAL_NUM_THREADS configuration option affects those open options/creation options even if you don't set them explicitly at the driver level (i.e, at dataset creation or dataset open). I'm wondering if you're seeing some effect of speeding up slow compress and/or decode with certain algorithms, while multi-threaded may have less effect with algorithms that are not so slow to begin with?

0 replies

wiesehahn · 2025-03-21T09:07:29Z

wiesehahn
Mar 21, 2025
Author

Thanks for the reply, I tried it with default settings (single threaded?) and it seems your thoughts are correct, at least for writing. While there are only minimal differences between single and multi-threaded for writing in some algorithms it is quite obvious for e.g. WEBP or ZSTD.

Reading seems not very much affected by multi-threading (not sure though if the GDAL_NUM_THREADS setting is directly affecting this?).

To me it seems that the similar reading times (which made me curious) are simply coming from the fact that better compressed images are read at a lower speed but the smaller file size negates this and in the end it is balanced (in my case).

https://gist.github.com/wiesehahn/7e865a04a8d8646944fd13ffc1f483de?permalink_comment_id=5505740#gistcomment-5505740

0 replies

ctoney · 2025-03-24T03:15:49Z

ctoney
Mar 24, 2025
Maintainer

I believe the changes you made removed the effect of the block cache as originally noted by @mdsumner. Also, the write function as you have it now results in closing the dataset when createCopy() completes, so the block cache would be flushed before the read operation is done on a newly opened dataset immediately after that. However, I also agree with @mdsumner that other caching is likely involved here (e.g., OS-level or disk controller). It could make sense to separate the write and read tests, i.e., test all the writes (without immediately reading back), then test reading in a separate loop (potentially in fresh R session, or even after reboot).

I don't think the block cache is affecting relative performance in the current version of your tests, but FWIW, it should be possible to disable it with set_config_option("GDAL_CACHEMAX", "0")(just clarifying, not recommending it necessarily for this test). The major caveat is that it cannot be reconfigured once any I/O has occurred (i.e., once the the cache memory has been allocated). From https://usdaforestservice.github.io/gdalraster/articles/gdal-block-cache.html:

Note that the size limit of the block cache is set upon first use (first I/O). Setting GDAL_CACHEMAX after that point will not resize the cache. It is a per-session setting. If GDAL_CACHEMAX has not been configured upon first use of the cache, then the default cache size will be in effect for the current session.

Also, with GDAL_NUM_THREADS=DEFAULT, note that valid values are ALL_CPUS or an integer (https://gdal.org/en/stable/user/configoptions.html#performance-and-caching). I suspect that using DEFAULT results in no change since it's not a valid value, but a better way to ensure that the default is in use is simply not set it at all (for single-threaded tests). Or, the way to unset a config option that has previously been set to some value is to reset it to empty string (i.e., set_config_option("GDAL_NUM_THREADS", ""), this usage is shown in the first example at https://usdaforestservice.github.io/gdalraster/articles/gdal-config-quick-ref.html).

The initial read operation in these tests (during createCopy()) is done from cloud storage using /vsicurl/. It could be worth setting up the five test files locally instead and avoid network read operations since you're trying to isolate the performance of compression methods. Additional overhead from network reads could potentially wash out some of the effect you're trying to measure for compression (or not, purely speculating here). Also, /vsicurl/ also has other cache involved, see CPL_VSIL_CURL_CACHE_SIZE at https://usdaforestservice.github.io/gdalraster/articles/gdal-config-quick-ref.html#httphttps. The default is only 16 MB so it may not make any difference in your tests, but is still another potentially confounding factor that could be removed with testing entirely from local storage.

Starting with the source data in cloud storage does mimic the real-world use case. But write time in this test is actually measuring the combination of read from cloud plus write locally ( i.e., createCopy()), while the read test involves only reading from local storage (and potentially benefiting from data cached by the immediately preceding read/write operation). It seems like separating those out as mentioned above could be worth investigating.

Please note that you could spend a lot of time on these suggestions and still not see much difference. I'm really not sure. Feel free to use or not accordingly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get time to read dataset #665

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Get time to read dataset #665

Uh oh!

wiesehahn Mar 19, 2025

Replies: 4 comments · 1 reply

Uh oh!

mdsumner Mar 19, 2025 Collaborator

Uh oh!

wiesehahn Mar 20, 2025 Author

Uh oh!

Uh oh!

ctoney Mar 20, 2025 Maintainer

Uh oh!

wiesehahn Mar 21, 2025 Author

Uh oh!

ctoney Mar 24, 2025 Maintainer

wiesehahn
Mar 19, 2025

Replies: 4 comments 1 reply

mdsumner
Mar 19, 2025
Collaborator

wiesehahn Mar 20, 2025
Author

ctoney
Mar 20, 2025
Maintainer

wiesehahn
Mar 21, 2025
Author

ctoney
Mar 24, 2025
Maintainer