Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsl Workflow Notes #8

Closed
julietcohen opened this issue Nov 9, 2022 · 14 comments
Closed

Parsl Workflow Notes #8

julietcohen opened this issue Nov 9, 2022 · 14 comments

Comments

@julietcohen
Copy link
Collaborator

Sticking points, suggested code, and questions that arise while working through parsl workflow (parsl_breakdown.ipynb in the branch parsl-workflow-breakdown). Processed lake change sample data provided by Ingmar (GeoPackage files separated by UTM zones).

Sticking Points

  • Staging files in parallel, with staging batch size = 1, with only 3 gpkg files as input resulted in error (because I was working with explicit filepaths as input rather than a single base directory with many files within it...?)
  • Staging files not in parallel, with 3 gpkg files as input
    Result: staged dir was created and files were being written fine, but then 47 minutes into the process (usually takes ~50 minutes), 14117 out of 19088 staged files were written, an error was returned:

staging_error

Next, I changed the input file sample size to 10 gpkg files (I downloaded 10 new lake_sample.gpkg files from the Google Drive that Ingmar uploaded last week) and staged in parallel with staging batch size = 2

Result: staged dir was created and files were being written fine (again), but then errored with the same message after 183 minutes (a very sad end to the workday):
image

Also, I tried just jumping to the rasterization step and pointing to the complete staged dir from my lake change sample run through without parsl. While running rasterization in parallel, and no errors resulted, but the geotiff and web_tile folders were not created.

Suggested Code

HighThroughputExecutor configuration for parsl workflow without kubernetes (run locally):

# bash command to activate virtual environment
activate_env = 'conda activate pdgviz'

htex_config_local = Config(
  executors = [
      HighThroughputExecutor(
        max_workers = 32, # 32 workers per node
        worker_debug = True,
        provider = LocalProvider(
          worker_init = activate_env,
          init_blocks = 1,
          max_blocks = 1
          ),
      )
  ],
  strategy = None
)

parsl.clear() 
parsl.load(htex_config_local)
  • read parsl HighThroughputExecutor documentation and compared to the same code chunk used in the Scalable Computing Course
  • 32 workers is what was used for the Scalable Computing Course and is in the documentation for HighThroughputExecutor
  • parsl.clear() is important to reset the parsl config with each run of the script

Questions to investigate:

How often should the following be run? After each parallel operation (as in, after staging in parallel, rasterizing in parallel, creating web tiles in parallel etc.) or just at the end of the script (I think it is the latter)?

htex_config_local.executors[0].shutdown()
parsl.clear()
@mbjones
Copy link
Member

mbjones commented Nov 9, 2022

Quick comment on the Questions... you only want to shutdown the executors when you are finished with all job requests. So, pretty much at the end, unless you are done with parsl and want to free up some memory before you do other things in your program. Note how the htex_config_local.executors[0] is the first element of an array -- so you are only shutting down the first configured set of executors. If you have multiple in your configuration, then you need to shut each down when appropriate.

@julietcohen
Copy link
Collaborator Author

Thank you, Matt! That makes sense. When we get the workflow to the point where it runs all steps in order, we won't shut down the executor in between steps. But as I play with the configuration to troubleshoot errors, I will make sure to shut down the executor before defining it again.

@julietcohen
Copy link
Collaborator Author

After meeting with Robyn, here are a few troubleshooting approaches:

  • continue to play with configuration of htex_config_local, such as adjusting the number of workers, cores per worker, including channel = LocalChannel(), etc.
  • remove line [a.result() for a in app_futures] if running in notebook rather than as python script
  • check logging (log.log) to troubleshoot where error is resulting
  • error may be due to certain troublesome files, an example would be an empty file, but a check for this is already implemented by Robyn, so it must be something else
  • convert the code in this .ipynb to a .py script and use tmux in terminal to run script in the background, freeing me up to run notebook chunks simultaneously

@julietcohen
Copy link
Collaborator Author

Update on error in staging step with lake change sample data. Same error, but more explicit message printed in the log:

2022-11-09 16:02:01,365 [ERROR] fiona._env: sqlite3_exec(COMMIT) failed: disk I/O error
2022-11-09 16:02:01,378 [ERROR] fiona._env: sqlite3_exec(UPDATE gpkg_contents SET min_x = -179.9999069191315, min_y = 67.50575177703736, max_x = 179.9992878286158, max_y = 67.57693189423964 WHERE lower(table_name) = lower('255') AND Lower(data_type) = 'features') failed: attempt to write a readonly database
2022-11-09 16:02:01,378 [ERROR] fiona._env: sqlite3_exec(UPDATE gpkg_contents SET last_change = strftime('%Y-%m-%dT%H:%M:%fZ','now') WHERE lower(table_name) = lower('255')) failed: attempt to write a readonly database
2022-11-09 16:02:01,655 [ERROR] fiona._env: sqlite3_exec(UPDATE gpkg_contents SET min_x = -168.2158419480275, min_y = 64.7664172690093, max_x = -167.8532790776757, max_y = 65.17308489524688 WHERE lower(table_name) = lower('286') AND Lower(data_type) = 'features') failed: disk I/O error
2022-11-09 16:02:01,655 [ERROR] fiona._env: sqlite3_exec(UPDATE gpkg_contents SET last_change = strftime('%Y-%m-%dT%H:%M:%fZ','now') WHERE lower(table_name) = lower('286')) failed: attempt to write a readonly database
2022-11-09 16:02:02,708 [ERROR] fiona._env: sqlite3_exec(COMMIT) failed: disk I/O error
2022-11-09 16:02:02,709 [ERROR] fiona._env: sqlite3_exec(UPDATE gpkg_contents SET min_x = -156.1980575599727, min_y = 65.73921165229099, max_x = -156.1649869342755, max_y = 65.8307150756711 WHERE lower(table_name) = lower('275') AND Lower(data_type) = 'features') failed: attempt to write a readonly database
2022-11-09 16:02:02,710 [ERROR] fiona._env: sqlite3_exec(UPDATE gpkg_contents SET last_change = strftime('%Y-%m-%dT%H:%M:%fZ','now') WHERE lower(table_name) = lower('275')) failed: attempt to write a readonly database

parsl config used for script:

activate_env = 'source /home/jcohen/.bashrc; conda activate pdgviz'

htex_config_local = Config(
  executors = [
      HighThroughputExecutor(
        label = "htex_Local",
        cores_per_worker = 2, 
        max_workers = 2, 
        worker_debug = False, 
        provider = LocalProvider(
          channel = LocalChannel(),
          worker_init = activate_env,
          init_blocks = 1, 
          max_blocks = 10 
        ),
      )
    ],
  )

parsl.clear() # first clear the current configuration since we will likely run this script multiple times
parsl.load(htex_config_local) # load the config we just outlined

Notes:

  • ran as .py script using tmux
  • many .gpkg files were written to the staged dir, just like in previous runs that resulted in this error
  • googling this error results in recommendations to adjust write permissions for directories, but I do not suspect this will solve this error beause many files from the same data sample were written fine
  • likely an error in certain file(s) rather than an issue with parsl

@mbjones
Copy link
Member

mbjones commented Nov 10, 2022

good sleuthing... seems like we need to differentiate between permissions issues that arise:

  1. from the original files having permissions not set consistently, or
  2. from our copy of the files to the computational node failing to set the correct permissions (probably due to a restrictive umask setting)

In both cases, we probably simply need to ensure files we plan to write to will be set as writable on the node it occurs on, either via a umask setting or via an explicit chmod call.

@julietcohen
Copy link
Collaborator Author

julietcohen commented Nov 11, 2022

Thanks Matt, I will look into the unmask setting or a chmod call. I have not worked with those before, so I might reach out with questions.

In the meantime, I was able to stage the 3 original GeoPackage files that Ingmar provided as a sample in parallel with parsl. The error I pasted above resulted from his newer data files he uploaded to drive. The reason I was able to avoid the above error with the original 3 files could be:

  1. correct permissions settings
  2. the contents of the files themselves
  3. absence of this line of code, when working in an ipynb: [a.result() for a in app_futures] (this being the cause of the error is highly unlikely, it was just code I tried omitting during troubleshooting, running these steps as chunks rather than as a script)

I then was not able to rasterize the the staged files that were created, so that will be my task on Monday. The error reported in rasterization_events.csv many times: unsupported operand type(s) for -: 'NoneType' and 'int'

@julietcohen
Copy link
Collaborator Author

julietcohen commented Nov 16, 2022

After changing my logging configuration option for "disable_existing_loggers" from true --> false (thank you for that suggestion Robyn) I got slightly more informative messages in log.log repeated for many tiles. :

Beginning rasterization of 30 vector files at z-level 11

[INFO] pdgraster.RasterTiler: Rasterizing staged/WorldCRS84Quad/11/412/308.gpkg for tile Tile(x=412, y=308, z=11) to geotiff/WorldCRS84Quad/11/412/308.tif.
[ERROR] pdgraster.RasterTiler: Error rasterizing staged/WorldCRS84Quad/11/408/305.gpkg for tile Tile(x=408, y=305, z=11).

I checked that I able to read in the tiles separately as geodataframes. Thanks to the print statement at the start and the lack of a print statement that rasterization was complete, I could tell the source of the error (or part of the source) was probably within rasterize_vectors() either because of:

  1. the fact that there is a problem within rasterize_vector(), and rasterize_vectors() is a wrapper for that
    or
  2. this code that comes after rasterize_vector() but is still executed within rasterize_vectors():
if tile is not None:
    parent_tiles.add(self.tiles.get_parent_tile(tile))

While I am unsure of what option 2's code snippet does (to figure out tomorrow), it inspired me to change the make_parents argument for rasterize_vectors() from False --> True because the tile objects were not None, so even though the make_parents argument was set to False, it seemed that this code was trying to do something with parent tiles. Setting that argument to True made the code run, the geotiff dir was created as it should, with all z-levels! It ran in a somewhat reasonable amount of time, too (52 minutes).

...but the output was still None for each batch:
None

@julietcohen
Copy link
Collaborator Author

julietcohen commented Nov 16, 2022

The number of files in z-level 11 of the geotiff dir is the same number of staged files (which of course are all z-level 11) so that means no fatal error occurred when processing the highest resolution files. Some files are still erroring, this one in particular:
image

and the same error for the same geotiff earlier in log.log (different time stamp):
image
and again:
image

I wonder if this means I did not run the code in parallel correctly, because why would the same tile be processed multiple times...?

So moving on to creating web tiles.

@julietcohen
Copy link
Collaborator Author

julietcohen commented Nov 17, 2022

Updates:

  • Note to myself to delete the documents rasterization_events.csv and rasters_summary.csv (or rename in config) when troubleshooting and running rasterization several times in a row without restarting the workflow, because these documents append rows with each run, rather than re-writing the file, which can mess up the update_ranges() step that comes after rasterization. Remembering to do this will save time while troubleshooting!
  • I archived these older csv files and re-ran rasterization in parallel, and was please that it took 27 minutes, even tho it still output the long list of None for each batch, pictured above. The only erroring file in the entire run was one in the highest resolution z-level (11):

image

  • It is probably not an informative error, because the dir was created by the parallel loop so there was no way the file already existed.
  • Something interesting about this error is that it is the first documented rasterization event in rasterization_events.csv. I wonder if it being the first file could have something to do with the cause of the error:

image

Checking out rasters_summary.csv, I see the format of the columns is incorrect (which could just the result of a few commas being misinterpreted as line breaks when rendering into a csv, when they were not meant to be line breaks), and there are several rasters where the polygon count is 0:
image

  • the tile that messed up the formatting in rasters_summary.csv was not the tile that errored

Temporary Solution:

Manually edited the csv to delete the troublesome lines 4 & 5, then re-uploaded that as the rasters_summary.csv and moved on to making web tiles and 3d tiles.

@julietcohen
Copy link
Collaborator Author

julietcohen commented Nov 18, 2022

Successfully created web tiles in parallel from geotiff's and no errors occurred according to the log, but got same repeated None output from chunk as I got for rasterization (still a little confused about that but hasn't seemed to cause an issue so far!). Same number of files created for web_tiles/coverage and web_tiles/polygon_count and geotiff dir. Updated ranges with function before running the code, so set update_ranges argument to False.

@julietcohen
Copy link
Collaborator Author

julietcohen commented Dec 6, 2022

Resuming workflow

After re-installing pdgstaging and pdgraster packages with their recent updates, I started this parsl workflow again with Ingmar's data sample. See viz-workflow/workflow_troubleshooting/workflow_redo.ipynb on branch parlsl-workflow-breakdown.

  • after staging:
    • 19088 staged files ✅
  • after rasterization:
    • output of chunk = 637 None (one for each staged batch), as noted above, but no errors in output or log.log (update: There was 3 error messages printed in log.log for this run that I initially overlooked. There were only 3 error messages at z-level 11, one for each file detailed below.)
    • no weird formatting in rasters_summary.csv as was reported above ✅
    • in rasters_summary.csv, several unexpected coverage ranges such as -2.95E+301 - 0.315380167 (should be 0-1), would be interesting to determine if these are more common at lower or higher zoom levels
    • 19085 files in z-level 11 directory (3 fewer files than in staged dir 🧐)

3 erroring files

To determine which files are failing to rasterize, I did some simple string manipulation to trim the file paths from the staged dir and geotiff z-level 11 dir in order to subtract the list of rasters paths from the list of gpkg paths.

String trimming
## geotiff strings
# start with all z-levels because the input to this function needs to be base dir
geotiff_paths = tile_manager.get_filenames_from_dir('geotiff')

# remove strings that do not have z-level 11
geotiff_paths_11 = []
for path in geotiff_paths:
    if '1' == path[21:22]:
        geotiff_paths_11.append(path)

# remove 'geotiff' and '.tif'
geotiff_paths_11_trimmed = []
for path in geotiff_paths_11:
    trimmed_1 = path.replace('geotiff', '')
    trimmed_2 = trimmed_1.replace('.tif', '')
    geotiff_paths_11_trimmed.append(trimmed_2)

## staged strings
staged_paths = stager.tiles.get_filenames_from_dir('staged')

# remove 'staged' and 'gpkg'
staged_paths_trimmed = []
for path in staged_paths:
    trimmed_1 = path.replace('staged', '')
    trimmed_2 = trimmed_1.replace('.gpkg', '')
    staged_paths_trimmed.append(trimmed_2)

# subtract lists of file paths
missing_files = []
for path in staged_paths_trimmed:
  if path not in geotiff_paths_11_trimmed:
    missing_files.append(path)

missing_files

Erroring files are:

['/WGS1984Quad/11/557/256',
 '/WGS1984Quad/11/463/320',
 '/WGS1984Quad/11/463/253']

Perhaps a more succinct version of this check (with list comprehension) would be a good log.log check and/or warning to integrate permanently into the workflow (see issue #11). If so, maybe we should make it a function.

Turns out this was a little unnecessary since these files did print error messages in log.log.

@julietcohen
Copy link
Collaborator Author

I did some general exploration of the erroring files (plotted, checked for NaN's in gdf format, etc.).

Then I put the 3 filepaths into a list, batched them, and rasterized them in parallel for all z-levels with no errors. It seems they only errored when processing with all other staged files. This leads me to think perhaps we should just try to rasterize the files again here if the criteria for the error message is met, and then if there is still a problem for that second try, then produce the error message and move on.

@julietcohen
Copy link
Collaborator Author

I edited RasterTiler.py to retry rasterize_vector() if the try: failed, but limited it to a few times.

rasterize_vector() with retry integrated
    def rasterize_vector(self, path):
        """
            Given a path to an output file from the viz-staging step, create a
            GeoTIFF and save it to the configured dir_geotiff directory. If the
            output geotiff already exists, it will be overwritten.
            During this process, the min and max values (and other summary
            stats) of the data arrays that comprise the GeoTIFFs for each band
            will be tracked.
            Parameters
            ----------
            path : str
                Path to the staged vector file to rasterize.
            Returns
            -------
            morecantile.Tile or None
                The tile that was rasterized or None if there was an error.
        """

        # adjust so that if error occurs, retries
        for attempt in range(3):
            try:

                # Get information about the tile from the path
                tile = self.tiles.tile_from_path(path)
                bounds = self.tiles.get_bounding_box(tile)
                out_path = self.tiles.path_from_tile(tile, 'geotiff')

                # Track and log the event
                id = self.__start_tracking('geotiffs_from_vectors')
                logger.info(f'Rasterizing {path} for tile {tile} to {out_path}.')

                # Check if deduplication should be performed first
                gdf = gpd.read_file(path)
                dedup_here = self.config.deduplicate_at('raster')
                dedup_method = self.config.get_deduplication_method()
                if dedup_here and (dedup_method is not None):
                    dedup_config = self.config.get_deduplication_config(gdf)
                    dedup = dedup_method(gdf, **dedup_config)
                    gdf = dedup['keep']

                # Get properties to pass to the rasterizer
                raster_opts = self.config.get_raster_config()

                # Rasterize
                raster = Raster.from_vector(
                    vector=gdf, bounds=bounds, **raster_opts)
                raster.write(out_path)

                # Track and log the end of the event
                message = f'Rasterization for tile {tile} complete.'
                self.__end_tracking(id, raster=raster, tile=tile, message=message)
                logger.info(
                    f'Complete rasterization of tile {tile} to {out_path}.')

                return tile

            except Exception as e:
                logger.info(f'Error rasterizing {path} for tile {tile} so trying again.')
            
            else:
                break
        
        else:
            message = f'Error rasterizing {path} for tile {tile}, ran out of retries.'
            self.__end_tracking(id, tile=tile, error=e, message=message)
            return None

I anticipated that the same files would error as last time, which would be noted in log.log, then the script would retry and print Error rasterizing {path} for tile {tile} so trying again., and I could see if my edit worked. However, it doesn't seem that any of those 3 files or any other z-level 11 files failed (which is good...but it means rasterization failing is inconsistent...which does support the argument that there is nothing actually corrupt about the gpkg, but rather that the workflow is randomly failing on a few files sometimes). I got all expected files written into the z-level 11 geotiff dir (19088) without any logged statements that my code retried.

However, there were errors in creating parent geotiffs, so I will continue to try to test the modification I made to rasterize_vector().

@julietcohen
Copy link
Collaborator Author

Closing this issue as the parsl workflow is now functional and is being integrated to use kubernetes and parsl with the Arctic Data Center cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: No status
Development

No branches or pull requests

2 participants