-
Notifications
You must be signed in to change notification settings - Fork 2
Fix a large item stuck in accessioning or depositing
If a user submits a ZIP file with many large files (e.g. >25k) and then submits it, this may become bogged down in the deposit or the accessioning step.
The DepositJob
in H2 may take a while to complete and bog down the H2 server in the process, potentially triggering nagios alerts.
You will know if the DepositJob
in H2 eventually succeeded if the work has a druid in the H2 database even though it still says "depositing" for the state. If so, the problem is likely in accessioning. You can find the object in Argo and see if the object looks stuck. You can also check Sidekiq to see if the related accessioning jobs are bogged down: https://robot-console-prod.stanford.edu/busy and https://dor-services-app-prod-a.stanford.edu/queues/busy
w=Work.find(work_id) # add the specific work_id from H2
w.druid
=> "druid:ab123cd5678"
wv=w.head
wv.state
=> "depositing"
Assuming this is the first version of the deposit and assuming the job gets stuck during accessioning, it will likely happen in the accessionWF:publish
step. If so, you may need to abort the deposit completely and ask the user to start over, but this time by using a single ZIP file in the object which will not be expanded (i.e. browser upload option, with the single ZIP as the file).
To cleanup, you should:
- Stop the object from accessioning by using the
cleanup:stop_accessioning
rake task in dor-services-app. This will remove all files from the workspace and remove all workflows from the object. See https://github.com/sul-dlss/dor-services-app/blob/main/lib/tasks/cleanup.rake You will want to run this task on the DSA box in a screen session, as if the object is large (number of files or size of files or both) the task may take a while and you don't want to be aborted if you drop your ssh session.
RAILS_ENV=production bundle exec rake cleanup:stop_accessioning['druid:ab123bc4567'] # put druid here
-
Once the task completes, you should look at the digital stacks (or ask SDR Ops manager to do this) for any files that may have already been shelved by the incomplete accessioning process. These files can be deleted.
-
Return to H2 and reset the object back to first_draft. You should also remove all attached files and any druid/DOI that was stored in the database. This will allow the H2 object to be re-deposited again into a brand new SDR object:
w=Work.find(work_id) # add the specific work_id from H2
wv=w.head
wv.attached_files.size
=> 49213 # DOH!
wv.upload_type
=> "browser"
wv.state
=> "depositing"
wv.attached_files.destroy_all;nil # delete all files and don't print all the output to the console
wv.state = "first_draft" # reset to first_draft
wv.upload_type = nil # reset upload type
wv.save
-
Change the source ID for the original druid (from step 1) so that it doesn't clash when the user tries to deposit the same work to a new object.
-
Let the SDR Product Owner know so they can inform user to try again using single zip file browser upload.
-
Check to see the publish job for that druid is still running: https://robot-console-prod.stanford.edu/busy It should die eventually not be retried.
Note that even though you may see a DOI in the H2 database, if it's the first version and it got stuck in accessionWF in the publish step, it will not have actually minted a real DOI yet.
This is easier, and you should be able to get away with only steps 3 onward. You will know this if the work has no druid listed in the H2 database yet. You will still want to kill the DepositJob
that may be still be running in H2 though, else it may get a druid and start accessioning when you don't want this to happen: https://sdr.stanford.edu/queues/busy