SAMZA-2464: Container shuts down when task fails to remove old state checkpoint dirs #1283
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Symptom: Container shuts down with exception on invocation of TaskStorageManager.removeOldCheckpoints from TaskInstance.commit
Cause: Concurrent modification of checkpoint directories by other processes / threads may cause FileNotFoundException to be thrown and shutdown the container. IOException may be thrown for other miscellaneous failures; these should not cause the container to shutdown but be logged and allow processing to continue.
Tests: Added a unit test which fails when exception from removeOldCheckpoints is not caught.
API Changes: None
Upgrade instructions: None
Usage instructions: None