NUTCH-3067 Improve performance of FetchItemQueues if error state is preserved #827
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Address NUTCH-3067:
do not keep every stateful queue: drop queues which have a low exception count after a configurable amount of time. If a second URL from the same host/domain/IP is fetched after a considerably long time span (eg. 30 minutes), the effect on performance and politeness should be negligible.
This is configured by
fetcher.exceptions.per.queue.clear.after
(default 30 minutes): if this time has elapsed after the next fetch time in the queue, that is in addition to any delay defined by the exponential backoff, empty queues are dropped.reviewed and improved the handling of the exponential backoff in FetchQueues.checkExceptionThreshold: if the delayed next fetch would happen after the fetcher timelimit (if configured), the queue is purged and blocked because no fetch item will be fetched from the queue anyway.
reduce the memory footprint of a single FetchQueue - important if there are many of them.
In addition, logging and documentation has been improved. While testing this PR on a production crawl, a NUTCH-3072 has been detected.