Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-3067 Improve performance of FetchItemQueues if error state is preserved #827

Merged
merged 5 commits into from
Oct 24, 2024

Conversation

sebastian-nagel
Copy link
Contributor

@sebastian-nagel sebastian-nagel commented Oct 4, 2024

Address NUTCH-3067:

  1. do not keep every stateful queue: drop queues which have a low exception count after a configurable amount of time. If a second URL from the same host/domain/IP is fetched after a considerably long time span (eg. 30 minutes), the effect on performance and politeness should be negligible.

    This is configured by fetcher.exceptions.per.queue.clear.after (default 30 minutes): if this time has elapsed after the next fetch time in the queue, that is in addition to any delay defined by the exponential backoff, empty queues are dropped.

  2. reviewed and improved the handling of the exponential backoff in FetchQueues.checkExceptionThreshold: if the delayed next fetch would happen after the fetcher timelimit (if configured), the queue is purged and blocked because no fetch item will be fetched from the queue anyway.

  3. reduce the memory footprint of a single FetchQueue - important if there are many of them.

In addition, logging and documentation has been improved. While testing this PR on a production crawl, a NUTCH-3072 has been detected.

@sebastian-nagel
Copy link
Contributor Author

This PR is successfully tested in production: Using the default of 30 minutes for fetcher.exceptions.per.queue.clear.after the number of FetchQueues hold stabilizes after half an hour because the queues with a single error and no new additions of URLs to the queue are removed after this time span.

…reserved

- reduce memory footprint of FetchItemQueue
…reserved

- purge and block queues which are delayed because of exceptions
  in case the next fetch would happen after the fetcher timelimit
…reserved

- skip empty fetch queues which hold exception counts after the time configured in
  fetcher.exceptions.per.queue.clear.after has passed in addition to the delay
  defined by the exponential backoff
…reserved

- more verbose logging when reaching the Fetcher throughput threshold,
  when emptying fetch queues and when aborting with hung threads
- add note that fetcher.throughput.threshold.retries should not exceed
  the timeout defined by mapreduce.task.timeout and
  fetcher.threads.timeout.divisor
@sebastian-nagel
Copy link
Contributor Author

(rebased on recent master)

@sebastian-nagel sebastian-nagel merged commit b02340d into apache:master Oct 24, 2024
4 checks passed
@sebastian-nagel sebastian-nagel deleted the NUTCH-3067 branch December 4, 2024 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant