-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17447 : Support to early terminate a search based on maxHits per collector. #2960
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this makes sense to me. Thanks for the nice contribution! I'd prefer if @gus-asf could take a look at some aspects since he worked on cpuAllowed recently. BTW it's clear you need to run ./gradlew tidy
I suggest renaming the param "maxHitsPerShard" to simply "maxHits" or "maxHitsTerminateEarly" and document that it's per-shard and best-effort; that more hits may ultimately be detected in aggregate. But maybe others disagree.
It'd be good to consider interference with other features. Maybe it works with cursorMark? Does it count after PostFilter (e.g. CollapseQParser) or before?
Renamed to maxHits
|
e4a7c17
to
f475bf7
Compare
Adds the capability to limit hits per shard. "maxHitsPerShard" request parameter controls how many hits the searcher should run over per shard. Once the searcher runs over the specfied number of documents, it will terminate the search with EarlyTerminatingCollectorException. This will be indicated by a new response header "terminatedEarly" also the "partialResults" will indicate that the results are partial. This parameter is supported in MT mode as well. Though there are other mechanisms to control runaway queries with CPU usage limits and time limits, this is simpler for certain use cases esp in case high recall queries and rerank use cases.
rename parameter to maxHits
@dsmiley can you pls relook when you get a chance ? |
note: you force-pushed to this PR. Please don't do that; it resets the review state making it hard for me to see changes since my last review and if I try it's still possible an earlier commit was changed and I won't know it. So never force-push to a PR that has been reviewed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommended additional reviewers based on touching similar functionality: @atris @gus-asf @cpoerschke
solr/core/src/java/org/apache/solr/search/EarlyTerminatingCollector.java
Outdated
Show resolved
Hide resolved
collector = (MaxScoreCollector) var4.next(); | ||
Collector next = (Collector) var4.next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use a comment like "assume the next collector, except EarlyTerminating, is the MaxScoreCollector".
While you are changing this code, please rename var4
which is an inexcusable variable name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -22,6 +22,7 @@ public class QueryResult { | |||
// Object for back compatibility so that we render true not "true" in json | |||
private Object partialResults; | |||
private Boolean segmentTerminatedEarly; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking out loud: Perhaps we should do away with segmentTerminatedEarly as overly specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'm not sure how people are going to respond differently between the two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to distinguish the 2 because the cause for them is different . segmentTerminateEarly could occur because of the searcher not proceeding on a sorted segment because the remaning docs are of lower score. The terminateEarly is purely due to the searcher running past the provided number of maxHits. As a user if I see terminateEarly then I know I might need to increase the maxHits parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay; so the cause is interesting for diagnostic / observability purposes but semantically, the search has "terminated early" (for whatever reason).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's good to know why so that the corrective action is clear, but terminateEarly is unlike maxHits (or shardMaxHits) so the relationship takes special knowledge to appreciate. If something is going to report a reason I'd want it to include the same phrase... i.e. "maxHitsReached" and definately would not want to have to "just know" that terminateEarly has nothing to do with segmentTerminateEarly despite strong similarity in naming.
@@ -338,6 +339,11 @@ private Collector buildAndRunCollectorChain( | |||
if (cmd.isQueryCancellable()) { | |||
core.getCancellableQueryTracker().removeCancellableQuery(cmd.getQueryID()); | |||
} | |||
if (collector instanceof final EarlyTerminatingCollector earlyTerminatingCollector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is brittle since it can only possibly true if the last collector is earlyTerminatingCollector. We may add others or re-order at some point. Also, can we just depend on the exception above and set qr.setTerminatedEarly(true);
there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
A test is needed. |
This parameter specifies the max number of hits a searcher will iterate through before early terminating the search. | ||
The count is per shard and across all threads involved in case of multi-threaded search. This parameter works | ||
in conjunction with other parameters that could early terminate a search, ex: _timeAllowed_ etc. In case the search | ||
was early terminated due to it exceeding maxHits a _terminatedEarly_ header in the response will be set along with | ||
_partialResults_ to indicate the same | ||
Note : the hits counted will need not be exactly equal to the maxHits provided, however it will be within range of this value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain here when the user would see partial results? The user will probably not be very familiar with how Lucene or the searcher works, so it should be from their point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated documentation.
added tests. |
@dsmiley @HoustonPutman updated, pls take a look when you get a chance. |
maybe shardMaxHits? Brevity is nice, but I think it's helpful to have a name that is at least slightly self documenting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this is a great idea. For some use cases this is certainly appropriate. My primary concern is that the use of "terminate early" is getting pretty overloaded. There is code relating to lucene segmentTerminateEarly, and terminateEarly was previously used in SOLR-3240, and either you have stepped on some of that code, possibly breaking things if it isn't dead code from a previously removed feature. My suggestion is that this not add a 3rd "terminateEarly" but rather use it's functional behavior as a name... "searcherMaxHits" or maybe "shardMaxHits" and edit that more easily destinguished and self documenting name through the various code locations where we check it or name things after it.
Finally I wonder if this and spellcheck_collate_max_docs should be doing exactly the same thing, or if they should be entirely separate. Should the spellcheck feature actually be using this under the covers now that it exists (or is this really just exposing the same idea for non-spellcheck cases)?
parameter due to other limits like _timeAllowed_ or _cpuAllowed_. | ||
Note : the hits counted will need not be exactly equal to the maxHits provided, however it will be within range of this value. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a similar concern as Houston regarding helping the user understand when to use this, here's my attempt to rewrite your text based on what I've gathered... of course if I misunderstood something or I'm too wordy we should fix that ;) As a side note I think we're supposed to write these as "ventilated prose" which is one sentence per line, no line breaks in a sentence.
== maxHits Parameter
[%autowidth,frame=none]
|===
|Optional |Default: `false`
|===
This parameter specifies the max number of hits a searcher will iterate through.
Searchers will arbitrarily ignore any number of additional hits beyond this value.
Practically speaking, the count is per shard and across all threads involved in case of multi-threaded search.
The intention of this feature is to favor speed over perfect relevancy & recall.
The trade-off is that if one shard contains many relevant hits and another contains a few less relevant hits the less relevant hits from the second shard may get returned instead of the more relevant hits that were clustered in the first shard.
This parameter works in conjunction with other parameters that could early terminate a search, ex: _timeAllowed_ etc.
In case the search was early terminated due to it exceeding maxHits a _terminatedEarly_ header in the response will be set along with _partialResults_ to indicate the same.
The _partialResults_ flag could be set in the absence of the _maxHits_ parameter due to other limits like _timeAllowed_ or _cpuAllowed_.
I have to admit I don't understand this bit:
NOTE: the hits counted will need not be exactly equal to the maxHits provided, however it will be within range of this value.
@@ -43,11 +48,14 @@ public class EarlyTerminatingCollector extends FilterCollector { | |||
* @param maxDocsToCollect - the maximum number of documents to Collect | |||
*/ | |||
public EarlyTerminatingCollector(Collector delegate, int maxDocsToCollect) { | |||
super(delegate); | |||
assert 0 < maxDocsToCollect; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the asserts removed?
@@ -29,11 +30,15 @@ | |||
*/ | |||
public class EarlyTerminatingCollector extends FilterCollector { | |||
|
|||
private final int chunkSize = 100; // Check across threads only at a chunk size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was this chosen? Should it be tuneable via env/sysprop?
private final int maxDocsToCollect; | ||
|
||
private int numCollected = 0; | ||
private int prevReaderCumulativeSize = 0; | ||
private int currentReaderSize = 0; | ||
private final LongAdder pendingDocsToCollect; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LongAdder is a neat class I wasn't aware of, but my read of the javadocs for LongAdder is that it's for rarely read values like statistics/metrics (frequent update, infrequent read)... the way you are using it you read from it immediately after every add. Maybe just AtomicLong? Happy to hear arguments to the contrary, but everyone knows what an AtomicLong is so slightly simpler to have that (and perhaps slightly less memory usage according to the javadocs).
if (numCollected % chunkSize == 0) { | ||
pendingDocsToCollect.add(chunkSize); | ||
final long overallCollectedDocCount = pendingDocsToCollect.intValue(); | ||
terminatedEarly = overallCollectedDocCount >= maxDocsToCollect; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code would be more legible if you just threw in two cases rather than tracking it with a boolean that is overwritten. This initially looked like a logic error. I expected to an |=
until I stopped and thought about it for a while to convince myself that the second case can't be false after the first is true. Seems better to me if future maintainers don't have to sort that out for themselves.
} | ||
} | ||
}; | ||
} | ||
|
||
public boolean isTerminatedEarly() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My IDE says this method is unused... do we use it somewhere I've missed?
@@ -194,7 +195,8 @@ public QueryCommand setNeedDocSet(boolean needDocSet) { | |||
} | |||
|
|||
public boolean getTerminateEarly() { | |||
return (flags & SolrIndexSearcher.TERMINATE_EARLY) != 0; | |||
return (flags & SolrIndexSearcher.TERMINATE_EARLY) != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I noted in another comment I think this and the Lucene feature are significantly different. I don't like conflating them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And after looking deeper, there's something funny going on in flags. There are two flags:
static final int SEGMENT_TERMINATE_EARLY = 0x08;
public static final int TERMINATE_EARLY = 0x04;
yet I see this code (ide says it's unused) in QueryCommand:
public QueryCommand setTerminateEarly(boolean segmentTerminateEarly) {
if (segmentTerminateEarly) {
return setFlags(SolrIndexSearcher.TERMINATE_EARLY);
} else {
return clearFlags(SolrIndexSearcher.TERMINATE_EARLY);
}
}
as well as:
public QueryCommand setSegmentTerminateEarly(boolean segmentSegmentTerminateEarly) {
if (segmentSegmentTerminateEarly) {
return setFlags(SolrIndexSearcher.SEGMENT_TERMINATE_EARLY);
} else {
return clearFlags(SolrIndexSearcher.SEGMENT_TERMINATE_EARLY);
}
}
So there's something we should maybe understand and maybe clean up here...
|
||
public SearchResult(ScoreMode scoreMode, Object[] result) { | ||
this(scoreMode, result, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My IDE claims this unused. Since this is a package private inner class we probably don't need to maintain this constructor for the API.
@@ -295,7 +295,7 @@ private Collector buildAndRunCollectorChain( | |||
|
|||
final boolean terminateEarly = cmd.getTerminateEarly(); | |||
if (terminateEarly) { | |||
collector = new EarlyTerminatingCollector(collector, cmd.getLen()); | |||
collector = new EarlyTerminatingCollector(collector, cmd.getMaxHitsTerminateEarly()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it we are replacing getLen() here... how does this not break SOLR-3240 (7141cb3#diff-99978700f1c69d6a5f6c2190f89c98cfe20c441161db5a183ec002e15cb1be28)
@@ -329,12 +329,12 @@ private Collector buildAndRunCollectorChain( | |||
if (collector instanceof DelegatingCollector) { | |||
((DelegatingCollector) collector).complete(); | |||
} | |||
throw etce; | |||
qr.setPartialResults(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not throwing here effect SOLR-3240, (the original reason for ETCE) ?
(not thinking of backwards compatibility here, just thinking holistically) I appreciate that there are very different causes of a terminated request, and we want to differentiate them in case the client cares and for diagnostics. But semantically to the caller, the cause is irrelevant (right?), so I think a simple flag/enum should be returned separate for the cause. I'm not arguing the cause should be hidden / indiscernible. I understand, once upon a time, there was a single cause of "terminate early" but as Solr has matured to do so under growing conditions, we shouldn't be held back by segment terminated early as the one and only terminate-early. Let's step back and change names / semantics if needed. Solr 10 is coming. |
If by "caller" you mean the person/programmer/code that is leveraging solr to make a fabulous profit or save the world then there are two questions to answer: 1) can we trust these results as 100% definitive 2) what can be done to obtain definitive results if that's important. For number 1 we don't need to know why we just need to know if it's trustworthy, that's entirely binary as you suggest. However if we need to adjust things to supply definitive results (for that one case the customer really cares about, etc) then we need to know which nob to turn, and thus it matters which option is the limiting factor (one might set multiple limits or be using limits with this feature). If by caller you mean the next (few) method(s) up the stack, then there is at least one important distinction between this and limits. In the limits case, we typically want to stop all work ASAP once the limit is hit. In this case the intent is that all shards should return up to the limiting number of hits. We wouldn't want to stop or terminate any other subrequests for this feature. Notably this feature would not protect the users from a shard living on a critically limping and slow server, whereas timeAllowed should ensure some sort of response even if one shard has nearly (but not quite) ground to a halt. |
Description
Adds the capability to early terminate a search based on maxHits parameter provided
https://issues.apache.org/jira/browse/SOLR-17447
Solution
"maxHitsPerShard" request parameter controls how many hits the searcher should go over per shard. Once the searcher runs over the specfied number of documents, it will terminate the search with EarlyTerminatingCollectorException. This will be indicated by a new response header "terminatedEarly" also the "partialResults" will indicate that the results are partial. This parameter is supported in MT mode as well.
Though there are other mechanisms to control runaway queries with CPU usage limits and time limits, this is simpler for certain use cases esp in case high recall queries and rerank use cases.
Lucene currently supports this feature with the EarlyTerminatingCollector. There was some code in SOLR as well to support the collector, but looks like it was not completely wired up.
Tests
Ran tests against a local solr instance in MT and single threaded mode
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.