Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-3060 [WiP] Add post-condition verification for CRUDActor metrics #794

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jseyster
Copy link
Contributor

This WiP is mostly to get a first opinion on how well the PostCondition class fits into the reporting architecture. I'd be interested in any feedback, and if the approach looks reasonable, I can proceed with unit testing and documentation. I included the temporary example workload that I have been testing with.

OperationMetricsName: PipelineWithoutOptions
OperationCommand:
Pipeline: [{$group: {_id: "$int", sum: {$sum: "$int2"}}}]
PostCondition:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to be part of the production workload, or is this whole phase only check? I'm hoping the second.

I had imagined being able to check the result value as well, but I guess you could write that into the pipeline itself. Is that the intent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made sure that the time spent verifying the post-condition doesn't get included in the elapsed time that the operation takes. I'm not sure what you mean by "only check." Are there separate production and verification modes for workloads? Thanks!

I considered having a way to verify query results as well, but after discussing it with other people on the column store index project, we decided that users who want that functionality will probably go with an AssertiveActor and that, for this case, we want to avoid the pitfall of making performance workloads do double duty as rigorous correctness tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are running one operation in one thread once, then everything is just fine. However, if you are running things in a loop with multiple threads, you are going to stress the system differently for having the post condition check. The client will spend some amount of checking the result before sending the next request, so there will be less apparent load on the system. For example, if you have 20 threads running in parallel, and you have 5% overhead for the check, you will effectively have 19 parallel operations on the server at a time, instead of 20. The server will be less loaded. It's something to be aware of, but not a reason to avoid doing it. It seems like there will be cases in which this would be very useful.

I had imagined a check like this being used as a step after the real test, to verify invariants about the system. For example, if you had a workload that was a model of a banking application, you could verify that the total cash in the system didn't change after many transfers. I think your addition still supports this use case fwiw. Actually I'd encourage including an example use case like this with the other.

We don't have anything that is production versus verification modes today. It's probably not worth adding one either, although it is an interesting idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have added: the approach looks promising!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Once this is ready for documentation, I think it will be worthwhile to note that it's appropriate for measuring the performance of a relatively long-running query but not for tests intended to measure performance of a server under heavy load.

I also like your idea that a PostCondition could also be used to examine a collection after benchmarks run to ensure it left them in an expected state. An AssertiveActor can also accomplish that, but it can take some trial and error to get the right query. I included an example in CheckMetricsTest.yml along with an example of how to do a similar comparison with an AssertiveActor to show how the techniques compare.

rtimmons
rtimmons previously approved these changes Jan 4, 2023
Copy link
Collaborator

@rtimmons rtimmons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my extreme delay. I think the syntax and overall approach is reasonable for now. We may want to move the PostCondition into the Metrics structure itself, but I like the general PostCondition concept being extensible outside of Metrics, so let's see where that evolves.

I looked through the code and didn't see anything obviously wrong, but I'd feel much better with good automated testing and bit more commentary in places where there's so many longs in a row :trollface: 😄. So I'm 👍 the approach and syntax. Happy to review the code with a bit more scrutiny whenever you're ready.

Copy link
Contributor Author

@jseyster jseyster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of extreme delays 🙃... We ended up taking this work out of the Column Store Index project (which we are wrapping up) and turning into a quick win, which is why it took me a while to address the comments.

Thanks for the helpful feedback! I added an additional example case (which can eventually get moved to the docs) and better comments. I also added automated testing with the goal of getting the patch close to a state where it would be reasonably mergable if we agreed on the functionality and details of the implementation.

Now that the Column Store Index project is wrapped I'll have some time to work on this, but it's still not urgent, so no need to worry about review turnaround time. Thanks again!

OperationMetricsName: PipelineWithoutOptions
OperationCommand:
Pipeline: [{$group: {_id: "$int", sum: {$sum: "$int2"}}}]
PostCondition:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Once this is ready for documentation, I think it will be worthwhile to note that it's appropriate for measuring the performance of a relatively long-running query but not for tests intended to measure performance of a server under heavy load.

I also like your idea that a PostCondition could also be used to examine a collection after benchmarks run to ensure it left them in an expected state. An AssertiveActor can also accomplish that, but it can take some trial and error to get the right query. I included an example in CheckMetricsTest.yml along with an example of how to do a similar comparison with an AssertiveActor to show how the techniques compare.

@dalyd
Copy link
Contributor

dalyd commented Feb 1, 2023

With the time this PR has been open, you might benefit from merging dsi master into your branch, and pushing that up here. Github handles merging in master well, and does not handle rebasing well at all.

Copy link
Contributor

@dalyd dalyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions. Some of this is above my current C++ level.
Overall, I appreciate the test cases. Let's make sure to include some examples in the docs/ directory, but otherwise this looks good to me.

* instance.
*/
void conditionalSuccess(const PostCondition& postCondition = {}) {
auto finished = ClockSource::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- just calling out this is how you don't let the time of the post condition check impact the latency measurement for the desired op.

}

for (const auto& relation : relations) {
if (auto requiredValue = node[relation.key].maybe<long long>()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to check for relation.symbol also? It seems like you might want to use it in this loop. I suppose Yaml might not like those symbols as keys.

return bytes;
}
throw std::logic_error("impossible");
}();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a lambda? Could you just determine the observedValue and pass it into evaluateFn. I'm not great with Lambdas, so maybe not. But if it can, I think it would make this code more readable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants