Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataflow: Refactor FlowState to be paired with Node #18633

Merged
merged 14 commits into from
Feb 5, 2025

Conversation

aschackmull
Copy link
Contributor

@aschackmull aschackmull commented Jan 30, 2025

This refactors the representation of FlowState in the data flow library. Stage 1 is extracted to its own file and then for subsequent stages we either use NodeEx directly or a (NodeEx, FlowState) pair as the node type in order to eliminate the FlowState column.

)
}

private predicate partialPathOutOfCallable0(

Check warning

Code scanning / CodeQL

Candidate predicate not marked as `nomagic` Warning

Candidate predicate to
partialPathOutOfCallable
is not marked as nomagic.
}

pragma[noinline]
private predicate partialPathThroughCallable0(

Check warning

Code scanning / CodeQL

Candidate predicate not marked as `nomagic` Warning

Candidate predicate to
partialPathThroughCallable
is not marked as nomagic.
@aschackmull aschackmull force-pushed the dataflow/refactor-flowstate branch from 1a930cc to dd0a07e Compare January 30, 2025 11:43
@aschackmull aschackmull marked this pull request as ready for review January 30, 2025 12:06
@aschackmull aschackmull added the no-change-note-required This PR does not need a change note label Jan 30, 2025
@aschackmull
Copy link
Contributor Author

Commit-by-commit review encouraged, but the second commit really needs to be viewed with git diff --minimal as that saves about 6000 lines of superfluous diff output.

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor, trivial comments only. Also, performance needs to be fixed.

/* End: Stage 1 logic. */
}

private module Stage1Common {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved up before the Stage1 module, perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It contains a bit of a mix of input and output predicates relative to the Stage1 module. It's mostly output, though, hence why I put it after.


private class Cc = boolean;

/* Begin: Stage 1 logic. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment (and the end comment) is perhaps redundant now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll remove them.

@@ -169,4237 +171,2703 @@ module MakeImpl<LocationSig Location, InputSig<Location> Lang> {
/**
* Constructs a data flow computation given a full input configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QL doc should probably also mention the Stage1 parameter.

@asgerf
Copy link
Contributor

asgerf commented Jan 31, 2025

Wouldn't it be cheaper and simpler to pair it up with the access path instead? I imagine the TNil case storing the FlowState (with similar adaptions to the various AP approximations):

newtype TAccessPath =
  TNil(FlowState state) { ...} or
  TCons(Content head, TAccessPath tail) { ... }

@aschackmull
Copy link
Contributor Author

Wouldn't it be cheaper and simpler to pair it up with the access path instead?

I don't think so. It might be cheaper in terms of number of constructed elements of IPA types, but I definitely don't think it's simpler. Firstly, doing so would be a semantic change in all the cases where we aren't tracking an exact access path. Secondly, the tracking of FlowStates would then be different in each stage and that would mean that stages 2-6 would still need to care about and handle FlowState, whereas the point of the present refactor is that we can handle FlowState once and for all between stages 1 and 2, and then the bulk of the data flow library simply doesn't need to ever care about FlowState, which I think is a major improvement in terms of reducing the complexity of the library. In particular, it has become much easier to see that the stateful in- and out-barriers are now handled correctly. That's a huge pain in the current implementation.

@aschackmull aschackmull requested a review from a team as a code owner January 31, 2025 14:00
@github-actions github-actions bot added the JS label Jan 31, 2025
@aschackmull aschackmull force-pushed the dataflow/refactor-flowstate branch from 72ca71e to 73d7250 Compare February 4, 2025 09:48
@aschackmull
Copy link
Contributor Author

Dca looks good now.

@hvitved
Copy link
Contributor

hvitved commented Feb 5, 2025

There is an impressive JS speedup on microsoft__vscode; is that because of db1ed67? (cc @asgerf ).

@aschackmull
Copy link
Contributor Author

There is an impressive JS speedup on microsoft__vscode; is that because of db1ed67? (cc @asgerf ).

I think it might simply be due to reduced memory pressure, since we've removed a column from most predicates. Impressive nonetheless.
Additionally a few predicates now track state, which they didn't before, and that yields a very minor precision improvement that reduces the tuple count a little. Testing locally, I couldn't get the same speedup, but that may be because I simply have more ram available.

@aschackmull aschackmull merged commit bcec7ee into github:main Feb 5, 2025
35 checks passed
@aschackmull aschackmull deleted the dataflow/refactor-flowstate branch February 5, 2025 08:43
@asgerf
Copy link
Contributor

asgerf commented Feb 5, 2025

There is an impressive JS speedup on microsoft__vscode; is that because of db1ed67?

That's great news! The stage timings seem to indicate js/insecure-randomness and to a lesser degree js/xss as the main contributors. Insecure randomness has long been a problematic query for that project, so it would be interesting to find out why it became faster.

@aschackmull
Copy link
Contributor Author

I just checked js/insecure-randomness and looked at it with the "Compare Performance" feature in VSCode. The largest predicate, Stage4::fwdFlow1 computes fewer tuples for some reason, but otherwise it looks fairly similar, so I think reduced ram-pressure is the main factor in the speedup. That query computes a huge number of tuples in stage 4, even more than js/xss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFlow Library JS no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants