-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add TDFA determinization; Update lexer's tag to final register map using TDFA. #91
base: main
Are you sure you want to change the base?
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis pull request updates the build configuration and several source files for the log-surgeon project. New source files are added, and existing files are modified to change data structures and control flows. Notably, the DFA and lexing modules now employ updated register mapping with changes from unordered to ordered maps, and a new determinization configuration class is introduced. Additionally, tests are expanded to validate the DFA and Lexer functionality, while obsolete functionality (e.g., epsilon closure in NfaState) is removed. Changes
Sequence Diagram(s)sequenceDiagram
participant C as Client
participant D as Dfa
participant DC as DeterminizationConfiguration
participant N as Nfa
C->>D: Construct Dfa with Nfa reference
D->>D: Invoke generate() method
D->>DC: Create initial configuration from Nfa state
DC->>D: Update reachable configurations (spontaneous_closure & update_reachable_configs)
alt New transition found
DC->>DC: Generate child configuration via child_configuration_with_new_state_and_tag()
DC->>D: Return new configuration
end
D->>D: Create or retrieve DFA state from configuration set
D->>D: Establish transitions and assign register operations
D->>C: Serialize DFA and return string representation
Suggested reviewers
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (24)
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)
21-27
: Consider pre-allocating the vector capacity.
This can reduce repeated memory allocations when adding multiple registers.Here's a possible diff:
-auto add_registers(uint32_t const num_reg_to_add) -> std::vector<uint32_t> { - std::vector<uint32_t> added_registers; +auto add_registers(uint32_t const num_reg_to_add) -> std::vector<uint32_t> { + std::vector<uint32_t> added_registers; + added_registers.reserve(num_reg_to_add); for (uint32_t i{0}; i < num_reg_to_add; ++i) {src/log_surgeon/Lexer.hpp (1)
128-128
: Returning a const reference to a unique_ptr may be unconventional.
Consider returning a raw pointer or a reference to the underlyingDfa
object instead for clarity.- ) const -> std::unique_ptr<finite_automata::Dfa<TypedDfaState, TypedNfaState>> const& { + ) const -> finite_automata::Dfa<TypedDfaState, TypedNfaState> const* {src/log_surgeon/finite_automata/NfaState.hpp (2)
93-93
: Returning a reference to a bool is unusual.
Although it technically works, returning by value is often preferable for clarity and to avoid unexpected referencing issues.- [[nodiscard]] auto is_accepting() const -> bool const& { return m_accepting; } + [[nodiscard]] auto is_accepting() const -> bool { return m_accepting; }
174-175
: Consider aligning naming with captures rather than tags.
The serialized output calls it"accepting_tag"
, yet the code setsm_matching_variable_id
. If the design now focuses on captures, a more consistent name would improve clarity.src/log_surgeon/finite_automata/Nfa.hpp (1)
101-104
: Potential performance concern returning a large map by value
Currently,get_capture_to_tag_ids()
returns a map by value, which may be expensive if the map is large. Consider returning aconst&
or using a lightweight wrapper if the caller only needs read-only access.- [[nodiscard]] auto get_capture_to_tag_ids() const -> std::unordered_map<Capture const*, std::pair<tag_id_t, tag_id_t>> { - return m_capture_to_tag_ids; - } + [[nodiscard]] auto const& get_capture_to_tag_ids() const { + return m_capture_to_tag_ids; + }src/log_surgeon/Lexer.tpp (2)
86-86
: Transition logic refactor
Replacing direct transitions withget_dest_state(next_char)
makes the flow more explicit. Consider logging or handling a scenario where no valid destination state is found, ensuring the user is informed if the character is unexpected.
380-416
: Capture uniqueness check
Throwing an exception on duplicate capture names is a solid safeguard. Be sure all user inputs are validated, and consider whether a more user-friendly error message (or logging) might help users understand the conflict quickly.src/log_surgeon/finite_automata/Dfa.hpp (2)
64-65
: Typo in doxygen
In the comment block forgenerate
, the line uses@oaram nfa
instead of@param nfa
. Correct this to maintain consistency in documentation tags.- * @oaram nfa The NFA used to generate the DFA. + * @param nfa The NFA used to generate the DFA.
386-406
: Clarifyreassign_transition_reg_ops
usage
Adding new register operations when no direct match exists can be subtle. Ensure the logic that inserts{new_reg_id, old_reg_id}
captures the correct source and destination semantics. A short inline comment might prevent confusion for future maintainers.src/log_surgeon/finite_automata/RegexAST.hpp (1)
110-122
: Minor spelling nitpick in the comment.
The code logic is correct, but “spontaenous” should be spelled “spontaneous”.Proposed fix for line 113 comment:
-// root --(regex)--> state_with_spontaenous_transition --(negate capture tags)--> end_state +// root --(regex)--> state_with_spontaneous_transition --(negate capture tags)--> end_statesrc/log_surgeon/UniqueIdGenerator.hpp (1)
1-17
: Simple unique ID generator is functional but not concurrency-safe.
If you plan to use this in multi-threaded code, consider adding a mutex or atomic operations. Otherwise, this implementation is acceptable for single-threaded environments.src/log_surgeon/finite_automata/TagOperation.hpp (1)
39-48
: Consider adding compile-time enum exhaustiveness checking.The default case might indicate incomplete enum handling. Consider using a switch statement without a default case to get compiler warnings about unhandled enum values.
- switch (m_type) { + switch (m_type) { case TagOperationType::Set: return fmt::format("{}{}", m_tag_id, "p"); case TagOperationType::Negate: return fmt::format("{}{}", m_tag_id, "n"); - default: - return fmt::format("{}{}", m_tag_id, "?"); }src/log_surgeon/finite_automata/SpontaneousTransition.hpp (2)
56-62
: Consider handling empty tag operations explicitly.The current implementation might not handle empty tag operations gracefully. Consider adding a special case for empty operations.
+ if (m_tag_ops.empty()) { + return fmt::format("{}", state_id_it->second); + } auto transformed_operations = m_tag_ops | std::ranges::views::transform([](TagOperation const& tag_op) { return tag_op.serialize(); });
25-29
: Consider adding explicit keyword to single-parameter constructor.To prevent implicit conversions, consider marking the single-parameter constructor as explicit.
- explicit SpontaneousTransition(TypedNfaState const* dest_state) : m_dest_state{dest_state} {} + explicit SpontaneousTransition(TypedNfaState const* dest_state) : m_dest_state{dest_state} {} - SpontaneousTransition(std::vector<TagOperation> tag_ops, TypedNfaState const* dest_state) + explicit SpontaneousTransition(std::vector<TagOperation> tag_ops, TypedNfaState const* dest_state) : m_tag_ops{std::move(tag_ops)}, m_dest_state{dest_state} {}src/log_surgeon/finite_automata/DfaTransition.hpp (3)
26-28
: Consider initializingm_dest_state
in the member initializer list.The
m_dest_state
member is already initialized tonullptr
in the class definition, but for consistency and to follow best practices, consider initializing it in the member initializer list.DfaTransition(std::vector<RegisterOperation> reg_ops, DfaState<state_type>* dest_state) : m_reg_ops{std::move(reg_ops)}, - m_dest_state{dest_state} {} + m_dest_state{dest_state} { + }
53-53
: Consider usingif (state_ids.contains(m_dest_state) == false)
for consistency.The coding guidelines specify preferring
false == <expression>
rather than!<expression>
. For consistency, consider applying this to thecontains
checks.- if (false == state_ids.contains(m_dest_state)) { + if (state_ids.contains(m_dest_state) == false) { return std::nullopt; } std::vector<std::string> transformed_ops; for (auto const& reg_op : m_reg_ops) { auto const optional_serialized_op{reg_op.serialize()}; - if (false == optional_serialized_op.has_value()) { + if (optional_serialized_op.has_value() == false) {Also applies to: 60-60
57-64
: Consider reserving vector capacity for performance.Since we know the size of
m_reg_ops
, we can reserve space intransformed_ops
to avoid reallocations.std::vector<std::string> transformed_ops; + transformed_ops.reserve(m_reg_ops.size()); for (auto const& reg_op : m_reg_ops) {
src/log_surgeon/finite_automata/DfaStatePair.hpp (1)
74-74
: Consider consistent ordering in pointer and boolean comparisons.While
nullptr != dest_state1
follows the coding guidelines for boolean expressions, consider applying the same pattern to thecontains
check.if (nullptr != dest_state1 && nullptr != dest_state2) { DfaStatePair const reachable_pair{dest_state1, dest_state2}; - if (false == visited_pairs.contains(reachable_pair)) { + if (visited_pairs.contains(reachable_pair) == false) {Also applies to: 76-76
src/log_surgeon/finite_automata/PrefixTree.hpp (1)
28-28
: Consider documenting the purpose ofcDefaultPos
.The new constant
cDefaultPos
lacks documentation explaining its purpose and usage.static constexpr id_t cRootId{0}; - static constexpr position_t cDefaultPos{0}; + /// Default position value used when initializing new positions + static constexpr position_t cDefaultPos{0};tests/test-register-handler.cpp (1)
23-27
: Consider extracting magic number to a named constant.The condition
0 == i
uses a magic number. Consider extracting it to a named constant for better readability and maintainability.+ static constexpr size_t cFirstRegisterIndex{0}; for (size_t i{0}; i < num_registers; ++i) { - if (0 == i) { + if (cFirstRegisterIndex == i) { handler.add_register(); } else { handler.add_register(i);tests/test-dfa.cpp (2)
28-69
: Test case provides good coverage for untagged DFA.The test effectively validates the DFA's serialization format and state transitions.
Consider adding descriptive error messages to REQUIRE statements to help debug test failures:
- REQUIRE(actual_line == expected_line); + REQUIRE(actual_line == expected_line, + fmt::format("Line mismatch.\nExpected: {}\nActual: {}", expected_line, actual_line));
71-119
: Test case thoroughly validates tagged DFA functionality.The test effectively covers complex regex patterns with named capturing groups.
Consider extracting the expected serialized DFA string to a separate constant or file to improve readability and maintainability.
tests/test-lexer.cpp (1)
102-135
: Consider refactoring the initialization logic.The initialization function contains multiple responsibilities and could benefit from being split into smaller, focused functions.
Consider extracting the delimiter handling and symbol mapping into separate functions:
+auto initialize_delimiters(ByteLexer& lexer) -> vector<uint32_t> { + vector<uint32_t> const cDelimiters{' ', '\n'}; + lexer.add_delimiters(cDelimiters); + + vector<uint32_t> delimiters; + for (uint32_t i{0}; i < log_surgeon::cSizeOfByte; i++) { + if (lexer.is_delimiter(i)) { + delimiters.push_back(i); + } + } + return delimiters; +} + +auto initialize_symbol_mapping(ByteLexer& lexer) -> void { + lexer.m_symbol_id[log_surgeon::cTokenEnd] = static_cast<uint32_t>(SymbolId::TokenEnd); + lexer.m_symbol_id[log_surgeon::cTokenUncaughtString] = static_cast<uint32_t>(SymbolId::TokenUncaughtString); + lexer.m_id_symbol[static_cast<uint32_t>(SymbolId::TokenEnd)] = log_surgeon::cTokenEnd; + lexer.m_id_symbol[static_cast<uint32_t>(SymbolId::TokenUncaughtString)] = log_surgeon::cTokenUncaughtString; +}.github/workflows/build.yaml (1)
55-59
: Addition of Test Log Output on Failure
Adding a step to print the last test log on failure enhances the debugging capability when tests do not pass. For robustness, consider checking whether the log file exists before attempting to print it, to avoid potential errors if the file is missing.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (32)
.github/workflows/build.yaml
(3 hunks)CMakeLists.txt
(2 hunks)examples/intersect-test.cpp
(3 hunks)lint-tasks.yml
(2 hunks)src/log_surgeon/Lexer.hpp
(4 hunks)src/log_surgeon/Lexer.tpp
(10 hunks)src/log_surgeon/LexicalRule.hpp
(3 hunks)src/log_surgeon/SchemaParser.cpp
(4 hunks)src/log_surgeon/UniqueIdGenerator.hpp
(1 hunks)src/log_surgeon/finite_automata/Capture.hpp
(2 hunks)src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
(1 hunks)src/log_surgeon/finite_automata/Dfa.hpp
(3 hunks)src/log_surgeon/finite_automata/DfaState.hpp
(3 hunks)src/log_surgeon/finite_automata/DfaStatePair.hpp
(1 hunks)src/log_surgeon/finite_automata/DfaTransition.hpp
(1 hunks)src/log_surgeon/finite_automata/Nfa.hpp
(6 hunks)src/log_surgeon/finite_automata/NfaState.hpp
(5 hunks)src/log_surgeon/finite_automata/PrefixTree.hpp
(1 hunks)src/log_surgeon/finite_automata/RegexAST.hpp
(19 hunks)src/log_surgeon/finite_automata/Register.hpp
(1 hunks)src/log_surgeon/finite_automata/RegisterHandler.hpp
(2 hunks)src/log_surgeon/finite_automata/RegisterOperation.hpp
(1 hunks)src/log_surgeon/finite_automata/SpontaneousTransition.hpp
(1 hunks)src/log_surgeon/finite_automata/TagOperation.hpp
(1 hunks)src/log_surgeon/finite_automata/TaggedTransition.hpp
(0 hunks)tests/CMakeLists.txt
(2 hunks)tests/test-capture.cpp
(1 hunks)tests/test-dfa.cpp
(1 hunks)tests/test-lexer.cpp
(6 hunks)tests/test-nfa.cpp
(2 hunks)tests/test-register-handler.cpp
(2 hunks)tests/test-tag.cpp
(0 hunks)
💤 Files with no reviewable changes (2)
- tests/test-tag.cpp
- src/log_surgeon/finite_automata/TaggedTransition.hpp
✅ Files skipped from review due to trivial changes (2)
- src/log_surgeon/finite_automata/Register.hpp
- src/log_surgeon/finite_automata/Capture.hpp
🧰 Additional context used
📓 Path-based instructions (23)
src/log_surgeon/UniqueIdGenerator.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/SchemaParser.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/PrefixTree.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/DfaStatePair.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
tests/test-capture.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/LexicalRule.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
tests/test-lexer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
tests/test-register-handler.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
examples/intersect-test.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/DfaState.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
tests/test-dfa.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/DfaTransition.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/RegisterOperation.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
tests/test-nfa.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/TagOperation.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/Lexer.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/Dfa.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/RegexAST.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (4)
src/log_surgeon/finite_automata/RegisterHandler.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T21:56:13.425Z
Learning: In the `log_surgeon` project, header guards in C++ header files should include `_HPP` at the end to match the filename. For example, in `RegisterHandler.hpp`, the header guard should be `LOG_SURGEON_FINITE_AUTOMATA_REGISTER_HANDLER_HPP`.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (macos-latest, Release)
🔇 Additional comments (79)
src/log_surgeon/finite_automata/RegisterHandler.hpp (8)
4-4
: Include usage is consistent with 32-bit integer usage.
No concerns here, looks good.
8-8
: Good choice referencingregister_id_t
.
Ensures clarity about the register type usage.
29-35
: The primary constructor for a single register is straightforward.
Implementation is correct, with no major issues found.
37-43
: Parameter-specific constructor also follows the same logic.
No immediate concerns, and the usage ofcDefaultPos
is appropriate.
45-45
:set_register
method is concise.
The transition fromsize_t
toregister_id_t
is consistent with the updated type usage.
49-49
:copy_register
method is correct.
The implementation is minimalistic, with no concurrency or logic issues.
53-57
:append_position
method logic is clear.
No edge cases or concurrency issues appear here.
59-63
: Returning reversed positions maintains clarity.
Usingm_prefix_tree
is consistent with the code’s established patterns.tests/test-nfa.cpp (4)
9-10
: Additional includes are consistent with the newNfaState
andLexicalRule
approach.
No issues; these new includes are appropriate for constructingByteNfa
.
38-38
: NFA construction aligned with the new usage.
Providingrules
directly to the constructor is neat.
41-77
: Updated transitions from “epsilon_transition” to “spontaneous_transition”.
The new strings in the expected serialization correctly reflect the updated NFA representation. The tags and positions appear consistent with the changes.
80-82
: Conditional usage ofoptional
and capturing for debugging.
It is good to see the test verifying that a serializable representation is produced. TheCAPTURE
usage is helpful for diagnosing test failures.Also applies to: 87-87
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (6)
1-2
: Header guard style is correct.
It follows the recommendedLOG_SURGEON_FINITE_AUTOMATA_DETERMINIZATION_CONFIGURATION_HPP
naming convention.
4-9
: Essential includes ensure minimal dependencies.
No concerns with these standard library inclusions.
11-12
:Register.hpp
andTagOperation.hpp
includes
This confirms the usage ofregister_id_t
andTagOperation
.
14-100
: TheDetermizationConfiguration
class design seems coherent.
- Constructor thoroughly moves the container arguments to reduce copies.
- The
operator<
properly compares in a logical order, ensuring stable sorting.- The
child_configuration_with_new_state_and_tag
method is a neat approach to build updated configurations.
One minor note: If the history also changes often, consider a symmetrical approach for that as well. Otherwise, this looks consistent.
102-117
:update_reachable_configs
method
The logic pre-emptively enlists each tag operation in a child configuration. It is a solid pattern for building subsequent states.
119-134
:spontaneous_closure
method
Neat usage of a stack for closure. Usingfalse == <expression>
also aligns with coding guidelines.src/log_surgeon/finite_automata/DfaState.hpp (5)
7-19
: No issues with new includes.
Everything appears to be in order with the added headers and type includes.
34-38
: Good constructor initialization.
Properly initializingm_bytes_transition
with default objects helps avoid undefined behaviour. The approach is clear and easy to maintain.
52-55
: Byte transition assignment is clear.
Storing the entireDfaTransition<state_type>
object instead of raw pointers removes potential pointer-management complexities.
57-58
: Ensure test coverage for accepting operations.
Consider adding unit tests or integration tests to confirm thatm_accepting_ops
is handled correctly at runtime.
101-141
: Serialization approach is well-structured.
Leveragingfmt::join
and collecting transition strings improves readability and maintainability.src/log_surgeon/Lexer.hpp (2)
7-19
: No issues with added headers and usings.
These additions properly support the new finite automata components without any apparent conflicts.
133-157
: General data-access methods look good.
The optional-based lookups and returns are easy to follow and reduce the risk of invalid lookups.src/log_surgeon/finite_automata/NfaState.hpp (3)
39-54
: Constructors appear well-defined.
They simplify the creation of NFA states and make the initialization of spontaneous transitions clear.
56-70
: Spontaneous transition management is straightforward.
Emplacing transitions with optional tag operations is a flexible approach.
190-190
: Consistent use of “false == expression”.
Usage ofif (false == optional_serialized_transition.has_value())
follows the coding style and avoids confusion with negation.src/log_surgeon/finite_automata/Nfa.hpp (3)
4-23
: Includes added for new dependencies
You have added the necessary headers for functionality likestd::optional
,fmt::format
, and the new capture-related classes. This is coherent and neatly addresses the code’s requirements.
39-39
: Good move to acceptrules
by const reference
Usingstd::vector<LexicalRule<TypedNfaState>> const&
in the constructor avoids unnecessary copying while preserving immutability, which can be more efficient.
234-251
: Effective use ofstd::optional
inserialize()
Handling serialization failures by returningstd::nullopt
is a clean pattern, ensuring that callers are alerted to potential errors. Ensure that the caller consistently checks for this return value.src/log_surgeon/Lexer.tpp (2)
89-96
: Re-initializing state on newline
The logic setsdest_state = m_dfa->get_root()->get_dest_state(next_char)
when encountering'\n'
. This ensures a fresh match and captures a new token start. Confirm that overlapping tokens before this line break are handled correctly elsewhere.
340-345
: Appending start-of-file logic
Prependingutf8::cCharStartOfFile
for the lexer helps detect matches from the beginning of logs. Ensure any dependent code in the pipeline also recognizes this special character and does not treat it as part of real data.src/log_surgeon/finite_automata/Dfa.hpp (2)
246-288
: Careful validation of register mappings
try_get_mapping
method is comprehensive in ensuring a valid one-to-one mapping between registers. This is crucial to avoid collisions. Ensure thorough testing covers edge cases where partial overlaps might attempt to map incorrectly.
403-405
: Possible confusion with register operation constructor
The expressionreg_ops.emplace_back(new_reg_id, old_reg_id);
suggests a “copy” or alternative operation fromold_reg_id
tonew_reg_id
. Confirm the constructor’s parameters align with that usage and do not inadvertently set a register to an unintended operation type.src/log_surgeon/finite_automata/RegexAST.hpp (21)
23-23
: No issues with updated include statement.
The change fromTag.hpp
toCapture.hpp
aligns with the rename effort.
33-37
: Documentation updates are accurate.
The references to captures instead of tags clearly convey the new concept.
85-87
: Returning a const reference is appropriate.
This helps avoid unnecessary copies and is consistent with the rest of the codebase.
89-91
: Setter function is straightforward.
Passing by value then moving is a valid pattern; no concerns here.
93-99
: Appending subtree captures is clearly implemented.
This insertion pattern appears correct for accumulating captures.
101-103
: Setter for negative captures is consistent.
The logic mirrors the subtree captures setter. No issues found.
130-153
: Serialization logic looks good.
This effectively transforms and joins the negative captures for inclusion in the serialized output.
627-630
: Documentation enforces non-null capture invariant.
This note is helpful for clarifying valid usage.
639-639
: Detailed doc comment is appreciated.
Clearly describes when and why an exception is thrown.Also applies to: 641-642
654-659
: Null checks on capture ensure robust construction.
Throwing onnullptr
is a safe way to enforce valid input.
666-668
: Deep copy approach in copy constructor is correct.
Makes a newCapture
instance, ensuring proper duplication.
707-707
: New accessor method is consistent.
Returningstd::string_view
is an efficient choice.
774-774
: Combining negative captures in OR constructor is logical.
Passing each branch's subtree captures to the other ensures correct negative semantics.Also applies to: 780-784
789-790
: NFA construction using add_to_nfa_with_negative_captures is correct.
Both branches are incorporated properly.
794-794
: Serialization with fallback is appropriate.
Using “null” when pointers are absent clarifies the state.Also applies to: 799-800
804-804
: Concatenation subtree captures are correctly combined.
This approach follows the same logic as in OR.Also applies to: 810-812
819-820
: Intermediate state usage for cat operation is implemented correctly.
This ensures each sub-expression transitions properly.Also applies to: 821-822
826-826
: Cat node serialization mirrors the established pattern.
No issues found—consistency is maintained.Also applies to: 831-832
836-836
: Subtree captures in multiplication constructor are properly inherited.
Logical approach for repeated structures.Also applies to: 844-844
854-855
: NFA construction for repeated patterns appears correct.
Handles zero, finite, and infinite cases, preserving negative captures in each iteration.Also applies to: 858-859, 861-862, 865-866, 869-870, 873-874, 875-876, 878-879
893-894
: Final serialization step for negative captures is properly appended.
Implementation remains consistent with other variants.tests/test-capture.cpp (1)
1-37
: Test coverage for capture names is solid.
All fundamental scenarios are validated, including empty and special character names. If required, consider adding tests for extra-long or non-ASCII names in future expansions.src/log_surgeon/LexicalRule.hpp (3)
5-7
: LGTM!The includes are well-organized and necessary for the new functionality.
28-30
: LGTM!The method is well-designed with appropriate const-correctness and return type.
46-47
: LGTM!The changes align well with the PR's objective of updating tag handling.
src/log_surgeon/finite_automata/TagOperation.hpp (2)
13-16
: LGTM!The enum class is well-designed with appropriate type safety and memory efficiency.
24-30
: LGTM!The comparison operators are implemented efficiently using std::tie with proper const-correctness.
src/log_surgeon/finite_automata/RegisterOperation.hpp (2)
42-42
: LGTM!The condition check follows the coding guideline of preferring
false == <expression>
.
39-53
: Same enum exhaustiveness issue as in TagOperation.The default case might indicate incomplete enum handling.
src/log_surgeon/finite_automata/DfaStatePair.hpp (1)
72-74
: LGTM! Variable naming improvement.The change from
next_state
todest_state
improves code clarity by better describing the variable's purpose.src/log_surgeon/finite_automata/PrefixTree.hpp (1)
15-20
: LGTM! Documentation clarity improvement.The documentation changes from "tag" to "capture" improve clarity and maintain consistency with the broader codebase changes.
tests/test-register-handler.cpp (1)
40-40
: LGTM! Improved const correctness.The addition of
const
toempty_handler
improves const correctness and prevents accidental modifications.examples/intersect-test.cpp (1)
24-24
: LGTM! Template parameter updates align with TDFA implementation.The addition of
ByteNfaState
as a template parameter toDfa
is consistent with the PR's objective of implementing TDFA determinization.Also applies to: 44-44, 82-82
tests/test-lexer.cpp (2)
169-169
: Address TODO comment about register value checks.The comment indicates missing test coverage for register values simulation.
Would you like me to help implement the register value verification tests or create an issue to track this task?
321-336
: Well-structured test cases with good error handling.The test cases effectively validate both basic lexer functionality and capture group handling. The use of CAPTURE macro helps with debugging test failures.
Also applies to: 338-379
src/log_surgeon/SchemaParser.cpp (2)
177-177
: LGTM! Updated to use Capture instead of Tag.The change aligns with the PR's objective of transitioning from tags to captures.
212-212
: Updated comments to reflect new terminology.Comments have been consistently updated to use "captures" instead of "tags".
Also applies to: 258-258
.github/workflows/build.yaml (2)
31-32
: OS Matrix Update: Specifying a Concrete Ubuntu Version
Updating the OS matrix to use"ubuntu-22.04"
instead of an ambiguous latest version improves reproducibility and stability of the build environment.
42-44
: Revised Catch2 Installation Condition for Ubuntu
The installation step for Catch2 now correctly targets"ubuntu-22.04"
, which aligns with the updated OS matrix. Please verify that version"3.6.0"
of Catch2 meets the project’s requirements on this platform.tests/CMakeLists.txt (2)
5-11
: Inclusion of New Finite Automata Headers
The update extends theSOURCES_LOG_SURGEON
list with multiple TDFA-related header files (e.g.,Capture.hpp
,DeterminizationConfiguration.hpp
,Dfa.hpp
, etc.). This supports the recent enhancements in the determinization process. Ensure that the file paths and dependencies remain consistent with the updated project structure.
35-43
: Addition of New Test Files
New test files, includingtest-capture.cpp
andtest-dfa.cpp
, have been added toSOURCES_TESTS
to boost coverage of the new features. This is a positive step; consider verifying that these tests cover both standard and edge case scenarios for TDFA functionality.lint-tasks.yml (2)
29-32
: Expanded Linting Coverage to Include Test Sources
The inclusion of additional test file patterns (for.cpp
,.h
,.hpp
,.tpp
files under thetests
directory) ensures that linting is applied uniformly across all source files. This comprehensive approach helps in maintaining code quality throughout the project.
69-73
: Exclusion of Build Directories from Formatting Tasks
The added exclusion pattern (! -path "*/*build*/*"
) in thecpp
task prevents the formatting of auto-generated or intermediary build files. This focused approach on source files is beneficial and reduces unnecessary processing.CMakeLists.txt (1)
64-116
: Updated Finite Automata Source Files Inclusion
TheSOURCE_FILES
list now incorporates several new TDFA-related files such asCapture.hpp
,DeterminizationConfiguration.hpp
, andUniqueIdGenerator.hpp
. However, there is a potential inconsistency in that files likePrefixTree.cpp
,RegexAST.hpp
, and various DFA/NFA headers are being included while previous discussions indicated their removal. Please verify that the set of files accurately reflects the intended new architecture for the finite automata module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (10)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (3)
69-69
: Add return type to documentation.The documentation for
get_tag_history
andget_tag_lookahead
methods is missing the return type.Add
@return
documentation to clarify the return type and its meaning.Also applies to: 79-79
130-130
: Consider using a more efficient loop condition.The loop condition
false == unexplored_stack.empty()
is less readable than!unexplored_stack.empty()
and doesn't provide any performance benefits.- while (false == unexplored_stack.empty()) { + while (!unexplored_stack.empty()) {
133-133
: Consider usinginsert
return value.The
contains
check followed byinsert
can be replaced with a singleinsert
operation, asinsert
returns a pair indicating whether the insertion took place.- if (false == reachable_set.contains(current_configuration)) { - current_configuration.update_reachable_configs(unexplored_stack); - reachable_set.insert(current_configuration); - } + if (auto [it, inserted] = reachable_set.insert(current_configuration); inserted) { + current_configuration.update_reachable_configs(unexplored_stack); + }tests/test-nfa.cpp (1)
37-38
: Remove or update outdated comment.The comment about capture order and tags in brackets appears to be unrelated to the test case, as the test doesn't involve any capture groups.
- // capture order(tags in brackets): letter1(0,1), letter2(2,3), letter(4,5), containerID(6,7)
tests/test-lexer.cpp (2)
171-171
: Add test case for register values.The TODO comment indicates missing test coverage for register values.
Would you like me to help implement the test case for register values? This would improve test coverage and help validate the simulation functionality.
174-180
: Consider using string_view for map values.The
serialize_id_symbol_map
function could be more efficient by usingstring_view
for map values since we're not modifying the strings.-auto serialize_id_symbol_map(unordered_map<symbol_id_t, string> const& map) -> string { +auto serialize_id_symbol_map(unordered_map<symbol_id_t, string_view> const& map) -> string {src/log_surgeon/finite_automata/Dfa.hpp (4)
264-264
: Fix typo in comment."irrelevent" should be "irrelevant".
- // If the NFA state sets the tag then the current register is irrelevent + // If the NFA state sets the tag then the current register is irrelevant
211-211
: Use preferred comparison style.According to the coding guidelines, prefer
!<expression>
overfalse == <expression>
.- while (false == unexplored_sets.empty()) { + while (!unexplored_sets.empty()) {
482-482
: Track UTF-8 support requirement.The TODO comment indicates missing UTF-8 support. This should be tracked to ensure proper handling of multi-byte characters.
Would you like me to create an issue to track the implementation of UTF-8 support?
464-466
: Reserve space for containers to improve performance.Pre-reserve space for
visited_states
andvisited_order
to avoid reallocations during traversal.The code already reserves space for these containers on lines 467-468, which is good practice!
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
CMakeLists.txt
(3 hunks)src/log_surgeon/Lexer.hpp
(4 hunks)src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
(1 hunks)src/log_surgeon/finite_automata/Dfa.hpp
(3 hunks)tests/test-dfa.cpp
(1 hunks)tests/test-lexer.cpp
(6 hunks)tests/test-nfa.cpp
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/log_surgeon/Lexer.hpp
- CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
tests/test-nfa.cpp
tests/test-dfa.cpp
tests/test-lexer.cpp
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
src/log_surgeon/finite_automata/Dfa.hpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
tests/test-nfa.cpp
tests/test-dfa.cpp
tests/test-lexer.cpp
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
src/log_surgeon/finite_automata/Dfa.hpp
🪛 GitHub Actions: lint
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
[error] 29-29: code should be clang-formatted [-Wclang-format-violations]
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (macos-latest, Release)
🔇 Additional comments (3)
tests/test-nfa.cpp (1)
8-11
: LGTM! Header organization is clean and follows best practices.The headers are well-organized, with system headers first, followed by third-party headers, and then project headers.
src/log_surgeon/finite_automata/Dfa.hpp (2)
1-35
: LGTM! Well-organized includes and class declaration.The includes are logically organized, and the class declaration follows good C++ practices with proper include guards and template parameters.
36-59
: LGTM! Well-documented public interface.The constructor and public methods are properly marked with appropriate specifiers and have clear documentation.
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
Outdated
Show resolved
Hide resolved
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1)
34-45
: Consider simplifying the comparison operator by leveraging a tuple.You could use tuple-based comparison to eliminate multiple conditional checks, improving maintainability and clarity. For example:
auto operator<(DetermizationConfiguration const& rhs) const -> bool { - if (m_nfa_state->get_id() != rhs.m_nfa_state->get_id()) { - return m_nfa_state->get_id() < rhs.m_nfa_state->get_id(); - } - if (m_tag_id_to_reg_ids != rhs.m_tag_id_to_reg_ids) { - return m_tag_id_to_reg_ids < rhs.m_tag_id_to_reg_ids; - } - if (m_history != rhs.m_history) { - return m_history < rhs.m_history; - } - return m_lookahead < rhs.m_lookahead; + return std::tie(m_nfa_state->get_id(), m_tag_id_to_reg_ids, m_history, m_lookahead) + < std::tie(rhs.m_nfa_state->get_id(), rhs.m_tag_id_to_reg_ids, rhs.m_history, rhs.m_lookahead); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
CMakeLists.txt
(2 hunks)src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (macos-latest, Release)
🔇 Additional comments (4)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (3)
69-71
: Add an explicit return type to align with style consistency.A past review flagged that an explicit return type is missing here. Changing it to either
void
orauto -> void
ensures consistency with the rest of your code base and clarifies the method’s intent.-auto set_reg_id(tag_id_t const tag_id, register_id_t const reg_id) { +auto set_reg_id(tag_id_t const tag_id, register_id_t const reg_id) -> void { m_tag_id_to_reg_ids[tag_id] = reg_id; }
107-122
: Verification of spontaneous transitions logic.Your approach of pushing the parent configuration onto the stack after applying each sequence of tag operations seems logically correct. Verify that all spontaneously reachable states (with the associated tag operations) are accounted for, especially if the NFA contains multiple concurrent spontaneous transitions.
130-133
: Adherence to the coding guideline on Boolean expressions.Using “false == ” aligns with the project guideline. This usage improves clarity over the “!” operator. Nicely done.
CMakeLists.txt (1)
71-87
: Updated finite automata files successfully integrated.The newly added lines properly reference the relevant finite automata headers (e.g.,
DeterminizationConfiguration.hpp
) and theUniqueIdGenerator.hpp
. This inclusion ensures the CMake build process incorporates all necessary sources.Also applies to: 115-115
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
src/log_surgeon/Lexer.tpp (1)
172-172
: 🛠️ Refactor suggestionAddress code duplication in scan_with_wildcard.
The TODO comment acknowledges code duplication between scan() and scan_with_wildcard(). Consider extracting common logic into a shared method.
🧹 Nitpick comments (8)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1)
16-17
: Consider standardizing the class name with the file name.
The file is named "DeterminizationConfiguration.hpp," but the class is named "DetermizationConfiguration." This discrepancy may cause confusion.src/log_surgeon/LexicalRule.hpp (1)
28-30
: Verifym_regex
usage for null safety.
Returning a reference to the captures vector is acceptable ifm_regex
is guaranteed to be non-null. Otherwise, consider an assertion or check to avoid dereferencing a null pointer.tests/test-lexer.cpp (1)
338-398
: Consider adding edge case tests.The test cases cover basic functionality but could benefit from additional edge cases such as:
- Empty input
- Maximum length input
- Invalid capture group names
- Nested capture groups
src/log_surgeon/Lexer.tpp (1)
386-388
: Enhance error message for duplicate capture names.The error message could be more descriptive by including the conflicting capture names.
- throw std::invalid_argument("`m_rules` contains capture names that are not unique." - ); + throw std::invalid_argument(fmt::format( + "`m_rules` contains duplicate capture name: '{}'", capture_name));src/log_surgeon/finite_automata/Dfa.hpp (1)
31-32
: Enhance class template documentation.The class template documentation should explain the purpose of both template parameters and their relationship.
src/log_surgeon/finite_automata/RegexAST.hpp (2)
652-657
: Enhance error messages for null pointer validation.The error messages for null pointer validation could be more descriptive.
- throw std::invalid_argument("Group regex AST cannot be null"); + throw std::invalid_argument("RegexASTCapture: capture_regex_ast cannot be null"); - throw std::invalid_argument("Capture cannot be null"); + throw std::invalid_argument("RegexASTCapture: capture pointer cannot be null");
909-940
: Move documentation to a separate file.As suggested by the TODO comment, consider moving the detailed documentation to a separate file for better maintainability.
CMakeLists.txt (1)
64-117
: Source files restructuring aligns well with TDFA implementation.The changes to SOURCE_FILES reflect a comprehensive restructuring to support TDFA determinization:
- Added new files for TDFA-specific functionality (Capture.hpp, DeterminizationConfiguration.hpp)
- Added support files for register operations and transitions
- Removed older tag-based implementation files
However, consider the following suggestions:
- Group related files together in the SOURCE_FILES list for better maintainability
- Consider adding comments to separate different components (e.g., core files, finite automata files, etc.)
Here's a suggested organization:
set(SOURCE_FILES # Core files src/log_surgeon/Aliases.hpp src/log_surgeon/Buffer.hpp src/log_surgeon/BufferParser.cpp src/log_surgeon/BufferParser.hpp src/log_surgeon/Constants.hpp src/log_surgeon/FileReader.cpp src/log_surgeon/FileReader.hpp # Finite Automata src/log_surgeon/finite_automata/Capture.hpp src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp src/log_surgeon/finite_automata/Dfa.hpp # ... rest of finite automata files ... # Parsing and Lexing src/log_surgeon/Lalr1Parser.cpp src/log_surgeon/Lalr1Parser.hpp # ... rest of parsing files ... )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
CMakeLists.txt
(2 hunks)src/log_surgeon/Aliases.hpp
(1 hunks)src/log_surgeon/Lexer.hpp
(5 hunks)src/log_surgeon/Lexer.tpp
(10 hunks)src/log_surgeon/LexicalRule.hpp
(3 hunks)src/log_surgeon/UniqueIdGenerator.hpp
(1 hunks)src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
(1 hunks)src/log_surgeon/finite_automata/Dfa.hpp
(3 hunks)src/log_surgeon/finite_automata/Nfa.hpp
(6 hunks)src/log_surgeon/finite_automata/NfaState.hpp
(5 hunks)src/log_surgeon/finite_automata/PrefixTree.hpp
(1 hunks)src/log_surgeon/finite_automata/RegexAST.hpp
(20 hunks)src/log_surgeon/finite_automata/TaggedTransition.hpp
(3 hunks)tests/CMakeLists.txt
(2 hunks)tests/test-lexer.cpp
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/log_surgeon/finite_automata/PrefixTree.hpp
- tests/CMakeLists.txt
- src/log_surgeon/UniqueIdGenerator.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/Aliases.hpp
src/log_surgeon/LexicalRule.hpp
tests/test-lexer.cpp
src/log_surgeon/finite_automata/TaggedTransition.hpp
src/log_surgeon/finite_automata/NfaState.hpp
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
src/log_surgeon/finite_automata/Nfa.hpp
src/log_surgeon/Lexer.hpp
src/log_surgeon/finite_automata/Dfa.hpp
src/log_surgeon/finite_automata/RegexAST.hpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/log_surgeon/Aliases.hpp
src/log_surgeon/LexicalRule.hpp
tests/test-lexer.cpp
src/log_surgeon/finite_automata/TaggedTransition.hpp
src/log_surgeon/finite_automata/NfaState.hpp
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
src/log_surgeon/finite_automata/Nfa.hpp
src/log_surgeon/Lexer.hpp
src/log_surgeon/finite_automata/Dfa.hpp
src/log_surgeon/finite_automata/RegexAST.hpp
🧠 Learnings (4)
src/log_surgeon/finite_automata/TaggedTransition.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (24)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (3)
69-71
: Add the missing return type forset_reg_id
.
This was previously noted in a prior review. Please ensure explicit return types (e.g.,-> void
) for better clarity and consistency.
107-122
: Implementation forupdate_reachable_configs
looks well-structured.
This logic clearly handles spontaneous transitions to produce new configurations. No issues spotted.
130-136
: Commendation on adhering to the style guideline.
Usingwhile (false == unexplored_stack.empty())
andif (false == reachable_set.contains(...))
aligns with the project’s preference offalse == <expression>
over!<expression>
. Good work.src/log_surgeon/Aliases.hpp (1)
1-14
: Clear and concise type alias definitions.
Defining descriptive aliases (capture_id_t, reg_id_t, rule_id_t, tag_id_t
) improves readability and type safety across the codebase.src/log_surgeon/LexicalRule.hpp (1)
46-46
: Acceptance state creation looks correct.
Usingnfa->new_accepting_state(m_variable_id)
simplifies creation of the end state and ensures predictable handling of acceptance logic.src/log_surgeon/finite_automata/TaggedTransition.hpp (2)
21-45
: LGTM! The transition to tag IDs improves type safety.The refactoring from Tag pointers to tag IDs eliminates potential null pointer issues and improves type safety.
52-76
: LGTM! The transition to tag IDs improves type safety.The refactoring from Tag pointers to tag IDs eliminates potential null pointer issues and improves type safety. The use of
fmt::join
for formatting multiple tag IDs is a clean approach.src/log_surgeon/finite_automata/NfaState.hpp (4)
34-40
: LGTM! The use ofcontains
andfalse == <expression>
improves readability.The changes align with coding guidelines and modern C++ practices.
102-105
: LGTM! The getter correctly returns a const reference.The method follows best practices by returning a const reference to avoid unnecessary copies of the vector.
136-144
: LGTM! The helper method efficiently manages capture tag pairs.The method follows best practices by using
contains
,emplace
, and aligning with coding guidelines.
163-181
: LGTM! The method efficiently handles negative captures.The method follows best practices by using move semantics and correctly managing tag pairs.
src/log_surgeon/Lexer.hpp (2)
139-145
: LGTM! The getter correctly handles missing rules.The method follows best practices by using
std::optional
for error handling andcontains
for map lookup.
179-198
: LGTM! The getter correctly handles all error cases.The method follows best practices by:
- Using
std::optional
for error handling- Following coding guidelines with
false == <expression>
- Properly validating both start and end register IDs
src/log_surgeon/finite_automata/Nfa.hpp (4)
136-144
: LGTM! The helper method efficiently manages capture tag pairs.The method follows best practices by using
contains
,emplace
, and aligning with coding guidelines.
163-181
: LGTM! The method efficiently handles negative captures.The method follows best practices by using move semantics and correctly managing tag pairs.
184-199
: LGTM! The method correctly manages capture states.The method follows best practices by:
- Efficiently extracting tag pairs
- Creating appropriate start and end states
- Using spontaneous transitions for tag operations
235-252
: LGTM! The method correctly handles serialization failures.The method follows best practices by:
- Using
std::optional
for error handling- Following coding guidelines with
false == <expression>
- Properly validating serialized states
tests/test-lexer.cpp (2)
67-91
: Consider extracting common test setup logic.A previous review comment suggested extracting common setup logic between test cases. This suggestion is still valid and would help reduce code duplication.
185-185
: Address the TODO comment about register value verification.The TODO comment indicates that register value verification is pending implementation of DFA simulation.
Would you like me to create an issue to track this task?
src/log_surgeon/Lexer.tpp (2)
30-36
: Address the TODO comment about m_start_pos handling.The TODO comment indicates potential issues with m_start_pos handling when it equals old_storage_size / 2.
Would you like me to create an issue to track this task?
406-406
: Address the TODO comment about DFA ignoring captures.The TODO comment indicates that DFA ignores captures in certain scenarios. This needs to be addressed to ensure proper functionality.
Would you like me to create an issue to track this task?
src/log_surgeon/finite_automata/Dfa.hpp (1)
482-482
: Address the TODO comment about UTF-8 handling.The TODO comment indicates that UTF-8 case handling is pending implementation.
Would you like me to create an issue to track this task?
src/log_surgeon/finite_automata/RegexAST.hpp (1)
30-30
: Address the TODO comment about class renaming.The TODO comment suggests renaming
RegexAST
toRegexASTNode
for better clarity.Would you like me to create an issue to track this task?
CMakeLists.txt (1)
72-90
: Verify the completeness of TDFA implementation files.The new files added for TDFA implementation appear comprehensive, covering:
- Core TDFA functionality (Dfa.hpp, DfaState.hpp, DfaTransition.hpp)
- Register handling (Register.hpp, RegisterHandler.hpp, RegisterOperation.hpp)
- Support for transitions (SpontaneousTransition.hpp)
Let's verify that all necessary files are present:
✅ Verification successful
🏁 Script executed:
#!/bin/bash # Check for any potential missing TDFA-related files # Look for references to TDFA in source files that might indicate missing implementations # Search for TDFA-related types or functions that might need implementation files rg -g '*.{cpp,hpp,tpp}' -i 'tdfa|tagged.*dfa' # Search for TODO comments related to TDFA rg -g '*.{cpp,hpp,tpp}' -i 'todo.*tdfa|tdfa.*todo'Length of output: 514
TDFA File Completeness Verified
The search for TDFA-related identifiers and TODO comments shows that all expected implementation files are present, and the test cases in
tests/test-dfa.cpp
confirm both tagged and untagged DFA functionality. Based on these findings, the TDFA implementation appears complete.
…hat result from the change.
Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/log_surgeon/finite_automata/Dfa.hpp (2)
61-67
: Typographical note in doc comment.There is a minor spelling issue in this doc comment: “@oaram” should be “@param”.
- * @oaram nfa The NFA used to generate the DFA. + * @param nfa The NFA used to generate the DFA.
195-229
: TODO note: UTF-8 is unhandled.You have a TODO for handling multi-byte transitions. Would you like assistance drafting a plan or partial implementation?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/log_surgeon/Lexer.tpp
(1 hunks)src/log_surgeon/finite_automata/Dfa.hpp
(2 hunks)tests/test-dfa.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/log_surgeon/Lexer.tpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
tests/test-dfa.cpp
src/log_surgeon/finite_automata/Dfa.hpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (15)
tests/test-dfa.cpp (2)
118-164
: Looks good, but consider extracting the line-by-line comparison logic.Your approach for verifying the DFA serialization is correct. However, you've duplicated the line-by-line comparison code from the previous test blocks. It might be beneficial to extract it into a helper method (e.g.,
validate_serialized_dfa
) to simplify maintenance and make your test code more concise.
165-215
: Duplicated line-by-line comparison logic.This repeats the same pattern of line-by-line validation seen in other tests. Consider factoring out the repeated logic into a helper function for improved clarity and maintainability.
src/log_surgeon/finite_automata/Dfa.hpp (13)
4-16
: No issues with the new includes.All added headers appear necessary for the newly introduced functionalities (e.g., queue, unordered_set).
22-27
: Imported classes and functions look properly referenced.No concerns with these newly added imports, as they align with the usage in this file.
33-34
: Type alias enhances readability.Using
ConfigurationSet
simplifies references to determinization configurations.
56-59
: Method seems fine; straightforward accessor.
get_tag_id_to_final_reg_id()
aligns well with your new data design, exposing needed mappings.
83-95
: Mapping logic is correct but can be slow for large sets.This function systematically checks each configuration in
lhs
againstrhs
. For extremely large sets, you might consider more efficient lookups.
96-119
: Comprehensive state creation logic.
create_or_get_dfa_state
cleanly handles the detection of an existing config or insertion of a new one. No immediate concerns.
120-130
: No concerns with transitions retrieval.Generating transitions based on each ASCII value is straightforward and consistent with the rest of the design.
136-152
: Register operation assignment looks well-organized.Multiple tag operations (Set, Negate) are handled neatly, ensuring consistent updates to the closure.
153-166
: Reassigning transition register ops is correct.Mapping old registers to new ones with fallback
copy_operation
is a robust approach for convergence.
167-178
: New state creation is coherent.A proper flow: if the NFA state is accepting, we add matching IDs and final register ops. This matches the design of the tag-based approach.
186-187
: New map is appropriate for final register references.
m_tag_id_to_final_reg_id
clarifies the relationship between tag and final register, promoting better modularity in the code.
190-193
: Constructor consistently seeds generation.Building the DFA in the constructor ensures any usage of the object has a fully ready state.
68-82
:❓ Verification inconclusive
Potential memory usage concern for large tag counts.
initialize_registers
adds registers for each tag without bounds checking. Ifnum_tags
can be very large, this may result in excessive memory usage.You might want to confirm the maximum expected number of tags. If needed, consider adding a guard similar to a “maximum allowed tags” check:
🏁 Script executed:
#!/bin/bash # Searching for references to call sites of `initialize_registers` # to see if there's a maximum or safe usage pattern in place. rg --context 5 'initialize_registers'Length of output: 2704
Memory Usage Concern: Verify Maximum Tag Count
The search confirms that all calls to
initialize_registers
(e.g. vianfa.get_num_tags()
) happen without any bounds checking. This means that if the number of tags is unexpectedly large, the unguarded allocation of registers could lead to excessive memory usage. Please verify the maximum expected number of tags in your NFA. If there is any risk of unreasonably high values, consider implementing a guard (a “maximum allowed tags” check) to mitigate potential issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (9)
src/log_surgeon/finite_automata/Dfa.hpp (7)
33-33
: Consider clarifying ownership for DFA states and tag IDs.The alias
ConfigurationSet
and the newly introduced membersm_tag_id_to_final_reg_id
andm_reg_handler
effectively manage tag information and states. However, it might be beneficial to add brief clarifications (e.g., comments) on how ownership is handled (viastd::unique_ptr
) and how lifetime ties into the map of configuration sets toTypedDfaState*
. This will help new contributors quickly understand the relationship betweenm_states
,dfa_states
, and tag ownership.Also applies to: 186-187, 191-193
60-66
: Extend UTF-8 handling and test coverage.The
generate
method includes a TODO at line 195 referencing future UTF-8 support. Since multi-byte character handling is often a critical requirement, consider creating a dedicated strategy for transitioning from single-byte to multi-byte logic. Adding tests in preparation for UTF-8 transitions will help ensure reliability.Also applies to: 195-229
83-95
: Watch for potential O(n²) overhead intry_get_mapping
.The nested loops over
lhs
andrhs
can lead to quadratic complexity. Though this might be acceptable for smaller NFAs, it could become a performance bottleneck for large ones. If scaling is a concern, consider more efficient set comparisons or hashing of configurations.Also applies to: 245-288
96-119
: Prevent confusion when returning mapped states.The
create_or_get_dfa_state
method can return either an existing DFA state, a potentially mapped DFA state, or a newly created one. Adding more explicit in-code comments or separate result types (e.g., an enum variant) can reduce ambiguity for future maintainers.Also applies to: 290-308
120-135
: Consider factoring out logic for byte transitions.The
get_transitions
method processes transitions for all 256 byte values. For clarity and maintainability, factor out the repeated logic into smaller helper functions/methods, especially if multi-byte support is planned.Also applies to: 309-342
153-165
: Re-check newly added copy operations.
reassign_transition_reg_ops
creates copy operations if an old register does not map existing ops to the new register. Double-check any potential for register duplication or unintentional overwriting of data. Additional debug logs or unit tests might catch hidden bugs.Also applies to: 384-405
167-178
: Simplifynew_state
logic for readability.The
new_state
function handles final register assignment, tag copying, and accepting condition checks in a single block. Breaking it into smaller subroutines (e.g., one for final operations, one for acceptance logic) could improve readability and reduce complexity.Also applies to: 407-441
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (2)
17-32
: Validate NFA state pointer in constructor.Throwing an exception if
nfa_state
is null ensures safety. Documenting in the class-level Doxygen that this constructor expects a non-null pointer could make the contract clearer for future users.
47-59
: Maintain consistent naming inchild_configuration_with_new_state_and_tag
.The method's name distinctly conveys purpose; however, for brevity, consider naming it
child_with_state_and_tag
or similar. Otherwise, everything is well-structured and easy to follow.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
(1 hunks)src/log_surgeon/finite_automata/Dfa.hpp
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
src/log_surgeon/finite_automata/Dfa.hpp
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
🧠 Learnings (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (macos-latest, Release)
- GitHub Check: build (macos-latest, Debug)
🔇 Additional comments (7)
src/log_surgeon/finite_automata/Dfa.hpp (1)
136-152
: Confirm no overlap in simultaneous tag operations.When
assign_transition_reg_ops
merges multipleTagOperation
s, ensure there's no scenario where conflicting operations on the same register occur within the same transition. Adding unit tests for edge cases involving repeated or rapid tag toggles (set/negate) could improve confidence.Also applies to: 344-382
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (6)
1-16
: Header guard and includes look correct.You have properly defined the include guard and minimal includes for containers and standard library features. This setup is concise and follows best practices.
34-45
: Well-defined ordering operator.The
operator<
properly orders configurations by state ID, register mappings, and then history/lookahead. This consistent ordering is vital for usingDeterminizationConfiguration
in sorted containers. Great job ensuring a deterministic comparison sequence.
61-66
: Verify correctness of spontaneous transitions.
update_reachable_configs
andspontaneous_closure
work together to capture all configurations reachable via zero or more spontaneous transitions. For large NFAs, you may want to ensure no repeated expansions or infinite loops occur if cycles are present. Additional cycle detection or upper-bound checks might be beneficial.Also applies to: 68-73, 112-127, 129-144
74-77
: Tag manipulation and retrieval appear logical.The methods
set_reg_id
,get_state
,get_tag_history
, andget_lookahead
provide straightforward ways to update and query tag operations. Just ensure that changes to one tag’s state do not accidentally interfere with the storedm_history
orm_lookahead
for another tag.Also applies to: 78-82, 84-91, 93-103
105-110
: Private members are neatly organized.Keeping
m_nfa_state
,m_tag_id_to_reg_ids
,m_history
, andm_lookahead
together at the end clarifies the data blueprint for each configuration. No changes needed here.
147-148
: Header end comment.Consistently ending the file with a matching
#endif
comment is a helpful practice. Thank you for maintaining clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some high-level comments of DeterminizationConfiguration.hpp
. Some of these comments should be also applied across the entire PR. (NTS: This file is not completely reviewed yet)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
Outdated
Show resolved
Hide resolved
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
Outdated
Show resolved
Hide resolved
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (4)
17-17
: Consider adding a class-level docstring.
A brief explanation of the purpose and usage of this class helps future maintainers and new contributors comprehend how determinization configurations fit into the larger TDFA logic.
19-32
: Add parameter documentation for the constructor.
You might want to use doxygen-style tags (e.g.,@param
) for each parameter, describing how they are used (e.g.,nfa_state
,tag_to_reg_ids
,tag_history
,tag_lookahead
).
34-45
: Use structured bindings or tuple comparison for operator<.
Rather than performing multiple if checks, consider using a single tuple-based lexicographical comparison for cleaner and more concise code, such as:-auto operator<(DeterminizationConfiguration const& rhs) const -> bool { - if (m_nfa_state->get_id() != rhs.m_nfa_state->get_id()) { - return m_nfa_state->get_id() < rhs.m_nfa_state->get_id(); - } - if (m_tag_id_to_reg_ids != rhs.m_tag_id_to_reg_ids) { - return m_tag_id_to_reg_ids < rhs.m_tag_id_to_reg_ids; - } - if (m_history != rhs.m_history) { - return m_history < rhs.m_history; - } - return m_lookahead < rhs.m_lookahead; +auto operator<(DeterminizationConfiguration const& rhs) const -> bool { + return std::tie(m_nfa_state->get_id(), + m_tag_id_to_reg_ids, + m_history, + m_lookahead) + < std::tie(rhs.m_nfa_state->get_id(), + rhs.m_tag_id_to_reg_ids, + rhs.m_history, + rhs.m_lookahead); }
61-64
: Use standard Doxygen tags.
The annotation@unexplored_stack
is non-standard. Use@param
or another doxygen-compatible tag to describe parameters clearly, for example:/** * @param unexplored_stack The stack of configurations updated via spontaneous transitions. */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
🧬 Code Definitions (1)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)
nodiscard
(56-58)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (ubuntu-22.04, Release)
🔇 Additional comments (3)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (3)
74-76
: Avoid operator[] on std::map if unintentional insertions are possible.
Previously, a reviewer (LinZhihao-723) advised against using[]
for map-like structure access, as it can insert new keys inadvertently. Useinsert_or_assign
or explicit checks if you want to safeguard against unexpected insertions.
112-127
: Ensure pointer validity inupdate_reachable_configs
.
Becausem_nfa_state
is stored as a raw pointer, confirm that it remains valid throughout spontaneous transition exploration. If memory is released elsewhere, consider using a smart pointer (likestd::shared_ptr
) to prevent dangling references.
129-144
: Confirm that no infinite loops occur inspontaneous_closure
.
The logic appears to prevent infinite loops by relying onreachable_set.contains(current_configuration)
, but consider adding tests for cyclical transitions in the NFA to verify this behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test-dfa.cpp (1)
148-181
: Comprehensive test for Complex Tagged DFA.This test case demonstrates more advanced functionality of tagged DFAs with multiple nested capture groups. The variety of operations in the expected output (push, copy, nested operations) provides thorough coverage of the TDFA implementation.
Consider adding a brief comment explaining what the register operations notation means (e.g., what
16p,17p
and8c16,9c19
represent) to make the test more understandable for future developers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test-dfa.cpp
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
tests/test-dfa.cpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (macos-latest, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (4)
tests/test-dfa.cpp (4)
28-58
: Well-implemented helper function for DFA comparison.The
compare_serialized_dfa
function is well-structured and documented. It nicely encapsulates the common validation logic that was previously duplicated across test cases, addressing a previous review comment that requested this extraction. The function handles error reporting effectively by using CAPTURE for both serialized strings, which will be helpful for debugging any failures.
87-87
: Good refactoring to use the helper function.Replacing the duplicated validation code with a call to the helper function improves maintainability.
114-115
: Effective refactoring of test validation.Good job replacing another instance of the duplicated validation code with the helper function call.
117-146
: Well-structured test for Simple Tagged DFA.This new test case effectively validates the tagged DFA functionality for a simple pattern with a single tagged group (
uID
). The test creates an appropriate schema, processes it into an NFA and DFA, and then verifies the expected output format including the register operations (e.g.,4p
for push operations and2c4
for copy operations).
…in this PR; Fix void to auto -> void.
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
tests/test-lexer.cpp (1)
211-211
: Potential duplication in test setup.This line, extracting
schema_var_ast_ptr
, appears in multiple tests. Consider consolidating common test logic into a shared helper function if it becomes repetitive.src/log_surgeon/finite_automata/Dfa.hpp (4)
66-66
: Superset determinization documentation.Clearly describes the algorithm’s purpose. Keep parameters (
@oaram
) typo-free in doc comments.- * @oaram nfa The NFA used to generate the DFA. + * @param nfa The NFA used to generate the DFA.
76-81
: Register initialization logic.Reserving two registers per tag—one for initial tracking, one for final—supports set/copy operations effectively. Consider adding an upper bound check for
num_tags
if resource constraints apply.
132-144
: Transition assembly.
get_transitions
collects transitions for each input byte. This design is flexible and keeps logic contained. Watch performance if states or byte transitions are large.
312-344
: Byte-based transitions.
get_transitions
systematically enumerates possible inputs (0..255). If UTF-8 expansions are needed, consider mindful design to handle multiple bytes.src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (4)
16-18
: Consider adding class-level documentation.
This class appears central to TDFA determinization but lacks a brief description, references to relevant automata concepts, or usage details in its docstring. Adding an overview would help maintainers and new contributors understand the overall purpose of the class.
29-31
: Use a smart pointer for m_nfa_state where feasible.
Although you verify thatm_nfa_state
is non-null, the raw pointer usage could lead to lifetime mismanagement if the underlyingTypedNfaState
is destroyed elsewhere prematurely. A smart pointer (e.g.,std::shared_ptr
orstd::unique_ptr
) could help ensure safe and transparent ownership semantics.
84-91
: Consider using a standard algorithm for searching tag history.
Currently, this loop searches through them_history
vector. For clarity and conciseness, you could usestd::find_if
with a lambda that checkstag_op.get_tag_id() == tag_id
. This can slightly improve readability.
116-124
: Avoid unnecessary object copies by iterating with references.
Here,for (auto const tag_op : nfa_spontaneous_transition.get_tag_ops())
copies each element. IfTagOperation
objects are sizable, consider iterating asfor (auto const& tag_op : ...)
to reduce copy overhead.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
CMakeLists.txt
(1 hunks)src/log_surgeon/Lexer.hpp
(5 hunks)src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
(1 hunks)src/log_surgeon/finite_automata/Dfa.hpp
(3 hunks)src/log_surgeon/finite_automata/NfaState.hpp
(3 hunks)tests/CMakeLists.txt
(1 hunks)tests/test-dfa.cpp
(4 hunks)tests/test-lexer.cpp
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- CMakeLists.txt
- tests/CMakeLists.txt
- src/log_surgeon/Lexer.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
tests/test-lexer.cpp
src/log_surgeon/finite_automata/NfaState.hpp
tests/test-dfa.cpp
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
src/log_surgeon/finite_automata/Dfa.hpp
🧠 Learnings (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2025-03-28T16:38:46.994Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🧬 Code Definitions (4)
tests/test-lexer.cpp (3)
src/log_surgeon/finite_automata/RegexAST.hpp (16)
nodiscard
(87-89)nodiscard
(134-152)nodiscard
(171-173)nodiscard
(205-207)nodiscard
(239-239)nodiscard
(256-258)nodiscard
(292-292)nodiscard
(294-294)nodiscard
(328-330)nodiscard
(406-406)nodiscard
(408-408)nodiscard
(410-410)nodiscard
(452-454)nodiscard
(487-487)nodiscard
(489-489)nodiscard
(515-517)src/log_surgeon/Schema.hpp (1)
var_schema
(29-29)src/log_surgeon/LogParser.hpp (1)
schema_ast
(161-161)
tests/test-dfa.cpp (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)
nfa
(66-66)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (2)
src/log_surgeon/finite_automata/Dfa.hpp (2)
nodiscard
(43-43)nodiscard
(56-58)src/log_surgeon/finite_automata/NfaState.hpp (6)
nodiscard
(95-95)nodiscard
(97-97)nodiscard
(99-101)nodiscard
(103-106)nodiscard
(108-110)nodiscard
(112-112)
src/log_surgeon/finite_automata/Dfa.hpp (5)
src/log_surgeon/finite_automata/DfaState.hpp (3)
nodiscard
(42-44)nodiscard
(46-48)nodiscard
(84-87)src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (4)
tag_id
(74-76)tag_id
(74-74)tag_id
(84-84)tag_id
(95-96)src/log_surgeon/finite_automata/Nfa.hpp (2)
new_state
(145-148)new_state
(145-145)src/log_surgeon/finite_automata/RegisterHandler.hpp (5)
reg_id
(45-47)reg_id
(45-45)reg_id
(53-56)reg_id
(53-53)reg_id
(58-59)src/log_surgeon/finite_automata/RegisterOperation.hpp (6)
reg_id
(27-29)reg_id
(27-27)reg_id
(31-33)reg_id
(31-31)reg_id
(40-40)reg_id
(40-40)
🔇 Additional comments (40)
tests/test-lexer.cpp (6)
90-90
: Good use of[[nodiscard]]
forserialize_id_symbol_map
.This addition helps avoid ignoring the returned string and reduces the risk of logical errors.
98-99
: Check fornullptr
atcapture_rule_ast
usage.Although there is a
REQUIRE(capture_rule_ast != nullptr);
later, ensure that no intermediate steps dereference this pointer beforehand.
175-175
: Safe map access.Using
.at()
onlexer.m_id_symbol
is consistent and will throw an exception on out-of-range access, which is safer than the subscript operator.
185-185
: Consistent token end check.Good usage of
.at()
again. This helps ensure that the retrieval of the token end symbol is safe.
229-229
: Repeated variable extraction.Similar code was seen at line 211. A shared setup helper could tighten maintainability and reduce duplication.
262-264
: Clear range verification.Verifying that the regex group covers '0' through '9' ensures correct numeric capture. This logic is well-defined.
src/log_surgeon/finite_automata/NfaState.hpp (3)
75-75
: Safer byte transition handling.Using
.at(byte)
instead of the subscript operator prevents out-of-bounds array access. This helps to catch invalid byte values at runtime.
109-109
: Consistent safe indexing.Again, replacing the subscript operator with
.at()
prevents potential undefined behaviour when the index is out of range.
182-182
: Improved exception safety on transitions.Using
.at(idx)
form_bytes_transitions
is good practice. Continue to ensure index bounds are validated or that the usage context guarantees valid indices.tests/test-dfa.cpp (5)
28-40
: Helpful DFA comparison helper introduced.The
compare_serialized_dfa
function lowers code duplication and centralizes the validation of DFA serialization. Good addition.
89-89
: Appropriate usage ofcompare_serialized_dfa
.This call effectively replaces repetitive string comparison logic, improving test clarity.
116-117
: Consistent verification pattern.Reusing the same helper enhances maintainability, ensuring tests follow a uniform approach to string comparison.
119-145
: New "Test Simple Tagged DFA" block.This test case is concise and validates proper handling for a basic tagged DFA. Its coverage of the set and copy operations for tagged states appears correct.
150-183
: Robust "Test Complex Tagged DFA" block.This test thoroughly checks multiple nested tags and composition of transitions. It effectively ensures final states capture correct register operations.
src/log_surgeon/finite_automata/Dfa.hpp (25)
4-5
: New inclusions for determinization.Adding
<algorithm>
and<cstddef>
is reasonable if used within new determinization or register-related logic. Validate build steps to remove unused includes later if necessary.
10-10
: Queue-based BFS approach.Bringing in
<queue>
is consistent with the BFS logic introduced later inget_bfs_traversal_order()
.
12-13
: Additional includes for strings and maps.
<string>
and<unordered_map>
are relevant for storing configuration sets, tag mappings, and serialization.
14-14
: Include<unordered_set>
for BFS tracking.Used to track visited states in
get_bfs_traversal_order()
, ensuring no duplicates.
22-22
: Determinization integration.Bringing in
DeterminizationConfiguration.hpp
is essential for the superset construction of DFA states.
25-27
: Register-based expansions.These includes (
RegisterHandler
,RegisterOperation
,TagOperation
) confirm the advanced handling of tag registers in the DFA.
33-33
: Type alias for determinization.
using ConfigurationSet = std::set<DeterminizationConfiguration<TypedNfaState>>;
clarifies usage and fosters readability.
43-43
: Safe root state retrieval.Accessing
m_states.at(0).get()
avoids out-of-bounds issues. Verifym_states
is guaranteed non-empty upon construction.
56-58
: Public mapping retrieval.Exposing
m_tag_id_to_final_reg_id
as a map of tag ID to final register ID helps the lexer or other consumers align final states and registers.
93-96
: Mapping attempt across configuration sets.
try_get_mapping
is well-documented and helps avoid duplicating states. Ensure it accounts for all relevant tag operations.
116-120
: Creating DFA state or reusing existing.
create_or_get_dfa_state
elegantly prevents repeated creation, aiming for minimal states under the superset approach.
149-154
: Tag register operation assignment.
assign_transition_reg_ops
systematically sets or negates registers within closure states. This helps ensure correct tag capturing.
164-167
: Reassigning register operations.
reassign_transition_reg_ops
merges new mapping or appends copy ops. This is crucial to ensure non-diverging states.
169-179
: Constructing new DFA states with final register assignment.
new_state
handles acceptance correctness by setting or copying relevant registers. This ensures final states accurately track tag data.
193-195
: Deferred generation in constructor.
generate(nfa);
is neatly separated, keeping constructor short and focusing on the generation step in a dedicated function.
198-231
: Core superset determinization logic.The BFS-like approach with
unexplored_sets
ensures that every newly discovered configuration set is fully expanded. This is a standard, effective pattern.
234-245
: Two-register-per-tag approach.
initialize_registers
sets up robust final copying. Optionally add a range check for hugenum_tags
.
248-290
: Configuration equivalence check.
try_get_mapping
cross-checks states and register usage. This helps unify states that differ only by register IDs.
293-309
: Avoiding duplicative states.
create_or_get_dfa_state
checks existing sets before new creation. The register mapping check is consistent with your minimal DFA strategy.
347-384
: Transition register assignment.Handles partial or repeated tag operations well. The final set usage ensures your updated closure is captured for subsequent transitions.
387-407
: Convergent mapping to existing register IDs.
reassign_transition_reg_ops
ensures we do not inflate states by mismatched registers. A well-targeted approach to unify them.
410-443
: Populating acceptance logic.Copies or sets final registers for accepted states. This robust approach supports advanced tag usage while ensuring correctness.
446-464
: Schema intersection method.
get_intersect(...)
pairs states from two DFAs, capturing matching variable IDs. This is a beneficial utility for refining schema coverage.
467-497
: BFS state order retrieval.
get_bfs_traversal_order
is a standard approach to produce consistent state ordering, beneficial for stable serialization.
500-518
: Optional serialization.Returns
std::nullopt
if any state fails to serialize, ensuring overall correctness. This final step ties together BFS ordering.src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1)
132-144
: Good job on implementing a stack-based approach to find all spontaneously reachable configurations.
The logic for exploring and updating possible configurations appears sound and aligns with the project’s coding guidelines (explicit checks withfalse == ...
).
…; Remove unused function.
…reachable_configs to private.
* Add a delimiters line from the schema to the lexer | ||
* @param delimiters | ||
* Adds a list of delimiter types to the lexer. | ||
* @param delimiters A vector containing the delimiter types to be added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets mention:
- it invalidates any existing delimiters
- delimiters param cannot be empty
// TODO: determine if we can switch to safely accessing `m_is_delimiter` and | ||
// `m_is_first_char_of_a_variable` with `at()` without impacting performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make this a github issue rather than todo so it is easier to track.
References
test-lexer.cpp
logging usingCAPTURE
; Add a test case for an NFA with a capture group; Add a test case for a Lexer without capture groups. #92 .Description
generate()
function called by the constructor.Validation performed
dfa-test.cpp
to test tagged DFA.Summary by CodeRabbit
New Features
DeterminizationConfiguration
class for managing NFA configurations.Enhancements
Tests