feat: Add TDFA determinization; Update lexer's tag to final register map using TDFA. #91

SharafMohamed · 2025-02-03T18:17:47Z

References

Depends on test: Improve test-lexer.cpp logging using CAPTURE; Add a test case for an NFA with a capture group; Add a test case for a Lexer without capture groups. #92 .
To review in parallel run:

git fetch upstream pull/92/head:pr-92
git fetch upstream pull/91/head:pr-91
git diff pr-92 pr-91

Description

Add TDFA determinization.
Add a generate() function called by the constructor.
Separate existing determinizations steps into several functions.
Lexer now updates its tag to final register map using the TDFA's output.

Validation performed

Update dfa-test.cpp to test tagged DFA.

Summary by CodeRabbit

New Features
- Introduced a new DeterminizationConfiguration class for managing NFA configurations.
- Added functionality for DFA state serialization and enhanced tag-to-register mapping.
Enhancements
- Refined logic for retrieving register IDs from tag IDs.
- Streamlined handling of DFA states derived from NFA configurations.
- Updated test structure to include new header files and test cases.
Tests
- Expanded test coverage with new cases validating simple and complex tagged scenarios.
- Improved lexical analysis tests to ensure correct capture group identification.
- Introduced a helper function for comparing serialized DFA outputs.

coderabbitai · 2025-02-03T18:17:57Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request updates the build configuration and several source files for the log-surgeon project. New source files are added, and existing files are modified to change data structures and control flows. Notably, the DFA and lexing modules now employ updated register mapping with changes from unordered to ordered maps, and a new determinization configuration class is introduced. Additionally, tests are expanded to validate the DFA and Lexer functionality, while obsolete functionality (e.g., epsilon closure in NfaState) is removed.

Changes

File(s)	Change Summary
`CMakeLists.txt` and `tests/CMakeLists.txt`	Updated source file lists to include new finite automata files (e.g., `.../DeterminizationConfiguration.hpp`) and test cases.
`src/.../Lexer.hpp` and `src/.../Lexer.tpp`	Updated method signatures and member variable types; changed `m_tag_to_final_reg_id` to use `std::map` instead of `std::unordered_map`.
`src/.../finite_automata/DeterminizationConfiguration.hpp`	New header file introducing the `DeterminationConfiguration` class with various methods for managing configurations of non-deterministic finite automata (NFA).
`src/.../finite_automata/Dfa.hpp`	Modified the Dfa class: added a type alias for configuration sets, updated the constructor, added new methods (e.g., `serialize`, `get_tag_id_to_final_reg_id`), and introduced several helper methods for DFA generation.
`src/.../finite_automata/NfaState.hpp`	Removed the `epsilon_closure` method and eliminated unnecessary header includes.
`tests/test-dfa.cpp`	Added a new helper function for comparing serialized DFA outputs and two new test cases for tagged DFAs.
`tests/test-lexer.cpp`	Updated the lexer test case to include the `[[nodiscard]]` attribute and improved error handling by using `at()` for map access.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant D as Dfa
    participant DC as DeterminizationConfiguration
    participant N as Nfa

    C->>D: Construct Dfa with Nfa reference
    D->>D: Invoke generate() method
    D->>DC: Create initial configuration from Nfa state
    DC->>D: Update reachable configurations (spontaneous_closure & update_reachable_configs)
    alt New transition found
        DC->>DC: Generate child configuration via child_configuration_with_new_state_and_tag()
        DC->>D: Return new configuration
    end
    D->>D: Create or retrieve DFA state from configuration set
    D->>D: Establish transitions and assign register operations
    D->>C: Serialize DFA and return string representation

Suggested reviewers

LinZhihao-723

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (24)

src/log_surgeon/finite_automata/RegisterHandler.hpp (1)
21-27: Consider pre-allocating the vector capacity.
This can reduce repeated memory allocations when adding multiple registers.

Here's a possible diff:
-auto add_registers(uint32_t const num_reg_to_add) -> std::vector<uint32_t> {
-    std::vector<uint32_t> added_registers;
+auto add_registers(uint32_t const num_reg_to_add) -> std::vector<uint32_t> {
+    std::vector<uint32_t> added_registers;
+    added_registers.reserve(num_reg_to_add);
     for (uint32_t i{0}; i < num_reg_to_add; ++i) {
src/log_surgeon/Lexer.hpp (1)
128-128: Returning a const reference to a unique_ptr may be unconventional.
Consider returning a raw pointer or a reference to the underlying Dfa object instead for clarity.
-    ) const -> std::unique_ptr<finite_automata::Dfa<TypedDfaState, TypedNfaState>> const& {
+    ) const -> finite_automata::Dfa<TypedDfaState, TypedNfaState> const* {
src/log_surgeon/finite_automata/NfaState.hpp (2)
93-93: Returning a reference to a bool is unusual.
Although it technically works, returning by value is often preferable for clarity and to avoid unexpected referencing issues.
-    [[nodiscard]] auto is_accepting() const -> bool const& { return m_accepting; }
+    [[nodiscard]] auto is_accepting() const -> bool { return m_accepting; }
174-175: Consider aligning naming with captures rather than tags.
The serialized output calls it "accepting_tag", yet the code sets m_matching_variable_id. If the design now focuses on captures, a more consistent name would improve clarity.
src/log_surgeon/finite_automata/Nfa.hpp (1)
101-104: Potential performance concern returning a large map by value
Currently, get_capture_to_tag_ids() returns a map by value, which may be expensive if the map is large. Consider returning a const& or using a lightweight wrapper if the caller only needs read-only access.
- [[nodiscard]] auto get_capture_to_tag_ids() const -> std::unordered_map<Capture const*, std::pair<tag_id_t, tag_id_t>> {
-     return m_capture_to_tag_ids;
- }
+ [[nodiscard]] auto const& get_capture_to_tag_ids() const {
+     return m_capture_to_tag_ids;
+ }
src/log_surgeon/Lexer.tpp (2)

86-86: Transition logic refactor
Replacing direct transitions with get_dest_state(next_char) makes the flow more explicit. Consider logging or handling a scenario where no valid destination state is found, ensuring the user is informed if the character is unexpected.

380-416: Capture uniqueness check
Throwing an exception on duplicate capture names is a solid safeguard. Be sure all user inputs are validated, and consider whether a more user-friendly error message (or logging) might help users understand the conflict quickly.
src/log_surgeon/finite_automata/Dfa.hpp (2)
64-65: Typo in doxygen
In the comment block for generate, the line uses @oaram nfa instead of @param nfa. Correct this to maintain consistency in documentation tags.
-  * @oaram nfa The NFA used to generate the DFA.
+  * @param nfa The NFA used to generate the DFA.
386-406: Clarify reassign_transition_reg_ops usage
Adding new register operations when no direct match exists can be subtle. Ensure the logic that inserts {new_reg_id, old_reg_id} captures the correct source and destination semantics. A short inline comment might prevent confusion for future maintainers.
src/log_surgeon/finite_automata/RegexAST.hpp (1)
110-122: Minor spelling nitpick in the comment.
The code logic is correct, but “spontaenous” should be spelled “spontaneous”.

Proposed fix for line 113 comment:
-// root --(regex)--> state_with_spontaenous_transition --(negate capture tags)--> end_state
+// root --(regex)--> state_with_spontaneous_transition --(negate capture tags)--> end_state
src/log_surgeon/UniqueIdGenerator.hpp (1)

1-17: Simple unique ID generator is functional but not concurrency-safe.
If you plan to use this in multi-threaded code, consider adding a mutex or atomic operations. Otherwise, this implementation is acceptable for single-threaded environments.
src/log_surgeon/finite_automata/TagOperation.hpp (1)
39-48: Consider adding compile-time enum exhaustiveness checking.

The default case might indicate incomplete enum handling. Consider using a switch statement without a default case to get compiler warnings about unhandled enum values.
-        switch (m_type) {
+        switch (m_type) {
             case TagOperationType::Set:
                 return fmt::format("{}{}", m_tag_id, "p");
             case TagOperationType::Negate:
                 return fmt::format("{}{}", m_tag_id, "n");
-            default:
-                return fmt::format("{}{}", m_tag_id, "?");
         }
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (2)
56-62: Consider handling empty tag operations explicitly.

The current implementation might not handle empty tag operations gracefully. Consider adding a special case for empty operations.
+    if (m_tag_ops.empty()) {
+        return fmt::format("{}", state_id_it->second);
+    }
     auto transformed_operations
             = m_tag_ops | std::ranges::views::transform([](TagOperation const& tag_op) {
                   return tag_op.serialize();
               });
25-29: Consider adding explicit keyword to single-parameter constructor.

To prevent implicit conversions, consider marking the single-parameter constructor as explicit.
-    explicit SpontaneousTransition(TypedNfaState const* dest_state) : m_dest_state{dest_state} {}
+    explicit SpontaneousTransition(TypedNfaState const* dest_state) : m_dest_state{dest_state} {}

-    SpontaneousTransition(std::vector<TagOperation> tag_ops, TypedNfaState const* dest_state)
+    explicit SpontaneousTransition(std::vector<TagOperation> tag_ops, TypedNfaState const* dest_state)
             : m_tag_ops{std::move(tag_ops)},
               m_dest_state{dest_state} {}
src/log_surgeon/finite_automata/DfaTransition.hpp (3)
26-28: Consider initializing m_dest_state in the member initializer list.

The m_dest_state member is already initialized to nullptr in the class definition, but for consistency and to follow best practices, consider initializing it in the member initializer list.
 DfaTransition(std::vector<RegisterOperation> reg_ops, DfaState<state_type>* dest_state)
         : m_reg_ops{std::move(reg_ops)},
-          m_dest_state{dest_state} {}
+          m_dest_state{dest_state} {
+ }
53-53: Consider using if (state_ids.contains(m_dest_state) == false) for consistency.

The coding guidelines specify preferring false == <expression> rather than !<expression>. For consistency, consider applying this to the contains checks.
-    if (false == state_ids.contains(m_dest_state)) {
+    if (state_ids.contains(m_dest_state) == false) {
         return std::nullopt;
     }

     std::vector<std::string> transformed_ops;
     for (auto const& reg_op : m_reg_ops) {
         auto const optional_serialized_op{reg_op.serialize()};
-        if (false == optional_serialized_op.has_value()) {
+        if (optional_serialized_op.has_value() == false) {
Also applies to: 60-60

57-64: Consider reserving vector capacity for performance.

Since we know the size of m_reg_ops, we can reserve space in transformed_ops to avoid reallocations.
     std::vector<std::string> transformed_ops;
+    transformed_ops.reserve(m_reg_ops.size());
     for (auto const& reg_op : m_reg_ops) {
src/log_surgeon/finite_automata/DfaStatePair.hpp (1)
74-74: Consider consistent ordering in pointer and boolean comparisons.

While nullptr != dest_state1 follows the coding guidelines for boolean expressions, consider applying the same pattern to the contains check.
     if (nullptr != dest_state1 && nullptr != dest_state2) {
         DfaStatePair const reachable_pair{dest_state1, dest_state2};
-        if (false == visited_pairs.contains(reachable_pair)) {
+        if (visited_pairs.contains(reachable_pair) == false) {
Also applies to: 76-76
src/log_surgeon/finite_automata/PrefixTree.hpp (1)
28-28: Consider documenting the purpose of cDefaultPos.

The new constant cDefaultPos lacks documentation explaining its purpose and usage.
     static constexpr id_t cRootId{0};
-    static constexpr position_t cDefaultPos{0};
+    /// Default position value used when initializing new positions
+    static constexpr position_t cDefaultPos{0};
tests/test-register-handler.cpp (1)
23-27: Consider extracting magic number to a named constant.

The condition 0 == i uses a magic number. Consider extracting it to a named constant for better readability and maintainability.
+    static constexpr size_t cFirstRegisterIndex{0};
     for (size_t i{0}; i < num_registers; ++i) {
-        if (0 == i) {
+        if (cFirstRegisterIndex == i) {
             handler.add_register();
         } else {
             handler.add_register(i);
tests/test-dfa.cpp (2)
28-69: Test case provides good coverage for untagged DFA.

The test effectively validates the DFA's serialization format and state transitions.

Consider adding descriptive error messages to REQUIRE statements to help debug test failures:
-        REQUIRE(actual_line == expected_line);
+        REQUIRE(actual_line == expected_line, 
+                fmt::format("Line mismatch.\nExpected: {}\nActual: {}", expected_line, actual_line));
71-119: Test case thoroughly validates tagged DFA functionality.

The test effectively covers complex regex patterns with named capturing groups.

Consider extracting the expected serialized DFA string to a separate constant or file to improve readability and maintainability.
tests/test-lexer.cpp (1)
102-135: Consider refactoring the initialization logic.

The initialization function contains multiple responsibilities and could benefit from being split into smaller, focused functions.

Consider extracting the delimiter handling and symbol mapping into separate functions:
+auto initialize_delimiters(ByteLexer& lexer) -> vector<uint32_t> {
+    vector<uint32_t> const cDelimiters{' ', '\n'};
+    lexer.add_delimiters(cDelimiters);
+    
+    vector<uint32_t> delimiters;
+    for (uint32_t i{0}; i < log_surgeon::cSizeOfByte; i++) {
+        if (lexer.is_delimiter(i)) {
+            delimiters.push_back(i);
+        }
+    }
+    return delimiters;
+}
+
+auto initialize_symbol_mapping(ByteLexer& lexer) -> void {
+    lexer.m_symbol_id[log_surgeon::cTokenEnd] = static_cast<uint32_t>(SymbolId::TokenEnd);
+    lexer.m_symbol_id[log_surgeon::cTokenUncaughtString] = static_cast<uint32_t>(SymbolId::TokenUncaughtString);
+    lexer.m_id_symbol[static_cast<uint32_t>(SymbolId::TokenEnd)] = log_surgeon::cTokenEnd;
+    lexer.m_id_symbol[static_cast<uint32_t>(SymbolId::TokenUncaughtString)] = log_surgeon::cTokenUncaughtString;
+}
.github/workflows/build.yaml (1)

55-59: Addition of Test Log Output on Failure
Adding a step to print the last test log on failure enhances the debugging capability when tests do not pass. For robustness, consider checking whether the log file exists before attempting to print it, to avoid potential errors if the file is missing.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a513c5f and 01a26e5.

📒 Files selected for processing (32)

.github/workflows/build.yaml (3 hunks)
CMakeLists.txt (2 hunks)
examples/intersect-test.cpp (3 hunks)
lint-tasks.yml (2 hunks)
src/log_surgeon/Lexer.hpp (4 hunks)
src/log_surgeon/Lexer.tpp (10 hunks)
src/log_surgeon/LexicalRule.hpp (3 hunks)
src/log_surgeon/SchemaParser.cpp (4 hunks)
src/log_surgeon/UniqueIdGenerator.hpp (1 hunks)
src/log_surgeon/finite_automata/Capture.hpp (2 hunks)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1 hunks)
src/log_surgeon/finite_automata/Dfa.hpp (3 hunks)
src/log_surgeon/finite_automata/DfaState.hpp (3 hunks)
src/log_surgeon/finite_automata/DfaStatePair.hpp (1 hunks)
src/log_surgeon/finite_automata/DfaTransition.hpp (1 hunks)
src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
src/log_surgeon/finite_automata/NfaState.hpp (5 hunks)
src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (19 hunks)
src/log_surgeon/finite_automata/Register.hpp (1 hunks)
src/log_surgeon/finite_automata/RegisterHandler.hpp (2 hunks)
src/log_surgeon/finite_automata/RegisterOperation.hpp (1 hunks)
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1 hunks)
src/log_surgeon/finite_automata/TagOperation.hpp (1 hunks)
src/log_surgeon/finite_automata/TaggedTransition.hpp (0 hunks)
tests/CMakeLists.txt (2 hunks)
tests/test-capture.cpp (1 hunks)
tests/test-dfa.cpp (1 hunks)
tests/test-lexer.cpp (6 hunks)
tests/test-nfa.cpp (2 hunks)
tests/test-register-handler.cpp (2 hunks)
tests/test-tag.cpp (0 hunks)

💤 Files with no reviewable changes (2)

tests/test-tag.cpp
src/log_surgeon/finite_automata/TaggedTransition.hpp

✅ Files skipped from review due to trivial changes (2)

src/log_surgeon/finite_automata/Register.hpp
src/log_surgeon/finite_automata/Capture.hpp

🧰 Additional context used

📓 Path-based instructions (23)

src/log_surgeon/UniqueIdGenerator.hpp (1)