Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add TDFA determinization; Update lexer's tag to final register map using TDFA. #91

Open
wants to merge 101 commits into
base: main
Choose a base branch
from

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Feb 3, 2025

References

git fetch upstream pull/92/head:pr-92
git fetch upstream pull/91/head:pr-91
git diff pr-92 pr-91

Description

  • Add TDFA determinization.
  • Add a generate() function called by the constructor.
  • Separate existing determinizations steps into several functions.
  • Lexer now updates its tag to final register map using the TDFA's output.

Validation performed

  • Update dfa-test.cpp to test tagged DFA.

Summary by CodeRabbit

  • New Features

    • Introduced a new DeterminizationConfiguration class for managing NFA configurations.
    • Added functionality for DFA state serialization and enhanced tag-to-register mapping.
  • Enhancements

    • Refined logic for retrieving register IDs from tag IDs.
    • Streamlined handling of DFA states derived from NFA configurations.
    • Updated test structure to include new header files and test cases.
  • Tests

    • Expanded test coverage with new cases validating simple and complex tagged scenarios.
    • Improved lexical analysis tests to ensure correct capture group identification.
    • Introduced a helper function for comparing serialized DFA outputs.

Copy link

coderabbitai bot commented Feb 3, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request updates the build configuration and several source files for the log-surgeon project. New source files are added, and existing files are modified to change data structures and control flows. Notably, the DFA and lexing modules now employ updated register mapping with changes from unordered to ordered maps, and a new determinization configuration class is introduced. Additionally, tests are expanded to validate the DFA and Lexer functionality, while obsolete functionality (e.g., epsilon closure in NfaState) is removed.

Changes

File(s) Change Summary
CMakeLists.txt and tests/CMakeLists.txt Updated source file lists to include new finite automata files (e.g., .../DeterminizationConfiguration.hpp) and test cases.
src/.../Lexer.hpp and src/.../Lexer.tpp Updated method signatures and member variable types; changed m_tag_to_final_reg_id to use std::map instead of std::unordered_map.
src/.../finite_automata/DeterminizationConfiguration.hpp New header file introducing the DeterminationConfiguration class with various methods for managing configurations of non-deterministic finite automata (NFA).
src/.../finite_automata/Dfa.hpp Modified the Dfa class: added a type alias for configuration sets, updated the constructor, added new methods (e.g., serialize, get_tag_id_to_final_reg_id), and introduced several helper methods for DFA generation.
src/.../finite_automata/NfaState.hpp Removed the epsilon_closure method and eliminated unnecessary header includes.
tests/test-dfa.cpp Added a new helper function for comparing serialized DFA outputs and two new test cases for tagged DFAs.
tests/test-lexer.cpp Updated the lexer test case to include the [[nodiscard]] attribute and improved error handling by using at() for map access.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant D as Dfa
    participant DC as DeterminizationConfiguration
    participant N as Nfa

    C->>D: Construct Dfa with Nfa reference
    D->>D: Invoke generate() method
    D->>DC: Create initial configuration from Nfa state
    DC->>D: Update reachable configurations (spontaneous_closure & update_reachable_configs)
    alt New transition found
        DC->>DC: Generate child configuration via child_configuration_with_new_state_and_tag()
        DC->>D: Return new configuration
    end
    D->>D: Create or retrieve DFA state from configuration set
    D->>D: Establish transitions and assign register operations
    D->>C: Serialize DFA and return string representation
Loading

Suggested reviewers

  • LinZhihao-723

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (24)
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)

21-27: Consider pre-allocating the vector capacity.
This can reduce repeated memory allocations when adding multiple registers.

Here's a possible diff:

-auto add_registers(uint32_t const num_reg_to_add) -> std::vector<uint32_t> {
-    std::vector<uint32_t> added_registers;
+auto add_registers(uint32_t const num_reg_to_add) -> std::vector<uint32_t> {
+    std::vector<uint32_t> added_registers;
+    added_registers.reserve(num_reg_to_add);
     for (uint32_t i{0}; i < num_reg_to_add; ++i) {
src/log_surgeon/Lexer.hpp (1)

128-128: Returning a const reference to a unique_ptr may be unconventional.
Consider returning a raw pointer or a reference to the underlying Dfa object instead for clarity.

-    ) const -> std::unique_ptr<finite_automata::Dfa<TypedDfaState, TypedNfaState>> const& {
+    ) const -> finite_automata::Dfa<TypedDfaState, TypedNfaState> const* {
src/log_surgeon/finite_automata/NfaState.hpp (2)

93-93: Returning a reference to a bool is unusual.
Although it technically works, returning by value is often preferable for clarity and to avoid unexpected referencing issues.

-    [[nodiscard]] auto is_accepting() const -> bool const& { return m_accepting; }
+    [[nodiscard]] auto is_accepting() const -> bool { return m_accepting; }

174-175: Consider aligning naming with captures rather than tags.
The serialized output calls it "accepting_tag", yet the code sets m_matching_variable_id. If the design now focuses on captures, a more consistent name would improve clarity.

src/log_surgeon/finite_automata/Nfa.hpp (1)

101-104: Potential performance concern returning a large map by value
Currently, get_capture_to_tag_ids() returns a map by value, which may be expensive if the map is large. Consider returning a const& or using a lightweight wrapper if the caller only needs read-only access.

- [[nodiscard]] auto get_capture_to_tag_ids() const -> std::unordered_map<Capture const*, std::pair<tag_id_t, tag_id_t>> {
-     return m_capture_to_tag_ids;
- }
+ [[nodiscard]] auto const& get_capture_to_tag_ids() const {
+     return m_capture_to_tag_ids;
+ }
src/log_surgeon/Lexer.tpp (2)

86-86: Transition logic refactor
Replacing direct transitions with get_dest_state(next_char) makes the flow more explicit. Consider logging or handling a scenario where no valid destination state is found, ensuring the user is informed if the character is unexpected.


380-416: Capture uniqueness check
Throwing an exception on duplicate capture names is a solid safeguard. Be sure all user inputs are validated, and consider whether a more user-friendly error message (or logging) might help users understand the conflict quickly.

src/log_surgeon/finite_automata/Dfa.hpp (2)

64-65: Typo in doxygen
In the comment block for generate, the line uses @oaram nfa instead of @param nfa. Correct this to maintain consistency in documentation tags.

-  * @oaram nfa The NFA used to generate the DFA.
+  * @param nfa The NFA used to generate the DFA.

386-406: Clarify reassign_transition_reg_ops usage
Adding new register operations when no direct match exists can be subtle. Ensure the logic that inserts {new_reg_id, old_reg_id} captures the correct source and destination semantics. A short inline comment might prevent confusion for future maintainers.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

110-122: Minor spelling nitpick in the comment.
The code logic is correct, but “spontaenous” should be spelled “spontaneous”.

Proposed fix for line 113 comment:

-// root --(regex)--> state_with_spontaenous_transition --(negate capture tags)--> end_state
+// root --(regex)--> state_with_spontaneous_transition --(negate capture tags)--> end_state
src/log_surgeon/UniqueIdGenerator.hpp (1)

1-17: Simple unique ID generator is functional but not concurrency-safe.
If you plan to use this in multi-threaded code, consider adding a mutex or atomic operations. Otherwise, this implementation is acceptable for single-threaded environments.

src/log_surgeon/finite_automata/TagOperation.hpp (1)

39-48: Consider adding compile-time enum exhaustiveness checking.

The default case might indicate incomplete enum handling. Consider using a switch statement without a default case to get compiler warnings about unhandled enum values.

-        switch (m_type) {
+        switch (m_type) {
             case TagOperationType::Set:
                 return fmt::format("{}{}", m_tag_id, "p");
             case TagOperationType::Negate:
                 return fmt::format("{}{}", m_tag_id, "n");
-            default:
-                return fmt::format("{}{}", m_tag_id, "?");
         }
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (2)

56-62: Consider handling empty tag operations explicitly.

The current implementation might not handle empty tag operations gracefully. Consider adding a special case for empty operations.

+    if (m_tag_ops.empty()) {
+        return fmt::format("{}", state_id_it->second);
+    }
     auto transformed_operations
             = m_tag_ops | std::ranges::views::transform([](TagOperation const& tag_op) {
                   return tag_op.serialize();
               });

25-29: Consider adding explicit keyword to single-parameter constructor.

To prevent implicit conversions, consider marking the single-parameter constructor as explicit.

-    explicit SpontaneousTransition(TypedNfaState const* dest_state) : m_dest_state{dest_state} {}
+    explicit SpontaneousTransition(TypedNfaState const* dest_state) : m_dest_state{dest_state} {}

-    SpontaneousTransition(std::vector<TagOperation> tag_ops, TypedNfaState const* dest_state)
+    explicit SpontaneousTransition(std::vector<TagOperation> tag_ops, TypedNfaState const* dest_state)
             : m_tag_ops{std::move(tag_ops)},
               m_dest_state{dest_state} {}
src/log_surgeon/finite_automata/DfaTransition.hpp (3)

26-28: Consider initializing m_dest_state in the member initializer list.

The m_dest_state member is already initialized to nullptr in the class definition, but for consistency and to follow best practices, consider initializing it in the member initializer list.

 DfaTransition(std::vector<RegisterOperation> reg_ops, DfaState<state_type>* dest_state)
         : m_reg_ops{std::move(reg_ops)},
-          m_dest_state{dest_state} {}
+          m_dest_state{dest_state} {
+ }

53-53: Consider using if (state_ids.contains(m_dest_state) == false) for consistency.

The coding guidelines specify preferring false == <expression> rather than !<expression>. For consistency, consider applying this to the contains checks.

-    if (false == state_ids.contains(m_dest_state)) {
+    if (state_ids.contains(m_dest_state) == false) {
         return std::nullopt;
     }

     std::vector<std::string> transformed_ops;
     for (auto const& reg_op : m_reg_ops) {
         auto const optional_serialized_op{reg_op.serialize()};
-        if (false == optional_serialized_op.has_value()) {
+        if (optional_serialized_op.has_value() == false) {

Also applies to: 60-60


57-64: Consider reserving vector capacity for performance.

Since we know the size of m_reg_ops, we can reserve space in transformed_ops to avoid reallocations.

     std::vector<std::string> transformed_ops;
+    transformed_ops.reserve(m_reg_ops.size());
     for (auto const& reg_op : m_reg_ops) {
src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

74-74: Consider consistent ordering in pointer and boolean comparisons.

While nullptr != dest_state1 follows the coding guidelines for boolean expressions, consider applying the same pattern to the contains check.

     if (nullptr != dest_state1 && nullptr != dest_state2) {
         DfaStatePair const reachable_pair{dest_state1, dest_state2};
-        if (false == visited_pairs.contains(reachable_pair)) {
+        if (visited_pairs.contains(reachable_pair) == false) {

Also applies to: 76-76

src/log_surgeon/finite_automata/PrefixTree.hpp (1)

28-28: Consider documenting the purpose of cDefaultPos.

The new constant cDefaultPos lacks documentation explaining its purpose and usage.

     static constexpr id_t cRootId{0};
-    static constexpr position_t cDefaultPos{0};
+    /// Default position value used when initializing new positions
+    static constexpr position_t cDefaultPos{0};
tests/test-register-handler.cpp (1)

23-27: Consider extracting magic number to a named constant.

The condition 0 == i uses a magic number. Consider extracting it to a named constant for better readability and maintainability.

+    static constexpr size_t cFirstRegisterIndex{0};
     for (size_t i{0}; i < num_registers; ++i) {
-        if (0 == i) {
+        if (cFirstRegisterIndex == i) {
             handler.add_register();
         } else {
             handler.add_register(i);
tests/test-dfa.cpp (2)

28-69: Test case provides good coverage for untagged DFA.

The test effectively validates the DFA's serialization format and state transitions.

Consider adding descriptive error messages to REQUIRE statements to help debug test failures:

-        REQUIRE(actual_line == expected_line);
+        REQUIRE(actual_line == expected_line, 
+                fmt::format("Line mismatch.\nExpected: {}\nActual: {}", expected_line, actual_line));

71-119: Test case thoroughly validates tagged DFA functionality.

The test effectively covers complex regex patterns with named capturing groups.

Consider extracting the expected serialized DFA string to a separate constant or file to improve readability and maintainability.

tests/test-lexer.cpp (1)

102-135: Consider refactoring the initialization logic.

The initialization function contains multiple responsibilities and could benefit from being split into smaller, focused functions.

Consider extracting the delimiter handling and symbol mapping into separate functions:

+auto initialize_delimiters(ByteLexer& lexer) -> vector<uint32_t> {
+    vector<uint32_t> const cDelimiters{' ', '\n'};
+    lexer.add_delimiters(cDelimiters);
+    
+    vector<uint32_t> delimiters;
+    for (uint32_t i{0}; i < log_surgeon::cSizeOfByte; i++) {
+        if (lexer.is_delimiter(i)) {
+            delimiters.push_back(i);
+        }
+    }
+    return delimiters;
+}
+
+auto initialize_symbol_mapping(ByteLexer& lexer) -> void {
+    lexer.m_symbol_id[log_surgeon::cTokenEnd] = static_cast<uint32_t>(SymbolId::TokenEnd);
+    lexer.m_symbol_id[log_surgeon::cTokenUncaughtString] = static_cast<uint32_t>(SymbolId::TokenUncaughtString);
+    lexer.m_id_symbol[static_cast<uint32_t>(SymbolId::TokenEnd)] = log_surgeon::cTokenEnd;
+    lexer.m_id_symbol[static_cast<uint32_t>(SymbolId::TokenUncaughtString)] = log_surgeon::cTokenUncaughtString;
+}
.github/workflows/build.yaml (1)

55-59: Addition of Test Log Output on Failure
Adding a step to print the last test log on failure enhances the debugging capability when tests do not pass. For robustness, consider checking whether the log file exists before attempting to print it, to avoid potential errors if the file is missing.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a513c5f and 01a26e5.

📒 Files selected for processing (32)
  • .github/workflows/build.yaml (3 hunks)
  • CMakeLists.txt (2 hunks)
  • examples/intersect-test.cpp (3 hunks)
  • lint-tasks.yml (2 hunks)
  • src/log_surgeon/Lexer.hpp (4 hunks)
  • src/log_surgeon/Lexer.tpp (10 hunks)
  • src/log_surgeon/LexicalRule.hpp (3 hunks)
  • src/log_surgeon/SchemaParser.cpp (4 hunks)
  • src/log_surgeon/UniqueIdGenerator.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Capture.hpp (2 hunks)
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (3 hunks)
  • src/log_surgeon/finite_automata/DfaState.hpp (3 hunks)
  • src/log_surgeon/finite_automata/DfaStatePair.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DfaTransition.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (5 hunks)
  • src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (19 hunks)
  • src/log_surgeon/finite_automata/Register.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegisterHandler.hpp (2 hunks)
  • src/log_surgeon/finite_automata/RegisterOperation.hpp (1 hunks)
  • src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1 hunks)
  • src/log_surgeon/finite_automata/TagOperation.hpp (1 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (0 hunks)
  • tests/CMakeLists.txt (2 hunks)
  • tests/test-capture.cpp (1 hunks)
  • tests/test-dfa.cpp (1 hunks)
  • tests/test-lexer.cpp (6 hunks)
  • tests/test-nfa.cpp (2 hunks)
  • tests/test-register-handler.cpp (2 hunks)
  • tests/test-tag.cpp (0 hunks)
💤 Files with no reviewable changes (2)
  • tests/test-tag.cpp
  • src/log_surgeon/finite_automata/TaggedTransition.hpp
✅ Files skipped from review due to trivial changes (2)
  • src/log_surgeon/finite_automata/Register.hpp
  • src/log_surgeon/finite_automata/Capture.hpp
🧰 Additional context used
📓 Path-based instructions (23)
src/log_surgeon/UniqueIdGenerator.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/SchemaParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/PrefixTree.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-capture.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LexicalRule.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-lexer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-register-handler.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-dfa.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegisterOperation.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-nfa.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/TagOperation.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lexer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegisterHandler.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (4)
src/log_surgeon/finite_automata/RegisterHandler.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T21:56:13.425Z
Learning: In the `log_surgeon` project, header guards in C++ header files should include `_HPP` at the end to match the filename. For example, in `RegisterHandler.hpp`, the header guard should be `LOG_SURGEON_FINITE_AUTOMATA_REGISTER_HANDLER_HPP`.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (macos-latest, Release)
🔇 Additional comments (79)
src/log_surgeon/finite_automata/RegisterHandler.hpp (8)

4-4: Include usage is consistent with 32-bit integer usage.
No concerns here, looks good.


8-8: Good choice referencing register_id_t.
Ensures clarity about the register type usage.


29-35: The primary constructor for a single register is straightforward.
Implementation is correct, with no major issues found.


37-43: Parameter-specific constructor also follows the same logic.
No immediate concerns, and the usage of cDefaultPos is appropriate.


45-45: set_register method is concise.
The transition from size_t to register_id_t is consistent with the updated type usage.


49-49: copy_register method is correct.
The implementation is minimalistic, with no concurrency or logic issues.


53-57: append_position method logic is clear.
No edge cases or concurrency issues appear here.


59-63: Returning reversed positions maintains clarity.
Using m_prefix_tree is consistent with the code’s established patterns.

tests/test-nfa.cpp (4)

9-10: Additional includes are consistent with the new NfaState and LexicalRule approach.
No issues; these new includes are appropriate for constructing ByteNfa.


38-38: NFA construction aligned with the new usage.
Providing rules directly to the constructor is neat.


41-77: Updated transitions from “epsilon_transition” to “spontaneous_transition”.
The new strings in the expected serialization correctly reflect the updated NFA representation. The tags and positions appear consistent with the changes.


80-82: Conditional usage of optional and capturing for debugging.
It is good to see the test verifying that a serializable representation is produced. The CAPTURE usage is helpful for diagnosing test failures.

Also applies to: 87-87

src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (6)

1-2: Header guard style is correct.
It follows the recommended LOG_SURGEON_FINITE_AUTOMATA_DETERMINIZATION_CONFIGURATION_HPP naming convention.


4-9: Essential includes ensure minimal dependencies.
No concerns with these standard library inclusions.


11-12: Register.hpp and TagOperation.hpp includes
This confirms the usage of register_id_t and TagOperation.


14-100: The DetermizationConfiguration class design seems coherent.

  • Constructor thoroughly moves the container arguments to reduce copies.
  • The operator< properly compares in a logical order, ensuring stable sorting.
  • The child_configuration_with_new_state_and_tag method is a neat approach to build updated configurations.
    One minor note: If the history also changes often, consider a symmetrical approach for that as well. Otherwise, this looks consistent.

102-117: update_reachable_configs method
The logic pre-emptively enlists each tag operation in a child configuration. It is a solid pattern for building subsequent states.


119-134: spontaneous_closure method
Neat usage of a stack for closure. Using false == <expression> also aligns with coding guidelines.

src/log_surgeon/finite_automata/DfaState.hpp (5)

7-19: No issues with new includes.
Everything appears to be in order with the added headers and type includes.


34-38: Good constructor initialization.
Properly initializing m_bytes_transition with default objects helps avoid undefined behaviour. The approach is clear and easy to maintain.


52-55: Byte transition assignment is clear.
Storing the entire DfaTransition<state_type> object instead of raw pointers removes potential pointer-management complexities.


57-58: Ensure test coverage for accepting operations.
Consider adding unit tests or integration tests to confirm that m_accepting_ops is handled correctly at runtime.


101-141: Serialization approach is well-structured.
Leveraging fmt::join and collecting transition strings improves readability and maintainability.

src/log_surgeon/Lexer.hpp (2)

7-19: No issues with added headers and usings.
These additions properly support the new finite automata components without any apparent conflicts.


133-157: General data-access methods look good.
The optional-based lookups and returns are easy to follow and reduce the risk of invalid lookups.

src/log_surgeon/finite_automata/NfaState.hpp (3)

39-54: Constructors appear well-defined.
They simplify the creation of NFA states and make the initialization of spontaneous transitions clear.


56-70: Spontaneous transition management is straightforward.
Emplacing transitions with optional tag operations is a flexible approach.


190-190: Consistent use of “false == expression”.
Usage of if (false == optional_serialized_transition.has_value()) follows the coding style and avoids confusion with negation.

src/log_surgeon/finite_automata/Nfa.hpp (3)

4-23: Includes added for new dependencies
You have added the necessary headers for functionality like std::optional, fmt::format, and the new capture-related classes. This is coherent and neatly addresses the code’s requirements.


39-39: Good move to accept rules by const reference
Using std::vector<LexicalRule<TypedNfaState>> const& in the constructor avoids unnecessary copying while preserving immutability, which can be more efficient.


234-251: Effective use of std::optional in serialize()
Handling serialization failures by returning std::nullopt is a clean pattern, ensuring that callers are alerted to potential errors. Ensure that the caller consistently checks for this return value.

src/log_surgeon/Lexer.tpp (2)

89-96: Re-initializing state on newline
The logic sets dest_state = m_dfa->get_root()->get_dest_state(next_char) when encountering '\n'. This ensures a fresh match and captures a new token start. Confirm that overlapping tokens before this line break are handled correctly elsewhere.


340-345: Appending start-of-file logic
Prepending utf8::cCharStartOfFile for the lexer helps detect matches from the beginning of logs. Ensure any dependent code in the pipeline also recognizes this special character and does not treat it as part of real data.

src/log_surgeon/finite_automata/Dfa.hpp (2)

246-288: Careful validation of register mappings
try_get_mapping method is comprehensive in ensuring a valid one-to-one mapping between registers. This is crucial to avoid collisions. Ensure thorough testing covers edge cases where partial overlaps might attempt to map incorrectly.


403-405: Possible confusion with register operation constructor
The expression reg_ops.emplace_back(new_reg_id, old_reg_id); suggests a “copy” or alternative operation from old_reg_id to new_reg_id. Confirm the constructor’s parameters align with that usage and do not inadvertently set a register to an unintended operation type.

src/log_surgeon/finite_automata/RegexAST.hpp (21)

23-23: No issues with updated include statement.
The change from Tag.hpp to Capture.hpp aligns with the rename effort.


33-37: Documentation updates are accurate.
The references to captures instead of tags clearly convey the new concept.


85-87: Returning a const reference is appropriate.
This helps avoid unnecessary copies and is consistent with the rest of the codebase.


89-91: Setter function is straightforward.
Passing by value then moving is a valid pattern; no concerns here.


93-99: Appending subtree captures is clearly implemented.
This insertion pattern appears correct for accumulating captures.


101-103: Setter for negative captures is consistent.
The logic mirrors the subtree captures setter. No issues found.


130-153: Serialization logic looks good.
This effectively transforms and joins the negative captures for inclusion in the serialized output.


627-630: Documentation enforces non-null capture invariant.
This note is helpful for clarifying valid usage.


639-639: Detailed doc comment is appreciated.
Clearly describes when and why an exception is thrown.

Also applies to: 641-642


654-659: Null checks on capture ensure robust construction.
Throwing on nullptr is a safe way to enforce valid input.


666-668: Deep copy approach in copy constructor is correct.
Makes a new Capture instance, ensuring proper duplication.


707-707: New accessor method is consistent.
Returning std::string_view is an efficient choice.


774-774: Combining negative captures in OR constructor is logical.
Passing each branch's subtree captures to the other ensures correct negative semantics.

Also applies to: 780-784


789-790: NFA construction using add_to_nfa_with_negative_captures is correct.
Both branches are incorporated properly.


794-794: Serialization with fallback is appropriate.
Using “null” when pointers are absent clarifies the state.

Also applies to: 799-800


804-804: Concatenation subtree captures are correctly combined.
This approach follows the same logic as in OR.

Also applies to: 810-812


819-820: Intermediate state usage for cat operation is implemented correctly.
This ensures each sub-expression transitions properly.

Also applies to: 821-822


826-826: Cat node serialization mirrors the established pattern.
No issues found—consistency is maintained.

Also applies to: 831-832


836-836: Subtree captures in multiplication constructor are properly inherited.
Logical approach for repeated structures.

Also applies to: 844-844


854-855: NFA construction for repeated patterns appears correct.
Handles zero, finite, and infinite cases, preserving negative captures in each iteration.

Also applies to: 858-859, 861-862, 865-866, 869-870, 873-874, 875-876, 878-879


893-894: Final serialization step for negative captures is properly appended.
Implementation remains consistent with other variants.

tests/test-capture.cpp (1)

1-37: Test coverage for capture names is solid.
All fundamental scenarios are validated, including empty and special character names. If required, consider adding tests for extra-long or non-ASCII names in future expansions.

src/log_surgeon/LexicalRule.hpp (3)

5-7: LGTM!

The includes are well-organized and necessary for the new functionality.


28-30: LGTM!

The method is well-designed with appropriate const-correctness and return type.


46-47: LGTM!

The changes align well with the PR's objective of updating tag handling.

src/log_surgeon/finite_automata/TagOperation.hpp (2)

13-16: LGTM!

The enum class is well-designed with appropriate type safety and memory efficiency.


24-30: LGTM!

The comparison operators are implemented efficiently using std::tie with proper const-correctness.

src/log_surgeon/finite_automata/RegisterOperation.hpp (2)

42-42: LGTM!

The condition check follows the coding guideline of preferring false == <expression>.


39-53: Same enum exhaustiveness issue as in TagOperation.

The default case might indicate incomplete enum handling.

src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

72-74: LGTM! Variable naming improvement.

The change from next_state to dest_state improves code clarity by better describing the variable's purpose.

src/log_surgeon/finite_automata/PrefixTree.hpp (1)

15-20: LGTM! Documentation clarity improvement.

The documentation changes from "tag" to "capture" improve clarity and maintain consistency with the broader codebase changes.

tests/test-register-handler.cpp (1)

40-40: LGTM! Improved const correctness.

The addition of const to empty_handler improves const correctness and prevents accidental modifications.

examples/intersect-test.cpp (1)

24-24: LGTM! Template parameter updates align with TDFA implementation.

The addition of ByteNfaState as a template parameter to Dfa is consistent with the PR's objective of implementing TDFA determinization.

Also applies to: 44-44, 82-82

tests/test-lexer.cpp (2)

169-169: Address TODO comment about register value checks.

The comment indicates missing test coverage for register values simulation.

Would you like me to help implement the register value verification tests or create an issue to track this task?


321-336: Well-structured test cases with good error handling.

The test cases effectively validate both basic lexer functionality and capture group handling. The use of CAPTURE macro helps with debugging test failures.

Also applies to: 338-379

src/log_surgeon/SchemaParser.cpp (2)

177-177: LGTM! Updated to use Capture instead of Tag.

The change aligns with the PR's objective of transitioning from tags to captures.


212-212: Updated comments to reflect new terminology.

Comments have been consistently updated to use "captures" instead of "tags".

Also applies to: 258-258

.github/workflows/build.yaml (2)

31-32: OS Matrix Update: Specifying a Concrete Ubuntu Version
Updating the OS matrix to use "ubuntu-22.04" instead of an ambiguous latest version improves reproducibility and stability of the build environment.


42-44: Revised Catch2 Installation Condition for Ubuntu
The installation step for Catch2 now correctly targets "ubuntu-22.04", which aligns with the updated OS matrix. Please verify that version "3.6.0" of Catch2 meets the project’s requirements on this platform.

tests/CMakeLists.txt (2)

5-11: Inclusion of New Finite Automata Headers
The update extends the SOURCES_LOG_SURGEON list with multiple TDFA-related header files (e.g., Capture.hpp, DeterminizationConfiguration.hpp, Dfa.hpp, etc.). This supports the recent enhancements in the determinization process. Ensure that the file paths and dependencies remain consistent with the updated project structure.


35-43: Addition of New Test Files
New test files, including test-capture.cpp and test-dfa.cpp, have been added to SOURCES_TESTS to boost coverage of the new features. This is a positive step; consider verifying that these tests cover both standard and edge case scenarios for TDFA functionality.

lint-tasks.yml (2)

29-32: Expanded Linting Coverage to Include Test Sources
The inclusion of additional test file patterns (for .cpp, .h, .hpp, .tpp files under the tests directory) ensures that linting is applied uniformly across all source files. This comprehensive approach helps in maintaining code quality throughout the project.


69-73: Exclusion of Build Directories from Formatting Tasks
The added exclusion pattern (! -path "*/*build*/*") in the cpp task prevents the formatting of auto-generated or intermediary build files. This focused approach on source files is beneficial and reduces unnecessary processing.

CMakeLists.txt (1)

64-116: Updated Finite Automata Source Files Inclusion
The SOURCE_FILES list now incorporates several new TDFA-related files such as Capture.hpp, DeterminizationConfiguration.hpp, and UniqueIdGenerator.hpp. However, there is a potential inconsistency in that files like PrefixTree.cpp, RegexAST.hpp, and various DFA/NFA headers are being included while previous discussions indicated their removal. Please verify that the set of files accurately reflects the intended new architecture for the finite automata module.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (10)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (3)

69-69: Add return type to documentation.

The documentation for get_tag_history and get_tag_lookahead methods is missing the return type.

Add @return documentation to clarify the return type and its meaning.

Also applies to: 79-79


130-130: Consider using a more efficient loop condition.

The loop condition false == unexplored_stack.empty() is less readable than !unexplored_stack.empty() and doesn't provide any performance benefits.

-    while (false == unexplored_stack.empty()) {
+    while (!unexplored_stack.empty()) {

133-133: Consider using insert return value.

The contains check followed by insert can be replaced with a single insert operation, as insert returns a pair indicating whether the insertion took place.

-        if (false == reachable_set.contains(current_configuration)) {
-            current_configuration.update_reachable_configs(unexplored_stack);
-            reachable_set.insert(current_configuration);
-        }
+        if (auto [it, inserted] = reachable_set.insert(current_configuration); inserted) {
+            current_configuration.update_reachable_configs(unexplored_stack);
+        }
tests/test-nfa.cpp (1)

37-38: Remove or update outdated comment.

The comment about capture order and tags in brackets appears to be unrelated to the test case, as the test doesn't involve any capture groups.

-    // capture order(tags in brackets): letter1(0,1), letter2(2,3), letter(4,5), containerID(6,7)
tests/test-lexer.cpp (2)

171-171: Add test case for register values.

The TODO comment indicates missing test coverage for register values.

Would you like me to help implement the test case for register values? This would improve test coverage and help validate the simulation functionality.


174-180: Consider using string_view for map values.

The serialize_id_symbol_map function could be more efficient by using string_view for map values since we're not modifying the strings.

-auto serialize_id_symbol_map(unordered_map<symbol_id_t, string> const& map) -> string {
+auto serialize_id_symbol_map(unordered_map<symbol_id_t, string_view> const& map) -> string {
src/log_surgeon/finite_automata/Dfa.hpp (4)

264-264: Fix typo in comment.

"irrelevent" should be "irrelevant".

-                // If the NFA state sets the tag then the current register is irrelevent
+                // If the NFA state sets the tag then the current register is irrelevant

211-211: Use preferred comparison style.

According to the coding guidelines, prefer !<expression> over false == <expression>.

-    while (false == unexplored_sets.empty()) {
+    while (!unexplored_sets.empty()) {

482-482: Track UTF-8 support requirement.

The TODO comment indicates missing UTF-8 support. This should be tracked to ensure proper handling of multi-byte characters.

Would you like me to create an issue to track the implementation of UTF-8 support?


464-466: Reserve space for containers to improve performance.

Pre-reserve space for visited_states and visited_order to avoid reallocations during traversal.

The code already reserves space for these containers on lines 467-468, which is good practice!

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01a26e5 and e5df444.

📒 Files selected for processing (7)
  • CMakeLists.txt (3 hunks)
  • src/log_surgeon/Lexer.hpp (4 hunks)
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (3 hunks)
  • tests/test-dfa.cpp (1 hunks)
  • tests/test-lexer.cpp (6 hunks)
  • tests/test-nfa.cpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/log_surgeon/Lexer.hpp
  • CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • tests/test-nfa.cpp
  • tests/test-dfa.cpp
  • tests/test-lexer.cpp
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
  • src/log_surgeon/finite_automata/Dfa.hpp
🪛 GitHub Actions: lint
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp

[error] 29-29: code should be clang-formatted [-Wclang-format-violations]

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (macos-latest, Release)
🔇 Additional comments (3)
tests/test-nfa.cpp (1)

8-11: LGTM! Header organization is clean and follows best practices.

The headers are well-organized, with system headers first, followed by third-party headers, and then project headers.

src/log_surgeon/finite_automata/Dfa.hpp (2)

1-35: LGTM! Well-organized includes and class declaration.

The includes are logically organized, and the class declaration follows good C++ practices with proper include guards and template parameters.


36-59: LGTM! Well-documented public interface.

The constructor and public methods are properly marked with appropriate specifiers and have clear documentation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1)

34-45: Consider simplifying the comparison operator by leveraging a tuple.

You could use tuple-based comparison to eliminate multiple conditional checks, improving maintainability and clarity. For example:

 auto operator<(DetermizationConfiguration const& rhs) const -> bool {
-    if (m_nfa_state->get_id() != rhs.m_nfa_state->get_id()) {
-        return m_nfa_state->get_id() < rhs.m_nfa_state->get_id();
-    }
-    if (m_tag_id_to_reg_ids != rhs.m_tag_id_to_reg_ids) {
-        return m_tag_id_to_reg_ids < rhs.m_tag_id_to_reg_ids;
-    }
-    if (m_history != rhs.m_history) {
-        return m_history < rhs.m_history;
-    }
-    return m_lookahead < rhs.m_lookahead;
+    return std::tie(m_nfa_state->get_id(), m_tag_id_to_reg_ids, m_history, m_lookahead)
+         < std::tie(rhs.m_nfa_state->get_id(), rhs.m_tag_id_to_reg_ids, rhs.m_history, rhs.m_lookahead);
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5df444 and 014304e.

📒 Files selected for processing (2)
  • CMakeLists.txt (2 hunks)
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (macos-latest, Release)
🔇 Additional comments (4)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (3)

69-71: Add an explicit return type to align with style consistency.

A past review flagged that an explicit return type is missing here. Changing it to either void or auto -> void ensures consistency with the rest of your code base and clarifies the method’s intent.

-auto set_reg_id(tag_id_t const tag_id, register_id_t const reg_id) {
+auto set_reg_id(tag_id_t const tag_id, register_id_t const reg_id) -> void {
     m_tag_id_to_reg_ids[tag_id] = reg_id;
 }

107-122: Verification of spontaneous transitions logic.

Your approach of pushing the parent configuration onto the stack after applying each sequence of tag operations seems logically correct. Verify that all spontaneously reachable states (with the associated tag operations) are accounted for, especially if the NFA contains multiple concurrent spontaneous transitions.


130-133: Adherence to the coding guideline on Boolean expressions.

Using “false == ” aligns with the project guideline. This usage improves clarity over the “!” operator. Nicely done.

CMakeLists.txt (1)

71-87: Updated finite automata files successfully integrated.

The newly added lines properly reference the relevant finite automata headers (e.g., DeterminizationConfiguration.hpp) and the UniqueIdGenerator.hpp. This inclusion ensures the CMake build process incorporates all necessary sources.

Also applies to: 115-115

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
src/log_surgeon/Lexer.tpp (1)

172-172: 🛠️ Refactor suggestion

Address code duplication in scan_with_wildcard.

The TODO comment acknowledges code duplication between scan() and scan_with_wildcard(). Consider extracting common logic into a shared method.

🧹 Nitpick comments (8)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1)

16-17: Consider standardizing the class name with the file name.
The file is named "DeterminizationConfiguration.hpp," but the class is named "DetermizationConfiguration." This discrepancy may cause confusion.

src/log_surgeon/LexicalRule.hpp (1)

28-30: Verify m_regex usage for null safety.
Returning a reference to the captures vector is acceptable if m_regex is guaranteed to be non-null. Otherwise, consider an assertion or check to avoid dereferencing a null pointer.

tests/test-lexer.cpp (1)

338-398: Consider adding edge case tests.

The test cases cover basic functionality but could benefit from additional edge cases such as:

  • Empty input
  • Maximum length input
  • Invalid capture group names
  • Nested capture groups
src/log_surgeon/Lexer.tpp (1)

386-388: Enhance error message for duplicate capture names.

The error message could be more descriptive by including the conflicting capture names.

-                throw std::invalid_argument("`m_rules` contains capture names that are not unique."
-                );
+                throw std::invalid_argument(fmt::format(
+                    "`m_rules` contains duplicate capture name: '{}'", capture_name));
src/log_surgeon/finite_automata/Dfa.hpp (1)

31-32: Enhance class template documentation.

The class template documentation should explain the purpose of both template parameters and their relationship.

src/log_surgeon/finite_automata/RegexAST.hpp (2)

652-657: Enhance error messages for null pointer validation.

The error messages for null pointer validation could be more descriptive.

-            throw std::invalid_argument("Group regex AST cannot be null");
+            throw std::invalid_argument("RegexASTCapture: capture_regex_ast cannot be null");
-            throw std::invalid_argument("Capture cannot be null");
+            throw std::invalid_argument("RegexASTCapture: capture pointer cannot be null");

909-940: Move documentation to a separate file.

As suggested by the TODO comment, consider moving the detailed documentation to a separate file for better maintainability.

CMakeLists.txt (1)

64-117: Source files restructuring aligns well with TDFA implementation.

The changes to SOURCE_FILES reflect a comprehensive restructuring to support TDFA determinization:

  • Added new files for TDFA-specific functionality (Capture.hpp, DeterminizationConfiguration.hpp)
  • Added support files for register operations and transitions
  • Removed older tag-based implementation files

However, consider the following suggestions:

  1. Group related files together in the SOURCE_FILES list for better maintainability
  2. Consider adding comments to separate different components (e.g., core files, finite automata files, etc.)

Here's a suggested organization:

 set(SOURCE_FILES
     # Core files
     src/log_surgeon/Aliases.hpp
     src/log_surgeon/Buffer.hpp
     src/log_surgeon/BufferParser.cpp
     src/log_surgeon/BufferParser.hpp
     src/log_surgeon/Constants.hpp
     src/log_surgeon/FileReader.cpp
     src/log_surgeon/FileReader.hpp
     
     # Finite Automata
     src/log_surgeon/finite_automata/Capture.hpp
     src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
     src/log_surgeon/finite_automata/Dfa.hpp
     # ... rest of finite automata files ...
     
     # Parsing and Lexing
     src/log_surgeon/Lalr1Parser.cpp
     src/log_surgeon/Lalr1Parser.hpp
     # ... rest of parsing files ...
     )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 014304e and 95d29f6.

📒 Files selected for processing (15)
  • CMakeLists.txt (2 hunks)
  • src/log_surgeon/Aliases.hpp (1 hunks)
  • src/log_surgeon/Lexer.hpp (5 hunks)
  • src/log_surgeon/Lexer.tpp (10 hunks)
  • src/log_surgeon/LexicalRule.hpp (3 hunks)
  • src/log_surgeon/UniqueIdGenerator.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (3 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (5 hunks)
  • src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (20 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
  • tests/CMakeLists.txt (2 hunks)
  • tests/test-lexer.cpp (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/log_surgeon/finite_automata/PrefixTree.hpp
  • tests/CMakeLists.txt
  • src/log_surgeon/UniqueIdGenerator.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/Aliases.hpp
  • src/log_surgeon/LexicalRule.hpp
  • tests/test-lexer.cpp
  • src/log_surgeon/finite_automata/TaggedTransition.hpp
  • src/log_surgeon/finite_automata/NfaState.hpp
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
  • src/log_surgeon/finite_automata/Nfa.hpp
  • src/log_surgeon/Lexer.hpp
  • src/log_surgeon/finite_automata/Dfa.hpp
  • src/log_surgeon/finite_automata/RegexAST.hpp
🧠 Learnings (4)
src/log_surgeon/finite_automata/TaggedTransition.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (24)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (3)

69-71: Add the missing return type for set_reg_id.
This was previously noted in a prior review. Please ensure explicit return types (e.g., -> void) for better clarity and consistency.


107-122: Implementation for update_reachable_configs looks well-structured.
This logic clearly handles spontaneous transitions to produce new configurations. No issues spotted.


130-136: Commendation on adhering to the style guideline.
Using while (false == unexplored_stack.empty()) and if (false == reachable_set.contains(...)) aligns with the project’s preference of false == <expression> over !<expression>. Good work.

src/log_surgeon/Aliases.hpp (1)

1-14: Clear and concise type alias definitions.
Defining descriptive aliases (capture_id_t, reg_id_t, rule_id_t, tag_id_t) improves readability and type safety across the codebase.

src/log_surgeon/LexicalRule.hpp (1)

46-46: Acceptance state creation looks correct.
Using nfa->new_accepting_state(m_variable_id) simplifies creation of the end state and ensures predictable handling of acceptance logic.

src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

21-45: LGTM! The transition to tag IDs improves type safety.

The refactoring from Tag pointers to tag IDs eliminates potential null pointer issues and improves type safety.


52-76: LGTM! The transition to tag IDs improves type safety.

The refactoring from Tag pointers to tag IDs eliminates potential null pointer issues and improves type safety. The use of fmt::join for formatting multiple tag IDs is a clean approach.

src/log_surgeon/finite_automata/NfaState.hpp (4)

34-40: LGTM! The use of contains and false == <expression> improves readability.

The changes align with coding guidelines and modern C++ practices.


102-105: LGTM! The getter correctly returns a const reference.

The method follows best practices by returning a const reference to avoid unnecessary copies of the vector.


136-144: LGTM! The helper method efficiently manages capture tag pairs.

The method follows best practices by using contains, emplace, and aligning with coding guidelines.


163-181: LGTM! The method efficiently handles negative captures.

The method follows best practices by using move semantics and correctly managing tag pairs.

src/log_surgeon/Lexer.hpp (2)

139-145: LGTM! The getter correctly handles missing rules.

The method follows best practices by using std::optional for error handling and contains for map lookup.


179-198: LGTM! The getter correctly handles all error cases.

The method follows best practices by:

  • Using std::optional for error handling
  • Following coding guidelines with false == <expression>
  • Properly validating both start and end register IDs
src/log_surgeon/finite_automata/Nfa.hpp (4)

136-144: LGTM! The helper method efficiently manages capture tag pairs.

The method follows best practices by using contains, emplace, and aligning with coding guidelines.


163-181: LGTM! The method efficiently handles negative captures.

The method follows best practices by using move semantics and correctly managing tag pairs.


184-199: LGTM! The method correctly manages capture states.

The method follows best practices by:

  • Efficiently extracting tag pairs
  • Creating appropriate start and end states
  • Using spontaneous transitions for tag operations

235-252: LGTM! The method correctly handles serialization failures.

The method follows best practices by:

  • Using std::optional for error handling
  • Following coding guidelines with false == <expression>
  • Properly validating serialized states
tests/test-lexer.cpp (2)

67-91: Consider extracting common test setup logic.

A previous review comment suggested extracting common setup logic between test cases. This suggestion is still valid and would help reduce code duplication.


185-185: Address the TODO comment about register value verification.

The TODO comment indicates that register value verification is pending implementation of DFA simulation.

Would you like me to create an issue to track this task?

src/log_surgeon/Lexer.tpp (2)

30-36: Address the TODO comment about m_start_pos handling.

The TODO comment indicates potential issues with m_start_pos handling when it equals old_storage_size / 2.

Would you like me to create an issue to track this task?


406-406: Address the TODO comment about DFA ignoring captures.

The TODO comment indicates that DFA ignores captures in certain scenarios. This needs to be addressed to ensure proper functionality.

Would you like me to create an issue to track this task?

src/log_surgeon/finite_automata/Dfa.hpp (1)

482-482: Address the TODO comment about UTF-8 handling.

The TODO comment indicates that UTF-8 case handling is pending implementation.

Would you like me to create an issue to track this task?

src/log_surgeon/finite_automata/RegexAST.hpp (1)

30-30: Address the TODO comment about class renaming.

The TODO comment suggests renaming RegexAST to RegexASTNode for better clarity.

Would you like me to create an issue to track this task?

CMakeLists.txt (1)

72-90: Verify the completeness of TDFA implementation files.

The new files added for TDFA implementation appear comprehensive, covering:

  • Core TDFA functionality (Dfa.hpp, DfaState.hpp, DfaTransition.hpp)
  • Register handling (Register.hpp, RegisterHandler.hpp, RegisterOperation.hpp)
  • Support for transitions (SpontaneousTransition.hpp)

Let's verify that all necessary files are present:

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Check for any potential missing TDFA-related files
# Look for references to TDFA in source files that might indicate missing implementations

# Search for TDFA-related types or functions that might need implementation files
rg -g '*.{cpp,hpp,tpp}' -i 'tdfa|tagged.*dfa' 

# Search for TODO comments related to TDFA
rg -g '*.{cpp,hpp,tpp}' -i 'todo.*tdfa|tdfa.*todo'

Length of output: 514


TDFA File Completeness Verified

The search for TDFA-related identifiers and TODO comments shows that all expected implementation files are present, and the test cases in tests/test-dfa.cpp confirm both tagged and untagged DFA functionality. Based on these findings, the TDFA implementation appears complete.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/log_surgeon/finite_automata/Dfa.hpp (2)

61-67: Typographical note in doc comment.

There is a minor spelling issue in this doc comment: “@oaram” should be “@param”.

- * @oaram nfa The NFA used to generate the DFA.
+ * @param nfa The NFA used to generate the DFA.

195-229: TODO note: UTF-8 is unhandled.

You have a TODO for handling multi-byte transitions. Would you like assistance drafting a plan or partial implementation?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af8bc5e and 14aac7f.

📒 Files selected for processing (3)
  • src/log_surgeon/Lexer.tpp (1 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (2 hunks)
  • tests/test-dfa.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/log_surgeon/Lexer.tpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • tests/test-dfa.cpp
  • src/log_surgeon/finite_automata/Dfa.hpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (15)
tests/test-dfa.cpp (2)

118-164: Looks good, but consider extracting the line-by-line comparison logic.

Your approach for verifying the DFA serialization is correct. However, you've duplicated the line-by-line comparison code from the previous test blocks. It might be beneficial to extract it into a helper method (e.g., validate_serialized_dfa) to simplify maintenance and make your test code more concise.


165-215: Duplicated line-by-line comparison logic.

This repeats the same pattern of line-by-line validation seen in other tests. Consider factoring out the repeated logic into a helper function for improved clarity and maintainability.

src/log_surgeon/finite_automata/Dfa.hpp (13)

4-16: No issues with the new includes.

All added headers appear necessary for the newly introduced functionalities (e.g., queue, unordered_set).


22-27: Imported classes and functions look properly referenced.

No concerns with these newly added imports, as they align with the usage in this file.


33-34: Type alias enhances readability.

Using ConfigurationSet simplifies references to determinization configurations.


56-59: Method seems fine; straightforward accessor.

get_tag_id_to_final_reg_id() aligns well with your new data design, exposing needed mappings.


83-95: Mapping logic is correct but can be slow for large sets.

This function systematically checks each configuration in lhs against rhs. For extremely large sets, you might consider more efficient lookups.


96-119: Comprehensive state creation logic.

create_or_get_dfa_state cleanly handles the detection of an existing config or insertion of a new one. No immediate concerns.


120-130: No concerns with transitions retrieval.

Generating transitions based on each ASCII value is straightforward and consistent with the rest of the design.


136-152: Register operation assignment looks well-organized.

Multiple tag operations (Set, Negate) are handled neatly, ensuring consistent updates to the closure.


153-166: Reassigning transition register ops is correct.

Mapping old registers to new ones with fallback copy_operation is a robust approach for convergence.


167-178: New state creation is coherent.

A proper flow: if the NFA state is accepting, we add matching IDs and final register ops. This matches the design of the tag-based approach.


186-187: New map is appropriate for final register references.

m_tag_id_to_final_reg_id clarifies the relationship between tag and final register, promoting better modularity in the code.


190-193: Constructor consistently seeds generation.

Building the DFA in the constructor ensures any usage of the object has a fully ready state.


68-82:

❓ Verification inconclusive

Potential memory usage concern for large tag counts.

initialize_registers adds registers for each tag without bounds checking. If num_tags can be very large, this may result in excessive memory usage.

You might want to confirm the maximum expected number of tags. If needed, consider adding a guard similar to a “maximum allowed tags” check:


🏁 Script executed:

#!/bin/bash
# Searching for references to call sites of `initialize_registers` 
# to see if there's a maximum or safe usage pattern in place.
rg --context 5 'initialize_registers'

Length of output: 2704


Memory Usage Concern: Verify Maximum Tag Count

The search confirms that all calls to initialize_registers (e.g. via nfa.get_num_tags()) happen without any bounds checking. This means that if the number of tags is unexpectedly large, the unguarded allocation of registers could lead to excessive memory usage. Please verify the maximum expected number of tags in your NFA. If there is any risk of unreasonably high values, consider implementing a guard (a “maximum allowed tags” check) to mitigate potential issues.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
src/log_surgeon/finite_automata/Dfa.hpp (7)

33-33: Consider clarifying ownership for DFA states and tag IDs.

The alias ConfigurationSet and the newly introduced members m_tag_id_to_final_reg_id and m_reg_handler effectively manage tag information and states. However, it might be beneficial to add brief clarifications (e.g., comments) on how ownership is handled (via std::unique_ptr) and how lifetime ties into the map of configuration sets to TypedDfaState*. This will help new contributors quickly understand the relationship between m_states, dfa_states, and tag ownership.

Also applies to: 186-187, 191-193


60-66: Extend UTF-8 handling and test coverage.

The generate method includes a TODO at line 195 referencing future UTF-8 support. Since multi-byte character handling is often a critical requirement, consider creating a dedicated strategy for transitioning from single-byte to multi-byte logic. Adding tests in preparation for UTF-8 transitions will help ensure reliability.

Also applies to: 195-229


83-95: Watch for potential O(n²) overhead in try_get_mapping.

The nested loops over lhs and rhs can lead to quadratic complexity. Though this might be acceptable for smaller NFAs, it could become a performance bottleneck for large ones. If scaling is a concern, consider more efficient set comparisons or hashing of configurations.

Also applies to: 245-288


96-119: Prevent confusion when returning mapped states.

The create_or_get_dfa_state method can return either an existing DFA state, a potentially mapped DFA state, or a newly created one. Adding more explicit in-code comments or separate result types (e.g., an enum variant) can reduce ambiguity for future maintainers.

Also applies to: 290-308


120-135: Consider factoring out logic for byte transitions.

The get_transitions method processes transitions for all 256 byte values. For clarity and maintainability, factor out the repeated logic into smaller helper functions/methods, especially if multi-byte support is planned.

Also applies to: 309-342


153-165: Re-check newly added copy operations.

reassign_transition_reg_ops creates copy operations if an old register does not map existing ops to the new register. Double-check any potential for register duplication or unintentional overwriting of data. Additional debug logs or unit tests might catch hidden bugs.

Also applies to: 384-405


167-178: Simplify new_state logic for readability.

The new_state function handles final register assignment, tag copying, and accepting condition checks in a single block. Breaking it into smaller subroutines (e.g., one for final operations, one for acceptance logic) could improve readability and reduce complexity.

Also applies to: 407-441

src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (2)

17-32: Validate NFA state pointer in constructor.

Throwing an exception if nfa_state is null ensures safety. Documenting in the class-level Doxygen that this constructor expects a non-null pointer could make the contract clearer for future users.


47-59: Maintain consistent naming in child_configuration_with_new_state_and_tag.

The method's name distinctly conveys purpose; however, for brevity, consider naming it child_with_state_and_tag or similar. Otherwise, everything is well-structured and easy to follow.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14aac7f and 3c8c208.

📒 Files selected for processing (2)
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/Dfa.hpp
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
🧠 Learnings (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (macos-latest, Release)
  • GitHub Check: build (macos-latest, Debug)
🔇 Additional comments (7)
src/log_surgeon/finite_automata/Dfa.hpp (1)

136-152: Confirm no overlap in simultaneous tag operations.

When assign_transition_reg_ops merges multiple TagOperations, ensure there's no scenario where conflicting operations on the same register occur within the same transition. Adding unit tests for edge cases involving repeated or rapid tag toggles (set/negate) could improve confidence.

Also applies to: 344-382

src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (6)

1-16: Header guard and includes look correct.

You have properly defined the include guard and minimal includes for containers and standard library features. This setup is concise and follows best practices.


34-45: Well-defined ordering operator.

The operator< properly orders configurations by state ID, register mappings, and then history/lookahead. This consistent ordering is vital for using DeterminizationConfiguration in sorted containers. Great job ensuring a deterministic comparison sequence.


61-66: Verify correctness of spontaneous transitions.

update_reachable_configs and spontaneous_closure work together to capture all configurations reachable via zero or more spontaneous transitions. For large NFAs, you may want to ensure no repeated expansions or infinite loops occur if cycles are present. Additional cycle detection or upper-bound checks might be beneficial.

Also applies to: 68-73, 112-127, 129-144


74-77: Tag manipulation and retrieval appear logical.

The methods set_reg_id, get_state, get_tag_history, and get_lookahead provide straightforward ways to update and query tag operations. Just ensure that changes to one tag’s state do not accidentally interfere with the stored m_history or m_lookahead for another tag.

Also applies to: 78-82, 84-91, 93-103


105-110: Private members are neatly organized.

Keeping m_nfa_state, m_tag_id_to_reg_ids, m_history, and m_lookahead together at the end clarifies the data blueprint for each configuration. No changes needed here.


147-148: Header end comment.

Consistently ending the file with a matching #endif comment is a helpful practice. Thank you for maintaining clarity.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high-level comments of DeterminizationConfiguration.hpp. Some of these comments should be also applied across the entire PR. (NTS: This file is not completely reviewed yet)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (4)

17-17: Consider adding a class-level docstring.
A brief explanation of the purpose and usage of this class helps future maintainers and new contributors comprehend how determinization configurations fit into the larger TDFA logic.


19-32: Add parameter documentation for the constructor.
You might want to use doxygen-style tags (e.g., @param) for each parameter, describing how they are used (e.g., nfa_state, tag_to_reg_ids, tag_history, tag_lookahead).


34-45: Use structured bindings or tuple comparison for operator<.
Rather than performing multiple if checks, consider using a single tuple-based lexicographical comparison for cleaner and more concise code, such as:

-auto operator<(DeterminizationConfiguration const& rhs) const -> bool {
-    if (m_nfa_state->get_id() != rhs.m_nfa_state->get_id()) {
-        return m_nfa_state->get_id() < rhs.m_nfa_state->get_id();
-    }
-    if (m_tag_id_to_reg_ids != rhs.m_tag_id_to_reg_ids) {
-        return m_tag_id_to_reg_ids < rhs.m_tag_id_to_reg_ids;
-    }
-    if (m_history != rhs.m_history) {
-        return m_history < rhs.m_history;
-    }
-    return m_lookahead < rhs.m_lookahead;
+auto operator<(DeterminizationConfiguration const& rhs) const -> bool {
+    return std::tie(m_nfa_state->get_id(),
+                    m_tag_id_to_reg_ids,
+                    m_history,
+                    m_lookahead)
+         < std::tie(rhs.m_nfa_state->get_id(),
+                    rhs.m_tag_id_to_reg_ids,
+                    rhs.m_history,
+                    rhs.m_lookahead);
 }

61-64: Use standard Doxygen tags.
The annotation @unexplored_stack is non-standard. Use @param or another doxygen-compatible tag to describe parameters clearly, for example:

/**
 * @param unexplored_stack   The stack of configurations updated via spontaneous transitions.
 */
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3c8c208 and 7067112.

📒 Files selected for processing (1)
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
🧬 Code Definitions (1)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)
  • nodiscard (56-58)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (ubuntu-22.04, Release)
🔇 Additional comments (3)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (3)

74-76: Avoid operator[] on std::map if unintentional insertions are possible.
Previously, a reviewer (LinZhihao-723) advised against using [] for map-like structure access, as it can insert new keys inadvertently. Use insert_or_assign or explicit checks if you want to safeguard against unexpected insertions.


112-127: Ensure pointer validity in update_reachable_configs.
Because m_nfa_state is stored as a raw pointer, confirm that it remains valid throughout spontaneous transition exploration. If memory is released elsewhere, consider using a smart pointer (like std::shared_ptr) to prevent dangling references.


129-144: Confirm that no infinite loops occur in spontaneous_closure.
The logic appears to prevent infinite loops by relying on reachable_set.contains(current_configuration), but consider adding tests for cyclical transitions in the NFA to verify this behaviour.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test-dfa.cpp (1)

148-181: Comprehensive test for Complex Tagged DFA.

This test case demonstrates more advanced functionality of tagged DFAs with multiple nested capture groups. The variety of operations in the expected output (push, copy, nested operations) provides thorough coverage of the TDFA implementation.

Consider adding a brief comment explaining what the register operations notation means (e.g., what 16p,17p and 8c16,9c19 represent) to make the test more understandable for future developers.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7067112 and 1429996.

📒 Files selected for processing (1)
  • tests/test-dfa.cpp (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • tests/test-dfa.cpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (macos-latest, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (4)
tests/test-dfa.cpp (4)

28-58: Well-implemented helper function for DFA comparison.

The compare_serialized_dfa function is well-structured and documented. It nicely encapsulates the common validation logic that was previously duplicated across test cases, addressing a previous review comment that requested this extraction. The function handles error reporting effectively by using CAPTURE for both serialized strings, which will be helpful for debugging any failures.


87-87: Good refactoring to use the helper function.

Replacing the duplicated validation code with a call to the helper function improves maintainability.


114-115: Effective refactoring of test validation.

Good job replacing another instance of the duplicated validation code with the helper function call.


117-146: Well-structured test for Simple Tagged DFA.

This new test case effectively validates the tagged DFA functionality for a simple pattern with a single tagged group (uID). The test creates an appropriate schema, processes it into an NFA and DFA, and then verifies the expected output format including the register operations (e.g., 4p for push operations and 2c4 for copy operations).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
tests/test-lexer.cpp (1)

211-211: Potential duplication in test setup.

This line, extracting schema_var_ast_ptr, appears in multiple tests. Consider consolidating common test logic into a shared helper function if it becomes repetitive.

src/log_surgeon/finite_automata/Dfa.hpp (4)

66-66: Superset determinization documentation.

Clearly describes the algorithm’s purpose. Keep parameters (@oaram) typo-free in doc comments.

- * @oaram nfa The NFA used to generate the DFA.
+ * @param nfa The NFA used to generate the DFA.

76-81: Register initialization logic.

Reserving two registers per tag—one for initial tracking, one for final—supports set/copy operations effectively. Consider adding an upper bound check for num_tags if resource constraints apply.


132-144: Transition assembly.

get_transitions collects transitions for each input byte. This design is flexible and keeps logic contained. Watch performance if states or byte transitions are large.


312-344: Byte-based transitions.

get_transitions systematically enumerates possible inputs (0..255). If UTF-8 expansions are needed, consider mindful design to handle multiple bytes.

src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (4)

16-18: Consider adding class-level documentation.
This class appears central to TDFA determinization but lacks a brief description, references to relevant automata concepts, or usage details in its docstring. Adding an overview would help maintainers and new contributors understand the overall purpose of the class.


29-31: Use a smart pointer for m_nfa_state where feasible.
Although you verify that m_nfa_state is non-null, the raw pointer usage could lead to lifetime mismanagement if the underlying TypedNfaState is destroyed elsewhere prematurely. A smart pointer (e.g., std::shared_ptr or std::unique_ptr) could help ensure safe and transparent ownership semantics.


84-91: Consider using a standard algorithm for searching tag history.
Currently, this loop searches through the m_history vector. For clarity and conciseness, you could use std::find_if with a lambda that checks tag_op.get_tag_id() == tag_id. This can slightly improve readability.


116-124: Avoid unnecessary object copies by iterating with references.
Here, for (auto const tag_op : nfa_spontaneous_transition.get_tag_ops()) copies each element. If TagOperation objects are sizable, consider iterating as for (auto const& tag_op : ...) to reduce copy overhead.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1429996 and 5eba1f3.

📒 Files selected for processing (8)
  • CMakeLists.txt (1 hunks)
  • src/log_surgeon/Lexer.hpp (5 hunks)
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (3 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (3 hunks)
  • tests/CMakeLists.txt (1 hunks)
  • tests/test-dfa.cpp (4 hunks)
  • tests/test-lexer.cpp (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • CMakeLists.txt
  • tests/CMakeLists.txt
  • src/log_surgeon/Lexer.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • tests/test-lexer.cpp
  • src/log_surgeon/finite_automata/NfaState.hpp
  • tests/test-dfa.cpp
  • src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp
  • src/log_surgeon/finite_automata/Dfa.hpp
🧠 Learnings (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2025-03-28T16:38:46.994Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🧬 Code Definitions (4)
tests/test-lexer.cpp (3)
src/log_surgeon/finite_automata/RegexAST.hpp (16)
  • nodiscard (87-89)
  • nodiscard (134-152)
  • nodiscard (171-173)
  • nodiscard (205-207)
  • nodiscard (239-239)
  • nodiscard (256-258)
  • nodiscard (292-292)
  • nodiscard (294-294)
  • nodiscard (328-330)
  • nodiscard (406-406)
  • nodiscard (408-408)
  • nodiscard (410-410)
  • nodiscard (452-454)
  • nodiscard (487-487)
  • nodiscard (489-489)
  • nodiscard (515-517)
src/log_surgeon/Schema.hpp (1)
  • var_schema (29-29)
src/log_surgeon/LogParser.hpp (1)
  • schema_ast (161-161)
tests/test-dfa.cpp (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)
  • nfa (66-66)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (2)
src/log_surgeon/finite_automata/Dfa.hpp (2)
  • nodiscard (43-43)
  • nodiscard (56-58)
src/log_surgeon/finite_automata/NfaState.hpp (6)
  • nodiscard (95-95)
  • nodiscard (97-97)
  • nodiscard (99-101)
  • nodiscard (103-106)
  • nodiscard (108-110)
  • nodiscard (112-112)
src/log_surgeon/finite_automata/Dfa.hpp (5)
src/log_surgeon/finite_automata/DfaState.hpp (3)
  • nodiscard (42-44)
  • nodiscard (46-48)
  • nodiscard (84-87)
src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (4)
  • tag_id (74-76)
  • tag_id (74-74)
  • tag_id (84-84)
  • tag_id (95-96)
src/log_surgeon/finite_automata/Nfa.hpp (2)
  • new_state (145-148)
  • new_state (145-145)
src/log_surgeon/finite_automata/RegisterHandler.hpp (5)
  • reg_id (45-47)
  • reg_id (45-45)
  • reg_id (53-56)
  • reg_id (53-53)
  • reg_id (58-59)
src/log_surgeon/finite_automata/RegisterOperation.hpp (6)
  • reg_id (27-29)
  • reg_id (27-27)
  • reg_id (31-33)
  • reg_id (31-31)
  • reg_id (40-40)
  • reg_id (40-40)
🔇 Additional comments (40)
tests/test-lexer.cpp (6)

90-90: Good use of [[nodiscard]] for serialize_id_symbol_map.

This addition helps avoid ignoring the returned string and reduces the risk of logical errors.


98-99: Check for nullptr at capture_rule_ast usage.

Although there is a REQUIRE(capture_rule_ast != nullptr); later, ensure that no intermediate steps dereference this pointer beforehand.


175-175: Safe map access.

Using .at() on lexer.m_id_symbol is consistent and will throw an exception on out-of-range access, which is safer than the subscript operator.


185-185: Consistent token end check.

Good usage of .at() again. This helps ensure that the retrieval of the token end symbol is safe.


229-229: Repeated variable extraction.

Similar code was seen at line 211. A shared setup helper could tighten maintainability and reduce duplication.


262-264: Clear range verification.

Verifying that the regex group covers '0' through '9' ensures correct numeric capture. This logic is well-defined.

src/log_surgeon/finite_automata/NfaState.hpp (3)

75-75: Safer byte transition handling.

Using .at(byte) instead of the subscript operator prevents out-of-bounds array access. This helps to catch invalid byte values at runtime.


109-109: Consistent safe indexing.

Again, replacing the subscript operator with .at() prevents potential undefined behaviour when the index is out of range.


182-182: Improved exception safety on transitions.

Using .at(idx) for m_bytes_transitions is good practice. Continue to ensure index bounds are validated or that the usage context guarantees valid indices.

tests/test-dfa.cpp (5)

28-40: Helpful DFA comparison helper introduced.

The compare_serialized_dfa function lowers code duplication and centralizes the validation of DFA serialization. Good addition.


89-89: Appropriate usage of compare_serialized_dfa.

This call effectively replaces repetitive string comparison logic, improving test clarity.


116-117: Consistent verification pattern.

Reusing the same helper enhances maintainability, ensuring tests follow a uniform approach to string comparison.


119-145: New "Test Simple Tagged DFA" block.

This test case is concise and validates proper handling for a basic tagged DFA. Its coverage of the set and copy operations for tagged states appears correct.


150-183: Robust "Test Complex Tagged DFA" block.

This test thoroughly checks multiple nested tags and composition of transitions. It effectively ensures final states capture correct register operations.

src/log_surgeon/finite_automata/Dfa.hpp (25)

4-5: New inclusions for determinization.

Adding <algorithm> and <cstddef> is reasonable if used within new determinization or register-related logic. Validate build steps to remove unused includes later if necessary.


10-10: Queue-based BFS approach.

Bringing in <queue> is consistent with the BFS logic introduced later in get_bfs_traversal_order().


12-13: Additional includes for strings and maps.

<string> and <unordered_map> are relevant for storing configuration sets, tag mappings, and serialization.


14-14: Include <unordered_set> for BFS tracking.

Used to track visited states in get_bfs_traversal_order(), ensuring no duplicates.


22-22: Determinization integration.

Bringing in DeterminizationConfiguration.hpp is essential for the superset construction of DFA states.


25-27: Register-based expansions.

These includes (RegisterHandler, RegisterOperation, TagOperation) confirm the advanced handling of tag registers in the DFA.


33-33: Type alias for determinization.

using ConfigurationSet = std::set<DeterminizationConfiguration<TypedNfaState>>; clarifies usage and fosters readability.


43-43: Safe root state retrieval.

Accessing m_states.at(0).get() avoids out-of-bounds issues. Verify m_states is guaranteed non-empty upon construction.


56-58: Public mapping retrieval.

Exposing m_tag_id_to_final_reg_id as a map of tag ID to final register ID helps the lexer or other consumers align final states and registers.


93-96: Mapping attempt across configuration sets.

try_get_mapping is well-documented and helps avoid duplicating states. Ensure it accounts for all relevant tag operations.


116-120: Creating DFA state or reusing existing.

create_or_get_dfa_state elegantly prevents repeated creation, aiming for minimal states under the superset approach.


149-154: Tag register operation assignment.

assign_transition_reg_ops systematically sets or negates registers within closure states. This helps ensure correct tag capturing.


164-167: Reassigning register operations.

reassign_transition_reg_ops merges new mapping or appends copy ops. This is crucial to ensure non-diverging states.


169-179: Constructing new DFA states with final register assignment.

new_state handles acceptance correctness by setting or copying relevant registers. This ensures final states accurately track tag data.


193-195: Deferred generation in constructor.

generate(nfa); is neatly separated, keeping constructor short and focusing on the generation step in a dedicated function.


198-231: Core superset determinization logic.

The BFS-like approach with unexplored_sets ensures that every newly discovered configuration set is fully expanded. This is a standard, effective pattern.


234-245: Two-register-per-tag approach.

initialize_registers sets up robust final copying. Optionally add a range check for huge num_tags.


248-290: Configuration equivalence check.

try_get_mapping cross-checks states and register usage. This helps unify states that differ only by register IDs.


293-309: Avoiding duplicative states.

create_or_get_dfa_state checks existing sets before new creation. The register mapping check is consistent with your minimal DFA strategy.


347-384: Transition register assignment.

Handles partial or repeated tag operations well. The final set usage ensures your updated closure is captured for subsequent transitions.


387-407: Convergent mapping to existing register IDs.

reassign_transition_reg_ops ensures we do not inflate states by mismatched registers. A well-targeted approach to unify them.


410-443: Populating acceptance logic.

Copies or sets final registers for accepted states. This robust approach supports advanced tag usage while ensuring correctness.


446-464: Schema intersection method.

get_intersect(...) pairs states from two DFAs, capturing matching variable IDs. This is a beneficial utility for refining schema coverage.


467-497: BFS state order retrieval.

get_bfs_traversal_order is a standard approach to produce consistent state ordering, beneficial for stable serialization.


500-518: Optional serialization.

Returns std::nullopt if any state fails to serialize, ensuring overall correctness. This final step ties together BFS ordering.

src/log_surgeon/finite_automata/DeterminizationConfiguration.hpp (1)

132-144: Good job on implementing a stack-based approach to find all spontaneously reachable configurations.
The logic for exploring and updating possible configurations appears sound and aligns with the project’s coding guidelines (explicit checks with false == ...).

Comment on lines -33 to +45
* Add a delimiters line from the schema to the lexer
* @param delimiters
* Adds a list of delimiter types to the lexer.
* @param delimiters A vector containing the delimiter types to be added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets mention:

  • it invalidates any existing delimiters
  • delimiters param cannot be empty

Comment on lines +218 to +219
// TODO: determine if we can switch to safely accessing `m_is_delimiter` and
// `m_is_first_char_of_a_variable` with `at()` without impacting performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make this a github issue rather than todo so it is easier to track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants