Skip to content
Rene Saarsoo edited this page Aug 5, 2013 · 10 revisions

In previous chapter we only took a glance at the parse_doc method, here we take a deeper look at the Scanner object which is passed to the parse_doc method.

The Scanner is similar to Ruby builtin StringScanner, it remembers the position of a scan pointer (a position inside the string we're parsing). The scanning itself is a process of advancing the scan pointer through the string a small step at a time. For this there are two core methods:

  • match(regex) matches a regex starting at current position of scan pointer, advances the scan pointer to the end of the match and returns the matching string. When regex doesn't match, returns nil.

  • look(regex) does all the same, except it doesn't advance the scan pointer, so it's use is to look ahead.

Let's visualize how scanning works with an example of parsing an @author tag that can either take a name or e-mail address plus a name:

* @author <[email protected]> John Doe
* @author Code Monkey

Here's a parse_doc method for parsing this tag:

def parse_doc(scanner, position)
  if scanner.look(/</)
    scanner.match(/</)
    email = scanner.match(/\w+@\w+(\.\w+)+/)
    scanner.match(/>/)
    scanner.hw
  end
  name = scanner.match(/.*$/)

  return { :tagname => :author, :name => name, :email => email }
end

Let's step through it while it's parsing the first line of our example code.

Here's the state of the Scanner at the time parse_doc gets called.

                                            # @author |<[email protected]> John Doe

The scan pointer (denoted as |) has stopped at the first non-whitespace character after the name of the tag. At that point we could look ahead to see what's coming. Say, we could check if we're at the beginning of an e-mail address block:

if scanner.look(/</)                      # @author |<[email protected]> John Doe

If so, we want to extract the e-mail address. But first lets match the < char which we want to exclude from our e-mail address:

scanner.match(/</)                        # @author <|[email protected]> John Doe

The scan pointer has now moved forward a step, and now we can match the e-mail address itself and store it to a variable:

email = scanner.match(/\w+@\w+(\.\w+)+/)  # @author <[email protected]|> John Doe

Then we skip the closing >:

scanner.match(/>/)                        # @author <[email protected]>| John Doe

And let's also skip the whitespace using hw method of Scanner to skip just the horizontal whitespace:

scanner.hw                                # @author <[email protected]>| John Doe

From here on we just want to match the name of the author, which could be anything, so we just match up to the end of a line:

name = scanner.match(/.*$/)              # @author <[email protected]> John Doe|

Finally we return all the extracted values:

  return { :tagname => :author, :name => "John Doe", :email => "[email protected]" }
Clone this wiki locally