From ThorxWiki
Jump to: navigation, search

Thoughts on writing a parser for NEWS

There's two kinds of formatting in NEWS: line-level, and block-level. Actually, 'block-level' formatting is also really two kinds of formatting (indentation blocks like lists and multi-line blocks like headings and preformatted blocks) but we'll ignore that for now.

It seems convenient to write the parser in several passes (as nice as a one-pass parser would be, it might not be very elegant due to inelegancies in the language we're going to parse). The passes should probably be as follows:

  1. Coalesce the input stream into a tree of lists of lines - so that all the consecutive lines with the same indentation become part of the same list. This would be a good time to interpret the continuation backslashes too. A line with zero or more whitespace (and nothing but whitespace) should be parsed as an empty-line with the current-indent-level amount of whitespace.
  2. For each list of lines, scan for consecutive lines matching one of the consecutive-line patterns (i.e. a heading, a definition list, a pre-formatted block) are turned into their equivalent objects.
  3. For each resulting line, scan for inline formatting to produce a tree of spans, some formatted, some not. Some line-types cannot contain formatting (say, horizontal rules), so they will ignore the request to parse themselves.

It occurs to me that the above scheme doesn't allow lists with zero indent unless the entire page is one big long list. I don't know if that's a Good Thing or a Bad Thing.

Personal tools

meta navigation
More thorx