[logs] regexless parsing, again?

Mordechai T. Abzug morty at frakir.org
Fri Sep 14 12:14:50 PDT 2007


On Fri, Sep 14, 2007 at 12:45:12PM -0400, Marcus J. Ranum wrote:

> In the case of a lookahead in a parser you're still going to prune
> the search with a single lookahead, whereas with a regex, you've got
> to try N additional regexes no matter what. Unless you order the
> regexes into a parse tree (of sorts) - which I've seen done - but
> that's just putting lipstick on a pig.

One approach I've used to improve the performance of regexes is to OR
all the regexes together.  I.e. instead of trying N regexes, regex1,
regex2, regex3, etc., in sequence, you do:

(regex1{sideeffect1}|regex2{sideeffect2}|regex3{sideeffect3}|etc)

The regex engine compiles the big regex once into a state machine, so
it's a lot faster than N regexes.  You test each message against the
big regex.  The side effect lets you figure out which regex actually
matched, so you then do a big case statement on the side effect to
figure out how to disposition the data.  [You also want to run the
message just against the sub-regex, but that's not a big deal.]

I've also implemented a pre-processor that allows token-like
subregexes for things like IPs, zones, mail addresses, etc.

However, as you say, it's still lipstick on a pig.  In particular, if
a message fails to match on the big regexes, you're going to have to
debug from square one.  Debugging badly-formed regexes becomes harder,
although you can improve this by doing a pre-pass of testing each
regex at the start of the run.  And I never even considered testing
for overlapping.

- Morty


More information about the LogAnalysis mailing list