[logs] regexless parsing, again?
Mordechai T. Abzug
morty at frakir.org
Fri Sep 14 12:14:50 PDT 2007
On Fri, Sep 14, 2007 at 12:45:12PM -0400, Marcus J. Ranum wrote:
> In the case of a lookahead in a parser you're still going to prune
> the search with a single lookahead, whereas with a regex, you've got
> to try N additional regexes no matter what. Unless you order the
> regexes into a parse tree (of sorts) - which I've seen done - but
> that's just putting lipstick on a pig.
One approach I've used to improve the performance of regexes is to OR
all the regexes together. I.e. instead of trying N regexes, regex1,
regex2, regex3, etc., in sequence, you do:
(regex1{sideeffect1}|regex2{sideeffect2}|regex3{sideeffect3}|etc)
The regex engine compiles the big regex once into a state machine, so
it's a lot faster than N regexes. You test each message against the
big regex. The side effect lets you figure out which regex actually
matched, so you then do a big case statement on the side effect to
figure out how to disposition the data. [You also want to run the
message just against the sub-regex, but that's not a big deal.]
I've also implemented a pre-processor that allows token-like
subregexes for things like IPs, zones, mail addresses, etc.
However, as you say, it's still lipstick on a pig. In particular, if
a message fails to match on the big regexes, you're going to have to
debug from square one. Debugging badly-formed regexes becomes harder,
although you can improve this by doing a pre-pass of testing each
regex at the start of the run. And I never even considered testing
for overlapping.
- Morty
More information about the LogAnalysis
mailing list