[logs] regexless parsing, again?

Marcus J. Ranum mjr at ranum.com
Fri Sep 14 09:45:12 PDT 2007


Mordechai T. Abzug wrote:
>Is the intent to eliminate the need for manual configuration, or just
>to exchange manual regex configuration for a manual lex+yacc-style
>setup?

A couple years ago Abe and I burned a bunch of cycles (literally!
It's cool to have friends with supercomputers...) trying a couple
of approaches we came up with to auto-generate log parsing
rules. It turned out (pretty much what we expected) that it's a
really hard problem. It also taught me that logs are nastier than
I thought they were - which was a real eye-opener because I
already knew logs are pretty nasty.

So the idea is not to auto-generate parse rules; I believe it would
require a hard AI solution to do that (and if I could do that I'd be too busy
conquering the world to worry about syslogs for a while) - the idea
is to produce a log parsing system that can process inputs in close
the same amount of time regardless of the number of rules you
load into it. One of the big problems with regex-based approaches is
that they get increasingly bad the more rules you load into them,
which is exactly the worst behavior a log parser can have.

A side objective is to produce a system wherein log parsing rules
are vastly more human readable. I.e. - something resembling:
"%{date} sendmail: %s %d at line %d"
instead of a typical regexp:
((\![0-9]\)-9\*.&&WTF
kind of nonsense. I believe more people would be willing to write
log-parsing rules coping with IP addresses if they could match
an IP address as:
%{ip}
as opposed to writing:
([0-9]+(\.[0-9]+){3})
- a popular (but incorrect) regex for matching IP addresses. That would accept
275.0.0.999 as an address - I think...

Another extremely important advantage of trying to build parse
trees instead of matching rules is that when you're parsing, you
KNOW WHERE YOU FAILED and you KNOW WHAT YOU
EXPECTED NEXT. The fact that people tolerate using regexps
for logs is absurd, to me, for those 2 reasons alone. Debugging
a regexp-based log matcher entails wasting a huge amount of
time scratching your butt going, "do I need to escape the dot
there...?" and "why the hell didn't that work?" It's also possible
with a parse tree to identify overlapping rules.

And there's the speed question. It's not guaranteed that parse
trees will be faster than match rules but it ought to be unless
the parse trees are really poorly coded.

>If the latter, you may be trading in one kind of tyranny for another
>kind of tyranny.  A lex+yacc setup will still require lots of
>intervention and expertise.

Nope; no lex/yacc.

The references I've made in the past to using lex/yacc were
in the context of parsing extremely large amounts of log
data very fast. I've done implementations of custom record
parsers with lex/yacc that easily handle 200,000+ lines
per second and are I/O bound on the input. That's nice. But
I don't think for a general purpose too that that's the right
approach. The parse tree generator thingie I'm designing
is based on the same principles as how yacc works inside
but won't make as many of the amazingly tricky optimizations
yacc makes because I simply don't understand them.

>If you put in more lookahead, you'll start running into
>the same performance problems as regexes.

Not even close.

In the case of a lookahead in a parser you're still going
to prune the search with a single lookahead, whereas
with a regex, you've got to try N additional regexes no
matter what. Unless you order the regexes into a parse
tree (of sorts) - which I've seen done - but that's just
putting lipstick on a pig.

mjr. 


More information about the LogAnalysis mailing list