[logs] Error messages from syslogd
Marcus J. Ranum
mjr at ranum.com
Wed Jul 11 17:32:41 PDT 2007
Russell Fulton wrote:
>So does anyone have such a matching library that they can Open Source?
I started on one but ran out of "give a damn" before I finished it.
My excuse is really lame, too. The idea that I was going to have
to explain over and over and over again why I chose not to use
XML was so daunting that I just gave up, in advance, because
I knew my head would explode if I actually posted the damn thing
and had to endure the consequences. :)
Unfortunately, I'm only half kidding. The other half is that the
code was tied up as part of an intellectual property dispute that
has since become irrelevant. Maybe I'll dust it off and write a man
page for it, and an XML output converter, this winter if the snow
drifts are deep enough.
Briefly, here's what needs to be done:
Instead of trying to build sequences of fall-through matching rules, you
need to build an acyclic left-to-right parse tree that completely defines
all the message forms. Abe and I did some experimenting with this
a few years ago and it turned out there are only about 50,000 variant
forms of messages - well - 60,000 if you count all the new ones the
OpenBSD guys appear to have added to syslogd. Obviously, you
don't need them all at once; you just need the ones you're seeing.
What's funny is that executing against a parse tree is going to be
oodles faster than a list of regular expressions once you go above
a certain number.
I offer as evidence an evil hack I did for a buddy last year, which
simply builds a nested sequence of calls to sscanf, keeping
track of the farthest-right point of matching, and walking left
to right. I forget the exact number but it was handling something
like 400,000 log lines per second at 2% CPU utilization (I.e.: it
was I/O bound) - I am not necessarily suggesting that someone
build a tool that outputs a C-coded recognizer... but consider
for a second that if you did, your iPod could handle log data
rates that make the largest commercial SIMs eat their own
intestines in sheer agony.
So, you build the parser-builder and then you load the rules
into it. The parser tree is rooted at (on the left) - made up syntax:
$(datetime) /bsd: stuff
$(datetime) sendmail: stuff more stuff
stuff2 stuff things
as soon as the parse tree executor sees "sendmail:" then,
aha, you can just walk the tree of sendmail rules. Is the
next token "stuff" or "stuff2"? Or an unknown? Parse trees
can hold huge huge numbers of rules and don't really get
much slower because you only have to assess the rules
that are relevant to what you're dealing with. Meanwhile
our poor lamer with the list of regexps has to check every
single other regexp (even the ones you only see every couple
years) for each line. This approach also has the extremely
valuable property of being able to tell you not only when you
have found an unmatched form, but what token the match
failed at. Any of you who have been regex b8tches have
spent time scratching your head going, "now why the HELL
didn't THAT match?" That's because using regexps for
parsing is retarded.
Consider that parsing is also a matter of lexical analysis.
Instead of using disgusting blarg-o-spew regexps like:
([0-9]+(\.[0-9]+){3})
which matches an IP address (if you consider
999.666.999.666 to be a valid IP address) and joy
of joy you get to do it ALL OVER AGAIN if anyone
ever uses IPv6. Real parsing rules would have an
IP address as a primary data type, like "string" and
"number" and "quoted string" etc. So you'd write a
rule like:
$(datetime) /bsd: stuff $(ip) $(email)
and you'd be not only able to write the rules without
resorting to sniffing glue, you'd be able to read
them again a day later! And parse them faster! And
tell when something broke the rule and where it did.
What's so sad about this is that people who write compilers
have known all this stuff literally forever (since the 70's,
anyhow) and because of how kids are taught
computer 'science' nobody ever whaps them and says
"if you want to parse stuff, take your hands off the perl
interpreter and go read Aho/Sethi/Ullman."
Anyhow... That's how to solve the problem.
mjr.
More information about the LogAnalysis
mailing list