[logs] regexless parsing, again?

Marcus J. Ranum mjr at ranum.com
Fri Sep 14 09:11:27 PDT 2007


>I am not sure if you are asking the right question. I don't think  
>regexes are necessarily bad. 

The way they usually are used is.

The problem with regexes is that you're not really parsing; you're
matching. Because you can't really predict which regexes are going
to match, you have to try them all. This can make debugging a
matching ruleset a real pain in the patoot. It also pretty much
guarantees that you're going to look at every input line N times
where N = your number of rules. That sucks.

There are tricks you can play to minimize the impact of that, by
using start-anchored regexes, or building a sort of hierarchy of
regexes. The fact that such tricks are the only way to make
regexes performance tolerable is proof (to me!) that regexes
aren't the right approach.

More importantly, since regex matching does not describe the
entire match-space predictably, you can't tell the user which
of the rules that missed was the closest, in the event that
you don't find a matching rule. Nor can you tell if there are
rules that overlap, and there's no way to enforce a notion of
rules precedence.

Using regexes is a sign of 2 things:
- Laziness
- Non-comprehension of how to parse
(the other sign of non-comprehension of parsing is use of XML)

The problem is that regexes are such a simple approach and CPU
cycles are so cheap, nobody has ever really bitten off the hard
approach. It's something I've wanted to do for 4 or  5 years, now,
and I pretty much post an annual posting here about how to do
it, hoping someone will take up the glove. Unfortunately for me,
one of my clients has a project that pretty much requires this
particular solution, so...  As soon as I'm done coding the analysis
engine that consumes the data I'll write the parser-generator
that produces the data. I'm a bit rusty at coding. :( The thing
needs to be able to handle (let's say) 100,000 lines/second
without dogging the CPU. So ruby and perl and java and all
those other goofy toy languages don't strike me as viable
options.

mjr. 


More information about the LogAnalysis mailing list