[logs] regexless parsing, again?
Tom Le
dottom at gmail.com
Fri Sep 14 22:59:28 PDT 2007
Marcus J. Ranum <mjr at ranum.com> wrote:
| The problem with regexes is that you're not really parsing; you're
| matching. Because you can't really predict which regexes are going
| to match, you have to try them all.
No you don't. You only have to try the # of times until you match. Even if
you get no match you don't have to try them all. You can use preparsers to
limit the # of regex rules to match against.
| This can make debugging a
| matching ruleset a real pain in the patoot. It also pretty much
| guarantees that you're going to look at every input line N times
| where N = your number of rules. That sucks.
This sounds like you are looking at more traditional use of regex where you
have one regex = one log event. You can use regexes to perform matching on
just parts of a log message (or data stream or whatever). By matching parts
within a log message, you can build vectors, hierarchical, or state-space
approaches to build a faster "matching engine" while still using regexes.
But you know that already.
Another example: use regex to define your "matching rules" and then convert
from regex to DFA at implementation. If memory is an issue, use Delay Input
DFA (D2FA) which substantially reduces the space requirements compared to
DFA. Awk, Tcl, GNU grep, and GNU awk all use DFA's.
| In the case of a lookahead in a parser you're still going
| to prune the search with a single lookahead, whereas
| with a regex, you've got to try N additional regexes no
| matter what. Unless you order the regexes into a parse
| tree (of sorts) - which I've seen done - but that's just
| putting lipstick on a pig.
What performance increase did you see what that approach? I've easily seen
10x increase which I would not call putting lipstick on a pig. YMMV
depending on the problem you are solving, but 10x is a real number, and we
are trying to push 50x.
Another issue to consider is development time and scalability. Regexes are
understood by many so if you can handle the optimizations on the back-end
but keep regexes for whatever "rules" you are building, it is easier to
support. Requiring your engineers to learn a new parsing language could add
more complexity than necessary unless you can build a good pseudolanguage
wrapper on top which has it's own pros and cons.
E G <bronc94583 at yahoo.com> wrote:
| I tried to make a method that would break data into a
| "header" token and a "context" token (the latter would
| then be broken into sub tokens). Take syslog as an
| easy example. All syslog messages should have the
| standard syslog header on them. After this header is
| the meat of the message, or "context". So if we can
| recgonize the header token, why do a linear search
| with regex on the entire string to figure out what it
| is (as it standard today) and to then try and
| normalize it based on what the regex captured?
If all syslog messages were RFC3164-compliant, log analysis would be much
easier (but then where's job security, right?) Only a fraction of vendor
devices are RFC3164 compliant.
This fact doesn't change the nature of this discussion, but is worth noting
as part of the difficulty/complexity discussion.
Contrast this with SNMP messages. SNMP messages can be parsed very
efficiently because you can do straight hash lookups on OID's for most
cases. Sometimes you have to parse the values to the right of the equal
sign but it's an easy problem to solve.
Michael Kinsley <michael.kinsley at sensage.com> wrote:
| Consider the following:
| - You are receiving Web access logs from 2 different boxes (they
| stream to us over syslog)
| - Each server is from a different vendor.
|
| *These happen to be vendors that both said: "Hey, we
| will follow the W3C standard for our access logs".
By looking only at the logs themselves, of course not. You did mention hard
coding IP's like you have to do on a Cisco MARS or other vendor products to
identify the end-point device. This is the common method used, but isn't
helpful for those of us that have to deal with 100% passive only (no agents,
no advance knowledge of customer environment... just logs, baby).
There are other approaches to solve the problem, only #1 is truly a 100%
passive approach:
1. Look at other logs sent by same device. Admin/config events will tell
you what devicde it is.
2. Can you do a discovery scan and integrate scan data?
3. Can you analyze the network traffic like a SourceFire RNA?
4. Can you build custom IDS signatures to give you alerts whose sole purpose
is to identify sensor OS, applications, etc.?
| ANSWER1: There are a number of reasons.
| [ snip ]
All good points and I agree 100%.
| So I return to my original point:
| - A REGEX lacks ability to differentiate between two things that
| look very similar.
Why can't you use multiple regexes per log message? Or extract
backreferences and then perform some logic on values to differentiate (e.g.
error codes).
| - A REGEX can only answer YES or NO
>From an oversimplified perspective, yes. But what about:
1. Using multiple yes/no answers to create a fuzzy answer. Many AI
implementations are built entirely on Y/N answers. From a single log
message (or "string") you can ask a lot of "questions" and receive a lot of
Y/N answers.
2. A regex can, of course, also return values via backreferences. Logic can
be applied to backreferences go give a richer answer than Y/N.
3. A regex can give you vector like like information, like:
is "bad password' *near* "logon failure"
Where *near* is expressed as follows (very simple example):
/logon failure.{1,40}bad password/i
Of you could parse out "logon failure" and "bad password" and note their
positions and from that build your "near" vector.
All 3 examples above are of course just Y/N answers, but the questions being
asked by the regexes are different than what most people think about when
they think regex. Most people talk about regexes are asking "does this
pattern match".
What about using regexes to say:
- Is this pattern near this other pattern (distance vectors)
- Is this pattern repeated (backreference comparison within regex)
- Does this pattern exist followed by that pattern? (lookahead/
lookbehind)
- Other if-then or case statement like capability using nesting.
In one sense you are right. Everything above is simply a Yes or No answer,
but I think that oversimplifies the issue - kind of like saying "that
computer only knows if that bit is ON or OFF".
| - A REGEX cannot branch or make decisions
It can with proper nesting, backreferences, lookahead, lookbehind, and/or
negation. You can also match multiple components of a log message and from
that build a tree-like identification system.
Matching individual components of a log message is faster than traditional
"one regex = one event" so the cost of multiple regexs against a single log
message is really an implementation issue.
Finally, I'd add that if the only reason to use something other than regex
is for performance, there are ways to address performance, such as DFA,
D2FA, algorithmic changes, hardware accelerators, and many other
approaches. If the reason to use something other than regex is for reasons
not related to performance, let's identify that specifically and discuss.
Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.loganalysis.org/pipermail/loganalysis/attachments/20070914/f7354b11/attachment-0001.html
More information about the LogAnalysis
mailing list