[logs] regexless parsing, again?
Daniel Cid
dcid at ossec.net
Tue Sep 18 18:01:13 PDT 2007
Hi Jason,
**long e-mail alert! :)
When you look at a simplified example, they may have similar
performances (just three
entries to check), but when you grow into a software with multiple
rules, your regex
approach will be much slower... Let me explain the difference (using
ossec as an example).
First, ossec decouples the process of log analysis (classification)
and the decoding one. When it receives a message, it goes through a
pre-decoding and decoding phase, where a log would be internally
represented as: (using sshd as an example)
May 21 20:22:28 slacker2 sshd[8813]: Accepted password for root from
192.168.20.185 port 1066 ssh2
hostname -> slacker2
program_name -> sshd
log-> Accepted password for root from 192.168.20.185 port 1066 ssh2
srcip -> 192.168.20.185
user -> root
The benefit of separating the decoding (or normalization) process that
is that if you get a Solaris or ASL (Apple System log) messages, they
will be internally represented in the same way, so our rules are one
for all different formats.
Solaris:
Aug 2 11:49:23 sshd: [ID 366847 auth.info] Accepted xx...
ASL:
[Time 2006.11.02 11:41:44 UTC] [Facility auth] [Sender sshd] [PID 800]
[Accepted password for root from 1.2.3.4] [Level 4] [UID -2] [GID -2]
[Host test-emac]
After the decoding is done, we have multiple rules, and in the sshd
branch the first one
checks if the "program_name" was decoded as sshd. Since program name is much
smaller then the whole log, our process of checking is much faster. In
addition to that, to
improve things a bit more, we hash the program name, so at the end we
just do an integer
comparison. You couldn't do that with regexes :)
Continuing, it goes down to the actual rules and tries to find out
which ones match. But, again, we don't use regexes, just a simplified
string matching and since we don't have to
deal with all different formats for the logs, it is quite fast. Also,
because of our "tree-based"
architecture, it is much easier to extract meaning from the logs. Example:
1- Rule looking for sshd.
1.1 -- Rule looking for a log that starts with "Accepted password"
1.1.1 -- Rule looking if the user name was root
1.1.1.1 -- Rule looking if the root login came from network
!= 192.168.0.0/24
2- Rule looking for PIX/FWSM/ASA
2.1 -- Rule looking for PIX/FWSM/ASA with severity 5 or 6.
2.1.1 -- Rule looking for message with severity 5 or 6 that is
605005 (successful login).
Try doing that with linear regexes and you will see the difference....
To summarize, that's
why I think that to do log analysis properly, you need to do decoding
separated from the
categorization, and combine fast pattern matching, with hashes and
regexes (not the later
by itself).
*Sorry for the long e-mail... If anyone is interested on more details
look at http://www.ossec.net/ossec-docs/auscert-2007-dcid.pdf
Thanks,
--
Daniel B. Cid
dcid ( at ) ossec.net
On 9/18/07, Jason Haar <Jason.Haar at trimble.co.nz> wrote:
> How does your sshd example differ from a regex like
>
> sshd.*(login failure|access denied|I am a hacker)
>
> I realise that
>
> sshd.*login failure
> sshd.*access denied
> sshd.*I am a hacker
>
> would be slower - but shouldn't "better written global regex" equate to
> your idea of separation? I guess I'm thinking that most modern regex
> "engines" will automatically create internal "trees" for well written
> regex rules.
>
> --
> Cheers
>
> Jason Haar
> Information Security Manager, Trimble Navigation Ltd.
> Phone: +64 3 9635 377 Fax: +64 3 9635 417
> PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysis at loganalysis.org
> http://www.loganalysis.org/mailman/listinfo/loganalysis
>
More information about the LogAnalysis
mailing list