[logs] regexless parsing, again?
Michael Kinsley
michael.kinsley at sensage.com
Fri Sep 14 17:07:16 PDT 2007
Hello Christina -
Thank you for the excellent questions.
> why do you care which vendor created the web access log if it's the
> same format?
ANSWER1: There are a number of reasons.
- I want to separate data based on the device type.
- Large scale customers often have device specific admins.
- Large scale customers often have region specific admins ( i.e. our
admins from Singapore should not see or have access to data coming
through New York or Zurich. Pretend you are a large multinational bank.)
- All sources are not created equal. Some are more equal than others
( See Animal Farm for more information).
* if we were managing compliance for a large multinational,
segregating data based on its criticality or asset group is often
desirable.
ANSWER2: It was a simplified example, designed to illustrate a point.
> and why on earth would you build a system that accepts syslog input
> without recording and being able to use the originating host's IP
> in other logic?
ANSWER : You wouldn't. You should always be able to use the host / ip
address - this is important information.
Please correct me if I am wrong, but we were talking about "regexless
parsing" or some of the pitfalls behind using regular expressions.
I'm not thinking about how to solve the problem of managing logs for
a SMB. We are all capable of rolling our own solutions. We are
examining fundamental elements of log analysis and transforming data
into information.
So I return to my original point:
- A REGEX lacks ability to differentiate between two things that
look very similar.
- A REGEX can only answer YES or NO
- A REGEX cannot branch or make decisions
* A system that addressed the above weakness of REGEX would provide a
great step forward.
* The tree-like learning system proposed by Ginorio is an example of
such a system. Such trees are also used heavily and successfully in
areas like Bioinformatics. Take a look at Suffix trees if you are
really interested.
Cheers.
-Michael
On Sep 14, 2007, at 3:40 PM, Christina Noren wrote:
> why do you care which vendor created the web access log if it's the
> same format?
>
> and why on earth would you build a system that accepts syslog input
> without recording and being able to use the originating host's IP
> in other logic?
>
> On Sep 14, 2007, at 2:41 PM, Kinsley, Michael wrote:
>
>> Consider the following:
>> - You are receiving Web access logs from 2 different boxes (they
>> stream to us over syslog)
>> - Each server is from a different vendor.
>> *These happen to be vendors that both said: "Hey, we
>> will follow the W3C standard for our access logs".
>>
>> Can you devise a regular expression that can discriminate between
>> vendor
>> x logs and vendor y?
>>
>> Answer: Not without hard coding an IP Address or host name... and
>> then
>> we would need to store this "Meta-Information" somewhere else... and
>> then we need a procedural language to go map and sort these results.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.loganalysis.org/pipermail/loganalysis/attachments/20070914/a666a352/attachment-0001.html
More information about the LogAnalysis
mailing list