[logs] regexless parsing, again?

Michael Kinsley michael.kinsley at sensage.com
Fri Sep 14 17:07:16 PDT 2007


Hello Christina -

Thank you for the excellent questions.

> why do you care which vendor created the web access log if it's the  
> same format?

ANSWER1: There are a number of reasons.
	- I want to separate data based on the device type.
	- Large scale customers often have device specific admins.
	- Large scale customers often have region specific admins ( i.e. our  
admins from Singapore should not see or have access to data coming  
through New York or Zurich. Pretend you are a large multinational bank.)
	- All sources are not created equal. Some are more equal than others  
( See Animal Farm for more information).
		* if we were managing compliance for a large multinational,  
segregating data based on its criticality or asset group is often  
desirable.

ANSWER2: It was a simplified example, designed to illustrate a point.

> and why on earth would you build a system that accepts syslog input  
> without recording and being able to use the originating host's IP  
> in other logic?

ANSWER : You wouldn't. You should always be able to use the host / ip  
address - this is important information.

Please correct me if I am wrong, but we were talking about "regexless  
parsing" or some of the pitfalls behind using regular expressions.  
I'm not thinking about how to  solve the problem of managing logs for  
a SMB. We are all capable of rolling our own solutions.   We are  
examining fundamental elements of log analysis and transforming data  
into information.

So I return to my original point:
	- A REGEX lacks ability to differentiate between two things that  
look very similar.
	- A REGEX can only answer YES or NO
	- A REGEX cannot branch or make decisions

* A system that addressed the above weakness of REGEX would provide a  
great step forward.
* The tree-like learning system proposed by Ginorio is an example of  
such a system. Such trees are also used heavily and successfully in  
areas like Bioinformatics. Take a look at Suffix trees if you are  
really interested.

Cheers.


-Michael






	
On Sep 14, 2007, at 3:40 PM, Christina Noren wrote:

> why do you care which vendor created the web access log if it's the  
> same format?
>
> and why on earth would you build a system that accepts syslog input  
> without recording and being able to use the originating host's IP  
> in other logic?
>
> On Sep 14, 2007, at 2:41 PM, Kinsley, Michael wrote:
>
>> Consider the following:
>> 	- You are receiving Web access logs from 2 different boxes (they
>> stream to us over syslog)
>> 	- Each server is from a different vendor.
>> 		*These happen to be vendors that both said: "Hey, we
>> will follow the W3C standard for our access logs".
>>
>> Can you devise a regular expression that can discriminate between  
>> vendor
>> x logs and vendor y?
>>
>> Answer: Not without hard coding an IP Address or host name... and  
>> then
>> we would need to store this "Meta-Information" somewhere else... and
>> then we need a procedural language to go map and sort these results.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.loganalysis.org/pipermail/loganalysis/attachments/20070914/a666a352/attachment-0001.html


More information about the LogAnalysis mailing list