[logs] regexless parsing, again?

cfrln at cfrln.com cfrln at cfrln.com
Fri Sep 14 19:25:44 PDT 2007


My point was only that host based decisions are beside the point for this format parsing decision.

Tree based parsing is obviously a good idea. Many people are exploring that.
Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: Michael Kinsley <michael.kinsley at sensage.com>

Date: Fri, 14 Sep 2007 17:07:16 
To:Christina Noren <cfrln at cfrln.com>
Cc:<loganalysis at loganalysis.org>
Subject: Re: Re: [logs] regexless parsing, again?


Hello Christina -


Thank you for the excellent questions.

why do you care which vendor created the web access log if it's the same format?


ANSWER1: There are a number of reasons.
	- I want to separate data based on the device type.
	- Large scale customers often have device specific admins.
	- Large scale customers often have region specific admins ( i.e. our admins from Singapore should not see or have access to data coming through New York or Zurich. Pretend you are a large multinational bank.)
	- All sources are not created equal. Some are more equal than others ( See Animal Farm for more information).
		* if we were managing compliance for a large multinational, segregating data based on its criticality or asset group is often desirable.


ANSWER2: It was a simplified example, designed to illustrate a point.


and why on earth would you build a system that accepts syslog input without recording and being able to use the originating host's IP in other logic?


ANSWER : You wouldn't. You should always be able to use the host / ip address - this is important information.


Please correct me if I am wrong, but we were talking about "regexless parsing" or some of the pitfalls behind using regular expressions. I'm not thinking about how to  solve the problem of managing logs for a SMB. We are all capable of rolling our own solutions.   We are examining fundamental elements of log analysis and transforming data into information.


So I return to my original point:
	- A REGEX lacks ability to differentiate between two things that look very similar. 
	- A REGEX can only answer YES or NO
	- A REGEX cannot branch or make decisions


* A system that addressed the above weakness of REGEX would provide a great step forward.
* The tree-like learning system proposed by Ginorio is an example of such a system. Such trees are also used heavily and successfully in areas like Bioinformatics. Take a look at Suffix trees if you are really interested.


Cheers.




-Michael












	


On Sep 14, 2007, at 3:40 PM, Christina Noren wrote:
why do you care which vendor created the web access log if it's the same format?


and why on earth would you build a system that accepts syslog input without recording and being able to use the originating host's IP in other logic?


On Sep 14, 2007, at 2:41 PM, Kinsley, Michael wrote:

Consider the following: 
	- You are receiving Web access logs from 2 different boxes (they
stream to us over syslog)
	- Each server is from a different vendor. 
		*These happen to be vendors that both said: "Hey, we
will follow the W3C standard for our access logs".


Can you devise a regular expression that can discriminate between vendor
x logs and vendor y? 


Answer: Not without hard coding an IP Address or host name... and then
we would need to store this "Meta-Information" somewhere else... and
then we need a procedural language to go map and sort these results. 




More information about the LogAnalysis mailing list