[logs] regexless parsing, again?

E G bronc94583 at yahoo.com
Fri Sep 14 22:33:40 PDT 2007


Please forgive my grammar, I just got home from happy
hour with my team :-)

Michael has some valid points, even if a few fall
outside the scope of this discussion. In the end, it's
the customer who dictates what they want, no matter
how cool of an idea any of us come up with.

I learned that the hard way during my time at Cisco
Systems as their top security guy.

As far as the system as a whole works, I totally agree
that whatever it is you use, it much be able to
"learn" what it doesn't know, really "know" what it
knows, plus "guess" at something partial which it
thinks it might know. Basically it's got to be able to
take the intelligence of an analysis and ideally put
it into the application. Impossible yes, but we can
come close.

Like Michael says, RegEx can only answer yes or no,
and we need something that falls in between. A yes, no
and a maybe. The only way I could see doing this is
via some kind of decision tree with all the branches
leading to some kind of outcome.

Ok, I'm off to bed. This has been a good discussion so
far, I hope maybe some more people will add in their
ideas. 



- Erik 


--- Michael Kinsley <michael.kinsley at sensage.com>
wrote:

> Hello Christina -
> 
> Thank you for the excellent questions.
> 
> > why do you care which vendor created the web
> access log if it's the  
> > same format?
> 
> ANSWER1: There are a number of reasons.
> 	- I want to separate data based on the device type.
> 	- Large scale customers often have device specific
> admins.
> 	- Large scale customers often have region specific
> admins ( i.e. our  
> admins from Singapore should not see or have access
> to data coming  
> through New York or Zurich. Pretend you are a large
> multinational bank.)
> 	- All sources are not created equal. Some are more
> equal than others  
> ( See Animal Farm for more information).
> 		* if we were managing compliance for a large
> multinational,  
> segregating data based on its criticality or asset
> group is often  
> desirable.
> 
> ANSWER2: It was a simplified example, designed to
> illustrate a point.
> 
> > and why on earth would you build a system that
> accepts syslog input  
> > without recording and being able to use the
> originating host's IP  
> > in other logic?
> 
> ANSWER : You wouldn't. You should always be able to
> use the host / ip  
> address - this is important information.
> 
> Please correct me if I am wrong, but we were talking
> about "regexless  
> parsing" or some of the pitfalls behind using
> regular expressions.  
> I'm not thinking about how to  solve the problem of
> managing logs for  
> a SMB. We are all capable of rolling our own
> solutions.   We are  
> examining fundamental elements of log analysis and
> transforming data  
> into information.
> 
> So I return to my original point:
> 	- A REGEX lacks ability to differentiate between
> two things that  
> look very similar.
> 	- A REGEX can only answer YES or NO
> 	- A REGEX cannot branch or make decisions
> 
> * A system that addressed the above weakness of
> REGEX would provide a  
> great step forward.
> * The tree-like learning system proposed by Ginorio
> is an example of  
> such a system. Such trees are also used heavily and
> successfully in  
> areas like Bioinformatics. Take a look at Suffix
> trees if you are  
> really interested.
> 
> Cheers.
> 
> 
> -Michael
> 
> 
> 
> 
> 
> 
> 	
> On Sep 14, 2007, at 3:40 PM, Christina Noren wrote:
> 
> > why do you care which vendor created the web
> access log if it's the  
> > same format?
> >
> > and why on earth would you build a system that
> accepts syslog input  
> > without recording and being able to use the
> originating host's IP  
> > in other logic?
> >
> > On Sep 14, 2007, at 2:41 PM, Kinsley, Michael
> wrote:
> >
> >> Consider the following:
> >> 	- You are receiving Web access logs from 2
> different boxes (they
> >> stream to us over syslog)
> >> 	- Each server is from a different vendor.
> >> 		*These happen to be vendors that both said:
> "Hey, we
> >> will follow the W3C standard for our access
> logs".
> >>
> >> Can you devise a regular expression that can
> discriminate between  
> >> vendor
> >> x logs and vendor y?
> >>
> >> Answer: Not without hard coding an IP Address or
> host name... and  
> >> then
> >> we would need to store this "Meta-Information"
> somewhere else... and
> >> then we need a procedural language to go map and
> sort these results.
> >
> 
> > _______________________________________________
> LogAnalysis mailing list
> LogAnalysis at loganalysis.org
>
http://www.loganalysis.org/mailman/listinfo/loganalysis



       
____________________________________________________________________________________
Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow  


More information about the LogAnalysis mailing list