[logs] regexless parsing, again?
E G
bronc94583 at yahoo.com
Fri Sep 14 13:05:47 PDT 2007
Totally, recognizing a line as being say, an ssh
syslog log line is a different matter then
"normalizing" that same data into a usable form on the
other end.
At the same time, once you know what a line is,
hopefully you've learned to "recgonize" its type based
on some method which also helps you break the data
down to make it easier to normalize later.
Using XML for everything is one way to have those
pseudo structures made for you, but in its absence
there needs to be another method until some kind of
standard takes hold.
I tried to make a method that would break data into a
"header" token and a "context" token (the latter would
then be broken into sub tokens). Take syslog as an
easy example. All syslog messages should have the
standard syslog header on them. After this header is
the meat of the message, or "context". So if we can
recgonize the header token, why do a linear search
with regex on the entire string to figure out what it
is (as it standard today) and to then try and
normalize it based on what the regex captured? Just
classify it as a syslog message and send it down the
syslog branch to have the next token delt with. Using
this tree method, you can reduce the search times
quite a bit.
I guess it's coding 101 really. A tree search is
faster then a linear search, and that's kind of what I
came up with.
For normalization, the only answer out there right now
is RegEx, especially if you want to add any kind of
intelligence to the data itself (i.e. "CODE 123"
translates to "PRINTER ON FIRE"). Adding in
intelligence is a big key, IMHO. I got tired of
looking up obscure Windows event code, if you know
what I mean.
Anyhow, I wanted to work on a tokenizing
format/engine, to replace the PCRE RegEx engine I had
been using, but my project never got that far.
And my disclaimer: I work at a SPAM company that
doesn't even collect it's logs :-P
- Erik
--- Christina Noren <cfrln at cfrln.com> wrote:
> I think you need to distinguish between the two
> different goals of
> parsing in order to have a productive discussion of
> this:
>
> 1) classifying log messages based on having a
> specific pattern -
> which the below approach does better than regexes -
> and which is
> realized somewhat similarly in Splunk's automatic
> event type
> classification feature.
> 2) pulling out and naming fields within the log data
> - which is also
> something where there are other possible approaches
> than regexes.
> However, even other approaches would necessarily
> rely on pattern
> matching of some sort in the absence of a
> self-describing log format
> whether XML, name/value pair, csv header, etc.
> (Splunk also guesses
> at fields for such well defined formats.) But the
> pattern matching
> could be less cryptic and smarter about the patterns
> that are in
> logs, such as Marcus mentions in his last post.
>
> Repeating Raffy's disclaimer, I also work at Splunk.
>
> Christina
>
> On Sep 13, 2007, at 5:12 PM, E G wrote:
>
> > Back when I worked at "another company" a few
> years
> > back I did a lot of research into this area.
> >
> > We looked a an approach that grouped logs together
> > based upon what we already knew about that type of
> log
> > source and how they are similar, rather then
> > "guessing" what each line was as it came in.
> >
> > This came about from doing quite a bit of
> statistical
> > analysis on raw log data, I noted quite a bit of
> > correlation from source to source (which in itself
> > isn't news), but because of this, would allow us
> to
> > classify unknown data in some semi-intelligent
> method
> > and dump known entities in known "buckets".
> >
> > Working with some people who were much smarter
> then I,
> > I was able to create a reverse Patricia Trie tree
> like
> > structure. Think of it like when you're on your
> > blackberry and you're typing. It attempts to
> predict
> > the next letter and tries to complete the word
> you're
> > typing for you. The same logic can basically work
> in
> > reverse where you use this Trie structure to
> dissemble
> > a word, or string in our case. Once you reach an
> end
> > point on the Trie, it leaves you with what the
> data
> > is, however you have decided to classify it.
> >
> > I hope that's understandable; I didn't want to
> write
> > out a book.
> >
> > Anyhow, my ideas didn't end up going anywhere.
> They
> > choose to stay with the RegEx "guessing" method -
> as
> > is the standard. I had a lot of the code I
> developed
> > after I left up on SourceForce for a while, but
> real
> > life took me away from it. I might be able to dig
> it
> > up if anyone is interested.
> >
> >
> > - Erik
> >
> >
> > --- "Marcus J. Ranum" <mjr at ranum.com> wrote:
> >
> >> Anton Chuvakin wrote:
> >>> Anybody care to restart the discussion and see
> what
> >> the collective
> >>> wisdom of loganalysis can produce?
> >>
> >> I am coding on something regarding regexless
> parsing
> >> as we
> >> speak. ETA is unknown but certainly before Xmas.
> It
> >> will be
> >> open source but not GPL.
> >>
> >> mjr.
> >> _______________________________________________
> >> LogAnalysis mailing list
> >> LogAnalysis at loganalysis.org
> >>
> >
>
http://www.loganalysis.org/mailman/listinfo/loganalysis
> >>
> >
> >
> >
> >
> >
>
______________________________________________________________________
>
> > ______________
> > Be a better Globetrotter. Get better travel
> answers from someone
> > who knows. Yahoo! Answers - Check it out.
> >
>
http://answers.yahoo.com/dir/?link=list&sid=396545469
> > _______________________________________________
> > LogAnalysis mailing list
> > LogAnalysis at loganalysis.org
> >
>
http://www.loganalysis.org/mailman/listinfo/loganalysis
>
>
____________________________________________________________________________________
Check out the hottest 2008 models today at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html
More information about the LogAnalysis
mailing list