[logs] regexless parsing, again?
Christina Noren
cfrln at cfrln.com
Fri Sep 14 12:30:18 PDT 2007
I think you need to distinguish between the two different goals of
parsing in order to have a productive discussion of this:
1) classifying log messages based on having a specific pattern -
which the below approach does better than regexes - and which is
realized somewhat similarly in Splunk's automatic event type
classification feature.
2) pulling out and naming fields within the log data - which is also
something where there are other possible approaches than regexes.
However, even other approaches would necessarily rely on pattern
matching of some sort in the absence of a self-describing log format
whether XML, name/value pair, csv header, etc. (Splunk also guesses
at fields for such well defined formats.) But the pattern matching
could be less cryptic and smarter about the patterns that are in
logs, such as Marcus mentions in his last post.
Repeating Raffy's disclaimer, I also work at Splunk.
Christina
On Sep 13, 2007, at 5:12 PM, E G wrote:
> Back when I worked at "another company" a few years
> back I did a lot of research into this area.
>
> We looked a an approach that grouped logs together
> based upon what we already knew about that type of log
> source and how they are similar, rather then
> "guessing" what each line was as it came in.
>
> This came about from doing quite a bit of statistical
> analysis on raw log data, I noted quite a bit of
> correlation from source to source (which in itself
> isn't news), but because of this, would allow us to
> classify unknown data in some semi-intelligent method
> and dump known entities in known "buckets".
>
> Working with some people who were much smarter then I,
> I was able to create a reverse Patricia Trie tree like
> structure. Think of it like when you're on your
> blackberry and you're typing. It attempts to predict
> the next letter and tries to complete the word you're
> typing for you. The same logic can basically work in
> reverse where you use this Trie structure to dissemble
> a word, or string in our case. Once you reach an end
> point on the Trie, it leaves you with what the data
> is, however you have decided to classify it.
>
> I hope that's understandable; I didn't want to write
> out a book.
>
> Anyhow, my ideas didn't end up going anywhere. They
> choose to stay with the RegEx "guessing" method - as
> is the standard. I had a lot of the code I developed
> after I left up on SourceForce for a while, but real
> life took me away from it. I might be able to dig it
> up if anyone is interested.
>
>
> - Erik
>
>
> --- "Marcus J. Ranum" <mjr at ranum.com> wrote:
>
>> Anton Chuvakin wrote:
>>> Anybody care to restart the discussion and see what
>> the collective
>>> wisdom of loganalysis can produce?
>>
>> I am coding on something regarding regexless parsing
>> as we
>> speak. ETA is unknown but certainly before Xmas. It
>> will be
>> open source but not GPL.
>>
>> mjr.
>> _______________________________________________
>> LogAnalysis mailing list
>> LogAnalysis at loganalysis.org
>>
> http://www.loganalysis.org/mailman/listinfo/loganalysis
>>
>
>
>
>
> ______________________________________________________________________
> ______________
> Be a better Globetrotter. Get better travel answers from someone
> who knows. Yahoo! Answers - Check it out.
> http://answers.yahoo.com/dir/?link=list&sid=396545469
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysis at loganalysis.org
> http://www.loganalysis.org/mailman/listinfo/loganalysis
More information about the LogAnalysis
mailing list