Sunday, 21 April 2013

Binary search in text files by timestamp or id


Sometimes I have to deal with huge log files (>5GB). Log files are usually gzipped, so grepping them with zgrep takes time. It is probably ok when I know the exact pattern to search. But quite often I just want to check the log around a particular time. zgrep does not help much here. zless is extremely slow for such task as well.

When a file is a plain text I estimated an offset of the lines I am looking for and I used dd to jump quickly. Something like this:

$ dd bs=1024 skip=3000000 if=huge_log.txt | less

I also tried the same approach for gzipped file

$ zcat huge_log.txt.gz | dd bs=1024 skip=3000000 | less

But it is still inconvenient. Estimation is usually wrong and I had to rerun the command several times.

The obvious thing that general purpose utilities does not take into account is the fact that the log file is sorted by timestamp. Log file usually looks like the following:

2013/04/21 20:59:22.234: ConfigParser: reading parameter file_name
2013/04/21 20:59:22.235: ConfigParser: loading extra values

I searched in google and found a lot of questions like grep between date ranges in a log.
But all answers were about using the general purpose utilities or writing a small specific perl script.

So I decided to write my own.
Here is how to use it for the given log file structure:

$ bsearch -p '^$[YYY/MM/DD hh:mm:ss.nnn]' -t '2013/04/21 20:59:22.235' -t '2013/04/21 21:20:00.000' huge_log.txt.gz

in the -p argument I wrote a regular expression with a small enhancement using Y, M, D, h, m, s, n characters inside $[ ] brackets with dollar sign to specify year, month and so on.
-t argument is for the search string. When -t is used twice they are considered as begin/end lines.
bsearch will quickly find and print all lines between 20:59 and 21:20.

Moreover, I decided that it may be used not only for timestamps search.
If you have a log file with lines containing some identifier which you know is growing (like sequence number), you may search for it too. The command will look like:

$ bsearch -p '^.*? sometext id=$[n+]' -t '43567800001' log.txt.gz

If you are familiar with regular expression, you should understand the meaning. The only addition to the normal regular expression is $[n+] part. This part is replaced with (\d+) expression by bsearch when it runs regex search.

I find it is very useful myself. So if somebody finds it useful too, I am pleased to share it: http://code.google.com/p/bsearch/