This is a mirror of official site: http://jasper-net.blogspot.com/

why GNU grep is fast

| Wednesday, August 10, 2011
Hi Gabor,

I am the original author of GNU grep.  I am also a FreeBSD user, although I live on -stable (and older) and rarely pay attention to -current.

However, while searching the -current mailing list for an unrelated reason, I stumbled across some flamage regarding BSD grep vs GNU grep performance.  You may have noticed that discussion too...

Anyway, just FYI, here's a quick summary of where GNU grep gets its speed.  Hopefully you can carry these ideas over to BSD grep.

#1 trick: GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE.

#2 trick: GNU grep is fast because it EXECUTES VERY FEW INSTRUCTIONS FOR EACH BYTE that it *does* look at.

GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can  skip in the input whenever it finds a non-matching character.

GNU grep also unrolls the inner loop of Boyer-Moore, and sets up the Boyer-Moore delta table entries in such a way that it doesn't need to do the loop exit test at every unrolled step.  The result of this is that, in the limit, GNU grep averages fewer than 3 x86 instructions executed for each input byte it actually looks at (and it skips many bytes entirely).

See "Fast String Searching", by Andrew Hume and Daniel Sunday, in the November 1991 issue of Software Practice & Experience, for a good discussion of  Boyer-Moore implementation tricks.  It's available as a free PDF online.

Once you have fast search, you'll find you also need fast input.

GNU grep uses raw Unix input system calls and avoids copying data after reading it.

Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES.  Looking for newlines would slow grep down by a factor of several times, because to find the newlines it would have to look at every byte!

So instead of using line-oriented input, GNU grep reads raw data into a large buffer, searches the buffer using Boyer-Moore, and only when it finds a match does it go and look for the bounding newlines.
(Certain command line options like -n disable this optimization.)


Read more: FreeBSD list
QR: 019310.html?

Posted via email from Jasper-net

0 comments: