git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tor Myklebust <tmyklebu@csclub.uwaterloo.ca>
To: Christian Thaeter <ct@pipapo.org>
Cc: libc-alpha@sourceware.org, git@vger.kernel.org
Subject: Re: (class=ham score=-4.96032) memmem.c improvement
Date: Fri, 30 Nov 2007 20:58:10 -0500 (EST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0711301954370.9426@caffeine.csclub.uwaterloo.ca> (raw)
In-Reply-To: <4750AF4F.6090207@pipapo.org>

On Sat, 1 Dec 2007, Christian Thaeter wrote:

> Short story, 'memmem()' is a gnuish, nonstandard function. I wanted to 
> provide a generic fallback in some code. So, lets borrow it somewhere 
> else:
>
> First looking at git's compat/memmem.c which I found was suboptimal with 
> roughly O(M*N) performance (albeit ok for that case since it was just a 
> generic fallback).
>
> Well, second taking a look at glibc's source surprised me, there is the 
> same code as in git. I somewhat expected a faster implementation from a 
> generic C library.

I don't think anybody involved with glibc really feels like having 
strstr() (or memmem(), for that matter) go fast.  (The grounds I heard 
were that "applications requiring large searches are more likely to have 
own fast string search implementations.')

> That thought and done, the code is attached to this mail. The algorithm 
> is similar to the Rabin/Karp method for string searches but uses a 
> weaker and simpler additive rolling hash.

There are rolling hashes based on arithmetic in fields of order 2^k. 
These can typically be computed using a bunch of bit-shifts and additions; 
have you tried using one?  There are lots of irreducible polynomials over 
Z_2 of degree k, so you can even fall back to a different one every few 
false positives.

> The result is still somewhat like O(M+N) for most cases

I don't think you get linear performance in the average case.  (But I do 
think you shave a factor of 256 off of the quadratic term.  The same 
algorithm, where your hash function is the population count, gives a 
collision between two random strings of length m with probability 
sum_(i=0)^m (m choose i)^2 / 4^m, which grows like sqrt(m).  Your 
algorithm helps this by a factor of 256.)

> (There may be corner cases where it not that good, but its really hard 
> to imagine those).

The needle "1100110011001100...1100" and the haystack 
"10101010101010...10" will produce quite a few false matches with your 
hash function (simply because every substring of the haystack of the 
appropriate length has the same hash as your needle).  (Making the needle 
"1010101010...101100" makes it even worse.)

> Anyways, it is always faster than the current implementation,

Try the example above.  (Your code also needs a "return NULL;" at the end, 
by the way.)

> in my tests with common data (parsing multipart/form-data cgi responses) 
> it yields approx 20%-100% improvements and with little artificial 
> cheating (having strings in haystack which only differ at the last char 
> of needle) the improvements are much better (500%, since is another 
> complexity class). The fact is that the old implementation does a brute 
> force search which has corner cases which are quite easy to hit 
> (repeating symbol sequences, small set of possible symbols) while for my 
> implementation the corner cases are much more rare and even then, it 
> will still perform better than the old implementation.

The old implementation is about 50% faster than yours in the example 
above.

> The code is not much more complex as the old one, not the original 
> 'Rabin/Karp' algorithm because that would require a efficient modulo 
> operation with a prime number which might be slower on some platforms 
> (just a guess, I didn't even tried, the performance satisfies me 
> perfectly).

Karp-Rabin with hashing done mod a prime is an impractical string matching 
algorithm.

> Other search algorithms like 'Knuth/Morris/Pratt'

A KMP-like algorithm can be implemented in a way that uses constant 
additional space.  For unordered alphabets, it's called "Galil-Seiferas" 
and for ordered alphabets the analogous algorithm is to decompose (in 
linear time and constant space) the string into its lexicographically 
largest suffix and some prefix.  Then you look for the suffix.  Everywhere 
where the suffix appears and appears at least prefix-length later than the 
last occurrence of the suffix, you check for the prefix.

> or 'Boyer/More' are more complex to implement and require O(Needle) 
> extra memory

There are variants of both that require constant additional space.  (They 
also perform way better than anything else I've seen.)

> for common implementations, which reserve them for special purposes imo. 
> I just wanted to keep it simple but still considerably better than the 
> current one.

As it happens, the extra space consumed by the KMP and suffix shift tables 
(and the associated bad cache effects) make the usual KMP and Boyer-Moore 
way, way slower than naive string matching on random inputs.

  reply	other threads:[~2007-12-01  1:58 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-01  0:48 memmem.c improvement Christian Thaeter
2007-12-01  1:58 ` Tor Myklebust [this message]
2007-12-01  2:30   ` (class=ham score=-4.96032) " Christian Thaeter
2007-12-01  3:07     ` Tor Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0711301954370.9426@caffeine.csclub.uwaterloo.ca \
    --to=tmyklebu@csclub.uwaterloo.ca \
    --cc=ct@pipapo.org \
    --cc=git@vger.kernel.org \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).