From: Christian Thaeter <ct@pipapo.org>
To: Tor Myklebust <tmyklebu@csclub.uwaterloo.ca>
Cc: libc-alpha@sourceware.org, git@vger.kernel.org
Subject: Re: (class=ham score=-4.96032) memmem.c improvement
Date: Sat, 01 Dec 2007 03:30:35 +0100 [thread overview]
Message-ID: <4750C74B.8060308@pipapo.org> (raw)
In-Reply-To: <Pine.LNX.4.64.0711301954370.9426@caffeine.csclub.uwaterloo.ca>
Tor Myklebust wrote:
> On Sat, 1 Dec 2007, Christian Thaeter wrote:
>
>> Short story, 'memmem()' is a gnuish, nonstandard function. I wanted to
>> provide a generic fallback in some code. So, lets borrow it somewhere
>> else:
>>
>> First looking at git's compat/memmem.c which I found was suboptimal
>> with roughly O(M*N) performance (albeit ok for that case since it was
>> just a generic fallback).
>>
>> Well, second taking a look at glibc's source surprised me, there is
>> the same code as in git. I somewhat expected a faster implementation
>> from a generic C library.
>
> I don't think anybody involved with glibc really feels like having
> strstr() (or memmem(), for that matter) go fast. (The grounds I heard
> were that "applications requiring large searches are more likely to have
> own fast string search implementations.')
>
>> That thought and done, the code is attached to this mail. The
>> algorithm is similar to the Rabin/Karp method for string searches but
>> uses a weaker and simpler additive rolling hash.
>
> There are rolling hashes based on arithmetic in fields of order 2^k.
> These can typically be computed using a bunch of bit-shifts and
> additions; have you tried using one? There are lots of irreducible
> polynomials over Z_2 of degree k, so you can even fall back to a
> different one every few false positives.
>
>> The result is still somewhat like O(M+N) for most cases
>
> I don't think you get linear performance in the average case. (But I do
> think you shave a factor of 256 off of the quadratic term. The same
> algorithm, where your hash function is the population count, gives a
> collision between two random strings of length m with probability
> sum_(i=0)^m (m choose i)^2 / 4^m, which grows like sqrt(m). Your
> algorithm helps this by a factor of 256.)
...
>
>> (There may be corner cases where it not that good, but its really hard
>> to imagine those).
>
> The needle "1100110011001100...1100" and the haystack
> "10101010101010...10" will produce quite a few false matches with your
> hash function (simply because every substring of the haystack of the
> appropriate length has the same hash as your needle). (Making the
> needle "1010101010...101100" makes it even worse.)
I am fully aware that this is not the best possible search algorithm. It
is considerably better than the actual one for 'common' data. Having a
string with few symbols or other corner cases needs an algorithm better
suited for that task. But well, this was just reaching a low hanging
fruit. I just wanted to share it because it is better than the algorithm
which is in git and glibc, feel free to submit a even better one or keep
the old one, whatever. For me it suffices and I won't put more efforts
into improving or lobbying it, its just not worth it.
Christian
next prev parent reply other threads:[~2007-12-01 2:32 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-12-01 0:48 memmem.c improvement Christian Thaeter
2007-12-01 1:58 ` (class=ham score=-4.96032) " Tor Myklebust
2007-12-01 2:30 ` Christian Thaeter [this message]
2007-12-01 3:07 ` Tor Myklebust
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4750C74B.8060308@pipapo.org \
--to=ct@pipapo.org \
--cc=git@vger.kernel.org \
--cc=libc-alpha@sourceware.org \
--cc=tmyklebu@csclub.uwaterloo.ca \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.