memmem.c improvement

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* memmem.c improvement
@ 2007-12-01  0:48 Christian Thaeter
  2007-12-01  1:58 ` (class=ham score=-4.96032) " Tor Myklebust
  0 siblings, 1 reply; 4+ messages in thread
From: Christian Thaeter @ 2007-12-01  0:48 UTC (permalink / raw)
  To: libc-alpha, git

[-- Attachment #1: Type: text/plain, Size: 2110 bytes --]

Short story, 'memmem()' is a gnuish, nonstandard function. I wanted to
provide a generic fallback in some code. So, lets borrow it somewhere else:

First looking at git's compat/memmem.c which I found was suboptimal with
roughly O(M*N) performance (albeit ok for that case since it was just a
generic fallback).

Well, second taking a look at glibc's source surprised me, there is the
same code as in git. I somewhat expected a faster implementation from a
generic C library.

That thought and done, the code is attached to this mail. The algorithm
is similar to the Rabin/Karp method for string searches but uses a
weaker and simpler additive rolling hash. The result is still somewhat
like O(M+N) for most cases (There may be corner cases where it not that
good, but its really hard to imagine those). Anyways, it is always
faster than the current implementation, in my tests with common data
(parsing multipart/form-data cgi responses) it yields approx 20%-100%
improvements and with little artificial cheating (having strings in
haystack which only differ at the last char of needle) the improvements
are much better (500%, since is another complexity class). The fact is
that the old implementation does a brute force search which has corner
cases which are quite easy to hit (repeating symbol sequences, small set
of possible symbols) while for my implementation the corner cases are
much more rare and even then, it will still perform better than the old
implementation.

The code is not much more complex as the old one, not the original
'Rabin/Karp' algorithm because that would require a efficient modulo
operation with a prime number which might be slower on some platforms
(just a guess, I didn't even tried, the performance satisfies me
perfectly). Other search algorithms like 'Knuth/Morris/Pratt' or
'Boyer/More' are more complex to implement and require O(Needle) extra
memory for common implementations, which reserve them for special
purposes imo. I just wanted to keep it simple but still considerably
better than the current one.

Feel free to use the code in glibc and/or git.

	Christian

[-- Attachment #2: memmem.c --]
[-- Type: text/x-csrc, Size: 2251 bytes --]

/* Copyright (C) 1991,92,93,94,96,97,98,2000,2004 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

#include <stddef.h>
#include <string.h>

#ifndef _LIBC
# define __builtin_expect(expr, val)   (expr)
#endif

#undef memmem

/* Return the first occurrence of NEEDLE in HAYSTACK. */
void *
memmem (haystack, haystack_len, needle, needle_len)
     const void *haystack;
     size_t haystack_len;
     const void *needle;
     size_t needle_len;
{
  /* not really Rabin-Karp, just using additive hashing */
  char* haystack_ = (char*)haystack;
  char* needle_ = (char*)needle;
  int hash = 0;		/* this is the static hash value of the needle */
  int hay_hash = 0;	/* rolling hash over the haystack */
  char* last;
  size_t i;

  if (haystack_len < needle_len)
    return NULL;

  if (!needle_len)
    return haystack_;

  /* initialize hashes */
  for (i = needle_len; i; --i)
    {
      hash += *needle_++;
      hay_hash += *haystack_++;
    }

  /* iterate over the haystack */
  haystack_ = (char*)haystack;
  needle_ = (char*)needle;
  last = haystack_+(haystack_len - needle_len + 1);
  for (; haystack_ < last; ++haystack_)
    {
      if (__builtin_expect(hash == hay_hash, 0) &&
	  *haystack_ == *needle_ &&	/* prevent calling memcmp, was a optimization from existing glibc */
	  !memcmp (haystack_, needle_, needle_len))
	return haystack_;

      /* roll the hash */
      hay_hash -= *haystack_;
      hay_hash += *(haystack_+needle_len);
    }

  return NULL;
}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: (class=ham score=-4.96032) memmem.c improvement
  2007-12-01  0:48 memmem.c improvement Christian Thaeter
@ 2007-12-01  1:58 ` Tor Myklebust
  2007-12-01  2:30   ` Christian Thaeter
  0 siblings, 1 reply; 4+ messages in thread
From: Tor Myklebust @ 2007-12-01  1:58 UTC (permalink / raw)
  To: Christian Thaeter; +Cc: libc-alpha, git

On Sat, 1 Dec 2007, Christian Thaeter wrote:

> Short story, 'memmem()' is a gnuish, nonstandard function. I wanted to 
> provide a generic fallback in some code. So, lets borrow it somewhere 
> else:
>
> First looking at git's compat/memmem.c which I found was suboptimal with 
> roughly O(M*N) performance (albeit ok for that case since it was just a 
> generic fallback).
>
> Well, second taking a look at glibc's source surprised me, there is the 
> same code as in git. I somewhat expected a faster implementation from a 
> generic C library.

I don't think anybody involved with glibc really feels like having 
strstr() (or memmem(), for that matter) go fast.  (The grounds I heard 
were that "applications requiring large searches are more likely to have 
own fast string search implementations.')

> That thought and done, the code is attached to this mail. The algorithm 
> is similar to the Rabin/Karp method for string searches but uses a 
> weaker and simpler additive rolling hash.

There are rolling hashes based on arithmetic in fields of order 2^k. 
These can typically be computed using a bunch of bit-shifts and additions; 
have you tried using one?  There are lots of irreducible polynomials over 
Z_2 of degree k, so you can even fall back to a different one every few 
false positives.

> The result is still somewhat like O(M+N) for most cases

I don't think you get linear performance in the average case.  (But I do 
think you shave a factor of 256 off of the quadratic term.  The same 
algorithm, where your hash function is the population count, gives a 
collision between two random strings of length m with probability 
sum_(i=0)^m (m choose i)^2 / 4^m, which grows like sqrt(m).  Your 
algorithm helps this by a factor of 256.)

> (There may be corner cases where it not that good, but its really hard 
> to imagine those).

The needle "1100110011001100...1100" and the haystack 
"10101010101010...10" will produce quite a few false matches with your 
hash function (simply because every substring of the haystack of the 
appropriate length has the same hash as your needle).  (Making the needle 
"1010101010...101100" makes it even worse.)

> Anyways, it is always faster than the current implementation,

Try the example above.  (Your code also needs a "return NULL;" at the end, 
by the way.)

> in my tests with common data (parsing multipart/form-data cgi responses) 
> it yields approx 20%-100% improvements and with little artificial 
> cheating (having strings in haystack which only differ at the last char 
> of needle) the improvements are much better (500%, since is another 
> complexity class). The fact is that the old implementation does a brute 
> force search which has corner cases which are quite easy to hit 
> (repeating symbol sequences, small set of possible symbols) while for my 
> implementation the corner cases are much more rare and even then, it 
> will still perform better than the old implementation.

The old implementation is about 50% faster than yours in the example 
above.

> The code is not much more complex as the old one, not the original 
> 'Rabin/Karp' algorithm because that would require a efficient modulo 
> operation with a prime number which might be slower on some platforms 
> (just a guess, I didn't even tried, the performance satisfies me 
> perfectly).

Karp-Rabin with hashing done mod a prime is an impractical string matching 
algorithm.

> Other search algorithms like 'Knuth/Morris/Pratt'

A KMP-like algorithm can be implemented in a way that uses constant 
additional space.  For unordered alphabets, it's called "Galil-Seiferas" 
and for ordered alphabets the analogous algorithm is to decompose (in 
linear time and constant space) the string into its lexicographically 
largest suffix and some prefix.  Then you look for the suffix.  Everywhere 
where the suffix appears and appears at least prefix-length later than the 
last occurrence of the suffix, you check for the prefix.

> or 'Boyer/More' are more complex to implement and require O(Needle) 
> extra memory

There are variants of both that require constant additional space.  (They 
also perform way better than anything else I've seen.)

> for common implementations, which reserve them for special purposes imo. 
> I just wanted to keep it simple but still considerably better than the 
> current one.

As it happens, the extra space consumed by the KMP and suffix shift tables 
(and the associated bad cache effects) make the usual KMP and Boyer-Moore 
way, way slower than naive string matching on random inputs.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: (class=ham score=-4.96032) memmem.c improvement
  2007-12-01  1:58 ` (class=ham score=-4.96032) " Tor Myklebust
@ 2007-12-01  2:30   ` Christian Thaeter
  2007-12-01  3:07     ` Tor Myklebust
  0 siblings, 1 reply; 4+ messages in thread
From: Christian Thaeter @ 2007-12-01  2:30 UTC (permalink / raw)
  To: Tor Myklebust; +Cc: libc-alpha, git

Tor Myklebust wrote:
> On Sat, 1 Dec 2007, Christian Thaeter wrote:
> 
>> Short story, 'memmem()' is a gnuish, nonstandard function. I wanted to
>> provide a generic fallback in some code. So, lets borrow it somewhere
>> else:
>>
>> First looking at git's compat/memmem.c which I found was suboptimal
>> with roughly O(M*N) performance (albeit ok for that case since it was
>> just a generic fallback).
>>
>> Well, second taking a look at glibc's source surprised me, there is
>> the same code as in git. I somewhat expected a faster implementation
>> from a generic C library.
> 
> I don't think anybody involved with glibc really feels like having
> strstr() (or memmem(), for that matter) go fast.  (The grounds I heard
> were that "applications requiring large searches are more likely to have
> own fast string search implementations.')
> 
>> That thought and done, the code is attached to this mail. The
>> algorithm is similar to the Rabin/Karp method for string searches but
>> uses a weaker and simpler additive rolling hash.
> 
> There are rolling hashes based on arithmetic in fields of order 2^k.
> These can typically be computed using a bunch of bit-shifts and
> additions; have you tried using one?  There are lots of irreducible
> polynomials over Z_2 of degree k, so you can even fall back to a
> different one every few false positives.
> 
>> The result is still somewhat like O(M+N) for most cases
> 
> I don't think you get linear performance in the average case.  (But I do
> think you shave a factor of 256 off of the quadratic term.  The same
> algorithm, where your hash function is the population count, gives a
> collision between two random strings of length m with probability
> sum_(i=0)^m (m choose i)^2 / 4^m, which grows like sqrt(m).  Your
> algorithm helps this by a factor of 256.)
...
> 
>> (There may be corner cases where it not that good, but its really hard
>> to imagine those).
> 
> The needle "1100110011001100...1100" and the haystack
> "10101010101010...10" will produce quite a few false matches with your
> hash function (simply because every substring of the haystack of the
> appropriate length has the same hash as your needle).  (Making the
> needle "1010101010...101100" makes it even worse.)

I am fully aware that this is not the best possible search algorithm. It
is considerably better than the actual one for 'common' data. Having a
string with few symbols or other corner cases needs an algorithm better
suited for that task. But well, this was just reaching a low hanging
fruit. I just wanted to share it because it is better than the algorithm
which is in git and glibc, feel free to submit a even better one or keep
the old one, whatever. For me it suffices and I won't put more efforts
into improving or lobbying it, its just not worth it.

	Christian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: memmem.c improvement
  2007-12-01  2:30   ` Christian Thaeter
@ 2007-12-01  3:07     ` Tor Myklebust
  0 siblings, 0 replies; 4+ messages in thread
From: Tor Myklebust @ 2007-12-01  3:07 UTC (permalink / raw)
  To: Christian Thaeter; +Cc: libc-alpha, git

On Sat, 1 Dec 2007, Christian Thaeter wrote:

> I am fully aware that this is not the best possible search algorithm.
> It is considerably better than the actual one for 'common' data.

And not considerably worse in the worst case I could think of.  (Which was 
my point, not that it was slow or sucked in some other way.)

> Having a string with few symbols or other corner cases needs an 
> algorithm better suited for that task.

Yes, if it's actually worth anyone's time making strstr() fast in the case 
where the haystack has length three and the needle length two.

Amusingly enough, I haven't ever seen strstr() used to do any nontrivial 
string matching in any free software I've bothered to grep.

> But well, this was just reaching a low hanging fruit.

Suitably hacked, your code looks clearly correct.  But there are valid 
reasons not to mess with the strstr() in libc (fewer for memmem()); chief 
among them is that *if* there is an undetected bug in the code, *then* 
lots of stuff can break.

> For me it suffices and I won't put more efforts into improving or 
> lobbying it, its just not worth it.

Clearly, the tone of my email did not help convey my message.  (Perhaps I 
should just stop trying to write words of encouragement altogether --- it 
never goes according to plan and always seems to end in disaster.)  I do 
commend you for submitting improvements (and this is clearly an 
improvement); I was merely trying to point out that it will likely be an 
uphill battle, and that certain things are either wrong or can be done 
better than you have done.  Do go on, please!  (And I'm sorry if my 
previous email caused you particular dismay.)

Tor Myklebust

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-12-01  3:07 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-01  0:48 memmem.c improvement Christian Thaeter
2007-12-01  1:58 ` (class=ham score=-4.96032) " Tor Myklebust
2007-12-01  2:30   ` Christian Thaeter
2007-12-01  3:07     ` Tor Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).