git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* memmem.c improvement
@ 2007-12-01  0:48 Christian Thaeter
  2007-12-01  1:58 ` (class=ham score=-4.96032) " Tor Myklebust
  0 siblings, 1 reply; 4+ messages in thread
From: Christian Thaeter @ 2007-12-01  0:48 UTC (permalink / raw)
  To: libc-alpha, git

[-- Attachment #1: Type: text/plain, Size: 2110 bytes --]

Short story, 'memmem()' is a gnuish, nonstandard function. I wanted to
provide a generic fallback in some code. So, lets borrow it somewhere else:

First looking at git's compat/memmem.c which I found was suboptimal with
roughly O(M*N) performance (albeit ok for that case since it was just a
generic fallback).

Well, second taking a look at glibc's source surprised me, there is the
same code as in git. I somewhat expected a faster implementation from a
generic C library.

That thought and done, the code is attached to this mail. The algorithm
is similar to the Rabin/Karp method for string searches but uses a
weaker and simpler additive rolling hash. The result is still somewhat
like O(M+N) for most cases (There may be corner cases where it not that
good, but its really hard to imagine those). Anyways, it is always
faster than the current implementation, in my tests with common data
(parsing multipart/form-data cgi responses) it yields approx 20%-100%
improvements and with little artificial cheating (having strings in
haystack which only differ at the last char of needle) the improvements
are much better (500%, since is another complexity class). The fact is
that the old implementation does a brute force search which has corner
cases which are quite easy to hit (repeating symbol sequences, small set
of possible symbols) while for my implementation the corner cases are
much more rare and even then, it will still perform better than the old
implementation.

The code is not much more complex as the old one, not the original
'Rabin/Karp' algorithm because that would require a efficient modulo
operation with a prime number which might be slower on some platforms
(just a guess, I didn't even tried, the performance satisfies me
perfectly). Other search algorithms like 'Knuth/Morris/Pratt' or
'Boyer/More' are more complex to implement and require O(Needle) extra
memory for common implementations, which reserve them for special
purposes imo. I just wanted to keep it simple but still considerably
better than the current one.

Feel free to use the code in glibc and/or git.

	Christian

[-- Attachment #2: memmem.c --]
[-- Type: text/x-csrc, Size: 2251 bytes --]

/* Copyright (C) 1991,92,93,94,96,97,98,2000,2004 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

#include <stddef.h>
#include <string.h>

#ifndef _LIBC
# define __builtin_expect(expr, val)   (expr)
#endif

#undef memmem

/* Return the first occurrence of NEEDLE in HAYSTACK. */
void *
memmem (haystack, haystack_len, needle, needle_len)
     const void *haystack;
     size_t haystack_len;
     const void *needle;
     size_t needle_len;
{
  /* not really Rabin-Karp, just using additive hashing */
  char* haystack_ = (char*)haystack;
  char* needle_ = (char*)needle;
  int hash = 0;		/* this is the static hash value of the needle */
  int hay_hash = 0;	/* rolling hash over the haystack */
  char* last;
  size_t i;

  if (haystack_len < needle_len)
    return NULL;

  if (!needle_len)
    return haystack_;

  /* initialize hashes */
  for (i = needle_len; i; --i)
    {
      hash += *needle_++;
      hay_hash += *haystack_++;
    }

  /* iterate over the haystack */
  haystack_ = (char*)haystack;
  needle_ = (char*)needle;
  last = haystack_+(haystack_len - needle_len + 1);
  for (; haystack_ < last; ++haystack_)
    {
      if (__builtin_expect(hash == hay_hash, 0) &&
	  *haystack_ == *needle_ &&	/* prevent calling memcmp, was a optimization from existing glibc */
	  !memcmp (haystack_, needle_, needle_len))
	return haystack_;

      /* roll the hash */
      hay_hash -= *haystack_;
      hay_hash += *(haystack_+needle_len);
    }

  return NULL;
}

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-12-01  3:07 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-01  0:48 memmem.c improvement Christian Thaeter
2007-12-01  1:58 ` (class=ham score=-4.96032) " Tor Myklebust
2007-12-01  2:30   ` Christian Thaeter
2007-12-01  3:07     ` Tor Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).