linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Big git diff speedup by avoiding x86 "fast string" memcmp
@ 2010-12-09  7:09 Nick Piggin
  2010-12-09 13:37 ` Borislav Petkov
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Nick Piggin @ 2010-12-09  7:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, x86, linux-kernel, linux-fsdevel

I was actually discussing this with Linus a while back, and finally
got around to testing it out now that I have a modern CPU to measure
it on! CCing linux-arch because it would be interesting to know
whether your tuned functions do better than gcc or not (I would
suspect not).

BTW. patch and numbers are on top of my scaling series, just for
an idea of what it does, I just want to generate some interesting
discussion.

If people are interested in running benchmarks, I'll be pushing out
a new update soon, after some more testing and debugging here.

The standard memcmp function on a Westmere system shows up hot in
profiles in the `git diff` workload (both parallel and single threaded),
and it is likely due to the costs associated with trapping into
microcode, and little opportunity to improve memory access (dentry
name is not likely to take up more than a cacheline).

So replace it with an open-coded byte comparison. This increases code
size by 24 bytes in the critical __d_lookup_rcu function, but the
speedup is huge, averaging 10 runs of each:

git diff st   user   sys   elapsed  CPU
before        1.15   2.57  3.82      97.1
after         1.14   2.35  3.61      96.8

git diff mt   user   sys   elapsed  CPU
before        1.27   3.85  1.46     349
after         1.26   3.54  1.43     333

Elapsed time for single threaded git diff at 95.0% confidence:
        -0.21  +/- 0.01
        -5.45% +/- 0.24%

Parallel case doesn't gain much, but that's because userspace runs
out of work to feed it -- efficiency is way up, though.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2010-12-09 05:07:19.000000000 +1100
+++ linux-2.6/fs/dcache.c	2010-12-09 17:37:30.000000000 +1100
@@ -1423,7 +1423,7 @@ static struct dentry *__d_instantiate_un
 			goto next;
 		if (qstr->len != len)
 			goto next;
-		if (memcmp(qstr->name, name, len))
+		if (dentry_memcmp(qstr->name, name, len))
 			goto next;
 		__dget_dlock(alias);
 		spin_unlock(&alias->d_lock);
@@ -1771,7 +1771,7 @@ struct dentry *__d_lookup_rcu(struct den
 		} else {
 			if (tlen != len)
 				continue;
-			if (memcmp(tname, str, tlen))
+			if (dentry_memcmp(tname, str, tlen))
 				continue;
 		}
 		/*
@@ -1901,7 +1901,7 @@ struct dentry *__d_lookup(struct dentry
 		} else {
 			if (tlen != len)
 				goto next;
-			if (memcmp(tname, str, tlen))
+			if (dentry_memcmp(tname, str, tlen))
 				goto next;
 		}
 
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h	2010-12-09 05:07:52.000000000 +1100
+++ linux-2.6/include/linux/dcache.h	2010-12-09 05:08:36.000000000 +1100
@@ -47,6 +47,20 @@ struct dentry_stat_t {
 };
 extern struct dentry_stat_t dentry_stat;
 
+static inline int dentry_memcmp(const unsigned char *cs,
+				const unsigned char *ct, size_t count)
+{
+	while (count) {
+		int ret = (*cs != *ct);
+		if (ret)
+			return ret;
+		cs++;
+		ct++;
+		count--;
+	}
+	return 0;
+}
+
 /* Name hashing routines. Initial hash value */
 /* Hash courtesy of the R5 hash in reiserfs modulo sign bits */
 #define init_name_hash()		0

^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: Big git diff speedup by avoiding x86 "fast string" memcmp
@ 2010-12-18 22:54 George Spelvin
  2010-12-19 14:28 ` Boaz Harrosh
  2010-12-19 15:46 ` Nick Piggin
  0 siblings, 2 replies; 29+ messages in thread
From: George Spelvin @ 2010-12-18 22:54 UTC (permalink / raw)
  To: bharrosh; +Cc: linux, linux-arch, linux-fsdevel, linux-kernel

> static inline int dentry_memcmp_long(const unsigned char *cs,
> 				const unsigned char *ct, ssize_t count)
> {
> 	int ret;
> 	const unsigned long *ls = (const unsigned long *)cs;
> 	const unsigned long *lt = (const unsigned long *)ct;
> 
> 	while (count > 8) {
> 		ret = (*cs != *ct);
> 		if (ret)
> 			break;
> 		cs++;
> 		ct++;
> 		count-=8;
> 	}
> 	if (count) {
> 		unsigned long t = *ct & ((0xffffffffffffffff >> ((8 - count) * 8))
> 		ret = (*cs != t)
> 	}
> 
> 	return ret;
> }

First, let's get the code right, and use correct types, but also, there
are some tricks to reduce the masking cost.

As long as you have to mask one string, *and* don't have to worry about
running off the end of mapped memory, there's no additional cost to
masking both in the loop.  Just test (a ^ b) & mask.

#if 1	/* Table lookup */

static unsigned long const dentry_memcmp_mask[8] = {
#if defined(__LITTLE_ENDIAN)
	0xff, 0xffff, 0xffffff, 0xffffffff,
#if sizeof (unsigned long) > 4
	0xffffffffff, 0xffffffffffff, 0xffffffffffffff, 0xffffffffffffffff
#endif
#elif defined(__BIG_ENDIAN)
#if sizeof (unsigned long) == 4
	0xff000000, 0xffff0000, 0xffffff00, 0xffffffff
#else
	0xff00000000000000, 0xffff000000000000,
	0xffffff0000000000, 0xffffffff00000000,
	0xffffffffff000000, 0xffffffffffff0000,
	0xffffffffffffff00, 0xffffffffffffffff
#endif
#else
#error undefined endianness
#endif
};

#define dentry_memcmp_mask(count) dentry_memcmp_mask[(count)-1]

#else /* In-line code */

#if defined(__LITTLE_ENDIAN)
#define dentry_memcmp_mask(count) (-1ul >> (sizeof 1ul - (count)) * 8)
#elif defined(__BIG_ENDIAN)
#define dentry_memcmp_mask(count) (-1ul << (sizeof 1ul - (count)) * 8)
#else
#error undefined endianness

#endif

static inline bool dentry_memcmp_long(unsigned char const *cs,
				unsigned char const *ct, ssize_t count)
{
	unsigned long const *ls = (unsigned long const *)cs;
	unsigned long const *lt = (unsigned long const *)ct;

	while (count > sizeof *cs) {
		if (*cs != *ct)
			return true;
		cs++;
		ct++;
		count -= sizeof *cs;
	}
	/* Last 1..8 bytes */
	return ((*ct ^ *cs) & dentry_memcmp_mask(count)) != 0;
}

If your string is at least 8 bytes long, and the processor has fast unaligned
loads, you can skip the mask entirely by redundantly comparing some bytes
(although the code bloat is probably undesirable for inline code):

static inline bool dentry_memcmp_long(const unsigned char *cs,
				const unsigned char *ct, ssize_t count)
{
	unsigned long const *ls = (unsigned long const *)cs;
	unsigned long const *lt = (unsigned long const *)ct;

	if (count < sizeof *cs)
		return ((*ct ^ *cs) & dentry_memcmp_mask(count)) != 0;

	while (count > sizeof *cs) {
		if (*cs != *ct)
			return true;
		cs++;
		ct++;
		count -= sizeof *cs;
	}
	cs = (unsigned long const *)((char const *)cs + count - sizeof *cs);
	ct = (unsigned long const *)((char const *)ct + count - sizeof *ct);
	return *cs != *ct;
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-12-21  9:26 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-09  7:09 Big git diff speedup by avoiding x86 "fast string" memcmp Nick Piggin
2010-12-09 13:37 ` Borislav Petkov
2010-12-10  2:38   ` Nick Piggin
2010-12-10  4:27 ` Nick Piggin
2010-12-10 14:23 ` J. R. Okajima
2010-12-13  1:45   ` Nick Piggin
2010-12-13  7:29     ` J. R. Okajima
2010-12-13  8:25       ` Nick Piggin
2010-12-14 19:01         ` J. R. Okajima
2010-12-15  4:06           ` Nick Piggin
2010-12-15  5:57             ` J. R. Okajima
2010-12-15 13:15             ` Boaz Harrosh
2010-12-15 18:00               ` David Miller
2010-12-16  9:53                 ` Boaz Harrosh
2010-12-16 13:13                   ` Nick Piggin
2010-12-16 14:03                     ` Boaz Harrosh
2010-12-16 14:15                       ` Nick Piggin
2010-12-16 16:51                   ` Linus Torvalds
2010-12-16 17:57                   ` David Miller
2010-12-15  4:38         ` Américo Wang
2010-12-15  5:54           ` Nick Piggin
2010-12-15  7:12             ` Linus Torvalds
2010-12-15 23:09 ` Tony Luck
2010-12-16  2:34   ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2010-12-18 22:54 George Spelvin
2010-12-19 14:28 ` Boaz Harrosh
2010-12-19 15:46 ` Nick Piggin
2010-12-19 17:06   ` George Spelvin
2010-12-21  9:26     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).