.. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
@ 2012-02-29 23:36 Linus Torvalds
  2012-03-01 10:13 ` Steven Whitehouse
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Linus Torvalds @ 2012-02-29 23:36 UTC (permalink / raw)
  To: Linux Kernel Mailing List, linux-fsdevel, Al Viro

So I'm doing my normal profiling ("empty kernel build" is my favorite
one), and link_path_walk() and __d_lookup_rcu() remain some of the
hottest kernel functions due to their per-character loops.

I can improve __d_lookup_rcu() on my machine by what appears to be
around 15% by doing things a "unsigned long" at a time (it would be an
option that only works on little-endian and with cheap unaligned
accesses, although the big-endian modifications should be pretty
trivial).

Sure, that optimization would have to be disabled if you do
DEBUG_PAGEALLOC, because it might opportunistically access bytes past
the end of the string, but it does seem to be a very reasonable and
easy thing to do apart from that small detail, and the numbers do look
good.

Basically, dentry_cmp() just becomes

        /* Little-endian with fast unaligned accesses? */
        unsigned long a,b,mask;

        if (scount != tcount)
                return 1;

        for (;;) {
                a = *(unsigned long *)cs;
                b = *(unsigned long *)ct;
                if (tcount < sizeof(unsigned long))
                        break;
                if (a != b)
                        return 1;
                cs += sizeof(unsigned long);
                ct += sizeof(unsigned long);
                tcount -= sizeof(unsigned long);
                if (!tcount)
                        return 0;
        }
        mask = ~(~0ul << tcount*8);
        return !!((a ^ b) & mask);

for that case, and gcc generates good code for it.

However, doing the same thing for link_path_walk() would require that
we actually change the hash function we use internally in the VFS
layer, and while I think that shouldn't really be a problem, I worry
that some filesystem might actually use the hash we generate and save
it somewhere on disk (rather than only use it for the hashed lookup
itself).

Computing the hash one long-word at a time is trivial if we just
change what the hash is. Finding the terminating NUL or '/' characters
that involves some big constants (0x2f2f2f2f2f2f2f2f,
0x0101010101010101 and 0x8080808080808080 but seems similarly fairly
easy. But if filesystems actually depend on our current hash
algorithm, the word-at-a-time model falls apart.

Anybody? I think we've kept the namehash unchanged since very early
on, so I could imagine that somebody has grown to think that it's
"stable". As far as I can tell, the current hash function goes back to
2.4.2 (just over ten years ago).

                       Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
  2012-02-29 23:36 .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Linus Torvalds
@ 2012-03-01 10:13 ` Steven Whitehouse
  2012-03-01 15:59   ` Linus Torvalds
  2012-03-01 16:57 ` Al Viro
  2012-03-02  0:46 ` .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Andi Kleen
  2 siblings, 1 reply; 14+ messages in thread
From: Steven Whitehouse @ 2012-03-01 10:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-fsdevel, Al Viro

Hi,

On Wed, 2012-02-29 at 15:36 -0800, Linus Torvalds wrote:
> So I'm doing my normal profiling ("empty kernel build" is my favorite
> one), and link_path_walk() and __d_lookup_rcu() remain some of the
> hottest kernel functions due to their per-character loops.
> 
> I can improve __d_lookup_rcu() on my machine by what appears to be
> around 15% by doing things a "unsigned long" at a time (it would be an
> option that only works on little-endian and with cheap unaligned
> accesses, although the big-endian modifications should be pretty
> trivial).
> 
> Sure, that optimization would have to be disabled if you do
> DEBUG_PAGEALLOC, because it might opportunistically access bytes past
> the end of the string, but it does seem to be a very reasonable and
> easy thing to do apart from that small detail, and the numbers do look
> good.
> 
> Basically, dentry_cmp() just becomes
> 
>         /* Little-endian with fast unaligned accesses? */
>         unsigned long a,b,mask;
> 
>         if (scount != tcount)
>                 return 1;
> 
>         for (;;) {
>                 a = *(unsigned long *)cs;
>                 b = *(unsigned long *)ct;
>                 if (tcount < sizeof(unsigned long))
>                         break;
>                 if (a != b)
>                         return 1;
>                 cs += sizeof(unsigned long);
>                 ct += sizeof(unsigned long);
>                 tcount -= sizeof(unsigned long);
>                 if (!tcount)
>                         return 0;
>         }
>         mask = ~(~0ul << tcount*8);
>         return !!((a ^ b) & mask);
> 
> for that case, and gcc generates good code for it.
> 
> However, doing the same thing for link_path_walk() would require that
> we actually change the hash function we use internally in the VFS
> layer, and while I think that shouldn't really be a problem, I worry
> that some filesystem might actually use the hash we generate and save
> it somewhere on disk (rather than only use it for the hashed lookup
> itself).
> 
GFS2 is guilty as charged, m'lud! (but see below...)

from fs/gfs2/dentry.c:

static int gfs2_dhash(const struct dentry *dentry, const struct inode *inode,
                struct qstr *str)
{
        str->hash = gfs2_disk_hash(str->name, str->len);
        return 0;
}
[snip]
const struct dentry_operations gfs2_dops = {
        .d_revalidate = gfs2_drevalidate,
        .d_hash = gfs2_dhash,
        .d_delete = gfs2_dentry_delete,
};

and in fs/gfs2/dir.h:

static inline u32 gfs2_disk_hash(const char *data, int len)
{
        return crc32_le((u32)~0, data, len) ^ (u32)~0;
}

This was something that was added to GFS2 right back at the beginning,
to avoid having to compute two hash functions for each directory entry.
The hash function itself was taken from the original GFS, so it was
designed to be backward compatible. We assume that the 32 bit on disk
hash function of GFS2 will always fit in the "unsigned int" of the qstr.

I wasn't quite sure whether from your description, the issue applies to
filesystems which actually have their own hash functions, or only to
filesystems which might write the VFS's internal hash function to disk?
Maybe the former is a simple solution for the latter, if a filesystem
specific hash function is a suitable solution,

Steve.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
  2012-03-01 10:13 ` Steven Whitehouse
@ 2012-03-01 15:59   ` Linus Torvalds
  0 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2012-03-01 15:59 UTC (permalink / raw)
  To: Steven Whitehouse; +Cc: Linux Kernel Mailing List, linux-fsdevel, Al Viro

On Thu, Mar 1, 2012 at 2:13 AM, Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> GFS2 is guilty as charged, m'lud! (but see below...)
>
> from fs/gfs2/dentry.c:
>
> static int gfs2_dhash(const struct dentry *dentry, const struct inode *inode,
>                struct qstr *str)
> {
>        str->hash = gfs2_disk_hash(str->name, str->len);
>        return 0;
> }

Ok, that's fine. As long as you also use your own well-defined
hash-function rather than just depend on the default VFS one, I don't
care. It's only the default one I'd be optimizing.

You'd obviously not get the performance advantage either, but since I
don't even know whether it's really worth it in the first place, not a
big worry (those kernel routines can be about 5% of CPU time each
under loads that are very pathname-intensive, and it looks like I
could shave maybe 10-20% off them for the best case, so we're talking
*maybe* a percentage point on some specific loads).

And yeah, if some filesystem actually depends on the current default
VFS hash we can always work around it by filesystems just making the
current hash function explicit like you already do in gfs2.

                       Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
  2012-02-29 23:36 .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Linus Torvalds
  2012-03-01 10:13 ` Steven Whitehouse
@ 2012-03-01 16:57 ` Al Viro
  2012-03-01 17:14   ` Linus Torvalds
  2012-03-02  0:46 ` .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Andi Kleen
  2 siblings, 1 reply; 14+ messages in thread
From: Al Viro @ 2012-03-01 16:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-fsdevel

On Wed, Feb 29, 2012 at 03:36:09PM -0800, Linus Torvalds wrote:
> However, doing the same thing for link_path_walk() would require that
> we actually change the hash function we use internally in the VFS
> layer, and while I think that shouldn't really be a problem, I worry
> that some filesystem might actually use the hash we generate and save
> it somewhere on disk (rather than only use it for the hashed lookup
> itself).
> 
> Computing the hash one long-word at a time is trivial if we just
> change what the hash is. Finding the terminating NUL or '/' characters
> that involves some big constants (0x2f2f2f2f2f2f2f2f,
> 0x0101010101010101 and 0x8080808080808080 but seems similarly fairly
> easy. But if filesystems actually depend on our current hash
> algorithm, the word-at-a-time model falls apart.

As long as full_name_hash() is still around, any such filesystem is welcome
to use it for ->d_hash() and STFU.

Note that there are places like this one (in btrfs_real_readdir())
                        q.name = name_ptr;
                        q.len = name_len;
                        q.hash = full_name_hash(q.name, q.len);  
                        tmp = d_lookup(filp->f_dentry, &q);
and they _will_ need to be updated if we switch the default.  I'd be more
worried about that kind of breakage, TBH, than anything leaking hash
on disk without bothering to calculate it manually (if nothing else,
finn_name_hash() is not endian-neutral, so any such place is buggy for
another reason already).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
  2012-03-01 16:57 ` Al Viro
@ 2012-03-01 17:14   ` Linus Torvalds
  2012-03-01 18:34     ` Chris Mason
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2012-03-01 17:14 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel Mailing List, linux-fsdevel

On Thu, Mar 1, 2012 at 8:57 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> As long as full_name_hash() is still around, any such filesystem is welcome
> to use it for ->d_hash() and STFU.

Well, I'd make full_name_hash() match the new hash model, so no, they
couldn't use that. They'd have to actually implement the old hash, and
make it "their hash".

Also note that the hash model would have to become dependent on config
options: doing things a word at a time really only works well on
architectures that have fast unaligned accesses, and even on those
archtiectures it really doesn't work if you enable
CONFIG_DEBUG_PAGE_ALLOC.

> Note that there are places like this one (in btrfs_real_readdir())
>                        q.name = name_ptr;
>                        q.len = name_len;
>                        q.hash = full_name_hash(q.name, q.len);
>                        tmp = d_lookup(filp->f_dentry, &q);
> and they _will_ need to be updated if we switch the default.

I don't agree that it will need to be updated - I was planning on just
updating full_name_hash() to always match. That part is easy and
obvious.

So this I don't have any issue with. It's only if you actually save
the default VFS hash to disk (or network), since it will change
between reboots or between different machines.

> I'd be more
> worried about that kind of breakage, TBH, than anything leaking hash
> on disk without bothering to calculate it manually (if nothing else,
> finn_name_hash() is not endian-neutral, so any such place is buggy for
> another reason already).

No, full_name_hash() is perfectly endian-safe - currently. It does
everything one character at a time, and it acts as if it was all done
in 32 bits (even if it internally looks like it is calculating hashes
in "unsigned long" - the math is all stable in 32 bits and at the end
it is truncated down).

Now, of course, any disk respresentation would have to store the value
in some particular endianness, but that's no different from any other
number that FS would store.

But if I actually go ahead and do a word-at-a-time hash thing, that
hash value really fundamentally will depend on endianness. But since
it will depend on config options and on architecture details too,
that's the least of the problems with using that for anything
"stable". The new hash would literally only be useful *inside* the
kernel, and really only for looking up names on the hash lists.

                  Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
  2012-03-01 17:14   ` Linus Torvalds
@ 2012-03-01 18:34     ` Chris Mason
  2012-03-01 22:42       ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Mason @ 2012-03-01 18:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel

On Thu, Mar 01, 2012 at 09:14:13AM -0800, Linus Torvalds wrote:
> On Thu, Mar 1, 2012 at 8:57 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > As long as full_name_hash() is still around, any such filesystem is welcome
> > to use it for ->d_hash() and STFU.
> 
> Well, I'd make full_name_hash() match the new hash model, so no, they
> couldn't use that. They'd have to actually implement the old hash, and
> make it "their hash".
> 
> Also note that the hash model would have to become dependent on config
> options: doing things a word at a time really only works well on
> architectures that have fast unaligned accesses, and even on those
> archtiectures it really doesn't work if you enable
> CONFIG_DEBUG_PAGE_ALLOC.
> 
> > Note that there are places like this one (in btrfs_real_readdir())
> >                        q.name = name_ptr;
> >                        q.len = name_len;
> >                        q.hash = full_name_hash(q.name, q.len);
> >                        tmp = d_lookup(filp->f_dentry, &q);
> > and they _will_ need to be updated if we switch the default.
> 
> I don't agree that it will need to be updated - I was planning on just
> updating full_name_hash() to always match. That part is easy and
> obvious.

Just to clarify, btrfs isn't putting the dcache hashes on disk.  The
code above is part of our optimization to preload some information we
find during readdir to reduce IO when the file is later opened.

On disk we're crc32c only.

-chris

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
  2012-03-01 18:34     ` Chris Mason
@ 2012-03-01 22:42       ` Linus Torvalds
  2012-03-17 12:29         ` Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?) Sven Anderson
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2012-03-01 22:42 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Al Viro, Linux Kernel Mailing List,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2250 bytes --]

Ok, guys, here's a *very* preliminary version of the patch I was thinking of.

NOTE! The changes are all totally unconditional, which means that this
patch will break on:

 - any big-endian machine (the "find first NUL/slash byte" and mask
generation really is little-endian only)

   Maybe somebody with a big-endian machine could try to figure out
how the logic would work there to efficiently find the mask of bytes
that are before the first NUL or slash.

 - any hardware that doesn't do good unaligned 'unsigned long' accesses

 - if you enable CONFIG_DEBUG_PAGEALLOC

In addition, I've only *tested* this on x86-64, and I'm pretty sure
that on x86-32 it will at the very least warn about the big 64-bit
constants I use (but I think it should work - they'll just get
truncated, and I tried to use "sizeof(unsigned long)" instead of "8"
etc).

Anyway, on x86-64 it works for me. And my stupid test-case (look up
the same pathname ten million times - I'm explicitly looking up the
*same* pathname to not show the D$ trashing of the hash tables looking
up millions of different names) went from 9.053s to 8.099s.

So that's a 10% improvement on an admittedly extreme load, but the two
functions that were improved (__d_lookup_rcu and link_path_walk)
really *are* the two hottest functions on actual real loads too, so
while my microbenchmark was extreme, it's at least testing something
relevant. And it's the whole microbenchmark that improved by 10%, so
those hot functions themselves each actually improved by about 30%.

Comments?

I agree that the games I play to do things one word at a time aren't
*pretty*, but I did try to split things up into helper functions so
that the code is actually conceptually understandable even if you
can't follow the tricks I play. Staring too much at that new
hash_name() function may turn you mad. Not only is the "find byte in
word" logic subtle, the whole crazy "do { } while ()" loop is written
that way so that gcc doesn't go crazy and try to unroll it once (which
gcc does if you turn it into the more obvious "for ()" loop).

That said, trying to figure out *why* hash_name() works can be a fun
exercise if you're into these kinds of tricks. It was certainly fun to
write.

                     Linus

[-- Attachment #2: dentry-speed.patch --]
[-- Type: text/x-patch, Size: 8318 bytes --]

 fs/dcache.c            |   13 +++---
 fs/namei.c             |  102 ++++++++++++++++++++++++++++++++++++++----------
 include/linux/dcache.h |   45 ++++++++++++++-------
 3 files changed, 120 insertions(+), 40 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fe19ac13f75f..138be96e25b6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -104,7 +104,7 @@ static unsigned int d_hash_shift __read_mostly;
 
 static struct hlist_bl_head *dentry_hashtable __read_mostly;
 
-static inline struct hlist_bl_head *d_hash(struct dentry *parent,
+static inline struct hlist_bl_head *d_hash(const struct dentry *parent,
 					unsigned long hash)
 {
 	hash += ((unsigned long) parent ^ GOLDEN_RATIO_PRIME) / L1_CACHE_BYTES;
@@ -1717,8 +1717,9 @@ EXPORT_SYMBOL(d_add_ci);
  * child is looked up. Thus, an interlocking stepping of sequence lock checks
  * is formed, giving integrity down the path walk.
  */
-struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name,
-				unsigned *seq, struct inode **inode)
+struct dentry *__d_lookup_rcu(const struct dentry *parent,
+				const struct qstr *name,
+				unsigned *seqp, struct inode **inode)
 {
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
@@ -1748,6 +1749,7 @@ struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name,
 	 * See Documentation/filesystems/path-lookup.txt for more details.
 	 */
 	hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
+		unsigned seq;
 		struct inode *i;
 		const char *tname;
 		int tlen;
@@ -1756,7 +1758,7 @@ struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name,
 			continue;
 
 seqretry:
-		*seq = read_seqcount_begin(&dentry->d_seq);
+		seq = read_seqcount_begin(&dentry->d_seq);
 		if (dentry->d_parent != parent)
 			continue;
 		if (d_unhashed(dentry))
@@ -1771,7 +1773,7 @@ seqretry:
 		 * edge of memory when walking. If we could load this
 		 * atomically some other way, we could drop this check.
 		 */
-		if (read_seqcount_retry(&dentry->d_seq, *seq))
+		if (read_seqcount_retry(&dentry->d_seq, seq))
 			goto seqretry;
 		if (unlikely(parent->d_flags & DCACHE_OP_COMPARE)) {
 			if (parent->d_op->d_compare(parent, *inode,
@@ -1788,6 +1790,7 @@ seqretry:
 		 * order to do anything useful with the returned dentry
 		 * anyway.
 		 */
+		*seqp = seq;
 		*inode = i;
 		return dentry;
 	}
diff --git a/fs/namei.c b/fs/namei.c
index a780ea515c47..06df2605c9e8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1374,6 +1374,85 @@ static inline int can_lookup(struct inode *inode)
 	return 1;
 }
 
+static inline unsigned int fold_hash(unsigned long hash)
+{
+	if (sizeof(long) != sizeof(int))
+		hash += hash >> (8*sizeof(int));
+	return hash;
+}
+
+/* Only works for little-endian with fast unaligned accesses! */
+unsigned int full_name_hash(const unsigned char *name, unsigned int len)
+{
+	unsigned long a, mask;
+	unsigned long hash = 0;
+
+	for (;;) {
+		a = *(unsigned long *)name;
+		hash *= 11;
+		if (len < sizeof(unsigned long))
+			break;
+		hash += a;
+		name += sizeof(unsigned long);
+		len -= sizeof(unsigned long);
+		if (!len)
+			goto done;
+	}
+	mask = ~(~0ul << len*8);
+	hash += mask & a;
+done:
+	return fold_hash(hash);
+}
+
+#define ONEBYTES	0x0101010101010101ul
+#define SLASHBYTES	0x2f2f2f2f2f2f2f2ful
+#define HIGHBITS	0x8080808080808080ul
+
+/* Return the high bit set in the first byte that is a zero */
+static inline unsigned long has_zero(unsigned long a)
+{
+	return ((a - ONEBYTES) & ~a) & HIGHBITS;
+}
+
+/*
+ * Calculate the length and hash of the path component, and
+ * return the beginning of the next one (or the pointer to the
+ * final NUL character if none).
+ */
+static inline const char *hash_name(struct qstr *str, const char *name)
+{
+	unsigned long a, mask;
+	unsigned long hash = 0;
+	unsigned long len = 0;
+
+	str->name = name;
+	a = 0;
+	len = -sizeof(unsigned long);
+	do {
+		hash = (hash + a) * 11;
+		len += sizeof(unsigned long);
+		a = *(unsigned long *)(name+len);
+		/* Do we have any NUL or '/' bytes in this word? */
+		mask = has_zero(a) | has_zero(a ^ SLASHBYTES);
+	} while (!mask);
+
+	/* The mask *below* the first high bit set */
+	mask = (mask - 1) & ~mask;
+	mask >>= 7;
+	hash += a & mask;
+	str->hash = fold_hash(hash);
+
+	/* Get the final path component length */
+	len += ffz(mask) / 8;
+	str->len = len;
+
+	/* remove trailing slashes */
+	while (name[len] == '/')
+		len++;
+
+	return name+len;
+}
+
 /*
  * Name resolution.
  * This is the basic name resolution function, turning a pathname into
@@ -1394,26 +1473,14 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 
 	/* At this point we know we have a real path component. */
 	for(;;) {
-		unsigned long hash;
 		struct qstr this;
-		unsigned int c;
 		int type;
 
 		err = may_lookup(nd);
  		if (err)
 			break;
 
-		this.name = name;
-		c = *(const unsigned char *)name;
-
-		hash = init_name_hash();
-		do {
-			name++;
-			hash = partial_name_hash(c, hash);
-			c = *(const unsigned char *)name;
-		} while (c && (c != '/'));
-		this.len = name - (const char *) this.name;
-		this.hash = end_name_hash(hash);
+		name = hash_name(&this, name);
 
 		type = LAST_NORM;
 		if (this.name[0] == '.') switch (this.len) {
@@ -1437,10 +1504,6 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 			}
 		}
 
-		/* remove trailing slashes? */
-		if (!c)
-			goto last_component;
-		while (*++name == '/');
 		if (!*name)
 			goto last_component;
 
@@ -1775,24 +1838,21 @@ static struct dentry *lookup_hash(struct nameidata *nd)
 struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
 {
 	struct qstr this;
-	unsigned long hash;
 	unsigned int c;
 
 	WARN_ON_ONCE(!mutex_is_locked(&base->d_inode->i_mutex));
 
 	this.name = name;
 	this.len = len;
+	this.hash = full_name_hash(name, len);
 	if (!len)
 		return ERR_PTR(-EACCES);
 
-	hash = init_name_hash();
 	while (len--) {
 		c = *(const unsigned char *)name++;
 		if (c == '/' || c == '\0')
 			return ERR_PTR(-EACCES);
-		hash = partial_name_hash(c, hash);
 	}
-	this.hash = end_name_hash(hash);
 	/*
 	 * See if the low-level filesystem might want
 	 * to use its own hash..
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d64a55b23afd..79e77f9419ff 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -54,18 +54,41 @@ extern struct dentry_stat_t dentry_stat;
 static inline int dentry_cmp(const unsigned char *cs, size_t scount,
 				const unsigned char *ct, size_t tcount)
 {
-	int ret;
+#if 1
+/* Little-endian with fast unaligned accesses? */
+	unsigned long a,b,mask;
+
+	if (unlikely(scount != tcount))
+		return 1;
+
+	for (;;) {
+		a = *(unsigned long *)cs;
+		b = *(unsigned long *)ct;
+		if (tcount < sizeof(unsigned long))
+			break;
+		if (unlikely(a != b))
+			return 1;
+		cs += sizeof(unsigned long);
+		ct += sizeof(unsigned long);
+		tcount -= sizeof(unsigned long);
+		if (!tcount)
+			return 0;
+	}
+	mask = ~(~0ul << tcount*8);
+	return unlikely(!!((a ^ b) & mask));
+#else
 	if (scount != tcount)
 		return 1;
+
 	do {
-		ret = (*cs != *ct);
-		if (ret)
-			break;
+		if (*cs != *ct)
+			return 1;
 		cs++;
 		ct++;
 		tcount--;
 	} while (tcount);
-	return ret;
+	return 0;
+#endif
 }
 
 /* Name hashing routines. Initial hash value */
@@ -89,14 +112,7 @@ static inline unsigned long end_name_hash(unsigned long hash)
 }
 
 /* Compute the hash for a name string. */
-static inline unsigned int
-full_name_hash(const unsigned char *name, unsigned int len)
-{
-	unsigned long hash = init_name_hash();
-	while (len--)
-		hash = partial_name_hash(*name++, hash);
-	return end_name_hash(hash);
-}
+extern unsigned int full_name_hash(const unsigned char *, unsigned int);
 
 /*
  * Try to keep struct dentry aligned on 64 byte cachelines (this will
@@ -309,7 +325,8 @@ extern struct dentry *d_ancestor(struct dentry *, struct dentry *);
 extern struct dentry *d_lookup(struct dentry *, struct qstr *);
 extern struct dentry *d_hash_and_lookup(struct dentry *, struct qstr *);
 extern struct dentry *__d_lookup(struct dentry *, struct qstr *);
-extern struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name,
+extern struct dentry *__d_lookup_rcu(const struct dentry *parent,
+				const struct qstr *name,
 				unsigned *seq, struct inode **inode);
 
 /**

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
  2012-02-29 23:36 .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Linus Torvalds
  2012-03-01 10:13 ` Steven Whitehouse
  2012-03-01 16:57 ` Al Viro
@ 2012-03-02  0:46 ` Andi Kleen
  2012-03-02  1:01   ` Linus Torvalds
  2 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2012-03-02  0:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-fsdevel, Al Viro

Linus Torvalds <torvalds@linux-foundation.org> writes:

> So I'm doing my normal profiling ("empty kernel build" is my favorite
> one), and link_path_walk() and __d_lookup_rcu() remain some of the
> hottest kernel functions due to their per-character loops.
>
> I can improve __d_lookup_rcu() on my machine by what appears to be
> around 15% by doing things a "unsigned long" at a time (it would be an
> option that only works on little-endian and with cheap unaligned
> accesses, although the big-endian modifications should be pretty
> trivial).

There should be generally better modern general hash algorithms around,
like murmur, cityhash or snoopy. Perhaps even the fnv we have in tree,
but it's somewhat dated by know.

They all have larger code, but if it's really that hot it would be worth
it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
  2012-03-02  0:46 ` .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Andi Kleen
@ 2012-03-02  1:01   ` Linus Torvalds
  2012-03-02  1:11     ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2012-03-02  1:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel Mailing List, linux-fsdevel, Al Viro

On Thu, Mar 1, 2012 at 4:46 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> There should be generally better modern general hash algorithms around,
> like murmur, cityhash or snoopy. Perhaps even the fnv we have in tree,
> but it's somewhat dated by know.
>
> They all have larger code, but if it's really that hot it would be worth
> it.

The quality of our hash function really doesn't seem to be the issue.
In fact, my (other - I've done both cache-hot and cache-cold) profiles
seem to indicate that the only access that ever matters is the first
one (ie the read from the hash table, not the following of the
chains).

I have a dim memory of us playing with the hash function itself long
long ago, and what made a difference was using the right bits from the
"parent" pointer, not the hash of the name itself. That's especially
true since many filenames are really just in the 3-8 character length
thing, so creating a 32-bit hash from that is not going to have lots
of collisions.

We basically have "two hashes": we have the full 32-bit "hash number"
that we compute over the filename (and also compare at lookup time
before we do a name compare), and then we have the "bucket number"
which is the actual hash chain number.

The bucket number is generate from both the filename hash number and
the parent pointer, and then we try to mix bits around with that
GOLDEN_RATIO_PRIME thing. And last time this was an issue, it was the
"mixing bits around" that made a difference to proper bucket use. See
commit 99effef9544e: "[PATCH] dentry and inode cache hash algorithm
performance changes." in the historical BK import by Thomas.

                         Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
  2012-03-02  1:01   ` Linus Torvalds
@ 2012-03-02  1:11     ` Andi Kleen
  2012-03-02  1:38       ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2012-03-02  1:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Linux Kernel Mailing List, linux-fsdevel, Al Viro

On Thu, Mar 01, 2012 at 05:01:52PM -0800, Linus Torvalds wrote:
> On Thu, Mar 1, 2012 at 4:46 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >
> > There should be generally better modern general hash algorithms around,
> > like murmur, cityhash or snoopy. Perhaps even the fnv we have in tree,
> > but it's somewhat dated by know.
> >
> > They all have larger code, but if it's really that hot it would be worth
> > it.
> 
> The quality of our hash function really doesn't seem to be the issue.

With better I meant mainly faster in cycles.

e.g. CityHash claims upto ~6 bytes/cycle. That's extreme and may need
the SSE versions, but there are non SSE variants e.g. in spooky that are 
somewhat competive.

Are you anywhere near that with your hash function?

Partly they get that from unrolling, but there are also lots of other tricks.

Also BTW if we had better hash functions (in mixing) we could do smaller
hash tables.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?
  2012-03-02  1:11     ` Andi Kleen
@ 2012-03-02  1:38       ` Linus Torvalds
  0 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2012-03-02  1:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel Mailing List, linux-fsdevel, Al Viro

On Thu, Mar 1, 2012 at 5:11 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> With better I meant mainly faster in cycles.
>
> e.g. CityHash claims upto ~6 bytes/cycle. That's extreme and may need
> the SSE versions, but there are non SSE variants e.g. in spooky that are
> somewhat competive.

Oh. Yeah, no. The real problem we have is finding the end of the
string, and the NUL and '/' bytes.

> Are you anywhere near that with your hash function?
>
> Partly they get that from unrolling, but there are also lots of other tricks.

Unrolling actually hurts. Badly.

The common path components sizes really are small. Three letters is
not unusual ("usr" and "bin" come to mind), but 5-10 letters is the
common case.

And when you do things a word at a time on a 64-bit architecture, that
means that you go through the loop *once*. Maybe twice. More than that
is quite unusual.

So the *hashing* is actually pretty trivial. Our name hash used to do
some "shift four bits games" to spread out the bytes a bit, but with
words that really doesn't make much sense, and I'm currently just
using "hash = hash*11 + word" which seems *plenty* fine.

It's the final masking of bytes and the loading the big constants to
find the zero and '/' bytes that are costly. The "inner loop" is
trivial, and does 8 bytes at a time with this loop:

.L402:
        addq    %rdi, %rdx      # hash, D.39887
        addq    $8, %rsi        #, len
        leaq    (%rdx,%rdx,4), %rax     #, tmp347
        leaq    (%rdx,%rax,2), %rdi     #, hash
        movq    (%r12,%rsi), %rdx       #* len, a
        movq    %rdx, %rax      # a, D.39884
        movq    %rdx, %r10      # a, tmp359
        xorq    %r11, %rax      # tmp469, D.39884
        notq    %r10    # tmp359
        leaq    (%rax,%r9), %rcx        #, tmp354
        notq    %rax    # tmp353
        andq    %rax, %rcx      # tmp353, tmp354
        leaq    (%rdx,%r9), %rax        #, tmp361
        andq    %r10, %rax      # tmp359, tmp361
        orq     %rcx, %rax      # tmp354, tmp361
        andq    %r8, %rax       # tmp471, mask
        je      .L402   #,

where most of it is about finding the final bytes. And remember, while
this is a "loop", usually we go through it once, and the final "jump
back to the top" generally is never actually taken for path components
of size 1-7 bytes.

(And it really is about the *components*. The pathname may be long,
but we do the hashing on a component basis)

According to my profiles, one of the most expensive parts is literally
this loop at the very end

        /* remove trailing slashes */
        while (name[len] == '/')
                len++;

which is *trivial*, but I think it mispredicts a lot (the common case
is a single slash at the end of the component except for the final
one, of course) but gcc has created a horrible mess that turns it into

   if(name[len] == '/') {
     .. align the loop that will never be taken, stupid gcc ..
     do { len ++ } while (name[len] == '/');
   }

which is just sad and doesn't buy us anything. I actually suspect I
should do something like

   len += name[len] == '/';

to get rid of the unpredictable "end of string vs end of path
component" case, and then I could have a very unlikely loop for the
"multiple slashes" case after that.

I use a "bsf" instruction to find how many bytes we had at the end,
which is just sad too. But hey, it's fine on modern CPU's that have
the special hardware for bit scanning. It might be one of those "sucks
on atom" kind of things, though.

                          Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?)
  2012-03-01 22:42       ` Linus Torvalds
@ 2012-03-17 12:29         ` Sven Anderson
  2012-03-17 16:53           ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Sven Anderson @ 2012-03-17 12:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Al Viro, Linux Kernel Mailing List, linux-fsdevel

Hi Linus,

Am 01.03.2012 um 23:42 schrieb Linus Torvalds:

> +/* Return the high bit set in the first byte that is a zero */
> +static inline unsigned long has_zero(unsigned long a)
> +{
> +	return ((a - ONEBYTES) & ~a) & HIGHBITS;
> +}

(I commented this on your google+ posting as well, but I'm not sure if you will notice it there.)

Out of curiosity I studied your code, and if I'm not mistaken your has_zero() function doesn't do what is expected. If there are leading 0x01 bytes in front of a NUL byte, they are also marked in the mask because of the borrow bit. You could argue, that there are no 0x01 bytes in path stings, and I agree (even with UTF-8). But you also use it for slash detection, and there you have the same effect with the '.' char, since '.' ^ '/' == 0x01. So if you have a directory name like "foobar.../" it will get handled the same as "foobar////".

Best regards

Sven

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?)
  2012-03-17 12:29         ` Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?) Sven Anderson
@ 2012-03-17 16:53           ` Linus Torvalds
  2012-03-17 18:00             ` Sven Anderson
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2012-03-17 16:53 UTC (permalink / raw)
  To: Sven Anderson
  Cc: Chris Mason, Al Viro, Linux Kernel Mailing List, linux-fsdevel

On Sat, Mar 17, 2012 at 5:29 AM, Sven Anderson <sven@anderson.de> wrote:
>
> Am 01.03.2012 um 23:42 schrieb Linus Torvalds:
>
>> +/* Return the high bit set in the first byte that is a zero */
>> +static inline unsigned long has_zero(unsigned long a)
>> +{
>> +     return ((a - ONEBYTES) & ~a) & HIGHBITS;
>> +}
>
> (I commented this on your google+ posting as well, but I'm not sure if you will notice it there.)
>
> Out of curiosity I studied your code, and if I'm not mistaken your has_zero() function doesn't do what is expected. If there are leading 0x01 bytes in front of a NUL byte, they are also marked in the mask because of the borrow bit.

So has_zero() doesn't guarantee to return a mask of all zero bytes -
only the *first* one. And that's the only one we care about. Any
subsequent zero bytes are suspect, but we don't use that information -
we mask it all away with the "(mask-1) & ~mask"

So what has_zero() guarantees is two-fold:

 - if there are no zeroes at all, it will return 0.

 - if there is one or more zero bytes, it will return with the high
bit set in the *first* zero byte (and nothing below that).

But if it returns non-zero, only the first bit is "guaranteed". Any of
the high bits in the higher bytes are indeed suspect, exactly because
of borrow.

But for our use, we simply just don't care - we only ever care about
the *first* NUL byte.

This is why the algorithm is fundamentally little-endian only. You may
have confused your test-case on a big-endian machine. On big-endian,
you would need to do a byte-switching load.

And when we do the "NUL or slash" thing, we will or the two cases
together, which means that we catch the first NUL or slash, and even
though *both* of those are suspect in the higher bits, once more, we
won't care. We will only ever look at the *lowest* of the high bits in
the bytes (that whole , which is exactly why borrow will never matter
for us: it can only affect bits above the lowest one.

                      Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?)
  2012-03-17 16:53           ` Linus Torvalds
@ 2012-03-17 18:00             ` Sven Anderson
  0 siblings, 0 replies; 14+ messages in thread
From: Sven Anderson @ 2012-03-17 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Al Viro, Linux Kernel Mailing List, linux-fsdevel

Am 17.03.2012 um 17:53 schrieb Linus Torvalds:
> This is why the algorithm is fundamentally little-endian only. You may
> have confused your test-case on a big-endian machine. On big-endian,
> you would need to do a byte-switching load.

Alright, I ignorantly assumed big-endian by interpreting the "first" as coming from the left side in the integer literals. *D'oh*
Sorry, and thanks for taking the time to explain it to me.

Best regards,

Sven

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-03-17 18:00 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-29 23:36 .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Linus Torvalds
2012-03-01 10:13 ` Steven Whitehouse
2012-03-01 15:59   ` Linus Torvalds
2012-03-01 16:57 ` Al Viro
2012-03-01 17:14   ` Linus Torvalds
2012-03-01 18:34     ` Chris Mason
2012-03-01 22:42       ` Linus Torvalds
2012-03-17 12:29         ` Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?) Sven Anderson
2012-03-17 16:53           ` Linus Torvalds
2012-03-17 18:00             ` Sven Anderson
2012-03-02  0:46 ` .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Andi Kleen
2012-03-02  1:01   ` Linus Torvalds
2012-03-02  1:11     ` Andi Kleen
2012-03-02  1:38       ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).