* .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? @ 2012-02-29 23:36 Linus Torvalds 2012-03-01 10:13 ` Steven Whitehouse ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Linus Torvalds @ 2012-02-29 23:36 UTC (permalink / raw) To: Linux Kernel Mailing List, linux-fsdevel, Al Viro So I'm doing my normal profiling ("empty kernel build" is my favorite one), and link_path_walk() and __d_lookup_rcu() remain some of the hottest kernel functions due to their per-character loops. I can improve __d_lookup_rcu() on my machine by what appears to be around 15% by doing things a "unsigned long" at a time (it would be an option that only works on little-endian and with cheap unaligned accesses, although the big-endian modifications should be pretty trivial). Sure, that optimization would have to be disabled if you do DEBUG_PAGEALLOC, because it might opportunistically access bytes past the end of the string, but it does seem to be a very reasonable and easy thing to do apart from that small detail, and the numbers do look good. Basically, dentry_cmp() just becomes /* Little-endian with fast unaligned accesses? */ unsigned long a,b,mask; if (scount != tcount) return 1; for (;;) { a = *(unsigned long *)cs; b = *(unsigned long *)ct; if (tcount < sizeof(unsigned long)) break; if (a != b) return 1; cs += sizeof(unsigned long); ct += sizeof(unsigned long); tcount -= sizeof(unsigned long); if (!tcount) return 0; } mask = ~(~0ul << tcount*8); return !!((a ^ b) & mask); for that case, and gcc generates good code for it. However, doing the same thing for link_path_walk() would require that we actually change the hash function we use internally in the VFS layer, and while I think that shouldn't really be a problem, I worry that some filesystem might actually use the hash we generate and save it somewhere on disk (rather than only use it for the hashed lookup itself). Computing the hash one long-word at a time is trivial if we just change what the hash is. Finding the terminating NUL or '/' characters that involves some big constants (0x2f2f2f2f2f2f2f2f, 0x0101010101010101 and 0x8080808080808080 but seems similarly fairly easy. But if filesystems actually depend on our current hash algorithm, the word-at-a-time model falls apart. Anybody? I think we've kept the namehash unchanged since very early on, so I could imagine that somebody has grown to think that it's "stable". As far as I can tell, the current hash function goes back to 2.4.2 (just over ten years ago). Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? 2012-02-29 23:36 .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Linus Torvalds @ 2012-03-01 10:13 ` Steven Whitehouse 2012-03-01 15:59 ` Linus Torvalds 2012-03-01 16:57 ` Al Viro 2012-03-02 0:46 ` .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Andi Kleen 2 siblings, 1 reply; 14+ messages in thread From: Steven Whitehouse @ 2012-03-01 10:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-fsdevel, Al Viro Hi, On Wed, 2012-02-29 at 15:36 -0800, Linus Torvalds wrote: > So I'm doing my normal profiling ("empty kernel build" is my favorite > one), and link_path_walk() and __d_lookup_rcu() remain some of the > hottest kernel functions due to their per-character loops. > > I can improve __d_lookup_rcu() on my machine by what appears to be > around 15% by doing things a "unsigned long" at a time (it would be an > option that only works on little-endian and with cheap unaligned > accesses, although the big-endian modifications should be pretty > trivial). > > Sure, that optimization would have to be disabled if you do > DEBUG_PAGEALLOC, because it might opportunistically access bytes past > the end of the string, but it does seem to be a very reasonable and > easy thing to do apart from that small detail, and the numbers do look > good. > > Basically, dentry_cmp() just becomes > > /* Little-endian with fast unaligned accesses? */ > unsigned long a,b,mask; > > if (scount != tcount) > return 1; > > for (;;) { > a = *(unsigned long *)cs; > b = *(unsigned long *)ct; > if (tcount < sizeof(unsigned long)) > break; > if (a != b) > return 1; > cs += sizeof(unsigned long); > ct += sizeof(unsigned long); > tcount -= sizeof(unsigned long); > if (!tcount) > return 0; > } > mask = ~(~0ul << tcount*8); > return !!((a ^ b) & mask); > > for that case, and gcc generates good code for it. > > However, doing the same thing for link_path_walk() would require that > we actually change the hash function we use internally in the VFS > layer, and while I think that shouldn't really be a problem, I worry > that some filesystem might actually use the hash we generate and save > it somewhere on disk (rather than only use it for the hashed lookup > itself). > GFS2 is guilty as charged, m'lud! (but see below...) from fs/gfs2/dentry.c: static int gfs2_dhash(const struct dentry *dentry, const struct inode *inode, struct qstr *str) { str->hash = gfs2_disk_hash(str->name, str->len); return 0; } [snip] const struct dentry_operations gfs2_dops = { .d_revalidate = gfs2_drevalidate, .d_hash = gfs2_dhash, .d_delete = gfs2_dentry_delete, }; and in fs/gfs2/dir.h: static inline u32 gfs2_disk_hash(const char *data, int len) { return crc32_le((u32)~0, data, len) ^ (u32)~0; } This was something that was added to GFS2 right back at the beginning, to avoid having to compute two hash functions for each directory entry. The hash function itself was taken from the original GFS, so it was designed to be backward compatible. We assume that the 32 bit on disk hash function of GFS2 will always fit in the "unsigned int" of the qstr. I wasn't quite sure whether from your description, the issue applies to filesystems which actually have their own hash functions, or only to filesystems which might write the VFS's internal hash function to disk? Maybe the former is a simple solution for the latter, if a filesystem specific hash function is a suitable solution, Steve. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? 2012-03-01 10:13 ` Steven Whitehouse @ 2012-03-01 15:59 ` Linus Torvalds 0 siblings, 0 replies; 14+ messages in thread From: Linus Torvalds @ 2012-03-01 15:59 UTC (permalink / raw) To: Steven Whitehouse; +Cc: Linux Kernel Mailing List, linux-fsdevel, Al Viro On Thu, Mar 1, 2012 at 2:13 AM, Steven Whitehouse <swhiteho@redhat.com> wrote: > > GFS2 is guilty as charged, m'lud! (but see below...) > > from fs/gfs2/dentry.c: > > static int gfs2_dhash(const struct dentry *dentry, const struct inode *inode, > struct qstr *str) > { > str->hash = gfs2_disk_hash(str->name, str->len); > return 0; > } Ok, that's fine. As long as you also use your own well-defined hash-function rather than just depend on the default VFS one, I don't care. It's only the default one I'd be optimizing. You'd obviously not get the performance advantage either, but since I don't even know whether it's really worth it in the first place, not a big worry (those kernel routines can be about 5% of CPU time each under loads that are very pathname-intensive, and it looks like I could shave maybe 10-20% off them for the best case, so we're talking *maybe* a percentage point on some specific loads). And yeah, if some filesystem actually depends on the current default VFS hash we can always work around it by filesystems just making the current hash function explicit like you already do in gfs2. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? 2012-02-29 23:36 .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Linus Torvalds 2012-03-01 10:13 ` Steven Whitehouse @ 2012-03-01 16:57 ` Al Viro 2012-03-01 17:14 ` Linus Torvalds 2012-03-02 0:46 ` .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Andi Kleen 2 siblings, 1 reply; 14+ messages in thread From: Al Viro @ 2012-03-01 16:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-fsdevel On Wed, Feb 29, 2012 at 03:36:09PM -0800, Linus Torvalds wrote: > However, doing the same thing for link_path_walk() would require that > we actually change the hash function we use internally in the VFS > layer, and while I think that shouldn't really be a problem, I worry > that some filesystem might actually use the hash we generate and save > it somewhere on disk (rather than only use it for the hashed lookup > itself). > > Computing the hash one long-word at a time is trivial if we just > change what the hash is. Finding the terminating NUL or '/' characters > that involves some big constants (0x2f2f2f2f2f2f2f2f, > 0x0101010101010101 and 0x8080808080808080 but seems similarly fairly > easy. But if filesystems actually depend on our current hash > algorithm, the word-at-a-time model falls apart. As long as full_name_hash() is still around, any such filesystem is welcome to use it for ->d_hash() and STFU. Note that there are places like this one (in btrfs_real_readdir()) q.name = name_ptr; q.len = name_len; q.hash = full_name_hash(q.name, q.len); tmp = d_lookup(filp->f_dentry, &q); and they _will_ need to be updated if we switch the default. I'd be more worried about that kind of breakage, TBH, than anything leaking hash on disk without bothering to calculate it manually (if nothing else, finn_name_hash() is not endian-neutral, so any such place is buggy for another reason already). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? 2012-03-01 16:57 ` Al Viro @ 2012-03-01 17:14 ` Linus Torvalds 2012-03-01 18:34 ` Chris Mason 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2012-03-01 17:14 UTC (permalink / raw) To: Al Viro; +Cc: Linux Kernel Mailing List, linux-fsdevel On Thu, Mar 1, 2012 at 8:57 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > As long as full_name_hash() is still around, any such filesystem is welcome > to use it for ->d_hash() and STFU. Well, I'd make full_name_hash() match the new hash model, so no, they couldn't use that. They'd have to actually implement the old hash, and make it "their hash". Also note that the hash model would have to become dependent on config options: doing things a word at a time really only works well on architectures that have fast unaligned accesses, and even on those archtiectures it really doesn't work if you enable CONFIG_DEBUG_PAGE_ALLOC. > Note that there are places like this one (in btrfs_real_readdir()) > q.name = name_ptr; > q.len = name_len; > q.hash = full_name_hash(q.name, q.len); > tmp = d_lookup(filp->f_dentry, &q); > and they _will_ need to be updated if we switch the default. I don't agree that it will need to be updated - I was planning on just updating full_name_hash() to always match. That part is easy and obvious. So this I don't have any issue with. It's only if you actually save the default VFS hash to disk (or network), since it will change between reboots or between different machines. > I'd be more > worried about that kind of breakage, TBH, than anything leaking hash > on disk without bothering to calculate it manually (if nothing else, > finn_name_hash() is not endian-neutral, so any such place is buggy for > another reason already). No, full_name_hash() is perfectly endian-safe - currently. It does everything one character at a time, and it acts as if it was all done in 32 bits (even if it internally looks like it is calculating hashes in "unsigned long" - the math is all stable in 32 bits and at the end it is truncated down). Now, of course, any disk respresentation would have to store the value in some particular endianness, but that's no different from any other number that FS would store. But if I actually go ahead and do a word-at-a-time hash thing, that hash value really fundamentally will depend on endianness. But since it will depend on config options and on architecture details too, that's the least of the problems with using that for anything "stable". The new hash would literally only be useful *inside* the kernel, and really only for looking up names on the hash lists. Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? 2012-03-01 17:14 ` Linus Torvalds @ 2012-03-01 18:34 ` Chris Mason 2012-03-01 22:42 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Chris Mason @ 2012-03-01 18:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel On Thu, Mar 01, 2012 at 09:14:13AM -0800, Linus Torvalds wrote: > On Thu, Mar 1, 2012 at 8:57 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > As long as full_name_hash() is still around, any such filesystem is welcome > > to use it for ->d_hash() and STFU. > > Well, I'd make full_name_hash() match the new hash model, so no, they > couldn't use that. They'd have to actually implement the old hash, and > make it "their hash". > > Also note that the hash model would have to become dependent on config > options: doing things a word at a time really only works well on > architectures that have fast unaligned accesses, and even on those > archtiectures it really doesn't work if you enable > CONFIG_DEBUG_PAGE_ALLOC. > > > Note that there are places like this one (in btrfs_real_readdir()) > > q.name = name_ptr; > > q.len = name_len; > > q.hash = full_name_hash(q.name, q.len); > > tmp = d_lookup(filp->f_dentry, &q); > > and they _will_ need to be updated if we switch the default. > > I don't agree that it will need to be updated - I was planning on just > updating full_name_hash() to always match. That part is easy and > obvious. Just to clarify, btrfs isn't putting the dcache hashes on disk. The code above is part of our optimization to preload some information we find during readdir to reduce IO when the file is later opened. On disk we're crc32c only. -chris ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? 2012-03-01 18:34 ` Chris Mason @ 2012-03-01 22:42 ` Linus Torvalds 2012-03-17 12:29 ` Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?) Sven Anderson 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2012-03-01 22:42 UTC (permalink / raw) To: Chris Mason, Linus Torvalds, Al Viro, Linux Kernel Mailing List, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 2250 bytes --] Ok, guys, here's a *very* preliminary version of the patch I was thinking of. NOTE! The changes are all totally unconditional, which means that this patch will break on: - any big-endian machine (the "find first NUL/slash byte" and mask generation really is little-endian only) Maybe somebody with a big-endian machine could try to figure out how the logic would work there to efficiently find the mask of bytes that are before the first NUL or slash. - any hardware that doesn't do good unaligned 'unsigned long' accesses - if you enable CONFIG_DEBUG_PAGEALLOC In addition, I've only *tested* this on x86-64, and I'm pretty sure that on x86-32 it will at the very least warn about the big 64-bit constants I use (but I think it should work - they'll just get truncated, and I tried to use "sizeof(unsigned long)" instead of "8" etc). Anyway, on x86-64 it works for me. And my stupid test-case (look up the same pathname ten million times - I'm explicitly looking up the *same* pathname to not show the D$ trashing of the hash tables looking up millions of different names) went from 9.053s to 8.099s. So that's a 10% improvement on an admittedly extreme load, but the two functions that were improved (__d_lookup_rcu and link_path_walk) really *are* the two hottest functions on actual real loads too, so while my microbenchmark was extreme, it's at least testing something relevant. And it's the whole microbenchmark that improved by 10%, so those hot functions themselves each actually improved by about 30%. Comments? I agree that the games I play to do things one word at a time aren't *pretty*, but I did try to split things up into helper functions so that the code is actually conceptually understandable even if you can't follow the tricks I play. Staring too much at that new hash_name() function may turn you mad. Not only is the "find byte in word" logic subtle, the whole crazy "do { } while ()" loop is written that way so that gcc doesn't go crazy and try to unroll it once (which gcc does if you turn it into the more obvious "for ()" loop). That said, trying to figure out *why* hash_name() works can be a fun exercise if you're into these kinds of tricks. It was certainly fun to write. Linus [-- Attachment #2: dentry-speed.patch --] [-- Type: text/x-patch, Size: 8318 bytes --] fs/dcache.c | 13 +++--- fs/namei.c | 102 ++++++++++++++++++++++++++++++++++++++---------- include/linux/dcache.h | 45 ++++++++++++++------- 3 files changed, 120 insertions(+), 40 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index fe19ac13f75f..138be96e25b6 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -104,7 +104,7 @@ static unsigned int d_hash_shift __read_mostly; static struct hlist_bl_head *dentry_hashtable __read_mostly; -static inline struct hlist_bl_head *d_hash(struct dentry *parent, +static inline struct hlist_bl_head *d_hash(const struct dentry *parent, unsigned long hash) { hash += ((unsigned long) parent ^ GOLDEN_RATIO_PRIME) / L1_CACHE_BYTES; @@ -1717,8 +1717,9 @@ EXPORT_SYMBOL(d_add_ci); * child is looked up. Thus, an interlocking stepping of sequence lock checks * is formed, giving integrity down the path walk. */ -struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name, - unsigned *seq, struct inode **inode) +struct dentry *__d_lookup_rcu(const struct dentry *parent, + const struct qstr *name, + unsigned *seqp, struct inode **inode) { unsigned int len = name->len; unsigned int hash = name->hash; @@ -1748,6 +1749,7 @@ struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name, * See Documentation/filesystems/path-lookup.txt for more details. */ hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) { + unsigned seq; struct inode *i; const char *tname; int tlen; @@ -1756,7 +1758,7 @@ struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name, continue; seqretry: - *seq = read_seqcount_begin(&dentry->d_seq); + seq = read_seqcount_begin(&dentry->d_seq); if (dentry->d_parent != parent) continue; if (d_unhashed(dentry)) @@ -1771,7 +1773,7 @@ seqretry: * edge of memory when walking. If we could load this * atomically some other way, we could drop this check. */ - if (read_seqcount_retry(&dentry->d_seq, *seq)) + if (read_seqcount_retry(&dentry->d_seq, seq)) goto seqretry; if (unlikely(parent->d_flags & DCACHE_OP_COMPARE)) { if (parent->d_op->d_compare(parent, *inode, @@ -1788,6 +1790,7 @@ seqretry: * order to do anything useful with the returned dentry * anyway. */ + *seqp = seq; *inode = i; return dentry; } diff --git a/fs/namei.c b/fs/namei.c index a780ea515c47..06df2605c9e8 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1374,6 +1374,85 @@ static inline int can_lookup(struct inode *inode) return 1; } +static inline unsigned int fold_hash(unsigned long hash) +{ + if (sizeof(long) != sizeof(int)) + hash += hash >> (8*sizeof(int)); + return hash; +} + +/* Only works for little-endian with fast unaligned accesses! */ +unsigned int full_name_hash(const unsigned char *name, unsigned int len) +{ + unsigned long a, mask; + unsigned long hash = 0; + + for (;;) { + a = *(unsigned long *)name; + hash *= 11; + if (len < sizeof(unsigned long)) + break; + hash += a; + name += sizeof(unsigned long); + len -= sizeof(unsigned long); + if (!len) + goto done; + } + mask = ~(~0ul << len*8); + hash += mask & a; +done: + return fold_hash(hash); +} + +#define ONEBYTES 0x0101010101010101ul +#define SLASHBYTES 0x2f2f2f2f2f2f2f2ful +#define HIGHBITS 0x8080808080808080ul + +/* Return the high bit set in the first byte that is a zero */ +static inline unsigned long has_zero(unsigned long a) +{ + return ((a - ONEBYTES) & ~a) & HIGHBITS; +} + +/* + * Calculate the length and hash of the path component, and + * return the beginning of the next one (or the pointer to the + * final NUL character if none). + */ +static inline const char *hash_name(struct qstr *str, const char *name) +{ + unsigned long a, mask; + unsigned long hash = 0; + unsigned long len = 0; + + str->name = name; + a = 0; + len = -sizeof(unsigned long); + do { + hash = (hash + a) * 11; + len += sizeof(unsigned long); + a = *(unsigned long *)(name+len); + /* Do we have any NUL or '/' bytes in this word? */ + mask = has_zero(a) | has_zero(a ^ SLASHBYTES); + } while (!mask); + + /* The mask *below* the first high bit set */ + mask = (mask - 1) & ~mask; + mask >>= 7; + hash += a & mask; + str->hash = fold_hash(hash); + + /* Get the final path component length */ + len += ffz(mask) / 8; + str->len = len; + + /* remove trailing slashes */ + while (name[len] == '/') + len++; + + return name+len; +} + /* * Name resolution. * This is the basic name resolution function, turning a pathname into @@ -1394,26 +1473,14 @@ static int link_path_walk(const char *name, struct nameidata *nd) /* At this point we know we have a real path component. */ for(;;) { - unsigned long hash; struct qstr this; - unsigned int c; int type; err = may_lookup(nd); if (err) break; - this.name = name; - c = *(const unsigned char *)name; - - hash = init_name_hash(); - do { - name++; - hash = partial_name_hash(c, hash); - c = *(const unsigned char *)name; - } while (c && (c != '/')); - this.len = name - (const char *) this.name; - this.hash = end_name_hash(hash); + name = hash_name(&this, name); type = LAST_NORM; if (this.name[0] == '.') switch (this.len) { @@ -1437,10 +1504,6 @@ static int link_path_walk(const char *name, struct nameidata *nd) } } - /* remove trailing slashes? */ - if (!c) - goto last_component; - while (*++name == '/'); if (!*name) goto last_component; @@ -1775,24 +1838,21 @@ static struct dentry *lookup_hash(struct nameidata *nd) struct dentry *lookup_one_len(const char *name, struct dentry *base, int len) { struct qstr this; - unsigned long hash; unsigned int c; WARN_ON_ONCE(!mutex_is_locked(&base->d_inode->i_mutex)); this.name = name; this.len = len; + this.hash = full_name_hash(name, len); if (!len) return ERR_PTR(-EACCES); - hash = init_name_hash(); while (len--) { c = *(const unsigned char *)name++; if (c == '/' || c == '\0') return ERR_PTR(-EACCES); - hash = partial_name_hash(c, hash); } - this.hash = end_name_hash(hash); /* * See if the low-level filesystem might want * to use its own hash.. diff --git a/include/linux/dcache.h b/include/linux/dcache.h index d64a55b23afd..79e77f9419ff 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -54,18 +54,41 @@ extern struct dentry_stat_t dentry_stat; static inline int dentry_cmp(const unsigned char *cs, size_t scount, const unsigned char *ct, size_t tcount) { - int ret; +#if 1 +/* Little-endian with fast unaligned accesses? */ + unsigned long a,b,mask; + + if (unlikely(scount != tcount)) + return 1; + + for (;;) { + a = *(unsigned long *)cs; + b = *(unsigned long *)ct; + if (tcount < sizeof(unsigned long)) + break; + if (unlikely(a != b)) + return 1; + cs += sizeof(unsigned long); + ct += sizeof(unsigned long); + tcount -= sizeof(unsigned long); + if (!tcount) + return 0; + } + mask = ~(~0ul << tcount*8); + return unlikely(!!((a ^ b) & mask)); +#else if (scount != tcount) return 1; + do { - ret = (*cs != *ct); - if (ret) - break; + if (*cs != *ct) + return 1; cs++; ct++; tcount--; } while (tcount); - return ret; + return 0; +#endif } /* Name hashing routines. Initial hash value */ @@ -89,14 +112,7 @@ static inline unsigned long end_name_hash(unsigned long hash) } /* Compute the hash for a name string. */ -static inline unsigned int -full_name_hash(const unsigned char *name, unsigned int len) -{ - unsigned long hash = init_name_hash(); - while (len--) - hash = partial_name_hash(*name++, hash); - return end_name_hash(hash); -} +extern unsigned int full_name_hash(const unsigned char *, unsigned int); /* * Try to keep struct dentry aligned on 64 byte cachelines (this will @@ -309,7 +325,8 @@ extern struct dentry *d_ancestor(struct dentry *, struct dentry *); extern struct dentry *d_lookup(struct dentry *, struct qstr *); extern struct dentry *d_hash_and_lookup(struct dentry *, struct qstr *); extern struct dentry *__d_lookup(struct dentry *, struct qstr *); -extern struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name, +extern struct dentry *__d_lookup_rcu(const struct dentry *parent, + const struct qstr *name, unsigned *seq, struct inode **inode); /** ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?) 2012-03-01 22:42 ` Linus Torvalds @ 2012-03-17 12:29 ` Sven Anderson 2012-03-17 16:53 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Sven Anderson @ 2012-03-17 12:29 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Al Viro, Linux Kernel Mailing List, linux-fsdevel Hi Linus, Am 01.03.2012 um 23:42 schrieb Linus Torvalds: > +/* Return the high bit set in the first byte that is a zero */ > +static inline unsigned long has_zero(unsigned long a) > +{ > + return ((a - ONEBYTES) & ~a) & HIGHBITS; > +} (I commented this on your google+ posting as well, but I'm not sure if you will notice it there.) Out of curiosity I studied your code, and if I'm not mistaken your has_zero() function doesn't do what is expected. If there are leading 0x01 bytes in front of a NUL byte, they are also marked in the mask because of the borrow bit. You could argue, that there are no 0x01 bytes in path stings, and I agree (even with UTF-8). But you also use it for slash detection, and there you have the same effect with the '.' char, since '.' ^ '/' == 0x01. So if you have a directory name like "foobar.../" it will get handled the same as "foobar////". Best regards Sven ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?) 2012-03-17 12:29 ` Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?) Sven Anderson @ 2012-03-17 16:53 ` Linus Torvalds 2012-03-17 18:00 ` Sven Anderson 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2012-03-17 16:53 UTC (permalink / raw) To: Sven Anderson Cc: Chris Mason, Al Viro, Linux Kernel Mailing List, linux-fsdevel On Sat, Mar 17, 2012 at 5:29 AM, Sven Anderson <sven@anderson.de> wrote: > > Am 01.03.2012 um 23:42 schrieb Linus Torvalds: > >> +/* Return the high bit set in the first byte that is a zero */ >> +static inline unsigned long has_zero(unsigned long a) >> +{ >> + return ((a - ONEBYTES) & ~a) & HIGHBITS; >> +} > > (I commented this on your google+ posting as well, but I'm not sure if you will notice it there.) > > Out of curiosity I studied your code, and if I'm not mistaken your has_zero() function doesn't do what is expected. If there are leading 0x01 bytes in front of a NUL byte, they are also marked in the mask because of the borrow bit. So has_zero() doesn't guarantee to return a mask of all zero bytes - only the *first* one. And that's the only one we care about. Any subsequent zero bytes are suspect, but we don't use that information - we mask it all away with the "(mask-1) & ~mask" So what has_zero() guarantees is two-fold: - if there are no zeroes at all, it will return 0. - if there is one or more zero bytes, it will return with the high bit set in the *first* zero byte (and nothing below that). But if it returns non-zero, only the first bit is "guaranteed". Any of the high bits in the higher bytes are indeed suspect, exactly because of borrow. But for our use, we simply just don't care - we only ever care about the *first* NUL byte. This is why the algorithm is fundamentally little-endian only. You may have confused your test-case on a big-endian machine. On big-endian, you would need to do a byte-switching load. And when we do the "NUL or slash" thing, we will or the two cases together, which means that we catch the first NUL or slash, and even though *both* of those are suspect in the higher bits, once more, we won't care. We will only ever look at the *lowest* of the high bits in the bytes (that whole , which is exactly why borrow will never matter for us: it can only affect bits above the lowest one. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?) 2012-03-17 16:53 ` Linus Torvalds @ 2012-03-17 18:00 ` Sven Anderson 0 siblings, 0 replies; 14+ messages in thread From: Sven Anderson @ 2012-03-17 18:00 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Mason, Al Viro, Linux Kernel Mailing List, linux-fsdevel Am 17.03.2012 um 17:53 schrieb Linus Torvalds: > This is why the algorithm is fundamentally little-endian only. You may > have confused your test-case on a big-endian machine. On big-endian, > you would need to do a byte-switching load. Alright, I ignorantly assumed big-endian by interpreting the "first" as coming from the left side in the integer literals. *D'oh* Sorry, and thanks for taking the time to explain it to me. Best regards, Sven ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? 2012-02-29 23:36 .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Linus Torvalds 2012-03-01 10:13 ` Steven Whitehouse 2012-03-01 16:57 ` Al Viro @ 2012-03-02 0:46 ` Andi Kleen 2012-03-02 1:01 ` Linus Torvalds 2 siblings, 1 reply; 14+ messages in thread From: Andi Kleen @ 2012-03-02 0:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-fsdevel, Al Viro Linus Torvalds <torvalds@linux-foundation.org> writes: > So I'm doing my normal profiling ("empty kernel build" is my favorite > one), and link_path_walk() and __d_lookup_rcu() remain some of the > hottest kernel functions due to their per-character loops. > > I can improve __d_lookup_rcu() on my machine by what appears to be > around 15% by doing things a "unsigned long" at a time (it would be an > option that only works on little-endian and with cheap unaligned > accesses, although the big-endian modifications should be pretty > trivial). There should be generally better modern general hash algorithms around, like murmur, cityhash or snoopy. Perhaps even the fnv we have in tree, but it's somewhat dated by know. They all have larger code, but if it's really that hot it would be worth it. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? 2012-03-02 0:46 ` .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Andi Kleen @ 2012-03-02 1:01 ` Linus Torvalds 2012-03-02 1:11 ` Andi Kleen 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2012-03-02 1:01 UTC (permalink / raw) To: Andi Kleen; +Cc: Linux Kernel Mailing List, linux-fsdevel, Al Viro On Thu, Mar 1, 2012 at 4:46 PM, Andi Kleen <andi@firstfloor.org> wrote: > > There should be generally better modern general hash algorithms around, > like murmur, cityhash or snoopy. Perhaps even the fnv we have in tree, > but it's somewhat dated by know. > > They all have larger code, but if it's really that hot it would be worth > it. The quality of our hash function really doesn't seem to be the issue. In fact, my (other - I've done both cache-hot and cache-cold) profiles seem to indicate that the only access that ever matters is the first one (ie the read from the hash table, not the following of the chains). I have a dim memory of us playing with the hash function itself long long ago, and what made a difference was using the right bits from the "parent" pointer, not the hash of the name itself. That's especially true since many filenames are really just in the 3-8 character length thing, so creating a 32-bit hash from that is not going to have lots of collisions. We basically have "two hashes": we have the full 32-bit "hash number" that we compute over the filename (and also compare at lookup time before we do a name compare), and then we have the "bucket number" which is the actual hash chain number. The bucket number is generate from both the filename hash number and the parent pointer, and then we try to mix bits around with that GOLDEN_RATIO_PRIME thing. And last time this was an issue, it was the "mixing bits around" that made a difference to proper bucket use. See commit 99effef9544e: "[PATCH] dentry and inode cache hash algorithm performance changes." in the historical BK import by Thomas. Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? 2012-03-02 1:01 ` Linus Torvalds @ 2012-03-02 1:11 ` Andi Kleen 2012-03-02 1:38 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Andi Kleen @ 2012-03-02 1:11 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Linux Kernel Mailing List, linux-fsdevel, Al Viro On Thu, Mar 01, 2012 at 05:01:52PM -0800, Linus Torvalds wrote: > On Thu, Mar 1, 2012 at 4:46 PM, Andi Kleen <andi@firstfloor.org> wrote: > > > > There should be generally better modern general hash algorithms around, > > like murmur, cityhash or snoopy. Perhaps even the fnv we have in tree, > > but it's somewhat dated by know. > > > > They all have larger code, but if it's really that hot it would be worth > > it. > > The quality of our hash function really doesn't seem to be the issue. With better I meant mainly faster in cycles. e.g. CityHash claims upto ~6 bytes/cycle. That's extreme and may need the SSE versions, but there are non SSE variants e.g. in spooky that are somewhat competive. Are you anywhere near that with your hash function? Partly they get that from unrolling, but there are also lots of other tricks. Also BTW if we had better hash functions (in mixing) we could do smaller hash tables. -Andi ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? 2012-03-02 1:11 ` Andi Kleen @ 2012-03-02 1:38 ` Linus Torvalds 0 siblings, 0 replies; 14+ messages in thread From: Linus Torvalds @ 2012-03-02 1:38 UTC (permalink / raw) To: Andi Kleen; +Cc: Linux Kernel Mailing List, linux-fsdevel, Al Viro On Thu, Mar 1, 2012 at 5:11 PM, Andi Kleen <andi@firstfloor.org> wrote: > > With better I meant mainly faster in cycles. > > e.g. CityHash claims upto ~6 bytes/cycle. That's extreme and may need > the SSE versions, but there are non SSE variants e.g. in spooky that are > somewhat competive. Oh. Yeah, no. The real problem we have is finding the end of the string, and the NUL and '/' bytes. > Are you anywhere near that with your hash function? > > Partly they get that from unrolling, but there are also lots of other tricks. Unrolling actually hurts. Badly. The common path components sizes really are small. Three letters is not unusual ("usr" and "bin" come to mind), but 5-10 letters is the common case. And when you do things a word at a time on a 64-bit architecture, that means that you go through the loop *once*. Maybe twice. More than that is quite unusual. So the *hashing* is actually pretty trivial. Our name hash used to do some "shift four bits games" to spread out the bytes a bit, but with words that really doesn't make much sense, and I'm currently just using "hash = hash*11 + word" which seems *plenty* fine. It's the final masking of bytes and the loading the big constants to find the zero and '/' bytes that are costly. The "inner loop" is trivial, and does 8 bytes at a time with this loop: .L402: addq %rdi, %rdx # hash, D.39887 addq $8, %rsi #, len leaq (%rdx,%rdx,4), %rax #, tmp347 leaq (%rdx,%rax,2), %rdi #, hash movq (%r12,%rsi), %rdx #* len, a movq %rdx, %rax # a, D.39884 movq %rdx, %r10 # a, tmp359 xorq %r11, %rax # tmp469, D.39884 notq %r10 # tmp359 leaq (%rax,%r9), %rcx #, tmp354 notq %rax # tmp353 andq %rax, %rcx # tmp353, tmp354 leaq (%rdx,%r9), %rax #, tmp361 andq %r10, %rax # tmp359, tmp361 orq %rcx, %rax # tmp354, tmp361 andq %r8, %rax # tmp471, mask je .L402 #, where most of it is about finding the final bytes. And remember, while this is a "loop", usually we go through it once, and the final "jump back to the top" generally is never actually taken for path components of size 1-7 bytes. (And it really is about the *components*. The pathname may be long, but we do the hashing on a component basis) According to my profiles, one of the most expensive parts is literally this loop at the very end /* remove trailing slashes */ while (name[len] == '/') len++; which is *trivial*, but I think it mispredicts a lot (the common case is a single slash at the end of the component except for the final one, of course) but gcc has created a horrible mess that turns it into if(name[len] == '/') { .. align the loop that will never be taken, stupid gcc .. do { len ++ } while (name[len] == '/'); } which is just sad and doesn't buy us anything. I actually suspect I should do something like len += name[len] == '/'; to get rid of the unpredictable "end of string vs end of path component" case, and then I could have a very unlikely loop for the "multiple slashes" case after that. I use a "bsf" instruction to find how many bytes we had at the end, which is just sad too. But hey, it's fine on modern CPU's that have the special hardware for bit scanning. It might be one of those "sucks on atom" kind of things, though. Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2012-03-17 18:00 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-02-29 23:36 .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Linus Torvalds 2012-03-01 10:13 ` Steven Whitehouse 2012-03-01 15:59 ` Linus Torvalds 2012-03-01 16:57 ` Al Viro 2012-03-01 17:14 ` Linus Torvalds 2012-03-01 18:34 ` Chris Mason 2012-03-01 22:42 ` Linus Torvalds 2012-03-17 12:29 ` Faulty has_zero()? (was: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?) Sven Anderson 2012-03-17 16:53 ` Linus Torvalds 2012-03-17 18:00 ` Sven Anderson 2012-03-02 0:46 ` .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation? Andi Kleen 2012-03-02 1:01 ` Linus Torvalds 2012-03-02 1:11 ` Andi Kleen 2012-03-02 1:38 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).