From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linus Torvalds Subject: Re: Word-at-a-time dcache name accesses (was Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?) Date: Sat, 3 Mar 2012 12:10:09 -0800 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Linux Kernel Mailing List , linux-fsdevel , Al Viro To: Andi Kleen , "H. Peter Anvin" Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Mar 2, 2012 at 3:46 PM, Linus Torvalds wrote: > > This *does* assume that "bsf" is a reasonably fast instruction, which= is > not necessarily the case especially on 32-bit x86. So the config opti= on > choice for this might want some tuning even on x86, but it would be l= ovely > to get comments and have people test it out on older hardware. Ok, so I was thinking about this. I can replace the "bsf" with a multiply, and I just wonder which one is faster. > + =A0 =A0 =A0 /* Get the final path component length */ > + =A0 =A0 =A0 len +=3D __ffs(mask) >> 3; > + > + =A0 =A0 =A0 /* The mask *below* the first high bit set */ > + =A0 =A0 =A0 mask =3D (mask - 1) & ~mask; > + =A0 =A0 =A0 mask >>=3D 7; > + =A0 =A0 =A0 hash +=3D a & mask; So instead of the __ffs() on the original mask (to find the first byte with the high bit set), I could use the "mask of bytes" and some math to get the number of bytes set like this (so this goes at the end, *after* we used the mask to mask off the bytes in 'a' - not where the __ffs() is right now): /* Low bits set in each byte we used as a mask */ mask &=3D ONEBYTES; /* Add up "mask + (mask<<8) + (mask<<16) +... ": same as a multiply */ mask *=3D ONEBYTES; /* High byte now contains count of bits set */ len +=3D mask >> 8*(sizeof(unsigned long)-1); which I find intriguing because it just continues with the whole "bitmask tricks" thing and even happens to re-use one of the bitmasks we already had. On machines with slow bit scanning (and a good multiplier), that might be faster. Sadly, it's a multiply with a big constant. Yes, we could make the constant smaller by not counting the highest byte: it is never set, so we could use "ONEBYTES>>8" and shift right by 8*sizeof(unsigned long)-2) instead, but it's still not as cheap as just doing adds and masks. I can't come up with anything really cheap to calculate "number of bytes set". But the above may be cheaper than the bsf on some older 32-bit machines that have horrible bit scanning performance (eg Atom or P4). An integer multiply tends to be around four cycles, the bsf performance is all over the map (2-17 cycles latency). Linus