From: "Darrick J. Wong" <djwong@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: david@fromorbit.com, xfs <linux-xfs@vger.kernel.org>
Subject: Re: [PATCHSET 0/3] xfs: fix ascii-ci problems with userspace
Date: Tue, 4 Apr 2023 14:00:09 -0700 [thread overview]
Message-ID: <20230404210009.GA303486@frogsfrogsfrogs> (raw)
In-Reply-To: <CAHk-=wiBm6VNykd5qH9y_SmL+Z8BJDXNwiny8Z0Yss86Wb6STw@mail.gmail.com>
On Tue, Apr 04, 2023 at 01:21:25PM -0700, Linus Torvalds wrote:
> On Tue, Apr 4, 2023 at 11:19 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Limiting yourself to US-ASCII is at least technically valid. Because
> > EBCDIC isn't worth worrying about. But when the high bit is set, you
> > had better not touch it, or you need to limit it spectacularly.
>
> Side note: if limiting it to US-ASCII is fine (and it had better be,
> because as mentioned, anything else will result in unresolvable
> problems), you might look at using this as the pre-hash function:
>
> unsigned char prehash(unsigned char c)
> {
> unsigned char mask = (~(c >> 1) & c & 64) >> 1;
> return c & ~mask;
> }
>
> which does modify a few other characters too, but nothing that matters
> for hashing.
>
> The advantage of the above is that you can trivially vectorize it. You
> can do it with just regular integer math (64 bits = 8 bytes in
> parallel), no need to use *actual* vector hardware.
>
> The actual comparison needs to do the careful thing (because '~' and
> '^' may hash to the same value, but obviously aren't the same), but
> even there you can do a cheap "are these 8 characters _possibly_ the
> same) with a very simple single 64-bit comparison, and only go to the
> careful path if things match, ie
>
> /* Cannot possibly be equal even case-insentivitely? */
> if ((word1 ^ word2) & ~0x2020202020202020ul)
> continue;
> /* Ok, same in all but the 5th bits, go be careful */
> ....
>
> and the reason I mention this is because I have been idly thinking
> about supporting case-insensitivity at the VFS layer for multiple
> decades, but have always decided that it's *so* nasty that I really
> was hoping it just is never an issue in practice.
If it were up to me I'd invent a userspace shim fs that would perform
whatever normalizations are desired, and pass that (and ideally a lookup
hash) to the underlying kernel/fs. Users can configure whatever
filtering they want and set LC_ALL as they please, and we kernel
developers never have to know, and the users never have to see what
actually gets written to disk. If users want normalized ci lookups, the
shim can do that.
ext4 tried to do better than XFS by actually defining the mathematical
transformation that would be applied to incoming names and refusing
things that would devolve into brokenness, but then it turns out that it
was utf8_data.c. Urgh.
I get it, shi* fses are not popular and are not fast, but if the Samba
benchmarks are still valid, multiple kernel<->fuserspace transitions are
still faster that their workaround.
> Particularly since the low-level filesystems then inevitably decide
> that they need to do things wrong and need a locale, and at that point
> all hope is lost.
>
> I was hoping xfs would be one of the sane filesystems.
Hah, nope, I'm all out of sanity here. :(
--D
> Linus
next prev parent reply other threads:[~2023-04-04 21:00 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-04 17:07 [PATCHSET 0/3] xfs: fix ascii-ci problems with userspace Darrick J. Wong
2023-04-04 17:07 ` [PATCH 1/3] xfs: stabilize the tolower function used for ascii-ci dir hash computation Darrick J. Wong
2023-04-04 17:54 ` Linus Torvalds
2023-04-04 18:32 ` Darrick J. Wong
2023-04-04 18:58 ` Linus Torvalds
2023-04-04 23:30 ` Dave Chinner
2023-04-05 0:17 ` Linus Torvalds
2023-04-05 6:12 ` Christoph Hellwig
2023-04-05 15:40 ` Darrick J. Wong
2023-04-05 15:42 ` Christoph Hellwig
2023-04-05 17:10 ` Darrick J. Wong
2023-04-05 10:48 ` Christoph Hellwig
2023-04-05 15:30 ` Darrick J. Wong
2023-04-05 15:45 ` Linus Torvalds
2023-04-04 17:07 ` [PATCH 2/3] xfs: test the ascii case-insensitive hash Darrick J. Wong
2023-04-04 18:06 ` Linus Torvalds
2023-04-04 20:51 ` Darrick J. Wong
2023-04-04 21:21 ` Linus Torvalds
2023-04-05 6:15 ` Christoph Hellwig
2023-04-04 17:07 ` [PATCH 3/3] xfs: use the directory name hash function for dir scrubbing Darrick J. Wong
2023-04-04 17:17 ` [PATCHSET 0/3] xfs: fix ascii-ci problems with userspace Darrick J. Wong
2023-04-04 18:19 ` Linus Torvalds
2023-04-04 20:21 ` Linus Torvalds
2023-04-04 21:00 ` Darrick J. Wong [this message]
2023-04-04 21:50 ` Linus Torvalds
2023-04-04 21:09 ` [PATCH] xfstests: add a couple more tests for ascii-ci problems Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230404210009.GA303486@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox