public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Christoph Hellwig <hch@infradead.org>, david@fromorbit.com
Cc: torvalds@linux-foundation.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/3] xfs: stabilize the tolower function used for ascii-ci dir hash computation
Date: Wed, 5 Apr 2023 08:30:02 -0700	[thread overview]
Message-ID: <20230405153002.GE303486@frogsfrogsfrogs> (raw)
In-Reply-To: <ZC1R4IRx7ZiBeeLJ@infradead.org>

On Wed, Apr 05, 2023 at 03:48:00AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 04, 2023 at 10:07:06AM -0700, Darrick J. Wong wrote:
> > Which means that the kernel and userspace do not agree on the hash value
> > for a directory filename that contains those higher values.  The hash
> > values are written into the leaf index block of directories that are
> > larger than two blocks in size, which means that xfs_repair will flag
> > these directories as having corrupted hash indexes and rewrite the index
> > with hash values that the kernel now will not recognize.
> > 
> > Because the ascii-ci feature is not frequently enabled and the kernel
> > touches filesystems far more frequently than xfs_repair does, fix this
> > by encoding the kernel's toupper predicate and tolower functions into
> > libxfs.  This makes userspace's behavior consistent with the kernel.
> 
> I agree with making the userspace behavior consistent with the actual
> kernel behavior.  Sadly the documented behavior differs from both
> of them, so I think we need to also document the actual tables used
> in the mkfs.xfs manpage, as it isn't actually just ASCII.

Agreed.  Given that kernel tolower() behavior has been stable since 1996
(and remaps the ISO 8859-1 accented letters), the "ASCII CI" feature
most closely maps to "ISO 8859-1 CI".  But at this point there's not
even a shared understanding (Dave said latin1, you said 7-bit ascii,
IDGAF) so I agree that documenting the exact transformations in the
manpage is the only sane way forward.

I propose the changing the mkfs.xfs manpage wording from:

"The version=ci  option  enables  ASCII  only case-insensitive filename
lookup and version 2 directories. Filenames  are  case-preserving, that
is, the names are stored in directories using  the  case  they  were
created with."

into:

"If the version=ci option is specified, the kernel will transform
certain bytes in filenames before performing lookup-related operations.
The byte sequence given to create a directory entry is persisted without
alterations.  The lookup transformations are defined as follows:

0x41 - 0x5a -> 0x61 - 0x7a
0xc0 - 0xd6 -> 0xe0 - 0xf6
0xd8 - 0xde -> 0xf8 - 0xfe

This transformation roughly corresponds to case insensitivity in ISO
8859-1 and may cause problems with other encodings (e.g. UTF8).  The
feature will be disabled by default in September 2025, and removed from
the kernel in September 2030."

> Does the kernel twolower behavior map to an actual documented charset?
> In that case we can just point to it, which would be way either than
> documenting all the details.

It *seems* to operate on ISO 8859-1 (aka latin1), but Linus implied that
the history of lib/ctype.c is lost to the ages.  Or at least 1996-era
mailing list archives.

--D

  reply	other threads:[~2023-04-05 15:30 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-04 17:07 [PATCHSET 0/3] xfs: fix ascii-ci problems with userspace Darrick J. Wong
2023-04-04 17:07 ` [PATCH 1/3] xfs: stabilize the tolower function used for ascii-ci dir hash computation Darrick J. Wong
2023-04-04 17:54   ` Linus Torvalds
2023-04-04 18:32     ` Darrick J. Wong
2023-04-04 18:58       ` Linus Torvalds
2023-04-04 23:30       ` Dave Chinner
2023-04-05  0:17         ` Linus Torvalds
2023-04-05  6:12       ` Christoph Hellwig
2023-04-05 15:40         ` Darrick J. Wong
2023-04-05 15:42           ` Christoph Hellwig
2023-04-05 17:10             ` Darrick J. Wong
2023-04-05 10:48   ` Christoph Hellwig
2023-04-05 15:30     ` Darrick J. Wong [this message]
2023-04-05 15:45       ` Linus Torvalds
2023-04-04 17:07 ` [PATCH 2/3] xfs: test the ascii case-insensitive hash Darrick J. Wong
2023-04-04 18:06   ` Linus Torvalds
2023-04-04 20:51     ` Darrick J. Wong
2023-04-04 21:21       ` Linus Torvalds
2023-04-05  6:15         ` Christoph Hellwig
2023-04-04 17:07 ` [PATCH 3/3] xfs: use the directory name hash function for dir scrubbing Darrick J. Wong
2023-04-04 17:17 ` [PATCHSET 0/3] xfs: fix ascii-ci problems with userspace Darrick J. Wong
2023-04-04 18:19   ` Linus Torvalds
2023-04-04 20:21     ` Linus Torvalds
2023-04-04 21:00       ` Darrick J. Wong
2023-04-04 21:50         ` Linus Torvalds
2023-04-04 21:09 ` [PATCH] xfstests: add a couple more tests for ascii-ci problems Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230405153002.GE303486@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox