public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed
* [v4 0/3] bcachefs: Casefolding implementation
@ 2023-08-14  1:24 Joshua Ashton
  2023-08-14  1:24 ` [v4 1/3] bcachefs: Split out dirent alloc and name initialization Joshua Ashton
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Joshua Ashton @ 2023-08-14  1:24 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Joshua Ashton

This implements support for case-insensitive file and directory
lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
casefolding attributes for bcachefs.

The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using currently

The main usecase for casefolding is compatibility with software written
against other filesystems that rely on casefolded lookups
(eg. NTFS and Wine/Proton).
Taking advantage of file-system level casefolding can lead to great
loading time gains in many applications and games.

See more information and rationale in:
Documentation/filesystems/bcachefs/casefolding.rst

You can see/review the work in my tree here:
https://github.com/Joshua-Ashton/bcachefs/tree/casefold

v2: Update format to store length in dirent based on flag, add comments, cleanup
dirent_create_key, other misc. nits from IRC.

v3: Cleanup the dirent code, utilize the fact that BCH_NAME_MAX is smaller now,
remove fallbacks for too big keys that won't make sense anymore. Improved
lookup string lowering.

v4: Move cf_name allocation from stack to trans alloc, validated bkey in invalid,
used length directly, misc. nits.

Joshua Ashton (3):
  bcachefs: Split out dirent alloc and name initialization
  bcachefs: Move dirent_val_u64s to dirent.c
  bcachefs: Implement casefolding

 .../filesystems/bcachefs/casefolding.rst      |  73 ++++++
 fs/bcachefs/bcachefs.h                        |   6 +
 fs/bcachefs/bcachefs_format.h                 |  26 ++-
 fs/bcachefs/dirent.c                          | 219 +++++++++++++++---
 fs/bcachefs/dirent.h                          |   6 -
 fs/bcachefs/fs-common.c                       |   4 +
 fs/bcachefs/fs-ioctl.c                        |  22 ++
 fs/bcachefs/fs-ioctl.h                        |  20 +-
 fs/bcachefs/str_hash.h                        |   4 +
 fs/bcachefs/super.c                           |  19 ++
 10 files changed, 353 insertions(+), 46 deletions(-)
 create mode 100644 Documentation/filesystems/bcachefs/casefolding.rst

-- 
2.41.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [v4 1/3] bcachefs: Split out dirent alloc and name initialization
  2023-08-14  1:24 [v4 0/3] bcachefs: Casefolding implementation Joshua Ashton
@ 2023-08-14  1:24 ` Joshua Ashton
  2023-08-14  1:24 ` [v4 2/3] bcachefs: Move dirent_val_u64s to dirent.c Joshua Ashton
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Joshua Ashton @ 2023-08-14  1:24 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Joshua Ashton

Splits out the code that allocates the dirent and initializes the name
to make things easier to implement casefolding in a future commit.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
---
 fs/bcachefs/dirent.c | 46 ++++++++++++++++++++++++++++++++------------
 1 file changed, 34 insertions(+), 12 deletions(-)

diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c
index a7559ab03802..85769d5c6b82 100644
--- a/fs/bcachefs/dirent.c
+++ b/fs/bcachefs/dirent.c
@@ -169,15 +169,13 @@ void bch2_dirent_to_text(struct printbuf *out, struct bch_fs *c,
 	       bch2_d_type_str(d.v->d_type));
 }
 
-static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
-				subvol_inum dir, u8 type,
-				const struct qstr *name, u64 dst)
+static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans,
+				subvol_inum dir,
+				u8 type,
+				int name_len, u64 dst)
 {
 	struct bkey_i_dirent *dirent;
-	unsigned u64s = BKEY_U64s + dirent_val_u64s(name->len);
-
-	if (name->len > BCH_NAME_MAX)
-		return ERR_PTR(-ENAMETOOLONG);
+	unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len);
 
 	BUG_ON(u64s > U8_MAX);
 
@@ -197,11 +195,35 @@ static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
 
 	dirent->v.d_type = type;
 
-	memcpy(dirent->v.d_name, name->name, name->len);
-	memset(dirent->v.d_name + name->len, 0,
-	       bkey_val_bytes(&dirent->k) -
-	       offsetof(struct bch_dirent, d_name) -
-	       name->len);
+	return dirent;
+}
+
+static void dirent_init_regular_name(struct bkey_i_dirent *dirent,
+				     const struct qstr *name)
+{
+	memcpy(&dirent->v.d_name[0], name->name, name->len);
+	memset(&dirent->v.d_name[name->len], 0,
+		bkey_val_bytes(&dirent->k) -
+		offsetof(struct bch_dirent, d_name) -
+		name->len);
+}
+
+static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
+				subvol_inum dir,
+				u8 type,
+				const struct qstr *name,
+				u64 dst)
+{
+	struct bkey_i_dirent *dirent;
+
+	if (name->len > BCH_NAME_MAX)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	dirent = dirent_alloc_key(trans, dir, type, name->len, dst);
+	if (IS_ERR(dirent))
+		return dirent;
+
+	dirent_init_regular_name(dirent, name);
 
 	EBUG_ON(bch2_dirent_name_bytes(dirent_i_to_s_c(dirent)) != name->len);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [v4 2/3] bcachefs: Move dirent_val_u64s to dirent.c
  2023-08-14  1:24 [v4 0/3] bcachefs: Casefolding implementation Joshua Ashton
  2023-08-14  1:24 ` [v4 1/3] bcachefs: Split out dirent alloc and name initialization Joshua Ashton
@ 2023-08-14  1:24 ` Joshua Ashton
  2023-08-14  1:24 ` [v4 3/3] bcachefs: Implement casefolding Joshua Ashton
  2023-08-16 19:40 ` [v4 0/3] bcachefs: Casefolding implementation Kent Overstreet
  3 siblings, 0 replies; 15+ messages in thread
From: Joshua Ashton @ 2023-08-14  1:24 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Joshua Ashton

Not used anywhere else anymore, can become local to dirent.c now.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
---
 fs/bcachefs/dirent.c | 6 ++++++
 fs/bcachefs/dirent.h | 6 ------
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c
index 85769d5c6b82..6bdcccabd732 100644
--- a/fs/bcachefs/dirent.c
+++ b/fs/bcachefs/dirent.c
@@ -13,6 +13,12 @@
 
 #include <linux/dcache.h>
 
+static inline unsigned dirent_val_u64s(unsigned len)
+{
+	return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len,
+			    sizeof(u64));
+}
+
 static unsigned bch2_dirent_name_bytes(struct bkey_s_c_dirent d)
 {
 	unsigned bkey_u64s = bkey_val_u64s(d.k);
diff --git a/fs/bcachefs/dirent.h b/fs/bcachefs/dirent.h
index e9fa1df38232..56061ac93ad3 100644
--- a/fs/bcachefs/dirent.h
+++ b/fs/bcachefs/dirent.h
@@ -26,12 +26,6 @@ struct bch_inode_info;
 
 struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d);
 
-static inline unsigned dirent_val_u64s(unsigned len)
-{
-	return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len,
-			    sizeof(u64));
-}
-
 int bch2_dirent_read_target(struct btree_trans *, subvol_inum,
 			    struct bkey_s_c_dirent, subvol_inum *);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [v4 3/3] bcachefs: Implement casefolding
  2023-08-14  1:24 [v4 0/3] bcachefs: Casefolding implementation Joshua Ashton
  2023-08-14  1:24 ` [v4 1/3] bcachefs: Split out dirent alloc and name initialization Joshua Ashton
  2023-08-14  1:24 ` [v4 2/3] bcachefs: Move dirent_val_u64s to dirent.c Joshua Ashton
@ 2023-08-14  1:24 ` Joshua Ashton
  2023-08-14 16:34   ` Gabriel Krisman Bertazi
  2023-08-15 12:36   ` Brian Foster
  2023-08-16 19:40 ` [v4 0/3] bcachefs: Casefolding implementation Kent Overstreet
  3 siblings, 2 replies; 15+ messages in thread
From: Joshua Ashton @ 2023-08-14  1:24 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Joshua Ashton, André Almeida, Gabriel Krisman Bertazi

This patch implements support for case-insensitive file name lookups
in bcachefs.

The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using.

More information is provided in Documentation/bcachefs/casefolding.rst

Signed-off-by: Joshua Ashton <joshua@froggi.es>

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
---
 .../filesystems/bcachefs/casefolding.rst      |  73 ++++++++
 fs/bcachefs/bcachefs.h                        |   6 +
 fs/bcachefs/bcachefs_format.h                 |  26 ++-
 fs/bcachefs/dirent.c                          | 177 +++++++++++++++---
 fs/bcachefs/fs-common.c                       |   4 +
 fs/bcachefs/fs-ioctl.c                        |  22 +++
 fs/bcachefs/fs-ioctl.h                        |  20 +-
 fs/bcachefs/str_hash.h                        |   4 +
 fs/bcachefs/super.c                           |  19 ++
 9 files changed, 318 insertions(+), 33 deletions(-)
 create mode 100644 Documentation/filesystems/bcachefs/casefolding.rst

diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
new file mode 100644
index 000000000000..c8580d892ae8
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/casefolding.rst
@@ -0,0 +1,73 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Casefolding
+===========
+
+bcachefs has support for case-insensitive file and directory
+lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
+casefolding attributes.
+
+The main usecase for casefolding is compatibility with software written
+against other filesystems that rely on casefolded lookups
+(eg. NTFS and Wine/Proton).
+Taking advantage of file-system level casefolding can lead to great
+loading time gains in many applications and games.
+
+Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
+Once a directory has been flagged for casefolding, a feature bit
+is enabled on the superblock which marks the filesystem as using
+casefolding.
+When the feature bit for casefolding is enabled, it is no longer possible
+to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
+
+On the lookup/query side: casefolding is implemented by allocating a new
+string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
+casefold the query string.
+
+On the dirent side: casefolding is implemented by ensuring the `bkey`'s
+hash is made from the casefolded string and storing the cached casefolded
+name with the regular name in the dirent.
+
+The structure looks like this:
+
+Regular:    [dirent data][regular name][nul][nul]...
+Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
+
+(Do note, the number of `NUL`s here is merely for illustration, they count can vary
+ per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.)
+
+This is efficient as it means that for all file lookups that require casefolding,
+it has identical performance to a regular lookup:
+a hash comparison and a `memcmp` of the name.
+
+Rationale
+---------
+
+Several designs were considered for this system:
+One was to introduce a dirent_v2, however that would be painful especially as
+the hash system only has support for a single key type. This would also need
+`BCH_NAME_MAX` to change between versions, and a new feature bit.
+
+Another option was to store without the two lengths, and just take the length of
+the regular name and casefolded name contiguously / 2 as the length. This would
+assume that the regular length == casefolded length, but that could potentially
+not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
+the lowercase unicode glyph.
+It would be possible to disregard the casefold cache for those cases, but it was
+decided to simply encode the two string lengths in the key to avoid random
+performance issues if this edgecase was ever hit.
+
+The option settled on was to use a free-bit in d_type to mark a dirent as having
+a casefold cache, and then treat the first 4 bytes the name block as lengths.
+You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
+
+The feature bit was used to allow casefolding support to be enabled for the majority
+of users, but some allow users who have no need for the feature to still use bcachefs as
+`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
+which may be decider between using bcachefs for eg. embedded platforms.
+
+Other filesystems like ext4 and f2fs have a super-block level option for casefolding
+encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
+any encodings than a single UTF-8 version. When future encodings are desirable,
+they will be added trivially using the opts mechanism.
+
diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 30b3d7b9f9dc..0e1facdc7876 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -202,6 +202,7 @@
 #include <linux/types.h>
 #include <linux/workqueue.h>
 #include <linux/zstd.h>
+#include <linux/unicode.h>
 
 #include "bcachefs_format.h"
 #include "errcode.h"
@@ -657,6 +658,8 @@ enum bch_write_ref {
 	BCH_WRITE_REF_NR,
 };
 
+#define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0)
+
 struct bch_fs {
 	struct closure		cl;
 
@@ -723,6 +726,9 @@ struct bch_fs {
 		u64		compat;
 	}			sb;
 
+#if IS_ENABLED(CONFIG_UNICODE)
+	struct unicode_map 	*cf_encoding;
+#endif
 
 	struct bch_sb_handle	disk_sb;
 
diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
index 23bae622309c..cc6797422977 100644
--- a/fs/bcachefs/bcachefs_format.h
+++ b/fs/bcachefs/bcachefs_format.h
@@ -852,6 +852,8 @@ enum {
 	__BCH_INODE_UNLINKED		= 7,
 	__BCH_INODE_BACKPTR_UNTRUSTED	= 8,
 
+	__BCH_INODE_CASEFOLDED		= 9,
+
 	/* bits 20+ reserved for packed fields below: */
 };
 
@@ -864,6 +866,7 @@ enum {
 #define BCH_INODE_I_SECTORS_DIRTY (1 << __BCH_INODE_I_SECTORS_DIRTY)
 #define BCH_INODE_UNLINKED	(1 << __BCH_INODE_UNLINKED)
 #define BCH_INODE_BACKPTR_UNTRUSTED (1 << __BCH_INODE_BACKPTR_UNTRUSTED)
+#define BCH_INODE_CASEFOLDED	(1 << __BCH_INODE_CASEFOLDED)
 
 LE32_BITMASK(INODE_STR_HASH,	struct bch_inode, bi_flags, 20, 24);
 LE32_BITMASK(INODE_NR_FIELDS,	struct bch_inode, bi_flags, 24, 31);
@@ -908,9 +911,25 @@ struct bch_dirent {
 	 * Copy of mode bits 12-15 from the target inode - so userspace can get
 	 * the filetype without having to do a stat()
 	 */
-	__u8			d_type;
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+	__u8			d_type:5,
+				d_unused:2,
+				d_casefold:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+	__u8			d_casefold:1,
+				d_unused:2,
+				d_type:5;
+#endif
 
-	__u8			d_name[];
+	union {
+	struct {
+		__u8		d_pad;
+		__le16		d_name_len;
+		__le16		d_cf_name_len;
+		__u8		d_names[0];
+	} d_cf_name_block __packed;
+	__u8			d_name[0];
+	} __packed;
 } __packed __aligned(8);
 
 #define DT_SUBVOL	16
@@ -1841,7 +1860,8 @@ static inline void SET_BCH_SB_BACKGROUND_COMPRESSION_TYPE(struct bch_sb *sb, __u
 	x(new_varint,			15)	\
 	x(journal_no_flush,		16)	\
 	x(alloc_v2,			17)	\
-	x(extents_across_btree_nodes,	18)
+	x(extents_across_btree_nodes,	18)	\
+	x(casefolding,			19)
 
 #define BCH_SB_FEATURES_ALWAYS				\
 	((1ULL << BCH_FEATURE_new_extent_overwrite)|	\
diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c
index 6bdcccabd732..6ac45d7ce6e9 100644
--- a/fs/bcachefs/dirent.c
+++ b/fs/bcachefs/dirent.c
@@ -13,8 +13,39 @@
 
 #include <linux/dcache.h>
 
-static inline unsigned dirent_val_u64s(unsigned len)
+static inline int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info,
+				const struct qstr *str, struct qstr *out_cf)
 {
+#if IS_ENABLED(CONFIG_UNICODE)
+	unsigned char *buf = NULL;
+	int ret;
+
+	*out_cf = (struct qstr) QSTR_INIT(NULL, 0);
+
+	buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1);
+	ret = PTR_ERR_OR_ZERO(buf);
+	if (ret)
+		return ret;
+
+	ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1);
+	if (ret <= 0)
+		return ret;
+
+	*out_cf = (struct qstr) QSTR_INIT(buf, ret);
+	return 0;
+#else
+	*out_cf = (struct qstr) QSTR_INIT(NULL, 0);
+
+	return -EOPNOTSUPP;
+#endif
+}
+
+static inline unsigned dirent_val_u64s(unsigned len, unsigned cf_len)
+{
+	if (cf_len > 0) {
+		return DIV_ROUND_UP(offsetof(struct bch_dirent, d_cf_name_block.d_names) + len + cf_len,
+				    sizeof(u64));
+	}
 	return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len,
 			    sizeof(u64));
 }
@@ -31,13 +62,38 @@ static unsigned bch2_dirent_name_bytes(struct bkey_s_c_dirent d)
 #endif
 
 	return bkey_bytes -
-		offsetof(struct bch_dirent, d_name) -
+		(d.v->d_casefold
+		? offsetof(struct bch_dirent, d_cf_name_block.d_names)
+		: offsetof(struct bch_dirent, d_name)) -
 		trailing_nuls;
 }
 
 struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d)
 {
-	return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d));
+	if (d.v->d_casefold) {
+		unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len);
+		return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[0], name_len);
+	} else {
+		return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d));
+	}
+}
+
+struct qstr bch2_dirent_get_casefold_name(struct bkey_s_c_dirent d)
+{
+	if (d.v->d_casefold) {
+		unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len);
+		unsigned cf_name_len = le16_to_cpu(d.v->d_cf_name_block.d_cf_name_len);
+		return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[name_len], cf_name_len);
+	} else {
+		return (struct qstr) QSTR_INIT(NULL, 0);
+	}
+}
+
+static inline struct qstr bch2_dirent_get_lookup_name(struct bkey_s_c_dirent d)
+{
+	return d.v->d_casefold
+		? bch2_dirent_get_casefold_name(d)
+		: bch2_dirent_get_name(d);
 }
 
 static u64 bch2_dirent_hash(const struct bch_hash_info *info,
@@ -60,7 +116,7 @@ static u64 dirent_hash_key(const struct bch_hash_info *info, const void *key)
 static u64 dirent_hash_bkey(const struct bch_hash_info *info, struct bkey_s_c k)
 {
 	struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k);
-	struct qstr name = bch2_dirent_get_name(d);
+	struct qstr name = bch2_dirent_get_lookup_name(d);
 
 	return bch2_dirent_hash(info, &name);
 }
@@ -68,7 +124,7 @@ static u64 dirent_hash_bkey(const struct bch_hash_info *info, struct bkey_s_c k)
 static bool dirent_cmp_key(struct bkey_s_c _l, const void *_r)
 {
 	struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l);
-	const struct qstr l_name = bch2_dirent_get_name(l);
+	const struct qstr l_name = bch2_dirent_get_lookup_name(l);
 	const struct qstr *r_name = _r;
 
 	return l_name.len - r_name->len ?: memcmp(l_name.name, r_name->name, l_name.len);
@@ -78,8 +134,8 @@ static bool dirent_cmp_bkey(struct bkey_s_c _l, struct bkey_s_c _r)
 {
 	struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l);
 	struct bkey_s_c_dirent r = bkey_s_c_to_dirent(_r);
-	const struct qstr l_name = bch2_dirent_get_name(l);
-	const struct qstr r_name = bch2_dirent_get_name(r);
+	const struct qstr l_name = bch2_dirent_get_lookup_name(l);
+	const struct qstr r_name = bch2_dirent_get_lookup_name(r);
 
 	return l_name.len - r_name.len ?: memcmp(l_name.name, r_name.name, l_name.len);
 }
@@ -108,16 +164,24 @@ int bch2_dirent_invalid(const struct bch_fs *c, struct bkey_s_c k,
 			struct printbuf *err)
 {
 	struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k);
+	unsigned name_block_len = bch2_dirent_name_bytes(d);
 	struct qstr d_name = bch2_dirent_get_name(d);
+	struct qstr d_cf_name = bch2_dirent_get_casefold_name(d);
 
 	if (!d_name.len) {
 		prt_printf(err, "empty name");
 		return -BCH_ERR_invalid_bkey;
 	}
 
-	if (bkey_val_u64s(k.k) > dirent_val_u64s(d_name.len)) {
+	if (bkey_val_u64s(k.k) > dirent_val_u64s(d_name.len, d_cf_name.len)) {
 		prt_printf(err, "value too big (%zu > %u)",
-		       bkey_val_u64s(k.k), dirent_val_u64s(d_name.len));
+		       bkey_val_u64s(k.k), dirent_val_u64s(d_name.len, d_cf_name.len));
+		return -BCH_ERR_invalid_bkey;
+	}
+
+	if (d_name.len + d_cf_name.len > name_block_len) {
+		prt_printf(err, "dirent names exceed bkey size (%d + %d > %d)",
+		       d_name.len, d_cf_name.len, name_block_len);
 		return -BCH_ERR_invalid_bkey;
 	}
 
@@ -157,6 +221,19 @@ int bch2_dirent_invalid(const struct bch_fs *c, struct bkey_s_c k,
 		return -BCH_ERR_invalid_bkey;
 	}
 
+	if (d.v->d_casefold) {
+		if ((flags & BKEY_INVALID_COMMIT) && d_cf_name.len > BCH_NAME_MAX) {
+			prt_printf(err, "dirent w/ cf name too big (%u > %u)",
+			d_cf_name.len, BCH_NAME_MAX);
+			return -BCH_ERR_invalid_bkey;
+		}
+
+		if (d_cf_name.len != strnlen(d_cf_name.name, d_cf_name.len)) {
+			prt_printf(err, "dirent has stray data after cf name's NUL");
+			return -BCH_ERR_invalid_bkey;
+		}
+	}
+
 	return 0;
 }
 
@@ -178,10 +255,11 @@ void bch2_dirent_to_text(struct printbuf *out, struct bch_fs *c,
 static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans,
 				subvol_inum dir,
 				u8 type,
-				int name_len, u64 dst)
+				int name_len, int cf_name_len,
+				u64 dst)
 {
 	struct bkey_i_dirent *dirent;
-	unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len);
+	unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len, cf_name_len);
 
 	BUG_ON(u64s > U8_MAX);
 
@@ -200,6 +278,8 @@ static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans,
 	}
 
 	dirent->v.d_type = type;
+	dirent->v.d_unused = 0;
+	dirent->v.d_casefold = cf_name_len ? 1 : 0;
 
 	return dirent;
 }
@@ -207,6 +287,8 @@ static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans,
 static void dirent_init_regular_name(struct bkey_i_dirent *dirent,
 				     const struct qstr *name)
 {
+	EBUG_ON(dirent->v.d_casefold);
+
 	memcpy(&dirent->v.d_name[0], name->name, name->len);
 	memset(&dirent->v.d_name[name->len], 0,
 		bkey_val_bytes(&dirent->k) -
@@ -214,10 +296,30 @@ static void dirent_init_regular_name(struct bkey_i_dirent *dirent,
 		name->len);
 }
 
+static void dirent_init_casefolded_name(struct bkey_i_dirent *dirent,
+					const struct qstr *name,
+					const struct qstr *cf_name)
+{
+	EBUG_ON(!dirent->v.d_casefold);
+	EBUG_ON(!cf_name->len);
+
+	dirent->v.d_cf_name_block.d_name_len = name->len;
+	dirent->v.d_cf_name_block.d_cf_name_len = cf_name->len;
+	memcpy(&dirent->v.d_cf_name_block.d_names[0], name->name, name->len);
+	memcpy(&dirent->v.d_cf_name_block.d_names[name->len], cf_name->name, cf_name->len);
+	memset(&dirent->v.d_cf_name_block.d_names[name->len + cf_name->len], 0,
+		bkey_val_bytes(&dirent->k) -
+		offsetof(struct bch_dirent, d_cf_name_block.d_names) -
+		name->len + cf_name->len);
+
+	EBUG_ON(bch2_dirent_get_casefold_name(dirent_i_to_s_c(dirent)).len != cf_name->len);
+}
+
 static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
 				subvol_inum dir,
 				u8 type,
 				const struct qstr *name,
+				const struct qstr *cf_name,
 				u64 dst)
 {
 	struct bkey_i_dirent *dirent;
@@ -225,13 +327,16 @@ static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
 	if (name->len > BCH_NAME_MAX)
 		return ERR_PTR(-ENAMETOOLONG);
 
-	dirent = dirent_alloc_key(trans, dir, type, name->len, dst);
+	dirent = dirent_alloc_key(trans, dir, type, name->len, cf_name ? cf_name->len : 0, dst);
 	if (IS_ERR(dirent))
 		return dirent;
 
-	dirent_init_regular_name(dirent, name);
+	if (cf_name)
+		dirent_init_casefolded_name(dirent, name, cf_name);
+	else
+		dirent_init_regular_name(dirent, name);
 
-	EBUG_ON(bch2_dirent_name_bytes(dirent_i_to_s_c(dirent)) != name->len);
+	EBUG_ON(bch2_dirent_get_name(dirent_i_to_s_c(dirent)).len != name->len);
 
 	return dirent;
 }
@@ -244,7 +349,16 @@ int bch2_dirent_create(struct btree_trans *trans, subvol_inum dir,
 	struct bkey_i_dirent *dirent;
 	int ret;
 
-	dirent = dirent_create_key(trans, dir, type, name, dst_inum);
+	if (hash_info->cf_encoding) {
+		struct qstr cf_name;
+		ret = bch2_casefold(trans, hash_info, name, &cf_name);
+		if (ret)
+			return ret;
+		dirent = dirent_create_key(trans, dir, type, name, &cf_name, dst_inum);
+	} else {
+		dirent = dirent_create_key(trans, dir, type, name, NULL, dst_inum);
+	}
+
 	ret = PTR_ERR_OR_ZERO(dirent);
 	if (ret)
 		return ret;
@@ -294,6 +408,8 @@ int bch2_dirent_rename(struct btree_trans *trans,
 		const struct qstr *dst_name, subvol_inum *dst_inum, u64 *dst_offset,
 		enum bch_rename_mode mode)
 {
+	struct qstr src_name_lookup = *src_name;
+	struct qstr dst_name_lookup = *dst_name;
 	struct btree_iter src_iter = { NULL };
 	struct btree_iter dst_iter = { NULL };
 	struct bkey_s_c old_src, old_dst = bkey_s_c_null;
@@ -310,8 +426,13 @@ int bch2_dirent_rename(struct btree_trans *trans,
 	memset(dst_inum, 0, sizeof(*dst_inum));
 
 	/* Lookup src: */
+	ret = src_hash->cf_encoding
+		? bch2_casefold(trans, src_hash, src_name, &src_name_lookup)
+		: 0;
+	if (ret)
+		goto out;
 	ret = bch2_hash_lookup(trans, &src_iter, bch2_dirent_hash_desc,
-			       src_hash, src_dir, src_name,
+			       src_hash, src_dir, &src_name_lookup,
 			       BTREE_ITER_INTENT);
 	if (ret)
 		goto out;
@@ -333,6 +454,11 @@ int bch2_dirent_rename(struct btree_trans *trans,
 
 
 	/* Lookup dst: */
+	ret = dst_hash->cf_encoding
+		? bch2_casefold(trans, dst_hash, dst_name, &dst_name_lookup)
+		: 0;
+	if (ret)
+		goto out;
 	if (mode == BCH_RENAME) {
 		/*
 		 * Note that we're _not_ checking if the target already exists -
@@ -340,12 +466,12 @@ int bch2_dirent_rename(struct btree_trans *trans,
 		 * correctness:
 		 */
 		ret = bch2_hash_hole(trans, &dst_iter, bch2_dirent_hash_desc,
-				     dst_hash, dst_dir, dst_name);
+				     dst_hash, dst_dir, &dst_name_lookup);
 		if (ret)
 			goto out;
 	} else {
 		ret = bch2_hash_lookup(trans, &dst_iter, bch2_dirent_hash_desc,
-				       dst_hash, dst_dir, dst_name,
+				       dst_hash, dst_dir, &dst_name_lookup,
 				       BTREE_ITER_INTENT);
 		if (ret)
 			goto out;
@@ -370,7 +496,8 @@ int bch2_dirent_rename(struct btree_trans *trans,
 		*src_offset = dst_iter.pos.offset;
 
 	/* Create new dst key: */
-	new_dst = dirent_create_key(trans, dst_dir, 0, dst_name, 0);
+	new_dst = dirent_create_key(trans, dst_dir, 0, dst_name,
+				    dst_hash->cf_encoding ? &dst_name_lookup : NULL, 0);
 	ret = PTR_ERR_OR_ZERO(new_dst);
 	if (ret)
 		goto out;
@@ -380,7 +507,8 @@ int bch2_dirent_rename(struct btree_trans *trans,
 
 	/* Create new src key: */
 	if (mode == BCH_RENAME_EXCHANGE) {
-		new_src = dirent_create_key(trans, src_dir, 0, src_name, 0);
+		new_src = dirent_create_key(trans, src_dir, 0, src_name,
+					    src_hash->cf_encoding ? &src_name_lookup : NULL, 0);
 		ret = PTR_ERR_OR_ZERO(new_src);
 		if (ret)
 			goto out;
@@ -472,6 +600,7 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans,
 			       const struct qstr *name, subvol_inum *inum,
 			       unsigned flags)
 {
+	struct qstr lookup_name = *name;
 	struct bkey_s_c k;
 	struct bkey_s_c_dirent d;
 	u32 snapshot;
@@ -481,8 +610,14 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans,
 	if (ret)
 		return ret;
 
+	ret = hash_info->cf_encoding
+		? bch2_casefold(trans, hash_info, name, &lookup_name)
+		: 0;
+	if (ret)
+		return ret;
+
 	ret = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc,
-			       hash_info, dir, name, flags);
+			       hash_info, dir, &lookup_name, flags);
 	if (ret)
 		return ret;
 
diff --git a/fs/bcachefs/fs-common.c b/fs/bcachefs/fs-common.c
index bb5305441f27..190ff3e0da83 100644
--- a/fs/bcachefs/fs-common.c
+++ b/fs/bcachefs/fs-common.c
@@ -46,6 +46,10 @@ int bch2_create_trans(struct btree_trans *trans,
 	if (ret)
 		goto err;
 
+	/* Inherit casefold state from parent. */
+	if (S_ISDIR(mode))
+		new_inode->bi_flags |= dir_u->bi_flags & BCH_INODE_CASEFOLDED;
+
 	if (!(flags & BCH_CREATE_SNAPSHOT)) {
 		/* Normal create path - allocate a new inode: */
 		bch2_inode_init_late(new_inode, now, uid, gid, mode, rdev, dir_u);
diff --git a/fs/bcachefs/fs-ioctl.c b/fs/bcachefs/fs-ioctl.c
index 141bcced031e..d0bac39e2326 100644
--- a/fs/bcachefs/fs-ioctl.c
+++ b/fs/bcachefs/fs-ioctl.c
@@ -54,6 +54,28 @@ static int bch2_inode_flags_set(struct btree_trans *trans,
 	    (newflags & (BCH_INODE_NODUMP|BCH_INODE_NOATIME)) != newflags)
 		return -EINVAL;
 
+	if ((newflags ^ oldflags) & BCH_INODE_CASEFOLDED) {
+#if IS_ENABLED(CONFIG_UNICODE)
+		int ret = 0;
+		/* Not supported on individual files. */
+		if (!S_ISDIR(bi->bi_mode))
+			return -EOPNOTSUPP;
+
+		/*
+		 * Make sure the dir is empty, as otherwise we'd need to
+		 * rehash everything and update the dirent keys.
+		 */
+		ret = bch2_empty_dir_trans(trans, inode_inum(inode));
+		if (ret < 0)
+			return ret;
+
+		bch2_check_set_feature(c, BCH_FEATURE_casefolding);
+#else
+		printk(KERN_ERR "Cannot use casefolding on a kernel without CONFIG_UNICODE\n");
+		return -EOPNOTSUPP;
+#endif
+	}
+
 	if (s->set_projinherit) {
 		bi->bi_fields_set &= ~(1 << Inode_opt_project);
 		bi->bi_fields_set |= ((int) s->projinherit << Inode_opt_project);
diff --git a/fs/bcachefs/fs-ioctl.h b/fs/bcachefs/fs-ioctl.h
index f201980ef2c3..2950091b5ac6 100644
--- a/fs/bcachefs/fs-ioctl.h
+++ b/fs/bcachefs/fs-ioctl.h
@@ -6,19 +6,21 @@
 
 /* bcachefs inode flags -> vfs inode flags: */
 static const unsigned bch_flags_to_vfs[] = {
-	[__BCH_INODE_SYNC]	= S_SYNC,
-	[__BCH_INODE_IMMUTABLE]	= S_IMMUTABLE,
-	[__BCH_INODE_APPEND]	= S_APPEND,
-	[__BCH_INODE_NOATIME]	= S_NOATIME,
+	[__BCH_INODE_SYNC]		= S_SYNC,
+	[__BCH_INODE_IMMUTABLE]		= S_IMMUTABLE,
+	[__BCH_INODE_APPEND]		= S_APPEND,
+	[__BCH_INODE_NOATIME]		= S_NOATIME,
+	[__BCH_INODE_CASEFOLDED]	= S_CASEFOLD,
 };
 
 /* bcachefs inode flags -> FS_IOC_GETFLAGS: */
 static const unsigned bch_flags_to_uflags[] = {
-	[__BCH_INODE_SYNC]	= FS_SYNC_FL,
-	[__BCH_INODE_IMMUTABLE]	= FS_IMMUTABLE_FL,
-	[__BCH_INODE_APPEND]	= FS_APPEND_FL,
-	[__BCH_INODE_NODUMP]	= FS_NODUMP_FL,
-	[__BCH_INODE_NOATIME]	= FS_NOATIME_FL,
+	[__BCH_INODE_SYNC]		= FS_SYNC_FL,
+	[__BCH_INODE_IMMUTABLE]		= FS_IMMUTABLE_FL,
+	[__BCH_INODE_APPEND]		= FS_APPEND_FL,
+	[__BCH_INODE_NODUMP]		= FS_NODUMP_FL,
+	[__BCH_INODE_NOATIME]		= FS_NOATIME_FL,
+	[__BCH_INODE_CASEFOLDED]	= FS_CASEFOLD_FL,
 };
 
 /* bcachefs inode flags -> FS_IOC_FSGETXATTR: */
diff --git a/fs/bcachefs/str_hash.h b/fs/bcachefs/str_hash.h
index ae21a8cca1b4..866cfc145889 100644
--- a/fs/bcachefs/str_hash.h
+++ b/fs/bcachefs/str_hash.h
@@ -34,6 +34,7 @@ bch2_str_hash_opt_to_type(struct bch_fs *c, enum bch_str_hash_opts opt)
 
 struct bch_hash_info {
 	u8			type;
+	struct unicode_map 	*cf_encoding;
 	/*
 	 * For crc32 or crc64 string hashes the first key value of
 	 * the siphash_key (k0) is used as the key.
@@ -48,6 +49,9 @@ bch2_hash_info_init(struct bch_fs *c, const struct bch_inode_unpacked *bi)
 	struct bch_hash_info info = {
 		.type = (bi->bi_flags >> INODE_STR_HASH_OFFSET) &
 			~(~0U << INODE_STR_HASH_BITS),
+#if IS_ENABLED(CONFIG_UNICODE)
+		.cf_encoding = !!(bi->bi_flags & BCH_INODE_CASEFOLDED) ? c->cf_encoding : NULL,
+#endif
 		.siphash_key = { .k0 = bi->bi_hash_seed }
 	};
 
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 5c62fcf3afdb..78d19cfb966a 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -757,6 +757,25 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts)
 	if (ret)
 		goto err;
 
+#if IS_ENABLED(CONFIG_UNICODE)
+	/* Default encoding until we can potentially have more as an option. */
+	c->cf_encoding = utf8_load(BCH_FS_DEFAULT_UTF8_ENCODING);
+	if (IS_ERR(c->cf_encoding)) {
+		printk(KERN_ERR "Cannot load UTF-8 encoding for filesystem. Version: %u.%u.%u",
+			unicode_major(BCH_FS_DEFAULT_UTF8_ENCODING),
+			unicode_minor(BCH_FS_DEFAULT_UTF8_ENCODING),
+			unicode_rev(BCH_FS_DEFAULT_UTF8_ENCODING));
+		ret = -EINVAL;
+		goto err;
+	}
+#else
+	if (c->sb.features & (1ULL << BCH_FEATURE_casefolding)) {
+		printk(KERN_ERR "Cannot mount a filesystem with casefolding on a kernel without CONFIG_UNICODE\n");
+		ret = -EINVAL;
+		goto err;
+	}
+#endif
+
 	pr_uuid(&name, c->sb.user_uuid.b);
 	strscpy(c->name, name.buf, sizeof(c->name));
 	printbuf_exit(&name);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [v4 3/3] bcachefs: Implement casefolding
  2023-08-14  1:24 ` [v4 3/3] bcachefs: Implement casefolding Joshua Ashton
@ 2023-08-14 16:34   ` Gabriel Krisman Bertazi
  2023-08-14 21:18     ` Joshua Ashton
  2023-08-15 12:36   ` Brian Foster
  1 sibling, 1 reply; 15+ messages in thread
From: Gabriel Krisman Bertazi @ 2023-08-14 16:34 UTC (permalink / raw)
  To: Joshua Ashton; +Cc: linux-bcachefs, André Almeida

Joshua Ashton <joshua@froggi.es> writes:

> This patch implements support for case-insensitive file name lookups
> in bcachefs.
>
> The implementation uses the same UTF-8 lowering and normalization that
> ext4 and f2fs is using.
>
> More information is provided in Documentation/bcachefs/casefolding.rst
>
> Signed-off-by: Joshua Ashton <joshua@froggi.es>

Please, cc the entire thread as i don't think I even got the cover
letter.

Did you test this against fstests generic/556?  it checks all sorts of
common pitfalls and corner-cases in case-insensitive filesystems.

> --- /dev/null
> +++ b/Documentation/filesystems/bcachefs/casefolding.rst
> @@ -0,0 +1,73 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Casefolding
> +===========
> +
> +bcachefs has support for case-insensitive file and directory
> +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
> +casefolding attributes.
> +
> +The main usecase for casefolding is compatibility with software written
> +against other filesystems that rely on casefolded lookups
> +(eg. NTFS and Wine/Proton).
> +Taking advantage of file-system level casefolding can lead to great
> +loading time gains in many applications and games.
> +
> +Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
> +Once a directory has been flagged for casefolding, a feature bit
> +is enabled on the superblock which marks the filesystem as using
> +casefolding.
> +When the feature bit for casefolding is enabled, it is no longer possible
> +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
> +
> +On the lookup/query side: casefolding is implemented by allocating a new
> +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
> +casefold the query string.
> +
> +On the dirent side: casefolding is implemented by ensuring the `bkey`'s
> +hash is made from the casefolded string and storing the cached casefolded
> +name with the regular name in the dirent.
> +
> +The structure looks like this:
> +
> +Regular:    [dirent data][regular name][nul][nul]...
> +Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
> +
> +(Do note, the number of `NUL`s here is merely for illustration, they count can vary
> + per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.)
> +
> +This is efficient as it means that for all file lookups that require casefolding,
> +it has identical performance to a regular lookup:
> +a hash comparison and a `memcmp` of the name.
> +
> +Rationale
> +---------
> +
> +Several designs were considered for this system:
> +One was to introduce a dirent_v2, however that would be painful especially as
> +the hash system only has support for a single key type. This would also need
> +`BCH_NAME_MAX` to change between versions, and a new feature bit.
> +
> +Another option was to store without the two lengths, and just take the length of
> +the regular name and casefolded name contiguously / 2 as the length. This would
> +assume that the regular length == casefolded length, but that could potentially
> +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
> +the lowercase unicode glyph.

Why do you want to store the casefolded name?  In ext4, we just store
the case-exact name that was created, and do a code-point by code-point
comparison with the name under lookup through utf8_strncasecmp, which is
optimized to compare two not casefolded strings.  We just modify the
filename hash to be case-insensitive, so we know which dirent to look
for during lookup.  Finally, we use d_compare/d_hash to leverage the
dcache into finding negative dentries.

Sure, if you have both, dirent and dentry->d_name casefolded, then it is
just a regular memcmp, but at least you don't have to change the disk
format and your implementation becomes simpler.



> +It would be possible to disregard the casefold cache for those cases, but it was
> +decided to simply encode the two string lengths in the key to avoid random
> +performance issues if this edgecase was ever hit.
> +
> +The option settled on was to use a free-bit in d_type to mark a dirent as having
> +a casefold cache, and then treat the first 4 bytes the name block as lengths.
> +You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
> +
> +The feature bit was used to allow casefolding support to be enabled for the majority
> +of users, but some allow users who have no need for the feature to still use bcachefs as
> +`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
> +which may be decider between using bcachefs for eg. embedded platforms.
> +
> +Other filesystems like ext4 and f2fs have a super-block level option for casefolding
> +encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
> +any encodings than a single UTF-8 version. When future encodings are desirable,
> +they will be added trivially using the opts mechanism.
> +
> diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
> index 30b3d7b9f9dc..0e1facdc7876 100644
> --- a/fs/bcachefs/bcachefs.h
> +++ b/fs/bcachefs/bcachefs.h
> @@ -202,6 +202,7 @@
>  #include <linux/types.h>
>  #include <linux/workqueue.h>
>  #include <linux/zstd.h>
> +#include <linux/unicode.h>
>  
>  #include "bcachefs_format.h"
>  #include "errcode.h"
> @@ -657,6 +658,8 @@ enum bch_write_ref {
>  	BCH_WRITE_REF_NR,
>  };
>  
> +#define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0)
> +
>  struct bch_fs {
>  	struct closure		cl;
>  
> @@ -723,6 +726,9 @@ struct bch_fs {
>  		u64		compat;
>  	}			sb;
>  
> +#if IS_ENABLED(CONFIG_UNICODE)
> +	struct unicode_map 	*cf_encoding;
> +#endif

This goes in struct super_block.


> @@ -13,8 +13,39 @@
>  
>  #include <linux/dcache.h>
>  
> -static inline unsigned dirent_val_u64s(unsigned len)
> +static inline int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info,
> +				const struct qstr *str, struct qstr *out_cf)
>  {
> +#if IS_ENABLED(CONFIG_UNICODE)
> +	unsigned char *buf = NULL;
> +	int ret;
> +
> +	*out_cf = (struct qstr) QSTR_INIT(NULL, 0);
> +
> +	buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1);
> +	ret = PTR_ERR_OR_ZERO(buf);
> +	if (ret)
> +		return ret;
> +
> +	ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1);
> +	if (ret <= 0)
> +		return ret;

The error path leaks buf.

> @@ -472,6 +600,7 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans,
>  			       const struct qstr *name, subvol_inum *inum,
>  			       unsigned flags)
>  {
> +	struct qstr lookup_name = *name;
>  	struct bkey_s_c k;
>  	struct bkey_s_c_dirent d;
>  	u32 snapshot;
> @@ -481,8 +610,14 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans,
>  	if (ret)
>  		return ret;
>  
> +	ret = hash_info->cf_encoding
> +		? bch2_casefold(trans, hash_info, name, &lookup_name)
> +		: 0;
> +	if (ret)
> +		return ret;
> +
>  	ret = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc,
> -			       hash_info, dir, name, flags);
> +			       hash_info, dir, &lookup_name, flags);
>  	if (ret)
>  		return ret;


(Skipping the bcachefs-specific bits that I'm not qualified to review)

I don't get your model for the dcache.  I don't see any handling
anywhere in the patch.  Also, I fetched the code and it seems bcachefs
will just do d_splice_alias at the end of the lookup. So, when someone
lookups FOO, ->lookup will be called and the dentry FOO will be created.
Then, when FOo is looked up, the dcache won't find the previous dentry
and ->lookup will be called again (perhaps creating another dentry?)

There are two ways around it. Rely on the libfs' generic handlers for
d_hash/d_compare and soon d_revalidate, or use d_add_ci to search for
case-insensitive aliases at lookup time, but then you don't benefit much
from the dcache if looking up for different cases.


> @@ -54,6 +54,28 @@ static int bch2_inode_flags_set(struct btree_trans *trans,
>  	    (newflags & (BCH_INODE_NODUMP|BCH_INODE_NOATIME)) != newflags)
>  		return -EINVAL;
>  
> +	if ((newflags ^ oldflags) & BCH_INODE_CASEFOLDED) {
> +#if IS_ENABLED(CONFIG_UNICODE)
> +		int ret = 0;
> +		/* Not supported on individual files. */
> +		if (!S_ISDIR(bi->bi_mode))
> +			return -EOPNOTSUPP;
> +
> +		/*
> +		 * Make sure the dir is empty, as otherwise we'd need to
> +		 * rehash everything and update the dirent keys.
> +		 */
> +		ret = bch2_empty_dir_trans(trans, inode_inum(inode));
> +		if (ret < 0)
> +			return ret;
> +
> +		bch2_check_set_feature(c, BCH_FEATURE_casefolding);

fwiw, we use _CASEFOLD in other filesystems, not _casefolding.

Thanks,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v4 3/3] bcachefs: Implement casefolding
  2023-08-14 16:34   ` Gabriel Krisman Bertazi
@ 2023-08-14 21:18     ` Joshua Ashton
  2023-08-14 21:33       ` Joshua Ashton
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Joshua Ashton @ 2023-08-14 21:18 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: linux-bcachefs, André Almeida



On 8/14/23 17:34, Gabriel Krisman Bertazi wrote:
> Joshua Ashton <joshua@froggi.es> writes:
> 
>> This patch implements support for case-insensitive file name lookups
>> in bcachefs.
>>
>> The implementation uses the same UTF-8 lowering and normalization that
>> ext4 and f2fs is using.
>>
>> More information is provided in Documentation/bcachefs/casefolding.rst
>>
>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
> 
> Please, cc the entire thread as i don't think I even got the cover
> letter.

Thanks, I will for v5.

> 
> Did you test this against fstests generic/556?  it checks all sorts of
> common pitfalls and corner-cases in case-insensitive filesystems.

No, I wrote my own casefolding test that tests a bunch of edgecases.

I was not aware there were any casefolding tests in fstests -- I really 
really wish they had better names. :/

Grepping the test right now I already know we will fail the 
`test_casefold_flag_removal` test as we don't support toggling the flag 
on non-empty directories as it would require rehashing and thus 
restructuring the btree.

There's a lot of edgecases to handle there (such as hash collisions) 
which are pretty gnarly to solve when doing something like that.

I diiiid try and implement something that did that, but it really did 
not end well... ^^;; So planning to defer this for now.

> 
>> --- /dev/null
>> +++ b/Documentation/filesystems/bcachefs/casefolding.rst
>> @@ -0,0 +1,73 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +Casefolding
>> +===========
>> +
>> +bcachefs has support for case-insensitive file and directory
>> +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
>> +casefolding attributes.
>> +
>> +The main usecase for casefolding is compatibility with software written
>> +against other filesystems that rely on casefolded lookups
>> +(eg. NTFS and Wine/Proton).
>> +Taking advantage of file-system level casefolding can lead to great
>> +loading time gains in many applications and games.
>> +
>> +Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
>> +Once a directory has been flagged for casefolding, a feature bit
>> +is enabled on the superblock which marks the filesystem as using
>> +casefolding.
>> +When the feature bit for casefolding is enabled, it is no longer possible
>> +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
>> +
>> +On the lookup/query side: casefolding is implemented by allocating a new
>> +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
>> +casefold the query string.
>> +
>> +On the dirent side: casefolding is implemented by ensuring the `bkey`'s
>> +hash is made from the casefolded string and storing the cached casefolded
>> +name with the regular name in the dirent.
>> +
>> +The structure looks like this:
>> +
>> +Regular:    [dirent data][regular name][nul][nul]...
>> +Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
>> +
>> +(Do note, the number of `NUL`s here is merely for illustration, they count can vary
>> + per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.)
>> +
>> +This is efficient as it means that for all file lookups that require casefolding,
>> +it has identical performance to a regular lookup:
>> +a hash comparison and a `memcmp` of the name.
>> +
>> +Rationale
>> +---------
>> +
>> +Several designs were considered for this system:
>> +One was to introduce a dirent_v2, however that would be painful especially as
>> +the hash system only has support for a single key type. This would also need
>> +`BCH_NAME_MAX` to change between versions, and a new feature bit.
>> +
>> +Another option was to store without the two lengths, and just take the length of
>> +the regular name and casefolded name contiguously / 2 as the length. This would
>> +assume that the regular length == casefolded length, but that could potentially
>> +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
>> +the lowercase unicode glyph.
> 
> Why do you want to store the casefolded name?  In ext4, we just store
> the case-exact name that was created, and do a code-point by code-point
> comparison with the name under lookup through utf8_strncasecmp, which is
> optimized to compare two not casefolded strings.  We just modify the
> filename hash to be case-insensitive, so we know which dirent to look
> for during lookup.  Finally, we use d_compare/d_hash to leverage the
> dcache into finding negative dentries.

Yeah, am aware of how ext4 does it, I just wanted it to have basically 
the exact same performance characteristics as as regular lookup.

The utf8 cursor stuff in utf8_strncasecmp is still super expensive 
compared to a straight memcmp in some local benching.

> 
> Sure, if you have both, dirent and dentry->d_name casefolded, then it is
> just a regular memcmp, but at least you don't have to change the disk
> format and your implementation becomes simpler.
> 

Given we have the opportunity to have support for this in the dirent 
format for bcachefs, I think it makes sense to take it right now.

> 
> 
>> +It would be possible to disregard the casefold cache for those cases, but it was
>> +decided to simply encode the two string lengths in the key to avoid random
>> +performance issues if this edgecase was ever hit.
>> +
>> +The option settled on was to use a free-bit in d_type to mark a dirent as having
>> +a casefold cache, and then treat the first 4 bytes the name block as lengths.
>> +You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
>> +
>> +The feature bit was used to allow casefolding support to be enabled for the majority
>> +of users, but some allow users who have no need for the feature to still use bcachefs as
>> +`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
>> +which may be decider between using bcachefs for eg. embedded platforms.
>> +
>> +Other filesystems like ext4 and f2fs have a super-block level option for casefolding
>> +encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
>> +any encodings than a single UTF-8 version. When future encodings are desirable,
>> +they will be added trivially using the opts mechanism.
>> +
>> diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
>> index 30b3d7b9f9dc..0e1facdc7876 100644
>> --- a/fs/bcachefs/bcachefs.h
>> +++ b/fs/bcachefs/bcachefs.h
>> @@ -202,6 +202,7 @@
>>   #include <linux/types.h>
>>   #include <linux/workqueue.h>
>>   #include <linux/zstd.h>
>> +#include <linux/unicode.h>
>>   
>>   #include "bcachefs_format.h"
>>   #include "errcode.h"
>> @@ -657,6 +658,8 @@ enum bch_write_ref {
>>   	BCH_WRITE_REF_NR,
>>   };
>>   
>> +#define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0)
>> +
>>   struct bch_fs {
>>   	struct closure		cl;
>>   
>> @@ -723,6 +726,9 @@ struct bch_fs {
>>   		u64		compat;
>>   	}			sb;
>>   
>> +#if IS_ENABLED(CONFIG_UNICODE)
>> +	struct unicode_map 	*cf_encoding;
>> +#endif
> 
> This goes in struct super_block.

The code in libfs.c that uses the s_encoding in the super_block is never 
used for bcachefs as it defines its own inode_operations and doesn't use 
any of the libfs common code for lookups/hashes there (ie. the 
dentry_operations).

One thing we wanted to do down the line was to use the bcachefs options 
system to be able to pick the encoding on a per-inode basis.

Currently I am storing it in the struct bch_fs as currently there is no 
need for other encodings right now and it is convenient, so there can 
only be one -- but that could change at some point with an opt attached 
to the inode.

I am happy to bung it in the vfs super_block for now though -- but I 
need to look at how that all works out with mounting and the multiple 
drives situation. Just be aware that at some point, it will likely be 
moved back again to our code.

> 
> 
>> @@ -13,8 +13,39 @@
>>   
>>   #include <linux/dcache.h>
>>   
>> -static inline unsigned dirent_val_u64s(unsigned len)
>> +static inline int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info,
>> +				const struct qstr *str, struct qstr *out_cf)
>>   {
>> +#if IS_ENABLED(CONFIG_UNICODE)
>> +	unsigned char *buf = NULL;
>> +	int ret;
>> +
>> +	*out_cf = (struct qstr) QSTR_INIT(NULL, 0);
>> +
>> +	buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1);
>> +	ret = PTR_ERR_OR_ZERO(buf);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1);
>> +	if (ret <= 0)
>> +		return ret;
> 
> The error path leaks buf.

trans_kmalloc does not need to be freed.

> 
>> @@ -472,6 +600,7 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans,
>>   			       const struct qstr *name, subvol_inum *inum,
>>   			       unsigned flags)
>>   {
>> +	struct qstr lookup_name = *name;
>>   	struct bkey_s_c k;
>>   	struct bkey_s_c_dirent d;
>>   	u32 snapshot;
>> @@ -481,8 +610,14 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans,
>>   	if (ret)
>>   		return ret;
>>   
>> +	ret = hash_info->cf_encoding
>> +		? bch2_casefold(trans, hash_info, name, &lookup_name)
>> +		: 0;
>> +	if (ret)
>> +		return ret;
>> +
>>   	ret = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc,
>> -			       hash_info, dir, name, flags);
>> +			       hash_info, dir, &lookup_name, flags);
>>   	if (ret)
>>   		return ret;
> 
> 
> (Skipping the bcachefs-specific bits that I'm not qualified to review)
> 
> I don't get your model for the dcache.  I don't see any handling
> anywhere in the patch.  Also, I fetched the code and it seems bcachefs
> will just do d_splice_alias at the end of the lookup. So, when someone
> lookups FOO, ->lookup will be called and the dentry FOO will be created.
> Then, when FOo is looked up, the dcache won't find the previous dentry
> and ->lookup will be called again (perhaps creating another dentry?)
> 
> There are two ways around it. Rely on the libfs' generic handlers for
> d_hash/d_compare and soon d_revalidate, or use d_add_ci to search for
> case-insensitive aliases at lookup time, but then you don't benefit much
> from the dcache if looking up for different cases.

Thanks for the heads-up on this.

It seems like I may have potentially hit a foot-gun there if we are 
making vfs dentry's for each lookup.

I will investigate and validate that we are doing the right thing here 
for the next revision.

I will not lie, I know veeery little about the vfs or dcache side, so 
huge thanks for pointing this out to me. :)

I had a quick look at what f2fs does here, and it seems like it also 
just does d_splice_alias, but it has the following caveat.

[f2fs code]
#if IS_ENABLED(CONFIG_UNICODE)
	if (!inode && IS_CASEFOLDED(dir)) {
		/* Eventually we want to call d_add_ci(dentry, NULL)
		 * for negative dentries in the encoding case as
		 * well.  For now, prevent the negative dentry
		 * from being cached.
		 */
		trace_f2fs_lookup_end(dir, dentry, ino, err);
		return NULL;
	}
#endif
	new = d_splice_alias(inode, dentry);

I won't even pretend that I understand what a negative dentry is right 
now... :D

> 
> 
>> @@ -54,6 +54,28 @@ static int bch2_inode_flags_set(struct btree_trans *trans,
>>   	    (newflags & (BCH_INODE_NODUMP|BCH_INODE_NOATIME)) != newflags)
>>   		return -EINVAL;
>>   
>> +	if ((newflags ^ oldflags) & BCH_INODE_CASEFOLDED) {
>> +#if IS_ENABLED(CONFIG_UNICODE)
>> +		int ret = 0;
>> +		/* Not supported on individual files. */
>> +		if (!S_ISDIR(bi->bi_mode))
>> +			return -EOPNOTSUPP;
>> +
>> +		/*
>> +		 * Make sure the dir is empty, as otherwise we'd need to
>> +		 * rehash everything and update the dirent keys.
>> +		 */
>> +		ret = bch2_empty_dir_trans(trans, inode_inum(inode));
>> +		if (ret < 0)
>> +			return ret;
>> +
>> +		bch2_check_set_feature(c, BCH_FEATURE_casefolding);
> 
> fwiw, we use _CASEFOLD in other filesystems, not _casefolding.

bcachefs features are always lowercase, but I can change the name of it 
to _casefold.

- Joshie 🐸✨

> 
> Thanks,
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v4 3/3] bcachefs: Implement casefolding
  2023-08-14 21:18     ` Joshua Ashton
@ 2023-08-14 21:33       ` Joshua Ashton
  2023-08-14 23:01       ` Gabriel Krisman Bertazi
  2023-08-15  0:58       ` Kent Overstreet
  2 siblings, 0 replies; 15+ messages in thread
From: Joshua Ashton @ 2023-08-14 21:33 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: linux-bcachefs, André Almeida



On 8/14/23 22:18, Joshua Ashton wrote:
> 
> 
> On 8/14/23 17:34, Gabriel Krisman Bertazi wrote:
>> Joshua Ashton <joshua@froggi.es> writes:
>>
>>> This patch implements support for case-insensitive file name lookups
>>> in bcachefs.
>>>
>>> The implementation uses the same UTF-8 lowering and normalization that
>>> ext4 and f2fs is using.
>>>
>>> More information is provided in Documentation/bcachefs/casefolding.rst
>>>
>>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>>
>> Please, cc the entire thread as i don't think I even got the cover
>> letter.
> 
> Thanks, I will for v5.
> 
>>
>> Did you test this against fstests generic/556?  it checks all sorts of
>> common pitfalls and corner-cases in case-insensitive filesystems.
> 
> No, I wrote my own casefolding test that tests a bunch of edgecases.
> 
> I was not aware there were any casefolding tests in fstests -- I really 
> really wish they had better names. :/
> 
> Grepping the test right now I already know we will fail the 
> `test_casefold_flag_removal` test as we don't support toggling the flag 
> on non-empty directories as it would require rehashing and thus 
> restructuring the btree.
> 
> There's a lot of edgecases to handle there (such as hash collisions) 
> which are pretty gnarly to solve when doing something like that.
> 
> I diiiid try and implement something that did that, but it really did 
> not end well... ^^;; So planning to defer this for now.
> 
>>
>>> --- /dev/null
>>> +++ b/Documentation/filesystems/bcachefs/casefolding.rst
>>> @@ -0,0 +1,73 @@
>>> +.. SPDX-License-Identifier: GPL-2.0
>>> +
>>> +Casefolding
>>> +===========
>>> +
>>> +bcachefs has support for case-insensitive file and directory
>>> +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
>>> +casefolding attributes.
>>> +
>>> +The main usecase for casefolding is compatibility with software written
>>> +against other filesystems that rely on casefolded lookups
>>> +(eg. NTFS and Wine/Proton).
>>> +Taking advantage of file-system level casefolding can lead to great
>>> +loading time gains in many applications and games.
>>> +
>>> +Casefolding support requires a kernel with the `CONFIG_UNICODE` 
>>> enabled.
>>> +Once a directory has been flagged for casefolding, a feature bit
>>> +is enabled on the superblock which marks the filesystem as using
>>> +casefolding.
>>> +When the feature bit for casefolding is enabled, it is no longer 
>>> possible
>>> +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
>>> +
>>> +On the lookup/query side: casefolding is implemented by allocating a 
>>> new
>>> +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
>>> +casefold the query string.
>>> +
>>> +On the dirent side: casefolding is implemented by ensuring the `bkey`'s
>>> +hash is made from the casefolded string and storing the cached 
>>> casefolded
>>> +name with the regular name in the dirent.
>>> +
>>> +The structure looks like this:
>>> +
>>> +Regular:    [dirent data][regular name][nul][nul]...
>>> +Casefolded: [dirent data][reg len][cf len][regular name][casefolded 
>>> name][nul][nul]...
>>> +
>>> +(Do note, the number of `NUL`s here is merely for illustration, they 
>>> count can vary
>>> + per-key, and they may not even be present if the key is aligned to 
>>> `sizeof(u64)`.)
>>> +
>>> +This is efficient as it means that for all file lookups that require 
>>> casefolding,
>>> +it has identical performance to a regular lookup:
>>> +a hash comparison and a `memcmp` of the name.
>>> +
>>> +Rationale
>>> +---------
>>> +
>>> +Several designs were considered for this system:
>>> +One was to introduce a dirent_v2, however that would be painful 
>>> especially as
>>> +the hash system only has support for a single key type. This would 
>>> also need
>>> +`BCH_NAME_MAX` to change between versions, and a new feature bit.
>>> +
>>> +Another option was to store without the two lengths, and just take 
>>> the length of
>>> +the regular name and casefolded name contiguously / 2 as the length. 
>>> This would
>>> +assume that the regular length == casefolded length, but that could 
>>> potentially
>>> +not be true, if the uppercase unicode glyph had a different UTF-8 
>>> encoding than
>>> +the lowercase unicode glyph.
>>
>> Why do you want to store the casefolded name?  In ext4, we just store
>> the case-exact name that was created, and do a code-point by code-point
>> comparison with the name under lookup through utf8_strncasecmp, which is
>> optimized to compare two not casefolded strings.  We just modify the
>> filename hash to be case-insensitive, so we know which dirent to look
>> for during lookup.  Finally, we use d_compare/d_hash to leverage the
>> dcache into finding negative dentries.
> 
> Yeah, am aware of how ext4 does it, I just wanted it to have basically 
> the exact same performance characteristics as as regular lookup.
> 
> The utf8 cursor stuff in utf8_strncasecmp is still super expensive 
> compared to a straight memcmp in some local benching.
> 
>>
>> Sure, if you have both, dirent and dentry->d_name casefolded, then it is
>> just a regular memcmp, but at least you don't have to change the disk
>> format and your implementation becomes simpler.
>>
> 
> Given we have the opportunity to have support for this in the dirent 
> format for bcachefs, I think it makes sense to take it right now.
> 
>>
>>
>>> +It would be possible to disregard the casefold cache for those 
>>> cases, but it was
>>> +decided to simply encode the two string lengths in the key to avoid 
>>> random
>>> +performance issues if this edgecase was ever hit.
>>> +
>>> +The option settled on was to use a free-bit in d_type to mark a 
>>> dirent as having
>>> +a casefold cache, and then treat the first 4 bytes the name block as 
>>> lengths.
>>> +You can see this in the `d_cf_name_block` member of union in 
>>> `bch_dirent`.
>>> +
>>> +The feature bit was used to allow casefolding support to be enabled 
>>> for the majority
>>> +of users, but some allow users who have no need for the feature to 
>>> still use bcachefs as
>>> +`CONFIG_UNICODE` can increase the kernel side a significant amount 
>>> due to the tables used,
>>> +which may be decider between using bcachefs for eg. embedded platforms.
>>> +
>>> +Other filesystems like ext4 and f2fs have a super-block level option 
>>> for casefolding
>>> +encoding, but bcachefs currently does not provide this. ext4 and 
>>> f2fs do not expose
>>> +any encodings than a single UTF-8 version. When future encodings are 
>>> desirable,
>>> +they will be added trivially using the opts mechanism.
>>> +
>>> diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
>>> index 30b3d7b9f9dc..0e1facdc7876 100644
>>> --- a/fs/bcachefs/bcachefs.h
>>> +++ b/fs/bcachefs/bcachefs.h
>>> @@ -202,6 +202,7 @@
>>>   #include <linux/types.h>
>>>   #include <linux/workqueue.h>
>>>   #include <linux/zstd.h>
>>> +#include <linux/unicode.h>
>>>   #include "bcachefs_format.h"
>>>   #include "errcode.h"
>>> @@ -657,6 +658,8 @@ enum bch_write_ref {
>>>       BCH_WRITE_REF_NR,
>>>   };
>>> +#define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0)
>>> +
>>>   struct bch_fs {
>>>       struct closure        cl;
>>> @@ -723,6 +726,9 @@ struct bch_fs {
>>>           u64        compat;
>>>       }            sb;
>>> +#if IS_ENABLED(CONFIG_UNICODE)
>>> +    struct unicode_map     *cf_encoding;
>>> +#endif
>>
>> This goes in struct super_block.
> 
> The code in libfs.c that uses the s_encoding in the super_block is never 
> used for bcachefs as it defines its own inode_operations and doesn't use 
> any of the libfs common code for lookups/hashes there (ie. the 
> dentry_operations).
> 
> One thing we wanted to do down the line was to use the bcachefs options 
> system to be able to pick the encoding on a per-inode basis.
> 
> Currently I am storing it in the struct bch_fs as currently there is no 
> need for other encodings right now and it is convenient, so there can 
> only be one -- but that could change at some point with an opt attached 
> to the inode.
> 
> I am happy to bung it in the vfs super_block for now though -- but I 
> need to look at how that all works out with mounting and the multiple 
> drives situation. Just be aware that at some point, it will likely be 
> moved back again to our code.
> 
>>
>>
>>> @@ -13,8 +13,39 @@
>>>   #include <linux/dcache.h>
>>> -static inline unsigned dirent_val_u64s(unsigned len)
>>> +static inline int bch2_casefold(struct btree_trans *trans, const 
>>> struct bch_hash_info *info,
>>> +                const struct qstr *str, struct qstr *out_cf)
>>>   {
>>> +#if IS_ENABLED(CONFIG_UNICODE)
>>> +    unsigned char *buf = NULL;
>>> +    int ret;
>>> +
>>> +    *out_cf = (struct qstr) QSTR_INIT(NULL, 0);
>>> +
>>> +    buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1);
>>> +    ret = PTR_ERR_OR_ZERO(buf);
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1);
>>> +    if (ret <= 0)
>>> +        return ret;
>>
>> The error path leaks buf.
> 
> trans_kmalloc does not need to be freed.
> 
>>
>>> @@ -472,6 +600,7 @@ int __bch2_dirent_lookup_trans(struct btree_trans 
>>> *trans,
>>>                      const struct qstr *name, subvol_inum *inum,
>>>                      unsigned flags)
>>>   {
>>> +    struct qstr lookup_name = *name;
>>>       struct bkey_s_c k;
>>>       struct bkey_s_c_dirent d;
>>>       u32 snapshot;
>>> @@ -481,8 +610,14 @@ int __bch2_dirent_lookup_trans(struct 
>>> btree_trans *trans,
>>>       if (ret)
>>>           return ret;
>>> +    ret = hash_info->cf_encoding
>>> +        ? bch2_casefold(trans, hash_info, name, &lookup_name)
>>> +        : 0;
>>> +    if (ret)
>>> +        return ret;
>>> +
>>>       ret = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc,
>>> -                   hash_info, dir, name, flags);
>>> +                   hash_info, dir, &lookup_name, flags);
>>>       if (ret)
>>>           return ret;
>>
>>
>> (Skipping the bcachefs-specific bits that I'm not qualified to review)
>>
>> I don't get your model for the dcache.  I don't see any handling
>> anywhere in the patch.  Also, I fetched the code and it seems bcachefs
>> will just do d_splice_alias at the end of the lookup. So, when someone
>> lookups FOO, ->lookup will be called and the dentry FOO will be created.
>> Then, when FOo is looked up, the dcache won't find the previous dentry
>> and ->lookup will be called again (perhaps creating another dentry?)
>>
>> There are two ways around it. Rely on the libfs' generic handlers for
>> d_hash/d_compare and soon d_revalidate, or use d_add_ci to search for
>> case-insensitive aliases at lookup time, but then you don't benefit much
>> from the dcache if looking up for different cases.
> 
> Thanks for the heads-up on this.
> 
> It seems like I may have potentially hit a foot-gun there if we are 
> making vfs dentry's for each lookup.
> 
> I will investigate and validate that we are doing the right thing here 
> for the next revision.
> 
> I will not lie, I know veeery little about the vfs or dcache side, so 
> huge thanks for pointing this out to me. :)
> 
> I had a quick look at what f2fs does here, and it seems like it also 
> just does d_splice_alias, but it has the following caveat.
> 
> [f2fs code]
> #if IS_ENABLED(CONFIG_UNICODE)
>      if (!inode && IS_CASEFOLDED(dir)) {
>          /* Eventually we want to call d_add_ci(dentry, NULL)
>           * for negative dentries in the encoding case as
>           * well.  For now, prevent the negative dentry
>           * from being cached.
>           */
>          trace_f2fs_lookup_end(dir, dentry, ino, err);
>          return NULL;
>      }
> #endif
>      new = d_splice_alias(inode, dentry);
> 
> I won't even pretend that I understand what a negative dentry is right 
> now... :D

I am guessing we probably want something like:

bch2_dirent_lookup -> returns feedback to a `struct qstr real_name` with 
the actual name if it was not identical to the dentry lookup name.

if (real_name.len) {
      d_add_ci(dentry, vinode, &real_name);
      return dentry;
} else {
      return d_splice_alias(vinode, dentry);
}

Is my line of thinking correct here?

- Joshie 🐸✨

> 
>>
>>
>>> @@ -54,6 +54,28 @@ static int bch2_inode_flags_set(struct btree_trans 
>>> *trans,
>>>           (newflags & (BCH_INODE_NODUMP|BCH_INODE_NOATIME)) != newflags)
>>>           return -EINVAL;
>>> +    if ((newflags ^ oldflags) & BCH_INODE_CASEFOLDED) {
>>> +#if IS_ENABLED(CONFIG_UNICODE)
>>> +        int ret = 0;
>>> +        /* Not supported on individual files. */
>>> +        if (!S_ISDIR(bi->bi_mode))
>>> +            return -EOPNOTSUPP;
>>> +
>>> +        /*
>>> +         * Make sure the dir is empty, as otherwise we'd need to
>>> +         * rehash everything and update the dirent keys.
>>> +         */
>>> +        ret = bch2_empty_dir_trans(trans, inode_inum(inode));
>>> +        if (ret < 0)
>>> +            return ret;
>>> +
>>> +        bch2_check_set_feature(c, BCH_FEATURE_casefolding);
>>
>> fwiw, we use _CASEFOLD in other filesystems, not _casefolding.
> 
> bcachefs features are always lowercase, but I can change the name of it 
> to _casefold.
> 
> - Joshie 🐸✨
> 
>>
>> Thanks,
>>
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v4 3/3] bcachefs: Implement casefolding
  2023-08-14 21:18     ` Joshua Ashton
  2023-08-14 21:33       ` Joshua Ashton
@ 2023-08-14 23:01       ` Gabriel Krisman Bertazi
  2023-08-15  0:58       ` Kent Overstreet
  2 siblings, 0 replies; 15+ messages in thread
From: Gabriel Krisman Bertazi @ 2023-08-14 23:01 UTC (permalink / raw)
  To: Joshua Ashton; +Cc: linux-bcachefs, André Almeida

Joshua Ashton <joshua@froggi.es> writes:

> On 8/14/23 17:34, Gabriel Krisman Bertazi wrote:
>>> +
>>> +Another option was to store without the two lengths, and just take the length of
>>> +the regular name and casefolded name contiguously / 2 as the length. This would
>>> +assume that the regular length == casefolded length, but that could potentially
>>> +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
>>> +the lowercase unicode glyph.
>> Why do you want to store the casefolded name?  In ext4, we just store
>> the case-exact name that was created, and do a code-point by code-point
>> comparison with the name under lookup through utf8_strncasecmp, which is
>> optimized to compare two not casefolded strings.  We just modify the
>> filename hash to be case-insensitive, so we know which dirent to look
>> for during lookup.  Finally, we use d_compare/d_hash to leverage the
>> dcache into finding negative dentries.
>
> Yeah, am aware of how ext4 does it, I just wanted it to have basically
> the exact same performance characteristics as as regular lookup.
>
> The utf8 cursor stuff in utf8_strncasecmp is still super expensive
> compared to a straight memcmp in some local benching.

More expensive than strncasecmp, for sure. But I'm surprised it is
visible in benchmarks.  Of course, unless you are searching different
files at every lookup, once it makes proper use of the dcache it won't
matter much.  There is also utf8_strncasecmp_folded that avoids half of
the trie searches by caching the name-under-lookup.

> I am happy to bung it in the vfs super_block for now though -- but I
> need to look at how that all works out with mounting and the multiple
> drives situation. Just be aware that at some point, it will likely be
> moved back again to our code.

Different drives as in different filesystem volumes? They each should
have one superblock anyway.  Also, someone called my attention to your
v1 commit message in v1 regarding selecting encoding per-directory.  I'm
not sure it is a good idea, since it is definitely problematic once you
start moving files around different subtrees.

Unicode casefolding is only stable for defined code points (of course),
but newer versions might add new code points that previously didn't have
any casefolding.  So, if you have a name that has an invalid casepoint
in an older version (but is legal utf8) and moves it to a new directory
with an encoding version where that folds into something else, you have
a situation where one file occludes the other and all sorts of
unpredictable behavior.

It can happen while copying across disks, of course. But we already
reject many cases of cross-filesystem copying.  But failing to move a
file within the same filesystem is not very user-friendly.

I think the reason you want to support multiple encoding versions is to
enable applications built for different Windows ntfs versions. If that
is the case, you should be safe using the latest unicode disk-wide,
since it is backward compatible.  Take a look also at ext4 and f2fs'
"strict mode", that rejects invalid unicode sequences too.

>> and ->lookup will be called again (perhaps creating another dentry?)
>> There are two ways around it. Rely on the libfs' generic handlers for
>> d_hash/d_compare and soon d_revalidate, or use d_add_ci to search for
>> case-insensitive aliases at lookup time, but then you don't benefit much
>> from the dcache if looking up for different cases.
>
> Thanks for the heads-up on this.
>
> It seems like I may have potentially hit a foot-gun there if we are
> making vfs dentry's for each lookup.
>
> I will investigate and validate that we are doing the right thing here
> for the next revision.
>
> I will not lie, I know veeery little about the vfs or dcache side, so
> huge thanks for pointing this out to me. :)
>
> I had a quick look at what f2fs does here, and it seems like it also
> just does d_splice_alias, but it has the following caveat.
>
> [f2fs code]
> #if IS_ENABLED(CONFIG_UNICODE)
> 	if (!inode && IS_CASEFOLDED(dir)) {
> 		/* Eventually we want to call d_add_ci(dentry, NULL)
> 		 * for negative dentries in the encoding case as
> 		 * well.  For now, prevent the negative dentry
> 		 * from being cached.
> 		 */
> 		trace_f2fs_lookup_end(dir, dentry, ino, err);
> 		return NULL;
> 	}
> #endif
> 	new = d_splice_alias(inode, dentry);
>
> I won't even pretend that I understand what a negative dentry is right
> now... :D

A negative dentry is simply a cache entry saying that a file doesn't
exist in the filesystem, so further lookups don't need to hit the disk
over and over.

The f2fs caveat you mentioned is entirely avoiding the creation of
negative dentries in IS_CASEFOLDED directories, since they aren't
supported currently. I explained a bit the reasoning for that in the
following series, that also remove this code.  It is (hopefully) going
upstream soon:

  https://www.spinics.net/lists/linux-ext4/msg90964.html

My suggestion is that you follow f2fs/ext4 closely.  Avoid d_add_ci, as
it was written for filesystems that deal with case-insensitive
directories in a slight different way, and don't make use good use of
the dcache, in my opinion.  A better approach would be to use
d_compare/d_hash and keep calling d_splice_alias.  Take a look at the
generic handlers in fs/libfs.c too.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v4 3/3] bcachefs: Implement casefolding
  2023-08-14 21:18     ` Joshua Ashton
  2023-08-14 21:33       ` Joshua Ashton
  2023-08-14 23:01       ` Gabriel Krisman Bertazi
@ 2023-08-15  0:58       ` Kent Overstreet
  2023-08-15 18:00         ` Gabriel Krisman Bertazi
  2 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2023-08-15  0:58 UTC (permalink / raw)
  To: Joshua Ashton; +Cc: Gabriel Krisman Bertazi, linux-bcachefs, André Almeida

On Mon, Aug 14, 2023 at 10:18:33PM +0100, Joshua Ashton wrote:
> 
> 
> On 8/14/23 17:34, Gabriel Krisman Bertazi wrote:
> > Joshua Ashton <joshua@froggi.es> writes:
> > 
> > > This patch implements support for case-insensitive file name lookups
> > > in bcachefs.
> > > 
> > > The implementation uses the same UTF-8 lowering and normalization that
> > > ext4 and f2fs is using.
> > > 
> > > More information is provided in Documentation/bcachefs/casefolding.rst
> > > 
> > > Signed-off-by: Joshua Ashton <joshua@froggi.es>
> > 
> > Please, cc the entire thread as i don't think I even got the cover
> > letter.
> 
> Thanks, I will for v5.
> 
> > 
> > Did you test this against fstests generic/556?  it checks all sorts of
> > common pitfalls and corner-cases in case-insensitive filesystems.
> 
> No, I wrote my own casefolding test that tests a bunch of edgecases.
> 
> I was not aware there were any casefolding tests in fstests -- I really
> really wish they had better names. :/

Seconded...

> Grepping the test right now I already know we will fail the
> `test_casefold_flag_removal` test as we don't support toggling the flag on
> non-empty directories as it would require rehashing and thus restructuring
> the btree.

What was the motivation for supporting that on other filesystems? I'm
not inclined to add that to bcachefs without a good reason.

> 
> There's a lot of edgecases to handle there (such as hash collisions) which
> are pretty gnarly to solve when doing something like that.
> 
> I diiiid try and implement something that did that, but it really did not
> end well... ^^;; So planning to defer this for now.

What hash collisions are you referring to? The str_hash code we use
handles hash collisions.

> > Why do you want to store the casefolded name?  In ext4, we just store
> > the case-exact name that was created, and do a code-point by code-point
> > comparison with the name under lookup through utf8_strncasecmp, which is
> > optimized to compare two not casefolded strings.  We just modify the
> > filename hash to be case-insensitive, so we know which dirent to look
> > for during lookup.  Finally, we use d_compare/d_hash to leverage the
> > dcache into finding negative dentries.
> 
> Yeah, am aware of how ext4 does it, I just wanted it to have basically the
> exact same performance characteristics as as regular lookup.
> 
> The utf8 cursor stuff in utf8_strncasecmp is still super expensive compared
> to a straight memcmp in some local benching.
> 
> > 
> > Sure, if you have both, dirent and dentry->d_name casefolded, then it is
> > just a regular memcmp, but at least you don't have to change the disk
> > format and your implementation becomes simpler.
> > 
> 
> Given we have the opportunity to have support for this in the dirent format
> for bcachefs, I think it makes sense to take it right now.

It also means that we have the option of defining a casefolding map that
handles ß correctly.
 
> > > +#if IS_ENABLED(CONFIG_UNICODE)
> > > +	struct unicode_map 	*cf_encoding;
> > > +#endif
> > 
> > This goes in struct super_block.
> 
> The code in libfs.c that uses the s_encoding in the super_block is never
> used for bcachefs as it defines its own inode_operations and doesn't use any
> of the libfs common code for lookups/hashes there (ie. the
> dentry_operations).
> 
> One thing we wanted to do down the line was to use the bcachefs options
> system to be able to pick the encoding on a per-inode basis.
> 
> Currently I am storing it in the struct bch_fs as currently there is no need
> for other encodings right now and it is convenient, so there can only be one
> -- but that could change at some point with an opt attached to the inode.
> 
> I am happy to bung it in the vfs super_block for now though -- but I need to
> look at how that all works out with mounting and the multiple drives
> situation. Just be aware that at some point, it will likely be moved back
> again to our code.

We also don't always have a super_block available when we have a running
bch_fs object (e.g. when running in userspace, for fsck and various
other tools). Since s_encoding is used in dirent.c, it probably needs to
be in bch_fs.

> > (Skipping the bcachefs-specific bits that I'm not qualified to review)
> > 
> > I don't get your model for the dcache.  I don't see any handling
> > anywhere in the patch.  Also, I fetched the code and it seems bcachefs
> > will just do d_splice_alias at the end of the lookup. So, when someone
> > lookups FOO, ->lookup will be called and the dentry FOO will be created.
> > Then, when FOo is looked up, the dcache won't find the previous dentry
> > and ->lookup will be called again (perhaps creating another dentry?)
> > 
> > There are two ways around it. Rely on the libfs' generic handlers for
> > d_hash/d_compare and soon d_revalidate, or use d_add_ci to search for
> > case-insensitive aliases at lookup time, but then you don't benefit much
> > from the dcache if looking up for different cases.
> 
> Thanks for the heads-up on this.
> 
> It seems like I may have potentially hit a foot-gun there if we are making
> vfs dentry's for each lookup.
> 
> I will investigate and validate that we are doing the right thing here for
> the next revision.
> 
> I will not lie, I know veeery little about the vfs or dcache side, so huge
> thanks for pointing this out to me. :)
> 
> I had a quick look at what f2fs does here, and it seems like it also just
> does d_splice_alias, but it has the following caveat.

I haven't dug into the dcache stuff yet either (thanks Gabriel for
pointing this out) - but, we'll want to create a dentry not with the
passed in name, but with the name from bch_dirent.

> [f2fs code]
> #if IS_ENABLED(CONFIG_UNICODE)
> 	if (!inode && IS_CASEFOLDED(dir)) {
> 		/* Eventually we want to call d_add_ci(dentry, NULL)
> 		 * for negative dentries in the encoding case as
> 		 * well.  For now, prevent the negative dentry
> 		 * from being cached.
> 		 */
> 		trace_f2fs_lookup_end(dir, dentry, ino, err);
> 		return NULL;
> 	}
> #endif
> 	new = d_splice_alias(inode, dentry);
> 
> I won't even pretend that I understand what a negative dentry is right
> now... :D

A negative dentry is used to indicate a dirent that doesn't exist in
dcache - it means lookups can be cached for dentries that don't exist,
instead of having to call into the filesystem's .lookup() every time.

Gabriel's the one who implemented them for ext4/f2fs, see: https://lwn.net/Articles/929986/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v4 3/3] bcachefs: Implement casefolding
  2023-08-14  1:24 ` [v4 3/3] bcachefs: Implement casefolding Joshua Ashton
  2023-08-14 16:34   ` Gabriel Krisman Bertazi
@ 2023-08-15 12:36   ` Brian Foster
  2023-08-24  5:43     ` Joshua Ashton
  1 sibling, 1 reply; 15+ messages in thread
From: Brian Foster @ 2023-08-15 12:36 UTC (permalink / raw)
  To: Joshua Ashton; +Cc: linux-bcachefs, André Almeida, Gabriel Krisman Bertazi

On Mon, Aug 14, 2023 at 02:24:27AM +0100, Joshua Ashton wrote:
> This patch implements support for case-insensitive file name lookups
> in bcachefs.
> 
> The implementation uses the same UTF-8 lowering and normalization that
> ext4 and f2fs is using.
> 
> More information is provided in Documentation/bcachefs/casefolding.rst
> 
> Signed-off-by: Joshua Ashton <joshua@froggi.es>
> 
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: Gabriel Krisman Bertazi <krisman@suse.de>
> ---

Hi Joshua,

I'm still grokking the implementation and I see Gabriel has already
provided much more substantive feedback, so just a few random comments...

>  .../filesystems/bcachefs/casefolding.rst      |  73 ++++++++
>  fs/bcachefs/bcachefs.h                        |   6 +
>  fs/bcachefs/bcachefs_format.h                 |  26 ++-
>  fs/bcachefs/dirent.c                          | 177 +++++++++++++++---
>  fs/bcachefs/fs-common.c                       |   4 +
>  fs/bcachefs/fs-ioctl.c                        |  22 +++
>  fs/bcachefs/fs-ioctl.h                        |  20 +-
>  fs/bcachefs/str_hash.h                        |   4 +
>  fs/bcachefs/super.c                           |  19 ++
>  9 files changed, 318 insertions(+), 33 deletions(-)
>  create mode 100644 Documentation/filesystems/bcachefs/casefolding.rst
> 
> diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
> new file mode 100644
> index 000000000000..c8580d892ae8
> --- /dev/null
> +++ b/Documentation/filesystems/bcachefs/casefolding.rst
> @@ -0,0 +1,73 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Casefolding
> +===========
> +
> +bcachefs has support for case-insensitive file and directory
> +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
> +casefolding attributes.
> +
> +The main usecase for casefolding is compatibility with software written
> +against other filesystems that rely on casefolded lookups
> +(eg. NTFS and Wine/Proton).
> +Taking advantage of file-system level casefolding can lead to great
> +loading time gains in many applications and games.
> +
> +Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
> +Once a directory has been flagged for casefolding, a feature bit
> +is enabled on the superblock which marks the filesystem as using
> +casefolding.
> +When the feature bit for casefolding is enabled, it is no longer possible
> +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
> +
> +On the lookup/query side: casefolding is implemented by allocating a new
> +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
> +casefold the query string.
> +
> +On the dirent side: casefolding is implemented by ensuring the `bkey`'s
> +hash is made from the casefolded string and storing the cached casefolded
> +name with the regular name in the dirent.
> +
> +The structure looks like this:
> +
> +Regular:    [dirent data][regular name][nul][nul]...
> +Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
> +
> +(Do note, the number of `NUL`s here is merely for illustration, they count can vary
> + per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.)
> +
> +This is efficient as it means that for all file lookups that require casefolding,
> +it has identical performance to a regular lookup:
> +a hash comparison and a `memcmp` of the name.
> +

Thanks for the writeup. This is very helpful.

> +Rationale
> +---------
> +
> +Several designs were considered for this system:
> +One was to introduce a dirent_v2, however that would be painful especially as
> +the hash system only has support for a single key type. This would also need
> +`BCH_NAME_MAX` to change between versions, and a new feature bit.
> +
> +Another option was to store without the two lengths, and just take the length of
> +the regular name and casefolded name contiguously / 2 as the length. This would
> +assume that the regular length == casefolded length, but that could potentially
> +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
> +the lowercase unicode glyph.
> +It would be possible to disregard the casefold cache for those cases, but it was
> +decided to simply encode the two string lengths in the key to avoid random
> +performance issues if this edgecase was ever hit.
> +
> +The option settled on was to use a free-bit in d_type to mark a dirent as having
> +a casefold cache, and then treat the first 4 bytes the name block as lengths.
> +You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
> +
> +The feature bit was used to allow casefolding support to be enabled for the majority
> +of users, but some allow users who have no need for the feature to still use bcachefs as
> +`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
> +which may be decider between using bcachefs for eg. embedded platforms.
> +
> +Other filesystems like ext4 and f2fs have a super-block level option for casefolding
> +encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
> +any encodings than a single UTF-8 version. When future encodings are desirable,
> +they will be added trivially using the opts mechanism.
> +
...
> diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
> index 23bae622309c..cc6797422977 100644
> --- a/fs/bcachefs/bcachefs_format.h
> +++ b/fs/bcachefs/bcachefs_format.h
...
> @@ -864,6 +866,7 @@ enum {
>  #define BCH_INODE_I_SECTORS_DIRTY (1 << __BCH_INODE_I_SECTORS_DIRTY)
>  #define BCH_INODE_UNLINKED	(1 << __BCH_INODE_UNLINKED)
>  #define BCH_INODE_BACKPTR_UNTRUSTED (1 << __BCH_INODE_BACKPTR_UNTRUSTED)
> +#define BCH_INODE_CASEFOLDED	(1 << __BCH_INODE_CASEFOLDED)
>  
>  LE32_BITMASK(INODE_STR_HASH,	struct bch_inode, bi_flags, 20, 24);
>  LE32_BITMASK(INODE_NR_FIELDS,	struct bch_inode, bi_flags, 24, 31);
> @@ -908,9 +911,25 @@ struct bch_dirent {
>  	 * Copy of mode bits 12-15 from the target inode - so userspace can get
>  	 * the filetype without having to do a stat()
>  	 */
> -	__u8			d_type;
> +#if defined(__LITTLE_ENDIAN_BITFIELD)
> +	__u8			d_type:5,
> +				d_unused:2,
> +				d_casefold:1;
> +#elif defined(__BIG_ENDIAN_BITFIELD)
> +	__u8			d_casefold:1,
> +				d_unused:2,
> +				d_type:5;
> +#endif

Interesting. I didn't realize that bitfield endianness matched general
endianness, so I presume this is required to ensure data is always le on
disk.

>  
> -	__u8			d_name[];
> +	union {
> +	struct {
> +		__u8		d_pad;
> +		__le16		d_name_len;
> +		__le16		d_cf_name_len;
> +		__u8		d_names[0];
> +	} d_cf_name_block __packed;
> +	__u8			d_name[0];
> +	} __packed;
>  } __packed __aligned(8);
>  
>  #define DT_SUBVOL	16
> @@ -1841,7 +1860,8 @@ static inline void SET_BCH_SB_BACKGROUND_COMPRESSION_TYPE(struct bch_sb *sb, __u
>  	x(new_varint,			15)	\
>  	x(journal_no_flush,		16)	\
>  	x(alloc_v2,			17)	\
> -	x(extents_across_btree_nodes,	18)
> +	x(extents_across_btree_nodes,	18)	\
> +	x(casefolding,			19)
>  
>  #define BCH_SB_FEATURES_ALWAYS				\
>  	((1ULL << BCH_FEATURE_new_extent_overwrite)|	\
> diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c
> index 6bdcccabd732..6ac45d7ce6e9 100644
> --- a/fs/bcachefs/dirent.c
> +++ b/fs/bcachefs/dirent.c
> @@ -13,8 +13,39 @@
>  
>  #include <linux/dcache.h>
>  
> -static inline unsigned dirent_val_u64s(unsigned len)
> +static inline int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info,
> +				const struct qstr *str, struct qstr *out_cf)
>  {
> +#if IS_ENABLED(CONFIG_UNICODE)
> +	unsigned char *buf = NULL;
> +	int ret;
> +
> +	*out_cf = (struct qstr) QSTR_INIT(NULL, 0);
> +
> +	buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1);
> +	ret = PTR_ERR_OR_ZERO(buf);
> +	if (ret)
> +		return ret;
> +
> +	ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1);

Looks like this helper really only needs the encoding pointer rather
than the full hash_info..?

> +	if (ret <= 0)
> +		return ret;
> +
> +	*out_cf = (struct qstr) QSTR_INIT(buf, ret);
> +	return 0;
> +#else
> +	*out_cf = (struct qstr) QSTR_INIT(NULL, 0);
> +
> +	return -EOPNOTSUPP;
> +#endif

It also looks like the #else could be completely avoided here. For
example, lift the local vars out of the ifdef, init out_cf using the
NULL buf, init ret to -EOPNOTSUPP and push 'return ret;' out of the end
of the ifdef. Hm?

> +}
> +
> +static inline unsigned dirent_val_u64s(unsigned len, unsigned cf_len)
> +{
> +	if (cf_len > 0) {
> +		return DIV_ROUND_UP(offsetof(struct bch_dirent, d_cf_name_block.d_names) + len + cf_len,
> +				    sizeof(u64));
> +	}
>  	return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len,
>  			    sizeof(u64));

FWIW, I think something along the lines of the pattern you use just
below makes this a bit more readable. For example, something like:

        unsigned name_offset;
        name_offset = (cf_len > 0) ?
                        offsetof(struct bch_dirent, d_cf_name_block.d_names) :
                        offsetof(struct bch_dirent, d_name);
        return DIV_ROUND_UP(name_offset + len + cf_len, sizeof(u64));

... but just a thought.

>  }
...
>  /* bcachefs inode flags -> FS_IOC_FSGETXATTR: */
> diff --git a/fs/bcachefs/str_hash.h b/fs/bcachefs/str_hash.h
> index ae21a8cca1b4..866cfc145889 100644
> --- a/fs/bcachefs/str_hash.h
> +++ b/fs/bcachefs/str_hash.h
> @@ -34,6 +34,7 @@ bch2_str_hash_opt_to_type(struct bch_fs *c, enum bch_str_hash_opts opt)
>  
>  struct bch_hash_info {
>  	u8			type;
> +	struct unicode_map 	*cf_encoding;
>  	/*
>  	 * For crc32 or crc64 string hashes the first key value of
>  	 * the siphash_key (k0) is used as the key.
> @@ -48,6 +49,9 @@ bch2_hash_info_init(struct bch_fs *c, const struct bch_inode_unpacked *bi)
>  	struct bch_hash_info info = {
>  		.type = (bi->bi_flags >> INODE_STR_HASH_OFFSET) &
>  			~(~0U << INODE_STR_HASH_BITS),
> +#if IS_ENABLED(CONFIG_UNICODE)
> +		.cf_encoding = !!(bi->bi_flags & BCH_INODE_CASEFOLDED) ? c->cf_encoding : NULL,
> +#endif

What's the purpose for propagating this pointer via the hash? Is this
necessary to (eventually) support per-directory encodings, or something
else? If purely the former and that is something to possibly come later,
it does seem a little premature to me in that the infrastructure part
should probably go along with the feature functionality that takes
advantage of it.

BTW on a similar train of thought.. I'm not familiar enough with the
feature to have an opinion on Gabriel's feedback wrt to storing the
casefolded name on-disk, but I do think there's value in breaking down
the feature into smaller logical components, particularly when there are
performance vs. complexity tradeoffs. So for example if you wanted to
continue to implement the on-disk format change approach, perhaps factor
out core functionality along the lines of what ext4 does into a patch 1,
then add the on-disk format change support in patch 2 as an (optional?)
optimization (then lift off the per-dir bits into later patches that
implement the granular approach, etc.).

That not only makes this easier to review, but IMO significantly easier
to maintain, test, diagnose, etc., such that it's at least worth
considering. Just my .02.

Brian

>  		.siphash_key = { .k0 = bi->bi_hash_seed }
>  	};
>  
> diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
> index 5c62fcf3afdb..78d19cfb966a 100644
> --- a/fs/bcachefs/super.c
> +++ b/fs/bcachefs/super.c
> @@ -757,6 +757,25 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts)
>  	if (ret)
>  		goto err;
>  
> +#if IS_ENABLED(CONFIG_UNICODE)
> +	/* Default encoding until we can potentially have more as an option. */
> +	c->cf_encoding = utf8_load(BCH_FS_DEFAULT_UTF8_ENCODING);
> +	if (IS_ERR(c->cf_encoding)) {
> +		printk(KERN_ERR "Cannot load UTF-8 encoding for filesystem. Version: %u.%u.%u",
> +			unicode_major(BCH_FS_DEFAULT_UTF8_ENCODING),
> +			unicode_minor(BCH_FS_DEFAULT_UTF8_ENCODING),
> +			unicode_rev(BCH_FS_DEFAULT_UTF8_ENCODING));
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +#else
> +	if (c->sb.features & (1ULL << BCH_FEATURE_casefolding)) {
> +		printk(KERN_ERR "Cannot mount a filesystem with casefolding on a kernel without CONFIG_UNICODE\n");
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +#endif
> +
>  	pr_uuid(&name, c->sb.user_uuid.b);
>  	strscpy(c->name, name.buf, sizeof(c->name));
>  	printbuf_exit(&name);
> -- 
> 2.41.0
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v4 3/3] bcachefs: Implement casefolding
  2023-08-15  0:58       ` Kent Overstreet
@ 2023-08-15 18:00         ` Gabriel Krisman Bertazi
  2023-08-23  9:37           ` Joshua Ashton
  0 siblings, 1 reply; 15+ messages in thread
From: Gabriel Krisman Bertazi @ 2023-08-15 18:00 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Joshua Ashton, linux-bcachefs, André Almeida

Kent Overstreet <kent.overstreet@linux.dev> writes:

> On Mon, Aug 14, 2023 at 10:18:33PM +0100, Joshua Ashton wrote:
>
>> Grepping the test right now I already know we will fail the
>> `test_casefold_flag_removal` test as we don't support toggling the flag on
>> non-empty directories as it would require rehashing and thus restructuring
>> the btree.
>
> What was the motivation for supporting that on other filesystems? I'm
> not inclined to add that to bcachefs without a good reason.

I don't think there was a better reason than "it is user-friendly to
allow disabling the feature without having to remove the
directory". Note that it is trivial to do so in empty directories for
ext4/f2fs: it is just a matter of flipping a bit in the inode to
enable/disable casefolding.

I think it's ok for the Windows games to not allow disabling
case-insensitive once it is set for a directory.  For fstests, I suppose
the easiest solution would be to skip that subtest when running on
bcachefs.

>> > Sure, if you have both, dirent and dentry->d_name casefolded, then it is
>> > just a regular memcmp, but at least you don't have to change the disk
>> > format and your implementation becomes simpler.
>> > 
>> 
>> Given we have the opportunity to have support for this in the dirent format
>> for bcachefs, I think it makes sense to take it right now.
>
> It also means that we have the option of defining a casefolding map that
> handles ß correctly.

What do you mean by handling ß correctly?  do you intend to support NFKD
normalization?  We can extend fs/unicode/ for that fairly easily,
but iirc, other filesystems decided to stick with NFD, because NFKD is
a bit too eager when folding some characters.

>> > > +#if IS_ENABLED(CONFIG_UNICODE)
>> > > +	struct unicode_map 	*cf_encoding;
>> > > +#endif
>> > 
>> > This goes in struct super_block.
>> 
>> The code in libfs.c that uses the s_encoding in the super_block is never
>> used for bcachefs as it defines its own inode_operations and doesn't use any
>> of the libfs common code for lookups/hashes there (ie. the
>> dentry_operations).
>> 
>> One thing we wanted to do down the line was to use the bcachefs options
>> system to be able to pick the encoding on a per-inode basis.
>> 
>> Currently I am storing it in the struct bch_fs as currently there is no need
>> for other encodings right now and it is convenient, so there can only be one
>> -- but that could change at some point with an opt attached to the inode.
>> 
>> I am happy to bung it in the vfs super_block for now though -- but I need to
>> look at how that all works out with mounting and the multiple drives
>> situation. Just be aware that at some point, it will likely be moved back
>> again to our code.
>
> We also don't always have a super_block available when we have a running
> bch_fs object (e.g. when running in userspace, for fsck and various
> other tools). Since s_encoding is used in dirent.c, it probably needs to
> be in bch_fs.

Ah, I didn't know this is reused for userspace tools.  Still, as long as
you initialize sb->s_encoding at mount time, libfs helpers will work fine.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v4 0/3] bcachefs: Casefolding implementation
  2023-08-14  1:24 [v4 0/3] bcachefs: Casefolding implementation Joshua Ashton
                   ` (2 preceding siblings ...)
  2023-08-14  1:24 ` [v4 3/3] bcachefs: Implement casefolding Joshua Ashton
@ 2023-08-16 19:40 ` Kent Overstreet
  2023-08-23  9:34   ` Joshua Ashton
  3 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2023-08-16 19:40 UTC (permalink / raw)
  To: Joshua Ashton; +Cc: linux-bcachefs

On Mon, Aug 14, 2023 at 02:24:24AM +0100, Joshua Ashton wrote:
> This implements support for case-insensitive file and directory
> lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
> casefolding attributes for bcachefs.
> 
> The implementation uses the same UTF-8 lowering and normalization that
> ext4 and f2fs is using currently
> 
> The main usecase for casefolding is compatibility with software written
> against other filesystems that rely on casefolded lookups
> (eg. NTFS and Wine/Proton).
> Taking advantage of file-system level casefolding can lead to great
> loading time gains in many applications and games.
> 
> See more information and rationale in:
> Documentation/filesystems/bcachefs/casefolding.rst
> 
> You can see/review the work in my tree here:
> https://github.com/Joshua-Ashton/bcachefs/tree/casefold
> 
> v2: Update format to store length in dirent based on flag, add comments, cleanup
> dirent_create_key, other misc. nits from IRC.
> 
> v3: Cleanup the dirent code, utilize the fact that BCH_NAME_MAX is smaller now,
> remove fallbacks for too big keys that won't make sense anymore. Improved
> lookup string lowering.
> 
> v4: Move cf_name allocation from stack to trans alloc, validated bkey in invalid,
> used length directly, misc. nits.

Josh, do you think you could post the benchmarks you ran at the start of
all this? We should look at those again after negative dentries are
hooked up and working correctly.

After giving this more thought, I'm starting to feel a bit more inclined
to Gabriel's approach...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v4 0/3] bcachefs: Casefolding implementation
  2023-08-16 19:40 ` [v4 0/3] bcachefs: Casefolding implementation Kent Overstreet
@ 2023-08-23  9:34   ` Joshua Ashton
  0 siblings, 0 replies; 15+ messages in thread
From: Joshua Ashton @ 2023-08-23  9:34 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, Gabriel Krisman Bertazi



On 8/16/23 20:40, Kent Overstreet wrote:
> On Mon, Aug 14, 2023 at 02:24:24AM +0100, Joshua Ashton wrote:
>> This implements support for case-insensitive file and directory
>> lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
>> casefolding attributes for bcachefs.
>>
>> The implementation uses the same UTF-8 lowering and normalization that
>> ext4 and f2fs is using currently
>>
>> The main usecase for casefolding is compatibility with software written
>> against other filesystems that rely on casefolded lookups
>> (eg. NTFS and Wine/Proton).
>> Taking advantage of file-system level casefolding can lead to great
>> loading time gains in many applications and games.
>>
>> See more information and rationale in:
>> Documentation/filesystems/bcachefs/casefolding.rst
>>
>> You can see/review the work in my tree here:
>> https://github.com/Joshua-Ashton/bcachefs/tree/casefold
>>
>> v2: Update format to store length in dirent based on flag, add comments, cleanup
>> dirent_create_key, other misc. nits from IRC.
>>
>> v3: Cleanup the dirent code, utilize the fact that BCH_NAME_MAX is smaller now,
>> remove fallbacks for too big keys that won't make sense anymore. Improved
>> lookup string lowering.
>>
>> v4: Move cf_name allocation from stack to trans alloc, validated bkey in invalid,
>> used length directly, misc. nits.
> 
> Josh, do you think you could post the benchmarks you ran at the start of
> all this? We should look at those again after negative dentries are
> hooked up and working correctly.
> 
> After giving this more thought, I'm starting to feel a bit more inclined
> to Gabriel's approach...

Sure. I've cleaned the synthetic ones up into an easy-to-run userspace 
benchmark available here:

https://github.com/Joshua-Ashton/casefolding-microbench

My results look like the following with -O2 on a Ryzen 7900XT:

------------------------------------------------------------------------------------------
Benchmark                                                Time  CPU 
Iterations
------------------------------------------------------------------------------------------
memcmp_ascii_eq                                       5.83 ns 5.82 ns 
119813859
memcmp_ascii_immediately_neq                          3.71 ns 3.70 ns 
188580499
memcmp_unicode_eq                                     6.63 ns 6.60 ns 
111117537
memcmp_unicode_immediately_neq                        3.67 ns 3.66 ns 
192004721
utf8_strncasecmp_ascii_eq                            30221 ns 30118 ns 
      23699
utf8_strncasecmp_ascii_immediately_neq                72.7 ns 72.6 ns 
   9634262
utf8_strncasecmp_unicode_eq                          33375 ns 33328 ns 
      21011
utf8_strncasecmp_unicode_immediately_neq               235 ns 235 ns 
  2981485
utf8_strncasecmp_unicode_cased                       34084 ns 34027 ns 
      20973
utf8_strncasecmp_folded_ascii_eq                     16301 ns 16276 ns 
      42710
utf8_strncasecmp_folded_ascii_immediately_neq         19.6 ns 19.6 ns 
  35654425
utf8_strncasecmp_folded_unicode_eq                   17529 ns 17504 ns 
      39968
utf8_strncasecmp_folded_unicode_immediately_neq        110 ns 110 ns 
  6373884
utf8_strncasecmp_folded_unicode_cased                19958 ns 19876 ns 
      35236

As you can see, using utf8_strncasecmp on our largest name ASCII strings 
can result in a 5183x slowdown for comparison checks, and for Unicode 
strings of similar length, a 5846x slowdown.

utf8_strncasecmp_folded was documented to give a 30% improvement for 
casefold lookup times in a large directory 
(3ae72562ad917df36a1b1247d749240e3b4865db). This bench shows that to be 
more like a 50% improvement on utf8_strncasecmp which is ballpark-ish. 
This is still a slowdown of 3006x in the best-case compared to memcmp.

Bear in mind, this is probably only once for a cmp in a small directory, 
and multiple times if there are hash collisions. At it's worst, at a 
minimum it will take ~0.04ms to do simply the cmp of a single case 
insensitive lookup.

Using -O3 makes things a bit better for strncasecmp, but not by any 
orders of magnitude that it needs:

------------------------------------------------------------------------------------------
Benchmark                                                Time  CPU 
Iterations
------------------------------------------------------------------------------------------
memcmp_ascii_eq                                       3.63 ns 3.62 ns 
193349553
memcmp_ascii_immediately_neq                          2.16 ns 2.15 ns 
338061946
memcmp_unicode_eq                                     4.43 ns 4.42 ns 
158399738
memcmp_unicode_immediately_neq                        2.18 ns 2.17 ns 
322218734
utf8_strncasecmp_ascii_eq                            13476 ns 13449 ns 
      51587
utf8_strncasecmp_ascii_immediately_neq                31.9 ns 31.8 ns 
  21838062
utf8_strncasecmp_unicode_eq                          18575 ns 18536 ns 
      37588
utf8_strncasecmp_unicode_immediately_neq               130 ns 130 ns 
  5408163
utf8_strncasecmp_unicode_cased                       17719 ns 17690 ns 
      37665
utf8_strncasecmp_folded_ascii_eq                      6132 ns 6124 ns 
    114403
utf8_strncasecmp_folded_ascii_immediately_neq         8.03 ns 8.02 ns 
  87290013
utf8_strncasecmp_folded_unicode_eq                    8734 ns 8722 ns 
     80225
utf8_strncasecmp_folded_unicode_immediately_neq       52.2 ns 52.1 ns 
  13370548
utf8_strncasecmp_folded_unicode_cased                 9874 ns 9859 ns 
     71000

Still pretty significant though. :/

I would really really like to avoid this inherent extra cost going forward.

One other reason I am pretty against going with utf8_strncasecmp_folded 
(even initially) is because if/when we do decide to do casefold caching 
in fs, we have to keep all of that code path around in dirent_cmp for 
compat which is super gross.

This also means if we want to do other encodings, such as ASCII alone, 
or whatever, the _cmp code needs to grow support for those too, which I 
also really dislike.

The dirent_cmp and dirent_hash, etc code should be kept as simple and 
efficient as possible and entirely free of all of this aside from using 
the already provided names.

One question I know is going to be asked is: How does this translate to 
the real world?

Given that the switch from utf8_strncasecmp to utf8_strncasecmp_folded 
for EXT4 increased perf with large directories by 30% as-per 
3ae72562ad917df36a1b1247d749240e3b4865db, then using a pure memcmp must 
improve it pretty significantly from the benchmarks above, although that 
is my pure speculation based on extrapolating the two data sets. :-D

We are of course always eating the cost of the initial utf8_casefold for 
hashing and cmp though...

Is there a need for more data here or is this enough to base our 
decision off of?

IIRC we had some data points about Proton prefix creation time and stuff 
following the ext4 casefolding work and strncasecmp vs folded too, but I 
may need to dig up if anyone still has those. I don't think they are 
super relevant here though.

- Joshie 🐸✨

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v4 3/3] bcachefs: Implement casefolding
  2023-08-15 18:00         ` Gabriel Krisman Bertazi
@ 2023-08-23  9:37           ` Joshua Ashton
  0 siblings, 0 replies; 15+ messages in thread
From: Joshua Ashton @ 2023-08-23  9:37 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, Kent Overstreet
  Cc: linux-bcachefs, André Almeida



On 8/15/23 19:00, Gabriel Krisman Bertazi wrote:
> Kent Overstreet <kent.overstreet@linux.dev> writes:
> 
>> On Mon, Aug 14, 2023 at 10:18:33PM +0100, Joshua Ashton wrote:
>>
>>> Grepping the test right now I already know we will fail the
>>> `test_casefold_flag_removal` test as we don't support toggling the flag on
>>> non-empty directories as it would require rehashing and thus restructuring
>>> the btree.
>>
>> What was the motivation for supporting that on other filesystems? I'm
>> not inclined to add that to bcachefs without a good reason.
> 
> I don't think there was a better reason than "it is user-friendly to
> allow disabling the feature without having to remove the
> directory". Note that it is trivial to do so in empty directories for
> ext4/f2fs: it is just a matter of flipping a bit in the inode to
> enable/disable casefolding.
> 
> I think it's ok for the Windows games to not allow disabling
> case-insensitive once it is set for a directory.  For fstests, I suppose
> the easiest solution would be to skip that subtest when running on
> bcachefs.
> 

Makes sense. And yeah, we would never unset it in Steam/Proton.

>>>> Sure, if you have both, dirent and dentry->d_name casefolded, then it is
>>>> just a regular memcmp, but at least you don't have to change the disk
>>>> format and your implementation becomes simpler.
>>>>
>>>
>>> Given we have the opportunity to have support for this in the dirent format
>>> for bcachefs, I think it makes sense to take it right now.
>>
>> It also means that we have the option of defining a casefolding map that
>> handles ß correctly.
> 
> What do you mean by handling ß correctly?  do you intend to support NFKD
> normalization?  We can extend fs/unicode/ for that fairly easily,
> but iirc, other filesystems decided to stick with NFD, because NFKD is
> a bit too eager when folding some characters.

Yes, I believe we wanted to add support for other normalizations as 
bcachefs options.

Or at the very least, ensure we structure what we do around being able 
to support that in the future.

> 
>>>>> +#if IS_ENABLED(CONFIG_UNICODE)
>>>>> +	struct unicode_map 	*cf_encoding;
>>>>> +#endif
>>>>
>>>> This goes in struct super_block.
>>>
>>> The code in libfs.c that uses the s_encoding in the super_block is never
>>> used for bcachefs as it defines its own inode_operations and doesn't use any
>>> of the libfs common code for lookups/hashes there (ie. the
>>> dentry_operations).
>>>
>>> One thing we wanted to do down the line was to use the bcachefs options
>>> system to be able to pick the encoding on a per-inode basis.
>>>
>>> Currently I am storing it in the struct bch_fs as currently there is no need
>>> for other encodings right now and it is convenient, so there can only be one
>>> -- but that could change at some point with an opt attached to the inode.
>>>
>>> I am happy to bung it in the vfs super_block for now though -- but I need to
>>> look at how that all works out with mounting and the multiple drives
>>> situation. Just be aware that at some point, it will likely be moved back
>>> again to our code.
>>
>> We also don't always have a super_block available when we have a running
>> bch_fs object (e.g. when running in userspace, for fsck and various
>> other tools). Since s_encoding is used in dirent.c, it probably needs to
>> be in bch_fs.
> 
> Ah, I didn't know this is reused for userspace tools.  Still, as long as
> you initialize sb->s_encoding at mount time, libfs helpers will work fine.
> 

Cool.

- Joshie 🐸✨

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v4 3/3] bcachefs: Implement casefolding
  2023-08-15 12:36   ` Brian Foster
@ 2023-08-24  5:43     ` Joshua Ashton
  0 siblings, 0 replies; 15+ messages in thread
From: Joshua Ashton @ 2023-08-24  5:43 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs, André Almeida, Gabriel Krisman Bertazi



On 8/15/23 13:36, Brian Foster wrote:
> On Mon, Aug 14, 2023 at 02:24:27AM +0100, Joshua Ashton wrote:
>> This patch implements support for case-insensitive file name lookups
>> in bcachefs.
>>
>> The implementation uses the same UTF-8 lowering and normalization that
>> ext4 and f2fs is using.
>>
>> More information is provided in Documentation/bcachefs/casefolding.rst
>>
>> Signed-off-by: Joshua Ashton <joshua@froggi.es>
>>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: Gabriel Krisman Bertazi <krisman@suse.de>
>> ---
> 
> Hi Joshua,
> 
> I'm still grokking the implementation and I see Gabriel has already
> provided much more substantive feedback, so just a few random comments...
> 
>>   .../filesystems/bcachefs/casefolding.rst      |  73 ++++++++
>>   fs/bcachefs/bcachefs.h                        |   6 +
>>   fs/bcachefs/bcachefs_format.h                 |  26 ++-
>>   fs/bcachefs/dirent.c                          | 177 +++++++++++++++---
>>   fs/bcachefs/fs-common.c                       |   4 +
>>   fs/bcachefs/fs-ioctl.c                        |  22 +++
>>   fs/bcachefs/fs-ioctl.h                        |  20 +-
>>   fs/bcachefs/str_hash.h                        |   4 +
>>   fs/bcachefs/super.c                           |  19 ++
>>   9 files changed, 318 insertions(+), 33 deletions(-)
>>   create mode 100644 Documentation/filesystems/bcachefs/casefolding.rst
>>
>> diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
>> new file mode 100644
>> index 000000000000..c8580d892ae8
>> --- /dev/null
>> +++ b/Documentation/filesystems/bcachefs/casefolding.rst
>> @@ -0,0 +1,73 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +Casefolding
>> +===========
>> +
>> +bcachefs has support for case-insensitive file and directory
>> +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
>> +casefolding attributes.
>> +
>> +The main usecase for casefolding is compatibility with software written
>> +against other filesystems that rely on casefolded lookups
>> +(eg. NTFS and Wine/Proton).
>> +Taking advantage of file-system level casefolding can lead to great
>> +loading time gains in many applications and games.
>> +
>> +Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
>> +Once a directory has been flagged for casefolding, a feature bit
>> +is enabled on the superblock which marks the filesystem as using
>> +casefolding.
>> +When the feature bit for casefolding is enabled, it is no longer possible
>> +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
>> +
>> +On the lookup/query side: casefolding is implemented by allocating a new
>> +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
>> +casefold the query string.
>> +
>> +On the dirent side: casefolding is implemented by ensuring the `bkey`'s
>> +hash is made from the casefolded string and storing the cached casefolded
>> +name with the regular name in the dirent.
>> +
>> +The structure looks like this:
>> +
>> +Regular:    [dirent data][regular name][nul][nul]...
>> +Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
>> +
>> +(Do note, the number of `NUL`s here is merely for illustration, they count can vary
>> + per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.)
>> +
>> +This is efficient as it means that for all file lookups that require casefolding,
>> +it has identical performance to a regular lookup:
>> +a hash comparison and a `memcmp` of the name.
>> +
> 
> Thanks for the writeup. This is very helpful.
> 
>> +Rationale
>> +---------
>> +
>> +Several designs were considered for this system:
>> +One was to introduce a dirent_v2, however that would be painful especially as
>> +the hash system only has support for a single key type. This would also need
>> +`BCH_NAME_MAX` to change between versions, and a new feature bit.
>> +
>> +Another option was to store without the two lengths, and just take the length of
>> +the regular name and casefolded name contiguously / 2 as the length. This would
>> +assume that the regular length == casefolded length, but that could potentially
>> +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
>> +the lowercase unicode glyph.
>> +It would be possible to disregard the casefold cache for those cases, but it was
>> +decided to simply encode the two string lengths in the key to avoid random
>> +performance issues if this edgecase was ever hit.
>> +
>> +The option settled on was to use a free-bit in d_type to mark a dirent as having
>> +a casefold cache, and then treat the first 4 bytes the name block as lengths.
>> +You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
>> +
>> +The feature bit was used to allow casefolding support to be enabled for the majority
>> +of users, but some allow users who have no need for the feature to still use bcachefs as
>> +`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
>> +which may be decider between using bcachefs for eg. embedded platforms.
>> +
>> +Other filesystems like ext4 and f2fs have a super-block level option for casefolding
>> +encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
>> +any encodings than a single UTF-8 version. When future encodings are desirable,
>> +they will be added trivially using the opts mechanism.
>> +
> ...
>> diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
>> index 23bae622309c..cc6797422977 100644
>> --- a/fs/bcachefs/bcachefs_format.h
>> +++ b/fs/bcachefs/bcachefs_format.h
> ...
>> @@ -864,6 +866,7 @@ enum {
>>   #define BCH_INODE_I_SECTORS_DIRTY (1 << __BCH_INODE_I_SECTORS_DIRTY)
>>   #define BCH_INODE_UNLINKED	(1 << __BCH_INODE_UNLINKED)
>>   #define BCH_INODE_BACKPTR_UNTRUSTED (1 << __BCH_INODE_BACKPTR_UNTRUSTED)
>> +#define BCH_INODE_CASEFOLDED	(1 << __BCH_INODE_CASEFOLDED)
>>   
>>   LE32_BITMASK(INODE_STR_HASH,	struct bch_inode, bi_flags, 20, 24);
>>   LE32_BITMASK(INODE_NR_FIELDS,	struct bch_inode, bi_flags, 24, 31);
>> @@ -908,9 +911,25 @@ struct bch_dirent {
>>   	 * Copy of mode bits 12-15 from the target inode - so userspace can get
>>   	 * the filetype without having to do a stat()
>>   	 */
>> -	__u8			d_type;
>> +#if defined(__LITTLE_ENDIAN_BITFIELD)
>> +	__u8			d_type:5,
>> +				d_unused:2,
>> +				d_casefold:1;
>> +#elif defined(__BIG_ENDIAN_BITFIELD)
>> +	__u8			d_casefold:1,
>> +				d_unused:2,
>> +				d_type:5;
>> +#endif
> 
> Interesting. I didn't realize that bitfield endianness matched general
> endianness, so I presume this is required to ensure data is always le on
> disk.

Correct.

> 
>>   
>> -	__u8			d_name[];
>> +	union {
>> +	struct {
>> +		__u8		d_pad;
>> +		__le16		d_name_len;
>> +		__le16		d_cf_name_len;
>> +		__u8		d_names[0];
>> +	} d_cf_name_block __packed;
>> +	__u8			d_name[0];
>> +	} __packed;
>>   } __packed __aligned(8);
>>   
>>   #define DT_SUBVOL	16
>> @@ -1841,7 +1860,8 @@ static inline void SET_BCH_SB_BACKGROUND_COMPRESSION_TYPE(struct bch_sb *sb, __u
>>   	x(new_varint,			15)	\
>>   	x(journal_no_flush,		16)	\
>>   	x(alloc_v2,			17)	\
>> -	x(extents_across_btree_nodes,	18)
>> +	x(extents_across_btree_nodes,	18)	\
>> +	x(casefolding,			19)
>>   
>>   #define BCH_SB_FEATURES_ALWAYS				\
>>   	((1ULL << BCH_FEATURE_new_extent_overwrite)|	\
>> diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c
>> index 6bdcccabd732..6ac45d7ce6e9 100644
>> --- a/fs/bcachefs/dirent.c
>> +++ b/fs/bcachefs/dirent.c
>> @@ -13,8 +13,39 @@
>>   
>>   #include <linux/dcache.h>
>>   
>> -static inline unsigned dirent_val_u64s(unsigned len)
>> +static inline int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info,
>> +				const struct qstr *str, struct qstr *out_cf)
>>   {
>> +#if IS_ENABLED(CONFIG_UNICODE)
>> +	unsigned char *buf = NULL;
>> +	int ret;
>> +
>> +	*out_cf = (struct qstr) QSTR_INIT(NULL, 0);
>> +
>> +	buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1);
>> +	ret = PTR_ERR_OR_ZERO(buf);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1);
> 
> Looks like this helper really only needs the encoding pointer rather
> than the full hash_info..?

Sure, we could do that.

> 
>> +	if (ret <= 0)
>> +		return ret;
>> +
>> +	*out_cf = (struct qstr) QSTR_INIT(buf, ret);
>> +	return 0;
>> +#else
>> +	*out_cf = (struct qstr) QSTR_INIT(NULL, 0);
>> +
>> +	return -EOPNOTSUPP;
>> +#endif
> 
> It also looks like the #else could be completely avoided here. For
> example, lift the local vars out of the ifdef, init out_cf using the
> NULL buf, init ret to -EOPNOTSUPP and push 'return ret;' out of the end
> of the ifdef. Hm?

Thanks for the heads up, will experiment with that

> 
>> +}
>> +
>> +static inline unsigned dirent_val_u64s(unsigned len, unsigned cf_len)
>> +{
>> +	if (cf_len > 0) {
>> +		return DIV_ROUND_UP(offsetof(struct bch_dirent, d_cf_name_block.d_names) + len + cf_len,
>> +				    sizeof(u64));
>> +	}
>>   	return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len,
>>   			    sizeof(u64));
> 
> FWIW, I think something along the lines of the pattern you use just
> below makes this a bit more readable. For example, something like:
> 
>          unsigned name_offset;
>          name_offset = (cf_len > 0) ?
>                          offsetof(struct bch_dirent, d_cf_name_block.d_names) :
>                          offsetof(struct bch_dirent, d_name);
>          return DIV_ROUND_UP(name_offset + len + cf_len, sizeof(u64));
> 
> ... but just a thought.
> 
>>   }
> ...
>>   /* bcachefs inode flags -> FS_IOC_FSGETXATTR: */
>> diff --git a/fs/bcachefs/str_hash.h b/fs/bcachefs/str_hash.h
>> index ae21a8cca1b4..866cfc145889 100644
>> --- a/fs/bcachefs/str_hash.h
>> +++ b/fs/bcachefs/str_hash.h
>> @@ -34,6 +34,7 @@ bch2_str_hash_opt_to_type(struct bch_fs *c, enum bch_str_hash_opts opt)
>>   
>>   struct bch_hash_info {
>>   	u8			type;
>> +	struct unicode_map 	*cf_encoding;
>>   	/*
>>   	 * For crc32 or crc64 string hashes the first key value of
>>   	 * the siphash_key (k0) is used as the key.
>> @@ -48,6 +49,9 @@ bch2_hash_info_init(struct bch_fs *c, const struct bch_inode_unpacked *bi)
>>   	struct bch_hash_info info = {
>>   		.type = (bi->bi_flags >> INODE_STR_HASH_OFFSET) &
>>   			~(~0U << INODE_STR_HASH_BITS),
>> +#if IS_ENABLED(CONFIG_UNICODE)
>> +		.cf_encoding = !!(bi->bi_flags & BCH_INODE_CASEFOLDED) ? c->cf_encoding : NULL,
>> +#endif
> 
> What's the purpose for propagating this pointer via the hash? Is this
> necessary to (eventually) support per-directory encodings, or something
> else? If purely the former and that is something to possibly come later,
> it does seem a little premature to me in that the infrastructure part
> should probably go along with the feature functionality that takes
> advantage of it.

Yes, it is for the convenience of being able to do per-directory 
encodings in the future -- but also it's the easiest way to get the 
information propagated from the superblock right now also.

If I was to not put it here, it would involve a bunch more work of 
passing through another pointer which to me makes little sense when we 
can just include it here.

It is used by the hashing process so it makes sense anyway, but I am 
happy to rename the struct.

> 
> BTW on a similar train of thought.. I'm not familiar enough with the
> feature to have an opinion on Gabriel's feedback wrt to storing the
> casefolded name on-disk, but I do think there's value in breaking down
> the feature into smaller logical components, particularly when there are
> performance vs. complexity tradeoffs. So for example if you wanted to
> continue to implement the on-disk format change approach, perhaps factor
> out core functionality along the lines of what ext4 does into a patch 1,
> then add the on-disk format change support in patch 2 as an (optional?)
> optimization (then lift off the per-dir bits into later patches that
> implement the granular approach, etc.).

I explained this in another email but will do again, I really don't want 
to do that approach because breaking it down like that means we will 
have to support a path that will never, ever be used forever for no 
reason. We'd have to make dirent_cmp, dirent_hash, and stuff aware of 
casefolding and it would just be a bunch of effectively dead code.

On face-value, it makes sense, but when you look closer, it doesn't really.

I also really don't want to bloat those functions that need to remain 
pretty fast with a bunch of stuff to casefold (in the case of bkey_hash) 
and use utf8_strncasecmp, etc.

It would all just be dead code that needs to be supported forever when 
we go to cache names. :/

> 
> That not only makes this easier to review, but IMO significantly easier
> to maintain, test, diagnose, etc., such that it's at least worth
> considering. Just my .02.

I don't really see how adding more code paths makes this easier to 
review and maintain...

- Joshie 🐸✨

> 
> Brian
> 
>>   		.siphash_key = { .k0 = bi->bi_hash_seed }
>>   	};
>>   
>> diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
>> index 5c62fcf3afdb..78d19cfb966a 100644
>> --- a/fs/bcachefs/super.c
>> +++ b/fs/bcachefs/super.c
>> @@ -757,6 +757,25 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts)
>>   	if (ret)
>>   		goto err;
>>   
>> +#if IS_ENABLED(CONFIG_UNICODE)
>> +	/* Default encoding until we can potentially have more as an option. */
>> +	c->cf_encoding = utf8_load(BCH_FS_DEFAULT_UTF8_ENCODING);
>> +	if (IS_ERR(c->cf_encoding)) {
>> +		printk(KERN_ERR "Cannot load UTF-8 encoding for filesystem. Version: %u.%u.%u",
>> +			unicode_major(BCH_FS_DEFAULT_UTF8_ENCODING),
>> +			unicode_minor(BCH_FS_DEFAULT_UTF8_ENCODING),
>> +			unicode_rev(BCH_FS_DEFAULT_UTF8_ENCODING));
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +#else
>> +	if (c->sb.features & (1ULL << BCH_FEATURE_casefolding)) {
>> +		printk(KERN_ERR "Cannot mount a filesystem with casefolding on a kernel without CONFIG_UNICODE\n");
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +#endif
>> +
>>   	pr_uuid(&name, c->sb.user_uuid.b);
>>   	strscpy(c->name, name.buf, sizeof(c->name));
>>   	printbuf_exit(&name);
>> -- 
>> 2.41.0
>>
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-08-24  5:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-14  1:24 [v4 0/3] bcachefs: Casefolding implementation Joshua Ashton
2023-08-14  1:24 ` [v4 1/3] bcachefs: Split out dirent alloc and name initialization Joshua Ashton
2023-08-14  1:24 ` [v4 2/3] bcachefs: Move dirent_val_u64s to dirent.c Joshua Ashton
2023-08-14  1:24 ` [v4 3/3] bcachefs: Implement casefolding Joshua Ashton
2023-08-14 16:34   ` Gabriel Krisman Bertazi
2023-08-14 21:18     ` Joshua Ashton
2023-08-14 21:33       ` Joshua Ashton
2023-08-14 23:01       ` Gabriel Krisman Bertazi
2023-08-15  0:58       ` Kent Overstreet
2023-08-15 18:00         ` Gabriel Krisman Bertazi
2023-08-23  9:37           ` Joshua Ashton
2023-08-15 12:36   ` Brian Foster
2023-08-24  5:43     ` Joshua Ashton
2023-08-16 19:40 ` [v4 0/3] bcachefs: Casefolding implementation Kent Overstreet
2023-08-23  9:34   ` Joshua Ashton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox