* [v4 0/3] bcachefs: Casefolding implementation
@ 2023-08-14 1:24 Joshua Ashton
2023-08-14 1:24 ` [v4 1/3] bcachefs: Split out dirent alloc and name initialization Joshua Ashton
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Joshua Ashton @ 2023-08-14 1:24 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Joshua Ashton
This implements support for case-insensitive file and directory
lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
casefolding attributes for bcachefs.
The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using currently
The main usecase for casefolding is compatibility with software written
against other filesystems that rely on casefolded lookups
(eg. NTFS and Wine/Proton).
Taking advantage of file-system level casefolding can lead to great
loading time gains in many applications and games.
See more information and rationale in:
Documentation/filesystems/bcachefs/casefolding.rst
You can see/review the work in my tree here:
https://github.com/Joshua-Ashton/bcachefs/tree/casefold
v2: Update format to store length in dirent based on flag, add comments, cleanup
dirent_create_key, other misc. nits from IRC.
v3: Cleanup the dirent code, utilize the fact that BCH_NAME_MAX is smaller now,
remove fallbacks for too big keys that won't make sense anymore. Improved
lookup string lowering.
v4: Move cf_name allocation from stack to trans alloc, validated bkey in invalid,
used length directly, misc. nits.
Joshua Ashton (3):
bcachefs: Split out dirent alloc and name initialization
bcachefs: Move dirent_val_u64s to dirent.c
bcachefs: Implement casefolding
.../filesystems/bcachefs/casefolding.rst | 73 ++++++
fs/bcachefs/bcachefs.h | 6 +
fs/bcachefs/bcachefs_format.h | 26 ++-
fs/bcachefs/dirent.c | 219 +++++++++++++++---
fs/bcachefs/dirent.h | 6 -
fs/bcachefs/fs-common.c | 4 +
fs/bcachefs/fs-ioctl.c | 22 ++
fs/bcachefs/fs-ioctl.h | 20 +-
fs/bcachefs/str_hash.h | 4 +
fs/bcachefs/super.c | 19 ++
10 files changed, 353 insertions(+), 46 deletions(-)
create mode 100644 Documentation/filesystems/bcachefs/casefolding.rst
--
2.41.0
^ permalink raw reply [flat|nested] 15+ messages in thread* [v4 1/3] bcachefs: Split out dirent alloc and name initialization 2023-08-14 1:24 [v4 0/3] bcachefs: Casefolding implementation Joshua Ashton @ 2023-08-14 1:24 ` Joshua Ashton 2023-08-14 1:24 ` [v4 2/3] bcachefs: Move dirent_val_u64s to dirent.c Joshua Ashton ` (2 subsequent siblings) 3 siblings, 0 replies; 15+ messages in thread From: Joshua Ashton @ 2023-08-14 1:24 UTC (permalink / raw) To: linux-bcachefs; +Cc: Joshua Ashton Splits out the code that allocates the dirent and initializes the name to make things easier to implement casefolding in a future commit. Signed-off-by: Joshua Ashton <joshua@froggi.es> --- fs/bcachefs/dirent.c | 46 ++++++++++++++++++++++++++++++++------------ 1 file changed, 34 insertions(+), 12 deletions(-) diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c index a7559ab03802..85769d5c6b82 100644 --- a/fs/bcachefs/dirent.c +++ b/fs/bcachefs/dirent.c @@ -169,15 +169,13 @@ void bch2_dirent_to_text(struct printbuf *out, struct bch_fs *c, bch2_d_type_str(d.v->d_type)); } -static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans, - subvol_inum dir, u8 type, - const struct qstr *name, u64 dst) +static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans, + subvol_inum dir, + u8 type, + int name_len, u64 dst) { struct bkey_i_dirent *dirent; - unsigned u64s = BKEY_U64s + dirent_val_u64s(name->len); - - if (name->len > BCH_NAME_MAX) - return ERR_PTR(-ENAMETOOLONG); + unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len); BUG_ON(u64s > U8_MAX); @@ -197,11 +195,35 @@ static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans, dirent->v.d_type = type; - memcpy(dirent->v.d_name, name->name, name->len); - memset(dirent->v.d_name + name->len, 0, - bkey_val_bytes(&dirent->k) - - offsetof(struct bch_dirent, d_name) - - name->len); + return dirent; +} + +static void dirent_init_regular_name(struct bkey_i_dirent *dirent, + const struct qstr *name) +{ + memcpy(&dirent->v.d_name[0], name->name, name->len); + memset(&dirent->v.d_name[name->len], 0, + bkey_val_bytes(&dirent->k) - + offsetof(struct bch_dirent, d_name) - + name->len); +} + +static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans, + subvol_inum dir, + u8 type, + const struct qstr *name, + u64 dst) +{ + struct bkey_i_dirent *dirent; + + if (name->len > BCH_NAME_MAX) + return ERR_PTR(-ENAMETOOLONG); + + dirent = dirent_alloc_key(trans, dir, type, name->len, dst); + if (IS_ERR(dirent)) + return dirent; + + dirent_init_regular_name(dirent, name); EBUG_ON(bch2_dirent_name_bytes(dirent_i_to_s_c(dirent)) != name->len); -- 2.41.0 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [v4 2/3] bcachefs: Move dirent_val_u64s to dirent.c 2023-08-14 1:24 [v4 0/3] bcachefs: Casefolding implementation Joshua Ashton 2023-08-14 1:24 ` [v4 1/3] bcachefs: Split out dirent alloc and name initialization Joshua Ashton @ 2023-08-14 1:24 ` Joshua Ashton 2023-08-14 1:24 ` [v4 3/3] bcachefs: Implement casefolding Joshua Ashton 2023-08-16 19:40 ` [v4 0/3] bcachefs: Casefolding implementation Kent Overstreet 3 siblings, 0 replies; 15+ messages in thread From: Joshua Ashton @ 2023-08-14 1:24 UTC (permalink / raw) To: linux-bcachefs; +Cc: Joshua Ashton Not used anywhere else anymore, can become local to dirent.c now. Signed-off-by: Joshua Ashton <joshua@froggi.es> --- fs/bcachefs/dirent.c | 6 ++++++ fs/bcachefs/dirent.h | 6 ------ 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c index 85769d5c6b82..6bdcccabd732 100644 --- a/fs/bcachefs/dirent.c +++ b/fs/bcachefs/dirent.c @@ -13,6 +13,12 @@ #include <linux/dcache.h> +static inline unsigned dirent_val_u64s(unsigned len) +{ + return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len, + sizeof(u64)); +} + static unsigned bch2_dirent_name_bytes(struct bkey_s_c_dirent d) { unsigned bkey_u64s = bkey_val_u64s(d.k); diff --git a/fs/bcachefs/dirent.h b/fs/bcachefs/dirent.h index e9fa1df38232..56061ac93ad3 100644 --- a/fs/bcachefs/dirent.h +++ b/fs/bcachefs/dirent.h @@ -26,12 +26,6 @@ struct bch_inode_info; struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d); -static inline unsigned dirent_val_u64s(unsigned len) -{ - return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len, - sizeof(u64)); -} - int bch2_dirent_read_target(struct btree_trans *, subvol_inum, struct bkey_s_c_dirent, subvol_inum *); -- 2.41.0 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [v4 3/3] bcachefs: Implement casefolding 2023-08-14 1:24 [v4 0/3] bcachefs: Casefolding implementation Joshua Ashton 2023-08-14 1:24 ` [v4 1/3] bcachefs: Split out dirent alloc and name initialization Joshua Ashton 2023-08-14 1:24 ` [v4 2/3] bcachefs: Move dirent_val_u64s to dirent.c Joshua Ashton @ 2023-08-14 1:24 ` Joshua Ashton 2023-08-14 16:34 ` Gabriel Krisman Bertazi 2023-08-15 12:36 ` Brian Foster 2023-08-16 19:40 ` [v4 0/3] bcachefs: Casefolding implementation Kent Overstreet 3 siblings, 2 replies; 15+ messages in thread From: Joshua Ashton @ 2023-08-14 1:24 UTC (permalink / raw) To: linux-bcachefs; +Cc: Joshua Ashton, André Almeida, Gabriel Krisman Bertazi This patch implements support for case-insensitive file name lookups in bcachefs. The implementation uses the same UTF-8 lowering and normalization that ext4 and f2fs is using. More information is provided in Documentation/bcachefs/casefolding.rst Signed-off-by: Joshua Ashton <joshua@froggi.es> Cc: André Almeida <andrealmeid@igalia.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> --- .../filesystems/bcachefs/casefolding.rst | 73 ++++++++ fs/bcachefs/bcachefs.h | 6 + fs/bcachefs/bcachefs_format.h | 26 ++- fs/bcachefs/dirent.c | 177 +++++++++++++++--- fs/bcachefs/fs-common.c | 4 + fs/bcachefs/fs-ioctl.c | 22 +++ fs/bcachefs/fs-ioctl.h | 20 +- fs/bcachefs/str_hash.h | 4 + fs/bcachefs/super.c | 19 ++ 9 files changed, 318 insertions(+), 33 deletions(-) create mode 100644 Documentation/filesystems/bcachefs/casefolding.rst diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst new file mode 100644 index 000000000000..c8580d892ae8 --- /dev/null +++ b/Documentation/filesystems/bcachefs/casefolding.rst @@ -0,0 +1,73 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Casefolding +=========== + +bcachefs has support for case-insensitive file and directory +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) +casefolding attributes. + +The main usecase for casefolding is compatibility with software written +against other filesystems that rely on casefolded lookups +(eg. NTFS and Wine/Proton). +Taking advantage of file-system level casefolding can lead to great +loading time gains in many applications and games. + +Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled. +Once a directory has been flagged for casefolding, a feature bit +is enabled on the superblock which marks the filesystem as using +casefolding. +When the feature bit for casefolding is enabled, it is no longer possible +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. + +On the lookup/query side: casefolding is implemented by allocating a new +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to +casefold the query string. + +On the dirent side: casefolding is implemented by ensuring the `bkey`'s +hash is made from the casefolded string and storing the cached casefolded +name with the regular name in the dirent. + +The structure looks like this: + +Regular: [dirent data][regular name][nul][nul]... +Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]... + +(Do note, the number of `NUL`s here is merely for illustration, they count can vary + per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.) + +This is efficient as it means that for all file lookups that require casefolding, +it has identical performance to a regular lookup: +a hash comparison and a `memcmp` of the name. + +Rationale +--------- + +Several designs were considered for this system: +One was to introduce a dirent_v2, however that would be painful especially as +the hash system only has support for a single key type. This would also need +`BCH_NAME_MAX` to change between versions, and a new feature bit. + +Another option was to store without the two lengths, and just take the length of +the regular name and casefolded name contiguously / 2 as the length. This would +assume that the regular length == casefolded length, but that could potentially +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than +the lowercase unicode glyph. +It would be possible to disregard the casefold cache for those cases, but it was +decided to simply encode the two string lengths in the key to avoid random +performance issues if this edgecase was ever hit. + +The option settled on was to use a free-bit in d_type to mark a dirent as having +a casefold cache, and then treat the first 4 bytes the name block as lengths. +You can see this in the `d_cf_name_block` member of union in `bch_dirent`. + +The feature bit was used to allow casefolding support to be enabled for the majority +of users, but some allow users who have no need for the feature to still use bcachefs as +`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used, +which may be decider between using bcachefs for eg. embedded platforms. + +Other filesystems like ext4 and f2fs have a super-block level option for casefolding +encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose +any encodings than a single UTF-8 version. When future encodings are desirable, +they will be added trivially using the opts mechanism. + diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h index 30b3d7b9f9dc..0e1facdc7876 100644 --- a/fs/bcachefs/bcachefs.h +++ b/fs/bcachefs/bcachefs.h @@ -202,6 +202,7 @@ #include <linux/types.h> #include <linux/workqueue.h> #include <linux/zstd.h> +#include <linux/unicode.h> #include "bcachefs_format.h" #include "errcode.h" @@ -657,6 +658,8 @@ enum bch_write_ref { BCH_WRITE_REF_NR, }; +#define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0) + struct bch_fs { struct closure cl; @@ -723,6 +726,9 @@ struct bch_fs { u64 compat; } sb; +#if IS_ENABLED(CONFIG_UNICODE) + struct unicode_map *cf_encoding; +#endif struct bch_sb_handle disk_sb; diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h index 23bae622309c..cc6797422977 100644 --- a/fs/bcachefs/bcachefs_format.h +++ b/fs/bcachefs/bcachefs_format.h @@ -852,6 +852,8 @@ enum { __BCH_INODE_UNLINKED = 7, __BCH_INODE_BACKPTR_UNTRUSTED = 8, + __BCH_INODE_CASEFOLDED = 9, + /* bits 20+ reserved for packed fields below: */ }; @@ -864,6 +866,7 @@ enum { #define BCH_INODE_I_SECTORS_DIRTY (1 << __BCH_INODE_I_SECTORS_DIRTY) #define BCH_INODE_UNLINKED (1 << __BCH_INODE_UNLINKED) #define BCH_INODE_BACKPTR_UNTRUSTED (1 << __BCH_INODE_BACKPTR_UNTRUSTED) +#define BCH_INODE_CASEFOLDED (1 << __BCH_INODE_CASEFOLDED) LE32_BITMASK(INODE_STR_HASH, struct bch_inode, bi_flags, 20, 24); LE32_BITMASK(INODE_NR_FIELDS, struct bch_inode, bi_flags, 24, 31); @@ -908,9 +911,25 @@ struct bch_dirent { * Copy of mode bits 12-15 from the target inode - so userspace can get * the filetype without having to do a stat() */ - __u8 d_type; +#if defined(__LITTLE_ENDIAN_BITFIELD) + __u8 d_type:5, + d_unused:2, + d_casefold:1; +#elif defined(__BIG_ENDIAN_BITFIELD) + __u8 d_casefold:1, + d_unused:2, + d_type:5; +#endif - __u8 d_name[]; + union { + struct { + __u8 d_pad; + __le16 d_name_len; + __le16 d_cf_name_len; + __u8 d_names[0]; + } d_cf_name_block __packed; + __u8 d_name[0]; + } __packed; } __packed __aligned(8); #define DT_SUBVOL 16 @@ -1841,7 +1860,8 @@ static inline void SET_BCH_SB_BACKGROUND_COMPRESSION_TYPE(struct bch_sb *sb, __u x(new_varint, 15) \ x(journal_no_flush, 16) \ x(alloc_v2, 17) \ - x(extents_across_btree_nodes, 18) + x(extents_across_btree_nodes, 18) \ + x(casefolding, 19) #define BCH_SB_FEATURES_ALWAYS \ ((1ULL << BCH_FEATURE_new_extent_overwrite)| \ diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c index 6bdcccabd732..6ac45d7ce6e9 100644 --- a/fs/bcachefs/dirent.c +++ b/fs/bcachefs/dirent.c @@ -13,8 +13,39 @@ #include <linux/dcache.h> -static inline unsigned dirent_val_u64s(unsigned len) +static inline int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info, + const struct qstr *str, struct qstr *out_cf) { +#if IS_ENABLED(CONFIG_UNICODE) + unsigned char *buf = NULL; + int ret; + + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); + + buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1); + ret = PTR_ERR_OR_ZERO(buf); + if (ret) + return ret; + + ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1); + if (ret <= 0) + return ret; + + *out_cf = (struct qstr) QSTR_INIT(buf, ret); + return 0; +#else + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); + + return -EOPNOTSUPP; +#endif +} + +static inline unsigned dirent_val_u64s(unsigned len, unsigned cf_len) +{ + if (cf_len > 0) { + return DIV_ROUND_UP(offsetof(struct bch_dirent, d_cf_name_block.d_names) + len + cf_len, + sizeof(u64)); + } return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len, sizeof(u64)); } @@ -31,13 +62,38 @@ static unsigned bch2_dirent_name_bytes(struct bkey_s_c_dirent d) #endif return bkey_bytes - - offsetof(struct bch_dirent, d_name) - + (d.v->d_casefold + ? offsetof(struct bch_dirent, d_cf_name_block.d_names) + : offsetof(struct bch_dirent, d_name)) - trailing_nuls; } struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d) { - return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d)); + if (d.v->d_casefold) { + unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len); + return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[0], name_len); + } else { + return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d)); + } +} + +struct qstr bch2_dirent_get_casefold_name(struct bkey_s_c_dirent d) +{ + if (d.v->d_casefold) { + unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len); + unsigned cf_name_len = le16_to_cpu(d.v->d_cf_name_block.d_cf_name_len); + return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[name_len], cf_name_len); + } else { + return (struct qstr) QSTR_INIT(NULL, 0); + } +} + +static inline struct qstr bch2_dirent_get_lookup_name(struct bkey_s_c_dirent d) +{ + return d.v->d_casefold + ? bch2_dirent_get_casefold_name(d) + : bch2_dirent_get_name(d); } static u64 bch2_dirent_hash(const struct bch_hash_info *info, @@ -60,7 +116,7 @@ static u64 dirent_hash_key(const struct bch_hash_info *info, const void *key) static u64 dirent_hash_bkey(const struct bch_hash_info *info, struct bkey_s_c k) { struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k); - struct qstr name = bch2_dirent_get_name(d); + struct qstr name = bch2_dirent_get_lookup_name(d); return bch2_dirent_hash(info, &name); } @@ -68,7 +124,7 @@ static u64 dirent_hash_bkey(const struct bch_hash_info *info, struct bkey_s_c k) static bool dirent_cmp_key(struct bkey_s_c _l, const void *_r) { struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l); - const struct qstr l_name = bch2_dirent_get_name(l); + const struct qstr l_name = bch2_dirent_get_lookup_name(l); const struct qstr *r_name = _r; return l_name.len - r_name->len ?: memcmp(l_name.name, r_name->name, l_name.len); @@ -78,8 +134,8 @@ static bool dirent_cmp_bkey(struct bkey_s_c _l, struct bkey_s_c _r) { struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l); struct bkey_s_c_dirent r = bkey_s_c_to_dirent(_r); - const struct qstr l_name = bch2_dirent_get_name(l); - const struct qstr r_name = bch2_dirent_get_name(r); + const struct qstr l_name = bch2_dirent_get_lookup_name(l); + const struct qstr r_name = bch2_dirent_get_lookup_name(r); return l_name.len - r_name.len ?: memcmp(l_name.name, r_name.name, l_name.len); } @@ -108,16 +164,24 @@ int bch2_dirent_invalid(const struct bch_fs *c, struct bkey_s_c k, struct printbuf *err) { struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k); + unsigned name_block_len = bch2_dirent_name_bytes(d); struct qstr d_name = bch2_dirent_get_name(d); + struct qstr d_cf_name = bch2_dirent_get_casefold_name(d); if (!d_name.len) { prt_printf(err, "empty name"); return -BCH_ERR_invalid_bkey; } - if (bkey_val_u64s(k.k) > dirent_val_u64s(d_name.len)) { + if (bkey_val_u64s(k.k) > dirent_val_u64s(d_name.len, d_cf_name.len)) { prt_printf(err, "value too big (%zu > %u)", - bkey_val_u64s(k.k), dirent_val_u64s(d_name.len)); + bkey_val_u64s(k.k), dirent_val_u64s(d_name.len, d_cf_name.len)); + return -BCH_ERR_invalid_bkey; + } + + if (d_name.len + d_cf_name.len > name_block_len) { + prt_printf(err, "dirent names exceed bkey size (%d + %d > %d)", + d_name.len, d_cf_name.len, name_block_len); return -BCH_ERR_invalid_bkey; } @@ -157,6 +221,19 @@ int bch2_dirent_invalid(const struct bch_fs *c, struct bkey_s_c k, return -BCH_ERR_invalid_bkey; } + if (d.v->d_casefold) { + if ((flags & BKEY_INVALID_COMMIT) && d_cf_name.len > BCH_NAME_MAX) { + prt_printf(err, "dirent w/ cf name too big (%u > %u)", + d_cf_name.len, BCH_NAME_MAX); + return -BCH_ERR_invalid_bkey; + } + + if (d_cf_name.len != strnlen(d_cf_name.name, d_cf_name.len)) { + prt_printf(err, "dirent has stray data after cf name's NUL"); + return -BCH_ERR_invalid_bkey; + } + } + return 0; } @@ -178,10 +255,11 @@ void bch2_dirent_to_text(struct printbuf *out, struct bch_fs *c, static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans, subvol_inum dir, u8 type, - int name_len, u64 dst) + int name_len, int cf_name_len, + u64 dst) { struct bkey_i_dirent *dirent; - unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len); + unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len, cf_name_len); BUG_ON(u64s > U8_MAX); @@ -200,6 +278,8 @@ static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans, } dirent->v.d_type = type; + dirent->v.d_unused = 0; + dirent->v.d_casefold = cf_name_len ? 1 : 0; return dirent; } @@ -207,6 +287,8 @@ static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans, static void dirent_init_regular_name(struct bkey_i_dirent *dirent, const struct qstr *name) { + EBUG_ON(dirent->v.d_casefold); + memcpy(&dirent->v.d_name[0], name->name, name->len); memset(&dirent->v.d_name[name->len], 0, bkey_val_bytes(&dirent->k) - @@ -214,10 +296,30 @@ static void dirent_init_regular_name(struct bkey_i_dirent *dirent, name->len); } +static void dirent_init_casefolded_name(struct bkey_i_dirent *dirent, + const struct qstr *name, + const struct qstr *cf_name) +{ + EBUG_ON(!dirent->v.d_casefold); + EBUG_ON(!cf_name->len); + + dirent->v.d_cf_name_block.d_name_len = name->len; + dirent->v.d_cf_name_block.d_cf_name_len = cf_name->len; + memcpy(&dirent->v.d_cf_name_block.d_names[0], name->name, name->len); + memcpy(&dirent->v.d_cf_name_block.d_names[name->len], cf_name->name, cf_name->len); + memset(&dirent->v.d_cf_name_block.d_names[name->len + cf_name->len], 0, + bkey_val_bytes(&dirent->k) - + offsetof(struct bch_dirent, d_cf_name_block.d_names) - + name->len + cf_name->len); + + EBUG_ON(bch2_dirent_get_casefold_name(dirent_i_to_s_c(dirent)).len != cf_name->len); +} + static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans, subvol_inum dir, u8 type, const struct qstr *name, + const struct qstr *cf_name, u64 dst) { struct bkey_i_dirent *dirent; @@ -225,13 +327,16 @@ static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans, if (name->len > BCH_NAME_MAX) return ERR_PTR(-ENAMETOOLONG); - dirent = dirent_alloc_key(trans, dir, type, name->len, dst); + dirent = dirent_alloc_key(trans, dir, type, name->len, cf_name ? cf_name->len : 0, dst); if (IS_ERR(dirent)) return dirent; - dirent_init_regular_name(dirent, name); + if (cf_name) + dirent_init_casefolded_name(dirent, name, cf_name); + else + dirent_init_regular_name(dirent, name); - EBUG_ON(bch2_dirent_name_bytes(dirent_i_to_s_c(dirent)) != name->len); + EBUG_ON(bch2_dirent_get_name(dirent_i_to_s_c(dirent)).len != name->len); return dirent; } @@ -244,7 +349,16 @@ int bch2_dirent_create(struct btree_trans *trans, subvol_inum dir, struct bkey_i_dirent *dirent; int ret; - dirent = dirent_create_key(trans, dir, type, name, dst_inum); + if (hash_info->cf_encoding) { + struct qstr cf_name; + ret = bch2_casefold(trans, hash_info, name, &cf_name); + if (ret) + return ret; + dirent = dirent_create_key(trans, dir, type, name, &cf_name, dst_inum); + } else { + dirent = dirent_create_key(trans, dir, type, name, NULL, dst_inum); + } + ret = PTR_ERR_OR_ZERO(dirent); if (ret) return ret; @@ -294,6 +408,8 @@ int bch2_dirent_rename(struct btree_trans *trans, const struct qstr *dst_name, subvol_inum *dst_inum, u64 *dst_offset, enum bch_rename_mode mode) { + struct qstr src_name_lookup = *src_name; + struct qstr dst_name_lookup = *dst_name; struct btree_iter src_iter = { NULL }; struct btree_iter dst_iter = { NULL }; struct bkey_s_c old_src, old_dst = bkey_s_c_null; @@ -310,8 +426,13 @@ int bch2_dirent_rename(struct btree_trans *trans, memset(dst_inum, 0, sizeof(*dst_inum)); /* Lookup src: */ + ret = src_hash->cf_encoding + ? bch2_casefold(trans, src_hash, src_name, &src_name_lookup) + : 0; + if (ret) + goto out; ret = bch2_hash_lookup(trans, &src_iter, bch2_dirent_hash_desc, - src_hash, src_dir, src_name, + src_hash, src_dir, &src_name_lookup, BTREE_ITER_INTENT); if (ret) goto out; @@ -333,6 +454,11 @@ int bch2_dirent_rename(struct btree_trans *trans, /* Lookup dst: */ + ret = dst_hash->cf_encoding + ? bch2_casefold(trans, dst_hash, dst_name, &dst_name_lookup) + : 0; + if (ret) + goto out; if (mode == BCH_RENAME) { /* * Note that we're _not_ checking if the target already exists - @@ -340,12 +466,12 @@ int bch2_dirent_rename(struct btree_trans *trans, * correctness: */ ret = bch2_hash_hole(trans, &dst_iter, bch2_dirent_hash_desc, - dst_hash, dst_dir, dst_name); + dst_hash, dst_dir, &dst_name_lookup); if (ret) goto out; } else { ret = bch2_hash_lookup(trans, &dst_iter, bch2_dirent_hash_desc, - dst_hash, dst_dir, dst_name, + dst_hash, dst_dir, &dst_name_lookup, BTREE_ITER_INTENT); if (ret) goto out; @@ -370,7 +496,8 @@ int bch2_dirent_rename(struct btree_trans *trans, *src_offset = dst_iter.pos.offset; /* Create new dst key: */ - new_dst = dirent_create_key(trans, dst_dir, 0, dst_name, 0); + new_dst = dirent_create_key(trans, dst_dir, 0, dst_name, + dst_hash->cf_encoding ? &dst_name_lookup : NULL, 0); ret = PTR_ERR_OR_ZERO(new_dst); if (ret) goto out; @@ -380,7 +507,8 @@ int bch2_dirent_rename(struct btree_trans *trans, /* Create new src key: */ if (mode == BCH_RENAME_EXCHANGE) { - new_src = dirent_create_key(trans, src_dir, 0, src_name, 0); + new_src = dirent_create_key(trans, src_dir, 0, src_name, + src_hash->cf_encoding ? &src_name_lookup : NULL, 0); ret = PTR_ERR_OR_ZERO(new_src); if (ret) goto out; @@ -472,6 +600,7 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans, const struct qstr *name, subvol_inum *inum, unsigned flags) { + struct qstr lookup_name = *name; struct bkey_s_c k; struct bkey_s_c_dirent d; u32 snapshot; @@ -481,8 +610,14 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans, if (ret) return ret; + ret = hash_info->cf_encoding + ? bch2_casefold(trans, hash_info, name, &lookup_name) + : 0; + if (ret) + return ret; + ret = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc, - hash_info, dir, name, flags); + hash_info, dir, &lookup_name, flags); if (ret) return ret; diff --git a/fs/bcachefs/fs-common.c b/fs/bcachefs/fs-common.c index bb5305441f27..190ff3e0da83 100644 --- a/fs/bcachefs/fs-common.c +++ b/fs/bcachefs/fs-common.c @@ -46,6 +46,10 @@ int bch2_create_trans(struct btree_trans *trans, if (ret) goto err; + /* Inherit casefold state from parent. */ + if (S_ISDIR(mode)) + new_inode->bi_flags |= dir_u->bi_flags & BCH_INODE_CASEFOLDED; + if (!(flags & BCH_CREATE_SNAPSHOT)) { /* Normal create path - allocate a new inode: */ bch2_inode_init_late(new_inode, now, uid, gid, mode, rdev, dir_u); diff --git a/fs/bcachefs/fs-ioctl.c b/fs/bcachefs/fs-ioctl.c index 141bcced031e..d0bac39e2326 100644 --- a/fs/bcachefs/fs-ioctl.c +++ b/fs/bcachefs/fs-ioctl.c @@ -54,6 +54,28 @@ static int bch2_inode_flags_set(struct btree_trans *trans, (newflags & (BCH_INODE_NODUMP|BCH_INODE_NOATIME)) != newflags) return -EINVAL; + if ((newflags ^ oldflags) & BCH_INODE_CASEFOLDED) { +#if IS_ENABLED(CONFIG_UNICODE) + int ret = 0; + /* Not supported on individual files. */ + if (!S_ISDIR(bi->bi_mode)) + return -EOPNOTSUPP; + + /* + * Make sure the dir is empty, as otherwise we'd need to + * rehash everything and update the dirent keys. + */ + ret = bch2_empty_dir_trans(trans, inode_inum(inode)); + if (ret < 0) + return ret; + + bch2_check_set_feature(c, BCH_FEATURE_casefolding); +#else + printk(KERN_ERR "Cannot use casefolding on a kernel without CONFIG_UNICODE\n"); + return -EOPNOTSUPP; +#endif + } + if (s->set_projinherit) { bi->bi_fields_set &= ~(1 << Inode_opt_project); bi->bi_fields_set |= ((int) s->projinherit << Inode_opt_project); diff --git a/fs/bcachefs/fs-ioctl.h b/fs/bcachefs/fs-ioctl.h index f201980ef2c3..2950091b5ac6 100644 --- a/fs/bcachefs/fs-ioctl.h +++ b/fs/bcachefs/fs-ioctl.h @@ -6,19 +6,21 @@ /* bcachefs inode flags -> vfs inode flags: */ static const unsigned bch_flags_to_vfs[] = { - [__BCH_INODE_SYNC] = S_SYNC, - [__BCH_INODE_IMMUTABLE] = S_IMMUTABLE, - [__BCH_INODE_APPEND] = S_APPEND, - [__BCH_INODE_NOATIME] = S_NOATIME, + [__BCH_INODE_SYNC] = S_SYNC, + [__BCH_INODE_IMMUTABLE] = S_IMMUTABLE, + [__BCH_INODE_APPEND] = S_APPEND, + [__BCH_INODE_NOATIME] = S_NOATIME, + [__BCH_INODE_CASEFOLDED] = S_CASEFOLD, }; /* bcachefs inode flags -> FS_IOC_GETFLAGS: */ static const unsigned bch_flags_to_uflags[] = { - [__BCH_INODE_SYNC] = FS_SYNC_FL, - [__BCH_INODE_IMMUTABLE] = FS_IMMUTABLE_FL, - [__BCH_INODE_APPEND] = FS_APPEND_FL, - [__BCH_INODE_NODUMP] = FS_NODUMP_FL, - [__BCH_INODE_NOATIME] = FS_NOATIME_FL, + [__BCH_INODE_SYNC] = FS_SYNC_FL, + [__BCH_INODE_IMMUTABLE] = FS_IMMUTABLE_FL, + [__BCH_INODE_APPEND] = FS_APPEND_FL, + [__BCH_INODE_NODUMP] = FS_NODUMP_FL, + [__BCH_INODE_NOATIME] = FS_NOATIME_FL, + [__BCH_INODE_CASEFOLDED] = FS_CASEFOLD_FL, }; /* bcachefs inode flags -> FS_IOC_FSGETXATTR: */ diff --git a/fs/bcachefs/str_hash.h b/fs/bcachefs/str_hash.h index ae21a8cca1b4..866cfc145889 100644 --- a/fs/bcachefs/str_hash.h +++ b/fs/bcachefs/str_hash.h @@ -34,6 +34,7 @@ bch2_str_hash_opt_to_type(struct bch_fs *c, enum bch_str_hash_opts opt) struct bch_hash_info { u8 type; + struct unicode_map *cf_encoding; /* * For crc32 or crc64 string hashes the first key value of * the siphash_key (k0) is used as the key. @@ -48,6 +49,9 @@ bch2_hash_info_init(struct bch_fs *c, const struct bch_inode_unpacked *bi) struct bch_hash_info info = { .type = (bi->bi_flags >> INODE_STR_HASH_OFFSET) & ~(~0U << INODE_STR_HASH_BITS), +#if IS_ENABLED(CONFIG_UNICODE) + .cf_encoding = !!(bi->bi_flags & BCH_INODE_CASEFOLDED) ? c->cf_encoding : NULL, +#endif .siphash_key = { .k0 = bi->bi_hash_seed } }; diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c index 5c62fcf3afdb..78d19cfb966a 100644 --- a/fs/bcachefs/super.c +++ b/fs/bcachefs/super.c @@ -757,6 +757,25 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts) if (ret) goto err; +#if IS_ENABLED(CONFIG_UNICODE) + /* Default encoding until we can potentially have more as an option. */ + c->cf_encoding = utf8_load(BCH_FS_DEFAULT_UTF8_ENCODING); + if (IS_ERR(c->cf_encoding)) { + printk(KERN_ERR "Cannot load UTF-8 encoding for filesystem. Version: %u.%u.%u", + unicode_major(BCH_FS_DEFAULT_UTF8_ENCODING), + unicode_minor(BCH_FS_DEFAULT_UTF8_ENCODING), + unicode_rev(BCH_FS_DEFAULT_UTF8_ENCODING)); + ret = -EINVAL; + goto err; + } +#else + if (c->sb.features & (1ULL << BCH_FEATURE_casefolding)) { + printk(KERN_ERR "Cannot mount a filesystem with casefolding on a kernel without CONFIG_UNICODE\n"); + ret = -EINVAL; + goto err; + } +#endif + pr_uuid(&name, c->sb.user_uuid.b); strscpy(c->name, name.buf, sizeof(c->name)); printbuf_exit(&name); -- 2.41.0 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [v4 3/3] bcachefs: Implement casefolding 2023-08-14 1:24 ` [v4 3/3] bcachefs: Implement casefolding Joshua Ashton @ 2023-08-14 16:34 ` Gabriel Krisman Bertazi 2023-08-14 21:18 ` Joshua Ashton 2023-08-15 12:36 ` Brian Foster 1 sibling, 1 reply; 15+ messages in thread From: Gabriel Krisman Bertazi @ 2023-08-14 16:34 UTC (permalink / raw) To: Joshua Ashton; +Cc: linux-bcachefs, André Almeida Joshua Ashton <joshua@froggi.es> writes: > This patch implements support for case-insensitive file name lookups > in bcachefs. > > The implementation uses the same UTF-8 lowering and normalization that > ext4 and f2fs is using. > > More information is provided in Documentation/bcachefs/casefolding.rst > > Signed-off-by: Joshua Ashton <joshua@froggi.es> Please, cc the entire thread as i don't think I even got the cover letter. Did you test this against fstests generic/556? it checks all sorts of common pitfalls and corner-cases in case-insensitive filesystems. > --- /dev/null > +++ b/Documentation/filesystems/bcachefs/casefolding.rst > @@ -0,0 +1,73 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +Casefolding > +=========== > + > +bcachefs has support for case-insensitive file and directory > +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) > +casefolding attributes. > + > +The main usecase for casefolding is compatibility with software written > +against other filesystems that rely on casefolded lookups > +(eg. NTFS and Wine/Proton). > +Taking advantage of file-system level casefolding can lead to great > +loading time gains in many applications and games. > + > +Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled. > +Once a directory has been flagged for casefolding, a feature bit > +is enabled on the superblock which marks the filesystem as using > +casefolding. > +When the feature bit for casefolding is enabled, it is no longer possible > +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. > + > +On the lookup/query side: casefolding is implemented by allocating a new > +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to > +casefold the query string. > + > +On the dirent side: casefolding is implemented by ensuring the `bkey`'s > +hash is made from the casefolded string and storing the cached casefolded > +name with the regular name in the dirent. > + > +The structure looks like this: > + > +Regular: [dirent data][regular name][nul][nul]... > +Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]... > + > +(Do note, the number of `NUL`s here is merely for illustration, they count can vary > + per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.) > + > +This is efficient as it means that for all file lookups that require casefolding, > +it has identical performance to a regular lookup: > +a hash comparison and a `memcmp` of the name. > + > +Rationale > +--------- > + > +Several designs were considered for this system: > +One was to introduce a dirent_v2, however that would be painful especially as > +the hash system only has support for a single key type. This would also need > +`BCH_NAME_MAX` to change between versions, and a new feature bit. > + > +Another option was to store without the two lengths, and just take the length of > +the regular name and casefolded name contiguously / 2 as the length. This would > +assume that the regular length == casefolded length, but that could potentially > +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than > +the lowercase unicode glyph. Why do you want to store the casefolded name? In ext4, we just store the case-exact name that was created, and do a code-point by code-point comparison with the name under lookup through utf8_strncasecmp, which is optimized to compare two not casefolded strings. We just modify the filename hash to be case-insensitive, so we know which dirent to look for during lookup. Finally, we use d_compare/d_hash to leverage the dcache into finding negative dentries. Sure, if you have both, dirent and dentry->d_name casefolded, then it is just a regular memcmp, but at least you don't have to change the disk format and your implementation becomes simpler. > +It would be possible to disregard the casefold cache for those cases, but it was > +decided to simply encode the two string lengths in the key to avoid random > +performance issues if this edgecase was ever hit. > + > +The option settled on was to use a free-bit in d_type to mark a dirent as having > +a casefold cache, and then treat the first 4 bytes the name block as lengths. > +You can see this in the `d_cf_name_block` member of union in `bch_dirent`. > + > +The feature bit was used to allow casefolding support to be enabled for the majority > +of users, but some allow users who have no need for the feature to still use bcachefs as > +`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used, > +which may be decider between using bcachefs for eg. embedded platforms. > + > +Other filesystems like ext4 and f2fs have a super-block level option for casefolding > +encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose > +any encodings than a single UTF-8 version. When future encodings are desirable, > +they will be added trivially using the opts mechanism. > + > diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h > index 30b3d7b9f9dc..0e1facdc7876 100644 > --- a/fs/bcachefs/bcachefs.h > +++ b/fs/bcachefs/bcachefs.h > @@ -202,6 +202,7 @@ > #include <linux/types.h> > #include <linux/workqueue.h> > #include <linux/zstd.h> > +#include <linux/unicode.h> > > #include "bcachefs_format.h" > #include "errcode.h" > @@ -657,6 +658,8 @@ enum bch_write_ref { > BCH_WRITE_REF_NR, > }; > > +#define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0) > + > struct bch_fs { > struct closure cl; > > @@ -723,6 +726,9 @@ struct bch_fs { > u64 compat; > } sb; > > +#if IS_ENABLED(CONFIG_UNICODE) > + struct unicode_map *cf_encoding; > +#endif This goes in struct super_block. > @@ -13,8 +13,39 @@ > > #include <linux/dcache.h> > > -static inline unsigned dirent_val_u64s(unsigned len) > +static inline int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info, > + const struct qstr *str, struct qstr *out_cf) > { > +#if IS_ENABLED(CONFIG_UNICODE) > + unsigned char *buf = NULL; > + int ret; > + > + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); > + > + buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1); > + ret = PTR_ERR_OR_ZERO(buf); > + if (ret) > + return ret; > + > + ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1); > + if (ret <= 0) > + return ret; The error path leaks buf. > @@ -472,6 +600,7 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans, > const struct qstr *name, subvol_inum *inum, > unsigned flags) > { > + struct qstr lookup_name = *name; > struct bkey_s_c k; > struct bkey_s_c_dirent d; > u32 snapshot; > @@ -481,8 +610,14 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans, > if (ret) > return ret; > > + ret = hash_info->cf_encoding > + ? bch2_casefold(trans, hash_info, name, &lookup_name) > + : 0; > + if (ret) > + return ret; > + > ret = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc, > - hash_info, dir, name, flags); > + hash_info, dir, &lookup_name, flags); > if (ret) > return ret; (Skipping the bcachefs-specific bits that I'm not qualified to review) I don't get your model for the dcache. I don't see any handling anywhere in the patch. Also, I fetched the code and it seems bcachefs will just do d_splice_alias at the end of the lookup. So, when someone lookups FOO, ->lookup will be called and the dentry FOO will be created. Then, when FOo is looked up, the dcache won't find the previous dentry and ->lookup will be called again (perhaps creating another dentry?) There are two ways around it. Rely on the libfs' generic handlers for d_hash/d_compare and soon d_revalidate, or use d_add_ci to search for case-insensitive aliases at lookup time, but then you don't benefit much from the dcache if looking up for different cases. > @@ -54,6 +54,28 @@ static int bch2_inode_flags_set(struct btree_trans *trans, > (newflags & (BCH_INODE_NODUMP|BCH_INODE_NOATIME)) != newflags) > return -EINVAL; > > + if ((newflags ^ oldflags) & BCH_INODE_CASEFOLDED) { > +#if IS_ENABLED(CONFIG_UNICODE) > + int ret = 0; > + /* Not supported on individual files. */ > + if (!S_ISDIR(bi->bi_mode)) > + return -EOPNOTSUPP; > + > + /* > + * Make sure the dir is empty, as otherwise we'd need to > + * rehash everything and update the dirent keys. > + */ > + ret = bch2_empty_dir_trans(trans, inode_inum(inode)); > + if (ret < 0) > + return ret; > + > + bch2_check_set_feature(c, BCH_FEATURE_casefolding); fwiw, we use _CASEFOLD in other filesystems, not _casefolding. Thanks, -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [v4 3/3] bcachefs: Implement casefolding 2023-08-14 16:34 ` Gabriel Krisman Bertazi @ 2023-08-14 21:18 ` Joshua Ashton 2023-08-14 21:33 ` Joshua Ashton ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Joshua Ashton @ 2023-08-14 21:18 UTC (permalink / raw) To: Gabriel Krisman Bertazi; +Cc: linux-bcachefs, André Almeida On 8/14/23 17:34, Gabriel Krisman Bertazi wrote: > Joshua Ashton <joshua@froggi.es> writes: > >> This patch implements support for case-insensitive file name lookups >> in bcachefs. >> >> The implementation uses the same UTF-8 lowering and normalization that >> ext4 and f2fs is using. >> >> More information is provided in Documentation/bcachefs/casefolding.rst >> >> Signed-off-by: Joshua Ashton <joshua@froggi.es> > > Please, cc the entire thread as i don't think I even got the cover > letter. Thanks, I will for v5. > > Did you test this against fstests generic/556? it checks all sorts of > common pitfalls and corner-cases in case-insensitive filesystems. No, I wrote my own casefolding test that tests a bunch of edgecases. I was not aware there were any casefolding tests in fstests -- I really really wish they had better names. :/ Grepping the test right now I already know we will fail the `test_casefold_flag_removal` test as we don't support toggling the flag on non-empty directories as it would require rehashing and thus restructuring the btree. There's a lot of edgecases to handle there (such as hash collisions) which are pretty gnarly to solve when doing something like that. I diiiid try and implement something that did that, but it really did not end well... ^^;; So planning to defer this for now. > >> --- /dev/null >> +++ b/Documentation/filesystems/bcachefs/casefolding.rst >> @@ -0,0 +1,73 @@ >> +.. SPDX-License-Identifier: GPL-2.0 >> + >> +Casefolding >> +=========== >> + >> +bcachefs has support for case-insensitive file and directory >> +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) >> +casefolding attributes. >> + >> +The main usecase for casefolding is compatibility with software written >> +against other filesystems that rely on casefolded lookups >> +(eg. NTFS and Wine/Proton). >> +Taking advantage of file-system level casefolding can lead to great >> +loading time gains in many applications and games. >> + >> +Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled. >> +Once a directory has been flagged for casefolding, a feature bit >> +is enabled on the superblock which marks the filesystem as using >> +casefolding. >> +When the feature bit for casefolding is enabled, it is no longer possible >> +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. >> + >> +On the lookup/query side: casefolding is implemented by allocating a new >> +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to >> +casefold the query string. >> + >> +On the dirent side: casefolding is implemented by ensuring the `bkey`'s >> +hash is made from the casefolded string and storing the cached casefolded >> +name with the regular name in the dirent. >> + >> +The structure looks like this: >> + >> +Regular: [dirent data][regular name][nul][nul]... >> +Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]... >> + >> +(Do note, the number of `NUL`s here is merely for illustration, they count can vary >> + per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.) >> + >> +This is efficient as it means that for all file lookups that require casefolding, >> +it has identical performance to a regular lookup: >> +a hash comparison and a `memcmp` of the name. >> + >> +Rationale >> +--------- >> + >> +Several designs were considered for this system: >> +One was to introduce a dirent_v2, however that would be painful especially as >> +the hash system only has support for a single key type. This would also need >> +`BCH_NAME_MAX` to change between versions, and a new feature bit. >> + >> +Another option was to store without the two lengths, and just take the length of >> +the regular name and casefolded name contiguously / 2 as the length. This would >> +assume that the regular length == casefolded length, but that could potentially >> +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than >> +the lowercase unicode glyph. > > Why do you want to store the casefolded name? In ext4, we just store > the case-exact name that was created, and do a code-point by code-point > comparison with the name under lookup through utf8_strncasecmp, which is > optimized to compare two not casefolded strings. We just modify the > filename hash to be case-insensitive, so we know which dirent to look > for during lookup. Finally, we use d_compare/d_hash to leverage the > dcache into finding negative dentries. Yeah, am aware of how ext4 does it, I just wanted it to have basically the exact same performance characteristics as as regular lookup. The utf8 cursor stuff in utf8_strncasecmp is still super expensive compared to a straight memcmp in some local benching. > > Sure, if you have both, dirent and dentry->d_name casefolded, then it is > just a regular memcmp, but at least you don't have to change the disk > format and your implementation becomes simpler. > Given we have the opportunity to have support for this in the dirent format for bcachefs, I think it makes sense to take it right now. > > >> +It would be possible to disregard the casefold cache for those cases, but it was >> +decided to simply encode the two string lengths in the key to avoid random >> +performance issues if this edgecase was ever hit. >> + >> +The option settled on was to use a free-bit in d_type to mark a dirent as having >> +a casefold cache, and then treat the first 4 bytes the name block as lengths. >> +You can see this in the `d_cf_name_block` member of union in `bch_dirent`. >> + >> +The feature bit was used to allow casefolding support to be enabled for the majority >> +of users, but some allow users who have no need for the feature to still use bcachefs as >> +`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used, >> +which may be decider between using bcachefs for eg. embedded platforms. >> + >> +Other filesystems like ext4 and f2fs have a super-block level option for casefolding >> +encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose >> +any encodings than a single UTF-8 version. When future encodings are desirable, >> +they will be added trivially using the opts mechanism. >> + >> diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h >> index 30b3d7b9f9dc..0e1facdc7876 100644 >> --- a/fs/bcachefs/bcachefs.h >> +++ b/fs/bcachefs/bcachefs.h >> @@ -202,6 +202,7 @@ >> #include <linux/types.h> >> #include <linux/workqueue.h> >> #include <linux/zstd.h> >> +#include <linux/unicode.h> >> >> #include "bcachefs_format.h" >> #include "errcode.h" >> @@ -657,6 +658,8 @@ enum bch_write_ref { >> BCH_WRITE_REF_NR, >> }; >> >> +#define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0) >> + >> struct bch_fs { >> struct closure cl; >> >> @@ -723,6 +726,9 @@ struct bch_fs { >> u64 compat; >> } sb; >> >> +#if IS_ENABLED(CONFIG_UNICODE) >> + struct unicode_map *cf_encoding; >> +#endif > > This goes in struct super_block. The code in libfs.c that uses the s_encoding in the super_block is never used for bcachefs as it defines its own inode_operations and doesn't use any of the libfs common code for lookups/hashes there (ie. the dentry_operations). One thing we wanted to do down the line was to use the bcachefs options system to be able to pick the encoding on a per-inode basis. Currently I am storing it in the struct bch_fs as currently there is no need for other encodings right now and it is convenient, so there can only be one -- but that could change at some point with an opt attached to the inode. I am happy to bung it in the vfs super_block for now though -- but I need to look at how that all works out with mounting and the multiple drives situation. Just be aware that at some point, it will likely be moved back again to our code. > > >> @@ -13,8 +13,39 @@ >> >> #include <linux/dcache.h> >> >> -static inline unsigned dirent_val_u64s(unsigned len) >> +static inline int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info, >> + const struct qstr *str, struct qstr *out_cf) >> { >> +#if IS_ENABLED(CONFIG_UNICODE) >> + unsigned char *buf = NULL; >> + int ret; >> + >> + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); >> + >> + buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1); >> + ret = PTR_ERR_OR_ZERO(buf); >> + if (ret) >> + return ret; >> + >> + ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1); >> + if (ret <= 0) >> + return ret; > > The error path leaks buf. trans_kmalloc does not need to be freed. > >> @@ -472,6 +600,7 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans, >> const struct qstr *name, subvol_inum *inum, >> unsigned flags) >> { >> + struct qstr lookup_name = *name; >> struct bkey_s_c k; >> struct bkey_s_c_dirent d; >> u32 snapshot; >> @@ -481,8 +610,14 @@ int __bch2_dirent_lookup_trans(struct btree_trans *trans, >> if (ret) >> return ret; >> >> + ret = hash_info->cf_encoding >> + ? bch2_casefold(trans, hash_info, name, &lookup_name) >> + : 0; >> + if (ret) >> + return ret; >> + >> ret = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc, >> - hash_info, dir, name, flags); >> + hash_info, dir, &lookup_name, flags); >> if (ret) >> return ret; > > > (Skipping the bcachefs-specific bits that I'm not qualified to review) > > I don't get your model for the dcache. I don't see any handling > anywhere in the patch. Also, I fetched the code and it seems bcachefs > will just do d_splice_alias at the end of the lookup. So, when someone > lookups FOO, ->lookup will be called and the dentry FOO will be created. > Then, when FOo is looked up, the dcache won't find the previous dentry > and ->lookup will be called again (perhaps creating another dentry?) > > There are two ways around it. Rely on the libfs' generic handlers for > d_hash/d_compare and soon d_revalidate, or use d_add_ci to search for > case-insensitive aliases at lookup time, but then you don't benefit much > from the dcache if looking up for different cases. Thanks for the heads-up on this. It seems like I may have potentially hit a foot-gun there if we are making vfs dentry's for each lookup. I will investigate and validate that we are doing the right thing here for the next revision. I will not lie, I know veeery little about the vfs or dcache side, so huge thanks for pointing this out to me. :) I had a quick look at what f2fs does here, and it seems like it also just does d_splice_alias, but it has the following caveat. [f2fs code] #if IS_ENABLED(CONFIG_UNICODE) if (!inode && IS_CASEFOLDED(dir)) { /* Eventually we want to call d_add_ci(dentry, NULL) * for negative dentries in the encoding case as * well. For now, prevent the negative dentry * from being cached. */ trace_f2fs_lookup_end(dir, dentry, ino, err); return NULL; } #endif new = d_splice_alias(inode, dentry); I won't even pretend that I understand what a negative dentry is right now... :D > > >> @@ -54,6 +54,28 @@ static int bch2_inode_flags_set(struct btree_trans *trans, >> (newflags & (BCH_INODE_NODUMP|BCH_INODE_NOATIME)) != newflags) >> return -EINVAL; >> >> + if ((newflags ^ oldflags) & BCH_INODE_CASEFOLDED) { >> +#if IS_ENABLED(CONFIG_UNICODE) >> + int ret = 0; >> + /* Not supported on individual files. */ >> + if (!S_ISDIR(bi->bi_mode)) >> + return -EOPNOTSUPP; >> + >> + /* >> + * Make sure the dir is empty, as otherwise we'd need to >> + * rehash everything and update the dirent keys. >> + */ >> + ret = bch2_empty_dir_trans(trans, inode_inum(inode)); >> + if (ret < 0) >> + return ret; >> + >> + bch2_check_set_feature(c, BCH_FEATURE_casefolding); > > fwiw, we use _CASEFOLD in other filesystems, not _casefolding. bcachefs features are always lowercase, but I can change the name of it to _casefold. - Joshie 🐸✨ > > Thanks, > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [v4 3/3] bcachefs: Implement casefolding 2023-08-14 21:18 ` Joshua Ashton @ 2023-08-14 21:33 ` Joshua Ashton 2023-08-14 23:01 ` Gabriel Krisman Bertazi 2023-08-15 0:58 ` Kent Overstreet 2 siblings, 0 replies; 15+ messages in thread From: Joshua Ashton @ 2023-08-14 21:33 UTC (permalink / raw) To: Gabriel Krisman Bertazi; +Cc: linux-bcachefs, André Almeida On 8/14/23 22:18, Joshua Ashton wrote: > > > On 8/14/23 17:34, Gabriel Krisman Bertazi wrote: >> Joshua Ashton <joshua@froggi.es> writes: >> >>> This patch implements support for case-insensitive file name lookups >>> in bcachefs. >>> >>> The implementation uses the same UTF-8 lowering and normalization that >>> ext4 and f2fs is using. >>> >>> More information is provided in Documentation/bcachefs/casefolding.rst >>> >>> Signed-off-by: Joshua Ashton <joshua@froggi.es> >> >> Please, cc the entire thread as i don't think I even got the cover >> letter. > > Thanks, I will for v5. > >> >> Did you test this against fstests generic/556? it checks all sorts of >> common pitfalls and corner-cases in case-insensitive filesystems. > > No, I wrote my own casefolding test that tests a bunch of edgecases. > > I was not aware there were any casefolding tests in fstests -- I really > really wish they had better names. :/ > > Grepping the test right now I already know we will fail the > `test_casefold_flag_removal` test as we don't support toggling the flag > on non-empty directories as it would require rehashing and thus > restructuring the btree. > > There's a lot of edgecases to handle there (such as hash collisions) > which are pretty gnarly to solve when doing something like that. > > I diiiid try and implement something that did that, but it really did > not end well... ^^;; So planning to defer this for now. > >> >>> --- /dev/null >>> +++ b/Documentation/filesystems/bcachefs/casefolding.rst >>> @@ -0,0 +1,73 @@ >>> +.. SPDX-License-Identifier: GPL-2.0 >>> + >>> +Casefolding >>> +=========== >>> + >>> +bcachefs has support for case-insensitive file and directory >>> +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) >>> +casefolding attributes. >>> + >>> +The main usecase for casefolding is compatibility with software written >>> +against other filesystems that rely on casefolded lookups >>> +(eg. NTFS and Wine/Proton). >>> +Taking advantage of file-system level casefolding can lead to great >>> +loading time gains in many applications and games. >>> + >>> +Casefolding support requires a kernel with the `CONFIG_UNICODE` >>> enabled. >>> +Once a directory has been flagged for casefolding, a feature bit >>> +is enabled on the superblock which marks the filesystem as using >>> +casefolding. >>> +When the feature bit for casefolding is enabled, it is no longer >>> possible >>> +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. >>> + >>> +On the lookup/query side: casefolding is implemented by allocating a >>> new >>> +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to >>> +casefold the query string. >>> + >>> +On the dirent side: casefolding is implemented by ensuring the `bkey`'s >>> +hash is made from the casefolded string and storing the cached >>> casefolded >>> +name with the regular name in the dirent. >>> + >>> +The structure looks like this: >>> + >>> +Regular: [dirent data][regular name][nul][nul]... >>> +Casefolded: [dirent data][reg len][cf len][regular name][casefolded >>> name][nul][nul]... >>> + >>> +(Do note, the number of `NUL`s here is merely for illustration, they >>> count can vary >>> + per-key, and they may not even be present if the key is aligned to >>> `sizeof(u64)`.) >>> + >>> +This is efficient as it means that for all file lookups that require >>> casefolding, >>> +it has identical performance to a regular lookup: >>> +a hash comparison and a `memcmp` of the name. >>> + >>> +Rationale >>> +--------- >>> + >>> +Several designs were considered for this system: >>> +One was to introduce a dirent_v2, however that would be painful >>> especially as >>> +the hash system only has support for a single key type. This would >>> also need >>> +`BCH_NAME_MAX` to change between versions, and a new feature bit. >>> + >>> +Another option was to store without the two lengths, and just take >>> the length of >>> +the regular name and casefolded name contiguously / 2 as the length. >>> This would >>> +assume that the regular length == casefolded length, but that could >>> potentially >>> +not be true, if the uppercase unicode glyph had a different UTF-8 >>> encoding than >>> +the lowercase unicode glyph. >> >> Why do you want to store the casefolded name? In ext4, we just store >> the case-exact name that was created, and do a code-point by code-point >> comparison with the name under lookup through utf8_strncasecmp, which is >> optimized to compare two not casefolded strings. We just modify the >> filename hash to be case-insensitive, so we know which dirent to look >> for during lookup. Finally, we use d_compare/d_hash to leverage the >> dcache into finding negative dentries. > > Yeah, am aware of how ext4 does it, I just wanted it to have basically > the exact same performance characteristics as as regular lookup. > > The utf8 cursor stuff in utf8_strncasecmp is still super expensive > compared to a straight memcmp in some local benching. > >> >> Sure, if you have both, dirent and dentry->d_name casefolded, then it is >> just a regular memcmp, but at least you don't have to change the disk >> format and your implementation becomes simpler. >> > > Given we have the opportunity to have support for this in the dirent > format for bcachefs, I think it makes sense to take it right now. > >> >> >>> +It would be possible to disregard the casefold cache for those >>> cases, but it was >>> +decided to simply encode the two string lengths in the key to avoid >>> random >>> +performance issues if this edgecase was ever hit. >>> + >>> +The option settled on was to use a free-bit in d_type to mark a >>> dirent as having >>> +a casefold cache, and then treat the first 4 bytes the name block as >>> lengths. >>> +You can see this in the `d_cf_name_block` member of union in >>> `bch_dirent`. >>> + >>> +The feature bit was used to allow casefolding support to be enabled >>> for the majority >>> +of users, but some allow users who have no need for the feature to >>> still use bcachefs as >>> +`CONFIG_UNICODE` can increase the kernel side a significant amount >>> due to the tables used, >>> +which may be decider between using bcachefs for eg. embedded platforms. >>> + >>> +Other filesystems like ext4 and f2fs have a super-block level option >>> for casefolding >>> +encoding, but bcachefs currently does not provide this. ext4 and >>> f2fs do not expose >>> +any encodings than a single UTF-8 version. When future encodings are >>> desirable, >>> +they will be added trivially using the opts mechanism. >>> + >>> diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h >>> index 30b3d7b9f9dc..0e1facdc7876 100644 >>> --- a/fs/bcachefs/bcachefs.h >>> +++ b/fs/bcachefs/bcachefs.h >>> @@ -202,6 +202,7 @@ >>> #include <linux/types.h> >>> #include <linux/workqueue.h> >>> #include <linux/zstd.h> >>> +#include <linux/unicode.h> >>> #include "bcachefs_format.h" >>> #include "errcode.h" >>> @@ -657,6 +658,8 @@ enum bch_write_ref { >>> BCH_WRITE_REF_NR, >>> }; >>> +#define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0) >>> + >>> struct bch_fs { >>> struct closure cl; >>> @@ -723,6 +726,9 @@ struct bch_fs { >>> u64 compat; >>> } sb; >>> +#if IS_ENABLED(CONFIG_UNICODE) >>> + struct unicode_map *cf_encoding; >>> +#endif >> >> This goes in struct super_block. > > The code in libfs.c that uses the s_encoding in the super_block is never > used for bcachefs as it defines its own inode_operations and doesn't use > any of the libfs common code for lookups/hashes there (ie. the > dentry_operations). > > One thing we wanted to do down the line was to use the bcachefs options > system to be able to pick the encoding on a per-inode basis. > > Currently I am storing it in the struct bch_fs as currently there is no > need for other encodings right now and it is convenient, so there can > only be one -- but that could change at some point with an opt attached > to the inode. > > I am happy to bung it in the vfs super_block for now though -- but I > need to look at how that all works out with mounting and the multiple > drives situation. Just be aware that at some point, it will likely be > moved back again to our code. > >> >> >>> @@ -13,8 +13,39 @@ >>> #include <linux/dcache.h> >>> -static inline unsigned dirent_val_u64s(unsigned len) >>> +static inline int bch2_casefold(struct btree_trans *trans, const >>> struct bch_hash_info *info, >>> + const struct qstr *str, struct qstr *out_cf) >>> { >>> +#if IS_ENABLED(CONFIG_UNICODE) >>> + unsigned char *buf = NULL; >>> + int ret; >>> + >>> + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); >>> + >>> + buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1); >>> + ret = PTR_ERR_OR_ZERO(buf); >>> + if (ret) >>> + return ret; >>> + >>> + ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1); >>> + if (ret <= 0) >>> + return ret; >> >> The error path leaks buf. > > trans_kmalloc does not need to be freed. > >> >>> @@ -472,6 +600,7 @@ int __bch2_dirent_lookup_trans(struct btree_trans >>> *trans, >>> const struct qstr *name, subvol_inum *inum, >>> unsigned flags) >>> { >>> + struct qstr lookup_name = *name; >>> struct bkey_s_c k; >>> struct bkey_s_c_dirent d; >>> u32 snapshot; >>> @@ -481,8 +610,14 @@ int __bch2_dirent_lookup_trans(struct >>> btree_trans *trans, >>> if (ret) >>> return ret; >>> + ret = hash_info->cf_encoding >>> + ? bch2_casefold(trans, hash_info, name, &lookup_name) >>> + : 0; >>> + if (ret) >>> + return ret; >>> + >>> ret = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc, >>> - hash_info, dir, name, flags); >>> + hash_info, dir, &lookup_name, flags); >>> if (ret) >>> return ret; >> >> >> (Skipping the bcachefs-specific bits that I'm not qualified to review) >> >> I don't get your model for the dcache. I don't see any handling >> anywhere in the patch. Also, I fetched the code and it seems bcachefs >> will just do d_splice_alias at the end of the lookup. So, when someone >> lookups FOO, ->lookup will be called and the dentry FOO will be created. >> Then, when FOo is looked up, the dcache won't find the previous dentry >> and ->lookup will be called again (perhaps creating another dentry?) >> >> There are two ways around it. Rely on the libfs' generic handlers for >> d_hash/d_compare and soon d_revalidate, or use d_add_ci to search for >> case-insensitive aliases at lookup time, but then you don't benefit much >> from the dcache if looking up for different cases. > > Thanks for the heads-up on this. > > It seems like I may have potentially hit a foot-gun there if we are > making vfs dentry's for each lookup. > > I will investigate and validate that we are doing the right thing here > for the next revision. > > I will not lie, I know veeery little about the vfs or dcache side, so > huge thanks for pointing this out to me. :) > > I had a quick look at what f2fs does here, and it seems like it also > just does d_splice_alias, but it has the following caveat. > > [f2fs code] > #if IS_ENABLED(CONFIG_UNICODE) > if (!inode && IS_CASEFOLDED(dir)) { > /* Eventually we want to call d_add_ci(dentry, NULL) > * for negative dentries in the encoding case as > * well. For now, prevent the negative dentry > * from being cached. > */ > trace_f2fs_lookup_end(dir, dentry, ino, err); > return NULL; > } > #endif > new = d_splice_alias(inode, dentry); > > I won't even pretend that I understand what a negative dentry is right > now... :D I am guessing we probably want something like: bch2_dirent_lookup -> returns feedback to a `struct qstr real_name` with the actual name if it was not identical to the dentry lookup name. if (real_name.len) { d_add_ci(dentry, vinode, &real_name); return dentry; } else { return d_splice_alias(vinode, dentry); } Is my line of thinking correct here? - Joshie 🐸✨ > >> >> >>> @@ -54,6 +54,28 @@ static int bch2_inode_flags_set(struct btree_trans >>> *trans, >>> (newflags & (BCH_INODE_NODUMP|BCH_INODE_NOATIME)) != newflags) >>> return -EINVAL; >>> + if ((newflags ^ oldflags) & BCH_INODE_CASEFOLDED) { >>> +#if IS_ENABLED(CONFIG_UNICODE) >>> + int ret = 0; >>> + /* Not supported on individual files. */ >>> + if (!S_ISDIR(bi->bi_mode)) >>> + return -EOPNOTSUPP; >>> + >>> + /* >>> + * Make sure the dir is empty, as otherwise we'd need to >>> + * rehash everything and update the dirent keys. >>> + */ >>> + ret = bch2_empty_dir_trans(trans, inode_inum(inode)); >>> + if (ret < 0) >>> + return ret; >>> + >>> + bch2_check_set_feature(c, BCH_FEATURE_casefolding); >> >> fwiw, we use _CASEFOLD in other filesystems, not _casefolding. > > bcachefs features are always lowercase, but I can change the name of it > to _casefold. > > - Joshie 🐸✨ > >> >> Thanks, >> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [v4 3/3] bcachefs: Implement casefolding 2023-08-14 21:18 ` Joshua Ashton 2023-08-14 21:33 ` Joshua Ashton @ 2023-08-14 23:01 ` Gabriel Krisman Bertazi 2023-08-15 0:58 ` Kent Overstreet 2 siblings, 0 replies; 15+ messages in thread From: Gabriel Krisman Bertazi @ 2023-08-14 23:01 UTC (permalink / raw) To: Joshua Ashton; +Cc: linux-bcachefs, André Almeida Joshua Ashton <joshua@froggi.es> writes: > On 8/14/23 17:34, Gabriel Krisman Bertazi wrote: >>> + >>> +Another option was to store without the two lengths, and just take the length of >>> +the regular name and casefolded name contiguously / 2 as the length. This would >>> +assume that the regular length == casefolded length, but that could potentially >>> +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than >>> +the lowercase unicode glyph. >> Why do you want to store the casefolded name? In ext4, we just store >> the case-exact name that was created, and do a code-point by code-point >> comparison with the name under lookup through utf8_strncasecmp, which is >> optimized to compare two not casefolded strings. We just modify the >> filename hash to be case-insensitive, so we know which dirent to look >> for during lookup. Finally, we use d_compare/d_hash to leverage the >> dcache into finding negative dentries. > > Yeah, am aware of how ext4 does it, I just wanted it to have basically > the exact same performance characteristics as as regular lookup. > > The utf8 cursor stuff in utf8_strncasecmp is still super expensive > compared to a straight memcmp in some local benching. More expensive than strncasecmp, for sure. But I'm surprised it is visible in benchmarks. Of course, unless you are searching different files at every lookup, once it makes proper use of the dcache it won't matter much. There is also utf8_strncasecmp_folded that avoids half of the trie searches by caching the name-under-lookup. > I am happy to bung it in the vfs super_block for now though -- but I > need to look at how that all works out with mounting and the multiple > drives situation. Just be aware that at some point, it will likely be > moved back again to our code. Different drives as in different filesystem volumes? They each should have one superblock anyway. Also, someone called my attention to your v1 commit message in v1 regarding selecting encoding per-directory. I'm not sure it is a good idea, since it is definitely problematic once you start moving files around different subtrees. Unicode casefolding is only stable for defined code points (of course), but newer versions might add new code points that previously didn't have any casefolding. So, if you have a name that has an invalid casepoint in an older version (but is legal utf8) and moves it to a new directory with an encoding version where that folds into something else, you have a situation where one file occludes the other and all sorts of unpredictable behavior. It can happen while copying across disks, of course. But we already reject many cases of cross-filesystem copying. But failing to move a file within the same filesystem is not very user-friendly. I think the reason you want to support multiple encoding versions is to enable applications built for different Windows ntfs versions. If that is the case, you should be safe using the latest unicode disk-wide, since it is backward compatible. Take a look also at ext4 and f2fs' "strict mode", that rejects invalid unicode sequences too. >> and ->lookup will be called again (perhaps creating another dentry?) >> There are two ways around it. Rely on the libfs' generic handlers for >> d_hash/d_compare and soon d_revalidate, or use d_add_ci to search for >> case-insensitive aliases at lookup time, but then you don't benefit much >> from the dcache if looking up for different cases. > > Thanks for the heads-up on this. > > It seems like I may have potentially hit a foot-gun there if we are > making vfs dentry's for each lookup. > > I will investigate and validate that we are doing the right thing here > for the next revision. > > I will not lie, I know veeery little about the vfs or dcache side, so > huge thanks for pointing this out to me. :) > > I had a quick look at what f2fs does here, and it seems like it also > just does d_splice_alias, but it has the following caveat. > > [f2fs code] > #if IS_ENABLED(CONFIG_UNICODE) > if (!inode && IS_CASEFOLDED(dir)) { > /* Eventually we want to call d_add_ci(dentry, NULL) > * for negative dentries in the encoding case as > * well. For now, prevent the negative dentry > * from being cached. > */ > trace_f2fs_lookup_end(dir, dentry, ino, err); > return NULL; > } > #endif > new = d_splice_alias(inode, dentry); > > I won't even pretend that I understand what a negative dentry is right > now... :D A negative dentry is simply a cache entry saying that a file doesn't exist in the filesystem, so further lookups don't need to hit the disk over and over. The f2fs caveat you mentioned is entirely avoiding the creation of negative dentries in IS_CASEFOLDED directories, since they aren't supported currently. I explained a bit the reasoning for that in the following series, that also remove this code. It is (hopefully) going upstream soon: https://www.spinics.net/lists/linux-ext4/msg90964.html My suggestion is that you follow f2fs/ext4 closely. Avoid d_add_ci, as it was written for filesystems that deal with case-insensitive directories in a slight different way, and don't make use good use of the dcache, in my opinion. A better approach would be to use d_compare/d_hash and keep calling d_splice_alias. Take a look at the generic handlers in fs/libfs.c too. -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [v4 3/3] bcachefs: Implement casefolding 2023-08-14 21:18 ` Joshua Ashton 2023-08-14 21:33 ` Joshua Ashton 2023-08-14 23:01 ` Gabriel Krisman Bertazi @ 2023-08-15 0:58 ` Kent Overstreet 2023-08-15 18:00 ` Gabriel Krisman Bertazi 2 siblings, 1 reply; 15+ messages in thread From: Kent Overstreet @ 2023-08-15 0:58 UTC (permalink / raw) To: Joshua Ashton; +Cc: Gabriel Krisman Bertazi, linux-bcachefs, André Almeida On Mon, Aug 14, 2023 at 10:18:33PM +0100, Joshua Ashton wrote: > > > On 8/14/23 17:34, Gabriel Krisman Bertazi wrote: > > Joshua Ashton <joshua@froggi.es> writes: > > > > > This patch implements support for case-insensitive file name lookups > > > in bcachefs. > > > > > > The implementation uses the same UTF-8 lowering and normalization that > > > ext4 and f2fs is using. > > > > > > More information is provided in Documentation/bcachefs/casefolding.rst > > > > > > Signed-off-by: Joshua Ashton <joshua@froggi.es> > > > > Please, cc the entire thread as i don't think I even got the cover > > letter. > > Thanks, I will for v5. > > > > > Did you test this against fstests generic/556? it checks all sorts of > > common pitfalls and corner-cases in case-insensitive filesystems. > > No, I wrote my own casefolding test that tests a bunch of edgecases. > > I was not aware there were any casefolding tests in fstests -- I really > really wish they had better names. :/ Seconded... > Grepping the test right now I already know we will fail the > `test_casefold_flag_removal` test as we don't support toggling the flag on > non-empty directories as it would require rehashing and thus restructuring > the btree. What was the motivation for supporting that on other filesystems? I'm not inclined to add that to bcachefs without a good reason. > > There's a lot of edgecases to handle there (such as hash collisions) which > are pretty gnarly to solve when doing something like that. > > I diiiid try and implement something that did that, but it really did not > end well... ^^;; So planning to defer this for now. What hash collisions are you referring to? The str_hash code we use handles hash collisions. > > Why do you want to store the casefolded name? In ext4, we just store > > the case-exact name that was created, and do a code-point by code-point > > comparison with the name under lookup through utf8_strncasecmp, which is > > optimized to compare two not casefolded strings. We just modify the > > filename hash to be case-insensitive, so we know which dirent to look > > for during lookup. Finally, we use d_compare/d_hash to leverage the > > dcache into finding negative dentries. > > Yeah, am aware of how ext4 does it, I just wanted it to have basically the > exact same performance characteristics as as regular lookup. > > The utf8 cursor stuff in utf8_strncasecmp is still super expensive compared > to a straight memcmp in some local benching. > > > > > Sure, if you have both, dirent and dentry->d_name casefolded, then it is > > just a regular memcmp, but at least you don't have to change the disk > > format and your implementation becomes simpler. > > > > Given we have the opportunity to have support for this in the dirent format > for bcachefs, I think it makes sense to take it right now. It also means that we have the option of defining a casefolding map that handles ß correctly. > > > +#if IS_ENABLED(CONFIG_UNICODE) > > > + struct unicode_map *cf_encoding; > > > +#endif > > > > This goes in struct super_block. > > The code in libfs.c that uses the s_encoding in the super_block is never > used for bcachefs as it defines its own inode_operations and doesn't use any > of the libfs common code for lookups/hashes there (ie. the > dentry_operations). > > One thing we wanted to do down the line was to use the bcachefs options > system to be able to pick the encoding on a per-inode basis. > > Currently I am storing it in the struct bch_fs as currently there is no need > for other encodings right now and it is convenient, so there can only be one > -- but that could change at some point with an opt attached to the inode. > > I am happy to bung it in the vfs super_block for now though -- but I need to > look at how that all works out with mounting and the multiple drives > situation. Just be aware that at some point, it will likely be moved back > again to our code. We also don't always have a super_block available when we have a running bch_fs object (e.g. when running in userspace, for fsck and various other tools). Since s_encoding is used in dirent.c, it probably needs to be in bch_fs. > > (Skipping the bcachefs-specific bits that I'm not qualified to review) > > > > I don't get your model for the dcache. I don't see any handling > > anywhere in the patch. Also, I fetched the code and it seems bcachefs > > will just do d_splice_alias at the end of the lookup. So, when someone > > lookups FOO, ->lookup will be called and the dentry FOO will be created. > > Then, when FOo is looked up, the dcache won't find the previous dentry > > and ->lookup will be called again (perhaps creating another dentry?) > > > > There are two ways around it. Rely on the libfs' generic handlers for > > d_hash/d_compare and soon d_revalidate, or use d_add_ci to search for > > case-insensitive aliases at lookup time, but then you don't benefit much > > from the dcache if looking up for different cases. > > Thanks for the heads-up on this. > > It seems like I may have potentially hit a foot-gun there if we are making > vfs dentry's for each lookup. > > I will investigate and validate that we are doing the right thing here for > the next revision. > > I will not lie, I know veeery little about the vfs or dcache side, so huge > thanks for pointing this out to me. :) > > I had a quick look at what f2fs does here, and it seems like it also just > does d_splice_alias, but it has the following caveat. I haven't dug into the dcache stuff yet either (thanks Gabriel for pointing this out) - but, we'll want to create a dentry not with the passed in name, but with the name from bch_dirent. > [f2fs code] > #if IS_ENABLED(CONFIG_UNICODE) > if (!inode && IS_CASEFOLDED(dir)) { > /* Eventually we want to call d_add_ci(dentry, NULL) > * for negative dentries in the encoding case as > * well. For now, prevent the negative dentry > * from being cached. > */ > trace_f2fs_lookup_end(dir, dentry, ino, err); > return NULL; > } > #endif > new = d_splice_alias(inode, dentry); > > I won't even pretend that I understand what a negative dentry is right > now... :D A negative dentry is used to indicate a dirent that doesn't exist in dcache - it means lookups can be cached for dentries that don't exist, instead of having to call into the filesystem's .lookup() every time. Gabriel's the one who implemented them for ext4/f2fs, see: https://lwn.net/Articles/929986/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [v4 3/3] bcachefs: Implement casefolding 2023-08-15 0:58 ` Kent Overstreet @ 2023-08-15 18:00 ` Gabriel Krisman Bertazi 2023-08-23 9:37 ` Joshua Ashton 0 siblings, 1 reply; 15+ messages in thread From: Gabriel Krisman Bertazi @ 2023-08-15 18:00 UTC (permalink / raw) To: Kent Overstreet; +Cc: Joshua Ashton, linux-bcachefs, André Almeida Kent Overstreet <kent.overstreet@linux.dev> writes: > On Mon, Aug 14, 2023 at 10:18:33PM +0100, Joshua Ashton wrote: > >> Grepping the test right now I already know we will fail the >> `test_casefold_flag_removal` test as we don't support toggling the flag on >> non-empty directories as it would require rehashing and thus restructuring >> the btree. > > What was the motivation for supporting that on other filesystems? I'm > not inclined to add that to bcachefs without a good reason. I don't think there was a better reason than "it is user-friendly to allow disabling the feature without having to remove the directory". Note that it is trivial to do so in empty directories for ext4/f2fs: it is just a matter of flipping a bit in the inode to enable/disable casefolding. I think it's ok for the Windows games to not allow disabling case-insensitive once it is set for a directory. For fstests, I suppose the easiest solution would be to skip that subtest when running on bcachefs. >> > Sure, if you have both, dirent and dentry->d_name casefolded, then it is >> > just a regular memcmp, but at least you don't have to change the disk >> > format and your implementation becomes simpler. >> > >> >> Given we have the opportunity to have support for this in the dirent format >> for bcachefs, I think it makes sense to take it right now. > > It also means that we have the option of defining a casefolding map that > handles ß correctly. What do you mean by handling ß correctly? do you intend to support NFKD normalization? We can extend fs/unicode/ for that fairly easily, but iirc, other filesystems decided to stick with NFD, because NFKD is a bit too eager when folding some characters. >> > > +#if IS_ENABLED(CONFIG_UNICODE) >> > > + struct unicode_map *cf_encoding; >> > > +#endif >> > >> > This goes in struct super_block. >> >> The code in libfs.c that uses the s_encoding in the super_block is never >> used for bcachefs as it defines its own inode_operations and doesn't use any >> of the libfs common code for lookups/hashes there (ie. the >> dentry_operations). >> >> One thing we wanted to do down the line was to use the bcachefs options >> system to be able to pick the encoding on a per-inode basis. >> >> Currently I am storing it in the struct bch_fs as currently there is no need >> for other encodings right now and it is convenient, so there can only be one >> -- but that could change at some point with an opt attached to the inode. >> >> I am happy to bung it in the vfs super_block for now though -- but I need to >> look at how that all works out with mounting and the multiple drives >> situation. Just be aware that at some point, it will likely be moved back >> again to our code. > > We also don't always have a super_block available when we have a running > bch_fs object (e.g. when running in userspace, for fsck and various > other tools). Since s_encoding is used in dirent.c, it probably needs to > be in bch_fs. Ah, I didn't know this is reused for userspace tools. Still, as long as you initialize sb->s_encoding at mount time, libfs helpers will work fine. -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [v4 3/3] bcachefs: Implement casefolding 2023-08-15 18:00 ` Gabriel Krisman Bertazi @ 2023-08-23 9:37 ` Joshua Ashton 0 siblings, 0 replies; 15+ messages in thread From: Joshua Ashton @ 2023-08-23 9:37 UTC (permalink / raw) To: Gabriel Krisman Bertazi, Kent Overstreet Cc: linux-bcachefs, André Almeida On 8/15/23 19:00, Gabriel Krisman Bertazi wrote: > Kent Overstreet <kent.overstreet@linux.dev> writes: > >> On Mon, Aug 14, 2023 at 10:18:33PM +0100, Joshua Ashton wrote: >> >>> Grepping the test right now I already know we will fail the >>> `test_casefold_flag_removal` test as we don't support toggling the flag on >>> non-empty directories as it would require rehashing and thus restructuring >>> the btree. >> >> What was the motivation for supporting that on other filesystems? I'm >> not inclined to add that to bcachefs without a good reason. > > I don't think there was a better reason than "it is user-friendly to > allow disabling the feature without having to remove the > directory". Note that it is trivial to do so in empty directories for > ext4/f2fs: it is just a matter of flipping a bit in the inode to > enable/disable casefolding. > > I think it's ok for the Windows games to not allow disabling > case-insensitive once it is set for a directory. For fstests, I suppose > the easiest solution would be to skip that subtest when running on > bcachefs. > Makes sense. And yeah, we would never unset it in Steam/Proton. >>>> Sure, if you have both, dirent and dentry->d_name casefolded, then it is >>>> just a regular memcmp, but at least you don't have to change the disk >>>> format and your implementation becomes simpler. >>>> >>> >>> Given we have the opportunity to have support for this in the dirent format >>> for bcachefs, I think it makes sense to take it right now. >> >> It also means that we have the option of defining a casefolding map that >> handles ß correctly. > > What do you mean by handling ß correctly? do you intend to support NFKD > normalization? We can extend fs/unicode/ for that fairly easily, > but iirc, other filesystems decided to stick with NFD, because NFKD is > a bit too eager when folding some characters. Yes, I believe we wanted to add support for other normalizations as bcachefs options. Or at the very least, ensure we structure what we do around being able to support that in the future. > >>>>> +#if IS_ENABLED(CONFIG_UNICODE) >>>>> + struct unicode_map *cf_encoding; >>>>> +#endif >>>> >>>> This goes in struct super_block. >>> >>> The code in libfs.c that uses the s_encoding in the super_block is never >>> used for bcachefs as it defines its own inode_operations and doesn't use any >>> of the libfs common code for lookups/hashes there (ie. the >>> dentry_operations). >>> >>> One thing we wanted to do down the line was to use the bcachefs options >>> system to be able to pick the encoding on a per-inode basis. >>> >>> Currently I am storing it in the struct bch_fs as currently there is no need >>> for other encodings right now and it is convenient, so there can only be one >>> -- but that could change at some point with an opt attached to the inode. >>> >>> I am happy to bung it in the vfs super_block for now though -- but I need to >>> look at how that all works out with mounting and the multiple drives >>> situation. Just be aware that at some point, it will likely be moved back >>> again to our code. >> >> We also don't always have a super_block available when we have a running >> bch_fs object (e.g. when running in userspace, for fsck and various >> other tools). Since s_encoding is used in dirent.c, it probably needs to >> be in bch_fs. > > Ah, I didn't know this is reused for userspace tools. Still, as long as > you initialize sb->s_encoding at mount time, libfs helpers will work fine. > Cool. - Joshie 🐸✨ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [v4 3/3] bcachefs: Implement casefolding 2023-08-14 1:24 ` [v4 3/3] bcachefs: Implement casefolding Joshua Ashton 2023-08-14 16:34 ` Gabriel Krisman Bertazi @ 2023-08-15 12:36 ` Brian Foster 2023-08-24 5:43 ` Joshua Ashton 1 sibling, 1 reply; 15+ messages in thread From: Brian Foster @ 2023-08-15 12:36 UTC (permalink / raw) To: Joshua Ashton; +Cc: linux-bcachefs, André Almeida, Gabriel Krisman Bertazi On Mon, Aug 14, 2023 at 02:24:27AM +0100, Joshua Ashton wrote: > This patch implements support for case-insensitive file name lookups > in bcachefs. > > The implementation uses the same UTF-8 lowering and normalization that > ext4 and f2fs is using. > > More information is provided in Documentation/bcachefs/casefolding.rst > > Signed-off-by: Joshua Ashton <joshua@froggi.es> > > Cc: André Almeida <andrealmeid@igalia.com> > Cc: Gabriel Krisman Bertazi <krisman@suse.de> > --- Hi Joshua, I'm still grokking the implementation and I see Gabriel has already provided much more substantive feedback, so just a few random comments... > .../filesystems/bcachefs/casefolding.rst | 73 ++++++++ > fs/bcachefs/bcachefs.h | 6 + > fs/bcachefs/bcachefs_format.h | 26 ++- > fs/bcachefs/dirent.c | 177 +++++++++++++++--- > fs/bcachefs/fs-common.c | 4 + > fs/bcachefs/fs-ioctl.c | 22 +++ > fs/bcachefs/fs-ioctl.h | 20 +- > fs/bcachefs/str_hash.h | 4 + > fs/bcachefs/super.c | 19 ++ > 9 files changed, 318 insertions(+), 33 deletions(-) > create mode 100644 Documentation/filesystems/bcachefs/casefolding.rst > > diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst > new file mode 100644 > index 000000000000..c8580d892ae8 > --- /dev/null > +++ b/Documentation/filesystems/bcachefs/casefolding.rst > @@ -0,0 +1,73 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +Casefolding > +=========== > + > +bcachefs has support for case-insensitive file and directory > +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) > +casefolding attributes. > + > +The main usecase for casefolding is compatibility with software written > +against other filesystems that rely on casefolded lookups > +(eg. NTFS and Wine/Proton). > +Taking advantage of file-system level casefolding can lead to great > +loading time gains in many applications and games. > + > +Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled. > +Once a directory has been flagged for casefolding, a feature bit > +is enabled on the superblock which marks the filesystem as using > +casefolding. > +When the feature bit for casefolding is enabled, it is no longer possible > +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. > + > +On the lookup/query side: casefolding is implemented by allocating a new > +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to > +casefold the query string. > + > +On the dirent side: casefolding is implemented by ensuring the `bkey`'s > +hash is made from the casefolded string and storing the cached casefolded > +name with the regular name in the dirent. > + > +The structure looks like this: > + > +Regular: [dirent data][regular name][nul][nul]... > +Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]... > + > +(Do note, the number of `NUL`s here is merely for illustration, they count can vary > + per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.) > + > +This is efficient as it means that for all file lookups that require casefolding, > +it has identical performance to a regular lookup: > +a hash comparison and a `memcmp` of the name. > + Thanks for the writeup. This is very helpful. > +Rationale > +--------- > + > +Several designs were considered for this system: > +One was to introduce a dirent_v2, however that would be painful especially as > +the hash system only has support for a single key type. This would also need > +`BCH_NAME_MAX` to change between versions, and a new feature bit. > + > +Another option was to store without the two lengths, and just take the length of > +the regular name and casefolded name contiguously / 2 as the length. This would > +assume that the regular length == casefolded length, but that could potentially > +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than > +the lowercase unicode glyph. > +It would be possible to disregard the casefold cache for those cases, but it was > +decided to simply encode the two string lengths in the key to avoid random > +performance issues if this edgecase was ever hit. > + > +The option settled on was to use a free-bit in d_type to mark a dirent as having > +a casefold cache, and then treat the first 4 bytes the name block as lengths. > +You can see this in the `d_cf_name_block` member of union in `bch_dirent`. > + > +The feature bit was used to allow casefolding support to be enabled for the majority > +of users, but some allow users who have no need for the feature to still use bcachefs as > +`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used, > +which may be decider between using bcachefs for eg. embedded platforms. > + > +Other filesystems like ext4 and f2fs have a super-block level option for casefolding > +encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose > +any encodings than a single UTF-8 version. When future encodings are desirable, > +they will be added trivially using the opts mechanism. > + ... > diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h > index 23bae622309c..cc6797422977 100644 > --- a/fs/bcachefs/bcachefs_format.h > +++ b/fs/bcachefs/bcachefs_format.h ... > @@ -864,6 +866,7 @@ enum { > #define BCH_INODE_I_SECTORS_DIRTY (1 << __BCH_INODE_I_SECTORS_DIRTY) > #define BCH_INODE_UNLINKED (1 << __BCH_INODE_UNLINKED) > #define BCH_INODE_BACKPTR_UNTRUSTED (1 << __BCH_INODE_BACKPTR_UNTRUSTED) > +#define BCH_INODE_CASEFOLDED (1 << __BCH_INODE_CASEFOLDED) > > LE32_BITMASK(INODE_STR_HASH, struct bch_inode, bi_flags, 20, 24); > LE32_BITMASK(INODE_NR_FIELDS, struct bch_inode, bi_flags, 24, 31); > @@ -908,9 +911,25 @@ struct bch_dirent { > * Copy of mode bits 12-15 from the target inode - so userspace can get > * the filetype without having to do a stat() > */ > - __u8 d_type; > +#if defined(__LITTLE_ENDIAN_BITFIELD) > + __u8 d_type:5, > + d_unused:2, > + d_casefold:1; > +#elif defined(__BIG_ENDIAN_BITFIELD) > + __u8 d_casefold:1, > + d_unused:2, > + d_type:5; > +#endif Interesting. I didn't realize that bitfield endianness matched general endianness, so I presume this is required to ensure data is always le on disk. > > - __u8 d_name[]; > + union { > + struct { > + __u8 d_pad; > + __le16 d_name_len; > + __le16 d_cf_name_len; > + __u8 d_names[0]; > + } d_cf_name_block __packed; > + __u8 d_name[0]; > + } __packed; > } __packed __aligned(8); > > #define DT_SUBVOL 16 > @@ -1841,7 +1860,8 @@ static inline void SET_BCH_SB_BACKGROUND_COMPRESSION_TYPE(struct bch_sb *sb, __u > x(new_varint, 15) \ > x(journal_no_flush, 16) \ > x(alloc_v2, 17) \ > - x(extents_across_btree_nodes, 18) > + x(extents_across_btree_nodes, 18) \ > + x(casefolding, 19) > > #define BCH_SB_FEATURES_ALWAYS \ > ((1ULL << BCH_FEATURE_new_extent_overwrite)| \ > diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c > index 6bdcccabd732..6ac45d7ce6e9 100644 > --- a/fs/bcachefs/dirent.c > +++ b/fs/bcachefs/dirent.c > @@ -13,8 +13,39 @@ > > #include <linux/dcache.h> > > -static inline unsigned dirent_val_u64s(unsigned len) > +static inline int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info, > + const struct qstr *str, struct qstr *out_cf) > { > +#if IS_ENABLED(CONFIG_UNICODE) > + unsigned char *buf = NULL; > + int ret; > + > + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); > + > + buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1); > + ret = PTR_ERR_OR_ZERO(buf); > + if (ret) > + return ret; > + > + ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1); Looks like this helper really only needs the encoding pointer rather than the full hash_info..? > + if (ret <= 0) > + return ret; > + > + *out_cf = (struct qstr) QSTR_INIT(buf, ret); > + return 0; > +#else > + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); > + > + return -EOPNOTSUPP; > +#endif It also looks like the #else could be completely avoided here. For example, lift the local vars out of the ifdef, init out_cf using the NULL buf, init ret to -EOPNOTSUPP and push 'return ret;' out of the end of the ifdef. Hm? > +} > + > +static inline unsigned dirent_val_u64s(unsigned len, unsigned cf_len) > +{ > + if (cf_len > 0) { > + return DIV_ROUND_UP(offsetof(struct bch_dirent, d_cf_name_block.d_names) + len + cf_len, > + sizeof(u64)); > + } > return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len, > sizeof(u64)); FWIW, I think something along the lines of the pattern you use just below makes this a bit more readable. For example, something like: unsigned name_offset; name_offset = (cf_len > 0) ? offsetof(struct bch_dirent, d_cf_name_block.d_names) : offsetof(struct bch_dirent, d_name); return DIV_ROUND_UP(name_offset + len + cf_len, sizeof(u64)); ... but just a thought. > } ... > /* bcachefs inode flags -> FS_IOC_FSGETXATTR: */ > diff --git a/fs/bcachefs/str_hash.h b/fs/bcachefs/str_hash.h > index ae21a8cca1b4..866cfc145889 100644 > --- a/fs/bcachefs/str_hash.h > +++ b/fs/bcachefs/str_hash.h > @@ -34,6 +34,7 @@ bch2_str_hash_opt_to_type(struct bch_fs *c, enum bch_str_hash_opts opt) > > struct bch_hash_info { > u8 type; > + struct unicode_map *cf_encoding; > /* > * For crc32 or crc64 string hashes the first key value of > * the siphash_key (k0) is used as the key. > @@ -48,6 +49,9 @@ bch2_hash_info_init(struct bch_fs *c, const struct bch_inode_unpacked *bi) > struct bch_hash_info info = { > .type = (bi->bi_flags >> INODE_STR_HASH_OFFSET) & > ~(~0U << INODE_STR_HASH_BITS), > +#if IS_ENABLED(CONFIG_UNICODE) > + .cf_encoding = !!(bi->bi_flags & BCH_INODE_CASEFOLDED) ? c->cf_encoding : NULL, > +#endif What's the purpose for propagating this pointer via the hash? Is this necessary to (eventually) support per-directory encodings, or something else? If purely the former and that is something to possibly come later, it does seem a little premature to me in that the infrastructure part should probably go along with the feature functionality that takes advantage of it. BTW on a similar train of thought.. I'm not familiar enough with the feature to have an opinion on Gabriel's feedback wrt to storing the casefolded name on-disk, but I do think there's value in breaking down the feature into smaller logical components, particularly when there are performance vs. complexity tradeoffs. So for example if you wanted to continue to implement the on-disk format change approach, perhaps factor out core functionality along the lines of what ext4 does into a patch 1, then add the on-disk format change support in patch 2 as an (optional?) optimization (then lift off the per-dir bits into later patches that implement the granular approach, etc.). That not only makes this easier to review, but IMO significantly easier to maintain, test, diagnose, etc., such that it's at least worth considering. Just my .02. Brian > .siphash_key = { .k0 = bi->bi_hash_seed } > }; > > diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c > index 5c62fcf3afdb..78d19cfb966a 100644 > --- a/fs/bcachefs/super.c > +++ b/fs/bcachefs/super.c > @@ -757,6 +757,25 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts) > if (ret) > goto err; > > +#if IS_ENABLED(CONFIG_UNICODE) > + /* Default encoding until we can potentially have more as an option. */ > + c->cf_encoding = utf8_load(BCH_FS_DEFAULT_UTF8_ENCODING); > + if (IS_ERR(c->cf_encoding)) { > + printk(KERN_ERR "Cannot load UTF-8 encoding for filesystem. Version: %u.%u.%u", > + unicode_major(BCH_FS_DEFAULT_UTF8_ENCODING), > + unicode_minor(BCH_FS_DEFAULT_UTF8_ENCODING), > + unicode_rev(BCH_FS_DEFAULT_UTF8_ENCODING)); > + ret = -EINVAL; > + goto err; > + } > +#else > + if (c->sb.features & (1ULL << BCH_FEATURE_casefolding)) { > + printk(KERN_ERR "Cannot mount a filesystem with casefolding on a kernel without CONFIG_UNICODE\n"); > + ret = -EINVAL; > + goto err; > + } > +#endif > + > pr_uuid(&name, c->sb.user_uuid.b); > strscpy(c->name, name.buf, sizeof(c->name)); > printbuf_exit(&name); > -- > 2.41.0 > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [v4 3/3] bcachefs: Implement casefolding 2023-08-15 12:36 ` Brian Foster @ 2023-08-24 5:43 ` Joshua Ashton 0 siblings, 0 replies; 15+ messages in thread From: Joshua Ashton @ 2023-08-24 5:43 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs, André Almeida, Gabriel Krisman Bertazi On 8/15/23 13:36, Brian Foster wrote: > On Mon, Aug 14, 2023 at 02:24:27AM +0100, Joshua Ashton wrote: >> This patch implements support for case-insensitive file name lookups >> in bcachefs. >> >> The implementation uses the same UTF-8 lowering and normalization that >> ext4 and f2fs is using. >> >> More information is provided in Documentation/bcachefs/casefolding.rst >> >> Signed-off-by: Joshua Ashton <joshua@froggi.es> >> >> Cc: André Almeida <andrealmeid@igalia.com> >> Cc: Gabriel Krisman Bertazi <krisman@suse.de> >> --- > > Hi Joshua, > > I'm still grokking the implementation and I see Gabriel has already > provided much more substantive feedback, so just a few random comments... > >> .../filesystems/bcachefs/casefolding.rst | 73 ++++++++ >> fs/bcachefs/bcachefs.h | 6 + >> fs/bcachefs/bcachefs_format.h | 26 ++- >> fs/bcachefs/dirent.c | 177 +++++++++++++++--- >> fs/bcachefs/fs-common.c | 4 + >> fs/bcachefs/fs-ioctl.c | 22 +++ >> fs/bcachefs/fs-ioctl.h | 20 +- >> fs/bcachefs/str_hash.h | 4 + >> fs/bcachefs/super.c | 19 ++ >> 9 files changed, 318 insertions(+), 33 deletions(-) >> create mode 100644 Documentation/filesystems/bcachefs/casefolding.rst >> >> diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst >> new file mode 100644 >> index 000000000000..c8580d892ae8 >> --- /dev/null >> +++ b/Documentation/filesystems/bcachefs/casefolding.rst >> @@ -0,0 +1,73 @@ >> +.. SPDX-License-Identifier: GPL-2.0 >> + >> +Casefolding >> +=========== >> + >> +bcachefs has support for case-insensitive file and directory >> +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) >> +casefolding attributes. >> + >> +The main usecase for casefolding is compatibility with software written >> +against other filesystems that rely on casefolded lookups >> +(eg. NTFS and Wine/Proton). >> +Taking advantage of file-system level casefolding can lead to great >> +loading time gains in many applications and games. >> + >> +Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled. >> +Once a directory has been flagged for casefolding, a feature bit >> +is enabled on the superblock which marks the filesystem as using >> +casefolding. >> +When the feature bit for casefolding is enabled, it is no longer possible >> +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. >> + >> +On the lookup/query side: casefolding is implemented by allocating a new >> +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to >> +casefold the query string. >> + >> +On the dirent side: casefolding is implemented by ensuring the `bkey`'s >> +hash is made from the casefolded string and storing the cached casefolded >> +name with the regular name in the dirent. >> + >> +The structure looks like this: >> + >> +Regular: [dirent data][regular name][nul][nul]... >> +Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]... >> + >> +(Do note, the number of `NUL`s here is merely for illustration, they count can vary >> + per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.) >> + >> +This is efficient as it means that for all file lookups that require casefolding, >> +it has identical performance to a regular lookup: >> +a hash comparison and a `memcmp` of the name. >> + > > Thanks for the writeup. This is very helpful. > >> +Rationale >> +--------- >> + >> +Several designs were considered for this system: >> +One was to introduce a dirent_v2, however that would be painful especially as >> +the hash system only has support for a single key type. This would also need >> +`BCH_NAME_MAX` to change between versions, and a new feature bit. >> + >> +Another option was to store without the two lengths, and just take the length of >> +the regular name and casefolded name contiguously / 2 as the length. This would >> +assume that the regular length == casefolded length, but that could potentially >> +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than >> +the lowercase unicode glyph. >> +It would be possible to disregard the casefold cache for those cases, but it was >> +decided to simply encode the two string lengths in the key to avoid random >> +performance issues if this edgecase was ever hit. >> + >> +The option settled on was to use a free-bit in d_type to mark a dirent as having >> +a casefold cache, and then treat the first 4 bytes the name block as lengths. >> +You can see this in the `d_cf_name_block` member of union in `bch_dirent`. >> + >> +The feature bit was used to allow casefolding support to be enabled for the majority >> +of users, but some allow users who have no need for the feature to still use bcachefs as >> +`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used, >> +which may be decider between using bcachefs for eg. embedded platforms. >> + >> +Other filesystems like ext4 and f2fs have a super-block level option for casefolding >> +encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose >> +any encodings than a single UTF-8 version. When future encodings are desirable, >> +they will be added trivially using the opts mechanism. >> + > ... >> diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h >> index 23bae622309c..cc6797422977 100644 >> --- a/fs/bcachefs/bcachefs_format.h >> +++ b/fs/bcachefs/bcachefs_format.h > ... >> @@ -864,6 +866,7 @@ enum { >> #define BCH_INODE_I_SECTORS_DIRTY (1 << __BCH_INODE_I_SECTORS_DIRTY) >> #define BCH_INODE_UNLINKED (1 << __BCH_INODE_UNLINKED) >> #define BCH_INODE_BACKPTR_UNTRUSTED (1 << __BCH_INODE_BACKPTR_UNTRUSTED) >> +#define BCH_INODE_CASEFOLDED (1 << __BCH_INODE_CASEFOLDED) >> >> LE32_BITMASK(INODE_STR_HASH, struct bch_inode, bi_flags, 20, 24); >> LE32_BITMASK(INODE_NR_FIELDS, struct bch_inode, bi_flags, 24, 31); >> @@ -908,9 +911,25 @@ struct bch_dirent { >> * Copy of mode bits 12-15 from the target inode - so userspace can get >> * the filetype without having to do a stat() >> */ >> - __u8 d_type; >> +#if defined(__LITTLE_ENDIAN_BITFIELD) >> + __u8 d_type:5, >> + d_unused:2, >> + d_casefold:1; >> +#elif defined(__BIG_ENDIAN_BITFIELD) >> + __u8 d_casefold:1, >> + d_unused:2, >> + d_type:5; >> +#endif > > Interesting. I didn't realize that bitfield endianness matched general > endianness, so I presume this is required to ensure data is always le on > disk. Correct. > >> >> - __u8 d_name[]; >> + union { >> + struct { >> + __u8 d_pad; >> + __le16 d_name_len; >> + __le16 d_cf_name_len; >> + __u8 d_names[0]; >> + } d_cf_name_block __packed; >> + __u8 d_name[0]; >> + } __packed; >> } __packed __aligned(8); >> >> #define DT_SUBVOL 16 >> @@ -1841,7 +1860,8 @@ static inline void SET_BCH_SB_BACKGROUND_COMPRESSION_TYPE(struct bch_sb *sb, __u >> x(new_varint, 15) \ >> x(journal_no_flush, 16) \ >> x(alloc_v2, 17) \ >> - x(extents_across_btree_nodes, 18) >> + x(extents_across_btree_nodes, 18) \ >> + x(casefolding, 19) >> >> #define BCH_SB_FEATURES_ALWAYS \ >> ((1ULL << BCH_FEATURE_new_extent_overwrite)| \ >> diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c >> index 6bdcccabd732..6ac45d7ce6e9 100644 >> --- a/fs/bcachefs/dirent.c >> +++ b/fs/bcachefs/dirent.c >> @@ -13,8 +13,39 @@ >> >> #include <linux/dcache.h> >> >> -static inline unsigned dirent_val_u64s(unsigned len) >> +static inline int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info, >> + const struct qstr *str, struct qstr *out_cf) >> { >> +#if IS_ENABLED(CONFIG_UNICODE) >> + unsigned char *buf = NULL; >> + int ret; >> + >> + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); >> + >> + buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1); >> + ret = PTR_ERR_OR_ZERO(buf); >> + if (ret) >> + return ret; >> + >> + ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1); > > Looks like this helper really only needs the encoding pointer rather > than the full hash_info..? Sure, we could do that. > >> + if (ret <= 0) >> + return ret; >> + >> + *out_cf = (struct qstr) QSTR_INIT(buf, ret); >> + return 0; >> +#else >> + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); >> + >> + return -EOPNOTSUPP; >> +#endif > > It also looks like the #else could be completely avoided here. For > example, lift the local vars out of the ifdef, init out_cf using the > NULL buf, init ret to -EOPNOTSUPP and push 'return ret;' out of the end > of the ifdef. Hm? Thanks for the heads up, will experiment with that > >> +} >> + >> +static inline unsigned dirent_val_u64s(unsigned len, unsigned cf_len) >> +{ >> + if (cf_len > 0) { >> + return DIV_ROUND_UP(offsetof(struct bch_dirent, d_cf_name_block.d_names) + len + cf_len, >> + sizeof(u64)); >> + } >> return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len, >> sizeof(u64)); > > FWIW, I think something along the lines of the pattern you use just > below makes this a bit more readable. For example, something like: > > unsigned name_offset; > name_offset = (cf_len > 0) ? > offsetof(struct bch_dirent, d_cf_name_block.d_names) : > offsetof(struct bch_dirent, d_name); > return DIV_ROUND_UP(name_offset + len + cf_len, sizeof(u64)); > > ... but just a thought. > >> } > ... >> /* bcachefs inode flags -> FS_IOC_FSGETXATTR: */ >> diff --git a/fs/bcachefs/str_hash.h b/fs/bcachefs/str_hash.h >> index ae21a8cca1b4..866cfc145889 100644 >> --- a/fs/bcachefs/str_hash.h >> +++ b/fs/bcachefs/str_hash.h >> @@ -34,6 +34,7 @@ bch2_str_hash_opt_to_type(struct bch_fs *c, enum bch_str_hash_opts opt) >> >> struct bch_hash_info { >> u8 type; >> + struct unicode_map *cf_encoding; >> /* >> * For crc32 or crc64 string hashes the first key value of >> * the siphash_key (k0) is used as the key. >> @@ -48,6 +49,9 @@ bch2_hash_info_init(struct bch_fs *c, const struct bch_inode_unpacked *bi) >> struct bch_hash_info info = { >> .type = (bi->bi_flags >> INODE_STR_HASH_OFFSET) & >> ~(~0U << INODE_STR_HASH_BITS), >> +#if IS_ENABLED(CONFIG_UNICODE) >> + .cf_encoding = !!(bi->bi_flags & BCH_INODE_CASEFOLDED) ? c->cf_encoding : NULL, >> +#endif > > What's the purpose for propagating this pointer via the hash? Is this > necessary to (eventually) support per-directory encodings, or something > else? If purely the former and that is something to possibly come later, > it does seem a little premature to me in that the infrastructure part > should probably go along with the feature functionality that takes > advantage of it. Yes, it is for the convenience of being able to do per-directory encodings in the future -- but also it's the easiest way to get the information propagated from the superblock right now also. If I was to not put it here, it would involve a bunch more work of passing through another pointer which to me makes little sense when we can just include it here. It is used by the hashing process so it makes sense anyway, but I am happy to rename the struct. > > BTW on a similar train of thought.. I'm not familiar enough with the > feature to have an opinion on Gabriel's feedback wrt to storing the > casefolded name on-disk, but I do think there's value in breaking down > the feature into smaller logical components, particularly when there are > performance vs. complexity tradeoffs. So for example if you wanted to > continue to implement the on-disk format change approach, perhaps factor > out core functionality along the lines of what ext4 does into a patch 1, > then add the on-disk format change support in patch 2 as an (optional?) > optimization (then lift off the per-dir bits into later patches that > implement the granular approach, etc.). I explained this in another email but will do again, I really don't want to do that approach because breaking it down like that means we will have to support a path that will never, ever be used forever for no reason. We'd have to make dirent_cmp, dirent_hash, and stuff aware of casefolding and it would just be a bunch of effectively dead code. On face-value, it makes sense, but when you look closer, it doesn't really. I also really don't want to bloat those functions that need to remain pretty fast with a bunch of stuff to casefold (in the case of bkey_hash) and use utf8_strncasecmp, etc. It would all just be dead code that needs to be supported forever when we go to cache names. :/ > > That not only makes this easier to review, but IMO significantly easier > to maintain, test, diagnose, etc., such that it's at least worth > considering. Just my .02. I don't really see how adding more code paths makes this easier to review and maintain... - Joshie 🐸✨ > > Brian > >> .siphash_key = { .k0 = bi->bi_hash_seed } >> }; >> >> diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c >> index 5c62fcf3afdb..78d19cfb966a 100644 >> --- a/fs/bcachefs/super.c >> +++ b/fs/bcachefs/super.c >> @@ -757,6 +757,25 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts) >> if (ret) >> goto err; >> >> +#if IS_ENABLED(CONFIG_UNICODE) >> + /* Default encoding until we can potentially have more as an option. */ >> + c->cf_encoding = utf8_load(BCH_FS_DEFAULT_UTF8_ENCODING); >> + if (IS_ERR(c->cf_encoding)) { >> + printk(KERN_ERR "Cannot load UTF-8 encoding for filesystem. Version: %u.%u.%u", >> + unicode_major(BCH_FS_DEFAULT_UTF8_ENCODING), >> + unicode_minor(BCH_FS_DEFAULT_UTF8_ENCODING), >> + unicode_rev(BCH_FS_DEFAULT_UTF8_ENCODING)); >> + ret = -EINVAL; >> + goto err; >> + } >> +#else >> + if (c->sb.features & (1ULL << BCH_FEATURE_casefolding)) { >> + printk(KERN_ERR "Cannot mount a filesystem with casefolding on a kernel without CONFIG_UNICODE\n"); >> + ret = -EINVAL; >> + goto err; >> + } >> +#endif >> + >> pr_uuid(&name, c->sb.user_uuid.b); >> strscpy(c->name, name.buf, sizeof(c->name)); >> printbuf_exit(&name); >> -- >> 2.41.0 >> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [v4 0/3] bcachefs: Casefolding implementation 2023-08-14 1:24 [v4 0/3] bcachefs: Casefolding implementation Joshua Ashton ` (2 preceding siblings ...) 2023-08-14 1:24 ` [v4 3/3] bcachefs: Implement casefolding Joshua Ashton @ 2023-08-16 19:40 ` Kent Overstreet 2023-08-23 9:34 ` Joshua Ashton 3 siblings, 1 reply; 15+ messages in thread From: Kent Overstreet @ 2023-08-16 19:40 UTC (permalink / raw) To: Joshua Ashton; +Cc: linux-bcachefs On Mon, Aug 14, 2023 at 02:24:24AM +0100, Joshua Ashton wrote: > This implements support for case-insensitive file and directory > lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) > casefolding attributes for bcachefs. > > The implementation uses the same UTF-8 lowering and normalization that > ext4 and f2fs is using currently > > The main usecase for casefolding is compatibility with software written > against other filesystems that rely on casefolded lookups > (eg. NTFS and Wine/Proton). > Taking advantage of file-system level casefolding can lead to great > loading time gains in many applications and games. > > See more information and rationale in: > Documentation/filesystems/bcachefs/casefolding.rst > > You can see/review the work in my tree here: > https://github.com/Joshua-Ashton/bcachefs/tree/casefold > > v2: Update format to store length in dirent based on flag, add comments, cleanup > dirent_create_key, other misc. nits from IRC. > > v3: Cleanup the dirent code, utilize the fact that BCH_NAME_MAX is smaller now, > remove fallbacks for too big keys that won't make sense anymore. Improved > lookup string lowering. > > v4: Move cf_name allocation from stack to trans alloc, validated bkey in invalid, > used length directly, misc. nits. Josh, do you think you could post the benchmarks you ran at the start of all this? We should look at those again after negative dentries are hooked up and working correctly. After giving this more thought, I'm starting to feel a bit more inclined to Gabriel's approach... ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [v4 0/3] bcachefs: Casefolding implementation 2023-08-16 19:40 ` [v4 0/3] bcachefs: Casefolding implementation Kent Overstreet @ 2023-08-23 9:34 ` Joshua Ashton 0 siblings, 0 replies; 15+ messages in thread From: Joshua Ashton @ 2023-08-23 9:34 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, Gabriel Krisman Bertazi On 8/16/23 20:40, Kent Overstreet wrote: > On Mon, Aug 14, 2023 at 02:24:24AM +0100, Joshua Ashton wrote: >> This implements support for case-insensitive file and directory >> lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) >> casefolding attributes for bcachefs. >> >> The implementation uses the same UTF-8 lowering and normalization that >> ext4 and f2fs is using currently >> >> The main usecase for casefolding is compatibility with software written >> against other filesystems that rely on casefolded lookups >> (eg. NTFS and Wine/Proton). >> Taking advantage of file-system level casefolding can lead to great >> loading time gains in many applications and games. >> >> See more information and rationale in: >> Documentation/filesystems/bcachefs/casefolding.rst >> >> You can see/review the work in my tree here: >> https://github.com/Joshua-Ashton/bcachefs/tree/casefold >> >> v2: Update format to store length in dirent based on flag, add comments, cleanup >> dirent_create_key, other misc. nits from IRC. >> >> v3: Cleanup the dirent code, utilize the fact that BCH_NAME_MAX is smaller now, >> remove fallbacks for too big keys that won't make sense anymore. Improved >> lookup string lowering. >> >> v4: Move cf_name allocation from stack to trans alloc, validated bkey in invalid, >> used length directly, misc. nits. > > Josh, do you think you could post the benchmarks you ran at the start of > all this? We should look at those again after negative dentries are > hooked up and working correctly. > > After giving this more thought, I'm starting to feel a bit more inclined > to Gabriel's approach... Sure. I've cleaned the synthetic ones up into an easy-to-run userspace benchmark available here: https://github.com/Joshua-Ashton/casefolding-microbench My results look like the following with -O2 on a Ryzen 7900XT: ------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------ memcmp_ascii_eq 5.83 ns 5.82 ns 119813859 memcmp_ascii_immediately_neq 3.71 ns 3.70 ns 188580499 memcmp_unicode_eq 6.63 ns 6.60 ns 111117537 memcmp_unicode_immediately_neq 3.67 ns 3.66 ns 192004721 utf8_strncasecmp_ascii_eq 30221 ns 30118 ns 23699 utf8_strncasecmp_ascii_immediately_neq 72.7 ns 72.6 ns 9634262 utf8_strncasecmp_unicode_eq 33375 ns 33328 ns 21011 utf8_strncasecmp_unicode_immediately_neq 235 ns 235 ns 2981485 utf8_strncasecmp_unicode_cased 34084 ns 34027 ns 20973 utf8_strncasecmp_folded_ascii_eq 16301 ns 16276 ns 42710 utf8_strncasecmp_folded_ascii_immediately_neq 19.6 ns 19.6 ns 35654425 utf8_strncasecmp_folded_unicode_eq 17529 ns 17504 ns 39968 utf8_strncasecmp_folded_unicode_immediately_neq 110 ns 110 ns 6373884 utf8_strncasecmp_folded_unicode_cased 19958 ns 19876 ns 35236 As you can see, using utf8_strncasecmp on our largest name ASCII strings can result in a 5183x slowdown for comparison checks, and for Unicode strings of similar length, a 5846x slowdown. utf8_strncasecmp_folded was documented to give a 30% improvement for casefold lookup times in a large directory (3ae72562ad917df36a1b1247d749240e3b4865db). This bench shows that to be more like a 50% improvement on utf8_strncasecmp which is ballpark-ish. This is still a slowdown of 3006x in the best-case compared to memcmp. Bear in mind, this is probably only once for a cmp in a small directory, and multiple times if there are hash collisions. At it's worst, at a minimum it will take ~0.04ms to do simply the cmp of a single case insensitive lookup. Using -O3 makes things a bit better for strncasecmp, but not by any orders of magnitude that it needs: ------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------ memcmp_ascii_eq 3.63 ns 3.62 ns 193349553 memcmp_ascii_immediately_neq 2.16 ns 2.15 ns 338061946 memcmp_unicode_eq 4.43 ns 4.42 ns 158399738 memcmp_unicode_immediately_neq 2.18 ns 2.17 ns 322218734 utf8_strncasecmp_ascii_eq 13476 ns 13449 ns 51587 utf8_strncasecmp_ascii_immediately_neq 31.9 ns 31.8 ns 21838062 utf8_strncasecmp_unicode_eq 18575 ns 18536 ns 37588 utf8_strncasecmp_unicode_immediately_neq 130 ns 130 ns 5408163 utf8_strncasecmp_unicode_cased 17719 ns 17690 ns 37665 utf8_strncasecmp_folded_ascii_eq 6132 ns 6124 ns 114403 utf8_strncasecmp_folded_ascii_immediately_neq 8.03 ns 8.02 ns 87290013 utf8_strncasecmp_folded_unicode_eq 8734 ns 8722 ns 80225 utf8_strncasecmp_folded_unicode_immediately_neq 52.2 ns 52.1 ns 13370548 utf8_strncasecmp_folded_unicode_cased 9874 ns 9859 ns 71000 Still pretty significant though. :/ I would really really like to avoid this inherent extra cost going forward. One other reason I am pretty against going with utf8_strncasecmp_folded (even initially) is because if/when we do decide to do casefold caching in fs, we have to keep all of that code path around in dirent_cmp for compat which is super gross. This also means if we want to do other encodings, such as ASCII alone, or whatever, the _cmp code needs to grow support for those too, which I also really dislike. The dirent_cmp and dirent_hash, etc code should be kept as simple and efficient as possible and entirely free of all of this aside from using the already provided names. One question I know is going to be asked is: How does this translate to the real world? Given that the switch from utf8_strncasecmp to utf8_strncasecmp_folded for EXT4 increased perf with large directories by 30% as-per 3ae72562ad917df36a1b1247d749240e3b4865db, then using a pure memcmp must improve it pretty significantly from the benchmarks above, although that is my pure speculation based on extrapolating the two data sets. :-D We are of course always eating the cost of the initial utf8_casefold for hashing and cmp though... Is there a need for more data here or is this enough to base our decision off of? IIRC we had some data points about Proton prefix creation time and stuff following the ext4 casefolding work and strncasecmp vs folded too, but I may need to dig up if anyone still has those. I don't think they are super relevant here though. - Joshie 🐸✨ ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2023-08-24 5:43 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-08-14 1:24 [v4 0/3] bcachefs: Casefolding implementation Joshua Ashton 2023-08-14 1:24 ` [v4 1/3] bcachefs: Split out dirent alloc and name initialization Joshua Ashton 2023-08-14 1:24 ` [v4 2/3] bcachefs: Move dirent_val_u64s to dirent.c Joshua Ashton 2023-08-14 1:24 ` [v4 3/3] bcachefs: Implement casefolding Joshua Ashton 2023-08-14 16:34 ` Gabriel Krisman Bertazi 2023-08-14 21:18 ` Joshua Ashton 2023-08-14 21:33 ` Joshua Ashton 2023-08-14 23:01 ` Gabriel Krisman Bertazi 2023-08-15 0:58 ` Kent Overstreet 2023-08-15 18:00 ` Gabriel Krisman Bertazi 2023-08-23 9:37 ` Joshua Ashton 2023-08-15 12:36 ` Brian Foster 2023-08-24 5:43 ` Joshua Ashton 2023-08-16 19:40 ` [v4 0/3] bcachefs: Casefolding implementation Kent Overstreet 2023-08-23 9:34 ` Joshua Ashton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox