* [PATCH 00/18] last on disk format changes before freeze
@ 2025-02-13 18:45 Kent Overstreet
2025-02-13 18:45 ` [PATCH 01/18] bcachefs: bch2_lru_change() checks for no-op Kent Overstreet
` (17 more replies)
0 siblings, 18 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet, Joshua Ashton, Hongbo Li
These are the last of the on disk format changes that will be required
upgrades. Future on disk format changes will be optional - meaning
they'll require more compat code.
- metadata_version_cached_backpointers: scrub can now check cached data,
and (more importantly) the gc_gens pass for recalculating
bucket.oldest_gen is no longer required - necessary scalability item.
- metadata_version_stripe_backpointers: needed for scrub to properly
support erasure coding, and for future stripe repair code
- metadata_version_stripe_lru: we no longer have to read stripes into
memory at startup for the stripes heap, for reusing partially empty
stripes
- metadata_version_casefolding: I think this one will make some people
happy :)
Joshua, the casefolding patches needed some work to rebase, could you
look them over? Hongbo, I had to tweak your directory i_size code as
well.
Joshua Ashton (2):
bcachefs: Split out dirent alloc and name initialization
bcachefs: bcachefs_metadata_version_casefolding
Kent Overstreet (16):
bcachefs: bch2_lru_change() checks for no-op
bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/
bcachefs: decouple bch2_lru_check_set() from alloc btree
bcachefs: Rework bch2_check_lru_key()
bcachefs: bch2_trigger_stripe_ptr() no longer uses
ec_stripes_heap_lock
bcachefs: Better trigger ordering
bcachefs: rework bch2_trans_commit_run_triggers()
bcachefs: bcachefs_metadata_version_cached_backpointers
bcachefs: Invalidate cached data by backpointers
bcachefs: Advance bch_alloc.oldest_gen if no stale pointers
bcachefs: bcachefs_metadata_version_stripe_backpointers
bcachefs: bcachefs_metadata_version_stripe_lru
bcachefs: ec_stripe_delete() uses new stripe lru
bcachefs: get_existing_stripe() uses new stripe lru
bcachefs: We no longer read stripes into memory at startup
bcachefs: Kill dirent_occupied_size()
.../filesystems/bcachefs/casefolding.rst | 87 ++++
fs/bcachefs/alloc_background.c | 147 ++++--
fs/bcachefs/backpointers.c | 14 +-
fs/bcachefs/backpointers.h | 15 +-
fs/bcachefs/bcachefs.h | 10 +-
fs/bcachefs/bcachefs_format.h | 9 +-
fs/bcachefs/btree_trans_commit.c | 89 ++--
fs/bcachefs/btree_types.h | 13 +
fs/bcachefs/btree_update.c | 3 +-
fs/bcachefs/buckets.c | 14 +-
fs/bcachefs/buckets.h | 27 --
fs/bcachefs/buckets_types.h | 27 ++
fs/bcachefs/dirent.c | 223 +++++++--
fs/bcachefs/dirent.h | 18 +-
fs/bcachefs/dirent_format.h | 20 +-
fs/bcachefs/ec.c | 429 ++++++------------
fs/bcachefs/ec.h | 46 +-
fs/bcachefs/ec_types.h | 12 +-
fs/bcachefs/fs-common.c | 42 +-
fs/bcachefs/fs-ioctl.c | 25 +
fs/bcachefs/fs-ioctl.h | 20 +-
fs/bcachefs/fs.c | 17 +
fs/bcachefs/fsck.c | 5 +-
fs/bcachefs/inode_format.h | 3 +-
fs/bcachefs/lru.c | 100 ++--
fs/bcachefs/lru.h | 22 +-
fs/bcachefs/lru_format.h | 6 +-
fs/bcachefs/move.c | 3 +
fs/bcachefs/movinggc.c | 4 +-
fs/bcachefs/recovery_passes_types.h | 2 +-
fs/bcachefs/sb-downgrade.c | 6 +
fs/bcachefs/sb-errors_format.h | 4 +-
fs/bcachefs/str_hash.c | 2 +-
fs/bcachefs/str_hash.h | 4 +
fs/bcachefs/super.c | 19 +
fs/bcachefs/sysfs.c | 5 -
36 files changed, 904 insertions(+), 588 deletions(-)
create mode 100644 Documentation/filesystems/bcachefs/casefolding.rst
--
2.45.2
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 01/18] bcachefs: bch2_lru_change() checks for no-op
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 02/18] bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/ Kent Overstreet
` (16 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Minor cleanup, no reason for the caller to have to this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/alloc_background.c | 32 +++++++++++++-------------------
fs/bcachefs/lru.c | 6 +++---
fs/bcachefs/lru.h | 11 ++++++++++-
3 files changed, 26 insertions(+), 23 deletions(-)
diff --git a/fs/bcachefs/alloc_background.c b/fs/bcachefs/alloc_background.c
index a35455802280..e1061524bdf5 100644
--- a/fs/bcachefs/alloc_background.c
+++ b/fs/bcachefs/alloc_background.c
@@ -889,26 +889,20 @@ int bch2_trigger_alloc(struct btree_trans *trans,
!new_a->io_time[READ])
new_a->io_time[READ] = bch2_current_io_time(c, READ);
- u64 old_lru = alloc_lru_idx_read(*old_a);
- u64 new_lru = alloc_lru_idx_read(*new_a);
- if (old_lru != new_lru) {
- ret = bch2_lru_change(trans, new.k->p.inode,
- bucket_to_u64(new.k->p),
- old_lru, new_lru);
- if (ret)
- goto err;
- }
+ ret = bch2_lru_change(trans, new.k->p.inode,
+ bucket_to_u64(new.k->p),
+ alloc_lru_idx_read(*old_a),
+ alloc_lru_idx_read(*new_a));
+ if (ret)
+ goto err;
- old_lru = alloc_lru_idx_fragmentation(*old_a, ca);
- new_lru = alloc_lru_idx_fragmentation(*new_a, ca);
- if (old_lru != new_lru) {
- ret = bch2_lru_change(trans,
- BCH_LRU_FRAGMENTATION_START,
- bucket_to_u64(new.k->p),
- old_lru, new_lru);
- if (ret)
- goto err;
- }
+ ret = bch2_lru_change(trans,
+ BCH_LRU_FRAGMENTATION_START,
+ bucket_to_u64(new.k->p),
+ alloc_lru_idx_fragmentation(*old_a, ca),
+ alloc_lru_idx_fragmentation(*new_a, ca));
+ if (ret)
+ goto err;
if (old_a->gen != new_a->gen) {
ret = bch2_bucket_gen_update(trans, new.k->p, new_a->gen);
diff --git a/fs/bcachefs/lru.c b/fs/bcachefs/lru.c
index ce794d55818f..8ec16ae8daa6 100644
--- a/fs/bcachefs/lru.c
+++ b/fs/bcachefs/lru.c
@@ -59,9 +59,9 @@ int bch2_lru_set(struct btree_trans *trans, u16 lru_id, u64 dev_bucket, u64 time
return __bch2_lru_set(trans, lru_id, dev_bucket, time, KEY_TYPE_set);
}
-int bch2_lru_change(struct btree_trans *trans,
- u16 lru_id, u64 dev_bucket,
- u64 old_time, u64 new_time)
+int __bch2_lru_change(struct btree_trans *trans,
+ u16 lru_id, u64 dev_bucket,
+ u64 old_time, u64 new_time)
{
if (old_time == new_time)
return 0;
diff --git a/fs/bcachefs/lru.h b/fs/bcachefs/lru.h
index f31a6cf1514c..2facc0758cb3 100644
--- a/fs/bcachefs/lru.h
+++ b/fs/bcachefs/lru.h
@@ -46,7 +46,16 @@ void bch2_lru_pos_to_text(struct printbuf *, struct bpos);
int bch2_lru_del(struct btree_trans *, u16, u64, u64);
int bch2_lru_set(struct btree_trans *, u16, u64, u64);
-int bch2_lru_change(struct btree_trans *, u16, u64, u64, u64);
+int __bch2_lru_change(struct btree_trans *, u16, u64, u64, u64);
+
+static inline int bch2_lru_change(struct btree_trans *trans,
+ u16 lru_id, u64 dev_bucket,
+ u64 old_time, u64 new_time)
+{
+ return old_time != new_time
+ ? __bch2_lru_change(trans, lru_id, dev_bucket, old_time, new_time)
+ : 0;
+}
struct bkey_buf;
int bch2_lru_check_set(struct btree_trans *, u16, u64, struct bkey_s_c, struct bkey_buf *);
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 02/18] bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
2025-02-13 18:45 ` [PATCH 01/18] bcachefs: bch2_lru_change() checks for no-op Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 03/18] bcachefs: decouple bch2_lru_check_set() from alloc btree Kent Overstreet
` (15 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
FRAGMENTATION_START was incorrect, there's currently only one
fragmentation LRU (at the end of the reserved bits for LRU type), and
we're getting ready to add a stripe fragmentation lru - so give it a
better name.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/alloc_background.c | 4 ++--
fs/bcachefs/lru.h | 2 +-
fs/bcachefs/lru_format.h | 2 +-
fs/bcachefs/movinggc.c | 4 ++--
4 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/bcachefs/alloc_background.c b/fs/bcachefs/alloc_background.c
index e1061524bdf5..87ff50a3cd81 100644
--- a/fs/bcachefs/alloc_background.c
+++ b/fs/bcachefs/alloc_background.c
@@ -897,7 +897,7 @@ int bch2_trigger_alloc(struct btree_trans *trans,
goto err;
ret = bch2_lru_change(trans,
- BCH_LRU_FRAGMENTATION_START,
+ BCH_LRU_BUCKET_FRAGMENTATION,
bucket_to_u64(new.k->p),
alloc_lru_idx_fragmentation(*old_a, ca),
alloc_lru_idx_fragmentation(*new_a, ca));
@@ -1699,7 +1699,7 @@ static int bch2_check_alloc_to_lru_ref(struct btree_trans *trans,
u64 lru_idx = alloc_lru_idx_fragmentation(*a, ca);
if (lru_idx) {
- ret = bch2_lru_check_set(trans, BCH_LRU_FRAGMENTATION_START,
+ ret = bch2_lru_check_set(trans, BCH_LRU_BUCKET_FRAGMENTATION,
lru_idx, alloc_k, last_flushed);
if (ret)
goto err;
diff --git a/fs/bcachefs/lru.h b/fs/bcachefs/lru.h
index 2facc0758cb3..398cc25db459 100644
--- a/fs/bcachefs/lru.h
+++ b/fs/bcachefs/lru.h
@@ -28,7 +28,7 @@ static inline enum bch_lru_type lru_type(struct bkey_s_c l)
{
u16 lru_id = l.k->p.inode >> 48;
- if (lru_id == BCH_LRU_FRAGMENTATION_START)
+ if (lru_id == BCH_LRU_BUCKET_FRAGMENTATION)
return BCH_LRU_fragmentation;
return BCH_LRU_read;
}
diff --git a/fs/bcachefs/lru_format.h b/fs/bcachefs/lru_format.h
index f372cb3b8cda..353a352d3fb9 100644
--- a/fs/bcachefs/lru_format.h
+++ b/fs/bcachefs/lru_format.h
@@ -17,7 +17,7 @@ enum bch_lru_type {
#undef x
};
-#define BCH_LRU_FRAGMENTATION_START ((1U << 16) - 1)
+#define BCH_LRU_BUCKET_FRAGMENTATION ((1U << 16) - 1)
#define LRU_TIME_BITS 48
#define LRU_TIME_MAX ((1ULL << LRU_TIME_BITS) - 1)
diff --git a/fs/bcachefs/movinggc.c b/fs/bcachefs/movinggc.c
index 21805509ab9e..4ecb721cfc85 100644
--- a/fs/bcachefs/movinggc.c
+++ b/fs/bcachefs/movinggc.c
@@ -168,8 +168,8 @@ static int bch2_copygc_get_buckets(struct moving_context *ctxt,
bch2_trans_begin(trans);
ret = for_each_btree_key_max(trans, iter, BTREE_ID_lru,
- lru_pos(BCH_LRU_FRAGMENTATION_START, 0, 0),
- lru_pos(BCH_LRU_FRAGMENTATION_START, U64_MAX, LRU_TIME_MAX),
+ lru_pos(BCH_LRU_BUCKET_FRAGMENTATION, 0, 0),
+ lru_pos(BCH_LRU_BUCKET_FRAGMENTATION, U64_MAX, LRU_TIME_MAX),
0, k, ({
struct move_bucket b = { .k.bucket = u64_to_bucket(k.k->p.offset) };
int ret2 = 0;
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 03/18] bcachefs: decouple bch2_lru_check_set() from alloc btree
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
2025-02-13 18:45 ` [PATCH 01/18] bcachefs: bch2_lru_change() checks for no-op Kent Overstreet
2025-02-13 18:45 ` [PATCH 02/18] bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/ Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 04/18] bcachefs: Rework bch2_check_lru_key() Kent Overstreet
` (14 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Pass in the backpointer explicitly, instead of assuming 'referring_k' is
an alloc key and calculating it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/alloc_background.c | 5 ++++-
fs/bcachefs/lru.c | 10 +++++-----
fs/bcachefs/lru.h | 2 +-
3 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/fs/bcachefs/alloc_background.c b/fs/bcachefs/alloc_background.c
index 87ff50a3cd81..58cdb6a0acf9 100644
--- a/fs/bcachefs/alloc_background.c
+++ b/fs/bcachefs/alloc_background.c
@@ -1700,6 +1700,7 @@ static int bch2_check_alloc_to_lru_ref(struct btree_trans *trans,
u64 lru_idx = alloc_lru_idx_fragmentation(*a, ca);
if (lru_idx) {
ret = bch2_lru_check_set(trans, BCH_LRU_BUCKET_FRAGMENTATION,
+ bucket_to_u64(alloc_k.k->p),
lru_idx, alloc_k, last_flushed);
if (ret)
goto err;
@@ -1729,7 +1730,9 @@ static int bch2_check_alloc_to_lru_ref(struct btree_trans *trans,
a = &a_mut->v;
}
- ret = bch2_lru_check_set(trans, alloc_k.k->p.inode, a->io_time[READ],
+ ret = bch2_lru_check_set(trans, alloc_k.k->p.inode,
+ bucket_to_u64(alloc_k.k->p),
+ a->io_time[READ],
alloc_k, last_flushed);
if (ret)
goto err;
diff --git a/fs/bcachefs/lru.c b/fs/bcachefs/lru.c
index 8ec16ae8daa6..dc6b9a80a8b5 100644
--- a/fs/bcachefs/lru.c
+++ b/fs/bcachefs/lru.c
@@ -78,7 +78,9 @@ static const char * const bch2_lru_types[] = {
};
int bch2_lru_check_set(struct btree_trans *trans,
- u16 lru_id, u64 time,
+ u16 lru_id,
+ u64 dev_bucket,
+ u64 time,
struct bkey_s_c referring_k,
struct bkey_buf *last_flushed)
{
@@ -87,9 +89,7 @@ int bch2_lru_check_set(struct btree_trans *trans,
struct btree_iter lru_iter;
struct bkey_s_c lru_k =
bch2_bkey_get_iter(trans, &lru_iter, BTREE_ID_lru,
- lru_pos(lru_id,
- bucket_to_u64(referring_k.k->p),
- time), 0);
+ lru_pos(lru_id, dev_bucket, time), 0);
int ret = bkey_err(lru_k);
if (ret)
return ret;
@@ -104,7 +104,7 @@ int bch2_lru_check_set(struct btree_trans *trans,
" %s",
bch2_lru_types[lru_type(lru_k)],
(bch2_bkey_val_to_text(&buf, c, referring_k), buf.buf))) {
- ret = bch2_lru_set(trans, lru_id, bucket_to_u64(referring_k.k->p), time);
+ ret = bch2_lru_set(trans, lru_id, dev_bucket, time);
if (ret)
goto err;
}
diff --git a/fs/bcachefs/lru.h b/fs/bcachefs/lru.h
index 398cc25db459..dea1d75cc9c1 100644
--- a/fs/bcachefs/lru.h
+++ b/fs/bcachefs/lru.h
@@ -58,7 +58,7 @@ static inline int bch2_lru_change(struct btree_trans *trans,
}
struct bkey_buf;
-int bch2_lru_check_set(struct btree_trans *, u16, u64, struct bkey_s_c, struct bkey_buf *);
+int bch2_lru_check_set(struct btree_trans *, u16, u64, u64, struct bkey_s_c, struct bkey_buf *);
int bch2_check_lrus(struct bch_fs *);
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 04/18] bcachefs: Rework bch2_check_lru_key()
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (2 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 03/18] bcachefs: decouple bch2_lru_check_set() from alloc btree Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 05/18] bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock Kent Overstreet
` (13 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
It's now easier to add new LRU types.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/lru.c | 77 +++++++++++++++++++++++++++++------------------
1 file changed, 47 insertions(+), 30 deletions(-)
diff --git a/fs/bcachefs/lru.c b/fs/bcachefs/lru.c
index dc6b9a80a8b5..98ab8496f29d 100644
--- a/fs/bcachefs/lru.c
+++ b/fs/bcachefs/lru.c
@@ -116,49 +116,67 @@ int bch2_lru_check_set(struct btree_trans *trans,
return ret;
}
+static struct bbpos lru_pos_to_bp(struct bkey_s_c lru_k)
+{
+ enum bch_lru_type type = lru_type(lru_k);
+
+ switch (type) {
+ case BCH_LRU_read:
+ case BCH_LRU_fragmentation:
+ return BBPOS(BTREE_ID_alloc, u64_to_bucket(lru_k.k->p.offset));
+ default:
+ BUG();
+ }
+}
+
+static u64 bkey_lru_type_idx(struct bch_fs *c,
+ enum bch_lru_type type,
+ struct bkey_s_c k)
+{
+ struct bch_alloc_v4 a_convert;
+ const struct bch_alloc_v4 *a;
+
+ switch (type) {
+ case BCH_LRU_read:
+ a = bch2_alloc_to_v4(k, &a_convert);
+ return alloc_lru_idx_read(*a);
+ case BCH_LRU_fragmentation: {
+ a = bch2_alloc_to_v4(k, &a_convert);
+
+ rcu_read_lock();
+ struct bch_dev *ca = bch2_dev_rcu_noerror(c, k.k->p.inode);
+ u64 idx = ca
+ ? alloc_lru_idx_fragmentation(*a, ca)
+ : 0;
+ rcu_read_unlock();
+ return idx;
+ }
+ default:
+ BUG();
+ }
+}
+
static int bch2_check_lru_key(struct btree_trans *trans,
struct btree_iter *lru_iter,
struct bkey_s_c lru_k,
struct bkey_buf *last_flushed)
{
struct bch_fs *c = trans->c;
- struct btree_iter iter;
- struct bkey_s_c k;
- struct bch_alloc_v4 a_convert;
- const struct bch_alloc_v4 *a;
struct printbuf buf1 = PRINTBUF;
struct printbuf buf2 = PRINTBUF;
- enum bch_lru_type type = lru_type(lru_k);
- struct bpos alloc_pos = u64_to_bucket(lru_k.k->p.offset);
- u64 idx;
- int ret;
-
- struct bch_dev *ca = bch2_dev_bucket_tryget_noerror(c, alloc_pos);
- if (fsck_err_on(!ca,
- trans, lru_entry_to_invalid_bucket,
- "lru key points to nonexistent device:bucket %llu:%llu",
- alloc_pos.inode, alloc_pos.offset))
- return bch2_btree_bit_mod_buffered(trans, BTREE_ID_lru, lru_iter->pos, false);
+ struct bbpos bp = lru_pos_to_bp(lru_k);
- k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_alloc, alloc_pos, 0);
- ret = bkey_err(k);
+ struct btree_iter iter;
+ struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, bp.btree, bp.pos, 0);
+ int ret = bkey_err(k);
if (ret)
goto err;
- a = bch2_alloc_to_v4(k, &a_convert);
-
- switch (type) {
- case BCH_LRU_read:
- idx = alloc_lru_idx_read(*a);
- break;
- case BCH_LRU_fragmentation:
- idx = alloc_lru_idx_fragmentation(*a, ca);
- break;
- }
+ enum bch_lru_type type = lru_type(lru_k);
+ u64 idx = bkey_lru_type_idx(c, type, k);
- if (lru_k.k->type != KEY_TYPE_set ||
- lru_pos_time(lru_k.k->p) != idx) {
+ if (lru_pos_time(lru_k.k->p) != idx) {
ret = bch2_btree_write_buffer_maybe_flush(trans, lru_k, last_flushed);
if (ret)
goto err;
@@ -176,7 +194,6 @@ static int bch2_check_lru_key(struct btree_trans *trans,
err:
fsck_err:
bch2_trans_iter_exit(trans, &iter);
- bch2_dev_put(ca);
printbuf_exit(&buf2);
printbuf_exit(&buf1);
return ret;
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 05/18] bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (3 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 04/18] bcachefs: Rework bch2_check_lru_key() Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 06/18] bcachefs: Better trigger ordering Kent Overstreet
` (12 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Introduce per-entry locks, like with struct bucket - the stripes heap is
going away.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/buckets.c | 6 +++---
fs/bcachefs/buckets.h | 27 ---------------------------
fs/bcachefs/buckets_types.h | 27 +++++++++++++++++++++++++++
fs/bcachefs/ec.h | 14 ++++++++++++++
fs/bcachefs/ec_types.h | 5 ++---
5 files changed, 46 insertions(+), 33 deletions(-)
diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index 345b117a4a4a..88af61bc799d 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -674,10 +674,10 @@ static int bch2_trigger_stripe_ptr(struct btree_trans *trans,
return -BCH_ERR_ENOMEM_mark_stripe_ptr;
}
- mutex_lock(&c->ec_stripes_heap_lock);
+ gc_stripe_lock(m);
if (!m || !m->alive) {
- mutex_unlock(&c->ec_stripes_heap_lock);
+ gc_stripe_unlock(m);
struct printbuf buf = PRINTBUF;
bch2_bkey_val_to_text(&buf, c, k);
bch_err_ratelimited(c, "pointer to nonexistent stripe %llu\n while marking %s",
@@ -693,7 +693,7 @@ static int bch2_trigger_stripe_ptr(struct btree_trans *trans,
.type = BCH_DISK_ACCOUNTING_replicas,
};
memcpy(&acc.replicas, &m->r.e, replicas_entry_bytes(&m->r.e));
- mutex_unlock(&c->ec_stripes_heap_lock);
+ gc_stripe_unlock(m);
acc.replicas.data_type = data_type;
int ret = bch2_disk_accounting_mod(trans, &acc, §ors, 1, true);
diff --git a/fs/bcachefs/buckets.h b/fs/bcachefs/buckets.h
index a9acdd6c0c86..6aeec1c0973c 100644
--- a/fs/bcachefs/buckets.h
+++ b/fs/bcachefs/buckets.h
@@ -39,33 +39,6 @@ static inline u64 sector_to_bucket_and_offset(const struct bch_dev *ca, sector_t
for (_b = (_buckets)->b + (_buckets)->first_bucket; \
_b < (_buckets)->b + (_buckets)->nbuckets; _b++)
-/*
- * Ugly hack alert:
- *
- * We need to cram a spinlock in a single byte, because that's what we have left
- * in struct bucket, and we care about the size of these - during fsck, we need
- * in memory state for every single bucket on every device.
- *
- * We used to do
- * while (xchg(&b->lock, 1) cpu_relax();
- * but, it turns out not all architectures support xchg on a single byte.
- *
- * So now we use bit_spin_lock(), with fun games since we can't burn a whole
- * ulong for this - we just need to make sure the lock bit always ends up in the
- * first byte.
- */
-
-#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-#define BUCKET_LOCK_BITNR 0
-#else
-#define BUCKET_LOCK_BITNR (BITS_PER_LONG - 1)
-#endif
-
-union ulong_byte_assert {
- ulong ulong;
- u8 byte;
-};
-
static inline void bucket_unlock(struct bucket *b)
{
BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte);
diff --git a/fs/bcachefs/buckets_types.h b/fs/bcachefs/buckets_types.h
index 7174047b8e92..900b8680c8b5 100644
--- a/fs/bcachefs/buckets_types.h
+++ b/fs/bcachefs/buckets_types.h
@@ -7,6 +7,33 @@
#define BUCKET_JOURNAL_SEQ_BITS 16
+/*
+ * Ugly hack alert:
+ *
+ * We need to cram a spinlock in a single byte, because that's what we have left
+ * in struct bucket, and we care about the size of these - during fsck, we need
+ * in memory state for every single bucket on every device.
+ *
+ * We used to do
+ * while (xchg(&b->lock, 1) cpu_relax();
+ * but, it turns out not all architectures support xchg on a single byte.
+ *
+ * So now we use bit_spin_lock(), with fun games since we can't burn a whole
+ * ulong for this - we just need to make sure the lock bit always ends up in the
+ * first byte.
+ */
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+#define BUCKET_LOCK_BITNR 0
+#else
+#define BUCKET_LOCK_BITNR (BITS_PER_LONG - 1)
+#endif
+
+union ulong_byte_assert {
+ ulong ulong;
+ u8 byte;
+};
+
struct bucket {
u8 lock;
u8 gen_valid:1;
diff --git a/fs/bcachefs/ec.h b/fs/bcachefs/ec.h
index 583ca6a226da..4c9511887655 100644
--- a/fs/bcachefs/ec.h
+++ b/fs/bcachefs/ec.h
@@ -132,6 +132,20 @@ static inline bool bch2_ptr_matches_stripe_m(const struct gc_stripe *m,
m->sectors);
}
+static inline void gc_stripe_unlock(struct gc_stripe *s)
+{
+ BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte);
+
+ clear_bit_unlock(BUCKET_LOCK_BITNR, (void *) &s->lock);
+ wake_up_bit((void *) &s->lock, BUCKET_LOCK_BITNR);
+}
+
+static inline void gc_stripe_lock(struct gc_stripe *s)
+{
+ wait_on_bit_lock((void *) &s->lock, BUCKET_LOCK_BITNR,
+ TASK_UNINTERRUPTIBLE);
+}
+
struct bch_read_bio;
struct ec_stripe_buf {
diff --git a/fs/bcachefs/ec_types.h b/fs/bcachefs/ec_types.h
index 8d1e70e830ac..37558cc2d89f 100644
--- a/fs/bcachefs/ec_types.h
+++ b/fs/bcachefs/ec_types.h
@@ -20,12 +20,11 @@ struct stripe {
};
struct gc_stripe {
+ u8 lock;
+ unsigned alive:1; /* does a corresponding key exist in stripes btree? */
u16 sectors;
-
u8 nr_blocks;
u8 nr_redundant;
-
- unsigned alive:1; /* does a corresponding key exist in stripes btree? */
u16 block_sectors[BCH_BKEY_PTRS_MAX];
struct bch_extent_ptr ptrs[BCH_BKEY_PTRS_MAX];
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 06/18] bcachefs: Better trigger ordering
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (4 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 05/18] bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 07/18] bcachefs: rework bch2_trans_commit_run_triggers() Kent Overstreet
` (11 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Transactional triggers need to run in a defined ordering, which is not
quite the same as btree ID integer comparison.
Previously this was handled in a hacky way in
bch2_trans_commit_run_triggers(), since it was only the alloc btree that
needed special handling, but upcoming stripe btree changes are going to
require more ordering changes - so, define that ordering.
Next patch will change the transaction commit path to use it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/btree_types.h | 13 +++++++++++++
fs/bcachefs/btree_update.c | 1 +
2 files changed, 14 insertions(+)
diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h
index a09cbe9cd94f..77578da2d23f 100644
--- a/fs/bcachefs/btree_types.h
+++ b/fs/bcachefs/btree_types.h
@@ -423,6 +423,7 @@ static inline struct bpos btree_node_pos(struct btree_bkey_cached_common *b)
struct btree_insert_entry {
unsigned flags;
+ u8 sort_order;
u8 bkey_type;
enum btree_id btree_id:8;
u8 level:4;
@@ -853,6 +854,18 @@ static inline bool btree_type_uses_write_buffer(enum btree_id btree)
return BIT_ULL(btree) & mask;
}
+static inline u8 btree_trigger_order(enum btree_id btree)
+{
+ switch (btree) {
+ case BTREE_ID_alloc:
+ return U8_MAX;
+ case BTREE_ID_stripes:
+ return U8_MAX - 1;
+ default:
+ return btree;
+ }
+}
+
struct btree_root {
struct btree *b;
diff --git a/fs/bcachefs/btree_update.c b/fs/bcachefs/btree_update.c
index 13d794f201a5..47e54eedd0bc 100644
--- a/fs/bcachefs/btree_update.c
+++ b/fs/bcachefs/btree_update.c
@@ -397,6 +397,7 @@ bch2_trans_update_by_path(struct btree_trans *trans, btree_path_idx_t path_idx,
n = (struct btree_insert_entry) {
.flags = flags,
+ .sort_order = btree_trigger_order(path->btree_id),
.bkey_type = __btree_node_type(path->level, path->btree_id),
.btree_id = path->btree_id,
.level = path->level,
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 07/18] bcachefs: rework bch2_trans_commit_run_triggers()
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (5 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 06/18] bcachefs: Better trigger ordering Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 08/18] bcachefs: bcachefs_metadata_version_cached_backpointers Kent Overstreet
` (10 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/btree_trans_commit.c | 89 ++++++++++++--------------------
fs/bcachefs/btree_update.c | 2 +-
2 files changed, 34 insertions(+), 57 deletions(-)
diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index c4f524b2ca9a..892d20a50a52 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -336,6 +336,7 @@ static inline void btree_insert_entry_checks(struct btree_trans *trans,
BUG_ON(i->cached != path->cached);
BUG_ON(i->level != path->level);
BUG_ON(i->btree_id != path->btree_id);
+ BUG_ON(i->bkey_type != __btree_node_type(path->level, path->btree_id));
EBUG_ON(!i->level &&
btree_type_has_snapshots(i->btree_id) &&
!(i->flags & BTREE_UPDATE_internal_snapshot_node) &&
@@ -517,69 +518,45 @@ static int run_one_trans_trigger(struct btree_trans *trans, struct btree_insert_
}
}
-static int run_btree_triggers(struct btree_trans *trans, enum btree_id btree_id,
- unsigned *btree_id_updates_start)
+static int bch2_trans_commit_run_triggers(struct btree_trans *trans)
{
- bool trans_trigger_run;
+ unsigned sort_id_start = 0;
- /*
- * Running triggers will append more updates to the list of updates as
- * we're walking it:
- */
- do {
- trans_trigger_run = false;
+ while (sort_id_start < trans->nr_updates) {
+ unsigned i, sort_id = trans->updates[sort_id_start].sort_order;
+ bool trans_trigger_run;
- for (unsigned i = *btree_id_updates_start;
- i < trans->nr_updates && trans->updates[i].btree_id <= btree_id;
- i++) {
- if (trans->updates[i].btree_id < btree_id) {
- *btree_id_updates_start = i;
- continue;
+ /*
+ * For a given btree, this algorithm runs insert triggers before
+ * overwrite triggers: this is so that when extents are being
+ * moved (e.g. by FALLOCATE_FL_INSERT_RANGE), we don't drop
+ * references before they are re-added.
+ *
+ * Running triggers will append more updates to the list of
+ * updates as we're walking it:
+ */
+ do {
+ trans_trigger_run = false;
+
+ for (i = sort_id_start;
+ i < trans->nr_updates && trans->updates[i].sort_order <= sort_id;
+ i++) {
+ if (trans->updates[i].sort_order < sort_id) {
+ sort_id_start = i;
+ continue;
+ }
+
+ int ret = run_one_trans_trigger(trans, trans->updates + i);
+ if (ret < 0)
+ return ret;
+ if (ret)
+ trans_trigger_run = true;
}
+ } while (trans_trigger_run);
- int ret = run_one_trans_trigger(trans, trans->updates + i);
- if (ret < 0)
- return ret;
- if (ret)
- trans_trigger_run = true;
- }
- } while (trans_trigger_run);
-
- trans_for_each_update(trans, i)
- BUG_ON(!(i->flags & BTREE_TRIGGER_norun) &&
- i->btree_id == btree_id &&
- btree_node_type_has_trans_triggers(i->bkey_type) &&
- (!i->insert_trigger_run || !i->overwrite_trigger_run));
-
- return 0;
-}
-
-static int bch2_trans_commit_run_triggers(struct btree_trans *trans)
-{
- unsigned btree_id = 0, btree_id_updates_start = 0;
- int ret = 0;
-
- /*
- *
- * For a given btree, this algorithm runs insert triggers before
- * overwrite triggers: this is so that when extents are being moved
- * (e.g. by FALLOCATE_FL_INSERT_RANGE), we don't drop references before
- * they are re-added.
- */
- for (btree_id = 0; btree_id < BTREE_ID_NR; btree_id++) {
- if (btree_id == BTREE_ID_alloc)
- continue;
-
- ret = run_btree_triggers(trans, btree_id, &btree_id_updates_start);
- if (ret)
- return ret;
+ sort_id_start = i;
}
- btree_id_updates_start = 0;
- ret = run_btree_triggers(trans, BTREE_ID_alloc, &btree_id_updates_start);
- if (ret)
- return ret;
-
#ifdef CONFIG_BCACHEFS_DEBUG
trans_for_each_update(trans, i)
BUG_ON(!(i->flags & BTREE_TRIGGER_norun) &&
diff --git a/fs/bcachefs/btree_update.c b/fs/bcachefs/btree_update.c
index 47e54eedd0bc..b3e346b5f8d7 100644
--- a/fs/bcachefs/btree_update.c
+++ b/fs/bcachefs/btree_update.c
@@ -17,7 +17,7 @@
static inline int btree_insert_entry_cmp(const struct btree_insert_entry *l,
const struct btree_insert_entry *r)
{
- return cmp_int(l->btree_id, r->btree_id) ?:
+ return cmp_int(l->sort_order, r->sort_order) ?:
cmp_int(l->cached, r->cached) ?:
-cmp_int(l->level, r->level) ?:
bpos_cmp(l->k->k.p, r->k->k.p);
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 08/18] bcachefs: bcachefs_metadata_version_cached_backpointers
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (6 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 07/18] bcachefs: rework bch2_trans_commit_run_triggers() Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 09/18] bcachefs: Invalidate cached data by backpointers Kent Overstreet
` (9 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Cached pointers now have backpointers.
This means that we'll be able to kill cached pointers in the
bucket_invalidate path, when invalidating/reusing buckets containing
cached data, instead of leaving them around to be cleaned up by gc_gens
garbago collection - which requires a full metadata scan.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/backpointers.c | 14 +++++++-------
fs/bcachefs/bcachefs_format.h | 3 ++-
fs/bcachefs/buckets.c | 8 +++-----
fs/bcachefs/sb-downgrade.c | 3 +++
4 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/fs/bcachefs/backpointers.c b/fs/bcachefs/backpointers.c
index bb799b86aa69..c9dfc3657696 100644
--- a/fs/bcachefs/backpointers.c
+++ b/fs/bcachefs/backpointers.c
@@ -611,9 +611,6 @@ static int check_extent_to_backpointers(struct btree_trans *trans,
struct extent_ptr_decoded p;
bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
- if (p.ptr.cached)
- continue;
-
if (p.ptr.dev == BCH_SB_MEMBER_INVALID)
continue;
@@ -621,9 +618,11 @@ static int check_extent_to_backpointers(struct btree_trans *trans,
struct bch_dev *ca = bch2_dev_rcu_noerror(c, p.ptr.dev);
bool check = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_mismatches);
bool empty = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_empty);
+
+ bool stale = p.ptr.cached && (!ca || dev_ptr_stale_rcu(ca, &p.ptr));
rcu_read_unlock();
- if (check || empty) {
+ if ((check || empty) && !stale) {
struct bkey_i_backpointer bp;
bch2_extent_ptr_to_bp(c, btree, level, k, p, entry, &bp);
@@ -857,9 +856,8 @@ static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct b
goto err;
}
- /* Cached pointers don't have backpointers: */
-
if (sectors[ALLOC_dirty] != a->dirty_sectors ||
+ sectors[ALLOC_cached] != a->cached_sectors ||
sectors[ALLOC_stripe] != a->stripe_sectors) {
if (c->sb.version_upgrade_complete >= bcachefs_metadata_version_backpointer_bucket_gen) {
ret = bch2_backpointers_maybe_flush(trans, alloc_k, last_flushed);
@@ -868,6 +866,7 @@ static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct b
}
if (sectors[ALLOC_dirty] > a->dirty_sectors ||
+ sectors[ALLOC_cached] > a->cached_sectors ||
sectors[ALLOC_stripe] > a->stripe_sectors) {
ret = check_bucket_backpointers_to_extents(trans, ca, alloc_k.k->p) ?:
-BCH_ERR_transaction_restart_nested;
@@ -875,7 +874,8 @@ static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct b
}
if (!sectors[ALLOC_dirty] &&
- !sectors[ALLOC_stripe])
+ !sectors[ALLOC_stripe] &&
+ !sectors[ALLOC_cached])
__set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_empty);
else
__set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_mismatches);
diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
index f70f0108401f..ef5009b18dd5 100644
--- a/fs/bcachefs/bcachefs_format.h
+++ b/fs/bcachefs/bcachefs_format.h
@@ -686,7 +686,8 @@ struct bch_sb_field_ext {
x(inode_depth, BCH_VERSION(1, 17)) \
x(persistent_inode_cursors, BCH_VERSION(1, 18)) \
x(autofix_errors, BCH_VERSION(1, 19)) \
- x(directory_size, BCH_VERSION(1, 20))
+ x(directory_size, BCH_VERSION(1, 20)) \
+ x(cached_backpointers, BCH_VERSION(1, 21))
enum bcachefs_metadata_version {
bcachefs_metadata_version_min = 9,
diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index 88af61bc799d..bb7742cf0014 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -590,11 +590,9 @@ static int bch2_trigger_pointer(struct btree_trans *trans,
if (ret)
goto err;
- if (!p.ptr.cached) {
- ret = bch2_bucket_backpointer_mod(trans, k, &bp, insert);
- if (ret)
- goto err;
- }
+ ret = bch2_bucket_backpointer_mod(trans, k, &bp, insert);
+ if (ret)
+ goto err;
}
if (flags & BTREE_TRIGGER_gc) {
diff --git a/fs/bcachefs/sb-downgrade.c b/fs/bcachefs/sb-downgrade.c
index 14f6b6a5fb38..34fd897680b3 100644
--- a/fs/bcachefs/sb-downgrade.c
+++ b/fs/bcachefs/sb-downgrade.c
@@ -94,6 +94,9 @@
x(directory_size, \
BIT_ULL(BCH_RECOVERY_PASS_check_inodes), \
BCH_FSCK_ERR_directory_size_mismatch) \
+ x(cached_backpointers, \
+ BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\
+ BCH_FSCK_ERR_ptr_to_missing_backpointer)
#define DOWNGRADE_TABLE() \
x(bucket_stripe_sectors, \
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 09/18] bcachefs: Invalidate cached data by backpointers
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (7 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 08/18] bcachefs: bcachefs_metadata_version_cached_backpointers Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 10/18] bcachefs: Advance bch_alloc.oldest_gen if no stale pointers Kent Overstreet
` (8 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
If we don't leave stale pointers around, we won't have to deal with
bucket gen wraparound.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/alloc_background.c | 102 +++++++++++++++++++++++++--------
1 file changed, 79 insertions(+), 23 deletions(-)
diff --git a/fs/bcachefs/alloc_background.c b/fs/bcachefs/alloc_background.c
index 58cdb6a0acf9..97c2df18dfa4 100644
--- a/fs/bcachefs/alloc_background.c
+++ b/fs/bcachefs/alloc_background.c
@@ -2055,16 +2055,71 @@ static void bch2_discard_one_bucket_fast(struct bch_dev *ca, u64 bucket)
bch2_write_ref_put(c, BCH_WRITE_REF_discard_fast);
}
+static int invalidate_one_bp(struct btree_trans *trans,
+ struct bch_dev *ca,
+ struct bkey_s_c_backpointer bp,
+ struct bkey_buf *last_flushed)
+{
+ struct btree_iter extent_iter;
+ struct bkey_s_c extent_k =
+ bch2_backpointer_get_key(trans, bp, &extent_iter, 0, last_flushed);
+ int ret = bkey_err(extent_k);
+ if (ret)
+ return ret;
+
+ struct bkey_i *n =
+ bch2_bkey_make_mut(trans, &extent_iter, &extent_k,
+ BTREE_UPDATE_internal_snapshot_node);
+ ret = PTR_ERR_OR_ZERO(n);
+ if (ret)
+ goto err;
+
+ bch2_bkey_drop_device(bkey_i_to_s(n), ca->dev_idx);
+err:
+ bch2_trans_iter_exit(trans, &extent_iter);
+ return ret;
+}
+
+static int invalidate_one_bucket_by_bps(struct btree_trans *trans,
+ struct bch_dev *ca,
+ struct bpos bucket,
+ u8 gen,
+ struct bkey_buf *last_flushed)
+{
+ struct bpos bp_start = bucket_pos_to_bp_start(ca, bucket);
+ struct bpos bp_end = bucket_pos_to_bp_end(ca, bucket);
+
+ return for_each_btree_key_max_commit(trans, iter, BTREE_ID_backpointers,
+ bp_start, bp_end, 0, k,
+ NULL, NULL,
+ BCH_WATERMARK_btree|
+ BCH_TRANS_COMMIT_no_enospc, ({
+ if (k.k->type != KEY_TYPE_backpointer)
+ continue;
+
+ struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k);
+
+ if (bp.v->bucket_gen != gen)
+ continue;
+
+ /* filter out bps with gens that don't match */
+
+ invalidate_one_bp(trans, ca, bp, last_flushed);
+ }));
+}
+
+noinline_for_stack
static int invalidate_one_bucket(struct btree_trans *trans,
+ struct bch_dev *ca,
struct btree_iter *lru_iter,
struct bkey_s_c lru_k,
+ struct bkey_buf *last_flushed,
s64 *nr_to_invalidate)
{
struct bch_fs *c = trans->c;
- struct bkey_i_alloc_v4 *a = NULL;
struct printbuf buf = PRINTBUF;
struct bpos bucket = u64_to_bucket(lru_k.k->p.offset);
- unsigned cached_sectors;
+ struct btree_iter alloc_iter = {};
int ret = 0;
if (*nr_to_invalidate <= 0)
@@ -2081,13 +2136,18 @@ static int invalidate_one_bucket(struct btree_trans *trans,
if (bch2_bucket_is_open_safe(c, bucket.inode, bucket.offset))
return 0;
- a = bch2_trans_start_alloc_update(trans, bucket, BTREE_TRIGGER_bucket_invalidate);
- ret = PTR_ERR_OR_ZERO(a);
+ struct bkey_s_c alloc_k = bch2_bkey_get_iter(trans, &alloc_iter,
+ BTREE_ID_alloc, bucket,
+ BTREE_ITER_cached);
+ ret = bkey_err(alloc_k);
if (ret)
- goto out;
+ return ret;
+
+ struct bch_alloc_v4 a_convert;
+ const struct bch_alloc_v4 *a = bch2_alloc_to_v4(alloc_k, &a_convert);
/* We expect harmless races here due to the btree write buffer: */
- if (lru_pos_time(lru_iter->pos) != alloc_lru_idx_read(a->v))
+ if (lru_pos_time(lru_iter->pos) != alloc_lru_idx_read(*a))
goto out;
/*
@@ -2097,26 +2157,16 @@ static int invalidate_one_bucket(struct btree_trans *trans,
*
* bch2_lru_validate() also disallows lru keys with lru_pos_time() == 0
*/
- BUG_ON(a->v.data_type != BCH_DATA_cached);
- BUG_ON(a->v.dirty_sectors);
+ BUG_ON(a->data_type != BCH_DATA_cached);
+ BUG_ON(a->dirty_sectors);
- if (!a->v.cached_sectors)
+ if (!a->cached_sectors)
bch_err(c, "invalidating empty bucket, confused");
- cached_sectors = a->v.cached_sectors;
+ unsigned cached_sectors = a->cached_sectors;
+ u8 gen = a->gen;
- SET_BCH_ALLOC_V4_NEED_INC_GEN(&a->v, false);
- a->v.gen++;
- a->v.data_type = 0;
- a->v.dirty_sectors = 0;
- a->v.stripe_sectors = 0;
- a->v.cached_sectors = 0;
- a->v.io_time[READ] = bch2_current_io_time(c, READ);
- a->v.io_time[WRITE] = bch2_current_io_time(c, WRITE);
-
- ret = bch2_trans_commit(trans, NULL, NULL,
- BCH_WATERMARK_btree|
- BCH_TRANS_COMMIT_no_enospc);
+ ret = invalidate_one_bucket_by_bps(trans, ca, bucket, gen, last_flushed);
if (ret)
goto out;
@@ -2124,6 +2174,7 @@ static int invalidate_one_bucket(struct btree_trans *trans,
--*nr_to_invalidate;
out:
fsck_err:
+ bch2_trans_iter_exit(trans, &alloc_iter);
printbuf_exit(&buf);
return ret;
}
@@ -2150,6 +2201,10 @@ static void bch2_do_invalidates_work(struct work_struct *work)
struct btree_trans *trans = bch2_trans_get(c);
int ret = 0;
+ struct bkey_buf last_flushed;
+ bch2_bkey_buf_init(&last_flushed);
+ bkey_init(&last_flushed.k->k);
+
ret = bch2_btree_write_buffer_tryflush(trans);
if (ret)
goto err;
@@ -2174,7 +2229,7 @@ static void bch2_do_invalidates_work(struct work_struct *work)
if (!k.k)
break;
- ret = invalidate_one_bucket(trans, &iter, k, &nr_to_invalidate);
+ ret = invalidate_one_bucket(trans, ca, &iter, k, &last_flushed, &nr_to_invalidate);
restart_err:
if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
continue;
@@ -2187,6 +2242,7 @@ static void bch2_do_invalidates_work(struct work_struct *work)
err:
bch2_trans_put(trans);
percpu_ref_put(&ca->io_ref);
+ bch2_bkey_buf_exit(&last_flushed, c);
bch2_write_ref_put(c, BCH_WRITE_REF_invalidate);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 10/18] bcachefs: Advance bch_alloc.oldest_gen if no stale pointers
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (8 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 09/18] bcachefs: Invalidate cached data by backpointers Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 11/18] bcachefs: bcachefs_metadata_version_stripe_backpointers Kent Overstreet
` (7 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Now that we've got cached backpointers and aren't leaving around stale
pointers on bucket invalidation, we no longer need the periodic (rare)
gc_gens - which recalculates each bucket's oldest gen to avoid wraparound.
We can't delete that code because we've got to support existing
filesystems that will still have stale pointers, but this gets rid of
another scalability limit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/alloc_background.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/bcachefs/alloc_background.c b/fs/bcachefs/alloc_background.c
index 97c2df18dfa4..c5c8497a6339 100644
--- a/fs/bcachefs/alloc_background.c
+++ b/fs/bcachefs/alloc_background.c
@@ -871,6 +871,9 @@ int bch2_trigger_alloc(struct btree_trans *trans,
if (data_type_is_empty(new_a->data_type) &&
BCH_ALLOC_V4_NEED_INC_GEN(new_a) &&
!bch2_bucket_is_open_safe(c, new.k->p.inode, new.k->p.offset)) {
+ if (new_a->oldest_gen == new_a->gen &&
+ !bch2_bucket_sectors_total(*new_a))
+ new_a->oldest_gen++;
new_a->gen++;
SET_BCH_ALLOC_V4_NEED_INC_GEN(new_a, false);
alloc_data_type_set(new_a, new_a->data_type);
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 11/18] bcachefs: bcachefs_metadata_version_stripe_backpointers
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (9 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 10/18] bcachefs: Advance bch_alloc.oldest_gen if no stale pointers Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 12/18] bcachefs: bcachefs_metadata_version_stripe_lru Kent Overstreet
` (6 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Stripes now have backpointers.
This is needed for proper scrub - stripe checksums need to be verified,
separately from extents within the stripe, since a block may not be full
of live extents but it's still needed for reconstruct.
And this will be needed for (efficient) evacuate/repair paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/backpointers.h | 15 ++++++++++++++-
fs/bcachefs/bcachefs_format.h | 3 ++-
fs/bcachefs/ec.c | 16 ++++++++++++++--
fs/bcachefs/move.c | 3 +++
fs/bcachefs/sb-downgrade.c | 3 +++
5 files changed, 36 insertions(+), 4 deletions(-)
diff --git a/fs/bcachefs/backpointers.h b/fs/bcachefs/backpointers.h
index 7786731d4ada..16575dbc5736 100644
--- a/fs/bcachefs/backpointers.h
+++ b/fs/bcachefs/backpointers.h
@@ -152,7 +152,20 @@ static inline void bch2_extent_ptr_to_bp(struct bch_fs *c,
struct bkey_i_backpointer *bp)
{
bkey_backpointer_init(&bp->k_i);
- bp->k.p = POS(p.ptr.dev, ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset);
+ bp->k.p.inode = p.ptr.dev;
+
+ if (k.k->type != KEY_TYPE_stripe)
+ bp->k.p.offset = ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset;
+ else {
+ /*
+ * Put stripe backpointers where they won't collide with the
+ * extent backpointers within the stripe:
+ */
+ struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
+ bp->k.p.offset = ((u64) (p.ptr.offset + le16_to_cpu(s.v->sectors)) <<
+ MAX_EXTENT_COMPRESS_RATIO_SHIFT) - 1;
+ }
+
bp->v = (struct bch_backpointer) {
.btree_id = btree_id,
.level = level,
diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
index ef5009b18dd5..bf3723a2bca4 100644
--- a/fs/bcachefs/bcachefs_format.h
+++ b/fs/bcachefs/bcachefs_format.h
@@ -687,7 +687,8 @@ struct bch_sb_field_ext {
x(persistent_inode_cursors, BCH_VERSION(1, 18)) \
x(autofix_errors, BCH_VERSION(1, 19)) \
x(directory_size, BCH_VERSION(1, 20)) \
- x(cached_backpointers, BCH_VERSION(1, 21))
+ x(cached_backpointers, BCH_VERSION(1, 21)) \
+ x(stripe_backpointers, BCH_VERSION(1, 22))
enum bcachefs_metadata_version {
bcachefs_metadata_version_min = 9,
diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c
index 1aa56d28de33..36590c0ce09f 100644
--- a/fs/bcachefs/ec.c
+++ b/fs/bcachefs/ec.c
@@ -298,10 +298,22 @@ static int mark_stripe_bucket(struct btree_trans *trans,
struct bpos bucket = PTR_BUCKET_POS(ca, ptr);
if (flags & BTREE_TRIGGER_transactional) {
+ struct extent_ptr_decoded p = {
+ .ptr = *ptr,
+ .crc = bch2_extent_crc_unpack(s.k, NULL),
+ };
+ struct bkey_i_backpointer bp;
+ bch2_extent_ptr_to_bp(c, BTREE_ID_stripes, 0, s.s_c, p,
+ (const union bch_extent_entry *) ptr, &bp);
+
struct bkey_i_alloc_v4 *a =
bch2_trans_start_alloc_update(trans, bucket, 0);
- ret = PTR_ERR_OR_ZERO(a) ?:
- __mark_stripe_bucket(trans, ca, s, ptr_idx, deleting, bucket, &a->v, flags);
+ ret = PTR_ERR_OR_ZERO(a) ?:
+ __mark_stripe_bucket(trans, ca, s, ptr_idx, deleting, bucket, &a->v, flags) ?:
+ bch2_bucket_backpointer_mod(trans, s.s_c, &bp,
+ !(flags & BTREE_TRIGGER_overwrite));
+ if (ret)
+ goto err;
}
if (flags & BTREE_TRIGGER_gc) {
diff --git a/fs/bcachefs/move.c b/fs/bcachefs/move.c
index e0e10deaea73..e944f2791546 100644
--- a/fs/bcachefs/move.c
+++ b/fs/bcachefs/move.c
@@ -774,6 +774,9 @@ static int __bch2_move_data_phys(struct moving_context *ctxt,
if (!(data_types & BIT(bp.v->data_type)))
goto next;
+ if (!bp.v->level && bp.v->btree_id == BTREE_ID_stripes)
+ goto next;
+
k = bch2_backpointer_get_key(trans, bp, &iter, 0, &last_flushed);
ret = bkey_err(k);
if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
diff --git a/fs/bcachefs/sb-downgrade.c b/fs/bcachefs/sb-downgrade.c
index 34fd897680b3..c587334ffb55 100644
--- a/fs/bcachefs/sb-downgrade.c
+++ b/fs/bcachefs/sb-downgrade.c
@@ -95,6 +95,9 @@
BIT_ULL(BCH_RECOVERY_PASS_check_inodes), \
BCH_FSCK_ERR_directory_size_mismatch) \
x(cached_backpointers, \
+ BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\
+ BCH_FSCK_ERR_ptr_to_missing_backpointer) \
+ x(stripe_backpointers, \
BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\
BCH_FSCK_ERR_ptr_to_missing_backpointer)
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 12/18] bcachefs: bcachefs_metadata_version_stripe_lru
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (10 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 11/18] bcachefs: bcachefs_metadata_version_stripe_backpointers Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 13/18] bcachefs: ec_stripe_delete() uses new stripe lru Kent Overstreet
` (5 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Add a persistent LRU for stripes, ordered by "number of empty blocks",
i.e. order in which we wish to reuse them.
This will replace the in-memory stripes heap, so we can kill off reading
stripes into memory at startup.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/alloc_background.c | 3 +-
fs/bcachefs/bcachefs_format.h | 3 +-
fs/bcachefs/ec.c | 51 ++++++++++++++++++++++++++++++++++
fs/bcachefs/ec.h | 27 ++++++++++++++++++
fs/bcachefs/lru.c | 7 +++++
fs/bcachefs/lru.h | 9 ++++--
fs/bcachefs/lru_format.h | 4 ++-
7 files changed, 99 insertions(+), 5 deletions(-)
diff --git a/fs/bcachefs/alloc_background.c b/fs/bcachefs/alloc_background.c
index c5c8497a6339..ecad4a78c3f7 100644
--- a/fs/bcachefs/alloc_background.c
+++ b/fs/bcachefs/alloc_background.c
@@ -1757,7 +1757,8 @@ int bch2_check_alloc_to_lru_refs(struct bch_fs *c)
for_each_btree_key_commit(trans, iter, BTREE_ID_alloc,
POS_MIN, BTREE_ITER_prefetch, k,
NULL, NULL, BCH_TRANS_COMMIT_no_enospc,
- bch2_check_alloc_to_lru_ref(trans, &iter, &last_flushed)));
+ bch2_check_alloc_to_lru_ref(trans, &iter, &last_flushed))) ?:
+ bch2_check_stripe_to_lru_refs(c);
bch2_bkey_buf_exit(&last_flushed, c);
bch_err_fn(c, ret);
diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
index bf3723a2bca4..b4ac311f21a1 100644
--- a/fs/bcachefs/bcachefs_format.h
+++ b/fs/bcachefs/bcachefs_format.h
@@ -688,7 +688,8 @@ struct bch_sb_field_ext {
x(autofix_errors, BCH_VERSION(1, 19)) \
x(directory_size, BCH_VERSION(1, 20)) \
x(cached_backpointers, BCH_VERSION(1, 21)) \
- x(stripe_backpointers, BCH_VERSION(1, 22))
+ x(stripe_backpointers, BCH_VERSION(1, 22)) \
+ x(stripe_lru, BCH_VERSION(1, 23))
enum bcachefs_metadata_version {
bcachefs_metadata_version_min = 9,
diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c
index 36590c0ce09f..1090cdb7d5cc 100644
--- a/fs/bcachefs/ec.c
+++ b/fs/bcachefs/ec.c
@@ -20,6 +20,7 @@
#include "io_read.h"
#include "io_write.h"
#include "keylist.h"
+#include "lru.h"
#include "recovery.h"
#include "replicas.h"
#include "super-io.h"
@@ -411,6 +412,15 @@ int bch2_trigger_stripe(struct btree_trans *trans,
(new_s->nr_blocks != old_s->nr_blocks ||
new_s->nr_redundant != old_s->nr_redundant));
+ if (flags & BTREE_TRIGGER_transactional) {
+ int ret = bch2_lru_change(trans,
+ BCH_LRU_STRIPE_FRAGMENTATION,
+ idx,
+ stripe_lru_pos(old_s),
+ stripe_lru_pos(new_s));
+ if (ret)
+ return ret;
+ }
if (flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) {
/*
@@ -1175,6 +1185,10 @@ static int ec_stripe_delete(struct btree_trans *trans, u64 idx)
return ret;
}
+/*
+ * XXX
+ * can we kill this and delete stripes from the trigger?
+ */
static void ec_stripe_delete_work(struct work_struct *work)
{
struct bch_fs *c =
@@ -2519,3 +2533,40 @@ int bch2_fs_ec_init(struct bch_fs *c)
return bioset_init(&c->ec_bioset, 1, offsetof(struct ec_bio, bio),
BIOSET_NEED_BVECS);
}
+
+static int bch2_check_stripe_to_lru_ref(struct btree_trans *trans,
+ struct bkey_s_c k,
+ struct bkey_buf *last_flushed)
+{
+ if (k.k->type != KEY_TYPE_stripe)
+ return 0;
+
+ struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
+
+ u64 lru_idx = stripe_lru_pos(s.v);
+ if (lru_idx) {
+ int ret = bch2_lru_check_set(trans, BCH_LRU_STRIPE_FRAGMENTATION,
+ k.k->p.offset, lru_idx, k, last_flushed);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+int bch2_check_stripe_to_lru_refs(struct bch_fs *c)
+{
+ struct bkey_buf last_flushed;
+
+ bch2_bkey_buf_init(&last_flushed);
+ bkey_init(&last_flushed.k->k);
+
+ int ret = bch2_trans_run(c,
+ for_each_btree_key_commit(trans, iter, BTREE_ID_stripes,
+ POS_MIN, BTREE_ITER_prefetch, k,
+ NULL, NULL, BCH_TRANS_COMMIT_no_enospc,
+ bch2_check_stripe_to_lru_ref(trans, k, &last_flushed)));
+
+ bch2_bkey_buf_exit(&last_flushed, c);
+ bch_err_fn(c, ret);
+ return ret;
+}
diff --git a/fs/bcachefs/ec.h b/fs/bcachefs/ec.h
index 4c9511887655..cd1c837e4933 100644
--- a/fs/bcachefs/ec.h
+++ b/fs/bcachefs/ec.h
@@ -92,6 +92,31 @@ static inline void stripe_csum_set(struct bch_stripe *s,
memcpy(stripe_csum(s, block, csum_idx), &csum, bch_crc_bytes[s->csum_type]);
}
+#define STRIPE_LRU_POS_EMPTY 1
+
+static inline u64 stripe_lru_pos(const struct bch_stripe *s)
+{
+ if (!s)
+ return 0;
+
+ unsigned blocks_empty = 0, blocks_nonempty = 0;
+
+ for (unsigned i = 0; i < s->nr_blocks; i++) {
+ blocks_empty += !stripe_blockcount_get(s, i);
+ blocks_nonempty += !!stripe_blockcount_get(s, i);
+ }
+
+ /* Will be picked up by the stripe_delete worker */
+ if (!blocks_nonempty)
+ return STRIPE_LRU_POS_EMPTY;
+
+ if (!blocks_empty)
+ return 0;
+
+ /* invert: more blocks empty = reuse first */
+ return LRU_TIME_MAX - blocks_empty;
+}
+
static inline bool __bch2_ptr_matches_stripe(const struct bch_extent_ptr *stripe_ptr,
const struct bch_extent_ptr *data_ptr,
unsigned sectors)
@@ -282,4 +307,6 @@ void bch2_fs_ec_exit(struct bch_fs *);
void bch2_fs_ec_init_early(struct bch_fs *);
int bch2_fs_ec_init(struct bch_fs *);
+int bch2_check_stripe_to_lru_refs(struct bch_fs *);
+
#endif /* _BCACHEFS_EC_H */
diff --git a/fs/bcachefs/lru.c b/fs/bcachefs/lru.c
index 98ab8496f29d..a299d9ec8ee4 100644
--- a/fs/bcachefs/lru.c
+++ b/fs/bcachefs/lru.c
@@ -6,6 +6,7 @@
#include "btree_iter.h"
#include "btree_update.h"
#include "btree_write_buffer.h"
+#include "ec.h"
#include "error.h"
#include "lru.h"
#include "recovery.h"
@@ -124,6 +125,8 @@ static struct bbpos lru_pos_to_bp(struct bkey_s_c lru_k)
case BCH_LRU_read:
case BCH_LRU_fragmentation:
return BBPOS(BTREE_ID_alloc, u64_to_bucket(lru_k.k->p.offset));
+ case BCH_LRU_stripes:
+ return BBPOS(BTREE_ID_stripes, POS(0, lru_k.k->p.offset));
default:
BUG();
}
@@ -151,6 +154,10 @@ static u64 bkey_lru_type_idx(struct bch_fs *c,
rcu_read_unlock();
return idx;
}
+ case BCH_LRU_stripes:
+ return k.k->type == KEY_TYPE_stripe
+ ? stripe_lru_pos(bkey_s_c_to_stripe(k).v)
+ : 0;
default:
BUG();
}
diff --git a/fs/bcachefs/lru.h b/fs/bcachefs/lru.h
index dea1d75cc9c1..8abd0aa2083a 100644
--- a/fs/bcachefs/lru.h
+++ b/fs/bcachefs/lru.h
@@ -28,9 +28,14 @@ static inline enum bch_lru_type lru_type(struct bkey_s_c l)
{
u16 lru_id = l.k->p.inode >> 48;
- if (lru_id == BCH_LRU_BUCKET_FRAGMENTATION)
+ switch (lru_id) {
+ case BCH_LRU_BUCKET_FRAGMENTATION:
return BCH_LRU_fragmentation;
- return BCH_LRU_read;
+ case BCH_LRU_STRIPE_FRAGMENTATION:
+ return BCH_LRU_stripes;
+ default:
+ return BCH_LRU_read;
+ }
}
int bch2_lru_validate(struct bch_fs *, struct bkey_s_c, struct bkey_validate_context);
diff --git a/fs/bcachefs/lru_format.h b/fs/bcachefs/lru_format.h
index 353a352d3fb9..b7392ad8e41f 100644
--- a/fs/bcachefs/lru_format.h
+++ b/fs/bcachefs/lru_format.h
@@ -9,7 +9,8 @@ struct bch_lru {
#define BCH_LRU_TYPES() \
x(read) \
- x(fragmentation)
+ x(fragmentation) \
+ x(stripes)
enum bch_lru_type {
#define x(n) BCH_LRU_##n,
@@ -18,6 +19,7 @@ enum bch_lru_type {
};
#define BCH_LRU_BUCKET_FRAGMENTATION ((1U << 16) - 1)
+#define BCH_LRU_STRIPE_FRAGMENTATION ((1U << 16) - 2)
#define LRU_TIME_BITS 48
#define LRU_TIME_MAX ((1ULL << LRU_TIME_BITS) - 1)
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 13/18] bcachefs: ec_stripe_delete() uses new stripe lru
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (11 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 12/18] bcachefs: bcachefs_metadata_version_stripe_lru Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:45 ` [PATCH 14/18] bcachefs: get_existing_stripe() " Kent Overstreet
` (4 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Convert to the new persistent stripe LRU.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/ec.c | 64 +++++++++++++++++-------------------------------
1 file changed, 22 insertions(+), 42 deletions(-)
diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c
index 1090cdb7d5cc..f9bb835c1d57 100644
--- a/fs/bcachefs/ec.c
+++ b/fs/bcachefs/ec.c
@@ -1149,37 +1149,22 @@ void bch2_stripes_heap_update(struct bch_fs *c,
static int ec_stripe_delete(struct btree_trans *trans, u64 idx)
{
- struct bch_fs *c = trans->c;
struct btree_iter iter;
- struct bkey_s_c k;
- struct bkey_s_c_stripe s;
- int ret;
-
- k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_stripes, POS(0, idx),
- BTREE_ITER_intent);
- ret = bkey_err(k);
+ struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter,
+ BTREE_ID_stripes, POS(0, idx),
+ BTREE_ITER_intent);
+ int ret = bkey_err(k);
if (ret)
goto err;
- if (k.k->type != KEY_TYPE_stripe) {
- bch2_fs_inconsistent(c, "attempting to delete nonexistent stripe %llu", idx);
- ret = -EINVAL;
- goto err;
- }
-
- s = bkey_s_c_to_stripe(k);
- for (unsigned i = 0; i < s.v->nr_blocks; i++)
- if (stripe_blockcount_get(s.v, i)) {
- struct printbuf buf = PRINTBUF;
-
- bch2_bkey_val_to_text(&buf, c, k);
- bch2_fs_inconsistent(c, "attempting to delete nonempty stripe %s", buf.buf);
- printbuf_exit(&buf);
- ret = -EINVAL;
- goto err;
- }
-
- ret = bch2_btree_delete_at(trans, &iter, 0);
+ /*
+ * We expect write buffer races here
+ * Important: check stripe_is_open with stripe key locked:
+ */
+ if (k.k->type == KEY_TYPE_stripe &&
+ !bch2_stripe_is_open(trans->c, idx) &&
+ stripe_lru_pos(bkey_s_c_to_stripe(k).v) == 1)
+ ret = bch2_btree_delete_at(trans, &iter, 0);
err:
bch2_trans_iter_exit(trans, &iter);
return ret;
@@ -1194,21 +1179,16 @@ static void ec_stripe_delete_work(struct work_struct *work)
struct bch_fs *c =
container_of(work, struct bch_fs, ec_stripe_delete_work);
- while (1) {
- mutex_lock(&c->ec_stripes_heap_lock);
- u64 idx = stripe_idx_to_delete(c);
- mutex_unlock(&c->ec_stripes_heap_lock);
-
- if (!idx)
- break;
-
- int ret = bch2_trans_commit_do(c, NULL, NULL, BCH_TRANS_COMMIT_no_enospc,
- ec_stripe_delete(trans, idx));
- bch_err_fn(c, ret);
- if (ret)
- break;
- }
-
+ bch2_trans_run(c,
+ bch2_btree_write_buffer_tryflush(trans) ?:
+ for_each_btree_key_max_commit(trans, lru_iter, BTREE_ID_lru,
+ lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 1, 0),
+ lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 1, LRU_TIME_MAX),
+ 0, lru_k,
+ NULL, NULL,
+ BCH_TRANS_COMMIT_no_enospc, ({
+ ec_stripe_delete(trans, lru_k.k->p.offset);
+ })));
bch2_write_ref_put(c, BCH_WRITE_REF_stripe_delete);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 14/18] bcachefs: get_existing_stripe() uses new stripe lru
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (12 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 13/18] bcachefs: ec_stripe_delete() uses new stripe lru Kent Overstreet
@ 2025-02-13 18:45 ` Kent Overstreet
2025-02-13 18:46 ` [PATCH 15/18] bcachefs: We no longer read stripes into memory at startup Kent Overstreet
` (3 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:45 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
Convert to the new persistent stripe LRU.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/ec.c | 88 ++++++++++++++++++++++++++++--------------------
1 file changed, 51 insertions(+), 37 deletions(-)
diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c
index f9bb835c1d57..bd47b8f0f705 100644
--- a/fs/bcachefs/ec.c
+++ b/fs/bcachefs/ec.c
@@ -1978,39 +1978,44 @@ static int new_stripe_alloc_buckets(struct btree_trans *trans,
return 0;
}
-static s64 get_existing_stripe(struct bch_fs *c,
- struct ec_stripe_head *head)
+static int __get_existing_stripe(struct btree_trans *trans,
+ struct ec_stripe_head *head,
+ struct ec_stripe_buf *stripe,
+ u64 idx)
{
- ec_stripes_heap *h = &c->ec_stripes_heap;
- struct stripe *m;
- size_t heap_idx;
- u64 stripe_idx;
- s64 ret = -1;
+ struct bch_fs *c = trans->c;
- if (may_create_new_stripe(c))
- return -1;
+ struct btree_iter iter;
+ struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter,
+ BTREE_ID_stripes, POS(0, idx), 0);
+ int ret = bkey_err(k);
+ if (ret)
+ goto err;
- mutex_lock(&c->ec_stripes_heap_lock);
- for (heap_idx = 0; heap_idx < h->nr; heap_idx++) {
- /* No blocks worth reusing, stripe will just be deleted: */
- if (!h->data[heap_idx].blocks_nonempty)
- continue;
+ /* We expect write buffer races here */
+ if (k.k->type != KEY_TYPE_stripe)
+ goto out;
- stripe_idx = h->data[heap_idx].idx;
+ struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
+ if (stripe_lru_pos(s.v) == 1)
+ goto out;
- m = genradix_ptr(&c->stripes, stripe_idx);
+ unsigned blocks_nonempty = 0;
+ for (unsigned i = 0; i < s.v->nr_blocks; i++)
+ blocks_nonempty += !!stripe_blockcount_get(s.v, i);
- if (m->disk_label == head->disk_label &&
- m->algorithm == head->algo &&
- m->nr_redundant == head->redundancy &&
- m->sectors == head->blocksize &&
- m->blocks_nonempty < m->nr_blocks - m->nr_redundant &&
- bch2_try_open_stripe(c, head->s, stripe_idx)) {
- ret = stripe_idx;
- break;
- }
+ if (s.v->disk_label == head->disk_label &&
+ s.v->algorithm == head->algo &&
+ s.v->nr_redundant == head->redundancy &&
+ s.v->sectors == head->blocksize &&
+ bch2_try_open_stripe(c, head->s, idx)) {
+ bkey_reassemble(&stripe->key, k);
+ ret = 1;
}
- mutex_unlock(&c->ec_stripes_heap_lock);
+out:
+ bch2_set_btree_iter_dontneed(&iter);
+err:
+ bch2_trans_iter_exit(trans, &iter);
return ret;
}
@@ -2062,24 +2067,33 @@ static int __bch2_ec_stripe_head_reuse(struct btree_trans *trans, struct ec_stri
struct ec_stripe_new *s)
{
struct bch_fs *c = trans->c;
- s64 idx;
- int ret;
/*
* If we can't allocate a new stripe, and there's no stripes with empty
* blocks for us to reuse, that means we have to wait on copygc:
*/
- idx = get_existing_stripe(c, h);
- if (idx < 0)
- return -BCH_ERR_stripe_alloc_blocked;
+ if (may_create_new_stripe(c))
+ return -1;
- ret = get_stripe_key_trans(trans, idx, &s->existing_stripe);
- bch2_fs_fatal_err_on(ret && !bch2_err_matches(ret, BCH_ERR_transaction_restart), c,
- "reading stripe key: %s", bch2_err_str(ret));
- if (ret) {
- bch2_stripe_close(c, s);
- return ret;
+ struct btree_iter lru_iter;
+ struct bkey_s_c lru_k;
+ int ret = 0;
+
+ for_each_btree_key_max_norestart(trans, lru_iter, BTREE_ID_lru,
+ lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 2, 0),
+ lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 2, LRU_TIME_MAX),
+ 0, lru_k, ret) {
+ ret = __get_existing_stripe(trans, h, &s->existing_stripe, lru_k.k->p.offset);
+ if (ret)
+ break;
}
+ bch2_trans_iter_exit(trans, &lru_iter);
+ if (!ret)
+ ret = -BCH_ERR_stripe_alloc_blocked;
+ if (ret == 1)
+ ret = 0;
+ if (ret)
+ return ret;
return init_new_stripe_from_existing(c, s);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 15/18] bcachefs: We no longer read stripes into memory at startup
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (13 preceding siblings ...)
2025-02-13 18:45 ` [PATCH 14/18] bcachefs: get_existing_stripe() " Kent Overstreet
@ 2025-02-13 18:46 ` Kent Overstreet
2025-02-13 18:46 ` [PATCH 16/18] bcachefs: Kill dirent_occupied_size() Kent Overstreet
` (2 subsequent siblings)
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:46 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet
And the stripes heap gets deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/bcachefs.h | 4 -
fs/bcachefs/ec.c | 210 +---------------------------
fs/bcachefs/ec.h | 5 -
fs/bcachefs/ec_types.h | 7 -
fs/bcachefs/recovery_passes_types.h | 2 +-
fs/bcachefs/sysfs.c | 5 -
6 files changed, 2 insertions(+), 231 deletions(-)
diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index e8f4999806b6..28216297edbb 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -996,15 +996,11 @@ struct bch_fs {
wait_queue_head_t copygc_running_wq;
/* STRIPES: */
- GENRADIX(struct stripe) stripes;
GENRADIX(struct gc_stripe) gc_stripes;
struct hlist_head ec_stripes_new[32];
spinlock_t ec_stripes_new_lock;
- ec_stripes_heap ec_stripes_heap;
- struct mutex ec_stripes_heap_lock;
-
/* ERASURE CODING */
struct list_head ec_stripe_head_list;
struct mutex ec_stripe_head_lock;
diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c
index bd47b8f0f705..689a0c32d07f 100644
--- a/fs/bcachefs/ec.c
+++ b/fs/bcachefs/ec.c
@@ -494,38 +494,6 @@ int bch2_trigger_stripe(struct btree_trans *trans,
return ret;
}
- if (flags & BTREE_TRIGGER_atomic) {
- struct stripe *m = genradix_ptr(&c->stripes, idx);
-
- if (!m) {
- struct printbuf buf1 = PRINTBUF;
- struct printbuf buf2 = PRINTBUF;
-
- bch2_bkey_val_to_text(&buf1, c, old);
- bch2_bkey_val_to_text(&buf2, c, new);
- bch_err_ratelimited(c, "error marking nonexistent stripe %llu while marking\n"
- "old %s\n"
- "new %s", idx, buf1.buf, buf2.buf);
- printbuf_exit(&buf2);
- printbuf_exit(&buf1);
- bch2_inconsistent_error(c);
- return -1;
- }
-
- if (!new_s) {
- bch2_stripes_heap_del(c, m, idx);
-
- memset(m, 0, sizeof(*m));
- } else {
- stripe_to_mem(m, new_s);
-
- if (!old_s)
- bch2_stripes_heap_insert(c, m, idx);
- else
- bch2_stripes_heap_update(c, m, idx);
- }
- }
-
return 0;
}
@@ -939,26 +907,6 @@ int bch2_ec_read_extent(struct btree_trans *trans, struct bch_read_bio *rbio,
static int __ec_stripe_mem_alloc(struct bch_fs *c, size_t idx, gfp_t gfp)
{
- ec_stripes_heap n, *h = &c->ec_stripes_heap;
-
- if (idx >= h->size) {
- if (!init_heap(&n, max(1024UL, roundup_pow_of_two(idx + 1)), gfp))
- return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc;
-
- mutex_lock(&c->ec_stripes_heap_lock);
- if (n.size > h->size) {
- memcpy(n.data, h->data, h->nr * sizeof(h->data[0]));
- n.nr = h->nr;
- swap(*h, n);
- }
- mutex_unlock(&c->ec_stripes_heap_lock);
-
- free_heap(&n);
- }
-
- if (!genradix_ptr_alloc(&c->stripes, idx, gfp))
- return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc;
-
if (c->gc_pos.phase != GC_PHASE_not_running &&
!genradix_ptr_alloc(&c->gc_stripes, idx, gfp))
return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc;
@@ -1031,120 +979,6 @@ static void bch2_stripe_close(struct bch_fs *c, struct ec_stripe_new *s)
s->idx = 0;
}
-/* Heap of all existing stripes, ordered by blocks_nonempty */
-
-static u64 stripe_idx_to_delete(struct bch_fs *c)
-{
- ec_stripes_heap *h = &c->ec_stripes_heap;
-
- lockdep_assert_held(&c->ec_stripes_heap_lock);
-
- if (h->nr &&
- h->data[0].blocks_nonempty == 0 &&
- !bch2_stripe_is_open(c, h->data[0].idx))
- return h->data[0].idx;
-
- return 0;
-}
-
-static inline void ec_stripes_heap_set_backpointer(ec_stripes_heap *h,
- size_t i)
-{
- struct bch_fs *c = container_of(h, struct bch_fs, ec_stripes_heap);
-
- genradix_ptr(&c->stripes, h->data[i].idx)->heap_idx = i;
-}
-
-static inline bool ec_stripes_heap_cmp(const void *l, const void *r, void __always_unused *args)
-{
- struct ec_stripe_heap_entry *_l = (struct ec_stripe_heap_entry *)l;
- struct ec_stripe_heap_entry *_r = (struct ec_stripe_heap_entry *)r;
-
- return ((_l->blocks_nonempty > _r->blocks_nonempty) <
- (_l->blocks_nonempty < _r->blocks_nonempty));
-}
-
-static inline void ec_stripes_heap_swap(void *l, void *r, void *h)
-{
- struct ec_stripe_heap_entry *_l = (struct ec_stripe_heap_entry *)l;
- struct ec_stripe_heap_entry *_r = (struct ec_stripe_heap_entry *)r;
- ec_stripes_heap *_h = (ec_stripes_heap *)h;
- size_t i = _l - _h->data;
- size_t j = _r - _h->data;
-
- swap(*_l, *_r);
-
- ec_stripes_heap_set_backpointer(_h, i);
- ec_stripes_heap_set_backpointer(_h, j);
-}
-
-static const struct min_heap_callbacks callbacks = {
- .less = ec_stripes_heap_cmp,
- .swp = ec_stripes_heap_swap,
-};
-
-static void heap_verify_backpointer(struct bch_fs *c, size_t idx)
-{
- ec_stripes_heap *h = &c->ec_stripes_heap;
- struct stripe *m = genradix_ptr(&c->stripes, idx);
-
- BUG_ON(m->heap_idx >= h->nr);
- BUG_ON(h->data[m->heap_idx].idx != idx);
-}
-
-void bch2_stripes_heap_del(struct bch_fs *c,
- struct stripe *m, size_t idx)
-{
- mutex_lock(&c->ec_stripes_heap_lock);
- heap_verify_backpointer(c, idx);
-
- min_heap_del(&c->ec_stripes_heap, m->heap_idx, &callbacks, &c->ec_stripes_heap);
- mutex_unlock(&c->ec_stripes_heap_lock);
-}
-
-void bch2_stripes_heap_insert(struct bch_fs *c,
- struct stripe *m, size_t idx)
-{
- mutex_lock(&c->ec_stripes_heap_lock);
- BUG_ON(min_heap_full(&c->ec_stripes_heap));
-
- genradix_ptr(&c->stripes, idx)->heap_idx = c->ec_stripes_heap.nr;
- min_heap_push(&c->ec_stripes_heap, &((struct ec_stripe_heap_entry) {
- .idx = idx,
- .blocks_nonempty = m->blocks_nonempty,
- }),
- &callbacks,
- &c->ec_stripes_heap);
-
- heap_verify_backpointer(c, idx);
- mutex_unlock(&c->ec_stripes_heap_lock);
-}
-
-void bch2_stripes_heap_update(struct bch_fs *c,
- struct stripe *m, size_t idx)
-{
- ec_stripes_heap *h = &c->ec_stripes_heap;
- bool do_deletes;
- size_t i;
-
- mutex_lock(&c->ec_stripes_heap_lock);
- heap_verify_backpointer(c, idx);
-
- h->data[m->heap_idx].blocks_nonempty = m->blocks_nonempty;
-
- i = m->heap_idx;
- min_heap_sift_up(h, i, &callbacks, &c->ec_stripes_heap);
- min_heap_sift_down(h, i, &callbacks, &c->ec_stripes_heap);
-
- heap_verify_backpointer(c, idx);
-
- do_deletes = stripe_idx_to_delete(c) != 0;
- mutex_unlock(&c->ec_stripes_heap_lock);
-
- if (do_deletes)
- bch2_do_stripe_deletes(c);
-}
-
/* stripe deletion */
static int ec_stripe_delete(struct btree_trans *trans, u64 idx)
@@ -2391,46 +2225,7 @@ void bch2_fs_ec_flush(struct bch_fs *c)
int bch2_stripes_read(struct bch_fs *c)
{
- int ret = bch2_trans_run(c,
- for_each_btree_key(trans, iter, BTREE_ID_stripes, POS_MIN,
- BTREE_ITER_prefetch, k, ({
- if (k.k->type != KEY_TYPE_stripe)
- continue;
-
- ret = __ec_stripe_mem_alloc(c, k.k->p.offset, GFP_KERNEL);
- if (ret)
- break;
-
- struct stripe *m = genradix_ptr(&c->stripes, k.k->p.offset);
-
- stripe_to_mem(m, bkey_s_c_to_stripe(k).v);
-
- bch2_stripes_heap_insert(c, m, k.k->p.offset);
- 0;
- })));
- bch_err_fn(c, ret);
- return ret;
-}
-
-void bch2_stripes_heap_to_text(struct printbuf *out, struct bch_fs *c)
-{
- ec_stripes_heap *h = &c->ec_stripes_heap;
- struct stripe *m;
- size_t i;
-
- mutex_lock(&c->ec_stripes_heap_lock);
- for (i = 0; i < min_t(size_t, h->nr, 50); i++) {
- m = genradix_ptr(&c->stripes, h->data[i].idx);
-
- prt_printf(out, "%zu %u/%u+%u", h->data[i].idx,
- h->data[i].blocks_nonempty,
- m->nr_blocks - m->nr_redundant,
- m->nr_redundant);
- if (bch2_stripe_is_open(c, h->data[i].idx))
- prt_str(out, " open");
- prt_newline(out);
- }
- mutex_unlock(&c->ec_stripes_heap_lock);
+ return 0;
}
static void bch2_new_stripe_to_text(struct printbuf *out, struct bch_fs *c,
@@ -2501,15 +2296,12 @@ void bch2_fs_ec_exit(struct bch_fs *c)
BUG_ON(!list_empty(&c->ec_stripe_new_list));
- free_heap(&c->ec_stripes_heap);
- genradix_free(&c->stripes);
bioset_exit(&c->ec_bioset);
}
void bch2_fs_ec_init_early(struct bch_fs *c)
{
spin_lock_init(&c->ec_stripes_new_lock);
- mutex_init(&c->ec_stripes_heap_lock);
INIT_LIST_HEAD(&c->ec_stripe_head_list);
mutex_init(&c->ec_stripe_head_lock);
diff --git a/fs/bcachefs/ec.h b/fs/bcachefs/ec.h
index cd1c837e4933..a1c662ada343 100644
--- a/fs/bcachefs/ec.h
+++ b/fs/bcachefs/ec.h
@@ -260,10 +260,6 @@ struct ec_stripe_head *bch2_ec_stripe_head_get(struct btree_trans *,
unsigned, unsigned, unsigned,
enum bch_watermark, struct closure *);
-void bch2_stripes_heap_update(struct bch_fs *, struct stripe *, size_t);
-void bch2_stripes_heap_del(struct bch_fs *, struct stripe *, size_t);
-void bch2_stripes_heap_insert(struct bch_fs *, struct stripe *, size_t);
-
void bch2_do_stripe_deletes(struct bch_fs *);
void bch2_ec_do_stripe_creates(struct bch_fs *);
void bch2_ec_stripe_new_free(struct bch_fs *, struct ec_stripe_new *);
@@ -300,7 +296,6 @@ void bch2_fs_ec_flush(struct bch_fs *);
int bch2_stripes_read(struct bch_fs *);
-void bch2_stripes_heap_to_text(struct printbuf *, struct bch_fs *);
void bch2_new_stripes_to_text(struct printbuf *, struct bch_fs *);
void bch2_fs_ec_exit(struct bch_fs *);
diff --git a/fs/bcachefs/ec_types.h b/fs/bcachefs/ec_types.h
index 37558cc2d89f..06144bfd9c19 100644
--- a/fs/bcachefs/ec_types.h
+++ b/fs/bcachefs/ec_types.h
@@ -31,11 +31,4 @@ struct gc_stripe {
struct bch_replicas_padded r;
};
-struct ec_stripe_heap_entry {
- size_t idx;
- unsigned blocks_nonempty;
-};
-
-typedef DEFINE_MIN_HEAP(struct ec_stripe_heap_entry, ec_stripes_heap) ec_stripes_heap;
-
#endif /* _BCACHEFS_EC_TYPES_H */
diff --git a/fs/bcachefs/recovery_passes_types.h b/fs/bcachefs/recovery_passes_types.h
index 418557960ed6..e89b9c783285 100644
--- a/fs/bcachefs/recovery_passes_types.h
+++ b/fs/bcachefs/recovery_passes_types.h
@@ -24,7 +24,7 @@
x(check_topology, 4, 0) \
x(accounting_read, 39, PASS_ALWAYS) \
x(alloc_read, 0, PASS_ALWAYS) \
- x(stripes_read, 1, PASS_ALWAYS) \
+ x(stripes_read, 1, 0) \
x(initialize_subvolumes, 2, 0) \
x(snapshots_read, 3, PASS_ALWAYS) \
x(check_allocations, 5, PASS_FSCK) \
diff --git a/fs/bcachefs/sysfs.c b/fs/bcachefs/sysfs.c
index b3f2c651c1f8..4551b1d86907 100644
--- a/fs/bcachefs/sysfs.c
+++ b/fs/bcachefs/sysfs.c
@@ -173,7 +173,6 @@ read_attribute(journal_debug);
read_attribute(btree_cache);
read_attribute(btree_key_cache);
read_attribute(btree_reserve_cache);
-read_attribute(stripes_heap);
read_attribute(open_buckets);
read_attribute(open_buckets_partial);
read_attribute(nocow_lock_table);
@@ -354,9 +353,6 @@ SHOW(bch2_fs)
if (attr == &sysfs_btree_reserve_cache)
bch2_btree_reserve_cache_to_text(out, c);
- if (attr == &sysfs_stripes_heap)
- bch2_stripes_heap_to_text(out, c);
-
if (attr == &sysfs_open_buckets)
bch2_open_buckets_to_text(out, c, NULL);
@@ -562,7 +558,6 @@ struct attribute *bch2_fs_internal_files[] = {
&sysfs_btree_key_cache,
&sysfs_btree_reserve_cache,
&sysfs_new_stripes,
- &sysfs_stripes_heap,
&sysfs_open_buckets,
&sysfs_open_buckets_partial,
#ifdef BCH_WRITE_REF_DEBUG
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 16/18] bcachefs: Kill dirent_occupied_size()
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (14 preceding siblings ...)
2025-02-13 18:46 ` [PATCH 15/18] bcachefs: We no longer read stripes into memory at startup Kent Overstreet
@ 2025-02-13 18:46 ` Kent Overstreet
2025-02-17 1:49 ` Hongbo Li
2025-02-13 18:46 ` [PATCH 17/18] bcachefs: Split out dirent alloc and name initialization Kent Overstreet
2025-02-13 18:46 ` [PATCH 18/18] bcachefs: bcachefs_metadata_version_casefolding Kent Overstreet
17 siblings, 1 reply; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:46 UTC (permalink / raw)
To: linux-bcachefs; +Cc: Kent Overstreet, Hongbo Li
With the upcoming patches for casefolding, we really need to use the
size of the dirent key itself - which is cleaner, anyways.
The size of the dirent is no longer just a function of the length of the
name, it'll be different depending on whether the directory has
casefolding enabled - which means the accounting in rename() has to
change a bit.
Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/dirent.c | 15 +++++++++++++--
fs/bcachefs/dirent.h | 11 +++--------
fs/bcachefs/fs-common.c | 42 ++++++++++++++++++++++-------------------
fs/bcachefs/fsck.c | 5 +----
4 files changed, 40 insertions(+), 33 deletions(-)
diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c
index 600eee936f13..27737aaa03a6 100644
--- a/fs/bcachefs/dirent.c
+++ b/fs/bcachefs/dirent.c
@@ -233,6 +233,7 @@ int bch2_dirent_create(struct btree_trans *trans, subvol_inum dir,
const struct bch_hash_info *hash_info,
u8 type, const struct qstr *name, u64 dst_inum,
u64 *dir_offset,
+ u64 *i_size,
enum btree_iter_update_trigger_flags flags)
{
struct bkey_i_dirent *dirent;
@@ -243,6 +244,8 @@ int bch2_dirent_create(struct btree_trans *trans, subvol_inum dir,
if (ret)
return ret;
+ *i_size += bkey_bytes(&dirent->k);
+
ret = bch2_hash_set(trans, bch2_dirent_hash_desc, hash_info,
dir, &dirent->k_i, flags);
*dir_offset = dirent->k.p.offset;
@@ -275,8 +278,8 @@ int bch2_dirent_read_target(struct btree_trans *trans, subvol_inum dir,
}
int bch2_dirent_rename(struct btree_trans *trans,
- subvol_inum src_dir, struct bch_hash_info *src_hash,
- subvol_inum dst_dir, struct bch_hash_info *dst_hash,
+ subvol_inum src_dir, struct bch_hash_info *src_hash, u64 *src_dir_i_size,
+ subvol_inum dst_dir, struct bch_hash_info *dst_hash, u64 *dst_dir_i_size,
const struct qstr *src_name, subvol_inum *src_inum, u64 *src_offset,
const struct qstr *dst_name, subvol_inum *dst_inum, u64 *dst_offset,
enum bch_rename_mode mode)
@@ -406,6 +409,14 @@ int bch2_dirent_rename(struct btree_trans *trans,
new_src->v.d_type == DT_SUBVOL)
new_src->v.d_parent_subvol = cpu_to_le32(src_dir.subvol);
+ if (old_dst.k)
+ *dst_dir_i_size -= bkey_bytes(old_dst.k);
+ *src_dir_i_size -= bkey_bytes(old_src.k);
+
+ if (mode == BCH_RENAME_EXCHANGE)
+ *src_dir_i_size += bkey_bytes(&new_src->k);
+ *dst_dir_i_size += bkey_bytes(&new_dst->k);
+
ret = bch2_trans_update(trans, &dst_iter, &new_dst->k_i, 0);
if (ret)
goto out;
diff --git a/fs/bcachefs/dirent.h b/fs/bcachefs/dirent.h
index a633f83c1ac7..37f01c1a3f7f 100644
--- a/fs/bcachefs/dirent.h
+++ b/fs/bcachefs/dirent.h
@@ -31,11 +31,6 @@ static inline unsigned dirent_val_u64s(unsigned len)
sizeof(u64));
}
-static inline unsigned int dirent_occupied_size(const struct qstr *name)
-{
- return (BKEY_U64s + dirent_val_u64s(name->len)) * sizeof(u64);
-}
-
int bch2_dirent_read_target(struct btree_trans *, subvol_inum,
struct bkey_s_c_dirent, subvol_inum *);
@@ -52,7 +47,7 @@ int bch2_dirent_create_snapshot(struct btree_trans *, u32, u64, u32,
enum btree_iter_update_trigger_flags);
int bch2_dirent_create(struct btree_trans *, subvol_inum,
const struct bch_hash_info *, u8,
- const struct qstr *, u64, u64 *,
+ const struct qstr *, u64, u64 *, u64 *,
enum btree_iter_update_trigger_flags);
static inline unsigned vfs_d_type(unsigned type)
@@ -67,8 +62,8 @@ enum bch_rename_mode {
};
int bch2_dirent_rename(struct btree_trans *,
- subvol_inum, struct bch_hash_info *,
- subvol_inum, struct bch_hash_info *,
+ subvol_inum, struct bch_hash_info *, u64 *,
+ subvol_inum, struct bch_hash_info *, u64 *,
const struct qstr *, subvol_inum *, u64 *,
const struct qstr *, subvol_inum *, u64 *,
enum bch_rename_mode);
diff --git a/fs/bcachefs/fs-common.c b/fs/bcachefs/fs-common.c
index d70d9f634cea..c8afd312e601 100644
--- a/fs/bcachefs/fs-common.c
+++ b/fs/bcachefs/fs-common.c
@@ -42,11 +42,14 @@ int bch2_create_trans(struct btree_trans *trans,
if (ret)
goto err;
- ret = bch2_inode_peek(trans, &dir_iter, dir_u, dir,
- BTREE_ITER_intent|BTREE_ITER_with_updates);
+ ret = bch2_inode_peek(trans, &dir_iter, dir_u, dir, BTREE_ITER_intent);
if (ret)
goto err;
+ /* Inherit casefold state from parent. */
+ if (S_ISDIR(mode))
+ new_inode->bi_flags |= dir_u->bi_flags & BCH_INODE_casefolded;
+
if (!(flags & BCH_CREATE_SNAPSHOT)) {
/* Normal create path - allocate a new inode: */
bch2_inode_init_late(new_inode, now, uid, gid, mode, rdev, dir_u);
@@ -152,7 +155,6 @@ int bch2_create_trans(struct btree_trans *trans,
if (is_subdir_for_nlink(new_inode))
dir_u->bi_nlink++;
dir_u->bi_mtime = dir_u->bi_ctime = now;
- dir_u->bi_size += dirent_occupied_size(name);
ret = bch2_inode_write(trans, &dir_iter, dir_u);
if (ret)
@@ -163,7 +165,8 @@ int bch2_create_trans(struct btree_trans *trans,
name,
dir_target,
&dir_offset,
- STR_HASH_must_create|BTREE_ITER_with_updates);
+ &dir_u->bi_size,
+ STR_HASH_must_create);
if (ret)
goto err;
@@ -221,13 +224,14 @@ int bch2_link_trans(struct btree_trans *trans,
}
dir_u->bi_mtime = dir_u->bi_ctime = now;
- dir_u->bi_size += dirent_occupied_size(name);
dir_hash = bch2_hash_info_init(c, dir_u);
ret = bch2_dirent_create(trans, dir, &dir_hash,
mode_to_type(inode_u->bi_mode),
- name, inum.inum, &dir_offset,
+ name, inum.inum,
+ &dir_offset,
+ &dir_u->bi_size,
STR_HASH_must_create);
if (ret)
goto err;
@@ -266,8 +270,16 @@ int bch2_unlink_trans(struct btree_trans *trans,
dir_hash = bch2_hash_info_init(c, dir_u);
- ret = bch2_dirent_lookup_trans(trans, &dirent_iter, dir, &dir_hash,
- name, &inum, BTREE_ITER_intent);
+ struct bkey_s_c dirent_k =
+ bch2_hash_lookup(trans, &dirent_iter, bch2_dirent_hash_desc,
+ &dir_hash, dir, name, BTREE_ITER_intent);
+ ret = bkey_err(dirent_k);
+ if (ret)
+ goto err;
+
+ ret = bch2_dirent_read_target(trans, dir, bkey_s_c_to_dirent(dirent_k), &inum);
+ if (ret > 0)
+ ret = -ENOENT;
if (ret)
goto err;
@@ -324,7 +336,7 @@ int bch2_unlink_trans(struct btree_trans *trans,
dir_u->bi_mtime = dir_u->bi_ctime = inode_u->bi_ctime = now;
dir_u->bi_nlink -= is_subdir_for_nlink(inode_u);
- dir_u->bi_size -= dirent_occupied_size(name);
+ dir_u->bi_size -= bkey_bytes(dirent_k.k);
ret = bch2_hash_delete_at(trans, bch2_dirent_hash_desc,
&dir_hash, &dirent_iter,
@@ -420,8 +432,8 @@ int bch2_rename_trans(struct btree_trans *trans,
}
ret = bch2_dirent_rename(trans,
- src_dir, &src_hash,
- dst_dir, &dst_hash,
+ src_dir, &src_hash, &src_dir_u->bi_size,
+ dst_dir, &dst_hash, &dst_dir_u->bi_size,
src_name, &src_inum, &src_offset,
dst_name, &dst_inum, &dst_offset,
mode);
@@ -463,14 +475,6 @@ int bch2_rename_trans(struct btree_trans *trans,
goto err;
}
- if (mode == BCH_RENAME) {
- src_dir_u->bi_size -= dirent_occupied_size(src_name);
- dst_dir_u->bi_size += dirent_occupied_size(dst_name);
- }
-
- if (mode == BCH_RENAME_OVERWRITE)
- src_dir_u->bi_size -= dirent_occupied_size(src_name);
-
if (src_inode_u->bi_parent_subvol)
src_inode_u->bi_parent_subvol = dst_dir.subvol;
diff --git a/fs/bcachefs/fsck.c b/fs/bcachefs/fsck.c
index 53a421ff136d..24ad1a42d169 100644
--- a/fs/bcachefs/fsck.c
+++ b/fs/bcachefs/fsck.c
@@ -1132,10 +1132,7 @@ static int check_directory_size(struct btree_trans *trans,
if (k.k->type != KEY_TYPE_dirent)
continue;
- struct bkey_s_c_dirent dirent = bkey_s_c_to_dirent(k);
- struct qstr name = bch2_dirent_get_name(dirent);
-
- new_size += dirent_occupied_size(&name);
+ new_size += bkey_bytes(k.k);
}
bch2_trans_iter_exit(trans, &iter);
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 17/18] bcachefs: Split out dirent alloc and name initialization
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (15 preceding siblings ...)
2025-02-13 18:46 ` [PATCH 16/18] bcachefs: Kill dirent_occupied_size() Kent Overstreet
@ 2025-02-13 18:46 ` Kent Overstreet
2025-02-13 18:46 ` [PATCH 18/18] bcachefs: bcachefs_metadata_version_casefolding Kent Overstreet
17 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:46 UTC (permalink / raw)
To: linux-bcachefs
Cc: Joshua Ashton, André Almeida, Gabriel Krisman Bertazi,
Kent Overstreet
From: Joshua Ashton <joshua@froggi.es>
Splits out the code that allocates the dirent and initializes the name
to make things easier to implement casefolding in a future commit.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
fs/bcachefs/dirent.c | 46 ++++++++++++++++++++++++++++++++------------
1 file changed, 34 insertions(+), 12 deletions(-)
diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c
index 27737aaa03a6..7dcc18000726 100644
--- a/fs/bcachefs/dirent.c
+++ b/fs/bcachefs/dirent.c
@@ -163,15 +163,13 @@ void bch2_dirent_to_text(struct printbuf *out, struct bch_fs *c, struct bkey_s_c
prt_printf(out, " type %s", bch2_d_type_str(d.v->d_type));
}
-static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
- subvol_inum dir, u8 type,
- const struct qstr *name, u64 dst)
+static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans,
+ subvol_inum dir,
+ u8 type,
+ int name_len, u64 dst)
{
struct bkey_i_dirent *dirent;
- unsigned u64s = BKEY_U64s + dirent_val_u64s(name->len);
-
- if (name->len > BCH_NAME_MAX)
- return ERR_PTR(-ENAMETOOLONG);
+ unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len);
BUG_ON(u64s > U8_MAX);
@@ -191,11 +189,35 @@ static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
dirent->v.d_type = type;
- memcpy(dirent->v.d_name, name->name, name->len);
- memset(dirent->v.d_name + name->len, 0,
- bkey_val_bytes(&dirent->k) -
- offsetof(struct bch_dirent, d_name) -
- name->len);
+ return dirent;
+}
+
+static void dirent_init_regular_name(struct bkey_i_dirent *dirent,
+ const struct qstr *name)
+{
+ memcpy(&dirent->v.d_name[0], name->name, name->len);
+ memset(&dirent->v.d_name[name->len], 0,
+ bkey_val_bytes(&dirent->k) -
+ offsetof(struct bch_dirent, d_name) -
+ name->len);
+}
+
+static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
+ subvol_inum dir,
+ u8 type,
+ const struct qstr *name,
+ u64 dst)
+{
+ struct bkey_i_dirent *dirent;
+
+ if (name->len > BCH_NAME_MAX)
+ return ERR_PTR(-ENAMETOOLONG);
+
+ dirent = dirent_alloc_key(trans, dir, type, name->len, dst);
+ if (IS_ERR(dirent))
+ return dirent;
+
+ dirent_init_regular_name(dirent, name);
EBUG_ON(bch2_dirent_name_bytes(dirent_i_to_s_c(dirent)) != name->len);
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 18/18] bcachefs: bcachefs_metadata_version_casefolding
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
` (16 preceding siblings ...)
2025-02-13 18:46 ` [PATCH 17/18] bcachefs: Split out dirent alloc and name initialization Kent Overstreet
@ 2025-02-13 18:46 ` Kent Overstreet
2025-02-21 18:26 ` [PATCH] bcachefs: Use flexible arrays in dirent Gabriel de Perthuis
17 siblings, 1 reply; 22+ messages in thread
From: Kent Overstreet @ 2025-02-13 18:46 UTC (permalink / raw)
To: linux-bcachefs
Cc: Joshua Ashton, André Almeida, Gabriel Krisman Bertazi,
Kent Overstreet
From: Joshua Ashton <joshua@froggi.es>
This patch implements support for case-insensitive file name lookups
in bcachefs.
The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using.
More information is provided in Documentation/bcachefs/casefolding.rst
Compatibility notes:
This uses the new versioning scheme for incompatible features where an
incompatible feature is tied to a version number: the superblock says
"we may use incompat features up to x" and "incompat features up to x
are in use", disallowing mounting by previous versions.
Additionally, and old style incompat feature bit is used, so that
kernels without utf8 casefolding support know if casefolding
specifically is in use and they're allowed to mount.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
.../filesystems/bcachefs/casefolding.rst | 87 +++++++++
fs/bcachefs/bcachefs.h | 6 +
fs/bcachefs/bcachefs_format.h | 6 +-
fs/bcachefs/dirent.c | 170 +++++++++++++++---
fs/bcachefs/dirent.h | 9 +-
fs/bcachefs/dirent_format.h | 20 ++-
fs/bcachefs/fs-ioctl.c | 25 +++
fs/bcachefs/fs-ioctl.h | 20 ++-
fs/bcachefs/fs.c | 17 ++
fs/bcachefs/inode_format.h | 3 +-
fs/bcachefs/sb-errors_format.h | 4 +-
fs/bcachefs/str_hash.c | 2 +-
fs/bcachefs/str_hash.h | 4 +
fs/bcachefs/super.c | 19 ++
14 files changed, 350 insertions(+), 42 deletions(-)
create mode 100644 Documentation/filesystems/bcachefs/casefolding.rst
diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
new file mode 100644
index 000000000000..6546aa4f7a86
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/casefolding.rst
@@ -0,0 +1,87 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Casefolding
+===========
+
+bcachefs has support for case-insensitive file and directory
+lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
+casefolding attributes.
+
+The main usecase for casefolding is compatibility with software written
+against other filesystems that rely on casefolded lookups
+(eg. NTFS and Wine/Proton).
+Taking advantage of file-system level casefolding can lead to great
+loading time gains in many applications and games.
+
+Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
+Once a directory has been flagged for casefolding, a feature bit
+is enabled on the superblock which marks the filesystem as using
+casefolding.
+When the feature bit for casefolding is enabled, it is no longer possible
+to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
+
+On the lookup/query side: casefolding is implemented by allocating a new
+string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
+casefold the query string.
+
+On the dirent side: casefolding is implemented by ensuring the `bkey`'s
+hash is made from the casefolded string and storing the cached casefolded
+name with the regular name in the dirent.
+
+The structure looks like this:
+
+Regular: [dirent data][regular name][nul][nul]...
+Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
+
+(Do note, the number of `NUL`s here is merely for illustration, they count can vary
+ per-key, and they may not even be present if the key is aligned to `sizeof(u64)`.)
+
+This is efficient as it means that for all file lookups that require casefolding,
+it has identical performance to a regular lookup:
+a hash comparison and a `memcmp` of the name.
+
+Rationale
+---------
+
+Several designs were considered for this system:
+One was to introduce a dirent_v2, however that would be painful especially as
+the hash system only has support for a single key type. This would also need
+`BCH_NAME_MAX` to change between versions, and a new feature bit.
+
+Another option was to store without the two lengths, and just take the length of
+the regular name and casefolded name contiguously / 2 as the length. This would
+assume that the regular length == casefolded length, but that could potentially
+not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
+the lowercase unicode glyph.
+It would be possible to disregard the casefold cache for those cases, but it was
+decided to simply encode the two string lengths in the key to avoid random
+performance issues if this edgecase was ever hit.
+
+The option settled on was to use a free-bit in d_type to mark a dirent as having
+a casefold cache, and then treat the first 4 bytes the name block as lengths.
+You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
+
+The feature bit was used to allow casefolding support to be enabled for the majority
+of users, but some allow users who have no need for the feature to still use bcachefs as
+`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
+which may be decider between using bcachefs for eg. embedded platforms.
+
+Other filesystems like ext4 and f2fs have a super-block level option for casefolding
+encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
+any encodings than a single UTF-8 version. When future encodings are desirable,
+they will be added trivially using the opts mechanism.
+
+dentry/dcache considerations
+---------
+
+Currently, in casefolded directories, bcachefs (like other filesystems) will not cache
+negative dentry's.
+
+This is because currently doing so presents a problem in the following scenario:
+ - Lookup file "blAH" in a casefolded directory
+ - Creation of file "BLAH" in a casefolded directory
+ - Lookup file "blAH" in a casefolded directory
+This would fail if negative dentry's were cached.
+
+This is slightly suboptimal, but could be fixed in future with some vfs work.
+
diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 28216297edbb..02067366820d 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -203,6 +203,7 @@
#include <linux/types.h>
#include <linux/workqueue.h>
#include <linux/zstd.h>
+#include <linux/unicode.h>
#include "bcachefs_format.h"
#include "btree_journal_iter_types.h"
@@ -699,6 +700,8 @@ enum bch_write_ref {
BCH_WRITE_REF_NR,
};
+#define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0)
+
struct bch_fs {
struct closure cl;
@@ -783,6 +786,9 @@ struct bch_fs {
u64 btrees_lost_data;
} sb;
+#ifdef CONFIG_UNICODE
+ struct unicode_map *cf_encoding;
+#endif
struct bch_sb_handle disk_sb;
diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
index b4ac311f21a1..13cc0833b488 100644
--- a/fs/bcachefs/bcachefs_format.h
+++ b/fs/bcachefs/bcachefs_format.h
@@ -689,7 +689,8 @@ struct bch_sb_field_ext {
x(directory_size, BCH_VERSION(1, 20)) \
x(cached_backpointers, BCH_VERSION(1, 21)) \
x(stripe_backpointers, BCH_VERSION(1, 22)) \
- x(stripe_lru, BCH_VERSION(1, 23))
+ x(stripe_lru, BCH_VERSION(1, 23)) \
+ x(casefolding, BCH_VERSION(1, 24))
enum bcachefs_metadata_version {
bcachefs_metadata_version_min = 9,
@@ -911,7 +912,8 @@ static inline void SET_BCH_SB_BACKGROUND_COMPRESSION_TYPE(struct bch_sb *sb, __u
x(journal_no_flush, 16) \
x(alloc_v2, 17) \
x(extents_across_btree_nodes, 18) \
- x(incompat_version_field, 19)
+ x(incompat_version_field, 19) \
+ x(casefolding, 20)
#define BCH_SB_FEATURES_ALWAYS \
(BIT_ULL(BCH_FEATURE_new_extent_overwrite)| \
diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c
index 7dcc18000726..f4c283d1e86a 100644
--- a/fs/bcachefs/dirent.c
+++ b/fs/bcachefs/dirent.c
@@ -13,6 +13,40 @@
#include <linux/dcache.h>
+static int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info,
+ const struct qstr *str, struct qstr *out_cf)
+{
+ *out_cf = (struct qstr) QSTR_INIT(NULL, 0);
+
+#ifdef CONFIG_UNICODE
+ unsigned char *buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1);
+ int ret = PTR_ERR_OR_ZERO(buf);
+ if (ret)
+ return ret;
+
+ ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1);
+ if (ret <= 0)
+ return ret;
+
+ *out_cf = (struct qstr) QSTR_INIT(buf, ret);
+ return 0;
+#else
+ return -EOPNOTSUPP;
+#endif
+}
+
+static inline int bch2_maybe_casefold(struct btree_trans *trans,
+ const struct bch_hash_info *info,
+ const struct qstr *str, struct qstr *out_cf)
+{
+ if (likely(!info->cf_encoding)) {
+ *out_cf = *str;
+ return 0;
+ } else {
+ return bch2_casefold(trans, info, str, out_cf);
+ }
+}
+
static unsigned bch2_dirent_name_bytes(struct bkey_s_c_dirent d)
{
if (bkey_val_bytes(d.k) < offsetof(struct bch_dirent, d_name))
@@ -28,13 +62,38 @@ static unsigned bch2_dirent_name_bytes(struct bkey_s_c_dirent d)
#endif
return bkey_bytes -
- offsetof(struct bch_dirent, d_name) -
+ (d.v->d_casefold
+ ? offsetof(struct bch_dirent, d_cf_name_block.d_names)
+ : offsetof(struct bch_dirent, d_name)) -
trailing_nuls;
}
struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d)
{
- return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d));
+ if (d.v->d_casefold) {
+ unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len);
+ return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[0], name_len);
+ } else {
+ return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d));
+ }
+}
+
+static struct qstr bch2_dirent_get_casefold_name(struct bkey_s_c_dirent d)
+{
+ if (d.v->d_casefold) {
+ unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len);
+ unsigned cf_name_len = le16_to_cpu(d.v->d_cf_name_block.d_cf_name_len);
+ return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[name_len], cf_name_len);
+ } else {
+ return (struct qstr) QSTR_INIT(NULL, 0);
+ }
+}
+
+static inline struct qstr bch2_dirent_get_lookup_name(struct bkey_s_c_dirent d)
+{
+ return d.v->d_casefold
+ ? bch2_dirent_get_casefold_name(d)
+ : bch2_dirent_get_name(d);
}
static u64 bch2_dirent_hash(const struct bch_hash_info *info,
@@ -57,7 +116,7 @@ static u64 dirent_hash_key(const struct bch_hash_info *info, const void *key)
static u64 dirent_hash_bkey(const struct bch_hash_info *info, struct bkey_s_c k)
{
struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k);
- struct qstr name = bch2_dirent_get_name(d);
+ struct qstr name = bch2_dirent_get_lookup_name(d);
return bch2_dirent_hash(info, &name);
}
@@ -65,7 +124,7 @@ static u64 dirent_hash_bkey(const struct bch_hash_info *info, struct bkey_s_c k)
static bool dirent_cmp_key(struct bkey_s_c _l, const void *_r)
{
struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l);
- const struct qstr l_name = bch2_dirent_get_name(l);
+ const struct qstr l_name = bch2_dirent_get_lookup_name(l);
const struct qstr *r_name = _r;
return !qstr_eq(l_name, *r_name);
@@ -75,8 +134,8 @@ static bool dirent_cmp_bkey(struct bkey_s_c _l, struct bkey_s_c _r)
{
struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l);
struct bkey_s_c_dirent r = bkey_s_c_to_dirent(_r);
- const struct qstr l_name = bch2_dirent_get_name(l);
- const struct qstr r_name = bch2_dirent_get_name(r);
+ const struct qstr l_name = bch2_dirent_get_lookup_name(l);
+ const struct qstr r_name = bch2_dirent_get_lookup_name(r);
return !qstr_eq(l_name, r_name);
}
@@ -104,17 +163,19 @@ int bch2_dirent_validate(struct bch_fs *c, struct bkey_s_c k,
struct bkey_validate_context from)
{
struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k);
+ unsigned name_block_len = bch2_dirent_name_bytes(d);
struct qstr d_name = bch2_dirent_get_name(d);
+ struct qstr d_cf_name = bch2_dirent_get_casefold_name(d);
int ret = 0;
bkey_fsck_err_on(!d_name.len,
c, dirent_empty_name,
"empty name");
- bkey_fsck_err_on(bkey_val_u64s(k.k) > dirent_val_u64s(d_name.len),
+ bkey_fsck_err_on(d_name.len + d_cf_name.len > name_block_len,
c, dirent_val_too_big,
- "value too big (%zu > %u)",
- bkey_val_u64s(k.k), dirent_val_u64s(d_name.len));
+ "dirent names exceed bkey size (%d + %d > %d)",
+ d_name.len, d_cf_name.len, name_block_len);
/*
* Check new keys don't exceed the max length
@@ -142,6 +203,18 @@ int bch2_dirent_validate(struct bch_fs *c, struct bkey_s_c k,
le64_to_cpu(d.v->d_inum) == d.k->p.inode,
c, dirent_to_itself,
"dirent points to own directory");
+
+ if (d.v->d_casefold) {
+ bkey_fsck_err_on(from.from == BKEY_VALIDATE_commit &&
+ d_cf_name.len > BCH_NAME_MAX,
+ c, dirent_cf_name_too_big,
+ "dirent w/ cf name too big (%u > %u)",
+ d_cf_name.len, BCH_NAME_MAX);
+
+ bkey_fsck_err_on(d_cf_name.len != strnlen(d_cf_name.name, d_cf_name.len),
+ c, dirent_stray_data_after_cf_name,
+ "dirent has stray data after cf name's NUL");
+ }
fsck_err:
return ret;
}
@@ -166,10 +239,11 @@ void bch2_dirent_to_text(struct printbuf *out, struct bch_fs *c, struct bkey_s_c
static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans,
subvol_inum dir,
u8 type,
- int name_len, u64 dst)
+ int name_len, int cf_name_len,
+ u64 dst)
{
struct bkey_i_dirent *dirent;
- unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len);
+ unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len, cf_name_len);
BUG_ON(u64s > U8_MAX);
@@ -188,6 +262,8 @@ static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans,
}
dirent->v.d_type = type;
+ dirent->v.d_unused = 0;
+ dirent->v.d_casefold = cf_name_len ? 1 : 0;
return dirent;
}
@@ -195,6 +271,8 @@ static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans,
static void dirent_init_regular_name(struct bkey_i_dirent *dirent,
const struct qstr *name)
{
+ EBUG_ON(dirent->v.d_casefold);
+
memcpy(&dirent->v.d_name[0], name->name, name->len);
memset(&dirent->v.d_name[name->len], 0,
bkey_val_bytes(&dirent->k) -
@@ -202,10 +280,30 @@ static void dirent_init_regular_name(struct bkey_i_dirent *dirent,
name->len);
}
+static void dirent_init_casefolded_name(struct bkey_i_dirent *dirent,
+ const struct qstr *name,
+ const struct qstr *cf_name)
+{
+ EBUG_ON(!dirent->v.d_casefold);
+ EBUG_ON(!cf_name->len);
+
+ dirent->v.d_cf_name_block.d_name_len = name->len;
+ dirent->v.d_cf_name_block.d_cf_name_len = cf_name->len;
+ memcpy(&dirent->v.d_cf_name_block.d_names[0], name->name, name->len);
+ memcpy(&dirent->v.d_cf_name_block.d_names[name->len], cf_name->name, cf_name->len);
+ memset(&dirent->v.d_cf_name_block.d_names[name->len + cf_name->len], 0,
+ bkey_val_bytes(&dirent->k) -
+ offsetof(struct bch_dirent, d_cf_name_block.d_names) -
+ name->len + cf_name->len);
+
+ EBUG_ON(bch2_dirent_get_casefold_name(dirent_i_to_s_c(dirent)).len != cf_name->len);
+}
+
static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
subvol_inum dir,
u8 type,
const struct qstr *name,
+ const struct qstr *cf_name,
u64 dst)
{
struct bkey_i_dirent *dirent;
@@ -213,13 +311,16 @@ static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
if (name->len > BCH_NAME_MAX)
return ERR_PTR(-ENAMETOOLONG);
- dirent = dirent_alloc_key(trans, dir, type, name->len, dst);
+ dirent = dirent_alloc_key(trans, dir, type, name->len, cf_name ? cf_name->len : 0, dst);
if (IS_ERR(dirent))
return dirent;
- dirent_init_regular_name(dirent, name);
+ if (cf_name)
+ dirent_init_casefolded_name(dirent, name, cf_name);
+ else
+ dirent_init_regular_name(dirent, name);
- EBUG_ON(bch2_dirent_name_bytes(dirent_i_to_s_c(dirent)) != name->len);
+ EBUG_ON(bch2_dirent_get_name(dirent_i_to_s_c(dirent)).len != name->len);
return dirent;
}
@@ -235,7 +336,7 @@ int bch2_dirent_create_snapshot(struct btree_trans *trans,
struct bkey_i_dirent *dirent;
int ret;
- dirent = dirent_create_key(trans, dir_inum, type, name, dst_inum);
+ dirent = dirent_create_key(trans, dir_inum, type, name, NULL, dst_inum);
ret = PTR_ERR_OR_ZERO(dirent);
if (ret)
return ret;
@@ -261,7 +362,16 @@ int bch2_dirent_create(struct btree_trans *trans, subvol_inum dir,
struct bkey_i_dirent *dirent;
int ret;
- dirent = dirent_create_key(trans, dir, type, name, dst_inum);
+ if (hash_info->cf_encoding) {
+ struct qstr cf_name;
+ ret = bch2_casefold(trans, hash_info, name, &cf_name);
+ if (ret)
+ return ret;
+ dirent = dirent_create_key(trans, dir, type, name, &cf_name, dst_inum);
+ } else {
+ dirent = dirent_create_key(trans, dir, type, name, NULL, dst_inum);
+ }
+
ret = PTR_ERR_OR_ZERO(dirent);
if (ret)
return ret;
@@ -306,6 +416,7 @@ int bch2_dirent_rename(struct btree_trans *trans,
const struct qstr *dst_name, subvol_inum *dst_inum, u64 *dst_offset,
enum bch_rename_mode mode)
{
+ struct qstr src_name_lookup, dst_name_lookup;
struct btree_iter src_iter = { NULL };
struct btree_iter dst_iter = { NULL };
struct bkey_s_c old_src, old_dst = bkey_s_c_null;
@@ -320,8 +431,11 @@ int bch2_dirent_rename(struct btree_trans *trans,
memset(dst_inum, 0, sizeof(*dst_inum));
/* Lookup src: */
+ ret = bch2_maybe_casefold(trans, src_hash, src_name, &src_name_lookup);
+ if (ret)
+ goto out;
old_src = bch2_hash_lookup(trans, &src_iter, bch2_dirent_hash_desc,
- src_hash, src_dir, src_name,
+ src_hash, src_dir, &src_name_lookup,
BTREE_ITER_intent);
ret = bkey_err(old_src);
if (ret)
@@ -333,6 +447,9 @@ int bch2_dirent_rename(struct btree_trans *trans,
goto out;
/* Lookup dst: */
+ ret = bch2_maybe_casefold(trans, dst_hash, dst_name, &dst_name_lookup);
+ if (ret)
+ goto out;
if (mode == BCH_RENAME) {
/*
* Note that we're _not_ checking if the target already exists -
@@ -340,12 +457,12 @@ int bch2_dirent_rename(struct btree_trans *trans,
* correctness:
*/
ret = bch2_hash_hole(trans, &dst_iter, bch2_dirent_hash_desc,
- dst_hash, dst_dir, dst_name);
+ dst_hash, dst_dir, &dst_name_lookup);
if (ret)
goto out;
} else {
old_dst = bch2_hash_lookup(trans, &dst_iter, bch2_dirent_hash_desc,
- dst_hash, dst_dir, dst_name,
+ dst_hash, dst_dir, &dst_name_lookup,
BTREE_ITER_intent);
ret = bkey_err(old_dst);
if (ret)
@@ -361,7 +478,8 @@ int bch2_dirent_rename(struct btree_trans *trans,
*src_offset = dst_iter.pos.offset;
/* Create new dst key: */
- new_dst = dirent_create_key(trans, dst_dir, 0, dst_name, 0);
+ new_dst = dirent_create_key(trans, dst_dir, 0, dst_name,
+ dst_hash->cf_encoding ? &dst_name_lookup : NULL, 0);
ret = PTR_ERR_OR_ZERO(new_dst);
if (ret)
goto out;
@@ -371,7 +489,8 @@ int bch2_dirent_rename(struct btree_trans *trans,
/* Create new src key: */
if (mode == BCH_RENAME_EXCHANGE) {
- new_src = dirent_create_key(trans, src_dir, 0, src_name, 0);
+ new_src = dirent_create_key(trans, src_dir, 0, src_name,
+ src_hash->cf_encoding ? &src_name_lookup : NULL, 0);
ret = PTR_ERR_OR_ZERO(new_src);
if (ret)
goto out;
@@ -498,9 +617,14 @@ int bch2_dirent_lookup_trans(struct btree_trans *trans,
const struct qstr *name, subvol_inum *inum,
unsigned flags)
{
+ struct qstr lookup_name;
+ int ret = bch2_maybe_casefold(trans, hash_info, name, &lookup_name);
+ if (ret)
+ return ret;
+
struct bkey_s_c k = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc,
- hash_info, dir, name, flags);
- int ret = bkey_err(k);
+ hash_info, dir, &lookup_name, flags);
+ ret = bkey_err(k);
if (ret)
goto err;
diff --git a/fs/bcachefs/dirent.h b/fs/bcachefs/dirent.h
index 37f01c1a3f7f..a6e15a012936 100644
--- a/fs/bcachefs/dirent.h
+++ b/fs/bcachefs/dirent.h
@@ -25,10 +25,13 @@ struct bch_inode_info;
struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d);
-static inline unsigned dirent_val_u64s(unsigned len)
+static inline unsigned dirent_val_u64s(unsigned len, unsigned cf_len)
{
- return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len,
- sizeof(u64));
+ unsigned bytes = cf_len
+ ? offsetof(struct bch_dirent, d_cf_name_block.d_names) + len + cf_len
+ : offsetof(struct bch_dirent, d_name) + len;
+
+ return DIV_ROUND_UP(bytes, sizeof(u64));
}
int bch2_dirent_read_target(struct btree_trans *, subvol_inum,
diff --git a/fs/bcachefs/dirent_format.h b/fs/bcachefs/dirent_format.h
index 5e116b88e814..2e766032e1e9 100644
--- a/fs/bcachefs/dirent_format.h
+++ b/fs/bcachefs/dirent_format.h
@@ -29,9 +29,25 @@ struct bch_dirent {
* Copy of mode bits 12-15 from the target inode - so userspace can get
* the filetype without having to do a stat()
*/
- __u8 d_type;
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+ __u8 d_type:5,
+ d_unused:2,
+ d_casefold:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+ __u8 d_casefold:1,
+ d_unused:2,
+ d_type:5;
+#endif
- __u8 d_name[];
+ union {
+ struct {
+ __u8 d_pad;
+ __le16 d_name_len;
+ __le16 d_cf_name_len;
+ __u8 d_names[0];
+ } d_cf_name_block __packed;
+ __u8 d_name[0];
+ } __packed;
} __packed __aligned(8);
#define DT_SUBVOL 16
diff --git a/fs/bcachefs/fs-ioctl.c b/fs/bcachefs/fs-ioctl.c
index 15725b4ce393..4465a2a821e3 100644
--- a/fs/bcachefs/fs-ioctl.c
+++ b/fs/bcachefs/fs-ioctl.c
@@ -54,6 +54,31 @@ static int bch2_inode_flags_set(struct btree_trans *trans,
(newflags & (BCH_INODE_nodump|BCH_INODE_noatime)) != newflags)
return -EINVAL;
+ if ((newflags ^ oldflags) & BCH_INODE_casefolded) {
+#ifdef CONFIG_UNICODE
+ int ret = 0;
+ /* Not supported on individual files. */
+ if (!S_ISDIR(bi->bi_mode))
+ return -EOPNOTSUPP;
+
+ /*
+ * Make sure the dir is empty, as otherwise we'd need to
+ * rehash everything and update the dirent keys.
+ */
+ ret = bch2_empty_dir_trans(trans, inode_inum(inode));
+ if (ret < 0)
+ return ret;
+
+ if (!bch2_request_incompat_feature(c,bcachefs_metadata_version_casefolding))
+ return -EOPNOTSUPP;
+
+ bch2_check_set_feature(c, BCH_FEATURE_casefolding);
+#else
+ printk(KERN_ERR "Cannot use casefolding on a kernel without CONFIG_UNICODE\n");
+ return -EOPNOTSUPP;
+#endif
+ }
+
if (s->set_projinherit) {
bi->bi_fields_set &= ~(1 << Inode_opt_project);
bi->bi_fields_set |= ((int) s->projinherit << Inode_opt_project);
diff --git a/fs/bcachefs/fs-ioctl.h b/fs/bcachefs/fs-ioctl.h
index d30f9bb056fd..ecd3bfdcde21 100644
--- a/fs/bcachefs/fs-ioctl.h
+++ b/fs/bcachefs/fs-ioctl.h
@@ -6,19 +6,21 @@
/* bcachefs inode flags -> vfs inode flags: */
static const __maybe_unused unsigned bch_flags_to_vfs[] = {
- [__BCH_INODE_sync] = S_SYNC,
- [__BCH_INODE_immutable] = S_IMMUTABLE,
- [__BCH_INODE_append] = S_APPEND,
- [__BCH_INODE_noatime] = S_NOATIME,
+ [__BCH_INODE_sync] = S_SYNC,
+ [__BCH_INODE_immutable] = S_IMMUTABLE,
+ [__BCH_INODE_append] = S_APPEND,
+ [__BCH_INODE_noatime] = S_NOATIME,
+ [__BCH_INODE_casefolded] = S_CASEFOLD,
};
/* bcachefs inode flags -> FS_IOC_GETFLAGS: */
static const __maybe_unused unsigned bch_flags_to_uflags[] = {
- [__BCH_INODE_sync] = FS_SYNC_FL,
- [__BCH_INODE_immutable] = FS_IMMUTABLE_FL,
- [__BCH_INODE_append] = FS_APPEND_FL,
- [__BCH_INODE_nodump] = FS_NODUMP_FL,
- [__BCH_INODE_noatime] = FS_NOATIME_FL,
+ [__BCH_INODE_sync] = FS_SYNC_FL,
+ [__BCH_INODE_immutable] = FS_IMMUTABLE_FL,
+ [__BCH_INODE_append] = FS_APPEND_FL,
+ [__BCH_INODE_nodump] = FS_NODUMP_FL,
+ [__BCH_INODE_noatime] = FS_NOATIME_FL,
+ [__BCH_INODE_casefolded] = FS_CASEFOLD_FL,
};
/* bcachefs inode flags -> FS_IOC_FSGETXATTR: */
diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 90ade8f648d9..ad9eaad0f2fb 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -698,6 +698,23 @@ static struct dentry *bch2_lookup(struct inode *vdir, struct dentry *dentry,
if (IS_ERR(inode))
inode = NULL;
+#ifdef CONFIG_UNICODE
+ if (!inode && IS_CASEFOLDED(vdir)) {
+ /*
+ * Do not cache a negative dentry in casefolded directories
+ * as it would need to be invalidated in the following situation:
+ * - Lookup file "blAH" in a casefolded directory
+ * - Creation of file "BLAH" in a casefolded directory
+ * - Lookup file "blAH" in a casefolded directory
+ * which would fail if we had a negative dentry.
+ *
+ * We should come back to this when VFS has a method to handle
+ * this edgecase.
+ */
+ return NULL;
+ }
+#endif
+
return d_splice_alias(&inode->v, dentry);
}
diff --git a/fs/bcachefs/inode_format.h b/fs/bcachefs/inode_format.h
index b99a5bf1a75e..117110af1e3f 100644
--- a/fs/bcachefs/inode_format.h
+++ b/fs/bcachefs/inode_format.h
@@ -137,7 +137,8 @@ enum inode_opt_id {
x(i_sectors_dirty, 6) \
x(unlinked, 7) \
x(backptr_untrusted, 8) \
- x(has_child_snapshot, 9)
+ x(has_child_snapshot, 9) \
+ x(casefolded, 10)
/* bits 20+ reserved for packed fields below: */
diff --git a/fs/bcachefs/sb-errors_format.h b/fs/bcachefs/sb-errors_format.h
index b86ec013d7d7..cdafd877b8a1 100644
--- a/fs/bcachefs/sb-errors_format.h
+++ b/fs/bcachefs/sb-errors_format.h
@@ -314,7 +314,9 @@ enum bch_fsck_flags {
x(compression_opt_not_marked_in_sb, 295, FSCK_AUTOFIX) \
x(compression_type_not_marked_in_sb, 296, FSCK_AUTOFIX) \
x(directory_size_mismatch, 303, FSCK_AUTOFIX) \
- x(MAX, 304, 0)
+ x(dirent_cf_name_too_big, 304, 0) \
+ x(dirent_stray_data_after_cf_name, 305, 0) \
+ x(MAX, 306, 0)
enum bch_sb_error_id {
#define x(t, n, ...) BCH_FSCK_ERR_##t = n,
diff --git a/fs/bcachefs/str_hash.c b/fs/bcachefs/str_hash.c
index d78451c2a0c6..93e71119e5a4 100644
--- a/fs/bcachefs/str_hash.c
+++ b/fs/bcachefs/str_hash.c
@@ -50,7 +50,7 @@ static noinline int fsck_rename_dirent(struct btree_trans *trans,
for (unsigned i = 0; i < 1000; i++) {
unsigned len = sprintf(new->v.d_name, "%.*s.fsck_renamed-%u",
old_name.len, old_name.name, i);
- unsigned u64s = BKEY_U64s + dirent_val_u64s(len);
+ unsigned u64s = BKEY_U64s + dirent_val_u64s(len, 0);
if (u64s > U8_MAX)
return -EINVAL;
diff --git a/fs/bcachefs/str_hash.h b/fs/bcachefs/str_hash.h
index 55a4ac7bf220..f645a4547b04 100644
--- a/fs/bcachefs/str_hash.h
+++ b/fs/bcachefs/str_hash.h
@@ -34,6 +34,7 @@ bch2_str_hash_opt_to_type(struct bch_fs *c, enum bch_str_hash_opts opt)
struct bch_hash_info {
u8 type;
+ struct unicode_map *cf_encoding;
/*
* For crc32 or crc64 string hashes the first key value of
* the siphash_key (k0) is used as the key.
@@ -47,6 +48,9 @@ bch2_hash_info_init(struct bch_fs *c, const struct bch_inode_unpacked *bi)
/* XXX ick */
struct bch_hash_info info = {
.type = INODE_STR_HASH(bi),
+#ifdef CONFIG_UNICODE
+ .cf_encoding = !!(bi->bi_flags & BCH_INODE_casefolded) ? c->cf_encoding : NULL,
+#endif
.siphash_key = { .k0 = bi->bi_hash_seed }
};
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 6d97d412fed9..10c281ad96eb 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -837,6 +837,25 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts)
if (ret)
goto err;
+#ifdef CONFIG_UNICODE
+ /* Default encoding until we can potentially have more as an option. */
+ c->cf_encoding = utf8_load(BCH_FS_DEFAULT_UTF8_ENCODING);
+ if (IS_ERR(c->cf_encoding)) {
+ printk(KERN_ERR "Cannot load UTF-8 encoding for filesystem. Version: %u.%u.%u",
+ unicode_major(BCH_FS_DEFAULT_UTF8_ENCODING),
+ unicode_minor(BCH_FS_DEFAULT_UTF8_ENCODING),
+ unicode_rev(BCH_FS_DEFAULT_UTF8_ENCODING));
+ ret = -EINVAL;
+ goto err;
+ }
+#else
+ if (c->sb.features & BIT_ULL(BCH_FEATURE_casefolding)) {
+ printk(KERN_ERR "Cannot mount a filesystem with casefolding on a kernel without CONFIG_UNICODE\n");
+ ret = -EINVAL;
+ goto err;
+ }
+#endif
+
pr_uuid(&name, c->sb.user_uuid.b);
ret = name.allocation_failure ? -BCH_ERR_ENOMEM_fs_name_alloc : 0;
if (ret)
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 16/18] bcachefs: Kill dirent_occupied_size()
2025-02-13 18:46 ` [PATCH 16/18] bcachefs: Kill dirent_occupied_size() Kent Overstreet
@ 2025-02-17 1:49 ` Hongbo Li
0 siblings, 0 replies; 22+ messages in thread
From: Hongbo Li @ 2025-02-17 1:49 UTC (permalink / raw)
To: Kent Overstreet, linux-bcachefs
On 2025/2/14 2:46, Kent Overstreet wrote:
> With the upcoming patches for casefolding, we really need to use the
> size of the dirent key itself - which is cleaner, anyways.
>
> The size of the dirent is no longer just a function of the length of the
> name, it'll be different depending on whether the directory has
> casefolding enabled - which means the accounting in rename() has to
> change a bit.
>
> Cc: Hongbo Li <lihongbo22@huawei.com>
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
> fs/bcachefs/dirent.c | 15 +++++++++++++--
> fs/bcachefs/dirent.h | 11 +++--------
> fs/bcachefs/fs-common.c | 42 ++++++++++++++++++++++-------------------
> fs/bcachefs/fsck.c | 5 +----
> 4 files changed, 40 insertions(+), 33 deletions(-)
>
> diff --git a/fs/bcachefs/dirent.c b/fs/bcachefs/dirent.c
> index 600eee936f13..27737aaa03a6 100644
> --- a/fs/bcachefs/dirent.c
> +++ b/fs/bcachefs/dirent.c
> @@ -233,6 +233,7 @@ int bch2_dirent_create(struct btree_trans *trans, subvol_inum dir,
> const struct bch_hash_info *hash_info,
> u8 type, const struct qstr *name, u64 dst_inum,
> u64 *dir_offset,
> + u64 *i_size,
> enum btree_iter_update_trigger_flags flags)
> {
> struct bkey_i_dirent *dirent;
> @@ -243,6 +244,8 @@ int bch2_dirent_create(struct btree_trans *trans, subvol_inum dir,
> if (ret)
> return ret;
>
> + *i_size += bkey_bytes(&dirent->k);
> +
> ret = bch2_hash_set(trans, bch2_dirent_hash_desc, hash_info,
> dir, &dirent->k_i, flags);
> *dir_offset = dirent->k.p.offset;
> @@ -275,8 +278,8 @@ int bch2_dirent_read_target(struct btree_trans *trans, subvol_inum dir,
> }
>
> int bch2_dirent_rename(struct btree_trans *trans,
> - subvol_inum src_dir, struct bch_hash_info *src_hash,
> - subvol_inum dst_dir, struct bch_hash_info *dst_hash,
> + subvol_inum src_dir, struct bch_hash_info *src_hash, u64 *src_dir_i_size,
> + subvol_inum dst_dir, struct bch_hash_info *dst_hash, u64 *dst_dir_i_size,
> const struct qstr *src_name, subvol_inum *src_inum, u64 *src_offset,
> const struct qstr *dst_name, subvol_inum *dst_inum, u64 *dst_offset,
> enum bch_rename_mode mode)
> @@ -406,6 +409,14 @@ int bch2_dirent_rename(struct btree_trans *trans,
> new_src->v.d_type == DT_SUBVOL)
> new_src->v.d_parent_subvol = cpu_to_le32(src_dir.subvol);
>
> + if (old_dst.k)
> + *dst_dir_i_size -= bkey_bytes(old_dst.k);
> + *src_dir_i_size -= bkey_bytes(old_src.k);
> +
> + if (mode == BCH_RENAME_EXCHANGE)
> + *src_dir_i_size += bkey_bytes(&new_src->k);
> + *dst_dir_i_size += bkey_bytes(&new_dst->k);
> +
Ah, if I had dug into bkey_bytes, I might not have introduced
dirent_occupied_size. The logic seems equivalent and more reasonable.
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Thanks,
Hongbo
> ret = bch2_trans_update(trans, &dst_iter, &new_dst->k_i, 0);
> if (ret)
> goto out;
> diff --git a/fs/bcachefs/dirent.h b/fs/bcachefs/dirent.h
> index a633f83c1ac7..37f01c1a3f7f 100644
> --- a/fs/bcachefs/dirent.h
> +++ b/fs/bcachefs/dirent.h
> @@ -31,11 +31,6 @@ static inline unsigned dirent_val_u64s(unsigned len)
> sizeof(u64));
> }
>
> -static inline unsigned int dirent_occupied_size(const struct qstr *name)
> -{
> - return (BKEY_U64s + dirent_val_u64s(name->len)) * sizeof(u64);
> -}
> -
> int bch2_dirent_read_target(struct btree_trans *, subvol_inum,
> struct bkey_s_c_dirent, subvol_inum *);
>
> @@ -52,7 +47,7 @@ int bch2_dirent_create_snapshot(struct btree_trans *, u32, u64, u32,
> enum btree_iter_update_trigger_flags);
> int bch2_dirent_create(struct btree_trans *, subvol_inum,
> const struct bch_hash_info *, u8,
> - const struct qstr *, u64, u64 *,
> + const struct qstr *, u64, u64 *, u64 *,
> enum btree_iter_update_trigger_flags);
>
> static inline unsigned vfs_d_type(unsigned type)
> @@ -67,8 +62,8 @@ enum bch_rename_mode {
> };
>
> int bch2_dirent_rename(struct btree_trans *,
> - subvol_inum, struct bch_hash_info *,
> - subvol_inum, struct bch_hash_info *,
> + subvol_inum, struct bch_hash_info *, u64 *,
> + subvol_inum, struct bch_hash_info *, u64 *,
> const struct qstr *, subvol_inum *, u64 *,
> const struct qstr *, subvol_inum *, u64 *,
> enum bch_rename_mode);
> diff --git a/fs/bcachefs/fs-common.c b/fs/bcachefs/fs-common.c
> index d70d9f634cea..c8afd312e601 100644
> --- a/fs/bcachefs/fs-common.c
> +++ b/fs/bcachefs/fs-common.c
> @@ -42,11 +42,14 @@ int bch2_create_trans(struct btree_trans *trans,
> if (ret)
> goto err;
>
> - ret = bch2_inode_peek(trans, &dir_iter, dir_u, dir,
> - BTREE_ITER_intent|BTREE_ITER_with_updates);
> + ret = bch2_inode_peek(trans, &dir_iter, dir_u, dir, BTREE_ITER_intent);
> if (ret)
> goto err;
>
> + /* Inherit casefold state from parent. */
> + if (S_ISDIR(mode))
> + new_inode->bi_flags |= dir_u->bi_flags & BCH_INODE_casefolded;
> +
> if (!(flags & BCH_CREATE_SNAPSHOT)) {
> /* Normal create path - allocate a new inode: */
> bch2_inode_init_late(new_inode, now, uid, gid, mode, rdev, dir_u);
> @@ -152,7 +155,6 @@ int bch2_create_trans(struct btree_trans *trans,
> if (is_subdir_for_nlink(new_inode))
> dir_u->bi_nlink++;
> dir_u->bi_mtime = dir_u->bi_ctime = now;
> - dir_u->bi_size += dirent_occupied_size(name);
>
> ret = bch2_inode_write(trans, &dir_iter, dir_u);
> if (ret)
> @@ -163,7 +165,8 @@ int bch2_create_trans(struct btree_trans *trans,
> name,
> dir_target,
> &dir_offset,
> - STR_HASH_must_create|BTREE_ITER_with_updates);
> + &dir_u->bi_size,
> + STR_HASH_must_create);
> if (ret)
> goto err;
>
> @@ -221,13 +224,14 @@ int bch2_link_trans(struct btree_trans *trans,
> }
>
> dir_u->bi_mtime = dir_u->bi_ctime = now;
> - dir_u->bi_size += dirent_occupied_size(name);
>
> dir_hash = bch2_hash_info_init(c, dir_u);
>
> ret = bch2_dirent_create(trans, dir, &dir_hash,
> mode_to_type(inode_u->bi_mode),
> - name, inum.inum, &dir_offset,
> + name, inum.inum,
> + &dir_offset,
> + &dir_u->bi_size,
> STR_HASH_must_create);
> if (ret)
> goto err;
> @@ -266,8 +270,16 @@ int bch2_unlink_trans(struct btree_trans *trans,
>
> dir_hash = bch2_hash_info_init(c, dir_u);
>
> - ret = bch2_dirent_lookup_trans(trans, &dirent_iter, dir, &dir_hash,
> - name, &inum, BTREE_ITER_intent);
> + struct bkey_s_c dirent_k =
> + bch2_hash_lookup(trans, &dirent_iter, bch2_dirent_hash_desc,
> + &dir_hash, dir, name, BTREE_ITER_intent);
> + ret = bkey_err(dirent_k);
> + if (ret)
> + goto err;
> +
> + ret = bch2_dirent_read_target(trans, dir, bkey_s_c_to_dirent(dirent_k), &inum);
> + if (ret > 0)
> + ret = -ENOENT;
> if (ret)
> goto err;
>
> @@ -324,7 +336,7 @@ int bch2_unlink_trans(struct btree_trans *trans,
>
> dir_u->bi_mtime = dir_u->bi_ctime = inode_u->bi_ctime = now;
> dir_u->bi_nlink -= is_subdir_for_nlink(inode_u);
> - dir_u->bi_size -= dirent_occupied_size(name);
> + dir_u->bi_size -= bkey_bytes(dirent_k.k);
>
> ret = bch2_hash_delete_at(trans, bch2_dirent_hash_desc,
> &dir_hash, &dirent_iter,
> @@ -420,8 +432,8 @@ int bch2_rename_trans(struct btree_trans *trans,
> }
>
> ret = bch2_dirent_rename(trans,
> - src_dir, &src_hash,
> - dst_dir, &dst_hash,
> + src_dir, &src_hash, &src_dir_u->bi_size,
> + dst_dir, &dst_hash, &dst_dir_u->bi_size,
> src_name, &src_inum, &src_offset,
> dst_name, &dst_inum, &dst_offset,
> mode);
> @@ -463,14 +475,6 @@ int bch2_rename_trans(struct btree_trans *trans,
> goto err;
> }
>
> - if (mode == BCH_RENAME) {
> - src_dir_u->bi_size -= dirent_occupied_size(src_name);
> - dst_dir_u->bi_size += dirent_occupied_size(dst_name);
> - }
> -
> - if (mode == BCH_RENAME_OVERWRITE)
> - src_dir_u->bi_size -= dirent_occupied_size(src_name);
> -
> if (src_inode_u->bi_parent_subvol)
> src_inode_u->bi_parent_subvol = dst_dir.subvol;
>
> diff --git a/fs/bcachefs/fsck.c b/fs/bcachefs/fsck.c
> index 53a421ff136d..24ad1a42d169 100644
> --- a/fs/bcachefs/fsck.c
> +++ b/fs/bcachefs/fsck.c
> @@ -1132,10 +1132,7 @@ static int check_directory_size(struct btree_trans *trans,
> if (k.k->type != KEY_TYPE_dirent)
> continue;
>
> - struct bkey_s_c_dirent dirent = bkey_s_c_to_dirent(k);
> - struct qstr name = bch2_dirent_get_name(dirent);
> -
> - new_size += dirent_occupied_size(&name);
> + new_size += bkey_bytes(k.k);
> }
> bch2_trans_iter_exit(trans, &iter);
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH] bcachefs: Use flexible arrays in dirent
2025-02-13 18:46 ` [PATCH 18/18] bcachefs: bcachefs_metadata_version_casefolding Kent Overstreet
@ 2025-02-21 18:26 ` Gabriel de Perthuis
2025-02-22 14:07 ` Kent Overstreet
0 siblings, 1 reply; 22+ messages in thread
From: Gabriel de Perthuis @ 2025-02-21 18:26 UTC (permalink / raw)
To: Kent Overstreet, linux-bcachefs, Joshua Ashton
Cc: André Almeida, Gabriel Krisman Bertazi, Gabriel de Perthuis
A fake d_name[0]/d_names[0] flexible array tickled UBSAN.
Use DECLARE_FLEX_ARRAY to work around a compiler restriction on flex
arrays in unions:
https://people.kernel.org/kees/bounded-flexible-arrays-in-c
Fixes: "bcachefs: bcachefs_metadata_version_casefolding"
Closes: https://github.com/koverstreet/bcachefs/issues/824
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
---
fs/bcachefs/dirent_format.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/bcachefs/dirent_format.h b/fs/bcachefs/dirent_format.h
index 2e766032e1e91..a46dbddd21aad 100644
--- a/fs/bcachefs/dirent_format.h
+++ b/fs/bcachefs/dirent_format.h
@@ -42,13 +42,13 @@ struct bch_dirent {
union {
struct {
__u8 d_pad;
__le16 d_name_len;
__le16 d_cf_name_len;
- __u8 d_names[0];
+ __u8 d_names[];
} d_cf_name_block __packed;
- __u8 d_name[0];
+ __DECLARE_FLEX_ARRAY(__u8, d_name);
} __packed;
} __packed __aligned(8);
#define DT_SUBVOL 16
#define BCH_DT_MAX 17
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] bcachefs: Use flexible arrays in dirent
2025-02-21 18:26 ` [PATCH] bcachefs: Use flexible arrays in dirent Gabriel de Perthuis
@ 2025-02-22 14:07 ` Kent Overstreet
0 siblings, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2025-02-22 14:07 UTC (permalink / raw)
To: Gabriel de Perthuis
Cc: linux-bcachefs, Joshua Ashton, André Almeida,
Gabriel Krisman Bertazi
On Fri, Feb 21, 2025 at 06:26:27PM +0000, Gabriel de Perthuis wrote:
> A fake d_name[0]/d_names[0] flexible array tickled UBSAN.
>
> Use DECLARE_FLEX_ARRAY to work around a compiler restriction on flex
> arrays in unions:
> https://people.kernel.org/kees/bounded-flexible-arrays-in-c
>
> Fixes: "bcachefs: bcachefs_metadata_version_casefolding"
> Closes: https://github.com/koverstreet/bcachefs/issues/824
>
> Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Applied
> ---
> fs/bcachefs/dirent_format.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/bcachefs/dirent_format.h b/fs/bcachefs/dirent_format.h
> index 2e766032e1e91..a46dbddd21aad 100644
> --- a/fs/bcachefs/dirent_format.h
> +++ b/fs/bcachefs/dirent_format.h
> @@ -42,13 +42,13 @@ struct bch_dirent {
> union {
> struct {
> __u8 d_pad;
> __le16 d_name_len;
> __le16 d_cf_name_len;
> - __u8 d_names[0];
> + __u8 d_names[];
> } d_cf_name_block __packed;
> - __u8 d_name[0];
> + __DECLARE_FLEX_ARRAY(__u8, d_name);
> } __packed;
> } __packed __aligned(8);
>
> #define DT_SUBVOL 16
> #define BCH_DT_MAX 17
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-02-22 14:07 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-13 18:45 [PATCH 00/18] last on disk format changes before freeze Kent Overstreet
2025-02-13 18:45 ` [PATCH 01/18] bcachefs: bch2_lru_change() checks for no-op Kent Overstreet
2025-02-13 18:45 ` [PATCH 02/18] bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/ Kent Overstreet
2025-02-13 18:45 ` [PATCH 03/18] bcachefs: decouple bch2_lru_check_set() from alloc btree Kent Overstreet
2025-02-13 18:45 ` [PATCH 04/18] bcachefs: Rework bch2_check_lru_key() Kent Overstreet
2025-02-13 18:45 ` [PATCH 05/18] bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock Kent Overstreet
2025-02-13 18:45 ` [PATCH 06/18] bcachefs: Better trigger ordering Kent Overstreet
2025-02-13 18:45 ` [PATCH 07/18] bcachefs: rework bch2_trans_commit_run_triggers() Kent Overstreet
2025-02-13 18:45 ` [PATCH 08/18] bcachefs: bcachefs_metadata_version_cached_backpointers Kent Overstreet
2025-02-13 18:45 ` [PATCH 09/18] bcachefs: Invalidate cached data by backpointers Kent Overstreet
2025-02-13 18:45 ` [PATCH 10/18] bcachefs: Advance bch_alloc.oldest_gen if no stale pointers Kent Overstreet
2025-02-13 18:45 ` [PATCH 11/18] bcachefs: bcachefs_metadata_version_stripe_backpointers Kent Overstreet
2025-02-13 18:45 ` [PATCH 12/18] bcachefs: bcachefs_metadata_version_stripe_lru Kent Overstreet
2025-02-13 18:45 ` [PATCH 13/18] bcachefs: ec_stripe_delete() uses new stripe lru Kent Overstreet
2025-02-13 18:45 ` [PATCH 14/18] bcachefs: get_existing_stripe() " Kent Overstreet
2025-02-13 18:46 ` [PATCH 15/18] bcachefs: We no longer read stripes into memory at startup Kent Overstreet
2025-02-13 18:46 ` [PATCH 16/18] bcachefs: Kill dirent_occupied_size() Kent Overstreet
2025-02-17 1:49 ` Hongbo Li
2025-02-13 18:46 ` [PATCH 17/18] bcachefs: Split out dirent alloc and name initialization Kent Overstreet
2025-02-13 18:46 ` [PATCH 18/18] bcachefs: bcachefs_metadata_version_casefolding Kent Overstreet
2025-02-21 18:26 ` [PATCH] bcachefs: Use flexible arrays in dirent Gabriel de Perthuis
2025-02-22 14:07 ` Kent Overstreet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox