* [PATCH 0/3] bcachefs: journal bug fixes
@ 2023-08-31 11:07 Brian Foster
2023-08-31 11:07 ` [PATCH 1/3] bcachefs: restart journal reclaim thread on ro->rw transitions Brian Foster
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Brian Foster @ 2023-08-31 11:07 UTC (permalink / raw)
To: linux-bcachefs
Hi all,
This series includes a couple bug fixes that fell out of ongoing
experimentation with freeze support. I think patch 1 is pretty
self-explanatory, patch 2 is a prepatory patch, and patch 3 addresses a
race in the journaling code.
All patches are available in my current test branch, with CI test
results below [1]. Note that branch also includes a patch to enable
freeze, but I'd rather not see that one land quite yet (which is why
it's not included here). It's so far only seen the type of testing
intended to shake these sorts of peripheral issues out, and I need to
run some more testing to confirm sane behavior.
Thoughts, reviews, flames appreciated.
Brian
[1] https://evilpiepirate.org/~testdashboard/ci?branch=bfoster
Brian Foster (3):
bcachefs: restart journal reclaim thread on ro->rw transitions
bcachefs: prepare journal buf put to handle pin put
bcachefs: fix race between journal entry close and pin set
fs/bcachefs/journal.c | 13 +------------
fs/bcachefs/journal.h | 39 +++++++++++++++++++++++++++++++++------
fs/bcachefs/super.c | 4 ++++
3 files changed, 38 insertions(+), 18 deletions(-)
--
2.41.0
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH 1/3] bcachefs: restart journal reclaim thread on ro->rw transitions 2023-08-31 11:07 [PATCH 0/3] bcachefs: journal bug fixes Brian Foster @ 2023-08-31 11:07 ` Brian Foster 2023-08-31 11:07 ` [PATCH 2/3] bcachefs: prepare journal buf put to handle pin put Brian Foster 2023-08-31 11:07 ` [PATCH 3/3] bcachefs: fix race between journal entry close and pin set Brian Foster 2 siblings, 0 replies; 14+ messages in thread From: Brian Foster @ 2023-08-31 11:07 UTC (permalink / raw) To: linux-bcachefs Commit c2d5ff36065a4 ("bcachefs: Start journal reclaim thread earlier") tweaked reclaim thread management to start a bit earlier in the mount sequence by moving the start call from __bch2_fs_read_write() to bch2_fs_journal_start(). This has the side effect of never starting the reclaim thread on a ro->rw transition, which can be observed by monitoring reclaim behavior via the journal_reclaim tracepoints. I.e. once an fs has remounted ro->rw, we only ever rely on direct reclaim from that point forward. Since bch2_journal_reclaim_start() properly handles the case where the reclaim thread has already been created, restore the start call in the read-write helper. This allows the reclaim thread to start early when appropriate and also exit/restart on remounts or freeze cycles. In the latter case it may be possible to simply allow the task to freeze rather than destroy it, but for now just fix the immediate bug. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/bcachefs/super.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c index 60424865980d..29cd71445a94 100644 --- a/fs/bcachefs/super.c +++ b/fs/bcachefs/super.c @@ -421,6 +421,10 @@ static int __bch2_fs_read_write(struct bch_fs *c, bool early) return ret; } + ret = bch2_journal_reclaim_start(&c->journal); + if (ret) + goto err; + if (!early) { ret = bch2_fs_read_write_late(c); if (ret) -- 2.41.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/3] bcachefs: prepare journal buf put to handle pin put 2023-08-31 11:07 [PATCH 0/3] bcachefs: journal bug fixes Brian Foster 2023-08-31 11:07 ` [PATCH 1/3] bcachefs: restart journal reclaim thread on ro->rw transitions Brian Foster @ 2023-08-31 11:07 ` Brian Foster 2023-09-03 21:09 ` Kent Overstreet 2023-08-31 11:07 ` [PATCH 3/3] bcachefs: fix race between journal entry close and pin set Brian Foster 2 siblings, 1 reply; 14+ messages in thread From: Brian Foster @ 2023-08-31 11:07 UTC (permalink / raw) To: linux-bcachefs bcachefs freeze testing has uncovered some raciness between journal entry open/close and pin list reference count management. The details of the problem are described in a separate patch. In preparation for the associated fix, refactor the journal buffer put path a bit to allow it to eventually handle dropping the pin list reference currently held by an open journal entry. Open code the journal write dispatch since it is essentially a single line and instead factor out the journal state update to be reused, consistent with other journal state update helpers. No functional changes in this patch. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/bcachefs/journal.c | 9 --------- fs/bcachefs/journal.h | 22 +++++++++++++++++----- 2 files changed, 17 insertions(+), 14 deletions(-) diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c index 055920c26da6..76ebcdb3e3b4 100644 --- a/fs/bcachefs/journal.c +++ b/fs/bcachefs/journal.c @@ -132,15 +132,6 @@ journal_error_check_stuck(struct journal *j, int error, unsigned flags) return stuck; } -/* journal entry close/open: */ - -void __bch2_journal_buf_put(struct journal *j) -{ - struct bch_fs *c = container_of(j, struct bch_fs, journal); - - closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL); -} - /* * Returns true if journal entry is now closed: * diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h index 008a2e25a4fa..c7a31da077c9 100644 --- a/fs/bcachefs/journal.h +++ b/fs/bcachefs/journal.h @@ -252,9 +252,10 @@ static inline bool journal_entry_empty(struct jset *j) return true; } -void __bch2_journal_buf_put(struct journal *); - -static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) +/* + * Drop reference on a buffer index and return true if the count has hit zero. + */ +static inline union journal_res_state journal_state_buf_put(struct journal *j, unsigned idx) { union journal_res_state s; @@ -264,9 +265,20 @@ static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) .buf2_count = idx == 2, .buf3_count = idx == 3, }).v, &j->reservations.counter); + return s; +} - if (!journal_state_count(s, idx) && idx == s.unwritten_idx) - __bch2_journal_buf_put(j); +void bch2_journal_write(struct closure *); +static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) +{ + struct bch_fs *c = container_of(j, struct bch_fs, journal); + union journal_res_state s; + + s = journal_state_buf_put(j, idx); + if (!journal_state_count(s, idx)) { + if (idx == s.unwritten_idx) + closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL); + } } /* -- 2.41.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] bcachefs: prepare journal buf put to handle pin put 2023-08-31 11:07 ` [PATCH 2/3] bcachefs: prepare journal buf put to handle pin put Brian Foster @ 2023-09-03 21:09 ` Kent Overstreet 2023-09-05 12:41 ` Brian Foster 0 siblings, 1 reply; 14+ messages in thread From: Kent Overstreet @ 2023-09-03 21:09 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Thu, Aug 31, 2023 at 07:07:33AM -0400, Brian Foster wrote: > bcachefs freeze testing has uncovered some raciness between journal > entry open/close and pin list reference count management. The > details of the problem are described in a separate patch. In > preparation for the associated fix, refactor the journal buffer put > path a bit to allow it to eventually handle dropping the pin list > reference currently held by an open journal entry. Open code the > journal write dispatch since it is essentially a single line and > instead factor out the journal state update to be reused, consistent > with other journal state update helpers. No functional changes in > this patch. So, I don't want to take this patch as is because even though __bch2_journal_buf_put() is small, you're putting it into an inline fastpath that we want to be as small and fast as possible - the main transaction commit path. journal_state_buf_put() is fine though, happy to take that part of the patch. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/bcachefs/journal.c | 9 --------- > fs/bcachefs/journal.h | 22 +++++++++++++++++----- > 2 files changed, 17 insertions(+), 14 deletions(-) > > diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c > index 055920c26da6..76ebcdb3e3b4 100644 > --- a/fs/bcachefs/journal.c > +++ b/fs/bcachefs/journal.c > @@ -132,15 +132,6 @@ journal_error_check_stuck(struct journal *j, int error, unsigned flags) > return stuck; > } > > -/* journal entry close/open: */ > - > -void __bch2_journal_buf_put(struct journal *j) > -{ > - struct bch_fs *c = container_of(j, struct bch_fs, journal); > - > - closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL); > -} > - > /* > * Returns true if journal entry is now closed: > * > diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h > index 008a2e25a4fa..c7a31da077c9 100644 > --- a/fs/bcachefs/journal.h > +++ b/fs/bcachefs/journal.h > @@ -252,9 +252,10 @@ static inline bool journal_entry_empty(struct jset *j) > return true; > } > > -void __bch2_journal_buf_put(struct journal *); > - > -static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) > +/* > + * Drop reference on a buffer index and return true if the count has hit zero. > + */ > +static inline union journal_res_state journal_state_buf_put(struct journal *j, unsigned idx) > { > union journal_res_state s; > > @@ -264,9 +265,20 @@ static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) > .buf2_count = idx == 2, > .buf3_count = idx == 3, > }).v, &j->reservations.counter); > + return s; > +} > > - if (!journal_state_count(s, idx) && idx == s.unwritten_idx) > - __bch2_journal_buf_put(j); > +void bch2_journal_write(struct closure *); > +static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) > +{ > + struct bch_fs *c = container_of(j, struct bch_fs, journal); > + union journal_res_state s; > + > + s = journal_state_buf_put(j, idx); > + if (!journal_state_count(s, idx)) { > + if (idx == s.unwritten_idx) > + closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL); > + } > } > > /* > -- > 2.41.0 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] bcachefs: prepare journal buf put to handle pin put 2023-09-03 21:09 ` Kent Overstreet @ 2023-09-05 12:41 ` Brian Foster 2023-09-10 1:55 ` Kent Overstreet 0 siblings, 1 reply; 14+ messages in thread From: Brian Foster @ 2023-09-05 12:41 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Sun, Sep 03, 2023 at 05:09:04PM -0400, Kent Overstreet wrote: > On Thu, Aug 31, 2023 at 07:07:33AM -0400, Brian Foster wrote: > > bcachefs freeze testing has uncovered some raciness between journal > > entry open/close and pin list reference count management. The > > details of the problem are described in a separate patch. In > > preparation for the associated fix, refactor the journal buffer put > > path a bit to allow it to eventually handle dropping the pin list > > reference currently held by an open journal entry. Open code the > > journal write dispatch since it is essentially a single line and > > instead factor out the journal state update to be reused, consistent > > with other journal state update helpers. No functional changes in > > this patch. > > So, I don't want to take this patch as is because even though > __bch2_journal_buf_put() is small, you're putting it into an inline > fastpath that we want to be as small and fast as possible - the main > transaction commit path. > Can you elaborate on what you mean here? ISTM all this patch does is replace a function call to __bch2_journal_buf_put() that already exists.. Brian > journal_state_buf_put() is fine though, happy to take that part of the > patch. > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/bcachefs/journal.c | 9 --------- > > fs/bcachefs/journal.h | 22 +++++++++++++++++----- > > 2 files changed, 17 insertions(+), 14 deletions(-) > > > > diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c > > index 055920c26da6..76ebcdb3e3b4 100644 > > --- a/fs/bcachefs/journal.c > > +++ b/fs/bcachefs/journal.c > > @@ -132,15 +132,6 @@ journal_error_check_stuck(struct journal *j, int error, unsigned flags) > > return stuck; > > } > > > > -/* journal entry close/open: */ > > - > > -void __bch2_journal_buf_put(struct journal *j) > > -{ > > - struct bch_fs *c = container_of(j, struct bch_fs, journal); > > - > > - closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL); > > -} > > - > > /* > > * Returns true if journal entry is now closed: > > * > > diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h > > index 008a2e25a4fa..c7a31da077c9 100644 > > --- a/fs/bcachefs/journal.h > > +++ b/fs/bcachefs/journal.h > > @@ -252,9 +252,10 @@ static inline bool journal_entry_empty(struct jset *j) > > return true; > > } > > > > -void __bch2_journal_buf_put(struct journal *); > > - > > -static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) > > +/* > > + * Drop reference on a buffer index and return true if the count has hit zero. > > + */ > > +static inline union journal_res_state journal_state_buf_put(struct journal *j, unsigned idx) > > { > > union journal_res_state s; > > > > @@ -264,9 +265,20 @@ static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) > > .buf2_count = idx == 2, > > .buf3_count = idx == 3, > > }).v, &j->reservations.counter); > > + return s; > > +} > > > > - if (!journal_state_count(s, idx) && idx == s.unwritten_idx) > > - __bch2_journal_buf_put(j); > > +void bch2_journal_write(struct closure *); > > +static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) > > +{ > > + struct bch_fs *c = container_of(j, struct bch_fs, journal); > > + union journal_res_state s; > > + > > + s = journal_state_buf_put(j, idx); > > + if (!journal_state_count(s, idx)) { > > + if (idx == s.unwritten_idx) > > + closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL); > > + } > > } > > > > /* > > -- > > 2.41.0 > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] bcachefs: prepare journal buf put to handle pin put 2023-09-05 12:41 ` Brian Foster @ 2023-09-10 1:55 ` Kent Overstreet 0 siblings, 0 replies; 14+ messages in thread From: Kent Overstreet @ 2023-09-10 1:55 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Tue, Sep 05, 2023 at 08:41:44AM -0400, Brian Foster wrote: > On Sun, Sep 03, 2023 at 05:09:04PM -0400, Kent Overstreet wrote: > > On Thu, Aug 31, 2023 at 07:07:33AM -0400, Brian Foster wrote: > > > bcachefs freeze testing has uncovered some raciness between journal > > > entry open/close and pin list reference count management. The > > > details of the problem are described in a separate patch. In > > > preparation for the associated fix, refactor the journal buffer put > > > path a bit to allow it to eventually handle dropping the pin list > > > reference currently held by an open journal entry. Open code the > > > journal write dispatch since it is essentially a single line and > > > instead factor out the journal state update to be reused, consistent > > > with other journal state update helpers. No functional changes in > > > this patch. > > > > So, I don't want to take this patch as is because even though > > __bch2_journal_buf_put() is small, you're putting it into an inline > > fastpath that we want to be as small and fast as possible - the main > > transaction commit path. > > > > Can you elaborate on what you mean here? ISTM all this patch does is > replace a function call to __bch2_journal_buf_put() that already > exists.. Yes, you made the code in __bch2_journal_buf_put() inline. Since that's the slowpath, we don't want it inline... > > Brian > > > journal_state_buf_put() is fine though, happy to take that part of the > > patch. > > > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > fs/bcachefs/journal.c | 9 --------- > > > fs/bcachefs/journal.h | 22 +++++++++++++++++----- > > > 2 files changed, 17 insertions(+), 14 deletions(-) > > > > > > diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c > > > index 055920c26da6..76ebcdb3e3b4 100644 > > > --- a/fs/bcachefs/journal.c > > > +++ b/fs/bcachefs/journal.c > > > @@ -132,15 +132,6 @@ journal_error_check_stuck(struct journal *j, int error, unsigned flags) > > > return stuck; > > > } > > > > > > -/* journal entry close/open: */ > > > - > > > -void __bch2_journal_buf_put(struct journal *j) > > > -{ > > > - struct bch_fs *c = container_of(j, struct bch_fs, journal); > > > - > > > - closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL); > > > -} > > > - > > > /* > > > * Returns true if journal entry is now closed: > > > * > > > diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h > > > index 008a2e25a4fa..c7a31da077c9 100644 > > > --- a/fs/bcachefs/journal.h > > > +++ b/fs/bcachefs/journal.h > > > @@ -252,9 +252,10 @@ static inline bool journal_entry_empty(struct jset *j) > > > return true; > > > } > > > > > > -void __bch2_journal_buf_put(struct journal *); > > > - > > > -static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) > > > +/* > > > + * Drop reference on a buffer index and return true if the count has hit zero. > > > + */ > > > +static inline union journal_res_state journal_state_buf_put(struct journal *j, unsigned idx) > > > { > > > union journal_res_state s; > > > > > > @@ -264,9 +265,20 @@ static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) > > > .buf2_count = idx == 2, > > > .buf3_count = idx == 3, > > > }).v, &j->reservations.counter); > > > + return s; > > > +} > > > > > > - if (!journal_state_count(s, idx) && idx == s.unwritten_idx) > > > - __bch2_journal_buf_put(j); > > > +void bch2_journal_write(struct closure *); > > > +static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) > > > +{ > > > + struct bch_fs *c = container_of(j, struct bch_fs, journal); > > > + union journal_res_state s; > > > + > > > + s = journal_state_buf_put(j, idx); > > > + if (!journal_state_count(s, idx)) { > > > + if (idx == s.unwritten_idx) > > > + closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL); > > > + } > > > } > > > > > > /* > > > -- > > > 2.41.0 > > > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/3] bcachefs: fix race between journal entry close and pin set 2023-08-31 11:07 [PATCH 0/3] bcachefs: journal bug fixes Brian Foster 2023-08-31 11:07 ` [PATCH 1/3] bcachefs: restart journal reclaim thread on ro->rw transitions Brian Foster 2023-08-31 11:07 ` [PATCH 2/3] bcachefs: prepare journal buf put to handle pin put Brian Foster @ 2023-08-31 11:07 ` Brian Foster 2023-09-03 21:18 ` Kent Overstreet 2 siblings, 1 reply; 14+ messages in thread From: Brian Foster @ 2023-08-31 11:07 UTC (permalink / raw) To: linux-bcachefs bcachefs freeze testing via fstests generic/390 occasionally reproduces the following BUG from bch2_fs_read_only(): BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty)); This indicates that one or more dirty key cache keys still exist after the attempt to flush and quiesce the fs. The sequence that leads to this problem actually occurs on unfreeze (ro->rw), and looks something like the following: - Task A begins a transaction commit and acquires journal_res for the current seq. This transaction intends to perform key cache insertion. - Task B begins a bch2_journal_flush() via bch2_sync_fs(). This ends up in journal_entry_want_write(), which closes the current journal entry and drops the reference to the pin list created on entry open. The pin put pops the front of the journal via fast reclaim since the reference count has dropped to 0. - Task A attempts to set the journal pin for the associated cached key, but bch2_journal_pin_set() skips the pin insert because the seq of the transaction reservation is behind the front of the pin list fifo. The end result is that the pin associated with the cached key is not added, which prevents a subsequent reclaim from processing the key and thus leaves it dangling at freeze time. The fundamental cause of this problem is that the front of the journal is allowed to pop before a transaction with outstanding reservation on the associated journal seq is able to add a pin. The count for the pin list associated with the seq drops to zero and is prematurely reclaimed as a result. The logical fix for this problem lies in how the journal buffer is managed in similar scenarios where the entry might have been closed before a transaction with outstanding reservations happens to be committed. When a journal entry is opened, the current sequence number is bumped, the associated pin list is initialized with a reference count of 1, and the journal buffer reference count is bumped (via journal_state_inc()). When a journal reservation is acquired, the reservation also acquires a reference on the associated buffer. If the journal entry is closed in the meantime, it drops both the pin and buffer references held by the open entry, but the buffer still has references held by outstanding reservation. After the associated transaction commits, the reservation release drops the associated buffer references and the buffer is written out once the reference count has dropped to zero. The fundamental problem here is that the lifecycle of the pin list reference held by an open journal entry is too short to cover the processing of transactions with outstanding reservations. The simplest way to address this is to expand the pin list reference to the lifecycle of the buffer vs. the shorter lifecycle of the open journal entry. This allows a journal entry to closer for whatever reason, but ensures the pin list for the seq with outstanding reservation cannot be popped and reclaimed before all outstanding reservations have been released. Move the pin put from journal entry close to where final reference count processing of the journal buffer occurs. Create a duplicate helper to cover the case where the caller doesn't already hold the journal lock. This allows generic/390 to pass reliably. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/bcachefs/journal.c | 4 +--- fs/bcachefs/journal.h | 19 +++++++++++++++++-- 2 files changed, 18 insertions(+), 5 deletions(-) diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c index 76ebcdb3e3b4..21f97d48931d 100644 --- a/fs/bcachefs/journal.c +++ b/fs/bcachefs/journal.c @@ -195,13 +195,11 @@ static void __journal_entry_close(struct journal *j, unsigned closed_val) buf->data->last_seq = cpu_to_le64(buf->last_seq); BUG_ON(buf->last_seq > le64_to_cpu(buf->data->seq)); - __bch2_journal_pin_put(j, le64_to_cpu(buf->data->seq)); - cancel_delayed_work(&j->write_work); bch2_journal_space_available(j); - bch2_journal_buf_put(j, old.idx); + __bch2_journal_buf_put(j, old.idx, le64_to_cpu(buf->data->seq)); } void bch2_journal_halt(struct journal *j) diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h index c7a31da077c9..5dea5d46c616 100644 --- a/fs/bcachefs/journal.h +++ b/fs/bcachefs/journal.h @@ -112,6 +112,7 @@ #include <linux/hash.h> #include "journal_types.h" +#include "journal_reclaim.h" struct bch_fs; @@ -269,13 +270,27 @@ static inline union journal_res_state journal_state_buf_put(struct journal *j, u } void bch2_journal_write(struct closure *); -static inline void bch2_journal_buf_put(struct journal *j, unsigned idx) +static inline void __bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq) { struct bch_fs *c = container_of(j, struct bch_fs, journal); union journal_res_state s; s = journal_state_buf_put(j, idx); if (!journal_state_count(s, idx)) { + __bch2_journal_pin_put(j, seq); + if (idx == s.unwritten_idx) + closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL); + } +} + +static inline void bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq) +{ + struct bch_fs *c = container_of(j, struct bch_fs, journal); + union journal_res_state s; + + s = journal_state_buf_put(j, idx); + if (!journal_state_count(s, idx)) { + bch2_journal_pin_put(j, seq); if (idx == s.unwritten_idx) closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL); } @@ -298,7 +313,7 @@ static inline void bch2_journal_res_put(struct journal *j, BCH_JSET_ENTRY_btree_keys, 0, 0, 0); - bch2_journal_buf_put(j, res->idx); + bch2_journal_buf_put(j, res->idx, res->seq); res->ref = 0; } -- 2.41.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] bcachefs: fix race between journal entry close and pin set 2023-08-31 11:07 ` [PATCH 3/3] bcachefs: fix race between journal entry close and pin set Brian Foster @ 2023-09-03 21:18 ` Kent Overstreet 2023-09-04 2:29 ` Kent Overstreet 0 siblings, 1 reply; 14+ messages in thread From: Kent Overstreet @ 2023-09-03 21:18 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Thu, Aug 31, 2023 at 07:07:34AM -0400, Brian Foster wrote: > bcachefs freeze testing via fstests generic/390 occasionally > reproduces the following BUG from bch2_fs_read_only(): > > BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty)); > > This indicates that one or more dirty key cache keys still exist > after the attempt to flush and quiesce the fs. The sequence that > leads to this problem actually occurs on unfreeze (ro->rw), and > looks something like the following: > > - Task A begins a transaction commit and acquires journal_res for > the current seq. This transaction intends to perform key cache > insertion. > - Task B begins a bch2_journal_flush() via bch2_sync_fs(). This ends > up in journal_entry_want_write(), which closes the current journal > entry and drops the reference to the pin list created on entry open. > The pin put pops the front of the journal via fast reclaim since the > reference count has dropped to 0. > - Task A attempts to set the journal pin for the associated cached > key, but bch2_journal_pin_set() skips the pin insert because the > seq of the transaction reservation is behind the front of the pin > list fifo. > > The end result is that the pin associated with the cached key is not > added, which prevents a subsequent reclaim from processing the key > and thus leaves it dangling at freeze time. The fundamental cause of > this problem is that the front of the journal is allowed to pop > before a transaction with outstanding reservation on the associated > journal seq is able to add a pin. The count for the pin list > associated with the seq drops to zero and is prematurely reclaimed > as a result. > > The logical fix for this problem lies in how the journal buffer is > managed in similar scenarios where the entry might have been closed > before a transaction with outstanding reservations happens to be > committed. > > When a journal entry is opened, the current sequence number is > bumped, the associated pin list is initialized with a reference > count of 1, and the journal buffer reference count is bumped (via > journal_state_inc()). When a journal reservation is acquired, the > reservation also acquires a reference on the associated buffer. If > the journal entry is closed in the meantime, it drops both the pin > and buffer references held by the open entry, but the buffer still > has references held by outstanding reservation. After the associated > transaction commits, the reservation release drops the associated > buffer references and the buffer is written out once the reference > count has dropped to zero. > > The fundamental problem here is that the lifecycle of the pin list > reference held by an open journal entry is too short to cover > the processing of transactions with outstanding reservations. The > simplest way to address this is to expand the pin list reference > to the lifecycle of the buffer vs. the shorter lifecycle of the > open journal entry. This allows a journal entry to closer for > whatever reason, but ensures the pin list for the seq with > outstanding reservation cannot be popped and reclaimed before all > outstanding reservations have been released. > > Move the pin put from journal entry close to where final reference > count processing of the journal buffer occurs. Create a duplicate > helper to cover the case where the caller doesn't already hold the > journal lock. This allows generic/390 to pass reliably. > > Signed-off-by: Brian Foster <bfoster@redhat.com> Good find :) Your fix makes sense, but I wonder if it might be simpler and better to fix this in bch2_journal_reclaim_fast(). We could just change it to check if the pin list is for a journal buf that's still open and has a nonzero refcount. It seems on the surface to be a bug where just fixing the lifetime refcounts is the right approach, but I think that ends up being the trickier approach here, since bch2_journal_buf_put() now has a couple things it's checking at different stages of the journal buf lifetime. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] bcachefs: fix race between journal entry close and pin set 2023-09-03 21:18 ` Kent Overstreet @ 2023-09-04 2:29 ` Kent Overstreet 2023-09-05 12:59 ` Brian Foster 0 siblings, 1 reply; 14+ messages in thread From: Kent Overstreet @ 2023-09-04 2:29 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Sun, Sep 03, 2023 at 05:18:12PM -0400, Kent Overstreet wrote: > On Thu, Aug 31, 2023 at 07:07:34AM -0400, Brian Foster wrote: > > bcachefs freeze testing via fstests generic/390 occasionally > > reproduces the following BUG from bch2_fs_read_only(): > > > > BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty)); > > Your fix makes sense, but I wonder if it might be simpler and better to > fix this in bch2_journal_reclaim_fast(). What do you think of this fix? diff --git a/fs/bcachefs/journal_reclaim.c b/fs/bcachefs/journal_reclaim.c index 10e1860dad..021c56ac67 100644 --- a/fs/bcachefs/journal_reclaim.c +++ b/fs/bcachefs/journal_reclaim.c @@ -303,6 +303,12 @@ static void bch2_journal_reclaim_fast(struct journal *j) */ while (!fifo_empty(&j->pin) && !atomic_read(&fifo_peek_front(&j->pin).count)) { + u64 seq = journal_last_seq(j); + + if (seq > j->seq_ondisk && + journal_state_count(j->reservations, seq & JOURNAL_BUF_MASK)) + break; + fifo_pop(&j->pin, temp); popped = true; } ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] bcachefs: fix race between journal entry close and pin set 2023-09-04 2:29 ` Kent Overstreet @ 2023-09-05 12:59 ` Brian Foster 2023-09-06 19:07 ` Brian Foster 0 siblings, 1 reply; 14+ messages in thread From: Brian Foster @ 2023-09-05 12:59 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Sun, Sep 03, 2023 at 10:29:28PM -0400, Kent Overstreet wrote: > On Sun, Sep 03, 2023 at 05:18:12PM -0400, Kent Overstreet wrote: > > On Thu, Aug 31, 2023 at 07:07:34AM -0400, Brian Foster wrote: > > > bcachefs freeze testing via fstests generic/390 occasionally > > > reproduces the following BUG from bch2_fs_read_only(): > > > > > > BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty)); > > > > Your fix makes sense, but I wonder if it might be simpler and better to > > fix this in bch2_journal_reclaim_fast(). > > What do you think of this fix? > > diff --git a/fs/bcachefs/journal_reclaim.c b/fs/bcachefs/journal_reclaim.c > index 10e1860dad..021c56ac67 100644 > --- a/fs/bcachefs/journal_reclaim.c > +++ b/fs/bcachefs/journal_reclaim.c > @@ -303,6 +303,12 @@ static void bch2_journal_reclaim_fast(struct journal *j) > */ > while (!fifo_empty(&j->pin) && > !atomic_read(&fifo_peek_front(&j->pin).count)) { > + u64 seq = journal_last_seq(j); > + > + if (seq > j->seq_ondisk && > + journal_state_count(j->reservations, seq & JOURNAL_BUF_MASK)) > + break; > + > fifo_pop(&j->pin, temp); > popped = true; > } > So my first reaction once I figured out what was going on was to consider something like this, because it seemed like the issue was premature release of the journal tail. After digging into that a bit it didn't quite feel right, which is what had me digging further into why the ref count was zero on a seq with outstanding reservation. I think I see what the above is doing, but I really don't see how this is simplified in any way. Maybe the code changes are slightly smaller, but IMO the bigger picture is far easier to reason about when the reference counts are properly honored. I.e., the outstanding reservations hold references on the journal buffer associated with the seq, and the active journal buffer holds a reference on the pin fifo. Conversely, the above strikes me as unnecessarily clever based on the fact that just to review it I need to grok behavior in several different contexts (res and seq management, on disk seq updates, etc.) to convince myself whether it's safe, and even then I'm not totally sure I have enough context to say whether the state count check can't ever break on something like a reused (for a new seq) journal buffer. Heh, the more I think about that the more I have to ask: why not just make the reference count work? ;P FWIW, one thing I disliked about the original patch was the need to mostly duplicate the buf put helper due to the locking context quirk. I was trying to avoid that, but I ran out of ideas before I wanted to move on actually fixing the bug. My preference would be to address the reference counting issue as is (to preserve design simplicity), and then maybe think a bit harder about cleaning up the res put implementation if the primary concern is that we feel like this starts to make things a bit convoluted.. Brian ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] bcachefs: fix race between journal entry close and pin set 2023-09-05 12:59 ` Brian Foster @ 2023-09-06 19:07 ` Brian Foster 2023-09-10 2:03 ` Kent Overstreet 0 siblings, 1 reply; 14+ messages in thread From: Brian Foster @ 2023-09-06 19:07 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Tue, Sep 05, 2023 at 08:59:02AM -0400, Brian Foster wrote: > On Sun, Sep 03, 2023 at 10:29:28PM -0400, Kent Overstreet wrote: > > On Sun, Sep 03, 2023 at 05:18:12PM -0400, Kent Overstreet wrote: > > > On Thu, Aug 31, 2023 at 07:07:34AM -0400, Brian Foster wrote: > > > > bcachefs freeze testing via fstests generic/390 occasionally > > > > reproduces the following BUG from bch2_fs_read_only(): > > > > > > > > BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty)); > > > > > > Your fix makes sense, but I wonder if it might be simpler and better to > > > fix this in bch2_journal_reclaim_fast(). ... > > FWIW, one thing I disliked about the original patch was the need to > mostly duplicate the buf put helper due to the locking context quirk. I > was trying to avoid that, but I ran out of ideas before I wanted to move > on actually fixing the bug. My preference would be to address the > reference counting issue as is (to preserve design simplicity), and then > maybe think a bit harder about cleaning up the res put implementation if > the primary concern is that we feel like this starts to make things a > bit convoluted.. > Another thought that comes to mind is to perhaps allow the journal_res to hold a reference to the pin fifo for the associated seq.. The idea would be we could continue to hold a reference during the open/close journal buffer lifecycle, but a res of the same seq would acquire an additional reference as well to keep the tail from popping before a transaction can actually set a pin (i.e. essentially parallel to the buffer reference). In a sense the behavior in this patch is kind of an optimization of that approach, but I think implementing it directly would eliminate the need to mostly duplicate the put helper, perhaps making things a bit more natural. We could still fall into bch2_journal_reclaim_fast() from res_put(), but now only for cases where the reservation outlives the active journal buffer (due to a journal flush or whatever). That also still addresses the race by properly using the reference count. I'd probably have to try it to be sure I'm not missing something, but... thoughts? Brian ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] bcachefs: fix race between journal entry close and pin set 2023-09-06 19:07 ` Brian Foster @ 2023-09-10 2:03 ` Kent Overstreet 2023-09-12 18:54 ` Brian Foster 0 siblings, 1 reply; 14+ messages in thread From: Kent Overstreet @ 2023-09-10 2:03 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Wed, Sep 06, 2023 at 03:07:58PM -0400, Brian Foster wrote: > On Tue, Sep 05, 2023 at 08:59:02AM -0400, Brian Foster wrote: > > On Sun, Sep 03, 2023 at 10:29:28PM -0400, Kent Overstreet wrote: > > > On Sun, Sep 03, 2023 at 05:18:12PM -0400, Kent Overstreet wrote: > > > > On Thu, Aug 31, 2023 at 07:07:34AM -0400, Brian Foster wrote: > > > > > bcachefs freeze testing via fstests generic/390 occasionally > > > > > reproduces the following BUG from bch2_fs_read_only(): > > > > > > > > > > BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty)); > > > > > > > > Your fix makes sense, but I wonder if it might be simpler and better to > > > > fix this in bch2_journal_reclaim_fast(). > ... > > > > FWIW, one thing I disliked about the original patch was the need to > > mostly duplicate the buf put helper due to the locking context quirk. I > > was trying to avoid that, but I ran out of ideas before I wanted to move > > on actually fixing the bug. My preference would be to address the > > reference counting issue as is (to preserve design simplicity), and then > > maybe think a bit harder about cleaning up the res put implementation if > > the primary concern is that we feel like this starts to make things a > > bit convoluted.. > > > > Another thought that comes to mind is to perhaps allow the journal_res > to hold a reference to the pin fifo for the associated seq.. The idea > would be we could continue to hold a reference during the open/close > journal buffer lifecycle, but a res of the same seq would acquire an > additional reference as well to keep the tail from popping before a > transaction can actually set a pin (i.e. essentially parallel to the > buffer reference). Yeah that would probably be cleanest - but it would be much too heavy. > In a sense the behavior in this patch is kind of an optimization of that > approach, but I think implementing it directly would eliminate the need > to mostly duplicate the put helper, perhaps making things a bit more > natural. We could still fall into bch2_journal_reclaim_fast() from > res_put(), but now only for cases where the reservation outlives the > active journal buffer (due to a journal flush or whatever). That also > still addresses the race by properly using the reference count. I'd > probably have to try it to be sure I'm not missing something, but... > thoughts? We can do your approach if that's what you feel is going to be cleanest, but if we go that way here's my two main concerns: - we can't put more slowpath code into the inline fastpath, remember this is all inlined into the main __bch2_trans_commit() path - your approach (the optimized version) means there's a new state transition where we do work, i.e. when journal_state_count() hits 0. So we have to make sure that that refcount isn't hitting 0 multiple times. The appropriate assertion already exists in journal_res_get_fast() - for a refcount to hit 0 multiple times it must leave 0 while in use, and we check for that. Let's just add a comment by that EBUG_ON(!journal_state_count(new, new.idx)) line indicating why it's now important. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] bcachefs: fix race between journal entry close and pin set 2023-09-10 2:03 ` Kent Overstreet @ 2023-09-12 18:54 ` Brian Foster 2023-09-12 19:15 ` Kent Overstreet 0 siblings, 1 reply; 14+ messages in thread From: Brian Foster @ 2023-09-12 18:54 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Sat, Sep 09, 2023 at 10:03:04PM -0400, Kent Overstreet wrote: > On Wed, Sep 06, 2023 at 03:07:58PM -0400, Brian Foster wrote: > > On Tue, Sep 05, 2023 at 08:59:02AM -0400, Brian Foster wrote: > > > On Sun, Sep 03, 2023 at 10:29:28PM -0400, Kent Overstreet wrote: > > > > On Sun, Sep 03, 2023 at 05:18:12PM -0400, Kent Overstreet wrote: > > > > > On Thu, Aug 31, 2023 at 07:07:34AM -0400, Brian Foster wrote: > > > > > > bcachefs freeze testing via fstests generic/390 occasionally > > > > > > reproduces the following BUG from bch2_fs_read_only(): > > > > > > > > > > > > BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty)); > > > > > > > > > > Your fix makes sense, but I wonder if it might be simpler and better to > > > > > fix this in bch2_journal_reclaim_fast(). > > ... > > > > > > FWIW, one thing I disliked about the original patch was the need to > > > mostly duplicate the buf put helper due to the locking context quirk. I > > > was trying to avoid that, but I ran out of ideas before I wanted to move > > > on actually fixing the bug. My preference would be to address the > > > reference counting issue as is (to preserve design simplicity), and then > > > maybe think a bit harder about cleaning up the res put implementation if > > > the primary concern is that we feel like this starts to make things a > > > bit convoluted.. > > > > > > > Another thought that comes to mind is to perhaps allow the journal_res > > to hold a reference to the pin fifo for the associated seq.. The idea > > would be we could continue to hold a reference during the open/close > > journal buffer lifecycle, but a res of the same seq would acquire an > > additional reference as well to keep the tail from popping before a > > transaction can actually set a pin (i.e. essentially parallel to the > > buffer reference). > > Yeah that would probably be cleanest - but it would be much too heavy. > I'm curious how so? Additional atomics? If so, we could also consider something that allows a pin list ref on the reservation to simply transfer to a pin set via the transaction. That might not add any new overhead over the current code, but would require some plumbing. > > In a sense the behavior in this patch is kind of an optimization of that > > approach, but I think implementing it directly would eliminate the need > > to mostly duplicate the put helper, perhaps making things a bit more > > natural. We could still fall into bch2_journal_reclaim_fast() from > > res_put(), but now only for cases where the reservation outlives the > > active journal buffer (due to a journal flush or whatever). That also > > still addresses the race by properly using the reference count. I'd > > probably have to try it to be sure I'm not missing something, but... > > thoughts? > > We can do your approach if that's what you feel is going to be cleanest, > but if we go that way here's my two main concerns: > Ok.. FWIW, it's not so much that I think it's the cleanest (I agree that the variant discussed above probably ends up cleaner code-wise), but rather that it provides sufficient confidence we won't just have to revisit the same underlying problem the next time some bit of code assumes that an outstanding reservation guarantees a usable pin list. > - we can't put more slowpath code into the inline fastpath, remember > this is all inlined into the main __bch2_trans_commit() path > Sure. I missed that closure_call() was itself inlined. I think that is what initially confused me wrt to the feedback on patch 2. I'll try to keep that helper external and perhaps just rename it. > - your approach (the optimized version) means there's a new state > transition where we do work, i.e. when journal_state_count() hits 0. > Indeed. > So we have to make sure that that refcount isn't hitting 0 multiple > times. The appropriate assertion already exists in > journal_res_get_fast() - for a refcount to hit 0 multiple times it > must leave 0 while in use, and we check for that. > > Let's just add a comment by that EBUG_ON(!journal_state_count(new, > new.idx)) line indicating why it's now important. > Sounds reasonable. I'll try to come up with something. Thanks. Brian ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] bcachefs: fix race between journal entry close and pin set 2023-09-12 18:54 ` Brian Foster @ 2023-09-12 19:15 ` Kent Overstreet 0 siblings, 0 replies; 14+ messages in thread From: Kent Overstreet @ 2023-09-12 19:15 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Tue, Sep 12, 2023 at 02:54:34PM -0400, Brian Foster wrote: > On Sat, Sep 09, 2023 at 10:03:04PM -0400, Kent Overstreet wrote: > > On Wed, Sep 06, 2023 at 03:07:58PM -0400, Brian Foster wrote: > > > On Tue, Sep 05, 2023 at 08:59:02AM -0400, Brian Foster wrote: > > > > On Sun, Sep 03, 2023 at 10:29:28PM -0400, Kent Overstreet wrote: > > > > > On Sun, Sep 03, 2023 at 05:18:12PM -0400, Kent Overstreet wrote: > > > > > > On Thu, Aug 31, 2023 at 07:07:34AM -0400, Brian Foster wrote: > > > > > > > bcachefs freeze testing via fstests generic/390 occasionally > > > > > > > reproduces the following BUG from bch2_fs_read_only(): > > > > > > > > > > > > > > BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty)); > > > > > > > > > > > > Your fix makes sense, but I wonder if it might be simpler and better to > > > > > > fix this in bch2_journal_reclaim_fast(). > > > ... > > > > > > > > FWIW, one thing I disliked about the original patch was the need to > > > > mostly duplicate the buf put helper due to the locking context quirk. I > > > > was trying to avoid that, but I ran out of ideas before I wanted to move > > > > on actually fixing the bug. My preference would be to address the > > > > reference counting issue as is (to preserve design simplicity), and then > > > > maybe think a bit harder about cleaning up the res put implementation if > > > > the primary concern is that we feel like this starts to make things a > > > > bit convoluted.. > > > > > > > > > > Another thought that comes to mind is to perhaps allow the journal_res > > > to hold a reference to the pin fifo for the associated seq.. The idea > > > would be we could continue to hold a reference during the open/close > > > journal buffer lifecycle, but a res of the same seq would acquire an > > > additional reference as well to keep the tail from popping before a > > > transaction can actually set a pin (i.e. essentially parallel to the > > > buffer reference). > > > > Yeah that would probably be cleanest - but it would be much too heavy. > > > > I'm curious how so? Additional atomics? Not just that, getting a journal reservation is lockless but getting a journal pin is not. > > So we have to make sure that that refcount isn't hitting 0 multiple > > times. The appropriate assertion already exists in > > journal_res_get_fast() - for a refcount to hit 0 multiple times it > > must leave 0 while in use, and we check for that. > > > > Let's just add a comment by that EBUG_ON(!journal_state_count(new, > > new.idx)) line indicating why it's now important. > > > > Sounds reasonable. I'll try to come up with something. Thanks. Sounds good ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2023-09-12 19:15 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-08-31 11:07 [PATCH 0/3] bcachefs: journal bug fixes Brian Foster 2023-08-31 11:07 ` [PATCH 1/3] bcachefs: restart journal reclaim thread on ro->rw transitions Brian Foster 2023-08-31 11:07 ` [PATCH 2/3] bcachefs: prepare journal buf put to handle pin put Brian Foster 2023-09-03 21:09 ` Kent Overstreet 2023-09-05 12:41 ` Brian Foster 2023-09-10 1:55 ` Kent Overstreet 2023-08-31 11:07 ` [PATCH 3/3] bcachefs: fix race between journal entry close and pin set Brian Foster 2023-09-03 21:18 ` Kent Overstreet 2023-09-04 2:29 ` Kent Overstreet 2023-09-05 12:59 ` Brian Foster 2023-09-06 19:07 ` Brian Foster 2023-09-10 2:03 ` Kent Overstreet 2023-09-12 18:54 ` Brian Foster 2023-09-12 19:15 ` Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox