From: Brian Foster <bfoster@redhat.com>
To: linux-bcachefs@vger.kernel.org
Subject: [PATCH 3/3] bcachefs: fix race between journal entry close and pin set
Date: Thu, 31 Aug 2023 07:07:34 -0400 [thread overview]
Message-ID: <20230831110734.787212-4-bfoster@redhat.com> (raw)
In-Reply-To: <20230831110734.787212-1-bfoster@redhat.com>
bcachefs freeze testing via fstests generic/390 occasionally
reproduces the following BUG from bch2_fs_read_only():
BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty));
This indicates that one or more dirty key cache keys still exist
after the attempt to flush and quiesce the fs. The sequence that
leads to this problem actually occurs on unfreeze (ro->rw), and
looks something like the following:
- Task A begins a transaction commit and acquires journal_res for
the current seq. This transaction intends to perform key cache
insertion.
- Task B begins a bch2_journal_flush() via bch2_sync_fs(). This ends
up in journal_entry_want_write(), which closes the current journal
entry and drops the reference to the pin list created on entry open.
The pin put pops the front of the journal via fast reclaim since the
reference count has dropped to 0.
- Task A attempts to set the journal pin for the associated cached
key, but bch2_journal_pin_set() skips the pin insert because the
seq of the transaction reservation is behind the front of the pin
list fifo.
The end result is that the pin associated with the cached key is not
added, which prevents a subsequent reclaim from processing the key
and thus leaves it dangling at freeze time. The fundamental cause of
this problem is that the front of the journal is allowed to pop
before a transaction with outstanding reservation on the associated
journal seq is able to add a pin. The count for the pin list
associated with the seq drops to zero and is prematurely reclaimed
as a result.
The logical fix for this problem lies in how the journal buffer is
managed in similar scenarios where the entry might have been closed
before a transaction with outstanding reservations happens to be
committed.
When a journal entry is opened, the current sequence number is
bumped, the associated pin list is initialized with a reference
count of 1, and the journal buffer reference count is bumped (via
journal_state_inc()). When a journal reservation is acquired, the
reservation also acquires a reference on the associated buffer. If
the journal entry is closed in the meantime, it drops both the pin
and buffer references held by the open entry, but the buffer still
has references held by outstanding reservation. After the associated
transaction commits, the reservation release drops the associated
buffer references and the buffer is written out once the reference
count has dropped to zero.
The fundamental problem here is that the lifecycle of the pin list
reference held by an open journal entry is too short to cover
the processing of transactions with outstanding reservations. The
simplest way to address this is to expand the pin list reference
to the lifecycle of the buffer vs. the shorter lifecycle of the
open journal entry. This allows a journal entry to closer for
whatever reason, but ensures the pin list for the seq with
outstanding reservation cannot be popped and reclaimed before all
outstanding reservations have been released.
Move the pin put from journal entry close to where final reference
count processing of the journal buffer occurs. Create a duplicate
helper to cover the case where the caller doesn't already hold the
journal lock. This allows generic/390 to pass reliably.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/bcachefs/journal.c | 4 +---
fs/bcachefs/journal.h | 19 +++++++++++++++++--
2 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index 76ebcdb3e3b4..21f97d48931d 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -195,13 +195,11 @@ static void __journal_entry_close(struct journal *j, unsigned closed_val)
buf->data->last_seq = cpu_to_le64(buf->last_seq);
BUG_ON(buf->last_seq > le64_to_cpu(buf->data->seq));
- __bch2_journal_pin_put(j, le64_to_cpu(buf->data->seq));
-
cancel_delayed_work(&j->write_work);
bch2_journal_space_available(j);
- bch2_journal_buf_put(j, old.idx);
+ __bch2_journal_buf_put(j, old.idx, le64_to_cpu(buf->data->seq));
}
void bch2_journal_halt(struct journal *j)
diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h
index c7a31da077c9..5dea5d46c616 100644
--- a/fs/bcachefs/journal.h
+++ b/fs/bcachefs/journal.h
@@ -112,6 +112,7 @@
#include <linux/hash.h>
#include "journal_types.h"
+#include "journal_reclaim.h"
struct bch_fs;
@@ -269,13 +270,27 @@ static inline union journal_res_state journal_state_buf_put(struct journal *j, u
}
void bch2_journal_write(struct closure *);
-static inline void bch2_journal_buf_put(struct journal *j, unsigned idx)
+static inline void __bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq)
{
struct bch_fs *c = container_of(j, struct bch_fs, journal);
union journal_res_state s;
s = journal_state_buf_put(j, idx);
if (!journal_state_count(s, idx)) {
+ __bch2_journal_pin_put(j, seq);
+ if (idx == s.unwritten_idx)
+ closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL);
+ }
+}
+
+static inline void bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq)
+{
+ struct bch_fs *c = container_of(j, struct bch_fs, journal);
+ union journal_res_state s;
+
+ s = journal_state_buf_put(j, idx);
+ if (!journal_state_count(s, idx)) {
+ bch2_journal_pin_put(j, seq);
if (idx == s.unwritten_idx)
closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL);
}
@@ -298,7 +313,7 @@ static inline void bch2_journal_res_put(struct journal *j,
BCH_JSET_ENTRY_btree_keys,
0, 0, 0);
- bch2_journal_buf_put(j, res->idx);
+ bch2_journal_buf_put(j, res->idx, res->seq);
res->ref = 0;
}
--
2.41.0
next prev parent reply other threads:[~2023-08-31 11:08 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-31 11:07 [PATCH 0/3] bcachefs: journal bug fixes Brian Foster
2023-08-31 11:07 ` [PATCH 1/3] bcachefs: restart journal reclaim thread on ro->rw transitions Brian Foster
2023-08-31 11:07 ` [PATCH 2/3] bcachefs: prepare journal buf put to handle pin put Brian Foster
2023-09-03 21:09 ` Kent Overstreet
2023-09-05 12:41 ` Brian Foster
2023-09-10 1:55 ` Kent Overstreet
2023-08-31 11:07 ` Brian Foster [this message]
2023-09-03 21:18 ` [PATCH 3/3] bcachefs: fix race between journal entry close and pin set Kent Overstreet
2023-09-04 2:29 ` Kent Overstreet
2023-09-05 12:59 ` Brian Foster
2023-09-06 19:07 ` Brian Foster
2023-09-10 2:03 ` Kent Overstreet
2023-09-12 18:54 ` Brian Foster
2023-09-12 19:15 ` Kent Overstreet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230831110734.787212-4-bfoster@redhat.com \
--to=bfoster@redhat.com \
--cc=linux-bcachefs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox