From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: fix unlink vs cluster buffer instantiation race
Date: Thu, 13 Jun 2024 08:07:41 -0700 [thread overview]
Message-ID: <20240613150741.GN2764752@frogsfrogsfrogs> (raw)
In-Reply-To: <20240612225148.3989713-1-david@fromorbit.com>
On Thu, Jun 13, 2024 at 08:51:48AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Luis has been reporting an assert failure when freeing an inode
> cluster during inode inactivation for a while. The assert looks
> like:
>
> XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
> ------------[ cut here ]------------
> kernel BUG at fs/xfs/xfs_message.c:102!
> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
> RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
> RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
> RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
> RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
> RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
> R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
> R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
> FS: 0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <TASK>
> xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
> xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
> xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
> xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
> __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
> xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
> xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
> xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
> process_one_work (kernel/workqueue.c:3231)
> worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
> kthread (kernel/kthread.c:389)
> ret_from_fork (arch/x86/kernel/process.c:147)
> ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
> </TASK>
>
> And occurs when the the inode precommit handlers is attempt to look
> up the inode cluster buffer to attach the inode for writeback.
>
> The trail of logic that I can reconstruct is as follows.
>
> 1. the inode is clean when inodegc runs, so it is not
> attached to a cluster buffer when precommit runs.
>
> 2. #1 implies the inode cluster buffer may be clean and not
> pinned by dirty inodes when inodegc runs.
>
> 3. #2 implies that the inode cluster buffer can be reclaimed
> by memory pressure at any time.
>
> 4. The assert failure implies that the cluster buffer was
> attached to the transaction, but not marked done. It had
> been accessed earlier in the transaction, but not marked
> done.
>
> 5. #4 implies the cluster buffer has been invalidated (i.e.
> marked stale).
>
> 6. #5 implies that the inode cluster buffer was instantiated
> uninitialised in the transaction in xfs_ifree_cluster(),
> which only instantiates the buffers to invalidate them
> and never marks them as done.
>
> Given factors 1-3, this issue is highly dependent on timing and
> environmental factors. Hence the issue can be very difficult to
> reproduce in some situations, but highly reliable in others. Luis
> has an environment where it can be reproduced easily by g/531 but,
> OTOH, I've reproduced it only once in ~2000 cycles of g/531.
>
> I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
> on the cluster buffers, even though they may not be initialised. The
> reasons why I think this is safe are:
>
> 1. A buffer cache lookup hit on a XBF_STALE buffer will
> clear the XBF_DONE flag. Hence all future users of the
> buffer know they have to re-initialise the contents
> before use and mark it done themselves.
>
> 2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
> means the buffer remains locked until the journal commit
> completes and the buffer is unpinned. Hence once marked
> XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
> context that can access the freed buffer is the currently
> running transaction.
>
> 3. #2 implies that future buffer lookups in the currently
> running transaction will hit the transaction match code
> and not the buffer cache. Hence XBF_STALE and
> XFS_BLI_STALE will not be cleared unless the transaction
> initialises and logs the buffer with valid contents
> again. At which point, the buffer will be marked marked
> XBF_DONE again, so having XBF_DONE already set on the
> stale buffer is a moot point.
>
> 4. #2 also implies that any concurrent access to that
> cluster buffer will block waiting on the buffer lock
> until the inode cluster has been fully freed and is no
> longer an active inode cluster buffer.
>
> 5. #4 + #1 means that any future user of the disk range of
> that buffer will always see the range of disk blocks
> covered by the cluster buffer as not done, and hence must
> initialise the contents themselves.
>
> 6. Setting XBF_DONE in xfs_ifree_cluster() then means the
> unlinked inode precommit code will see a XBF_DONE buffer
> from the transaction match as it expects. It can then
> attach the stale but newly dirtied inode to the stale
> but newly dirtied cluster buffer without unexpected
> failures. The stale buffer will then sail through the
> journal and do the right thing with the attached stale
> inode during unpin.
>
> Hence the fix is just one line of extra code. The explanation of
> why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
> complex....
>
> Fixes: 82842fee6e59 ("xfs: fix AGF vs inode cluster buffer deadlock")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Tested-by: Luis Chamberlain <mcgrof@kernel.org>
I'm pretty sure this also fixes the same symptoms that I can (very very
occasionally) reproduce. Though, I haven't seen it in a while.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/xfs/xfs_inode.c | 23 +++++++++++++++++++----
> 1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 58fb7a5062e1..f36091e1e7f5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2548,11 +2548,26 @@ xfs_ifree_cluster(
> * This buffer may not have been correctly initialised as we
> * didn't read it from disk. That's not important because we are
> * only using to mark the buffer as stale in the log, and to
> - * attach stale cached inodes on it. That means it will never be
> - * dispatched for IO. If it is, we want to know about it, and we
> - * want it to fail. We can acheive this by adding a write
> - * verifier to the buffer.
> + * attach stale cached inodes on it.
> + *
> + * For the inode that triggered the cluster freeing, this
> + * attachment may occur in xfs_inode_item_precommit() after we
> + * have marked this buffer stale. If this buffer was not in
> + * memory before xfs_ifree_cluster() started, it will not be
> + * marked XBF_DONE and this will cause problems later in
> + * xfs_inode_item_precommit() when we trip over a (stale, !done)
> + * buffer to attached to the transaction.
> + *
> + * Hence we have to mark the buffer as XFS_DONE here. This is
> + * safe because we are also marking the buffer as XBF_STALE and
> + * XFS_BLI_STALE. That means it will never be dispatched for
> + * IO and it won't be unlocked until the cluster freeing has
> + * been committed to the journal and the buffer unpinned. If it
> + * is written, we want to know about it, and we want it to
> + * fail. We can acheive this by adding a write verifier to the
> + * buffer.
> */
> + bp->b_flags |= XBF_DONE;
> bp->b_ops = &xfs_inode_buf_ops;
>
> /*
> --
> 2.45.1
>
>
next prev parent reply other threads:[~2024-06-13 15:07 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-12 22:51 [PATCH] xfs: fix unlink vs cluster buffer instantiation race Dave Chinner
2024-06-13 13:35 ` Christoph Hellwig
2024-06-13 15:07 ` Darrick J. Wong [this message]
2024-06-14 10:01 ` Chandan Babu R
2024-06-14 10:14 ` Christoph Hellwig
2024-06-14 12:30 ` Chandan Babu R
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240613150741.GN2764752@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox