Re: [PATCH] xfs: fix unlink vs cluster buffer instantiation race

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: fix unlink vs cluster buffer instantiation race
Date: Thu, 13 Jun 2024 08:07:41 -0700	[thread overview]
Message-ID: <20240613150741.GN2764752@frogsfrogsfrogs> (raw)
In-Reply-To: <20240612225148.3989713-1-david@fromorbit.com>

On Thu, Jun 13, 2024 at 08:51:48AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Luis has been reporting an assert failure when freeing an inode
> cluster during inode inactivation for a while. The assert looks
> like:
> 
>  XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
>  ------------[ cut here ]------------
>  kernel BUG at fs/xfs/xfs_message.c:102!
>  Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
>  CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>  Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
>  RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
>  RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
>  RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
>  RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
>  RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
>  R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
>  R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
>  FS:  0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>  PKRU: 55555554
>  Call Trace:
>   <TASK>
>  xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
>  xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
>  xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
>  xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
>  __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
>  xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
>  xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
>  xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
>  process_one_work (kernel/workqueue.c:3231)
>  worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
>  kthread (kernel/kthread.c:389)
>  ret_from_fork (arch/x86/kernel/process.c:147)
>  ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
>   </TASK>
> 
> And occurs when the the inode precommit handlers is attempt to look
> up the inode cluster buffer to attach the inode for writeback.
> 
> The trail of logic that I can reconstruct is as follows.
> 
> 	1. the inode is clean when inodegc runs, so it is not
> 	   attached to a cluster buffer when precommit runs.
> 
> 	2. #1 implies the inode cluster buffer may be clean and not
> 	   pinned by dirty inodes when inodegc runs.
> 
> 	3. #2 implies that the inode cluster buffer can be reclaimed
> 	   by memory pressure at any time.
> 
> 	4. The assert failure implies that the cluster buffer was
> 	   attached to the transaction, but not marked done. It had
> 	   been accessed earlier in the transaction, but not marked
> 	   done.
> 
> 	5. #4 implies the cluster buffer has been invalidated (i.e.
> 	   marked stale).
> 
> 	6. #5 implies that the inode cluster buffer was instantiated
> 	   uninitialised in the transaction in xfs_ifree_cluster(),
> 	   which only instantiates the buffers to invalidate them
> 	   and never marks them as done.
> 
> Given factors 1-3, this issue is highly dependent on timing and
> environmental factors. Hence the issue can be very difficult to
> reproduce in some situations, but highly reliable in others. Luis
> has an environment where it can be reproduced easily by g/531 but,
> OTOH, I've reproduced it only once in ~2000 cycles of g/531.
> 
> I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
> on the cluster buffers, even though they may not be initialised. The
> reasons why I think this is safe are:
> 
> 	1. A buffer cache lookup hit on a XBF_STALE buffer will
> 	   clear the XBF_DONE flag. Hence all future users of the
> 	   buffer know they have to re-initialise the contents
> 	   before use and mark it done themselves.
> 
> 	2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
> 	   means the buffer remains locked until the journal commit
> 	   completes and the buffer is unpinned. Hence once marked
> 	   XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
> 	   context that can access the freed buffer is the currently
> 	   running transaction.
> 
> 	3. #2 implies that future buffer lookups in the currently
> 	   running transaction will hit the transaction match code
> 	   and not the buffer cache. Hence XBF_STALE and
> 	   XFS_BLI_STALE will not be cleared unless the transaction
> 	   initialises and logs the buffer with valid contents
> 	   again. At which point, the buffer will be marked marked
> 	   XBF_DONE again, so having XBF_DONE already set on the
> 	   stale buffer is a moot point.
> 
> 	4. #2 also implies that any concurrent access to that
> 	   cluster buffer will block waiting on the buffer lock
> 	   until the inode cluster has been fully freed and is no
> 	   longer an active inode cluster buffer.
> 
> 	5. #4 + #1 means that any future user of the disk range of
> 	   that buffer will always see the range of disk blocks
> 	   covered by the cluster buffer as not done, and hence must
> 	   initialise the contents themselves.
> 
> 	6. Setting XBF_DONE in xfs_ifree_cluster() then means the
> 	   unlinked inode precommit code will see a XBF_DONE buffer
> 	   from the transaction match as it expects. It can then
> 	   attach the stale but newly dirtied inode to the stale
> 	   but newly dirtied cluster buffer without unexpected
> 	   failures. The stale buffer will then sail through the
> 	   journal and do the right thing with the attached stale
> 	   inode during unpin.
> 
> Hence the fix is just one line of extra code. The explanation of
> why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
> complex....
> 
> Fixes: 82842fee6e59 ("xfs: fix AGF vs inode cluster buffer deadlock")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Tested-by: Luis Chamberlain <mcgrof@kernel.org>

I'm pretty sure this also fixes the same symptoms that I can (very very
occasionally) reproduce.  Though, I haven't seen it in a while.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_inode.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 58fb7a5062e1..f36091e1e7f5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2548,11 +2548,26 @@ xfs_ifree_cluster(
>  		 * This buffer may not have been correctly initialised as we
>  		 * didn't read it from disk. That's not important because we are
>  		 * only using to mark the buffer as stale in the log, and to
> -		 * attach stale cached inodes on it. That means it will never be
> -		 * dispatched for IO. If it is, we want to know about it, and we
> -		 * want it to fail. We can acheive this by adding a write
> -		 * verifier to the buffer.
> +		 * attach stale cached inodes on it.
> +		 *
> +		 * For the inode that triggered the cluster freeing, this
> +		 * attachment may occur in xfs_inode_item_precommit() after we
> +		 * have marked this buffer stale.  If this buffer was not in
> +		 * memory before xfs_ifree_cluster() started, it will not be
> +		 * marked XBF_DONE and this will cause problems later in
> +		 * xfs_inode_item_precommit() when we trip over a (stale, !done)
> +		 * buffer to attached to the transaction.
> +		 *
> +		 * Hence we have to mark the buffer as XFS_DONE here. This is
> +		 * safe because we are also marking the buffer as XBF_STALE and
> +		 * XFS_BLI_STALE. That means it will never be dispatched for
> +		 * IO and it won't be unlocked until the cluster freeing has
> +		 * been committed to the journal and the buffer unpinned. If it
> +		 * is written, we want to know about it, and we want it to
> +		 * fail. We can acheive this by adding a write verifier to the
> +		 * buffer.
>  		 */
> +		bp->b_flags |= XBF_DONE;
>  		bp->b_ops = &xfs_inode_buf_ops;
>  
>  		/*
> -- 
> 2.45.1
> 
>

next prev parent reply	other threads:[~2024-06-13 15:07 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-12 22:51 [PATCH] xfs: fix unlink vs cluster buffer instantiation race Dave Chinner
2024-06-13 13:35 ` Christoph Hellwig
2024-06-13 15:07 ` Darrick J. Wong [this message]
2024-06-14 10:01 ` Chandan Babu R
2024-06-14 10:14   ` Christoph Hellwig
2024-06-14 12:30     ` Chandan Babu R

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240613150741.GN2764752@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox