XBF_DONE semantics

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* XBF_DONE semantics
@ 2023-11-28 15:38 Christoph Hellwig
  2023-11-28 16:58 ` Darrick J. Wong
  2023-11-28 21:08 ` Dave Chinner
  0 siblings, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2023-11-28 15:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Dave Chinner

Hi Darrick,

while reviewing your online repair series I've noticed that the new
xfs_buf_delwri_queue_here helper sets XBF_DONE in addition to waiting
for the buffer to go off a delwri list, and that reminded me off an
assert I saw during my allocator experiments, where
xfs_trans_read_buf_map or xfs_buf_reverify trip on a buffer that doesn't
have XBF_DONE set because it comes from an ifree transaction (see
my current not fully thought out bandaid below).

The way we currently set and check XBF_DONE seems a bit undefined.  The
one clear use case is that read uses it to see if a buffer was read in.
But places that use buf_get and manually fill in data only use it in a
few cases.  Do we need to define clear semantics for it?  Or maybe
replace with an XBF_READ_DONE flag for that main read use case and
then think what do do with the rest?

---
From 80d148555ca261777ad728455a9a240e0007f54e Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Mon, 13 Nov 2023 12:02:15 -0500
Subject: xfs: remove the XBF_DONE asserts in xfs_buf_reverify and
 xfs_trans_read_buf_map

When xfs_trans_read_buf_map is called from xfs_inode_item_precommit
through xfs_imap_to_bp, we can find an inode buffer that doesn't have
XBF_DONE because xfs_ifree_cluster doesn't read the buffer but instead
just gets it before marking all buffers stale.

[  206.728129] ------------[ cut here ]------------
[  206.728573] WARNING: CPU: 0 PID: 6320 at fs/xfs/xfs_trans_buf.c:256 xfs_trans_read_buf_map+0x20
[  206.728971] Modules linked in:
[  206.729099] CPU: 0 PID: 6320 Comm: kworker/0:124 Not tainted 6.6.0+ #1598
[  206.729368] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 044
[  206.729744] Workqueue: xfs-inodegc/nvme1n1 xfs_inodegc_worker
[  206.729984] RIP: 0010:xfs_trans_read_buf_map+0x2ab/0x570
[  206.730295] Code: 0f 84 ae fe ff ff b9 4e 01 00 00 48 c7 c6 c8 62 0d 83 48 c7 c2 a5 15 ff 82 3d
[  206.731131] RSP: 0018:ffffc9000751fc38 EFLAGS: 00010246
[  206.731414] RAX: 0000000002110040 RBX: ffff88824c47dbd8 RCX: 0000000000000000
[  206.731748] RDX: ffff88824c47dc90 RSI: 0000000000000000 RDI: ffff8881705b46c0
[  206.732139] RBP: ffffc9000751fca8 R08: ffff888106416e00 R09: ffffc9000751fca8
[  206.732428] R10: ffff8881e09faa00 R11: ffff8881e09fb408 R12: 0000000000000001
[  206.732786] R13: ffff88810ddb2000 R14: ffffffff82b36660 R15: ffffc9000751fcf0
[  206.733092] FS:  0000000000000000(0000) GS:ffff888277c00000(0000) knlGS:0000000000000000
[  206.733460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  206.733797] CR2: 00007f9833f399c0 CR3: 00000002113de000 CR4: 0000000000750ef0
[  206.734131] PKRU: 55555554
[  206.734320] Call Trace:
[  206.734438]  <TASK>
[  206.734537]  ? xfs_trans_read_buf_map+0x2ab/0x570
[  206.734824]  ? __warn+0x80/0x170
[  206.735051]  ? xfs_trans_read_buf_map+0x2ab/0x570
[  206.735342]  ? report_bug+0x18d/0x1c0
[  206.735533]  ? handle_bug+0x41/0x70
[  206.735679]  ? exc_invalid_op+0x17/0x60
[  206.735852]  ? asm_exc_invalid_op+0x1a/0x20
[  206.736054]  ? xfs_trans_read_buf_map+0x2ab/0x570
[  206.736262]  ? xfs_trans_read_buf_map+0x6b/0x570
[  206.736500]  xfs_imap_to_bp+0x60/0xc0
[  206.736668]  xfs_inode_item_precommit+0x172/0x2a0
[  206.736947]  ? xfs_iunlink_item_precommit+0xae/0x220
[  206.737191]  xfs_trans_run_precommits+0x60/0xc0
[  206.737460]  __xfs_trans_commit+0x67/0x3a0
[  206.737668]  xfs_inactive_ifree+0xfb/0x1f0
[  206.737889]  xfs_inactive+0x22f/0x420
[  206.738059]  xfs_inodegc_worker+0xa3/0x200
[  206.738227]  ? process_one_work+0x171/0x4a0
[  206.738450]  process_one_work+0x1d8/0x4a0
[  206.738651]  worker_thread+0x1ce/0x3b0
[  206.738864]  ? wq_sysfs_prep_attrs+0x90/0x90
[  206.739074]  kthread+0xf2/0x120
[  206.739216]  ? kthread_complete_and_exit+0x20/0x20
[  206.739482]  ret_from_fork+0x2c/0x40
[  206.739675]  ? kthread_complete_and_exit+0x20/0x20
[  206.740074]  ret_from_fork_asm+0x11/0x20
[  206.740402]  </TASK>

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_buf.c       | 1 -
 fs/xfs/xfs_trans_buf.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 9bf79172efc824..4ff9f1b1abb698 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -863,7 +863,6 @@ xfs_buf_reverify(
 	struct xfs_buf		*bp,
 	const struct xfs_buf_ops *ops)
 {
-	ASSERT(bp->b_flags & XBF_DONE);
 	ASSERT(bp->b_error == 0);
 
 	if (!ops || bp->b_ops)
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 8e886ecfd69a3b..575922c64d4d3a 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -253,7 +253,6 @@ xfs_trans_read_buf_map(
 		ASSERT(bp->b_transp == tp);
 		ASSERT(bp->b_log_item != NULL);
 		ASSERT(!bp->b_error);
-		ASSERT(bp->b_flags & XBF_DONE);
 
 		/*
 		 * We never locked this buf ourselves, so we shouldn't
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: XBF_DONE semantics
  2023-11-28 15:38 XBF_DONE semantics Christoph Hellwig
@ 2023-11-28 16:58 ` Darrick J. Wong
  2023-11-28 17:13   ` Christoph Hellwig
  2023-11-28 22:42   ` Dave Chinner
  2023-11-28 21:08 ` Dave Chinner
  1 sibling, 2 replies; 7+ messages in thread
From: Darrick J. Wong @ 2023-11-28 16:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, Dave Chinner

On Tue, Nov 28, 2023 at 04:38:08PM +0100, Christoph Hellwig wrote:
> Hi Darrick,
> 
> while reviewing your online repair series I've noticed that the new
> xfs_buf_delwri_queue_here helper sets XBF_DONE in addition to waiting
> for the buffer to go off a delwri list, and that reminded me off an
> assert I saw during my allocator experiments, where
> xfs_trans_read_buf_map or xfs_buf_reverify trip on a buffer that doesn't
> have XBF_DONE set because it comes from an ifree transaction (see
> my current not fully thought out bandaid below).

LOL, guess what I was looking at yesterday too! :)

> The way we currently set and check XBF_DONE seems a bit undefined.  The
> one clear use case is that read uses it to see if a buffer was read in.
> But places that use buf_get and manually fill in data only use it in a
> few cases.  Do we need to define clear semantics for it?  Or maybe
> replace with an XBF_READ_DONE flag for that main read use case and
> then think what do do with the rest?

I thought XBF_DONE meant "contents have been read in from disk and
have passed/will pass verifiers"

> ---
> From 80d148555ca261777ad728455a9a240e0007f54e Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Mon, 13 Nov 2023 12:02:15 -0500
> Subject: xfs: remove the XBF_DONE asserts in xfs_buf_reverify and
>  xfs_trans_read_buf_map
> 
> When xfs_trans_read_buf_map is called from xfs_inode_item_precommit
> through xfs_imap_to_bp, we can find an inode buffer that doesn't have
> XBF_DONE because xfs_ifree_cluster doesn't read the buffer but instead
> just gets it before marking all buffers stale.

Dave and I wondered if xfs_inode_item_precommit should be grabbing the
buffer at all when ISTALE is set, since xfs_ifree_cluster should have
staled (and invalidated) the buffer after setting ISTALE.

I got bogged down in proving that correct because I got distracted by
new shiny^W^W quota tests failing in -rc3 for some weird reason.

> [  206.728129] ------------[ cut here ]------------
> [  206.728573] WARNING: CPU: 0 PID: 6320 at fs/xfs/xfs_trans_buf.c:256 xfs_trans_read_buf_map+0x20
> [  206.728971] Modules linked in:
> [  206.729099] CPU: 0 PID: 6320 Comm: kworker/0:124 Not tainted 6.6.0+ #1598
> [  206.729368] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 044
> [  206.729744] Workqueue: xfs-inodegc/nvme1n1 xfs_inodegc_worker
> [  206.729984] RIP: 0010:xfs_trans_read_buf_map+0x2ab/0x570
> [  206.730295] Code: 0f 84 ae fe ff ff b9 4e 01 00 00 48 c7 c6 c8 62 0d 83 48 c7 c2 a5 15 ff 82 3d
> [  206.731131] RSP: 0018:ffffc9000751fc38 EFLAGS: 00010246
> [  206.731414] RAX: 0000000002110040 RBX: ffff88824c47dbd8 RCX: 0000000000000000
> [  206.731748] RDX: ffff88824c47dc90 RSI: 0000000000000000 RDI: ffff8881705b46c0
> [  206.732139] RBP: ffffc9000751fca8 R08: ffff888106416e00 R09: ffffc9000751fca8
> [  206.732428] R10: ffff8881e09faa00 R11: ffff8881e09fb408 R12: 0000000000000001
> [  206.732786] R13: ffff88810ddb2000 R14: ffffffff82b36660 R15: ffffc9000751fcf0
> [  206.733092] FS:  0000000000000000(0000) GS:ffff888277c00000(0000) knlGS:0000000000000000
> [  206.733460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  206.733797] CR2: 00007f9833f399c0 CR3: 00000002113de000 CR4: 0000000000750ef0
> [  206.734131] PKRU: 55555554
> [  206.734320] Call Trace:
> [  206.734438]  <TASK>
> [  206.734537]  ? xfs_trans_read_buf_map+0x2ab/0x570
> [  206.734824]  ? __warn+0x80/0x170
> [  206.735051]  ? xfs_trans_read_buf_map+0x2ab/0x570
> [  206.735342]  ? report_bug+0x18d/0x1c0
> [  206.735533]  ? handle_bug+0x41/0x70
> [  206.735679]  ? exc_invalid_op+0x17/0x60
> [  206.735852]  ? asm_exc_invalid_op+0x1a/0x20
> [  206.736054]  ? xfs_trans_read_buf_map+0x2ab/0x570
> [  206.736262]  ? xfs_trans_read_buf_map+0x6b/0x570
> [  206.736500]  xfs_imap_to_bp+0x60/0xc0
> [  206.736668]  xfs_inode_item_precommit+0x172/0x2a0
> [  206.736947]  ? xfs_iunlink_item_precommit+0xae/0x220
> [  206.737191]  xfs_trans_run_precommits+0x60/0xc0
> [  206.737460]  __xfs_trans_commit+0x67/0x3a0
> [  206.737668]  xfs_inactive_ifree+0xfb/0x1f0
> [  206.737889]  xfs_inactive+0x22f/0x420
> [  206.738059]  xfs_inodegc_worker+0xa3/0x200
> [  206.738227]  ? process_one_work+0x171/0x4a0
> [  206.738450]  process_one_work+0x1d8/0x4a0
> [  206.738651]  worker_thread+0x1ce/0x3b0
> [  206.738864]  ? wq_sysfs_prep_attrs+0x90/0x90
> [  206.739074]  kthread+0xf2/0x120
> [  206.739216]  ? kthread_complete_and_exit+0x20/0x20
> [  206.739482]  ret_from_fork+0x2c/0x40
> [  206.739675]  ? kthread_complete_and_exit+0x20/0x20
> [  206.740074]  ret_from_fork_asm+0x11/0x20
> [  206.740402]  </TASK>
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_buf.c       | 1 -
>  fs/xfs/xfs_trans_buf.c | 1 -
>  2 files changed, 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 9bf79172efc824..4ff9f1b1abb698 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -863,7 +863,6 @@ xfs_buf_reverify(
>  	struct xfs_buf		*bp,
>  	const struct xfs_buf_ops *ops)
>  {
> -	ASSERT(bp->b_flags & XBF_DONE);
>  	ASSERT(bp->b_error == 0);
>  
>  	if (!ops || bp->b_ops)
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 8e886ecfd69a3b..575922c64d4d3a 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -253,7 +253,6 @@ xfs_trans_read_buf_map(
>  		ASSERT(bp->b_transp == tp);
>  		ASSERT(bp->b_log_item != NULL);
>  		ASSERT(!bp->b_error);
> -		ASSERT(bp->b_flags & XBF_DONE);

I don't think this is the right thing to do here -- if the buffer is
attached to a transaction, it ought to be XBF_DONE.  I think every
transaction that calls _get_buf and rewrites the buffer contents will
set XBF_DONE via xfs_trans_dirty_buf, right?

Hmm.  Maybe I'm wrong -- a transaction could bjoin a buffer and then
call xfs_trans_read_buf_map before dirtying it.  That strikes me as a
suspicious thing to do, though.

--D

>  
>  		/*
>  		 * We never locked this buf ourselves, so we shouldn't
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XBF_DONE semantics
  2023-11-28 16:58 ` Darrick J. Wong
@ 2023-11-28 17:13   ` Christoph Hellwig
  2023-11-28 22:42   ` Dave Chinner
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2023-11-28 17:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, Dave Chinner

On Tue, Nov 28, 2023 at 08:58:31AM -0800, Darrick J. Wong wrote:
> > The way we currently set and check XBF_DONE seems a bit undefined.  The
> > one clear use case is that read uses it to see if a buffer was read in.
> > But places that use buf_get and manually fill in data only use it in a
> > few cases.  Do we need to define clear semantics for it?  Or maybe
> > replace with an XBF_READ_DONE flag for that main read use case and
> > then think what do do with the rest?
> 
> I thought XBF_DONE meant "contents have been read in from disk and
> have passed/will pass verifiers"

That's what I though too.  But there's clearly code that treats it
differently..

> Dave and I wondered if xfs_inode_item_precommit should be grabbing the
> buffer at all when ISTALE is set, since xfs_ifree_cluster should have
> staled (and invalidated) the buffer after setting ISTALE.

That does sound reasonable.

> > +++ b/fs/xfs/xfs_trans_buf.c
> > @@ -253,7 +253,6 @@ xfs_trans_read_buf_map(
> >  		ASSERT(bp->b_transp == tp);
> >  		ASSERT(bp->b_log_item != NULL);
> >  		ASSERT(!bp->b_error);
> > -		ASSERT(bp->b_flags & XBF_DONE);
> 
> I don't think this is the right thing to do here -- if the buffer is
> attached to a transaction, it ought to be XBF_DONE.  I think every
> transaction that calls _get_buf and rewrites the buffer contents will
> set XBF_DONE via xfs_trans_dirty_buf, right?
> 
> Hmm.  Maybe I'm wrong -- a transaction could bjoin a buffer and then
> call xfs_trans_read_buf_map before dirtying it.  That strikes me as a
> suspicious thing to do, though.

I suspect it's happening here somehow.  I can try to find some more
time pinning it down.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XBF_DONE semantics
  2023-11-28 16:58 ` Darrick J. Wong
  2023-11-28 17:13   ` Christoph Hellwig
@ 2023-11-28 22:42   ` Dave Chinner
  1 sibling, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2023-11-28 22:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, Dave Chinner

On Tue, Nov 28, 2023 at 08:58:31AM -0800, Darrick J. Wong wrote:
> On Tue, Nov 28, 2023 at 04:38:08PM +0100, Christoph Hellwig wrote:
> > Hi Darrick,
> > 
> > while reviewing your online repair series I've noticed that the new
> > xfs_buf_delwri_queue_here helper sets XBF_DONE in addition to waiting
> > for the buffer to go off a delwri list, and that reminded me off an
> > assert I saw during my allocator experiments, where
> > xfs_trans_read_buf_map or xfs_buf_reverify trip on a buffer that doesn't
> > have XBF_DONE set because it comes from an ifree transaction (see
> > my current not fully thought out bandaid below).
> 
> LOL, guess what I was looking at yesterday too! :)
> 
> > The way we currently set and check XBF_DONE seems a bit undefined.  The
> > one clear use case is that read uses it to see if a buffer was read in.
> > But places that use buf_get and manually fill in data only use it in a
> > few cases.  Do we need to define clear semantics for it?  Or maybe
> > replace with an XBF_READ_DONE flag for that main read use case and
> > then think what do do with the rest?
> 
> I thought XBF_DONE meant "contents have been read in from disk and
> have passed/will pass verifiers"

Effectively.

> > diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> > index 8e886ecfd69a3b..575922c64d4d3a 100644
> > --- a/fs/xfs/xfs_trans_buf.c
> > +++ b/fs/xfs/xfs_trans_buf.c
> > @@ -253,7 +253,6 @@ xfs_trans_read_buf_map(
> >  		ASSERT(bp->b_transp == tp);
> >  		ASSERT(bp->b_log_item != NULL);
> >  		ASSERT(!bp->b_error);
> > -		ASSERT(bp->b_flags & XBF_DONE);
> 
> I don't think this is the right thing to do here -- if the buffer is
> attached to a transaction, it ought to be XBF_DONE.  I think every
> transaction that calls _get_buf and rewrites the buffer contents will
> set XBF_DONE via xfs_trans_dirty_buf, right?

It should, yes. Otherwise the initialisation data in the buffer has
not been logged and may result in writeback and/or recovery being
incorrectly ordered w.r.t. the transaction that initialised the
buffer contents.

> Hmm.  Maybe I'm wrong -- a transaction could bjoin a buffer and then
> call xfs_trans_read_buf_map before dirtying it.  That strikes me as a
> suspicious thing to do, though.

It also seems to me that it doesn't solve the problem of buffer
invalidation followed by a read operation in commit context.

Having though a bit more on this (admittedly they are feverish,
covid-inspired thoughts), I suspect the real problem here is
requiring xfs_imap_to_bp() to ensure we can pin the cluster buffer
in memory whilst there are dirty inodes over it.

If we go back to the days of Irix and the early days of the linux
port, we always pinned the inode cluster buffer in memory whilst we
had an inode active in cache via the inode cluster hash index. We
could do lookups directly on in-memory inode cluster buffers rather
than inodes. Inodes held references on the cluster buffer, and when
the last reference to a cluster buffer went away, it was removed
from the inode cluster hash.

i.e. we never had the inode cluster buffer RMW problem where cached
inodes got dirtied without a cluster buffer already being in memory.

It was simple, and the only downside to it was that it didn't scale.

Hence we got rid of the inode cluster hash index and the pinning of
inode cluster buffers with the introduction of the RCU protected
inode cache based on radix trees. The radix trees were so much more
efficient than a fixed sized cluster hash that we simply did away
with them, and got a memory footprint improvement for read-mostly
inode traversal workloads for free.

Perhaps it is time to revisit that ~15 year old architectural
choice? We've reinstated the pinning for dirty inodes, perhaps we
should just do it unconditionally for all inodes now and reinstate
the direct inode -> cluster buffer relationship we once had.

This has other benefits as well. We can "densify" the XFS clean inode
cache by making VFS inodes unconditionally I_DONT_CACHE, and
simply rely on the xfs buffer cache LRU to hold recently used inode
buffers. It solves several nasty corner cases around inode cluster
buffer freeing and XFS_ISTALE. It allows us to avoid cluster buffer
lookups in the inode logging path. It opens the gate to shared
access to the buffer for flushing dirty inodes to the cluster buffer
without adding new lifecycle problems. It allows significant
efficiency optimisations in managing inode items in the AIL because
lifecycle discrepancies between cluster buffers and inodes go away.

And so on.

So perhaps the best fix here is reinstate the old behaviour of
inodes always pinning the inode cluster buffer in memory and hence
eliding the need for on-demand, log item/IO completion time item
tracking altogether....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XBF_DONE semantics
  2023-11-28 15:38 XBF_DONE semantics Christoph Hellwig
  2023-11-28 16:58 ` Darrick J. Wong
@ 2023-11-28 21:08 ` Dave Chinner
  2023-11-29  6:18   ` Christoph Hellwig
  1 sibling, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2023-11-28 21:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs, Dave Chinner

On Tue, Nov 28, 2023 at 04:38:08PM +0100, Christoph Hellwig wrote:
> Hi Darrick,
> 
> while reviewing your online repair series I've noticed that the new
> xfs_buf_delwri_queue_here helper sets XBF_DONE in addition to waiting
> for the buffer to go off a delwri list, and that reminded me off an
> assert I saw during my allocator experiments, where
> xfs_trans_read_buf_map or xfs_buf_reverify trip on a buffer that doesn't
> have XBF_DONE set because it comes from an ifree transaction (see
> my current not fully thought out bandaid below).

I'll come back to the bug later, because I know what it is and I
just haven't had time to fix it yet. I'll address XBF_DONE first.

XBF_DONE means the data in the buffer is valid. It's equivalent to
the uptodate bit in a folio. It has no other meaning.

> The way we currently set and check XBF_DONE seems a bit undefined.  The
> one clear use case is that read uses it to see if a buffer was read in.

Yes.

> But places that use buf_get and manually fill in data only use it in a
> few cases. 

Yes. the caller of buf_get always needs to set XBF_DONE if it is
initialising a new buffer ready for it to be written. It should be
done before the caller drops the buffer lock so that no other lookup
can see the buffer in the state of "contains valid data but does not
have XBF_DONE set".

Also, there are cases where we use buf_get but we don't care about
the contents being initialised because we are invalidating
the buffer and need the buffer+buf_log_item to log the invalidation
to the journal.

In these cases we just don't care that the contents
are valid, because xfs_trans_binval() calls xfs_buf_stale() to
invalidate the contents and that removes the XBF_DONE flag. We do
this in places like inode chunk removal to invalidate the cluster
buffers to ensure they are written back after then chunk has been
freed.

.... and this brings us to the bug that you mentioned about an ifree
transaction leaving an inode cluster buffer in cache without
XBF_DONE set....

The issue is xfs_inode_item_precommit() attaching inodes to the
cluster buffer. In the old days before we delayed inode logging to
the end of the ifree transaction, the order was:

xfs_ifree
  xfs_difree(ip)
    xfs_imap_to_bp()
      xfs_trans_buf_read()
    xfs_trans_brelse()
  xfs_trans_log_inode(ip)
  xfs_ifree_cluster(ip)
    for each incore inode {
      set XFS_ISTALE
    }
    for each cluster buffer {
      xfs_trans_buf_get()
      xfs_trans_binval()
    }
  xfs_trans_commit()

IOWs, the attachment of the inode to the cluster buffer in
xfs_trans_log_inode() occurred before both the inode was marked
XFS_ISTALE and the cluster buffer was marked XBF_STALE and XBF_DONE
was removed from it. Hence the lookup in xfs_difree() always found a
valid XBF_DONE buffer.

With the fixes for AGF->AGI->inode cluster buffer locking order done
a while back, we moved the processing that was done in
xfs_trans_log_inode() to xfs_inode_item_precommit(), which is called
from xfs_trans_commit(). This moved the xfs_imap_to_bp() call when
logging th einode from before the cluster invalidation to after it.

The result is that imap_to_bp() now finds the inode cluster buffer
in memory (as it should), but it has been marked stale (correctly!)
and xfs_trans_buf_read_map() freaks out over that (again,
correctly!).

The key to triggering this situation is that the inode cluster
buffer needs to be written back between the unlink() syscall and the
inactivation processing that frees the cluster buffer. Inode cluster
buffer IO completion removes the inodes from the cluster buffer, so
when they are next dirtied they need to be re-added. If this inode
cluster buffer writeback coincides with the transaction removing of
the last inode from an inode chunk and hence freeing the inode
chunk, we end up with this stiuation occurring and assert failures
in xfs_trans_read_buf_map().

So, like I said: I know what the bug is, I worked it out from the
one time one of my test machines tripped over it about 4 weeks ago,
but I just haven't had the time since then to work out a fix.

I suspect that we can check XFS_ISTALE in xfs_inode_item_precommit()
and do something different, but I'd much prefer that the inode still
gets added to the inode cluster buffer and cleaned up with all the
other XFS_ISTALE indoes on the cluster buffer at journal commit
completion time. Maybe we can pass a new flag to xfs_imap_to_bp() to
say "stale buffer ok here" or something similar, because I really
want the general case of xfs_trans_buf_read_map() to fail loudly if
a buffer without XBF_DONE is returned....

> Do we need to define clear semantics for it?  Or maybe
> replace with an XBF_READ_DONE flag for that main read use case and
> then think what do do with the rest?

To me, the semantics of XBF_DONE are pretty clear. Apart from fixing
the bug you are seeing, I'm not sure that anything really needs to
change....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XBF_DONE semantics
  2023-11-28 21:08 ` Dave Chinner
@ 2023-11-29  6:18   ` Christoph Hellwig
  2023-11-29  8:21     ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2023-11-29  6:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, Dave Chinner

On Wed, Nov 29, 2023 at 08:08:37AM +1100, Dave Chinner wrote:
> > But places that use buf_get and manually fill in data only use it in a
> > few cases. 
> 
> Yes. the caller of buf_get always needs to set XBF_DONE if it is
> initialising a new buffer ready for it to be written. It should be
> done before the caller drops the buffer lock so that no other lookup
> can see the buffer in the state of "contains valid data but does not
> have XBF_DONE set".

That makes sense, but we do have a whole bunch of weird things going
on as well:

 - xfs_buf_ioend_handle_error sets XBF_DONE when retrying or failing
 - xfs_buf_ioend sets XBF_DONE on successful write completion as well
 - xfs_buf_ioend_fail drops XBF_DONE for any I/O failure
 - xfs_do_force_shutdown sets XBF_DONE on the super block buffer on
   a foced shutdown
 - xfs_trans_get_buf_map sets XBF_DONE on a forced shutdown

So there's definitively a bunch of weird things not fully in line
with the straight forward answer.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XBF_DONE semantics
  2023-11-29  6:18   ` Christoph Hellwig
@ 2023-11-29  8:21     ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2023-11-29  8:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs, Dave Chinner

On Wed, Nov 29, 2023 at 07:18:05AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 29, 2023 at 08:08:37AM +1100, Dave Chinner wrote:
> > > But places that use buf_get and manually fill in data only use it in a
> > > few cases. 
> > 
> > Yes. the caller of buf_get always needs to set XBF_DONE if it is
> > initialising a new buffer ready for it to be written. It should be
> > done before the caller drops the buffer lock so that no other lookup
> > can see the buffer in the state of "contains valid data but does not
> > have XBF_DONE set".
> 
> That makes sense, but we do have a whole bunch of weird things going
> on as well:
> 
>  - xfs_buf_ioend_handle_error sets XBF_DONE when retrying or failing

Write path. It is expected that XBF_DONE is set prior to submitting
the write, so this should be a no-op. It's probably detritus from
the repeated factoring of the buf_ioend code over the years that
we've never looked deeply into.

>  - xfs_buf_ioend sets XBF_DONE on successful write completion as well

Same. It's really only needed on read IO, but I'm guessing that
somewhere along the line it was done for all IO and we've never
stopped doing that.

>  - xfs_buf_ioend_fail drops XBF_DONE for any I/O failure

That's actually correct.

We're about to toss the buffer because we can't write it back. The
very next function call is xfs_buf_stale(), which invalidates the
buffer and it's contents. It should not have XBF_DONE and XBF_STALE
set at the same time.

It may be worthwhile to move the clearing of XBF_DONE to be inside
xfs_buf_stale(), but I'd have to look at the rest of the code to see
if that's the right thing to do or not.

>  - xfs_do_force_shutdown sets XBF_DONE on the super block buffer on
>    a foced shutdown

No idea why that exists.  I'd have to go code spelunking to find out
what it was needed for.

>  - xfs_trans_get_buf_map sets XBF_DONE on a forced shutdown

That code just looks broken. we're in the middle of a transaction
doing a buffer read, we found a match and raced with a shutdown so
we stale the buffer, mark it donei, bump the recursion count and
return it without an error?

A quick spelunk through history indicates stale vs undelay-write vs
XFS_BUF_DONE was an inconsistent mess. We fixed all the stale vs
done nastiness in the xfs_trans_buf_read() path, but missed this
case in the get path.

As it is, I'd say the way shutdown racing is handled here is broken
and could be fixed by returning -EIO instead of the staled buffer.
The buffer is already linked into the transaction, so the
transaction cancel in response to the IO error will handle the
buffer appropriately....

> So there's definitively a bunch of weird things not fully in line
> with the straight forward answer.

No surprises there - this is a 30 year old code base. Nothing looks
particularly problematic, though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-11-29  8:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-28 15:38 XBF_DONE semantics Christoph Hellwig
2023-11-28 16:58 ` Darrick J. Wong
2023-11-28 17:13   ` Christoph Hellwig
2023-11-28 22:42   ` Dave Chinner
2023-11-28 21:08 ` Dave Chinner
2023-11-29  6:18   ` Christoph Hellwig
2023-11-29  8:21     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox