From: cem@kernel.org
To: linux-xfs@vger.kernel.org
Cc: djwong@kernel.org, hch@lst.de
Subject: [PATCH 29/67] xfs: force all buffers to be written during btree bulk load
Date: Mon, 22 Apr 2024 18:25:51 +0200 [thread overview]
Message-ID: <20240422163832.858420-31-cem@kernel.org> (raw)
In-Reply-To: <20240422163832.858420-2-cem@kernel.org>
From: "Darrick J. Wong" <djwong@kernel.org>
Source kernel commit: 13ae04d8d45227c2ba51e188daf9fc13d08a1b12
While stress-testing online repair of btrees, I noticed periodic
assertion failures from the buffer cache about buffers with incorrect
DELWRI_Q state. Looking further, I observed this race between the AIL
trying to write out a btree block and repair zapping a btree block after
the fact:
AIL: Repair0:
pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list
stale buf X:
clear DELWRI_Q
does not clear b_list
free space X
delwri_submit # oops
Worse yet, I discovered that running the same repair over and over in a
tight loop can result in a second race that cause data integrity
problems with the repair:
AIL: Repair0: Repair1:
pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list
stale buf X:
clear DELWRI_Q
does not clear b_list
free space X
find free space X
get buffer
rewrite buffer
delwri_queue:
set DELWRI_Q
already on a list, do not add
BAD: committed tree root before all blocks written
delwri_submit # too late now
I traced this to my own misunderstanding of how the delwri lists work,
particularly with regards to the AIL's buffer list. If a buffer is
logged and committed, the buffer can end up on that AIL buffer list. If
btree repairs are run twice in rapid succession, it's possible that the
first repair will invalidate the buffer and free it before the next time
the AIL wakes up. Marking the buffer stale clears DELWRI_Q from the
buffer state without removing the buffer from its delwri list. The
buffer doesn't know which list it's on, so it cannot know which lock to
take to protect the list for a removal.
If the second repair allocates the same block, it will then recycle the
buffer to start writing the new btree block. Meanwhile, if the AIL
wakes up and walks the buffer list, it will ignore the buffer because it
can't lock it, and go back to sleep.
When the second repair calls delwri_queue to put the buffer on the
list of buffers to write before committing the new btree, it will set
DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
buffer list, it won't add it to the bulkload buffer's list.
This is incorrect, because the bulkload caller relies on delwri_submit
to ensure that all the buffers have been sent to disk /before/
required for data consistency.
Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
drop it, so the next thread to walk through the btree will trip over a
debug assertion on that flag.
To fix this, create a new function that waits for the buffer to be
removed from any other delwri lists before adding the buffer to the
caller's delwri list. By waiting for the buffer to clear both the
delwri list and any potential delwri wait list, we can be sure that
repair will initiate writes of all buffers and report all write errors
back to userspace instead of committing the new structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
---
libxfs/libxfs_io.h | 11 +++++++++++
libxfs/xfs_btree_staging.c | 4 +---
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index 267ea9796..259c6a7cf 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -244,6 +244,17 @@ xfs_buf_delwri_queue(struct xfs_buf *bp, struct list_head *buffer_list)
return true;
}
+static inline void
+xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *buffer_list)
+{
+ ASSERT(list_empty(&bp->b_list));
+
+ /* This buffer is uptodate; don't let it get reread. */
+ libxfs_buf_mark_dirty(bp);
+
+ xfs_buf_delwri_queue(bp, buffer_list);
+}
+
int xfs_buf_delwri_submit(struct list_head *buffer_list);
void xfs_buf_delwri_cancel(struct list_head *list);
diff --git a/libxfs/xfs_btree_staging.c b/libxfs/xfs_btree_staging.c
index a6a907916..baf7f4226 100644
--- a/libxfs/xfs_btree_staging.c
+++ b/libxfs/xfs_btree_staging.c
@@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf(
if (*bpp == NULL)
return;
- if (!xfs_buf_delwri_queue(*bpp, buffers_list))
- ASSERT(0);
-
+ xfs_buf_delwri_queue_here(*bpp, buffers_list);
xfs_buf_relse(*bpp);
*bpp = NULL;
}
--
2.44.0
next prev parent reply other threads:[~2024-04-22 16:39 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-22 16:25 [PATCH 00/67] libxfs: Sync to Linux 6.8 cem
2024-04-22 16:25 ` [PATCH 01/67] xfs: use xfs_defer_pending objects to recover intent items cem
2024-04-22 16:25 ` [PATCH 02/67] xfs: recreate work items when recovering " cem
2024-04-22 16:25 ` [PATCH 03/67] xfs: use xfs_defer_finish_one to finish recovered work items cem
2024-04-22 16:25 ` [PATCH 04/67] xfs: move ->iop_recover to xfs_defer_op_type cem
2024-04-22 16:25 ` [PATCH 05/67] xfs: hoist intent done flag setting to ->finish_item callsite cem
2024-04-22 16:25 ` [PATCH 06/67] xfs: hoist ->create_intent boilerplate to its callsite cem
2024-04-22 16:25 ` [PATCH 07/67] xfs: use xfs_defer_create_done for the relogging operation cem
2024-04-22 16:25 ` [PATCH 08/67] xfs: clean out XFS_LI_DIRTY setting boilerplate from ->iop_relog cem
2024-04-22 16:25 ` [PATCH 09/67] xfs: hoist xfs_trans_add_item calls to defer ops functions cem
2024-04-22 16:25 ` [PATCH 10/67] xfs: move ->iop_relog to struct xfs_defer_op_type cem
2024-04-22 16:25 ` [PATCH 11/67] xfs: make rextslog computation consistent with mkfs cem
2024-04-22 16:25 ` [PATCH 12/67] xfs: fix 32-bit truncation in xfs_compute_rextslog cem
2024-04-22 16:25 ` [PATCH 13/67] xfs: don't allow overly small or large realtime volumes cem
2024-04-22 16:25 ` [PATCH 14/67] xfs: elide ->create_done calls for unlogged deferred work cem
2024-04-22 16:25 ` [PATCH 15/67] xfs: don't append work items to logged xfs_defer_pending objects cem
2024-04-22 16:25 ` [PATCH 16/67] xfs: allow pausing of pending deferred work items cem
2024-04-22 16:25 ` [PATCH 17/67] xfs: remove __xfs_free_extent_later cem
2024-04-22 16:25 ` [PATCH 18/67] xfs: automatic freeing of freshly allocated unwritten space cem
2024-04-22 16:25 ` [PATCH 19/67] xfs: remove unused fields from struct xbtree_ifakeroot cem
2024-04-22 16:25 ` [PATCH 20/67] xfs: force small EFIs for reaping btree extents cem
2024-04-22 16:25 ` [PATCH 21/67] xfs: ensure logflagsp is initialized in xfs_bmap_del_extent_real cem
2024-04-22 16:25 ` [PATCH 22/67] xfs: update dir3 leaf block metadata after swap cem
2024-04-22 16:25 ` [PATCH 23/67] xfs: extract xfs_da_buf_copy() helper function cem
2024-04-22 16:25 ` [PATCH 24/67] xfs: move xfs_ondisk.h to libxfs/ cem
2024-04-22 16:25 ` [PATCH 25/67] xfs: consolidate the xfs_attr_defer_* helpers cem
2024-04-22 16:25 ` [PATCH 26/67] xfs: store an ops pointer in struct xfs_defer_pending cem
2024-04-22 16:25 ` [PATCH 27/67] xfs: pass the defer ops instead of type to xfs_defer_start_recovery cem
2024-04-22 16:25 ` [PATCH 28/67] xfs: pass the defer ops directly to xfs_defer_add cem
2024-04-22 16:25 ` cem [this message]
2024-04-22 16:25 ` [PATCH 30/67] xfs: set XBF_DONE on newly formatted btree block that are ready for writing cem
2024-04-22 16:25 ` [PATCH 31/67] xfs: read leaf blocks when computing keys for bulkloading into node blocks cem
2024-04-22 16:25 ` [PATCH 32/67] xfs: move btree bulkload record initialization to ->get_record implementations cem
2024-04-22 16:25 ` [PATCH 33/67] xfs: constrain dirty buffers while formatting a staged btree cem
2024-04-22 16:25 ` [PATCH 34/67] xfs: repair free space btrees cem
2024-04-22 16:25 ` [PATCH 35/67] xfs: repair inode btrees cem
2024-04-22 16:25 ` [PATCH 36/67] xfs: repair refcount btrees cem
2024-04-22 16:25 ` [PATCH 37/67] xfs: dont cast to char * for XFS_DFORK_*PTR macros cem
2024-04-22 16:26 ` [PATCH 38/67] xfs: set inode sick state flags when we zap either ondisk fork cem
2024-04-22 16:26 ` [PATCH 39/67] xfs: zap broken inode forks cem
2024-04-22 16:26 ` [PATCH 40/67] xfs: repair inode fork block mapping data structures cem
2024-04-22 16:26 ` [PATCH 41/67] xfs: create a ranged query function for refcount btrees cem
2024-04-22 16:26 ` [PATCH 42/67] xfs: create a new inode fork block unmap helper cem
2024-04-22 16:26 ` [PATCH 43/67] xfs: improve dquot iteration for scrub cem
2024-04-22 16:26 ` [PATCH 44/67] xfs: add lock protection when remove perag from radix tree cem
2024-04-22 16:26 ` [PATCH 45/67] xfs: fix perag leak when growfs fails cem
2024-04-22 16:26 ` [PATCH 46/67] xfs: remove the xfs_alloc_arg argument to xfs_bmap_btalloc_accounting cem
2024-04-22 16:26 ` [PATCH 47/67] xfs: also use xfs_bmap_btalloc_accounting for RT allocations cem
2024-04-22 16:26 ` [PATCH 48/67] xfs: return -ENOSPC from xfs_rtallocate_* cem
2024-04-22 16:26 ` [PATCH 49/67] xfs: indicate if xfs_bmap_adjacent changed ap->blkno cem
2024-04-22 16:26 ` [PATCH 50/67] xfs: move xfs_rtget_summary to xfs_rtbitmap.c cem
2024-04-22 16:26 ` [PATCH 51/67] xfs: split xfs_rtmodify_summary_int cem
2024-04-22 16:26 ` [PATCH 52/67] xfs: remove rt-wrappers from xfs_format.h cem
2024-04-22 16:26 ` [PATCH 53/67] xfs: remove XFS_RTMIN/XFS_RTMAX cem
2024-04-22 16:26 ` [PATCH 54/67] xfs: make if_data a void pointer cem
2024-04-22 16:26 ` [PATCH 55/67] xfs: return if_data from xfs_idata_realloc cem
2024-04-22 16:26 ` [PATCH 56/67] xfs: move the xfs_attr_sf_lookup tracepoint cem
2024-04-22 16:26 ` [PATCH 57/67] xfs: simplify xfs_attr_sf_findname cem
2024-04-22 16:26 ` [PATCH 58/67] xfs: remove xfs_attr_shortform_lookup cem
2024-04-22 16:26 ` [PATCH 59/67] xfs: use xfs_attr_sf_findname in xfs_attr_shortform_getvalue cem
2024-04-22 16:26 ` [PATCH 60/67] xfs: remove struct xfs_attr_shortform cem
2024-04-22 16:26 ` [PATCH 61/67] xfs: remove xfs_attr_sf_hdr_t cem
2024-04-22 16:26 ` [PATCH 62/67] xfs: turn the XFS_DA_OP_REPLACE checks in xfs_attr_shortform_addname into asserts cem
2024-04-22 16:26 ` [PATCH 63/67] xfs: fix a use after free in xfs_defer_finish_recovery cem
2024-04-22 16:26 ` [PATCH 64/67] xfs: use the op name in trace_xlog_intent_recovery_failed cem
2024-04-22 16:26 ` [PATCH 65/67] xfs: fix backwards logic in xfs_bmap_alloc_account cem
2024-04-22 16:26 ` [PATCH 66/67] xfs: reset XFS_ATTR_INCOMPLETE filter on node removal cem
2024-04-22 16:26 ` [PATCH 67/67] xfs: remove conditional building of rt geometry validator functions cem
-- strict thread matches above, loose matches on Subject: below --
2024-04-17 21:16 [PATCHSET 04/11] libxfs: sync with 6.8 Darrick J. Wong
2024-04-17 21:29 ` [PATCH 29/67] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
2024-03-26 2:55 [PATCHSET 02/18] libxfs: sync with 6.8 Darrick J. Wong
2024-03-26 3:10 ` [PATCH 29/67] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
2024-03-13 1:47 [PATCHSET 02/10] libxfs: sync with 6.8 Darrick J. Wong
2024-03-13 2:00 ` [PATCH 29/67] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240422163832.858420-31-cem@kernel.org \
--to=cem@kernel.org \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox