[PATCH v2 0/2] btrfs: error handling fixes for extent

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] btrfs: error handling fixes for extent_writepage()
@ 2024-11-27  8:15 Qu Wenruo
  2024-11-27  8:15 ` [PATCH v2 1/2] btrfs: fix double accounting race in extent_writepage() Qu Wenruo
  2024-11-27  8:15 ` [PATCH v2 2/2] btrfs: handle submit_one_sector() error inside extent_writepage_io() Qu Wenruo
  0 siblings, 2 replies; 3+ messages in thread
From: Qu Wenruo @ 2024-11-27  8:15 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v2:
- Update the commit message for the first patch
  It turns out the root cause is the race between
  btrfs_finish_one_ordered() and btrfs_mark_ordered_io_finished()

  And that race is possible no matter if the sector size is smaller than
  page size.

  So update the commit message to reflect that.

- Remove comments update in the patch
  To make backport easier.

- Update the commit message for the second patch
  Mostly to down play the possible problem, as the extent map is already
  pinned, thus no way to failed to grab the extent map.

It's more and more common to hit crash in my aarch64 testing VM.

The main symptom is ordered extent double accounting, causing various
problems mostly kernel warning and crashes (for debug builds).

The direct cause the failure from writepage_delalloc() with -ENOSPC,
which is another rabbit hole, but here we need to focus on the error
handling.

All the call traces points to the btrfs_mark_ordered_io_finished()
inside extent_writepage() for error handling.

It turns out that btrfs_mark_ordered_io_finished() inside
extent_writepage() is racing with the same cleanup inside
btrfs_run_delalloc_range().

And if the one inside extent_writepage() is called before the ordered
extent removed from the ordered tree (the removal is queued in a
workqueue), then we hit the double accounting.

There is also a theoretical failure path from submit_one_sector(),  but
I have never hit a case caused by that failure, the fix is only for the
sake of consistency.

Both fixes are similar, by moving the btrfs_mark_ordered_io_finished()
calls for error handling into each function, so that we can avoid
touching ranges that is already covered.

Although these fixes are mostly for backports, the proper fix in the end
would be reworking how we handle dirty folio writeback.

The current way is map-map-map, then submit-submit-submit (run delalloc
for every dirty sector of the folio, then submit all dirty sectors).

The planned new fix would be more like iomap to go
map-submit-map-submit-map-submit (run one delalloc, then immeidately submit
it).

Qu Wenruo (2):
  btrfs: fix double accounting race in extent_writepage()
  btrfs: handle submit_one_sector() error inside extent_writepage_io()

 fs/btrfs/extent_io.c | 63 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 48 insertions(+), 15 deletions(-)

-- 
2.47.0

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v2 1/2] btrfs: fix double accounting race in extent_writepage()
  2024-11-27  8:15 [PATCH v2 0/2] btrfs: error handling fixes for extent_writepage() Qu Wenruo
@ 2024-11-27  8:15 ` Qu Wenruo
  2024-11-27  8:15 ` [PATCH v2 2/2] btrfs: handle submit_one_sector() error inside extent_writepage_io() Qu Wenruo
  1 sibling, 0 replies; 3+ messages in thread
From: Qu Wenruo @ 2024-11-27  8:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
There are several double accounting case, where the WARN_ON_ONCE() is
triggered inside can_finish_ordered_extent().

And all such cases points back to the btrfs_mark_ordered_io_finished()
call inside extent_writepage() when it hits some error.

[CAUSE]
With extra debug patches to show where the error is from, it turns out
to be btrfs_run_delalloc_range() can fail with -ENOSPC.

Such failure itself is already a symptom of some bad data/metadata space
reservation, but here we need to focus on the error handling part.

For example, we have the following dirty page layout (4K sector size and
4K page size):

    0                       16K                     32K
    |/////|/////|/////|/////|/////|/////|/////|/////|

Where the range [0, 32K) is dirty and we need to write all the 8 pages
back.

When handling the first page 0, we go the following sequence:

- btrfs_run_delalloc_range() for range [0, 32k)
  We enter cow_file_range() for [0, 32K)

- btrfs_reserve_extent() only returned a 16K data extent.
  This can be caused by fragmentation, and it's already an indication
  we're almost running of space.

  Now we have the following layout:

    0                       16K                     32K
    |<----- Reserved ------>|/////|/////|/////|/////|

  The range [0, 16K) has ordered extent allocated.

- btrfs_reserve_extent() returned -ENOSPC
  We really run out of space. But since we have reserved space
  for range [0, 16K) we need to clean them up.

  But that cleanup for ordered extent only happens inside
  btrfs_run_delalloc_range().

- btrfs_run_delalloc_range() cleanup the reserved ordered extent
  By calling btrfs_mark_ordered_io_finished() for range [0, 32K).

  It will locate the ordered extent [0, 16K) and mark it as IOERR.
  Also since the ordered extent is only 16K, we're finishing the whole
  ordered extent.

  Thus we call btrfs_queue_ordered_fn() to queue to finish the ordered
  extent.
  But still, the ordered extent [0, 16K) is still in the
  btrfs_inode::ordered_tree.

- extent_writepage() cleanup the ordered extent inside the folio
  We call btrfs_mark_ordered_io_finished() for range [0, 4K).

  Since the finished ordered extent [0, 16K) is not yet removed (racy,
  depends on when btrfs_finish_one_ordered() is called), if
  btrfs_mark_ordered_io_finished() is called before
  btrfs_finish_one_ordered(), we will double account and trigger the
  warning inside can_finish_ordered_extent().

So the root cause is, we're relying on btrfs_mark_ordered_io_finished()
to handle ranges which is already cleaned up.

Unfortunately the bug dates back to the early days when
btrfs_mark_ordered_io_finished() is introduced as a no-brain choice for
error paths, but such no-brain solution just hides all the race and make
us less cautious when handling errors.

[FIX]
Instead of relying on the btrfs_mark_ordered_io_finished() call to
cleanup the whole folio range, record the last successfully ran delalloc
range.

And combined with bio_ctrl->submit_bitmap to properly clean up any newly
created ordered extents.

Since we have cleaned up the ordered extents in range, we should not
rely on the btrfs_mark_ordered_io_finished() inside extent_writepage()
anymore.

By this, we ensure btrfs_mark_ordered_io_finished() is only called once
when writepage_delalloc() failed.

Cc: stable@vger.kernel.org # 5.15+
Fixes: e65f152e4348 ("btrfs: refactor how we finish ordered extent io for endio functions")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 438974d4def4..d619c4e148be 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1167,6 +1167,12 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 	 * last delalloc end.
 	 */
 	u64 last_delalloc_end = 0;
+	/*
+	 * Save the last successfully ran delalloc range end (exclusive).
+	 * This is for error handling to avoid ranges with ordered extent created
+	 * but no IO will be submitted due to error.
+	 */
+	u64 last_finished = page_start;
 	u64 delalloc_start = page_start;
 	u64 delalloc_end = page_end;
 	u64 delalloc_to_write = 0;
@@ -1235,11 +1241,19 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 			found_len = last_delalloc_end + 1 - found_start;

 		if (ret >= 0) {
+			/*
+			 * Some delalloc range may be created by previous folios.
+			 * Thus we still need to clean those range up during error
+			 * handling.
+			 */
+			last_finished = found_start;
 			/* No errors hit so far, run the current delalloc range. */
 			ret = btrfs_run_delalloc_range(inode, folio,
 						       found_start,
 						       found_start + found_len - 1,
 						       wbc);
+			if (ret >= 0)
+				last_finished = found_start + found_len;
 		} else {
 			/*
 			 * We've hit an error during previous delalloc range,
@@ -1274,8 +1288,21 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,

 		delalloc_start = found_start + found_len;
 	}
-	if (ret < 0)
+	/*
+	 * It's possible we have some ordered extents created before we hit
+	 * an error, cleanup non-async successfully created delalloc ranges.
+	 */
+	if (unlikely(ret < 0)) {
+		unsigned int bitmap_size = min(
+			(last_finished - page_start) >> fs_info->sectorsize_bits,
+			fs_info->sectors_per_page);
+
+		for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size)
+			btrfs_mark_ordered_io_finished(inode, folio,
+				page_start + (bit << fs_info->sectorsize_bits),
+				fs_info->sectorsize, false);
 		return ret;
+	}
 out:
 	if (last_delalloc_end)
 		delalloc_end = last_delalloc_end;
@@ -1509,13 +1536,13 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl

 	bio_ctrl->wbc->nr_to_write--;

-done:
-	if (ret) {
+	if (ret)
 		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
 					       page_start, PAGE_SIZE, !ret);
-		mapping_set_error(folio->mapping, ret);
-	}

+done:
+	if (ret < 0)
+		mapping_set_error(folio->mapping, ret);
 	/*
 	 * Only unlock ranges that are submitted. As there can be some async
 	 * submitted ranges inside the folio.
-- 
2.47.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH v2 2/2] btrfs: handle submit_one_sector() error inside extent_writepage_io()
  2024-11-27  8:15 [PATCH v2 0/2] btrfs: error handling fixes for extent_writepage() Qu Wenruo
  2024-11-27  8:15 ` [PATCH v2 1/2] btrfs: fix double accounting race in extent_writepage() Qu Wenruo
@ 2024-11-27  8:15 ` Qu Wenruo
  1 sibling, 0 replies; 3+ messages in thread
From: Qu Wenruo @ 2024-11-27  8:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
If submit_one_sector() failed inside extent_writepage_io() for sector
size < page size cases (e.g. 4K sector size and 64K page size), then
we can hit double ordered extent accounting error.

This should be very rare, as submit_one_sector() only fails when we
failed to grab the extent map, and such extent map should exist inside
the memory and have been pinned.

[CAUSE]
For example we have the following folio layout:

    0  4K          32K    48K   60K 64K
    |//|           |//////|     |///|

Where |///| is the dirty range we need to writeback. The 3 different
dirty ranges are submitted for regular COW.

Now we hit the following sequence:

- submit_one_sector() returned 0 for [0, 4K)

- submit_one_sector() returned 0 for [32K, 48K)

- submit_one_sector() returned error for [60K, 64K)

- btrfs_mark_ordered_io_finished() called for the whole folio
  This will mark the following ranges as finished:
  * [0, 4K)
  * [32K, 48K)
    Both ranges have their IO already submitted, this cleanup will
    lead to double accounting.

  * [60K, 64K)
    That's the correct cleanup.

The only good news is, this error is only theoretical, as the target
extent map is always pinned, thus we should directly grab it from
memory, other than reading it from the disk.

[FIX]
Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
range, which can touch ranges we should not touch, instead
move the error handling inside extent_writepage_io().

So that we can cleanup exact sectors that are ought to be submitted but
failed.

This provide much more accurate cleanup, avoiding the double accounting.

Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d619c4e148be..b74298c2c24f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	unsigned long range_bitmap = 0;
 	bool submitted_io = false;
+	bool error = false;
 	const u64 folio_start = folio_pos(folio);
 	u64 cur;
 	int bit;
@@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			break;
 		}
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
-		if (ret < 0)
-			goto out;
+		if (unlikely(ret < 0)) {
+			submit_one_bio(bio_ctrl);
+			/*
+			 * Failed to grab the extent map which should be very rare.
+			 * Since there is no bio submitted to finish the ordered
+			 * extent, we have to manually finish this sector.
+			 */
+			btrfs_mark_ordered_io_finished(inode, folio, cur,
+					fs_info->sectorsize, false);
+			error = true;
+			continue;
+		}
 		submitted_io = true;
 	}
-out:
+
 	/*
 	 * If we didn't submitted any sector (>= i_size), folio dirty get
 	 * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
@@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	 *
 	 * Here we set writeback and clear for the range. If the full folio
 	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
+	 *
+	 * If we hit any error, the corresponding sector will still be dirty
+	 * thus no need to clear PAGECACHE_TAG_DIRTY.
 	 */
-	if (!submitted_io) {
+	if (!submitted_io && !error) {
 		btrfs_folio_set_writeback(fs_info, folio, start, len);
 		btrfs_folio_clear_writeback(fs_info, folio, start, len);
 	}
@@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 {
 	struct inode *inode = folio->mapping->host;
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
-	const u64 page_start = folio_pos(folio);
 	int ret;
 	size_t pg_offset;
 	loff_t i_size = i_size_read(inode);
@@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 
 	bio_ctrl->wbc->nr_to_write--;
 
-	if (ret)
-		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
-					       page_start, PAGE_SIZE, !ret);
-
 done:
 	if (ret < 0)
 		mapping_set_error(folio->mapping, ret);
@@ -2320,11 +2329,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
 		if (ret == 1)
 			goto next_page;
 
-		if (ret) {
-			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
-						       cur, cur_len, !ret);
+		if (ret)
 			mapping_set_error(mapping, ret);
-		}
 		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
 		if (ret < 0)
 			found_error = true;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-11-27  8:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-27  8:15 [PATCH v2 0/2] btrfs: error handling fixes for extent_writepage() Qu Wenruo
2024-11-27  8:15 ` [PATCH v2 1/2] btrfs: fix double accounting race in extent_writepage() Qu Wenruo
2024-11-27  8:15 ` [PATCH v2 2/2] btrfs: handle submit_one_sector() error inside extent_writepage_io() Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox