[PATCH v2 0/9] btrfs: error handling fixes

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/9] btrfs: error handling fixes
@ 2024-12-12  6:13 Qu Wenruo
  2024-12-12  6:13 ` [PATCH v2 1/9] btrfs: fix double accounting race when btrfs_run_delalloc_range() failed Qu Wenruo
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Qu Wenruo @ 2024-12-12  6:13 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v2:
- Fix the btrfs_cleanup_ordered_extents() call inside
  btrfs_run_delalloc_range()

  Since we no longer call btrfs_mark_ordered_io_finished() if
  btrfs_run_delalloc_range() failed, the existing
  btrfs_cleanup_ordered_extents() call with @locked_folio will cause the
  subpage range not to be properly cleaned up.

  This can lead to hanging ordered extents for subpage cases.

- Update the commit message of the first patch
  With more detailed analyse on how the double accounting happens.
  It's pretty complex and very lengthy, but is easier to understand (as
  least I hope so).

  The root cause is the btrfs_cleanup_ordered_extents()'s range split
  behavior, which is not subpage compatible and is cursed in the first
  place.

  So the fix is still the same, by removing the split OE handling
  completely.

- A new patch to cleanup the @locked_folio parameter of
  btrfs_cleanup_ordered_extents()

I believe there is a regression in the last 2 or 3 releases where
metadata/data space reservation code is no longer working properly,
result us to hit -ENOSPC during btrfs_run_delalloc_range().

One of the most common situation to hit such problem is during
generic/750, along with other long running generic tests.

Although I should start bisecting the space reservation bug, but I can
not help but fixing the exposed bugs first.

This exposed quite some long existing bugs, all in the error handling
paths, that can lead to the following crashes

- Double ordered extent accounting
  Triggers WARN_ON_OCE() inside can_finish_ordered_extent() then crash.

  This bug is fixed by the first 3 patches.
  The first patch is the most important one, since it's pretty easy to
  trigger in the real world, and very long existing.

  The second patch is just a precautious fix, not easy to happen in the
  real world.

  The third one is also possible in the real world, but only possible
  with the recently enabled subpage compression write support.

- Subpage ASSERT() triggered, where subpage folio bitmap differs from
  folio status
  This happens most likey in submit_uncompressed_range(), where it
  unlock the folio without updating the subpage bitmaps.

  This bug is fixed by the 3rd patch.

- WARN_ON() if out-of-tree patch "btrfs: reject out-of-band dirty folios
  during writeback" applied
  This is a more complex case, where error handling leaves some folios
  dirty, but with EXTENT_DELALLOC flag cleared from extent io tree.

  Such dirty folios are still possible to be written back later, but
  since there is no EXTENT_DELALLOC flag, it will be treat as
  out-of-band dirty flags and trigger COW fixup.

  This bug is fixed by the 4th and 5th patch

With so many existing bugs exposed, there is more than enough motivation
to make btrfs_run_delalloc_range() (and its delalloc range functions)
output extra error messages so that at least we know something is wrong.

And those error messages have already helped a lot during my
development.

Patches 6~8 are here to enhance the error messages.

And the final one is to cleanup the unnecessary @locked_folio parameter
of btrfs_cleanup_ordered_extents().

With all these patches applied, at least fstests can finish reliably,
otherwise it frequently crashes in generic tests that I was unable to
finish even one full run since the space reservation regression.

Qu Wenruo (9):
  btrfs: fix double accounting race when btrfs_run_delalloc_range()
    failed
  btrfs: fix double accounting race when extent_writepage_io() failed
  btrfs: fix the error handling of submit_uncompressed_range()
  btrfs: do proper folio cleanup when cow_file_range() failed
  btrfs: do proper folio cleanup when run_delalloc_nocow() failed
  btrfs: subpage: fix the bitmap dump for the locked flags
  btrfs: subpage: dump the involved bitmap when ASSERT() failed
  btrfs: add extra error messages for delalloc range related errors
  btrfs: remove the unused @locked_folio parameter from
    btrfs_cleanup_ordered_extents()

 fs/btrfs/extent_io.c |  79 ++++++++++++---
 fs/btrfs/inode.c     | 230 +++++++++++++++++++++++++------------------
 fs/btrfs/subpage.c   |  48 ++++++---
 3 files changed, 235 insertions(+), 122 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/9] btrfs: fix double accounting race when btrfs_run_delalloc_range() failed
  2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
@ 2024-12-12  6:13 ` Qu Wenruo
  2025-01-08 21:52   ` Boris Burkov
  2024-12-12  6:13 ` [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed Qu Wenruo
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-12  6:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
When running btrfs with block size (4K) smaller than page size (64K,
aarch64), there is a very high chance to crash the kernel at
generic/750, with the following messages:
(before the call traces, there are 3 extra debug messages added)

 BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
 BTRFS info (device dm-3): checking UUID tree
 hrtimer: interrupt took 5451385 ns
 BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
 BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
 BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
 CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G           OE      6.13.0-rc1-custom+ #89
 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
 pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
 lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
 Call trace:
  can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
  can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
  btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
  extent_writepage+0x10c/0x3b8 [btrfs]
  extent_write_cache_pages+0x21c/0x4e8 [btrfs]
  btrfs_writepages+0x94/0x160 [btrfs]
  do_writepages+0x74/0x190
  filemap_fdatawrite_wbc+0x74/0xa0
  start_delalloc_inodes+0x17c/0x3b0 [btrfs]
  btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
  shrink_delalloc+0x11c/0x280 [btrfs]
  flush_space+0x288/0x328 [btrfs]
  btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
  process_one_work+0x228/0x680
  worker_thread+0x1bc/0x360
  kthread+0x100/0x118
  ret_from_fork+0x10/0x20
 ---[ end trace 0000000000000000 ]---
 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
 CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G        W  OE      6.13.0-rc1-custom+ #89
 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
 Workqueue:  btrfs_work_helper [btrfs] (btrfs-endio-write)
 pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : process_one_work+0x110/0x680
 lr : worker_thread+0x1bc/0x360
 Call trace:
  process_one_work+0x110/0x680 (P)
  worker_thread+0x1bc/0x360 (L)
  worker_thread+0x1bc/0x360
  kthread+0x100/0x118
  ret_from_fork+0x10/0x20
 Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
 ---[ end trace 0000000000000000 ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 SMP: failed to stop secondary CPUs 2-3
 Dumping ftrace buffer:
    (ftrace buffer empty)
 Kernel Offset: 0x275bb9540000 from 0xffff800080000000
 PHYS_OFFSET: 0xffff8fbba0000000
 CPU features: 0x100,00000070,00801250,8201720b

[CAUSE]
The above warning is triggered immediately after the delalloc range
failure, this happens in the following sequence:

- Range [1568K, 1636K) is dirty

   1536K  1568K     1600K    1636K  1664K
   |      |/////////|////////|      |

  Where 1536K, 1600K and 1664K are page boundaries (64K page size)

- Enter extent_writepage() for page 1536K

- Enter run_delalloc_nocow() with locked page 1536K and range
  [1568K, 1636K)
  This is due to the inode has preallocated extents.

- Enter cow_file_range() with locked page 1536K and range
  [1568K, 1636K)

- btrfs_reserve_extent() only reserved two extents
  The main loop of cow_file_range() only reserved two data extents,

  Now we have:

   1536K  1568K        1600K    1636K  1664K
   |      |<-->|<--->|/|///////|      |
               1584K  1596K
  Range [1568K, 1596K) has ordered extent reserved.

- btrfs_reserve_extent() failed inside cow_file_range() for file offset
  1596K
  This is already a bug in our space reservation code, but for now let's
  focus on the error handling path.

  Now cow_file_range() returned -ENOSPC.

- btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
  Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
  [1568K, 1636K)

  Function btrfs_cleanup_ordered_extents() normally needs to skip the
  ranges inside the folio, as it will normally be cleaned up by
  extent_writepage().

  Such split error handling is already problematic in the first place.

  What's worse is the folio range skipping itself, which is not taking
  subpage cases into consideration at all, it will only skip the range
  if the page start >= the range start.
  In our case, the page start < the range start, since for subpage cases
  we can have delalloc ranges inside the folio but not covering the
  folio.

  So it doesn't skip the page range at all.
  This means all the ordered extents, both [1568K, 1584K) and
  [1584K, 1596K) will be marked as IOERR.

  And those two ordered extents have no more pending ios, it is marked
  finished, and *QUEUED* to be deleted from the io tree.

- extent_writepage() do error cleanup
  Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).

  Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
  deletion from io tree is async, it may or may not happen at this
  timing.

  If the ranges are not yet removed, we will do double cleaning on those
  ranges, triggers the above ordered extent warnings.

In theory there are other bugs, like the cleanup in extent_writepage()
can cause double accounting on ranges that are submitted async
(compression for example).

But that's much harder to trigger because normally we do not mix regular
and compression delalloc ranges.

[FIX]
The folio range split is already buggy and not subpage compatible, it's
introduced a long time ago where subpage support is not even considered.

So instead of splitting the ordered extents cleanup into the folio range
and out of folio range, do all the cleanup inside writepage_delalloc().

- Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
  btrfs_run_delalloc_range()

- Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
  failed

  So all ordered extents are only cleaned up by
  btrfs_run_delalloc_range().

- Handle the ranges that already have ordered extents allocated
  If part of the folio already has ordered extent allocated, and
  btrfs_run_delalloc_range() failed, we also need to cleanup that range.

Now we have a concentrated error handling for ordered extents during
btrfs_run_delalloc_range().

Cc: stable@vger.kernel.org # 5.15+
Fixes: d1051d6ebf8e ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 37 ++++++++++++++++++++++++++++++++-----
 fs/btrfs/inode.c     |  2 +-
 2 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9725ff7f274d..417c710c55ca 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1167,6 +1167,12 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 	 * last delalloc end.
 	 */
 	u64 last_delalloc_end = 0;
+	/*
+	 * Save the last successfully ran delalloc range end (exclusive).
+	 * This is for error handling to avoid ranges with ordered extent created
+	 * but no IO will be submitted due to error.
+	 */
+	u64 last_finished = page_start;
 	u64 delalloc_start = page_start;
 	u64 delalloc_end = page_end;
 	u64 delalloc_to_write = 0;
@@ -1235,11 +1241,19 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 			found_len = last_delalloc_end + 1 - found_start;
 
 		if (ret >= 0) {
+			/*
+			 * Some delalloc range may be created by previous folios.
+			 * Thus we still need to clean those range up during error
+			 * handling.
+			 */
+			last_finished = found_start;
 			/* No errors hit so far, run the current delalloc range. */
 			ret = btrfs_run_delalloc_range(inode, folio,
 						       found_start,
 						       found_start + found_len - 1,
 						       wbc);
+			if (ret >= 0)
+				last_finished = found_start + found_len;
 		} else {
 			/*
 			 * We've hit an error during previous delalloc range,
@@ -1274,8 +1288,21 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 
 		delalloc_start = found_start + found_len;
 	}
-	if (ret < 0)
+	/*
+	 * It's possible we have some ordered extents created before we hit
+	 * an error, cleanup non-async successfully created delalloc ranges.
+	 */
+	if (unlikely(ret < 0)) {
+		unsigned int bitmap_size = min(
+			(last_finished - page_start) >> fs_info->sectorsize_bits,
+			fs_info->sectors_per_page);
+
+		for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size)
+			btrfs_mark_ordered_io_finished(inode, folio,
+				page_start + (bit << fs_info->sectorsize_bits),
+				fs_info->sectorsize, false);
 		return ret;
+	}
 out:
 	if (last_delalloc_end)
 		delalloc_end = last_delalloc_end;
@@ -1509,13 +1536,13 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 
 	bio_ctrl->wbc->nr_to_write--;
 
-done:
-	if (ret) {
+	if (ret)
 		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
 					       page_start, PAGE_SIZE, !ret);
-		mapping_set_error(folio->mapping, ret);
-	}
 
+done:
+	if (ret < 0)
+		mapping_set_error(folio->mapping, ret);
 	/*
 	 * Only unlock ranges that are submitted. As there can be some async
 	 * submitted ranges inside the folio.
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c4997200dbb2..d41bb47d59fb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2305,7 +2305,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
 
 out:
 	if (ret < 0)
-		btrfs_cleanup_ordered_extents(inode, locked_folio, start,
+		btrfs_cleanup_ordered_extents(inode, NULL, start,
 					      end - start + 1);
 	return ret;
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed
  2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
  2024-12-12  6:13 ` [PATCH v2 1/9] btrfs: fix double accounting race when btrfs_run_delalloc_range() failed Qu Wenruo
@ 2024-12-12  6:13 ` Qu Wenruo
  2025-01-08 22:24   ` Boris Burkov
  2024-12-12  6:13 ` [PATCH v2 3/9] btrfs: fix the error handling of submit_uncompressed_range() Qu Wenruo
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-12  6:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
If submit_one_sector() failed inside extent_writepage_io() for sector
size < page size cases (e.g. 4K sector size and 64K page size), then
we can hit double ordered extent accounting error.

This should be very rare, as submit_one_sector() only fails when we
failed to grab the extent map, and such extent map should exist inside
the memory and have been pinned.

[CAUSE]
For example we have the following folio layout:

    0  4K          32K    48K   60K 64K
    |//|           |//////|     |///|

Where |///| is the dirty range we need to writeback. The 3 different
dirty ranges are submitted for regular COW.

Now we hit the following sequence:

- submit_one_sector() returned 0 for [0, 4K)

- submit_one_sector() returned 0 for [32K, 48K)

- submit_one_sector() returned error for [60K, 64K)

- btrfs_mark_ordered_io_finished() called for the whole folio
  This will mark the following ranges as finished:
  * [0, 4K)
  * [32K, 48K)
    Both ranges have their IO already submitted, this cleanup will
    lead to double accounting.

  * [60K, 64K)
    That's the correct cleanup.

The only good news is, this error is only theoretical, as the target
extent map is always pinned, thus we should directly grab it from
memory, other than reading it from the disk.

[FIX]
Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
range, which can touch ranges we should not touch, instead
move the error handling inside extent_writepage_io().

So that we can cleanup exact sectors that are ought to be submitted but
failed.

This provide much more accurate cleanup, avoiding the double accounting.

Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 417c710c55ca..b6a4f1765b4c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	unsigned long range_bitmap = 0;
 	bool submitted_io = false;
+	bool error = false;
 	const u64 folio_start = folio_pos(folio);
 	u64 cur;
 	int bit;
@@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			break;
 		}
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
-		if (ret < 0)
-			goto out;
+		if (unlikely(ret < 0)) {
+			submit_one_bio(bio_ctrl);
+			/*
+			 * Failed to grab the extent map which should be very rare.
+			 * Since there is no bio submitted to finish the ordered
+			 * extent, we have to manually finish this sector.
+			 */
+			btrfs_mark_ordered_io_finished(inode, folio, cur,
+					fs_info->sectorsize, false);
+			error = true;
+			continue;
+		}
 		submitted_io = true;
 	}
-out:
+
 	/*
 	 * If we didn't submitted any sector (>= i_size), folio dirty get
 	 * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
@@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	 *
 	 * Here we set writeback and clear for the range. If the full folio
 	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
+	 *
+	 * If we hit any error, the corresponding sector will still be dirty
+	 * thus no need to clear PAGECACHE_TAG_DIRTY.
 	 */
-	if (!submitted_io) {
+	if (!submitted_io && !error) {
 		btrfs_folio_set_writeback(fs_info, folio, start, len);
 		btrfs_folio_clear_writeback(fs_info, folio, start, len);
 	}
@@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 {
 	struct inode *inode = folio->mapping->host;
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
-	const u64 page_start = folio_pos(folio);
 	int ret;
 	size_t pg_offset;
 	loff_t i_size = i_size_read(inode);
@@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 
 	bio_ctrl->wbc->nr_to_write--;
 
-	if (ret)
-		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
-					       page_start, PAGE_SIZE, !ret);
-
 done:
 	if (ret < 0)
 		mapping_set_error(folio->mapping, ret);
@@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
 		if (ret == 1)
 			goto next_page;
 
-		if (ret) {
-			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
-						       cur, cur_len, !ret);
+		if (ret)
 			mapping_set_error(mapping, ret);
-		}
 		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
 		if (ret < 0)
 			found_error = true;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/9] btrfs: fix the error handling of submit_uncompressed_range()
  2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
  2024-12-12  6:13 ` [PATCH v2 1/9] btrfs: fix double accounting race when btrfs_run_delalloc_range() failed Qu Wenruo
  2024-12-12  6:13 ` [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed Qu Wenruo
@ 2024-12-12  6:13 ` Qu Wenruo
  2025-01-08 22:33   ` Boris Burkov
  2024-12-12  6:13 ` [PATCH v2 4/9] btrfs: do proper folio cleanup when cow_file_range() failed Qu Wenruo
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-12  6:13 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
If btrfs failed to compress the range, or can not reserve a large enough
data extent (e.g. too fragmented free space), btrfs will fall back to
submit_uncompressed_range().

But inside submit_uncompressed_range(), run_dealloc_cow() can also fail
due to -ENOSPC or whatever other errors.

In that case there are 3 bugs in the error handling:

1) Double freeing for the same ordered extent
   Which can lead to crash due to ordered extent double accounting

2) Start/end writeback without updating the subpage writeback bitmap

3) Unlock the folio without clear the subpage lock bitmap

Both bug 2) and 3) will crash the kernel if the btrfs block size is
smaller than folio size, as the next time the folio get writeback/lock
updates, subpage will find the bitmap already have the range set,
triggering an ASSERT().

[CAUSE]
Bug 1) happens in the following call chain:

  submit_uncompressed_range()
  |- run_dealloc_cow()
  |  |- cow_file_range()
  |     |- btrfs_reserve_extent()
  |        Failed with -ENOSPC or whatever error
  |
  |- btrfs_clean_up_ordered_extents()
  |  |- btrfs_mark_ordered_io_finished()
  |     Which cleans all the ordered extents in the async_extent range.
  |
  |- btrfs_mark_ordered_io_finished()
     Which cleans the folio range.

The finished ordered extents may not be immediately removed from the
ordered io tree, as they are removed inside a work queue.

So the second btrfs_mark_ordered_io_finished() may find the finished but
not-yet-removed ordered extents, and double free them.

Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage
compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover
other ordered extents.

Bug 2) and 3) are more straight forward, btrfs just calls folio_unlock(),
folio_start_writeback() and folio_end_writeback(), other than the helpers
which handle subpage cases.

[FIX]
For bug 1) since the first btrfs_cleanup_ordered_extents() call is
handling the whole range, we should not do the second
btrfs_mark_ordered_io_finished() call.

And for the first btrfs_cleanup_ordered_extents(), we no longer need to
pass the @locked_page parameter, as we are already in the async extent
context, thus will never rely on the error handling inside
btrfs_run_delalloc_range().

So just let the btrfs_clean_up_ordered_extents() to handle every folio
equally.

For bug 2) we should not even call
folio_start_writeback()/folio_end_writeback() anymore.
As the error handling protocol, cow_file_range() should clear
dirty flag and start/finish the writeback for the whole range passed in.

For bug 3) just change the folio_unlock() to btrfs_folio_end_lock()
helper.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d41bb47d59fb..5ba8d044757b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1129,19 +1129,11 @@ static void submit_uncompressed_range(struct btrfs_inode *inode,
 			       &wbc, false);
 	wbc_detach_inode(&wbc);
 	if (ret < 0) {
-		btrfs_cleanup_ordered_extents(inode, locked_folio,
+		btrfs_cleanup_ordered_extents(inode, NULL,
 					      start, end - start + 1);
-		if (locked_folio) {
-			const u64 page_start = folio_pos(locked_folio);
-
-			folio_start_writeback(locked_folio);
-			folio_end_writeback(locked_folio);
-			btrfs_mark_ordered_io_finished(inode, locked_folio,
-						       page_start, PAGE_SIZE,
-						       !ret);
-			mapping_set_error(locked_folio->mapping, ret);
-			folio_unlock(locked_folio);
-		}
+		if (locked_folio)
+			btrfs_folio_end_lock(inode->root->fs_info, locked_folio,
+					     start, async_extent->ram_size);
 	}
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 4/9] btrfs: do proper folio cleanup when cow_file_range() failed
  2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
                   ` (2 preceding siblings ...)
  2024-12-12  6:13 ` [PATCH v2 3/9] btrfs: fix the error handling of submit_uncompressed_range() Qu Wenruo
@ 2024-12-12  6:13 ` Qu Wenruo
  2025-01-09 23:20   ` Boris Burkov
  2024-12-12  6:13 ` [PATCH v2 5/9] btrfs: do proper folio cleanup when run_delalloc_nocow() failed Qu Wenruo
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-12  6:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
When testing with COW fixup marked as BUG_ON() (this is involved with the
new pin_user_pages*() change, which should not result new out-of-band
dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
fixup path.

This BUG_ON() happens just after a failed btrfs_run_delalloc_range():

 BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/extent_io.c:1444!
 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
 CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G           OE      6.12.0-rc7-custom+ #86
 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
 pc : extent_writepage_io+0x2d4/0x308 [btrfs]
 lr : extent_writepage_io+0x2d4/0x308 [btrfs]
 Call trace:
  extent_writepage_io+0x2d4/0x308 [btrfs]
  extent_writepage+0x218/0x330 [btrfs]
  extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
  btrfs_writepages+0x94/0x150 [btrfs]
  do_writepages+0x74/0x190
  filemap_fdatawrite_wbc+0x88/0xc8
  start_delalloc_inodes+0x180/0x3b0 [btrfs]
  btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
  shrink_delalloc+0x114/0x280 [btrfs]
  flush_space+0x250/0x2f8 [btrfs]
  btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
  process_one_work+0x164/0x408
  worker_thread+0x25c/0x388
  kthread+0x100/0x118
  ret_from_fork+0x10/0x20
 Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
 ---[ end trace 0000000000000000 ]---

[CAUSE]
That failure is mostly from cow_file_range(), where we can hit -ENOSPC.

Although the -ENOSPC is already a bug related to our space reservation
code, let's just focus on the error handling.

For example, we have the following dirty range [0, 64K) of an inode,
with 4K sector size and 4K page size:

   0        16K        32K       48K       64K
   |///////////////////////////////////////|
   |#######################################|

Where |///| means page are still dirty, and |###| means the extent io
tree has EXTENT_DELALLOC flag.

- Enter extent_writepage() for page 0

- Enter btrfs_run_delalloc_range() for range [0, 64K)

- Enter cow_file_range() for range [0, 64K)

- Function btrfs_reserve_extent() only reserved one 16K extent
  So we created extent map and ordered extent for range [0, 16K)

   0        16K        32K       48K       64K
   |////////|//////////////////////////////|
   |<- OE ->|##############################|

   And range [0, 16K) has its delalloc flag cleared.
   But since we haven't yet submit any bio, involved 4 pages are still
   dirty.

- Function btrfs_reserve_extent() return with -ENOSPC
  Now we have to run error cleanup, which will clear all
  EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
  ranges:

   0        16K        32K       48K       64K
   |////////|                              |
   |        |                              |

  Note that range [0, 16K) still has their pages dirty.

- Some time later, writeback are triggered again for the range [0, 16K)
  since the page range still have dirty flags.

- btrfs_run_delalloc_range() will do nothing because there is no
  EXTENT_DELALLOC flag.

- extent_writepage_io() find page 0 has no ordered flag
  Which falls into the COW fixup path, triggering the BUG_ON().

Unfortunately this error handling bug dates back to the introduction of btrfs.
Thankfully with the abuse of cow fixup, at least it won't crash the
kernel.

[FIX]
Instead of immediately unlock the extent and folios, we keep the extent
and folios locked until either erroring out or the whole delalloc range
finished.

When the whole delalloc range finished without error, we just unlock the
whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
And those involved folios will be properly submitted, with their dirty
flags cleared during submission.

For the error path, it will be a little more complex:

- The range with ordered extent allocated (range (1))
  We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
  flags are cleaned up by
  btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().

  For folios we finish the IO (clear dirty, start writeback and
  immediately finish the writeback) and unlock the folios.

- The range with reserved extent but no ordered extent (range(2))
- The range we never touched (range(3))
  For both range (2) and range(3) the behavior is not changed.

Now even if cow_file_range() failed halfway with some successfully
reserved extents/ordered extents, we will keep all folios clean, so
there will be no future writeback triggered on them.

Cc: stable@vger.kernel.org
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 65 ++++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5ba8d044757b..19c88b7d0363 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1364,6 +1364,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 	alloc_hint = btrfs_get_extent_allocation_hint(inode, start, num_bytes);
 
+	/*
+	 * We're not doing compressed IO, don't unlock the first page
+	 * (which the caller expects to stay locked), don't clear any
+	 * dirty bits and don't set any writeback bits
+	 *
+	 * Do set the Ordered (Private2) bit so we know this page was
+	 * properly setup for writepage.
+	 */
+	page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
+	page_ops |= PAGE_SET_ORDERED;
+
 	/*
 	 * Relocation relies on the relocated extents to have exactly the same
 	 * size as the original extents. Normally writeback for relocation data
@@ -1423,6 +1434,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		file_extent.offset = 0;
 		file_extent.compression = BTRFS_COMPRESS_NONE;
 
+		/*
+		 * Locked range will be released either during error clean up or
+		 * after the whole range is finished.
+		 */
 		lock_extent(&inode->io_tree, start, start + cur_alloc_size - 1,
 			    &cached);
 
@@ -1468,21 +1483,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 
-		/*
-		 * We're not doing compressed IO, don't unlock the first page
-		 * (which the caller expects to stay locked), don't clear any
-		 * dirty bits and don't set any writeback bits
-		 *
-		 * Do set the Ordered flag so we know this page was
-		 * properly setup for writepage.
-		 */
-		page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
-		page_ops |= PAGE_SET_ORDERED;
-
-		extent_clear_unlock_delalloc(inode, start, start + cur_alloc_size - 1,
-					     locked_folio, &cached,
-					     EXTENT_LOCKED | EXTENT_DELALLOC,
-					     page_ops);
 		if (num_bytes < cur_alloc_size)
 			num_bytes = 0;
 		else
@@ -1499,6 +1499,9 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		if (ret)
 			goto out_unlock;
 	}
+	extent_clear_unlock_delalloc(inode, orig_start, end, locked_folio, &cached,
+				     EXTENT_LOCKED | EXTENT_DELALLOC,
+				     page_ops);
 done:
 	if (done_offset)
 		*done_offset = end;
@@ -1519,35 +1522,31 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	 * We process each region below.
 	 */
 
-	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
-		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
-	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
-
 	/*
 	 * For the range (1). We have already instantiated the ordered extents
 	 * for this region. They are cleaned up by
 	 * btrfs_cleanup_ordered_extents() in e.g,
-	 * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
-	 * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
-	 * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
-	 * function.
+	 * btrfs_run_delalloc_range().
+	 * EXTENT_DELALLOC_NEW | EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV
+	 * are also handled by the cleanup function.
 	 *
-	 * However, in case of @keep_locked, we still need to unlock the pages
-	 * (except @locked_folio) to ensure all the pages are unlocked.
+	 * So here we only clear EXTENT_LOCKED and EXTENT_DELALLOC flag,
+	 * and finish the writeback of the involved folios, which will be
+	 * never submitted.
 	 */
-	if (keep_locked && orig_start < start) {
+	if (orig_start < start) {
+		clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC;
+		page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
+
 		if (!locked_folio)
 			mapping_set_error(inode->vfs_inode.i_mapping, ret);
 		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
-					     locked_folio, NULL, 0, page_ops);
+					     locked_folio, NULL, clear_bits, page_ops);
 	}
 
-	/*
-	 * At this point we're unlocked, we want to make sure we're only
-	 * clearing these flags under the extent lock, so lock the rest of the
-	 * range and clear everything up.
-	 */
-	lock_extent(&inode->io_tree, start, end, NULL);
+	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
+		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
+	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
 
 	/*
 	 * For the range (2). If we reserved an extent for our delalloc range
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 5/9] btrfs: do proper folio cleanup when run_delalloc_nocow() failed
  2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
                   ` (3 preceding siblings ...)
  2024-12-12  6:13 ` [PATCH v2 4/9] btrfs: do proper folio cleanup when cow_file_range() failed Qu Wenruo
@ 2024-12-12  6:13 ` Qu Wenruo
  2025-01-09 23:26   ` Boris Burkov
  2024-12-12  6:14 ` [PATCH v2 6/9] btrfs: subpage: fix the bitmap dump for the locked flags Qu Wenruo
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-12  6:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash
with the following VM_BUG_ON_FOLIO():

 BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28
 BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28
 page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664
 aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774"
 flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff)
 page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
 ------------[ cut here ]------------
 kernel BUG at mm/page-writeback.c:2992!
 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
 CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G           OE      6.12.0-rc7-custom+ #87
 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
 pc : folio_clear_dirty_for_io+0x128/0x258
 lr : folio_clear_dirty_for_io+0x128/0x258
 Call trace:
  folio_clear_dirty_for_io+0x128/0x258
  btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs]
  __process_folios_contig+0x154/0x268 [btrfs]
  extent_clear_unlock_delalloc+0x5c/0x80 [btrfs]
  run_delalloc_nocow+0x5f8/0x760 [btrfs]
  btrfs_run_delalloc_range+0xa8/0x220 [btrfs]
  writepage_delalloc+0x230/0x4c8 [btrfs]
  extent_writepage+0xb8/0x358 [btrfs]
  extent_write_cache_pages+0x21c/0x4e8 [btrfs]
  btrfs_writepages+0x94/0x150 [btrfs]
  do_writepages+0x74/0x190
  filemap_fdatawrite_wbc+0x88/0xc8
  start_delalloc_inodes+0x178/0x3a8 [btrfs]
  btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
  shrink_delalloc+0x114/0x280 [btrfs]
  flush_space+0x250/0x2f8 [btrfs]
  btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
  process_one_work+0x164/0x408
  worker_thread+0x25c/0x388
  kthread+0x100/0x118
  ret_from_fork+0x10/0x20
 Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000)
 ---[ end trace 0000000000000000 ]---

[CAUSE]
The first two lines of extra debug messages show the problem is caused
by the error handling of run_delalloc_nocow().

E.g. we have the following dirtied range (4K blocksize 4K page size):

    0                 16K                  32K
    |//////////////////////////////////////|
    |  Pre-allocated  |

And the range [0, 16K) has a preallocated extent.

- Enter run_delalloc_nocow() for range [0, 16K)
  Which found range [0, 16K) is preallocated, can do the proper NOCOW
  write.

- Enter fallback_to_fow() for range [16K, 32K)
  Since the range [16K, 32K) is not backed by preallocated extent, we
  have to go COW.

- cow_file_range() failed for range [16K, 32K)
  So cow_file_range() will do the clean up by clearing folio dirty,
  unlock the folios.

  Now the folios in range [16K, 32K) is unlocked.

- Enter extent_clear_unlock_delalloc() from run_delalloc_nocow()
  Which is called with PAGE_START_WRITEBACK to start page writeback.
  But folios can only be marked writeback when it's properly locked,
  thus this triggered the VM_BUG_ON_FOLIO().

Furthermore there is another hidden but common bug that
run_delalloc_nocow() is not clearing the folio dirty flags in its error
handling path.
This is the common bug shared between run_delalloc_nocow() and
cow_file_range().

[FIX]
- Clear folio dirty for range [@start, @cur_offset)
  Introduce a helper, cleanup_dirty_folios(), which
  will find and lock the folio in the range, clear the dirty flag and
  start/end the writeback, with the extra handling for the
  @locked_folio.

- Introduce a helper to record the last failed COW range end
  This is to trace which range we should skip, to avoid double
  unlocking.

- Skip the failed COW range for the error handling

Cc: stable@vger.kernel.org
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 93 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 86 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 19c88b7d0363..bae8aceb3eae 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1961,6 +1961,48 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 	return ret < 0 ? ret : can_nocow;
 }
 
+static void cleanup_dirty_folios(struct btrfs_inode *inode,
+				 struct folio *locked_folio,
+				 u64 start, u64 end, int error)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct address_space *mapping = inode->vfs_inode.i_mapping;
+	pgoff_t start_index = start >> PAGE_SHIFT;
+	pgoff_t end_index = end >> PAGE_SHIFT;
+	u32 len;
+
+	ASSERT(end + 1 - start < U32_MAX);
+	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
+	       IS_ALIGNED(end + 1, fs_info->sectorsize));
+	len = end + 1 - start;
+
+	/*
+	 * Handle the locked folio first.
+	 * btrfs_folio_clamp_*() helpers can handle range out of the folio case.
+	 */
+	btrfs_folio_clamp_clear_dirty(fs_info, locked_folio, start, len);
+	btrfs_folio_clamp_set_writeback(fs_info, locked_folio, start, len);
+	btrfs_folio_clamp_clear_writeback(fs_info, locked_folio, start, len);
+
+	for (pgoff_t index = start_index; index <= end_index; index++) {
+		struct folio *folio;
+
+		/* Already handled at the beginning. */
+		if (index == locked_folio->index)
+			continue;
+		folio = __filemap_get_folio(mapping, index, FGP_LOCK, GFP_NOFS);
+		/* Cache already dropped, no need to do any cleanup. */
+		if (IS_ERR(folio))
+			continue;
+		btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
+		btrfs_folio_clamp_set_writeback(fs_info, folio, start, len);
+		btrfs_folio_clamp_clear_writeback(fs_info, folio, start, len);
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+	mapping_set_error(mapping, error);
+}
+
 /*
  * when nowcow writeback call back.  This checks for snapshots or COW copies
  * of the extents that exist in the file, and COWs the file as required.
@@ -1976,6 +2018,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 	struct btrfs_root *root = inode->root;
 	struct btrfs_path *path;
 	u64 cow_start = (u64)-1;
+	/*
+	 * If not 0, represents the inclusive end of the last fallback_to_cow()
+	 * range. Only for error handling.
+	 */
+	u64 cow_end = 0;
 	u64 cur_offset = start;
 	int ret;
 	bool check_prev = true;
@@ -2136,6 +2183,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 					      found_key.offset - 1);
 			cow_start = (u64)-1;
 			if (ret) {
+				cow_end = found_key.offset - 1;
 				btrfs_dec_nocow_writers(nocow_bg);
 				goto error;
 			}
@@ -2209,11 +2257,12 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 		cow_start = cur_offset;
 
 	if (cow_start != (u64)-1) {
-		cur_offset = end;
 		ret = fallback_to_cow(inode, locked_folio, cow_start, end);
 		cow_start = (u64)-1;
-		if (ret)
+		if (ret) {
+			cow_end = end;
 			goto error;
+		}
 	}
 
 	btrfs_free_path(path);
@@ -2221,12 +2270,42 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 
 error:
 	/*
-	 * If an error happened while a COW region is outstanding, cur_offset
-	 * needs to be reset to cow_start to ensure the COW region is unlocked
-	 * as well.
+	 * There are several error cases:
+	 *
+	 * 1) Failed without falling back to COW
+	 *    start         cur_start              end
+	 *    |/////////////|                      |
+	 *
+	 *    For range [start, cur_start) the folios are already unlocked (except
+	 *    @locked_folio), EXTENT_DELALLOC already removed.
+	 *    Only need to clear the dirty flag as they will never be submitted.
+	 *    Ordered extent and extent maps are handled by
+	 *    btrfs_mark_ordered_io_finished() inside run_delalloc_range().
+	 *
+	 * 2) Failed with error from fallback_to_cow()
+	 *    start         cur_start   cow_end    end
+	 *    |/////////////|-----------|          |
+	 *
+	 *    For range [start, cur_start) it's the same as case 1).
+	 *    But for range [cur_start, cow_end), the folios have dirty flag
+	 *    cleared and unlocked, EXTENT_DEALLLOC cleared.
+	 *    There may or may not be any ordered extents/extent maps allocated.
+	 *
+	 *    We should not call extent_clear_unlock_delalloc() on range [cur_start,
+	 *    cow_end), as the folios are already unlocked.
+	 *
+	 * So clear the folio dirty flags for [start, cur_offset) first.
 	 */
-	if (cow_start != (u64)-1)
-		cur_offset = cow_start;
+	if (cur_offset > start)
+		cleanup_dirty_folios(inode, locked_folio, start, cur_offset - 1, ret);
+
+	/*
+	 * If an error happened while a COW region is outstanding, cur_offset
+	 * needs to be reset to @cow_end + 1 to skip the COW range, as
+	 * cow_file_range() will do the proper cleanup at error.
+	 */
+	if (cow_end)
+		cur_offset = cow_end + 1;
 
 	/*
 	 * We need to lock the extent here because we're clearing DELALLOC and
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 6/9] btrfs: subpage: fix the bitmap dump for the locked flags
  2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
                   ` (4 preceding siblings ...)
  2024-12-12  6:13 ` [PATCH v2 5/9] btrfs: do proper folio cleanup when run_delalloc_nocow() failed Qu Wenruo
@ 2024-12-12  6:14 ` Qu Wenruo
  2025-01-08 22:45   ` Boris Burkov
  2024-12-12  6:14 ` [PATCH v2 7/9] btrfs: subpage: dump the involved bitmap when ASSERT() failed Qu Wenruo
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-12  6:14 UTC (permalink / raw)
  To: linux-btrfs

We're dumping the locked bitmap into the @checked_bitmap variable,
causing incorrect values during debug.

Thankfuklly even during my development I haven't hit a case where I need
to dump the locked bitmap.
But for the sake of consistency, fix it by dumpping the locked bitmap
into @locked_bitmap variable for output.

Fixes: 75258f20fb70 ("btrfs: subpage: dump extra subpage bitmaps for debug")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 8c68059ac1b0..03d7bfc042e2 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -716,6 +716,7 @@ void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
 	unsigned long writeback_bitmap;
 	unsigned long ordered_bitmap;
 	unsigned long checked_bitmap;
+	unsigned long locked_bitmap;
 	unsigned long flags;
 
 	ASSERT(folio_test_private(folio) && folio_get_private(folio));
@@ -728,15 +729,16 @@ void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
 	GET_SUBPAGE_BITMAP(subpage, fs_info, writeback, &writeback_bitmap);
 	GET_SUBPAGE_BITMAP(subpage, fs_info, ordered, &ordered_bitmap);
 	GET_SUBPAGE_BITMAP(subpage, fs_info, checked, &checked_bitmap);
-	GET_SUBPAGE_BITMAP(subpage, fs_info, locked, &checked_bitmap);
+	GET_SUBPAGE_BITMAP(subpage, fs_info, locked, &locked_bitmap);
 	spin_unlock_irqrestore(&subpage->lock, flags);
 
 	dump_page(folio_page(folio, 0), "btrfs subpage dump");
 	btrfs_warn(fs_info,
-"start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl dirty=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl",
+"start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl dirty=%*pbl locked=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl",
 		    start, len, folio_pos(folio),
 		    sectors_per_page, &uptodate_bitmap,
 		    sectors_per_page, &dirty_bitmap,
+		    sectors_per_page, &locked_bitmap,
 		    sectors_per_page, &writeback_bitmap,
 		    sectors_per_page, &ordered_bitmap,
 		    sectors_per_page, &checked_bitmap);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 7/9] btrfs: subpage: dump the involved bitmap when ASSERT() failed
  2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
                   ` (5 preceding siblings ...)
  2024-12-12  6:14 ` [PATCH v2 6/9] btrfs: subpage: fix the bitmap dump for the locked flags Qu Wenruo
@ 2024-12-12  6:14 ` Qu Wenruo
  2025-01-08 22:46   ` Boris Burkov
  2024-12-12  6:14 ` [PATCH v2 8/9] btrfs: add extra error messages for delalloc range related errors Qu Wenruo
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-12  6:14 UTC (permalink / raw)
  To: linux-btrfs

For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call
bitmap_test_range_all_zero() to ensure the involved range has not bit
set.

However with my recent enhanced delalloc range error handling, I'm
hitting the ASSERT() inside btrfs_folio_set_lock(), and is wondering if
it's some error handling not properly cleanup the locked bitmap but
directly unlock the page.

So add some extra dumpping for the ASSERTs to dump the involved bitmap
to help debug.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 03d7bfc042e2..d692bc34a3af 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -635,6 +635,28 @@ IMPLEMENT_BTRFS_PAGE_OPS(ordered, folio_set_ordered, folio_clear_ordered,
 IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked,
 			 folio_test_checked);
 
+#define GET_SUBPAGE_BITMAP(subpage, fs_info, name, dst)			\
+{									\
+	const int sectors_per_page = fs_info->sectors_per_page;		\
+									\
+	ASSERT(sectors_per_page < BITS_PER_LONG);			\
+	*dst = bitmap_read(subpage->bitmaps,				\
+			   sectors_per_page * btrfs_bitmap_nr_##name,	\
+			   sectors_per_page);				\
+}
+
+#define subpage_dump_bitmap(fs_info, folio, name, start, len)		\
+{									\
+	struct btrfs_subpage *subpage = folio_get_private(folio);	\
+	unsigned long bitmap;						\
+									\
+	GET_SUBPAGE_BITMAP(subpage, fs_info, name, &bitmap);		\
+	btrfs_warn(fs_info,						\
+	"dumpping bitmap start=%llu len=%u folio=%llu" #name "_bitmap=%*pbl", \
+		   start, len, folio_pos(folio),			\
+		   fs_info->sectors_per_page, &bitmap);			\
+}
+
 /*
  * Make sure not only the page dirty bit is cleared, but also subpage dirty bit
  * is cleared.
@@ -660,6 +682,10 @@ void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
 	subpage = folio_get_private(folio);
 	ASSERT(subpage);
 	spin_lock_irqsave(&subpage->lock, flags);
+	if (unlikely(!bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits))) {
+		subpage_dump_bitmap(fs_info, folio, dirty, start, len);
+		ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
+	}
 	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
@@ -689,23 +715,16 @@ void btrfs_folio_set_lock(const struct btrfs_fs_info *fs_info,
 	nbits = len >> fs_info->sectorsize_bits;
 	spin_lock_irqsave(&subpage->lock, flags);
 	/* Target range should not yet be locked. */
-	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
+	if (unlikely(!bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits))) {
+		subpage_dump_bitmap(fs_info, folio, locked, start, len);
+		ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
+	}
 	bitmap_set(subpage->bitmaps, start_bit, nbits);
 	ret = atomic_add_return(nbits, &subpage->nr_locked);
 	ASSERT(ret <= fs_info->sectors_per_page);
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
-#define GET_SUBPAGE_BITMAP(subpage, fs_info, name, dst)			\
-{									\
-	const int sectors_per_page = fs_info->sectors_per_page;		\
-									\
-	ASSERT(sectors_per_page < BITS_PER_LONG);			\
-	*dst = bitmap_read(subpage->bitmaps,				\
-			   sectors_per_page * btrfs_bitmap_nr_##name,	\
-			   sectors_per_page);				\
-}
-
 void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
 				      struct folio *folio, u64 start, u32 len)
 {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 8/9] btrfs: add extra error messages for delalloc range related errors
  2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
                   ` (6 preceding siblings ...)
  2024-12-12  6:14 ` [PATCH v2 7/9] btrfs: subpage: dump the involved bitmap when ASSERT() failed Qu Wenruo
@ 2024-12-12  6:14 ` Qu Wenruo
  2025-01-08 22:47   ` Boris Burkov
  2024-12-12  6:14 ` [PATCH v2 9/9] btrfs: remove the unused @locked_folio parameter from btrfs_cleanup_ordered_extents() Qu Wenruo
  2025-01-07 15:10 ` [PATCH v2 0/9] btrfs: error handling fixes David Sterba
  9 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-12  6:14 UTC (permalink / raw)
  To: linux-btrfs

All the error handling bugs I hit so far are all -ENOSPC from either:

- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()

Previously when those functions failed, there is no error message at
all, making the debugging much harder.

So here we introduce extra error messages for:

- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
- writepage_delalloc() when btrfs_run_delalloc_range() failed
- extent_writepage() when extent_writepage_io() failed

One example of the new debug error messages is the following one:

 run fstests generic/750 at 2024-12-08 12:41:41
 BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600)
 BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3
 BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm
 BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536
 BTRFS info (device dm-4): using free-space-tree
 BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental
 BTRFS info (device dm-4): checking UUID tree
 BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28
 BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28
 BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28

Which shows an error from cow_file_range() which is called inside a
nocow write attempt, along with the extra bitmap from
writepage_delalloc().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 16 ++++++++++++++++
 fs/btrfs/inode.c     | 14 +++++++++++++-
 fs/btrfs/subpage.c   |  3 ++-
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b6a4f1765b4c..f4fb1fb3454a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1254,6 +1254,15 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 						       wbc);
 			if (ret >= 0)
 				last_finished = found_start + found_len;
+			if (unlikely(ret < 0))
+				btrfs_err_rl(fs_info,
+"failed to run delalloc range, root=%lld ino=%llu folio=%llu submit_bitmap=%*pbl start=%llu len=%u: %d",
+					     inode->root->root_key.objectid,
+					     btrfs_ino(inode),
+					     folio_pos(folio),
+					     fs_info->sectors_per_page,
+					     &bio_ctrl->submit_bitmap,
+					     found_start, found_len, ret);
 		} else {
 			/*
 			 * We've hit an error during previous delalloc range,
@@ -1546,6 +1555,13 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 				  PAGE_SIZE, bio_ctrl, i_size);
 	if (ret == 1)
 		return 0;
+	if (ret < 0)
+		btrfs_err_rl(fs_info,
+"failed to submit blocks, root=%lld inode=%llu folio=%llu submit_bitmap=%*pbl: %d",
+			     BTRFS_I(inode)->root->root_key.objectid,
+			     btrfs_ino(BTRFS_I(inode)),
+			     folio_pos(folio), fs_info->sectors_per_page,
+			     &bio_ctrl->submit_bitmap, ret);
 
 	bio_ctrl->wbc->nr_to_write--;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bae8aceb3eae..a88cba85bf40 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1134,6 +1134,10 @@ static void submit_uncompressed_range(struct btrfs_inode *inode,
 		if (locked_folio)
 			btrfs_folio_end_lock(inode->root->fs_info, locked_folio,
 					     start, async_extent->ram_size);
+		btrfs_err_rl(inode->root->fs_info,
+		"%s failed, root=%llu inode=%llu start=%llu len=%llu: %d",
+			     __func__, btrfs_root_id(inode->root),
+			     btrfs_ino(inode), start, async_extent->ram_size, ret);
 	}
 }
 
@@ -1246,7 +1250,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 	free_async_extent_pages(async_extent);
 	if (async_chunk->blkcg_css)
 		kthread_associate_blkcg(NULL);
-	btrfs_debug(fs_info,
+	btrfs_debug_rl(fs_info,
 "async extent submission failed root=%lld inode=%llu start=%llu len=%llu ret=%d",
 		    btrfs_root_id(root), btrfs_ino(inode), start,
 		    async_extent->ram_size, ret);
@@ -1580,6 +1584,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		btrfs_qgroup_free_data(inode, NULL, start + cur_alloc_size,
 				       end - start - cur_alloc_size + 1, NULL);
 	}
+	btrfs_err_rl(fs_info,
+		     "%s failed, root=%llu inode=%llu start=%llu len=%llu: %d",
+		     __func__, btrfs_root_id(inode->root),
+		     btrfs_ino(inode), orig_start, end + 1 - orig_start, ret);
 	return ret;
 }
 
@@ -2325,6 +2333,10 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 		btrfs_qgroup_free_data(inode, NULL, cur_offset, end - cur_offset + 1, NULL);
 	}
 	btrfs_free_path(path);
+	btrfs_err_rl(fs_info,
+		     "%s failed, root=%llu inode=%llu start=%llu len=%llu: %d",
+		     __func__, btrfs_root_id(inode->root),
+		     btrfs_ino(inode), start, end + 1 - start, ret);
 	return ret;
 }
 
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index d692bc34a3af..7f47bc61389c 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -652,7 +652,7 @@ IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked,
 									\
 	GET_SUBPAGE_BITMAP(subpage, fs_info, name, &bitmap);		\
 	btrfs_warn(fs_info,						\
-	"dumpping bitmap start=%llu len=%u folio=%llu" #name "_bitmap=%*pbl", \
+	"dumpping bitmap start=%llu len=%u folio=%llu " #name "_bitmap=%*pbl", \
 		   start, len, folio_pos(folio),			\
 		   fs_info->sectors_per_page, &bitmap);			\
 }
@@ -717,6 +717,7 @@ void btrfs_folio_set_lock(const struct btrfs_fs_info *fs_info,
 	/* Target range should not yet be locked. */
 	if (unlikely(!bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits))) {
 		subpage_dump_bitmap(fs_info, folio, locked, start, len);
+		btrfs_warn(fs_info, "nr_locked=%u\n", atomic_read(&subpage->nr_locked));
 		ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
 	}
 	bitmap_set(subpage->bitmaps, start_bit, nbits);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 9/9] btrfs: remove the unused @locked_folio parameter from btrfs_cleanup_ordered_extents()
  2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
                   ` (7 preceding siblings ...)
  2024-12-12  6:14 ` [PATCH v2 8/9] btrfs: add extra error messages for delalloc range related errors Qu Wenruo
@ 2024-12-12  6:14 ` Qu Wenruo
  2025-01-08 22:48   ` Boris Burkov
  2025-01-07 15:10 ` [PATCH v2 0/9] btrfs: error handling fixes David Sterba
  9 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-12  6:14 UTC (permalink / raw)
  To: linux-btrfs

The function btrfs_cleanup_ordered_extents() is only called in error
handling path, and the last caller with a @locked_folio parameter is
removed to fix a bug in the btrfs_run_delalloc_range() error handling.

There is no need to pass @locked_folio parameter anymore.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 44 ++------------------------------------------
 1 file changed, 2 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a88cba85bf40..a5d33ebf90d4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -393,34 +393,13 @@ void btrfs_inode_unlock(struct btrfs_inode *inode, unsigned int ilock_flags)
  * extent (btrfs_finish_ordered_io()).
  */
 static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
-						 struct folio *locked_folio,
 						 u64 offset, u64 bytes)
 {
 	unsigned long index = offset >> PAGE_SHIFT;
 	unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
-	u64 page_start = 0, page_end = 0;
 	struct folio *folio;
 
-	if (locked_folio) {
-		page_start = folio_pos(locked_folio);
-		page_end = page_start + folio_size(locked_folio) - 1;
-	}
-
 	while (index <= end_index) {
-		/*
-		 * For locked page, we will call btrfs_mark_ordered_io_finished
-		 * through btrfs_mark_ordered_io_finished() on it
-		 * in run_delalloc_range() for the error handling, which will
-		 * clear page Ordered and run the ordered extent accounting.
-		 *
-		 * Here we can't just clear the Ordered bit, or
-		 * btrfs_mark_ordered_io_finished() would skip the accounting
-		 * for the page range, and the ordered extent will never finish.
-		 */
-		if (locked_folio && index == (page_start >> PAGE_SHIFT)) {
-			index++;
-			continue;
-		}
 		folio = filemap_get_folio(inode->vfs_inode.i_mapping, index);
 		index++;
 		if (IS_ERR(folio))
@@ -436,23 +415,6 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
 		folio_put(folio);
 	}
 
-	if (locked_folio) {
-		/* The locked page covers the full range, nothing needs to be done */
-		if (bytes + offset <= page_start + folio_size(locked_folio))
-			return;
-		/*
-		 * In case this page belongs to the delalloc range being
-		 * instantiated then skip it, since the first page of a range is
-		 * going to be properly cleaned up by the caller of
-		 * run_delalloc_range
-		 */
-		if (page_start >= offset && page_end <= (offset + bytes - 1)) {
-			bytes = offset + bytes - folio_pos(locked_folio) -
-				folio_size(locked_folio);
-			offset = folio_pos(locked_folio) + folio_size(locked_folio);
-		}
-	}
-
 	return btrfs_mark_ordered_io_finished(inode, NULL, offset, bytes, false);
 }
 
@@ -1129,8 +1091,7 @@ static void submit_uncompressed_range(struct btrfs_inode *inode,
 			       &wbc, false);
 	wbc_detach_inode(&wbc);
 	if (ret < 0) {
-		btrfs_cleanup_ordered_extents(inode, NULL,
-					      start, end - start + 1);
+		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
 		if (locked_folio)
 			btrfs_folio_end_lock(inode->root->fs_info, locked_folio,
 					     start, async_extent->ram_size);
@@ -2387,8 +2348,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
 
 out:
 	if (ret < 0)
-		btrfs_cleanup_ordered_extents(inode, NULL, start,
-					      end - start + 1);
+		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
 	return ret;
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/9] btrfs: error handling fixes
  2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
                   ` (8 preceding siblings ...)
  2024-12-12  6:14 ` [PATCH v2 9/9] btrfs: remove the unused @locked_folio parameter from btrfs_cleanup_ordered_extents() Qu Wenruo
@ 2025-01-07 15:10 ` David Sterba
  9 siblings, 0 replies; 28+ messages in thread
From: David Sterba @ 2025-01-07 15:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Dec 12, 2024 at 04:43:54PM +1030, Qu Wenruo wrote:
> [CHANGELOG]
> v2:
> - Fix the btrfs_cleanup_ordered_extents() call inside
>   btrfs_run_delalloc_range()
> 
>   Since we no longer call btrfs_mark_ordered_io_finished() if
>   btrfs_run_delalloc_range() failed, the existing
>   btrfs_cleanup_ordered_extents() call with @locked_folio will cause the
>   subpage range not to be properly cleaned up.
> 
>   This can lead to hanging ordered extents for subpage cases.
> 
> - Update the commit message of the first patch
>   With more detailed analyse on how the double accounting happens.
>   It's pretty complex and very lengthy, but is easier to understand (as
>   least I hope so).
> 
>   The root cause is the btrfs_cleanup_ordered_extents()'s range split
>   behavior, which is not subpage compatible and is cursed in the first
>   place.
> 
>   So the fix is still the same, by removing the split OE handling
>   completely.
> 
> - A new patch to cleanup the @locked_folio parameter of
>   btrfs_cleanup_ordered_extents()
> 
> I believe there is a regression in the last 2 or 3 releases where
> metadata/data space reservation code is no longer working properly,
> result us to hit -ENOSPC during btrfs_run_delalloc_range().
> 
> One of the most common situation to hit such problem is during
> generic/750, along with other long running generic tests.
> 
> Although I should start bisecting the space reservation bug, but I can
> not help but fixing the exposed bugs first.
> 
> This exposed quite some long existing bugs, all in the error handling
> paths, that can lead to the following crashes
> 
> - Double ordered extent accounting
>   Triggers WARN_ON_OCE() inside can_finish_ordered_extent() then crash.
> 
>   This bug is fixed by the first 3 patches.
>   The first patch is the most important one, since it's pretty easy to
>   trigger in the real world, and very long existing.
> 
>   The second patch is just a precautious fix, not easy to happen in the
>   real world.
> 
>   The third one is also possible in the real world, but only possible
>   with the recently enabled subpage compression write support.
> 
> - Subpage ASSERT() triggered, where subpage folio bitmap differs from
>   folio status
>   This happens most likey in submit_uncompressed_range(), where it
>   unlock the folio without updating the subpage bitmaps.
> 
>   This bug is fixed by the 3rd patch.
> 
> - WARN_ON() if out-of-tree patch "btrfs: reject out-of-band dirty folios
>   during writeback" applied
>   This is a more complex case, where error handling leaves some folios
>   dirty, but with EXTENT_DELALLOC flag cleared from extent io tree.
> 
>   Such dirty folios are still possible to be written back later, but
>   since there is no EXTENT_DELALLOC flag, it will be treat as
>   out-of-band dirty flags and trigger COW fixup.
> 
>   This bug is fixed by the 4th and 5th patch
> 
> With so many existing bugs exposed, there is more than enough motivation
> to make btrfs_run_delalloc_range() (and its delalloc range functions)
> output extra error messages so that at least we know something is wrong.
> 
> And those error messages have already helped a lot during my
> development.
> 
> Patches 6~8 are here to enhance the error messages.
> 
> And the final one is to cleanup the unnecessary @locked_folio parameter
> of btrfs_cleanup_ordered_extents().
> 
> With all these patches applied, at least fstests can finish reliably,
> otherwise it frequently crashes in generic tests that I was unable to
> finish even one full run since the space reservation regression.
> 
> 
> Qu Wenruo (9):
>   btrfs: fix double accounting race when btrfs_run_delalloc_range()
>     failed
>   btrfs: fix double accounting race when extent_writepage_io() failed
>   btrfs: fix the error handling of submit_uncompressed_range()
>   btrfs: do proper folio cleanup when cow_file_range() failed
>   btrfs: do proper folio cleanup when run_delalloc_nocow() failed
>   btrfs: subpage: fix the bitmap dump for the locked flags
>   btrfs: subpage: dump the involved bitmap when ASSERT() failed
>   btrfs: add extra error messages for delalloc range related errors
>   btrfs: remove the unused @locked_folio parameter from
>     btrfs_cleanup_ordered_extents()

This is the last non-trivial outstanding patchset in the queue. It was
in misc-next (and thus linux-next) so we have some testing coverage.
Unless there are known issues or fixups we should move it to for-next.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/9] btrfs: fix double accounting race when btrfs_run_delalloc_range() failed
  2024-12-12  6:13 ` [PATCH v2 1/9] btrfs: fix double accounting race when btrfs_run_delalloc_range() failed Qu Wenruo
@ 2025-01-08 21:52   ` Boris Burkov
  2025-01-09  2:45     ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Boris Burkov @ 2025-01-08 21:52 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Thu, Dec 12, 2024 at 04:43:55PM +1030, Qu Wenruo wrote:
> [BUG]
> When running btrfs with block size (4K) smaller than page size (64K,
> aarch64), there is a very high chance to crash the kernel at
> generic/750, with the following messages:
> (before the call traces, there are 3 extra debug messages added)
> 
>  BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
>  BTRFS info (device dm-3): checking UUID tree
>  hrtimer: interrupt took 5451385 ns
>  BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
>  BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
>  BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
>  ------------[ cut here ]------------
>  WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
>  CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G           OE      6.13.0-rc1-custom+ #89
>  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
>  Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
>  pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
>  lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
>  Call trace:
>   can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
>   can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
>   btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
>   extent_writepage+0x10c/0x3b8 [btrfs]
>   extent_write_cache_pages+0x21c/0x4e8 [btrfs]
>   btrfs_writepages+0x94/0x160 [btrfs]
>   do_writepages+0x74/0x190
>   filemap_fdatawrite_wbc+0x74/0xa0
>   start_delalloc_inodes+0x17c/0x3b0 [btrfs]
>   btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
>   shrink_delalloc+0x11c/0x280 [btrfs]
>   flush_space+0x288/0x328 [btrfs]
>   btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
>   process_one_work+0x228/0x680
>   worker_thread+0x1bc/0x360
>   kthread+0x100/0x118
>   ret_from_fork+0x10/0x20
>  ---[ end trace 0000000000000000 ]---
>  BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
>  BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
>  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>  BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
>  CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G        W  OE      6.13.0-rc1-custom+ #89
>  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
>  Workqueue:  btrfs_work_helper [btrfs] (btrfs-endio-write)
>  pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  pc : process_one_work+0x110/0x680
>  lr : worker_thread+0x1bc/0x360
>  Call trace:
>   process_one_work+0x110/0x680 (P)
>   worker_thread+0x1bc/0x360 (L)
>   worker_thread+0x1bc/0x360
>   kthread+0x100/0x118
>   ret_from_fork+0x10/0x20
>  Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
>  ---[ end trace 0000000000000000 ]---
>  Kernel panic - not syncing: Oops: Fatal exception
>  SMP: stopping secondary CPUs
>  SMP: failed to stop secondary CPUs 2-3
>  Dumping ftrace buffer:
>     (ftrace buffer empty)
>  Kernel Offset: 0x275bb9540000 from 0xffff800080000000
>  PHYS_OFFSET: 0xffff8fbba0000000
>  CPU features: 0x100,00000070,00801250,8201720b
> 
> [CAUSE]
> The above warning is triggered immediately after the delalloc range
> failure, this happens in the following sequence:
> 
> - Range [1568K, 1636K) is dirty
> 
>    1536K  1568K     1600K    1636K  1664K
>    |      |/////////|////////|      |
> 
>   Where 1536K, 1600K and 1664K are page boundaries (64K page size)
> 
> - Enter extent_writepage() for page 1536K
> 
> - Enter run_delalloc_nocow() with locked page 1536K and range
>   [1568K, 1636K)
>   This is due to the inode has preallocated extents.
> 
> - Enter cow_file_range() with locked page 1536K and range
>   [1568K, 1636K)
> 
> - btrfs_reserve_extent() only reserved two extents
>   The main loop of cow_file_range() only reserved two data extents,
> 
>   Now we have:
> 
>    1536K  1568K        1600K    1636K  1664K
>    |      |<-->|<--->|/|///////|      |
>                1584K  1596K
>   Range [1568K, 1596K) has ordered extent reserved.
> 
> - btrfs_reserve_extent() failed inside cow_file_range() for file offset
>   1596K
>   This is already a bug in our space reservation code, but for now let's
>   focus on the error handling path.
> 
>   Now cow_file_range() returned -ENOSPC.
> 
> - btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
>   Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
>   [1568K, 1636K)
> 
>   Function btrfs_cleanup_ordered_extents() normally needs to skip the
>   ranges inside the folio, as it will normally be cleaned up by
>   extent_writepage().
> 
>   Such split error handling is already problematic in the first place.
> 
>   What's worse is the folio range skipping itself, which is not taking
>   subpage cases into consideration at all, it will only skip the range
>   if the page start >= the range start.
>   In our case, the page start < the range start, since for subpage cases
>   we can have delalloc ranges inside the folio but not covering the
>   folio.
> 
>   So it doesn't skip the page range at all.
>   This means all the ordered extents, both [1568K, 1584K) and
>   [1584K, 1596K) will be marked as IOERR.
> 
>   And those two ordered extents have no more pending ios, it is marked
>   finished, and *QUEUED* to be deleted from the io tree.
> 
> - extent_writepage() do error cleanup
>   Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).
> 
>   Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
>   deletion from io tree is async, it may or may not happen at this
>   timing.
> 
>   If the ranges are not yet removed, we will do double cleaning on those
>   ranges, triggers the above ordered extent warnings.
> 
> In theory there are other bugs, like the cleanup in extent_writepage()
> can cause double accounting on ranges that are submitted async
> (compression for example).
> 
> But that's much harder to trigger because normally we do not mix regular
> and compression delalloc ranges.
> 
> [FIX]
> The folio range split is already buggy and not subpage compatible, it's
> introduced a long time ago where subpage support is not even considered.
> 
> So instead of splitting the ordered extents cleanup into the folio range
> and out of folio range, do all the cleanup inside writepage_delalloc().
> 
> - Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
>   btrfs_run_delalloc_range()
> 
> - Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
>   failed
> 
>   So all ordered extents are only cleaned up by
>   btrfs_run_delalloc_range().
> 
> - Handle the ranges that already have ordered extents allocated
>   If part of the folio already has ordered extent allocated, and
>   btrfs_run_delalloc_range() failed, we also need to cleanup that range.
> 
> Now we have a concentrated error handling for ordered extents during
> btrfs_run_delalloc_range().

Great investigation and writeup, thanks!

The explanation and fix both make sense to me. I traced the change in
error handling and I see how we are avoiding double ending the
ordered_extent. So with that said, feel free to add:
Reviewed-by: Boris Burkov <boris@bur.io>

However, I would like to request one thing, if I may.
While this is all still relatively fresh in your mind, could you please
document the intended behavior of the various functions (at least the
ones you modify/reason about) with regards to:
- cleanup state of the various objects involved like ordered_extents
  and subpages (e.g., writepage_delalloc cleans up ordered extents, so
  callers should not, etc.)
- return values (e.g., when precisely does btrfs_run_delalloc_range
  return >= 0 ?)
- anything else you think would be helpful for reasoning about these
  functions in an abstract way while you are at it.

That request is obviously optional for landing these fixes, but I really
think it would help if we went through the bother every time we
deciphered one of these undocumented paths. A restatement of your best
understanding of the behavior now will really pay off for the next
person reading this code :)

Thanks,
Boris

> 
> Cc: stable@vger.kernel.org # 5.15+
> Fixes: d1051d6ebf8e ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 37 ++++++++++++++++++++++++++++++++-----
>  fs/btrfs/inode.c     |  2 +-
>  2 files changed, 33 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 9725ff7f274d..417c710c55ca 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1167,6 +1167,12 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  	 * last delalloc end.
>  	 */
>  	u64 last_delalloc_end = 0;
> +	/*
> +	 * Save the last successfully ran delalloc range end (exclusive).
> +	 * This is for error handling to avoid ranges with ordered extent created
> +	 * but no IO will be submitted due to error.
> +	 */

nit: last_finished what? I feel this name or comment could use some
extra work.

> +	u64 last_finished = page_start;
>  	u64 delalloc_start = page_start;
>  	u64 delalloc_end = page_end;
>  	u64 delalloc_to_write = 0;
> @@ -1235,11 +1241,19 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  			found_len = last_delalloc_end + 1 - found_start;
>  
>  		if (ret >= 0) {
> +			/*
> +			 * Some delalloc range may be created by previous folios.
> +			 * Thus we still need to clean those range up during error
> +			 * handling.
> +			 */
> +			last_finished = found_start;
>  			/* No errors hit so far, run the current delalloc range. */
>  			ret = btrfs_run_delalloc_range(inode, folio,
>  						       found_start,
>  						       found_start + found_len - 1,
>  						       wbc);
> +			if (ret >= 0)
> +				last_finished = found_start + found_len;
>  		} else {
>  			/*
>  			 * We've hit an error during previous delalloc range,
> @@ -1274,8 +1288,21 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  
>  		delalloc_start = found_start + found_len;
>  	}
> -	if (ret < 0)
> +	/*
> +	 * It's possible we have some ordered extents created before we hit
> +	 * an error, cleanup non-async successfully created delalloc ranges.
> +	 */
> +	if (unlikely(ret < 0)) {
> +		unsigned int bitmap_size = min(
> +			(last_finished - page_start) >> fs_info->sectorsize_bits,
> +			fs_info->sectors_per_page);
> +
> +		for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size)
> +			btrfs_mark_ordered_io_finished(inode, folio,
> +				page_start + (bit << fs_info->sectorsize_bits),
> +				fs_info->sectorsize, false);
>  		return ret;
> +	}
>  out:
>  	if (last_delalloc_end)
>  		delalloc_end = last_delalloc_end;
> @@ -1509,13 +1536,13 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>  
>  	bio_ctrl->wbc->nr_to_write--;
>  
> -done:
> -	if (ret) {
> +	if (ret)
>  		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
>  					       page_start, PAGE_SIZE, !ret);
> -		mapping_set_error(folio->mapping, ret);
> -	}
>  
> +done:
> +	if (ret < 0)
> +		mapping_set_error(folio->mapping, ret);
>  	/*
>  	 * Only unlock ranges that are submitted. As there can be some async
>  	 * submitted ranges inside the folio.
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c4997200dbb2..d41bb47d59fb 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2305,7 +2305,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
>  
>  out:
>  	if (ret < 0)
> -		btrfs_cleanup_ordered_extents(inode, locked_folio, start,
> +		btrfs_cleanup_ordered_extents(inode, NULL, start,
>  					      end - start + 1);
>  	return ret;
>  }
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed
  2024-12-12  6:13 ` [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed Qu Wenruo
@ 2025-01-08 22:24   ` Boris Burkov
  2025-01-09  2:56     ` Qu Wenruo
  2025-01-09  3:45     ` Qu Wenruo
  0 siblings, 2 replies; 28+ messages in thread
From: Boris Burkov @ 2025-01-08 22:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Thu, Dec 12, 2024 at 04:43:56PM +1030, Qu Wenruo wrote:
> [BUG]
> If submit_one_sector() failed inside extent_writepage_io() for sector
> size < page size cases (e.g. 4K sector size and 64K page size), then
> we can hit double ordered extent accounting error.
> 
> This should be very rare, as submit_one_sector() only fails when we
> failed to grab the extent map, and such extent map should exist inside
> the memory and have been pinned.
> 
> [CAUSE]
> For example we have the following folio layout:
> 
>     0  4K          32K    48K   60K 64K
>     |//|           |//////|     |///|
> 
> Where |///| is the dirty range we need to writeback. The 3 different
> dirty ranges are submitted for regular COW.
> 
> Now we hit the following sequence:
> 
> - submit_one_sector() returned 0 for [0, 4K)
> 
> - submit_one_sector() returned 0 for [32K, 48K)
> 
> - submit_one_sector() returned error for [60K, 64K)
> 
> - btrfs_mark_ordered_io_finished() called for the whole folio
>   This will mark the following ranges as finished:
>   * [0, 4K)
>   * [32K, 48K)
>     Both ranges have their IO already submitted, this cleanup will
>     lead to double accounting.
> 
>   * [60K, 64K)
>     That's the correct cleanup.
> 
> The only good news is, this error is only theoretical, as the target
> extent map is always pinned, thus we should directly grab it from
> memory, other than reading it from the disk.
> 
> [FIX]
> Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
> range, which can touch ranges we should not touch, instead
> move the error handling inside extent_writepage_io().
> 
> So that we can cleanup exact sectors that are ought to be submitted but
> failed.
> 
> This provide much more accurate cleanup, avoiding the double accounting.

Analysis and fix both make sense to me. However, this one feels a lot
more fragile than the other one.

It relies on submit_one_sector being the only error path in
extent_writepage_io. Any future error in the loop would have to create a
shared "per sector" error handling goto in the loop I guess?

Not a hard "no", in the sense that I think the code is correct for now
(aside from my submit_one_bio question) but curious if we can give this
some more principled structure.

Thanks,
Boris

> 
> Cc: stable@vger.kernel.org # 5.15+
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
>  1 file changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 417c710c55ca..b6a4f1765b4c 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>  	unsigned long range_bitmap = 0;
>  	bool submitted_io = false;
> +	bool error = false;
>  	const u64 folio_start = folio_pos(folio);
>  	u64 cur;
>  	int bit;
> @@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>  			break;
>  		}
>  		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
> -		if (ret < 0)
> -			goto out;
> +		if (unlikely(ret < 0)) {
> +			submit_one_bio(bio_ctrl);

This submit_one_bio is confusing to me. submit_one_sector failed and the
subsequent comment says "there is no bio submitted" yet right here we
call submit_one_bio.

What is the meaning of it?

> +			/*
> +			 * Failed to grab the extent map which should be very rare.
> +			 * Since there is no bio submitted to finish the ordered
> +			 * extent, we have to manually finish this sector.
> +			 */
> +			btrfs_mark_ordered_io_finished(inode, folio, cur,
> +					fs_info->sectorsize, false);
> +			error = true;
> +			continue;
> +		}
>  		submitted_io = true;
>  	}
> -out:
> +
>  	/*
>  	 * If we didn't submitted any sector (>= i_size), folio dirty get
>  	 * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
> @@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>  	 *
>  	 * Here we set writeback and clear for the range. If the full folio
>  	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
> +	 *
> +	 * If we hit any error, the corresponding sector will still be dirty
> +	 * thus no need to clear PAGECACHE_TAG_DIRTY.
>  	 */

submitted_io is only used for this bit of logic, so you could consider
changing this logic by keeping a single variable for whether or not we
should go into this logic (naming it seems kind of annoying) and then
setting it in both the error and submitted_io paths. I think that
reduces headache in thinking about boolean logic, slightly.

> -	if (!submitted_io) {
> +	if (!submitted_io && !error) {
>  		btrfs_folio_set_writeback(fs_info, folio, start, len);
>  		btrfs_folio_clear_writeback(fs_info, folio, start, len);
>  	}
> @@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>  {
>  	struct inode *inode = folio->mapping->host;
>  	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> -	const u64 page_start = folio_pos(folio);
>  	int ret;
>  	size_t pg_offset;
>  	loff_t i_size = i_size_read(inode);
> @@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>  
>  	bio_ctrl->wbc->nr_to_write--;
>  
> -	if (ret)
> -		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
> -					       page_start, PAGE_SIZE, !ret);
> -
>  done:
>  	if (ret < 0)
>  		mapping_set_error(folio->mapping, ret);
> @@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
>  		if (ret == 1)
>  			goto next_page;
>  
> -		if (ret) {
> -			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
> -						       cur, cur_len, !ret);
> +		if (ret)
>  			mapping_set_error(mapping, ret);
> -		}
>  		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
>  		if (ret < 0)
>  			found_error = true;
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/9] btrfs: fix the error handling of submit_uncompressed_range()
  2024-12-12  6:13 ` [PATCH v2 3/9] btrfs: fix the error handling of submit_uncompressed_range() Qu Wenruo
@ 2025-01-08 22:33   ` Boris Burkov
  0 siblings, 0 replies; 28+ messages in thread
From: Boris Burkov @ 2025-01-08 22:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Dec 12, 2024 at 04:43:57PM +1030, Qu Wenruo wrote:
> [BUG]
> If btrfs failed to compress the range, or can not reserve a large enough
> data extent (e.g. too fragmented free space), btrfs will fall back to
> submit_uncompressed_range().
> 
> But inside submit_uncompressed_range(), run_dealloc_cow() can also fail
> due to -ENOSPC or whatever other errors.
> 
> In that case there are 3 bugs in the error handling:
> 
> 1) Double freeing for the same ordered extent
>    Which can lead to crash due to ordered extent double accounting
> 
> 2) Start/end writeback without updating the subpage writeback bitmap
> 
> 3) Unlock the folio without clear the subpage lock bitmap
> 
> Both bug 2) and 3) will crash the kernel if the btrfs block size is
> smaller than folio size, as the next time the folio get writeback/lock
> updates, subpage will find the bitmap already have the range set,
> triggering an ASSERT().
> 
> [CAUSE]
> Bug 1) happens in the following call chain:
> 
>   submit_uncompressed_range()
>   |- run_dealloc_cow()
>   |  |- cow_file_range()
>   |     |- btrfs_reserve_extent()
>   |        Failed with -ENOSPC or whatever error
>   |
>   |- btrfs_clean_up_ordered_extents()
>   |  |- btrfs_mark_ordered_io_finished()
>   |     Which cleans all the ordered extents in the async_extent range.
>   |
>   |- btrfs_mark_ordered_io_finished()
>      Which cleans the folio range.
> 
> The finished ordered extents may not be immediately removed from the
> ordered io tree, as they are removed inside a work queue.
> 
> So the second btrfs_mark_ordered_io_finished() may find the finished but
> not-yet-removed ordered extents, and double free them.
> 
> Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage
> compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover
> other ordered extents.
> 
> Bug 2) and 3) are more straight forward, btrfs just calls folio_unlock(),
> folio_start_writeback() and folio_end_writeback(), other than the helpers
> which handle subpage cases.
> 
> [FIX]
> For bug 1) since the first btrfs_cleanup_ordered_extents() call is
> handling the whole range, we should not do the second
> btrfs_mark_ordered_io_finished() call.
> 
> And for the first btrfs_cleanup_ordered_extents(), we no longer need to
> pass the @locked_page parameter, as we are already in the async extent
> context, thus will never rely on the error handling inside
> btrfs_run_delalloc_range().
> 
> So just let the btrfs_clean_up_ordered_extents() to handle every folio
> equally.
> 
> For bug 2) we should not even call
> folio_start_writeback()/folio_end_writeback() anymore.
> As the error handling protocol, cow_file_range() should clear
> dirty flag and start/finish the writeback for the whole range passed in.
> 
> For bug 3) just change the folio_unlock() to btrfs_folio_end_lock()
> helper.
> 

Reviewed-by: Boris Burkov <boris@bur.io>

> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 16 ++++------------
>  1 file changed, 4 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d41bb47d59fb..5ba8d044757b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1129,19 +1129,11 @@ static void submit_uncompressed_range(struct btrfs_inode *inode,
>  			       &wbc, false);
>  	wbc_detach_inode(&wbc);
>  	if (ret < 0) {
> -		btrfs_cleanup_ordered_extents(inode, locked_folio,
> +		btrfs_cleanup_ordered_extents(inode, NULL,
>  					      start, end - start + 1);
> -		if (locked_folio) {
> -			const u64 page_start = folio_pos(locked_folio);
> -
> -			folio_start_writeback(locked_folio);
> -			folio_end_writeback(locked_folio);
> -			btrfs_mark_ordered_io_finished(inode, locked_folio,
> -						       page_start, PAGE_SIZE,
> -						       !ret);
> -			mapping_set_error(locked_folio->mapping, ret);
> -			folio_unlock(locked_folio);
> -		}
> +		if (locked_folio)
> +			btrfs_folio_end_lock(inode->root->fs_info, locked_folio,
> +					     start, async_extent->ram_size);
>  	}
>  }
>  
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 6/9] btrfs: subpage: fix the bitmap dump for the locked flags
  2024-12-12  6:14 ` [PATCH v2 6/9] btrfs: subpage: fix the bitmap dump for the locked flags Qu Wenruo
@ 2025-01-08 22:45   ` Boris Burkov
  0 siblings, 0 replies; 28+ messages in thread
From: Boris Burkov @ 2025-01-08 22:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Dec 12, 2024 at 04:44:00PM +1030, Qu Wenruo wrote:
> We're dumping the locked bitmap into the @checked_bitmap variable,
> causing incorrect values during debug.
> 
> Thankfuklly even during my development I haven't hit a case where I need
> to dump the locked bitmap.
> But for the sake of consistency, fix it by dumpping the locked bitmap
> into @locked_bitmap variable for output.
> 
> Fixes: 75258f20fb70 ("btrfs: subpage: dump extra subpage bitmaps for debug")
Reviewed-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/subpage.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 8c68059ac1b0..03d7bfc042e2 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -716,6 +716,7 @@ void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
>  	unsigned long writeback_bitmap;
>  	unsigned long ordered_bitmap;
>  	unsigned long checked_bitmap;
> +	unsigned long locked_bitmap;
>  	unsigned long flags;
>  
>  	ASSERT(folio_test_private(folio) && folio_get_private(folio));
> @@ -728,15 +729,16 @@ void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
>  	GET_SUBPAGE_BITMAP(subpage, fs_info, writeback, &writeback_bitmap);
>  	GET_SUBPAGE_BITMAP(subpage, fs_info, ordered, &ordered_bitmap);
>  	GET_SUBPAGE_BITMAP(subpage, fs_info, checked, &checked_bitmap);
> -	GET_SUBPAGE_BITMAP(subpage, fs_info, locked, &checked_bitmap);
> +	GET_SUBPAGE_BITMAP(subpage, fs_info, locked, &locked_bitmap);
>  	spin_unlock_irqrestore(&subpage->lock, flags);
>  
>  	dump_page(folio_page(folio, 0), "btrfs subpage dump");
>  	btrfs_warn(fs_info,
> -"start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl dirty=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl",
> +"start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl dirty=%*pbl locked=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl",
>  		    start, len, folio_pos(folio),
>  		    sectors_per_page, &uptodate_bitmap,
>  		    sectors_per_page, &dirty_bitmap,
> +		    sectors_per_page, &locked_bitmap,
>  		    sectors_per_page, &writeback_bitmap,
>  		    sectors_per_page, &ordered_bitmap,
>  		    sectors_per_page, &checked_bitmap);
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 7/9] btrfs: subpage: dump the involved bitmap when ASSERT() failed
  2024-12-12  6:14 ` [PATCH v2 7/9] btrfs: subpage: dump the involved bitmap when ASSERT() failed Qu Wenruo
@ 2025-01-08 22:46   ` Boris Burkov
  0 siblings, 0 replies; 28+ messages in thread
From: Boris Burkov @ 2025-01-08 22:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Dec 12, 2024 at 04:44:01PM +1030, Qu Wenruo wrote:
> For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call
> bitmap_test_range_all_zero() to ensure the involved range has not bit
> set.
> 
> However with my recent enhanced delalloc range error handling, I'm
> hitting the ASSERT() inside btrfs_folio_set_lock(), and is wondering if
> it's some error handling not properly cleanup the locked bitmap but
> directly unlock the page.
> 
> So add some extra dumpping for the ASSERTs to dump the involved bitmap
> to help debug.
> 
Reviewed-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/subpage.c | 41 ++++++++++++++++++++++++++++++-----------
>  1 file changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 03d7bfc042e2..d692bc34a3af 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -635,6 +635,28 @@ IMPLEMENT_BTRFS_PAGE_OPS(ordered, folio_set_ordered, folio_clear_ordered,
>  IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked,
>  			 folio_test_checked);
>  
> +#define GET_SUBPAGE_BITMAP(subpage, fs_info, name, dst)			\
> +{									\
> +	const int sectors_per_page = fs_info->sectors_per_page;		\
> +									\
> +	ASSERT(sectors_per_page < BITS_PER_LONG);			\
> +	*dst = bitmap_read(subpage->bitmaps,				\
> +			   sectors_per_page * btrfs_bitmap_nr_##name,	\
> +			   sectors_per_page);				\
> +}
> +
> +#define subpage_dump_bitmap(fs_info, folio, name, start, len)		\
> +{									\
> +	struct btrfs_subpage *subpage = folio_get_private(folio);	\
> +	unsigned long bitmap;						\
> +									\
> +	GET_SUBPAGE_BITMAP(subpage, fs_info, name, &bitmap);		\
> +	btrfs_warn(fs_info,						\
> +	"dumpping bitmap start=%llu len=%u folio=%llu" #name "_bitmap=%*pbl", \
> +		   start, len, folio_pos(folio),			\
> +		   fs_info->sectors_per_page, &bitmap);			\
> +}
> +
>  /*
>   * Make sure not only the page dirty bit is cleared, but also subpage dirty bit
>   * is cleared.
> @@ -660,6 +682,10 @@ void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
>  	subpage = folio_get_private(folio);
>  	ASSERT(subpage);
>  	spin_lock_irqsave(&subpage->lock, flags);
> +	if (unlikely(!bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits))) {
> +		subpage_dump_bitmap(fs_info, folio, dirty, start, len);
> +		ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
> +	}
>  	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
>  	spin_unlock_irqrestore(&subpage->lock, flags);
>  }
> @@ -689,23 +715,16 @@ void btrfs_folio_set_lock(const struct btrfs_fs_info *fs_info,
>  	nbits = len >> fs_info->sectorsize_bits;
>  	spin_lock_irqsave(&subpage->lock, flags);
>  	/* Target range should not yet be locked. */
> -	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
> +	if (unlikely(!bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits))) {
> +		subpage_dump_bitmap(fs_info, folio, locked, start, len);
> +		ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
> +	}
>  	bitmap_set(subpage->bitmaps, start_bit, nbits);
>  	ret = atomic_add_return(nbits, &subpage->nr_locked);
>  	ASSERT(ret <= fs_info->sectors_per_page);
>  	spin_unlock_irqrestore(&subpage->lock, flags);
>  }
>  
> -#define GET_SUBPAGE_BITMAP(subpage, fs_info, name, dst)			\
> -{									\
> -	const int sectors_per_page = fs_info->sectors_per_page;		\
> -									\
> -	ASSERT(sectors_per_page < BITS_PER_LONG);			\
> -	*dst = bitmap_read(subpage->bitmaps,				\
> -			   sectors_per_page * btrfs_bitmap_nr_##name,	\
> -			   sectors_per_page);				\
> -}
> -
>  void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
>  				      struct folio *folio, u64 start, u32 len)
>  {
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 8/9] btrfs: add extra error messages for delalloc range related errors
  2024-12-12  6:14 ` [PATCH v2 8/9] btrfs: add extra error messages for delalloc range related errors Qu Wenruo
@ 2025-01-08 22:47   ` Boris Burkov
  0 siblings, 0 replies; 28+ messages in thread
From: Boris Burkov @ 2025-01-08 22:47 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Dec 12, 2024 at 04:44:02PM +1030, Qu Wenruo wrote:
> All the error handling bugs I hit so far are all -ENOSPC from either:
> 
> - cow_file_range()
> - run_delalloc_nocow()
> - submit_uncompressed_range()
> 
> Previously when those functions failed, there is no error message at
> all, making the debugging much harder.
> 
> So here we introduce extra error messages for:
> 
> - cow_file_range()
> - run_delalloc_nocow()
> - submit_uncompressed_range()
> - writepage_delalloc() when btrfs_run_delalloc_range() failed
> - extent_writepage() when extent_writepage_io() failed
> 
> One example of the new debug error messages is the following one:
> 
>  run fstests generic/750 at 2024-12-08 12:41:41
>  BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600)
>  BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3
>  BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm
>  BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536
>  BTRFS info (device dm-4): using free-space-tree
>  BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental
>  BTRFS info (device dm-4): checking UUID tree
>  BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28
>  BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28
>  BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28

This looks great, thanks. Curious to see how much this shows up in our
prod :)

Reviewed-by: Boris Burkov <boris@bur.io>

> 
> Which shows an error from cow_file_range() which is called inside a
> nocow write attempt, along with the extra bitmap from
> writepage_delalloc().
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 16 ++++++++++++++++
>  fs/btrfs/inode.c     | 14 +++++++++++++-
>  fs/btrfs/subpage.c   |  3 ++-
>  3 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index b6a4f1765b4c..f4fb1fb3454a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1254,6 +1254,15 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  						       wbc);
>  			if (ret >= 0)
>  				last_finished = found_start + found_len;
> +			if (unlikely(ret < 0))
> +				btrfs_err_rl(fs_info,
> +"failed to run delalloc range, root=%lld ino=%llu folio=%llu submit_bitmap=%*pbl start=%llu len=%u: %d",
> +					     inode->root->root_key.objectid,
> +					     btrfs_ino(inode),
> +					     folio_pos(folio),
> +					     fs_info->sectors_per_page,
> +					     &bio_ctrl->submit_bitmap,
> +					     found_start, found_len, ret);
>  		} else {
>  			/*
>  			 * We've hit an error during previous delalloc range,
> @@ -1546,6 +1555,13 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>  				  PAGE_SIZE, bio_ctrl, i_size);
>  	if (ret == 1)
>  		return 0;
> +	if (ret < 0)
> +		btrfs_err_rl(fs_info,
> +"failed to submit blocks, root=%lld inode=%llu folio=%llu submit_bitmap=%*pbl: %d",
> +			     BTRFS_I(inode)->root->root_key.objectid,
> +			     btrfs_ino(BTRFS_I(inode)),
> +			     folio_pos(folio), fs_info->sectors_per_page,
> +			     &bio_ctrl->submit_bitmap, ret);
>  
>  	bio_ctrl->wbc->nr_to_write--;
>  
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index bae8aceb3eae..a88cba85bf40 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1134,6 +1134,10 @@ static void submit_uncompressed_range(struct btrfs_inode *inode,
>  		if (locked_folio)
>  			btrfs_folio_end_lock(inode->root->fs_info, locked_folio,
>  					     start, async_extent->ram_size);
> +		btrfs_err_rl(inode->root->fs_info,
> +		"%s failed, root=%llu inode=%llu start=%llu len=%llu: %d",
> +			     __func__, btrfs_root_id(inode->root),
> +			     btrfs_ino(inode), start, async_extent->ram_size, ret);
>  	}
>  }
>  
> @@ -1246,7 +1250,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>  	free_async_extent_pages(async_extent);
>  	if (async_chunk->blkcg_css)
>  		kthread_associate_blkcg(NULL);
> -	btrfs_debug(fs_info,
> +	btrfs_debug_rl(fs_info,
>  "async extent submission failed root=%lld inode=%llu start=%llu len=%llu ret=%d",
>  		    btrfs_root_id(root), btrfs_ino(inode), start,
>  		    async_extent->ram_size, ret);
> @@ -1580,6 +1584,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  		btrfs_qgroup_free_data(inode, NULL, start + cur_alloc_size,
>  				       end - start - cur_alloc_size + 1, NULL);
>  	}
> +	btrfs_err_rl(fs_info,
> +		     "%s failed, root=%llu inode=%llu start=%llu len=%llu: %d",
> +		     __func__, btrfs_root_id(inode->root),
> +		     btrfs_ino(inode), orig_start, end + 1 - orig_start, ret);
>  	return ret;
>  }
>  
> @@ -2325,6 +2333,10 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>  		btrfs_qgroup_free_data(inode, NULL, cur_offset, end - cur_offset + 1, NULL);
>  	}
>  	btrfs_free_path(path);
> +	btrfs_err_rl(fs_info,
> +		     "%s failed, root=%llu inode=%llu start=%llu len=%llu: %d",
> +		     __func__, btrfs_root_id(inode->root),
> +		     btrfs_ino(inode), start, end + 1 - start, ret);
>  	return ret;
>  }
>  
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index d692bc34a3af..7f47bc61389c 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -652,7 +652,7 @@ IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked,
>  									\
>  	GET_SUBPAGE_BITMAP(subpage, fs_info, name, &bitmap);		\
>  	btrfs_warn(fs_info,						\
> -	"dumpping bitmap start=%llu len=%u folio=%llu" #name "_bitmap=%*pbl", \
> +	"dumpping bitmap start=%llu len=%u folio=%llu " #name "_bitmap=%*pbl", \
>  		   start, len, folio_pos(folio),			\
>  		   fs_info->sectors_per_page, &bitmap);			\
>  }
> @@ -717,6 +717,7 @@ void btrfs_folio_set_lock(const struct btrfs_fs_info *fs_info,
>  	/* Target range should not yet be locked. */
>  	if (unlikely(!bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits))) {
>  		subpage_dump_bitmap(fs_info, folio, locked, start, len);
> +		btrfs_warn(fs_info, "nr_locked=%u\n", atomic_read(&subpage->nr_locked));
>  		ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
>  	}
>  	bitmap_set(subpage->bitmaps, start_bit, nbits);
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 9/9] btrfs: remove the unused @locked_folio parameter from btrfs_cleanup_ordered_extents()
  2024-12-12  6:14 ` [PATCH v2 9/9] btrfs: remove the unused @locked_folio parameter from btrfs_cleanup_ordered_extents() Qu Wenruo
@ 2025-01-08 22:48   ` Boris Burkov
  0 siblings, 0 replies; 28+ messages in thread
From: Boris Burkov @ 2025-01-08 22:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Dec 12, 2024 at 04:44:03PM +1030, Qu Wenruo wrote:
> The function btrfs_cleanup_ordered_extents() is only called in error
> handling path, and the last caller with a @locked_folio parameter is
> removed to fix a bug in the btrfs_run_delalloc_range() error handling.
> 
> There is no need to pass @locked_folio parameter anymore.

I was wondering this while reviewing the others. Nice improvement, this
confused me once earlier this year in some reservation debug.

Reviewed-by: Boris Burkov <boris@bur.io>

> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 44 ++------------------------------------------
>  1 file changed, 2 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a88cba85bf40..a5d33ebf90d4 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -393,34 +393,13 @@ void btrfs_inode_unlock(struct btrfs_inode *inode, unsigned int ilock_flags)
>   * extent (btrfs_finish_ordered_io()).
>   */
>  static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
> -						 struct folio *locked_folio,
>  						 u64 offset, u64 bytes)
>  {
>  	unsigned long index = offset >> PAGE_SHIFT;
>  	unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
> -	u64 page_start = 0, page_end = 0;
>  	struct folio *folio;
>  
> -	if (locked_folio) {
> -		page_start = folio_pos(locked_folio);
> -		page_end = page_start + folio_size(locked_folio) - 1;
> -	}
> -
>  	while (index <= end_index) {
> -		/*
> -		 * For locked page, we will call btrfs_mark_ordered_io_finished
> -		 * through btrfs_mark_ordered_io_finished() on it
> -		 * in run_delalloc_range() for the error handling, which will
> -		 * clear page Ordered and run the ordered extent accounting.
> -		 *
> -		 * Here we can't just clear the Ordered bit, or
> -		 * btrfs_mark_ordered_io_finished() would skip the accounting
> -		 * for the page range, and the ordered extent will never finish.
> -		 */
> -		if (locked_folio && index == (page_start >> PAGE_SHIFT)) {
> -			index++;
> -			continue;
> -		}
>  		folio = filemap_get_folio(inode->vfs_inode.i_mapping, index);
>  		index++;
>  		if (IS_ERR(folio))
> @@ -436,23 +415,6 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
>  		folio_put(folio);
>  	}
>  
> -	if (locked_folio) {
> -		/* The locked page covers the full range, nothing needs to be done */
> -		if (bytes + offset <= page_start + folio_size(locked_folio))
> -			return;
> -		/*
> -		 * In case this page belongs to the delalloc range being
> -		 * instantiated then skip it, since the first page of a range is
> -		 * going to be properly cleaned up by the caller of
> -		 * run_delalloc_range
> -		 */
> -		if (page_start >= offset && page_end <= (offset + bytes - 1)) {
> -			bytes = offset + bytes - folio_pos(locked_folio) -
> -				folio_size(locked_folio);
> -			offset = folio_pos(locked_folio) + folio_size(locked_folio);
> -		}
> -	}
> -
>  	return btrfs_mark_ordered_io_finished(inode, NULL, offset, bytes, false);
>  }
>  
> @@ -1129,8 +1091,7 @@ static void submit_uncompressed_range(struct btrfs_inode *inode,
>  			       &wbc, false);
>  	wbc_detach_inode(&wbc);
>  	if (ret < 0) {
> -		btrfs_cleanup_ordered_extents(inode, NULL,
> -					      start, end - start + 1);
> +		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>  		if (locked_folio)
>  			btrfs_folio_end_lock(inode->root->fs_info, locked_folio,
>  					     start, async_extent->ram_size);
> @@ -2387,8 +2348,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
>  
>  out:
>  	if (ret < 0)
> -		btrfs_cleanup_ordered_extents(inode, NULL, start,
> -					      end - start + 1);
> +		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>  	return ret;
>  }
>  
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/9] btrfs: fix double accounting race when btrfs_run_delalloc_range() failed
  2025-01-08 21:52   ` Boris Burkov
@ 2025-01-09  2:45     ` Qu Wenruo
  2025-01-09  4:47       ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2025-01-09  2:45 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs, stable



在 2025/1/9 08:22, Boris Burkov 写道:
> On Thu, Dec 12, 2024 at 04:43:55PM +1030, Qu Wenruo wrote:
>> [BUG]
>> When running btrfs with block size (4K) smaller than page size (64K,
>> aarch64), there is a very high chance to crash the kernel at
>> generic/750, with the following messages:
>> (before the call traces, there are 3 extra debug messages added)
>>
>>   BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
>>   BTRFS info (device dm-3): checking UUID tree
>>   hrtimer: interrupt took 5451385 ns
>>   BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
>>   BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
>>   BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
>>   ------------[ cut here ]------------
>>   WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
>>   CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G           OE      6.13.0-rc1-custom+ #89
>>   Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>>   Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
>>   Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
>>   pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
>>   lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
>>   Call trace:
>>    can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
>>    can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
>>    btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
>>    extent_writepage+0x10c/0x3b8 [btrfs]
>>    extent_write_cache_pages+0x21c/0x4e8 [btrfs]
>>    btrfs_writepages+0x94/0x160 [btrfs]
>>    do_writepages+0x74/0x190
>>    filemap_fdatawrite_wbc+0x74/0xa0
>>    start_delalloc_inodes+0x17c/0x3b0 [btrfs]
>>    btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
>>    shrink_delalloc+0x11c/0x280 [btrfs]
>>    flush_space+0x288/0x328 [btrfs]
>>    btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
>>    process_one_work+0x228/0x680
>>    worker_thread+0x1bc/0x360
>>    kthread+0x100/0x118
>>    ret_from_fork+0x10/0x20
>>   ---[ end trace 0000000000000000 ]---
>>   BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
>>   BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
>>   Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>>   BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
>>   CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G        W  OE      6.13.0-rc1-custom+ #89
>>   Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
>>   Workqueue:  btrfs_work_helper [btrfs] (btrfs-endio-write)
>>   pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>   pc : process_one_work+0x110/0x680
>>   lr : worker_thread+0x1bc/0x360
>>   Call trace:
>>    process_one_work+0x110/0x680 (P)
>>    worker_thread+0x1bc/0x360 (L)
>>    worker_thread+0x1bc/0x360
>>    kthread+0x100/0x118
>>    ret_from_fork+0x10/0x20
>>   Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
>>   ---[ end trace 0000000000000000 ]---
>>   Kernel panic - not syncing: Oops: Fatal exception
>>   SMP: stopping secondary CPUs
>>   SMP: failed to stop secondary CPUs 2-3
>>   Dumping ftrace buffer:
>>      (ftrace buffer empty)
>>   Kernel Offset: 0x275bb9540000 from 0xffff800080000000
>>   PHYS_OFFSET: 0xffff8fbba0000000
>>   CPU features: 0x100,00000070,00801250,8201720b
>>
>> [CAUSE]
>> The above warning is triggered immediately after the delalloc range
>> failure, this happens in the following sequence:
>>
>> - Range [1568K, 1636K) is dirty
>>
>>     1536K  1568K     1600K    1636K  1664K
>>     |      |/////////|////////|      |
>>
>>    Where 1536K, 1600K and 1664K are page boundaries (64K page size)
>>
>> - Enter extent_writepage() for page 1536K
>>
>> - Enter run_delalloc_nocow() with locked page 1536K and range
>>    [1568K, 1636K)
>>    This is due to the inode has preallocated extents.
>>
>> - Enter cow_file_range() with locked page 1536K and range
>>    [1568K, 1636K)
>>
>> - btrfs_reserve_extent() only reserved two extents
>>    The main loop of cow_file_range() only reserved two data extents,
>>
>>    Now we have:
>>
>>     1536K  1568K        1600K    1636K  1664K
>>     |      |<-->|<--->|/|///////|      |
>>                 1584K  1596K
>>    Range [1568K, 1596K) has ordered extent reserved.
>>
>> - btrfs_reserve_extent() failed inside cow_file_range() for file offset
>>    1596K
>>    This is already a bug in our space reservation code, but for now let's
>>    focus on the error handling path.
>>
>>    Now cow_file_range() returned -ENOSPC.
>>
>> - btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
>>    Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
>>    [1568K, 1636K)
>>
>>    Function btrfs_cleanup_ordered_extents() normally needs to skip the
>>    ranges inside the folio, as it will normally be cleaned up by
>>    extent_writepage().
>>
>>    Such split error handling is already problematic in the first place.
>>
>>    What's worse is the folio range skipping itself, which is not taking
>>    subpage cases into consideration at all, it will only skip the range
>>    if the page start >= the range start.
>>    In our case, the page start < the range start, since for subpage cases
>>    we can have delalloc ranges inside the folio but not covering the
>>    folio.
>>
>>    So it doesn't skip the page range at all.
>>    This means all the ordered extents, both [1568K, 1584K) and
>>    [1584K, 1596K) will be marked as IOERR.
>>
>>    And those two ordered extents have no more pending ios, it is marked
>>    finished, and *QUEUED* to be deleted from the io tree.
>>
>> - extent_writepage() do error cleanup
>>    Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).
>>
>>    Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
>>    deletion from io tree is async, it may or may not happen at this
>>    timing.
>>
>>    If the ranges are not yet removed, we will do double cleaning on those
>>    ranges, triggers the above ordered extent warnings.
>>
>> In theory there are other bugs, like the cleanup in extent_writepage()
>> can cause double accounting on ranges that are submitted async
>> (compression for example).
>>
>> But that's much harder to trigger because normally we do not mix regular
>> and compression delalloc ranges.
>>
>> [FIX]
>> The folio range split is already buggy and not subpage compatible, it's
>> introduced a long time ago where subpage support is not even considered.
>>
>> So instead of splitting the ordered extents cleanup into the folio range
>> and out of folio range, do all the cleanup inside writepage_delalloc().
>>
>> - Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
>>    btrfs_run_delalloc_range()
>>
>> - Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
>>    failed
>>
>>    So all ordered extents are only cleaned up by
>>    btrfs_run_delalloc_range().
>>
>> - Handle the ranges that already have ordered extents allocated
>>    If part of the folio already has ordered extent allocated, and
>>    btrfs_run_delalloc_range() failed, we also need to cleanup that range.
>>
>> Now we have a concentrated error handling for ordered extents during
>> btrfs_run_delalloc_range().
>
> Great investigation and writeup, thanks!

Thanks a lot of the review!

>
> The explanation and fix both make sense to me. I traced the change in
> error handling and I see how we are avoiding double ending the
> ordered_extent. So with that said, feel free to add:
> Reviewed-by: Boris Burkov <boris@bur.io>
>
> However, I would like to request one thing, if I may.
> While this is all still relatively fresh in your mind, could you please
> document the intended behavior of the various functions (at least the
> ones you modify/reason about) with regards to:
> - cleanup state of the various objects involved like ordered_extents
>    and subpages (e.g., writepage_delalloc cleans up ordered extents, so
>    callers should not, etc.)

The subpage one should not be considered as something special, it's
really just some kind of enhanced page flags for subpage cases.

Thus I'll not explicitly mention the subpage bitmap, but directly
mention the involved flags (in this particular case, folio Ordered and
Locked flags).

> - return values (e.g., when precisely does btrfs_run_delalloc_range
>    return >= 0 ?)

My bad, I should update the comment in commit d034cdb4cc8a ("btrfs: lock
subpage ranges in one go for writepage_delalloc()").

Still better fix it here before too late.

> - anything else you think would be helpful for reasoning about these
>    functions in an abstract way while you are at it.
>
> That request is obviously optional for landing these fixes, but I really
> think it would help if we went through the bother every time we
> deciphered one of these undocumented paths. A restatement of your best
> understanding of the behavior now will really pay off for the next
> person reading this code :)
>
> Thanks,
> Boris
>
>>
>> Cc: stable@vger.kernel.org # 5.15+
>> Fixes: d1051d6ebf8e ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 37 ++++++++++++++++++++++++++++++++-----
>>   fs/btrfs/inode.c     |  2 +-
>>   2 files changed, 33 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 9725ff7f274d..417c710c55ca 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1167,6 +1167,12 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>>   	 * last delalloc end.
>>   	 */
>>   	u64 last_delalloc_end = 0;
>> +	/*
>> +	 * Save the last successfully ran delalloc range end (exclusive).
>> +	 * This is for error handling to avoid ranges with ordered extent created
>> +	 * but no IO will be submitted due to error.
>> +	 */
>
> nit: last_finished what? I feel this name or comment could use some
> extra work.

I can enhance it to @last_finished_delalloc_end and update the comment.

Thanks,
Qu

>
>> +	u64 last_finished = page_start;
>>   	u64 delalloc_start = page_start;
>>   	u64 delalloc_end = page_end;
>>   	u64 delalloc_to_write = 0;
>> @@ -1235,11 +1241,19 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>>   			found_len = last_delalloc_end + 1 - found_start;
>>
>>   		if (ret >= 0) {
>> +			/*
>> +			 * Some delalloc range may be created by previous folios.
>> +			 * Thus we still need to clean those range up during error
>> +			 * handling.
>> +			 */
>> +			last_finished = found_start;
>>   			/* No errors hit so far, run the current delalloc range. */
>>   			ret = btrfs_run_delalloc_range(inode, folio,
>>   						       found_start,
>>   						       found_start + found_len - 1,
>>   						       wbc);
>> +			if (ret >= 0)
>> +				last_finished = found_start + found_len;
>>   		} else {
>>   			/*
>>   			 * We've hit an error during previous delalloc range,
>> @@ -1274,8 +1288,21 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>>
>>   		delalloc_start = found_start + found_len;
>>   	}
>> -	if (ret < 0)
>> +	/*
>> +	 * It's possible we have some ordered extents created before we hit
>> +	 * an error, cleanup non-async successfully created delalloc ranges.
>> +	 */
>> +	if (unlikely(ret < 0)) {
>> +		unsigned int bitmap_size = min(
>> +			(last_finished - page_start) >> fs_info->sectorsize_bits,
>> +			fs_info->sectors_per_page);
>> +
>> +		for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size)
>> +			btrfs_mark_ordered_io_finished(inode, folio,
>> +				page_start + (bit << fs_info->sectorsize_bits),
>> +				fs_info->sectorsize, false);
>>   		return ret;
>> +	}
>>   out:
>>   	if (last_delalloc_end)
>>   		delalloc_end = last_delalloc_end;
>> @@ -1509,13 +1536,13 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>>
>>   	bio_ctrl->wbc->nr_to_write--;
>>
>> -done:
>> -	if (ret) {
>> +	if (ret)
>>   		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
>>   					       page_start, PAGE_SIZE, !ret);
>> -		mapping_set_error(folio->mapping, ret);
>> -	}
>>
>> +done:
>> +	if (ret < 0)
>> +		mapping_set_error(folio->mapping, ret);
>>   	/*
>>   	 * Only unlock ranges that are submitted. As there can be some async
>>   	 * submitted ranges inside the folio.
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index c4997200dbb2..d41bb47d59fb 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -2305,7 +2305,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
>>
>>   out:
>>   	if (ret < 0)
>> -		btrfs_cleanup_ordered_extents(inode, locked_folio, start,
>> +		btrfs_cleanup_ordered_extents(inode, NULL, start,
>>   					      end - start + 1);
>>   	return ret;
>>   }
>> --
>> 2.47.1
>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed
  2025-01-08 22:24   ` Boris Burkov
@ 2025-01-09  2:56     ` Qu Wenruo
  2025-01-09  3:45     ` Qu Wenruo
  1 sibling, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2025-01-09  2:56 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs, stable



在 2025/1/9 08:54, Boris Burkov 写道:
> On Thu, Dec 12, 2024 at 04:43:56PM +1030, Qu Wenruo wrote:
>> [BUG]
>> If submit_one_sector() failed inside extent_writepage_io() for sector
>> size < page size cases (e.g. 4K sector size and 64K page size), then
>> we can hit double ordered extent accounting error.
>>
>> This should be very rare, as submit_one_sector() only fails when we
>> failed to grab the extent map, and such extent map should exist inside
>> the memory and have been pinned.
>>
>> [CAUSE]
>> For example we have the following folio layout:
>>
>>      0  4K          32K    48K   60K 64K
>>      |//|           |//////|     |///|
>>
>> Where |///| is the dirty range we need to writeback. The 3 different
>> dirty ranges are submitted for regular COW.
>>
>> Now we hit the following sequence:
>>
>> - submit_one_sector() returned 0 for [0, 4K)
>>
>> - submit_one_sector() returned 0 for [32K, 48K)
>>
>> - submit_one_sector() returned error for [60K, 64K)
>>
>> - btrfs_mark_ordered_io_finished() called for the whole folio
>>    This will mark the following ranges as finished:
>>    * [0, 4K)
>>    * [32K, 48K)
>>      Both ranges have their IO already submitted, this cleanup will
>>      lead to double accounting.
>>
>>    * [60K, 64K)
>>      That's the correct cleanup.
>>
>> The only good news is, this error is only theoretical, as the target
>> extent map is always pinned, thus we should directly grab it from
>> memory, other than reading it from the disk.
>>
>> [FIX]
>> Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
>> range, which can touch ranges we should not touch, instead
>> move the error handling inside extent_writepage_io().
>>
>> So that we can cleanup exact sectors that are ought to be submitted but
>> failed.
>>
>> This provide much more accurate cleanup, avoiding the double accounting.
>
> Analysis and fix both make sense to me. However, this one feels a lot
> more fragile than the other one.
>
> It relies on submit_one_sector being the only error path in
> extent_writepage_io. Any future error in the loop would have to create a
> shared "per sector" error handling goto in the loop I guess?

In the long run I'm planning to make extent_writepage_io() to have no
error path, by:

- Remove cow fixup mechanism completely

- Call extent_writepage_io() immediately after each delalloc range has
   OE allocated
   By this we will always have a valid extent map, thus no way to error
   out.

So at least the function extent_writepage_io() should not get more error
paths, but only less.

>
> Not a hard "no", in the sense that I think the code is correct for now
> (aside from my submit_one_bio question) but curious if we can give this
> some more principled structure.
>
> Thanks,
> Boris
>
>>
>> Cc: stable@vger.kernel.org # 5.15+
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
>>   1 file changed, 19 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 417c710c55ca..b6a4f1765b4c 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>   	unsigned long range_bitmap = 0;
>>   	bool submitted_io = false;
>> +	bool error = false;
>>   	const u64 folio_start = folio_pos(folio);
>>   	u64 cur;
>>   	int bit;
>> @@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>   			break;
>>   		}
>>   		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
>> -		if (ret < 0)
>> -			goto out;
>> +		if (unlikely(ret < 0)) {
>> +			submit_one_bio(bio_ctrl);
>
> This submit_one_bio is confusing to me. submit_one_sector failed and the
> subsequent comment says "there is no bio submitted" yet right here we
> call submit_one_bio.
>
> What is the meaning of it?

It to make sure we have submit the existing bio.

This is a little overkilled, because we have one submit_write_bio() at
immediately after extent_write_cache_pages() call.

But that's exactly a pattern of delayed cleanup, also an easy bug prone
(just like the OE cleanup I'm fixing), so I explicitly added a
submission for it.

I'll add extra comments explaining it.

Thanks,
Qu
>
>> +			/*
>> +			 * Failed to grab the extent map which should be very rare.
>> +			 * Since there is no bio submitted to finish the ordered
>> +			 * extent, we have to manually finish this sector.
>> +			 */
>> +			btrfs_mark_ordered_io_finished(inode, folio, cur,
>> +					fs_info->sectorsize, false);
>> +			error = true;
>> +			continue;
>> +		}
>>   		submitted_io = true;
>>   	}
>> -out:
>> +
>>   	/*
>>   	 * If we didn't submitted any sector (>= i_size), folio dirty get
>>   	 * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
>> @@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>   	 *
>>   	 * Here we set writeback and clear for the range. If the full folio
>>   	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
>> +	 *
>> +	 * If we hit any error, the corresponding sector will still be dirty
>> +	 * thus no need to clear PAGECACHE_TAG_DIRTY.
>>   	 */
>
> submitted_io is only used for this bit of logic, so you could consider
> changing this logic by keeping a single variable for whether or not we
> should go into this logic (naming it seems kind of annoying) and then
> setting it in both the error and submitted_io paths. I think that
> reduces headache in thinking about boolean logic, slightly.
>
>> -	if (!submitted_io) {
>> +	if (!submitted_io && !error) {
>>   		btrfs_folio_set_writeback(fs_info, folio, start, len);
>>   		btrfs_folio_clear_writeback(fs_info, folio, start, len);
>>   	}
>> @@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>>   {
>>   	struct inode *inode = folio->mapping->host;
>>   	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
>> -	const u64 page_start = folio_pos(folio);
>>   	int ret;
>>   	size_t pg_offset;
>>   	loff_t i_size = i_size_read(inode);
>> @@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>>
>>   	bio_ctrl->wbc->nr_to_write--;
>>
>> -	if (ret)
>> -		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
>> -					       page_start, PAGE_SIZE, !ret);
>> -
>>   done:
>>   	if (ret < 0)
>>   		mapping_set_error(folio->mapping, ret);
>> @@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
>>   		if (ret == 1)
>>   			goto next_page;
>>
>> -		if (ret) {
>> -			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
>> -						       cur, cur_len, !ret);
>> +		if (ret)
>>   			mapping_set_error(mapping, ret);
>> -		}
>>   		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
>>   		if (ret < 0)
>>   			found_error = true;
>> --
>> 2.47.1
>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed
  2025-01-08 22:24   ` Boris Burkov
  2025-01-09  2:56     ` Qu Wenruo
@ 2025-01-09  3:45     ` Qu Wenruo
  2025-01-09 18:06       ` Boris Burkov
  1 sibling, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2025-01-09  3:45 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs, stable



在 2025/1/9 08:54, Boris Burkov 写道:
> On Thu, Dec 12, 2024 at 04:43:56PM +1030, Qu Wenruo wrote:
>> [BUG]
>> If submit_one_sector() failed inside extent_writepage_io() for sector
>> size < page size cases (e.g. 4K sector size and 64K page size), then
>> we can hit double ordered extent accounting error.
>>
>> This should be very rare, as submit_one_sector() only fails when we
>> failed to grab the extent map, and such extent map should exist inside
>> the memory and have been pinned.
>>
>> [CAUSE]
>> For example we have the following folio layout:
>>
>>      0  4K          32K    48K   60K 64K
>>      |//|           |//////|     |///|
>>
>> Where |///| is the dirty range we need to writeback. The 3 different
>> dirty ranges are submitted for regular COW.
>>
>> Now we hit the following sequence:
>>
>> - submit_one_sector() returned 0 for [0, 4K)
>>
>> - submit_one_sector() returned 0 for [32K, 48K)
>>
>> - submit_one_sector() returned error for [60K, 64K)
>>
>> - btrfs_mark_ordered_io_finished() called for the whole folio
>>    This will mark the following ranges as finished:
>>    * [0, 4K)
>>    * [32K, 48K)
>>      Both ranges have their IO already submitted, this cleanup will
>>      lead to double accounting.
>>
>>    * [60K, 64K)
>>      That's the correct cleanup.
>>
>> The only good news is, this error is only theoretical, as the target
>> extent map is always pinned, thus we should directly grab it from
>> memory, other than reading it from the disk.
>>
>> [FIX]
>> Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
>> range, which can touch ranges we should not touch, instead
>> move the error handling inside extent_writepage_io().
>>
>> So that we can cleanup exact sectors that are ought to be submitted but
>> failed.
>>
>> This provide much more accurate cleanup, avoiding the double accounting.
>
> Analysis and fix both make sense to me. However, this one feels a lot
> more fragile than the other one.
>
> It relies on submit_one_sector being the only error path in
> extent_writepage_io. Any future error in the loop would have to create a
> shared "per sector" error handling goto in the loop I guess?
>
> Not a hard "no", in the sense that I think the code is correct for now
> (aside from my submit_one_bio question) but curious if we can give this
> some more principled structure.
>
> Thanks,
> Boris
>
>>
>> Cc: stable@vger.kernel.org # 5.15+
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
>>   1 file changed, 19 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 417c710c55ca..b6a4f1765b4c 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>   	unsigned long range_bitmap = 0;
>>   	bool submitted_io = false;
>> +	bool error = false;
>>   	const u64 folio_start = folio_pos(folio);
>>   	u64 cur;
>>   	int bit;
>> @@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>   			break;
>>   		}
>>   		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
>> -		if (ret < 0)
>> -			goto out;
>> +		if (unlikely(ret < 0)) {
>> +			submit_one_bio(bio_ctrl);
>
> This submit_one_bio is confusing to me. submit_one_sector failed and the
> subsequent comment says "there is no bio submitted" yet right here we
> call submit_one_bio.
>
> What is the meaning of it?
>
>> +			/*
>> +			 * Failed to grab the extent map which should be very rare.
>> +			 * Since there is no bio submitted to finish the ordered
>> +			 * extent, we have to manually finish this sector.
>> +			 */
>> +			btrfs_mark_ordered_io_finished(inode, folio, cur,
>> +					fs_info->sectorsize, false);
>> +			error = true;
>> +			continue;
>> +		}
>>   		submitted_io = true;
>>   	}
>> -out:
>> +
>>   	/*
>>   	 * If we didn't submitted any sector (>= i_size), folio dirty get
>>   	 * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
>> @@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>   	 *
>>   	 * Here we set writeback and clear for the range. If the full folio
>>   	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
>> +	 *
>> +	 * If we hit any error, the corresponding sector will still be dirty
>> +	 * thus no need to clear PAGECACHE_TAG_DIRTY.
>>   	 */
>
> submitted_io is only used for this bit of logic, so you could consider
> changing this logic by keeping a single variable for whether or not we
> should go into this logic (naming it seems kind of annoying) and then
> setting it in both the error and submitted_io paths. I think that
> reduces headache in thinking about boolean logic, slightly.

Unfortunately I can not find a good alternative to this double boolean
usages.

I can go a single boolean, but it will be called something like
@no_error_nor_submission.

Which is the not only the worst naming, but also a hell of boolean
operations for a single bool.

So I'm afraid the @error and @submitted_io will still be better for this
case.

The other comments will be addressed properly.

Thanks,
Qu
>
>> -	if (!submitted_io) {
>> +	if (!submitted_io && !error) {
>>   		btrfs_folio_set_writeback(fs_info, folio, start, len);
>>   		btrfs_folio_clear_writeback(fs_info, folio, start, len);
>>   	}
>> @@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>>   {
>>   	struct inode *inode = folio->mapping->host;
>>   	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
>> -	const u64 page_start = folio_pos(folio);
>>   	int ret;
>>   	size_t pg_offset;
>>   	loff_t i_size = i_size_read(inode);
>> @@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>>
>>   	bio_ctrl->wbc->nr_to_write--;
>>
>> -	if (ret)
>> -		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
>> -					       page_start, PAGE_SIZE, !ret);
>> -
>>   done:
>>   	if (ret < 0)
>>   		mapping_set_error(folio->mapping, ret);
>> @@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
>>   		if (ret == 1)
>>   			goto next_page;
>>
>> -		if (ret) {
>> -			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
>> -						       cur, cur_len, !ret);
>> +		if (ret)
>>   			mapping_set_error(mapping, ret);
>> -		}
>>   		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
>>   		if (ret < 0)
>>   			found_error = true;
>> --
>> 2.47.1
>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/9] btrfs: fix double accounting race when btrfs_run_delalloc_range() failed
  2025-01-09  2:45     ` Qu Wenruo
@ 2025-01-09  4:47       ` Qu Wenruo
  0 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2025-01-09  4:47 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs, stable



在 2025/1/9 13:15, Qu Wenruo 写道:
>
>
> 在 2025/1/9 08:22, Boris Burkov 写道:
>> On Thu, Dec 12, 2024 at 04:43:55PM +1030, Qu Wenruo wrote:
>>> [BUG]
>>> When running btrfs with block size (4K) smaller than page size (64K,
>>> aarch64), there is a very high chance to crash the kernel at
>>> generic/750, with the following messages:
>>> (before the call traces, there are 3 extra debug messages added)
>>>
>>>   BTRFS warning (device dm-3): read-write for sector size 4096 with
>>> page size 65536 is experimental
>>>   BTRFS info (device dm-3): checking UUID tree
>>>   hrtimer: interrupt took 5451385 ns
>>>   BTRFS error (device dm-3): cow_file_range failed, root=4957
>>> inode=257 start=1605632 len=69632: -28
>>>   BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957
>>> inode=257 start=1605632 len=69632: -28
>>>   BTRFS error (device dm-3): failed to run delalloc range, root=4957
>>> ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
>>>   ------------[ cut here ]------------
>>>   WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360
>>> can_finish_ordered_extent+0x370/0x3b8 [btrfs]
>>>   CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G
>>> OE      6.13.0-rc1-custom+ #89
>>>   Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>>>   Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
>>>   Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
>>>   pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
>>>   lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
>>>   Call trace:
>>>    can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
>>>    can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
>>>    btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
>>>    extent_writepage+0x10c/0x3b8 [btrfs]
>>>    extent_write_cache_pages+0x21c/0x4e8 [btrfs]
>>>    btrfs_writepages+0x94/0x160 [btrfs]
>>>    do_writepages+0x74/0x190
>>>    filemap_fdatawrite_wbc+0x74/0xa0
>>>    start_delalloc_inodes+0x17c/0x3b0 [btrfs]
>>>    btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
>>>    shrink_delalloc+0x11c/0x280 [btrfs]
>>>    flush_space+0x288/0x328 [btrfs]
>>>    btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
>>>    process_one_work+0x228/0x680
>>>    worker_thread+0x1bc/0x360
>>>    kthread+0x100/0x118
>>>    ret_from_fork+0x10/0x20
>>>   ---[ end trace 0000000000000000 ]---
>>>   BTRFS critical (device dm-3): bad ordered extent accounting,
>>> root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
>>>   BTRFS critical (device dm-3): bad ordered extent accounting,
>>> root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
>>>   Unable to handle kernel NULL pointer dereference at virtual address
>>> 0000000000000008
>>>   BTRFS critical (device dm-3): bad ordered extent accounting,
>>> root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
>>>   CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G        W
>>> OE      6.13.0-rc1-custom+ #89
>>>   Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
>>>   Workqueue:  btrfs_work_helper [btrfs] (btrfs-endio-write)
>>>   pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>   pc : process_one_work+0x110/0x680
>>>   lr : worker_thread+0x1bc/0x360
>>>   Call trace:
>>>    process_one_work+0x110/0x680 (P)
>>>    worker_thread+0x1bc/0x360 (L)
>>>    worker_thread+0x1bc/0x360
>>>    kthread+0x100/0x118
>>>    ret_from_fork+0x10/0x20
>>>   Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
>>>   ---[ end trace 0000000000000000 ]---
>>>   Kernel panic - not syncing: Oops: Fatal exception
>>>   SMP: stopping secondary CPUs
>>>   SMP: failed to stop secondary CPUs 2-3
>>>   Dumping ftrace buffer:
>>>      (ftrace buffer empty)
>>>   Kernel Offset: 0x275bb9540000 from 0xffff800080000000
>>>   PHYS_OFFSET: 0xffff8fbba0000000
>>>   CPU features: 0x100,00000070,00801250,8201720b
>>>
>>> [CAUSE]
>>> The above warning is triggered immediately after the delalloc range
>>> failure, this happens in the following sequence:
>>>
>>> - Range [1568K, 1636K) is dirty
>>>
>>>     1536K  1568K     1600K    1636K  1664K
>>>     |      |/////////|////////|      |
>>>
>>>    Where 1536K, 1600K and 1664K are page boundaries (64K page size)
>>>
>>> - Enter extent_writepage() for page 1536K
>>>
>>> - Enter run_delalloc_nocow() with locked page 1536K and range
>>>    [1568K, 1636K)
>>>    This is due to the inode has preallocated extents.
>>>
>>> - Enter cow_file_range() with locked page 1536K and range
>>>    [1568K, 1636K)
>>>
>>> - btrfs_reserve_extent() only reserved two extents
>>>    The main loop of cow_file_range() only reserved two data extents,
>>>
>>>    Now we have:
>>>
>>>     1536K  1568K        1600K    1636K  1664K
>>>     |      |<-->|<--->|/|///////|      |
>>>                 1584K  1596K
>>>    Range [1568K, 1596K) has ordered extent reserved.
>>>
>>> - btrfs_reserve_extent() failed inside cow_file_range() for file offset
>>>    1596K
>>>    This is already a bug in our space reservation code, but for now
>>> let's
>>>    focus on the error handling path.
>>>
>>>    Now cow_file_range() returned -ENOSPC.
>>>
>>> - btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
>>>    Call btrfs_cleanup_ordered_extents() with locked folio 1536K and
>>> range
>>>    [1568K, 1636K)
>>>
>>>    Function btrfs_cleanup_ordered_extents() normally needs to skip the
>>>    ranges inside the folio, as it will normally be cleaned up by
>>>    extent_writepage().
>>>
>>>    Such split error handling is already problematic in the first place.
>>>
>>>    What's worse is the folio range skipping itself, which is not taking
>>>    subpage cases into consideration at all, it will only skip the range
>>>    if the page start >= the range start.
>>>    In our case, the page start < the range start, since for subpage
>>> cases
>>>    we can have delalloc ranges inside the folio but not covering the
>>>    folio.
>>>
>>>    So it doesn't skip the page range at all.
>>>    This means all the ordered extents, both [1568K, 1584K) and
>>>    [1584K, 1596K) will be marked as IOERR.
>>>
>>>    And those two ordered extents have no more pending ios, it is marked
>>>    finished, and *QUEUED* to be deleted from the io tree.
>>>
>>> - extent_writepage() do error cleanup
>>>    Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).
>>>
>>>    Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
>>>    deletion from io tree is async, it may or may not happen at this
>>>    timing.
>>>
>>>    If the ranges are not yet removed, we will do double cleaning on
>>> those
>>>    ranges, triggers the above ordered extent warnings.
>>>
>>> In theory there are other bugs, like the cleanup in extent_writepage()
>>> can cause double accounting on ranges that are submitted async
>>> (compression for example).
>>>
>>> But that's much harder to trigger because normally we do not mix regular
>>> and compression delalloc ranges.
>>>
>>> [FIX]
>>> The folio range split is already buggy and not subpage compatible, it's
>>> introduced a long time ago where subpage support is not even considered.
>>>
>>> So instead of splitting the ordered extents cleanup into the folio range
>>> and out of folio range, do all the cleanup inside writepage_delalloc().
>>>
>>> - Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
>>>    btrfs_run_delalloc_range()
>>>
>>> - Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
>>>    failed
>>>
>>>    So all ordered extents are only cleaned up by
>>>    btrfs_run_delalloc_range().
>>>
>>> - Handle the ranges that already have ordered extents allocated
>>>    If part of the folio already has ordered extent allocated, and
>>>    btrfs_run_delalloc_range() failed, we also need to cleanup that
>>> range.
>>>
>>> Now we have a concentrated error handling for ordered extents during
>>> btrfs_run_delalloc_range().
>>
>> Great investigation and writeup, thanks!
>
> Thanks a lot of the review!
>
>>
>> The explanation and fix both make sense to me. I traced the change in
>> error handling and I see how we are avoiding double ending the
>> ordered_extent. So with that said, feel free to add:
>> Reviewed-by: Boris Burkov <boris@bur.io>
>>
>> However, I would like to request one thing, if I may.
>> While this is all still relatively fresh in your mind, could you please
>> document the intended behavior of the various functions (at least the
>> ones you modify/reason about) with regards to:
>> - cleanup state of the various objects involved like ordered_extents
>>    and subpages (e.g., writepage_delalloc cleans up ordered extents, so
>>    callers should not, etc.)
>
> The subpage one should not be considered as something special, it's
> really just some kind of enhanced page flags for subpage cases.
>
> Thus I'll not explicitly mention the subpage bitmap, but directly
> mention the involved flags (in this particular case, folio Ordered and
> Locked flags).
>
>> - return values (e.g., when precisely does btrfs_run_delalloc_range
>>    return >= 0 ?)
>
> My bad, I should update the comment in commit d034cdb4cc8a ("btrfs: lock
> subpage ranges in one go for writepage_delalloc()").
>
> Still better fix it here before too late.
>
>> - anything else you think would be helpful for reasoning about these
>>    functions in an abstract way while you are at it.
>>
>> That request is obviously optional for landing these fixes, but I really
>> think it would help if we went through the bother every time we
>> deciphered one of these undocumented paths. A restatement of your best
>> understanding of the behavior now will really pay off for the next
>> person reading this code :)

And since we're here, just a quick note for the patch 4/5, that although
those two patches are fixing error handling, they are still mostly
backport oriented fixes (although I'm 100% sure it will cause conflicts).

They still are doing cross-layer error handling. E.g. the ordered
extents cleanup are not inside cow_file_range(), but done by
btrfs_run_delalloc_range().

I'm going to properly fix all those cross-layer error handling in the
next series soon.

Thanks,
Qu
>>
>> Thanks,
>> Boris
>>
>>>
>>> Cc: stable@vger.kernel.org # 5.15+
>>> Fixes: d1051d6ebf8e ("btrfs: Fix error handling in
>>> btrfs_cleanup_ordered_extents")
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>   fs/btrfs/extent_io.c | 37 ++++++++++++++++++++++++++++++++-----
>>>   fs/btrfs/inode.c     |  2 +-
>>>   2 files changed, 33 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 9725ff7f274d..417c710c55ca 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -1167,6 +1167,12 @@ static noinline_for_stack int
>>> writepage_delalloc(struct btrfs_inode *inode,
>>>        * last delalloc end.
>>>        */
>>>       u64 last_delalloc_end = 0;
>>> +    /*
>>> +     * Save the last successfully ran delalloc range end (exclusive).
>>> +     * This is for error handling to avoid ranges with ordered
>>> extent created
>>> +     * but no IO will be submitted due to error.
>>> +     */
>>
>> nit: last_finished what? I feel this name or comment could use some
>> extra work.
>
> I can enhance it to @last_finished_delalloc_end and update the comment.
>
> Thanks,
> Qu
>
>>
>>> +    u64 last_finished = page_start;
>>>       u64 delalloc_start = page_start;
>>>       u64 delalloc_end = page_end;
>>>       u64 delalloc_to_write = 0;
>>> @@ -1235,11 +1241,19 @@ static noinline_for_stack int
>>> writepage_delalloc(struct btrfs_inode *inode,
>>>               found_len = last_delalloc_end + 1 - found_start;
>>>
>>>           if (ret >= 0) {
>>> +            /*
>>> +             * Some delalloc range may be created by previous folios.
>>> +             * Thus we still need to clean those range up during error
>>> +             * handling.
>>> +             */
>>> +            last_finished = found_start;
>>>               /* No errors hit so far, run the current delalloc
>>> range. */
>>>               ret = btrfs_run_delalloc_range(inode, folio,
>>>                                  found_start,
>>>                                  found_start + found_len - 1,
>>>                                  wbc);
>>> +            if (ret >= 0)
>>> +                last_finished = found_start + found_len;
>>>           } else {
>>>               /*
>>>                * We've hit an error during previous delalloc range,
>>> @@ -1274,8 +1288,21 @@ static noinline_for_stack int
>>> writepage_delalloc(struct btrfs_inode *inode,
>>>
>>>           delalloc_start = found_start + found_len;
>>>       }
>>> -    if (ret < 0)
>>> +    /*
>>> +     * It's possible we have some ordered extents created before we hit
>>> +     * an error, cleanup non-async successfully created delalloc
>>> ranges.
>>> +     */
>>> +    if (unlikely(ret < 0)) {
>>> +        unsigned int bitmap_size = min(
>>> +            (last_finished - page_start) >> fs_info->sectorsize_bits,
>>> +            fs_info->sectors_per_page);
>>> +
>>> +        for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size)
>>> +            btrfs_mark_ordered_io_finished(inode, folio,
>>> +                page_start + (bit << fs_info->sectorsize_bits),
>>> +                fs_info->sectorsize, false);
>>>           return ret;
>>> +    }
>>>   out:
>>>       if (last_delalloc_end)
>>>           delalloc_end = last_delalloc_end;
>>> @@ -1509,13 +1536,13 @@ static int extent_writepage(struct folio
>>> *folio, struct btrfs_bio_ctrl *bio_ctrl
>>>
>>>       bio_ctrl->wbc->nr_to_write--;
>>>
>>> -done:
>>> -    if (ret) {
>>> +    if (ret)
>>>           btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
>>>                              page_start, PAGE_SIZE, !ret);
>>> -        mapping_set_error(folio->mapping, ret);
>>> -    }
>>>
>>> +done:
>>> +    if (ret < 0)
>>> +        mapping_set_error(folio->mapping, ret);
>>>       /*
>>>        * Only unlock ranges that are submitted. As there can be some
>>> async
>>>        * submitted ranges inside the folio.
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index c4997200dbb2..d41bb47d59fb 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -2305,7 +2305,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode
>>> *inode, struct folio *locked_fol
>>>
>>>   out:
>>>       if (ret < 0)
>>> -        btrfs_cleanup_ordered_extents(inode, locked_folio, start,
>>> +        btrfs_cleanup_ordered_extents(inode, NULL, start,
>>>                             end - start + 1);
>>>       return ret;
>>>   }
>>> --
>>> 2.47.1
>>>
>>
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed
  2025-01-09  3:45     ` Qu Wenruo
@ 2025-01-09 18:06       ` Boris Burkov
  2025-01-09 20:34         ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Boris Burkov @ 2025-01-09 18:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, stable

On Thu, Jan 09, 2025 at 02:15:06PM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/1/9 08:54, Boris Burkov 写道:
> > On Thu, Dec 12, 2024 at 04:43:56PM +1030, Qu Wenruo wrote:
> > > [BUG]
> > > If submit_one_sector() failed inside extent_writepage_io() for sector
> > > size < page size cases (e.g. 4K sector size and 64K page size), then
> > > we can hit double ordered extent accounting error.
> > > 
> > > This should be very rare, as submit_one_sector() only fails when we
> > > failed to grab the extent map, and such extent map should exist inside
> > > the memory and have been pinned.
> > > 
> > > [CAUSE]
> > > For example we have the following folio layout:
> > > 
> > >      0  4K          32K    48K   60K 64K
> > >      |//|           |//////|     |///|
> > > 
> > > Where |///| is the dirty range we need to writeback. The 3 different
> > > dirty ranges are submitted for regular COW.
> > > 
> > > Now we hit the following sequence:
> > > 
> > > - submit_one_sector() returned 0 for [0, 4K)
> > > 
> > > - submit_one_sector() returned 0 for [32K, 48K)
> > > 
> > > - submit_one_sector() returned error for [60K, 64K)
> > > 
> > > - btrfs_mark_ordered_io_finished() called for the whole folio
> > >    This will mark the following ranges as finished:
> > >    * [0, 4K)
> > >    * [32K, 48K)
> > >      Both ranges have their IO already submitted, this cleanup will
> > >      lead to double accounting.
> > > 
> > >    * [60K, 64K)
> > >      That's the correct cleanup.
> > > 
> > > The only good news is, this error is only theoretical, as the target
> > > extent map is always pinned, thus we should directly grab it from
> > > memory, other than reading it from the disk.
> > > 
> > > [FIX]
> > > Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
> > > range, which can touch ranges we should not touch, instead
> > > move the error handling inside extent_writepage_io().
> > > 
> > > So that we can cleanup exact sectors that are ought to be submitted but
> > > failed.
> > > 
> > > This provide much more accurate cleanup, avoiding the double accounting.
> > 
> > Analysis and fix both make sense to me. However, this one feels a lot
> > more fragile than the other one.
> > 
> > It relies on submit_one_sector being the only error path in
> > extent_writepage_io. Any future error in the loop would have to create a
> > shared "per sector" error handling goto in the loop I guess?
> > 
> > Not a hard "no", in the sense that I think the code is correct for now
> > (aside from my submit_one_bio question) but curious if we can give this
> > some more principled structure.
> > 
> > Thanks,
> > Boris
> > 
> > > 
> > > Cc: stable@vger.kernel.org # 5.15+
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > ---
> > >   fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
> > >   1 file changed, 19 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index 417c710c55ca..b6a4f1765b4c 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
> > >   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > >   	unsigned long range_bitmap = 0;
> > >   	bool submitted_io = false;
> > > +	bool error = false;
> > >   	const u64 folio_start = folio_pos(folio);
> > >   	u64 cur;
> > >   	int bit;
> > > @@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
> > >   			break;
> > >   		}
> > >   		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
> > > -		if (ret < 0)
> > > -			goto out;
> > > +		if (unlikely(ret < 0)) {
> > > +			submit_one_bio(bio_ctrl);
> > 
> > This submit_one_bio is confusing to me. submit_one_sector failed and the
> > subsequent comment says "there is no bio submitted" yet right here we
> > call submit_one_bio.
> > 
> > What is the meaning of it?
> > 
> > > +			/*
> > > +			 * Failed to grab the extent map which should be very rare.
> > > +			 * Since there is no bio submitted to finish the ordered
> > > +			 * extent, we have to manually finish this sector.
> > > +			 */
> > > +			btrfs_mark_ordered_io_finished(inode, folio, cur,
> > > +					fs_info->sectorsize, false);
> > > +			error = true;
> > > +			continue;
> > > +		}
> > >   		submitted_io = true;
> > >   	}
> > > -out:
> > > +
> > >   	/*
> > >   	 * If we didn't submitted any sector (>= i_size), folio dirty get
> > >   	 * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
> > > @@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
> > >   	 *
> > >   	 * Here we set writeback and clear for the range. If the full folio
> > >   	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
> > > +	 *
> > > +	 * If we hit any error, the corresponding sector will still be dirty
> > > +	 * thus no need to clear PAGECACHE_TAG_DIRTY.
> > >   	 */
> > 
> > submitted_io is only used for this bit of logic, so you could consider
> > changing this logic by keeping a single variable for whether or not we
> > should go into this logic (naming it seems kind of annoying) and then
> > setting it in both the error and submitted_io paths. I think that
> > reduces headache in thinking about boolean logic, slightly.
> 
> Unfortunately I can not find a good alternative to this double boolean
> usages.
> 
> I can go a single boolean, but it will be called something like
> @no_error_nor_submission.
> 
> Which is the not only the worst naming, but also a hell of boolean
> operations for a single bool.

I think you could do something like:

needs_reset_writeback = false;
then set it to true in either case, whether you submit an io or hit an
error.

It's your call, though, I won't be upset if you leave it as is.

> 
> So I'm afraid the @error and @submitted_io will still be better for this
> case.
> 
> The other comments will be addressed properly.
> 
> Thanks,
> Qu
> > 
> > > -	if (!submitted_io) {
> > > +	if (!submitted_io && !error) {
> > >   		btrfs_folio_set_writeback(fs_info, folio, start, len);
> > >   		btrfs_folio_clear_writeback(fs_info, folio, start, len);
> > >   	}
> > > @@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
> > >   {
> > >   	struct inode *inode = folio->mapping->host;
> > >   	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> > > -	const u64 page_start = folio_pos(folio);
> > >   	int ret;
> > >   	size_t pg_offset;
> > >   	loff_t i_size = i_size_read(inode);
> > > @@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
> > > 
> > >   	bio_ctrl->wbc->nr_to_write--;
> > > 
> > > -	if (ret)
> > > -		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
> > > -					       page_start, PAGE_SIZE, !ret);
> > > -
> > >   done:
> > >   	if (ret < 0)
> > >   		mapping_set_error(folio->mapping, ret);
> > > @@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
> > >   		if (ret == 1)
> > >   			goto next_page;
> > > 
> > > -		if (ret) {
> > > -			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
> > > -						       cur, cur_len, !ret);
> > > +		if (ret)
> > >   			mapping_set_error(mapping, ret);
> > > -		}
> > >   		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
> > >   		if (ret < 0)
> > >   			found_error = true;
> > > --
> > > 2.47.1
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed
  2025-01-09 18:06       ` Boris Burkov
@ 2025-01-09 20:34         ` Qu Wenruo
  0 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2025-01-09 20:34 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Qu Wenruo, linux-btrfs, stable



在 2025/1/10 04:36, Boris Burkov 写道:
> On Thu, Jan 09, 2025 at 02:15:06PM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2025/1/9 08:54, Boris Burkov 写道:
>>> On Thu, Dec 12, 2024 at 04:43:56PM +1030, Qu Wenruo wrote:
>>>> [BUG]
>>>> If submit_one_sector() failed inside extent_writepage_io() for sector
>>>> size < page size cases (e.g. 4K sector size and 64K page size), then
>>>> we can hit double ordered extent accounting error.
>>>>
>>>> This should be very rare, as submit_one_sector() only fails when we
>>>> failed to grab the extent map, and such extent map should exist inside
>>>> the memory and have been pinned.
>>>>
>>>> [CAUSE]
>>>> For example we have the following folio layout:
>>>>
>>>>       0  4K          32K    48K   60K 64K
>>>>       |//|           |//////|     |///|
>>>>
>>>> Where |///| is the dirty range we need to writeback. The 3 different
>>>> dirty ranges are submitted for regular COW.
>>>>
>>>> Now we hit the following sequence:
>>>>
>>>> - submit_one_sector() returned 0 for [0, 4K)
>>>>
>>>> - submit_one_sector() returned 0 for [32K, 48K)
>>>>
>>>> - submit_one_sector() returned error for [60K, 64K)
>>>>
>>>> - btrfs_mark_ordered_io_finished() called for the whole folio
>>>>     This will mark the following ranges as finished:
>>>>     * [0, 4K)
>>>>     * [32K, 48K)
>>>>       Both ranges have their IO already submitted, this cleanup will
>>>>       lead to double accounting.
>>>>
>>>>     * [60K, 64K)
>>>>       That's the correct cleanup.
>>>>
>>>> The only good news is, this error is only theoretical, as the target
>>>> extent map is always pinned, thus we should directly grab it from
>>>> memory, other than reading it from the disk.
>>>>
>>>> [FIX]
>>>> Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
>>>> range, which can touch ranges we should not touch, instead
>>>> move the error handling inside extent_writepage_io().
>>>>
>>>> So that we can cleanup exact sectors that are ought to be submitted but
>>>> failed.
>>>>
>>>> This provide much more accurate cleanup, avoiding the double accounting.
>>>
>>> Analysis and fix both make sense to me. However, this one feels a lot
>>> more fragile than the other one.
>>>
>>> It relies on submit_one_sector being the only error path in
>>> extent_writepage_io. Any future error in the loop would have to create a
>>> shared "per sector" error handling goto in the loop I guess?
>>>
>>> Not a hard "no", in the sense that I think the code is correct for now
>>> (aside from my submit_one_bio question) but curious if we can give this
>>> some more principled structure.
>>>
>>> Thanks,
>>> Boris
>>>
>>>>
>>>> Cc: stable@vger.kernel.org # 5.15+
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>    fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
>>>>    1 file changed, 19 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>> index 417c710c55ca..b6a4f1765b4c 100644
>>>> --- a/fs/btrfs/extent_io.c
>>>> +++ b/fs/btrfs/extent_io.c
>>>> @@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>>>    	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>>>    	unsigned long range_bitmap = 0;
>>>>    	bool submitted_io = false;
>>>> +	bool error = false;
>>>>    	const u64 folio_start = folio_pos(folio);
>>>>    	u64 cur;
>>>>    	int bit;
>>>> @@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>>>    			break;
>>>>    		}
>>>>    		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
>>>> -		if (ret < 0)
>>>> -			goto out;
>>>> +		if (unlikely(ret < 0)) {
>>>> +			submit_one_bio(bio_ctrl);
>>>
>>> This submit_one_bio is confusing to me. submit_one_sector failed and the
>>> subsequent comment says "there is no bio submitted" yet right here we
>>> call submit_one_bio.
>>>
>>> What is the meaning of it?
>>>
>>>> +			/*
>>>> +			 * Failed to grab the extent map which should be very rare.
>>>> +			 * Since there is no bio submitted to finish the ordered
>>>> +			 * extent, we have to manually finish this sector.
>>>> +			 */
>>>> +			btrfs_mark_ordered_io_finished(inode, folio, cur,
>>>> +					fs_info->sectorsize, false);
>>>> +			error = true;
>>>> +			continue;
>>>> +		}
>>>>    		submitted_io = true;
>>>>    	}
>>>> -out:
>>>> +
>>>>    	/*
>>>>    	 * If we didn't submitted any sector (>= i_size), folio dirty get
>>>>    	 * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
>>>> @@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>>>    	 *
>>>>    	 * Here we set writeback and clear for the range. If the full folio
>>>>    	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
>>>> +	 *
>>>> +	 * If we hit any error, the corresponding sector will still be dirty
>>>> +	 * thus no need to clear PAGECACHE_TAG_DIRTY.
>>>>    	 */
>>>
>>> submitted_io is only used for this bit of logic, so you could consider
>>> changing this logic by keeping a single variable for whether or not we
>>> should go into this logic (naming it seems kind of annoying) and then
>>> setting it in both the error and submitted_io paths. I think that
>>> reduces headache in thinking about boolean logic, slightly.
>>
>> Unfortunately I can not find a good alternative to this double boolean
>> usages.
>>
>> I can go a single boolean, but it will be called something like
>> @no_error_nor_submission.
>>
>> Which is the not only the worst naming, but also a hell of boolean
>> operations for a single bool.
>
> I think you could do something like:
>
> needs_reset_writeback = false;

Unfortunately, that will not work if setting it to false.

We have to set it default to true, and only set it to false in the error
or submission path.

This also means, we need to explain why we need to set the bool to false
in both paths (aka, duplicated comments)

> then set it to true in either case, whether you submit an io or hit an
> error.
>
> It's your call, though, I won't be upset if you leave it as is.

I'm afraid I'll leave it as is for now.

And hope in the future we can remove the @error bool by removing the the
extent map related error path at least.

Thanks,
Qu

>
>>
>> So I'm afraid the @error and @submitted_io will still be better for this
>> case.
>>
>> The other comments will be addressed properly.
>>
>> Thanks,
>> Qu
>>>
>>>> -	if (!submitted_io) {
>>>> +	if (!submitted_io && !error) {
>>>>    		btrfs_folio_set_writeback(fs_info, folio, start, len);
>>>>    		btrfs_folio_clear_writeback(fs_info, folio, start, len);
>>>>    	}
>>>> @@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>>>>    {
>>>>    	struct inode *inode = folio->mapping->host;
>>>>    	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
>>>> -	const u64 page_start = folio_pos(folio);
>>>>    	int ret;
>>>>    	size_t pg_offset;
>>>>    	loff_t i_size = i_size_read(inode);
>>>> @@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>>>>
>>>>    	bio_ctrl->wbc->nr_to_write--;
>>>>
>>>> -	if (ret)
>>>> -		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
>>>> -					       page_start, PAGE_SIZE, !ret);
>>>> -
>>>>    done:
>>>>    	if (ret < 0)
>>>>    		mapping_set_error(folio->mapping, ret);
>>>> @@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
>>>>    		if (ret == 1)
>>>>    			goto next_page;
>>>>
>>>> -		if (ret) {
>>>> -			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
>>>> -						       cur, cur_len, !ret);
>>>> +		if (ret)
>>>>    			mapping_set_error(mapping, ret);
>>>> -		}
>>>>    		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
>>>>    		if (ret < 0)
>>>>    			found_error = true;
>>>> --
>>>> 2.47.1
>>>>
>>>
>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/9] btrfs: do proper folio cleanup when cow_file_range() failed
  2024-12-12  6:13 ` [PATCH v2 4/9] btrfs: do proper folio cleanup when cow_file_range() failed Qu Wenruo
@ 2025-01-09 23:20   ` Boris Burkov
  2025-01-09 23:34     ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Boris Burkov @ 2025-01-09 23:20 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Thu, Dec 12, 2024 at 04:43:58PM +1030, Qu Wenruo wrote:
> [BUG]
> When testing with COW fixup marked as BUG_ON() (this is involved with the
> new pin_user_pages*() change, which should not result new out-of-band
> dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
> fixup path.
> 
> This BUG_ON() happens just after a failed btrfs_run_delalloc_range():
> 
>  BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
>  ------------[ cut here ]------------
>  kernel BUG at fs/btrfs/extent_io.c:1444!
>  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>  CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G           OE      6.12.0-rc7-custom+ #86
>  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
>  Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
>  pc : extent_writepage_io+0x2d4/0x308 [btrfs]
>  lr : extent_writepage_io+0x2d4/0x308 [btrfs]
>  Call trace:
>   extent_writepage_io+0x2d4/0x308 [btrfs]
>   extent_writepage+0x218/0x330 [btrfs]
>   extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
>   btrfs_writepages+0x94/0x150 [btrfs]
>   do_writepages+0x74/0x190
>   filemap_fdatawrite_wbc+0x88/0xc8
>   start_delalloc_inodes+0x180/0x3b0 [btrfs]
>   btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
>   shrink_delalloc+0x114/0x280 [btrfs]
>   flush_space+0x250/0x2f8 [btrfs]
>   btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
>   process_one_work+0x164/0x408
>   worker_thread+0x25c/0x388
>   kthread+0x100/0x118
>   ret_from_fork+0x10/0x20
>  Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
>  ---[ end trace 0000000000000000 ]---
> 
> [CAUSE]
> That failure is mostly from cow_file_range(), where we can hit -ENOSPC.
> 
> Although the -ENOSPC is already a bug related to our space reservation
> code, let's just focus on the error handling.
> 
> For example, we have the following dirty range [0, 64K) of an inode,
> with 4K sector size and 4K page size:
> 
>    0        16K        32K       48K       64K
>    |///////////////////////////////////////|
>    |#######################################|
> 
> Where |///| means page are still dirty, and |###| means the extent io
> tree has EXTENT_DELALLOC flag.
> 
> - Enter extent_writepage() for page 0
> 
> - Enter btrfs_run_delalloc_range() for range [0, 64K)
> 
> - Enter cow_file_range() for range [0, 64K)
> 
> - Function btrfs_reserve_extent() only reserved one 16K extent
>   So we created extent map and ordered extent for range [0, 16K)
> 
>    0        16K        32K       48K       64K
>    |////////|//////////////////////////////|
>    |<- OE ->|##############################|
> 
>    And range [0, 16K) has its delalloc flag cleared.
>    But since we haven't yet submit any bio, involved 4 pages are still
>    dirty.
> 
> - Function btrfs_reserve_extent() return with -ENOSPC
>   Now we have to run error cleanup, which will clear all
>   EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
>   ranges:
> 
>    0        16K        32K       48K       64K
>    |////////|                              |
>    |        |                              |
> 
>   Note that range [0, 16K) still has their pages dirty.
> 
> - Some time later, writeback are triggered again for the range [0, 16K)
>   since the page range still have dirty flags.
> 
> - btrfs_run_delalloc_range() will do nothing because there is no
>   EXTENT_DELALLOC flag.
> 
> - extent_writepage_io() find page 0 has no ordered flag
>   Which falls into the COW fixup path, triggering the BUG_ON().
> 
> Unfortunately this error handling bug dates back to the introduction of btrfs.
> Thankfully with the abuse of cow fixup, at least it won't crash the
> kernel.
> 
> [FIX]
> Instead of immediately unlock the extent and folios, we keep the extent
> and folios locked until either erroring out or the whole delalloc range
> finished.
> 
> When the whole delalloc range finished without error, we just unlock the
> whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
> cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
> And those involved folios will be properly submitted, with their dirty
> flags cleared during submission.
> 
> For the error path, it will be a little more complex:
> 
> - The range with ordered extent allocated (range (1))
>   We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
>   flags are cleaned up by
>   btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().
> 
>   For folios we finish the IO (clear dirty, start writeback and
>   immediately finish the writeback) and unlock the folios.
> 
> - The range with reserved extent but no ordered extent (range(2))
> - The range we never touched (range(3))
>   For both range (2) and range(3) the behavior is not changed.
> 
> Now even if cow_file_range() failed halfway with some successfully
> reserved extents/ordered extents, we will keep all folios clean, so
> there will be no future writeback triggered on them.

2 qs, to make sure I understand:

This changes the happy path, in that IO can't start on the allocated
ordered extents until the whole range is done allocating and unlocked or
errors. But it shouldn't be a big deal unless we have this race a lot?

What is the new behavior in your test case? The whole range correctly is
not dirty, no IO happens, and the mapping has an error set on it? Have
you managed to demonstrate something to that effect more explicitly than
not hitting the BUG_ON in your new code?

However, assuming I understood correctly, LGTM.
Reviewed-by: Boris Burkov <boris@bur.io>

> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 65 ++++++++++++++++++++++++------------------------
>  1 file changed, 32 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5ba8d044757b..19c88b7d0363 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1364,6 +1364,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  
>  	alloc_hint = btrfs_get_extent_allocation_hint(inode, start, num_bytes);
>  
> +	/*
> +	 * We're not doing compressed IO, don't unlock the first page
> +	 * (which the caller expects to stay locked), don't clear any
> +	 * dirty bits and don't set any writeback bits
> +	 *
> +	 * Do set the Ordered (Private2) bit so we know this page was
> +	 * properly setup for writepage.
> +	 */
> +	page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
> +	page_ops |= PAGE_SET_ORDERED;
> +
>  	/*
>  	 * Relocation relies on the relocated extents to have exactly the same
>  	 * size as the original extents. Normally writeback for relocation data
> @@ -1423,6 +1434,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  		file_extent.offset = 0;
>  		file_extent.compression = BTRFS_COMPRESS_NONE;
>  
> +		/*
> +		 * Locked range will be released either during error clean up or
> +		 * after the whole range is finished.
> +		 */
>  		lock_extent(&inode->io_tree, start, start + cur_alloc_size - 1,
>  			    &cached);
>  
> @@ -1468,21 +1483,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  
>  		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>  
> -		/*
> -		 * We're not doing compressed IO, don't unlock the first page
> -		 * (which the caller expects to stay locked), don't clear any
> -		 * dirty bits and don't set any writeback bits
> -		 *
> -		 * Do set the Ordered flag so we know this page was
> -		 * properly setup for writepage.
> -		 */
> -		page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
> -		page_ops |= PAGE_SET_ORDERED;
> -
> -		extent_clear_unlock_delalloc(inode, start, start + cur_alloc_size - 1,
> -					     locked_folio, &cached,
> -					     EXTENT_LOCKED | EXTENT_DELALLOC,
> -					     page_ops);
>  		if (num_bytes < cur_alloc_size)
>  			num_bytes = 0;
>  		else
> @@ -1499,6 +1499,9 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  		if (ret)
>  			goto out_unlock;
>  	}
> +	extent_clear_unlock_delalloc(inode, orig_start, end, locked_folio, &cached,
> +				     EXTENT_LOCKED | EXTENT_DELALLOC,
> +				     page_ops);
>  done:
>  	if (done_offset)
>  		*done_offset = end;
> @@ -1519,35 +1522,31 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  	 * We process each region below.
>  	 */
>  
> -	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
> -		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
> -	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
> -
>  	/*
>  	 * For the range (1). We have already instantiated the ordered extents
>  	 * for this region. They are cleaned up by
>  	 * btrfs_cleanup_ordered_extents() in e.g,
> -	 * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
> -	 * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
> -	 * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
> -	 * function.
> +	 * btrfs_run_delalloc_range().
> +	 * EXTENT_DELALLOC_NEW | EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV
> +	 * are also handled by the cleanup function.
>  	 *
> -	 * However, in case of @keep_locked, we still need to unlock the pages
> -	 * (except @locked_folio) to ensure all the pages are unlocked.
> +	 * So here we only clear EXTENT_LOCKED and EXTENT_DELALLOC flag,
> +	 * and finish the writeback of the involved folios, which will be
> +	 * never submitted.
>  	 */
> -	if (keep_locked && orig_start < start) {
> +	if (orig_start < start) {
> +		clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC;
> +		page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
> +
>  		if (!locked_folio)
>  			mapping_set_error(inode->vfs_inode.i_mapping, ret);
>  		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
> -					     locked_folio, NULL, 0, page_ops);
> +					     locked_folio, NULL, clear_bits, page_ops);
>  	}
>  
> -	/*
> -	 * At this point we're unlocked, we want to make sure we're only
> -	 * clearing these flags under the extent lock, so lock the rest of the
> -	 * range and clear everything up.
> -	 */
> -	lock_extent(&inode->io_tree, start, end, NULL);
> +	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
> +		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
> +	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
>  
>  	/*
>  	 * For the range (2). If we reserved an extent for our delalloc range
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/9] btrfs: do proper folio cleanup when run_delalloc_nocow() failed
  2024-12-12  6:13 ` [PATCH v2 5/9] btrfs: do proper folio cleanup when run_delalloc_nocow() failed Qu Wenruo
@ 2025-01-09 23:26   ` Boris Burkov
  2025-01-09 23:45     ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Boris Burkov @ 2025-01-09 23:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Thu, Dec 12, 2024 at 04:43:59PM +1030, Qu Wenruo wrote:
> [BUG]
> With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash
> with the following VM_BUG_ON_FOLIO():
> 
>  BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28
>  BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28
>  page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664
>  aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774"
>  flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff)
>  page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
>  ------------[ cut here ]------------
>  kernel BUG at mm/page-writeback.c:2992!
>  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>  CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G           OE      6.12.0-rc7-custom+ #87
>  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
>  Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
>  pc : folio_clear_dirty_for_io+0x128/0x258
>  lr : folio_clear_dirty_for_io+0x128/0x258
>  Call trace:
>   folio_clear_dirty_for_io+0x128/0x258
>   btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs]
>   __process_folios_contig+0x154/0x268 [btrfs]
>   extent_clear_unlock_delalloc+0x5c/0x80 [btrfs]
>   run_delalloc_nocow+0x5f8/0x760 [btrfs]
>   btrfs_run_delalloc_range+0xa8/0x220 [btrfs]
>   writepage_delalloc+0x230/0x4c8 [btrfs]
>   extent_writepage+0xb8/0x358 [btrfs]
>   extent_write_cache_pages+0x21c/0x4e8 [btrfs]
>   btrfs_writepages+0x94/0x150 [btrfs]
>   do_writepages+0x74/0x190
>   filemap_fdatawrite_wbc+0x88/0xc8
>   start_delalloc_inodes+0x178/0x3a8 [btrfs]
>   btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
>   shrink_delalloc+0x114/0x280 [btrfs]
>   flush_space+0x250/0x2f8 [btrfs]
>   btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
>   process_one_work+0x164/0x408
>   worker_thread+0x25c/0x388
>   kthread+0x100/0x118
>   ret_from_fork+0x10/0x20
>  Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000)
>  ---[ end trace 0000000000000000 ]---
> 
> [CAUSE]
> The first two lines of extra debug messages show the problem is caused
> by the error handling of run_delalloc_nocow().
> 
> E.g. we have the following dirtied range (4K blocksize 4K page size):
> 
>     0                 16K                  32K
>     |//////////////////////////////////////|
>     |  Pre-allocated  |
> 
> And the range [0, 16K) has a preallocated extent.
> 
> - Enter run_delalloc_nocow() for range [0, 16K)
>   Which found range [0, 16K) is preallocated, can do the proper NOCOW
>   write.
> 
> - Enter fallback_to_fow() for range [16K, 32K)
>   Since the range [16K, 32K) is not backed by preallocated extent, we
>   have to go COW.
> 
> - cow_file_range() failed for range [16K, 32K)
>   So cow_file_range() will do the clean up by clearing folio dirty,
>   unlock the folios.
> 
>   Now the folios in range [16K, 32K) is unlocked.
> 
> - Enter extent_clear_unlock_delalloc() from run_delalloc_nocow()
>   Which is called with PAGE_START_WRITEBACK to start page writeback.
>   But folios can only be marked writeback when it's properly locked,
>   thus this triggered the VM_BUG_ON_FOLIO().
> 
> Furthermore there is another hidden but common bug that
> run_delalloc_nocow() is not clearing the folio dirty flags in its error
> handling path.
> This is the common bug shared between run_delalloc_nocow() and
> cow_file_range().
> 
> [FIX]
> - Clear folio dirty for range [@start, @cur_offset)
>   Introduce a helper, cleanup_dirty_folios(), which
>   will find and lock the folio in the range, clear the dirty flag and
>   start/end the writeback, with the extra handling for the
>   @locked_folio.
> 
> - Introduce a helper to record the last failed COW range end
>   This is to trace which range we should skip, to avoid double
>   unlocking.
> 
> - Skip the failed COW range for the error handling
> 
> Cc: stable@vger.kernel.org

Reviewed-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 93 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 86 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 19c88b7d0363..bae8aceb3eae 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1961,6 +1961,48 @@ static int can_nocow_file_extent(struct btrfs_path *path,
>  	return ret < 0 ? ret : can_nocow;
>  }
>  

I like this function. Can you add a simple doc with pre and post
conditions please?

> +static void cleanup_dirty_folios(struct btrfs_inode *inode,
> +				 struct folio *locked_folio,
> +				 u64 start, u64 end, int error)
> +{
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct address_space *mapping = inode->vfs_inode.i_mapping;
> +	pgoff_t start_index = start >> PAGE_SHIFT;
> +	pgoff_t end_index = end >> PAGE_SHIFT;
> +	u32 len;
> +
> +	ASSERT(end + 1 - start < U32_MAX);
> +	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
> +	       IS_ALIGNED(end + 1, fs_info->sectorsize));
> +	len = end + 1 - start;
> +
> +	/*
> +	 * Handle the locked folio first.
> +	 * btrfs_folio_clamp_*() helpers can handle range out of the folio case.
> +	 */
> +	btrfs_folio_clamp_clear_dirty(fs_info, locked_folio, start, len);
> +	btrfs_folio_clamp_set_writeback(fs_info, locked_folio, start, len);
> +	btrfs_folio_clamp_clear_writeback(fs_info, locked_folio, start, len);

Could this clear dirty; set writeback; clear writeback sequence benefit
from a good name and a helper function too?

> +
> +	for (pgoff_t index = start_index; index <= end_index; index++) {
> +		struct folio *folio;
> +
> +		/* Already handled at the beginning. */
> +		if (index == locked_folio->index)
> +			continue;
> +		folio = __filemap_get_folio(mapping, index, FGP_LOCK, GFP_NOFS);
> +		/* Cache already dropped, no need to do any cleanup. */
> +		if (IS_ERR(folio))
> +			continue;
> +		btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
> +		btrfs_folio_clamp_set_writeback(fs_info, folio, start, len);
> +		btrfs_folio_clamp_clear_writeback(fs_info, folio, start, len);
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}
> +	mapping_set_error(mapping, error);
> +}
> +
>  /*
>   * when nowcow writeback call back.  This checks for snapshots or COW copies
>   * of the extents that exist in the file, and COWs the file as required.
> @@ -1976,6 +2018,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>  	struct btrfs_root *root = inode->root;
>  	struct btrfs_path *path;
>  	u64 cow_start = (u64)-1;
> +	/*
> +	 * If not 0, represents the inclusive end of the last fallback_to_cow()
> +	 * range. Only for error handling.
> +	 */
> +	u64 cow_end = 0;
>  	u64 cur_offset = start;
>  	int ret;
>  	bool check_prev = true;
> @@ -2136,6 +2183,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>  					      found_key.offset - 1);
>  			cow_start = (u64)-1;
>  			if (ret) {
> +				cow_end = found_key.offset - 1;
>  				btrfs_dec_nocow_writers(nocow_bg);
>  				goto error;
>  			}
> @@ -2209,11 +2257,12 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>  		cow_start = cur_offset;
>  
>  	if (cow_start != (u64)-1) {
> -		cur_offset = end;
>  		ret = fallback_to_cow(inode, locked_folio, cow_start, end);
>  		cow_start = (u64)-1;
> -		if (ret)
> +		if (ret) {
> +			cow_end = end;
>  			goto error;
> +		}
>  	}
>  
>  	btrfs_free_path(path);
> @@ -2221,12 +2270,42 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>  
>  error:
>  	/*
> -	 * If an error happened while a COW region is outstanding, cur_offset
> -	 * needs to be reset to cow_start to ensure the COW region is unlocked
> -	 * as well.
> +	 * There are several error cases:
> +	 *
> +	 * 1) Failed without falling back to COW
> +	 *    start         cur_start              end
> +	 *    |/////////////|                      |
> +	 *
> +	 *    For range [start, cur_start) the folios are already unlocked (except
> +	 *    @locked_folio), EXTENT_DELALLOC already removed.
> +	 *    Only need to clear the dirty flag as they will never be submitted.
> +	 *    Ordered extent and extent maps are handled by
> +	 *    btrfs_mark_ordered_io_finished() inside run_delalloc_range().
> +	 *
> +	 * 2) Failed with error from fallback_to_cow()
> +	 *    start         cur_start   cow_end    end
> +	 *    |/////////////|-----------|          |
> +	 *
> +	 *    For range [start, cur_start) it's the same as case 1).
> +	 *    But for range [cur_start, cow_end), the folios have dirty flag
> +	 *    cleared and unlocked, EXTENT_DEALLLOC cleared.
> +	 *    There may or may not be any ordered extents/extent maps allocated.
> +	 *
> +	 *    We should not call extent_clear_unlock_delalloc() on range [cur_start,
> +	 *    cow_end), as the folios are already unlocked.
> +	 *

I think it would be helpful to include cur_offset in your drawings.

> +	 * So clear the folio dirty flags for [start, cur_offset) first.
>  	 */
> -	if (cow_start != (u64)-1)
> -		cur_offset = cow_start;
> +	if (cur_offset > start)
> +		cleanup_dirty_folios(inode, locked_folio, start, cur_offset - 1, ret);
> +
> +	/*
> +	 * If an error happened while a COW region is outstanding, cur_offset
> +	 * needs to be reset to @cow_end + 1 to skip the COW range, as
> +	 * cow_file_range() will do the proper cleanup at error.
> +	 */
> +	if (cow_end)
> +		cur_offset = cow_end + 1;
>  
>  	/*
>  	 * We need to lock the extent here because we're clearing DELALLOC and
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/9] btrfs: do proper folio cleanup when cow_file_range() failed
  2025-01-09 23:20   ` Boris Burkov
@ 2025-01-09 23:34     ` Qu Wenruo
  0 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2025-01-09 23:34 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, stable



在 2025/1/10 09:50, Boris Burkov 写道:
> On Thu, Dec 12, 2024 at 04:43:58PM +1030, Qu Wenruo wrote:
>> [BUG]
>> When testing with COW fixup marked as BUG_ON() (this is involved with the
>> new pin_user_pages*() change, which should not result new out-of-band
>> dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
>> fixup path.
>>
>> This BUG_ON() happens just after a failed btrfs_run_delalloc_range():
>>
>>   BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
>>   ------------[ cut here ]------------
>>   kernel BUG at fs/btrfs/extent_io.c:1444!
>>   Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>>   CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G           OE      6.12.0-rc7-custom+ #86
>>   Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
>>   Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
>>   pc : extent_writepage_io+0x2d4/0x308 [btrfs]
>>   lr : extent_writepage_io+0x2d4/0x308 [btrfs]
>>   Call trace:
>>    extent_writepage_io+0x2d4/0x308 [btrfs]
>>    extent_writepage+0x218/0x330 [btrfs]
>>    extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
>>    btrfs_writepages+0x94/0x150 [btrfs]
>>    do_writepages+0x74/0x190
>>    filemap_fdatawrite_wbc+0x88/0xc8
>>    start_delalloc_inodes+0x180/0x3b0 [btrfs]
>>    btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
>>    shrink_delalloc+0x114/0x280 [btrfs]
>>    flush_space+0x250/0x2f8 [btrfs]
>>    btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
>>    process_one_work+0x164/0x408
>>    worker_thread+0x25c/0x388
>>    kthread+0x100/0x118
>>    ret_from_fork+0x10/0x20
>>   Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
>>   ---[ end trace 0000000000000000 ]---
>>
>> [CAUSE]
>> That failure is mostly from cow_file_range(), where we can hit -ENOSPC.
>>
>> Although the -ENOSPC is already a bug related to our space reservation
>> code, let's just focus on the error handling.
>>
>> For example, we have the following dirty range [0, 64K) of an inode,
>> with 4K sector size and 4K page size:
>>
>>     0        16K        32K       48K       64K
>>     |///////////////////////////////////////|
>>     |#######################################|
>>
>> Where |///| means page are still dirty, and |###| means the extent io
>> tree has EXTENT_DELALLOC flag.
>>
>> - Enter extent_writepage() for page 0
>>
>> - Enter btrfs_run_delalloc_range() for range [0, 64K)
>>
>> - Enter cow_file_range() for range [0, 64K)
>>
>> - Function btrfs_reserve_extent() only reserved one 16K extent
>>    So we created extent map and ordered extent for range [0, 16K)
>>
>>     0        16K        32K       48K       64K
>>     |////////|//////////////////////////////|
>>     |<- OE ->|##############################|
>>
>>     And range [0, 16K) has its delalloc flag cleared.
>>     But since we haven't yet submit any bio, involved 4 pages are still
>>     dirty.
>>
>> - Function btrfs_reserve_extent() return with -ENOSPC
>>    Now we have to run error cleanup, which will clear all
>>    EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
>>    ranges:
>>
>>     0        16K        32K       48K       64K
>>     |////////|                              |
>>     |        |                              |
>>
>>    Note that range [0, 16K) still has their pages dirty.
>>
>> - Some time later, writeback are triggered again for the range [0, 16K)
>>    since the page range still have dirty flags.
>>
>> - btrfs_run_delalloc_range() will do nothing because there is no
>>    EXTENT_DELALLOC flag.
>>
>> - extent_writepage_io() find page 0 has no ordered flag
>>    Which falls into the COW fixup path, triggering the BUG_ON().
>>
>> Unfortunately this error handling bug dates back to the introduction of btrfs.
>> Thankfully with the abuse of cow fixup, at least it won't crash the
>> kernel.
>>
>> [FIX]
>> Instead of immediately unlock the extent and folios, we keep the extent
>> and folios locked until either erroring out or the whole delalloc range
>> finished.
>>
>> When the whole delalloc range finished without error, we just unlock the
>> whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
>> cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
>> And those involved folios will be properly submitted, with their dirty
>> flags cleared during submission.
>>
>> For the error path, it will be a little more complex:
>>
>> - The range with ordered extent allocated (range (1))
>>    We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
>>    flags are cleaned up by
>>    btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().
>>
>>    For folios we finish the IO (clear dirty, start writeback and
>>    immediately finish the writeback) and unlock the folios.
>>
>> - The range with reserved extent but no ordered extent (range(2))
>> - The range we never touched (range(3))
>>    For both range (2) and range(3) the behavior is not changed.
>>
>> Now even if cow_file_range() failed halfway with some successfully
>> reserved extents/ordered extents, we will keep all folios clean, so
>> there will be no future writeback triggered on them.
> 
> 2 qs, to make sure I understand:
> 
> This changes the happy path, in that IO can't start on the allocated
> ordered extents until the whole range is done allocating and unlocked or
> errors. But it shouldn't be a big deal unless we have this race a lot?

If we race a lot, it already means the fs is fragmented thus we need a 
lot of loops to allocate quite some small extents.

For normal cases, we should really get a large extent for the delalloc 
range in one go, and in that case, the lock holding period is not changed.

So if we really hit some races, it already means our fs is fragmented 
and no one can expect a quick run anyway.

> 
> What is the new behavior in your test case? The whole range correctly is
> not dirty, no IO happens, and the mapping has an error set on it? Have
> you managed to demonstrate something to that effect more explicitly than
> not hitting the BUG_ON in your new code?

Unfortunately I have no better way to verify the behavior, other than 
BUG_ON() in cow fixup path.

It may reduce the warning from space reservation code (because we 
properly free the reserved space), but it's less obvious than the 
avoided BUG_ON().

> 
> However, assuming I understood correctly, LGTM.
> Reviewed-by: Boris Burkov <boris@bur.io>

Thanks a lot for the review.
Qu

> 
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/inode.c | 65 ++++++++++++++++++++++++------------------------
>>   1 file changed, 32 insertions(+), 33 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 5ba8d044757b..19c88b7d0363 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -1364,6 +1364,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>>   
>>   	alloc_hint = btrfs_get_extent_allocation_hint(inode, start, num_bytes);
>>   
>> +	/*
>> +	 * We're not doing compressed IO, don't unlock the first page
>> +	 * (which the caller expects to stay locked), don't clear any
>> +	 * dirty bits and don't set any writeback bits
>> +	 *
>> +	 * Do set the Ordered (Private2) bit so we know this page was
>> +	 * properly setup for writepage.
>> +	 */
>> +	page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
>> +	page_ops |= PAGE_SET_ORDERED;
>> +
>>   	/*
>>   	 * Relocation relies on the relocated extents to have exactly the same
>>   	 * size as the original extents. Normally writeback for relocation data
>> @@ -1423,6 +1434,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>>   		file_extent.offset = 0;
>>   		file_extent.compression = BTRFS_COMPRESS_NONE;
>>   
>> +		/*
>> +		 * Locked range will be released either during error clean up or
>> +		 * after the whole range is finished.
>> +		 */
>>   		lock_extent(&inode->io_tree, start, start + cur_alloc_size - 1,
>>   			    &cached);
>>   
>> @@ -1468,21 +1483,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>>   
>>   		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>>   
>> -		/*
>> -		 * We're not doing compressed IO, don't unlock the first page
>> -		 * (which the caller expects to stay locked), don't clear any
>> -		 * dirty bits and don't set any writeback bits
>> -		 *
>> -		 * Do set the Ordered flag so we know this page was
>> -		 * properly setup for writepage.
>> -		 */
>> -		page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
>> -		page_ops |= PAGE_SET_ORDERED;
>> -
>> -		extent_clear_unlock_delalloc(inode, start, start + cur_alloc_size - 1,
>> -					     locked_folio, &cached,
>> -					     EXTENT_LOCKED | EXTENT_DELALLOC,
>> -					     page_ops);
>>   		if (num_bytes < cur_alloc_size)
>>   			num_bytes = 0;
>>   		else
>> @@ -1499,6 +1499,9 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>>   		if (ret)
>>   			goto out_unlock;
>>   	}
>> +	extent_clear_unlock_delalloc(inode, orig_start, end, locked_folio, &cached,
>> +				     EXTENT_LOCKED | EXTENT_DELALLOC,
>> +				     page_ops);
>>   done:
>>   	if (done_offset)
>>   		*done_offset = end;
>> @@ -1519,35 +1522,31 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>>   	 * We process each region below.
>>   	 */
>>   
>> -	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
>> -		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
>> -	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
>> -
>>   	/*
>>   	 * For the range (1). We have already instantiated the ordered extents
>>   	 * for this region. They are cleaned up by
>>   	 * btrfs_cleanup_ordered_extents() in e.g,
>> -	 * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
>> -	 * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
>> -	 * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
>> -	 * function.
>> +	 * btrfs_run_delalloc_range().
>> +	 * EXTENT_DELALLOC_NEW | EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV
>> +	 * are also handled by the cleanup function.
>>   	 *
>> -	 * However, in case of @keep_locked, we still need to unlock the pages
>> -	 * (except @locked_folio) to ensure all the pages are unlocked.
>> +	 * So here we only clear EXTENT_LOCKED and EXTENT_DELALLOC flag,
>> +	 * and finish the writeback of the involved folios, which will be
>> +	 * never submitted.
>>   	 */
>> -	if (keep_locked && orig_start < start) {
>> +	if (orig_start < start) {
>> +		clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC;
>> +		page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
>> +
>>   		if (!locked_folio)
>>   			mapping_set_error(inode->vfs_inode.i_mapping, ret);
>>   		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
>> -					     locked_folio, NULL, 0, page_ops);
>> +					     locked_folio, NULL, clear_bits, page_ops);
>>   	}
>>   
>> -	/*
>> -	 * At this point we're unlocked, we want to make sure we're only
>> -	 * clearing these flags under the extent lock, so lock the rest of the
>> -	 * range and clear everything up.
>> -	 */
>> -	lock_extent(&inode->io_tree, start, end, NULL);
>> +	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
>> +		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
>> +	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
>>   
>>   	/*
>>   	 * For the range (2). If we reserved an extent for our delalloc range
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/9] btrfs: do proper folio cleanup when run_delalloc_nocow() failed
  2025-01-09 23:26   ` Boris Burkov
@ 2025-01-09 23:45     ` Qu Wenruo
  0 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2025-01-09 23:45 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs, stable



在 2025/1/10 09:56, Boris Burkov 写道:
[...]
>
> I like this function. Can you add a simple doc with pre and post
> conditions please?

Sure, no problem.

Would be something like this:

/*
  * To cleanup dirty folios when failed to run a delalloc range.
  *
  * When running a delalloc range, we may need to split into
  * different extents (fragmentation or NOCOW limits), and if
  * we hit error, previous successfully executed ranges also need
  * to have their dirty flags cleared, with the address space marked
  * as error.
  */
>
>> +static void cleanup_dirty_folios(struct btrfs_inode *inode,
>> +				 struct folio *locked_folio,
>> +				 u64 start, u64 end, int error)
>> +{
>> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> +	struct address_space *mapping = inode->vfs_inode.i_mapping;
>> +	pgoff_t start_index = start >> PAGE_SHIFT;
>> +	pgoff_t end_index = end >> PAGE_SHIFT;
>> +	u32 len;
>> +
>> +	ASSERT(end + 1 - start < U32_MAX);
>> +	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
>> +	       IS_ALIGNED(end + 1, fs_info->sectorsize));
>> +	len = end + 1 - start;
>> +
>> +	/*
>> +	 * Handle the locked folio first.
>> +	 * btrfs_folio_clamp_*() helpers can handle range out of the folio case.
>> +	 */
>> +	btrfs_folio_clamp_clear_dirty(fs_info, locked_folio, start, len);
>> +	btrfs_folio_clamp_set_writeback(fs_info, locked_folio, start, len);
>> +	btrfs_folio_clamp_clear_writeback(fs_info, locked_folio, start, len);
>
> Could this clear dirty; set writeback; clear writeback sequence benefit
> from a good name and a helper function too?

Sure, what about btrfs_folio_clamp_finish_io()?
>
>> +
>> +	for (pgoff_t index = start_index; index <= end_index; index++) {
>> +		struct folio *folio;
>> +
>> +		/* Already handled at the beginning. */
>> +		if (index == locked_folio->index)
>> +			continue;
>> +		folio = __filemap_get_folio(mapping, index, FGP_LOCK, GFP_NOFS);
>> +		/* Cache already dropped, no need to do any cleanup. */
>> +		if (IS_ERR(folio))
>> +			continue;
>> +		btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
>> +		btrfs_folio_clamp_set_writeback(fs_info, folio, start, len);
>> +		btrfs_folio_clamp_clear_writeback(fs_info, folio, start, len);
>> +		folio_unlock(folio);
>> +		folio_put(folio);
>> +	}
>> +	mapping_set_error(mapping, error);
>> +}
>> +
>>   /*
>>    * when nowcow writeback call back.  This checks for snapshots or COW copies
>>    * of the extents that exist in the file, and COWs the file as required.
>> @@ -1976,6 +2018,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>>   	struct btrfs_root *root = inode->root;
>>   	struct btrfs_path *path;
>>   	u64 cow_start = (u64)-1;
>> +	/*
>> +	 * If not 0, represents the inclusive end of the last fallback_to_cow()
>> +	 * range. Only for error handling.
>> +	 */
>> +	u64 cow_end = 0;
>>   	u64 cur_offset = start;
>>   	int ret;
>>   	bool check_prev = true;
>> @@ -2136,6 +2183,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>>   					      found_key.offset - 1);
>>   			cow_start = (u64)-1;
>>   			if (ret) {
>> +				cow_end = found_key.offset - 1;
>>   				btrfs_dec_nocow_writers(nocow_bg);
>>   				goto error;
>>   			}
>> @@ -2209,11 +2257,12 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>>   		cow_start = cur_offset;
>>
>>   	if (cow_start != (u64)-1) {
>> -		cur_offset = end;
>>   		ret = fallback_to_cow(inode, locked_folio, cow_start, end);
>>   		cow_start = (u64)-1;
>> -		if (ret)
>> +		if (ret) {
>> +			cow_end = end;
>>   			goto error;
>> +		}
>>   	}
>>
>>   	btrfs_free_path(path);
>> @@ -2221,12 +2270,42 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>>
>>   error:
>>   	/*
>> -	 * If an error happened while a COW region is outstanding, cur_offset
>> -	 * needs to be reset to cow_start to ensure the COW region is unlocked
>> -	 * as well.
>> +	 * There are several error cases:
>> +	 *
>> +	 * 1) Failed without falling back to COW
>> +	 *    start         cur_start              end
>> +	 *    |/////////////|                      |
>> +	 *
>> +	 *    For range [start, cur_start) the folios are already unlocked (except
>> +	 *    @locked_folio), EXTENT_DELALLOC already removed.
>> +	 *    Only need to clear the dirty flag as they will never be submitted.
>> +	 *    Ordered extent and extent maps are handled by
>> +	 *    btrfs_mark_ordered_io_finished() inside run_delalloc_range().
>> +	 *
>> +	 * 2) Failed with error from fallback_to_cow()
>> +	 *    start         cur_start   cow_end    end
>> +	 *    |/////////////|-----------|          |
>> +	 *
>> +	 *    For range [start, cur_start) it's the same as case 1).
>> +	 *    But for range [cur_start, cow_end), the folios have dirty flag
>> +	 *    cleared and unlocked, EXTENT_DEALLLOC cleared.
>> +	 *    There may or may not be any ordered extents/extent maps allocated.
>> +	 *
>> +	 *    We should not call extent_clear_unlock_delalloc() on range [cur_start,
>> +	 *    cow_end), as the folios are already unlocked.
>> +	 *
>
> I think it would be helpful to include cur_offset in your drawings.

I noticed this when crafting a new patch too, there is no @cur_start at
all, but only @cur_offset.

Will fix it in the next update.

Thanks again for the detailed review,
Qu

>
>> +	 * So clear the folio dirty flags for [start, cur_offset) first.
>>   	 */
>> -	if (cow_start != (u64)-1)
>> -		cur_offset = cow_start;
>> +	if (cur_offset > start)
>> +		cleanup_dirty_folios(inode, locked_folio, start, cur_offset - 1, ret);
>> +
>> +	/*
>> +	 * If an error happened while a COW region is outstanding, cur_offset
>> +	 * needs to be reset to @cow_end + 1 to skip the COW range, as
>> +	 * cow_file_range() will do the proper cleanup at error.
>> +	 */
>> +	if (cow_end)
>> +		cur_offset = cow_end + 1;
>>
>>   	/*
>>   	 * We need to lock the extent here because we're clearing DELALLOC and
>> --
>> 2.47.1
>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-01-09 23:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-12  6:13 [PATCH v2 0/9] btrfs: error handling fixes Qu Wenruo
2024-12-12  6:13 ` [PATCH v2 1/9] btrfs: fix double accounting race when btrfs_run_delalloc_range() failed Qu Wenruo
2025-01-08 21:52   ` Boris Burkov
2025-01-09  2:45     ` Qu Wenruo
2025-01-09  4:47       ` Qu Wenruo
2024-12-12  6:13 ` [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed Qu Wenruo
2025-01-08 22:24   ` Boris Burkov
2025-01-09  2:56     ` Qu Wenruo
2025-01-09  3:45     ` Qu Wenruo
2025-01-09 18:06       ` Boris Burkov
2025-01-09 20:34         ` Qu Wenruo
2024-12-12  6:13 ` [PATCH v2 3/9] btrfs: fix the error handling of submit_uncompressed_range() Qu Wenruo
2025-01-08 22:33   ` Boris Burkov
2024-12-12  6:13 ` [PATCH v2 4/9] btrfs: do proper folio cleanup when cow_file_range() failed Qu Wenruo
2025-01-09 23:20   ` Boris Burkov
2025-01-09 23:34     ` Qu Wenruo
2024-12-12  6:13 ` [PATCH v2 5/9] btrfs: do proper folio cleanup when run_delalloc_nocow() failed Qu Wenruo
2025-01-09 23:26   ` Boris Burkov
2025-01-09 23:45     ` Qu Wenruo
2024-12-12  6:14 ` [PATCH v2 6/9] btrfs: subpage: fix the bitmap dump for the locked flags Qu Wenruo
2025-01-08 22:45   ` Boris Burkov
2024-12-12  6:14 ` [PATCH v2 7/9] btrfs: subpage: dump the involved bitmap when ASSERT() failed Qu Wenruo
2025-01-08 22:46   ` Boris Burkov
2024-12-12  6:14 ` [PATCH v2 8/9] btrfs: add extra error messages for delalloc range related errors Qu Wenruo
2025-01-08 22:47   ` Boris Burkov
2024-12-12  6:14 ` [PATCH v2 9/9] btrfs: remove the unused @locked_folio parameter from btrfs_cleanup_ordered_extents() Qu Wenruo
2025-01-08 22:48   ` Boris Burkov
2025-01-07 15:10 ` [PATCH v2 0/9] btrfs: error handling fixes David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox