* [PATCH 0/2] btrfs: error handling fixes for extent_writepage()
@ 2024-11-26 6:29 Qu Wenruo
2024-11-26 6:29 ` [PATCH 1/2] btrfs: handle btrfs_run_delalloc_range() errors correctly Qu Wenruo
2024-11-26 6:29 ` [PATCH 2/2] btrfs: handle submit_one_sector() error inside extent_writepage_io() Qu Wenruo
0 siblings, 2 replies; 3+ messages in thread
From: Qu Wenruo @ 2024-11-26 6:29 UTC (permalink / raw)
To: linux-btrfs
Now with the new compressed write support and the incoming partial
uptodate folio and inline write support, it's easier and easier to hit
crashes for test cases likes generic/750, when the sector size is
smaller than page size.
The main symptom is ordered extent double accounting, causing various
problems mostly kernel warning and crashes (for debug builds).
The direct cause the failure from writepage_delalloc() with -ENOSPC,
which is another rabbit hole, but here we need to focus on the error
handling.
All the call traces points to the btrfs_mark_ordered_io_finished()
inside extent_writepage() for error handling.
It turns out that that unconditional, full folio cleanup is no doubt the
root cause, and there are two error paths leading to the situation:
- btrfs_run_delalloc_range() failure
Which can lead some delalloc ranges are submitted asynchronously
(compression mostly), and we try to clean up those which we shouldn't.
This is the most common one, as I'm hitting quite some -ENOSPC during
my fstests runs, and all the hang/crashes are following such errors.
- submit_one_sector() failure
This is much rare, as I haven't really hit one.
But the idea is pretty much the same, so we should not touch ranges we
shouldn't.
Both fixes are similar, by moving the btrfs_mark_ordered_io_finished()
calls for error handling into each function, so that we can avoid
touching async extents.
Although these fixes are mostly for backports, the proper fix in the end
would be reworking how we handle dirty folio writeback.
The current way is map-map-map, then submit-submit-submit (run delalloc
for every dirty sector of the folio, then submit all dirty sectors).
The planned new fix would be more like iomap to go
map-submit-map-submit-map-submit (run one delalloc, then immeidately submit
it).
Qu Wenruo (2):
btrfs: handle btrfs_run_delalloc_range() errors correctly
btrfs: handle submit_one_sector() error inside extent_writepage_io()
fs/btrfs/extent_io.c | 71 +++++++++++++++++++++++++++++++++-----------
1 file changed, 53 insertions(+), 18 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH 1/2] btrfs: handle btrfs_run_delalloc_range() errors correctly
2024-11-26 6:29 [PATCH 0/2] btrfs: error handling fixes for extent_writepage() Qu Wenruo
@ 2024-11-26 6:29 ` Qu Wenruo
2024-11-26 6:29 ` [PATCH 2/2] btrfs: handle submit_one_sector() error inside extent_writepage_io() Qu Wenruo
1 sibling, 0 replies; 3+ messages in thread
From: Qu Wenruo @ 2024-11-26 6:29 UTC (permalink / raw)
To: linux-btrfs; +Cc: stable
[BUG]
There are several crash or hang during fstests runs with sectorsize < page
size setup.
It turns out that most of those hang happens after a
btrfs_run_delalloc_range() failure (caused by -ENOSPC).
The most common one is generic/750.
The symptom are all related to ordered extent finishing, where we double
account the target ordered extent.
[CAUSE]
Inside writepage_delalloc() if we hit an error from
btrfs_run_delalloc_range(), we still need to unlock all the locked
range, but that's the only error handling.
If we have the following page layout with a 64K page size and 4K sector
size:
0 4K 32K 40K 60K 64K
|////| |////| |/////|
Where |//| is the dirtied blocks inside the folio.
Then we hit the following sequence:
- Enter writepage_delalloc() for folio 0
- btrfs_run_delalloc_range() returned 0 for [0, 4K)
And created regular COW ordered extent for range [0, 4K)
- btrfs_run_delalloc_range() returned 0 for [32K, 40K)
And created async extent for range [32K, 40K).
This means the error handling will be done in another thread, we
should not touch the range anymore.
- btrfs_run_delalloc_range() failed with -ENOSPC for range [60K, 64K)
In theory we should not fail since we should have reserved enough
space at buffered write time, but let's ignore that rabbit hole and
focus on the error handling.
- Error handling in extent_writepage()
Now we go to the done: tag, calling btrfs_mark_ordered_io_finished()
for the whole folio range.
This will find ranges [0, 4K) and [32K, 40K) to cleanup, for [0, 4K)
it should be cleaned up, but for range [32K, 40K) it's asynchronously
handled, the OE may have already been submitted.
This will lead to the double account for range [32K, 40K) and crash
the kernel.
Unfortunately this bad error handling is from the very beginning of
sector size < page size support.
[FIX]
Instead of relying on the btrfs_mark_ordered_io_finished() call to
cleanup the whole folio range, record the last successfully ran delalloc
range.
And combined with bio_ctrl->submit_bitmap to properly clean up any newly
created ordered extents.
Since we have cleaned up the ordered extents in range, we should not
rely on the btrfs_mark_ordered_io_finished() inside extent_writepage()
anymore.
By this, we should avoid double accounting during error handling.
Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 37 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 438974d4def4..0132c2b84d99 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1145,11 +1145,13 @@ static bool find_next_delalloc_bitmap(struct folio *folio,
* helper for extent_writepage(), doing all of the delayed allocation setup.
*
* This returns 1 if btrfs_run_delalloc_range function did all the work required
- * to write the page (copy into inline extent). In this case the IO has
- * been started and the page is already unlocked.
+ * to write the page (copy into inline extent or compression). In this case
+ * the IO has been started and we should no longer touch the page (may have
+ * already been unlocked).
*
* This returns 0 if all went well (page still locked)
- * This returns < 0 if there were errors (page still locked)
+ * This returns < 0 if there were errors (page still locked), in this case any
+ * newly created delalloc range will be marked as error and finished.
*/
static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
struct folio *folio,
@@ -1167,6 +1169,12 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
* last delalloc end.
*/
u64 last_delalloc_end = 0;
+ /*
+ * Save the last successfully ran delalloc range end (exclusive).
+ * This is for error handling to avoid ranges with ordered extent created
+ * but no IO will be submitted due to error.
+ */
+ u64 last_finished = page_start;
u64 delalloc_start = page_start;
u64 delalloc_end = page_end;
u64 delalloc_to_write = 0;
@@ -1235,11 +1243,19 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
found_len = last_delalloc_end + 1 - found_start;
if (ret >= 0) {
+ /*
+ * Some delalloc range may be created by previous folios.
+ * Thus we still need to clean those range up during error
+ * handling.
+ */
+ last_finished = found_start;
/* No errors hit so far, run the current delalloc range. */
ret = btrfs_run_delalloc_range(inode, folio,
found_start,
found_start + found_len - 1,
wbc);
+ if (ret >= 0)
+ last_finished = found_start + found_len;
} else {
/*
* We've hit an error during previous delalloc range,
@@ -1274,8 +1290,21 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
delalloc_start = found_start + found_len;
}
- if (ret < 0)
+ /*
+ * It's possible we have some ordered extents created before we hit
+ * an error, cleanup non-async successfully created delalloc ranges.
+ */
+ if (unlikely(ret < 0)) {
+ unsigned int bitmap_size = min(
+ (last_finished - page_start) >> fs_info->sectorsize_bits,
+ fs_info->sectors_per_page);
+
+ for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size)
+ btrfs_mark_ordered_io_finished(inode, folio,
+ page_start + (bit << fs_info->sectorsize_bits),
+ fs_info->sectorsize, false);
return ret;
+ }
out:
if (last_delalloc_end)
delalloc_end = last_delalloc_end;
@@ -1509,13 +1538,13 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
bio_ctrl->wbc->nr_to_write--;
-done:
- if (ret) {
+ if (ret)
btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
page_start, PAGE_SIZE, !ret);
- mapping_set_error(folio->mapping, ret);
- }
+done:
+ if (ret < 0)
+ mapping_set_error(folio->mapping, ret);
/*
* Only unlock ranges that are submitted. As there can be some async
* submitted ranges inside the folio.
--
2.47.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH 2/2] btrfs: handle submit_one_sector() error inside extent_writepage_io()
2024-11-26 6:29 [PATCH 0/2] btrfs: error handling fixes for extent_writepage() Qu Wenruo
2024-11-26 6:29 ` [PATCH 1/2] btrfs: handle btrfs_run_delalloc_range() errors correctly Qu Wenruo
@ 2024-11-26 6:29 ` Qu Wenruo
1 sibling, 0 replies; 3+ messages in thread
From: Qu Wenruo @ 2024-11-26 6:29 UTC (permalink / raw)
To: linux-btrfs; +Cc: stable
[BUG]
If submit_one_sector() failed inside extent_writepage_io() for sector
size < page size cases (e.g. 4K sector size and 64K page size), then
we can hit double ordered extent accounting error.
This should be very rare, as submit_one_sector() only fails when we
failed to grab the extent map, and such extent map should exist inside
the memory and have been pinned.
[CAUSE]
For example we have the following folio layout:
0 4K 32K 48K 60K 64K
|//| |//////| |///|
Where |///| is the dirty range we need to writeback. The 3 different
dirty ranges are submitted for regular COW.
Now we hit the following sequence:
- submit_one_sector() returned 0 for [0, 4K)
- submit_one_sector() returned 0 for [32K, 48K)
- submit_one_sector() returned error for [60K, 64K)
- btrfs_mark_ordered_io_finished() called for the whole folio
This will mark the following ranges as finished:
* [0, 4K)
* [32K, 48K)
Both ranges have their IO already submitted, this cleanup will
lead to double accounting.
* [60K, 64K)
That's the correct cleanup.
Unfortunately the behavior dates back to the old days when there is no
subpage support.
[FIX]
Instead of calling btrfs_mark_ordered_io_finished() unconditionally at
extent_writepage(), which can touch ranges we should not touch, instead
move the error handling inside extent_writepage_io().
So that we can cleanup exact sectors that are ought to be submitted but
failed.
This provide much more accurate cleanup, avoiding the double accounting.
Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0132c2b84d99..a3d4f698fd25 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1420,6 +1420,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
struct btrfs_fs_info *fs_info = inode->root->fs_info;
unsigned long range_bitmap = 0;
bool submitted_io = false;
+ bool error = false;
const u64 folio_start = folio_pos(folio);
u64 cur;
int bit;
@@ -1462,11 +1463,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
break;
}
ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
- if (ret < 0)
- goto out;
+ if (unlikely(ret < 0)) {
+ submit_one_bio(bio_ctrl);
+ /*
+ * Failed to grab the extent map which should be very rare.
+ * Since there is no bio submitted to finish the ordered
+ * extent, we have to manually finish this sector.
+ */
+ btrfs_mark_ordered_io_finished(inode, folio, cur,
+ fs_info->sectorsize, false);
+ error = true;
+ continue;
+ }
submitted_io = true;
}
-out:
+
/*
* If we didn't submitted any sector (>= i_size), folio dirty get
* cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
@@ -1474,8 +1485,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
*
* Here we set writeback and clear for the range. If the full folio
* is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
+ *
+ * If we hit any error, the corresponding sector will still be dirty
+ * thus no need to clear PAGECACHE_TAG_DIRTY.
*/
- if (!submitted_io) {
+ if (!submitted_io && !error) {
btrfs_folio_set_writeback(fs_info, folio, start, len);
btrfs_folio_clear_writeback(fs_info, folio, start, len);
}
@@ -1495,7 +1509,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
{
struct inode *inode = folio->mapping->host;
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
- const u64 page_start = folio_pos(folio);
int ret;
size_t pg_offset;
loff_t i_size = i_size_read(inode);
@@ -1538,10 +1551,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
bio_ctrl->wbc->nr_to_write--;
- if (ret)
- btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
- page_start, PAGE_SIZE, !ret);
-
done:
if (ret < 0)
mapping_set_error(folio->mapping, ret);
@@ -2322,11 +2331,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
if (ret == 1)
goto next_page;
- if (ret) {
- btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
- cur, cur_len, !ret);
+ if (ret)
mapping_set_error(mapping, ret);
- }
btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
if (ret < 0)
found_error = true;
--
2.47.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-11-26 6:30 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-26 6:29 [PATCH 0/2] btrfs: error handling fixes for extent_writepage() Qu Wenruo
2024-11-26 6:29 ` [PATCH 1/2] btrfs: handle btrfs_run_delalloc_range() errors correctly Qu Wenruo
2024-11-26 6:29 ` [PATCH 2/2] btrfs: handle submit_one_sector() error inside extent_writepage_io() Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox