[PATCH AUTOSEL 6.16] btrfs: clear block dirty if submit_one

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.16] btrfs: clear block dirty if submit_one_sector() failed
       [not found] <20250825121505.2983941-1-sashal@kernel.org>
@ 2025-08-25 12:14 ` Sasha Levin
  2025-08-25 12:14 ` [PATCH AUTOSEL 6.16-6.12] btrfs: zoned: skip ZONE FINISH of conventional zones Sasha Levin
  1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2025-08-25 12:14 UTC (permalink / raw)
  To: patches, stable
  Cc: Qu Wenruo, Filipe Manana, David Sterba, Sasha Levin, clm, josef,
	linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 4bcd3061e8154606af7f721cb75ca04ffe191a12 ]

[BUG]
If submit_one_sector() failed, the block will be kept dirty, but with
their corresponding range finished in the ordered extent.

This means if a writeback happens later again, we can hit the following
problems:

- ASSERT(block_start != EXTENT_MAP_HOLE) in submit_one_sector()
  If the original extent map is a hole, then we can hit this case, as
  the new ordered extent failed, we will drop the new extent map and
  re-read one from the disk.

- DEBUG_WARN() in btrfs_writepage_cow_fixup()
  This is because we no longer have an ordered extent for those dirty
  blocks. The original for them is already finished with error.

[CAUSE]
The function submit_one_sector() is not following the regular error
handling of writeback.  The common practice is to clear the folio dirty,
start and finish the writeback for the block.

This is normally done by extent_clear_unlock_delalloc() with
PAGE_START_WRITEBACK | PAGE_END_WRITEBACK flags during
run_delalloc_range().

So if we keep those failed blocks dirty, they will stay in the page
cache and wait for the next writeback.

And since the original ordered extent is already finished and removed,
depending on the original extent map, we either hit the ASSERT() inside
submit_one_sector(), or hit the DEBUG_WARN() in
btrfs_writepage_cow_fixup().

[FIX]
Follow the regular error handling to clear the dirty flag for the block,
start and finish writeback for that block instead.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Based on my analysis of the commit and the kernel repository context,
here's my assessment:

**Backport Status: YES**

This commit should be backported to stable kernel trees for the
following reasons:

## Bug Fix Analysis

1. **Critical Bug Being Fixed**: The commit fixes a serious error
   handling bug in btrfs writeback that can lead to two different
   assertion failures:
   - `ASSERT(block_start != EXTENT_MAP_HOLE)` in submit_one_sector()
   - `DEBUG_WARN()` in btrfs_writepage_cow_fixup()

2. **Data Integrity Issue**: The bug causes dirty blocks to remain dirty
   after a failed submission, but their corresponding ordered extent is
   already finished with error. This creates an inconsistent state
   where:
   - Dirty blocks exist without proper ordered extent tracking
   - Subsequent writeback attempts will fail with assertions/warnings
   - The filesystem enters an undefined state that could affect data
     integrity

3. **Clear Root Cause**: The commit message clearly identifies the
   problem - submit_one_sector() was not following standard writeback
   error handling practices. The fix aligns the error handling with the
   rest of the btrfs writeback code.

## Code Change Analysis

The fix is minimal and contained:
```c
if (IS_ERR(em)) {
+    /*
+     * When submission failed, we should still clear the folio dirty.
+     * Or the folio will be written back again but without any
+     * ordered extent.
+     */
+    btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
+    btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize);
+    btrfs_folio_clear_writeback(fs_info, folio, filepos, sectorsize);
    return PTR_ERR(em);
}
```

The changes:
- Add proper error handling to clear dirty flag
- Set and clear writeback status to properly finish the failed writeback
- Update comments to clarify the behavior

## Stable Tree Criteria Met

1. **Fixes a real bug**: Yes - prevents assertion failures and potential
   filesystem corruption
2. **Small and contained**: Yes - only ~10 lines of actual code change
   in one function
3. **No new features**: Correct - purely bug fix
4. **Low regression risk**: The change follows established patterns used
   elsewhere in btrfs (extent_clear_unlock_delalloc)
5. **Important enough**: Yes - prevents filesystem errors and potential
   data integrity issues

The fix is straightforward, follows existing btrfs patterns, and
addresses a clear bug that could affect users running btrfs filesystems,
especially under I/O error conditions. This makes it an excellent
candidate for stable backporting.

 fs/btrfs/extent_io.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1dc931c4937f..2e127c109e5b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1483,7 +1483,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,

 /*
  * Return 0 if we have submitted or queued the sector for submission.
- * Return <0 for critical errors.
+ * Return <0 for critical errors, and the sector will have its dirty flag cleared.
  *
  * Caller should make sure filepos < i_size and handle filepos >= i_size case.
  */
@@ -1506,8 +1506,17 @@ static int submit_one_sector(struct btrfs_inode *inode,
 	ASSERT(filepos < i_size);

 	em = btrfs_get_extent(inode, NULL, filepos, sectorsize);
-	if (IS_ERR(em))
+	if (IS_ERR(em)) {
+		/*
+		 * When submission failed, we should still clear the folio dirty.
+		 * Or the folio will be written back again but without any
+		 * ordered extent.
+		 */
+		btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
+		btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize);
+		btrfs_folio_clear_writeback(fs_info, folio, filepos, sectorsize);
 		return PTR_ERR(em);
+	}

 	extent_offset = filepos - em->start;
 	em_end = btrfs_extent_map_end(em);
@@ -1637,8 +1646,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	 * Here we set writeback and clear for the range. If the full folio
 	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
 	 *
-	 * If we hit any error, the corresponding sector will still be dirty
-	 * thus no need to clear PAGECACHE_TAG_DIRTY.
+	 * If we hit any error, the corresponding sector will have its dirty
+	 * flag cleared and writeback finished, thus no need to handle the error case.
 	 */
 	if (!submitted_io && !error) {
 		btrfs_folio_set_writeback(fs_info, folio, start, len);
-- 
2.50.1

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* [PATCH AUTOSEL 6.16-6.12] btrfs: zoned: skip ZONE FINISH of conventional zones
       [not found] <20250825121505.2983941-1-sashal@kernel.org>
  2025-08-25 12:14 ` [PATCH AUTOSEL 6.16] btrfs: clear block dirty if submit_one_sector() failed Sasha Levin
@ 2025-08-25 12:14 ` Sasha Levin
  1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2025-08-25 12:14 UTC (permalink / raw)
  To: patches, stable
  Cc: Johannes Thumshirn, Naohiro Aota, Anand Jain, David Sterba,
	Sasha Levin, clm, josef, linux-btrfs

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

[ Upstream commit f0ba0e7172a222ea6043b61ecd86723c46d7bcf2 ]

Don't call ZONE FINISH for conventional zones as this will result in I/O
errors. Instead check if the zone that needs finishing is a conventional
zone and if yes skip it.

Also factor out the actual handling of finishing a single zone into a
helper function, as do_zone_finish() is growing ever bigger and the
indentations levels are getting higher.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

**Backport Status: YES**

This commit should be backported to stable kernel trees for the
following reasons:

## Bug Fix Analysis

1. **Critical Bug Fix**: The commit fixes a concrete bug where calling
   `REQ_OP_ZONE_FINISH` on conventional zones results in I/O errors.
   This is a real runtime error that affects users of btrfs on zoned
   storage devices that have conventional zones mixed with sequential
   zones.

2. **Small and Contained Change**: The fix is relatively small and well-
   contained:
   - Adds a new helper function `call_zone_finish()` that encapsulates
     the zone finishing logic
   - Most importantly, adds the critical check: `if
     (btrfs_dev_is_sequential(device, physical))` before issuing the
     `REQ_OP_ZONE_FINISH` operation
   - The refactoring merely moves existing code into the helper function
     without changing the logic

3. **Clear Root Cause**: The bug occurs because the original code
   unconditionally calls `blkdev_zone_mgmt()` with `REQ_OP_ZONE_FINISH`
   on all zones, but this operation is invalid for conventional zones
   (non-sequential zones). The fix properly checks if a zone is
   sequential before attempting to finish it.

4. **No New Features**: This commit doesn't introduce any new
   functionality - it's purely a bug fix that prevents I/O errors.

5. **Minimal Risk**: The change has minimal risk of regression:
   - The check for sequential zones is straightforward
   - The refactoring doesn't change the existing logic flow
   - The fix has been reviewed by multiple developers familiar with the
     zoned code

6. **Affects Real Users**: This bug affects users running btrfs on SMR
   (Shingled Magnetic Recording) drives or ZNS (Zoned Namespace) SSDs
   that have a mix of conventional and sequential zones, which is a
   common configuration.

## Code Analysis

The key fix in `call_zone_finish()` at line 2262:
```c
if (btrfs_dev_is_sequential(device, physical)) {
    // Only call zone finish for sequential zones
    ret = blkdev_zone_mgmt(device->bdev, REQ_OP_ZONE_FINISH, ...);
}
```

This prevents the invalid operation on conventional zones while
maintaining the correct behavior for sequential zones. The subsequent
operations (updating reserved_active_zones and clearing active zone) are
still performed regardless of zone type, which is the correct behavior.

The commit follows stable kernel rules perfectly: it's a clear bug fix,
has minimal changes, doesn't introduce new features, and addresses a
real user-facing issue that causes I/O errors.

 fs/btrfs/zoned.c | 55 ++++++++++++++++++++++++++++++------------------
 1 file changed, 35 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 5439d8374716..950e72dc537c 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2246,6 +2246,40 @@ static void wait_eb_writebacks(struct btrfs_block_group *block_group)
 	rcu_read_unlock();
 }

+static int call_zone_finish(struct btrfs_block_group *block_group,
+			    struct btrfs_io_stripe *stripe)
+{
+	struct btrfs_device *device = stripe->dev;
+	const u64 physical = stripe->physical;
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	int ret;
+
+	if (!device->bdev)
+		return 0;
+
+	if (zinfo->max_active_zones == 0)
+		return 0;
+
+	if (btrfs_dev_is_sequential(device, physical)) {
+		unsigned int nofs_flags;
+
+		nofs_flags = memalloc_nofs_save();
+		ret = blkdev_zone_mgmt(device->bdev, REQ_OP_ZONE_FINISH,
+				       physical >> SECTOR_SHIFT,
+				       zinfo->zone_size >> SECTOR_SHIFT);
+		memalloc_nofs_restore(nofs_flags);
+
+		if (ret)
+			return ret;
+	}
+
+	if (!(block_group->flags & BTRFS_BLOCK_GROUP_DATA))
+		zinfo->reserved_active_zones++;
+	btrfs_dev_clear_active_zone(device, physical);
+
+	return 0;
+}
+
 static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_written)
 {
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
@@ -2330,31 +2364,12 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 	down_read(&dev_replace->rwsem);
 	map = block_group->physical_map;
 	for (i = 0; i < map->num_stripes; i++) {
-		struct btrfs_device *device = map->stripes[i].dev;
-		const u64 physical = map->stripes[i].physical;
-		struct btrfs_zoned_device_info *zinfo = device->zone_info;
-		unsigned int nofs_flags;
-
-		if (!device->bdev)
-			continue;
-
-		if (zinfo->max_active_zones == 0)
-			continue;
-
-		nofs_flags = memalloc_nofs_save();
-		ret = blkdev_zone_mgmt(device->bdev, REQ_OP_ZONE_FINISH,
-				       physical >> SECTOR_SHIFT,
-				       zinfo->zone_size >> SECTOR_SHIFT);
-		memalloc_nofs_restore(nofs_flags);

+		ret = call_zone_finish(block_group, &map->stripes[i]);
 		if (ret) {
 			up_read(&dev_replace->rwsem);
 			return ret;
 		}
-
-		if (!(block_group->flags & BTRFS_BLOCK_GROUP_DATA))
-			zinfo->reserved_active_zones++;
-		btrfs_dev_clear_active_zone(device, physical);
 	}
 	up_read(&dev_replace->rwsem);

-- 
2.50.1

^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-08-25 12:15 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250825121505.2983941-1-sashal@kernel.org>
2025-08-25 12:14 ` [PATCH AUTOSEL 6.16] btrfs: clear block dirty if submit_one_sector() failed Sasha Levin
2025-08-25 12:14 ` [PATCH AUTOSEL 6.16-6.12] btrfs: zoned: skip ZONE FINISH of conventional zones Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).