* [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path
@ 2025-11-20 9:16 Qu Wenruo
2025-11-20 9:16 ` [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated Qu Wenruo
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-11-20 9:16 UTC (permalink / raw)
To: linux-btrfs
[CHANGELOG]
v3:
- Use @cur_len for btrfs_get_extent() in the last patch
And remove the hole related comment.
This should not cause any behavior change.
The @len parameter for btrfs_get_extent() is only going to cause a
difference when the range is completely a hole (e.g. beyond EOF).
For our writeback routine, there should be an extent map for us thus
it's no different passing @cur_len or sectorsize.
I think this @len parameter of btrfs_get_extent() is causing
unnecessary complexity, and want to remove it completely. But that
will be a new series.
v2:
- Fix a potential bug where OEs beyond EOF are not truncated properly
This replace the original patch to extract the code into a helper.
- Replace more for_each_set_bit() with for_each_set_bitrange()
- Fix several copy-n-pasted incorrect range inside submit_range()
Although btrfs has bs < ps support for a long time, and the larger data
folios support is also going to be graduate from experimental features
soon, the write path is still iterating each fs block and call
btrfs_get_extent() on each fs block.
What makes the situation worse is that, for the write path we do not
have any cached extent map, meaning even with large folios and we got a
continuous range that can be submitted in one go, we still call
btrfs_get_extent() many times and get the same range extent map again
and again.
This series will reduce the duplicated btrfs_get_extent() calls by only
call it once for each range, other than for each fs block.
The first one is a potential bug inspired by Boris' review.
Patch 2~3 are minor cleanups.
Patch 4 is the core of the optimization.
Although I don't expect there will be much difference in the real world though.
Qu Wenruo (4):
btrfs: make sure all ordered extents beyond EOF is properly truncated
btrfs: integrate the error handling of submit_one_sector()
btrfs: replace for_each_set_bit() with for_each_set_bitmap()
btrfs: reduce extent map lookup during writes
fs/btrfs/extent_io.c | 230 ++++++++++++++++++++--------------------
fs/btrfs/ordered-data.c | 38 +++++++
fs/btrfs/ordered-data.h | 2 +
3 files changed, 156 insertions(+), 114 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-20 9:16 [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
@ 2025-11-20 9:16 ` Qu Wenruo
2025-11-21 11:55 ` Filipe Manana
2025-11-20 9:16 ` [PATCH v3 2/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo
` (3 subsequent siblings)
4 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-11-20 9:16 UTC (permalink / raw)
To: linux-btrfs
[POSSIBLE BUG]
If there are multiple ordered extents beyond EOF, at folio writeback
time we may only truncate the first ordered extent, but leaving the
remaining ones finished but not marked as truncated.
Since those OEs are not marked as truncated, it will still insert an
file extent item, and may lead to false missing csum errors during
"btrfs check".
[CAUSE]
Since we have bs < ps support for a while and experimental large data
folios are also going to graduate from experimental features soon, we
can have the following folio to be written back:
fs block size 4K
page size 4K, folio size 64K.
0 16K 32K 64K
|<---------------- Dirty -------------->|
|<-OE A->|<-OE B->|<----- OE C -------->|
|
i_size 4K.
In above case we need to submit the writeback for the range [0, 4K).
For range [4K, 64K) there is no need to submit any IO but mark the
involved OEs (OE A, B, C) all as truncated.
However during the EOF handling, we only call
btrfs_lookup_first_ordered_range() once, thus only got OE A and mark it
as truncated.
But OE B and C are not marked as truncated, they will finish as usual,
which will leave a regular file extent item to be inserted beyond EOF,
and without any data checksum.
[FIX]
Introduce a new helper, btrfs_mark_ordered_io_truncated(), to handle all
OEs of a range, and mark them all as truncated.
With that helper, all OEs (A B and C) will be marked as truncated.
OE B and C will have 0 truncated_len, preventing any file extent item to
be inserted from them.
Fixes: f0a0f5bfe04e ("btrfs: truncate ordered extent when skipping writeback past i_size")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 19 +------------------
fs/btrfs/ordered-data.c | 38 ++++++++++++++++++++++++++++++++++++++
fs/btrfs/ordered-data.h | 2 ++
3 files changed, 41 insertions(+), 18 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2d32dfc34ae3..2044b889c887 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1725,24 +1725,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
if (cur >= i_size) {
- struct btrfs_ordered_extent *ordered;
-
- ordered = btrfs_lookup_first_ordered_range(inode, cur,
- folio_end - cur);
- /*
- * We have just run delalloc before getting here, so
- * there must be an ordered extent.
- */
- ASSERT(ordered != NULL);
- spin_lock(&inode->ordered_tree_lock);
- set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
- ordered->truncated_len = min(ordered->truncated_len,
- cur - ordered->file_offset);
- spin_unlock(&inode->ordered_tree_lock);
- btrfs_put_ordered_extent(ordered);
-
- btrfs_mark_ordered_io_finished(inode, folio, cur,
- end - cur, true);
+ btrfs_mark_ordered_io_truncated(inode, folio, cur, end - cur);
/*
* This range is beyond i_size, thus we don't need to
* bother writing back.
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a421f7db9eec..9a0638d13225 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -546,6 +546,44 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
spin_unlock(&inode->ordered_tree_lock);
}
+/*
+ * Mark all ordered extents io inside the specified range as truncated.
+ *
+ * This is utilized by writeback path, thus there must be an ordered extent
+ * for the range.
+ */
+void btrfs_mark_ordered_io_truncated(struct btrfs_inode *inode, struct folio *folio,
+ u64 file_offset, u32 len)
+{
+ u64 cur = file_offset;
+
+ ASSERT(file_offset >= folio_pos(folio));
+ ASSERT(file_offset + len <= folio_end(folio));
+
+ while (cur < file_offset + len) {
+ u32 cur_len = file_offset + len - cur;
+ struct btrfs_ordered_extent *ordered;
+
+ ordered = btrfs_lookup_first_ordered_range(inode, cur, cur_len);
+
+ /*
+ * We have just run delalloc before getting here, so there must
+ * be an ordered extent.
+ */
+ ASSERT(ordered != NULL);
+ scoped_guard(spinlock, &inode->ordered_tree_lock) {
+ set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+ ordered->truncated_len = min(ordered->truncated_len,
+ cur - ordered->file_offset);
+ }
+ cur_len = min(cur_len, ordered->file_offset + ordered->num_bytes - cur);
+ btrfs_put_ordered_extent(ordered);
+
+ cur += cur_len;
+ }
+ btrfs_mark_ordered_io_finished(inode, folio, file_offset, len, true);
+}
+
/*
* Finish IO for one ordered extent across a given range. The range can only
* contain one ordered extent.
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 1e6b0b182b29..dd4cdc1a8b78 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -169,6 +169,8 @@ void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
struct folio *folio, u64 file_offset,
u64 num_bytes, bool uptodate);
+void btrfs_mark_ordered_io_truncated(struct btrfs_inode *inode, struct folio *folio,
+ u64 file_offset, u32 len);
bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
struct btrfs_ordered_extent **cached,
u64 file_offset, u64 io_size);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 2/4] btrfs: integrate the error handling of submit_one_sector()
2025-11-20 9:16 [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
2025-11-20 9:16 ` [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated Qu Wenruo
@ 2025-11-20 9:16 ` Qu Wenruo
2025-11-20 9:16 ` [PATCH v3 3/4] btrfs: replace for_each_set_bit() with for_each_set_bitmap() Qu Wenruo
` (2 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-11-20 9:16 UTC (permalink / raw)
To: linux-btrfs
Currently submit_one_sector() has only one failure pattern from
btrfs_get_extent().
However the error handling is split into two parts, one inside
submit_one_sector(), which clears the dirty flag and finishes the
writeback for the fs block.
The other part is to submit any remaining bio inside bio_ctrl and mark
the ordered extent finished for the fs block.
There is no special reason that we must split the error handling, let's
just concentrate all the error handling into submit_one_sector().
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2044b889c887..0f1bf9a4f0ae 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1598,7 +1598,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
/*
* Return 0 if we have submitted or queued the sector for submission.
- * Return <0 for critical errors, and the sector will have its dirty flag cleared.
+ * Return <0 for critical errors, and the involved sector will be cleaned up.
*
* Caller should make sure filepos < i_size and handle filepos >= i_size case.
*/
@@ -1622,6 +1622,13 @@ static int submit_one_sector(struct btrfs_inode *inode,
em = btrfs_get_extent(inode, NULL, filepos, sectorsize);
if (IS_ERR(em)) {
+ /*
+ * bio_ctrl may contain a bio crossing several folios.
+ * Submit it immediately so that the bio has a chance
+ * to finish normally, other than marked as error.
+ */
+ submit_one_bio(bio_ctrl);
+
/*
* When submission failed, we should still clear the folio dirty.
* Or the folio will be written back again but without any
@@ -1630,6 +1637,13 @@ static int submit_one_sector(struct btrfs_inode *inode,
btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize);
btrfs_folio_clear_writeback(fs_info, folio, filepos, sectorsize);
+
+ /*
+ * Since there is no bio submitted to finish the ordered
+ * extent, we have to manually finish this sector.
+ */
+ btrfs_mark_ordered_io_finished(inode, folio, filepos,
+ fs_info->sectorsize, false);
return PTR_ERR(em);
}
@@ -1739,19 +1753,6 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
}
ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
if (unlikely(ret < 0)) {
- /*
- * bio_ctrl may contain a bio crossing several folios.
- * Submit it immediately so that the bio has a chance
- * to finish normally, other than marked as error.
- */
- submit_one_bio(bio_ctrl);
- /*
- * Failed to grab the extent map which should be very rare.
- * Since there is no bio submitted to finish the ordered
- * extent, we have to manually finish this sector.
- */
- btrfs_mark_ordered_io_finished(inode, folio, cur,
- fs_info->sectorsize, false);
if (!found_error)
found_error = ret;
continue;
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 3/4] btrfs: replace for_each_set_bit() with for_each_set_bitmap()
2025-11-20 9:16 [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
2025-11-20 9:16 ` [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated Qu Wenruo
2025-11-20 9:16 ` [PATCH v3 2/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo
@ 2025-11-20 9:16 ` Qu Wenruo
2025-11-20 9:16 ` [PATCH v3 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo
2025-11-21 1:02 ` [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Boris Burkov
4 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-11-20 9:16 UTC (permalink / raw)
To: linux-btrfs
Inside extent_io.c, there are several simple call sites doing things
like:
for_each_set_bit(bit, bitmap, bitmap_size) {
/* handle one fs block */
}
The workload includes:
- set_bit()
Inside extent_writepage_io().
This can be replaced with a bitmap_set().
- btrfs_folio_set_lock()
- btrfs_mark_ordered_io_finished()
Inside writepage_delalloc().
Instead of calling it multiple times, we can pass a range into the
function with one call.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0f1bf9a4f0ae..87bf5ce17264 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1426,8 +1426,9 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
u64 delalloc_start = page_start;
u64 delalloc_end = page_end;
u64 delalloc_to_write = 0;
+ unsigned int start_bit;
+ unsigned int end_bit;
int ret = 0;
- int bit;
/* Save the dirty bitmap as our submission bitmap will be a subset of it. */
if (btrfs_is_subpage(fs_info, folio)) {
@@ -1437,10 +1438,12 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
bio_ctrl->submit_bitmap = 1;
}
- for_each_set_bit(bit, &bio_ctrl->submit_bitmap, blocks_per_folio) {
- u64 start = page_start + (bit << fs_info->sectorsize_bits);
+ for_each_set_bitrange(start_bit, end_bit, &bio_ctrl->submit_bitmap,
+ blocks_per_folio) {
+ u64 start = page_start + (start_bit << fs_info->sectorsize_bits);
+ u32 len = (end_bit - start_bit) << fs_info->sectorsize_bits;
- btrfs_folio_set_lock(fs_info, folio, start, fs_info->sectorsize);
+ btrfs_folio_set_lock(fs_info, folio, start, len);
}
/* Lock all (subpage) delalloc ranges inside the folio first. */
@@ -1557,10 +1560,13 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
fs_info->sectorsize_bits,
blocks_per_folio);
- for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size)
- btrfs_mark_ordered_io_finished(inode, folio,
- page_start + (bit << fs_info->sectorsize_bits),
- fs_info->sectorsize, false);
+ for_each_set_bitrange(start_bit, end_bit, &bio_ctrl->submit_bitmap,
+ bitmap_size) {
+ u64 start = page_start + (start_bit << fs_info->sectorsize_bits);
+ u32 len = (end_bit - start_bit) << fs_info->sectorsize_bits;
+
+ btrfs_mark_ordered_io_finished(inode, folio, start, len, false);
+ }
return ret;
}
out:
@@ -1728,8 +1734,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
return ret;
}
- for (cur = start; cur < end; cur += fs_info->sectorsize)
- set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap);
+ bitmap_set(&range_bitmap, (start - folio_pos(folio)) >> fs_info->sectorsize_bits,
+ len >> fs_info->sectorsize_bits);
bitmap_and(&bio_ctrl->submit_bitmap, &bio_ctrl->submit_bitmap, &range_bitmap,
blocks_per_folio);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 4/4] btrfs: reduce extent map lookup during writes
2025-11-20 9:16 [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
` (2 preceding siblings ...)
2025-11-20 9:16 ` [PATCH v3 3/4] btrfs: replace for_each_set_bit() with for_each_set_bitmap() Qu Wenruo
@ 2025-11-20 9:16 ` Qu Wenruo
2025-11-21 1:02 ` [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Boris Burkov
4 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-11-20 9:16 UTC (permalink / raw)
To: linux-btrfs
With large data folios supports, even on x86_64 we can hit a folio that
contains several fs blocks.
In that case, we still need to call btrfs_get_extent() for each block,
as our submission path is still iterating each fs block and submit them
one by one. This reduces the benefit of large folios.
Change the behavior to submit the whole range when possible, this is
done by:
- Use for_each_set_bitrange() instead of for_each_set_bit()
Now we can get a contiguous range to submit instead of a single fs
block.
- Handle blocks beyond EOF in one go
This is pretty much the same as the old behavior, but for a range
crossing i_size, we finish the range beyond i_size first, then submit
the remaining.
- Submit the contiguous range in one go
Although we still need to consider the extent map boundary.
- Remove submit_one_sector()
As it's no longer utilized.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 182 +++++++++++++++++++++++--------------------
1 file changed, 97 insertions(+), 85 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 87bf5ce17264..e61039de0edf 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1602,91 +1602,94 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
return 0;
}
+
/*
- * Return 0 if we have submitted or queued the sector for submission.
- * Return <0 for critical errors, and the involved sector will be cleaned up.
+ * Return 0 if we have submitted or queued the range for submission.
+ * Return <0 for critical errors, and the involved blocks will be cleaned up.
*
- * Caller should make sure filepos < i_size and handle filepos >= i_size case.
+ * Caller should make sure the range doesn't go beyond the last block of the inode.
*/
-static int submit_one_sector(struct btrfs_inode *inode,
- struct folio *folio,
- u64 filepos, struct btrfs_bio_ctrl *bio_ctrl,
- loff_t i_size)
+static int submit_range(struct btrfs_inode *inode, struct folio *folio,
+ u64 start, u32 len, struct btrfs_bio_ctrl *bio_ctrl)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
- struct extent_map *em;
- u64 block_start;
- u64 disk_bytenr;
- u64 extent_offset;
- u64 em_end;
const u32 sectorsize = fs_info->sectorsize;
+ u64 cur = start;
- ASSERT(IS_ALIGNED(filepos, sectorsize));
+ ASSERT(IS_ALIGNED(start, sectorsize));
+ ASSERT(IS_ALIGNED(len, sectorsize));
+ ASSERT(start + len <= folio_end(folio));
- /* @filepos >= i_size case should be handled by the caller. */
- ASSERT(filepos < i_size);
+ while (cur < start + len) {
+ struct extent_map *em;
+ u64 block_start;
+ u64 disk_bytenr;
+ u64 extent_offset;
+ u64 em_end;
+ u32 cur_len = start + len - cur;
- em = btrfs_get_extent(inode, NULL, filepos, sectorsize);
- if (IS_ERR(em)) {
- /*
- * bio_ctrl may contain a bio crossing several folios.
- * Submit it immediately so that the bio has a chance
- * to finish normally, other than marked as error.
- */
- submit_one_bio(bio_ctrl);
+ em = btrfs_get_extent(inode, NULL, cur, cur_len);
+ if (IS_ERR(em)) {
+ /*
+ * bio_ctrl may contain a bio crossing several folios.
+ * Submit it immediately so that the bio has a chance
+ * to finish normally, other than marked as error.
+ */
+ submit_one_bio(bio_ctrl);
+
+ /*
+ * When submission failed, we should still clear the folio dirty.
+ * Or the folio will be written back again but without any
+ * ordered extent.
+ */
+ btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len);
+ btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
+ btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len);
+
+ /*
+ * Since there is no bio submitted to finish the ordered
+ * extent, we have to manually finish this range.
+ */
+ btrfs_mark_ordered_io_finished(inode, folio, cur, cur_len, false);
+ return PTR_ERR(em);
+ }
+ extent_offset = cur - em->start;
+ em_end = btrfs_extent_map_end(em);
+ ASSERT(cur <= em_end);
+ ASSERT(IS_ALIGNED(em->start, sectorsize));
+ ASSERT(IS_ALIGNED(em->len, sectorsize));
+
+ block_start = btrfs_extent_map_block_start(em);
+ disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset;
+
+ ASSERT(!btrfs_extent_map_is_compressed(em));
+ ASSERT(block_start != EXTENT_MAP_HOLE);
+ ASSERT(block_start != EXTENT_MAP_INLINE);
+
+ cur_len = min(cur_len, em_end - cur);
+ btrfs_free_extent_map(em);
+ em = NULL;
/*
- * When submission failed, we should still clear the folio dirty.
- * Or the folio will be written back again but without any
- * ordered extent.
+ * Although the PageDirty bit is cleared before entering this
+ * function, subpage dirty bit is not cleared.
+ * So clear subpage dirty bit here so next time we won't submit
+ * a folio for a range already written to disk.
*/
- btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
- btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize);
- btrfs_folio_clear_writeback(fs_info, folio, filepos, sectorsize);
-
+ btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len);
+ btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
/*
- * Since there is no bio submitted to finish the ordered
- * extent, we have to manually finish this sector.
+ * Above call should set the whole folio with writeback flag, even
+ * just for a single subpage block.
+ * As long as the folio is properly locked and the range is correct,
+ * we should always get the folio with writeback flag.
*/
- btrfs_mark_ordered_io_finished(inode, folio, filepos,
- fs_info->sectorsize, false);
- return PTR_ERR(em);
+ ASSERT(folio_test_writeback(folio));
+
+ submit_extent_folio(bio_ctrl, disk_bytenr, folio,
+ cur_len, cur - folio_pos(folio), 0);
+ cur += cur_len;
}
-
- extent_offset = filepos - em->start;
- em_end = btrfs_extent_map_end(em);
- ASSERT(filepos <= em_end);
- ASSERT(IS_ALIGNED(em->start, sectorsize));
- ASSERT(IS_ALIGNED(em->len, sectorsize));
-
- block_start = btrfs_extent_map_block_start(em);
- disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset;
-
- ASSERT(!btrfs_extent_map_is_compressed(em));
- ASSERT(block_start != EXTENT_MAP_HOLE);
- ASSERT(block_start != EXTENT_MAP_INLINE);
-
- btrfs_free_extent_map(em);
- em = NULL;
-
- /*
- * Although the PageDirty bit is cleared before entering this
- * function, subpage dirty bit is not cleared.
- * So clear subpage dirty bit here so next time we won't submit
- * a folio for a range already written to disk.
- */
- btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
- btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize);
- /*
- * Above call should set the whole folio with writeback flag, even
- * just for a single subpage sector.
- * As long as the folio is properly locked and the range is correct,
- * we should always get the folio with writeback flag.
- */
- ASSERT(folio_test_writeback(folio));
-
- submit_extent_folio(bio_ctrl, disk_bytenr, folio,
- sectorsize, filepos - folio_pos(folio), 0);
return 0;
}
@@ -1712,8 +1715,9 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
const u64 folio_start = folio_pos(folio);
const u64 folio_end = folio_start + folio_size(folio);
const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio);
- u64 cur;
- int bit;
+ unsigned int start_bit;
+ unsigned int end_bit;
+ const u64 rounded_isize = round_up(i_size, fs_info->sectorsize);
int ret = 0;
ASSERT(start >= folio_start, "start=%llu folio_start=%llu", start, folio_start);
@@ -1741,23 +1745,31 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
bio_ctrl->end_io_func = end_bbio_data_write;
- for_each_set_bit(bit, &bio_ctrl->submit_bitmap, blocks_per_folio) {
- cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
+ for_each_set_bitrange(start_bit, end_bit, &bio_ctrl->submit_bitmap, blocks_per_folio) {
+ const u64 cur_start = folio_pos(folio) + (start_bit << fs_info->sectorsize_bits);
+ u32 cur_len = (end_bit - start_bit) << fs_info->sectorsize_bits;
- if (cur >= i_size) {
- btrfs_mark_ordered_io_truncated(inode, folio, cur, end - cur);
+ if (cur_start > rounded_isize) {
/*
- * This range is beyond i_size, thus we don't need to
- * bother writing back.
- * But we still need to clear the dirty subpage bit, or
- * the next time the folio gets dirtied, we will try to
- * writeback the sectors with subpage dirty bits,
- * causing writeback without ordered extent.
+ * The whole range is beyond EOF.
+ *
+ * Just finish the IO and skip to the next range.
*/
- btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
- break;
+ btrfs_mark_ordered_io_truncated(inode, folio, cur_start, cur_len);
+ btrfs_folio_clear_dirty(fs_info, folio, cur_start, cur_len);
+ continue;
}
- ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
+ if (cur_start + cur_len > rounded_isize) {
+ u32 truncate_len = cur_start + cur_len - rounded_isize;
+
+ /* The tailing part of the range is beyond EOF. */
+ btrfs_mark_ordered_io_truncated(inode, folio, rounded_isize, truncate_len);
+ btrfs_folio_clear_dirty(fs_info, folio, rounded_isize, truncate_len);
+ /* Shrink the range inside the EOF. */
+ cur_len = rounded_isize - cur_start;
+ }
+
+ ret = submit_range(inode, folio, cur_start, cur_len, bio_ctrl);
if (unlikely(ret < 0)) {
if (!found_error)
found_error = ret;
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path
2025-11-20 9:16 [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
` (3 preceding siblings ...)
2025-11-20 9:16 ` [PATCH v3 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo
@ 2025-11-21 1:02 ` Boris Burkov
4 siblings, 0 replies; 16+ messages in thread
From: Boris Burkov @ 2025-11-21 1:02 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Thu, Nov 20, 2025 at 07:46:45PM +1030, Qu Wenruo wrote:
> [CHANGELOG]
> v3:
> - Use @cur_len for btrfs_get_extent() in the last patch
> And remove the hole related comment.
> This should not cause any behavior change.
>
> The @len parameter for btrfs_get_extent() is only going to cause a
> difference when the range is completely a hole (e.g. beyond EOF).
>
> For our writeback routine, there should be an extent map for us thus
> it's no different passing @cur_len or sectorsize.
>
> I think this @len parameter of btrfs_get_extent() is causing
> unnecessary complexity, and want to remove it completely. But that
> will be a new series.
>
> v2:
> - Fix a potential bug where OEs beyond EOF are not truncated properly
> This replace the original patch to extract the code into a helper.
>
> - Replace more for_each_set_bit() with for_each_set_bitrange()
>
> - Fix several copy-n-pasted incorrect range inside submit_range()
>
>
> Although btrfs has bs < ps support for a long time, and the larger data
> folios support is also going to be graduate from experimental features
> soon, the write path is still iterating each fs block and call
> btrfs_get_extent() on each fs block.
>
> What makes the situation worse is that, for the write path we do not
> have any cached extent map, meaning even with large folios and we got a
> continuous range that can be submitted in one go, we still call
> btrfs_get_extent() many times and get the same range extent map again
> and again.
>
> This series will reduce the duplicated btrfs_get_extent() calls by only
> call it once for each range, other than for each fs block.
>
> The first one is a potential bug inspired by Boris' review.
> Patch 2~3 are minor cleanups.
> Patch 4 is the core of the optimization.
>
> Although I don't expect there will be much difference in the real world though.
Reviewed-by: Boris Burkov <boris@bur.io>
>
>
> Qu Wenruo (4):
> btrfs: make sure all ordered extents beyond EOF is properly truncated
> btrfs: integrate the error handling of submit_one_sector()
> btrfs: replace for_each_set_bit() with for_each_set_bitmap()
> btrfs: reduce extent map lookup during writes
>
> fs/btrfs/extent_io.c | 230 ++++++++++++++++++++--------------------
> fs/btrfs/ordered-data.c | 38 +++++++
> fs/btrfs/ordered-data.h | 2 +
> 3 files changed, 156 insertions(+), 114 deletions(-)
>
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-20 9:16 ` [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated Qu Wenruo
@ 2025-11-21 11:55 ` Filipe Manana
2025-11-21 15:38 ` David Sterba
2025-11-22 1:14 ` Qu Wenruo
0 siblings, 2 replies; 16+ messages in thread
From: Filipe Manana @ 2025-11-21 11:55 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Thu, Nov 20, 2025 at 9:22 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [POSSIBLE BUG]
> If there are multiple ordered extents beyond EOF, at folio writeback
> time we may only truncate the first ordered extent, but leaving the
> remaining ones finished but not marked as truncated.
>
> Since those OEs are not marked as truncated, it will still insert an
> file extent item, and may lead to false missing csum errors during
> "btrfs check".
>
> [CAUSE]
> Since we have bs < ps support for a while and experimental large data
> folios are also going to graduate from experimental features soon, we
> can have the following folio to be written back:
>
> fs block size 4K
> page size 4K, folio size 64K.
>
> 0 16K 32K 64K
> |<---------------- Dirty -------------->|
> |<-OE A->|<-OE B->|<----- OE C -------->|
> |
> i_size 4K.
>
> In above case we need to submit the writeback for the range [0, 4K).
> For range [4K, 64K) there is no need to submit any IO but mark the
> involved OEs (OE A, B, C) all as truncated.
>
> However during the EOF handling, we only call
> btrfs_lookup_first_ordered_range() once, thus only got OE A and mark it
> as truncated.
> But OE B and C are not marked as truncated, they will finish as usual,
> which will leave a regular file extent item to be inserted beyond EOF,
> and without any data checksum.
>
> [FIX]
> Introduce a new helper, btrfs_mark_ordered_io_truncated(), to handle all
> OEs of a range, and mark them all as truncated.
>
> With that helper, all OEs (A B and C) will be marked as truncated.
> OE B and C will have 0 truncated_len, preventing any file extent item to
> be inserted from them.
>
> Fixes: f0a0f5bfe04e ("btrfs: truncate ordered extent when skipping writeback past i_size")
The commit ID is not stable yet, as that change is not in Linus' tree yet.
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 19 +------------------
> fs/btrfs/ordered-data.c | 38 ++++++++++++++++++++++++++++++++++++++
> fs/btrfs/ordered-data.h | 2 ++
> 3 files changed, 41 insertions(+), 18 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 2d32dfc34ae3..2044b889c887 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1725,24 +1725,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
> cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
>
> if (cur >= i_size) {
> - struct btrfs_ordered_extent *ordered;
> -
> - ordered = btrfs_lookup_first_ordered_range(inode, cur,
> - folio_end - cur);
> - /*
> - * We have just run delalloc before getting here, so
> - * there must be an ordered extent.
> - */
> - ASSERT(ordered != NULL);
> - spin_lock(&inode->ordered_tree_lock);
> - set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> - ordered->truncated_len = min(ordered->truncated_len,
> - cur - ordered->file_offset);
> - spin_unlock(&inode->ordered_tree_lock);
> - btrfs_put_ordered_extent(ordered);
> -
> - btrfs_mark_ordered_io_finished(inode, folio, cur,
> - end - cur, true);
> + btrfs_mark_ordered_io_truncated(inode, folio, cur, end - cur);
> /*
> * This range is beyond i_size, thus we don't need to
> * bother writing back.
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index a421f7db9eec..9a0638d13225 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -546,6 +546,44 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
> spin_unlock(&inode->ordered_tree_lock);
> }
>
> +/*
> + * Mark all ordered extents io inside the specified range as truncated.
> + *
> + * This is utilized by writeback path, thus there must be an ordered extent
by -> by the
> + * for the range.
> + */
> +void btrfs_mark_ordered_io_truncated(struct btrfs_inode *inode, struct folio *folio,
> + u64 file_offset, u32 len)
> +{
> + u64 cur = file_offset;
> +
> + ASSERT(file_offset >= folio_pos(folio));
> + ASSERT(file_offset + len <= folio_end(folio));
> +
> + while (cur < file_offset + len) {
Two spaces between while and the opening parenthesis.
> + u32 cur_len = file_offset + len - cur;
Please avoid repeating "file_offset + len" so many times. Use a const
variable at the top like:
const u64 end = file_offset + len;
Then use 'end' instead.
> + struct btrfs_ordered_extent *ordered;
> +
> + ordered = btrfs_lookup_first_ordered_range(inode, cur, cur_len);
> +
> + /*
> + * We have just run delalloc before getting here, so there must
> + * be an ordered extent.
> + */
> + ASSERT(ordered != NULL);
> + scoped_guard(spinlock, &inode->ordered_tree_lock) {
> + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> + ordered->truncated_len = min(ordered->truncated_len,
> + cur - ordered->file_offset);
> + }
I thought we had not made a decision yet to not use this new fancy locking yet.
In this case where it's a very short critical section, it doesn't
bring any advantage over using explicit spin_lock/unlock, and adds one
extra level of indentation.
Also in the subject:
btrfs: make sure all ordered extents beyond EOF is properly truncated
should be "are" instead of "is".
Thanks.
> + cur_len = min(cur_len, ordered->file_offset + ordered->num_bytes - cur);
> + btrfs_put_ordered_extent(ordered);
> +
> + cur += cur_len;
> + }
> + btrfs_mark_ordered_io_finished(inode, folio, file_offset, len, true);
> +}
> +
> /*
> * Finish IO for one ordered extent across a given range. The range can only
> * contain one ordered extent.
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 1e6b0b182b29..dd4cdc1a8b78 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -169,6 +169,8 @@ void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
> void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
> struct folio *folio, u64 file_offset,
> u64 num_bytes, bool uptodate);
> +void btrfs_mark_ordered_io_truncated(struct btrfs_inode *inode, struct folio *folio,
> + u64 file_offset, u32 len);
> bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
> struct btrfs_ordered_extent **cached,
> u64 file_offset, u64 io_size);
> --
> 2.52.0
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-21 11:55 ` Filipe Manana
@ 2025-11-21 15:38 ` David Sterba
2025-11-21 19:26 ` Qu Wenruo
2025-11-22 1:14 ` Qu Wenruo
1 sibling, 1 reply; 16+ messages in thread
From: David Sterba @ 2025-11-21 15:38 UTC (permalink / raw)
To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs
On Fri, Nov 21, 2025 at 11:55:58AM +0000, Filipe Manana wrote:
> > + /*
> > + * We have just run delalloc before getting here, so there must
> > + * be an ordered extent.
> > + */
> > + ASSERT(ordered != NULL);
> > + scoped_guard(spinlock, &inode->ordered_tree_lock) {
> > + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> > + ordered->truncated_len = min(ordered->truncated_len,
> > + cur - ordered->file_offset);
> > + }
>
> I thought we had not made a decision yet to not use this new fancy locking yet.
> In this case where it's a very short critical section, it doesn't
> bring any advantage over using explicit spin_lock/unlock, and adds one
> extra level of indentation.
Agreed, this looks like an anti-pattern of the scoped locking.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-21 15:38 ` David Sterba
@ 2025-11-21 19:26 ` Qu Wenruo
2025-11-21 20:25 ` Daniel Vacek
0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-11-21 19:26 UTC (permalink / raw)
To: dsterba, Filipe Manana; +Cc: Qu Wenruo, linux-btrfs
在 2025/11/22 02:08, David Sterba 写道:
> On Fri, Nov 21, 2025 at 11:55:58AM +0000, Filipe Manana wrote:
>>> + /*
>>> + * We have just run delalloc before getting here, so there must
>>> + * be an ordered extent.
>>> + */
>>> + ASSERT(ordered != NULL);
>>> + scoped_guard(spinlock, &inode->ordered_tree_lock) {
>>> + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
>>> + ordered->truncated_len = min(ordered->truncated_len,
>>> + cur - ordered->file_offset);
>>> + }
>>
>> I thought we had not made a decision yet to not use this new fancy locking yet.
>> In this case where it's a very short critical section, it doesn't
>> bring any advantage over using explicit spin_lock/unlock, and adds one
>> extra level of indentation.
>
> Agreed, this looks like an anti-pattern of the scoped locking.
>
I think one is free to use whatever style as long as there is no mixed
style in the same function.
Sure, we're not yet settled on the style, but I can not see anything
damaging readability either.
That extra indent is clearly showing the critical section, and we have
enough space in this particular case.
It's more clear than spinlock() pair, and unlike guard() this won't
cause problems for future modifications.
And I didn't get the point of anti-pattern.
You only want to use the guard for the longest critical section with
multiple exit? And leave the simplest ones using the old style?
That doesn't looks sane to me.
I'm not saying we should change to the new RAII style immediately with a
huge patch nor everyone should accept it immediately, but to gradually
use the new style in new codes, with the usual proper review/testing
procedures to keep the correctness.
If one doesn't like the RAII, sure one doesn't need to use, and I'm not
going to force anyone to use that style either.
But if this step-by-step new-code-only approach is also rejected, it
will going to be a closed-loop:
Not settled -> No new style code to get any feedback -> No motivation
to change
Thanks,
Qu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-21 19:26 ` Qu Wenruo
@ 2025-11-21 20:25 ` Daniel Vacek
2025-11-21 20:46 ` Qu Wenruo
0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vacek @ 2025-11-21 20:25 UTC (permalink / raw)
To: Qu Wenruo; +Cc: dsterba, Filipe Manana, Qu Wenruo, linux-btrfs
On Fri, 21 Nov 2025 at 20:28, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 在 2025/11/22 02:08, David Sterba 写道:
> > On Fri, Nov 21, 2025 at 11:55:58AM +0000, Filipe Manana wrote:
> >>> + /*
> >>> + * We have just run delalloc before getting here, so there must
> >>> + * be an ordered extent.
> >>> + */
> >>> + ASSERT(ordered != NULL);
> >>> + scoped_guard(spinlock, &inode->ordered_tree_lock) {
> >>> + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> >>> + ordered->truncated_len = min(ordered->truncated_len,
> >>> + cur - ordered->file_offset);
> >>> + }
> >>
> >> I thought we had not made a decision yet to not use this new fancy locking yet.
> >> In this case where it's a very short critical section, it doesn't
> >> bring any advantage over using explicit spin_lock/unlock, and adds one
> >> extra level of indentation.
> >
> > Agreed, this looks like an anti-pattern of the scoped locking.
> >
>
> I think one is free to use whatever style as long as there is no mixed
> style in the same function.
I've got a hard objection here. If there is(?) any granularity using
guards vs. explicit locking then, IMO, it should be per given lock.
Ie, given `ordered_tree_lock` - either it should always use the RAII
style *OR* it should always use the explicit style. But it should
never mix one style in one function and the other style in another
function. That would only make it really messy looking for
interactions and race bugs / missing/misplaced locking - simply
general debugging.
> Sure, we're not yet settled on the style, but I can not see anything
> damaging readability either.
> That extra indent is clearly showing the critical section, and we have
> enough space in this particular case.
> It's more clear than spinlock() pair, and unlike guard() this won't
> cause problems for future modifications.
>
>
> And I didn't get the point of anti-pattern.
>
> You only want to use the guard for the longest critical section with
> multiple exit? And leave the simplest ones using the old style?
> That doesn't looks sane to me.
>
>
> I'm not saying we should change to the new RAII style immediately with a
> huge patch nor everyone should accept it immediately, but to gradually
> use the new style in new codes, with the usual proper review/testing
> procedures to keep the correctness.
I would understand if you introduced a _new_ lock and started using it
with the RAII style - setting the example. But if you're grabbing a
lock which is always acquired using the explicit style, just stick to
it and keep the style consistent throughout all the callsites, the
whole code base. This makes it _easier_ to crosscheck, IMO.
-just my 2c
> If one doesn't like the RAII, sure one doesn't need to use, and I'm not
> going to force anyone to use that style either.
>
> But if this step-by-step new-code-only approach is also rejected, it
> will going to be a closed-loop:
>
> Not settled -> No new style code to get any feedback -> No motivation
> to change
>
> Thanks,
> Qu
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-21 20:25 ` Daniel Vacek
@ 2025-11-21 20:46 ` Qu Wenruo
2025-11-24 22:25 ` David Sterba
0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-11-21 20:46 UTC (permalink / raw)
To: Daniel Vacek; +Cc: dsterba, Filipe Manana, Qu Wenruo, linux-btrfs
在 2025/11/22 06:55, Daniel Vacek 写道:
> On Fri, 21 Nov 2025 at 20:28, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> 在 2025/11/22 02:08, David Sterba 写道:
>>> On Fri, Nov 21, 2025 at 11:55:58AM +0000, Filipe Manana wrote:
>>>>> + /*
>>>>> + * We have just run delalloc before getting here, so there must
>>>>> + * be an ordered extent.
>>>>> + */
>>>>> + ASSERT(ordered != NULL);
>>>>> + scoped_guard(spinlock, &inode->ordered_tree_lock) {
>>>>> + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
>>>>> + ordered->truncated_len = min(ordered->truncated_len,
>>>>> + cur - ordered->file_offset);
>>>>> + }
>>>>
>>>> I thought we had not made a decision yet to not use this new fancy locking yet.
>>>> In this case where it's a very short critical section, it doesn't
>>>> bring any advantage over using explicit spin_lock/unlock, and adds one
>>>> extra level of indentation.
>>>
>>> Agreed, this looks like an anti-pattern of the scoped locking.
>>>
>>
>> I think one is free to use whatever style as long as there is no mixed
>> style in the same function.
>
> I've got a hard objection here. If there is(?) any granularity using
> guards vs. explicit locking then, IMO, it should be per given lock.
>
> Ie, given `ordered_tree_lock` - either it should always use the RAII
> style *OR* it should always use the explicit style. But it should
> never mix one style in one function and the other style in another
> function. That would only make it really messy looking for
> interactions and race bugs / missing/misplaced locking - simply
> general debugging.
Then check fs/*.c.
There are guard(rcu) and rcu_read_lock() usage mixed in different files.
E.g. in fs/namespace.c it's using guard(rcu) and scoped_guard(rcu),
meanwhile still having regular rcu_read_lock().
There are more counter-examples than you know.
We're not the first subsystem to face the new RAII styles, and I hope we
will not be the last either.
[...]
>>
>> I'm not saying we should change to the new RAII style immediately with a
>> huge patch nor everyone should accept it immediately, but to gradually
>> use the new style in new codes, with the usual proper review/testing
>> procedures to keep the correctness.
>
> I would understand if you introduced a _new_ lock and started using it
> with the RAII style - setting the example. But if you're grabbing a
> lock which is always acquired using the explicit style, just stick to
> it and keep the style consistent throughout all the callsites, the
> whole code base. This makes it _easier_ to crosscheck, IMO.
Then check kernfs_root::kernfs_rwsem, it also has mixed RAII and old styles.
We should all remember there are always subsystems before us facing this
challenge, and if they are fine embrace the new style gradually, I see
no reason why we can not.
We're just a regular subsystem in the kernel, not some god-chosen
special one, we do not live in a bubble.
Thanks,
Qu
>
> -just my 2c
>
>> If one doesn't like the RAII, sure one doesn't need to use, and I'm not
>> going to force anyone to use that style either.
>>
>> But if this step-by-step new-code-only approach is also rejected, it
>> will going to be a closed-loop:
>>
>> Not settled -> No new style code to get any feedback -> No motivation
>> to change
>>
>> Thanks,
>> Qu
>>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-21 11:55 ` Filipe Manana
2025-11-21 15:38 ` David Sterba
@ 2025-11-22 1:14 ` Qu Wenruo
2025-11-24 12:22 ` Filipe Manana
1 sibling, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-11-22 1:14 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs
在 2025/11/21 22:25, Filipe Manana 写道:
[...]
>> Fixes: f0a0f5bfe04e ("btrfs: truncate ordered extent when skipping writeback past i_size")
>
> The commit ID is not stable yet, as that change is not in Linus' tree yet.
Thanks for pointing this out. Now it's removed from for-next and will
resend the series with proper upstream commit during the next merge window.
[...]
>> + u32 cur_len = file_offset + len - cur;
>
> Please avoid repeating "file_offset + len" so many times. Use a const
> variable at the top like:
>
> const u64 end = file_offset + len;
>
> Then use 'end' instead.
I intentionally avoided using @end as we have pretty confusing
inclusive/exclusive behaviors on all @end usages.
A lot of them are inclusive, but also quite some are exclusive.
In fact even inside ordered-data.c, we have different
inclusive/exclusive usages.
E.g. @end inside btrfs_start_ordered_extent_nowriteback() is inclusive,
but @range_end inside btrfs_wait_ordered_extents() is exclusive.
It's almost half-half in that file.
Although I can definitely add an @end, I'm not sure if it's improving
anything, considering it's only utilized twice per loop.
Thanks,
Qu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-22 1:14 ` Qu Wenruo
@ 2025-11-24 12:22 ` Filipe Manana
2025-11-24 21:00 ` Qu Wenruo
0 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2025-11-24 12:22 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Sat, Nov 22, 2025 at 1:14 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2025/11/21 22:25, Filipe Manana 写道:
> [...]
> >> Fixes: f0a0f5bfe04e ("btrfs: truncate ordered extent when skipping writeback past i_size")
> >
> > The commit ID is not stable yet, as that change is not in Linus' tree yet.
>
> Thanks for pointing this out. Now it's removed from for-next and will
> resend the series with proper upstream commit during the next merge window.
Instead of waiting until the next merge window, David may want to
include this in this one, in which case no Fixes tag is needed, just
mention the patch's subject in the change log (that's usually what we
do).
This way it will avoid backports if we want until the next merge
window (for 6.20).
Or may be included in a pull request for one of the 6.19-rcs.
>
> [...]
>
> >> + u32 cur_len = file_offset + len - cur;
> >
> > Please avoid repeating "file_offset + len" so many times. Use a const
> > variable at the top like:
> >
> > const u64 end = file_offset + len;
> >
> > Then use 'end' instead.
>
> I intentionally avoided using @end as we have pretty confusing
> inclusive/exclusive behaviors on all @end usages.
>
> A lot of them are inclusive, but also quite some are exclusive.
>
> In fact even inside ordered-data.c, we have different
> inclusive/exclusive usages.
>
> E.g. @end inside btrfs_start_ordered_extent_nowriteback() is inclusive,
> but @range_end inside btrfs_wait_ordered_extents() is exclusive.
>
> It's almost half-half in that file.
>
> Although I can definitely add an @end, I'm not sure if it's improving
> anything, considering it's only utilized twice per loop.
I don't think it's confusing at all in this context.
It may be confusing when there's an end argument, where we may have to
check callers to see how it's computed to find out if it's an
inclusive or exclusive end offset.
Now having the end offset stored in a local variable, it makes it very
easy to see that it's exclusive, especially for such a short function.
I prefer that than open-coding the sum multiple times (3 times in this
function). The compiler generally optimizes and avoids repeating the
sum, but it makes the code easier to read and reason.
>
> Thanks,
> Qu
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-24 12:22 ` Filipe Manana
@ 2025-11-24 21:00 ` Qu Wenruo
0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-11-24 21:00 UTC (permalink / raw)
To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs
在 2025/11/24 22:52, Filipe Manana 写道:
> On Sat, Nov 22, 2025 at 1:14 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>>
>>
>> 在 2025/11/21 22:25, Filipe Manana 写道:
>> [...]
>>>> Fixes: f0a0f5bfe04e ("btrfs: truncate ordered extent when skipping writeback past i_size")
>>>
>>> The commit ID is not stable yet, as that change is not in Linus' tree yet.
>>
>> Thanks for pointing this out. Now it's removed from for-next and will
>> resend the series with proper upstream commit during the next merge window.
>
> Instead of waiting until the next merge window, David may want to
> include this in this one, in which case no Fixes tag is needed, just
> mention the patch's subject in the change log (that's usually what we
> do).
> This way it will avoid backports if we want until the next merge
> window (for 6.20).
>
> Or may be included in a pull request for one of the 6.19-rcs.
OK, considering this is a bug fix, I'll resend this patch (without the
whole series) with the Fixed tag removed.
[...]
>> Although I can definitely add an @end, I'm not sure if it's improving
>> anything, considering it's only utilized twice per loop.
>
> I don't think it's confusing at all in this context.
>
> It may be confusing when there's an end argument, where we may have to
> check callers to see how it's computed to find out if it's an
> inclusive or exclusive end offset.
>
> Now having the end offset stored in a local variable, it makes it very
> easy to see that it's exclusive, especially for such a short function.
> I prefer that than open-coding the sum multiple times (3 times in this
> function). The compiler generally optimizes and avoids repeating the
> sum, but it makes the code easier to read and reason.
Got it.
Thanks,
Qu
>
>>
>> Thanks,
>> Qu
>>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-21 20:46 ` Qu Wenruo
@ 2025-11-24 22:25 ` David Sterba
2025-11-24 23:16 ` Qu Wenruo
0 siblings, 1 reply; 16+ messages in thread
From: David Sterba @ 2025-11-24 22:25 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Daniel Vacek, dsterba, Filipe Manana, Qu Wenruo, linux-btrfs
On Sat, Nov 22, 2025 at 07:16:06AM +1030, Qu Wenruo wrote:
> 在 2025/11/22 06:55, Daniel Vacek 写道:
> > On Fri, 21 Nov 2025 at 20:28, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >> 在 2025/11/22 02:08, David Sterba 写道:
> >>> On Fri, Nov 21, 2025 at 11:55:58AM +0000, Filipe Manana wrote:
> >>>>> + /*
> >>>>> + * We have just run delalloc before getting here, so there must
> >>>>> + * be an ordered extent.
> >>>>> + */
> >>>>> + ASSERT(ordered != NULL);
> >>>>> + scoped_guard(spinlock, &inode->ordered_tree_lock) {
> >>>>> + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> >>>>> + ordered->truncated_len = min(ordered->truncated_len,
> >>>>> + cur - ordered->file_offset);
> >>>>> + }
> >>>>
> >>>> I thought we had not made a decision yet to not use this new fancy locking yet.
> >>>> In this case where it's a very short critical section, it doesn't
> >>>> bring any advantage over using explicit spin_lock/unlock, and adds one
> >>>> extra level of indentation.
> >>>
> >>> Agreed, this looks like an anti-pattern of the scoped locking.
> >>>
> >>
> >> I think one is free to use whatever style as long as there is no mixed
> >> style in the same function.
> >
> > I've got a hard objection here. If there is(?) any granularity using
> > guards vs. explicit locking then, IMO, it should be per given lock.
> >
> > Ie, given `ordered_tree_lock` - either it should always use the RAII
> > style *OR* it should always use the explicit style. But it should
> > never mix one style in one function and the other style in another
> > function. That would only make it really messy looking for
> > interactions and race bugs / missing/misplaced locking - simply
> > general debugging.
>
> Then check fs/*.c.
>
> There are guard(rcu) and rcu_read_lock() usage mixed in different files.
>
> E.g. in fs/namespace.c it's using guard(rcu) and scoped_guard(rcu),
> meanwhile still having regular rcu_read_lock().
>
> There are more counter-examples than you know.
> We're not the first subsystem to face the new RAII styles, and I hope we
> will not be the last either.
We take inspiration from other subsystems but that some coding style is
done in another subsystem does not mean we have to do that too. If
fs/*.c people decide to use it everywhere then so be it.
I was hesitant to introduce the auto cleaning for btrfs_path or kfree
but so far I found it working. The locking guards are quite different to
that and I don't seem to get any liking in it
> [...]
> >>
> >> I'm not saying we should change to the new RAII style immediately with a
> >> huge patch nor everyone should accept it immediately, but to gradually
> >> use the new style in new codes, with the usual proper review/testing
> >> procedures to keep the correctness.
> >
> > I would understand if you introduced a _new_ lock and started using it
> > with the RAII style - setting the example. But if you're grabbing a
> > lock which is always acquired using the explicit style, just stick to
> > it and keep the style consistent throughout all the callsites, the
> > whole code base. This makes it _easier_ to crosscheck, IMO.
>
> Then check kernfs_root::kernfs_rwsem, it also has mixed RAII and old styles.
>
> We should all remember there are always subsystems before us facing this
> challenge, and if they are fine embrace the new style gradually, I see
> no reason why we can not.
>
> We're just a regular subsystem in the kernel, not some god-chosen
> special one, we do not live in a bubble.
Honestly, this is a weird argument to make. There are local coding
styles that are their own bubbles, look at net/* with their own special
commenting style and mandatory reverse xmass tree sort of variables (I
think that one has been adopted by other subsystems too). Contributing to
those subsystems means following their style. If you want an example
of a filesystem with unique code formatting style then go no further
than XFS.
I think we've been doing fine in btrfs with the style consistency and
incremental updates of old code when possible. This means we have fewer
choices for personal "creative expression" how the code is written but
in the long run it makes the code look better.
The problem with the guards is that there's no one good way for their
general use, i.e. the problem of mixing with explicit lock/unlock,
trylock, unlock/lock order, etc. Allowing both styles will lead to
inconsistent style based on personal preferences again.
As said on slack, we'll get to a conclusion.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated
2025-11-24 22:25 ` David Sterba
@ 2025-11-24 23:16 ` Qu Wenruo
0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-11-24 23:16 UTC (permalink / raw)
To: dsterba
Cc: Daniel Vacek, Filipe Manana, Qu Wenruo, linux-btrfs,
linux-fsdevel@vger.kernel.org
在 2025/11/25 08:55, David Sterba 写道:
> On Sat, Nov 22, 2025 at 07:16:06AM +1030, Qu Wenruo wrote:
[...]
>> Then check fs/*.c.
>>
>> There are guard(rcu) and rcu_read_lock() usage mixed in different files.
>>
>> E.g. in fs/namespace.c it's using guard(rcu) and scoped_guard(rcu),
>> meanwhile still having regular rcu_read_lock().
>>
>> There are more counter-examples than you know.
>> We're not the first subsystem to face the new RAII styles, and I hope we
>> will not be the last either.
>
> We take inspiration from other subsystems but that some coding style is
> done in another subsystem does not mean we have to do that too. If
> fs/*.c people decide to use it everywhere then so be it.
>
> I was hesitant to introduce the auto cleaning for btrfs_path or kfree
> but so far I found it working. The locking guards are quite different to
> that and I don't seem to get any liking in it
They are all RAII, I didn't see much difference between them.
Although the path auto-freeing is hiding quite some bad designs.
The path should only be allocated when first utilized (normally by
btrfs_search_slot()).
But our btrfs_search_slot() design also needs to reuse pre-allocated
paths for certain call sites, thus we have to treat path auto-freeing
just like auto-freeing a pointer. It is just a compromise.
So you're just saying, "I'm fine with a compromise, but not the properly
deisnged usage".
I won't argue that guard() may be a little tricky to read and expand in
the future, but I see the readability cost is still way better than
potential missing unlocks.
Not to mention we have scoped_guard(), which is making the critical
section much easier to expose, and still easy to expand.
[...]
>> We're just a regular subsystem in the kernel, not some god-chosen
>> special one, we do not live in a bubble.
>
> Honestly, this is a weird argument to make. There are local coding
> styles that are their own bubbles, look at net/* with their own special
> commenting style and mandatory reverse xmass tree sort of variables (I
> think that one has been adopted by other subsystems too). Contributing to
> those subsystems means following their style. If you want an example
> of a filesystem with unique code formatting style then go no further
> than XFS.
This is not coding style, but whether to adapt an existing
practices/interfaces utilized by several programming languages and
subsystems.
A lot of core synchronization code is already supporting RAII, and VFS
is also trying pushing new RAII based APIs for filesystems, e.g:
https://lore.kernel.org/linux-btrfs/20251104-work-guards-v1-0-5108ac78a171@kernel.org/
You can go whatever code style you like, but in the end we're still
depending on the infrastructures from VFS/MM/Block layers, you can not
resist the change from them, or go the path to a deprecated fs in the
future.
>
> I think we've been doing fine in btrfs with the style consistency and
> incremental updates of old code when possible. This means we have fewer
> choices for personal "creative expression" how the code is written but
> in the long run it makes the code look better.
Then there is no conflicts with RAII, it's all about to write better and
more correct code.
And RAII even provides less "creative way" on how to release a resource.
No creative tag names or whatever.
There may be some cases like holding different locks for a critical
section, thus we may want to choose between things like scoped_guard()
{guard()} or { guard(); guard(); }, but that's way less creative than
manual GC with tags.
>
> The problem with the guards is that there's no one good way for their
> general use, i.e. the problem of mixing with explicit lock/unlock,
Yes, I see some bugs during the above VFS RFC patch.
But that's the common thing of development. Code changes will bring
bugs, but in the long run I believe RAII is way safer than manual GC,
and encourages us to do better design of variable lifespan.
> trylock,
There are already "*_try" guards defines, although I only find very few
users though.
E.g. drivers/input/keyboard/maple_keyb.c:
scoped_guard(mutex_try, &maple_keyb_mutex) {
> unlock/lock order,
Isn't that the advantage of RAII? Compilers ensure the reserve order of
releasing when exiting the scope.
Other than human-error-prone manual GC.
> etc. Allowing both styles will lead to
> inconsistent style based on personal preferences again.
You're asking for a huge code change on a large code base.
You should know it's not going to be done in a sudden, mixed RAII and
manual GC will stay for a while.
I'm not encourage mixing, but that's just unavoidable.
I'd suggest not to mix any RAII and manual GC inside the same function.
>
> As said on slack, we'll get to a conclusion.
At least I hope it's a future-proof conclusion.
Thanks,
Qu
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-11-24 23:16 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-20 9:16 [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
2025-11-20 9:16 ` [PATCH v3 1/4] btrfs: make sure all ordered extents beyond EOF is properly truncated Qu Wenruo
2025-11-21 11:55 ` Filipe Manana
2025-11-21 15:38 ` David Sterba
2025-11-21 19:26 ` Qu Wenruo
2025-11-21 20:25 ` Daniel Vacek
2025-11-21 20:46 ` Qu Wenruo
2025-11-24 22:25 ` David Sterba
2025-11-24 23:16 ` Qu Wenruo
2025-11-22 1:14 ` Qu Wenruo
2025-11-24 12:22 ` Filipe Manana
2025-11-24 21:00 ` Qu Wenruo
2025-11-20 9:16 ` [PATCH v3 2/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo
2025-11-20 9:16 ` [PATCH v3 3/4] btrfs: replace for_each_set_bit() with for_each_set_bitmap() Qu Wenruo
2025-11-20 9:16 ` [PATCH v3 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo
2025-11-21 1:02 ` [PATCH v3 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Boris Burkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox