* [PATCH v2 1/5] btrfs: subpage: fix a bug that blocks large folios
2025-03-29 9:19 [PATCH v2 0/5] btrfs: add the missing preparations exposed by initial large data folio support Qu Wenruo
@ 2025-03-29 9:19 ` Qu Wenruo
2025-03-29 9:19 ` [PATCH v2 2/5] btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range() Qu Wenruo
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-03-29 9:19 UTC (permalink / raw)
To: linux-btrfs; +Cc: Filipe Manana
Inside the macro, subpage_calc_start_bit(), we need to calculate the
offset to the beginning of the folio.
But we're using offset_in_page(), on systems with 4K page size and 4K fs
block size, this means we will always return offset 0 for a large folio,
causing all kinds of errors.
Fix it by using offset_in_folio() instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/subpage.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 11dbd7be6a3b..bd252c78a261 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -204,7 +204,7 @@ static void btrfs_subpage_assert(const struct btrfs_fs_info *fs_info,
btrfs_blocks_per_folio(fs_info, folio); \
\
btrfs_subpage_assert(fs_info, folio, start, len); \
- __start_bit = offset_in_page(start) >> fs_info->sectorsize_bits; \
+ __start_bit = offset_in_folio(folio, start) >> fs_info->sectorsize_bits; \
__start_bit += blocks_per_folio * btrfs_bitmap_nr_##name; \
__start_bit; \
})
--
2.49.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v2 2/5] btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range()
2025-03-29 9:19 [PATCH v2 0/5] btrfs: add the missing preparations exposed by initial large data folio support Qu Wenruo
2025-03-29 9:19 ` [PATCH v2 1/5] btrfs: subpage: fix a bug that blocks large folios Qu Wenruo
@ 2025-03-29 9:19 ` Qu Wenruo
2025-03-31 11:24 ` Filipe Manana
2025-03-29 9:19 ` [PATCH v2 3/5] btrfs: refactor how we handle reserved space inside copy_one_range() Qu Wenruo
` (2 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2025-03-29 9:19 UTC (permalink / raw)
To: linux-btrfs
[BUG]
When running btrfs/004 with 4K fs block size and 64K page size,
sometimes fsstress workload can take 100% CPU for a while, but not long
enough to trigger a 120s hang warning.
[CAUSE]
When such 100% CPU usage happens, btrfs_punch_hole_lock_range() is
always in the call trace.
With extra ftrace debugs, it shows something like this:
btrfs_punch_hole_lock_range: r/i=5/2745 start=4096(65536)
end=20479(18446744073709551615) enter
Where 4096 is the @lockstart parameter, 65536 is the rounded up
@page_lockstart, 20479 is the @lockend parameter.
So far so good.
But the large number (u64)-1 is the @page_lockend, which is not correct.
This is caused by the fact that round_down(locked + 1, PAGE_SIZE)
results 0.
In the above case, the range is inside the same page, and we do not even
need to call filemap_range_has_page(), not to mention to call it with
(u64)-1 as the end.
This behavior will cause btrfs_punch_hole_lock_range() to busy loop
waiting for irrelevant range to has its pages to be dropped.
[FIX]
Calculate @page_lockend by just rounding down @lockend, without
decreasing the value by one.
So @page_lockend will no longer overflow.
Then exit early if @page_lockend is no larger than @page_lockestart.
As it means either the range is inside the same page, or the two pages
are adjacent already.
Finally only decrease @page_lockend when calling
filemap_range_has_page().
Fixes: 0528476b6ac7 ("btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 589d36f8de12..7c147ef9368d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2126,15 +2126,20 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
* will always return true.
* So here we need to do extra page alignment for
* filemap_range_has_page().
+ *
+ * And do not decrease page_lockend right now, as it can be 0.
*/
const u64 page_lockstart = round_up(lockstart, PAGE_SIZE);
- const u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1;
+ const u64 page_lockend = round_down(lockend + 1, PAGE_SIZE);
while (1) {
truncate_pagecache_range(inode, lockstart, lockend);
lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
cached_state);
+ /* The same page or adjacent pages. */
+ if (page_lockend <= page_lockstart)
+ break;
/*
* We can't have ordered extents in the range, nor dirty/writeback
* pages, because we have locked the inode's VFS lock in exclusive
@@ -2146,7 +2151,7 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
* we do, unlock the range and retry.
*/
if (!filemap_range_has_page(inode->i_mapping, page_lockstart,
- page_lockend))
+ page_lockend - 1))
break;
unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
--
2.49.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH v2 2/5] btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range()
2025-03-29 9:19 ` [PATCH v2 2/5] btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range() Qu Wenruo
@ 2025-03-31 11:24 ` Filipe Manana
0 siblings, 0 replies; 9+ messages in thread
From: Filipe Manana @ 2025-03-31 11:24 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Sat, Mar 29, 2025 at 9:20 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> When running btrfs/004 with 4K fs block size and 64K page size,
> sometimes fsstress workload can take 100% CPU for a while, but not long
> enough to trigger a 120s hang warning.
>
> [CAUSE]
> When such 100% CPU usage happens, btrfs_punch_hole_lock_range() is
> always in the call trace.
>
> With extra ftrace debugs, it shows something like this:
>
> btrfs_punch_hole_lock_range: r/i=5/2745 start=4096(65536)
What's this r/i=5/2745?
Since this is from a custom tracepoint you made for debugging, I
suggest removing that part from the changelog.
It's not obvious (to me at least) what it is and it's not relevant to
understand the problem being fixed.
> end=20479(18446744073709551615) enter
>
> Where 4096 is the @lockstart parameter, 65536 is the rounded up
> @page_lockstart, 20479 is the @lockend parameter.
> So far so good.
>
> But the large number (u64)-1 is the @page_lockend, which is not correct.
>
> This is caused by the fact that round_down(locked + 1, PAGE_SIZE)
> results 0.
>
> In the above case, the range is inside the same page, and we do not even
> need to call filemap_range_has_page(), not to mention to call it with
> (u64)-1 as the end.
>
> This behavior will cause btrfs_punch_hole_lock_range() to busy loop
> waiting for irrelevant range to has its pages to be dropped.
>
> [FIX]
> Calculate @page_lockend by just rounding down @lockend, without
> decreasing the value by one.
> So @page_lockend will no longer overflow.
>
> Then exit early if @page_lockend is no larger than @page_lockestart.
@page_lockestart -> @page_lockstart
> As it means either the range is inside the same page, or the two pages
> are adjacent already.
>
> Finally only decrease @page_lockend when calling
> filemap_range_has_page().
>
> Fixes: 0528476b6ac7 ("btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
With those minor things fixed up, it looks good, thanks.
> ---
> fs/btrfs/file.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 589d36f8de12..7c147ef9368d 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2126,15 +2126,20 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
> * will always return true.
> * So here we need to do extra page alignment for
> * filemap_range_has_page().
> + *
> + * And do not decrease page_lockend right now, as it can be 0.
> */
> const u64 page_lockstart = round_up(lockstart, PAGE_SIZE);
> - const u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1;
> + const u64 page_lockend = round_down(lockend + 1, PAGE_SIZE);
>
> while (1) {
> truncate_pagecache_range(inode, lockstart, lockend);
>
> lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> cached_state);
> + /* The same page or adjacent pages. */
> + if (page_lockend <= page_lockstart)
> + break;
> /*
> * We can't have ordered extents in the range, nor dirty/writeback
> * pages, because we have locked the inode's VFS lock in exclusive
> @@ -2146,7 +2151,7 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
> * we do, unlock the range and retry.
> */
> if (!filemap_range_has_page(inode->i_mapping, page_lockstart,
> - page_lockend))
> + page_lockend - 1))
> break;
>
> unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 3/5] btrfs: refactor how we handle reserved space inside copy_one_range()
2025-03-29 9:19 [PATCH v2 0/5] btrfs: add the missing preparations exposed by initial large data folio support Qu Wenruo
2025-03-29 9:19 ` [PATCH v2 1/5] btrfs: subpage: fix a bug that blocks large folios Qu Wenruo
2025-03-29 9:19 ` [PATCH v2 2/5] btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range() Qu Wenruo
@ 2025-03-29 9:19 ` Qu Wenruo
2025-03-29 9:19 ` [PATCH v2 4/5] btrfs: prepare btrfs_buffered_write() for large data folios Qu Wenruo
2025-03-29 9:19 ` [PATCH v2 5/5] btrfs: prepare btrfs_punch_hole_lock_range() " Qu Wenruo
4 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-03-29 9:19 UTC (permalink / raw)
To: linux-btrfs; +Cc: Filipe Manana
There are several things not ideal inside copy_one_range():
- Unnecessary temporary variables
* block_offset
* reserve_bytes
* dirty_blocks
* num_blocks
* release_bytes
These are utilized to handle short-copy cases.
- Inconsistent handling of btrfs_delalloc_release_extents()
There is a hidden behavior that, after reserving metadata for X bytes
of data write, we have to call btrfs_delalloc_release_extents() with X
once and only once.
Calling btrfs_delalloc_release_extents(X - 4K) and
btrfs_delalloc_release_extents(4K) will cause outstanding extents
accounting to go wrong.
This is because the outstanding extents mechanism is not designed to
handle shrink of reserved space.
Improve above situations by:
- Use a single @reserved_start and @reserved_len pair
Now we reserved space for the initial range, and if a short copy
happened and we need to shrink the reserved space, we can easily
calculate the new length, and update @reserved_len.
- Introduce helpers to shrink reserved data and metadata space
This is done by two new helper, shrink_reserved_space() and
btrfs_delalloc_shrink_extents().
The later will do a better calculation on if we need to modify the
outstanding extents, and the first one will be utilized inside
copy_one_range().
- Manually unlock, release reserved space and return if no byte is
copied
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/delalloc-space.c | 24 +++++++++
fs/btrfs/delalloc-space.h | 3 +-
fs/btrfs/file.c | 102 ++++++++++++++++++++++----------------
3 files changed, 86 insertions(+), 43 deletions(-)
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 88e900e5a43d..82289c860476 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -439,6 +439,30 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
btrfs_inode_rsv_release(inode, true);
}
+/* Shrink a previously reserved extent to a new length. */
+void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len,
+ u64 new_len)
+{
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len);
+ const u32 new_num_extents = count_max_extents(fs_info, new_len);
+ const int diff_num_extents = new_num_extents - reserved_num_extents;
+
+ ASSERT(new_len <= reserved_len);
+ if (new_num_extents == reserved_num_extents)
+ return;
+
+ spin_lock(&inode->lock);
+ btrfs_mod_outstanding_extents(inode, diff_num_extents);
+ btrfs_calculate_inode_block_rsv_size(fs_info, inode);
+ spin_unlock(&inode->lock);
+
+ if (btrfs_is_testing(fs_info))
+ return;
+
+ btrfs_inode_rsv_release(inode, true);
+}
+
/*
* Reserve data and metadata space for delalloc
*
diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
index 3f32953c0a80..c61580c63caf 100644
--- a/fs/btrfs/delalloc-space.h
+++ b/fs/btrfs/delalloc-space.h
@@ -27,5 +27,6 @@ int btrfs_delalloc_reserve_space(struct btrfs_inode *inode,
int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
u64 disk_num_bytes, bool noflush);
void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
-
+void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len,
+ u64 new_len);
#endif /* BTRFS_DELALLOC_SPACE_H */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 7c147ef9368d..e421b64f7038 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1151,6 +1151,23 @@ static ssize_t reserve_space(struct btrfs_inode *inode,
return reserve_bytes;
}
+/* Shrink the reserved data and metadata space from @reserved_len to @new_len. */
+static void shrink_reserved_space(struct btrfs_inode *inode,
+ struct extent_changeset *data_reserved,
+ u64 reserved_start, u64 reserved_len,
+ u64 new_len, bool only_release_metadata)
+{
+ const u64 diff = reserved_len - new_len;
+
+ ASSERT(new_len <= reserved_len);
+ btrfs_delalloc_shrink_extents(inode, reserved_len, new_len);
+ if (only_release_metadata)
+ btrfs_delalloc_release_metadata(inode, diff, true);
+ else
+ btrfs_delalloc_release_space(inode, data_reserved,
+ reserved_start + new_len, diff, true);
+}
+
/*
* Do the heavy-lifting work to copy one range into one folio of the page cache.
*
@@ -1164,14 +1181,11 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_state *cached_state = NULL;
- const size_t block_offset = (start & (fs_info->sectorsize - 1));
size_t write_bytes = min(iov_iter_count(iter), PAGE_SIZE - offset_in_page(start));
- size_t reserve_bytes;
size_t copied;
- size_t dirty_blocks;
- size_t num_blocks;
+ const u64 reserved_start = round_down(start, fs_info->sectorsize);
+ u64 reserved_len;
struct folio *folio = NULL;
- u64 release_bytes;
int extents_locked;
u64 lockstart;
u64 lockend;
@@ -1190,23 +1204,25 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
&only_release_metadata);
if (ret < 0)
return ret;
- reserve_bytes = ret;
- release_bytes = reserve_bytes;
+ reserved_len = ret;
+ /* Write range must be inside the reserved range. */
+ ASSERT(reserved_start <= start);
+ ASSERT(start + write_bytes <= reserved_start + reserved_len);
again:
ret = balance_dirty_pages_ratelimited_flags(inode->vfs_inode.i_mapping,
bdp_flags);
if (ret) {
- btrfs_delalloc_release_extents(inode, reserve_bytes);
- release_space(inode, *data_reserved, start, release_bytes,
+ btrfs_delalloc_release_extents(inode, reserved_len);
+ release_space(inode, *data_reserved, reserved_start, reserved_len,
only_release_metadata);
return ret;
}
ret = prepare_one_folio(&inode->vfs_inode, &folio, start, write_bytes, false);
if (ret) {
- btrfs_delalloc_release_extents(inode, reserve_bytes);
- release_space(inode, *data_reserved, start, release_bytes,
+ btrfs_delalloc_release_extents(inode, reserved_len);
+ release_space(inode, *data_reserved, reserved_start, reserved_len,
only_release_metadata);
return ret;
}
@@ -1217,8 +1233,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
if (!nowait && extents_locked == -EAGAIN)
goto again;
- btrfs_delalloc_release_extents(inode, reserve_bytes);
- release_space(inode, *data_reserved, start, release_bytes,
+ btrfs_delalloc_release_extents(inode, reserved_len);
+ release_space(inode, *data_reserved, reserved_start, reserved_len,
only_release_metadata);
ret = extents_locked;
return ret;
@@ -1228,41 +1244,43 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
write_bytes, iter);
flush_dcache_folio(folio);
- /*
- * If we get a partial write, we can end up with partially uptodate
- * page. Although if sector size < page size we can handle it, but if
- * it's not sector aligned it can cause a lot of complexity, so make
- * sure they don't happen by forcing retry this copy.
- */
if (unlikely(copied < write_bytes)) {
+ u64 last_block;
+
+ /*
+ * The original write range doesn't need an uptodate folio as
+ * the range is block aligned. But now a short copy happened.
+ * We can not handle it without an uptodate folio.
+ *
+ * So just revert the range and we will retry.
+ */
if (!folio_test_uptodate(folio)) {
iov_iter_revert(iter, copied);
copied = 0;
}
- }
- num_blocks = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
- dirty_blocks = round_up(copied + block_offset, fs_info->sectorsize);
- dirty_blocks = BTRFS_BYTES_TO_BLKS(fs_info, dirty_blocks);
-
- if (copied == 0)
- dirty_blocks = 0;
-
- if (num_blocks > dirty_blocks) {
- /* Release everything except the sectors we dirtied. */
- release_bytes -= dirty_blocks << fs_info->sectorsize_bits;
- if (only_release_metadata) {
- btrfs_delalloc_release_metadata(inode, release_bytes, true);
- } else {
- const u64 release_start = round_up(start + copied,
- fs_info->sectorsize);
-
- btrfs_delalloc_release_space(inode, *data_reserved,
- release_start, release_bytes,
- true);
+ /* No copied byte, unlock, release reserved space and exit. */
+ if (copied == 0) {
+ if (extents_locked)
+ unlock_extent(&inode->io_tree, lockstart, lockend,
+ &cached_state);
+ else
+ free_extent_state(cached_state);
+ btrfs_delalloc_release_extents(inode, reserved_len);
+ release_space(inode, *data_reserved, reserved_start, reserved_len,
+ only_release_metadata);
+ btrfs_drop_folio(fs_info, folio, start, copied);
+ return 0;
}
+
+ /* Release the reserved space beyond the last block. */
+ last_block = round_up(start + copied, fs_info->sectorsize);
+
+ shrink_reserved_space(inode, *data_reserved, reserved_start,
+ reserved_len, last_block - reserved_start,
+ only_release_metadata);
+ reserved_len = last_block - reserved_start;
}
- release_bytes = round_up(copied + block_offset, fs_info->sectorsize);
ret = btrfs_dirty_folio(inode, folio, start, copied, &cached_state,
only_release_metadata);
@@ -1278,10 +1296,10 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
else
free_extent_state(cached_state);
- btrfs_delalloc_release_extents(inode, reserve_bytes);
+ btrfs_delalloc_release_extents(inode, reserved_len);
if (ret) {
btrfs_drop_folio(fs_info, folio, start, copied);
- release_space(inode, *data_reserved, start, release_bytes,
+ release_space(inode, *data_reserved, reserved_start, reserved_len,
only_release_metadata);
return ret;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v2 4/5] btrfs: prepare btrfs_buffered_write() for large data folios
2025-03-29 9:19 [PATCH v2 0/5] btrfs: add the missing preparations exposed by initial large data folio support Qu Wenruo
` (2 preceding siblings ...)
2025-03-29 9:19 ` [PATCH v2 3/5] btrfs: refactor how we handle reserved space inside copy_one_range() Qu Wenruo
@ 2025-03-29 9:19 ` Qu Wenruo
2025-03-29 9:19 ` [PATCH v2 5/5] btrfs: prepare btrfs_punch_hole_lock_range() " Qu Wenruo
4 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-03-29 9:19 UTC (permalink / raw)
To: linux-btrfs; +Cc: Filipe Manana
This involves the following modifications:
- Set the order flags for __filemap_get_folio() inside
prepare_one_folio()
This will allow __filemap_get_folio() to create a large folio if the
address space supports it.
- Limit the initial @write_bytes inside copy_one_range()
If the largest folio boundary splits the initial write range, there is
no way we can write beyond the largest folio boundary.
This is done by a simple helper function, calc_write_bytes().
- Release exceeding reserved space if the folio is smaller than expected
Which is doing the same handling when short copy happened.
All these preparations should not change the behavior when the largest
folio order is 0.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/file.c | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e421b64f7038..a7afc55bab2a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -861,7 +861,8 @@ static noinline int prepare_one_folio(struct inode *inode, struct folio **folio_
{
unsigned long index = pos >> PAGE_SHIFT;
gfp_t mask = get_prepare_gfp_flags(inode, nowait);
- fgf_t fgp_flags = (nowait ? FGP_WRITEBEGIN | FGP_NOWAIT : FGP_WRITEBEGIN);
+ fgf_t fgp_flags = (nowait ? FGP_WRITEBEGIN | FGP_NOWAIT : FGP_WRITEBEGIN) |
+ fgf_set_order(write_bytes);
struct folio *folio;
int ret = 0;
@@ -1168,6 +1169,16 @@ static void shrink_reserved_space(struct btrfs_inode *inode,
reserved_start + new_len, diff, true);
}
+/* Calculate the maximum amount of bytes we can write into one folio. */
+static size_t calc_write_bytes(const struct btrfs_inode *inode,
+ const struct iov_iter *iter, u64 start)
+{
+ const size_t max_folio_size = mapping_max_folio_size(inode->vfs_inode.i_mapping);
+
+ return min(max_folio_size - (start & (max_folio_size - 1)),
+ iov_iter_count(iter));
+}
+
/*
* Do the heavy-lifting work to copy one range into one folio of the page cache.
*
@@ -1181,7 +1192,7 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_state *cached_state = NULL;
- size_t write_bytes = min(iov_iter_count(iter), PAGE_SIZE - offset_in_page(start));
+ size_t write_bytes = calc_write_bytes(inode, iter, start);
size_t copied;
const u64 reserved_start = round_down(start, fs_info->sectorsize);
u64 reserved_len;
@@ -1226,9 +1237,25 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
only_release_metadata);
return ret;
}
+
+ /*
+ * The reserved range goes beyond the current folio, shrink the reserved
+ * space to the folio boundary.
+ */
+ if (reserved_start + reserved_len > folio_pos(folio) + folio_size(folio)) {
+ const u64 last_block = folio_pos(folio) + folio_size(folio);
+
+ shrink_reserved_space(inode, *data_reserved, reserved_start,
+ reserved_len, last_block - reserved_start,
+ only_release_metadata);
+ write_bytes = last_block - start;
+ reserved_len = last_block - reserved_start;
+ }
+
extents_locked = lock_and_cleanup_extent_if_need(inode, folio, start,
write_bytes, &lockstart,
- &lockend, nowait, &cached_state);
+ &lockend, nowait,
+ &cached_state);
if (extents_locked < 0) {
if (!nowait && extents_locked == -EAGAIN)
goto again;
--
2.49.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v2 5/5] btrfs: prepare btrfs_punch_hole_lock_range() for large data folios
2025-03-29 9:19 [PATCH v2 0/5] btrfs: add the missing preparations exposed by initial large data folio support Qu Wenruo
` (3 preceding siblings ...)
2025-03-29 9:19 ` [PATCH v2 4/5] btrfs: prepare btrfs_buffered_write() for large data folios Qu Wenruo
@ 2025-03-29 9:19 ` Qu Wenruo
2025-03-31 11:50 ` Filipe Manana
4 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2025-03-29 9:19 UTC (permalink / raw)
To: linux-btrfs
The function btrfs_punch_hole_lock_range() needs to make sure there is
no other folio in the range, thus it goes with filemap_range_has_page(),
which works pretty fine.
But if we have large folios, under the following case
filemap_range_has_page() will always return true, forcing
btrfs_punch_hole_lock_range() to do a very time consuming busy loop:
start end
| |
|//|//|//|//| | | | | | | | |//|//|
\ / \ /
Folio A Folio B
In above case, folio A and B contain our start/end indexes, and there
are no other folios in the range.
Thus we do not need to retry inside btrfs_punch_hole_lock_range().
To prepare for large data folios, introduce a helper,
check_range_has_page(), which will:
- Shrink the search range towards page boundaries
If the rounded down end (exclusive, otherwise it can underflow when @end
is inside the folio at file offset 0) is no larger than the rounded up
start, it means the range contains no other pages other than the ones
covering @start and @end.
Can return false directly in that case.
- Grab all the folios inside the range
- Skip any large folios that cover the start and end indexes
- If any other folios are found return true
- Otherwise return false
This new helper is going to handle both large folios and regular ones.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/file.c | 69 +++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 58 insertions(+), 11 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a7afc55bab2a..bd0bb7aea99d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2159,11 +2159,29 @@ static int find_first_non_hole(struct btrfs_inode *inode, u64 *start, u64 *len)
return ret;
}
-static void btrfs_punch_hole_lock_range(struct inode *inode,
- const u64 lockstart,
- const u64 lockend,
- struct extent_state **cached_state)
+/*
+ * The helper to check if there is no folio in the range.
+ *
+ * We can not utilized filemap_range_has_page() in a filemap with large folios
+ * as we can hit the following false positive:
+ *
+ * start end
+ * | |
+ * |//|//|//|//| | | | | | | | |//|//|
+ * \ / \ /
+ * Folio A Folio B
+ *
+ * That large folio A and B cover the start and end indexes.
+ * In that case filemap_range_has_page() will always return true, but the above
+ * case is fine for btrfs_punch_hole_lock_range() usage.
+ *
+ * So here we only ensure that no other folios is in the range, excluding the
+ * head/tail large folio.
+ */
+static bool check_range_has_page(struct inode *inode, u64 start, u64 end)
{
+ struct folio_batch fbatch;
+ bool ret = false;
/*
* For subpage case, if the range is not at page boundary, we could
* have pages at the leading/tailing part of the range.
@@ -2174,17 +2192,47 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
*
* And do not decrease page_lockend right now, as it can be 0.
*/
- const u64 page_lockstart = round_up(lockstart, PAGE_SIZE);
- const u64 page_lockend = round_down(lockend + 1, PAGE_SIZE);
+ const u64 page_lockstart = round_up(start, PAGE_SIZE);
+ const u64 page_lockend = round_down(end+ 1, PAGE_SIZE);
+ const pgoff_t start_index = page_lockstart >> PAGE_SHIFT;
+ const pgoff_t end_index = (page_lockend - 1) >> PAGE_SHIFT;
+ pgoff_t tmp = start_index;
+ int found_folios;
+ /* The same page or adjacent pages. */
+ if (page_lockend <= page_lockstart)
+ return false;
+
+ folio_batch_init(&fbatch);
+ found_folios = filemap_get_folios(inode->i_mapping, &tmp, end_index,
+ &fbatch);
+ for (int i = 0; i < found_folios; i++) {
+ struct folio *folio = fbatch.folios[i];
+
+ /* A large folio begins before the start. Not a target. */
+ if (folio->index < start_index)
+ continue;
+ /* A large folio extends beyond the end. Not a target. */
+ if (folio->index + folio_nr_pages(folio) > end_index)
+ continue;
+ /* A folio doesn't cover the head/tail index. Found a target. */
+ ret = true;
+ break;
+ }
+ folio_batch_release(&fbatch);
+ return ret;
+}
+
+static void btrfs_punch_hole_lock_range(struct inode *inode,
+ const u64 lockstart,
+ const u64 lockend,
+ struct extent_state **cached_state)
+{
while (1) {
truncate_pagecache_range(inode, lockstart, lockend);
lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
cached_state);
- /* The same page or adjacent pages. */
- if (page_lockend <= page_lockstart)
- break;
/*
* We can't have ordered extents in the range, nor dirty/writeback
* pages, because we have locked the inode's VFS lock in exclusive
@@ -2195,8 +2243,7 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
* locking the range check if we have pages in the range, and if
* we do, unlock the range and retry.
*/
- if (!filemap_range_has_page(inode->i_mapping, page_lockstart,
- page_lockend - 1))
+ if (!check_range_has_page(inode, lockstart, lockend))
break;
unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
--
2.49.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH v2 5/5] btrfs: prepare btrfs_punch_hole_lock_range() for large data folios
2025-03-29 9:19 ` [PATCH v2 5/5] btrfs: prepare btrfs_punch_hole_lock_range() " Qu Wenruo
@ 2025-03-31 11:50 ` Filipe Manana
2025-03-31 21:19 ` Qu Wenruo
0 siblings, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2025-03-31 11:50 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Sat, Mar 29, 2025 at 9:20 AM Qu Wenruo <wqu@suse.com> wrote:
>
> The function btrfs_punch_hole_lock_range() needs to make sure there is
> no other folio in the range, thus it goes with filemap_range_has_page(),
> which works pretty fine.
>
> But if we have large folios, under the following case
> filemap_range_has_page() will always return true, forcing
> btrfs_punch_hole_lock_range() to do a very time consuming busy loop:
>
> start end
> | |
> |//|//|//|//| | | | | | | | |//|//|
> \ / \ /
> Folio A Folio B
>
> In above case, folio A and B contain our start/end indexes, and there
> are no other folios in the range.
> Thus we do not need to retry inside btrfs_punch_hole_lock_range().
>
> To prepare for large data folios, introduce a helper,
> check_range_has_page(), which will:
>
> - Shrink the search range towards page boundaries
> If the rounded down end (exclusive, otherwise it can underflow when @end
> is inside the folio at file offset 0) is no larger than the rounded up
> start, it means the range contains no other pages other than the ones
> covering @start and @end.
>
> Can return false directly in that case.
>
> - Grab all the folios inside the range
>
> - Skip any large folios that cover the start and end indexes
>
> - If any other folios are found return true
>
> - Otherwise return false
>
> This new helper is going to handle both large folios and regular ones.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/file.c | 69 +++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 58 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index a7afc55bab2a..bd0bb7aea99d 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2159,11 +2159,29 @@ static int find_first_non_hole(struct btrfs_inode *inode, u64 *start, u64 *len)
> return ret;
> }
>
> -static void btrfs_punch_hole_lock_range(struct inode *inode,
> - const u64 lockstart,
> - const u64 lockend,
> - struct extent_state **cached_state)
> +/*
> + * The helper to check if there is no folio in the range.
> + *
> + * We can not utilized filemap_range_has_page() in a filemap with large folios
> + * as we can hit the following false positive:
> + *
> + * start end
> + * | |
> + * |//|//|//|//| | | | | | | | |//|//|
> + * \ / \ /
> + * Folio A Folio B
> + *
> + * That large folio A and B cover the start and end indexes.
> + * In that case filemap_range_has_page() will always return true, but the above
> + * case is fine for btrfs_punch_hole_lock_range() usage.
> + *
> + * So here we only ensure that no other folios is in the range, excluding the
> + * head/tail large folio.
> + */
> +static bool check_range_has_page(struct inode *inode, u64 start, u64 end)
> {
> + struct folio_batch fbatch;
> + bool ret = false;
> /*
> * For subpage case, if the range is not at page boundary, we could
> * have pages at the leading/tailing part of the range.
> @@ -2174,17 +2192,47 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
> *
> * And do not decrease page_lockend right now, as it can be 0.
> */
> - const u64 page_lockstart = round_up(lockstart, PAGE_SIZE);
> - const u64 page_lockend = round_down(lockend + 1, PAGE_SIZE);
> + const u64 page_lockstart = round_up(start, PAGE_SIZE);
> + const u64 page_lockend = round_down(end+ 1, PAGE_SIZE);
Missing space between 'end' and '+ 1'.
> + const pgoff_t start_index = page_lockstart >> PAGE_SHIFT;
> + const pgoff_t end_index = (page_lockend - 1) >> PAGE_SHIFT;
> + pgoff_t tmp = start_index;
> + int found_folios;
>
> + /* The same page or adjacent pages. */
> + if (page_lockend <= page_lockstart)
> + return false;
> +
> + folio_batch_init(&fbatch);
> + found_folios = filemap_get_folios(inode->i_mapping, &tmp, end_index,
> + &fbatch);
> + for (int i = 0; i < found_folios; i++) {
> + struct folio *folio = fbatch.folios[i];
> +
> + /* A large folio begins before the start. Not a target. */
> + if (folio->index < start_index)
> + continue;
We passed start_index (via tmp) to filemap_get_folios(). Isn't the
function supposed to return folios only at an index >= start_index?
It's what the documentation says and the implementation seems to
behave that way too.
Removing that we could also use start_index to pass to
filemap_get_folios(), making it non-const, and remove the tmp
variable.
Either way it looks good and that's a minor thing:
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks.
> + /* A large folio extends beyond the end. Not a target. */
> + if (folio->index + folio_nr_pages(folio) > end_index)
> + continue;
> + /* A folio doesn't cover the head/tail index. Found a target. */
> + ret = true;
> + break;
> + }
> + folio_batch_release(&fbatch);
> + return ret;
> +}
> +
> +static void btrfs_punch_hole_lock_range(struct inode *inode,
> + const u64 lockstart,
> + const u64 lockend,
> + struct extent_state **cached_state)
> +{
> while (1) {
> truncate_pagecache_range(inode, lockstart, lockend);
>
> lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> cached_state);
> - /* The same page or adjacent pages. */
> - if (page_lockend <= page_lockstart)
> - break;
> /*
> * We can't have ordered extents in the range, nor dirty/writeback
> * pages, because we have locked the inode's VFS lock in exclusive
> @@ -2195,8 +2243,7 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
> * locking the range check if we have pages in the range, and if
> * we do, unlock the range and retry.
> */
> - if (!filemap_range_has_page(inode->i_mapping, page_lockstart,
> - page_lockend - 1))
> + if (!check_range_has_page(inode, lockstart, lockend))
> break;
>
> unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH v2 5/5] btrfs: prepare btrfs_punch_hole_lock_range() for large data folios
2025-03-31 11:50 ` Filipe Manana
@ 2025-03-31 21:19 ` Qu Wenruo
0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-03-31 21:19 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs
在 2025/3/31 22:20, Filipe Manana 写道:
> On Sat, Mar 29, 2025 at 9:20 AM Qu Wenruo <wqu@suse.com> wrote:
[...]
>> + found_folios = filemap_get_folios(inode->i_mapping, &tmp, end_index,
>> + &fbatch);
>> + for (int i = 0; i < found_folios; i++) {
>> + struct folio *folio = fbatch.folios[i];
>> +
>> + /* A large folio begins before the start. Not a target. */
>> + if (folio->index < start_index)
>> + continue;
>
> We passed start_index (via tmp) to filemap_get_folios(). Isn't the
> function supposed to return folios only at an index >= start_index?
> It's what the documentation says and the implementation seems to
> behave that way too.
Not exactly, comments of filemap_get_folios_tag() shows exactly this:
* The first folio may start before @start; if it does, it will contain
* @start. The final folio may extend beyond @end; if it does, it will
* contain @end.
This is exactly what we need to support large folios.
>
> Removing that we could also use start_index to pass to
> filemap_get_folios(), making it non-const, and remove the tmp
> variable.
The reason I use @tmp is, filemap_get_folios() can update the @tmp to
the next folio's start index, which makes us unable to use the original
start_index to compare.
Thanks,
Qu
>
> Either way it looks good and that's a minor thing:
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>
> Thanks.
>
>> + /* A large folio extends beyond the end. Not a target. */
>> + if (folio->index + folio_nr_pages(folio) > end_index)
>> + continue;
>> + /* A folio doesn't cover the head/tail index. Found a target. */
>> + ret = true;
>> + break;
>> + }
>> + folio_batch_release(&fbatch);
>> + return ret;
>> +}
>> +
>> +static void btrfs_punch_hole_lock_range(struct inode *inode,
>> + const u64 lockstart,
>> + const u64 lockend,
>> + struct extent_state **cached_state)
>> +{
>> while (1) {
>> truncate_pagecache_range(inode, lockstart, lockend);
>>
>> lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
>> cached_state);
>> - /* The same page or adjacent pages. */
>> - if (page_lockend <= page_lockstart)
>> - break;
>> /*
>> * We can't have ordered extents in the range, nor dirty/writeback
>> * pages, because we have locked the inode's VFS lock in exclusive
>> @@ -2195,8 +2243,7 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
>> * locking the range check if we have pages in the range, and if
>> * we do, unlock the range and retry.
>> */
>> - if (!filemap_range_has_page(inode->i_mapping, page_lockstart,
>> - page_lockend - 1))
>> + if (!check_range_has_page(inode, lockstart, lockend))
>> break;
>>
>> unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
>> --
>> 2.49.0
>>
>>\
^ permalink raw reply [flat|nested] 9+ messages in thread