* [PATCH 0/4] btrfs: add the missing preparations exposed by initial large data folio support
@ 2025-03-27 22:31 Qu Wenruo
2025-03-27 22:31 ` [PATCH 1/4] btrfs: subpage: fix a bug that blocks large folios Qu Wenruo
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-03-27 22:31 UTC (permalink / raw)
To: linux-btrfs
With all the existing preparations, finally I can enable and start
initial testing (aka, short fsstress loops).
To no one's surprise, it immeidately exposed several problems:
- A critical subpage helper is still using offset_in_page()
Which causes incorrect start bit calculation and triggered the
ASSERT() inside subpage locking.
Fixed in the first patch.
- Buffered write can not shrink reserved space properly
Since we're reserving data and metadata space before grabbing the
folio, we have to reserved as much space as possible, other than just
reserving space for the range inside the page.
If the folio we got is smaller than what we expect, we have to shrink
the reserved space, but things like btrfs_delalloc_release_extents()
can not handle it.
Fixed in the second patch, with a new helper
btrfs_delalloc_shrink_extents().
This will also be a topic in in the iomap migration, iomap goes
valid_folio() callbacks to make sure no extent map is changed during
our buffered write, thus they can reserve a large range of space,
other than our current over-reserve-then-shrink.
Our behavior is super safe, but less optimized compared to iomap.
- Buffered write is not utilizing large folios
Since buffered write is the main entrance to allocate large folios,
without its support there will be no large folios at all.
Addressed in the third patch.
- Long CPU busy loops inside btrfs_punch_hole_lock_range()
It turns out that the usage of filemap_range_has_page() is never a
good idea for large folios, as we can easily hit the following case:
start end
| |
|//|//|//|//| | | | | | | | |//|//|
\ / \ /
Folio A Folio B
Fixed in the fourth patch, with a helper check_range_has_page() to do
the check with large folios in mind.
This will also be a topic in the iomap migration, as our zero range
behavior is quite different from the iomap one, and the
filemap_range_has_page() behavior looks a little overkilled to me.
I'm pretty sure there will be more hidden bugs after I throw the whole
fstests to my local branch, but that's all the bugs I have so far.
Qu Wenruo (4):
btrfs: subpage: fix a bug that blocks large folios
btrfs: refactor how we handle reserved space inside copy_one_range()
btrfs: prepare btrfs_buffered_write() for large data folios
btrfs: prepare btrfs_punch_hole_lock_range() for large data folios
fs/btrfs/delalloc-space.c | 25 ++++++
fs/btrfs/delalloc-space.h | 3 +-
fs/btrfs/file.c | 184 ++++++++++++++++++++++++++++----------
fs/btrfs/subpage.c | 2 +-
4 files changed, 165 insertions(+), 49 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/4] btrfs: subpage: fix a bug that blocks large folios
2025-03-27 22:31 [PATCH 0/4] btrfs: add the missing preparations exposed by initial large data folio support Qu Wenruo
@ 2025-03-27 22:31 ` Qu Wenruo
2025-03-28 17:36 ` Filipe Manana
2025-03-27 22:31 ` [PATCH 2/4] btrfs: refactor how we handle reserved space inside copy_one_range() Qu Wenruo
` (2 subsequent siblings)
3 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-03-27 22:31 UTC (permalink / raw)
To: linux-btrfs
Inside the macro, subpage_calc_start_bit(), we needs to calculate the
offset to the beginning of the folio.
But we're using offset_in_page(), on systems with 4K page size and 4K fs
block size, this means we will always return offset 0 for a large folio,
causing all kinds of errors.
Fix it by using offset_in_folio() instead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/subpage.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 5b69c447fec9..5fbdd977121e 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -202,7 +202,7 @@ static void btrfs_subpage_assert(const struct btrfs_fs_info *fs_info,
btrfs_blocks_per_folio(fs_info, folio); \
\
btrfs_subpage_assert(fs_info, folio, start, len); \
- __start_bit = offset_in_page(start) >> fs_info->sectorsize_bits; \
+ __start_bit = offset_in_folio(folio, start) >> fs_info->sectorsize_bits; \
__start_bit += blocks_per_folio * btrfs_bitmap_nr_##name; \
__start_bit; \
})
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/4] btrfs: refactor how we handle reserved space inside copy_one_range()
2025-03-27 22:31 [PATCH 0/4] btrfs: add the missing preparations exposed by initial large data folio support Qu Wenruo
2025-03-27 22:31 ` [PATCH 1/4] btrfs: subpage: fix a bug that blocks large folios Qu Wenruo
@ 2025-03-27 22:31 ` Qu Wenruo
2025-03-28 17:45 ` Filipe Manana
2025-03-27 22:31 ` [PATCH 3/4] btrfs: prepare btrfs_buffered_write() for large data folios Qu Wenruo
2025-03-27 22:31 ` [PATCH 4/4] btrfs: prepare btrfs_punch_hole_lock_range() " Qu Wenruo
3 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-03-27 22:31 UTC (permalink / raw)
To: linux-btrfs
There are several things not ideal inside copy_one_range():
- Unnecessary temporary variables
* block_offset
* reserve_bytes
* dirty_blocks
* num_blocks
* release_bytes
These are utilized to handle short-copy cases.
- Inconsistent handling of btrfs_delalloc_release_extents()
There is a hidden behavior that, after reserving metadata for X bytes
of data write, we have to call btrfs_delalloc_release_extents() with X
once and only once.
Calling btrfs_delalloc_release_extents(X - 4K) and
btrfs_delalloc_release_extents(4K) will cause outstanding extents
accounting to go wrong.
This is because the outstanding extents mechanism is not designed to
handle shrink of reserved space.
Improve above situations by:
- Use a single @reserved_start and @reserved_len pair
Now we reserved space for the initial range, and if a short copy
happened and we need to shrink the reserved space, we can easily
calculate the new length, and update @reserved_len.
- Introduce helpers to shrink reserved data and metadata space
This is done by two new helper, shrink_reserved_space() and
btrfs_delalloc_shrink_extents().
The later will do a better calculation on if we need to modify the
outstanding extents, and the first one will be utlized inside
copy_one_range().
- Manually unlock, release reserved space and return if no byte is
copied
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/delalloc-space.c | 25 +++++++++
fs/btrfs/delalloc-space.h | 3 +-
fs/btrfs/file.c | 104 ++++++++++++++++++++++----------------
3 files changed, 88 insertions(+), 44 deletions(-)
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 88e900e5a43d..916b62221dde 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -439,6 +439,31 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
btrfs_inode_rsv_release(inode, true);
}
+/* Shrink a previously reserved extent to a new length. */
+void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len,
+ u64 new_len)
+{
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len);
+ const u32 new_num_extents = count_max_extents(fs_info, new_len);
+ u32 diff_num_extents;
+
+ ASSERT(new_len <= reserved_len);
+ if (new_num_extents == reserved_num_extents)
+ return;
+
+ spin_lock(&inode->lock);
+ diff_num_extents = reserved_num_extents - new_num_extents;
+ btrfs_mod_outstanding_extents(inode, -diff_num_extents);
+ btrfs_calculate_inode_block_rsv_size(fs_info, inode);
+ spin_unlock(&inode->lock);
+
+ if (btrfs_is_testing(fs_info))
+ return;
+
+ btrfs_inode_rsv_release(inode, true);
+}
+
/*
* Reserve data and metadata space for delalloc
*
diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
index 3f32953c0a80..c61580c63caf 100644
--- a/fs/btrfs/delalloc-space.h
+++ b/fs/btrfs/delalloc-space.h
@@ -27,5 +27,6 @@ int btrfs_delalloc_reserve_space(struct btrfs_inode *inode,
int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
u64 disk_num_bytes, bool noflush);
void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
-
+void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len,
+ u64 new_len);
#endif /* BTRFS_DELALLOC_SPACE_H */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b72fc00bc2f6..63c7a3294eb2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1151,6 +1151,24 @@ static ssize_t reserve_space(struct btrfs_inode *inode,
return reserve_bytes;
}
+/* Shrink the reserved data and metadata space from @reserved_len to @new_len. */
+static void shrink_reserved_space(struct btrfs_inode *inode,
+ struct extent_changeset *data_reserved,
+ u64 reserved_start, u64 reserved_len,
+ u64 new_len, bool only_release_metadata)
+{
+ u64 diff = reserved_len - new_len;
+
+ ASSERT(new_len <= reserved_len);
+ btrfs_delalloc_shrink_extents(inode, reserved_len, new_len);
+ if (only_release_metadata)
+ btrfs_delalloc_release_metadata(inode, diff, true);
+ else
+ btrfs_delalloc_release_space(inode, data_reserved,
+ reserved_start + new_len,
+ diff, true);
+}
+
/*
* Do the heavy-lifting work to copy one range into one folio of the page cache.
*
@@ -1164,14 +1182,11 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_state *cached_state = NULL;
- const size_t block_offset = start & (fs_info->sectorsize - 1);
size_t write_bytes = min(iov_iter_count(i), PAGE_SIZE - offset_in_page(start));
- size_t reserve_bytes;
size_t copied;
- size_t dirty_blocks;
- size_t num_blocks;
+ const u64 reserved_start = round_down(start, fs_info->sectorsize);
+ u64 reserved_len;
struct folio *folio = NULL;
- u64 release_bytes;
int extents_locked;
u64 lockstart;
u64 lockend;
@@ -1190,23 +1205,25 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
&only_release_metadata);
if (ret < 0)
return ret;
- reserve_bytes = ret;
- release_bytes = reserve_bytes;
+ reserved_len = ret;
+ /* Write range must be inside the reserved range. */
+ ASSERT(reserved_start <= start);
+ ASSERT(start + write_bytes <= reserved_start + reserved_len);
again:
ret = balance_dirty_pages_ratelimited_flags(inode->vfs_inode.i_mapping,
bdp_flags);
if (ret) {
- btrfs_delalloc_release_extents(inode, reserve_bytes);
- release_space(inode, *data_reserved, start, release_bytes,
+ btrfs_delalloc_release_extents(inode, reserved_len);
+ release_space(inode, *data_reserved, reserved_start, reserved_len,
only_release_metadata);
return ret;
}
ret = prepare_one_folio(&inode->vfs_inode, &folio, start, write_bytes, false);
if (ret) {
- btrfs_delalloc_release_extents(inode, reserve_bytes);
- release_space(inode, *data_reserved, start, release_bytes,
+ btrfs_delalloc_release_extents(inode, reserved_len);
+ release_space(inode, *data_reserved, reserved_start, reserved_len,
only_release_metadata);
return ret;
}
@@ -1217,8 +1234,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
if (!nowait && extents_locked == -EAGAIN)
goto again;
- btrfs_delalloc_release_extents(inode, reserve_bytes);
- release_space(inode, *data_reserved, start, release_bytes,
+ btrfs_delalloc_release_extents(inode, reserved_len);
+ release_space(inode, *data_reserved, reserved_start, reserved_len,
only_release_metadata);
ret = extents_locked;
return ret;
@@ -1228,42 +1245,43 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
offset_in_folio(folio, start), write_bytes, i);
flush_dcache_folio(folio);
- /*
- * If we get a partial write, we can end up with partially uptodate
- * page. Although if sector size < page size we can handle it, but if
- * it's not sector aligned it can cause a lot of complexity, so make
- * sure they don't happen by forcing retry this copy.
- */
if (unlikely(copied < write_bytes)) {
+ u64 last_block;
+
+ /*
+ * The original write range doesn't need an uptodate folio as
+ * the range is block aligned. But now a short copy happened.
+ * We can not handle it without an uptodate folio.
+ *
+ * So just revert the range and we will retry.
+ */
if (!folio_test_uptodate(folio)) {
iov_iter_revert(i, copied);
copied = 0;
}
- }
- num_blocks = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
- dirty_blocks = round_up(copied + block_offset, fs_info->sectorsize);
- dirty_blocks = BTRFS_BYTES_TO_BLKS(fs_info, dirty_blocks);
-
- if (copied == 0)
- dirty_blocks = 0;
-
- if (num_blocks > dirty_blocks) {
- /* Release everything except the sectors we dirtied. */
- release_bytes -= dirty_blocks << fs_info->sectorsize_bits;
- if (only_release_metadata) {
- btrfs_delalloc_release_metadata(inode,
- release_bytes, true);
- } else {
- const u64 release_start = round_up(start + copied,
- fs_info->sectorsize);
-
- btrfs_delalloc_release_space(inode,
- *data_reserved, release_start,
- release_bytes, true);
+ /* No copied byte, unlock, release reserved space and exit. */
+ if (copied == 0) {
+ if (extents_locked)
+ unlock_extent(&inode->io_tree, lockstart, lockend,
+ &cached_state);
+ else
+ free_extent_state(cached_state);
+ btrfs_delalloc_release_extents(inode, reserved_len);
+ release_space(inode, *data_reserved, reserved_start, reserved_len,
+ only_release_metadata);
+ btrfs_drop_folio(fs_info, folio, start, copied);
+ return 0;
}
+
+ /* Release the reserved space beyond the last block. */
+ last_block = round_up(start + copied, fs_info->sectorsize);
+
+ shrink_reserved_space(inode, *data_reserved, reserved_start,
+ reserved_len, last_block - reserved_start,
+ only_release_metadata);
+ reserved_len = last_block - reserved_start;
}
- release_bytes = round_up(copied + block_offset, fs_info->sectorsize);
ret = btrfs_dirty_folio(inode, folio, start, copied,
&cached_state, only_release_metadata);
@@ -1280,10 +1298,10 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
else
free_extent_state(cached_state);
- btrfs_delalloc_release_extents(inode, reserve_bytes);
+ btrfs_delalloc_release_extents(inode, reserved_len);
if (ret) {
btrfs_drop_folio(fs_info, folio, start, copied);
- release_space(inode, *data_reserved, start, release_bytes,
+ release_space(inode, *data_reserved, reserved_start, reserved_len,
only_release_metadata);
return ret;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/4] btrfs: prepare btrfs_buffered_write() for large data folios
2025-03-27 22:31 [PATCH 0/4] btrfs: add the missing preparations exposed by initial large data folio support Qu Wenruo
2025-03-27 22:31 ` [PATCH 1/4] btrfs: subpage: fix a bug that blocks large folios Qu Wenruo
2025-03-27 22:31 ` [PATCH 2/4] btrfs: refactor how we handle reserved space inside copy_one_range() Qu Wenruo
@ 2025-03-27 22:31 ` Qu Wenruo
2025-03-28 17:51 ` Filipe Manana
2025-03-27 22:31 ` [PATCH 4/4] btrfs: prepare btrfs_punch_hole_lock_range() " Qu Wenruo
3 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-03-27 22:31 UTC (permalink / raw)
To: linux-btrfs
This involves the following modifications:
- Set the order flags for __filemap_get_folio() inside
prepare_one_folio()
This will allow __filemap_get_folio() to create a large folio if the
address space supports it.
- Limit the initial @write_bytes inside copy_one_range()
If the largest folio boundary splits the initial write range, there is
no way we can write beyond the largest folio boundary.
This is done by a simple helper function, calc_write_bytes().
- Release exceeding reserved space if the folio is smaller than expected
Which is doing the same handling when short copy happened.
All these preparation should not change the behavior when the largest
folio order is 0.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/file.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 63c7a3294eb2..5d10ae321687 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -861,7 +861,8 @@ static noinline int prepare_one_folio(struct inode *inode, struct folio **folio_
{
unsigned long index = pos >> PAGE_SHIFT;
gfp_t mask = get_prepare_gfp_flags(inode, nowait);
- fgf_t fgp_flags = (nowait ? FGP_WRITEBEGIN | FGP_NOWAIT : FGP_WRITEBEGIN);
+ fgf_t fgp_flags = (nowait ? FGP_WRITEBEGIN | FGP_NOWAIT : FGP_WRITEBEGIN) |
+ fgf_set_order(write_bytes);
struct folio *folio;
int ret = 0;
@@ -1169,6 +1170,16 @@ static void shrink_reserved_space(struct btrfs_inode *inode,
diff, true);
}
+/* Calculate the maximum amount of bytes we can write into one folio. */
+static size_t calc_write_bytes(const struct btrfs_inode *inode,
+ const struct iov_iter *i, u64 start)
+{
+ size_t max_folio_size = mapping_max_folio_size(inode->vfs_inode.i_mapping);
+
+ return min(max_folio_size - (start & (max_folio_size - 1)),
+ iov_iter_count(i));
+}
+
/*
* Do the heavy-lifting work to copy one range into one folio of the page cache.
*
@@ -1182,7 +1193,7 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_state *cached_state = NULL;
- size_t write_bytes = min(iov_iter_count(i), PAGE_SIZE - offset_in_page(start));
+ size_t write_bytes = calc_write_bytes(inode, i, start);
size_t copied;
const u64 reserved_start = round_down(start, fs_info->sectorsize);
u64 reserved_len;
@@ -1227,6 +1238,20 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
only_release_metadata);
return ret;
}
+ /*
+ * The reserved range goes beyond the current folio, shrink the reserved
+ * space to the folio boundary.
+ */
+ if (reserved_start + reserved_len > folio_pos(folio) + folio_size(folio)) {
+ const u64 last_block = folio_pos(folio) + folio_size(folio);
+
+ shrink_reserved_space(inode, *data_reserved, reserved_start,
+ reserved_len, last_block - reserved_start,
+ only_release_metadata);
+ write_bytes = folio_pos(folio) + folio_size(folio) - start;
+ reserved_len = last_block - reserved_start;
+ }
+
extents_locked = lock_and_cleanup_extent_if_need(inode,
folio, start, write_bytes, &lockstart,
&lockend, nowait, &cached_state);
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 4/4] btrfs: prepare btrfs_punch_hole_lock_range() for large data folios
2025-03-27 22:31 [PATCH 0/4] btrfs: add the missing preparations exposed by initial large data folio support Qu Wenruo
` (2 preceding siblings ...)
2025-03-27 22:31 ` [PATCH 3/4] btrfs: prepare btrfs_buffered_write() for large data folios Qu Wenruo
@ 2025-03-27 22:31 ` Qu Wenruo
2025-03-28 17:57 ` Filipe Manana
3 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-03-27 22:31 UTC (permalink / raw)
To: linux-btrfs
The function btrfs_punch_hole_lock_range() needs to make sure there is
no other folio in the range, thus it goes with filemap_range_has_page(),
which works pretty fine.
But if we have large folios, under the following case
filemap_range_has_page() will always return true, forcing
btrfs_punch_hole_lock_range() to do a very time consuming busy loop:
start end
| |
|//|//|//|//| | | | | | | | |//|//|
\ / \ /
Folio A Folio B
In above case, folio A and B contains our start/end index, and there is
no other folios in the range.
Thus there is no other folios and we do not need to retry inside
btrfs_punch_hole_lock_range().
To prepare for large data folios, introduce a helper,
check_range_has_page(), which will:
- Grab all the folios inside the range
- Skip any large folios that covers the start and end index
- If any other folios is found return true
- Otherwise return false
This new helper is going to handle both large folios and regular ones.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/file.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 49 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5d10ae321687..417c90ffc6fa 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2157,6 +2157,54 @@ static int find_first_non_hole(struct btrfs_inode *inode, u64 *start, u64 *len)
return ret;
}
+/*
+ * The helper to check if there is no folio in the range.
+ *
+ * We can not utilized filemap_range_has_page() in a filemap with large folios
+ * as we can hit the following false postive:
+ *
+ * start end
+ * | |
+ * |//|//|//|//| | | | | | | | |//|//|
+ * \ / \ /
+ * Folio A Folio B
+ *
+ * That large folio A and B covers the start and end index.
+ * In that case filemap_range_has_page() will always return true, but the above
+ * case is fine for btrfs_punch_hole_lock_range() usage.
+ *
+ * So here we only ensure that no other folio is in the range, excluding the
+ * head/tail large folio.
+ */
+static bool check_range_has_page(struct inode *inode, u64 start, u64 end)
+{
+ struct folio_batch fbatch;
+ bool ret = false;
+ const pgoff_t start_index = start >> PAGE_SHIFT;
+ const pgoff_t end_index = end >> PAGE_SHIFT;
+ pgoff_t tmp = start_index;
+ int found_folios;
+
+ folio_batch_init(&fbatch);
+ found_folios = filemap_get_folios(inode->i_mapping, &tmp, end_index,
+ &fbatch);
+ for (int i = 0; i < found_folios; i++) {
+ struct folio *folio = fbatch.folios[i];
+
+ /* A large folio begins before the start. Not a target. */
+ if (folio->index < start_index)
+ continue;
+ /* A large folio extends beyond the end. Not a target. */
+ if (folio->index + folio_nr_pages(folio) > end_index)
+ continue;
+ /* A folio doesn't cover the head/tail index. Found a target. */
+ ret = true;
+ break;
+ }
+ folio_batch_release(&fbatch);
+ return ret;
+}
+
static void btrfs_punch_hole_lock_range(struct inode *inode,
const u64 lockstart,
const u64 lockend,
@@ -2188,8 +2236,7 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
* locking the range check if we have pages in the range, and if
* we do, unlock the range and retry.
*/
- if (!filemap_range_has_page(inode->i_mapping, page_lockstart,
- page_lockend))
+ if (!check_range_has_page(inode, page_lockstart, page_lockend))
break;
unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/4] btrfs: subpage: fix a bug that blocks large folios
2025-03-27 22:31 ` [PATCH 1/4] btrfs: subpage: fix a bug that blocks large folios Qu Wenruo
@ 2025-03-28 17:36 ` Filipe Manana
0 siblings, 0 replies; 10+ messages in thread
From: Filipe Manana @ 2025-03-28 17:36 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Thu, Mar 27, 2025 at 10:33 PM Qu Wenruo <wqu@suse.com> wrote:
>
> Inside the macro, subpage_calc_start_bit(), we needs to calculate the
> offset to the beginning of the folio.
>
> But we're using offset_in_page(), on systems with 4K page size and 4K fs
> block size, this means we will always return offset 0 for a large folio,
> causing all kinds of errors.
>
> Fix it by using offset_in_folio() instead.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Looks good, thanks.
> ---
> fs/btrfs/subpage.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 5b69c447fec9..5fbdd977121e 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -202,7 +202,7 @@ static void btrfs_subpage_assert(const struct btrfs_fs_info *fs_info,
> btrfs_blocks_per_folio(fs_info, folio); \
> \
> btrfs_subpage_assert(fs_info, folio, start, len); \
> - __start_bit = offset_in_page(start) >> fs_info->sectorsize_bits; \
> + __start_bit = offset_in_folio(folio, start) >> fs_info->sectorsize_bits; \
> __start_bit += blocks_per_folio * btrfs_bitmap_nr_##name; \
> __start_bit; \
> })
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/4] btrfs: refactor how we handle reserved space inside copy_one_range()
2025-03-27 22:31 ` [PATCH 2/4] btrfs: refactor how we handle reserved space inside copy_one_range() Qu Wenruo
@ 2025-03-28 17:45 ` Filipe Manana
0 siblings, 0 replies; 10+ messages in thread
From: Filipe Manana @ 2025-03-28 17:45 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Thu, Mar 27, 2025 at 10:36 PM Qu Wenruo <wqu@suse.com> wrote:
>
> There are several things not ideal inside copy_one_range():
>
> - Unnecessary temporary variables
> * block_offset
> * reserve_bytes
> * dirty_blocks
> * num_blocks
> * release_bytes
> These are utilized to handle short-copy cases.
>
> - Inconsistent handling of btrfs_delalloc_release_extents()
> There is a hidden behavior that, after reserving metadata for X bytes
> of data write, we have to call btrfs_delalloc_release_extents() with X
> once and only once.
>
> Calling btrfs_delalloc_release_extents(X - 4K) and
> btrfs_delalloc_release_extents(4K) will cause outstanding extents
> accounting to go wrong.
>
> This is because the outstanding extents mechanism is not designed to
> handle shrink of reserved space.
>
> Improve above situations by:
>
> - Use a single @reserved_start and @reserved_len pair
> Now we reserved space for the initial range, and if a short copy
> happened and we need to shrink the reserved space, we can easily
> calculate the new length, and update @reserved_len.
>
> - Introduce helpers to shrink reserved data and metadata space
> This is done by two new helper, shrink_reserved_space() and
> btrfs_delalloc_shrink_extents().
>
> The later will do a better calculation on if we need to modify the
> outstanding extents, and the first one will be utlized inside
> copy_one_range().
>
> - Manually unlock, release reserved space and return if no byte is
> copied
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/delalloc-space.c | 25 +++++++++
> fs/btrfs/delalloc-space.h | 3 +-
> fs/btrfs/file.c | 104 ++++++++++++++++++++++----------------
> 3 files changed, 88 insertions(+), 44 deletions(-)
>
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index 88e900e5a43d..916b62221dde 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -439,6 +439,31 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
> btrfs_inode_rsv_release(inode, true);
> }
>
> +/* Shrink a previously reserved extent to a new length. */
> +void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len,
> + u64 new_len)
> +{
> + struct btrfs_fs_info *fs_info = inode->root->fs_info;
> + const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len);
> + const u32 new_num_extents = count_max_extents(fs_info, new_len);
> + u32 diff_num_extents;
> +
> + ASSERT(new_len <= reserved_len);
> + if (new_num_extents == reserved_num_extents)
> + return;
> +
> + spin_lock(&inode->lock);
> + diff_num_extents = reserved_num_extents - new_num_extents;
This doesn't need to be done while holding the lock, and in fact
should be done outside to reduce time spent in a critical section.
It's a simple expression that can be done when declaring the variable
and also make it const.
> + btrfs_mod_outstanding_extents(inode, -diff_num_extents);
We can also make the sign change when declaring the variable, something like:
const int diff_num_extents = -((int)(reserved_num_extents - new_num_extents));
The way it is, turning an unsigned into a negative is also a bit odd
to read, I think it makes it more clear if we have a signed type.
> + btrfs_calculate_inode_block_rsv_size(fs_info, inode);
> + spin_unlock(&inode->lock);
> +
> + if (btrfs_is_testing(fs_info))
> + return;
> +
> + btrfs_inode_rsv_release(inode, true);
> +}
> +
> /*
> * Reserve data and metadata space for delalloc
> *
> diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
> index 3f32953c0a80..c61580c63caf 100644
> --- a/fs/btrfs/delalloc-space.h
> +++ b/fs/btrfs/delalloc-space.h
> @@ -27,5 +27,6 @@ int btrfs_delalloc_reserve_space(struct btrfs_inode *inode,
> int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> u64 disk_num_bytes, bool noflush);
> void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
> -
> +void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len,
> + u64 new_len);
> #endif /* BTRFS_DELALLOC_SPACE_H */
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b72fc00bc2f6..63c7a3294eb2 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1151,6 +1151,24 @@ static ssize_t reserve_space(struct btrfs_inode *inode,
> return reserve_bytes;
> }
>
> +/* Shrink the reserved data and metadata space from @reserved_len to @new_len. */
> +static void shrink_reserved_space(struct btrfs_inode *inode,
> + struct extent_changeset *data_reserved,
> + u64 reserved_start, u64 reserved_len,
> + u64 new_len, bool only_release_metadata)
> +{
> + u64 diff = reserved_len - new_len;
Can be made const.
> +
> + ASSERT(new_len <= reserved_len);
> + btrfs_delalloc_shrink_extents(inode, reserved_len, new_len);
> + if (only_release_metadata)
> + btrfs_delalloc_release_metadata(inode, diff, true);
> + else
> + btrfs_delalloc_release_space(inode, data_reserved,
> + reserved_start + new_len,
> + diff, true);
This could fit in 2 lines instead of 3.
Anyway those are minor things, so:
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks.
> +}
> +
> /*
> * Do the heavy-lifting work to copy one range into one folio of the page cache.
> *
> @@ -1164,14 +1182,11 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
> {
> struct btrfs_fs_info *fs_info = inode->root->fs_info;
> struct extent_state *cached_state = NULL;
> - const size_t block_offset = start & (fs_info->sectorsize - 1);
> size_t write_bytes = min(iov_iter_count(i), PAGE_SIZE - offset_in_page(start));
> - size_t reserve_bytes;
> size_t copied;
> - size_t dirty_blocks;
> - size_t num_blocks;
> + const u64 reserved_start = round_down(start, fs_info->sectorsize);
> + u64 reserved_len;
> struct folio *folio = NULL;
> - u64 release_bytes;
> int extents_locked;
> u64 lockstart;
> u64 lockend;
> @@ -1190,23 +1205,25 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
> &only_release_metadata);
> if (ret < 0)
> return ret;
> - reserve_bytes = ret;
> - release_bytes = reserve_bytes;
> + reserved_len = ret;
> + /* Write range must be inside the reserved range. */
> + ASSERT(reserved_start <= start);
> + ASSERT(start + write_bytes <= reserved_start + reserved_len);
>
> again:
> ret = balance_dirty_pages_ratelimited_flags(inode->vfs_inode.i_mapping,
> bdp_flags);
> if (ret) {
> - btrfs_delalloc_release_extents(inode, reserve_bytes);
> - release_space(inode, *data_reserved, start, release_bytes,
> + btrfs_delalloc_release_extents(inode, reserved_len);
> + release_space(inode, *data_reserved, reserved_start, reserved_len,
> only_release_metadata);
> return ret;
> }
>
> ret = prepare_one_folio(&inode->vfs_inode, &folio, start, write_bytes, false);
> if (ret) {
> - btrfs_delalloc_release_extents(inode, reserve_bytes);
> - release_space(inode, *data_reserved, start, release_bytes,
> + btrfs_delalloc_release_extents(inode, reserved_len);
> + release_space(inode, *data_reserved, reserved_start, reserved_len,
> only_release_metadata);
> return ret;
> }
> @@ -1217,8 +1234,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
> if (!nowait && extents_locked == -EAGAIN)
> goto again;
>
> - btrfs_delalloc_release_extents(inode, reserve_bytes);
> - release_space(inode, *data_reserved, start, release_bytes,
> + btrfs_delalloc_release_extents(inode, reserved_len);
> + release_space(inode, *data_reserved, reserved_start, reserved_len,
> only_release_metadata);
> ret = extents_locked;
> return ret;
> @@ -1228,42 +1245,43 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
> offset_in_folio(folio, start), write_bytes, i);
> flush_dcache_folio(folio);
>
> - /*
> - * If we get a partial write, we can end up with partially uptodate
> - * page. Although if sector size < page size we can handle it, but if
> - * it's not sector aligned it can cause a lot of complexity, so make
> - * sure they don't happen by forcing retry this copy.
> - */
> if (unlikely(copied < write_bytes)) {
> + u64 last_block;
> +
> + /*
> + * The original write range doesn't need an uptodate folio as
> + * the range is block aligned. But now a short copy happened.
> + * We can not handle it without an uptodate folio.
> + *
> + * So just revert the range and we will retry.
> + */
> if (!folio_test_uptodate(folio)) {
> iov_iter_revert(i, copied);
> copied = 0;
> }
> - }
>
> - num_blocks = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
> - dirty_blocks = round_up(copied + block_offset, fs_info->sectorsize);
> - dirty_blocks = BTRFS_BYTES_TO_BLKS(fs_info, dirty_blocks);
> -
> - if (copied == 0)
> - dirty_blocks = 0;
> -
> - if (num_blocks > dirty_blocks) {
> - /* Release everything except the sectors we dirtied. */
> - release_bytes -= dirty_blocks << fs_info->sectorsize_bits;
> - if (only_release_metadata) {
> - btrfs_delalloc_release_metadata(inode,
> - release_bytes, true);
> - } else {
> - const u64 release_start = round_up(start + copied,
> - fs_info->sectorsize);
> -
> - btrfs_delalloc_release_space(inode,
> - *data_reserved, release_start,
> - release_bytes, true);
> + /* No copied byte, unlock, release reserved space and exit. */
> + if (copied == 0) {
> + if (extents_locked)
> + unlock_extent(&inode->io_tree, lockstart, lockend,
> + &cached_state);
> + else
> + free_extent_state(cached_state);
> + btrfs_delalloc_release_extents(inode, reserved_len);
> + release_space(inode, *data_reserved, reserved_start, reserved_len,
> + only_release_metadata);
> + btrfs_drop_folio(fs_info, folio, start, copied);
> + return 0;
> }
> +
> + /* Release the reserved space beyond the last block. */
> + last_block = round_up(start + copied, fs_info->sectorsize);
> +
> + shrink_reserved_space(inode, *data_reserved, reserved_start,
> + reserved_len, last_block - reserved_start,
> + only_release_metadata);
> + reserved_len = last_block - reserved_start;
> }
> - release_bytes = round_up(copied + block_offset, fs_info->sectorsize);
>
> ret = btrfs_dirty_folio(inode, folio, start, copied,
> &cached_state, only_release_metadata);
> @@ -1280,10 +1298,10 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
> else
> free_extent_state(cached_state);
>
> - btrfs_delalloc_release_extents(inode, reserve_bytes);
> + btrfs_delalloc_release_extents(inode, reserved_len);
> if (ret) {
> btrfs_drop_folio(fs_info, folio, start, copied);
> - release_space(inode, *data_reserved, start, release_bytes,
> + release_space(inode, *data_reserved, reserved_start, reserved_len,
> only_release_metadata);
> return ret;
> }
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/4] btrfs: prepare btrfs_buffered_write() for large data folios
2025-03-27 22:31 ` [PATCH 3/4] btrfs: prepare btrfs_buffered_write() for large data folios Qu Wenruo
@ 2025-03-28 17:51 ` Filipe Manana
0 siblings, 0 replies; 10+ messages in thread
From: Filipe Manana @ 2025-03-28 17:51 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Thu, Mar 27, 2025 at 10:37 PM Qu Wenruo <wqu@suse.com> wrote:
>
> This involves the following modifications:
>
> - Set the order flags for __filemap_get_folio() inside
> prepare_one_folio()
>
> This will allow __filemap_get_folio() to create a large folio if the
> address space supports it.
>
> - Limit the initial @write_bytes inside copy_one_range()
> If the largest folio boundary splits the initial write range, there is
> no way we can write beyond the largest folio boundary.
>
> This is done by a simple helper function, calc_write_bytes().
>
> - Release exceeding reserved space if the folio is smaller than expected
> Which is doing the same handling when short copy happened.
>
> All these preparation should not change the behavior when the largest
preparation -> preparations
> folio order is 0.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/file.c | 29 +++++++++++++++++++++++++++--
> 1 file changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 63c7a3294eb2..5d10ae321687 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -861,7 +861,8 @@ static noinline int prepare_one_folio(struct inode *inode, struct folio **folio_
> {
> unsigned long index = pos >> PAGE_SHIFT;
> gfp_t mask = get_prepare_gfp_flags(inode, nowait);
> - fgf_t fgp_flags = (nowait ? FGP_WRITEBEGIN | FGP_NOWAIT : FGP_WRITEBEGIN);
> + fgf_t fgp_flags = (nowait ? FGP_WRITEBEGIN | FGP_NOWAIT : FGP_WRITEBEGIN) |
> + fgf_set_order(write_bytes);
> struct folio *folio;
> int ret = 0;
>
> @@ -1169,6 +1170,16 @@ static void shrink_reserved_space(struct btrfs_inode *inode,
> diff, true);
> }
>
> +/* Calculate the maximum amount of bytes we can write into one folio. */
> +static size_t calc_write_bytes(const struct btrfs_inode *inode,
> + const struct iov_iter *i, u64 start)
This was mentioned by David in a previous review of some other patch,
but please don't name it 'i', we only use such a name for loop index
variables.
Name it something like iter, or iov.
> +{
> + size_t max_folio_size = mapping_max_folio_size(inode->vfs_inode.i_mapping);
Can be const.
Anyway, those are minor things, so:
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks.
> +
> + return min(max_folio_size - (start & (max_folio_size - 1)),
> + iov_iter_count(i));
> +}
> +
> /*
> * Do the heavy-lifting work to copy one range into one folio of the page cache.
> *
> @@ -1182,7 +1193,7 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
> {
> struct btrfs_fs_info *fs_info = inode->root->fs_info;
> struct extent_state *cached_state = NULL;
> - size_t write_bytes = min(iov_iter_count(i), PAGE_SIZE - offset_in_page(start));
> + size_t write_bytes = calc_write_bytes(inode, i, start);
> size_t copied;
> const u64 reserved_start = round_down(start, fs_info->sectorsize);
> u64 reserved_len;
> @@ -1227,6 +1238,20 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
> only_release_metadata);
> return ret;
> }
> + /*
> + * The reserved range goes beyond the current folio, shrink the reserved
> + * space to the folio boundary.
> + */
> + if (reserved_start + reserved_len > folio_pos(folio) + folio_size(folio)) {
> + const u64 last_block = folio_pos(folio) + folio_size(folio);
> +
> + shrink_reserved_space(inode, *data_reserved, reserved_start,
> + reserved_len, last_block - reserved_start,
> + only_release_metadata);
> + write_bytes = folio_pos(folio) + folio_size(folio) - start;
> + reserved_len = last_block - reserved_start;
> + }
> +
> extents_locked = lock_and_cleanup_extent_if_need(inode,
> folio, start, write_bytes, &lockstart,
> &lockend, nowait, &cached_state);
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 4/4] btrfs: prepare btrfs_punch_hole_lock_range() for large data folios
2025-03-27 22:31 ` [PATCH 4/4] btrfs: prepare btrfs_punch_hole_lock_range() " Qu Wenruo
@ 2025-03-28 17:57 ` Filipe Manana
2025-03-29 3:18 ` Qu Wenruo
0 siblings, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2025-03-28 17:57 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Thu, Mar 27, 2025 at 10:33 PM Qu Wenruo <wqu@suse.com> wrote:
>
> The function btrfs_punch_hole_lock_range() needs to make sure there is
> no other folio in the range, thus it goes with filemap_range_has_page(),
> which works pretty fine.
>
> But if we have large folios, under the following case
> filemap_range_has_page() will always return true, forcing
> btrfs_punch_hole_lock_range() to do a very time consuming busy loop:
>
> start end
> | |
> |//|//|//|//| | | | | | | | |//|//|
> \ / \ /
> Folio A Folio B
>
> In above case, folio A and B contains our start/end index, and there is
contains -> contain
index -> indexes
there is -> there are
> no other folios in the range.
> Thus there is no other folios and we do not need to retry inside
is -> are.
"Thus there is no other folios" is repeated from the previous
sentence, so it can be just:
Thus we do not need to retry inside...
> btrfs_punch_hole_lock_range().
>
> To prepare for large data folios, introduce a helper,
> check_range_has_page(), which will:
>
> - Grab all the folios inside the range
>
> - Skip any large folios that covers the start and end index
covers -> cover
index -> indexes
>
> - If any other folios is found return true
is found -> are found
>
> - Otherwise return false
>
> This new helper is going to handle both large folios and regular ones.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/file.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 5d10ae321687..417c90ffc6fa 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2157,6 +2157,54 @@ static int find_first_non_hole(struct btrfs_inode *inode, u64 *start, u64 *len)
> return ret;
> }
>
> +/*
> + * The helper to check if there is no folio in the range.
> + *
> + * We can not utilized filemap_range_has_page() in a filemap with large folios
> + * as we can hit the following false postive:
postive -> positive
> + *
> + * start end
> + * | |
> + * |//|//|//|//| | | | | | | | |//|//|
> + * \ / \ /
> + * Folio A Folio B
> + *
> + * That large folio A and B covers the start and end index.
covers -> cover
index -> indexes
Anyway, those are minor things, so:
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks.
> + * In that case filemap_range_has_page() will always return true, but the above
> + * case is fine for btrfs_punch_hole_lock_range() usage.
> + *
> + * So here we only ensure that no other folio is in the range, excluding the
> + * head/tail large folio.
> + */
> +static bool check_range_has_page(struct inode *inode, u64 start, u64 end)
> +{
> + struct folio_batch fbatch;
> + bool ret = false;
> + const pgoff_t start_index = start >> PAGE_SHIFT;
> + const pgoff_t end_index = end >> PAGE_SHIFT;
> + pgoff_t tmp = start_index;
> + int found_folios;
> +
> + folio_batch_init(&fbatch);
> + found_folios = filemap_get_folios(inode->i_mapping, &tmp, end_index,
> + &fbatch);
> + for (int i = 0; i < found_folios; i++) {
> + struct folio *folio = fbatch.folios[i];
> +
> + /* A large folio begins before the start. Not a target. */
> + if (folio->index < start_index)
> + continue;
> + /* A large folio extends beyond the end. Not a target. */
> + if (folio->index + folio_nr_pages(folio) > end_index)
> + continue;
> + /* A folio doesn't cover the head/tail index. Found a target. */
> + ret = true;
> + break;
> + }
> + folio_batch_release(&fbatch);
> + return ret;
> +}
> +
> static void btrfs_punch_hole_lock_range(struct inode *inode,
> const u64 lockstart,
> const u64 lockend,
> @@ -2188,8 +2236,7 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
> * locking the range check if we have pages in the range, and if
> * we do, unlock the range and retry.
> */
> - if (!filemap_range_has_page(inode->i_mapping, page_lockstart,
> - page_lockend))
> + if (!check_range_has_page(inode, page_lockstart, page_lockend))
> break;
>
> unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 4/4] btrfs: prepare btrfs_punch_hole_lock_range() for large data folios
2025-03-28 17:57 ` Filipe Manana
@ 2025-03-29 3:18 ` Qu Wenruo
0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-03-29 3:18 UTC (permalink / raw)
To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs
在 2025/3/29 04:27, Filipe Manana 写道:
> On Thu, Mar 27, 2025 at 10:33 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> The function btrfs_punch_hole_lock_range() needs to make sure there is
>> no other folio in the range, thus it goes with filemap_range_has_page(),
>> which works pretty fine.
>>
>> But if we have large folios, under the following case
>> filemap_range_has_page() will always return true, forcing
>> btrfs_punch_hole_lock_range() to do a very time consuming busy loop:
>>
>> start end
>> | |
>> |//|//|//|//| | | | | | | | |//|//|
>> \ / \ /
>> Folio A Folio B
>>
>> In above case, folio A and B contains our start/end index, and there is
>
> contains -> contain
> index -> indexes
> there is -> there are
>
>> no other folios in the range.
>> Thus there is no other folios and we do not need to retry inside
>
> is -> are.
>
> "Thus there is no other folios" is repeated from the previous
> sentence, so it can be just:
>
> Thus we do not need to retry inside...
>
>> btrfs_punch_hole_lock_range().
>>
>> To prepare for large data folios, introduce a helper,
>> check_range_has_page(), which will:
>>
>> - Grab all the folios inside the range
>>
>> - Skip any large folios that covers the start and end index
>
> covers -> cover
> index -> indexes
>
>>
>> - If any other folios is found return true
>
> is found -> are found
>
>>
>> - Otherwise return false
>>
>> This new helper is going to handle both large folios and regular ones.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/file.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 49 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 5d10ae321687..417c90ffc6fa 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -2157,6 +2157,54 @@ static int find_first_non_hole(struct btrfs_inode *inode, u64 *start, u64 *len)
>> return ret;
>> }
>>
>> +/*
>> + * The helper to check if there is no folio in the range.
>> + *
>> + * We can not utilized filemap_range_has_page() in a filemap with large folios
>> + * as we can hit the following false postive:
>
> postive -> positive
>
>> + *
>> + * start end
>> + * | |
>> + * |//|//|//|//| | | | | | | | |//|//|
>> + * \ / \ /
>> + * Folio A Folio B
>> + *
>> + * That large folio A and B covers the start and end index.
>
> covers -> cover
> index -> indexes
>
> Anyway, those are minor things, so:
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks a lot for the review, although there is some more fixes needed in
this patch.
Thus I'll resent with all your comments addressed, but not with RoB tag
in case you find something else wrong in the fix.
>
> Thanks.
>
>
>
>> + * In that case filemap_range_has_page() will always return true, but the above
>> + * case is fine for btrfs_punch_hole_lock_range() usage.
>> + *
>> + * So here we only ensure that no other folio is in the range, excluding the
>> + * head/tail large folio.
>> + */
>> +static bool check_range_has_page(struct inode *inode, u64 start, u64 end)
>> +{
Fsstress exposed rare cases where btrfs_punch_hole_lock_range() can be
called with a range inside the same folio:
btrfs_punch_hole_lock_range: r/i=5/2745 start=4096(65536)
end=20479(18446744073709551615) enter
And since both @lockstart and @lockend are inside the same folio 0, the
page_lockend calculation will be -1, causing us to check to the end of
the inode.
But then we can hit a cases some one else is still holding the folio at 64K:
check_range_has_page: r/i=5/2745 start=65536(1)
end=18446744073709551615(281474976710655) enter
check_range_has_page: found folio=65536(1) size=65536(1)
Then we will hit a deadloop waiting for folio 64K meanwhile our zero
range doesn't even need that folio.
The original filemap_range_has_page() has a basic check to skip such
case completely, which is missing in the new helper.
Will get this fixed and resend.
Thanks,
Qu>> + struct folio_batch fbatch;
>> + bool ret = false;
>> + const pgoff_t start_index = start >> PAGE_SHIFT;
>> + const pgoff_t end_index = end >> PAGE_SHIFT;
>> + pgoff_t tmp = start_index;
>> + int found_folios;
>> +
>> + folio_batch_init(&fbatch);
>> + found_folios = filemap_get_folios(inode->i_mapping, &tmp, end_index,
>> + &fbatch);
>> + for (int i = 0; i < found_folios; i++) {
>> + struct folio *folio = fbatch.folios[i];
>> +
>> + /* A large folio begins before the start. Not a target. */
>> + if (folio->index < start_index)
>> + continue;
>> + /* A large folio extends beyond the end. Not a target. */
>> + if (folio->index + folio_nr_pages(folio) > end_index)
>> + continue;
>> + /* A folio doesn't cover the head/tail index. Found a target. */
>> + ret = true;
>> + break;
>> + }
>> + folio_batch_release(&fbatch);
>> + return ret;
>> +}
>> +
>> static void btrfs_punch_hole_lock_range(struct inode *inode,
>> const u64 lockstart,
>> const u64 lockend,
>> @@ -2188,8 +2236,7 @@ static void btrfs_punch_hole_lock_range(struct inode *inode,
>> * locking the range check if we have pages in the range, and if
>> * we do, unlock the range and retry.
>> */
>> - if (!filemap_range_has_page(inode->i_mapping, page_lockstart,
>> - page_lockend))
>> + if (!check_range_has_page(inode, page_lockstart, page_lockend))
>> break;
>>
>> unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend,
>> --
>> 2.49.0
>>
>>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-03-29 3:18 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-27 22:31 [PATCH 0/4] btrfs: add the missing preparations exposed by initial large data folio support Qu Wenruo
2025-03-27 22:31 ` [PATCH 1/4] btrfs: subpage: fix a bug that blocks large folios Qu Wenruo
2025-03-28 17:36 ` Filipe Manana
2025-03-27 22:31 ` [PATCH 2/4] btrfs: refactor how we handle reserved space inside copy_one_range() Qu Wenruo
2025-03-28 17:45 ` Filipe Manana
2025-03-27 22:31 ` [PATCH 3/4] btrfs: prepare btrfs_buffered_write() for large data folios Qu Wenruo
2025-03-28 17:51 ` Filipe Manana
2025-03-27 22:31 ` [PATCH 4/4] btrfs: prepare btrfs_punch_hole_lock_range() " Qu Wenruo
2025-03-28 17:57 ` Filipe Manana
2025-03-29 3:18 ` Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox