* [PATCH 1/9] btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
@ 2022-11-11 11:50 ` fdmanana
2022-11-11 11:50 ` [PATCH 2/9] btrfs: add an early exit when searching for delalloc range for lseek/fiemap fdmanana
` (10 subsequent siblings)
11 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2022-11-11 11:50 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
We don't need to set the EXTENT_UPDATE bit in an inode's io_tree to mark a
range as uptodate, we rely on the pages themselves being uptodate - page
reading is not triggered for already uptodate pages. Recently we removed
most use of the EXTENT_UPTODATE for buffered IO with commit 52b029f42751
("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path"),
but there were a few leftovers, namely when reading from holes and
successfully finishing read repair.
These leftovers are unnecessarily making an inode's tree larger and deeper,
slowing down searches on it. So remove all the leftovers.
This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:
1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent-io-tree.h | 7 -------
fs/btrfs/extent_io.c | 19 +++----------------
2 files changed, 3 insertions(+), 23 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index cdee8c0854c8..18ab82f62611 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -216,13 +216,6 @@ static inline int set_extent_new(struct extent_io_tree *tree, u64 start,
return set_extent_bit(tree, start, end, EXTENT_NEW, NULL, GFP_NOFS);
}
-static inline int set_extent_uptodate(struct extent_io_tree *tree, u64 start,
- u64 end, struct extent_state **cached_state, gfp_t mask)
-{
- return set_extent_bit(tree, start, end, EXTENT_UPTODATE,
- cached_state, mask);
-}
-
int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
u64 *start_ret, u64 *end_ret, u32 bits,
struct extent_state **cached_state);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 859a41624c31..bf19011037d1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -901,14 +901,9 @@ static void end_sector_io(struct page *page, u64 offset, bool uptodate)
{
struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
const u32 sectorsize = inode->root->fs_info->sectorsize;
- struct extent_state *cached = NULL;
end_page_read(page, uptodate, offset, sectorsize);
- if (uptodate)
- set_extent_uptodate(&inode->io_tree, offset,
- offset + sectorsize - 1, &cached, GFP_NOFS);
- unlock_extent(&inode->io_tree, offset, offset + sectorsize - 1,
- &cached);
+ unlock_extent(&inode->io_tree, offset, offset + sectorsize - 1, NULL);
}
static void submit_data_read_repair(struct inode *inode,
@@ -1781,13 +1776,9 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
ASSERT(IS_ALIGNED(cur, fs_info->sectorsize));
if (cur >= last_byte) {
- struct extent_state *cached = NULL;
-
iosize = PAGE_SIZE - pg_offset;
memzero_page(page, pg_offset, iosize);
- set_extent_uptodate(tree, cur, cur + iosize - 1,
- &cached, GFP_NOFS);
- unlock_extent(tree, cur, cur + iosize - 1, &cached);
+ unlock_extent(tree, cur, cur + iosize - 1, NULL);
end_page_read(page, true, cur, iosize);
break;
}
@@ -1863,13 +1854,9 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
/* we've found a hole, just zero and go on */
if (block_start == EXTENT_MAP_HOLE) {
- struct extent_state *cached = NULL;
-
memzero_page(page, pg_offset, iosize);
- set_extent_uptodate(tree, cur, cur + iosize - 1,
- &cached, GFP_NOFS);
- unlock_extent(tree, cur, cur + iosize - 1, &cached);
+ unlock_extent(tree, cur, cur + iosize - 1, NULL);
end_page_read(page, true, cur, iosize);
cur = cur + iosize;
pg_offset += iosize;
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH 2/9] btrfs: add an early exit when searching for delalloc range for lseek/fiemap
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
2022-11-11 11:50 ` [PATCH 1/9] btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree fdmanana
@ 2022-11-11 11:50 ` fdmanana
2022-11-11 11:50 ` [PATCH 3/9] btrfs: skip unnecessary delalloc searches during lseek/fiemap fdmanana
` (9 subsequent siblings)
11 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2022-11-11 11:50 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
During fiemap and lseek (SEEK_HOLE/DATA), when looking for delalloc in a
range corresponding to a hole or a prealloc extent, if we found the whole
range marked as delalloc in the inode's io_tree, then we can terminate
immediately and avoid searching the extent map tree. If not, and if the
found delalloc starts at the same offset of our search start but ends
before our search range's end, then we can adjust the search range for
the search in the extent map tree. So implement those changes.
This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:
1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/file.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e6c93be91a06..9b1f76109682 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3216,7 +3216,7 @@ static long btrfs_fallocate(struct file *file, int mode,
static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
u64 *delalloc_start_ret, u64 *delalloc_end_ret)
{
- const u64 len = end + 1 - start;
+ u64 len = end + 1 - start;
struct extent_map_tree *em_tree = &inode->extent_tree;
struct extent_map *em;
u64 em_end;
@@ -3242,13 +3242,23 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
delalloc_len = 0;
}
- /*
- * If delalloc was found then *delalloc_start_ret has a sector size
- * aligned value (rounded down).
- */
- if (delalloc_len > 0)
+ if (delalloc_len > 0) {
+ /*
+ * If delalloc was found then *delalloc_start_ret has a sector size
+ * aligned value (rounded down).
+ */
*delalloc_end_ret = *delalloc_start_ret + delalloc_len - 1;
+ if (*delalloc_start_ret == start) {
+ /* Delalloc for the whole range, nothing more to do. */
+ if (*delalloc_end_ret == end)
+ return true;
+ /* Else trim our search range for extent maps. */
+ start = *delalloc_end_ret + 1;
+ len = end + 1 - start;
+ }
+ }
+
/*
* No outstanding extents means we don't have any delalloc that is
* flushing, so return the unflushed range found in the io tree (if any).
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH 3/9] btrfs: skip unnecessary delalloc searches during lseek/fiemap
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
2022-11-11 11:50 ` [PATCH 1/9] btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree fdmanana
2022-11-11 11:50 ` [PATCH 2/9] btrfs: add an early exit when searching for delalloc range for lseek/fiemap fdmanana
@ 2022-11-11 11:50 ` fdmanana
2022-11-11 11:50 ` [PATCH 4/9] btrfs: search for delalloc more efficiently " fdmanana
` (8 subsequent siblings)
11 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2022-11-11 11:50 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range
that corresponds to a hole or a prealloc extent, if we find that there is
no delalloc marked in the inode's io_tree but there is delalloc due to
an extent map in the io tree, then on the next iteration that calls
find_delalloc_subrange() we can skip searching the io tree again, since
on the first call we had no delalloc in the io tree for the whole range.
This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:
1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/file.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9b1f76109682..99cc95487d42 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3214,6 +3214,7 @@ static long btrfs_fallocate(struct file *file, int mode,
* looping while it gets adjacent subranges, and merging them together.
*/
static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
+ bool *search_io_tree,
u64 *delalloc_start_ret, u64 *delalloc_end_ret)
{
u64 len = end + 1 - start;
@@ -3231,7 +3232,7 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
spin_lock(&inode->lock);
outstanding_extents = inode->outstanding_extents;
- if (inode->delalloc_bytes > 0) {
+ if (*search_io_tree && inode->delalloc_bytes > 0) {
spin_unlock(&inode->lock);
*delalloc_start_ret = start;
delalloc_len = count_range_bits(&inode->io_tree,
@@ -3257,6 +3258,9 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
start = *delalloc_end_ret + 1;
len = end + 1 - start;
}
+ } else {
+ /* No delalloc, future calls don't need to search again. */
+ *search_io_tree = false;
}
/*
@@ -3390,6 +3394,7 @@ bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
{
u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
u64 prev_delalloc_end = 0;
+ bool search_io_tree = true;
bool ret = false;
while (cur_offset < end) {
@@ -3398,6 +3403,7 @@ bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
bool delalloc;
delalloc = find_delalloc_subrange(inode, cur_offset, end,
+ &search_io_tree,
&delalloc_start,
&delalloc_end);
if (!delalloc)
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH 4/9] btrfs: search for delalloc more efficiently during lseek/fiemap
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
` (2 preceding siblings ...)
2022-11-11 11:50 ` [PATCH 3/9] btrfs: skip unnecessary delalloc searches during lseek/fiemap fdmanana
@ 2022-11-11 11:50 ` fdmanana
2022-11-11 11:50 ` [PATCH 5/9] btrfs: remove no longer used btrfs_next_extent_map() fdmanana
` (7 subsequent siblings)
11 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2022-11-11 11:50 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range
that corresponds to a hole or a prealloc extent, we have to check if
there's any delalloc in the range. We do it by searching for delalloc
ranges in the inode's io_tree (for unflushed delalloc) and in the inode's
extent map tree (for delalloc that is flushing).
We avoid searching the extent map tree if the number of outstanding
extents is 0, as in that case we can't have extent maps for our search
range in the tree that correspond to delalloc that is flushing. However
if we have any unflushed delalloc, due to buffered writes or mmap writes,
then the outstanding extents counter is not 0 and we'll search the extent
map tree. The tree may be large because it can have lots of extent maps
that were loaded by reads or created by previous writes, therefore taking
a significant time to search the tree, specially if have a file with a
lot of holes and/or prealloc extents.
We can improve on this by instead of searching the extent map tree,
searching the ordered extents tree of the inode, since when delalloc is
flushing we create an ordered extent along with the new extent map, while
holding the respective file range locked in the inode's io_tree. The
ordered extents tree is typically much smaller, since ordered extents have
a short life and get removed from the tree once they are completed, while
extent maps can stay for a very long time in the extent map tree, either
created by previous writes or loaded by read operations.
So use the ordered extents tree instead of the extent maps tree.
This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:
1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/file.c | 152 +++++++++++++++---------------------------------
1 file changed, 48 insertions(+), 104 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 99cc95487d42..6bc2397e324c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3218,29 +3218,27 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
u64 *delalloc_start_ret, u64 *delalloc_end_ret)
{
u64 len = end + 1 - start;
- struct extent_map_tree *em_tree = &inode->extent_tree;
- struct extent_map *em;
- u64 em_end;
- u64 delalloc_len;
- unsigned int outstanding_extents;
+ u64 delalloc_len = 0;
+ struct btrfs_ordered_extent *oe;
+ u64 oe_start;
+ u64 oe_end;
/*
* Search the io tree first for EXTENT_DELALLOC. If we find any, it
* means we have delalloc (dirty pages) for which writeback has not
* started yet.
*/
- spin_lock(&inode->lock);
- outstanding_extents = inode->outstanding_extents;
-
- if (*search_io_tree && inode->delalloc_bytes > 0) {
- spin_unlock(&inode->lock);
- *delalloc_start_ret = start;
- delalloc_len = count_range_bits(&inode->io_tree,
- delalloc_start_ret, end,
- len, EXTENT_DELALLOC, 1);
- } else {
- spin_unlock(&inode->lock);
- delalloc_len = 0;
+ if (*search_io_tree) {
+ spin_lock(&inode->lock);
+ if (inode->delalloc_bytes > 0) {
+ spin_unlock(&inode->lock);
+ *delalloc_start_ret = start;
+ delalloc_len = count_range_bits(&inode->io_tree,
+ delalloc_start_ret, end,
+ len, EXTENT_DELALLOC, 1);
+ } else {
+ spin_unlock(&inode->lock);
+ }
}
if (delalloc_len > 0) {
@@ -3254,7 +3252,7 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
/* Delalloc for the whole range, nothing more to do. */
if (*delalloc_end_ret == end)
return true;
- /* Else trim our search range for extent maps. */
+ /* Else trim our search range for ordered extents. */
start = *delalloc_end_ret + 1;
len = end + 1 - start;
}
@@ -3264,110 +3262,56 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
}
/*
- * No outstanding extents means we don't have any delalloc that is
- * flushing, so return the unflushed range found in the io tree (if any).
- */
- if (outstanding_extents == 0)
- return (delalloc_len > 0);
-
- /*
- * Now also check if there's any extent map in the range that does not
- * map to a hole or prealloc extent. We do this because:
+ * Now also check if there's any ordered extent in the range.
+ * We do this because:
*
* 1) When delalloc is flushed, the file range is locked, we clear the
- * EXTENT_DELALLOC bit from the io tree and create an extent map for
- * an allocated extent. So we might just have been called after
- * delalloc is flushed and before the ordered extent completes and
- * inserts the new file extent item in the subvolume's btree;
+ * EXTENT_DELALLOC bit from the io tree and create an extent map and
+ * an ordered extent for the write. So we might just have been called
+ * after delalloc is flushed and before the ordered extent completes
+ * and inserts the new file extent item in the subvolume's btree;
*
- * 2) We may have an extent map created by flushing delalloc for a
+ * 2) We may have an ordered extent created by flushing delalloc for a
* subrange that starts before the subrange we found marked with
* EXTENT_DELALLOC in the io tree.
+ *
+ * We could also use the extent map tree to find such delalloc that is
+ * being flushed, but using the ordered extents tree is more efficient
+ * because it's usually much smaller as ordered extents are removed from
+ * the tree once they complete. With the extent maps, we mau have them
+ * in the extent map tree for a very long time, and they were either
+ * created by previous writes or loaded by read operations.
*/
- read_lock(&em_tree->lock);
- em = lookup_extent_mapping(em_tree, start, len);
- if (!em) {
- read_unlock(&em_tree->lock);
+ oe = btrfs_lookup_first_ordered_range(inode, start, len);
+ if (!oe)
return (delalloc_len > 0);
- }
-
- /* extent_map_end() returns a non-inclusive end offset. */
- em_end = extent_map_end(em);
-
- /*
- * If we have a hole/prealloc extent map, check the next one if this one
- * ends before our range's end.
- */
- if ((em->block_start == EXTENT_MAP_HOLE ||
- test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) && em_end < end) {
- struct extent_map *next_em;
-
- next_em = btrfs_next_extent_map(em_tree, em);
- free_extent_map(em);
-
- /*
- * There's no next extent map or the next one starts beyond our
- * range, return the range found in the io tree (if any).
- */
- if (!next_em || next_em->start > end) {
- read_unlock(&em_tree->lock);
- free_extent_map(next_em);
- return (delalloc_len > 0);
- }
- em_end = extent_map_end(next_em);
- em = next_em;
- }
-
- read_unlock(&em_tree->lock);
+ /* The ordered extent may span beyond our search range. */
+ oe_start = max(oe->file_offset, start);
+ oe_end = min(oe->file_offset + oe->num_bytes - 1, end);
- /*
- * We have a hole or prealloc extent that ends at or beyond our range's
- * end, return the range found in the io tree (if any).
- */
- if (em->block_start == EXTENT_MAP_HOLE ||
- test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
- free_extent_map(em);
- return (delalloc_len > 0);
- }
+ btrfs_put_ordered_extent(oe);
- /*
- * We don't have any range as EXTENT_DELALLOC in the io tree, so the
- * extent map is the only subrange representing delalloc.
- */
+ /* Don't have unflushed delalloc, return the ordered extent range. */
if (delalloc_len == 0) {
- *delalloc_start_ret = em->start;
- *delalloc_end_ret = min(end, em_end - 1);
- free_extent_map(em);
+ *delalloc_start_ret = oe_start;
+ *delalloc_end_ret = oe_end;
return true;
}
/*
- * The extent map represents a delalloc range that starts before the
- * delalloc range we found in the io tree.
+ * We have both unflushed delalloc (io_tree) and an ordered extent.
+ * If the ranges are adjacent returned a combined range, otherwise
+ * return the leftmost range.
*/
- if (em->start < *delalloc_start_ret) {
- *delalloc_start_ret = em->start;
- /*
- * If the ranges are adjacent, return a combined range.
- * Otherwise return the extent map's range.
- */
- if (em_end < *delalloc_start_ret)
- *delalloc_end_ret = min(end, em_end - 1);
-
- free_extent_map(em);
- return true;
+ if (oe_start < *delalloc_start_ret) {
+ if (oe_end < *delalloc_start_ret)
+ *delalloc_end_ret = oe_end;
+ *delalloc_start_ret = oe_start;
+ } else if (*delalloc_end_ret + 1 == oe_start) {
+ *delalloc_end_ret = oe_end;
}
- /*
- * The extent map starts after the delalloc range we found in the io
- * tree. If it's adjacent, return a combined range, otherwise return
- * the range found in the io tree.
- */
- if (*delalloc_end_ret + 1 == em->start)
- *delalloc_end_ret = min(end, em_end - 1);
-
- free_extent_map(em);
return true;
}
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH 5/9] btrfs: remove no longer used btrfs_next_extent_map()
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
` (3 preceding siblings ...)
2022-11-11 11:50 ` [PATCH 4/9] btrfs: search for delalloc more efficiently " fdmanana
@ 2022-11-11 11:50 ` fdmanana
2022-11-11 11:50 ` [PATCH 6/9] btrfs: allow passing a cached state record to count_range_bits() fdmanana
` (6 subsequent siblings)
11 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2022-11-11 11:50 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
There are no more users of btrfs_next_extent_map(), the previous patch
in the series ("btrfs: search for delalloc more efficiently during
lseek/fiemap") removed the last usage of the function, so delete it.
This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:
1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent_map.c | 29 -----------------------------
fs/btrfs/extent_map.h | 2 --
2 files changed, 31 deletions(-)
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 4a4362f5cc52..be94030e1dfb 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -529,35 +529,6 @@ static struct extent_map *next_extent_map(const struct extent_map *em)
return container_of(next, struct extent_map, rb_node);
}
-/*
- * Get the extent map that immediately follows another one.
- *
- * @tree: The extent map tree that the extent map belong to.
- * Holding read or write access on the tree's lock is required.
- * @em: An extent map from the given tree. The caller must ensure that
- * between getting @em and between calling this function, the
- * extent map @em is not removed from the tree - for example, by
- * holding the tree's lock for the duration of those 2 operations.
- *
- * Returns the extent map that immediately follows @em, or NULL if @em is the
- * last extent map in the tree.
- */
-struct extent_map *btrfs_next_extent_map(const struct extent_map_tree *tree,
- const struct extent_map *em)
-{
- struct extent_map *next;
-
- /* The lock must be acquired either in read mode or write mode. */
- lockdep_assert_held(&tree->lock);
- ASSERT(extent_map_in_tree(em));
-
- next = next_extent_map(em);
- if (next)
- refcount_inc(&next->refs);
-
- return next;
-}
-
static struct extent_map *prev_extent_map(struct extent_map *em)
{
struct rb_node *prev;
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 68d3f2c9ea1d..ad311864272a 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -87,8 +87,6 @@ static inline u64 extent_map_block_end(struct extent_map *em)
void extent_map_tree_init(struct extent_map_tree *tree);
struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
u64 start, u64 len);
-struct extent_map *btrfs_next_extent_map(const struct extent_map_tree *tree,
- const struct extent_map *em);
int add_extent_mapping(struct extent_map_tree *tree,
struct extent_map *em, int modified);
void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em);
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH 6/9] btrfs: allow passing a cached state record to count_range_bits()
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
` (4 preceding siblings ...)
2022-11-11 11:50 ` [PATCH 5/9] btrfs: remove no longer used btrfs_next_extent_map() fdmanana
@ 2022-11-11 11:50 ` fdmanana
2022-11-11 11:50 ` [PATCH 7/9] btrfs: update stale comment for count_range_bits() fdmanana
` (5 subsequent siblings)
11 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2022-11-11 11:50 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
An inode's io_tree can be quite large and there are cases where due to
delalloc it can have thousands of extent state records, which makes the
red black tree have a depth of 10 or more, making the operation of
count_range_bits() slow if we repeatedly call it for a range that starts
where, or after, the previous one we called it for. Such use cases are
when searching for delalloc in a file range that corresponds to a hole or
a prealloc extent, which is done during lseek SEEK_HOLE/DATA and fiemap.
So introduce a cached state parameter to count_range_bits() which we use
to store the last extent state record we visited, and then allow the
caller to pass it again on its next call to count_range_bits(). The next
patches in the series will make fiemap and lseek use the new parameter.
This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:
1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent-io-tree.c | 47 ++++++++++++++++++++++++++++++++++++---
fs/btrfs/extent-io-tree.h | 3 ++-
fs/btrfs/file.c | 3 ++-
fs/btrfs/inode.c | 2 +-
4 files changed, 49 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 285b0ff6e953..6b0d78df7eee 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -1521,9 +1521,11 @@ void find_first_clear_extent_bit(struct extent_io_tree *tree, u64 start,
*/
u64 count_range_bits(struct extent_io_tree *tree,
u64 *start, u64 search_end, u64 max_bytes,
- u32 bits, int contig)
+ u32 bits, int contig,
+ struct extent_state **cached_state)
{
- struct extent_state *state;
+ struct extent_state *state = NULL;
+ struct extent_state *cached;
u64 cur_start = *start;
u64 total_bytes = 0;
u64 last = 0;
@@ -1534,11 +1536,41 @@ u64 count_range_bits(struct extent_io_tree *tree,
spin_lock(&tree->lock);
+ if (!cached_state || !*cached_state)
+ goto search;
+
+ cached = *cached_state;
+
+ if (!extent_state_in_tree(cached))
+ goto search;
+
+ if (cached->start <= cur_start && cur_start <= cached->end) {
+ state = cached;
+ } else if (cached->start > cur_start) {
+ struct extent_state *prev;
+
+ /*
+ * The cached state starts after our search range's start. Check
+ * if the previous state record starts at or before the range we
+ * are looking for, and if so, use it - this is a common case
+ * when there are holes between records in the tree. If there is
+ * no previous state record, we can start from our cached state.
+ */
+ prev = prev_state(cached);
+ if (!prev)
+ state = cached;
+ else if (prev->start <= cur_start && cur_start <= prev->end)
+ state = prev;
+ }
+
/*
* This search will find all the extents that end after our range
* starts.
*/
- state = tree_search(tree, cur_start);
+search:
+ if (!state)
+ state = tree_search(tree, cur_start);
+
while (state) {
if (state->start > search_end)
break;
@@ -1559,7 +1591,16 @@ u64 count_range_bits(struct extent_io_tree *tree,
}
state = next_state(state);
}
+
+ if (cached_state) {
+ free_extent_state(*cached_state);
+ *cached_state = state;
+ if (state)
+ refcount_inc(&state->refs);
+ }
+
spin_unlock(&tree->lock);
+
return total_bytes;
}
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index 18ab82f62611..e3eeec380844 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -119,7 +119,8 @@ void __cold extent_state_free_cachep(void);
u64 count_range_bits(struct extent_io_tree *tree,
u64 *start, u64 search_end,
- u64 max_bytes, u32 bits, int contig);
+ u64 max_bytes, u32 bits, int contig,
+ struct extent_state **cached_state);
void free_extent_state(struct extent_state *state);
int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6bc2397e324c..dc8399610ca3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3235,7 +3235,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
*delalloc_start_ret = start;
delalloc_len = count_range_bits(&inode->io_tree,
delalloc_start_ret, end,
- len, EXTENT_DELALLOC, 1);
+ len, EXTENT_DELALLOC, 1,
+ NULL);
} else {
spin_unlock(&inode->lock);
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index aec1b232a71c..83898bca39d5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1769,7 +1769,7 @@ static int fallback_to_cow(struct btrfs_inode *inode, struct page *locked_page,
* when starting writeback.
*/
count = count_range_bits(io_tree, &range_start, end, range_bytes,
- EXTENT_NORESERVE, 0);
+ EXTENT_NORESERVE, 0, NULL);
if (count > 0 || is_space_ino || is_reloc_ino) {
u64 bytes = count;
struct btrfs_fs_info *fs_info = inode->root->fs_info;
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH 7/9] btrfs: update stale comment for count_range_bits()
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
` (5 preceding siblings ...)
2022-11-11 11:50 ` [PATCH 6/9] btrfs: allow passing a cached state record to count_range_bits() fdmanana
@ 2022-11-11 11:50 ` fdmanana
2022-11-11 11:50 ` [PATCH 8/9] btrfs: use cached state when looking for delalloc ranges with fiemap fdmanana
` (4 subsequent siblings)
11 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2022-11-11 11:50 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
The comment for count_range_bits() mentions that the search is fast if we
are asking for a range with the EXTENT_DIRTY bit set. However that is no
longer true since we don't use that bit and the optimization for that was
removed in:
commit 71528e9e16c7 ("btrfs: get rid of extent_io_tree::dirty_bytes")
So remove that part of the comment mentioning the no longer existing
optimized case, and, while at it, add proper documentation describing the
purpose, arguments and return value of the function.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent-io-tree.c | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 6b0d78df7eee..21fa15123af8 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -1515,9 +1515,29 @@ void find_first_clear_extent_bit(struct extent_io_tree *tree, u64 start,
}
/*
- * Count the number of bytes in the tree that have a given bit(s) set. This
- * can be fairly slow, except for EXTENT_DIRTY which is cached. The total
- * number found is returned.
+ * Count the number of bytes in the tree that have a given bit(s) set for a
+ * given range.
+ *
+ * @tree: The io tree to search.
+ * @start: The start offset of the range. This value is updated to the
+ * offset of the first byte found with the given bit(s), so it
+ * can end up being bigger than the initial value.
+ * @search_end: The end offset (inclusive value) of the search range.
+ * @max_bytes: The maximum byte count we are interested. The search stops
+ * once it reaches this count.
+ * @bits: The bits the range must have in order to be accounted for.
+ * If multiple bits are set, then only subranges that have all
+ * the bits set are accounted for.
+ * @contig: Indicate if we should ignore holes in the range or not. If
+ * this is true, then stop once we find a hole.
+ * @cached_state: A cached state to be used across multiple calls to this
+ * function in order to speedup searches. Use NULL if this is
+ * called only once or if each call does not start where the
+ * previous one ended.
+ *
+ * Returns the total number of bytes found within the given range that have
+ * all given bits set. If the returned number of bytes is greater than zero
+ * then @start is updated with the offset of the first byte with the bits set.
*/
u64 count_range_bits(struct extent_io_tree *tree,
u64 *start, u64 search_end, u64 max_bytes,
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH 8/9] btrfs: use cached state when looking for delalloc ranges with fiemap
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
` (6 preceding siblings ...)
2022-11-11 11:50 ` [PATCH 7/9] btrfs: update stale comment for count_range_bits() fdmanana
@ 2022-11-11 11:50 ` fdmanana
2022-11-11 11:50 ` [PATCH 9/9] btrfs: use cached state when looking for delalloc ranges with lseek fdmanana
` (3 subsequent siblings)
11 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2022-11-11 11:50 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
During fiemap, whenever we find a hole or prealloc extent, we will look
for delalloc in that range, and one of the things we do for that is to
find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using
calls to count_range_bits().
Since we process file extents from left to right, if we have a file with
several holes or prealloc extents, we benefit from keeping a cached extent
state record for calls to count_range_bits(). Most of the time the last
extent state record we visited in one call to count_range_bits() matches
the first extent state record we will use in the next call to
count_range_bits(), so there's a benefit here. So use an extent state
record to cache results from count_range_bits() calls during fiemap.
This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:
1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent_io.c | 11 ++++++++++-
fs/btrfs/file.c | 10 +++++++---
fs/btrfs/file.h | 1 +
3 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bf19011037d1..1e8d58fc0806 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3698,6 +3698,7 @@ static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path
static int fiemap_process_hole(struct btrfs_inode *inode,
struct fiemap_extent_info *fieinfo,
struct fiemap_cache *cache,
+ struct extent_state **delalloc_cached_state,
struct btrfs_backref_share_check_ctx *backref_ctx,
u64 disk_bytenr, u64 extent_offset,
u64 extent_gen,
@@ -3722,6 +3723,7 @@ static int fiemap_process_hole(struct btrfs_inode *inode,
bool delalloc;
delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
+ delalloc_cached_state,
&delalloc_start,
&delalloc_end);
if (!delalloc)
@@ -3892,6 +3894,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
{
const u64 ino = btrfs_ino(inode);
struct extent_state *cached_state = NULL;
+ struct extent_state *delalloc_cached_state = NULL;
struct btrfs_path *path;
struct fiemap_cache cache = { 0 };
struct btrfs_backref_share_check_ctx *backref_ctx;
@@ -3966,6 +3969,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
const u64 range_end = min(key.offset, lockend) - 1;
ret = fiemap_process_hole(inode, fieinfo, &cache,
+ &delalloc_cached_state,
backref_ctx, 0, 0, 0,
prev_extent_end, range_end);
if (ret < 0) {
@@ -4006,6 +4010,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
extent_len, flags);
} else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
ret = fiemap_process_hole(inode, fieinfo, &cache,
+ &delalloc_cached_state,
backref_ctx,
disk_bytenr, extent_offset,
extent_gen, key.offset,
@@ -4013,6 +4018,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
} else if (disk_bytenr == 0) {
/* We have an explicit hole. */
ret = fiemap_process_hole(inode, fieinfo, &cache,
+ &delalloc_cached_state,
backref_ctx, 0, 0, 0,
key.offset, extent_end - 1);
} else {
@@ -4070,7 +4076,8 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
path = NULL;
if (!stopped && prev_extent_end < lockend) {
- ret = fiemap_process_hole(inode, fieinfo, &cache, backref_ctx,
+ ret = fiemap_process_hole(inode, fieinfo, &cache,
+ &delalloc_cached_state, backref_ctx,
0, 0, 0, prev_extent_end, lockend - 1);
if (ret < 0)
goto out_unlock;
@@ -4088,6 +4095,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
delalloc = btrfs_find_delalloc_in_range(inode,
prev_extent_end,
i_size - 1,
+ &delalloc_cached_state,
&delalloc_start,
&delalloc_end);
if (!delalloc)
@@ -4102,6 +4110,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
out_unlock:
unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
out:
+ free_extent_state(delalloc_cached_state);
btrfs_free_backref_share_ctx(backref_ctx);
btrfs_free_path(path);
return ret;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index dc8399610ca3..da390f891220 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3214,6 +3214,7 @@ static long btrfs_fallocate(struct file *file, int mode,
* looping while it gets adjacent subranges, and merging them together.
*/
static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
+ struct extent_state **cached_state,
bool *search_io_tree,
u64 *delalloc_start_ret, u64 *delalloc_end_ret)
{
@@ -3236,7 +3237,7 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
delalloc_len = count_range_bits(&inode->io_tree,
delalloc_start_ret, end,
len, EXTENT_DELALLOC, 1,
- NULL);
+ cached_state);
} else {
spin_unlock(&inode->lock);
}
@@ -3324,6 +3325,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
* sector size aligned.
* @end: The end offset (inclusive value) of the search range.
* It does not need to be sector size aligned.
+ * @cached_state: Extent state record used for speeding up delalloc
+ * searches in the inode's io_tree. Can be NULL.
* @delalloc_start_ret: Output argument, set to the start offset of the
* subrange found with delalloc (may not be sector size
* aligned).
@@ -3335,6 +3338,7 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
* end offsets of the subrange.
*/
bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
+ struct extent_state **cached_state,
u64 *delalloc_start_ret, u64 *delalloc_end_ret)
{
u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
@@ -3348,7 +3352,7 @@ bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
bool delalloc;
delalloc = find_delalloc_subrange(inode, cur_offset, end,
- &search_io_tree,
+ cached_state, &search_io_tree,
&delalloc_start,
&delalloc_end);
if (!delalloc)
@@ -3400,7 +3404,7 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
u64 delalloc_end;
bool delalloc;
- delalloc = btrfs_find_delalloc_in_range(inode, start, end,
+ delalloc = btrfs_find_delalloc_in_range(inode, start, end, NULL,
&delalloc_start, &delalloc_end);
if (delalloc && whence == SEEK_DATA) {
*start_ret = delalloc_start;
diff --git a/fs/btrfs/file.h b/fs/btrfs/file.h
index f3d794e33d46..82b34fbb295f 100644
--- a/fs/btrfs/file.h
+++ b/fs/btrfs/file.h
@@ -27,6 +27,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
size_t *write_bytes, bool nowait);
void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
+ struct extent_state **cached_state,
u64 *delalloc_start_ret, u64 *delalloc_end_ret);
#endif
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread* [PATCH 9/9] btrfs: use cached state when looking for delalloc ranges with lseek
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
` (7 preceding siblings ...)
2022-11-11 11:50 ` [PATCH 8/9] btrfs: use cached state when looking for delalloc ranges with fiemap fdmanana
@ 2022-11-11 11:50 ` fdmanana
2022-11-11 19:12 ` [PATCH 0/9] btrfs: more optimizations for lseek and fiemap David Sterba
` (2 subsequent siblings)
11 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2022-11-11 11:50 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
During lseek (SEEK_HOLE/DATA), whenever we find a hole or prealloc extent,
we will look for delalloc in that range, and one of the things we do for
that is to find out ranges in the inode's io_tree marked with
EXTENT_DELALLOC, using calls to count_range_bits().
Typically there's a single, or few, searches in the io_tree for delalloc
per lseek call. However it's common for applications to keep calling
lseek with SEEK_HOLE and SEEK_DATA to find where extents and holes are in
a file, read the extents and skip holes in order to avoid unnecessary IO
and save disk space by preserving holes.
One popular user is the cp utility from coreutils. Starting with coreutils
9.0, cp uses SEEK_HOLE and SEEK_DATA to iterate over the extents of a
file. Before 9.0, it used fiemap to figure out where holes and extents are
in the source file. Another popular user is the tar utility when used with
the --sparse / -S option to detect and preserve holes.
Given that the pattern is to keep calling lseek with a start offset that
matches the returned offset from the previous lseek call, we can benefit
from caching the last extent state visited in count_range_bits() and use
it for the next count_range_bits() from the next lseek call. Example,
the following strace excerpt from running tar:
$ strace tar cJSvf foo.tar.xz qemu_disk_file.raw
(...)
lseek(5, 125019574272, SEEK_HOLE) = 125024989184
lseek(5, 125024989184, SEEK_DATA) = 125024993280
lseek(5, 125024993280, SEEK_HOLE) = 125025239040
lseek(5, 125025239040, SEEK_DATA) = 125025255424
lseek(5, 125025255424, SEEK_HOLE) = 125025353728
lseek(5, 125025353728, SEEK_DATA) = 125025357824
lseek(5, 125025357824, SEEK_HOLE) = 125026766848
lseek(5, 125026766848, SEEK_DATA) = 125026770944
lseek(5, 125026770944, SEEK_HOLE) = 125027053568
(...)
Shows that pattern, which is the same as with cp from coreutils 9.0+.
So start using a cached state for the delalloc searches in lseek, and
store it in struct file's private data so that it can be reused across
lseek calls.
This change is part of a patchset that is compresed of the following
patches:
1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek
The following test was run before and after applying the whole patchset:
$ cat test-cp.sh
#!/bin/bash
DEV=/dev/sdh
MNT=/mnt/sdh
# coreutils 8.32, cp uses fiemap to detect holes and extents
#CP_PROG=/usr/bin/cp
# coreutils 9.1, cp uses SEEK_HOLE/DATA to detect holes and extents
CP_PROG=/home/fdmanana/git/hub/coreutils/src/cp
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount $DEV $MNT
FILE_SIZE=$((1024 * 1024 * 1024))
echo "Creating file with a size of $((FILE_SIZE / 1024 / 1024))M"
# Create a very sparse file, where each extent has a length of 4K and
# is preceded by a 4K hole and followed by another 4K hole.
start=$(date +%s%N)
echo -n > $MNT/foobar
for ((off = 0; off < $FILE_SIZE; off += 8192)); do
xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null
echo -ne "\r$off / $FILE_SIZE ..."
done
end=$(date +%s%N)
echo -e "\nFile created ($(( (end - start) / 1000000 )) milliseconds)"
start=$(date +%s%N)
$CP_PROG $MNT/foobar /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "cp took $dur milliseconds with data/metadata cached and delalloc"
# Flush all delalloc.
sync
start=$(date +%s%N)
$CP_PROG $MNT/foobar /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "cp took $dur milliseconds with data/metadata cached and no delalloc"
# Unmount and mount again to test the case without any metadata
# loaded in memory.
umount $MNT
mount $DEV $MNT
start=$(date +%s%N)
$CP_PROG $MNT/foobar /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "cp took $dur milliseconds without data/metadata cached and no delalloc"
umount $MNT
The results, running on a box with a non-debug kernel (Debian's default
kernel config), were the following:
128M file, before patchset:
cp took 16574 milliseconds with data/metadata cached and delalloc
cp took 122 milliseconds with data/metadata cached and no delalloc
cp took 20144 milliseconds without data/metadata cached and no delalloc
128M file, after patchset:
cp took 6277 milliseconds with data/metadata cached and delalloc
cp took 109 milliseconds with data/metadata cached and no delalloc
cp took 210 milliseconds without data/metadata cached and no delalloc
512M file, before patchset:
cp took 14369 milliseconds with data/metadata cached and delalloc
cp took 429 milliseconds with data/metadata cached and no delalloc
cp took 88034 milliseconds without data/metadata cached and no delalloc
512M file, after patchset:
cp took 12106 milliseconds with data/metadata cached and delalloc
cp took 427 milliseconds with data/metadata cached and no delalloc
cp took 824 milliseconds without data/metadata cached and no delalloc
1G file, before patchset:
cp took 10074 milliseconds with data/metadata cached and delalloc
cp took 886 milliseconds with data/metadata cached and no delalloc
cp took 181261 milliseconds without data/metadata cached and no delalloc
1G file, after patchset:
cp took 3320 milliseconds with data/metadata cached and delalloc
cp took 880 milliseconds with data/metadata cached and no delalloc
cp took 1801 milliseconds without data/metadata cached and no delalloc
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/ctree.h | 1 +
fs/btrfs/file.c | 40 ++++++++++++++++++++++++++++++++--------
2 files changed, 33 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5649f8907984..99defab7e1f6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -426,6 +426,7 @@ struct btrfs_drop_extents_args {
struct btrfs_file_private {
void *filldir_buf;
+ struct extent_state *llseek_cached_state;
};
static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index da390f891220..448b143a5cb2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1696,10 +1696,12 @@ int btrfs_release_file(struct inode *inode, struct file *filp)
{
struct btrfs_file_private *private = filp->private_data;
- if (private && private->filldir_buf)
+ if (private) {
kfree(private->filldir_buf);
- kfree(private);
- filp->private_data = NULL;
+ free_extent_state(private->llseek_cached_state);
+ kfree(private);
+ filp->private_data = NULL;
+ }
/*
* Set by setattr when we are about to truncate a file from a non-zero
@@ -3398,13 +3400,14 @@ bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
* is found, it updates @start_ret with the start of the subrange.
*/
static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
+ struct extent_state **cached_state,
u64 start, u64 end, u64 *start_ret)
{
u64 delalloc_start;
u64 delalloc_end;
bool delalloc;
- delalloc = btrfs_find_delalloc_in_range(inode, start, end, NULL,
+ delalloc = btrfs_find_delalloc_in_range(inode, start, end, cached_state,
&delalloc_start, &delalloc_end);
if (delalloc && whence == SEEK_DATA) {
*start_ret = delalloc_start;
@@ -3447,11 +3450,13 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
return false;
}
-static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
- int whence)
+static loff_t find_desired_extent(struct file *file, loff_t offset, int whence)
{
+ struct btrfs_inode *inode = BTRFS_I(file->f_mapping->host);
+ struct btrfs_file_private *private = file->private_data;
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_state *cached_state = NULL;
+ struct extent_state **delalloc_cached_state;
const loff_t i_size = i_size_read(&inode->vfs_inode);
const u64 ino = btrfs_ino(inode);
struct btrfs_root *root = inode->root;
@@ -3476,6 +3481,22 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
inode_get_bytes(&inode->vfs_inode) == i_size)
return i_size;
+ if (!private) {
+ private = kzalloc(sizeof(*private), GFP_KERNEL);
+ /*
+ * No worries if memory allocation failed.
+ * The private structure is used only for speeding up multiple
+ * lseek SEEK_HOLE/DATA calls to a file when there's delalloc,
+ * so everything will still be correct.
+ */
+ file->private_data = private;
+ }
+
+ if (private)
+ delalloc_cached_state = &private->llseek_cached_state;
+ else
+ delalloc_cached_state = NULL;
+
/*
* offset can be negative, in this case we start finding DATA/HOLE from
* the very start of the file.
@@ -3553,6 +3574,7 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
search_start = offset;
found = find_desired_extent_in_hole(inode, whence,
+ delalloc_cached_state,
search_start,
key.offset - 1,
&found_start);
@@ -3587,6 +3609,7 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
search_start = offset;
found = find_desired_extent_in_hole(inode, whence,
+ delalloc_cached_state,
search_start,
extent_end - 1,
&found_start);
@@ -3628,7 +3651,8 @@ static loff_t find_desired_extent(struct btrfs_inode *inode, loff_t offset,
/* We have an implicit hole from the last extent found up to i_size. */
if (!found && start < i_size) {
- found = find_desired_extent_in_hole(inode, whence, start,
+ found = find_desired_extent_in_hole(inode, whence,
+ delalloc_cached_state, start,
i_size - 1, &start);
if (!found)
start = i_size;
@@ -3657,7 +3681,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
case SEEK_DATA:
case SEEK_HOLE:
btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_SHARED);
- offset = find_desired_extent(BTRFS_I(inode), offset, whence);
+ offset = find_desired_extent(file, offset, whence);
btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_SHARED);
break;
}
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH 0/9] btrfs: more optimizations for lseek and fiemap
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
` (8 preceding siblings ...)
2022-11-11 11:50 ` [PATCH 9/9] btrfs: use cached state when looking for delalloc ranges with lseek fdmanana
@ 2022-11-11 19:12 ` David Sterba
2022-11-12 1:44 ` Wang Yugui
2022-12-13 9:11 ` Naohiro Aota
11 siblings, 0 replies; 16+ messages in thread
From: David Sterba @ 2022-11-11 19:12 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Fri, Nov 11, 2022 at 11:50:26AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Here follows a few more optimizations for lseek and fiemap. Starting with
> coreutils 9.0, cp no longer uses fiemap to determine where holes are in a
> file, instead it uses lseek's SEEK_HOLE and SEEK_DATA modes of operation.
> For very sparse files, or files with lot of delalloc, this can be very
> slow (when compared to ext4 or xfs). This aims to improve that.
>
> The cp pattern is not specific to cp, it's common to iterate over all
> allocated regions of a file sequentially with SEEK_HOLE and SEEK_DATA,
> for either the whole file or just a section. Another popular program that
> does the same is tar, when using its --sparse / -S command line option
> (I use it like that for doing vm backups for example).
>
> The details are in the changelogs of each patch, and results are on the
> changelog of the last patch in the series. There's still much more room
> for further improvement, but that won't happen too soon as it will require
> broader changes outside the lseek and fiemap code.
>
> Filipe Manana (9):
> btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
> btrfs: add an early exit when searching for delalloc range for lseek/fiemap
> btrfs: skip unnecessary delalloc searches during lseek/fiemap
> btrfs: search for delalloc more efficiently during lseek/fiemap
> btrfs: remove no longer used btrfs_next_extent_map()
> btrfs: allow passing a cached state record to count_range_bits()
> btrfs: update stale comment for count_range_bits()
> btrfs: use cached state when looking for delalloc ranges with fiemap
> btrfs: use cached state when looking for delalloc ranges with lseek
Thanks, the improvements look great, added to misc-next.
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH 0/9] btrfs: more optimizations for lseek and fiemap
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
` (9 preceding siblings ...)
2022-11-11 19:12 ` [PATCH 0/9] btrfs: more optimizations for lseek and fiemap David Sterba
@ 2022-11-12 1:44 ` Wang Yugui
2022-12-13 9:11 ` Naohiro Aota
11 siblings, 0 replies; 16+ messages in thread
From: Wang Yugui @ 2022-11-12 1:44 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
Hi,
> From: Filipe Manana <fdmanana@suse.com>
>
> Here follows a few more optimizations for lseek and fiemap. Starting with
> coreutils 9.0, cp no longer uses fiemap to determine where holes are in a
> file, instead it uses lseek's SEEK_HOLE and SEEK_DATA modes of operation.
> For very sparse files, or files with lot of delalloc, this can be very
> slow (when compared to ext4 or xfs). This aims to improve that.
>
> The cp pattern is not specific to cp, it's common to iterate over all
> allocated regions of a file sequentially with SEEK_HOLE and SEEK_DATA,
> for either the whole file or just a section. Another popular program that
> does the same is tar, when using its --sparse / -S command line option
> (I use it like that for doing vm backups for example).
The case reported in *1 is huge improved. Thanks a lot.
*1
https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
and btrfs misc-next with these patches passed fstests too.
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/11/12
> The details are in the changelogs of each patch, and results are on the
> changelog of the last patch in the series. There's still much more room
> for further improvement, but that won't happen too soon as it will require
> broader changes outside the lseek and fiemap code.
>
> Filipe Manana (9):
> btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
> btrfs: add an early exit when searching for delalloc range for lseek/fiemap
> btrfs: skip unnecessary delalloc searches during lseek/fiemap
> btrfs: search for delalloc more efficiently during lseek/fiemap
> btrfs: remove no longer used btrfs_next_extent_map()
> btrfs: allow passing a cached state record to count_range_bits()
> btrfs: update stale comment for count_range_bits()
> btrfs: use cached state when looking for delalloc ranges with fiemap
> btrfs: use cached state when looking for delalloc ranges with lseek
>
> fs/btrfs/ctree.h | 1 +
> fs/btrfs/extent-io-tree.c | 73 +++++++++++--
> fs/btrfs/extent-io-tree.h | 10 +-
> fs/btrfs/extent_io.c | 30 +++---
> fs/btrfs/extent_map.c | 29 -----
> fs/btrfs/extent_map.h | 2 -
> fs/btrfs/file.c | 221 ++++++++++++++++++--------------------
> fs/btrfs/file.h | 1 +
> fs/btrfs/inode.c | 2 +-
> 9 files changed, 190 insertions(+), 179 deletions(-)
>
> --
> 2.35.1
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH 0/9] btrfs: more optimizations for lseek and fiemap
2022-11-11 11:50 [PATCH 0/9] btrfs: more optimizations for lseek and fiemap fdmanana
` (10 preceding siblings ...)
2022-11-12 1:44 ` Wang Yugui
@ 2022-12-13 9:11 ` Naohiro Aota
2022-12-13 10:39 ` Filipe Manana
11 siblings, 1 reply; 16+ messages in thread
From: Naohiro Aota @ 2022-12-13 9:11 UTC (permalink / raw)
To: fdmanana@kernel.org; +Cc: linux-btrfs@vger.kernel.org
Hello, Filipe
Thank you for this series. We had a performance issue on a HDFS write only
workload with the zoned mode. That issue is solved thanks to this
series. In addition, it improves the performance even on the regular
btrfs. So, this series might be worth backporting. I'd like to share the
result.
The test workload is a system call replay, which mimics a HDFS write only
workload. The workload does buffered writes with pwrite() to multiple files
and does not issue any sync operation, so the performance number is the
number of bytes issued with pwrite() divided by the total runtime without
sync. The total pwrite() size is about 60GB.
Before the series:
- Regular: 307.68 MB/s
- Zoned emulation: 269.32 MB/s
- Zoned: 231.93 MB/s
At the first patch of the series:
- Regular: 349.84 MB/s (+13.70%)
- Zoned emulation: 363.48 MB/s (+34.96%)
- Zoned: 326.40 MB/s (+40.73%)
After the series (at b7af0635c87f ("btrfs: print transaction aborted messages with an error level")):
- Regular: 350.22 MB/s (+13.83%)
- Zoned emulation: 360.96 MB/s (+34.03%)
- Zoned: 326.24 MB/s (+40.66%)
So, before the first patch, the zoned case had poor performance (-12% on
the zoned emulation and -24% on the zoned) compared to the regular case. As
the regular case and the zoned emulation case ran on the same device, it
shows the zoned mode's penalty.
This series improves the performance for all the cases. But, the zoned
cases have far better improvements and it somehow solved the performance
degradation.
Also, even only with the first patch, the performance greatly improved. So,
interestingly, the first patch, touching only the readpage() side is the
key to fixing the issue for our case.
On Fri, Nov 11, 2022 at 11:50:26AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Here follows a few more optimizations for lseek and fiemap. Starting with
> coreutils 9.0, cp no longer uses fiemap to determine where holes are in a
> file, instead it uses lseek's SEEK_HOLE and SEEK_DATA modes of operation.
> For very sparse files, or files with lot of delalloc, this can be very
> slow (when compared to ext4 or xfs). This aims to improve that.
>
> The cp pattern is not specific to cp, it's common to iterate over all
> allocated regions of a file sequentially with SEEK_HOLE and SEEK_DATA,
> for either the whole file or just a section. Another popular program that
> does the same is tar, when using its --sparse / -S command line option
> (I use it like that for doing vm backups for example).
>
> The details are in the changelogs of each patch, and results are on the
> changelog of the last patch in the series. There's still much more room
> for further improvement, but that won't happen too soon as it will require
> broader changes outside the lseek and fiemap code.
>
> Filipe Manana (9):
> btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
> btrfs: add an early exit when searching for delalloc range for lseek/fiemap
> btrfs: skip unnecessary delalloc searches during lseek/fiemap
> btrfs: search for delalloc more efficiently during lseek/fiemap
> btrfs: remove no longer used btrfs_next_extent_map()
> btrfs: allow passing a cached state record to count_range_bits()
> btrfs: update stale comment for count_range_bits()
> btrfs: use cached state when looking for delalloc ranges with fiemap
> btrfs: use cached state when looking for delalloc ranges with lseek
>
> fs/btrfs/ctree.h | 1 +
> fs/btrfs/extent-io-tree.c | 73 +++++++++++--
> fs/btrfs/extent-io-tree.h | 10 +-
> fs/btrfs/extent_io.c | 30 +++---
> fs/btrfs/extent_map.c | 29 -----
> fs/btrfs/extent_map.h | 2 -
> fs/btrfs/file.c | 221 ++++++++++++++++++--------------------
> fs/btrfs/file.h | 1 +
> fs/btrfs/inode.c | 2 +-
> 9 files changed, 190 insertions(+), 179 deletions(-)
>
> --
> 2.35.1
>
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH 0/9] btrfs: more optimizations for lseek and fiemap
2022-12-13 9:11 ` Naohiro Aota
@ 2022-12-13 10:39 ` Filipe Manana
2022-12-16 6:53 ` Naohiro Aota
0 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2022-12-13 10:39 UTC (permalink / raw)
To: Naohiro Aota; +Cc: linux-btrfs@vger.kernel.org
On Tue, Dec 13, 2022 at 9:11 AM Naohiro Aota <Naohiro.Aota@wdc.com> wrote:
>
> Hello, Filipe
Hi Aota,
>
> Thank you for this series. We had a performance issue on a HDFS write only
> workload with the zoned mode. That issue is solved thanks to this
> series. In addition, it improves the performance even on the regular
> btrfs. So, this series might be worth backporting. I'd like to share the
> result.
I'm afraid backporting the whole series is not an option.
It depends on two very large patchsets that largely rewrite fiemap and lseek,
focused more on performance improvement but also fix one bug with fiemap
extent sharedness reporting.
In reverse order:
https://lore.kernel.org/linux-btrfs/cover.1665490018.git.fdmanana@suse.com/
(introduced in this merge window)
https://lore.kernel.org/linux-btrfs/cover.1662022922.git.fdmanana@suse.com/
(introduced in 6.1)
>
> The test workload is a system call replay, which mimics a HDFS write only
> workload. The workload does buffered writes with pwrite() to multiple files
> and does not issue any sync operation, so the performance number is the
> number of bytes issued with pwrite() divided by the total runtime without
> sync. The total pwrite() size is about 60GB.
>
> Before the series:
> - Regular: 307.68 MB/s
> - Zoned emulation: 269.32 MB/s
> - Zoned: 231.93 MB/s
>
> At the first patch of the series:
> - Regular: 349.84 MB/s (+13.70%)
> - Zoned emulation: 363.48 MB/s (+34.96%)
> - Zoned: 326.40 MB/s (+40.73%)
>
> After the series (at b7af0635c87f ("btrfs: print transaction aborted messages with an error level")):
> - Regular: 350.22 MB/s (+13.83%)
> - Zoned emulation: 360.96 MB/s (+34.03%)
> - Zoned: 326.24 MB/s (+40.66%)
>
> So, before the first patch, the zoned case had poor performance (-12% on
> the zoned emulation and -24% on the zoned) compared to the regular case. As
> the regular case and the zoned emulation case ran on the same device, it
> shows the zoned mode's penalty.
>
> This series improves the performance for all the cases. But, the zoned
> cases have far better improvements and it somehow solved the performance
> degradation.
>
> Also, even only with the first patch, the performance greatly improved. So,
> interestingly, the first patch, touching only the readpage() side is the
> key to fixing the issue for our case.
Backporting the first patch only seems reasonable to me, and it
depends on the following one that landed in 6.1:
52b029f42751 ("btrfs: remove unnecessary EXTENT_UPTODATE state in
buffered I/O path")
(it's mentioned in the changelog)
It seems your workload is probably not fiemap/lseek heavy (I only see
mentions of pwrite).
Thanks for the report!
>
> On Fri, Nov 11, 2022 at 11:50:26AM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Here follows a few more optimizations for lseek and fiemap. Starting with
> > coreutils 9.0, cp no longer uses fiemap to determine where holes are in a
> > file, instead it uses lseek's SEEK_HOLE and SEEK_DATA modes of operation.
> > For very sparse files, or files with lot of delalloc, this can be very
> > slow (when compared to ext4 or xfs). This aims to improve that.
> >
> > The cp pattern is not specific to cp, it's common to iterate over all
> > allocated regions of a file sequentially with SEEK_HOLE and SEEK_DATA,
> > for either the whole file or just a section. Another popular program that
> > does the same is tar, when using its --sparse / -S command line option
> > (I use it like that for doing vm backups for example).
> >
> > The details are in the changelogs of each patch, and results are on the
> > changelog of the last patch in the series. There's still much more room
> > for further improvement, but that won't happen too soon as it will require
> > broader changes outside the lseek and fiemap code.
> >
> > Filipe Manana (9):
> > btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
> > btrfs: add an early exit when searching for delalloc range for lseek/fiemap
> > btrfs: skip unnecessary delalloc searches during lseek/fiemap
> > btrfs: search for delalloc more efficiently during lseek/fiemap
> > btrfs: remove no longer used btrfs_next_extent_map()
> > btrfs: allow passing a cached state record to count_range_bits()
> > btrfs: update stale comment for count_range_bits()
> > btrfs: use cached state when looking for delalloc ranges with fiemap
> > btrfs: use cached state when looking for delalloc ranges with lseek
> >
> > fs/btrfs/ctree.h | 1 +
> > fs/btrfs/extent-io-tree.c | 73 +++++++++++--
> > fs/btrfs/extent-io-tree.h | 10 +-
> > fs/btrfs/extent_io.c | 30 +++---
> > fs/btrfs/extent_map.c | 29 -----
> > fs/btrfs/extent_map.h | 2 -
> > fs/btrfs/file.c | 221 ++++++++++++++++++--------------------
> > fs/btrfs/file.h | 1 +
> > fs/btrfs/inode.c | 2 +-
> > 9 files changed, 190 insertions(+), 179 deletions(-)
> >
> > --
> > 2.35.1
> >
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH 0/9] btrfs: more optimizations for lseek and fiemap
2022-12-13 10:39 ` Filipe Manana
@ 2022-12-16 6:53 ` Naohiro Aota
2023-01-23 2:13 ` Naohiro Aota
0 siblings, 1 reply; 16+ messages in thread
From: Naohiro Aota @ 2022-12-16 6:53 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs@vger.kernel.org
On Tue, Dec 13, 2022 at 10:39:07AM +0000, Filipe Manana wrote:
> On Tue, Dec 13, 2022 at 9:11 AM Naohiro Aota <Naohiro.Aota@wdc.com> wrote:
> >
> > Hello, Filipe
>
> Hi Aota,
>
> >
> > Thank you for this series. We had a performance issue on a HDFS write only
> > workload with the zoned mode. That issue is solved thanks to this
> > series. In addition, it improves the performance even on the regular
> > btrfs. So, this series might be worth backporting. I'd like to share the
> > result.
>
> I'm afraid backporting the whole series is not an option.
> It depends on two very large patchsets that largely rewrite fiemap and lseek,
> focused more on performance improvement but also fix one bug with fiemap
> extent sharedness reporting.
>
> In reverse order:
>
> https://lore.kernel.org/linux-btrfs/cover.1665490018.git.fdmanana@suse.com/
> (introduced in this merge window)
>
> https://lore.kernel.org/linux-btrfs/cover.1662022922.git.fdmanana@suse.com/
> (introduced in 6.1)
I see. These are huge.
> >
> > The test workload is a system call replay, which mimics a HDFS write only
> > workload. The workload does buffered writes with pwrite() to multiple files
> > and does not issue any sync operation, so the performance number is the
> > number of bytes issued with pwrite() divided by the total runtime without
> > sync. The total pwrite() size is about 60GB.
> >
> > Before the series:
> > - Regular: 307.68 MB/s
> > - Zoned emulation: 269.32 MB/s
> > - Zoned: 231.93 MB/s
> >
> > At the first patch of the series:
> > - Regular: 349.84 MB/s (+13.70%)
> > - Zoned emulation: 363.48 MB/s (+34.96%)
> > - Zoned: 326.40 MB/s (+40.73%)
> >
> > After the series (at b7af0635c87f ("btrfs: print transaction aborted messages with an error level")):
> > - Regular: 350.22 MB/s (+13.83%)
> > - Zoned emulation: 360.96 MB/s (+34.03%)
> > - Zoned: 326.24 MB/s (+40.66%)
> >
> > So, before the first patch, the zoned case had poor performance (-12% on
> > the zoned emulation and -24% on the zoned) compared to the regular case. As
> > the regular case and the zoned emulation case ran on the same device, it
> > shows the zoned mode's penalty.
> >
> > This series improves the performance for all the cases. But, the zoned
> > cases have far better improvements and it somehow solved the performance
> > degradation.
> >
> > Also, even only with the first patch, the performance greatly improved. So,
> > interestingly, the first patch, touching only the readpage() side is the
> > key to fixing the issue for our case.
>
> Backporting the first patch only seems reasonable to me, and it
> depends on the following one that landed in 6.1:
>
> 52b029f42751 ("btrfs: remove unnecessary EXTENT_UPTODATE state in
> buffered I/O path")
>
> (it's mentioned in the changelog)
Thank you for the info. I backported 6adb9287e3f8 ("btrfs: do not use
GFP_ATOMIC in the read endio") and ccd2d08b91ee ("btrfs: remove leftover
setting of EXTENT_UPTODATE state in an inode's io_tree") on v6.1, which
already has 52b029f42751. However, the performance number didn't go well :(
I'll dig further to find which patch is necessary for the improvement.
> It seems your workload is probably not fiemap/lseek heavy (I only see
> mentions of pwrite).
> Thanks for the report!
Yes, it only does pwrite()s.
> >
> > On Fri, Nov 11, 2022 at 11:50:26AM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Here follows a few more optimizations for lseek and fiemap. Starting with
> > > coreutils 9.0, cp no longer uses fiemap to determine where holes are in a
> > > file, instead it uses lseek's SEEK_HOLE and SEEK_DATA modes of operation.
> > > For very sparse files, or files with lot of delalloc, this can be very
> > > slow (when compared to ext4 or xfs). This aims to improve that.
> > >
> > > The cp pattern is not specific to cp, it's common to iterate over all
> > > allocated regions of a file sequentially with SEEK_HOLE and SEEK_DATA,
> > > for either the whole file or just a section. Another popular program that
> > > does the same is tar, when using its --sparse / -S command line option
> > > (I use it like that for doing vm backups for example).
> > >
> > > The details are in the changelogs of each patch, and results are on the
> > > changelog of the last patch in the series. There's still much more room
> > > for further improvement, but that won't happen too soon as it will require
> > > broader changes outside the lseek and fiemap code.
> > >
> > > Filipe Manana (9):
> > > btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
> > > btrfs: add an early exit when searching for delalloc range for lseek/fiemap
> > > btrfs: skip unnecessary delalloc searches during lseek/fiemap
> > > btrfs: search for delalloc more efficiently during lseek/fiemap
> > > btrfs: remove no longer used btrfs_next_extent_map()
> > > btrfs: allow passing a cached state record to count_range_bits()
> > > btrfs: update stale comment for count_range_bits()
> > > btrfs: use cached state when looking for delalloc ranges with fiemap
> > > btrfs: use cached state when looking for delalloc ranges with lseek
> > >
> > > fs/btrfs/ctree.h | 1 +
> > > fs/btrfs/extent-io-tree.c | 73 +++++++++++--
> > > fs/btrfs/extent-io-tree.h | 10 +-
> > > fs/btrfs/extent_io.c | 30 +++---
> > > fs/btrfs/extent_map.c | 29 -----
> > > fs/btrfs/extent_map.h | 2 -
> > > fs/btrfs/file.c | 221 ++++++++++++++++++--------------------
> > > fs/btrfs/file.h | 1 +
> > > fs/btrfs/inode.c | 2 +-
> > > 9 files changed, 190 insertions(+), 179 deletions(-)
> > >
> > > --
> > > 2.35.1
> > >
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH 0/9] btrfs: more optimizations for lseek and fiemap
2022-12-16 6:53 ` Naohiro Aota
@ 2023-01-23 2:13 ` Naohiro Aota
0 siblings, 0 replies; 16+ messages in thread
From: Naohiro Aota @ 2023-01-23 2:13 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs@vger.kernel.org
On Fri, Dec 16, 2022 at 06:53:36AM +0000, Naohiro Aota wrote:
> On Tue, Dec 13, 2022 at 10:39:07AM +0000, Filipe Manana wrote:
> > On Tue, Dec 13, 2022 at 9:11 AM Naohiro Aota <Naohiro.Aota@wdc.com> wrote:
> > >
> > > Hello, Filipe
> >
> > Hi Aota,
> >
> > >
> > > Thank you for this series. We had a performance issue on a HDFS write only
> > > workload with the zoned mode. That issue is solved thanks to this
> > > series. In addition, it improves the performance even on the regular
> > > btrfs. So, this series might be worth backporting. I'd like to share the
> > > result.
> >
> > I'm afraid backporting the whole series is not an option.
> > It depends on two very large patchsets that largely rewrite fiemap and lseek,
> > focused more on performance improvement but also fix one bug with fiemap
> > extent sharedness reporting.
> >
> > In reverse order:
> >
> > https://lore.kernel.org/linux-btrfs/cover.1665490018.git.fdmanana@suse.com/
> > (introduced in this merge window)
> >
> > https://lore.kernel.org/linux-btrfs/cover.1662022922.git.fdmanana@suse.com/
> > (introduced in 6.1)
>
> I see. These are huge.
>
> > >
> > > The test workload is a system call replay, which mimics a HDFS write only
> > > workload. The workload does buffered writes with pwrite() to multiple files
> > > and does not issue any sync operation, so the performance number is the
> > > number of bytes issued with pwrite() divided by the total runtime without
> > > sync. The total pwrite() size is about 60GB.
> > >
> > > Before the series:
> > > - Regular: 307.68 MB/s
> > > - Zoned emulation: 269.32 MB/s
> > > - Zoned: 231.93 MB/s
> > >
> > > At the first patch of the series:
> > > - Regular: 349.84 MB/s (+13.70%)
> > > - Zoned emulation: 363.48 MB/s (+34.96%)
> > > - Zoned: 326.40 MB/s (+40.73%)
> > >
> > > After the series (at b7af0635c87f ("btrfs: print transaction aborted messages with an error level")):
> > > - Regular: 350.22 MB/s (+13.83%)
> > > - Zoned emulation: 360.96 MB/s (+34.03%)
> > > - Zoned: 326.24 MB/s (+40.66%)
> > >
> > > So, before the first patch, the zoned case had poor performance (-12% on
> > > the zoned emulation and -24% on the zoned) compared to the regular case. As
> > > the regular case and the zoned emulation case ran on the same device, it
> > > shows the zoned mode's penalty.
> > >
> > > This series improves the performance for all the cases. But, the zoned
> > > cases have far better improvements and it somehow solved the performance
> > > degradation.
> > >
> > > Also, even only with the first patch, the performance greatly improved. So,
> > > interestingly, the first patch, touching only the readpage() side is the
> > > key to fixing the issue for our case.
> >
> > Backporting the first patch only seems reasonable to me, and it
> > depends on the following one that landed in 6.1:
> >
> > 52b029f42751 ("btrfs: remove unnecessary EXTENT_UPTODATE state in
> > buffered I/O path")
> >
> > (it's mentioned in the changelog)
>
> Thank you for the info. I backported 6adb9287e3f8 ("btrfs: do not use
> GFP_ATOMIC in the read endio") and ccd2d08b91ee ("btrfs: remove leftover
> setting of EXTENT_UPTODATE state in an inode's io_tree") on v6.1, which
> already has 52b029f42751. However, the performance number didn't go well :(
Actually, this was my mistake. I forgot to add my patch which removes a
performance limitation. So, 6adb9287e3f8 ("btrfs: do not use GFP_ATOMIC in
the read endio") and ccd2d08b91ee ("btrfs: remove leftover setting of
EXTENT_UPTODATE state in an inode's io_tree") are enough to solve the
performance issue on v6.1.
^ permalink raw reply [flat|nested] 16+ messages in thread