* [PATCH v8] btrfs: prefer to allocate larger folio for metadata
@ 2024-07-25 10:11 Qu Wenruo
2024-07-26 14:57 ` Josef Bacik
0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2024-07-25 10:11 UTC (permalink / raw)
To: linux-btrfs
Since btrfs metadata is always in fixed size (nodesize, determined at
mkfs time, default to 16K), and btrfs has the full control of the folios
(read is triggered internally, no read/readahead call backs), it's the
best location to experimental larger folios inside btrfs.
To enable larger folios, the btrfs has to meet the following conditions:
- The extent buffer start is aligned to nodesize
This should be the common case for any btrfs in the last 5 years.
- The nodesize is larger than page size
- MM layer can fulfill our larger folio allocation
The larger folio will cover exactly the metadata size (nodesize).
If any of the condition is not met, we just fall back to page sized
folio and go as usual.
This means, we can have mixed orders for btrfs metadata.
Thus there are several new corner cases with the mixed orders:
1) New filemap_add_folio() -EEXIST failure cases
For mixed order cases, filemap_add_folio() can return -EEXIST
meanwhile filemap_lock_folio() returns -ENOENT.
We can only retry several times, before falling back to 0 order
folios.
2) Existing folio size may be different than the one we allocated
This is after the existing eb checks.
2.1) The existing folio is larger than the allocated one
Need to free all allocated folios, and use the existing larger
folio instead.
2.2) The existing folio has the same size
Free the allocated one and reuse the page cache.
This is the existing path.
2.3) The existing folio is smaller than the allocated one
Fall back to re-allocate order 0 folios instead.
Otherwise all the needed infrastructure is already here, we only need to
try allocate larger folio as our first try in alloc_eb_folio_array().
For now, the higher order allocation is only a preferable attempt for
debug build, before we had enough test coverage and push it to end
users.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
[CHANGELOG]
v8:
- Drop the memcgroup optimization as dependency
Opting out memcgroup will be pushed as an independent patchset
instead. It's not related to the soft lockup.
- Fix a soft lockup caused by mixed folio orders
|<- folio ->|
| | |//|//| |//| is the existing page cache
In above case, the filemap_add_folio() will always return -EEXIST
but filemap_lock_folio() also returns -ENOENT.
Which can lead to a dead loop.
Fix it by only retrying 5 times for larger folios, then fall back
to 0 order folios.
- Slightly rewording the commit messages
Make it shorter and better organized.
v7:
- Fix an accidentally removed line caused by previous modification
attempt
Previously I was moving that line to the common branch to
unconditionally define root_mem_cgroup pointer.
But that's later discarded and changed to use macro definition, but
forgot to add back the original line.
v6:
- Add a new root_mem_cgroup definition for CONFIG_MEMCG=n cases
So that users of root_mem_cgroup no longer needs to check
CONFIG_MEMCG.
This is to fix the compile error for CONFIG_MEMCG=n cases.
- Slight rewording of the 2nd patch
v5:
- Use root memcgroup to attach folios to btree inode filemap
- Only try higher order folio once without NOFAIL nor extra retry
v4:
- Hide the feature behind CONFIG_BTRFS_DEBUG
So that end users won't be affected (aka, still per-page based
allocation) meanwhile we can do more testing on this new behavior.
v3:
- Rebased to the latest for-next branch
- Use PAGE_ALLOC_COSTLY_ORDER to determine whether to use __GFP_NOFAIL
- Add a dependency MM patch "mm/page_alloc: unify the warning on NOFAIL
and high order allocation"
This allows us to use NOFAIL up to 32K nodesize, and makes sure for
default 16K nodesize, all metadata would go 16K folios
v2:
- Rebased to handle the change in "btrfs: cache folio size and shift in extent_buffer"
---
fs/btrfs/extent_io.c | 122 ++++++++++++++++++++++++++++++-------------
1 file changed, 86 insertions(+), 36 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index aa7f8148cd0d..0beebcb9be77 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -719,12 +719,28 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
*
* For now, the folios populated are always in order 0 (aka, single page).
*/
-static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail)
+static int alloc_eb_folio_array(struct extent_buffer *eb, int order,
+ bool nofail)
{
struct page *page_array[INLINE_EXTENT_BUFFER_PAGES] = { 0 };
int num_pages = num_extent_pages(eb);
int ret;
+ if (order) {
+ gfp_t gfp;
+
+ if (order > 0)
+ gfp = GFP_NOFS | __GFP_NORETRY | __GFP_NOWARN;
+ else
+ gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
+ eb->folios[0] = folio_alloc(gfp, order);
+ if (likely(eb->folios[0])) {
+ eb->folio_size = folio_size(eb->folios[0]);
+ eb->folio_shift = folio_shift(eb->folios[0]);
+ return 0;
+ }
+ /* Fallback to 0 order (single page) allocation. */
+ }
ret = btrfs_alloc_page_array(num_pages, page_array, nofail);
if (ret < 0)
return ret;
@@ -2707,7 +2723,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
*/
set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
- ret = alloc_eb_folio_array(new, false);
+ ret = alloc_eb_folio_array(new, 0, false);
if (ret) {
btrfs_release_extent_buffer(new);
return NULL;
@@ -2740,7 +2756,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
if (!eb)
return NULL;
- ret = alloc_eb_folio_array(eb, false);
+ ret = alloc_eb_folio_array(eb, 0, false);
if (ret)
goto err;
@@ -2955,7 +2971,16 @@ static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
return 0;
}
+static void free_all_eb_folios(struct extent_buffer *eb)
+{
+ for (int i = 0; i < INLINE_EXTENT_BUFFER_PAGES; i++) {
+ if (eb->folios[i])
+ folio_put(eb->folios[i]);
+ eb->folios[i] = NULL;
+ }
+}
+#define BTRFS_ADD_FOLIO_RETRY_LIMIT (5)
/*
* Return 0 if eb->folios[i] is attached to btree inode successfully.
* Return >0 if there is already another extent buffer for the range,
@@ -2973,6 +2998,8 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
struct address_space *mapping = fs_info->btree_inode->i_mapping;
const unsigned long index = eb->start >> PAGE_SHIFT;
struct folio *existing_folio = NULL;
+ const int eb_order = folio_order(eb->folios[0]);
+ int retried = 0;
int ret;
ASSERT(found_eb_ret);
@@ -2990,18 +3017,25 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
/* The page cache only exists for a very short time, just retry. */
if (IS_ERR(existing_folio)) {
existing_folio = NULL;
+ retried++;
+ /*
+ * We can have the following case:
+ * |<- folio ->|
+ * | | |//|//|
+ * Where |//| is the slot that we have a page cache.
+ *
+ * In above case, filemap_add_folio() will return -EEXIST,
+ * but filemap_lock_folio() will return -ENOENT.
+ * After several retries, we know it's the above case,
+ * and just fallback to order 0 folios instead.
+ */
+ if (eb_order > 0 && retried > BTRFS_ADD_FOLIO_RETRY_LIMIT) {
+ ASSERT(i == 0);
+ return -EAGAIN;
+ }
goto retry;
}
- /* For now, we should only have single-page folios for btree inode. */
- ASSERT(folio_nr_pages(existing_folio) == 1);
-
- if (folio_size(existing_folio) != eb->folio_size) {
- folio_unlock(existing_folio);
- folio_put(existing_folio);
- return -EAGAIN;
- }
-
finish:
spin_lock(&mapping->i_private_lock);
if (existing_folio && fs_info->nodesize < PAGE_SIZE) {
@@ -3010,6 +3044,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
eb->folios[i] = existing_folio;
} else if (existing_folio) {
struct extent_buffer *existing_eb;
+ int existing_order = folio_order(existing_folio);
existing_eb = grab_extent_buffer(fs_info,
folio_page(existing_folio, 0));
@@ -3021,9 +3056,34 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
folio_put(existing_folio);
return 1;
}
- /* The extent buffer no longer exists, we can reuse the folio. */
- __free_page(folio_page(eb->folios[i], 0));
- eb->folios[i] = existing_folio;
+ if (existing_order > eb_order) {
+ /*
+ * The existing one has higher order, we need to drop
+ * all eb folios before resuing it.
+ * And this should only happen for the first folio.
+ */
+ ASSERT(i == 0);
+ free_all_eb_folios(eb);
+ eb->folios[i] = existing_folio;
+ } else if (existing_order == eb_order) {
+ /*
+ * Can safely reuse the filemap folio, just
+ * release the eb one.
+ */
+ folio_put(eb->folios[i]);
+ eb->folios[i] = existing_folio;
+ } else {
+ /*
+ * The existing one has lower order.
+ *
+ * Just retry and fallback to order 0.
+ */
+ ASSERT(i == 0);
+ folio_unlock(existing_folio);
+ folio_put(existing_folio);
+ spin_unlock(&mapping->i_private_lock);
+ return -EAGAIN;
+ }
}
eb->folio_size = folio_size(eb->folios[i]);
eb->folio_shift = folio_shift(eb->folios[i]);
@@ -3056,6 +3116,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
u64 lockdep_owner = owner_root;
bool page_contig = true;
int uptodate = 1;
+ int order = 0;
int ret;
if (check_eb_alignment(fs_info, start))
@@ -3072,6 +3133,10 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
btrfs_warn_32bit_limit(fs_info);
#endif
+ if (IS_ENABLED(CONFIG_BTRFS_DEBUG) && fs_info->nodesize > PAGE_SIZE &&
+ IS_ALIGNED(start, fs_info->nodesize))
+ order = ilog2(fs_info->nodesize >> PAGE_SHIFT);
+
eb = find_extent_buffer(fs_info, start);
if (eb)
return eb;
@@ -3106,7 +3171,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
reallocate:
/* Allocate all pages first. */
- ret = alloc_eb_folio_array(eb, true);
+ ret = alloc_eb_folio_array(eb, order, true);
if (ret < 0) {
btrfs_free_subpage(prealloc);
goto out;
@@ -3123,27 +3188,11 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
goto out;
}
- /*
- * TODO: Special handling for a corner case where the order of
- * folios mismatch between the new eb and filemap.
- *
- * This happens when:
- *
- * - the new eb is using higher order folio
- *
- * - the filemap is still using 0-order folios for the range
- * This can happen at the previous eb allocation, and we don't
- * have higher order folio for the call.
- *
- * - the existing eb has already been freed
- *
- * In this case, we have to free the existing folios first, and
- * re-allocate using the same order.
- * Thankfully this is not going to happen yet, as we're still
- * using 0-order folios.
- */
+ /* Need to fallback to 0 order folios. */
if (unlikely(ret == -EAGAIN)) {
- ASSERT(0);
+ ASSERT(order > 0);
+ order = 0;
+ free_all_eb_folios(eb);
goto reallocate;
}
attached++;
@@ -3154,6 +3203,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
* and free the allocated page.
*/
folio = eb->folios[i];
+ num_folios = num_extent_folios(eb);
WARN_ON(btrfs_folio_test_dirty(fs_info, folio, eb->start, eb->len));
/*
--
2.45.2
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH v8] btrfs: prefer to allocate larger folio for metadata
2024-07-25 10:11 [PATCH v8] btrfs: prefer to allocate larger folio for metadata Qu Wenruo
@ 2024-07-26 14:57 ` Josef Bacik
2024-07-26 22:32 ` Qu Wenruo
0 siblings, 1 reply; 5+ messages in thread
From: Josef Bacik @ 2024-07-26 14:57 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Thu, Jul 25, 2024 at 07:41:16PM +0930, Qu Wenruo wrote:
> Since btrfs metadata is always in fixed size (nodesize, determined at
> mkfs time, default to 16K), and btrfs has the full control of the folios
> (read is triggered internally, no read/readahead call backs), it's the
> best location to experimental larger folios inside btrfs.
>
> To enable larger folios, the btrfs has to meet the following conditions:
>
> - The extent buffer start is aligned to nodesize
> This should be the common case for any btrfs in the last 5 years.
>
> - The nodesize is larger than page size
>
> - MM layer can fulfill our larger folio allocation
> The larger folio will cover exactly the metadata size (nodesize).
>
> If any of the condition is not met, we just fall back to page sized
> folio and go as usual.
> This means, we can have mixed orders for btrfs metadata.
>
> Thus there are several new corner cases with the mixed orders:
>
> 1) New filemap_add_folio() -EEXIST failure cases
> For mixed order cases, filemap_add_folio() can return -EEXIST
> meanwhile filemap_lock_folio() returns -ENOENT.
> We can only retry several times, before falling back to 0 order
> folios.
>
> 2) Existing folio size may be different than the one we allocated
> This is after the existing eb checks.
>
> 2.1) The existing folio is larger than the allocated one
> Need to free all allocated folios, and use the existing larger
> folio instead.
>
> 2.2) The existing folio has the same size
> Free the allocated one and reuse the page cache.
> This is the existing path.
>
> 2.3) The existing folio is smaller than the allocated one
> Fall back to re-allocate order 0 folios instead.
>
> Otherwise all the needed infrastructure is already here, we only need to
> try allocate larger folio as our first try in alloc_eb_folio_array().
>
> For now, the higher order allocation is only a preferable attempt for
> debug build, before we had enough test coverage and push it to end
> users.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> [CHANGELOG]
> v8:
> - Drop the memcgroup optimization as dependency
> Opting out memcgroup will be pushed as an independent patchset
> instead. It's not related to the soft lockup.
>
> - Fix a soft lockup caused by mixed folio orders
> |<- folio ->|
> | | |//|//| |//| is the existing page cache
> In above case, the filemap_add_folio() will always return -EEXIST
> but filemap_lock_folio() also returns -ENOENT.
> Which can lead to a dead loop.
> Fix it by only retrying 5 times for larger folios, then fall back
> to 0 order folios.
>
> - Slightly rewording the commit messages
> Make it shorter and better organized.
>
> v7:
> - Fix an accidentally removed line caused by previous modification
> attempt
> Previously I was moving that line to the common branch to
> unconditionally define root_mem_cgroup pointer.
> But that's later discarded and changed to use macro definition, but
> forgot to add back the original line.
>
> v6:
> - Add a new root_mem_cgroup definition for CONFIG_MEMCG=n cases
> So that users of root_mem_cgroup no longer needs to check
> CONFIG_MEMCG.
> This is to fix the compile error for CONFIG_MEMCG=n cases.
>
> - Slight rewording of the 2nd patch
>
> v5:
> - Use root memcgroup to attach folios to btree inode filemap
> - Only try higher order folio once without NOFAIL nor extra retry
>
> v4:
> - Hide the feature behind CONFIG_BTRFS_DEBUG
> So that end users won't be affected (aka, still per-page based
> allocation) meanwhile we can do more testing on this new behavior.
>
> v3:
> - Rebased to the latest for-next branch
> - Use PAGE_ALLOC_COSTLY_ORDER to determine whether to use __GFP_NOFAIL
> - Add a dependency MM patch "mm/page_alloc: unify the warning on NOFAIL
> and high order allocation"
> This allows us to use NOFAIL up to 32K nodesize, and makes sure for
> default 16K nodesize, all metadata would go 16K folios
>
> v2:
> - Rebased to handle the change in "btrfs: cache folio size and shift in extent_buffer"
> ---
> fs/btrfs/extent_io.c | 122 ++++++++++++++++++++++++++++++-------------
> 1 file changed, 86 insertions(+), 36 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index aa7f8148cd0d..0beebcb9be77 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -719,12 +719,28 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
> *
> * For now, the folios populated are always in order 0 (aka, single page).
> */
> -static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail)
> +static int alloc_eb_folio_array(struct extent_buffer *eb, int order,
> + bool nofail)
> {
> struct page *page_array[INLINE_EXTENT_BUFFER_PAGES] = { 0 };
> int num_pages = num_extent_pages(eb);
> int ret;
>
> + if (order) {
> + gfp_t gfp;
> +
> + if (order > 0)
> + gfp = GFP_NOFS | __GFP_NORETRY | __GFP_NOWARN;
> + else
> + gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
This check is useless, we're already checking if (order) above.
> + eb->folios[0] = folio_alloc(gfp, order);
> + if (likely(eb->folios[0])) {
> + eb->folio_size = folio_size(eb->folios[0]);
> + eb->folio_shift = folio_shift(eb->folios[0]);
> + return 0;
> + }
> + /* Fallback to 0 order (single page) allocation. */
> + }
> ret = btrfs_alloc_page_array(num_pages, page_array, nofail);
> if (ret < 0)
> return ret;
> @@ -2707,7 +2723,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
> */
> set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>
> - ret = alloc_eb_folio_array(new, false);
> + ret = alloc_eb_folio_array(new, 0, false);
> if (ret) {
> btrfs_release_extent_buffer(new);
> return NULL;
> @@ -2740,7 +2756,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
> if (!eb)
> return NULL;
>
> - ret = alloc_eb_folio_array(eb, false);
> + ret = alloc_eb_folio_array(eb, 0, false);
> if (ret)
> goto err;
>
> @@ -2955,7 +2971,16 @@ static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
> return 0;
> }
>
> +static void free_all_eb_folios(struct extent_buffer *eb)
> +{
> + for (int i = 0; i < INLINE_EXTENT_BUFFER_PAGES; i++) {
> + if (eb->folios[i])
> + folio_put(eb->folios[i]);
> + eb->folios[i] = NULL;
> + }
> +}
>
> +#define BTRFS_ADD_FOLIO_RETRY_LIMIT (5)
> /*
> * Return 0 if eb->folios[i] is attached to btree inode successfully.
> * Return >0 if there is already another extent buffer for the range,
> @@ -2973,6 +2998,8 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> struct address_space *mapping = fs_info->btree_inode->i_mapping;
> const unsigned long index = eb->start >> PAGE_SHIFT;
> struct folio *existing_folio = NULL;
> + const int eb_order = folio_order(eb->folios[0]);
> + int retried = 0;
> int ret;
>
> ASSERT(found_eb_ret);
> @@ -2990,18 +3017,25 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> /* The page cache only exists for a very short time, just retry. */
> if (IS_ERR(existing_folio)) {
> existing_folio = NULL;
> + retried++;
> + /*
> + * We can have the following case:
> + * |<- folio ->|
> + * | | |//|//|
> + * Where |//| is the slot that we have a page cache.
> + *
> + * In above case, filemap_add_folio() will return -EEXIST,
> + * but filemap_lock_folio() will return -ENOENT.
> + * After several retries, we know it's the above case,
> + * and just fallback to order 0 folios instead.
> + */
> + if (eb_order > 0 && retried > BTRFS_ADD_FOLIO_RETRY_LIMIT) {
> + ASSERT(i == 0);
> + return -EAGAIN;
> + }
Ok hold on, I'm just now looking at this code, and am completely confused.
filemap_add_folio inserts our new folio into the mapping and returns with it
locked. Under what circumstances do we end up with filemap_lock_folio()
returning ENOENT?
I understand in this larger folio case where this could happen, but this code
exists today where we're only allocating 0 order folios. So does this actually
happen today, or were you future proofing it?
Also it seems like a super bad, no good thing that we can end up being allowed
to insert a folio that overlaps other folios already in the mapping. So if that
can happen, that seems like the thing that needs to be addressed generically.
This sets off huge alarm bells for me, I'm going to need a more thorough
explanation of how this happens and why it's ok in the normal case. Thanks,
Josef
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH v8] btrfs: prefer to allocate larger folio for metadata
2024-07-26 14:57 ` Josef Bacik
@ 2024-07-26 22:32 ` Qu Wenruo
2024-07-31 15:11 ` Josef Bacik
0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2024-07-26 22:32 UTC (permalink / raw)
To: Josef Bacik, Qu Wenruo; +Cc: linux-btrfs
在 2024/7/27 00:27, Josef Bacik 写道:
> On Thu, Jul 25, 2024 at 07:41:16PM +0930, Qu Wenruo wrote:
>> Since btrfs metadata is always in fixed size (nodesize, determined at
>> mkfs time, default to 16K), and btrfs has the full control of the folios
>> (read is triggered internally, no read/readahead call backs), it's the
>> best location to experimental larger folios inside btrfs.
>>
>> To enable larger folios, the btrfs has to meet the following conditions:
>>
>> - The extent buffer start is aligned to nodesize
>> This should be the common case for any btrfs in the last 5 years.
>>
>> - The nodesize is larger than page size
>>
>> - MM layer can fulfill our larger folio allocation
>> The larger folio will cover exactly the metadata size (nodesize).
>>
>> If any of the condition is not met, we just fall back to page sized
>> folio and go as usual.
>> This means, we can have mixed orders for btrfs metadata.
>>
>> Thus there are several new corner cases with the mixed orders:
>>
>> 1) New filemap_add_folio() -EEXIST failure cases
>> For mixed order cases, filemap_add_folio() can return -EEXIST
>> meanwhile filemap_lock_folio() returns -ENOENT.
>> We can only retry several times, before falling back to 0 order
>> folios.
>>
>> 2) Existing folio size may be different than the one we allocated
>> This is after the existing eb checks.
>>
>> 2.1) The existing folio is larger than the allocated one
>> Need to free all allocated folios, and use the existing larger
>> folio instead.
>>
>> 2.2) The existing folio has the same size
>> Free the allocated one and reuse the page cache.
>> This is the existing path.
>>
>> 2.3) The existing folio is smaller than the allocated one
>> Fall back to re-allocate order 0 folios instead.
>>
>> Otherwise all the needed infrastructure is already here, we only need to
>> try allocate larger folio as our first try in alloc_eb_folio_array().
>>
>> For now, the higher order allocation is only a preferable attempt for
>> debug build, before we had enough test coverage and push it to end
>> users.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> [CHANGELOG]
>> v8:
>> - Drop the memcgroup optimization as dependency
>> Opting out memcgroup will be pushed as an independent patchset
>> instead. It's not related to the soft lockup.
>>
>> - Fix a soft lockup caused by mixed folio orders
>> |<- folio ->|
>> | | |//|//| |//| is the existing page cache
>> In above case, the filemap_add_folio() will always return -EEXIST
>> but filemap_lock_folio() also returns -ENOENT.
>> Which can lead to a dead loop.
>> Fix it by only retrying 5 times for larger folios, then fall back
>> to 0 order folios.
>>
>> - Slightly rewording the commit messages
>> Make it shorter and better organized.
>>
>> v7:
>> - Fix an accidentally removed line caused by previous modification
>> attempt
>> Previously I was moving that line to the common branch to
>> unconditionally define root_mem_cgroup pointer.
>> But that's later discarded and changed to use macro definition, but
>> forgot to add back the original line.
>>
>> v6:
>> - Add a new root_mem_cgroup definition for CONFIG_MEMCG=n cases
>> So that users of root_mem_cgroup no longer needs to check
>> CONFIG_MEMCG.
>> This is to fix the compile error for CONFIG_MEMCG=n cases.
>>
>> - Slight rewording of the 2nd patch
>>
>> v5:
>> - Use root memcgroup to attach folios to btree inode filemap
>> - Only try higher order folio once without NOFAIL nor extra retry
>>
>> v4:
>> - Hide the feature behind CONFIG_BTRFS_DEBUG
>> So that end users won't be affected (aka, still per-page based
>> allocation) meanwhile we can do more testing on this new behavior.
>>
>> v3:
>> - Rebased to the latest for-next branch
>> - Use PAGE_ALLOC_COSTLY_ORDER to determine whether to use __GFP_NOFAIL
>> - Add a dependency MM patch "mm/page_alloc: unify the warning on NOFAIL
>> and high order allocation"
>> This allows us to use NOFAIL up to 32K nodesize, and makes sure for
>> default 16K nodesize, all metadata would go 16K folios
>>
>> v2:
>> - Rebased to handle the change in "btrfs: cache folio size and shift in extent_buffer"
>> ---
>> fs/btrfs/extent_io.c | 122 ++++++++++++++++++++++++++++++-------------
>> 1 file changed, 86 insertions(+), 36 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index aa7f8148cd0d..0beebcb9be77 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -719,12 +719,28 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
>> *
>> * For now, the folios populated are always in order 0 (aka, single page).
>> */
>> -static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail)
>> +static int alloc_eb_folio_array(struct extent_buffer *eb, int order,
>> + bool nofail)
>> {
>> struct page *page_array[INLINE_EXTENT_BUFFER_PAGES] = { 0 };
>> int num_pages = num_extent_pages(eb);
>> int ret;
>>
>> + if (order) {
>> + gfp_t gfp;
>> +
>> + if (order > 0)
>> + gfp = GFP_NOFS | __GFP_NORETRY | __GFP_NOWARN;
>> + else
>> + gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
>
> This check is useless, we're already checking if (order) above.
>
>> + eb->folios[0] = folio_alloc(gfp, order);
>> + if (likely(eb->folios[0])) {
>> + eb->folio_size = folio_size(eb->folios[0]);
>> + eb->folio_shift = folio_shift(eb->folios[0]);
>> + return 0;
>> + }
>> + /* Fallback to 0 order (single page) allocation. */
>> + }
>> ret = btrfs_alloc_page_array(num_pages, page_array, nofail);
>> if (ret < 0)
>> return ret;
>> @@ -2707,7 +2723,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>> */
>> set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>>
>> - ret = alloc_eb_folio_array(new, false);
>> + ret = alloc_eb_folio_array(new, 0, false);
>> if (ret) {
>> btrfs_release_extent_buffer(new);
>> return NULL;
>> @@ -2740,7 +2756,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>> if (!eb)
>> return NULL;
>>
>> - ret = alloc_eb_folio_array(eb, false);
>> + ret = alloc_eb_folio_array(eb, 0, false);
>> if (ret)
>> goto err;
>>
>> @@ -2955,7 +2971,16 @@ static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
>> return 0;
>> }
>>
>> +static void free_all_eb_folios(struct extent_buffer *eb)
>> +{
>> + for (int i = 0; i < INLINE_EXTENT_BUFFER_PAGES; i++) {
>> + if (eb->folios[i])
>> + folio_put(eb->folios[i]);
>> + eb->folios[i] = NULL;
>> + }
>> +}
>>
>> +#define BTRFS_ADD_FOLIO_RETRY_LIMIT (5)
>> /*
>> * Return 0 if eb->folios[i] is attached to btree inode successfully.
>> * Return >0 if there is already another extent buffer for the range,
>> @@ -2973,6 +2998,8 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>> struct address_space *mapping = fs_info->btree_inode->i_mapping;
>> const unsigned long index = eb->start >> PAGE_SHIFT;
>> struct folio *existing_folio = NULL;
>> + const int eb_order = folio_order(eb->folios[0]);
>> + int retried = 0;
>> int ret;
>>
>> ASSERT(found_eb_ret);
>> @@ -2990,18 +3017,25 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>> /* The page cache only exists for a very short time, just retry. */
>> if (IS_ERR(existing_folio)) {
>> existing_folio = NULL;
>> + retried++;
>> + /*
>> + * We can have the following case:
>> + * |<- folio ->|
>> + * | | |//|//|
>> + * Where |//| is the slot that we have a page cache.
>> + *
>> + * In above case, filemap_add_folio() will return -EEXIST,
>> + * but filemap_lock_folio() will return -ENOENT.
>> + * After several retries, we know it's the above case,
>> + * and just fallback to order 0 folios instead.
>> + */
>> + if (eb_order > 0 && retried > BTRFS_ADD_FOLIO_RETRY_LIMIT) {
>> + ASSERT(i == 0);
>> + return -EAGAIN;
>> + }
>
> Ok hold on, I'm just now looking at this code, and am completely confused.
>
> filemap_add_folio inserts our new folio into the mapping and returns with it
> locked. Under what circumstances do we end up with filemap_lock_folio()
> returning ENOENT?
Remember we go filemap_lock_folio() only when filemap_add_folio()
returns with -EEXIST.
There is an existing case (all order 0 folios) like this:
------------------------------+-------------------------------------
filemap_add_folio() |
Return -EEXIST |
| Some page reclaim happens
| Choose the page at exactly the same
| index to be reclaimed
filemap_lock_folio |
return -ENOENT
Remember that between the filemap_add_folio() and filemap_lock_folio()
there is nothing preventing page reclaim from happening.
>
> I understand in this larger folio case where this could happen, but this code
> exists today where we're only allocating 0 order folios. So does this actually
> happen today, or were you future proofing it?
It's already happening even for order 0 folios. As we do not hold any
lock for the address space.
>
> Also it seems like a super bad, no good thing that we can end up being allowed
> to insert a folio that overlaps other folios already in the mapping. So if that
> can happen, that seems like the thing that needs to be addressed generically.
For the generic __filemap_get_folio() with FGP_CREAT, that's exactly
what we do, just retry with smaller folio.
And in that case, they do the retry if filemap_add_folio() hits -EEXIST
just like us.
>
> This sets off huge alarm bells for me, I'm going to need a more thorough
> explanation of how this happens and why it's ok in the normal case. Thanks,
Sorry I'm not seeing what you're concerned about.
If you mean why we could have filemap_add_folio() succeeded but
filemap_add_folio() failed, then it's impossible.
Because if filemap_add_folio() succeeded, we will never hit the
remaining code.
If you're concerned about the seemingly infinite retry loop, then it's
the same as __filemap_get_folio().
If you're talking about the fact we are unable to distinguish real
EEXIST and the reclaim race, yes we do not have a good way to do that,
so is the __filemap_get_folio(), except they always try lower folio
order, but that's not something we can do easily here.
Thanks,
Qu
>
> Josef
>
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH v8] btrfs: prefer to allocate larger folio for metadata
2024-07-26 22:32 ` Qu Wenruo
@ 2024-07-31 15:11 ` Josef Bacik
2024-07-31 22:37 ` Qu Wenruo
0 siblings, 1 reply; 5+ messages in thread
From: Josef Bacik @ 2024-07-31 15:11 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs
On Sat, Jul 27, 2024 at 08:02:26AM +0930, Qu Wenruo wrote:
>
>
> 在 2024/7/27 00:27, Josef Bacik 写道:
> > On Thu, Jul 25, 2024 at 07:41:16PM +0930, Qu Wenruo wrote:
> > > Since btrfs metadata is always in fixed size (nodesize, determined at
> > > mkfs time, default to 16K), and btrfs has the full control of the folios
> > > (read is triggered internally, no read/readahead call backs), it's the
> > > best location to experimental larger folios inside btrfs.
> > >
> > > To enable larger folios, the btrfs has to meet the following conditions:
> > >
> > > - The extent buffer start is aligned to nodesize
> > > This should be the common case for any btrfs in the last 5 years.
> > >
> > > - The nodesize is larger than page size
> > >
> > > - MM layer can fulfill our larger folio allocation
> > > The larger folio will cover exactly the metadata size (nodesize).
> > >
> > > If any of the condition is not met, we just fall back to page sized
> > > folio and go as usual.
> > > This means, we can have mixed orders for btrfs metadata.
> > >
> > > Thus there are several new corner cases with the mixed orders:
> > >
> > > 1) New filemap_add_folio() -EEXIST failure cases
> > > For mixed order cases, filemap_add_folio() can return -EEXIST
> > > meanwhile filemap_lock_folio() returns -ENOENT.
> > > We can only retry several times, before falling back to 0 order
> > > folios.
> > >
> > > 2) Existing folio size may be different than the one we allocated
> > > This is after the existing eb checks.
> > >
> > > 2.1) The existing folio is larger than the allocated one
> > > Need to free all allocated folios, and use the existing larger
> > > folio instead.
> > >
> > > 2.2) The existing folio has the same size
> > > Free the allocated one and reuse the page cache.
> > > This is the existing path.
> > >
> > > 2.3) The existing folio is smaller than the allocated one
> > > Fall back to re-allocate order 0 folios instead.
> > >
> > > Otherwise all the needed infrastructure is already here, we only need to
> > > try allocate larger folio as our first try in alloc_eb_folio_array().
> > >
> > > For now, the higher order allocation is only a preferable attempt for
> > > debug build, before we had enough test coverage and push it to end
> > > users.
> > >
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > ---
> > > Changelog:
> > > [CHANGELOG]
> > > v8:
> > > - Drop the memcgroup optimization as dependency
> > > Opting out memcgroup will be pushed as an independent patchset
> > > instead. It's not related to the soft lockup.
> > >
> > > - Fix a soft lockup caused by mixed folio orders
> > > |<- folio ->|
> > > | | |//|//| |//| is the existing page cache
> > > In above case, the filemap_add_folio() will always return -EEXIST
> > > but filemap_lock_folio() also returns -ENOENT.
> > > Which can lead to a dead loop.
> > > Fix it by only retrying 5 times for larger folios, then fall back
> > > to 0 order folios.
> > >
> > > - Slightly rewording the commit messages
> > > Make it shorter and better organized.
> > >
> > > v7:
> > > - Fix an accidentally removed line caused by previous modification
> > > attempt
> > > Previously I was moving that line to the common branch to
> > > unconditionally define root_mem_cgroup pointer.
> > > But that's later discarded and changed to use macro definition, but
> > > forgot to add back the original line.
> > >
> > > v6:
> > > - Add a new root_mem_cgroup definition for CONFIG_MEMCG=n cases
> > > So that users of root_mem_cgroup no longer needs to check
> > > CONFIG_MEMCG.
> > > This is to fix the compile error for CONFIG_MEMCG=n cases.
> > >
> > > - Slight rewording of the 2nd patch
> > >
> > > v5:
> > > - Use root memcgroup to attach folios to btree inode filemap
> > > - Only try higher order folio once without NOFAIL nor extra retry
> > >
> > > v4:
> > > - Hide the feature behind CONFIG_BTRFS_DEBUG
> > > So that end users won't be affected (aka, still per-page based
> > > allocation) meanwhile we can do more testing on this new behavior.
> > >
> > > v3:
> > > - Rebased to the latest for-next branch
> > > - Use PAGE_ALLOC_COSTLY_ORDER to determine whether to use __GFP_NOFAIL
> > > - Add a dependency MM patch "mm/page_alloc: unify the warning on NOFAIL
> > > and high order allocation"
> > > This allows us to use NOFAIL up to 32K nodesize, and makes sure for
> > > default 16K nodesize, all metadata would go 16K folios
> > >
> > > v2:
> > > - Rebased to handle the change in "btrfs: cache folio size and shift in extent_buffer"
> > > ---
> > > fs/btrfs/extent_io.c | 122 ++++++++++++++++++++++++++++++-------------
> > > 1 file changed, 86 insertions(+), 36 deletions(-)
> > >
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index aa7f8148cd0d..0beebcb9be77 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -719,12 +719,28 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
> > > *
> > > * For now, the folios populated are always in order 0 (aka, single page).
> > > */
> > > -static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail)
> > > +static int alloc_eb_folio_array(struct extent_buffer *eb, int order,
> > > + bool nofail)
> > > {
> > > struct page *page_array[INLINE_EXTENT_BUFFER_PAGES] = { 0 };
> > > int num_pages = num_extent_pages(eb);
> > > int ret;
> > >
> > > + if (order) {
> > > + gfp_t gfp;
> > > +
> > > + if (order > 0)
> > > + gfp = GFP_NOFS | __GFP_NORETRY | __GFP_NOWARN;
> > > + else
> > > + gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
> >
> > This check is useless, we're already checking if (order) above.
> >
> > > + eb->folios[0] = folio_alloc(gfp, order);
> > > + if (likely(eb->folios[0])) {
> > > + eb->folio_size = folio_size(eb->folios[0]);
> > > + eb->folio_shift = folio_shift(eb->folios[0]);
> > > + return 0;
> > > + }
> > > + /* Fallback to 0 order (single page) allocation. */
> > > + }
> > > ret = btrfs_alloc_page_array(num_pages, page_array, nofail);
> > > if (ret < 0)
> > > return ret;
> > > @@ -2707,7 +2723,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
> > > */
> > > set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
> > >
> > > - ret = alloc_eb_folio_array(new, false);
> > > + ret = alloc_eb_folio_array(new, 0, false);
> > > if (ret) {
> > > btrfs_release_extent_buffer(new);
> > > return NULL;
> > > @@ -2740,7 +2756,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
> > > if (!eb)
> > > return NULL;
> > >
> > > - ret = alloc_eb_folio_array(eb, false);
> > > + ret = alloc_eb_folio_array(eb, 0, false);
> > > if (ret)
> > > goto err;
> > >
> > > @@ -2955,7 +2971,16 @@ static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
> > > return 0;
> > > }
> > >
> > > +static void free_all_eb_folios(struct extent_buffer *eb)
> > > +{
> > > + for (int i = 0; i < INLINE_EXTENT_BUFFER_PAGES; i++) {
> > > + if (eb->folios[i])
> > > + folio_put(eb->folios[i]);
> > > + eb->folios[i] = NULL;
> > > + }
> > > +}
> > >
> > > +#define BTRFS_ADD_FOLIO_RETRY_LIMIT (5)
> > > /*
> > > * Return 0 if eb->folios[i] is attached to btree inode successfully.
> > > * Return >0 if there is already another extent buffer for the range,
> > > @@ -2973,6 +2998,8 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> > > struct address_space *mapping = fs_info->btree_inode->i_mapping;
> > > const unsigned long index = eb->start >> PAGE_SHIFT;
> > > struct folio *existing_folio = NULL;
> > > + const int eb_order = folio_order(eb->folios[0]);
> > > + int retried = 0;
> > > int ret;
> > >
> > > ASSERT(found_eb_ret);
> > > @@ -2990,18 +3017,25 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> > > /* The page cache only exists for a very short time, just retry. */
> > > if (IS_ERR(existing_folio)) {
> > > existing_folio = NULL;
> > > + retried++;
> > > + /*
> > > + * We can have the following case:
> > > + * |<- folio ->|
> > > + * | | |//|//|
> > > + * Where |//| is the slot that we have a page cache.
> > > + *
> > > + * In above case, filemap_add_folio() will return -EEXIST,
> > > + * but filemap_lock_folio() will return -ENOENT.
> > > + * After several retries, we know it's the above case,
> > > + * and just fallback to order 0 folios instead.
> > > + */
> > > + if (eb_order > 0 && retried > BTRFS_ADD_FOLIO_RETRY_LIMIT) {
> > > + ASSERT(i == 0);
> > > + return -EAGAIN;
> > > + }
> >
> > Ok hold on, I'm just now looking at this code, and am completely confused.
> >
> > filemap_add_folio inserts our new folio into the mapping and returns with it
> > locked. Under what circumstances do we end up with filemap_lock_folio()
> > returning ENOENT?
>
> Remember we go filemap_lock_folio() only when filemap_add_folio()
> returns with -EEXIST.
>
> There is an existing case (all order 0 folios) like this:
>
> ------------------------------+-------------------------------------
> filemap_add_folio() |
> Return -EEXIST |
> | Some page reclaim happens
> | Choose the page at exactly the same
> | index to be reclaimed
> filemap_lock_folio |
> return -ENOENT
>
> Remember that between the filemap_add_folio() and filemap_lock_folio()
> there is nothing preventing page reclaim from happening.
>
>
> >
> > I understand in this larger folio case where this could happen, but this code
> > exists today where we're only allocating 0 order folios. So does this actually
> > happen today, or were you future proofing it?
>
> It's already happening even for order 0 folios. As we do not hold any
> lock for the address space.
>
> >
> > Also it seems like a super bad, no good thing that we can end up being allowed
> > to insert a folio that overlaps other folios already in the mapping. So if that
> > can happen, that seems like the thing that needs to be addressed generically.
>
> For the generic __filemap_get_folio() with FGP_CREAT, that's exactly
> what we do, just retry with smaller folio.
>
> And in that case, they do the retry if filemap_add_folio() hits -EEXIST
> just like us.
You're right, I misread the code and thought this was the normal path, so my
whole rambling was just wrong, sorry about that.
However if we're allocating them at the order that our extent buffer is, this
means that we previously had an extent buffer that had 0 order pages, and these
just haven't been cleaned up yet.
This is kind of an ugly scenario, because now we're widening the possibility of
racing with somebody else trying to create an extent buffer here. I think the
only sane choice is to immediately switch to 0 order folios and carry on.
Thanks,
Josef
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH v8] btrfs: prefer to allocate larger folio for metadata
2024-07-31 15:11 ` Josef Bacik
@ 2024-07-31 22:37 ` Qu Wenruo
0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2024-07-31 22:37 UTC (permalink / raw)
To: Josef Bacik; +Cc: Qu Wenruo, linux-btrfs
在 2024/8/1 00:41, Josef Bacik 写道:
> On Sat, Jul 27, 2024 at 08:02:26AM +0930, Qu Wenruo wrote:
[...]
>>
>> And in that case, they do the retry if filemap_add_folio() hits -EEXIST
>> just like us.
>
> You're right, I misread the code and thought this was the normal path, so my
> whole rambling was just wrong, sorry about that.
>
> However if we're allocating them at the order that our extent buffer is, this
> means that we previously had an extent buffer that had 0 order pages, and these
> just haven't been cleaned up yet.
Yes, and I believe the situation is caused by the usage of (NO_RETRY |
NO_WARN) gfp flag.
And the fact that btree_release_folio() only handles the folio, making
the remaining one still attached to the filemap (requiring another
release_folio() to free it)
That will be another optimization soon, but I'm still wondering if we
should get rid of (NO_RETRY | NO_WARN) if the folio is not that costly
to allocate, or enhance the release_folio() callback, or even both?
>
> This is kind of an ugly scenario, because now we're widening the possibility of
> racing with somebody else trying to create an extent buffer here. I think the
> only sane choice is to immediately switch to 0 order folios and carry on.
> Thanks,
I'm totally fine with immediately fallback.
Thanks for the review,
Qu
>
> Josef
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-07-31 22:37 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-25 10:11 [PATCH v8] btrfs: prefer to allocate larger folio for metadata Qu Wenruo
2024-07-26 14:57 ` Josef Bacik
2024-07-26 22:32 ` Qu Wenruo
2024-07-31 15:11 ` Josef Bacik
2024-07-31 22:37 ` Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox