public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH-RFC] btrfs: for unclustered allocation don't consider ffe_ctl->empty_cluster
@ 2026-02-26 11:34 Alex Lyakas
  2026-02-26 17:37 ` Boris Burkov
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Lyakas @ 2026-02-26 11:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.lyakas

I encountered an issue when performing unclustered allocation for metadata:

free_space_ctl->free_space = 2MB
ffe_ctl->num_bytes = 4096
ffe_ctl->empty_cluster = 2MB
ffe_ctl->empty_size = 0

So early check for free_space_ctl->free_space skips the block group, even though
it has enough space to do the allocation.
I think when doing unclustered allocation we should not look at ffe_ctl->empty_cluster.

I tested this on stable kernel 6.6.

Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 03cf9f242c70..84b340a67882 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3885,7 +3885,7 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
 		free_space_ctl = bg->free_space_ctl;
 		spin_lock(&free_space_ctl->tree_lock);
 		if (free_space_ctl->free_space <
-		    ffe_ctl->num_bytes + ffe_ctl->empty_cluster +
+		    ffe_ctl->num_bytes +
 		    ffe_ctl->empty_size) {
 			ffe_ctl->total_free_space = max_t(u64,
 					ffe_ctl->total_free_space,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH-RFC] btrfs: for unclustered allocation don't consider ffe_ctl->empty_cluster
  2026-02-26 11:34 [PATCH-RFC] btrfs: for unclustered allocation don't consider ffe_ctl->empty_cluster Alex Lyakas
@ 2026-02-26 17:37 ` Boris Burkov
  2026-02-26 18:29   ` Boris Burkov
  0 siblings, 1 reply; 7+ messages in thread
From: Boris Burkov @ 2026-02-26 17:37 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Thu, Feb 26, 2026 at 01:34:19PM +0200, Alex Lyakas wrote:
> I encountered an issue when performing unclustered allocation for metadata:
> 
> free_space_ctl->free_space = 2MB
> ffe_ctl->num_bytes = 4096
> ffe_ctl->empty_cluster = 2MB
> ffe_ctl->empty_size = 0
> 
> So early check for free_space_ctl->free_space skips the block group, even though
> it has enough space to do the allocation.
> I think when doing unclustered allocation we should not look at ffe_ctl->empty_cluster.

I see what you are saying, and a the level of this situation and this
line of code, I think you are right.

But as-is, this change does not contain enough reasoning about the
semantics of the allocation algorithm to realistically be anything
but exchanging one bug for two new ones.

What is empty_cluster modelling? Why is it included in this calculation?
Why should it not be included? Where else is empty_cluster used and
should it change under this new interpretation? Does any of this change
if there is actually a cluster active for clustered allocations?

etc. etc.

Thanks,
Boris

> 
> I tested this on stable kernel 6.6.
> 
> Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com>
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 03cf9f242c70..84b340a67882 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3885,7 +3885,7 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
>  		free_space_ctl = bg->free_space_ctl;
>  		spin_lock(&free_space_ctl->tree_lock);
>  		if (free_space_ctl->free_space <
> -		    ffe_ctl->num_bytes + ffe_ctl->empty_cluster +
> +		    ffe_ctl->num_bytes +
>  		    ffe_ctl->empty_size) {
>  			ffe_ctl->total_free_space = max_t(u64,
>  					ffe_ctl->total_free_space,
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH-RFC] btrfs: for unclustered allocation don't consider ffe_ctl->empty_cluster
  2026-02-26 17:37 ` Boris Burkov
@ 2026-02-26 18:29   ` Boris Burkov
  2026-03-01 10:14     ` Alex Lyakas
  0 siblings, 1 reply; 7+ messages in thread
From: Boris Burkov @ 2026-02-26 18:29 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Thu, Feb 26, 2026 at 09:37:46AM -0800, Boris Burkov wrote:
> On Thu, Feb 26, 2026 at 01:34:19PM +0200, Alex Lyakas wrote:
> > I encountered an issue when performing unclustered allocation for metadata:
> > 
> > free_space_ctl->free_space = 2MB
> > ffe_ctl->num_bytes = 4096
> > ffe_ctl->empty_cluster = 2MB
> > ffe_ctl->empty_size = 0
> > 
> > So early check for free_space_ctl->free_space skips the block group, even though
> > it has enough space to do the allocation.
> > I think when doing unclustered allocation we should not look at ffe_ctl->empty_cluster.
> 
> I see what you are saying, and a the level of this situation and this
> line of code, I think you are right.
> 
> But as-is, this change does not contain enough reasoning about the
> semantics of the allocation algorithm to realistically be anything
> but exchanging one bug for two new ones.
> 
> What is empty_cluster modelling? Why is it included in this calculation?
> Why should it not be included? Where else is empty_cluster used and
> should it change under this new interpretation? Does any of this change
> if there is actually a cluster active for clustered allocations?
> 
> etc. etc.
> 
> Thanks,
> Boris
> 

I missed the RFC tag in the patch, so I would like to apologize for my
negativity. In my experience with other bugs, the interplay between the
clustered algorithm and the unclustered algorithm is under-specified so
I think it is likely you have indeed found a bug. 

If you want to fix it, I would proceed along the lines I complained
about in my first response and try to define the relationships between
the variables in a consistent way that explains why we shouldn't count
that variable here.

If you do go through with sharpening the definition of empty_cluster,
I would be happy to review that work and help get it in.

Thanks,
Boris

> > 
> > I tested this on stable kernel 6.6.
> > 
> > Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com>
> > ---
> >  fs/btrfs/extent-tree.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 03cf9f242c70..84b340a67882 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -3885,7 +3885,7 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
> >  		free_space_ctl = bg->free_space_ctl;
> >  		spin_lock(&free_space_ctl->tree_lock);
> >  		if (free_space_ctl->free_space <
> > -		    ffe_ctl->num_bytes + ffe_ctl->empty_cluster +
> > +		    ffe_ctl->num_bytes +
> >  		    ffe_ctl->empty_size) {
> >  			ffe_ctl->total_free_space = max_t(u64,
> >  					ffe_ctl->total_free_space,
> > -- 
> > 2.43.0
> > 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH-RFC] btrfs: for unclustered allocation don't consider ffe_ctl->empty_cluster
  2026-02-26 18:29   ` Boris Burkov
@ 2026-03-01 10:14     ` Alex Lyakas
  2026-03-02 19:43       ` Boris Burkov
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Lyakas @ 2026-03-01 10:14 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

Hi Boris,

On Thu, Feb 26, 2026 at 8:28 PM Boris Burkov <boris@bur.io> wrote:
>
> On Thu, Feb 26, 2026 at 09:37:46AM -0800, Boris Burkov wrote:
> > On Thu, Feb 26, 2026 at 01:34:19PM +0200, Alex Lyakas wrote:
> > > I encountered an issue when performing unclustered allocation for metadata:
> > >
> > > free_space_ctl->free_space = 2MB
> > > ffe_ctl->num_bytes = 4096
> > > ffe_ctl->empty_cluster = 2MB
> > > ffe_ctl->empty_size = 0
> > >
> > > So early check for free_space_ctl->free_space skips the block group, even though
> > > it has enough space to do the allocation.
> > > I think when doing unclustered allocation we should not look at ffe_ctl->empty_cluster.
> >
> > I see what you are saying, and a the level of this situation and this
> > line of code, I think you are right.
> >
> > But as-is, this change does not contain enough reasoning about the
> > semantics of the allocation algorithm to realistically be anything
> > but exchanging one bug for two new ones.
> >

Thank you for your review and your comments below.

However, I am not sure what is it that you are actually requesting.
Do you expect me to describe the whole mechanics of find_free_extent()
and how its state is tracked through ffe_ctl?

> > What is empty_cluster modelling? Why is it included in this calculation?
> > Why should it not be included? Where else is empty_cluster used and
> > should it change under this new interpretation? Does any of this change
> > if there is actually a cluster active for clustered allocations?
I am not changing the meaning of any field of ffe_ctl, which tracks
the state of every call to find_free_extent().

I found an issue where "empty_cluster" is used erroneously in my opinion.

According to the code, "empty_cluster" describes the minimal amount of
free space in the block group, which we are considering to allocate a
cluster from. I am not changing the meaning of it.

My fix is suggesting that "empty_cluster" should not be used when
looking at free space in a block group during unclustered allocation.
Nothing else is changed. In the issue that I saw, this bug prevented
us from allocating a metadata block from a block group, when the block
group still had enough space to make a 4Kb allocation.

I tagged the patch as RFC, because I am unable to test it on the
latest mainline kernel, and I tested it on a stable LTS kernel 6.6.

Thanks,
Alex.





> >
> > etc. etc.
> >
> > Thanks,
> > Boris
> >
>
> I missed the RFC tag in the patch, so I would like to apologize for my
> negativity. In my experience with other bugs, the interplay between the
> clustered algorithm and the unclustered algorithm is under-specified so
> I think it is likely you have indeed found a bug.
>
> If you want to fix it, I would proceed along the lines I complained
> about in my first response and try to define the relationships between
> the variables in a consistent way that explains why we shouldn't count
> that variable here.
>
> If you do go through with sharpening the definition of empty_cluster,
> I would be happy to review that work and help get it in.
>



> Thanks,
> Boris
>
> > >
> > > I tested this on stable kernel 6.6.
> > >
> > > Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com>
> > > ---
> > >  fs/btrfs/extent-tree.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > index 03cf9f242c70..84b340a67882 100644
> > > --- a/fs/btrfs/extent-tree.c
> > > +++ b/fs/btrfs/extent-tree.c
> > > @@ -3885,7 +3885,7 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
> > >             free_space_ctl = bg->free_space_ctl;
> > >             spin_lock(&free_space_ctl->tree_lock);
> > >             if (free_space_ctl->free_space <
> > > -               ffe_ctl->num_bytes + ffe_ctl->empty_cluster +
> > > +               ffe_ctl->num_bytes +
> > >                 ffe_ctl->empty_size) {
> > >                     ffe_ctl->total_free_space = max_t(u64,
> > >                                     ffe_ctl->total_free_space,
> > > --
> > > 2.43.0
> > >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH-RFC] btrfs: for unclustered allocation don't consider ffe_ctl->empty_cluster
  2026-03-01 10:14     ` Alex Lyakas
@ 2026-03-02 19:43       ` Boris Burkov
  2026-03-02 21:49         ` Boris Burkov
  0 siblings, 1 reply; 7+ messages in thread
From: Boris Burkov @ 2026-03-02 19:43 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Sun, Mar 01, 2026 at 12:14:09PM +0200, Alex Lyakas wrote:
> Hi Boris,
> 
> On Thu, Feb 26, 2026 at 8:28 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Thu, Feb 26, 2026 at 09:37:46AM -0800, Boris Burkov wrote:
> > > On Thu, Feb 26, 2026 at 01:34:19PM +0200, Alex Lyakas wrote:
> > > > I encountered an issue when performing unclustered allocation for metadata:
> > > >
> > > > free_space_ctl->free_space = 2MB
> > > > ffe_ctl->num_bytes = 4096
> > > > ffe_ctl->empty_cluster = 2MB
> > > > ffe_ctl->empty_size = 0
> > > >
> > > > So early check for free_space_ctl->free_space skips the block group, even though
> > > > it has enough space to do the allocation.
> > > > I think when doing unclustered allocation we should not look at ffe_ctl->empty_cluster.
> > >
> > > I see what you are saying, and a the level of this situation and this
> > > line of code, I think you are right.
> > >
> > > But as-is, this change does not contain enough reasoning about the
> > > semantics of the allocation algorithm to realistically be anything
> > > but exchanging one bug for two new ones.
> > >
> 
> Thank you for your review and your comments below.
> 
> However, I am not sure what is it that you are actually requesting.
> Do you expect me to describe the whole mechanics of find_free_extent()
> and how its state is tracked through ffe_ctl?

To be confident that this change does not introduce a different,
potentially more serious, bug, it is important to demonstrate why this
transformation is reasonable, safe, and fits the existing model. That
implies some level of explanation of the mechanisms of the existing
model.

> 
> > > What is empty_cluster modelling? Why is it included in this calculation?
> > > Why should it not be included? Where else is empty_cluster used and
> > > should it change under this new interpretation? Does any of this change
> > > if there is actually a cluster active for clustered allocations?
> I am not changing the meaning of any field of ffe_ctl, which tracks
> the state of every call to find_free_extent().
> 
> I found an issue where "empty_cluster" is used erroneously in my opinion.
> 

At this point, in my opinion, you have a useful insight and a good lead
for a fix, but have not done enough work to justify the change you want
to make.

> According to the code, "empty_cluster" describes the minimal amount of
> free space in the block group, which we are considering to allocate a
> cluster from. I am not changing the meaning of it.
> 
> My fix is suggesting that "empty_cluster" should not be used when
> looking at free space in a block group during unclustered allocation.
> Nothing else is changed. In the issue that I saw, this bug prevented
> us from allocating a metadata block from a block group, when the block
> group still had enough space to make a 4Kb allocation.

At a minimum level, I would like to see an analysis of why the
empty_cluster may have been included in the calculation and some
reasoning about why excluding it ought to be safe and not cause issues
in other situations. Aside from improving the behavior in the conditions
you observed.

i.e., a general demonstration of an understanding of how the change fits
into the code at a wider level than just changing the one thing you care
about right now.

> 
> I tagged the patch as RFC, because I am unable to test it on the
> latest mainline kernel, and I tested it on a stable LTS kernel 6.6.

Why are you unable to test it on the latest mainline kernel? I can run
it through fstests for you, if that would be helpful.

> 
> Thanks,
> Alex.

Thanks,
Boris

> 
> 
> 
> 
> 
> > >
> > > etc. etc.
> > >
> > > Thanks,
> > > Boris
> > >
> >
> > I missed the RFC tag in the patch, so I would like to apologize for my
> > negativity. In my experience with other bugs, the interplay between the
> > clustered algorithm and the unclustered algorithm is under-specified so
> > I think it is likely you have indeed found a bug.
> >
> > If you want to fix it, I would proceed along the lines I complained
> > about in my first response and try to define the relationships between
> > the variables in a consistent way that explains why we shouldn't count
> > that variable here.
> >
> > If you do go through with sharpening the definition of empty_cluster,
> > I would be happy to review that work and help get it in.
> >
> 
> 
> 
> > Thanks,
> > Boris
> >
> > > >
> > > > I tested this on stable kernel 6.6.
> > > >
> > > > Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com>
> > > > ---
> > > >  fs/btrfs/extent-tree.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > > index 03cf9f242c70..84b340a67882 100644
> > > > --- a/fs/btrfs/extent-tree.c
> > > > +++ b/fs/btrfs/extent-tree.c
> > > > @@ -3885,7 +3885,7 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
> > > >             free_space_ctl = bg->free_space_ctl;
> > > >             spin_lock(&free_space_ctl->tree_lock);
> > > >             if (free_space_ctl->free_space <
> > > > -               ffe_ctl->num_bytes + ffe_ctl->empty_cluster +
> > > > +               ffe_ctl->num_bytes +
> > > >                 ffe_ctl->empty_size) {
> > > >                     ffe_ctl->total_free_space = max_t(u64,
> > > >                                     ffe_ctl->total_free_space,
> > > > --
> > > > 2.43.0
> > > >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH-RFC] btrfs: for unclustered allocation don't consider ffe_ctl->empty_cluster
  2026-03-02 19:43       ` Boris Burkov
@ 2026-03-02 21:49         ` Boris Burkov
  2026-03-02 21:59           ` Boris Burkov
  0 siblings, 1 reply; 7+ messages in thread
From: Boris Burkov @ 2026-03-02 21:49 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Mon, Mar 02, 2026 at 11:43:59AM -0800, Boris Burkov wrote:
> On Sun, Mar 01, 2026 at 12:14:09PM +0200, Alex Lyakas wrote:
> > Hi Boris,
> > 
> > On Thu, Feb 26, 2026 at 8:28 PM Boris Burkov <boris@bur.io> wrote:
> > >
> > > On Thu, Feb 26, 2026 at 09:37:46AM -0800, Boris Burkov wrote:
> > > > On Thu, Feb 26, 2026 at 01:34:19PM +0200, Alex Lyakas wrote:
> > > > > I encountered an issue when performing unclustered allocation for metadata:
> > > > >
> > > > > free_space_ctl->free_space = 2MB
> > > > > ffe_ctl->num_bytes = 4096
> > > > > ffe_ctl->empty_cluster = 2MB
> > > > > ffe_ctl->empty_size = 0
> > > > >
> > > > > So early check for free_space_ctl->free_space skips the block group, even though
> > > > > it has enough space to do the allocation.
> > > > > I think when doing unclustered allocation we should not look at ffe_ctl->empty_cluster.
> > > >
> > > > I see what you are saying, and a the level of this situation and this
> > > > line of code, I think you are right.
> > > >
> > > > But as-is, this change does not contain enough reasoning about the
> > > > semantics of the allocation algorithm to realistically be anything
> > > > but exchanging one bug for two new ones.
> > > >
> > 
> > Thank you for your review and your comments below.
> > 
> > However, I am not sure what is it that you are actually requesting.
> > Do you expect me to describe the whole mechanics of find_free_extent()
> > and how its state is tracked through ffe_ctl?
> 
> To be confident that this change does not introduce a different,
> potentially more serious, bug, it is important to demonstrate why this
> transformation is reasonable, safe, and fits the existing model. That
> implies some level of explanation of the mechanisms of the existing
> model.
> 
> > 
> > > > What is empty_cluster modelling? Why is it included in this calculation?
> > > > Why should it not be included? Where else is empty_cluster used and
> > > > should it change under this new interpretation? Does any of this change
> > > > if there is actually a cluster active for clustered allocations?
> > I am not changing the meaning of any field of ffe_ctl, which tracks
> > the state of every call to find_free_extent().
> > 
> > I found an issue where "empty_cluster" is used erroneously in my opinion.
> > 
> 
> At this point, in my opinion, you have a useful insight and a good lead
> for a fix, but have not done enough work to justify the change you want
> to make.
> 
> > According to the code, "empty_cluster" describes the minimal amount of
> > free space in the block group, which we are considering to allocate a
> > cluster from. I am not changing the meaning of it.
> > 
> > My fix is suggesting that "empty_cluster" should not be used when
> > looking at free space in a block group during unclustered allocation.
> > Nothing else is changed. In the issue that I saw, this bug prevented
> > us from allocating a metadata block from a block group, when the block
> > group still had enough space to make a 4Kb allocation.
> 
> At a minimum level, I would like to see an analysis of why the
> empty_cluster may have been included in the calculation and some
> reasoning about why excluding it ought to be safe and not cause issues
> in other situations. Aside from improving the behavior in the conditions
> you observed.
> 
> i.e., a general demonstration of an understanding of how the change fits
> into the code at a wider level than just changing the one thing you care
> about right now.
> 

I spent a few minutes looking into this out of curiousity.

So at a base level, clustered allocator is used by metadata and data in
ssd_spread mode. So let's focus on metadata, since that was your case
anyway.

This check was introduced by the commit
425d83156ca ("Btrfs: skip block groups without enough space for a cluster")

with the justification that for unclustered allocation attempts in
clustered mode, the search fails because it searches for bytes +
empty_size anyway, which I see borne out in btrfs_find_space_for_alloc()
(u64 bytes_search = bytes + empty_size)

Later in the wider loop, we set empty_size to 0 as a final last resort.

So to me, this suggests the allocator is intentionally designed to go
allocate a new metadata block group when there isn't enough room for a
new cluster, but shouldn't return ENOSPC as a result of this preference.

How exactly did you notice this in the first place? I assume you saw it
skipping a block group in practice then noticed this code? After the
fix, did it actually use the block group? I am kind of surprised based
on that code I found in btrfs_find_space_for_alloc(), if so. So that
would probably need explaining.

I could well be wrong in some part of my analysis, I am just sharing
some quick findings. Please feel free to investigate further and
convince me otherwise. But this is the sort of analysis I was looking
for to motivate and contextualize the change.

In addition, please include exactly the improvement in behavior from
your change to make it as clear as possible. (e.g., before when running
<script> it would allocate 6 metadata block_groups, now it only
allocates 5.)

Thanks,
Boris

> > 
> > I tagged the patch as RFC, because I am unable to test it on the
> > latest mainline kernel, and I tested it on a stable LTS kernel 6.6.
> 
> Why are you unable to test it on the latest mainline kernel? I can run
> it through fstests for you, if that would be helpful.
> 
> > 
> > Thanks,
> > Alex.
> 
> Thanks,
> Boris
> 
> > 
> > 
> > 
> > 
> > 
> > > >
> > > > etc. etc.
> > > >
> > > > Thanks,
> > > > Boris
> > > >
> > >
> > > I missed the RFC tag in the patch, so I would like to apologize for my
> > > negativity. In my experience with other bugs, the interplay between the
> > > clustered algorithm and the unclustered algorithm is under-specified so
> > > I think it is likely you have indeed found a bug.
> > >
> > > If you want to fix it, I would proceed along the lines I complained
> > > about in my first response and try to define the relationships between
> > > the variables in a consistent way that explains why we shouldn't count
> > > that variable here.
> > >
> > > If you do go through with sharpening the definition of empty_cluster,
> > > I would be happy to review that work and help get it in.
> > >
> > 
> > 
> > 
> > > Thanks,
> > > Boris
> > >
> > > > >
> > > > > I tested this on stable kernel 6.6.
> > > > >
> > > > > Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com>
> > > > > ---
> > > > >  fs/btrfs/extent-tree.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > > > index 03cf9f242c70..84b340a67882 100644
> > > > > --- a/fs/btrfs/extent-tree.c
> > > > > +++ b/fs/btrfs/extent-tree.c
> > > > > @@ -3885,7 +3885,7 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
> > > > >             free_space_ctl = bg->free_space_ctl;
> > > > >             spin_lock(&free_space_ctl->tree_lock);
> > > > >             if (free_space_ctl->free_space <
> > > > > -               ffe_ctl->num_bytes + ffe_ctl->empty_cluster +
> > > > > +               ffe_ctl->num_bytes +
> > > > >                 ffe_ctl->empty_size) {
> > > > >                     ffe_ctl->total_free_space = max_t(u64,
> > > > >                                     ffe_ctl->total_free_space,
> > > > > --
> > > > > 2.43.0
> > > > >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH-RFC] btrfs: for unclustered allocation don't consider ffe_ctl->empty_cluster
  2026-03-02 21:49         ` Boris Burkov
@ 2026-03-02 21:59           ` Boris Burkov
  0 siblings, 0 replies; 7+ messages in thread
From: Boris Burkov @ 2026-03-02 21:59 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Mon, Mar 02, 2026 at 01:49:18PM -0800, Boris Burkov wrote:
> On Mon, Mar 02, 2026 at 11:43:59AM -0800, Boris Burkov wrote:
> > On Sun, Mar 01, 2026 at 12:14:09PM +0200, Alex Lyakas wrote:
> > > Hi Boris,
> > > 
> > > On Thu, Feb 26, 2026 at 8:28 PM Boris Burkov <boris@bur.io> wrote:
> > > >
> > > > On Thu, Feb 26, 2026 at 09:37:46AM -0800, Boris Burkov wrote:
> > > > > On Thu, Feb 26, 2026 at 01:34:19PM +0200, Alex Lyakas wrote:
> > > > > > I encountered an issue when performing unclustered allocation for metadata:
> > > > > >
> > > > > > free_space_ctl->free_space = 2MB
> > > > > > ffe_ctl->num_bytes = 4096
> > > > > > ffe_ctl->empty_cluster = 2MB
> > > > > > ffe_ctl->empty_size = 0
> > > > > >
> > > > > > So early check for free_space_ctl->free_space skips the block group, even though
> > > > > > it has enough space to do the allocation.
> > > > > > I think when doing unclustered allocation we should not look at ffe_ctl->empty_cluster.
> > > > >
> > > > > I see what you are saying, and a the level of this situation and this
> > > > > line of code, I think you are right.
> > > > >
> > > > > But as-is, this change does not contain enough reasoning about the
> > > > > semantics of the allocation algorithm to realistically be anything
> > > > > but exchanging one bug for two new ones.
> > > > >
> > > 
> > > Thank you for your review and your comments below.
> > > 
> > > However, I am not sure what is it that you are actually requesting.
> > > Do you expect me to describe the whole mechanics of find_free_extent()
> > > and how its state is tracked through ffe_ctl?
> > 
> > To be confident that this change does not introduce a different,
> > potentially more serious, bug, it is important to demonstrate why this
> > transformation is reasonable, safe, and fits the existing model. That
> > implies some level of explanation of the mechanisms of the existing
> > model.
> > 
> > > 
> > > > > What is empty_cluster modelling? Why is it included in this calculation?
> > > > > Why should it not be included? Where else is empty_cluster used and
> > > > > should it change under this new interpretation? Does any of this change
> > > > > if there is actually a cluster active for clustered allocations?
> > > I am not changing the meaning of any field of ffe_ctl, which tracks
> > > the state of every call to find_free_extent().
> > > 
> > > I found an issue where "empty_cluster" is used erroneously in my opinion.
> > > 
> > 
> > At this point, in my opinion, you have a useful insight and a good lead
> > for a fix, but have not done enough work to justify the change you want
> > to make.
> > 
> > > According to the code, "empty_cluster" describes the minimal amount of
> > > free space in the block group, which we are considering to allocate a
> > > cluster from. I am not changing the meaning of it.
> > > 
> > > My fix is suggesting that "empty_cluster" should not be used when
> > > looking at free space in a block group during unclustered allocation.
> > > Nothing else is changed. In the issue that I saw, this bug prevented
> > > us from allocating a metadata block from a block group, when the block
> > > group still had enough space to make a 4Kb allocation.
> > 
> > At a minimum level, I would like to see an analysis of why the
> > empty_cluster may have been included in the calculation and some
> > reasoning about why excluding it ought to be safe and not cause issues
> > in other situations. Aside from improving the behavior in the conditions
> > you observed.
> > 
> > i.e., a general demonstration of an understanding of how the change fits
> > into the code at a wider level than just changing the one thing you care
> > about right now.
> > 
> 
> I spent a few minutes looking into this out of curiousity.
> 
> So at a base level, clustered allocator is used by metadata and data in
> ssd_spread mode. So let's focus on metadata, since that was your case
> anyway.
> 
> This check was introduced by the commit
> 425d83156ca ("Btrfs: skip block groups without enough space for a cluster")

a5f6f719a5cd ("Btrfs: test free space only for unclustered allocation")

gives another good hint on the evolution and why the check is only done
in the unclustered case.

> 
> with the justification that for unclustered allocation attempts in
> clustered mode, the search fails because it searches for bytes +
> empty_size anyway, which I see borne out in btrfs_find_space_for_alloc()
> (u64 bytes_search = bytes + empty_size)
> 
> Later in the wider loop, we set empty_size to 0 as a final last resort.
> 
> So to me, this suggests the allocator is intentionally designed to go
> allocate a new metadata block group when there isn't enough room for a
> new cluster, but shouldn't return ENOSPC as a result of this preference.
> 
> How exactly did you notice this in the first place? I assume you saw it
> skipping a block group in practice then noticed this code? After the
> fix, did it actually use the block group? I am kind of surprised based
> on that code I found in btrfs_find_space_for_alloc(), if so. So that
> would probably need explaining.
> 
> I could well be wrong in some part of my analysis, I am just sharing
> some quick findings. Please feel free to investigate further and
> convince me otherwise. But this is the sort of analysis I was looking
> for to motivate and contextualize the change.
> 
> In addition, please include exactly the improvement in behavior from
> your change to make it as clear as possible. (e.g., before when running
> <script> it would allocate 6 metadata block_groups, now it only
> allocates 5.)
> 
> Thanks,
> Boris
> 
> > > 
> > > I tagged the patch as RFC, because I am unable to test it on the
> > > latest mainline kernel, and I tested it on a stable LTS kernel 6.6.
> > 
> > Why are you unable to test it on the latest mainline kernel? I can run
> > it through fstests for you, if that would be helpful.
> > 
> > > 
> > > Thanks,
> > > Alex.
> > 
> > Thanks,
> > Boris
> > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > > >
> > > > > etc. etc.
> > > > >
> > > > > Thanks,
> > > > > Boris
> > > > >
> > > >
> > > > I missed the RFC tag in the patch, so I would like to apologize for my
> > > > negativity. In my experience with other bugs, the interplay between the
> > > > clustered algorithm and the unclustered algorithm is under-specified so
> > > > I think it is likely you have indeed found a bug.
> > > >
> > > > If you want to fix it, I would proceed along the lines I complained
> > > > about in my first response and try to define the relationships between
> > > > the variables in a consistent way that explains why we shouldn't count
> > > > that variable here.
> > > >
> > > > If you do go through with sharpening the definition of empty_cluster,
> > > > I would be happy to review that work and help get it in.
> > > >
> > > 
> > > 
> > > 
> > > > Thanks,
> > > > Boris
> > > >
> > > > > >
> > > > > > I tested this on stable kernel 6.6.
> > > > > >
> > > > > > Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com>
> > > > > > ---
> > > > > >  fs/btrfs/extent-tree.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > > > > index 03cf9f242c70..84b340a67882 100644
> > > > > > --- a/fs/btrfs/extent-tree.c
> > > > > > +++ b/fs/btrfs/extent-tree.c
> > > > > > @@ -3885,7 +3885,7 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
> > > > > >             free_space_ctl = bg->free_space_ctl;
> > > > > >             spin_lock(&free_space_ctl->tree_lock);
> > > > > >             if (free_space_ctl->free_space <
> > > > > > -               ffe_ctl->num_bytes + ffe_ctl->empty_cluster +
> > > > > > +               ffe_ctl->num_bytes +
> > > > > >                 ffe_ctl->empty_size) {
> > > > > >                     ffe_ctl->total_free_space = max_t(u64,
> > > > > >                                     ffe_ctl->total_free_space,
> > > > > > --
> > > > > > 2.43.0
> > > > > >

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-03-02 21:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26 11:34 [PATCH-RFC] btrfs: for unclustered allocation don't consider ffe_ctl->empty_cluster Alex Lyakas
2026-02-26 17:37 ` Boris Burkov
2026-02-26 18:29   ` Boris Burkov
2026-03-01 10:14     ` Alex Lyakas
2026-03-02 19:43       ` Boris Burkov
2026-03-02 21:49         ` Boris Burkov
2026-03-02 21:59           ` Boris Burkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox