[PATCH] ext4: fix dirtyclusters double decrement on fs shutdown

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown
@ 2025-12-12 15:47 Brian Foster
  2025-12-13  1:46 ` Baokun Li
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2025-12-12 15:47 UTC (permalink / raw)
  To: linux-ext4

fstests test generic/388 occasionally reproduces a warning in
ext4_put_super() associated with the dirty clusters count:

  WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]

Tracing the failure shows that the warning fires due to an
s_dirtyclusters_counter value of -1. IOW, this appears to be a
spurious decrement as opposed to some sort of leak. Further tracing
of the dirty cluster count deltas and an LLM scan of the resulting
output identified the cause as a double decrement in the error path
between ext4_mb_mark_diskspace_used() and the caller
ext4_mb_new_blocks().

First, note that generic/388 is a shutdown vs. fsstress test and so
produces a random set of operations and shutdown injections. In the
problematic case, the shutdown triggers an error return from the
ext4_handle_dirty_metadata() call(s) made from
ext4_mb_mark_context(). The changed value is non-zero at this point,
so ext4_mb_mark_diskspace_used() does not exit after the error
bubbles up from ext4_mb_mark_context(). Instead, the former
decrements both cluster counters and returns the error up to
ext4_mb_new_blocks(). The latter falls into the !ar->len out path
which decrements the dirty clusters counter a second time, creating
the inconsistency.

AFAICT the solution here is to exit immediately from
ext4_mb_mark_diskspace_used() on error, regardless of the changed
value. This leaves the caller responsible for clearing the block
reservation at the same level it is acquired. This also skips the
free clusters update, but the caller also calls into
ext4_discard_allocated_blocks() to free the blocks back into the
group. This survives an overnight loop test of generic/388 on an
otherwise reproducing system and survives a local regression run.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

I've thrown some testing at this and poked around the area enough that I
_think_ it is reasonably sane, but the error paths are hairy and I could
certainly be missing some details. I'm happy to try a different approach
if there are any thoughts around that.. thanks.

Brian

 fs/ext4/mballoc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 56d50fd3310b..224abfd6a42b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
 				   ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
 				   flags, &changed);

-	if (err && changed == 0)
+	if (err)
 		return err;

 #ifdef AGGRESSIVE_CHECK
-- 
2.51.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown
  2025-12-12 15:47 [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown Brian Foster
@ 2025-12-13  1:46 ` Baokun Li
  2025-12-15 15:28   ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Baokun Li @ 2025-12-13  1:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-ext4

Hi Brian,

Thanks for the patch.

On 2025-12-12 23:47, Brian Foster wrote:
> fstests test generic/388 occasionally reproduces a warning in
> ext4_put_super() associated with the dirty clusters count:
>
>   WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]
>
> Tracing the failure shows that the warning fires due to an
> s_dirtyclusters_counter value of -1. IOW, this appears to be a
> spurious decrement as opposed to some sort of leak. Further tracing
> of the dirty cluster count deltas and an LLM scan of the resulting
> output identified the cause as a double decrement in the error path
> between ext4_mb_mark_diskspace_used() and the caller
> ext4_mb_new_blocks().
>
> First, note that generic/388 is a shutdown vs. fsstress test and so
> produces a random set of operations and shutdown injections. In the
> problematic case, the shutdown triggers an error return from the
> ext4_handle_dirty_metadata() call(s) made from
> ext4_mb_mark_context(). The changed value is non-zero at this point,
> so ext4_mb_mark_diskspace_used() does not exit after the error
> bubbles up from ext4_mb_mark_context(). Instead, the former
> decrements both cluster counters and returns the error up to
> ext4_mb_new_blocks(). The latter falls into the !ar->len out path
> which decrements the dirty clusters counter a second time, creating
> the inconsistency.
>
> AFAICT the solution here is to exit immediately from
> ext4_mb_mark_diskspace_used() on error, regardless of the changed
> value. This leaves the caller responsible for clearing the block
> reservation at the same level it is acquired. This also skips the
> free clusters update, but the caller also calls into
> ext4_discard_allocated_blocks() to free the blocks back into the
> group. This survives an overnight loop test of generic/388 on an
> otherwise reproducing system and survives a local regression run.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>
> Hi all,
>
> I've thrown some testing at this and poked around the area enough that I
> _think_ it is reasonably sane, but the error paths are hairy and I could
> certainly be missing some details. I'm happy to try a different approach
> if there are any thoughts around that.. thanks.
>
> Brian
>
>  fs/ext4/mballoc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 56d50fd3310b..224abfd6a42b 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
>  				   ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
>  				   flags, &changed);
>  
> -	if (err && changed == 0)
> +	if (err)
>  		return err;
>  
>  #ifdef AGGRESSIVE_CHECK

I think we might need to swap that && for an ||.

Basically, we should only proceed with the following logic if there's
no error and the bitmap was actually changed. If we hit an error or
if the section we intended to modify was already fully handled,
we should just bail out early. Otherwise, the err could get quietly
ignored unless we hit a duplicate allocation that happens to result in
'changed' being zero.

By the way, I spotted two other spots with similar error logic:
ext4_mb_clear_bb() and ext4_group_add_blocks().

Since this issue popped up in the last couple of years, we should
probably add a Fixes: tag to make backporting easier.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown
  2025-12-13  1:46 ` Baokun Li
@ 2025-12-15 15:28   ` Brian Foster
  2025-12-16  4:01     ` Baokun Li
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2025-12-15 15:28 UTC (permalink / raw)
  To: Baokun Li; +Cc: linux-ext4

On Sat, Dec 13, 2025 at 09:46:23AM +0800, Baokun Li wrote:
> Hi Brian,
> 
> Thanks for the patch.
> 

Hi Baokun,

Thanks for reviewing..

> On 2025-12-12 23:47, Brian Foster wrote:
> > fstests test generic/388 occasionally reproduces a warning in
> > ext4_put_super() associated with the dirty clusters count:
> >
> >   WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]
> >
> > Tracing the failure shows that the warning fires due to an
> > s_dirtyclusters_counter value of -1. IOW, this appears to be a
> > spurious decrement as opposed to some sort of leak. Further tracing
> > of the dirty cluster count deltas and an LLM scan of the resulting
> > output identified the cause as a double decrement in the error path
> > between ext4_mb_mark_diskspace_used() and the caller
> > ext4_mb_new_blocks().
> >
> > First, note that generic/388 is a shutdown vs. fsstress test and so
> > produces a random set of operations and shutdown injections. In the
> > problematic case, the shutdown triggers an error return from the
> > ext4_handle_dirty_metadata() call(s) made from
> > ext4_mb_mark_context(). The changed value is non-zero at this point,
> > so ext4_mb_mark_diskspace_used() does not exit after the error
> > bubbles up from ext4_mb_mark_context(). Instead, the former
> > decrements both cluster counters and returns the error up to
> > ext4_mb_new_blocks(). The latter falls into the !ar->len out path
> > which decrements the dirty clusters counter a second time, creating
> > the inconsistency.
> >
> > AFAICT the solution here is to exit immediately from
> > ext4_mb_mark_diskspace_used() on error, regardless of the changed
> > value. This leaves the caller responsible for clearing the block
> > reservation at the same level it is acquired. This also skips the
> > free clusters update, but the caller also calls into
> > ext4_discard_allocated_blocks() to free the blocks back into the
> > group. This survives an overnight loop test of generic/388 on an
> > otherwise reproducing system and survives a local regression run.
> >
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >
> > Hi all,
> >
> > I've thrown some testing at this and poked around the area enough that I
> > _think_ it is reasonably sane, but the error paths are hairy and I could
> > certainly be missing some details. I'm happy to try a different approach
> > if there are any thoughts around that.. thanks.
> >
> > Brian
> >
> >  fs/ext4/mballoc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 56d50fd3310b..224abfd6a42b 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
> >  				   ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
> >  				   flags, &changed);
> >  
> > -	if (err && changed == 0)
> > +	if (err)
> >  		return err;
> >  
> >  #ifdef AGGRESSIVE_CHECK
> 
> I think we might need to swap that && for an ||.
> 
> Basically, we should only proceed with the following logic if there's
> no error and the bitmap was actually changed. If we hit an error or
> if the section we intended to modify was already fully handled,
> we should just bail out early. Otherwise, the err could get quietly
> ignored unless we hit a duplicate allocation that happens to result in
> 'changed' being zero.
> 

Hmm.. to make sure I understand, are you referring to an inconsistency
case where we allocated blocks that were already marked as such in the
group on disk..?

I'm a little uneasy about this because it seems to conflict with the
surrounding code. AFAICT the only way we can hit something like !err &&
!changed is via EXT4_MB_BITMAP_MARKED_CHECK, which causes
_mark_context() to check the bitmap for "already modified" bits up
front.

If this scenario plays out, the caller has a BUG check just after the
return (also under the aggressive check macro). So ISTM that this sort
of (err || !changed) logic would bypass the aggressive checks and let
the fs carry on when it probably shouldn't. Hm?

> By the way, I spotted two other spots with similar error logic:
> ext4_mb_clear_bb() and ext4_group_add_blocks().
> 

Yeah, I saw those as well but didn't think they needed changing. My high
level understanding of the alloc case is that ext4_mb_new_blocks()
acquires res (!delalloc), allocs blocks out of in-core structures, then
calls down into _mark_diskspace_used() to update/journal on-disk
structures with the pending alloc. If the latter fails, we release res
and feed blocks back into the in-core structures. So IOW, if we return
directly from _mark_diskspace_used() the counters/state end up
consistent afaict.

For the ext4_free_blocks() case, we call _mark_context() and if it fails
with changed != 0 (and don't otherwise BUG), we still go ahead and free
the blocks in the e4b and return the error. It does look like the
discard code can clobber the error though, so perhaps that should be
fixed. But otherwise it's not clear to me why we might want to exit
early there. Am I missing something else?

> Since this issue popped up in the last couple of years, we should
> probably add a Fixes: tag to make backporting easier.
> 

Do you have a target patch in mind? I made a pass through historical
changes and it looked like this was a longer standing issue through
various bits of refactoring..

Brian

> 
> Cheers,
> Baokun
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown
  2025-12-15 15:28   ` Brian Foster
@ 2025-12-16  4:01     ` Baokun Li
  2025-12-16 15:53       ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Baokun Li @ 2025-12-16  4:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-ext4

On 2025-12-15 23:28, Brian Foster wrote:
> On Sat, Dec 13, 2025 at 09:46:23AM +0800, Baokun Li wrote:
>> Hi Brian,
>>
>> Thanks for the patch.
>>
> Hi Baokun,
>
> Thanks for reviewing..
>
>> On 2025-12-12 23:47, Brian Foster wrote:
>>> fstests test generic/388 occasionally reproduces a warning in
>>> ext4_put_super() associated with the dirty clusters count:
>>>
>>>   WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]
>>>
>>> Tracing the failure shows that the warning fires due to an
>>> s_dirtyclusters_counter value of -1. IOW, this appears to be a
>>> spurious decrement as opposed to some sort of leak. Further tracing
>>> of the dirty cluster count deltas and an LLM scan of the resulting
>>> output identified the cause as a double decrement in the error path
>>> between ext4_mb_mark_diskspace_used() and the caller
>>> ext4_mb_new_blocks().
>>>
>>> First, note that generic/388 is a shutdown vs. fsstress test and so
>>> produces a random set of operations and shutdown injections. In the
>>> problematic case, the shutdown triggers an error return from the
>>> ext4_handle_dirty_metadata() call(s) made from
>>> ext4_mb_mark_context(). The changed value is non-zero at this point,
>>> so ext4_mb_mark_diskspace_used() does not exit after the error
>>> bubbles up from ext4_mb_mark_context(). Instead, the former
>>> decrements both cluster counters and returns the error up to
>>> ext4_mb_new_blocks(). The latter falls into the !ar->len out path
>>> which decrements the dirty clusters counter a second time, creating
>>> the inconsistency.
>>>
>>> AFAICT the solution here is to exit immediately from
>>> ext4_mb_mark_diskspace_used() on error, regardless of the changed
>>> value. This leaves the caller responsible for clearing the block
>>> reservation at the same level it is acquired. This also skips the
>>> free clusters update, but the caller also calls into
>>> ext4_discard_allocated_blocks() to free the blocks back into the
>>> group. This survives an overnight loop test of generic/388 on an
>>> otherwise reproducing system and survives a local regression run.
>>>
>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>> ---
>>>
>>> Hi all,
>>>
>>> I've thrown some testing at this and poked around the area enough that I
>>> _think_ it is reasonably sane, but the error paths are hairy and I could
>>> certainly be missing some details. I'm happy to try a different approach
>>> if there are any thoughts around that.. thanks.
>>>
>>> Brian
>>>
>>>  fs/ext4/mballoc.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>> index 56d50fd3310b..224abfd6a42b 100644
>>> --- a/fs/ext4/mballoc.c
>>> +++ b/fs/ext4/mballoc.c
>>> @@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
>>>  				   ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
>>>  				   flags, &changed);
>>>  
>>> -	if (err && changed == 0)
>>> +	if (err)
>>>  		return err;
>>>  
>>>  #ifdef AGGRESSIVE_CHECK
>> I think we might need to swap that && for an ||.
>>
>> Basically, we should only proceed with the following logic if there's
>> no error and the bitmap was actually changed. If we hit an error or
>> if the section we intended to modify was already fully handled,
>> we should just bail out early. Otherwise, the err could get quietly
>> ignored unless we hit a duplicate allocation that happens to result in
>> 'changed' being zero.
>>
> Hmm.. to make sure I understand, are you referring to an inconsistency
> case where we allocated blocks that were already marked as such in the
> group on disk..?
Yes.
> I'm a little uneasy about this because it seems to conflict with the
> surrounding code. AFAICT the only way we can hit something like !err &&
> !changed is via EXT4_MB_BITMAP_MARKED_CHECK, which causes
> _mark_context() to check the bitmap for "already modified" bits up
> front.
>
> If this scenario plays out, the caller has a BUG check just after the
> return (also under the aggressive check macro). So ISTM that this sort
> of (err || !changed) logic would bypass the aggressive checks and let
> the fs carry on when it probably shouldn't. Hm?

Regarding ext4_mb_mark_context, if the passed ret_changed pointer is
non-NULL, we initialize *ret_changed to 0. After updating the bitmap_bh,
we then update *ret_changed with the actual number of blocks modified
(changed).

Therefore, the original intention was for changed == 0 to signify that
an error occurred in ext4_mb_mark_context() before ret_changed could be
updated. However, as you pointed out, we also get changed == 0 when the
target extent has already been fully marked as allocated within bitmap_bh.

Crucially, we only genuinely check the bitmap to modify changed when
EXT4_MB_BITMAP_MARKED_CHECK is set (i.e., when AGGRESSIVE_CHECK is defined,
or during fast commit or resize operations). Otherwise, changed is always
set to the target length. This means that, in the general case, errors
returned after the point where ret_changed is updated (e.g., the error
from ext4_handle_dirty_metadata()) are usually ignored.

In summary:

 * (err && changed == 0) only concerns errors that occur before ret_changed
   is updated.
 * (err || changed == 0) concerns whether there was an error OR if any
   modification actually took place.

If we only care about err, we could move the update of ret_changed inside
ext4_mb_mark_context() to just before the successful return.

>> By the way, I spotted two other spots with similar error logic:
>> ext4_mb_clear_bb() and ext4_group_add_blocks().
>>
> Yeah, I saw those as well but didn't think they needed changing. My high
> level understanding of the alloc case is that ext4_mb_new_blocks()
> acquires res (!delalloc), allocs blocks out of in-core structures, then
> calls down into _mark_diskspace_used() to update/journal on-disk
> structures with the pending alloc. If the latter fails, we release res
> and feed blocks back into the in-core structures. So IOW, if we return
> directly from _mark_diskspace_used() the counters/state end up
> consistent afaict.
>
> For the ext4_free_blocks() case, we call _mark_context() and if it fails
> with changed != 0 (and don't otherwise BUG), we still go ahead and free
> the blocks in the e4b and return the error. It does look like the
> discard code can clobber the error though, so perhaps that should be
> fixed. But otherwise it's not clear to me why we might want to exit
> early there. Am I missing something else?

The core issue is that they risk ignoring certain errors, which can
result in inconsistency.

>
>> Since this issue popped up in the last couple of years, we should
>> probably add a Fixes: tag to make backporting easier.
>>
> Do you have a target patch in mind? I made a pass through historical
> changes and it looked like this was a longer standing issue through
> various bits of refactoring..
>
No, I haven't, but I suspect it was introduced when
(err && changed == 0) was added.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown
  2025-12-16  4:01     ` Baokun Li
@ 2025-12-16 15:53       ` Brian Foster
  2025-12-17  4:05         ` Baokun Li
  2025-12-19 14:29         ` Jan Kara
  0 siblings, 2 replies; 9+ messages in thread
From: Brian Foster @ 2025-12-16 15:53 UTC (permalink / raw)
  To: Baokun Li; +Cc: linux-ext4, Jan Kara

On Tue, Dec 16, 2025 at 12:01:41PM +0800, Baokun Li wrote:
> On 2025-12-15 23:28, Brian Foster wrote:
> > On Sat, Dec 13, 2025 at 09:46:23AM +0800, Baokun Li wrote:
> >> Hi Brian,
> >>
> >> Thanks for the patch.
> >>
> > Hi Baokun,
> >
> > Thanks for reviewing..
> >
> >> On 2025-12-12 23:47, Brian Foster wrote:
> >>> fstests test generic/388 occasionally reproduces a warning in
> >>> ext4_put_super() associated with the dirty clusters count:
> >>>
> >>>   WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]
> >>>
> >>> Tracing the failure shows that the warning fires due to an
> >>> s_dirtyclusters_counter value of -1. IOW, this appears to be a
> >>> spurious decrement as opposed to some sort of leak. Further tracing
> >>> of the dirty cluster count deltas and an LLM scan of the resulting
> >>> output identified the cause as a double decrement in the error path
> >>> between ext4_mb_mark_diskspace_used() and the caller
> >>> ext4_mb_new_blocks().
> >>>
> >>> First, note that generic/388 is a shutdown vs. fsstress test and so
> >>> produces a random set of operations and shutdown injections. In the
> >>> problematic case, the shutdown triggers an error return from the
> >>> ext4_handle_dirty_metadata() call(s) made from
> >>> ext4_mb_mark_context(). The changed value is non-zero at this point,
> >>> so ext4_mb_mark_diskspace_used() does not exit after the error
> >>> bubbles up from ext4_mb_mark_context(). Instead, the former
> >>> decrements both cluster counters and returns the error up to
> >>> ext4_mb_new_blocks(). The latter falls into the !ar->len out path
> >>> which decrements the dirty clusters counter a second time, creating
> >>> the inconsistency.
> >>>
> >>> AFAICT the solution here is to exit immediately from
> >>> ext4_mb_mark_diskspace_used() on error, regardless of the changed
> >>> value. This leaves the caller responsible for clearing the block
> >>> reservation at the same level it is acquired. This also skips the
> >>> free clusters update, but the caller also calls into
> >>> ext4_discard_allocated_blocks() to free the blocks back into the
> >>> group. This survives an overnight loop test of generic/388 on an
> >>> otherwise reproducing system and survives a local regression run.
> >>>
> >>> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >>> ---
> >>>
> >>> Hi all,
> >>>
> >>> I've thrown some testing at this and poked around the area enough that I
> >>> _think_ it is reasonably sane, but the error paths are hairy and I could
> >>> certainly be missing some details. I'm happy to try a different approach
> >>> if there are any thoughts around that.. thanks.
> >>>
> >>> Brian
> >>>
> >>>  fs/ext4/mballoc.c | 2 +-
> >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> >>> index 56d50fd3310b..224abfd6a42b 100644
> >>> --- a/fs/ext4/mballoc.c
> >>> +++ b/fs/ext4/mballoc.c
> >>> @@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
> >>>  				   ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
> >>>  				   flags, &changed);
> >>>  
> >>> -	if (err && changed == 0)
> >>> +	if (err)
> >>>  		return err;
> >>>  
> >>>  #ifdef AGGRESSIVE_CHECK
> >> I think we might need to swap that && for an ||.
> >>
> >> Basically, we should only proceed with the following logic if there's
> >> no error and the bitmap was actually changed. If we hit an error or
> >> if the section we intended to modify was already fully handled,
> >> we should just bail out early. Otherwise, the err could get quietly
> >> ignored unless we hit a duplicate allocation that happens to result in
> >> 'changed' being zero.
> >>
> > Hmm.. to make sure I understand, are you referring to an inconsistency
> > case where we allocated blocks that were already marked as such in the
> > group on disk..?
> Yes.
> > I'm a little uneasy about this because it seems to conflict with the
> > surrounding code. AFAICT the only way we can hit something like !err &&
> > !changed is via EXT4_MB_BITMAP_MARKED_CHECK, which causes
> > _mark_context() to check the bitmap for "already modified" bits up
> > front.
> >
> > If this scenario plays out, the caller has a BUG check just after the
> > return (also under the aggressive check macro). So ISTM that this sort
> > of (err || !changed) logic would bypass the aggressive checks and let
> > the fs carry on when it probably shouldn't. Hm?
> 
> Regarding ext4_mb_mark_context, if the passed ret_changed pointer is
> non-NULL, we initialize *ret_changed to 0. After updating the bitmap_bh,
> we then update *ret_changed with the actual number of blocks modified
> (changed).
> 
> Therefore, the original intention was for changed == 0 to signify that
> an error occurred in ext4_mb_mark_context() before ret_changed could be
> updated. However, as you pointed out, we also get changed == 0 when the
> target extent has already been fully marked as allocated within bitmap_bh.
> 
> Crucially, we only genuinely check the bitmap to modify changed when
> EXT4_MB_BITMAP_MARKED_CHECK is set (i.e., when AGGRESSIVE_CHECK is defined,
> or during fast commit or resize operations). Otherwise, changed is always
> set to the target length. This means that, in the general case, errors
> returned after the point where ret_changed is updated (e.g., the error
> from ext4_handle_dirty_metadata()) are usually ignored.
> 
> In summary:
> 
>  * (err && changed == 0) only concerns errors that occur before ret_changed
>    is updated.
>  * (err || changed == 0) concerns whether there was an error OR if any
>    modification actually took place.
> 
> If we only care about err, we could move the update of ret_changed inside
> ext4_mb_mark_context() to just before the successful return.
> 

Yeah, I think the _mark_context() logic is reasonably straightforward
from the code. What is less clear is why the allocation path only cares
about errors prior to ret_changed being set.

It looks to me that this is just wrong. I suspect either due to
copy/paste error in the mark_diskspace_used() path at some point in the
past, or an attempt to filter out the BUG case from an obvious case
where changed will be 0 on certain error returns.

I don't think the mark_context() logic alone tells the full story here.
I think what's relevant is the high level error handling of the
!delalloc allocation path:

1. ext4_mb_new_blocks() reserves blocks and attempts physical allocation
in memory. On success, it calls into ext4_mb_mark_diskspace_used() to
update on-disk structures.

2a. If ext4_mb_mark_diskspace_used() is successful, it decrements the
freeclusters counter and releases res from the dirtyclusers counter
(i.e. transfer the block from dirty to used). ext4_mb_new_blocks()
basically just returns the result.

2b. If ext4_mb_mark_diskspace_used() fails, ext4_mb_new_blocks()
receives the error. In the error exit path, it frees the blocks back
into the incore structures and releases the reservation it acquired in
step 1.  

However the bug this patch is trying to fix is that
ext4_mb_mark_diskspace_used() runs the counter updates regardless of
error in some cases. If the on-disk update fails, _new_blocks() will
release its reservation and return the blocks, so _mark_diskspace_used()
shouldn't account that block from the dirty and free counters.

What isn't quite clear to me is how this is expected to deal with the
modified buffer. This particular case is a shutdown and journal abort,
so I suspect the buffer can't write back.

> >> By the way, I spotted two other spots with similar error logic:
> >> ext4_mb_clear_bb() and ext4_group_add_blocks().
> >>
> > Yeah, I saw those as well but didn't think they needed changing. My high
> > level understanding of the alloc case is that ext4_mb_new_blocks()
> > acquires res (!delalloc), allocs blocks out of in-core structures, then
> > calls down into _mark_diskspace_used() to update/journal on-disk
> > structures with the pending alloc. If the latter fails, we release res
> > and feed blocks back into the in-core structures. So IOW, if we return
> > directly from _mark_diskspace_used() the counters/state end up
> > consistent afaict.
> >
> > For the ext4_free_blocks() case, we call _mark_context() and if it fails
> > with changed != 0 (and don't otherwise BUG), we still go ahead and free
> > the blocks in the e4b and return the error. It does look like the
> > discard code can clobber the error though, so perhaps that should be
> > fixed. But otherwise it's not clear to me why we might want to exit
> > early there. Am I missing something else?
> 
> The core issue is that they risk ignoring certain errors, which can
> result in inconsistency.
> 

Hmm.. I'm not sure it's that simple in the free path. It looks like
things are ordered differently there. We modify the on-disk struct and
if it changes something, then even it fails we go ahead and proceed with
the in-core updates, and then return the error. Modulo the discard logic
thing, the error is then passed into the standard error handling code,
so it isn't really ignored.

Though again I'm not quite sure what the expected result is here in the
case where it's the ext4_handle_dirty_metadata() call that fails. Is
this a guaranteed shutdown/abort situation? Perhaps Jan or somebody
familiar with the journaling code could chime in on this..? Thanks.

> >
> >> Since this issue popped up in the last couple of years, we should
> >> probably add a Fixes: tag to make backporting easier.
> >>
> > Do you have a target patch in mind? I made a pass through historical
> > changes and it looked like this was a longer standing issue through
> > various bits of refactoring..
> >
> No, I haven't, but I suspect it was introduced when
> (err && changed == 0) was added.
> 

Making another pass.. that code is introduced in commit 2f94711b098b
("ext4: call ext4_mb_mark_context in ext4_mb_mark_diskspace_used"), but
that just refactors things and doesn't appear to introduce the problem.
It looks like it still double decrements due to how the code is ordered.

Going back further.. commit 0087d9fb3f29 ("ext4: Fix
s_dirty_blocks_counter if block allocation failed with nodelalloc") adds
the release in ext4_mb_new_blocks(), so that appears to be where the
logic wart was first introduced.

That said, that one goes back to 2009 and the current fix wouldn't
apply. TBH given how old the bug is and that it's basically just a
warning in a shutdown case, I'm not sure it's really worth the risk of
backporting at all.. I'll defer to the community on that however and can
at least mention the commit in the changelog if we don't ultimately tag
it.

Brian

> 
> Cheers,
> Baokun
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown
  2025-12-16 15:53       ` Brian Foster
@ 2025-12-17  4:05         ` Baokun Li
  2025-12-17 13:52           ` Brian Foster
  2025-12-19 14:29         ` Jan Kara
  1 sibling, 1 reply; 9+ messages in thread
From: Baokun Li @ 2025-12-17  4:05 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-ext4, Jan Kara

On 2025-12-16 23:53, Brian Foster wrote:
> On Tue, Dec 16, 2025 at 12:01:41PM +0800, Baokun Li wrote:
>> On 2025-12-15 23:28, Brian Foster wrote:
>>> On Sat, Dec 13, 2025 at 09:46:23AM +0800, Baokun Li wrote:
>>>> Hi Brian,
>>>>
>>>> Thanks for the patch.
>>>>
>>> Hi Baokun,
>>>
>>> Thanks for reviewing..
>>>
>>>> On 2025-12-12 23:47, Brian Foster wrote:
>>>>> fstests test generic/388 occasionally reproduces a warning in
>>>>> ext4_put_super() associated with the dirty clusters count:
>>>>>
>>>>>   WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]
>>>>>
>>>>> Tracing the failure shows that the warning fires due to an
>>>>> s_dirtyclusters_counter value of -1. IOW, this appears to be a
>>>>> spurious decrement as opposed to some sort of leak. Further tracing
>>>>> of the dirty cluster count deltas and an LLM scan of the resulting
>>>>> output identified the cause as a double decrement in the error path
>>>>> between ext4_mb_mark_diskspace_used() and the caller
>>>>> ext4_mb_new_blocks().
>>>>>
>>>>> First, note that generic/388 is a shutdown vs. fsstress test and so
>>>>> produces a random set of operations and shutdown injections. In the
>>>>> problematic case, the shutdown triggers an error return from the
>>>>> ext4_handle_dirty_metadata() call(s) made from
>>>>> ext4_mb_mark_context(). The changed value is non-zero at this point,
>>>>> so ext4_mb_mark_diskspace_used() does not exit after the error
>>>>> bubbles up from ext4_mb_mark_context(). Instead, the former
>>>>> decrements both cluster counters and returns the error up to
>>>>> ext4_mb_new_blocks(). The latter falls into the !ar->len out path
>>>>> which decrements the dirty clusters counter a second time, creating
>>>>> the inconsistency.
>>>>>
>>>>> AFAICT the solution here is to exit immediately from
>>>>> ext4_mb_mark_diskspace_used() on error, regardless of the changed
>>>>> value. This leaves the caller responsible for clearing the block
>>>>> reservation at the same level it is acquired. This also skips the
>>>>> free clusters update, but the caller also calls into
>>>>> ext4_discard_allocated_blocks() to free the blocks back into the
>>>>> group. This survives an overnight loop test of generic/388 on an
>>>>> otherwise reproducing system and survives a local regression run.
>>>>>
>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>>> ---
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I've thrown some testing at this and poked around the area enough that I
>>>>> _think_ it is reasonably sane, but the error paths are hairy and I could
>>>>> certainly be missing some details. I'm happy to try a different approach
>>>>> if there are any thoughts around that.. thanks.
>>>>>
>>>>> Brian
>>>>>
>>>>>  fs/ext4/mballoc.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>>> index 56d50fd3310b..224abfd6a42b 100644
>>>>> --- a/fs/ext4/mballoc.c
>>>>> +++ b/fs/ext4/mballoc.c
>>>>> @@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
>>>>>  				   ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
>>>>>  				   flags, &changed);
>>>>>  
>>>>> -	if (err && changed == 0)
>>>>> +	if (err)
>>>>>  		return err;
>>>>>  
>>>>>  #ifdef AGGRESSIVE_CHECK
>>>> I think we might need to swap that && for an ||.
>>>>
>>>> Basically, we should only proceed with the following logic if there's
>>>> no error and the bitmap was actually changed. If we hit an error or
>>>> if the section we intended to modify was already fully handled,
>>>> we should just bail out early. Otherwise, the err could get quietly
>>>> ignored unless we hit a duplicate allocation that happens to result in
>>>> 'changed' being zero.
>>>>
>>> Hmm.. to make sure I understand, are you referring to an inconsistency
>>> case where we allocated blocks that were already marked as such in the
>>> group on disk..?
>> Yes.
>>> I'm a little uneasy about this because it seems to conflict with the
>>> surrounding code. AFAICT the only way we can hit something like !err &&
>>> !changed is via EXT4_MB_BITMAP_MARKED_CHECK, which causes
>>> _mark_context() to check the bitmap for "already modified" bits up
>>> front.
>>>
>>> If this scenario plays out, the caller has a BUG check just after the
>>> return (also under the aggressive check macro). So ISTM that this sort
>>> of (err || !changed) logic would bypass the aggressive checks and let
>>> the fs carry on when it probably shouldn't. Hm?
>> Regarding ext4_mb_mark_context, if the passed ret_changed pointer is
>> non-NULL, we initialize *ret_changed to 0. After updating the bitmap_bh,
>> we then update *ret_changed with the actual number of blocks modified
>> (changed).
>>
>> Therefore, the original intention was for changed == 0 to signify that
>> an error occurred in ext4_mb_mark_context() before ret_changed could be
>> updated. However, as you pointed out, we also get changed == 0 when the
>> target extent has already been fully marked as allocated within bitmap_bh.
>>
>> Crucially, we only genuinely check the bitmap to modify changed when
>> EXT4_MB_BITMAP_MARKED_CHECK is set (i.e., when AGGRESSIVE_CHECK is defined,
>> or during fast commit or resize operations). Otherwise, changed is always
>> set to the target length. This means that, in the general case, errors
>> returned after the point where ret_changed is updated (e.g., the error
>> from ext4_handle_dirty_metadata()) are usually ignored.
>>
>> In summary:
>>
>>  * (err && changed == 0) only concerns errors that occur before ret_changed
>>    is updated.
>>  * (err || changed == 0) concerns whether there was an error OR if any
>>    modification actually took place.
>>
>> If we only care about err, we could move the update of ret_changed inside
>> ext4_mb_mark_context() to just before the successful return.
>>
> Yeah, I think the _mark_context() logic is reasonably straightforward
> from the code. What is less clear is why the allocation path only cares
> about errors prior to ret_changed being set.
>
> It looks to me that this is just wrong. I suspect either due to
> copy/paste error in the mark_diskspace_used() path at some point in the
> past, or an attempt to filter out the BUG case from an obvious case
> where changed will be 0 on certain error returns.
>
> I don't think the mark_context() logic alone tells the full story here.
> I think what's relevant is the high level error handling of the
> !delalloc allocation path:
>
> 1. ext4_mb_new_blocks() reserves blocks and attempts physical allocation
> in memory. On success, it calls into ext4_mb_mark_diskspace_used() to
> update on-disk structures.
>
> 2a. If ext4_mb_mark_diskspace_used() is successful, it decrements the
> freeclusters counter and releases res from the dirtyclusers counter
> (i.e. transfer the block from dirty to used). ext4_mb_new_blocks()
> basically just returns the result.
>
> 2b. If ext4_mb_mark_diskspace_used() fails, ext4_mb_new_blocks()
> receives the error. In the error exit path, it frees the blocks back
> into the incore structures and releases the reservation it acquired in
> step 1.  
>
> However the bug this patch is trying to fix is that
> ext4_mb_mark_diskspace_used() runs the counter updates regardless of
> error in some cases. If the on-disk update fails, _new_blocks() will
> release its reservation and return the blocks, so _mark_diskspace_used()
> shouldn't account that block from the dirty and free counters.
>
> What isn't quite clear to me is how this is expected to deal with the
> modified buffer. This particular case is a shutdown and journal abort,
> so I suspect the buffer can't write back.

These are actually two separate issues, so let's focus on the
dirtyclusters problem first, as it's more straightforward.

In ext4_mb_new_blocks(), when nodealloc is active, we increment
s_dirtyclusters_counter via ext4_claim_free_clusters() to ensure
there is enough free space to satisfy the current allocation request.

Once the block allocation finishes—regardless of whether it succeeded
or failed—we should return the incremented dirtyclusters count.
Therefore, we don't need to pass the reserv_clstrs parameter to
ext4_mb_mark_diskspace_used() just to update the counter there.
Instead, we can completely remove the dirtyclusters update logic from
that function.

A better approach would be to decrement the dirtyclusters count in
ext4_mb_new_blocks() whenever (!(ar->flags & EXT4_MB_DELALLOC_RESERVED)),
regardless of whether ar->len is 0.

This makes the code logic much clearer and helps prevent counter anomalies.

>
>>>> By the way, I spotted two other spots with similar error logic:
>>>> ext4_mb_clear_bb() and ext4_group_add_blocks().
>>>>
>>> Yeah, I saw those as well but didn't think they needed changing. My high
>>> level understanding of the alloc case is that ext4_mb_new_blocks()
>>> acquires res (!delalloc), allocs blocks out of in-core structures, then
>>> calls down into _mark_diskspace_used() to update/journal on-disk
>>> structures with the pending alloc. If the latter fails, we release res
>>> and feed blocks back into the in-core structures. So IOW, if we return
>>> directly from _mark_diskspace_used() the counters/state end up
>>> consistent afaict.
>>>
>>> For the ext4_free_blocks() case, we call _mark_context() and if it fails
>>> with changed != 0 (and don't otherwise BUG), we still go ahead and free
>>> the blocks in the e4b and return the error. It does look like the
>>> discard code can clobber the error though, so perhaps that should be
>>> fixed. But otherwise it's not clear to me why we might want to exit
>>> early there. Am I missing something else?
>> The core issue is that they risk ignoring certain errors, which can
>> result in inconsistency.
>>
> Hmm.. I'm not sure it's that simple in the free path. It looks like
> things are ordered differently there. We modify the on-disk struct and
> if it changes something, then even it fails we go ahead and proceed with
> the in-core updates, and then return the error. Modulo the discard logic
> thing, the error is then passed into the standard error handling code,
> so it isn't really ignored.
>
> Though again I'm not quite sure what the expected result is here in the
> case where it's the ext4_handle_dirty_metadata() call that fails. Is
> this a guaranteed shutdown/abort situation? Perhaps Jan or somebody
> familiar with the journaling code could chime in on this..? Thanks.
>
>>>> Since this issue popped up in the last couple of years, we should
>>>> probably add a Fixes: tag to make backporting easier.
>>>>
>>> Do you have a target patch in mind? I made a pass through historical
>>> changes and it looked like this was a longer standing issue through
>>> various bits of refactoring..
>>>
>> No, I haven't, but I suspect it was introduced when
>> (err && changed == 0) was added.
>>
> Making another pass.. that code is introduced in commit 2f94711b098b
> ("ext4: call ext4_mb_mark_context in ext4_mb_mark_diskspace_used"), but
> that just refactors things and doesn't appear to introduce the problem.
> It looks like it still double decrements due to how the code is ordered.

Yes, that is a separate issue.

>
> Going back further.. commit 0087d9fb3f29 ("ext4: Fix
> s_dirty_blocks_counter if block allocation failed with nodelalloc") adds
> the release in ext4_mb_new_blocks(), so that appears to be where the
> logic wart was first introduced.

Yes, it does appear that this commit introduced the issue. 

Thanks for finding the exact commit where it was introduced!

>
> That said, that one goes back to 2009 and the current fix wouldn't
> apply. TBH given how old the bug is and that it's basically just a
> warning in a shutdown case, I'm not sure it's really worth the risk of
> backporting at all.. I'll defer to the community on that however and can
> at least mention the commit in the changelog if we don't ultimately tag
> it.

I think it's essential to have this fixed.
Thank you for your effort on this.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown
  2025-12-17  4:05         ` Baokun Li
@ 2025-12-17 13:52           ` Brian Foster
  0 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2025-12-17 13:52 UTC (permalink / raw)
  To: Baokun Li; +Cc: linux-ext4, Jan Kara

On Wed, Dec 17, 2025 at 12:05:18PM +0800, Baokun Li wrote:
> On 2025-12-16 23:53, Brian Foster wrote:
> > On Tue, Dec 16, 2025 at 12:01:41PM +0800, Baokun Li wrote:
> >> On 2025-12-15 23:28, Brian Foster wrote:
> >>> On Sat, Dec 13, 2025 at 09:46:23AM +0800, Baokun Li wrote:
> >>>> Hi Brian,
> >>>>
> >>>> Thanks for the patch.
> >>>>
> >>> Hi Baokun,
> >>>
> >>> Thanks for reviewing..
> >>>
> >>>> On 2025-12-12 23:47, Brian Foster wrote:
> >>>>> fstests test generic/388 occasionally reproduces a warning in
> >>>>> ext4_put_super() associated with the dirty clusters count:
> >>>>>
> >>>>>   WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]
> >>>>>
> >>>>> Tracing the failure shows that the warning fires due to an
> >>>>> s_dirtyclusters_counter value of -1. IOW, this appears to be a
> >>>>> spurious decrement as opposed to some sort of leak. Further tracing
> >>>>> of the dirty cluster count deltas and an LLM scan of the resulting
> >>>>> output identified the cause as a double decrement in the error path
> >>>>> between ext4_mb_mark_diskspace_used() and the caller
> >>>>> ext4_mb_new_blocks().
> >>>>>
> >>>>> First, note that generic/388 is a shutdown vs. fsstress test and so
> >>>>> produces a random set of operations and shutdown injections. In the
> >>>>> problematic case, the shutdown triggers an error return from the
> >>>>> ext4_handle_dirty_metadata() call(s) made from
> >>>>> ext4_mb_mark_context(). The changed value is non-zero at this point,
> >>>>> so ext4_mb_mark_diskspace_used() does not exit after the error
> >>>>> bubbles up from ext4_mb_mark_context(). Instead, the former
> >>>>> decrements both cluster counters and returns the error up to
> >>>>> ext4_mb_new_blocks(). The latter falls into the !ar->len out path
> >>>>> which decrements the dirty clusters counter a second time, creating
> >>>>> the inconsistency.
> >>>>>
> >>>>> AFAICT the solution here is to exit immediately from
> >>>>> ext4_mb_mark_diskspace_used() on error, regardless of the changed
> >>>>> value. This leaves the caller responsible for clearing the block
> >>>>> reservation at the same level it is acquired. This also skips the
> >>>>> free clusters update, but the caller also calls into
> >>>>> ext4_discard_allocated_blocks() to free the blocks back into the
> >>>>> group. This survives an overnight loop test of generic/388 on an
> >>>>> otherwise reproducing system and survives a local regression run.
> >>>>>
> >>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >>>>> ---
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I've thrown some testing at this and poked around the area enough that I
> >>>>> _think_ it is reasonably sane, but the error paths are hairy and I could
> >>>>> certainly be missing some details. I'm happy to try a different approach
> >>>>> if there are any thoughts around that.. thanks.
> >>>>>
> >>>>> Brian
> >>>>>
> >>>>>  fs/ext4/mballoc.c | 2 +-
> >>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> >>>>> index 56d50fd3310b..224abfd6a42b 100644
> >>>>> --- a/fs/ext4/mballoc.c
> >>>>> +++ b/fs/ext4/mballoc.c
> >>>>> @@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
> >>>>>  				   ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
> >>>>>  				   flags, &changed);
> >>>>>  
> >>>>> -	if (err && changed == 0)
> >>>>> +	if (err)
> >>>>>  		return err;
> >>>>>  
> >>>>>  #ifdef AGGRESSIVE_CHECK
> >>>> I think we might need to swap that && for an ||.
> >>>>
> >>>> Basically, we should only proceed with the following logic if there's
> >>>> no error and the bitmap was actually changed. If we hit an error or
> >>>> if the section we intended to modify was already fully handled,
> >>>> we should just bail out early. Otherwise, the err could get quietly
> >>>> ignored unless we hit a duplicate allocation that happens to result in
> >>>> 'changed' being zero.
> >>>>
> >>> Hmm.. to make sure I understand, are you referring to an inconsistency
> >>> case where we allocated blocks that were already marked as such in the
> >>> group on disk..?
> >> Yes.
> >>> I'm a little uneasy about this because it seems to conflict with the
> >>> surrounding code. AFAICT the only way we can hit something like !err &&
> >>> !changed is via EXT4_MB_BITMAP_MARKED_CHECK, which causes
> >>> _mark_context() to check the bitmap for "already modified" bits up
> >>> front.
> >>>
> >>> If this scenario plays out, the caller has a BUG check just after the
> >>> return (also under the aggressive check macro). So ISTM that this sort
> >>> of (err || !changed) logic would bypass the aggressive checks and let
> >>> the fs carry on when it probably shouldn't. Hm?
> >> Regarding ext4_mb_mark_context, if the passed ret_changed pointer is
> >> non-NULL, we initialize *ret_changed to 0. After updating the bitmap_bh,
> >> we then update *ret_changed with the actual number of blocks modified
> >> (changed).
> >>
> >> Therefore, the original intention was for changed == 0 to signify that
> >> an error occurred in ext4_mb_mark_context() before ret_changed could be
> >> updated. However, as you pointed out, we also get changed == 0 when the
> >> target extent has already been fully marked as allocated within bitmap_bh.
> >>
> >> Crucially, we only genuinely check the bitmap to modify changed when
> >> EXT4_MB_BITMAP_MARKED_CHECK is set (i.e., when AGGRESSIVE_CHECK is defined,
> >> or during fast commit or resize operations). Otherwise, changed is always
> >> set to the target length. This means that, in the general case, errors
> >> returned after the point where ret_changed is updated (e.g., the error
> >> from ext4_handle_dirty_metadata()) are usually ignored.
> >>
> >> In summary:
> >>
> >>  * (err && changed == 0) only concerns errors that occur before ret_changed
> >>    is updated.
> >>  * (err || changed == 0) concerns whether there was an error OR if any
> >>    modification actually took place.
> >>
> >> If we only care about err, we could move the update of ret_changed inside
> >> ext4_mb_mark_context() to just before the successful return.
> >>
> > Yeah, I think the _mark_context() logic is reasonably straightforward
> > from the code. What is less clear is why the allocation path only cares
> > about errors prior to ret_changed being set.
> >
> > It looks to me that this is just wrong. I suspect either due to
> > copy/paste error in the mark_diskspace_used() path at some point in the
> > past, or an attempt to filter out the BUG case from an obvious case
> > where changed will be 0 on certain error returns.
> >
> > I don't think the mark_context() logic alone tells the full story here.
> > I think what's relevant is the high level error handling of the
> > !delalloc allocation path:
> >
> > 1. ext4_mb_new_blocks() reserves blocks and attempts physical allocation
> > in memory. On success, it calls into ext4_mb_mark_diskspace_used() to
> > update on-disk structures.
> >
> > 2a. If ext4_mb_mark_diskspace_used() is successful, it decrements the
> > freeclusters counter and releases res from the dirtyclusers counter
> > (i.e. transfer the block from dirty to used). ext4_mb_new_blocks()
> > basically just returns the result.
> >
> > 2b. If ext4_mb_mark_diskspace_used() fails, ext4_mb_new_blocks()
> > receives the error. In the error exit path, it frees the blocks back
> > into the incore structures and releases the reservation it acquired in
> > step 1.  
> >
> > However the bug this patch is trying to fix is that
> > ext4_mb_mark_diskspace_used() runs the counter updates regardless of
> > error in some cases. If the on-disk update fails, _new_blocks() will
> > release its reservation and return the blocks, so _mark_diskspace_used()
> > shouldn't account that block from the dirty and free counters.
> >
> > What isn't quite clear to me is how this is expected to deal with the
> > modified buffer. This particular case is a shutdown and journal abort,
> > so I suspect the buffer can't write back.
> 
> These are actually two separate issues, so let's focus on the
> dirtyclusters problem first, as it's more straightforward.
> 

Agreed.

> In ext4_mb_new_blocks(), when nodealloc is active, we increment
> s_dirtyclusters_counter via ext4_claim_free_clusters() to ensure
> there is enough free space to satisfy the current allocation request.
> 
> Once the block allocation finishes—regardless of whether it succeeded
> or failed—we should return the incremented dirtyclusters count.
> Therefore, we don't need to pass the reserv_clstrs parameter to
> ext4_mb_mark_diskspace_used() just to update the counter there.
> Instead, we can completely remove the dirtyclusters update logic from
> that function.
> 
> A better approach would be to decrement the dirtyclusters count in
> ext4_mb_new_blocks() whenever (!(ar->flags & EXT4_MB_DELALLOC_RESERVED)),
> regardless of whether ar->len is 0.
> 

Sure I think we could do something like that.

> This makes the code logic much clearer and helps prevent counter anomalies.
> 
> >
> >>>> By the way, I spotted two other spots with similar error logic:
> >>>> ext4_mb_clear_bb() and ext4_group_add_blocks().
> >>>>
> >>> Yeah, I saw those as well but didn't think they needed changing. My high
> >>> level understanding of the alloc case is that ext4_mb_new_blocks()
> >>> acquires res (!delalloc), allocs blocks out of in-core structures, then
> >>> calls down into _mark_diskspace_used() to update/journal on-disk
> >>> structures with the pending alloc. If the latter fails, we release res
> >>> and feed blocks back into the in-core structures. So IOW, if we return
> >>> directly from _mark_diskspace_used() the counters/state end up
> >>> consistent afaict.
> >>>
> >>> For the ext4_free_blocks() case, we call _mark_context() and if it fails
> >>> with changed != 0 (and don't otherwise BUG), we still go ahead and free
> >>> the blocks in the e4b and return the error. It does look like the
> >>> discard code can clobber the error though, so perhaps that should be
> >>> fixed. But otherwise it's not clear to me why we might want to exit
> >>> early there. Am I missing something else?
> >> The core issue is that they risk ignoring certain errors, which can
> >> result in inconsistency.
> >>
> > Hmm.. I'm not sure it's that simple in the free path. It looks like
> > things are ordered differently there. We modify the on-disk struct and
> > if it changes something, then even it fails we go ahead and proceed with
> > the in-core updates, and then return the error. Modulo the discard logic
> > thing, the error is then passed into the standard error handling code,
> > so it isn't really ignored.
> >
> > Though again I'm not quite sure what the expected result is here in the
> > case where it's the ext4_handle_dirty_metadata() call that fails. Is
> > this a guaranteed shutdown/abort situation? Perhaps Jan or somebody
> > familiar with the journaling code could chime in on this..? Thanks.
> >
> >>>> Since this issue popped up in the last couple of years, we should
> >>>> probably add a Fixes: tag to make backporting easier.
> >>>>
> >>> Do you have a target patch in mind? I made a pass through historical
> >>> changes and it looked like this was a longer standing issue through
> >>> various bits of refactoring..
> >>>
> >> No, I haven't, but I suspect it was introduced when
> >> (err && changed == 0) was added.
> >>
> > Making another pass.. that code is introduced in commit 2f94711b098b
> > ("ext4: call ext4_mb_mark_context in ext4_mb_mark_diskspace_used"), but
> > that just refactors things and doesn't appear to introduce the problem.
> > It looks like it still double decrements due to how the code is ordered.
> 
> Yes, that is a separate issue.
> 
> >
> > Going back further.. commit 0087d9fb3f29 ("ext4: Fix
> > s_dirty_blocks_counter if block allocation failed with nodelalloc") adds
> > the release in ext4_mb_new_blocks(), so that appears to be where the
> > logic wart was first introduced.
> 
> Yes, it does appear that this commit introduced the issue. 
> 
> Thanks for finding the exact commit where it was introduced!
> 
> >
> > That said, that one goes back to 2009 and the current fix wouldn't
> > apply. TBH given how old the bug is and that it's basically just a
> > warning in a shutdown case, I'm not sure it's really worth the risk of
> > backporting at all.. I'll defer to the community on that however and can
> > at least mention the commit in the changelog if we don't ultimately tag
> > it.
> 
> I think it's essential to have this fixed.
> Thank you for your effort on this.
> 

Ack, thanks.

Brian

> 
> Cheers,
> Baokun
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown
  2025-12-16 15:53       ` Brian Foster
  2025-12-17  4:05         ` Baokun Li
@ 2025-12-19 14:29         ` Jan Kara
  2026-01-05 14:04           ` Brian Foster
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2025-12-19 14:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: Baokun Li, linux-ext4, Jan Kara

On Tue 16-12-25 10:53:35, Brian Foster wrote:
> On Tue, Dec 16, 2025 at 12:01:41PM +0800, Baokun Li wrote:
> > On 2025-12-15 23:28, Brian Foster wrote:
> > > On Sat, Dec 13, 2025 at 09:46:23AM +0800, Baokun Li wrote:
> > >> Hi Brian,
> > >>
> > >> Thanks for the patch.
> > >>
> > > Hi Baokun,
> > >
> > > Thanks for reviewing..
> > >
> > >> On 2025-12-12 23:47, Brian Foster wrote:
> > >>> fstests test generic/388 occasionally reproduces a warning in
> > >>> ext4_put_super() associated with the dirty clusters count:
> > >>>
> > >>>   WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]
> > >>>
> > >>> Tracing the failure shows that the warning fires due to an
> > >>> s_dirtyclusters_counter value of -1. IOW, this appears to be a
> > >>> spurious decrement as opposed to some sort of leak. Further tracing
> > >>> of the dirty cluster count deltas and an LLM scan of the resulting
> > >>> output identified the cause as a double decrement in the error path
> > >>> between ext4_mb_mark_diskspace_used() and the caller
> > >>> ext4_mb_new_blocks().
> > >>>
> > >>> First, note that generic/388 is a shutdown vs. fsstress test and so
> > >>> produces a random set of operations and shutdown injections. In the
> > >>> problematic case, the shutdown triggers an error return from the
> > >>> ext4_handle_dirty_metadata() call(s) made from
> > >>> ext4_mb_mark_context(). The changed value is non-zero at this point,
> > >>> so ext4_mb_mark_diskspace_used() does not exit after the error
> > >>> bubbles up from ext4_mb_mark_context(). Instead, the former
> > >>> decrements both cluster counters and returns the error up to
> > >>> ext4_mb_new_blocks(). The latter falls into the !ar->len out path
> > >>> which decrements the dirty clusters counter a second time, creating
> > >>> the inconsistency.
> > >>>
> > >>> AFAICT the solution here is to exit immediately from
> > >>> ext4_mb_mark_diskspace_used() on error, regardless of the changed
> > >>> value. This leaves the caller responsible for clearing the block
> > >>> reservation at the same level it is acquired. This also skips the
> > >>> free clusters update, but the caller also calls into
> > >>> ext4_discard_allocated_blocks() to free the blocks back into the
> > >>> group. This survives an overnight loop test of generic/388 on an
> > >>> otherwise reproducing system and survives a local regression run.
> > >>>
> > >>> Signed-off-by: Brian Foster <bfoster@redhat.com>
> > >>> ---
> > >>>
> > >>> Hi all,
> > >>>
> > >>> I've thrown some testing at this and poked around the area enough that I
> > >>> _think_ it is reasonably sane, but the error paths are hairy and I could
> > >>> certainly be missing some details. I'm happy to try a different approach
> > >>> if there are any thoughts around that.. thanks.
> > >>>
> > >>> Brian
> > >>>
> > >>>  fs/ext4/mballoc.c | 2 +-
> > >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > >>> index 56d50fd3310b..224abfd6a42b 100644
> > >>> --- a/fs/ext4/mballoc.c
> > >>> +++ b/fs/ext4/mballoc.c
> > >>> @@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
> > >>>  				   ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
> > >>>  				   flags, &changed);
> > >>>  
> > >>> -	if (err && changed == 0)
> > >>> +	if (err)
> > >>>  		return err;
> > >>>  
> > >>>  #ifdef AGGRESSIVE_CHECK
> > >> I think we might need to swap that && for an ||.
> > >>
> > >> Basically, we should only proceed with the following logic if there's
> > >> no error and the bitmap was actually changed. If we hit an error or
> > >> if the section we intended to modify was already fully handled,
> > >> we should just bail out early. Otherwise, the err could get quietly
> > >> ignored unless we hit a duplicate allocation that happens to result in
> > >> 'changed' being zero.
> > >>
> > > Hmm.. to make sure I understand, are you referring to an inconsistency
> > > case where we allocated blocks that were already marked as such in the
> > > group on disk..?
> > Yes.
> > > I'm a little uneasy about this because it seems to conflict with the
> > > surrounding code. AFAICT the only way we can hit something like !err &&
> > > !changed is via EXT4_MB_BITMAP_MARKED_CHECK, which causes
> > > _mark_context() to check the bitmap for "already modified" bits up
> > > front.
> > >
> > > If this scenario plays out, the caller has a BUG check just after the
> > > return (also under the aggressive check macro). So ISTM that this sort
> > > of (err || !changed) logic would bypass the aggressive checks and let
> > > the fs carry on when it probably shouldn't. Hm?
> > 
> > Regarding ext4_mb_mark_context, if the passed ret_changed pointer is
> > non-NULL, we initialize *ret_changed to 0. After updating the bitmap_bh,
> > we then update *ret_changed with the actual number of blocks modified
> > (changed).
> > 
> > Therefore, the original intention was for changed == 0 to signify that
> > an error occurred in ext4_mb_mark_context() before ret_changed could be
> > updated. However, as you pointed out, we also get changed == 0 when the
> > target extent has already been fully marked as allocated within bitmap_bh.
> > 
> > Crucially, we only genuinely check the bitmap to modify changed when
> > EXT4_MB_BITMAP_MARKED_CHECK is set (i.e., when AGGRESSIVE_CHECK is defined,
> > or during fast commit or resize operations). Otherwise, changed is always
> > set to the target length. This means that, in the general case, errors
> > returned after the point where ret_changed is updated (e.g., the error
> > from ext4_handle_dirty_metadata()) are usually ignored.
> > 
> > In summary:
> > 
> >  * (err && changed == 0) only concerns errors that occur before ret_changed
> >    is updated.
> >  * (err || changed == 0) concerns whether there was an error OR if any
> >    modification actually took place.
> > 
> > If we only care about err, we could move the update of ret_changed inside
> > ext4_mb_mark_context() to just before the successful return.
> > 
> 
> Yeah, I think the _mark_context() logic is reasonably straightforward
> from the code. What is less clear is why the allocation path only cares
> about errors prior to ret_changed being set.
> 
> It looks to me that this is just wrong. I suspect either due to
> copy/paste error in the mark_diskspace_used() path at some point in the
> past, or an attempt to filter out the BUG case from an obvious case
> where changed will be 0 on certain error returns.
> 
> I don't think the mark_context() logic alone tells the full story here.
> I think what's relevant is the high level error handling of the
> !delalloc allocation path:
> 
> 1. ext4_mb_new_blocks() reserves blocks and attempts physical allocation
> in memory. On success, it calls into ext4_mb_mark_diskspace_used() to
> update on-disk structures.
> 
> 2a. If ext4_mb_mark_diskspace_used() is successful, it decrements the
> freeclusters counter and releases res from the dirtyclusers counter
> (i.e. transfer the block from dirty to used). ext4_mb_new_blocks()
> basically just returns the result.
> 
> 2b. If ext4_mb_mark_diskspace_used() fails, ext4_mb_new_blocks()
> receives the error. In the error exit path, it frees the blocks back
> into the incore structures and releases the reservation it acquired in
> step 1.  
> 
> However the bug this patch is trying to fix is that
> ext4_mb_mark_diskspace_used() runs the counter updates regardless of
> error in some cases. If the on-disk update fails, _new_blocks() will
> release its reservation and return the blocks, so _mark_diskspace_used()
> shouldn't account that block from the dirty and free counters.
> 
> What isn't quite clear to me is how this is expected to deal with the
> modified buffer. This particular case is a shutdown and journal abort,
> so I suspect the buffer can't write back.
> 
> > >> By the way, I spotted two other spots with similar error logic:
> > >> ext4_mb_clear_bb() and ext4_group_add_blocks().
> > >>
> > > Yeah, I saw those as well but didn't think they needed changing. My high
> > > level understanding of the alloc case is that ext4_mb_new_blocks()
> > > acquires res (!delalloc), allocs blocks out of in-core structures, then
> > > calls down into _mark_diskspace_used() to update/journal on-disk
> > > structures with the pending alloc. If the latter fails, we release res
> > > and feed blocks back into the in-core structures. So IOW, if we return
> > > directly from _mark_diskspace_used() the counters/state end up
> > > consistent afaict.
> > >
> > > For the ext4_free_blocks() case, we call _mark_context() and if it fails
> > > with changed != 0 (and don't otherwise BUG), we still go ahead and free
> > > the blocks in the e4b and return the error. It does look like the
> > > discard code can clobber the error though, so perhaps that should be
> > > fixed. But otherwise it's not clear to me why we might want to exit
> > > early there. Am I missing something else?
> > 
> > The core issue is that they risk ignoring certain errors, which can
> > result in inconsistency.
> 
> Hmm.. I'm not sure it's that simple in the free path. It looks like
> things are ordered differently there. We modify the on-disk struct and
> if it changes something, then even it fails we go ahead and proceed with
> the in-core updates, and then return the error. Modulo the discard logic
> thing, the error is then passed into the standard error handling code,
> so it isn't really ignored.
> 
> Though again I'm not quite sure what the expected result is here in the
> case where it's the ext4_handle_dirty_metadata() call that fails. Is
> this a guaranteed shutdown/abort situation? Perhaps Jan or somebody
> familiar with the journaling code could chime in on this..? Thanks.

Yes, at the moment ext4_handle_dirty_metadata() or
ext4_journal_get_write_access() returns error the journal is dead and no
modifications can make it to the disk. *But* this applies only to
filesystems with the journal. If we don't have a journal,
ext4_handle_dirty_metadata() can fail also due to simple IO error when
writing out the buffer. OTOH without a journal you're expected to run
e2fsck once error like this happens so we just strive not to loose more
data than necessary and not to crash the kernel...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown
  2025-12-19 14:29         ` Jan Kara
@ 2026-01-05 14:04           ` Brian Foster
  0 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2026-01-05 14:04 UTC (permalink / raw)
  To: Jan Kara; +Cc: Baokun Li, linux-ext4

On Fri, Dec 19, 2025 at 03:29:25PM +0100, Jan Kara wrote:
> On Tue 16-12-25 10:53:35, Brian Foster wrote:
> > On Tue, Dec 16, 2025 at 12:01:41PM +0800, Baokun Li wrote:
> > > On 2025-12-15 23:28, Brian Foster wrote:
> > > > On Sat, Dec 13, 2025 at 09:46:23AM +0800, Baokun Li wrote:
> > > >> Hi Brian,
> > > >>
> > > >> Thanks for the patch.
> > > >>
> > > > Hi Baokun,
> > > >
> > > > Thanks for reviewing..
> > > >
> > > >> On 2025-12-12 23:47, Brian Foster wrote:
> > > >>> fstests test generic/388 occasionally reproduces a warning in
> > > >>> ext4_put_super() associated with the dirty clusters count:
> > > >>>
> > > >>>   WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]
> > > >>>
> > > >>> Tracing the failure shows that the warning fires due to an
> > > >>> s_dirtyclusters_counter value of -1. IOW, this appears to be a
> > > >>> spurious decrement as opposed to some sort of leak. Further tracing
> > > >>> of the dirty cluster count deltas and an LLM scan of the resulting
> > > >>> output identified the cause as a double decrement in the error path
> > > >>> between ext4_mb_mark_diskspace_used() and the caller
> > > >>> ext4_mb_new_blocks().
> > > >>>
> > > >>> First, note that generic/388 is a shutdown vs. fsstress test and so
> > > >>> produces a random set of operations and shutdown injections. In the
> > > >>> problematic case, the shutdown triggers an error return from the
> > > >>> ext4_handle_dirty_metadata() call(s) made from
> > > >>> ext4_mb_mark_context(). The changed value is non-zero at this point,
> > > >>> so ext4_mb_mark_diskspace_used() does not exit after the error
> > > >>> bubbles up from ext4_mb_mark_context(). Instead, the former
> > > >>> decrements both cluster counters and returns the error up to
> > > >>> ext4_mb_new_blocks(). The latter falls into the !ar->len out path
> > > >>> which decrements the dirty clusters counter a second time, creating
> > > >>> the inconsistency.
> > > >>>
> > > >>> AFAICT the solution here is to exit immediately from
> > > >>> ext4_mb_mark_diskspace_used() on error, regardless of the changed
> > > >>> value. This leaves the caller responsible for clearing the block
> > > >>> reservation at the same level it is acquired. This also skips the
> > > >>> free clusters update, but the caller also calls into
> > > >>> ext4_discard_allocated_blocks() to free the blocks back into the
> > > >>> group. This survives an overnight loop test of generic/388 on an
> > > >>> otherwise reproducing system and survives a local regression run.
> > > >>>
> > > >>> Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > >>> ---
> > > >>>
> > > >>> Hi all,
> > > >>>
> > > >>> I've thrown some testing at this and poked around the area enough that I
> > > >>> _think_ it is reasonably sane, but the error paths are hairy and I could
> > > >>> certainly be missing some details. I'm happy to try a different approach
> > > >>> if there are any thoughts around that.. thanks.
> > > >>>
> > > >>> Brian
> > > >>>
> > > >>>  fs/ext4/mballoc.c | 2 +-
> > > >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >>>
> > > >>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > >>> index 56d50fd3310b..224abfd6a42b 100644
> > > >>> --- a/fs/ext4/mballoc.c
> > > >>> +++ b/fs/ext4/mballoc.c
> > > >>> @@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
> > > >>>  				   ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
> > > >>>  				   flags, &changed);
> > > >>>  
> > > >>> -	if (err && changed == 0)
> > > >>> +	if (err)
> > > >>>  		return err;
> > > >>>  
> > > >>>  #ifdef AGGRESSIVE_CHECK
> > > >> I think we might need to swap that && for an ||.
> > > >>
> > > >> Basically, we should only proceed with the following logic if there's
> > > >> no error and the bitmap was actually changed. If we hit an error or
> > > >> if the section we intended to modify was already fully handled,
> > > >> we should just bail out early. Otherwise, the err could get quietly
> > > >> ignored unless we hit a duplicate allocation that happens to result in
> > > >> 'changed' being zero.
> > > >>
> > > > Hmm.. to make sure I understand, are you referring to an inconsistency
> > > > case where we allocated blocks that were already marked as such in the
> > > > group on disk..?
> > > Yes.
> > > > I'm a little uneasy about this because it seems to conflict with the
> > > > surrounding code. AFAICT the only way we can hit something like !err &&
> > > > !changed is via EXT4_MB_BITMAP_MARKED_CHECK, which causes
> > > > _mark_context() to check the bitmap for "already modified" bits up
> > > > front.
> > > >
> > > > If this scenario plays out, the caller has a BUG check just after the
> > > > return (also under the aggressive check macro). So ISTM that this sort
> > > > of (err || !changed) logic would bypass the aggressive checks and let
> > > > the fs carry on when it probably shouldn't. Hm?
> > > 
> > > Regarding ext4_mb_mark_context, if the passed ret_changed pointer is
> > > non-NULL, we initialize *ret_changed to 0. After updating the bitmap_bh,
> > > we then update *ret_changed with the actual number of blocks modified
> > > (changed).
> > > 
> > > Therefore, the original intention was for changed == 0 to signify that
> > > an error occurred in ext4_mb_mark_context() before ret_changed could be
> > > updated. However, as you pointed out, we also get changed == 0 when the
> > > target extent has already been fully marked as allocated within bitmap_bh.
> > > 
> > > Crucially, we only genuinely check the bitmap to modify changed when
> > > EXT4_MB_BITMAP_MARKED_CHECK is set (i.e., when AGGRESSIVE_CHECK is defined,
> > > or during fast commit or resize operations). Otherwise, changed is always
> > > set to the target length. This means that, in the general case, errors
> > > returned after the point where ret_changed is updated (e.g., the error
> > > from ext4_handle_dirty_metadata()) are usually ignored.
> > > 
> > > In summary:
> > > 
> > >  * (err && changed == 0) only concerns errors that occur before ret_changed
> > >    is updated.
> > >  * (err || changed == 0) concerns whether there was an error OR if any
> > >    modification actually took place.
> > > 
> > > If we only care about err, we could move the update of ret_changed inside
> > > ext4_mb_mark_context() to just before the successful return.
> > > 
> > 
> > Yeah, I think the _mark_context() logic is reasonably straightforward
> > from the code. What is less clear is why the allocation path only cares
> > about errors prior to ret_changed being set.
> > 
> > It looks to me that this is just wrong. I suspect either due to
> > copy/paste error in the mark_diskspace_used() path at some point in the
> > past, or an attempt to filter out the BUG case from an obvious case
> > where changed will be 0 on certain error returns.
> > 
> > I don't think the mark_context() logic alone tells the full story here.
> > I think what's relevant is the high level error handling of the
> > !delalloc allocation path:
> > 
> > 1. ext4_mb_new_blocks() reserves blocks and attempts physical allocation
> > in memory. On success, it calls into ext4_mb_mark_diskspace_used() to
> > update on-disk structures.
> > 
> > 2a. If ext4_mb_mark_diskspace_used() is successful, it decrements the
> > freeclusters counter and releases res from the dirtyclusers counter
> > (i.e. transfer the block from dirty to used). ext4_mb_new_blocks()
> > basically just returns the result.
> > 
> > 2b. If ext4_mb_mark_diskspace_used() fails, ext4_mb_new_blocks()
> > receives the error. In the error exit path, it frees the blocks back
> > into the incore structures and releases the reservation it acquired in
> > step 1.  
> > 
> > However the bug this patch is trying to fix is that
> > ext4_mb_mark_diskspace_used() runs the counter updates regardless of
> > error in some cases. If the on-disk update fails, _new_blocks() will
> > release its reservation and return the blocks, so _mark_diskspace_used()
> > shouldn't account that block from the dirty and free counters.
> > 
> > What isn't quite clear to me is how this is expected to deal with the
> > modified buffer. This particular case is a shutdown and journal abort,
> > so I suspect the buffer can't write back.
> > 
> > > >> By the way, I spotted two other spots with similar error logic:
> > > >> ext4_mb_clear_bb() and ext4_group_add_blocks().
> > > >>
> > > > Yeah, I saw those as well but didn't think they needed changing. My high
> > > > level understanding of the alloc case is that ext4_mb_new_blocks()
> > > > acquires res (!delalloc), allocs blocks out of in-core structures, then
> > > > calls down into _mark_diskspace_used() to update/journal on-disk
> > > > structures with the pending alloc. If the latter fails, we release res
> > > > and feed blocks back into the in-core structures. So IOW, if we return
> > > > directly from _mark_diskspace_used() the counters/state end up
> > > > consistent afaict.
> > > >
> > > > For the ext4_free_blocks() case, we call _mark_context() and if it fails
> > > > with changed != 0 (and don't otherwise BUG), we still go ahead and free
> > > > the blocks in the e4b and return the error. It does look like the
> > > > discard code can clobber the error though, so perhaps that should be
> > > > fixed. But otherwise it's not clear to me why we might want to exit
> > > > early there. Am I missing something else?
> > > 
> > > The core issue is that they risk ignoring certain errors, which can
> > > result in inconsistency.
> > 
> > Hmm.. I'm not sure it's that simple in the free path. It looks like
> > things are ordered differently there. We modify the on-disk struct and
> > if it changes something, then even it fails we go ahead and proceed with
> > the in-core updates, and then return the error. Modulo the discard logic
> > thing, the error is then passed into the standard error handling code,
> > so it isn't really ignored.
> > 
> > Though again I'm not quite sure what the expected result is here in the
> > case where it's the ext4_handle_dirty_metadata() call that fails. Is
> > this a guaranteed shutdown/abort situation? Perhaps Jan or somebody
> > familiar with the journaling code could chime in on this..? Thanks.
> 
> Yes, at the moment ext4_handle_dirty_metadata() or
> ext4_journal_get_write_access() returns error the journal is dead and no
> modifications can make it to the disk. *But* this applies only to
> filesystems with the journal. If we don't have a journal,
> ext4_handle_dirty_metadata() can fail also due to simple IO error when
> writing out the buffer. OTOH without a journal you're expected to run
> e2fsck once error like this happens so we just strive not to loose more
> data than necessary and not to crash the kernel...
> 

Ah good point, thanks. Anyways, I ran out of time before holiday pto to
get this one turned around with Baokun's suggested changes. Once I'm
undug and caught back up I'll get back to this and take another look
with this case in mind..

Brian

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-05 14:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-12 15:47 [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown Brian Foster
2025-12-13  1:46 ` Baokun Li
2025-12-15 15:28   ` Brian Foster
2025-12-16  4:01     ` Baokun Li
2025-12-16 15:53       ` Brian Foster
2025-12-17  4:05         ` Baokun Li
2025-12-17 13:52           ` Brian Foster
2025-12-19 14:29         ` Jan Kara
2026-01-05 14:04           ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox