From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2974C346E4C for ; Wed, 17 Dec 2025 13:52:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765979543; cv=none; b=WrXW/KkJtOLuR6TfyMZIafJ1erCWqVTrHwg2ZW3bBKDOoN7tNsqRlo5G2lUQY39KwAvaQSaYmwiz+JvWJQYr51l7oLQthRiXB2sHe2y8PmfWMJEGchIRFDH2+QTqau6C1fzQ/4MySAclsz6DwdiuXisdZIqno35IfPCCxq2vEVQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765979543; c=relaxed/simple; bh=R2ek4QCI3O0vdSEfVt+FhruFza6s8UoZ6D1XoGnG+1Q=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=sk/VRqFcnkticOCa4/cIPanOZKHGX6zLSd+8jXDRGOBc0ypg7Bmz1VZxpb8XZAgPtnWufHoS3X/vg01tBS1ffGbiOllyHoAAHhT1FFU3un3Du5qsCSJbKxzhppbVTrJ6wGm9040W5ljDrRsTlTNsxdXcv+AYGerY/5cdxawx6ks= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=GM6BOhoy; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="GM6BOhoy" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1765979537; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YvHo3DruybSwHnCGMW7ttmK9eQVDupdoKKFTyz/Gkzo=; b=GM6BOhoyXoYN84egj6z6F29jbrg/Dk2v0hWJBgzP22cmiNDQYHiWqjdQS+8VYLaxUo7dO4 FNBKUoisjaxbpQDD08a8kELXuXOMihKuGRi9q4/i6KdM8rpJFCjbD5+49PjhO4XBozgGTT 3WpJIOuvqRYbZfrPN3AcGoAXzrkTBZo= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-138-oADYrfukO4GsMjfit70YWQ-1; Wed, 17 Dec 2025 08:52:13 -0500 X-MC-Unique: oADYrfukO4GsMjfit70YWQ-1 X-Mimecast-MFC-AGG-ID: oADYrfukO4GsMjfit70YWQ_1765979532 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 85898180028B; Wed, 17 Dec 2025 13:52:12 +0000 (UTC) Received: from bfoster (unknown [10.22.64.2]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id D07A51800576; Wed, 17 Dec 2025 13:52:11 +0000 (UTC) Date: Wed, 17 Dec 2025 08:52:09 -0500 From: Brian Foster To: Baokun Li Cc: linux-ext4@vger.kernel.org, Jan Kara Subject: Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown Message-ID: References: <20251212154735.512651-1-bfoster@redhat.com> <106ba6fe-4f94-4ed4-a53a-98a1f3ad30ab@huawei.com> <5e580bd0-519d-4bd0-93f5-9a2c532380ec@huawei.com> Precedence: bulk X-Mailing-List: linux-ext4@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <5e580bd0-519d-4bd0-93f5-9a2c532380ec@huawei.com> X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 On Wed, Dec 17, 2025 at 12:05:18PM +0800, Baokun Li wrote: > On 2025-12-16 23:53, Brian Foster wrote: > > On Tue, Dec 16, 2025 at 12:01:41PM +0800, Baokun Li wrote: > >> On 2025-12-15 23:28, Brian Foster wrote: > >>> On Sat, Dec 13, 2025 at 09:46:23AM +0800, Baokun Li wrote: > >>>> Hi Brian, > >>>> > >>>> Thanks for the patch. > >>>> > >>> Hi Baokun, > >>> > >>> Thanks for reviewing.. > >>> > >>>> On 2025-12-12 23:47, Brian Foster wrote: > >>>>> fstests test generic/388 occasionally reproduces a warning in > >>>>> ext4_put_super() associated with the dirty clusters count: > >>>>> > >>>>> WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4] > >>>>> > >>>>> Tracing the failure shows that the warning fires due to an > >>>>> s_dirtyclusters_counter value of -1. IOW, this appears to be a > >>>>> spurious decrement as opposed to some sort of leak. Further tracing > >>>>> of the dirty cluster count deltas and an LLM scan of the resulting > >>>>> output identified the cause as a double decrement in the error path > >>>>> between ext4_mb_mark_diskspace_used() and the caller > >>>>> ext4_mb_new_blocks(). > >>>>> > >>>>> First, note that generic/388 is a shutdown vs. fsstress test and so > >>>>> produces a random set of operations and shutdown injections. In the > >>>>> problematic case, the shutdown triggers an error return from the > >>>>> ext4_handle_dirty_metadata() call(s) made from > >>>>> ext4_mb_mark_context(). The changed value is non-zero at this point, > >>>>> so ext4_mb_mark_diskspace_used() does not exit after the error > >>>>> bubbles up from ext4_mb_mark_context(). Instead, the former > >>>>> decrements both cluster counters and returns the error up to > >>>>> ext4_mb_new_blocks(). The latter falls into the !ar->len out path > >>>>> which decrements the dirty clusters counter a second time, creating > >>>>> the inconsistency. > >>>>> > >>>>> AFAICT the solution here is to exit immediately from > >>>>> ext4_mb_mark_diskspace_used() on error, regardless of the changed > >>>>> value. This leaves the caller responsible for clearing the block > >>>>> reservation at the same level it is acquired. This also skips the > >>>>> free clusters update, but the caller also calls into > >>>>> ext4_discard_allocated_blocks() to free the blocks back into the > >>>>> group. This survives an overnight loop test of generic/388 on an > >>>>> otherwise reproducing system and survives a local regression run. > >>>>> > >>>>> Signed-off-by: Brian Foster > >>>>> --- > >>>>> > >>>>> Hi all, > >>>>> > >>>>> I've thrown some testing at this and poked around the area enough that I > >>>>> _think_ it is reasonably sane, but the error paths are hairy and I could > >>>>> certainly be missing some details. I'm happy to try a different approach > >>>>> if there are any thoughts around that.. thanks. > >>>>> > >>>>> Brian > >>>>> > >>>>> fs/ext4/mballoc.c | 2 +- > >>>>> 1 file changed, 1 insertion(+), 1 deletion(-) > >>>>> > >>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > >>>>> index 56d50fd3310b..224abfd6a42b 100644 > >>>>> --- a/fs/ext4/mballoc.c > >>>>> +++ b/fs/ext4/mballoc.c > >>>>> @@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac, > >>>>> ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len, > >>>>> flags, &changed); > >>>>> > >>>>> - if (err && changed == 0) > >>>>> + if (err) > >>>>> return err; > >>>>> > >>>>> #ifdef AGGRESSIVE_CHECK > >>>> I think we might need to swap that && for an ||. > >>>> > >>>> Basically, we should only proceed with the following logic if there's > >>>> no error and the bitmap was actually changed. If we hit an error or > >>>> if the section we intended to modify was already fully handled, > >>>> we should just bail out early. Otherwise, the err could get quietly > >>>> ignored unless we hit a duplicate allocation that happens to result in > >>>> 'changed' being zero. > >>>> > >>> Hmm.. to make sure I understand, are you referring to an inconsistency > >>> case where we allocated blocks that were already marked as such in the > >>> group on disk..? > >> Yes. > >>> I'm a little uneasy about this because it seems to conflict with the > >>> surrounding code. AFAICT the only way we can hit something like !err && > >>> !changed is via EXT4_MB_BITMAP_MARKED_CHECK, which causes > >>> _mark_context() to check the bitmap for "already modified" bits up > >>> front. > >>> > >>> If this scenario plays out, the caller has a BUG check just after the > >>> return (also under the aggressive check macro). So ISTM that this sort > >>> of (err || !changed) logic would bypass the aggressive checks and let > >>> the fs carry on when it probably shouldn't. Hm? > >> Regarding ext4_mb_mark_context, if the passed ret_changed pointer is > >> non-NULL, we initialize *ret_changed to 0. After updating the bitmap_bh, > >> we then update *ret_changed with the actual number of blocks modified > >> (changed). > >> > >> Therefore, the original intention was for changed == 0 to signify that > >> an error occurred in ext4_mb_mark_context() before ret_changed could be > >> updated. However, as you pointed out, we also get changed == 0 when the > >> target extent has already been fully marked as allocated within bitmap_bh. > >> > >> Crucially, we only genuinely check the bitmap to modify changed when > >> EXT4_MB_BITMAP_MARKED_CHECK is set (i.e., when AGGRESSIVE_CHECK is defined, > >> or during fast commit or resize operations). Otherwise, changed is always > >> set to the target length. This means that, in the general case, errors > >> returned after the point where ret_changed is updated (e.g., the error > >> from ext4_handle_dirty_metadata()) are usually ignored. > >> > >> In summary: > >> > >>  * (err && changed == 0) only concerns errors that occur before ret_changed > >>    is updated. > >>  * (err || changed == 0) concerns whether there was an error OR if any > >>    modification actually took place. > >> > >> If we only care about err, we could move the update of ret_changed inside > >> ext4_mb_mark_context() to just before the successful return. > >> > > Yeah, I think the _mark_context() logic is reasonably straightforward > > from the code. What is less clear is why the allocation path only cares > > about errors prior to ret_changed being set. > > > > It looks to me that this is just wrong. I suspect either due to > > copy/paste error in the mark_diskspace_used() path at some point in the > > past, or an attempt to filter out the BUG case from an obvious case > > where changed will be 0 on certain error returns. > > > > I don't think the mark_context() logic alone tells the full story here. > > I think what's relevant is the high level error handling of the > > !delalloc allocation path: > > > > 1. ext4_mb_new_blocks() reserves blocks and attempts physical allocation > > in memory. On success, it calls into ext4_mb_mark_diskspace_used() to > > update on-disk structures. > > > > 2a. If ext4_mb_mark_diskspace_used() is successful, it decrements the > > freeclusters counter and releases res from the dirtyclusers counter > > (i.e. transfer the block from dirty to used). ext4_mb_new_blocks() > > basically just returns the result. > > > > 2b. If ext4_mb_mark_diskspace_used() fails, ext4_mb_new_blocks() > > receives the error. In the error exit path, it frees the blocks back > > into the incore structures and releases the reservation it acquired in > > step 1. > > > > However the bug this patch is trying to fix is that > > ext4_mb_mark_diskspace_used() runs the counter updates regardless of > > error in some cases. If the on-disk update fails, _new_blocks() will > > release its reservation and return the blocks, so _mark_diskspace_used() > > shouldn't account that block from the dirty and free counters. > > > > What isn't quite clear to me is how this is expected to deal with the > > modified buffer. This particular case is a shutdown and journal abort, > > so I suspect the buffer can't write back. > > These are actually two separate issues, so let's focus on the > dirtyclusters problem first, as it's more straightforward. > Agreed. > In ext4_mb_new_blocks(), when nodealloc is active, we increment > s_dirtyclusters_counter via ext4_claim_free_clusters() to ensure > there is enough free space to satisfy the current allocation request. > > Once the block allocation finishes—regardless of whether it succeeded > or failed—we should return the incremented dirtyclusters count. > Therefore, we don't need to pass the reserv_clstrs parameter to > ext4_mb_mark_diskspace_used() just to update the counter there. > Instead, we can completely remove the dirtyclusters update logic from > that function. > > A better approach would be to decrement the dirtyclusters count in > ext4_mb_new_blocks() whenever (!(ar->flags & EXT4_MB_DELALLOC_RESERVED)), > regardless of whether ar->len is 0. > Sure I think we could do something like that. > This makes the code logic much clearer and helps prevent counter anomalies. > > > > >>>> By the way, I spotted two other spots with similar error logic: > >>>> ext4_mb_clear_bb() and ext4_group_add_blocks(). > >>>> > >>> Yeah, I saw those as well but didn't think they needed changing. My high > >>> level understanding of the alloc case is that ext4_mb_new_blocks() > >>> acquires res (!delalloc), allocs blocks out of in-core structures, then > >>> calls down into _mark_diskspace_used() to update/journal on-disk > >>> structures with the pending alloc. If the latter fails, we release res > >>> and feed blocks back into the in-core structures. So IOW, if we return > >>> directly from _mark_diskspace_used() the counters/state end up > >>> consistent afaict. > >>> > >>> For the ext4_free_blocks() case, we call _mark_context() and if it fails > >>> with changed != 0 (and don't otherwise BUG), we still go ahead and free > >>> the blocks in the e4b and return the error. It does look like the > >>> discard code can clobber the error though, so perhaps that should be > >>> fixed. But otherwise it's not clear to me why we might want to exit > >>> early there. Am I missing something else? > >> The core issue is that they risk ignoring certain errors, which can > >> result in inconsistency. > >> > > Hmm.. I'm not sure it's that simple in the free path. It looks like > > things are ordered differently there. We modify the on-disk struct and > > if it changes something, then even it fails we go ahead and proceed with > > the in-core updates, and then return the error. Modulo the discard logic > > thing, the error is then passed into the standard error handling code, > > so it isn't really ignored. > > > > Though again I'm not quite sure what the expected result is here in the > > case where it's the ext4_handle_dirty_metadata() call that fails. Is > > this a guaranteed shutdown/abort situation? Perhaps Jan or somebody > > familiar with the journaling code could chime in on this..? Thanks. > > > >>>> Since this issue popped up in the last couple of years, we should > >>>> probably add a Fixes: tag to make backporting easier. > >>>> > >>> Do you have a target patch in mind? I made a pass through historical > >>> changes and it looked like this was a longer standing issue through > >>> various bits of refactoring.. > >>> > >> No, I haven't, but I suspect it was introduced when > >> (err && changed == 0) was added. > >> > > Making another pass.. that code is introduced in commit 2f94711b098b > > ("ext4: call ext4_mb_mark_context in ext4_mb_mark_diskspace_used"), but > > that just refactors things and doesn't appear to introduce the problem. > > It looks like it still double decrements due to how the code is ordered. > > Yes, that is a separate issue. > > > > > Going back further.. commit 0087d9fb3f29 ("ext4: Fix > > s_dirty_blocks_counter if block allocation failed with nodelalloc") adds > > the release in ext4_mb_new_blocks(), so that appears to be where the > > logic wart was first introduced. > > Yes, it does appear that this commit introduced the issue.  > > Thanks for finding the exact commit where it was introduced! > > > > > That said, that one goes back to 2009 and the current fix wouldn't > > apply. TBH given how old the bug is and that it's basically just a > > warning in a shutdown case, I'm not sure it's really worth the risk of > > backporting at all.. I'll defer to the community on that however and can > > at least mention the commit in the changelog if we don't ultimately tag > > it. > > I think it's essential to have this fixed. > Thank you for your effort on this. > Ack, thanks. Brian > > Cheers, > Baokun >