Re: BUG report: an ext4 data corruption issue in nojournal mode

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Zhang Yi <yi.zhang@huaweicloud.com>
To: Jan Kara <jack@suse.cz>
Cc: Ext4 Developers List <linux-ext4@vger.kernel.org>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Zhang Yi <yi.zhang@huawei.com>,
	"Li,Baokun" <libaokun1@huawei.com>,
	hsiangkao@linux.alibaba.com, yangerkun <yangerkun@huawei.com>
Subject: Re: BUG report: an ext4 data corruption issue in nojournal mode
Date: Mon, 15 Sep 2025 20:03:47 +0800	[thread overview]
Message-ID: <36785255-3f1c-46aa-9df3-f43f3b042e40@huaweicloud.com> (raw)
In-Reply-To: <mugnb73mvlvclccatavdd2rwczz3wl3gs7rh4kwcnejkdh4t6b@na743yuuidlj>

On 9/15/2025 6:56 PM, Jan Kara wrote:
> On Sat 13-09-25 11:36:27, Zhang Yi wrote:
>> On 9/12/2025 10:42 PM, Jan Kara wrote:
>>> Hello!
>>>
>>> On Fri 12-09-25 11:28:13, Zhang Yi wrote:
>>>> Gao Xiang recently discovered a data corruption issue in **nojournal**
>>>> mode. After analysis, we found that the problem is after a metadata
>>>> block is freed, it can be immediately reallocated as a data block.
>>>> However, the metadata on this block may still be in the process of being
>>>> written back, which means the new data in this block could potentially
>>>> be overwritten by the stale metadata.
>>>>
>>>> When releasing a metadata block, ext4_forget() calls bforget() in
>>>> nojournal mode, which clears the dirty flag on the buffer_head. If the
>>>> metadata has not yet started to be written back at this point, there is
>>>> no issue. However, if the write-back has already begun but the I/O has
>>>> not yet completed, ext4_forget() will have no effect, and the subsequent
>>>> ext4_mb_clear_bb() will immediately return the block to the mb
>>>> allocator. This block can then be immediately reallocated, potentially
>>>> triggering a data loss issue.
>>>
>>> Yes, I agree this can be a problem.
>>>
>>>> This issue is somewhat related to this patch set[1] that have been
>>>> merged. Before this patch set, clean_bdev_aliases() and
>>>> clean_bdev_bh_alias() could ensure that the dirty flag of the block
>>>> device buffer was cleared and the write-back was completed before using
>>>> newly allocated blocks in most cases. However, this patch set have fixed
>>>> a similar issues in journal mode and removed this safeguard because it's
>>>> fragile and misses some corner cases[2], increasing the likelihood of
>>>> triggering this issue.
>>>
>>> Right.
>>>
>>>> Furthermore, I found that this issue theoretically still appears to
>>>> persist even in **ordered** journal mode. In the final else branch of
>>>> jbd2_journal_forget(), if the metadata block to be released is also
>>>> undergoing a write-back, jbd2_journal_forget() will add this buffer to
>>>> the current transaction for forgetting. Once the current transaction is
>>>> committed, the block can then be reallocated. However, there is no
>>>> guarantee that the ongoing I/O will complete. Typically, the undergoing
>>>> metadata writeback I/O does not take this long to complete, but it might
>>>> be throttled by the block layer or delayed due to anomalies in some slow
>>>> I/O processes in the underlying devices. Therefore, although it is
>>>> difficult to trigger, it theoretically still exists.
>>>
>>> I don't think this can actually happen. For writeback to be happening on a
>>> buffer it still has to be part of a checkpoint list of some transaction.
>>> That means we'll call jbd2_journal_try_remove_checkpoint() which will lock
>>> the buffer and that's enough to make sure the buffer writeback has either
>>> completed or not yet started. If I missed some case, please tell me.
>>>
>>
>> Yes, jbd2_journal_try_remove_checkpoint() does lock the buffer and check
>> the buffer's dirty status under the buffer lock. However. First, it returns
>> immediately if the buffer is locked by the write-back process, which means
>> that it does not wait the write-back to complete, thus, until the current
>> transaction is committed, there is still no guarantee that the I/O will be
>> completed.
> 
> Right, it will return with EBUSY for a buffer under IO, file the buffer to
> BJ_forget list of the running transaction and in principle we're not
> guaranteed IO completes before that transaction commits (although in
> practice that's always true).
> 
>> Second, it unlocks the buffer once it finds the buffer is still
>> dirty, if a concurrent write-back happens just after this unlocking and
>> before clear_buffer_dirty() in jbd2_journal_forget(), the issue can still
>> theoretically happen, right?
> 
> Hum, that as well.
> 
>> It seems that only the follow changes can make sure the buffer writeback has
>> either completed or not yet started (and will never start again). What do
>> you think?
>>
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index c7867139af69..e4691e445106 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -1772,23 +1772,26 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
>>  			goto drop;
>>  		}
>>
>> -		/*
>> -		 * Otherwise, if the buffer has been written to disk,
>> -		 * it is safe to remove the checkpoint and drop it.
>> -		 */
>> -		if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
>> -			spin_unlock(&journal->j_list_lock);
>> -			goto drop;
>> +		lock_buffer(bh);
> 
> We hold j_list_lock and b_state_lock here so you cannot lock the buffer...
> I think we rather need something like:
> 
>                 /*
>                  * Otherwise, if the buffer has been written to disk,
>                  * it is safe to remove the checkpoint and drop it.
>                  */     
>                 if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
>                         spin_unlock(&journal->j_list_lock);
>                         goto drop;
>                 }
> 
>                 /*
>                  * The buffer is still not written to disk, we should
>                  * attach this buffer to current transaction so that the
>                  * buffer can be checkpointed only after the current
>                  * transaction commits.
>                  */
>                 clear_buffer_dirty(bh);
> +		wait_for_writeback = 1;
> 		__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
> 		spin_unlock(&journal->j_list_lock);
> 	}
> drop:
> 	__brelse(bh);
> 	spin_unlock(&jh->b_state_lock);
> +	if (wait_for_writeback)
> +		wait_on_buffer(bh);
> 	jbd2_journal_put_journal_head(jh);
> 	if (drop_reserve) {
> ...
> 

Yes, I agree. This is the simplest way to provide a guarantee.

Thanks,
Yi.

     prev parent reply	other threads:[~2025-09-15 12:03 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-12  3:28 BUG report: an ext4 data corruption issue in nojournal mode Zhang Yi
2025-09-12 14:42 ` Jan Kara
2025-09-13  3:36   ` Zhang Yi
2025-09-15 10:56     ` Jan Kara
2025-09-15 12:03       ` Zhang Yi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=36785255-3f1c-46aa-9df3-f43f3b042e40@huaweicloud.com \
    --to=yi.zhang@huaweicloud.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=hsiangkao@linux.alibaba.com \
    --cc=jack@suse.cz \
    --cc=libaokun1@huawei.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).