linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode
@ 2025-09-16  9:33 Zhang Yi
  2025-09-16  9:33 ` [PATCH 1/2] jbd2: ensure that all ongoing I/O complete before freeing blocks Zhang Yi
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Zhang Yi @ 2025-09-16  9:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	hsiangkao, yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Hello!

This series fixes an data corruption issue reported by Gao Xiang in
nojournal mode. The problem is happened after a metadata block is freed,
it can be immediately reallocated as a data block. However, the metadata
on this block may still be in the process of being written back, which
means the new data in this block could potentially be overwritten by the
stale metadata and trigger a data corruption issue. Please see below
discussion with Jan for more details:

  https://lore.kernel.org/linux-ext4/a9417096-9549-4441-9878-b1955b899b4e@huaweicloud.com/

Patch 1 strengthens the same case in ordered journal mode, theoretically
preventing the occurrence of stale data issues. 
Patch 2 fix this issue in nojournal mode.

Regards,
Yi.

Zhang Yi (2):
  jbd2: ensure that all ongoing I/O complete before freeing blocks
  ext4: wait for ongoing I/O to complete before freeing blocks

 fs/ext4/ext4_jbd2.c   | 11 +++++++++--
 fs/jbd2/transaction.c | 13 +++++++++----
 2 files changed, 18 insertions(+), 6 deletions(-)

-- 
2.46.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] jbd2: ensure that all ongoing I/O complete before freeing blocks
  2025-09-16  9:33 [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode Zhang Yi
@ 2025-09-16  9:33 ` Zhang Yi
  2025-09-16 10:56   ` Jan Kara
  2025-09-16  9:33 ` [PATCH 2/2] ext4: wait for ongoing I/O to " Zhang Yi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Zhang Yi @ 2025-09-16  9:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	hsiangkao, yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When releasing file system metadata blocks in jbd2_journal_forget(), if
this buffer has not yet been checkpointed, it may have already been
written back, currently be in the process of being written back, or has
not yet written back. jbd2_journal_forget() calls
jbd2_journal_try_remove_checkpoint() to check the buffer's status and
add it to the current transaction if it has not been written back. This
buffer can only be reallocated after the transaction is committed.

jbd2_journal_try_remove_checkpoint() attempts to lock the buffer and
check its dirty status while holding the buffer lock. If the buffer has
already been written back, everything proceeds normally. However, there
are two issues. First, the function returns immediately if the buffer is
locked by the write-back process. It does not wait for the write-back to
complete. Consequently, until the current transaction is committed and
the block is reallocated, there is no guarantee that the I/O will
complete. This means that ongoing I/O could write stale metadata to the
newly allocated block, potentially corrupting data. Second, the function
unlocks the buffer as soon as it detects that the buffer is still dirty.
If a concurrent write-back occurs immediately after this unlocking and
before clear_buffer_dirty() is called in jbd2_journal_forget(), data
corruption can theoretically still occur.

Although these two issues are unlikely to occur in practice since the
undergoing metadata writeback I/O does not take this long to complete,
it's better to explicitly ensure that all ongoing I/O operations are
completed.

Fixes: 597599268e3b ("jbd2: discard dirty data when forgetting an un-journalled buffer")
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/jbd2/transaction.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index c7867139af69..3e510564de6e 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1659,6 +1659,7 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
 	int drop_reserve = 0;
 	int err = 0;
 	int was_modified = 0;
+	int wait_for_writeback = 0;
 
 	if (is_handle_aborted(handle))
 		return -EROFS;
@@ -1782,18 +1783,22 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
 		}
 
 		/*
-		 * The buffer is still not written to disk, we should
-		 * attach this buffer to current transaction so that the
-		 * buffer can be checkpointed only after the current
-		 * transaction commits.
+		 * The buffer has not yet been written to disk. We should
+		 * either clear the buffer or ensure that the ongoing I/O
+		 * is completed, and attach this buffer to current
+		 * transaction so that the buffer can be checkpointed only
+		 * after the current transaction commits.
 		 */
 		clear_buffer_dirty(bh);
+		wait_for_writeback = 1;
 		__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
 		spin_unlock(&journal->j_list_lock);
 	}
 drop:
 	__brelse(bh);
 	spin_unlock(&jh->b_state_lock);
+	if (wait_for_writeback)
+		wait_on_buffer(bh);
 	jbd2_journal_put_journal_head(jh);
 	if (drop_reserve) {
 		/* no need to reserve log space for this block -bzzz */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] ext4: wait for ongoing I/O to complete before freeing blocks
  2025-09-16  9:33 [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode Zhang Yi
  2025-09-16  9:33 ` [PATCH 1/2] jbd2: ensure that all ongoing I/O complete before freeing blocks Zhang Yi
@ 2025-09-16  9:33 ` Zhang Yi
  2025-09-16 10:57   ` Jan Kara
  2025-10-02 11:42 ` [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode Gao Xiang
  2025-10-15  2:44 ` Theodore Ts'o
  3 siblings, 1 reply; 10+ messages in thread
From: Zhang Yi @ 2025-09-16  9:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	hsiangkao, yi.zhang, yi.zhang, libaokun1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When freeing metadata blocks in nojournal mode, ext4_forget() calls
bforget() to clear the dirty flag on the buffer_head and remvoe
associated mappings. This is acceptable if the metadata has not yet
begun to be written back. However, if the write-back has already started
but is not yet completed, ext4_forget() will have no effect.
Subsequently, ext4_mb_clear_bb() will immediately return the block to
the mb allocator. This block can then be reallocated immediately,
potentially causing an data corruption issue.

Fix this by clearing the buffer's dirty flag and waiting for the ongoing
I/O to complete, ensuring that no further writes to stale data will
occur.

Fixes: 16e08b14a455 ("ext4: cleanup clean_bdev_aliases() calls")
Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Closes: https://lore.kernel.org/linux-ext4/a9417096-9549-4441-9878-b1955b899b4e@huaweicloud.com/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4_jbd2.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index b3e9b7bd7978..a0e66bc10093 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -280,9 +280,16 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
 		  bh, is_metadata, inode->i_mode,
 		  test_opt(inode->i_sb, DATA_FLAGS));
 
-	/* In the no journal case, we can just do a bforget and return */
+	/*
+	 * In the no journal case, we should wait for the ongoing buffer
+	 * to complete and do a forget.
+	 */
 	if (!ext4_handle_valid(handle)) {
-		bforget(bh);
+		if (bh) {
+			clear_buffer_dirty(bh);
+			wait_on_buffer(bh);
+			__bforget(bh);
+		}
 		return 0;
 	}
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] jbd2: ensure that all ongoing I/O complete before freeing blocks
  2025-09-16  9:33 ` [PATCH 1/2] jbd2: ensure that all ongoing I/O complete before freeing blocks Zhang Yi
@ 2025-09-16 10:56   ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2025-09-16 10:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, hsiangkao, yi.zhang, libaokun1, yukuai3, yangerkun

On Tue 16-09-25 17:33:36, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When releasing file system metadata blocks in jbd2_journal_forget(), if
> this buffer has not yet been checkpointed, it may have already been
> written back, currently be in the process of being written back, or has
> not yet written back. jbd2_journal_forget() calls
> jbd2_journal_try_remove_checkpoint() to check the buffer's status and
> add it to the current transaction if it has not been written back. This
> buffer can only be reallocated after the transaction is committed.
> 
> jbd2_journal_try_remove_checkpoint() attempts to lock the buffer and
> check its dirty status while holding the buffer lock. If the buffer has
> already been written back, everything proceeds normally. However, there
> are two issues. First, the function returns immediately if the buffer is
> locked by the write-back process. It does not wait for the write-back to
> complete. Consequently, until the current transaction is committed and
> the block is reallocated, there is no guarantee that the I/O will
> complete. This means that ongoing I/O could write stale metadata to the
> newly allocated block, potentially corrupting data. Second, the function
> unlocks the buffer as soon as it detects that the buffer is still dirty.
> If a concurrent write-back occurs immediately after this unlocking and
> before clear_buffer_dirty() is called in jbd2_journal_forget(), data
> corruption can theoretically still occur.
> 
> Although these two issues are unlikely to occur in practice since the
> undergoing metadata writeback I/O does not take this long to complete,
> it's better to explicitly ensure that all ongoing I/O operations are
> completed.
> 
> Fixes: 597599268e3b ("jbd2: discard dirty data when forgetting an un-journalled buffer")
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/jbd2/transaction.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index c7867139af69..3e510564de6e 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1659,6 +1659,7 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
>  	int drop_reserve = 0;
>  	int err = 0;
>  	int was_modified = 0;
> +	int wait_for_writeback = 0;
>  
>  	if (is_handle_aborted(handle))
>  		return -EROFS;
> @@ -1782,18 +1783,22 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
>  		}
>  
>  		/*
> -		 * The buffer is still not written to disk, we should
> -		 * attach this buffer to current transaction so that the
> -		 * buffer can be checkpointed only after the current
> -		 * transaction commits.
> +		 * The buffer has not yet been written to disk. We should
> +		 * either clear the buffer or ensure that the ongoing I/O
> +		 * is completed, and attach this buffer to current
> +		 * transaction so that the buffer can be checkpointed only
> +		 * after the current transaction commits.
>  		 */
>  		clear_buffer_dirty(bh);
> +		wait_for_writeback = 1;
>  		__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
>  		spin_unlock(&journal->j_list_lock);
>  	}
>  drop:
>  	__brelse(bh);
>  	spin_unlock(&jh->b_state_lock);
> +	if (wait_for_writeback)
> +		wait_on_buffer(bh);
>  	jbd2_journal_put_journal_head(jh);
>  	if (drop_reserve) {
>  		/* no need to reserve log space for this block -bzzz */
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ext4: wait for ongoing I/O to complete before freeing blocks
  2025-09-16  9:33 ` [PATCH 2/2] ext4: wait for ongoing I/O to " Zhang Yi
@ 2025-09-16 10:57   ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2025-09-16 10:57 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, hsiangkao, yi.zhang, libaokun1, yukuai3, yangerkun

On Tue 16-09-25 17:33:37, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When freeing metadata blocks in nojournal mode, ext4_forget() calls
> bforget() to clear the dirty flag on the buffer_head and remvoe
> associated mappings. This is acceptable if the metadata has not yet
> begun to be written back. However, if the write-back has already started
> but is not yet completed, ext4_forget() will have no effect.
> Subsequently, ext4_mb_clear_bb() will immediately return the block to
> the mb allocator. This block can then be reallocated immediately,
> potentially causing an data corruption issue.
> 
> Fix this by clearing the buffer's dirty flag and waiting for the ongoing
> I/O to complete, ensuring that no further writes to stale data will
> occur.
> 
> Fixes: 16e08b14a455 ("ext4: cleanup clean_bdev_aliases() calls")
> Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> Closes: https://lore.kernel.org/linux-ext4/a9417096-9549-4441-9878-b1955b899b4e@huaweicloud.com/
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ext4_jbd2.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index b3e9b7bd7978..a0e66bc10093 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -280,9 +280,16 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
>  		  bh, is_metadata, inode->i_mode,
>  		  test_opt(inode->i_sb, DATA_FLAGS));
>  
> -	/* In the no journal case, we can just do a bforget and return */
> +	/*
> +	 * In the no journal case, we should wait for the ongoing buffer
> +	 * to complete and do a forget.
> +	 */
>  	if (!ext4_handle_valid(handle)) {
> -		bforget(bh);
> +		if (bh) {
> +			clear_buffer_dirty(bh);
> +			wait_on_buffer(bh);
> +			__bforget(bh);
> +		}
>  		return 0;
>  	}
>  
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode
  2025-09-16  9:33 [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode Zhang Yi
  2025-09-16  9:33 ` [PATCH 1/2] jbd2: ensure that all ongoing I/O complete before freeing blocks Zhang Yi
  2025-09-16  9:33 ` [PATCH 2/2] ext4: wait for ongoing I/O to " Zhang Yi
@ 2025-10-02 11:42 ` Gao Xiang
  2025-10-06 13:52   ` Jan Kara
  2025-10-15  2:44 ` Theodore Ts'o
  3 siblings, 1 reply; 10+ messages in thread
From: Gao Xiang @ 2025-10-02 11:42 UTC (permalink / raw)
  To: Zhang Yi, linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	yi.zhang, libaokun1, yukuai3, yangerkun

Hi Ted,

On 2025/9/16 17:33, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Hello!
> 
> This series fixes an data corruption issue reported by Gao Xiang in
> nojournal mode. The problem is happened after a metadata block is freed,
> it can be immediately reallocated as a data block. However, the metadata
> on this block may still be in the process of being written back, which
> means the new data in this block could potentially be overwritten by the
> stale metadata and trigger a data corruption issue. Please see below
> discussion with Jan for more details:
> 
>    https://lore.kernel.org/linux-ext4/a9417096-9549-4441-9878-b1955b899b4e@huaweicloud.com/
> 
> Patch 1 strengthens the same case in ordered journal mode, theoretically
> preventing the occurrence of stale data issues.
> Patch 2 fix this issue in nojournal mode.

It seems this series is not applied, is it ignored?

When ext4 nojournal mode is used, it is actually a very
serious bug since data corruption can happen very easily
in specific conditions (we actually have a specific
environment which can reproduce the issue very quickly)

Also it seems AWS folks reported this issue years ago
(2021), the phenomenon was almost the same, but the issue
still exists until now:
https://lore.kernel.org/linux-ext4/20211108173520.xp6xphodfhcen2sy@u87e72aa3c6c25c.ant.amazon.com/

Some of our internal businesses actually rely on EXT4
no_journal mode and when they upgrade the kernel from
4.19 to 5.10, they actually read corrupted data after
page cache memory is reclaimed (actually the on-disk
data was corrupted even earlier).

So personally I wonder what's the current status of
EXT4 no_journal mode since this issue has been existing
for more than 5 years but some people may need
an extent-enabled ext2 so they selected this mode.

We already released an announcement to advise customers
not using no_journal mode because it seems lack of
enough maintainence (yet many end users are interested
in this mode):
https://www.alibabacloud.com/help/en/alinux/support/data-corruption-risk-and-solution-in-ext4-nojounral-mode

Thanks,
Gao Xiang

> 
> Regards,
> Yi.
> 
> Zhang Yi (2):
>    jbd2: ensure that all ongoing I/O complete before freeing blocks
>    ext4: wait for ongoing I/O to complete before freeing blocks
> 
>   fs/ext4/ext4_jbd2.c   | 11 +++++++++--
>   fs/jbd2/transaction.c | 13 +++++++++----
>   2 files changed, 18 insertions(+), 6 deletions(-)
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode
  2025-10-02 11:42 ` [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode Gao Xiang
@ 2025-10-06 13:52   ` Jan Kara
  2025-10-06 14:37     ` Gao Xiang
  2025-10-09 14:58     ` Theodore Ts'o
  0 siblings, 2 replies; 10+ messages in thread
From: Jan Kara @ 2025-10-06 13:52 UTC (permalink / raw)
  To: Ted Tso
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, adilger.kernel,
	jack, yi.zhang, libaokun1, yukuai3, yangerkun, Gao Xiang

Hi Ted!

I think this patch series has fallen through the cracks. Can you please
push it to Linus? Given there are real users hitting the data corruption,
we should do it soon (although it isn't a new issue so it isn't
supercritical).

On Thu 02-10-25 19:42:34, Gao Xiang wrote:
> On 2025/9/16 17:33, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > Hello!
> > 
> > This series fixes an data corruption issue reported by Gao Xiang in
> > nojournal mode. The problem is happened after a metadata block is freed,
> > it can be immediately reallocated as a data block. However, the metadata
> > on this block may still be in the process of being written back, which
> > means the new data in this block could potentially be overwritten by the
> > stale metadata and trigger a data corruption issue. Please see below
> > discussion with Jan for more details:
> > 
> >    https://lore.kernel.org/linux-ext4/a9417096-9549-4441-9878-b1955b899b4e@huaweicloud.com/
> > 
> > Patch 1 strengthens the same case in ordered journal mode, theoretically
> > preventing the occurrence of stale data issues.
> > Patch 2 fix this issue in nojournal mode.
> 
> It seems this series is not applied, is it ignored?

Well, likely Ted just missed it when collecting patches for his PR.

> When ext4 nojournal mode is used, it is actually a very
> serious bug since data corruption can happen very easily
> in specific conditions (we actually have a specific
> environment which can reproduce the issue very quickly)

This is good to know so that we can prioritize accordingly.

> Also it seems AWS folks reported this issue years ago
> (2021), the phenomenon was almost the same, but the issue
> still exists until now:
> https://lore.kernel.org/linux-ext4/20211108173520.xp6xphodfhcen2sy@u87e72aa3c6c25c.ant.amazon.com/

Likely yes, but back then we weren't able to figure out the root cause.

> Some of our internal businesses actually rely on EXT4
> no_journal mode and when they upgrade the kernel from
> 4.19 to 5.10, they actually read corrupted data after
> page cache memory is reclaimed (actually the on-disk
> data was corrupted even earlier).
> 
> So personally I wonder what's the current status of
> EXT4 no_journal mode since this issue has been existing
> for more than 5 years but some people may need
> an extent-enabled ext2 so they selected this mode.

The nojournal mode is fully supported. There are many enterprise customers
(mostly cloud vendors) that depend on it. Including Ted's employer ;)

> We already released an announcement to advise customers
> not using no_journal mode because it seems lack of
> enough maintainence (yet many end users are interested
> in this mode):
> https://www.alibabacloud.com/help/en/alinux/support/data-corruption-risk-and-solution-in-ext4-nojounral-mode

Well, it's good to be cautious but the reality is that data corruption
issues do happen from time to time. Both in nojournal mode and in normal
journalled mode. And this one exists since the beginning when nojournal
mode was implemented. So it apparently requires rather specific conditions
to hit.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode
  2025-10-06 13:52   ` Jan Kara
@ 2025-10-06 14:37     ` Gao Xiang
  2025-10-09 14:58     ` Theodore Ts'o
  1 sibling, 0 replies; 10+ messages in thread
From: Gao Xiang @ 2025-10-06 14:37 UTC (permalink / raw)
  To: Jan Kara, Ted Tso
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, adilger.kernel,
	yi.zhang, libaokun1, yukuai3, yangerkun

Hi Jan,

On 2025/10/6 21:52, Jan Kara wrote:
> Hi Ted!
> 
> I think this patch series has fallen through the cracks. Can you please
> push it to Linus? Given there are real users hitting the data corruption,
> we should do it soon (although it isn't a new issue so it isn't
> supercritical).

Thanks for the ping.

> 

..

> 
>> Some of our internal businesses actually rely on EXT4
>> no_journal mode and when they upgrade the kernel from
>> 4.19 to 5.10, they actually read corrupted data after
>> page cache memory is reclaimed (actually the on-disk
>> data was corrupted even earlier).
>>
>> So personally I wonder what's the current status of
>> EXT4 no_journal mode since this issue has been existing
>> for more than 5 years but some people may need
>> an extent-enabled ext2 so they selected this mode.
> 
> The nojournal mode is fully supported. There are many enterprise customers
> (mostly cloud vendors) that depend on it. Including Ted's employer ;)

.. yet honestly, this issue can be easily observed in
no_journal + memory pressure, and our new 5.10 kernel
setup (previous 4.19) can catch this issue very easily.

Unless the memory is sufficient, the valid page cache can
cover up this issue, but the on-disk data could be still
corrupted.

So we wonder how large scale no_journal mode is used for
now, and if they have  memory pressure workload.

> 
>> We already released an announcement to advise customers
>> not using no_journal mode because it seems lack of
>> enough maintainence (yet many end users are interested
>> in this mode):
>> https://www.alibabacloud.com/help/en/alinux/support/data-corruption-risk-and-solution-in-ext4-nojounral-mode
> 
> Well, it's good to be cautious but the reality is that data corruption
> issues do happen from time to time. Both in nojournal mode and in normal
> journalled mode. And this one exists since the beginning when nojournal
> mode was implemented. So it apparently requires rather specific conditions
> to hit.

The original issue (the one fixed by Yi in 2019) existed
for a quite long time and I think it was hard to reproduce
(compared to this one), but the regression out of lack of
clean_bdev_aliases() and clean_bdev_bh_alias() makes another
serious regression (which exists since 2019 until now) which
can be easily reproduced on some specific VM setup (our
workload is also create and delete some small and big files,
and data corruption can be observed since some data is filled
with extent layout, much like the previous AWS one).

Thanks,
Gao Xiang

> 
> 								Honza
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode
  2025-10-06 13:52   ` Jan Kara
  2025-10-06 14:37     ` Gao Xiang
@ 2025-10-09 14:58     ` Theodore Ts'o
  1 sibling, 0 replies; 10+ messages in thread
From: Theodore Ts'o @ 2025-10-09 14:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, adilger.kernel,
	yi.zhang, libaokun1, yukuai3, yangerkun, Gao Xiang

Yes, sorry, this fell through the cracks.  I just applied and am
running it through tests.

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode
  2025-09-16  9:33 [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode Zhang Yi
                   ` (2 preceding siblings ...)
  2025-10-02 11:42 ` [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode Gao Xiang
@ 2025-10-15  2:44 ` Theodore Ts'o
  3 siblings, 0 replies; 10+ messages in thread
From: Theodore Ts'o @ 2025-10-15  2:44 UTC (permalink / raw)
  To: linux-ext4, Zhang Yi
  Cc: Theodore Ts'o, linux-fsdevel, linux-kernel, adilger.kernel,
	jack, yi.zhang, libaokun1, yukuai3, yangerkun, Gao Xiang


On Tue, 16 Sep 2025 17:33:35 +0800, Zhang Yi wrote:
> This series fixes an data corruption issue reported by Gao Xiang in
> nojournal mode. The problem is happened after a metadata block is freed,
> it can be immediately reallocated as a data block. However, the metadata
> on this block may still be in the process of being written back, which
> means the new data in this block could potentially be overwritten by the
> stale metadata and trigger a data corruption issue. Please see below
> discussion with Jan for more details:
> 
> [...]

Applied, thanks!

[1/2] jbd2: ensure that all ongoing I/O complete before freeing blocks
      commit: 3c652c3a71de1d30d72dc82c3bead8deb48eb749
[2/2] ext4: wait for ongoing I/O to complete before freeing blocks
      commit: 328a782cb138029182e521c08f50eb1587db955d

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-10-15  2:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-16  9:33 [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode Zhang Yi
2025-09-16  9:33 ` [PATCH 1/2] jbd2: ensure that all ongoing I/O complete before freeing blocks Zhang Yi
2025-09-16 10:56   ` Jan Kara
2025-09-16  9:33 ` [PATCH 2/2] ext4: wait for ongoing I/O to " Zhang Yi
2025-09-16 10:57   ` Jan Kara
2025-10-02 11:42 ` [PATCH 0/2] ext4: fix an data corruption issue in nojournal mode Gao Xiang
2025-10-06 13:52   ` Jan Kara
2025-10-06 14:37     ` Gao Xiang
2025-10-09 14:58     ` Theodore Ts'o
2025-10-15  2:44 ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).