[PATCH] ocfs2: revalidate the journal dinode before toggling dirty

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH] ocfs2: revalidate the journal dinode before toggling dirty
@ 2026-05-09 13:52 ZhengYuan Huang
  2026-05-10  4:02 ` Joseph Qi
  0 siblings, 1 reply; 5+ messages in thread
From: ZhengYuan Huang @ 2026-05-09 13:52 UTC (permalink / raw)
  To: mark, jlbec, joseph.qi
  Cc: ocfs2-devel, linux-kernel, baijiaju1990, r33s3n6, zzzccc427,
	ZhengYuan Huang

[BUG]
A fuzzed OCFS2 image can corrupt the current slot journal dinode while
mount is still in progress. The mount path first reports the invalid
journal block and then crashes in shutdown:

kernel BUG at fs/ocfs2/journal.c:1034!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:ocfs2_journal_toggle_dirty+0x2d6/0x340 fs/ocfs2/journal.c:1034
Call Trace:
 ocfs2_journal_shutdown+0x414/0xc30 fs/ocfs2/journal.c:1116
 ocfs2_mount_volume fs/ocfs2/super.c:1785 [inline]
 ocfs2_fill_super+0x30a9/0x3cd0 fs/ocfs2/super.c:1083
 get_tree_bdev_flags+0x38b/0x640 fs/super.c:1698
 get_tree_bdev+0x24/0x40 fs/super.c:1721
 ocfs2_get_tree+0x21/0x30 fs/ocfs2/super.c:1184
 vfs_get_tree+0x9a/0x370 fs/super.c:1758
 fc_mount fs/namespace.c:1199 [inline]
 do_new_mount_fc fs/namespace.c:3642 [inline]
 do_new_mount fs/namespace.c:3718 [inline]
 path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
 do_mount fs/namespace.c:4041 [inline]
 __do_sys_mount fs/namespace.c:4229 [inline]
 __se_sys_mount fs/namespace.c:4206 [inline]
 __x64_sys_mount+0x282/0x320 fs/namespace.c:4206
 ...

[CAUSE]
ocfs2_journal_toggle_dirty() assumes journal->j_bh still contains the
same validated dinode that ocfs2_journal_init() locked earlier, and it
uses BUG_ON() when the buffer no longer looks like a dinode. That
assumption is too strong. The mount path can force the same current-slot
journal inode block back in from disk through
ocfs2_read_journal_inode(..., OCFS2_BH_IGNORE_CACHE) while
ocfs2_mark_dead_nodes() scans the journal slots. If that reread finds
corrupted metadata, mount unwinds through ocfs2_journal_shutdown(),
which reuses journal->j_bh and turns the metadata corruption into a
kernel BUG.

[FIX]
Revalidate journal->j_bh with ocfs2_validate_inode_block() before
updating the dirty flag. If the cached journal dinode has become
invalid, return the corruption error and keep the failure on OCFS2's
normal read-only/error path instead of crashing the kernel. This
revalidation happens in the cold path of mount, so the performance
impact should be negligible.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
---
 fs/ocfs2/journal.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index f9bf3bac085d..c9a972a1304e 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -1021,12 +1021,15 @@ static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb,
 	struct buffer_head *bh = journal->j_bh;
 	struct ocfs2_dinode *fe;

-	fe = (struct ocfs2_dinode *)bh->b_data;
+	/* The journal inode block can be forced back in from disk while the
+	 * mount path is still running, so validate the cached bh again before
+	 * updating the journal state on disk.
+	 */
+	status = ocfs2_validate_inode_block(osb->sb, bh);
+	if (status < 0)
+		return status;

-	/* The journal bh on the osb always comes from ocfs2_journal_init()
-	 * and was validated there inside ocfs2_inode_lock_full().  It's a
-	 * code bug if we mess it up. */
-	BUG_ON(!OCFS2_IS_VALID_DINODE(fe));
+	fe = (struct ocfs2_dinode *)bh->b_data;

 	flags = le32_to_cpu(fe->id1.journal1.ij_flags);
 	if (dirty)
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] ocfs2: revalidate the journal dinode before toggling dirty
  2026-05-09 13:52 [PATCH] ocfs2: revalidate the journal dinode before toggling dirty ZhengYuan Huang
@ 2026-05-10  4:02 ` Joseph Qi
  2026-05-11  2:58   ` ZhengYuan Huang
  0 siblings, 1 reply; 5+ messages in thread
From: Joseph Qi @ 2026-05-10  4:02 UTC (permalink / raw)
  To: ZhengYuan Huang
  Cc: ocfs2-devel, linux-kernel, baijiaju1990, r33s3n6, zzzccc427,
	Mark Fasheh, Joel Becker



On 5/9/26 9:52 PM, ZhengYuan Huang wrote:
> [BUG]
> A fuzzed OCFS2 image can corrupt the current slot journal dinode while
> mount is still in progress. The mount path first reports the invalid
> journal block and then crashes in shutdown:
> 
> kernel BUG at fs/ocfs2/journal.c:1034!
> Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
> RIP: 0010:ocfs2_journal_toggle_dirty+0x2d6/0x340 fs/ocfs2/journal.c:1034
> Call Trace:
>  ocfs2_journal_shutdown+0x414/0xc30 fs/ocfs2/journal.c:1116
>  ocfs2_mount_volume fs/ocfs2/super.c:1785 [inline]
>  ocfs2_fill_super+0x30a9/0x3cd0 fs/ocfs2/super.c:1083
>  get_tree_bdev_flags+0x38b/0x640 fs/super.c:1698
>  get_tree_bdev+0x24/0x40 fs/super.c:1721
>  ocfs2_get_tree+0x21/0x30 fs/ocfs2/super.c:1184
>  vfs_get_tree+0x9a/0x370 fs/super.c:1758
>  fc_mount fs/namespace.c:1199 [inline]
>  do_new_mount_fc fs/namespace.c:3642 [inline]
>  do_new_mount fs/namespace.c:3718 [inline]
>  path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
>  do_mount fs/namespace.c:4041 [inline]
>  __do_sys_mount fs/namespace.c:4229 [inline]
>  __se_sys_mount fs/namespace.c:4206 [inline]
>  __x64_sys_mount+0x282/0x320 fs/namespace.c:4206
>  ...
> 
> 
> [CAUSE]
> ocfs2_journal_toggle_dirty() assumes journal->j_bh still contains the
> same validated dinode that ocfs2_journal_init() locked earlier, and it
> uses BUG_ON() when the buffer no longer looks like a dinode. That
> assumption is too strong. The mount path can force the same current-slot
> journal inode block back in from disk through
> ocfs2_read_journal_inode(..., OCFS2_BH_IGNORE_CACHE) while
> ocfs2_mark_dead_nodes() scans the journal slots. If that reread finds
> corrupted metadata, mount unwinds through ocfs2_journal_shutdown(),
> which reuses journal->j_bh and turns the metadata corruption into a
> kernel BUG.
> 

A bit confused.
Since journal dinode is firstly validated, it means image is checked.
Now mount is in progress, how to corrupt it during runtime?

Thanks,
Joseph

> [FIX]
> Revalidate journal->j_bh with ocfs2_validate_inode_block() before
> updating the dirty flag. If the cached journal dinode has become
> invalid, return the corruption error and keep the failure on OCFS2's
> normal read-only/error path instead of crashing the kernel. This
> revalidation happens in the cold path of mount, so the performance
> impact should be negligible.
> 
> Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
> ---
>  fs/ocfs2/journal.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index f9bf3bac085d..c9a972a1304e 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -1021,12 +1021,15 @@ static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb,
>  	struct buffer_head *bh = journal->j_bh;
>  	struct ocfs2_dinode *fe;
>  
> -	fe = (struct ocfs2_dinode *)bh->b_data;
> +	/* The journal inode block can be forced back in from disk while the
> +	 * mount path is still running, so validate the cached bh again before
> +	 * updating the journal state on disk.
> +	 */
> +	status = ocfs2_validate_inode_block(osb->sb, bh);
> +	if (status < 0)
> +		return status;
>  
> -	/* The journal bh on the osb always comes from ocfs2_journal_init()
> -	 * and was validated there inside ocfs2_inode_lock_full().  It's a
> -	 * code bug if we mess it up. */
> -	BUG_ON(!OCFS2_IS_VALID_DINODE(fe));
> +	fe = (struct ocfs2_dinode *)bh->b_data;
>  
>  	flags = le32_to_cpu(fe->id1.journal1.ij_flags);
>  	if (dirty)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] ocfs2: revalidate the journal dinode before toggling dirty
  2026-05-10  4:02 ` Joseph Qi
@ 2026-05-11  2:58   ` ZhengYuan Huang
  2026-05-11  6:15     ` Joseph Qi
  0 siblings, 1 reply; 5+ messages in thread
From: ZhengYuan Huang @ 2026-05-11  2:58 UTC (permalink / raw)
  To: Joseph Qi
  Cc: ocfs2-devel, linux-kernel, baijiaju1990, r33s3n6, zzzccc427,
	Mark Fasheh, Joel Becker

On Sun, May 10, 2026 at 12:02 PM Joseph Qi <joseph.qi@linux.alibaba.com> wrote:
>
>
>
> On 5/9/26 9:52 PM, ZhengYuan Huang wrote:
> > [BUG]
> > A fuzzed OCFS2 image can corrupt the current slot journal dinode while
> > mount is still in progress. The mount path first reports the invalid
> > journal block and then crashes in shutdown:
> >
> > kernel BUG at fs/ocfs2/journal.c:1034!
> > Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
> > RIP: 0010:ocfs2_journal_toggle_dirty+0x2d6/0x340 fs/ocfs2/journal.c:1034
> > Call Trace:
> >  ocfs2_journal_shutdown+0x414/0xc30 fs/ocfs2/journal.c:1116
> >  ocfs2_mount_volume fs/ocfs2/super.c:1785 [inline]
> >  ocfs2_fill_super+0x30a9/0x3cd0 fs/ocfs2/super.c:1083
> >  get_tree_bdev_flags+0x38b/0x640 fs/super.c:1698
> >  get_tree_bdev+0x24/0x40 fs/super.c:1721
> >  ocfs2_get_tree+0x21/0x30 fs/ocfs2/super.c:1184
> >  vfs_get_tree+0x9a/0x370 fs/super.c:1758
> >  fc_mount fs/namespace.c:1199 [inline]
> >  do_new_mount_fc fs/namespace.c:3642 [inline]
> >  do_new_mount fs/namespace.c:3718 [inline]
> >  path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
> >  do_mount fs/namespace.c:4041 [inline]
> >  __do_sys_mount fs/namespace.c:4229 [inline]
> >  __se_sys_mount fs/namespace.c:4206 [inline]
> >  __x64_sys_mount+0x282/0x320 fs/namespace.c:4206
> >  ...
> >
> >
> > [CAUSE]
> > ocfs2_journal_toggle_dirty() assumes journal->j_bh still contains the
> > same validated dinode that ocfs2_journal_init() locked earlier, and it
> > uses BUG_ON() when the buffer no longer looks like a dinode. That
> > assumption is too strong. The mount path can force the same current-slot
> > journal inode block back in from disk through
> > ocfs2_read_journal_inode(..., OCFS2_BH_IGNORE_CACHE) while
> > ocfs2_mark_dead_nodes() scans the journal slots. If that reread finds
> > corrupted metadata, mount unwinds through ocfs2_journal_shutdown(),
> > which reuses journal->j_bh and turns the metadata corruption into a
> > kernel BUG.
> >
>
> A bit confused.
> Since journal dinode is firstly validated, it means image is checked.
> Now mount is in progress, how to corrupt it during runtime?
>
> Thanks,
> Joseph

Thanks for taking a look.

Yes, the journal dinode is validated when it is first initialized. My
concern is that later in the mount path, the same journal inode block
may be read again from disk with OCFS2_BH_IGNORE_CACHE, so the buffer
used by ocfs2_journal_shutdown() may no longer be the same validated
contents.

This does not mean the filesystem itself corrupts the block during
mount. Rather, after the initial validation and before the later use,
the block contents may change due to unexpected disk corruption, I/O
problems, or a forced reread of corrupted on-disk metadata. In that
case, ocfs2_journal_toggle_dirty() should not rely only on the earlier
validation.

Since this is a cold mount/shutdown error path, adding this extra
validation should not have a meaningful performance impact. I see it
as a small robustness improvement to avoid turning bad metadata into a
kernel BUG.

Thanks,
ZhengYuan Huang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] ocfs2: revalidate the journal dinode before toggling dirty
  2026-05-11  2:58   ` ZhengYuan Huang
@ 2026-05-11  6:15     ` Joseph Qi
  2026-05-12  2:45       ` ZhengYuan Huang
  0 siblings, 1 reply; 5+ messages in thread
From: Joseph Qi @ 2026-05-11  6:15 UTC (permalink / raw)
  To: ZhengYuan Huang
  Cc: ocfs2-devel, linux-kernel, baijiaju1990, r33s3n6, zzzccc427,
	Mark Fasheh, Joel Becker



On 5/11/26 10:58 AM, ZhengYuan Huang wrote:
> On Sun, May 10, 2026 at 12:02 PM Joseph Qi <joseph.qi@linux.alibaba.com> wrote:
>>
>>
>>
>> On 5/9/26 9:52 PM, ZhengYuan Huang wrote:
>>> [BUG]
>>> A fuzzed OCFS2 image can corrupt the current slot journal dinode while
>>> mount is still in progress. The mount path first reports the invalid
>>> journal block and then crashes in shutdown:
>>>
>>> kernel BUG at fs/ocfs2/journal.c:1034!
>>> Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
>>> RIP: 0010:ocfs2_journal_toggle_dirty+0x2d6/0x340 fs/ocfs2/journal.c:1034
>>> Call Trace:
>>>  ocfs2_journal_shutdown+0x414/0xc30 fs/ocfs2/journal.c:1116
>>>  ocfs2_mount_volume fs/ocfs2/super.c:1785 [inline]
>>>  ocfs2_fill_super+0x30a9/0x3cd0 fs/ocfs2/super.c:1083
>>>  get_tree_bdev_flags+0x38b/0x640 fs/super.c:1698
>>>  get_tree_bdev+0x24/0x40 fs/super.c:1721
>>>  ocfs2_get_tree+0x21/0x30 fs/ocfs2/super.c:1184
>>>  vfs_get_tree+0x9a/0x370 fs/super.c:1758
>>>  fc_mount fs/namespace.c:1199 [inline]
>>>  do_new_mount_fc fs/namespace.c:3642 [inline]
>>>  do_new_mount fs/namespace.c:3718 [inline]
>>>  path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
>>>  do_mount fs/namespace.c:4041 [inline]
>>>  __do_sys_mount fs/namespace.c:4229 [inline]
>>>  __se_sys_mount fs/namespace.c:4206 [inline]
>>>  __x64_sys_mount+0x282/0x320 fs/namespace.c:4206
>>>  ...
>>>
>>>
>>> [CAUSE]
>>> ocfs2_journal_toggle_dirty() assumes journal->j_bh still contains the
>>> same validated dinode that ocfs2_journal_init() locked earlier, and it
>>> uses BUG_ON() when the buffer no longer looks like a dinode. That
>>> assumption is too strong. The mount path can force the same current-slot
>>> journal inode block back in from disk through
>>> ocfs2_read_journal_inode(..., OCFS2_BH_IGNORE_CACHE) while
>>> ocfs2_mark_dead_nodes() scans the journal slots. If that reread finds
>>> corrupted metadata, mount unwinds through ocfs2_journal_shutdown(),
>>> which reuses journal->j_bh and turns the metadata corruption into a
>>> kernel BUG.
>>>
>>
>> A bit confused.
>> Since journal dinode is firstly validated, it means image is checked.
>> Now mount is in progress, how to corrupt it during runtime?
>>
>> Thanks,
>> Joseph
> 
> Thanks for taking a look.
> 
> Yes, the journal dinode is validated when it is first initialized. My
> concern is that later in the mount path, the same journal inode block
> may be read again from disk with OCFS2_BH_IGNORE_CACHE, so the buffer
> used by ocfs2_journal_shutdown() may no longer be the same validated
> contents.
> 
After the validation in ocfs2_journal_init(), the in-memory copy won't
spontaneously become invalid.

And if it is broken by a re-write (e.g. recover), this a bug in the
re-write flow and we have to fix the flow itself.


> This does not mean the filesystem itself corrupts the block during
> mount. Rather, after the initial validation and before the later use,
> the block contents may change due to unexpected disk corruption, I/O
> problems, or a forced reread of corrupted on-disk metadata. In that
> case, ocfs2_journal_toggle_dirty() should not rely only on the earlier
> validation.
> 
ocfs2_validate_inode_block() is a bit heavy. So if we want to prevent a
BUG_ON in case of unexpected disk corruption (still a bit strange, it is
fine in init and then suddenly down...), a simpler alternative would be
just replace BUG_ON with WARN_ON and return error.

Thanks,
Joseph

> Since this is a cold mount/shutdown error path, adding this extra
> validation should not have a meaningful performance impact. I see it
> as a small robustness improvement to avoid turning bad metadata into a
> kernel BUG.
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] ocfs2: revalidate the journal dinode before toggling dirty
  2026-05-11  6:15     ` Joseph Qi
@ 2026-05-12  2:45       ` ZhengYuan Huang
  0 siblings, 0 replies; 5+ messages in thread
From: ZhengYuan Huang @ 2026-05-12  2:45 UTC (permalink / raw)
  To: Joseph Qi
  Cc: ocfs2-devel, linux-kernel, baijiaju1990, r33s3n6, zzzccc427,
	Mark Fasheh, Joel Becker

On Mon, May 11, 2026 at 2:15 PM Joseph Qi <joseph.qi@linux.alibaba.com> wrote:
>
>
>
> On 5/11/26 10:58 AM, ZhengYuan Huang wrote:
> > On Sun, May 10, 2026 at 12:02 PM Joseph Qi <joseph.qi@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 5/9/26 9:52 PM, ZhengYuan Huang wrote:
> >>> [BUG]
> >>> A fuzzed OCFS2 image can corrupt the current slot journal dinode while
> >>> mount is still in progress. The mount path first reports the invalid
> >>> journal block and then crashes in shutdown:
> >>>
> >>> kernel BUG at fs/ocfs2/journal.c:1034!
> >>> Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
> >>> RIP: 0010:ocfs2_journal_toggle_dirty+0x2d6/0x340 fs/ocfs2/journal.c:1034
> >>> Call Trace:
> >>>  ocfs2_journal_shutdown+0x414/0xc30 fs/ocfs2/journal.c:1116
> >>>  ocfs2_mount_volume fs/ocfs2/super.c:1785 [inline]
> >>>  ocfs2_fill_super+0x30a9/0x3cd0 fs/ocfs2/super.c:1083
> >>>  get_tree_bdev_flags+0x38b/0x640 fs/super.c:1698
> >>>  get_tree_bdev+0x24/0x40 fs/super.c:1721
> >>>  ocfs2_get_tree+0x21/0x30 fs/ocfs2/super.c:1184
> >>>  vfs_get_tree+0x9a/0x370 fs/super.c:1758
> >>>  fc_mount fs/namespace.c:1199 [inline]
> >>>  do_new_mount_fc fs/namespace.c:3642 [inline]
> >>>  do_new_mount fs/namespace.c:3718 [inline]
> >>>  path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
> >>>  do_mount fs/namespace.c:4041 [inline]
> >>>  __do_sys_mount fs/namespace.c:4229 [inline]
> >>>  __se_sys_mount fs/namespace.c:4206 [inline]
> >>>  __x64_sys_mount+0x282/0x320 fs/namespace.c:4206
> >>>  ...
> >>>
> >>>
> >>> [CAUSE]
> >>> ocfs2_journal_toggle_dirty() assumes journal->j_bh still contains the
> >>> same validated dinode that ocfs2_journal_init() locked earlier, and it
> >>> uses BUG_ON() when the buffer no longer looks like a dinode. That
> >>> assumption is too strong. The mount path can force the same current-slot
> >>> journal inode block back in from disk through
> >>> ocfs2_read_journal_inode(..., OCFS2_BH_IGNORE_CACHE) while
> >>> ocfs2_mark_dead_nodes() scans the journal slots. If that reread finds
> >>> corrupted metadata, mount unwinds through ocfs2_journal_shutdown(),
> >>> which reuses journal->j_bh and turns the metadata corruption into a
> >>> kernel BUG.
> >>>
> >>
> >> A bit confused.
> >> Since journal dinode is firstly validated, it means image is checked.
> >> Now mount is in progress, how to corrupt it during runtime?
> >>
> >> Thanks,
> >> Joseph
> >
> > Thanks for taking a look.
> >
> > Yes, the journal dinode is validated when it is first initialized. My
> > concern is that later in the mount path, the same journal inode block
> > may be read again from disk with OCFS2_BH_IGNORE_CACHE, so the buffer
> > used by ocfs2_journal_shutdown() may no longer be the same validated
> > contents.
> >
> After the validation in ocfs2_journal_init(), the in-memory copy won't
> spontaneously become invalid.
>
> And if it is broken by a re-write (e.g. recover), this a bug in the
> re-write flow and we have to fix the flow itself.
>
>
> > This does not mean the filesystem itself corrupts the block during
> > mount. Rather, after the initial validation and before the later use,
> > the block contents may change due to unexpected disk corruption, I/O
> > problems, or a forced reread of corrupted on-disk metadata. In that
> > case, ocfs2_journal_toggle_dirty() should not rely only on the earlier
> > validation.
> >
> ocfs2_validate_inode_block() is a bit heavy. So if we want to prevent a
> BUG_ON in case of unexpected disk corruption (still a bit strange, it is
> fine in init and then suddenly down...), a simpler alternative would be
> just replace BUG_ON with WARN_ON and return error.
>
> Thanks,
> Joseph
>
> > Since this is a cold mount/shutdown error path, adding this extra
> > validation should not have a meaningful performance impact. I see it
> > as a small robustness improvement to avoid turning bad metadata into a
> > kernel BUG.
> >

Thanks for the clarification. I have updated the patch according to
your suggestion and sent out v2.

Thanks,
ZhengYuan Huang

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-12  2:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-09 13:52 [PATCH] ocfs2: revalidate the journal dinode before toggling dirty ZhengYuan Huang
2026-05-10  4:02 ` Joseph Qi
2026-05-11  2:58   ` ZhengYuan Huang
2026-05-11  6:15     ` Joseph Qi
2026-05-12  2:45       ` ZhengYuan Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox