* [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail
@ 2024-08-23 0:00 syzbot
2024-08-26 4:22 ` Vinicius Peixoto
0 siblings, 1 reply; 7+ messages in thread
From: syzbot @ 2024-08-23 0:00 UTC (permalink / raw)
To: jack, jlbec, joseph.qi, linux-ext4, linux-kernel, mark,
ocfs2-devel, syzkaller-bugs, tytso
Hello,
syzbot found the following issue on:
HEAD commit: c3f2d783a459 Merge tag 'mm-hotfixes-stable-2024-08-17-19-3..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16ba0fcb980000
kernel config: https://syzkaller.appspot.com/x/.config?x=7229118d88b4a71b
dashboard link: https://syzkaller.appspot.com/bug?extid=8512f3dbd96253ffbe27
compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15a1fc09980000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11da5cfd980000
Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-c3f2d783.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/4d927f7c3cfd/vmlinux-c3f2d783.xz
kernel image: https://storage.googleapis.com/syzbot-assets/ea54bdfad24b/bzImage-c3f2d783.xz
mounted in repro: https://storage.googleapis.com/syzbot-assets/ec5187e2b71d/mount_0.gz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+8512f3dbd96253ffbe27@syzkaller.appspotmail.com
(syz-executor675,5099,0):ocfs2_check_volume:2425 ERROR: ocfs2 journal load failed! -22
(syz-executor675,5099,0):ocfs2_check_volume:2481 ERROR: status = -22
(syz-executor675,5099,0):ocfs2_mount_volume:1821 ERROR: status = -22
------------[ cut here ]------------
kernel BUG at fs/jbd2/checkpoint.c:321!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5099 Comm: syz-executor675 Not tainted 6.11.0-rc3-syzkaller-00338-gc3f2d783a459 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:jbd2_cleanup_journal_tail+0x2bf/0x2d0 fs/jbd2/checkpoint.c:321
Code: 24 40 80 e1 07 80 c1 03 38 c1 0f 8c 65 ff ff ff 48 8d 7c 24 40 e8 51 4d 8a ff e9 56 ff ff ff e8 57 1a 46 09 e8 a2 0b 23 ff 90 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 90 90 90 90 90
RSP: 0018:ffffc90000e36d60 EFLAGS: 00010293
RAX: ffffffff82707f4e RBX: 0000000000000000 RCX: ffff88801a738000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90000e36e30 R08: ffffffff82707dd7 R09: 1ffff11003709c16
R10: dffffc0000000000 R11: ffffed1003709c17 R12: 1ffff920001c6db0
R13: ffff88801b84e000 R14: 1ffff920001c6db6 R15: 1ffff11003709c00
FS: 000055555d18e380(0000) GS:ffff888020800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f251d195ed8 CR3: 000000001bee4000 CR4: 0000000000350ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
jbd2_journal_flush+0x290/0xc10 fs/jbd2/journal.c:2479
ocfs2_journal_shutdown+0x443/0xbe0 fs/ocfs2/journal.c:1081
ocfs2_mount_volume+0x169f/0x1940 fs/ocfs2/super.c:1842
ocfs2_fill_super+0x483b/0x5880 fs/ocfs2/super.c:1084
mount_bdev+0x20a/0x2d0 fs/super.c:1679
legacy_get_tree+0xee/0x190 fs/fs_context.c:662
vfs_get_tree+0x90/0x2a0 fs/super.c:1800
do_new_mount+0x2be/0xb40 fs/namespace.c:3472
do_mount fs/namespace.c:3812 [inline]
__do_sys_mount fs/namespace.c:4020 [inline]
__se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f15c3ba9dea
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb a6 e8 5e 04 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc0f1577b8 EFLAGS: 00000282 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007ffc0f1577d0 RCX: 00007f15c3ba9dea
RDX: 0000000020004440 RSI: 0000000020000040 RDI: 00007ffc0f1577d0
RBP: 0000000000000004 R08: 00007ffc0f157810 R09: 0000000000004424
R10: 00000000000008c0 R11: 0000000000000282 R12: 00000000000008c0
R13: 00007ffc0f157810 R14: 0000000000000003 R15: 0000000001000000
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:jbd2_cleanup_journal_tail+0x2bf/0x2d0 fs/jbd2/checkpoint.c:321
Code: 24 40 80 e1 07 80 c1 03 38 c1 0f 8c 65 ff ff ff 48 8d 7c 24 40 e8 51 4d 8a ff e9 56 ff ff ff e8 57 1a 46 09 e8 a2 0b 23 ff 90 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 90 90 90 90 90
RSP: 0018:ffffc90000e36d60 EFLAGS: 00010293
RAX: ffffffff82707f4e RBX: 0000000000000000 RCX: ffff88801a738000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90000e36e30 R08: ffffffff82707dd7 R09: 1ffff11003709c16
R10: dffffc0000000000 R11: ffffed1003709c17 R12: 1ffff920001c6db0
R13: ffff88801b84e000 R14: 1ffff920001c6db6 R15: 1ffff11003709c00
FS: 000055555d18e380(0000) GS:ffff888020800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f251d195ed8 CR3: 000000001bee4000 CR4: 0000000000350ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail
2024-08-23 0:00 [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail syzbot
@ 2024-08-26 4:22 ` Vinicius Peixoto
2024-08-26 13:32 ` Theodore Ts'o
0 siblings, 1 reply; 7+ messages in thread
From: Vinicius Peixoto @ 2024-08-26 4:22 UTC (permalink / raw)
To: syzbot+8512f3dbd96253ffbe27
Cc: jack, jlbec, joseph.qi, linux-ext4, linux-kernel, mark,
ocfs2-devel, syzkaller-bugs, tytso, ~lkcamp/discussion
Hi all,
I noticed this report from syzbot when going through the preliminary
tasks for the Linux Kernel Mentorship Program, and thought I'd take a
stab at solving it. I apologize in advance for any mistakes as I'm still
very new to kernel development. Either way, here's my analysis:
From what I can tell by looking at the reproducer from syzbot, it is
trying to mount a file filled with bogus data as an ocfs2 disk, and this
is triggering an assertion in jbd2_cleanup_journal_tail, which in turn
causes a panic.
The problematic call stack goes roughly like this:
mount_bdev
-> ofcs2_mount_volume
-> ofcs2_check_volume
-> ofcs2_journal_load
-> jbd2_journal_load
-> journal_reset (fails)
Since the disk data is bogus, journal_reset fails with -EINVAL ("JBD2:
Journal too short (blocks 2-1024)"); this leaves journal->j_head ==
NULL. However, jbd2_journal_load clears the JBD2_ABORT flag right before
calling journal_reset. This leads to a problem later when
ofcs2_mount_volume tries to flush the journal as part of the cleanup
when aborting the mount operation:
-> ofcs2_mount_volume (error; goto out_system_inodes)
-> ofcs2_journal_shutdown
-> jbd2_journal_flush
-> jbd2_cleanup_journal_tail (J_ASSERT fails)
This failure happens because of the following code:
if (is_journal_aborted(journal))
return -EIO;
if (!jbd2_journal_get_log_tail(journal, &first_tid, &blocknr))
return 1;
J_ASSERT(blocknr != 0);
Since JBD2_ABORT was cleared in jbd2_journal_load earlier, we enter
jbd2_journal_get_log_tail, which will set *blocknr = journal->j_head
(which is NULL) and then trigger the assertion, causing a panic.
I confirmed that setting the JBD2_ABORT flag in journal_reset before
returning -EINVAL fixes the problem:
static int journal_reset(journal_t *journal)
journal_fail_superblock(journal);
+ journal->j_flags |= JBD2_ABORT;
return -EINVAL;
You can find a proper patch file + the syzbot re-test result in [1].
However, I'm not entirely sure whether this is the correct decision, and
I wanted to confirm that this is an appropriate solution before sending
a proper patch to the mailing list.
Thanks in advance,
Vinicius
[1] https://syzkaller.appspot.com/bug?extid=8512f3dbd96253ffbe27
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail
2024-08-26 4:22 ` Vinicius Peixoto
@ 2024-08-26 13:32 ` Theodore Ts'o
2024-08-26 19:34 ` Joel Becker
2024-08-28 10:54 ` Jan Kara
0 siblings, 2 replies; 7+ messages in thread
From: Theodore Ts'o @ 2024-08-26 13:32 UTC (permalink / raw)
To: Vinicius Peixoto
Cc: syzbot+8512f3dbd96253ffbe27, jack, jlbec, joseph.qi, linux-ext4,
linux-kernel, mark, ocfs2-devel, syzkaller-bugs,
~lkcamp/discussion
On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> Since the disk data is bogus, journal_reset fails with -EINVAL ("JBD2:
> Journal too short (blocks 2-1024)"); this leaves journal->j_head == NULL.
> However, jbd2_journal_load clears the JBD2_ABORT flag right before calling
> journal_reset. This leads to a problem later when ofcs2_mount_volume tries
> to flush the journal as part of the cleanup when aborting the mount
> operation:
>
> -> ofcs2_mount_volume (error; goto out_system_inodes)
> -> ofcs2_journal_shutdown
> -> jbd2_journal_flush
> -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> ...
>
> I confirmed that setting the JBD2_ABORT flag in journal_reset before
> returning -EINVAL fixes the problem:
>
> static int journal_reset(journal_t *journal)
> journal_fail_superblock(journal);
> + journal->j_flags |= JBD2_ABORT;
> return -EINVAL;
>
> You can find a proper patch file + the syzbot re-test result in [1].
> However, I'm not entirely sure whether this is the correct decision, and I
> wanted to confirm that this is an appropriate solution before sending a
> proper patch to the mailing list.
The reason why this isn't an issue with ext4 is because the normal
"right" way to do this is if jbd2_journal_load() returns an error, is
to call jbd2_journal_destroy() to release the data structures with the
journal --- and then don't try to use the journal afterwards.
The weird thing is that there are two code paths in ocfs2 that calls
jbd2_journal_load(). One is in ocfs2_replay_journal() which does what
ext4 does. The other is ocfs2_load_journal() which does *not* do
this, and this is the one which you saw in the syzkaller reproducer.
It looks like one codepath is used to replay the ocfs2 for some other
node, and the is to load (and presumably later, replay) the journal
for the mounting node.
There are also some other things which look very confusing, such as the fact that ocfs2_journal_shutdown calls igrab:
/* need to inc inode use count - jbd2_journal_destroy will iput. */
if (!igrab(inode))
BUG();
... which is *weird*. Normaly the journal inode refcount is expected
to be bumped before calling jbd2_journal_init_inode() --- which
normally is done by an iget() function, and is needed to make sure the
journal inode isn't released out from under the jbd2 layer. It looks
like ocfs2 uses the journal inode for other stuff so get the journal
inode without calling something like iget(). Which is OK, I guess,
but it means that there are a whole bunch of places where it has to
call !BUG_ON(igrab(journal->j_inode) before calling
jbd2_journal_destroy(). It would seem like the *right* thing to do is
to bump the refcount in ocfs2_journal_init(), and if for some reason
the igrab fails, it can just return an error early, instead of having
to resort to BUG_ON() later, and if you don't realize that you have to
do this weird igrab() before calling jbd2_journal_destroy(), you'll
end up leaking the journal inode.
Anyway, I think the right thing to do is to restructure how ocfs2
calls the jbd2 journal, but that's going to require a more careful
examination of the code paths. Your patch is a bit of a blunt force
hack that should be harmless, but it's probably not the best way to
fix the problem, although doing it "right" would be more
complicated....
- Ted
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail
2024-08-26 13:32 ` Theodore Ts'o
@ 2024-08-26 19:34 ` Joel Becker
2024-08-28 12:28 ` Julian Sun
2024-08-28 10:54 ` Jan Kara
1 sibling, 1 reply; 7+ messages in thread
From: Joel Becker @ 2024-08-26 19:34 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Vinicius Peixoto, syzbot+8512f3dbd96253ffbe27, jack, joseph.qi,
linux-ext4, linux-kernel, mark, ocfs2-devel, syzkaller-bugs,
~lkcamp/discussion
On Mon, Aug 26, 2024 at 09:32:08AM -0400, Theodore Ts'o wrote:
> On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> > Since the disk data is bogus, journal_reset fails with -EINVAL ("JBD2:
> > Journal too short (blocks 2-1024)"); this leaves journal->j_head == NULL.
> > However, jbd2_journal_load clears the JBD2_ABORT flag right before calling
> > journal_reset. This leads to a problem later when ofcs2_mount_volume tries
> > to flush the journal as part of the cleanup when aborting the mount
> > operation:
> >
> > -> ofcs2_mount_volume (error; goto out_system_inodes)
> > -> ofcs2_journal_shutdown
> > -> jbd2_journal_flush
> > -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> > ...
> The reason why this isn't an issue with ext4 is because the normal
> "right" way to do this is if jbd2_journal_load() returns an error, is
> to call jbd2_journal_destroy() to release the data structures with the
> journal --- and then don't try to use the journal afterwards.
>
> The weird thing is that there are two code paths in ocfs2 that calls
> jbd2_journal_load(). One is in ocfs2_replay_journal() which does what
> ext4 does. The other is ocfs2_load_journal() which does *not* do
> this, and this is the one which you saw in the syzkaller reproducer.
> It looks like one codepath is used to replay the ocfs2 for some other
> node, and the is to load (and presumably later, replay) the journal
> for the mounting node.
You are correct, Ted, that one path is for the local journal and the
other is to recover remote journals for other nodes that may have
crashed.
I think the big ordering issue is that we set
osb->journal->j_state=OCFS2_JOURNAL_LOADED in ocfs2_journal_init(),
before we've attempted any replay. Later in ocfs2_journal_shutdown(),
we check this state and decide to perform cleanup.
Instead, we should not set OCFS2_JOURNAL_LOADED until
ocfs2_journal_load() has called jbd2_journal_load(). Only then do we
actually know we have loaded a valid journal.
Something like:
```
status = jbd2_journal_load(journal->j_journal);
if (status < 0) {
mlog(ML_ERROR, "Failed to load journal!\n");
BUG_ON(!igrab(journal->j_inode));
jbd2_journal_destroy(journal->j_journal);
iput(journal->j_inode)
goto done;
}
journal->j_state = OCFS2_JOURNAL_LOADED;
```
With code like this, when jbd2_journal_load() fails, the future
ocfs2_journal_shutdown() will exit early, because !OCFS2_JOURNAL_LOADED.
I think this is the right spot; a quick audit of the paths (it has been
a while) doesn't find any other outstanding state; the rest of journal
startup, such as the commit thread etc, only happen after this.
> jbd2_journal_destroy(). It would seem like the *right* thing to do is
> to bump the refcount in ocfs2_journal_init(), and if for some reason
> the igrab fails, it can just return an error early, instead of having
> to resort to BUG_ON() later, and if you don't realize that you have to
> do this weird igrab() before calling jbd2_journal_destroy(), you'll
> end up leaking the journal inode.
There are interactions of journal inodes for nodes we don't own, and
also connections to cluster locks for our own journal (don't replay
ourselves while another node has locked it and is recovering us). So we
do have some state to keep track of. But it's been so long that I don't
recall if there was a specific reason we do this late via igrab(), or if
it's just that we should have done as you describe and missed it.
You'll note that I copied the igrab/iput game in my snippet above.
Should someone try to audit the igrab/iput thing later? Yes. But it's
not a necessary part of this fix.
Thanks,
Joel
--
"War doesn't determine who's right; war determines who's left."
http://www.jlbec.org/
jlbec@evilplan.org
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail
2024-08-26 13:32 ` Theodore Ts'o
2024-08-26 19:34 ` Joel Becker
@ 2024-08-28 10:54 ` Jan Kara
2024-08-28 11:55 ` Julian Sun
1 sibling, 1 reply; 7+ messages in thread
From: Jan Kara @ 2024-08-28 10:54 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Vinicius Peixoto, syzbot+8512f3dbd96253ffbe27, jack, jlbec,
joseph.qi, linux-ext4, linux-kernel, mark, ocfs2-devel,
Julian Sun, syzkaller-bugs, ~lkcamp/discussion
On Mon 26-08-24 09:32:08, Theodore Ts'o wrote:
> On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> > Since the disk data is bogus, journal_reset fails with -EINVAL ("JBD2:
> > Journal too short (blocks 2-1024)"); this leaves journal->j_head == NULL.
> > However, jbd2_journal_load clears the JBD2_ABORT flag right before calling
> > journal_reset. This leads to a problem later when ofcs2_mount_volume tries
> > to flush the journal as part of the cleanup when aborting the mount
> > operation:
> >
> > -> ofcs2_mount_volume (error; goto out_system_inodes)
> > -> ofcs2_journal_shutdown
> > -> jbd2_journal_flush
> > -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> > ...
> >
> > I confirmed that setting the JBD2_ABORT flag in journal_reset before
> > returning -EINVAL fixes the problem:
> >
> > static int journal_reset(journal_t *journal)
> > journal_fail_superblock(journal);
> > + journal->j_flags |= JBD2_ABORT;
> > return -EINVAL;
> >
> > You can find a proper patch file + the syzbot re-test result in [1].
> > However, I'm not entirely sure whether this is the correct decision, and I
> > wanted to confirm that this is an appropriate solution before sending a
> > proper patch to the mailing list.
>
> The reason why this isn't an issue with ext4 is because the normal
> "right" way to do this is if jbd2_journal_load() returns an error, is
> to call jbd2_journal_destroy() to release the data structures with the
> journal --- and then don't try to use the journal afterwards.
Yep. OCFS2 guys are actually looking into fixing this in OCFS2
(https://lore.kernel.org/ocfs2-devel/20240819131120.746077-1-sunjunchao2870@gmail.com/)
Not sure what's the status though. Julian, did you send v2 of your fix?
Honza
> The weird thing is that there are two code paths in ocfs2 that calls
> jbd2_journal_load(). One is in ocfs2_replay_journal() which does what
> ext4 does. The other is ocfs2_load_journal() which does *not* do
> this, and this is the one which you saw in the syzkaller reproducer.
> It looks like one codepath is used to replay the ocfs2 for some other
> node, and the is to load (and presumably later, replay) the journal
> for the mounting node.
>
> There are also some other things which look very confusing, such as the
> fact that ocfs2_journal_shutdown calls igrab:
>
> /* need to inc inode use count - jbd2_journal_destroy will iput. */
> if (!igrab(inode))
> BUG();
>
> ... which is *weird*. Normaly the journal inode refcount is expected
> to be bumped before calling jbd2_journal_init_inode() --- which
> normally is done by an iget() function, and is needed to make sure the
> journal inode isn't released out from under the jbd2 layer. It looks
> like ocfs2 uses the journal inode for other stuff so get the journal
> inode without calling something like iget(). Which is OK, I guess,
> but it means that there are a whole bunch of places where it has to
> call !BUG_ON(igrab(journal->j_inode) before calling
> jbd2_journal_destroy(). It would seem like the *right* thing to do is
> to bump the refcount in ocfs2_journal_init(), and if for some reason
> the igrab fails, it can just return an error early, instead of having
> to resort to BUG_ON() later, and if you don't realize that you have to
> do this weird igrab() before calling jbd2_journal_destroy(), you'll
> end up leaking the journal inode.
>
> Anyway, I think the right thing to do is to restructure how ocfs2
> calls the jbd2 journal, but that's going to require a more careful
> examination of the code paths. Your patch is a bit of a blunt force
> hack that should be harmless, but it's probably not the best way to
> fix the problem, although doing it "right" would be more
> complicated....
>
> - Ted
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail
2024-08-28 10:54 ` Jan Kara
@ 2024-08-28 11:55 ` Julian Sun
0 siblings, 0 replies; 7+ messages in thread
From: Julian Sun @ 2024-08-28 11:55 UTC (permalink / raw)
To: Jan Kara, Theodore Ts'o
Cc: Vinicius Peixoto, syzbot+8512f3dbd96253ffbe27, jack, jlbec,
joseph.qi, linux-ext4, linux-kernel, mark, ocfs2-devel,
syzkaller-bugs, ~lkcamp/discussion
On Wed, 2024-08-28 at 12:54 +0200, Jan Kara wrote:
> On Mon 26-08-24 09:32:08, Theodore Ts'o wrote:
> > On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> > > Since the disk data is bogus, journal_reset fails with -EINVAL
> > > ("JBD2:
> > > Journal too short (blocks 2-1024)"); this leaves journal->j_head
> > > == NULL.
> > > However, jbd2_journal_load clears the JBD2_ABORT flag right
> > > before calling
> > > journal_reset. This leads to a problem later when
> > > ofcs2_mount_volume tries
> > > to flush the journal as part of the cleanup when aborting the
> > > mount
> > > operation:
> > >
> > > -> ofcs2_mount_volume (error; goto out_system_inodes)
> > > -> ofcs2_journal_shutdown
> > > -> jbd2_journal_flush
> > > -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> > > ...
> > >
> > > I confirmed that setting the JBD2_ABORT flag in journal_reset
> > > before
> > > returning -EINVAL fixes the problem:
> > >
> > > static int journal_reset(journal_t *journal)
> > > journal_fail_superblock(journal);
> > > + journal->j_flags |= JBD2_ABORT;
> > > return -EINVAL;
> > >
> > > You can find a proper patch file + the syzbot re-test result in
> > > [1].
> > > However, I'm not entirely sure whether this is the correct
> > > decision, and I
> > > wanted to confirm that this is an appropriate solution before
> > > sending a
> > > proper patch to the mailing list.
> >
> > The reason why this isn't an issue with ext4 is because the normal
> > "right" way to do this is if jbd2_journal_load() returns an error,
> > is
> > to call jbd2_journal_destroy() to release the data structures with
> > the
> > journal --- and then don't try to use the journal afterwards.
>
> Yep. OCFS2 guys are actually looking into fixing this in OCFS2
> (
> https://lore.kernel.org/ocfs2-devel/20240819131120.746077-1-sunjunchao2870@gmail.com
> /)
> Not sure what's the status though. Julian, did you send v2 of your
> fix?
Yeah, I have already submitted the v2[1] of the patch and it is
awaiting review.
[1]:
https://lore.kernel.org/ocfs2-devel/20240823083150.17590-1-sunjunchao2870@gmail.com/
>
> Honza
>
> > The weird thing is that there are two code paths in ocfs2 that
> > calls
> > jbd2_journal_load(). One is in ocfs2_replay_journal() which does
> > what
> > ext4 does. The other is ocfs2_load_journal() which does *not* do
> > this, and this is the one which you saw in the syzkaller
> > reproducer.
> > It looks like one codepath is used to replay the ocfs2 for some
> > other
> > node, and the is to load (and presumably later, replay) the journal
> > for the mounting node.
> >
> > There are also some other things which look very confusing, such as
> > the
> > fact that ocfs2_journal_shutdown calls igrab:
> >
> > /* need to inc inode use count - jbd2_journal_destroy will
> > iput. */
> > if (!igrab(inode))
> > BUG();
> >
> > ... which is *weird*. Normaly the journal inode refcount is
> > expected
> > to be bumped before calling jbd2_journal_init_inode() --- which
> > normally is done by an iget() function, and is needed to make sure
> > the
> > journal inode isn't released out from under the jbd2 layer. It
> > looks
> > like ocfs2 uses the journal inode for other stuff so get the
> > journal
> > inode without calling something like iget(). Which is OK, I guess,
> > but it means that there are a whole bunch of places where it has to
> > call !BUG_ON(igrab(journal->j_inode) before calling
> > jbd2_journal_destroy(). It would seem like the *right* thing to do
> > is
> > to bump the refcount in ocfs2_journal_init(), and if for some
> > reason
> > the igrab fails, it can just return an error early, instead of
> > having
> > to resort to BUG_ON() later, and if you don't realize that you have
> > to
> > do this weird igrab() before calling jbd2_journal_destroy(), you'll
> > end up leaking the journal inode.
> >
> > Anyway, I think the right thing to do is to restructure how ocfs2
> > calls the jbd2 journal, but that's going to require a more careful
> > examination of the code paths. Your patch is a bit of a blunt
> > force
> > hack that should be harmless, but it's probably not the best way to
> > fix the problem, although doing it "right" would be more
> > complicated....
> >
> > - Ted
Thanks,
--
Julian Sun <sunjunchao2870@gmail.com>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail
2024-08-26 19:34 ` Joel Becker
@ 2024-08-28 12:28 ` Julian Sun
0 siblings, 0 replies; 7+ messages in thread
From: Julian Sun @ 2024-08-28 12:28 UTC (permalink / raw)
To: Joel Becker, Theodore Ts'o
Cc: Vinicius Peixoto, syzbot+8512f3dbd96253ffbe27, jack, joseph.qi,
linux-ext4, linux-kernel, mark, ocfs2-devel, syzkaller-bugs,
~lkcamp/discussion
On Mon, 2024-08-26 at 12:34 -0700, Joel Becker wrote:
> On Mon, Aug 26, 2024 at 09:32:08AM -0400, Theodore Ts'o wrote:
> > On Mon, Aug 26, 2024 at 01:22:54AM -0300, Vinicius Peixoto wrote:
> > > Since the disk data is bogus, journal_reset fails with -EINVAL
> > > ("JBD2:
> > > Journal too short (blocks 2-1024)"); this leaves journal->j_head
> > > == NULL.
> > > However, jbd2_journal_load clears the JBD2_ABORT flag right
> > > before calling
> > > journal_reset. This leads to a problem later when
> > > ofcs2_mount_volume tries
> > > to flush the journal as part of the cleanup when aborting the
> > > mount
> > > operation:
> > >
> > > -> ofcs2_mount_volume (error; goto out_system_inodes)
> > > -> ofcs2_journal_shutdown
> > > -> jbd2_journal_flush
> > > -> jbd2_cleanup_journal_tail (J_ASSERT fails)
> > > ...
> > The reason why this isn't an issue with ext4 is because the normal
> > "right" way to do this is if jbd2_journal_load() returns an error,
> > is
> > to call jbd2_journal_destroy() to release the data structures with
> > the
> > journal --- and then don't try to use the journal afterwards.
> >
> > The weird thing is that there are two code paths in ocfs2 that
> > calls
> > jbd2_journal_load(). One is in ocfs2_replay_journal() which does
> > what
> > ext4 does. The other is ocfs2_load_journal() which does *not* do
> > this, and this is the one which you saw in the syzkaller
> > reproducer.
> > It looks like one codepath is used to replay the ocfs2 for some
> > other
> > node, and the is to load (and presumably later, replay) the journal
> > for the mounting node.
>
> You are correct, Ted, that one path is for the local journal and the
> other is to recover remote journals for other nodes that may have
> crashed.
>
> I think the big ordering issue is that we set
> osb->journal->j_state=OCFS2_JOURNAL_LOADED in ocfs2_journal_init(),
> before we've attempted any replay. Later in
> ocfs2_journal_shutdown(),
> we check this state and decide to perform cleanup.
>
> Instead, we should not set OCFS2_JOURNAL_LOADED until
> ocfs2_journal_load() has called jbd2_journal_load(). Only then do we
> actually know we have loaded a valid journal.
Yeah, it's right. We should not set OCFS2_JOURNAL_LOADED in
ocfs2_journal_load(). Instead, we should set other flag like
OCFS2_JOURNAL_INIT, to indicate that resources have been allocated.
This way, we can clean them up properly in ocfs2_journal_shutdown(). We
should distinguish between these two states to ensure the correct exit
procedure when an error occurs, just like this patch[1] does.
[1]:
https://lore.kernel.org/ocfs2-devel/20240823083150.17590-1-sunjunchao2870@gmail.com/
>
> Something like:
>
> ```
> status = jbd2_journal_load(journal->j_journal);
> if (status < 0) {
> mlog(ML_ERROR, "Failed to load journal!\n");
> BUG_ON(!igrab(journal->j_inode));
> jbd2_journal_destroy(journal->j_journal);
> iput(journal->j_inode)
> goto done;
> }
> journal->j_state = OCFS2_JOURNAL_LOADED;
> ```
>
> With code like this, when jbd2_journal_load() fails, the future
> ocfs2_journal_shutdown() will exit early, because
> !OCFS2_JOURNAL_LOADED.
>
> I think this is the right spot; a quick audit of the paths (it has
> been
> a while) doesn't find any other outstanding state; the rest of
> journal
> startup, such as the commit thread etc, only happen after this.
>
> > jbd2_journal_destroy(). It would seem like the *right* thing to do
> > is
> > to bump the refcount in ocfs2_journal_init(), and if for some
> > reason
> > the igrab fails, it can just return an error early, instead of
> > having
> > to resort to BUG_ON() later, and if you don't realize that you have
> > to
> > do this weird igrab() before calling jbd2_journal_destroy(), you'll
> > end up leaking the journal inode.
>
> There are interactions of journal inodes for nodes we don't own, and
> also connections to cluster locks for our own journal (don't replay
> ourselves while another node has locked it and is recovering us). So
> we
> do have some state to keep track of. But it's been so long that I
> don't
> recall if there was a specific reason we do this late via igrab(), or
> if
> it's just that we should have done as you describe and missed it.
> You'll note that I copied the igrab/iput game in my snippet above.
>
> Should someone try to audit the igrab/iput thing later? Yes. But
> it's
> not a necessary part of this fix.
>
> Thanks,
> Joel
>
Thanks,
--
Julian Sun <sunjunchao2870@gmail.com>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-08-28 12:28 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-23 0:00 [syzbot] [ext4?] [ocfs2?] kernel BUG in jbd2_cleanup_journal_tail syzbot
2024-08-26 4:22 ` Vinicius Peixoto
2024-08-26 13:32 ` Theodore Ts'o
2024-08-26 19:34 ` Joel Becker
2024-08-28 12:28 ` Julian Sun
2024-08-28 10:54 ` Jan Kara
2024-08-28 11:55 ` Julian Sun
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox