* Ext4 jbd2 state lock race condition
@ 2016-03-25 10:32 Da-Chang Guan
2016-03-26 15:21 ` Theodore Ts'o
0 siblings, 1 reply; 2+ messages in thread
From: Da-Chang Guan @ 2016-03-25 10:32 UTC (permalink / raw)
To: linux-ext4
Hi, all,
We have a 4 core Android device has system hang issue. The stack
trace shows system hang may caused by jbd2 state lock racing.
The stack trace is:
03-24 00:24:00[26516.738548] INFO: rcu_sched self-detected stall on
CPU { 2} (t=380280 jiffies g=631554 c=631553 q=6057)
03-24 00:24:00[26516.748298] Sending NMI to all CPUs:
03-24 00:24:00[26516.753286] NMI backtrace for cpu 0
03-24 00:24:00[26516.756854]
03-24 00:24:00[26516.758380] CPU: 0 PID: 587 Comm: system_server
Tainted: P O 3.10.19-mag2+ #12
03-24 00:24:00[26516.766655] task: deb14c00 ti: debce000 task.ti: debce000
03-24 00:24:00[26516.772178] PC is at _raw_read_lock+0x18/0x30
03-24 00:24:00[26516.776635] LR is at start_this_handle+0xd0/0x570
03-24 00:24:00[26516.781447] pc : [<c0745c94>] lr : [<c02e26fc>]
psr: 800b0013
03-24 00:24:00[26516.787857] sp : debcfc10 ip : debcfc20 fp : debcfc1c
03-24 00:24:00[26516.793201] r10: c0eac5d8 r9 : dfa23400 r8 : debce000
03-24 00:24:00[26516.798545] r7 : 00000002 r6 : dfa23414 r5 :
00000000 r4 : dfa23400
03-24 00:24:00[26516.805221] r3 : 80000000 r2 : c0cba0c0 r1 :
d5675788 r0 : dfa23414
03-24 00:24:00[26516.811897] Flags: Nzcv IRQs on FIQs on Mode
SVC_32 ISA ARM Segment user
03-24 00:24:00[26516.819196] Control: 10c5383d Table: 1ee1c06a DAC: 00000015
03-24 00:24:00[26516.825073] CPU: 0 PID: 587 Comm: system_server
Tainted: P O 3.10.19-mag2+ #12
03-24 00:24:00[26516.833349] [<c011b878>]
(unwind_backtrace+0x0/0x124) from [<c0117688>] (show_stack+0x20/0x24)
03-24 00:24:00[26516.842157] [<c0117688>] (show_stack+0x20/0x24)
from [<c0740840>] (dump_stack+0x20/0x28)
03-24 00:24:00[26516.850432] [<c0740840>] (dump_stack+0x20/0x28)
from [<c0114e80>] (show_regs+0x2c/0x34)
03-24 00:24:00[26516.858619] [<c0114e80>] (show_regs+0x2c/0x34) from
[<c03cf574>] (nmi_cpu_backtrace+0x68/0x9c)
03-24 00:24:00[26516.867428] [<c03cf574>]
(nmi_cpu_backtrace+0x68/0x9c) from [<c01194e0>]
(handle_IPI+0x3a8/0x3ec)
03-24 00:24:00[26516.876503] [<c01194e0>] (handle_IPI+0x3a8/0x3ec)
from [<c010855c>] (gic_handle_irq+0x64/0x6c)
03-24 00:24:00[26516.885312] [<c010855c>] (gic_handle_irq+0x64/0x6c)
from [<c0113340>] (__irq_svc+0x40/0x50)
03-24 00:24:00[26516.893853] Exception stack(0xdebcfbc8 to 0xdebcfc10)
03-24 00:24:00[26516.899021] fbc0: dfa23414
d5675788 c0cba0c0 80000000 dfa23400 00000000
03-24 00:24:00[26516.907385] fbe0: dfa23414 00000002 debce000
dfa23400 c0eac5d8 debcfc1c debcfc20 debcfc10
03-24 00:24:00[26516.915749] fc00: c02e26fc c0745c94 800b0013 ffffffff
03-24 00:24:00[26516.920916] [<c0113340>] (__irq_svc+0x40/0x50) from
[<c0745c94>] (_raw_read_lock+0x18/0x30)
03-24 00:24:00[26516.929459] [<c0745c94>] (_raw_read_lock+0x18/0x30)
from [<c02e26fc>] (start_this_handle+0xd0/0x570)
03-24 00:24:00[26516.938801] [<c02e26fc>]
(start_this_handle+0xd0/0x570) from [<c02e2c44>]
(jbd2__journal_start+0xa8/0x170)
03-24 00:24:00[26516.948675] [<c02e2c44>]
(jbd2__journal_start+0xa8/0x170) from [<c02cbf24>]
(__ext4_journal_start_sb+0x104/0x124)
03-24 00:24:00[26516.959171] [<c02cbf24>]
(__ext4_journal_start_sb+0x104/0x124) from [<c02af284>]
(ext4_dirty_inode+0x2c/0x58)
03-24 00:24:00[26516.969312] [<c02af284>]
(ext4_dirty_inode+0x2c/0x58) from [<c02614e8>]
(__mark_inode_dirty+0x84/0x288)
03-24 00:24:00[26516.978921] [<c02614e8>]
(__mark_inode_dirty+0x84/0x288) from [<c0254e04>]
(update_time+0xac/0xb4)
03-24 00:24:00[26516.988084] [<c0254e04>] (update_time+0xac/0xb4)
from [<c0255054>] (file_update_time+0xd0/0xf4)
03-24 00:24:00[26516.996982] [<c0255054>]
(file_update_time+0xd0/0xf4) from [<c01ff150>]
(__generic_file_aio_write+0x268/0x3dc)
03-24 00:24:00[26517.007212] [<c01ff150>]
(__generic_file_aio_write+0x268/0x3dc) from [<c01ff32c>]
(generic_file_aio_write+0x68/0xc8)
03-24 00:24:00[26517.017975] [<c01ff32c>]
(generic_file_aio_write+0x68/0xc8) from [<c02a4ca0>]
(ext4_file_write+0x1d0/0x468)
03-24 00:24:00[26517.027938] [<c02a4ca0>]
(ext4_file_write+0x1d0/0x468) from [<c023b760>]
(do_sync_write+0x84/0xa8)
03-24 00:24:00[26517.037101] [<c023b760>] (do_sync_write+0x84/0xa8)
from [<c023beb8>] (vfs_write+0xe4/0x184)
03-24 00:24:00[26517.045643] [<c023beb8>] (vfs_write+0xe4/0x184)
from [<c023c4ec>] (SyS_pwrite64+0x70/0x90)
03-24 00:24:00[26517.054096] [<c023c4ec>] (SyS_pwrite64+0x70/0x90)
from [<c0113740>] (ret_fast_syscall+0x0/0x30)
03-24 00:24:00[26517.062992] NMI backtrace for cpu 1
The 4 cores seem stuck on waiting a lock:
03-24 00:24:00[26516.929459] [<c0745c94>] (_raw_read_lock+0x18/0x30)
from [<c02e26fc>] (start_this_handle+0xd0/0x570)
03-24 00:24:00[26516.938801] [<c02e26fc>]
(start_this_handle+0xd0/0x570) from [<c02e2c44>]
(jbd2__journal_start+0xa8/0x170)
03-24 00:24:00[26516.948675] [<c02e2c44>]
(jbd2__journal_start+0xa8/0x170) from [<c02cbf24>]
(__ext4_journal_start_sb+0x104/0x124)
We check the source code and it seems hang here:
static int start_this_handle(journal_t *journal, handle_t *handle,
gfp_t gfp_mask)
...
repeat:
read_lock(&journal->j_state_lock);
Linux kernel version is 3.7.2.
We want to know who acquires the lock at that time so we can fix
it. But we don't even know how to start debug.
Any help would be appreciated.
Regards,
David Guan
^ permalink raw reply [flat|nested] 2+ messages in thread* Re: Ext4 jbd2 state lock race condition 2016-03-25 10:32 Ext4 jbd2 state lock race condition Da-Chang Guan @ 2016-03-26 15:21 ` Theodore Ts'o 0 siblings, 0 replies; 2+ messages in thread From: Theodore Ts'o @ 2016-03-26 15:21 UTC (permalink / raw) To: Da-Chang Guan; +Cc: linux-ext4 On Fri, Mar 25, 2016 at 06:32:47PM +0800, Da-Chang Guan wrote: > Hi, all, > > We have a 4 core Android device has system hang issue. The stack > trace shows system hang may caused by jbd2 state lock racing. So this is an ancient kernel (3.7.2) --- which is extremely old. It's not even a stable kernel, and in fact starting this year I stopped caring about 3.10 kernels since while it was disgraceful we are shipping phones in 2016 using kernels dating from 2013, there are no mobile devices I care about that will be using anything older than 3.18 going forward. So just to set your expectations, as upstream developers we generally only support the latest upstream kernels. Because I've been doing some work to add ext4 encrpytion support into Android, for a while I suffered having to support 3.10 based device kernels. At this point, though, I personally have little or no interest for kernels older than 3.18. In terms of trying to debug this, if you can reproduce the bug, you'll be in much better shape. Also, if you have a serial conosle and CONFIG_MAGIC_SYSRQ is enabled, I'd suggest getting stack traces of all the CPU's so you can see who else might be holding the lock. If you can't reproduce the problem, and you can't get the stack traces for all the CPU's using the magic sysrq, I doubt there's much that can be done to reproduce the problem. May I suggest upgrading to at least 3.18, preferably the latest stable kernel, which as of this writing is 3.18.29? I am running regression tests on 3.18, and making sure that critical bug fixes are getting back ported to 4.4 and 3.18. (With 3.14 and 3.10 happening if I have time and if it's not too difficult, but starting this year, those two kernels are much lower priority for me.) Best regards, > We want to know who acquires the lock at that time so we can fix > it. But we don't even know how to start debug. If you can reproduce the problem, using CONFIG_LOCKDEP will be very helpful. Also perhaps useful would be to build 3.7.2 on x86, and then use xfstests to try flush out bugs. I'm sure you will find them --- when I first started testing a 3.10-based msm kernel, I was able to trivially trigger kernel crashes using kvm-xfstests. I think you'll find it is much easier to find the bugs on x86, and then fix up the kernel so it's not crashing there, and then see if that addresses your problem on arm, because there is a much more powerful testing infrastructure you can use for x86. See: http://thunk.org/gce-xfstests If you can upgrade to a non-antique kernel, though, I think you'll save yourself much more time. It may be that using kvm-xfstests or gce-xfstests to demonstrate how unstable 3.7.2 might be helpful in pursuading your management to let you upgrade to something a bit more recent. Cheers, - Ted ^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2016-03-26 15:21 UTC | newest] Thread overview: 2+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-25 10:32 Ext4 jbd2 state lock race condition Da-Chang Guan 2016-03-26 15:21 ` Theodore Ts'o
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.