All of lore.kernel.org
 help / color / mirror / Atom feed
* Ext4 jbd2 state lock race condition
@ 2016-03-25 10:32 Da-Chang Guan
  2016-03-26 15:21 ` Theodore Ts'o
  0 siblings, 1 reply; 2+ messages in thread
From: Da-Chang Guan @ 2016-03-25 10:32 UTC (permalink / raw)
  To: linux-ext4

Hi, all,

  We have a 4 core Android device has system hang issue. The stack
trace shows system hang may caused by jbd2 state lock racing.

  The stack trace is:

  03-24 00:24:00[26516.738548] INFO: rcu_sched self-detected stall on
CPU { 2}  (t=380280 jiffies g=631554 c=631553 q=6057)
  03-24 00:24:00[26516.748298] Sending NMI to all CPUs:
  03-24 00:24:00[26516.753286] NMI backtrace for cpu 0
  03-24 00:24:00[26516.756854]
  03-24 00:24:00[26516.758380] CPU: 0 PID: 587 Comm: system_server
Tainted: P           O 3.10.19-mag2+ #12
  03-24 00:24:00[26516.766655] task: deb14c00 ti: debce000 task.ti: debce000
  03-24 00:24:00[26516.772178] PC is at _raw_read_lock+0x18/0x30
  03-24 00:24:00[26516.776635] LR is at start_this_handle+0xd0/0x570
  03-24 00:24:00[26516.781447] pc : [<c0745c94>]    lr : [<c02e26fc>]
  psr: 800b0013
  03-24 00:24:00[26516.787857] sp : debcfc10  ip : debcfc20  fp : debcfc1c
  03-24 00:24:00[26516.793201] r10: c0eac5d8  r9 : dfa23400  r8 : debce000
  03-24 00:24:00[26516.798545] r7 : 00000002  r6 : dfa23414  r5 :
00000000  r4 : dfa23400
  03-24 00:24:00[26516.805221] r3 : 80000000  r2 : c0cba0c0  r1 :
d5675788  r0 : dfa23414
  03-24 00:24:00[26516.811897] Flags: Nzcv  IRQs on  FIQs on  Mode
SVC_32  ISA ARM  Segment user
  03-24 00:24:00[26516.819196] Control: 10c5383d  Table: 1ee1c06a  DAC: 00000015
  03-24 00:24:00[26516.825073] CPU: 0 PID: 587 Comm: system_server
Tainted: P           O 3.10.19-mag2+ #12
  03-24 00:24:00[26516.833349] [<c011b878>]
(unwind_backtrace+0x0/0x124) from [<c0117688>] (show_stack+0x20/0x24)
  03-24 00:24:00[26516.842157] [<c0117688>] (show_stack+0x20/0x24)
from [<c0740840>] (dump_stack+0x20/0x28)
  03-24 00:24:00[26516.850432] [<c0740840>] (dump_stack+0x20/0x28)
from [<c0114e80>] (show_regs+0x2c/0x34)
  03-24 00:24:00[26516.858619] [<c0114e80>] (show_regs+0x2c/0x34) from
[<c03cf574>] (nmi_cpu_backtrace+0x68/0x9c)
  03-24 00:24:00[26516.867428] [<c03cf574>]
(nmi_cpu_backtrace+0x68/0x9c) from [<c01194e0>]
(handle_IPI+0x3a8/0x3ec)
  03-24 00:24:00[26516.876503] [<c01194e0>] (handle_IPI+0x3a8/0x3ec)
from [<c010855c>] (gic_handle_irq+0x64/0x6c)
  03-24 00:24:00[26516.885312] [<c010855c>] (gic_handle_irq+0x64/0x6c)
from [<c0113340>] (__irq_svc+0x40/0x50)
  03-24 00:24:00[26516.893853] Exception stack(0xdebcfbc8 to 0xdebcfc10)
  03-24 00:24:00[26516.899021] fbc0:                   dfa23414
d5675788 c0cba0c0 80000000 dfa23400 00000000
  03-24 00:24:00[26516.907385] fbe0: dfa23414 00000002 debce000
dfa23400 c0eac5d8 debcfc1c debcfc20 debcfc10
  03-24 00:24:00[26516.915749] fc00: c02e26fc c0745c94 800b0013 ffffffff
  03-24 00:24:00[26516.920916] [<c0113340>] (__irq_svc+0x40/0x50) from
[<c0745c94>] (_raw_read_lock+0x18/0x30)
  03-24 00:24:00[26516.929459] [<c0745c94>] (_raw_read_lock+0x18/0x30)
from [<c02e26fc>] (start_this_handle+0xd0/0x570)
  03-24 00:24:00[26516.938801] [<c02e26fc>]
(start_this_handle+0xd0/0x570) from [<c02e2c44>]
(jbd2__journal_start+0xa8/0x170)
  03-24 00:24:00[26516.948675] [<c02e2c44>]
(jbd2__journal_start+0xa8/0x170) from [<c02cbf24>]
(__ext4_journal_start_sb+0x104/0x124)
  03-24 00:24:00[26516.959171] [<c02cbf24>]
(__ext4_journal_start_sb+0x104/0x124) from [<c02af284>]
(ext4_dirty_inode+0x2c/0x58)
  03-24 00:24:00[26516.969312] [<c02af284>]
(ext4_dirty_inode+0x2c/0x58) from [<c02614e8>]
(__mark_inode_dirty+0x84/0x288)
  03-24 00:24:00[26516.978921] [<c02614e8>]
(__mark_inode_dirty+0x84/0x288) from [<c0254e04>]
(update_time+0xac/0xb4)
  03-24 00:24:00[26516.988084] [<c0254e04>] (update_time+0xac/0xb4)
from [<c0255054>] (file_update_time+0xd0/0xf4)
  03-24 00:24:00[26516.996982] [<c0255054>]
(file_update_time+0xd0/0xf4) from [<c01ff150>]
(__generic_file_aio_write+0x268/0x3dc)
  03-24 00:24:00[26517.007212] [<c01ff150>]
(__generic_file_aio_write+0x268/0x3dc) from [<c01ff32c>]
(generic_file_aio_write+0x68/0xc8)
  03-24 00:24:00[26517.017975] [<c01ff32c>]
(generic_file_aio_write+0x68/0xc8) from [<c02a4ca0>]
(ext4_file_write+0x1d0/0x468)
  03-24 00:24:00[26517.027938] [<c02a4ca0>]
(ext4_file_write+0x1d0/0x468) from [<c023b760>]
(do_sync_write+0x84/0xa8)
  03-24 00:24:00[26517.037101] [<c023b760>] (do_sync_write+0x84/0xa8)
from [<c023beb8>] (vfs_write+0xe4/0x184)
  03-24 00:24:00[26517.045643] [<c023beb8>] (vfs_write+0xe4/0x184)
from [<c023c4ec>] (SyS_pwrite64+0x70/0x90)
  03-24 00:24:00[26517.054096] [<c023c4ec>] (SyS_pwrite64+0x70/0x90)
from [<c0113740>] (ret_fast_syscall+0x0/0x30)
  03-24 00:24:00[26517.062992] NMI backtrace for cpu 1

  The 4 cores seem stuck on waiting a lock:

  03-24 00:24:00[26516.929459] [<c0745c94>] (_raw_read_lock+0x18/0x30)
from [<c02e26fc>] (start_this_handle+0xd0/0x570)
  03-24 00:24:00[26516.938801] [<c02e26fc>]
(start_this_handle+0xd0/0x570) from [<c02e2c44>]
(jbd2__journal_start+0xa8/0x170)
  03-24 00:24:00[26516.948675] [<c02e2c44>]
(jbd2__journal_start+0xa8/0x170) from [<c02cbf24>]
(__ext4_journal_start_sb+0x104/0x124)

   We check the source code and it seems hang here:

   static int start_this_handle(journal_t *journal, handle_t *handle,
gfp_t gfp_mask)

    ...
   repeat:
         read_lock(&journal->j_state_lock);

   Linux kernel version is 3.7.2.

   We want to know who acquires the lock at that time so we can fix
it.  But we don't even know how to start debug.
   Any help would be appreciated.

Regards,
David Guan

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Ext4 jbd2 state lock race condition
  2016-03-25 10:32 Ext4 jbd2 state lock race condition Da-Chang Guan
@ 2016-03-26 15:21 ` Theodore Ts'o
  0 siblings, 0 replies; 2+ messages in thread
From: Theodore Ts'o @ 2016-03-26 15:21 UTC (permalink / raw)
  To: Da-Chang Guan; +Cc: linux-ext4

On Fri, Mar 25, 2016 at 06:32:47PM +0800, Da-Chang Guan wrote:
> Hi, all,
> 
>   We have a 4 core Android device has system hang issue. The stack
> trace shows system hang may caused by jbd2 state lock racing.

So this is an ancient kernel (3.7.2) --- which is extremely old.  It's
not even a stable kernel, and in fact starting this year I stopped
caring about 3.10 kernels since while it was disgraceful we are
shipping phones in 2016 using kernels dating from 2013, there are no
mobile devices I care about that will be using anything older than
3.18 going forward.

So just to set your expectations, as upstream developers we generally
only support the latest upstream kernels.  Because I've been doing
some work to add ext4 encrpytion support into Android, for a while I
suffered having to support 3.10 based device kernels.  At this point,
though, I personally have little or no interest for kernels older than
3.18.

In terms of trying to debug this, if you can reproduce the bug, you'll
be in much better shape.  Also, if you have a serial conosle and
CONFIG_MAGIC_SYSRQ is enabled, I'd suggest getting stack traces of all
the CPU's so you can see who else might be holding the lock.  If you
can't reproduce the problem, and you can't get the stack traces for
all the CPU's using the magic sysrq, I doubt there's much that can be
done to reproduce the problem.

May I suggest upgrading to at least 3.18, preferably the latest stable
kernel, which as of this writing is 3.18.29?  I am running regression
tests on 3.18, and making sure that critical bug fixes are getting
back ported to 4.4 and 3.18.  (With 3.14 and 3.10 happening if I have
time and if it's not too difficult, but starting this year, those two
kernels are much lower priority for me.)

Best regards,

>    We want to know who acquires the lock at that time so we can fix
> it.  But we don't even know how to start debug.

If you can reproduce the problem, using CONFIG_LOCKDEP will be very
helpful.  Also perhaps useful would be to build 3.7.2 on x86, and then
use xfstests to try flush out bugs.  I'm sure you will find them ---
when I first started testing a 3.10-based msm kernel, I was able to
trivially trigger kernel crashes using kvm-xfstests.  I think you'll
find it is much easier to find the bugs on x86, and then fix up the
kernel so it's not crashing there, and then see if that addresses your
problem on arm, because there is a much more powerful testing
infrastructure you can use for x86.  See:

	       http://thunk.org/gce-xfstests

If you can upgrade to a non-antique kernel, though, I think you'll
save yourself much more time.  It may be that using kvm-xfstests or
gce-xfstests to demonstrate how unstable 3.7.2 might be helpful in
pursuading your management to let you upgrade to something a bit more
recent.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-03-26 15:21 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-25 10:32 Ext4 jbd2 state lock race condition Da-Chang Guan
2016-03-26 15:21 ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.