From mboxrd@z Thu Jan  1 00:00:00 1970
From: Lachlan McIlroy <lachlan@sgi.com>
Subject: Re: possible deadlock shown by CONFIG_PROVE_LOCKING
Date: Mon, 04 Feb 2008 16:56:52 +1100
Message-ID: <47A6A924.5010601@sgi.com>
References: <18328.37831.359806.359123@fisica.ufpr.br>	<479942D9.6000302@sgi.com> <18332.63985.756005.37224@fisica.ufpr.br>
Reply-To: lachlan@sgi.com
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: linux-fsdevel@vger.kernel.org
To: Carlos Carvalho <carlos@fisica.ufpr.br>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from relay1.sgi.com ([192.48.171.29]:60718 "EHLO relay.sgi.com"
	rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
	id S1751180AbYBDFwv (ORCPT
	<rfc822;@relay.sgi.com:linux-fsdevel@vger.kernel.org>);
	Mon, 4 Feb 2008 00:52:51 -0500
In-Reply-To: <18332.63985.756005.37224@fisica.ufpr.br>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Carlos Carvalho wrote:
> Lachlan McIlroy (lachlan@sgi.com) wrote on 25 January 2008 13:00:
>  >Carlos Carvalho wrote:
>  >> I compiled the kernel with Ingo's CONFIG_PROVE_LOCKING and got the
>  >> below at boot. Is it a problem?
>  >It was a problem - it has since been fixed in 2.6.23.
>  >Patch is attached in case you're interested.
> 
> [patch removed]
> 
> Thanks. I applied it to 22.16 without rejects but then the machine
> doesn't even mount the partition:
> 
> Filesystem "md0": Disabling barriers, not supported by the underlying device
> XFS mounting filesystem md0
> Unable to handle kernel paging request at ffff8110e325e8d0 RIP: 
The patch shouldn't have caused that.  Did you keep a stack trace from this?

> 
> Then I ran 23.14, no errors. However the machine locks hard a few
> seconds after we start the heavy IO tests...
Not good either - again, any stack traces from the stuck threads?

> 
> So we used .24. Now the machine stood some 16h of heavy IO load. Good!
> However we got the log below after about 15 minutes of running the test.
> It didn't repeat. Is it still a problem?
If it didn't cause a deadlock then it's probably not an issue.  We probably
just need some more lockdep annotations to tell lockdep to ignore this
locking scenario.  I can't see where we acquire nested qh_lock's in
xfs_qm_dqget().  It might actually be caused by acquiring the xfs_qm_freelist
lock in xfs_qm_dqget while holding the hash lock because both locks are
initialised with the same stack and lockdep will think they are locks for
the same object type.

> 
> Jan 26 19:25:43 mumm kernel: Filesystem "md0": Disabling barriers, not supported by the underlying device
> Jan 26 19:25:43 mumm kernel: XFS mounting filesystem md0
> Jan 26 19:25:43 mumm kernel: Ending clean XFS mount for filesystem: md0
> 
> 15min later...
> 
> Jan 26 19:41:40 mumm kernel: 
> Jan 26 19:41:40 mumm kernel: =============================================
> Jan 26 19:41:40 mumm kernel: [ INFO: possible recursive locking detected ]
> Jan 26 19:41:40 mumm kernel: 2.6.24 #3
> Jan 26 19:41:40 mumm kernel: ---------------------------------------------
> Jan 26 19:41:40 mumm kernel: rm/1589 is trying to acquire lock:
> Jan 26 19:41:40 mumm kernel:  (&list->qh_lock){--..}, at: [<ffffffff802c58e2>] xfs_qm_dqget+0x297/0x355
> Jan 26 19:41:40 mumm kernel: 
> Jan 26 19:41:40 mumm kernel: but task is already holding lock:
> Jan 26 19:41:40 mumm kernel:  (&list->qh_lock){--..}, at: [<ffffffff802c5841>] xfs_qm_dqget+0x1f6/0x355
> Jan 26 19:41:40 mumm kernel: 
> Jan 26 19:41:40 mumm kernel: other info that might help us debug this:
> Jan 26 19:41:40 mumm kernel: 4 locks held by rm/1589:
> Jan 26 19:41:40 mumm kernel:  #0:  (&type->i_mutex_dir_key#2/1){--..}, at: [<ffffffff8027c400>] do_unlinkat+0x6b/0x133
> Jan 26 19:41:40 mumm kernel:  #1:  (&sb->s_type->i_mutex_key#2){--..}, at: [<ffffffff8027a59d>] vfs_unlink+0x42/0x8c
> Jan 26 19:41:40 mumm kernel:  #2:  (&(&ip->i_lock)->mr_lock){----}, at: [<ffffffff802f4093>] xfs_ilock+0x63/0x8d
> Jan 26 19:41:40 mumm kernel:  #3:  (&list->qh_lock){--..}, at: [<ffffffff802c5841>] xfs_qm_dqget+0x1f6/0x355
> Jan 26 19:41:40 mumm kernel: 
> Jan 26 19:41:40 mumm kernel: stack backtrace:
> Jan 26 19:41:40 mumm kernel: Pid: 1589, comm: rm Not tainted 2.6.24 #3
> Jan 26 19:41:40 mumm kernel: 
> Jan 26 19:41:40 mumm kernel: Call Trace:
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802446b2>] __lock_acquire+0x1b7/0xcd0
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802c58e2>] xfs_qm_dqget+0x297/0x355
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802455f4>] lock_acquire+0x51/0x6c
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802c58e2>] xfs_qm_dqget+0x297/0x355
> Jan 26 19:41:40 mumm kernel:  [<ffffffff80241ef9>] debug_mutex_lock_common+0x16/0x23
> Jan 26 19:41:40 mumm kernel: other info that might help us debug this:
> Jan 26 19:41:40 mumm kernel: 4 locks held by rm/1589:
> Jan 26 19:41:40 mumm kernel:  #0:  (&type->i_mutex_dir_key#2/1){--..}, at: [<ffffffff8027c400>] do_unlinkat+0x6b/0x133
> Jan 26 19:41:40 mumm kernel:  #1:  (&sb->s_type->i_mutex_key#2){--..}, at: [<ffffffff8027a59d>] vfs_unlink+0x42/0x8c
> Jan 26 19:41:40 mumm kernel:  #2:  (&(&ip->i_lock)->mr_lock){----}, at: [<ffffffff802f4093>] xfs_ilock+0x63/0x8d
> Jan 26 19:41:40 mumm kernel:  #3:  (&list->qh_lock){--..}, at: [<ffffffff802c5841>] xfs_qm_dqget+0x1f6/0x355
> Jan 26 19:41:40 mumm kernel: 
> Jan 26 19:41:40 mumm kernel: stack backtrace:
> Jan 26 19:41:40 mumm kernel: Pid: 1589, comm: rm Not tainted 2.6.24 #3
> Jan 26 19:41:40 mumm kernel: 
> Jan 26 19:41:40 mumm kernel: Call Trace:
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802446b2>] __lock_acquire+0x1b7/0xcd0
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802c58e2>] xfs_qm_dqget+0x297/0x355
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802455f4>] lock_acquire+0x51/0x6c
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802c58e2>] xfs_qm_dqget+0x297/0x355
> Jan 26 19:41:40 mumm kernel:  [<ffffffff80241ef9>] debug_mutex_lock_common+0x16/0x23
> Jan 26 19:41:40 mumm kernel:  [<ffffffff803f8c7f>] mutex_lock_nested+0xec/0x286
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802c58e2>] xfs_qm_dqget+0x297/0x355
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802c8c2c>] xfs_qm_dqattach_one+0xe2/0x150
> Jan 26 19:41:40 mumm kernel:  [<ffffffff802c8d51>] xfs_qm_dqattach+0xb7/0x175
> Jan 26 19:41:40 mumm kernel:  [<ffffffff80311ac5>] xfs_remove+0x8f/0x364
> Jan 26 19:41:40 mumm kernel:  [<ffffffff803f8dff>] mutex_lock_nested+0x26c/0x286
> Jan 26 19:41:40 mumm kernel:  [<ffffffff80244082>] trace_hardirqs_on+0x115/0x139
> Jan 26 19:41:40 mumm kernel:  [<ffffffff80318e6f>] xfs_vn_unlink+0x17/0x36
> Jan 26 19:41:40 mumm kernel:  [<ffffffff8027a5b7>] vfs_unlink+0x5c/0x8c
> Jan 26 19:41:40 mumm kernel:  [<ffffffff8027c449>] do_unlinkat+0xb4/0x133
> Jan 26 19:41:40 mumm kernel:  [<ffffffff803f997a>] trace_hardirqs_on_thunk+0x35/0x3a
> Jan 26 19:41:40 mumm kernel:  [<ffffffff80244082>] trace_hardirqs_on+0x115/0x139
> Jan 26 19:41:40 mumm kernel:  [<ffffffff803f997a>] trace_hardirqs_on_thunk+0x35/0x3a
> Jan 26 19:41:40 mumm kernel:  [<ffffffff8020b66e>] system_call+0x7e/0x83
>