net/sunrpc: v4.14-rc4 lockdep warning

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* net/sunrpc: v4.14-rc4 lockdep warning
@ 2017-10-09 18:17 Lorenzo Pieralisi
  2017-10-09 18:32 ` Trond Myklebust
  0 siblings, 1 reply; 7+ messages in thread
From: Lorenzo Pieralisi @ 2017-10-09 18:17 UTC (permalink / raw)
  To: linux-kernel, linux-nfs
  Cc: Trond Myklebust, J. Bruce Fields, Anna Schumaker, Jeff Layton

Hi,

I have run into the lockdep warning below while running v4.14-rc3/rc4
on an ARM64 defconfig Juno dev board - reporting it to check whether
it is a known/genuine issue.

Please let me know if you need further debug data or need some
specific tests.

Thanks,
Lorenzo

[    6.209384] ======================================================
[    6.215569] WARNING: possible circular locking dependency detected
[    6.221755] 4.14.0-rc4 #54 Not tainted
[    6.225503] ------------------------------------------------------
[    6.231689] kworker/4:0H/32 is trying to acquire lock:
[    6.236830]  ((&task->u.tk_work)){+.+.}, at: [<ffff0000080e64cc>] process_one_work+0x1cc/0x3f0
[    6.245472] 
               but task is already holding lock:
[    6.251309]  ("xprtiod"){+.+.}, at: [<ffff0000080e64cc>] process_one_work+0x1cc/0x3f0
[    6.259158] 
               which lock already depends on the new lock.

[    6.267345] 
               the existing dependency chain (in reverse order) is:
[    6.274836] 
               -> #1 ("xprtiod"){+.+.}:
[    6.279903]        lock_acquire+0x6c/0xb8
[    6.283914]        flush_work+0x188/0x270
[    6.287926]        __cancel_work_timer+0x120/0x198
[    6.292720]        cancel_work_sync+0x10/0x18
[    6.297081]        xs_destroy+0x34/0x58
[    6.300917]        xprt_destroy+0x84/0x90
[    6.304927]        xprt_put+0x34/0x40
[    6.308589]        rpc_task_release_client+0x6c/0x80
[    6.313557]        rpc_release_resources_task+0x2c/0x38
[    6.318786]        __rpc_execute+0x9c/0x210
[    6.322971]        rpc_async_schedule+0x10/0x18
[    6.327504]        process_one_work+0x240/0x3f0
[    6.332036]        worker_thread+0x48/0x420
[    6.336222]        kthread+0x12c/0x158
[    6.339972]        ret_from_fork+0x10/0x18
[    6.344068] 
               -> #0 ((&task->u.tk_work)){+.+.}:
[    6.349920]        __lock_acquire+0x12ec/0x14a8
[    6.354451]        lock_acquire+0x6c/0xb8
[    6.358462]        process_one_work+0x22c/0x3f0
[    6.362994]        worker_thread+0x48/0x420
[    6.367180]        kthread+0x12c/0x158
[    6.370929]        ret_from_fork+0x10/0x18
[    6.375025] 
               other info that might help us debug this:

[    6.383038]  Possible unsafe locking scenario:

[    6.388962]        CPU0                    CPU1
[    6.393493]        ----                    ----
[    6.398023]   lock("xprtiod");
[    6.401080]                                lock((&task->u.tk_work));
[    6.407444]                                lock("xprtiod");
[    6.413024]   lock((&task->u.tk_work));
[    6.416863] 
                *** DEADLOCK ***

[    6.422789] 1 lock held by kworker/4:0H/32:
[    6.426972]  #0:  ("xprtiod"){+.+.}, at: [<ffff0000080e64cc>] process_one_work+0x1cc/0x3f0
[    6.435258] 
               stack backtrace:
[    6.439618] CPU: 4 PID: 32 Comm: kworker/4:0H Not tainted 4.14.0-rc4 #54
[    6.446325] Hardware name: ARM Juno development board (r2) (DT)
[    6.452252] Workqueue: xprtiod rpc_async_schedule
[    6.456959] Call trace:
[    6.459406] [<ffff000008089430>] dump_backtrace+0x0/0x3c8
[    6.464810] [<ffff00000808980c>] show_stack+0x14/0x20
[    6.469866] [<ffff000008a01a30>] dump_stack+0xb8/0xf0
[    6.474922] [<ffff0000081194ac>] print_circular_bug+0x224/0x3a0
[    6.480849] [<ffff00000811a304>] check_prev_add+0x304/0x860
[    6.486426] [<ffff00000811c8c4>] __lock_acquire+0x12ec/0x14a8
[    6.492177] [<ffff00000811d144>] lock_acquire+0x6c/0xb8
[    6.497406] [<ffff0000080e652c>] process_one_work+0x22c/0x3f0
[    6.503156] [<ffff0000080e6738>] worker_thread+0x48/0x420
[    6.508560] [<ffff0000080ed5bc>] kthread+0x12c/0x158
[    6.513528] [<ffff000008084d48>] ret_from_fork+0x10/0x18

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net/sunrpc: v4.14-rc4 lockdep warning
  2017-10-09 18:17 net/sunrpc: v4.14-rc4 lockdep warning Lorenzo Pieralisi
@ 2017-10-09 18:32 ` Trond Myklebust
  2017-10-10 14:03   ` tj
  0 siblings, 1 reply; 7+ messages in thread
From: Trond Myklebust @ 2017-10-09 18:32 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, lorenzo.pieralisi@arm.com,
	linux-nfs@vger.kernel.org, jiangshanlai@gmail.com, tj@kernel.org
  Cc: bfields@fieldses.org, anna.schumaker@netapp.com,
	jlayton@poochiereds.net

On Mon, 2017-10-09 at 19:17 +0100, Lorenzo Pieralisi wrote:
> Hi,
> 
> I have run into the lockdep warning below while running v4.14-rc3/rc4
> on an ARM64 defconfig Juno dev board - reporting it to check whether
> it is a known/genuine issue.
> 
> Please let me know if you need further debug data or need some
> specific tests.
> 
> Thanks,
> Lorenzo
> 
> [    6.209384] ======================================================
> [    6.215569] WARNING: possible circular locking dependency detected
> [    6.221755] 4.14.0-rc4 #54 Not tainted
> [    6.225503] ------------------------------------------------------
> [    6.231689] kworker/4:0H/32 is trying to acquire lock:
> [    6.236830]  ((&task->u.tk_work)){+.+.}, at: [<ffff0000080e64cc>]
> process_one_work+0x1cc/0x3f0
> [    6.245472] 
>                but task is already holding lock:
> [    6.251309]  ("xprtiod"){+.+.}, at: [<ffff0000080e64cc>]
> process_one_work+0x1cc/0x3f0
> [    6.259158] 
>                which lock already depends on the new lock.
> 
> [    6.267345] 
>                the existing dependency chain (in reverse order) is:
> [    6.274836] 
>                -> #1 ("xprtiod"){+.+.}:
> [    6.279903]        lock_acquire+0x6c/0xb8
> [    6.283914]        flush_work+0x188/0x270
> [    6.287926]        __cancel_work_timer+0x120/0x198
> [    6.292720]        cancel_work_sync+0x10/0x18
> [    6.297081]        xs_destroy+0x34/0x58
> [    6.300917]        xprt_destroy+0x84/0x90
> [    6.304927]        xprt_put+0x34/0x40
> [    6.308589]        rpc_task_release_client+0x6c/0x80
> [    6.313557]        rpc_release_resources_task+0x2c/0x38
> [    6.318786]        __rpc_execute+0x9c/0x210
> [    6.322971]        rpc_async_schedule+0x10/0x18
> [    6.327504]        process_one_work+0x240/0x3f0
> [    6.332036]        worker_thread+0x48/0x420
> [    6.336222]        kthread+0x12c/0x158
> [    6.339972]        ret_from_fork+0x10/0x18
> [    6.344068] 
>                -> #0 ((&task->u.tk_work)){+.+.}:
> [    6.349920]        __lock_acquire+0x12ec/0x14a8
> [    6.354451]        lock_acquire+0x6c/0xb8
> [    6.358462]        process_one_work+0x22c/0x3f0
> [    6.362994]        worker_thread+0x48/0x420
> [    6.367180]        kthread+0x12c/0x158
> [    6.370929]        ret_from_fork+0x10/0x18
> [    6.375025] 
>                other info that might help us debug this:
> 
> [    6.383038]  Possible unsafe locking scenario:
> 
> [    6.388962]        CPU0                    CPU1
> [    6.393493]        ----                    ----
> [    6.398023]   lock("xprtiod");
> [    6.401080]                                lock((&task-
> >u.tk_work));
> [    6.407444]                                lock("xprtiod");
> [    6.413024]   lock((&task->u.tk_work));
> [    6.416863] 
>                 *** DEADLOCK ***
> 
> [    6.422789] 1 lock held by kworker/4:0H/32:
> [    6.426972]  #0:  ("xprtiod"){+.+.}, at: [<ffff0000080e64cc>]
> process_one_work+0x1cc/0x3f0
> [    6.435258] 
>                stack backtrace:
> [    6.439618] CPU: 4 PID: 32 Comm: kworker/4:0H Not tainted 4.14.0-
> rc4 #54
> [    6.446325] Hardware name: ARM Juno development board (r2) (DT)
> [    6.452252] Workqueue: xprtiod rpc_async_schedule
> [    6.456959] Call trace:
> [    6.459406] [<ffff000008089430>] dump_backtrace+0x0/0x3c8
> [    6.464810] [<ffff00000808980c>] show_stack+0x14/0x20
> [    6.469866] [<ffff000008a01a30>] dump_stack+0xb8/0xf0
> [    6.474922] [<ffff0000081194ac>] print_circular_bug+0x224/0x3a0
> [    6.480849] [<ffff00000811a304>] check_prev_add+0x304/0x860
> [    6.486426] [<ffff00000811c8c4>] __lock_acquire+0x12ec/0x14a8
> [    6.492177] [<ffff00000811d144>] lock_acquire+0x6c/0xb8
> [    6.497406] [<ffff0000080e652c>] process_one_work+0x22c/0x3f0
> [    6.503156] [<ffff0000080e6738>] worker_thread+0x48/0x420
> [    6.508560] [<ffff0000080ed5bc>] kthread+0x12c/0x158
> [    6.513528] [<ffff000008084d48>] ret_from_fork+0x10/0x18
> 

Adding Tejun and Lai, since this looks like a workqueue locking issue.

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net/sunrpc: v4.14-rc4 lockdep warning
  2017-10-09 18:32 ` Trond Myklebust
@ 2017-10-10 14:03   ` tj
  2017-10-10 16:48     ` Trond Myklebust
  0 siblings, 1 reply; 7+ messages in thread
From: tj @ 2017-10-10 14:03 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-kernel@vger.kernel.org, lorenzo.pieralisi@arm.com,
	linux-nfs@vger.kernel.org, jiangshanlai@gmail.com,
	bfields@fieldses.org, anna.schumaker@netapp.com,
	jlayton@poochiereds.net

Hello, Trond.

On Mon, Oct 09, 2017 at 06:32:13PM +0000, Trond Myklebust wrote:
> On Mon, 2017-10-09 at 19:17 +0100, Lorenzo Pieralisi wrote:
> > I have run into the lockdep warning below while running v4.14-rc3/rc4
> > on an ARM64 defconfig Juno dev board - reporting it to check whether
> > it is a known/genuine issue.
> > 
> > Please let me know if you need further debug data or need some
> > specific tests.
> > 
> > [    6.209384] ======================================================
> > [    6.215569] WARNING: possible circular locking dependency detected
> > [    6.221755] 4.14.0-rc4 #54 Not tainted
> > [    6.225503] ------------------------------------------------------
> > [    6.231689] kworker/4:0H/32 is trying to acquire lock:
> > [    6.236830]  ((&task->u.tk_work)){+.+.}, at: [<ffff0000080e64cc>]
> > process_one_work+0x1cc/0x3f0
> > [    6.245472] 
> >                but task is already holding lock:
> > [    6.251309]  ("xprtiod"){+.+.}, at: [<ffff0000080e64cc>]
> > process_one_work+0x1cc/0x3f0
> > [    6.259158] 
> >                which lock already depends on the new lock.
> > 
> > [    6.267345] 
> >                the existing dependency chain (in reverse order) is:
..
> Adding Tejun and Lai, since this looks like a workqueue locking issue.

It looks a bit cryptic but it's warning against the following case.

1. Memory pressure is high and rescuer kicks in for the xprtiod
   workqueue.  There are no other kworkers serving the workqueue.

2. The rescuer runs the xptr_destroy path and ends up calling
   cancel_work_sync() on a work item which is queued on xprtiod.

3. The work item is pending on the same workqueue and assuming that
   memory pressure doesn't let off (let's say reclaim is trying to
   kick off nfs pages), the only way it can get executed is by the
   rescuer which is waiting for the work item - an A-B-A deadlock.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net/sunrpc: v4.14-rc4 lockdep warning
  2017-10-10 14:03   ` tj
@ 2017-10-10 16:48     ` Trond Myklebust
  2017-10-10 17:19       ` tj
  0 siblings, 1 reply; 7+ messages in thread
From: Trond Myklebust @ 2017-10-10 16:48 UTC (permalink / raw)
  To: tj@kernel.org
  Cc: bfields@fieldses.org, linux-kernel@vger.kernel.org,
	lorenzo.pieralisi@arm.com, jlayton@poochiereds.net,
	linux-nfs@vger.kernel.org, jiangshanlai@gmail.com,
	anna.schumaker@netapp.com

On Tue, 2017-10-10 at 07:03 -0700, tj@kernel.org wrote:
> Hello, Trond.
> 
> On Mon, Oct 09, 2017 at 06:32:13PM +0000, Trond Myklebust wrote:
> > On Mon, 2017-10-09 at 19:17 +0100, Lorenzo Pieralisi wrote:
> > > I have run into the lockdep warning below while running v4.14-
> > > rc3/rc4
> > > on an ARM64 defconfig Juno dev board - reporting it to check
> > > whether
> > > it is a known/genuine issue.
> > > 
> > > Please let me know if you need further debug data or need some
> > > specific tests.
> > > 
> > > [    6.209384]
> > > ======================================================
> > > [    6.215569] WARNING: possible circular locking dependency
> > > detected
> > > [    6.221755] 4.14.0-rc4 #54 Not tainted
> > > [    6.225503] --------------------------------------------------
> > > ----
> > > [    6.231689] kworker/4:0H/32 is trying to acquire lock:
> > > [    6.236830]  ((&task->u.tk_work)){+.+.}, at:
> > > [<ffff0000080e64cc>]
> > > process_one_work+0x1cc/0x3f0
> > > [    6.245472] 
> > >                but task is already holding lock:
> > > [    6.251309]  ("xprtiod"){+.+.}, at: [<ffff0000080e64cc>]
> > > process_one_work+0x1cc/0x3f0
> > > [    6.259158] 
> > >                which lock already depends on the new lock.
> > > 
> > > [    6.267345] 
> > >                the existing dependency chain (in reverse order)
> > > is:
> 
> ..
> > Adding Tejun and Lai, since this looks like a workqueue locking
> > issue.
> 
> It looks a bit cryptic but it's warning against the following case.
> 
> 1. Memory pressure is high and rescuer kicks in for the xprtiod
>    workqueue.  There are no other kworkers serving the workqueue.
> 
> 2. The rescuer runs the xptr_destroy path and ends up calling
>    cancel_work_sync() on a work item which is queued on xprtiod.
> 
> 3. The work item is pending on the same workqueue and assuming that
>    memory pressure doesn't let off (let's say reclaim is trying to
>    kick off nfs pages), the only way it can get executed is by the
>    rescuer which is waiting for the work item - an A-B-A deadlock.
> 

Hi Tejun,

Thanks for the explanation. What I'm not really understanding here
though, is how the work item could be queued at all. We have a
wait_on_bit_lock() in xprt_destroy() that should mean the xprt-
>task_cleanup work item has completed running, and that it cannot be
requeued.

Is there a possibility that the flush_queue() might be triggered
despite the work item not being queued?

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net/sunrpc: v4.14-rc4 lockdep warning
  2017-10-10 16:48     ` Trond Myklebust
@ 2017-10-10 17:19       ` tj
  2017-10-11 17:49         ` Trond Myklebust
  0 siblings, 1 reply; 7+ messages in thread
From: tj @ 2017-10-10 17:19 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: bfields@fieldses.org, linux-kernel@vger.kernel.org,
	lorenzo.pieralisi@arm.com, jlayton@poochiereds.net,
	linux-nfs@vger.kernel.org, jiangshanlai@gmail.com,
	anna.schumaker@netapp.com

Hello,

On Tue, Oct 10, 2017 at 04:48:57PM +0000, Trond Myklebust wrote:
> Thanks for the explanation. What I'm not really understanding here
> though, is how the work item could be queued at all. We have a
> wait_on_bit_lock() in xprt_destroy() that should mean the xprt-
> >task_cleanup work item has completed running, and that it cannot be
> requeued.
> 
> Is there a possibility that the flush_queue() might be triggered
> despite the work item not being queued?

Yeah, for sure.  The lockdep annotations don't distinguish those
cases and assume the worst case.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net/sunrpc: v4.14-rc4 lockdep warning
  2017-10-10 17:19       ` tj
@ 2017-10-11 17:49         ` Trond Myklebust
  2017-10-16 13:34           ` Jan Glauber
  0 siblings, 1 reply; 7+ messages in thread
From: Trond Myklebust @ 2017-10-11 17:49 UTC (permalink / raw)
  To: tj@kernel.org
  Cc: bfields@fieldses.org, linux-kernel@vger.kernel.org,
	lorenzo.pieralisi@arm.com, jlayton@poochiereds.net,
	linux-nfs@vger.kernel.org, jiangshanlai@gmail.com,
	anna.schumaker@netapp.com

On Tue, 2017-10-10 at 10:19 -0700, tj@kernel.org wrote:
> Hello,
> 
> On Tue, Oct 10, 2017 at 04:48:57PM +0000, Trond Myklebust wrote:
> > Thanks for the explanation. What I'm not really understanding here
> > though, is how the work item could be queued at all. We have a
> > wait_on_bit_lock() in xprt_destroy() that should mean the xprt-
> > > task_cleanup work item has completed running, and that it cannot
> > > be
> > 
> > requeued.
> > 
> > Is there a possibility that the flush_queue() might be triggered
> > despite the work item not being queued?
> 
> Yeah, for sure.  The lockdep annotations don't distinguish those
> cases and assume the worst case.
> 

OK. Let's just remove that call to cancel_work_sync() then. As I said,
it should be redundant due to the wait_on_bit_lock().

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net/sunrpc: v4.14-rc4 lockdep warning
  2017-10-11 17:49         ` Trond Myklebust
@ 2017-10-16 13:34           ` Jan Glauber
  0 siblings, 0 replies; 7+ messages in thread
From: Jan Glauber @ 2017-10-16 13:34 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: tj@kernel.org, bfields@fieldses.org, linux-kernel@vger.kernel.org,
	lorenzo.pieralisi@arm.com, jlayton@poochiereds.net,
	linux-nfs@vger.kernel.org, jiangshanlai@gmail.com,
	anna.schumaker@netapp.com

Hi Trond,

is there a patch available for this issue? I'm running into with
4.14-rc5 on my ARM64 board.

thanks, Jan

2017-10-11 19:49 GMT+02:00 Trond Myklebust <trondmy@primarydata.com>:
> On Tue, 2017-10-10 at 10:19 -0700, tj@kernel.org wrote:
>> Hello,
>>
>> On Tue, Oct 10, 2017 at 04:48:57PM +0000, Trond Myklebust wrote:
>> > Thanks for the explanation. What I'm not really understanding here
>> > though, is how the work item could be queued at all. We have a
>> > wait_on_bit_lock() in xprt_destroy() that should mean the xprt-
>> > > task_cleanup work item has completed running, and that it cannot
>> > > be
>> >
>> > requeued.
>> >
>> > Is there a possibility that the flush_queue() might be triggered
>> > despite the work item not being queued?
>>
>> Yeah, for sure.  The lockdep annotations don't distinguish those
>> cases and assume the worst case.
>>
>
> OK. Let's just remove that call to cancel_work_sync() then. As I said,
> it should be redundant due to the wait_on_bit_lock().
>
> --
> Trond Myklebust
> Linux NFS client maintainer, PrimaryData
> trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-10-16 13:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-09 18:17 net/sunrpc: v4.14-rc4 lockdep warning Lorenzo Pieralisi
2017-10-09 18:32 ` Trond Myklebust
2017-10-10 14:03   ` tj
2017-10-10 16:48     ` Trond Myklebust
2017-10-10 17:19       ` tj
2017-10-11 17:49         ` Trond Myklebust
2017-10-16 13:34           ` Jan Glauber

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox