Re: nfsd oops on Linus' current tree.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "J. Bruce Fields" <bfields@fieldses.org>
To: "Adamson, Dros" <Weston.Adamson@netapp.com>
Cc: "Myklebust, Trond" <Trond.Myklebust@netapp.com>,
	Dave Jones <davej@redhat.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	Tejun Heo <tj@kernel.org>
Subject: Re: nfsd oops on Linus' current tree.
Date: Thu, 3 Jan 2013 15:11:20 -0500	[thread overview]
Message-ID: <20130103201120.GA2096@fieldses.org> (raw)
In-Reply-To: <0EC8763B847DB24D9ADF5EBD9CD7B4191259E4A2@SACEXCMBX02-PRD.hq.netapp.com>

On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
> Hey, sorry for the late response, I've been on vacation.
> 
> On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <bfields@fieldses.org>
> wrote:
> 
> > On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
> >> Please reread what I said. There was no obvious circular
> >> dependency, because nfsiod and rpciod are separate workqueues, both
> >> created with WQ_MEM_RECLAIM.
> > 
> > Oh, sorry, I read "rpciod" as "nfsiod"!
> > 
> >> Dros' experience shows, however that a call to rpc_shutdown_client
> >> in an nfsiod work item will deadlock with rpciod if the RPC task's
> >> work item has been assigned to the same CPU as the one running the
> >> rpc_shutdown_client work item.
> >> 
> >> I can't tell right now if that is intentional (in which case the
> >> WARN_ON in the rpc code is correct), or if it is a bug in the
> >> workqueue code. For now, we're assuming the former.
> > 
> > Well, Documentation/workqueue.txt says:
> > 
> > 	"Each wq with WQ_MEM_RECLAIM set has an execution context
> > 	reserved for it.  If there is dependency among multiple work
> > 	items used during memory reclaim, they should be queued to
> > 	separate wq each with WQ_MEM_RECLAIM."
> 
> The deadlock we were seeing was:
> 
> - task A gets queued on rpciod workqueue and assigned kworker-0:0 -
> task B gets queued on rpciod workqueue and assigned the same kworker
> (kworker-0:0) - task A gets run, calls rpc_shutdown_client(), which
> will loop forever waiting for task B to run rpc_async_release() - task
> B will never run rpc_async_release() - it can't run until kworker-0:0
> is free, which won't happen until task A (rpc_shutdown_client) is done
> 
> The same deadlock happened when we tried queuing the tasks on a
> different workqueues -- queue_work() assigns the task to a kworker
> thread and it's luck of the draw if it's the same kworker as task A.
> We tried the different workqueue options, but nothing changed this
> behavior.
> 
> Once a work struct is queued, there is no way to back out of the
> deadlock.  From kernel/workqueue.c:insert_wq_barrier comment:
> 
>  * Currently, a queued barrier can't be canceled.  This is because *
>  try_to_grab_pending() can't determine whether the work to be *
>  grabbed is at the head of the queue and thus can't clear LINKED *
>  flag of the previous work while there must be a valid next work *
>  after a work with LINKED flag set.
> 
> So once a work struct is queued and there is an ordering dependency
> (i.e. task A is before task B), there is no way to back task B out -
> so we can't just call cancel_work() or something on task B in
> rpc_shutdown_client.
> 
> The root of our issue is that rpc_shutdown_client is never safe to
> call from a workqueue context - it loops until there are no more
> tasks, marking tasks as killed and waiting for them to be cleaned up
> in each task's own workqueue context. Any tasks that have already been
> assigned to the same kworker thread will never have a chance to run
> this cleanup stage.
> 
> When fixing this deadlock, Trond and I discussed changing how
> rpc_shutdown_client works (making it workqueue safe), but Trond felt
> that it'd be better to just not call it from a workqueue context and
> print a warning if it is.
> 
> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no
> success), but I'd argue that even if that did work it would still be
> very easy to call rpc_shutdown_client from the wrong context and MUCH
> harder to detect it.  It's also unclear to me if setting rpciod
> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...

Both rpciod and nfsiod already set WQ_MEM_RECLAIM.

But, right, looking at kernel/workqueue.c, it seems that the dedicated
"rescuer" threads are invoked only in the case when work is stalled
because a new worker thread isn't allocated quickly enough.

So, what to do that's simplest enough that it would work for
post-rc2/stable?  I was happy having just a simple dedicated
thread--these are only started when nfsd is, so there's no real thread
proliferation problem.

I'll go quietly weep for a little while and then think about it some
more....

--b.

next prev parent reply	other threads:[~2013-01-03 20:11 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-21 15:33 nfsd oops on Linus' current tree Dave Jones
2012-12-21 18:08 ` J. Bruce Fields
2012-12-21 18:40   ` Myklebust, Trond
2012-12-21 23:08     ` J. Bruce Fields
2012-12-21 23:15       ` Myklebust, Trond
2012-12-21 23:26         ` J. Bruce Fields
2012-12-21 23:36           ` Myklebust, Trond
2012-12-21 23:45             ` J. Bruce Fields
2013-01-03 16:28               ` Adamson, Dros
2013-01-03 20:11                 ` J. Bruce Fields [this message]
2013-01-03 20:27                   ` Adamson, Dros
2013-01-03 20:52                     ` J. Bruce Fields
2013-01-03 22:15                       ` Tejun Heo
2013-01-03 22:08                   ` Tejun Heo
2013-01-03 22:12                     ` Myklebust, Trond
2013-01-03 22:26                       ` Tejun Heo
2013-01-03 22:34                         ` Tejun Heo
2013-01-03 23:11                         ` Myklebust, Trond
     [not found]                         ` <1357254692.55285.33.camel@lade.trondhjem.org>
2013-01-03 23:26                           ` Myklebust, Trond
2013-01-04 17:11                             ` Adamson, Dros
2013-01-03 22:03                 ` Tejun Heo
2013-01-03 23:08                   ` J. Bruce Fields
2012-12-22  0:48   ` [PATCH] Revert "nfsd: warn on odd reply state in nfsd_vfs_read" J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130103201120.GA2096@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=Trond.Myklebust@netapp.com \
    --cc=Weston.Adamson@netapp.com \
    --cc=davej@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox