From: "J. Bruce Fields" <bfields@fieldses.org>
To: "Adamson, Dros" <Weston.Adamson@netapp.com>
Cc: "Myklebust, Trond" <Trond.Myklebust@netapp.com>,
Dave Jones <davej@redhat.com>,
Linux Kernel <linux-kernel@vger.kernel.org>,
"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
Tejun Heo <tj@kernel.org>
Subject: Re: nfsd oops on Linus' current tree.
Date: Thu, 3 Jan 2013 15:11:20 -0500 [thread overview]
Message-ID: <20130103201120.GA2096@fieldses.org> (raw)
In-Reply-To: <0EC8763B847DB24D9ADF5EBD9CD7B4191259E4A2@SACEXCMBX02-PRD.hq.netapp.com>
On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
> Hey, sorry for the late response, I've been on vacation.
>
> On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <bfields@fieldses.org>
> wrote:
>
> > On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
> >> Please reread what I said. There was no obvious circular
> >> dependency, because nfsiod and rpciod are separate workqueues, both
> >> created with WQ_MEM_RECLAIM.
> >
> > Oh, sorry, I read "rpciod" as "nfsiod"!
> >
> >> Dros' experience shows, however that a call to rpc_shutdown_client
> >> in an nfsiod work item will deadlock with rpciod if the RPC task's
> >> work item has been assigned to the same CPU as the one running the
> >> rpc_shutdown_client work item.
> >>
> >> I can't tell right now if that is intentional (in which case the
> >> WARN_ON in the rpc code is correct), or if it is a bug in the
> >> workqueue code. For now, we're assuming the former.
> >
> > Well, Documentation/workqueue.txt says:
> >
> > "Each wq with WQ_MEM_RECLAIM set has an execution context
> > reserved for it. If there is dependency among multiple work
> > items used during memory reclaim, they should be queued to
> > separate wq each with WQ_MEM_RECLAIM."
>
> The deadlock we were seeing was:
>
> - task A gets queued on rpciod workqueue and assigned kworker-0:0 -
> task B gets queued on rpciod workqueue and assigned the same kworker
> (kworker-0:0) - task A gets run, calls rpc_shutdown_client(), which
> will loop forever waiting for task B to run rpc_async_release() - task
> B will never run rpc_async_release() - it can't run until kworker-0:0
> is free, which won't happen until task A (rpc_shutdown_client) is done
>
> The same deadlock happened when we tried queuing the tasks on a
> different workqueues -- queue_work() assigns the task to a kworker
> thread and it's luck of the draw if it's the same kworker as task A.
> We tried the different workqueue options, but nothing changed this
> behavior.
>
> Once a work struct is queued, there is no way to back out of the
> deadlock. From kernel/workqueue.c:insert_wq_barrier comment:
>
> * Currently, a queued barrier can't be canceled. This is because *
> try_to_grab_pending() can't determine whether the work to be *
> grabbed is at the head of the queue and thus can't clear LINKED *
> flag of the previous work while there must be a valid next work *
> after a work with LINKED flag set.
>
> So once a work struct is queued and there is an ordering dependency
> (i.e. task A is before task B), there is no way to back task B out -
> so we can't just call cancel_work() or something on task B in
> rpc_shutdown_client.
>
> The root of our issue is that rpc_shutdown_client is never safe to
> call from a workqueue context - it loops until there are no more
> tasks, marking tasks as killed and waiting for them to be cleaned up
> in each task's own workqueue context. Any tasks that have already been
> assigned to the same kworker thread will never have a chance to run
> this cleanup stage.
>
> When fixing this deadlock, Trond and I discussed changing how
> rpc_shutdown_client works (making it workqueue safe), but Trond felt
> that it'd be better to just not call it from a workqueue context and
> print a warning if it is.
>
> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no
> success), but I'd argue that even if that did work it would still be
> very easy to call rpc_shutdown_client from the wrong context and MUCH
> harder to detect it. It's also unclear to me if setting rpciod
> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...
Both rpciod and nfsiod already set WQ_MEM_RECLAIM.
But, right, looking at kernel/workqueue.c, it seems that the dedicated
"rescuer" threads are invoked only in the case when work is stalled
because a new worker thread isn't allocated quickly enough.
So, what to do that's simplest enough that it would work for
post-rc2/stable? I was happy having just a simple dedicated
thread--these are only started when nfsd is, so there's no real thread
proliferation problem.
I'll go quietly weep for a little while and then think about it some
more....
--b.
next prev parent reply other threads:[~2013-01-03 20:11 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-21 15:33 nfsd oops on Linus' current tree Dave Jones
2012-12-21 18:08 ` J. Bruce Fields
2012-12-21 18:40 ` Myklebust, Trond
2012-12-21 23:08 ` J. Bruce Fields
2012-12-21 23:15 ` Myklebust, Trond
2012-12-21 23:26 ` J. Bruce Fields
2012-12-21 23:36 ` Myklebust, Trond
2012-12-21 23:45 ` J. Bruce Fields
2013-01-03 16:28 ` Adamson, Dros
2013-01-03 20:11 ` J. Bruce Fields [this message]
2013-01-03 20:27 ` Adamson, Dros
2013-01-03 20:52 ` J. Bruce Fields
2013-01-03 22:15 ` Tejun Heo
2013-01-03 22:08 ` Tejun Heo
2013-01-03 22:12 ` Myklebust, Trond
2013-01-03 22:26 ` Tejun Heo
2013-01-03 22:34 ` Tejun Heo
2013-01-03 23:11 ` Myklebust, Trond
[not found] ` <1357254692.55285.33.camel@lade.trondhjem.org>
2013-01-03 23:26 ` Myklebust, Trond
2013-01-04 17:11 ` Adamson, Dros
2013-01-04 18:15 ` [PATCH 1/3] SUNRPC: Ensure that we free the rpc_task after cleanups are done Trond Myklebust
2013-01-04 18:15 ` [PATCH 2/3] NFS: Ensure that we free the rpc_task after read and write " Trond Myklebust
2013-01-04 18:15 ` [PATCH 3/3] SUNRPC: Partial revert of commit 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9 Trond Myklebust
2013-01-04 18:29 ` [PATCH 2/3] NFS: Ensure that we free the rpc_task after read and write cleanups are done Bruce Fields
2013-01-04 18:52 ` Myklebust, Trond
2013-01-04 19:12 ` Bruce Fields
2013-01-07 17:47 ` [PATCH 1/3] SUNRPC: Ensure that we free the rpc_task after " Tejun Heo
2013-01-07 17:54 ` Myklebust, Trond
2013-01-03 22:03 ` nfsd oops on Linus' current tree Tejun Heo
2013-01-03 23:08 ` J. Bruce Fields
2012-12-22 0:48 ` [PATCH] Revert "nfsd: warn on odd reply state in nfsd_vfs_read" J. Bruce Fields
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130103201120.GA2096@fieldses.org \
--to=bfields@fieldses.org \
--cc=Trond.Myklebust@netapp.com \
--cc=Weston.Adamson@netapp.com \
--cc=davej@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.