From: Dr Fields James Bruce <bfields@fieldses.org>
To: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Andrew Martin <amartin@xes-inc.com>, Jim Rees <rees@umich.edu>,
bhawley@luminex.com, Brown Neil <neilb@suse.de>,
linux-nfs-owner@vger.kernel.org, linux-nfs@vger.kernel.org
Subject: Re: Optimal NFS mount options to safely allow interrupts and timeouts on newer kernels
Date: Fri, 28 Mar 2014 18:00:38 -0400 [thread overview]
Message-ID: <20140328220038.GK6041@fieldses.org> (raw)
In-Reply-To: <A95B7939-BDEB-4FFB-BCC0-6EAD9487E7D8@primarydata.com>
On Tue, Mar 18, 2014 at 06:27:57PM -0400, Trond Myklebust wrote:
>
> On Mar 18, 2014, at 17:50, Andrew Martin <amartin@xes-inc.com> wrote:
>
> > ----- Original Message -----
> >> From: "Trond Myklebust" <trond.myklebust@primarydata.com>
> >> To: "Andrew Martin" <amartin@xes-inc.com>
> >> Cc: "Jim Rees" <rees@umich.edu>, bhawley@luminex.com, "Brown Neil" <neilb@suse.de>, linux-nfs-owner@vger.kernel.org,
> >> linux-nfs@vger.kernel.org
> >> Sent: Thursday, March 6, 2014 3:01:03 PM
> >> Subject: Re: Optimal NFS mount options to safely allow interrupts and timeouts on newer kernels
> >>
> >>
> >
> > Trond,
> >
> > This problem has reoccurred, and I have captured the debug output that you requested:
> >
> > echo 0 >/proc/sys/sunrpc/rpc_debug:
> > http://pastebin.com/9juDs2TW
> >
> > echo w > /proc/sysrq-trigger ; dmesg:
> > http://pastebin.com/1vDx9bNf
> >
> > netstat -tn:
> > http://pastebin.com/mjxqjmuL
> >
> > One suggestion for debug was to attempt to run "umount -f /path/to/mountpoint"
> > repeatedly to attempt to send SIGKILL back up to the application. This always
> > returned "Device or resource busy" and I was unable to unmount the filesystem
> > until I used "mount -l".
> >
> > I was able to kill -9 all but two of the processes that were blocking in
> > uninterruptable sleep. Note that I was able to get lsof output on these
> > processes this time, and they all appeared to be blocking on access to a
> > single file on the nfs share. If I tried to cat said file from this client,
> > my terminal would block:
> > open("/path/to/file", O_RDONLY) = 3
> > fstat(3, {st_mode=S_IFREG|0644, st_size=42385, ...}) = 0
> > mmap(NULL, 1056768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb00f0dc000
> > read(3,
> >
> > However, I could cat the file just fine from another nfs client. Does this
> > additional information shed any light on the source of this problem?
> >
>
> Ah… So this machine is acting both as a NFSv3 client and a NFSv4 server?
>
> • [1140235.544551] SysRq : Show Blocked State
> • [1140235.547126] task PC stack pid father
> • [1140235.547145] rpciod/0 D 0000000000000001 0 833 2 0x00000000
> • [1140235.547150] ffff8802812a3c20 0000000000000046 0000000000015e00 0000000000015e00
> • [1140235.547155] ffff880297251ad0 ffff8802812a3fd8 0000000000015e00 ffff880297251700
> • [1140235.547159] 0000000000015e00 ffff8802812a3fd8 0000000000015e00 ffff880297251ad0
> • [1140235.547164] Call Trace:
> • [1140235.547175] [<ffffffff8156a1a5>] schedule_timeout+0x195/0x300
> • [1140235.547182] [<ffffffff81078130>] ? process_timeout+0x0/0x10
> • [1140235.547197] [<ffffffffa009ef52>] rpc_shutdown_client+0xc2/0x100 [sunrpc]
> • [1140235.547203] [<ffffffff81086750>] ? autoremove_wake_function+0x0/0x40
> • [1140235.547216] [<ffffffffa01aa62c>] put_nfs4_client+0x4c/0xb0 [nfsd]
> • [1140235.547227] [<ffffffffa01ae669>] nfsd4_cb_probe_done+0x29/0x60 [nfsd]
> • [1140235.547238] [<ffffffffa00a5d0c>] rpc_exit_task+0x2c/0x60 [sunrpc]
> • [1140235.547250] [<ffffffffa00a64e6>] __rpc_execute+0x66/0x2a0 [sunrpc]
> • [1140235.547261] [<ffffffffa00a6750>] ? rpc_async_schedule+0x0/0x20 [sunrpc]
> • [1140235.547272] [<ffffffffa00a6765>] rpc_async_schedule+0x15/0x20 [sunrpc]
> • [1140235.547276] [<ffffffff81081ba7>] run_workqueue+0xc7/0x1a0
> • [1140235.547279] [<ffffffff81081d23>] worker_thread+0xa3/0x110
> • [1140235.547284] [<ffffffff81086750>] ? autoremove_wake_function+0x0/0x40
> • [1140235.547287] [<ffffffff81081c80>] ? worker_thread+0x0/0x110
> • [1140235.547291] [<ffffffff810863d6>] kthread+0x96/0xa0
> • [1140235.547295] [<ffffffff810141aa>] child_rip+0xa/0x20
> • [1140235.547299] [<ffffffff81086340>] ? kthread+0x0/0xa0
> • [1140235.547302] [<ffffffff810141a0>] ? child_rip+0x0/0x20
>
> the above looks bad. The rpciod thread is sleeping, waiting for the rpc client to terminate, and the only task running on that rpc client, according to your rpc_debug output is the above CB_NULL probe. Deadlock...
>
> Bruce, it looks like the above should have been fixed in Linux 2.6.35 with commit 9045b4b9f7f3 (nfsd4: remove probe task's reference on client), is that correct?
Yes, that definitely looks it would explain the bug. And the sysrq
trace shows 2.6.32-57.
Andrew Martin, can you confirm that the problem is no longer
reproduceable on a kernel with that patch applied?
--b.
next prev parent reply other threads:[~2014-03-28 22:00 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1696396609.119284.1394040541217.JavaMail.zimbra@xes-inc.com>
2014-03-05 17:45 ` Optimal NFS mount options to safely allow interrupts and timeouts on newer kernels Andrew Martin
2014-03-05 20:11 ` Jim Rees
2014-03-05 20:41 ` Andrew Martin
2014-03-05 21:11 ` Jim Rees
2014-03-06 3:34 ` NeilBrown
2014-03-06 3:47 ` Jim Rees
2014-03-06 4:37 ` NeilBrown
2014-03-05 20:15 ` Brian Hawley
2014-03-05 20:54 ` Chuck Lever
2014-03-06 9:37 ` Ric Wheeler
2014-03-06 3:50 ` NeilBrown
2014-03-06 5:03 ` Andrew Martin
2014-03-06 5:37 ` NeilBrown
2014-03-06 5:47 ` Brian Hawley
2014-03-06 15:30 ` Andrew Martin
2014-03-06 16:22 ` Jim Rees
2014-03-06 16:43 ` Andrew Martin
2014-03-06 17:36 ` Jim Rees
2014-03-06 18:26 ` Trond Myklebust
2014-03-06 18:35 ` Andrew Martin
2014-03-06 18:48 ` Jim Rees
2014-03-06 19:02 ` Trond Myklebust
2014-03-06 18:50 ` Trond Myklebust
2014-03-06 19:46 ` Andrew Martin
2014-03-06 19:52 ` Trond Myklebust
2014-03-06 20:45 ` Andrew Martin
2014-03-06 21:01 ` Trond Myklebust
2014-03-18 21:50 ` Andrew Martin
2014-03-18 22:27 ` Trond Myklebust
2014-03-28 22:00 ` Dr Fields James Bruce [this message]
2014-04-04 18:15 ` Andrew Martin
2014-03-06 19:00 ` Brian Hawley
2014-03-06 19:06 ` Trond Myklebust
2014-03-06 19:14 ` Brian Hawley
2014-03-06 19:26 ` Trond Myklebust
2014-03-06 19:33 ` Brian Hawley
2014-03-06 19:47 ` Trond Myklebust
2014-03-06 19:56 ` Brian Hawley
2014-03-06 20:31 ` Trond Myklebust
2014-03-06 20:34 ` Brian Hawley
2014-03-06 20:41 ` Trond Myklebust
2014-03-06 19:29 ` Ric Wheeler
2014-03-06 19:38 ` Brian Hawley
2014-04-04 18:15 ` Andrew Martin
2014-03-06 18:56 ` Brian Hawley
2014-03-06 12:34 ` Jim Rees
2014-03-06 15:26 ` Chuck Lever
2014-03-06 15:33 ` Trond Myklebust
2014-03-06 15:59 ` Chuck Lever
2014-03-06 16:02 ` Trond Myklebust
2014-03-06 16:13 ` Chuck Lever
2014-03-06 16:16 ` Trond Myklebust
2014-03-06 16:45 ` Chuck Lever
2014-03-06 17:47 ` Trond Myklebust
2014-03-06 20:38 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140328220038.GK6041@fieldses.org \
--to=bfields@fieldses.org \
--cc=amartin@xes-inc.com \
--cc=bhawley@luminex.com \
--cc=linux-nfs-owner@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@suse.de \
--cc=rees@umich.edu \
--cc=trond.myklebust@primarydata.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).