nfsd oops on Linus' current tree.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* nfsd oops on Linus' current tree.
@ 2012-12-21 15:33 Dave Jones
  2012-12-21 18:08 ` J. Bruce Fields
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Jones @ 2012-12-21 15:33 UTC (permalink / raw)
  To: Linux Kernel; +Cc: bfields, linux-nfs

Did a mount from a client (also running Linus current), and the
server spat this out..

[ 6936.306135] ------------[ cut here ]------------
[ 6936.306154] WARNING: at net/sunrpc/clnt.c:617 rpc_shutdown_client+0x12a/0x1b0 [sunrpc]()
[ 6936.306156] Hardware name:         
[ 6936.306157] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs coretemp iTCO_wdt iTCO_vendor_support snd_emu10k1 microcode snd_util_mem snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_pcm snd_page_alloc snd_timer e1000e snd_rawmidi snd_seq_device snd emu10k1_gp pcspkr i2c_i801 soundcore gameport lpc_ich mfd_core i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate usb_storage firewire_ohci firewire_core sata_sil crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
[ 6936.306214] Pid: 52, comm: kworker/u:2 Not tainted 3.7.0+ #34
[ 6936.306216] Call Trace:
[ 6936.306224]  [<ffffffff8106badf>] warn_slowpath_common+0x7f/0xc0
[ 6936.306227]  [<ffffffff8106bb3a>] warn_slowpath_null+0x1a/0x20
[ 6936.306235]  [<ffffffffa02c62ca>] rpc_shutdown_client+0x12a/0x1b0 [sunrpc]
[ 6936.306240]  [<ffffffff81368318>] ? delay_tsc+0x98/0xf0
[ 6936.306252]  [<ffffffffa034a60b>] nfsd4_process_cb_update.isra.16+0x4b/0x230 [nfsd]
[ 6936.306256]  [<ffffffff8109677c>] ? __rcu_read_unlock+0x5c/0xa0
[ 6936.306260]  [<ffffffff81370d46>] ? debug_object_deactivate+0x46/0x130
[ 6936.306269]  [<ffffffffa034a87d>] nfsd4_do_callback_rpc+0x8d/0xa0 [nfsd]
[ 6936.306272]  [<ffffffff810900f7>] process_one_work+0x207/0x760
[ 6936.306274]  [<ffffffff81090087>] ? process_one_work+0x197/0x760
[ 6936.306277]  [<ffffffff81090afe>] ? worker_thread+0x21e/0x440
[ 6936.306285]  [<ffffffffa034a7f0>] ? nfsd4_process_cb_update.isra.16+0x230/0x230 [nfsd]
[ 6936.306289]  [<ffffffff81090a3e>] worker_thread+0x15e/0x440
[ 6936.306292]  [<ffffffff810908e0>] ? rescuer_thread+0x250/0x250
[ 6936.306295]  [<ffffffff8109b16d>] kthread+0xed/0x100
[ 6936.306299]  [<ffffffff810dd86e>] ? put_lock_stats.isra.25+0xe/0x40
[ 6936.306302]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
[ 6936.306307]  [<ffffffff81711e2c>] ret_from_fork+0x7c/0xb0
[ 6936.306310]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
[ 6936.306312] ---[ end trace 5bab69e086ae3c6f ]---
[ 6936.363213] ------------[ cut here ]------------
[ 6936.363226] WARNING: at fs/nfsd/vfs.c:937 nfsd_vfs_read.isra.13+0x197/0x1b0 [nfsd]()
[ 6936.363229] Hardware name:         
[ 6936.363230] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs coretemp iTCO_wdt iTCO_vendor_support snd_emu10k1 microcode snd_util_mem snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_pcm snd_page_alloc snd_timer e1000e snd_rawmidi snd_seq_device snd emu10k1_gp pcspkr i2c_i801 soundcore gameport lpc_ich mfd_core i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate usb_storage firewire_ohci firewire_core sata_sil crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
[ 6936.363284] Pid: 699, comm: nfsd Tainted: G        W    3.7.0+ #34
[ 6936.363286] Call Trace:
[ 6936.363293]  [<ffffffff8106badf>] warn_slowpath_common+0x7f/0xc0
[ 6936.363296]  [<ffffffff8106bb3a>] warn_slowpath_null+0x1a/0x20
[ 6936.363302]  [<ffffffffa031ef77>] nfsd_vfs_read.isra.13+0x197/0x1b0 [nfsd]
[ 6936.363310]  [<ffffffffa0321948>] nfsd_read_file+0x88/0xb0 [nfsd]
[ 6936.363317]  [<ffffffffa0332956>] nfsd4_encode_read+0x186/0x260 [nfsd]
[ 6936.363325]  [<ffffffffa03391cc>] nfsd4_encode_operation+0x5c/0xa0 [nfsd]
[ 6936.363333]  [<ffffffffa032e5a9>] nfsd4_proc_compound+0x289/0x780 [nfsd]
[ 6936.363339]  [<ffffffffa0319e5b>] nfsd_dispatch+0xeb/0x230 [nfsd]
[ 6936.363355]  [<ffffffffa02d3d38>] svc_process_common+0x328/0x6d0 [sunrpc]
[ 6936.363365]  [<ffffffffa02d4433>] svc_process+0x103/0x160 [sunrpc]
[ 6936.363371]  [<ffffffffa031921b>] nfsd+0xdb/0x160 [nfsd]
[ 6936.363378]  [<ffffffffa0319140>] ? nfsd_destroy+0x210/0x210 [nfsd]
[ 6936.363381]  [<ffffffff8109b16d>] kthread+0xed/0x100
[ 6936.363385]  [<ffffffff810dd86e>] ? put_lock_stats.isra.25+0xe/0x40
[ 6936.363388]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
[ 6936.363393]  [<ffffffff81711e2c>] ret_from_fork+0x7c/0xb0
[ 6936.363396]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
[ 6936.363398] ---[ end trace 5bab69e086ae3c70 ]---


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2012-12-21 15:33 nfsd oops on Linus' current tree Dave Jones
@ 2012-12-21 18:08 ` J. Bruce Fields
  2012-12-21 18:40   ` Myklebust, Trond
  2012-12-22  0:48   ` [PATCH] Revert "nfsd: warn on odd reply state in nfsd_vfs_read" J. Bruce Fields
  0 siblings, 2 replies; 23+ messages in thread
From: J. Bruce Fields @ 2012-12-21 18:08 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, linux-nfs, Weston Andros Adamson,
	Trond Myklebust

On Fri, Dec 21, 2012 at 10:33:48AM -0500, Dave Jones wrote:
> Did a mount from a client (also running Linus current), and the
> server spat this out..
> 
> [ 6936.306135] ------------[ cut here ]------------
> [ 6936.306154] WARNING: at net/sunrpc/clnt.c:617 rpc_shutdown_client+0x12a/0x1b0 [sunrpc]()

This is a warning added by 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9
"SUNRPC: add WARN_ON_ONCE for potential deadlock", pointing out that
nfsd is calling shutdown_client from a workqueue, which is a problem
because shutdown_client has to wait on rpc tasks that run on a
workqueue.

I don't believe there's any circular dependency among the workqueues
(we're calling shutdown_client from callback_wq, not rpciod_workqueue),
but 168e4b39d1afb.. says that we could get a deadlock if both are
running on the same kworker thread.

I'm not sure what to do about that.

> [ 6936.306156] Hardware name:         
> [ 6936.306157] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs coretemp iTCO_wdt iTCO_vendor_support snd_emu10k1 microcode snd_util_mem snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_pcm snd_page_alloc snd_timer e1000e snd_rawmidi snd_seq_device snd emu10k1_gp pcspkr i2c_i801 soundcore gameport lpc_ich mfd_core i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate usb_storage firewire_ohci firewire_core sata_sil crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
> [ 6936.306214] Pid: 52, comm: kworker/u:2 Not tainted 3.7.0+ #34
> [ 6936.306216] Call Trace:
> [ 6936.306224]  [<ffffffff8106badf>] warn_slowpath_common+0x7f/0xc0
> [ 6936.306227]  [<ffffffff8106bb3a>] warn_slowpath_null+0x1a/0x20
> [ 6936.306235]  [<ffffffffa02c62ca>] rpc_shutdown_client+0x12a/0x1b0 [sunrpc]
> [ 6936.306240]  [<ffffffff81368318>] ? delay_tsc+0x98/0xf0
> [ 6936.306252]  [<ffffffffa034a60b>] nfsd4_process_cb_update.isra.16+0x4b/0x230 [nfsd]
> [ 6936.306256]  [<ffffffff8109677c>] ? __rcu_read_unlock+0x5c/0xa0
> [ 6936.306260]  [<ffffffff81370d46>] ? debug_object_deactivate+0x46/0x130
> [ 6936.306269]  [<ffffffffa034a87d>] nfsd4_do_callback_rpc+0x8d/0xa0 [nfsd]
> [ 6936.306272]  [<ffffffff810900f7>] process_one_work+0x207/0x760
> [ 6936.306274]  [<ffffffff81090087>] ? process_one_work+0x197/0x760
> [ 6936.306277]  [<ffffffff81090afe>] ? worker_thread+0x21e/0x440
> [ 6936.306285]  [<ffffffffa034a7f0>] ? nfsd4_process_cb_update.isra.16+0x230/0x230 [nfsd]
> [ 6936.306289]  [<ffffffff81090a3e>] worker_thread+0x15e/0x440
> [ 6936.306292]  [<ffffffff810908e0>] ? rescuer_thread+0x250/0x250
> [ 6936.306295]  [<ffffffff8109b16d>] kthread+0xed/0x100
> [ 6936.306299]  [<ffffffff810dd86e>] ? put_lock_stats.isra.25+0xe/0x40
> [ 6936.306302]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> [ 6936.306307]  [<ffffffff81711e2c>] ret_from_fork+0x7c/0xb0
> [ 6936.306310]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> [ 6936.306312] ---[ end trace 5bab69e086ae3c6f ]---
> [ 6936.363213] ------------[ cut here ]------------
> [ 6936.363226] WARNING: at fs/nfsd/vfs.c:937 nfsd_vfs_read.isra.13+0x197/0x1b0 [nfsd]()

This warning is unrelated, and is probably just carelessness on my part:
I couldn't see why this condition would happen, and I stuck the warning
in there without looking much harder.  Probably we should just revert
79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e "nfsd: warn on odd reply state
in nfsd_vfs_read" while I go stare at the code.

--b.

> [ 6936.363229] Hardware name:         
> [ 6936.363230] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs coretemp iTCO_wdt iTCO_vendor_support snd_emu10k1 microcode snd_util_mem snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_pcm snd_page_alloc snd_timer e1000e snd_rawmidi snd_seq_device snd emu10k1_gp pcspkr i2c_i801 soundcore gameport lpc_ich mfd_core i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate usb_storage firewire_ohci firewire_core sata_sil crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
> [ 6936.363284] Pid: 699, comm: nfsd Tainted: G        W    3.7.0+ #34
> [ 6936.363286] Call Trace:
> [ 6936.363293]  [<ffffffff8106badf>] warn_slowpath_common+0x7f/0xc0
> [ 6936.363296]  [<ffffffff8106bb3a>] warn_slowpath_null+0x1a/0x20
> [ 6936.363302]  [<ffffffffa031ef77>] nfsd_vfs_read.isra.13+0x197/0x1b0 [nfsd]
> [ 6936.363310]  [<ffffffffa0321948>] nfsd_read_file+0x88/0xb0 [nfsd]
> [ 6936.363317]  [<ffffffffa0332956>] nfsd4_encode_read+0x186/0x260 [nfsd]
> [ 6936.363325]  [<ffffffffa03391cc>] nfsd4_encode_operation+0x5c/0xa0 [nfsd]
> [ 6936.363333]  [<ffffffffa032e5a9>] nfsd4_proc_compound+0x289/0x780 [nfsd]
> [ 6936.363339]  [<ffffffffa0319e5b>] nfsd_dispatch+0xeb/0x230 [nfsd]
> [ 6936.363355]  [<ffffffffa02d3d38>] svc_process_common+0x328/0x6d0 [sunrpc]
> [ 6936.363365]  [<ffffffffa02d4433>] svc_process+0x103/0x160 [sunrpc]
> [ 6936.363371]  [<ffffffffa031921b>] nfsd+0xdb/0x160 [nfsd]
> [ 6936.363378]  [<ffffffffa0319140>] ? nfsd_destroy+0x210/0x210 [nfsd]
> [ 6936.363381]  [<ffffffff8109b16d>] kthread+0xed/0x100
> [ 6936.363385]  [<ffffffff810dd86e>] ? put_lock_stats.isra.25+0xe/0x40
> [ 6936.363388]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> [ 6936.363393]  [<ffffffff81711e2c>] ret_from_fork+0x7c/0xb0
> [ 6936.363396]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> [ 6936.363398] ---[ end trace 5bab69e086ae3c70 ]---
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2012-12-21 18:08 ` J. Bruce Fields
@ 2012-12-21 18:40   ` Myklebust, Trond
  2012-12-21 23:08     ` J. Bruce Fields
  2012-12-22  0:48   ` [PATCH] Revert "nfsd: warn on odd reply state in nfsd_vfs_read" J. Bruce Fields
  1 sibling, 1 reply; 23+ messages in thread
From: Myklebust, Trond @ 2012-12-21 18:40 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Dave Jones, Linux Kernel, linux-nfs@vger.kernel.org,
	Adamson, Dros

On Fri, 2012-12-21 at 13:08 -0500, J. Bruce Fields wrote:
> On Fri, Dec 21, 2012 at 10:33:48AM -0500, Dave Jones wrote:
> > Did a mount from a client (also running Linus current), and the
> > server spat this out..
> > 
> > [ 6936.306135] ------------[ cut here ]------------
> > [ 6936.306154] WARNING: at net/sunrpc/clnt.c:617 rpc_shutdown_client+0x12a/0x1b0 [sunrpc]()
> 
> This is a warning added by 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9
> "SUNRPC: add WARN_ON_ONCE for potential deadlock", pointing out that
> nfsd is calling shutdown_client from a workqueue, which is a problem
> because shutdown_client has to wait on rpc tasks that run on a
> workqueue.
> 
> I don't believe there's any circular dependency among the workqueues
> (we're calling shutdown_client from callback_wq, not rpciod_workqueue),

We were getting deadlocks with rpciod when calling rpc_shutdown_client
from the nfsiod workqueue.

The problem here is that the workqueues all run using the same pool of
threads, and so you can get "interesting" deadlocks when one of these
threads has to wait for another one.

> but 168e4b39d1afb.. says that we could get a deadlock if both are
> running on the same kworker thread.
> 
> I'm not sure what to do about that.
> 

The question is if you really do need the call to rpc_killall_tasks and
the synchronous wait for completion of old tasks? If you don't care,
then we could just have you call rpc_release_client() in order to
release your reference on the rpc_client.

> > [ 6936.306156] Hardware name:         
> > [ 6936.306157] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs coretemp iTCO_wdt iTCO_vendor_support snd_emu10k1 microcode snd_util_mem snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_pcm snd_page_alloc snd_timer e1000e snd_rawmidi snd_seq_device snd emu10k1_gp pcspkr i2c_i801 soundcore gameport lpc_ich mfd_core i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate usb_storage firewire_ohci firewire_core sata_sil crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
> > [ 6936.306214] Pid: 52, comm: kworker/u:2 Not tainted 3.7.0+ #34
> > [ 6936.306216] Call Trace:
> > [ 6936.306224]  [<ffffffff8106badf>] warn_slowpath_common+0x7f/0xc0
> > [ 6936.306227]  [<ffffffff8106bb3a>] warn_slowpath_null+0x1a/0x20
> > [ 6936.306235]  [<ffffffffa02c62ca>] rpc_shutdown_client+0x12a/0x1b0 [sunrpc]
> > [ 6936.306240]  [<ffffffff81368318>] ? delay_tsc+0x98/0xf0
> > [ 6936.306252]  [<ffffffffa034a60b>] nfsd4_process_cb_update.isra.16+0x4b/0x230 [nfsd]
> > [ 6936.306256]  [<ffffffff8109677c>] ? __rcu_read_unlock+0x5c/0xa0
> > [ 6936.306260]  [<ffffffff81370d46>] ? debug_object_deactivate+0x46/0x130
> > [ 6936.306269]  [<ffffffffa034a87d>] nfsd4_do_callback_rpc+0x8d/0xa0 [nfsd]
> > [ 6936.306272]  [<ffffffff810900f7>] process_one_work+0x207/0x760
> > [ 6936.306274]  [<ffffffff81090087>] ? process_one_work+0x197/0x760
> > [ 6936.306277]  [<ffffffff81090afe>] ? worker_thread+0x21e/0x440
> > [ 6936.306285]  [<ffffffffa034a7f0>] ? nfsd4_process_cb_update.isra.16+0x230/0x230 [nfsd]
> > [ 6936.306289]  [<ffffffff81090a3e>] worker_thread+0x15e/0x440
> > [ 6936.306292]  [<ffffffff810908e0>] ? rescuer_thread+0x250/0x250
> > [ 6936.306295]  [<ffffffff8109b16d>] kthread+0xed/0x100
> > [ 6936.306299]  [<ffffffff810dd86e>] ? put_lock_stats.isra.25+0xe/0x40
> > [ 6936.306302]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> > [ 6936.306307]  [<ffffffff81711e2c>] ret_from_fork+0x7c/0xb0
> > [ 6936.306310]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> > [ 6936.306312] ---[ end trace 5bab69e086ae3c6f ]---
> > [ 6936.363213] ------------[ cut here ]------------
> > [ 6936.363226] WARNING: at fs/nfsd/vfs.c:937 nfsd_vfs_read.isra.13+0x197/0x1b0 [nfsd]()
> 
> This warning is unrelated, and is probably just carelessness on my part:
> I couldn't see why this condition would happen, and I stuck the warning
> in there without looking much harder.  Probably we should just revert
> 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e "nfsd: warn on odd reply state
> in nfsd_vfs_read" while I go stare at the code.
> 
> --b.
> 
> > [ 6936.363229] Hardware name:         
> > [ 6936.363230] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs coretemp iTCO_wdt iTCO_vendor_support snd_emu10k1 microcode snd_util_mem snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_pcm snd_page_alloc snd_timer e1000e snd_rawmidi snd_seq_device snd emu10k1_gp pcspkr i2c_i801 soundcore gameport lpc_ich mfd_core i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate usb_storage firewire_ohci firewire_core sata_sil crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
> > [ 6936.363284] Pid: 699, comm: nfsd Tainted: G        W    3.7.0+ #34
> > [ 6936.363286] Call Trace:
> > [ 6936.363293]  [<ffffffff8106badf>] warn_slowpath_common+0x7f/0xc0
> > [ 6936.363296]  [<ffffffff8106bb3a>] warn_slowpath_null+0x1a/0x20
> > [ 6936.363302]  [<ffffffffa031ef77>] nfsd_vfs_read.isra.13+0x197/0x1b0 [nfsd]
> > [ 6936.363310]  [<ffffffffa0321948>] nfsd_read_file+0x88/0xb0 [nfsd]
> > [ 6936.363317]  [<ffffffffa0332956>] nfsd4_encode_read+0x186/0x260 [nfsd]
> > [ 6936.363325]  [<ffffffffa03391cc>] nfsd4_encode_operation+0x5c/0xa0 [nfsd]
> > [ 6936.363333]  [<ffffffffa032e5a9>] nfsd4_proc_compound+0x289/0x780 [nfsd]
> > [ 6936.363339]  [<ffffffffa0319e5b>] nfsd_dispatch+0xeb/0x230 [nfsd]
> > [ 6936.363355]  [<ffffffffa02d3d38>] svc_process_common+0x328/0x6d0 [sunrpc]
> > [ 6936.363365]  [<ffffffffa02d4433>] svc_process+0x103/0x160 [sunrpc]
> > [ 6936.363371]  [<ffffffffa031921b>] nfsd+0xdb/0x160 [nfsd]
> > [ 6936.363378]  [<ffffffffa0319140>] ? nfsd_destroy+0x210/0x210 [nfsd]
> > [ 6936.363381]  [<ffffffff8109b16d>] kthread+0xed/0x100
> > [ 6936.363385]  [<ffffffff810dd86e>] ? put_lock_stats.isra.25+0xe/0x40
> > [ 6936.363388]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> > [ 6936.363393]  [<ffffffff81711e2c>] ret_from_fork+0x7c/0xb0
> > [ 6936.363396]  [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> > [ 6936.363398] ---[ end trace 5bab69e086ae3c70 ]---
> > 

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2012-12-21 18:40   ` Myklebust, Trond
@ 2012-12-21 23:08     ` J. Bruce Fields
  2012-12-21 23:15       ` Myklebust, Trond
  0 siblings, 1 reply; 23+ messages in thread
From: J. Bruce Fields @ 2012-12-21 23:08 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Dave Jones, Linux Kernel, linux-nfs@vger.kernel.org,
	Adamson, Dros

On Fri, Dec 21, 2012 at 06:40:54PM +0000, Myklebust, Trond wrote:
> On Fri, 2012-12-21 at 13:08 -0500, J. Bruce Fields wrote:
> > On Fri, Dec 21, 2012 at 10:33:48AM -0500, Dave Jones wrote:
> > > Did a mount from a client (also running Linus current), and the
> > > server spat this out..
> > > 
> > > [ 6936.306135] ------------[ cut here ]------------
> > > [ 6936.306154] WARNING: at net/sunrpc/clnt.c:617 rpc_shutdown_client+0x12a/0x1b0 [sunrpc]()
> > 
> > This is a warning added by 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9
> > "SUNRPC: add WARN_ON_ONCE for potential deadlock", pointing out that
> > nfsd is calling shutdown_client from a workqueue, which is a problem
> > because shutdown_client has to wait on rpc tasks that run on a
> > workqueue.
> > 
> > I don't believe there's any circular dependency among the workqueues
> > (we're calling shutdown_client from callback_wq, not rpciod_workqueue),
> 
> We were getting deadlocks with rpciod when calling rpc_shutdown_client
> from the nfsiod workqueue.
> 
> The problem here is that the workqueues all run using the same pool of
> threads, and so you can get "interesting" deadlocks when one of these
> threads has to wait for another one.

OK, coming back after reviewing Documentation/workqueue.txt: the
workqueues potentially involved here are rpciod and the server's
laundromat and callback workqueues.

The former is created with

	alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);

and the latter two are both created with
create_singlethread_workqueue(), which also sets WQ_MEM_RECLAIM.

As I understand it, that ensures that each of the three has at least one
"rescuer" thread dedicated to it--so there shouldn't be any deadlock as
long as there's no circular dependency among the three.

So think this is a false positive--am I missing something?

> > but 168e4b39d1afb.. says that we could get a deadlock if both are
> > running on the same kworker thread.
> > 
> > I'm not sure what to do about that.
> > 
> 
> The question is if you really do need the call to rpc_killall_tasks and
> the synchronous wait for completion of old tasks? If you don't care,
> then we could just have you call rpc_release_client() in order to
> release your reference on the rpc_client.

No, the waiting's intentional: for example, the laundromat wq is what's
responsible for reaping expired clients, and it wants to wait for
callback rpc's to exit.  (If I remember right they may reference data
structures we're about to clean up).

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: nfsd oops on Linus' current tree.
  2012-12-21 23:08     ` J. Bruce Fields
@ 2012-12-21 23:15       ` Myklebust, Trond
  2012-12-21 23:26         ` J. Bruce Fields
  0 siblings, 1 reply; 23+ messages in thread
From: Myklebust, Trond @ 2012-12-21 23:15 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Dave Jones, Linux Kernel, linux-nfs@vger.kernel.org,
	Adamson, Dros

Apologies for top-posting. The SSD on my laptop died, and so I'm stuck using webmail for this account...

Our experience with nfsiod is that the WQ_MEM_RECLAIM option still deadlocks despite the "rescuer thread". The CPU that is running the workqueue will deadlock with any rpciod task that is assigned to the same CPU. Interestingly enough, the WQ_UNBOUND option also appears able to deadlock in the same situation.

Sorry, I have no explanation why...

Cheers,
  Trond

________________________________________
From: J. Bruce Fields [bfields@fieldses.org]
Sent: Friday, December 21, 2012 6:08 PM
To: Myklebust, Trond
Cc: Dave Jones; Linux Kernel; linux-nfs@vger.kernel.org; Adamson, Dros
Subject: Re: nfsd oops on Linus' current tree.

On Fri, Dec 21, 2012 at 06:40:54PM +0000, Myklebust, Trond wrote:
> On Fri, 2012-12-21 at 13:08 -0500, J. Bruce Fields wrote:
> > On Fri, Dec 21, 2012 at 10:33:48AM -0500, Dave Jones wrote:
> > > Did a mount from a client (also running Linus current), and the
> > > server spat this out..
> > >
> > > [ 6936.306135] ------------[ cut here ]------------
> > > [ 6936.306154] WARNING: at net/sunrpc/clnt.c:617 rpc_shutdown_client+0x12a/0x1b0 [sunrpc]()
> >
> > This is a warning added by 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9
> > "SUNRPC: add WARN_ON_ONCE for potential deadlock", pointing out that
> > nfsd is calling shutdown_client from a workqueue, which is a problem
> > because shutdown_client has to wait on rpc tasks that run on a
> > workqueue.
> >
> > I don't believe there's any circular dependency among the workqueues
> > (we're calling shutdown_client from callback_wq, not rpciod_workqueue),
>
> We were getting deadlocks with rpciod when calling rpc_shutdown_client
> from the nfsiod workqueue.
>
> The problem here is that the workqueues all run using the same pool of
> threads, and so you can get "interesting" deadlocks when one of these
> threads has to wait for another one.

OK, coming back after reviewing Documentation/workqueue.txt: the
workqueues potentially involved here are rpciod and the server's
laundromat and callback workqueues.

The former is created with

        alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);

and the latter two are both created with
create_singlethread_workqueue(), which also sets WQ_MEM_RECLAIM.

As I understand it, that ensures that each of the three has at least one
"rescuer" thread dedicated to it--so there shouldn't be any deadlock as
long as there's no circular dependency among the three.

So think this is a false positive--am I missing something?

> > but 168e4b39d1afb.. says that we could get a deadlock if both are
> > running on the same kworker thread.
> >
> > I'm not sure what to do about that.
> >
>
> The question is if you really do need the call to rpc_killall_tasks and
> the synchronous wait for completion of old tasks? If you don't care,
> then we could just have you call rpc_release_client() in order to
> release your reference on the rpc_client.

No, the waiting's intentional: for example, the laundromat wq is what's
responsible for reaping expired clients, and it wants to wait for
callback rpc's to exit.  (If I remember right they may reference data
structures we're about to clean up).

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2012-12-21 23:15       ` Myklebust, Trond
@ 2012-12-21 23:26         ` J. Bruce Fields
  2012-12-21 23:36           ` Myklebust, Trond
  0 siblings, 1 reply; 23+ messages in thread
From: J. Bruce Fields @ 2012-12-21 23:26 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Dave Jones, Linux Kernel, linux-nfs@vger.kernel.org,
	Adamson, Dros

On Fri, Dec 21, 2012 at 11:15:40PM +0000, Myklebust, Trond wrote:
> Apologies for top-posting. The SSD on my laptop died, and so I'm stuck using webmail for this account...

Fun!  If that happens to me on this trip, I've got a week trying to hack
the kernel from my cell phone....

> Our experience with nfsiod is that the WQ_MEM_RECLAIM option still deadlocks despite the "rescuer thread". The CPU that is running the workqueue will deadlock with any rpciod task that is assigned to the same CPU. Interestingly enough, the WQ_UNBOUND option also appears able to deadlock in the same situation.
> 
> Sorry, I have no explanation why...

As I said:

> there shouldn't be any deadlock as long as there's no circular
> dependency among the three.

There was a circular dependency (of rpciod on itself), so having a
dedicated rpciod rescuer thread wouldn't help--once the rescuer thread
is waiting for work queued to do the same queue you're asking for
trouble.

The last argument in

	alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);

ensures that it will never allow more than 1 piece of work to run per
CPU, so the deadlock should be pretty easy to hit.

And with UNBOUND that's only one piece of work globally, so yeah all you
need is an rpc at shutdown time and it should deadlock every time.

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: nfsd oops on Linus' current tree.
  2012-12-21 23:26         ` J. Bruce Fields
@ 2012-12-21 23:36           ` Myklebust, Trond
  2012-12-21 23:45             ` J. Bruce Fields
  0 siblings, 1 reply; 23+ messages in thread
From: Myklebust, Trond @ 2012-12-21 23:36 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Dave Jones, Linux Kernel, linux-nfs@vger.kernel.org,
	Adamson, Dros

Please reread what I said. There was no obvious circular dependency, because nfsiod and rpciod are separate workqueues, both created with WQ_MEM_RECLAIM. Dros' experience shows, however that a call to rpc_shutdown_client in an nfsiod work item will deadlock with rpciod if the RPC task's work item has been assigned to the same CPU as the one running the rpc_shutdown_client work item.

I can't tell right now if that is intentional (in which case the WARN_ON in the rpc code is correct), or if it is a bug in the workqueue code. For now, we're assuming the former.

________________________________________
From: J. Bruce Fields [bfields@fieldses.org]
Sent: Friday, December 21, 2012 6:26 PM
To: Myklebust, Trond
Cc: Dave Jones; Linux Kernel; linux-nfs@vger.kernel.org; Adamson, Dros
Subject: Re: nfsd oops on Linus' current tree.

On Fri, Dec 21, 2012 at 11:15:40PM +0000, Myklebust, Trond wrote:
> Apologies for top-posting. The SSD on my laptop died, and so I'm stuck using webmail for this account...

Fun!  If that happens to me on this trip, I've got a week trying to hack
the kernel from my cell phone....

> Our experience with nfsiod is that the WQ_MEM_RECLAIM option still deadlocks despite the "rescuer thread". The CPU that is running the workqueue will deadlock with any rpciod task that is assigned to the same CPU. Interestingly enough, the WQ_UNBOUND option also appears able to deadlock in the same situation.
>
> Sorry, I have no explanation why...

As I said:

> there shouldn't be any deadlock as long as there's no circular
> dependency among the three.

There was a circular dependency (of rpciod on itself), so having a
dedicated rpciod rescuer thread wouldn't help--once the rescuer thread
is waiting for work queued to do the same queue you're asking for
trouble.

The last argument in

        alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);

ensures that it will never allow more than 1 piece of work to run per
CPU, so the deadlock should be pretty easy to hit.

And with UNBOUND that's only one piece of work globally, so yeah all you
need is an rpc at shutdown time and it should deadlock every time.

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2012-12-21 23:36           ` Myklebust, Trond
@ 2012-12-21 23:45             ` J. Bruce Fields
  2013-01-03 16:28               ` Adamson, Dros
  0 siblings, 1 reply; 23+ messages in thread
From: J. Bruce Fields @ 2012-12-21 23:45 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Dave Jones, Linux Kernel, linux-nfs@vger.kernel.org,
	Adamson, Dros, Tejun Heo

On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
> Please reread what I said. There was no obvious circular dependency,
> because nfsiod and rpciod are separate workqueues, both created with
> WQ_MEM_RECLAIM.

Oh, sorry, I read "rpciod" as "nfsiod"!

> Dros' experience shows, however that a call to
> rpc_shutdown_client in an nfsiod work item will deadlock with rpciod
> if the RPC task's work item has been assigned to the same CPU as the
> one running the rpc_shutdown_client work item.
> 
> I can't tell right now if that is intentional (in which case the
> WARN_ON in the rpc code is correct), or if it is a bug in the
> workqueue code. For now, we're assuming the former.

Well, Documentation/workqueue.txt says:

	"Each wq with WQ_MEM_RECLAIM set has an execution context
	reserved for it.  If there is dependency among multiple work
	items used during memory reclaim, they should be queued to
	separate wq each with WQ_MEM_RECLAIM."

And I can't see how it could have been safe to convert
create_single_singlethread_workqueue() callers otherwise.

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2012-12-21 23:45             ` J. Bruce Fields
@ 2013-01-03 16:28               ` Adamson, Dros
  2013-01-03 20:11                 ` J. Bruce Fields
  2013-01-03 22:03                 ` Tejun Heo
  0 siblings, 2 replies; 23+ messages in thread
From: Adamson, Dros @ 2013-01-03 16:28 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Myklebust, Trond, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org, Adamson, Dros, Tejun Heo

Hey, sorry for the late response, I've been on vacation.

On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <bfields@fieldses.org> wrote:

> On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
>> Please reread what I said. There was no obvious circular dependency,
>> because nfsiod and rpciod are separate workqueues, both created with
>> WQ_MEM_RECLAIM.
> 
> Oh, sorry, I read "rpciod" as "nfsiod"!
> 
>> Dros' experience shows, however that a call to
>> rpc_shutdown_client in an nfsiod work item will deadlock with rpciod
>> if the RPC task's work item has been assigned to the same CPU as the
>> one running the rpc_shutdown_client work item.
>> 
>> I can't tell right now if that is intentional (in which case the
>> WARN_ON in the rpc code is correct), or if it is a bug in the
>> workqueue code. For now, we're assuming the former.
> 
> Well, Documentation/workqueue.txt says:
> 
> 	"Each wq with WQ_MEM_RECLAIM set has an execution context
> 	reserved for it.  If there is dependency among multiple work
> 	items used during memory reclaim, they should be queued to
> 	separate wq each with WQ_MEM_RECLAIM."

The deadlock we were seeing was:

- task A gets queued on rpciod workqueue and assigned kworker-0:0
- task B gets queued on rpciod workqueue and assigned the same kworker (kworker-0:0)
- task A gets run, calls rpc_shutdown_client(), which will loop forever waiting for task B to run rpc_async_release()
- task B will never run rpc_async_release() - it can't run until kworker-0:0 is free, which won't happen until task A (rpc_shutdown_client) is done

The same deadlock happened when we tried queuing the tasks on a different workqueues -- queue_work() assigns the task to a kworker thread and it's luck of the draw if it's the same kworker as task A.  We tried the different workqueue options, but nothing changed this behavior.

Once a work struct is queued, there is no way to back out of the deadlock.  From kernel/workqueue.c:insert_wq_barrier comment:

 * Currently, a queued barrier can't be canceled.  This is because
 * try_to_grab_pending() can't determine whether the work to be
 * grabbed is at the head of the queue and thus can't clear LINKED
 * flag of the previous work while there must be a valid next work
 * after a work with LINKED flag set.

So once a work struct is queued and there is an ordering dependency (i.e. task A is before task B), there is no way to back task B out - so we can't just call cancel_work() or something on task B in rpc_shutdown_client.

The root of our issue is that rpc_shutdown_client is never safe to call from a workqueue context - it loops until there are no more tasks, marking tasks as killed and waiting for them to be cleaned up in each task's own workqueue context. Any tasks that have already been assigned to the same kworker thread will never have a chance to run this cleanup stage.

When fixing this deadlock, Trond and I discussed changing how rpc_shutdown_client works (making it workqueue safe), but Trond felt that it'd be better to just not call it from a workqueue context and print a warning if it is.

IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no success), but I'd argue that even if that did work it would still be very easy to call rpc_shutdown_client from the wrong context and MUCH harder to detect it.  It's also unclear to me if setting rpciod workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...

-dros

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 16:28               ` Adamson, Dros
@ 2013-01-03 20:11                 ` J. Bruce Fields
  2013-01-03 20:27                   ` Adamson, Dros
  2013-01-03 22:08                   ` Tejun Heo
  2013-01-03 22:03                 ` Tejun Heo
  1 sibling, 2 replies; 23+ messages in thread
From: J. Bruce Fields @ 2013-01-03 20:11 UTC (permalink / raw)
  To: Adamson, Dros
  Cc: Myklebust, Trond, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org, Tejun Heo

On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
> Hey, sorry for the late response, I've been on vacation.
> 
> On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <bfields@fieldses.org>
> wrote:
> 
> > On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
> >> Please reread what I said. There was no obvious circular
> >> dependency, because nfsiod and rpciod are separate workqueues, both
> >> created with WQ_MEM_RECLAIM.
> > 
> > Oh, sorry, I read "rpciod" as "nfsiod"!
> > 
> >> Dros' experience shows, however that a call to rpc_shutdown_client
> >> in an nfsiod work item will deadlock with rpciod if the RPC task's
> >> work item has been assigned to the same CPU as the one running the
> >> rpc_shutdown_client work item.
> >> 
> >> I can't tell right now if that is intentional (in which case the
> >> WARN_ON in the rpc code is correct), or if it is a bug in the
> >> workqueue code. For now, we're assuming the former.
> > 
> > Well, Documentation/workqueue.txt says:
> > 
> > 	"Each wq with WQ_MEM_RECLAIM set has an execution context
> > 	reserved for it.  If there is dependency among multiple work
> > 	items used during memory reclaim, they should be queued to
> > 	separate wq each with WQ_MEM_RECLAIM."
> 
> The deadlock we were seeing was:
> 
> - task A gets queued on rpciod workqueue and assigned kworker-0:0 -
> task B gets queued on rpciod workqueue and assigned the same kworker
> (kworker-0:0) - task A gets run, calls rpc_shutdown_client(), which
> will loop forever waiting for task B to run rpc_async_release() - task
> B will never run rpc_async_release() - it can't run until kworker-0:0
> is free, which won't happen until task A (rpc_shutdown_client) is done
> 
> The same deadlock happened when we tried queuing the tasks on a
> different workqueues -- queue_work() assigns the task to a kworker
> thread and it's luck of the draw if it's the same kworker as task A.
> We tried the different workqueue options, but nothing changed this
> behavior.
> 
> Once a work struct is queued, there is no way to back out of the
> deadlock.  From kernel/workqueue.c:insert_wq_barrier comment:
> 
>  * Currently, a queued barrier can't be canceled.  This is because *
>  try_to_grab_pending() can't determine whether the work to be *
>  grabbed is at the head of the queue and thus can't clear LINKED *
>  flag of the previous work while there must be a valid next work *
>  after a work with LINKED flag set.
> 
> So once a work struct is queued and there is an ordering dependency
> (i.e. task A is before task B), there is no way to back task B out -
> so we can't just call cancel_work() or something on task B in
> rpc_shutdown_client.
> 
> The root of our issue is that rpc_shutdown_client is never safe to
> call from a workqueue context - it loops until there are no more
> tasks, marking tasks as killed and waiting for them to be cleaned up
> in each task's own workqueue context. Any tasks that have already been
> assigned to the same kworker thread will never have a chance to run
> this cleanup stage.
> 
> When fixing this deadlock, Trond and I discussed changing how
> rpc_shutdown_client works (making it workqueue safe), but Trond felt
> that it'd be better to just not call it from a workqueue context and
> print a warning if it is.
> 
> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no
> success), but I'd argue that even if that did work it would still be
> very easy to call rpc_shutdown_client from the wrong context and MUCH
> harder to detect it.  It's also unclear to me if setting rpciod
> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...

Both rpciod and nfsiod already set WQ_MEM_RECLAIM.

But, right, looking at kernel/workqueue.c, it seems that the dedicated
"rescuer" threads are invoked only in the case when work is stalled
because a new worker thread isn't allocated quickly enough.

So, what to do that's simplest enough that it would work for
post-rc2/stable?  I was happy having just a simple dedicated
thread--these are only started when nfsd is, so there's no real thread
proliferation problem.

I'll go quietly weep for a little while and then think about it some
more....

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 20:11                 ` J. Bruce Fields
@ 2013-01-03 20:27                   ` Adamson, Dros
  2013-01-03 20:52                     ` J. Bruce Fields
  2013-01-03 22:08                   ` Tejun Heo
  1 sibling, 1 reply; 23+ messages in thread
From: Adamson, Dros @ 2013-01-03 20:27 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Adamson, Dros, Myklebust, Trond, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org, Tejun Heo


On Jan 3, 2013, at 3:11 PM, J. Bruce Fields <bfields@fieldses.org> wrote:

> On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
>> Hey, sorry for the late response, I've been on vacation.
>> 
>> On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <bfields@fieldses.org>
>> wrote:
>> 
>>> On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
>>>> Please reread what I said. There was no obvious circular
>>>> dependency, because nfsiod and rpciod are separate workqueues, both
>>>> created with WQ_MEM_RECLAIM.
>>> 
>>> Oh, sorry, I read "rpciod" as "nfsiod"!
>>> 
>>>> Dros' experience shows, however that a call to rpc_shutdown_client
>>>> in an nfsiod work item will deadlock with rpciod if the RPC task's
>>>> work item has been assigned to the same CPU as the one running the
>>>> rpc_shutdown_client work item.
>>>> 
>>>> I can't tell right now if that is intentional (in which case the
>>>> WARN_ON in the rpc code is correct), or if it is a bug in the
>>>> workqueue code. For now, we're assuming the former.
>>> 
>>> Well, Documentation/workqueue.txt says:
>>> 
>>> 	"Each wq with WQ_MEM_RECLAIM set has an execution context
>>> 	reserved for it.  If there is dependency among multiple work
>>> 	items used during memory reclaim, they should be queued to
>>> 	separate wq each with WQ_MEM_RECLAIM."
>> 
>> The deadlock we were seeing was:
>> 
>> - task A gets queued on rpciod workqueue and assigned kworker-0:0 -
>> task B gets queued on rpciod workqueue and assigned the same kworker
>> (kworker-0:0) - task A gets run, calls rpc_shutdown_client(), which
>> will loop forever waiting for task B to run rpc_async_release() - task
>> B will never run rpc_async_release() - it can't run until kworker-0:0
>> is free, which won't happen until task A (rpc_shutdown_client) is done
>> 
>> The same deadlock happened when we tried queuing the tasks on a
>> different workqueues -- queue_work() assigns the task to a kworker
>> thread and it's luck of the draw if it's the same kworker as task A.
>> We tried the different workqueue options, but nothing changed this
>> behavior.
>> 
>> Once a work struct is queued, there is no way to back out of the
>> deadlock.  From kernel/workqueue.c:insert_wq_barrier comment:
>> 
>> * Currently, a queued barrier can't be canceled.  This is because *
>> try_to_grab_pending() can't determine whether the work to be *
>> grabbed is at the head of the queue and thus can't clear LINKED *
>> flag of the previous work while there must be a valid next work *
>> after a work with LINKED flag set.
>> 
>> So once a work struct is queued and there is an ordering dependency
>> (i.e. task A is before task B), there is no way to back task B out -
>> so we can't just call cancel_work() or something on task B in
>> rpc_shutdown_client.
>> 
>> The root of our issue is that rpc_shutdown_client is never safe to
>> call from a workqueue context - it loops until there are no more
>> tasks, marking tasks as killed and waiting for them to be cleaned up
>> in each task's own workqueue context. Any tasks that have already been
>> assigned to the same kworker thread will never have a chance to run
>> this cleanup stage.
>> 
>> When fixing this deadlock, Trond and I discussed changing how
>> rpc_shutdown_client works (making it workqueue safe), but Trond felt
>> that it'd be better to just not call it from a workqueue context and
>> print a warning if it is.
>> 
>> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no
>> success), but I'd argue that even if that did work it would still be
>> very easy to call rpc_shutdown_client from the wrong context and MUCH
>> harder to detect it.  It's also unclear to me if setting rpciod
>> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...
> 
> Both rpciod and nfsiod already set WQ_MEM_RECLAIM.

Heh, oh yeah :)

> 
> But, right, looking at kernel/workqueue.c, it seems that the dedicated
> "rescuer" threads are invoked only in the case when work is stalled
> because a new worker thread isn't allocated quickly enough.
> 
> So, what to do that's simplest enough that it would work for
> post-rc2/stable?  I was happy having just a simple dedicated
> thread--these are only started when nfsd is, so there's no real thread
> proliferation problem.

That should work fine. The client went this way and spawns a new kthread before calling rpc_shutdown_client() from a workqueue context.  This should happen very infrequently.

-dros

> 
> I'll go quietly weep for a little while and then think about it some
> more....




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 20:27                   ` Adamson, Dros
@ 2013-01-03 20:52                     ` J. Bruce Fields
  2013-01-03 22:15                       ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: J. Bruce Fields @ 2013-01-03 20:52 UTC (permalink / raw)
  To: Adamson, Dros
  Cc: Myklebust, Trond, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org, Tejun Heo

On Thu, Jan 03, 2013 at 08:27:20PM +0000, Adamson, Dros wrote:
> 
> On Jan 3, 2013, at 3:11 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> > On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
> >> Hey, sorry for the late response, I've been on vacation.
> >> 
> >> On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <bfields@fieldses.org>
> >> wrote:
> >> 
> >>> On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
> >>>> Please reread what I said. There was no obvious circular
> >>>> dependency, because nfsiod and rpciod are separate workqueues, both
> >>>> created with WQ_MEM_RECLAIM.
> >>> 
> >>> Oh, sorry, I read "rpciod" as "nfsiod"!
> >>> 
> >>>> Dros' experience shows, however that a call to rpc_shutdown_client
> >>>> in an nfsiod work item will deadlock with rpciod if the RPC task's
> >>>> work item has been assigned to the same CPU as the one running the
> >>>> rpc_shutdown_client work item.
> >>>> 
> >>>> I can't tell right now if that is intentional (in which case the
> >>>> WARN_ON in the rpc code is correct), or if it is a bug in the
> >>>> workqueue code. For now, we're assuming the former.
> >>> 
> >>> Well, Documentation/workqueue.txt says:
> >>> 
> >>> 	"Each wq with WQ_MEM_RECLAIM set has an execution context
> >>> 	reserved for it.  If there is dependency among multiple work
> >>> 	items used during memory reclaim, they should be queued to
> >>> 	separate wq each with WQ_MEM_RECLAIM."
> >> 
> >> The deadlock we were seeing was:
> >> 
> >> - task A gets queued on rpciod workqueue and assigned kworker-0:0 -
> >> task B gets queued on rpciod workqueue and assigned the same kworker
> >> (kworker-0:0) - task A gets run, calls rpc_shutdown_client(), which
> >> will loop forever waiting for task B to run rpc_async_release() - task
> >> B will never run rpc_async_release() - it can't run until kworker-0:0
> >> is free, which won't happen until task A (rpc_shutdown_client) is done
> >> 
> >> The same deadlock happened when we tried queuing the tasks on a
> >> different workqueues -- queue_work() assigns the task to a kworker
> >> thread and it's luck of the draw if it's the same kworker as task A.
> >> We tried the different workqueue options, but nothing changed this
> >> behavior.
> >> 
> >> Once a work struct is queued, there is no way to back out of the
> >> deadlock.  From kernel/workqueue.c:insert_wq_barrier comment:
> >> 
> >> * Currently, a queued barrier can't be canceled.  This is because *
> >> try_to_grab_pending() can't determine whether the work to be *
> >> grabbed is at the head of the queue and thus can't clear LINKED *
> >> flag of the previous work while there must be a valid next work *
> >> after a work with LINKED flag set.
> >> 
> >> So once a work struct is queued and there is an ordering dependency
> >> (i.e. task A is before task B), there is no way to back task B out -
> >> so we can't just call cancel_work() or something on task B in
> >> rpc_shutdown_client.
> >> 
> >> The root of our issue is that rpc_shutdown_client is never safe to
> >> call from a workqueue context - it loops until there are no more
> >> tasks, marking tasks as killed and waiting for them to be cleaned up
> >> in each task's own workqueue context. Any tasks that have already been
> >> assigned to the same kworker thread will never have a chance to run
> >> this cleanup stage.
> >> 
> >> When fixing this deadlock, Trond and I discussed changing how
> >> rpc_shutdown_client works (making it workqueue safe), but Trond felt
> >> that it'd be better to just not call it from a workqueue context and
> >> print a warning if it is.
> >> 
> >> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no
> >> success), but I'd argue that even if that did work it would still be
> >> very easy to call rpc_shutdown_client from the wrong context and MUCH
> >> harder to detect it.  It's also unclear to me if setting rpciod
> >> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...
> > 
> > Both rpciod and nfsiod already set WQ_MEM_RECLAIM.
> 
> Heh, oh yeah :)
> 
> > 
> > But, right, looking at kernel/workqueue.c, it seems that the dedicated
> > "rescuer" threads are invoked only in the case when work is stalled
> > because a new worker thread isn't allocated quickly enough.
> > 
> > So, what to do that's simplest enough that it would work for
> > post-rc2/stable?  I was happy having just a simple dedicated
> > thread--these are only started when nfsd is, so there's no real thread
> > proliferation problem.
> 
> That should work fine.

No, I meant I was happy having my workqueue handled by a single
dedicated thread, which I don't see a trivial way to do any more.

> The client went this way and spawns a new kthread before calling rpc_shutdown_client() from a workqueue context.  This should happen very infrequently.

I don't think I can do that, as I was really counting on being able to
wait for the client to shut down.  I'll think about it.

--b.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 20:52                     ` J. Bruce Fields
@ 2013-01-03 22:15                       ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2013-01-03 22:15 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Adamson, Dros, Myklebust, Trond, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org

It's getting a bit repetitive but I really wanna steer people away
from implementing separate kthreads for wrong reasons.  kthread is
surprisingly difficult to get right especially around freezing /
exiting / hotplugs and people get it subtly wrong very often.

On Thu, Jan 03, 2013 at 03:52:21PM -0500, J. Bruce Fields wrote:
> No, I meant I was happy having my workqueue handled by a single
> dedicated thread, which I don't see a trivial way to do any more.

Because you don't need that.  The old workqueue didn't give you
anything better than the current one.  It was usually more difficult
to get right in terms of execution dependency because it introduced a
lot of unintended execution dependencies through
one-(per-cpu)-worker-per-workqueue rule.

If anyone can show me workqueue is deadlocking when it shouldn't I'll
be happy to look into it and fix it, but, up until now, most if not
all reported cases were through incorrect usage, and things definitely
don't work like described in this thread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 20:11                 ` J. Bruce Fields
  2013-01-03 20:27                   ` Adamson, Dros
@ 2013-01-03 22:08                   ` Tejun Heo
  2013-01-03 22:12                     ` Myklebust, Trond
  1 sibling, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2013-01-03 22:08 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Adamson, Dros, Myklebust, Trond, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org

Hello,

On Thu, Jan 03, 2013 at 03:11:20PM -0500, J. Bruce Fields wrote:
> Both rpciod and nfsiod already set WQ_MEM_RECLAIM.
> 
> But, right, looking at kernel/workqueue.c, it seems that the dedicated
> "rescuer" threads are invoked only in the case when work is stalled
> because a new worker thread isn't allocated quickly enough.

Because that's the *only* case where progress can't be guaranteed
otherwise.

> So, what to do that's simplest enough that it would work for
> post-rc2/stable?  I was happy having just a simple dedicated
> thread--these are only started when nfsd is, so there's no real thread
> proliferation problem.

The analysis is likely completely wrong, so please don't go off doing
something unnecessary.  Please take look at what's causing the
deadlocks again.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 22:08                   ` Tejun Heo
@ 2013-01-03 22:12                     ` Myklebust, Trond
  2013-01-03 22:26                       ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Myklebust, Trond @ 2013-01-03 22:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: J. Bruce Fields, Adamson, Dros, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org

On Thu, 2013-01-03 at 17:08 -0500, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jan 03, 2013 at 03:11:20PM -0500, J. Bruce Fields wrote:
> > Both rpciod and nfsiod already set WQ_MEM_RECLAIM.
> > 
> > But, right, looking at kernel/workqueue.c, it seems that the dedicated
> > "rescuer" threads are invoked only in the case when work is stalled
> > because a new worker thread isn't allocated quickly enough.
> 
> Because that's the *only* case where progress can't be guaranteed
> otherwise.
> 
> > So, what to do that's simplest enough that it would work for
> > post-rc2/stable?  I was happy having just a simple dedicated
> > thread--these are only started when nfsd is, so there's no real thread
> > proliferation problem.
> 
> The analysis is likely completely wrong, so please don't go off doing
> something unnecessary.  Please take look at what's causing the
> deadlocks again.

The analysis is a no-brainer:
We see a deadlock due to one work item waiting for completion of another
work item that is queued on the same CPU. There is no other dependency
between the two work items.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 22:12                     ` Myklebust, Trond
@ 2013-01-03 22:26                       ` Tejun Heo
  2013-01-03 22:34                         ` Tejun Heo
                                           ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Tejun Heo @ 2013-01-03 22:26 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: J. Bruce Fields, Adamson, Dros, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org

Hello, Trond.

On Thu, Jan 03, 2013 at 10:12:32PM +0000, Myklebust, Trond wrote:
> > The analysis is likely completely wrong, so please don't go off doing
> > something unnecessary.  Please take look at what's causing the
> > deadlocks again.
> 
> The analysis is a no-brainer:
> We see a deadlock due to one work item waiting for completion of another
> work item that is queued on the same CPU. There is no other dependency
> between the two work items.

What do you mean "waiting for completion of"?  Is one flushing the
other?  Or is one waiting for the other to take certain action?  How
are the two work items related?  Are they queued on the same
workqueue?  Can you create a minimal repro case of the observed
deadlock?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 22:26                       ` Tejun Heo
@ 2013-01-03 22:34                         ` Tejun Heo
  2013-01-03 23:11                         ` Myklebust, Trond
       [not found]                         ` <1357254692.55285.33.camel@lade.trondhjem.org>
  2 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2013-01-03 22:34 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: J. Bruce Fields, Adamson, Dros, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org

On Thu, Jan 03, 2013 at 05:26:39PM -0500, Tejun Heo wrote:
> Hello, Trond.
> 
> On Thu, Jan 03, 2013 at 10:12:32PM +0000, Myklebust, Trond wrote:
> > > The analysis is likely completely wrong, so please don't go off doing
> > > something unnecessary.  Please take look at what's causing the
> > > deadlocks again.
> > 
> > The analysis is a no-brainer:
> > We see a deadlock due to one work item waiting for completion of another
> > work item that is queued on the same CPU. There is no other dependency
> > between the two work items.
> 
> What do you mean "waiting for completion of"?  Is one flushing the
> other?  Or is one waiting for the other to take certain action?  How
> are the two work items related?  Are they queued on the same
> workqueue?  Can you create a minimal repro case of the observed
> deadlock?

Ooh, BTW, there was a bug where workqueue code created a false
dependency between two work items.  Workqueue currently considers two
work items to be the same if they're on the same address and won't
execute them concurrently - ie. it makes a work item which is queued
again while being executed wait for the previous execution to
complete.

If a work function frees the work item, and then waits for an event
which should be performed by another work item and *that* work item
recycles the freed work item, it can create a false dependency loop.
There really is no reliable way to detect this short of verifying
every memory free.  A patch is queued to make such occurrences less
likely (work functions should also match for two work items considered
the same), but if you're seeing this, the best thing to do is freeing
the work item at the end of the work function.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 22:26                       ` Tejun Heo
  2013-01-03 22:34                         ` Tejun Heo
@ 2013-01-03 23:11                         ` Myklebust, Trond
       [not found]                         ` <1357254692.55285.33.camel@lade.trondhjem.org>
  2 siblings, 0 replies; 23+ messages in thread
From: Myklebust, Trond @ 2013-01-03 23:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: J. Bruce Fields, Adamson, Dros, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3289 bytes --]

On Thu, 2013-01-03 at 17:26 -0500, Tejun Heo wrote:
> Hello, Trond.
> 
> On Thu, Jan 03, 2013 at 10:12:32PM +0000, Myklebust, Trond wrote:
> > > The analysis is likely completely wrong, so please don't go off doing
> > > something unnecessary.  Please take look at what's causing the
> > > deadlocks again.
> > 
> > The analysis is a no-brainer:
> > We see a deadlock due to one work item waiting for completion of another
> > work item that is queued on the same CPU. There is no other dependency
> > between the two work items.
> 
> What do you mean "waiting for completion of"?  Is one flushing the
> other?  Or is one waiting for the other to take certain action?  How
> are the two work items related?  Are they queued on the same
> workqueue?  Can you create a minimal repro case of the observed
> deadlock?

The two work items are running on different work queues. One is running
on the nfsiod work queue, and is waiting for the other to complete on
the rpciod work queue. Basically, the nfsiod work item is trying to shut
down an RPC session, and it is waiting for each outstanding RPC call to
finish running a clean-up routine.

We can't call flush_work(), since we don't have a way to pin the
work_struct for any long period of time, so we queue all the RPC calls
up, then sleep on a global wait queue for 1 second or until the last RPC
call wakes us up (see rpc_shutdown_client()).

In the deadlock scenario, it looks as if one (or more) of the RPC calls
are getting queued on the same CPU (but on the rpciod workqueue) as the
shutdown process (running on nfsiod).


> Ooh, BTW, there was a bug where workqueue code created a false
> dependency between two work items.  Workqueue currently considers two
> work items to be the same if they're on the same address and won't
> execute them concurrently - ie. it makes a work item which is queued
> again while being executed wait for the previous execution to
> complete.
> 
> If a work function frees the work item, and then waits for an event
> which should be performed by another work item and *that* work item
> recycles the freed work item, it can create a false dependency loop.
> There really is no reliable way to detect this short of verifying
> every memory free.  A patch is queued to make such occurrences less
> likely (work functions should also match for two work items considered
> the same), but if you're seeing this, the best thing to do is freeing
> the work item at the end of the work function.

That's interesting... I wonder if we may have been hitting that issue.

>From what I can see, we do actually free the write RPC task (and hence
the work_struct) before we call the asynchronous unlink completion...

Dros, can you see if reverting commit
324d003b0cd82151adbaecefef57b73f7959a469 + commit 
168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9 and then applying the attached
patch also fixes the hang on a pristine 3.7.x kernel?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: gnurr.dif --]
[-- Type: text/x-patch; name="gnurr.dif", Size: 1283 bytes --]

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index b6bdb18..400f7ec 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -91,12 +91,13 @@ void nfs_readdata_release(struct nfs_read_data *rdata)
 	put_nfs_open_context(rdata->args.context);
 	if (rdata->pages.pagevec != rdata->pages.page_array)
 		kfree(rdata->pages.pagevec);
-	if (rdata != &read_header->rpc_data)
-		kfree(rdata);
-	else
+	if (rdata == &read_header->rpc_data) {
 		rdata->header = NULL;
+		rdata = NULL;
+	}
 	if (atomic_dec_and_test(&hdr->refcnt))
 		hdr->completion_ops->completion(hdr);
+	kfree(rdata);
 }
 EXPORT_SYMBOL_GPL(nfs_readdata_release);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b673be3..45d9250 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -126,12 +126,13 @@ void nfs_writedata_release(struct nfs_write_data *wdata)
 	put_nfs_open_context(wdata->args.context);
 	if (wdata->pages.pagevec != wdata->pages.page_array)
 		kfree(wdata->pages.pagevec);
-	if (wdata != &write_header->rpc_data)
-		kfree(wdata);
-	else
+	if (wdata == &write_header->rpc_data) {
 		wdata->header = NULL;
+		wdata = NULL;
+	}
 	if (atomic_dec_and_test(&hdr->refcnt))
 		hdr->completion_ops->completion(hdr);
+	kfree(wdata);
 }
 EXPORT_SYMBOL_GPL(nfs_writedata_release);
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

[parent not found: <1357254692.55285.33.camel@lade.trondhjem.org>]

* Re: nfsd oops on Linus' current tree.
       [not found]                         ` <1357254692.55285.33.camel@lade.trondhjem.org>
@ 2013-01-03 23:26                           ` Myklebust, Trond
  2013-01-04 17:11                             ` Adamson, Dros
  0 siblings, 1 reply; 23+ messages in thread
From: Myklebust, Trond @ 2013-01-03 23:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: J. Bruce Fields, Adamson, Dros, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1911 bytes --]

On Thu, 2013-01-03 at 18:11 -0500, Trond Myklebust wrote:
> On Thu, 2013-01-03 at 17:26 -0500, Tejun Heo wrote:
> > Ooh, BTW, there was a bug where workqueue code created a false
> > dependency between two work items.  Workqueue currently considers two
> > work items to be the same if they're on the same address and won't
> > execute them concurrently - ie. it makes a work item which is queued
> > again while being executed wait for the previous execution to
> > complete.
> > 
> > If a work function frees the work item, and then waits for an event
> > which should be performed by another work item and *that* work item
> > recycles the freed work item, it can create a false dependency loop.
> > There really is no reliable way to detect this short of verifying
> > every memory free.  A patch is queued to make such occurrences less
> > likely (work functions should also match for two work items considered
> > the same), but if you're seeing this, the best thing to do is freeing
> > the work item at the end of the work function.
> 
> That's interesting... I wonder if we may have been hitting that issue.
> 
> From what I can see, we do actually free the write RPC task (and hence
> the work_struct) before we call the asynchronous unlink completion...
> 
> Dros, can you see if reverting commit
> 324d003b0cd82151adbaecefef57b73f7959a469 + commit 
> 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9 and then applying the attached
> patch also fixes the hang on a pristine 3.7.x kernel?

Actually, we probably also need to look at rpc_free_task, so the
following patch, instead...

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: gnurr2.dif --]
[-- Type: text/x-patch; name="gnurr2.dif", Size: 2061 bytes --]

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index b6bdb18..400f7ec 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -91,12 +91,13 @@ void nfs_readdata_release(struct nfs_read_data *rdata)
 	put_nfs_open_context(rdata->args.context);
 	if (rdata->pages.pagevec != rdata->pages.page_array)
 		kfree(rdata->pages.pagevec);
-	if (rdata != &read_header->rpc_data)
-		kfree(rdata);
-	else
+	if (rdata == &read_header->rpc_data) {
 		rdata->header = NULL;
+		rdata = NULL;
+	}
 	if (atomic_dec_and_test(&hdr->refcnt))
 		hdr->completion_ops->completion(hdr);
+	kfree(rdata);
 }
 EXPORT_SYMBOL_GPL(nfs_readdata_release);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b673be3..45d9250 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -126,12 +126,13 @@ void nfs_writedata_release(struct nfs_write_data *wdata)
 	put_nfs_open_context(wdata->args.context);
 	if (wdata->pages.pagevec != wdata->pages.page_array)
 		kfree(wdata->pages.pagevec);
-	if (wdata != &write_header->rpc_data)
-		kfree(wdata);
-	else
+	if (wdata == &write_header->rpc_data) {
 		wdata->header = NULL;
+		wdata = NULL;
+	}
 	if (atomic_dec_and_test(&hdr->refcnt))
 		hdr->completion_ops->completion(hdr);
+	kfree(wdata);
 }
 EXPORT_SYMBOL_GPL(nfs_writedata_release);
 
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index d17a704..500407a 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -936,14 +936,13 @@ struct rpc_task *rpc_new_task(const struct rpc_task_setup *setup_data)
 
 static void rpc_free_task(struct rpc_task *task)
 {
-	const struct rpc_call_ops *tk_ops = task->tk_ops;
-	void *calldata = task->tk_calldata;
+	unsigned short tk_flags = task->tk_flags;
 
-	if (task->tk_flags & RPC_TASK_DYNAMIC) {
+	rpc_release_calldata(task->tk_ops, task->tk_calldata);
+	if (tk_flags & RPC_TASK_DYNAMIC) {
 		dprintk("RPC: %5u freeing task\n", task->tk_pid);
 		mempool_free(task, rpc_task_mempool);
 	}
-	rpc_release_calldata(tk_ops, calldata);
 }
 
 static void rpc_async_release(struct work_struct *work)

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 23:26                           ` Myklebust, Trond
@ 2013-01-04 17:11                             ` Adamson, Dros
  0 siblings, 0 replies; 23+ messages in thread
From: Adamson, Dros @ 2013-01-04 17:11 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Tejun Heo, J. Bruce Fields, Adamson, Dros, Dave Jones,
	Linux Kernel, linux-nfs@vger.kernel.org


On Jan 3, 2013, at 6:26 PM, "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> On Thu, 2013-01-03 at 18:11 -0500, Trond Myklebust wrote:
>> On Thu, 2013-01-03 at 17:26 -0500, Tejun Heo wrote:
>>> Ooh, BTW, there was a bug where workqueue code created a false
>>> dependency between two work items.  Workqueue currently considers two
>>> work items to be the same if they're on the same address and won't
>>> execute them concurrently - ie. it makes a work item which is queued
>>> again while being executed wait for the previous execution to
>>> complete.
>>> 
>>> If a work function frees the work item, and then waits for an event
>>> which should be performed by another work item and *that* work item
>>> recycles the freed work item, it can create a false dependency loop.
>>> There really is no reliable way to detect this short of verifying
>>> every memory free.  A patch is queued to make such occurrences less
>>> likely (work functions should also match for two work items considered
>>> the same), but if you're seeing this, the best thing to do is freeing
>>> the work item at the end of the work function.
>> 
>> That's interesting... I wonder if we may have been hitting that issue.
>> 
>> From what I can see, we do actually free the write RPC task (and hence
>> the work_struct) before we call the asynchronous unlink completion...
>> 
>> Dros, can you see if reverting commit
>> 324d003b0cd82151adbaecefef57b73f7959a469 + commit 
>> 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9 and then applying the attached
>> patch also fixes the hang on a pristine 3.7.x kernel?
> 
> Actually, we probably also need to look at rpc_free_task, so the
> following patch, instead...

Yes, this patch fixes the hang!

Thank you for the explanation Tejun - that makes a lot of sense and explains the workqueue behavior that we were seeing.

-dros

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 16:28               ` Adamson, Dros
  2013-01-03 20:11                 ` J. Bruce Fields
@ 2013-01-03 22:03                 ` Tejun Heo
  2013-01-03 23:08                   ` J. Bruce Fields
  1 sibling, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2013-01-03 22:03 UTC (permalink / raw)
  To: Adamson, Dros
  Cc: J. Bruce Fields, Myklebust, Trond, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org

Hello, guys.

On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
> The deadlock we were seeing was:
> 
> - task A gets queued on rpciod workqueue and assigned kworker-0:0
> - task B gets queued on rpciod workqueue and assigned the same kworker (kworker-0:0)
> - task A gets run, calls rpc_shutdown_client(), which will loop forever waiting for task B to run rpc_async_release()
> - task B will never run rpc_async_release() - it can't run until kworker-0:0 is free, which won't happen until task A (rpc_shutdown_client) is done
> 
> The same deadlock happened when we tried queuing the tasks on a
> different workqueues -- queue_work() assigns the task to a kworker
> thread and it's luck of the draw if it's the same kworker as task A.
> We tried the different workqueue options, but nothing changed this
> behavior.

Work items don't get assigned to workers on queueing.  Idle workers
pick up work items.  A work item is directly assigned to a specific
worker iff the worker is already executing that specific work item or
the new work item is "linked" to the one it's currently executing.
Currently, the only case where a linked work item is used is when
flushing which is guaranteed to not introduce dependency the other way
around.

So, your diagnosis looks wrong to me.  If such problem existed, we
would be seeing deadlocks all over the place.

> Once a work struct is queued, there is no way to back out of the
> deadlock.  From kernel/workqueue.c:insert_wq_barrier comment:

Yes, there are.  cancel_work[_sync]() do exactly that.

>  * Currently, a queued barrier can't be canceled.  This is because
>  * try_to_grab_pending() can't determine whether the work to be
>  * grabbed is at the head of the queue and thus can't clear LINKED
>  * flag of the previous work while there must be a valid next work
>  * after a work with LINKED flag set.
> 
> So once a work struct is queued and there is an ordering dependency
> (i.e. task A is before task B), there is no way to back task B out -
> so we can't just call cancel_work() or something on task B in
> rpc_shutdown_client.

A *barrier* can't be canceled.  A barrier is used only to flush work
items.  The above comment means that we currently don't (or can't)
support canceling flush_work().  It has *nothing* to do with canceling
regular work items.  You can cancel work items fine.

> The root of our issue is that rpc_shutdown_client is never safe to
> call from a workqueue context - it loops until there are no more
> tasks, marking tasks as killed and waiting for them to be cleaned up
> in each task's own workqueue context. Any tasks that have already
> been assigned to the same kworker thread will never have a chance to
> run this cleanup stage.
>
> When fixing this deadlock, Trond and I discussed changing how
> rpc_shutdown_client works (making it workqueue safe), but Trond felt
> that it'd be better to just not call it from a workqueue context and
> print a warning if it is.
>
> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with
> no success), but I'd argue that even if that did work it would still
> be very easy to call rpc_shutdown_client from the wrong context and
> MUCH harder to detect it.  It's also unclear to me if setting rpciod
> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...

It looks like you guys ended up in a weird place misled by wrong
analysis.  Unless you require more than one concurrent execution on
the same workqueue, WQ_MEM_RECLAIM guarantees forward progress.  It
won't deadlock because "a different work item is queued to the same
worker".  The whole thing is designed *exactly* to avoid problems like
that.  So, I'd strongly recommend looking again at why the deadlocks
are occurring.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: nfsd oops on Linus' current tree.
  2013-01-03 22:03                 ` Tejun Heo
@ 2013-01-03 23:08                   ` J. Bruce Fields
  0 siblings, 0 replies; 23+ messages in thread
From: J. Bruce Fields @ 2013-01-03 23:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Adamson, Dros, Myklebust, Trond, Dave Jones, Linux Kernel,
	linux-nfs@vger.kernel.org

On Thu, Jan 03, 2013 at 05:03:09PM -0500, Tejun Heo wrote:
> Hello, guys.
> 
> On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
> > The deadlock we were seeing was:
> > 
> > - task A gets queued on rpciod workqueue and assigned kworker-0:0
> > - task B gets queued on rpciod workqueue and assigned the same kworker (kworker-0:0)
> > - task A gets run, calls rpc_shutdown_client(), which will loop forever waiting for task B to run rpc_async_release()
> > - task B will never run rpc_async_release() - it can't run until kworker-0:0 is free, which won't happen until task A (rpc_shutdown_client) is done
> > 
> > The same deadlock happened when we tried queuing the tasks on a
> > different workqueues -- queue_work() assigns the task to a kworker
> > thread and it's luck of the draw if it's the same kworker as task A.
> > We tried the different workqueue options, but nothing changed this
> > behavior.
> 
> Work items don't get assigned to workers on queueing.  Idle workers
> pick up work items.

Oh, so that's why the case where we can't create a new worker is the
only case we should need the rescuers for.  Got it.  I think.

--b.

> A work item is directly assigned to a specific
> worker iff the worker is already executing that specific work item or
> the new work item is "linked" to the one it's currently executing.
> Currently, the only case where a linked work item is used is when
> flushing which is guaranteed to not introduce dependency the other way
> around.
> 
> So, your diagnosis looks wrong to me.  If such problem existed, we
> would be seeing deadlocks all over the place.
> 
> > Once a work struct is queued, there is no way to back out of the
> > deadlock.  From kernel/workqueue.c:insert_wq_barrier comment:
> 
> Yes, there are.  cancel_work[_sync]() do exactly that.
> 
> >  * Currently, a queued barrier can't be canceled.  This is because
> >  * try_to_grab_pending() can't determine whether the work to be
> >  * grabbed is at the head of the queue and thus can't clear LINKED
> >  * flag of the previous work while there must be a valid next work
> >  * after a work with LINKED flag set.
> > 
> > So once a work struct is queued and there is an ordering dependency
> > (i.e. task A is before task B), there is no way to back task B out -
> > so we can't just call cancel_work() or something on task B in
> > rpc_shutdown_client.
> 
> A *barrier* can't be canceled.  A barrier is used only to flush work
> items.  The above comment means that we currently don't (or can't)
> support canceling flush_work().  It has *nothing* to do with canceling
> regular work items.  You can cancel work items fine.
> 
> > The root of our issue is that rpc_shutdown_client is never safe to
> > call from a workqueue context - it loops until there are no more
> > tasks, marking tasks as killed and waiting for them to be cleaned up
> > in each task's own workqueue context. Any tasks that have already
> > been assigned to the same kworker thread will never have a chance to
> > run this cleanup stage.
> >
> > When fixing this deadlock, Trond and I discussed changing how
> > rpc_shutdown_client works (making it workqueue safe), but Trond felt
> > that it'd be better to just not call it from a workqueue context and
> > print a warning if it is.
> >
> > IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with
> > no success), but I'd argue that even if that did work it would still
> > be very easy to call rpc_shutdown_client from the wrong context and
> > MUCH harder to detect it.  It's also unclear to me if setting rpciod
> > workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...
> 
> It looks like you guys ended up in a weird place misled by wrong
> analysis.  Unless you require more than one concurrent execution on
> the same workqueue, WQ_MEM_RECLAIM guarantees forward progress.  It
> won't deadlock because "a different work item is queued to the same
> worker".  The whole thing is designed *exactly* to avoid problems like
> that.  So, I'd strongly recommend looking again at why the deadlocks
> are occurring.
> 
> Thanks.
> 
> -- 
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] Revert "nfsd: warn on odd reply state in nfsd_vfs_read"
  2012-12-21 18:08 ` J. Bruce Fields
  2012-12-21 18:40   ` Myklebust, Trond
@ 2012-12-22  0:48   ` J. Bruce Fields
  1 sibling, 0 replies; 23+ messages in thread
From: J. Bruce Fields @ 2012-12-22  0:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Linux Kernel, linux-nfs, Weston Andros Adamson,
	Trond Myklebust

From: "J. Bruce Fields" <bfields@redhat.com>

This reverts commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e.

This is obviously wrong, and I have no idea how I missed seeing the
warning in testing: I must just not have looked at the right logs.  The
caller bumps rq_resused/rq_next_page, so it will always be hit on a
large enough read.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/nfsd/vfs.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f0a6d88..d586117 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -934,7 +934,6 @@ nfsd_vfs_read(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 			.u.data		= rqstp,
 		};
 
-		WARN_ON_ONCE(rqstp->rq_next_page != rqstp->rq_respages + 1);
 		rqstp->rq_next_page = rqstp->rq_respages + 1;
 		host_err = splice_direct_to_actor(file, &sd, nfsd_direct_splice_actor);
 	} else {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-01-04 17:11 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-21 15:33 nfsd oops on Linus' current tree Dave Jones
2012-12-21 18:08 ` J. Bruce Fields
2012-12-21 18:40   ` Myklebust, Trond
2012-12-21 23:08     ` J. Bruce Fields
2012-12-21 23:15       ` Myklebust, Trond
2012-12-21 23:26         ` J. Bruce Fields
2012-12-21 23:36           ` Myklebust, Trond
2012-12-21 23:45             ` J. Bruce Fields
2013-01-03 16:28               ` Adamson, Dros
2013-01-03 20:11                 ` J. Bruce Fields
2013-01-03 20:27                   ` Adamson, Dros
2013-01-03 20:52                     ` J. Bruce Fields
2013-01-03 22:15                       ` Tejun Heo
2013-01-03 22:08                   ` Tejun Heo
2013-01-03 22:12                     ` Myklebust, Trond
2013-01-03 22:26                       ` Tejun Heo
2013-01-03 22:34                         ` Tejun Heo
2013-01-03 23:11                         ` Myklebust, Trond
     [not found]                         ` <1357254692.55285.33.camel@lade.trondhjem.org>
2013-01-03 23:26                           ` Myklebust, Trond
2013-01-04 17:11                             ` Adamson, Dros
2013-01-03 22:03                 ` Tejun Heo
2013-01-03 23:08                   ` J. Bruce Fields
2012-12-22  0:48   ` [PATCH] Revert "nfsd: warn on odd reply state in nfsd_vfs_read" J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox