public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
* Confused about BUG_ON in rpcb_getport_async
@ 2008-08-12  2:27 Neil Brown
       [not found] ` <18592.62730.840231.108375-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2008-08-12  2:27 UTC (permalink / raw)
  To: linux-nfs


Hi 
 I have a report of a the BUG_ON in rpcb_getport_clnt being
 triggered.  This is:
	/* Autobind on cloned rpc clients is discouraged */
	BUG_ON(clnt->cl_parent != clnt);

 It looks to me that while they might be discouraged, they are not
 prevented and so having the BUG_ON is wrong.

 When rpc_clone_client creates a clone, it sets cl_autobind to 0,
 and gives the new client a reference to the same cl_xprt as the
 original client.

 The only effect of cl_autobind is to prevent rpc_force_rebind from
 clearing the BOUND flag on ->cl_xprt.
 So while a call to rpc_force_rebind on the clone will not clear that
 flag, a call on the original client will clear that flag.

 So a cloned client can still end up with a ->cl_xprt with the BOUND
 flag clear.

 So call_bind (which is present in the call trace under the oops) can
 find that !xprt_bound, even when the client is a cloned client.

 When this happens, ->rpcbind, which is rpcb_getport_clnt, goes BOOM.

 What should happen when a clone client finds that its transport is no
 longer bound?  Should rpc_getport_async just do
    clnt = task->tk_client->cl_parent;
 ??

Perplexed,
NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Confused about BUG_ON in rpcb_getport_async
       [not found] ` <18592.62730.840231.108375-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2008-08-12 20:32   ` Trond Myklebust
  2008-08-13 23:45     ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Trond Myklebust @ 2008-08-12 20:32 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-nfs

On Tue, 2008-08-12 at 12:27 +1000, Neil Brown wrote:
> Hi 
>  I have a report of a the BUG_ON in rpcb_getport_clnt being
>  triggered.  This is:
> 	/* Autobind on cloned rpc clients is discouraged */
> 	BUG_ON(clnt->cl_parent != clnt);
> 
>  It looks to me that while they might be discouraged, they are not
>  prevented and so having the BUG_ON is wrong.
> 
>  When rpc_clone_client creates a clone, it sets cl_autobind to 0,
>  and gives the new client a reference to the same cl_xprt as the
>  original client.
> 
>  The only effect of cl_autobind is to prevent rpc_force_rebind from
>  clearing the BOUND flag on ->cl_xprt.
>  So while a call to rpc_force_rebind on the clone will not clear that
>  flag, a call on the original client will clear that flag.
> 
>  So a cloned client can still end up with a ->cl_xprt with the BOUND
>  flag clear.
> 
>  So call_bind (which is present in the call trace under the oops) can
>  find that !xprt_bound, even when the client is a cloned client.
> 
>  When this happens, ->rpcbind, which is rpcb_getport_clnt, goes BOOM.
> 
>  What should happen when a clone client finds that its transport is no
>  longer bound?  Should rpc_getport_async just do
>     clnt = task->tk_client->cl_parent;
>  ??

This shouldn't be a particularly urgent problem, since lockd should
normally be the only thing that needs to use the autobind functionality,
and it does not use cloned clients.

That said, I think that the answer is that cloned clients may indeed use
rpcbind as long as they share the same program number and version as the
parent.
IOW: we should probably rather BUG_ON() calls to rpc_bind_new_program()
if the parent has xprt->cl_autobind set.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Confused about BUG_ON in rpcb_getport_async
  2008-08-12 20:32   ` Trond Myklebust
@ 2008-08-13 23:45     ` Neil Brown
       [not found]       ` <18595.29231.61262.220929-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2008-08-13 23:45 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Tuesday August 12, trond.myklebust@fys.uio.no wrote:
> On Tue, 2008-08-12 at 12:27 +1000, Neil Brown wrote:
> > Hi 
> >  I have a report of a the BUG_ON in rpcb_getport_clnt being
> >  triggered.  This is:
> > 	/* Autobind on cloned rpc clients is discouraged */
> > 	BUG_ON(clnt->cl_parent != clnt);
> > 
> >  It looks to me that while they might be discouraged, they are not
> >  prevented and so having the BUG_ON is wrong.
> > 
> >  When rpc_clone_client creates a clone, it sets cl_autobind to 0,
> >  and gives the new client a reference to the same cl_xprt as the
> >  original client.
> > 
> >  The only effect of cl_autobind is to prevent rpc_force_rebind from
> >  clearing the BOUND flag on ->cl_xprt.
> >  So while a call to rpc_force_rebind on the clone will not clear that
> >  flag, a call on the original client will clear that flag.
> > 
> >  So a cloned client can still end up with a ->cl_xprt with the BOUND
> >  flag clear.
> > 
> >  So call_bind (which is present in the call trace under the oops) can
> >  find that !xprt_bound, even when the client is a cloned client.
> > 
> >  When this happens, ->rpcbind, which is rpcb_getport_clnt, goes BOOM.
> > 
> >  What should happen when a clone client finds that its transport is no
> >  longer bound?  Should rpc_getport_async just do
> >     clnt = task->tk_client->cl_parent;
> >  ??
> 
> This shouldn't be a particularly urgent problem, since lockd should
> normally be the only thing that needs to use the autobind functionality,
> and it does not use cloned clients.
> 
> That said, I think that the answer is that cloned clients may indeed use
> rpcbind as long as they share the same program number and version as the
> parent.
> IOW: we should probably rather BUG_ON() calls to rpc_bind_new_program()
> if the parent has xprt->cl_autobind set.

Just to confirm,  you are saying that the current BUG_ON is indeed
wrong, and the fact that it fires does not imply any real bug.
It should be removed.  Possibly a BUG_ON could be put somewhere else
to catch incorrect behaviour.

Is that correct?

I now have multiple people for whom this BUG_ON has fired.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Confused about BUG_ON in rpcb_getport_async
       [not found]       ` <18595.29231.61262.220929-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2008-08-14  0:10         ` Trond Myklebust
  2008-08-14  2:26           ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Trond Myklebust @ 2008-08-14  0:10 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-nfs

On Thu, 2008-08-14 at 09:45 +1000, Neil Brown wrote:
> Just to confirm,  you are saying that the current BUG_ON is indeed
> wrong, and the fact that it fires does not imply any real bug.
> It should be removed.  Possibly a BUG_ON could be put somewhere else
> to catch incorrect behaviour.
> 
> Is that correct?

Yes.

> I now have multiple people for whom this BUG_ON has fired.

Could you send us a trace? AFAIK it shouldn't really be happening...

Trond


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Confused about BUG_ON in rpcb_getport_async
  2008-08-14  0:10         ` Trond Myklebust
@ 2008-08-14  2:26           ` Neil Brown
  0 siblings, 0 replies; 5+ messages in thread
From: Neil Brown @ 2008-08-14  2:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Wednesday August 13, trond.myklebust@fys.uio.no wrote:
> On Thu, 2008-08-14 at 09:45 +1000, Neil Brown wrote:
> > Just to confirm,  you are saying that the current BUG_ON is indeed
> > wrong, and the fact that it fires does not imply any real bug.
> > It should be removed.  Possibly a BUG_ON could be put somewhere else
> > to catch incorrect behaviour.
> > 
> > Is that correct?
> 
> Yes.

Thanks.

> 
> > I now have multiple people for whom this BUG_ON has fired.
> 
> Could you send us a trace? AFAIK it shouldn't really be happening...

https://bugzilla.novell.com/show_bug.cgi?id=415607
contains
Aug  7 09:55:42 copperhead klogd: ------------[ cut here ]------------
Aug  7 09:55:42 copperhead klogd: kernel BUG at net/sunrpc/rpcb_clnt.c:290!
Aug  7 09:55:42 copperhead klogd: invalid opcode: 0000 [1] SMP
Aug  7 09:55:42 copperhead klogd: last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
Aug  7 09:55:42 copperhead klogd: CPU 3
Aug  7 09:55:42 copperhead klogd: Modules linked in: lp parport_pc ppdev
parport nls_utf8 nfs lockd nfs_acl sunrpc autofs4 binfmt_misc iptable_filter
ip_tables ip6table_filter ip6_tables x_tables ipv6 bonding microcode
firmware_class fuse loop dm_mod rtc_cmos iTCO_wdt sg shpchp rtc_core i2c_i801
iTCO_vendor_support rtc_lib floppy container i2c_core i5000_edac usb_storage
sr_mod pci_hotplug button cdrom joydev e1000e edac_core usbhid hid ff_memless
ehci_hcd uhci_hcd sd_mod usbcore edd ext3 mbcache jbd fan ata_piix ahci libata
dock 3w_9xxx scsi_mod thermal processor [last unloaded: parport_pc]
Aug  7 09:55:42 copperhead klogd: Pid: 14342, comm: xauth Tainted: G        N 2.6.25.11-0.1-default #1
Aug  7 09:55:42 copperhead klogd: RIP: 0010:[<ffffffff883ad55a>] [<ffffffff883ad55a>] :sunrpc:rpcb_getport_async+0x66/0x460
Aug  7 09:55:42 copperhead klogd: RSP: 0018:ffff810110cff938  EFLAGS: 00010287
Aug  7 09:55:42 copperhead klogd: RAX: ffffffff883d11d0 RBX: ffff81022c8d5c00 RCX: 0000000000000001
Aug  7 09:55:42 copperhead klogd: RDX: 0000000000000000 RSI: ffff81010a543e40 RDI: ffff81010a543e40
Aug  7 09:55:42 copperhead klogd: RBP: ffff810110cffa58 R08: ffff81010a543f28 R09: 0000000000000001
Aug  7 09:55:42 copperhead klogd: R10: ffff810001031280 R11: ffff810110cffc18 R12: ffff81018b047800
Aug  7 09:55:42 copperhead klogd: R13: ffff81018b047800 R14: ffff81010a543e40 R15: ffff810229d40800
Aug  7 09:55:42 copperhead klogd: FS:  00007fb4bd7ed6f0(0000) GS:ffff81022f1037c0(0000) knlGS:0000000000000000
Aug  7 09:55:42 copperhead klogd: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug  7 09:55:42 copperhead klogd: CR2: 00007f3992dfcf00 CR3: 00000001d4601000 CR4: 00000000000006e0
Aug  7 09:55:42 copperhead klogd: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug  7 09:55:42 copperhead klogd: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug  7 09:55:42 copperhead klogd: Process xauth (pid: 14342, threadinfo ffff810110cfe000, task ffff810116954080)
Aug  7 09:55:42 copperhead klogd: Stack:  0000000000000246 ffff81022ad67370 ffff810110cff978 0000000000000246
Aug  7 09:55:42 copperhead klogd:  0000000000011200 ffff81022ad67340 ffff810110cff9a8 ffff810110cff9c0
Aug  7 09:55:42 copperhead klogd:  ffff810110cff988 ffffffff80276757 ffff810110cffa08 ffffffff8027685a
Aug  7 09:55:42 copperhead klogd: Call Trace: Aug  7 09:55:42 copperhead klogd:  [<ffffffff883a018b>] :sunrpc:call_bind+0x6f/0x74
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883a72d9>] :sunrpc:__rpc_execute+0x86/0x246
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883a751e>] :sunrpc:rpc_execute+0x85/0x8e
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883a14a5>] :sunrpc:rpc_run_task+0x52/0x5a
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883a1598>] :sunrpc:rpc_call_sync+0x3f/0x5d
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8840c0bb>] :nfs:nfs3_rpc_wrapper+0x22/0x5c
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8840c5b5>] :nfs:nfs3_proc_access+0xf3/0x15e
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883ffde0>] :nfs:nfs_do_access+0x173/0x341
Aug  7 09:55:42 copperhead klogd:  [<ffffffff884000a3>] :nfs:nfs_permission+0xf5/0x166
Aug  7 09:55:42 copperhead klogd:  [<ffffffff802a7497>] permission+0xc7/0x118
Aug  7 09:55:42 copperhead klogd:  [<ffffffff802a7bbc>] lookup_hash+0x21/0x43
Aug  7 09:55:42 copperhead klogd:  [<ffffffff802ab6d8>] open_namei+0xe8/0x6a0
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8029ec5b>] do_filp_open+0x28/0x4b
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8029eccf>] do_sys_open+0x51/0xd2
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8029ed79>] sys_open+0x1b/0x1d
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8020bffa>] system_call_after_swapgs+0x8a/0x8f
Aug  7 09:55:42 copperhead klogd: DWARF2 unwinder stuck at system_call_after_swapgs+0x8a/0x8f
Aug  7 09:55:42 copperhead klogd:
Aug  7 09:55:42 copperhead klogd: Leftover inexact backtrace:
Aug  7 09:55:42 copperhead klogd:
Aug  7 09:55:42 copperhead syslog-ng[2774]: last message repeated 2 times
Aug  7 09:55:42 copperhead klogd: Code: 3b 88 44 8b 4b 44 44 8b 43 40 48 c7 c7
57 74 3b 88 48 8b 4b 50 89 04 24 31 c0 e8 ed 24 0a f8 4c 8b bb e8 00 00 00 49
39 df 74 04 <0f> 0b eb fe 49 8d 85 78 02 00 00 48 89 85 f8 fe ff ff f0 0f ba
Aug  7 09:55:42 copperhead klogd: RIP  [<ffffffff883ad55a>] :sunrpc:rpcb_getport_async+0x66/0x460
Aug  7 09:55:42 copperhead klogd:  RSP <ffff810110cff938>
Aug  7 09:55:42 copperhead klogd: ---[ end trace 4b516acb183486f8 ]---

Kernel was
   Linux copperhead 2.6.25.11-0.1-default #1 SMP 2008-07-13 20:48:28 +0200 x86_64 x86_64 x86_64 GNU/Linux

NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-08-14  2:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-12  2:27 Confused about BUG_ON in rpcb_getport_async Neil Brown
     [not found] ` <18592.62730.840231.108375-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-08-12 20:32   ` Trond Myklebust
2008-08-13 23:45     ` Neil Brown
     [not found]       ` <18595.29231.61262.220929-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-08-14  0:10         ` Trond Myklebust
2008-08-14  2:26           ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox