* [NFSv4] 2.6.23-rc4 oops in nfs4_cb_recall... @ 2007-09-04 11:05 Daniel J Blueman 2007-09-09 21:04 ` J. Bruce Fields 0 siblings, 1 reply; 4+ messages in thread From: Daniel J Blueman @ 2007-09-04 11:05 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux Kernel, nfsv4 Hi Trond, When accessing a directory inode from a single other client, NFSv4 callbacks catastrophically failed [1] on the NFS server with 2.6.23-rc4 (unpatched); clients are both 2.6.22 (Ubuntu Gutsy build). Seems not easy to reproduce, since this kernel was running smoothly for 7 days on the server. What information will help track this down, or is there a known failure mechanism? I can map stack frames to source lines with objdump, if that helps. Daniel --- [1] general protection fault: 0000 [1] SMP CPU 1 Modules linked in: dvb_usb_dtt200u dvb_usb dvb_core firmware_class i2c_core uhci_hcd ehci_hcd usbcore Pid: 24009, comm: nfs4_cb_recall Not tainted 2.6.23-rc4-109 #1 RIP: 0010:[xprt_reserve+217/384] [xprt_reserve+217/384] xprt_reserve+0xd9/0x180 RSP: 0018:ffff81003905de20 EFLAGS: 00010286 RAX: ffff81000a2600a8 RBX: ffff81003a1d8780 RCX: 4d00000000610d00 RDX: 0b66656403100000 RSI: 0000000000000000 RDI: ffff81000a260000 RBP: ffff81003ebf3000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff81000a260000 R13: ffff81003ebf3460 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff81003fdd4180(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002b456c102468 CR3: 000000003e047000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 Process nfs4_cb_recall (pid: 24009, threadinfo ffff81003905c000, task ffff81000a358000) Stack: ffff810032360040 0000000000000000 ffff81003a1d8780 ffffffff804d2170 ffff81003a1d8870 ffffffff80483a3b 0000000000000000 ffff81003a1d8780 ffff81003a1d8780 ffffffff804d2170 ffff81003905ded0 ffffffff8047d876 Call Trace: [__rpc_execute+107/656] __rpc_execute+0x6b/0x290 [rpc_do_run_task+118/208] rpc_do_run_task+0x76/0xd0 [rpc_call_sync+21/64] rpc_call_sync+0x15/0x40 [nfsd4_cb_recall+258/304] nfsd4_cb_recall+0x102/0x130 [do_recall+0/32] do_recall+0x0/0x20 [do_recall+17/32] do_recall+0x11/0x20 [kthread+75/128] kthread+0x4b/0x80 [child_rip+10/18] child_rip+0xa/0x12 [kthread+0/128] kthread+0x0/0x80 -- Daniel J Blueman ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [NFSv4] 2.6.23-rc4 oops in nfs4_cb_recall... 2007-09-04 11:05 [NFSv4] 2.6.23-rc4 oops in nfs4_cb_recall Daniel J Blueman @ 2007-09-09 21:04 ` J. Bruce Fields 2007-09-10 14:39 ` Daniel J Blueman 0 siblings, 1 reply; 4+ messages in thread From: J. Bruce Fields @ 2007-09-09 21:04 UTC (permalink / raw) To: Daniel J Blueman; +Cc: Trond Myklebust, nfsv4, Linux Kernel > When accessing a directory inode from a single other client, NFSv4 > callbacks catastrophically failed [1] on the NFS server with > 2.6.23-rc4 (unpatched); clients are both 2.6.22 (Ubuntu Gutsy build). > Seems not easy to reproduce, since this kernel was running smoothly > for 7 days on the server. > > What information will help track this down, or is there a known > failure mechanism? I haven't seen that before. > I can map stack frames to source lines with objdump, if that helps. If it's still easy, it might help to figure out exactly where in xprt_reserve() it died, and why. If we've got some race that can lead to freeing the client while a callback is in progress, then perhaps this is on the first dereference of xprt? --b. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [NFSv4] 2.6.23-rc4 oops in nfs4_cb_recall... 2007-09-09 21:04 ` J. Bruce Fields @ 2007-09-10 14:39 ` Daniel J Blueman 2007-09-13 1:53 ` J. Bruce Fields 0 siblings, 1 reply; 4+ messages in thread From: Daniel J Blueman @ 2007-09-10 14:39 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Trond Myklebust, nfsv4, Linux Kernel On 09/09/2007, J. Bruce Fields <bfields@fieldses.org> wrote: > > When accessing a directory inode from a single other client, NFSv4 > > callbacks catastrophically failed [1] on the NFS server with > > 2.6.23-rc4 (unpatched); clients are both 2.6.22 (Ubuntu Gutsy build). > > Seems not easy to reproduce, since this kernel was running smoothly > > for 7 days on the server. > > > > What information will help track this down, or is there a known > > failure mechanism? > > I haven't seen that before. > > > I can map stack frames to source lines with objdump, if that helps. > If it's still easy, it might help to figure out exactly where in > xprt_reserve() it died, and why. If we've got some race that can lead > to freeing the client while a callback is in progress, then perhaps this > is on the first dereference of xprt? I've raised the bug report into bugzilla, added other observations from a second occurrence recently and disassembled xprt_reserve with line numbers. http://bugzilla.kernel.org/show_bug.cgi?id=9003 Ping me for any more detail/info and thanks! Daniel -- Daniel J Blueman ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [NFSv4] 2.6.23-rc4 oops in nfs4_cb_recall... 2007-09-10 14:39 ` Daniel J Blueman @ 2007-09-13 1:53 ` J. Bruce Fields 0 siblings, 0 replies; 4+ messages in thread From: J. Bruce Fields @ 2007-09-13 1:53 UTC (permalink / raw) To: Daniel J Blueman; +Cc: Trond Myklebust, nfsv4, Linux Kernel On Mon, Sep 10, 2007 at 03:39:23PM +0100, Daniel J Blueman wrote: > On 09/09/2007, J. Bruce Fields <bfields@fieldses.org> wrote: > > > When accessing a directory inode from a single other client, NFSv4 > > > callbacks catastrophically failed [1] on the NFS server with > > > 2.6.23-rc4 (unpatched); clients are both 2.6.22 (Ubuntu Gutsy build). > > > Seems not easy to reproduce, since this kernel was running smoothly > > > for 7 days on the server. > > > > > > What information will help track this down, or is there a known > > > failure mechanism? > > > > I haven't seen that before. > > > > > I can map stack frames to source lines with objdump, if that helps. > > > If it's still easy, it might help to figure out exactly where in > > xprt_reserve() it died, and why. If we've got some race that can lead > > to freeing the client while a callback is in progress, then perhaps this > > is on the first dereference of xprt? > > I've raised the bug report into bugzilla, added other observations > from a second occurrence recently and disassembled xprt_reserve with > line numbers. > > http://bugzilla.kernel.org/show_bug.cgi?id=9003 > > Ping me for any more detail/info and thanks! If you or anyone else that's seen this problem could test the following, that would be helpful. Thanks! --b. diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c index c1cb7e0..9d536a8 100644 --- a/fs/nfsd/nfs4callback.c +++ b/fs/nfsd/nfs4callback.c @@ -486,6 +486,7 @@ out_put_cred: /* Success or failure, now we're either waiting for lease expiration * or deleg_return. */ dprintk("NFSD: nfs4_cb_recall: dp %p dl_flock %p dl_count %d\n",dp, dp->dl_flock, atomic_read(&dp->dl_count)); + put_nfs4_client(clp); nfs4_put_delegation(dp); return; } diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 6256492..6f182d2 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -358,9 +358,22 @@ alloc_client(struct xdr_netobj name) return clp; } +static void +shutdown_callback_client(struct nfs4_client *clp) +{ + struct rpc_clnt *clnt = clp->cl_callback.cb_client; + + /* shutdown rpc client, ending any outstanding recall rpcs */ + if (clnt) { + clp->cl_callback.cb_client = NULL; + rpc_shutdown_client(clnt); + } +} + static inline void free_client(struct nfs4_client *clp) { + shutdown_callback_client(clp); if (clp->cl_cred.cr_group_info) put_group_info(clp->cl_cred.cr_group_info); kfree(clp->cl_name.data); @@ -375,18 +388,6 @@ put_nfs4_client(struct nfs4_client *clp) } static void -shutdown_callback_client(struct nfs4_client *clp) -{ - struct rpc_clnt *clnt = clp->cl_callback.cb_client; - - /* shutdown rpc client, ending any outstanding recall rpcs */ - if (clnt) { - clp->cl_callback.cb_client = NULL; - rpc_shutdown_client(clnt); - } -} - -static void expire_client(struct nfs4_client *clp) { struct nfs4_stateowner *sop; @@ -396,8 +397,6 @@ expire_client(struct nfs4_client *clp) dprintk("NFSD: expire_client cl_count %d\n", atomic_read(&clp->cl_count)); - shutdown_callback_client(clp); - INIT_LIST_HEAD(&reaplist); spin_lock(&recall_lock); while (!list_empty(&clp->cl_delegations)) { @@ -1346,6 +1345,7 @@ void nfsd_break_deleg_cb(struct file_lock *fl) * lock) we know the server hasn't removed the lease yet, we know * it's safe to take a reference: */ atomic_inc(&dp->dl_count); + atomic_inc(&dp->dl_client->cl_count); spin_lock(&recall_lock); list_add_tail(&dp->dl_recall_lru, &del_recall_lru); @@ -1354,8 +1354,12 @@ void nfsd_break_deleg_cb(struct file_lock *fl) /* only place dl_time is set. protected by lock_kernel*/ dp->dl_time = get_seconds(); - /* XXX need to merge NFSD_LEASE_TIME with fs/locks.c:lease_break_time */ - fl->fl_break_time = jiffies + NFSD_LEASE_TIME * HZ; + /* + * We don't want the locks code to timeout the lease for us; + * we'll remove it ourself if the delegation isn't returned + * in time. + */ + fl->fl_break_time = 0; t = kthread_run(do_recall, dp, "%s", "nfs4_cb_recall"); if (IS_ERR(t)) { @@ -1364,6 +1368,7 @@ void nfsd_break_deleg_cb(struct file_lock *fl) printk(KERN_INFO "NFSD: Callback thread failed for " "for client (clientid %08x/%08x)\n", clp->cl_clientid.cl_boot, clp->cl_clientid.cl_id); + put_nfs4_client(dp->dl_client); nfs4_put_delegation(dp); } } ^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2007-09-13 1:53 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-09-04 11:05 [NFSv4] 2.6.23-rc4 oops in nfs4_cb_recall Daniel J Blueman 2007-09-09 21:04 ` J. Bruce Fields 2007-09-10 14:39 ` Daniel J Blueman 2007-09-13 1:53 ` J. Bruce Fields
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox