From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tom Tucker Subject: Re: [pnfs] nfs41: sunrpc: handle clnt==NULL in call_status Date: Wed, 05 Nov 2008 12:05:42 -0600 Message-ID: <4911E076.2030605@opengridcomputing.com> References: <273FE88A07F5D445824060902F700344029749C8@SACMVEXC1-PRD.hq.netapp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Cc: "Labiaga, Ricardo" , Trond Myklebust , Benny Halevy , NFS list , pNFS Mailing List To: Chuck Lever Return-path: Received: from smtp.opengridcomputing.com ([209.198.142.2]:34325 "EHLO smtp.opengridcomputing.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752470AbYKESFm (ORCPT ); Wed, 5 Nov 2008 13:05:42 -0500 In-Reply-To: Sender: linux-nfs-owner@vger.kernel.org List-ID: Chuck Lever wrote: > Hey Ricardo- > > On Nov 4, 2008, at 9:40 PM, Labiaga, Ricardo wrote: >> Hi Tom, >> >> The thinking is that NFSRDMA could initially use a TCP callback channel. >> We'll implement BIND_CONN_TO_SESSION so that the backchannel does not >> need to be tied to the forechannel connection. This should address the >> case where you have NFSRDMA for the forechannel and TCP for the >> backchannel. BIND_CONN_TO_SESSION is also required to reestablish >> dropped connections effectively (to avoid losing the reply cache). >> >> We're first trying to get enough of the backchannel functionality >> implemented to make it useful. We can then build on that. Tom Talpey >> and I had a brief conversation about what would be needed to implement >> the backchannel over NFSRDMA some time ago. We didn't see anything >> inherently wrong with the current approach. Tom mentioned there needs >> to be some RDMA work during the pre-allocation. It will also need to >> inspect the RPC direction to determine if it needs to queue the request >> on the callback service, or proceed as usual with the processing of the >> request. > > Your approach is fine for a prototype. However... > > Now we have to have this conversation again when adding any new > transport capability, such as SCTP. > > Plus, to support NFSv4.1 over RDMA, you now _require_ the existence of > an IP network between client and server. This may be as minor as > ensuring IPoIB is configured on both ends, but it's still a significant > extra step for a pure SAN over RDMA environment. FWIW, We already require IPoIB because the NFSRDMA implementation uses the RDMA CM which is IP address based. This obviates the > whole point of combining the forward and back channels onto one transport. > > In my opinion you are making the developer's life simpler now and > complicating the lives of current users and admins, and future developers. > >>> -----Original Message----- >>> From: Tom Tucker [mailto:tom@opengridcomputing.com] >>> Sent: Tuesday, November 04, 2008 4:44 PM >>> To: Labiaga, Ricardo >>> Cc: Trond Myklebust; Benny Halevy; NFS list; pNFS Mailing List >>> Subject: Re: [pnfs] nfs41: sunrpc: handle clnt==NULL in call_status >>> >>> Ricardo: >>> >>> I don't mean to throw this whole discussion down a rat-hole, but the >>> current back channel architecture isn't transport independent, so >>> NFSRDMA on NFS41 is basically broken because all the back-channel >> logic >>> is in the xprtsock.c (transport specific) file. It also assumes that >> the >>> same socket (not transport, socket) is used for both fore and back >>> channel communication. >>> >>> What are the plans long-term for making this consistent with the >>> existing architecture? Does it have to be the same socket/connection >> for >>> some reason? >>> >>> Thanks, >>> Tom >>> >>> Labiaga, Ricardo wrote: >>>>> -----Original Message----- >>>>> From: Trond Myklebust [mailto:trond.myklebust@fys.uio.no] >>>>> Sent: Tuesday, November 04, 2008 10:19 AM >>>>> To: Benny Halevy >>>>> Cc: pNFS Mailing List; NFS list >>>>> Subject: Re: nfs41: sunrpc: handle clnt==NULL in call_status >>>>> >>>>> On Tue, 2008-11-04 at 17:52 +0200, Benny Halevy wrote: >>>>>> On the following error path task->tk_client == NULL causing NULL >>>> deref: >>>>>> Nov 4 14:14:32 tl1 kernel: Callback slot table overflowed >>>>>> Nov 4 14:14:32 tl1 kernel: BUG: unable to handle kernel NULL >>>> pointer >>>>> dereference at 0000000000000078 >>>>>> Nov 4 14:14:32 tl1 kernel: IP: [] >>>>> call_status+0xe1/0x145 [sunrpc] >>>>>> Nov 4 14:14:32 tl1 kernel: PGD 2f973067 PUD 2f974067 PMD 0 >>>>>> Nov 4 14:14:32 tl1 kernel: Oops: 0000 [1] SMP >>>>>> Nov 4 14:14:32 tl1 kernel: CPU 0 >>>>>> Nov 4 14:14:32 tl1 kernel: Modules linked in: panfs(P) >>>> panlayoutdriver >>>>> nfslayoutdriver nfsd auth_rpcgss exportfs nfs lockd nfs_acl fuse >>>> sunrpc >>>>> ipv6 cpufreq_ondemand powernow_k8 freq_table dm_mirror dm_log >>>> dm_multipath >>>>> dm_mod snd_hda_intel snd_seq_dummy sr_mod snd_seq_oss cdrom >>>>> snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss >>>>> snd_pcm snd_timer snd_page_alloc snd_hwdep snd pcspkr serio_raw >> k8temp >>>>> hwmon forcedeth soundcore pata_amd sg jedec_probe cfi_probe >> gen_probe >>>> mtd >>>>> i2c_nforce2 chipreg i2c_core button map_funcs sata_nv ata_generic >>>>> pata_acpi libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd >>>>> ehci_hcd [last unloaded: scsi_wait_scan] >>>>>> Nov 4 14:14:32 tl1 kernel: Pid: 3405, comm: nfsv4.1-svc Tainted: >> P >>>>> 2.6.27-pnfs #1 >>>>>> Nov 4 14:14:32 tl1 kernel: RIP: 0010:[] >>>>> [] call_status+0xe1/0x145 [sunrpc] >>>>>> Nov 4 14:14:32 tl1 kernel: RSP: 0018:ffff88002a70bd60 EFLAGS: >>>> 00010246 >>>>>> Nov 4 14:14:32 tl1 kernel: RAX: 0000000000000000 RBX: >>>> 00000000ffffff95 >>>>> RCX: 0000000000000000 >>>>>> Nov 4 14:14:32 tl1 kernel: RDX: ffff88002a70bd10 RSI: >>>> ffff88002fc52dc0 >>>>> RDI: ffff88002fc52dc0 >>>>>> Nov 4 14:14:32 tl1 kernel: RBP: ffff88002a70bd80 R08: >>>> ffff88002f9dae40 >>>>> R09: ffffffff81489e60 >>>>>> Nov 4 14:14:32 tl1 kernel: R10: ffff880038497800 R11: >>>> ffff88002a70bd30 >>>>> R12: ffff88002fc52dc0 >>>>>> Nov 4 14:14:32 tl1 kernel: R13: 0000000000000000 R14: >>>> ffff88002f57fa00 >>>>> R15: ffff88002fc52e50 >>>>>> Nov 4 14:14:32 tl1 kernel: FS: 00007fe78ffac6f0(0000) >>>>> GS:ffffffff8148da80(0000) knlGS:0000000000000000 >>>>>> Nov 4 14:14:32 tl1 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: >>>>> 000000008005003b >>>>>> Nov 4 14:14:32 tl1 kernel: CR2: 0000000000000078 CR3: >>>> 000000002f971000 >>>>> CR4: 00000000000006e0 >>>>>> Nov 4 14:14:32 tl1 kernel: DR0: 0000000000000000 DR1: >>>> 0000000000000000 >>>>> DR2: 0000000000000000 >>>>>> Nov 4 14:14:32 tl1 kernel: DR3: 0000000000000000 DR6: >>>> 00000000ffff0ff0 >>>>> DR7: 0000000000000400 >>>>>> Nov 4 14:14:32 tl1 kernel: Process nfsv4.1-svc (pid: 3405, >>>> threadinfo >>>>> ffff88002a70a000, task ffff88003694d900) >>>>>> Nov 4 14:14:32 tl1 kernel: Stack: 0000000000000000 >>>> ffff88002f57fa00 >>>>> ffff88002fc52dc0 ffff880030450ce8 >>>>>> Nov 4 14:14:32 tl1 kernel: ffff88002a70bdc0 ffffffffa02bb9db >>>>> ffff88002a70bdf0 ffff88002fc52dc0 >>>>>> Nov 4 14:14:32 tl1 kernel: ffff88002f57fa00 ffff88002f57fa00 >>>>> ffff880030450ce8 ffff88002a734000 >>>>>> Nov 4 14:14:32 tl1 kernel: Call Trace: >>>>>> Nov 4 14:14:32 tl1 kernel: [] >>>>> __rpc_execute+0x7e/0x21f [sunrpc] >>>>>> Nov 4 14:14:32 tl1 kernel: [] >>>> rpc_execute+0x28/0x2c >>>>> [sunrpc] >>>>>> Nov 4 14:14:32 tl1 kernel: [] >>>>> rpc_run_bc_task+0x88/0xae [sunrpc] >>>>>> Nov 4 14:14:32 tl1 kernel: [] bc_send+0x35/0x5a >>>>> [sunrpc] >>>>>> Nov 4 14:14:32 tl1 kernel: [] >>>>> bc_svc_process+0xf6/0x101 [sunrpc] >>>>>> Nov 4 14:14:32 tl1 kernel: [] ? >>>>> prepare_to_wait+0x5c/0x65 >>>>>> Nov 4 14:14:32 tl1 kernel: [] >>>>> nfs41_callback_svc+0xda/0x131 [nfs] >>>>>> Nov 4 14:14:32 tl1 kernel: [] ? >>>>> autoremove_wake_function+0x0/0x38 >>>>>> Nov 4 14:14:32 tl1 kernel: [] ? >>>>> nfs41_callback_svc+0x0/0x131 [nfs] >>>>>> Nov 4 14:14:32 tl1 kernel: [] kthread+0x49/0x76 >>>>>> Nov 4 14:14:32 tl1 kernel: [] >> child_rip+0xa/0x11 >>>>>> Nov 4 14:14:32 tl1 kernel: [] ? >>>>> pick_next_task_fair+0x93/0xa3 >>>>>> Nov 4 14:14:32 tl1 kernel: [] ? >> kthread+0x0/0x76 >>>>>> Nov 4 14:14:32 tl1 kernel: [] ? >>>> child_rip+0x0/0x11 >>>>>> Nov 4 14:14:32 tl1 kernel: >>>>>> Nov 4 14:14:32 tl1 kernel: >>>>>> Nov 4 14:14:32 tl1 kernel: Code: e8 7c 70 00 00 49 8b 44 24 20 49 >>>> c7 44 >>>>> 24 68 dc 48 2b a0 f6 40 78 02 74 6d 41 8b b6 34 01 00 00 48 8b 78 >> 30 >>>> e8 b9 >>>>> 2a 00 00 eb 5b <41> f6 45 78 04 74 13 49 8b 55 30 48 8d 82 80 04 00 >> 00 >>>> f0 >>>>> 80 a2 >>>>>> Nov 4 14:14:32 tl1 kernel: RIP [] >>>>> call_status+0xe1/0x145 [sunrpc] >>>>>> Nov 4 14:14:32 tl1 kernel: RSP >>>>>> Nov 4 14:14:32 tl1 kernel: CR2: 0000000000000078 >>>>>> Nov 4 14:14:32 tl1 kernel: ---[ end trace 1584c489f2573aaf ]--- >>>>>> >>>>>> Signed-off-by: Benny Halevy >>>>>> --- >>>>>> >>>>>> Trond, I'm not sure if this can happen without nfs41. >>>>>> However, please consider this patch for upstream since >>>>>> it is safe to do in any case. >>>>>> >>>>>> Benny >>>>>> >>>>>> net/sunrpc/clnt.c | 8 +++++--- >>>>>> 1 files changed, 5 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c >>>>>> index 78fc483..b555d9f 100644 >>>>>> --- a/net/sunrpc/clnt.c >>>>>> +++ b/net/sunrpc/clnt.c >>>>>> @@ -1206,7 +1206,8 @@ call_status(struct rpc_task *task) >>>>>> break; >>>>>> case -ECONNREFUSED: >>>>>> case -ENOTCONN: >>>>>> - rpc_force_rebind(clnt); >>>>>> + if (clnt) >>>>>> + rpc_force_rebind(clnt); >>>>>> task->tk_action = call_bind; >>>>>> break; >>>>>> case -EAGAIN: >>>>>> @@ -1217,9 +1218,10 @@ call_status(struct rpc_task *task) >>>>>> rpc_exit(task, status); >>>>>> break; >>>>>> default: >>>>>> - if (clnt->cl_chatty) >>>>>> + if (!clnt || clnt->cl_chatty) >>>>>> printk("%s: RPC call returned error %d\n", >>>>>> - clnt->cl_protname, -status); >>>>>> + clnt ? clnt->cl_protname : ">>> protocol>", >>>>>> + -status); >>>>>> rpc_exit(task, status); >>>>>> } >>>>>> } >>>>> BIG NACK! >>>>> >>>>> How does even it make sense for a task to get past call_transmit >> and >>>>> call_status without having task->tk_client set? This sounds like >>>> serious >>>>> borkenness in the nfsv4.1 patches... >>>> >>>> The "Callback slot table overflowed" message means we couldn't >> obtain a >>>> pre-allocated rpc_rqst to process the callback. When this occurs, >> the >>>> client sets the XPRT_CLOSE_WAIT bit to close the connection. (As an >>>> aside, should we be using xprt_force_disconnect() instead?). >>>> >>>> This leads me to believe the server exceeded the number of >> outstanding >>>> requests allowed on the backchannel (slot count of one at the >> moment), >>>> the new request caused us to close the connection and pulled the rug >>>> from under the callback service. I'll investigate further. >>>> >>>> Is this reproducible? >>>> >>>> - ricardo >>>> >>>> >>>>> Trond >>>>> ... >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe >> linux-nfs" >>>> in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> _______________________________________________ >>>> pNFS mailing list >>>> pNFS@linux-nfs.org >>>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Chuck Lever > chuck[dot]lever[at]oracle[dot]com > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html