public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
* Debugging a kernel crash in svc_process_common() on the client (NFS 4.1)
@ 2018-11-22 14:32 Evgenii Shatokhin
  0 siblings, 0 replies; only message in thread
From: Evgenii Shatokhin @ 2018-11-22 14:32 UTC (permalink / raw)
  To: linux-nfs@vger.kernel.org; +Cc: Trond Myklebust

Hi,

I am debugging rarely occurring kernel crashes in svc_process_common 
('sunrpc' kernel module) that some of our customers got. Unfortunately, 
I am still unable to reproduce these and can see no obvious fix in the 
mainline kernel.

Any hints on how to debug the issue further could be very helpful.

The OS is Virtuozzo 7. The crashes happened with at least 2 kernels, 
based on 3.10.0-862.11.6 and 3.10.0-693.21.1 from RHEL.

The crashes happened when the customers' systems were writing backups of 
some data to NFS shares. NFS 4.1 was used. No RDMA.

The backtrace looks like this:

#0 svc_process_common [sunrpc]
#1 bc_svc_process [sunrpc]
#2 nfs41_callback_svc [nfsv4]
#3 kthread

Each time, the crash happened here:

	/* Setup reply header */
	rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp);

'struct svc_xprt' instance rqstp->rq_xprt pointed to was filled with 
invalid data. Accessing rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr 
caused the crash as a result.

I checked the crash dumps and found that the memory page allocated for 
that 'struct svc_xprt' (to 'struct svc_sock' that contains it, to be 
exact) had been given to another, unrelated, process by that time. So, 
it seems, the processing of the backchannel request on these NFS clients 
could race with something that called svc_xprt_put() for that 'struct 
svc_xprt' instance.

First, I thought that it might be a race between a backchannel request 
and umount of the NFS share (although I have no indication that the 
customers' system tried to unmount it). So, I added a delay into 
bc_svc_process(), opened a file on an NFS share from one NFS client and 
replaced the file from another client to make the server recall the 
delegation, to trigger a backchannel request. Then - closed the files 
and tried to umount the NFS share. Everything went OK, no crash. umount 
waited till the backchannel request had been processed by the client, as 
it should have.

I am new to this code, so might be missing something obvious. However, I 
cannot see at the moment, how bc_svc_process() could race with freeing 
of that 'struct svc_sock'.

Any ideas?

Regards,
Evgenii

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2018-11-22 14:32 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-11-22 14:32 Debugging a kernel crash in svc_process_common() on the client (NFS 4.1) Evgenii Shatokhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox