public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Evgenii Shatokhin <eshatokhin@virtuozzo.com>
To: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Subject: Debugging a kernel crash in svc_process_common() on the client (NFS 4.1)
Date: Thu, 22 Nov 2018 14:32:39 +0000	[thread overview]
Message-ID: <ffb9d35c-46a8-17f5-dbfa-c0cb2be444e0@virtuozzo.com> (raw)

Hi,

I am debugging rarely occurring kernel crashes in svc_process_common 
('sunrpc' kernel module) that some of our customers got. Unfortunately, 
I am still unable to reproduce these and can see no obvious fix in the 
mainline kernel.

Any hints on how to debug the issue further could be very helpful.

The OS is Virtuozzo 7. The crashes happened with at least 2 kernels, 
based on 3.10.0-862.11.6 and 3.10.0-693.21.1 from RHEL.

The crashes happened when the customers' systems were writing backups of 
some data to NFS shares. NFS 4.1 was used. No RDMA.

The backtrace looks like this:

#0 svc_process_common [sunrpc]
#1 bc_svc_process [sunrpc]
#2 nfs41_callback_svc [nfsv4]
#3 kthread

Each time, the crash happened here:

	/* Setup reply header */
	rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp);

'struct svc_xprt' instance rqstp->rq_xprt pointed to was filled with 
invalid data. Accessing rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr 
caused the crash as a result.

I checked the crash dumps and found that the memory page allocated for 
that 'struct svc_xprt' (to 'struct svc_sock' that contains it, to be 
exact) had been given to another, unrelated, process by that time. So, 
it seems, the processing of the backchannel request on these NFS clients 
could race with something that called svc_xprt_put() for that 'struct 
svc_xprt' instance.

First, I thought that it might be a race between a backchannel request 
and umount of the NFS share (although I have no indication that the 
customers' system tried to unmount it). So, I added a delay into 
bc_svc_process(), opened a file on an NFS share from one NFS client and 
replaced the file from another client to make the server recall the 
delegation, to trigger a backchannel request. Then - closed the files 
and tried to umount the NFS share. Everything went OK, no crash. umount 
waited till the backchannel request had been processed by the client, as 
it should have.

I am new to this code, so might be missing something obvious. However, I 
cannot see at the moment, how bc_svc_process() could race with freeing 
of that 'struct svc_sock'.

Any ideas?

Regards,
Evgenii

                 reply	other threads:[~2018-11-22 14:32 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ffb9d35c-46a8-17f5-dbfa-c0cb2be444e0@virtuozzo.com \
    --to=eshatokhin@virtuozzo.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trond.myklebust@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox