NFS TCP race condition with SOCK_ASYNC_NOSPACE

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	<netdev@vger.kernel.org>
Subject: NFS TCP race condition with SOCK_ASYNC_NOSPACE
Date: Fri, 18 Nov 2011 18:40:01 +0000	[thread overview]
Message-ID: <4EC6A681.30902@citrix.com> (raw)

Hello,

As described originally in
http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
encountering a bug whereby the NFS session was unexpectedly timing out.

I believe I have found the source of the race condition causing the timeout.

Brief overview of setup:
  10GiB network, NFS mounted using TCP.  Problem reproduces with
multiple different NICs, with synchronous or asynchronous mounts, and
with soft and hard mounts.  Reproduces on 2.6.32 and I am currently
trying to reproduce with mainline. (I don't have physical access to the
servers so installing stuff is not fantastically easy)

In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
the sock buffer using xs_sendpages()

When the sock buffer is nearly fully, we get an EAGAIN from
xs_sendpages() which causes a break out of the loop.  Lower down the
function, we switch on status which cases us to call xs_nospace() with
the task.

In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
in the rare case where that bit is clear, we return 0 instead of
EAGAIN.  This promptly overwrites status in xs_tcp_send_request().

The result is that xs_tcp_release_xprt() finds a request which has no
error, but has not sent all of the bytes in its send buffer.  It cleans
up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
xprt->task_cleanup, which closes the TCP connection.

Under normal operation, the TCP connection goes down and back up without
interruption to the NFS layer.  However, when the NFS server hangs in a
half closed state, the client forces a RST of the TCP connection,
leading to the timeout.

I have tried a few naive fixes such as changing the default return value
in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
returned) but this causes a kernel memory leak.  Can someone who a
better understanding of these interactions than me have a look?  It
seems that the if (test_bit()) test in xs_nospace() should have an else
clause.

Thanks in advance,

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

WARNING: multiple messages have this Message-ID (diff)

From: Andrew Cooper <andrew.cooper3-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
To: "linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	<netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: NFS TCP race condition with SOCK_ASYNC_NOSPACE
Date: Fri, 18 Nov 2011 18:40:01 +0000	[thread overview]
Message-ID: <4EC6A681.30902@citrix.com> (raw)

Hello,

As described originally in
http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
encountering a bug whereby the NFS session was unexpectedly timing out.

I believe I have found the source of the race condition causing the timeout.

Brief overview of setup:
  10GiB network, NFS mounted using TCP.  Problem reproduces with
multiple different NICs, with synchronous or asynchronous mounts, and
with soft and hard mounts.  Reproduces on 2.6.32 and I am currently
trying to reproduce with mainline. (I don't have physical access to the
servers so installing stuff is not fantastically easy)

In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
the sock buffer using xs_sendpages()

When the sock buffer is nearly fully, we get an EAGAIN from
xs_sendpages() which causes a break out of the loop.  Lower down the
function, we switch on status which cases us to call xs_nospace() with
the task.

In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
in the rare case where that bit is clear, we return 0 instead of
EAGAIN.  This promptly overwrites status in xs_tcp_send_request().

The result is that xs_tcp_release_xprt() finds a request which has no
error, but has not sent all of the bytes in its send buffer.  It cleans
up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
xprt->task_cleanup, which closes the TCP connection.

Under normal operation, the TCP connection goes down and back up without
interruption to the NFS layer.  However, when the NFS server hangs in a
half closed state, the client forces a RST of the TCP connection,
leading to the timeout.

I have tried a few naive fixes such as changing the default return value
in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
returned) but this causes a kernel memory leak.  Can someone who a
better understanding of these interactions than me have a look?  It
seems that the if (test_bit()) test in xs_nospace() should have an else
clause.

Thanks in advance,

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next             reply	other threads:[~2011-11-18 18:40 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-18 18:40 Andrew Cooper [this message]
2011-11-18 18:40 ` NFS TCP race condition with SOCK_ASYNC_NOSPACE Andrew Cooper
2011-11-18 18:52 ` Trond Myklebust
2011-11-18 18:52   ` Trond Myklebust
2011-11-18 19:04   ` Andrew Cooper
2011-11-18 19:04     ` Andrew Cooper
2011-11-18 19:14     ` Trond Myklebust
2011-11-18 19:14       ` Trond Myklebust
2011-11-18 19:55       ` Andrew Cooper
2011-11-18 19:55         ` Andrew Cooper
2011-11-21 18:14         ` Andrew Cooper
2011-11-21 18:14           ` Andrew Cooper
2011-11-22 11:38           ` Trond Myklebust
2011-11-22 11:38             ` Trond Myklebust
2011-11-22 12:02             ` Andrew Cooper
2011-11-22 12:10               ` Trond Myklebust
2011-11-22 12:16                 ` Andrew Cooper
2011-11-22 12:22                   ` Trond Myklebust
2011-11-22 12:34                     ` Andrew Cooper
2011-11-22 12:34                       ` Andrew Cooper
2011-11-22 12:45                       ` Trond Myklebust
2011-11-22 12:45                         ` Trond Myklebust
2011-11-22 13:23                         ` Andrew Cooper
2011-11-22 13:23                           ` Andrew Cooper

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4EC6A681.30902@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.