Re: PROBLEM: NFS Client Ignores TCP Resets

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Richard Laager <rlaager@wiktel.com>
To: NeilBrown <nfbrown@novell.com>,
	trond.myklebust@primarydata.com,
	Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Subject: Re: PROBLEM: NFS Client Ignores TCP Resets
Date: Thu, 7 Apr 2016 04:45:55 -0500	[thread overview]
Message-ID: <57062C53.9080102@wiktel.com> (raw)
In-Reply-To: <87twjjpcl8.fsf@notabene.neil.brown.name>

On 04/02/2016 10:58 PM, NeilBrown wrote:
> On Sun, Feb 14 2016, Richard Laager wrote:
> 
>> [1.] One line summary of the problem:
>>
>> NFS Client Ignores TCP Resets
>>
>> [2.] Full description of the problem/report:
>>
>> Steps to reproduce:
>> 1) Mount NFS share from HA cluster with TCP.
>> 2) Failover the HA cluster. (The NFS server's IP address moves from one
>>      machine to the other.)
>> 3) Access the mounted NFS share from the client (an `ls` is sufficient).
>>
>> Expected results:
>> Accessing the NFS mount works fine immediately.
>>
>> Actual results:
>> Accessing the NFS mount hangs for 5 minutes. Then the TCP connection
>> times out, a new connection is established, and it works fine again.
>>
>> After the IP moves, the new server responds to the client with TCP RST
>> packets, just as I would expect. I would expect the client to tear down
>> its TCP connection immediately and re-establish a new one. But it
>> doesn't. Am I confused, or is this a bug?
>>
>> For the duration of this test, all iptables firewalling was disabled on
>> the client machine.
>>
>> I have a packet capture of a minimized test (just a simple ls):
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1542826/+attachment/4571304/+files/dovecot-test.upstream-kernel.pcap
> 
> I notice that the server sends packets from a different MAC address to
> the one it advertises in ARP replies (and the one the client sends to).
> This is probably normal - maybe you have two interfaces bonded together?
> 
> Maybe it would help to be explicit about the network configuration
> between client and server - are there switches?  soft or hard?
> 
> Where is tcpdump being run?  On the (virtual) client, or on the
> (physical) host or elsewhere?

Yes, there is link bonding happening on both sides. Details below.

This test was run from a VM (for testing purposes), but the problem is
equally reproducible on just the host, with or without this VLAN
attached to a bridge. That is, whether we put the NFS client IP on
bond0 (with no br9 existing) or put it on br9, we get the same behavior
using NFS from the host.

I believe I was running the packet capture from inside the VM.

+------------------------------+
|             Host             |
|                              |
|            +------+          |
|            |  VM  |          |
|            |      |          |
|            | eth0 |          |
|            +------+          |
|               |  VM's eth0   |
|               |  is e.g.     |
|               |  vnet0 on    |
|               |  the host    |
|               |              |
| TCP/IP -------+ br9          |
| Stack         |              |
|               |              |
|               |              |
|               | bond0        |
|       +-------+------+       |
|  p5p1 |              | p6p1  |
|       |              |       |
+-------|              |-------+
        |              |
  10GbE |              | 10GbE
        |              |
  +----------+    +----------+
  | Switch 1 |20Gb| Switch 2 |
  |          |====|          |
  +----------+    +----------+
        |              |
  10GbE |              | 10GbE
        |              |
+-------|              |-------+
|       |              |       |
|  oce0 |              | oce1  |
|       +-------+------+       |
|               | ipmp0        |
|               |              |
| TCP/IP -------+              |
| Stack                        |
|                              |
|         Storage Head         |
+------------------------------+

The switches behave like a single, larger virtual switch.

The VM host is doing actual 802.3ad LAG, whereas the storage heads are
doing Solaris's link-based IPMP.

There are two storage heads, each with two physical interfaces:

krls1:
  oce0: 00:90:fa:34:f3:be
  oce1: 00:90:fa:34:f3:c2
krls2:
  oce0: 00:90:fa:34:f3:3e
  oce1: 00:90:fa:34:f3:42

The failover event in the original packet capture was failing over from
krls1 to krls2.

...
> If you were up to building your own kernel, I would suggest putting some
> printks in tcp_validate_incoming() (in net/ipv4/tcp_input.c).
> 
> Print a message if th->rst is ever set, and another if the
> tcp_sequence() test causes it to be discarded.  It shouldn't but
> something seems to be discarding it somewhere...

I added the changes you suggested:

--- tcp_input.c.orig	2016-04-07 04:11:07.907669997 -0500
+++ tcp_input.c	2016-04-04 19:41:09.661590000 -0500
@@ -5133,6 +5133,11 @@
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	if (th->rst)
+	{
+		printk(KERN_WARNING "Received RST segment.");
+	}
+
 	/* RFC1323: H1. Apply PAWS check first. */
 	if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp &&
 	    tcp_paws_discard(sk, skb)) {
@@ -5163,6 +5168,20 @@
 						  &tp->last_oow_ack_time))
 				tcp_send_dupack(sk, skb);
 		}
+		if (th->rst)
+		{
+			printk(KERN_WARNING "Discarding RST segment due to tcp_sequence()");
+			if (before(TCP_SKB_CB(skb)->end_seq, tp->rcv_wup))
+			{
+				printk(KERN_WARNING "RST segment failed before test: %d %d",
+					TCP_SKB_CB(skb)->end_seq, tp->rcv_wup);
+			}
+			if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp)))
+			{
+				printk(KERN_WARNING "RST segment failed after test: %d %d %d",
+					TCP_SKB_CB(skb)->seq, tp->rcv_nxt, tcp_receive_window(tp));
+			}
+		}
 		goto discard;
 	}
 
@@ -5174,10 +5193,13 @@
 		 * else
 		 *     Send a challenge ACK
 		 */
-		if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt)
+		if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+			printk(KERN_WARNING "Accepted RST segment");
 			tcp_reset(sk);
-		else
+		} else {
+			printk(KERN_WARNING "Sending challenge ACK for RST segment");
 			tcp_send_challenge_ack(sk, skb);
+		}
 		goto discard;
 	}

...reordered quoted text...
> Can you create a TCP connection to some other port on the server
> (telnet? ssh? http?) and see what happens to it on fail-over?
> You would need some protocol that the server won't quickly close.
> Maybe just "telnet SERVER 2049" and don't type anything until after the
> failover.
> 
> If that closes quickly, then maybe it is an NFS bug.  If that persists
> for a long timeout before closing, then it must be a network bug -
> either in the network code or the network hardware.
> In that case, netdev@vger.kernel.org might be the best place to ask.

I tried "telnet 10.20.0.30 22". I got the SSH header. I sent no input,
forced a storage cluster failover, and then hit enter after the
failover was complete. The ssh connection immediately terminated. My
tcp_validate_incoming() debugging code, as expected, showed "Received
RST segment." and "Accepted RST segment". These correspond to the one
RST packet I received on the SSH connection.

In a separate failover event, I tested accessing NFS over TCP. I do
*not* get "Received RST segment.". So I conclude that
tcp_validate_incoming() is not being called.

Any thoughts on what that means or where to go from here?

-- 
Richard

next prev parent reply	other threads:[~2016-04-07  9:45 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-14  2:24 PROBLEM: NFS Client Ignores TCP Resets Richard Laager
2016-03-08 17:06 ` Richard Laager
2016-03-09 21:16   ` Anna Schumaker
2016-03-09 21:42     ` Richard Laager
2016-03-11  9:44     ` Richard Laager
2016-04-02  1:43       ` Richard Laager
2016-04-03  3:58 ` NeilBrown
2016-04-07  9:45   ` Richard Laager [this message]
2016-04-08  0:47     ` NeilBrown
2017-10-02 19:29       ` Olga Kornievskaia
2017-10-02 22:13         ` NeilBrown
     [not found]       ` <CAN-5tyHuuBJxwqFLkiZa5ktBk7ypCJxmZ9creeD_RGWbK4Xn3A@mail.gmail.com>
2017-10-02 19:48         ` Richard Laager
2017-10-02 22:03           ` Olga Kornievskaia
2017-10-03  0:09             ` Richard Laager

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57062C53.9080102@wiktel.com \
    --to=rlaager@wiktel.com \
    --cc=anna.schumaker@netapp.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=nfbrown@novell.com \
    --cc=trond.myklebust@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).