From: Richard Laager <rlaager@wiktel.com>
To: NeilBrown <nfbrown@novell.com>,
trond.myklebust@primarydata.com,
Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Subject: Re: PROBLEM: NFS Client Ignores TCP Resets
Date: Thu, 7 Apr 2016 04:45:55 -0500 [thread overview]
Message-ID: <57062C53.9080102@wiktel.com> (raw)
In-Reply-To: <87twjjpcl8.fsf@notabene.neil.brown.name>
On 04/02/2016 10:58 PM, NeilBrown wrote:
> On Sun, Feb 14 2016, Richard Laager wrote:
>
>> [1.] One line summary of the problem:
>>
>> NFS Client Ignores TCP Resets
>>
>> [2.] Full description of the problem/report:
>>
>> Steps to reproduce:
>> 1) Mount NFS share from HA cluster with TCP.
>> 2) Failover the HA cluster. (The NFS server's IP address moves from one
>> machine to the other.)
>> 3) Access the mounted NFS share from the client (an `ls` is sufficient).
>>
>> Expected results:
>> Accessing the NFS mount works fine immediately.
>>
>> Actual results:
>> Accessing the NFS mount hangs for 5 minutes. Then the TCP connection
>> times out, a new connection is established, and it works fine again.
>>
>> After the IP moves, the new server responds to the client with TCP RST
>> packets, just as I would expect. I would expect the client to tear down
>> its TCP connection immediately and re-establish a new one. But it
>> doesn't. Am I confused, or is this a bug?
>>
>> For the duration of this test, all iptables firewalling was disabled on
>> the client machine.
>>
>> I have a packet capture of a minimized test (just a simple ls):
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1542826/+attachment/4571304/+files/dovecot-test.upstream-kernel.pcap
>
> I notice that the server sends packets from a different MAC address to
> the one it advertises in ARP replies (and the one the client sends to).
> This is probably normal - maybe you have two interfaces bonded together?
>
> Maybe it would help to be explicit about the network configuration
> between client and server - are there switches? soft or hard?
>
> Where is tcpdump being run? On the (virtual) client, or on the
> (physical) host or elsewhere?
Yes, there is link bonding happening on both sides. Details below.
This test was run from a VM (for testing purposes), but the problem is
equally reproducible on just the host, with or without this VLAN
attached to a bridge. That is, whether we put the NFS client IP on
bond0 (with no br9 existing) or put it on br9, we get the same behavior
using NFS from the host.
I believe I was running the packet capture from inside the VM.
+------------------------------+
| Host |
| |
| +------+ |
| | VM | |
| | | |
| | eth0 | |
| +------+ |
| | VM's eth0 |
| | is e.g. |
| | vnet0 on |
| | the host |
| | |
| TCP/IP -------+ br9 |
| Stack | |
| | |
| | |
| | bond0 |
| +-------+------+ |
| p5p1 | | p6p1 |
| | | |
+-------| |-------+
| |
10GbE | | 10GbE
| |
+----------+ +----------+
| Switch 1 |20Gb| Switch 2 |
| |====| |
+----------+ +----------+
| |
10GbE | | 10GbE
| |
+-------| |-------+
| | | |
| oce0 | | oce1 |
| +-------+------+ |
| | ipmp0 |
| | |
| TCP/IP -------+ |
| Stack |
| |
| Storage Head |
+------------------------------+
The switches behave like a single, larger virtual switch.
The VM host is doing actual 802.3ad LAG, whereas the storage heads are
doing Solaris's link-based IPMP.
There are two storage heads, each with two physical interfaces:
krls1:
oce0: 00:90:fa:34:f3:be
oce1: 00:90:fa:34:f3:c2
krls2:
oce0: 00:90:fa:34:f3:3e
oce1: 00:90:fa:34:f3:42
The failover event in the original packet capture was failing over from
krls1 to krls2.
...
> If you were up to building your own kernel, I would suggest putting some
> printks in tcp_validate_incoming() (in net/ipv4/tcp_input.c).
>
> Print a message if th->rst is ever set, and another if the
> tcp_sequence() test causes it to be discarded. It shouldn't but
> something seems to be discarding it somewhere...
I added the changes you suggested:
--- tcp_input.c.orig 2016-04-07 04:11:07.907669997 -0500
+++ tcp_input.c 2016-04-04 19:41:09.661590000 -0500
@@ -5133,6 +5133,11 @@
{
struct tcp_sock *tp = tcp_sk(sk);
+ if (th->rst)
+ {
+ printk(KERN_WARNING "Received RST segment.");
+ }
+
/* RFC1323: H1. Apply PAWS check first. */
if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp &&
tcp_paws_discard(sk, skb)) {
@@ -5163,6 +5168,20 @@
&tp->last_oow_ack_time))
tcp_send_dupack(sk, skb);
}
+ if (th->rst)
+ {
+ printk(KERN_WARNING "Discarding RST segment due to tcp_sequence()");
+ if (before(TCP_SKB_CB(skb)->end_seq, tp->rcv_wup))
+ {
+ printk(KERN_WARNING "RST segment failed before test: %d %d",
+ TCP_SKB_CB(skb)->end_seq, tp->rcv_wup);
+ }
+ if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp)))
+ {
+ printk(KERN_WARNING "RST segment failed after test: %d %d %d",
+ TCP_SKB_CB(skb)->seq, tp->rcv_nxt, tcp_receive_window(tp));
+ }
+ }
goto discard;
}
@@ -5174,10 +5193,13 @@
* else
* Send a challenge ACK
*/
- if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt)
+ if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+ printk(KERN_WARNING "Accepted RST segment");
tcp_reset(sk);
- else
+ } else {
+ printk(KERN_WARNING "Sending challenge ACK for RST segment");
tcp_send_challenge_ack(sk, skb);
+ }
goto discard;
}
...reordered quoted text...
> Can you create a TCP connection to some other port on the server
> (telnet? ssh? http?) and see what happens to it on fail-over?
> You would need some protocol that the server won't quickly close.
> Maybe just "telnet SERVER 2049" and don't type anything until after the
> failover.
>
> If that closes quickly, then maybe it is an NFS bug. If that persists
> for a long timeout before closing, then it must be a network bug -
> either in the network code or the network hardware.
> In that case, netdev@vger.kernel.org might be the best place to ask.
I tried "telnet 10.20.0.30 22". I got the SSH header. I sent no input,
forced a storage cluster failover, and then hit enter after the
failover was complete. The ssh connection immediately terminated. My
tcp_validate_incoming() debugging code, as expected, showed "Received
RST segment." and "Accepted RST segment". These correspond to the one
RST packet I received on the SSH connection.
In a separate failover event, I tested accessing NFS over TCP. I do
*not* get "Received RST segment.". So I conclude that
tcp_validate_incoming() is not being called.
Any thoughts on what that means or where to go from here?
--
Richard
next prev parent reply other threads:[~2016-04-07 9:45 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-14 2:24 PROBLEM: NFS Client Ignores TCP Resets Richard Laager
2016-03-08 17:06 ` Richard Laager
2016-03-09 21:16 ` Anna Schumaker
2016-03-09 21:42 ` Richard Laager
2016-03-11 9:44 ` Richard Laager
2016-04-02 1:43 ` Richard Laager
2016-04-03 3:58 ` NeilBrown
2016-04-07 9:45 ` Richard Laager [this message]
2016-04-08 0:47 ` NeilBrown
2017-10-02 19:29 ` Olga Kornievskaia
2017-10-02 22:13 ` NeilBrown
[not found] ` <CAN-5tyHuuBJxwqFLkiZa5ktBk7ypCJxmZ9creeD_RGWbK4Xn3A@mail.gmail.com>
2017-10-02 19:48 ` Richard Laager
2017-10-02 22:03 ` Olga Kornievskaia
2017-10-03 0:09 ` Richard Laager
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57062C53.9080102@wiktel.com \
--to=rlaager@wiktel.com \
--cc=anna.schumaker@netapp.com \
--cc=linux-nfs@vger.kernel.org \
--cc=nfbrown@novell.com \
--cc=trond.myklebust@primarydata.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).