Receive offloads, small RCVBUF and zero TCP window

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Receive offloads, small RCVBUF and zero TCP window
@ 2016-11-28 20:49 Alex Sidorenko
  2016-11-28 20:54 ` David Miller
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Sidorenko @ 2016-11-28 20:49 UTC (permalink / raw)
  To: netdev

One of our customers has met a problem: TCP window closes and stays closed forever even though receive buffer is empty. This problem has been reported for RHEL6.8 and I think that the issue is in __tcp_select_window() subroutine. Comparing sources of RHEL6.8 kernel and the latest upstream kernel (pulled from GIT today), it looks that it should still be present in the latest kernels.

The problem is triggered by the following conditions:

(a) small RCVBUF (24576 in our case), as a result WS=0
(b) mss = icsk->icsk_ack.rcv_mss > MTU

I asked customer to trigger vmcore when the problem occurs to find why window stays closed forever. I can see in vmcore (doing calculations following __tcp_select_window sources):

        windows: rcv=0, snd=65535  advmss=1460 rcv_ws=0 snd_ws=0
        --- Emulating __tcp_select_window ---
          rcv_mss=7300 free_space=18432 allowed_space=18432 full_space=16972
          rcv_ssthresh=5840, so free_space->5840 

So when we reach the test

		if (window <= free_space - mss || window > free_space)
			window = (free_space / mss) * mss;
		else if (mss == full_space &&
			 free_space > window + (full_space >> 1))
			window = free_space;

we have  negative value of (free_space - mss) = -1460

As a result, we do not update the window and it stays zero forever - even though application has read all available data and we have sufficient free_space.

This occurs only due to the fact that we have interface with MTU=1500 (so that mss=1460 is expected), but icsk->icsk_ack.rcv_mss is 5*1460 = 7300.

As a result, "Get the largest window that is a nice multiple of mss" means a multiple of 7300, and this never happens!

All other mss-related values look reasonable:

crash64> struct tcp_sock 0xffff8801bcb8c840  | grep mss
    icsk_sync_mss = 0xffffffff814ce620 , 
      rcv_mss = 7300
  mss_cache = 1460, 
  advmss = 1460, 
    user_mss = 0, 
    mss_clamp = 1460

Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss larger than MTU. I suspect the most important factor is that this host is running under VMWare. VMWare probably optimizes receive offloading dramatically, pushing to us merged SKBs larger than MTU. I have written a tool to print warnings when we have mss > advmss and ran it on my collection of vmcores. Almost in all cases where vmcore was taken on VMWare guest, we have some connections with mss > advmss. I have not found any vmcores showing this high mss value for any non-VMWare vmcore.

Obviously, this is a corner-case problem - it can happen only if we have a small RCVBUF. But I think this needs to be fixed anyway. I am not sure whether having 
icsk->icsk_ack.rcv_mss > MTU is expected. If not, this should be fixed in receiving offload subroutines (LRO?) or maybe VMWare NIC driver.

But if it is OK for NICs to merge received SKBs and present to TCP supersegments (similar to TSO), this needs to be fixed in __tcp_select_window - e.g. if we see a small RCVBUF and large icsk->icsk_ack.rcv_mss, switch to mss_clamp, as it was done in older versions. From __tcp_select_window() comment 

	/* MSS for the peer's data.  Previous versions used mss_clamp
	 * here.  I don't know if the value based on our guesses
	 * of peer's MSS is better for the performance.  It's more correct
	 * but may be worse for the performance because of rcv_mss
	 * fluctuations.  --SAW  1998/11/1
	 */

Regards,
Alex

-- 

------------------------------------------------------------------
Alex Sidorenko	email: asid@hpe.com
ERT  Linux 	Hewlett-Packard Enterprise (Canada)
------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Receive offloads, small RCVBUF and zero TCP window
  2016-11-28 20:49 Receive offloads, small RCVBUF and zero TCP window Alex Sidorenko
@ 2016-11-28 20:54 ` David Miller
  2016-11-28 21:14   ` Alex Sidorenko
  2016-11-28 22:01   ` Marcelo Ricardo Leitner
  0 siblings, 2 replies; 5+ messages in thread
From: David Miller @ 2016-11-28 20:54 UTC (permalink / raw)
  To: alexandre.sidorenko; +Cc: netdev

From: Alex Sidorenko <alexandre.sidorenko@hpe.com>
Date: Mon, 28 Nov 2016 15:49:26 -0500

> Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss
> larger than MTU.

It absolutely is not OK.

If VMWare wants to receive large frames for batching purposes it must
use GRO or similar to achieve that, not just send vanilla frames into
the stack which are larger than the device MTU.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Receive offloads, small RCVBUF and zero TCP window
  2016-11-28 20:54 ` David Miller
@ 2016-11-28 21:14   ` Alex Sidorenko
  2016-11-30 15:10     ` Alex Sidorenko
  2016-11-28 22:01   ` Marcelo Ricardo Leitner
  1 sibling, 1 reply; 5+ messages in thread
From: Alex Sidorenko @ 2016-11-28 21:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Monday, November 28, 2016 3:54:59 PM EST David Miller wrote:
> From: Alex Sidorenko <alexandre.sidorenko@hpe.com>
> Date: Mon, 28 Nov 2016 15:49:26 -0500
> 
> > Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss
> > larger than MTU.
> 
> It absolutely is not OK.
> 
> If VMWare wants to receive large frames for batching purposes it must
> use GRO or similar to achieve that, not just send vanilla frames into
> the stack which are larger than the device MTU.
> 

As VMWare's vmxnet3 driver is open-sourced and part of generic kernel, do you think the problem is in that driver or elsewhere? I looked at vmxnet3 sources and see that it uses LRO/GRO subroutines. Unfortunately, I don't understand its logic enough to see whether they are doing anything incorrectly.

Alex 

-- 

------------------------------------------------------------------
Alex Sidorenko	email: asid@hpe.com
ERT  Linux 	Hewlett-Packard Enterprise (Canada)
------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Receive offloads, small RCVBUF and zero TCP window
  2016-11-28 20:54 ` David Miller
  2016-11-28 21:14   ` Alex Sidorenko
@ 2016-11-28 22:01   ` Marcelo Ricardo Leitner
  1 sibling, 0 replies; 5+ messages in thread
From: Marcelo Ricardo Leitner @ 2016-11-28 22:01 UTC (permalink / raw)
  To: David Miller; +Cc: alexandre.sidorenko, netdev, jmaxwell37, eric.dumazet

On Mon, Nov 28, 2016 at 03:54:59PM -0500, David Miller wrote:
> From: Alex Sidorenko <alexandre.sidorenko@hpe.com>
> Date: Mon, 28 Nov 2016 15:49:26 -0500
> 
> > Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss
> > larger than MTU.
> 
> It absolutely is not OK.
> 

Would it make sense to add a pr_warn_once() and perhaps even clamp it
down to known/saner MSS?

> If VMWare wants to receive large frames for batching purposes it must
> use GRO or similar to achieve that, not just send vanilla frames into
> the stack which are larger than the device MTU.
> 

It's not the first report I've seen on this type of issue. IBM also had
this issue recently while not being able to send the gso_size from tx
side to rx, and the warning probably could have saved quite some
debugging time.

Something like (but with a better msg, for sure):

--8<--

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a27b9c0e27c0..3a59cffae3fa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -144,7 +144,9 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 	 */
 	len = skb_shinfo(skb)->gso_size ? : skb->len;
 	if (len >= icsk->icsk_ack.rcv_mss) {
-		icsk->icsk_ack.rcv_mss = len;
+		icsk->icsk_ack.rcv_mss = max(len, tcp_sk(sk)->advmss);
+		if (icsk->icsk_ack.rcv_mss != len)
+			pr_warn_once("Your driver is likely doing bad rx acceleration.\n");
 	} else {
 		/* Otherwise, we make more careful check taking into account,
 		 * that SACKs block is variable.

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Receive offloads, small RCVBUF and zero TCP window
  2016-11-28 21:14   ` Alex Sidorenko
@ 2016-11-30 15:10     ` Alex Sidorenko
  0 siblings, 0 replies; 5+ messages in thread
From: Alex Sidorenko @ 2016-11-30 15:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Marcelo Ricardo Leitner

On Monday, November 28, 2016 4:14:04 PM EST Alex Sidorenko wrote:
> On Monday, November 28, 2016 3:54:59 PM EST David Miller wrote:
> > From: Alex Sidorenko <alexandre.sidorenko@hpe.com>
> > Date: Mon, 28 Nov 2016 15:49:26 -0500
> > 
> > > Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss
> > > larger than MTU.
> > 
> > It absolutely is not OK.
> > 
> > If VMWare wants to receive large frames for batching purposes it must
> > use GRO or similar to achieve that, not just send vanilla frames into
> > the stack which are larger than the device MTU.
> > 
> 
> As VMWare's vmxnet3 driver is open-sourced and part of generic kernel, do you think the problem is in that driver or elsewhere? I looked at vmxnet3 sources and see that it uses LRO/GRO subroutines. Unfortunately, I don't understand its logic enough to see whether they are doing anything incorrectly.

I think this has been already fixed in recent versions of vmxnet3 driver (but not in RHEL6). VMWare/ESX can pass us aggregated large SKBs indeed (> MTU) if LRO is enabled, but the driver takes care of that in vmxnet3_rq_rx_complete():

			} else if (segCnt != 0 || skb->len > mtu) {
				u32 hlen;

				hlen = vmxnet3_get_hdr_len(adapter, skb,
					(union Vmxnet3_GenericDesc *)rcd);
				if (hlen == 0)
					goto not_lro;

				skb_shinfo(skb)->gso_type =
					rcd->v4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6;
				if (segCnt != 0) {
					skb_shinfo(skb)->gso_segs = segCnt;
					skb_shinfo(skb)->gso_size =
						DIV_ROUND_UP(skb->len -
							hlen, segCnt);
				} else {
					skb_shinfo(skb)->gso_size = mtu - hlen;
				}
			}


So if packets have been aggregated, 

	u8		segCnt;       /* Number of aggregated packets */


we compute gso_size by dividing large skb->len by the number.

I still like Marcelo's idea of printing a warning when icsk->icsk_ack.rcv_mss looks unreasonable, should really help with detecting buggy drivers.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-11-30 15:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-28 20:49 Receive offloads, small RCVBUF and zero TCP window Alex Sidorenko
2016-11-28 20:54 ` David Miller
2016-11-28 21:14   ` Alex Sidorenko
2016-11-30 15:10     ` Alex Sidorenko
2016-11-28 22:01   ` Marcelo Ricardo Leitner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).