netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Sidorenko <alexandre.sidorenko@hpe.com>
To: netdev@vger.kernel.org
Subject: Receive offloads, small RCVBUF and zero TCP window
Date: Mon, 28 Nov 2016 15:49:26 -0500	[thread overview]
Message-ID: <2080597.A38JFJZ1AD@zbook> (raw)

One of our customers has met a problem: TCP window closes and stays closed forever even though receive buffer is empty. This problem has been reported for RHEL6.8 and I think that the issue is in __tcp_select_window() subroutine. Comparing sources of RHEL6.8 kernel and the latest upstream kernel (pulled from GIT today), it looks that it should still be present in the latest kernels.

The problem is triggered by the following conditions:

(a) small RCVBUF (24576 in our case), as a result WS=0
(b) mss = icsk->icsk_ack.rcv_mss > MTU

I asked customer to trigger vmcore when the problem occurs to find why window stays closed forever. I can see in vmcore (doing calculations following __tcp_select_window sources):

        windows: rcv=0, snd=65535  advmss=1460 rcv_ws=0 snd_ws=0
        --- Emulating __tcp_select_window ---
          rcv_mss=7300 free_space=18432 allowed_space=18432 full_space=16972
          rcv_ssthresh=5840, so free_space->5840 

So when we reach the test

		if (window <= free_space - mss || window > free_space)
			window = (free_space / mss) * mss;
		else if (mss == full_space &&
			 free_space > window + (full_space >> 1))
			window = free_space;

we have  negative value of (free_space - mss) = -1460

As a result, we do not update the window and it stays zero forever - even though application has read all available data and we have sufficient free_space.


This occurs only due to the fact that we have interface with MTU=1500 (so that mss=1460 is expected), but icsk->icsk_ack.rcv_mss is 5*1460 = 7300.

As a result, "Get the largest window that is a nice multiple of mss" means a multiple of 7300, and this never happens!

All other mss-related values look reasonable:

crash64> struct tcp_sock 0xffff8801bcb8c840  | grep mss
    icsk_sync_mss = 0xffffffff814ce620 , 
      rcv_mss = 7300
  mss_cache = 1460, 
  advmss = 1460, 
    user_mss = 0, 
    mss_clamp = 1460


Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss larger than MTU. I suspect the most important factor is that this host is running under VMWare. VMWare probably optimizes receive offloading dramatically, pushing to us merged SKBs larger than MTU. I have written a tool to print warnings when we have mss > advmss and ran it on my collection of vmcores. Almost in all cases where vmcore was taken on VMWare guest, we have some connections with mss > advmss. I have not found any vmcores showing this high mss value for any non-VMWare vmcore.

Obviously, this is a corner-case problem - it can happen only if we have a small RCVBUF. But I think this needs to be fixed anyway. I am not sure whether having 
icsk->icsk_ack.rcv_mss > MTU is expected. If not, this should be fixed in receiving offload subroutines (LRO?) or maybe VMWare NIC driver.

But if it is OK for NICs to merge received SKBs and present to TCP supersegments (similar to TSO), this needs to be fixed in __tcp_select_window - e.g. if we see a small RCVBUF and large icsk->icsk_ack.rcv_mss, switch to mss_clamp, as it was done in older versions. From __tcp_select_window() comment 

	/* MSS for the peer's data.  Previous versions used mss_clamp
	 * here.  I don't know if the value based on our guesses
	 * of peer's MSS is better for the performance.  It's more correct
	 * but may be worse for the performance because of rcv_mss
	 * fluctuations.  --SAW  1998/11/1
	 */

Regards,
Alex

-- 

------------------------------------------------------------------
Alex Sidorenko	email: asid@hpe.com
ERT  Linux 	Hewlett-Packard Enterprise (Canada)
------------------------------------------------------------------

             reply	other threads:[~2016-11-28 20:49 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-28 20:49 Alex Sidorenko [this message]
2016-11-28 20:54 ` Receive offloads, small RCVBUF and zero TCP window David Miller
2016-11-28 21:14   ` Alex Sidorenko
2016-11-30 15:10     ` Alex Sidorenko
2016-11-28 22:01   ` Marcelo Ricardo Leitner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2080597.A38JFJZ1AD@zbook \
    --to=alexandre.sidorenko@hpe.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).