All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking
@ 2022-04-11 10:40 Paolo Abeni
  2022-04-11 10:40 ` [RFC PATCH 1/4] mptcp: really share subflow snd_wnd Paolo Abeni
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Paolo Abeni @ 2022-04-11 10:40 UTC (permalink / raw)
  To: mptcp

I've been chasing bad/unstable performances with multiple subflows
on very high speed links.

It looks like the root cause is due to the current mptcp-level
congestion window handling. There are apparently a few different
sub-issues:

- the rcv_wnd is not effectively shared on the tx side, as each
  subflow takes in account only the value received by the underlaying
  TCP connection. This is addressed in patch 1/4

- The mptcp-level offered wnd right edge is currently allowed to shrink.
  Reading section 3.3.4.:

"""
   The receive window is relative to the DATA_ACK.  As in TCP, a
   receiver MUST NOT shrink the right edge of the receive window (i.e.,
   DATA_ACK + receive window).  The receiver will use the data sequence
   number to tell if a packet should be accepted at the connection
   level.
"""

   I read the above as we need to reflect window right-edge tracking
   on the wire, see patch 3/4.

- The offered window right edge tracking can happen concurrently on
  multiple subflows, but there is no mutex protection. We need an
  additional atomic operation - still patch 3/4

This series additionally bump a few new MIBs to track all the above
(ensure/observe that the suspected races actually take place).

With this series tput in the critical scenario raises from ~26 Gbps
(ranging in 4-30 Gbps) to ~43 Gbps (with min > 33 Gbps)

I guess patch 3/4 is the most debatable - expecially for RFC compliance
Any feedback more then welcome!

Note: still in patch 3/4, I'm unsure that the th->window update is
strictly necessary from functional perspective (e.g. possibly the atomic
operation is enough), I'll try to test that, too.

Paolo Abeni (4):
  mptcp: really share subflow snd_wnd
  mptcp: add mib for xmit window sharing
  mptcp: never shrink offered window
  mptcp: add more offered MIBs counter.

 include/net/mptcp.h   |  2 +-
 net/ipv4/tcp_output.c |  2 +-
 net/mptcp/mib.c       |  4 +++
 net/mptcp/mib.h       |  6 +++++
 net/mptcp/options.c   | 61 +++++++++++++++++++++++++++++++++++++------
 net/mptcp/protocol.c  | 24 +++++++++++------
 6 files changed, 81 insertions(+), 18 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/4] mptcp: really share subflow snd_wnd
  2022-04-11 10:40 [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking Paolo Abeni
@ 2022-04-11 10:40 ` Paolo Abeni
  2022-04-11 10:40 ` [RFC PATCH 2/4] mptcp: add mib for xmit window sharing Paolo Abeni
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2022-04-11 10:40 UTC (permalink / raw)
  To: mptcp

As per RFC, mptcp subflows use a "shared" snd_wnd: the effective
window is the maximum among the current values received on all
subflows. Without such feature a data transfer using multiple
subflows could block.

Window sharing is currently implemented in the RX side:
__tcp_select_window uses the mptcp-level receive buffer to compute
the announced window.

That is not enough: the TCP stack will stick to the window size
received on the given subflow; we need to propagate the msk window
value on each subflow at xmit time.

Change the packet scheduler to ignore the subflow level window
and and use instead the msk level one

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index b2c654992de0..9eb33681d00d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1141,19 +1141,20 @@ struct mptcp_sendmsg_info {
 	bool data_lock_held;
 };
 
-static int mptcp_check_allowed_size(struct mptcp_sock *msk, u64 data_seq,
-				    int avail_size)
+static int mptcp_check_allowed_size(const struct mptcp_sock *msk, struct sock *ssk,
+				    u64 data_seq, int avail_size)
 {
 	u64 window_end = mptcp_wnd_end(msk);
+	u64 mptcp_snd_wnd;
 
 	if (__mptcp_check_fallback(msk))
 		return avail_size;
 
-	if (!before64(data_seq + avail_size, window_end)) {
-		u64 allowed_size = window_end - data_seq;
+	mptcp_snd_wnd = window_end - data_seq;
+	avail_size = min_t(unsigned int, mptcp_snd_wnd, avail_size);
 
-		return min_t(unsigned int, allowed_size, avail_size);
-	}
+	if (unlikely(tcp_sk(ssk)->snd_wnd < mptcp_snd_wnd))
+		tcp_sk(ssk)->snd_wnd = min_t(u64, U32_MAX, mptcp_snd_wnd);
 
 	return avail_size;
 }
@@ -1305,7 +1306,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	}
 
 	/* Zero window and all data acked? Probe. */
-	copy = mptcp_check_allowed_size(msk, data_seq, copy);
+	copy = mptcp_check_allowed_size(msk, ssk, data_seq, copy);
 	if (copy == 0) {
 		u64 snd_una = READ_ONCE(msk->snd_una);
 
@@ -1498,11 +1499,16 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 	 * to check that subflow has a non empty cwin.
 	 */
 	ssk = send_info[SSK_MODE_ACTIVE].ssk;
-	if (!ssk || !sk_stream_memory_free(ssk) || !tcp_sk(ssk)->snd_wnd)
+	if (!ssk || !sk_stream_memory_free(ssk))
 		return NULL;
 
-	burst = min_t(int, MPTCP_SEND_BURST_SIZE, tcp_sk(ssk)->snd_wnd);
+	burst = min_t(int, MPTCP_SEND_BURST_SIZE, mptcp_wnd_end(msk) - msk->snd_nxt);
 	wmem = READ_ONCE(ssk->sk_wmem_queued);
+	if (!burst) {
+		msk->last_snd = NULL;
+		return ssk;
+	}
+
 	subflow = mptcp_subflow_ctx(ssk);
 	subflow->avg_pacing_rate = div_u64((u64)subflow->avg_pacing_rate * wmem +
 					   READ_ONCE(ssk->sk_pacing_rate) * burst,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/4] mptcp: add mib for xmit window sharing
  2022-04-11 10:40 [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking Paolo Abeni
  2022-04-11 10:40 ` [RFC PATCH 1/4] mptcp: really share subflow snd_wnd Paolo Abeni
@ 2022-04-11 10:40 ` Paolo Abeni
  2022-04-11 10:40 ` [RFC PATCH 3/4] mptcp: never shrink offered window Paolo Abeni
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2022-04-11 10:40 UTC (permalink / raw)
  To: mptcp

Bump a counter for counter when snd_wnd is shared among subflow,
for observability's sake.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/mib.c      | 1 +
 net/mptcp/mib.h      | 1 +
 net/mptcp/protocol.c | 4 +++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index d93a8c9996fd..6a6f8151375a 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -56,6 +56,7 @@ static const struct snmp_mib mptcp_snmp_list[] = {
 	SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
 	SNMP_MIB_ITEM("SubflowStale", MPTCP_MIB_SUBFLOWSTALE),
 	SNMP_MIB_ITEM("SubflowRecover", MPTCP_MIB_SUBFLOWRECOVER),
+	SNMP_MIB_ITEM("SndWndShared", MPTCP_MIB_SNDWNDSHARED),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 529d07af9e14..2411510bef66 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -49,6 +49,7 @@ enum linux_mptcp_mib_field {
 	MPTCP_MIB_RCVPRUNED,		/* Incoming packet dropped due to memory limit */
 	MPTCP_MIB_SUBFLOWSTALE,		/* Subflows entered 'stale' status */
 	MPTCP_MIB_SUBFLOWRECOVER,	/* Subflows returned to active status after being stale */
+	MPTCP_MIB_SNDWNDSHARED,		/* Subflow snd wnd is overridden by msk's one */
 	__MPTCP_MIB_MAX
 };
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 9eb33681d00d..6ef530271845 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1153,8 +1153,10 @@ static int mptcp_check_allowed_size(const struct mptcp_sock *msk, struct sock *s
 	mptcp_snd_wnd = window_end - data_seq;
 	avail_size = min_t(unsigned int, mptcp_snd_wnd, avail_size);
 
-	if (unlikely(tcp_sk(ssk)->snd_wnd < mptcp_snd_wnd))
+	if (unlikely(tcp_sk(ssk)->snd_wnd < mptcp_snd_wnd)) {
 		tcp_sk(ssk)->snd_wnd = min_t(u64, U32_MAX, mptcp_snd_wnd);
+		MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_SNDWNDSHARED);
+	}
 
 	return avail_size;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/4] mptcp: never shrink offered window
  2022-04-11 10:40 [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking Paolo Abeni
  2022-04-11 10:40 ` [RFC PATCH 1/4] mptcp: really share subflow snd_wnd Paolo Abeni
  2022-04-11 10:40 ` [RFC PATCH 2/4] mptcp: add mib for xmit window sharing Paolo Abeni
@ 2022-04-11 10:40 ` Paolo Abeni
  2022-04-11 18:23   ` kernel test robot
                     ` (2 more replies)
  2022-04-11 10:40 ` [RFC PATCH 4/4] mptcp: add more offered MIBs counter Paolo Abeni
  2022-04-11 13:49 ` [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking Paolo Abeni
  4 siblings, 3 replies; 10+ messages in thread
From: Paolo Abeni @ 2022-04-11 10:40 UTC (permalink / raw)
  To: mptcp

As per RFC, the offered MPTCP-level window should never shrink.
While we currently track the right edge, we don't enforce the
above constraint on the wire.
Additionally, concurrent xmit on different subflows can end-up in
erroneous right edge update.
Address the above explicitly updating the announced window and
protecting the update with an additional atomic operation (sic)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/mptcp.h   |  2 +-
 net/ipv4/tcp_output.c |  2 +-
 net/mptcp/options.c   | 57 +++++++++++++++++++++++++++++++++++++------
 3 files changed, 51 insertions(+), 10 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 877077b53200..5706d132713e 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -125,7 +125,7 @@ bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
 			       struct mptcp_out_options *opts);
 bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb);
 
-void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp,
+void mptcp_write_options(__be32 *ptr, struct tcp_sock *tp,
 			 struct mptcp_out_options *opts);
 
 void mptcp_diag_fill_info(struct mptcp_sock *msk, struct mptcp_info *info);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c221f3bce975..edad33b9db11 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -444,7 +444,7 @@ struct tcp_out_options {
 	struct mptcp_out_options mptcp;
 };
 
-static void mptcp_options_write(__be32 *ptr, const struct tcp_sock *tp,
+static void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 				struct tcp_out_options *opts)
 {
 #if IS_ENABLED(CONFIG_MPTCP)
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index e05d9458a025..ffc0292da041 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1224,20 +1224,61 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 	return true;
 }
 
-static void mptcp_set_rwin(const struct tcp_sock *tp)
+static void mptcp_set_rwin(struct tcp_sock *tp, void *opt)
 {
 	const struct sock *ssk = (const struct sock *)tp;
-	const struct mptcp_subflow_context *subflow;
+	struct mptcp_subflow_context *subflow;
+	u64 ack_seq, rcv_wnd_old, rcv_wnd_new;
 	struct mptcp_sock *msk;
-	u64 ack_seq;
+	struct tcphdr *th;
+	u32 new_win;
+	u64 win;
 
 	subflow = mptcp_subflow_ctx(ssk);
 	msk = mptcp_sk(subflow->conn);
 
-	ack_seq = READ_ONCE(msk->ack_seq) + tp->rcv_wnd;
+	ack_seq = READ_ONCE(msk->ack_seq);
+	rcv_wnd_new = ack_seq + tp->rcv_wnd;
+
+	rcv_wnd_old = READ_ONCE(msk->rcv_wnd_sent);
+	if (after64(rcv_wnd_new, rcv_wnd_old)) {
+		u64 rcv_wnd;
 
-	if (after64(ack_seq, READ_ONCE(msk->rcv_wnd_sent)))
-		WRITE_ONCE(msk->rcv_wnd_sent, ack_seq);
+		for (;;) {
+			rcv_wnd = cmpxchg64(&msk->rcv_wnd_sent, rcv_wnd_old, rcv_wnd_new);
+
+			if (rcv_wnd == rcv_wnd_old)
+				break;
+			if (before64(rcv_wnd_new, rcv_wnd))
+				goto raise_win;
+			rcv_wnd_old = rcv_wnd;
+		};
+		return;
+	}
+
+	if (rcv_wnd_new != rcv_wnd_old) {
+
+raise_win:
+		th = opt - sizeof(struct tcphdr);
+		win = rcv_wnd_old - ack_seq;
+		tp->rcv_wnd = min_t(u64, win, U32_MAX);
+		new_win = tp->rcv_wnd;
+
+		/* Make sure we do not exceed the maximum possible
+		 * scaled window.
+		 */
+		if (unlikely(th->syn))
+			new_win = min(new_win, 65535U) << tp->rx_opt.rcv_wscale;
+		if (!tp->rx_opt.rcv_wscale &&
+		    sock_net(ssk)->ipv4.sysctl_tcp_workaround_signed_windows)
+			new_win = min(new_win, MAX_TCP_WINDOW);
+		else
+			new_win = min(new_win, (65535U << tp->rx_opt.rcv_wscale));
+
+		/* RFC1323 scaling applied */
+		new_win >>= tp->rx_opt.rcv_wscale;
+		th->window = htons(new_win);
+	}
 }
 
 u16 __mptcp_make_csum(u64 data_seq, u32 subflow_seq, u16 data_len, __wsum sum)
@@ -1265,7 +1306,7 @@ static u16 mptcp_make_csum(const struct mptcp_ext *mpext)
 				 ~csum_unfold(mpext->csum));
 }
 
-void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp,
+void mptcp_write_options(__be32 *ptr, struct tcp_sock *tp,
 			 struct mptcp_out_options *opts)
 {
 	const struct sock *ssk = (const struct sock *)tp;
@@ -1554,7 +1595,7 @@ void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp,
 	}
 
 	if (tp)
-		mptcp_set_rwin(tp);
+		mptcp_set_rwin(tp, opts);
 }
 
 __be32 mptcp_get_reset_option(const struct sk_buff *skb)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 4/4] mptcp: add more offered MIBs counter.
  2022-04-11 10:40 [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking Paolo Abeni
                   ` (2 preceding siblings ...)
  2022-04-11 10:40 ` [RFC PATCH 3/4] mptcp: never shrink offered window Paolo Abeni
@ 2022-04-11 10:40 ` Paolo Abeni
  2022-04-11 11:58   ` mptcp: add more offered MIBs counter.: Tests Results MPTCP CI
  2022-04-11 13:49 ` [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking Paolo Abeni
  4 siblings, 1 reply; 10+ messages in thread
From: Paolo Abeni @ 2022-04-11 10:40 UTC (permalink / raw)
  To: mptcp

Track the exceptional handling of MPTCP-level offered window
with a few more counters for observability.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/mib.c     | 3 +++
 net/mptcp/mib.h     | 5 +++++
 net/mptcp/options.c | 6 +++++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index 6a6f8151375a..0dac2863c6e1 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -57,6 +57,9 @@ static const struct snmp_mib mptcp_snmp_list[] = {
 	SNMP_MIB_ITEM("SubflowStale", MPTCP_MIB_SUBFLOWSTALE),
 	SNMP_MIB_ITEM("SubflowRecover", MPTCP_MIB_SUBFLOWRECOVER),
 	SNMP_MIB_ITEM("SndWndShared", MPTCP_MIB_SNDWNDSHARED),
+	SNMP_MIB_ITEM("RcvWndShared", MPTCP_MIB_RCVWNDSHARED),
+	SNMP_MIB_ITEM("RcvWndConflictUpdate", MPTCP_MIB_RCVWNDCONFLICTUPDATE),
+	SNMP_MIB_ITEM("RcvWndConflict", MPTCP_MIB_RCVWNDCONFLICT),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 2411510bef66..e05428838005 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -50,6 +50,11 @@ enum linux_mptcp_mib_field {
 	MPTCP_MIB_SUBFLOWSTALE,		/* Subflows entered 'stale' status */
 	MPTCP_MIB_SUBFLOWRECOVER,	/* Subflows returned to active status after being stale */
 	MPTCP_MIB_SNDWNDSHARED,		/* Subflow snd wnd is overridden by msk's one */
+	MPTCP_MIB_RCVWNDSHARED,		/* Subflow rcv wnd is overridden by msk's one */
+	MPTCP_MIB_RCVWNDCONFLICTUPDATE,	/* subflow rcv wnd is overridden by msk's one due to
+					 * conflict with another subflow while updating msk rcv wnd
+					 */
+	MPTCP_MIB_RCVWNDCONFLICT,	/* Conflict with another subflow while updating msk rcv wnd */
 	__MPTCP_MIB_MAX
 };
 
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index ffc0292da041..dc84f39ff9f9 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1249,8 +1249,11 @@ static void mptcp_set_rwin(struct tcp_sock *tp, void *opt)
 
 			if (rcv_wnd == rcv_wnd_old)
 				break;
-			if (before64(rcv_wnd_new, rcv_wnd))
+			if (before64(rcv_wnd_new, rcv_wnd)) {
+				MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_RCVWNDCONFLICTUPDATE);
 				goto raise_win;
+			}
+			MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_RCVWNDCONFLICT);
 			rcv_wnd_old = rcv_wnd;
 		};
 		return;
@@ -1278,6 +1281,7 @@ static void mptcp_set_rwin(struct tcp_sock *tp, void *opt)
 		/* RFC1323 scaling applied */
 		new_win >>= tp->rx_opt.rcv_wscale;
 		th->window = htons(new_win);
+		MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_RCVWNDSHARED);
 	}
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: mptcp: add more offered MIBs counter.: Tests Results
  2022-04-11 10:40 ` [RFC PATCH 4/4] mptcp: add more offered MIBs counter Paolo Abeni
@ 2022-04-11 11:58   ` MPTCP CI
  0 siblings, 0 replies; 10+ messages in thread
From: MPTCP CI @ 2022-04-11 11:58 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal:
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/6597949374332928
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/6597949374332928/summary/summary.txt

- KVM Validation: debug:
  - Critical: KMemLeak ❌:
  - Task: https://cirrus-ci.com/task/4768362025713664
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/4768362025713664/summary/summary.txt

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/3b2bbb4557d7


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-debug

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (Tessares)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking
  2022-04-11 10:40 [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking Paolo Abeni
                   ` (3 preceding siblings ...)
  2022-04-11 10:40 ` [RFC PATCH 4/4] mptcp: add more offered MIBs counter Paolo Abeni
@ 2022-04-11 13:49 ` Paolo Abeni
  4 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2022-04-11 13:49 UTC (permalink / raw)
  To: mptcp

On Mon, 2022-04-11 at 12:40 +0200, Paolo Abeni wrote:
> Note: still in patch 3/4, I'm unsure that the th->window update is
> strictly necessary from functional perspective (e.g. possibly the atomic
> operation is enough), I'll try to test that, too.

A couple more notes:

- it looks like the th->window update is necessary, a test run without
the relevant chunk produced significantly less good results.

- this is still (supriese, surprise!) not perfect, I can observe
low_but_not_zero count of TcpExtTCPZeroWindowDrop, which instead should
be 0 without evil middle-box (and is 0 with plain TCP)

- I'm wondering if we should additionally add another mib counter for 
MPTCPWANTZEROWINDOWADV - quite similar to TCPWANTZEROWINDOWADV

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 3/4] mptcp: never shrink offered window
  2022-04-11 10:40 ` [RFC PATCH 3/4] mptcp: never shrink offered window Paolo Abeni
@ 2022-04-11 18:23   ` kernel test robot
  2022-04-11 19:46   ` kernel test robot
  2022-04-15  0:39   ` Mat Martineau
  2 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2022-04-11 18:23 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: llvm, kbuild-all

Hi Paolo,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on mptcp/export]
[also build test ERROR on linus/master v5.18-rc2 next-20220411]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Paolo-Abeni/mptcp-improve-mptcp-level-window-tracking/20220411-184144
base:   https://github.com/multipath-tcp/mptcp_net-next.git export
config: hexagon-defconfig (https://download.01.org/0day-ci/archive/20220412/202204120247.mC6ZORgL-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c6e83f560f06cdfe8aa47b248d8bdc58f947274b)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/4cd25cbca4acc18a428817e9b1adc5ddd131422b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Paolo-Abeni/mptcp-improve-mptcp-level-window-tracking/20220411-184144
        git checkout 4cd25cbca4acc18a428817e9b1adc5ddd131422b
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash net/mptcp/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   net/mptcp/options.c:554:21: warning: parameter 'remaining' set but not used [-Wunused-but-set-parameter]
                                             unsigned int remaining,
                                                          ^
>> net/mptcp/options.c:1248:14: error: implicit declaration of function 'arch_cmpxchg64' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                           rcv_wnd = cmpxchg64(&msk->rcv_wnd_sent, rcv_wnd_old, rcv_wnd_new);
                                     ^
   include/linux/atomic/atomic-instrumented.h:1946:2: note: expanded from macro 'cmpxchg64'
           arch_cmpxchg64(__ai_ptr, __VA_ARGS__); \
           ^
   1 warning and 1 error generated.


vim +/arch_cmpxchg64 +1248 net/mptcp/options.c

  1226	
  1227	static void mptcp_set_rwin(struct tcp_sock *tp, void *opt)
  1228	{
  1229		const struct sock *ssk = (const struct sock *)tp;
  1230		struct mptcp_subflow_context *subflow;
  1231		u64 ack_seq, rcv_wnd_old, rcv_wnd_new;
  1232		struct mptcp_sock *msk;
  1233		struct tcphdr *th;
  1234		u32 new_win;
  1235		u64 win;
  1236	
  1237		subflow = mptcp_subflow_ctx(ssk);
  1238		msk = mptcp_sk(subflow->conn);
  1239	
  1240		ack_seq = READ_ONCE(msk->ack_seq);
  1241		rcv_wnd_new = ack_seq + tp->rcv_wnd;
  1242	
  1243		rcv_wnd_old = READ_ONCE(msk->rcv_wnd_sent);
  1244		if (after64(rcv_wnd_new, rcv_wnd_old)) {
  1245			u64 rcv_wnd;
  1246	
  1247			for (;;) {
> 1248				rcv_wnd = cmpxchg64(&msk->rcv_wnd_sent, rcv_wnd_old, rcv_wnd_new);
  1249	
  1250				if (rcv_wnd == rcv_wnd_old)
  1251					break;
  1252				if (before64(rcv_wnd_new, rcv_wnd))
  1253					goto raise_win;
  1254				rcv_wnd_old = rcv_wnd;
  1255			};
  1256			return;
  1257		}
  1258	
  1259		if (rcv_wnd_new != rcv_wnd_old) {
  1260	
  1261	raise_win:
  1262			th = opt - sizeof(struct tcphdr);
  1263			win = rcv_wnd_old - ack_seq;
  1264			tp->rcv_wnd = min_t(u64, win, U32_MAX);
  1265			new_win = tp->rcv_wnd;
  1266	
  1267			/* Make sure we do not exceed the maximum possible
  1268			 * scaled window.
  1269			 */
  1270			if (unlikely(th->syn))
  1271				new_win = min(new_win, 65535U) << tp->rx_opt.rcv_wscale;
  1272			if (!tp->rx_opt.rcv_wscale &&
  1273			    sock_net(ssk)->ipv4.sysctl_tcp_workaround_signed_windows)
  1274				new_win = min(new_win, MAX_TCP_WINDOW);
  1275			else
  1276				new_win = min(new_win, (65535U << tp->rx_opt.rcv_wscale));
  1277	
  1278			/* RFC1323 scaling applied */
  1279			new_win >>= tp->rx_opt.rcv_wscale;
  1280			th->window = htons(new_win);
  1281		}
  1282	}
  1283	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 3/4] mptcp: never shrink offered window
  2022-04-11 10:40 ` [RFC PATCH 3/4] mptcp: never shrink offered window Paolo Abeni
  2022-04-11 18:23   ` kernel test robot
@ 2022-04-11 19:46   ` kernel test robot
  2022-04-15  0:39   ` Mat Martineau
  2 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2022-04-11 19:46 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: llvm, kbuild-all

Hi Paolo,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on mptcp/export]
[also build test ERROR on linus/master v5.18-rc2 next-20220411]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Paolo-Abeni/mptcp-improve-mptcp-level-window-tracking/20220411-184144
base:   https://github.com/multipath-tcp/mptcp_net-next.git export
config: arm-buildonly-randconfig-r006-20220411 (https://download.01.org/0day-ci/archive/20220412/202204120334.WqEBThyK-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c6e83f560f06cdfe8aa47b248d8bdc58f947274b)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/intel-lab-lkp/linux/commit/4cd25cbca4acc18a428817e9b1adc5ddd131422b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Paolo-Abeni/mptcp-improve-mptcp-level-window-tracking/20220411-184144
        git checkout 4cd25cbca4acc18a428817e9b1adc5ddd131422b
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash net/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   net/mptcp/options.c:554:21: warning: parameter 'remaining' set but not used [-Wunused-but-set-parameter]
                                             unsigned int remaining,
                                                          ^
   In file included from net/mptcp/options.c:9:
   In file included from include/linux/kernel.h:22:
   In file included from include/linux/bitops.h:33:
   In file included from arch/arm/include/asm/bitops.h:243:
   In file included from include/asm-generic/bitops/lock.h:5:
   In file included from include/linux/atomic.h:7:
   In file included from arch/arm/include/asm/atomic.h:16:
>> arch/arm/include/asm/cmpxchg.h:254:1: error: instruction requires: !armv*m
   "1:     ldrexd          %1, %H1, [%3]\n"
   ^
   <inline asm>:1:5: note: instantiated into assembly here
           1:      ldrexd          r8, r9, [r11]
                   ^
   In file included from net/mptcp/options.c:9:
   In file included from include/linux/kernel.h:22:
   In file included from include/linux/bitops.h:33:
   In file included from arch/arm/include/asm/bitops.h:243:
   In file included from include/asm-generic/bitops/lock.h:5:
   In file included from include/linux/atomic.h:7:
   In file included from arch/arm/include/asm/atomic.h:16:
   arch/arm/include/asm/cmpxchg.h:258:2: error: instruction requires: !armv*m
   "       strexd          %0, %5, %H5, [%3]\n"
    ^
   <inline asm>:5:2: note: instantiated into assembly here
           strexd          r0, r4, r5, [r11]
           ^
   1 warning and 2 errors generated.

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for DRM_GEM_SHMEM_HELPER
   Depends on HAS_IOMEM && DRM && MMU
   Selected by
   - DRM_SSD130X && HAS_IOMEM && DRM


vim +254 arch/arm/include/asm/cmpxchg.h

e001bbae7147b1 Russell King 2015-05-26  243  
2523c67bb6962f Will Deacon  2013-10-09  244  static inline unsigned long long __cmpxchg64(unsigned long long *ptr,
2523c67bb6962f Will Deacon  2013-10-09  245  					     unsigned long long old,
2523c67bb6962f Will Deacon  2013-10-09  246  					     unsigned long long new)
2523c67bb6962f Will Deacon  2013-10-09  247  {
2523c67bb6962f Will Deacon  2013-10-09  248  	unsigned long long oldval;
2523c67bb6962f Will Deacon  2013-10-09  249  	unsigned long res;
2523c67bb6962f Will Deacon  2013-10-09  250  
c32ffce0f66e5d Will Deacon  2014-02-21  251  	prefetchw(ptr);
c32ffce0f66e5d Will Deacon  2014-02-21  252  
2523c67bb6962f Will Deacon  2013-10-09  253  	__asm__ __volatile__(
2523c67bb6962f Will Deacon  2013-10-09 @254  "1:	ldrexd		%1, %H1, [%3]\n"
2523c67bb6962f Will Deacon  2013-10-09  255  "	teq		%1, %4\n"
2523c67bb6962f Will Deacon  2013-10-09  256  "	teqeq		%H1, %H4\n"
2523c67bb6962f Will Deacon  2013-10-09  257  "	bne		2f\n"
2523c67bb6962f Will Deacon  2013-10-09  258  "	strexd		%0, %5, %H5, [%3]\n"
2523c67bb6962f Will Deacon  2013-10-09  259  "	teq		%0, #0\n"
2523c67bb6962f Will Deacon  2013-10-09  260  "	bne		1b\n"
2523c67bb6962f Will Deacon  2013-10-09  261  "2:"
2523c67bb6962f Will Deacon  2013-10-09  262  	: "=&r" (res), "=&r" (oldval), "+Qo" (*ptr)
2523c67bb6962f Will Deacon  2013-10-09  263  	: "r" (ptr), "r" (old), "r" (new)
2523c67bb6962f Will Deacon  2013-10-09  264  	: "cc");
2523c67bb6962f Will Deacon  2013-10-09  265  
2523c67bb6962f Will Deacon  2013-10-09  266  	return oldval;
2523c67bb6962f Will Deacon  2013-10-09  267  }
2523c67bb6962f Will Deacon  2013-10-09  268  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 3/4] mptcp: never shrink offered window
  2022-04-11 10:40 ` [RFC PATCH 3/4] mptcp: never shrink offered window Paolo Abeni
  2022-04-11 18:23   ` kernel test robot
  2022-04-11 19:46   ` kernel test robot
@ 2022-04-15  0:39   ` Mat Martineau
  2 siblings, 0 replies; 10+ messages in thread
From: Mat Martineau @ 2022-04-15  0:39 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Mon, 11 Apr 2022, Paolo Abeni wrote:

> As per RFC, the offered MPTCP-level window should never shrink.
> While we currently track the right edge, we don't enforce the
> above constraint on the wire.
> Additionally, concurrent xmit on different subflows can end-up in
> erroneous right edge update.
> Address the above explicitly updating the announced window and
> protecting the update with an additional atomic operation (sic)
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> include/net/mptcp.h   |  2 +-
> net/ipv4/tcp_output.c |  2 +-
> net/mptcp/options.c   | 57 +++++++++++++++++++++++++++++++++++++------
> 3 files changed, 51 insertions(+), 10 deletions(-)
>
> diff --git a/include/net/mptcp.h b/include/net/mptcp.h
> index 877077b53200..5706d132713e 100644
> --- a/include/net/mptcp.h
> +++ b/include/net/mptcp.h
> @@ -125,7 +125,7 @@ bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
> 			       struct mptcp_out_options *opts);
> bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb);
>
> -void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp,
> +void mptcp_write_options(__be32 *ptr, struct tcp_sock *tp,
> 			 struct mptcp_out_options *opts);
>
> void mptcp_diag_fill_info(struct mptcp_sock *msk, struct mptcp_info *info);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index c221f3bce975..edad33b9db11 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -444,7 +444,7 @@ struct tcp_out_options {
> 	struct mptcp_out_options mptcp;
> };
>
> -static void mptcp_options_write(__be32 *ptr, const struct tcp_sock *tp,
> +static void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
> 				struct tcp_out_options *opts)
> {
> #if IS_ENABLED(CONFIG_MPTCP)
> diff --git a/net/mptcp/options.c b/net/mptcp/options.c
> index e05d9458a025..ffc0292da041 100644
> --- a/net/mptcp/options.c
> +++ b/net/mptcp/options.c
> @@ -1224,20 +1224,61 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
> 	return true;
> }
>
> -static void mptcp_set_rwin(const struct tcp_sock *tp)
> +static void mptcp_set_rwin(struct tcp_sock *tp, void *opt)
> {
> 	const struct sock *ssk = (const struct sock *)tp;
> -	const struct mptcp_subflow_context *subflow;
> +	struct mptcp_subflow_context *subflow;
> +	u64 ack_seq, rcv_wnd_old, rcv_wnd_new;
> 	struct mptcp_sock *msk;
> -	u64 ack_seq;
> +	struct tcphdr *th;
> +	u32 new_win;
> +	u64 win;
>
> 	subflow = mptcp_subflow_ctx(ssk);
> 	msk = mptcp_sk(subflow->conn);
>
> -	ack_seq = READ_ONCE(msk->ack_seq) + tp->rcv_wnd;
> +	ack_seq = READ_ONCE(msk->ack_seq);
> +	rcv_wnd_new = ack_seq + tp->rcv_wnd;
> +
> +	rcv_wnd_old = READ_ONCE(msk->rcv_wnd_sent);
> +	if (after64(rcv_wnd_new, rcv_wnd_old)) {
> +		u64 rcv_wnd;
>
> -	if (after64(ack_seq, READ_ONCE(msk->rcv_wnd_sent)))
> -		WRITE_ONCE(msk->rcv_wnd_sent, ack_seq);
> +		for (;;) {
> +			rcv_wnd = cmpxchg64(&msk->rcv_wnd_sent, rcv_wnd_old, rcv_wnd_new);
> +
> +			if (rcv_wnd == rcv_wnd_old)
> +				break;
> +			if (before64(rcv_wnd_new, rcv_wnd))
> +				goto raise_win;
> +			rcv_wnd_old = rcv_wnd;
> +		};
> +		return;
> +	}
> +
> +	if (rcv_wnd_new != rcv_wnd_old) {
> +
> +raise_win:
> +		th = opt - sizeof(struct tcphdr);

'opt' is a pointer to a struct mptcp_out_options, not a pointer to the 
option bytes in a tcp header... so this ends up writing somewhere 
unexpected on the stack.

'ptr' (in mptcp_write_options()) is somewhere in the TCP header, but we 
don't know how much option space was consumed before mptcp_write_options() 
was called. Looks like a copy of the original ptr passed in to 
tcp_options_write() (or, even better, an actual tcphdr pointer!) needs to 
be plumbed in to this function.


- Mat

> +		win = rcv_wnd_old - ack_seq;
> +		tp->rcv_wnd = min_t(u64, win, U32_MAX);
> +		new_win = tp->rcv_wnd;
> +
> +		/* Make sure we do not exceed the maximum possible
> +		 * scaled window.
> +		 */
> +		if (unlikely(th->syn))
> +			new_win = min(new_win, 65535U) << tp->rx_opt.rcv_wscale;
> +		if (!tp->rx_opt.rcv_wscale &&
> +		    sock_net(ssk)->ipv4.sysctl_tcp_workaround_signed_windows)
> +			new_win = min(new_win, MAX_TCP_WINDOW);
> +		else
> +			new_win = min(new_win, (65535U << tp->rx_opt.rcv_wscale));
> +
> +		/* RFC1323 scaling applied */
> +		new_win >>= tp->rx_opt.rcv_wscale;
> +		th->window = htons(new_win);
> +	}
> }
>
> u16 __mptcp_make_csum(u64 data_seq, u32 subflow_seq, u16 data_len, __wsum sum)
> @@ -1265,7 +1306,7 @@ static u16 mptcp_make_csum(const struct mptcp_ext *mpext)
> 				 ~csum_unfold(mpext->csum));
> }
>
> -void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp,
> +void mptcp_write_options(__be32 *ptr, struct tcp_sock *tp,
> 			 struct mptcp_out_options *opts)
> {
> 	const struct sock *ssk = (const struct sock *)tp;
> @@ -1554,7 +1595,7 @@ void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp,
> 	}
>
> 	if (tp)
> -		mptcp_set_rwin(tp);
> +		mptcp_set_rwin(tp, opts);
> }
>
> __be32 mptcp_get_reset_option(const struct sk_buff *skb)
> -- 
> 2.35.1
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-04-15  0:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-11 10:40 [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking Paolo Abeni
2022-04-11 10:40 ` [RFC PATCH 1/4] mptcp: really share subflow snd_wnd Paolo Abeni
2022-04-11 10:40 ` [RFC PATCH 2/4] mptcp: add mib for xmit window sharing Paolo Abeni
2022-04-11 10:40 ` [RFC PATCH 3/4] mptcp: never shrink offered window Paolo Abeni
2022-04-11 18:23   ` kernel test robot
2022-04-11 19:46   ` kernel test robot
2022-04-15  0:39   ` Mat Martineau
2022-04-11 10:40 ` [RFC PATCH 4/4] mptcp: add more offered MIBs counter Paolo Abeni
2022-04-11 11:58   ` mptcp: add more offered MIBs counter.: Tests Results MPTCP CI
2022-04-11 13:49 ` [RFC PATCH 0/4] mptcp: improve mptcp-level window tracking Paolo Abeni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.