[PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement

public inbox for mptcp@lists.linux.dev
 help / color / mirror / Atom feed

* [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement
@ 2025-11-20  8:39 Paolo Abeni
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 1/6] trace: mptcp: add mptcp_rcvbuf_grow tracepoint Paolo Abeni
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Paolo Abeni @ 2025-11-20  8:39 UTC (permalink / raw)
  To: mptcp; +Cc: martineau

This series collects a few follow-up for the backlog refactor, with some
of them posing as fixes just to confuse the enemy.

Targeting net-next as the issues addressed are very old, the change
is quite invasive and, as mentioned, based on the BL refactor.

Overall this:
- introduces a tracepoint similar to tcp useful to track down the issues
  addressed here
- avoid rx drop in corner case scenarios, improving tput stability and
  peak tput over fast link
- refactor the rcv space and rtt estimator, overall making DRS more
  correct and avoiding rcv buffer drifting to tcp_rmem[2], which in
  turn makes the tput more stable and less bursty

Note that after this upstream series will be merged:

https://patchwork.kernel.org/user/todo/netdevbpf/?series=1025209

we want to introduce similar changes in mptcp, too.

I'm sorry for the high frequency spamming; trying to complete this
stuff before ENORESOURCES. Please ignore the previous 2 revisions.
---
v6 -> v7:
   - re added the patches dropped in the last 2 iterations

v5 -> v6:
  - dropped the buggy patch 3/5

v4 -> v5:
  - dropped patch 4/6; it proved to be useless together with new version
    of patch 4/5
  - refactor rtt estimator: use sliding window instead of dropping "too
    high" values

v3 -> v4:
  - dropped already merged patch 1
  - update comment in patch 6/6

v2 -> v3:
  - use __entry->family in patch 1/7
  - more verbose commit message for patch 6/7

v1 -> v2:
  - new patch 1/7
  - old patch 2/4 has been split in 3: 3/7, 4/7 and 5/7
  - refactor both the rcv space init and rtt estimator

Paolo Abeni (6):
  trace: mptcp: add mptcp_rcvbuf_grow tracepoint
  mptcp: do not account for OoO in mptcp_rcvbuf_grow()
  mptcp: fix receive space timestamp initialization.
  mptcp: consolidate rcv space init
  mptcp: better mptcp-level RTT estimator
  mptcp: add receive queue awareness in tcp_rcv_space_adjust()

 include/trace/events/mptcp.h |  80 ++++++++++++++++++++++++
 net/mptcp/protocol.c         | 116 ++++++++++++++++++-----------------
 net/mptcp/protocol.h         |  44 ++++++++++++-
 net/mptcp/subflow.c          |   2 -
 4 files changed, 183 insertions(+), 59 deletions(-)

-- 
2.51.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v7 mptcp-next 1/6] trace: mptcp: add mptcp_rcvbuf_grow tracepoint
  2025-11-20  8:39 [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement Paolo Abeni
@ 2025-11-20  8:39 ` Paolo Abeni
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 2/6] mptcp: do not account for OoO in mptcp_rcvbuf_grow() Paolo Abeni
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Paolo Abeni @ 2025-11-20  8:39 UTC (permalink / raw)
  To: mptcp; +Cc: martineau

Similar to tcp, provide a new tracepoint to better understand
mptcp_rcv_space_adjust() behavior, which presents many artifacts.

Note that the used format string is so long that I preferred
wrap it, contrary to guidance for quoted strings.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v4 -> v5:
  - fixed a couple of checkpatch issues.

v2 -> v3:
  - use __entry->family; note that its value is show as raw number
    instead of string, as show_family_name is available only to
    net/core/ trace and I preferred not moving the mptcp traces
    there as we need mptcp-specific helpers, too.
---
 include/trace/events/mptcp.h | 80 ++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.c         |  3 ++
 2 files changed, 83 insertions(+)

diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h
index 085b749cdd97..269d949b2025 100644
--- a/include/trace/events/mptcp.h
+++ b/include/trace/events/mptcp.h
@@ -5,7 +5,13 @@
 #if !defined(_TRACE_MPTCP_H) || defined(TRACE_HEADER_MULTI_READ)
 #define _TRACE_MPTCP_H
 
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
 #include <linux/tracepoint.h>
+#include <net/ipv6.h>
+#include <net/tcp.h>
+#include <linux/sock_diag.h>
+#include <net/rstreason.h>
 
 #define show_mapping_status(status)					\
 	__print_symbolic(status,					\
@@ -178,6 +184,80 @@ TRACE_EVENT(subflow_check_data_avail,
 		  __entry->skb)
 );
 
+#include <trace/events/net_probe_common.h>
+
+TRACE_EVENT(mptcp_rcvbuf_grow,
+
+	TP_PROTO(struct sock *sk, int time),
+
+	TP_ARGS(sk, time),
+
+	TP_STRUCT__entry(
+		__field(int, time)
+		__field(__u32, rtt_us)
+		__field(__u32, copied)
+		__field(__u32, inq)
+		__field(__u32, space)
+		__field(__u32, ooo_space)
+		__field(__u32, rcvbuf)
+		__field(__u32, rcv_wnd)
+		__field(__u8, scaling_ratio)
+		__field(__u16, sport)
+		__field(__u16, dport)
+		__field(__u16, family)
+		__array(__u8, saddr, 4)
+		__array(__u8, daddr, 4)
+		__array(__u8, saddr_v6, 16)
+		__array(__u8, daddr_v6, 16)
+		__field(const void *, skaddr)
+	),
+
+	TP_fast_assign(
+		struct mptcp_sock *msk = mptcp_sk(sk);
+		struct inet_sock *inet = inet_sk(sk);
+		bool ofo_empty;
+		__be32 *p32;
+
+		__entry->time = time;
+		__entry->rtt_us = msk->rcvq_space.rtt_us >> 3;
+		__entry->copied = msk->rcvq_space.copied;
+		__entry->inq = mptcp_inq_hint(sk);
+		__entry->space = msk->rcvq_space.space;
+		ofo_empty = RB_EMPTY_ROOT(&msk->out_of_order_queue);
+		__entry->ooo_space = ofo_empty ? 0 :
+				     MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq -
+				     msk->ack_seq;
+
+		__entry->rcvbuf = sk->sk_rcvbuf;
+		__entry->rcv_wnd = atomic64_read(&msk->rcv_wnd_sent) -
+				   msk->ack_seq;
+		__entry->scaling_ratio = msk->scaling_ratio;
+		__entry->sport = ntohs(inet->inet_sport);
+		__entry->dport = ntohs(inet->inet_dport);
+		__entry->family = sk->sk_family;
+
+		p32 = (__be32 *)__entry->saddr;
+		*p32 = inet->inet_saddr;
+
+		p32 = (__be32 *)__entry->daddr;
+		*p32 = inet->inet_daddr;
+
+		TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
+			       sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
+
+		__entry->skaddr = sk;
+	),
+
+	TP_printk("time=%u rtt_us=%u copied=%u inq=%u space=%u ooo=%u scaling_ratio=%u "
+		  "rcvbuf=%u rcv_wnd=%u family=%d sport=%hu dport=%hu saddr=%pI4 "
+		  "daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c skaddr=%p",
+		  __entry->time, __entry->rtt_us, __entry->copied,
+		  __entry->inq, __entry->space, __entry->ooo_space,
+		  __entry->scaling_ratio, __entry->rcvbuf, __entry->rcv_wnd,
+		  __entry->family, __entry->sport, __entry->dport,
+		  __entry->saddr, __entry->daddr, __entry->saddr_v6,
+		  __entry->daddr_v6, __entry->skaddr)
+);
 #endif /* _TRACE_MPTCP_H */
 
 /* This part must be outside protection */
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index f8756521b2e2..64f592a7897c 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -28,6 +28,8 @@
 #include "protocol.h"
 #include "mib.h"
 
+static unsigned int mptcp_inq_hint(const struct sock *sk);
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/mptcp.h>
 
@@ -2118,6 +2120,7 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 	if (msk->rcvq_space.copied <= msk->rcvq_space.space)
 		goto new_measure;
 
+	trace_mptcp_rcvbuf_grow(sk, time);
 	if (mptcp_rcvbuf_grow(sk, msk->rcvq_space.copied)) {
 		/* Make subflows follow along.  If we do not do this, we
 		 * get drops at subflow level if skbs can't be moved to
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v7 mptcp-next 2/6] mptcp: do not account for OoO in mptcp_rcvbuf_grow()
  2025-11-20  8:39 [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement Paolo Abeni
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 1/6] trace: mptcp: add mptcp_rcvbuf_grow tracepoint Paolo Abeni
@ 2025-11-20  8:39 ` Paolo Abeni
  2025-11-27  0:06   ` Mat Martineau
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 3/6] mptcp: fix receive space timestamp initialization Paolo Abeni
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Paolo Abeni @ 2025-11-20  8:39 UTC (permalink / raw)
  To: mptcp; +Cc: martineau

MPTCP-level OoOs are physiological when multiple subflows are active
concurrently and will not cause retransmissions nor are caused by
drops.

Accounting for them in mptcp_rcvbuf_grow() causes the rcvbuf slowly
drifting towards tcp_rmem[2].

Remove such accounting. Note that subflows will still account for TCP-level
OoO when the MPTCP-level rcvbuf is propagated.

This also closes a subtle and very unlikely race condition with rcvspace
init; active sockets with user-space holding the msk-level socket lock,
could complete such initialization in the receive callback, after that the
first OoO data reaches the rcvbuf and potentially triggering a divide by
zero Oops.

Fixes: e118cdc34dd1 ("mptcp: rcvbuf auto-tuning improvement")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 64f592a7897c..e31ccc4bbb2d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -226,9 +226,6 @@ static bool mptcp_rcvbuf_grow(struct sock *sk, u32 newval)
 	do_div(grow, oldval);
 	rcvwin += grow << 1;
 
-	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue))
-		rcvwin += MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq - msk->ack_seq;
-
 	cap = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]);
 
 	rcvbuf = min_t(u32, mptcp_space_from_win(sk, rcvwin), cap);
@@ -352,9 +349,6 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 end:
 	skb_condense(skb);
 	skb_set_owner_r(skb, sk);
-	/* do not grow rcvbuf for not-yet-accepted or orphaned sockets. */
-	if (sk->sk_socket)
-		mptcp_rcvbuf_grow(sk, msk->rcvq_space.space);
 }
 
 static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v7 mptcp-next 3/6] mptcp: fix receive space timestamp initialization.
  2025-11-20  8:39 [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement Paolo Abeni
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 1/6] trace: mptcp: add mptcp_rcvbuf_grow tracepoint Paolo Abeni
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 2/6] mptcp: do not account for OoO in mptcp_rcvbuf_grow() Paolo Abeni
@ 2025-11-20  8:39 ` Paolo Abeni
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 4/6] mptcp: consolidate rcv space init Paolo Abeni
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Paolo Abeni @ 2025-11-20  8:39 UTC (permalink / raw)
  To: mptcp; +Cc: martineau

MPTCP initialize the receive buffer stamp in mptcp_rcv_space_init(),
using the provided subflow stamp. Such helper is invoked in several
places; for passive sockets, space init happened at clone time.

In such scenario, MPTCP ends-up accesses the subflow stamp before
its initialization, leading to quite randomic timing for the first
receive buffer auto-tune event, as the timestamp for newly created
subflow is not refreshed there.

Fix the issue moving the stamp initialization out of the mentioned helper,
at the data transfer start, and always using a fresh timestamp.

Fixes: 013e3179dbd2 ("mptcp: fix rcv space initialization")
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v6 -> v7
  - do not remove the mptcp_rcv_space_init() call in mptcp_sk_clone_init():

v5 -> v6
  - really remove the stamp init from mptcp_rcv_space_init()

v1 -> v2:
  - factor out only the tstamp change for better reviewability
---
 net/mptcp/protocol.c | 8 ++++----
 net/mptcp/protocol.h | 5 +++++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e31ccc4bbb2d..b25cc8d5c98d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2083,8 +2083,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 
 	msk->rcvq_space.copied += copied;
 
-	mstamp = div_u64(tcp_clock_ns(), NSEC_PER_USEC);
-	time = tcp_stamp_us_delta(mstamp, msk->rcvq_space.time);
+	mstamp = mptcp_stamp();
+	time = tcp_stamp_us_delta(mstamp, READ_ONCE(msk->rcvq_space.time));
 
 	rtt_us = msk->rcvq_space.rtt_us;
 	if (rtt_us && time < (rtt_us >> 3))
@@ -3525,6 +3525,7 @@ struct sock *mptcp_sk_clone_init(const struct sock *sk,
 	__mptcp_propagate_sndbuf(nsk, ssk);
 
 	mptcp_rcv_space_init(msk, ssk);
+	msk->rcvq_space.time = mptcp_stamp();
 
 	if (mp_opt->suboptions & OPTION_MPTCP_MPC_ACK)
 		__mptcp_subflow_fully_established(msk, subflow, mp_opt);
@@ -3542,8 +3543,6 @@ void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
 	msk->rcvq_space.copied = 0;
 	msk->rcvq_space.rtt_us = 0;
 
-	msk->rcvq_space.time = tp->tcp_mstamp;
-
 	/* initial rcv_space offering made to peer */
 	msk->rcvq_space.space = min_t(u32, tp->rcv_wnd,
 				      TCP_INIT_CWND * tp->advmss);
@@ -3739,6 +3738,7 @@ void mptcp_finish_connect(struct sock *ssk)
 	 * accessing the field below
 	 */
 	WRITE_ONCE(msk->local_key, subflow->local_key);
+	WRITE_ONCE(msk->rcvq_space.time, mptcp_stamp());
 
 	mptcp_pm_new_connection(msk, ssk, 0);
 }
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 199f28f3dd5e..95c62f2ac705 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -917,6 +917,11 @@ static inline bool mptcp_is_fully_established(struct sock *sk)
 	       READ_ONCE(mptcp_sk(sk)->fully_established);
 }
 
+static inline u64 mptcp_stamp(void)
+{
+	return div_u64(tcp_clock_ns(), NSEC_PER_USEC);
+}
+
 void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk);
 void mptcp_data_ready(struct sock *sk, struct sock *ssk);
 bool mptcp_finish_join(struct sock *sk);
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v7 mptcp-next 4/6] mptcp: consolidate rcv space init
  2025-11-20  8:39 [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement Paolo Abeni
                   ` (2 preceding siblings ...)
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 3/6] mptcp: fix receive space timestamp initialization Paolo Abeni
@ 2025-11-20  8:39 ` Paolo Abeni
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator Paolo Abeni
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Paolo Abeni @ 2025-11-20  8:39 UTC (permalink / raw)
  To: mptcp; +Cc: martineau

MPTCP uses several calls of the mptcp_rcv_space_init() helper to
initialize the receive space, with a catch-up call in
mptcp_rcv_space_adjust().

Drop all the other strictly not needed invocations and move constant
fields initialization at socket init/reset time.

This removes a bit of complexity from mptcp DRS code. No functional
changes intended.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v4 -> v5:
  - reworded the commit message

v1 -> v2:
  - split helper consolidation out of v1 patch
  - additionally move 'copied' and 'rtt_us' initialization out of
    mptcp_rcv_space_init()
---
 net/mptcp/protocol.c | 30 +++++++++++++++---------------
 net/mptcp/protocol.h |  1 -
 net/mptcp/subflow.c  |  2 --
 3 files changed, 15 insertions(+), 18 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index b25cc8d5c98d..40c90d841237 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2061,6 +2061,21 @@ static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg,
 	return copied;
 }
 
+static void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
+{
+	const struct tcp_sock *tp = tcp_sk(ssk);
+
+	msk->rcvspace_init = 1;
+	msk->rcvq_space.copied = 0;
+	msk->rcvq_space.rtt_us = 0;
+
+	/* initial rcv_space offering made to peer */
+	msk->rcvq_space.space = min_t(u32, tp->rcv_wnd,
+				      TCP_INIT_CWND * tp->advmss);
+	if (msk->rcvq_space.space == 0)
+		msk->rcvq_space.space = TCP_INIT_CWND * TCP_MSS_DEFAULT;
+}
+
 /* receive buffer autotuning.  See tcp_rcv_space_adjust for more information.
  *
  * Only difference: Use highest rtt estimate of the subflows in use.
@@ -3535,21 +3550,6 @@ struct sock *mptcp_sk_clone_init(const struct sock *sk,
 	return nsk;
 }
 
-void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
-{
-	const struct tcp_sock *tp = tcp_sk(ssk);
-
-	msk->rcvspace_init = 1;
-	msk->rcvq_space.copied = 0;
-	msk->rcvq_space.rtt_us = 0;
-
-	/* initial rcv_space offering made to peer */
-	msk->rcvq_space.space = min_t(u32, tp->rcv_wnd,
-				      TCP_INIT_CWND * tp->advmss);
-	if (msk->rcvq_space.space == 0)
-		msk->rcvq_space.space = TCP_INIT_CWND * TCP_MSS_DEFAULT;
-}
-
 static void mptcp_destroy(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 95c62f2ac705..ee0dbd6dbacf 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -922,7 +922,6 @@ static inline u64 mptcp_stamp(void)
 	return div_u64(tcp_clock_ns(), NSEC_PER_USEC);
 }
 
-void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk);
 void mptcp_data_ready(struct sock *sk, struct sock *ssk);
 bool mptcp_finish_join(struct sock *sk);
 bool mptcp_schedule_work(struct sock *sk);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 86ce58ae533d..23f439853a7c 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -462,8 +462,6 @@ void __mptcp_sync_state(struct sock *sk, int state)
 
 	subflow = mptcp_subflow_ctx(ssk);
 	__mptcp_propagate_sndbuf(sk, ssk);
-	if (!msk->rcvspace_init)
-		mptcp_rcv_space_init(msk, ssk);
 
 	if (sk->sk_state == TCP_SYN_SENT) {
 		/* subflow->idsn is always available is TCP_SYN_SENT state,
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator
  2025-11-20  8:39 [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement Paolo Abeni
                   ` (3 preceding siblings ...)
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 4/6] mptcp: consolidate rcv space init Paolo Abeni
@ 2025-11-20  8:39 ` Paolo Abeni
  2025-11-27  2:19   ` Mat Martineau
                     ` (2 more replies)
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 6/6] mptcp: add receive queue awareness in tcp_rcv_space_adjust() Paolo Abeni
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 16+ messages in thread
From: Paolo Abeni @ 2025-11-20  8:39 UTC (permalink / raw)
  To: mptcp; +Cc: martineau

The current MPTCP-level RTT estimator has several issues. On high speed
links, the MPTCP-level receive buffer auto-tuning happens with a frequency
well above the TCP-level's one. That in turn can cause excessive/unneeded
receive buffer increase.

On such links, the initial rtt_us value is considerably higher
than the actual delay, and the current mptcp_rcv_space_adjust() updates
msk->rcvq_space.rtt_us with a period equal to the such field previous
value. If the initial rtt_us is 40ms, its first update will happen after
40ms, even if the subflows see actual RTT orders of magnitude lower.

Additionally:
- setting the msk rtt to the maximum among all the subflows RTTs makes DRS
constantly overshooting the rcvbuf size when a subflow has considerable
higher latency than the other(s).

- during unidirectional bulk transfers with multiple active subflows, the
TCP-level RTT estimator occasionally sees considerably higher value than
the real link delay, i.e. when the packet scheduler reacts to an incoming
ack on given subflow pushing data on a different subflow.

- currently inactive but still open subflows (i.e. switched to backup mode)
are always considered when computing the msk-level rtt.

Address the all the issues above with a more accurate RTT estimation
strategy: the MPTCP-level RTT is set to the minimum of all the subflows
actually feeding data into the MPTCP receive buffer, using a small sliding
window.

While at it, also use EWMA to compute the msk-level scaling_ratio, to that
mptcp can avoid traversing the subflow list is mptcp_rcv_space_adjust().

Use some care to avoid updating msk and ssk level fields too often.

Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v4 -> v5:
  - avoid filtering out too high value, use sliding window instead

v3 -> v4:
  - really refresh msk rtt after a full win per subflow (off-by-one in prev
    revision)
  - sync mptcp_rcv_space_adjust() comment with the new code

v1 -> v2:
  - do not use explicit reset flags - do rcv win based decision instead
  - discard 0 rtt_us samples from subflows
  - discard samples on non empty rx queue
  - discard "too high" samples, see the code comments WRT the whys
---
 include/trace/events/mptcp.h |  2 +-
 net/mptcp/protocol.c         | 63 ++++++++++++++++++++----------------
 net/mptcp/protocol.h         | 38 +++++++++++++++++++++-
 3 files changed, 73 insertions(+), 30 deletions(-)

diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h
index 269d949b2025..04521acba483 100644
--- a/include/trace/events/mptcp.h
+++ b/include/trace/events/mptcp.h
@@ -219,7 +219,7 @@ TRACE_EVENT(mptcp_rcvbuf_grow,
 		__be32 *p32;
 
 		__entry->time = time;
-		__entry->rtt_us = msk->rcvq_space.rtt_us >> 3;
+		__entry->rtt_us = mptcp_rtt_us_est(msk) >> 3;
 		__entry->copied = msk->rcvq_space.copied;
 		__entry->inq = mptcp_inq_hint(sk);
 		__entry->space = msk->rcvq_space.space;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 40c90d841237..5dde016a108d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -880,6 +880,32 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	return moved;
 }
 
+static void mptcp_rcv_rtt_update(struct mptcp_sock *msk,
+				 struct mptcp_subflow_context *subflow)
+{
+	const struct tcp_sock *tp = tcp_sk(subflow->tcp_sock);
+	u32 rtt_us = tp->rcv_rtt_est.rtt_us;
+	int id;
+
+	/* Update once per subflow per rcvwnd to avoid touching the msk
+	 * too often.
+	 */
+	if (!rtt_us || tp->rcv_rtt_est.seq == subflow->prev_rtt_seq)
+		return;
+
+	subflow->prev_rtt_seq = tp->rcv_rtt_est.seq;
+
+	/* Pairs with READ_ONCE() in mptcp_rtt_us_est(). */
+	id = msk->rcv_rtt_est.next_sample;
+	WRITE_ONCE(msk->rcv_rtt_est.samples[id], rtt_us);
+	if (++msk->rcv_rtt_est.next_sample == MPTCP_RTT_SAMPLES)
+		msk->rcv_rtt_est.next_sample = 0;
+
+	/* EWMA among the incoming subflows */
+	msk->scaling_ratio = ((msk->scaling_ratio << 3) - msk->scaling_ratio +
+			     tp->scaling_ratio) >> 3;
+}
+
 void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
@@ -893,6 +919,7 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 		return;
 
 	mptcp_data_lock(sk);
+	mptcp_rcv_rtt_update(msk, subflow);
 	if (!sock_owned_by_user(sk)) {
 		/* Wake-up the reader only for in-sequence data */
 		if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
@@ -2067,7 +2094,6 @@ static void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
 
 	msk->rcvspace_init = 1;
 	msk->rcvq_space.copied = 0;
-	msk->rcvq_space.rtt_us = 0;
 
 	/* initial rcv_space offering made to peer */
 	msk->rcvq_space.space = min_t(u32, tp->rcv_wnd,
@@ -2078,15 +2104,15 @@ static void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
 
 /* receive buffer autotuning.  See tcp_rcv_space_adjust for more information.
  *
- * Only difference: Use highest rtt estimate of the subflows in use.
+ * Only difference: Use lowest rtt estimate of the subflows in use, see
+ * mptcp_rcv_rtt_update() and mptcp_rtt_us_est().
  */
 static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 {
 	struct mptcp_subflow_context *subflow;
 	struct sock *sk = (struct sock *)msk;
-	u8 scaling_ratio = U8_MAX;
-	u32 time, advmss = 1;
-	u64 rtt_us, mstamp;
+	u32 time, rtt_us;
+	u64 mstamp;
 
 	msk_owned_by_me(msk);
 
@@ -2101,29 +2127,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 	mstamp = mptcp_stamp();
 	time = tcp_stamp_us_delta(mstamp, READ_ONCE(msk->rcvq_space.time));
 
-	rtt_us = msk->rcvq_space.rtt_us;
-	if (rtt_us && time < (rtt_us >> 3))
-		return;
-
-	rtt_us = 0;
-	mptcp_for_each_subflow(msk, subflow) {
-		const struct tcp_sock *tp;
-		u64 sf_rtt_us;
-		u32 sf_advmss;
-
-		tp = tcp_sk(mptcp_subflow_tcp_sock(subflow));
-
-		sf_rtt_us = READ_ONCE(tp->rcv_rtt_est.rtt_us);
-		sf_advmss = READ_ONCE(tp->advmss);
-
-		rtt_us = max(sf_rtt_us, rtt_us);
-		advmss = max(sf_advmss, advmss);
-		scaling_ratio = min(tp->scaling_ratio, scaling_ratio);
-	}
-
-	msk->rcvq_space.rtt_us = rtt_us;
-	msk->scaling_ratio = scaling_ratio;
-	if (time < (rtt_us >> 3) || rtt_us == 0)
+	rtt_us = mptcp_rtt_us_est(msk);
+	if (rtt_us == U32_MAX || time < (rtt_us >> 3))
 		return;
 
 	if (msk->rcvq_space.copied <= msk->rcvq_space.space)
@@ -2969,6 +2974,7 @@ static void __mptcp_init_sock(struct sock *sk)
 	msk->timer_ival = TCP_RTO_MIN;
 	msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
 	msk->backlog_len = 0;
+	mptcp_init_rtt_est(msk);
 
 	WRITE_ONCE(msk->first, NULL);
 	inet_csk(sk)->icsk_sync_mss = mptcp_sync_mss;
@@ -3412,6 +3418,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
 	msk->bytes_sent = 0;
 	msk->bytes_retrans = 0;
 	msk->rcvspace_init = 0;
+	mptcp_init_rtt_est(msk);
 
 	/* for fallback's sake */
 	WRITE_ONCE(msk->ack_seq, 0);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index ee0dbd6dbacf..b392d7855928 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -269,6 +269,13 @@ struct mptcp_data_frag {
 	struct page *page;
 };
 
+/* Arbitrary compromise between as low as possible to react timely to subflow
+ * close event and as big as possible to avoid being fouled by biased large
+ * samples due to peer sending data on a different subflow WRT to the incoming
+ * ack.
+ */
+#define MPTCP_RTT_SAMPLES	5
+
 /* MPTCP connection sock */
 struct mptcp_sock {
 	/* inet_connection_sock must be the first member */
@@ -340,11 +347,17 @@ struct mptcp_sock {
 				 */
 	struct mptcp_pm_data	pm;
 	struct mptcp_sched_ops	*sched;
+
+	/* Most recent rtt_us observed by in use incoming subflows. */
+	struct {
+		u32	samples[MPTCP_RTT_SAMPLES];
+		u32	next_sample;
+	} rcv_rtt_est;
+
 	struct {
 		int	space;	/* bytes copied in last measurement window */
 		int	copied; /* bytes copied in this measurement window */
 		u64	time;	/* start time of measurement window */
-		u64	rtt_us; /* last maximum rtt of subflows */
 	} rcvq_space;
 	u8		scaling_ratio;
 	bool		allow_subflows;
@@ -422,6 +435,27 @@ static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
 	return msk->first_pending;
 }
 
+static inline void mptcp_init_rtt_est(struct mptcp_sock *msk)
+{
+	int i;
+
+	for (i = 0; i < MPTCP_RTT_SAMPLES; ++i)
+		msk->rcv_rtt_est.samples[i] = U32_MAX;
+	msk->rcv_rtt_est.next_sample = 0;
+	msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+}
+
+static inline u32 mptcp_rtt_us_est(const struct mptcp_sock *msk)
+{
+	u32 rtt_us = msk->rcv_rtt_est.samples[0];
+	int i;
+
+	/* Lockless access of collected samples. */
+	for (i = 1; i < MPTCP_RTT_SAMPLES; ++i)
+		rtt_us = min(rtt_us, READ_ONCE(msk->rcv_rtt_est.samples[i]));
+	return rtt_us;
+}
+
 static inline struct mptcp_data_frag *mptcp_send_next(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -523,6 +557,8 @@ struct mptcp_subflow_context {
 	u32	map_data_len;
 	__wsum	map_data_csum;
 	u32	map_csum_len;
+	u32	prev_rtt_us;
+	u32	prev_rtt_seq;
 	u32	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_join : 1,   /* send MP_JOIN */
 		request_bkup : 1,
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v7 mptcp-next 6/6] mptcp: add receive queue awareness in tcp_rcv_space_adjust()
  2025-11-20  8:39 [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement Paolo Abeni
                   ` (4 preceding siblings ...)
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator Paolo Abeni
@ 2025-11-20  8:39 ` Paolo Abeni
  2025-11-20  9:48 ` [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement MPTCP CI
  2025-11-27 18:42 ` Matthieu Baerts
  7 siblings, 0 replies; 16+ messages in thread
From: Paolo Abeni @ 2025-11-20  8:39 UTC (permalink / raw)
  To: mptcp; +Cc: martineau

This is the mptcp counter-part of commit ea33537d8292 ("tcp: add receive
queue awareness in tcp_rcv_space_adjust()").

Prior to this commit:

ESTAB 33165568 0      192.168.255.2:5201 192.168.255.1:53380 \
	skmem:(r33076416,rb33554432,t0,tb91136,f448,w0,o0,bl0,d0)

After:
ESTAB 3279168 0      192.168.255.2:5201 192.168.255.1]:53042 \
	skmem:(r3190912,rb3719956,t0,tb91136,f1536,w0,o0,bl0,d0)
(same tput)

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 5dde016a108d..6f4220c8b5bb 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2131,11 +2131,13 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 	if (rtt_us == U32_MAX || time < (rtt_us >> 3))
 		return;
 
-	if (msk->rcvq_space.copied <= msk->rcvq_space.space)
+	copied = msk->rcvq_space.copied;
+	copied -= mptcp_inq_hint(sk);
+	if (copied <= msk->rcvq_space.space)
 		goto new_measure;
 
 	trace_mptcp_rcvbuf_grow(sk, time);
-	if (mptcp_rcvbuf_grow(sk, msk->rcvq_space.copied)) {
+	if (mptcp_rcvbuf_grow(sk, copied)) {
 		/* Make subflows follow along.  If we do not do this, we
 		 * get drops at subflow level if skbs can't be moved to
 		 * the mptcp rx queue fast enough (announced rcv_win can
@@ -2149,7 +2151,7 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 			slow = lock_sock_fast(ssk);
 			/* subflows can be added before tcp_init_transfer() */
 			if (tcp_sk(ssk)->rcvq_space.space)
-				tcp_rcvbuf_grow(ssk, msk->rcvq_space.copied);
+				tcp_rcvbuf_grow(ssk, copied);
 			unlock_sock_fast(ssk, slow);
 		}
 	}
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement
  2025-11-20  8:39 [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement Paolo Abeni
                   ` (5 preceding siblings ...)
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 6/6] mptcp: add receive queue awareness in tcp_rcv_space_adjust() Paolo Abeni
@ 2025-11-20  9:48 ` MPTCP CI
  2025-11-27 18:42 ` Matthieu Baerts
  7 siblings, 0 replies; 16+ messages in thread
From: MPTCP CI @ 2025-11-20  9:48 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Success! ✅
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/19531247842

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/9db02a7e6d33
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1025709


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 mptcp-next 2/6] mptcp: do not account for OoO in mptcp_rcvbuf_grow()
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 2/6] mptcp: do not account for OoO in mptcp_rcvbuf_grow() Paolo Abeni
@ 2025-11-27  0:06   ` Mat Martineau
  0 siblings, 0 replies; 16+ messages in thread
From: Mat Martineau @ 2025-11-27  0:06 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Thu, 20 Nov 2025, Paolo Abeni wrote:

> MPTCP-level OoOs are physiological when multiple subflows are active
> concurrently and will not cause retransmissions nor are caused by
> drops.
>
> Accounting for them in mptcp_rcvbuf_grow() causes the rcvbuf slowly
> drifting towards tcp_rmem[2].
>
> Remove such accounting. Note that subflows will still account for TCP-level
> OoO when the MPTCP-level rcvbuf is propagated.
>
> This also closes a subtle and very unlikely race condition with rcvspace
> init; active sockets with user-space holding the msk-level socket lock,
> could complete such initialization in the receive callback, after that the
> first OoO data reaches the rcvbuf and potentially triggering a divide by
> zero Oops.
>
> Fixes: e118cdc34dd1 ("mptcp: rcvbuf auto-tuning improvement")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/protocol.c | 6 ------
> 1 file changed, 6 deletions(-)

Looks good, thanks Paolo:

Reviewed-by: Mat Martineau <martineau@kernel.org>


>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 64f592a7897c..e31ccc4bbb2d 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -226,9 +226,6 @@ static bool mptcp_rcvbuf_grow(struct sock *sk, u32 newval)
> 	do_div(grow, oldval);
> 	rcvwin += grow << 1;
>
> -	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue))
> -		rcvwin += MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq - msk->ack_seq;
> -
> 	cap = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]);
>
> 	rcvbuf = min_t(u32, mptcp_space_from_win(sk, rcvwin), cap);
> @@ -352,9 +349,6 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
> end:
> 	skb_condense(skb);
> 	skb_set_owner_r(skb, sk);
> -	/* do not grow rcvbuf for not-yet-accepted or orphaned sockets. */
> -	if (sk->sk_socket)
> -		mptcp_rcvbuf_grow(sk, msk->rcvq_space.space);
> }
>
> static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
> -- 
> 2.51.1
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator Paolo Abeni
@ 2025-11-27  2:19   ` Mat Martineau
  2025-11-27  7:36     ` Paolo Abeni
  2025-11-27 18:13   ` Matthieu Baerts
  2025-12-16 16:38   ` Matthieu Baerts
  2 siblings, 1 reply; 16+ messages in thread
From: Mat Martineau @ 2025-11-27  2:19 UTC (permalink / raw)
  To: Paolo Abeni, Matthieu Baerts; +Cc: mptcp

On Thu, 20 Nov 2025, Paolo Abeni wrote:

> The current MPTCP-level RTT estimator has several issues. On high speed
> links, the MPTCP-level receive buffer auto-tuning happens with a frequency
> well above the TCP-level's one. That in turn can cause excessive/unneeded
> receive buffer increase.
>

Thanks for the revisions Paolo, this definitely addresses my concerns 
about rtt values getting "stuck".

Am I understanding correctly: the msk-level rtt estimate doesn't have 
a direct effect on the size of the autotuned receive buffer, it just 
changes how often it's updated?

It seems like this new algorithm will do a good job of protecting against 
high-rtt outliers and effects from inactive subflows. It might have 
low-rtt outliers, is that a problem at all? Or is the occasional early 
update ok?

I think it would be a good idea to put the series into the export branch 
so we can get some test coverage - but would still like to hear your 
thoughts on the low-rtt outlier question!

Reviewed-by: Mat Martineau <martineau@kernel.org>


> On such links, the initial rtt_us value is considerably higher
> than the actual delay, and the current mptcp_rcv_space_adjust() updates
> msk->rcvq_space.rtt_us with a period equal to the such field previous
> value. If the initial rtt_us is 40ms, its first update will happen after
> 40ms, even if the subflows see actual RTT orders of magnitude lower.
>
> Additionally:
> - setting the msk rtt to the maximum among all the subflows RTTs makes DRS
> constantly overshooting the rcvbuf size when a subflow has considerable
> higher latency than the other(s).
>
> - during unidirectional bulk transfers with multiple active subflows, the
> TCP-level RTT estimator occasionally sees considerably higher value than
> the real link delay, i.e. when the packet scheduler reacts to an incoming
> ack on given subflow pushing data on a different subflow.
>
> - currently inactive but still open subflows (i.e. switched to backup mode)
> are always considered when computing the msk-level rtt.
>
> Address the all the issues above with a more accurate RTT estimation
> strategy: the MPTCP-level RTT is set to the minimum of all the subflows
> actually feeding data into the MPTCP receive buffer, using a small sliding
> window.
>
> While at it, also use EWMA to compute the msk-level scaling_ratio, to that
> mptcp can avoid traversing the subflow list is mptcp_rcv_space_adjust().
>
> Use some care to avoid updating msk and ssk level fields too often.
>
> Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v4 -> v5:
>  - avoid filtering out too high value, use sliding window instead
>
> v3 -> v4:
>  - really refresh msk rtt after a full win per subflow (off-by-one in prev
>    revision)
>  - sync mptcp_rcv_space_adjust() comment with the new code
>
> v1 -> v2:
>  - do not use explicit reset flags - do rcv win based decision instead
>  - discard 0 rtt_us samples from subflows
>  - discard samples on non empty rx queue
>  - discard "too high" samples, see the code comments WRT the whys
> ---
> include/trace/events/mptcp.h |  2 +-
> net/mptcp/protocol.c         | 63 ++++++++++++++++++++----------------
> net/mptcp/protocol.h         | 38 +++++++++++++++++++++-
> 3 files changed, 73 insertions(+), 30 deletions(-)
>
> diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h
> index 269d949b2025..04521acba483 100644
> --- a/include/trace/events/mptcp.h
> +++ b/include/trace/events/mptcp.h
> @@ -219,7 +219,7 @@ TRACE_EVENT(mptcp_rcvbuf_grow,
> 		__be32 *p32;
>
> 		__entry->time = time;
> -		__entry->rtt_us = msk->rcvq_space.rtt_us >> 3;
> +		__entry->rtt_us = mptcp_rtt_us_est(msk) >> 3;
> 		__entry->copied = msk->rcvq_space.copied;
> 		__entry->inq = mptcp_inq_hint(sk);
> 		__entry->space = msk->rcvq_space.space;
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 40c90d841237..5dde016a108d 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -880,6 +880,32 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
> 	return moved;
> }
>
> +static void mptcp_rcv_rtt_update(struct mptcp_sock *msk,
> +				 struct mptcp_subflow_context *subflow)
> +{
> +	const struct tcp_sock *tp = tcp_sk(subflow->tcp_sock);
> +	u32 rtt_us = tp->rcv_rtt_est.rtt_us;
> +	int id;
> +
> +	/* Update once per subflow per rcvwnd to avoid touching the msk
> +	 * too often.
> +	 */
> +	if (!rtt_us || tp->rcv_rtt_est.seq == subflow->prev_rtt_seq)
> +		return;
> +
> +	subflow->prev_rtt_seq = tp->rcv_rtt_est.seq;
> +
> +	/* Pairs with READ_ONCE() in mptcp_rtt_us_est(). */
> +	id = msk->rcv_rtt_est.next_sample;
> +	WRITE_ONCE(msk->rcv_rtt_est.samples[id], rtt_us);
> +	if (++msk->rcv_rtt_est.next_sample == MPTCP_RTT_SAMPLES)
> +		msk->rcv_rtt_est.next_sample = 0;
> +
> +	/* EWMA among the incoming subflows */
> +	msk->scaling_ratio = ((msk->scaling_ratio << 3) - msk->scaling_ratio +
> +			     tp->scaling_ratio) >> 3;
> +}
> +
> void mptcp_data_ready(struct sock *sk, struct sock *ssk)
> {
> 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
> @@ -893,6 +919,7 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
> 		return;
>
> 	mptcp_data_lock(sk);
> +	mptcp_rcv_rtt_update(msk, subflow);
> 	if (!sock_owned_by_user(sk)) {
> 		/* Wake-up the reader only for in-sequence data */
> 		if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
> @@ -2067,7 +2094,6 @@ static void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
>
> 	msk->rcvspace_init = 1;
> 	msk->rcvq_space.copied = 0;
> -	msk->rcvq_space.rtt_us = 0;
>
> 	/* initial rcv_space offering made to peer */
> 	msk->rcvq_space.space = min_t(u32, tp->rcv_wnd,
> @@ -2078,15 +2104,15 @@ static void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
>
> /* receive buffer autotuning.  See tcp_rcv_space_adjust for more information.
>  *
> - * Only difference: Use highest rtt estimate of the subflows in use.
> + * Only difference: Use lowest rtt estimate of the subflows in use, see
> + * mptcp_rcv_rtt_update() and mptcp_rtt_us_est().
>  */
> static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
> {
> 	struct mptcp_subflow_context *subflow;
> 	struct sock *sk = (struct sock *)msk;
> -	u8 scaling_ratio = U8_MAX;
> -	u32 time, advmss = 1;
> -	u64 rtt_us, mstamp;
> +	u32 time, rtt_us;
> +	u64 mstamp;
>
> 	msk_owned_by_me(msk);
>
> @@ -2101,29 +2127,8 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
> 	mstamp = mptcp_stamp();
> 	time = tcp_stamp_us_delta(mstamp, READ_ONCE(msk->rcvq_space.time));
>
> -	rtt_us = msk->rcvq_space.rtt_us;
> -	if (rtt_us && time < (rtt_us >> 3))
> -		return;
> -
> -	rtt_us = 0;
> -	mptcp_for_each_subflow(msk, subflow) {
> -		const struct tcp_sock *tp;
> -		u64 sf_rtt_us;
> -		u32 sf_advmss;
> -
> -		tp = tcp_sk(mptcp_subflow_tcp_sock(subflow));
> -
> -		sf_rtt_us = READ_ONCE(tp->rcv_rtt_est.rtt_us);
> -		sf_advmss = READ_ONCE(tp->advmss);
> -
> -		rtt_us = max(sf_rtt_us, rtt_us);
> -		advmss = max(sf_advmss, advmss);
> -		scaling_ratio = min(tp->scaling_ratio, scaling_ratio);
> -	}
> -
> -	msk->rcvq_space.rtt_us = rtt_us;
> -	msk->scaling_ratio = scaling_ratio;
> -	if (time < (rtt_us >> 3) || rtt_us == 0)
> +	rtt_us = mptcp_rtt_us_est(msk);
> +	if (rtt_us == U32_MAX || time < (rtt_us >> 3))
> 		return;
>
> 	if (msk->rcvq_space.copied <= msk->rcvq_space.space)
> @@ -2969,6 +2974,7 @@ static void __mptcp_init_sock(struct sock *sk)
> 	msk->timer_ival = TCP_RTO_MIN;
> 	msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
> 	msk->backlog_len = 0;
> +	mptcp_init_rtt_est(msk);
>
> 	WRITE_ONCE(msk->first, NULL);
> 	inet_csk(sk)->icsk_sync_mss = mptcp_sync_mss;
> @@ -3412,6 +3418,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
> 	msk->bytes_sent = 0;
> 	msk->bytes_retrans = 0;
> 	msk->rcvspace_init = 0;
> +	mptcp_init_rtt_est(msk);
>
> 	/* for fallback's sake */
> 	WRITE_ONCE(msk->ack_seq, 0);
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index ee0dbd6dbacf..b392d7855928 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -269,6 +269,13 @@ struct mptcp_data_frag {
> 	struct page *page;
> };
>
> +/* Arbitrary compromise between as low as possible to react timely to subflow
> + * close event and as big as possible to avoid being fouled by biased large
> + * samples due to peer sending data on a different subflow WRT to the incoming
> + * ack.
> + */
> +#define MPTCP_RTT_SAMPLES	5
> +
> /* MPTCP connection sock */
> struct mptcp_sock {
> 	/* inet_connection_sock must be the first member */
> @@ -340,11 +347,17 @@ struct mptcp_sock {
> 				 */
> 	struct mptcp_pm_data	pm;
> 	struct mptcp_sched_ops	*sched;
> +
> +	/* Most recent rtt_us observed by in use incoming subflows. */
> +	struct {
> +		u32	samples[MPTCP_RTT_SAMPLES];
> +		u32	next_sample;
> +	} rcv_rtt_est;
> +
> 	struct {
> 		int	space;	/* bytes copied in last measurement window */
> 		int	copied; /* bytes copied in this measurement window */
> 		u64	time;	/* start time of measurement window */
> -		u64	rtt_us; /* last maximum rtt of subflows */
> 	} rcvq_space;
> 	u8		scaling_ratio;
> 	bool		allow_subflows;
> @@ -422,6 +435,27 @@ static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
> 	return msk->first_pending;
> }
>
> +static inline void mptcp_init_rtt_est(struct mptcp_sock *msk)
> +{
> +	int i;
> +
> +	for (i = 0; i < MPTCP_RTT_SAMPLES; ++i)
> +		msk->rcv_rtt_est.samples[i] = U32_MAX;
> +	msk->rcv_rtt_est.next_sample = 0;
> +	msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
> +}
> +
> +static inline u32 mptcp_rtt_us_est(const struct mptcp_sock *msk)
> +{
> +	u32 rtt_us = msk->rcv_rtt_est.samples[0];
> +	int i;
> +
> +	/* Lockless access of collected samples. */
> +	for (i = 1; i < MPTCP_RTT_SAMPLES; ++i)
> +		rtt_us = min(rtt_us, READ_ONCE(msk->rcv_rtt_est.samples[i]));
> +	return rtt_us;
> +}
> +
> static inline struct mptcp_data_frag *mptcp_send_next(struct sock *sk)
> {
> 	struct mptcp_sock *msk = mptcp_sk(sk);
> @@ -523,6 +557,8 @@ struct mptcp_subflow_context {
> 	u32	map_data_len;
> 	__wsum	map_data_csum;
> 	u32	map_csum_len;
> +	u32	prev_rtt_us;
> +	u32	prev_rtt_seq;
> 	u32	request_mptcp : 1,  /* send MP_CAPABLE */
> 		request_join : 1,   /* send MP_JOIN */
> 		request_bkup : 1,
> -- 
> 2.51.1
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator
  2025-11-27  2:19   ` Mat Martineau
@ 2025-11-27  7:36     ` Paolo Abeni
  0 siblings, 0 replies; 16+ messages in thread
From: Paolo Abeni @ 2025-11-27  7:36 UTC (permalink / raw)
  To: Mat Martineau, Matthieu Baerts; +Cc: mptcp

On 11/27/25 3:19 AM, Mat Martineau wrote:
> On Thu, 20 Nov 2025, Paolo Abeni wrote:
> 
>> The current MPTCP-level RTT estimator has several issues. On high speed
>> links, the MPTCP-level receive buffer auto-tuning happens with a frequency
>> well above the TCP-level's one. That in turn can cause excessive/unneeded
>> receive buffer increase.
>>
> 
> Thanks for the revisions Paolo, this definitely addresses my concerns 
> about rtt values getting "stuck".
> 
> Am I understanding correctly: the msk-level rtt estimate doesn't have 
> a direct effect on the size of the autotuned receive buffer, it just 
> changes how often it's updated?

The ties between rtt estimate and receive buffer auto-tuning are very
strict: DRS account for the amount of data received in an estimated RTT,
and computes/derives the rcvbuf size from such value.

If the RTT estimate is i.e. larger than the actual RTT, the accounted
data could be also significantly greater than <real amount of data the
connection is pushing in a RTT>, and the rcvbuf estimate will be
unnecessarily big/large.

I guess it could be called an indirect effect, but a very
impacting/deterministic one.

> It seems like this new algorithm will do a good job of protecting against 
> high-rtt outliers and effects from inactive subflows. It might have 
> low-rtt outliers, 

Can it? How? AFAICS subflow-level RTT estimate can't be lower than the
current wire RTT. It can be a bit slow - due to the sliding window - to
adapt to RTT increases.

> is that a problem at all? Or is the occasional early 
> update ok?

A lower than reality RTT estimate could prevent rcvbuf from growing and
slow down the transfer as rcvbuf limited. I guess there are scenarios
were hitting such condition could be possible, but:

- should be quite hard to see in practice and FWIW I never observed it here.
- the situation is transient; with time the RTT estimate should always
converge to the correct value.
- I think it's fairly better than the opposite, especially on the server
side.
/P

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator Paolo Abeni
  2025-11-27  2:19   ` Mat Martineau
@ 2025-11-27 18:13   ` Matthieu Baerts
  2025-11-28  8:47     ` Paolo Abeni
  2025-12-16 16:38   ` Matthieu Baerts
  2 siblings, 1 reply; 16+ messages in thread
From: Matthieu Baerts @ 2025-11-27 18:13 UTC (permalink / raw)
  To: Paolo Abeni, martineau; +Cc: mptcp

Hi Paolo, Mat,

On 20/11/2025 09:39, Paolo Abeni wrote:
> The current MPTCP-level RTT estimator has several issues. On high speed
> links, the MPTCP-level receive buffer auto-tuning happens with a frequency
> well above the TCP-level's one. That in turn can cause excessive/unneeded
> receive buffer increase.
> 
> On such links, the initial rtt_us value is considerably higher
> than the actual delay, and the current mptcp_rcv_space_adjust() updates
> msk->rcvq_space.rtt_us with a period equal to the such field previous
> value. If the initial rtt_us is 40ms, its first update will happen after
> 40ms, even if the subflows see actual RTT orders of magnitude lower.
> 
> Additionally:
> - setting the msk rtt to the maximum among all the subflows RTTs makes DRS
> constantly overshooting the rcvbuf size when a subflow has considerable
> higher latency than the other(s).
> 
> - during unidirectional bulk transfers with multiple active subflows, the
> TCP-level RTT estimator occasionally sees considerably higher value than
> the real link delay, i.e. when the packet scheduler reacts to an incoming
> ack on given subflow pushing data on a different subflow.
> 
> - currently inactive but still open subflows (i.e. switched to backup mode)
> are always considered when computing the msk-level rtt.
> 
> Address the all the issues above with a more accurate RTT estimation
> strategy: the MPTCP-level RTT is set to the minimum of all the subflows
> actually feeding data into the MPTCP receive buffer, using a small sliding
> window.

(...)

> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index ee0dbd6dbacf..b392d7855928 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -269,6 +269,13 @@ struct mptcp_data_frag {
>  	struct page *page;
>  };
>  
> +/* Arbitrary compromise between as low as possible to react timely to subflow
> + * close event and as big as possible to avoid being fouled by biased large
> + * samples due to peer sending data on a different subflow WRT to the incoming
> + * ack.
> + */
> +#define MPTCP_RTT_SAMPLES	5
> +
>  /* MPTCP connection sock */
>  struct mptcp_sock {
>  	/* inet_connection_sock must be the first member */
> @@ -340,11 +347,17 @@ struct mptcp_sock {
>  				 */
>  	struct mptcp_pm_data	pm;
>  	struct mptcp_sched_ops	*sched;
> +
> +	/* Most recent rtt_us observed by in use incoming subflows. */
> +	struct {
> +		u32	samples[MPTCP_RTT_SAMPLES];
> +		u32	next_sample;
> +	} rcv_rtt_est;

I'm sorry to react only now, I didn't manage to follow this in details,
but I have one question: why not using a smooth RTT [1]? Is it because
the goal is to mix data from the active/recently used subflows and only
to take the minimum, and not "combining" RTT from different subflows?

[1] https://datatracker.ietf.org/doc/rfc6298/
    srtt = old * (1-alpha) + new * alpha   # alpha is 1/8 in RFC6298

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement
  2025-11-20  8:39 [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement Paolo Abeni
                   ` (6 preceding siblings ...)
  2025-11-20  9:48 ` [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement MPTCP CI
@ 2025-11-27 18:42 ` Matthieu Baerts
  7 siblings, 0 replies; 16+ messages in thread
From: Matthieu Baerts @ 2025-11-27 18:42 UTC (permalink / raw)
  To: Paolo Abeni, mptcp; +Cc: martineau

Hi Paolo, Mat,

On 20/11/2025 09:39, Paolo Abeni wrote:
> This series collects a few follow-up for the backlog refactor, with some
> of them posing as fixes just to confuse the enemy.
> 
> Targeting net-next as the issues addressed are very old, the change
> is quite invasive and, as mentioned, based on the BL refactor.
> 
> Overall this:
> - introduces a tracepoint similar to tcp useful to track down the issues
>   addressed here
> - avoid rx drop in corner case scenarios, improving tput stability and
>   peak tput over fast link
> - refactor the rcv space and rtt estimator, overall making DRS more
>   correct and avoiding rcv buffer drifting to tcp_rmem[2], which in
>   turn makes the tput more stable and less bursty

Thank you for the new improvements and reviews!

I guess the answer to my question in patch 5/6 will be that nothing
needs to be changed, so I decided to apply this series (hopefully) to
ease things. Worst case, I can easily amend/revert patches!

> Note that after this upstream series will be merged:
> 
> https://patchwork.kernel.org/user/todo/netdevbpf/?series=1025209
> 
> we want to introduce similar changes in mptcp, too.

(ah, that's where I saw that message, and just noticed that was no urgency!)


Now in our tree (feat. for net-next):

New patches for t/upstream:
- 1ac40c258962: trace: mptcp: add mptcp_rcvbuf_grow tracepoint
- 744ffd37f43a: mptcp: do not account for OoO in mptcp_rcvbuf_grow()
- 2c093fc5ab1b: mptcp: fix receive space timestamp initialization
- f2a2ee211112: mptcp: consolidate rcv space init
- 436eef292ce0: mptcp: better mptcp-level RTT estimator
- 4d83ab6b697c: mptcp: add receive queue awareness in tcp_rcv_space_adjust()
- Results: 1fea9a6bd10f..36ddaf26cab4 (export)

Tests are now in progress:

- export:
https://github.com/multipath-tcp/mptcp_net-next/commit/51795655931ed517a9362868f07690f569417880/checks

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator
  2025-11-27 18:13   ` Matthieu Baerts
@ 2025-11-28  8:47     ` Paolo Abeni
  2025-11-28  9:51       ` Matthieu Baerts
  0 siblings, 1 reply; 16+ messages in thread
From: Paolo Abeni @ 2025-11-28  8:47 UTC (permalink / raw)
  To: Matthieu Baerts, martineau; +Cc: mptcp

On 11/27/25 7:13 PM, Matthieu Baerts wrote:
> On 20/11/2025 09:39, Paolo Abeni wrote:
>> The current MPTCP-level RTT estimator has several issues. On high speed
>> links, the MPTCP-level receive buffer auto-tuning happens with a frequency
>> well above the TCP-level's one. That in turn can cause excessive/unneeded
>> receive buffer increase.
>>
>> On such links, the initial rtt_us value is considerably higher
>> than the actual delay, and the current mptcp_rcv_space_adjust() updates
>> msk->rcvq_space.rtt_us with a period equal to the such field previous
>> value. If the initial rtt_us is 40ms, its first update will happen after
>> 40ms, even if the subflows see actual RTT orders of magnitude lower.
>>
>> Additionally:
>> - setting the msk rtt to the maximum among all the subflows RTTs makes DRS
>> constantly overshooting the rcvbuf size when a subflow has considerable
>> higher latency than the other(s).
>>
>> - during unidirectional bulk transfers with multiple active subflows, the
>> TCP-level RTT estimator occasionally sees considerably higher value than
>> the real link delay, i.e. when the packet scheduler reacts to an incoming
>> ack on given subflow pushing data on a different subflow.
>>
>> - currently inactive but still open subflows (i.e. switched to backup mode)
>> are always considered when computing the msk-level rtt.
>>
>> Address the all the issues above with a more accurate RTT estimation
>> strategy: the MPTCP-level RTT is set to the minimum of all the subflows
>> actually feeding data into the MPTCP receive buffer, using a small sliding
>> window.
> 
> (...)
> 
>> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
>> index ee0dbd6dbacf..b392d7855928 100644
>> --- a/net/mptcp/protocol.h
>> +++ b/net/mptcp/protocol.h
>> @@ -269,6 +269,13 @@ struct mptcp_data_frag {
>>  	struct page *page;
>>  };
>>  
>> +/* Arbitrary compromise between as low as possible to react timely to subflow
>> + * close event and as big as possible to avoid being fouled by biased large
>> + * samples due to peer sending data on a different subflow WRT to the incoming
>> + * ack.
>> + */
>> +#define MPTCP_RTT_SAMPLES	5
>> +
>>  /* MPTCP connection sock */
>>  struct mptcp_sock {
>>  	/* inet_connection_sock must be the first member */
>> @@ -340,11 +347,17 @@ struct mptcp_sock {
>>  				 */
>>  	struct mptcp_pm_data	pm;
>>  	struct mptcp_sched_ops	*sched;
>> +
>> +	/* Most recent rtt_us observed by in use incoming subflows. */
>> +	struct {
>> +		u32	samples[MPTCP_RTT_SAMPLES];
>> +		u32	next_sample;
>> +	} rcv_rtt_est;
> 
> I'm sorry to react only now, I didn't manage to follow this in details,
> but I have one question: why not using a smooth RTT [1]? Is it because
> the goal is to mix data from the active/recently used subflows and only
> to take the minimum, and not "combining" RTT from different subflows?
> 
> [1] https://datatracker.ietf.org/doc/rfc6298/
>     srtt = old * (1-alpha) + new * alpha   # alpha is 1/8 in RFC6298

TCP already use EWMA for rtt; the values seen by MPTCP on each subflow
went already into such processing.

If there is a single subflow, doing again EWMA should only cause slower
reactions.

If there are multiple subflows, we really want the min() not the
smoothed average, because:

- there are high spikes caused by the mptcp packet scheduler: we want to
entirely filter them out, while EWMA will make them contributing the the
estimate and results are very visible (negatively) in my experiments.

- different subflows can have very different rtt (say 1ms vs 100ms). We
really want the minimum, otherwise DRS will be fouled/the rcvbuf will
"explode"

/P


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator
  2025-11-28  8:47     ` Paolo Abeni
@ 2025-11-28  9:51       ` Matthieu Baerts
  0 siblings, 0 replies; 16+ messages in thread
From: Matthieu Baerts @ 2025-11-28  9:51 UTC (permalink / raw)
  To: Paolo Abeni, martineau; +Cc: mptcp

Hi Paolo,

On 28/11/2025 09:47, Paolo Abeni wrote:
> On 11/27/25 7:13 PM, Matthieu Baerts wrote:
>> On 20/11/2025 09:39, Paolo Abeni wrote:
>>> The current MPTCP-level RTT estimator has several issues. On high speed
>>> links, the MPTCP-level receive buffer auto-tuning happens with a frequency
>>> well above the TCP-level's one. That in turn can cause excessive/unneeded
>>> receive buffer increase.
>>>
>>> On such links, the initial rtt_us value is considerably higher
>>> than the actual delay, and the current mptcp_rcv_space_adjust() updates
>>> msk->rcvq_space.rtt_us with a period equal to the such field previous
>>> value. If the initial rtt_us is 40ms, its first update will happen after
>>> 40ms, even if the subflows see actual RTT orders of magnitude lower.
>>>
>>> Additionally:
>>> - setting the msk rtt to the maximum among all the subflows RTTs makes DRS
>>> constantly overshooting the rcvbuf size when a subflow has considerable
>>> higher latency than the other(s).
>>>
>>> - during unidirectional bulk transfers with multiple active subflows, the
>>> TCP-level RTT estimator occasionally sees considerably higher value than
>>> the real link delay, i.e. when the packet scheduler reacts to an incoming
>>> ack on given subflow pushing data on a different subflow.
>>>
>>> - currently inactive but still open subflows (i.e. switched to backup mode)
>>> are always considered when computing the msk-level rtt.
>>>
>>> Address the all the issues above with a more accurate RTT estimation
>>> strategy: the MPTCP-level RTT is set to the minimum of all the subflows
>>> actually feeding data into the MPTCP receive buffer, using a small sliding
>>> window.
>>
>> (...)
>>
>>> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
>>> index ee0dbd6dbacf..b392d7855928 100644
>>> --- a/net/mptcp/protocol.h
>>> +++ b/net/mptcp/protocol.h
>>> @@ -269,6 +269,13 @@ struct mptcp_data_frag {
>>>  	struct page *page;
>>>  };
>>>  
>>> +/* Arbitrary compromise between as low as possible to react timely to subflow
>>> + * close event and as big as possible to avoid being fouled by biased large
>>> + * samples due to peer sending data on a different subflow WRT to the incoming
>>> + * ack.
>>> + */
>>> +#define MPTCP_RTT_SAMPLES	5
>>> +
>>>  /* MPTCP connection sock */
>>>  struct mptcp_sock {
>>>  	/* inet_connection_sock must be the first member */
>>> @@ -340,11 +347,17 @@ struct mptcp_sock {
>>>  				 */
>>>  	struct mptcp_pm_data	pm;
>>>  	struct mptcp_sched_ops	*sched;
>>> +
>>> +	/* Most recent rtt_us observed by in use incoming subflows. */
>>> +	struct {
>>> +		u32	samples[MPTCP_RTT_SAMPLES];
>>> +		u32	next_sample;
>>> +	} rcv_rtt_est;
>>
>> I'm sorry to react only now, I didn't manage to follow this in details,
>> but I have one question: why not using a smooth RTT [1]? Is it because
>> the goal is to mix data from the active/recently used subflows and only
>> to take the minimum, and not "combining" RTT from different subflows?
>>
>> [1] https://datatracker.ietf.org/doc/rfc6298/
>>     srtt = old * (1-alpha) + new * alpha   # alpha is 1/8 in RFC6298
> 
> TCP already use EWMA for rtt; the values seen by MPTCP on each subflow
> went already into such processing.
> 
> If there is a single subflow, doing again EWMA should only cause slower
> reactions.
> 
> If there are multiple subflows, we really want the min() not the
> smoothed average, because:
> 
> - there are high spikes caused by the mptcp packet scheduler: we want to
> entirely filter them out, while EWMA will make them contributing the the
> estimate and results are very visible (negatively) in my experiments.
> 
> - different subflows can have very different rtt (say 1ms vs 100ms). We
> really want the minimum, otherwise DRS will be fouled/the rcvbuf will
> "explode"

Thank you for having taken the time to explain me all this. That's much
clearer, and I just realised I was mixing up acronyms! :)

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator
  2025-11-20  8:39 ` [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator Paolo Abeni
  2025-11-27  2:19   ` Mat Martineau
  2025-11-27 18:13   ` Matthieu Baerts
@ 2025-12-16 16:38   ` Matthieu Baerts
  2 siblings, 0 replies; 16+ messages in thread
From: Matthieu Baerts @ 2025-12-16 16:38 UTC (permalink / raw)
  To: Paolo Abeni, Mat Martineau; +Cc: mptcp

Hi Paolo, Mat,

On 20/11/2025 09:39, Paolo Abeni wrote:
> The current MPTCP-level RTT estimator has several issues. On high speed
> links, the MPTCP-level receive buffer auto-tuning happens with a frequency
> well above the TCP-level's one. That in turn can cause excessive/unneeded
> receive buffer increase.
> 
> On such links, the initial rtt_us value is considerably higher
> than the actual delay, and the current mptcp_rcv_space_adjust() updates
> msk->rcvq_space.rtt_us with a period equal to the such field previous
> value. If the initial rtt_us is 40ms, its first update will happen after
> 40ms, even if the subflows see actual RTT orders of magnitude lower.
> 
> Additionally:
> - setting the msk rtt to the maximum among all the subflows RTTs makes DRS
> constantly overshooting the rcvbuf size when a subflow has considerable
> higher latency than the other(s).
> 
> - during unidirectional bulk transfers with multiple active subflows, the
> TCP-level RTT estimator occasionally sees considerably higher value than
> the real link delay, i.e. when the packet scheduler reacts to an incoming
> ack on given subflow pushing data on a different subflow.
> 
> - currently inactive but still open subflows (i.e. switched to backup mode)
> are always considered when computing the msk-level rtt.
> 
> Address the all the issues above with a more accurate RTT estimation
> strategy: the MPTCP-level RTT is set to the minimum of all the subflows
> actually feeding data into the MPTCP receive buffer, using a small sliding
> window.
> 
> While at it, also use EWMA to compute the msk-level scaling_ratio, to that
> mptcp can avoid traversing the subflow list is mptcp_rcv_space_adjust().
> 
> Use some care to avoid updating msk and ssk level fields too often.

FYI, it looks like the recent "simult_flows" instabilities are due to
this patch, see this ticket for more details:

  https://github.com/multipath-tcp/mptcp_net-next/issues/607

Good that there is no hurry to fix this, it is only in our tree for the
moment. I could revert it, but anyway, it looks like there are other
bugs in net, e.g. it is not possible to connect to it at the beginning
via vsock.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-12-16 16:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-20  8:39 [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement Paolo Abeni
2025-11-20  8:39 ` [PATCH v7 mptcp-next 1/6] trace: mptcp: add mptcp_rcvbuf_grow tracepoint Paolo Abeni
2025-11-20  8:39 ` [PATCH v7 mptcp-next 2/6] mptcp: do not account for OoO in mptcp_rcvbuf_grow() Paolo Abeni
2025-11-27  0:06   ` Mat Martineau
2025-11-20  8:39 ` [PATCH v7 mptcp-next 3/6] mptcp: fix receive space timestamp initialization Paolo Abeni
2025-11-20  8:39 ` [PATCH v7 mptcp-next 4/6] mptcp: consolidate rcv space init Paolo Abeni
2025-11-20  8:39 ` [PATCH v7 mptcp-next 5/6] mptcp: better mptcp-level RTT estimator Paolo Abeni
2025-11-27  2:19   ` Mat Martineau
2025-11-27  7:36     ` Paolo Abeni
2025-11-27 18:13   ` Matthieu Baerts
2025-11-28  8:47     ` Paolo Abeni
2025-11-28  9:51       ` Matthieu Baerts
2025-12-16 16:38   ` Matthieu Baerts
2025-11-20  8:39 ` [PATCH v7 mptcp-next 6/6] mptcp: add receive queue awareness in tcp_rcv_space_adjust() Paolo Abeni
2025-11-20  9:48 ` [PATCH v7 mptcp-next 0/6] mptcp: autotune related improvement MPTCP CI
2025-11-27 18:42 ` Matthieu Baerts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox