All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure
@ 2026-05-15  9:07 Paolo Abeni
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 1/7] mptcp: fix missing wakeups in edge scenarios Paolo Abeni
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Paolo Abeni @ 2026-05-15  9:07 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang, gang.yan

This an attempt to fix the data transfer stall reported by Geliang and
Gang more carefully enforcing memory constraints at the MPTCP level.

This iteration presents a significant change WRT the previous one,
avoiding entirely the collapse attempt on memory pressure. Note that
this choice represent a trade off: collapsing allow much faster transfer
(to be more accurate: order of magnitude less slow) under some extreme
conditions, but makes transfer slower and much more CPU intensive for
less unlikely conditions.

As a consequence of the above the `mptcp_data.multi_chunk_sendfile`
test-case needs a 240 seconds timeout to complete successfully:

TEST_F_TIMEOUT(mptcp, multi_chunk_sendfile, 240)

The solution performing data collapsing would need similar long timeout
for the multiproc tests cases: mutliproc_even, mutliproc_readers,
mutliproc_writers, mutliproc_sendpage_even, mutliproc_sendpage_readers,
mutliproc_sendpage_writers.

Patch 1 is new in v6, and is actually a fix for an old issue (targeting
net), included here just for my convenience.

Patch 2 and 3 makes the admission check much more strict for incoming
packets exceeding the memory limits, with some exception for fallback
sockets.
Patch 4 makes implement OoO queue pruning for MPTCP and patch 5
addresses an edge scenario that could still lead to transfer stall
under memory pressure.
Finally patch 6 and 7 improve the MPTCP-level retransmission schema to
make recovery from memory pressure/after MPTCP-level drop significanly
faster.
---
Paolo Abeni (7):
  mptcp: fix missing wakeups in edge scenarios
  mptcp: explicitly drop over memory limits
  mptcp: enforce hard limit on backlog flushing
  mptcp: implemented OoO queue pruning
  mptcp: track prune recovery status
  mptcp: move the retrans loop to a separate helper
  mptcp: let the retrans scheduler do its job.

 net/mptcp/mib.c      |   2 +
 net/mptcp/mib.h      |   2 +
 net/mptcp/options.c  |  67 +++++++++++-
 net/mptcp/protocol.c | 249 ++++++++++++++++++++++++++++++++-----------
 net/mptcp/protocol.h |   4 +
 net/mptcp/subflow.c  |   1 +
 6 files changed, 259 insertions(+), 66 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 mptcp-next 1/7] mptcp: fix missing wakeups in edge scenarios
  2026-05-15  9:07 [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
@ 2026-05-15  9:07 ` Paolo Abeni
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 2/7] mptcp: explicitly drop over memory limits Paolo Abeni
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2026-05-15  9:07 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang, gang.yan

The mptcp_recvmsg() can fill MPTCP socket receive queue via
mptcp_move_skbs(), but currently does not try to wakeup any listener,
because the same process is going to check the receive queue soon.

When multiple threads are reading from the same fd, the above can
cause stall. Add the missing wakeup.

Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index ce8372fb3c6a..b3ac1cb370f5 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2276,6 +2276,8 @@ static bool mptcp_move_skbs(struct sock *sk)
 		mptcp_backlog_spooled(sk, moved, &skbs);
 	}
 	mptcp_data_unlock(sk);
+	if (enqueued)
+		sk->sk_data_ready(sk);
 	return enqueued;
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 mptcp-next 2/7] mptcp: explicitly drop over memory limits
  2026-05-15  9:07 [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 1/7] mptcp: fix missing wakeups in edge scenarios Paolo Abeni
@ 2026-05-15  9:07 ` Paolo Abeni
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 3/7] mptcp: enforce hard limit on backlog flushing Paolo Abeni
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2026-05-15  9:07 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang, gang.yan

Currently the enforcement of the rcvbuf constraint is implemented
when moving the skbs into the msk receive or OoO queue, keeping the
incoming skbs in the subflow queue when over limit.

Under significant memory pressure the above can cause permanent data
transfer stalls. Hard enforce the memory limits as early as possible,
before landing even in the subflow queues, and refine the check when
owning the msk socket lock.

Note that fallback socket must not drop on the later checks, as the
incoming skb is already acked, and such drop would break the stream.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v4 -> v5:
 - fix possible u32 overflow in mptcp_over_limit

v3 -> v4:
 - schedule TCP ack on drop
 - enforce limits in __mptcp_move_skb() and __mptcp_add_backlog(), too
   but only if not fallback.

v1 -> v2:
 - deal correctly with tcp fin and zero win probe

RFC -> v1:
 - limit vs actual buffer size
 - use CB info instead of skb->len

Note that:
 - this needs the follow-up patches to really fix the stall
 - sashiko can assume ZWP carries unacked data and may be silently dropped.
   AFAIK that is false.
 - sashiko apparently can't graps mptcp subflow never hit the tcp rx
   fastpath, and the mptcp_incoming_options in tcp_rcv_state_process
   is hit, the peer can't transmit any more data.
 - the memory comparison is intentionally very rough, as
   the msk socket lock is not currently held where the condition is
   now enforced. This should require some refinement, shared as-is
   to avoid more latency on my side
---
 net/mptcp/options.c  | 32 ++++++++++++++++++++++++++++++--
 net/mptcp/protocol.c | 29 +++++++++++++++++++++--------
 2 files changed, 51 insertions(+), 10 deletions(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 4cc583fdc7a9..36f12e5dfa92 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1158,8 +1158,30 @@ static bool add_addr_hmac_valid(struct mptcp_sock *msk,
 	return hmac == mp_opt->ahmac;
 }
 
-/* Return false in case of error (or subflow has been reset),
- * else return true.
+static bool mptcp_over_limit(struct sock *sk, struct sock *ssk,
+			     const struct sk_buff *skb)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	u64 mem = sk_rmem_alloc_get(sk);
+
+	mem += READ_ONCE(msk->backlog_len);
+	if (likely(mem <= READ_ONCE(sk->sk_rcvbuf)))
+		return false;
+
+	/* Avoid silently dropping pure acks, fin or zero win probes. */
+	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq ||
+	    TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN ||
+	    !after(TCP_SKB_CB(skb)->end_seq, tcp_sk(ssk)->rcv_nxt))
+		return false;
+
+	/* Dropped due to memory constraints, schedule an ack. */
+	inet_csk(ssk)->icsk_ack.pending |= ICSK_ACK_NOMEM | ICSK_ACK_NOW;
+	inet_csk_schedule_ack(ssk);
+	return true;
+}
+
+/* Return false when the caller must drop the packet, i.e. in case of error,
+ * subflow has been reset, or over memory limits.
  */
 bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 {
@@ -1185,6 +1207,9 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 
 		__mptcp_data_acked(subflow->conn);
 		mptcp_data_unlock(subflow->conn);
+
+		if (mptcp_over_limit(subflow->conn, sk, skb))
+			return false;
 		return true;
 	}
 
@@ -1263,6 +1288,9 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 		return true;
 	}
 
+	if (mptcp_over_limit(subflow->conn, sk, skb))
+		return false;
+
 	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
 	if (!mpext)
 		return false;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index b3ac1cb370f5..9d2ed9503d08 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -381,6 +381,15 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
 
 	mptcp_borrow_fwdmem(sk, skb);
 
+	/* Can't drop packets for fallback socket this late, or the stream
+	 * will break.
+	 */
+	if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) &&
+	    !__mptcp_check_fallback(msk)) {
+		mptcp_drop(sk, skb);
+		return false;
+	}
+
 	if (MPTCP_SKB_CB(skb)->map_seq == msk->ack_seq) {
 		/* in sequence */
 		msk->bytes_received += copy_len;
@@ -675,6 +684,7 @@ static void __mptcp_add_backlog(struct sock *sk,
 	struct sk_buff *tail = NULL;
 	struct sock *ssk = skb->sk;
 	bool fragstolen;
+	u64 limit;
 	int delta;
 
 	if (unlikely(sk->sk_state == TCP_CLOSE)) {
@@ -682,6 +692,15 @@ static void __mptcp_add_backlog(struct sock *sk,
 		return;
 	}
 
+	/* Similar additional allowance as plain TCP. */
+	limit = READ_ONCE(sk->sk_rcvbuf);
+	limit += (limit >> 1) + 64 * 1024;
+	limit = min_t(u64, limit, UINT_MAX);
+	if (msk->backlog_len > limit && !__mptcp_check_fallback(msk)) {
+		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_RCVBUFF);
+		return;
+	}
+
 	/* Try to coalesce with the last skb in our backlog */
 	if (!list_empty(&msk->backlog_list))
 		tail = list_last_entry(&msk->backlog_list, struct sk_buff, list);
@@ -753,7 +772,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 
 			mptcp_init_skb(ssk, skb, offset, len);
 
-			if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf) {
+			if (own_msk) {
 				mptcp_subflow_lend_fwdmem(subflow, skb);
 				ret |= __mptcp_move_skb(sk, skb);
 			} else {
@@ -2211,10 +2230,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
 
 	*delta = 0;
 	while (1) {
-		/* If the msk recvbuf is full stop, don't drop */
-		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
-			break;
-
 		prefetch(skb->next);
 		list_del(&skb->list);
 		*delta += skb->truesize;
@@ -2242,9 +2257,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
 	DEBUG_NET_WARN_ON_ONCE(msk->backlog_unaccounted && sk->sk_socket &&
 			       mem_cgroup_from_sk(sk));
 
-	/* Don't spool the backlog if the rcvbuf is full. */
-	if (list_empty(&msk->backlog_list) ||
-	    sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+	if (list_empty(&msk->backlog_list))
 		return false;
 
 	INIT_LIST_HEAD(skbs);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 mptcp-next 3/7] mptcp: enforce hard limit on backlog flushing
  2026-05-15  9:07 [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 1/7] mptcp: fix missing wakeups in edge scenarios Paolo Abeni
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 2/7] mptcp: explicitly drop over memory limits Paolo Abeni
@ 2026-05-15  9:07 ` Paolo Abeni
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 4/7] mptcp: implemented OoO queue pruning Paolo Abeni
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2026-05-15  9:07 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang, gang.yan

Currently a wild producer could keep the backlog flushing operation
spinning for an unbound time.

Since the previous patch the amount of data present in the backlog is
hard-limited. Move the backlog len update at the end of the flush loop to
prevent it spinning forever.

Also, no need to splice back the remaining skbs list into the backlog, as
such list is always empty after each backlog processing loop.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 9d2ed9503d08..78b8bcac7d91 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2228,7 +2228,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	bool moved = false;
 
-	*delta = 0;
 	while (1) {
 		prefetch(skb->next);
 		list_del(&skb->list);
@@ -2265,20 +2264,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
 	return true;
 }
 
-static void mptcp_backlog_spooled(struct sock *sk, u32 moved,
-				  struct list_head *skbs)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
-	list_splice(skbs, &msk->backlog_list);
-}
-
 static bool mptcp_move_skbs(struct sock *sk)
 {
+	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct list_head skbs;
 	bool enqueued = false;
-	u32 moved;
+	u32 moved = 0;
 
 	mptcp_data_lock(sk);
 	while (mptcp_can_spool_backlog(sk, &skbs)) {
@@ -2286,8 +2277,8 @@ static bool mptcp_move_skbs(struct sock *sk)
 		enqueued |= __mptcp_move_skbs(sk, &skbs, &moved);
 
 		mptcp_data_lock(sk);
-		mptcp_backlog_spooled(sk, moved, &skbs);
 	}
+	WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
 	mptcp_data_unlock(sk);
 	if (enqueued)
 		sk->sk_data_ready(sk);
@@ -3672,12 +3663,12 @@ static void mptcp_release_cb(struct sock *sk)
 	__must_hold(&sk->sk_lock.slock)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	u32 moved = 0;
 
 	for (;;) {
 		unsigned long flags = (msk->cb_flags & MPTCP_FLAGS_PROCESS_CTX_NEED);
 		struct list_head join_list, skbs;
 		bool spool_bl;
-		u32 moved;
 
 		spool_bl = mptcp_can_spool_backlog(sk, &skbs);
 		if (!flags && !spool_bl)
@@ -3710,9 +3701,9 @@ static void mptcp_release_cb(struct sock *sk)
 
 		cond_resched();
 		spin_lock_bh(&sk->sk_lock.slock);
-		if (spool_bl)
-			mptcp_backlog_spooled(sk, moved, &skbs);
 	}
+	if (moved)
+		WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
 
 	if (__test_and_clear_bit(MPTCP_CLEAN_UNA, &msk->cb_flags))
 		__mptcp_clean_una_wakeup(sk);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 mptcp-next 4/7] mptcp: implemented OoO queue pruning
  2026-05-15  9:07 [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
                   ` (2 preceding siblings ...)
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 3/7] mptcp: enforce hard limit on backlog flushing Paolo Abeni
@ 2026-05-15  9:07 ` Paolo Abeni
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 5/7] mptcp: track prune recovery status Paolo Abeni
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2026-05-15  9:07 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang, gang.yan

Leverage the hybrid helpers to implement the receive queue and OoO queue
collapsing at ingress time when reaching memory bounds.

If the msk is owned by the user-space at incoming skb time, perform the
pruning in the release_cb. The prune check is additionally performed
when the skb reaches the msk-level queues.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v2 -> v3:
 - deal with unsynced TFO skb at prune time - only possible when pruning
   in mptcp_over_limit()

v1 -> v2:
 - collapse rcv queue, too
 - deal with MPC map, too
 - drop left-over sentence in the commit message

RFC -> v1:
 - use data_seq only when available
 - avoid ack_seq lockless access
 - drop limit on fallback
 - collapse rcvqueue, too
 - drop only when pruning is not possible and over rcvbuf * 2

Note:
 - sashiko can be confused about fwd memory lifecycle (I can
 understand that :). Any exceeding amount of fwd allocated memory
 is always released by the next sk_mem_uncharge() - i.e. fwd memory
 is not tied to the current skb.
 - sashiko is also fooled by the main xtcp_collapse_ofo_queue()
 loop: ooo_last_skb is always kept up2date with the current tree
 status
 - AFAICS KASAN handles bitmap variables in a sane way, and sashiko
 doesn't know about that
---
 net/mptcp/mib.c      |  2 ++
 net/mptcp/mib.h      |  2 ++
 net/mptcp/options.c  | 42 +++++++++++++++++++++++++++++++++++-------
 net/mptcp/protocol.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.h |  1 +
 5 files changed, 82 insertions(+), 7 deletions(-)

diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index f23fda0c55a7..bdc863c3a952 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -85,6 +85,8 @@ static const struct snmp_mib mptcp_snmp_list[] = {
 	SNMP_MIB_ITEM("SimultConnectFallback", MPTCP_MIB_SIMULTCONNFALLBACK),
 	SNMP_MIB_ITEM("FallbackFailed", MPTCP_MIB_FALLBACKFAILED),
 	SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE),
+	SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED),
+	SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
 };
 
 /* mptcp_mib_alloc - allocate percpu mib counters
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 812218b5ed2b..8ec314847fc3 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -88,6 +88,8 @@ enum linux_mptcp_mib_field {
 	MPTCP_MIB_SIMULTCONNFALLBACK,	/* Simultaneous connect */
 	MPTCP_MIB_FALLBACKFAILED,	/* Can't fallback due to msk status */
 	MPTCP_MIB_WINPROBE,		/* MPTCP-level zero window probe */
+	MPTCP_MIB_OFO_PRUNED,		/* MPTCP-level OoO queue pruned */
+	MPTCP_MIB_RCVPRUNED,		/* Dropped due to memory constrains */
 	__MPTCP_MIB_MAX
 };
 
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 36f12e5dfa92..ec64e1a127d7 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1159,10 +1159,13 @@ static bool add_addr_hmac_valid(struct mptcp_sock *msk,
 }
 
 static bool mptcp_over_limit(struct sock *sk, struct sock *ssk,
-			     const struct sk_buff *skb)
+			     const struct sk_buff *skb,
+			     const struct mptcp_options_received *mp_opt)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	u64 mem = sk_rmem_alloc_get(sk);
+	u64 limit;
+	bool ret;
 
 	mem += READ_ONCE(msk->backlog_len);
 	if (likely(mem <= READ_ONCE(sk->sk_rcvbuf)))
@@ -1174,10 +1177,31 @@ static bool mptcp_over_limit(struct sock *sk, struct sock *ssk,
 	    !after(TCP_SKB_CB(skb)->end_seq, tcp_sk(ssk)->rcv_nxt))
 		return false;
 
-	/* Dropped due to memory constraints, schedule an ack. */
-	inet_csk(ssk)->icsk_ack.pending |= ICSK_ACK_NOMEM | ICSK_ACK_NOW;
-	inet_csk_schedule_ack(ssk);
-	return true;
+	mptcp_data_lock(sk);
+	if (!sock_owned_by_user(sk)) {
+		/* When the data sequence is not (yet) available for the
+		 * incoming skb, allow pruning the whole OoO queue.
+		 */
+		u64 seq = (!mp_opt->use_map || mp_opt->mpc_map) ?
+			  msk->ack_seq : mp_opt->data_seq;
+
+		limit = sk->sk_rcvbuf;
+		__mptcp_check_prune(sk, seq);
+	} else {
+		/* Pruning will take place later in the RX path, allow
+		 * some extra slack.
+		 */
+		limit = ((u64)READ_ONCE(sk->sk_rcvbuf)) << 1;
+	}
+	ret = sk_rmem_alloc_get(sk) + msk->backlog_len > limit;
+	mptcp_data_unlock(sk);
+
+	if (ret) {
+		/* Dropped due to memory constraints, schedule an ack. */
+		inet_csk(ssk)->icsk_ack.pending |= ICSK_ACK_NOMEM | ICSK_ACK_NOW;
+		inet_csk_schedule_ack(ssk);
+	}
+	return ret;
 }
 
 /* Return false when the caller must drop the packet, i.e. in case of error,
@@ -1208,7 +1232,11 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 		__mptcp_data_acked(subflow->conn);
 		mptcp_data_unlock(subflow->conn);
 
-		if (mptcp_over_limit(subflow->conn, sk, skb))
+		/* Will use ack_seq as limit for OoO pruning; any value would do
+		 * as OoO queue must be empty.
+		 */
+		mp_opt.use_map = 0;
+		if (mptcp_over_limit(subflow->conn, sk, skb, &mp_opt))
 			return false;
 		return true;
 	}
@@ -1288,7 +1316,7 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 		return true;
 	}
 
-	if (mptcp_over_limit(subflow->conn, sk, skb))
+	if (mptcp_over_limit(subflow->conn, sk, skb, &mp_opt))
 		return false;
 
 	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 78b8bcac7d91..b79dd4c4fe31 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -373,6 +373,46 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
 	skb_dst_drop(skb);
 }
 
+/* "Inspired" from the TCP version */
+static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct rb_node *node, *prev;
+	bool pruned = false;
+
+	if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
+		return;
+
+	node = &msk->ooo_last_skb->rbnode;
+
+	do {
+		struct sk_buff *skb = rb_to_skb(node);
+
+		/* Stop pruning if the incoming skb would land in OoO tail. */
+		if (after(seq, MPTCP_SKB_CB(skb)->map_seq))
+			break;
+
+		pruned = true;
+		prev = rb_prev(node);
+		rb_erase(node, &msk->out_of_order_queue);
+		mptcp_drop(sk, skb);
+		msk->ooo_last_skb = rb_to_skb(prev);
+		if (atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf)
+			break;
+
+		node = prev;
+	} while (node);
+
+	if (pruned)
+		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED);
+}
+
+bool __mptcp_check_prune(struct sock *sk, u64 seq)
+{
+	mptcp_prune_ofo_queue(sk, seq);
+	return atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf;
+}
+
 static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
 {
 	u64 copy_len = MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq;
@@ -385,7 +425,9 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
 	 * will break.
 	 */
 	if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) &&
+	    __mptcp_check_prune(sk, MPTCP_SKB_CB(skb)->map_seq) &&
 	    !__mptcp_check_fallback(msk)) {
+		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
 		mptcp_drop(sk, skb);
 		return false;
 	}
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 661600f8b573..95774a4e7231 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -827,6 +827,7 @@ bool __mptcp_close(struct sock *sk, long timeout);
 void mptcp_cancel_work(struct sock *sk);
 void __mptcp_unaccepted_force_close(struct sock *sk);
 void mptcp_set_state(struct sock *sk, int state);
+bool __mptcp_check_prune(struct sock *sk, u64 seq);
 
 bool mptcp_addresses_equal(const struct mptcp_addr_info *a,
 			   const struct mptcp_addr_info *b, bool use_port);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 mptcp-next 5/7] mptcp: track prune recovery status
  2026-05-15  9:07 [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
                   ` (3 preceding siblings ...)
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 4/7] mptcp: implemented OoO queue pruning Paolo Abeni
@ 2026-05-15  9:07 ` Paolo Abeni
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 6/7] mptcp: move the retrans loop to a separate helper Paolo Abeni
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2026-05-15  9:07 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang, gang.yan

After dropping any data already acked at the TCP level, the MPTCP must
avoid inducing TCP-level retransmission until the pruned data has been
successfully acked at MPTCP level. Otherwise the subflows could keep
retransmitting skbs carring OoO MPTCP data, preventing reinjections and
stalling completely the data transfer.

Explicitly keep track of the highest pruned MPTCP-level seq number and
stop dropping at TCP level until such sequence has been acked.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
Notes:
 - sashiko may miss that msk->ack_seq access in mptcp_over_limit()
 happens under the msk data lock and this is raceless.
---
 net/mptcp/options.c  |  7 +++++++
 net/mptcp/protocol.c | 14 +++++++++++++-
 net/mptcp/protocol.h |  3 +++
 net/mptcp/subflow.c  |  1 +
 4 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index ec64e1a127d7..c96b8166224b 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1194,6 +1194,13 @@ static bool mptcp_over_limit(struct sock *sk, struct sock *ssk,
 		limit = ((u64)READ_ONCE(sk->sk_rcvbuf)) << 1;
 	}
 	ret = sk_rmem_alloc_get(sk) + msk->backlog_len > limit;
+
+	/* After pruning any packets ensure that MPTCP-driven drops do not
+	 * cause TCP-level retransmission.
+	 */
+	if (before64(msk->ack_seq, READ_ONCE(msk->pruned_seq)))
+		ret = false;
+
 	mptcp_data_unlock(sk);
 
 	if (ret) {
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index b79dd4c4fe31..640632c283e1 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -379,12 +379,14 @@ static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct rb_node *node, *prev;
 	bool pruned = false;
+	u64 pruned_seq;
 
 	if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
 		return;
 
 	node = &msk->ooo_last_skb->rbnode;
 
+	pruned_seq = msk->pruned_seq;
 	do {
 		struct sk_buff *skb = rb_to_skb(node);
 
@@ -395,16 +397,21 @@ static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq)
 		pruned = true;
 		prev = rb_prev(node);
 		rb_erase(node, &msk->out_of_order_queue);
+		if (after(MPTCP_SKB_CB(skb)->end_seq, pruned_seq))
+			pruned_seq = MPTCP_SKB_CB(skb)->end_seq;
 		mptcp_drop(sk, skb);
 		msk->ooo_last_skb = rb_to_skb(prev);
+
 		if (atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf)
 			break;
 
 		node = prev;
 	} while (node);
 
-	if (pruned)
+	if (pruned) {
+		WRITE_ONCE(msk->pruned_seq, pruned_seq);
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED);
+	}
 }
 
 bool __mptcp_check_prune(struct sock *sk, u64 seq)
@@ -427,6 +434,8 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
 	if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) &&
 	    __mptcp_check_prune(sk, MPTCP_SKB_CB(skb)->map_seq) &&
 	    !__mptcp_check_fallback(msk)) {
+		if (after(MPTCP_SKB_CB(skb)->end_seq, msk->pruned_seq))
+			WRITE_ONCE(msk->pruned_seq, MPTCP_SKB_CB(skb)->end_seq);
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
 		mptcp_drop(sk, skb);
 		return false;
@@ -887,6 +896,8 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
 		WRITE_ONCE(msk->ack_seq, end_seq);
 		moved = true;
 	}
+	if (after64(msk->ack_seq, msk->pruned_seq))
+		WRITE_ONCE(msk->pruned_seq, msk->ack_seq);
 	return moved;
 }
 
@@ -3536,6 +3547,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
 	/* for fallback's sake */
 	WRITE_ONCE(msk->ack_seq, 0);
 	atomic64_set(&msk->rcv_wnd_sent, 0);
+	WRITE_ONCE(msk->pruned_seq, 0);
 
 	WRITE_ONCE(sk->sk_shutdown, 0);
 	sk_error_report(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 95774a4e7231..32daf51e48ef 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -303,6 +303,9 @@ struct mptcp_sock {
 	u64		bytes_acked;
 	u64		snd_una;
 	u64		wnd_end;
+	u64		pruned_seq;		/* If strictly above ack_seq,
+						 * the highest seq pruned.
+						 */
 	u32		last_data_sent;
 	u32		last_data_recv;
 	u32		last_ack_recv;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index d562e149606f..cc75d914c1b5 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -494,6 +494,7 @@ static void subflow_set_remote_key(struct mptcp_sock *msk,
 
 	WRITE_ONCE(msk->remote_key, subflow->remote_key);
 	WRITE_ONCE(msk->ack_seq, subflow->iasn);
+	WRITE_ONCE(msk->pruned_seq, subflow->iasn);
 	WRITE_ONCE(msk->can_ack, true);
 	atomic64_set(&msk->rcv_wnd_sent, subflow->iasn);
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 mptcp-next 6/7] mptcp: move the retrans loop to a separate helper
  2026-05-15  9:07 [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
                   ` (4 preceding siblings ...)
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 5/7] mptcp: track prune recovery status Paolo Abeni
@ 2026-05-15  9:07 ` Paolo Abeni
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 7/7] mptcp: let the retrans scheduler do its job Paolo Abeni
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2026-05-15  9:07 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang, gang.yan

This is a cleanup in order to make the next patch simpler.
No functional change intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 74 +++++++++++++++++++++++++-------------------
 1 file changed, 43 insertions(+), 31 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 640632c283e1..de142be05934 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2835,41 +2835,14 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
 	sk_error_report(sk);
 }
 
-static void __mptcp_retrans(struct sock *sk)
+/* Retransmit the specified data fragment on all the selected subflows. */
+static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
 {
 	struct mptcp_sendmsg_info info = { .data_lock_held = true, };
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct mptcp_subflow_context *subflow;
-	struct mptcp_data_frag *dfrag;
 	struct sock *ssk;
-	int ret, err;
-	u16 len = 0;
-
-	mptcp_clean_una_wakeup(sk);
-
-	/* first check ssk: need to kick "stale" logic */
-	err = mptcp_sched_get_retrans(msk);
-	dfrag = mptcp_rtx_head(sk);
-	if (!dfrag) {
-		if (mptcp_data_fin_enabled(msk)) {
-			struct inet_connection_sock *icsk = inet_csk(sk);
-
-			WRITE_ONCE(icsk->icsk_retransmits,
-				   icsk->icsk_retransmits + 1);
-			mptcp_set_datafin_timeout(sk);
-			mptcp_send_ack(msk);
-
-			goto reset_timer;
-		}
-
-		if (!mptcp_send_head(sk))
-			goto clear_scheduled;
-
-		goto reset_timer;
-	}
-
-	if (err)
-		goto reset_timer;
+	int ret, len = 0;
 
 	mptcp_for_each_subflow(msk, subflow) {
 		if (READ_ONCE(subflow->scheduled)) {
@@ -2897,7 +2870,7 @@ static void __mptcp_retrans(struct sock *sk)
 			    !msk->allow_subflows) {
 				spin_unlock_bh(&msk->fallback_lock);
 				release_sock(ssk);
-				goto clear_scheduled;
+				return -1;
 			}
 
 			while (info.sent < info.limit) {
@@ -2920,6 +2893,45 @@ static void __mptcp_retrans(struct sock *sk)
 			release_sock(ssk);
 		}
 	}
+	return len;
+}
+
+static void __mptcp_retrans(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct mptcp_subflow_context *subflow;
+	struct mptcp_data_frag *dfrag;
+	int err, len;
+
+	mptcp_clean_una_wakeup(sk);
+
+	/* first check ssk: need to kick "stale" logic */
+	err = mptcp_sched_get_retrans(msk);
+	dfrag = mptcp_rtx_head(sk);
+	if (!dfrag) {
+		if (mptcp_data_fin_enabled(msk)) {
+			struct inet_connection_sock *icsk = inet_csk(sk);
+
+			WRITE_ONCE(icsk->icsk_retransmits,
+				   icsk->icsk_retransmits + 1);
+			mptcp_set_datafin_timeout(sk);
+			mptcp_send_ack(msk);
+
+			goto reset_timer;
+		}
+
+		if (!mptcp_send_head(sk))
+			goto clear_scheduled;
+
+		goto reset_timer;
+	}
+
+	if (err)
+		goto reset_timer;
+
+	len = __mptcp_push_retrans(sk, dfrag);
+	if (len < 0)
+		goto clear_scheduled;
 
 	msk->bytes_retrans += len;
 	dfrag->already_sent = max(dfrag->already_sent, len);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 mptcp-next 7/7] mptcp: let the retrans scheduler do its job.
  2026-05-15  9:07 [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
                   ` (5 preceding siblings ...)
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 6/7] mptcp: move the retrans loop to a separate helper Paolo Abeni
@ 2026-05-15  9:07 ` Paolo Abeni
  2026-05-15  9:29 ` [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
  2026-05-15 10:35 ` MPTCP CI
  8 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2026-05-15  9:07 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang, gang.yan

Currently the MPTCP core enforces that when MPTCP-level retrans timer
fires, at most a single dfrag is retransmitted. If some corner-cases it
may be necessary retransmit multiple dfrags, and the MPTCP socket will
need to wait multiple retrans timeout to accomplish that.

Remove the mentioned constraint, allowing to transmit multiple dfrags per
retrans period, as long as the scheduler keeps selecting subflows for
retransmissions and pending data is available in the rtx queue.
The default scheduler will transmit a dfrag per available subflow.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
---
v4 -> v5:
  - fixed already_sent update

v3 -> v4:
  - avoid quadratic behavior, fix retrans_seq update
  - fix rtx timer re-schedule miss

v2 -> v3:
  - fix infinite loop issue (should address tls tests failures)

v1 -> v2:
  - fix retrans sequence update (sashiko)
---
 net/mptcp/protocol.c | 105 +++++++++++++++++++++++++++++++------------
 1 file changed, 77 insertions(+), 28 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index de142be05934..b0877908883a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1208,13 +1208,6 @@ static void __mptcp_clean_una_wakeup(struct sock *sk)
 	mptcp_write_space(sk);
 }
 
-static void mptcp_clean_una_wakeup(struct sock *sk)
-{
-	mptcp_data_lock(sk);
-	__mptcp_clean_una_wakeup(sk);
-	mptcp_data_unlock(sk);
-}
-
 static void mptcp_enter_memory_pressure(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow;
@@ -2835,8 +2828,12 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
 	sk_error_report(sk);
 }
 
-/* Retransmit the specified data fragment on all the selected subflows. */
-static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
+/*
+ * Retransmit the specified data fragment on all the selected subflows,
+ * starting from the specified sequence
+ */
+static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag,
+				u64 sent_seq)
 {
 	struct mptcp_sendmsg_info info = { .data_lock_held = true, };
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -2846,6 +2843,7 @@ static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
 
 	mptcp_for_each_subflow(msk, subflow) {
 		if (READ_ONCE(subflow->scheduled)) {
+			u16 offset = sent_seq - dfrag->data_seq;
 			u16 copied = 0;
 
 			mptcp_subflow_set_scheduled(subflow, false);
@@ -2855,9 +2853,12 @@ static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
 			lock_sock(ssk);
 
 			/* limit retransmission to the bytes already sent on some subflows */
-			info.sent = 0;
+			info.sent = offset;
 			info.limit = READ_ONCE(msk->csum_enabled) ? dfrag->data_len :
 								    dfrag->already_sent;
+			DEBUG_NET_WARN_ON_ONCE(!before64(sent_seq,
+							 dfrag->data_seq +
+							 info.limit));
 
 			/*
 			 * make the whole retrans decision, xmit, disallow
@@ -2901,41 +2902,89 @@ static void __mptcp_retrans(struct sock *sk)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct mptcp_subflow_context *subflow;
 	struct mptcp_data_frag *dfrag;
+	bool retransmitted = false;
+	u64 retrans_seq;
 	int err, len;
 
-	mptcp_clean_una_wakeup(sk);
-
-	/* first check ssk: need to kick "stale" logic */
-	err = mptcp_sched_get_retrans(msk);
+	mptcp_data_lock(sk);
+	__mptcp_clean_una_wakeup(sk);
+	retrans_seq = msk->snd_una;
 	dfrag = mptcp_rtx_head(sk);
+	mptcp_data_unlock(sk);
+	if (!dfrag)
+		goto check_data_fin;
+
+	for (;;) {
+		bool already_retrans;
+
+		/* The scheduler may clean the RTX queue. */
+		get_page(dfrag->page);
+
+		/* The default scheduler will kick "stale" logic. */
+		err = mptcp_sched_get_retrans(msk);
+		if (err) {
+			put_page(dfrag->page);
+			break;
+		}
+
+		/* Incoming acks can have moved retrans sequence after
+		 * the current dfrag, if so try to start again from RTX head.
+		 */
+		mptcp_data_lock(sk);
+		already_retrans = !dfrag->already_sent ||
+				  !before64(msk->snd_una, dfrag->data_seq +
+					    dfrag->already_sent);
+		put_page(dfrag->page);
+		if (already_retrans) {
+			__mptcp_clean_una_wakeup(sk);
+			retrans_seq = msk->snd_una;
+			dfrag = mptcp_rtx_head(sk);
+		}
+		mptcp_data_unlock(sk);
+		if (!dfrag)
+			break;
+
+		len = __mptcp_push_retrans(sk, dfrag, retrans_seq);
+		if (len < 0)
+			goto clear_scheduled;
+
+		retransmitted = true;
+		retrans_seq += len;
+		msk->bytes_retrans += len;
+		dfrag->already_sent = max_t(u16, dfrag->already_sent,
+					    retrans_seq - dfrag->data_seq);
+
+		/* Attempt the next fragment only if the current one is
+		 * completely retransmitted.
+		 */
+		if (before64(retrans_seq, dfrag->data_seq + dfrag->data_len))
+			break;
+
+		dfrag = list_is_last(&dfrag->list, &msk->rtx_queue) ?
+				NULL : list_next_entry(dfrag, list);
+		if (!dfrag || !dfrag->already_sent)
+			break;
+	}
+
+	/* Data fin retransmission needed only if no data retransmission took
+	 * place, and RTX queue is empty.
+	 */
+check_data_fin:
 	if (!dfrag) {
-		if (mptcp_data_fin_enabled(msk)) {
+		if (!retransmitted && mptcp_data_fin_enabled(msk)) {
 			struct inet_connection_sock *icsk = inet_csk(sk);
 
 			WRITE_ONCE(icsk->icsk_retransmits,
 				   icsk->icsk_retransmits + 1);
 			mptcp_set_datafin_timeout(sk);
 			mptcp_send_ack(msk);
-
 			goto reset_timer;
 		}
 
 		if (!mptcp_send_head(sk))
 			goto clear_scheduled;
-
-		goto reset_timer;
 	}
 
-	if (err)
-		goto reset_timer;
-
-	len = __mptcp_push_retrans(sk, dfrag);
-	if (len < 0)
-		goto clear_scheduled;
-
-	msk->bytes_retrans += len;
-	dfrag->already_sent = max(dfrag->already_sent, len);
-
 reset_timer:
 	mptcp_check_and_set_pending(sk);
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure
  2026-05-15  9:07 [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
                   ` (6 preceding siblings ...)
  2026-05-15  9:07 ` [PATCH v6 mptcp-next 7/7] mptcp: let the retrans scheduler do its job Paolo Abeni
@ 2026-05-15  9:29 ` Paolo Abeni
  2026-05-18  8:13   ` gang.yan
  2026-05-15 10:35 ` MPTCP CI
  8 siblings, 1 reply; 14+ messages in thread
From: Paolo Abeni @ 2026-05-15  9:29 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang, gang.yan

On 5/15/26 11:07 AM, Paolo Abeni wrote:
> This an attempt to fix the data transfer stall reported by Geliang and
> Gang more carefully enforcing memory constraints at the MPTCP level.
> 
> This iteration presents a significant change WRT the previous one,
> avoiding entirely the collapse attempt on memory pressure. Note that
> this choice represent a trade off: collapsing allow much faster transfer
> (to be more accurate: order of magnitude less slow) under some extreme
> conditions, but makes transfer slower and much more CPU intensive for
> less unlikely conditions.
> 
> As a consequence of the above the `mptcp_data.multi_chunk_sendfile`
> test-case needs a 240 seconds timeout to complete successfully:
> 
> TEST_F_TIMEOUT(mptcp, multi_chunk_sendfile, 240)
> 
> The solution performing data collapsing would need similar long timeout
> for the multiproc tests cases: mutliproc_even, mutliproc_readers,
> mutliproc_writers, mutliproc_sendpage_even, mutliproc_sendpage_readers,
> mutliproc_sendpage_writers.
> 
> Patch 1 is new in v6, and is actually a fix for an old issue (targeting
> net), included here just for my convenience.
> 
> Patch 2 and 3 makes the admission check much more strict for incoming
> packets exceeding the memory limits, with some exception for fallback
> sockets.
> Patch 4 makes implement OoO queue pruning for MPTCP and patch 5
> addresses an edge scenario that could still lead to transfer stall
> under memory pressure.
> Finally patch 6 and 7 improve the MPTCP-level retransmission schema to
> make recovery from memory pressure/after MPTCP-level drop significanly
> faster.

@Geliang, @Gang: could you please have a spin at this iteration? Note
that you must increase the timeout for the
mptcp_data.multi_chunk_sendfile test-case, as mentioned above.

Side note: with the "collapse" code this revision also omitted a related
few refactor patches, that I still plan to upstream later, since the
effect is a nice cleanup and reducing differences VS plain TCP.

/P


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure
  2026-05-15  9:07 [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
                   ` (7 preceding siblings ...)
  2026-05-15  9:29 ` [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
@ 2026-05-15 10:35 ` MPTCP CI
  8 siblings, 0 replies; 14+ messages in thread
From: MPTCP CI @ 2026-05-15 10:35 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Success! ✅
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/25910612780

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/0c52f631034a
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1095238


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure
  2026-05-15  9:29 ` [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
@ 2026-05-18  8:13   ` gang.yan
  2026-05-18 16:30     ` Paolo Abeni
  0 siblings, 1 reply; 14+ messages in thread
From: gang.yan @ 2026-05-18  8:13 UTC (permalink / raw)
  To: Paolo Abeni, mptcp; +Cc: Geliang Tang

May 15, 2026 at 5:29 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:


> 
> On 5/15/26 11:07 AM, Paolo Abeni wrote:
> 
> > 
> > This an attempt to fix the data transfer stall reported by Geliang and
> >  Gang more carefully enforcing memory constraints at the MPTCP level.
> >  
> >  This iteration presents a significant change WRT the previous one,
> >  avoiding entirely the collapse attempt on memory pressure. Note that
> >  this choice represent a trade off: collapsing allow much faster transfer
> >  (to be more accurate: order of magnitude less slow) under some extreme
> >  conditions, but makes transfer slower and much more CPU intensive for
> >  less unlikely conditions.
> >  
> >  As a consequence of the above the `mptcp_data.multi_chunk_sendfile`
> >  test-case needs a 240 seconds timeout to complete successfully:
> >  
> >  TEST_F_TIMEOUT(mptcp, multi_chunk_sendfile, 240)
> >  
> >  The solution performing data collapsing would need similar long timeout
> >  for the multiproc tests cases: mutliproc_even, mutliproc_readers,
> >  mutliproc_writers, mutliproc_sendpage_even, mutliproc_sendpage_readers,
> >  mutliproc_sendpage_writers.
> >  
> >  Patch 1 is new in v6, and is actually a fix for an old issue (targeting
> >  net), included here just for my convenience.
> >  
> >  Patch 2 and 3 makes the admission check much more strict for incoming
> >  packets exceeding the memory limits, with some exception for fallback
> >  sockets.
> >  Patch 4 makes implement OoO queue pruning for MPTCP and patch 5
> >  addresses an edge scenario that could still lead to transfer stall
> >  under memory pressure.
> >  Finally patch 6 and 7 improve the MPTCP-level retransmission schema to
> >  make recovery from memory pressure/after MPTCP-level drop significanly
> >  faster.
> > 
> @Geliang, @Gang: could you please have a spin at this iteration? Note
> that you must increase the timeout for the
> mptcp_data.multi_chunk_sendfile test-case, as mentioned above.
> 
Hi Paolo,

Thanks a lot for the new patch set. We have encountered some issues during
our testing. It will take us some time to analyze and locate the root cause.
We will post updates on the mailing list as soon as we make progress.

Best regards
Gang

> Side note: with the "collapse" code this revision also omitted a related
> few refactor patches, that I still plan to upstream later, since the
> effect is a nice cleanup and reducing differences VS plain TCP.
> 
> /P
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure
  2026-05-18  8:13   ` gang.yan
@ 2026-05-18 16:30     ` Paolo Abeni
  2026-05-19  8:50       ` gang.yan
  0 siblings, 1 reply; 14+ messages in thread
From: Paolo Abeni @ 2026-05-18 16:30 UTC (permalink / raw)
  To: gang.yan, mptcp; +Cc: Geliang Tang

On 5/18/26 10:13 AM, gang.yan@linux.dev wrote:
> May 15, 2026 at 5:29 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>> On 5/15/26 11:07 AM, Paolo Abeni wrote:
>>> This an attempt to fix the data transfer stall reported by Geliang and
>>>  Gang more carefully enforcing memory constraints at the MPTCP level.
>>>  
>>>  This iteration presents a significant change WRT the previous one,
>>>  avoiding entirely the collapse attempt on memory pressure. Note that
>>>  this choice represent a trade off: collapsing allow much faster transfer
>>>  (to be more accurate: order of magnitude less slow) under some extreme
>>>  conditions, but makes transfer slower and much more CPU intensive for
>>>  less unlikely conditions.
>>>  
>>>  As a consequence of the above the `mptcp_data.multi_chunk_sendfile`
>>>  test-case needs a 240 seconds timeout to complete successfully:
>>>  
>>>  TEST_F_TIMEOUT(mptcp, multi_chunk_sendfile, 240)
>>>  
>>>  The solution performing data collapsing would need similar long timeout
>>>  for the multiproc tests cases: mutliproc_even, mutliproc_readers,
>>>  mutliproc_writers, mutliproc_sendpage_even, mutliproc_sendpage_readers,
>>>  mutliproc_sendpage_writers.
>>>  
>>>  Patch 1 is new in v6, and is actually a fix for an old issue (targeting
>>>  net), included here just for my convenience.
>>>  
>>>  Patch 2 and 3 makes the admission check much more strict for incoming
>>>  packets exceeding the memory limits, with some exception for fallback
>>>  sockets.
>>>  Patch 4 makes implement OoO queue pruning for MPTCP and patch 5
>>>  addresses an edge scenario that could still lead to transfer stall
>>>  under memory pressure.
>>>  Finally patch 6 and 7 improve the MPTCP-level retransmission schema to
>>>  make recovery from memory pressure/after MPTCP-level drop significanly
>>>  faster.
>>>
>> @Geliang, @Gang: could you please have a spin at this iteration? Note
>> that you must increase the timeout for the
>> mptcp_data.multi_chunk_sendfile test-case, as mentioned above.
>>
> Hi Paolo,
> 
> Thanks a lot for the new patch set. We have encountered some issues during
> our testing. It will take us some time to analyze and locate the root cause.
> We will post updates on the mailing list as soon as we make progress.

I run a few hundred iterations of the mptcp_data test without observing
any issue (still running). Can you please share which issue are you
observing, the relevant tests case and the build type (debug/non debug)?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure
  2026-05-18 16:30     ` Paolo Abeni
@ 2026-05-19  8:50       ` gang.yan
  2026-05-19 16:18         ` Paolo Abeni
  0 siblings, 1 reply; 14+ messages in thread
From: gang.yan @ 2026-05-19  8:50 UTC (permalink / raw)
  To: Paolo Abeni, mptcp; +Cc: Geliang Tang

May 19, 2026 at 12:30 AM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:


> 
> On 5/18/26 10:13 AM, gang.yan@linux.dev wrote:
> 
> > 
> > May 15, 2026 at 5:29 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
> > 
> > > 
> > > On 5/15/26 11:07 AM, Paolo Abeni wrote:
> > > 
> >  This an attempt to fix the data transfer stall reported by Geliang and
> >  Gang more carefully enforcing memory constraints at the MPTCP level.
> >  
> >  This iteration presents a significant change WRT the previous one,
> >  avoiding entirely the collapse attempt on memory pressure. Note that
> >  this choice represent a trade off: collapsing allow much faster transfer
> >  (to be more accurate: order of magnitude less slow) under some extreme
> >  conditions, but makes transfer slower and much more CPU intensive for
> >  less unlikely conditions.
> >  
> >  As a consequence of the above the `mptcp_data.multi_chunk_sendfile`
> >  test-case needs a 240 seconds timeout to complete successfully:
> >  
> >  TEST_F_TIMEOUT(mptcp, multi_chunk_sendfile, 240)
> >  
> >  The solution performing data collapsing would need similar long timeout
> >  for the multiproc tests cases: mutliproc_even, mutliproc_readers,
> >  mutliproc_writers, mutliproc_sendpage_even, mutliproc_sendpage_readers,
> >  mutliproc_sendpage_writers.
> >  
> >  Patch 1 is new in v6, and is actually a fix for an old issue (targeting
> >  net), included here just for my convenience.
> >  
> >  Patch 2 and 3 makes the admission check much more strict for incoming
> >  packets exceeding the memory limits, with some exception for fallback
> >  sockets.
> >  Patch 4 makes implement OoO queue pruning for MPTCP and patch 5
> >  addresses an edge scenario that could still lead to transfer stall
> >  under memory pressure.
> >  Finally patch 6 and 7 improve the MPTCP-level retransmission schema to
> >  make recovery from memory pressure/after MPTCP-level drop significanly
> >  faster.
> > 
> > > 
> > > @Geliang, @Gang: could you please have a spin at this iteration? Note
> > >  that you must increase the timeout for the
> > >  mptcp_data.multi_chunk_sendfile test-case, as mentioned above.
> > > 
> >  Hi Paolo,
> >  
> >  Thanks a lot for the new patch set. We have encountered some issues during
> >  our testing. It will take us some time to analyze and locate the root cause.
> >  We will post updates on the mailing list as soon as we make progress.
> > 
> I run a few hundred iterations of the mptcp_data test without observing
> any issue (still running). Can you please share which issue are you
> observing, the relevant tests case and the build type (debug/non debug)?
> 
> Thanks,
>
Hi Paolo,

Sorry for not providing the specific details earlier in my first message.
The issue we are seeing occurs specifically when KTLS is enabled.

Today I ran several sets of comparative tests (many hundreds iterations), and the
current results suggest that the integration framework between KTLS and MPTCP
may need some adjustments. We are investigating the root cause.

Thanks for your help.

Cherrs,
Gang

> Paolo
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure
  2026-05-19  8:50       ` gang.yan
@ 2026-05-19 16:18         ` Paolo Abeni
  0 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2026-05-19 16:18 UTC (permalink / raw)
  To: gang.yan, mptcp; +Cc: Geliang Tang

On 5/19/26 10:50 AM, gang.yan@linux.dev wrote:
> May 19, 2026 at 12:30 AM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>> On 5/18/26 10:13 AM, gang.yan@linux.dev wrote:
>>>  Thanks a lot for the new patch set. We have encountered some issues during
>>>  our testing. It will take us some time to analyze and locate the root cause.
>>>  We will post updates on the mailing list as soon as we make progress.
>>>
>> I run a few hundred iterations of the mptcp_data test without observing
>> any issue (still running). Can you please share which issue are you
>> observing, the relevant tests case and the build type (debug/non debug)?
>
> Sorry for not providing the specific details earlier in my first message.
> The issue we are seeing occurs specifically when KTLS is enabled.
> 
> Today I ran several sets of comparative tests (many hundreds iterations), and the
> current results suggest that the integration framework between KTLS and MPTCP
> may need some adjustments. We are investigating the root cause.

Reading the above, I understand that testing vs plain MPTCP gives good
results, am I correct? In any case I'll post a v7 to deal with some of
the feedback reported by sashiko on v6.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-05-19 16:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15  9:07 [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
2026-05-15  9:07 ` [PATCH v6 mptcp-next 1/7] mptcp: fix missing wakeups in edge scenarios Paolo Abeni
2026-05-15  9:07 ` [PATCH v6 mptcp-next 2/7] mptcp: explicitly drop over memory limits Paolo Abeni
2026-05-15  9:07 ` [PATCH v6 mptcp-next 3/7] mptcp: enforce hard limit on backlog flushing Paolo Abeni
2026-05-15  9:07 ` [PATCH v6 mptcp-next 4/7] mptcp: implemented OoO queue pruning Paolo Abeni
2026-05-15  9:07 ` [PATCH v6 mptcp-next 5/7] mptcp: track prune recovery status Paolo Abeni
2026-05-15  9:07 ` [PATCH v6 mptcp-next 6/7] mptcp: move the retrans loop to a separate helper Paolo Abeni
2026-05-15  9:07 ` [PATCH v6 mptcp-next 7/7] mptcp: let the retrans scheduler do its job Paolo Abeni
2026-05-15  9:29 ` [PATCH v6 mptcp-next 0/7] mptcp: address stall under memory pressure Paolo Abeni
2026-05-18  8:13   ` gang.yan
2026-05-18 16:30     ` Paolo Abeni
2026-05-19  8:50       ` gang.yan
2026-05-19 16:18         ` Paolo Abeni
2026-05-15 10:35 ` MPTCP CI

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.