[PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure
@ 2026-04-24 14:08 Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 1/9] mptcp: move checks vs rcvbuf size earlier in the RX path Paolo Abeni
                   ` (10 more replies)
  0 siblings, 11 replies; 13+ messages in thread
From: Paolo Abeni @ 2026-04-24 14:08 UTC (permalink / raw)
  To: mptcp; +Cc: yangang, geliang, matttbe

This an attempt to fix the data transfer stall reported by Geliang and
Gang more carefully enforcing memory constraints at the MPTCP level.

Patch 1/9 moves the bound check before entering the TCP socket.
Patch 2, 3 and 4 are cleanups/refactors finalized to safely re-using TCP
helpers on MPTCP skbs.
Patch 5 makes TCP pruning related helpers available to MPTCP and patch 6
makes use of them. Patch 7 addresses an edge scenario that could still
lead to transfer stall under memory pressure.
Finally patch 8 and 9 improve the MPTCP-level retransmission schema to
make recovery from memory pressure significanly faster.

Note that the diffstat is biases by the quite large patch 4/9, which
contains mechanical transformation of existing code; "real" changes are
noticiable smaller.

Tested successfully vs the test cases proposed by Geliang and Gang.
---
RFC -> v1:
 - dropped old patch 4 & 5
 - addressed AI reported comments
 - added retrans refactor.

Paolo Abeni (9):
  mptcp: move checks vs rcvbuf size earlier in the RX path
  mptcp: drop the mptcp_ooo_try_coalesce() helper
  mptcp: remove CB offset field
  mptcp: sync mptcp skb cb layout with tcp one
  tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too
  mptcp: implemented OoO queue pruning
  mptcp: track prune recovery status
  mptcp: move the retrans loop to a separate helper
  mptcp: let the retrans scheduler do its job.

 include/net/tcp.h    |   8 ++
 net/ipv4/tcp_input.c |  55 +++++---
 net/mptcp/fastopen.c |   1 -
 net/mptcp/mib.c      |   3 +
 net/mptcp/mib.h      |   3 +
 net/mptcp/options.c  |  55 +++++++-
 net/mptcp/protocol.c | 328 ++++++++++++++++++++++++++++---------------
 net/mptcp/protocol.h |  11 +-
 net/mptcp/subflow.c  |   2 +
 9 files changed, 323 insertions(+), 143 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next v1 1/9] mptcp: move checks vs rcvbuf size earlier in the RX path
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
@ 2026-04-24 14:08 ` Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 2/9] mptcp: drop the mptcp_ooo_try_coalesce() helper Paolo Abeni
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2026-04-24 14:08 UTC (permalink / raw)
  To: mptcp; +Cc: yangang, geliang, matttbe

Currently the enforcement of the rcvbuf constraint is implemented
when moving the skbs into the msk receive or OoO queue.

Under significant memory pressure the above can cause permanent data
transfer stalls. Move the checks early on, before landing even in
the subflow queues.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
RFC -> v1:
 - limit vs actual buffer size
 - use CB info instead of skb->len

Note that:
 - this needs the follow-up patches to really fix the stall
 - the memory comparison is intentionally very rough, as
   the msk socket lock is not currently held where the condition is
   now enforced. This should require some refinement, shared as-is
   to avoid more latency on my side
---
 net/mptcp/options.c  | 18 ++++++++++++++++--
 net/mptcp/protocol.c | 10 ++--------
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 4cc583fdc7a9..14afeee8ca5f 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1158,8 +1158,16 @@ static bool add_addr_hmac_valid(struct mptcp_sock *msk,
 	return hmac == mp_opt->ahmac;
 }
 
-/* Return false in case of error (or subflow has been reset),
- * else return true.
+static bool mptcp_over_limit(const struct sock *sk, struct sk_buff *skb)
+{
+	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
+		return false;
+
+	return sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf);
+}
+
+/* Return false when the caller must drop the packet, i.e. in case of error,
+ * subflow has been reset, or over memory limits.
  */
 bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 {
@@ -1185,6 +1193,9 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 
 		__mptcp_data_acked(subflow->conn);
 		mptcp_data_unlock(subflow->conn);
+
+		if (mptcp_over_limit(subflow->conn, skb))
+			return false;
 		return true;
 	}
 
@@ -1263,6 +1274,9 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 		return true;
 	}
 
+	if (mptcp_over_limit(subflow->conn, skb))
+		return false;
+
 	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
 	if (!mpext)
 		return false;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 17b9a8c13ebf..81a9b8077d6b 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -739,7 +739,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 
 			mptcp_init_skb(ssk, skb, offset, len);
 
-			if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf) {
+			if (own_msk) {
 				mptcp_subflow_lend_fwdmem(subflow, skb);
 				ret |= __mptcp_move_skb(sk, skb);
 			} else {
@@ -2197,10 +2197,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
 
 	*delta = 0;
 	while (1) {
-		/* If the msk recvbuf is full stop, don't drop */
-		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
-			break;
-
 		prefetch(skb->next);
 		list_del(&skb->list);
 		*delta += skb->truesize;
@@ -2228,9 +2224,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
 	DEBUG_NET_WARN_ON_ONCE(msk->backlog_unaccounted && sk->sk_socket &&
 			       mem_cgroup_from_sk(sk));
 
-	/* Don't spool the backlog if the rcvbuf is full. */
-	if (list_empty(&msk->backlog_list) ||
-	    sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+	if (list_empty(&msk->backlog_list))
 		return false;
 
 	INIT_LIST_HEAD(skbs);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next v1 2/9] mptcp: drop the mptcp_ooo_try_coalesce() helper
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 1/9] mptcp: move checks vs rcvbuf size earlier in the RX path Paolo Abeni
@ 2026-04-24 14:08 ` Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 3/9] mptcp: remove CB offset field Paolo Abeni
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2026-04-24 14:08 UTC (permalink / raw)
  To: mptcp; +Cc: yangang, geliang, matttbe

It's used to save an additional comparison for in-order skbs, but is
also a barrier to remove CB offset. Remove the helper, let
__mptcp_try_coalesce() always perform the sequence check and remove
duplicate checks from the callers.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 81a9b8077d6b..ad0a289b544b 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -159,7 +159,8 @@ static bool __mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
 {
 	int limit = READ_ONCE(sk->sk_rcvbuf);
 
-	if (unlikely(MPTCP_SKB_CB(to)->cant_coalesce) ||
+	if (MPTCP_SKB_CB(from)->map_seq != MPTCP_SKB_CB(to)->end_seq ||
+	    unlikely(MPTCP_SKB_CB(to)->cant_coalesce) ||
 	    MPTCP_SKB_CB(from)->offset ||
 	    ((to->len + from->len) > (limit >> 3)) ||
 	    !skb_try_coalesce(to, from, fragstolen, delta))
@@ -192,15 +193,6 @@ static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
 	return true;
 }
 
-static bool mptcp_ooo_try_coalesce(struct mptcp_sock *msk, struct sk_buff *to,
-				   struct sk_buff *from)
-{
-	if (MPTCP_SKB_CB(from)->map_seq != MPTCP_SKB_CB(to)->end_seq)
-		return false;
-
-	return mptcp_try_coalesce((struct sock *)msk, to, from);
-}
-
 /* "inspired" by tcp_rcvbuf_grow(), main difference:
  * - mptcp does not maintain a msk-level window clamp
  * - returns true when  the receive buffer is actually updated
@@ -275,7 +267,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 	/* with 2 subflows, adding at end of ooo queue is quite likely
 	 * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup.
 	 */
-	if (mptcp_ooo_try_coalesce(msk, msk->ooo_last_skb, skb)) {
+	if (mptcp_try_coalesce(sk, msk->ooo_last_skb, skb)) {
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOMERGE);
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOQUEUETAIL);
 		return;
@@ -321,7 +313,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 				MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA);
 				goto merge_right;
 			}
-		} else if (mptcp_ooo_try_coalesce(msk, skb1, skb)) {
+		} else if (mptcp_try_coalesce(sk, skb1, skb)) {
 			MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOMERGE);
 			return;
 		}
@@ -672,8 +664,7 @@ static void __mptcp_add_backlog(struct sock *sk,
 	if (!list_empty(&msk->backlog_list))
 		tail = list_last_entry(&msk->backlog_list, struct sk_buff, list);
 
-	if (tail && MPTCP_SKB_CB(skb)->map_seq == MPTCP_SKB_CB(tail)->end_seq &&
-	    ssk == tail->sk &&
+	if (tail && ssk == tail->sk &&
 	    __mptcp_try_coalesce(sk, tail, skb, &fragstolen, &delta)) {
 		skb->truesize -= delta;
 		kfree_skb_partial(skb, fragstolen);
@@ -797,7 +788,7 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
 
 		end_seq = MPTCP_SKB_CB(skb)->end_seq;
 		tail = skb_peek_tail(&sk->sk_receive_queue);
-		if (!tail || !mptcp_ooo_try_coalesce(msk, tail, skb)) {
+		if (!tail || !mptcp_try_coalesce(sk, tail, skb)) {
 			int delta = msk->ack_seq - MPTCP_SKB_CB(skb)->map_seq;
 
 			/* skip overlapping data, if any */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next v1 3/9] mptcp: remove CB offset field
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 1/9] mptcp: move checks vs rcvbuf size earlier in the RX path Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 2/9] mptcp: drop the mptcp_ooo_try_coalesce() helper Paolo Abeni
@ 2026-04-24 14:08 ` Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 4/9] mptcp: sync mptcp skb cb layout with tcp one Paolo Abeni
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2026-04-24 14:08 UTC (permalink / raw)
  To: mptcp; +Cc: yangang, geliang, matttbe

Instead, to track the bytes already consumed inside each skb, use a
msk-level new field, tracking the amount of bytes already copied to
user-space, alike what TCP is already doing.

This simplify a bit the code and will make possible the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
Note: this has the potential to break almost everything. On the flip
side the CB->offset vs copied_seq difference from TCP is quite confusing
and removing it will be for the good.

Also this explicitly relays on "mptcp: do not drop partial packets" to
avoid dropping partially consumed packets
---
 net/mptcp/fastopen.c |  1 -
 net/mptcp/protocol.c | 30 +++++++++++-------------------
 net/mptcp/protocol.h |  2 +-
 net/mptcp/subflow.c  |  1 +
 4 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
index 82ec15bcfd7f..fe579e45ecdf 100644
--- a/net/mptcp/fastopen.c
+++ b/net/mptcp/fastopen.c
@@ -44,7 +44,6 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
 	/* Only the sequence delta is relevant */
 	MPTCP_SKB_CB(skb)->map_seq = -skb->len;
 	MPTCP_SKB_CB(skb)->end_seq = 0;
-	MPTCP_SKB_CB(skb)->offset = 0;
 	MPTCP_SKB_CB(skb)->has_rxtstamp = TCP_SKB_CB(skb)->has_rxtstamp;
 	MPTCP_SKB_CB(skb)->cant_coalesce = 1;
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index ad0a289b544b..c0b77d77c268 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -161,7 +161,6 @@ static bool __mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
 
 	if (MPTCP_SKB_CB(from)->map_seq != MPTCP_SKB_CB(to)->end_seq ||
 	    unlikely(MPTCP_SKB_CB(to)->cant_coalesce) ||
-	    MPTCP_SKB_CB(from)->offset ||
 	    ((to->len + from->len) > (limit >> 3)) ||
 	    !skb_try_coalesce(to, from, fragstolen, delta))
 		return false;
@@ -343,8 +342,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 	skb_set_owner_r(skb, sk);
 }
 
-static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
-			   int copy_len)
+static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	bool has_rxtstamp = TCP_SKB_CB(skb)->has_rxtstamp;
@@ -353,9 +351,8 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
 	 * mptcp_subflow_get_mapped_dsn() is based on the current tp->copied_seq
 	 * value
 	 */
-	MPTCP_SKB_CB(skb)->map_seq = mptcp_subflow_get_mapped_dsn(subflow);
-	MPTCP_SKB_CB(skb)->end_seq = MPTCP_SKB_CB(skb)->map_seq + copy_len;
-	MPTCP_SKB_CB(skb)->offset = offset;
+	MPTCP_SKB_CB(skb)->map_seq = mptcp_subflow_get_mapped_dsn(subflow) - offset;
+	MPTCP_SKB_CB(skb)->end_seq = MPTCP_SKB_CB(skb)->map_seq + skb->len;
 	MPTCP_SKB_CB(skb)->has_rxtstamp = has_rxtstamp;
 	MPTCP_SKB_CB(skb)->cant_coalesce = 0;
 
@@ -728,7 +725,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 		if (offset < skb->len) {
 			size_t len = skb->len - offset;
 
-			mptcp_init_skb(ssk, skb, offset, len);
+			mptcp_init_skb(ssk, skb, offset);
 
 			if (own_msk) {
 				mptcp_subflow_lend_fwdmem(subflow, skb);
@@ -795,8 +792,6 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
 			pr_debug("uncoalesced seq=%llx ack seq=%llx delta=%d\n",
 				 MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq,
 				 delta);
-			MPTCP_SKB_CB(skb)->offset += delta;
-			MPTCP_SKB_CB(skb)->map_seq += delta;
 			__skb_queue_tail(&sk->sk_receive_queue, skb);
 		}
 		msk->bytes_received += end_seq - msk->ack_seq;
@@ -2050,7 +2045,7 @@ static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg,
 	int copied = 0;
 
 	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
-		u32 delta, offset = MPTCP_SKB_CB(skb)->offset;
+		u32 delta, offset = msk->copied_seq - MPTCP_SKB_CB(skb)->map_seq;
 		u32 data_len = skb->len - offset;
 		u32 count;
 		int err;
@@ -2088,11 +2083,9 @@ static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg,
 
 		if (!(flags & MSG_PEEK)) {
 			msk->bytes_consumed += count;
-			if (count < data_len) {
-				MPTCP_SKB_CB(skb)->offset += count;
-				MPTCP_SKB_CB(skb)->map_seq += count;
+			msk->copied_seq += count;
+			if (count < data_len)
 				break;
-			}
 
 			mptcp_eat_recv_skb(sk, skb);
 		} else {
@@ -3457,6 +3450,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
 
 	/* for fallback's sake */
 	WRITE_ONCE(msk->ack_seq, 0);
+	WRITE_ONCE(msk->copied_seq, 0);
 
 	WRITE_ONCE(sk->sk_shutdown, 0);
 	sk_error_report(sk);
@@ -4340,7 +4334,7 @@ static struct sk_buff *mptcp_recv_skb(struct sock *sk, u32 *off)
 		mptcp_move_skbs(sk);
 
 	while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
-		offset = MPTCP_SKB_CB(skb)->offset;
+		offset = msk->copied_seq - MPTCP_SKB_CB(skb)->map_seq;
 		if (offset < skb->len) {
 			*off = offset;
 			return skb;
@@ -4382,11 +4376,9 @@ static int __mptcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 		copied += count;
 
 		msk->bytes_consumed += count;
-		if (count < data_len) {
-			MPTCP_SKB_CB(skb)->offset += count;
-			MPTCP_SKB_CB(skb)->map_seq += count;
+		msk->copied_seq += count;
+		if (count < data_len)
 			break;
-		}
 
 		mptcp_eat_recv_skb(sk, skb);
 	}
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 661600f8b573..dd437643e604 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -128,7 +128,6 @@
 struct mptcp_skb_cb {
 	u64 map_seq;
 	u64 end_seq;
-	u32 offset;
 	u8  has_rxtstamp;
 	u8  cant_coalesce;
 };
@@ -289,6 +288,7 @@ struct mptcp_sock {
 	u64		bytes_sent;
 	u64		snd_nxt;
 	u64		bytes_received;
+	u64		copied_seq;
 	u64		ack_seq;
 	atomic64_t	rcv_wnd_sent;
 	u64		rcv_data_fin_seq;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index c57ed27a5fb0..2a8d5da4aaea 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -494,6 +494,7 @@ static void subflow_set_remote_key(struct mptcp_sock *msk,
 
 	WRITE_ONCE(msk->remote_key, subflow->remote_key);
 	WRITE_ONCE(msk->ack_seq, subflow->iasn);
+	WRITE_ONCE(msk->copied_seq, subflow->iasn);
 	WRITE_ONCE(msk->can_ack, true);
 	atomic64_set(&msk->rcv_wnd_sent, subflow->iasn);
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next v1 4/9] mptcp: sync mptcp skb cb layout with tcp one
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
                   ` (2 preceding siblings ...)
  2026-04-24 14:08 ` [PATCH mptcp-next v1 3/9] mptcp: remove CB offset field Paolo Abeni
@ 2026-04-24 14:08 ` Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 5/9] tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too Paolo Abeni
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2026-04-24 14:08 UTC (permalink / raw)
  To: mptcp; +Cc: yangang, geliang, matttbe

The MPTCP protocol uses a significantly different CB layout WRT TCP, as it
includes different information and use 64 bits for the sequence numbers.

As the msk-level rcvbuf buffer size is limited by the core socket code the
INT_MAX, we can safely use 32 bits for MPTCP-level sequence number. This
allow updating the MPTCP CB layout so that fields with a corresponding TCP-level
data use the same area inside the CB itself.

Add build time check the unsure the latter invariant.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
rfc -> v1:
  - keep `ack_seq` up2date
---
 net/mptcp/protocol.c | 81 ++++++++++++++++++++++++++------------------
 net/mptcp/protocol.h |  6 ++--
 2 files changed, 52 insertions(+), 35 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index c0b77d77c268..49e62f817fd6 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -28,7 +28,7 @@
 #include "protocol.h"
 #include "mib.h"
 
-static unsigned int mptcp_inq_hint(const struct sock *sk);
+static int mptcp_inq_hint(const struct sock *sk);
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/mptcp.h>
@@ -165,7 +165,7 @@ static bool __mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
 	    !skb_try_coalesce(to, from, fragstolen, delta))
 		return false;
 
-	pr_debug("colesced seq %llx into %llx new len %d new end seq %llx\n",
+	pr_debug("colesced seq %x into %x new len %d new end seq %x\n",
 		 MPTCP_SKB_CB(from)->map_seq, MPTCP_SKB_CB(to)->map_seq,
 		 to->len, MPTCP_SKB_CB(from)->end_seq);
 	MPTCP_SKB_CB(to)->end_seq = MPTCP_SKB_CB(from)->end_seq;
@@ -235,20 +235,20 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 {
 	struct sock *sk = (struct sock *)msk;
 	struct rb_node **p, *parent;
-	u64 seq, end_seq, max_seq;
+	u32 seq, end_seq, max_seq;
 	struct sk_buff *skb1;
 
 	seq = MPTCP_SKB_CB(skb)->map_seq;
 	end_seq = MPTCP_SKB_CB(skb)->end_seq;
 	max_seq = atomic64_read(&msk->rcv_wnd_sent);
 
-	pr_debug("msk=%p seq=%llx limit=%llx empty=%d\n", msk, seq, max_seq,
+	pr_debug("msk=%p seq=%x limit=%x empty=%d\n", msk, seq, max_seq,
 		 RB_EMPTY_ROOT(&msk->out_of_order_queue));
-	if (after64(end_seq, max_seq)) {
+	if (after(end_seq, max_seq)) {
 		/* out of window */
 		mptcp_drop(sk, skb);
-		pr_debug("oow by %lld, rcv_wnd_sent %llu\n",
-			 (unsigned long long)end_seq - (unsigned long)max_seq,
+		pr_debug("oow by %d, rcv_wnd_sent %llu\n",
+			 end_seq - max_seq,
 			 (unsigned long long)atomic64_read(&msk->rcv_wnd_sent));
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_NODSSWINDOW);
 		return;
@@ -273,7 +273,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 	}
 
 	/* Can avoid an rbtree lookup if we are adding skb after ooo_last_skb */
-	if (!before64(seq, MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq)) {
+	if (!before(seq, MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq)) {
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOQUEUETAIL);
 		parent = &msk->ooo_last_skb->rbnode;
 		p = &parent->rb_right;
@@ -285,18 +285,18 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 	while (*p) {
 		parent = *p;
 		skb1 = rb_to_skb(parent);
-		if (before64(seq, MPTCP_SKB_CB(skb1)->map_seq)) {
+		if (before(seq, MPTCP_SKB_CB(skb1)->map_seq)) {
 			p = &parent->rb_left;
 			continue;
 		}
-		if (before64(seq, MPTCP_SKB_CB(skb1)->end_seq)) {
-			if (!after64(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) {
+		if (before(seq, MPTCP_SKB_CB(skb1)->end_seq)) {
+			if (!after(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) {
 				/* All the bits are present. Drop. */
 				mptcp_drop(sk, skb);
 				MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA);
 				return;
 			}
-			if (after64(seq, MPTCP_SKB_CB(skb1)->map_seq)) {
+			if (after(seq, MPTCP_SKB_CB(skb1)->map_seq)) {
 				/* partial overlap:
 				 *     |     skb      |
 				 *  |     skb1    |
@@ -327,7 +327,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 merge_right:
 	/* Remove other segments covered by skb. */
 	while ((skb1 = skb_rb_next(skb)) != NULL) {
-		if (before64(end_seq, MPTCP_SKB_CB(skb1)->end_seq))
+		if (before(end_seq, MPTCP_SKB_CB(skb1)->end_seq))
 			break;
 		rb_erase(&skb1->rbnode, &msk->out_of_order_queue);
 		mptcp_drop(sk, skb1);
@@ -349,10 +349,11 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset)
 
 	/* the skb map_seq accounts for the skb offset:
 	 * mptcp_subflow_get_mapped_dsn() is based on the current tp->copied_seq
-	 * value
+	 * value; note that seq numbers are truncated to 32bits
 	 */
 	MPTCP_SKB_CB(skb)->map_seq = mptcp_subflow_get_mapped_dsn(subflow) - offset;
 	MPTCP_SKB_CB(skb)->end_seq = MPTCP_SKB_CB(skb)->map_seq + skb->len;
+	MPTCP_SKB_CB(skb)->flags = 0;
 	MPTCP_SKB_CB(skb)->has_rxtstamp = has_rxtstamp;
 	MPTCP_SKB_CB(skb)->cant_coalesce = 0;
 
@@ -364,13 +365,14 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset)
 
 static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
 {
-	u64 copy_len = MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq;
+	u32 copy_len = MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq;
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	u32 ack_seq = msk->ack_seq;
 	struct sk_buff *tail;
 
 	mptcp_borrow_fwdmem(sk, skb);
 
-	if (MPTCP_SKB_CB(skb)->map_seq == msk->ack_seq) {
+	if (MPTCP_SKB_CB(skb)->map_seq == ack_seq) {
 		/* in sequence */
 		msk->bytes_received += copy_len;
 		WRITE_ONCE(msk->ack_seq, msk->ack_seq + copy_len);
@@ -381,7 +383,7 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
 		skb_set_owner_r(skb, sk);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
 		return true;
-	} else if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) {
+	} else if (after(MPTCP_SKB_CB(skb)->map_seq, ack_seq)) {
 		mptcp_data_queue_ofo(msk, skb);
 		return false;
 	}
@@ -762,40 +764,40 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
 {
 	struct sock *sk = (struct sock *)msk;
 	struct sk_buff *skb, *tail;
+	u32 seq_delta, ack_seq;
 	bool moved = false;
 	struct rb_node *p;
-	u64 end_seq;
 
 	p = rb_first(&msk->out_of_order_queue);
 	pr_debug("msk=%p empty=%d\n", msk, RB_EMPTY_ROOT(&msk->out_of_order_queue));
 	while (p) {
+		ack_seq = msk->ack_seq;
 		skb = rb_to_skb(p);
-		if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq))
+		if (after(MPTCP_SKB_CB(skb)->map_seq, ack_seq))
 			break;
 
 		p = rb_next(p);
 		rb_erase(&skb->rbnode, &msk->out_of_order_queue);
 
-		if (unlikely(!after64(MPTCP_SKB_CB(skb)->end_seq,
-				      msk->ack_seq))) {
+		if (unlikely(!after(MPTCP_SKB_CB(skb)->end_seq, ack_seq))) {
 			mptcp_drop(sk, skb);
 			MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA);
 			continue;
 		}
 
-		end_seq = MPTCP_SKB_CB(skb)->end_seq;
+		seq_delta = MPTCP_SKB_CB(skb)->end_seq - ack_seq;
 		tail = skb_peek_tail(&sk->sk_receive_queue);
 		if (!tail || !mptcp_try_coalesce(sk, tail, skb)) {
-			int delta = msk->ack_seq - MPTCP_SKB_CB(skb)->map_seq;
+			int delta = ack_seq - MPTCP_SKB_CB(skb)->map_seq;
 
 			/* skip overlapping data, if any */
-			pr_debug("uncoalesced seq=%llx ack seq=%llx delta=%d\n",
-				 MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq,
+			pr_debug("uncoalesced seq=%x ack seq=%x delta=%d\n",
+				 MPTCP_SKB_CB(skb)->map_seq, ack_seq,
 				 delta);
 			__skb_queue_tail(&sk->sk_receive_queue, skb);
 		}
-		msk->bytes_received += end_seq - msk->ack_seq;
-		WRITE_ONCE(msk->ack_seq, end_seq);
+		msk->bytes_received += seq_delta;
+		WRITE_ONCE(msk->ack_seq, msk->ack_seq + seq_delta);
 		moved = true;
 	}
 	return moved;
@@ -2243,19 +2245,20 @@ static bool mptcp_move_skbs(struct sock *sk)
 	return enqueued;
 }
 
-static unsigned int mptcp_inq_hint(const struct sock *sk)
+static int mptcp_inq_hint(const struct sock *sk)
 {
 	const struct mptcp_sock *msk = mptcp_sk(sk);
 	const struct sk_buff *skb;
 
 	skb = skb_peek(&sk->sk_receive_queue);
 	if (skb) {
-		u64 hint_val = READ_ONCE(msk->ack_seq) - MPTCP_SKB_CB(skb)->map_seq;
+		int hint_val = (u32)READ_ONCE(msk->ack_seq) -
+			       MPTCP_SKB_CB(skb)->map_seq;
 
-		if (hint_val >= INT_MAX)
-			return INT_MAX;
+		if (hint_val < 0)
+			return -hint_val;
 
-		return (unsigned int)hint_val;
+		return hint_val;
 	}
 
 	if (sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN))
@@ -2363,7 +2366,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			tcp_recv_timestamp(msg, sk, &tss);
 
 		if (cmsg_flags & MPTCP_CMSG_INQ) {
-			unsigned int inq = mptcp_inq_hint(sk);
+			int inq = mptcp_inq_hint(sk);
 
 			put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq);
 		}
@@ -4583,11 +4586,23 @@ static int mptcp_napi_poll(struct napi_struct *napi, int budget)
 	return work_done;
 }
 
+#define CHK_CB_FIELD(mptcp_field, tcp_field)	\
+	({					\
+		BUILD_BUG_ON(offsetof(struct mptcp_skb_cb, mptcp_field) !=    \
+			     offsetof(struct tcp_skb_cb, tcp_field));	      \
+		BUILD_BUG_ON(offsetofend(struct mptcp_skb_cb, mptcp_field) != \
+			     offsetofend(struct tcp_skb_cb, tcp_field));      \
+	})
+
 void __init mptcp_proto_init(void)
 {
 	struct mptcp_delegated_action *delegated;
 	int cpu;
 
+	CHK_CB_FIELD(map_seq, seq);
+	CHK_CB_FIELD(end_seq, end_seq);
+	CHK_CB_FIELD(flags, tcp_flags);
+
 	mptcp_prot.h.hashinfo = tcp_prot.h.hashinfo;
 
 	if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL))
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index dd437643e604..e541f42fca25 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -126,8 +126,10 @@
 #define MPTCP_SYNC_SNDBUF	7
 
 struct mptcp_skb_cb {
-	u64 map_seq;
-	u64 end_seq;
+	u32 map_seq;
+	u32 end_seq;
+	u32 unused;
+	u16 flags;
 	u8  has_rxtstamp;
 	u8  cant_coalesce;
 };
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next v1 5/9] tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
                   ` (3 preceding siblings ...)
  2026-04-24 14:08 ` [PATCH mptcp-next v1 4/9] mptcp: sync mptcp skb cb layout with tcp one Paolo Abeni
@ 2026-04-24 14:08 ` Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 6/9] mptcp: implemented OoO queue pruning Paolo Abeni
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2026-04-24 14:08 UTC (permalink / raw)
  To: mptcp; +Cc: yangang, geliang, matttbe

The end goal is to avoid duplicating the quite untrivial strategy at MPTCP
level.

After the previous patch, the mentioned helpers could process skbs standing
in MPTCP-level queues without any CB-related adaptation.

The only additional adjustment needed is explicitly providing the OoO queue
reference, to cope with different sk layout.

Additionally rename the helper to clearly document its hybrid nature and
let it return the number of collapsed skbs, to allow proper accounting from
the future MPTCP caller.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
rfc -> v1:
 - fix arg typo

Note:
 - this will need a significant amount of testing at the TCP level and
   explicit approval from Eric, which I can't guess if we can hope.
---
 include/net/tcp.h    |  8 +++++++
 net/ipv4/tcp_input.c | 55 ++++++++++++++++++++++++++++----------------
 2 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6156d1d068e1..34a96f0bcf0a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1828,6 +1828,14 @@ extern void tcp_openreq_init_rwin(struct request_sock *req,
 
 void tcp_enter_memory_pressure(struct sock *sk);
 void tcp_leave_memory_pressure(struct sock *sk);
+unsigned int xtcp_collapse(struct sock *sk, struct sk_buff_head *list,
+			   struct rb_root *root, struct sk_buff *head,
+			   struct sk_buff *tail, u32 start, u32 end,
+			   u8 scaling_ratio);
+unsigned int xtcp_collapse_ofo_queue(struct sock *sk,
+				     struct rb_root *out_of_order_queue,
+				     struct sk_buff **ooo_last_skb,
+				     u8 scaling_ratio);
 
 static inline int keepalive_intvl_when(const struct tcp_sock *tp)
 {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7171442c3ed7..8417785fa48f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5725,16 +5725,22 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
 /* Collapse contiguous sequence of skbs head..tail with
  * sequence numbers start..end.
  *
+ * sk can be either a TCP or an MPTCP socket.
+ *
  * If tail is NULL, this means until the end of the queue.
  *
  * Segments with FIN/SYN are not collapsed (only because this
  * simplifies code)
+ *
+ * Returns the number of collapsed skbs.
  */
-static void
-tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
-	     struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end)
+unsigned int
+xtcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
+	      struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end,
+	      u8 scaling_ratio)
 {
 	struct sk_buff *skb = head, *n;
+	unsigned int collapsed = 0;
 	struct sk_buff_head tmp;
 	bool end_of_skbs;
 
@@ -5750,6 +5756,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 
 		/* No new bits? It is possible on ofo queue. */
 		if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
+			collapsed++;
 			skb = tcp_collapse_one(sk, skb, list, root);
 			if (!skb)
 				break;
@@ -5762,7 +5769,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 		 *   overlaps to the next one and mptcp allow collapsing.
 		 */
 		if (!(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) &&
-		    (tcp_win_from_space(sk, skb->truesize) > skb->len ||
+		    (__tcp_win_from_space(scaling_ratio, skb->truesize) > skb->len ||
 		     before(TCP_SKB_CB(skb)->seq, start))) {
 			end_of_skbs = false;
 			break;
@@ -5782,7 +5789,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 	if (end_of_skbs ||
 	    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) ||
 	    !skb_frags_readable(skb))
-		return;
+		return collapsed;
 
 	__skb_queue_head_init(&tmp);
 
@@ -5819,6 +5826,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 				start += size;
 			}
 			if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
+				collapsed++;
 				skb = tcp_collapse_one(sk, skb, list, root);
 				if (!skb ||
 				    skb == tail ||
@@ -5832,23 +5840,26 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 end:
 	skb_queue_walk_safe(&tmp, skb, n)
 		tcp_rbtree_insert(root, skb);
+	return collapsed;
 }
 
 /* Collapse ofo queue. Algorithm: select contiguous sequence of skbs
- * and tcp_collapse() them until all the queue is collapsed.
+ * and xtcp_collapse() them until all the queue is collapsed.
  */
-static void tcp_collapse_ofo_queue(struct sock *sk)
+unsigned int xtcp_collapse_ofo_queue(struct sock *sk,
+				     struct rb_root *ooo_queue,
+				     struct sk_buff **ooo_last_skb,
+				     u8 scaling_ratio)
 {
-	struct tcp_sock *tp = tcp_sk(sk);
-	u32 range_truesize, sum_tiny = 0;
+	u32 range_truesize, sum_tiny = 0, collapsed = 0;
 	struct sk_buff *skb, *head;
 	u32 start, end;
 
-	skb = skb_rb_first(&tp->out_of_order_queue);
+	skb = skb_rb_first(ooo_queue);
 new_range:
 	if (!skb) {
-		tp->ooo_last_skb = skb_rb_last(&tp->out_of_order_queue);
-		return;
+		*ooo_last_skb = skb_rb_last(ooo_queue);
+		return collapsed;
 	}
 	start = TCP_SKB_CB(skb)->seq;
 	end = TCP_SKB_CB(skb)->end_seq;
@@ -5866,12 +5877,13 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 			/* Do not attempt collapsing tiny skbs */
 			if (range_truesize != head->truesize ||
 			    end - start >= SKB_WITH_OVERHEAD(PAGE_SIZE)) {
-				tcp_collapse(sk, NULL, &tp->out_of_order_queue,
-					     head, skb, start, end);
+				collapsed += xtcp_collapse(sk, NULL, ooo_queue,
+					      head, skb, start, end,
+					      scaling_ratio);
 			} else {
 				sum_tiny += range_truesize;
 				if (sum_tiny > sk->sk_rcvbuf >> 3)
-					return;
+					return collapsed;
 			}
 			goto new_range;
 		}
@@ -5882,6 +5894,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 		if (after(TCP_SKB_CB(skb)->end_seq, end))
 			end = TCP_SKB_CB(skb)->end_seq;
 	}
+	return collapsed;
 }
 
 /*
@@ -5969,12 +5982,14 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
 	if (tcp_can_ingest(sk, in_skb))
 		return 0;
 
-	tcp_collapse_ofo_queue(sk);
+	xtcp_collapse_ofo_queue(sk, &tp->out_of_order_queue,
+				&tp->ooo_last_skb, tp->scaling_ratio);
 	if (!skb_queue_empty(&sk->sk_receive_queue))
-		tcp_collapse(sk, &sk->sk_receive_queue, NULL,
-			     skb_peek(&sk->sk_receive_queue),
-			     NULL,
-			     tp->copied_seq, tp->rcv_nxt);
+		xtcp_collapse(sk, &sk->sk_receive_queue, NULL,
+			      skb_peek(&sk->sk_receive_queue),
+			      NULL,
+			      tp->copied_seq, tp->rcv_nxt,
+			      tp->scaling_ratio);
 
 	if (tcp_can_ingest(sk, in_skb))
 		return 0;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next v1 6/9] mptcp: implemented OoO queue pruning
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
                   ` (4 preceding siblings ...)
  2026-04-24 14:08 ` [PATCH mptcp-next v1 5/9] tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too Paolo Abeni
@ 2026-04-24 14:08 ` Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 7/9] mptcp: track prune recovery status Paolo Abeni
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2026-04-24 14:08 UTC (permalink / raw)
  To: mptcp; +Cc: yangang, geliang, matttbe

Leverage the hybrid helper to implement the receive queue and OoO queue
collapsing at ingress time when reaching memory bounds.

If the msk is owned by the user-space at incoming skb time, perform the
pruning in the release_cb. The prune check is additionally performed
when the skb reaches the msk-level queues.

Pruning is not needed for fallback socket, as their MPTCP-level OoO queue
must always be empty: remove the ingress check for such scenario and
relay on the TCP-level one..

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
RFC -> v1:
 - use data_seq only when available
 - avoid ack_seq lockless access
 - drop limit on fallback
 - collapse rcvqueue, too
 - drop only when pruning is not possible and over rcvbuf * 2

Notes:
 - Similarly to path 'mptcp: move checks vs rcvbuf size earlier in the RX
   path', some cleanup/tuning in mptcp_over_limit() will be needed
 - Pruning in the release_cb() is likely not needed, should probably be
   removed (after more testing).
---
 net/mptcp/mib.c      |  3 ++
 net/mptcp/mib.h      |  3 ++
 net/mptcp/options.c  | 40 +++++++++++++++++++++++---
 net/mptcp/protocol.c | 68 ++++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.h |  2 ++
 5 files changed, 112 insertions(+), 4 deletions(-)

diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index f23fda0c55a7..5128feec942c 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -85,6 +85,9 @@ static const struct snmp_mib mptcp_snmp_list[] = {
 	SNMP_MIB_ITEM("SimultConnectFallback", MPTCP_MIB_SIMULTCONNFALLBACK),
 	SNMP_MIB_ITEM("FallbackFailed", MPTCP_MIB_FALLBACKFAILED),
 	SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE),
+	SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED),
+	SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
+	SNMP_MIB_ITEM("RcvCollapsed", MPTCP_MIB_RCVCOLLAPSED),
 };
 
 /* mptcp_mib_alloc - allocate percpu mib counters
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 812218b5ed2b..2f8f68e33ac5 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -88,6 +88,9 @@ enum linux_mptcp_mib_field {
 	MPTCP_MIB_SIMULTCONNFALLBACK,	/* Simultaneous connect */
 	MPTCP_MIB_FALLBACKFAILED,	/* Can't fallback due to msk status */
 	MPTCP_MIB_WINPROBE,		/* MPTCP-level zero window probe */
+	MPTCP_MIB_OFO_PRUNED,		/* MPTCP-level OoO queue pruned */
+	MPTCP_MIB_RCVPRUNED,		/* Dropped due to memory constrains */
+	MPTCP_MIB_RCVCOLLAPSED,		/* Collapsed due to memory pressure */
 	__MPTCP_MIB_MAX
 };
 
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 14afeee8ca5f..a49cb03954e5 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1158,12 +1158,40 @@ static bool add_addr_hmac_valid(struct mptcp_sock *msk,
 	return hmac == mp_opt->ahmac;
 }
 
-static bool mptcp_over_limit(const struct sock *sk, struct sk_buff *skb)
+static bool mptcp_over_limit(struct sock *sk, struct sk_buff *skb,
+			     const struct mptcp_options_received *mp_opt)
 {
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	bool ret;
+
 	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
 		return false;
 
-	return sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf);
+	/* Allow some slack for backlog processing */
+	if (sk_rmem_alloc_get(sk) < READ_ONCE(sk->sk_rcvbuf))
+		return false;
+
+	mptcp_data_lock(sk);
+	if (!sock_owned_by_user(sk)) {
+		/* When the data seqence is not (yet) available for the,
+		 * incoming skb, allow pruning the whole OoO queue
+		 */
+		u32 seq = !mp_opt->use_map || mp_opt->mpc_map ? msk->ack_seq :
+			  mp_opt->data_seq;
+
+		__mptcp_check_prune(sk, seq);
+		ret = sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf);
+	} else {
+		u64 limit = ((u64)READ_ONCE(sk->sk_rcvbuf)) << 1;
+
+		/* Pruning will take place later in the RX path, allow
+		 * some extra slack.
+		 */
+		ret = sk_rmem_alloc_get(sk) > limit;
+		__set_bit(MPTCP_PRUNE, &msk->cb_flags);
+	}
+	mptcp_data_unlock(sk);
+	return ret;
 }
 
 /* Return false when the caller must drop the packet, i.e. in case of error,
@@ -1194,7 +1222,11 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 		__mptcp_data_acked(subflow->conn);
 		mptcp_data_unlock(subflow->conn);
 
-		if (mptcp_over_limit(subflow->conn, skb))
+		/* Will use ack_seq as limit for OoO pruning; any value would do
+		 * as OoO queue must be empty.
+		 */
+		mp_opt.use_map = 0;
+		if (mptcp_over_limit(subflow->conn, skb, &mp_opt))
 			return false;
 		return true;
 	}
@@ -1274,7 +1306,7 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 		return true;
 	}
 
-	if (mptcp_over_limit(subflow->conn, skb))
+	if (mptcp_over_limit(subflow->conn, skb, &mp_opt))
 		return false;
 
 	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 49e62f817fd6..0c57561ee046 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -363,6 +363,66 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset)
 	skb_dst_drop(skb);
 }
 
+/* "Inspired" from the TCP version */
+static void mptcp_prune_ofo_queue(struct sock *sk, u32 seq)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct rb_node *node, *prev;
+	bool pruned = false;
+
+	if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
+		return;
+
+	node = &msk->ooo_last_skb->rbnode;
+
+	do {
+		struct sk_buff *skb = rb_to_skb(node);
+
+		/* If incoming skb would land last in ofo queue, stop pruning. */
+		if (after(seq, MPTCP_SKB_CB(skb)->map_seq))
+			break;
+
+		pruned = true;
+		prev = rb_prev(node);
+		rb_erase(node, &msk->out_of_order_queue);
+		mptcp_drop(sk, skb);
+		msk->ooo_last_skb = rb_to_skb(prev);
+		if (atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf)
+			break;
+
+		node = prev;
+	} while (node);
+
+	if (pruned)
+		NET_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED);
+}
+
+bool __mptcp_check_prune(struct sock *sk, u32 seq)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	unsigned int dropped;
+
+	if (likely(atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf))
+		return false;
+
+	dropped = xtcp_collapse_ofo_queue(sk, &msk->out_of_order_queue,
+					  &msk->ooo_last_skb, msk->scaling_ratio);
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		dropped += xtcp_collapse(sk, &sk->sk_receive_queue, NULL,
+					 skb_peek(&sk->sk_receive_queue),
+					 NULL,
+					 msk->copied_seq, msk->ack_seq,
+					 msk->scaling_ratio);
+
+	if (dropped)
+		MPTCP_ADD_STATS(sock_net(sk), MPTCP_MIB_RCVCOLLAPSED, dropped);
+	if (likely(atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf))
+		return false;
+
+	mptcp_prune_ofo_queue(sk, seq);
+	return atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf;
+}
+
 static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
 {
 	u32 copy_len = MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq;
@@ -372,6 +432,12 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
 
 	mptcp_borrow_fwdmem(sk, skb);
 
+	if (__mptcp_check_prune(sk, MPTCP_SKB_CB(skb)->map_seq)) {
+		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
+		mptcp_drop(sk, skb);
+		return false;
+	}
+
 	if (MPTCP_SKB_CB(skb)->map_seq == ack_seq) {
 		/* in sequence */
 		msk->bytes_received += copy_len;
@@ -3679,6 +3745,8 @@ static void mptcp_release_cb(struct sock *sk)
 			__mptcp_error_report(sk);
 		if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags))
 			__mptcp_sync_sndbuf(sk);
+		if (__test_and_clear_bit(MPTCP_PRUNE, &msk->cb_flags))
+			__mptcp_check_prune(sk, msk->ack_seq - 1);
 	}
 }
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index e541f42fca25..a6b7eedf36cf 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -124,6 +124,7 @@
 #define MPTCP_FLUSH_JOIN_LIST	5
 #define MPTCP_SYNC_STATE	6
 #define MPTCP_SYNC_SNDBUF	7
+#define MPTCP_PRUNE		8
 
 struct mptcp_skb_cb {
 	u32 map_seq;
@@ -829,6 +830,7 @@ bool __mptcp_close(struct sock *sk, long timeout);
 void mptcp_cancel_work(struct sock *sk);
 void __mptcp_unaccepted_force_close(struct sock *sk);
 void mptcp_set_state(struct sock *sk, int state);
+bool __mptcp_check_prune(struct sock *sk, u32 seq);
 
 bool mptcp_addresses_equal(const struct mptcp_addr_info *a,
 			   const struct mptcp_addr_info *b, bool use_port);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next v1 7/9] mptcp: track prune recovery status
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
                   ` (5 preceding siblings ...)
  2026-04-24 14:08 ` [PATCH mptcp-next v1 6/9] mptcp: implemented OoO queue pruning Paolo Abeni
@ 2026-04-24 14:08 ` Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 8/9] mptcp: move the retrans loop to a separate helper Paolo Abeni
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2026-04-24 14:08 UTC (permalink / raw)
  To: mptcp; +Cc: yangang, geliang, matttbe

After dropping any data already acked at the TCP level, the MPTCP must
avoid inducing TCP-level retransmission until the pruned data has been
successfully acked at MPTCP level. Otherwise the subflows could keep
retransmitting skbs carring OoO MPTCP data, preventing reinjections and
stalling completely the data transfer.

Explicitly keep track of the highest pruned MPTCP-level seq number and
stop dropping at TCP level until such sequence has been acked.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/options.c  |  7 ++++++-
 net/mptcp/protocol.c | 14 +++++++++++++-
 net/mptcp/protocol.h |  1 +
 net/mptcp/subflow.c  |  1 +
 4 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index a49cb03954e5..941e4ec705fe 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1191,7 +1191,12 @@ static bool mptcp_over_limit(struct sock *sk, struct sk_buff *skb,
 		__set_bit(MPTCP_PRUNE, &msk->cb_flags);
 	}
 	mptcp_data_unlock(sk);
-	return ret;
+
+	/* After pruning any packets ensure that MPTCP-driven drops do not
+	 * cause TCP-level retransmission
+	 */
+	return ret &&
+	       !before(READ_ONCE(msk->ack_seq), READ_ONCE(msk->pruned_seq));
 }
 
 /* Return false when the caller must drop the packet, i.e. in case of error,
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 0c57561ee046..44840020e53a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -369,12 +369,14 @@ static void mptcp_prune_ofo_queue(struct sock *sk, u32 seq)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct rb_node *node, *prev;
 	bool pruned = false;
+	u32 pruned_seq;
 
 	if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
 		return;
 
 	node = &msk->ooo_last_skb->rbnode;
 
+	pruned_seq = msk->pruned_seq;
 	do {
 		struct sk_buff *skb = rb_to_skb(node);
 
@@ -385,16 +387,21 @@ static void mptcp_prune_ofo_queue(struct sock *sk, u32 seq)
 		pruned = true;
 		prev = rb_prev(node);
 		rb_erase(node, &msk->out_of_order_queue);
+		if (after(MPTCP_SKB_CB(skb)->end_seq, pruned_seq))
+			pruned_seq = MPTCP_SKB_CB(skb)->end_seq;
 		mptcp_drop(sk, skb);
 		msk->ooo_last_skb = rb_to_skb(prev);
+
 		if (atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf)
 			break;
 
 		node = prev;
 	} while (node);
 
-	if (pruned)
+	if (pruned) {
+		WRITE_ONCE(msk->pruned_seq, pruned_seq);
 		NET_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED);
+	}
 }
 
 bool __mptcp_check_prune(struct sock *sk, u32 seq)
@@ -433,6 +440,8 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
 	mptcp_borrow_fwdmem(sk, skb);
 
 	if (__mptcp_check_prune(sk, MPTCP_SKB_CB(skb)->map_seq)) {
+		if (after(MPTCP_SKB_CB(skb)->end_seq, msk->pruned_seq))
+			WRITE_ONCE(msk->pruned_seq, MPTCP_SKB_CB(skb)->end_seq);
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
 		mptcp_drop(sk, skb);
 		return false;
@@ -866,6 +875,8 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
 		WRITE_ONCE(msk->ack_seq, msk->ack_seq + seq_delta);
 		moved = true;
 	}
+	if (after(msk->ack_seq, msk->pruned_seq))
+		WRITE_ONCE(msk->pruned_seq, (u32)msk->ack_seq);
 	return moved;
 }
 
@@ -3520,6 +3531,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
 	/* for fallback's sake */
 	WRITE_ONCE(msk->ack_seq, 0);
 	WRITE_ONCE(msk->copied_seq, 0);
+	WRITE_ONCE(msk->pruned_seq, 0);
 
 	WRITE_ONCE(sk->sk_shutdown, 0);
 	sk_error_report(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index a6b7eedf36cf..b7b32301e7c4 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -306,6 +306,7 @@ struct mptcp_sock {
 	u64		bytes_acked;
 	u64		snd_una;
 	u64		wnd_end;
+	u32		pruned_seq;		/* if after ack_seq, highest seq pruned */
 	u32		last_data_sent;
 	u32		last_data_recv;
 	u32		last_ack_recv;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 2a8d5da4aaea..70a5c2a08278 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -495,6 +495,7 @@ static void subflow_set_remote_key(struct mptcp_sock *msk,
 	WRITE_ONCE(msk->remote_key, subflow->remote_key);
 	WRITE_ONCE(msk->ack_seq, subflow->iasn);
 	WRITE_ONCE(msk->copied_seq, subflow->iasn);
+	WRITE_ONCE(msk->pruned_seq, subflow->iasn);
 	WRITE_ONCE(msk->can_ack, true);
 	atomic64_set(&msk->rcv_wnd_sent, subflow->iasn);
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next v1 8/9] mptcp: move the retrans loop to a separate helper
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
                   ` (6 preceding siblings ...)
  2026-04-24 14:08 ` [PATCH mptcp-next v1 7/9] mptcp: track prune recovery status Paolo Abeni
@ 2026-04-24 14:08 ` Paolo Abeni
  2026-04-24 14:08 ` [PATCH mptcp-next v1 9/9] mptcp: let the retrans scheduler do its job Paolo Abeni
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2026-04-24 14:08 UTC (permalink / raw)
  To: mptcp; +Cc: yangang, geliang, matttbe

This is a cleanup in order to make the next patch simpler.
No functional change intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 74 +++++++++++++++++++++++++-------------------
 1 file changed, 43 insertions(+), 31 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 44840020e53a..093c50a43bcb 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2820,41 +2820,14 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
 	sk_error_report(sk);
 }
 
-static void __mptcp_retrans(struct sock *sk)
+/* Retransmit the specified data fragment on all the selected subflows. */
+static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
 {
 	struct mptcp_sendmsg_info info = { .data_lock_held = true, };
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct mptcp_subflow_context *subflow;
-	struct mptcp_data_frag *dfrag;
 	struct sock *ssk;
-	int ret, err;
-	u16 len = 0;
-
-	mptcp_clean_una_wakeup(sk);
-
-	/* first check ssk: need to kick "stale" logic */
-	err = mptcp_sched_get_retrans(msk);
-	dfrag = mptcp_rtx_head(sk);
-	if (!dfrag) {
-		if (mptcp_data_fin_enabled(msk)) {
-			struct inet_connection_sock *icsk = inet_csk(sk);
-
-			WRITE_ONCE(icsk->icsk_retransmits,
-				   icsk->icsk_retransmits + 1);
-			mptcp_set_datafin_timeout(sk);
-			mptcp_send_ack(msk);
-
-			goto reset_timer;
-		}
-
-		if (!mptcp_send_head(sk))
-			goto clear_scheduled;
-
-		goto reset_timer;
-	}
-
-	if (err)
-		goto reset_timer;
+	int ret, len = 0;
 
 	mptcp_for_each_subflow(msk, subflow) {
 		if (READ_ONCE(subflow->scheduled)) {
@@ -2882,7 +2855,7 @@ static void __mptcp_retrans(struct sock *sk)
 			    !msk->allow_subflows) {
 				spin_unlock_bh(&msk->fallback_lock);
 				release_sock(ssk);
-				goto clear_scheduled;
+				return -1;
 			}
 
 			while (info.sent < info.limit) {
@@ -2905,6 +2878,45 @@ static void __mptcp_retrans(struct sock *sk)
 			release_sock(ssk);
 		}
 	}
+	return len;
+}
+
+static void __mptcp_retrans(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct mptcp_subflow_context *subflow;
+	struct mptcp_data_frag *dfrag;
+	int err, len;
+
+	mptcp_clean_una_wakeup(sk);
+
+	/* first check ssk: need to kick "stale" logic */
+	err = mptcp_sched_get_retrans(msk);
+	dfrag = mptcp_rtx_head(sk);
+	if (!dfrag) {
+		if (mptcp_data_fin_enabled(msk)) {
+			struct inet_connection_sock *icsk = inet_csk(sk);
+
+			WRITE_ONCE(icsk->icsk_retransmits,
+				   icsk->icsk_retransmits + 1);
+			mptcp_set_datafin_timeout(sk);
+			mptcp_send_ack(msk);
+
+			goto reset_timer;
+		}
+
+		if (!mptcp_send_head(sk))
+			goto clear_scheduled;
+
+		goto reset_timer;
+	}
+
+	if (err)
+		goto reset_timer;
+
+	len = __mptcp_push_retrans(sk, dfrag);
+	if (len < 0)
+		goto clear_scheduled;
 
 	msk->bytes_retrans += len;
 	dfrag->already_sent = max(dfrag->already_sent, len);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next v1 9/9] mptcp: let the retrans scheduler do its job.
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
                   ` (7 preceding siblings ...)
  2026-04-24 14:08 ` [PATCH mptcp-next v1 8/9] mptcp: move the retrans loop to a separate helper Paolo Abeni
@ 2026-04-24 14:08 ` Paolo Abeni
  2026-04-24 16:29 ` [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure MPTCP CI
  2026-04-27  7:27 ` Geliang Tang
  10 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2026-04-24 14:08 UTC (permalink / raw)
  To: mptcp; +Cc: yangang, geliang, matttbe

Currently the MPTCP core enforces that when MPTCP-level retrans timer
fires, at most a single dfrag is retransmitted. If some corner-cases it
may be necessary retransmit multiple dfrags, and the MPTCP socket will
need to wait multiple retrans timeout to accomplish that.

Remove the mentioned constraint, allowing to transmit multiple dfrags per
retrans period, as long as the scheduler keeps selecting subflows for
retransmissions and pending data is available in the rtx queue.
The default scheduler will transmit a dfrag per available subflow.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 84 +++++++++++++++++++++++++-------------------
 1 file changed, 47 insertions(+), 37 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 093c50a43bcb..da84bf5410da 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1187,13 +1187,6 @@ static void __mptcp_clean_una_wakeup(struct sock *sk)
 	mptcp_write_space(sk);
 }
 
-static void mptcp_clean_una_wakeup(struct sock *sk)
-{
-	mptcp_data_lock(sk);
-	__mptcp_clean_una_wakeup(sk);
-	mptcp_data_unlock(sk);
-}
-
 static void mptcp_enter_memory_pressure(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow;
@@ -2820,8 +2813,12 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
 	sk_error_report(sk);
 }
 
-/* Retransmit the specified data fragment on all the selected subflows. */
-static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
+/*
+ * Retransmit the specified data fragment on all the selected subflows,
+ * starting from the specified sequence
+ */
+static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag,
+				u64 retrans_seq)
 {
 	struct mptcp_sendmsg_info info = { .data_lock_held = true, };
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -2840,7 +2837,7 @@ static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
 			lock_sock(ssk);
 
 			/* limit retransmission to the bytes already sent on some subflows */
-			info.sent = 0;
+			info.sent = retrans_seq - dfrag->data_seq;
 			info.limit = READ_ONCE(msk->csum_enabled) ? dfrag->data_len :
 								    dfrag->already_sent;
 
@@ -2886,42 +2883,55 @@ static void __mptcp_retrans(struct sock *sk)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct mptcp_subflow_context *subflow;
 	struct mptcp_data_frag *dfrag;
+	u64 retrans_seq;
 	int err, len;
 
-	mptcp_clean_una_wakeup(sk);
-
-	/* first check ssk: need to kick "stale" logic */
-	err = mptcp_sched_get_retrans(msk);
-	dfrag = mptcp_rtx_head(sk);
-	if (!dfrag) {
-		if (mptcp_data_fin_enabled(msk)) {
-			struct inet_connection_sock *icsk = inet_csk(sk);
+	mptcp_data_lock(sk);
+	__mptcp_clean_una_wakeup(sk);
+	retrans_seq = msk->snd_una;
+	mptcp_data_unlock(sk);
 
-			WRITE_ONCE(icsk->icsk_retransmits,
-				   icsk->icsk_retransmits + 1);
-			mptcp_set_datafin_timeout(sk);
-			mptcp_send_ack(msk);
+	for (;;) {
+		/* first check ssk: need to kick "stale" logic */
+		err = mptcp_sched_get_retrans(msk);
+		dfrag = mptcp_rtx_head(sk);
+		if (!dfrag) {
+			if (mptcp_data_fin_enabled(msk)) {
+				struct inet_connection_sock *icsk = inet_csk(sk);
+
+				WRITE_ONCE(icsk->icsk_retransmits,
+					   icsk->icsk_retransmits + 1);
+				mptcp_set_datafin_timeout(sk);
+				mptcp_send_ack(msk);
+				break;
+			}
 
-			goto reset_timer;
+			if (!mptcp_send_head(sk))
+				goto clear_scheduled;
+			break;
 		}
 
-		if (!mptcp_send_head(sk))
-			goto clear_scheduled;
-
-		goto reset_timer;
-	}
-
-	if (err)
-		goto reset_timer;
+		if (err)
+			break;
 
-	len = __mptcp_push_retrans(sk, dfrag);
-	if (len < 0)
-		goto clear_scheduled;
+		/* Skip the data already retransmitted in this run */
+		while (dfrag && !before64(retrans_seq, dfrag->data_seq +
+					  dfrag->already_sent))
+			dfrag = list_is_last(&dfrag->list, &msk->rtx_queue) ? NULL :
+				list_next_entry(dfrag, list);
+		if (!dfrag || !dfrag->already_sent)
+			break;
 
-	msk->bytes_retrans += len;
-	dfrag->already_sent = max(dfrag->already_sent, len);
+		len = __mptcp_push_retrans(sk, dfrag, retrans_seq);
+		if (len < 0)
+			goto clear_scheduled;
+		if (!len)
+			break;
 
-reset_timer:
+		retrans_seq += len;
+		msk->bytes_retrans += len;
+		dfrag->already_sent = max(dfrag->already_sent, len);
+	}
 	mptcp_check_and_set_pending(sk);
 
 	if (!mptcp_rtx_timer_pending(sk))
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
                   ` (8 preceding siblings ...)
  2026-04-24 14:08 ` [PATCH mptcp-next v1 9/9] mptcp: let the retrans scheduler do its job Paolo Abeni
@ 2026-04-24 16:29 ` MPTCP CI
  2026-04-27  7:27 ` Geliang Tang
  10 siblings, 0 replies; 13+ messages in thread
From: MPTCP CI @ 2026-04-24 16:29 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join): Unstable: 7 failed test(s): packetdrill_fastopen packetdrill_regressions selftest_mptcp_connect selftest_mptcp_connect_checksum selftest_mptcp_connect_mmap selftest_mptcp_connect_sendfile selftest_mptcp_connect_splice ⚠️ 
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Critical: 2 Call Trace(s) - Critical: Global Timeout ❌
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Unstable: 2 failed test(s): bpftest_test_progs-no_alu32_mptcp bpftest_test_progs_mptcp ⚠️ 
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/24894934320

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/cceb6849cdf8
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1085225


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure
  2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
                   ` (9 preceding siblings ...)
  2026-04-24 16:29 ` [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure MPTCP CI
@ 2026-04-27  7:27 ` Geliang Tang
  2026-04-28  8:16   ` Geliang Tang
  10 siblings, 1 reply; 13+ messages in thread
From: Geliang Tang @ 2026-04-27  7:27 UTC (permalink / raw)
  To: Paolo Abeni, mptcp; +Cc: yangang, matttbe

Hi Paolo,

Thanks for these fixes.

On Fri, 2026-04-24 at 16:08 +0200, Paolo Abeni wrote:
> This an attempt to fix the data transfer stall reported by Geliang
> and
> Gang more carefully enforcing memory constraints at the MPTCP level.
> 
> Patch 1/9 moves the bound check before entering the TCP socket.
> Patch 2, 3 and 4 are cleanups/refactors finalized to safely re-using
> TCP
> helpers on MPTCP skbs.
> Patch 5 makes TCP pruning related helpers available to MPTCP and
> patch 6
> makes use of them. Patch 7 addresses an edge scenario that could
> still
> lead to transfer stall under memory pressure.
> Finally patch 8 and 9 improve the MPTCP-level retransmission schema
> to
> make recovery from memory pressure significanly faster.
> 
> Note that the diffstat is biases by the quite large patch 4/9, which
> contains mechanical transformation of existing code; "real" changes
> are
> noticiable smaller.
> 
> Tested successfully vs the test cases proposed by Geliang and Gang.

We found this issue while testing the MPTCP TLS selftests
(tools/testing/selftests/net/tls.c). The multi_chunk.c test was
actually extracted from chunked_sendfile() function in tls.c. The tls.c
file contains many test groups, and chunked_sendfile() is just one of
them. Therefore, passing the chunked_sendfile tests does not guarantee
that all tests in tls.c will pass in the future. So we will provide you
with a test similar to multi_chunk.c shortly, but includes a more
complete set of the tests from tls.c.

> ---
> RFC -> v1:
>  - dropped old patch 4 & 5
>  - addressed AI reported comments
>  - added retrans refactor.
> 
> Paolo Abeni (9):
>   mptcp: move checks vs rcvbuf size earlier in the RX path
>   mptcp: drop the mptcp_ooo_try_coalesce() helper
>   mptcp: remove CB offset field

When implementing MPTCP KTLS, I used this offset field. Please see the
implementation in [1], including mptcp_read_done() and
mptcp_get_skb_seq(). After removing the offset field, I think the new
version should be implemented as follows:

static void mptcp_read_done(struct sock *sk, size_t len) 
{
        struct mptcp_sock *msk = mptcp_sk(sk);
        struct sk_buff *skb;
        size_t left;
        u32 offset;

        msk_owned_by_me(msk);

        if (sk->sk_state == TCP_LISTEN)
                return;

        left = len; 
        while (left && (skb = mptcp_recv_skb(sk, &offset)) != NULL) {
                int used;

                used = min_t(size_t, skb->len - offset, left);
                msk->bytes_consumed += used;
                msk->copied_seq += used;
                left -= used;

                if (skb->len > offset + used)
                        break;

                mptcp_eat_recv_skb(sk, skb);
        }

        mptcp_rcv_space_adjust(msk, len - left);

        /* Clean up data we have read: This will do ACK frames. */
        if (left != len) 
                mptcp_cleanup_rbuf(msk, len - left);
}

static u32 mptcp_get_skb_seq(struct sk_buff *skb)
{
        return MPTCP_SKB_CB(skb)->map_seq;
}

But unfortunately, after this modification, with all your patches in
this set, the newly added MPTCP test cases in tls.c did not all pass.
Are there any obvious issues with my modification in these two
function?

Thanks again, and I will continue to follow up and test this series.

-Geliang

[1]
https://patchwork.kernel.org/project/mptcp/patch/b86c642262c9718f4936ad52dab804b8f494aa6d.1777026753.git.tanggeliang@kylinos.cn/

>   mptcp: sync mptcp skb cb layout with tcp one
>   tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too
>   mptcp: implemented OoO queue pruning
>   mptcp: track prune recovery status
>   mptcp: move the retrans loop to a separate helper
>   mptcp: let the retrans scheduler do its job.
> 
>  include/net/tcp.h    |   8 ++
>  net/ipv4/tcp_input.c |  55 +++++---
>  net/mptcp/fastopen.c |   1 -
>  net/mptcp/mib.c      |   3 +
>  net/mptcp/mib.h      |   3 +
>  net/mptcp/options.c  |  55 +++++++-
>  net/mptcp/protocol.c | 328 ++++++++++++++++++++++++++++-------------
> --
>  net/mptcp/protocol.h |  11 +-
>  net/mptcp/subflow.c  |   2 +
>  9 files changed, 323 insertions(+), 143 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure
  2026-04-27  7:27 ` Geliang Tang
@ 2026-04-28  8:16   ` Geliang Tang
  0 siblings, 0 replies; 13+ messages in thread
From: Geliang Tang @ 2026-04-28  8:16 UTC (permalink / raw)
  To: Paolo Abeni, mptcp; +Cc: yangang, matttbe

[-- Attachment #1: Type: text/plain, Size: 6739 bytes --]

Hi Paolo,

On Mon, 2026-04-27 at 15:27 +0800, Geliang Tang wrote:
> Hi Paolo,
> 
> Thanks for these fixes.
> 
> On Fri, 2026-04-24 at 16:08 +0200, Paolo Abeni wrote:
> > This an attempt to fix the data transfer stall reported by Geliang
> > and
> > Gang more carefully enforcing memory constraints at the MPTCP
> > level.
> > 
> > Patch 1/9 moves the bound check before entering the TCP socket.
> > Patch 2, 3 and 4 are cleanups/refactors finalized to safely re-
> > using
> > TCP
> > helpers on MPTCP skbs.
> > Patch 5 makes TCP pruning related helpers available to MPTCP and
> > patch 6
> > makes use of them. Patch 7 addresses an edge scenario that could
> > still
> > lead to transfer stall under memory pressure.
> > Finally patch 8 and 9 improve the MPTCP-level retransmission schema
> > to
> > make recovery from memory pressure significanly faster.
> > 
> > Note that the diffstat is biases by the quite large patch 4/9,
> > which
> > contains mechanical transformation of existing code; "real" changes
> > are
> > noticiable smaller.
> > 
> > Tested successfully vs the test cases proposed by Geliang and Gang.
> 
> We found this issue while testing the MPTCP TLS selftests
> (tools/testing/selftests/net/tls.c). The multi_chunk.c test was
> actually extracted from chunked_sendfile() function in tls.c. The
> tls.c
> file contains many test groups, and chunked_sendfile() is just one of
> them. Therefore, passing the chunked_sendfile tests does not
> guarantee
> that all tests in tls.c will pass in the future. So we will provide
> you
> with a test similar to multi_chunk.c shortly, but includes a more
> complete set of the tests from tls.c.

I have attached a new test that basically covers all the tests in
tls.c. This test uses MPTCP sockets without TLS encryption.

When I ran it on v5 of Gang's patches in [1], all test items passed;
however, when I ran it on v2 of this series, several test items failed,
accompanied by an OOM error:

# ok 7 mptcp.mutliproc_sendpage_even
# #  RUN           mptcp.mutliproc_writers ...
# # mutliproc_writers: Test terminated by timeout
# #          FAIL  mptcp.mutliproc_writers
# not ok 8 mptcp.mutliproc_writers
# #  RUN           mptcp.mutliproc_readers ...
# # mutliproc_readers: Test terminated by timeout
# #          FAIL  mptcp.mutliproc_readers
# not ok 9 mptcp.mutliproc_readers
# #  RUN           mptcp.mutliproc_even ...
root@mptcpdev:/home/tgl/mptcp_net-next# [   83.031990][  T458]
kworker/13:2: page allocation failure: order:0,
mode:0x40820(GFP_ATOMIC|__GFP_COMP),
nodemask=(null),cpuset=/,mems_allowed=0
[   83.032214][  T465] SLUB: Unable to allocate memory on CPU 20 (of
node 0) on node -1, gfp=0x920(GFP_ATOMIC|__GFP_ZERO)
[   83.032956][  T458] CPU: 13 UID: 0 PID: 458 Comm: kworker/13:2 Not
tainted 7.1.0-rc1+ #85 PREEMPT(full) 
[   83.032959][  T458] Hardware name: Bochs Bochs, BIOS Bochs
01/01/2011
[   83.032961][  T458] Workqueue: events mptcp_worker
[   83.032969][  T458] Call Trace:
[   83.032971][  T458]  <TASK>
[   83.032973][  T458]  dump_stack_lvl+0x6f/0xb0
[   83.032982][  T458]  warn_alloc.cold+0x9b/0x1c4
[   83.032987][  T458]  ? __pfx_warn_alloc+0x10/0x10
[   83.033000][  T458]  __alloc_pages_slowpath.constprop.0+0xa3e/0x1770
[   83.033005][  T458]  ?
__pfx___alloc_pages_slowpath.constprop.0+0x10/0x10
[   83.033011][  T458]  __alloc_frozen_pages_noprof+0x2f3/0x380

I hope this test is useful to you. If you need me to do any testing, I
am very willing to help, just let me know.

Thanks,
-Geliang

[1]
https://patchwork.kernel.org/project/mptcp/cover/cover.1775033340.git.yangang@kylinos.cn/

> 
> > ---
> > RFC -> v1:
> >  - dropped old patch 4 & 5
> >  - addressed AI reported comments
> >  - added retrans refactor.
> > 
> > Paolo Abeni (9):
> >   mptcp: move checks vs rcvbuf size earlier in the RX path
> >   mptcp: drop the mptcp_ooo_try_coalesce() helper
> >   mptcp: remove CB offset field
> 
> When implementing MPTCP KTLS, I used this offset field. Please see
> the
> implementation in [1], including mptcp_read_done() and
> mptcp_get_skb_seq(). After removing the offset field, I think the new
> version should be implemented as follows:
> 
> static void mptcp_read_done(struct sock *sk, size_t len) 
> {
>         struct mptcp_sock *msk = mptcp_sk(sk);
>         struct sk_buff *skb;
>         size_t left;
>         u32 offset;
> 
>         msk_owned_by_me(msk);
> 
>         if (sk->sk_state == TCP_LISTEN)
>                 return;
> 
>         left = len; 
>         while (left && (skb = mptcp_recv_skb(sk, &offset)) != NULL) {
>                 int used;
> 
>                 used = min_t(size_t, skb->len - offset, left);
>                 msk->bytes_consumed += used;
>                 msk->copied_seq += used;
>                 left -= used;
> 
>                 if (skb->len > offset + used)
>                         break;
> 
>                 mptcp_eat_recv_skb(sk, skb);
>         }
> 
>         mptcp_rcv_space_adjust(msk, len - left);
> 
>         /* Clean up data we have read: This will do ACK frames. */
>         if (left != len) 
>                 mptcp_cleanup_rbuf(msk, len - left);
> }
> 
> static u32 mptcp_get_skb_seq(struct sk_buff *skb)
> {
>         return MPTCP_SKB_CB(skb)->map_seq;
> }
> 
> But unfortunately, after this modification, with all your patches in
> this set, the newly added MPTCP test cases in tls.c did not all pass.
> Are there any obvious issues with my modification in these two
> function?
> 
> Thanks again, and I will continue to follow up and test this series.
> 
> -Geliang
> 
> [1]
> https://patchwork.kernel.org/project/mptcp/patch/b86c642262c9718f4936ad52dab804b8f494aa6d.1777026753.git.tanggeliang@kylinos.cn/
> 
> >   mptcp: sync mptcp skb cb layout with tcp one
> >   tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage,
> > too
> >   mptcp: implemented OoO queue pruning
> >   mptcp: track prune recovery status
> >   mptcp: move the retrans loop to a separate helper
> >   mptcp: let the retrans scheduler do its job.
> > 
> >  include/net/tcp.h    |   8 ++
> >  net/ipv4/tcp_input.c |  55 +++++---
> >  net/mptcp/fastopen.c |   1 -
> >  net/mptcp/mib.c      |   3 +
> >  net/mptcp/mib.h      |   3 +
> >  net/mptcp/options.c  |  55 +++++++-
> >  net/mptcp/protocol.c | 328 ++++++++++++++++++++++++++++-----------
> > --
> > --
> >  net/mptcp/protocol.h |  11 +-
> >  net/mptcp/subflow.c  |   2 +
> >  9 files changed, 323 insertions(+), 143 deletions(-)

[-- Attachment #2: 0001-selftest-mptcp-add-some-trasnfer-mode-like-tls.patch --]
[-- Type: text/x-patch, Size: 33815 bytes --]

From ff5ed6aaed973bf97f568e008b4a78a3dc0369b3 Mon Sep 17 00:00:00 2001
Message-ID: <ff5ed6aaed973bf97f568e008b4a78a3dc0369b3.1777363110.git.tanggeliang@kylinos.cn>
From: Gang Yan <yangang@kylinos.cn>
Date: Mon, 27 Apr 2026 16:03:31 +0800
Subject: [PATCH] selftest: mptcp: add some trasnfer mode like tls

This patch adds a test using transfer mode like tls without ktls
configuration.

But the msg_more and msg_more_unsent will fail, I think it's not a big
deal with, so just cancel it temprorily.
---
 tools/testing/selftests/net/mptcp/Makefile    |    1 +
 .../testing/selftests/net/mptcp/mptcp_data.c  | 1240 +++++++++++++++++
 .../testing/selftests/net/mptcp/mptcp_data.sh |   45 +
 3 files changed, 1286 insertions(+)
 create mode 100644 tools/testing/selftests/net/mptcp/mptcp_data.c
 create mode 100755 tools/testing/selftests/net/mptcp/mptcp_data.sh

diff --git a/tools/testing/selftests/net/mptcp/Makefile b/tools/testing/selftests/net/mptcp/Makefile
index 22ba0da2adb8..413d0a70bb89 100644
--- a/tools/testing/selftests/net/mptcp/Makefile
+++ b/tools/testing/selftests/net/mptcp/Makefile
@@ -25,6 +25,7 @@ TEST_GEN_FILES := \
 	mptcp_inq \
 	mptcp_sockopt \
 	pm_nl_ctl \
+	mptcp_data \
 # end of TEST_GEN_FILES
 
 TEST_FILES := \
diff --git a/tools/testing/selftests/net/mptcp/mptcp_data.c b/tools/testing/selftests/net/mptcp/mptcp_data.c
new file mode 100644
index 000000000000..c6d6984f06ae
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/mptcp_data.c
@@ -0,0 +1,1240 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <arpa/inet.h>
+#include <errno.h>
+#include <error.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <linux/tcp.h>
+
+#include <sys/epoll.h>
+#include <sys/types.h>
+#include <sys/sendfile.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+
+#include "kselftest_harness.h"
+
+#ifndef IPPROTO_MPTCP
+#define IPPROTO_MPTCP 262
+#endif
+
+#ifndef TCP_CLOSE
+#define TCP_CLOSE 7
+#endif
+
+
+#define PAYLOAD_MAX_LEN 16384
+
+static void memrnd(void *s, size_t n)
+{
+	int *dword = s;
+	char *byte;
+
+	for (; n >= 4; n -= 4)
+		*dword++ = rand();
+	byte = (void *)dword;
+	while (n--)
+		*byte++ = rand();
+}
+
+static void mptcp_sock_pair(struct __test_metadata *_metadata,
+			    int *fd, int *cfd)
+{
+	struct sockaddr_in addr;
+	socklen_t len;
+	int sfd, ret;
+
+	len = sizeof(addr);
+
+	addr.sin_family = AF_INET;
+	addr.sin_addr.s_addr = htonl(INADDR_ANY);
+	addr.sin_port = 0;
+
+	*fd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
+	ASSERT_GE(*fd, 0);
+	sfd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
+	ASSERT_GE(sfd, 0);
+
+	ret = bind(sfd, &addr, sizeof(addr));
+	ASSERT_EQ(ret, 0);
+	ret = listen(sfd, 10);
+	ASSERT_EQ(ret, 0);
+
+	ret = getsockname(sfd, &addr, &len);
+	ASSERT_EQ(ret, 0);
+
+	ret = connect(*fd, &addr, sizeof(addr));
+	ASSERT_EQ(ret, 0);
+
+	*cfd = accept(sfd, &addr, &len);
+	ASSERT_GE(*cfd, 0);
+
+	close(sfd);
+}
+
+FIXTURE(mptcp)
+{
+	int fd, cfd;
+};
+
+FIXTURE_SETUP(mptcp)
+{
+	mptcp_sock_pair(_metadata, &self->fd, &self->cfd);
+}
+
+FIXTURE_TEARDOWN(mptcp)
+{
+	close(self->fd);
+	close(self->cfd);
+}
+
+TEST_F(mptcp, sendfile)
+{
+	int filefd = open("/proc/self/exe", O_RDONLY);
+	struct stat st;
+
+	EXPECT_GE(filefd, 0);
+	fstat(filefd, &st);
+	EXPECT_GE(sendfile(self->fd, filefd, 0, st.st_size), 0);
+
+	close(filefd);
+}
+
+TEST_F(mptcp, send_then_sendfile)
+{
+	int filefd = open("/proc/self/exe", O_RDONLY);
+	char const *test_str = "test_send";
+	int to_send = strlen(test_str) + 1;
+	char recv_buf[10];
+	struct stat st;
+	char *buf;
+
+	EXPECT_GE(filefd, 0);
+	fstat(filefd, &st);
+	buf = (char *)malloc(st.st_size);
+
+	EXPECT_EQ(send(self->fd, test_str, to_send, 0), to_send);
+	EXPECT_EQ(recv(self->cfd, recv_buf, to_send, MSG_WAITALL), to_send);
+	EXPECT_EQ(memcmp(test_str, recv_buf, to_send), 0);
+
+	EXPECT_GE(sendfile(self->fd, filefd, 0, st.st_size), 0);
+	EXPECT_EQ(recv(self->cfd, buf, st.st_size, MSG_WAITALL), st.st_size);
+
+	free(buf);
+	close(filefd);
+}
+
+static void chunked_sendfile(struct __test_metadata *_metadata,
+			     struct _test_data_mptcp *self,
+			     uint16_t chunk_size,
+			     uint16_t extra_payload_size)
+{
+	char buf[PAYLOAD_MAX_LEN];
+	uint16_t test_payload_size;
+	int size = 0;
+	int ret;
+	char filename[] = "/tmp/mytemp.XXXXXX";
+	int fd = mkstemp(filename);
+	off_t offset = 0;
+
+	unlink(filename);
+	ASSERT_GE(fd, 0);
+	EXPECT_GE(chunk_size, 1);
+	test_payload_size = chunk_size + extra_payload_size;
+	ASSERT_GE(PAYLOAD_MAX_LEN, test_payload_size);
+	memset(buf, 1, test_payload_size);
+	size = write(fd, buf, test_payload_size);
+	EXPECT_EQ(size, test_payload_size);
+	fsync(fd);
+
+	while (size > 0) {
+		ret = sendfile(self->fd, fd, &offset, chunk_size);
+		EXPECT_GE(ret, 0);
+		size -= ret;
+	}
+
+	EXPECT_EQ(recv(self->cfd, buf, test_payload_size, MSG_WAITALL),
+		  test_payload_size);
+
+	close(fd);
+}
+
+TEST_F(mptcp, multi_chunk_sendfile)
+{
+	chunked_sendfile(_metadata, self, 4096, 4096);
+	chunked_sendfile(_metadata, self, 4096, 0);
+	chunked_sendfile(_metadata, self, 4096, 1);
+	chunked_sendfile(_metadata, self, 4096, 2048);
+	chunked_sendfile(_metadata, self, 8192, 2048);
+	chunked_sendfile(_metadata, self, 4096, 8192);
+	chunked_sendfile(_metadata, self, 8192, 4096);
+	chunked_sendfile(_metadata, self, 12288, 1024);
+	chunked_sendfile(_metadata, self, 12288, 2000);
+	chunked_sendfile(_metadata, self, 15360, 100);
+	chunked_sendfile(_metadata, self, 15360, 300);
+	chunked_sendfile(_metadata, self, 1, 4096);
+	chunked_sendfile(_metadata, self, 2048, 4096);
+	chunked_sendfile(_metadata, self, 2048, 8192);
+	chunked_sendfile(_metadata, self, 4096, 8192);
+	chunked_sendfile(_metadata, self, 1024, 12288);
+	chunked_sendfile(_metadata, self, 2000, 12288);
+	chunked_sendfile(_metadata, self, 100, 15360);
+	chunked_sendfile(_metadata, self, 300, 15360);
+}
+
+TEST_F(mptcp, recv_max)
+{
+	unsigned int send_len = PAYLOAD_MAX_LEN;
+	char recv_mem[PAYLOAD_MAX_LEN];
+	char buf[PAYLOAD_MAX_LEN];
+
+	memrnd(buf, sizeof(buf));
+
+	EXPECT_GE(send(self->fd, buf, send_len, 0), 0);
+	EXPECT_NE(recv(self->cfd, recv_mem, send_len, 0), -1);
+	EXPECT_EQ(memcmp(buf, recv_mem, send_len), 0);
+}
+
+TEST_F(mptcp, recv_small)
+{
+	char const *test_str = "test_read";
+	int send_len = 10;
+	char buf[10];
+
+	send_len = strlen(test_str) + 1;
+	EXPECT_EQ(send(self->fd, test_str, send_len, 0), send_len);
+	EXPECT_NE(recv(self->cfd, buf, send_len, 0), -1);
+	EXPECT_EQ(memcmp(buf, test_str, send_len), 0);
+}
+
+/*
+TEST_F(mptcp, msg_more)
+{
+	char const *test_str = "test_read";
+	int send_len = 10;
+	char buf[10 * 2];
+
+	EXPECT_EQ(send(self->fd, test_str, send_len, MSG_MORE), send_len);
+	EXPECT_EQ(recv(self->cfd, buf, send_len, MSG_DONTWAIT), -1);
+	EXPECT_EQ(send(self->fd, test_str, send_len, 0), send_len);
+	EXPECT_EQ(recv(self->cfd, buf, send_len * 2, MSG_WAITALL),
+		  send_len * 2);
+	EXPECT_EQ(memcmp(buf, test_str, send_len), 0);
+}
+
+TEST_F(mptcp, msg_more_unsent)
+{
+	char const *test_str = "test_read";
+	int send_len = 10;
+	char buf[10];
+
+	EXPECT_EQ(send(self->fd, test_str, send_len, MSG_MORE), send_len);
+	EXPECT_EQ(recv(self->cfd, buf, send_len, MSG_DONTWAIT), -1);
+}
+
+*/
+TEST_F(mptcp, msg_eor)
+{
+	char const *test_str = "test_read";
+	int send_len = 10;
+	char buf[10];
+
+	EXPECT_EQ(send(self->fd, test_str, send_len, MSG_EOR), send_len);
+	EXPECT_EQ(recv(self->cfd, buf, send_len, MSG_WAITALL), send_len);
+	EXPECT_EQ(memcmp(buf, test_str, send_len), 0);
+}
+
+TEST_F(mptcp, sendmsg_single)
+{
+	char const *test_str = "test_sendmsg";
+	size_t send_len = 13;
+	struct msghdr msg;
+	struct iovec vec;
+	char buf[13];
+
+	vec.iov_base = (char *)test_str;
+	vec.iov_len = send_len;
+	memset(&msg, 0, sizeof(struct msghdr));
+	msg.msg_iov = &vec;
+	msg.msg_iovlen = 1;
+	EXPECT_EQ(sendmsg(self->fd, &msg, 0), send_len);
+	EXPECT_EQ(recv(self->cfd, buf, send_len, MSG_WAITALL), send_len);
+	EXPECT_EQ(memcmp(buf, test_str, send_len), 0);
+}
+
+#define MAX_FRAGS	64
+#define SEND_LEN	13
+TEST_F(mptcp, sendmsg_fragmented)
+{
+	char const *test_str = "test_sendmsg";
+	char buf[SEND_LEN * MAX_FRAGS];
+	struct iovec vec[MAX_FRAGS];
+	struct msghdr msg;
+	int i, frags;
+
+	for (frags = 1; frags <= MAX_FRAGS; frags++) {
+		for (i = 0; i < frags; i++) {
+			vec[i].iov_base = (char *)test_str;
+			vec[i].iov_len = SEND_LEN;
+		}
+
+		memset(&msg, 0, sizeof(struct msghdr));
+		msg.msg_iov = vec;
+		msg.msg_iovlen = frags;
+
+		EXPECT_EQ(sendmsg(self->fd, &msg, 0), SEND_LEN * frags);
+		EXPECT_EQ(recv(self->cfd, buf, SEND_LEN * frags, MSG_WAITALL),
+			  SEND_LEN * frags);
+
+		for (i = 0; i < frags; i++)
+			EXPECT_EQ(memcmp(buf + SEND_LEN * i,
+					 test_str, SEND_LEN), 0);
+	}
+}
+#undef MAX_FRAGS
+#undef SEND_LEN
+
+TEST_F(mptcp, sendmsg_large)
+{
+	void *mem = malloc(16384);
+	size_t send_len = 16384;
+	size_t sends = 128;
+	struct msghdr msg;
+	size_t recvs = 0;
+	size_t sent = 0;
+
+	memset(&msg, 0, sizeof(struct msghdr));
+	while (sent++ < sends) {
+		struct iovec vec = { (void *)mem, send_len };
+
+		msg.msg_iov = &vec;
+		msg.msg_iovlen = 1;
+		EXPECT_EQ(sendmsg(self->fd, &msg, 0), send_len);
+	}
+
+	while (recvs++ < sends)
+		EXPECT_NE(recv(self->cfd, mem, send_len, 0), -1);
+
+	free(mem);
+}
+
+TEST_F(mptcp, sendmsg_multiple)
+{
+	char const *test_str = "test_sendmsg_multiple";
+	struct iovec vec[5];
+	char *test_strs[5];
+	struct msghdr msg;
+	int total_len = 0;
+	int len_cmp = 0;
+	int iov_len = 5;
+	char *buf;
+	int i;
+
+	memset(&msg, 0, sizeof(struct msghdr));
+	for (i = 0; i < iov_len; i++) {
+		test_strs[i] = (char *)malloc(strlen(test_str) + 1);
+		snprintf(test_strs[i], strlen(test_str) + 1, "%s", test_str);
+		vec[i].iov_base = (void *)test_strs[i];
+		vec[i].iov_len = strlen(test_strs[i]) + 1;
+		total_len += vec[i].iov_len;
+	}
+	msg.msg_iov = vec;
+	msg.msg_iovlen = iov_len;
+
+	EXPECT_EQ(sendmsg(self->fd, &msg, 0), total_len);
+	buf = malloc(total_len);
+	EXPECT_NE(recv(self->cfd, buf, total_len, 0), -1);
+	for (i = 0; i < iov_len; i++) {
+		EXPECT_EQ(memcmp(test_strs[i], buf + len_cmp,
+				 strlen(test_strs[i])),
+			  0);
+		len_cmp += strlen(buf + len_cmp) + 1;
+	}
+	for (i = 0; i < iov_len; i++)
+		free(test_strs[i]);
+	free(buf);
+}
+
+TEST_F(mptcp, sendmsg_multiple_stress)
+{
+	char const *test_str = "abcdefghijklmno";
+	struct iovec vec[1024];
+	char *test_strs[1024];
+	int iov_len = 1024;
+	int total_len = 0;
+	char buf[1 << 14];
+	struct msghdr msg;
+	int len_cmp = 0;
+	int i;
+
+	memset(&msg, 0, sizeof(struct msghdr));
+	for (i = 0; i < iov_len; i++) {
+		test_strs[i] = (char *)malloc(strlen(test_str) + 1);
+		snprintf(test_strs[i], strlen(test_str) + 1, "%s", test_str);
+		vec[i].iov_base = (void *)test_strs[i];
+		vec[i].iov_len = strlen(test_strs[i]) + 1;
+		total_len += vec[i].iov_len;
+	}
+	msg.msg_iov = vec;
+	msg.msg_iovlen = iov_len;
+
+	EXPECT_EQ(sendmsg(self->fd, &msg, 0), total_len);
+	EXPECT_NE(recv(self->cfd, buf, total_len, 0), -1);
+
+	for (i = 0; i < iov_len; i++)
+		len_cmp += strlen(buf + len_cmp) + 1;
+
+	for (i = 0; i < iov_len; i++)
+		free(test_strs[i]);
+}
+
+TEST_F(mptcp, splice_from_pipe)
+{
+	int send_len = PAYLOAD_MAX_LEN;
+	char mem_send[PAYLOAD_MAX_LEN];
+	char mem_recv[PAYLOAD_MAX_LEN];
+	int p[2];
+
+	ASSERT_GE(pipe(p), 0);
+	EXPECT_GE(write(p[1], mem_send, send_len), 0);
+	EXPECT_GE(splice(p[0], NULL, self->fd, NULL, send_len, 0), 0);
+	EXPECT_EQ(recv(self->cfd, mem_recv, send_len, MSG_WAITALL), send_len);
+	EXPECT_EQ(memcmp(mem_send, mem_recv, send_len), 0);
+}
+
+TEST_F(mptcp, splice_more)
+{
+	unsigned int f = SPLICE_F_NONBLOCK | SPLICE_F_MORE | SPLICE_F_GIFT;
+	int send_len = PAYLOAD_MAX_LEN;
+	char mem_send[PAYLOAD_MAX_LEN];
+	int i, send_pipe = 1;
+	int p[2];
+
+	ASSERT_GE(pipe(p), 0);
+	EXPECT_GE(write(p[1], mem_send, send_len), 0);
+	for (i = 0; i < 32; i++)
+		EXPECT_EQ(splice(p[0], NULL, self->fd, NULL, send_pipe, f), 1);
+}
+
+TEST_F(mptcp, splice_from_pipe2)
+{
+	int send_len = 16000;
+	char mem_send[16000];
+	char mem_recv[16000];
+	int p2[2];
+	int p[2];
+
+	memrnd(mem_send, sizeof(mem_send));
+
+	ASSERT_GE(pipe(p), 0);
+	ASSERT_GE(pipe(p2), 0);
+	EXPECT_EQ(write(p[1], mem_send, 8000), 8000);
+	EXPECT_EQ(splice(p[0], NULL, self->fd, NULL, 8000, 0), 8000);
+	EXPECT_EQ(write(p2[1], mem_send + 8000, 8000), 8000);
+	EXPECT_EQ(splice(p2[0], NULL, self->fd, NULL, 8000, 0), 8000);
+	EXPECT_EQ(recv(self->cfd, mem_recv, send_len, MSG_WAITALL), send_len);
+	EXPECT_EQ(memcmp(mem_send, mem_recv, send_len), 0);
+}
+
+TEST_F(mptcp, send_and_splice)
+{
+	int send_len = PAYLOAD_MAX_LEN;
+	char mem_send[PAYLOAD_MAX_LEN];
+	char mem_recv[PAYLOAD_MAX_LEN];
+	char const *test_str = "test_read";
+	int send_len2 = 10;
+	char buf[10];
+	int p[2];
+
+	ASSERT_GE(pipe(p), 0);
+	EXPECT_EQ(send(self->fd, test_str, send_len2, 0), send_len2);
+	EXPECT_EQ(recv(self->cfd, buf, send_len2, MSG_WAITALL), send_len2);
+	EXPECT_EQ(memcmp(test_str, buf, send_len2), 0);
+
+	EXPECT_GE(write(p[1], mem_send, send_len), send_len);
+	EXPECT_GE(splice(p[0], NULL, self->fd, NULL, send_len, 0), send_len);
+
+	EXPECT_EQ(recv(self->cfd, mem_recv, send_len, MSG_WAITALL), send_len);
+	EXPECT_EQ(memcmp(mem_send, mem_recv, send_len), 0);
+}
+
+TEST_F(mptcp, splice_to_pipe)
+{
+	int send_len = PAYLOAD_MAX_LEN;
+	char mem_send[PAYLOAD_MAX_LEN];
+	char mem_recv[PAYLOAD_MAX_LEN];
+	int p[2];
+
+	memrnd(mem_send, sizeof(mem_send));
+
+	ASSERT_GE(pipe(p), 0);
+	EXPECT_EQ(send(self->fd, mem_send, send_len, 0), send_len);
+	EXPECT_EQ(splice(self->cfd, NULL, p[1], NULL, send_len, 0), send_len);
+	EXPECT_EQ(read(p[0], mem_recv, send_len), send_len);
+	EXPECT_EQ(memcmp(mem_send, mem_recv, send_len), 0);
+}
+
+TEST_F(mptcp, recv_and_splice)
+{
+	int send_len = PAYLOAD_MAX_LEN;
+	char mem_send[PAYLOAD_MAX_LEN];
+	char mem_recv[PAYLOAD_MAX_LEN];
+	int half = send_len / 2;
+	int p[2];
+
+	ASSERT_GE(pipe(p), 0);
+	EXPECT_EQ(send(self->fd, mem_send, send_len, 0), send_len);
+	/* Recv half of the data, splice the other half */
+	EXPECT_EQ(recv(self->cfd, mem_recv, half, MSG_WAITALL), half);
+	EXPECT_EQ(splice(self->cfd, NULL, p[1], NULL, half, SPLICE_F_NONBLOCK),
+		  half);
+	EXPECT_EQ(read(p[0], &mem_recv[half], half), half);
+	EXPECT_EQ(memcmp(mem_send, mem_recv, send_len), 0);
+}
+
+TEST_F(mptcp, peek_and_splice)
+{
+	int send_len = PAYLOAD_MAX_LEN;
+	char mem_send[PAYLOAD_MAX_LEN];
+	char mem_recv[PAYLOAD_MAX_LEN];
+	int chunk = PAYLOAD_MAX_LEN / 4;
+	int n, i, p[2];
+
+	memrnd(mem_send, sizeof(mem_send));
+
+	ASSERT_GE(pipe(p), 0);
+	for (i = 0; i < 4; i++)
+		EXPECT_EQ(send(self->fd, &mem_send[chunk * i], chunk, 0),
+			  chunk);
+
+	EXPECT_EQ(recv(self->cfd, mem_recv, chunk * 5 / 2,
+		       MSG_WAITALL | MSG_PEEK),
+		  chunk * 5 / 2);
+	EXPECT_EQ(memcmp(mem_send, mem_recv, chunk * 5 / 2), 0);
+
+	n = 0;
+	while (n < send_len) {
+		i = splice(self->cfd, NULL, p[1], NULL, send_len - n, 0);
+		EXPECT_GT(i, 0);
+		n += i;
+	}
+	EXPECT_EQ(n, send_len);
+	EXPECT_EQ(read(p[0], mem_recv, send_len), send_len);
+	EXPECT_EQ(memcmp(mem_send, mem_recv, send_len), 0);
+}
+
+#define MAX_FRAGS 48
+TEST_F(mptcp, splice_short)
+{
+	struct iovec sendchar_iov;
+	char read_buf[0x10000];
+	char sendbuf[0x100];
+	char sendchar = 'S';
+	int pipefds[2];
+	int i;
+
+	sendchar_iov.iov_base = &sendchar;
+	sendchar_iov.iov_len = 1;
+
+	memset(sendbuf, 's', sizeof(sendbuf));
+
+	ASSERT_GE(pipe2(pipefds, O_NONBLOCK), 0);
+	ASSERT_GE(fcntl(pipefds[0], F_SETPIPE_SZ, (MAX_FRAGS + 1) * 0x1000), 0);
+
+	for (i = 0; i < MAX_FRAGS; i++)
+		ASSERT_GE(vmsplice(pipefds[1], &sendchar_iov, 1, 0), 0);
+
+	ASSERT_EQ(write(pipefds[1], sendbuf, sizeof(sendbuf)), sizeof(sendbuf));
+
+	EXPECT_EQ(splice(pipefds[0], NULL, self->fd, NULL, MAX_FRAGS + 0x1000, 0),
+		  MAX_FRAGS + sizeof(sendbuf));
+	EXPECT_EQ(recv(self->cfd, read_buf, sizeof(read_buf), 0),
+		  MAX_FRAGS + sizeof(sendbuf));
+	EXPECT_EQ(recv(self->cfd, read_buf, sizeof(read_buf), MSG_DONTWAIT), -1);
+	EXPECT_EQ(errno, EAGAIN);
+}
+#undef MAX_FRAGS
+
+TEST_F(mptcp, recvmsg_single)
+{
+	char const *test_str = "test_recvmsg_single";
+	int send_len = strlen(test_str) + 1;
+	char buf[20];
+	struct msghdr hdr;
+	struct iovec vec;
+
+	memset(&hdr, 0, sizeof(hdr));
+	EXPECT_EQ(send(self->fd, test_str, send_len, 0), send_len);
+	vec.iov_base = (char *)buf;
+	vec.iov_len = send_len;
+	hdr.msg_iovlen = 1;
+	hdr.msg_iov = &vec;
+	EXPECT_NE(recvmsg(self->cfd, &hdr, 0), -1);
+	EXPECT_EQ(memcmp(test_str, buf, send_len), 0);
+}
+
+TEST_F(mptcp, recvmsg_single_max)
+{
+	int send_len = PAYLOAD_MAX_LEN;
+	char send_mem[PAYLOAD_MAX_LEN];
+	char recv_mem[PAYLOAD_MAX_LEN];
+	struct iovec vec;
+	struct msghdr hdr;
+
+	memrnd(send_mem, sizeof(send_mem));
+
+	EXPECT_EQ(send(self->fd, send_mem, send_len, 0), send_len);
+	vec.iov_base = (char *)recv_mem;
+	vec.iov_len = PAYLOAD_MAX_LEN;
+
+	hdr.msg_iovlen = 1;
+	hdr.msg_iov = &vec;
+	EXPECT_NE(recvmsg(self->cfd, &hdr, 0), -1);
+	EXPECT_EQ(memcmp(send_mem, recv_mem, send_len), 0);
+}
+
+TEST_F(mptcp, recvmsg_multiple)
+{
+	unsigned int msg_iovlen = 1024;
+	struct iovec vec[1024];
+	char *iov_base[1024];
+	unsigned int iov_len = 16;
+	int send_len = 1 << 14;
+	char buf[1 << 14];
+	struct msghdr hdr;
+	int i;
+
+	memrnd(buf, sizeof(buf));
+
+	EXPECT_EQ(send(self->fd, buf, send_len, 0), send_len);
+	for (i = 0; i < msg_iovlen; i++) {
+		iov_base[i] = (char *)malloc(iov_len);
+		vec[i].iov_base = iov_base[i];
+		vec[i].iov_len = iov_len;
+	}
+
+	hdr.msg_iovlen = msg_iovlen;
+	hdr.msg_iov = vec;
+	EXPECT_NE(recvmsg(self->cfd, &hdr, 0), -1);
+
+	for (i = 0; i < msg_iovlen; i++)
+		free(iov_base[i]);
+}
+
+TEST_F(mptcp, single_send_multiple_recv)
+{
+	unsigned int total_len = PAYLOAD_MAX_LEN * 2;
+	unsigned int send_len = PAYLOAD_MAX_LEN;
+	char send_mem[PAYLOAD_MAX_LEN * 2];
+	char recv_mem[PAYLOAD_MAX_LEN * 2];
+
+	memrnd(send_mem, sizeof(send_mem));
+
+	EXPECT_GE(send(self->fd, send_mem, total_len, 0), 0);
+	memset(recv_mem, 0, total_len);
+
+	EXPECT_NE(recv(self->cfd, recv_mem, send_len, 0), -1);
+	EXPECT_NE(recv(self->cfd, recv_mem + send_len, send_len, 0), -1);
+	EXPECT_EQ(memcmp(send_mem, recv_mem, total_len), 0);
+}
+
+TEST_F(mptcp, multiple_send_single_recv)
+{
+	unsigned int total_len = 2 * 10;
+	unsigned int send_len = 10;
+	char recv_mem[2 * 10];
+	char send_mem[10];
+
+	memrnd(send_mem, sizeof(send_mem));
+
+	EXPECT_GE(send(self->fd, send_mem, send_len, 0), 0);
+	EXPECT_GE(send(self->fd, send_mem, send_len, 0), 0);
+	memset(recv_mem, 0, total_len);
+	EXPECT_EQ(recv(self->cfd, recv_mem, total_len, MSG_WAITALL), total_len);
+
+	EXPECT_EQ(memcmp(send_mem, recv_mem, send_len), 0);
+	EXPECT_EQ(memcmp(send_mem, recv_mem + send_len, send_len), 0);
+}
+
+TEST_F(mptcp, single_send_multiple_recv_non_align)
+{
+	const unsigned int total_len = 15;
+	const unsigned int recv_len = 10;
+	char recv_mem[recv_len * 2];
+	char send_mem[total_len];
+
+	memrnd(send_mem, sizeof(send_mem));
+
+	EXPECT_GE(send(self->fd, send_mem, total_len, 0), 0);
+	memset(recv_mem, 0, total_len);
+
+	EXPECT_EQ(recv(self->cfd, recv_mem, recv_len, 0), recv_len);
+	EXPECT_EQ(recv(self->cfd, recv_mem + recv_len, recv_len, 0), 5);
+	EXPECT_EQ(memcmp(send_mem, recv_mem, total_len), 0);
+}
+
+TEST_F(mptcp, recv_partial)
+{
+	char const *test_str = "test_read_partial";
+	char const *test_str_first = "test_read";
+	char const *test_str_second = "_partial";
+	int send_len = strlen(test_str) + 1;
+	char recv_mem[18];
+
+	memset(recv_mem, 0, sizeof(recv_mem));
+	EXPECT_EQ(send(self->fd, test_str, send_len, 0), send_len);
+	EXPECT_EQ(recv(self->cfd, recv_mem, strlen(test_str_first),
+		       MSG_WAITALL), strlen(test_str_first));
+	EXPECT_EQ(memcmp(test_str_first, recv_mem, strlen(test_str_first)), 0);
+	memset(recv_mem, 0, sizeof(recv_mem));
+	EXPECT_EQ(recv(self->cfd, recv_mem, strlen(test_str_second),
+		       MSG_WAITALL), strlen(test_str_second));
+	EXPECT_EQ(memcmp(test_str_second, recv_mem, strlen(test_str_second)),
+		  0);
+}
+
+TEST_F(mptcp, recv_nonblock)
+{
+	char buf[4096];
+	bool err;
+
+	EXPECT_EQ(recv(self->cfd, buf, sizeof(buf), MSG_DONTWAIT), -1);
+	err = (errno == EAGAIN || errno == EWOULDBLOCK);
+	EXPECT_EQ(err, true);
+}
+
+TEST_F(mptcp, recv_peek)
+{
+	char const *test_str = "test_read_peek";
+	int send_len = strlen(test_str) + 1;
+	char buf[15];
+
+	EXPECT_EQ(send(self->fd, test_str, send_len, 0), send_len);
+	EXPECT_EQ(recv(self->cfd, buf, send_len, MSG_PEEK), send_len);
+	EXPECT_EQ(memcmp(test_str, buf, send_len), 0);
+	memset(buf, 0, sizeof(buf));
+	EXPECT_EQ(recv(self->cfd, buf, send_len, 0), send_len);
+	EXPECT_EQ(memcmp(test_str, buf, send_len), 0);
+}
+
+TEST_F(mptcp, recv_peek_multiple)
+{
+	char const *test_str = "test_read_peek";
+	int send_len = strlen(test_str) + 1;
+	unsigned int num_peeks = 100;
+	char buf[15];
+	int i;
+
+	EXPECT_EQ(send(self->fd, test_str, send_len, 0), send_len);
+	for (i = 0; i < num_peeks; i++) {
+		EXPECT_NE(recv(self->cfd, buf, send_len, MSG_PEEK), -1);
+		EXPECT_EQ(memcmp(test_str, buf, send_len), 0);
+		memset(buf, 0, sizeof(buf));
+	}
+	EXPECT_NE(recv(self->cfd, buf, send_len, 0), -1);
+	EXPECT_EQ(memcmp(test_str, buf, send_len), 0);
+}
+
+TEST_F(mptcp, recv_peek_multiple_records)
+{
+	char const *test_str = "test_read_peek_mult_recs";
+	char const *test_str_first = "test_read_peek";
+	char const *test_str_second = "_mult_recs";
+	int len;
+	char buf[64];
+
+	len = strlen(test_str_first);
+	EXPECT_EQ(send(self->fd, test_str_first, len, 0), len);
+
+	len = strlen(test_str_second) + 1;
+	EXPECT_EQ(send(self->fd, test_str_second, len, 0), len);
+
+	len = strlen(test_str_first);
+	memset(buf, 0, len);
+	EXPECT_EQ(recv(self->cfd, buf, len, MSG_PEEK | MSG_WAITALL), len);
+
+	len = strlen(test_str_first);
+	EXPECT_EQ(memcmp(test_str_first, buf, len), 0);
+
+	len = strlen(test_str) + 1;
+	memset(buf, 0, len);
+	EXPECT_EQ(recv(self->cfd, buf, len, MSG_WAITALL), len);
+
+	/* Non-MSG_PEEK will consume the peeked data */
+	len = strlen(test_str) + 1;
+	EXPECT_EQ(memcmp(test_str, buf, len), 0);
+
+	/* MSG_MORE holds data in kernel send buffer */
+	len = strlen(test_str_first);
+	EXPECT_EQ(send(self->fd, test_str_first, len, MSG_MORE), len);
+
+	len = strlen(test_str_second) + 1;
+	EXPECT_EQ(send(self->fd, test_str_second, len, 0), len);
+
+	len = strlen(test_str) + 1;
+	memset(buf, 0, len);
+	EXPECT_EQ(recv(self->cfd, buf, len, MSG_PEEK | MSG_WAITALL), len);
+
+	len = strlen(test_str) + 1;
+	EXPECT_EQ(memcmp(test_str, buf, len), 0);
+}
+
+TEST_F(mptcp, recv_peek_large_buf_mult_recs)
+{
+	char const *test_str = "test_read_peek_mult_recs";
+	char const *test_str_first = "test_read_peek";
+	char const *test_str_second = "_mult_recs";
+	int len;
+	char buf[64];
+
+	len = strlen(test_str_first);
+	EXPECT_EQ(send(self->fd, test_str_first, len, 0), len);
+
+	len = strlen(test_str_second) + 1;
+	EXPECT_EQ(send(self->fd, test_str_second, len, 0), len);
+
+	len = strlen(test_str) + 1;
+	memset(buf, 0, len);
+	EXPECT_NE((len = recv(self->cfd, buf, len,
+			      MSG_PEEK | MSG_WAITALL)), -1);
+	len = strlen(test_str) + 1;
+	EXPECT_EQ(memcmp(test_str, buf, len), 0);
+}
+
+TEST_F(mptcp, recv_lowat)
+{
+	char send_mem[10] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
+	char recv_mem[20];
+	int lowat = 8;
+
+	EXPECT_EQ(send(self->fd, send_mem, 10, 0), 10);
+	EXPECT_EQ(send(self->fd, send_mem, 5, 0), 5);
+
+	memset(recv_mem, 0, 20);
+	EXPECT_EQ(setsockopt(self->cfd, SOL_SOCKET, SO_RCVLOWAT,
+			     &lowat, sizeof(lowat)), 0);
+	EXPECT_EQ(recv(self->cfd, recv_mem, 1, MSG_WAITALL), 1);
+	EXPECT_EQ(recv(self->cfd, recv_mem + 1, 6, MSG_WAITALL), 6);
+	EXPECT_EQ(recv(self->cfd, recv_mem + 7, 10, 0), 8);
+
+	EXPECT_EQ(memcmp(send_mem, recv_mem, 10), 0);
+	EXPECT_EQ(memcmp(send_mem, recv_mem + 10, 5), 0);
+}
+
+TEST_F(mptcp, bidir)
+{
+	char const *test_str = "test_read";
+	int send_len = 10;
+	char buf[10];
+
+	ASSERT_EQ(strlen(test_str) + 1, send_len);
+
+	EXPECT_EQ(send(self->fd, test_str, send_len, 0), send_len);
+	EXPECT_NE(recv(self->cfd, buf, send_len, 0), -1);
+	EXPECT_EQ(memcmp(buf, test_str, send_len), 0);
+
+	memset(buf, 0, sizeof(buf));
+
+	EXPECT_EQ(send(self->cfd, test_str, send_len, 0), send_len);
+	EXPECT_NE(recv(self->fd, buf, send_len, 0), -1);
+	EXPECT_EQ(memcmp(buf, test_str, send_len), 0);
+}
+
+TEST_F(mptcp, pollin)
+{
+	char const *test_str = "test_poll";
+	struct pollfd fd = { 0, 0, 0 };
+	char buf[10];
+	int send_len = 10;
+
+	EXPECT_EQ(send(self->fd, test_str, send_len, 0), send_len);
+	fd.fd = self->cfd;
+	fd.events = POLLIN;
+
+	EXPECT_EQ(poll(&fd, 1, 20), 1);
+	EXPECT_EQ(fd.revents & POLLIN, 1);
+	EXPECT_EQ(recv(self->cfd, buf, send_len, MSG_WAITALL), send_len);
+	/* Test timing out */
+	EXPECT_EQ(poll(&fd, 1, 20), 0);
+}
+
+TEST_F(mptcp, poll_wait)
+{
+	char const *test_str = "test_poll_wait";
+	int send_len = strlen(test_str) + 1;
+	struct pollfd fd = { 0, 0, 0 };
+	char recv_mem[15];
+
+	fd.fd = self->cfd;
+	fd.events = POLLIN;
+	EXPECT_EQ(send(self->fd, test_str, send_len, 0), send_len);
+	/* Set timeout to inf. secs */
+	EXPECT_EQ(poll(&fd, 1, -1), 1);
+	EXPECT_EQ(fd.revents & POLLIN, 1);
+	EXPECT_EQ(recv(self->cfd, recv_mem, send_len, MSG_WAITALL), send_len);
+}
+
+TEST_F(mptcp, poll_wait_split)
+{
+	struct pollfd fd = { 0, 0, 0 };
+	char send_mem[20] = {};
+	char recv_mem[15];
+
+	fd.fd = self->cfd;
+	fd.events = POLLIN;
+	/* Send 20 bytes */
+	EXPECT_EQ(send(self->fd, send_mem, sizeof(send_mem), 0),
+		  sizeof(send_mem));
+	/* Poll with inf. timeout */
+	EXPECT_EQ(poll(&fd, 1, -1), 1);
+	EXPECT_EQ(fd.revents & POLLIN, 1);
+	EXPECT_EQ(recv(self->cfd, recv_mem, sizeof(recv_mem), MSG_WAITALL),
+		  sizeof(recv_mem));
+
+	/* The remaining 5 bytes are still in the socket buffer */
+	fd.fd = self->cfd;
+	fd.events = POLLIN;
+	EXPECT_EQ(poll(&fd, 1, -1), 1);
+	EXPECT_EQ(fd.revents & POLLIN, 1);
+	EXPECT_EQ(recv(self->cfd, recv_mem, sizeof(recv_mem), 0),
+		  sizeof(send_mem) - sizeof(recv_mem));
+}
+
+TEST_F(mptcp, blocking)
+{
+	size_t data = 100000;
+	int res = fork();
+
+	EXPECT_NE(res, -1);
+
+	if (res) {
+		/* parent */
+		size_t left = data;
+		char buf[16384];
+		int status;
+		int pid2;
+
+		while (left) {
+			int res = send(self->fd, buf,
+				       left > 16384 ? 16384 : left, 0);
+
+			EXPECT_GE(res, 0);
+			left -= res;
+		}
+
+		pid2 = wait(&status);
+		EXPECT_EQ(status, 0);
+		EXPECT_EQ(res, pid2);
+	} else {
+		/* child */
+		size_t left = data;
+		char buf[16384];
+
+		while (left) {
+			int res = recv(self->cfd, buf,
+				       left > 16384 ? 16384 : left, 0);
+
+			EXPECT_GE(res, 0);
+			left -= res;
+		}
+	}
+}
+
+TEST_F(mptcp, nonblocking)
+{
+	size_t data = 100000;
+	int sendbuf = 100;
+	int flags;
+	int res;
+
+	flags = fcntl(self->fd, F_GETFL, 0);
+	fcntl(self->fd, F_SETFL, flags | O_NONBLOCK);
+	fcntl(self->cfd, F_SETFL, flags | O_NONBLOCK);
+
+	/* Ensure nonblocking behavior by imposing a small send
+	 * buffer.
+	 */
+	EXPECT_EQ(setsockopt(self->fd, SOL_SOCKET, SO_SNDBUF,
+			     &sendbuf, sizeof(sendbuf)), 0);
+
+	res = fork();
+	EXPECT_NE(res, -1);
+
+	if (res) {
+		/* parent */
+		bool eagain = false;
+		size_t left = data;
+		char buf[16384];
+		int status;
+		int pid2;
+
+		while (left) {
+			int res = send(self->fd, buf,
+				       left > 16384 ? 16384 : left, 0);
+
+			if (res == -1 && errno == EAGAIN) {
+				eagain = true;
+				usleep(10000);
+				continue;
+			}
+			EXPECT_GE(res, 0);
+			left -= res;
+		}
+
+		EXPECT_TRUE(eagain);
+		pid2 = wait(&status);
+
+		EXPECT_EQ(status, 0);
+		EXPECT_EQ(res, pid2);
+	} else {
+		/* child */
+		bool eagain = false;
+		size_t left = data;
+		char buf[16384];
+
+		while (left) {
+			int res = recv(self->cfd, buf,
+				       left > 16384 ? 16384 : left, 0);
+
+			if (res == -1 && errno == EAGAIN) {
+				eagain = true;
+				usleep(10000);
+				continue;
+			}
+			EXPECT_GE(res, 0);
+			left -= res;
+		}
+		EXPECT_TRUE(eagain);
+	}
+}
+
+static void
+test_mutliproc(struct __test_metadata *_metadata, struct _test_data_mptcp *self,
+	       bool sendpg, unsigned int n_readers, unsigned int n_writers)
+{
+	const unsigned int n_children = n_readers + n_writers;
+	const size_t data = 6 * 1000 * 1000;
+	const size_t file_sz = data / 100;
+	size_t read_bias, write_bias;
+	int i, fd, child_id;
+	char buf[file_sz];
+	pid_t pid;
+
+	/* Only allow multiples for simplicity */
+	ASSERT_EQ(!(n_readers % n_writers) || !(n_writers % n_readers), true);
+	read_bias = n_writers / n_readers ?: 1;
+	write_bias = n_readers / n_writers ?: 1;
+
+	/* prep a file to send */
+	fd = open("/tmp/", O_TMPFILE | O_RDWR, 0600);
+	ASSERT_GE(fd, 0);
+
+	memset(buf, 0xac, file_sz);
+	ASSERT_EQ(write(fd, buf, file_sz), file_sz);
+
+	/* spawn children */
+	for (child_id = 0; child_id < n_children; child_id++) {
+		pid = fork();
+		ASSERT_NE(pid, -1);
+		if (!pid)
+			break;
+	}
+
+	/* parent waits for all children */
+	if (pid) {
+		for (i = 0; i < n_children; i++) {
+			int status;
+
+			wait(&status);
+			EXPECT_EQ(status, 0);
+		}
+
+		return;
+	}
+
+	/* Split threads for reading and writing */
+	if (child_id < n_readers) {
+		size_t left = data * read_bias;
+		char rb[8001];
+
+		while (left) {
+			int res;
+
+			res = recv(self->cfd, rb,
+				   left > sizeof(rb) ? sizeof(rb) : left, 0);
+
+			EXPECT_GE(res, 0);
+			left -= res;
+		}
+	} else {
+		size_t left = data * write_bias;
+
+		while (left) {
+			int res;
+
+			ASSERT_EQ(lseek(fd, 0, SEEK_SET), 0);
+			if (sendpg)
+				res = sendfile(self->fd, fd, NULL,
+					       left > file_sz ? file_sz : left);
+			else
+				res = send(self->fd, buf,
+					   left > file_sz ? file_sz : left, 0);
+
+			EXPECT_GE(res, 0);
+			left -= res;
+		}
+	}
+}
+
+TEST_F(mptcp, mutliproc_even)
+{
+	test_mutliproc(_metadata, self, false, 6, 6);
+}
+
+TEST_F(mptcp, mutliproc_readers)
+{
+	test_mutliproc(_metadata, self, false, 4, 12);
+}
+
+TEST_F(mptcp, mutliproc_writers)
+{
+	test_mutliproc(_metadata, self, false, 10, 2);
+}
+
+TEST_F(mptcp, mutliproc_sendpage_even)
+{
+	test_mutliproc(_metadata, self, true, 6, 6);
+}
+
+TEST_F(mptcp, mutliproc_sendpage_readers)
+{
+	test_mutliproc(_metadata, self, true, 4, 12);
+}
+
+TEST_F(mptcp, mutliproc_sendpage_writers)
+{
+	test_mutliproc(_metadata, self, true, 10, 2);
+}
+
+TEST_F(mptcp, shutdown)
+{
+	char const *test_str = "test_read";
+	int send_len = 10;
+	char buf[10];
+
+	ASSERT_EQ(strlen(test_str) + 1, send_len);
+
+	EXPECT_EQ(send(self->fd, test_str, send_len, 0), send_len);
+	EXPECT_NE(recv(self->cfd, buf, send_len, 0), -1);
+	EXPECT_EQ(memcmp(buf, test_str, send_len), 0);
+
+	shutdown(self->fd, SHUT_RDWR);
+	shutdown(self->cfd, SHUT_RDWR);
+}
+
+TEST_F(mptcp, shutdown_unsent)
+{
+	char const *test_str = "test_read";
+	int send_len = 10;
+
+	EXPECT_EQ(send(self->fd, test_str, send_len, MSG_MORE), send_len);
+
+	shutdown(self->fd, SHUT_RDWR);
+	shutdown(self->cfd, SHUT_RDWR);
+}
+
+static bool wait_for_tcp_close(struct __test_metadata *_metadata,
+			       int fd, int max)
+{
+	struct tcp_info info;
+	socklen_t len;
+	int i, ret;
+
+	for (i = 0; i < max; i++) {
+		len = sizeof(info);
+		ret = getsockopt(fd, IPPROTO_TCP, TCP_INFO, &info, &len);
+		ASSERT_EQ(ret, 0);
+		if (info.tcpi_state == TCP_CLOSE)
+			return true;
+		usleep(1000);
+	}
+
+	return false;
+}
+
+TEST_F(mptcp, shutdown_reuse)
+{
+	struct sockaddr_in addr;
+	int ret;
+
+	shutdown(self->fd, SHUT_RDWR);
+	shutdown(self->cfd, SHUT_RDWR);
+	close(self->cfd);
+
+	EXPECT_TRUE(wait_for_tcp_close(_metadata, self->fd, 1000));
+
+	addr.sin_family = AF_INET;
+	addr.sin_addr.s_addr = htonl(INADDR_ANY);
+	addr.sin_port = 0;
+
+	ret = bind(self->fd, &addr, sizeof(addr));
+	EXPECT_EQ(ret, 0);
+	ret = listen(self->fd, 10);
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	ret = connect(self->fd, &addr, sizeof(addr));
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, EISCONN);
+}
+
+TEST(mptcp_v6)
+{
+	struct sockaddr_in6 addr, addr2;
+	int sfd, ret, fd;
+	socklen_t len, len2;
+
+	addr.sin6_family = AF_INET6;
+	addr.sin6_addr = in6addr_any;
+	addr.sin6_port = 0;
+
+	fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_MPTCP);
+	ASSERT_GE(fd, 0);
+	sfd = socket(AF_INET6, SOCK_STREAM, IPPROTO_MPTCP);
+	ASSERT_GE(sfd, 0);
+
+	ret = bind(sfd, &addr, sizeof(addr));
+	ASSERT_EQ(ret, 0);
+	ret = listen(sfd, 10);
+	ASSERT_EQ(ret, 0);
+
+	len = sizeof(addr);
+	ret = getsockname(sfd, &addr, &len);
+	ASSERT_EQ(ret, 0);
+
+	ret = connect(fd, &addr, sizeof(addr));
+	ASSERT_EQ(ret, 0);
+
+	len = sizeof(addr);
+	ret = getsockname(fd, &addr, &len);
+	ASSERT_EQ(ret, 0);
+
+	len2 = sizeof(addr2);
+	ret = getsockname(fd, &addr2, &len2);
+	ASSERT_EQ(ret, 0);
+
+	EXPECT_EQ(len2, len);
+	EXPECT_EQ(memcmp(&addr, &addr2, len), 0);
+
+	close(fd);
+	close(sfd);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/net/mptcp/mptcp_data.sh b/tools/testing/selftests/net/mptcp/mptcp_data.sh
new file mode 100755
index 000000000000..d06b3503e810
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/mptcp_data.sh
@@ -0,0 +1,45 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(dirname "${0}")/mptcp_lib.sh"
+
+ret=0
+ns1=""
+
+# This function is used in the cleanup trap
+#shellcheck disable=SC2317,SC2329
+cleanup()
+{
+        if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
+                kill "$pid" 2>/dev/null
+                wait "$pid" 2>/dev/null
+        fi
+
+        mptcp_lib_ns_exit "$ns1"
+}
+
+init()
+{
+        local max="${1:-4}"
+
+        mptcp_lib_ns_init ns1
+
+        mptcp_lib_pm_nl_set_limits "$ns1" "$max" "$max"
+
+        local i
+        for i in $(seq 1 "$max"); do
+                mptcp_lib_pm_nl_add_endpoint "$ns1" \
+                        "127.0.0.1" flags signal port 1000"$i"
+        done
+}
+
+trap cleanup EXIT
+
+mptcp_lib_check_mptcp
+init
+ip -n "${ns1}" mptcp limits
+mptcp_lib_pm_nl_show_endpoints "$ns1"
+
+#gcc -o ../mptcp_data.o ../mptcp_data.c
+
+ip netns exec "$ns1" ./mptcp_data
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-04-28  8:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 14:08 [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure Paolo Abeni
2026-04-24 14:08 ` [PATCH mptcp-next v1 1/9] mptcp: move checks vs rcvbuf size earlier in the RX path Paolo Abeni
2026-04-24 14:08 ` [PATCH mptcp-next v1 2/9] mptcp: drop the mptcp_ooo_try_coalesce() helper Paolo Abeni
2026-04-24 14:08 ` [PATCH mptcp-next v1 3/9] mptcp: remove CB offset field Paolo Abeni
2026-04-24 14:08 ` [PATCH mptcp-next v1 4/9] mptcp: sync mptcp skb cb layout with tcp one Paolo Abeni
2026-04-24 14:08 ` [PATCH mptcp-next v1 5/9] tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too Paolo Abeni
2026-04-24 14:08 ` [PATCH mptcp-next v1 6/9] mptcp: implemented OoO queue pruning Paolo Abeni
2026-04-24 14:08 ` [PATCH mptcp-next v1 7/9] mptcp: track prune recovery status Paolo Abeni
2026-04-24 14:08 ` [PATCH mptcp-next v1 8/9] mptcp: move the retrans loop to a separate helper Paolo Abeni
2026-04-24 14:08 ` [PATCH mptcp-next v1 9/9] mptcp: let the retrans scheduler do its job Paolo Abeni
2026-04-24 16:29 ` [PATCH mptcp-next v1 0/9] mptcp: address stall under memory pressure MPTCP CI
2026-04-27  7:27 ` Geliang Tang
2026-04-28  8:16   ` Geliang Tang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.