* [RFC PATCH 0/6] mptcp: address stall under memory pressure
@ 2026-04-20 10:29 Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 1/6] mptcp: move checks vs rcvbuf size earlier in the RX path Paolo Abeni
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: Paolo Abeni @ 2026-04-20 10:29 UTC (permalink / raw)
To: mptcp; +Cc: yangang, geliang, matttbe
This is a very early RFC to discuss a different approch to solve the
data transfer stall reported by Geliang and Gang.
There are a few open points documented into the individual patches, the
goal here is describe with some detail the intended architecture.
Note that the diffstat is biases by the quite large patch 2/6, which
contains mechanical transformation of existing code; "real" changes are
noticiable smaller.
Paolo Abeni (6):
mptcp: move checks vs rcvbuf size earlier in the RX path
mptcp: sync mptcp skb cb layout with tcp one
tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too
mptcp: implemented OoO queue pruning
mptcp: refine coalescing conditions
mptcp: unclone skbs before coalescing them, when needed
include/net/tcp.h | 4 ++
net/ipv4/tcp_input.c | 55 ++++++++++------
net/mptcp/mib.c | 3 +
net/mptcp/mib.h | 3 +
net/mptcp/options.c | 35 +++++++++-
net/mptcp/protocol.c | 154 +++++++++++++++++++++++++++++++------------
net/mptcp/protocol.h | 7 +-
7 files changed, 195 insertions(+), 66 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [RFC PATCH 1/6] mptcp: move checks vs rcvbuf size earlier in the RX path
2026-04-20 10:29 [RFC PATCH 0/6] mptcp: address stall under memory pressure Paolo Abeni
@ 2026-04-20 10:29 ` Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 2/6] mptcp: sync mptcp skb cb layout with tcp one Paolo Abeni
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Paolo Abeni @ 2026-04-20 10:29 UTC (permalink / raw)
To: mptcp; +Cc: yangang, geliang, matttbe
Currently the enforcement of the rcvbuf constraint is implemented
when moving the skbs into the msk receive or OoO queue.
Under significant memory pressure the above can cause permanent data
transfer stalls. Move the checks early on, before landing even in
the subflow queues.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
Note that:
- this needs the follow-up patches to really fix the stall
- the memory comparison is intentionally very rough, as
the msk socket lock is not currently held where the condition is
now enforced. This should require some refinement, shared as-is
to avoid more latency on my side
---
net/mptcp/options.c | 21 +++++++++++++++++++--
net/mptcp/protocol.c | 9 ++-------
2 files changed, 21 insertions(+), 9 deletions(-)
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 4cc583fdc7a9..a6d290427611 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1158,8 +1158,19 @@ static bool add_addr_hmac_valid(struct mptcp_sock *msk,
return hmac == mp_opt->ahmac;
}
-/* Return false in case of error (or subflow has been reset),
- * else return true.
+static bool mptcp_over_limit(const struct sock *sk, struct sk_buff *skb)
+{
+ int limit;
+
+ if (!skb->len)
+ return false;
+
+ limit = READ_ONCE(sk->sk_rcvbuf) << 1;
+ return sk_rmem_alloc_get(sk) > limit;
+}
+
+/* Return false when the caller must to drop the packet, i.e. in case of error,
+ * subflow has been reset, or over memory limits.
*/
bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
{
@@ -1185,6 +1196,9 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
__mptcp_data_acked(subflow->conn);
mptcp_data_unlock(subflow->conn);
+
+ if (mptcp_over_limit(subflow->conn, skb))
+ return false;
return true;
}
@@ -1263,6 +1277,9 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
return true;
}
+ if (mptcp_over_limit(subflow->conn, skb))
+ return false;
+
mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
if (!mpext)
return false;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 17b9a8c13ebf..2d143b929bbf 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -739,7 +739,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
mptcp_init_skb(ssk, skb, offset, len);
- if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf) {
+ if (own_msk) {
mptcp_subflow_lend_fwdmem(subflow, skb);
ret |= __mptcp_move_skb(sk, skb);
} else {
@@ -2197,10 +2197,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
*delta = 0;
while (1) {
- /* If the msk recvbuf is full stop, don't drop */
- if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
- break;
-
prefetch(skb->next);
list_del(&skb->list);
*delta += skb->truesize;
@@ -2229,8 +2225,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
mem_cgroup_from_sk(sk));
/* Don't spool the backlog if the rcvbuf is full. */
- if (list_empty(&msk->backlog_list) ||
- sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+ if (list_empty(&msk->backlog_list))
return false;
INIT_LIST_HEAD(skbs);
--
2.53.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH 2/6] mptcp: sync mptcp skb cb layout with tcp one
2026-04-20 10:29 [RFC PATCH 0/6] mptcp: address stall under memory pressure Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 1/6] mptcp: move checks vs rcvbuf size earlier in the RX path Paolo Abeni
@ 2026-04-20 10:29 ` Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 3/6] tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too Paolo Abeni
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Paolo Abeni @ 2026-04-20 10:29 UTC (permalink / raw)
To: mptcp; +Cc: yangang, geliang, matttbe
The MPTCP protocol uses a significantly different CB layout WRT TCP, as it
includes different information and use 64 bits for the sequence numbers.
As the msk-level rcvbuf buffer size is limited by the core socket code the
INT_MAX, we can safely use 32 bits for MPTCP-level sequence number. This
allow updating the MPTCP CB layout so that fields with a corresponding TCP-level
data use the same area inside the CB itself.
Add build time check the unsure the latter invariant.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/protocol.c | 81 +++++++++++++++++++++++++-------------------
net/mptcp/protocol.h | 5 +--
2 files changed, 50 insertions(+), 36 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 2d143b929bbf..800aa7d9408e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -28,7 +28,7 @@
#include "protocol.h"
#include "mib.h"
-static unsigned int mptcp_inq_hint(const struct sock *sk);
+static int mptcp_inq_hint(const struct sock *sk);
#define CREATE_TRACE_POINTS
#include <trace/events/mptcp.h>
@@ -165,7 +165,7 @@ static bool __mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
!skb_try_coalesce(to, from, fragstolen, delta))
return false;
- pr_debug("colesced seq %llx into %llx new len %d new end seq %llx\n",
+ pr_debug("colesced seq %x into %x new len %d new end seq %x\n",
MPTCP_SKB_CB(from)->map_seq, MPTCP_SKB_CB(to)->map_seq,
to->len, MPTCP_SKB_CB(from)->end_seq);
MPTCP_SKB_CB(to)->end_seq = MPTCP_SKB_CB(from)->end_seq;
@@ -244,20 +244,20 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
{
struct sock *sk = (struct sock *)msk;
struct rb_node **p, *parent;
- u64 seq, end_seq, max_seq;
+ u32 seq, end_seq, max_seq;
struct sk_buff *skb1;
seq = MPTCP_SKB_CB(skb)->map_seq;
end_seq = MPTCP_SKB_CB(skb)->end_seq;
max_seq = atomic64_read(&msk->rcv_wnd_sent);
- pr_debug("msk=%p seq=%llx limit=%llx empty=%d\n", msk, seq, max_seq,
+ pr_debug("msk=%p seq=%x limit=%x empty=%d\n", msk, seq, max_seq,
RB_EMPTY_ROOT(&msk->out_of_order_queue));
- if (after64(end_seq, max_seq)) {
+ if (after(end_seq, max_seq)) {
/* out of window */
mptcp_drop(sk, skb);
- pr_debug("oow by %lld, rcv_wnd_sent %llu\n",
- (unsigned long long)end_seq - (unsigned long)max_seq,
+ pr_debug("oow by %d, rcv_wnd_sent %llu\n",
+ end_seq - max_seq,
(unsigned long long)atomic64_read(&msk->rcv_wnd_sent));
MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_NODSSWINDOW);
return;
@@ -282,7 +282,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
}
/* Can avoid an rbtree lookup if we are adding skb after ooo_last_skb */
- if (!before64(seq, MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq)) {
+ if (!before(seq, MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq)) {
MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOQUEUETAIL);
parent = &msk->ooo_last_skb->rbnode;
p = &parent->rb_right;
@@ -294,18 +294,18 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
while (*p) {
parent = *p;
skb1 = rb_to_skb(parent);
- if (before64(seq, MPTCP_SKB_CB(skb1)->map_seq)) {
+ if (before(seq, MPTCP_SKB_CB(skb1)->map_seq)) {
p = &parent->rb_left;
continue;
}
- if (before64(seq, MPTCP_SKB_CB(skb1)->end_seq)) {
- if (!after64(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) {
+ if (before(seq, MPTCP_SKB_CB(skb1)->end_seq)) {
+ if (!after(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) {
/* All the bits are present. Drop. */
mptcp_drop(sk, skb);
MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA);
return;
}
- if (after64(seq, MPTCP_SKB_CB(skb1)->map_seq)) {
+ if (after(seq, MPTCP_SKB_CB(skb1)->map_seq)) {
/* partial overlap:
* | skb |
* | skb1 |
@@ -336,7 +336,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
merge_right:
/* Remove other segments covered by skb. */
while ((skb1 = skb_rb_next(skb)) != NULL) {
- if (before64(end_seq, MPTCP_SKB_CB(skb1)->end_seq))
+ if (before(end_seq, MPTCP_SKB_CB(skb1)->end_seq))
break;
rb_erase(&skb1->rbnode, &msk->out_of_order_queue);
mptcp_drop(sk, skb1);
@@ -359,11 +359,12 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
/* the skb map_seq accounts for the skb offset:
* mptcp_subflow_get_mapped_dsn() is based on the current tp->copied_seq
- * value
+ * value; note that seq numbers are truncated to 32bits
*/
MPTCP_SKB_CB(skb)->map_seq = mptcp_subflow_get_mapped_dsn(subflow);
MPTCP_SKB_CB(skb)->end_seq = MPTCP_SKB_CB(skb)->map_seq + copy_len;
MPTCP_SKB_CB(skb)->offset = offset;
+ MPTCP_SKB_CB(skb)->flags = 0;
MPTCP_SKB_CB(skb)->has_rxtstamp = has_rxtstamp;
MPTCP_SKB_CB(skb)->cant_coalesce = 0;
@@ -375,13 +376,14 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
{
- u64 copy_len = MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq;
+ u32 copy_len = MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq;
struct mptcp_sock *msk = mptcp_sk(sk);
+ u32 ack_seq = msk->ack_seq;
struct sk_buff *tail;
mptcp_borrow_fwdmem(sk, skb);
- if (MPTCP_SKB_CB(skb)->map_seq == msk->ack_seq) {
+ if (MPTCP_SKB_CB(skb)->map_seq == ack_seq) {
/* in sequence */
msk->bytes_received += copy_len;
WRITE_ONCE(msk->ack_seq, msk->ack_seq + copy_len);
@@ -392,7 +394,7 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
skb_set_owner_r(skb, sk);
__skb_queue_tail(&sk->sk_receive_queue, skb);
return true;
- } else if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) {
+ } else if (after(MPTCP_SKB_CB(skb)->map_seq, ack_seq)) {
mptcp_data_queue_ofo(msk, skb);
return false;
}
@@ -772,44 +774,42 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
{
+ u32 seq_delta, ack_seq = msk->ack_seq;
struct sock *sk = (struct sock *)msk;
struct sk_buff *skb, *tail;
bool moved = false;
struct rb_node *p;
- u64 end_seq;
p = rb_first(&msk->out_of_order_queue);
pr_debug("msk=%p empty=%d\n", msk, RB_EMPTY_ROOT(&msk->out_of_order_queue));
while (p) {
skb = rb_to_skb(p);
- if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq))
+ if (after(MPTCP_SKB_CB(skb)->map_seq, ack_seq))
break;
p = rb_next(p);
rb_erase(&skb->rbnode, &msk->out_of_order_queue);
- if (unlikely(!after64(MPTCP_SKB_CB(skb)->end_seq,
- msk->ack_seq))) {
+ if (unlikely(!after(MPTCP_SKB_CB(skb)->end_seq, ack_seq))) {
mptcp_drop(sk, skb);
MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA);
continue;
}
- end_seq = MPTCP_SKB_CB(skb)->end_seq;
+ seq_delta = MPTCP_SKB_CB(skb)->end_seq - ack_seq;
tail = skb_peek_tail(&sk->sk_receive_queue);
if (!tail || !mptcp_ooo_try_coalesce(msk, tail, skb)) {
- int delta = msk->ack_seq - MPTCP_SKB_CB(skb)->map_seq;
+ int delta = ack_seq - MPTCP_SKB_CB(skb)->map_seq;
/* skip overlapping data, if any */
- pr_debug("uncoalesced seq=%llx ack seq=%llx delta=%d\n",
- MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq,
- delta);
+ pr_debug("uncoalesced seq=%x ack seq=%x delta=%d\n",
+ MPTCP_SKB_CB(skb)->map_seq, ack_seq, delta);
MPTCP_SKB_CB(skb)->offset += delta;
MPTCP_SKB_CB(skb)->map_seq += delta;
__skb_queue_tail(&sk->sk_receive_queue, skb);
}
- msk->bytes_received += end_seq - msk->ack_seq;
- WRITE_ONCE(msk->ack_seq, end_seq);
+ msk->bytes_received += seq_delta;
+ WRITE_ONCE(msk->ack_seq, msk->ack_seq + seq_delta);
moved = true;
}
return moved;
@@ -2260,19 +2260,20 @@ static bool mptcp_move_skbs(struct sock *sk)
return enqueued;
}
-static unsigned int mptcp_inq_hint(const struct sock *sk)
+static int mptcp_inq_hint(const struct sock *sk)
{
const struct mptcp_sock *msk = mptcp_sk(sk);
const struct sk_buff *skb;
skb = skb_peek(&sk->sk_receive_queue);
if (skb) {
- u64 hint_val = READ_ONCE(msk->ack_seq) - MPTCP_SKB_CB(skb)->map_seq;
+ int hint_val = (u32)READ_ONCE(msk->ack_seq) -
+ MPTCP_SKB_CB(skb)->map_seq;
- if (hint_val >= INT_MAX)
- return INT_MAX;
+ if (hint_val < 0)
+ return -hint_val;
- return (unsigned int)hint_val;
+ return hint_val;
}
if (sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN))
@@ -2380,7 +2381,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
tcp_recv_timestamp(msg, sk, &tss);
if (cmsg_flags & MPTCP_CMSG_INQ) {
- unsigned int inq = mptcp_inq_hint(sk);
+ int inq = mptcp_inq_hint(sk);
put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq);
}
@@ -4601,11 +4602,23 @@ static int mptcp_napi_poll(struct napi_struct *napi, int budget)
return work_done;
}
+#define CHK_CB_FIELD(mptcp_field, tcp_field) \
+ ({ \
+ BUILD_BUG_ON(offsetof(struct mptcp_skb_cb, mptcp_field) != \
+ offsetof(struct tcp_skb_cb, tcp_field)); \
+ BUILD_BUG_ON(offsetofend(struct mptcp_skb_cb, mptcp_field) != \
+ offsetofend(struct tcp_skb_cb, tcp_field)); \
+ })
+
void __init mptcp_proto_init(void)
{
struct mptcp_delegated_action *delegated;
int cpu;
+ CHK_CB_FIELD(map_seq, seq);
+ CHK_CB_FIELD(end_seq, end_seq);
+ CHK_CB_FIELD(flags, tcp_flags);
+
mptcp_prot.h.hashinfo = tcp_prot.h.hashinfo;
if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL))
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 661600f8b573..ad906737ee9f 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -126,9 +126,10 @@
#define MPTCP_SYNC_SNDBUF 7
struct mptcp_skb_cb {
- u64 map_seq;
- u64 end_seq;
+ u32 map_seq;
+ u32 end_seq;
u32 offset;
+ u16 flags;
u8 has_rxtstamp;
u8 cant_coalesce;
};
--
2.53.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH 3/6] tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too
2026-04-20 10:29 [RFC PATCH 0/6] mptcp: address stall under memory pressure Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 1/6] mptcp: move checks vs rcvbuf size earlier in the RX path Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 2/6] mptcp: sync mptcp skb cb layout with tcp one Paolo Abeni
@ 2026-04-20 10:29 ` Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 4/6] mptcp: implemented OoO queue pruning Paolo Abeni
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Paolo Abeni @ 2026-04-20 10:29 UTC (permalink / raw)
To: mptcp; +Cc: yangang, geliang, matttbe
The end goal is to avoid duplicating the quite untrivial strategy at MPTCP
level.
After the previous patch, the mentioned helpers could process skbs standing
in MPTCP-level queues without any CB-related adaptation.
The only additional adjustment needed is explicitly providing the OoO queue
reference, to cope with different sk layout.
Additionally rename the helper to clearly document its hybrid nature and
let it return the number of collapsed skbs, to allow proper accounting from
the future MPTCP caller.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
Note:
- this will need a significant amount of testing at the TCP level and
explicit approval from Eric, which I can't guess if we can hope.
---
include/net/tcp.h | 4 ++++
net/ipv4/tcp_input.c | 55 ++++++++++++++++++++++++++++----------------
2 files changed, 39 insertions(+), 20 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6156d1d068e1..4d23e75fc5cb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1828,6 +1828,10 @@ extern void tcp_openreq_init_rwin(struct request_sock *req,
void tcp_enter_memory_pressure(struct sock *sk);
void tcp_leave_memory_pressure(struct sock *sk);
+unsigned int xtcp_collapse_ofo_queue(struct sock *sk,
+ struct rb_root *out_of_order_queue,
+ struct sk_buff **ooo_last_skb,
+ u8 scaling_radio);
static inline int keepalive_intvl_when(const struct tcp_sock *tp)
{
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7171442c3ed7..4daccc9c4795 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5725,16 +5725,22 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
/* Collapse contiguous sequence of skbs head..tail with
* sequence numbers start..end.
*
+ * sk can be either a TCP or an MPTCP socket.
+ *
* If tail is NULL, this means until the end of the queue.
*
* Segments with FIN/SYN are not collapsed (only because this
* simplifies code)
+ *
+ * Returns the number of collapsed skbs.
*/
-static void
-tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
- struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end)
+static unsigned int
+xtcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
+ struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end,
+ u8 scaling_ratio)
{
struct sk_buff *skb = head, *n;
+ unsigned int collapsed = 0;
struct sk_buff_head tmp;
bool end_of_skbs;
@@ -5750,6 +5756,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
/* No new bits? It is possible on ofo queue. */
if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
+ collapsed++;
skb = tcp_collapse_one(sk, skb, list, root);
if (!skb)
break;
@@ -5762,7 +5769,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
* overlaps to the next one and mptcp allow collapsing.
*/
if (!(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) &&
- (tcp_win_from_space(sk, skb->truesize) > skb->len ||
+ (__tcp_win_from_space(scaling_ratio, skb->truesize) > skb->len ||
before(TCP_SKB_CB(skb)->seq, start))) {
end_of_skbs = false;
break;
@@ -5782,7 +5789,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
if (end_of_skbs ||
(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) ||
!skb_frags_readable(skb))
- return;
+ return collapsed;
__skb_queue_head_init(&tmp);
@@ -5819,6 +5826,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
start += size;
}
if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
+ collapsed++;
skb = tcp_collapse_one(sk, skb, list, root);
if (!skb ||
skb == tail ||
@@ -5832,23 +5840,26 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
end:
skb_queue_walk_safe(&tmp, skb, n)
tcp_rbtree_insert(root, skb);
+ return collapsed;
}
/* Collapse ofo queue. Algorithm: select contiguous sequence of skbs
- * and tcp_collapse() them until all the queue is collapsed.
+ * and xtcp_collapse() them until all the queue is collapsed.
*/
-static void tcp_collapse_ofo_queue(struct sock *sk)
+unsigned int xtcp_collapse_ofo_queue(struct sock *sk,
+ struct rb_root *ooo_queue,
+ struct sk_buff **ooo_last_skb,
+ u8 scaling_ratio)
{
- struct tcp_sock *tp = tcp_sk(sk);
- u32 range_truesize, sum_tiny = 0;
+ u32 range_truesize, sum_tiny = 0, collapsed = 0;
struct sk_buff *skb, *head;
u32 start, end;
- skb = skb_rb_first(&tp->out_of_order_queue);
+ skb = skb_rb_first(ooo_queue);
new_range:
if (!skb) {
- tp->ooo_last_skb = skb_rb_last(&tp->out_of_order_queue);
- return;
+ *ooo_last_skb = skb_rb_last(ooo_queue);
+ return collapsed;
}
start = TCP_SKB_CB(skb)->seq;
end = TCP_SKB_CB(skb)->end_seq;
@@ -5866,12 +5877,13 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
/* Do not attempt collapsing tiny skbs */
if (range_truesize != head->truesize ||
end - start >= SKB_WITH_OVERHEAD(PAGE_SIZE)) {
- tcp_collapse(sk, NULL, &tp->out_of_order_queue,
- head, skb, start, end);
+ collapsed += xtcp_collapse(sk, NULL, ooo_queue,
+ head, skb, start, end,
+ scaling_ratio);
} else {
sum_tiny += range_truesize;
if (sum_tiny > sk->sk_rcvbuf >> 3)
- return;
+ return collapsed;
}
goto new_range;
}
@@ -5882,6 +5894,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
if (after(TCP_SKB_CB(skb)->end_seq, end))
end = TCP_SKB_CB(skb)->end_seq;
}
+ return collapsed;
}
/*
@@ -5969,12 +5982,14 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
if (tcp_can_ingest(sk, in_skb))
return 0;
- tcp_collapse_ofo_queue(sk);
+ xtcp_collapse_ofo_queue(sk, &tp->out_of_order_queue,
+ &tp->ooo_last_skb, tp->scaling_ratio);
if (!skb_queue_empty(&sk->sk_receive_queue))
- tcp_collapse(sk, &sk->sk_receive_queue, NULL,
- skb_peek(&sk->sk_receive_queue),
- NULL,
- tp->copied_seq, tp->rcv_nxt);
+ xtcp_collapse(sk, &sk->sk_receive_queue, NULL,
+ skb_peek(&sk->sk_receive_queue),
+ NULL,
+ tp->copied_seq, tp->rcv_nxt,
+ tp->scaling_ratio);
if (tcp_can_ingest(sk, in_skb))
return 0;
--
2.53.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH 4/6] mptcp: implemented OoO queue pruning
2026-04-20 10:29 [RFC PATCH 0/6] mptcp: address stall under memory pressure Paolo Abeni
` (2 preceding siblings ...)
2026-04-20 10:29 ` [RFC PATCH 3/6] tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too Paolo Abeni
@ 2026-04-20 10:29 ` Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 5/6] mptcp: refine coalescing conditions Paolo Abeni
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Paolo Abeni @ 2026-04-20 10:29 UTC (permalink / raw)
To: mptcp; +Cc: yangang, geliang, matttbe
Leverage the hybrid helper to implement the OoO queue prune at
ingress time.
If the msk is owned by the user-space at incoming skb time, perform the
pruning in the release_cb. The prune check is additionally performed
when the skb reaches the msk-level queues.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
Notes:
- Similarly to path 'mptcp: move checks vs rcvbuf size earlier in the RX
path', some cleanup/tuning in mptcp_over_limit() will be needed
- Pruning in the release_cb() is likely not needed, should probably be
removed (after more testing).
---
net/mptcp/mib.c | 3 +++
net/mptcp/mib.h | 3 +++
net/mptcp/options.c | 22 +++++++++++++---
net/mptcp/protocol.c | 61 ++++++++++++++++++++++++++++++++++++++++++++
net/mptcp/protocol.h | 2 ++
5 files changed, 87 insertions(+), 4 deletions(-)
diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index f23fda0c55a7..5128feec942c 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -85,6 +85,9 @@ static const struct snmp_mib mptcp_snmp_list[] = {
SNMP_MIB_ITEM("SimultConnectFallback", MPTCP_MIB_SIMULTCONNFALLBACK),
SNMP_MIB_ITEM("FallbackFailed", MPTCP_MIB_FALLBACKFAILED),
SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE),
+ SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED),
+ SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
+ SNMP_MIB_ITEM("RcvCollapsed", MPTCP_MIB_RCVCOLLAPSED),
};
/* mptcp_mib_alloc - allocate percpu mib counters
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 812218b5ed2b..2f8f68e33ac5 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -88,6 +88,9 @@ enum linux_mptcp_mib_field {
MPTCP_MIB_SIMULTCONNFALLBACK, /* Simultaneous connect */
MPTCP_MIB_FALLBACKFAILED, /* Can't fallback due to msk status */
MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */
+ MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */
+ MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */
+ MPTCP_MIB_RCVCOLLAPSED, /* Collapsed due to memory pressure */
__MPTCP_MIB_MAX
};
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index a6d290427611..a6a6da262413 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1158,15 +1158,29 @@ static bool add_addr_hmac_valid(struct mptcp_sock *msk,
return hmac == mp_opt->ahmac;
}
-static bool mptcp_over_limit(const struct sock *sk, struct sk_buff *skb)
+static bool mptcp_over_limit(struct sock *sk, struct sk_buff *skb, u32 seq)
{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ bool ret = true;
int limit;
if (!skb->len)
return false;
+ /* Allow some slack for backlog processing */
limit = READ_ONCE(sk->sk_rcvbuf) << 1;
- return sk_rmem_alloc_get(sk) > limit;
+ if (sk_rmem_alloc_get(sk) < limit)
+ return false;
+
+ mptcp_data_lock(sk);
+ if (!sock_owned_by_user(sk)) {
+ __mptcp_check_prune(sk, seq);
+ ret = sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf);
+ } else {
+ __set_bit(MPTCP_PRUNE, &msk->cb_flags);
+ }
+ mptcp_data_unlock(sk);
+ return ret;
}
/* Return false when the caller must to drop the packet, i.e. in case of error,
@@ -1197,7 +1211,7 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
__mptcp_data_acked(subflow->conn);
mptcp_data_unlock(subflow->conn);
- if (mptcp_over_limit(subflow->conn, skb))
+ if (mptcp_over_limit(subflow->conn, skb, msk->ack_seq))
return false;
return true;
}
@@ -1277,7 +1291,7 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
return true;
}
- if (mptcp_over_limit(subflow->conn, skb))
+ if (mptcp_over_limit(subflow->conn, skb, mp_opt.use_map ? mp_opt.data_seq : msk->ack_seq))
return false;
mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 800aa7d9408e..9cf135e04d69 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -374,6 +374,59 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
skb_dst_drop(skb);
}
+/* "Inspiered" from the TCP version */
+static void mptcp_prune_ofo_queue(struct sock *sk, u32 seq)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct rb_node *node, *prev;
+ bool pruned = false;
+
+ if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
+ return;
+
+ node = &msk->ooo_last_skb->rbnode;
+
+ do {
+ struct sk_buff *skb = rb_to_skb(node);
+
+ /* If incoming skb would land last in ofo queue, stop pruning. */
+ if (after(seq, MPTCP_SKB_CB(skb)->map_seq))
+ break;
+
+ pruned = true;
+ prev = rb_prev(node);
+ rb_erase(node, &msk->out_of_order_queue);
+ mptcp_drop(sk, skb);
+ msk->ooo_last_skb = rb_to_skb(prev);
+ if (atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf)
+ break;
+
+ node = prev;
+ } while (node);
+
+ if (pruned)
+ NET_INC_STATS(sock_net(sk), LINUX_MIB_OFOPRUNED);
+}
+
+bool __mptcp_check_prune(struct sock *sk, u32 seq)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ unsigned int dropped;
+
+ if (likely(atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf))
+ return false;
+
+ dropped = xtcp_collapse_ofo_queue(sk, &msk->out_of_order_queue,
+ &msk->ooo_last_skb, msk->scaling_ratio);
+ if (dropped)
+ MPTCP_ADD_STATS(sock_net(sk), MPTCP_MIB_RCVCOLLAPSED, dropped);
+ if (likely(atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf))
+ return false;
+
+ mptcp_prune_ofo_queue(sk, seq);
+ return atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf;
+}
+
static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
{
u32 copy_len = MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq;
@@ -383,6 +436,12 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
mptcp_borrow_fwdmem(sk, skb);
+ if (__mptcp_check_prune(sk, MPTCP_SKB_CB(skb)->map_seq)) {
+ MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
+ mptcp_drop(sk, skb);
+ return false;
+ }
+
if (MPTCP_SKB_CB(skb)->map_seq == ack_seq) {
/* in sequence */
msk->bytes_received += copy_len;
@@ -3693,6 +3752,8 @@ static void mptcp_release_cb(struct sock *sk)
__mptcp_error_report(sk);
if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags))
__mptcp_sync_sndbuf(sk);
+ if (__test_and_clear_bit(MPTCP_PRUNE, &msk->cb_flags))
+ __mptcp_check_prune(sk, msk->ack_seq - 1);
}
}
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index ad906737ee9f..e4bc77de725e 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -124,6 +124,7 @@
#define MPTCP_FLUSH_JOIN_LIST 5
#define MPTCP_SYNC_STATE 6
#define MPTCP_SYNC_SNDBUF 7
+#define MPTCP_PRUNE 8
struct mptcp_skb_cb {
u32 map_seq;
@@ -828,6 +829,7 @@ bool __mptcp_close(struct sock *sk, long timeout);
void mptcp_cancel_work(struct sock *sk);
void __mptcp_unaccepted_force_close(struct sock *sk);
void mptcp_set_state(struct sock *sk, int state);
+bool __mptcp_check_prune(struct sock *sk, u32 seq);
bool mptcp_addresses_equal(const struct mptcp_addr_info *a,
const struct mptcp_addr_info *b, bool use_port);
--
2.53.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH 5/6] mptcp: refine coalescing conditions
2026-04-20 10:29 [RFC PATCH 0/6] mptcp: address stall under memory pressure Paolo Abeni
` (3 preceding siblings ...)
2026-04-20 10:29 ` [RFC PATCH 4/6] mptcp: implemented OoO queue pruning Paolo Abeni
@ 2026-04-20 10:29 ` Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 6/6] mptcp: unclone skbs before coalescing them, when needed Paolo Abeni
2026-04-20 11:39 ` [RFC PATCH 0/6] mptcp: address stall under memory pressure MPTCP CI
6 siblings, 0 replies; 8+ messages in thread
From: Paolo Abeni @ 2026-04-20 10:29 UTC (permalink / raw)
To: mptcp; +Cc: yangang, geliang, matttbe
The current conditions prevent any coalescing when the receive buffer
is small. Ensure that MPTCP can always aggregate at least at max GSO size.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
Note:
- or we can drop entirely the rcvbuf-related check, to be verified vs
simult_flows tests
---
net/mptcp/protocol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 9cf135e04d69..8ddd4bb5172e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -161,7 +161,7 @@ static bool __mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
if (unlikely(MPTCP_SKB_CB(to)->cant_coalesce) ||
MPTCP_SKB_CB(from)->offset ||
- ((to->len + from->len) > (limit >> 3)) ||
+ ((to->len + from->len) > max(U16_MAX, (limit >> 3))) ||
!skb_try_coalesce(to, from, fragstolen, delta))
return false;
--
2.53.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH 6/6] mptcp: unclone skbs before coalescing them, when needed
2026-04-20 10:29 [RFC PATCH 0/6] mptcp: address stall under memory pressure Paolo Abeni
` (4 preceding siblings ...)
2026-04-20 10:29 ` [RFC PATCH 5/6] mptcp: refine coalescing conditions Paolo Abeni
@ 2026-04-20 10:29 ` Paolo Abeni
2026-04-20 11:39 ` [RFC PATCH 0/6] mptcp: address stall under memory pressure MPTCP CI
6 siblings, 0 replies; 8+ messages in thread
From: Paolo Abeni @ 2026-04-20 10:29 UTC (permalink / raw)
To: mptcp; +Cc: yangang, geliang, matttbe
The self-test can trigger skb coalescing on clones skb, as the
forward path uses only veth devices. That in turn prevents coalescing
making memory pressure scenario more extreme.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
Possibly we could obtain the same effect with some netem magic, would
be better.
---
net/mptcp/protocol.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 8ddd4bb5172e..42af9f9e935d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -162,6 +162,7 @@ static bool __mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
if (unlikely(MPTCP_SKB_CB(to)->cant_coalesce) ||
MPTCP_SKB_CB(from)->offset ||
((to->len + from->len) > max(U16_MAX, (limit >> 3))) ||
+ (skb_cloned(to) && skb_unclone(to, GFP_ATOMIC)) ||
!skb_try_coalesce(to, from, fragstolen, delta))
return false;
--
2.53.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 0/6] mptcp: address stall under memory pressure
2026-04-20 10:29 [RFC PATCH 0/6] mptcp: address stall under memory pressure Paolo Abeni
` (5 preceding siblings ...)
2026-04-20 10:29 ` [RFC PATCH 6/6] mptcp: unclone skbs before coalescing them, when needed Paolo Abeni
@ 2026-04-20 11:39 ` MPTCP CI
6 siblings, 0 replies; 8+ messages in thread
From: MPTCP CI @ 2026-04-20 11:39 UTC (permalink / raw)
To: Paolo Abeni; +Cc: mptcp
Hi Paolo,
Thank you for your modifications, that's great!
Our CI did some validations and here is its report:
- KVM Validation: normal (except selftest_mptcp_join): Unstable: 1 failed test(s): selftest_mptcp_connect_checksum ⚠️
- KVM Validation: normal (only selftest_mptcp_join): Unstable: 1 failed test(s): selftest_mptcp_join ⚠️
- KVM Validation: debug (except selftest_mptcp_join): Unstable: 2 failed test(s): packetdrill_dss selftest_mptcp_connect_checksum ⚠️
- KVM Validation: debug (only selftest_mptcp_join): Unstable: 1 failed test(s): selftest_mptcp_join ⚠️
- KVM Validation: btf-normal (only bpftest_all): Unstable: 2 failed test(s): bpftest_test_progs-no_alu32_mptcp bpftest_test_progs_mptcp ⚠️
- KVM Validation: btf-debug (only bpftest_all): Unstable: 2 failed test(s): bpftest_test_progs-cpuv4_mptcp bpftest_test_progs_mptcp ⚠️
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/24661753570
Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/d6ebb747f78c
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1083242
If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:
$ cd [kernel source code]
$ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
--pull always mptcp/mptcp-upstream-virtme-docker:latest \
auto-normal
For more details:
https://github.com/multipath-tcp/mptcp-upstream-virtme-docker
Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)
Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-04-20 11:39 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-20 10:29 [RFC PATCH 0/6] mptcp: address stall under memory pressure Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 1/6] mptcp: move checks vs rcvbuf size earlier in the RX path Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 2/6] mptcp: sync mptcp skb cb layout with tcp one Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 3/6] tcp: expose the tcp_collapse_ofo_queue() helper to mptcp usage, too Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 4/6] mptcp: implemented OoO queue pruning Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 5/6] mptcp: refine coalescing conditions Paolo Abeni
2026-04-20 10:29 ` [RFC PATCH 6/6] mptcp: unclone skbs before coalescing them, when needed Paolo Abeni
2026-04-20 11:39 ` [RFC PATCH 0/6] mptcp: address stall under memory pressure MPTCP CI
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.