* [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
2026-04-01 8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
@ 2026-04-01 8:54 ` Gang Yan
2026-04-01 20:12 ` Paolo Abeni
2026-04-01 8:54 ` [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue Gang Yan
` (3 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: Gang Yan @ 2026-04-01 8:54 UTC (permalink / raw)
To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=a, Size: 8034 bytes --]
From: Gang Yan <yangang@kylinos.cn>
This patch replaces the original list-based backlog_list with a
red-black tree (RB-tree) based backlog_queue for MPTCP.
Add key helper functions:
- mptcp_queue_backlog: Insert skb into backlog_queue in order of
map_seq via RB-tree
- mptcp_backlog_queue_to_list: Convert RB-tree based backlog_queue to
list_head
- mptcp_backlog_list_to_queue: Convert list_head back to RB-tree based
backlog_queue
Adapt existing backlog operation logic:
- Update mptcp_can_spool_backlog to splice RB-tree backlog to list
via new helper
- Adjust mptcp_backlog_spooled to restore list skbs back to RB-tree
backlog_queue
- Modify mptcp_close_ssk and mptcp_recv_skb to check RB-tree emptiness
instead of list
- Update mptcp_backlog_purge to use RB-tree to list conversion for
backlog cleanup
Furthermore, this patch also initialize the msk->backlog_unaccounted in
'__mptcp_init_sock'.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:
Hi Matt, Paolo:
I have reviewed the suggestions from the AI review and found several of
them valuable. This is a response to the AI review comments.
- Missing the initialization of msk->backlog_unaccounted.
I've fixed this in v5.
- Potential performance problems/
The main goal of this patch is to replace the existing backlog_list
with an RB-tree structure to resolve the stall issue. The
replacement itself is intended to be functionally equivalent—if the
RB-tree usage is correct, the stall problem should be addressed.
Performance-wise, the operations under spinlock can be optimized
later (e.g., removing the temporary list or refining locks). I've
tested with simult_subflows.sh and saw no significant regression,
so I'd prefer to keep this change as-is in this series.
- RB-tree traversal crash in 'mptcp_close_ssk()'
After reviewing the code, the crash seems not happen. The walk
does not modify the RB-tree structure — it only clears skb->sk
pointers.
Then, this is pre-existing upstream code, I think the concerns
should be addressed separately.
Thanks,
Gang
Changelog:
v5:
- Fix missing initialization of msk->backlog_unaccounted in
__mptcp_init_sock() to match the commit message.
---
net/mptcp/protocol.c | 73 +++++++++++++++++++++++++++++++++++++-------
net/mptcp/protocol.h | 2 +-
2 files changed, 63 insertions(+), 12 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 46fedfa05a54..e6f83c8afc76 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -653,6 +653,33 @@ static void mptcp_dss_corruption(struct mptcp_sock *msk, struct sock *ssk)
}
}
+static int mptcp_queue_backlog(struct mptcp_sock *msk, struct sk_buff *skb)
+{
+ u64 seq = MPTCP_SKB_CB(skb)->map_seq;
+ struct rb_node **p, *parent = NULL;
+
+ p = &msk->backlog_queue.rb_node;
+ if (RB_EMPTY_ROOT(&msk->backlog_queue))
+ goto insert;
+
+ while (*p) {
+ struct sk_buff *s;
+
+ parent = *p;
+ s = rb_to_skb(parent);
+
+ if (before64(seq, MPTCP_SKB_CB(s)->map_seq))
+ p = &parent->rb_left;
+ else
+ p = &parent->rb_right;
+ }
+
+insert:
+ rb_link_node(&skb->rbnode, parent, p);
+ rb_insert_color(&skb->rbnode, &msk->backlog_queue);
+ return 0;
+}
+
static void __mptcp_add_backlog(struct sock *sk,
struct mptcp_subflow_context *subflow,
struct sk_buff *skb)
@@ -669,8 +696,8 @@ static void __mptcp_add_backlog(struct sock *sk,
}
/* Try to coalesce with the last skb in our backlog */
- if (!list_empty(&msk->backlog_list))
- tail = list_last_entry(&msk->backlog_list, struct sk_buff, list);
+ if (!RB_EMPTY_ROOT(&msk->backlog_queue))
+ tail = skb_rb_last(&msk->backlog_queue);
if (tail && MPTCP_SKB_CB(skb)->map_seq == MPTCP_SKB_CB(tail)->end_seq &&
ssk == tail->sk &&
@@ -681,7 +708,7 @@ static void __mptcp_add_backlog(struct sock *sk,
goto account;
}
- list_add_tail(&skb->list, &msk->backlog_list);
+ mptcp_queue_backlog(msk, skb);
mptcp_subflow_lend_fwdmem(subflow, skb);
delta = skb->truesize;
@@ -2199,6 +2226,29 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
return moved;
}
+static void mptcp_backlog_queue_to_list(struct mptcp_sock *msk,
+ struct list_head *list)
+{
+ struct sk_buff *skb;
+
+ while ((skb = skb_rb_first(&msk->backlog_queue)) != NULL) {
+ rb_erase(&skb->rbnode, &msk->backlog_queue);
+ RB_CLEAR_NODE(&skb->rbnode);
+ list_add_tail(&skb->list, list);
+ }
+}
+
+static void mptcp_backlog_list_to_queue(struct mptcp_sock *msk,
+ struct list_head *list)
+{
+ struct sk_buff *skb, *tmp;
+
+ list_for_each_entry_safe(skb, tmp, list, list) {
+ list_del(&skb->list);
+ mptcp_queue_backlog(msk, skb);
+ }
+}
+
static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
{
struct mptcp_sock *msk = mptcp_sk(sk);
@@ -2210,12 +2260,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
mem_cgroup_from_sk(sk));
/* Don't spool the backlog if the rcvbuf is full. */
- if (list_empty(&msk->backlog_list) ||
+ if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
return false;
INIT_LIST_HEAD(skbs);
- list_splice_init(&msk->backlog_list, skbs);
+ mptcp_backlog_queue_to_list(msk, skbs);
return true;
}
@@ -2225,7 +2275,7 @@ static void mptcp_backlog_spooled(struct sock *sk, u32 moved,
struct mptcp_sock *msk = mptcp_sk(sk);
WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
- list_splice(skbs, &msk->backlog_list);
+ mptcp_backlog_list_to_queue(msk, skbs);
}
static bool mptcp_move_skbs(struct sock *sk)
@@ -2311,7 +2361,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
copied += bytes_read;
- if (!list_empty(&msk->backlog_list) && mptcp_move_skbs(sk))
+ if (!RB_EMPTY_ROOT(&msk->backlog_queue) && mptcp_move_skbs(sk))
continue;
/* only the MPTCP socket status is relevant here. The exit
@@ -2640,7 +2690,7 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
/* Remove any reference from the backlog to this ssk; backlog skbs consume
* space in the msk receive queue, no need to touch sk->sk_rmem_alloc
*/
- list_for_each_entry(skb, &msk->backlog_list, list) {
+ skb_rbtree_walk(skb, &msk->backlog_queue) {
if (skb->sk != ssk)
continue;
@@ -2896,7 +2946,7 @@ static void mptcp_backlog_purge(struct sock *sk)
LIST_HEAD(backlog);
mptcp_data_lock(sk);
- list_splice_init(&msk->backlog_list, &backlog);
+ mptcp_backlog_queue_to_list(msk, &backlog);
msk->backlog_len = 0;
mptcp_data_unlock(sk);
@@ -2999,13 +3049,14 @@ static void __mptcp_init_sock(struct sock *sk)
INIT_LIST_HEAD(&msk->conn_list);
INIT_LIST_HEAD(&msk->join_list);
INIT_LIST_HEAD(&msk->rtx_queue);
- INIT_LIST_HEAD(&msk->backlog_list);
+ msk->backlog_queue = RB_ROOT;
INIT_WORK(&msk->work, mptcp_worker);
msk->out_of_order_queue = RB_ROOT;
msk->first_pending = NULL;
msk->timer_ival = TCP_RTO_MIN;
msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
msk->backlog_len = 0;
+ msk->backlog_unaccounted = 0;
mptcp_init_rtt_est(msk);
WRITE_ONCE(msk->first, NULL);
@@ -4335,7 +4386,7 @@ static struct sk_buff *mptcp_recv_skb(struct sock *sk, u32 *off)
struct sk_buff *skb;
u32 offset;
- if (!list_empty(&msk->backlog_list))
+ if (!RB_EMPTY_ROOT(&msk->backlog_queue))
mptcp_move_skbs(sk);
while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index f5d4d7d030f2..f0eaba2c61fa 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -372,7 +372,7 @@ struct mptcp_sock {
* allow_join
*/
- struct list_head backlog_list; /* protected by the data lock */
+ struct rb_root backlog_queue; /* protected by the data lock */
u32 backlog_len;
u32 backlog_unaccounted;
};
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
2026-04-01 8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
@ 2026-04-01 20:12 ` Paolo Abeni
2026-04-15 7:17 ` Paolo Abeni
0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2026-04-01 20:12 UTC (permalink / raw)
To: Gang Yan, mptcp; +Cc: Gang Yan, Geliang Tang
On 4/1/26 10:54 AM, Gang Yan wrote:
> @@ -2199,6 +2226,29 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
> return moved;
> }
>
> +static void mptcp_backlog_queue_to_list(struct mptcp_sock *msk,
> + struct list_head *list)
> +{
> + struct sk_buff *skb;
> +
> + while ((skb = skb_rb_first(&msk->backlog_queue)) != NULL) {
> + rb_erase(&skb->rbnode, &msk->backlog_queue);
> + RB_CLEAR_NODE(&skb->rbnode);
> + list_add_tail(&skb->list, list);
> + }
This is O(N * log(N))
> +}
> +
> +static void mptcp_backlog_list_to_queue(struct mptcp_sock *msk,
> + struct list_head *list)
> +{
> + struct sk_buff *skb, *tmp;
> +
> + list_for_each_entry_safe(skb, tmp, list, list) {
> + list_del(&skb->list);
> + mptcp_queue_backlog(msk, skb);
> + }
> +}
> +
> static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
> {
> struct mptcp_sock *msk = mptcp_sk(sk);
> @@ -2210,12 +2260,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
> mem_cgroup_from_sk(sk));
>
> /* Don't spool the backlog if the rcvbuf is full. */
> - if (list_empty(&msk->backlog_list) ||
> + if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
> sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
> return false;
>
> INIT_LIST_HEAD(skbs);
> - list_splice_init(&msk->backlog_list, skbs);
> + mptcp_backlog_queue_to_list(msk, skbs);
This is under a spinlock with BH disabled, and N is potentially quite
high. This code is problematic as it could cause stall and delay BH for
a significant amount of time.
Note that there are more chunks later and in later patches with similar
problem.
I don't have a good solution on top of my head, I'll try to go back to
the design table.
/P
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
2026-04-01 20:12 ` Paolo Abeni
@ 2026-04-15 7:17 ` Paolo Abeni
2026-04-15 8:21 ` gang.yan
0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2026-04-15 7:17 UTC (permalink / raw)
To: Gang Yan, mptcp; +Cc: Gang Yan, Geliang Tang
On 4/1/26 10:12 PM, Paolo Abeni wrote:
> On 4/1/26 10:54 AM, Gang Yan wrote:
>> @@ -2210,12 +2260,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
>> mem_cgroup_from_sk(sk));
>>
>> /* Don't spool the backlog if the rcvbuf is full. */
>> - if (list_empty(&msk->backlog_list) ||
>> + if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
>> sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
>> return false;
>>
>> INIT_LIST_HEAD(skbs);
>> - list_splice_init(&msk->backlog_list, skbs);
>> + mptcp_backlog_queue_to_list(msk, skbs);
>
> This is under a spinlock with BH disabled, and N is potentially quite
> high. This code is problematic as it could cause stall and delay BH for
> a significant amount of time.
>
> Note that there are more chunks later and in later patches with similar
> problem.
>
> I don't have a good solution on top of my head, I'll try to go back to
> the design table.
I had some time to have a better look at the whole problem and how it
manifest itself in the included self-tests - comprising the initial one
using sendfile().
AFAICS the stall in the self-tests in patch 5/5 is caused by the sysctl
setting taking effect on the server side _after_ that the 3whs
negotiated the initial window; the rcvbuf suddenly shrinks from ~128K to
4K and almost every incoming packet is dropped.
The test itself is really an extreme condition; we should accept any
implementation able to complete the transfer - even at very low speed.
The initial test-case, the one using sendfile(), operates in a
significantly different way: it generates 1-bytes len DSS preventing
coalescing (I've not understood yet why coalescing does not happen),
which cause an extremely bad skb->truesize/skb->len ratio, which in turn
causes the initial window being way too "optimistic", extreme rcvbuf
squeeze at runtime and a behavior similar to the previous one.
In both cases simply dropping incoming packets early/in
mptcp_incoming_options() when the rcvbuf is full does not solve the
issue: if the rcvbuf is used (mostly) by the OoO queue, retransmissions
always hit the same rcvbuf condition and are also dropped.
The root cause of both scenario is that some very unlikely condition
calls to retract the announced rcv wnd, but mptcp can't do that.
Currently I start to think that we need a strategy similar to plain TCP
to deal with such scenario: when rcvbuf is full we need to condense and
eventually prune the OoO queue (see tcp_prune_queue(),
tcp_collapse_ofo_queue(), tcp_collapse()).
The above has some serious downsides, i.e. it could lead to large slice
of almost duplicate complex code, as is diff to abstract the MPTCP vs
TCP differences (CB, seq numbers, drop reasons). Still under investigation.
/P
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
2026-04-15 7:17 ` Paolo Abeni
@ 2026-04-15 8:21 ` gang.yan
2026-04-20 9:34 ` Paolo Abeni
0 siblings, 1 reply; 18+ messages in thread
From: gang.yan @ 2026-04-15 8:21 UTC (permalink / raw)
To: Paolo Abeni, mptcp
April 15, 2026 at 3:17 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>
> On 4/1/26 10:12 PM, Paolo Abeni wrote:
>
> >
> > On 4/1/26 10:54 AM, Gang Yan wrote:
> >
> > >
> > > @@ -2210,12 +2260,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
> > > mem_cgroup_from_sk(sk));
> > >
> > > /* Don't spool the backlog if the rcvbuf is full. */
> > > - if (list_empty(&msk->backlog_list) ||
> > > + if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
> > > sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
> > > return false;
> > >
> > > INIT_LIST_HEAD(skbs);
> > > - list_splice_init(&msk->backlog_list, skbs);
> > > + mptcp_backlog_queue_to_list(msk, skbs);
> > >
> >
> > This is under a spinlock with BH disabled, and N is potentially quite
> > high. This code is problematic as it could cause stall and delay BH for
> > a significant amount of time.
> >
> > Note that there are more chunks later and in later patches with similar
> > problem.
> >
> > I don't have a good solution on top of my head, I'll try to go back to
> > the design table.
> >
> I had some time to have a better look at the whole problem and how it
> manifest itself in the included self-tests - comprising the initial one
> using sendfile().
>
> AFAICS the stall in the self-tests in patch 5/5 is caused by the sysctl
> setting taking effect on the server side _after_ that the 3whs
> negotiated the initial window; the rcvbuf suddenly shrinks from ~128K to
> 4K and almost every incoming packet is dropped.
>
> The test itself is really an extreme condition; we should accept any
> implementation able to complete the transfer - even at very low speed.
>
> The initial test-case, the one using sendfile(), operates in a
> significantly different way: it generates 1-bytes len DSS preventing
> coalescing (I've not understood yet why coalescing does not happen),
> which cause an extremely bad skb->truesize/skb->len ratio, which in turn
> causes the initial window being way too "optimistic", extreme rcvbuf
> squeeze at runtime and a behavior similar to the previous one.
>
> In both cases simply dropping incoming packets early/in
> mptcp_incoming_options() when the rcvbuf is full does not solve the
> issue: if the rcvbuf is used (mostly) by the OoO queue, retransmissions
> always hit the same rcvbuf condition and are also dropped.
>
> The root cause of both scenario is that some very unlikely condition
> calls to retract the announced rcv wnd, but mptcp can't do that.
>
> Currently I start to think that we need a strategy similar to plain TCP
> to deal with such scenario: when rcvbuf is full we need to condense and
> eventually prune the OoO queue (see tcp_prune_queue(),
> tcp_collapse_ofo_queue(), tcp_collapse()).
>
> The above has some serious downsides, i.e. it could lead to large slice
> of almost duplicate complex code, as is diff to abstract the MPTCP vs
> TCP differences (CB, seq numbers, drop reasons). Still under investigation.
>
> /P
>
Hi, Paolo
Thanks a lot for your detailed and insightful analysis of this problem!
I fully agree with your points: MPTCP should allow the transfer to complete
even under extremely slow or harsh conditions, just as you mentioned.
Regarding the TCP-style mechanisms like 'tcp_prune_queue' for handling full
rcvbuf conditions — I have actually attempted similar implementations before.
As you pointed out, this approach is indeed highly complex for MPTCP. There
are far too many aspects that require careful modification and consideration,
making it extremely challenging to implement correctly.
I’ve also made some progress on this issue recently. I implemented a solution
based on the backlog list structure: when a transfer stall is detected, I
traverse the backlog_list, and only move the skbs where 'map_seq == ack_seq' into
the rcv_queue to unblock the transfer.
This approach effectively prevents unbounded growth of rmem_alloc, and has nearly
no impact on performance. The core logic is roughly as follows:
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 17b9a8c13ebf..c733dd2aa85f 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2189,6 +2189,18 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
msk->rcvq_space.time = mstamp;
}
+static bool mptcp_stop_spool_backlog(struct sock *sk)
+{
+ return sk_rmem_alloc_get(sk) > sk->sk_rcvbuf &&
+ !skb_queue_empty(&sk->sk_receive_queue);
+}
+
+static bool mptcp_recv_stall(struct sock *sk)
+{
+ return sk_rmem_alloc_get(sk) > sk->sk_rcvbuf &&
+ skb_queue_empty(&sk->sk_receive_queue);
+}
+
static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delta)
{
struct sk_buff *skb = list_first_entry(skbs, struct sk_buff, list);
@@ -2198,7 +2210,7 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
*delta = 0;
while (1) {
/* If the msk recvbuf is full stop, don't drop */
- if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+ if (mptcp_stop_spool_backlog(sk))
break;
prefetch(skb->next);
@@ -2228,12 +2240,32 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
DEBUG_NET_WARN_ON_ONCE(msk->backlog_unaccounted && sk->sk_socket &&
mem_cgroup_from_sk(sk));
- /* Don't spool the backlog if the rcvbuf is full. */
+ /* Don't spool the backlog if the rcvbuf is full when rcv_queue is
+ * not empty.
+ */
if (list_empty(&msk->backlog_list) ||
- sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+ mptcp_stop_spool_backlog(sk))
return false;
INIT_LIST_HEAD(skbs);
+
+ /* If the rcvbuf is full and the rcv_queue is empty, the transmission
+ * will stall. To prevent this, move the ack_seq skb from the
+ * backlog_list to the receive queue.
+ */
+ if (mptcp_recv_stall(sk)) {
+ struct sk_buff *iter, *tmp;
+
+ list_for_each_entry_safe(iter, tmp, &msk->backlog_list, list) {
+ if (MPTCP_SKB_CB(iter)->map_seq == msk->ack_seq) {
+ list_del(&iter->list);
+ list_add(&iter->list, skbs);
+ break;
+ }
+ }
+ return !list_empty(skbs);
+ }
+
list_splice_init(&msk->backlog_list, skbs);
return true;
}
I would really appreciate it if you could share your thoughts on this approach when
you have time.
Cheers,
Gang
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
2026-04-15 8:21 ` gang.yan
@ 2026-04-20 9:34 ` Paolo Abeni
2026-04-20 9:41 ` gang.yan
0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2026-04-20 9:34 UTC (permalink / raw)
To: gang.yan, mptcp
On 4/15/26 10:21 AM, gang.yan@linux.dev wrote:
> April 15, 2026 at 3:17 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>> AFAICS the stall in the self-tests in patch 5/5 is caused by the sysctl
>> setting taking effect on the server side _after_ that the 3whs
>> negotiated the initial window; the rcvbuf suddenly shrinks from ~128K to
>> 4K and almost every incoming packet is dropped.
>>
>> The test itself is really an extreme condition; we should accept any
>> implementation able to complete the transfer - even at very low speed.
>>
>> The initial test-case, the one using sendfile(), operates in a
>> significantly different way: it generates 1-bytes len DSS preventing
>> coalescing (I've not understood yet why coalescing does not happen),
>> which cause an extremely bad skb->truesize/skb->len ratio, which in turn
>> causes the initial window being way too "optimistic", extreme rcvbuf
>> squeeze at runtime and a behavior similar to the previous one.
>>
>> In both cases simply dropping incoming packets early/in
>> mptcp_incoming_options() when the rcvbuf is full does not solve the
>> issue: if the rcvbuf is used (mostly) by the OoO queue, retransmissions
>> always hit the same rcvbuf condition and are also dropped.
>>
>> The root cause of both scenario is that some very unlikely condition
>> calls to retract the announced rcv wnd, but mptcp can't do that.
>>
>> Currently I start to think that we need a strategy similar to plain TCP
>> to deal with such scenario: when rcvbuf is full we need to condense and
>> eventually prune the OoO queue (see tcp_prune_queue(),
>> tcp_collapse_ofo_queue(), tcp_collapse()).
>>
>> The above has some serious downsides, i.e. it could lead to large slice
>> of almost duplicate complex code, as is diff to abstract the MPTCP vs
>> TCP differences (CB, seq numbers, drop reasons). Still under investigation.
>>
>> /P
>>
> Hi, Paolo
> Thanks a lot for your detailed and insightful analysis of this problem!
>
> I fully agree with your points: MPTCP should allow the transfer to complete
> even under extremely slow or harsh conditions, just as you mentioned.
I was likely not clear in my previous message. IMHO the key point is
that in the mentioned scenario we should consider suitable any fix that
would allow completing the transfer, even at a extremely low average
bitrate - because the memory conditions are indeed extreme.
I.e. we can/should consider this case an extreme slow path.
> Regarding the TCP-style mechanisms like 'tcp_prune_queue' for handling full
> rcvbuf conditions — I have actually attempted similar implementations before.
> As you pointed out, this approach is indeed highly complex for MPTCP. There
> are far too many aspects that require careful modification and consideration,
> making it extremely challenging to implement correctly.
I agree duplicating the TCP pruning code inside MPTCP does not look a
viable solution.
I think we can instead share it (at least the most cumbersome helper),
with some caveat. I have a few very rough patches doing that. Let me add
some comments to at least somehow make the code readable and I'll share
them here.
/P
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
2026-04-20 9:34 ` Paolo Abeni
@ 2026-04-20 9:41 ` gang.yan
2026-04-20 10:33 ` Paolo Abeni
0 siblings, 1 reply; 18+ messages in thread
From: gang.yan @ 2026-04-20 9:41 UTC (permalink / raw)
To: Paolo Abeni, mptcp
April 20, 2026 at 5:34 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>
> On 4/15/26 10:21 AM, gang.yan@linux.dev wrote:
>
> >
> > April 15, 2026 at 3:17 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
> >
> > >
> > > AFAICS the stall in the self-tests in patch 5/5 is caused by the sysctl
> > > setting taking effect on the server side _after_ that the 3whs
> > > negotiated the initial window; the rcvbuf suddenly shrinks from ~128K to
> > > 4K and almost every incoming packet is dropped.
> > >
> > > The test itself is really an extreme condition; we should accept any
> > > implementation able to complete the transfer - even at very low speed.
> > >
> > > The initial test-case, the one using sendfile(), operates in a
> > > significantly different way: it generates 1-bytes len DSS preventing
> > > coalescing (I've not understood yet why coalescing does not happen),
> > > which cause an extremely bad skb->truesize/skb->len ratio, which in turn
> > > causes the initial window being way too "optimistic", extreme rcvbuf
> > > squeeze at runtime and a behavior similar to the previous one.
> > >
> > > In both cases simply dropping incoming packets early/in
> > > mptcp_incoming_options() when the rcvbuf is full does not solve the
> > > issue: if the rcvbuf is used (mostly) by the OoO queue, retransmissions
> > > always hit the same rcvbuf condition and are also dropped.
> > >
> > > The root cause of both scenario is that some very unlikely condition
> > > calls to retract the announced rcv wnd, but mptcp can't do that.
> > >
> > > Currently I start to think that we need a strategy similar to plain TCP
> > > to deal with such scenario: when rcvbuf is full we need to condense and
> > > eventually prune the OoO queue (see tcp_prune_queue(),
> > > tcp_collapse_ofo_queue(), tcp_collapse()).
> > >
> > > The above has some serious downsides, i.e. it could lead to large slice
> > > of almost duplicate complex code, as is diff to abstract the MPTCP vs
> > > TCP differences (CB, seq numbers, drop reasons). Still under investigation.
> > >
> > > /P
> > >
> > Hi, Paolo
> > Thanks a lot for your detailed and insightful analysis of this problem!
> >
> > I fully agree with your points: MPTCP should allow the transfer to complete
> > even under extremely slow or harsh conditions, just as you mentioned.
> >
> I was likely not clear in my previous message. IMHO the key point is
> that in the mentioned scenario we should consider suitable any fix that
> would allow completing the transfer, even at a extremely low average
> bitrate - because the memory conditions are indeed extreme.
>
> I.e. we can/should consider this case an extreme slow path.
>
> >
> > Regarding the TCP-style mechanisms like 'tcp_prune_queue' for handling full
> > rcvbuf conditions — I have actually attempted similar implementations before.
> > As you pointed out, this approach is indeed highly complex for MPTCP. There
> > are far too many aspects that require careful modification and consideration,
> > making it extremely challenging to implement correctly.
> >
> I agree duplicating the TCP pruning code inside MPTCP does not look a
> viable solution.
>
> I think we can instead share it (at least the most cumbersome helper),
> with some caveat. I have a few very rough patches doing that. Let me add
> some comments to at least somehow make the code readable and I'll share
> them here.
>
Thank you so much for your help and for working on this!
Looking forward to your updates.
Thanks
Gang
> /P
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
2026-04-20 9:41 ` gang.yan
@ 2026-04-20 10:33 ` Paolo Abeni
2026-04-21 1:26 ` gang.yan
0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2026-04-20 10:33 UTC (permalink / raw)
To: gang.yan, mptcp
On 4/20/26 11:41 AM, gang.yan@linux.dev wrote:
> April 20, 2026 at 5:34 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>>> Regarding the TCP-style mechanisms like 'tcp_prune_queue' for handling full
>>> rcvbuf conditions — I have actually attempted similar implementations before.
>>> As you pointed out, this approach is indeed highly complex for MPTCP. There
>>> are far too many aspects that require careful modification and consideration,
>>> making it extremely challenging to implement correctly.
>>>
>> I agree duplicating the TCP pruning code inside MPTCP does not look a
>> viable solution.
>>
>> I think we can instead share it (at least the most cumbersome helper),
>> with some caveat. I have a few very rough patches doing that. Let me add
>> some comments to at least somehow make the code readable and I'll share
>> them here.
>>
>
> Thank you so much for your help and for working on this!
Thank you indeed for all the effort and the patience to cope with my
slow feedback.
Also I hope it's clear that the goal is trying to find the "best
possible" solution and it's _not_ preventing or steeling your (much
appreciated) work!
Cheers,
Paolo
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
2026-04-20 10:33 ` Paolo Abeni
@ 2026-04-21 1:26 ` gang.yan
0 siblings, 0 replies; 18+ messages in thread
From: gang.yan @ 2026-04-21 1:26 UTC (permalink / raw)
To: mptcp
April 20, 2026 at 6:33 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>
> On 4/20/26 11:41 AM, gang.yan@linux.dev wrote:
>
> >
> > April 20, 2026 at 5:34 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
> > Regarding the TCP-style mechanisms like 'tcp_prune_queue' for handling full
> > rcvbuf conditions — I have actually attempted similar implementations before.
> > As you pointed out, this approach is indeed highly complex for MPTCP. There
> > are far too many aspects that require careful modification and consideration,
> > making it extremely challenging to implement correctly.
> >
> > >
> > > I agree duplicating the TCP pruning code inside MPTCP does not look a
> > > viable solution.
> > >
> > > I think we can instead share it (at least the most cumbersome helper),
> > > with some caveat. I have a few very rough patches doing that. Let me add
> > > some comments to at least somehow make the code readable and I'll share
> > > them here.
> > >
> >
> > Thank you so much for your help and for working on this!
> >
> Thank you indeed for all the effort and the patience to cope with my
> slow feedback.
>
> Also I hope it's clear that the goal is trying to find the "best
> possible" solution and it's _not_ preventing or steeling your (much
> appreciated) work!
>
> Cheers,
>
> Paolo
>
Thank you very much for your understanding and patience throughout
the review process.It’s clear to me that we share the same goal:
to find the best and most robust solution. I really value your
feedback and guidance, and I will test your new patch as soon as
possible and provide feedback promptly.
Thanks again for your time and support.
Best regards,
Gang
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue
2026-04-01 8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
2026-04-01 8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
@ 2026-04-01 8:54 ` Gang Yan
2026-04-01 8:54 ` [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready Gang Yan
` (2 subsequent siblings)
4 siblings, 0 replies; 18+ messages in thread
From: Gang Yan @ 2026-04-01 8:54 UTC (permalink / raw)
To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang
From: Gang Yan <yangang@kylinos.cn>
The original condition would stop moving skbs or spooling backlog even
when the receive queue is empty, leading to receive stall.
Modify the condition in __mptcp_move_skbs() and mptcp_can_spool_backlog()
to only treat rcvbuf as full when:
sk_rmem_alloc_get(sk) > sk->sk_rcvbuf && !skb_queue_empty(&sk->sk_receive_queue)
This ensures the backlog can still be moved to the receive queue when
the queue is empty, avoiding stall problems.
Fixes: 6228efe0cc01 ("mptcp: leverage the backlog for RX packet processing")
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
net/mptcp/protocol.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e6f83c8afc76..8f1afd90290e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2197,6 +2197,12 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
msk->rcvq_space.time = mstamp;
}
+static bool mptcp_can_dequeue_from_backlog(struct sock *sk)
+{
+ return sk_rmem_alloc_get(sk) <= sk->sk_rcvbuf ||
+ skb_queue_empty(&sk->sk_receive_queue);
+}
+
static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delta)
{
struct sk_buff *skb = list_first_entry(skbs, struct sk_buff, list);
@@ -2206,7 +2212,7 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
*delta = 0;
while (1) {
/* If the msk recvbuf is full stop, don't drop */
- if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+ if (!mptcp_can_dequeue_from_backlog(sk))
break;
prefetch(skb->next);
@@ -2261,7 +2267,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
/* Don't spool the backlog if the rcvbuf is full. */
if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
- sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+ !mptcp_can_dequeue_from_backlog(sk))
return false;
INIT_LIST_HEAD(skbs);
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready
2026-04-01 8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
2026-04-01 8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
2026-04-01 8:54 ` [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue Gang Yan
@ 2026-04-01 8:54 ` Gang Yan
2026-04-01 8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
2026-04-01 8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
4 siblings, 0 replies; 18+ messages in thread
From: Gang Yan @ 2026-04-01 8:54 UTC (permalink / raw)
To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang
From: Gang Yan <yangang@kylinos.cn>
There exists a stall caused by unprocessed backlog_queue in
'move_skbs_to_msk'.
This patch adds a check for backlog_queue and move skbs to receive
queue when no skbs were moved from subflow but backlog_queue is not
empty.
Fixes: 6228efe0cc01 ("mptcp: leverage the backlog for RX packet processing")
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:
Hi Matt, Paolo:
Here is the response to the AI review for this patch:
- Potential dead-lock problems:
I've fixed this in the next patch. Keeping it as a separate patch
should make the review easier. Should I squash it into this one?
- The question of the short-circuiting logical OR (||):
In fact, the short-circuiting is intentional: if 'move_skbs_to_msk()'
returns true, there is already data in the receive queue, which
should not cause the stall problems. In that case moving more data
from the backlog isn't necessary. Using '||' avoids redundant work
and matches the intended logic.
Thanks
Gang
---
net/mptcp/protocol.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 8f1afd90290e..326b5d4d79e7 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -49,6 +49,11 @@ static struct percpu_counter mptcp_sockets_allocated ____cacheline_aligned_in_sm
static void __mptcp_destroy_sock(struct sock *sk);
static void mptcp_check_send_data_fin(struct sock *sk);
+static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs);
+static void mptcp_backlog_spooled(struct sock *sk, u32 moved,
+ struct list_head *skbs);
+static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs,
+ u32 *delta);
DEFINE_PER_CPU(struct mptcp_delegated_action, mptcp_delegated_actions) = {
.bh_lock = INIT_LOCAL_LOCK(bh_lock),
@@ -906,6 +911,19 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
return moved;
}
+static bool move_skbs_from_backlog(struct sock *sk)
+{
+ struct list_head skbs;
+ bool enqueued = false;
+ u32 moved;
+
+ while (mptcp_can_spool_backlog(sk, &skbs)) {
+ enqueued |= __mptcp_move_skbs(sk, &skbs, &moved);
+ mptcp_backlog_spooled(sk, moved, &skbs);
+ }
+ return enqueued;
+}
+
static void mptcp_rcv_rtt_update(struct mptcp_sock *msk,
struct mptcp_subflow_context *subflow)
{
@@ -948,7 +966,9 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
mptcp_rcv_rtt_update(msk, subflow);
if (!sock_owned_by_user(sk)) {
/* Wake-up the reader only for in-sequence data */
- if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
+ if ((move_skbs_to_msk(msk, ssk) ||
+ move_skbs_from_backlog(sk)) &&
+ mptcp_epollin_ready(sk))
sk->sk_data_ready(sk);
} else {
__mptcp_move_skbs_from_subflow(msk, ssk, false);
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready
2026-04-01 8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
` (2 preceding siblings ...)
2026-04-01 8:54 ` [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready Gang Yan
@ 2026-04-01 8:54 ` Gang Yan
2026-04-01 20:02 ` Paolo Abeni
2026-04-01 8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
4 siblings, 1 reply; 18+ messages in thread
From: Gang Yan @ 2026-04-01 8:54 UTC (permalink / raw)
To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang
From: Gang Yan <yangang@kylinos.cn>
This patch defers mptcp_check_data_fin from __mptcp_move_skbs to avoid
deadlock.
When processing backlogged data in softirq context, __mptcp_move_skbs
directly calls mptcp_check_data_fin, which can lead to a deadlock if
the msk socket is in TCP_FIN_WAIT2 state. In that case,
mptcp_check_data_fin calls mptcp_shutdown_subflows, which attempts to
lock the subflow socket with lock_sock_fast() - a sleeping function
that cannot be called in atomic context (softirq).
The correct pattern is already used in move_skbs_to_msk: if a data fin
is pending, schedule the work to be processed later in process context
rather than handling it directly.
Reported-by: Geliang Tang <geliang@kernel.org>
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:
Hi Matt,Paolo:
Here is the response of AI review for this patch:
- Typo problem in subject line:
Already fixed in v5.
- Remove else branch:
It demonstrates a TOCTOU race, though the race window is extremely
narrow. The else branch is useless even without the race. So I
removed it in v5. WDYT?
Thanks,
Gang
Changelog:
v5:
- Fix typo in title.
- Remove the else branch.
---
net/mptcp/protocol.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 326b5d4d79e7..88e19fa61b84 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2247,8 +2247,9 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
}
__mptcp_ofo_queue(msk);
- if (moved)
- mptcp_check_data_fin((struct sock *)msk);
+ if (moved && mptcp_pending_data_fin(sk, NULL))
+ mptcp_schedule_work(sk);
+
return moved;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready
2026-04-01 8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
@ 2026-04-01 20:02 ` Paolo Abeni
[not found] ` <e1f58d941ad141f9f96a24d1a1d9d6ca74cc2f5e@linux.dev>
0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2026-04-01 20:02 UTC (permalink / raw)
To: Gang Yan, mptcp; +Cc: Gang Yan, Geliang Tang
On 4/1/26 10:54 AM, Gang Yan wrote:
> From: Gang Yan <yangang@kylinos.cn>
>
> This patch defers mptcp_check_data_fin from __mptcp_move_skbs to avoid
> deadlock.
>
> When processing backlogged data in softirq context, __mptcp_move_skbs
> directly calls mptcp_check_data_fin, which can lead to a deadlock if
> the msk socket is in TCP_FIN_WAIT2 state. In that case,
> mptcp_check_data_fin calls mptcp_shutdown_subflows, which attempts to
> lock the subflow socket with lock_sock_fast() - a sleeping function
> that cannot be called in atomic context (softirq).
The fix for such issue should be squashed inside the patch generating
the issue itself.
> The correct pattern is already used in move_skbs_to_msk: if a data fin
> is pending, schedule the work to be processed later in process context
> rather than handling it directly.
>
> Reported-by: Geliang Tang <geliang@kernel.org>
> Co-developed-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Gang Yan <yangang@kylinos.cn>
> ---
> Notes:
>
> Hi Matt,Paolo:
>
> Here is the response of AI review for this patch:
>
> - Typo problem in subject line:
> Already fixed in v5.
>
> - Remove else branch:
> It demonstrates a TOCTOU race, though the race window is extremely
> narrow. The else branch is useless even without the race. So I
> removed it in v5. WDYT?
>
> Thanks,
> Gang
>
> Changelog:
> v5:
> - Fix typo in title.
> - Remove the else branch.
>
> ---
> net/mptcp/protocol.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 326b5d4d79e7..88e19fa61b84 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -2247,8 +2247,9 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
> }
>
> __mptcp_ofo_queue(msk);
> - if (moved)
> - mptcp_check_data_fin((struct sock *)msk);
> + if (moved && mptcp_pending_data_fin(sk, NULL))
> + mptcp_schedule_work(sk);
Scheduling the worker when we are not in BH could introduce problematic
delays. Instead you could factor out a ____mptcp_move_skbs() helper (a
better name would be nice!) not including the mptcp_check_data_fin()
call, use the latter in mptcp_data_ready() and in such function
eventually schedule the mptcp work as needed.
/P
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf
2026-04-01 8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
` (3 preceding siblings ...)
2026-04-01 8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
@ 2026-04-01 8:54 ` Gang Yan
2026-04-03 1:17 ` Geliang Tang
4 siblings, 1 reply; 18+ messages in thread
From: Gang Yan @ 2026-04-01 8:54 UTC (permalink / raw)
To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang
From: Gang Yan <yangang@kylinos.cn>
This patch adds a test for transmission with a limited receive buffer.
Without the fixes in this set, this test fails with errors:
# 001 4 subflows with reduced rcvbuf
# net.ipv4.tcp_rmem = 4096 4096 4096
# Info: Test file (size 1024 KB) for client
# Info: Test file (size 1024 KB) for server
# copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
# copyfd_io_poll: poll timed out (events: POLLIN 0, POLLOUT 4)
# [FAIL] client exit code 2, server 2
# Server ns stats (ns2-qg1Hqo)
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:
Changelog:
v5:
- Drop the unnecessary tcp_rmem restore.
---
tools/testing/selftests/net/mptcp/mptcp_join.sh | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index beec41f6662a..a21333424085 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -4417,6 +4417,22 @@ endpoint_tests()
fi
}
+rcvbuf_tests()
+{
+ if reset "4 subflows with reduced rcvbuf"; then
+ ip netns exec $ns2 sysctl -w net.ipv4.tcp_rmem="4096 4096 4096"
+
+ pm_nl_set_limits $ns1 0 4
+ pm_nl_set_limits $ns2 0 4
+ pm_nl_add_endpoint $ns2 10.0.2.2 flags subflow
+ pm_nl_add_endpoint $ns2 10.0.3.2 flags subflow
+ pm_nl_add_endpoint $ns2 10.0.4.2 flags subflow
+ speed=fast test_linkfail=1024 \
+ run_tests $ns1 $ns2 10.0.1.1
+ chk_join_nr 3 3 3
+ fi
+}
+
# [$1: error message]
usage()
{
@@ -4467,6 +4483,7 @@ all_tests_sorted=(
F@fail_tests
u@userspace_tests
I@endpoint_tests
+ R@rcvbuf_tests
)
all_tests_args=""
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf
2026-04-01 8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
@ 2026-04-03 1:17 ` Geliang Tang
2026-04-03 8:52 ` Matthieu Baerts
0 siblings, 1 reply; 18+ messages in thread
From: Geliang Tang @ 2026-04-03 1:17 UTC (permalink / raw)
To: Gang Yan, mptcp; +Cc: pabeni, Gang Yan
Hi Gang,
On Wed, 2026-04-01 at 16:54 +0800, Gang Yan wrote:
> From: Gang Yan <yangang@kylinos.cn>
>
> This patch adds a test for transmission with a limited receive
> buffer.
>
> Without the fixes in this set, this test fails with errors:
>
> # 001 4 subflows with reduced rcvbuf
> # net.ipv4.tcp_rmem = 4096 4096 4096
> # Info: Test file (size 1024 KB) for client
> # Info: Test file (size 1024 KB) for server
> # copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
> # copyfd_io_poll: poll timed out (events: POLLIN 0, POLLOUT 4)
> # [FAIL] client exit code 2, server 2
> # Server ns stats (ns2-qg1Hqo)
>
> Co-developed-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Gang Yan <yangang@kylinos.cn>
> ---
> Notes:
>
> Changelog:
> v5:
> - Drop the unnecessary tcp_rmem restore.
>
> ---
> tools/testing/selftests/net/mptcp/mptcp_join.sh | 17
> +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh
> b/tools/testing/selftests/net/mptcp/mptcp_join.sh
> index beec41f6662a..a21333424085 100755
> --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
> +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
> @@ -4417,6 +4417,22 @@ endpoint_tests()
> fi
> }
>
> +rcvbuf_tests()
> +{
> + if reset "4 subflows with reduced rcvbuf"; then
> + ip netns exec $ns2 sysctl -w net.ipv4.tcp_rmem="4096
> 4096 4096"
Instead of using reset() here, it's better to set this sysctl in a new
helper - for example, reset_with_tcp_rmem() - as is done in other
tests.
Thanks,
-Geliang
> +
> + pm_nl_set_limits $ns1 0 4
> + pm_nl_set_limits $ns2 0 4
> + pm_nl_add_endpoint $ns2 10.0.2.2 flags subflow
> + pm_nl_add_endpoint $ns2 10.0.3.2 flags subflow
> + pm_nl_add_endpoint $ns2 10.0.4.2 flags subflow
> + speed=fast test_linkfail=1024 \
> + run_tests $ns1 $ns2 10.0.1.1
> + chk_join_nr 3 3 3
> + fi
> +}
> +
> # [$1: error message]
> usage()
> {
> @@ -4467,6 +4483,7 @@ all_tests_sorted=(
> F@fail_tests
> u@userspace_tests
> I@endpoint_tests
> + R@rcvbuf_tests
> )
>
> all_tests_args=""
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf
2026-04-03 1:17 ` Geliang Tang
@ 2026-04-03 8:52 ` Matthieu Baerts
0 siblings, 0 replies; 18+ messages in thread
From: Matthieu Baerts @ 2026-04-03 8:52 UTC (permalink / raw)
To: Geliang Tang, Gang Yan, mptcp; +Cc: pabeni, Gang Yan
Hi Geliang,
On 03/04/2026 03:17, Geliang Tang wrote:
> Hi Gang,
>
> On Wed, 2026-04-01 at 16:54 +0800, Gang Yan wrote:
>> From: Gang Yan <yangang@kylinos.cn>
>>
>> This patch adds a test for transmission with a limited receive
>> buffer.
>>
>> Without the fixes in this set, this test fails with errors:
>>
>> # 001 4 subflows with reduced rcvbuf
>> # net.ipv4.tcp_rmem = 4096 4096 4096
>> # Info: Test file (size 1024 KB) for client
>> # Info: Test file (size 1024 KB) for server
>> # copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
>> # copyfd_io_poll: poll timed out (events: POLLIN 0, POLLOUT 4)
>> # [FAIL] client exit code 2, server 2
>> # Server ns stats (ns2-qg1Hqo)
>>
>> Co-developed-by: Geliang Tang <geliang@kernel.org>
>> Signed-off-by: Geliang Tang <geliang@kernel.org>
>> Signed-off-by: Gang Yan <yangang@kylinos.cn>
>> ---
>> Notes:
>>
>> Changelog:
>> v5:
>> - Drop the unnecessary tcp_rmem restore.
>>
>> ---
>> tools/testing/selftests/net/mptcp/mptcp_join.sh | 17
>> +++++++++++++++++
>> 1 file changed, 17 insertions(+)
>>
>> diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> b/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> index beec41f6662a..a21333424085 100755
>> --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> @@ -4417,6 +4417,22 @@ endpoint_tests()
>> fi
>> }
>>
>> +rcvbuf_tests()
>> +{
>> + if reset "4 subflows with reduced rcvbuf"; then
>> + ip netns exec $ns2 sysctl -w net.ipv4.tcp_rmem="4096
>> 4096 4096"
>
> Instead of using reset() here, it's better to set this sysctl in a new
> helper - for example, reset_with_tcp_rmem() - as is done in other
> tests.
If there is only one test, I think it is fine to leave this here: at
least it is clear what is being done. A helper only used once just to
set one extra command might not be worth it.
But what you can do is to use 'sysctl -q', like every time 'sysctl' is used.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 18+ messages in thread