public inbox for mptcp@lists.linux.dev
 help / color / mirror / Atom feed
* [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready
@ 2026-04-01  8:54 Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan

From: Gang Yan <yangang@kylinos.cn>

Changelog:
V5:
  - Patch 1 fixes missing initialization of msk->backlog_unaccounted in
    __mptcp_init_sock() to match the commit message.
  - Patch 4 removes the 'else' branch to fix potential TOCTOU race and
    fix typo in title.
  - Patch 5 drops the redundant tcp_rmem restore.
  - Add notes in some patches of the series in response to the AI
    review.
v4:
  - Pass 'delta' to 'move_skbs_to_msk' to fix the potential problems,
    and add a helper named 'move_skbs_from_backlog' to achieve this.
  - Add a test case in mptcp_join.sh which can reproduce such stall
    prolems.
  - Fix a dead_lock problems which is imported in the 1d77124a6322
    ('mptcp: fix the stall problems with data_ready.').
v3:
  - Replace backlog list with backlog_queue(rb-tree) to solve the
    stall problems.
  Link: https://patchwork.kernel.org/project/mptcp/cover/cover.1773735950.git.yangang@kylinos.cn/
v2:
  Link: https://patchwork.kernel.org/project/mptcp/patch/20260209084717.506379-1-gang.yan@linux.dev/
 v1:
  Link: https://patchwork.kernel.org/project/mptcp/cover/cover.1770273341.git.yangang@kylinos.cn/


Gang Yan (5):
  mptcp: replace backlog_list with backlog_queue
  mptcp: fix the stall problems using backlog_queue
  mptcp: fix the stall problems with data_ready
  mptcp: fix the dead_lock in mptcp_data_ready
  selftests: mptcp: test transmission with small rcvbuf

 net/mptcp/protocol.c                          | 110 +++++++++++++++---
 net/mptcp/protocol.h                          |   2 +-
 .../testing/selftests/net/mptcp/mptcp_join.sh |  17 +++
 3 files changed, 112 insertions(+), 17 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
  2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
@ 2026-04-01  8:54 ` Gang Yan
  2026-04-01 20:12   ` Paolo Abeni
  2026-04-01  8:54 ` [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue Gang Yan
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=a, Size: 8034 bytes --]

From: Gang Yan <yangang@kylinos.cn>

This patch replaces the original list-based backlog_list with a
red-black tree (RB-tree) based backlog_queue for MPTCP.

Add key helper functions:
 - mptcp_queue_backlog: Insert skb into backlog_queue in order of
   map_seq via RB-tree
 - mptcp_backlog_queue_to_list: Convert RB-tree based backlog_queue to
   list_head
 - mptcp_backlog_list_to_queue: Convert list_head back to RB-tree based
   backlog_queue

Adapt existing backlog operation logic:
   - Update mptcp_can_spool_backlog to splice RB-tree backlog to list
     via new helper
   - Adjust mptcp_backlog_spooled to restore list skbs back to RB-tree
     backlog_queue
   - Modify mptcp_close_ssk and mptcp_recv_skb to check RB-tree emptiness
     instead of list
   - Update mptcp_backlog_purge to use RB-tree to list conversion for
     backlog cleanup

Furthermore, this patch also initialize the msk->backlog_unaccounted in
'__mptcp_init_sock'.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:

Hi Matt, Paolo:

I have reviewed the suggestions from the AI review and found several of
them valuable. This is a response to the AI review comments.

- Missing the initialization of msk->backlog_unaccounted.
    I've fixed this in v5.

- Potential performance problems/
    The main goal of this patch is to replace the existing backlog_list
    with an RB-tree structure to resolve the stall issue. The
    replacement itself is intended to be functionally equivalent—if the
    RB-tree usage is correct, the stall problem should be addressed.

    Performance-wise, the operations under spinlock can be optimized
    later (e.g., removing the temporary list or refining locks). I've
    tested with simult_subflows.sh and saw no significant regression,
    so I'd prefer to keep this change as-is in this series.

- RB-tree traversal crash in 'mptcp_close_ssk()'
    After reviewing the code, the crash seems not happen.  The walk
    does not modify the RB-tree structure — it only clears skb->sk
    pointers.

    Then, this is pre-existing upstream code, I think the concerns
    should be addressed separately.

Thanks,
Gang

Changelog:
v5:
  - Fix missing initialization of msk->backlog_unaccounted in
    __mptcp_init_sock() to match the commit message.

---
 net/mptcp/protocol.c | 73 +++++++++++++++++++++++++++++++++++++-------
 net/mptcp/protocol.h |  2 +-
 2 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 46fedfa05a54..e6f83c8afc76 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -653,6 +653,33 @@ static void mptcp_dss_corruption(struct mptcp_sock *msk, struct sock *ssk)
 	}
 }
 
+static int mptcp_queue_backlog(struct mptcp_sock *msk, struct sk_buff *skb)
+{
+	u64 seq = MPTCP_SKB_CB(skb)->map_seq;
+	struct rb_node **p, *parent = NULL;
+
+	p = &msk->backlog_queue.rb_node;
+	if (RB_EMPTY_ROOT(&msk->backlog_queue))
+		goto insert;
+
+	while (*p) {
+		struct sk_buff *s;
+
+		parent = *p;
+		s = rb_to_skb(parent);
+
+		if (before64(seq, MPTCP_SKB_CB(s)->map_seq))
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+
+insert:
+	rb_link_node(&skb->rbnode, parent, p);
+	rb_insert_color(&skb->rbnode, &msk->backlog_queue);
+	return 0;
+}
+
 static void __mptcp_add_backlog(struct sock *sk,
 				struct mptcp_subflow_context *subflow,
 				struct sk_buff *skb)
@@ -669,8 +696,8 @@ static void __mptcp_add_backlog(struct sock *sk,
 	}
 
 	/* Try to coalesce with the last skb in our backlog */
-	if (!list_empty(&msk->backlog_list))
-		tail = list_last_entry(&msk->backlog_list, struct sk_buff, list);
+	if (!RB_EMPTY_ROOT(&msk->backlog_queue))
+		tail = skb_rb_last(&msk->backlog_queue);
 
 	if (tail && MPTCP_SKB_CB(skb)->map_seq == MPTCP_SKB_CB(tail)->end_seq &&
 	    ssk == tail->sk &&
@@ -681,7 +708,7 @@ static void __mptcp_add_backlog(struct sock *sk,
 		goto account;
 	}
 
-	list_add_tail(&skb->list, &msk->backlog_list);
+	mptcp_queue_backlog(msk, skb);
 	mptcp_subflow_lend_fwdmem(subflow, skb);
 	delta = skb->truesize;
 
@@ -2199,6 +2226,29 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
 	return moved;
 }
 
+static void mptcp_backlog_queue_to_list(struct mptcp_sock *msk,
+					struct list_head *list)
+{
+	struct sk_buff *skb;
+
+	while ((skb = skb_rb_first(&msk->backlog_queue)) != NULL) {
+		rb_erase(&skb->rbnode, &msk->backlog_queue);
+		RB_CLEAR_NODE(&skb->rbnode);
+		list_add_tail(&skb->list, list);
+	}
+}
+
+static void mptcp_backlog_list_to_queue(struct mptcp_sock *msk,
+					struct list_head *list)
+{
+	struct sk_buff *skb, *tmp;
+
+	list_for_each_entry_safe(skb, tmp, list, list) {
+		list_del(&skb->list);
+		mptcp_queue_backlog(msk, skb);
+	}
+}
+
 static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -2210,12 +2260,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
 			       mem_cgroup_from_sk(sk));
 
 	/* Don't spool the backlog if the rcvbuf is full. */
-	if (list_empty(&msk->backlog_list) ||
+	if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
 	    sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
 		return false;
 
 	INIT_LIST_HEAD(skbs);
-	list_splice_init(&msk->backlog_list, skbs);
+	mptcp_backlog_queue_to_list(msk, skbs);
 	return true;
 }
 
@@ -2225,7 +2275,7 @@ static void mptcp_backlog_spooled(struct sock *sk, u32 moved,
 	struct mptcp_sock *msk = mptcp_sk(sk);
 
 	WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
-	list_splice(skbs, &msk->backlog_list);
+	mptcp_backlog_list_to_queue(msk, skbs);
 }
 
 static bool mptcp_move_skbs(struct sock *sk)
@@ -2311,7 +2361,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 		copied += bytes_read;
 
-		if (!list_empty(&msk->backlog_list) && mptcp_move_skbs(sk))
+		if (!RB_EMPTY_ROOT(&msk->backlog_queue) && mptcp_move_skbs(sk))
 			continue;
 
 		/* only the MPTCP socket status is relevant here. The exit
@@ -2640,7 +2690,7 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 	/* Remove any reference from the backlog to this ssk; backlog skbs consume
 	 * space in the msk receive queue, no need to touch sk->sk_rmem_alloc
 	 */
-	list_for_each_entry(skb, &msk->backlog_list, list) {
+	skb_rbtree_walk(skb, &msk->backlog_queue) {
 		if (skb->sk != ssk)
 			continue;
 
@@ -2896,7 +2946,7 @@ static void mptcp_backlog_purge(struct sock *sk)
 	LIST_HEAD(backlog);
 
 	mptcp_data_lock(sk);
-	list_splice_init(&msk->backlog_list, &backlog);
+	mptcp_backlog_queue_to_list(msk, &backlog);
 	msk->backlog_len = 0;
 	mptcp_data_unlock(sk);
 
@@ -2999,13 +3049,14 @@ static void __mptcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&msk->conn_list);
 	INIT_LIST_HEAD(&msk->join_list);
 	INIT_LIST_HEAD(&msk->rtx_queue);
-	INIT_LIST_HEAD(&msk->backlog_list);
+	msk->backlog_queue = RB_ROOT;
 	INIT_WORK(&msk->work, mptcp_worker);
 	msk->out_of_order_queue = RB_ROOT;
 	msk->first_pending = NULL;
 	msk->timer_ival = TCP_RTO_MIN;
 	msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
 	msk->backlog_len = 0;
+	msk->backlog_unaccounted = 0;
 	mptcp_init_rtt_est(msk);
 
 	WRITE_ONCE(msk->first, NULL);
@@ -4335,7 +4386,7 @@ static struct sk_buff *mptcp_recv_skb(struct sock *sk, u32 *off)
 	struct sk_buff *skb;
 	u32 offset;
 
-	if (!list_empty(&msk->backlog_list))
+	if (!RB_EMPTY_ROOT(&msk->backlog_queue))
 		mptcp_move_skbs(sk);
 
 	while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index f5d4d7d030f2..f0eaba2c61fa 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -372,7 +372,7 @@ struct mptcp_sock {
 					 * allow_join
 					 */
 
-	struct list_head backlog_list;	/* protected by the data lock */
+	struct rb_root	backlog_queue;	/* protected by the data lock */
 	u32		backlog_len;
 	u32		backlog_unaccounted;
 };
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue
  2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
@ 2026-04-01  8:54 ` Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready Gang Yan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang

From: Gang Yan <yangang@kylinos.cn>

The original condition would stop moving skbs or spooling backlog even
when the receive queue is empty, leading to receive stall.

Modify the condition in __mptcp_move_skbs() and mptcp_can_spool_backlog()
to only treat rcvbuf as full when:
  sk_rmem_alloc_get(sk) > sk->sk_rcvbuf && !skb_queue_empty(&sk->sk_receive_queue)

This ensures the backlog can still be moved to the receive queue when
the queue is empty, avoiding stall problems.

Fixes: 6228efe0cc01 ("mptcp: leverage the backlog for RX packet processing")
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
 net/mptcp/protocol.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e6f83c8afc76..8f1afd90290e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2197,6 +2197,12 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 	msk->rcvq_space.time = mstamp;
 }
 
+static bool mptcp_can_dequeue_from_backlog(struct sock *sk)
+{
+	return sk_rmem_alloc_get(sk) <= sk->sk_rcvbuf ||
+	       skb_queue_empty(&sk->sk_receive_queue);
+}
+
 static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delta)
 {
 	struct sk_buff *skb = list_first_entry(skbs, struct sk_buff, list);
@@ -2206,7 +2212,7 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
 	*delta = 0;
 	while (1) {
 		/* If the msk recvbuf is full stop, don't drop */
-		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+		if (!mptcp_can_dequeue_from_backlog(sk))
 			break;
 
 		prefetch(skb->next);
@@ -2261,7 +2267,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
 
 	/* Don't spool the backlog if the rcvbuf is full. */
 	if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
-	    sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+	    !mptcp_can_dequeue_from_backlog(sk))
 		return false;
 
 	INIT_LIST_HEAD(skbs);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready
  2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue Gang Yan
@ 2026-04-01  8:54 ` Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
  4 siblings, 0 replies; 11+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang

From: Gang Yan <yangang@kylinos.cn>

There exists a stall caused by unprocessed backlog_queue in
'move_skbs_to_msk'.

This patch adds a check for backlog_queue and move skbs to receive
queue when no skbs were moved from subflow but backlog_queue is not
empty.

Fixes: 6228efe0cc01 ("mptcp: leverage the backlog for RX packet processing")
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:

Hi Matt, Paolo:

Here is the response to the AI review for this patch:
- Potential dead-lock problems:
    I've fixed this in the next patch. Keeping it as a separate patch
    should make the review easier. Should I squash it into this one?

- The question of the short-circuiting logical OR (||):
    In fact, the short-circuiting is intentional: if 'move_skbs_to_msk()'
    returns true, there is already data in the receive queue, which
    should not cause the stall problems. In that case moving more data
    from the backlog isn't necessary. Using '||' avoids redundant work
    and matches the intended logic.

Thanks
Gang

---
 net/mptcp/protocol.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 8f1afd90290e..326b5d4d79e7 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -49,6 +49,11 @@ static struct percpu_counter mptcp_sockets_allocated ____cacheline_aligned_in_sm
 
 static void __mptcp_destroy_sock(struct sock *sk);
 static void mptcp_check_send_data_fin(struct sock *sk);
+static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs);
+static void mptcp_backlog_spooled(struct sock *sk, u32 moved,
+				  struct list_head *skbs);
+static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs,
+			      u32 *delta);
 
 DEFINE_PER_CPU(struct mptcp_delegated_action, mptcp_delegated_actions) = {
 	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
@@ -906,6 +911,19 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	return moved;
 }
 
+static bool move_skbs_from_backlog(struct sock *sk)
+{
+	struct list_head skbs;
+	bool enqueued = false;
+	u32 moved;
+
+	while (mptcp_can_spool_backlog(sk, &skbs)) {
+		enqueued |= __mptcp_move_skbs(sk, &skbs, &moved);
+		mptcp_backlog_spooled(sk, moved, &skbs);
+	}
+	return enqueued;
+}
+
 static void mptcp_rcv_rtt_update(struct mptcp_sock *msk,
 				 struct mptcp_subflow_context *subflow)
 {
@@ -948,7 +966,9 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 	mptcp_rcv_rtt_update(msk, subflow);
 	if (!sock_owned_by_user(sk)) {
 		/* Wake-up the reader only for in-sequence data */
-		if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
+		if ((move_skbs_to_msk(msk, ssk) ||
+		     move_skbs_from_backlog(sk)) &&
+		    mptcp_epollin_ready(sk))
 			sk->sk_data_ready(sk);
 	} else {
 		__mptcp_move_skbs_from_subflow(msk, ssk, false);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready
  2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
                   ` (2 preceding siblings ...)
  2026-04-01  8:54 ` [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready Gang Yan
@ 2026-04-01  8:54 ` Gang Yan
  2026-04-01 20:02   ` Paolo Abeni
  2026-04-01  8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
  4 siblings, 1 reply; 11+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang

From: Gang Yan <yangang@kylinos.cn>

This patch defers mptcp_check_data_fin from __mptcp_move_skbs to avoid
deadlock.

When processing backlogged data in softirq context, __mptcp_move_skbs
directly calls mptcp_check_data_fin, which can lead to a deadlock if
the msk socket is in TCP_FIN_WAIT2 state. In that case,
mptcp_check_data_fin calls mptcp_shutdown_subflows, which attempts to
lock the subflow socket with lock_sock_fast() - a sleeping function
that cannot be called in atomic context (softirq).

The correct pattern is already used in move_skbs_to_msk: if a data fin
is pending, schedule the work to be processed later in process context
rather than handling it directly.

Reported-by: Geliang Tang <geliang@kernel.org>
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:

Hi Matt,Paolo:

Here is the response of AI review for this patch:

- Typo problem in subject line:
    Already fixed in v5.

- Remove else branch:
    It demonstrates a TOCTOU race, though the race window is extremely
    narrow. The else branch is useless even without the race. So I
    removed it in v5. WDYT?

Thanks,
Gang

Changelog:
v5:
  - Fix typo in title.
  - Remove the else branch.

---
 net/mptcp/protocol.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 326b5d4d79e7..88e19fa61b84 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2247,8 +2247,9 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
 	}
 
 	__mptcp_ofo_queue(msk);
-	if (moved)
-		mptcp_check_data_fin((struct sock *)msk);
+	if (moved && mptcp_pending_data_fin(sk, NULL))
+		mptcp_schedule_work(sk);
+
 	return moved;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf
  2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
                   ` (3 preceding siblings ...)
  2026-04-01  8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
@ 2026-04-01  8:54 ` Gang Yan
  2026-04-03  1:17   ` Geliang Tang
  4 siblings, 1 reply; 11+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang

From: Gang Yan <yangang@kylinos.cn>

This patch adds a test for transmission with a limited receive buffer.

Without the fixes in this set, this test fails with errors:

 # 001 4 subflows with reduced rcvbuf
 # net.ipv4.tcp_rmem = 4096 4096 4096
 #       Info: Test file (size 1024 KB) for client
 #       Info: Test file (size 1024 KB) for server
 # copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
 # copyfd_io_poll: poll timed out (events: POLLIN 0, POLLOUT 4)
 # [FAIL] client exit code 2, server 2
 # Server ns stats (ns2-qg1Hqo)

Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:

Changelog:
v5:
  - Drop the unnecessary tcp_rmem restore.

---
 tools/testing/selftests/net/mptcp/mptcp_join.sh | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index beec41f6662a..a21333424085 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -4417,6 +4417,22 @@ endpoint_tests()
 	fi
 }
 
+rcvbuf_tests()
+{
+	if reset "4 subflows with reduced rcvbuf"; then
+		ip netns exec $ns2 sysctl -w net.ipv4.tcp_rmem="4096 4096 4096"
+
+		pm_nl_set_limits $ns1 0 4
+		pm_nl_set_limits $ns2 0 4
+		pm_nl_add_endpoint $ns2 10.0.2.2 flags subflow
+		pm_nl_add_endpoint $ns2 10.0.3.2 flags subflow
+		pm_nl_add_endpoint $ns2 10.0.4.2 flags subflow
+		speed=fast test_linkfail=1024 \
+			run_tests $ns1 $ns2 10.0.1.1
+		chk_join_nr 3 3 3
+	fi
+}
+
 # [$1: error message]
 usage()
 {
@@ -4467,6 +4483,7 @@ all_tests_sorted=(
 	F@fail_tests
 	u@userspace_tests
 	I@endpoint_tests
+	R@rcvbuf_tests
 )
 
 all_tests_args=""
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready
  2026-04-01  8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
@ 2026-04-01 20:02   ` Paolo Abeni
       [not found]     ` <e1f58d941ad141f9f96a24d1a1d9d6ca74cc2f5e@linux.dev>
  0 siblings, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2026-04-01 20:02 UTC (permalink / raw)
  To: Gang Yan, mptcp; +Cc: Gang Yan, Geliang Tang

On 4/1/26 10:54 AM, Gang Yan wrote:
> From: Gang Yan <yangang@kylinos.cn>
> 
> This patch defers mptcp_check_data_fin from __mptcp_move_skbs to avoid
> deadlock.
> 
> When processing backlogged data in softirq context, __mptcp_move_skbs
> directly calls mptcp_check_data_fin, which can lead to a deadlock if
> the msk socket is in TCP_FIN_WAIT2 state. In that case,
> mptcp_check_data_fin calls mptcp_shutdown_subflows, which attempts to
> lock the subflow socket with lock_sock_fast() - a sleeping function
> that cannot be called in atomic context (softirq).

The fix for such issue should be squashed inside the patch generating
the issue itself.

> The correct pattern is already used in move_skbs_to_msk: if a data fin
> is pending, schedule the work to be processed later in process context
> rather than handling it directly.
> 
> Reported-by: Geliang Tang <geliang@kernel.org>
> Co-developed-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Gang Yan <yangang@kylinos.cn>
> ---
> Notes:
> 
> Hi Matt,Paolo:
> 
> Here is the response of AI review for this patch:
> 
> - Typo problem in subject line:
>     Already fixed in v5.
> 
> - Remove else branch:
>     It demonstrates a TOCTOU race, though the race window is extremely
>     narrow. The else branch is useless even without the race. So I
>     removed it in v5. WDYT?
> 
> Thanks,
> Gang
> 
> Changelog:
> v5:
>   - Fix typo in title.
>   - Remove the else branch.
> 
> ---
>  net/mptcp/protocol.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 326b5d4d79e7..88e19fa61b84 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -2247,8 +2247,9 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
>  	}
>  
>  	__mptcp_ofo_queue(msk);
> -	if (moved)
> -		mptcp_check_data_fin((struct sock *)msk);
> +	if (moved && mptcp_pending_data_fin(sk, NULL))
> +		mptcp_schedule_work(sk);

Scheduling the worker when we are not in BH could introduce problematic
delays. Instead you could factor out a ____mptcp_move_skbs() helper (a
better name would be nice!) not including the mptcp_check_data_fin()
call, use the latter in mptcp_data_ready() and in such function
eventually schedule the mptcp work as needed.

/P


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
  2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
@ 2026-04-01 20:12   ` Paolo Abeni
  0 siblings, 0 replies; 11+ messages in thread
From: Paolo Abeni @ 2026-04-01 20:12 UTC (permalink / raw)
  To: Gang Yan, mptcp; +Cc: Gang Yan, Geliang Tang

On 4/1/26 10:54 AM, Gang Yan wrote:
> @@ -2199,6 +2226,29 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
>  	return moved;
>  }
>  
> +static void mptcp_backlog_queue_to_list(struct mptcp_sock *msk,
> +					struct list_head *list)
> +{
> +	struct sk_buff *skb;
> +
> +	while ((skb = skb_rb_first(&msk->backlog_queue)) != NULL) {
> +		rb_erase(&skb->rbnode, &msk->backlog_queue);
> +		RB_CLEAR_NODE(&skb->rbnode);
> +		list_add_tail(&skb->list, list);
> +	}

This is O(N * log(N))

> +}
> +
> +static void mptcp_backlog_list_to_queue(struct mptcp_sock *msk,
> +					struct list_head *list)
> +{
> +	struct sk_buff *skb, *tmp;
> +
> +	list_for_each_entry_safe(skb, tmp, list, list) {
> +		list_del(&skb->list);
> +		mptcp_queue_backlog(msk, skb);
> +	}
> +}
> +
>  static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
>  {
>  	struct mptcp_sock *msk = mptcp_sk(sk);
> @@ -2210,12 +2260,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
>  			       mem_cgroup_from_sk(sk));
>  
>  	/* Don't spool the backlog if the rcvbuf is full. */
> -	if (list_empty(&msk->backlog_list) ||
> +	if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
>  	    sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
>  		return false;
>  
>  	INIT_LIST_HEAD(skbs);
> -	list_splice_init(&msk->backlog_list, skbs);
> +	mptcp_backlog_queue_to_list(msk, skbs);

This is under a spinlock with BH disabled, and N is potentially quite
high. This code is problematic as it could cause stall and delay BH for
a significant amount of time.

Note that there are more chunks later and in later patches with similar
problem.

I don't have a good solution on top of my head, I'll try to go back to
the design table.

/P


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf
  2026-04-01  8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
@ 2026-04-03  1:17   ` Geliang Tang
  2026-04-03  8:52     ` Matthieu Baerts
  0 siblings, 1 reply; 11+ messages in thread
From: Geliang Tang @ 2026-04-03  1:17 UTC (permalink / raw)
  To: Gang Yan, mptcp; +Cc: pabeni, Gang Yan

Hi Gang,

On Wed, 2026-04-01 at 16:54 +0800, Gang Yan wrote:
> From: Gang Yan <yangang@kylinos.cn>
> 
> This patch adds a test for transmission with a limited receive
> buffer.
> 
> Without the fixes in this set, this test fails with errors:
> 
>  # 001 4 subflows with reduced rcvbuf
>  # net.ipv4.tcp_rmem = 4096 4096 4096
>  #       Info: Test file (size 1024 KB) for client
>  #       Info: Test file (size 1024 KB) for server
>  # copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
>  # copyfd_io_poll: poll timed out (events: POLLIN 0, POLLOUT 4)
>  # [FAIL] client exit code 2, server 2
>  # Server ns stats (ns2-qg1Hqo)
> 
> Co-developed-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Gang Yan <yangang@kylinos.cn>
> ---
> Notes:
> 
> Changelog:
> v5:
>   - Drop the unnecessary tcp_rmem restore.
> 
> ---
>  tools/testing/selftests/net/mptcp/mptcp_join.sh | 17
> +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh
> b/tools/testing/selftests/net/mptcp/mptcp_join.sh
> index beec41f6662a..a21333424085 100755
> --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
> +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
> @@ -4417,6 +4417,22 @@ endpoint_tests()
>  	fi
>  }
>  
> +rcvbuf_tests()
> +{
> +	if reset "4 subflows with reduced rcvbuf"; then
> +		ip netns exec $ns2 sysctl -w net.ipv4.tcp_rmem="4096
> 4096 4096"

Instead of using reset() here, it's better to set this sysctl in a new
helper - for example, reset_with_tcp_rmem() - as is done in other
tests.

Thanks,
-Geliang

> +
> +		pm_nl_set_limits $ns1 0 4
> +		pm_nl_set_limits $ns2 0 4
> +		pm_nl_add_endpoint $ns2 10.0.2.2 flags subflow
> +		pm_nl_add_endpoint $ns2 10.0.3.2 flags subflow
> +		pm_nl_add_endpoint $ns2 10.0.4.2 flags subflow
> +		speed=fast test_linkfail=1024 \
> +			run_tests $ns1 $ns2 10.0.1.1
> +		chk_join_nr 3 3 3
> +	fi
> +}
> +
>  # [$1: error message]
>  usage()
>  {
> @@ -4467,6 +4483,7 @@ all_tests_sorted=(
>  	F@fail_tests
>  	u@userspace_tests
>  	I@endpoint_tests
> +	R@rcvbuf_tests
>  )
>  
>  all_tests_args=""

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Fwd: Re: [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready
       [not found]     ` <e1f58d941ad141f9f96a24d1a1d9d6ca74cc2f5e@linux.dev>
@ 2026-04-03  6:45       ` gang.yan
  0 siblings, 0 replies; 11+ messages in thread
From: gang.yan @ 2026-04-03  6:45 UTC (permalink / raw)
  To: mptcp

-------- Forwarded message -------

From: gang.yan@linux.dev mailto:gang.yan@linux.dev 
To: "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E >
Sent: April 3, 2026 at 2:44 PM
Subject: Re: [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready

April 2, 2026 at 4:02 AM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:



> 
> On 4/1/26 10:54 AM, Gang Yan wrote:
>  
>  
>  From: Gang Yan <yangang@kylinos.cn>
>  
>  This patch defers mptcp_check_data_fin from __mptcp_move_skbs to avoid
>  deadlock.
>  
>  When processing backlogged data in softirq context, __mptcp_move_skbs
>  directly calls mptcp_check_data_fin, which can lead to a deadlock if
>  the msk socket is in TCP_FIN_WAIT2 state. In that case,
>  mptcp_check_data_fin calls mptcp_shutdown_subflows, which attempts to
>  lock the subflow socket with lock_sock_fast() - a sleeping function
>  that cannot be called in atomic context (softirq).
>  
>  The fix for such issue should be squashed inside the patch generating
>  the issue itself.
>  
>  
>  The correct pattern is already used in move_skbs_to_msk: if a data fin
>  is pending, schedule the work to be processed later in process context
>  rather than handling it directly.
>  
>  Reported-by: Geliang Tang <geliang@kernel.org>
>  Co-developed-by: Geliang Tang <geliang@kernel.org>
>  Signed-off-by: Geliang Tang <geliang@kernel.org>
>  Signed-off-by: Gang Yan <yangang@kylinos.cn>
>  ---
>  Notes:
>  
>  Hi Matt,Paolo:
>  
>  Here is the response of AI review for this patch:
>  
>  - Typo problem in subject line:
>  Already fixed in v5.
>  
>  - Remove else branch:
>  It demonstrates a TOCTOU race, though the race window is extremely
>  narrow. The else branch is useless even without the race. So I
>  removed it in v5. WDYT?
>  
>  Thanks,
>  Gang
>  
>  Changelog:
>  v5:
>  - Fix typo in title.
>  - Remove the else branch.
>  
>  ---
>  net/mptcp/protocol.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>  
>  diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>  index 326b5d4d79e7..88e19fa61b84 100644
>  --- a/net/mptcp/protocol.c
>  +++ b/net/mptcp/protocol.c
>  @@ -2247,8 +2247,9 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
>  }
>  
>  __mptcp_ofo_queue(msk);
>  - if (moved)
>  - mptcp_check_data_fin((struct sock *)msk);
>  + if (moved && mptcp_pending_data_fin(sk, NULL))
>  + mptcp_schedule_work(sk);
>  
>  Scheduling the worker when we are not in BH could introduce problematic
>  delays. Instead you could factor out a ____mptcp_move_skbs() helper (a
>  better name would be nice!) not including the mptcp_check_data_fin()
>  call, use the latter in mptcp_data_ready() and in such function
>  eventually schedule the mptcp work as needed.
> 
Hi Paolo,

Thanks for your review. 
I'm a little bit confused. Does this mean the 'move_skbs_to_msk' should be
fixed too? like:

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6ce0d18c86b6..e90c8621683c 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -869,13 +869,6 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 if (unlikely(ssk->sk_err))
 __mptcp_subflow_error_report(sk, ssk);
 
- /* If the moves have caught up with the DATA_FIN sequence number
- * it's time to ack the DATA_FIN and change socket state, but
- * this is not a good place to change state. Let the workqueue
- * do it.
- */
- if (mptcp_pending_data_fin(sk, NULL))
- mptcp_schedule_work(sk);
 return moved;
 }
 
@@ -909,6 +902,7 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 {
 struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 struct mptcp_sock *msk = mptcp_sk(sk);
+ bool moved = false;
 
 /* The peer can send data while we are shutting down this
 * subflow at subflow destruction time, but we must avoid enqueuing
@@ -920,13 +914,17 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 mptcp_data_lock(sk);
 mptcp_rcv_rtt_update(msk, subflow);
 if (!sock_owned_by_user(sk)) {
+ moved = move_skbs_to_msk(msk, ssk); 
 /* Wake-up the reader only for in-sequence data */
- if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
+ if (moved && mptcp_epollin_ready(sk))
 sk->sk_data_ready(sk);
 } else {
 __mptcp_move_skbs_from_subflow(msk, ssk, false);
 }
 mptcp_data_unlock(sk);
+ 
+ if (mptcp_pending_data_fin(sk, NULL))
+ mptcp_check_data_fin(sk);
 }

Looking forward for your reply

Very thanks
Gang

> 
> /P
>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf
  2026-04-03  1:17   ` Geliang Tang
@ 2026-04-03  8:52     ` Matthieu Baerts
  0 siblings, 0 replies; 11+ messages in thread
From: Matthieu Baerts @ 2026-04-03  8:52 UTC (permalink / raw)
  To: Geliang Tang, Gang Yan, mptcp; +Cc: pabeni, Gang Yan

Hi Geliang,

On 03/04/2026 03:17, Geliang Tang wrote:
> Hi Gang,
> 
> On Wed, 2026-04-01 at 16:54 +0800, Gang Yan wrote:
>> From: Gang Yan <yangang@kylinos.cn>
>>
>> This patch adds a test for transmission with a limited receive
>> buffer.
>>
>> Without the fixes in this set, this test fails with errors:
>>
>>  # 001 4 subflows with reduced rcvbuf
>>  # net.ipv4.tcp_rmem = 4096 4096 4096
>>  #       Info: Test file (size 1024 KB) for client
>>  #       Info: Test file (size 1024 KB) for server
>>  # copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
>>  # copyfd_io_poll: poll timed out (events: POLLIN 0, POLLOUT 4)
>>  # [FAIL] client exit code 2, server 2
>>  # Server ns stats (ns2-qg1Hqo)
>>
>> Co-developed-by: Geliang Tang <geliang@kernel.org>
>> Signed-off-by: Geliang Tang <geliang@kernel.org>
>> Signed-off-by: Gang Yan <yangang@kylinos.cn>
>> ---
>> Notes:
>>
>> Changelog:
>> v5:
>>   - Drop the unnecessary tcp_rmem restore.
>>
>> ---
>>  tools/testing/selftests/net/mptcp/mptcp_join.sh | 17
>> +++++++++++++++++
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> b/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> index beec41f6662a..a21333424085 100755
>> --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> @@ -4417,6 +4417,22 @@ endpoint_tests()
>>  	fi
>>  }
>>  
>> +rcvbuf_tests()
>> +{
>> +	if reset "4 subflows with reduced rcvbuf"; then
>> +		ip netns exec $ns2 sysctl -w net.ipv4.tcp_rmem="4096
>> 4096 4096"
> 
> Instead of using reset() here, it's better to set this sysctl in a new
> helper - for example, reset_with_tcp_rmem() - as is done in other
> tests.

If there is only one test, I think it is fine to leave this here: at
least it is clear what is being done. A helper only used once just to
set one extra command might not be worth it.

But what you can do is to use 'sysctl -q', like every time 'sysctl' is used.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-03  8:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
2026-04-01 20:12   ` Paolo Abeni
2026-04-01  8:54 ` [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue Gang Yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready Gang Yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
2026-04-01 20:02   ` Paolo Abeni
     [not found]     ` <e1f58d941ad141f9f96a24d1a1d9d6ca74cc2f5e@linux.dev>
2026-04-03  6:45       ` Fwd: " gang.yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
2026-04-03  1:17   ` Geliang Tang
2026-04-03  8:52     ` Matthieu Baerts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox