[PATCH mptcp-net v5 0/5] mptcp: fix stall because of data

public inbox for mptcp@lists.linux.dev
 help / color / mirror / Atom feed

* [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready
@ 2026-04-01  8:54 Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan

From: Gang Yan <yangang@kylinos.cn>

Changelog:
V5:
  - Patch 1 fixes missing initialization of msk->backlog_unaccounted in
    __mptcp_init_sock() to match the commit message.
  - Patch 4 removes the 'else' branch to fix potential TOCTOU race and
    fix typo in title.
  - Patch 5 drops the redundant tcp_rmem restore.
  - Add notes in some patches of the series in response to the AI
    review.
v4:
  - Pass 'delta' to 'move_skbs_to_msk' to fix the potential problems,
    and add a helper named 'move_skbs_from_backlog' to achieve this.
  - Add a test case in mptcp_join.sh which can reproduce such stall
    prolems.
  - Fix a dead_lock problems which is imported in the 1d77124a6322
    ('mptcp: fix the stall problems with data_ready.').
v3:
  - Replace backlog list with backlog_queue(rb-tree) to solve the
    stall problems.
  Link: https://patchwork.kernel.org/project/mptcp/cover/cover.1773735950.git.yangang@kylinos.cn/
v2:
  Link: https://patchwork.kernel.org/project/mptcp/patch/20260209084717.506379-1-gang.yan@linux.dev/
 v1:
  Link: https://patchwork.kernel.org/project/mptcp/cover/cover.1770273341.git.yangang@kylinos.cn/


Gang Yan (5):
  mptcp: replace backlog_list with backlog_queue
  mptcp: fix the stall problems using backlog_queue
  mptcp: fix the stall problems with data_ready
  mptcp: fix the dead_lock in mptcp_data_ready
  selftests: mptcp: test transmission with small rcvbuf

 net/mptcp/protocol.c                          | 110 +++++++++++++++---
 net/mptcp/protocol.h                          |   2 +-
 .../testing/selftests/net/mptcp/mptcp_join.sh |  17 +++
 3 files changed, 112 insertions(+), 17 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
  2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
@ 2026-04-01  8:54 ` Gang Yan
  2026-04-01 20:12   ` Paolo Abeni
  2026-04-01  8:54 ` [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue Gang Yan
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=a, Size: 8034 bytes --]

From: Gang Yan <yangang@kylinos.cn>

This patch replaces the original list-based backlog_list with a
red-black tree (RB-tree) based backlog_queue for MPTCP.

Add key helper functions:
 - mptcp_queue_backlog: Insert skb into backlog_queue in order of
   map_seq via RB-tree
 - mptcp_backlog_queue_to_list: Convert RB-tree based backlog_queue to
   list_head
 - mptcp_backlog_list_to_queue: Convert list_head back to RB-tree based
   backlog_queue

Adapt existing backlog operation logic:
   - Update mptcp_can_spool_backlog to splice RB-tree backlog to list
     via new helper
   - Adjust mptcp_backlog_spooled to restore list skbs back to RB-tree
     backlog_queue
   - Modify mptcp_close_ssk and mptcp_recv_skb to check RB-tree emptiness
     instead of list
   - Update mptcp_backlog_purge to use RB-tree to list conversion for
     backlog cleanup

Furthermore, this patch also initialize the msk->backlog_unaccounted in
'__mptcp_init_sock'.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:

Hi Matt, Paolo:

I have reviewed the suggestions from the AI review and found several of
them valuable. This is a response to the AI review comments.

- Missing the initialization of msk->backlog_unaccounted.
    I've fixed this in v5.

- Potential performance problems/
    The main goal of this patch is to replace the existing backlog_list
    with an RB-tree structure to resolve the stall issue. The
    replacement itself is intended to be functionally equivalent—if the
    RB-tree usage is correct, the stall problem should be addressed.

    Performance-wise, the operations under spinlock can be optimized
    later (e.g., removing the temporary list or refining locks). I've
    tested with simult_subflows.sh and saw no significant regression,
    so I'd prefer to keep this change as-is in this series.

- RB-tree traversal crash in 'mptcp_close_ssk()'
    After reviewing the code, the crash seems not happen.  The walk
    does not modify the RB-tree structure — it only clears skb->sk
    pointers.

    Then, this is pre-existing upstream code, I think the concerns
    should be addressed separately.

Thanks,
Gang

Changelog:
v5:
  - Fix missing initialization of msk->backlog_unaccounted in
    __mptcp_init_sock() to match the commit message.

---
 net/mptcp/protocol.c | 73 +++++++++++++++++++++++++++++++++++++-------
 net/mptcp/protocol.h |  2 +-
 2 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 46fedfa05a54..e6f83c8afc76 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -653,6 +653,33 @@ static void mptcp_dss_corruption(struct mptcp_sock *msk, struct sock *ssk)
 	}
 }
 
+static int mptcp_queue_backlog(struct mptcp_sock *msk, struct sk_buff *skb)
+{
+	u64 seq = MPTCP_SKB_CB(skb)->map_seq;
+	struct rb_node **p, *parent = NULL;
+
+	p = &msk->backlog_queue.rb_node;
+	if (RB_EMPTY_ROOT(&msk->backlog_queue))
+		goto insert;
+
+	while (*p) {
+		struct sk_buff *s;
+
+		parent = *p;
+		s = rb_to_skb(parent);
+
+		if (before64(seq, MPTCP_SKB_CB(s)->map_seq))
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+
+insert:
+	rb_link_node(&skb->rbnode, parent, p);
+	rb_insert_color(&skb->rbnode, &msk->backlog_queue);
+	return 0;
+}
+
 static void __mptcp_add_backlog(struct sock *sk,
 				struct mptcp_subflow_context *subflow,
 				struct sk_buff *skb)
@@ -669,8 +696,8 @@ static void __mptcp_add_backlog(struct sock *sk,
 	}
 
 	/* Try to coalesce with the last skb in our backlog */
-	if (!list_empty(&msk->backlog_list))
-		tail = list_last_entry(&msk->backlog_list, struct sk_buff, list);
+	if (!RB_EMPTY_ROOT(&msk->backlog_queue))
+		tail = skb_rb_last(&msk->backlog_queue);
 
 	if (tail && MPTCP_SKB_CB(skb)->map_seq == MPTCP_SKB_CB(tail)->end_seq &&
 	    ssk == tail->sk &&
@@ -681,7 +708,7 @@ static void __mptcp_add_backlog(struct sock *sk,
 		goto account;
 	}
 
-	list_add_tail(&skb->list, &msk->backlog_list);
+	mptcp_queue_backlog(msk, skb);
 	mptcp_subflow_lend_fwdmem(subflow, skb);
 	delta = skb->truesize;
 
@@ -2199,6 +2226,29 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
 	return moved;
 }
 
+static void mptcp_backlog_queue_to_list(struct mptcp_sock *msk,
+					struct list_head *list)
+{
+	struct sk_buff *skb;
+
+	while ((skb = skb_rb_first(&msk->backlog_queue)) != NULL) {
+		rb_erase(&skb->rbnode, &msk->backlog_queue);
+		RB_CLEAR_NODE(&skb->rbnode);
+		list_add_tail(&skb->list, list);
+	}
+}
+
+static void mptcp_backlog_list_to_queue(struct mptcp_sock *msk,
+					struct list_head *list)
+{
+	struct sk_buff *skb, *tmp;
+
+	list_for_each_entry_safe(skb, tmp, list, list) {
+		list_del(&skb->list);
+		mptcp_queue_backlog(msk, skb);
+	}
+}
+
 static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -2210,12 +2260,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
 			       mem_cgroup_from_sk(sk));
 
 	/* Don't spool the backlog if the rcvbuf is full. */
-	if (list_empty(&msk->backlog_list) ||
+	if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
 	    sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
 		return false;
 
 	INIT_LIST_HEAD(skbs);
-	list_splice_init(&msk->backlog_list, skbs);
+	mptcp_backlog_queue_to_list(msk, skbs);
 	return true;
 }
 
@@ -2225,7 +2275,7 @@ static void mptcp_backlog_spooled(struct sock *sk, u32 moved,
 	struct mptcp_sock *msk = mptcp_sk(sk);
 
 	WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
-	list_splice(skbs, &msk->backlog_list);
+	mptcp_backlog_list_to_queue(msk, skbs);
 }
 
 static bool mptcp_move_skbs(struct sock *sk)
@@ -2311,7 +2361,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 		copied += bytes_read;
 
-		if (!list_empty(&msk->backlog_list) && mptcp_move_skbs(sk))
+		if (!RB_EMPTY_ROOT(&msk->backlog_queue) && mptcp_move_skbs(sk))
 			continue;
 
 		/* only the MPTCP socket status is relevant here. The exit
@@ -2640,7 +2690,7 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 	/* Remove any reference from the backlog to this ssk; backlog skbs consume
 	 * space in the msk receive queue, no need to touch sk->sk_rmem_alloc
 	 */
-	list_for_each_entry(skb, &msk->backlog_list, list) {
+	skb_rbtree_walk(skb, &msk->backlog_queue) {
 		if (skb->sk != ssk)
 			continue;
 
@@ -2896,7 +2946,7 @@ static void mptcp_backlog_purge(struct sock *sk)
 	LIST_HEAD(backlog);
 
 	mptcp_data_lock(sk);
-	list_splice_init(&msk->backlog_list, &backlog);
+	mptcp_backlog_queue_to_list(msk, &backlog);
 	msk->backlog_len = 0;
 	mptcp_data_unlock(sk);
 
@@ -2999,13 +3049,14 @@ static void __mptcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&msk->conn_list);
 	INIT_LIST_HEAD(&msk->join_list);
 	INIT_LIST_HEAD(&msk->rtx_queue);
-	INIT_LIST_HEAD(&msk->backlog_list);
+	msk->backlog_queue = RB_ROOT;
 	INIT_WORK(&msk->work, mptcp_worker);
 	msk->out_of_order_queue = RB_ROOT;
 	msk->first_pending = NULL;
 	msk->timer_ival = TCP_RTO_MIN;
 	msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
 	msk->backlog_len = 0;
+	msk->backlog_unaccounted = 0;
 	mptcp_init_rtt_est(msk);
 
 	WRITE_ONCE(msk->first, NULL);
@@ -4335,7 +4386,7 @@ static struct sk_buff *mptcp_recv_skb(struct sock *sk, u32 *off)
 	struct sk_buff *skb;
 	u32 offset;
 
-	if (!list_empty(&msk->backlog_list))
+	if (!RB_EMPTY_ROOT(&msk->backlog_queue))
 		mptcp_move_skbs(sk);
 
 	while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index f5d4d7d030f2..f0eaba2c61fa 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -372,7 +372,7 @@ struct mptcp_sock {
 					 * allow_join
 					 */
 
-	struct list_head backlog_list;	/* protected by the data lock */
+	struct rb_root	backlog_queue;	/* protected by the data lock */
 	u32		backlog_len;
 	u32		backlog_unaccounted;
 };
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue
  2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
@ 2026-04-01  8:54 ` Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready Gang Yan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang

From: Gang Yan <yangang@kylinos.cn>

The original condition would stop moving skbs or spooling backlog even
when the receive queue is empty, leading to receive stall.

Modify the condition in __mptcp_move_skbs() and mptcp_can_spool_backlog()
to only treat rcvbuf as full when:
  sk_rmem_alloc_get(sk) > sk->sk_rcvbuf && !skb_queue_empty(&sk->sk_receive_queue)

This ensures the backlog can still be moved to the receive queue when
the queue is empty, avoiding stall problems.

Fixes: 6228efe0cc01 ("mptcp: leverage the backlog for RX packet processing")
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
 net/mptcp/protocol.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e6f83c8afc76..8f1afd90290e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2197,6 +2197,12 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 	msk->rcvq_space.time = mstamp;
 }
 
+static bool mptcp_can_dequeue_from_backlog(struct sock *sk)
+{
+	return sk_rmem_alloc_get(sk) <= sk->sk_rcvbuf ||
+	       skb_queue_empty(&sk->sk_receive_queue);
+}
+
 static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delta)
 {
 	struct sk_buff *skb = list_first_entry(skbs, struct sk_buff, list);
@@ -2206,7 +2212,7 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
 	*delta = 0;
 	while (1) {
 		/* If the msk recvbuf is full stop, don't drop */
-		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+		if (!mptcp_can_dequeue_from_backlog(sk))
 			break;
 
 		prefetch(skb->next);
@@ -2261,7 +2267,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
 
 	/* Don't spool the backlog if the rcvbuf is full. */
 	if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
-	    sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+	    !mptcp_can_dequeue_from_backlog(sk))
 		return false;
 
 	INIT_LIST_HEAD(skbs);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready
  2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue Gang Yan
@ 2026-04-01  8:54 ` Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
  2026-04-01  8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
  4 siblings, 0 replies; 12+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang

From: Gang Yan <yangang@kylinos.cn>

There exists a stall caused by unprocessed backlog_queue in
'move_skbs_to_msk'.

This patch adds a check for backlog_queue and move skbs to receive
queue when no skbs were moved from subflow but backlog_queue is not
empty.

Fixes: 6228efe0cc01 ("mptcp: leverage the backlog for RX packet processing")
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:

Hi Matt, Paolo:

Here is the response to the AI review for this patch:
- Potential dead-lock problems:
    I've fixed this in the next patch. Keeping it as a separate patch
    should make the review easier. Should I squash it into this one?

- The question of the short-circuiting logical OR (||):
    In fact, the short-circuiting is intentional: if 'move_skbs_to_msk()'
    returns true, there is already data in the receive queue, which
    should not cause the stall problems. In that case moving more data
    from the backlog isn't necessary. Using '||' avoids redundant work
    and matches the intended logic.

Thanks
Gang

---
 net/mptcp/protocol.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 8f1afd90290e..326b5d4d79e7 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -49,6 +49,11 @@ static struct percpu_counter mptcp_sockets_allocated ____cacheline_aligned_in_sm
 
 static void __mptcp_destroy_sock(struct sock *sk);
 static void mptcp_check_send_data_fin(struct sock *sk);
+static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs);
+static void mptcp_backlog_spooled(struct sock *sk, u32 moved,
+				  struct list_head *skbs);
+static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs,
+			      u32 *delta);
 
 DEFINE_PER_CPU(struct mptcp_delegated_action, mptcp_delegated_actions) = {
 	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
@@ -906,6 +911,19 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	return moved;
 }
 
+static bool move_skbs_from_backlog(struct sock *sk)
+{
+	struct list_head skbs;
+	bool enqueued = false;
+	u32 moved;
+
+	while (mptcp_can_spool_backlog(sk, &skbs)) {
+		enqueued |= __mptcp_move_skbs(sk, &skbs, &moved);
+		mptcp_backlog_spooled(sk, moved, &skbs);
+	}
+	return enqueued;
+}
+
 static void mptcp_rcv_rtt_update(struct mptcp_sock *msk,
 				 struct mptcp_subflow_context *subflow)
 {
@@ -948,7 +966,9 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 	mptcp_rcv_rtt_update(msk, subflow);
 	if (!sock_owned_by_user(sk)) {
 		/* Wake-up the reader only for in-sequence data */
-		if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
+		if ((move_skbs_to_msk(msk, ssk) ||
+		     move_skbs_from_backlog(sk)) &&
+		    mptcp_epollin_ready(sk))
 			sk->sk_data_ready(sk);
 	} else {
 		__mptcp_move_skbs_from_subflow(msk, ssk, false);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready
  2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
                   ` (2 preceding siblings ...)
  2026-04-01  8:54 ` [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready Gang Yan
@ 2026-04-01  8:54 ` Gang Yan
  2026-04-01 20:02   ` Paolo Abeni
  2026-04-01  8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
  4 siblings, 1 reply; 12+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang

From: Gang Yan <yangang@kylinos.cn>

This patch defers mptcp_check_data_fin from __mptcp_move_skbs to avoid
deadlock.

When processing backlogged data in softirq context, __mptcp_move_skbs
directly calls mptcp_check_data_fin, which can lead to a deadlock if
the msk socket is in TCP_FIN_WAIT2 state. In that case,
mptcp_check_data_fin calls mptcp_shutdown_subflows, which attempts to
lock the subflow socket with lock_sock_fast() - a sleeping function
that cannot be called in atomic context (softirq).

The correct pattern is already used in move_skbs_to_msk: if a data fin
is pending, schedule the work to be processed later in process context
rather than handling it directly.

Reported-by: Geliang Tang <geliang@kernel.org>
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:

Hi Matt,Paolo:

Here is the response of AI review for this patch:

- Typo problem in subject line:
    Already fixed in v5.

- Remove else branch:
    It demonstrates a TOCTOU race, though the race window is extremely
    narrow. The else branch is useless even without the race. So I
    removed it in v5. WDYT?

Thanks,
Gang

Changelog:
v5:
  - Fix typo in title.
  - Remove the else branch.

---
 net/mptcp/protocol.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 326b5d4d79e7..88e19fa61b84 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2247,8 +2247,9 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
 	}
 
 	__mptcp_ofo_queue(msk);
-	if (moved)
-		mptcp_check_data_fin((struct sock *)msk);
+	if (moved && mptcp_pending_data_fin(sk, NULL))
+		mptcp_schedule_work(sk);
+
 	return moved;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf
  2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
                   ` (3 preceding siblings ...)
  2026-04-01  8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
@ 2026-04-01  8:54 ` Gang Yan
  2026-04-03  1:17   ` Geliang Tang
  4 siblings, 1 reply; 12+ messages in thread
From: Gang Yan @ 2026-04-01  8:54 UTC (permalink / raw)
  To: mptcp; +Cc: pabeni, Gang Yan, Geliang Tang

From: Gang Yan <yangang@kylinos.cn>

This patch adds a test for transmission with a limited receive buffer.

Without the fixes in this set, this test fails with errors:

 # 001 4 subflows with reduced rcvbuf
 # net.ipv4.tcp_rmem = 4096 4096 4096
 #       Info: Test file (size 1024 KB) for client
 #       Info: Test file (size 1024 KB) for server
 # copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
 # copyfd_io_poll: poll timed out (events: POLLIN 0, POLLOUT 4)
 # [FAIL] client exit code 2, server 2
 # Server ns stats (ns2-qg1Hqo)

Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
---
Notes:

Changelog:
v5:
  - Drop the unnecessary tcp_rmem restore.

---
 tools/testing/selftests/net/mptcp/mptcp_join.sh | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index beec41f6662a..a21333424085 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -4417,6 +4417,22 @@ endpoint_tests()
 	fi
 }
 
+rcvbuf_tests()
+{
+	if reset "4 subflows with reduced rcvbuf"; then
+		ip netns exec $ns2 sysctl -w net.ipv4.tcp_rmem="4096 4096 4096"
+
+		pm_nl_set_limits $ns1 0 4
+		pm_nl_set_limits $ns2 0 4
+		pm_nl_add_endpoint $ns2 10.0.2.2 flags subflow
+		pm_nl_add_endpoint $ns2 10.0.3.2 flags subflow
+		pm_nl_add_endpoint $ns2 10.0.4.2 flags subflow
+		speed=fast test_linkfail=1024 \
+			run_tests $ns1 $ns2 10.0.1.1
+		chk_join_nr 3 3 3
+	fi
+}
+
 # [$1: error message]
 usage()
 {
@@ -4467,6 +4483,7 @@ all_tests_sorted=(
 	F@fail_tests
 	u@userspace_tests
 	I@endpoint_tests
+	R@rcvbuf_tests
 )
 
 all_tests_args=""
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready
  2026-04-01  8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
@ 2026-04-01 20:02   ` Paolo Abeni
       [not found]     ` <e1f58d941ad141f9f96a24d1a1d9d6ca74cc2f5e@linux.dev>
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2026-04-01 20:02 UTC (permalink / raw)
  To: Gang Yan, mptcp; +Cc: Gang Yan, Geliang Tang

On 4/1/26 10:54 AM, Gang Yan wrote:
> From: Gang Yan <yangang@kylinos.cn>
> 
> This patch defers mptcp_check_data_fin from __mptcp_move_skbs to avoid
> deadlock.
> 
> When processing backlogged data in softirq context, __mptcp_move_skbs
> directly calls mptcp_check_data_fin, which can lead to a deadlock if
> the msk socket is in TCP_FIN_WAIT2 state. In that case,
> mptcp_check_data_fin calls mptcp_shutdown_subflows, which attempts to
> lock the subflow socket with lock_sock_fast() - a sleeping function
> that cannot be called in atomic context (softirq).

The fix for such issue should be squashed inside the patch generating
the issue itself.

> The correct pattern is already used in move_skbs_to_msk: if a data fin
> is pending, schedule the work to be processed later in process context
> rather than handling it directly.
> 
> Reported-by: Geliang Tang <geliang@kernel.org>
> Co-developed-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Gang Yan <yangang@kylinos.cn>
> ---
> Notes:
> 
> Hi Matt,Paolo:
> 
> Here is the response of AI review for this patch:
> 
> - Typo problem in subject line:
>     Already fixed in v5.
> 
> - Remove else branch:
>     It demonstrates a TOCTOU race, though the race window is extremely
>     narrow. The else branch is useless even without the race. So I
>     removed it in v5. WDYT?
> 
> Thanks,
> Gang
> 
> Changelog:
> v5:
>   - Fix typo in title.
>   - Remove the else branch.
> 
> ---
>  net/mptcp/protocol.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 326b5d4d79e7..88e19fa61b84 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -2247,8 +2247,9 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
>  	}
>  
>  	__mptcp_ofo_queue(msk);
> -	if (moved)
> -		mptcp_check_data_fin((struct sock *)msk);
> +	if (moved && mptcp_pending_data_fin(sk, NULL))
> +		mptcp_schedule_work(sk);

Scheduling the worker when we are not in BH could introduce problematic
delays. Instead you could factor out a ____mptcp_move_skbs() helper (a
better name would be nice!) not including the mptcp_check_data_fin()
call, use the latter in mptcp_data_ready() and in such function
eventually schedule the mptcp work as needed.

/P


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
  2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
@ 2026-04-01 20:12   ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2026-04-01 20:12 UTC (permalink / raw)
  To: Gang Yan, mptcp; +Cc: Gang Yan, Geliang Tang

On 4/1/26 10:54 AM, Gang Yan wrote:
> @@ -2199,6 +2226,29 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
>  	return moved;
>  }
>  
> +static void mptcp_backlog_queue_to_list(struct mptcp_sock *msk,
> +					struct list_head *list)
> +{
> +	struct sk_buff *skb;
> +
> +	while ((skb = skb_rb_first(&msk->backlog_queue)) != NULL) {
> +		rb_erase(&skb->rbnode, &msk->backlog_queue);
> +		RB_CLEAR_NODE(&skb->rbnode);
> +		list_add_tail(&skb->list, list);
> +	}

This is O(N * log(N))

> +}
> +
> +static void mptcp_backlog_list_to_queue(struct mptcp_sock *msk,
> +					struct list_head *list)
> +{
> +	struct sk_buff *skb, *tmp;
> +
> +	list_for_each_entry_safe(skb, tmp, list, list) {
> +		list_del(&skb->list);
> +		mptcp_queue_backlog(msk, skb);
> +	}
> +}
> +
>  static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
>  {
>  	struct mptcp_sock *msk = mptcp_sk(sk);
> @@ -2210,12 +2260,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
>  			       mem_cgroup_from_sk(sk));
>  
>  	/* Don't spool the backlog if the rcvbuf is full. */
> -	if (list_empty(&msk->backlog_list) ||
> +	if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
>  	    sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
>  		return false;
>  
>  	INIT_LIST_HEAD(skbs);
> -	list_splice_init(&msk->backlog_list, skbs);
> +	mptcp_backlog_queue_to_list(msk, skbs);

This is under a spinlock with BH disabled, and N is potentially quite
high. This code is problematic as it could cause stall and delay BH for
a significant amount of time.

Note that there are more chunks later and in later patches with similar
problem.

I don't have a good solution on top of my head, I'll try to go back to
the design table.

/P


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf
  2026-04-01  8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
@ 2026-04-03  1:17   ` Geliang Tang
  2026-04-03  8:52     ` Matthieu Baerts
  0 siblings, 1 reply; 12+ messages in thread
From: Geliang Tang @ 2026-04-03  1:17 UTC (permalink / raw)
  To: Gang Yan, mptcp; +Cc: pabeni, Gang Yan

Hi Gang,

On Wed, 2026-04-01 at 16:54 +0800, Gang Yan wrote:
> From: Gang Yan <yangang@kylinos.cn>
> 
> This patch adds a test for transmission with a limited receive
> buffer.
> 
> Without the fixes in this set, this test fails with errors:
> 
>  # 001 4 subflows with reduced rcvbuf
>  # net.ipv4.tcp_rmem = 4096 4096 4096
>  #       Info: Test file (size 1024 KB) for client
>  #       Info: Test file (size 1024 KB) for server
>  # copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
>  # copyfd_io_poll: poll timed out (events: POLLIN 0, POLLOUT 4)
>  # [FAIL] client exit code 2, server 2
>  # Server ns stats (ns2-qg1Hqo)
> 
> Co-developed-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Geliang Tang <geliang@kernel.org>
> Signed-off-by: Gang Yan <yangang@kylinos.cn>
> ---
> Notes:
> 
> Changelog:
> v5:
>   - Drop the unnecessary tcp_rmem restore.
> 
> ---
>  tools/testing/selftests/net/mptcp/mptcp_join.sh | 17
> +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh
> b/tools/testing/selftests/net/mptcp/mptcp_join.sh
> index beec41f6662a..a21333424085 100755
> --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
> +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
> @@ -4417,6 +4417,22 @@ endpoint_tests()
>  	fi
>  }
>  
> +rcvbuf_tests()
> +{
> +	if reset "4 subflows with reduced rcvbuf"; then
> +		ip netns exec $ns2 sysctl -w net.ipv4.tcp_rmem="4096
> 4096 4096"

Instead of using reset() here, it's better to set this sysctl in a new
helper - for example, reset_with_tcp_rmem() - as is done in other
tests.

Thanks,
-Geliang

> +
> +		pm_nl_set_limits $ns1 0 4
> +		pm_nl_set_limits $ns2 0 4
> +		pm_nl_add_endpoint $ns2 10.0.2.2 flags subflow
> +		pm_nl_add_endpoint $ns2 10.0.3.2 flags subflow
> +		pm_nl_add_endpoint $ns2 10.0.4.2 flags subflow
> +		speed=fast test_linkfail=1024 \
> +			run_tests $ns1 $ns2 10.0.1.1
> +		chk_join_nr 3 3 3
> +	fi
> +}
> +
>  # [$1: error message]
>  usage()
>  {
> @@ -4467,6 +4483,7 @@ all_tests_sorted=(
>  	F@fail_tests
>  	u@userspace_tests
>  	I@endpoint_tests
> +	R@rcvbuf_tests
>  )
>  
>  all_tests_args=""

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Fwd: Re: [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready
       [not found]     ` <e1f58d941ad141f9f96a24d1a1d9d6ca74cc2f5e@linux.dev>
@ 2026-04-03  6:45       ` gang.yan
  2026-04-13 15:49         ` Paolo Abeni
  0 siblings, 1 reply; 12+ messages in thread
From: gang.yan @ 2026-04-03  6:45 UTC (permalink / raw)
  To: mptcp

-------- Forwarded message -------

From: gang.yan@linux.dev mailto:gang.yan@linux.dev 
To: "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E >
Sent: April 3, 2026 at 2:44 PM
Subject: Re: [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready

April 2, 2026 at 4:02 AM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:



> 
> On 4/1/26 10:54 AM, Gang Yan wrote:
>  
>  
>  From: Gang Yan <yangang@kylinos.cn>
>  
>  This patch defers mptcp_check_data_fin from __mptcp_move_skbs to avoid
>  deadlock.
>  
>  When processing backlogged data in softirq context, __mptcp_move_skbs
>  directly calls mptcp_check_data_fin, which can lead to a deadlock if
>  the msk socket is in TCP_FIN_WAIT2 state. In that case,
>  mptcp_check_data_fin calls mptcp_shutdown_subflows, which attempts to
>  lock the subflow socket with lock_sock_fast() - a sleeping function
>  that cannot be called in atomic context (softirq).
>  
>  The fix for such issue should be squashed inside the patch generating
>  the issue itself.
>  
>  
>  The correct pattern is already used in move_skbs_to_msk: if a data fin
>  is pending, schedule the work to be processed later in process context
>  rather than handling it directly.
>  
>  Reported-by: Geliang Tang <geliang@kernel.org>
>  Co-developed-by: Geliang Tang <geliang@kernel.org>
>  Signed-off-by: Geliang Tang <geliang@kernel.org>
>  Signed-off-by: Gang Yan <yangang@kylinos.cn>
>  ---
>  Notes:
>  
>  Hi Matt,Paolo:
>  
>  Here is the response of AI review for this patch:
>  
>  - Typo problem in subject line:
>  Already fixed in v5.
>  
>  - Remove else branch:
>  It demonstrates a TOCTOU race, though the race window is extremely
>  narrow. The else branch is useless even without the race. So I
>  removed it in v5. WDYT?
>  
>  Thanks,
>  Gang
>  
>  Changelog:
>  v5:
>  - Fix typo in title.
>  - Remove the else branch.
>  
>  ---
>  net/mptcp/protocol.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>  
>  diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>  index 326b5d4d79e7..88e19fa61b84 100644
>  --- a/net/mptcp/protocol.c
>  +++ b/net/mptcp/protocol.c
>  @@ -2247,8 +2247,9 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
>  }
>  
>  __mptcp_ofo_queue(msk);
>  - if (moved)
>  - mptcp_check_data_fin((struct sock *)msk);
>  + if (moved && mptcp_pending_data_fin(sk, NULL))
>  + mptcp_schedule_work(sk);
>  
>  Scheduling the worker when we are not in BH could introduce problematic
>  delays. Instead you could factor out a ____mptcp_move_skbs() helper (a
>  better name would be nice!) not including the mptcp_check_data_fin()
>  call, use the latter in mptcp_data_ready() and in such function
>  eventually schedule the mptcp work as needed.
> 
Hi Paolo,

Thanks for your review. 
I'm a little bit confused. Does this mean the 'move_skbs_to_msk' should be
fixed too? like:

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6ce0d18c86b6..e90c8621683c 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -869,13 +869,6 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 if (unlikely(ssk->sk_err))
 __mptcp_subflow_error_report(sk, ssk);
 
- /* If the moves have caught up with the DATA_FIN sequence number
- * it's time to ack the DATA_FIN and change socket state, but
- * this is not a good place to change state. Let the workqueue
- * do it.
- */
- if (mptcp_pending_data_fin(sk, NULL))
- mptcp_schedule_work(sk);
 return moved;
 }
 
@@ -909,6 +902,7 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 {
 struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 struct mptcp_sock *msk = mptcp_sk(sk);
+ bool moved = false;
 
 /* The peer can send data while we are shutting down this
 * subflow at subflow destruction time, but we must avoid enqueuing
@@ -920,13 +914,17 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 mptcp_data_lock(sk);
 mptcp_rcv_rtt_update(msk, subflow);
 if (!sock_owned_by_user(sk)) {
+ moved = move_skbs_to_msk(msk, ssk); 
 /* Wake-up the reader only for in-sequence data */
- if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
+ if (moved && mptcp_epollin_ready(sk))
 sk->sk_data_ready(sk);
 } else {
 __mptcp_move_skbs_from_subflow(msk, ssk, false);
 }
 mptcp_data_unlock(sk);
+ 
+ if (mptcp_pending_data_fin(sk, NULL))
+ mptcp_check_data_fin(sk);
 }

Looking forward for your reply

Very thanks
Gang

> 
> /P
>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf
  2026-04-03  1:17   ` Geliang Tang
@ 2026-04-03  8:52     ` Matthieu Baerts
  0 siblings, 0 replies; 12+ messages in thread
From: Matthieu Baerts @ 2026-04-03  8:52 UTC (permalink / raw)
  To: Geliang Tang, Gang Yan, mptcp; +Cc: pabeni, Gang Yan

Hi Geliang,

On 03/04/2026 03:17, Geliang Tang wrote:
> Hi Gang,
> 
> On Wed, 2026-04-01 at 16:54 +0800, Gang Yan wrote:
>> From: Gang Yan <yangang@kylinos.cn>
>>
>> This patch adds a test for transmission with a limited receive
>> buffer.
>>
>> Without the fixes in this set, this test fails with errors:
>>
>>  # 001 4 subflows with reduced rcvbuf
>>  # net.ipv4.tcp_rmem = 4096 4096 4096
>>  #       Info: Test file (size 1024 KB) for client
>>  #       Info: Test file (size 1024 KB) for server
>>  # copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
>>  # copyfd_io_poll: poll timed out (events: POLLIN 0, POLLOUT 4)
>>  # [FAIL] client exit code 2, server 2
>>  # Server ns stats (ns2-qg1Hqo)
>>
>> Co-developed-by: Geliang Tang <geliang@kernel.org>
>> Signed-off-by: Geliang Tang <geliang@kernel.org>
>> Signed-off-by: Gang Yan <yangang@kylinos.cn>
>> ---
>> Notes:
>>
>> Changelog:
>> v5:
>>   - Drop the unnecessary tcp_rmem restore.
>>
>> ---
>>  tools/testing/selftests/net/mptcp/mptcp_join.sh | 17
>> +++++++++++++++++
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> b/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> index beec41f6662a..a21333424085 100755
>> --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
>> @@ -4417,6 +4417,22 @@ endpoint_tests()
>>  	fi
>>  }
>>  
>> +rcvbuf_tests()
>> +{
>> +	if reset "4 subflows with reduced rcvbuf"; then
>> +		ip netns exec $ns2 sysctl -w net.ipv4.tcp_rmem="4096
>> 4096 4096"
> 
> Instead of using reset() here, it's better to set this sysctl in a new
> helper - for example, reset_with_tcp_rmem() - as is done in other
> tests.

If there is only one test, I think it is fine to leave this here: at
least it is clear what is being done. A helper only used once just to
set one extra command might not be worth it.

But what you can do is to use 'sysctl -q', like every time 'sysctl' is used.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fwd: Re: [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready
  2026-04-03  6:45       ` Fwd: " gang.yan
@ 2026-04-13 15:49         ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2026-04-13 15:49 UTC (permalink / raw)
  To: gang.yan, mptcp

hi,

I'm sorry for the very late reply. I'm really can't keep-up with
everything, my latency here is going to be even worse.

On 4/3/26 8:45 AM, gang.yan@linux.dev wrote:
> -------- Forwarded message -------
> 
> From: gang.yan@linux.dev mailto:gang.yan@linux.dev 
> To: "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E >
> Sent: April 3, 2026 at 2:44 PM
> Subject: Re: [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready
> 
> April 2, 2026 at 4:02 AM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
> 
> 
> 
>>
>> On 4/1/26 10:54 AM, Gang Yan wrote:
>>  
>>  
>>  From: Gang Yan <yangang@kylinos.cn>
>>  
>>  This patch defers mptcp_check_data_fin from __mptcp_move_skbs to avoid
>>  deadlock.
>>  
>>  When processing backlogged data in softirq context, __mptcp_move_skbs
>>  directly calls mptcp_check_data_fin, which can lead to a deadlock if
>>  the msk socket is in TCP_FIN_WAIT2 state. In that case,
>>  mptcp_check_data_fin calls mptcp_shutdown_subflows, which attempts to
>>  lock the subflow socket with lock_sock_fast() - a sleeping function
>>  that cannot be called in atomic context (softirq).
>>  
>>  The fix for such issue should be squashed inside the patch generating
>>  the issue itself.
>>  
>>  
>>  The correct pattern is already used in move_skbs_to_msk: if a data fin
>>  is pending, schedule the work to be processed later in process context
>>  rather than handling it directly.
>>  
>>  Reported-by: Geliang Tang <geliang@kernel.org>
>>  Co-developed-by: Geliang Tang <geliang@kernel.org>
>>  Signed-off-by: Geliang Tang <geliang@kernel.org>
>>  Signed-off-by: Gang Yan <yangang@kylinos.cn>
>>  ---
>>  Notes:
>>  
>>  Hi Matt,Paolo:
>>  
>>  Here is the response of AI review for this patch:
>>  
>>  - Typo problem in subject line:
>>  Already fixed in v5.
>>  
>>  - Remove else branch:
>>  It demonstrates a TOCTOU race, though the race window is extremely
>>  narrow. The else branch is useless even without the race. So I
>>  removed it in v5. WDYT?
>>  
>>  Thanks,
>>  Gang
>>  
>>  Changelog:
>>  v5:
>>  - Fix typo in title.
>>  - Remove the else branch.
>>  
>>  ---
>>  net/mptcp/protocol.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>  
>>  diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>  index 326b5d4d79e7..88e19fa61b84 100644
>>  --- a/net/mptcp/protocol.c
>>  +++ b/net/mptcp/protocol.c
>>  @@ -2247,8 +2247,9 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
>>  }
>>  
>>  __mptcp_ofo_queue(msk);
>>  - if (moved)
>>  - mptcp_check_data_fin((struct sock *)msk);
>>  + if (moved && mptcp_pending_data_fin(sk, NULL))
>>  + mptcp_schedule_work(sk);
>>  
>>  Scheduling the worker when we are not in BH could introduce problematic
>>  delays. Instead you could factor out a ____mptcp_move_skbs() helper (a
>>  better name would be nice!) not including the mptcp_check_data_fin()
>>  call, use the latter in mptcp_data_ready() and in such function
>>  eventually schedule the mptcp work as needed.
>>
> Hi Paolo,
> 
> Thanks for your review. 
> I'm a little bit confused. Does this mean the 'move_skbs_to_msk' should be
> fixed too? like:

The additional latency due to the schedule work is the lesser problem,
the main one is the potentially very long time spent in RB tree processing.

> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 6ce0d18c86b6..e90c8621683c 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -869,13 +869,6 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
>  if (unlikely(ssk->sk_err))
>  __mptcp_subflow_error_report(sk, ssk);
>  
> - /* If the moves have caught up with the DATA_FIN sequence number
> - * it's time to ack the DATA_FIN and change socket state, but
> - * this is not a good place to change state. Let the workqueue
> - * do it.
> - */
> - if (mptcp_pending_data_fin(sk, NULL))
> - mptcp_schedule_work(sk);

Also, on top of my memory, this is like that since the beginning, and
never caused problem. The point vs the proposed patch is that it adds
latency to the current status quo.

>  return moved;
>  }
>  
> @@ -909,6 +902,7 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
>  {
>  struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
>  struct mptcp_sock *msk = mptcp_sk(sk);
> + bool moved = false;
>  
>  /* The peer can send data while we are shutting down this
>  * subflow at subflow destruction time, but we must avoid enqueuing
> @@ -920,13 +914,17 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
>  mptcp_data_lock(sk);
>  mptcp_rcv_rtt_update(msk, subflow);
>  if (!sock_owned_by_user(sk)) {
> + moved = move_skbs_to_msk(msk, ssk); 
>  /* Wake-up the reader only for in-sequence data */
> - if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
> + if (moved && mptcp_epollin_ready(sk))
>  sk->sk_data_ready(sk);
>  } else {
>  __mptcp_move_skbs_from_subflow(msk, ssk, false);
>  }
>  mptcp_data_unlock(sk);
> + 
> + if (mptcp_pending_data_fin(sk, NULL))
> + mptcp_check_data_fin(sk);

Finally this chunk should still be (mostly) in atomic context as it's
invoked by the subflow sk_data_ready and sk_state_change callbacks with
run almost always in bh scope.

DL:DR: this chunk is not needed.

/P


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-04-13 15:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
2026-04-01 20:12   ` Paolo Abeni
2026-04-01  8:54 ` [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue Gang Yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready Gang Yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
2026-04-01 20:02   ` Paolo Abeni
     [not found]     ` <e1f58d941ad141f9f96a24d1a1d9d6ca74cc2f5e@linux.dev>
2026-04-03  6:45       ` Fwd: " gang.yan
2026-04-13 15:49         ` Paolo Abeni
2026-04-01  8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
2026-04-03  1:17   ` Geliang Tang
2026-04-03  8:52     ` Matthieu Baerts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox