[PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing
@ 2025-10-03 14:01 Paolo Abeni
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 1/8] mptcp: borrow forward memory from subflow Paolo Abeni
                   ` (9 more replies)
  0 siblings, 10 replies; 14+ messages in thread
From: Paolo Abeni @ 2025-10-03 14:01 UTC (permalink / raw)
  To: mptcp

This series includes RX path improvement built around backlog processing

The main goals are improving the RX performances _and_ increase the
long term maintainability.

Patches 1-3 prepare the stack for backlog processing, removing
assumptions that will not hold true anymore after backlog introduction.

Patch 4 fixes a long standing issue which is quite hard to reproduce
with the current implementation but will become very apparent with
backlog usage.

Patches 5 and 6 are more cleanups that will make the backlog patch a
little less huge.

Patch 7 is a somewhat unrelated cleanup, included here before I forgot
about it.

The real work is done by patch 8. There are significant changes vs the
previous iteration, as it turned out we can't uset the sk_backlog, as
the mptcp release callback can also release and re-acquire the msk-level
spinlock and core backlog processing works under the assumption that
such event is not possible.

Other relevant points are:
- skbs in the backlog are _not_ accounted. TCP does the same, and we
  can't update the fwd mem while enqueuing to the backlog as the caller
  does not own the msk-level socket lock nor can acquire it.
- skbs in the backlog still use the incoming ssk rmem. This allows
  backpressure and implicitly prevent excessive memory usage for the
  backlog itself
- [this is possibly the most critical point]: when the msk rx buf is
  full, we don't add more packets there even when the caller owns the
  msk socket lock. Instead packets are added to the backlog. Note that
  the amount of memory used there is still limited by the above. Also
  note that this implicitly means that such packets could stage in the
  backlog until the receiver flushes the rx buffer - an unbound amount
  of time. That is not supposed to happen for the backlog, ence the
  criticality here.

This survived a few hours of selftest iterations in a loop: should
address all functional issues observed in previous iterations (and
possibly includes different ones ;)

Paolo Abeni (8):
  mptcp: borrow forward memory from subflow
  mptcp: cleanup fallback data fin reception
  mptcp: cleanup fallback dummy mapping generation
  mptcp: fix MSG_PEEK stream corruption
  mptcp: ensure the kernel PM does not take action too late
  mptcp: do not miss early first subflow close event notification.
  mptcp: make mptcp_destroy_common() static
  mptcp: leverage the backlog for RX packet processing

 net/mptcp/pm.c        |   4 +-
 net/mptcp/pm_kernel.c |   2 +
 net/mptcp/protocol.c  | 306 +++++++++++++++++++++++++++---------------
 net/mptcp/protocol.h  |   7 +-
 net/mptcp/subflow.c   |  12 +-
 5 files changed, 215 insertions(+), 116 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 mptcp-next 1/8] mptcp: borrow forward memory from subflow
  2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
@ 2025-10-03 14:01 ` Paolo Abeni
  2025-10-03 15:41   ` Matthieu Baerts
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 2/8] mptcp: cleanup fallback data fin reception Paolo Abeni
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 14+ messages in thread
From: Paolo Abeni @ 2025-10-03 14:01 UTC (permalink / raw)
  To: mptcp

In the MPTCP receive path, we release the subflow allocated
fwd memory just to allocate it again shortly after for the msk.

That could increases the failures chances, especially during
backlog processing, when other actions could consume the just
released memory before the msk socket has a chance to do the
rcv allocation.

Replace the skb_orphan() call with an open-coded variant that
explicitly borrows, with a PAGE_SIZE granularity, the fwd memory
from the subflow socket instead of releasing it. During backlog
processing the borrowed memory is accounted at release_cb time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v1 -> v2:
  - rebased
  - explain why skb_orphan is removed
---
 net/mptcp/protocol.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 574a1e222d9cf..34661ab979158 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -337,11 +337,12 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 		mptcp_rcvbuf_grow(sk);
 }
 
-static void mptcp_init_skb(struct sock *ssk,
-			   struct sk_buff *skb, int offset, int copy_len)
+static int mptcp_init_skb(struct sock *ssk,
+			  struct sk_buff *skb, int offset, int copy_len)
 {
 	const struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	bool has_rxtstamp = TCP_SKB_CB(skb)->has_rxtstamp;
+	int borrowed;
 
 	/* the skb map_seq accounts for the skb offset:
 	 * mptcp_subflow_get_mapped_dsn() is based on the current tp->copied_seq
@@ -357,6 +358,13 @@ static void mptcp_init_skb(struct sock *ssk,
 
 	skb_ext_reset(skb);
 	skb_dst_drop(skb);
+
+	/* "borrow" the fwd memory from the subflow, instead of reclaiming it */
+	skb->destructor = NULL;
+	borrowed = ssk->sk_forward_alloc - sk_unused_reserved_mem(ssk);
+	borrowed &= ~(PAGE_SIZE - 1);
+	sk_forward_alloc_add(ssk, skb->truesize - borrowed);
+	return borrowed;
 }
 
 static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
@@ -690,9 +698,12 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 
 		if (offset < skb->len) {
 			size_t len = skb->len - offset;
+			int bmem;
 
-			mptcp_init_skb(ssk, skb, offset, len);
-			skb_orphan(skb);
+			bmem = mptcp_init_skb(ssk, skb, offset, len);
+			skb->sk = NULL;
+			sk_forward_alloc_add(sk, bmem);
+			atomic_sub(skb->truesize, &ssk->sk_rmem_alloc);
 			ret = __mptcp_move_skb(sk, skb) || ret;
 			seq += len;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 mptcp-next 1/8] mptcp: borrow forward memory from subflow
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 1/8] mptcp: borrow forward memory from subflow Paolo Abeni
@ 2025-10-03 15:41   ` Matthieu Baerts
  0 siblings, 0 replies; 14+ messages in thread
From: Matthieu Baerts @ 2025-10-03 15:41 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

On 03/10/2025 16:01, Paolo Abeni wrote:
> In the MPTCP receive path, we release the subflow allocated
> fwd memory just to allocate it again shortly after for the msk.
> 
> That could increases the failures chances, especially during
> backlog processing, when other actions could consume the just
> released memory before the msk socket has a chance to do the
> rcv allocation.
> 
> Replace the skb_orphan() call with an open-coded variant that
> explicitly borrows, with a PAGE_SIZE granularity, the fwd memory
> from the subflow socket instead of releasing it. During backlog
> processing the borrowed memory is accounted at release_cb time.
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v1 -> v2:
>   - rebased
>   - explain why skb_orphan is removed

Thank you for the new version!

> ---
>  net/mptcp/protocol.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 574a1e222d9cf..34661ab979158 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -337,11 +337,12 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
>  		mptcp_rcvbuf_grow(sk);
>  }
>  
> -static void mptcp_init_skb(struct sock *ssk,
> -			   struct sk_buff *skb, int offset, int copy_len)
> +static int mptcp_init_skb(struct sock *ssk,
> +			  struct sk_buff *skb, int offset, int copy_len)

There is a conflict here with the export branch. Sorry, that's my fault:
when I sent the patches upstream, I moved some arguments to the previous
line (as reported by Geliang):

 static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, (...)

I manually resolved the conflicts and sent it to our CI:

https://github.com/multipath-tcp/mptcp_net-next/commit/2ed154e487d2cbb1ea49428d0c524354289a3eb4/checks

While at it, I also restarted the syzkaller instances to validate this
series.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 mptcp-next 2/8] mptcp: cleanup fallback data fin reception
  2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 1/8] mptcp: borrow forward memory from subflow Paolo Abeni
@ 2025-10-03 14:01 ` Paolo Abeni
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 3/8] mptcp: cleanup fallback dummy mapping generation Paolo Abeni
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2025-10-03 14:01 UTC (permalink / raw)
  To: mptcp

MPTCP currently generate a dummy data_fin for fallback socket
when the fallback subflow has completed data reception using
the current ack_seq.

We are going to introduce backlog usage for the msk soon, even
for fallback sockets: the ack_seq value will not match the most recent
sequence number seen by the fallback subflow socket, as it will ignore
data_seq sitting in the backlog.

Instead use the last map sequence number to set the data_fin,
as fallback (dummy) map sequences are always in sequence.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v2 -> v3:
  - keep the close check in subflow_sched_work_if_closed, fix
    CI failures
---
 net/mptcp/subflow.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index e8325890a3223..b9455c04e8a46 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1285,6 +1285,7 @@ static bool subflow_is_done(const struct sock *sk)
 /* sched mptcp worker for subflow cleanup if no more data is pending */
 static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ssk)
 {
+	const struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	struct sock *sk = (struct sock *)msk;
 
 	if (likely(ssk->sk_state != TCP_CLOSE &&
@@ -1303,7 +1304,8 @@ static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ss
 	 */
 	if (__mptcp_check_fallback(msk) && subflow_is_done(ssk) &&
 	    msk->first == ssk &&
-	    mptcp_update_rcv_data_fin(msk, READ_ONCE(msk->ack_seq), true))
+	    mptcp_update_rcv_data_fin(msk, subflow->map_seq +
+				      subflow->map_data_len, true))
 		mptcp_schedule_work(sk);
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 mptcp-next 3/8] mptcp: cleanup fallback dummy mapping generation
  2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 1/8] mptcp: borrow forward memory from subflow Paolo Abeni
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 2/8] mptcp: cleanup fallback data fin reception Paolo Abeni
@ 2025-10-03 14:01 ` Paolo Abeni
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 4/8] mptcp: fix MSG_PEEK stream corruption Paolo Abeni
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2025-10-03 14:01 UTC (permalink / raw)
  To: mptcp

MPTCP currently access ack_seq outside the msk socket log scope to
generate the dummy mapping for fallback socket. Soon we are going
to introduce backlog usage and even for fallback socket the ack_seq
value will be significantly off outside of the msk socket lock scope.

Avoid relying on ack_seq for dummy mapping generation, using instead
the subflow sequence number. Note that in case of disconnect() and
(re)connect() we must ensure that any previous state is re-set.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v2 -> v3:
 - reordered before the backlog introduction to avoid transiently
   break the fallback
 - explicitly reset ack_seq
---
 net/mptcp/protocol.c | 3 +++
 net/mptcp/subflow.c  | 8 +++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 34661ab979158..12f201aa81f43 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3234,6 +3234,9 @@ static int mptcp_disconnect(struct sock *sk, int flags)
 	msk->bytes_retrans = 0;
 	msk->rcvspace_init = 0;
 
+	/* for fallback's sake */
+	WRITE_ONCE(msk->ack_seq, 0);
+
 	WRITE_ONCE(sk->sk_shutdown, 0);
 	sk_error_report(sk);
 	return 0;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index b9455c04e8a46..ac8616e7521e8 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -491,6 +491,9 @@ static void subflow_set_remote_key(struct mptcp_sock *msk,
 	mptcp_crypto_key_sha(subflow->remote_key, NULL, &subflow->iasn);
 	subflow->iasn++;
 
+	/* for fallback's sake */
+	subflow->map_seq = subflow->iasn;
+
 	WRITE_ONCE(msk->remote_key, subflow->remote_key);
 	WRITE_ONCE(msk->ack_seq, subflow->iasn);
 	WRITE_ONCE(msk->can_ack, true);
@@ -1435,9 +1438,12 @@ static bool subflow_check_data_avail(struct sock *ssk)
 
 	skb = skb_peek(&ssk->sk_receive_queue);
 	subflow->map_valid = 1;
-	subflow->map_seq = READ_ONCE(msk->ack_seq);
 	subflow->map_data_len = skb->len;
 	subflow->map_subflow_seq = tcp_sk(ssk)->copied_seq - subflow->ssn_offset;
+	subflow->map_seq = __mptcp_expand_seq(subflow->map_seq,
+					      subflow->iasn +
+					      TCP_SKB_CB(skb)->seq -
+					      subflow->ssn_offset - 1);
 	WRITE_ONCE(subflow->data_avail, true);
 	return true;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 mptcp-next 4/8] mptcp: fix MSG_PEEK stream corruption
  2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
                   ` (2 preceding siblings ...)
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 3/8] mptcp: cleanup fallback dummy mapping generation Paolo Abeni
@ 2025-10-03 14:01 ` Paolo Abeni
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 5/8] mptcp: ensure the kernel PM does not take action too late Paolo Abeni
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2025-10-03 14:01 UTC (permalink / raw)
  To: mptcp

If a MSG_PEEK | MSG_WAITALL read operation consumes all the bytes in the
receive queue and recvmsg() need to waits for more data - i.e. it's a
blocking one - upon arrival of the next packet the MPTCP protocol will
start again copying the oldest data present in the receive queue,
corrupting the data stream.

Address the issue explicitly tracking the peeked sequence number,
restarting from the last peeked byte.

Fixes: ca4fb892579f ("mptcp: add MSG_PEEK support")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This may sound quite esoteric, but it will soon become very easy to
reproduce with mptcp_connect, thanks to the backlog.
---
 net/mptcp/protocol.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 12f201aa81f43..ce1238f620c33 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1947,22 +1947,36 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied);
 
-static int __mptcp_recvmsg_mskq(struct sock *sk,
-				struct msghdr *msg,
-				size_t len, int flags,
+static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg,
+				size_t len, int flags, int copied_total,
 				struct scm_timestamping_internal *tss,
 				int *cmsg_flags)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct sk_buff *skb, *tmp;
+	int total_data_len = 0;
 	int copied = 0;
 
 	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
-		u32 offset = MPTCP_SKB_CB(skb)->offset;
+		u32 delta, offset = MPTCP_SKB_CB(skb)->offset;
 		u32 data_len = skb->len - offset;
-		u32 count = min_t(size_t, len - copied, data_len);
+		u32 count;
 		int err;
 
+		if (flags & MSG_PEEK) {
+			/* skip already peeked skbs*/
+			if (total_data_len + data_len <= copied_total) {
+				total_data_len += data_len;
+				continue;
+			}
+
+			/* skip the already peeked data in the current skb */
+			delta = copied_total - total_data_len;
+			offset += delta;
+			data_len -= delta;
+		}
+
+		count = min_t(size_t, len - copied, data_len);
 		if (!(flags & MSG_TRUNC)) {
 			err = skb_copy_datagram_msg(skb, offset, msg, count);
 			if (unlikely(err < 0)) {
@@ -1979,16 +1993,14 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
 
 		copied += count;
 
-		if (count < data_len) {
-			if (!(flags & MSG_PEEK)) {
+		if (!(flags & MSG_PEEK)) {
+			msk->bytes_consumed += count;
+			if (count < data_len) {
 				MPTCP_SKB_CB(skb)->offset += count;
 				MPTCP_SKB_CB(skb)->map_seq += count;
-				msk->bytes_consumed += count;
+				break;
 			}
-			break;
-		}
 
-		if (!(flags & MSG_PEEK)) {
 			/* avoid the indirect call, we know the destructor is sock_rfree */
 			skb->destructor = NULL;
 			skb->sk = NULL;
@@ -1996,7 +2008,6 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
 			sk_mem_uncharge(sk, skb->truesize);
 			__skb_unlink(skb, &sk->sk_receive_queue);
 			skb_attempt_defer_free(skb);
-			msk->bytes_consumed += count;
 		}
 
 		if (copied >= len)
@@ -2194,7 +2205,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	while (copied < len) {
 		int err, bytes_read;
 
-		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags, &tss, &cmsg_flags);
+		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags,
+						  copied, &tss, &cmsg_flags);
 		if (unlikely(bytes_read < 0)) {
 			if (!copied)
 				copied = bytes_read;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 mptcp-next 5/8] mptcp: ensure the kernel PM does not take action too late
  2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
                   ` (3 preceding siblings ...)
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 4/8] mptcp: fix MSG_PEEK stream corruption Paolo Abeni
@ 2025-10-03 14:01 ` Paolo Abeni
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 6/8] mptcp: do not miss early first subflow close event notification Paolo Abeni
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2025-10-03 14:01 UTC (permalink / raw)
  To: mptcp

The PM hooks can currently take place when when the msk is already
shutting down. Subflow creation will fail, thanks to the existing
check at join time, but we can entirely avoid starting the to be failed
operations.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/pm.c        | 4 +++-
 net/mptcp/pm_kernel.c | 2 ++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index daf6dcb806843..eade530d38e01 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -588,6 +588,7 @@ void mptcp_pm_subflow_established(struct mptcp_sock *msk)
 void mptcp_pm_subflow_check_next(struct mptcp_sock *msk,
 				 const struct mptcp_subflow_context *subflow)
 {
+	struct sock *sk = (struct sock *)msk;
 	struct mptcp_pm_data *pm = &msk->pm;
 	bool update_subflows;
 
@@ -611,7 +612,8 @@ void mptcp_pm_subflow_check_next(struct mptcp_sock *msk,
 	/* Even if this subflow is not really established, tell the PM to try
 	 * to pick the next ones, if possible.
 	 */
-	if (mptcp_pm_nl_check_work_pending(msk))
+	if (mptcp_is_fully_established(sk) &&
+	    mptcp_pm_nl_check_work_pending(msk))
 		mptcp_pm_schedule_work(msk, MPTCP_PM_SUBFLOW_ESTABLISHED);
 
 	spin_unlock_bh(&pm->lock);
diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c
index da431da16ae04..07b5142004e73 100644
--- a/net/mptcp/pm_kernel.c
+++ b/net/mptcp/pm_kernel.c
@@ -328,6 +328,8 @@ static void mptcp_pm_create_subflow_or_signal_addr(struct mptcp_sock *msk)
 	struct mptcp_pm_local local;
 
 	mptcp_mpc_endpoint_setup(msk);
+	if (!mptcp_is_fully_established(sk))
+		return;
 
 	pr_debug("local %d:%d signal %d:%d subflows %d:%d\n",
 		 msk->pm.local_addr_used, endp_subflow_max,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 mptcp-next 6/8] mptcp: do not miss early first subflow close event notification.
  2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
                   ` (4 preceding siblings ...)
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 5/8] mptcp: ensure the kernel PM does not take action too late Paolo Abeni
@ 2025-10-03 14:01 ` Paolo Abeni
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 7/8] mptcp: make mptcp_destroy_common() static Paolo Abeni
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2025-10-03 14:01 UTC (permalink / raw)
  To: mptcp

The MPTCP protocol is not currently emitting the NL event when the first
subflow is closed before msk accept() time.

By replacing the in use close helper is such scenario, implicitly introduce
the missing notification. Note that in such scenario we want to be sure
that mptcp_close_ssk() will not trigger any PM work, move the msk state
change update earlier, so that the previous patch will offer such
guarantee.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index ce1238f620c33..6ae5ab7595272 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3988,10 +3988,10 @@ static int mptcp_stream_accept(struct socket *sock, struct socket *newsock,
 		 * deal with bad peers not doing a complete shutdown.
 		 */
 		if (unlikely(inet_sk_state_load(msk->first) == TCP_CLOSE)) {
-			__mptcp_close_ssk(newsk, msk->first,
-					  mptcp_subflow_ctx(msk->first), 0);
 			if (unlikely(list_is_singular(&msk->conn_list)))
 				mptcp_set_state(newsk, TCP_CLOSE);
+			mptcp_close_ssk(newsk, msk->first,
+					mptcp_subflow_ctx(msk->first));
 		}
 	} else {
 tcpfallback:
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 mptcp-next 7/8] mptcp: make mptcp_destroy_common() static
  2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
                   ` (5 preceding siblings ...)
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 6/8] mptcp: do not miss early first subflow close event notification Paolo Abeni
@ 2025-10-03 14:01 ` Paolo Abeni
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 8/8] mptcp: leverage the backlog for RX packet processing Paolo Abeni
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2025-10-03 14:01 UTC (permalink / raw)
  To: mptcp

Such function is only used inside protocol.c, there is no need
to expose it to the whole stack.

Note that the function definition most be moved earlier to avoid
forward declaration.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 42 +++++++++++++++++++++---------------------
 net/mptcp/protocol.h |  2 --
 2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6ae5ab7595272..e354f16f4a79f 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3195,6 +3195,27 @@ static void mptcp_copy_inaddrs(struct sock *msk, const struct sock *ssk)
 	inet_sk(msk)->inet_rcv_saddr = inet_sk(ssk)->inet_rcv_saddr;
 }
 
+static void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
+{
+	struct mptcp_subflow_context *subflow, *tmp;
+	struct sock *sk = (struct sock *)msk;
+
+	__mptcp_clear_xmit(sk);
+
+	/* join list will be eventually flushed (with rst) at sock lock release time */
+	mptcp_for_each_subflow_safe(msk, subflow, tmp)
+		__mptcp_close_ssk(sk, mptcp_subflow_tcp_sock(subflow), subflow, flags);
+
+	__skb_queue_purge(&sk->sk_receive_queue);
+	skb_rbtree_purge(&msk->out_of_order_queue);
+
+	/* move all the rx fwd alloc into the sk_mem_reclaim_final in
+	 * inet_sock_destruct() will dispose it
+	 */
+	mptcp_token_destroy(msk);
+	mptcp_pm_destroy(msk);
+}
+
 static int mptcp_disconnect(struct sock *sk, int flags)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -3399,27 +3420,6 @@ void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
 		msk->rcvq_space.space = TCP_INIT_CWND * TCP_MSS_DEFAULT;
 }
 
-void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
-{
-	struct mptcp_subflow_context *subflow, *tmp;
-	struct sock *sk = (struct sock *)msk;
-
-	__mptcp_clear_xmit(sk);
-
-	/* join list will be eventually flushed (with rst) at sock lock release time */
-	mptcp_for_each_subflow_safe(msk, subflow, tmp)
-		__mptcp_close_ssk(sk, mptcp_subflow_tcp_sock(subflow), subflow, flags);
-
-	__skb_queue_purge(&sk->sk_receive_queue);
-	skb_rbtree_purge(&msk->out_of_order_queue);
-
-	/* move all the rx fwd alloc into the sk_mem_reclaim_final in
-	 * inet_sock_destruct() will dispose it
-	 */
-	mptcp_token_destroy(msk);
-	mptcp_pm_destroy(msk);
-}
-
 static void mptcp_destroy(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 0545eab231250..46d8432c72ee7 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -979,8 +979,6 @@ static inline void mptcp_propagate_sndbuf(struct sock *sk, struct sock *ssk)
 	local_bh_enable();
 }
 
-void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags);
-
 #define MPTCP_TOKEN_MAX_RETRIES	4
 
 void __init mptcp_token_init(void);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 mptcp-next 8/8] mptcp: leverage the backlog for RX packet processing
  2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
                   ` (6 preceding siblings ...)
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 7/8] mptcp: make mptcp_destroy_common() static Paolo Abeni
@ 2025-10-03 14:01 ` Paolo Abeni
  2025-10-06  5:09   ` Geliang Tang
  2025-10-03 16:10 ` [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing MPTCP CI
  2025-10-04  1:26 ` Geliang Tang
  9 siblings, 1 reply; 14+ messages in thread
From: Paolo Abeni @ 2025-10-03 14:01 UTC (permalink / raw)
  To: mptcp

When the msk socket is owned or the msk receive buffer is full,
move the incoming skbs in a msk level backlog list. This avoid
traversing the joined subflows and acquiring the subflow level
socket lock at reception time, improving the RX performances.

The skbs in the backlog keep using the incoming subflow receive space,
to allow backpressure on the subflow flow control, and when processing the
backlog, skbs exceeding the msk receive space are not dropped and
re-inserted into backlog processing, as dropping packets already acked
at the TCP level, is explicitly discouraged by the RFC and would corrupt
the data stream for fallback sockets.

As a drawback, special care is needed to avoid adding skbs to the backlog
of a closed msk, and to avoid leaving dangling references into the backlog
at subflow closing time.

Note that we can't use sk_backlog, as such list is processed before
release_cb() and the latter can release and re-acquire the msk level
socket spin lock. That would cause msk-level OoO that in turn are
fatal in case of fallback.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 204 ++++++++++++++++++++++++++++---------------
 net/mptcp/protocol.h |   5 +-
 2 files changed, 136 insertions(+), 73 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e354f16f4a79f..1fcdb26b8e0a0 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -654,8 +654,35 @@ static void mptcp_dss_corruption(struct mptcp_sock *msk, struct sock *ssk)
 	}
 }
 
+static void __mptcp_add_backlog(struct sock *sk, struct sk_buff *skb)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct sk_buff *tail = NULL;
+	bool fragstolen;
+	int delta;
+
+	if (unlikely(sk->sk_state == TCP_CLOSE))
+		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
+
+	/* Try to coalesce with the last skb in our backlog */
+	if (!list_empty(&msk->backlog_list))
+		tail = list_last_entry(&msk->backlog_list, struct sk_buff, list);
+
+	if (tail && MPTCP_SKB_CB(skb)->map_seq == MPTCP_SKB_CB(tail)->end_seq &&
+	    skb->sk == tail->sk &&
+	    __mptcp_try_coalesce(sk, tail, skb, &fragstolen, &delta)) {
+		atomic_sub(skb->truesize - delta, &skb->sk->sk_rmem_alloc);
+		kfree_skb_partial(skb, fragstolen);
+		WRITE_ONCE(msk->backlog_len, msk->backlog_len + delta);
+		return;
+	}
+
+	list_add_tail(&skb->list, &msk->backlog_list);
+	WRITE_ONCE(msk->backlog_len, msk->backlog_len + skb->truesize);
+}
+
 static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
-					   struct sock *ssk)
+					   struct sock *ssk, bool own_msk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	struct sock *sk = (struct sock *)msk;
@@ -671,9 +698,6 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 		struct sk_buff *skb;
 		bool fin;
 
-		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
-			break;
-
 		/* try to move as much data as available */
 		map_remaining = subflow->map_data_len -
 				mptcp_subflow_get_map_offset(subflow);
@@ -701,10 +725,18 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 			int bmem;
 
 			bmem = mptcp_init_skb(ssk, skb, offset, len);
-			skb->sk = NULL;
-			sk_forward_alloc_add(sk, bmem);
-			atomic_sub(skb->truesize, &ssk->sk_rmem_alloc);
-			ret = __mptcp_move_skb(sk, skb) || ret;
+			if (own_msk)
+				sk_forward_alloc_add(sk, bmem);
+			else
+				msk->borrowed_mem += bmem;
+
+			if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf) {
+				skb->sk = NULL;
+				atomic_sub(skb->truesize, &ssk->sk_rmem_alloc);
+				ret |= __mptcp_move_skb(sk, skb);
+			} else {
+				__mptcp_add_backlog(sk, skb);
+			}
 			seq += len;
 
 			if (unlikely(map_remaining < len)) {
@@ -823,7 +855,7 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	struct sock *sk = (struct sock *)msk;
 	bool moved;
 
-	moved = __mptcp_move_skbs_from_subflow(msk, ssk);
+	moved = __mptcp_move_skbs_from_subflow(msk, ssk, true);
 	__mptcp_ofo_queue(msk);
 	if (unlikely(ssk->sk_err))
 		__mptcp_subflow_error_report(sk, ssk);
@@ -838,18 +870,10 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	return moved;
 }
 
-static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	/* Wake-up the reader only for in-sequence data */
-	if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
-		sk->sk_data_ready(sk);
-}
-
 void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
+	struct mptcp_sock *msk = mptcp_sk(sk);
 
 	/* The peer can send data while we are shutting down this
 	 * subflow at msk destruction time, but we must avoid enqueuing
@@ -859,10 +883,13 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 		return;
 
 	mptcp_data_lock(sk);
-	if (!sock_owned_by_user(sk))
-		__mptcp_data_ready(sk, ssk);
-	else
-		__set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->cb_flags);
+	if (!sock_owned_by_user(sk)) {
+		/* Wake-up the reader only for in-sequence data */
+		if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
+			sk->sk_data_ready(sk);
+	} else {
+		__mptcp_move_skbs_from_subflow(msk, ssk, false);
+	}
 	mptcp_data_unlock(sk);
 }
 
@@ -2096,60 +2123,61 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 	msk->rcvq_space.time = mstamp;
 }
 
-static struct mptcp_subflow_context *
-__mptcp_first_ready_from(struct mptcp_sock *msk,
-			 struct mptcp_subflow_context *subflow)
+static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delta)
 {
-	struct mptcp_subflow_context *start_subflow = subflow;
-
-	while (!READ_ONCE(subflow->data_avail)) {
-		subflow = mptcp_next_subflow(msk, subflow);
-		if (subflow == start_subflow)
-			return NULL;
-	}
-	return subflow;
-}
-
-static bool __mptcp_move_skbs(struct sock *sk)
-{
-	struct mptcp_subflow_context *subflow;
+	struct sk_buff *skb = list_first_entry(skbs, struct sk_buff, list);
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	bool ret = false;
-
-	if (list_empty(&msk->conn_list))
-		return false;
-
-	subflow = list_first_entry(&msk->conn_list,
-				   struct mptcp_subflow_context, node);
-	for (;;) {
-		struct sock *ssk;
-		bool slowpath;
+	bool moved = false;
 
-		/*
-		 * As an optimization avoid traversing the subflows list
-		 * and ev. acquiring the subflow socket lock before baling out
-		 */
+	while (1) {
+		/* If the msk recvbuf is full stop, don't drop */
 		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
 			break;
 
-		subflow = __mptcp_first_ready_from(msk, subflow);
-		if (!subflow)
-			break;
+		prefetch(skb->next);
+		list_del(&skb->list);
+		*delta += skb->truesize;
 
-		ssk = mptcp_subflow_tcp_sock(subflow);
-		slowpath = lock_sock_fast(ssk);
-		ret = __mptcp_move_skbs_from_subflow(msk, ssk) || ret;
-		if (unlikely(ssk->sk_err))
-			__mptcp_error_report(sk);
-		unlock_sock_fast(ssk, slowpath);
+		/* Release the memory allocated on the incoming subflow before
+		 * moving it to the msk
+		 */
+		atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
+		skb->sk = NULL;
+		moved |= __mptcp_move_skb(sk, skb);
+		if (list_empty(skbs))
+			break;
 
-		subflow = mptcp_next_subflow(msk, subflow);
+		skb = list_first_entry(skbs, struct sk_buff, list);
 	}
 
 	__mptcp_ofo_queue(msk);
-	if (ret)
+	if (moved)
 		mptcp_check_data_fin((struct sock *)msk);
-	return ret;
+	return moved;
+}
+
+static bool mptcp_move_skbs(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	bool moved = false;
+	LIST_HEAD(skbs);
+	u32 delta = 0;
+
+	mptcp_data_lock(sk);
+	while (!list_empty(&msk->backlog_list)) {
+		list_splice_init(&msk->backlog_list, &skbs);
+		mptcp_data_unlock(sk);
+		moved |= __mptcp_move_skbs(sk, &skbs, &delta);
+
+		mptcp_data_lock(sk);
+		if (!list_empty(&skbs)) {
+			list_splice(&skbs, &msk->backlog_list);
+			break;
+		}
+	}
+	WRITE_ONCE(msk->backlog_len, msk->backlog_len - delta);
+	mptcp_data_unlock(sk);
+	return moved;
 }
 
 static unsigned int mptcp_inq_hint(const struct sock *sk)
@@ -2215,7 +2243,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 		copied += bytes_read;
 
-		if (skb_queue_empty(&sk->sk_receive_queue) && __mptcp_move_skbs(sk))
+		if (!list_empty(&msk->backlog_list) && mptcp_move_skbs(sk))
 			continue;
 
 		/* only the MPTCP socket status is relevant here. The exit
@@ -2520,6 +2548,9 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 		     struct mptcp_subflow_context *subflow)
 {
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct sk_buff *skb;
+
 	/* The first subflow can already be closed and still in the list */
 	if (subflow->close_event_done)
 		return;
@@ -2529,6 +2560,18 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 	if (sk->sk_state == TCP_ESTABLISHED)
 		mptcp_event(MPTCP_EVENT_SUB_CLOSED, mptcp_sk(sk), ssk, GFP_KERNEL);
 
+	/* Remove any reference from the backlog to this ssk, accounting the
+	 * related skb directly to the main socket
+	 */
+	list_for_each_entry(skb, &msk->backlog_list, list) {
+		if (skb->sk != ssk)
+			continue;
+
+		atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
+		atomic_add(skb->truesize, &sk->sk_rmem_alloc);
+		skb->sk = sk;
+	}
+
 	/* subflow aborted before reaching the fully_established status
 	 * attempt the creation of the next subflow
 	 */
@@ -2761,8 +2804,11 @@ static void mptcp_do_fastclose(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow, *tmp;
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct sk_buff *skb;
 
 	mptcp_set_state(sk, TCP_CLOSE);
+	list_for_each_entry(skb, &msk->backlog_list, list)
+		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
 	mptcp_for_each_subflow_safe(msk, subflow, tmp)
 		__mptcp_close_ssk(sk, mptcp_subflow_tcp_sock(subflow),
 				  subflow, MPTCP_CF_FASTCLOSE);
@@ -2820,6 +2866,7 @@ static void __mptcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&msk->conn_list);
 	INIT_LIST_HEAD(&msk->join_list);
 	INIT_LIST_HEAD(&msk->rtx_queue);
+	INIT_LIST_HEAD(&msk->backlog_list);
 	INIT_WORK(&msk->work, mptcp_worker);
 	msk->out_of_order_queue = RB_ROOT;
 	msk->first_pending = NULL;
@@ -3199,9 +3246,13 @@ static void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
 {
 	struct mptcp_subflow_context *subflow, *tmp;
 	struct sock *sk = (struct sock *)msk;
+	struct sk_buff *skb;
 
 	__mptcp_clear_xmit(sk);
 
+	list_for_each_entry(skb, &msk->backlog_list, list)
+		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
+
 	/* join list will be eventually flushed (with rst) at sock lock release time */
 	mptcp_for_each_subflow_safe(msk, subflow, tmp)
 		__mptcp_close_ssk(sk, mptcp_subflow_tcp_sock(subflow), subflow, flags);
@@ -3451,23 +3502,29 @@ void __mptcp_check_push(struct sock *sk, struct sock *ssk)
 
 #define MPTCP_FLAGS_PROCESS_CTX_NEED (BIT(MPTCP_PUSH_PENDING) | \
 				      BIT(MPTCP_RETRANSMIT) | \
-				      BIT(MPTCP_FLUSH_JOIN_LIST) | \
-				      BIT(MPTCP_DEQUEUE))
+				      BIT(MPTCP_FLUSH_JOIN_LIST))
 
 /* processes deferred events and flush wmem */
 static void mptcp_release_cb(struct sock *sk)
 	__must_hold(&sk->sk_lock.slock)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	u32 delta = 0;
 
 	for (;;) {
 		unsigned long flags = (msk->cb_flags & MPTCP_FLAGS_PROCESS_CTX_NEED);
-		struct list_head join_list;
+		LIST_HEAD(join_list);
+		LIST_HEAD(skbs);
+
+		sk_forward_alloc_add(sk, msk->borrowed_mem);
+		msk->borrowed_mem = 0;
+
+		if (sk_rmem_alloc_get(sk) < sk->sk_rcvbuf)
+			list_splice_init(&msk->backlog_list, &skbs);
 
-		if (!flags)
+		if (!flags && list_empty(&skbs))
 			break;
 
-		INIT_LIST_HEAD(&join_list);
 		list_splice_init(&msk->join_list, &join_list);
 
 		/* the following actions acquire the subflow socket lock
@@ -3486,7 +3543,8 @@ static void mptcp_release_cb(struct sock *sk)
 			__mptcp_push_pending(sk, 0);
 		if (flags & BIT(MPTCP_RETRANSMIT))
 			__mptcp_retrans(sk);
-		if ((flags & BIT(MPTCP_DEQUEUE)) && __mptcp_move_skbs(sk)) {
+		if (!list_empty(&skbs) &&
+		    __mptcp_move_skbs(sk, &skbs, &delta)) {
 			/* notify ack seq update */
 			mptcp_cleanup_rbuf(msk, 0);
 			sk->sk_data_ready(sk);
@@ -3494,7 +3552,9 @@ static void mptcp_release_cb(struct sock *sk)
 
 		cond_resched();
 		spin_lock_bh(&sk->sk_lock.slock);
+		list_splice(&skbs, &msk->backlog_list);
 	}
+	WRITE_ONCE(msk->backlog_len, msk->backlog_len - delta);
 
 	if (__test_and_clear_bit(MPTCP_CLEAN_UNA, &msk->cb_flags))
 		__mptcp_clean_una_wakeup(sk);
@@ -3726,7 +3786,7 @@ static int mptcp_ioctl(struct sock *sk, int cmd, int *karg)
 			return -EINVAL;
 
 		lock_sock(sk);
-		if (__mptcp_move_skbs(sk))
+		if (mptcp_move_skbs(sk))
 			mptcp_cleanup_rbuf(msk, 0);
 		*karg = mptcp_inq_hint(sk);
 		release_sock(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 46d8432c72ee7..c9c6582b4e1c4 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -124,7 +124,6 @@
 #define MPTCP_FLUSH_JOIN_LIST	5
 #define MPTCP_SYNC_STATE	6
 #define MPTCP_SYNC_SNDBUF	7
-#define MPTCP_DEQUEUE		8
 
 struct mptcp_skb_cb {
 	u64 map_seq;
@@ -301,6 +300,7 @@ struct mptcp_sock {
 	u32		last_ack_recv;
 	unsigned long	timer_ival;
 	u32		token;
+	u32		borrowed_mem;
 	unsigned long	flags;
 	unsigned long	cb_flags;
 	bool		recovery;		/* closing subflow write queue reinjected */
@@ -358,6 +358,8 @@ struct mptcp_sock {
 					 * allow_infinite_fallback and
 					 * allow_join
 					 */
+	struct list_head backlog_list;	/*protected by the data lock */
+	u32		backlog_len;
 };
 
 #define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock)
@@ -408,6 +410,7 @@ static inline int mptcp_space_from_win(const struct sock *sk, int win)
 static inline int __mptcp_space(const struct sock *sk)
 {
 	return mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) -
+				    READ_ONCE(mptcp_sk(sk)->backlog_len) -
 				    sk_rmem_alloc_get(sk));
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 mptcp-next 8/8] mptcp: leverage the backlog for RX packet processing
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 8/8] mptcp: leverage the backlog for RX packet processing Paolo Abeni
@ 2025-10-06  5:09   ` Geliang Tang
  2025-10-06  7:51     ` Paolo Abeni
  0 siblings, 1 reply; 14+ messages in thread
From: Geliang Tang @ 2025-10-06  5:09 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

Hi Paolo,

On Fri, 2025-10-03 at 16:01 +0200, Paolo Abeni wrote:
> When the msk socket is owned or the msk receive buffer is full,
> move the incoming skbs in a msk level backlog list. This avoid
> traversing the joined subflows and acquiring the subflow level
> socket lock at reception time, improving the RX performances.
> 
> The skbs in the backlog keep using the incoming subflow receive
> space,
> to allow backpressure on the subflow flow control, and when
> processing the
> backlog, skbs exceeding the msk receive space are not dropped and
> re-inserted into backlog processing, as dropping packets already
> acked
> at the TCP level, is explicitly discouraged by the RFC and would
> corrupt
> the data stream for fallback sockets.
> 
> As a drawback, special care is needed to avoid adding skbs to the
> backlog
> of a closed msk, and to avoid leaving dangling references into the
> backlog
> at subflow closing time.
> 
> Note that we can't use sk_backlog, as such list is processed before
> release_cb() and the latter can release and re-acquire the msk level
> socket spin lock. That would cause msk-level OoO that in turn are
> fatal in case of fallback.
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  net/mptcp/protocol.c | 204 ++++++++++++++++++++++++++++-------------
> --
>  net/mptcp/protocol.h |   5 +-
>  2 files changed, 136 insertions(+), 73 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index e354f16f4a79f..1fcdb26b8e0a0 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -654,8 +654,35 @@ static void mptcp_dss_corruption(struct
> mptcp_sock *msk, struct sock *ssk)
>  	}
>  }
>  
> +static void __mptcp_add_backlog(struct sock *sk, struct sk_buff
> *skb)
> +{
> +	struct mptcp_sock *msk = mptcp_sk(sk);
> +	struct sk_buff *tail = NULL;
> +	bool fragstolen;
> +	int delta;
> +
> +	if (unlikely(sk->sk_state == TCP_CLOSE))
> +		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
> +
> +	/* Try to coalesce with the last skb in our backlog */
> +	if (!list_empty(&msk->backlog_list))
> +		tail = list_last_entry(&msk->backlog_list, struct
> sk_buff, list);
> +
> +	if (tail && MPTCP_SKB_CB(skb)->map_seq ==
> MPTCP_SKB_CB(tail)->end_seq &&
> +	    skb->sk == tail->sk &&
> +	    __mptcp_try_coalesce(sk, tail, skb, &fragstolen,
> &delta)) {
> +		atomic_sub(skb->truesize - delta, &skb->sk-
> >sk_rmem_alloc);
> +		kfree_skb_partial(skb, fragstolen);
> +		WRITE_ONCE(msk->backlog_len, msk->backlog_len +
> delta);
> +		return;
> +	}
> +
> +	list_add_tail(&skb->list, &msk->backlog_list);
> +	WRITE_ONCE(msk->backlog_len, msk->backlog_len + skb-
> >truesize);
> +}
> +
>  static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
> -					   struct sock *ssk)
> +					   struct sock *ssk, bool
> own_msk)
>  {
>  	struct mptcp_subflow_context *subflow =
> mptcp_subflow_ctx(ssk);
>  	struct sock *sk = (struct sock *)msk;
> @@ -671,9 +698,6 @@ static bool __mptcp_move_skbs_from_subflow(struct
> mptcp_sock *msk,
>  		struct sk_buff *skb;
>  		bool fin;
>  
> -		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
> -			break;
> -
>  		/* try to move as much data as available */
>  		map_remaining = subflow->map_data_len -
>  				mptcp_subflow_get_map_offset(subflow
> );
> @@ -701,10 +725,18 @@ static bool
> __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
>  			int bmem;
>  
>  			bmem = mptcp_init_skb(ssk, skb, offset,
> len);
> -			skb->sk = NULL;
> -			sk_forward_alloc_add(sk, bmem);
> -			atomic_sub(skb->truesize, &ssk-
> >sk_rmem_alloc);
> -			ret = __mptcp_move_skb(sk, skb) || ret;
> +			if (own_msk)
> +				sk_forward_alloc_add(sk, bmem);
> +			else
> +				msk->borrowed_mem += bmem;
> +
> +			if (own_msk && sk_rmem_alloc_get(sk) < sk-
> >sk_rcvbuf) {
> +				skb->sk = NULL;
> +				atomic_sub(skb->truesize, &ssk-
> >sk_rmem_alloc);
> +				ret |= __mptcp_move_skb(sk, skb);
> +			} else {
> +				__mptcp_add_backlog(sk, skb);
> +			}
>  			seq += len;
>  
>  			if (unlikely(map_remaining < len)) {
> @@ -823,7 +855,7 @@ static bool move_skbs_to_msk(struct mptcp_sock
> *msk, struct sock *ssk)
>  	struct sock *sk = (struct sock *)msk;
>  	bool moved;
>  
> -	moved = __mptcp_move_skbs_from_subflow(msk, ssk);
> +	moved = __mptcp_move_skbs_from_subflow(msk, ssk, true);
>  	__mptcp_ofo_queue(msk);
>  	if (unlikely(ssk->sk_err))
>  		__mptcp_subflow_error_report(sk, ssk);
> @@ -838,18 +870,10 @@ static bool move_skbs_to_msk(struct mptcp_sock
> *msk, struct sock *ssk)
>  	return moved;
>  }
>  
> -static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
> -{
> -	struct mptcp_sock *msk = mptcp_sk(sk);
> -
> -	/* Wake-up the reader only for in-sequence data */
> -	if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
> -		sk->sk_data_ready(sk);
> -}
> -
>  void mptcp_data_ready(struct sock *sk, struct sock *ssk)
>  {
>  	struct mptcp_subflow_context *subflow =
> mptcp_subflow_ctx(ssk);
> +	struct mptcp_sock *msk = mptcp_sk(sk);
>  
>  	/* The peer can send data while we are shutting down this
>  	 * subflow at msk destruction time, but we must avoid
> enqueuing
> @@ -859,10 +883,13 @@ void mptcp_data_ready(struct sock *sk, struct
> sock *ssk)
>  		return;
>  
>  	mptcp_data_lock(sk);
> -	if (!sock_owned_by_user(sk))
> -		__mptcp_data_ready(sk, ssk);
> -	else
> -		__set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->cb_flags);
> +	if (!sock_owned_by_user(sk)) {
> +		/* Wake-up the reader only for in-sequence data */
> +		if (move_skbs_to_msk(msk, ssk) &&
> mptcp_epollin_ready(sk))
> +			sk->sk_data_ready(sk);
> +	} else {
> +		__mptcp_move_skbs_from_subflow(msk, ssk, false);
> +	}
>  	mptcp_data_unlock(sk);
>  }
>  
> @@ -2096,60 +2123,61 @@ static void mptcp_rcv_space_adjust(struct
> mptcp_sock *msk, int copied)
>  	msk->rcvq_space.time = mstamp;
>  }
>  
> -static struct mptcp_subflow_context *
> -__mptcp_first_ready_from(struct mptcp_sock *msk,
> -			 struct mptcp_subflow_context *subflow)
> +static bool __mptcp_move_skbs(struct sock *sk, struct list_head
> *skbs, u32 *delta)
>  {
> -	struct mptcp_subflow_context *start_subflow = subflow;
> -
> -	while (!READ_ONCE(subflow->data_avail)) {
> -		subflow = mptcp_next_subflow(msk, subflow);
> -		if (subflow == start_subflow)
> -			return NULL;
> -	}
> -	return subflow;
> -}
> -
> -static bool __mptcp_move_skbs(struct sock *sk)
> -{
> -	struct mptcp_subflow_context *subflow;
> +	struct sk_buff *skb = list_first_entry(skbs, struct sk_buff,
> list);
>  	struct mptcp_sock *msk = mptcp_sk(sk);
> -	bool ret = false;
> -
> -	if (list_empty(&msk->conn_list))
> -		return false;
> -
> -	subflow = list_first_entry(&msk->conn_list,
> -				   struct mptcp_subflow_context,
> node);
> -	for (;;) {
> -		struct sock *ssk;
> -		bool slowpath;
> +	bool moved = false;
>  
> -		/*
> -		 * As an optimization avoid traversing the subflows
> list
> -		 * and ev. acquiring the subflow socket lock before
> baling out
> -		 */
> +	while (1) {
> +		/* If the msk recvbuf is full stop, don't drop */
>  		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
>  			break;
>  
> -		subflow = __mptcp_first_ready_from(msk, subflow);
> -		if (!subflow)
> -			break;
> +		prefetch(skb->next);
> +		list_del(&skb->list);
> +		*delta += skb->truesize;
>  
> -		ssk = mptcp_subflow_tcp_sock(subflow);
> -		slowpath = lock_sock_fast(ssk);
> -		ret = __mptcp_move_skbs_from_subflow(msk, ssk) ||
> ret;
> -		if (unlikely(ssk->sk_err))
> -			__mptcp_error_report(sk);
> -		unlock_sock_fast(ssk, slowpath);
> +		/* Release the memory allocated on the incoming
> subflow before
> +		 * moving it to the msk
> +		 */
> +		atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
> +		skb->sk = NULL;
> +		moved |= __mptcp_move_skb(sk, skb);
> +		if (list_empty(skbs))
> +			break;
>  
> -		subflow = mptcp_next_subflow(msk, subflow);
> +		skb = list_first_entry(skbs, struct sk_buff, list);
>  	}
>  
>  	__mptcp_ofo_queue(msk);
> -	if (ret)
> +	if (moved)
>  		mptcp_check_data_fin((struct sock *)msk);
> -	return ret;
> +	return moved;
> +}
> +
> +static bool mptcp_move_skbs(struct sock *sk)
> +{
> +	struct mptcp_sock *msk = mptcp_sk(sk);
> +	bool moved = false;
> +	LIST_HEAD(skbs);
> +	u32 delta = 0;
> +
> +	mptcp_data_lock(sk);
> +	while (!list_empty(&msk->backlog_list)) {
> +		list_splice_init(&msk->backlog_list, &skbs);
> +		mptcp_data_unlock(sk);
> +		moved |= __mptcp_move_skbs(sk, &skbs, &delta);
> +
> +		mptcp_data_lock(sk);
> +		if (!list_empty(&skbs)) {
> +			list_splice(&skbs, &msk->backlog_list);
> +			break;
> +		}
> +	}
> +	WRITE_ONCE(msk->backlog_len, msk->backlog_len - delta);
> +	mptcp_data_unlock(sk);
> +	return moved;
>  }
>  
>  static unsigned int mptcp_inq_hint(const struct sock *sk)
> @@ -2215,7 +2243,7 @@ static int mptcp_recvmsg(struct sock *sk,
> struct msghdr *msg, size_t len,
>  
>  		copied += bytes_read;
>  
> -		if (skb_queue_empty(&sk->sk_receive_queue) &&
> __mptcp_move_skbs(sk))
> +		if (!list_empty(&msk->backlog_list) &&
> mptcp_move_skbs(sk))
>  			continue;
>  
>  		/* only the MPTCP socket status is relevant here.
> The exit
> @@ -2520,6 +2548,9 @@ static void __mptcp_close_ssk(struct sock *sk,
> struct sock *ssk,
>  void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
>  		     struct mptcp_subflow_context *subflow)
>  {
> +	struct mptcp_sock *msk = mptcp_sk(sk);
> +	struct sk_buff *skb;
> +
>  	/* The first subflow can already be closed and still in the
> list */
>  	if (subflow->close_event_done)
>  		return;
> @@ -2529,6 +2560,18 @@ void mptcp_close_ssk(struct sock *sk, struct
> sock *ssk,
>  	if (sk->sk_state == TCP_ESTABLISHED)
>  		mptcp_event(MPTCP_EVENT_SUB_CLOSED, mptcp_sk(sk),
> ssk, GFP_KERNEL);
>  
> +	/* Remove any reference from the backlog to this ssk,
> accounting the
> +	 * related skb directly to the main socket
> +	 */
> +	list_for_each_entry(skb, &msk->backlog_list, list) {
> +		if (skb->sk != ssk)
> +			continue;
> +
> +		atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
> +		atomic_add(skb->truesize, &sk->sk_rmem_alloc);
> +		skb->sk = sk;
> +	}
> +
>  	/* subflow aborted before reaching the fully_established
> status
>  	 * attempt the creation of the next subflow
>  	 */
> @@ -2761,8 +2804,11 @@ static void mptcp_do_fastclose(struct sock
> *sk)
>  {
>  	struct mptcp_subflow_context *subflow, *tmp;
>  	struct mptcp_sock *msk = mptcp_sk(sk);
> +	struct sk_buff *skb;
>  
>  	mptcp_set_state(sk, TCP_CLOSE);
> +	list_for_each_entry(skb, &msk->backlog_list, list)
> +		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
>  	mptcp_for_each_subflow_safe(msk, subflow, tmp)
>  		__mptcp_close_ssk(sk,
> mptcp_subflow_tcp_sock(subflow),
>  				  subflow, MPTCP_CF_FASTCLOSE);
> @@ -2820,6 +2866,7 @@ static void __mptcp_init_sock(struct sock *sk)
>  	INIT_LIST_HEAD(&msk->conn_list);
>  	INIT_LIST_HEAD(&msk->join_list);
>  	INIT_LIST_HEAD(&msk->rtx_queue);
> +	INIT_LIST_HEAD(&msk->backlog_list);
>  	INIT_WORK(&msk->work, mptcp_worker);
>  	msk->out_of_order_queue = RB_ROOT;
>  	msk->first_pending = NULL;
> @@ -3199,9 +3246,13 @@ static void mptcp_destroy_common(struct
> mptcp_sock *msk, unsigned int flags)
>  {
>  	struct mptcp_subflow_context *subflow, *tmp;
>  	struct sock *sk = (struct sock *)msk;
> +	struct sk_buff *skb;
>  
>  	__mptcp_clear_xmit(sk);
>  
> +	list_for_each_entry(skb, &msk->backlog_list, list)
> +		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
> +

I encountered this error while testing the splice test:

[ 6708.714803] BUG: KFENCE: use-after-free read in mptcp_destroy_common
(net/mptcp/protocol.c:3233 (discriminator 4))  
[ 6708.714803] 
[ 6708.714848] Use-after-free read at 0x000000007fc00f9c (in kfence-
#253):

Holding the msk data lock while traversing the backlog_list can fix
this error:

        mptcp_data_lock(sk);
        list_for_each_entry(skb, &msk->backlog_list, list)
                kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
        mptcp_data_unlock(sk);

Similarly, I have also added the holding of the msk data lock in
mptcp_do_fastclose().

I would like to ask if this modification is acceptable. If it works, I
can send a squash-to patch for this.

Thanks,
-Geliang

>  	/* join list will be eventually flushed (with rst) at sock
> lock release time */
>  	mptcp_for_each_subflow_safe(msk, subflow, tmp)
>  		__mptcp_close_ssk(sk,
> mptcp_subflow_tcp_sock(subflow), subflow, flags);
> @@ -3451,23 +3502,29 @@ void __mptcp_check_push(struct sock *sk,
> struct sock *ssk)
>  
>  #define MPTCP_FLAGS_PROCESS_CTX_NEED (BIT(MPTCP_PUSH_PENDING) | \
>  				      BIT(MPTCP_RETRANSMIT) | \
> -				      BIT(MPTCP_FLUSH_JOIN_LIST) | \
> -				      BIT(MPTCP_DEQUEUE))
> +				      BIT(MPTCP_FLUSH_JOIN_LIST))
>  
>  /* processes deferred events and flush wmem */
>  static void mptcp_release_cb(struct sock *sk)
>  	__must_hold(&sk->sk_lock.slock)
>  {
>  	struct mptcp_sock *msk = mptcp_sk(sk);
> +	u32 delta = 0;
>  
>  	for (;;) {
>  		unsigned long flags = (msk->cb_flags &
> MPTCP_FLAGS_PROCESS_CTX_NEED);
> -		struct list_head join_list;
> +		LIST_HEAD(join_list);
> +		LIST_HEAD(skbs);
> +
> +		sk_forward_alloc_add(sk, msk->borrowed_mem);
> +		msk->borrowed_mem = 0;
> +
> +		if (sk_rmem_alloc_get(sk) < sk->sk_rcvbuf)
> +			list_splice_init(&msk->backlog_list, &skbs);
>  
> -		if (!flags)
> +		if (!flags && list_empty(&skbs))
>  			break;
>  
> -		INIT_LIST_HEAD(&join_list);
>  		list_splice_init(&msk->join_list, &join_list);
>  
>  		/* the following actions acquire the subflow socket
> lock
> @@ -3486,7 +3543,8 @@ static void mptcp_release_cb(struct sock *sk)
>  			__mptcp_push_pending(sk, 0);
>  		if (flags & BIT(MPTCP_RETRANSMIT))
>  			__mptcp_retrans(sk);
> -		if ((flags & BIT(MPTCP_DEQUEUE)) &&
> __mptcp_move_skbs(sk)) {
> +		if (!list_empty(&skbs) &&
> +		    __mptcp_move_skbs(sk, &skbs, &delta)) {
>  			/* notify ack seq update */
>  			mptcp_cleanup_rbuf(msk, 0);
>  			sk->sk_data_ready(sk);
> @@ -3494,7 +3552,9 @@ static void mptcp_release_cb(struct sock *sk)
>  
>  		cond_resched();
>  		spin_lock_bh(&sk->sk_lock.slock);
> +		list_splice(&skbs, &msk->backlog_list);
>  	}
> +	WRITE_ONCE(msk->backlog_len, msk->backlog_len - delta);
>  
>  	if (__test_and_clear_bit(MPTCP_CLEAN_UNA, &msk->cb_flags))
>  		__mptcp_clean_una_wakeup(sk);
> @@ -3726,7 +3786,7 @@ static int mptcp_ioctl(struct sock *sk, int
> cmd, int *karg)
>  			return -EINVAL;
>  
>  		lock_sock(sk);
> -		if (__mptcp_move_skbs(sk))
> +		if (mptcp_move_skbs(sk))
>  			mptcp_cleanup_rbuf(msk, 0);
>  		*karg = mptcp_inq_hint(sk);
>  		release_sock(sk);
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index 46d8432c72ee7..c9c6582b4e1c4 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -124,7 +124,6 @@
>  #define MPTCP_FLUSH_JOIN_LIST	5
>  #define MPTCP_SYNC_STATE	6
>  #define MPTCP_SYNC_SNDBUF	7
> -#define MPTCP_DEQUEUE		8
>  
>  struct mptcp_skb_cb {
>  	u64 map_seq;
> @@ -301,6 +300,7 @@ struct mptcp_sock {
>  	u32		last_ack_recv;
>  	unsigned long	timer_ival;
>  	u32		token;
> +	u32		borrowed_mem;
>  	unsigned long	flags;
>  	unsigned long	cb_flags;
>  	bool		recovery;		/* closing subflow
> write queue reinjected */
> @@ -358,6 +358,8 @@ struct mptcp_sock {
>  					 * allow_infinite_fallback
> and
>  					 * allow_join
>  					 */
> +	struct list_head backlog_list;	/*protected by the data lock
> */
> +	u32		backlog_len;
>  };
>  
>  #define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock)
> @@ -408,6 +410,7 @@ static inline int mptcp_space_from_win(const
> struct sock *sk, int win)
>  static inline int __mptcp_space(const struct sock *sk)
>  {
>  	return mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) -
> +				    READ_ONCE(mptcp_sk(sk)-
> >backlog_len) -
>  				    sk_rmem_alloc_get(sk));
>  }
>  


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 mptcp-next 8/8] mptcp: leverage the backlog for RX packet processing
  2025-10-06  5:09   ` Geliang Tang
@ 2025-10-06  7:51     ` Paolo Abeni
  0 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2025-10-06  7:51 UTC (permalink / raw)
  To: Geliang Tang, mptcp

On 10/6/25 7:09 AM, Geliang Tang wrote:
> On Fri, 2025-10-03 at 16:01 +0200, Paolo Abeni wrote:
>> When the msk socket is owned or the msk receive buffer is full,
>> move the incoming skbs in a msk level backlog list. This avoid
>> traversing the joined subflows and acquiring the subflow level
>> socket lock at reception time, improving the RX performances.
>>
>> The skbs in the backlog keep using the incoming subflow receive
>> space,
>> to allow backpressure on the subflow flow control, and when
>> processing the
>> backlog, skbs exceeding the msk receive space are not dropped and
>> re-inserted into backlog processing, as dropping packets already
>> acked
>> at the TCP level, is explicitly discouraged by the RFC and would
>> corrupt
>> the data stream for fallback sockets.
>>
>> As a drawback, special care is needed to avoid adding skbs to the
>> backlog
>> of a closed msk, and to avoid leaving dangling references into the
>> backlog
>> at subflow closing time.
>>
>> Note that we can't use sk_backlog, as such list is processed before
>> release_cb() and the latter can release and re-acquire the msk level
>> socket spin lock. That would cause msk-level OoO that in turn are
>> fatal in case of fallback.
>>
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>>  net/mptcp/protocol.c | 204 ++++++++++++++++++++++++++++-------------
>> --
>>  net/mptcp/protocol.h |   5 +-
>>  2 files changed, 136 insertions(+), 73 deletions(-)
>>
>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>> index e354f16f4a79f..1fcdb26b8e0a0 100644
>> --- a/net/mptcp/protocol.c
>> +++ b/net/mptcp/protocol.c
>> @@ -654,8 +654,35 @@ static void mptcp_dss_corruption(struct
>> mptcp_sock *msk, struct sock *ssk)
>>  	}
>>  }
>>  
>> +static void __mptcp_add_backlog(struct sock *sk, struct sk_buff
>> *skb)
>> +{
>> +	struct mptcp_sock *msk = mptcp_sk(sk);
>> +	struct sk_buff *tail = NULL;
>> +	bool fragstolen;
>> +	int delta;
>> +
>> +	if (unlikely(sk->sk_state == TCP_CLOSE))
>> +		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
>> +
>> +	/* Try to coalesce with the last skb in our backlog */
>> +	if (!list_empty(&msk->backlog_list))
>> +		tail = list_last_entry(&msk->backlog_list, struct
>> sk_buff, list);
>> +
>> +	if (tail && MPTCP_SKB_CB(skb)->map_seq ==
>> MPTCP_SKB_CB(tail)->end_seq &&
>> +	    skb->sk == tail->sk &&
>> +	    __mptcp_try_coalesce(sk, tail, skb, &fragstolen,
>> &delta)) {
>> +		atomic_sub(skb->truesize - delta, &skb->sk-
>>> sk_rmem_alloc);
>> +		kfree_skb_partial(skb, fragstolen);
>> +		WRITE_ONCE(msk->backlog_len, msk->backlog_len +
>> delta);
>> +		return;
>> +	}
>> +
>> +	list_add_tail(&skb->list, &msk->backlog_list);
>> +	WRITE_ONCE(msk->backlog_len, msk->backlog_len + skb-
>>> truesize);
>> +}
>> +
>>  static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
>> -					   struct sock *ssk)
>> +					   struct sock *ssk, bool
>> own_msk)
>>  {
>>  	struct mptcp_subflow_context *subflow =
>> mptcp_subflow_ctx(ssk);
>>  	struct sock *sk = (struct sock *)msk;
>> @@ -671,9 +698,6 @@ static bool __mptcp_move_skbs_from_subflow(struct
>> mptcp_sock *msk,
>>  		struct sk_buff *skb;
>>  		bool fin;
>>  
>> -		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
>> -			break;
>> -
>>  		/* try to move as much data as available */
>>  		map_remaining = subflow->map_data_len -
>>  				mptcp_subflow_get_map_offset(subflow
>> );
>> @@ -701,10 +725,18 @@ static bool
>> __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
>>  			int bmem;
>>  
>>  			bmem = mptcp_init_skb(ssk, skb, offset,
>> len);
>> -			skb->sk = NULL;
>> -			sk_forward_alloc_add(sk, bmem);
>> -			atomic_sub(skb->truesize, &ssk-
>>> sk_rmem_alloc);
>> -			ret = __mptcp_move_skb(sk, skb) || ret;
>> +			if (own_msk)
>> +				sk_forward_alloc_add(sk, bmem);
>> +			else
>> +				msk->borrowed_mem += bmem;
>> +
>> +			if (own_msk && sk_rmem_alloc_get(sk) < sk-
>>> sk_rcvbuf) {
>> +				skb->sk = NULL;
>> +				atomic_sub(skb->truesize, &ssk-
>>> sk_rmem_alloc);
>> +				ret |= __mptcp_move_skb(sk, skb);
>> +			} else {
>> +				__mptcp_add_backlog(sk, skb);
>> +			}
>>  			seq += len;
>>  
>>  			if (unlikely(map_remaining < len)) {
>> @@ -823,7 +855,7 @@ static bool move_skbs_to_msk(struct mptcp_sock
>> *msk, struct sock *ssk)
>>  	struct sock *sk = (struct sock *)msk;
>>  	bool moved;
>>  
>> -	moved = __mptcp_move_skbs_from_subflow(msk, ssk);
>> +	moved = __mptcp_move_skbs_from_subflow(msk, ssk, true);
>>  	__mptcp_ofo_queue(msk);
>>  	if (unlikely(ssk->sk_err))
>>  		__mptcp_subflow_error_report(sk, ssk);
>> @@ -838,18 +870,10 @@ static bool move_skbs_to_msk(struct mptcp_sock
>> *msk, struct sock *ssk)
>>  	return moved;
>>  }
>>  
>> -static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
>> -{
>> -	struct mptcp_sock *msk = mptcp_sk(sk);
>> -
>> -	/* Wake-up the reader only for in-sequence data */
>> -	if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
>> -		sk->sk_data_ready(sk);
>> -}
>> -
>>  void mptcp_data_ready(struct sock *sk, struct sock *ssk)
>>  {
>>  	struct mptcp_subflow_context *subflow =
>> mptcp_subflow_ctx(ssk);
>> +	struct mptcp_sock *msk = mptcp_sk(sk);
>>  
>>  	/* The peer can send data while we are shutting down this
>>  	 * subflow at msk destruction time, but we must avoid
>> enqueuing
>> @@ -859,10 +883,13 @@ void mptcp_data_ready(struct sock *sk, struct
>> sock *ssk)
>>  		return;
>>  
>>  	mptcp_data_lock(sk);
>> -	if (!sock_owned_by_user(sk))
>> -		__mptcp_data_ready(sk, ssk);
>> -	else
>> -		__set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->cb_flags);
>> +	if (!sock_owned_by_user(sk)) {
>> +		/* Wake-up the reader only for in-sequence data */
>> +		if (move_skbs_to_msk(msk, ssk) &&
>> mptcp_epollin_ready(sk))
>> +			sk->sk_data_ready(sk);
>> +	} else {
>> +		__mptcp_move_skbs_from_subflow(msk, ssk, false);
>> +	}
>>  	mptcp_data_unlock(sk);
>>  }
>>  
>> @@ -2096,60 +2123,61 @@ static void mptcp_rcv_space_adjust(struct
>> mptcp_sock *msk, int copied)
>>  	msk->rcvq_space.time = mstamp;
>>  }
>>  
>> -static struct mptcp_subflow_context *
>> -__mptcp_first_ready_from(struct mptcp_sock *msk,
>> -			 struct mptcp_subflow_context *subflow)
>> +static bool __mptcp_move_skbs(struct sock *sk, struct list_head
>> *skbs, u32 *delta)
>>  {
>> -	struct mptcp_subflow_context *start_subflow = subflow;
>> -
>> -	while (!READ_ONCE(subflow->data_avail)) {
>> -		subflow = mptcp_next_subflow(msk, subflow);
>> -		if (subflow == start_subflow)
>> -			return NULL;
>> -	}
>> -	return subflow;
>> -}
>> -
>> -static bool __mptcp_move_skbs(struct sock *sk)
>> -{
>> -	struct mptcp_subflow_context *subflow;
>> +	struct sk_buff *skb = list_first_entry(skbs, struct sk_buff,
>> list);
>>  	struct mptcp_sock *msk = mptcp_sk(sk);
>> -	bool ret = false;
>> -
>> -	if (list_empty(&msk->conn_list))
>> -		return false;
>> -
>> -	subflow = list_first_entry(&msk->conn_list,
>> -				   struct mptcp_subflow_context,
>> node);
>> -	for (;;) {
>> -		struct sock *ssk;
>> -		bool slowpath;
>> +	bool moved = false;
>>  
>> -		/*
>> -		 * As an optimization avoid traversing the subflows
>> list
>> -		 * and ev. acquiring the subflow socket lock before
>> baling out
>> -		 */
>> +	while (1) {
>> +		/* If the msk recvbuf is full stop, don't drop */
>>  		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
>>  			break;
>>  
>> -		subflow = __mptcp_first_ready_from(msk, subflow);
>> -		if (!subflow)
>> -			break;
>> +		prefetch(skb->next);
>> +		list_del(&skb->list);
>> +		*delta += skb->truesize;
>>  
>> -		ssk = mptcp_subflow_tcp_sock(subflow);
>> -		slowpath = lock_sock_fast(ssk);
>> -		ret = __mptcp_move_skbs_from_subflow(msk, ssk) ||
>> ret;
>> -		if (unlikely(ssk->sk_err))
>> -			__mptcp_error_report(sk);
>> -		unlock_sock_fast(ssk, slowpath);
>> +		/* Release the memory allocated on the incoming
>> subflow before
>> +		 * moving it to the msk
>> +		 */
>> +		atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
>> +		skb->sk = NULL;
>> +		moved |= __mptcp_move_skb(sk, skb);
>> +		if (list_empty(skbs))
>> +			break;
>>  
>> -		subflow = mptcp_next_subflow(msk, subflow);
>> +		skb = list_first_entry(skbs, struct sk_buff, list);
>>  	}
>>  
>>  	__mptcp_ofo_queue(msk);
>> -	if (ret)
>> +	if (moved)
>>  		mptcp_check_data_fin((struct sock *)msk);
>> -	return ret;
>> +	return moved;
>> +}
>> +
>> +static bool mptcp_move_skbs(struct sock *sk)
>> +{
>> +	struct mptcp_sock *msk = mptcp_sk(sk);
>> +	bool moved = false;
>> +	LIST_HEAD(skbs);
>> +	u32 delta = 0;
>> +
>> +	mptcp_data_lock(sk);
>> +	while (!list_empty(&msk->backlog_list)) {
>> +		list_splice_init(&msk->backlog_list, &skbs);
>> +		mptcp_data_unlock(sk);
>> +		moved |= __mptcp_move_skbs(sk, &skbs, &delta);
>> +
>> +		mptcp_data_lock(sk);
>> +		if (!list_empty(&skbs)) {
>> +			list_splice(&skbs, &msk->backlog_list);
>> +			break;
>> +		}
>> +	}
>> +	WRITE_ONCE(msk->backlog_len, msk->backlog_len - delta);
>> +	mptcp_data_unlock(sk);
>> +	return moved;
>>  }
>>  
>>  static unsigned int mptcp_inq_hint(const struct sock *sk)
>> @@ -2215,7 +2243,7 @@ static int mptcp_recvmsg(struct sock *sk,
>> struct msghdr *msg, size_t len,
>>  
>>  		copied += bytes_read;
>>  
>> -		if (skb_queue_empty(&sk->sk_receive_queue) &&
>> __mptcp_move_skbs(sk))
>> +		if (!list_empty(&msk->backlog_list) &&
>> mptcp_move_skbs(sk))
>>  			continue;
>>  
>>  		/* only the MPTCP socket status is relevant here.
>> The exit
>> @@ -2520,6 +2548,9 @@ static void __mptcp_close_ssk(struct sock *sk,
>> struct sock *ssk,
>>  void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
>>  		     struct mptcp_subflow_context *subflow)
>>  {
>> +	struct mptcp_sock *msk = mptcp_sk(sk);
>> +	struct sk_buff *skb;
>> +
>>  	/* The first subflow can already be closed and still in the
>> list */
>>  	if (subflow->close_event_done)
>>  		return;
>> @@ -2529,6 +2560,18 @@ void mptcp_close_ssk(struct sock *sk, struct
>> sock *ssk,
>>  	if (sk->sk_state == TCP_ESTABLISHED)
>>  		mptcp_event(MPTCP_EVENT_SUB_CLOSED, mptcp_sk(sk),
>> ssk, GFP_KERNEL);
>>  
>> +	/* Remove any reference from the backlog to this ssk,
>> accounting the
>> +	 * related skb directly to the main socket
>> +	 */
>> +	list_for_each_entry(skb, &msk->backlog_list, list) {
>> +		if (skb->sk != ssk)
>> +			continue;
>> +
>> +		atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
>> +		atomic_add(skb->truesize, &sk->sk_rmem_alloc);
>> +		skb->sk = sk;
>> +	}
>> +
>>  	/* subflow aborted before reaching the fully_established
>> status
>>  	 * attempt the creation of the next subflow
>>  	 */
>> @@ -2761,8 +2804,11 @@ static void mptcp_do_fastclose(struct sock
>> *sk)
>>  {
>>  	struct mptcp_subflow_context *subflow, *tmp;
>>  	struct mptcp_sock *msk = mptcp_sk(sk);
>> +	struct sk_buff *skb;
>>  
>>  	mptcp_set_state(sk, TCP_CLOSE);
>> +	list_for_each_entry(skb, &msk->backlog_list, list)
>> +		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
>>  	mptcp_for_each_subflow_safe(msk, subflow, tmp)
>>  		__mptcp_close_ssk(sk,
>> mptcp_subflow_tcp_sock(subflow),
>>  				  subflow, MPTCP_CF_FASTCLOSE);
>> @@ -2820,6 +2866,7 @@ static void __mptcp_init_sock(struct sock *sk)
>>  	INIT_LIST_HEAD(&msk->conn_list);
>>  	INIT_LIST_HEAD(&msk->join_list);
>>  	INIT_LIST_HEAD(&msk->rtx_queue);
>> +	INIT_LIST_HEAD(&msk->backlog_list);
>>  	INIT_WORK(&msk->work, mptcp_worker);
>>  	msk->out_of_order_queue = RB_ROOT;
>>  	msk->first_pending = NULL;
>> @@ -3199,9 +3246,13 @@ static void mptcp_destroy_common(struct
>> mptcp_sock *msk, unsigned int flags)
>>  {
>>  	struct mptcp_subflow_context *subflow, *tmp;
>>  	struct sock *sk = (struct sock *)msk;
>> +	struct sk_buff *skb;
>>  
>>  	__mptcp_clear_xmit(sk);
>>  
>> +	list_for_each_entry(skb, &msk->backlog_list, list)
>> +		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
>> +
> 
> I encountered this error while testing the splice test:
> 
> [ 6708.714803] BUG: KFENCE: use-after-free read in mptcp_destroy_common
> (net/mptcp/protocol.c:3233 (discriminator 4))  
> [ 6708.714803] 
> [ 6708.714848] Use-after-free read at 0x000000007fc00f9c (in kfence-
> #253):
> 
> Holding the msk data lock while traversing the backlog_list can fix
> this error:
> 
>         mptcp_data_lock(sk);
>         list_for_each_entry(skb, &msk->backlog_list, list)
>                 kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
>         mptcp_data_unlock(sk);
> 
> Similarly, I have also added the holding of the msk data lock in
> mptcp_do_fastclose().
> 
> I would like to ask if this modification is acceptable. If it works, I
> can send a squash-to patch for this.

Thank you for the feedback!

Indeed the current mptcp_destroy_common() impl is quite broken. Beyond
the missing spinlock, there is double free caused by missing removal
from backlog list.

I have a new review ready that should address the above and a few more
issues reported by the CI.  I'll share it soon!

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing
  2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
                   ` (7 preceding siblings ...)
  2025-10-03 14:01 ` [PATCH v4 mptcp-next 8/8] mptcp: leverage the backlog for RX packet processing Paolo Abeni
@ 2025-10-03 16:10 ` MPTCP CI
  2025-10-04  1:26 ` Geliang Tang
  9 siblings, 0 replies; 14+ messages in thread
From: MPTCP CI @ 2025-10-03 16:10 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Success! ✅
- KVM Validation: debug (only selftest_mptcp_join): Unstable: 1 failed test(s): selftest_mptcp_join - Critical: 4 Call Trace(s) ❌
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/18226665550

Initiator: Matthieu Baerts (NGI0)
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/2ed154e487d2
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1008293


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing
  2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
                   ` (8 preceding siblings ...)
  2025-10-03 16:10 ` [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing MPTCP CI
@ 2025-10-04  1:26 ` Geliang Tang
  9 siblings, 0 replies; 14+ messages in thread
From: Geliang Tang @ 2025-10-04  1:26 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

Hi Paolo,

Thanks for this v4.

On Fri, 2025-10-03 at 16:01 +0200, Paolo Abeni wrote:
> This series includes RX path improvement built around backlog
> processing
> 
> The main goals are improving the RX performances _and_ increase the
> long term maintainability.
> 
> Patches 1-3 prepare the stack for backlog processing, removing
> assumptions that will not hold true anymore after backlog
> introduction.
> 
> Patch 4 fixes a long standing issue which is quite hard to reproduce
> with the current implementation but will become very apparent with
> backlog usage.
> 
> Patches 5 and 6 are more cleanups that will make the backlog patch a
> little less huge.
> 
> Patch 7 is a somewhat unrelated cleanup, included here before I
> forgot
> about it.
> 
> The real work is done by patch 8. There are significant changes vs
> the
> previous iteration, as it turned out we can't uset the sk_backlog, as
> the mptcp release callback can also release and re-acquire the msk-
> level
> spinlock and core backlog processing works under the assumption that
> such event is not possible.
> 
> Other relevant points are:
> - skbs in the backlog are _not_ accounted. TCP does the same, and we
>   can't update the fwd mem while enqueuing to the backlog as the
> caller
>   does not own the msk-level socket lock nor can acquire it.
> - skbs in the backlog still use the incoming ssk rmem. This allows
>   backpressure and implicitly prevent excessive memory usage for the
>   backlog itself
> - [this is possibly the most critical point]: when the msk rx buf is
>   full, we don't add more packets there even when the caller owns the
>   msk socket lock. Instead packets are added to the backlog. Note
> that
>   the amount of memory used there is still limited by the above. Also
>   note that this implicitly means that such packets could stage in
> the
>   backlog until the receiver flushes the rx buffer - an unbound
> amount
>   of time. That is not supposed to happen for the backlog, ence the
>   criticality here.
> 
> This survived a few hours of selftest iterations in a loop: should
> address all functional issues observed in previous iterations (and
> possibly includes different ones ;)

This set looks good to me!

Reviewed-by: Geliang Tang <geliang@kernel.org>

Thanks,
-Geliang

> 
> Paolo Abeni (8):
>   mptcp: borrow forward memory from subflow
>   mptcp: cleanup fallback data fin reception
>   mptcp: cleanup fallback dummy mapping generation
>   mptcp: fix MSG_PEEK stream corruption
>   mptcp: ensure the kernel PM does not take action too late
>   mptcp: do not miss early first subflow close event notification.
>   mptcp: make mptcp_destroy_common() static
>   mptcp: leverage the backlog for RX packet processing
> 
>  net/mptcp/pm.c        |   4 +-
>  net/mptcp/pm_kernel.c |   2 +
>  net/mptcp/protocol.c  | 306 +++++++++++++++++++++++++++-------------
> --
>  net/mptcp/protocol.h  |   7 +-
>  net/mptcp/subflow.c   |  12 +-
>  5 files changed, 215 insertions(+), 116 deletions(-)
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-10-06  7:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-03 14:01 [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing Paolo Abeni
2025-10-03 14:01 ` [PATCH v4 mptcp-next 1/8] mptcp: borrow forward memory from subflow Paolo Abeni
2025-10-03 15:41   ` Matthieu Baerts
2025-10-03 14:01 ` [PATCH v4 mptcp-next 2/8] mptcp: cleanup fallback data fin reception Paolo Abeni
2025-10-03 14:01 ` [PATCH v4 mptcp-next 3/8] mptcp: cleanup fallback dummy mapping generation Paolo Abeni
2025-10-03 14:01 ` [PATCH v4 mptcp-next 4/8] mptcp: fix MSG_PEEK stream corruption Paolo Abeni
2025-10-03 14:01 ` [PATCH v4 mptcp-next 5/8] mptcp: ensure the kernel PM does not take action too late Paolo Abeni
2025-10-03 14:01 ` [PATCH v4 mptcp-next 6/8] mptcp: do not miss early first subflow close event notification Paolo Abeni
2025-10-03 14:01 ` [PATCH v4 mptcp-next 7/8] mptcp: make mptcp_destroy_common() static Paolo Abeni
2025-10-03 14:01 ` [PATCH v4 mptcp-next 8/8] mptcp: leverage the backlog for RX packet processing Paolo Abeni
2025-10-06  5:09   ` Geliang Tang
2025-10-06  7:51     ` Paolo Abeni
2025-10-03 16:10 ` [PATCH v4 mptcp-next 0/8] mptcp: introduce backlog processing MPTCP CI
2025-10-04  1:26 ` Geliang Tang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.