[PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing

MPTCP Linux Development
 help / color / mirror / Atom feed

* [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing
@ 2025-10-22 14:31 Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 01/11] mptcp: drop bogus optimization in __mptcp_check_push() Paolo Abeni
                   ` (13 more replies)
  0 siblings, 14 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

This series includes RX path improvement built around backlog processing

The main goals are improving the RX performances _and_ increase the
long term maintainability.

Patches 2-4 prepare the stack for backlog processing, removing
assumptions that will not hold true anymore after backlog introduction.

Patches 1 and 5 fixes long standing issues which are quite hard to
reproduce with the current implementation but the 2nd one will become
very apparent with backlog usage.

Patches 6, 7 and 9 are more cleanups that will make the backlog patch a
little less huge.

Patch 8 is a somewhat unrelated cleanup, included here before I forgot
about it.

The real work is done by patch 10 and 11. Patch 10 introduces the helpers
needed to manipulate the msk-level backlog, and the data struct itself,
without any actual functional change. Patch 11 finally use the backlog
for RX skb processing. Note that MPTCP can't uset the sk_backlog, as
the mptcp release callback can also release and re-acquire the msk-level
spinlock and core backlog processing works under the assumption that
such event is not possible.

A relevant point is memory accounts for skbs in the backlog.

It's somewhat "original" due to MPTCP constraints. Such skbs use space
from the incoming subflow receive buffer, do not use explicitly any
forward allocated memory, as we can't update the msk fwd mem while
enqueuing, nor we want to acquire again the ssk socket lock while
processing the skbs.

Instead the msk borrows memory from the subflow and reserve it for
the backlog - see patch 3 and 11 for the gory details.

Note that even if the skbs can sit in the backlog for an unbounded time,

---
v5 -> v6:
 - added patch 1/11
 - reworked widely patch 10 && 11 to avoid double accounts for backlog
   skb and to address the fwd allocated memory criticality mentioned
   in previous iterations.

Paolo Abeni (11):
  mptcp: drop bogus optimization in __mptcp_check_push()
  mptcp: borrow forward memory from subflow
  mptcp: cleanup fallback data fin reception
  mptcp: cleanup fallback dummy mapping generation
  mptcp: fix MSG_PEEK stream corruption
  mptcp: ensure the kernel PM does not take action too late
  mptcp: do not miss early first subflow close event notification.
  mptcp: make mptcp_destroy_common() static
  mptcp: drop the __mptcp_data_ready() helper
  mptcp: introduce mptcp-level backlog
  mptcp: leverage the backlog for RX packet processing

 net/mptcp/mptcp_diag.c |   3 +-
 net/mptcp/pm.c         |   4 +-
 net/mptcp/pm_kernel.c  |   2 +
 net/mptcp/protocol.c   | 363 ++++++++++++++++++++++++++++-------------
 net/mptcp/protocol.h   |  10 +-
 net/mptcp/subflow.c    |  12 +-
 6 files changed, 272 insertions(+), 122 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 01/11] mptcp: drop bogus optimization in __mptcp_check_push()
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 02/11] mptcp: borrow forward memory from subflow Paolo Abeni
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

Accessing the transmit queue without owning the msk socket lock is
inherently racy, hence __mptcp_check_push() could actually quit early
even when there is pending data.

That in turn could cause unexpected tx lock and timeout.

Dropping the early check avoids the race, implicitly relaying on later
tests under the relevant lock. With such change, all the other
mptcp_send_head() call sites are now under the msk socket lock and we
can additionally drop the now unneeded annotation on the transmit head
pointer accesses.

Fixes: 6e628cd3a8f7 ("mptcp: use mptcp release_cb for delayed tasks")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 11 ++++-------
 net/mptcp/protocol.h |  2 +-
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index f42e28a031f39a..804227736638e3 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -998,7 +998,7 @@ static void __mptcp_clean_una(struct sock *sk)
 			if (WARN_ON_ONCE(!msk->recovery))
 				break;
 
-			WRITE_ONCE(msk->first_pending, mptcp_send_next(sk));
+			msk->first_pending = mptcp_send_next(sk);
 		}
 
 		dfrag_clear(sk, dfrag);
@@ -1543,7 +1543,7 @@ static int __subflow_push_pending(struct sock *sk, struct sock *ssk,
 
 			mptcp_update_post_push(msk, dfrag, ret);
 		}
-		WRITE_ONCE(msk->first_pending, mptcp_send_next(sk));
+		msk->first_pending = mptcp_send_next(sk);
 
 		if (msk->snd_burst <= 0 ||
 		    !sk_stream_memory_free(ssk) ||
@@ -1903,7 +1903,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			get_page(dfrag->page);
 			list_add_tail(&dfrag->list, &msk->rtx_queue);
 			if (!msk->first_pending)
-				WRITE_ONCE(msk->first_pending, dfrag);
+				msk->first_pending = dfrag;
 		}
 		pr_debug("msk=%p dfrag at seq=%llu len=%u sent=%u new=%d\n", msk,
 			 dfrag->data_seq, dfrag->data_len, dfrag->already_sent,
@@ -2874,7 +2874,7 @@ static void __mptcp_clear_xmit(struct sock *sk)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct mptcp_data_frag *dtmp, *dfrag;
 
-	WRITE_ONCE(msk->first_pending, NULL);
+	msk->first_pending = NULL;
 	list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list)
 		dfrag_clear(sk, dfrag);
 }
@@ -3414,9 +3414,6 @@ void __mptcp_data_acked(struct sock *sk)
 
 void __mptcp_check_push(struct sock *sk, struct sock *ssk)
 {
-	if (!mptcp_send_head(sk))
-		return;
-
 	if (!sock_owned_by_user(sk))
 		__mptcp_subflow_push_pending(sk, ssk, false);
 	else
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 0545eab231250d..a3bbce8950a5e0 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -415,7 +415,7 @@ static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
 {
 	const struct mptcp_sock *msk = mptcp_sk(sk);
 
-	return READ_ONCE(msk->first_pending);
+	return msk->first_pending;
 }
 
 static inline struct mptcp_data_frag *mptcp_send_next(struct sock *sk)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 02/11] mptcp: borrow forward memory from subflow
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 01/11] mptcp: drop bogus optimization in __mptcp_check_push() Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-23  6:38   ` Geliang Tang
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 03/11] mptcp: cleanup fallback data fin reception Paolo Abeni
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

In the MPTCP receive path, we release the subflow allocated
fwd memory just to allocate it again shortly after for the msk.

That could increases the failures chances, especially during
backlog processing, when other actions could consume the just
released memory before the msk socket has a chance to do the
rcv allocation.

Replace the skb_orphan() call with an open-coded variant that
explicitly borrows, with a PAGE_SIZE granularity, the fwd memory
from the subflow socket instead of releasing it. During backlog
processing the borrowed memory is accounted at release_cb time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v1 -> v2:
  - rebased
  - explain why skb_orphan is removed
---
 net/mptcp/protocol.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 804227736638e3..372ae2d9fd229e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -337,11 +337,12 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 		mptcp_rcvbuf_grow(sk);
 }
 
-static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
-			   int copy_len)
+static int mptcp_init_skb(struct sock *ssk,
+			  struct sk_buff *skb, int offset, int copy_len)
 {
 	const struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	bool has_rxtstamp = TCP_SKB_CB(skb)->has_rxtstamp;
+	int borrowed;
 
 	/* the skb map_seq accounts for the skb offset:
 	 * mptcp_subflow_get_mapped_dsn() is based on the current tp->copied_seq
@@ -357,6 +358,13 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
 
 	skb_ext_reset(skb);
 	skb_dst_drop(skb);
+
+	/* "borrow" the fwd memory from the subflow, instead of reclaiming it */
+	skb->destructor = NULL;
+	borrowed = ssk->sk_forward_alloc - sk_unused_reserved_mem(ssk);
+	borrowed &= ~(PAGE_SIZE - 1);
+	sk_forward_alloc_add(ssk, skb->truesize - borrowed);
+	return borrowed;
 }
 
 static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
@@ -690,9 +698,12 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 
 		if (offset < skb->len) {
 			size_t len = skb->len - offset;
+			int bmem;
 
-			mptcp_init_skb(ssk, skb, offset, len);
-			skb_orphan(skb);
+			bmem = mptcp_init_skb(ssk, skb, offset, len);
+			skb->sk = NULL;
+			sk_forward_alloc_add(sk, bmem);
+			atomic_sub(skb->truesize, &ssk->sk_rmem_alloc);
 			ret = __mptcp_move_skb(sk, skb) || ret;
 			seq += len;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 02/11] mptcp: borrow forward memory from subflow
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 02/11] mptcp: borrow forward memory from subflow Paolo Abeni
@ 2025-10-23  6:38   ` Geliang Tang
  0 siblings, 0 replies; 23+ messages in thread
From: Geliang Tang @ 2025-10-23  6:38 UTC (permalink / raw)
  To: Paolo Abeni, mptcp; +Cc: Mat Martineau

On Wed, 2025-10-22 at 16:31 +0200, Paolo Abeni wrote:
> In the MPTCP receive path, we release the subflow allocated
> fwd memory just to allocate it again shortly after for the msk.
> 
> That could increases the failures chances, especially during
> backlog processing, when other actions could consume the just
> released memory before the msk socket has a chance to do the
> rcv allocation.
> 
> Replace the skb_orphan() call with an open-coded variant that
> explicitly borrows, with a PAGE_SIZE granularity, the fwd memory
> from the subflow socket instead of releasing it. During backlog
> processing the borrowed memory is accounted at release_cb time.
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v1 -> v2:
>   - rebased
>   - explain why skb_orphan is removed
> ---
>  net/mptcp/protocol.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 804227736638e3..372ae2d9fd229e 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -337,11 +337,12 @@ static void mptcp_data_queue_ofo(struct
> mptcp_sock *msk, struct sk_buff *skb)
>  		mptcp_rcvbuf_grow(sk);
>  }
>  
> -static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb,
> int offset,
> -			   int copy_len)
> +static int mptcp_init_skb(struct sock *ssk,
> +			  struct sk_buff *skb, int offset, int
> copy_len)

nit:

int mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
                   int copy_len)

is better.

>  {
>  	const struct mptcp_subflow_context *subflow =
> mptcp_subflow_ctx(ssk);
>  	bool has_rxtstamp = TCP_SKB_CB(skb)->has_rxtstamp;
> +	int borrowed;
>  
>  	/* the skb map_seq accounts for the skb offset:
>  	 * mptcp_subflow_get_mapped_dsn() is based on the current
> tp->copied_seq
> @@ -357,6 +358,13 @@ static void mptcp_init_skb(struct sock *ssk,
> struct sk_buff *skb, int offset,
>  
>  	skb_ext_reset(skb);
>  	skb_dst_drop(skb);
> +
> +	/* "borrow" the fwd memory from the subflow, instead of
> reclaiming it */
> +	skb->destructor = NULL;
> +	borrowed = ssk->sk_forward_alloc -
> sk_unused_reserved_mem(ssk);
> +	borrowed &= ~(PAGE_SIZE - 1);
> +	sk_forward_alloc_add(ssk, skb->truesize - borrowed);
> +	return borrowed;
>  }
>  
>  static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
> @@ -690,9 +698,12 @@ static bool
> __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
>  
>  		if (offset < skb->len) {
>  			size_t len = skb->len - offset;
> +			int bmem;
>  
> -			mptcp_init_skb(ssk, skb, offset, len);
> -			skb_orphan(skb);
> +			bmem = mptcp_init_skb(ssk, skb, offset,
> len);
> +			skb->sk = NULL;
> +			sk_forward_alloc_add(sk, bmem);
> +			atomic_sub(skb->truesize, &ssk-
> >sk_rmem_alloc);
>  			ret = __mptcp_move_skb(sk, skb) || ret;
>  			seq += len;
>  


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 03/11] mptcp: cleanup fallback data fin reception
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 01/11] mptcp: drop bogus optimization in __mptcp_check_push() Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 02/11] mptcp: borrow forward memory from subflow Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 04/11] mptcp: cleanup fallback dummy mapping generation Paolo Abeni
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

MPTCP currently generate a dummy data_fin for fallback socket
when the fallback subflow has completed data reception using
the current ack_seq.

We are going to introduce backlog usage for the msk soon, even
for fallback sockets: the ack_seq value will not match the most recent
sequence number seen by the fallback subflow socket, as it will ignore
data_seq sitting in the backlog.

Instead use the last map sequence number to set the data_fin,
as fallback (dummy) map sequences are always in sequence.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v2 -> v3:
  - keep the close check in subflow_sched_work_if_closed, fix
    CI failures
---
 net/mptcp/subflow.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index e8325890a32238..b9455c04e8a462 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1285,6 +1285,7 @@ static bool subflow_is_done(const struct sock *sk)
 /* sched mptcp worker for subflow cleanup if no more data is pending */
 static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ssk)
 {
+	const struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	struct sock *sk = (struct sock *)msk;
 
 	if (likely(ssk->sk_state != TCP_CLOSE &&
@@ -1303,7 +1304,8 @@ static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ss
 	 */
 	if (__mptcp_check_fallback(msk) && subflow_is_done(ssk) &&
 	    msk->first == ssk &&
-	    mptcp_update_rcv_data_fin(msk, READ_ONCE(msk->ack_seq), true))
+	    mptcp_update_rcv_data_fin(msk, subflow->map_seq +
+				      subflow->map_data_len, true))
 		mptcp_schedule_work(sk);
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 04/11] mptcp: cleanup fallback dummy mapping generation
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (2 preceding siblings ...)
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 03/11] mptcp: cleanup fallback data fin reception Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 05/11] mptcp: fix MSG_PEEK stream corruption Paolo Abeni
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

MPTCP currently access ack_seq outside the msk socket log scope to
generate the dummy mapping for fallback socket. Soon we are going
to introduce backlog usage and even for fallback socket the ack_seq
value will be significantly off outside of the msk socket lock scope.

Avoid relying on ack_seq for dummy mapping generation, using instead
the subflow sequence number. Note that in case of disconnect() and
(re)connect() we must ensure that any previous state is re-set.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v2 -> v3:
 - reordered before the backlog introduction to avoid transiently
   break the fallback
 - explicitly reset ack_seq
---
 net/mptcp/protocol.c | 3 +++
 net/mptcp/subflow.c  | 8 +++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 372ae2d9fd229e..51c55a45aeaccd 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3234,6 +3234,9 @@ static int mptcp_disconnect(struct sock *sk, int flags)
 	msk->bytes_retrans = 0;
 	msk->rcvspace_init = 0;
 
+	/* for fallback's sake */
+	WRITE_ONCE(msk->ack_seq, 0);
+
 	WRITE_ONCE(sk->sk_shutdown, 0);
 	sk_error_report(sk);
 	return 0;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index b9455c04e8a462..ac8616e7521e8a 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -491,6 +491,9 @@ static void subflow_set_remote_key(struct mptcp_sock *msk,
 	mptcp_crypto_key_sha(subflow->remote_key, NULL, &subflow->iasn);
 	subflow->iasn++;
 
+	/* for fallback's sake */
+	subflow->map_seq = subflow->iasn;
+
 	WRITE_ONCE(msk->remote_key, subflow->remote_key);
 	WRITE_ONCE(msk->ack_seq, subflow->iasn);
 	WRITE_ONCE(msk->can_ack, true);
@@ -1435,9 +1438,12 @@ static bool subflow_check_data_avail(struct sock *ssk)
 
 	skb = skb_peek(&ssk->sk_receive_queue);
 	subflow->map_valid = 1;
-	subflow->map_seq = READ_ONCE(msk->ack_seq);
 	subflow->map_data_len = skb->len;
 	subflow->map_subflow_seq = tcp_sk(ssk)->copied_seq - subflow->ssn_offset;
+	subflow->map_seq = __mptcp_expand_seq(subflow->map_seq,
+					      subflow->iasn +
+					      TCP_SKB_CB(skb)->seq -
+					      subflow->ssn_offset - 1);
 	WRITE_ONCE(subflow->data_avail, true);
 	return true;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 05/11] mptcp: fix MSG_PEEK stream corruption
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (3 preceding siblings ...)
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 04/11] mptcp: cleanup fallback dummy mapping generation Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-23 16:56   ` Mat Martineau
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 06/11] mptcp: ensure the kernel PM does not take action too late Paolo Abeni
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

If a MSG_PEEK | MSG_WAITALL read operation consumes all the bytes in the
receive queue and recvmsg() need to waits for more data - i.e. it's a
blocking one - upon arrival of the next packet the MPTCP protocol will
start again copying the oldest data present in the receive queue,
corrupting the data stream.

Address the issue explicitly tracking the peeked sequence number,
restarting from the last peeked byte.

Fixes: ca4fb892579f ("mptcp: add MSG_PEEK support")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This may sound quite esoteric, but it will soon become very easy to
reproduce with mptcp_connect, thanks to the backlog.
---
 net/mptcp/protocol.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 51c55a45aeaccd..200b657080eb3e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1947,22 +1947,36 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied);
 
-static int __mptcp_recvmsg_mskq(struct sock *sk,
-				struct msghdr *msg,
-				size_t len, int flags,
+static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg,
+				size_t len, int flags, int copied_total,
 				struct scm_timestamping_internal *tss,
 				int *cmsg_flags)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct sk_buff *skb, *tmp;
+	int total_data_len = 0;
 	int copied = 0;
 
 	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
-		u32 offset = MPTCP_SKB_CB(skb)->offset;
+		u32 delta, offset = MPTCP_SKB_CB(skb)->offset;
 		u32 data_len = skb->len - offset;
-		u32 count = min_t(size_t, len - copied, data_len);
+		u32 count;
 		int err;
 
+		if (flags & MSG_PEEK) {
+			/* skip already peeked skbs*/
+			if (total_data_len + data_len <= copied_total) {
+				total_data_len += data_len;
+				continue;
+			}
+
+			/* skip the already peeked data in the current skb */
+			delta = copied_total - total_data_len;
+			offset += delta;
+			data_len -= delta;
+		}
+
+		count = min_t(size_t, len - copied, data_len);
 		if (!(flags & MSG_TRUNC)) {
 			err = skb_copy_datagram_msg(skb, offset, msg, count);
 			if (unlikely(err < 0)) {
@@ -1979,16 +1993,14 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
 
 		copied += count;
 
-		if (count < data_len) {
-			if (!(flags & MSG_PEEK)) {
+		if (!(flags & MSG_PEEK)) {
+			msk->bytes_consumed += count;
+			if (count < data_len) {
 				MPTCP_SKB_CB(skb)->offset += count;
 				MPTCP_SKB_CB(skb)->map_seq += count;
-				msk->bytes_consumed += count;
+				break;
 			}
-			break;
-		}
 
-		if (!(flags & MSG_PEEK)) {
 			/* avoid the indirect call, we know the destructor is sock_rfree */
 			skb->destructor = NULL;
 			skb->sk = NULL;
@@ -1996,7 +2008,6 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
 			sk_mem_uncharge(sk, skb->truesize);
 			__skb_unlink(skb, &sk->sk_receive_queue);
 			skb_attempt_defer_free(skb);
-			msk->bytes_consumed += count;
 		}
 
 		if (copied >= len)
@@ -2194,7 +2205,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	while (copied < len) {
 		int err, bytes_read;
 
-		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags, &tss, &cmsg_flags);
+		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags,
+						  copied, &tss, &cmsg_flags);
 		if (unlikely(bytes_read < 0)) {
 			if (!copied)
 				copied = bytes_read;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 05/11] mptcp: fix MSG_PEEK stream corruption
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 05/11] mptcp: fix MSG_PEEK stream corruption Paolo Abeni
@ 2025-10-23 16:56   ` Mat Martineau
  2025-10-24  7:34     ` Paolo Abeni
  0 siblings, 1 reply; 23+ messages in thread
From: Mat Martineau @ 2025-10-23 16:56 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp, Geliang Tang

On Wed, 22 Oct 2025, Paolo Abeni wrote:

> If a MSG_PEEK | MSG_WAITALL read operation consumes all the bytes in the
> receive queue and recvmsg() need to waits for more data - i.e. it's a
> blocking one - upon arrival of the next packet the MPTCP protocol will
> start again copying the oldest data present in the receive queue,
> corrupting the data stream.
>
> Address the issue explicitly tracking the peeked sequence number,
> restarting from the last peeked byte.
>
> Fixes: ca4fb892579f ("mptcp: add MSG_PEEK support")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> This may sound quite esoteric, but it will soon become very easy to
> reproduce with mptcp_connect, thanks to the backlog.

Would it be good to apply this to -net?

- Mat

> ---
> net/mptcp/protocol.c | 38 +++++++++++++++++++++++++-------------
> 1 file changed, 25 insertions(+), 13 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 51c55a45aeaccd..200b657080eb3e 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1947,22 +1947,36 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>
> static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied);
>
> -static int __mptcp_recvmsg_mskq(struct sock *sk,
> -				struct msghdr *msg,
> -				size_t len, int flags,
> +static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg,
> +				size_t len, int flags, int copied_total,
> 				struct scm_timestamping_internal *tss,
> 				int *cmsg_flags)
> {
> 	struct mptcp_sock *msk = mptcp_sk(sk);
> 	struct sk_buff *skb, *tmp;
> +	int total_data_len = 0;
> 	int copied = 0;
>
> 	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
> -		u32 offset = MPTCP_SKB_CB(skb)->offset;
> +		u32 delta, offset = MPTCP_SKB_CB(skb)->offset;
> 		u32 data_len = skb->len - offset;
> -		u32 count = min_t(size_t, len - copied, data_len);
> +		u32 count;
> 		int err;
>
> +		if (flags & MSG_PEEK) {
> +			/* skip already peeked skbs*/
> +			if (total_data_len + data_len <= copied_total) {
> +				total_data_len += data_len;
> +				continue;
> +			}
> +
> +			/* skip the already peeked data in the current skb */
> +			delta = copied_total - total_data_len;
> +			offset += delta;
> +			data_len -= delta;
> +		}
> +
> +		count = min_t(size_t, len - copied, data_len);
> 		if (!(flags & MSG_TRUNC)) {
> 			err = skb_copy_datagram_msg(skb, offset, msg, count);
> 			if (unlikely(err < 0)) {
> @@ -1979,16 +1993,14 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
>
> 		copied += count;
>
> -		if (count < data_len) {
> -			if (!(flags & MSG_PEEK)) {
> +		if (!(flags & MSG_PEEK)) {
> +			msk->bytes_consumed += count;
> +			if (count < data_len) {
> 				MPTCP_SKB_CB(skb)->offset += count;
> 				MPTCP_SKB_CB(skb)->map_seq += count;
> -				msk->bytes_consumed += count;
> +				break;
> 			}
> -			break;
> -		}
>
> -		if (!(flags & MSG_PEEK)) {
> 			/* avoid the indirect call, we know the destructor is sock_rfree */
> 			skb->destructor = NULL;
> 			skb->sk = NULL;
> @@ -1996,7 +2008,6 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
> 			sk_mem_uncharge(sk, skb->truesize);
> 			__skb_unlink(skb, &sk->sk_receive_queue);
> 			skb_attempt_defer_free(skb);
> -			msk->bytes_consumed += count;
> 		}
>
> 		if (copied >= len)
> @@ -2194,7 +2205,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> 	while (copied < len) {
> 		int err, bytes_read;
>
> -		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags, &tss, &cmsg_flags);
> +		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags,
> +						  copied, &tss, &cmsg_flags);
> 		if (unlikely(bytes_read < 0)) {
> 			if (!copied)
> 				copied = bytes_read;
> -- 
> 2.51.0
>
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 05/11] mptcp: fix MSG_PEEK stream corruption
  2025-10-23 16:56   ` Mat Martineau
@ 2025-10-24  7:34     ` Paolo Abeni
  0 siblings, 0 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-24  7:34 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp, Geliang Tang

On 10/23/25 6:56 PM, Mat Martineau wrote:
> On Wed, 22 Oct 2025, Paolo Abeni wrote:
> 
>> If a MSG_PEEK | MSG_WAITALL read operation consumes all the bytes in the
>> receive queue and recvmsg() need to waits for more data - i.e. it's a
>> blocking one - upon arrival of the next packet the MPTCP protocol will
>> start again copying the oldest data present in the receive queue,
>> corrupting the data stream.
>>
>> Address the issue explicitly tracking the peeked sequence number,
>> restarting from the last peeked byte.
>>
>> Fixes: ca4fb892579f ("mptcp: add MSG_PEEK support")
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> This may sound quite esoteric, but it will soon become very easy to
>> reproduce with mptcp_connect, thanks to the backlog.
> 
> Would it be good to apply this to -net?

FWIW, I'm fine with applying this to -net.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 06/11] mptcp: ensure the kernel PM does not take action too late
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (4 preceding siblings ...)
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 05/11] mptcp: fix MSG_PEEK stream corruption Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 07/11] mptcp: do not miss early first subflow close event notification Paolo Abeni
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

The PM hooks can currently take place when when the msk is already
shutting down. Subflow creation will fail, thanks to the existing
check at join time, but we can entirely avoid starting the to be failed
operations.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/pm.c        | 4 +++-
 net/mptcp/pm_kernel.c | 2 ++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index daf6dcb8068431..eade530d38e018 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -588,6 +588,7 @@ void mptcp_pm_subflow_established(struct mptcp_sock *msk)
 void mptcp_pm_subflow_check_next(struct mptcp_sock *msk,
 				 const struct mptcp_subflow_context *subflow)
 {
+	struct sock *sk = (struct sock *)msk;
 	struct mptcp_pm_data *pm = &msk->pm;
 	bool update_subflows;
 
@@ -611,7 +612,8 @@ void mptcp_pm_subflow_check_next(struct mptcp_sock *msk,
 	/* Even if this subflow is not really established, tell the PM to try
 	 * to pick the next ones, if possible.
 	 */
-	if (mptcp_pm_nl_check_work_pending(msk))
+	if (mptcp_is_fully_established(sk) &&
+	    mptcp_pm_nl_check_work_pending(msk))
 		mptcp_pm_schedule_work(msk, MPTCP_PM_SUBFLOW_ESTABLISHED);
 
 	spin_unlock_bh(&pm->lock);
diff --git a/net/mptcp/pm_kernel.c b/net/mptcp/pm_kernel.c
index da431da16ae04c..07b5142004e73e 100644
--- a/net/mptcp/pm_kernel.c
+++ b/net/mptcp/pm_kernel.c
@@ -328,6 +328,8 @@ static void mptcp_pm_create_subflow_or_signal_addr(struct mptcp_sock *msk)
 	struct mptcp_pm_local local;
 
 	mptcp_mpc_endpoint_setup(msk);
+	if (!mptcp_is_fully_established(sk))
+		return;
 
 	pr_debug("local %d:%d signal %d:%d subflows %d:%d\n",
 		 msk->pm.local_addr_used, endp_subflow_max,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 07/11] mptcp: do not miss early first subflow close event notification.
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (5 preceding siblings ...)
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 06/11] mptcp: ensure the kernel PM does not take action too late Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 08/11] mptcp: make mptcp_destroy_common() static Paolo Abeni
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

The MPTCP protocol is not currently emitting the NL event when the first
subflow is closed before msk accept() time.

By replacing the in use close helper is such scenario, implicitly introduce
the missing notification. Note that in such scenario we want to be sure
that mptcp_close_ssk() will not trigger any PM work, move the msk state
change update earlier, so that the previous patch will offer such
guarantee.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 200b657080eb3e..8c835ba5a437a2 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3985,10 +3985,10 @@ static int mptcp_stream_accept(struct socket *sock, struct socket *newsock,
 		 * deal with bad peers not doing a complete shutdown.
 		 */
 		if (unlikely(inet_sk_state_load(msk->first) == TCP_CLOSE)) {
-			__mptcp_close_ssk(newsk, msk->first,
-					  mptcp_subflow_ctx(msk->first), 0);
 			if (unlikely(list_is_singular(&msk->conn_list)))
 				mptcp_set_state(newsk, TCP_CLOSE);
+			mptcp_close_ssk(newsk, msk->first,
+					mptcp_subflow_ctx(msk->first));
 		}
 	} else {
 tcpfallback:
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 08/11] mptcp: make mptcp_destroy_common() static
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (6 preceding siblings ...)
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 07/11] mptcp: do not miss early first subflow close event notification Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 09/11] mptcp: drop the __mptcp_data_ready() helper Paolo Abeni
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

Such function is only used inside protocol.c, there is no need
to expose it to the whole stack.

Note that the function definition most be moved earlier to avoid
forward declaration.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 42 +++++++++++++++++++++---------------------
 net/mptcp/protocol.h |  2 --
 2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 8c835ba5a437a2..3653b10d24887e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3195,6 +3195,27 @@ static void mptcp_copy_inaddrs(struct sock *msk, const struct sock *ssk)
 	inet_sk(msk)->inet_rcv_saddr = inet_sk(ssk)->inet_rcv_saddr;
 }
 
+static void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
+{
+	struct mptcp_subflow_context *subflow, *tmp;
+	struct sock *sk = (struct sock *)msk;
+
+	__mptcp_clear_xmit(sk);
+
+	/* join list will be eventually flushed (with rst) at sock lock release time */
+	mptcp_for_each_subflow_safe(msk, subflow, tmp)
+		__mptcp_close_ssk(sk, mptcp_subflow_tcp_sock(subflow), subflow, flags);
+
+	__skb_queue_purge(&sk->sk_receive_queue);
+	skb_rbtree_purge(&msk->out_of_order_queue);
+
+	/* move all the rx fwd alloc into the sk_mem_reclaim_final in
+	 * inet_sock_destruct() will dispose it
+	 */
+	mptcp_token_destroy(msk);
+	mptcp_pm_destroy(msk);
+}
+
 static int mptcp_disconnect(struct sock *sk, int flags)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -3399,27 +3420,6 @@ void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk)
 		msk->rcvq_space.space = TCP_INIT_CWND * TCP_MSS_DEFAULT;
 }
 
-void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
-{
-	struct mptcp_subflow_context *subflow, *tmp;
-	struct sock *sk = (struct sock *)msk;
-
-	__mptcp_clear_xmit(sk);
-
-	/* join list will be eventually flushed (with rst) at sock lock release time */
-	mptcp_for_each_subflow_safe(msk, subflow, tmp)
-		__mptcp_close_ssk(sk, mptcp_subflow_tcp_sock(subflow), subflow, flags);
-
-	__skb_queue_purge(&sk->sk_receive_queue);
-	skb_rbtree_purge(&msk->out_of_order_queue);
-
-	/* move all the rx fwd alloc into the sk_mem_reclaim_final in
-	 * inet_sock_destruct() will dispose it
-	 */
-	mptcp_token_destroy(msk);
-	mptcp_pm_destroy(msk);
-}
-
 static void mptcp_destroy(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index a3bbce8950a5e0..dc61579282b2fc 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -979,8 +979,6 @@ static inline void mptcp_propagate_sndbuf(struct sock *sk, struct sock *ssk)
 	local_bh_enable();
 }
 
-void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags);
-
 #define MPTCP_TOKEN_MAX_RETRIES	4
 
 void __init mptcp_token_init(void);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 09/11] mptcp: drop the __mptcp_data_ready() helper
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (7 preceding siblings ...)
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 08/11] mptcp: make mptcp_destroy_common() static Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-23  6:38   ` Geliang Tang
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 10/11] mptcp: introduce mptcp-level backlog Paolo Abeni
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

It adds little clarity and there is a single user of such helper,
just inline it in the caller.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
  - v4 -> v5:
    split out of main backlog patch, to make the latter smaller
---
 net/mptcp/protocol.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 3653b10d24887e..c68a4b410e7e5b 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -838,18 +838,10 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	return moved;
 }
 
-static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	/* Wake-up the reader only for in-sequence data */
-	if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
-		sk->sk_data_ready(sk);
-}
-
 void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
+	struct mptcp_sock *msk = mptcp_sk(sk);
 
 	/* The peer can send data while we are shutting down this
 	 * subflow at msk destruction time, but we must avoid enqueuing
@@ -859,10 +851,14 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 		return;
 
 	mptcp_data_lock(sk);
-	if (!sock_owned_by_user(sk))
-		__mptcp_data_ready(sk, ssk);
-	else
+	if (!sock_owned_by_user(sk)) {
+		/* Wake-up the reader only for in-sequence data */
+		if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
+			sk->sk_data_ready(sk);
+		
+	} else {
 		__set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->cb_flags);
+	}
 	mptcp_data_unlock(sk);
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 09/11] mptcp: drop the __mptcp_data_ready() helper
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 09/11] mptcp: drop the __mptcp_data_ready() helper Paolo Abeni
@ 2025-10-23  6:38   ` Geliang Tang
  0 siblings, 0 replies; 23+ messages in thread
From: Geliang Tang @ 2025-10-23  6:38 UTC (permalink / raw)
  To: Paolo Abeni, mptcp; +Cc: Mat Martineau

On Wed, 2025-10-22 at 16:31 +0200, Paolo Abeni wrote:
> It adds little clarity and there is a single user of such helper,
> just inline it in the caller.
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>   - v4 -> v5:
>     split out of main backlog patch, to make the latter smaller
> ---
>  net/mptcp/protocol.c | 20 ++++++++------------
>  1 file changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 3653b10d24887e..c68a4b410e7e5b 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -838,18 +838,10 @@ static bool move_skbs_to_msk(struct mptcp_sock
> *msk, struct sock *ssk)
>  	return moved;
>  }
>  
> -static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
> -{
> -	struct mptcp_sock *msk = mptcp_sk(sk);
> -
> -	/* Wake-up the reader only for in-sequence data */
> -	if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
> -		sk->sk_data_ready(sk);
> -}
> -
>  void mptcp_data_ready(struct sock *sk, struct sock *ssk)
>  {
>  	struct mptcp_subflow_context *subflow =
> mptcp_subflow_ctx(ssk);
> +	struct mptcp_sock *msk = mptcp_sk(sk);
>  
>  	/* The peer can send data while we are shutting down this
>  	 * subflow at msk destruction time, but we must avoid
> enqueuing
> @@ -859,10 +851,14 @@ void mptcp_data_ready(struct sock *sk, struct
> sock *ssk)
>  		return;
>  
>  	mptcp_data_lock(sk);
> -	if (!sock_owned_by_user(sk))
> -		__mptcp_data_ready(sk, ssk);
> -	else
> +	if (!sock_owned_by_user(sk)) {
> +		/* Wake-up the reader only for in-sequence data */
> +		if (move_skbs_to_msk(msk, ssk) &&
> mptcp_epollin_ready(sk))
> +			sk->sk_data_ready(sk);
> +		

nit:

There's an extra blank line here.

Thanks,
-Geliang

> +	} else {
>  		__set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->cb_flags);
> +	}
>  	mptcp_data_unlock(sk);
>  }
>  


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 10/11] mptcp: introduce mptcp-level backlog
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (8 preceding siblings ...)
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 09/11] mptcp: drop the __mptcp_data_ready() helper Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 11/11] mptcp: leverage the backlog for RX packet processing Paolo Abeni
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

We are soon using it for incoming data processing.
MPTCP can't leverage the sk_backlog, as the latter is processed
before the release callback, and such callback for MPTCP releases
and re-acquire the socket spinlock, breaking the sk_backlog processing
assumption.

Add a skb backlog list inside the mptcp sock struct, and implement
basic helper to transfer packet to and purge such list.

Packets in the backlog are not memory accounted, but still use the
incoming subflow receive memory, to allow back-pressure.

No packet is currently added to the backlog, so no functional changes
intended here.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
--
v5 -> v6:
  - call mptcp_bl_free() instead of inlining it.
  - report the bl mem in diag mem info
  - moved here the mptcp_close_ssk chunk from the next patch.
    (logically belongs here)

v4 -> v5:
  - split out of the next path, to make the latter smaller
  - set a custom destructor for skbs in the backlog, this avoid
    duplicate code, and fix a few places where the need ssk cleanup
    was not performed.
  - factor out the backlog purge in a new helper,
    use spinlock protection, clear the backlog list and zero the
    backlog len
  - explicitly init the backlog_len at mptcp_init_sock() time
---
 net/mptcp/mptcp_diag.c |  3 +-
 net/mptcp/protocol.c   | 85 ++++++++++++++++++++++++++++++++++++++++--
 net/mptcp/protocol.h   |  4 ++
 3 files changed, 87 insertions(+), 5 deletions(-)

diff --git a/net/mptcp/mptcp_diag.c b/net/mptcp/mptcp_diag.c
index ac974299de71cd..136c2d05c0eeb8 100644
--- a/net/mptcp/mptcp_diag.c
+++ b/net/mptcp/mptcp_diag.c
@@ -195,7 +195,8 @@ static void mptcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct mptcp_info *info = _info;
 
-	r->idiag_rqueue = sk_rmem_alloc_get(sk);
+	r->idiag_rqueue = sk_rmem_alloc_get(sk) +
+			  READ_ONCE(mptcp_sk(sk)->backlog_len);
 	r->idiag_wqueue = sk_wmem_alloc_get(sk);
 
 	if (inet_sk_state_load(sk) == TCP_LISTEN) {
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index c68a4b410e7e5b..5a1d8f9e0fb0ec 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -337,6 +337,11 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 		mptcp_rcvbuf_grow(sk);
 }
 
+static void mptcp_bl_free(struct sk_buff *skb)
+{
+	atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
+}
+
 static int mptcp_init_skb(struct sock *ssk,
 			  struct sk_buff *skb, int offset, int copy_len)
 {
@@ -360,7 +365,7 @@ static int mptcp_init_skb(struct sock *ssk,
 	skb_dst_drop(skb);
 
 	/* "borrow" the fwd memory from the subflow, instead of reclaiming it */
-	skb->destructor = NULL;
+	skb->destructor = mptcp_bl_free;
 	borrowed = ssk->sk_forward_alloc - sk_unused_reserved_mem(ssk);
 	borrowed &= ~(PAGE_SIZE - 1);
 	sk_forward_alloc_add(ssk, skb->truesize - borrowed);
@@ -373,6 +378,13 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct sk_buff *tail;
 
+	/* Avoid the indirect call overhead, we know destructor is
+	 * mptcp_bl_free at this point.
+	 */
+	mptcp_bl_free(skb);
+	skb->sk = NULL;
+	skb->destructor = NULL;
+
 	/* try to fetch required memory from subflow */
 	if (!sk_rmem_schedule(sk, skb, skb->truesize)) {
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
@@ -654,6 +666,35 @@ static void mptcp_dss_corruption(struct mptcp_sock *msk, struct sock *ssk)
 	}
 }
 
+static void __mptcp_add_backlog(struct sock *sk, struct sk_buff *skb)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct sk_buff *tail = NULL;
+	bool fragstolen;
+	int delta;
+
+	if (unlikely(sk->sk_state == TCP_CLOSE)) {
+		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
+		return;
+	}
+
+	/* Try to coalesce with the last skb in our backlog */
+	if (!list_empty(&msk->backlog_list))
+		tail = list_last_entry(&msk->backlog_list, struct sk_buff, list);
+
+	if (tail && MPTCP_SKB_CB(skb)->map_seq == MPTCP_SKB_CB(tail)->end_seq &&
+	    skb->sk == tail->sk &&
+	    __mptcp_try_coalesce(sk, tail, skb, &fragstolen, &delta)) {
+		skb->truesize -= delta;
+		kfree_skb_partial(skb, fragstolen);
+		WRITE_ONCE(msk->backlog_len, msk->backlog_len + delta);
+		return;
+	}
+
+	list_add_tail(&skb->list, &msk->backlog_list);
+	WRITE_ONCE(msk->backlog_len, msk->backlog_len + skb->truesize);
+}
+
 static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 					   struct sock *ssk)
 {
@@ -701,10 +742,12 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 			int bmem;
 
 			bmem = mptcp_init_skb(ssk, skb, offset, len);
-			skb->sk = NULL;
 			sk_forward_alloc_add(sk, bmem);
-			atomic_sub(skb->truesize, &ssk->sk_rmem_alloc);
-			ret = __mptcp_move_skb(sk, skb) || ret;
+
+			if (true)
+				ret |= __mptcp_move_skb(sk, skb);
+			else
+				__mptcp_add_backlog(sk, skb);
 			seq += len;
 
 			if (unlikely(map_remaining < len)) {
@@ -2516,6 +2559,9 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 		     struct mptcp_subflow_context *subflow)
 {
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct sk_buff *skb;
+
 	/* The first subflow can already be closed and still in the list */
 	if (subflow->close_event_done)
 		return;
@@ -2525,6 +2571,18 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 	if (sk->sk_state == TCP_ESTABLISHED)
 		mptcp_event(MPTCP_EVENT_SUB_CLOSED, mptcp_sk(sk), ssk, GFP_KERNEL);
 
+	/* Remove any reference from the backlog to this ssk, accounting the
+	 * related skb directly to the main socket
+	 */
+	list_for_each_entry(skb, &msk->backlog_list, list) {
+		if (skb->sk != ssk)
+			continue;
+
+		atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
+		atomic_add(skb->truesize, &sk->sk_rmem_alloc);
+		skb->sk = sk;
+	}
+
 	/* subflow aborted before reaching the fully_established status
 	 * attempt the creation of the next subflow
 	 */
@@ -2753,12 +2811,28 @@ static void mptcp_mp_fail_no_response(struct mptcp_sock *msk)
 	unlock_sock_fast(ssk, slow);
 }
 
+static void mptcp_backlog_purge(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct sk_buff *tmp, *skb;
+	LIST_HEAD(backlog);
+
+	mptcp_data_lock(sk);
+	list_splice_init(&msk->backlog_list, &backlog);
+	msk->backlog_len = 0;
+	mptcp_data_unlock(sk);
+
+	list_for_each_entry_safe(skb, tmp, &backlog, list)
+		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
+}
+
 static void mptcp_do_fastclose(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow, *tmp;
 	struct mptcp_sock *msk = mptcp_sk(sk);
 
 	mptcp_set_state(sk, TCP_CLOSE);
+	mptcp_backlog_purge(sk);
 	mptcp_for_each_subflow_safe(msk, subflow, tmp)
 		__mptcp_close_ssk(sk, mptcp_subflow_tcp_sock(subflow),
 				  subflow, MPTCP_CF_FASTCLOSE);
@@ -2816,11 +2890,13 @@ static void __mptcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&msk->conn_list);
 	INIT_LIST_HEAD(&msk->join_list);
 	INIT_LIST_HEAD(&msk->rtx_queue);
+	INIT_LIST_HEAD(&msk->backlog_list);
 	INIT_WORK(&msk->work, mptcp_worker);
 	msk->out_of_order_queue = RB_ROOT;
 	msk->first_pending = NULL;
 	msk->timer_ival = TCP_RTO_MIN;
 	msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+	msk->backlog_len = 0;
 
 	WRITE_ONCE(msk->first, NULL);
 	inet_csk(sk)->icsk_sync_mss = mptcp_sync_mss;
@@ -3197,6 +3273,7 @@ static void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
 	struct sock *sk = (struct sock *)msk;
 
 	__mptcp_clear_xmit(sk);
+	mptcp_backlog_purge(sk);
 
 	/* join list will be eventually flushed (with rst) at sock lock release time */
 	mptcp_for_each_subflow_safe(msk, subflow, tmp)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index dc61579282b2fc..d814e8151458d5 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -358,6 +358,9 @@ struct mptcp_sock {
 					 * allow_infinite_fallback and
 					 * allow_join
 					 */
+
+	struct list_head backlog_list;	/* protected by the data lock */
+	u32		backlog_len;
 };
 
 #define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock)
@@ -408,6 +411,7 @@ static inline int mptcp_space_from_win(const struct sock *sk, int win)
 static inline int __mptcp_space(const struct sock *sk)
 {
 	return mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) -
+				    READ_ONCE(mptcp_sk(sk)->backlog_len) -
 				    sk_rmem_alloc_get(sk));
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v6 mptcp-next 11/11] mptcp: leverage the backlog for RX packet processing
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (9 preceding siblings ...)
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 10/11] mptcp: introduce mptcp-level backlog Paolo Abeni
@ 2025-10-22 14:31 ` Paolo Abeni
  2025-10-23 15:11   ` Paolo Abeni
  2025-10-22 15:50 ` [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing MPTCP CI
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Paolo Abeni @ 2025-10-22 14:31 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang

When the msk socket is owned or the msk receive buffer is full,
move the incoming skbs in a msk level backlog list. This avoid
traversing the joined subflows and acquiring the subflow level
socket lock at reception time, improving the RX performances.

When processing the backlog, use the fwd alloc memory borrowed from
the incoming subflow. skbs exceeding the msk receive space are
not dropped; instead they are kept into the backlog until the receive
buffer is freed. Dropping packets already acked at the TCP level is
explicitly discouraged by the RFC and would corrupt the data stream
for fallback sockets.

Move the conditional reschedule in release_cb() to take action only
after the first loop iteration, to avoid rescheduling just before
releasing the lock.

Special care is needed to avoid adding skbs to the backlog of a closed
msk and to avoid leaving dangling references into the backlog
at subflow closing time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v5 -> v6:
  - do backlog len update asap to advise the correct window.
  - explicitly bound backlog processing loop to the maximum BL len

v4 -> v5:
  - consolidate ssk rcvbuf accunting in __mptcp_move_skb(), remove
    some code duplication
  - return soon in __mptcp_add_backlog() when dropping skbs due to
    the msk closed. This avoid later UaF
---
 net/mptcp/protocol.c | 151 +++++++++++++++++++++++++++----------------
 net/mptcp/protocol.h |   2 +-
 2 files changed, 96 insertions(+), 57 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 5a1d8f9e0fb0ec..0aae17ab77edb2 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -696,7 +696,7 @@ static void __mptcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 }
 
 static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
-					   struct sock *ssk)
+					   struct sock *ssk, bool own_msk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	struct sock *sk = (struct sock *)msk;
@@ -712,9 +712,6 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 		struct sk_buff *skb;
 		bool fin;
 
-		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
-			break;
-
 		/* try to move as much data as available */
 		map_remaining = subflow->map_data_len -
 				mptcp_subflow_get_map_offset(subflow);
@@ -742,9 +739,12 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 			int bmem;
 
 			bmem = mptcp_init_skb(ssk, skb, offset, len);
-			sk_forward_alloc_add(sk, bmem);
+			if (own_msk)
+				sk_forward_alloc_add(sk, bmem);
+			else
+				msk->borrowed_mem += bmem;
 
-			if (true)
+			if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf)
 				ret |= __mptcp_move_skb(sk, skb);
 			else
 				__mptcp_add_backlog(sk, skb);
@@ -866,7 +866,7 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	struct sock *sk = (struct sock *)msk;
 	bool moved;
 
-	moved = __mptcp_move_skbs_from_subflow(msk, ssk);
+	moved = __mptcp_move_skbs_from_subflow(msk, ssk, true);
 	__mptcp_ofo_queue(msk);
 	if (unlikely(ssk->sk_err))
 		__mptcp_subflow_error_report(sk, ssk);
@@ -898,9 +898,8 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 		/* Wake-up the reader only for in-sequence data */
 		if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
 			sk->sk_data_ready(sk);
-		
 	} else {
-		__set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->cb_flags);
+		__mptcp_move_skbs_from_subflow(msk, ssk, false);
 	}
 	mptcp_data_unlock(sk);
 }
@@ -2135,60 +2134,92 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 	msk->rcvq_space.time = mstamp;
 }
 
-static struct mptcp_subflow_context *
-__mptcp_first_ready_from(struct mptcp_sock *msk,
-			 struct mptcp_subflow_context *subflow)
+static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delta)
 {
-	struct mptcp_subflow_context *start_subflow = subflow;
+	struct sk_buff *skb = list_first_entry(skbs, struct sk_buff, list);
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	bool moved = false;
+
+	*delta = 0;
+	while (1) {
+		/* If the msk recvbuf is full stop, don't drop */
+		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+			break;
+
+		prefetch(skb->next);
+		list_del(&skb->list);
+		*delta += skb->truesize;
+
+		moved |= __mptcp_move_skb(sk, skb);
+		if (list_empty(skbs))
+			break;
 
-	while (!READ_ONCE(subflow->data_avail)) {
-		subflow = mptcp_next_subflow(msk, subflow);
-		if (subflow == start_subflow)
-			return NULL;
+		skb = list_first_entry(skbs, struct sk_buff, list);
 	}
-	return subflow;
+
+	__mptcp_ofo_queue(msk);
+	if (moved)
+		mptcp_check_data_fin((struct sock *)msk);
+	return moved;
 }
 
-static bool __mptcp_move_skbs(struct sock *sk)
+static bool mptcp_can_spool_backlog(struct sock *sk, u32 moved,
+				    struct list_head *skbs)
 {
-	struct mptcp_subflow_context *subflow;
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	bool ret = false;
 
-	if (list_empty(&msk->conn_list))
+	if (list_empty(&msk->backlog_list))
 		return false;
 
-	subflow = list_first_entry(&msk->conn_list,
-				   struct mptcp_subflow_context, node);
-	for (;;) {
-		struct sock *ssk;
-		bool slowpath;
+	/* Borrowed mem could be zero only in the unlikely event that the bl
+	 * is full
+	 */
+	if (likely(msk->borrowed_mem)) {
+		sk_forward_alloc_add(sk, msk->borrowed_mem);
+		msk->borrowed_mem = 0;
+		sk->sk_reserved_mem = msk->backlog_len;
+	}
 
-		/*
-		 * As an optimization avoid traversing the subflows list
-		 * and ev. acquiring the subflow socket lock before baling out
-		 */
-		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
-			break;
+	/* Limit the backlog loop to the maximum backlog size; moved skbs are
+	 * accounted on both the backlog and the receive buffer; the caller
+	 * should update the backlog usage ASAP, to avoid underestimate the
+	 * rcvwnd.
+	 */
+	if (moved > sk->sk_rcvbuf || sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+		return false;
 
-		subflow = __mptcp_first_ready_from(msk, subflow);
-		if (!subflow)
-			break;
+	INIT_LIST_HEAD(skbs);
+	list_splice_init(&msk->backlog_list, skbs);
+	return true;
+}
 
-		ssk = mptcp_subflow_tcp_sock(subflow);
-		slowpath = lock_sock_fast(ssk);
-		ret = __mptcp_move_skbs_from_subflow(msk, ssk) || ret;
-		if (unlikely(ssk->sk_err))
-			__mptcp_error_report(sk);
-		unlock_sock_fast(ssk, slowpath);
+static void mptcp_backlog_spooled(struct sock *sk, u32 moved,
+				  struct list_head *skbs)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
 
-		subflow = mptcp_next_subflow(msk, subflow);
-	}
+	WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
+	list_splice(skbs, &msk->backlog_list);
+	sk->sk_reserved_mem = msk->backlog_len;
+}
 
-	__mptcp_ofo_queue(msk);
-	if (ret)
-		mptcp_check_data_fin((struct sock *)msk);
-	return ret;
+static bool mptcp_move_skbs(struct sock *sk)
+{
+	u32 moved, total_moved = 0;
+	struct list_head skbs;
+	bool enqueued = false;
+
+	mptcp_data_lock(sk);
+	while (mptcp_can_spool_backlog(sk, total_moved, &skbs)) {
+		mptcp_data_unlock(sk);
+		enqueued |= __mptcp_move_skbs(sk, &skbs, &moved);
+
+		mptcp_data_lock(sk);
+		total_moved += moved;
+		mptcp_backlog_spooled(sk, moved, &skbs);
+	}
+	mptcp_data_unlock(sk);
+	return enqueued;
 }
 
 static unsigned int mptcp_inq_hint(const struct sock *sk)
@@ -2254,7 +2285,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 		copied += bytes_read;
 
-		if (skb_queue_empty(&sk->sk_receive_queue) && __mptcp_move_skbs(sk))
+		if (!list_empty(&msk->backlog_list) && mptcp_move_skbs(sk))
 			continue;
 
 		/* only the MPTCP socket status is relevant here. The exit
@@ -3521,20 +3552,22 @@ void __mptcp_check_push(struct sock *sk, struct sock *ssk)
 
 #define MPTCP_FLAGS_PROCESS_CTX_NEED (BIT(MPTCP_PUSH_PENDING) | \
 				      BIT(MPTCP_RETRANSMIT) | \
-				      BIT(MPTCP_FLUSH_JOIN_LIST) | \
-				      BIT(MPTCP_DEQUEUE))
+				      BIT(MPTCP_FLUSH_JOIN_LIST))
 
 /* processes deferred events and flush wmem */
 static void mptcp_release_cb(struct sock *sk)
 	__must_hold(&sk->sk_lock.slock)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	u32 moved, total_moved = 0;
 
 	for (;;) {
 		unsigned long flags = (msk->cb_flags & MPTCP_FLAGS_PROCESS_CTX_NEED);
-		struct list_head join_list;
+		struct list_head join_list, skbs;
+		bool spool_bl;
 
-		if (!flags)
+		spool_bl = mptcp_can_spool_backlog(sk, total_moved, &skbs);
+		if (!flags && !spool_bl)
 			break;
 
 		INIT_LIST_HEAD(&join_list);
@@ -3550,20 +3583,26 @@ static void mptcp_release_cb(struct sock *sk)
 		msk->cb_flags &= ~flags;
 		spin_unlock_bh(&sk->sk_lock.slock);
 
+		if (total_moved)
+			cond_resched();
+
 		if (flags & BIT(MPTCP_FLUSH_JOIN_LIST))
 			__mptcp_flush_join_list(sk, &join_list);
 		if (flags & BIT(MPTCP_PUSH_PENDING))
 			__mptcp_push_pending(sk, 0);
 		if (flags & BIT(MPTCP_RETRANSMIT))
 			__mptcp_retrans(sk);
-		if ((flags & BIT(MPTCP_DEQUEUE)) && __mptcp_move_skbs(sk)) {
+		if (spool_bl && __mptcp_move_skbs(sk, &skbs, &moved)) {
 			/* notify ack seq update */
 			mptcp_cleanup_rbuf(msk, 0);
 			sk->sk_data_ready(sk);
 		}
 
-		cond_resched();
 		spin_lock_bh(&sk->sk_lock.slock);
+		if (spool_bl) {
+			total_moved += moved;
+			mptcp_backlog_spooled(sk, moved, &skbs);
+		}
 	}
 
 	if (__test_and_clear_bit(MPTCP_CLEAN_UNA, &msk->cb_flags))
@@ -3796,7 +3835,7 @@ static int mptcp_ioctl(struct sock *sk, int cmd, int *karg)
 			return -EINVAL;
 
 		lock_sock(sk);
-		if (__mptcp_move_skbs(sk))
+		if (mptcp_move_skbs(sk))
 			mptcp_cleanup_rbuf(msk, 0);
 		*karg = mptcp_inq_hint(sk);
 		release_sock(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index d814e8151458d5..9e2a44546354a0 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -124,7 +124,6 @@
 #define MPTCP_FLUSH_JOIN_LIST	5
 #define MPTCP_SYNC_STATE	6
 #define MPTCP_SYNC_SNDBUF	7
-#define MPTCP_DEQUEUE		8
 
 struct mptcp_skb_cb {
 	u64 map_seq;
@@ -301,6 +300,7 @@ struct mptcp_sock {
 	u32		last_ack_recv;
 	unsigned long	timer_ival;
 	u32		token;
+	u32		borrowed_mem;
 	unsigned long	flags;
 	unsigned long	cb_flags;
 	bool		recovery;		/* closing subflow write queue reinjected */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 11/11] mptcp: leverage the backlog for RX packet processing
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 11/11] mptcp: leverage the backlog for RX packet processing Paolo Abeni
@ 2025-10-23 15:11   ` Paolo Abeni
  2025-10-23 15:52     ` Matthieu Baerts
  0 siblings, 1 reply; 23+ messages in thread
From: Paolo Abeni @ 2025-10-23 15:11 UTC (permalink / raw)
  To: mptcp; +Cc: Mat Martineau, Geliang Tang



On 10/22/25 4:31 PM, Paolo Abeni wrote:
> When the msk socket is owned or the msk receive buffer is full,
> move the incoming skbs in a msk level backlog list. This avoid
> traversing the joined subflows and acquiring the subflow level
> socket lock at reception time, improving the RX performances.
> 
> When processing the backlog, use the fwd alloc memory borrowed from
> the incoming subflow. skbs exceeding the msk receive space are
> not dropped; instead they are kept into the backlog until the receive
> buffer is freed. Dropping packets already acked at the TCP level is
> explicitly discouraged by the RFC and would corrupt the data stream
> for fallback sockets.
> 
> Move the conditional reschedule in release_cb() to take action only
> after the first loop iteration, to avoid rescheduling just before
> releasing the lock.
> 
> Special care is needed to avoid adding skbs to the backlog of a closed
> msk and to avoid leaving dangling references into the backlog
> at subflow closing time.
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v5 -> v6:
>   - do backlog len update asap to advise the correct window.
>   - explicitly bound backlog processing loop to the maximum BL len
> 
> v4 -> v5:
>   - consolidate ssk rcvbuf accunting in __mptcp_move_skb(), remove
>     some code duplication
>   - return soon in __mptcp_add_backlog() when dropping skbs due to
>     the msk closed. This avoid later UaF
> ---
>  net/mptcp/protocol.c | 151 +++++++++++++++++++++++++++----------------
>  net/mptcp/protocol.h |   2 +-
>  2 files changed, 96 insertions(+), 57 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 5a1d8f9e0fb0ec..0aae17ab77edb2 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -696,7 +696,7 @@ static void __mptcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>  }
>  
>  static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
> -					   struct sock *ssk)
> +					   struct sock *ssk, bool own_msk)
>  {
>  	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
>  	struct sock *sk = (struct sock *)msk;
> @@ -712,9 +712,6 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
>  		struct sk_buff *skb;
>  		bool fin;
>  
> -		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
> -			break;
> -
>  		/* try to move as much data as available */
>  		map_remaining = subflow->map_data_len -
>  				mptcp_subflow_get_map_offset(subflow);
> @@ -742,9 +739,12 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
>  			int bmem;
>  
>  			bmem = mptcp_init_skb(ssk, skb, offset, len);
> -			sk_forward_alloc_add(sk, bmem);
> +			if (own_msk)
> +				sk_forward_alloc_add(sk, bmem);
> +			else
> +				msk->borrowed_mem += bmem;
>  
> -			if (true)
> +			if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf)
>  				ret |= __mptcp_move_skb(sk, skb);
>  			else
>  				__mptcp_add_backlog(sk, skb);
> @@ -866,7 +866,7 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
>  	struct sock *sk = (struct sock *)msk;
>  	bool moved;
>  
> -	moved = __mptcp_move_skbs_from_subflow(msk, ssk);
> +	moved = __mptcp_move_skbs_from_subflow(msk, ssk, true);
>  	__mptcp_ofo_queue(msk);
>  	if (unlikely(ssk->sk_err))
>  		__mptcp_subflow_error_report(sk, ssk);
> @@ -898,9 +898,8 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
>  		/* Wake-up the reader only for in-sequence data */
>  		if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
>  			sk->sk_data_ready(sk);
> -		
>  	} else {
> -		__set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->cb_flags);
> +		__mptcp_move_skbs_from_subflow(msk, ssk, false);
>  	}
>  	mptcp_data_unlock(sk);
>  }
> @@ -2135,60 +2134,92 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
>  	msk->rcvq_space.time = mstamp;
>  }
>  
> -static struct mptcp_subflow_context *
> -__mptcp_first_ready_from(struct mptcp_sock *msk,
> -			 struct mptcp_subflow_context *subflow)
> +static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delta)
>  {
> -	struct mptcp_subflow_context *start_subflow = subflow;
> +	struct sk_buff *skb = list_first_entry(skbs, struct sk_buff, list);
> +	struct mptcp_sock *msk = mptcp_sk(sk);
> +	bool moved = false;
> +
> +	*delta = 0;
> +	while (1) {
> +		/* If the msk recvbuf is full stop, don't drop */
> +		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
> +			break;
> +
> +		prefetch(skb->next);
> +		list_del(&skb->list);
> +		*delta += skb->truesize;
> +
> +		moved |= __mptcp_move_skb(sk, skb);
> +		if (list_empty(skbs))
> +			break;
>  
> -	while (!READ_ONCE(subflow->data_avail)) {
> -		subflow = mptcp_next_subflow(msk, subflow);
> -		if (subflow == start_subflow)
> -			return NULL;
> +		skb = list_first_entry(skbs, struct sk_buff, list);
>  	}
> -	return subflow;
> +
> +	__mptcp_ofo_queue(msk);
> +	if (moved)
> +		mptcp_check_data_fin((struct sock *)msk);
> +	return moved;
>  }
>  
> -static bool __mptcp_move_skbs(struct sock *sk)
> +static bool mptcp_can_spool_backlog(struct sock *sk, u32 moved,
> +				    struct list_head *skbs)
>  {
> -	struct mptcp_subflow_context *subflow;
>  	struct mptcp_sock *msk = mptcp_sk(sk);
> -	bool ret = false;
>  
> -	if (list_empty(&msk->conn_list))
> +	if (list_empty(&msk->backlog_list))
>  		return false;
>  
> -	subflow = list_first_entry(&msk->conn_list,
> -				   struct mptcp_subflow_context, node);
> -	for (;;) {
> -		struct sock *ssk;
> -		bool slowpath;
> +	/* Borrowed mem could be zero only in the unlikely event that the bl
> +	 * is full
> +	 */
> +	if (likely(msk->borrowed_mem)) {
> +		sk_forward_alloc_add(sk, msk->borrowed_mem);
> +		msk->borrowed_mem = 0;
> +		sk->sk_reserved_mem = msk->backlog_len;

With the above I intended to prevent the fwd memory handling from
releasing backlog_len bytes. Re-reading the relevant code, it does not
allow that (experimentation confirmed), see:

https://elixir.bootlin.com/linux/v6.18-rc2/source/include/net/sock.h#L1593

and:

https://elixir.bootlin.com/linux/v6.18-rc2/source/include/net/sock.h#L1580

This will need some more care. Also patch 2 will require some
significant rework.

@Mat, @Matttbe: could you please consider merging patches 1,3-9?

I think they should be pretty uncontroversial, would make the series
more manegeable for future iterations (and would alleviate my
frustration to make this thing work correctly).

Thanks!

Paolo


> +	}
>  
> -		/*
> -		 * As an optimization avoid traversing the subflows list
> -		 * and ev. acquiring the subflow socket lock before baling out
> -		 */
> -		if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
> -			break;
> +	/* Limit the backlog loop to the maximum backlog size; moved skbs are
> +	 * accounted on both the backlog and the receive buffer; the caller
> +	 * should update the backlog usage ASAP, to avoid underestimate the
> +	 * rcvwnd.
> +	 */
> +	if (moved > sk->sk_rcvbuf || sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
> +		return false;
>  
> -		subflow = __mptcp_first_ready_from(msk, subflow);
> -		if (!subflow)
> -			break;
> +	INIT_LIST_HEAD(skbs);
> +	list_splice_init(&msk->backlog_list, skbs);
> +	return true;
> +}
>  
> -		ssk = mptcp_subflow_tcp_sock(subflow);
> -		slowpath = lock_sock_fast(ssk);
> -		ret = __mptcp_move_skbs_from_subflow(msk, ssk) || ret;
> -		if (unlikely(ssk->sk_err))
> -			__mptcp_error_report(sk);
> -		unlock_sock_fast(ssk, slowpath);
> +static void mptcp_backlog_spooled(struct sock *sk, u32 moved,
> +				  struct list_head *skbs)
> +{
> +	struct mptcp_sock *msk = mptcp_sk(sk);
>  
> -		subflow = mptcp_next_subflow(msk, subflow);
> -	}
> +	WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
> +	list_splice(skbs, &msk->backlog_list);
> +	sk->sk_reserved_mem = msk->backlog_len;
> +}
>  
> -	__mptcp_ofo_queue(msk);
> -	if (ret)
> -		mptcp_check_data_fin((struct sock *)msk);
> -	return ret;
> +static bool mptcp_move_skbs(struct sock *sk)
> +{
> +	u32 moved, total_moved = 0;
> +	struct list_head skbs;
> +	bool enqueued = false;
> +
> +	mptcp_data_lock(sk);
> +	while (mptcp_can_spool_backlog(sk, total_moved, &skbs)) {
> +		mptcp_data_unlock(sk);
> +		enqueued |= __mptcp_move_skbs(sk, &skbs, &moved);
> +
> +		mptcp_data_lock(sk);
> +		total_moved += moved;
> +		mptcp_backlog_spooled(sk, moved, &skbs);
> +	}
> +	mptcp_data_unlock(sk);
> +	return enqueued;
>  }
>  
>  static unsigned int mptcp_inq_hint(const struct sock *sk)
> @@ -2254,7 +2285,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  
>  		copied += bytes_read;
>  
> -		if (skb_queue_empty(&sk->sk_receive_queue) && __mptcp_move_skbs(sk))
> +		if (!list_empty(&msk->backlog_list) && mptcp_move_skbs(sk))
>  			continue;
>  
>  		/* only the MPTCP socket status is relevant here. The exit
> @@ -3521,20 +3552,22 @@ void __mptcp_check_push(struct sock *sk, struct sock *ssk)
>  
>  #define MPTCP_FLAGS_PROCESS_CTX_NEED (BIT(MPTCP_PUSH_PENDING) | \
>  				      BIT(MPTCP_RETRANSMIT) | \
> -				      BIT(MPTCP_FLUSH_JOIN_LIST) | \
> -				      BIT(MPTCP_DEQUEUE))
> +				      BIT(MPTCP_FLUSH_JOIN_LIST))
>  
>  /* processes deferred events and flush wmem */
>  static void mptcp_release_cb(struct sock *sk)
>  	__must_hold(&sk->sk_lock.slock)
>  {
>  	struct mptcp_sock *msk = mptcp_sk(sk);
> +	u32 moved, total_moved = 0;
>  
>  	for (;;) {
>  		unsigned long flags = (msk->cb_flags & MPTCP_FLAGS_PROCESS_CTX_NEED);
> -		struct list_head join_list;
> +		struct list_head join_list, skbs;
> +		bool spool_bl;
>  
> -		if (!flags)
> +		spool_bl = mptcp_can_spool_backlog(sk, total_moved, &skbs);
> +		if (!flags && !spool_bl)
>  			break;
>  
>  		INIT_LIST_HEAD(&join_list);
> @@ -3550,20 +3583,26 @@ static void mptcp_release_cb(struct sock *sk)
>  		msk->cb_flags &= ~flags;
>  		spin_unlock_bh(&sk->sk_lock.slock);
>  
> +		if (total_moved)
> +			cond_resched();
> +
>  		if (flags & BIT(MPTCP_FLUSH_JOIN_LIST))
>  			__mptcp_flush_join_list(sk, &join_list);
>  		if (flags & BIT(MPTCP_PUSH_PENDING))
>  			__mptcp_push_pending(sk, 0);
>  		if (flags & BIT(MPTCP_RETRANSMIT))
>  			__mptcp_retrans(sk);
> -		if ((flags & BIT(MPTCP_DEQUEUE)) && __mptcp_move_skbs(sk)) {
> +		if (spool_bl && __mptcp_move_skbs(sk, &skbs, &moved)) {
>  			/* notify ack seq update */
>  			mptcp_cleanup_rbuf(msk, 0);
>  			sk->sk_data_ready(sk);
>  		}
>  
> -		cond_resched();
>  		spin_lock_bh(&sk->sk_lock.slock);
> +		if (spool_bl) {
> +			total_moved += moved;
> +			mptcp_backlog_spooled(sk, moved, &skbs);
> +		}
>  	}
>  
>  	if (__test_and_clear_bit(MPTCP_CLEAN_UNA, &msk->cb_flags))
> @@ -3796,7 +3835,7 @@ static int mptcp_ioctl(struct sock *sk, int cmd, int *karg)
>  			return -EINVAL;
>  
>  		lock_sock(sk);
> -		if (__mptcp_move_skbs(sk))
> +		if (mptcp_move_skbs(sk))
>  			mptcp_cleanup_rbuf(msk, 0);
>  		*karg = mptcp_inq_hint(sk);
>  		release_sock(sk);
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index d814e8151458d5..9e2a44546354a0 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -124,7 +124,6 @@
>  #define MPTCP_FLUSH_JOIN_LIST	5
>  #define MPTCP_SYNC_STATE	6
>  #define MPTCP_SYNC_SNDBUF	7
> -#define MPTCP_DEQUEUE		8
>  
>  struct mptcp_skb_cb {
>  	u64 map_seq;
> @@ -301,6 +300,7 @@ struct mptcp_sock {
>  	u32		last_ack_recv;
>  	unsigned long	timer_ival;
>  	u32		token;
> +	u32		borrowed_mem;
>  	unsigned long	flags;
>  	unsigned long	cb_flags;
>  	bool		recovery;		/* closing subflow write queue reinjected */


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 11/11] mptcp: leverage the backlog for RX packet processing
  2025-10-23 15:11   ` Paolo Abeni
@ 2025-10-23 15:52     ` Matthieu Baerts
  2025-10-23 17:02       ` Mat Martineau
  0 siblings, 1 reply; 23+ messages in thread
From: Matthieu Baerts @ 2025-10-23 15:52 UTC (permalink / raw)
  To: Paolo Abeni, Mat Martineau; +Cc: Geliang Tang, mptcp

Hi Paolo, Mat,

On 23/10/2025 17:11, Paolo Abeni wrote:
> 
> 
> On 10/22/25 4:31 PM, Paolo Abeni wrote:
>> When the msk socket is owned or the msk receive buffer is full,
>> move the incoming skbs in a msk level backlog list. This avoid
>> traversing the joined subflows and acquiring the subflow level
>> socket lock at reception time, improving the RX performances.
>>
>> When processing the backlog, use the fwd alloc memory borrowed from
>> the incoming subflow. skbs exceeding the msk receive space are
>> not dropped; instead they are kept into the backlog until the receive
>> buffer is freed. Dropping packets already acked at the TCP level is
>> explicitly discouraged by the RFC and would corrupt the data stream
>> for fallback sockets.
>>
>> Move the conditional reschedule in release_cb() to take action only
>> after the first loop iteration, to avoid rescheduling just before
>> releasing the lock.
>>
>> Special care is needed to avoid adding skbs to the backlog of a closed
>> msk and to avoid leaving dangling references into the backlog
>> at subflow closing time.

(...)

>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>> index 5a1d8f9e0fb0ec..0aae17ab77edb2 100644
>> --- a/net/mptcp/protocol.c
>> +++ b/net/mptcp/protocol.c

(...)

>> -static bool __mptcp_move_skbs(struct sock *sk)
>> +static bool mptcp_can_spool_backlog(struct sock *sk, u32 moved,
>> +				    struct list_head *skbs)
>>  {
>> -	struct mptcp_subflow_context *subflow;
>>  	struct mptcp_sock *msk = mptcp_sk(sk);
>> -	bool ret = false;
>>  
>> -	if (list_empty(&msk->conn_list))
>> +	if (list_empty(&msk->backlog_list))
>>  		return false;
>>  
>> -	subflow = list_first_entry(&msk->conn_list,
>> -				   struct mptcp_subflow_context, node);
>> -	for (;;) {
>> -		struct sock *ssk;
>> -		bool slowpath;
>> +	/* Borrowed mem could be zero only in the unlikely event that the bl
>> +	 * is full
>> +	 */
>> +	if (likely(msk->borrowed_mem)) {
>> +		sk_forward_alloc_add(sk, msk->borrowed_mem);
>> +		msk->borrowed_mem = 0;
>> +		sk->sk_reserved_mem = msk->backlog_len;
> 
> With the above I intended to prevent the fwd memory handling from
> releasing backlog_len bytes. Re-reading the relevant code, it does not
> allow that (experimentation confirmed), see:
> 
> https://elixir.bootlin.com/linux/v6.18-rc2/source/include/net/sock.h#L1593
> 
> and:
> 
> https://elixir.bootlin.com/linux/v6.18-rc2/source/include/net/sock.h#L1580
> 
> This will need some more care. Also patch 2 will require some
> significant rework.

Thank you for looking at this complex part, and for having spot that!

> @Mat, @Matttbe: could you please consider merging patches 1,3-9?
> 
> I think they should be pretty uncontroversial, would make the series
> more manegeable for future iterations (and would alleviate my
> frustration to make this thing work correctly).

It makes sense, fine by me. I will wait for Mat's review before applying
them (patch 1 is for 'net' I suppose).

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 11/11] mptcp: leverage the backlog for RX packet processing
  2025-10-23 15:52     ` Matthieu Baerts
@ 2025-10-23 17:02       ` Mat Martineau
  2025-10-23 17:43         ` Matthieu Baerts
  0 siblings, 1 reply; 23+ messages in thread
From: Mat Martineau @ 2025-10-23 17:02 UTC (permalink / raw)
  To: Matthieu Baerts; +Cc: Paolo Abeni, Geliang Tang, mptcp

On Thu, 23 Oct 2025, Matthieu Baerts wrote:

> Hi Paolo, Mat,
>
> On 23/10/2025 17:11, Paolo Abeni wrote:
>>
>>
>> On 10/22/25 4:31 PM, Paolo Abeni wrote:
>>> When the msk socket is owned or the msk receive buffer is full,
>>> move the incoming skbs in a msk level backlog list. This avoid
>>> traversing the joined subflows and acquiring the subflow level
>>> socket lock at reception time, improving the RX performances.
>>>
>>> When processing the backlog, use the fwd alloc memory borrowed from
>>> the incoming subflow. skbs exceeding the msk receive space are
>>> not dropped; instead they are kept into the backlog until the receive
>>> buffer is freed. Dropping packets already acked at the TCP level is
>>> explicitly discouraged by the RFC and would corrupt the data stream
>>> for fallback sockets.
>>>
>>> Move the conditional reschedule in release_cb() to take action only
>>> after the first loop iteration, to avoid rescheduling just before
>>> releasing the lock.
>>>
>>> Special care is needed to avoid adding skbs to the backlog of a closed
>>> msk and to avoid leaving dangling references into the backlog
>>> at subflow closing time.
>
> (...)
>
>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>> index 5a1d8f9e0fb0ec..0aae17ab77edb2 100644
>>> --- a/net/mptcp/protocol.c
>>> +++ b/net/mptcp/protocol.c
>
> (...)
>
>>> -static bool __mptcp_move_skbs(struct sock *sk)
>>> +static bool mptcp_can_spool_backlog(struct sock *sk, u32 moved,
>>> +				    struct list_head *skbs)
>>>  {
>>> -	struct mptcp_subflow_context *subflow;
>>>  	struct mptcp_sock *msk = mptcp_sk(sk);
>>> -	bool ret = false;
>>>
>>> -	if (list_empty(&msk->conn_list))
>>> +	if (list_empty(&msk->backlog_list))
>>>  		return false;
>>>
>>> -	subflow = list_first_entry(&msk->conn_list,
>>> -				   struct mptcp_subflow_context, node);
>>> -	for (;;) {
>>> -		struct sock *ssk;
>>> -		bool slowpath;
>>> +	/* Borrowed mem could be zero only in the unlikely event that the bl
>>> +	 * is full
>>> +	 */
>>> +	if (likely(msk->borrowed_mem)) {
>>> +		sk_forward_alloc_add(sk, msk->borrowed_mem);
>>> +		msk->borrowed_mem = 0;
>>> +		sk->sk_reserved_mem = msk->backlog_len;
>>
>> With the above I intended to prevent the fwd memory handling from
>> releasing backlog_len bytes. Re-reading the relevant code, it does not
>> allow that (experimentation confirmed), see:
>>
>> https://elixir.bootlin.com/linux/v6.18-rc2/source/include/net/sock.h#L1593
>>
>> and:
>>
>> https://elixir.bootlin.com/linux/v6.18-rc2/source/include/net/sock.h#L1580
>>
>> This will need some more care. Also patch 2 will require some
>> significant rework.
>
> Thank you for looking at this complex part, and for having spot that!
>
>> @Mat, @Matttbe: could you please consider merging patches 1,3-9?
>>
>> I think they should be pretty uncontroversial, would make the series
>> more manegeable for future iterations (and would alleviate my
>> frustration to make this thing work correctly).
>
> It makes sense, fine by me. I will wait for Mat's review before applying
> them (patch 1 is for 'net' I suppose).
>

Applying 1,3-9 to our tree(s) makes sense to me. Maybe patch 5 to -net 
too? (just sent email about that before I saw this message).

Matthieu do you want me to reply to each so the RvB tag is in patchwork, 
or does this suffice for patches 1 & 3-9:

Reviewed-by: Mat Martineau <martineau@kernel.org>

- Mat


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 11/11] mptcp: leverage the backlog for RX packet processing
  2025-10-23 17:02       ` Mat Martineau
@ 2025-10-23 17:43         ` Matthieu Baerts
  0 siblings, 0 replies; 23+ messages in thread
From: Matthieu Baerts @ 2025-10-23 17:43 UTC (permalink / raw)
  To: Mat Martineau, Paolo Abeni; +Cc: Geliang Tang, mptcp

Hi Mat, Paolo,

On 23/10/2025 19:02, Mat Martineau wrote:
> On Thu, 23 Oct 2025, Matthieu Baerts wrote:
> 
>> Hi Paolo, Mat,
>>
>> On 23/10/2025 17:11, Paolo Abeni wrote:
>>>
>>>
>>> On 10/22/25 4:31 PM, Paolo Abeni wrote:
>>>> When the msk socket is owned or the msk receive buffer is full,
>>>> move the incoming skbs in a msk level backlog list. This avoid
>>>> traversing the joined subflows and acquiring the subflow level
>>>> socket lock at reception time, improving the RX performances.
>>>>
>>>> When processing the backlog, use the fwd alloc memory borrowed from
>>>> the incoming subflow. skbs exceeding the msk receive space are
>>>> not dropped; instead they are kept into the backlog until the receive
>>>> buffer is freed. Dropping packets already acked at the TCP level is
>>>> explicitly discouraged by the RFC and would corrupt the data stream
>>>> for fallback sockets.
>>>>
>>>> Move the conditional reschedule in release_cb() to take action only
>>>> after the first loop iteration, to avoid rescheduling just before
>>>> releasing the lock.
>>>>
>>>> Special care is needed to avoid adding skbs to the backlog of a closed
>>>> msk and to avoid leaving dangling references into the backlog
>>>> at subflow closing time.
>>
>> (...)
>>
>>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>>> index 5a1d8f9e0fb0ec..0aae17ab77edb2 100644
>>>> --- a/net/mptcp/protocol.c
>>>> +++ b/net/mptcp/protocol.c
>>
>> (...)
>>
>>>> -static bool __mptcp_move_skbs(struct sock *sk)
>>>> +static bool mptcp_can_spool_backlog(struct sock *sk, u32 moved,
>>>> +                    struct list_head *skbs)
>>>>  {
>>>> -    struct mptcp_subflow_context *subflow;
>>>>      struct mptcp_sock *msk = mptcp_sk(sk);
>>>> -    bool ret = false;
>>>>
>>>> -    if (list_empty(&msk->conn_list))
>>>> +    if (list_empty(&msk->backlog_list))
>>>>          return false;
>>>>
>>>> -    subflow = list_first_entry(&msk->conn_list,
>>>> -                   struct mptcp_subflow_context, node);
>>>> -    for (;;) {
>>>> -        struct sock *ssk;
>>>> -        bool slowpath;
>>>> +    /* Borrowed mem could be zero only in the unlikely event that
>>>> the bl
>>>> +     * is full
>>>> +     */
>>>> +    if (likely(msk->borrowed_mem)) {
>>>> +        sk_forward_alloc_add(sk, msk->borrowed_mem);
>>>> +        msk->borrowed_mem = 0;
>>>> +        sk->sk_reserved_mem = msk->backlog_len;
>>>
>>> With the above I intended to prevent the fwd memory handling from
>>> releasing backlog_len bytes. Re-reading the relevant code, it does not
>>> allow that (experimentation confirmed), see:
>>>
>>> https://elixir.bootlin.com/linux/v6.18-rc2/source/include/net/
>>> sock.h#L1593
>>>
>>> and:
>>>
>>> https://elixir.bootlin.com/linux/v6.18-rc2/source/include/net/
>>> sock.h#L1580
>>>
>>> This will need some more care. Also patch 2 will require some
>>> significant rework.
>>
>> Thank you for looking at this complex part, and for having spot that!
>>
>>> @Mat, @Matttbe: could you please consider merging patches 1,3-9?
>>>
>>> I think they should be pretty uncontroversial, would make the series
>>> more manegeable for future iterations (and would alleviate my
>>> frustration to make this thing work correctly).
>>
>> It makes sense, fine by me. I will wait for Mat's review before applying
>> them (patch 1 is for 'net' I suppose).
>>
> 
> Applying 1,3-9 to our tree(s) makes sense to me. Maybe patch 5 to -net
> too? (just sent email about that before I saw this message).

Good catch! (I will wait for Paolo's reply before applying the patches.)

> Matthieu do you want me to reply to each so the RvB tag is in patchwork,
> or does this suffice for patches 1 & 3-9:
> 
> Reviewed-by: Mat Martineau <martineau@kernel.org>

No, that's fine, I can do a copy-paste. (Note that you can also send it
as a reply to the cover-letter, then I simply only apply the mentioned
patches.)


Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (10 preceding siblings ...)
  2025-10-22 14:31 ` [PATCH v6 mptcp-next 11/11] mptcp: leverage the backlog for RX packet processing Paolo Abeni
@ 2025-10-22 15:50 ` MPTCP CI
  2025-10-23  6:37 ` Geliang Tang
  2025-10-27 12:17 ` Matthieu Baerts
  13 siblings, 0 replies; 23+ messages in thread
From: MPTCP CI @ 2025-10-22 15:50 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Success! ✅
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/18720365057

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/61d32775fca0
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1014567


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (11 preceding siblings ...)
  2025-10-22 15:50 ` [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing MPTCP CI
@ 2025-10-23  6:37 ` Geliang Tang
  2025-10-27 12:17 ` Matthieu Baerts
  13 siblings, 0 replies; 23+ messages in thread
From: Geliang Tang @ 2025-10-23  6:37 UTC (permalink / raw)
  To: Paolo Abeni, mptcp; +Cc: Mat Martineau

Hi Paolo,

Thanks for this v6.

On Wed, 2025-10-22 at 16:31 +0200, Paolo Abeni wrote:
> This series includes RX path improvement built around backlog
> processing
> 
> The main goals are improving the RX performances _and_ increase the
> long term maintainability.
> 
> Patches 2-4 prepare the stack for backlog processing, removing
> assumptions that will not hold true anymore after backlog
> introduction.
> 
> Patches 1 and 5 fixes long standing issues which are quite hard to
> reproduce with the current implementation but the 2nd one will become
> very apparent with backlog usage.
> 
> Patches 6, 7 and 9 are more cleanups that will make the backlog patch
> a
> little less huge.
> 
> Patch 8 is a somewhat unrelated cleanup, included here before I
> forgot
> about it.
> 
> The real work is done by patch 10 and 11. Patch 10 introduces the
> helpers
> needed to manipulate the msk-level backlog, and the data struct
> itself,
> without any actual functional change. Patch 11 finally use the
> backlog
> for RX skb processing. Note that MPTCP can't uset the sk_backlog, as
> the mptcp release callback can also release and re-acquire the msk-
> level
> spinlock and core backlog processing works under the assumption that
> such event is not possible.
> 
> A relevant point is memory accounts for skbs in the backlog.
> 
> It's somewhat "original" due to MPTCP constraints. Such skbs use
> space
> from the incoming subflow receive buffer, do not use explicitly any
> forward allocated memory, as we can't update the msk fwd mem while
> enqueuing, nor we want to acquire again the ssk socket lock while
> processing the skbs.
> 
> Instead the msk borrows memory from the subflow and reserve it for
> the backlog - see patch 3 and 11 for the gory details.
> 
> Note that even if the skbs can sit in the backlog for an unbounded
> time,
> 
> ---
> v5 -> v6:
>  - added patch 1/11
>  - reworked widely patch 10 && 11 to avoid double accounts for
> backlog
>    skb and to address the fwd allocated memory criticality mentioned
>    in previous iterations.

All tests have passed on my end. The only minor issues requiring
cleanup are in patch 2 and patch 9, which Matt or I can address in
subsequent revisions. All patches LGTM.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>

> 
> Paolo Abeni (11):
>   mptcp: drop bogus optimization in __mptcp_check_push()
>   mptcp: borrow forward memory from subflow
>   mptcp: cleanup fallback data fin reception
>   mptcp: cleanup fallback dummy mapping generation
>   mptcp: fix MSG_PEEK stream corruption
>   mptcp: ensure the kernel PM does not take action too late
>   mptcp: do not miss early first subflow close event notification.
>   mptcp: make mptcp_destroy_common() static
>   mptcp: drop the __mptcp_data_ready() helper
>   mptcp: introduce mptcp-level backlog
>   mptcp: leverage the backlog for RX packet processing
> 
>  net/mptcp/mptcp_diag.c |   3 +-
>  net/mptcp/pm.c         |   4 +-
>  net/mptcp/pm_kernel.c  |   2 +
>  net/mptcp/protocol.c   | 363 ++++++++++++++++++++++++++++-----------
> --
>  net/mptcp/protocol.h   |  10 +-
>  net/mptcp/subflow.c    |  12 +-
>  6 files changed, 272 insertions(+), 122 deletions(-)
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing
  2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
                   ` (12 preceding siblings ...)
  2025-10-23  6:37 ` Geliang Tang
@ 2025-10-27 12:17 ` Matthieu Baerts
  13 siblings, 0 replies; 23+ messages in thread
From: Matthieu Baerts @ 2025-10-27 12:17 UTC (permalink / raw)
  To: Paolo Abeni, mptcp; +Cc: Mat Martineau, Geliang Tang

Hi Paolo,

On 22/10/2025 16:31, Paolo Abeni wrote:
> This series includes RX path improvement built around backlog processing
> 
> The main goals are improving the RX performances _and_ increase the
> long term maintainability.
> 
> Patches 2-4 prepare the stack for backlog processing, removing
> assumptions that will not hold true anymore after backlog introduction.
> 
> Patches 1 and 5 fixes long standing issues which are quite hard to
> reproduce with the current implementation but the 2nd one will become
> very apparent with backlog usage.
> 
> Patches 6, 7 and 9 are more cleanups that will make the backlog patch a
> little less huge.
> 
> Patch 8 is a somewhat unrelated cleanup, included here before I forgot
> about it.
> 
> The real work is done by patch 10 and 11. Patch 10 introduces the helpers
> needed to manipulate the msk-level backlog, and the data struct itself,
> without any actual functional change. Patch 11 finally use the backlog
> for RX skb processing. Note that MPTCP can't uset the sk_backlog, as
> the mptcp release callback can also release and re-acquire the msk-level
> spinlock and core backlog processing works under the assumption that
> such event is not possible.
> 
> A relevant point is memory accounts for skbs in the backlog.
> 
> It's somewhat "original" due to MPTCP constraints. Such skbs use space
> from the incoming subflow receive buffer, do not use explicitly any
> forward allocated memory, as we can't update the msk fwd mem while
> enqueuing, nor we want to acquire again the ssk socket lock while
> processing the skbs.
> 
> Instead the msk borrows memory from the subflow and reserve it for
> the backlog - see patch 3 and 11 for the gory details.
> 
> Note that even if the skbs can sit in the backlog for an unbounded time,
> 
> ---
> v5 -> v6:
>  - added patch 1/11
>  - reworked widely patch 10 && 11 to avoid double accounts for backlog
>    skb and to address the fwd allocated memory criticality mentioned
>    in previous iterations.
> 
> Paolo Abeni (11):
>   mptcp: drop bogus optimization in __mptcp_check_push()
>   mptcp: borrow forward memory from subflow
>   mptcp: cleanup fallback data fin reception
>   mptcp: cleanup fallback dummy mapping generation
>   mptcp: fix MSG_PEEK stream corruption
>   mptcp: ensure the kernel PM does not take action too late
>   mptcp: do not miss early first subflow close event notification.
>   mptcp: make mptcp_destroy_common() static
>   mptcp: drop the __mptcp_data_ready() helper
>   mptcp: introduce mptcp-level backlog
>   mptcp: leverage the backlog for RX packet processing

Sorry for the delay to apply the patches. Now in our tree:

New patches for t/upstream-net and t/upstream:
- 3a771ce1a023: mptcp: drop bogus optimization in __mptcp_check_push()
- b7ad5d80dff9: mptcp: fix MSG_PEEK stream corruption
- Results: 2a626f21446a..1ce735a9c1d6 (export-net)
- Results: 0d7b92893723..f23a9744c467 (export)

Tests are now in progress:

- export-net:
https://github.com/multipath-tcp/mptcp_net-next/commit/294a632909acc62a7656edc52770f9f59332af39/checks
- export:
https://github.com/multipath-tcp/mptcp_net-next/commit/1035c9b6c98cf0d15998eb7ce53e5f81621e5116/checks


New patches for t/upstream:
- f0025b89c22c: mptcp: cleanup fallback data fin reception
- 905fec8ceabc: mptcp: cleanup fallback dummy mapping generation
- 8b29c30e82d2: mptcp: ensure the kernel PM does not take action too late
- 439f10b1648b: mptcp: do not miss early first subflow close event
notification
- 00804e3abc8d: mptcp: make mptcp_destroy_common() static
- 25c9415c554d: mptcp: drop the __mptcp_data_ready() helper
- Results: f23a9744c467..eb31e166f63e (export)

Tests are now in progress:

- export:
https://github.com/multipath-tcp/mptcp_net-next/commit/54c37fb023bf32f10f60fb90d27a8fa800de426f/checks

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-10-27 12:17 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-22 14:31 [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing Paolo Abeni
2025-10-22 14:31 ` [PATCH v6 mptcp-next 01/11] mptcp: drop bogus optimization in __mptcp_check_push() Paolo Abeni
2025-10-22 14:31 ` [PATCH v6 mptcp-next 02/11] mptcp: borrow forward memory from subflow Paolo Abeni
2025-10-23  6:38   ` Geliang Tang
2025-10-22 14:31 ` [PATCH v6 mptcp-next 03/11] mptcp: cleanup fallback data fin reception Paolo Abeni
2025-10-22 14:31 ` [PATCH v6 mptcp-next 04/11] mptcp: cleanup fallback dummy mapping generation Paolo Abeni
2025-10-22 14:31 ` [PATCH v6 mptcp-next 05/11] mptcp: fix MSG_PEEK stream corruption Paolo Abeni
2025-10-23 16:56   ` Mat Martineau
2025-10-24  7:34     ` Paolo Abeni
2025-10-22 14:31 ` [PATCH v6 mptcp-next 06/11] mptcp: ensure the kernel PM does not take action too late Paolo Abeni
2025-10-22 14:31 ` [PATCH v6 mptcp-next 07/11] mptcp: do not miss early first subflow close event notification Paolo Abeni
2025-10-22 14:31 ` [PATCH v6 mptcp-next 08/11] mptcp: make mptcp_destroy_common() static Paolo Abeni
2025-10-22 14:31 ` [PATCH v6 mptcp-next 09/11] mptcp: drop the __mptcp_data_ready() helper Paolo Abeni
2025-10-23  6:38   ` Geliang Tang
2025-10-22 14:31 ` [PATCH v6 mptcp-next 10/11] mptcp: introduce mptcp-level backlog Paolo Abeni
2025-10-22 14:31 ` [PATCH v6 mptcp-next 11/11] mptcp: leverage the backlog for RX packet processing Paolo Abeni
2025-10-23 15:11   ` Paolo Abeni
2025-10-23 15:52     ` Matthieu Baerts
2025-10-23 17:02       ` Mat Martineau
2025-10-23 17:43         ` Matthieu Baerts
2025-10-22 15:50 ` [PATCH v6 mptcp-next 00/11] mptcp: introduce backlog processing MPTCP CI
2025-10-23  6:37 ` Geliang Tang
2025-10-27 12:17 ` Matthieu Baerts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox