MPTCP Linux Development
 help / color / mirror / Atom feed
* [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up
@ 2025-11-14  9:17 Paolo Abeni
  2025-11-14  9:17 ` [PATCH v4 mptcp-next 1/3] mptcp: grafting MPJ subflow earlier Paolo Abeni
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Paolo Abeni @ 2025-11-14  9:17 UTC (permalink / raw)
  To: mptcp

This tries to fix the existing races in memcg accounting with the
backlog.

Only minor delta over v3: typos, build fix without CG, and reworded
comments/commit messages.

Paolo Abeni (3):
  mptcp: grafting MPJ subflow earlier
  Squash-to: "mptcp: fix memcg accounting for passive sockets"
  Squash-to: "mptcp: leverage the backlog for RX packet processing"

 net/mptcp/protocol.c | 110 +++++++++++++++++++++++++++++++++++--------
 net/mptcp/protocol.h |   1 +
 2 files changed, 92 insertions(+), 19 deletions(-)

-- 
2.51.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 mptcp-next 1/3] mptcp: grafting MPJ subflow earlier
  2025-11-14  9:17 [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up Paolo Abeni
@ 2025-11-14  9:17 ` Paolo Abeni
  2025-11-14  9:17 ` [PATCH v4 mptcp-next 2/3] Squash-to: "mptcp: fix memcg accounting for passive sockets" Paolo Abeni
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2025-11-14  9:17 UTC (permalink / raw)
  To: mptcp

Later patches need to ensure that all MPJ subflows are grafted to the
msk socket before accept() completion.

Currently the grafting happens under the msk socket lock: potentially
at msk release_cb time which make satisfying the above condition a bit
tricky.

Move the MPJ subflow grafting earlier, under the msk data lock, so that
we can use such lock as a synchronization point.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v3 -> v4:
  - clarified it's not a fix
  - move the graft under the msk socket lock
  - no need to graft for active subflows
---
 net/mptcp/protocol.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 78ac8ba80e59..4a4cb9952596 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -933,12 +933,6 @@ static bool __mptcp_finish_join(struct mptcp_sock *msk, struct sock *ssk)
 	mptcp_subflow_joined(msk, ssk);
 	spin_unlock_bh(&msk->fallback_lock);
 
-	/* attach to msk socket only after we are sure we will deal with it
-	 * at close time
-	 */
-	if (sk->sk_socket && !ssk->sk_socket)
-		mptcp_sock_graft(ssk, sk->sk_socket);
-
 	mptcp_subflow_ctx(ssk)->subflow_id = msk->subflow_id++;
 	mptcp_sockopt_sync_locked(msk, ssk);
 	mptcp_stop_tout_timer(sk);
@@ -3760,6 +3754,20 @@ void mptcp_sock_graft(struct sock *sk, struct socket *parent)
 	write_unlock_bh(&sk->sk_callback_lock);
 }
 
+/* Can be called without holding the msk socket lock; use the callback lock
+ * to avoid {READ_,WRITE_}ONCE annotations on sk_socket.
+ */
+static void mptcp_sock_check_graft(struct sock *sk, struct sock *ssk)
+{
+	struct socket *sock;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	sock = sk->sk_socket;
+	write_unlock_bh(&sk->sk_callback_lock);
+	if (sock)
+		mptcp_sock_graft(ssk, sock);
+}
+
 bool mptcp_finish_join(struct sock *ssk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
@@ -3775,7 +3783,9 @@ bool mptcp_finish_join(struct sock *ssk)
 		return false;
 	}
 
-	/* active subflow, already present inside the conn_list */
+	/* Active subflow, already present inside the conn_list; is grafted
+	 * either by __mptcp_subflow_connect() or accept.
+	 */
 	if (!list_empty(&subflow->node)) {
 		spin_lock_bh(&msk->fallback_lock);
 		if (!msk->allow_subflows) {
@@ -3802,11 +3812,17 @@ bool mptcp_finish_join(struct sock *ssk)
 		if (ret) {
 			sock_hold(ssk);
 			list_add_tail(&subflow->node, &msk->conn_list);
+			mptcp_sock_check_graft(parent, ssk);
 		}
 	} else {
 		sock_hold(ssk);
 		list_add_tail(&subflow->node, &msk->join_list);
 		__set_bit(MPTCP_FLUSH_JOIN_LIST, &msk->cb_flags);
+
+		/* In case of later failures, __mptcp_flush_join_list() will
+		 * properly orphan the ssk via mptcp_close_ssk().
+		 */
+		mptcp_sock_check_graft(parent, ssk);
 	}
 	mptcp_data_unlock(parent);
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 mptcp-next 2/3] Squash-to: "mptcp: fix memcg accounting for passive sockets"
  2025-11-14  9:17 [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up Paolo Abeni
  2025-11-14  9:17 ` [PATCH v4 mptcp-next 1/3] mptcp: grafting MPJ subflow earlier Paolo Abeni
@ 2025-11-14  9:17 ` Paolo Abeni
  2025-11-14  9:17 ` [PATCH v4 mptcp-next 3/3] Squash-to: "mptcp: leverage the backlog for RX packet processing" Paolo Abeni
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2025-11-14  9:17 UTC (permalink / raw)
  To: mptcp

__mptcp_inherit_memcg() is currently invoked by mptcp_graph_subflows()
with the wrong GFP flags, as lock_sock_fast() can yield atomic scope.

Since this is not the most extreme fast path, use plain lock_sock()
instead.

Additionally move the subflow CG initialization earlier under the msk data
lock, so that the next patch could use the latter as a synchronization
point to ensure all subflows are CG accounted.

Finally fix a typo in the mentioned helper name.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v3 -> v4:
  - rebased
---
 net/mptcp/protocol.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4a4cb9952596..2364c144bf4f 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -936,8 +936,6 @@ static bool __mptcp_finish_join(struct mptcp_sock *msk, struct sock *ssk)
 	mptcp_subflow_ctx(ssk)->subflow_id = msk->subflow_id++;
 	mptcp_sockopt_sync_locked(msk, ssk);
 	mptcp_stop_tout_timer(sk);
-	__mptcp_inherit_cgrp_data(sk, ssk);
-	__mptcp_inherit_memcg(sk, ssk, GFP_ATOMIC);
 	__mptcp_propagate_sndbuf(sk, ssk);
 	return true;
 }
@@ -3764,8 +3762,11 @@ static void mptcp_sock_check_graft(struct sock *sk, struct sock *ssk)
 	write_lock_bh(&sk->sk_callback_lock);
 	sock = sk->sk_socket;
 	write_unlock_bh(&sk->sk_callback_lock);
-	if (sock)
+	if (sock) {
 		mptcp_sock_graft(ssk, sock);
+		__mptcp_inherit_cgrp_data(sk, ssk);
+		__mptcp_inherit_memcg(sk, ssk, GFP_ATOMIC);
+	}
 }
 
 bool mptcp_finish_join(struct sock *ssk)
@@ -4083,18 +4084,17 @@ static int mptcp_listen(struct socket *sock, int backlog)
 	return err;
 }
 
-static void mptcp_graph_subflows(struct sock *sk)
+static void mptcp_graft_subflows(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow;
 	struct mptcp_sock *msk = mptcp_sk(sk);
 
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
-		bool slow;
 
-		slow = lock_sock_fast(ssk);
+		lock_sock(ssk);
 
-		/* set ssk->sk_socket of accept()ed flows to mptcp socket.
+		/* Set ssk->sk_socket of accept()ed flows to mptcp socket.
 		 * This is needed so NOSPACE flag can be set from tcp stack.
 		 */
 		if (!ssk->sk_socket)
@@ -4102,7 +4102,7 @@ static void mptcp_graph_subflows(struct sock *sk)
 
 		__mptcp_inherit_cgrp_data(sk, ssk);
 		__mptcp_inherit_memcg(sk, ssk, GFP_KERNEL);
-		unlock_sock_fast(ssk, slow);
+		release_sock(ssk);
 	}
 }
 
@@ -4153,7 +4153,7 @@ static int mptcp_stream_accept(struct socket *sock, struct socket *newsock,
 		msk = mptcp_sk(newsk);
 		msk->in_accept_queue = 0;
 
-		mptcp_graph_subflows(newsk);
+		mptcp_graft_subflows(newsk);
 		mptcp_rps_record_subflows(msk);
 
 		/* Do late cleanup for the first subflow as necessary. Also
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 mptcp-next 3/3] Squash-to: "mptcp: leverage the backlog for RX packet processing"
  2025-11-14  9:17 [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up Paolo Abeni
  2025-11-14  9:17 ` [PATCH v4 mptcp-next 1/3] mptcp: grafting MPJ subflow earlier Paolo Abeni
  2025-11-14  9:17 ` [PATCH v4 mptcp-next 2/3] Squash-to: "mptcp: fix memcg accounting for passive sockets" Paolo Abeni
@ 2025-11-14  9:17 ` Paolo Abeni
  2025-11-14 10:30 ` [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up MPTCP CI
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2025-11-14  9:17 UTC (permalink / raw)
  To: mptcp

If a subflow receives data before gaining the memcg while the msk
socket lock is held at accept time, or the PM locks the msk socket
while still unaccepted and subflows push data to it at the same time,
the mptcp_graph_subflows() can complete with a non empty backlog.

The msk will try to borrow such memory, but (some) of the skbs there
where not memcg charged. When the msk finally will return such accounted
memory, we should hit the same splat of #597.
[even if so far I was unable to replicate this scenario]

This patch tries to address such potential issue by:
- explicitly keep track of the amount of memory added to the backlog
  not CG accounted
- additionally accounting for such memory at accept time
- preventing any subflow from adding memory to the backlog not CG
  accounted after the above flush

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v3 -> v4:
  - fixed a bunch of typos
  - fixed build error when CG are not enabled
---
 net/mptcp/protocol.c | 64 +++++++++++++++++++++++++++++++++++++++++---
 net/mptcp/protocol.h |  1 +
 2 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 2364c144bf4f..ad3c43a9c3f4 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -678,6 +678,7 @@ static void __mptcp_add_backlog(struct sock *sk,
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct sk_buff *tail = NULL;
+	struct sock *ssk = skb->sk;
 	bool fragstolen;
 	int delta;
 
@@ -691,18 +692,26 @@ static void __mptcp_add_backlog(struct sock *sk,
 		tail = list_last_entry(&msk->backlog_list, struct sk_buff, list);
 
 	if (tail && MPTCP_SKB_CB(skb)->map_seq == MPTCP_SKB_CB(tail)->end_seq &&
-	    skb->sk == tail->sk &&
+	    ssk == tail->sk &&
 	    __mptcp_try_coalesce(sk, tail, skb, &fragstolen, &delta)) {
 		skb->truesize -= delta;
 		kfree_skb_partial(skb, fragstolen);
 		__mptcp_subflow_lend_fwdmem(subflow, delta);
-		WRITE_ONCE(msk->backlog_len, msk->backlog_len + delta);
-		return;
+		goto account;
 	}
 
 	list_add_tail(&skb->list, &msk->backlog_list);
 	mptcp_subflow_lend_fwdmem(subflow, skb);
-	WRITE_ONCE(msk->backlog_len, msk->backlog_len + skb->truesize);
+	delta = skb->truesize;
+
+account:
+	WRITE_ONCE(msk->backlog_len, msk->backlog_len + delta);
+
+	/* Possibly not accept()ed yet, keep track of memory not CG
+	 * accounted, mptcp_graft_subflows() will handle it.
+	 */
+	if (!mem_cgroup_from_sk(ssk))
+		msk->backlog_unaccounted += delta;
 }
 
 static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
@@ -2179,6 +2188,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 
+	/* After CG initialization, subflows should never add skb before
+	 * gaining the CG themself.
+	 */
+	DEBUG_NET_WARN_ON_ONCE(msk->backlog_unaccounted && sk->sk_socket &&
+			       mem_cgroup_from_sk(sk));
+
 	/* Don't spool the backlog if the rcvbuf is full. */
 	if (list_empty(&msk->backlog_list) ||
 	    sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
@@ -4089,6 +4104,22 @@ static void mptcp_graft_subflows(struct sock *sk)
 	struct mptcp_subflow_context *subflow;
 	struct mptcp_sock *msk = mptcp_sk(sk);
 
+	if (mem_cgroup_sockets_enabled) {
+		LIST_HEAD(join_list);
+
+		/* Subflows joining after __inet_accept() will get the
+		 * mem CG properly initialized at mptcp_finish_join() time,
+		 * but subflows pending in join_list need explicit
+		 * initialization before flushing `backlog_unaccounted`
+		 * or MPTCP can later unexpectedly observe unaccounted memory.
+		 */
+		mptcp_data_lock(sk);
+		list_splice_init(&msk->join_list, &join_list);
+		mptcp_data_unlock(sk);
+
+		__mptcp_flush_join_list(sk, &join_list);
+	}
+
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 
@@ -4100,10 +4131,35 @@ static void mptcp_graft_subflows(struct sock *sk)
 		if (!ssk->sk_socket)
 			mptcp_sock_graft(ssk, sk->sk_socket);
 
+		if (!mem_cgroup_sk_enabled(sk))
+			goto unlock;
+
 		__mptcp_inherit_cgrp_data(sk, ssk);
 		__mptcp_inherit_memcg(sk, ssk, GFP_KERNEL);
+
+unlock:
 		release_sock(ssk);
 	}
+
+	if (mem_cgroup_sk_enabled(sk)) {
+		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
+		int amt;
+
+		/* Account the backlog memory; prior accept() is aware of
+		 * fwd and rmem only.
+		 */
+		mptcp_data_lock(sk);
+		amt = sk_mem_pages(sk->sk_forward_alloc +
+				   msk->backlog_unaccounted +
+				   atomic_read(&sk->sk_rmem_alloc)) -
+		      sk_mem_pages(sk->sk_forward_alloc +
+				   atomic_read(&sk->sk_rmem_alloc));
+		msk->backlog_unaccounted = 0;
+		mptcp_data_unlock(sk);
+
+		if (amt)
+			mem_cgroup_sk_charge(sk, amt, gfp);
+	}
 }
 
 static int mptcp_stream_accept(struct socket *sock, struct socket *newsock,
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 161b704be16b..199f28f3dd5e 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -360,6 +360,7 @@ struct mptcp_sock {
 
 	struct list_head backlog_list;	/* protected by the data lock */
 	u32		backlog_len;
+	u32		backlog_unaccounted;
 };
 
 #define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock)
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up
  2025-11-14  9:17 [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up Paolo Abeni
                   ` (2 preceding siblings ...)
  2025-11-14  9:17 ` [PATCH v4 mptcp-next 3/3] Squash-to: "mptcp: leverage the backlog for RX packet processing" Paolo Abeni
@ 2025-11-14 10:30 ` MPTCP CI
  2025-11-14 11:58 ` Matthieu Baerts
  2025-11-14 12:06 ` Matthieu Baerts
  5 siblings, 0 replies; 7+ messages in thread
From: MPTCP CI @ 2025-11-14 10:30 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Success! ✅
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/19360390865

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/c7e19c65fe47
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1023415


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up
  2025-11-14  9:17 [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up Paolo Abeni
                   ` (3 preceding siblings ...)
  2025-11-14 10:30 ` [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up MPTCP CI
@ 2025-11-14 11:58 ` Matthieu Baerts
  2025-11-14 12:06 ` Matthieu Baerts
  5 siblings, 0 replies; 7+ messages in thread
From: Matthieu Baerts @ 2025-11-14 11:58 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

Hi Paolo,

On 14/11/2025 10:17, Paolo Abeni wrote:
> This tries to fix the existing races in memcg accounting with the
> backlog.
> 
> Only minor delta over v3: typos, build fix without CG, and reworded
> comments/commit messages.

Thank you for the v4 and all these fixes. It looks good to me:

Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up
  2025-11-14  9:17 [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up Paolo Abeni
                   ` (4 preceding siblings ...)
  2025-11-14 11:58 ` Matthieu Baerts
@ 2025-11-14 12:06 ` Matthieu Baerts
  5 siblings, 0 replies; 7+ messages in thread
From: Matthieu Baerts @ 2025-11-14 12:06 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

Hi Paolo,

On 14/11/2025 10:17, Paolo Abeni wrote:
> This tries to fix the existing races in memcg accounting with the
> backlog.
> 
> Only minor delta over v3: typos, build fix without CG, and reworded
> comments/commit messages.

Now in our tree:

New patches for t/upstream:
- f67fd656fe97: mptcp: grafting MPJ subflow earlier
- 36a8ca62b02d: "squashed" patch 2/3 in "mptcp: fix memcg accounting for
passive sockets"
- a3565f84e54f: "squashed" patch 3/3 in "mptcp: leverage the backlog for
RX packet processing"
- Results: 13d7de45372c..13cda072207e (export)

Tests are now in progress:

- export:
https://github.com/multipath-tcp/mptcp_net-next/commit/c2eff70b71b8c42253db435d68374277502522e7/checks

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-11-14 12:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-14  9:17 [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up Paolo Abeni
2025-11-14  9:17 ` [PATCH v4 mptcp-next 1/3] mptcp: grafting MPJ subflow earlier Paolo Abeni
2025-11-14  9:17 ` [PATCH v4 mptcp-next 2/3] Squash-to: "mptcp: fix memcg accounting for passive sockets" Paolo Abeni
2025-11-14  9:17 ` [PATCH v4 mptcp-next 3/3] Squash-to: "mptcp: leverage the backlog for RX packet processing" Paolo Abeni
2025-11-14 10:30 ` [PATCH v4 mptcp-next 0/3] mptcp: cg and backlog follow-up MPTCP CI
2025-11-14 11:58 ` Matthieu Baerts
2025-11-14 12:06 ` Matthieu Baerts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox