All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH mptcp-next-next 0/3] mptcp: rx path refactor
@ 2024-11-29 17:45 Paolo Abeni
  2024-11-29 17:45 ` [PATCH mptcp-next-next 1/3] mptcp: consolidate subflow cleanup Paolo Abeni
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Paolo Abeni @ 2024-11-29 17:45 UTC (permalink / raw)
  To: mptcp

This is a batch of changes I had sitting in my local tree for a while.
Why another refactor you may ask? Two main resons:

- currently the mptcp RX path introduces quite a bit of 'exceptional'
 accounting/locking processing WRT to plain TCP, adding up to the
 implementation complexity in a misurable way
- the performance gap WRT plain TCP for single subflow connections is
 quite measurable.

The present refactor addresses both the above items: most of the
additional complexity is dropped, and single stream performances
increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
reference, plain TCP is around 84Gps on the same host.

The above comes to a price: the patch are invasive, even in subtle ways:
the chance of destabilizing the implementation is real (ence the
additional, intentional '-next' into the subj).

In any case keeping the patch hidden for longer was not going to do any
good, so here we are.

Paolo Abeni (3):
  mptcp: consolidate subflow cleanup
  mptcp: move the whole rx path under msk socket lock protection
  mptcp: cleanup mem accounting.

 net/mptcp/fastopen.c |   4 +-
 net/mptcp/protocol.c | 202 +++++++++----------------------------------
 net/mptcp/protocol.h |   6 +-
 net/mptcp/subflow.c  |  33 +++----
 4 files changed, 66 insertions(+), 179 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH mptcp-next-next 1/3] mptcp: consolidate subflow cleanup
  2024-11-29 17:45 [PATCH mptcp-next-next 0/3] mptcp: rx path refactor Paolo Abeni
@ 2024-11-29 17:45 ` Paolo Abeni
  2024-11-29 17:45 ` [PATCH mptcp-next-next 2/3] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2024-11-29 17:45 UTC (permalink / raw)
  To: mptcp

Consolidate all the cleanup actions requiring the worked in a single
helper and ensure the dummy data fin creation for fallback socket is
performed only when the tcp rx queue is empty.

There are no functional changes intended, but this will simplify the
next patch, when the tcp rx queue spooling could be delayed at release_cb
time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/subflow.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index fd021cf8286e..2926bdf88e42 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1271,7 +1271,12 @@ static void mptcp_subflow_discard_data(struct sock *ssk, struct sk_buff *skb,
 		subflow->map_valid = 0;
 }
 
-/* sched mptcp worker to remove the subflow if no more data is pending */
+static bool subflow_is_done(const struct sock *sk)
+{
+	return sk->sk_shutdown & RCV_SHUTDOWN || sk->sk_state == TCP_CLOSE;
+}
+
+/* sched mptcp worker for subflow cleanup if no more data is pending */
 static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ssk)
 {
 	struct sock *sk = (struct sock *)msk;
@@ -1281,8 +1286,18 @@ static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ss
 		    inet_sk_state_load(sk) != TCP_ESTABLISHED)))
 		return;
 
-	if (skb_queue_empty(&ssk->sk_receive_queue) &&
-	    !test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, &msk->flags))
+	if (!skb_queue_empty(&ssk->sk_receive_queue))
+		return;
+
+	if (!test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, &msk->flags))
+		mptcp_schedule_work(sk);
+
+	/* when the fallback subflow closes the rx side, trigger a 'dummy'
+	 * ingress data fin, so that the msk state will follow along
+	 */
+	if (__mptcp_check_fallback(msk) && subflow_is_done(ssk) &&
+	    msk->first == ssk &&
+	    mptcp_update_rcv_data_fin(msk, READ_ONCE(msk->ack_seq), true))
 		mptcp_schedule_work(sk);
 }
 
@@ -1842,11 +1857,6 @@ static void __subflow_state_change(struct sock *sk)
 	rcu_read_unlock();
 }
 
-static bool subflow_is_done(const struct sock *sk)
-{
-	return sk->sk_shutdown & RCV_SHUTDOWN || sk->sk_state == TCP_CLOSE;
-}
-
 static void subflow_state_change(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
@@ -1873,13 +1883,6 @@ static void subflow_state_change(struct sock *sk)
 		subflow_error_report(sk);
 
 	subflow_sched_work_if_closed(mptcp_sk(parent), sk);
-
-	/* when the fallback subflow closes the rx side, trigger a 'dummy'
-	 * ingress data fin, so that the msk state will follow along
-	 */
-	if (__mptcp_check_fallback(msk) && subflow_is_done(sk) && msk->first == sk &&
-	    mptcp_update_rcv_data_fin(msk, READ_ONCE(msk->ack_seq), true))
-		mptcp_schedule_work(parent);
 }
 
 void mptcp_subflow_queue_clean(struct sock *listener_sk, struct sock *listener_ssk)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH mptcp-next-next 2/3] mptcp: move the whole rx path under msk socket lock protection
  2024-11-29 17:45 [PATCH mptcp-next-next 0/3] mptcp: rx path refactor Paolo Abeni
  2024-11-29 17:45 ` [PATCH mptcp-next-next 1/3] mptcp: consolidate subflow cleanup Paolo Abeni
@ 2024-11-29 17:45 ` Paolo Abeni
  2024-12-02 16:56   ` Matthieu Baerts
  2024-11-29 17:45 ` [PATCH mptcp-next-next 3/3] mptcp: cleanup mem accounting Paolo Abeni
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2024-11-29 17:45 UTC (permalink / raw)
  To: mptcp

After commit c2e6048fa1cf ("mptcp: fix race in release_cb") it's
pretty straight forward move the whole MPTCP rx path under the socket
lock leveraging the release_cb.

We can drop a bunch of spin_lock pairs in the receive functions, use
a single receive queue and invoke __mptcp_move_skbs only when subflows
ask for it.

This will allow more cleanup in the next patch

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/fastopen.c |  2 ++
 net/mptcp/protocol.c | 76 +++++++++++++++++++-------------------------
 net/mptcp/protocol.h |  2 +-
 3 files changed, 36 insertions(+), 44 deletions(-)

diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
index a29ff901df75..fb945c0d50bf 100644
--- a/net/mptcp/fastopen.c
+++ b/net/mptcp/fastopen.c
@@ -49,6 +49,7 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
 	MPTCP_SKB_CB(skb)->has_rxtstamp = TCP_SKB_CB(skb)->has_rxtstamp;
 
 	mptcp_data_lock(sk);
+	DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
 
 	mptcp_set_owner_r(skb, sk);
 	__skb_queue_tail(&sk->sk_receive_queue, skb);
@@ -65,6 +66,7 @@ void __mptcp_fastopen_gen_msk_ackseq(struct mptcp_sock *msk, struct mptcp_subflo
 	struct sock *sk = (struct sock *)msk;
 	struct sk_buff *skb;
 
+	DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
 	skb = skb_peek_tail(&sk->sk_receive_queue);
 	if (skb) {
 		WARN_ON_ONCE(MPTCP_SKB_CB(skb)->end_seq);
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index f768aa4473fb..159add48f6d9 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -845,7 +845,7 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	return moved > 0;
 }
 
-void mptcp_data_ready(struct sock *sk, struct sock *ssk)
+static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -868,9 +868,17 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 		return;
 
 	/* Wake-up the reader only for in-sequence data */
-	mptcp_data_lock(sk);
 	if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
 		sk->sk_data_ready(sk);
+}
+
+void mptcp_data_ready(struct sock *sk, struct sock *ssk)
+{
+	mptcp_data_lock(sk);
+	if (!sock_owned_by_user(sk))
+		__mptcp_data_ready(sk, ssk);
+	else
+		__set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->cb_flags);
 	mptcp_data_unlock(sk);
 }
 
@@ -1077,9 +1085,7 @@ static void __mptcp_clean_una_wakeup(struct sock *sk)
 
 static void mptcp_clean_una_wakeup(struct sock *sk)
 {
-	mptcp_data_lock(sk);
 	__mptcp_clean_una_wakeup(sk);
-	mptcp_data_unlock(sk);
 }
 
 static void mptcp_enter_memory_pressure(struct sock *sk)
@@ -1939,16 +1945,22 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	goto out;
 }
 
-static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
+static bool __mptcp_move_skbs(struct sock *sk);
+
+static int __mptcp_recvmsg_mskq(struct sock *sk,
 				struct msghdr *msg,
 				size_t len, int flags,
 				struct scm_timestamping_internal *tss,
 				int *cmsg_flags)
 {
+	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct sk_buff *skb, *tmp;
 	int copied = 0;
 
-	skb_queue_walk_safe(&msk->receive_queue, skb, tmp) {
+	if (skb_queue_empty(&sk->sk_receive_queue) && !__mptcp_move_skbs(sk))
+		return 0;
+
+	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
 		u32 offset = MPTCP_SKB_CB(skb)->offset;
 		u32 data_len = skb->len - offset;
 		u32 count = min_t(size_t, len - copied, data_len);
@@ -1983,7 +1995,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
 			/* we will bulk release the skb memory later */
 			skb->destructor = NULL;
 			WRITE_ONCE(msk->rmem_released, msk->rmem_released + skb->truesize);
-			__skb_unlink(skb, &msk->receive_queue);
+			__skb_unlink(skb, &sk->sk_receive_queue);
 			__kfree_skb(skb);
 			msk->bytes_consumed += count;
 		}
@@ -2107,16 +2119,9 @@ static void __mptcp_update_rmem(struct sock *sk)
 	WRITE_ONCE(msk->rmem_released, 0);
 }
 
-static void __mptcp_splice_receive_queue(struct sock *sk)
+static bool __mptcp_move_skbs(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	skb_queue_splice_tail_init(&sk->sk_receive_queue, &msk->receive_queue);
-}
-
-static bool __mptcp_move_skbs(struct mptcp_sock *msk)
-{
-	struct sock *sk = (struct sock *)msk;
 	unsigned int moved = 0;
 	bool ret, done;
 
@@ -2124,37 +2129,27 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk)
 		struct sock *ssk = mptcp_subflow_recv_lookup(msk);
 		bool slowpath;
 
-		/* we can have data pending in the subflows only if the msk
-		 * receive buffer was full at subflow_data_ready() time,
-		 * that is an unlikely slow path.
-		 */
-		if (likely(!ssk))
+		if (unlikely(!ssk))
 			break;
 
 		slowpath = lock_sock_fast(ssk);
-		mptcp_data_lock(sk);
 		__mptcp_update_rmem(sk);
 		done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
-		mptcp_data_unlock(sk);
 
 		if (unlikely(ssk->sk_err))
 			__mptcp_error_report(sk);
 		unlock_sock_fast(ssk, slowpath);
 	} while (!done);
 
-	/* acquire the data lock only if some input data is pending */
 	ret = moved > 0;
 	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) ||
-	    !skb_queue_empty_lockless(&sk->sk_receive_queue)) {
-		mptcp_data_lock(sk);
+	    !skb_queue_empty(&sk->sk_receive_queue)) {
 		__mptcp_update_rmem(sk);
 		ret |= __mptcp_ofo_queue(msk);
-		__mptcp_splice_receive_queue(sk);
-		mptcp_data_unlock(sk);
 	}
 	if (ret)
 		mptcp_check_data_fin((struct sock *)msk);
-	return !skb_queue_empty(&msk->receive_queue);
+	return ret;
 }
 
 static unsigned int mptcp_inq_hint(const struct sock *sk)
@@ -2162,7 +2157,7 @@ static unsigned int mptcp_inq_hint(const struct sock *sk)
 	const struct mptcp_sock *msk = mptcp_sk(sk);
 	const struct sk_buff *skb;
 
-	skb = skb_peek(&msk->receive_queue);
+	skb = skb_peek(&sk->sk_receive_queue);
 	if (skb) {
 		u64 hint_val = READ_ONCE(msk->ack_seq) - MPTCP_SKB_CB(skb)->map_seq;
 
@@ -2208,7 +2203,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	while (copied < len) {
 		int err, bytes_read;
 
-		bytes_read = __mptcp_recvmsg_mskq(msk, msg, len - copied, flags, &tss, &cmsg_flags);
+		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags, &tss, &cmsg_flags);
 		if (unlikely(bytes_read < 0)) {
 			if (!copied)
 				copied = bytes_read;
@@ -2220,8 +2215,6 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		/* be sure to advertise window change */
 		mptcp_cleanup_rbuf(msk);
 
-		if (skb_queue_empty(&msk->receive_queue) && __mptcp_move_skbs(msk))
-			continue;
 
 		/* only the MPTCP socket status is relevant here. The exit
 		 * conditions mirror closely tcp_recvmsg()
@@ -2246,7 +2239,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 				/* race breaker: the shutdown could be after the
 				 * previous receive queue check
 				 */
-				if (__mptcp_move_skbs(msk))
+				if (__mptcp_move_skbs(sk))
 					continue;
 				break;
 			}
@@ -2290,9 +2283,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 	}
 
-	pr_debug("msk=%p rx queue empty=%d:%d copied=%d\n",
-		 msk, skb_queue_empty_lockless(&sk->sk_receive_queue),
-		 skb_queue_empty(&msk->receive_queue), copied);
+	pr_debug("msk=%p rx queue empty=%d copied=%d",
+		 msk, skb_queue_empty(&sk->sk_receive_queue), copied);
 
 	release_sock(sk);
 	return copied;
@@ -2819,7 +2811,6 @@ static void __mptcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&msk->join_list);
 	INIT_LIST_HEAD(&msk->rtx_queue);
 	INIT_WORK(&msk->work, mptcp_worker);
-	__skb_queue_head_init(&msk->receive_queue);
 	msk->out_of_order_queue = RB_ROOT;
 	msk->first_pending = NULL;
 	WRITE_ONCE(msk->rmem_fwd_alloc, 0);
@@ -3402,12 +3393,8 @@ void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
 	mptcp_for_each_subflow_safe(msk, subflow, tmp)
 		__mptcp_close_ssk(sk, mptcp_subflow_tcp_sock(subflow), subflow, flags);
 
-	/* move to sk_receive_queue, sk_stream_kill_queues will purge it */
-	mptcp_data_lock(sk);
-	skb_queue_splice_tail_init(&msk->receive_queue, &sk->sk_receive_queue);
 	__skb_queue_purge(&sk->sk_receive_queue);
 	skb_rbtree_purge(&msk->out_of_order_queue);
-	mptcp_data_unlock(sk);
 
 	/* move all the rx fwd alloc into the sk_mem_reclaim_final in
 	 * inet_sock_destruct() will dispose it
@@ -3450,7 +3437,8 @@ void __mptcp_check_push(struct sock *sk, struct sock *ssk)
 
 #define MPTCP_FLAGS_PROCESS_CTX_NEED (BIT(MPTCP_PUSH_PENDING) | \
 				      BIT(MPTCP_RETRANSMIT) | \
-				      BIT(MPTCP_FLUSH_JOIN_LIST))
+				      BIT(MPTCP_FLUSH_JOIN_LIST) | \
+				      BIT(MPTCP_DEQUEUE))
 
 /* processes deferred events and flush wmem */
 static void mptcp_release_cb(struct sock *sk)
@@ -3484,6 +3472,8 @@ static void mptcp_release_cb(struct sock *sk)
 			__mptcp_push_pending(sk, 0);
 		if (flags & BIT(MPTCP_RETRANSMIT))
 			__mptcp_retrans(sk);
+		if ((flags & BIT(MPTCP_DEQUEUE)) && __mptcp_move_skbs(sk))
+			sk->sk_data_ready(sk);
 
 		cond_resched();
 		spin_lock_bh(&sk->sk_lock.slock);
@@ -3721,7 +3711,7 @@ static int mptcp_ioctl(struct sock *sk, int cmd, int *karg)
 			return -EINVAL;
 
 		lock_sock(sk);
-		__mptcp_move_skbs(msk);
+		__mptcp_move_skbs(sk);
 		*karg = mptcp_inq_hint(sk);
 		release_sock(sk);
 		break;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index b4c72a73594f..ad940cc1f26f 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -124,6 +124,7 @@
 #define MPTCP_FLUSH_JOIN_LIST	5
 #define MPTCP_SYNC_STATE	6
 #define MPTCP_SYNC_SNDBUF	7
+#define MPTCP_DEQUEUE		8
 
 struct mptcp_skb_cb {
 	u64 map_seq;
@@ -322,7 +323,6 @@ struct mptcp_sock {
 	struct work_struct work;
 	struct sk_buff  *ooo_last_skb;
 	struct rb_root  out_of_order_queue;
-	struct sk_buff_head receive_queue;
 	struct list_head conn_list;
 	struct list_head rtx_queue;
 	struct mptcp_data_frag *first_pending;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH mptcp-next-next 3/3] mptcp: cleanup mem accounting.
  2024-11-29 17:45 [PATCH mptcp-next-next 0/3] mptcp: rx path refactor Paolo Abeni
  2024-11-29 17:45 ` [PATCH mptcp-next-next 1/3] mptcp: consolidate subflow cleanup Paolo Abeni
  2024-11-29 17:45 ` [PATCH mptcp-next-next 2/3] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
@ 2024-11-29 17:45 ` Paolo Abeni
  2024-11-29 18:52 ` [PATCH mptcp-next-next 0/3] mptcp: rx path refactor MPTCP CI
  2024-12-02 16:56 ` Matthieu Baerts
  4 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2024-11-29 17:45 UTC (permalink / raw)
  To: mptcp

After the previous patch, updating sk_forward_memory is cheap and
we can drop a lot of complexity from the MPTCP memory acconting,
removing the custom fwd mem allocations for rmem.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/fastopen.c |   2 +-
 net/mptcp/protocol.c | 128 ++++---------------------------------------
 net/mptcp/protocol.h |   4 +-
 3 files changed, 13 insertions(+), 121 deletions(-)

diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
index fb945c0d50bf..b0f1dddfb143 100644
--- a/net/mptcp/fastopen.c
+++ b/net/mptcp/fastopen.c
@@ -51,7 +51,7 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
 	mptcp_data_lock(sk);
 	DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
 
-	mptcp_set_owner_r(skb, sk);
+	skb_set_owner_r(skb, sk);
 	__skb_queue_tail(&sk->sk_receive_queue, skb);
 	mptcp_sk(sk)->bytes_received += skb->len;
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 159add48f6d9..426acc03a932 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -118,17 +118,6 @@ static void mptcp_drop(struct sock *sk, struct sk_buff *skb)
 	__kfree_skb(skb);
 }
 
-static void mptcp_rmem_fwd_alloc_add(struct sock *sk, int size)
-{
-	WRITE_ONCE(mptcp_sk(sk)->rmem_fwd_alloc,
-		   mptcp_sk(sk)->rmem_fwd_alloc + size);
-}
-
-static void mptcp_rmem_charge(struct sock *sk, int size)
-{
-	mptcp_rmem_fwd_alloc_add(sk, -size);
-}
-
 static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
 			       struct sk_buff *from)
 {
@@ -149,7 +138,7 @@ static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
 	 * negative one
 	 */
 	atomic_add(delta, &sk->sk_rmem_alloc);
-	mptcp_rmem_charge(sk, delta);
+	sk_mem_charge(sk, delta);
 	kfree_skb_partial(from, fragstolen);
 
 	return true;
@@ -164,44 +153,6 @@ static bool mptcp_ooo_try_coalesce(struct mptcp_sock *msk, struct sk_buff *to,
 	return mptcp_try_coalesce((struct sock *)msk, to, from);
 }
 
-static void __mptcp_rmem_reclaim(struct sock *sk, int amount)
-{
-	amount >>= PAGE_SHIFT;
-	mptcp_rmem_charge(sk, amount << PAGE_SHIFT);
-	__sk_mem_reduce_allocated(sk, amount);
-}
-
-static void mptcp_rmem_uncharge(struct sock *sk, int size)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-	int reclaimable;
-
-	mptcp_rmem_fwd_alloc_add(sk, size);
-	reclaimable = msk->rmem_fwd_alloc - sk_unused_reserved_mem(sk);
-
-	/* see sk_mem_uncharge() for the rationale behind the following schema */
-	if (unlikely(reclaimable >= PAGE_SIZE))
-		__mptcp_rmem_reclaim(sk, reclaimable);
-}
-
-static void mptcp_rfree(struct sk_buff *skb)
-{
-	unsigned int len = skb->truesize;
-	struct sock *sk = skb->sk;
-
-	atomic_sub(len, &sk->sk_rmem_alloc);
-	mptcp_rmem_uncharge(sk, len);
-}
-
-void mptcp_set_owner_r(struct sk_buff *skb, struct sock *sk)
-{
-	skb_orphan(skb);
-	skb->sk = sk;
-	skb->destructor = mptcp_rfree;
-	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
-	mptcp_rmem_charge(sk, skb->truesize);
-}
-
 /* "inspired" by tcp_data_queue_ofo(), main differences:
  * - use mptcp seqs
  * - don't cope with sacks
@@ -314,25 +265,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 
 end:
 	skb_condense(skb);
-	mptcp_set_owner_r(skb, sk);
-}
-
-static bool mptcp_rmem_schedule(struct sock *sk, struct sock *ssk, int size)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-	int amt, amount;
-
-	if (size <= msk->rmem_fwd_alloc)
-		return true;
-
-	size -= msk->rmem_fwd_alloc;
-	amt = sk_mem_pages(size);
-	amount = amt << PAGE_SHIFT;
-	if (!__sk_mem_raise_allocated(sk, size, amt, SK_MEM_RECV))
-		return false;
-
-	mptcp_rmem_fwd_alloc_add(sk, amount);
-	return true;
+	skb_set_owner_r(skb, sk);
 }
 
 static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
@@ -350,7 +283,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
 	skb_orphan(skb);
 
 	/* try to fetch required memory from subflow */
-	if (!mptcp_rmem_schedule(sk, ssk, skb->truesize)) {
+	if (!sk_rmem_schedule(sk, skb, skb->truesize)) {
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
 		goto drop;
 	}
@@ -374,7 +307,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
 		if (tail && mptcp_try_coalesce(sk, tail, skb))
 			return true;
 
-		mptcp_set_owner_r(skb, sk);
+		skb_set_owner_r(skb, sk);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
 		return true;
 	} else if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) {
@@ -1077,17 +1010,10 @@ static void __mptcp_clean_una(struct sock *sk)
 
 static void __mptcp_clean_una_wakeup(struct sock *sk)
 {
-	lockdep_assert_held_once(&sk->sk_lock.slock);
-
 	__mptcp_clean_una(sk);
 	mptcp_write_space(sk);
 }
 
-static void mptcp_clean_una_wakeup(struct sock *sk)
-{
-	__mptcp_clean_una_wakeup(sk);
-}
-
 static void mptcp_enter_memory_pressure(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow;
@@ -1992,9 +1918,10 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
 		}
 
 		if (!(flags & MSG_PEEK)) {
-			/* we will bulk release the skb memory later */
+			/* avoid the indirect call, we know the destructor is sock_wfree */
 			skb->destructor = NULL;
-			WRITE_ONCE(msk->rmem_released, msk->rmem_released + skb->truesize);
+			atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+			sk_mem_uncharge(sk, skb->truesize);
 			__skb_unlink(skb, &sk->sk_receive_queue);
 			__kfree_skb(skb);
 			msk->bytes_consumed += count;
@@ -2107,18 +2034,6 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 	msk->rcvq_space.time = mstamp;
 }
 
-static void __mptcp_update_rmem(struct sock *sk)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	if (!msk->rmem_released)
-		return;
-
-	atomic_sub(msk->rmem_released, &sk->sk_rmem_alloc);
-	mptcp_rmem_uncharge(sk, msk->rmem_released);
-	WRITE_ONCE(msk->rmem_released, 0);
-}
-
 static bool __mptcp_move_skbs(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -2133,7 +2048,6 @@ static bool __mptcp_move_skbs(struct sock *sk)
 			break;
 
 		slowpath = lock_sock_fast(ssk);
-		__mptcp_update_rmem(sk);
 		done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
 
 		if (unlikely(ssk->sk_err))
@@ -2143,10 +2057,9 @@ static bool __mptcp_move_skbs(struct sock *sk)
 
 	ret = moved > 0;
 	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) ||
-	    !skb_queue_empty(&sk->sk_receive_queue)) {
-		__mptcp_update_rmem(sk);
+	    !skb_queue_empty(&sk->sk_receive_queue))
 		ret |= __mptcp_ofo_queue(msk);
-	}
+
 	if (ret)
 		mptcp_check_data_fin((struct sock *)msk);
 	return ret;
@@ -2371,17 +2284,13 @@ bool __mptcp_retransmit_pending_data(struct sock *sk)
 	 * some data in the mptcp rtx queue has not really xmitted yet.
 	 * keep it simple and re-inject the whole mptcp level rtx queue
 	 */
-	mptcp_data_lock(sk);
 	__mptcp_clean_una_wakeup(sk);
 	rtx_head = mptcp_rtx_head(sk);
-	if (!rtx_head) {
-		mptcp_data_unlock(sk);
+	if (!rtx_head)
 		return false;
-	}
 
 	msk->recovery_snd_nxt = msk->snd_nxt;
 	msk->recovery = true;
-	mptcp_data_unlock(sk);
 
 	msk->first_pending = rtx_head;
 	msk->snd_burst = 0;
@@ -2640,7 +2549,7 @@ static void __mptcp_retrans(struct sock *sk)
 	int ret, err;
 	u16 len = 0;
 
-	mptcp_clean_una_wakeup(sk);
+	__mptcp_clean_una_wakeup(sk);
 
 	/* first check ssk: need to kick "stale" logic */
 	err = mptcp_sched_get_retrans(msk);
@@ -2813,8 +2722,6 @@ static void __mptcp_init_sock(struct sock *sk)
 	INIT_WORK(&msk->work, mptcp_worker);
 	msk->out_of_order_queue = RB_ROOT;
 	msk->first_pending = NULL;
-	WRITE_ONCE(msk->rmem_fwd_alloc, 0);
-	WRITE_ONCE(msk->rmem_released, 0);
 	msk->timer_ival = TCP_RTO_MIN;
 	msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
 
@@ -3040,8 +2947,6 @@ static void __mptcp_destroy_sock(struct sock *sk)
 
 	sk->sk_prot->destroy(sk);
 
-	WARN_ON_ONCE(READ_ONCE(msk->rmem_fwd_alloc));
-	WARN_ON_ONCE(msk->rmem_released);
 	sk_stream_kill_queues(sk);
 	xfrm_sk_free_policy(sk);
 
@@ -3399,8 +3304,6 @@ void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
 	/* move all the rx fwd alloc into the sk_mem_reclaim_final in
 	 * inet_sock_destruct() will dispose it
 	 */
-	sk_forward_alloc_add(sk, msk->rmem_fwd_alloc);
-	WRITE_ONCE(msk->rmem_fwd_alloc, 0);
 	mptcp_token_destroy(msk);
 	mptcp_pm_free_anno_list(msk);
 	mptcp_free_local_addr_list(msk);
@@ -3493,8 +3396,6 @@ static void mptcp_release_cb(struct sock *sk)
 		if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags))
 			__mptcp_sync_sndbuf(sk);
 	}
-
-	__mptcp_update_rmem(sk);
 }
 
 /* MP_JOIN client subflow must wait for 4th ack before sending any data:
@@ -3665,12 +3566,6 @@ static void mptcp_shutdown(struct sock *sk, int how)
 		__mptcp_wr_shutdown(sk);
 }
 
-static int mptcp_forward_alloc_get(const struct sock *sk)
-{
-	return READ_ONCE(sk->sk_forward_alloc) +
-	       READ_ONCE(mptcp_sk(sk)->rmem_fwd_alloc);
-}
-
 static int mptcp_ioctl_outq(const struct mptcp_sock *msk, u64 v)
 {
 	const struct sock *sk = (void *)msk;
@@ -3828,7 +3723,6 @@ static struct proto mptcp_prot = {
 	.hash		= mptcp_hash,
 	.unhash		= mptcp_unhash,
 	.get_port	= mptcp_get_port,
-	.forward_alloc_get	= mptcp_forward_alloc_get,
 	.stream_memory_free	= mptcp_stream_memory_free,
 	.sockets_allocated	= &mptcp_sockets_allocated,
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index ad940cc1f26f..a0d46b69746d 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -278,7 +278,6 @@ struct mptcp_sock {
 	u64		rcv_data_fin_seq;
 	u64		bytes_retrans;
 	u64		bytes_consumed;
-	int		rmem_fwd_alloc;
 	int		snd_burst;
 	int		old_wspace;
 	u64		recovery_snd_nxt;	/* in recovery mode accept up to this seq;
@@ -293,7 +292,6 @@ struct mptcp_sock {
 	u32		last_ack_recv;
 	unsigned long	timer_ival;
 	u32		token;
-	int		rmem_released;
 	unsigned long	flags;
 	unsigned long	cb_flags;
 	bool		recovery;		/* closing subflow write queue reinjected */
@@ -384,7 +382,7 @@ static inline void msk_owned_by_me(const struct mptcp_sock *msk)
  */
 static inline int __mptcp_rmem(const struct sock *sk)
 {
-	return atomic_read(&sk->sk_rmem_alloc) - READ_ONCE(mptcp_sk(sk)->rmem_released);
+	return atomic_read(&sk->sk_rmem_alloc);
 }
 
 static inline int mptcp_win_from_space(const struct sock *sk, int space)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH mptcp-next-next 0/3] mptcp: rx path refactor
  2024-11-29 17:45 [PATCH mptcp-next-next 0/3] mptcp: rx path refactor Paolo Abeni
                   ` (2 preceding siblings ...)
  2024-11-29 17:45 ` [PATCH mptcp-next-next 3/3] mptcp: cleanup mem accounting Paolo Abeni
@ 2024-11-29 18:52 ` MPTCP CI
  2024-12-02  8:19   ` Paolo Abeni
  2024-12-02 16:56 ` Matthieu Baerts
  4 siblings, 1 reply; 9+ messages in thread
From: MPTCP CI @ 2024-11-29 18:52 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal: Success! ✅
- KVM Validation: debug: Unstable: 1 failed test(s): selftest_mptcp_connect 🔴
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/12088981326

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/818e1f7b5c07
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=913372


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mptcp-next-next 0/3] mptcp: rx path refactor
  2024-11-29 18:52 ` [PATCH mptcp-next-next 0/3] mptcp: rx path refactor MPTCP CI
@ 2024-12-02  8:19   ` Paolo Abeni
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2024-12-02  8:19 UTC (permalink / raw)
  To: mptcp, Matthieu Baerts (NGI0)

On 11/29/24 19:52, MPTCP CI wrote:
> Thank you for your modifications, that's great!
> 
> Our CI did some validations and here is its report:
> 
> - KVM Validation: normal: Success! ✅
> - KVM Validation: debug: Unstable: 1 failed test(s): selftest_mptcp_connect 🔴

This is somewhat 'funny'; I can't reproduce it locally, and other tests
do the connection successfully. I guess there is some race due to the
slow env?!? Is virtme still running without KVM ?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mptcp-next-next 0/3] mptcp: rx path refactor
  2024-11-29 17:45 [PATCH mptcp-next-next 0/3] mptcp: rx path refactor Paolo Abeni
                   ` (3 preceding siblings ...)
  2024-11-29 18:52 ` [PATCH mptcp-next-next 0/3] mptcp: rx path refactor MPTCP CI
@ 2024-12-02 16:56 ` Matthieu Baerts
  4 siblings, 0 replies; 9+ messages in thread
From: Matthieu Baerts @ 2024-12-02 16:56 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

Hi Paolo,

On 29/11/2024 18:45, Paolo Abeni wrote:
> This is a batch of changes I had sitting in my local tree for a while.
> Why another refactor you may ask? Two main resons:
> 
> - currently the mptcp RX path introduces quite a bit of 'exceptional'
>  accounting/locking processing WRT to plain TCP, adding up to the
>  implementation complexity in a misurable way
> - the performance gap WRT plain TCP for single subflow connections is
>  quite measurable.
> 
> The present refactor addresses both the above items: most of the
> additional complexity is dropped, and single stream performances
> increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
> reference, plain TCP is around 84Gps on the same host.

Nice clean-up, with an excellent improvement!

> The above comes to a price: the patch are invasive, even in subtle ways:
> the chance of destabilizing the implementation is real (ence the
> additional, intentional '-next' into the subj).

Because we are only at the beginning of a new cycle, do we need to have
this in '-next'? If we can have syzkaller validating this for a couple
of weeks, would it not be OK to send that to netdev, to have it in
linux-next for a few weeks?

> In any case keeping the patch hidden for longer was not going to do any
> good, so here we are.
> 
> Paolo Abeni (3):
>   mptcp: consolidate subflow cleanup
>   mptcp: move the whole rx path under msk socket lock protection
>   mptcp: cleanup mem accounting.

The series look good to me: the modifications seem to make sense, but I
don't know well this part that we usually avoid modifying :)

@Mat: do you mind having a look please?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mptcp-next-next 2/3] mptcp: move the whole rx path under msk socket lock protection
  2024-11-29 17:45 ` [PATCH mptcp-next-next 2/3] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
@ 2024-12-02 16:56   ` Matthieu Baerts
  2024-12-03  7:40     ` Paolo Abeni
  0 siblings, 1 reply; 9+ messages in thread
From: Matthieu Baerts @ 2024-12-02 16:56 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

Hi Paolo,

On 29/11/2024 18:45, Paolo Abeni wrote:
> After commit c2e6048fa1cf ("mptcp: fix race in release_cb") it's
> pretty straight forward move the whole MPTCP rx path under the socket
> lock leveraging the release_cb.
> 
> We can drop a bunch of spin_lock pairs in the receive functions, use
> a single receive queue and invoke __mptcp_move_skbs only when subflows
> ask for it.
> 
> This will allow more cleanup in the next patch

(...)

> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index f768aa4473fb..159add48f6d9 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c

(...)

> @@ -2290,9 +2283,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  		}
>  	}
>  
> -	pr_debug("msk=%p rx queue empty=%d:%d copied=%d\n",
> -		 msk, skb_queue_empty_lockless(&sk->sk_receive_queue),
> -		 skb_queue_empty(&msk->receive_queue), copied);
> +	pr_debug("msk=%p rx queue empty=%d copied=%d",

A small detail: the '\n' at the end is missing: no need to delay the
output if it is not supposed to be used with a pr_cont().

(Something we can fix when applying the patches.)

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mptcp-next-next 2/3] mptcp: move the whole rx path under msk socket lock protection
  2024-12-02 16:56   ` Matthieu Baerts
@ 2024-12-03  7:40     ` Paolo Abeni
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2024-12-03  7:40 UTC (permalink / raw)
  To: Matthieu Baerts, mptcp

On 12/2/24 17:56, Matthieu Baerts wrote:
> Hi Paolo,
> 
> On 29/11/2024 18:45, Paolo Abeni wrote:
>> After commit c2e6048fa1cf ("mptcp: fix race in release_cb") it's
>> pretty straight forward move the whole MPTCP rx path under the socket
>> lock leveraging the release_cb.
>>
>> We can drop a bunch of spin_lock pairs in the receive functions, use
>> a single receive queue and invoke __mptcp_move_skbs only when subflows
>> ask for it.
>>
>> This will allow more cleanup in the next patch
> 
> (...)
> 
>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>> index f768aa4473fb..159add48f6d9 100644
>> --- a/net/mptcp/protocol.c
>> +++ b/net/mptcp/protocol.c
> 
> (...)
> 
>> @@ -2290,9 +2283,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>>  		}
>>  	}
>>  
>> -	pr_debug("msk=%p rx queue empty=%d:%d copied=%d\n",
>> -		 msk, skb_queue_empty_lockless(&sk->sk_receive_queue),
>> -		 skb_queue_empty(&msk->receive_queue), copied);
>> +	pr_debug("msk=%p rx queue empty=%d copied=%d",
> 
> A small detail: the '\n' at the end is missing: no need to delay the
> output if it is not supposed to be used with a pr_cont().

Thanks! I'll fix in the next revision. I think there is at least a
fixable race in the current code (the one causing the ST failure -
different root cause from
https://github.com/multipath-tcp/mptcp_net-next/issues/483, this is
specific to the refactor).

I hope to be able to share the next version much later today.

Cheers,

Paolo

> 
> (Something we can fix when applying the patches.)
> 
> Cheers,
> Matt


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-12-03  7:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-29 17:45 [PATCH mptcp-next-next 0/3] mptcp: rx path refactor Paolo Abeni
2024-11-29 17:45 ` [PATCH mptcp-next-next 1/3] mptcp: consolidate subflow cleanup Paolo Abeni
2024-11-29 17:45 ` [PATCH mptcp-next-next 2/3] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
2024-12-02 16:56   ` Matthieu Baerts
2024-12-03  7:40     ` Paolo Abeni
2024-11-29 17:45 ` [PATCH mptcp-next-next 3/3] mptcp: cleanup mem accounting Paolo Abeni
2024-11-29 18:52 ` [PATCH mptcp-next-next 0/3] mptcp: rx path refactor MPTCP CI
2024-12-02  8:19   ` Paolo Abeni
2024-12-02 16:56 ` Matthieu Baerts

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.