[PATCH mptcp-next v2 0/7] mptcp: rx path refactor

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH mptcp-next v2 0/7] mptcp: rx path refactor
@ 2024-12-06 12:09 Paolo Abeni
  2024-12-06 12:10 ` [PATCH mptcp-next v2 1/7] mptcp: prevent excessive coalescing on receive Paolo Abeni
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: Paolo Abeni @ 2024-12-06 12:09 UTC (permalink / raw)
  To: mptcp

This is a batch of changes I had sitting in my local tree for a while.
Why another refactor you may ask? Two main resons:

- currently the mptcp RX path introduces quite a bit of 'exceptional'
 accounting/locking processing WRT to plain TCP, adding up to the
 implementation complexity in a misurable way
- the performance gap WRT plain TCP for single subflow connections is
 quite measurable.

The present refactor addresses both the above items: most of the
additional complexity is dropped, and single stream performances
increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
reference, plain TCP is around 84Gps on the same host.

The above comes to a price: the patch are invasive, even in subtle ways:
the chance of destabilizing the implementation is real (ence the
additional, intentional '-next' into the subj).

In any case keeping the patch hidden for longer was not going to do any
good, so here we are.

Changes from v1:
  - fixed several data stream corruption and wake-up misses due
    to multi subflows races
  - added patches 1-3 mainly to address the above
  - added an additional follow-up patch (patch 7) with more cleanup

Paolo Abeni (7):
  mptcp: prevent excessive coalescing on receive
  tcp: fix recvbuffer adjust on sleeping rcvmsg
  mptcp: don't always assume copied data in mptcp_cleanup_rbuf()
  mptcp: consolidate subflow cleanup
  mptcp: move the whole rx path under msk socket lock protection
  mptcp: cleanup mem accounting.
  net: dismiss sk_forward_alloc_get()

 include/net/sock.h   |  13 ---
 net/core/sock.c      |   2 +-
 net/ipv4/af_inet.c   |   2 +-
 net/ipv4/inet_diag.c |   2 +-
 net/mptcp/fastopen.c |   4 +-
 net/mptcp/protocol.c | 259 +++++++++++++------------------------------
 net/mptcp/protocol.h |   6 +-
 net/mptcp/subflow.c  |  33 +++---
 net/sched/em_meta.c  |   2 +-
 9 files changed, 104 insertions(+), 219 deletions(-)

-- 
2.45.2

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH mptcp-next v2 1/7] mptcp: prevent excessive coalescing on receive
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
@ 2024-12-06 12:10 ` Paolo Abeni
  2024-12-10 12:03   ` Matthieu Baerts
  2024-12-06 12:10 ` [PATCH mptcp-next v2 2/7] tcp: fix recvbuffer adjust on sleeping rcvmsg Paolo Abeni
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2024-12-06 12:10 UTC (permalink / raw)
  To: mptcp

Currently the skb size after coalescing is only limited by the skb
layout (the skb must not carry frag_list). A single coalesced skb
covering several MSS can potentially fill completely the receive
buffer. In such a case, the snd win will zero until the receive buffer
will be empty again, affecting tput badly.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
No fixes tag because the problem is not very visible in practice
currently, but will be apparent after the rx refactor.
Still I hope this could affect positively the simlut flows self-tests
---
 net/mptcp/protocol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index f768aa4473fb..fd9593f85a98 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -136,6 +136,7 @@ static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
 	int delta;
 
 	if (MPTCP_SKB_CB(from)->offset ||
+	    ((to->len + from->len) > (sk->sk_rcvbuf >> 3)) ||
 	    !skb_try_coalesce(to, from, &fragstolen, &delta))
 		return false;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH mptcp-next v2 2/7] tcp: fix recvbuffer adjust on sleeping rcvmsg
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
  2024-12-06 12:10 ` [PATCH mptcp-next v2 1/7] mptcp: prevent excessive coalescing on receive Paolo Abeni
@ 2024-12-06 12:10 ` Paolo Abeni
  2024-12-06 12:10 ` [PATCH mptcp-next v2 3/7] mptcp: don't always assume copied data in mptcp_cleanup_rbuf() Paolo Abeni
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2024-12-06 12:10 UTC (permalink / raw)
  To: mptcp

If the recvmsg() blocks after receiving some data - i.e. due to
SO_RCVLOWAT - the MPTCP code will attempt multiple times to
adjust the receive buffer size, wrongly accounting every time the
cumulative of received data - instead of accounting only for the
delta.

Address the issue moving mptcp_rcv_space_adjust just after the
data reception and passing it only the just received bytes.

This also remove an unneeded difference between the TCP and MPTCP
RX code path implementation.

Fixes: 581302298524 ("mptcp: error out earlier on disconnect")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
Not strictly related to the refactor, found while investigating
the CI failure
---
 net/mptcp/protocol.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index fd9593f85a98..bca8c2c046c3 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1940,6 +1940,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	goto out;
 }
 
+static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied);
+
 static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
 				struct msghdr *msg,
 				size_t len, int flags,
@@ -1993,6 +1995,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
 			break;
 	}
 
+	mptcp_rcv_space_adjust(msk, copied);
 	return copied;
 }
 
@@ -2269,7 +2272,6 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 
 		pr_debug("block timeout %ld\n", timeo);
-		mptcp_rcv_space_adjust(msk, copied);
 		err = sk_wait_data(sk, &timeo, NULL);
 		if (err < 0) {
 			err = copied ? : err;
@@ -2277,8 +2279,6 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 	}
 
-	mptcp_rcv_space_adjust(msk, copied);
-
 out_err:
 	if (cmsg_flags && copied >= 0) {
 		if (cmsg_flags & MPTCP_CMSG_TS)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH mptcp-next v2 3/7] mptcp: don't always assume copied data in mptcp_cleanup_rbuf()
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
  2024-12-06 12:10 ` [PATCH mptcp-next v2 1/7] mptcp: prevent excessive coalescing on receive Paolo Abeni
  2024-12-06 12:10 ` [PATCH mptcp-next v2 2/7] tcp: fix recvbuffer adjust on sleeping rcvmsg Paolo Abeni
@ 2024-12-06 12:10 ` Paolo Abeni
  2024-12-06 12:10 ` [PATCH mptcp-next v2 4/7] mptcp: consolidate subflow cleanup Paolo Abeni
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2024-12-06 12:10 UTC (permalink / raw)
  To: mptcp

Under some corner cases the MPTCP protocol can end-up invoking
mptcp_cleanup_rbuf() when no data has been copied, but such helper
assumes the opposite condition.

Explicitly drop such assumption and performs the costly call only
when strictly needed - before releasing the msk socket lock.

Fixes: fd8976790a6c ("mptcp: be careful on MPTCP-level ack.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index bca8c2c046c3..690614816749 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -529,13 +529,13 @@ static void mptcp_send_ack(struct mptcp_sock *msk)
 		mptcp_subflow_send_ack(mptcp_subflow_tcp_sock(subflow));
 }
 
-static void mptcp_subflow_cleanup_rbuf(struct sock *ssk)
+static void mptcp_subflow_cleanup_rbuf(struct sock *ssk, int copied)
 {
 	bool slow;
 
 	slow = lock_sock_fast(ssk);
 	if (tcp_can_send_ack(ssk))
-		tcp_cleanup_rbuf(ssk, 1);
+		tcp_cleanup_rbuf(ssk, copied);
 	unlock_sock_fast(ssk, slow);
 }
 
@@ -552,7 +552,7 @@ static bool mptcp_subflow_could_cleanup(const struct sock *ssk, bool rx_empty)
 			      (ICSK_ACK_PUSHED2 | ICSK_ACK_PUSHED)));
 }
 
-static void mptcp_cleanup_rbuf(struct mptcp_sock *msk)
+static void mptcp_cleanup_rbuf(struct mptcp_sock *msk, int copied)
 {
 	int old_space = READ_ONCE(msk->old_wspace);
 	struct mptcp_subflow_context *subflow;
@@ -560,14 +560,14 @@ static void mptcp_cleanup_rbuf(struct mptcp_sock *msk)
 	int space =  __mptcp_space(sk);
 	bool cleanup, rx_empty;
 
-	cleanup = (space > 0) && (space >= (old_space << 1));
-	rx_empty = !__mptcp_rmem(sk);
+	cleanup = (space > 0) && (space >= (old_space << 1)) && copied;
+	rx_empty = !__mptcp_rmem(sk) && copied;
 
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 
 		if (cleanup || mptcp_subflow_could_cleanup(ssk, rx_empty))
-			mptcp_subflow_cleanup_rbuf(ssk);
+			mptcp_subflow_cleanup_rbuf(ssk, copied);
 	}
 }
 
@@ -2221,9 +2221,6 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 		copied += bytes_read;
 
-		/* be sure to advertise window change */
-		mptcp_cleanup_rbuf(msk);
-
 		if (skb_queue_empty(&msk->receive_queue) && __mptcp_move_skbs(msk))
 			continue;
 
@@ -2272,6 +2269,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 
 		pr_debug("block timeout %ld\n", timeo);
+		mptcp_cleanup_rbuf(msk, copied);
 		err = sk_wait_data(sk, &timeo, NULL);
 		if (err < 0) {
 			err = copied ? : err;
@@ -2279,6 +2277,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 	}
 
+	mptcp_cleanup_rbuf(msk, copied);
+
 out_err:
 	if (cmsg_flags && copied >= 0) {
 		if (cmsg_flags & MPTCP_CMSG_TS)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH mptcp-next v2 4/7] mptcp: consolidate subflow cleanup
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
                   ` (2 preceding siblings ...)
  2024-12-06 12:10 ` [PATCH mptcp-next v2 3/7] mptcp: don't always assume copied data in mptcp_cleanup_rbuf() Paolo Abeni
@ 2024-12-06 12:10 ` Paolo Abeni
  2024-12-06 12:11 ` [PATCH mptcp-next v2 5/7] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2024-12-06 12:10 UTC (permalink / raw)
  To: mptcp

Consolidate all the cleanup actions requiring the worker in a single
helper and ensure the dummy data fin creation for fallback socket is
performed only when the tcp rx queue is empty.

There are no functional changes intended, but this will simplify the
next patch, when the tcp rx queue spooling could be delayed at release_cb
time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/subflow.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index fd021cf8286e..2926bdf88e42 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1271,7 +1271,12 @@ static void mptcp_subflow_discard_data(struct sock *ssk, struct sk_buff *skb,
 		subflow->map_valid = 0;
 }
 
-/* sched mptcp worker to remove the subflow if no more data is pending */
+static bool subflow_is_done(const struct sock *sk)
+{
+	return sk->sk_shutdown & RCV_SHUTDOWN || sk->sk_state == TCP_CLOSE;
+}
+
+/* sched mptcp worker for subflow cleanup if no more data is pending */
 static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ssk)
 {
 	struct sock *sk = (struct sock *)msk;
@@ -1281,8 +1286,18 @@ static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ss
 		    inet_sk_state_load(sk) != TCP_ESTABLISHED)))
 		return;
 
-	if (skb_queue_empty(&ssk->sk_receive_queue) &&
-	    !test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, &msk->flags))
+	if (!skb_queue_empty(&ssk->sk_receive_queue))
+		return;
+
+	if (!test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, &msk->flags))
+		mptcp_schedule_work(sk);
+
+	/* when the fallback subflow closes the rx side, trigger a 'dummy'
+	 * ingress data fin, so that the msk state will follow along
+	 */
+	if (__mptcp_check_fallback(msk) && subflow_is_done(ssk) &&
+	    msk->first == ssk &&
+	    mptcp_update_rcv_data_fin(msk, READ_ONCE(msk->ack_seq), true))
 		mptcp_schedule_work(sk);
 }
 
@@ -1842,11 +1857,6 @@ static void __subflow_state_change(struct sock *sk)
 	rcu_read_unlock();
 }
 
-static bool subflow_is_done(const struct sock *sk)
-{
-	return sk->sk_shutdown & RCV_SHUTDOWN || sk->sk_state == TCP_CLOSE;
-}
-
 static void subflow_state_change(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
@@ -1873,13 +1883,6 @@ static void subflow_state_change(struct sock *sk)
 		subflow_error_report(sk);
 
 	subflow_sched_work_if_closed(mptcp_sk(parent), sk);
-
-	/* when the fallback subflow closes the rx side, trigger a 'dummy'
-	 * ingress data fin, so that the msk state will follow along
-	 */
-	if (__mptcp_check_fallback(msk) && subflow_is_done(sk) && msk->first == sk &&
-	    mptcp_update_rcv_data_fin(msk, READ_ONCE(msk->ack_seq), true))
-		mptcp_schedule_work(parent);
 }
 
 void mptcp_subflow_queue_clean(struct sock *listener_sk, struct sock *listener_ssk)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH mptcp-next v2 5/7] mptcp: move the whole rx path under msk socket lock protection
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
                   ` (3 preceding siblings ...)
  2024-12-06 12:10 ` [PATCH mptcp-next v2 4/7] mptcp: consolidate subflow cleanup Paolo Abeni
@ 2024-12-06 12:11 ` Paolo Abeni
  2024-12-06 12:11 ` [PATCH mptcp-next v2 6/7] mptcp: cleanup mem accounting Paolo Abeni
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2024-12-06 12:11 UTC (permalink / raw)
  To: mptcp

After commit c2e6048fa1cf ("mptcp: fix race in release_cb") we can
move the whole MPTCP rx path under the socket lock leveraging the
release_cb.

We can drop a bunch of spin_lock pairs in the receive functions, use
a single receive queue and invoke __mptcp_move_skbs only when subflows
ask for it.

This will allow more cleanup in the next patch.

Some changes are worth specific mention:

The msk rcvbuf update now always happens under both the msk and the
subflow socket lock: we can drop a bunch of ONCE annotation and
consolidate the checks.

When the skbs move is delayed at msk release callback time, even the
msk rcvbuf update is delayed; additionally take care of such action in
__mptcp_move_skbs().

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v1 -> v2:
 - cleanup && fixup msk sk_rcvbuf update
 - fix missed wakeup due to bad placed __mptcp_move_skbs(); move it
   form __mptcp_recvmsg_mskq into mptcp_recvmsg()
 - added missing '\n' in debug message format string
 - keep 'snd_una' && friends update under the mptcp data lock
---
 net/mptcp/fastopen.c |   2 +
 net/mptcp/protocol.c | 123 ++++++++++++++++++++-----------------------
 net/mptcp/protocol.h |   2 +-
 3 files changed, 61 insertions(+), 66 deletions(-)

diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
index a29ff901df75..fb945c0d50bf 100644
--- a/net/mptcp/fastopen.c
+++ b/net/mptcp/fastopen.c
@@ -49,6 +49,7 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
 	MPTCP_SKB_CB(skb)->has_rxtstamp = TCP_SKB_CB(skb)->has_rxtstamp;
 
 	mptcp_data_lock(sk);
+	DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
 
 	mptcp_set_owner_r(skb, sk);
 	__skb_queue_tail(&sk->sk_receive_queue, skb);
@@ -65,6 +66,7 @@ void __mptcp_fastopen_gen_msk_ackseq(struct mptcp_sock *msk, struct mptcp_subflo
 	struct sock *sk = (struct sock *)msk;
 	struct sk_buff *skb;
 
+	DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
 	skb = skb_peek_tail(&sk->sk_receive_queue);
 	if (skb) {
 		WARN_ON_ONCE(MPTCP_SKB_CB(skb)->end_seq);
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 690614816749..42d04de32560 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -643,18 +643,6 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 	bool more_data_avail;
 	struct tcp_sock *tp;
 	bool done = false;
-	int sk_rbuf;
-
-	sk_rbuf = READ_ONCE(sk->sk_rcvbuf);
-
-	if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
-		int ssk_rbuf = READ_ONCE(ssk->sk_rcvbuf);
-
-		if (unlikely(ssk_rbuf > sk_rbuf)) {
-			WRITE_ONCE(sk->sk_rcvbuf, ssk_rbuf);
-			sk_rbuf = ssk_rbuf;
-		}
-	}
 
 	pr_debug("msk=%p ssk=%p\n", msk, ssk);
 	tp = tcp_sk(ssk);
@@ -722,7 +710,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
 		WRITE_ONCE(tp->copied_seq, seq);
 		more_data_avail = mptcp_subflow_data_available(ssk);
 
-		if (atomic_read(&sk->sk_rmem_alloc) > sk_rbuf) {
+		if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) {
 			done = true;
 			break;
 		}
@@ -846,11 +834,30 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	return moved > 0;
 }
 
+static void __mptcp_rcvbuf_update(struct sock *sk, struct sock *ssk)
+{
+	if (unlikely(ssk->sk_rcvbuf > sk->sk_rcvbuf))
+		WRITE_ONCE(sk->sk_rcvbuf, ssk->sk_rcvbuf);
+}
+
+static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	__mptcp_rcvbuf_update(sk, ssk);
+
+	/* over limit? can't append more skbs to msk, Also, no need to wake-up*/
+	if (__mptcp_rmem(sk) > sk->sk_rcvbuf)
+		return;
+
+	/* Wake-up the reader only for in-sequence data */
+	if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
+		sk->sk_data_ready(sk);
+}
+
 void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
-	struct mptcp_sock *msk = mptcp_sk(sk);
-	int sk_rbuf, ssk_rbuf;
 
 	/* The peer can send data while we are shutting down this
 	 * subflow at msk destruction time, but we must avoid enqueuing
@@ -859,19 +866,11 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 	if (unlikely(subflow->disposable))
 		return;
 
-	ssk_rbuf = READ_ONCE(ssk->sk_rcvbuf);
-	sk_rbuf = READ_ONCE(sk->sk_rcvbuf);
-	if (unlikely(ssk_rbuf > sk_rbuf))
-		sk_rbuf = ssk_rbuf;
-
-	/* over limit? can't append more skbs to msk, Also, no need to wake-up*/
-	if (__mptcp_rmem(sk) > sk_rbuf)
-		return;
-
-	/* Wake-up the reader only for in-sequence data */
 	mptcp_data_lock(sk);
-	if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
-		sk->sk_data_ready(sk);
+	if (!sock_owned_by_user(sk))
+		__mptcp_data_ready(sk, ssk);
+	else
+		__set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->cb_flags);
 	mptcp_data_unlock(sk);
 }
 
@@ -1942,16 +1941,17 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied);
 
-static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
+static int __mptcp_recvmsg_mskq(struct sock *sk,
 				struct msghdr *msg,
 				size_t len, int flags,
 				struct scm_timestamping_internal *tss,
 				int *cmsg_flags)
 {
+	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct sk_buff *skb, *tmp;
 	int copied = 0;
 
-	skb_queue_walk_safe(&msk->receive_queue, skb, tmp) {
+	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
 		u32 offset = MPTCP_SKB_CB(skb)->offset;
 		u32 data_len = skb->len - offset;
 		u32 count = min_t(size_t, len - copied, data_len);
@@ -1986,7 +1986,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
 			/* we will bulk release the skb memory later */
 			skb->destructor = NULL;
 			WRITE_ONCE(msk->rmem_released, msk->rmem_released + skb->truesize);
-			__skb_unlink(skb, &msk->receive_queue);
+			__skb_unlink(skb, &sk->sk_receive_queue);
 			__kfree_skb(skb);
 			msk->bytes_consumed += count;
 		}
@@ -2111,54 +2111,46 @@ static void __mptcp_update_rmem(struct sock *sk)
 	WRITE_ONCE(msk->rmem_released, 0);
 }
 
-static void __mptcp_splice_receive_queue(struct sock *sk)
+static bool __mptcp_move_skbs(struct sock *sk)
 {
+	struct mptcp_subflow_context *subflow;
 	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	skb_queue_splice_tail_init(&sk->sk_receive_queue, &msk->receive_queue);
-}
-
-static bool __mptcp_move_skbs(struct mptcp_sock *msk)
-{
-	struct sock *sk = (struct sock *)msk;
 	unsigned int moved = 0;
 	bool ret, done;
 
+	/* verify we can move any data from the subflow, eventually updating */
+	if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
+		mptcp_for_each_subflow(msk, subflow)
+			__mptcp_rcvbuf_update(sk, subflow->tcp_sock);
+
+	if (__mptcp_rmem(sk) > sk->sk_rcvbuf)
+		return false;
+
 	do {
 		struct sock *ssk = mptcp_subflow_recv_lookup(msk);
 		bool slowpath;
 
-		/* we can have data pending in the subflows only if the msk
-		 * receive buffer was full at subflow_data_ready() time,
-		 * that is an unlikely slow path.
-		 */
-		if (likely(!ssk))
+		if (unlikely(!ssk))
 			break;
 
 		slowpath = lock_sock_fast(ssk);
-		mptcp_data_lock(sk);
 		__mptcp_update_rmem(sk);
 		done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
-		mptcp_data_unlock(sk);
 
 		if (unlikely(ssk->sk_err))
 			__mptcp_error_report(sk);
 		unlock_sock_fast(ssk, slowpath);
 	} while (!done);
 
-	/* acquire the data lock only if some input data is pending */
 	ret = moved > 0;
 	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) ||
-	    !skb_queue_empty_lockless(&sk->sk_receive_queue)) {
-		mptcp_data_lock(sk);
+	    !skb_queue_empty(&sk->sk_receive_queue)) {
 		__mptcp_update_rmem(sk);
 		ret |= __mptcp_ofo_queue(msk);
-		__mptcp_splice_receive_queue(sk);
-		mptcp_data_unlock(sk);
 	}
 	if (ret)
 		mptcp_check_data_fin((struct sock *)msk);
-	return !skb_queue_empty(&msk->receive_queue);
+	return ret;
 }
 
 static unsigned int mptcp_inq_hint(const struct sock *sk)
@@ -2166,7 +2158,7 @@ static unsigned int mptcp_inq_hint(const struct sock *sk)
 	const struct mptcp_sock *msk = mptcp_sk(sk);
 	const struct sk_buff *skb;
 
-	skb = skb_peek(&msk->receive_queue);
+	skb = skb_peek(&sk->sk_receive_queue);
 	if (skb) {
 		u64 hint_val = READ_ONCE(msk->ack_seq) - MPTCP_SKB_CB(skb)->map_seq;
 
@@ -2212,7 +2204,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	while (copied < len) {
 		int err, bytes_read;
 
-		bytes_read = __mptcp_recvmsg_mskq(msk, msg, len - copied, flags, &tss, &cmsg_flags);
+		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags, &tss, &cmsg_flags);
 		if (unlikely(bytes_read < 0)) {
 			if (!copied)
 				copied = bytes_read;
@@ -2221,7 +2213,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 		copied += bytes_read;
 
-		if (skb_queue_empty(&msk->receive_queue) && __mptcp_move_skbs(msk))
+		if (skb_queue_empty(&sk->sk_receive_queue) && __mptcp_move_skbs(sk))
 			continue;
 
 		/* only the MPTCP socket status is relevant here. The exit
@@ -2247,7 +2239,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 				/* race breaker: the shutdown could be after the
 				 * previous receive queue check
 				 */
-				if (__mptcp_move_skbs(msk))
+				if (__mptcp_move_skbs(sk))
 					continue;
 				break;
 			}
@@ -2291,9 +2283,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 	}
 
-	pr_debug("msk=%p rx queue empty=%d:%d copied=%d\n",
-		 msk, skb_queue_empty_lockless(&sk->sk_receive_queue),
-		 skb_queue_empty(&msk->receive_queue), copied);
+	pr_debug("msk=%p rx queue empty=%d copied=%d\n",
+		 msk, skb_queue_empty(&sk->sk_receive_queue), copied);
 
 	release_sock(sk);
 	return copied;
@@ -2820,7 +2811,6 @@ static void __mptcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&msk->join_list);
 	INIT_LIST_HEAD(&msk->rtx_queue);
 	INIT_WORK(&msk->work, mptcp_worker);
-	__skb_queue_head_init(&msk->receive_queue);
 	msk->out_of_order_queue = RB_ROOT;
 	msk->first_pending = NULL;
 	WRITE_ONCE(msk->rmem_fwd_alloc, 0);
@@ -3403,12 +3393,8 @@ void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
 	mptcp_for_each_subflow_safe(msk, subflow, tmp)
 		__mptcp_close_ssk(sk, mptcp_subflow_tcp_sock(subflow), subflow, flags);
 
-	/* move to sk_receive_queue, sk_stream_kill_queues will purge it */
-	mptcp_data_lock(sk);
-	skb_queue_splice_tail_init(&msk->receive_queue, &sk->sk_receive_queue);
 	__skb_queue_purge(&sk->sk_receive_queue);
 	skb_rbtree_purge(&msk->out_of_order_queue);
-	mptcp_data_unlock(sk);
 
 	/* move all the rx fwd alloc into the sk_mem_reclaim_final in
 	 * inet_sock_destruct() will dispose it
@@ -3451,7 +3437,8 @@ void __mptcp_check_push(struct sock *sk, struct sock *ssk)
 
 #define MPTCP_FLAGS_PROCESS_CTX_NEED (BIT(MPTCP_PUSH_PENDING) | \
 				      BIT(MPTCP_RETRANSMIT) | \
-				      BIT(MPTCP_FLUSH_JOIN_LIST))
+				      BIT(MPTCP_FLUSH_JOIN_LIST) | \
+				      BIT(MPTCP_DEQUEUE))
 
 /* processes deferred events and flush wmem */
 static void mptcp_release_cb(struct sock *sk)
@@ -3485,6 +3472,11 @@ static void mptcp_release_cb(struct sock *sk)
 			__mptcp_push_pending(sk, 0);
 		if (flags & BIT(MPTCP_RETRANSMIT))
 			__mptcp_retrans(sk);
+		if ((flags & BIT(MPTCP_DEQUEUE)) && __mptcp_move_skbs(sk)) {
+			/* notify ack seq update */
+			mptcp_cleanup_rbuf(msk, 0);
+			sk->sk_data_ready(sk);
+		}
 
 		cond_resched();
 		spin_lock_bh(&sk->sk_lock.slock);
@@ -3722,7 +3714,8 @@ static int mptcp_ioctl(struct sock *sk, int cmd, int *karg)
 			return -EINVAL;
 
 		lock_sock(sk);
-		__mptcp_move_skbs(msk);
+		if (__mptcp_move_skbs(sk))
+			mptcp_cleanup_rbuf(msk, 0);
 		*karg = mptcp_inq_hint(sk);
 		release_sock(sk);
 		break;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index b4c72a73594f..ad940cc1f26f 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -124,6 +124,7 @@
 #define MPTCP_FLUSH_JOIN_LIST	5
 #define MPTCP_SYNC_STATE	6
 #define MPTCP_SYNC_SNDBUF	7
+#define MPTCP_DEQUEUE		8
 
 struct mptcp_skb_cb {
 	u64 map_seq;
@@ -322,7 +323,6 @@ struct mptcp_sock {
 	struct work_struct work;
 	struct sk_buff  *ooo_last_skb;
 	struct rb_root  out_of_order_queue;
-	struct sk_buff_head receive_queue;
 	struct list_head conn_list;
 	struct list_head rtx_queue;
 	struct mptcp_data_frag *first_pending;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH mptcp-next v2 6/7] mptcp: cleanup mem accounting.
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
                   ` (4 preceding siblings ...)
  2024-12-06 12:11 ` [PATCH mptcp-next v2 5/7] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
@ 2024-12-06 12:11 ` Paolo Abeni
  2024-12-07  1:45   ` Mat Martineau
  2024-12-06 12:11 ` [PATCH mptcp-next v2 7/7] net: dismiss sk_forward_alloc_get() Paolo Abeni
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2024-12-06 12:11 UTC (permalink / raw)
  To: mptcp

After the previous patch, updating sk_forward_memory is cheap and
we can drop a lot of complexity from the MPTCP memory acconting,
removing the custom fwd mem allocations for rmem.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v1 -> v2:
 - keep 'snd_una' and recovery-related fields under the msk
   data lock
 - dropped unneeded code in __mptcp_move_skbs()
---
 net/mptcp/fastopen.c |   2 +-
 net/mptcp/protocol.c | 115 +++----------------------------------------
 net/mptcp/protocol.h |   4 +-
 3 files changed, 10 insertions(+), 111 deletions(-)

diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
index fb945c0d50bf..b0f1dddfb143 100644
--- a/net/mptcp/fastopen.c
+++ b/net/mptcp/fastopen.c
@@ -51,7 +51,7 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
 	mptcp_data_lock(sk);
 	DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
 
-	mptcp_set_owner_r(skb, sk);
+	skb_set_owner_r(skb, sk);
 	__skb_queue_tail(&sk->sk_receive_queue, skb);
 	mptcp_sk(sk)->bytes_received += skb->len;
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 42d04de32560..4f27b0cafac5 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -118,17 +118,6 @@ static void mptcp_drop(struct sock *sk, struct sk_buff *skb)
 	__kfree_skb(skb);
 }
 
-static void mptcp_rmem_fwd_alloc_add(struct sock *sk, int size)
-{
-	WRITE_ONCE(mptcp_sk(sk)->rmem_fwd_alloc,
-		   mptcp_sk(sk)->rmem_fwd_alloc + size);
-}
-
-static void mptcp_rmem_charge(struct sock *sk, int size)
-{
-	mptcp_rmem_fwd_alloc_add(sk, -size);
-}
-
 static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
 			       struct sk_buff *from)
 {
@@ -150,7 +139,7 @@ static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
 	 * negative one
 	 */
 	atomic_add(delta, &sk->sk_rmem_alloc);
-	mptcp_rmem_charge(sk, delta);
+	sk_mem_charge(sk, delta);
 	kfree_skb_partial(from, fragstolen);
 
 	return true;
@@ -165,44 +154,6 @@ static bool mptcp_ooo_try_coalesce(struct mptcp_sock *msk, struct sk_buff *to,
 	return mptcp_try_coalesce((struct sock *)msk, to, from);
 }
 
-static void __mptcp_rmem_reclaim(struct sock *sk, int amount)
-{
-	amount >>= PAGE_SHIFT;
-	mptcp_rmem_charge(sk, amount << PAGE_SHIFT);
-	__sk_mem_reduce_allocated(sk, amount);
-}
-
-static void mptcp_rmem_uncharge(struct sock *sk, int size)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-	int reclaimable;
-
-	mptcp_rmem_fwd_alloc_add(sk, size);
-	reclaimable = msk->rmem_fwd_alloc - sk_unused_reserved_mem(sk);
-
-	/* see sk_mem_uncharge() for the rationale behind the following schema */
-	if (unlikely(reclaimable >= PAGE_SIZE))
-		__mptcp_rmem_reclaim(sk, reclaimable);
-}
-
-static void mptcp_rfree(struct sk_buff *skb)
-{
-	unsigned int len = skb->truesize;
-	struct sock *sk = skb->sk;
-
-	atomic_sub(len, &sk->sk_rmem_alloc);
-	mptcp_rmem_uncharge(sk, len);
-}
-
-void mptcp_set_owner_r(struct sk_buff *skb, struct sock *sk)
-{
-	skb_orphan(skb);
-	skb->sk = sk;
-	skb->destructor = mptcp_rfree;
-	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
-	mptcp_rmem_charge(sk, skb->truesize);
-}
-
 /* "inspired" by tcp_data_queue_ofo(), main differences:
  * - use mptcp seqs
  * - don't cope with sacks
@@ -315,25 +266,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 
 end:
 	skb_condense(skb);
-	mptcp_set_owner_r(skb, sk);
-}
-
-static bool mptcp_rmem_schedule(struct sock *sk, struct sock *ssk, int size)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-	int amt, amount;
-
-	if (size <= msk->rmem_fwd_alloc)
-		return true;
-
-	size -= msk->rmem_fwd_alloc;
-	amt = sk_mem_pages(size);
-	amount = amt << PAGE_SHIFT;
-	if (!__sk_mem_raise_allocated(sk, size, amt, SK_MEM_RECV))
-		return false;
-
-	mptcp_rmem_fwd_alloc_add(sk, amount);
-	return true;
+	skb_set_owner_r(skb, sk);
 }
 
 static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
@@ -351,7 +284,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
 	skb_orphan(skb);
 
 	/* try to fetch required memory from subflow */
-	if (!mptcp_rmem_schedule(sk, ssk, skb->truesize)) {
+	if (!sk_rmem_schedule(sk, skb, skb->truesize)) {
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
 		goto drop;
 	}
@@ -375,7 +308,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
 		if (tail && mptcp_try_coalesce(sk, tail, skb))
 			return true;
 
-		mptcp_set_owner_r(skb, sk);
+		skb_set_owner_r(skb, sk);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
 		return true;
 	} else if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) {
@@ -1983,9 +1916,10 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
 		}
 
 		if (!(flags & MSG_PEEK)) {
-			/* we will bulk release the skb memory later */
+			/* avoid the indirect call, we know the destructor is sock_wfree */
 			skb->destructor = NULL;
-			WRITE_ONCE(msk->rmem_released, msk->rmem_released + skb->truesize);
+			atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+			sk_mem_uncharge(sk, skb->truesize);
 			__skb_unlink(skb, &sk->sk_receive_queue);
 			__kfree_skb(skb);
 			msk->bytes_consumed += count;
@@ -2099,18 +2033,6 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 	msk->rcvq_space.time = mstamp;
 }
 
-static void __mptcp_update_rmem(struct sock *sk)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	if (!msk->rmem_released)
-		return;
-
-	atomic_sub(msk->rmem_released, &sk->sk_rmem_alloc);
-	mptcp_rmem_uncharge(sk, msk->rmem_released);
-	WRITE_ONCE(msk->rmem_released, 0);
-}
-
 static bool __mptcp_move_skbs(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow;
@@ -2134,7 +2056,6 @@ static bool __mptcp_move_skbs(struct sock *sk)
 			break;
 
 		slowpath = lock_sock_fast(ssk);
-		__mptcp_update_rmem(sk);
 		done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
 
 		if (unlikely(ssk->sk_err))
@@ -2142,12 +2063,7 @@ static bool __mptcp_move_skbs(struct sock *sk)
 		unlock_sock_fast(ssk, slowpath);
 	} while (!done);
 
-	ret = moved > 0;
-	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) ||
-	    !skb_queue_empty(&sk->sk_receive_queue)) {
-		__mptcp_update_rmem(sk);
-		ret |= __mptcp_ofo_queue(msk);
-	}
+	ret = moved > 0 || __mptcp_ofo_queue(msk);
 	if (ret)
 		mptcp_check_data_fin((struct sock *)msk);
 	return ret;
@@ -2813,8 +2729,6 @@ static void __mptcp_init_sock(struct sock *sk)
 	INIT_WORK(&msk->work, mptcp_worker);
 	msk->out_of_order_queue = RB_ROOT;
 	msk->first_pending = NULL;
-	WRITE_ONCE(msk->rmem_fwd_alloc, 0);
-	WRITE_ONCE(msk->rmem_released, 0);
 	msk->timer_ival = TCP_RTO_MIN;
 	msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
 
@@ -3040,8 +2954,6 @@ static void __mptcp_destroy_sock(struct sock *sk)
 
 	sk->sk_prot->destroy(sk);
 
-	WARN_ON_ONCE(READ_ONCE(msk->rmem_fwd_alloc));
-	WARN_ON_ONCE(msk->rmem_released);
 	sk_stream_kill_queues(sk);
 	xfrm_sk_free_policy(sk);
 
@@ -3399,8 +3311,6 @@ void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
 	/* move all the rx fwd alloc into the sk_mem_reclaim_final in
 	 * inet_sock_destruct() will dispose it
 	 */
-	sk_forward_alloc_add(sk, msk->rmem_fwd_alloc);
-	WRITE_ONCE(msk->rmem_fwd_alloc, 0);
 	mptcp_token_destroy(msk);
 	mptcp_pm_free_anno_list(msk);
 	mptcp_free_local_addr_list(msk);
@@ -3496,8 +3406,6 @@ static void mptcp_release_cb(struct sock *sk)
 		if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags))
 			__mptcp_sync_sndbuf(sk);
 	}
-
-	__mptcp_update_rmem(sk);
 }
 
 /* MP_JOIN client subflow must wait for 4th ack before sending any data:
@@ -3668,12 +3576,6 @@ static void mptcp_shutdown(struct sock *sk, int how)
 		__mptcp_wr_shutdown(sk);
 }
 
-static int mptcp_forward_alloc_get(const struct sock *sk)
-{
-	return READ_ONCE(sk->sk_forward_alloc) +
-	       READ_ONCE(mptcp_sk(sk)->rmem_fwd_alloc);
-}
-
 static int mptcp_ioctl_outq(const struct mptcp_sock *msk, u64 v)
 {
 	const struct sock *sk = (void *)msk;
@@ -3832,7 +3734,6 @@ static struct proto mptcp_prot = {
 	.hash		= mptcp_hash,
 	.unhash		= mptcp_unhash,
 	.get_port	= mptcp_get_port,
-	.forward_alloc_get	= mptcp_forward_alloc_get,
 	.stream_memory_free	= mptcp_stream_memory_free,
 	.sockets_allocated	= &mptcp_sockets_allocated,
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index ad940cc1f26f..a0d46b69746d 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -278,7 +278,6 @@ struct mptcp_sock {
 	u64		rcv_data_fin_seq;
 	u64		bytes_retrans;
 	u64		bytes_consumed;
-	int		rmem_fwd_alloc;
 	int		snd_burst;
 	int		old_wspace;
 	u64		recovery_snd_nxt;	/* in recovery mode accept up to this seq;
@@ -293,7 +292,6 @@ struct mptcp_sock {
 	u32		last_ack_recv;
 	unsigned long	timer_ival;
 	u32		token;
-	int		rmem_released;
 	unsigned long	flags;
 	unsigned long	cb_flags;
 	bool		recovery;		/* closing subflow write queue reinjected */
@@ -384,7 +382,7 @@ static inline void msk_owned_by_me(const struct mptcp_sock *msk)
  */
 static inline int __mptcp_rmem(const struct sock *sk)
 {
-	return atomic_read(&sk->sk_rmem_alloc) - READ_ONCE(mptcp_sk(sk)->rmem_released);
+	return atomic_read(&sk->sk_rmem_alloc);
 }
 
 static inline int mptcp_win_from_space(const struct sock *sk, int space)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH mptcp-next v2 7/7] net: dismiss sk_forward_alloc_get()
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
                   ` (5 preceding siblings ...)
  2024-12-06 12:11 ` [PATCH mptcp-next v2 6/7] mptcp: cleanup mem accounting Paolo Abeni
@ 2024-12-06 12:11 ` Paolo Abeni
  2024-12-06 13:17 ` [PATCH mptcp-next v2 0/7] mptcp: rx path refactor MPTCP CI
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2024-12-06 12:11 UTC (permalink / raw)
  To: mptcp

After the previous patch we can remove the forward_alloc_get
proto callback, basically reverting commit 292e6077b040 ("net: introduce
sk_forward_alloc_get()") and commit 66d58f046c9d ("net: use
sk_forward_alloc_get() in sk_get_meminfo()").

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/sock.h   | 13 -------------
 net/core/sock.c      |  2 +-
 net/ipv4/af_inet.c   |  2 +-
 net/ipv4/inet_diag.c |  2 +-
 net/sched/em_meta.c  |  2 +-
 5 files changed, 4 insertions(+), 17 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7464e9f9f47c..a5c28a4f0263 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1283,10 +1283,6 @@ struct proto {
 	unsigned int		inuse_idx;
 #endif
 
-#if IS_ENABLED(CONFIG_MPTCP)
-	int			(*forward_alloc_get)(const struct sock *sk);
-#endif
-
 	bool			(*stream_memory_free)(const struct sock *sk, int wake);
 	bool			(*sock_is_readable)(struct sock *sk);
 	/* Memory pressure */
@@ -1347,15 +1343,6 @@ int sock_load_diag_module(int family, int protocol);
 
 INDIRECT_CALLABLE_DECLARE(bool tcp_stream_memory_free(const struct sock *sk, int wake));
 
-static inline int sk_forward_alloc_get(const struct sock *sk)
-{
-#if IS_ENABLED(CONFIG_MPTCP)
-	if (sk->sk_prot->forward_alloc_get)
-		return sk->sk_prot->forward_alloc_get(sk);
-#endif
-	return READ_ONCE(sk->sk_forward_alloc);
-}
-
 static inline bool __sk_stream_memory_free(const struct sock *sk, int wake)
 {
 	if (READ_ONCE(sk->sk_wmem_queued) >= READ_ONCE(sk->sk_sndbuf))
diff --git a/net/core/sock.c b/net/core/sock.c
index 74729d20cd00..06b732604bf2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3858,7 +3858,7 @@ void sk_get_meminfo(const struct sock *sk, u32 *mem)
 	mem[SK_MEMINFO_RCVBUF] = READ_ONCE(sk->sk_rcvbuf);
 	mem[SK_MEMINFO_WMEM_ALLOC] = sk_wmem_alloc_get(sk);
 	mem[SK_MEMINFO_SNDBUF] = READ_ONCE(sk->sk_sndbuf);
-	mem[SK_MEMINFO_FWD_ALLOC] = sk_forward_alloc_get(sk);
+	mem[SK_MEMINFO_FWD_ALLOC] = READ_ONCE(sk->sk_forward_alloc);
 	mem[SK_MEMINFO_WMEM_QUEUED] = READ_ONCE(sk->sk_wmem_queued);
 	mem[SK_MEMINFO_OPTMEM] = atomic_read(&sk->sk_omem_alloc);
 	mem[SK_MEMINFO_BACKLOG] = READ_ONCE(sk->sk_backlog.len);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8095e82de808..a460ef3a2b0b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -153,7 +153,7 @@ void inet_sock_destruct(struct sock *sk)
 	WARN_ON_ONCE(atomic_read(&sk->sk_rmem_alloc));
 	WARN_ON_ONCE(refcount_read(&sk->sk_wmem_alloc));
 	WARN_ON_ONCE(sk->sk_wmem_queued);
-	WARN_ON_ONCE(sk_forward_alloc_get(sk));
+	WARN_ON_ONCE(sk->sk_forward_alloc);
 
 	kfree(rcu_dereference_protected(inet->inet_opt, 1));
 	dst_release(rcu_dereference_protected(sk->sk_dst_cache, 1));
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 321acc8abf17..efe2a085cf68 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -282,7 +282,7 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
 		struct inet_diag_meminfo minfo = {
 			.idiag_rmem = sk_rmem_alloc_get(sk),
 			.idiag_wmem = READ_ONCE(sk->sk_wmem_queued),
-			.idiag_fmem = sk_forward_alloc_get(sk),
+			.idiag_fmem = READ_ONCE(sk->sk_forward_alloc),
 			.idiag_tmem = sk_wmem_alloc_get(sk),
 		};
 
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 8996c73c9779..3f2e707a11d1 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -460,7 +460,7 @@ META_COLLECTOR(int_sk_fwd_alloc)
 		*err = -1;
 		return;
 	}
-	dst->value = sk_forward_alloc_get(sk);
+	dst->value = READ_ONCE(sk->sk_forward_alloc);
 }
 
 META_COLLECTOR(int_sk_sndbuf)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH mptcp-next v2 0/7] mptcp: rx path refactor
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
                   ` (6 preceding siblings ...)
  2024-12-06 12:11 ` [PATCH mptcp-next v2 7/7] net: dismiss sk_forward_alloc_get() Paolo Abeni
@ 2024-12-06 13:17 ` MPTCP CI
  2024-12-06 16:41 ` Paolo Abeni
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: MPTCP CI @ 2024-12-06 13:17 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal: Success! ✅
- KVM Validation: debug: Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/12198751082

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/1c3b9857b5b4
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=915337


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mptcp-next v2 0/7] mptcp: rx path refactor
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
                   ` (7 preceding siblings ...)
  2024-12-06 13:17 ` [PATCH mptcp-next v2 0/7] mptcp: rx path refactor MPTCP CI
@ 2024-12-06 16:41 ` Paolo Abeni
  2024-12-21  2:12 ` Mat Martineau
  2024-12-21 11:17 ` Matthieu Baerts
  10 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2024-12-06 16:41 UTC (permalink / raw)
  To: mptcp

On 12/6/24 13:09, Paolo Abeni wrote:
> The present refactor addresses both the above items: most of the
> additional complexity is dropped, and single stream performances
> increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
> reference, plain TCP is around 84Gps on the same host.

I forgot to update this section.

I did some additional test with multiple links, differences vs the
existing codebase is within the noise range.

Disclaimer: tests over multiple veth pairs, with RPS to use multiple
cores for the forwarding. This is a somewhat decent approximation of a
real setup with multiple H/W links but I still have to try the latter.

I think the issue reported by the CI in the previous iteration should be
addressed here - or at least this iteration addresses the closest
problem I was able to reproduce locally.

I'm not aware of any standing issue, but I would bet a few coins this is
not bug-free.

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mptcp-next v2 6/7] mptcp: cleanup mem accounting.
  2024-12-06 12:11 ` [PATCH mptcp-next v2 6/7] mptcp: cleanup mem accounting Paolo Abeni
@ 2024-12-07  1:45   ` Mat Martineau
  2024-12-10  9:00     ` Paolo Abeni
  0 siblings, 1 reply; 18+ messages in thread
From: Mat Martineau @ 2024-12-07  1:45 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Fri, 6 Dec 2024, Paolo Abeni wrote:

> After the previous patch, updating sk_forward_memory is cheap and
> we can drop a lot of complexity from the MPTCP memory acconting,
> removing the custom fwd mem allocations for rmem.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v1 -> v2:
> - keep 'snd_una' and recovery-related fields under the msk
>   data lock
> - dropped unneeded code in __mptcp_move_skbs()
> ---
> net/mptcp/fastopen.c |   2 +-
> net/mptcp/protocol.c | 115 +++----------------------------------------
> net/mptcp/protocol.h |   4 +-
> 3 files changed, 10 insertions(+), 111 deletions(-)
>
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index ad940cc1f26f..a0d46b69746d 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -278,7 +278,6 @@ struct mptcp_sock {
> 	u64		rcv_data_fin_seq;
> 	u64		bytes_retrans;
> 	u64		bytes_consumed;
> -	int		rmem_fwd_alloc;
> 	int		snd_burst;
> 	int		old_wspace;
> 	u64		recovery_snd_nxt;	/* in recovery mode accept up to this seq;
> @@ -293,7 +292,6 @@ struct mptcp_sock {
> 	u32		last_ack_recv;
> 	unsigned long	timer_ival;
> 	u32		token;
> -	int		rmem_released;
> 	unsigned long	flags;
> 	unsigned long	cb_flags;
> 	bool		recovery;		/* closing subflow write queue reinjected */
> @@ -384,7 +382,7 @@ static inline void msk_owned_by_me(const struct mptcp_sock *msk)
>  */
> static inline int __mptcp_rmem(const struct sock *sk)
> {
> -	return atomic_read(&sk->sk_rmem_alloc) - READ_ONCE(mptcp_sk(sk)->rmem_released);
> +	return atomic_read(&sk->sk_rmem_alloc);
> }

Hi Paolo -

Minor change: this helper is now exactly the same as sk_rmem_alloc_get(), 
so use that existing helper instead.

Thats actually the only suggestion I have for v2! Good to delete code and 
improve performance. I also think the release_cb approach on rx is a good 
idea because it could be further leveraged to help with redundant 
schedulers (our main problem there was trying to trigger sends on other 
subflows when we couldn't acquire the msk lock).

- Mat

>
> static inline int mptcp_win_from_space(const struct sock *sk, int space)
> -- 
> 2.45.2
>
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mptcp-next v2 6/7] mptcp: cleanup mem accounting.
  2024-12-07  1:45   ` Mat Martineau
@ 2024-12-10  9:00     ` Paolo Abeni
  0 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2024-12-10  9:00 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp

Hi,

On 12/7/24 02:45, Mat Martineau wrote:
> Minor change: this helper is now exactly the same as sk_rmem_alloc_get(), 
> so use that existing helper instead.

If there are no strong opinion against that, I think it would be
preferrable to do the removal in a separate patch. In fact, I have 2
more follow-up patches doing the __mptcp_rmem() removal and some
micro-optimization/cleanup for __mptcp_move_skb().

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mptcp-next v2 1/7] mptcp: prevent excessive coalescing on receive
  2024-12-06 12:10 ` [PATCH mptcp-next v2 1/7] mptcp: prevent excessive coalescing on receive Paolo Abeni
@ 2024-12-10 12:03   ` Matthieu Baerts
  2024-12-21 10:00     ` Matthieu Baerts
  0 siblings, 1 reply; 18+ messages in thread
From: Matthieu Baerts @ 2024-12-10 12:03 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

On 06/12/2024 13:10, Paolo Abeni wrote:
> Currently the skb size after coalescing is only limited by the skb
> layout (the skb must not carry frag_list). A single coalesced skb
> covering several MSS can potentially fill completely the receive
> buffer. In such a case, the snd win will zero until the receive buffer
> will be empty again, affecting tput badly.

Good idea! Thank you for having looked at this!

> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> No fixes tag because the problem is not very visible in practice
> currently, but will be apparent after the rx refactor.
> Still I hope this could affect positively the simlut flows self-tests

The modification looks safe enough to me: it should not make thing worst.

If it helps, a Fixes tag can be added and next to the 'Cc: stable' tag,
we can ask to backport this patch after a few weeks, no? e.g.

  Cc: <stable@vger.kernel.org> # after -rc6

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mptcp-next v2 0/7] mptcp: rx path refactor
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
                   ` (8 preceding siblings ...)
  2024-12-06 16:41 ` Paolo Abeni
@ 2024-12-21  2:12 ` Mat Martineau
  2024-12-21 11:17 ` Matthieu Baerts
  10 siblings, 0 replies; 18+ messages in thread
From: Mat Martineau @ 2024-12-21  2:12 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Fri, 6 Dec 2024, Paolo Abeni wrote:

> This is a batch of changes I had sitting in my local tree for a while.
> Why another refactor you may ask? Two main resons:
>
> - currently the mptcp RX path introduces quite a bit of 'exceptional'
> accounting/locking processing WRT to plain TCP, adding up to the
> implementation complexity in a misurable way
> - the performance gap WRT plain TCP for single subflow connections is
> quite measurable.
>
> The present refactor addresses both the above items: most of the
> additional complexity is dropped, and single stream performances
> increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
> reference, plain TCP is around 84Gps on the same host.
>

Hi Paolo -

Thanks for the v2 and for sharing the performance numbers - great to see 
such a dramatic improvement in throughput!

> The above comes to a price: the patch are invasive, even in subtle ways:
> the chance of destabilizing the implementation is real (ence the
> additional, intentional '-next' into the subj).
>

I think we will also need to be extra careful about monitoring stable tree 
backports that modify the affected functions. But to me the 
simplifications and performance fixes are worth it.

It will help to get this in to the export branch for further testing!

Reviewed-by: Mat Martineau <martineau@kernel.org>


> In any case keeping the patch hidden for longer was not going to do any
> good, so here we are.
>
> Changes from v1:
>  - fixed several data stream corruption and wake-up misses due
>    to multi subflows races
>  - added patches 1-3 mainly to address the above
>  - added an additional follow-up patch (patch 7) with more cleanup
>
> Paolo Abeni (7):
>  mptcp: prevent excessive coalescing on receive
>  tcp: fix recvbuffer adjust on sleeping rcvmsg
>  mptcp: don't always assume copied data in mptcp_cleanup_rbuf()
>  mptcp: consolidate subflow cleanup
>  mptcp: move the whole rx path under msk socket lock protection
>  mptcp: cleanup mem accounting.
>  net: dismiss sk_forward_alloc_get()
>
> include/net/sock.h   |  13 ---
> net/core/sock.c      |   2 +-
> net/ipv4/af_inet.c   |   2 +-
> net/ipv4/inet_diag.c |   2 +-
> net/mptcp/fastopen.c |   4 +-
> net/mptcp/protocol.c | 259 +++++++++++++------------------------------
> net/mptcp/protocol.h |   6 +-
> net/mptcp/subflow.c  |  33 +++---
> net/sched/em_meta.c  |   2 +-
> 9 files changed, 104 insertions(+), 219 deletions(-)
>
> -- 
> 2.45.2
>
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mptcp-next v2 1/7] mptcp: prevent excessive coalescing on receive
  2024-12-10 12:03   ` Matthieu Baerts
@ 2024-12-21 10:00     ` Matthieu Baerts
  2024-12-27  9:40       ` Paolo Abeni
  0 siblings, 1 reply; 18+ messages in thread
From: Matthieu Baerts @ 2024-12-21 10:00 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

On 10/12/2024 13:03, Matthieu Baerts wrote:
> Hi Paolo,
> 
> On 06/12/2024 13:10, Paolo Abeni wrote:
>> Currently the skb size after coalescing is only limited by the skb
>> layout (the skb must not carry frag_list). A single coalesced skb
>> covering several MSS can potentially fill completely the receive
>> buffer. In such a case, the snd win will zero until the receive buffer
>> will be empty again, affecting tput badly.
> 
> Good idea! Thank you for having looked at this!
> 
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> No fixes tag because the problem is not very visible in practice
>> currently, but will be apparent after the rx refactor.
>> Still I hope this could affect positively the simlut flows self-tests
> 
> The modification looks safe enough to me: it should not make thing worst.

Do you still prefer not targetting -net for this patch here? Or is it OK
to do so and add a Fixes tag?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mptcp-next v2 0/7] mptcp: rx path refactor
  2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
                   ` (9 preceding siblings ...)
  2024-12-21  2:12 ` Mat Martineau
@ 2024-12-21 11:17 ` Matthieu Baerts
  10 siblings, 0 replies; 18+ messages in thread
From: Matthieu Baerts @ 2024-12-21 11:17 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

Hi Paolo, Mat,

On 06/12/2024 13:09, Paolo Abeni wrote:
> This is a batch of changes I had sitting in my local tree for a while.
> Why another refactor you may ask? Two main resons:
> 
> - currently the mptcp RX path introduces quite a bit of 'exceptional'
>  accounting/locking processing WRT to plain TCP, adding up to the
>  implementation complexity in a misurable way
> - the performance gap WRT plain TCP for single subflow connections is
>  quite measurable.
> 
> The present refactor addresses both the above items: most of the
> additional complexity is dropped, and single stream performances
> increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
> reference, plain TCP is around 84Gps on the same host.
> 
> The above comes to a price: the patch are invasive, even in subtle ways:
> the chance of destabilizing the implementation is real (ence the
> additional, intentional '-next' into the subj).
> 
> In any case keeping the patch hidden for longer was not going to do any
> good, so here we are.

Thank you for the patches and the reviews!

Patches 2 and 3 are now in the "fixes for net", while the others are in
"feat. for next". I think patch 1 can go to "Fixes for net", any objections?

New patches for t/upstream-net and t/upstream:
- d443f83de23d: mptcp: fix recvbuffer adjust on sleeping rcvmsg
- 11d4943d66ee: mptcp: don't always assume copied data in
mptcp_cleanup_rbuf()
- Results: 3a0f411ba1c5..2116e71814b7 (export-net)
- Results: df3193ae83eb..953b0d388720 (export)

Tests are now in progress:

- export-net:
https://github.com/multipath-tcp/mptcp_net-next/commit/047c2f99575350d9971b6257f96ce5f734f0186a/checks

New patches for t/upstream:
- 972d8a5b7f20: mptcp: prevent excessive coalescing on receive
- 0b0a25a24d21: mptcp: consolidate subflow cleanup
- 533bb69f1b0d: mptcp: move the whole rx path under msk socket lock
protection
- c5d13b230497: mptcp: cleanup mem accounting
- a564b9923a94: net: dismiss sk_forward_alloc_get()
- Results: 953b0d388720..9213903d77ba (export)

Tests are now in progress:

- export:
https://github.com/multipath-tcp/mptcp_net-next/commit/6aae9c8bd8361dd90a7def87459ca0173915c56c/checks

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mptcp-next v2 1/7] mptcp: prevent excessive coalescing on receive
  2024-12-21 10:00     ` Matthieu Baerts
@ 2024-12-27  9:40       ` Paolo Abeni
  2024-12-30 18:24         ` Matthieu Baerts
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2024-12-27  9:40 UTC (permalink / raw)
  To: Matthieu Baerts; +Cc: mptcp

On 12/21/24 11:00, Matthieu Baerts wrote:
> On 10/12/2024 13:03, Matthieu Baerts wrote:
>> On 06/12/2024 13:10, Paolo Abeni wrote:
>>> Currently the skb size after coalescing is only limited by the skb
>>> layout (the skb must not carry frag_list). A single coalesced skb
>>> covering several MSS can potentially fill completely the receive
>>> buffer. In such a case, the snd win will zero until the receive buffer
>>> will be empty again, affecting tput badly.
>>
>> Good idea! Thank you for having looked at this!
>>
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>> No fixes tag because the problem is not very visible in practice
>>> currently, but will be apparent after the rx refactor.
>>> Still I hope this could affect positively the simlut flows self-tests
>>
>> The modification looks safe enough to me: it should not make thing worst.
> 
> Do you still prefer not targetting -net for this patch here? Or is it OK
> to do so and add a Fixes tag?

Whatever is easier to do in practice works for me. Perhaps monitoring
the simult_flows ST flakes could give an idea - if that get more
stable/less unstable we should possibly target 'net'.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mptcp-next v2 1/7] mptcp: prevent excessive coalescing on receive
  2024-12-27  9:40       ` Paolo Abeni
@ 2024-12-30 18:24         ` Matthieu Baerts
  0 siblings, 0 replies; 18+ messages in thread
From: Matthieu Baerts @ 2024-12-30 18:24 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On 27/12/2024 10:40, Paolo Abeni wrote:
> On 12/21/24 11:00, Matthieu Baerts wrote:
>> On 10/12/2024 13:03, Matthieu Baerts wrote:
>>> On 06/12/2024 13:10, Paolo Abeni wrote:
>>>> Currently the skb size after coalescing is only limited by the skb
>>>> layout (the skb must not carry frag_list). A single coalesced skb
>>>> covering several MSS can potentially fill completely the receive
>>>> buffer. In such a case, the snd win will zero until the receive buffer
>>>> will be empty again, affecting tput badly.
>>>
>>> Good idea! Thank you for having looked at this!
>>>
>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>>> ---
>>>> No fixes tag because the problem is not very visible in practice
>>>> currently, but will be apparent after the rx refactor.
>>>> Still I hope this could affect positively the simlut flows self-tests
>>>
>>> The modification looks safe enough to me: it should not make thing worst.
>>
>> Do you still prefer not targetting -net for this patch here? Or is it OK
>> to do so and add a Fixes tag?
> 
> Whatever is easier to do in practice works for me. Perhaps monitoring
> the simult_flows ST flakes could give an idea - if that get more
> stable/less unstable we should possibly target 'net'.

Thank you for your reply!

It looks OK so far. I just sent this patch to netdev, so we can also get
some tests over there. I will monitor if it helps on NIPA as well.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-12-30 18:24 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-06 12:09 [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Paolo Abeni
2024-12-06 12:10 ` [PATCH mptcp-next v2 1/7] mptcp: prevent excessive coalescing on receive Paolo Abeni
2024-12-10 12:03   ` Matthieu Baerts
2024-12-21 10:00     ` Matthieu Baerts
2024-12-27  9:40       ` Paolo Abeni
2024-12-30 18:24         ` Matthieu Baerts
2024-12-06 12:10 ` [PATCH mptcp-next v2 2/7] tcp: fix recvbuffer adjust on sleeping rcvmsg Paolo Abeni
2024-12-06 12:10 ` [PATCH mptcp-next v2 3/7] mptcp: don't always assume copied data in mptcp_cleanup_rbuf() Paolo Abeni
2024-12-06 12:10 ` [PATCH mptcp-next v2 4/7] mptcp: consolidate subflow cleanup Paolo Abeni
2024-12-06 12:11 ` [PATCH mptcp-next v2 5/7] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
2024-12-06 12:11 ` [PATCH mptcp-next v2 6/7] mptcp: cleanup mem accounting Paolo Abeni
2024-12-07  1:45   ` Mat Martineau
2024-12-10  9:00     ` Paolo Abeni
2024-12-06 12:11 ` [PATCH mptcp-next v2 7/7] net: dismiss sk_forward_alloc_get() Paolo Abeni
2024-12-06 13:17 ` [PATCH mptcp-next v2 0/7] mptcp: rx path refactor MPTCP CI
2024-12-06 16:41 ` Paolo Abeni
2024-12-21  2:12 ` Mat Martineau
2024-12-21 11:17 ` Matthieu Baerts

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.