[PATCH mptcp-next 0/4] mptcp: performance improvemets

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH mptcp-next 0/4] mptcp: performance improvemets
@ 2023-09-20 12:45 Paolo Abeni
  2023-09-20 12:45 ` [PATCH mptcp-next 1/4] mptcp: properly account fastopen data Paolo Abeni
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Paolo Abeni @ 2023-09-20 12:45 UTC (permalink / raw)
  To: mptcp

Cool subj just to catch more attention;)

This is a follow-up the yday mtg and to the discussion started from
issues/437.

The first 3 patches implement a working support for rcvlowat, with the
intended and relevant side effect of avoiding most TCP-level immediate
acks.

The existing check in:

https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_input.c#L5534

newarly always triggers immediate ack, as the any data received
on the subflow is usually immediately moved into the msk. That is, we
nearly always have: tp->rcv_nxt == tp->copied_seq.

That causes an unneeded large amount of (tcp-level) ack. After patch
3, MPTCP behaves much more alike plain TCP, compressing/delaying many
unneeded immediate ack and moving some of them at recvmsg() time.

Such change has quite a relevant effect on the max tput, ranging
from -4% to + 40% depending on the specific setup.
The small regression happens on CPU-bounded bulk transfers, with
not-so-large write buffer and can be elimitated increasing the write
buffer size. Even in such scenario, the overall efficency (B/W divided
by the total CPU cycles consumed) increases.
My personal take is that overall this for the better.

The last patch gives some speed-up to the tx path, just by using the
'correct' (or better) helper to memcpy the data from the user-space
into the kernel buffer.

Paolo Abeni (4):
  mptcp: properly account fastopen data
  mptcp: use plain bool instead of custom binary enum
  mptcp: give rcvlowat some love.
  mptcp: use copy_from_iter helpers on transmit..

 net/mptcp/fastopen.c |  1 +
 net/mptcp/protocol.c | 42 +++++++++++++++++++++++++-----------------
 net/mptcp/protocol.h | 27 +++++++++++++++++++++------
 net/mptcp/sockopt.c  | 31 +++++++++++++++++++++++++++++++
 net/mptcp/subflow.c  | 24 ++++++++++++++++--------
 5 files changed, 94 insertions(+), 31 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH mptcp-next 1/4] mptcp: properly account fastopen data
  2023-09-20 12:45 [PATCH mptcp-next 0/4] mptcp: performance improvemets Paolo Abeni
@ 2023-09-20 12:45 ` Paolo Abeni
  2023-09-20 12:45 ` [PATCH mptcp-next 2/4] mptcp: use plain bool instead of custom binary enum Paolo Abeni
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2023-09-20 12:45 UTC (permalink / raw)
  To: mptcp

Currently the socket level counter aggregating the received data
does not take in account the data received via fastopen.

Address the issue updating the counter as required.

Fixes: 38967f424b5b ("mptcp: track some aggregate data counters")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/fastopen.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
index bceaab8dd8e4..74698582a285 100644
--- a/net/mptcp/fastopen.c
+++ b/net/mptcp/fastopen.c
@@ -52,6 +52,7 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
 
 	mptcp_set_owner_r(skb, sk);
 	__skb_queue_tail(&sk->sk_receive_queue, skb);
+	mptcp_sk(sk)->bytes_received += skb->len;
 
 	sk->sk_data_ready(sk);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH mptcp-next 2/4] mptcp: use plain bool instead of custom binary enum
  2023-09-20 12:45 [PATCH mptcp-next 0/4] mptcp: performance improvemets Paolo Abeni
  2023-09-20 12:45 ` [PATCH mptcp-next 1/4] mptcp: properly account fastopen data Paolo Abeni
@ 2023-09-20 12:45 ` Paolo Abeni
  2023-09-20 12:45 ` [PATCH mptcp-next 3/4] mptcp: give rcvlowat some love Paolo Abeni
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2023-09-20 12:45 UTC (permalink / raw)
  To: mptcp

The 'data_avail' subflow field is already used as plain boolean,
drop the custom binary enum type and switch to bool.

No functional changed intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.h |  7 +------
 net/mptcp/subflow.c  | 12 ++++++------
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index ab775e48c11d..158b6fc19217 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -433,11 +433,6 @@ mptcp_subflow_rsk(const struct request_sock *rsk)
 	return (struct mptcp_subflow_request_sock *)rsk;
 }
 
-enum mptcp_data_avail {
-	MPTCP_SUBFLOW_NODATA,
-	MPTCP_SUBFLOW_DATA_AVAIL,
-};
-
 struct mptcp_delegated_action {
 	struct napi_struct napi;
 	struct list_head head;
@@ -494,7 +489,7 @@ struct mptcp_subflow_context {
 		valid_csum_seen : 1,        /* at least one csum validated */
 		is_mptfo : 1,	    /* subflow is doing TFO */
 		__unused : 9;
-	enum mptcp_data_avail data_avail;
+	bool	data_avail;
 	bool	scheduled;
 	u32	remote_nonce;
 	u64	thmac;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 0eae952064b1..0a4465dcf56d 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1237,7 +1237,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
 	struct sk_buff *skb;
 
 	if (!skb_peek(&ssk->sk_receive_queue))
-		WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
+		WRITE_ONCE(subflow->data_avail, false);
 	if (subflow->data_avail)
 		return true;
 
@@ -1271,7 +1271,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
 			continue;
 		}
 
-		WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_DATA_AVAIL);
+		WRITE_ONCE(subflow->data_avail, true);
 		break;
 	}
 	return true;
@@ -1293,7 +1293,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
 				goto reset;
 			}
 			mptcp_subflow_fail(msk, ssk);
-			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_DATA_AVAIL);
+			WRITE_ONCE(subflow->data_avail, true);
 			return true;
 		}
 
@@ -1310,7 +1310,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
 			while ((skb = skb_peek(&ssk->sk_receive_queue)))
 				sk_eat_skb(ssk, skb);
 			tcp_send_active_reset(ssk, GFP_ATOMIC);
-			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
+			WRITE_ONCE(subflow->data_avail, false);
 			return false;
 		}
 
@@ -1322,7 +1322,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
 	subflow->map_seq = READ_ONCE(msk->ack_seq);
 	subflow->map_data_len = skb->len;
 	subflow->map_subflow_seq = tcp_sk(ssk)->copied_seq - subflow->ssn_offset;
-	WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_DATA_AVAIL);
+	WRITE_ONCE(subflow->data_avail, true);
 	return true;
 }
 
@@ -1334,7 +1334,7 @@ bool mptcp_subflow_data_available(struct sock *sk)
 	if (subflow->map_valid &&
 	    mptcp_subflow_get_map_offset(subflow) >= subflow->map_data_len) {
 		subflow->map_valid = 0;
-		WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
+		WRITE_ONCE(subflow->data_avail, false);
 
 		pr_debug("Done with mapping: seq=%u data_len=%u",
 			 subflow->map_subflow_seq,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH mptcp-next 3/4] mptcp: give rcvlowat some love.
  2023-09-20 12:45 [PATCH mptcp-next 0/4] mptcp: performance improvemets Paolo Abeni
  2023-09-20 12:45 ` [PATCH mptcp-next 1/4] mptcp: properly account fastopen data Paolo Abeni
  2023-09-20 12:45 ` [PATCH mptcp-next 2/4] mptcp: use plain bool instead of custom binary enum Paolo Abeni
@ 2023-09-20 12:45 ` Paolo Abeni
  2023-09-20 12:45 ` [PATCH mptcp-next 4/4] mptcp: use copy_from_iter helpers on transmit Paolo Abeni
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2023-09-20 12:45 UTC (permalink / raw)
  To: mptcp

The MPTCP protocol allow setting sk_rcvlowat, but the value there
is currently ignored.

Additionally, the default subflows sk_rcvlowat basically disables per
subflow delayed ack: the MPTCP protocol move the incoming data from the
subflows into the msk socket as soon as the TCP stacks invokes the subflow
data_ready callback. Later, when __tcp_ack_snd_check() takes action,
the subflow-level copied_seq matches rcv_nxt, and that mandate for an
immediate ack.

Let the mptcp receive path be aware of such threshold, explicitly tracking
the amount of data available to be ready and checking vs sk_rcvlowat in
mptcp_poll() and before waking-up readers.

Additionally implement the set_rcvlowat() callback, to properly handle
the rcvbuf auto-tuning on sk_rcvlowat changes.

Finally to properly handle delayed ack, force the subflow level threshold
to 0 and instead explicitly ask for an immediate ack when the msk level th
is not reached.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 23 ++++++++++-------------
 net/mptcp/protocol.h | 20 ++++++++++++++++++++
 net/mptcp/sockopt.c  | 31 +++++++++++++++++++++++++++++++
 net/mptcp/subflow.c  | 12 ++++++++++--
 4 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 0a9d00e794d4..1de8eaccd48b 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -861,9 +861,8 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 
 	/* Wake-up the reader only for in-sequence data */
 	mptcp_data_lock(sk);
-	if (move_skbs_to_msk(msk, ssk))
+	if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
 		sk->sk_data_ready(sk);
-
 	mptcp_data_unlock(sk);
 }
 
@@ -1928,6 +1927,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
 			if (!(flags & MSG_PEEK)) {
 				MPTCP_SKB_CB(skb)->offset += count;
 				MPTCP_SKB_CB(skb)->map_seq += count;
+				msk->bytes_consumed += count;
 			}
 			break;
 		}
@@ -1938,6 +1938,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
 			WRITE_ONCE(msk->rmem_released, msk->rmem_released + skb->truesize);
 			__skb_unlink(skb, &msk->receive_queue);
 			__kfree_skb(skb);
+			msk->bytes_consumed += count;
 		}
 
 		if (copied >= len)
@@ -2955,16 +2956,9 @@ void __mptcp_unaccepted_force_close(struct sock *sk)
 	__mptcp_destroy_sock(sk);
 }
 
-static __poll_t mptcp_check_readable(struct mptcp_sock *msk)
+static __poll_t mptcp_check_readable(struct sock *sk)
 {
-	/* Concurrent splices from sk_receive_queue into receive_queue will
-	 * always show at least one non-empty queue when checked in this order.
-	 */
-	if (skb_queue_empty_lockless(&((struct sock *)msk)->sk_receive_queue) &&
-	    skb_queue_empty_lockless(&msk->receive_queue))
-		return 0;
-
-	return EPOLLIN | EPOLLRDNORM;
+	return mptcp_epollin_ready(sk) ? EPOLLIN | EPOLLRDNORM : 0;
 }
 
 static void mptcp_check_listen_stop(struct sock *sk)
@@ -3002,7 +2996,7 @@ bool __mptcp_close(struct sock *sk, long timeout)
 		goto cleanup;
 	}
 
-	if (mptcp_check_readable(msk) || timeout < 0) {
+	if (mptcp_data_avail(msk) || timeout < 0) {
 		/* If the msk has read data, or the caller explicitly ask it,
 		 * do the MPTCP equivalent of TCP reset, aka MPTCP fastclose
 		 */
@@ -3135,6 +3129,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
 	msk->snd_data_fin_enable = false;
 	msk->rcv_fastclose = false;
 	msk->use_64bit_ack = false;
+	msk->bytes_consumed = 0;
 	WRITE_ONCE(msk->csum_enabled, mptcp_is_checksum_enabled(sock_net(sk)));
 	mptcp_pm_data_reset(msk);
 	mptcp_ca_reset(sk);
@@ -3918,7 +3913,7 @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 		mask |= EPOLLIN | EPOLLRDNORM | EPOLLRDHUP;
 
 	if (state != TCP_SYN_SENT && state != TCP_SYN_RECV) {
-		mask |= mptcp_check_readable(msk);
+		mask |= mptcp_check_readable(sk);
 		if (shutdown & SEND_SHUTDOWN)
 			mask |= EPOLLOUT | EPOLLWRNORM;
 		else
@@ -3956,6 +3951,7 @@ static const struct proto_ops mptcp_stream_ops = {
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
+	.set_rcvlowat	   = mptcp_set_rcvlowat,
 };
 
 static struct inet_protosw mptcp_protosw = {
@@ -4057,6 +4053,7 @@ static const struct proto_ops mptcp_v6_stream_ops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	   = inet6_compat_ioctl,
 #endif
+	.set_rcvlowat	   = mptcp_set_rcvlowat,
 };
 
 static struct proto mptcp_v6_prot;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 158b6fc19217..a895424bf112 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -268,6 +268,7 @@ struct mptcp_sock {
 	atomic64_t	rcv_wnd_sent;
 	u64		rcv_data_fin_seq;
 	u64		bytes_retrans;
+	u64		bytes_consumed;
 	int		rmem_fwd_alloc;
 	int		snd_burst;
 	int		old_wspace;
@@ -667,6 +668,24 @@ struct sock *mptcp_subflow_get_retrans(struct mptcp_sock *msk);
 int mptcp_sched_get_send(struct mptcp_sock *msk);
 int mptcp_sched_get_retrans(struct mptcp_sock *msk);
 
+static inline u64 mptcp_data_avail(const struct mptcp_sock *msk)
+{
+	return READ_ONCE(msk->bytes_received) - READ_ONCE(msk->bytes_consumed);
+}
+
+static inline bool mptcp_epollin_ready(const struct sock *sk)
+{
+	/* mptcp doesn't have to deal with small skbs in the receive queue,
+	 * at it can always coalesce them
+	 */
+	return (mptcp_data_avail(mptcp_sk(sk)) >= sk->sk_rcvlowat) ||
+	       (mem_cgroup_sockets_enabled && sk->sk_memcg &&
+		mem_cgroup_under_socket_pressure(sk->sk_memcg)) ||
+	       READ_ONCE(tcp_memory_pressure);
+}
+
+int mptcp_set_rcvlowat(struct sock *sk, int val);
+
 static inline bool __tcp_can_send(const struct sock *ssk)
 {
 	/* only send if our side has not closed yet */
@@ -741,6 +760,7 @@ static inline bool mptcp_is_fully_established(struct sock *sk)
 	return inet_sk_state_load(sk) == TCP_ESTABLISHED &&
 	       READ_ONCE(mptcp_sk(sk)->fully_established);
 }
+
 void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk);
 void mptcp_data_ready(struct sock *sk, struct sock *ssk);
 bool mptcp_finish_join(struct sock *sk);
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 453d6c78c25c..8351e4558379 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -1456,9 +1456,40 @@ void mptcp_sockopt_sync_locked(struct mptcp_sock *msk, struct sock *ssk)
 	 */
 	tcp_sk(ssk)->notsent_lowat = UINT_MAX;
 
+	ssk->sk_rcvlowat = 0;
+
 	if (READ_ONCE(subflow->setsockopt_seq) != msk->setsockopt_seq) {
 		sync_socket_options(msk, ssk);
 
 		subflow->setsockopt_seq = msk->setsockopt_seq;
 	}
 }
+
+/* unfortunaly this is different enough from the tcp version so
+ * that we can't factor it out
+ */
+int mptcp_set_rcvlowat(struct sock *sk, int val)
+{
+	int space, cap;
+
+	if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
+		cap = sk->sk_rcvbuf >> 1;
+	else
+		cap = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2]) >> 1;
+	val = min(val, cap);
+	WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
+
+	/* Check if we need to signal EPOLLIN right now */
+	if (mptcp_epollin_ready(sk))
+		sk->sk_data_ready(sk);
+
+	if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
+		return 0;
+
+	space = __tcp_space_from_win(mptcp_sk(sk)->scaling_ratio, val);
+	if (space > sk->sk_rcvbuf) {
+		WRITE_ONCE(sk->sk_rcvbuf, space);
+		tcp_sk(sk)->window_clamp = val;
+	}
+	return 0;
+}
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 0a4465dcf56d..1301b0215f62 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1405,10 +1405,18 @@ static void subflow_data_ready(struct sock *sk)
 	WARN_ON_ONCE(!__mptcp_check_fallback(msk) && !subflow->mp_capable &&
 		     !subflow->mp_join && !(state & TCPF_CLOSE));
 
-	if (mptcp_subflow_data_available(sk))
+	if (mptcp_subflow_data_available(sk)) {
 		mptcp_data_ready(parent, sk);
-	else if (unlikely(sk->sk_err))
+
+		/* subflow-level lowat test are not relevant.
+		 * respect the msk-level threshold eventually mandating an immediate ack
+		 */
+		if (mptcp_data_avail(msk) < parent->sk_rcvlowat &&
+		    (tcp_sk(sk)->rcv_nxt - tcp_sk(sk)->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss)
+			inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
+	} else if (unlikely(sk->sk_err)) {
 		subflow_error_report(sk);
+	}
 }
 
 static void subflow_write_space(struct sock *ssk)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH mptcp-next 4/4] mptcp: use copy_from_iter helpers on transmit..
  2023-09-20 12:45 [PATCH mptcp-next 0/4] mptcp: performance improvemets Paolo Abeni
                   ` (2 preceding siblings ...)
  2023-09-20 12:45 ` [PATCH mptcp-next 3/4] mptcp: give rcvlowat some love Paolo Abeni
@ 2023-09-20 12:45 ` Paolo Abeni
  2023-09-22 16:59 ` [PATCH mptcp-next 0/4] mptcp: performance improvemets Mat Martineau
  2023-09-23  8:20 ` Matthieu Baerts
  5 siblings, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2023-09-20 12:45 UTC (permalink / raw)
  To: mptcp

The perf traces show an high cost for the MPTCP transmit path memcpy.

It turn out that the helper currently in use carries quite a bit
of unneeded overhead, e.g. to map/unmap the memory pages.

Moving to the 'copy_from_iter' variant removes such overhead and
additionally gains the no-cache support.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 1de8eaccd48b..6f9e116598ed 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1766,6 +1766,18 @@ static int mptcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg,
 	return ret;
 }
 
+static int do_copy_data_nocache(struct sock *sk, int copy,
+				struct iov_iter *from, char *to)
+{
+	if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY) {
+		if (!copy_from_iter_full_nocache(to, copy, from))
+			return -EFAULT;
+	} else if (!copy_from_iter_full(to, copy, from)) {
+		return -EFAULT;
+	}
+	return 0;
+}
+
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -1839,11 +1851,10 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		if (!sk_wmem_schedule(sk, total_ts))
 			goto wait_for_memory;
 
-		if (copy_page_from_iter(dfrag->page, offset, psize,
-					&msg->msg_iter) != psize) {
-			ret = -EFAULT;
+		ret = do_copy_data_nocache(sk, psize, &msg->msg_iter,
+					   page_address(dfrag->page) + offset);
+		if (ret)
 			goto do_error;
-		}
 
 		/* data successfully copied into the write queue */
 		sk_forward_alloc_add(sk, -total_ts);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH mptcp-next 0/4] mptcp: performance improvemets
  2023-09-20 12:45 [PATCH mptcp-next 0/4] mptcp: performance improvemets Paolo Abeni
                   ` (3 preceding siblings ...)
  2023-09-20 12:45 ` [PATCH mptcp-next 4/4] mptcp: use copy_from_iter helpers on transmit Paolo Abeni
@ 2023-09-22 16:59 ` Mat Martineau
  2023-09-23  8:20 ` Matthieu Baerts
  5 siblings, 0 replies; 7+ messages in thread
From: Mat Martineau @ 2023-09-22 16:59 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Wed, 20 Sep 2023, Paolo Abeni wrote:

> Cool subj just to catch more attention;)
>
> This is a follow-up the yday mtg and to the discussion started from
> issues/437.
>
> The first 3 patches implement a working support for rcvlowat, with the
> intended and relevant side effect of avoiding most TCP-level immediate
> acks.
>
> The existing check in:
>
> https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_input.c#L5534
>
> newarly always triggers immediate ack, as the any data received
> on the subflow is usually immediately moved into the msk. That is, we
> nearly always have: tp->rcv_nxt == tp->copied_seq.
>
> That causes an unneeded large amount of (tcp-level) ack. After patch
> 3, MPTCP behaves much more alike plain TCP, compressing/delaying many
> unneeded immediate ack and moving some of them at recvmsg() time.
>
> Such change has quite a relevant effect on the max tput, ranging
> from -4% to + 40% depending on the specific setup.
> The small regression happens on CPU-bounded bulk transfers, with
> not-so-large write buffer and can be elimitated increasing the write
> buffer size. Even in such scenario, the overall efficency (B/W divided
> by the total CPU cycles consumed) increases.
> My personal take is that overall this for the better.
>
> The last patch gives some speed-up to the tx path, just by using the
> 'correct' (or better) helper to memcpy the data from the user-space
> into the kernel buffer.
>

Hi Paolo -

Series looks good to me, thanks for the fixes and enhancements

Reviewed-by: Mat Martineau <martineau@kernel.org>

> Paolo Abeni (4):
>  mptcp: properly account fastopen data
>  mptcp: use plain bool instead of custom binary enum
>  mptcp: give rcvlowat some love.
>  mptcp: use copy_from_iter helpers on transmit..
>
> net/mptcp/fastopen.c |  1 +
> net/mptcp/protocol.c | 42 +++++++++++++++++++++++++-----------------
> net/mptcp/protocol.h | 27 +++++++++++++++++++++------
> net/mptcp/sockopt.c  | 31 +++++++++++++++++++++++++++++++
> net/mptcp/subflow.c  | 24 ++++++++++++++++--------
> 5 files changed, 94 insertions(+), 31 deletions(-)
>
> -- 
> 2.41.0
>
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH mptcp-next 0/4] mptcp: performance improvemets
  2023-09-20 12:45 [PATCH mptcp-next 0/4] mptcp: performance improvemets Paolo Abeni
                   ` (4 preceding siblings ...)
  2023-09-22 16:59 ` [PATCH mptcp-next 0/4] mptcp: performance improvemets Mat Martineau
@ 2023-09-23  8:20 ` Matthieu Baerts
  5 siblings, 0 replies; 7+ messages in thread
From: Matthieu Baerts @ 2023-09-23  8:20 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

Hi Paolo, Mat,

On 20/09/2023 14:45, Paolo Abeni wrote:
> Cool subj just to catch more attention;)
> 
> This is a follow-up the yday mtg and to the discussion started from
> issues/437.
> 
> The first 3 patches implement a working support for rcvlowat, with the
> intended and relevant side effect of avoiding most TCP-level immediate
> acks.
> 
> The existing check in:
> 
> https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_input.c#L5534
> 
> newarly always triggers immediate ack, as the any data received
> on the subflow is usually immediately moved into the msk. That is, we
> nearly always have: tp->rcv_nxt == tp->copied_seq.
> 
> That causes an unneeded large amount of (tcp-level) ack. After patch
> 3, MPTCP behaves much more alike plain TCP, compressing/delaying many
> unneeded immediate ack and moving some of them at recvmsg() time.
> 
> Such change has quite a relevant effect on the max tput, ranging
> from -4% to + 40% depending on the specific setup.
> The small regression happens on CPU-bounded bulk transfers, with
> not-so-large write buffer and can be elimitated increasing the write
> buffer size. Even in such scenario, the overall efficency (B/W divided
> by the total CPU cycles consumed) increases.
> My personal take is that overall this for the better.
> 
> The last patch gives some speed-up to the tx path, just by using the
> 'correct' (or better) helper to memcpy the data from the user-space
> into the kernel buffer.

Thank you for this nice patch set!

New patches for t/upstream:
- 0188ee6897bb: mptcp: properly account fastopen data
- ea3f76f3c64b: mptcp: use plain bool instead of custom binary enum
- 03cdb460a417: mptcp: give rcvlowat some love
- 11b40c8da614: mptcp: use copy_from_iter helpers on transmit
- Results: 1c2adc4e573c..f2baa63004ad (export)

Tests are now in progress:

https://cirrus-ci.com/github/multipath-tcp/mptcp_net-next/export/20230923T080409

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-09-23  8:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-20 12:45 [PATCH mptcp-next 0/4] mptcp: performance improvemets Paolo Abeni
2023-09-20 12:45 ` [PATCH mptcp-next 1/4] mptcp: properly account fastopen data Paolo Abeni
2023-09-20 12:45 ` [PATCH mptcp-next 2/4] mptcp: use plain bool instead of custom binary enum Paolo Abeni
2023-09-20 12:45 ` [PATCH mptcp-next 3/4] mptcp: give rcvlowat some love Paolo Abeni
2023-09-20 12:45 ` [PATCH mptcp-next 4/4] mptcp: use copy_from_iter helpers on transmit Paolo Abeni
2023-09-22 16:59 ` [PATCH mptcp-next 0/4] mptcp: performance improvemets Mat Martineau
2023-09-23  8:20 ` Matthieu Baerts

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.