All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH mptcp-next 0/7] mptcp: implement TCP_NOTSENT_LOWAT support
@ 2024-01-22 15:08 Paolo Abeni
  2024-01-22 15:08 ` [PATCH mptcp-next 1/7] mptcp: push ad DSS boundaries Paolo Abeni
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Paolo Abeni @ 2024-01-22 15:08 UTC (permalink / raw)
  To: mptcp

Patch 6/7 does the magic, all the others are minor cleanup and fix of
buglet exposed by such feature. I'll push a paired pktdrill test.

Note that this relies on the existing accounting for snd_nxt. As I
stated such accounting is not 110% accurate as it tracks the most recent
sequence number queued to any subflow, and not the actual sequence
number sent on the wire.

I experimented a lot trying to implement the latter and in the end it
proved to be both "too complex" and "not necessary".

The complexity raises from the need for additional lock and a lot of
refactory to introduce such protection without adding significant
overhead. Additionally, snd_nxt is currenly used and exposed with the
current semantic both eBPF and the internal packet scheduling.
Introducing a different tracking will still require us to keep the old
one.

More interesting, a more accurate tracking could be not strictly
necessary: as the MPTCP protocol enqueues data to the subflows only up
the available send window, any enqueue data is sent on the wire
instantly, without any blocking operation short of a drop in the tx path
at the nft or TC layer.

The individual patches changelog carry the gory details.

Paolo Abeni (7):
  mptcp: push ad DSS boundaries
  mptcp: fix snd_wnd initialization for passive socket
  mptcp: fix potential wake-up event loss
  mptcp: cleanup writer wake-up
  mptcp: avoid some duplicate code in socket option handling
  mptcp: implement TCP_NOTSENT_LOWAT support.
  mptcp: cleanup SOL_TCP handling

 net/mptcp/protocol.c | 51 ++++++++++++++++++++-----------
 net/mptcp/protocol.h | 57 ++++++++++++++++++++++++++++-------
 net/mptcp/sockopt.c  | 71 ++++++++++++++++++++------------------------
 3 files changed, 111 insertions(+), 68 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next 1/7] mptcp: push ad DSS boundaries
  2024-01-22 15:08 [PATCH mptcp-next 0/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
@ 2024-01-22 15:08 ` Paolo Abeni
  2024-02-08  1:55   ` Mat Martineau
  2024-01-22 15:08 ` [PATCH mptcp-next 2/7] mptcp: fix snd_wnd initialization for passive socket Paolo Abeni
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Paolo Abeni @ 2024-01-22 15:08 UTC (permalink / raw)
  To: mptcp

when inserting/creating skbs on the subflow write queue, if
the to-be-appended data don't is not contiguous to the tail data,
the protocol creates a new skb and prevent the TCP stack from
merging it later by setting the EOR marker.

Still no push flag is explicitly set at the end of previous GSO
packet, making the aggregation on the receiver side sub-optimal -
and packetdrill self-tests less predictable.

Explicitly mark the end of not contiguous DSS with the push flag.

Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This could also land on next, as needed, dropping the fixes tag
---
 net/mptcp/protocol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 744b4d6f15f4..4d7131f96ebf 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1264,6 +1264,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 		mpext = mptcp_get_ext(skb);
 		if (!mptcp_skb_can_collapse_to(data_seq, skb, mpext)) {
 			TCP_SKB_CB(skb)->eor = 1;
+			tcp_mark_push(tcp_sk(ssk), skb);
 			goto alloc_skb;
 		}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next 2/7] mptcp: fix snd_wnd initialization for passive socket
  2024-01-22 15:08 [PATCH mptcp-next 0/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
  2024-01-22 15:08 ` [PATCH mptcp-next 1/7] mptcp: push ad DSS boundaries Paolo Abeni
@ 2024-01-22 15:08 ` Paolo Abeni
  2024-01-22 15:08 ` [PATCH mptcp-next 3/7] mptcp: fix potential wake-up event loss Paolo Abeni
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2024-01-22 15:08 UTC (permalink / raw)
  To: mptcp

Such value should be inherited from the first subflow, but
passive sockets always used 'rsk_rcv_wnd'.

Fixes: 6f8a612a33e4 ("mptcp: keep track of advertised windows right edge")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4d7131f96ebf..7a47834bf6ac 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3218,7 +3218,7 @@ struct sock *mptcp_sk_clone_init(const struct sock *sk,
 	WRITE_ONCE(msk->write_seq, subflow_req->idsn + 1);
 	WRITE_ONCE(msk->snd_nxt, msk->write_seq);
 	WRITE_ONCE(msk->snd_una, msk->write_seq);
-	WRITE_ONCE(msk->wnd_end, msk->snd_nxt + req->rsk_rcv_wnd);
+	WRITE_ONCE(msk->wnd_end, msk->snd_nxt + tcp_sk(ssk)->snd_wnd);
 	msk->setsockopt_seq = mptcp_sk(sk)->setsockopt_seq;
 	mptcp_init_sched(msk, mptcp_sk(sk)->sched);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next 3/7] mptcp: fix potential wake-up event loss
  2024-01-22 15:08 [PATCH mptcp-next 0/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
  2024-01-22 15:08 ` [PATCH mptcp-next 1/7] mptcp: push ad DSS boundaries Paolo Abeni
  2024-01-22 15:08 ` [PATCH mptcp-next 2/7] mptcp: fix snd_wnd initialization for passive socket Paolo Abeni
@ 2024-01-22 15:08 ` Paolo Abeni
  2024-02-08  1:57   ` Mat Martineau
  2024-01-22 15:08 ` [PATCH mptcp-next 4/7] mptcp: cleanup writer wake-up Paolo Abeni
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Paolo Abeni @ 2024-01-22 15:08 UTC (permalink / raw)
  To: mptcp

After the blamed commit below, the send buffer auto-tuning can
happen after that the mptcp_propagate_sndbuf() completes - via
the delegated action infrastructure.

We must check for write space even after such change or we risk
missing the wake-up event.

Fixes: 8005184fd1ca ("mptcp: refactor sndbuf auto-tuning")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.h | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index d05ec76dd7c2..bbc490f36273 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -797,6 +797,16 @@ static inline bool mptcp_data_fin_enabled(const struct mptcp_sock *msk)
 	       READ_ONCE(msk->write_seq) == READ_ONCE(msk->snd_nxt);
 }
 
+static inline void mptcp_write_space(struct sock *sk)
+{
+	if (sk_stream_is_writeable(sk)) {
+		/* pairs with memory barrier in mptcp_poll */
+		smp_mb();
+		if (test_and_clear_bit(MPTCP_NOSPACE, &mptcp_sk(sk)->flags))
+			sk_stream_write_space(sk);
+	}
+}
+
 static inline void __mptcp_sync_sndbuf(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow;
@@ -815,6 +825,7 @@ static inline void __mptcp_sync_sndbuf(struct sock *sk)
 
 	/* the msk max wmem limit is <nr_subflows> * tcp wmem[2] */
 	WRITE_ONCE(sk->sk_sndbuf, new_sndbuf);
+	mptcp_write_space(sk);
 }
 
 /* The called held both the msk socket and the subflow socket locks,
@@ -845,14 +856,16 @@ static inline void mptcp_propagate_sndbuf(struct sock *sk, struct sock *ssk)
 	local_bh_enable();
 }
 
-static inline void mptcp_write_space(struct sock *sk)
+static inline void __mptcp_sync_sndnxt(struct sock *sk, u64 new_snd_nxt)
 {
-	if (sk_stream_is_writeable(sk)) {
-		/* pairs with memory barrier in mptcp_poll */
-		smp_mb();
-		if (test_and_clear_bit(MPTCP_NOSPACE, &mptcp_sk(sk)->flags))
-			sk_stream_write_space(sk);
-	}
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	if (!after64(new_snd_nxt, msk->snd_nxt))
+		return;
+
+	msk->bytes_sent += new_snd_nxt - msk->snd_nxt;
+	WRITE_ONCE(msk->snd_nxt, new_snd_nxt);
+	mptcp_write_space(sk);
 }
 
 void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next 4/7] mptcp: cleanup writer wake-up
  2024-01-22 15:08 [PATCH mptcp-next 0/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
                   ` (2 preceding siblings ...)
  2024-01-22 15:08 ` [PATCH mptcp-next 3/7] mptcp: fix potential wake-up event loss Paolo Abeni
@ 2024-01-22 15:08 ` Paolo Abeni
  2024-01-22 15:08 ` [PATCH mptcp-next 5/7] mptcp: avoid some duplicate code in socket option handling Paolo Abeni
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2024-01-22 15:08 UTC (permalink / raw)
  To: mptcp

After commit 5cf92bbadc58 ("mptcp: re-enable sndbuf autotune"), the
MPTCP_NOSPACE bit is redundant: it is always set and cleared together with
SOCK_NOSPACE.

Let's drop the first and always relay on the latter, dropping a bunch
of useless code.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 15 +++------------
 net/mptcp/protocol.h | 16 ++++++----------
 2 files changed, 9 insertions(+), 22 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 7a47834bf6ac..3599205fdceb 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1691,15 +1691,6 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool
 	}
 }
 
-static void mptcp_set_nospace(struct sock *sk)
-{
-	/* enable autotune */
-	set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-
-	/* will be cleared on avail space */
-	set_bit(MPTCP_NOSPACE, &mptcp_sk(sk)->flags);
-}
-
 static int mptcp_disconnect(struct sock *sk, int flags);
 
 static int mptcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg,
@@ -1873,7 +1864,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		continue;
 
 wait_for_memory:
-		mptcp_set_nospace(sk);
+		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		__mptcp_push_pending(sk, msg->msg_flags);
 		ret = sk_stream_wait_memory(sk, &timeo);
 		if (ret)
@@ -3898,8 +3889,8 @@ static __poll_t mptcp_check_writeable(struct mptcp_sock *msk)
 	if (sk_stream_is_writeable(sk))
 		return EPOLLOUT | EPOLLWRNORM;
 
-	mptcp_set_nospace(sk);
-	smp_mb__after_atomic(); /* msk->flags is changed by write_space cb */
+	set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+	smp_mb__after_atomic(); /* NOSPACE is changed by mptcp_write_space() */
 	if (sk_stream_is_writeable(sk))
 		return EPOLLOUT | EPOLLWRNORM;
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index bbc490f36273..3069d5c072b0 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -113,10 +113,9 @@
 #define MPTCP_RST_TRANSIENT	BIT(0)
 
 /* MPTCP socket atomic flags */
-#define MPTCP_NOSPACE		1
-#define MPTCP_WORK_RTX		2
-#define MPTCP_FALLBACK_DONE	4
-#define MPTCP_WORK_CLOSE_SUBFLOW 5
+#define MPTCP_WORK_RTX		1
+#define MPTCP_FALLBACK_DONE	2
+#define MPTCP_WORK_CLOSE_SUBFLOW 3
 
 /* MPTCP socket release cb flags */
 #define MPTCP_PUSH_PENDING	1
@@ -799,12 +798,9 @@ static inline bool mptcp_data_fin_enabled(const struct mptcp_sock *msk)
 
 static inline void mptcp_write_space(struct sock *sk)
 {
-	if (sk_stream_is_writeable(sk)) {
-		/* pairs with memory barrier in mptcp_poll */
-		smp_mb();
-		if (test_and_clear_bit(MPTCP_NOSPACE, &mptcp_sk(sk)->flags))
-			sk_stream_write_space(sk);
-	}
+	/* pairs with memory barrier in mptcp_poll */
+	smp_mb();
+	sk_stream_write_space(sk);
 }
 
 static inline void __mptcp_sync_sndbuf(struct sock *sk)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next 5/7] mptcp: avoid some duplicate code in socket option handling
  2024-01-22 15:08 [PATCH mptcp-next 0/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
                   ` (3 preceding siblings ...)
  2024-01-22 15:08 ` [PATCH mptcp-next 4/7] mptcp: cleanup writer wake-up Paolo Abeni
@ 2024-01-22 15:08 ` Paolo Abeni
  2024-01-22 15:08 ` [PATCH mptcp-next 6/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
  2024-01-22 15:08 ` [PATCH mptcp-next 7/7] mptcp: cleanup SOL_TCP handling Paolo Abeni
  6 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2024-01-22 15:08 UTC (permalink / raw)
  To: mptcp

The mptcp_get_int_option() helper is needless open-coded in a
couple of places, replace the duplicate code with the helper
call.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/sockopt.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index da37e4541a5d..ac37f6c5e2ed 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -629,13 +629,11 @@ static int mptcp_setsockopt_sol_tcp_cork(struct mptcp_sock *msk, sockptr_t optva
 {
 	struct mptcp_subflow_context *subflow;
 	struct sock *sk = (struct sock *)msk;
-	int val;
-
-	if (optlen < sizeof(int))
-		return -EINVAL;
+	int val, ret;
 
-	if (copy_from_sockptr(&val, optval, sizeof(val)))
-		return -EFAULT;
+	ret = mptcp_get_int_option(msk, optval, optlen, &val);
+	if (ret)
+		return ret;
 
 	lock_sock(sk);
 	sockopt_seq_inc(msk);
@@ -659,13 +657,11 @@ static int mptcp_setsockopt_sol_tcp_nodelay(struct mptcp_sock *msk, sockptr_t op
 {
 	struct mptcp_subflow_context *subflow;
 	struct sock *sk = (struct sock *)msk;
-	int val;
-
-	if (optlen < sizeof(int))
-		return -EINVAL;
+	int val, ret;
 
-	if (copy_from_sockptr(&val, optval, sizeof(val)))
-		return -EFAULT;
+	ret = mptcp_get_int_option(msk, optval, optlen, &val);
+	if (ret)
+		return ret;
 
 	lock_sock(sk);
 	sockopt_seq_inc(msk);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next 6/7] mptcp: implement TCP_NOTSENT_LOWAT support.
  2024-01-22 15:08 [PATCH mptcp-next 0/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
                   ` (4 preceding siblings ...)
  2024-01-22 15:08 ` [PATCH mptcp-next 5/7] mptcp: avoid some duplicate code in socket option handling Paolo Abeni
@ 2024-01-22 15:08 ` Paolo Abeni
  2024-02-08  2:12   ` Mat Martineau
  2024-01-22 15:08 ` [PATCH mptcp-next 7/7] mptcp: cleanup SOL_TCP handling Paolo Abeni
  6 siblings, 1 reply; 13+ messages in thread
From: Paolo Abeni @ 2024-01-22 15:08 UTC (permalink / raw)
  To: mptcp

Add support for such socket option storing the user-space provided
value in a new msk field, and using such data to implement the
_mptcp_stream_memory_free() helper, similar to the TCP one.

To avoid adding more indirect calls in the fast path, instead of
hooking the new helper via the sk_stream_memory_free sock cb,
add the directly calls where needed.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/464
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 33 ++++++++++++++++++++++++++++-----
 net/mptcp/protocol.h | 28 +++++++++++++++++++++++++++-
 net/mptcp/sockopt.c  | 12 ++++++++++++
 3 files changed, 67 insertions(+), 6 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 3599205fdceb..53d6c5544900 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1761,6 +1761,25 @@ static int do_copy_data_nocache(struct sock *sk, int copy,
 	return 0;
 }
 
+static u32 mptcp_send_limit(const struct sock *sk)
+{
+	const struct mptcp_sock *msk = mptcp_sk(sk);
+	u32 limit, not_sent;
+
+	if (!sk_stream_memory_free(sk))
+		return 0;
+
+	limit = mptcp_notsent_lowat(sk);
+	if (limit == UINT_MAX)
+		return UINT_MAX;
+
+	not_sent = msk->write_seq - msk->snd_nxt;
+	if (not_sent >= limit)
+		return 0;
+
+	return limit - not_sent;
+}
+
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -1805,6 +1824,12 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		struct mptcp_data_frag *dfrag;
 		bool dfrag_collapsed;
 		size_t psize, offset;
+		u32 copy_limit;
+
+		/* ensure fitting the notsent_lowat() constraint */
+		copy_limit = mptcp_send_limit(sk);
+		if (!copy_limit)
+			goto wait_for_memory;
 
 		/* reuse tail pfrag, if possible, or carve a new one from the
 		 * page allocator
@@ -1812,9 +1837,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		dfrag = mptcp_pending_tail(sk);
 		dfrag_collapsed = mptcp_frag_can_collapse_to(msk, pfrag, dfrag);
 		if (!dfrag_collapsed) {
-			if (!sk_stream_memory_free(sk))
-				goto wait_for_memory;
-
 			if (!mptcp_page_frag_refill(sk, pfrag))
 				goto wait_for_memory;
 
@@ -1829,6 +1851,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		offset = dfrag->offset + dfrag->data_len;
 		psize = pfrag->size - offset;
 		psize = min_t(size_t, psize, msg_data_left(msg));
+		psize = min_t(size_t, psize, copy_limit);
 		total_ts = psize + frag_truesize;
 
 		if (!sk_wmem_schedule(sk, total_ts))
@@ -3886,12 +3909,12 @@ static __poll_t mptcp_check_writeable(struct mptcp_sock *msk)
 {
 	struct sock *sk = (struct sock *)msk;
 
-	if (sk_stream_is_writeable(sk))
+	if (__mptcp_stream_is_writeable(sk, 1))
 		return EPOLLOUT | EPOLLWRNORM;
 
 	set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 	smp_mb__after_atomic(); /* NOSPACE is changed by mptcp_write_space() */
-	if (sk_stream_is_writeable(sk))
+	if (__mptcp_stream_is_writeable(sk, 1))
 		return EPOLLOUT | EPOLLWRNORM;
 
 	return 0;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 3069d5c072b0..4a32d3d11fb6 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -307,6 +307,7 @@ struct mptcp_sock {
 			in_accept_queue:1,
 			free_first:1,
 			rcvspace_init:1;
+	u32		notsent_lowat;
 	struct work_struct work;
 	struct sk_buff  *ooo_last_skb;
 	struct rb_root  out_of_order_queue;
@@ -796,11 +797,36 @@ static inline bool mptcp_data_fin_enabled(const struct mptcp_sock *msk)
 	       READ_ONCE(msk->write_seq) == READ_ONCE(msk->snd_nxt);
 }
 
+static inline u32 mptcp_notsent_lowat(const struct sock *sk)
+{
+	struct net *net = sock_net(sk);
+	u32 val;
+
+	val = READ_ONCE(mptcp_sk(sk)->notsent_lowat);
+	return val ?: READ_ONCE(net->ipv4.sysctl_tcp_notsent_lowat);
+}
+
+static inline bool __mptcp_stream_memory_free(const struct sock *sk, int wake)
+{
+	const struct mptcp_sock *msk = mptcp_sk(sk);
+	u32 notsent_bytes;
+
+	notsent_bytes = READ_ONCE(msk->write_seq) - READ_ONCE(msk->snd_nxt);
+	return (notsent_bytes << wake) < mptcp_notsent_lowat(sk);
+}
+
+static inline bool __mptcp_stream_is_writeable(const struct sock *sk, int wake)
+{
+	return __mptcp_stream_memory_free(sk, wake) &&
+	       __sk_stream_is_writeable(sk, wake);
+}
+
 static inline void mptcp_write_space(struct sock *sk)
 {
 	/* pairs with memory barrier in mptcp_poll */
 	smp_mb();
-	sk_stream_write_space(sk);
+	if (__mptcp_stream_memory_free(sk, 1))
+		sk_stream_write_space(sk);
 }
 
 static inline void __mptcp_sync_sndbuf(struct sock *sk)
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index ac37f6c5e2ed..1b38dac70719 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -812,6 +812,16 @@ static int mptcp_setsockopt_sol_tcp(struct mptcp_sock *msk, int optname,
 		return 0;
 	case TCP_ULP:
 		return -EOPNOTSUPP;
+	case TCP_NOTSENT_LOWAT:
+		ret = mptcp_get_int_option(msk, optval, optlen, &val);
+		if (ret)
+			return ret;
+
+		lock_sock(sk);
+		WRITE_ONCE(msk->notsent_lowat, val);
+		mptcp_write_space(sk);
+		release_sock(sk);
+		return 0;
 	case TCP_CONGESTION:
 		return mptcp_setsockopt_sol_tcp_congestion(msk, optval, optlen);
 	case TCP_CORK:
@@ -1345,6 +1355,8 @@ static int mptcp_getsockopt_sol_tcp(struct mptcp_sock *msk, int optname,
 		return mptcp_put_int_option(msk, optval, optlen, msk->cork);
 	case TCP_NODELAY:
 		return mptcp_put_int_option(msk, optval, optlen, msk->nodelay);
+	case TCP_NOTSENT_LOWAT:
+		return mptcp_put_int_option(msk, optval, optlen, msk->notsent_lowat);
 	}
 	return -EOPNOTSUPP;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH mptcp-next 7/7] mptcp: cleanup SOL_TCP handling
  2024-01-22 15:08 [PATCH mptcp-next 0/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
                   ` (5 preceding siblings ...)
  2024-01-22 15:08 ` [PATCH mptcp-next 6/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
@ 2024-01-22 15:08 ` Paolo Abeni
  6 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2024-01-22 15:08 UTC (permalink / raw)
  To: mptcp

Most TCP-level socket options get an integer from user space, and
set the corresponding field under the msk-level socket lock.

Reduce the code duplication moving such operations in the common code.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/sockopt.c | 75 ++++++++++++++++++---------------------------
 1 file changed, 30 insertions(+), 45 deletions(-)

diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 1b38dac70719..dcd1c76d2a3b 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -624,18 +624,11 @@ static int mptcp_setsockopt_sol_tcp_congestion(struct mptcp_sock *msk, sockptr_t
 	return ret;
 }
 
-static int mptcp_setsockopt_sol_tcp_cork(struct mptcp_sock *msk, sockptr_t optval,
-					 unsigned int optlen)
+static int __mptcp_setsockopt_sol_tcp_cork(struct mptcp_sock *msk, int val)
 {
 	struct mptcp_subflow_context *subflow;
 	struct sock *sk = (struct sock *)msk;
-	int val, ret;
 
-	ret = mptcp_get_int_option(msk, optval, optlen, &val);
-	if (ret)
-		return ret;
-
-	lock_sock(sk);
 	sockopt_seq_inc(msk);
 	msk->cork = !!val;
 	mptcp_for_each_subflow(msk, subflow) {
@@ -647,23 +640,15 @@ static int mptcp_setsockopt_sol_tcp_cork(struct mptcp_sock *msk, sockptr_t optva
 	}
 	if (!val)
 		mptcp_check_and_set_pending(sk);
-	release_sock(sk);
 
 	return 0;
 }
 
-static int mptcp_setsockopt_sol_tcp_nodelay(struct mptcp_sock *msk, sockptr_t optval,
-					    unsigned int optlen)
+static int __mptcp_setsockopt_sol_tcp_nodelay(struct mptcp_sock *msk, int val)
 {
 	struct mptcp_subflow_context *subflow;
 	struct sock *sk = (struct sock *)msk;
-	int val, ret;
 
-	ret = mptcp_get_int_option(msk, optval, optlen, &val);
-	if (ret)
-		return ret;
-
-	lock_sock(sk);
 	sockopt_seq_inc(msk);
 	msk->nodelay = !!val;
 	mptcp_for_each_subflow(msk, subflow) {
@@ -675,8 +660,6 @@ static int mptcp_setsockopt_sol_tcp_nodelay(struct mptcp_sock *msk, sockptr_t op
 	}
 	if (val)
 		mptcp_check_and_set_pending(sk);
-	release_sock(sk);
-
 	return 0;
 }
 
@@ -799,35 +782,10 @@ static int mptcp_setsockopt_sol_tcp(struct mptcp_sock *msk, int optname,
 	int ret, val;
 
 	switch (optname) {
-	case TCP_INQ:
-		ret = mptcp_get_int_option(msk, optval, optlen, &val);
-		if (ret)
-			return ret;
-		if (val < 0 || val > 1)
-			return -EINVAL;
-
-		lock_sock(sk);
-		msk->recvmsg_inq = !!val;
-		release_sock(sk);
-		return 0;
 	case TCP_ULP:
 		return -EOPNOTSUPP;
-	case TCP_NOTSENT_LOWAT:
-		ret = mptcp_get_int_option(msk, optval, optlen, &val);
-		if (ret)
-			return ret;
-
-		lock_sock(sk);
-		WRITE_ONCE(msk->notsent_lowat, val);
-		mptcp_write_space(sk);
-		release_sock(sk);
-		return 0;
 	case TCP_CONGESTION:
 		return mptcp_setsockopt_sol_tcp_congestion(msk, optval, optlen);
-	case TCP_CORK:
-		return mptcp_setsockopt_sol_tcp_cork(msk, optval, optlen);
-	case TCP_NODELAY:
-		return mptcp_setsockopt_sol_tcp_nodelay(msk, optval, optlen);
 	case TCP_DEFER_ACCEPT:
 		/* See tcp.c: TCP_DEFER_ACCEPT does not fail */
 		mptcp_setsockopt_first_sf_only(msk, SOL_TCP, optname, optval, optlen);
@@ -840,7 +798,34 @@ static int mptcp_setsockopt_sol_tcp(struct mptcp_sock *msk, int optname,
 						      optval, optlen);
 	}
 
-	return -EOPNOTSUPP;
+	ret = mptcp_get_int_option(msk, optval, optlen, &val);
+	if (ret)
+		return ret;
+
+	lock_sock(sk);
+	switch (optname) {
+	case TCP_INQ:
+		if (val < 0 || val > 1)
+			ret = -EINVAL;
+		else
+			msk->recvmsg_inq = !!val;
+		break;
+	case TCP_NOTSENT_LOWAT:
+		WRITE_ONCE(msk->notsent_lowat, val);
+		mptcp_write_space(sk);
+		break;
+	case TCP_CORK:
+		ret = __mptcp_setsockopt_sol_tcp_cork(msk, val);
+		break;
+	case TCP_NODELAY:
+		ret = __mptcp_setsockopt_sol_tcp_nodelay(msk, val);
+		break;
+	default:
+		ret = -ENOPROTOOPT;
+	}
+
+	release_sock(sk);
+	return ret;
 }
 
 int mptcp_setsockopt(struct sock *sk, int level, int optname,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH mptcp-next 1/7] mptcp: push ad DSS boundaries
  2024-01-22 15:08 ` [PATCH mptcp-next 1/7] mptcp: push ad DSS boundaries Paolo Abeni
@ 2024-02-08  1:55   ` Mat Martineau
  0 siblings, 0 replies; 13+ messages in thread
From: Mat Martineau @ 2024-02-08  1:55 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Mon, 22 Jan 2024, Paolo Abeni wrote:

> when inserting/creating skbs on the subflow write queue, if
> the to-be-appended data don't is not contiguous to the tail data,

Hi Paolo -

Looks like the "don't" here is extra, can you clarify this sentence?

> the protocol creates a new skb and prevent the TCP stack from
> merging it later by setting the EOR marker.
>
> Still no push flag is explicitly set at the end of previous GSO
> packet, making the aggregation on the receiver side sub-optimal -
> and packetdrill self-tests less predictable.
>
> Explicitly mark the end of not contiguous DSS with the push flag.
>

I agree with the approach, gro will never benefit from waiting around for 
the next packet as it will have a new DSS header and can never be 
combined.

- Mat

> Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> This could also land on next, as needed, dropping the fixes tag
> ---
> net/mptcp/protocol.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 744b4d6f15f4..4d7131f96ebf 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1264,6 +1264,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> 		mpext = mptcp_get_ext(skb);
> 		if (!mptcp_skb_can_collapse_to(data_seq, skb, mpext)) {
> 			TCP_SKB_CB(skb)->eor = 1;
> +			tcp_mark_push(tcp_sk(ssk), skb);
> 			goto alloc_skb;
> 		}
>
> -- 
> 2.43.0
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mptcp-next 3/7] mptcp: fix potential wake-up event loss
  2024-01-22 15:08 ` [PATCH mptcp-next 3/7] mptcp: fix potential wake-up event loss Paolo Abeni
@ 2024-02-08  1:57   ` Mat Martineau
  2024-02-12 11:09     ` Paolo Abeni
  0 siblings, 1 reply; 13+ messages in thread
From: Mat Martineau @ 2024-02-08  1:57 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Mon, 22 Jan 2024, Paolo Abeni wrote:

> After the blamed commit below, the send buffer auto-tuning can
> happen after that the mptcp_propagate_sndbuf() completes - via
> the delegated action infrastructure.
>
> We must check for write space even after such change or we risk
> missing the wake-up event.
>
> Fixes: 8005184fd1ca ("mptcp: refactor sndbuf auto-tuning")

Hi Paolo -

These first three patches have "Fixes:" tags, what do you think about 
applying them to mptcp-net instead of mptcp-next?

> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/protocol.h | 27 ++++++++++++++++++++-------
> 1 file changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index d05ec76dd7c2..bbc490f36273 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -797,6 +797,16 @@ static inline bool mptcp_data_fin_enabled(const struct mptcp_sock *msk)
> 	       READ_ONCE(msk->write_seq) == READ_ONCE(msk->snd_nxt);
> }
>
> +static inline void mptcp_write_space(struct sock *sk)
> +{
> +	if (sk_stream_is_writeable(sk)) {
> +		/* pairs with memory barrier in mptcp_poll */
> +		smp_mb();
> +		if (test_and_clear_bit(MPTCP_NOSPACE, &mptcp_sk(sk)->flags))
> +			sk_stream_write_space(sk);
> +	}
> +}
> +
> static inline void __mptcp_sync_sndbuf(struct sock *sk)
> {
> 	struct mptcp_subflow_context *subflow;
> @@ -815,6 +825,7 @@ static inline void __mptcp_sync_sndbuf(struct sock *sk)
>
> 	/* the msk max wmem limit is <nr_subflows> * tcp wmem[2] */
> 	WRITE_ONCE(sk->sk_sndbuf, new_sndbuf);
> +	mptcp_write_space(sk);
> }
>
> /* The called held both the msk socket and the subflow socket locks,
> @@ -845,14 +856,16 @@ static inline void mptcp_propagate_sndbuf(struct sock *sk, struct sock *ssk)
> 	local_bh_enable();
> }
>
> -static inline void mptcp_write_space(struct sock *sk)
> +static inline void __mptcp_sync_sndnxt(struct sock *sk, u64 new_snd_nxt)

This is never called - extra code from an experiment or did some delegated 
action code not get added?

- Mat

> {
> -	if (sk_stream_is_writeable(sk)) {
> -		/* pairs with memory barrier in mptcp_poll */
> -		smp_mb();
> -		if (test_and_clear_bit(MPTCP_NOSPACE, &mptcp_sk(sk)->flags))
> -			sk_stream_write_space(sk);
> -	}
> +	struct mptcp_sock *msk = mptcp_sk(sk);
> +
> +	if (!after64(new_snd_nxt, msk->snd_nxt))
> +		return;
> +
> +	msk->bytes_sent += new_snd_nxt - msk->snd_nxt;
> +	WRITE_ONCE(msk->snd_nxt, new_snd_nxt);
> +	mptcp_write_space(sk);
> }
>
> void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags);
> -- 
> 2.43.0
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mptcp-next 6/7] mptcp: implement TCP_NOTSENT_LOWAT support.
  2024-01-22 15:08 ` [PATCH mptcp-next 6/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
@ 2024-02-08  2:12   ` Mat Martineau
  2024-02-12 11:27     ` Paolo Abeni
  0 siblings, 1 reply; 13+ messages in thread
From: Mat Martineau @ 2024-02-08  2:12 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Mon, 22 Jan 2024, Paolo Abeni wrote:

> Add support for such socket option storing the user-space provided
> value in a new msk field, and using such data to implement the
> _mptcp_stream_memory_free() helper, similar to the TCP one.
>
> To avoid adding more indirect calls in the fast path, instead of
> hooking the new helper via the sk_stream_memory_free sock cb,
> add the directly calls where needed.
>
> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/464
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/protocol.c | 33 ++++++++++++++++++++++++++++-----
> net/mptcp/protocol.h | 28 +++++++++++++++++++++++++++-
> net/mptcp/sockopt.c  | 12 ++++++++++++
> 3 files changed, 67 insertions(+), 6 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 3599205fdceb..53d6c5544900 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1761,6 +1761,25 @@ static int do_copy_data_nocache(struct sock *sk, int copy,
> 	return 0;
> }
>
> +static u32 mptcp_send_limit(const struct sock *sk)
> +{
> +	const struct mptcp_sock *msk = mptcp_sk(sk);
> +	u32 limit, not_sent;
> +
> +	if (!sk_stream_memory_free(sk))
> +		return 0;
> +
> +	limit = mptcp_notsent_lowat(sk);
> +	if (limit == UINT_MAX)
> +		return UINT_MAX;
> +
> +	not_sent = msk->write_seq - msk->snd_nxt;
> +	if (not_sent >= limit)
> +		return 0;
> +
> +	return limit - not_sent;
> +}
> +
> static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> {
> 	struct mptcp_sock *msk = mptcp_sk(sk);
> @@ -1805,6 +1824,12 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 		struct mptcp_data_frag *dfrag;
> 		bool dfrag_collapsed;
> 		size_t psize, offset;
> +		u32 copy_limit;
> +
> +		/* ensure fitting the notsent_lowat() constraint */
> +		copy_limit = mptcp_send_limit(sk);
> +		if (!copy_limit)
> +			goto wait_for_memory;
>
> 		/* reuse tail pfrag, if possible, or carve a new one from the
> 		 * page allocator
> @@ -1812,9 +1837,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 		dfrag = mptcp_pending_tail(sk);
> 		dfrag_collapsed = mptcp_frag_can_collapse_to(msk, pfrag, dfrag);
> 		if (!dfrag_collapsed) {
> -			if (!sk_stream_memory_free(sk))
> -				goto wait_for_memory;
> -
> 			if (!mptcp_page_frag_refill(sk, pfrag))
> 				goto wait_for_memory;
>
> @@ -1829,6 +1851,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 		offset = dfrag->offset + dfrag->data_len;
> 		psize = pfrag->size - offset;
> 		psize = min_t(size_t, psize, msg_data_left(msg));
> +		psize = min_t(size_t, psize, copy_limit);


Hi Paolo -

Any concern that this will cause us to start sending short packets if 
mptcp_send_limit() starts returning small (nonzero) values?

If copy_limit is small on the first iteration of this loop, it looks like 
a small segment will be created. When that gets acked, a small amount 
becomes available and then we create another small segment (repeat...).

On the first pass (or on the 'wait_for_memory' path) should something like 
__mptcp_stream_is_writeable(sk, 1) be used to make sure a reasonable chunk 
of buffer is available before proceeding?

Or is that the approach to this feature that got too complicated? :)


- Mat


> 		total_ts = psize + frag_truesize;
>
> 		if (!sk_wmem_schedule(sk, total_ts))
> @@ -3886,12 +3909,12 @@ static __poll_t mptcp_check_writeable(struct mptcp_sock *msk)
> {
> 	struct sock *sk = (struct sock *)msk;
>
> -	if (sk_stream_is_writeable(sk))
> +	if (__mptcp_stream_is_writeable(sk, 1))
> 		return EPOLLOUT | EPOLLWRNORM;
>
> 	set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> 	smp_mb__after_atomic(); /* NOSPACE is changed by mptcp_write_space() */
> -	if (sk_stream_is_writeable(sk))
> +	if (__mptcp_stream_is_writeable(sk, 1))
> 		return EPOLLOUT | EPOLLWRNORM;
>
> 	return 0;
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index 3069d5c072b0..4a32d3d11fb6 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -307,6 +307,7 @@ struct mptcp_sock {
> 			in_accept_queue:1,
> 			free_first:1,
> 			rcvspace_init:1;
> +	u32		notsent_lowat;
> 	struct work_struct work;
> 	struct sk_buff  *ooo_last_skb;
> 	struct rb_root  out_of_order_queue;
> @@ -796,11 +797,36 @@ static inline bool mptcp_data_fin_enabled(const struct mptcp_sock *msk)
> 	       READ_ONCE(msk->write_seq) == READ_ONCE(msk->snd_nxt);
> }
>
> +static inline u32 mptcp_notsent_lowat(const struct sock *sk)
> +{
> +	struct net *net = sock_net(sk);
> +	u32 val;
> +
> +	val = READ_ONCE(mptcp_sk(sk)->notsent_lowat);
> +	return val ?: READ_ONCE(net->ipv4.sysctl_tcp_notsent_lowat);
> +}
> +
> +static inline bool __mptcp_stream_memory_free(const struct sock *sk, int wake)
> +{
> +	const struct mptcp_sock *msk = mptcp_sk(sk);
> +	u32 notsent_bytes;
> +
> +	notsent_bytes = READ_ONCE(msk->write_seq) - READ_ONCE(msk->snd_nxt);
> +	return (notsent_bytes << wake) < mptcp_notsent_lowat(sk);
> +}
> +
> +static inline bool __mptcp_stream_is_writeable(const struct sock *sk, int wake)
> +{
> +	return __mptcp_stream_memory_free(sk, wake) &&
> +	       __sk_stream_is_writeable(sk, wake);
> +}
> +
> static inline void mptcp_write_space(struct sock *sk)
> {
> 	/* pairs with memory barrier in mptcp_poll */
> 	smp_mb();
> -	sk_stream_write_space(sk);
> +	if (__mptcp_stream_memory_free(sk, 1))
> +		sk_stream_write_space(sk);
> }
>
> static inline void __mptcp_sync_sndbuf(struct sock *sk)
> diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
> index ac37f6c5e2ed..1b38dac70719 100644
> --- a/net/mptcp/sockopt.c
> +++ b/net/mptcp/sockopt.c
> @@ -812,6 +812,16 @@ static int mptcp_setsockopt_sol_tcp(struct mptcp_sock *msk, int optname,
> 		return 0;
> 	case TCP_ULP:
> 		return -EOPNOTSUPP;
> +	case TCP_NOTSENT_LOWAT:
> +		ret = mptcp_get_int_option(msk, optval, optlen, &val);
> +		if (ret)
> +			return ret;
> +
> +		lock_sock(sk);
> +		WRITE_ONCE(msk->notsent_lowat, val);
> +		mptcp_write_space(sk);
> +		release_sock(sk);
> +		return 0;
> 	case TCP_CONGESTION:
> 		return mptcp_setsockopt_sol_tcp_congestion(msk, optval, optlen);
> 	case TCP_CORK:
> @@ -1345,6 +1355,8 @@ static int mptcp_getsockopt_sol_tcp(struct mptcp_sock *msk, int optname,
> 		return mptcp_put_int_option(msk, optval, optlen, msk->cork);
> 	case TCP_NODELAY:
> 		return mptcp_put_int_option(msk, optval, optlen, msk->nodelay);
> +	case TCP_NOTSENT_LOWAT:
> +		return mptcp_put_int_option(msk, optval, optlen, msk->notsent_lowat);
> 	}
> 	return -EOPNOTSUPP;
> }
> -- 
> 2.43.0
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mptcp-next 3/7] mptcp: fix potential wake-up event loss
  2024-02-08  1:57   ` Mat Martineau
@ 2024-02-12 11:09     ` Paolo Abeni
  0 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2024-02-12 11:09 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp

On Wed, 2024-02-07 at 17:57 -0800, Mat Martineau wrote:
> On Mon, 22 Jan 2024, Paolo Abeni wrote:
> 
> > After the blamed commit below, the send buffer auto-tuning can
> > happen after that the mptcp_propagate_sndbuf() completes - via
> > the delegated action infrastructure.
> > 
> > We must check for write space even after such change or we risk
> > missing the wake-up event.
> > 
> > Fixes: 8005184fd1ca ("mptcp: refactor sndbuf auto-tuning")
> 
> Hi Paolo -
> 
> These first three patches have "Fixes:" tags, what do you think about 
> applying them to mptcp-net instead of mptcp-next?


I'm fine with that. I send them together because are needed to allow
TCP_NOTSENT_LOWAT working in a reliable way.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mptcp-next 6/7] mptcp: implement TCP_NOTSENT_LOWAT support.
  2024-02-08  2:12   ` Mat Martineau
@ 2024-02-12 11:27     ` Paolo Abeni
  0 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2024-02-12 11:27 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp

On Wed, 2024-02-07 at 18:12 -0800, Mat Martineau wrote:
> On Mon, 22 Jan 2024, Paolo Abeni wrote:
> 
> > Add support for such socket option storing the user-space provided
> > value in a new msk field, and using such data to implement the
> > _mptcp_stream_memory_free() helper, similar to the TCP one.
> > 
> > To avoid adding more indirect calls in the fast path, instead of
> > hooking the new helper via the sk_stream_memory_free sock cb,
> > add the directly calls where needed.
> > 
> > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/464
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > net/mptcp/protocol.c | 33 ++++++++++++++++++++++++++++-----
> > net/mptcp/protocol.h | 28 +++++++++++++++++++++++++++-
> > net/mptcp/sockopt.c  | 12 ++++++++++++
> > 3 files changed, 67 insertions(+), 6 deletions(-)
> > 
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index 3599205fdceb..53d6c5544900 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -1761,6 +1761,25 @@ static int do_copy_data_nocache(struct sock *sk, int copy,
> > 	return 0;
> > }
> > 
> > +static u32 mptcp_send_limit(const struct sock *sk)
> > +{
> > +	const struct mptcp_sock *msk = mptcp_sk(sk);
> > +	u32 limit, not_sent;
> > +
> > +	if (!sk_stream_memory_free(sk))
> > +		return 0;
> > +
> > +	limit = mptcp_notsent_lowat(sk);
> > +	if (limit == UINT_MAX)
> > +		return UINT_MAX;
> > +
> > +	not_sent = msk->write_seq - msk->snd_nxt;
> > +	if (not_sent >= limit)
> > +		return 0;
> > +
> > +	return limit - not_sent;
> > +}
> > +
> > static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> > {
> > 	struct mptcp_sock *msk = mptcp_sk(sk);
> > @@ -1805,6 +1824,12 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> > 		struct mptcp_data_frag *dfrag;
> > 		bool dfrag_collapsed;
> > 		size_t psize, offset;
> > +		u32 copy_limit;
> > +
> > +		/* ensure fitting the notsent_lowat() constraint */
> > +		copy_limit = mptcp_send_limit(sk);
> > +		if (!copy_limit)
> > +			goto wait_for_memory;
> > 
> > 		/* reuse tail pfrag, if possible, or carve a new one from the
> > 		 * page allocator
> > @@ -1812,9 +1837,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> > 		dfrag = mptcp_pending_tail(sk);
> > 		dfrag_collapsed = mptcp_frag_can_collapse_to(msk, pfrag, dfrag);
> > 		if (!dfrag_collapsed) {
> > -			if (!sk_stream_memory_free(sk))
> > -				goto wait_for_memory;
> > -
> > 			if (!mptcp_page_frag_refill(sk, pfrag))
> > 				goto wait_for_memory;
> > 
> > @@ -1829,6 +1851,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> > 		offset = dfrag->offset + dfrag->data_len;
> > 		psize = pfrag->size - offset;
> > 		psize = min_t(size_t, psize, msg_data_left(msg));
> > +		psize = min_t(size_t, psize, copy_limit);
> 
> 
> Hi Paolo -
> 
> Any concern that this will cause us to start sending short packets if 
> mptcp_send_limit() starts returning small (nonzero) values?
> 
> If copy_limit is small on the first iteration of this loop, it looks like 
> a small segment will be created. When that gets acked, a small amount 
> becomes available and then we create another small segment (repeat...).

The idea is to align the mptcp behavior with the one exposed by plain
TCP.

Note that wait_for_memory/sk_stream_wait_memory() should be woken-up
only after __mptcp_stream_memory_free(sk, 1), see mptcp_write_space()
below...

> On the first pass (or on the 'wait_for_memory' path) should something like 
> __mptcp_stream_is_writeable(sk, 1) be used to make sure a reasonable chunk 
> of buffer is available before proceeding?

... so the above should be already guaranteed (and plain TCP does not
check for that in tcp_sendmsg_locked/wait_for_space.

Anyway I think you outlined a bug here: mptcp_sendmsg() waits for
mptcp_notsent_lowat/__mptcp_stream_memory_free(sk, 0) but I explicitly
avoided to set sk_prot->stream_memory_free, so sk_stream_wait_memory()
could wake the reader too early.

I'll send a v2 addressing the above.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-02-12 11:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-22 15:08 [PATCH mptcp-next 0/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
2024-01-22 15:08 ` [PATCH mptcp-next 1/7] mptcp: push ad DSS boundaries Paolo Abeni
2024-02-08  1:55   ` Mat Martineau
2024-01-22 15:08 ` [PATCH mptcp-next 2/7] mptcp: fix snd_wnd initialization for passive socket Paolo Abeni
2024-01-22 15:08 ` [PATCH mptcp-next 3/7] mptcp: fix potential wake-up event loss Paolo Abeni
2024-02-08  1:57   ` Mat Martineau
2024-02-12 11:09     ` Paolo Abeni
2024-01-22 15:08 ` [PATCH mptcp-next 4/7] mptcp: cleanup writer wake-up Paolo Abeni
2024-01-22 15:08 ` [PATCH mptcp-next 5/7] mptcp: avoid some duplicate code in socket option handling Paolo Abeni
2024-01-22 15:08 ` [PATCH mptcp-next 6/7] mptcp: implement TCP_NOTSENT_LOWAT support Paolo Abeni
2024-02-08  2:12   ` Mat Martineau
2024-02-12 11:27     ` Paolo Abeni
2024-01-22 15:08 ` [PATCH mptcp-next 7/7] mptcp: cleanup SOL_TCP handling Paolo Abeni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.