All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2
@ 2022-11-16 11:43 Geliang Tang
  2022-11-16 11:43 ` [PATCH mptcp-next v20 1/7] mptcp: add scheduler wrappers Geliang Tang
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Geliang Tang @ 2022-11-16 11:43 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang

v20:
- rebased on "Squash to "mptcp: refactor push_pending logic" v19"

v19:
- patch 1, use 'continue' instead of 'goto again'.

v18:
 - some cleanups
 - update commit logs.

v17:
 - address to Mat's comments in v16
 - rebase to export/20221108T055508.

v16:
- keep last_snd and snd_burst in struct mptcp_sock.
- drop "mptcp: register default scheduler".
- drop "mptcp: add scheduler wrappers", move it into "mptcp: use
get_send wrapper" and "mptcp: use get_retrans wrapper".
- depends on 'v2, Revert "mptcp: add get_subflow wrappers" - fix
divide error in mptcp_subflow_get_send'

v15:
 1: "refactor push pending" v10
 2-11: "register default scheduler" v3
  - move last_snd and snd_burst into struct mptcp_sched_ops
 12-19: "BPF redundant scheduler" v15
  - split "use get_send wrapper" into two patches
 - rebase to export/20221021T061837.

v14:
- add "mptcp: refactor push_pending logic" v10 as patch 1
- drop update_first_pending in patch 4
- drop update_already_sent in patch 5

v13:
- deponds on "refactor push pending" v9.
- Simply 'goto out' after invoking mptcp_subflow_delegate in patch 1.
- All selftests (mptcp_connect.sh, mptcp_join.sh and simult_flows.sh) passed.

v12:
 - fix WARN_ON_ONCE(reuse_skb) and WARN_ON_ONCE(!msk->recovery) errors
   in kernel logs.

v11:
 - address to Mat's comments in v10.
 - rebase to export/20220908T063452

v10:
 - send multiple dfrags in __mptcp_push_pending().

v9:
 - drop the extra *err paramenter of mptcp_sched_get_send() as Florian
   suggested.

v8:
 - update __mptcp_push_pending(), send the same data on each subflow.
 - update __mptcp_retrans, track the max sent data.
 = add a new patch.

v7:
 - drop redundant flag in v6
 - drop __mptcp_subflows_push_pending in v6
 - update redundant subflows support in __mptcp_push_pending
 - update redundant subflows support in __mptcp_retrans

v6:
 - Add redundant flag for struct mptcp_sched_ops.
 - add a dedicated function __mptcp_subflows_push_pending() to deal with
   redundat subflows push pending.

v5:
 - address to Paolo's comment, keep the optimization to
mptcp_subflow_get_send() for the non eBPF case.
 - merge mptcp_sched_get_send() and __mptcp_sched_get_send() in v4 into one.
 - depends on "cleanups for bpf sched selftests".

v4:
 - small cleanups in patch 1, 2.
 - add TODO in patch 3.
 - rebase patch 5 on 'cleanups for bpf sched selftests'.

v3:
 - use new API.
 - fix the link failure tests issue mentioned in ("https://patchwork.kernel.org/project/mptcp/cover/cover.1653033459.git.geliang.tang@suse.com/").

v2:
 - add MPTCP_SUBFLOWS_MAX limit to avoid infinite loops when the
   scheduler always sets call_again to true.
 - track the largest copied amount.
 - deal with __mptcp_subflow_push_pending() and the retransmit loop.
 - depends on "BPF round-robin scheduler" v14.

v1:

Implements the redundant BPF MPTCP scheduler, which sends all packets
redundantly on all available subflows.

Geliang Tang (7):
  mptcp: add scheduler wrappers
  mptcp: use get_send wrapper
  mptcp: use get_retrans wrapper
  mptcp: delay updating first_pending
  mptcp: delay updating already_sent
  selftests/bpf: Add bpf_red scheduler
  selftests/bpf: Add bpf_red test

 net/mptcp/protocol.c                          | 242 ++++++++++++------
 net/mptcp/protocol.h                          |  18 +-
 net/mptcp/sched.c                             |  67 +++++
 .../testing/selftests/bpf/prog_tests/mptcp.c  |  34 +++
 .../selftests/bpf/progs/mptcp_bpf_red.c       |  45 ++++
 5 files changed, 320 insertions(+), 86 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/mptcp_bpf_red.c

-- 
2.35.3


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH mptcp-next v20 1/7] mptcp: add scheduler wrappers
  2022-11-16 11:43 [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Geliang Tang
@ 2022-11-16 11:43 ` Geliang Tang
  2022-11-16 11:43 ` [PATCH mptcp-next v20 2/7] mptcp: use get_send wrapper Geliang Tang
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Geliang Tang @ 2022-11-16 11:43 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang

This patch defines two packet scheduler wrappers mptcp_sched_get_send()
and mptcp_sched_get_retrans(), invoke data_init() and get_subflow() of
msk->sched in them.

Set data->reinject to true in mptcp_sched_get_retrans(), set it false in
mptcp_sched_get_send().

If msk->sched is NULL, use default functions mptcp_subflow_get_send()
and mptcp_subflow_get_retrans() to send data.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
---
 net/mptcp/protocol.c |  4 ++--
 net/mptcp/protocol.h |  4 ++++
 net/mptcp/sched.c    | 48 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 2e60165c0eb2..6531df5ef4dc 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1397,7 +1397,7 @@ bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
  * returns the subflow that will transmit the next DSS
  * additionally updates the rtx timeout
  */
-static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
+struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 {
 	struct subflow_send_info send_info[SSK_MODE_MAX];
 	struct mptcp_subflow_context *subflow;
@@ -2213,7 +2213,7 @@ static void mptcp_timeout_timer(struct timer_list *t)
  *
  * A backup subflow is returned only if that is the only kind available.
  */
-static struct sock *mptcp_subflow_get_retrans(struct mptcp_sock *msk)
+struct sock *mptcp_subflow_get_retrans(struct mptcp_sock *msk)
 {
 	struct sock *backup = NULL, *pick = NULL;
 	struct mptcp_subflow_context *subflow;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index bae216bff6e4..8536035a71d0 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -652,6 +652,10 @@ int mptcp_init_sched(struct mptcp_sock *msk,
 void mptcp_release_sched(struct mptcp_sock *msk);
 void mptcp_subflow_set_scheduled(struct mptcp_subflow_context *subflow,
 				 bool scheduled);
+struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk);
+struct sock *mptcp_subflow_get_retrans(struct mptcp_sock *msk);
+int mptcp_sched_get_send(struct mptcp_sock *msk);
+int mptcp_sched_get_retrans(struct mptcp_sock *msk);
 
 static inline bool __tcp_can_send(const struct sock *ssk)
 {
diff --git a/net/mptcp/sched.c b/net/mptcp/sched.c
index 0d7c73e9562e..f51f9cf20b6e 100644
--- a/net/mptcp/sched.c
+++ b/net/mptcp/sched.c
@@ -112,3 +112,51 @@ void mptcp_sched_data_set_contexts(const struct mptcp_sock *msk,
 	for (; i < MPTCP_SUBFLOWS_MAX; i++)
 		data->contexts[i] = NULL;
 }
+
+int mptcp_sched_get_send(struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow;
+	struct mptcp_sched_data data;
+	struct sock *ssk = NULL;
+
+	mptcp_for_each_subflow(msk, subflow) {
+		if (READ_ONCE(subflow->scheduled))
+			return 0;
+	}
+
+	if (!msk->sched) {
+		ssk = mptcp_subflow_get_send(msk);
+		if (!ssk)
+			return -EINVAL;
+		mptcp_subflow_set_scheduled(mptcp_subflow_ctx(ssk), true);
+		return 0;
+	}
+
+	data.reinject = false;
+	msk->sched->data_init(msk, &data);
+	return msk->sched->get_subflow(msk, &data);
+}
+
+int mptcp_sched_get_retrans(struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow;
+	struct mptcp_sched_data data;
+	struct sock *ssk = NULL;
+
+	mptcp_for_each_subflow(msk, subflow) {
+		if (READ_ONCE(subflow->scheduled))
+			return 0;
+	}
+
+	if (!msk->sched) {
+		ssk = mptcp_subflow_get_retrans(msk);
+		if (!ssk)
+			return -EINVAL;
+		mptcp_subflow_set_scheduled(mptcp_subflow_ctx(ssk), true);
+		return 0;
+	}
+
+	data.reinject = true;
+	msk->sched->data_init(msk, &data);
+	return msk->sched->get_subflow(msk, &data);
+}
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mptcp-next v20 2/7] mptcp: use get_send wrapper
  2022-11-16 11:43 [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Geliang Tang
  2022-11-16 11:43 ` [PATCH mptcp-next v20 1/7] mptcp: add scheduler wrappers Geliang Tang
@ 2022-11-16 11:43 ` Geliang Tang
  2022-11-18 22:04   ` Mat Martineau
  2022-11-16 11:43 ` [PATCH mptcp-next v20 3/7] mptcp: use get_retrans wrapper Geliang Tang
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Geliang Tang @ 2022-11-16 11:43 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang

This patch adds the multiple subflows support for __mptcp_push_pending
and __mptcp_subflow_push_pending. Use get_send() wrapper instead of
mptcp_subflow_get_send() in them.

Check the subflow scheduled flags to test which subflow or subflows are
picked by the scheduler, use them to send data.

Move sock_owned_by_me() check and fallback check into get_send() wrapper
from mptcp_subflow_get_send().

This commit allows the scheduler to set the subflow->scheduled bit in
multiple subflows, but it does not allow for sending redundant data.
Multiple scheduled subflows will send sequential data on each subflow.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
---
 net/mptcp/protocol.c | 119 ++++++++++++++++++++++++++-----------------
 net/mptcp/sched.c    |  13 +++++
 2 files changed, 84 insertions(+), 48 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6531df5ef4dc..f3720923b22d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1408,15 +1408,6 @@ struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 	u64 linger_time;
 	long tout = 0;
 
-	sock_owned_by_me(sk);
-
-	if (__mptcp_check_fallback(msk)) {
-		if (!msk->first)
-			return NULL;
-		return __tcp_can_send(msk->first) &&
-		       sk_stream_memory_free(msk->first) ? msk->first : NULL;
-	}
-
 	/* pick the subflow with the lower wmem/wspace ratio */
 	for (i = 0; i < SSK_MODE_MAX; ++i) {
 		send_info[i].ssk = NULL;
@@ -1563,47 +1554,58 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
 {
 	struct sock *prev_ssk = NULL, *ssk = NULL;
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct mptcp_subflow_context *subflow;
 	struct mptcp_sendmsg_info info = {
 				.flags = flags,
 	};
 	bool do_check_data_fin = false;
+	int err = 0;
 
-	while (mptcp_send_head(sk)) {
+	while (mptcp_send_head(sk) && !err) {
 		int ret = 0;
 
-		prev_ssk = ssk;
-		ssk = mptcp_subflow_get_send(msk);
-
-		/* First check. If the ssk has changed since
-		 * the last round, release prev_ssk
-		 */
-		if (ssk != prev_ssk && prev_ssk)
-			mptcp_push_release(prev_ssk, &info);
-		if (!ssk)
+		if (mptcp_sched_get_send(msk))
 			goto out;
 
-		/* Need to lock the new subflow only if different
-		 * from the previous one, otherwise we are still
-		 * helding the relevant lock
-		 */
-		if (ssk != prev_ssk)
-			lock_sock(ssk);
+		mptcp_for_each_subflow(msk, subflow) {
+			if (READ_ONCE(subflow->scheduled)) {
+				prev_ssk = ssk;
+				ssk = mptcp_subflow_tcp_sock(subflow);
 
-		ret = __subflow_push_pending(sk, ssk, &info);
-		if (ret <= 0) {
-			if (ret == -EAGAIN)
-				continue;
-			mptcp_push_release(ssk, &info);
-			goto out;
+				/* First check. If the ssk has changed since
+				 * the last round, release prev_ssk
+				 */
+				if (ssk != prev_ssk && prev_ssk)
+					mptcp_push_release(prev_ssk, &info);
+
+				/* Need to lock the new subflow only if different
+				 * from the previous one, otherwise we are still
+				 * helding the relevant lock
+				 */
+				if (ssk != prev_ssk)
+					lock_sock(ssk);
+
+				ret = __subflow_push_pending(sk, ssk, &info);
+				if (ret <= 0) {
+					if (ret != -EAGAIN ||
+					    inet_sk_state_load(ssk) == TCP_FIN_WAIT1 ||
+					    inet_sk_state_load(ssk) == TCP_FIN_WAIT2 ||
+					    inet_sk_state_load(ssk) == TCP_CLOSE)
+						err = 1;
+					continue;
+				}
+				do_check_data_fin = true;
+				msk->last_snd = ssk;
+				mptcp_subflow_set_scheduled(subflow, false);
+			}
 		}
-		do_check_data_fin = true;
 	}
 
+out:
 	/* at this point we held the socket lock for the last subflow we used */
 	if (ssk)
 		mptcp_push_release(ssk, &info);
 
-out:
 	/* ensure the rtx timer is running */
 	if (!mptcp_timer_pending(sk))
 		mptcp_reset_timer(sk);
@@ -1614,33 +1616,54 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
 static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool first)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct mptcp_subflow_context *subflow;
 	struct mptcp_sendmsg_info info = {
 		.data_lock_held = true,
 	};
-	struct sock *xmit_ssk;
-	int copied = 0;
+	int copied = 0, err = 0;
 
 	info.flags = 0;
-	while (mptcp_send_head(sk)) {
+	while (mptcp_send_head(sk) && !err) {
 		int ret = 0;
 
 		/* check for a different subflow usage only after
 		 * spooling the first chunk of data
 		 */
-		xmit_ssk = first ? ssk : mptcp_subflow_get_send(msk);
-		if (!xmit_ssk)
-			goto out;
-		if (xmit_ssk != ssk) {
-			mptcp_subflow_delegate(mptcp_subflow_ctx(xmit_ssk),
-					       MPTCP_DELEGATE_SEND);
-			goto out;
+		if (first) {
+			ret = __subflow_push_pending(sk, ssk, &info);
+			first = false;
+			if (ret <= 0)
+				break;
+			copied += ret;
+			msk->last_snd = ssk;
+			continue;
 		}
 
-		ret = __subflow_push_pending(sk, ssk, &info);
-		first = false;
-		if (ret <= 0)
-			break;
-		copied += ret;
+		if (mptcp_sched_get_send(msk))
+			goto out;
+
+		mptcp_for_each_subflow(msk, subflow) {
+			if (READ_ONCE(subflow->scheduled)) {
+				struct sock *xmit_ssk = mptcp_subflow_tcp_sock(subflow);
+
+				if (xmit_ssk != ssk) {
+					mptcp_subflow_delegate(subflow,
+							       MPTCP_DELEGATE_SEND);
+					msk->last_snd = ssk;
+					mptcp_subflow_set_scheduled(subflow, false);
+					goto out;
+				}
+
+				ret = __subflow_push_pending(sk, ssk, &info);
+				if (ret <= 0) {
+					err = 1;
+					continue;
+				}
+				copied += ret;
+				msk->last_snd = ssk;
+				mptcp_subflow_set_scheduled(subflow, false);
+			}
+		}
 	}
 
 out:
diff --git a/net/mptcp/sched.c b/net/mptcp/sched.c
index f51f9cf20b6e..923c07296ff3 100644
--- a/net/mptcp/sched.c
+++ b/net/mptcp/sched.c
@@ -119,11 +119,24 @@ int mptcp_sched_get_send(struct mptcp_sock *msk)
 	struct mptcp_sched_data data;
 	struct sock *ssk = NULL;
 
+	sock_owned_by_me((const struct sock *)msk);
+
 	mptcp_for_each_subflow(msk, subflow) {
 		if (READ_ONCE(subflow->scheduled))
 			return 0;
 	}
 
+	/* the following check is moved out of mptcp_subflow_get_send */
+	if (__mptcp_check_fallback(msk)) {
+		if (msk->first &&
+		    __tcp_can_send(msk->first) &&
+		    sk_stream_memory_free(msk->first)) {
+			mptcp_subflow_set_scheduled(mptcp_subflow_ctx(msk->first), true);
+			return 0;
+		}
+		return -EINVAL;
+	}
+
 	if (!msk->sched) {
 		ssk = mptcp_subflow_get_send(msk);
 		if (!ssk)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mptcp-next v20 3/7] mptcp: use get_retrans wrapper
  2022-11-16 11:43 [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Geliang Tang
  2022-11-16 11:43 ` [PATCH mptcp-next v20 1/7] mptcp: add scheduler wrappers Geliang Tang
  2022-11-16 11:43 ` [PATCH mptcp-next v20 2/7] mptcp: use get_send wrapper Geliang Tang
@ 2022-11-16 11:43 ` Geliang Tang
  2022-11-18 22:05   ` Mat Martineau
  2022-11-16 11:43 ` [PATCH mptcp-next v20 4/7] mptcp: delay updating first_pending Geliang Tang
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Geliang Tang @ 2022-11-16 11:43 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang

This patch adds the multiple subflows support for __mptcp_retrans(). Use
get_retrans() wrapper instead of mptcp_subflow_get_retrans() in it.

Check the subflow scheduled flags to test which subflow or subflows are
picked by the scheduler, use them to send data.

Move sock_owned_by_me() check and fallback check into get_retrans()
wrapper from mptcp_subflow_get_retrans().

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
---
 net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++------------------
 net/mptcp/sched.c    |  6 ++++
 2 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index f3720923b22d..b8265badbe29 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2242,11 +2242,6 @@ struct sock *mptcp_subflow_get_retrans(struct mptcp_sock *msk)
 	struct mptcp_subflow_context *subflow;
 	int min_stale_count = INT_MAX;
 
-	sock_owned_by_me((const struct sock *)msk);
-
-	if (__mptcp_check_fallback(msk))
-		return NULL;
-
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 
@@ -2521,16 +2516,17 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
 static void __mptcp_retrans(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct mptcp_subflow_context *subflow;
 	struct mptcp_sendmsg_info info = {};
 	struct mptcp_data_frag *dfrag;
-	size_t copied = 0;
 	struct sock *ssk;
-	int ret;
+	int ret, err;
+	u16 len = 0;
 
 	mptcp_clean_una_wakeup(sk);
 
 	/* first check ssk: need to kick "stale" logic */
-	ssk = mptcp_subflow_get_retrans(msk);
+	err = mptcp_sched_get_retrans(msk);
 	dfrag = mptcp_rtx_head(sk);
 	if (!dfrag) {
 		if (mptcp_data_fin_enabled(msk)) {
@@ -2549,31 +2545,46 @@ static void __mptcp_retrans(struct sock *sk)
 		goto reset_timer;
 	}
 
-	if (!ssk)
+	if (err)
 		goto reset_timer;
 
-	lock_sock(ssk);
+	mptcp_for_each_subflow(msk, subflow) {
+		if (READ_ONCE(subflow->scheduled)) {
+			u16 copied = 0;
 
-	/* limit retransmission to the bytes already sent on some subflows */
-	info.sent = 0;
-	info.limit = READ_ONCE(msk->csum_enabled) ? dfrag->data_len : dfrag->already_sent;
-	while (info.sent < info.limit) {
-		ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info);
-		if (ret <= 0)
-			break;
+			ssk = mptcp_subflow_tcp_sock(subflow);
+			if (!ssk)
+				goto reset_timer;
 
-		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RETRANSSEGS);
-		copied += ret;
-		info.sent += ret;
-	}
-	if (copied) {
-		dfrag->already_sent = max(dfrag->already_sent, info.sent);
-		tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle,
-			 info.size_goal);
-		WRITE_ONCE(msk->allow_infinite_fallback, false);
-	}
+			lock_sock(ssk);
 
-	release_sock(ssk);
+			/* limit retransmission to the bytes already sent on some subflows */
+			info.sent = 0;
+			info.limit = READ_ONCE(msk->csum_enabled) ? dfrag->data_len :
+								    dfrag->already_sent;
+			while (info.sent < info.limit) {
+				ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info);
+				if (ret <= 0)
+					break;
+
+				MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RETRANSSEGS);
+				copied += ret;
+				info.sent += ret;
+			}
+			if (copied) {
+				len = max(copied, len);
+				tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle,
+					 info.size_goal);
+				WRITE_ONCE(msk->allow_infinite_fallback, false);
+			}
+
+			release_sock(ssk);
+
+			msk->last_snd = ssk;
+			mptcp_subflow_set_scheduled(subflow, false);
+		}
+	}
+	dfrag->already_sent = max(dfrag->already_sent, len);
 
 reset_timer:
 	mptcp_check_and_set_pending(sk);
diff --git a/net/mptcp/sched.c b/net/mptcp/sched.c
index 923c07296ff3..edddd7cceada 100644
--- a/net/mptcp/sched.c
+++ b/net/mptcp/sched.c
@@ -156,11 +156,17 @@ int mptcp_sched_get_retrans(struct mptcp_sock *msk)
 	struct mptcp_sched_data data;
 	struct sock *ssk = NULL;
 
+	sock_owned_by_me((const struct sock *)msk);
+
 	mptcp_for_each_subflow(msk, subflow) {
 		if (READ_ONCE(subflow->scheduled))
 			return 0;
 	}
 
+	/* the following check is moved out of mptcp_subflow_get_retrans */
+	if (__mptcp_check_fallback(msk))
+		return -EINVAL;
+
 	if (!msk->sched) {
 		ssk = mptcp_subflow_get_retrans(msk);
 		if (!ssk)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mptcp-next v20 4/7] mptcp: delay updating first_pending
  2022-11-16 11:43 [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Geliang Tang
                   ` (2 preceding siblings ...)
  2022-11-16 11:43 ` [PATCH mptcp-next v20 3/7] mptcp: use get_retrans wrapper Geliang Tang
@ 2022-11-16 11:43 ` Geliang Tang
  2022-11-16 11:43 ` [PATCH mptcp-next v20 5/7] mptcp: delay updating already_sent Geliang Tang
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Geliang Tang @ 2022-11-16 11:43 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang

To support redundant package schedulers more easily, this patch refactors
the data sending loop in __subflow_push_pending(), to delay updating
first_pending until all data is sent.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
---
 net/mptcp/protocol.c | 36 ++++++++++++++++++++++++++++++++----
 net/mptcp/protocol.h | 13 ++++++++++---
 2 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index b8265badbe29..4c249d1b9ec6 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1112,6 +1112,7 @@ struct mptcp_sendmsg_info {
 	u16 sent;
 	unsigned int flags;
 	bool data_lock_held;
+	struct mptcp_data_frag *last_frag;
 };
 
 static int mptcp_check_allowed_size(const struct mptcp_sock *msk, struct sock *ssk,
@@ -1502,6 +1503,19 @@ static void mptcp_update_post_push(struct mptcp_sock *msk,
 		msk->snd_nxt = snd_nxt_new;
 }
 
+static void mptcp_update_first_pending(struct sock *sk, struct mptcp_sendmsg_info *info)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	if (info->last_frag)
+		WRITE_ONCE(msk->first_pending, mptcp_next_frag(sk, info->last_frag));
+}
+
+static void mptcp_update_dfrags(struct sock *sk, struct mptcp_sendmsg_info *info)
+{
+	mptcp_update_first_pending(sk, info);
+}
+
 void mptcp_check_and_set_pending(struct sock *sk)
 {
 	if (mptcp_send_head(sk))
@@ -1515,7 +1529,13 @@ static int __subflow_push_pending(struct sock *sk, struct sock *ssk,
 	struct mptcp_data_frag *dfrag;
 	int len, copied = 0, err = 0;
 
-	while ((dfrag = mptcp_send_head(sk))) {
+	info->last_frag = NULL;
+
+	dfrag = mptcp_send_head(sk);
+	if (!dfrag)
+		goto out;
+
+	do {
 		info->sent = dfrag->already_sent;
 		info->limit = dfrag->data_len;
 		len = dfrag->data_len - dfrag->already_sent;
@@ -1534,7 +1554,8 @@ static int __subflow_push_pending(struct sock *sk, struct sock *ssk,
 
 			mptcp_update_post_push(msk, dfrag, ret);
 		}
-		WRITE_ONCE(msk->first_pending, mptcp_send_next(sk));
+		info->last_frag = dfrag;
+		dfrag = mptcp_next_frag(sk, dfrag);
 
 		if (msk->snd_burst <= 0 ||
 		    !sk_stream_memory_free(ssk) ||
@@ -1543,7 +1564,7 @@ static int __subflow_push_pending(struct sock *sk, struct sock *ssk,
 			goto out;
 		}
 		mptcp_set_timeout(sk);
-	}
+	} while (dfrag);
 	err = copied;
 
 out:
@@ -1587,6 +1608,7 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
 
 				ret = __subflow_push_pending(sk, ssk, &info);
 				if (ret <= 0) {
+					mptcp_update_first_pending(sk, &info);
 					if (ret != -EAGAIN ||
 					    inet_sk_state_load(ssk) == TCP_FIN_WAIT1 ||
 					    inet_sk_state_load(ssk) == TCP_FIN_WAIT2 ||
@@ -1599,6 +1621,7 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
 				mptcp_subflow_set_scheduled(subflow, false);
 			}
 		}
+		mptcp_update_dfrags(sk, &info);
 	}
 
 out:
@@ -1632,10 +1655,13 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool
 		if (first) {
 			ret = __subflow_push_pending(sk, ssk, &info);
 			first = false;
-			if (ret <= 0)
+			if (ret <= 0) {
+				mptcp_update_first_pending(sk, &info);
 				break;
+			}
 			copied += ret;
 			msk->last_snd = ssk;
+			mptcp_update_dfrags(sk, &info);
 			continue;
 		}
 
@@ -1656,6 +1682,7 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool
 
 				ret = __subflow_push_pending(sk, ssk, &info);
 				if (ret <= 0) {
+					mptcp_update_first_pending(sk, &info);
 					err = 1;
 					continue;
 				}
@@ -1664,6 +1691,7 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool
 				mptcp_subflow_set_scheduled(subflow, false);
 			}
 		}
+		mptcp_update_dfrags(sk, &info);
 	}
 
 out:
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 8536035a71d0..cdce0c092c3c 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -358,16 +358,23 @@ static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
 	return READ_ONCE(msk->first_pending);
 }
 
-static inline struct mptcp_data_frag *mptcp_send_next(struct sock *sk)
+static inline struct mptcp_data_frag *mptcp_next_frag(const struct sock *sk,
+						      struct mptcp_data_frag *cur)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct mptcp_data_frag *cur;
 
-	cur = msk->first_pending;
+	if (!cur)
+		return NULL;
+
 	return list_is_last(&cur->list, &msk->rtx_queue) ? NULL :
 						     list_next_entry(cur, list);
 }
 
+static inline struct mptcp_data_frag *mptcp_send_next(const struct sock *sk)
+{
+	return mptcp_next_frag(sk, mptcp_send_head(sk));
+}
+
 static inline struct mptcp_data_frag *mptcp_pending_tail(const struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mptcp-next v20 5/7] mptcp: delay updating already_sent
  2022-11-16 11:43 [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Geliang Tang
                   ` (3 preceding siblings ...)
  2022-11-16 11:43 ` [PATCH mptcp-next v20 4/7] mptcp: delay updating first_pending Geliang Tang
@ 2022-11-16 11:43 ` Geliang Tang
  2022-11-18 22:15   ` Mat Martineau
  2022-11-16 11:43 ` [PATCH mptcp-next v20 6/7] selftests/bpf: Add bpf_red scheduler Geliang Tang
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Geliang Tang @ 2022-11-16 11:43 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang

This patch adds a new member sent in struct mptcp_data_frag, save
info->sent in it, to support delay updating already_sent of dfrags
until all data is sent.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
---
 net/mptcp/protocol.c | 18 ++++++++++++++++--
 net/mptcp/protocol.h |  1 +
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4c249d1b9ec6..a1007751bd4d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1099,6 +1099,7 @@ mptcp_carve_data_frag(const struct mptcp_sock *msk, struct page_frag *pfrag,
 	dfrag->data_seq = msk->write_seq;
 	dfrag->overhead = offset - orig_offset + sizeof(struct mptcp_data_frag);
 	dfrag->offset = offset + sizeof(struct mptcp_data_frag);
+	dfrag->sent = 0;
 	dfrag->already_sent = 0;
 	dfrag->page = pfrag->page;
 
@@ -1484,11 +1485,11 @@ static void mptcp_update_post_push(struct mptcp_sock *msk,
 {
 	u64 snd_nxt_new = dfrag->data_seq;
 
-	dfrag->already_sent += sent;
+	dfrag->sent += sent;
 
 	msk->snd_burst -= sent;
 
-	snd_nxt_new += dfrag->already_sent;
+	snd_nxt_new += dfrag->sent;
 
 	/* snd_nxt_new can be smaller than snd_nxt in case mptcp
 	 * is recovering after a failover. In that event, this re-sends
@@ -1513,6 +1514,18 @@ static void mptcp_update_first_pending(struct sock *sk, struct mptcp_sendmsg_inf
 
 static void mptcp_update_dfrags(struct sock *sk, struct mptcp_sendmsg_info *info)
 {
+	struct mptcp_data_frag *dfrag = mptcp_send_head(sk);
+
+	if (!dfrag)
+		return;
+
+	do {
+		if (dfrag->sent) {
+			dfrag->already_sent = max(dfrag->already_sent, dfrag->sent);
+			dfrag->sent = 0;
+		}
+	} while ((dfrag = mptcp_next_frag(sk, dfrag)));
+
 	mptcp_update_first_pending(sk, info);
 }
 
@@ -1539,6 +1552,7 @@ static int __subflow_push_pending(struct sock *sk, struct sock *ssk,
 		info->sent = dfrag->already_sent;
 		info->limit = dfrag->data_len;
 		len = dfrag->data_len - dfrag->already_sent;
+		dfrag->sent = info->sent;
 		while (len > 0) {
 			int ret = 0;
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index cdce0c092c3c..cdadb39a03da 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -250,6 +250,7 @@ struct mptcp_data_frag {
 	u16 data_len;
 	u16 offset;
 	u16 overhead;
+	u16 sent;
 	u16 already_sent;
 	struct page *page;
 };
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mptcp-next v20 6/7] selftests/bpf: Add bpf_red scheduler
  2022-11-16 11:43 [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Geliang Tang
                   ` (4 preceding siblings ...)
  2022-11-16 11:43 ` [PATCH mptcp-next v20 5/7] mptcp: delay updating already_sent Geliang Tang
@ 2022-11-16 11:43 ` Geliang Tang
  2022-11-16 11:43 ` [PATCH mptcp-next v20 7/7] selftests/bpf: Add bpf_red test Geliang Tang
  2022-11-18 21:42 ` [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Mat Martineau
  7 siblings, 0 replies; 14+ messages in thread
From: Geliang Tang @ 2022-11-16 11:43 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang

This patch implements the redundant BPF MPTCP scheduler, named bpf_red,
which sends all packets redundantly on all available subflows.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
---
 .../selftests/bpf/progs/mptcp_bpf_red.c       | 45 +++++++++++++++++++
 1 file changed, 45 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/mptcp_bpf_red.c

diff --git a/tools/testing/selftests/bpf/progs/mptcp_bpf_red.c b/tools/testing/selftests/bpf/progs/mptcp_bpf_red.c
new file mode 100644
index 000000000000..30dd6f521b7f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/mptcp_bpf_red.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022, SUSE. */
+
+#include <linux/bpf.h>
+#include "bpf_tcp_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/mptcp_sched_red_init")
+void BPF_PROG(mptcp_sched_red_init, const struct mptcp_sock *msk)
+{
+}
+
+SEC("struct_ops/mptcp_sched_red_release")
+void BPF_PROG(mptcp_sched_red_release, const struct mptcp_sock *msk)
+{
+}
+
+void BPF_STRUCT_OPS(bpf_red_data_init, const struct mptcp_sock *msk,
+		    struct mptcp_sched_data *data)
+{
+	mptcp_sched_data_set_contexts(msk, data);
+}
+
+int BPF_STRUCT_OPS(bpf_red_get_subflow, const struct mptcp_sock *msk,
+		   struct mptcp_sched_data *data)
+{
+	for (int i = 0; i < MPTCP_SUBFLOWS_MAX; i++) {
+		if (!data->contexts[i])
+			break;
+
+		mptcp_subflow_set_scheduled(data->contexts[i], true);
+	}
+
+	return 0;
+}
+
+SEC(".struct_ops")
+struct mptcp_sched_ops red = {
+	.init		= (void *)mptcp_sched_red_init,
+	.release	= (void *)mptcp_sched_red_release,
+	.data_init	= (void *)bpf_red_data_init,
+	.get_subflow	= (void *)bpf_red_get_subflow,
+	.name		= "bpf_red",
+};
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mptcp-next v20 7/7] selftests/bpf: Add bpf_red test
  2022-11-16 11:43 [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Geliang Tang
                   ` (5 preceding siblings ...)
  2022-11-16 11:43 ` [PATCH mptcp-next v20 6/7] selftests/bpf: Add bpf_red scheduler Geliang Tang
@ 2022-11-16 11:43 ` Geliang Tang
  2022-11-18 21:42 ` [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Mat Martineau
  7 siblings, 0 replies; 14+ messages in thread
From: Geliang Tang @ 2022-11-16 11:43 UTC (permalink / raw)
  To: mptcp; +Cc: Geliang Tang

This patch adds the redundant BPF MPTCP scheduler test: test_red(). Use
sysctl to set net.mptcp.scheduler to use this sched. Add two veth net
devices to simulate the multiple addresses case. Use 'ip mptcp endpoint'
command to add the new endpoint ADDR_2 to PM netlink. Send data and check
bytes_sent of 'ss' output after it to make sure the data has been
redundantly sent on both net devices.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
---
 .../testing/selftests/bpf/prog_tests/mptcp.c  | 34 +++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/mptcp.c b/tools/testing/selftests/bpf/prog_tests/mptcp.c
index 647d313475bc..8426a5aba721 100644
--- a/tools/testing/selftests/bpf/prog_tests/mptcp.c
+++ b/tools/testing/selftests/bpf/prog_tests/mptcp.c
@@ -9,6 +9,7 @@
 #include "mptcp_bpf_first.skel.h"
 #include "mptcp_bpf_bkup.skel.h"
 #include "mptcp_bpf_rr.skel.h"
+#include "mptcp_bpf_red.skel.h"
 
 #ifndef TCP_CA_NAME_MAX
 #define TCP_CA_NAME_MAX	16
@@ -381,6 +382,37 @@ static void test_rr(void)
 	mptcp_bpf_rr__destroy(rr_skel);
 }
 
+static void test_red(void)
+{
+	struct mptcp_bpf_red *red_skel;
+	int server_fd, client_fd;
+	struct bpf_link *link;
+
+	red_skel = mptcp_bpf_red__open_and_load();
+	if (!ASSERT_OK_PTR(red_skel, "bpf_red__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(red_skel->maps.red);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		mptcp_bpf_red__destroy(red_skel);
+		return;
+	}
+
+	sched_init("subflow", "bpf_red");
+	server_fd = start_mptcp_server(AF_INET, ADDR_1, 0, 0);
+	client_fd = connect_to_fd(server_fd, 0);
+
+	send_data(server_fd, client_fd);
+	ASSERT_OK(has_bytes_sent(ADDR_1), "has_bytes_sent addr 1");
+	ASSERT_OK(has_bytes_sent(ADDR_2), "has_bytes_sent addr 2");
+
+	close(client_fd);
+	close(server_fd);
+	sched_cleanup();
+	bpf_link__destroy(link);
+	mptcp_bpf_red__destroy(red_skel);
+}
+
 void test_mptcp(void)
 {
 	if (test__start_subtest("base"))
@@ -391,4 +423,6 @@ void test_mptcp(void)
 		test_bkup();
 	if (test__start_subtest("rr"))
 		test_rr();
+	if (test__start_subtest("red"))
+		test_red();
 }
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2
  2022-11-16 11:43 [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Geliang Tang
                   ` (6 preceding siblings ...)
  2022-11-16 11:43 ` [PATCH mptcp-next v20 7/7] selftests/bpf: Add bpf_red test Geliang Tang
@ 2022-11-18 21:42 ` Mat Martineau
  2022-11-18 22:20   ` Geliang Tang
  7 siblings, 1 reply; 14+ messages in thread
From: Mat Martineau @ 2022-11-18 21:42 UTC (permalink / raw)
  To: Geliang Tang; +Cc: mptcp

On Wed, 16 Nov 2022, Geliang Tang wrote:

> v20:
> - rebased on "Squash to "mptcp: refactor push_pending logic" v19"
>
> v19:
> - patch 1, use 'continue' instead of 'goto again'.
>
> v18:
> - some cleanups
> - update commit logs.
>
> v17:
> - address to Mat's comments in v16
> - rebase to export/20221108T055508.
>
> v16:
> - keep last_snd and snd_burst in struct mptcp_sock.

I didn't notice this back in v16 - why has last_snd returned? At the end 
of the series, it still doesn't appear to do anything. Is it useful for 
some future feature?

- Mat

> - drop "mptcp: register default scheduler".
> - drop "mptcp: add scheduler wrappers", move it into "mptcp: use
> get_send wrapper" and "mptcp: use get_retrans wrapper".
> - depends on 'v2, Revert "mptcp: add get_subflow wrappers" - fix
> divide error in mptcp_subflow_get_send'
>
> v15:
> 1: "refactor push pending" v10
> 2-11: "register default scheduler" v3
>  - move last_snd and snd_burst into struct mptcp_sched_ops
> 12-19: "BPF redundant scheduler" v15
>  - split "use get_send wrapper" into two patches
> - rebase to export/20221021T061837.
>
> v14:
> - add "mptcp: refactor push_pending logic" v10 as patch 1
> - drop update_first_pending in patch 4
> - drop update_already_sent in patch 5
>
> v13:
> - deponds on "refactor push pending" v9.
> - Simply 'goto out' after invoking mptcp_subflow_delegate in patch 1.
> - All selftests (mptcp_connect.sh, mptcp_join.sh and simult_flows.sh) passed.
>
> v12:
> - fix WARN_ON_ONCE(reuse_skb) and WARN_ON_ONCE(!msk->recovery) errors
>   in kernel logs.
>
> v11:
> - address to Mat's comments in v10.
> - rebase to export/20220908T063452
>
> v10:
> - send multiple dfrags in __mptcp_push_pending().
>
> v9:
> - drop the extra *err paramenter of mptcp_sched_get_send() as Florian
>   suggested.
>
> v8:
> - update __mptcp_push_pending(), send the same data on each subflow.
> - update __mptcp_retrans, track the max sent data.
> = add a new patch.
>
> v7:
> - drop redundant flag in v6
> - drop __mptcp_subflows_push_pending in v6
> - update redundant subflows support in __mptcp_push_pending
> - update redundant subflows support in __mptcp_retrans
>
> v6:
> - Add redundant flag for struct mptcp_sched_ops.
> - add a dedicated function __mptcp_subflows_push_pending() to deal with
>   redundat subflows push pending.
>
> v5:
> - address to Paolo's comment, keep the optimization to
> mptcp_subflow_get_send() for the non eBPF case.
> - merge mptcp_sched_get_send() and __mptcp_sched_get_send() in v4 into one.
> - depends on "cleanups for bpf sched selftests".
>
> v4:
> - small cleanups in patch 1, 2.
> - add TODO in patch 3.
> - rebase patch 5 on 'cleanups for bpf sched selftests'.
>
> v3:
> - use new API.
> - fix the link failure tests issue mentioned in ("https://patchwork.kernel.org/project/mptcp/cover/cover.1653033459.git.geliang.tang@suse.com/").
>
> v2:
> - add MPTCP_SUBFLOWS_MAX limit to avoid infinite loops when the
>   scheduler always sets call_again to true.
> - track the largest copied amount.
> - deal with __mptcp_subflow_push_pending() and the retransmit loop.
> - depends on "BPF round-robin scheduler" v14.
>
> v1:
>
> Implements the redundant BPF MPTCP scheduler, which sends all packets
> redundantly on all available subflows.
>
> Geliang Tang (7):
>  mptcp: add scheduler wrappers
>  mptcp: use get_send wrapper
>  mptcp: use get_retrans wrapper
>  mptcp: delay updating first_pending
>  mptcp: delay updating already_sent
>  selftests/bpf: Add bpf_red scheduler
>  selftests/bpf: Add bpf_red test
>
> net/mptcp/protocol.c                          | 242 ++++++++++++------
> net/mptcp/protocol.h                          |  18 +-
> net/mptcp/sched.c                             |  67 +++++
> .../testing/selftests/bpf/prog_tests/mptcp.c  |  34 +++
> .../selftests/bpf/progs/mptcp_bpf_red.c       |  45 ++++
> 5 files changed, 320 insertions(+), 86 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/progs/mptcp_bpf_red.c
>
> -- 
> 2.35.3
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mptcp-next v20 2/7] mptcp: use get_send wrapper
  2022-11-16 11:43 ` [PATCH mptcp-next v20 2/7] mptcp: use get_send wrapper Geliang Tang
@ 2022-11-18 22:04   ` Mat Martineau
  0 siblings, 0 replies; 14+ messages in thread
From: Mat Martineau @ 2022-11-18 22:04 UTC (permalink / raw)
  To: Geliang Tang; +Cc: mptcp

On Wed, 16 Nov 2022, Geliang Tang wrote:

> This patch adds the multiple subflows support for __mptcp_push_pending
> and __mptcp_subflow_push_pending. Use get_send() wrapper instead of
> mptcp_subflow_get_send() in them.
>
> Check the subflow scheduled flags to test which subflow or subflows are
> picked by the scheduler, use them to send data.
>
> Move sock_owned_by_me() check and fallback check into get_send() wrapper
> from mptcp_subflow_get_send().
>
> This commit allows the scheduler to set the subflow->scheduled bit in
> multiple subflows, but it does not allow for sending redundant data.
> Multiple scheduled subflows will send sequential data on each subflow.
>
> Signed-off-by: Geliang Tang <geliang.tang@suse.com>
> ---
> net/mptcp/protocol.c | 119 ++++++++++++++++++++++++++-----------------
> net/mptcp/sched.c    |  13 +++++
> 2 files changed, 84 insertions(+), 48 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 6531df5ef4dc..f3720923b22d 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1408,15 +1408,6 @@ struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> 	u64 linger_time;
> 	long tout = 0;
>
> -	sock_owned_by_me(sk);
> -
> -	if (__mptcp_check_fallback(msk)) {
> -		if (!msk->first)
> -			return NULL;
> -		return __tcp_can_send(msk->first) &&
> -		       sk_stream_memory_free(msk->first) ? msk->first : NULL;
> -	}
> -
> 	/* pick the subflow with the lower wmem/wspace ratio */
> 	for (i = 0; i < SSK_MODE_MAX; ++i) {
> 		send_info[i].ssk = NULL;
> @@ -1563,47 +1554,58 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
> {
> 	struct sock *prev_ssk = NULL, *ssk = NULL;
> 	struct mptcp_sock *msk = mptcp_sk(sk);
> +	struct mptcp_subflow_context *subflow;
> 	struct mptcp_sendmsg_info info = {
> 				.flags = flags,
> 	};
> 	bool do_check_data_fin = false;
> +	int err = 0;
>
> -	while (mptcp_send_head(sk)) {
> +	while (mptcp_send_head(sk) && !err) {

This "!err" will quit the loop if any subflow encounters an error, so if 
an iteration of this loop has two subflows successfully send, and one 
fails, then this will exit.

Isn't it better to continue looping unless *all* scheduled subflows fail?

> 		int ret = 0;
>
> -		prev_ssk = ssk;
> -		ssk = mptcp_subflow_get_send(msk);
> -
> -		/* First check. If the ssk has changed since
> -		 * the last round, release prev_ssk
> -		 */
> -		if (ssk != prev_ssk && prev_ssk)
> -			mptcp_push_release(prev_ssk, &info);
> -		if (!ssk)
> +		if (mptcp_sched_get_send(msk))
> 			goto out;
>
> -		/* Need to lock the new subflow only if different
> -		 * from the previous one, otherwise we are still
> -		 * helding the relevant lock
> -		 */
> -		if (ssk != prev_ssk)
> -			lock_sock(ssk);
> +		mptcp_for_each_subflow(msk, subflow) {
> +			if (READ_ONCE(subflow->scheduled)) {
> +				prev_ssk = ssk;
> +				ssk = mptcp_subflow_tcp_sock(subflow);
>
> -		ret = __subflow_push_pending(sk, ssk, &info);
> -		if (ret <= 0) {
> -			if (ret == -EAGAIN)
> -				continue;
> -			mptcp_push_release(ssk, &info);
> -			goto out;
> +				/* First check. If the ssk has changed since
> +				 * the last round, release prev_ssk
> +				 */
> +				if (ssk != prev_ssk && prev_ssk)
> +					mptcp_push_release(prev_ssk, &info);
> +
> +				/* Need to lock the new subflow only if different
> +				 * from the previous one, otherwise we are still
> +				 * helding the relevant lock
> +				 */
> +				if (ssk != prev_ssk)
> +					lock_sock(ssk);

The above two statements could be combined to only check ssk != prev_ssk 
once:

if (ssk != prev_ssk) {
 	/* First check. If the ssk ... */
 	if (prev_ssk)
 		mptcp_push_release(prev_ssk, &info);

 	/* Need to lock ... */
 	lock_sock(ssk);
}


> +
> +				ret = __subflow_push_pending(sk, ssk, &info);
> +				if (ret <= 0) {
> +					if (ret != -EAGAIN ||
> +					    inet_sk_state_load(ssk) == TCP_FIN_WAIT1 ||
> +					    inet_sk_state_load(ssk) == TCP_FIN_WAIT2 ||
> +					    inet_sk_state_load(ssk) == TCP_CLOSE)

Each call to inet_sk_state_load() uses smp_load_acquire(), so the sk state 
should be loaded only once and then compared to multiple values (or use 
the "1 << sk_state" technique, like in mptcp_pending_data_fin_ack()).

> +						err = 1;
> +					continue;
> +				}
> +				do_check_data_fin = true;
> +				msk->last_snd = ssk;
> +				mptcp_subflow_set_scheduled(subflow, false);
> +			}
> 		}
> -		do_check_data_fin = true;
> 	}
>
> +out:
> 	/* at this point we held the socket lock for the last subflow we used */
> 	if (ssk)
> 		mptcp_push_release(ssk, &info);
>
> -out:
> 	/* ensure the rtx timer is running */
> 	if (!mptcp_timer_pending(sk))
> 		mptcp_reset_timer(sk);
> @@ -1614,33 +1616,54 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
> static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk, bool first)
> {
> 	struct mptcp_sock *msk = mptcp_sk(sk);
> +	struct mptcp_subflow_context *subflow;
> 	struct mptcp_sendmsg_info info = {
> 		.data_lock_held = true,
> 	};
> -	struct sock *xmit_ssk;
> -	int copied = 0;
> +	int copied = 0, err = 0;

I suggest making 'err' a bool, so it's obvious it's not an integer error 
code (like -EAGAIN, etc).

>
> 	info.flags = 0;
> -	while (mptcp_send_head(sk)) {
> +	while (mptcp_send_head(sk) && !err) {
> 		int ret = 0;
>
> 		/* check for a different subflow usage only after
> 		 * spooling the first chunk of data
> 		 */
> -		xmit_ssk = first ? ssk : mptcp_subflow_get_send(msk);
> -		if (!xmit_ssk)
> -			goto out;
> -		if (xmit_ssk != ssk) {
> -			mptcp_subflow_delegate(mptcp_subflow_ctx(xmit_ssk),
> -					       MPTCP_DELEGATE_SEND);
> -			goto out;
> +		if (first) {
> +			ret = __subflow_push_pending(sk, ssk, &info);
> +			first = false;
> +			if (ret <= 0)
> +				break;
> +			copied += ret;
> +			msk->last_snd = ssk;
> +			continue;
> 		}
>
> -		ret = __subflow_push_pending(sk, ssk, &info);
> -		first = false;
> -		if (ret <= 0)
> -			break;
> -		copied += ret;
> +		if (mptcp_sched_get_send(msk))
> +			goto out;
> +
> +		mptcp_for_each_subflow(msk, subflow) {
> +			if (READ_ONCE(subflow->scheduled)) {
> +				struct sock *xmit_ssk = mptcp_subflow_tcp_sock(subflow);
> +
> +				if (xmit_ssk != ssk) {
> +					mptcp_subflow_delegate(subflow,
> +							       MPTCP_DELEGATE_SEND);
> +					msk->last_snd = ssk;
> +					mptcp_subflow_set_scheduled(subflow, false);
> +					goto out;
> +				}
> +
> +				ret = __subflow_push_pending(sk, ssk, &info);
> +				if (ret <= 0) {
> +					err = 1;
> +					continue;
> +				}
> +				copied += ret;
> +				msk->last_snd = ssk;
> +				mptcp_subflow_set_scheduled(subflow, false);
> +			}
> +		}
> 	}
>
> out:
> diff --git a/net/mptcp/sched.c b/net/mptcp/sched.c
> index f51f9cf20b6e..923c07296ff3 100644
> --- a/net/mptcp/sched.c
> +++ b/net/mptcp/sched.c
> @@ -119,11 +119,24 @@ int mptcp_sched_get_send(struct mptcp_sock *msk)
> 	struct mptcp_sched_data data;
> 	struct sock *ssk = NULL;
>
> +	sock_owned_by_me((const struct sock *)msk);
> +
> 	mptcp_for_each_subflow(msk, subflow) {
> 		if (READ_ONCE(subflow->scheduled))
> 			return 0;
> 	}
>
> +	/* the following check is moved out of mptcp_subflow_get_send */
> +	if (__mptcp_check_fallback(msk)) {
> +		if (msk->first &&
> +		    __tcp_can_send(msk->first) &&
> +		    sk_stream_memory_free(msk->first)) {
> +			mptcp_subflow_set_scheduled(mptcp_subflow_ctx(msk->first), true);
> +			return 0;
> +		}
> +		return -EINVAL;
> +	}
> +

Better to check fallback before the subflow->scheduled bits, right? It's 
not expected for subflow->scheduled to be set on the single subflow 
present in a connection that has gone through fallback, but it's better to 
be safe by reordering these chunks of code.


> 	if (!msk->sched) {
> 		ssk = mptcp_subflow_get_send(msk);
> 		if (!ssk)
> -- 
> 2.35.3
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mptcp-next v20 3/7] mptcp: use get_retrans wrapper
  2022-11-16 11:43 ` [PATCH mptcp-next v20 3/7] mptcp: use get_retrans wrapper Geliang Tang
@ 2022-11-18 22:05   ` Mat Martineau
  0 siblings, 0 replies; 14+ messages in thread
From: Mat Martineau @ 2022-11-18 22:05 UTC (permalink / raw)
  To: Geliang Tang; +Cc: mptcp

On Wed, 16 Nov 2022, Geliang Tang wrote:

> This patch adds the multiple subflows support for __mptcp_retrans(). Use
> get_retrans() wrapper instead of mptcp_subflow_get_retrans() in it.
>
> Check the subflow scheduled flags to test which subflow or subflows are
> picked by the scheduler, use them to send data.
>
> Move sock_owned_by_me() check and fallback check into get_retrans()
> wrapper from mptcp_subflow_get_retrans().
>
> Signed-off-by: Geliang Tang <geliang.tang@suse.com>
> ---
> net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++------------------
> net/mptcp/sched.c    |  6 ++++
> 2 files changed, 45 insertions(+), 28 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index f3720923b22d..b8265badbe29 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -2242,11 +2242,6 @@ struct sock *mptcp_subflow_get_retrans(struct mptcp_sock *msk)
> 	struct mptcp_subflow_context *subflow;
> 	int min_stale_count = INT_MAX;
>
> -	sock_owned_by_me((const struct sock *)msk);
> -
> -	if (__mptcp_check_fallback(msk))
> -		return NULL;
> -
> 	mptcp_for_each_subflow(msk, subflow) {
> 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
>
> @@ -2521,16 +2516,17 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
> static void __mptcp_retrans(struct sock *sk)
> {
> 	struct mptcp_sock *msk = mptcp_sk(sk);
> +	struct mptcp_subflow_context *subflow;
> 	struct mptcp_sendmsg_info info = {};
> 	struct mptcp_data_frag *dfrag;
> -	size_t copied = 0;
> 	struct sock *ssk;
> -	int ret;
> +	int ret, err;
> +	u16 len = 0;
>
> 	mptcp_clean_una_wakeup(sk);
>
> 	/* first check ssk: need to kick "stale" logic */
> -	ssk = mptcp_subflow_get_retrans(msk);
> +	err = mptcp_sched_get_retrans(msk);
> 	dfrag = mptcp_rtx_head(sk);
> 	if (!dfrag) {
> 		if (mptcp_data_fin_enabled(msk)) {
> @@ -2549,31 +2545,46 @@ static void __mptcp_retrans(struct sock *sk)
> 		goto reset_timer;
> 	}
>
> -	if (!ssk)
> +	if (err)
> 		goto reset_timer;
>
> -	lock_sock(ssk);
> +	mptcp_for_each_subflow(msk, subflow) {
> +		if (READ_ONCE(subflow->scheduled)) {
> +			u16 copied = 0;
>
> -	/* limit retransmission to the bytes already sent on some subflows */
> -	info.sent = 0;
> -	info.limit = READ_ONCE(msk->csum_enabled) ? dfrag->data_len : dfrag->already_sent;
> -	while (info.sent < info.limit) {
> -		ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info);
> -		if (ret <= 0)
> -			break;
> +			ssk = mptcp_subflow_tcp_sock(subflow);
> +			if (!ssk)
> +				goto reset_timer;
>
> -		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RETRANSSEGS);
> -		copied += ret;
> -		info.sent += ret;
> -	}
> -	if (copied) {
> -		dfrag->already_sent = max(dfrag->already_sent, info.sent);
> -		tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle,
> -			 info.size_goal);
> -		WRITE_ONCE(msk->allow_infinite_fallback, false);
> -	}
> +			lock_sock(ssk);
>
> -	release_sock(ssk);
> +			/* limit retransmission to the bytes already sent on some subflows */
> +			info.sent = 0;
> +			info.limit = READ_ONCE(msk->csum_enabled) ? dfrag->data_len :
> +								    dfrag->already_sent;
> +			while (info.sent < info.limit) {
> +				ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info);
> +				if (ret <= 0)
> +					break;
> +
> +				MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RETRANSSEGS);
> +				copied += ret;
> +				info.sent += ret;
> +			}
> +			if (copied) {
> +				len = max(copied, len);
> +				tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle,
> +					 info.size_goal);
> +				WRITE_ONCE(msk->allow_infinite_fallback, false);
> +			}
> +
> +			release_sock(ssk);
> +
> +			msk->last_snd = ssk;
> +			mptcp_subflow_set_scheduled(subflow, false);
> +		}
> +	}
> +	dfrag->already_sent = max(dfrag->already_sent, len);
>
> reset_timer:
> 	mptcp_check_and_set_pending(sk);
> diff --git a/net/mptcp/sched.c b/net/mptcp/sched.c
> index 923c07296ff3..edddd7cceada 100644
> --- a/net/mptcp/sched.c
> +++ b/net/mptcp/sched.c
> @@ -156,11 +156,17 @@ int mptcp_sched_get_retrans(struct mptcp_sock *msk)
> 	struct mptcp_sched_data data;
> 	struct sock *ssk = NULL;
>
> +	sock_owned_by_me((const struct sock *)msk);
> +
> 	mptcp_for_each_subflow(msk, subflow) {
> 		if (READ_ONCE(subflow->scheduled))
> 			return 0;
> 	}
>
> +	/* the following check is moved out of mptcp_subflow_get_retrans */
> +	if (__mptcp_check_fallback(msk))
> +		return -EINVAL;
> +

Like the previous patch, I think this fits better before the 
mptcp_for_each_subflow().

> 	if (!msk->sched) {
> 		ssk = mptcp_subflow_get_retrans(msk);
> 		if (!ssk)
> -- 
> 2.35.3
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mptcp-next v20 5/7] mptcp: delay updating already_sent
  2022-11-16 11:43 ` [PATCH mptcp-next v20 5/7] mptcp: delay updating already_sent Geliang Tang
@ 2022-11-18 22:15   ` Mat Martineau
  2022-11-28  3:33     ` Geliang Tang
  0 siblings, 1 reply; 14+ messages in thread
From: Mat Martineau @ 2022-11-18 22:15 UTC (permalink / raw)
  To: Geliang Tang; +Cc: mptcp

On Wed, 16 Nov 2022, Geliang Tang wrote:

> This patch adds a new member sent in struct mptcp_data_frag, save
> info->sent in it, to support delay updating already_sent of dfrags
> until all data is sent.
>

I think this patch is likely the cause of the DSS issues you mentioned in 
https://lore.kernel.org/mptcp/20221022125610.GA28495@bogon/

Looking at the code some more, __mptcp_clean_una() accesses and modifies 
dfrag->already_sent. If data is sent on one subflow, it can be acked at 
any time - even while the redundant sends are still happening on other 
subflows. So I don't think delaying dfrag->already_sent updates is a 
design that can work.

This makes me think that redundant sends need to work more like the MPTCP 
retransmit code path. When the scheduler selects multiple subflows, the 
first subflow to send is a "normal" transmit, and any other subflows would 
act like a retransmit when accessing the dfrags.


- Mat


> Signed-off-by: Geliang Tang <geliang.tang@suse.com>
> ---
> net/mptcp/protocol.c | 18 ++++++++++++++++--
> net/mptcp/protocol.h |  1 +
> 2 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 4c249d1b9ec6..a1007751bd4d 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1099,6 +1099,7 @@ mptcp_carve_data_frag(const struct mptcp_sock *msk, struct page_frag *pfrag,
> 	dfrag->data_seq = msk->write_seq;
> 	dfrag->overhead = offset - orig_offset + sizeof(struct mptcp_data_frag);
> 	dfrag->offset = offset + sizeof(struct mptcp_data_frag);
> +	dfrag->sent = 0;
> 	dfrag->already_sent = 0;
> 	dfrag->page = pfrag->page;
>
> @@ -1484,11 +1485,11 @@ static void mptcp_update_post_push(struct mptcp_sock *msk,
> {
> 	u64 snd_nxt_new = dfrag->data_seq;
>
> -	dfrag->already_sent += sent;
> +	dfrag->sent += sent;
>
> 	msk->snd_burst -= sent;
>
> -	snd_nxt_new += dfrag->already_sent;
> +	snd_nxt_new += dfrag->sent;
>
> 	/* snd_nxt_new can be smaller than snd_nxt in case mptcp
> 	 * is recovering after a failover. In that event, this re-sends
> @@ -1513,6 +1514,18 @@ static void mptcp_update_first_pending(struct sock *sk, struct mptcp_sendmsg_inf
>
> static void mptcp_update_dfrags(struct sock *sk, struct mptcp_sendmsg_info *info)
> {
> +	struct mptcp_data_frag *dfrag = mptcp_send_head(sk);
> +
> +	if (!dfrag)
> +		return;
> +
> +	do {
> +		if (dfrag->sent) {
> +			dfrag->already_sent = max(dfrag->already_sent, dfrag->sent);
> +			dfrag->sent = 0;
> +		}
> +	} while ((dfrag = mptcp_next_frag(sk, dfrag)));
> +
> 	mptcp_update_first_pending(sk, info);
> }
>
> @@ -1539,6 +1552,7 @@ static int __subflow_push_pending(struct sock *sk, struct sock *ssk,
> 		info->sent = dfrag->already_sent;
> 		info->limit = dfrag->data_len;
> 		len = dfrag->data_len - dfrag->already_sent;
> +		dfrag->sent = info->sent;
> 		while (len > 0) {
> 			int ret = 0;
>
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index cdce0c092c3c..cdadb39a03da 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -250,6 +250,7 @@ struct mptcp_data_frag {
> 	u16 data_len;
> 	u16 offset;
> 	u16 overhead;
> +	u16 sent;
> 	u16 already_sent;
> 	struct page *page;
> };
> -- 
> 2.35.3
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2
  2022-11-18 21:42 ` [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Mat Martineau
@ 2022-11-18 22:20   ` Geliang Tang
  0 siblings, 0 replies; 14+ messages in thread
From: Geliang Tang @ 2022-11-18 22:20 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp

On Fri, Nov 18, 2022 at 01:42:48PM -0800, Mat Martineau wrote:
> On Wed, 16 Nov 2022, Geliang Tang wrote:
> 
> > v20:
> > - rebased on "Squash to "mptcp: refactor push_pending logic" v19"
> > 
> > v19:
> > - patch 1, use 'continue' instead of 'goto again'.
> > 
> > v18:
> > - some cleanups
> > - update commit logs.
> > 
> > v17:
> > - address to Mat's comments in v16
> > - rebase to export/20221108T055508.
> > 
> > v16:
> > - keep last_snd and snd_burst in struct mptcp_sock.
> 
> I didn't notice this back in v16 - why has last_snd returned? At the end of
> the series, it still doesn't appear to do anything. Is it useful for some
> future feature?

last_snd is now used for BPF round-robin packet scheduler in
"selftests/bpf: Add bpf_rr scheduler".

Thanks,
-Geliang

> 
> - Mat
> 
> > - drop "mptcp: register default scheduler".
> > - drop "mptcp: add scheduler wrappers", move it into "mptcp: use
> > get_send wrapper" and "mptcp: use get_retrans wrapper".
> > - depends on 'v2, Revert "mptcp: add get_subflow wrappers" - fix
> > divide error in mptcp_subflow_get_send'
> > 
> > v15:
> > 1: "refactor push pending" v10
> > 2-11: "register default scheduler" v3
> >  - move last_snd and snd_burst into struct mptcp_sched_ops
> > 12-19: "BPF redundant scheduler" v15
> >  - split "use get_send wrapper" into two patches
> > - rebase to export/20221021T061837.
> > 
> > v14:
> > - add "mptcp: refactor push_pending logic" v10 as patch 1
> > - drop update_first_pending in patch 4
> > - drop update_already_sent in patch 5
> > 
> > v13:
> > - deponds on "refactor push pending" v9.
> > - Simply 'goto out' after invoking mptcp_subflow_delegate in patch 1.
> > - All selftests (mptcp_connect.sh, mptcp_join.sh and simult_flows.sh) passed.
> > 
> > v12:
> > - fix WARN_ON_ONCE(reuse_skb) and WARN_ON_ONCE(!msk->recovery) errors
> >   in kernel logs.
> > 
> > v11:
> > - address to Mat's comments in v10.
> > - rebase to export/20220908T063452
> > 
> > v10:
> > - send multiple dfrags in __mptcp_push_pending().
> > 
> > v9:
> > - drop the extra *err paramenter of mptcp_sched_get_send() as Florian
> >   suggested.
> > 
> > v8:
> > - update __mptcp_push_pending(), send the same data on each subflow.
> > - update __mptcp_retrans, track the max sent data.
> > = add a new patch.
> > 
> > v7:
> > - drop redundant flag in v6
> > - drop __mptcp_subflows_push_pending in v6
> > - update redundant subflows support in __mptcp_push_pending
> > - update redundant subflows support in __mptcp_retrans
> > 
> > v6:
> > - Add redundant flag for struct mptcp_sched_ops.
> > - add a dedicated function __mptcp_subflows_push_pending() to deal with
> >   redundat subflows push pending.
> > 
> > v5:
> > - address to Paolo's comment, keep the optimization to
> > mptcp_subflow_get_send() for the non eBPF case.
> > - merge mptcp_sched_get_send() and __mptcp_sched_get_send() in v4 into one.
> > - depends on "cleanups for bpf sched selftests".
> > 
> > v4:
> > - small cleanups in patch 1, 2.
> > - add TODO in patch 3.
> > - rebase patch 5 on 'cleanups for bpf sched selftests'.
> > 
> > v3:
> > - use new API.
> > - fix the link failure tests issue mentioned in ("https://patchwork.kernel.org/project/mptcp/cover/cover.1653033459.git.geliang.tang@suse.com/").
> > 
> > v2:
> > - add MPTCP_SUBFLOWS_MAX limit to avoid infinite loops when the
> >   scheduler always sets call_again to true.
> > - track the largest copied amount.
> > - deal with __mptcp_subflow_push_pending() and the retransmit loop.
> > - depends on "BPF round-robin scheduler" v14.
> > 
> > v1:
> > 
> > Implements the redundant BPF MPTCP scheduler, which sends all packets
> > redundantly on all available subflows.
> > 
> > Geliang Tang (7):
> >  mptcp: add scheduler wrappers
> >  mptcp: use get_send wrapper
> >  mptcp: use get_retrans wrapper
> >  mptcp: delay updating first_pending
> >  mptcp: delay updating already_sent
> >  selftests/bpf: Add bpf_red scheduler
> >  selftests/bpf: Add bpf_red test
> > 
> > net/mptcp/protocol.c                          | 242 ++++++++++++------
> > net/mptcp/protocol.h                          |  18 +-
> > net/mptcp/sched.c                             |  67 +++++
> > .../testing/selftests/bpf/prog_tests/mptcp.c  |  34 +++
> > .../selftests/bpf/progs/mptcp_bpf_red.c       |  45 ++++
> > 5 files changed, 320 insertions(+), 86 deletions(-)
> > create mode 100644 tools/testing/selftests/bpf/progs/mptcp_bpf_red.c
> > 
> > -- 
> > 2.35.3
> > 
> > 
> > 
> 
> --
> Mat Martineau
> Intel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mptcp-next v20 5/7] mptcp: delay updating already_sent
  2022-11-18 22:15   ` Mat Martineau
@ 2022-11-28  3:33     ` Geliang Tang
  0 siblings, 0 replies; 14+ messages in thread
From: Geliang Tang @ 2022-11-28  3:33 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp

On Fri, Nov 18, 2022 at 02:15:30PM -0800, Mat Martineau wrote:
> On Wed, 16 Nov 2022, Geliang Tang wrote:
> 
> > This patch adds a new member sent in struct mptcp_data_frag, save
> > info->sent in it, to support delay updating already_sent of dfrags
> > until all data is sent.
> > 
> 
> I think this patch is likely the cause of the DSS issues you mentioned in
> https://lore.kernel.org/mptcp/20221022125610.GA28495@bogon/
> 
> Looking at the code some more, __mptcp_clean_una() accesses and modifies
> dfrag->already_sent. If data is sent on one subflow, it can be acked at any
> time - even while the redundant sends are still happening on other subflows.
> So I don't think delaying dfrag->already_sent updates is a design that can
> work.
> 
> This makes me think that redundant sends need to work more like the MPTCP
> retransmit code path. When the scheduler selects multiple subflows, the
> first subflow to send is a "normal" transmit, and any other subflows would
> act like a retransmit when accessing the dfrags.

Hi Mat,

I changed the redundant sends on the retransmit code path in v21, but
the DSS issues still exist.

Thanks,
-Geliang

> 
> 
> - Mat
> 
> 
> > Signed-off-by: Geliang Tang <geliang.tang@suse.com>
> > ---
> > net/mptcp/protocol.c | 18 ++++++++++++++++--
> > net/mptcp/protocol.h |  1 +
> > 2 files changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index 4c249d1b9ec6..a1007751bd4d 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -1099,6 +1099,7 @@ mptcp_carve_data_frag(const struct mptcp_sock *msk, struct page_frag *pfrag,
> > 	dfrag->data_seq = msk->write_seq;
> > 	dfrag->overhead = offset - orig_offset + sizeof(struct mptcp_data_frag);
> > 	dfrag->offset = offset + sizeof(struct mptcp_data_frag);
> > +	dfrag->sent = 0;
> > 	dfrag->already_sent = 0;
> > 	dfrag->page = pfrag->page;
> > 
> > @@ -1484,11 +1485,11 @@ static void mptcp_update_post_push(struct mptcp_sock *msk,
> > {
> > 	u64 snd_nxt_new = dfrag->data_seq;
> > 
> > -	dfrag->already_sent += sent;
> > +	dfrag->sent += sent;
> > 
> > 	msk->snd_burst -= sent;
> > 
> > -	snd_nxt_new += dfrag->already_sent;
> > +	snd_nxt_new += dfrag->sent;
> > 
> > 	/* snd_nxt_new can be smaller than snd_nxt in case mptcp
> > 	 * is recovering after a failover. In that event, this re-sends
> > @@ -1513,6 +1514,18 @@ static void mptcp_update_first_pending(struct sock *sk, struct mptcp_sendmsg_inf
> > 
> > static void mptcp_update_dfrags(struct sock *sk, struct mptcp_sendmsg_info *info)
> > {
> > +	struct mptcp_data_frag *dfrag = mptcp_send_head(sk);
> > +
> > +	if (!dfrag)
> > +		return;
> > +
> > +	do {
> > +		if (dfrag->sent) {
> > +			dfrag->already_sent = max(dfrag->already_sent, dfrag->sent);
> > +			dfrag->sent = 0;
> > +		}
> > +	} while ((dfrag = mptcp_next_frag(sk, dfrag)));
> > +
> > 	mptcp_update_first_pending(sk, info);
> > }
> > 
> > @@ -1539,6 +1552,7 @@ static int __subflow_push_pending(struct sock *sk, struct sock *ssk,
> > 		info->sent = dfrag->already_sent;
> > 		info->limit = dfrag->data_len;
> > 		len = dfrag->data_len - dfrag->already_sent;
> > +		dfrag->sent = info->sent;
> > 		while (len > 0) {
> > 			int ret = 0;
> > 
> > diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> > index cdce0c092c3c..cdadb39a03da 100644
> > --- a/net/mptcp/protocol.h
> > +++ b/net/mptcp/protocol.h
> > @@ -250,6 +250,7 @@ struct mptcp_data_frag {
> > 	u16 data_len;
> > 	u16 offset;
> > 	u16 overhead;
> > +	u16 sent;
> > 	u16 already_sent;
> > 	struct page *page;
> > };
> > -- 
> > 2.35.3
> > 
> > 
> > 
> 
> --
> Mat Martineau
> Intel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-11-28  3:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-16 11:43 [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Geliang Tang
2022-11-16 11:43 ` [PATCH mptcp-next v20 1/7] mptcp: add scheduler wrappers Geliang Tang
2022-11-16 11:43 ` [PATCH mptcp-next v20 2/7] mptcp: use get_send wrapper Geliang Tang
2022-11-18 22:04   ` Mat Martineau
2022-11-16 11:43 ` [PATCH mptcp-next v20 3/7] mptcp: use get_retrans wrapper Geliang Tang
2022-11-18 22:05   ` Mat Martineau
2022-11-16 11:43 ` [PATCH mptcp-next v20 4/7] mptcp: delay updating first_pending Geliang Tang
2022-11-16 11:43 ` [PATCH mptcp-next v20 5/7] mptcp: delay updating already_sent Geliang Tang
2022-11-18 22:15   ` Mat Martineau
2022-11-28  3:33     ` Geliang Tang
2022-11-16 11:43 ` [PATCH mptcp-next v20 6/7] selftests/bpf: Add bpf_red scheduler Geliang Tang
2022-11-16 11:43 ` [PATCH mptcp-next v20 7/7] selftests/bpf: Add bpf_red test Geliang Tang
2022-11-18 21:42 ` [PATCH mptcp-next v20 0/7] BPF redundant scheduler, part 2 Mat Martineau
2022-11-18 22:20   ` Geliang Tang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.