* [PATCH v11 mptcp-next 0/2] mptcp: address stall under memory pressure
@ 2026-05-30 14:59 Paolo Abeni
2026-05-30 14:59 ` [PATCH v11 mptcp-next 1/2] mptcp: move the retrans loop to a separate helper Paolo Abeni
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Paolo Abeni @ 2026-05-30 14:59 UTC (permalink / raw)
To: mptcp
This brings (hopefully) the final bits required to address the
data transfer stall reported by Geliang and Gang
This series improves mptcp retranmission to make them reliable:
pruning can require some of them.
The only change over the previous iteration is in patch 2, addressing
Geliang and sashiko feedback over the previous iteration (bad WARN
condition).
Paolo Abeni (2):
mptcp: move the retrans loop to a separate helper
mptcp: let the retrans scheduler do its job.
net/mptcp/protocol.c | 156 +++++++++++++++++++++++++++++++------------
1 file changed, 112 insertions(+), 44 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v11 mptcp-next 1/2] mptcp: move the retrans loop to a separate helper
2026-05-30 14:59 [PATCH v11 mptcp-next 0/2] mptcp: address stall under memory pressure Paolo Abeni
@ 2026-05-30 14:59 ` Paolo Abeni
2026-05-30 14:59 ` [PATCH v11 mptcp-next 2/2] mptcp: let the retrans scheduler do its job Paolo Abeni
2026-05-30 16:28 ` [PATCH v11 mptcp-next 0/2] mptcp: address stall under memory pressure MPTCP CI
2 siblings, 0 replies; 4+ messages in thread
From: Paolo Abeni @ 2026-05-30 14:59 UTC (permalink / raw)
To: mptcp
This is a cleanup in order to make the next patch simpler.
No functional change intended.
Tested-by: Gang Yan <yangang@kylinos.cn>
Tested-by: Geliang Tang <geliang@kernel.org>
Acked-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/protocol.c | 74 +++++++++++++++++++++++++-------------------
1 file changed, 43 insertions(+), 31 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 03234e8cc26c..51756800edc2 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2830,41 +2830,14 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
sk_error_report(sk);
}
-static void __mptcp_retrans(struct sock *sk)
+/* Retransmit the specified data fragment on all the selected subflows. */
+static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
{
struct mptcp_sendmsg_info info = { .data_lock_held = true, };
struct mptcp_sock *msk = mptcp_sk(sk);
struct mptcp_subflow_context *subflow;
- struct mptcp_data_frag *dfrag;
struct sock *ssk;
- int ret, err;
- u16 len = 0;
-
- mptcp_clean_una_wakeup(sk);
-
- /* first check ssk: need to kick "stale" logic */
- err = mptcp_sched_get_retrans(msk);
- dfrag = mptcp_rtx_head(sk);
- if (!dfrag) {
- if (mptcp_data_fin_enabled(msk)) {
- struct inet_connection_sock *icsk = inet_csk(sk);
-
- WRITE_ONCE(icsk->icsk_retransmits,
- icsk->icsk_retransmits + 1);
- mptcp_set_datafin_timeout(sk);
- mptcp_send_ack(msk);
-
- goto reset_timer;
- }
-
- if (!mptcp_send_head(sk))
- goto clear_scheduled;
-
- goto reset_timer;
- }
-
- if (err)
- goto reset_timer;
+ int ret, len = 0;
mptcp_for_each_subflow(msk, subflow) {
if (READ_ONCE(subflow->scheduled)) {
@@ -2892,7 +2865,7 @@ static void __mptcp_retrans(struct sock *sk)
!msk->allow_subflows) {
spin_unlock_bh(&msk->fallback_lock);
release_sock(ssk);
- goto clear_scheduled;
+ return -1;
}
while (info.sent < info.limit) {
@@ -2915,6 +2888,45 @@ static void __mptcp_retrans(struct sock *sk)
release_sock(ssk);
}
}
+ return len;
+}
+
+static void __mptcp_retrans(struct sock *sk)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct mptcp_subflow_context *subflow;
+ struct mptcp_data_frag *dfrag;
+ int err, len;
+
+ mptcp_clean_una_wakeup(sk);
+
+ /* first check ssk: need to kick "stale" logic */
+ err = mptcp_sched_get_retrans(msk);
+ dfrag = mptcp_rtx_head(sk);
+ if (!dfrag) {
+ if (mptcp_data_fin_enabled(msk)) {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+
+ WRITE_ONCE(icsk->icsk_retransmits,
+ icsk->icsk_retransmits + 1);
+ mptcp_set_datafin_timeout(sk);
+ mptcp_send_ack(msk);
+
+ goto reset_timer;
+ }
+
+ if (!mptcp_send_head(sk))
+ goto clear_scheduled;
+
+ goto reset_timer;
+ }
+
+ if (err)
+ goto reset_timer;
+
+ len = __mptcp_push_retrans(sk, dfrag);
+ if (len < 0)
+ goto clear_scheduled;
msk->bytes_retrans += len;
dfrag->already_sent = max(dfrag->already_sent, len);
--
2.54.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v11 mptcp-next 2/2] mptcp: let the retrans scheduler do its job.
2026-05-30 14:59 [PATCH v11 mptcp-next 0/2] mptcp: address stall under memory pressure Paolo Abeni
2026-05-30 14:59 ` [PATCH v11 mptcp-next 1/2] mptcp: move the retrans loop to a separate helper Paolo Abeni
@ 2026-05-30 14:59 ` Paolo Abeni
2026-05-30 16:28 ` [PATCH v11 mptcp-next 0/2] mptcp: address stall under memory pressure MPTCP CI
2 siblings, 0 replies; 4+ messages in thread
From: Paolo Abeni @ 2026-05-30 14:59 UTC (permalink / raw)
To: mptcp
Currently the MPTCP core enforces that when MPTCP-level retrans timer
fires, at most a single dfrag is retransmitted. If some corner-cases it
may be necessary retransmit multiple dfrags, and the MPTCP socket will
need to wait multiple retrans timeout to accomplish that.
Remove the mentioned constraint, allowing to transmit multiple dfrags per
retrans period, as long as the scheduler keeps selecting subflows for
retransmissions and pending data is available in the rtx queue.
The default scheduler will transmit a dfrag per available subflow.
Tested-by: Gang Yan <yangang@kylinos.cn>
Tested-by: Geliang Tang <geliang@kernel.org>
Acked-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v10 -> v11:
- avoid WARNING when retransmitting dfrag with 0 already_sent, as such
status can happend, as reported by Geliang. Instead explicitly check
for the bad condition and skip.
v9 -> v10:
- simpler handling for data-fin rtx
v7 -> v8
- fix corner-case retrans_seq update
v4 -> v5:
- fixed already_sent update
v3 -> v4:
- avoid quadratic behavior, fix retrans_seq update
- fix rtx timer re-schedule miss
v2 -> v3:
- fix infinite loop issue (should address tls tests failures)
v1 -> v2:
- fix retrans sequence update (sashiko)
Note:
- sashiko may see missing data-fin rtx when the initial `dfrag` is
not NULL. data-fin RTX is NOT needed in such scenario.
---
net/mptcp/protocol.c | 120 +++++++++++++++++++++++++++++++------------
1 file changed, 88 insertions(+), 32 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 51756800edc2..264a13bc6f3e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1201,13 +1201,6 @@ static void __mptcp_clean_una_wakeup(struct sock *sk)
mptcp_write_space(sk);
}
-static void mptcp_clean_una_wakeup(struct sock *sk)
-{
- mptcp_data_lock(sk);
- __mptcp_clean_una_wakeup(sk);
- mptcp_data_unlock(sk);
-}
-
static void mptcp_enter_memory_pressure(struct sock *sk)
{
struct mptcp_subflow_context *subflow;
@@ -2830,8 +2823,12 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
sk_error_report(sk);
}
-/* Retransmit the specified data fragment on all the selected subflows. */
-static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
+/*
+ * Retransmit the specified data fragment on all the selected subflows,
+ * starting from the specified sequence
+ */
+static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag,
+ u64 sent_seq)
{
struct mptcp_sendmsg_info info = { .data_lock_held = true, };
struct mptcp_sock *msk = mptcp_sk(sk);
@@ -2841,6 +2838,7 @@ static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
mptcp_for_each_subflow(msk, subflow) {
if (READ_ONCE(subflow->scheduled)) {
+ u16 offset = sent_seq - dfrag->data_seq;
u16 copied = 0;
mptcp_subflow_set_scheduled(subflow, false);
@@ -2850,7 +2848,7 @@ static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
lock_sock(ssk);
/* limit retransmission to the bytes already sent on some subflows */
- info.sent = 0;
+ info.sent = offset;
info.limit = READ_ONCE(msk->csum_enabled) ? dfrag->data_len :
dfrag->already_sent;
@@ -2896,14 +2894,89 @@ static void __mptcp_retrans(struct sock *sk)
struct mptcp_sock *msk = mptcp_sk(sk);
struct mptcp_subflow_context *subflow;
struct mptcp_data_frag *dfrag;
+ bool need_retrans;
+ u64 retrans_seq;
int err, len;
- mptcp_clean_una_wakeup(sk);
-
- /* first check ssk: need to kick "stale" logic */
- err = mptcp_sched_get_retrans(msk);
+ mptcp_data_lock(sk);
+ __mptcp_clean_una_wakeup(sk);
+ retrans_seq = msk->snd_una;
dfrag = mptcp_rtx_head(sk);
- if (!dfrag) {
+ need_retrans = !!dfrag;
+ mptcp_data_unlock(sk);
+ if (!dfrag)
+ goto check_data_fin;
+
+ for (;;) {
+ bool already_retrans;
+ u64 sent_seq;
+
+ /* The default scheduler will kick "stale" logic, that in
+ * turn can process incoming acks and clean the RTX queue;
+ * ensure that the current dfrag will still be around
+ * afterwards.
+ */
+ get_page(dfrag->page);
+ err = mptcp_sched_get_retrans(msk);
+ if (err) {
+ put_page(dfrag->page);
+ break;
+ }
+
+ /* Incoming acks can have moved retrans sequence after
+ * the current dfrag, if so try to start again from RTX head.
+ */
+ mptcp_data_lock(sk);
+ already_retrans = !dfrag->already_sent ||
+ !before64(msk->snd_una, dfrag->data_seq +
+ dfrag->already_sent);
+ put_page(dfrag->page);
+ if (already_retrans) {
+ __mptcp_clean_una_wakeup(sk);
+ retrans_seq = msk->snd_una;
+ dfrag = mptcp_rtx_head(sk);
+ need_retrans = !!dfrag;
+ } else if (after64(msk->snd_una, retrans_seq)) {
+ retrans_seq = msk->snd_una;
+ }
+ mptcp_data_unlock(sk);
+
+ /* `already_sent` can be 0 for `dfrag` belonging to the RTX
+ * queue due to __mptcp_retransmit_pending_data().
+ */
+ if (!dfrag || !dfrag->already_sent)
+ break;
+
+ /* Can fail only in case of fallback. */
+ len = __mptcp_push_retrans(sk, dfrag, retrans_seq);
+ if (len < 0)
+ goto clear_scheduled;
+
+ retrans_seq += len;
+ msk->bytes_retrans += len;
+ dfrag->already_sent = max_t(u16, dfrag->already_sent,
+ retrans_seq - dfrag->data_seq);
+
+ /* With csum enabled retransmission can send new data. */
+ sent_seq = dfrag->already_sent + dfrag->data_seq;
+ if (after64(sent_seq, msk->snd_nxt))
+ WRITE_ONCE(msk->snd_nxt, sent_seq);
+
+ /* Attempt the next fragment only if the current one is
+ * completely retransmitted.
+ */
+ if (before64(retrans_seq, dfrag->data_seq + dfrag->data_len))
+ break;
+
+ dfrag = list_is_last(&dfrag->list, &msk->rtx_queue) ?
+ NULL : list_next_entry(dfrag, list);
+ if (!dfrag)
+ break;
+ }
+
+ /* Attempt data-fin retransmission only when the RTX queue is empty. */
+ if (!need_retrans) {
+check_data_fin:
if (mptcp_data_fin_enabled(msk)) {
struct inet_connection_sock *icsk = inet_csk(sk);
@@ -2911,30 +2984,13 @@ static void __mptcp_retrans(struct sock *sk)
icsk->icsk_retransmits + 1);
mptcp_set_datafin_timeout(sk);
mptcp_send_ack(msk);
-
goto reset_timer;
}
if (!mptcp_send_head(sk))
goto clear_scheduled;
-
- goto reset_timer;
}
- if (err)
- goto reset_timer;
-
- len = __mptcp_push_retrans(sk, dfrag);
- if (len < 0)
- goto clear_scheduled;
-
- msk->bytes_retrans += len;
- dfrag->already_sent = max(dfrag->already_sent, len);
-
- /* With csum enabled retransmission can send new data. */
- if (after64(dfrag->already_sent + dfrag->data_seq, msk->snd_nxt))
- WRITE_ONCE(msk->snd_nxt, dfrag->already_sent + dfrag->data_seq);
-
reset_timer:
mptcp_check_and_set_pending(sk);
--
2.54.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v11 mptcp-next 0/2] mptcp: address stall under memory pressure
2026-05-30 14:59 [PATCH v11 mptcp-next 0/2] mptcp: address stall under memory pressure Paolo Abeni
2026-05-30 14:59 ` [PATCH v11 mptcp-next 1/2] mptcp: move the retrans loop to a separate helper Paolo Abeni
2026-05-30 14:59 ` [PATCH v11 mptcp-next 2/2] mptcp: let the retrans scheduler do its job Paolo Abeni
@ 2026-05-30 16:28 ` MPTCP CI
2 siblings, 0 replies; 4+ messages in thread
From: MPTCP CI @ 2026-05-30 16:28 UTC (permalink / raw)
To: Paolo Abeni; +Cc: mptcp
Hi Paolo,
Thank you for your modifications, that's great!
Our CI did some validations and here is its report:
- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Success! ✅
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/26687469839
Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/89eb2ec6592a
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1103334
If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:
$ cd [kernel source code]
$ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
--pull always mptcp/mptcp-upstream-virtme-docker:latest \
auto-normal
For more details:
https://github.com/multipath-tcp/mptcp-upstream-virtme-docker
Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)
Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-30 16:28 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-30 14:59 [PATCH v11 mptcp-next 0/2] mptcp: address stall under memory pressure Paolo Abeni
2026-05-30 14:59 ` [PATCH v11 mptcp-next 1/2] mptcp: move the retrans loop to a separate helper Paolo Abeni
2026-05-30 14:59 ` [PATCH v11 mptcp-next 2/2] mptcp: let the retrans scheduler do its job Paolo Abeni
2026-05-30 16:28 ` [PATCH v11 mptcp-next 0/2] mptcp: address stall under memory pressure MPTCP CI
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.