From: Geliang Tang <geliang@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>, mptcp@lists.linux.dev
Cc: Matthieu Baerts <matttbe@kernel.org>, gang.yan@linux.dev
Subject: Re: [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job.
Date: Wed, 27 May 2026 13:46:56 +0800 [thread overview]
Message-ID: <52022434fe4b2996253b5fc2af7ea98aa987bcb5.camel@kernel.org> (raw)
In-Reply-To: <a35934164aad8a08ed8108853cff7a282e5c25ef.1779485511.git.pabeni@redhat.com>
Hi Paolo,
On Fri, 2026-05-22 at 23:43 +0200, Paolo Abeni wrote:
> Currently the MPTCP core enforces that when MPTCP-level retrans timer
> fires, at most a single dfrag is retransmitted. If some corner-cases
> it
> may be necessary retransmit multiple dfrags, and the MPTCP socket
> will
> need to wait multiple retrans timeout to accomplish that.
>
> Remove the mentioned constraint, allowing to transmit multiple dfrags
> per
> retrans period, as long as the scheduler keeps selecting subflows for
> retransmissions and pending data is available in the rtx queue.
> The default scheduler will transmit a dfrag per available subflow.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v7 -> v8
> - fix corner-case retrans_seq update
>
> v4 -> v5:
> - fixed already_sent update
>
> v3 -> v4:
> - avoid quadratic behavior, fix retrans_seq update
> - fix rtx timer re-schedule miss
>
> v2 -> v3:
> - fix infinite loop issue (should address tls tests failures)
>
> v1 -> v2:
> - fix retrans sequence update (sashiko)
>
> Note:
> - sashiko see issues when dfrag = mptcp_rtx_head(sk) != NULL and
> dfrag->already_sent == 0. That condition should not possible: if
> mptcp_rtx_head() is not NULL there should be some data already
> sent.
> ---
> net/mptcp/protocol.c | 117 +++++++++++++++++++++++++++++++----------
> --
> 1 file changed, 85 insertions(+), 32 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 4ebe45e8a3d2..892fc44dffac 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1197,13 +1197,6 @@ static void __mptcp_clean_una_wakeup(struct
> sock *sk)
> mptcp_write_space(sk);
> }
>
> -static void mptcp_clean_una_wakeup(struct sock *sk)
> -{
> - mptcp_data_lock(sk);
> - __mptcp_clean_una_wakeup(sk);
> - mptcp_data_unlock(sk);
> -}
> -
> static void mptcp_enter_memory_pressure(struct sock *sk)
> {
> struct mptcp_subflow_context *subflow;
> @@ -2824,8 +2817,12 @@ static void mptcp_check_fastclose(struct
> mptcp_sock *msk)
> sk_error_report(sk);
> }
>
> -/* Retransmit the specified data fragment on all the selected
> subflows. */
> -static int __mptcp_push_retrans(struct sock *sk, struct
> mptcp_data_frag *dfrag)
> +/*
> + * Retransmit the specified data fragment on all the selected
> subflows,
> + * starting from the specified sequence
> + */
> +static int __mptcp_push_retrans(struct sock *sk, struct
> mptcp_data_frag *dfrag,
> + u64 sent_seq)
> {
> struct mptcp_sendmsg_info info = { .data_lock_held = true,
> };
> struct mptcp_sock *msk = mptcp_sk(sk);
> @@ -2835,6 +2832,7 @@ static int __mptcp_push_retrans(struct sock
> *sk, struct mptcp_data_frag *dfrag)
>
> mptcp_for_each_subflow(msk, subflow) {
> if (READ_ONCE(subflow->scheduled)) {
> + u16 offset = sent_seq - dfrag->data_seq;
> u16 copied = 0;
>
> mptcp_subflow_set_scheduled(subflow, false);
> @@ -2844,9 +2842,12 @@ static int __mptcp_push_retrans(struct sock
> *sk, struct mptcp_data_frag *dfrag)
> lock_sock(ssk);
>
> /* limit retransmission to the bytes already
> sent on some subflows */
> - info.sent = 0;
> + info.sent = offset;
> info.limit = READ_ONCE(msk->csum_enabled) ?
> dfrag->data_len :
>
> dfrag->already_sent;
> + DEBUG_NET_WARN_ON_ONCE(!before64(sent_seq,
> + dfrag-
> >data_seq +
> +
> info.limit));
When I tested NVMe MPTCP TLS based on this series, I encountered the
following error:
------------[ cut here ]------------
!before64(sent_seq, dfrag->data_seq + info.limit)
WARNING: net/mptcp/protocol.c:2879 at
__mptcp_push_retrans+0x562/0x6f0, CPU#5: kworker/5:2H/823
Modules linked in: mptcp_diag tcp_diag inet_diag sch_netem
CPU: 5 UID: 0 PID: 823 Comm: kworker/5:2H Not tainted 7.1.0-rc4+ #11
PREEMPT(full)
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
RIP: 0010:__mptcp_push_retrans+0x562/0x6f0
I printed the "info.limit" value at that time and found it to be 0:
if (!before64(sent_seq, dfrag->data_seq + info.limit))
pr_info("not ok sent_seq=%llu dfrag->data_seq=%llu info.limit=%u\n",
sent_seq, dfrag->data_seq, info.limit);
# Starting 4 threads
[ 358.532991][ T257] MPTCP: not ok sent_seq=4607903646123206498
dfrag->data_seq=4607903646123206498 info.limit=0
[ 360.946058][ T90] MPTCP: not ok sent_seq=12604043601802564894
dfrag->data_seq=12604043601802564894 info.limit=0
It seems we need to remove this DEBUG_NET_WARN_ON_ONCE().
Thanks,
-Geliang
>
> /*
> * make the whole retrans decision, xmit,
> disallow
> @@ -2890,45 +2891,97 @@ static void __mptcp_retrans(struct sock *sk)
> struct mptcp_sock *msk = mptcp_sk(sk);
> struct mptcp_subflow_context *subflow;
> struct mptcp_data_frag *dfrag;
> + bool retransmitted = false;
> + u64 retrans_seq;
> int err, len;
>
> - mptcp_clean_una_wakeup(sk);
> -
> - /* first check ssk: need to kick "stale" logic */
> - err = mptcp_sched_get_retrans(msk);
> + mptcp_data_lock(sk);
> + __mptcp_clean_una_wakeup(sk);
> + retrans_seq = msk->snd_una;
> dfrag = mptcp_rtx_head(sk);
> + mptcp_data_unlock(sk);
> + if (!dfrag)
> + goto check_data_fin;
> +
> + for (;;) {
> + bool already_retrans;
> + u64 sent_seq;
> +
> + /* The scheduler may clean the RTX queue. */
> + get_page(dfrag->page);
> +
> + /* The default scheduler will kick "stale" logic. */
> + err = mptcp_sched_get_retrans(msk);
> + if (err) {
> + put_page(dfrag->page);
> + break;
> + }
> +
> + /* Incoming acks can have moved retrans sequence
> after
> + * the current dfrag, if so try to start again from
> RTX head.
> + */
> + mptcp_data_lock(sk);
> + already_retrans = !dfrag->already_sent ||
> + !before64(msk->snd_una, dfrag-
> >data_seq +
> + dfrag->already_sent);
> + put_page(dfrag->page);
> + if (already_retrans) {
> + __mptcp_clean_una_wakeup(sk);
> + retrans_seq = msk->snd_una;
> + dfrag = mptcp_rtx_head(sk);
> + } else if (after64(msk->snd_una, retrans_seq)) {
> + retrans_seq = msk->snd_una;
> + }
> + mptcp_data_unlock(sk);
> + if (!dfrag)
> + break;
> +
> + len = __mptcp_push_retrans(sk, dfrag, retrans_seq);
> + if (len < 0)
> + goto clear_scheduled;
> +
> + retransmitted = true;
> + retrans_seq += len;
> + msk->bytes_retrans += len;
> + dfrag->already_sent = max_t(u16, dfrag-
> >already_sent,
> + retrans_seq - dfrag-
> >data_seq);
> +
> + /* With csum enabled retransmission can send new
> data. */
> + sent_seq = dfrag->already_sent + dfrag->data_seq;
> + if (after64(sent_seq, msk->snd_nxt))
> + msk->snd_nxt = sent_seq;
> +
> + /* Attempt the next fragment only if the current one
> is
> + * completely retransmitted.
> + */
> + if (before64(retrans_seq, dfrag->data_seq + dfrag-
> >data_len))
> + break;
> +
> + dfrag = list_is_last(&dfrag->list, &msk->rtx_queue)
> ?
> + NULL : list_next_entry(dfrag, list);
> + if (!dfrag || !dfrag->already_sent)
> + break;
> + }
> +
> + /* Data fin retransmission needed only if no data
> retransmission took
> + * place, and RTX queue is empty.
> + */
> +check_data_fin:
> if (!dfrag) {
> - if (mptcp_data_fin_enabled(msk)) {
> + if (!retransmitted && mptcp_data_fin_enabled(msk)) {
> struct inet_connection_sock *icsk =
> inet_csk(sk);
>
> WRITE_ONCE(icsk->icsk_retransmits,
> icsk->icsk_retransmits + 1);
> mptcp_set_datafin_timeout(sk);
> mptcp_send_ack(msk);
> -
> goto reset_timer;
> }
>
> if (!mptcp_send_head(sk))
> goto clear_scheduled;
> -
> - goto reset_timer;
> }
>
> - if (err)
> - goto reset_timer;
> -
> - len = __mptcp_push_retrans(sk, dfrag);
> - if (len < 0)
> - goto clear_scheduled;
> -
> - msk->bytes_retrans += len;
> - dfrag->already_sent = max(dfrag->already_sent, len);
> -
> - /* With csum enabled retransmission can send new data. */
> - if (after64(dfrag->already_sent + dfrag->data_seq, msk-
> >snd_nxt))
> - msk->snd_nxt = dfrag->already_sent + dfrag-
> >data_seq;
> -
> reset_timer:
> mptcp_check_and_set_pending(sk);
>
next prev parent reply other threads:[~2026-05-27 5:47 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 1/9] mptcp: fix missing wakeups in edge scenarios Paolo Abeni
2026-05-26 7:47 ` Matthieu Baerts
2026-05-22 21:43 ` [PATCH v8 mptcp-next 2/9] mptcp: fix retransmission loop when csum is enabled Paolo Abeni
2026-05-26 7:48 ` Matthieu Baerts
2026-05-26 15:10 ` Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 3/9] mptcp: close TOCTOU race while computing rcv_wnd Paolo Abeni
2026-05-26 6:10 ` Geliang Tang
2026-05-26 6:34 ` Paolo Abeni
2026-05-26 7:48 ` Matthieu Baerts
2026-05-22 21:43 ` [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink Paolo Abeni
2026-05-24 8:34 ` Paolo Abeni
2026-05-26 7:02 ` Matthieu Baerts
2026-05-26 7:49 ` Matthieu Baerts
2026-05-26 15:17 ` Paolo Abeni
2026-05-27 0:51 ` Matthieu Baerts
2026-05-26 7:48 ` Matthieu Baerts
2026-05-26 15:16 ` Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 5/9] mptcp: explicitly drop over memory limits Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 6/9] mptcp: enforce hard limit on backlog flushing Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning Paolo Abeni
2026-05-26 3:13 ` gang.yan
2026-05-26 6:50 ` Paolo Abeni
2026-05-27 5:30 ` gang.yan
2026-05-27 10:01 ` Paolo Abeni
2026-05-28 1:18 ` gang.yan
2026-05-22 21:43 ` [PATCH v8 mptcp-next 8/9] mptcp: move the retrans loop to a separate helper Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job Paolo Abeni
2026-05-27 5:46 ` Geliang Tang [this message]
2026-05-22 23:10 ` [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure MPTCP CI
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52022434fe4b2996253b5fc2af7ea98aa987bcb5.camel@kernel.org \
--to=geliang@kernel.org \
--cc=gang.yan@linux.dev \
--cc=matttbe@kernel.org \
--cc=mptcp@lists.linux.dev \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox