Re: [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job.

MPTCP Linux Development
 help / color / mirror / Atom feed

From: Geliang Tang <geliang@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>, mptcp@lists.linux.dev
Cc: Matthieu Baerts <matttbe@kernel.org>, gang.yan@linux.dev
Subject: Re: [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job.
Date: Wed, 27 May 2026 13:46:56 +0800	[thread overview]
Message-ID: <52022434fe4b2996253b5fc2af7ea98aa987bcb5.camel@kernel.org> (raw)
In-Reply-To: <a35934164aad8a08ed8108853cff7a282e5c25ef.1779485511.git.pabeni@redhat.com>

Hi Paolo,

On Fri, 2026-05-22 at 23:43 +0200, Paolo Abeni wrote:
> Currently the MPTCP core enforces that when MPTCP-level retrans timer
> fires, at most a single dfrag is retransmitted. If some corner-cases
> it
> may be necessary retransmit multiple dfrags, and the MPTCP socket
> will
> need to wait multiple retrans timeout to accomplish that.
> 
> Remove the mentioned constraint, allowing to transmit multiple dfrags
> per
> retrans period, as long as the scheduler keeps selecting subflows for
> retransmissions and pending data is available in the rtx queue.
> The default scheduler will transmit a dfrag per available subflow.
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v7 -> v8
>   - fix corner-case retrans_seq update
> 
> v4 -> v5:
>   - fixed already_sent update
> 
> v3 -> v4:
>   - avoid quadratic behavior, fix retrans_seq update
>   - fix rtx timer re-schedule miss
> 
> v2 -> v3:
>   - fix infinite loop issue (should address tls tests failures)
> 
> v1 -> v2:
>   - fix retrans sequence update (sashiko)
> 
> Note:
>  - sashiko see issues when dfrag = mptcp_rtx_head(sk) != NULL and
>    dfrag->already_sent == 0. That condition should not possible: if
>    mptcp_rtx_head() is not NULL there should be some data already
>    sent.
> ---
>  net/mptcp/protocol.c | 117 +++++++++++++++++++++++++++++++----------
> --
>  1 file changed, 85 insertions(+), 32 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 4ebe45e8a3d2..892fc44dffac 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1197,13 +1197,6 @@ static void __mptcp_clean_una_wakeup(struct
> sock *sk)
>  	mptcp_write_space(sk);
>  }
>  
> -static void mptcp_clean_una_wakeup(struct sock *sk)
> -{
> -	mptcp_data_lock(sk);
> -	__mptcp_clean_una_wakeup(sk);
> -	mptcp_data_unlock(sk);
> -}
> -
>  static void mptcp_enter_memory_pressure(struct sock *sk)
>  {
>  	struct mptcp_subflow_context *subflow;
> @@ -2824,8 +2817,12 @@ static void mptcp_check_fastclose(struct
> mptcp_sock *msk)
>  	sk_error_report(sk);
>  }
>  
> -/* Retransmit the specified data fragment on all the selected
> subflows. */
> -static int __mptcp_push_retrans(struct sock *sk, struct
> mptcp_data_frag *dfrag)
> +/*
> + * Retransmit the specified data fragment on all the selected
> subflows,
> + * starting from the specified sequence
> + */
> +static int __mptcp_push_retrans(struct sock *sk, struct
> mptcp_data_frag *dfrag,
> +				u64 sent_seq)
>  {
>  	struct mptcp_sendmsg_info info = { .data_lock_held = true,
> };
>  	struct mptcp_sock *msk = mptcp_sk(sk);
> @@ -2835,6 +2832,7 @@ static int __mptcp_push_retrans(struct sock
> *sk, struct mptcp_data_frag *dfrag)
>  
>  	mptcp_for_each_subflow(msk, subflow) {
>  		if (READ_ONCE(subflow->scheduled)) {
> +			u16 offset = sent_seq - dfrag->data_seq;
>  			u16 copied = 0;
>  
>  			mptcp_subflow_set_scheduled(subflow, false);
> @@ -2844,9 +2842,12 @@ static int __mptcp_push_retrans(struct sock
> *sk, struct mptcp_data_frag *dfrag)
>  			lock_sock(ssk);
>  
>  			/* limit retransmission to the bytes already
> sent on some subflows */
> -			info.sent = 0;
> +			info.sent = offset;
>  			info.limit = READ_ONCE(msk->csum_enabled) ?
> dfrag->data_len :
>  								   
> dfrag->already_sent;
> +			DEBUG_NET_WARN_ON_ONCE(!before64(sent_seq,
> +							 dfrag-
> >data_seq +
> +							
> info.limit));

When I tested NVMe MPTCP TLS based on this series, I encountered the
following error:

 ------------[ cut here ]------------
 !before64(sent_seq, dfrag->data_seq + info.limit)
 WARNING: net/mptcp/protocol.c:2879 at
__mptcp_push_retrans+0x562/0x6f0, CPU#5: kworker/5:2H/823
 Modules linked in: mptcp_diag tcp_diag inet_diag sch_netem
 CPU: 5 UID: 0 PID: 823 Comm: kworker/5:2H Not tainted 7.1.0-rc4+ #11
PREEMPT(full)
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
 RIP: 0010:__mptcp_push_retrans+0x562/0x6f0

I printed the "info.limit" value at that time and found it to be 0:

if (!before64(sent_seq, dfrag->data_seq + info.limit))
  pr_info("not ok sent_seq=%llu dfrag->data_seq=%llu info.limit=%u\n",
          sent_seq, dfrag->data_seq, info.limit);

# Starting 4 threads
[  358.532991][  T257] MPTCP: not ok sent_seq=4607903646123206498
dfrag->data_seq=4607903646123206498 info.limit=0
[  360.946058][   T90] MPTCP: not ok sent_seq=12604043601802564894
dfrag->data_seq=12604043601802564894 info.limit=0

It seems we need to remove this DEBUG_NET_WARN_ON_ONCE().

Thanks,
-Geliang

>  
>  			/*
>  			 * make the whole retrans decision, xmit,
> disallow
> @@ -2890,45 +2891,97 @@ static void __mptcp_retrans(struct sock *sk)
>  	struct mptcp_sock *msk = mptcp_sk(sk);
>  	struct mptcp_subflow_context *subflow;
>  	struct mptcp_data_frag *dfrag;
> +	bool retransmitted = false;
> +	u64 retrans_seq;
>  	int err, len;
>  
> -	mptcp_clean_una_wakeup(sk);
> -
> -	/* first check ssk: need to kick "stale" logic */
> -	err = mptcp_sched_get_retrans(msk);
> +	mptcp_data_lock(sk);
> +	__mptcp_clean_una_wakeup(sk);
> +	retrans_seq = msk->snd_una;
>  	dfrag = mptcp_rtx_head(sk);
> +	mptcp_data_unlock(sk);
> +	if (!dfrag)
> +		goto check_data_fin;
> +
> +	for (;;) {
> +		bool already_retrans;
> +		u64 sent_seq;
> +
> +		/* The scheduler may clean the RTX queue. */
> +		get_page(dfrag->page);
> +
> +		/* The default scheduler will kick "stale" logic. */
> +		err = mptcp_sched_get_retrans(msk);
> +		if (err) {
> +			put_page(dfrag->page);
> +			break;
> +		}
> +
> +		/* Incoming acks can have moved retrans sequence
> after
> +		 * the current dfrag, if so try to start again from
> RTX head.
> +		 */
> +		mptcp_data_lock(sk);
> +		already_retrans = !dfrag->already_sent ||
> +				  !before64(msk->snd_una, dfrag-
> >data_seq +
> +					    dfrag->already_sent);
> +		put_page(dfrag->page);
> +		if (already_retrans) {
> +			__mptcp_clean_una_wakeup(sk);
> +			retrans_seq = msk->snd_una;
> +			dfrag = mptcp_rtx_head(sk);
> +		} else if (after64(msk->snd_una, retrans_seq)) {
> +			retrans_seq = msk->snd_una;
> +		}
> +		mptcp_data_unlock(sk);
> +		if (!dfrag)
> +			break;
> +
> +		len = __mptcp_push_retrans(sk, dfrag, retrans_seq);
> +		if (len < 0)
> +			goto clear_scheduled;
> +
> +		retransmitted = true;
> +		retrans_seq += len;
> +		msk->bytes_retrans += len;
> +		dfrag->already_sent = max_t(u16, dfrag-
> >already_sent,
> +					    retrans_seq - dfrag-
> >data_seq);
> +
> +		/* With csum enabled retransmission can send new
> data. */
> +		sent_seq = dfrag->already_sent + dfrag->data_seq;
> +		if (after64(sent_seq, msk->snd_nxt))
> +			msk->snd_nxt = sent_seq;
> +
> +		/* Attempt the next fragment only if the current one
> is
> +		 * completely retransmitted.
> +		 */
> +		if (before64(retrans_seq, dfrag->data_seq + dfrag-
> >data_len))
> +			break;
> +
> +		dfrag = list_is_last(&dfrag->list, &msk->rtx_queue)
> ?
> +				NULL : list_next_entry(dfrag, list);
> +		if (!dfrag || !dfrag->already_sent)
> +			break;
> +	}
> +
> +	/* Data fin retransmission needed only if no data
> retransmission took
> +	 * place, and RTX queue is empty.
> +	 */
> +check_data_fin:
>  	if (!dfrag) {
> -		if (mptcp_data_fin_enabled(msk)) {
> +		if (!retransmitted && mptcp_data_fin_enabled(msk)) {
>  			struct inet_connection_sock *icsk =
> inet_csk(sk);
>  
>  			WRITE_ONCE(icsk->icsk_retransmits,
>  				   icsk->icsk_retransmits + 1);
>  			mptcp_set_datafin_timeout(sk);
>  			mptcp_send_ack(msk);
> -
>  			goto reset_timer;
>  		}
>  
>  		if (!mptcp_send_head(sk))
>  			goto clear_scheduled;
> -
> -		goto reset_timer;
>  	}
>  
> -	if (err)
> -		goto reset_timer;
> -
> -	len = __mptcp_push_retrans(sk, dfrag);
> -	if (len < 0)
> -		goto clear_scheduled;
> -
> -	msk->bytes_retrans += len;
> -	dfrag->already_sent = max(dfrag->already_sent, len);
> -
> -	/* With csum enabled retransmission can send new data. */
> -	if (after64(dfrag->already_sent + dfrag->data_seq, msk-
> >snd_nxt))
> -		msk->snd_nxt = dfrag->already_sent + dfrag-
> >data_seq;
> -
>  reset_timer:
>  	mptcp_check_and_set_pending(sk);
>

next prev parent reply	other threads:[~2026-05-27  5:47 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 1/9] mptcp: fix missing wakeups in edge scenarios Paolo Abeni
2026-05-26  7:47   ` Matthieu Baerts
2026-05-22 21:43 ` [PATCH v8 mptcp-next 2/9] mptcp: fix retransmission loop when csum is enabled Paolo Abeni
2026-05-26  7:48   ` Matthieu Baerts
2026-05-26 15:10     ` Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 3/9] mptcp: close TOCTOU race while computing rcv_wnd Paolo Abeni
2026-05-26  6:10   ` Geliang Tang
2026-05-26  6:34     ` Paolo Abeni
2026-05-26  7:48       ` Matthieu Baerts
2026-05-22 21:43 ` [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink Paolo Abeni
2026-05-24  8:34   ` Paolo Abeni
2026-05-26  7:02     ` Matthieu Baerts
2026-05-26  7:49       ` Matthieu Baerts
2026-05-26 15:17         ` Paolo Abeni
2026-05-27  0:51           ` Matthieu Baerts
2026-05-26  7:48   ` Matthieu Baerts
2026-05-26 15:16     ` Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 5/9] mptcp: explicitly drop over memory limits Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 6/9] mptcp: enforce hard limit on backlog flushing Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning Paolo Abeni
2026-05-26  3:13   ` gang.yan
2026-05-26  6:50     ` Paolo Abeni
2026-05-27  5:30       ` gang.yan
2026-05-27 10:01         ` Paolo Abeni
2026-05-28  1:18           ` gang.yan
2026-05-22 21:43 ` [PATCH v8 mptcp-next 8/9] mptcp: move the retrans loop to a separate helper Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job Paolo Abeni
2026-05-27  5:46   ` Geliang Tang [this message]
2026-05-22 23:10 ` [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure MPTCP CI

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52022434fe4b2996253b5fc2af7ea98aa987bcb5.camel@kernel.org \
    --to=geliang@kernel.org \
    --cc=gang.yan@linux.dev \
    --cc=matttbe@kernel.org \
    --cc=mptcp@lists.linux.dev \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox