Re: [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Geliang Tang <geliang@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>, mptcp@lists.linux.dev
Cc: Matthieu Baerts <matttbe@kernel.org>, gang.yan@linux.dev
Subject: Re: [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job.
Date: Wed, 27 May 2026 13:46:56 +0800	[thread overview]
Message-ID: <52022434fe4b2996253b5fc2af7ea98aa987bcb5.camel@kernel.org> (raw)
In-Reply-To: <a35934164aad8a08ed8108853cff7a282e5c25ef.1779485511.git.pabeni@redhat.com>

Hi Paolo,

On Fri, 2026-05-22 at 23:43 +0200, Paolo Abeni wrote:
> Currently the MPTCP core enforces that when MPTCP-level retrans timer
> fires, at most a single dfrag is retransmitted. If some corner-cases
> it
> may be necessary retransmit multiple dfrags, and the MPTCP socket
> will
> need to wait multiple retrans timeout to accomplish that.
> 
> Remove the mentioned constraint, allowing to transmit multiple dfrags
> per
> retrans period, as long as the scheduler keeps selecting subflows for
> retransmissions and pending data is available in the rtx queue.
> The default scheduler will transmit a dfrag per available subflow.
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v7 -> v8
>   - fix corner-case retrans_seq update
> 
> v4 -> v5:
>   - fixed already_sent update
> 
> v3 -> v4:
>   - avoid quadratic behavior, fix retrans_seq update
>   - fix rtx timer re-schedule miss
> 
> v2 -> v3:
>   - fix infinite loop issue (should address tls tests failures)
> 
> v1 -> v2:
>   - fix retrans sequence update (sashiko)
> 
> Note:
>  - sashiko see issues when dfrag = mptcp_rtx_head(sk) != NULL and
>    dfrag->already_sent == 0. That condition should not possible: if
>    mptcp_rtx_head() is not NULL there should be some data already
>    sent.
> ---
>  net/mptcp/protocol.c | 117 +++++++++++++++++++++++++++++++----------
> --
>  1 file changed, 85 insertions(+), 32 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 4ebe45e8a3d2..892fc44dffac 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1197,13 +1197,6 @@ static void __mptcp_clean_una_wakeup(struct
> sock *sk)
>  	mptcp_write_space(sk);
>  }
>  
> -static void mptcp_clean_una_wakeup(struct sock *sk)
> -{
> -	mptcp_data_lock(sk);
> -	__mptcp_clean_una_wakeup(sk);
> -	mptcp_data_unlock(sk);
> -}
> -
>  static void mptcp_enter_memory_pressure(struct sock *sk)
>  {
>  	struct mptcp_subflow_context *subflow;
> @@ -2824,8 +2817,12 @@ static void mptcp_check_fastclose(struct
> mptcp_sock *msk)
>  	sk_error_report(sk);
>  }
>  
> -/* Retransmit the specified data fragment on all the selected
> subflows. */
> -static int __mptcp_push_retrans(struct sock *sk, struct
> mptcp_data_frag *dfrag)
> +/*
> + * Retransmit the specified data fragment on all the selected
> subflows,
> + * starting from the specified sequence
> + */
> +static int __mptcp_push_retrans(struct sock *sk, struct
> mptcp_data_frag *dfrag,
> +				u64 sent_seq)
>  {
>  	struct mptcp_sendmsg_info info = { .data_lock_held = true,
> };
>  	struct mptcp_sock *msk = mptcp_sk(sk);
> @@ -2835,6 +2832,7 @@ static int __mptcp_push_retrans(struct sock
> *sk, struct mptcp_data_frag *dfrag)
>  
>  	mptcp_for_each_subflow(msk, subflow) {
>  		if (READ_ONCE(subflow->scheduled)) {
> +			u16 offset = sent_seq - dfrag->data_seq;
>  			u16 copied = 0;
>  
>  			mptcp_subflow_set_scheduled(subflow, false);
> @@ -2844,9 +2842,12 @@ static int __mptcp_push_retrans(struct sock
> *sk, struct mptcp_data_frag *dfrag)
>  			lock_sock(ssk);
>  
>  			/* limit retransmission to the bytes already
> sent on some subflows */
> -			info.sent = 0;
> +			info.sent = offset;
>  			info.limit = READ_ONCE(msk->csum_enabled) ?
> dfrag->data_len :
>  								   
> dfrag->already_sent;
> +			DEBUG_NET_WARN_ON_ONCE(!before64(sent_seq,
> +							 dfrag-
> >data_seq +
> +							
> info.limit));

When I tested NVMe MPTCP TLS based on this series, I encountered the
following error:

 ------------[ cut here ]------------
 !before64(sent_seq, dfrag->data_seq + info.limit)
 WARNING: net/mptcp/protocol.c:2879 at
__mptcp_push_retrans+0x562/0x6f0, CPU#5: kworker/5:2H/823
 Modules linked in: mptcp_diag tcp_diag inet_diag sch_netem
 CPU: 5 UID: 0 PID: 823 Comm: kworker/5:2H Not tainted 7.1.0-rc4+ #11
PREEMPT(full)
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
 RIP: 0010:__mptcp_push_retrans+0x562/0x6f0

I printed the "info.limit" value at that time and found it to be 0:

if (!before64(sent_seq, dfrag->data_seq + info.limit))
  pr_info("not ok sent_seq=%llu dfrag->data_seq=%llu info.limit=%u\n",
          sent_seq, dfrag->data_seq, info.limit);

# Starting 4 threads
[  358.532991][  T257] MPTCP: not ok sent_seq=4607903646123206498
dfrag->data_seq=4607903646123206498 info.limit=0
[  360.946058][   T90] MPTCP: not ok sent_seq=12604043601802564894
dfrag->data_seq=12604043601802564894 info.limit=0

It seems we need to remove this DEBUG_NET_WARN_ON_ONCE().

Thanks,
-Geliang

>  
>  			/*
>  			 * make the whole retrans decision, xmit,
> disallow
> @@ -2890,45 +2891,97 @@ static void __mptcp_retrans(struct sock *sk)
>  	struct mptcp_sock *msk = mptcp_sk(sk);
>  	struct mptcp_subflow_context *subflow;
>  	struct mptcp_data_frag *dfrag;
> +	bool retransmitted = false;
> +	u64 retrans_seq;
>  	int err, len;
>  
> -	mptcp_clean_una_wakeup(sk);
> -
> -	/* first check ssk: need to kick "stale" logic */
> -	err = mptcp_sched_get_retrans(msk);
> +	mptcp_data_lock(sk);
> +	__mptcp_clean_una_wakeup(sk);
> +	retrans_seq = msk->snd_una;
>  	dfrag = mptcp_rtx_head(sk);
> +	mptcp_data_unlock(sk);
> +	if (!dfrag)
> +		goto check_data_fin;
> +
> +	for (;;) {
> +		bool already_retrans;
> +		u64 sent_seq;
> +
> +		/* The scheduler may clean the RTX queue. */
> +		get_page(dfrag->page);
> +
> +		/* The default scheduler will kick "stale" logic. */
> +		err = mptcp_sched_get_retrans(msk);
> +		if (err) {
> +			put_page(dfrag->page);
> +			break;
> +		}
> +
> +		/* Incoming acks can have moved retrans sequence
> after
> +		 * the current dfrag, if so try to start again from
> RTX head.
> +		 */
> +		mptcp_data_lock(sk);
> +		already_retrans = !dfrag->already_sent ||
> +				  !before64(msk->snd_una, dfrag-
> >data_seq +
> +					    dfrag->already_sent);
> +		put_page(dfrag->page);
> +		if (already_retrans) {
> +			__mptcp_clean_una_wakeup(sk);
> +			retrans_seq = msk->snd_una;
> +			dfrag = mptcp_rtx_head(sk);
> +		} else if (after64(msk->snd_una, retrans_seq)) {
> +			retrans_seq = msk->snd_una;
> +		}
> +		mptcp_data_unlock(sk);
> +		if (!dfrag)
> +			break;
> +
> +		len = __mptcp_push_retrans(sk, dfrag, retrans_seq);
> +		if (len < 0)
> +			goto clear_scheduled;
> +
> +		retransmitted = true;
> +		retrans_seq += len;
> +		msk->bytes_retrans += len;
> +		dfrag->already_sent = max_t(u16, dfrag-
> >already_sent,
> +					    retrans_seq - dfrag-
> >data_seq);
> +
> +		/* With csum enabled retransmission can send new
> data. */
> +		sent_seq = dfrag->already_sent + dfrag->data_seq;
> +		if (after64(sent_seq, msk->snd_nxt))
> +			msk->snd_nxt = sent_seq;
> +
> +		/* Attempt the next fragment only if the current one
> is
> +		 * completely retransmitted.
> +		 */
> +		if (before64(retrans_seq, dfrag->data_seq + dfrag-
> >data_len))
> +			break;
> +
> +		dfrag = list_is_last(&dfrag->list, &msk->rtx_queue)
> ?
> +				NULL : list_next_entry(dfrag, list);
> +		if (!dfrag || !dfrag->already_sent)
> +			break;
> +	}
> +
> +	/* Data fin retransmission needed only if no data
> retransmission took
> +	 * place, and RTX queue is empty.
> +	 */
> +check_data_fin:
>  	if (!dfrag) {
> -		if (mptcp_data_fin_enabled(msk)) {
> +		if (!retransmitted && mptcp_data_fin_enabled(msk)) {
>  			struct inet_connection_sock *icsk =
> inet_csk(sk);
>  
>  			WRITE_ONCE(icsk->icsk_retransmits,
>  				   icsk->icsk_retransmits + 1);
>  			mptcp_set_datafin_timeout(sk);
>  			mptcp_send_ack(msk);
> -
>  			goto reset_timer;
>  		}
>  
>  		if (!mptcp_send_head(sk))
>  			goto clear_scheduled;
> -
> -		goto reset_timer;
>  	}
>  
> -	if (err)
> -		goto reset_timer;
> -
> -	len = __mptcp_push_retrans(sk, dfrag);
> -	if (len < 0)
> -		goto clear_scheduled;
> -
> -	msk->bytes_retrans += len;
> -	dfrag->already_sent = max(dfrag->already_sent, len);
> -
> -	/* With csum enabled retransmission can send new data. */
> -	if (after64(dfrag->already_sent + dfrag->data_seq, msk-
> >snd_nxt))
> -		msk->snd_nxt = dfrag->already_sent + dfrag-
> >data_seq;
> -
>  reset_timer:
>  	mptcp_check_and_set_pending(sk);
>

next prev parent reply	other threads:[~2026-05-27  5:47 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 1/9] mptcp: fix missing wakeups in edge scenarios Paolo Abeni
2026-05-26  7:47   ` Matthieu Baerts
2026-05-22 21:43 ` [PATCH v8 mptcp-next 2/9] mptcp: fix retransmission loop when csum is enabled Paolo Abeni
2026-05-26  7:48   ` Matthieu Baerts
2026-05-26 15:10     ` Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 3/9] mptcp: close TOCTOU race while computing rcv_wnd Paolo Abeni
2026-05-26  6:10   ` Geliang Tang
2026-05-26  6:34     ` Paolo Abeni
2026-05-26  7:48       ` Matthieu Baerts
2026-05-22 21:43 ` [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink Paolo Abeni
2026-05-24  8:34   ` Paolo Abeni
2026-05-26  7:02     ` Matthieu Baerts
2026-05-26  7:49       ` Matthieu Baerts
2026-05-26 15:17         ` Paolo Abeni
2026-05-27  0:51           ` Matthieu Baerts
2026-05-26  7:48   ` Matthieu Baerts
2026-05-26 15:16     ` Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 5/9] mptcp: explicitly drop over memory limits Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 6/9] mptcp: enforce hard limit on backlog flushing Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning Paolo Abeni
2026-05-26  3:13   ` gang.yan
2026-05-26  6:50     ` Paolo Abeni
2026-05-27  5:30       ` gang.yan
2026-05-27 10:01         ` Paolo Abeni
2026-05-28  1:18           ` gang.yan
2026-05-22 21:43 ` [PATCH v8 mptcp-next 8/9] mptcp: move the retrans loop to a separate helper Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job Paolo Abeni
2026-05-27  5:46   ` Geliang Tang [this message]
2026-05-22 23:10 ` [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure MPTCP CI

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52022434fe4b2996253b5fc2af7ea98aa987bcb5.camel@kernel.org \
    --to=geliang@kernel.org \
    --cc=gang.yan@linux.dev \
    --cc=matttbe@kernel.org \
    --cc=mptcp@lists.linux.dev \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.