* [PATCH v8 mptcp-next 1/9] mptcp: fix missing wakeups in edge scenarios
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
@ 2026-05-22 21:43 ` Paolo Abeni
2026-05-26 7:47 ` Matthieu Baerts
2026-05-22 21:43 ` [PATCH v8 mptcp-next 2/9] mptcp: fix retransmission loop when csum is enabled Paolo Abeni
` (8 subsequent siblings)
9 siblings, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-05-22 21:43 UTC (permalink / raw)
To: mptcp; +Cc: Matthieu Baerts, Geliang Tang, gang.yan
The mptcp_recvmsg() can fill MPTCP socket receive queue via
mptcp_move_skbs(), but currently does not try to wakeup any listener,
because the same process is going to check the receive queue soon.
When multiple threads are reading from the same fd, the above can
cause stall. Add the missing wakeup.
Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v6 -> v7
- use mptcp_epollin_ready()
Notes:
- sashiko may raise concerns about thundering herd problems. If
application uses multiple threads to read on the same socket, that
could/will happen. Application should not use multiple threads
to read from the same socket if they want good performances.
---
net/mptcp/protocol.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index ce8372fb3c6a..f14572fb1975 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2276,6 +2276,8 @@ static bool mptcp_move_skbs(struct sock *sk)
mptcp_backlog_spooled(sk, moved, &skbs);
}
mptcp_data_unlock(sk);
+ if (enqueued && mptcp_epollin_ready(sk))
+ sk->sk_data_ready(sk);
return enqueued;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 1/9] mptcp: fix missing wakeups in edge scenarios
2026-05-22 21:43 ` [PATCH v8 mptcp-next 1/9] mptcp: fix missing wakeups in edge scenarios Paolo Abeni
@ 2026-05-26 7:47 ` Matthieu Baerts
0 siblings, 0 replies; 30+ messages in thread
From: Matthieu Baerts @ 2026-05-26 7:47 UTC (permalink / raw)
To: Paolo Abeni, mptcp; +Cc: Geliang Tang, gang.yan
Hi Paolo,
Thank you for this fix!
On 23/05/2026 07:43, Paolo Abeni wrote:
> The mptcp_recvmsg() can fill MPTCP socket receive queue via
> mptcp_move_skbs(), but currently does not try to wakeup any listener,
> because the same process is going to check the receive queue soon.
>
> When multiple threads are reading from the same fd, the above can
> cause stall. Add the missing wakeup.
>
> Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v6 -> v7
> - use mptcp_epollin_ready()
>
> Notes:
> - sashiko may raise concerns about thundering herd problems. If
> application uses multiple threads to read on the same socket, that
> could/will happen. Application should not use multiple threads
> to read from the same socket if they want good performances.
> ---
> net/mptcp/protocol.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index ce8372fb3c6a..f14572fb1975 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -2276,6 +2276,8 @@ static bool mptcp_move_skbs(struct sock *sk)
> mptcp_backlog_spooled(sk, moved, &skbs);
> }
> mptcp_data_unlock(sk);
> + if (enqueued && mptcp_epollin_ready(sk))
> + sk->sk_data_ready(sk);
Nit: I hope that's OK if I add new lines above and below this new block
when applying the patch.
Apart from that:
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> return enqueued;
> }
>
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v8 mptcp-next 2/9] mptcp: fix retransmission loop when csum is enabled
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 1/9] mptcp: fix missing wakeups in edge scenarios Paolo Abeni
@ 2026-05-22 21:43 ` Paolo Abeni
2026-05-26 7:48 ` Matthieu Baerts
2026-05-22 21:43 ` [PATCH v8 mptcp-next 3/9] mptcp: close TOCTOU race while computing rcv_wnd Paolo Abeni
` (7 subsequent siblings)
9 siblings, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-05-22 21:43 UTC (permalink / raw)
To: mptcp; +Cc: Matthieu Baerts, Geliang Tang, gang.yan
Sashiko noted that retransmission with csum enabled can actually
transmit new data, but currently the relevant code does not update
accordingly snd_nxt.
The may cause incoming ack drop and an endless retransmission loop.
Address the issue incrementing snd_nxt as needed.
Fixes: 4e14867d5e91 ("mptcp: tune re-injections for csum enabled mode")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/protocol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index f14572fb1975..4f2b2868031a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2867,6 +2867,10 @@ static void __mptcp_retrans(struct sock *sk)
msk->bytes_retrans += len;
dfrag->already_sent = max(dfrag->already_sent, len);
+ /* With csum enabled retransmission can send new data. */
+ if (after64(dfrag->already_sent + dfrag->data_seq, msk->snd_nxt))
+ msk->snd_nxt = dfrag->already_sent + dfrag->data_seq;
+
reset_timer:
mptcp_check_and_set_pending(sk);
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 2/9] mptcp: fix retransmission loop when csum is enabled
2026-05-22 21:43 ` [PATCH v8 mptcp-next 2/9] mptcp: fix retransmission loop when csum is enabled Paolo Abeni
@ 2026-05-26 7:48 ` Matthieu Baerts
2026-05-26 15:10 ` Paolo Abeni
0 siblings, 1 reply; 30+ messages in thread
From: Matthieu Baerts @ 2026-05-26 7:48 UTC (permalink / raw)
To: Paolo Abeni, mptcp; +Cc: Geliang Tang, gang.yan
Hi Paolo,
Thank you for this fix!
On 23/05/2026 07:43, Paolo Abeni wrote:
> Sashiko noted that retransmission with csum enabled can actually
> transmit new data, but currently the relevant code does not update
> accordingly snd_nxt.
>
> The may cause incoming ack drop and an endless retransmission loop.
>
> Address the issue incrementing snd_nxt as needed.
>
> Fixes: 4e14867d5e91 ("mptcp: tune re-injections for csum enabled mode")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/protocol.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index f14572fb1975..4f2b2868031a 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -2867,6 +2867,10 @@ static void __mptcp_retrans(struct sock *sk)
> msk->bytes_retrans += len;
> dfrag->already_sent = max(dfrag->already_sent, len);
>
> + /* With csum enabled retransmission can send new data. */
> + if (after64(dfrag->already_sent + dfrag->data_seq, msk->snd_nxt))
> + msk->snd_nxt = dfrag->already_sent + dfrag->data_seq;
Should we use WRITE_ONCE() here, similar to other places where it is
updated?
If yes, I can do the modification when applying the patch.
Apart from that:
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> +
> reset_timer:
> mptcp_check_and_set_pending(sk);
>
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 2/9] mptcp: fix retransmission loop when csum is enabled
2026-05-26 7:48 ` Matthieu Baerts
@ 2026-05-26 15:10 ` Paolo Abeni
0 siblings, 0 replies; 30+ messages in thread
From: Paolo Abeni @ 2026-05-26 15:10 UTC (permalink / raw)
To: Matthieu Baerts, mptcp; +Cc: Geliang Tang, gang.yan
On 5/26/26 9:48 AM, Matthieu Baerts wrote:
> On 23/05/2026 07:43, Paolo Abeni wrote:
>> Sashiko noted that retransmission with csum enabled can actually
>> transmit new data, but currently the relevant code does not update
>> accordingly snd_nxt.
>>
>> The may cause incoming ack drop and an endless retransmission loop.
>>
>> Address the issue incrementing snd_nxt as needed.
>>
>> Fixes: 4e14867d5e91 ("mptcp: tune re-injections for csum enabled mode")
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> net/mptcp/protocol.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>> index f14572fb1975..4f2b2868031a 100644
>> --- a/net/mptcp/protocol.c
>> +++ b/net/mptcp/protocol.c
>> @@ -2867,6 +2867,10 @@ static void __mptcp_retrans(struct sock *sk)
>> msk->bytes_retrans += len;
>> dfrag->already_sent = max(dfrag->already_sent, len);
>>
>> + /* With csum enabled retransmission can send new data. */
>> + if (after64(dfrag->already_sent + dfrag->data_seq, msk->snd_nxt))
>> + msk->snd_nxt = dfrag->already_sent + dfrag->data_seq;
>
> Should we use WRITE_ONCE() here, similar to other places where it is
> updated?
O wow, I (and sashiko too?!?) missed that. Yes, it's needed.
> If yes, I can do the modification when applying the patch.
Would be great, thanks
/P
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v8 mptcp-next 3/9] mptcp: close TOCTOU race while computing rcv_wnd
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 1/9] mptcp: fix missing wakeups in edge scenarios Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 2/9] mptcp: fix retransmission loop when csum is enabled Paolo Abeni
@ 2026-05-22 21:43 ` Paolo Abeni
2026-05-26 6:10 ` Geliang Tang
2026-05-22 21:43 ` [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink Paolo Abeni
` (6 subsequent siblings)
9 siblings, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-05-22 21:43 UTC (permalink / raw)
To: mptcp; +Cc: Matthieu Baerts, Geliang Tang, gang.yan
The MPTCP output path access locklessly the MPTCP-level ack_seq
in multiple times, using possibly different values for the data_ack
in the DSS option and to compute the announced rcv wnd for the same
packet.
Refactor the cote to avoid inconsistencies which may confuse the
peer. Also ensure that the MPTCP level rcv wnd is updated only when
the egress packet actually contains a DSS ack.
Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/options.c | 36 ++++++++++++++++++------------------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 4cc583fdc7a9..4d72f286a485 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -570,7 +570,6 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
struct mptcp_ext *mpext;
unsigned int ack_size;
bool ret = false;
- u64 ack_seq;
opts->csum_reqd = READ_ONCE(msk->csum_enabled);
mpext = skb ? mptcp_get_ext(skb) : NULL;
@@ -601,14 +600,11 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
return ret;
}
- ack_seq = READ_ONCE(msk->ack_seq);
if (READ_ONCE(msk->use_64bit_ack)) {
ack_size = TCPOLEN_MPTCP_DSS_ACK64;
- opts->ext_copy.data_ack = ack_seq;
opts->ext_copy.ack64 = 1;
} else {
ack_size = TCPOLEN_MPTCP_DSS_ACK32;
- opts->ext_copy.data_ack32 = (uint32_t)ack_seq;
opts->ext_copy.ack64 = 0;
}
opts->ext_copy.use_ack = 1;
@@ -1298,19 +1294,14 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
return true;
}
-static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
+static u64 mptcp_set_rwin(struct mptcp_sock *msk, struct tcp_sock *tp,
+ struct tcphdr *th, u64 ack_seq)
{
const struct sock *ssk = (const struct sock *)tp;
- struct mptcp_subflow_context *subflow;
- u64 ack_seq, rcv_wnd_old, rcv_wnd_new;
- struct mptcp_sock *msk;
+ u64 rcv_wnd_old, rcv_wnd_new;
u32 new_win;
u64 win;
- subflow = mptcp_subflow_ctx(ssk);
- msk = mptcp_sk(subflow->conn);
-
- ack_seq = READ_ONCE(msk->ack_seq);
rcv_wnd_new = ack_seq + tp->rcv_wnd;
rcv_wnd_old = atomic64_read(&msk->rcv_wnd_sent);
@@ -1363,7 +1354,7 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
update_wspace:
WRITE_ONCE(msk->old_wspace, tp->rcv_wnd);
- subflow->rcv_wnd_sent = rcv_wnd_new;
+ return rcv_wnd_new;
}
static void mptcp_track_rwin(struct tcp_sock *tp)
@@ -1475,13 +1466,25 @@ void mptcp_write_options(struct tcphdr *th, __be32 *ptr, struct tcp_sock *tp,
*ptr++ = mptcp_option(MPTCPOPT_DSS, len, 0, flags);
if (mpext->use_ack) {
+ struct mptcp_sock *msk;
+ u64 ack_seq;
+
+ /* DSS option is set only by mptcp_established_option,
+ * the caller is __tcp_transmit_skb() and ssk is always
+ * not NULL.
+ */
+ subflow = mptcp_subflow_ctx(ssk);
+ msk = mptcp_sk(subflow->conn);
+ ack_seq = READ_ONCE(msk->ack_seq);
if (mpext->ack64) {
- put_unaligned_be64(mpext->data_ack, ptr);
+ put_unaligned_be64(ack_seq, ptr);
ptr += 2;
} else {
- put_unaligned_be32(mpext->data_ack32, ptr);
+ put_unaligned_be32(ack_seq, ptr);
ptr += 1;
}
+ subflow->rcv_wnd_sent = mptcp_set_rwin(msk, tp, th,
+ ack_seq);
}
if (mpext->use_map) {
@@ -1709,9 +1712,6 @@ void mptcp_write_options(struct tcphdr *th, __be32 *ptr, struct tcp_sock *tp,
i += 4;
}
}
-
- if (tp)
- mptcp_set_rwin(tp, th);
}
__be32 mptcp_get_reset_option(const struct sk_buff *skb)
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 3/9] mptcp: close TOCTOU race while computing rcv_wnd
2026-05-22 21:43 ` [PATCH v8 mptcp-next 3/9] mptcp: close TOCTOU race while computing rcv_wnd Paolo Abeni
@ 2026-05-26 6:10 ` Geliang Tang
2026-05-26 6:34 ` Paolo Abeni
0 siblings, 1 reply; 30+ messages in thread
From: Geliang Tang @ 2026-05-26 6:10 UTC (permalink / raw)
To: Paolo Abeni, mptcp; +Cc: Matthieu Baerts, gang.yan
Hi Paolo,
On Fri, 2026-05-22 at 23:43 +0200, Paolo Abeni wrote:
> The MPTCP output path access locklessly the MPTCP-level ack_seq
> in multiple times, using possibly different values for the data_ack
> in the DSS option and to compute the announced rcv wnd for the same
> packet.
>
> Refactor the cote to avoid inconsistencies which may confuse the
> peer. Also ensure that the MPTCP level rcv wnd is updated only when
> the egress packet actually contains a DSS ack.
>
> Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/options.c | 36 ++++++++++++++++++------------------
> 1 file changed, 18 insertions(+), 18 deletions(-)
>
> diff --git a/net/mptcp/options.c b/net/mptcp/options.c
> index 4cc583fdc7a9..4d72f286a485 100644
> --- a/net/mptcp/options.c
> +++ b/net/mptcp/options.c
> @@ -570,7 +570,6 @@ static bool mptcp_established_options_dss(struct
> sock *sk, struct sk_buff *skb,
> struct mptcp_ext *mpext;
> unsigned int ack_size;
> bool ret = false;
> - u64 ack_seq;
>
> opts->csum_reqd = READ_ONCE(msk->csum_enabled);
> mpext = skb ? mptcp_get_ext(skb) : NULL;
> @@ -601,14 +600,11 @@ static bool
> mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
> return ret;
> }
>
> - ack_seq = READ_ONCE(msk->ack_seq);
> if (READ_ONCE(msk->use_64bit_ack)) {
> ack_size = TCPOLEN_MPTCP_DSS_ACK64;
> - opts->ext_copy.data_ack = ack_seq;
> opts->ext_copy.ack64 = 1;
> } else {
> ack_size = TCPOLEN_MPTCP_DSS_ACK32;
> - opts->ext_copy.data_ack32 = (uint32_t)ack_seq;
> opts->ext_copy.ack64 = 0;
> }
> opts->ext_copy.use_ack = 1;
> @@ -1298,19 +1294,14 @@ bool mptcp_incoming_options(struct sock *sk,
> struct sk_buff *skb)
> return true;
> }
>
> -static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
> +static u64 mptcp_set_rwin(struct mptcp_sock *msk, struct tcp_sock
> *tp,
> + struct tcphdr *th, u64 ack_seq)
> {
> const struct sock *ssk = (const struct sock *)tp;
> - struct mptcp_subflow_context *subflow;
> - u64 ack_seq, rcv_wnd_old, rcv_wnd_new;
> - struct mptcp_sock *msk;
> + u64 rcv_wnd_old, rcv_wnd_new;
> u32 new_win;
> u64 win;
>
> - subflow = mptcp_subflow_ctx(ssk);
> - msk = mptcp_sk(subflow->conn);
> -
> - ack_seq = READ_ONCE(msk->ack_seq);
> rcv_wnd_new = ack_seq + tp->rcv_wnd;
>
> rcv_wnd_old = atomic64_read(&msk->rcv_wnd_sent);
> @@ -1363,7 +1354,7 @@ static void mptcp_set_rwin(struct tcp_sock *tp,
> struct tcphdr *th)
>
> update_wspace:
> WRITE_ONCE(msk->old_wspace, tp->rcv_wnd);
> - subflow->rcv_wnd_sent = rcv_wnd_new;
> + return rcv_wnd_new;
> }
>
> static void mptcp_track_rwin(struct tcp_sock *tp)
> @@ -1475,13 +1466,25 @@ void mptcp_write_options(struct tcphdr *th,
> __be32 *ptr, struct tcp_sock *tp,
> *ptr++ = mptcp_option(MPTCPOPT_DSS, len, 0, flags);
>
> if (mpext->use_ack) {
> + struct mptcp_sock *msk;
> + u64 ack_seq;
> +
> + /* DSS option is set only by
> mptcp_established_option,
> + * the caller is __tcp_transmit_skb() and
> ssk is always
> + * not NULL.
> + */
> + subflow = mptcp_subflow_ctx(ssk);
> + msk = mptcp_sk(subflow->conn);
> + ack_seq = READ_ONCE(msk->ack_seq);
> if (mpext->ack64) {
> - put_unaligned_be64(mpext->data_ack,
> ptr);
> + put_unaligned_be64(ack_seq, ptr);
> ptr += 2;
> } else {
> - put_unaligned_be32(mpext-
> >data_ack32, ptr);
This patch makes data_ack and data_ack32 in struct mptcp_ext no longer
used. Can they be removed from struct mptcp_ext, and update
mptcp_dump_mpext() in include/trace/events/mptcp.h accordingly?
Thanks,
-Geliang
> + put_unaligned_be32(ack_seq, ptr);
> ptr += 1;
> }
> + subflow->rcv_wnd_sent = mptcp_set_rwin(msk,
> tp, th,
> +
> ack_seq);
> }
>
> if (mpext->use_map) {
> @@ -1709,9 +1712,6 @@ void mptcp_write_options(struct tcphdr *th,
> __be32 *ptr, struct tcp_sock *tp,
> i += 4;
> }
> }
> -
> - if (tp)
> - mptcp_set_rwin(tp, th);
> }
>
> __be32 mptcp_get_reset_option(const struct sk_buff *skb)
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 3/9] mptcp: close TOCTOU race while computing rcv_wnd
2026-05-26 6:10 ` Geliang Tang
@ 2026-05-26 6:34 ` Paolo Abeni
2026-05-26 7:48 ` Matthieu Baerts
0 siblings, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-05-26 6:34 UTC (permalink / raw)
To: Geliang Tang, mptcp; +Cc: Matthieu Baerts, gang.yan
On 5/26/26 8:10 AM, Geliang Tang wrote:
> On Fri, 2026-05-22 at 23:43 +0200, Paolo Abeni wrote:
>> The MPTCP output path access locklessly the MPTCP-level ack_seq
>> in multiple times, using possibly different values for the data_ack
>> in the DSS option and to compute the announced rcv wnd for the same
>> packet.
>>
>> Refactor the cote to avoid inconsistencies which may confuse the
>> peer. Also ensure that the MPTCP level rcv wnd is updated only when
>> the egress packet actually contains a DSS ack.
>>
>> Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> net/mptcp/options.c | 36 ++++++++++++++++++------------------
>> 1 file changed, 18 insertions(+), 18 deletions(-)
>>
>> diff --git a/net/mptcp/options.c b/net/mptcp/options.c
>> index 4cc583fdc7a9..4d72f286a485 100644
>> --- a/net/mptcp/options.c
>> +++ b/net/mptcp/options.c
>> @@ -570,7 +570,6 @@ static bool mptcp_established_options_dss(struct
>> sock *sk, struct sk_buff *skb,
>> struct mptcp_ext *mpext;
>> unsigned int ack_size;
>> bool ret = false;
>> - u64 ack_seq;
>>
>> opts->csum_reqd = READ_ONCE(msk->csum_enabled);
>> mpext = skb ? mptcp_get_ext(skb) : NULL;
>> @@ -601,14 +600,11 @@ static bool
>> mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
>> return ret;
>> }
>>
>> - ack_seq = READ_ONCE(msk->ack_seq);
>> if (READ_ONCE(msk->use_64bit_ack)) {
>> ack_size = TCPOLEN_MPTCP_DSS_ACK64;
>> - opts->ext_copy.data_ack = ack_seq;
>> opts->ext_copy.ack64 = 1;
>> } else {
>> ack_size = TCPOLEN_MPTCP_DSS_ACK32;
>> - opts->ext_copy.data_ack32 = (uint32_t)ack_seq;
>> opts->ext_copy.ack64 = 0;
>> }
>> opts->ext_copy.use_ack = 1;
>> @@ -1298,19 +1294,14 @@ bool mptcp_incoming_options(struct sock *sk,
>> struct sk_buff *skb)
>> return true;
>> }
>>
>> -static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
>> +static u64 mptcp_set_rwin(struct mptcp_sock *msk, struct tcp_sock
>> *tp,
>> + struct tcphdr *th, u64 ack_seq)
>> {
>> const struct sock *ssk = (const struct sock *)tp;
>> - struct mptcp_subflow_context *subflow;
>> - u64 ack_seq, rcv_wnd_old, rcv_wnd_new;
>> - struct mptcp_sock *msk;
>> + u64 rcv_wnd_old, rcv_wnd_new;
>> u32 new_win;
>> u64 win;
>>
>> - subflow = mptcp_subflow_ctx(ssk);
>> - msk = mptcp_sk(subflow->conn);
>> -
>> - ack_seq = READ_ONCE(msk->ack_seq);
>> rcv_wnd_new = ack_seq + tp->rcv_wnd;
>>
>> rcv_wnd_old = atomic64_read(&msk->rcv_wnd_sent);
>> @@ -1363,7 +1354,7 @@ static void mptcp_set_rwin(struct tcp_sock *tp,
>> struct tcphdr *th)
>>
>> update_wspace:
>> WRITE_ONCE(msk->old_wspace, tp->rcv_wnd);
>> - subflow->rcv_wnd_sent = rcv_wnd_new;
>> + return rcv_wnd_new;
>> }
>>
>> static void mptcp_track_rwin(struct tcp_sock *tp)
>> @@ -1475,13 +1466,25 @@ void mptcp_write_options(struct tcphdr *th,
>> __be32 *ptr, struct tcp_sock *tp,
>> *ptr++ = mptcp_option(MPTCPOPT_DSS, len, 0, flags);
>>
>> if (mpext->use_ack) {
>> + struct mptcp_sock *msk;
>> + u64 ack_seq;
>> +
>> + /* DSS option is set only by
>> mptcp_established_option,
>> + * the caller is __tcp_transmit_skb() and
>> ssk is always
>> + * not NULL.
>> + */
>> + subflow = mptcp_subflow_ctx(ssk);
>> + msk = mptcp_sk(subflow->conn);
>> + ack_seq = READ_ONCE(msk->ack_seq);
>> if (mpext->ack64) {
>> - put_unaligned_be64(mpext->data_ack,
>> ptr);
>> + put_unaligned_be64(ack_seq, ptr);
>> ptr += 2;
>> } else {
>> - put_unaligned_be32(mpext-
>>> data_ack32, ptr);
>
> This patch makes data_ack and data_ack32 in struct mptcp_ext no longer
> used. Can they be removed from struct mptcp_ext, and update
> mptcp_dump_mpext() in include/trace/events/mptcp.h accordingly?
Thanks for your review.
It looks like mptcp_dump_mpext() always dumped data_ack, but such value
was always 0-set when that function was called? if so dropping the
fields and adjusting the helper could be a separate/later patch?
/P
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 3/9] mptcp: close TOCTOU race while computing rcv_wnd
2026-05-26 6:34 ` Paolo Abeni
@ 2026-05-26 7:48 ` Matthieu Baerts
0 siblings, 0 replies; 30+ messages in thread
From: Matthieu Baerts @ 2026-05-26 7:48 UTC (permalink / raw)
To: Paolo Abeni, Geliang Tang, mptcp; +Cc: gang.yan
Hi Paolo, Geliang,
Thank you for this fix!
On 26/05/2026 16:34, Paolo Abeni wrote:
> On 5/26/26 8:10 AM, Geliang Tang wrote:
>> On Fri, 2026-05-22 at 23:43 +0200, Paolo Abeni wrote:
>>> The MPTCP output path access locklessly the MPTCP-level ack_seq
>>> in multiple times, using possibly different values for the data_ack
>>> in the DSS option and to compute the announced rcv wnd for the same
>>> packet.
>>>
>>> Refactor the cote to avoid inconsistencies which may confuse the
>>> peer. Also ensure that the MPTCP level rcv wnd is updated only when
>>> the egress packet actually contains a DSS ack.
>>>
>>> Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>> net/mptcp/options.c | 36 ++++++++++++++++++------------------
>>> 1 file changed, 18 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/net/mptcp/options.c b/net/mptcp/options.c
>>> index 4cc583fdc7a9..4d72f286a485 100644
>>> --- a/net/mptcp/options.c
>>> +++ b/net/mptcp/options.c
>>> @@ -570,7 +570,6 @@ static bool mptcp_established_options_dss(struct
>>> sock *sk, struct sk_buff *skb,
>>> struct mptcp_ext *mpext;
>>> unsigned int ack_size;
>>> bool ret = false;
>>> - u64 ack_seq;
>>>
>>> opts->csum_reqd = READ_ONCE(msk->csum_enabled);
>>> mpext = skb ? mptcp_get_ext(skb) : NULL;
>>> @@ -601,14 +600,11 @@ static bool
>>> mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
>>> return ret;
>>> }
>>>
>>> - ack_seq = READ_ONCE(msk->ack_seq);
>>> if (READ_ONCE(msk->use_64bit_ack)) {
>>> ack_size = TCPOLEN_MPTCP_DSS_ACK64;
>>> - opts->ext_copy.data_ack = ack_seq;
>>> opts->ext_copy.ack64 = 1;
>>> } else {
>>> ack_size = TCPOLEN_MPTCP_DSS_ACK32;
>>> - opts->ext_copy.data_ack32 = (uint32_t)ack_seq;
>>> opts->ext_copy.ack64 = 0;
>>> }
>>> opts->ext_copy.use_ack = 1;
>>> @@ -1298,19 +1294,14 @@ bool mptcp_incoming_options(struct sock *sk,
>>> struct sk_buff *skb)
>>> return true;
>>> }
>>>
>>> -static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
>>> +static u64 mptcp_set_rwin(struct mptcp_sock *msk, struct tcp_sock
>>> *tp,
>>> + struct tcphdr *th, u64 ack_seq)
>>> {
>>> const struct sock *ssk = (const struct sock *)tp;
>>> - struct mptcp_subflow_context *subflow;
>>> - u64 ack_seq, rcv_wnd_old, rcv_wnd_new;
>>> - struct mptcp_sock *msk;
>>> + u64 rcv_wnd_old, rcv_wnd_new;
>>> u32 new_win;
>>> u64 win;
>>>
>>> - subflow = mptcp_subflow_ctx(ssk);
>>> - msk = mptcp_sk(subflow->conn);
>>> -
>>> - ack_seq = READ_ONCE(msk->ack_seq);
>>> rcv_wnd_new = ack_seq + tp->rcv_wnd;
>>>
>>> rcv_wnd_old = atomic64_read(&msk->rcv_wnd_sent);
>>> @@ -1363,7 +1354,7 @@ static void mptcp_set_rwin(struct tcp_sock *tp,
>>> struct tcphdr *th)
>>>
>>> update_wspace:
>>> WRITE_ONCE(msk->old_wspace, tp->rcv_wnd);
>>> - subflow->rcv_wnd_sent = rcv_wnd_new;
>>> + return rcv_wnd_new;
>>> }
>>>
>>> static void mptcp_track_rwin(struct tcp_sock *tp)
>>> @@ -1475,13 +1466,25 @@ void mptcp_write_options(struct tcphdr *th,
>>> __be32 *ptr, struct tcp_sock *tp,
>>> *ptr++ = mptcp_option(MPTCPOPT_DSS, len, 0, flags);
>>>
>>> if (mpext->use_ack) {
>>> + struct mptcp_sock *msk;
>>> + u64 ack_seq;
>>> +
>>> + /* DSS option is set only by
>>> mptcp_established_option,
>>> + * the caller is __tcp_transmit_skb() and
>>> ssk is always
>>> + * not NULL.
>>> + */
>>> + subflow = mptcp_subflow_ctx(ssk);
>>> + msk = mptcp_sk(subflow->conn);
>>> + ack_seq = READ_ONCE(msk->ack_seq);
>>> if (mpext->ack64) {
>>> - put_unaligned_be64(mpext->data_ack,
>>> ptr);
>>> + put_unaligned_be64(ack_seq, ptr);
>>> ptr += 2;
>>> } else {
>>> - put_unaligned_be32(mpext-
>>>> data_ack32, ptr);
>>
>> This patch makes data_ack and data_ack32 in struct mptcp_ext no longer
>> used. Can they be removed from struct mptcp_ext, and update
>> mptcp_dump_mpext() in include/trace/events/mptcp.h accordingly?
>
> Thanks for your review.
>
> It looks like mptcp_dump_mpext() always dumped data_ack, but such value
> was always 0-set when that function was called? if so dropping the
> fields and adjusting the helper could be a separate/later patch?
If they have never been used elsewhere before -- which seems to be the
case -- that should be fine to do that in a separate patch. In other
cases, that's probably safer to do that in the same patch, to help to
catch other usages when backporting such patches.
In other words, no need to change anything here, and a follow-up patch
would be nice :)
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
` (2 preceding siblings ...)
2026-05-22 21:43 ` [PATCH v8 mptcp-next 3/9] mptcp: close TOCTOU race while computing rcv_wnd Paolo Abeni
@ 2026-05-22 21:43 ` Paolo Abeni
2026-05-24 8:34 ` Paolo Abeni
2026-05-26 7:48 ` Matthieu Baerts
2026-05-22 21:43 ` [PATCH v8 mptcp-next 5/9] mptcp: explicitly drop over memory limits Paolo Abeni
` (5 subsequent siblings)
9 siblings, 2 replies; 30+ messages in thread
From: Paolo Abeni @ 2026-05-22 21:43 UTC (permalink / raw)
To: mptcp; +Cc: Matthieu Baerts, Geliang Tang, gang.yan
In MPTCP connection, the `window` field in the TCP header refers to the
MPTCP-level rcv_nxt and it's right edge should not move backward. Such
constraint is enforced at DSS option generation time.
At the same time, the TCP stack ensures independently that the TCP-level
rcv wnd right's edge does not move backward. That in turn causes artificial
inflating of the MPTCP rcv window when the incoming data is acked at the
TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).
As a consequence, the incoming traffic can exceed the receiver rcvbuf size
even when the sender is not misbehaving.
Prevent such scenario forcibly allowing the TCP subflow to shrink the
TCP-level rcv wnd regardless of the current netns setting.
Fixes: f3589be0c420 ("mptcp: never shrink offered window")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/options.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 4d72f286a485..97ea4aa37b33 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -566,6 +566,7 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
{
struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
struct mptcp_sock *msk = mptcp_sk(subflow->conn);
+ struct tcp_sock *tp = tcp_sk(sk);
unsigned int dss_size = 0;
struct mptcp_ext *mpext;
unsigned int ack_size;
@@ -614,6 +615,12 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
if (dss_size == 0)
ack_size += TCPOLEN_MPTCP_DSS_BASE;
+ /* The caller is __tcp_transmit_skb(), and will compute the new rcv
+ * wnd soon: ensure that the window can shrink.
+ */
+ if (skb)
+ tp->rcv_wnd = tp->rcv_nxt - tp->rcv_wup;
+
dss_size += ack_size;
*size = ALIGN(dss_size, 4);
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink
2026-05-22 21:43 ` [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink Paolo Abeni
@ 2026-05-24 8:34 ` Paolo Abeni
2026-05-26 7:02 ` Matthieu Baerts
2026-05-26 7:48 ` Matthieu Baerts
1 sibling, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-05-24 8:34 UTC (permalink / raw)
To: Matthieu Baerts; +Cc: Geliang Tang, gang.yan, mptcp
On 5/22/26 11:43 PM, Paolo Abeni wrote:
> In MPTCP connection, the `window` field in the TCP header refers to the
> MPTCP-level rcv_nxt and it's right edge should not move backward. Such
> constraint is enforced at DSS option generation time.
>
> At the same time, the TCP stack ensures independently that the TCP-level
> rcv wnd right's edge does not move backward. That in turn causes artificial
> inflating of the MPTCP rcv window when the incoming data is acked at the
> TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).
>
> As a consequence, the incoming traffic can exceed the receiver rcvbuf size
> even when the sender is not misbehaving.
>
> Prevent such scenario forcibly allowing the TCP subflow to shrink the
> TCP-level rcv wnd regardless of the current netns setting.
>
> Fixes: f3589be0c420 ("mptcp: never shrink offered window")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Sashiko failed to review this series from this patch included onward:
https://sashiko.dev/#/patchset/cover.1779485511.git.pabeni%40redhat.com
and it's not clear to me why.
I think patch 1/3 are quite uncontroversial and reasonably even this one.
WRT the stall addressing patches, this iteration survives mptcp_data
testing without timeout tuning, hopefully should be fine for Geliang and
Gang usage.
@Matttbe: please LMK if you prefer a complete repost or if you could
apply patch 1-3 (or more).
Thanks,
Paolo
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink
2026-05-24 8:34 ` Paolo Abeni
@ 2026-05-26 7:02 ` Matthieu Baerts
2026-05-26 7:49 ` Matthieu Baerts
0 siblings, 1 reply; 30+ messages in thread
From: Matthieu Baerts @ 2026-05-26 7:02 UTC (permalink / raw)
To: Paolo Abeni; +Cc: Geliang Tang, gang.yan, mptcp
Hi Paolo,
On 24/05/2026 18:34, Paolo Abeni wrote:
> On 5/22/26 11:43 PM, Paolo Abeni wrote:
>> In MPTCP connection, the `window` field in the TCP header refers to the
>> MPTCP-level rcv_nxt and it's right edge should not move backward. Such
>> constraint is enforced at DSS option generation time.
>>
>> At the same time, the TCP stack ensures independently that the TCP-level
>> rcv wnd right's edge does not move backward. That in turn causes artificial
>> inflating of the MPTCP rcv window when the incoming data is acked at the
>> TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).
>>
>> As a consequence, the incoming traffic can exceed the receiver rcvbuf size
>> even when the sender is not misbehaving.
>>
>> Prevent such scenario forcibly allowing the TCP subflow to shrink the
>> TCP-level rcv wnd regardless of the current netns setting.
>>
>> Fixes: f3589be0c420 ("mptcp: never shrink offered window")
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>
> Sashiko failed to review this series from this patch included onward:
>
> https://sashiko.dev/#/patchset/cover.1779485511.git.pabeni%40redhat.com
>
> and it's not clear to me why.
It is not the first time I see failures. I guess it is not linked to
your series.
> I think patch 1/3 are quite uncontroversial and reasonably even this one.
>
> WRT the stall addressing patches, this iteration survives mptcp_data
> testing without timeout tuning, hopefully should be fine for Geliang and
> Gang usage.
>
> @Matttbe: please LMK if you prefer a complete repost or if you could
> apply patch 1-3 (or more).
Sorry for the delay. Yes, on it. I guess patches 1-4 can already be
applied to -net.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink
2026-05-26 7:02 ` Matthieu Baerts
@ 2026-05-26 7:49 ` Matthieu Baerts
2026-05-26 15:17 ` Paolo Abeni
0 siblings, 1 reply; 30+ messages in thread
From: Matthieu Baerts @ 2026-05-26 7:49 UTC (permalink / raw)
To: Paolo Abeni; +Cc: Geliang Tang, gang.yan, mptcp
Hi Paolo,
On 26/05/2026 17:02, Matthieu Baerts wrote:
> Hi Paolo,
>
> On 24/05/2026 18:34, Paolo Abeni wrote:
>> On 5/22/26 11:43 PM, Paolo Abeni wrote:
>>> In MPTCP connection, the `window` field in the TCP header refers to the
>>> MPTCP-level rcv_nxt and it's right edge should not move backward. Such
>>> constraint is enforced at DSS option generation time.
>>>
>>> At the same time, the TCP stack ensures independently that the TCP-level
>>> rcv wnd right's edge does not move backward. That in turn causes artificial
>>> inflating of the MPTCP rcv window when the incoming data is acked at the
>>> TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).
>>>
>>> As a consequence, the incoming traffic can exceed the receiver rcvbuf size
>>> even when the sender is not misbehaving.
>>>
>>> Prevent such scenario forcibly allowing the TCP subflow to shrink the
>>> TCP-level rcv wnd regardless of the current netns setting.
>>>
>>> Fixes: f3589be0c420 ("mptcp: never shrink offered window")
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>
>> Sashiko failed to review this series from this patch included onward:
>>
>> https://sashiko.dev/#/patchset/cover.1779485511.git.pabeni%40redhat.com
>>
>> and it's not clear to me why.
>
> It is not the first time I see failures. I guess it is not linked to
> your series.
>
>> I think patch 1/3 are quite uncontroversial and reasonably even this one.
>>
>> WRT the stall addressing patches, this iteration survives mptcp_data
>> testing without timeout tuning, hopefully should be fine for Geliang and
>> Gang usage.
>>
>> @Matttbe: please LMK if you prefer a complete repost or if you could
>> apply patch 1-3 (or more).
> Sorry for the delay. Yes, on it. I guess patches 1-4 can already be
> applied to -net.
Just reviewed them. Please tell me if you are OK with the suggested
modifications, or if you prefer to send a new version.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink
2026-05-26 7:49 ` Matthieu Baerts
@ 2026-05-26 15:17 ` Paolo Abeni
2026-05-27 0:51 ` Matthieu Baerts
0 siblings, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-05-26 15:17 UTC (permalink / raw)
To: Matthieu Baerts; +Cc: Geliang Tang, gang.yan, mptcp
On 5/26/26 9:49 AM, Matthieu Baerts wrote:
> On 26/05/2026 17:02, Matthieu Baerts wrote:
>> On 24/05/2026 18:34, Paolo Abeni wrote:
>>> On 5/22/26 11:43 PM, Paolo Abeni wrote:
>>>> In MPTCP connection, the `window` field in the TCP header refers to the
>>>> MPTCP-level rcv_nxt and it's right edge should not move backward. Such
>>>> constraint is enforced at DSS option generation time.
>>>>
>>>> At the same time, the TCP stack ensures independently that the TCP-level
>>>> rcv wnd right's edge does not move backward. That in turn causes artificial
>>>> inflating of the MPTCP rcv window when the incoming data is acked at the
>>>> TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).
>>>>
>>>> As a consequence, the incoming traffic can exceed the receiver rcvbuf size
>>>> even when the sender is not misbehaving.
>>>>
>>>> Prevent such scenario forcibly allowing the TCP subflow to shrink the
>>>> TCP-level rcv wnd regardless of the current netns setting.
>>>>
>>>> Fixes: f3589be0c420 ("mptcp: never shrink offered window")
>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>>
>>> Sashiko failed to review this series from this patch included onward:
>>>
>>> https://sashiko.dev/#/patchset/cover.1779485511.git.pabeni%40redhat.com
>>>
>>> and it's not clear to me why.
>>
>> It is not the first time I see failures. I guess it is not linked to
>> your series.
>>
>>> I think patch 1/3 are quite uncontroversial and reasonably even this one.
>>>
>>> WRT the stall addressing patches, this iteration survives mptcp_data
>>> testing without timeout tuning, hopefully should be fine for Geliang and
>>> Gang usage.
>>>
>>> @Matttbe: please LMK if you prefer a complete repost or if you could
>>> apply patch 1-3 (or more).
>> Sorry for the delay. Yes, on it. I guess patches 1-4 can already be
>> applied to -net.
> Just reviewed them. Please tell me if you are OK with the suggested
> modifications, or if you prefer to send a new version.
I think patch 4 does not need modifications. On the flip side that one
is a little delicate and has not gone through sashiko yet due to the
mentioned issue. I think you could apply patch 1-3 and I'll repost the
following ones.
/P
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink
2026-05-26 15:17 ` Paolo Abeni
@ 2026-05-27 0:51 ` Matthieu Baerts
0 siblings, 0 replies; 30+ messages in thread
From: Matthieu Baerts @ 2026-05-27 0:51 UTC (permalink / raw)
To: Paolo Abeni; +Cc: Geliang Tang, gang.yan, mptcp
Hi Paolo,
On 27/05/2026 01:17, Paolo Abeni wrote:
> On 5/26/26 9:49 AM, Matthieu Baerts wrote:
>> On 26/05/2026 17:02, Matthieu Baerts wrote:
>>> On 24/05/2026 18:34, Paolo Abeni wrote:
>>>> On 5/22/26 11:43 PM, Paolo Abeni wrote:
>>>>> In MPTCP connection, the `window` field in the TCP header refers to the
>>>>> MPTCP-level rcv_nxt and it's right edge should not move backward. Such
>>>>> constraint is enforced at DSS option generation time.
>>>>>
>>>>> At the same time, the TCP stack ensures independently that the TCP-level
>>>>> rcv wnd right's edge does not move backward. That in turn causes artificial
>>>>> inflating of the MPTCP rcv window when the incoming data is acked at the
>>>>> TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).
>>>>>
>>>>> As a consequence, the incoming traffic can exceed the receiver rcvbuf size
>>>>> even when the sender is not misbehaving.
>>>>>
>>>>> Prevent such scenario forcibly allowing the TCP subflow to shrink the
>>>>> TCP-level rcv wnd regardless of the current netns setting.
>>>>>
>>>>> Fixes: f3589be0c420 ("mptcp: never shrink offered window")
>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>>>
>>>> Sashiko failed to review this series from this patch included onward:
>>>>
>>>> https://sashiko.dev/#/patchset/cover.1779485511.git.pabeni%40redhat.com
>>>>
>>>> and it's not clear to me why.
>>>
>>> It is not the first time I see failures. I guess it is not linked to
>>> your series.
>>>
>>>> I think patch 1/3 are quite uncontroversial and reasonably even this one.
>>>>
>>>> WRT the stall addressing patches, this iteration survives mptcp_data
>>>> testing without timeout tuning, hopefully should be fine for Geliang and
>>>> Gang usage.
>>>>
>>>> @Matttbe: please LMK if you prefer a complete repost or if you could
>>>> apply patch 1-3 (or more).
>>> Sorry for the delay. Yes, on it. I guess patches 1-4 can already be
>>> applied to -net.
>> Just reviewed them. Please tell me if you are OK with the suggested
>> modifications, or if you prefer to send a new version.
>
> I think patch 4 does not need modifications. On the flip side that one
> is a little delicate and has not gone through sashiko yet due to the
> mentioned issue. I think you could apply patch 1-3 and I'll repost the
> following ones.
Fine by me. I just applied the 3 first patches with the 2 small fixes:
New patches for t/upstream-net and t/upstream:
- fc8fb422aae7: mptcp: fix missing wakeups in edge scenarios
- 6219a0efdede: Squash to "mptcp: fix missing wakeups in edge scenarios"
- 2950e9d2932a: mptcp: fix retransmission loop when csum is enabled
- fcee762a0dc5: Squash to "mptcp: fix retransmission loop when csum is enabled"
- 0f17b1705eee: mptcp: close TOCTOU race while computing rcv_wnd
- Results: 1685899b9752..7865561e04f3 (export-net)
- Results: 599cb099e01e..19d4169d0689 (export)
Tests are now in progress:
- export-net: https://github.com/multipath-tcp/mptcp_net-next/commit/32c8161bff373b91be456498bcb8463c146460fe/checks
- export: https://github.com/multipath-tcp/mptcp_net-next/commit/b41d3d0c1a32a4a80618b5898cf6f993ee9e8511/checks
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink
2026-05-22 21:43 ` [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink Paolo Abeni
2026-05-24 8:34 ` Paolo Abeni
@ 2026-05-26 7:48 ` Matthieu Baerts
2026-05-26 15:16 ` Paolo Abeni
1 sibling, 1 reply; 30+ messages in thread
From: Matthieu Baerts @ 2026-05-26 7:48 UTC (permalink / raw)
To: Paolo Abeni, mptcp; +Cc: Geliang Tang, gang.yan
Hi Paolo,
On 23/05/2026 07:43, Paolo Abeni wrote:
> In MPTCP connection, the `window` field in the TCP header refers to the
> MPTCP-level rcv_nxt and it's right edge should not move backward. Such
> constraint is enforced at DSS option generation time.
>
> At the same time, the TCP stack ensures independently that the TCP-level
> rcv wnd right's edge does not move backward. That in turn causes artificial
> inflating of the MPTCP rcv window when the incoming data is acked at the
> TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).
>
> As a consequence, the incoming traffic can exceed the receiver rcvbuf size
> even when the sender is not misbehaving.
>
> Prevent such scenario forcibly allowing the TCP subflow to shrink the
> TCP-level rcv wnd regardless of the current netns setting.
>
> Fixes: f3589be0c420 ("mptcp: never shrink offered window")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/options.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/net/mptcp/options.c b/net/mptcp/options.c
> index 4d72f286a485..97ea4aa37b33 100644
> --- a/net/mptcp/options.c
> +++ b/net/mptcp/options.c
> @@ -566,6 +566,7 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
> {
> struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
> struct mptcp_sock *msk = mptcp_sk(subflow->conn);
> + struct tcp_sock *tp = tcp_sk(sk);
> unsigned int dss_size = 0;
> struct mptcp_ext *mpext;
> unsigned int ack_size;
> @@ -614,6 +615,12 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
> if (dss_size == 0)
> ack_size += TCPOLEN_MPTCP_DSS_BASE;
>
> + /* The caller is __tcp_transmit_skb(), and will compute the new rcv
> + * wnd soon: ensure that the window can shrink.
> + */
> + if (skb)
> + tp->rcv_wnd = tp->rcv_nxt - tp->rcv_wup;
Do you need to call tcp_update_max_rcv_wnd_seq() after having updated
rcv_wnd?
See 0e24d17bd966 ("tcp: implement RFC 7323 window retraction receiver
requirements") for more details about it.
If we do that, I guess we can declare "struct tcp_sock *tp" here in the
if-statement.
I can also update the patch when applying it. Apart from that:
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink
2026-05-26 7:48 ` Matthieu Baerts
@ 2026-05-26 15:16 ` Paolo Abeni
0 siblings, 0 replies; 30+ messages in thread
From: Paolo Abeni @ 2026-05-26 15:16 UTC (permalink / raw)
To: Matthieu Baerts, mptcp; +Cc: Geliang Tang, gang.yan
On 5/26/26 9:48 AM, Matthieu Baerts wrote:
> On 23/05/2026 07:43, Paolo Abeni wrote:
>> In MPTCP connection, the `window` field in the TCP header refers to the
>> MPTCP-level rcv_nxt and it's right edge should not move backward. Such
>> constraint is enforced at DSS option generation time.
>>
>> At the same time, the TCP stack ensures independently that the TCP-level
>> rcv wnd right's edge does not move backward. That in turn causes artificial
>> inflating of the MPTCP rcv window when the incoming data is acked at the
>> TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).
>>
>> As a consequence, the incoming traffic can exceed the receiver rcvbuf size
>> even when the sender is not misbehaving.
>>
>> Prevent such scenario forcibly allowing the TCP subflow to shrink the
>> TCP-level rcv wnd regardless of the current netns setting.
>>
>> Fixes: f3589be0c420 ("mptcp: never shrink offered window")
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> net/mptcp/options.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/net/mptcp/options.c b/net/mptcp/options.c
>> index 4d72f286a485..97ea4aa37b33 100644
>> --- a/net/mptcp/options.c
>> +++ b/net/mptcp/options.c
>> @@ -566,6 +566,7 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
>> {
>> struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
>> struct mptcp_sock *msk = mptcp_sk(subflow->conn);
>> + struct tcp_sock *tp = tcp_sk(sk);
>> unsigned int dss_size = 0;
>> struct mptcp_ext *mpext;
>> unsigned int ack_size;
>> @@ -614,6 +615,12 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
>> if (dss_size == 0)
>> ack_size += TCPOLEN_MPTCP_DSS_BASE;
>>
>> + /* The caller is __tcp_transmit_skb(), and will compute the new rcv
>> + * wnd soon: ensure that the window can shrink.
>> + */
>> + if (skb)
>> + tp->rcv_wnd = tp->rcv_nxt - tp->rcv_wup;
>
> Do you need to call tcp_update_max_rcv_wnd_seq() after having updated
> rcv_wnd?
I think it's not needed here, because the above will never move
rightwards the rcv_wnd - it sets the current wnd to zero.
Note that LINUX_MIB_TCPFROMZEROWINDOWADV/LINUX_MIB_TCPTOZEROWINDOWADV
for mptcp subflow will be incorrect - as they are already due to mptcp
changing rcv_wnd after such accounting.
/P
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v8 mptcp-next 5/9] mptcp: explicitly drop over memory limits
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
` (3 preceding siblings ...)
2026-05-22 21:43 ` [PATCH v8 mptcp-next 4/9] mptcp: allow subflow rcv wnd to shrink Paolo Abeni
@ 2026-05-22 21:43 ` Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 6/9] mptcp: enforce hard limit on backlog flushing Paolo Abeni
` (4 subsequent siblings)
9 siblings, 0 replies; 30+ messages in thread
From: Paolo Abeni @ 2026-05-22 21:43 UTC (permalink / raw)
To: mptcp; +Cc: Matthieu Baerts, Geliang Tang, gang.yan
Currently the enforcement of the rcvbuf constraint is implemented
when moving the skbs into the msk receive or OoO queue, keeping the
incoming skbs in the subflow queue when over limit.
Under significant memory pressure the above can cause permanent data
transfer stalls, as the skb needed to make forward progress can be
stuck in a subflow queue.
Over memory limits, drop the incoming skb, relaying on MPTCP-level
retransmissions.
Note that fallback socket must perform the limit before the skb reaches
the subflow-level queue, as dropping an in-sequence already acked skb
would break the stream.
This is not a complete fix for the stall issue, as the drop strategy
needs refinements that will come in the next patches.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v7 -> v8:
- removed non fallback check in mptcp_incoming_option(): that is an
tput optimization (avoid rejections) for a slowpath case (sender
is misbehaving) and needs too much additional complexity.
- move here from later patches mibs definition
v6 -> v7:
- fix sign extension issues
v4 -> v5:
- fix possible u32 overflow in mptcp_over_limit
v3 -> v4:
- schedule TCP ack on drop
- enforce limits in __mptcp_move_skb() and __mptcp_add_backlog(), too
but only if not fallback.
v1 -> v2:
- deal correctly with tcp fin and zero win probe
RFC -> v1:
- limit vs actual buffer size
- use CB info instead of skb->len
Note that:
- this needs the follow-up patches to really fix the stall
- sashiko can assume ZWP carries unacked data and may be silently dropped.
AFAIK that is false.
- sashiko apparently can't graps mptcp subflow never hit the tcp rx
fastpath, and the mptcp_incoming_options in tcp_rcv_state_process
is hit, the peer can't transmit any more data.
- the memory comparison is intentionally very rough, as
the msk socket lock is not currently held where the condition is
now enforced. This should require some refinement, shared as-is
to avoid more latency on my side
---
net/mptcp/mib.c | 2 ++
net/mptcp/mib.h | 2 ++
net/mptcp/options.c | 28 +++++++++++++++++++++++++---
net/mptcp/protocol.c | 31 +++++++++++++++++++++++--------
4 files changed, 52 insertions(+), 11 deletions(-)
diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index f23fda0c55a7..ef65e2df709f 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -85,6 +85,8 @@ static const struct snmp_mib mptcp_snmp_list[] = {
SNMP_MIB_ITEM("SimultConnectFallback", MPTCP_MIB_SIMULTCONNFALLBACK),
SNMP_MIB_ITEM("FallbackFailed", MPTCP_MIB_FALLBACKFAILED),
SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE),
+ SNMP_MIB_ITEM("BacklogDrop", MPTCP_MIB_BACKLOGDROP),
+ SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
};
/* mptcp_mib_alloc - allocate percpu mib counters
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 812218b5ed2b..c84eb853d499 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -88,6 +88,8 @@ enum linux_mptcp_mib_field {
MPTCP_MIB_SIMULTCONNFALLBACK, /* Simultaneous connect */
MPTCP_MIB_FALLBACKFAILED, /* Can't fallback due to msk status */
MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */
+ MPTCP_MIB_BACKLOGDROP, /* Backlog over memory limit */
+ MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */
__MPTCP_MIB_MAX
};
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 97ea4aa37b33..2b35bdc113a5 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1161,8 +1161,30 @@ static bool add_addr_hmac_valid(struct mptcp_sock *msk,
return hmac == mp_opt->ahmac;
}
-/* Return false in case of error (or subflow has been reset),
- * else return true.
+static bool mptcp_over_limit(struct sock *sk, struct sock *ssk,
+ const struct sk_buff *skb)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ u64 mem = sk_rmem_alloc_get(sk);
+
+ mem += READ_ONCE(msk->backlog_len);
+ if (likely(mem <= READ_ONCE(sk->sk_rcvbuf)))
+ return false;
+
+ /* Avoid silently dropping pure acks, fin or zero win probes. */
+ if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq ||
+ TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN ||
+ !after(TCP_SKB_CB(skb)->end_seq, tcp_sk(ssk)->rcv_nxt))
+ return false;
+
+ /* Dropped due to memory constraints, schedule an ack. */
+ inet_csk(ssk)->icsk_ack.pending |= ICSK_ACK_NOMEM | ICSK_ACK_NOW;
+ inet_csk_schedule_ack(ssk);
+ return true;
+}
+
+/* Return false when the caller must drop the packet, i.e. in case of error,
+ * subflow has been reset, or over memory limits.
*/
bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
{
@@ -1188,7 +1210,7 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
__mptcp_data_acked(subflow->conn);
mptcp_data_unlock(subflow->conn);
- return true;
+ return !mptcp_over_limit(subflow->conn, sk, skb);
}
mptcp_get_options(skb, &mp_opt);
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4f2b2868031a..1d498c601145 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -381,6 +381,16 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
mptcp_borrow_fwdmem(sk, skb);
+ /* Can't drop packets for fallback socket this late, or the stream
+ * will break.
+ */
+ if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) &&
+ !__mptcp_check_fallback(msk)) {
+ MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
+ mptcp_drop(sk, skb);
+ return false;
+ }
+
if (MPTCP_SKB_CB(skb)->map_seq == msk->ack_seq) {
/* in sequence */
msk->bytes_received += copy_len;
@@ -675,6 +685,7 @@ static void __mptcp_add_backlog(struct sock *sk,
struct sk_buff *tail = NULL;
struct sock *ssk = skb->sk;
bool fragstolen;
+ u64 limit;
int delta;
if (unlikely(sk->sk_state == TCP_CLOSE)) {
@@ -682,6 +693,16 @@ static void __mptcp_add_backlog(struct sock *sk,
return;
}
+ /* Similar additional allowance as plain TCP. */
+ limit = READ_ONCE(sk->sk_rcvbuf);
+ limit += (limit >> 1) + 64 * 1024;
+ limit = min_t(u64, limit, UINT_MAX);
+ if (msk->backlog_len > limit && !__mptcp_check_fallback(msk)) {
+ __MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_BACKLOGDROP);
+ kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_BACKLOG);
+ return;
+ }
+
/* Try to coalesce with the last skb in our backlog */
if (!list_empty(&msk->backlog_list))
tail = list_last_entry(&msk->backlog_list, struct sk_buff, list);
@@ -753,7 +774,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
mptcp_init_skb(ssk, skb, offset, len);
- if (own_msk && sk_rmem_alloc_get(sk) < sk->sk_rcvbuf) {
+ if (own_msk) {
mptcp_subflow_lend_fwdmem(subflow, skb);
ret |= __mptcp_move_skb(sk, skb);
} else {
@@ -2211,10 +2232,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
*delta = 0;
while (1) {
- /* If the msk recvbuf is full stop, don't drop */
- if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
- break;
-
prefetch(skb->next);
list_del(&skb->list);
*delta += skb->truesize;
@@ -2242,9 +2259,7 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
DEBUG_NET_WARN_ON_ONCE(msk->backlog_unaccounted && sk->sk_socket &&
mem_cgroup_from_sk(sk));
- /* Don't spool the backlog if the rcvbuf is full. */
- if (list_empty(&msk->backlog_list) ||
- sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+ if (list_empty(&msk->backlog_list))
return false;
INIT_LIST_HEAD(skbs);
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH v8 mptcp-next 6/9] mptcp: enforce hard limit on backlog flushing
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
` (4 preceding siblings ...)
2026-05-22 21:43 ` [PATCH v8 mptcp-next 5/9] mptcp: explicitly drop over memory limits Paolo Abeni
@ 2026-05-22 21:43 ` Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning Paolo Abeni
` (3 subsequent siblings)
9 siblings, 0 replies; 30+ messages in thread
From: Paolo Abeni @ 2026-05-22 21:43 UTC (permalink / raw)
To: mptcp; +Cc: Matthieu Baerts, Geliang Tang, gang.yan
Currently a wild producer could keep the backlog flushing operation
spinning for an unbound time.
Since the previous patch the amount of data present in the backlog is
hard-limited. Move the backlog len update at the end of the flush loop to
prevent it spinning forever.
Also, no need to splice back the remaining skbs list into the backlog, as
such list is always empty after each backlog processing loop.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/protocol.c | 21 ++++++---------------
1 file changed, 6 insertions(+), 15 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 1d498c601145..03d6f8658467 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2230,7 +2230,6 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
struct mptcp_sock *msk = mptcp_sk(sk);
bool moved = false;
- *delta = 0;
while (1) {
prefetch(skb->next);
list_del(&skb->list);
@@ -2267,20 +2266,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
return true;
}
-static void mptcp_backlog_spooled(struct sock *sk, u32 moved,
- struct list_head *skbs)
-{
- struct mptcp_sock *msk = mptcp_sk(sk);
-
- WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
- list_splice(skbs, &msk->backlog_list);
-}
-
static bool mptcp_move_skbs(struct sock *sk)
{
+ struct mptcp_sock *msk = mptcp_sk(sk);
struct list_head skbs;
bool enqueued = false;
- u32 moved;
+ u32 moved = 0;
mptcp_data_lock(sk);
while (mptcp_can_spool_backlog(sk, &skbs)) {
@@ -2288,8 +2279,8 @@ static bool mptcp_move_skbs(struct sock *sk)
enqueued |= __mptcp_move_skbs(sk, &skbs, &moved);
mptcp_data_lock(sk);
- mptcp_backlog_spooled(sk, moved, &skbs);
}
+ WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
mptcp_data_unlock(sk);
if (enqueued && mptcp_epollin_ready(sk))
sk->sk_data_ready(sk);
@@ -3678,12 +3669,12 @@ static void mptcp_release_cb(struct sock *sk)
__must_hold(&sk->sk_lock.slock)
{
struct mptcp_sock *msk = mptcp_sk(sk);
+ u32 moved = 0;
for (;;) {
unsigned long flags = (msk->cb_flags & MPTCP_FLAGS_PROCESS_CTX_NEED);
struct list_head join_list, skbs;
bool spool_bl;
- u32 moved;
spool_bl = mptcp_can_spool_backlog(sk, &skbs);
if (!flags && !spool_bl)
@@ -3716,9 +3707,9 @@ static void mptcp_release_cb(struct sock *sk)
cond_resched();
spin_lock_bh(&sk->sk_lock.slock);
- if (spool_bl)
- mptcp_backlog_spooled(sk, moved, &skbs);
}
+ if (moved)
+ WRITE_ONCE(msk->backlog_len, msk->backlog_len - moved);
if (__test_and_clear_bit(MPTCP_CLEAN_UNA, &msk->cb_flags))
__mptcp_clean_una_wakeup(sk);
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
` (5 preceding siblings ...)
2026-05-22 21:43 ` [PATCH v8 mptcp-next 6/9] mptcp: enforce hard limit on backlog flushing Paolo Abeni
@ 2026-05-22 21:43 ` Paolo Abeni
2026-05-26 3:13 ` gang.yan
2026-05-22 21:43 ` [PATCH v8 mptcp-next 8/9] mptcp: move the retrans loop to a separate helper Paolo Abeni
` (2 subsequent siblings)
9 siblings, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-05-22 21:43 UTC (permalink / raw)
To: mptcp; +Cc: Matthieu Baerts, Geliang Tang, gang.yan
Leverage the hybrid helpers to implement the receive queue and OoO queue
collapsing at ingress time when reaching memory bounds.
If the msk is owned by the user-space at incoming skb time, perform the
pruning in the release_cb. The prune check is additionally performed
when the skb reaches the msk-level queues.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v6 -> v7:
- fix u64 -> u32 truncation
v2 -> v3:
- deal with unsynced TFO skb at prune time - only possible when pruning
in mptcp_over_limit()
v1 -> v2:
- collapse rcv queue, too
- deal with MPC map, too
- drop left-over sentence in the commit message
RFC -> v1:
- use data_seq only when available
- avoid ack_seq lockless access
- drop limit on fallback
- collapse rcvqueue, too
- drop only when pruning is not possible and over rcvbuf * 2
Note:
- sashiko can be confused about fwd memory lifecycle (I can
understand that :). Any exceeding amount of fwd allocated memory
is always released by the next sk_mem_uncharge() - i.e. fwd memory
is not tied to the current skb.
- AFAICS KASAN handles bitmap variables in a sane way, and sashiko
doesn't know about that
---
net/mptcp/mib.c | 1 +
net/mptcp/mib.h | 1 +
net/mptcp/protocol.c | 46 +++++++++++++++++++++++++++++++++++++++++---
3 files changed, 45 insertions(+), 3 deletions(-)
diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index ef65e2df709f..d9bd4f4afcc0 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -87,6 +87,7 @@ static const struct snmp_mib mptcp_snmp_list[] = {
SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE),
SNMP_MIB_ITEM("BacklogDrop", MPTCP_MIB_BACKLOGDROP),
SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
+ SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED),
};
/* mptcp_mib_alloc - allocate percpu mib counters
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index c84eb853d499..18f35f7e0a2d 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -90,6 +90,7 @@ enum linux_mptcp_mib_field {
MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */
MPTCP_MIB_BACKLOGDROP, /* Backlog over memory limit */
MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */
+ MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */
__MPTCP_MIB_MAX
};
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 03d6f8658467..f446e22148b9 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -373,6 +373,43 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
skb_dst_drop(skb);
}
+/* "Inspired" from the TCP version */
+static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct rb_node *node, *prev;
+ bool pruned = false;
+ u64 mem;
+
+ if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
+ return;
+
+ node = &msk->ooo_last_skb->rbnode;
+
+ do {
+ struct sk_buff *skb = rb_to_skb(node);
+
+ /* Stop pruning if the incoming skb would land in OoO tail. */
+ if (after64(seq, MPTCP_SKB_CB(skb)->map_seq))
+ break;
+
+ pruned = true;
+ prev = rb_prev(node);
+ rb_erase(node, &msk->out_of_order_queue);
+ mptcp_drop(sk, skb);
+ msk->ooo_last_skb = rb_to_skb(prev);
+
+ mem = (unsigned int)atomic_read(&sk->sk_rmem_alloc);
+ if (mem < sk->sk_rcvbuf)
+ break;
+
+ node = prev;
+ } while (node);
+
+ if (pruned)
+ MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED);
+}
+
static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
{
u64 copy_len = MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq;
@@ -386,9 +423,12 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
*/
if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) &&
!__mptcp_check_fallback(msk)) {
- MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
- mptcp_drop(sk, skb);
- return false;
+ mptcp_prune_ofo_queue(sk, MPTCP_SKB_CB(skb)->map_seq);
+ if (sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) {
+ MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
+ mptcp_drop(sk, skb);
+ return false;
+ }
}
if (MPTCP_SKB_CB(skb)->map_seq == msk->ack_seq) {
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning
2026-05-22 21:43 ` [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning Paolo Abeni
@ 2026-05-26 3:13 ` gang.yan
2026-05-26 6:50 ` Paolo Abeni
0 siblings, 1 reply; 30+ messages in thread
From: gang.yan @ 2026-05-26 3:13 UTC (permalink / raw)
To: Paolo Abeni, mptcp; +Cc: Matthieu Baerts, Geliang Tang
May 23, 2026 at 5:43 AM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>
> Leverage the hybrid helpers to implement the receive queue and OoO queue
> collapsing at ingress time when reaching memory bounds.
>
> If the msk is owned by the user-space at incoming skb time, perform the
> pruning in the release_cb. The prune check is additionally performed
> when the skb reaches the msk-level queues.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v6 -> v7:
> - fix u64 -> u32 truncation
>
> v2 -> v3:
> - deal with unsynced TFO skb at prune time - only possible when pruning
> in mptcp_over_limit()
>
> v1 -> v2:
> - collapse rcv queue, too
> - deal with MPC map, too
> - drop left-over sentence in the commit message
>
> RFC -> v1:
> - use data_seq only when available
> - avoid ack_seq lockless access
> - drop limit on fallback
> - collapse rcvqueue, too
> - drop only when pruning is not possible and over rcvbuf * 2
>
> Note:
> - sashiko can be confused about fwd memory lifecycle (I can
> understand that :). Any exceeding amount of fwd allocated memory
> is always released by the next sk_mem_uncharge() - i.e. fwd memory
> is not tied to the current skb.
> - AFAICS KASAN handles bitmap variables in a sane way, and sashiko
> doesn't know about that
> ---
> net/mptcp/mib.c | 1 +
> net/mptcp/mib.h | 1 +
> net/mptcp/protocol.c | 46 +++++++++++++++++++++++++++++++++++++++++---
> 3 files changed, 45 insertions(+), 3 deletions(-)
>
> diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
> index ef65e2df709f..d9bd4f4afcc0 100644
> --- a/net/mptcp/mib.c
> +++ b/net/mptcp/mib.c
> @@ -87,6 +87,7 @@ static const struct snmp_mib mptcp_snmp_list[] = {
> SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE),
> SNMP_MIB_ITEM("BacklogDrop", MPTCP_MIB_BACKLOGDROP),
> SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
> + SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED),
> };
>
> /* mptcp_mib_alloc - allocate percpu mib counters
> diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
> index c84eb853d499..18f35f7e0a2d 100644
> --- a/net/mptcp/mib.h
> +++ b/net/mptcp/mib.h
> @@ -90,6 +90,7 @@ enum linux_mptcp_mib_field {
> MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */
> MPTCP_MIB_BACKLOGDROP, /* Backlog over memory limit */
> MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */
> + MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */
> __MPTCP_MIB_MAX
> };
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 03d6f8658467..f446e22148b9 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -373,6 +373,43 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
> skb_dst_drop(skb);
> }
>
> +/* "Inspired" from the TCP version */
> +static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq)
> +{
> + struct mptcp_sock *msk = mptcp_sk(sk);
> + struct rb_node *node, *prev;
> + bool pruned = false;
> + u64 mem;
> +
> + if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
> + return;
> +
> + node = &msk->ooo_last_skb->rbnode;
> +
> + do {
> + struct sk_buff *skb = rb_to_skb(node);
> +
> + /* Stop pruning if the incoming skb would land in OoO tail. */
> + if (after64(seq, MPTCP_SKB_CB(skb)->map_seq))
> + break;
> +
> + pruned = true;
> + prev = rb_prev(node);
> + rb_erase(node, &msk->out_of_order_queue);
> + mptcp_drop(sk, skb);
> + msk->ooo_last_skb = rb_to_skb(prev);
> +
> + mem = (unsigned int)atomic_read(&sk->sk_rmem_alloc);
> + if (mem < sk->sk_rcvbuf)
> + break;
Hi Paolo,
Thanks for the v8. While going through the code, I ran into a
point that I'm not entirely sure about.
TCP‘s design uses sk->sk_rcvbuf >> 3 (one eighth of the buffer)
as a goal. It we use sk->sk_rcvbuf here, the loop may break after
deleting just one packet, right? This may fail to free enough space
for the incoming out‑of‑order packet, leading to repeated pruning
calls and potential packet drops.
Perhaps you intend to resolve the differences between TCP and MPTCP
when refactoring this function later?
Thanks
Gang
> +
> + node = prev;
> + } while (node);
> +
> + if (pruned)
> + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFO_PRUNED);
> +}
> +
> static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
> {
> u64 copy_len = MPTCP_SKB_CB(skb)->end_seq - MPTCP_SKB_CB(skb)->map_seq;
> @@ -386,9 +423,12 @@ static bool __mptcp_move_skb(struct sock *sk, struct sk_buff *skb)
> */
> if (unlikely(sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) &&
> !__mptcp_check_fallback(msk)) {
> - MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
> - mptcp_drop(sk, skb);
> - return false;
> + mptcp_prune_ofo_queue(sk, MPTCP_SKB_CB(skb)->map_seq);
> + if (sk_rmem_alloc_get(sk) > READ_ONCE(sk->sk_rcvbuf)) {
> + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
> + mptcp_drop(sk, skb);
> + return false;
> + }
> }
>
> if (MPTCP_SKB_CB(skb)->map_seq == msk->ack_seq) {
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning
2026-05-26 3:13 ` gang.yan
@ 2026-05-26 6:50 ` Paolo Abeni
2026-05-27 5:30 ` gang.yan
0 siblings, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-05-26 6:50 UTC (permalink / raw)
To: gang.yan, mptcp; +Cc: Matthieu Baerts, Geliang Tang
On 5/26/26 5:13 AM, gang.yan@linux.dev wrote:
> May 23, 2026 at 5:43 AM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>> diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
>> index ef65e2df709f..d9bd4f4afcc0 100644
>> --- a/net/mptcp/mib.c
>> +++ b/net/mptcp/mib.c
>> @@ -87,6 +87,7 @@ static const struct snmp_mib mptcp_snmp_list[] = {
>> SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE),
>> SNMP_MIB_ITEM("BacklogDrop", MPTCP_MIB_BACKLOGDROP),
>> SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
>> + SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED),
>> };
>>
>> /* mptcp_mib_alloc - allocate percpu mib counters
>> diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
>> index c84eb853d499..18f35f7e0a2d 100644
>> --- a/net/mptcp/mib.h
>> +++ b/net/mptcp/mib.h
>> @@ -90,6 +90,7 @@ enum linux_mptcp_mib_field {
>> MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */
>> MPTCP_MIB_BACKLOGDROP, /* Backlog over memory limit */
>> MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */
>> + MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */
>> __MPTCP_MIB_MAX
>> };
>>
>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>> index 03d6f8658467..f446e22148b9 100644
>> --- a/net/mptcp/protocol.c
>> +++ b/net/mptcp/protocol.c
>> @@ -373,6 +373,43 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
>> skb_dst_drop(skb);
>> }
>>
>> +/* "Inspired" from the TCP version */
>> +static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq)
>> +{
>> + struct mptcp_sock *msk = mptcp_sk(sk);
>> + struct rb_node *node, *prev;
>> + bool pruned = false;
>> + u64 mem;
>> +
>> + if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
>> + return;
>> +
>> + node = &msk->ooo_last_skb->rbnode;
>> +
>> + do {
>> + struct sk_buff *skb = rb_to_skb(node);
>> +
>> + /* Stop pruning if the incoming skb would land in OoO tail. */
>> + if (after64(seq, MPTCP_SKB_CB(skb)->map_seq))
>> + break;
>> +
>> + pruned = true;
>> + prev = rb_prev(node);
>> + rb_erase(node, &msk->out_of_order_queue);
>> + mptcp_drop(sk, skb);
>> + msk->ooo_last_skb = rb_to_skb(prev);
>> +
>> + mem = (unsigned int)atomic_read(&sk->sk_rmem_alloc);
>> + if (mem < sk->sk_rcvbuf)
>> + break;
> Hi Paolo,
>
> Thanks for the v8. While going through the code, I ran into a
> point that I'm not entirely sure about.
>
> TCP‘s design uses sk->sk_rcvbuf >> 3 (one eighth of the buffer)
> as a goal. It we use sk->sk_rcvbuf here, the loop may break after
> deleting just one packet, right? This may fail to free enough space
> for the incoming out‑of‑order packet, leading to repeated pruning
> calls and potential packet drops.
Thank you for your review.
Yes, here there is some difference vs plain TCP and I think it's for the
better.
TCP tries to make a "significant" amount of space in the receiver buffer
while MPTCP tries to make room for a single packet.
Both strategies make sense in their respective context: TCP invokes
tcp_prune_ofo_queue() only after collapsing, and the latter is very CPU
intensive; if tcp_prune_ofo_queue() would make room for a single packet,
the next incoming one could still trigger collapsing and burn a lot of
CPU cycles (note that this bad chain of events could still happen if
sk_rcvbuf / 8 is not bigger than 2 packets).
MPTCP (intentionally, as per discussion in the last iterations here)
does not perform collapsing, and directly does pruning when over limit.
Pruning is relatively cheap - computational complexity wise we could do
it for each incoming packet with "no issues", at least compared to
collapsing.
Dropping the minimal amount of packets to fit the incoming one, allows
MPTCP to minimize the need for reinjections (for packets already acked
at the TCP-level, which we really should avoid). I think overall this
compromise is a better fit for MPTCP.
BTW did you have the chance to perform testing on top of this revision?
Thanks,
Paolo
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning
2026-05-26 6:50 ` Paolo Abeni
@ 2026-05-27 5:30 ` gang.yan
2026-05-27 10:01 ` Paolo Abeni
0 siblings, 1 reply; 30+ messages in thread
From: gang.yan @ 2026-05-27 5:30 UTC (permalink / raw)
To: Paolo Abeni, mptcp; +Cc: Matthieu Baerts, Geliang Tang
May 26, 2026 at 2:50 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>
> On 5/26/26 5:13 AM, gang.yan@linux.dev wrote:
>
> >
> > May 23, 2026 at 5:43 AM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
> >
> > >
> > > diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
> > > index ef65e2df709f..d9bd4f4afcc0 100644
> > > --- a/net/mptcp/mib.c
> > > +++ b/net/mptcp/mib.c
> > > @@ -87,6 +87,7 @@ static const struct snmp_mib mptcp_snmp_list[] = {
> > > SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE),
> > > SNMP_MIB_ITEM("BacklogDrop", MPTCP_MIB_BACKLOGDROP),
> > > SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
> > > + SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED),
> > > };
> > >
> > > /* mptcp_mib_alloc - allocate percpu mib counters
> > > diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
> > > index c84eb853d499..18f35f7e0a2d 100644
> > > --- a/net/mptcp/mib.h
> > > +++ b/net/mptcp/mib.h
> > > @@ -90,6 +90,7 @@ enum linux_mptcp_mib_field {
> > > MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */
> > > MPTCP_MIB_BACKLOGDROP, /* Backlog over memory limit */
> > > MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */
> > > + MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */
> > > __MPTCP_MIB_MAX
> > > };
> > >
> > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > > index 03d6f8658467..f446e22148b9 100644
> > > --- a/net/mptcp/protocol.c
> > > +++ b/net/mptcp/protocol.c
> > > @@ -373,6 +373,43 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
> > > skb_dst_drop(skb);
> > > }
> > >
> > > +/* "Inspired" from the TCP version */
> > > +static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq)
> > > +{
> > > + struct mptcp_sock *msk = mptcp_sk(sk);
> > > + struct rb_node *node, *prev;
> > > + bool pruned = false;
> > > + u64 mem;
> > > +
> > > + if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
> > > + return;
> > > +
> > > + node = &msk->ooo_last_skb->rbnode;
> > > +
> > > + do {
> > > + struct sk_buff *skb = rb_to_skb(node);
> > > +
> > > + /* Stop pruning if the incoming skb would land in OoO tail. */
> > > + if (after64(seq, MPTCP_SKB_CB(skb)->map_seq))
> > > + break;
> > > +
> > > + pruned = true;
> > > + prev = rb_prev(node);
> > > + rb_erase(node, &msk->out_of_order_queue);
> > > + mptcp_drop(sk, skb);
> > > + msk->ooo_last_skb = rb_to_skb(prev);
> > > +
> > > + mem = (unsigned int)atomic_read(&sk->sk_rmem_alloc);
> > > + if (mem < sk->sk_rcvbuf)
> > > + break;
> > >
> > Hi Paolo,
> >
> > Thanks for the v8. While going through the code, I ran into a
> > point that I'm not entirely sure about.
> >
> > TCP‘s design uses sk->sk_rcvbuf >> 3 (one eighth of the buffer)
> > as a goal. It we use sk->sk_rcvbuf here, the loop may break after
> > deleting just one packet, right? This may fail to free enough space
> > for the incoming out‑of‑order packet, leading to repeated pruning
> > calls and potential packet drops.
> >
> Thank you for your review.
>
> Yes, here there is some difference vs plain TCP and I think it's for the
> better.
>
> TCP tries to make a "significant" amount of space in the receiver buffer
> while MPTCP tries to make room for a single packet.
>
> Both strategies make sense in their respective context: TCP invokes
> tcp_prune_ofo_queue() only after collapsing, and the latter is very CPU
> intensive; if tcp_prune_ofo_queue() would make room for a single packet,
> the next incoming one could still trigger collapsing and burn a lot of
> CPU cycles (note that this bad chain of events could still happen if
> sk_rcvbuf / 8 is not bigger than 2 packets).
>
> MPTCP (intentionally, as per discussion in the last iterations here)
> does not perform collapsing, and directly does pruning when over limit.
> Pruning is relatively cheap - computational complexity wise we could do
> it for each incoming packet with "no issues", at least compared to
> collapsing.
>
> Dropping the minimal amount of packets to fit the incoming one, allows
> MPTCP to minimize the need for reinjections (for packets already acked
> at the TCP-level, which we really should avoid). I think overall this
> compromise is a better fit for MPTCP.
>
> BTW did you have the chance to perform testing on top of this revision?
>
Hi Paolo,
Yes, I've tested v8 many times — hundreds of runs. And the tls.c part works
fine too.
I also spotted some issues with MPTCP TLS adaptation, but those are unrelated
to v8. I'll reply in Geliang's thread about them soon.
Thanks for your help.
Cherrs
Gang
> Thanks,
>
> Paolo
>
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning
2026-05-27 5:30 ` gang.yan
@ 2026-05-27 10:01 ` Paolo Abeni
2026-05-28 1:18 ` gang.yan
0 siblings, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-05-27 10:01 UTC (permalink / raw)
To: gang.yan; +Cc: Matthieu Baerts (NGI0), Geliang Tang, MPTCP Linux
On 5/27/26 7:30 AM, gang.yan@linux.dev wrote:
> May 26, 2026 at 2:50 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>> On 5/26/26 5:13 AM, gang.yan@linux.dev wrote:
>>> May 23, 2026 at 5:43 AM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>>>> diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
>>>> index ef65e2df709f..d9bd4f4afcc0 100644
>>>> --- a/net/mptcp/mib.c
>>>> +++ b/net/mptcp/mib.c
>>>> @@ -87,6 +87,7 @@ static const struct snmp_mib mptcp_snmp_list[] = {
>>>> SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE),
>>>> SNMP_MIB_ITEM("BacklogDrop", MPTCP_MIB_BACKLOGDROP),
>>>> SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
>>>> + SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED),
>>>> };
>>>>
>>>> /* mptcp_mib_alloc - allocate percpu mib counters
>>>> diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
>>>> index c84eb853d499..18f35f7e0a2d 100644
>>>> --- a/net/mptcp/mib.h
>>>> +++ b/net/mptcp/mib.h
>>>> @@ -90,6 +90,7 @@ enum linux_mptcp_mib_field {
>>>> MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */
>>>> MPTCP_MIB_BACKLOGDROP, /* Backlog over memory limit */
>>>> MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */
>>>> + MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */
>>>> __MPTCP_MIB_MAX
>>>> };
>>>>
>>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>>> index 03d6f8658467..f446e22148b9 100644
>>>> --- a/net/mptcp/protocol.c
>>>> +++ b/net/mptcp/protocol.c
>>>> @@ -373,6 +373,43 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
>>>> skb_dst_drop(skb);
>>>> }
>>>>
>>>> +/* "Inspired" from the TCP version */
>>>> +static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq)
>>>> +{
>>>> + struct mptcp_sock *msk = mptcp_sk(sk);
>>>> + struct rb_node *node, *prev;
>>>> + bool pruned = false;
>>>> + u64 mem;
>>>> +
>>>> + if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
>>>> + return;
>>>> +
>>>> + node = &msk->ooo_last_skb->rbnode;
>>>> +
>>>> + do {
>>>> + struct sk_buff *skb = rb_to_skb(node);
>>>> +
>>>> + /* Stop pruning if the incoming skb would land in OoO tail. */
>>>> + if (after64(seq, MPTCP_SKB_CB(skb)->map_seq))
>>>> + break;
>>>> +
>>>> + pruned = true;
>>>> + prev = rb_prev(node);
>>>> + rb_erase(node, &msk->out_of_order_queue);
>>>> + mptcp_drop(sk, skb);
>>>> + msk->ooo_last_skb = rb_to_skb(prev);
>>>> +
>>>> + mem = (unsigned int)atomic_read(&sk->sk_rmem_alloc);
>>>> + if (mem < sk->sk_rcvbuf)
>>>> + break;
>>>>
>>> Hi Paolo,
>>>
>>> Thanks for the v8. While going through the code, I ran into a
>>> point that I'm not entirely sure about.
>>>
>>> TCP‘s design uses sk->sk_rcvbuf >> 3 (one eighth of the buffer)
>>> as a goal. It we use sk->sk_rcvbuf here, the loop may break after
>>> deleting just one packet, right? This may fail to free enough space
>>> for the incoming out‑of‑order packet, leading to repeated pruning
>>> calls and potential packet drops.
>>>
>> Thank you for your review.
>>
>> Yes, here there is some difference vs plain TCP and I think it's for the
>> better.
>>
>> TCP tries to make a "significant" amount of space in the receiver buffer
>> while MPTCP tries to make room for a single packet.
>>
>> Both strategies make sense in their respective context: TCP invokes
>> tcp_prune_ofo_queue() only after collapsing, and the latter is very CPU
>> intensive; if tcp_prune_ofo_queue() would make room for a single packet,
>> the next incoming one could still trigger collapsing and burn a lot of
>> CPU cycles (note that this bad chain of events could still happen if
>> sk_rcvbuf / 8 is not bigger than 2 packets).
>>
>> MPTCP (intentionally, as per discussion in the last iterations here)
>> does not perform collapsing, and directly does pruning when over limit.
>> Pruning is relatively cheap - computational complexity wise we could do
>> it for each incoming packet with "no issues", at least compared to
>> collapsing.
>>
>> Dropping the minimal amount of packets to fit the incoming one, allows
>> MPTCP to minimize the need for reinjections (for packets already acked
>> at the TCP-level, which we really should avoid). I think overall this
>> compromise is a better fit for MPTCP.
>>
>> BTW did you have the chance to perform testing on top of this revision?
>>
>
> Hi Paolo,
>
> Yes, I've tested v8 many times — hundreds of runs. And the tls.c part works
> fine too.
>
> I also spotted some issues with MPTCP TLS adaptation, but those are unrelated
> to v8. I'll reply in Geliang's thread about them soon.
Thank you for testing. I'll send a v9 to try to get sashiko comments on
remaining patches, but no real changes in it. Feel free to add your
formal Tested-by tag, if you want :)
Thanks,
Paolo
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning
2026-05-27 10:01 ` Paolo Abeni
@ 2026-05-28 1:18 ` gang.yan
0 siblings, 0 replies; 30+ messages in thread
From: gang.yan @ 2026-05-28 1:18 UTC (permalink / raw)
To: Paolo Abeni; +Cc: Matthieu Baerts (NGI0), Geliang Tang, MPTCP Linux
May 27, 2026 at 6:01 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
>
> On 5/27/26 7:30 AM, gang.yan@linux.dev wrote:
>
> >
> > May 26, 2026 at 2:50 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
> >
> > >
> > > On 5/26/26 5:13 AM, gang.yan@linux.dev wrote:
> > >
> > May 23, 2026 at 5:43 AM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:
> > diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
> > index ef65e2df709f..d9bd4f4afcc0 100644
> > --- a/net/mptcp/mib.c
> > +++ b/net/mptcp/mib.c
> > @@ -87,6 +87,7 @@ static const struct snmp_mib mptcp_snmp_list[] = {
> > SNMP_MIB_ITEM("WinProbe", MPTCP_MIB_WINPROBE),
> > SNMP_MIB_ITEM("BacklogDrop", MPTCP_MIB_BACKLOGDROP),
> > SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
> > + SNMP_MIB_ITEM("OfoPruned", MPTCP_MIB_OFO_PRUNED),
> > };
> >
> > /* mptcp_mib_alloc - allocate percpu mib counters
> > diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
> > index c84eb853d499..18f35f7e0a2d 100644
> > --- a/net/mptcp/mib.h
> > +++ b/net/mptcp/mib.h
> > @@ -90,6 +90,7 @@ enum linux_mptcp_mib_field {
> > MPTCP_MIB_WINPROBE, /* MPTCP-level zero window probe */
> > MPTCP_MIB_BACKLOGDROP, /* Backlog over memory limit */
> > MPTCP_MIB_RCVPRUNED, /* Dropped due to memory constrains */
> > + MPTCP_MIB_OFO_PRUNED, /* MPTCP-level OoO queue pruned */
> > __MPTCP_MIB_MAX
> > };
> >
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index 03d6f8658467..f446e22148b9 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -373,6 +373,43 @@ static void mptcp_init_skb(struct sock *ssk, struct sk_buff *skb, int offset,
> > skb_dst_drop(skb);
> > }
> >
> > +/* "Inspired" from the TCP version */
> > +static void mptcp_prune_ofo_queue(struct sock *sk, u64 seq)
> > +{
> > + struct mptcp_sock *msk = mptcp_sk(sk);
> > + struct rb_node *node, *prev;
> > + bool pruned = false;
> > + u64 mem;
> > +
> > + if (RB_EMPTY_ROOT(&msk->out_of_order_queue))
> > + return;
> > +
> > + node = &msk->ooo_last_skb->rbnode;
> > +
> > + do {
> > + struct sk_buff *skb = rb_to_skb(node);
> > +
> > + /* Stop pruning if the incoming skb would land in OoO tail. */
> > + if (after64(seq, MPTCP_SKB_CB(skb)->map_seq))
> > + break;
> > +
> > + pruned = true;
> > + prev = rb_prev(node);
> > + rb_erase(node, &msk->out_of_order_queue);
> > + mptcp_drop(sk, skb);
> > + msk->ooo_last_skb = rb_to_skb(prev);
> > +
> > + mem = (unsigned int)atomic_read(&sk->sk_rmem_alloc);
> > + if (mem < sk->sk_rcvbuf)
> > + break;
> >
> > Hi Paolo,
> >
> > Thanks for the v8. While going through the code, I ran into a
> > point that I'm not entirely sure about.
> >
> > TCP‘s design uses sk->sk_rcvbuf >> 3 (one eighth of the buffer)
> > as a goal. It we use sk->sk_rcvbuf here, the loop may break after
> > deleting just one packet, right? This may fail to free enough space
> > for the incoming out‑of‑order packet, leading to repeated pruning
> > calls and potential packet drops.
> >
> > >
> > > Thank you for your review.
> > >
> > > Yes, here there is some difference vs plain TCP and I think it's for the
> > > better.
> > >
> > > TCP tries to make a "significant" amount of space in the receiver buffer
> > > while MPTCP tries to make room for a single packet.
> > >
> > > Both strategies make sense in their respective context: TCP invokes
> > > tcp_prune_ofo_queue() only after collapsing, and the latter is very CPU
> > > intensive; if tcp_prune_ofo_queue() would make room for a single packet,
> > > the next incoming one could still trigger collapsing and burn a lot of
> > > CPU cycles (note that this bad chain of events could still happen if
> > > sk_rcvbuf / 8 is not bigger than 2 packets).
> > >
> > > MPTCP (intentionally, as per discussion in the last iterations here)
> > > does not perform collapsing, and directly does pruning when over limit.
> > > Pruning is relatively cheap - computational complexity wise we could do
> > > it for each incoming packet with "no issues", at least compared to
> > > collapsing.
> > >
> > > Dropping the minimal amount of packets to fit the incoming one, allows
> > > MPTCP to minimize the need for reinjections (for packets already acked
> > > at the TCP-level, which we really should avoid). I think overall this
> > > compromise is a better fit for MPTCP.
> > >
> > > BTW did you have the chance to perform testing on top of this revision?
> > >
> >
> > Hi Paolo,
> >
> > Yes, I've tested v8 many times — hundreds of runs. And the tls.c part works
> > fine too.
> >
> > I also spotted some issues with MPTCP TLS adaptation, but those are unrelated
> > to v8. I'll reply in Geliang's thread about them soon.
> >
> Thank you for testing. I'll send a v9 to try to get sashiko comments on
> remaining patches, but no real changes in it. Feel free to add your
> formal Tested-by tag, if you want :)
Great, thank you.
Tested-by: Gang Yan <yangang@kylinos.cn>
>
> Thanks,
>
> Paolo
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v8 mptcp-next 8/9] mptcp: move the retrans loop to a separate helper
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
` (6 preceding siblings ...)
2026-05-22 21:43 ` [PATCH v8 mptcp-next 7/9] mptcp: implemented OoO queue pruning Paolo Abeni
@ 2026-05-22 21:43 ` Paolo Abeni
2026-05-22 21:43 ` [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job Paolo Abeni
2026-05-22 23:10 ` [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure MPTCP CI
9 siblings, 0 replies; 30+ messages in thread
From: Paolo Abeni @ 2026-05-22 21:43 UTC (permalink / raw)
To: mptcp; +Cc: Matthieu Baerts, Geliang Tang, gang.yan
This is a cleanup in order to make the next patch simpler.
No functional change intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/mptcp/protocol.c | 74 +++++++++++++++++++++++++-------------------
1 file changed, 43 insertions(+), 31 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index f446e22148b9..4ebe45e8a3d2 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2824,41 +2824,14 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
sk_error_report(sk);
}
-static void __mptcp_retrans(struct sock *sk)
+/* Retransmit the specified data fragment on all the selected subflows. */
+static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
{
struct mptcp_sendmsg_info info = { .data_lock_held = true, };
struct mptcp_sock *msk = mptcp_sk(sk);
struct mptcp_subflow_context *subflow;
- struct mptcp_data_frag *dfrag;
struct sock *ssk;
- int ret, err;
- u16 len = 0;
-
- mptcp_clean_una_wakeup(sk);
-
- /* first check ssk: need to kick "stale" logic */
- err = mptcp_sched_get_retrans(msk);
- dfrag = mptcp_rtx_head(sk);
- if (!dfrag) {
- if (mptcp_data_fin_enabled(msk)) {
- struct inet_connection_sock *icsk = inet_csk(sk);
-
- WRITE_ONCE(icsk->icsk_retransmits,
- icsk->icsk_retransmits + 1);
- mptcp_set_datafin_timeout(sk);
- mptcp_send_ack(msk);
-
- goto reset_timer;
- }
-
- if (!mptcp_send_head(sk))
- goto clear_scheduled;
-
- goto reset_timer;
- }
-
- if (err)
- goto reset_timer;
+ int ret, len = 0;
mptcp_for_each_subflow(msk, subflow) {
if (READ_ONCE(subflow->scheduled)) {
@@ -2886,7 +2859,7 @@ static void __mptcp_retrans(struct sock *sk)
!msk->allow_subflows) {
spin_unlock_bh(&msk->fallback_lock);
release_sock(ssk);
- goto clear_scheduled;
+ return -1;
}
while (info.sent < info.limit) {
@@ -2909,6 +2882,45 @@ static void __mptcp_retrans(struct sock *sk)
release_sock(ssk);
}
}
+ return len;
+}
+
+static void __mptcp_retrans(struct sock *sk)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct mptcp_subflow_context *subflow;
+ struct mptcp_data_frag *dfrag;
+ int err, len;
+
+ mptcp_clean_una_wakeup(sk);
+
+ /* first check ssk: need to kick "stale" logic */
+ err = mptcp_sched_get_retrans(msk);
+ dfrag = mptcp_rtx_head(sk);
+ if (!dfrag) {
+ if (mptcp_data_fin_enabled(msk)) {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+
+ WRITE_ONCE(icsk->icsk_retransmits,
+ icsk->icsk_retransmits + 1);
+ mptcp_set_datafin_timeout(sk);
+ mptcp_send_ack(msk);
+
+ goto reset_timer;
+ }
+
+ if (!mptcp_send_head(sk))
+ goto clear_scheduled;
+
+ goto reset_timer;
+ }
+
+ if (err)
+ goto reset_timer;
+
+ len = __mptcp_push_retrans(sk, dfrag);
+ if (len < 0)
+ goto clear_scheduled;
msk->bytes_retrans += len;
dfrag->already_sent = max(dfrag->already_sent, len);
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job.
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
` (7 preceding siblings ...)
2026-05-22 21:43 ` [PATCH v8 mptcp-next 8/9] mptcp: move the retrans loop to a separate helper Paolo Abeni
@ 2026-05-22 21:43 ` Paolo Abeni
2026-05-27 5:46 ` Geliang Tang
2026-05-22 23:10 ` [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure MPTCP CI
9 siblings, 1 reply; 30+ messages in thread
From: Paolo Abeni @ 2026-05-22 21:43 UTC (permalink / raw)
To: mptcp; +Cc: Matthieu Baerts, Geliang Tang, gang.yan
Currently the MPTCP core enforces that when MPTCP-level retrans timer
fires, at most a single dfrag is retransmitted. If some corner-cases it
may be necessary retransmit multiple dfrags, and the MPTCP socket will
need to wait multiple retrans timeout to accomplish that.
Remove the mentioned constraint, allowing to transmit multiple dfrags per
retrans period, as long as the scheduler keeps selecting subflows for
retransmissions and pending data is available in the rtx queue.
The default scheduler will transmit a dfrag per available subflow.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v7 -> v8
- fix corner-case retrans_seq update
v4 -> v5:
- fixed already_sent update
v3 -> v4:
- avoid quadratic behavior, fix retrans_seq update
- fix rtx timer re-schedule miss
v2 -> v3:
- fix infinite loop issue (should address tls tests failures)
v1 -> v2:
- fix retrans sequence update (sashiko)
Note:
- sashiko see issues when dfrag = mptcp_rtx_head(sk) != NULL and
dfrag->already_sent == 0. That condition should not possible: if
mptcp_rtx_head() is not NULL there should be some data already
sent.
---
net/mptcp/protocol.c | 117 +++++++++++++++++++++++++++++++------------
1 file changed, 85 insertions(+), 32 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4ebe45e8a3d2..892fc44dffac 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1197,13 +1197,6 @@ static void __mptcp_clean_una_wakeup(struct sock *sk)
mptcp_write_space(sk);
}
-static void mptcp_clean_una_wakeup(struct sock *sk)
-{
- mptcp_data_lock(sk);
- __mptcp_clean_una_wakeup(sk);
- mptcp_data_unlock(sk);
-}
-
static void mptcp_enter_memory_pressure(struct sock *sk)
{
struct mptcp_subflow_context *subflow;
@@ -2824,8 +2817,12 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
sk_error_report(sk);
}
-/* Retransmit the specified data fragment on all the selected subflows. */
-static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
+/*
+ * Retransmit the specified data fragment on all the selected subflows,
+ * starting from the specified sequence
+ */
+static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag,
+ u64 sent_seq)
{
struct mptcp_sendmsg_info info = { .data_lock_held = true, };
struct mptcp_sock *msk = mptcp_sk(sk);
@@ -2835,6 +2832,7 @@ static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
mptcp_for_each_subflow(msk, subflow) {
if (READ_ONCE(subflow->scheduled)) {
+ u16 offset = sent_seq - dfrag->data_seq;
u16 copied = 0;
mptcp_subflow_set_scheduled(subflow, false);
@@ -2844,9 +2842,12 @@ static int __mptcp_push_retrans(struct sock *sk, struct mptcp_data_frag *dfrag)
lock_sock(ssk);
/* limit retransmission to the bytes already sent on some subflows */
- info.sent = 0;
+ info.sent = offset;
info.limit = READ_ONCE(msk->csum_enabled) ? dfrag->data_len :
dfrag->already_sent;
+ DEBUG_NET_WARN_ON_ONCE(!before64(sent_seq,
+ dfrag->data_seq +
+ info.limit));
/*
* make the whole retrans decision, xmit, disallow
@@ -2890,45 +2891,97 @@ static void __mptcp_retrans(struct sock *sk)
struct mptcp_sock *msk = mptcp_sk(sk);
struct mptcp_subflow_context *subflow;
struct mptcp_data_frag *dfrag;
+ bool retransmitted = false;
+ u64 retrans_seq;
int err, len;
- mptcp_clean_una_wakeup(sk);
-
- /* first check ssk: need to kick "stale" logic */
- err = mptcp_sched_get_retrans(msk);
+ mptcp_data_lock(sk);
+ __mptcp_clean_una_wakeup(sk);
+ retrans_seq = msk->snd_una;
dfrag = mptcp_rtx_head(sk);
+ mptcp_data_unlock(sk);
+ if (!dfrag)
+ goto check_data_fin;
+
+ for (;;) {
+ bool already_retrans;
+ u64 sent_seq;
+
+ /* The scheduler may clean the RTX queue. */
+ get_page(dfrag->page);
+
+ /* The default scheduler will kick "stale" logic. */
+ err = mptcp_sched_get_retrans(msk);
+ if (err) {
+ put_page(dfrag->page);
+ break;
+ }
+
+ /* Incoming acks can have moved retrans sequence after
+ * the current dfrag, if so try to start again from RTX head.
+ */
+ mptcp_data_lock(sk);
+ already_retrans = !dfrag->already_sent ||
+ !before64(msk->snd_una, dfrag->data_seq +
+ dfrag->already_sent);
+ put_page(dfrag->page);
+ if (already_retrans) {
+ __mptcp_clean_una_wakeup(sk);
+ retrans_seq = msk->snd_una;
+ dfrag = mptcp_rtx_head(sk);
+ } else if (after64(msk->snd_una, retrans_seq)) {
+ retrans_seq = msk->snd_una;
+ }
+ mptcp_data_unlock(sk);
+ if (!dfrag)
+ break;
+
+ len = __mptcp_push_retrans(sk, dfrag, retrans_seq);
+ if (len < 0)
+ goto clear_scheduled;
+
+ retransmitted = true;
+ retrans_seq += len;
+ msk->bytes_retrans += len;
+ dfrag->already_sent = max_t(u16, dfrag->already_sent,
+ retrans_seq - dfrag->data_seq);
+
+ /* With csum enabled retransmission can send new data. */
+ sent_seq = dfrag->already_sent + dfrag->data_seq;
+ if (after64(sent_seq, msk->snd_nxt))
+ msk->snd_nxt = sent_seq;
+
+ /* Attempt the next fragment only if the current one is
+ * completely retransmitted.
+ */
+ if (before64(retrans_seq, dfrag->data_seq + dfrag->data_len))
+ break;
+
+ dfrag = list_is_last(&dfrag->list, &msk->rtx_queue) ?
+ NULL : list_next_entry(dfrag, list);
+ if (!dfrag || !dfrag->already_sent)
+ break;
+ }
+
+ /* Data fin retransmission needed only if no data retransmission took
+ * place, and RTX queue is empty.
+ */
+check_data_fin:
if (!dfrag) {
- if (mptcp_data_fin_enabled(msk)) {
+ if (!retransmitted && mptcp_data_fin_enabled(msk)) {
struct inet_connection_sock *icsk = inet_csk(sk);
WRITE_ONCE(icsk->icsk_retransmits,
icsk->icsk_retransmits + 1);
mptcp_set_datafin_timeout(sk);
mptcp_send_ack(msk);
-
goto reset_timer;
}
if (!mptcp_send_head(sk))
goto clear_scheduled;
-
- goto reset_timer;
}
- if (err)
- goto reset_timer;
-
- len = __mptcp_push_retrans(sk, dfrag);
- if (len < 0)
- goto clear_scheduled;
-
- msk->bytes_retrans += len;
- dfrag->already_sent = max(dfrag->already_sent, len);
-
- /* With csum enabled retransmission can send new data. */
- if (after64(dfrag->already_sent + dfrag->data_seq, msk->snd_nxt))
- msk->snd_nxt = dfrag->already_sent + dfrag->data_seq;
-
reset_timer:
mptcp_check_and_set_pending(sk);
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job.
2026-05-22 21:43 ` [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job Paolo Abeni
@ 2026-05-27 5:46 ` Geliang Tang
0 siblings, 0 replies; 30+ messages in thread
From: Geliang Tang @ 2026-05-27 5:46 UTC (permalink / raw)
To: Paolo Abeni, mptcp; +Cc: Matthieu Baerts, gang.yan
Hi Paolo,
On Fri, 2026-05-22 at 23:43 +0200, Paolo Abeni wrote:
> Currently the MPTCP core enforces that when MPTCP-level retrans timer
> fires, at most a single dfrag is retransmitted. If some corner-cases
> it
> may be necessary retransmit multiple dfrags, and the MPTCP socket
> will
> need to wait multiple retrans timeout to accomplish that.
>
> Remove the mentioned constraint, allowing to transmit multiple dfrags
> per
> retrans period, as long as the scheduler keeps selecting subflows for
> retransmissions and pending data is available in the rtx queue.
> The default scheduler will transmit a dfrag per available subflow.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v7 -> v8
> - fix corner-case retrans_seq update
>
> v4 -> v5:
> - fixed already_sent update
>
> v3 -> v4:
> - avoid quadratic behavior, fix retrans_seq update
> - fix rtx timer re-schedule miss
>
> v2 -> v3:
> - fix infinite loop issue (should address tls tests failures)
>
> v1 -> v2:
> - fix retrans sequence update (sashiko)
>
> Note:
> - sashiko see issues when dfrag = mptcp_rtx_head(sk) != NULL and
> dfrag->already_sent == 0. That condition should not possible: if
> mptcp_rtx_head() is not NULL there should be some data already
> sent.
> ---
> net/mptcp/protocol.c | 117 +++++++++++++++++++++++++++++++----------
> --
> 1 file changed, 85 insertions(+), 32 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 4ebe45e8a3d2..892fc44dffac 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1197,13 +1197,6 @@ static void __mptcp_clean_una_wakeup(struct
> sock *sk)
> mptcp_write_space(sk);
> }
>
> -static void mptcp_clean_una_wakeup(struct sock *sk)
> -{
> - mptcp_data_lock(sk);
> - __mptcp_clean_una_wakeup(sk);
> - mptcp_data_unlock(sk);
> -}
> -
> static void mptcp_enter_memory_pressure(struct sock *sk)
> {
> struct mptcp_subflow_context *subflow;
> @@ -2824,8 +2817,12 @@ static void mptcp_check_fastclose(struct
> mptcp_sock *msk)
> sk_error_report(sk);
> }
>
> -/* Retransmit the specified data fragment on all the selected
> subflows. */
> -static int __mptcp_push_retrans(struct sock *sk, struct
> mptcp_data_frag *dfrag)
> +/*
> + * Retransmit the specified data fragment on all the selected
> subflows,
> + * starting from the specified sequence
> + */
> +static int __mptcp_push_retrans(struct sock *sk, struct
> mptcp_data_frag *dfrag,
> + u64 sent_seq)
> {
> struct mptcp_sendmsg_info info = { .data_lock_held = true,
> };
> struct mptcp_sock *msk = mptcp_sk(sk);
> @@ -2835,6 +2832,7 @@ static int __mptcp_push_retrans(struct sock
> *sk, struct mptcp_data_frag *dfrag)
>
> mptcp_for_each_subflow(msk, subflow) {
> if (READ_ONCE(subflow->scheduled)) {
> + u16 offset = sent_seq - dfrag->data_seq;
> u16 copied = 0;
>
> mptcp_subflow_set_scheduled(subflow, false);
> @@ -2844,9 +2842,12 @@ static int __mptcp_push_retrans(struct sock
> *sk, struct mptcp_data_frag *dfrag)
> lock_sock(ssk);
>
> /* limit retransmission to the bytes already
> sent on some subflows */
> - info.sent = 0;
> + info.sent = offset;
> info.limit = READ_ONCE(msk->csum_enabled) ?
> dfrag->data_len :
>
> dfrag->already_sent;
> + DEBUG_NET_WARN_ON_ONCE(!before64(sent_seq,
> + dfrag-
> >data_seq +
> +
> info.limit));
When I tested NVMe MPTCP TLS based on this series, I encountered the
following error:
------------[ cut here ]------------
!before64(sent_seq, dfrag->data_seq + info.limit)
WARNING: net/mptcp/protocol.c:2879 at
__mptcp_push_retrans+0x562/0x6f0, CPU#5: kworker/5:2H/823
Modules linked in: mptcp_diag tcp_diag inet_diag sch_netem
CPU: 5 UID: 0 PID: 823 Comm: kworker/5:2H Not tainted 7.1.0-rc4+ #11
PREEMPT(full)
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
RIP: 0010:__mptcp_push_retrans+0x562/0x6f0
I printed the "info.limit" value at that time and found it to be 0:
if (!before64(sent_seq, dfrag->data_seq + info.limit))
pr_info("not ok sent_seq=%llu dfrag->data_seq=%llu info.limit=%u\n",
sent_seq, dfrag->data_seq, info.limit);
# Starting 4 threads
[ 358.532991][ T257] MPTCP: not ok sent_seq=4607903646123206498
dfrag->data_seq=4607903646123206498 info.limit=0
[ 360.946058][ T90] MPTCP: not ok sent_seq=12604043601802564894
dfrag->data_seq=12604043601802564894 info.limit=0
It seems we need to remove this DEBUG_NET_WARN_ON_ONCE().
Thanks,
-Geliang
>
> /*
> * make the whole retrans decision, xmit,
> disallow
> @@ -2890,45 +2891,97 @@ static void __mptcp_retrans(struct sock *sk)
> struct mptcp_sock *msk = mptcp_sk(sk);
> struct mptcp_subflow_context *subflow;
> struct mptcp_data_frag *dfrag;
> + bool retransmitted = false;
> + u64 retrans_seq;
> int err, len;
>
> - mptcp_clean_una_wakeup(sk);
> -
> - /* first check ssk: need to kick "stale" logic */
> - err = mptcp_sched_get_retrans(msk);
> + mptcp_data_lock(sk);
> + __mptcp_clean_una_wakeup(sk);
> + retrans_seq = msk->snd_una;
> dfrag = mptcp_rtx_head(sk);
> + mptcp_data_unlock(sk);
> + if (!dfrag)
> + goto check_data_fin;
> +
> + for (;;) {
> + bool already_retrans;
> + u64 sent_seq;
> +
> + /* The scheduler may clean the RTX queue. */
> + get_page(dfrag->page);
> +
> + /* The default scheduler will kick "stale" logic. */
> + err = mptcp_sched_get_retrans(msk);
> + if (err) {
> + put_page(dfrag->page);
> + break;
> + }
> +
> + /* Incoming acks can have moved retrans sequence
> after
> + * the current dfrag, if so try to start again from
> RTX head.
> + */
> + mptcp_data_lock(sk);
> + already_retrans = !dfrag->already_sent ||
> + !before64(msk->snd_una, dfrag-
> >data_seq +
> + dfrag->already_sent);
> + put_page(dfrag->page);
> + if (already_retrans) {
> + __mptcp_clean_una_wakeup(sk);
> + retrans_seq = msk->snd_una;
> + dfrag = mptcp_rtx_head(sk);
> + } else if (after64(msk->snd_una, retrans_seq)) {
> + retrans_seq = msk->snd_una;
> + }
> + mptcp_data_unlock(sk);
> + if (!dfrag)
> + break;
> +
> + len = __mptcp_push_retrans(sk, dfrag, retrans_seq);
> + if (len < 0)
> + goto clear_scheduled;
> +
> + retransmitted = true;
> + retrans_seq += len;
> + msk->bytes_retrans += len;
> + dfrag->already_sent = max_t(u16, dfrag-
> >already_sent,
> + retrans_seq - dfrag-
> >data_seq);
> +
> + /* With csum enabled retransmission can send new
> data. */
> + sent_seq = dfrag->already_sent + dfrag->data_seq;
> + if (after64(sent_seq, msk->snd_nxt))
> + msk->snd_nxt = sent_seq;
> +
> + /* Attempt the next fragment only if the current one
> is
> + * completely retransmitted.
> + */
> + if (before64(retrans_seq, dfrag->data_seq + dfrag-
> >data_len))
> + break;
> +
> + dfrag = list_is_last(&dfrag->list, &msk->rtx_queue)
> ?
> + NULL : list_next_entry(dfrag, list);
> + if (!dfrag || !dfrag->already_sent)
> + break;
> + }
> +
> + /* Data fin retransmission needed only if no data
> retransmission took
> + * place, and RTX queue is empty.
> + */
> +check_data_fin:
> if (!dfrag) {
> - if (mptcp_data_fin_enabled(msk)) {
> + if (!retransmitted && mptcp_data_fin_enabled(msk)) {
> struct inet_connection_sock *icsk =
> inet_csk(sk);
>
> WRITE_ONCE(icsk->icsk_retransmits,
> icsk->icsk_retransmits + 1);
> mptcp_set_datafin_timeout(sk);
> mptcp_send_ack(msk);
> -
> goto reset_timer;
> }
>
> if (!mptcp_send_head(sk))
> goto clear_scheduled;
> -
> - goto reset_timer;
> }
>
> - if (err)
> - goto reset_timer;
> -
> - len = __mptcp_push_retrans(sk, dfrag);
> - if (len < 0)
> - goto clear_scheduled;
> -
> - msk->bytes_retrans += len;
> - dfrag->already_sent = max(dfrag->already_sent, len);
> -
> - /* With csum enabled retransmission can send new data. */
> - if (after64(dfrag->already_sent + dfrag->data_seq, msk-
> >snd_nxt))
> - msk->snd_nxt = dfrag->already_sent + dfrag-
> >data_seq;
> -
> reset_timer:
> mptcp_check_and_set_pending(sk);
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure
2026-05-22 21:43 [PATCH v8 mptcp-next 0/9] mptcp: address stall under memory pressure Paolo Abeni
` (8 preceding siblings ...)
2026-05-22 21:43 ` [PATCH v8 mptcp-next 9/9] mptcp: let the retrans scheduler do its job Paolo Abeni
@ 2026-05-22 23:10 ` MPTCP CI
9 siblings, 0 replies; 30+ messages in thread
From: MPTCP CI @ 2026-05-22 23:10 UTC (permalink / raw)
To: Paolo Abeni; +Cc: mptcp
Hi Paolo,
Thank you for your modifications, that's great!
Our CI did some validations and here is its report:
- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Unstable: 2 failed test(s): packetdrill_dss packetdrill_fastopen ⚠️
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/26314295414
Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/0bcab31fe736
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1099630
If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:
$ cd [kernel source code]
$ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
--pull always mptcp/mptcp-upstream-virtme-docker:latest \
auto-normal
For more details:
https://github.com/multipath-tcp/mptcp-upstream-virtme-docker
Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)
Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)
^ permalink raw reply [flat|nested] 30+ messages in thread