From: Jakub Kicinski <kuba@kernel.org>
To: matttbe@kernel.org
Cc: Jakub Kicinski <kuba@kernel.org>,
martineau@kernel.org, davem@davemloft.net,
netdev@vger.kernel.org, pabeni@redhat.com, fw@strlen.de,
horms@kernel.org, edumazet@google.com,
linux-kernel@vger.kernel.org, mptcp@lists.linux.dev,
geliang@kernel.org, mhiramat@kernel.org,
linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org,
mathieu.desnoyers@efficios.com
Subject: Re: [net-next,1/2] mptcp: better mptcp-level RTT estimator
Date: Tue, 10 Mar 2026 19:45:47 -0700 [thread overview]
Message-ID: <20260311024547.361027-1-kuba@kernel.org> (raw)
In-Reply-To: <20260309-net-next-mptcp-reduce-rbuf-v1-1-8f471206f9c5@kernel.org>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
mptcp: better mptcp-level RTT estimator
The current MPTCP-level RTT estimator has several issues. On high speed
links, the MPTCP-level receive buffer auto-tuning happens with a
frequency well above the TCP-level's one. That in turn can cause
excessive/unneeded receive buffer increase.
On such links, the initial rtt_us value is considerably higher than the
actual delay, and the current mptcp_rcv_space_adjust() updates
msk->rcvq_space.rtt_us with a period equal to the such field previous
value. If the initial rtt_us is 40ms, its first update will happen after
40ms, even if the subflows see actual RTT orders of magnitude lower.
Additionally:
- setting the msk RTT to the maximum among all the subflows RTTs makes
DRS constantly overshooting the rcvbuf size when a subflow has
considerable higher latency than the other(s).
- during unidirectional bulk transfers with multiple active subflows,
the TCP-level RTT estimator occasionally sees considerably higher
value than the real link delay, i.e. when the packet scheduler reacts
to an incoming ACK on given subflow pushing data on a different
subflow.
- currently inactive but still open subflows (i.e. switched to backup
mode) are always considered when computing the msk-level RTT.
Address the all the issues above with a more accurate RTT estimation
strategy: the MPTCP-level RTT is set to the minimum of all the subflows
actually feeding data into the MPTCP receive buffer, using a small
sliding window.
> While at it, also use EWMA to compute the msk-level scaling_ratio, to
> that MPTCP can avoid traversing the subflow list is
> mptcp_rcv_space_adjust().
This isn't a bug, but there are two typos here that make the sentence
hard to parse. Should this be 'so that MPTCP can avoid traversing the
subflow list in mptcp_rcv_space_adjust()'?
Use some care to avoid updating msk and ssk level fields too often.
Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning")
[ ... ]
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index 0bd1ee860316..6ec65c0ae655 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
[ ... ]
> @@ -422,6 +435,27 @@ static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
> return msk->first_pending;
> }
>
> +static inline void mptcp_init_rtt_est(struct mptcp_sock *msk)
> +{
> + int i;
> +
> + for (i = 0; i < MPTCP_RTT_SAMPLES; ++i)
> + msk->rcv_rtt_est.samples[i] = U32_MAX;
> + msk->rcv_rtt_est.next_sample = 0;
> + msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
> +}
> +
> +static inline u32 mptcp_rtt_us_est(const struct mptcp_sock *msk)
> +{
> + u32 rtt_us = msk->rcv_rtt_est.samples[0];
^^^^
Is samples[0] missing a READ_ONCE()?
The function comment says 'Lockless access of collected samples' and the
writer mptcp_rcv_rtt_update() uses WRITE_ONCE() for all sample indices
(including index 0) with a comment 'Pairs with READ_ONCE() in
mptcp_rtt_us_est()':
net/mptcp/protocol.c:mptcp_rcv_rtt_update() {
...
/* Pairs with READ_ONCE() in mptcp_rtt_us_est(). */
id = msk->rcv_rtt_est.next_sample;
WRITE_ONCE(msk->rcv_rtt_est.samples[id], rtt_us);
...
}
However, samples[0] is read with a plain load while samples[1..4] correctly
use READ_ONCE():
> + int i;
> +
> + /* Lockless access of collected samples. */
> + for (i = 1; i < MPTCP_RTT_SAMPLES; ++i)
> + rtt_us = min(rtt_us, READ_ONCE(msk->rcv_rtt_est.samples[i]));
> + return rtt_us;
> +}
The writer runs under mptcp_data_lock (spin_lock_bh) while the reader runs
under the socket lock (msk_owned_by_me) -- these are different locks, so
WRITE_ONCE/READ_ONCE pairing is required. KCSAN would flag this as a data
race.
> +
> static inline struct mptcp_data_frag *mptcp_send_next(struct sock *sk)
> {
> struct mptcp_sock *msk = mptcp_sk(sk);
[ ... ]
> @@ -523,6 +557,8 @@ struct mptcp_subflow_context {
> u32 map_data_len;
> __wsum map_data_csum;
> u32 map_csum_len;
> + u32 prev_rtt_us;
^^^^
Is this field used anywhere?
The field 'prev_rtt_us' is added to struct mptcp_subflow_context but doesn't
appear to be referenced anywhere in the codebase. Only 'prev_rtt_seq' is used
in mptcp_rcv_rtt_update().
> + u32 prev_rtt_seq;
> u32 request_mptcp : 1, /* send MP_CAPABLE */
> request_join : 1, /* send MP_JOIN */
> request_bkup : 1,
--
pw-bot: cr
next prev parent reply other threads:[~2026-03-11 2:45 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-09 15:56 [PATCH net-next 0/2] mptcp: autotune related improvement Matthieu Baerts (NGI0)
2026-03-09 15:56 ` [PATCH net-next 1/2] mptcp: better mptcp-level RTT estimator Matthieu Baerts (NGI0)
2026-03-11 2:45 ` Jakub Kicinski [this message]
2026-03-11 16:27 ` [net-next,1/2] " Matthieu Baerts
2026-03-09 15:56 ` [PATCH net-next 2/2] mptcp: add receive queue awareness in tcp_rcv_space_adjust() Matthieu Baerts (NGI0)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260311024547.361027-1-kuba@kernel.org \
--to=kuba@kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=fw@strlen.de \
--cc=geliang@kernel.org \
--cc=horms@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=martineau@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=matttbe@kernel.org \
--cc=mhiramat@kernel.org \
--cc=mptcp@lists.linux.dev \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=rostedt@goodmis.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.