From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08AB5131E49; Wed, 11 Mar 2026 02:45:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773197152; cv=none; b=uKGjCu0f+4eOVyLA6Jm9mqNAQ3Hl4eo9IxJsY4debmb5/uBrjO5dBlBr5mmg+zIlqQMzNP8f/ISHfLKPGz+OF9UYhcIBDklJD3f1ZAzAwoKkz1YxQjEUbtItLNq9bnZF5KHnNQ4qyc36BjfpfU2D/CClwc0d4uxWQZxqkelbWIo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773197152; c=relaxed/simple; bh=tJBY0oKmShmrimbu8kR6Ic6jyXUaR6ocSPOVb/7bF74=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dlRPLykqDvRswyCCdCmILj2VVnLMaWXllMtAYqEjLNWZ4W/bK0U/2MAMuQ9MAlRjfKGl7AFq6VrRJXc6kS2MCsTkveC66Osmsqf/YJsNXfYFL88UdJevbtwKoG9UmGvh41DV/6UovFnOyJzFMPvrWtiL0rIY6BZ1XGisGcvRnLg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=bcHY7YfN; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="bcHY7YfN" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 05D43C19423; Wed, 11 Mar 2026 02:45:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773197151; bh=tJBY0oKmShmrimbu8kR6Ic6jyXUaR6ocSPOVb/7bF74=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=bcHY7YfN0xxDYZkDgeIQqDhJxYQKS3t9xkkhjK+gZcdVXRFjaRYMhaU3K1/7IoHLm c+JPAszQ7UsaEJK1fqct9fywnOofUd1wWopM47nNuBRwTHmFGOhP++dVqMIJRp8P3e N8u3Rs2PSegu1Pa469mxsRNraYBip86fn2biNOfH4EsNn4bhE3q6y5zM3guiQ/3jUH iLEGYN87rbWCQp3A4pmK7eOFe9qVnSHNKYtu3lrNzt5t70ghSCsmrSbJkcORtfkY+3 RsbJjvseKn1Dbp0BJ4TfzeadsYFYxFmeHMBSpVUCJhLuYfeShOtbbR6PWk3vCgT5qd zUW9gWKcGrG7w== From: Jakub Kicinski To: matttbe@kernel.org Cc: Jakub Kicinski , martineau@kernel.org, davem@davemloft.net, netdev@vger.kernel.org, pabeni@redhat.com, fw@strlen.de, horms@kernel.org, edumazet@google.com, linux-kernel@vger.kernel.org, mptcp@lists.linux.dev, geliang@kernel.org, mhiramat@kernel.org, linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mathieu.desnoyers@efficios.com Subject: Re: [net-next,1/2] mptcp: better mptcp-level RTT estimator Date: Tue, 10 Mar 2026 19:45:47 -0700 Message-ID: <20260311024547.361027-1-kuba@kernel.org> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260309-net-next-mptcp-reduce-rbuf-v1-1-8f471206f9c5@kernel.org> References: <20260309-net-next-mptcp-reduce-rbuf-v1-1-8f471206f9c5@kernel.org> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html --- mptcp: better mptcp-level RTT estimator The current MPTCP-level RTT estimator has several issues. On high speed links, the MPTCP-level receive buffer auto-tuning happens with a frequency well above the TCP-level's one. That in turn can cause excessive/unneeded receive buffer increase. On such links, the initial rtt_us value is considerably higher than the actual delay, and the current mptcp_rcv_space_adjust() updates msk->rcvq_space.rtt_us with a period equal to the such field previous value. If the initial rtt_us is 40ms, its first update will happen after 40ms, even if the subflows see actual RTT orders of magnitude lower. Additionally: - setting the msk RTT to the maximum among all the subflows RTTs makes DRS constantly overshooting the rcvbuf size when a subflow has considerable higher latency than the other(s). - during unidirectional bulk transfers with multiple active subflows, the TCP-level RTT estimator occasionally sees considerably higher value than the real link delay, i.e. when the packet scheduler reacts to an incoming ACK on given subflow pushing data on a different subflow. - currently inactive but still open subflows (i.e. switched to backup mode) are always considered when computing the msk-level RTT. Address the all the issues above with a more accurate RTT estimation strategy: the MPTCP-level RTT is set to the minimum of all the subflows actually feeding data into the MPTCP receive buffer, using a small sliding window. > While at it, also use EWMA to compute the msk-level scaling_ratio, to > that MPTCP can avoid traversing the subflow list is > mptcp_rcv_space_adjust(). This isn't a bug, but there are two typos here that make the sentence hard to parse. Should this be 'so that MPTCP can avoid traversing the subflow list in mptcp_rcv_space_adjust()'? Use some care to avoid updating msk and ssk level fields too often. Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning") [ ... ] > diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h > index 0bd1ee860316..6ec65c0ae655 100644 > --- a/net/mptcp/protocol.h > +++ b/net/mptcp/protocol.h [ ... ] > @@ -422,6 +435,27 @@ static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk) > return msk->first_pending; > } > > +static inline void mptcp_init_rtt_est(struct mptcp_sock *msk) > +{ > + int i; > + > + for (i = 0; i < MPTCP_RTT_SAMPLES; ++i) > + msk->rcv_rtt_est.samples[i] = U32_MAX; > + msk->rcv_rtt_est.next_sample = 0; > + msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO; > +} > + > +static inline u32 mptcp_rtt_us_est(const struct mptcp_sock *msk) > +{ > + u32 rtt_us = msk->rcv_rtt_est.samples[0]; ^^^^ Is samples[0] missing a READ_ONCE()? The function comment says 'Lockless access of collected samples' and the writer mptcp_rcv_rtt_update() uses WRITE_ONCE() for all sample indices (including index 0) with a comment 'Pairs with READ_ONCE() in mptcp_rtt_us_est()': net/mptcp/protocol.c:mptcp_rcv_rtt_update() { ... /* Pairs with READ_ONCE() in mptcp_rtt_us_est(). */ id = msk->rcv_rtt_est.next_sample; WRITE_ONCE(msk->rcv_rtt_est.samples[id], rtt_us); ... } However, samples[0] is read with a plain load while samples[1..4] correctly use READ_ONCE(): > + int i; > + > + /* Lockless access of collected samples. */ > + for (i = 1; i < MPTCP_RTT_SAMPLES; ++i) > + rtt_us = min(rtt_us, READ_ONCE(msk->rcv_rtt_est.samples[i])); > + return rtt_us; > +} The writer runs under mptcp_data_lock (spin_lock_bh) while the reader runs under the socket lock (msk_owned_by_me) -- these are different locks, so WRITE_ONCE/READ_ONCE pairing is required. KCSAN would flag this as a data race. > + > static inline struct mptcp_data_frag *mptcp_send_next(struct sock *sk) > { > struct mptcp_sock *msk = mptcp_sk(sk); [ ... ] > @@ -523,6 +557,8 @@ struct mptcp_subflow_context { > u32 map_data_len; > __wsum map_data_csum; > u32 map_csum_len; > + u32 prev_rtt_us; ^^^^ Is this field used anywhere? The field 'prev_rtt_us' is added to struct mptcp_subflow_context but doesn't appear to be referenced anywhere in the codebase. Only 'prev_rtt_seq' is used in mptcp_rcv_rtt_update(). > + u32 prev_rtt_seq; > u32 request_mptcp : 1, /* send MP_CAPABLE */ > request_join : 1, /* send MP_JOIN */ > request_bkup : 1, -- pw-bot: cr