Re: [PATCHv3 net-next] tcp: Add TCP ROCCET congestion control module.

Netdev List
 help / color / mirror / Atom feed

From: Jakub Kicinski <kuba@kernel.org>
To: t.fuechsel@gmx.de
Cc: Jakub Kicinski <kuba@kernel.org>,
	davem@davemloft.net, dsahern@kernel.org, edumazet@google.com,
	pabeni@redhat.com, horms@kernel.org, ncardwell@google.com,
	kuniyu@google.com, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, fmancera@suse.de, ebiggers@kernel.org,
	bpf@vger.kernel.org, lukas.prause@ikt.uni-hannover.de
Subject: Re: [PATCHv3 net-next] tcp: Add TCP ROCCET congestion control module.
Date: Mon,  8 Jun 2026 19:06:20 -0700	[thread overview]
Message-ID: <20260609020620.787588-1-kuba@kernel.org> (raw)
In-Reply-To: <ah_ZFYB_tiw45_fg@volt-roccet-vm>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
tcp: Add TCP ROCCET congestion control module.

This patch adds a new TCP congestion control algorithm derived from
TCP CUBIC, called ROCCET, which adds bufferbloat detection based on
srRTT and ACK arrival rate, and replaces HyStart with a custom slow
start phase called LAUNCH. A few observations follow.

[Low]
The commit message says:

  The second metric is the acknowledgment arrival rate sampled over 5
  RTT intervals.

Looking at update_ack_rate(), the sampling window is a fixed 100 ms
wall-clock interval (u32 interval = USEC_PER_MSEC * 100), not a
multiple of the connection's RTT. The 5*curr_rtt window only governs
ca->next_srrtt_check. Should the commit message match the
implementation, or should the implementation be changed to actually
sample over 5 RTT intervals?

[Low]
Could the commit message also discuss why ROCCET cannot be implemented
as a BPF struct_ops program on top of CUBIC, given that this same
patch already registers BTF kfuncs for the struct_ops interface and
that most of the file is a verbatim copy of CUBIC?

> diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
> index 301b47660305..23197958324b 100644
> --- a/net/ipv4/Kconfig
> +++ b/net/ipv4/Kconfig
> @@ -663,6 +663,18 @@ config TCP_CONG_CDG
>  	    delay gradients." In Networking 2011. Preprint:
>  	    http://caia.swin.edu.au/cv/dahayes/content/networking2011-cdg-preprint.pdf
>  
> +config TCP_CONG_ROCCET
> +	tristate "ROCCET TCP"
> +	default n
> +	help
> +	  TCP ROCCET is a sender-side only modification of the TCP CUBIC
> +	  protocol stack/TCP CUBIC congestion control algorithm that
> +	  optimizes the performance of TCP congestion control. Especially
> +	  for networks with large buffers (wireless, cellular networks),
> +	  TCP ROCCET has improved performance by maintaining similar
> +	  throughput as CUBIC while reducing the latency.
> +	  For more information, see: https://arxiv.org/abs/2510.25281

[Low]
The help text describes ROCCET as a "modification of the TCP CUBIC
protocol stack/TCP CUBIC congestion control algorithm", but CUBIC
itself is not modified by this patch; ROCCET is a new sibling module
that copies large parts of CUBIC. The acronym ROCCET is also never
expanded here. Should the help text be reworded?

> diff --git a/net/ipv4/tcp_roccet.h b/net/ipv4/tcp_roccet.h
> new file mode 100644
> index 000000000000..5025d14667bb
> --- /dev/null
> +++ b/net/ipv4/tcp_roccet.h
> @@ -0,0 +1,46 @@

[Low]
Is the separate header net/ipv4/tcp_roccet.h needed? It only declares
struct ack_rate and struct roccettcp, both of which are used solely
inside tcp_roccet.c. Other in-tree CCs (tcp_cubic.c, tcp_bbr.c,
tcp_vegas.c, tcp_dctcp.c, tcp_bic.c, tcp_cdg.c, ...) keep their
per-socket state in the .c file.

> +struct roccettcp {
> +	u32 cnt;		/* increase cwnd by 1 after ACKs */
[ ... ]
> +	u32 curr_rtt;		/* the minimum rtt of current round */

[Low]
The comment says "the minimum rtt of current round", but in
roccettcp_acked() curr_rtt is assigned the latest sample
(ca->curr_rtt = sample->rtt_us;) without any min-of-round logic. Was
the comment intended to be copied from tcp_cubic.c, where curr_rtt
has different semantics?

> diff --git a/net/ipv4/tcp_roccet.c b/net/ipv4/tcp_roccet.c
> new file mode 100644
> index 000000000000..900329bb070b
> --- /dev/null
> +++ b/net/ipv4/tcp_roccet.c
[ ... ]
> +/* Parameters that are specific to the ROCCET-Algorithm */
> +static int sr_rtt_upper_bound __read_mostly = 100;
> +static int ack_rate_diff_ss __read_mostly = 10;
> +
> +module_param(sr_rtt_upper_bound, int, 0644);
> +MODULE_PARM_DESC(sr_rtt_upper_bound, "ROCCET's upper bound for srRTT.");
> +module_param(ack_rate_diff_ss, int, 0644);
> +MODULE_PARM_DESC(ack_rate_diff_ss,
> +		 "ROCCET's threshold to exit slow start if ACK-rate defer by given amount of segments.");
> +
> +static int fast_convergence __read_mostly = 1;
> +static int beta __read_mostly = 717; /* = 717/1024 (BICTCP_BETA_SCALE) */
> +static int initial_ssthresh __read_mostly;
> +static int bic_scale __read_mostly = 41;
> +static int tcp_friendliness __read_mostly = 1;

[Low]
sr_rtt_upper_bound and ack_rate_diff_ss are declared signed int and
exposed as 0644 module params, but they are compared to and added to
u32 quantities (curr_srrtt, get_ack_rate_diff(), roccet_xj). If a
negative value is written, the signed-to-unsigned promotion turns it
into a very large unsigned, effectively disabling the bufferbloat
detection branch. Should these be unsigned int with a setter that
rejects 0 / negative values?

[Low]
Is there a way to avoid adding seven new module_param tunables to a
new in-tree CC? Five of them duplicate CUBIC's identically named
knobs (fast_convergence, beta, initial_ssthresh, bic_scale,
tcp_friendliness), which makes it hard for an admin to reason about
which knob applies to which CC, and these are global (not per-netns)
unlike the sysctls used by other modern CCs.

> +static __always_inline void update_min_rtt(struct sock *sk)
> +{
> +	struct roccettcp *ca = inet_csk_ca(sk);
> +
> +	/* Check if new lower min RTT was found. If so, set it directly */
> +	if (ca->curr_rtt < ca->curr_min_rtt)
> +		ca->curr_min_rtt = max(ca->curr_rtt, 1);
> +}

[High]
Can update_min_rtt() and update_srrtt() leave curr_min_rtt at 0 and
crash the kernel via divide-by-zero in roccettcp_cong_avoid()?

After roccettcp_reset() (called from roccettcp_init() and from
roccettcp_state(TCP_CA_Loss)) curr_rtt is 0 and curr_min_rtt is ~0U.
On the next ACK that reaches roccettcp_cong_avoid() before
roccettcp_acked() has set curr_rtt to a non-zero value (the early
return in roccettcp_acked() suppresses updates for HZ jiffies after
fast recovery), update_min_rtt() does:

	if (ca->curr_rtt < ca->curr_min_rtt)
		ca->curr_min_rtt = max(ca->curr_rtt, 1);

so curr_min_rtt is clamped to 1. Then update_srrtt() runs:

	if (ca->curr_rtt < ca->curr_min_rtt)
		ca->curr_min_rtt = ca->curr_rtt;

which overwrites curr_min_rtt with 0 (curr_rtt). update_srrtt()
returns early on its own divide-by-zero guard, but cong_avoid keeps
running and reaches:

	roccet_xj = ((jitter * 100) / ca->curr_min_rtt) +
		    sr_rtt_upper_bound;

and divides by 0.

Once curr_min_rtt becomes 0, the unsigned compare
(ca->curr_rtt < ca->curr_min_rtt) is false for any later positive
curr_rtt, so curr_min_rtt stays 0 for the lifetime of the socket. Is
that intended?

> +/* Compute srRTT.
> + */
> +static __always_inline void update_srrtt(struct sock *sk)
> +{
[ ... ]
> +	u32 rrtt = (100 * (ca->curr_rtt - ca->curr_min_rtt)) /
> +		   ca->curr_min_rtt;

[Medium]
Can this overflow? curr_rtt and curr_min_rtt are microseconds and can
reach values close to TCP_RTO_MAX (~120 s). 100 * (curr_rtt -
curr_min_rtt) is computed in u32 and wraps once the difference
exceeds about 42.9 ms. The wrapped rrtt is then folded into
curr_srrtt and drives the LAUNCH/ORBITER decisions. Should this be
promoted to u64 with div_u64()?

[ ... ]
> +__bpf_kfunc static void roccettcp_cwnd_event(struct sock *sk,
> +					     enum tcp_ca_event event)
> +{
> +	if (event == CA_EVENT_TX_START) {
> +		struct roccettcp *ca = inet_csk_ca(sk);
> +		u32 now = tcp_jiffies32;
> +		s32 delta;
> +
> +		delta = now - tcp_sk(sk)->lsndtime;
> +
> +		/* We were application limited (idle) for a while.
> +		 * Shift epoch_start to keep cwnd growth to cubic curve.
> +		 */
> +		if (ca->epoch_start && delta > 0) {
> +			ca->epoch_start += delta;
> +			if (after(ca->epoch_start, now))
> +				ca->epoch_start = now;
> +		}
> +		return;
> +	}
> +}

[High]
Is this body reachable? In include/net/tcp.h tcp_ca_event() dispatches
CA_EVENT_TX_START exclusively to .cwnd_event_tx_start and returns
before calling .cwnd_event:

	if (event == CA_EVENT_TX_START) {
		if (icsk->icsk_ca_ops->cwnd_event_tx_start)
			icsk->icsk_ca_ops->cwnd_event_tx_start(sk);
		return;
	}

tcp_cubic.c handles this by registering cubictcp_cwnd_event_tx_start
in .cwnd_event_tx_start. With the registration here only setting
.cwnd_event = roccettcp_cwnd_event, the entire idle-period
epoch_start adjustment is dead code, and on a transition out of
application-limited idle bictcp_update() will compute t with a stale
epoch_start, which then propagates through:

	delta = (cube_rtt_scale * offs * offs * offs) >> (10 + 3 * BICTCP_HZ);

producing a large bic_target/cnt. Should this be wired to
.cwnd_event_tx_start instead?

[ ... ]
> +/* calculate the cubic root of x using a table lookup followed by one
> + * Newton-Raphson iteration.
> + * Avg err ~= 0.195%
> + */
> +static u32 cubic_root(u64 a)
[ ... ]
> +/* Compute congestion window to use.
> + */
> +static __always_inline void bictcp_update(struct roccettcp *ca, u32 cwnd,
> +					  u32 acked)

[Low]
cubic_root(), bictcp_update(), the module_param block, and the
precomputation in roccettcp_register() appear to be byte-for-byte
copies of tcp_cubic.c. Should these be factored as shared helpers in
a common file (or ROCCET implemented as a BPF struct_ops on top of
CUBIC) so that future numerical fixes to CUBIC propagate to ROCCET?

[ ... ]
> +__bpf_kfunc static void roccettcp_cong_avoid(struct sock *sk, u32 ack,
> +					     u32 acked)
> +{
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	struct roccettcp *ca = inet_csk_ca(sk);
> +
> +	u32 now = jiffies_to_usecs(tcp_jiffies32);
> +	bool evaluate_srrtt = false;
> +	u32 roccet_xj;
> +	u32 jitter;
> +
> +	if (after(ca->last_rtt, ca->curr_rtt))
> +		jitter = ca->last_rtt - ca->curr_rtt;
> +	else
> +		jitter = ca->curr_rtt - ca->last_rtt;

[Low]
after() is the TCP sequence-number macro from include/net/tcp.h.
Using it here for an RTT-microsecond comparison works (it expands to
a signed 32-bit subtraction), but it is a sequence-space helper and
reads as a layering mistake. The same file uses raw subtraction with
<= a few lines below ("now - ca->roccet_last_event_time_us <= 100 *
USEC_PER_MSEC"), so the time arithmetic is internally inconsistent.
Could the after() uses be replaced with explicit (s32) subtractions?

[Low]
On the very first call after a reset, ca->last_rtt is 0, so this
computes jitter = |curr_rtt - 0| = curr_rtt, i.e. an entire RTT
worth of "jitter". That value then feeds roccet_xj and loosens the
bufferbloat-detection threshold during the LAUNCH phase the
algorithm is supposed to keep tightest. Should the first sample be
suppressed (e.g., skip jitter computation while last_rtt == 0)?

[ ... ]
> +	roccet_xj = ((jitter * 100) / ca->curr_min_rtt) +
> +		    sr_rtt_upper_bound;
> +	if (roccet_xj < sr_rtt_upper_bound)
> +		roccet_xj = sr_rtt_upper_bound;

[Medium]
Same overflow class as update_srrtt(): jitter * 100 is computed in
u32. With jitter above ~42.9 ms in microseconds the multiplication
wraps, and the subsequent clamp "if (roccet_xj < sr_rtt_upper_bound)"
only catches a subset of the wrap cases. The wrapped value drives
whether the cwnd-reduction branch fires. Should jitter * 100 be
performed in u64?

[ ... ]
> +static int __init roccettcp_register(void)
> +{
> +	int ret;
> +
> +	BUILD_BUG_ON(sizeof(struct roccettcp) > ICSK_CA_PRIV_SIZE);
[ ... ]
> +	beta_scale =
> +		8 * (BICTCP_BETA_SCALE + beta) / 3 / (BICTCP_BETA_SCALE - beta);
[ ... ]
> +	/* divide by bic_scale and by constant Srtt (100ms) */
> +	do_div(cube_factor, bic_scale * 10);

[Low]
Can these divides panic at module load if bic_scale=0 or beta=1024 is
passed via modprobe? do_div(cube_factor, bic_scale * 10) divides by 0
when bic_scale is 0, and beta_scale divides by (BICTCP_BETA_SCALE -
beta) which is 0 when beta is 1024. The same shape exists in
tcp_cubic.c, but should the new file sanity-check these inputs
before computing the precomputed factors?

[Low]
MODULE_VERSION("1.0") is more typical of out-of-tree DKMS modules
than of in-tree code; should it be dropped here?
-- 
pw-bot: cr

     prev parent reply	other threads:[~2026-06-09  2:06 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-03  7:34 [PATCHv3 net-next] tcp: Add TCP ROCCET congestion control module Tim Fuechsel
2026-06-09  2:06 ` Jakub Kicinski [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260609020620.787588-1-kuba@kernel.org \
    --to=kuba@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=ebiggers@kernel.org \
    --cc=edumazet@google.com \
    --cc=fmancera@suse.de \
    --cc=horms@kernel.org \
    --cc=kuniyu@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lukas.prause@ikt.uni-hannover.de \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=t.fuechsel@gmx.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox