From: Jakub Kicinski <kuba@kernel.org>
To: t.fuechsel@gmx.de
Cc: Jakub Kicinski <kuba@kernel.org>,
davem@davemloft.net, dsahern@kernel.org, edumazet@google.com,
pabeni@redhat.com, horms@kernel.org, ncardwell@google.com,
kuniyu@google.com, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, fmancera@suse.de, ebiggers@kernel.org,
bpf@vger.kernel.org, lukas.prause@ikt.uni-hannover.de
Subject: Re: [PATCHv3 net-next] tcp: Add TCP ROCCET congestion control module.
Date: Mon, 8 Jun 2026 19:06:20 -0700 [thread overview]
Message-ID: <20260609020620.787588-1-kuba@kernel.org> (raw)
In-Reply-To: <ah_ZFYB_tiw45_fg@volt-roccet-vm>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
tcp: Add TCP ROCCET congestion control module.
This patch adds a new TCP congestion control algorithm derived from
TCP CUBIC, called ROCCET, which adds bufferbloat detection based on
srRTT and ACK arrival rate, and replaces HyStart with a custom slow
start phase called LAUNCH. A few observations follow.
[Low]
The commit message says:
The second metric is the acknowledgment arrival rate sampled over 5
RTT intervals.
Looking at update_ack_rate(), the sampling window is a fixed 100 ms
wall-clock interval (u32 interval = USEC_PER_MSEC * 100), not a
multiple of the connection's RTT. The 5*curr_rtt window only governs
ca->next_srrtt_check. Should the commit message match the
implementation, or should the implementation be changed to actually
sample over 5 RTT intervals?
[Low]
Could the commit message also discuss why ROCCET cannot be implemented
as a BPF struct_ops program on top of CUBIC, given that this same
patch already registers BTF kfuncs for the struct_ops interface and
that most of the file is a verbatim copy of CUBIC?
> diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
> index 301b47660305..23197958324b 100644
> --- a/net/ipv4/Kconfig
> +++ b/net/ipv4/Kconfig
> @@ -663,6 +663,18 @@ config TCP_CONG_CDG
> delay gradients." In Networking 2011. Preprint:
> http://caia.swin.edu.au/cv/dahayes/content/networking2011-cdg-preprint.pdf
>
> +config TCP_CONG_ROCCET
> + tristate "ROCCET TCP"
> + default n
> + help
> + TCP ROCCET is a sender-side only modification of the TCP CUBIC
> + protocol stack/TCP CUBIC congestion control algorithm that
> + optimizes the performance of TCP congestion control. Especially
> + for networks with large buffers (wireless, cellular networks),
> + TCP ROCCET has improved performance by maintaining similar
> + throughput as CUBIC while reducing the latency.
> + For more information, see: https://arxiv.org/abs/2510.25281
[Low]
The help text describes ROCCET as a "modification of the TCP CUBIC
protocol stack/TCP CUBIC congestion control algorithm", but CUBIC
itself is not modified by this patch; ROCCET is a new sibling module
that copies large parts of CUBIC. The acronym ROCCET is also never
expanded here. Should the help text be reworded?
> diff --git a/net/ipv4/tcp_roccet.h b/net/ipv4/tcp_roccet.h
> new file mode 100644
> index 000000000000..5025d14667bb
> --- /dev/null
> +++ b/net/ipv4/tcp_roccet.h
> @@ -0,0 +1,46 @@
[Low]
Is the separate header net/ipv4/tcp_roccet.h needed? It only declares
struct ack_rate and struct roccettcp, both of which are used solely
inside tcp_roccet.c. Other in-tree CCs (tcp_cubic.c, tcp_bbr.c,
tcp_vegas.c, tcp_dctcp.c, tcp_bic.c, tcp_cdg.c, ...) keep their
per-socket state in the .c file.
> +struct roccettcp {
> + u32 cnt; /* increase cwnd by 1 after ACKs */
[ ... ]
> + u32 curr_rtt; /* the minimum rtt of current round */
[Low]
The comment says "the minimum rtt of current round", but in
roccettcp_acked() curr_rtt is assigned the latest sample
(ca->curr_rtt = sample->rtt_us;) without any min-of-round logic. Was
the comment intended to be copied from tcp_cubic.c, where curr_rtt
has different semantics?
> diff --git a/net/ipv4/tcp_roccet.c b/net/ipv4/tcp_roccet.c
> new file mode 100644
> index 000000000000..900329bb070b
> --- /dev/null
> +++ b/net/ipv4/tcp_roccet.c
[ ... ]
> +/* Parameters that are specific to the ROCCET-Algorithm */
> +static int sr_rtt_upper_bound __read_mostly = 100;
> +static int ack_rate_diff_ss __read_mostly = 10;
> +
> +module_param(sr_rtt_upper_bound, int, 0644);
> +MODULE_PARM_DESC(sr_rtt_upper_bound, "ROCCET's upper bound for srRTT.");
> +module_param(ack_rate_diff_ss, int, 0644);
> +MODULE_PARM_DESC(ack_rate_diff_ss,
> + "ROCCET's threshold to exit slow start if ACK-rate defer by given amount of segments.");
> +
> +static int fast_convergence __read_mostly = 1;
> +static int beta __read_mostly = 717; /* = 717/1024 (BICTCP_BETA_SCALE) */
> +static int initial_ssthresh __read_mostly;
> +static int bic_scale __read_mostly = 41;
> +static int tcp_friendliness __read_mostly = 1;
[Low]
sr_rtt_upper_bound and ack_rate_diff_ss are declared signed int and
exposed as 0644 module params, but they are compared to and added to
u32 quantities (curr_srrtt, get_ack_rate_diff(), roccet_xj). If a
negative value is written, the signed-to-unsigned promotion turns it
into a very large unsigned, effectively disabling the bufferbloat
detection branch. Should these be unsigned int with a setter that
rejects 0 / negative values?
[Low]
Is there a way to avoid adding seven new module_param tunables to a
new in-tree CC? Five of them duplicate CUBIC's identically named
knobs (fast_convergence, beta, initial_ssthresh, bic_scale,
tcp_friendliness), which makes it hard for an admin to reason about
which knob applies to which CC, and these are global (not per-netns)
unlike the sysctls used by other modern CCs.
> +static __always_inline void update_min_rtt(struct sock *sk)
> +{
> + struct roccettcp *ca = inet_csk_ca(sk);
> +
> + /* Check if new lower min RTT was found. If so, set it directly */
> + if (ca->curr_rtt < ca->curr_min_rtt)
> + ca->curr_min_rtt = max(ca->curr_rtt, 1);
> +}
[High]
Can update_min_rtt() and update_srrtt() leave curr_min_rtt at 0 and
crash the kernel via divide-by-zero in roccettcp_cong_avoid()?
After roccettcp_reset() (called from roccettcp_init() and from
roccettcp_state(TCP_CA_Loss)) curr_rtt is 0 and curr_min_rtt is ~0U.
On the next ACK that reaches roccettcp_cong_avoid() before
roccettcp_acked() has set curr_rtt to a non-zero value (the early
return in roccettcp_acked() suppresses updates for HZ jiffies after
fast recovery), update_min_rtt() does:
if (ca->curr_rtt < ca->curr_min_rtt)
ca->curr_min_rtt = max(ca->curr_rtt, 1);
so curr_min_rtt is clamped to 1. Then update_srrtt() runs:
if (ca->curr_rtt < ca->curr_min_rtt)
ca->curr_min_rtt = ca->curr_rtt;
which overwrites curr_min_rtt with 0 (curr_rtt). update_srrtt()
returns early on its own divide-by-zero guard, but cong_avoid keeps
running and reaches:
roccet_xj = ((jitter * 100) / ca->curr_min_rtt) +
sr_rtt_upper_bound;
and divides by 0.
Once curr_min_rtt becomes 0, the unsigned compare
(ca->curr_rtt < ca->curr_min_rtt) is false for any later positive
curr_rtt, so curr_min_rtt stays 0 for the lifetime of the socket. Is
that intended?
> +/* Compute srRTT.
> + */
> +static __always_inline void update_srrtt(struct sock *sk)
> +{
[ ... ]
> + u32 rrtt = (100 * (ca->curr_rtt - ca->curr_min_rtt)) /
> + ca->curr_min_rtt;
[Medium]
Can this overflow? curr_rtt and curr_min_rtt are microseconds and can
reach values close to TCP_RTO_MAX (~120 s). 100 * (curr_rtt -
curr_min_rtt) is computed in u32 and wraps once the difference
exceeds about 42.9 ms. The wrapped rrtt is then folded into
curr_srrtt and drives the LAUNCH/ORBITER decisions. Should this be
promoted to u64 with div_u64()?
[ ... ]
> +__bpf_kfunc static void roccettcp_cwnd_event(struct sock *sk,
> + enum tcp_ca_event event)
> +{
> + if (event == CA_EVENT_TX_START) {
> + struct roccettcp *ca = inet_csk_ca(sk);
> + u32 now = tcp_jiffies32;
> + s32 delta;
> +
> + delta = now - tcp_sk(sk)->lsndtime;
> +
> + /* We were application limited (idle) for a while.
> + * Shift epoch_start to keep cwnd growth to cubic curve.
> + */
> + if (ca->epoch_start && delta > 0) {
> + ca->epoch_start += delta;
> + if (after(ca->epoch_start, now))
> + ca->epoch_start = now;
> + }
> + return;
> + }
> +}
[High]
Is this body reachable? In include/net/tcp.h tcp_ca_event() dispatches
CA_EVENT_TX_START exclusively to .cwnd_event_tx_start and returns
before calling .cwnd_event:
if (event == CA_EVENT_TX_START) {
if (icsk->icsk_ca_ops->cwnd_event_tx_start)
icsk->icsk_ca_ops->cwnd_event_tx_start(sk);
return;
}
tcp_cubic.c handles this by registering cubictcp_cwnd_event_tx_start
in .cwnd_event_tx_start. With the registration here only setting
.cwnd_event = roccettcp_cwnd_event, the entire idle-period
epoch_start adjustment is dead code, and on a transition out of
application-limited idle bictcp_update() will compute t with a stale
epoch_start, which then propagates through:
delta = (cube_rtt_scale * offs * offs * offs) >> (10 + 3 * BICTCP_HZ);
producing a large bic_target/cnt. Should this be wired to
.cwnd_event_tx_start instead?
[ ... ]
> +/* calculate the cubic root of x using a table lookup followed by one
> + * Newton-Raphson iteration.
> + * Avg err ~= 0.195%
> + */
> +static u32 cubic_root(u64 a)
[ ... ]
> +/* Compute congestion window to use.
> + */
> +static __always_inline void bictcp_update(struct roccettcp *ca, u32 cwnd,
> + u32 acked)
[Low]
cubic_root(), bictcp_update(), the module_param block, and the
precomputation in roccettcp_register() appear to be byte-for-byte
copies of tcp_cubic.c. Should these be factored as shared helpers in
a common file (or ROCCET implemented as a BPF struct_ops on top of
CUBIC) so that future numerical fixes to CUBIC propagate to ROCCET?
[ ... ]
> +__bpf_kfunc static void roccettcp_cong_avoid(struct sock *sk, u32 ack,
> + u32 acked)
> +{
> + struct tcp_sock *tp = tcp_sk(sk);
> + struct roccettcp *ca = inet_csk_ca(sk);
> +
> + u32 now = jiffies_to_usecs(tcp_jiffies32);
> + bool evaluate_srrtt = false;
> + u32 roccet_xj;
> + u32 jitter;
> +
> + if (after(ca->last_rtt, ca->curr_rtt))
> + jitter = ca->last_rtt - ca->curr_rtt;
> + else
> + jitter = ca->curr_rtt - ca->last_rtt;
[Low]
after() is the TCP sequence-number macro from include/net/tcp.h.
Using it here for an RTT-microsecond comparison works (it expands to
a signed 32-bit subtraction), but it is a sequence-space helper and
reads as a layering mistake. The same file uses raw subtraction with
<= a few lines below ("now - ca->roccet_last_event_time_us <= 100 *
USEC_PER_MSEC"), so the time arithmetic is internally inconsistent.
Could the after() uses be replaced with explicit (s32) subtractions?
[Low]
On the very first call after a reset, ca->last_rtt is 0, so this
computes jitter = |curr_rtt - 0| = curr_rtt, i.e. an entire RTT
worth of "jitter". That value then feeds roccet_xj and loosens the
bufferbloat-detection threshold during the LAUNCH phase the
algorithm is supposed to keep tightest. Should the first sample be
suppressed (e.g., skip jitter computation while last_rtt == 0)?
[ ... ]
> + roccet_xj = ((jitter * 100) / ca->curr_min_rtt) +
> + sr_rtt_upper_bound;
> + if (roccet_xj < sr_rtt_upper_bound)
> + roccet_xj = sr_rtt_upper_bound;
[Medium]
Same overflow class as update_srrtt(): jitter * 100 is computed in
u32. With jitter above ~42.9 ms in microseconds the multiplication
wraps, and the subsequent clamp "if (roccet_xj < sr_rtt_upper_bound)"
only catches a subset of the wrap cases. The wrapped value drives
whether the cwnd-reduction branch fires. Should jitter * 100 be
performed in u64?
[ ... ]
> +static int __init roccettcp_register(void)
> +{
> + int ret;
> +
> + BUILD_BUG_ON(sizeof(struct roccettcp) > ICSK_CA_PRIV_SIZE);
[ ... ]
> + beta_scale =
> + 8 * (BICTCP_BETA_SCALE + beta) / 3 / (BICTCP_BETA_SCALE - beta);
[ ... ]
> + /* divide by bic_scale and by constant Srtt (100ms) */
> + do_div(cube_factor, bic_scale * 10);
[Low]
Can these divides panic at module load if bic_scale=0 or beta=1024 is
passed via modprobe? do_div(cube_factor, bic_scale * 10) divides by 0
when bic_scale is 0, and beta_scale divides by (BICTCP_BETA_SCALE -
beta) which is 0 when beta is 1024. The same shape exists in
tcp_cubic.c, but should the new file sanity-check these inputs
before computing the precomputed factors?
[Low]
MODULE_VERSION("1.0") is more typical of out-of-tree DKMS modules
than of in-tree code; should it be dropped here?
--
pw-bot: cr
prev parent reply other threads:[~2026-06-09 2:06 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-03 7:34 [PATCHv3 net-next] tcp: Add TCP ROCCET congestion control module Tim Fuechsel
2026-06-09 2:06 ` Jakub Kicinski [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260609020620.787588-1-kuba@kernel.org \
--to=kuba@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=ebiggers@kernel.org \
--cc=edumazet@google.com \
--cc=fmancera@suse.de \
--cc=horms@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lukas.prause@ikt.uni-hannover.de \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=t.fuechsel@gmx.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox