* [PATCH net v4 0/3] tcp_cubic: fix to achieve at least the same throughput as Reno
@ 2024-08-17 16:33 Mingrui Zhang
2024-08-17 16:33 ` [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT Mingrui Zhang
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Mingrui Zhang @ 2024-08-17 16:33 UTC (permalink / raw)
To: edumazet, davem, ncardwell, netdev; +Cc: Mingrui Zhang, Lisong Xu
This series patches fixes some CUBIC bugs so that "CUBIC achieves at least
the same throughput as Reno in small-BDP networks"
[RFC 9438: https://www.rfc-editor.org/rfc/rfc9438.html]
It consists of three bug fixes, all changing function bictcp_update()
of tcp_cubic.c, which controls how fast CUBIC increases its
congestion window size snd_cwnd.
(1) tcp_cubic: fix to run bictcp_update() at least once per RTT
(2) tcp_cubic: fix to match Reno additive increment
(3) tcp_cubic: fix to use emulated Reno cwnd one RTT in the future
Experiments:
Below are Mininet experiments to demonstrate the performance difference
between the original CUBIC and patched CUBIC.
Network: link capacity = 100Mbps, RTT = 4ms
TCP flows: one RENO and one CUBIC. initial cwnd = 10 packets.
The first data packet of each flow is lost
snd_cwnd of RENO and original CUBIC flows
https://github.com/zmrui/tcp_cubic_fix/blob/main/renocubic_fixb0.jpg
snd_cwnd of RENO and patched CUBIC (with bug fixes 1, 2, and 3) flows.
https://github.com/zmrui/tcp_cubic_fix/blob/main/renocubic_fixb1b2b3.jpg
The result of patched CUBIC with different combinations of
bug fixes 1, 2, and 3 can be found at the following link,
where you can also find more experiment results.
https://github.com/zmrui/tcp_cubic_fix
Thanks
Mingrui, and Lisong
Changes:
v3->v4:
replace min() with min_t()
separate declarations and code of tcp_cwnd_next_rtt
https://lore.kernel.org/netdev/20240815214035.1145228-1-mrzhang97@gmail.com/
v2->v3:
Correct the "Fixes:" footer content
https://lore.kernel.org/netdev/20240815001718.2845791-1-mrzhang97@gmail.com/
v1->v2:
Separate patches
Add new cwnd_prior field to hold cwnd before a loss event
https://lore.kernel.org/netdev/20240810223130.379146-1-mrzhang97@gmail.com/
Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com>
Signed-off-by: Lisong Xu <xu@unl.edu>
Mingrui Zhang (3):
tcp_cubic: fix to run bictcp_update() at least once per RTT
tcp_cubic: fix to match Reno additive increment
tcp_cubic: fix to use emulated Reno cwnd one RTT in the future
net/ipv4/tcp_cubic.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT 2024-08-17 16:33 [PATCH net v4 0/3] tcp_cubic: fix to achieve at least the same throughput as Reno Mingrui Zhang @ 2024-08-17 16:33 ` Mingrui Zhang 2024-08-19 9:00 ` Eric Dumazet 2024-08-17 16:33 ` [PATCH net v4 2/3] tcp_cubic: fix to match Reno additive increment Mingrui Zhang 2024-08-17 16:34 ` [PATCH net v4 3/3] tcp_cubic: fix to use emulated Reno cwnd one RTT in the future Mingrui Zhang 2 siblings, 1 reply; 15+ messages in thread From: Mingrui Zhang @ 2024-08-17 16:33 UTC (permalink / raw) To: edumazet, davem, ncardwell, netdev; +Cc: Mingrui Zhang, Lisong Xu The original code bypasses bictcp_update() under certain conditions to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, bictcp_update() is executed 32 times per second. As a result, it is possible that bictcp_update() is not executed for several RTTs when RTT is short (specifically < 1/32 second = 31 ms and last_cwnd==cwnd which may happen in small-BDP networks), thus leading to low throughput in these RTTs. The patched code executes bictcp_update() 32 times per second if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> Signed-off-by: Lisong Xu <xu@unl.edu> --- v3->v4: Replace min() with min_t() v2->v3: Correct the "Fixes:" footer content v1->v2: Separate patches net/ipv4/tcp_cubic.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 5dbed91c6178..00da7d592032 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) ca->ack_cnt += acked; /* count the number of ACKed packets */ + /* Update 32 times per second if RTT > 1/32 second, + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd + */ if (ca->last_cwnd == cwnd && - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) + (s32)(tcp_jiffies32 - ca->last_time) <= + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) return; /* The CUBIC function can update ca->cnt at most once per jiffy. -- 2.34.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT 2024-08-17 16:33 ` [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT Mingrui Zhang @ 2024-08-19 9:00 ` Eric Dumazet 2024-08-19 20:36 ` Mingrui Zhang 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2024-08-19 9:00 UTC (permalink / raw) To: Mingrui Zhang; +Cc: davem, ncardwell, netdev, Lisong Xu On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > The original code bypasses bictcp_update() under certain conditions > to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, > bictcp_update() is executed 32 times per second. As a result, > it is possible that bictcp_update() is not executed for several > RTTs when RTT is short (specifically < 1/32 second = 31 ms and > last_cwnd==cwnd which may happen in small-BDP networks), > thus leading to low throughput in these RTTs. > > The patched code executes bictcp_update() 32 times per second > if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. > > Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") > Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") I do not understand this Fixes: tag ? Commit ac35f562203a was essentially a nop at that time... > Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> > Signed-off-by: Lisong Xu <xu@unl.edu> > --- > v3->v4: Replace min() with min_t() > v2->v3: Correct the "Fixes:" footer content > v1->v2: Separate patches > > net/ipv4/tcp_cubic.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > index 5dbed91c6178..00da7d592032 100644 > --- a/net/ipv4/tcp_cubic.c > +++ b/net/ipv4/tcp_cubic.c > @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > > ca->ack_cnt += acked; /* count the number of ACKed packets */ > > + /* Update 32 times per second if RTT > 1/32 second, > + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd > + */ > if (ca->last_cwnd == cwnd && > - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > + (s32)(tcp_jiffies32 - ca->last_time) <= > + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) This looks convoluted to me and still limited if HZ=250 (some distros still use 250 jiffies per second :/ ) I would suggest switching to usec right away. diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..fae000a57bf7d3803c5dd854af64b6933c4e26dd 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -211,26 +211,27 @@ static u32 cubic_root(u64 a) /* * Compute congestion window to use. */ -static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) +static inline void bictcp_update(struct tcp_sock *tp, struct bictcp *ca, + u32 cwnd, u32 acked) { u32 delta, bic_target, max_cnt; u64 offs, t; ca->ack_cnt += acked; /* count the number of ACKed packets */ - if (ca->last_cwnd == cwnd && - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) + delta = tp->tcp_mstamp - ca->last_time; + if (ca->last_cwnd == cwnd && delta <= ca->delay_min) return; - /* The CUBIC function can update ca->cnt at most once per jiffy. + /* The CUBIC function can update ca->cnt at most once per ms. * On all cwnd reduction events, ca->epoch_start is set to 0, * which will force a recalculation of ca->cnt. */ - if (ca->epoch_start && tcp_jiffies32 == ca->last_time) + if (ca->epoch_start && delta < USEC_PER_MSEC) goto tcp_friendliness; ca->last_cwnd = cwnd; - ca->last_time = tcp_jiffies32; + ca->last_time = tp->tcp_mstamp; if (ca->epoch_start == 0) { ca->epoch_start = tcp_jiffies32; /* record beginning */ @@ -334,7 +335,7 @@ __bpf_kfunc static void cubictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked) if (!acked) return; } - bictcp_update(ca, tcp_snd_cwnd(tp), acked); + bictcp_update(tp, ca, tcp_snd_cwnd(tp), acked); tcp_cong_avoid_ai(tp, ca->cnt, acked); } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT 2024-08-19 9:00 ` Eric Dumazet @ 2024-08-19 20:36 ` Mingrui Zhang 2024-08-20 12:53 ` Eric Dumazet 0 siblings, 1 reply; 15+ messages in thread From: Mingrui Zhang @ 2024-08-19 20:36 UTC (permalink / raw) To: Eric Dumazet; +Cc: davem, ncardwell, netdev, Lisong Xu On 8/19/24 04:00, Eric Dumazet wrote: > On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >> The original code bypasses bictcp_update() under certain conditions >> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, >> bictcp_update() is executed 32 times per second. As a result, >> it is possible that bictcp_update() is not executed for several >> RTTs when RTT is short (specifically < 1/32 second = 31 ms and >> last_cwnd==cwnd which may happen in small-BDP networks), >> thus leading to low throughput in these RTTs. >> >> The patched code executes bictcp_update() 32 times per second >> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. >> >> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") >> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") > I do not understand this Fixes: tag ? > > Commit ac35f562203a was essentially a nop at that time... > I may misunderstood the use of Fixes tag and choose the latest commit of that line. Shall it supposed to be the very first commit with that behavior? That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced? >> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> >> Signed-off-by: Lisong Xu <xu@unl.edu> >> --- >> v3->v4: Replace min() with min_t() >> v2->v3: Correct the "Fixes:" footer content >> v1->v2: Separate patches >> >> net/ipv4/tcp_cubic.c | 6 +++++- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c >> index 5dbed91c6178..00da7d592032 100644 >> --- a/net/ipv4/tcp_cubic.c >> +++ b/net/ipv4/tcp_cubic.c >> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) >> >> ca->ack_cnt += acked; /* count the number of ACKed packets */ >> >> + /* Update 32 times per second if RTT > 1/32 second, >> + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd >> + */ >> if (ca->last_cwnd == cwnd && >> - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) >> + (s32)(tcp_jiffies32 - ca->last_time) <= >> + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) > This looks convoluted to me and still limited if HZ=250 (some distros > still use 250 jiffies per second :/ ) > > I would suggest switching to usec right away. Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :) Thank you > > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..fae000a57bf7d3803c5dd854af64b6933c4e26dd > 100644 > --- a/net/ipv4/tcp_cubic.c > +++ b/net/ipv4/tcp_cubic.c > @@ -211,26 +211,27 @@ static u32 cubic_root(u64 a) > /* > * Compute congestion window to use. > */ > -static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > +static inline void bictcp_update(struct tcp_sock *tp, struct bictcp *ca, > + u32 cwnd, u32 acked) > { > u32 delta, bic_target, max_cnt; > u64 offs, t; > > ca->ack_cnt += acked; /* count the number of ACKed packets */ > > - if (ca->last_cwnd == cwnd && > - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > + delta = tp->tcp_mstamp - ca->last_time; > + if (ca->last_cwnd == cwnd && delta <= ca->delay_min) > return; > > - /* The CUBIC function can update ca->cnt at most once per jiffy. > + /* The CUBIC function can update ca->cnt at most once per ms. > * On all cwnd reduction events, ca->epoch_start is set to 0, > * which will force a recalculation of ca->cnt. > */ > - if (ca->epoch_start && tcp_jiffies32 == ca->last_time) > + if (ca->epoch_start && delta < USEC_PER_MSEC) > goto tcp_friendliness; > > ca->last_cwnd = cwnd; > - ca->last_time = tcp_jiffies32; > + ca->last_time = tp->tcp_mstamp; > > if (ca->epoch_start == 0) { > ca->epoch_start = tcp_jiffies32; /* record beginning */ > @@ -334,7 +335,7 @@ __bpf_kfunc static void cubictcp_cong_avoid(struct > sock *sk, u32 ack, u32 acked) > if (!acked) > return; > } > - bictcp_update(ca, tcp_snd_cwnd(tp), acked); > + bictcp_update(tp, ca, tcp_snd_cwnd(tp), acked); > tcp_cong_avoid_ai(tp, ca->cnt, acked); > } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT 2024-08-19 20:36 ` Mingrui Zhang @ 2024-08-20 12:53 ` Eric Dumazet 2024-08-25 17:47 ` Mingrui Zhang 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2024-08-20 12:53 UTC (permalink / raw) To: Mingrui Zhang; +Cc: davem, ncardwell, netdev, Lisong Xu On Mon, Aug 19, 2024 at 10:36 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > On 8/19/24 04:00, Eric Dumazet wrote: > > On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > >> The original code bypasses bictcp_update() under certain conditions > >> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, > >> bictcp_update() is executed 32 times per second. As a result, > >> it is possible that bictcp_update() is not executed for several > >> RTTs when RTT is short (specifically < 1/32 second = 31 ms and > >> last_cwnd==cwnd which may happen in small-BDP networks), > >> thus leading to low throughput in these RTTs. > >> > >> The patched code executes bictcp_update() 32 times per second > >> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. > >> > >> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") > >> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") > > I do not understand this Fixes: tag ? > > > > Commit ac35f562203a was essentially a nop at that time... > > > I may misunderstood the use of Fixes tag and choose the latest commit of that line. > > Shall it supposed to be the very first commit with that behavior? > That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced? I was referring to this line : Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") Commit ac35f562203a did not change the behavior at all. I see no particular reason to mention it, this is confusing. > >> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> > >> Signed-off-by: Lisong Xu <xu@unl.edu> > >> --- > >> v3->v4: Replace min() with min_t() > >> v2->v3: Correct the "Fixes:" footer content > >> v1->v2: Separate patches > >> > >> net/ipv4/tcp_cubic.c | 6 +++++- > >> 1 file changed, 5 insertions(+), 1 deletion(-) > >> > >> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > >> index 5dbed91c6178..00da7d592032 100644 > >> --- a/net/ipv4/tcp_cubic.c > >> +++ b/net/ipv4/tcp_cubic.c > >> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > >> > >> ca->ack_cnt += acked; /* count the number of ACKed packets */ > >> > >> + /* Update 32 times per second if RTT > 1/32 second, > >> + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd > >> + */ > >> if (ca->last_cwnd == cwnd && > >> - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > >> + (s32)(tcp_jiffies32 - ca->last_time) <= > >> + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) > > This looks convoluted to me and still limited if HZ=250 (some distros > > still use 250 jiffies per second :/ ) > > > > I would suggest switching to usec right away. > Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :) > Thank you No problem, there is no hurry. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT 2024-08-20 12:53 ` Eric Dumazet @ 2024-08-25 17:47 ` Mingrui Zhang 2024-08-26 9:25 ` Eric Dumazet 0 siblings, 1 reply; 15+ messages in thread From: Mingrui Zhang @ 2024-08-25 17:47 UTC (permalink / raw) To: Eric Dumazet; +Cc: davem, ncardwell, netdev, Lisong Xu On 8/20/24 07:53, Eric Dumazet wrote: > On Mon, Aug 19, 2024 at 10:36 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >> On 8/19/24 04:00, Eric Dumazet wrote: >>> On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >>>> The original code bypasses bictcp_update() under certain conditions >>>> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, >>>> bictcp_update() is executed 32 times per second. As a result, >>>> it is possible that bictcp_update() is not executed for several >>>> RTTs when RTT is short (specifically < 1/32 second = 31 ms and >>>> last_cwnd==cwnd which may happen in small-BDP networks), >>>> thus leading to low throughput in these RTTs. >>>> >>>> The patched code executes bictcp_update() 32 times per second >>>> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. >>>> >>>> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") >>>> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") >>> I do not understand this Fixes: tag ? >>> >>> Commit ac35f562203a was essentially a nop at that time... >>> >> I may misunderstood the use of Fixes tag and choose the latest commit of that line. >> >> Shall it supposed to be the very first commit with that behavior? >> That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced? > I was referring to this line : Fixes: ac35f562203a ("tcp: bic, cubic: > use tcp_jiffies32 instead of tcp_time_stamp") > > Commit ac35f562203a did not change the behavior at all. > > I see no particular reason to mention it, this is confusing. > > >>>> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> >>>> Signed-off-by: Lisong Xu <xu@unl.edu> >>>> --- >>>> v3->v4: Replace min() with min_t() >>>> v2->v3: Correct the "Fixes:" footer content >>>> v1->v2: Separate patches >>>> >>>> net/ipv4/tcp_cubic.c | 6 +++++- >>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c >>>> index 5dbed91c6178..00da7d592032 100644 >>>> --- a/net/ipv4/tcp_cubic.c >>>> +++ b/net/ipv4/tcp_cubic.c >>>> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) >>>> >>>> ca->ack_cnt += acked; /* count the number of ACKed packets */ >>>> >>>> + /* Update 32 times per second if RTT > 1/32 second, >>>> + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd >>>> + */ >>>> if (ca->last_cwnd == cwnd && >>>> - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) >>>> + (s32)(tcp_jiffies32 - ca->last_time) <= >>>> + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) >>> This looks convoluted to me and still limited if HZ=250 (some distros >>> still use 250 jiffies per second :/ ) >>> >>> I would suggest switching to usec right away. >> Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :) >> Thank you > No problem, there is no hurry. Thank you, Eric, for your suggestion (switching ca->last_time from jiffies to usec)! We thought about it and feel that it is more complicated and beyond the scope of this patch. There are two blocks of code in bictcp_update(). * Block 1: cubic calculation, which is computationally intensive. * Block 2: tcp friendliness, which emulates RENO. There are two if statements to control how often these two blocks are called to reduce CPU overhead. * If statement 1: if the condition is true, none of the two blocks are called. if (ca->last_cwnd == cwnd && (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) return; * If statement 2: If the condition is true, block 1 is not called. Intuitively, block 1 is called at most once per jiffy. if (ca->epoch_start && tcp_jiffies32 == ca->last_time) goto tcp_friendliness; This patch changes only the first if statement. If we switch ca->last_time from jiffies to usec, we need to change not only the first if statement but also the second if statement, as well as block 1. * change the first if statement from jiffies to usec. * change the second if statement from jiffies to usec. Need to determine how often (in usec) block 1 is called * change block 1 from jiffies to usec. Should be fine, but need to make sure no calculation overflow. Therefore, it might be better to keep the current patch as it is, and address the switch from jiffies to usec in future patches. Thank you! Lisong (Mingrui Send On Behalf of) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT 2024-08-25 17:47 ` Mingrui Zhang @ 2024-08-26 9:25 ` Eric Dumazet 2024-08-28 20:32 ` Neal Cardwell 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2024-08-26 9:25 UTC (permalink / raw) To: Mingrui Zhang; +Cc: davem, ncardwell, netdev, Lisong Xu On Sun, Aug 25, 2024 at 7:47 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > On 8/20/24 07:53, Eric Dumazet wrote: > > On Mon, Aug 19, 2024 at 10:36 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > >> On 8/19/24 04:00, Eric Dumazet wrote: > >>> On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > >>>> The original code bypasses bictcp_update() under certain conditions > >>>> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, > >>>> bictcp_update() is executed 32 times per second. As a result, > >>>> it is possible that bictcp_update() is not executed for several > >>>> RTTs when RTT is short (specifically < 1/32 second = 31 ms and > >>>> last_cwnd==cwnd which may happen in small-BDP networks), > >>>> thus leading to low throughput in these RTTs. > >>>> > >>>> The patched code executes bictcp_update() 32 times per second > >>>> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. > >>>> > >>>> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") > >>>> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") > >>> I do not understand this Fixes: tag ? > >>> > >>> Commit ac35f562203a was essentially a nop at that time... > >>> > >> I may misunderstood the use of Fixes tag and choose the latest commit of that line. > >> > >> Shall it supposed to be the very first commit with that behavior? > >> That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced? > > I was referring to this line : Fixes: ac35f562203a ("tcp: bic, cubic: > > use tcp_jiffies32 instead of tcp_time_stamp") > > > > Commit ac35f562203a did not change the behavior at all. > > > > I see no particular reason to mention it, this is confusing. > > > > > >>>> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> > >>>> Signed-off-by: Lisong Xu <xu@unl.edu> > >>>> --- > >>>> v3->v4: Replace min() with min_t() > >>>> v2->v3: Correct the "Fixes:" footer content > >>>> v1->v2: Separate patches > >>>> > >>>> net/ipv4/tcp_cubic.c | 6 +++++- > >>>> 1 file changed, 5 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > >>>> index 5dbed91c6178..00da7d592032 100644 > >>>> --- a/net/ipv4/tcp_cubic.c > >>>> +++ b/net/ipv4/tcp_cubic.c > >>>> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > >>>> > >>>> ca->ack_cnt += acked; /* count the number of ACKed packets */ > >>>> > >>>> + /* Update 32 times per second if RTT > 1/32 second, > >>>> + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd > >>>> + */ > >>>> if (ca->last_cwnd == cwnd && > >>>> - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > >>>> + (s32)(tcp_jiffies32 - ca->last_time) <= > >>>> + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) > >>> This looks convoluted to me and still limited if HZ=250 (some distros > >>> still use 250 jiffies per second :/ ) > >>> > >>> I would suggest switching to usec right away. > >> Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :) > >> Thank you > > No problem, there is no hurry. > > Thank you, Eric, for your suggestion (switching ca->last_time from jiffies to usec)! > We thought about it and feel that it is more complicated and beyond the scope of this patch. > > There are two blocks of code in bictcp_update(). > * Block 1: cubic calculation, which is computationally intensive. > * Block 2: tcp friendliness, which emulates RENO. > > There are two if statements to control how often these two blocks are called to reduce CPU overhead. > * If statement 1: if the condition is true, none of the two blocks are called. > if (ca->last_cwnd == cwnd && > (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > return; > > * If statement 2: If the condition is true, block 1 is not called. Intuitively, block 1 is called at most once per jiffy. > if (ca->epoch_start && tcp_jiffies32 == ca->last_time) > goto tcp_friendliness; > > > This patch changes only the first if statement. If we switch ca->last_time from jiffies to usec, > we need to change not only the first if statement but also the second if statement, as well as block 1. > * change the first if statement from jiffies to usec. > * change the second if statement from jiffies to usec. Need to determine how often (in usec) block 1 is called > * change block 1 from jiffies to usec. Should be fine, but need to make sure no calculation overflow. No problem, I can take care of the jiffies -> usec conversion, you can then send your patch on top of it. > > Therefore, it might be better to keep the current patch as it is, and address the switch from jiffies to usec in future patches. I prefer you rebase your patch after mine is merged. There is a common misconception with jiffies. It can change in less than 20 nsec. Assuming that delta(jiffies) == 1 means that 1ms has elapsed is plain wrong. In the old days, linux TCP only could rely on jiffies and we had to accept its limits. We now can switch to high resolution clocks, without extra costs, because we already cache in tcp->tcp_mstamp the usec timestamp for the current time. Some distros are using CONFIG_HZ_250=y or CONFIG_HZ_100=y, this means current logic in cubic is more fuzzy for them. Without ca->last_time conversion to jiffies, your patch would still be limited to jiffies resolution: usecs_to_jiffies(ca->delay_min) would round up to much bigger values for DC communications. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT 2024-08-26 9:25 ` Eric Dumazet @ 2024-08-28 20:32 ` Neal Cardwell 2024-09-30 16:24 ` Eric Dumazet 0 siblings, 1 reply; 15+ messages in thread From: Neal Cardwell @ 2024-08-28 20:32 UTC (permalink / raw) To: Eric Dumazet; +Cc: Mingrui Zhang, davem, netdev, Lisong Xu On Mon, Aug 26, 2024 at 5:26 AM Eric Dumazet <edumazet@google.com> wrote: ... > I prefer you rebase your patch after mine is merged. > > There is a common misconception with jiffies. > It can change in less than 20 nsec. > Assuming that delta(jiffies) == 1 means that 1ms has elapsed is plain wrong. > In the old days, linux TCP only could rely on jiffies and we had to > accept its limits. > We now can switch to high resolution clocks, without extra costs, > because we already cache in tcp->tcp_mstamp > the usec timestamp for the current time. > > Some distros are using CONFIG_HZ_250=y or CONFIG_HZ_100=y, this means > current logic in cubic is more fuzzy for them. > > Without ca->last_time conversion to jiffies, your patch would still be > limited to jiffies resolution: > usecs_to_jiffies(ca->delay_min) would round up to much bigger values > for DC communications. Even given Eric's excellent point that is raised above, that an increase of jiffies by one can happen even though only O(us) or less may have elapsed, AFAICT the patch should be fine in practice. The patch says: + /* Update 32 times per second if RTT > 1/32 second, + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd + */ if (ca->last_cwnd == cwnd && - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) + (s32)(tcp_jiffies32 - ca->last_time) <= + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) return; So, basically, we only run fall through and try to run the core of bictcp_update() if cwnd has increased since ca-> last_cwnd, or tcp_jiffies32 has increased by more than min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)) since ca->last_time. AFAICT this works out OK because the logic is looking for "more than" and usecs_to_jiffies() rounds up. That means that in the interesting/tricky/common case where ca->delay_min is less than a jiffy, usecs_to_jiffies(ca->delay_min) will return 1 jiffy. That means that in this case we will only fall through and try to run the core of bictcp_update() if cwnd has increased since ca-> last_cwnd, or tcp_jiffies32 has increased by more than 1 jiffy (i.e., 2 or more jiffies). AFAICT the fact that this check is triggering only if tcp_jiffies32 has increased by 2 or more means that at least one full jiffy has elapsed between when we set ca->last_time and the time when this check triggers running the core of bictcp_update(). So AFAICT this logic is not tricked by the fact that a single increment of tcp_jiffies32 can happen over O(us) or less. At first glance it may sound like if the RTT is much less than a jiffy, many RTTs could elapse before we run the core of bictcp_update(). However, AFAIK if the RTT is much less than a jiffy then CUBIC is very likely in Reno mode, and so is very likely to increase cwnd by roughly 1 packet per round trip (the behavior of Reno), so the (ca->last_cwnd == cwnd) condition should fail roughly once per round trip and allow recomputation of the ca->cnt slope. So AFAICT this patch should be OK in practice. Given those considerations, Eric, do you think it would be OK to accept the patch as-is? Thanks! neal ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT 2024-08-28 20:32 ` Neal Cardwell @ 2024-09-30 16:24 ` Eric Dumazet 0 siblings, 0 replies; 15+ messages in thread From: Eric Dumazet @ 2024-09-30 16:24 UTC (permalink / raw) To: Neal Cardwell; +Cc: Mingrui Zhang, davem, netdev, Lisong Xu On Wed, Aug 28, 2024 at 10:32 PM Neal Cardwell <ncardwell@google.com> wrote: > > On Mon, Aug 26, 2024 at 5:26 AM Eric Dumazet <edumazet@google.com> wrote: > ... > > I prefer you rebase your patch after mine is merged. > > > > There is a common misconception with jiffies. > > It can change in less than 20 nsec. > > Assuming that delta(jiffies) == 1 means that 1ms has elapsed is plain wrong. > > In the old days, linux TCP only could rely on jiffies and we had to > > accept its limits. > > We now can switch to high resolution clocks, without extra costs, > > because we already cache in tcp->tcp_mstamp > > the usec timestamp for the current time. > > > > Some distros are using CONFIG_HZ_250=y or CONFIG_HZ_100=y, this means > > current logic in cubic is more fuzzy for them. > > > > Without ca->last_time conversion to jiffies, your patch would still be > > limited to jiffies resolution: > > usecs_to_jiffies(ca->delay_min) would round up to much bigger values > > for DC communications. > > Even given Eric's excellent point that is raised above, that an > increase of jiffies by one can happen even though only O(us) or less > may have elapsed, AFAICT the patch should be fine in practice. > > The patch says: > > + /* Update 32 times per second if RTT > 1/32 second, > + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd > + */ > if (ca->last_cwnd == cwnd && > - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > + (s32)(tcp_jiffies32 - ca->last_time) <= > + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) > return; > > So, basically, we only run fall through and try to run the core of > bictcp_update() if cwnd has increased since ca-> last_cwnd, or > tcp_jiffies32 has increased by more than > min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)) since ca->last_time. > > AFAICT this works out OK because the logic is looking for "more than" > and usecs_to_jiffies() rounds up. That means that in the > interesting/tricky/common case where ca->delay_min is less than a > jiffy, usecs_to_jiffies(ca->delay_min) will return 1 jiffy. That means > that in this case we will only fall through and try to run the core of > bictcp_update() if cwnd has increased since ca-> last_cwnd, or > tcp_jiffies32 has increased by more than 1 jiffy (i.e., 2 or more > jiffies). > > AFAICT the fact that this check is triggering only if tcp_jiffies32 > has increased by 2 or more means that at least one full jiffy has > elapsed between when we set ca->last_time and the time when this check > triggers running the core of bictcp_update(). > > So AFAICT this logic is not tricked by the fact that a single > increment of tcp_jiffies32 can happen over O(us) or less. > > At first glance it may sound like if the RTT is much less than a > jiffy, many RTTs could elapse before we run the core of > bictcp_update(). However, AFAIK if the RTT is much less than a jiffy > then CUBIC is very likely in Reno mode, and so is very likely to > increase cwnd by roughly 1 packet per round trip (the behavior of > Reno), so the (ca->last_cwnd == cwnd) condition should fail roughly > once per round trip and allow recomputation of the ca->cnt slope. > > So AFAICT this patch should be OK in practice. > > Given those considerations, Eric, do you think it would be OK to > accept the patch as-is? > Ok, what about updating net/ipv4/tcp_bic.c at the same time ? ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net v4 2/3] tcp_cubic: fix to match Reno additive increment 2024-08-17 16:33 [PATCH net v4 0/3] tcp_cubic: fix to achieve at least the same throughput as Reno Mingrui Zhang 2024-08-17 16:33 ` [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT Mingrui Zhang @ 2024-08-17 16:33 ` Mingrui Zhang 2024-08-19 8:22 ` Eric Dumazet 2024-08-17 16:34 ` [PATCH net v4 3/3] tcp_cubic: fix to use emulated Reno cwnd one RTT in the future Mingrui Zhang 2 siblings, 1 reply; 15+ messages in thread From: Mingrui Zhang @ 2024-08-17 16:33 UTC (permalink / raw) To: edumazet, davem, ncardwell, netdev; +Cc: Mingrui Zhang, Lisong Xu The original code follows RFC 8312 (obsoleted CUBIC RFC). The patched code follows RFC 9438 (new CUBIC RFC): "Once _W_est_ has grown to reach the _cwnd_ at the time of most recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve the same congestion window increment rate as Reno, which uses AIMD (1,0.5)." Add new field 'cwnd_prior' in bictcp to hold cwnd before a loss event Fixes: 89b3d9aaf467 ("[TCP] cubic: precompute constants") Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> Signed-off-by: Lisong Xu <xu@unl.edu> --- v2->v3: Correct the "Fixes:" footer content v1->v2: Add new field 'cwnd_prior' in bictcp to hold cwnd before a loss event v1->v2: Separate patches net/ipv4/tcp_cubic.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 00da7d592032..03cfbad37dab 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -102,6 +102,7 @@ struct bictcp { u32 end_seq; /* end_seq of the round */ u32 last_ack; /* last time when the ACK spacing is close */ u32 curr_rtt; /* the minimum rtt of current round */ + u32 cwnd_prior; /* cwnd before a loss event */ }; static inline void bictcp_reset(struct bictcp *ca) @@ -305,7 +306,10 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) if (tcp_friendliness) { u32 scale = beta_scale; - delta = (cwnd * scale) >> 3; + if (cwnd < ca->cwnd_prior) + delta = (cwnd * scale) >> 3; /* CUBIC additive increment */ + else + delta = cwnd; /* Reno additive increment */ while (ca->ack_cnt > delta) { /* update tcp cwnd */ ca->ack_cnt -= delta; ca->tcp_cwnd++; @@ -355,6 +359,7 @@ __bpf_kfunc static u32 cubictcp_recalc_ssthresh(struct sock *sk) / (2 * BICTCP_BETA_SCALE); else ca->last_max_cwnd = tcp_snd_cwnd(tp); + ca->cwnd_prior = tcp_snd_cwnd(tp); return max((tcp_snd_cwnd(tp) * beta) / BICTCP_BETA_SCALE, 2U); } -- 2.34.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 2/3] tcp_cubic: fix to match Reno additive increment 2024-08-17 16:33 ` [PATCH net v4 2/3] tcp_cubic: fix to match Reno additive increment Mingrui Zhang @ 2024-08-19 8:22 ` Eric Dumazet 2024-08-19 21:03 ` Mingrui Zhang 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2024-08-19 8:22 UTC (permalink / raw) To: Mingrui Zhang; +Cc: davem, ncardwell, netdev, Lisong Xu On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > The original code follows RFC 8312 (obsoleted CUBIC RFC). > > The patched code follows RFC 9438 (new CUBIC RFC): Please give the precise location in the RFC (4.3 Reno-Friendly Region) > "Once _W_est_ has grown to reach the _cwnd_ at the time of most > recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- > the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve > the same congestion window increment rate as Reno, which uses AIMD > (1,0.5)." > > Add new field 'cwnd_prior' in bictcp to hold cwnd before a loss event > > Fixes: 89b3d9aaf467 ("[TCP] cubic: precompute constants") RFC 9438 is brand new, I think we should not backport this patch to stable linux versions. This would target net-next, unless there is clear evidence that it is absolutely safe. Note the existence of tools/testing/selftests/bpf/progs/bpf_cc_cubic.c and tools/testing/selftests/bpf/progs/bpf_cubic.c If this patch was a fix, I presume we would need to fix these files ? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 2/3] tcp_cubic: fix to match Reno additive increment 2024-08-19 8:22 ` Eric Dumazet @ 2024-08-19 21:03 ` Mingrui Zhang 2024-08-20 12:56 ` Eric Dumazet 0 siblings, 1 reply; 15+ messages in thread From: Mingrui Zhang @ 2024-08-19 21:03 UTC (permalink / raw) To: Eric Dumazet; +Cc: davem, ncardwell, netdev, Lisong Xu On 8/19/24 03:22, Eric Dumazet wrote: > On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >> The original code follows RFC 8312 (obsoleted CUBIC RFC). >> >> The patched code follows RFC 9438 (new CUBIC RFC): > Please give the precise location in the RFC (4.3 Reno-Friendly Region) Thank you, Eric, I will write it more clearly in the next version patch to submit. > >> "Once _W_est_ has grown to reach the _cwnd_ at the time of most >> recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- >> the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve >> the same congestion window increment rate as Reno, which uses AIMD >> (1,0.5)." >> >> Add new field 'cwnd_prior' in bictcp to hold cwnd before a loss event >> >> Fixes: 89b3d9aaf467 ("[TCP] cubic: precompute constants") > RFC 9438 is brand new, I think we should not backport this patch to > stable linux versions. > > This would target net-next, unless there is clear evidence that it is > absolutely safe. I agree with you that this patch would target net-next. > Note the existence of tools/testing/selftests/bpf/progs/bpf_cc_cubic.c > and tools/testing/selftests/bpf/progs/bpf_cubic.c > > If this patch was a fix, I presume we would need to fix these files ? In my understanding, the bpf_cubic.c and bpf_cc_cubic.c are not designed to create a fully equivalent version of tcp_cubic, but more focus on BPF logic testing usage. For example, the up-to-date bpf_cubic does not involve the changes in commit 9957b38b5e7a ("tcp_cubic: make hystart_ack_delay() aware of BIG TCP") Maybe we would ask BPF maintainers whether to fix these BPF files? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 2/3] tcp_cubic: fix to match Reno additive increment 2024-08-19 21:03 ` Mingrui Zhang @ 2024-08-20 12:56 ` Eric Dumazet 2024-08-20 14:22 ` Mingrui Zhang 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2024-08-20 12:56 UTC (permalink / raw) To: Mingrui Zhang; +Cc: davem, ncardwell, netdev, Lisong Xu On Mon, Aug 19, 2024 at 11:03 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > On 8/19/24 03:22, Eric Dumazet wrote: > > On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > >> The original code follows RFC 8312 (obsoleted CUBIC RFC). > >> > >> The patched code follows RFC 9438 (new CUBIC RFC): > > Please give the precise location in the RFC (4.3 Reno-Friendly Region) > > Thank you, Eric, > I will write it more clearly in the next version patch to submit. > > > > >> "Once _W_est_ has grown to reach the _cwnd_ at the time of most > >> recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- > >> the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve > >> the same congestion window increment rate as Reno, which uses AIMD > >> (1,0.5)." > >> > >> Add new field 'cwnd_prior' in bictcp to hold cwnd before a loss event > >> > >> Fixes: 89b3d9aaf467 ("[TCP] cubic: precompute constants") > > RFC 9438 is brand new, I think we should not backport this patch to > > stable linux versions. > > > > This would target net-next, unless there is clear evidence that it is > > absolutely safe. > > I agree with you that this patch would target net-next. > > > Note the existence of tools/testing/selftests/bpf/progs/bpf_cc_cubic.c > > and tools/testing/selftests/bpf/progs/bpf_cubic.c > > > > If this patch was a fix, I presume we would need to fix these files ? > In my understanding, the bpf_cubic.c and bpf_cc_cubic.c are not designed to create a fully equivalent version of tcp_cubic, but more focus on BPF logic testing usage. > For example, the up-to-date bpf_cubic does not involve the changes in commit 9957b38b5e7a ("tcp_cubic: make hystart_ack_delay() aware of BIG TCP") > > Maybe we would ask BPF maintainers whether to fix these BPF files? We try (as TCP maintainers) to keep tools/testing/selftests/bpf/progs/bpf_cubic.c up to date with the kernel C code. Because _if_ someone is really using BPF based cubic, they should get the fix eventually. See for instance commit 7d21d54d624777358ab6c7be7ff778808fef70ba Author: Neal Cardwell <ncardwell@google.com> Date: Wed Jun 24 12:42:03 2020 -0400 bpf: tcp: bpf_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT Apply the fix from: "tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT" to the BPF implementation of TCP CUBIC congestion control. Repeating the commit description here for completeness: Mirja Kuehlewind reported a bug in Linux TCP CUBIC Hystart, where Hystart HYSTART_DELAY mechanism can exit Slow Start spuriously on an ACK when the minimum rtt of a connection goes down. From inspection it is clear from the existing code that this could happen in an example like the following: o The first 8 RTT samples in a round trip are 150ms, resulting in a curr_rtt of 150ms and a delay_min of 150ms. o The 9th RTT sample is 100ms. The curr_rtt does not change after the first 8 samples, so curr_rtt remains 150ms. But delay_min can be lowered at any time, so delay_min falls to 100ms. The code executes the HYSTART_DELAY comparison between curr_rtt of 150ms and delay_min of 100ms, and the curr_rtt is declared far enough above delay_min to force a (spurious) exit of Slow start. The fix here is simple: allow every RTT sample in a round trip to lower the curr_rtt. Fixes: 6de4a9c430b5 ("bpf: tcp: Add bpf_cubic example") Reported-by: Mirja Kuehlewind <mirja.kuehlewind@ericsson.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net v4 2/3] tcp_cubic: fix to match Reno additive increment 2024-08-20 12:56 ` Eric Dumazet @ 2024-08-20 14:22 ` Mingrui Zhang 0 siblings, 0 replies; 15+ messages in thread From: Mingrui Zhang @ 2024-08-20 14:22 UTC (permalink / raw) To: Eric Dumazet; +Cc: davem, ncardwell, netdev, Lisong Xu On 8/20/24 07:56, Eric Dumazet wrote: > On Mon, Aug 19, 2024 at 11:03 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >> On 8/19/24 03:22, Eric Dumazet wrote: >>> On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >>>> The original code follows RFC 8312 (obsoleted CUBIC RFC). >>>> >>>> The patched code follows RFC 9438 (new CUBIC RFC): >>> Please give the precise location in the RFC (4.3 Reno-Friendly Region) >> Thank you, Eric, >> I will write it more clearly in the next version patch to submit. >> >>>> "Once _W_est_ has grown to reach the _cwnd_ at the time of most >>>> recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- >>>> the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve >>>> the same congestion window increment rate as Reno, which uses AIMD >>>> (1,0.5)." >>>> >>>> Add new field 'cwnd_prior' in bictcp to hold cwnd before a loss event >>>> >>>> Fixes: 89b3d9aaf467 ("[TCP] cubic: precompute constants") >>> RFC 9438 is brand new, I think we should not backport this patch to >>> stable linux versions. >>> >>> This would target net-next, unless there is clear evidence that it is >>> absolutely safe. >> I agree with you that this patch would target net-next. >> >>> Note the existence of tools/testing/selftests/bpf/progs/bpf_cc_cubic.c >>> and tools/testing/selftests/bpf/progs/bpf_cubic.c >>> >>> If this patch was a fix, I presume we would need to fix these files ? >> In my understanding, the bpf_cubic.c and bpf_cc_cubic.c are not designed to create a fully equivalent version of tcp_cubic, but more focus on BPF logic testing usage. >> For example, the up-to-date bpf_cubic does not involve the changes in commit 9957b38b5e7a ("tcp_cubic: make hystart_ack_delay() aware of BIG TCP") >> >> Maybe we would ask BPF maintainers whether to fix these BPF files? > We try (as TCP maintainers) to keep > tools/testing/selftests/bpf/progs/bpf_cubic.c up to date with the > kernel C code. > Because _if_ someone is really using BPF based cubic, they should get > the fix eventually. > I got your point. Yes, we should fix those BPF based cubic files if this patch was confirmed. > See for instance > > commit 7d21d54d624777358ab6c7be7ff778808fef70ba > Author: Neal Cardwell <ncardwell@google.com> > Date: Wed Jun 24 12:42:03 2020 -0400 > > bpf: tcp: bpf_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT > > Apply the fix from: > "tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT" > to the BPF implementation of TCP CUBIC congestion control. > > Repeating the commit description here for completeness: > > Mirja Kuehlewind reported a bug in Linux TCP CUBIC Hystart, where > Hystart HYSTART_DELAY mechanism can exit Slow Start spuriously on an > ACK when the minimum rtt of a connection goes down. From inspection it > is clear from the existing code that this could happen in an example > like the following: > > o The first 8 RTT samples in a round trip are 150ms, resulting in a > curr_rtt of 150ms and a delay_min of 150ms. > > o The 9th RTT sample is 100ms. The curr_rtt does not change after the > first 8 samples, so curr_rtt remains 150ms. But delay_min can be > lowered at any time, so delay_min falls to 100ms. The code executes > the HYSTART_DELAY comparison between curr_rtt of 150ms and delay_min > of 100ms, and the curr_rtt is declared far enough above delay_min to > force a (spurious) exit of Slow start. > > The fix here is simple: allow every RTT sample in a round trip to > lower the curr_rtt. > > Fixes: 6de4a9c430b5 ("bpf: tcp: Add bpf_cubic example") > Reported-by: Mirja Kuehlewind <mirja.kuehlewind@ericsson.com> > Signed-off-by: Neal Cardwell <ncardwell@google.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> > Acked-by: Soheil Hassas Yeganeh <soheil@google.com> > Signed-off-by: David S. Miller <davem@davemloft.net> Thank you for pointing out this patch as an example. ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net v4 3/3] tcp_cubic: fix to use emulated Reno cwnd one RTT in the future 2024-08-17 16:33 [PATCH net v4 0/3] tcp_cubic: fix to achieve at least the same throughput as Reno Mingrui Zhang 2024-08-17 16:33 ` [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT Mingrui Zhang 2024-08-17 16:33 ` [PATCH net v4 2/3] tcp_cubic: fix to match Reno additive increment Mingrui Zhang @ 2024-08-17 16:34 ` Mingrui Zhang 2 siblings, 0 replies; 15+ messages in thread From: Mingrui Zhang @ 2024-08-17 16:34 UTC (permalink / raw) To: edumazet, davem, ncardwell, netdev; +Cc: Mingrui Zhang, Lisong Xu The original code estimates RENO snd_cwnd using the estimated RENO snd_cwnd at the current time (i.e., tcp_cwnd). The patched code estimates RENO snd_cwnd using the estimated RENO snd_cwnd after one RTT (i.e., tcp_cwnd_next_rtt), because ca->cnt is used to increase snd_cwnd for the next RTT. Fixes: 89b3d9aaf467 ("[TCP] cubic: precompute constants") Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> Signed-off-by: Lisong Xu <xu@unl.edu> --- v3->v4: Separate declarations and code of tcp_cwnd_next_rtt v2->v3: Correct the "Fixes:" footer content v1->v2: Separate patches net/ipv4/tcp_cubic.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 03cfbad37dab..2d7121ca789e 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -304,7 +304,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) tcp_friendliness: /* TCP Friendly */ if (tcp_friendliness) { - u32 scale = beta_scale; + u32 scale = beta_scale, tcp_cwnd_next_rtt; if (cwnd < ca->cwnd_prior) delta = (cwnd * scale) >> 3; /* CUBIC additive increment */ @@ -315,8 +315,11 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) ca->tcp_cwnd++; } - if (ca->tcp_cwnd > cwnd) { /* if bic is slower than tcp */ - delta = ca->tcp_cwnd - cwnd; + /* Reno cwnd one RTT in the future */ + tcp_cwnd_next_rtt = ca->tcp_cwnd + (ca->ack_cnt + cwnd) / delta; + + if (tcp_cwnd_next_rtt > cwnd) { /* if bic is slower than Reno */ + delta = tcp_cwnd_next_rtt - cwnd; max_cnt = cwnd / delta; if (ca->cnt > max_cnt) ca->cnt = max_cnt; -- 2.34.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-09-30 16:24 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-17 16:33 [PATCH net v4 0/3] tcp_cubic: fix to achieve at least the same throughput as Reno Mingrui Zhang 2024-08-17 16:33 ` [PATCH net v4 1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT Mingrui Zhang 2024-08-19 9:00 ` Eric Dumazet 2024-08-19 20:36 ` Mingrui Zhang 2024-08-20 12:53 ` Eric Dumazet 2024-08-25 17:47 ` Mingrui Zhang 2024-08-26 9:25 ` Eric Dumazet 2024-08-28 20:32 ` Neal Cardwell 2024-09-30 16:24 ` Eric Dumazet 2024-08-17 16:33 ` [PATCH net v4 2/3] tcp_cubic: fix to match Reno additive increment Mingrui Zhang 2024-08-19 8:22 ` Eric Dumazet 2024-08-19 21:03 ` Mingrui Zhang 2024-08-20 12:56 ` Eric Dumazet 2024-08-20 14:22 ` Mingrui Zhang 2024-08-17 16:34 ` [PATCH net v4 3/3] tcp_cubic: fix to use emulated Reno cwnd one RTT in the future Mingrui Zhang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).