* [PATCH] bonding: Improve the accuracy of LACPDU transmissions
@ 2025-06-18 19:53 carlos.bilbao
2025-06-20 3:15 ` Tonghao Zhang
0 siblings, 1 reply; 7+ messages in thread
From: carlos.bilbao @ 2025-06-18 19:53 UTC (permalink / raw)
To: jv, andrew+netdev, davem, edumazet, kuba, pabeni, horms, netdev,
linux-kernel
Cc: sforshee, bilbao, Carlos Bilbao
From: Carlos Bilbao <carlos.bilbao@kernel.org>
Improve the timing accuracy of LACPDU transmissions in the bonding 802.3ad
(LACP) driver. The current approach relies on a decrementing counter to
limit the transmission rate. In our experience, this method is susceptible
to delays (such as those caused by CPU contention or soft lockups) which
can lead to accumulated drift in the LACPDU send interval. Over time, this
drift can cause synchronization issues with the top-of-rack (ToR) switch
managing the LAG, manifesting as lag map flapping. This in turn can trigger
temporary interface removal and potential packet loss.
This patch improves stability with a jiffies-based mechanism to track and
enforce the minimum transmission interval; keeping track of when the next
LACPDU should be sent.
Suggested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: Carlos Bilbao (DigitalOcean) <carlos.bilbao@kernel.org>
---
drivers/net/bonding/bond_3ad.c | 18 ++++++++----------
include/net/bond_3ad.h | 5 +----
2 files changed, 9 insertions(+), 14 deletions(-)
diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index c6807e473ab7..47610697e4e5 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -1375,10 +1375,12 @@ static void ad_churn_machine(struct port *port)
*/
static void ad_tx_machine(struct port *port)
{
- /* check if tx timer expired, to verify that we do not send more than
- * 3 packets per second
- */
- if (port->sm_tx_timer_counter && !(--port->sm_tx_timer_counter)) {
+ unsigned long now = jiffies;
+
+ /* Check if enough time has passed since the last LACPDU sent */
+ if (time_after_eq(now, port->sm_tx_next_jiffies)) {
+ port->sm_tx_next_jiffies += ad_ticks_per_sec / AD_MAX_TX_IN_SECOND;
+
/* check if there is something to send */
if (port->ntt && (port->sm_vars & AD_PORT_LACP_ENABLED)) {
__update_lacpdu_from_port(port);
@@ -1395,10 +1397,6 @@ static void ad_tx_machine(struct port *port)
port->ntt = false;
}
}
- /* restart tx timer(to verify that we will not exceed
- * AD_MAX_TX_IN_SECOND
- */
- port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND;
}
}
@@ -2199,9 +2197,9 @@ void bond_3ad_bind_slave(struct slave *slave)
/* actor system is the bond's system */
__ad_actor_update_port(port);
/* tx timer(to verify that no more than MAX_TX_IN_SECOND
- * lacpdu's are sent in one second)
+ * lacpdu's are sent in the configured interval (1 or 30 secs))
*/
- port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND;
+ port->sm_tx_next_jiffies = jiffies + ad_ticks_per_sec / AD_MAX_TX_IN_SECOND;
__disable_port(port);
diff --git a/include/net/bond_3ad.h b/include/net/bond_3ad.h
index 2053cd8e788a..956d4cb45db1 100644
--- a/include/net/bond_3ad.h
+++ b/include/net/bond_3ad.h
@@ -231,10 +231,7 @@ typedef struct port {
mux_states_t sm_mux_state; /* state machine mux state */
u16 sm_mux_timer_counter; /* state machine mux timer counter */
tx_states_t sm_tx_state; /* state machine tx state */
- u16 sm_tx_timer_counter; /* state machine tx timer counter
- * (always on - enter to transmit
- * state 3 time per second)
- */
+ unsigned long sm_tx_next_jiffies;/* expected jiffies for next LACPDU sent */
u16 sm_churn_actor_timer_counter;
u16 sm_churn_partner_timer_counter;
u32 churn_actor_count;
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] bonding: Improve the accuracy of LACPDU transmissions
2025-06-18 19:53 [PATCH] bonding: Improve the accuracy of LACPDU transmissions carlos.bilbao
@ 2025-06-20 3:15 ` Tonghao Zhang
2025-06-24 21:15 ` Jay Vosburgh
0 siblings, 1 reply; 7+ messages in thread
From: Tonghao Zhang @ 2025-06-20 3:15 UTC (permalink / raw)
To: carlos.bilbao
Cc: jv, andrew+netdev, davem, edumazet, kuba, pabeni, horms, netdev,
linux-kernel, sforshee, bilbao
> 2025年6月19日 03:53,carlos.bilbao@kernel.org 写道:
>
> From: Carlos Bilbao <carlos.bilbao@kernel.org>
>
> Improve the timing accuracy of LACPDU transmissions in the bonding 802.3ad
> (LACP) driver. The current approach relies on a decrementing counter to
> limit the transmission rate. In our experience, this method is susceptible
> to delays (such as those caused by CPU contention or soft lockups) which
> can lead to accumulated drift in the LACPDU send interval. Over time, this
> drift can cause synchronization issues with the top-of-rack (ToR) switch
> managing the LAG, manifesting as lag map flapping. This in turn can trigger
> temporary interface removal and potential packet loss.
>
> This patch improves stability with a jiffies-based mechanism to track and
> enforce the minimum transmission interval; keeping track of when the next
> LACPDU should be sent.
>
> Suggested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
> Signed-off-by: Carlos Bilbao (DigitalOcean) <carlos.bilbao@kernel.org>
> ---
> drivers/net/bonding/bond_3ad.c | 18 ++++++++----------
> include/net/bond_3ad.h | 5 +----
> 2 files changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index c6807e473ab7..47610697e4e5 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -1375,10 +1375,12 @@ static void ad_churn_machine(struct port *port)
> */
> static void ad_tx_machine(struct port *port)
> {
> - /* check if tx timer expired, to verify that we do not send more than
> - * 3 packets per second
> - */
> - if (port->sm_tx_timer_counter && !(--port->sm_tx_timer_counter)) {
> + unsigned long now = jiffies;
> +
> + /* Check if enough time has passed since the last LACPDU sent */
> + if (time_after_eq(now, port->sm_tx_next_jiffies)) {
> + port->sm_tx_next_jiffies += ad_ticks_per_sec / AD_MAX_TX_IN_SECOND;
> +
> /* check if there is something to send */
> if (port->ntt && (port->sm_vars & AD_PORT_LACP_ENABLED)) {
> __update_lacpdu_from_port(port);
> @@ -1395,10 +1397,6 @@ static void ad_tx_machine(struct port *port)
> port->ntt = false;
> }
> }
> - /* restart tx timer(to verify that we will not exceed
> - * AD_MAX_TX_IN_SECOND
> - */
> - port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND;
> }
> }
>
> @@ -2199,9 +2197,9 @@ void bond_3ad_bind_slave(struct slave *slave)
> /* actor system is the bond's system */
> __ad_actor_update_port(port);
> /* tx timer(to verify that no more than MAX_TX_IN_SECOND
> - * lacpdu's are sent in one second)
> + * lacpdu's are sent in the configured interval (1 or 30 secs))
> */
> - port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND;
> + port->sm_tx_next_jiffies = jiffies + ad_ticks_per_sec / AD_MAX_TX_IN_SECOND;
If CONFIG_HZ is 1000, there is 1000 tick per second, but "ad_ticks_per_sec / AD_MAX_TX_IN_SECOND” == 10/3 == 3, so that means send lacp packets every 3 ticks ?
>
> __disable_port(port);
>
> diff --git a/include/net/bond_3ad.h b/include/net/bond_3ad.h
> index 2053cd8e788a..956d4cb45db1 100644
> --- a/include/net/bond_3ad.h
> +++ b/include/net/bond_3ad.h
> @@ -231,10 +231,7 @@ typedef struct port {
> mux_states_t sm_mux_state; /* state machine mux state */
> u16 sm_mux_timer_counter; /* state machine mux timer counter */
> tx_states_t sm_tx_state; /* state machine tx state */
> - u16 sm_tx_timer_counter; /* state machine tx timer counter
> - * (always on - enter to transmit
> - * state 3 time per second)
> - */
> + unsigned long sm_tx_next_jiffies;/* expected jiffies for next LACPDU sent */
> u16 sm_churn_actor_timer_counter;
> u16 sm_churn_partner_timer_counter;
> u32 churn_actor_count;
> --
> 2.43.0
>
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] bonding: Improve the accuracy of LACPDU transmissions
2025-06-20 3:15 ` Tonghao Zhang
@ 2025-06-24 21:15 ` Jay Vosburgh
2025-06-24 22:27 ` Carlos Bilbao
2025-06-24 22:47 ` Seth Forshee
0 siblings, 2 replies; 7+ messages in thread
From: Jay Vosburgh @ 2025-06-24 21:15 UTC (permalink / raw)
To: Tonghao Zhang
Cc: carlos.bilbao, andrew+netdev, davem, edumazet, kuba, pabeni,
horms, netdev, linux-kernel, sforshee, bilbao
Tonghao Zhang <tonghao@bamaicloud.com> wrote:
>
>
>
>> 2025年6月19日 03:53,carlos.bilbao@kernel.org 写道:
>>
>> From: Carlos Bilbao <carlos.bilbao@kernel.org>
>>
>> Improve the timing accuracy of LACPDU transmissions in the bonding 802.3ad
>> (LACP) driver. The current approach relies on a decrementing counter to
>> limit the transmission rate. In our experience, this method is susceptible
>> to delays (such as those caused by CPU contention or soft lockups) which
>> can lead to accumulated drift in the LACPDU send interval. Over time, this
>> drift can cause synchronization issues with the top-of-rack (ToR) switch
>> managing the LAG, manifesting as lag map flapping. This in turn can trigger
>> temporary interface removal and potential packet loss.
So, you're saying that contention or soft lockups are causing
the queue_delayed_work() of bond_3ad_state_machine_handler() to be
scheduled late, and, then, because the LACPDU TX limiter is based on the
number of state machine executions (which should be every 100ms), it is
then late sending LACPDUs?
If the core problem is that the state machine as a whole isn't
running regularly, how is doing a clock-based time check reliable? Or
should I take the word "improve" from the Subject literally, and assume
it's making things "better" but not "totally perfect"?
Is the sync issue with the TOR due to missing / delayed LACPDUs,
or is there more to it? At the fast LACP rate, the periodic timeout is
3 seconds for a nominal 1 second LACPDU interval, which is fairly
generous.
>> This patch improves stability with a jiffies-based mechanism to track and
>> enforce the minimum transmission interval; keeping track of when the next
>> LACPDU should be sent.
>>
>> Suggested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
>> Signed-off-by: Carlos Bilbao (DigitalOcean) <carlos.bilbao@kernel.org>
>> ---
>> drivers/net/bonding/bond_3ad.c | 18 ++++++++----------
>> include/net/bond_3ad.h | 5 +----
>> 2 files changed, 9 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>> index c6807e473ab7..47610697e4e5 100644
>> --- a/drivers/net/bonding/bond_3ad.c
>> +++ b/drivers/net/bonding/bond_3ad.c
>> @@ -1375,10 +1375,12 @@ static void ad_churn_machine(struct port *port)
>> */
>> static void ad_tx_machine(struct port *port)
>> {
>> - /* check if tx timer expired, to verify that we do not send more than
>> - * 3 packets per second
>> - */
>> - if (port->sm_tx_timer_counter && !(--port->sm_tx_timer_counter)) {
>> + unsigned long now = jiffies;
>> +
>> + /* Check if enough time has passed since the last LACPDU sent */
>> + if (time_after_eq(now, port->sm_tx_next_jiffies)) {
>> + port->sm_tx_next_jiffies += ad_ticks_per_sec / AD_MAX_TX_IN_SECOND;
>> +
>> /* check if there is something to send */
>> if (port->ntt && (port->sm_vars & AD_PORT_LACP_ENABLED)) {
>> __update_lacpdu_from_port(port);
>> @@ -1395,10 +1397,6 @@ static void ad_tx_machine(struct port *port)
>> port->ntt = false;
>> }
>> }
>> - /* restart tx timer(to verify that we will not exceed
>> - * AD_MAX_TX_IN_SECOND
>> - */
>> - port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND;
>> }
>> }
>>
>> @@ -2199,9 +2197,9 @@ void bond_3ad_bind_slave(struct slave *slave)
>> /* actor system is the bond's system */
>> __ad_actor_update_port(port);
>> /* tx timer(to verify that no more than MAX_TX_IN_SECOND
>> - * lacpdu's are sent in one second)
>> + * lacpdu's are sent in the configured interval (1 or 30 secs))
>> */
>> - port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND;
>> + port->sm_tx_next_jiffies = jiffies + ad_ticks_per_sec / AD_MAX_TX_IN_SECOND;
>
>If CONFIG_HZ is 1000, there is 1000 tick per second, but "ad_ticks_per_sec / AD_MAX_TX_IN_SECOND” == 10/3 == 3, so that means send lacp packets every 3 ticks ?
Agreed, I think the math is off here.
ad_ticks_per_sec is 10, it's the number of times the entire LACP
state machine runs per second. It is unrelated to jiffies, and can't be
used directly with jiffy units (the duration of which varies depending
on what CONFIG_HZ is). I agree that it's confusingly similar to
ad_delta_in_ticks, which is measured in jiffy units.
You'll probably want to use msecs_to_jiffies() somewhere.
How did you test this to insure the TX machine doesn't overrun
(i.e., exceed AD_MAX_TX_IN_SECOND LACPDU transmissions in one second)?
-J
>>
>> __disable_port(port);
>>
>> diff --git a/include/net/bond_3ad.h b/include/net/bond_3ad.h
>> index 2053cd8e788a..956d4cb45db1 100644
>> --- a/include/net/bond_3ad.h
>> +++ b/include/net/bond_3ad.h
>> @@ -231,10 +231,7 @@ typedef struct port {
>> mux_states_t sm_mux_state; /* state machine mux state */
>> u16 sm_mux_timer_counter; /* state machine mux timer counter */
>> tx_states_t sm_tx_state; /* state machine tx state */
>> - u16 sm_tx_timer_counter; /* state machine tx timer counter
>> - * (always on - enter to transmit
>> - * state 3 time per second)
>> - */
>> + unsigned long sm_tx_next_jiffies;/* expected jiffies for next LACPDU sent */
>> u16 sm_churn_actor_timer_counter;
>> u16 sm_churn_partner_timer_counter;
>> u32 churn_actor_count;
>> --
>> 2.43.0
>>
>>
>>
>
---
-Jay Vosburgh, jv@jvosburgh.net
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] bonding: Improve the accuracy of LACPDU transmissions
2025-06-24 21:15 ` Jay Vosburgh
@ 2025-06-24 22:27 ` Carlos Bilbao
2025-06-24 22:33 ` Carlos Bilbao
2025-06-24 22:47 ` Seth Forshee
1 sibling, 1 reply; 7+ messages in thread
From: Carlos Bilbao @ 2025-06-24 22:27 UTC (permalink / raw)
To: Jay Vosburgh, Tonghao Zhang
Cc: carlos.bilbao, andrew+netdev, davem, edumazet, kuba, pabeni,
horms, netdev, linux-kernel, sforshee, bilbao
Hello both,
On 6/24/25 16:15, Jay Vosburgh wrote:
> Tonghao Zhang <tonghao@bamaicloud.com> wrote:
>
>>
>>
>>> 2025年6月19日 03:53,carlos.bilbao@kernel.org 写道:
>>>
>>> From: Carlos Bilbao <carlos.bilbao@kernel.org>
>>>
>>> Improve the timing accuracy of LACPDU transmissions in the bonding 802.3ad
>>> (LACP) driver. The current approach relies on a decrementing counter to
>>> limit the transmission rate. In our experience, this method is susceptible
>>> to delays (such as those caused by CPU contention or soft lockups) which
>>> can lead to accumulated drift in the LACPDU send interval. Over time, this
>>> drift can cause synchronization issues with the top-of-rack (ToR) switch
>>> managing the LAG, manifesting as lag map flapping. This in turn can trigger
>>> temporary interface removal and potential packet loss.
> So, you're saying that contention or soft lockups are causing
> the queue_delayed_work() of bond_3ad_state_machine_handler() to be
> scheduled late, and, then, because the LACPDU TX limiter is based on the
> number of state machine executions (which should be every 100ms), it is
> then late sending LACPDUs?
>
> If the core problem is that the state machine as a whole isn't
> running regularly, how is doing a clock-based time check reliable? Or
> should I take the word "improve" from the Subject literally, and assume
> it's making things "better" but not "totally perfect"?
>
> Is the sync issue with the TOR due to missing / delayed LACPDUs,
> or is there more to it? At the fast LACP rate, the periodic timeout is
> 3 seconds for a nominal 1 second LACPDU interval, which is fairly
> generous.
>
>>> This patch improves stability with a jiffies-based mechanism to track and
>>> enforce the minimum transmission interval; keeping track of when the next
>>> LACPDU should be sent.
>>>
>>> Suggested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
>>> Signed-off-by: Carlos Bilbao (DigitalOcean) <carlos.bilbao@kernel.org>
>>> ---
>>> drivers/net/bonding/bond_3ad.c | 18 ++++++++----------
>>> include/net/bond_3ad.h | 5 +----
>>> 2 files changed, 9 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>>> index c6807e473ab7..47610697e4e5 100644
>>> --- a/drivers/net/bonding/bond_3ad.c
>>> +++ b/drivers/net/bonding/bond_3ad.c
>>> @@ -1375,10 +1375,12 @@ static void ad_churn_machine(struct port *port)
>>> */
>>> static void ad_tx_machine(struct port *port)
>>> {
>>> - /* check if tx timer expired, to verify that we do not send more than
>>> - * 3 packets per second
>>> - */
>>> - if (port->sm_tx_timer_counter && !(--port->sm_tx_timer_counter)) {
>>> + unsigned long now = jiffies;
>>> +
>>> + /* Check if enough time has passed since the last LACPDU sent */
>>> + if (time_after_eq(now, port->sm_tx_next_jiffies)) {
>>> + port->sm_tx_next_jiffies += ad_ticks_per_sec / AD_MAX_TX_IN_SECOND;
>>> +
>>> /* check if there is something to send */
>>> if (port->ntt && (port->sm_vars & AD_PORT_LACP_ENABLED)) {
>>> __update_lacpdu_from_port(port);
>>> @@ -1395,10 +1397,6 @@ static void ad_tx_machine(struct port *port)
>>> port->ntt = false;
>>> }
>>> }
>>> - /* restart tx timer(to verify that we will not exceed
>>> - * AD_MAX_TX_IN_SECOND
>>> - */
>>> - port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND;
>>> }
>>> }
>>>
>>> @@ -2199,9 +2197,9 @@ void bond_3ad_bind_slave(struct slave *slave)
>>> /* actor system is the bond's system */
>>> __ad_actor_update_port(port);
>>> /* tx timer(to verify that no more than MAX_TX_IN_SECOND
>>> - * lacpdu's are sent in one second)
>>> + * lacpdu's are sent in the configured interval (1 or 30 secs))
>>> */
>>> - port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND;
>>> + port->sm_tx_next_jiffies = jiffies + ad_ticks_per_sec / AD_MAX_TX_IN_SECOND;
>> If CONFIG_HZ is 1000, there is 1000 tick per second, but "ad_ticks_per_sec / AD_MAX_TX_IN_SECOND” == 10/3 == 3, so that means send lacp packets every 3 ticks ?
> Agreed, I think the math is off here.
>
> ad_ticks_per_sec is 10, it's the number of times the entire LACP
> state machine runs per second. It is unrelated to jiffies, and can't be
> used directly with jiffy units (the duration of which varies depending
> on what CONFIG_HZ is). I agree that it's confusingly similar to
> ad_delta_in_ticks, which is measured in jiffy units.
>
> You'll probably want to use msecs_to_jiffies() somewhere.
Thank you for taking the time to review my patch!
I agree, the math is off here and I plan to fix that in v2. It shouldn't
be:
port->sm_tx_next_jiffies = jiffies + ad_ticks_per_sec / AD_MAX_TX_IN_SECOND;
port->sm_tx_next_jiffies = jiffies + 3;
... instead, it should be:
port->sm_tx_next_jiffies = jiffies + HZ / AD_MAX_TX_IN_SECOND;
... which, assuming CONFIG_HZ is 1000, it would be:
port->sm_tx_next_jiffies = jiffies + 333;
This math also works in terms of the units too:
HZ = ticks / sec
AD_MAX_X_IN_SECOND = packets / sec
so:
(ticks / sec) / (packets /sec) = ticks / packet
>
> How did you test this to insure the TX machine doesn't overrun
> (i.e., exceed AD_MAX_TX_IN_SECOND LACPDU transmissions in one second)?
>
> -J
>
>>> __disable_port(port);
>>>
>>> diff --git a/include/net/bond_3ad.h b/include/net/bond_3ad.h
>>> index 2053cd8e788a..956d4cb45db1 100644
>>> --- a/include/net/bond_3ad.h
>>> +++ b/include/net/bond_3ad.h
>>> @@ -231,10 +231,7 @@ typedef struct port {
>>> mux_states_t sm_mux_state; /* state machine mux state */
>>> u16 sm_mux_timer_counter; /* state machine mux timer counter */
>>> tx_states_t sm_tx_state; /* state machine tx state */
>>> - u16 sm_tx_timer_counter; /* state machine tx timer counter
>>> - * (always on - enter to transmit
>>> - * state 3 time per second)
>>> - */
>>> + unsigned long sm_tx_next_jiffies;/* expected jiffies for next LACPDU sent */
>>> u16 sm_churn_actor_timer_counter;
>>> u16 sm_churn_partner_timer_counter;
>>> u32 churn_actor_count;
>>> --
>>> 2.43.0
>>>
>>>
>>>
> ---
> -Jay Vosburgh, jv@jvosburgh.net
>
Thanks,
Carlos
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] bonding: Improve the accuracy of LACPDU transmissions
2025-06-24 22:27 ` Carlos Bilbao
@ 2025-06-24 22:33 ` Carlos Bilbao
0 siblings, 0 replies; 7+ messages in thread
From: Carlos Bilbao @ 2025-06-24 22:33 UTC (permalink / raw)
To: Jay Vosburgh, Tonghao Zhang
Cc: carlos.bilbao, andrew+netdev, davem, edumazet, kuba, pabeni,
horms, netdev, linux-kernel, sforshee, bilbao
On 6/24/25 17:27, Carlos Bilbao wrote:
> Hello both,
>
> On 6/24/25 16:15, Jay Vosburgh wrote:
>> Tonghao Zhang <tonghao@bamaicloud.com> wrote:
>>
>>>
>>>
>>>> 2025年6月19日 03:53,carlos.bilbao@kernel.org 写道:
>>>>
>>>> From: Carlos Bilbao <carlos.bilbao@kernel.org>
>>>>
>>>> Improve the timing accuracy of LACPDU transmissions in the bonding
>>>> 802.3ad
>>>> (LACP) driver. The current approach relies on a decrementing
>>>> counter to
>>>> limit the transmission rate. In our experience, this method is
>>>> susceptible
>>>> to delays (such as those caused by CPU contention or soft lockups)
>>>> which
>>>> can lead to accumulated drift in the LACPDU send interval. Over
>>>> time, this
>>>> drift can cause synchronization issues with the top-of-rack (ToR)
>>>> switch
>>>> managing the LAG, manifesting as lag map flapping. This in turn can
>>>> trigger
>>>> temporary interface removal and potential packet loss.
>> So, you're saying that contention or soft lockups are causing
>> the queue_delayed_work() of bond_3ad_state_machine_handler() to be
>> scheduled late, and, then, because the LACPDU TX limiter is based on the
>> number of state machine executions (which should be every 100ms), it is
>> then late sending LACPDUs?
>>
>> If the core problem is that the state machine as a whole isn't
>> running regularly, how is doing a clock-based time check reliable? Or
>> should I take the word "improve" from the Subject literally, and assume
>> it's making things "better" but not "totally perfect"?
>>
>> Is the sync issue with the TOR due to missing / delayed LACPDUs,
>> or is there more to it? At the fast LACP rate, the periodic timeout is
>> 3 seconds for a nominal 1 second LACPDU interval, which is fairly
>> generous.
Also, I didn’t ignore your comments -- I agree there’s a deeper issue at
the state machine level. I’m looking into it and will get back to y'all.
>>
>>>> This patch improves stability with a jiffies-based mechanism to
>>>> track and
>>>> enforce the minimum transmission interval; keeping track of when
>>>> the next
>>>> LACPDU should be sent.
>>>>
>>>> Suggested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
>>>> Signed-off-by: Carlos Bilbao (DigitalOcean) <carlos.bilbao@kernel.org>
>>>> ---
>>>> drivers/net/bonding/bond_3ad.c | 18 ++++++++----------
>>>> include/net/bond_3ad.h | 5 +----
>>>> 2 files changed, 9 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/drivers/net/bonding/bond_3ad.c
>>>> b/drivers/net/bonding/bond_3ad.c
>>>> index c6807e473ab7..47610697e4e5 100644
>>>> --- a/drivers/net/bonding/bond_3ad.c
>>>> +++ b/drivers/net/bonding/bond_3ad.c
>>>> @@ -1375,10 +1375,12 @@ static void ad_churn_machine(struct port
>>>> *port)
>>>> */
>>>> static void ad_tx_machine(struct port *port)
>>>> {
>>>> - /* check if tx timer expired, to verify that we do not send more
>>>> than
>>>> - * 3 packets per second
>>>> - */
>>>> - if (port->sm_tx_timer_counter && !(--port->sm_tx_timer_counter)) {
>>>> + unsigned long now = jiffies;
>>>> +
>>>> + /* Check if enough time has passed since the last LACPDU sent */
>>>> + if (time_after_eq(now, port->sm_tx_next_jiffies)) {
>>>> + port->sm_tx_next_jiffies += ad_ticks_per_sec / AD_MAX_TX_IN_SECOND;
>>>> +
>>>> /* check if there is something to send */
>>>> if (port->ntt && (port->sm_vars & AD_PORT_LACP_ENABLED)) {
>>>> __update_lacpdu_from_port(port);
>>>> @@ -1395,10 +1397,6 @@ static void ad_tx_machine(struct port *port)
>>>> port->ntt = false;
>>>> }
>>>> }
>>>> - /* restart tx timer(to verify that we will not exceed
>>>> - * AD_MAX_TX_IN_SECOND
>>>> - */
>>>> - port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND;
>>>> }
>>>> }
>>>>
>>>> @@ -2199,9 +2197,9 @@ void bond_3ad_bind_slave(struct slave *slave)
>>>> /* actor system is the bond's system */
>>>> __ad_actor_update_port(port);
>>>> /* tx timer(to verify that no more than MAX_TX_IN_SECOND
>>>> - * lacpdu's are sent in one second)
>>>> + * lacpdu's are sent in the configured interval (1 or 30 secs))
>>>> */
>>>> - port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND;
>>>> + port->sm_tx_next_jiffies = jiffies + ad_ticks_per_sec /
>>>> AD_MAX_TX_IN_SECOND;
>>> If CONFIG_HZ is 1000, there is 1000 tick per second, but
>>> "ad_ticks_per_sec / AD_MAX_TX_IN_SECOND” == 10/3 == 3, so that means
>>> send lacp packets every 3 ticks ?
>> Agreed, I think the math is off here.
>>
>> ad_ticks_per_sec is 10, it's the number of times the entire LACP
>> state machine runs per second. It is unrelated to jiffies, and can't be
>> used directly with jiffy units (the duration of which varies depending
>> on what CONFIG_HZ is). I agree that it's confusingly similar to
>> ad_delta_in_ticks, which is measured in jiffy units.
>>
>> You'll probably want to use msecs_to_jiffies() somewhere.
>
>
> Thank you for taking the time to review my patch!
>
> I agree, the math is off here and I plan to fix that in v2. It shouldn't
> be:
>
> port->sm_tx_next_jiffies = jiffies + ad_ticks_per_sec /
> AD_MAX_TX_IN_SECOND;
> port->sm_tx_next_jiffies = jiffies + 3;
>
> ... instead, it should be:
>
> port->sm_tx_next_jiffies = jiffies + HZ / AD_MAX_TX_IN_SECOND;
>
> ... which, assuming CONFIG_HZ is 1000, it would be:
>
> port->sm_tx_next_jiffies = jiffies + 333;
>
> This math also works in terms of the units too:
>
> HZ = ticks / sec
> AD_MAX_X_IN_SECOND = packets / sec
>
> so:
>
> (ticks / sec) / (packets /sec) = ticks / packet
>
>
>>
>> How did you test this to insure the TX machine doesn't overrun
>> (i.e., exceed AD_MAX_TX_IN_SECOND LACPDU transmissions in one second)?
>>
>> -J
>>
>>>> __disable_port(port);
>>>>
>>>> diff --git a/include/net/bond_3ad.h b/include/net/bond_3ad.h
>>>> index 2053cd8e788a..956d4cb45db1 100644
>>>> --- a/include/net/bond_3ad.h
>>>> +++ b/include/net/bond_3ad.h
>>>> @@ -231,10 +231,7 @@ typedef struct port {
>>>> mux_states_t sm_mux_state; /* state machine mux state */
>>>> u16 sm_mux_timer_counter; /* state machine mux timer counter */
>>>> tx_states_t sm_tx_state; /* state machine tx state */
>>>> - u16 sm_tx_timer_counter; /* state machine tx timer counter
>>>> - * (always on - enter to transmit
>>>> - * state 3 time per second)
>>>> - */
>>>> + unsigned long sm_tx_next_jiffies;/* expected jiffies for next
>>>> LACPDU sent */
>>>> u16 sm_churn_actor_timer_counter;
>>>> u16 sm_churn_partner_timer_counter;
>>>> u32 churn_actor_count;
>>>> --
>>>> 2.43.0
>>>>
>>>>
>>>>
>> ---
>> -Jay Vosburgh, jv@jvosburgh.net
>>
>
> Thanks,
>
> Carlos
>
Thanks,
Carlos
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] bonding: Improve the accuracy of LACPDU transmissions
2025-06-24 21:15 ` Jay Vosburgh
2025-06-24 22:27 ` Carlos Bilbao
@ 2025-06-24 22:47 ` Seth Forshee
2025-06-25 16:06 ` Seth Forshee
1 sibling, 1 reply; 7+ messages in thread
From: Seth Forshee @ 2025-06-24 22:47 UTC (permalink / raw)
To: Jay Vosburgh
Cc: Tonghao Zhang, carlos.bilbao, andrew+netdev, davem, edumazet,
kuba, pabeni, horms, netdev, linux-kernel, bilbao
On Tue, Jun 24, 2025 at 02:15:32PM -0700, Jay Vosburgh wrote:
> Tonghao Zhang <tonghao@bamaicloud.com> wrote:
>
> >
> >
> >
> >> 2025年6月19日 03:53,carlos.bilbao@kernel.org 写道:
> >>
> >> From: Carlos Bilbao <carlos.bilbao@kernel.org>
> >>
> >> Improve the timing accuracy of LACPDU transmissions in the bonding 802.3ad
> >> (LACP) driver. The current approach relies on a decrementing counter to
> >> limit the transmission rate. In our experience, this method is susceptible
> >> to delays (such as those caused by CPU contention or soft lockups) which
> >> can lead to accumulated drift in the LACPDU send interval. Over time, this
> >> drift can cause synchronization issues with the top-of-rack (ToR) switch
> >> managing the LAG, manifesting as lag map flapping. This in turn can trigger
> >> temporary interface removal and potential packet loss.
>
> So, you're saying that contention or soft lockups are causing
> the queue_delayed_work() of bond_3ad_state_machine_handler() to be
> scheduled late, and, then, because the LACPDU TX limiter is based on the
> number of state machine executions (which should be every 100ms), it is
> then late sending LACPDUs?
>
> If the core problem is that the state machine as a whole isn't
> running regularly, how is doing a clock-based time check reliable? Or
> should I take the word "improve" from the Subject literally, and assume
> it's making things "better" but not "totally perfect"?
>
> Is the sync issue with the TOR due to missing / delayed LACPDUs,
> or is there more to it? At the fast LACP rate, the periodic timeout is
> 3 seconds for a nominal 1 second LACPDU interval, which is fairly
> generous.
To take a step back: we originally started looking at this code because
we've seen cases where TORs drop a link from the bond due to not getting
a LACPDU within the short timeout period. We don't currently know the
root cause of that problem, so this patch isn't trying to fix that issue.
But when looking at the LACPDUs that the TOR is receiving, we saw that
the packets were pretty consistently late by hundreds of milliseconds.
This patch from Carlos is trying to improve that inconsistency.
Yes, the math is incorrect, and that's partly my fault from giving
Carlos bad advice after looking at the code too quickly. But the fact
that this mistake improved the timing points to what I think may be at
the root of the inconsistency we see.
ad_tx_machine() resets port->sm_tx_timer_counter every time it reaches
zero, regardless of when the last LACPDU was sent. This means that
ad_tx_machine() is delaying LACPDU tx to the next expiration of this
periodic timer. So when Carlos made the counter very small, he made the
tx delay much shorter.
As I understand it the intention is to ensure that LACPDU tx is spaced
out by at least ~300ms, not to align them to an arbitrary ~300ms
boundary. If so, a simple improvement would be to reset the counter only
when an LACPDU is sent, then allow sending a LACPDU any time after it
reaches zero. Though I still think it makes sense to make the state
machines time-based rather than counter-based to ensure they aren't
sensitive to delays in running the delayed work.
Thanks,
Seth
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] bonding: Improve the accuracy of LACPDU transmissions
2025-06-24 22:47 ` Seth Forshee
@ 2025-06-25 16:06 ` Seth Forshee
0 siblings, 0 replies; 7+ messages in thread
From: Seth Forshee @ 2025-06-25 16:06 UTC (permalink / raw)
To: Jay Vosburgh
Cc: Tonghao Zhang, carlos.bilbao, andrew+netdev, davem, edumazet,
kuba, pabeni, horms, netdev, linux-kernel, bilbao
On Tue, Jun 24, 2025 at 05:47:13PM -0500, Seth Forshee wrote:
> As I understand it the intention is to ensure that LACPDU tx is spaced
> out by at least ~300ms, not to align them to an arbitrary ~300ms
> boundary. If so, a simple improvement would be to reset the counter only
> when an LACPDU is sent, then allow sending a LACPDU any time after it
> reaches zero. Though I still think it makes sense to make the state
> machines time-based rather than counter-based to ensure they aren't
> sensitive to delays in running the delayed work.
Sent a patch which only changes when the counter is reset:
https://lore.kernel.org/all/20250625-fix-lacpdu-jitter-v1-1-4d0ee627e1ba@kernel.org/
On an unloaded system the timing of LACPDUs is consistent within ~10ms
after this change.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-06-25 16:06 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18 19:53 [PATCH] bonding: Improve the accuracy of LACPDU transmissions carlos.bilbao
2025-06-20 3:15 ` Tonghao Zhang
2025-06-24 21:15 ` Jay Vosburgh
2025-06-24 22:27 ` Carlos Bilbao
2025-06-24 22:33 ` Carlos Bilbao
2025-06-24 22:47 ` Seth Forshee
2025-06-25 16:06 ` Seth Forshee
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).