bnxt_en: Incorrect tx timestamp report

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* bnxt_en: Incorrect tx timestamp report
@ 2025-03-20 14:35 Kamil Zaripov
  2025-03-20 14:48 ` Andrew Lunn
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Kamil Zaripov @ 2025-03-20 14:35 UTC (permalink / raw)
  To: netdev

Hi all,

I've encountered a bug in the bnxt_en driver and I am unsure about the correct approach to fix it. Every 2^48 nanoseconds (or roughly 78.19 hours) there is a probability that the hardware timestamp for a sent packet may deviate by either 2^48 nanoseconds less or 2^47 nanoseconds more compared to the actual time.

This issue likely occurs within the bnxt_async_event_process function when handling the ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE event. It appears that the payload of this event contains bits 48–63 of the PHC timer counter. During event handling, this function reads bits 0–47 of the same counter to combine them and subsequently updates the cycle_last field within the struct timecounter. The relevant code can be found here:
https://elixir.bootlin.com/linux/v6.13.7/source/drivers/net/ethernet/broadcom/bnxt/bnxt.c#L2829-L2833

The issue arises if bits 48–63 of the PHC counter increment by 1 between sending the ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE event and its actual handling by the driver. In such a case, cycle_last becomes approximately 2^48 nanoseconds behind the real-time value.

A possibly related issue involves the BCM57502 network card, which seemingly possesses only a single PHC device. However, the bnxt_en driver creates four PHC Linux devices when operating in quad-port mode. Consequently, clock synchronization daemons like phc2sys attempt to independently synchronize the system clock to each of these four PHC clocks. This scenario can lead to unstable synchronization and might also trigger additional ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE events.

Given these issues, I have two questions:

1. Would it be beneficial to modify the bnxt_en driver to create only a single PHC Linux device for network cards that physically have only one PHC?

2. Is there a method available to read the complete 64-bit PHC counter to mitigate the observed problem of 2^48-nanosecond time jumps?

Best regards,
Zaripov Kamil

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-20 14:35 bnxt_en: Incorrect tx timestamp report Kamil Zaripov
@ 2025-03-20 14:48 ` Andrew Lunn
       [not found]   ` <CAGtf3ibFAidzpFKm1o5zmZF3Neu8MgdXp_n_Wt+mv8M9YZhhug@mail.gmail.com>
  2025-03-20 16:21   ` Vadim Fedorenko
  2025-03-20 15:56 ` Pavan Chebbi
  2025-03-20 17:11 ` Jacob Keller
  2 siblings, 2 replies; 18+ messages in thread
From: Andrew Lunn @ 2025-03-20 14:48 UTC (permalink / raw)
  To: Kamil Zaripov; +Cc: netdev

> 2. Is there a method available to read the complete 64-bit PHC
> counter to mitigate the observed problem of 2^48-nanosecond time
> jumps?

The usual workaround is to read the upper part, the lower part, and
the upper part again. If you get two different values for the upper
part, do it all again, until you get consistent values.

Look around other PTP drivers, there is probably code you can
copy/paste.

	Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <CAGtf3ibFAidzpFKm1o5zmZF3Neu8MgdXp_n_Wt+mv8M9YZhhug@mail.gmail.com>]

* Re: bnxt_en: Incorrect tx timestamp report
       [not found]   ` <CAGtf3ibFAidzpFKm1o5zmZF3Neu8MgdXp_n_Wt+mv8M9YZhhug@mail.gmail.com>
@ 2025-03-20 15:14     ` Kamil Zaripov
  0 siblings, 0 replies; 18+ messages in thread
From: Kamil Zaripov @ 2025-03-20 15:14 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev

Yes, I know, but the issue is that it seems there is no way to read
upper 48-63 bits except receiving it from
ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE or setting it inside settime64
call. See comments to the
https://github.com/torvalds/linux/commit/24ac1ecd524065cdcf8c27dc85ae37eccce8f2f6
commit.

Kamil.


On Thu, Mar 20, 2025 at 5:12 PM Kamil Zaripov <zaripov-kamil@avride.ai> wrote:
>
> > > 2. Is there a method available to read the complete 64-bit PHC
> > > counter to mitigate the observed problem of 2^48-nanosecond time
> > > jumps?
> >
> > The usual workaround is to read the upper part, the lower part, and
> > the upper part again. If you get two different values for the upper
> > part, do it all again, until you get consistent values.
>
> Yes, I know, but the issue is that it seems there is no way to read upper 48-63 bits except receiving it from ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE or setting it inside settime64 call. See comments to the https://github.com/torvalds/linux/commit/24ac1ecd524065cdcf8c27dc85ae37eccce8f2f6 commit.
>
> Kamil.
>
>
> On Thu, Mar 20, 2025 at 4:48 PM Andrew Lunn <andrew@lunn.ch> wrote:
>>
>> > 2. Is there a method available to read the complete 64-bit PHC
>> > counter to mitigate the observed problem of 2^48-nanosecond time
>> > jumps?
>>
>> The usual workaround is to read the upper part, the lower part, and
>> the upper part again. If you get two different values for the upper
>> part, do it all again, until you get consistent values.
>>
>> Look around other PTP drivers, there is probably code you can
>> copy/paste.
>>
>>         Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-20 14:48 ` Andrew Lunn
       [not found]   ` <CAGtf3ibFAidzpFKm1o5zmZF3Neu8MgdXp_n_Wt+mv8M9YZhhug@mail.gmail.com>
@ 2025-03-20 16:21   ` Vadim Fedorenko
  1 sibling, 0 replies; 18+ messages in thread
From: Vadim Fedorenko @ 2025-03-20 16:21 UTC (permalink / raw)
  To: Andrew Lunn, Pavan Chebbi, Kamil Zaripov
  Cc: netdev, Michael Chan, Andrew Gospodarek

On 20/03/2025 14:48, Andrew Lunn wrote:
>> 2. Is there a method available to read the complete 64-bit PHC
>> counter to mitigate the observed problem of 2^48-nanosecond time
>> jumps?
> 
> The usual workaround is to read the upper part, the lower part, and
> the upper part again. If you get two different values for the upper
> part, do it all again, until you get consistent values.
> 
> Look around other PTP drivers, there is probably code you can
> copy/paste.

This part of the driver is tricky. ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE
reports only 16 bits of 64 bits timestamp, 48-63 range, which doesn't
overlap with anything else. The assumption is that when the driver
processes this event, the register which reports bits of range 0-47 has
already overflowed and holds new value. Unfortunately, there is a time
gap between register overflow and update of MSB of the cached timestamp.

There is no easy way to solve this problem, but we may add additional
check on every read, probably... Not sure, though

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-20 14:35 bnxt_en: Incorrect tx timestamp report Kamil Zaripov
  2025-03-20 14:48 ` Andrew Lunn
@ 2025-03-20 15:56 ` Pavan Chebbi
  2025-03-20 16:21   ` Kamil Zaripov
  2025-03-20 16:26   ` Vadim Fedorenko
  2025-03-20 17:11 ` Jacob Keller
  2 siblings, 2 replies; 18+ messages in thread
From: Pavan Chebbi @ 2025-03-20 15:56 UTC (permalink / raw)
  To: Kamil Zaripov; +Cc: netdev, Michael Chan, Andrew Gospodarek

[-- Attachment #1: Type: text/plain, Size: 2774 bytes --]

On Thu, Mar 20, 2025 at 8:07 PM Kamil Zaripov <zaripov-kamil@avride.ai> wrote:
>
> Hi all,
>
> I've encountered a bug in the bnxt_en driver and I am unsure about the correct approach to fix it. Every 2^48 nanoseconds (or roughly 78.19 hours) there is a probability that the hardware timestamp for a sent packet may deviate by either 2^48 nanoseconds less or 2^47 nanoseconds more compared to the actual time.
>
> This issue likely occurs within the bnxt_async_event_process function when handling the ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE event. It appears that the payload of this event contains bits 48–63 of the PHC timer counter. During event handling, this function reads bits 0–47 of the same counter to combine them and subsequently updates the cycle_last field within the struct timecounter. The relevant code can be found here:
> https://elixir.bootlin.com/linux/v6.13.7/source/drivers/net/ethernet/broadcom/bnxt/bnxt.c#L2829-L2833
>
> The issue arises if bits 48–63 of the PHC counter increment by 1 between sending the ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE event and its actual handling by the driver. In such a case, cycle_last becomes approximately 2^48 nanoseconds behind the real-time value.
>
> A possibly related issue involves the BCM57502 network card, which seemingly possesses only a single PHC device. However, the bnxt_en driver creates four PHC Linux devices when operating in quad-port mode. Consequently, clock synchronization daemons like phc2sys attempt to independently synchronize the system clock to each of these four PHC clocks. This scenario can lead to unstable synchronization and might also trigger additional ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE events.
>
> Given these issues, I have two questions:
>
> 1. Would it be beneficial to modify the bnxt_en driver to create only a single PHC Linux device for network cards that physically have only one PHC?

It's not clear to me if you are facing this issue when the PHC is
shared between multiple hosts or if you are running a single host NIC.
In the cases where a PHC is shared across multiple hosts, the driver
identifies such a configuration and switches to non-real time PHC
access mode.
https://web.git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/drivers/net/ethernet/broadcom/bnxt?id=85036aee1938d65da4be6ae1bc7e5e7e30b567b9
If you are using a configuration like the multi host, can you please
make sure you have this patch?

Let me know if you are not in the multi-host config. Do post the
ethtool -i output to help know the firmware version.

>
> 2. Is there a method available to read the complete 64-bit PHC counter to mitigate the observed problem of 2^48-nanosecond time jumps?
>
> Best regards,
> Zaripov Kamil
>
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4196 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-20 15:56 ` Pavan Chebbi
@ 2025-03-20 16:21   ` Kamil Zaripov
  2025-03-20 16:26   ` Vadim Fedorenko
  1 sibling, 0 replies; 18+ messages in thread
From: Kamil Zaripov @ 2025-03-20 16:21 UTC (permalink / raw)
  To: Pavan Chebbi; +Cc: netdev, Michael Chan, Andrew Gospodarek

> It's not clear to me if you are facing this issue when the PHC is
> shared between multiple hosts or if you are running a single host NIC.

I think that BCM57502 works in single host mode but I'm not sure: it
is ADLINK's Ampere Altra Developer Platform and maybe BMC can see this
NIC as well. But all actions over PHC are performed from CPU only.

> In the cases where a PHC is shared across multiple hosts, the driver
> identifies such a configuration and switches to non-real time PHC
> access mode.

Is it possible to understand in which access mode the driver works with PHC?

> https://web.git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/drivers/net/ethernet/broadcom/bnxt?id=85036aee1938d65da4be6ae1bc7e5e7e30b567b9
> If you are using a configuration like the multi host, can you please
> make sure you have this patch?

We are using upstream Linux v6.6.39 which includes
85036aee1938d65da4be6ae1bc7e5e7e30b567b9 commit.


> Let me know if you are not in the multi-host config. Do post the
> ethtool -i output to help know the firmware version.

Here is output of this command for the first port of this NIC:

    $ ethtool -i enP2s1f0np0
    driver: bnxt_en
    version: 6.6.39
    firmware-version: 224.0.110.0/pkg 224.1.60.0
    expansion-rom-version:
    bus-info: 0002:01:00.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: yes
    supports-register-dump: yes
    supports-priv-flags: no

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-20 15:56 ` Pavan Chebbi
  2025-03-20 16:21   ` Kamil Zaripov
@ 2025-03-20 16:26   ` Vadim Fedorenko
  1 sibling, 0 replies; 18+ messages in thread
From: Vadim Fedorenko @ 2025-03-20 16:26 UTC (permalink / raw)
  To: Pavan Chebbi, Kamil Zaripov; +Cc: netdev, Michael Chan, Andrew Gospodarek

On 20/03/2025 15:56, Pavan Chebbi wrote:
> On Thu, Mar 20, 2025 at 8:07 PM Kamil Zaripov <zaripov-kamil@avride.ai> wrote:
>>
>> Hi all,
>>
>> I've encountered a bug in the bnxt_en driver and I am unsure about the correct approach to fix it. Every 2^48 nanoseconds (or roughly 78.19 hours) there is a probability that the hardware timestamp for a sent packet may deviate by either 2^48 nanoseconds less or 2^47 nanoseconds more compared to the actual time.
>>
>> This issue likely occurs within the bnxt_async_event_process function when handling the ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE event. It appears that the payload of this event contains bits 48–63 of the PHC timer counter. During event handling, this function reads bits 0–47 of the same counter to combine them and subsequently updates the cycle_last field within the struct timecounter. The relevant code can be found here:
>> https://elixir.bootlin.com/linux/v6.13.7/source/drivers/net/ethernet/broadcom/bnxt/bnxt.c#L2829-L2833
>>
>> The issue arises if bits 48–63 of the PHC counter increment by 1 between sending the ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE event and its actual handling by the driver. In such a case, cycle_last becomes approximately 2^48 nanoseconds behind the real-time value.
>>
>> A possibly related issue involves the BCM57502 network card, which seemingly possesses only a single PHC device. However, the bnxt_en driver creates four PHC Linux devices when operating in quad-port mode. Consequently, clock synchronization daemons like phc2sys attempt to independently synchronize the system clock to each of these four PHC clocks. This scenario can lead to unstable synchronization and might also trigger additional ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE events.
>>
>> Given these issues, I have two questions:
>>
>> 1. Would it be beneficial to modify the bnxt_en driver to create only a single PHC Linux device for network cards that physically have only one PHC?
> 
> It's not clear to me if you are facing this issue when the PHC is
> shared between multiple hosts or if you are running a single host NIC.
> In the cases where a PHC is shared across multiple hosts, the driver
> identifies such a configuration and switches to non-real time PHC
> access mode.
> https://web.git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/drivers/net/ethernet/broadcom/bnxt?id=85036aee1938d65da4be6ae1bc7e5e7e30b567b9
> If you are using a configuration like the multi host, can you please
> make sure you have this patch?
> 
> Let me know if you are not in the multi-host config. Do post the
> ethtool -i output to help know the firmware version.

AFAIU, the setup is single host, but multi port NIC, which exports
several PTP devices, all of them are using RTC mode. But as HW has 
single physical PHC, it's not possible to properly discipline all
of PTP devices in parallel. I think mlx5 was adjusted to export only
single PHC device for multi-port configuration because of the very same
reasons.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-20 14:35 bnxt_en: Incorrect tx timestamp report Kamil Zaripov
  2025-03-20 14:48 ` Andrew Lunn
  2025-03-20 15:56 ` Pavan Chebbi
@ 2025-03-20 17:11 ` Jacob Keller
  2025-03-21 15:17   ` Kamil Zaripov
  2 siblings, 1 reply; 18+ messages in thread
From: Jacob Keller @ 2025-03-20 17:11 UTC (permalink / raw)
  To: Kamil Zaripov, netdev

On 3/20/2025 7:35 AM, Kamil Zaripov wrote:
> 1. Would it be beneficial to modify the bnxt_en driver to create only a single PHC Linux device for network cards that physically have only one PHC?
> 

That depends. If it has only one underlying clock, but each PF has its
own register space, it may functionally be independent clocks in
practice. I don't know the bnxt_en driver or hardware well enough to
know if that is the case.

If it really is one clock with one set of registers to control it, then
it should only expose one PHC. This may be tricky depending on the
driver design. (See ice as an example where we've had a lot of
challenges in this space because of the multiple PFs)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-20 17:11 ` Jacob Keller
@ 2025-03-21 15:17   ` Kamil Zaripov
  2025-03-21 17:33     ` Michael Chan
  0 siblings, 1 reply; 18+ messages in thread
From: Kamil Zaripov @ 2025-03-21 15:17 UTC (permalink / raw)
  To: Jacob Keller; +Cc: netdev

> That depends. If it has only one underlying clock, but each PF has its
> own register space, it may functionally be independent clocks in
> practice. I don't know the bnxt_en driver or hardware well enough to
> know if that is the case.

> If it really is one clock with one set of registers to control it, then
> it should only expose one PHC. This may be tricky depending on the
> driver design. (See ice as an example where we've had a lot of
> challenges in this space because of the multiple PFs)

I can only guess, from looking at the __bnxt_hwrm_ptp_qcfg function,
that it depends on hardware and/or firmware (see
https://elixir.bootlin.com/linux/v6.13.7/source/drivers/net/ethernet/broadcom/bnxt/bnxt.c#L9427-L9431).
I hope that broadcom folks can clarify this.

> This part of the driver is tricky. ASYNC_EVENT_CMPL_EVENT_ID_PHC_UPDATE
> reports only 16 bits of 64 bits timestamp, 48-63 range, which doesn't
> overlap with anything else. The assumption is that when the driver
> processes this event, the register which reports bits of range 0-47 has
> already overflowed and holds new value. Unfortunately, there is a time
> gap between register overflow and update of MSB of the cached timestamp.

Indeed, PHC counter reading is pretty complex in the case of the
bnxt_en driver. Final timestamp that is sent to the userspace is
combined from 3 parts which are stored in different places and updated
using different mechanics. Apparently in some corner cases the driver
fails to produce the correct result.

> There is no easy way to solve this problem, but we may add additional
> check on every read, probably... Not sure, though

Right now I've just added an extra check into
bnxt_ptp_rtc_timecounter_init function, it should work in some cases
but I do not believe that it is the right way to fix the original
issue.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-21 15:17   ` Kamil Zaripov
@ 2025-03-21 17:33     ` Michael Chan
  2025-03-24 15:04       ` Pavan Chebbi
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Chan @ 2025-03-21 17:33 UTC (permalink / raw)
  To: Kamil Zaripov; +Cc: Jacob Keller, netdev

[-- Attachment #1: Type: text/plain, Size: 1191 bytes --]

On Fri, Mar 21, 2025 at 8:17 AM Kamil Zaripov <zaripov-kamil@avride.ai> wrote:
>
> > That depends. If it has only one underlying clock, but each PF has its
> > own register space, it may functionally be independent clocks in
> > practice. I don't know the bnxt_en driver or hardware well enough to
> > know if that is the case.
>
> > If it really is one clock with one set of registers to control it, then
> > it should only expose one PHC. This may be tricky depending on the
> > driver design. (See ice as an example where we've had a lot of
> > challenges in this space because of the multiple PFs)
>
> I can only guess, from looking at the __bnxt_hwrm_ptp_qcfg function,
> that it depends on hardware and/or firmware (see
> https://elixir.bootlin.com/linux/v6.13.7/source/drivers/net/ethernet/broadcom/bnxt/bnxt.c#L9427-L9431).
> I hope that broadcom folks can clarify this.
>

It is one physical PHC per chip.  Each function has access to the
shared PHC.   It won't work properly when multiple functions try to
adjust the PHC independently.  That's why we use the non-RTC mode when
the PHC is shared in multi-function mode.  Pavan can add more details
on this.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4196 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-21 17:33     ` Michael Chan
@ 2025-03-24 15:04       ` Pavan Chebbi
  2025-03-25 10:13         ` Kamil Zaripov
  0 siblings, 1 reply; 18+ messages in thread
From: Pavan Chebbi @ 2025-03-24 15:04 UTC (permalink / raw)
  To: Michael Chan; +Cc: Kamil Zaripov, Jacob Keller, Linux Netdev List

[-- Attachment #1.1: Type: text/plain, Size: 2181 bytes --]

On Fri, 21 Mar, 2025, 11:03 pm Michael Chan, <michael.chan@broadcom.com>
wrote:

> On Fri, Mar 21, 2025 at 8:17 AM Kamil Zaripov <zaripov-kamil@avride.ai>
> wrote:
> >
> > > That depends. If it has only one underlying clock, but each PF has its
> > > own register space, it may functionally be independent clocks in
> > > practice. I don't know the bnxt_en driver or hardware well enough to
> > > know if that is the case.
> >
> > > If it really is one clock with one set of registers to control it, then
> > > it should only expose one PHC. This may be tricky depending on the
> > > driver design. (See ice as an example where we've had a lot of
> > > challenges in this space because of the multiple PFs)
> >
> > I can only guess, from looking at the __bnxt_hwrm_ptp_qcfg function,
> > that it depends on hardware and/or firmware (see
> >
> https://elixir.bootlin.com/linux/v6.13.7/source/drivers/net/ethernet/broadcom/bnxt/bnxt.c#L9427-L9431
> ).
> > I hope that broadcom folks can clarify this.
> >
>
> It is one physical PHC per chip.  Each function has access to the
> shared PHC.   It won't work properly when multiple functions try to
> adjust the PHC independently.  That's why we use the non-RTC mode when
> the PHC is shared in multi-function mode.  Pavan can add more details
> on this.
>

Yes, that's correct. It's one PHC shared across functions. The way we
handle multiple
functions accessing the shared PHC is by firmware allowing only one
function to adjust
the frequency. All the other functions' adjustments are ignored. However,
needless to say,
they all still receive the latest timestamps. As I recall, this event
design was an earlier
version of our multi host support implementation where the rollover was
being tracked in
the firmware.

The latest driver handles the rollover on its own and we don't need the
firmware to tell us.
I checked with the firmware team and I gather that the version you are
using is very old.
Firmware version 230.x onwards, you should not receive this event for
rollovers.
Is it possible for you to update the firmware? Do you have access to a more
recent (230+) firmware?

[-- Attachment #1.2: Type: text/html, Size: 3329 bytes --]

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4196 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-24 15:04       ` Pavan Chebbi
@ 2025-03-25 10:13         ` Kamil Zaripov
  2025-03-25 10:41           ` Vadim Fedorenko
  0 siblings, 1 reply; 18+ messages in thread
From: Kamil Zaripov @ 2025-03-25 10:13 UTC (permalink / raw)
  To: Pavan Chebbi; +Cc: Michael Chan, Jacob Keller, Linux Netdev List


> On 24 Mar 2025, at 17:04, Pavan Chebbi <pavan.chebbi@broadcom.com> wrote:
> 
>> On Fri, 21 Mar, 2025, 11:03 pm Michael Chan, <michael.chan@broadcom.com> wrote:
>> 
>> > On Fri, Mar 21, 2025 at 8:17 AM Kamil Zaripov <zaripov-kamil@avride.ai> wrote:
>> >
>> > > That depends. If it has only one underlying clock, but each PF has its
>> > > own register space, it may functionally be independent clocks in
>> > > practice. I don't know the bnxt_en driver or hardware well enough to
>> > > know if that is the case.
>> >
>> > > If it really is one clock with one set of registers to control it, then
>> > > it should only expose one PHC. This may be tricky depending on the
>> > > driver design. (See ice as an example where we've had a lot of
>> > > challenges in this space because of the multiple PFs)
>> >
>> > I can only guess, from looking at the __bnxt_hwrm_ptp_qcfg function,
>> > that it depends on hardware and/or firmware (see
>> > https://elixir.bootlin.com/linux/v6.13.7/source/drivers/net/ethernet/broadcom/bnxt/bnxt.c#L9427-L9431).
>> > I hope that broadcom folks can clarify this.
>> >
>> 
>> It is one physical PHC per chip.  Each function has access to the
>> shared PHC.   It won't work properly when multiple functions try to
>> adjust the PHC independently.  That's why we use the non-RTC mode when
>> the PHC is shared in multi-function mode.  Pavan can add more details
>> on this.
> Yes, that's correct. It's one PHC shared across functions. The way we handle multiple
> functions accessing the shared PHC is by firmware allowing only one function to adjust
> the frequency. All the other functions' adjustments are ignored. ...

I guess I don’t understand how does it work. Am I right that if userspace program changes frequency of PHC devices 0,1,2,3 (one for each port present in NIC) driver will send PHC frequency change 4 times but firmware will drop 3 of these frequency change commands and will pick up only one? How can I understand which PHC will actually represent adjustable clock and which one is phony?

Another thing that I cannot understand is so-called RTC and non-RTC mode. Is there any documentation that describes it? Or specific parts of the driver that change its behavior on for RTC and non-RTC mode?

> … However, needless to say,
> they all still receive the latest timestamps. As I recall, this event design was an earlier
> version of our multi host support implementation where the rollover was being tracked in
> the firmware. 

From which version the bnxt_en driver starts to track rollover on the driver side rather than firmware side?

> The latest driver handles the rollover on its own and we don't need the firmware to tell us.
> I checked with the firmware team and I gather that the version you are using is very old. 
> Firmware version 230.x onwards, you should not receive this event for rollovers.
> Is it possible for you to update the firmware? Do you have access to a more recent (230+) firmware?

Yes, I can update firmware if you can tell where can I find the latest firmware and the update instructions?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-25 10:13         ` Kamil Zaripov
@ 2025-03-25 10:41           ` Vadim Fedorenko
  2025-03-25 12:24             ` Pavan Chebbi
  2025-03-26 13:50             ` Kamil Zaripov
  0 siblings, 2 replies; 18+ messages in thread
From: Vadim Fedorenko @ 2025-03-25 10:41 UTC (permalink / raw)
  To: Kamil Zaripov; +Cc: Michael Chan, Jacob Keller, Pavan Chebbi, Linux Netdev List

On 25/03/2025 10:13, Kamil Zaripov wrote:
> 
>> On 24 Mar 2025, at 17:04, Pavan Chebbi <pavan.chebbi@broadcom.com> wrote:
>>
>>> On Fri, 21 Mar, 2025, 11:03 pm Michael Chan, <michael.chan@broadcom.com> wrote:
>>>
>>>> On Fri, Mar 21, 2025 at 8:17 AM Kamil Zaripov <zaripov-kamil@avride.ai> wrote:
>>>>
>>>>> That depends. If it has only one underlying clock, but each PF has its
>>>>> own register space, it may functionally be independent clocks in
>>>>> practice. I don't know the bnxt_en driver or hardware well enough to
>>>>> know if that is the case.
>>>>
>>>>> If it really is one clock with one set of registers to control it, then
>>>>> it should only expose one PHC. This may be tricky depending on the
>>>>> driver design. (See ice as an example where we've had a lot of
>>>>> challenges in this space because of the multiple PFs)
>>>>
>>>> I can only guess, from looking at the __bnxt_hwrm_ptp_qcfg function,
>>>> that it depends on hardware and/or firmware (see
>>>> https://elixir.bootlin.com/linux/v6.13.7/source/drivers/net/ethernet/broadcom/bnxt/bnxt.c#L9427-L9431).
>>>> I hope that broadcom folks can clarify this.
>>>>
>>>
>>> It is one physical PHC per chip.  Each function has access to the
>>> shared PHC.   It won't work properly when multiple functions try to
>>> adjust the PHC independently.  That's why we use the non-RTC mode when
>>> the PHC is shared in multi-function mode.  Pavan can add more details
>>> on this.
>> Yes, that's correct. It's one PHC shared across functions. The way we handle multiple
>> functions accessing the shared PHC is by firmware allowing only one function to adjust
>> the frequency. All the other functions' adjustments are ignored. ...
> 
> I guess I don’t understand how does it work. Am I right that if userspace program changes frequency of PHC devices 0,1,2,3 (one for each port present in NIC) driver will send PHC frequency change 4 times but firmware will drop 3 of these frequency change commands and will pick up only one? How can I understand which PHC will actually represent adjustable clock and which one is phony?

It can be any of PHC devices, mostly the first to try to adjust will be 
used.

> 
> Another thing that I cannot understand is so-called RTC and non-RTC mode. Is there any documentation that describes it? Or specific parts of the driver that change its behavior on for RTC and non-RTC mode?

Generally, non-RTC means free-running HW PHC clock with timecounter
adjustment on top of it. With RTC mode every adjfine() call tries to
adjust HW configuration to change the slope of PHC.

> 
>> … However, needless to say,
>> they all still receive the latest timestamps. As I recall, this event design was an earlier
>> version of our multi host support implementation where the rollover was being tracked in
>> the firmware.
> 
>  From which version the bnxt_en driver starts to track rollover on the driver side rather than firmware side?

It was done a couple of years ago, in 5.x era.

> 
>> The latest driver handles the rollover on its own and we don't need the firmware to tell us.
>> I checked with the firmware team and I gather that the version you are using is very old.
>> Firmware version 230.x onwards, you should not receive this event for rollovers.
>> Is it possible for you to update the firmware? Do you have access to a more recent (230+) firmware?
> 
> Yes, I can update firmware if you can tell where can I find the latest firmware and the update instructions?
> 

Broadcom's web site has pretty easy support portal with NIC firmware
publicly available. Current version is 232 and it has all the
improvements Pavan mentioned.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-25 10:41           ` Vadim Fedorenko
@ 2025-03-25 12:24             ` Pavan Chebbi
  2025-03-26 13:50             ` Kamil Zaripov
  1 sibling, 0 replies; 18+ messages in thread
From: Pavan Chebbi @ 2025-03-25 12:24 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Kamil Zaripov, Michael Chan, Jacob Keller, Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 619 bytes --]

> > Yes, I can update firmware if you can tell where can I find the latest firmware and the update instructions?
> >
>
> Broadcom's web site has pretty easy support portal with NIC firmware
> publicly available. Current version is 232 and it has all the
> improvements Pavan mentioned.
>
Thanks Vadim for chiming in. I guess you answered all of Kamil's questions.
I am curious about Kamil's use case of running PTP on 4 ports (in a
single host?) which seem to be using RTC mode.
Like Vadim pointed out earlier, this cannot be an accurate config
given we run a shared PHC.
Can Kamil give details of his configuration?
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4196 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-25 10:41           ` Vadim Fedorenko
  2025-03-25 12:24             ` Pavan Chebbi
@ 2025-03-26 13:50             ` Kamil Zaripov
  2025-03-26 20:31               ` Jacob Keller
  2025-03-27 13:16               ` Pavan Chebbi
  1 sibling, 2 replies; 18+ messages in thread
From: Kamil Zaripov @ 2025-03-26 13:50 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Michael Chan, Jacob Keller, Pavan Chebbi, Linux Netdev List

> On 25 Mar 2025, at 12:41, Vadim Fedorenko <vadim.fedorenko@linux.dev> wrote:
> 
> On 25/03/2025 10:13, Kamil Zaripov wrote:
>> 
>> I guess I don’t understand how does it work. Am I right that if userspace program changes frequency of PHC devices 0,1,2,3 (one for each port present in NIC) driver will send PHC frequency change 4 times but firmware will drop 3 of these frequency change commands and will pick up only one? How can I understand which PHC will actually represent adjustable clock and which one is phony?
> 
> It can be any of PHC devices, mostly the first to try to adjust will be used.

I believe that randomly selecting one of the PHC clock to control actual PHC in NIC and directing commands received on other clocks to the /dev/null is quite unexpected behavior for the userspace applications.

>> Another thing that I cannot understand is so-called RTC and non-RTC mode. Is there any documentation that describes it? Or specific parts of the driver that change its behavior on for RTC and non-RTC mode?
> 
> Generally, non-RTC means free-running HW PHC clock with timecounter
> adjustment on top of it. With RTC mode every adjfine() call tries to
> adjust HW configuration to change the slope of PHC.

Just to clarify:

Am I right that in RTC mode:
1.1. All 64 bits of the PHC counter are stored on the NIC (both the “readable” 0–47 bits and the higher 48–63 bits).
1.2. When userspace attempts to change the PHC counter value (using adjtime or settime), these changes are propagated to the NIC via the PORT_MAC_CFG_REQ_ENABLES_PTP_ADJ_PHASE and FUNC_PTP_CFG_REQ_ENABLES_PTP_SET_TIME requests.
1.3. If one port of a four-port NIC is updated, the change is propagated to all other ports via the ASYNC_EVENT_CMPL_PHC_UPDATE_EVENT_DATA1_FLAGS_PHC_RTC_UPDATE event. As a result, all four instances of the bnxt_en driver receive the event with the high 48–63 bits of the counter in payload. They then asynchronously read the 0–47 bits and update the timecounter struct’s nsec field.
1.4. If we ignore the bug related to unsynchronized reading of the higher (48–63) and lower (0–47) bits of the PHC counter, the time across each timecounter instance should remain in sync.
1.5. When userspace calls adjfine, it triggers the PORT_MAC_CFG_REQ_ENABLES_PTP_FREQ_ADJ_PPB request, causing the PHC tick rate to change.

In non-RTC mode:
2.1. Only the lower 0–47 bits are stored on the NIC. The higher 48–63 bits are stored only in the timecounter struct.
2.2. When userspace tries to change the PHC counter via adjtime or settime, the change is reflected only in the timecounter struct.
2.3. Each timecounter instance may have its own nsec field value, potentially leading to different timestamps read from /dev/ptp[0-3].
2.4. When userspace calls adjfine, it only modifies the mul field in the cyclecounter struct, which means no real changeoccurs to the PHC tick rate on the hardware.

And about issue in general:
3.1. Firmware versions 230+ operate in non-RTC mode in all environments.
3.2. Firmware version 224 uses RTC mode because older driver versions were not designed to track overflows (the higher 48–63 bits of the PHC counter) on the driver side.

>>> The latest driver handles the rollover on its own and we don't need the firmware to tell us.
>>> I checked with the firmware team and I gather that the version you are using is very old.
>>> Firmware version 230.x onwards, you should not receive this event for rollovers.
>>> Is it possible for you to update the firmware? Do you have access to a more recent (230+) firmware?
>> Yes, I can update firmware if you can tell where can I find the latest firmware and the update instructions?
> 
> Broadcom's web site has pretty easy support portal with NIC firmware
> publicly available. Current version is 232 and it has all the
> improvements Pavan mentioned.

Yes, I have found the "Broadcom BCM57xx Fwupg Tools” archive with some precompiled binaries for x86_64 platform. The problem is that our hosts are aarch64 and uses the Nix as a package manager, it will take some time to make it work in our setup. I just hoped that there is firmware binary itself that I can pass to ethtool —-flash.

> On 25 Mar 2025, at 14:24, Pavan Chebbi <pavan.chebbi@broadcom.com> wrote:
> 
>>> Yes, I can update firmware if you can tell where can I find the latest firmware and the update instructions?
>>> 
>> 
>> Broadcom's web site has pretty easy support portal with NIC firmware
>> publicly available. Current version is 232 and it has all the
>> improvements Pavan mentioned.
>> 
> Thanks Vadim for chiming in. I guess you answered all of Kamil's questions.

Yes, thank you for help. Without your explanation, I would have spent a lot more time understanding it on my own.

> I am curious about Kamil's use case of running PTP on 4 ports (in a
> single host?) which seem to be using RTC mode.
> Like Vadim pointed out earlier, this cannot be an accurate config
> given we run a shared PHC.
> Can Kamil give details of his configuration?

I have a system equipped with a BCM57502 NIC that functions as a PTP grandmaster in a small local network. Four PTP clients — each connected to one of the NIC’s four ports — synchronize their time with the grandmaster using the PTP L2P2P protocol. To support this configuration, I run four ptp4l instances (one for each port) and a single phc2sys daemon to synchronize system time and PHC time by adjusting the PHC. Because the bnxt_en driver reports different PHC device indexes for each NIC port, the phc2sys daemon treats each PHC device as independent and adjusts their times separately.

We also have a similar setup with a different network card, the Intel E810-C, which has four ports as well. However, its ice driver exposes only one PHC device and probably read PHC counter in a different way. I do not remember similar issues with this setup.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-26 13:50             ` Kamil Zaripov
@ 2025-03-26 20:31               ` Jacob Keller
  2025-03-27 13:16               ` Pavan Chebbi
  1 sibling, 0 replies; 18+ messages in thread
From: Jacob Keller @ 2025-03-26 20:31 UTC (permalink / raw)
  To: Kamil Zaripov, Vadim Fedorenko
  Cc: Michael Chan, Pavan Chebbi, Linux Netdev List



On 3/26/2025 6:50 AM, Kamil Zaripov wrote:
> 
> 
>> On 25 Mar 2025, at 12:41, Vadim Fedorenko <vadim.fedorenko@linux.dev> wrote:
>>
>> On 25/03/2025 10:13, Kamil Zaripov wrote:
>>>
>>> I guess I don’t understand how does it work. Am I right that if userspace program changes frequency of PHC devices 0,1,2,3 (one for each port present in NIC) driver will send PHC frequency change 4 times but firmware will drop 3 of these frequency change commands and will pick up only one? How can I understand which PHC will actually represent adjustable clock and which one is phony?
>>
>> It can be any of PHC devices, mostly the first to try to adjust will be used.
> 
> I believe that randomly selecting one of the PHC clock to control actual PHC in NIC and directing commands received on other clocks to the /dev/null is quite unexpected behavior for the userspace applications.
> 

At the very least this should somehow be predictable. Better would be
for software to manage this and report only one PHC to userspace, with
each netdev reporting the PHC associated via get_ts_info

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bnxt_en: Incorrect tx timestamp report
  2025-03-26 13:50             ` Kamil Zaripov
  2025-03-26 20:31               ` Jacob Keller
@ 2025-03-27 13:16               ` Pavan Chebbi
  2025-04-01 20:17                 ` Keller, Jacob E
  1 sibling, 1 reply; 18+ messages in thread
From: Pavan Chebbi @ 2025-03-27 13:16 UTC (permalink / raw)
  To: Kamil Zaripov
  Cc: Vadim Fedorenko, Michael Chan, Jacob Keller, Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 7968 bytes --]

On Wed, Mar 26, 2025 at 7:20 PM Kamil Zaripov <zaripov-kamil@avride.ai> wrote:
>
>
>
> > On 25 Mar 2025, at 12:41, Vadim Fedorenko <vadim.fedorenko@linux.dev> wrote:
> >
> > On 25/03/2025 10:13, Kamil Zaripov wrote:
> >>
> >> I guess I don’t understand how does it work. Am I right that if userspace program changes frequency of PHC devices 0,1,2,3 (one for each port present in NIC) driver will send PHC frequency change 4 times but firmware will drop 3 of these frequency change commands and will pick up only one? How can I understand which PHC will actually represent adjustable clock and which one is phony?
> >
> > It can be any of PHC devices, mostly the first to try to adjust will be used.
>
> I believe that randomly selecting one of the PHC clock to control actual PHC in NIC and directing commands received on other clocks to the /dev/null is quite unexpected behavior for the userspace applications.
>
> >> Another thing that I cannot understand is so-called RTC and non-RTC mode. Is there any documentation that describes it? Or specific parts of the driver that change its behavior on for RTC and non-RTC mode?
> >
> > Generally, non-RTC means free-running HW PHC clock with timecounter
> > adjustment on top of it. With RTC mode every adjfine() call tries to
> > adjust HW configuration to change the slope of PHC.
>
> Just to clarify:
>
> Am I right that in RTC mode:
> 1.1. All 64 bits of the PHC counter are stored on the NIC (both the “readable” 0–47 bits and the higher 48–63 bits).
In both RTC and non-RTC modes, the driver will use the lower 48b from
HW as cycles to feed to the timecounter that driver has mapped to the
PHC.

> 1.2. When userspace attempts to change the PHC counter value (using adjtime or settime), these changes are propagated to the NIC via the PORT_MAC_CFG_REQ_ENABLES_PTP_ADJ_PHASE and FUNC_PTP_CFG_REQ_ENABLES_PTP_SET_TIME requests.
True.

> 1.3. If one port of a four-port NIC is updated, the change is propagated to all other ports via the ASYNC_EVENT_CMPL_PHC_UPDATE_EVENT_DATA1_FLAGS_PHC_RTC_UPDATE event. As a result, all four instances of the bnxt_en driver receive the event with the high 48–63 bits of the counter in payload. They then asynchronously read the 0–47 bits and update the timecounter struct’s nsec field.
Not true in the latest Firmware.

> 1.4. If we ignore the bug related to unsynchronized reading of the higher (48–63) and lower (0–47) bits of the PHC counter, the time across each timecounter instance should remain in sync.
Well, no. It won't be very accurate. We designed non-RTC mode for such
use cases. But yes, your use case is not exactly what non-RTC caters
for.

> 1.5. When userspace calls adjfine, it triggers the PORT_MAC_CFG_REQ_ENABLES_PTP_FREQ_ADJ_PPB request, causing the PHC tick rate to change.
Correct. But only the first ever port that made the freq adj will
continue to make further freq adjustments. This was a policy decision,
not exactly random. There is an option in our tools to see which is
the interface that is currently making freq adjustments.

>
> In non-RTC mode:
> 2.1. Only the lower 0–47 bits are stored on the NIC. The higher 48–63 bits are stored only in the timecounter struct.
> 2.2. When userspace tries to change the PHC counter via adjtime or settime, the change is reflected only in the timecounter struct.
Correct.

> 2.3. Each timecounter instance may have its own nsec field value, potentially leading to different timestamps read from /dev/ptp[0-3].
Basically each of the timecounters is independent.

> 2.4. When userspace calls adjfine, it only modifies the mul field in the cyclecounter struct, which means no real changeoccurs to the PHC tick rate on the hardware.
Correct.

>
> And about issue in general:
> 3.1. Firmware versions 230+ operate in non-RTC mode in all environments.
No, the driver makes the choice of when to shift to non-RTC from RTC.
Currently this happens only in the multi-host environment, where each
port is used to synchronize a different Linux system clock.
But 230+ version has the change that will not track the rollover in
FW, and the ASYNC_EVENT_CMPL_PHC_UPDATE_EVENT_DATA1_FLAGS_PHC_RTC_UPDATE
deprecated.

> 3.2. Firmware version 224 uses RTC mode because older driver versions were not designed to track overflows (the higher 48–63 bits of the PHC counter) on the driver side.
>
>
> >>> The latest driver handles the rollover on its own and we don't need the firmware to tell us.
> >>> I checked with the firmware team and I gather that the version you are using is very old.
> >>> Firmware version 230.x onwards, you should not receive this event for rollovers.
> >>> Is it possible for you to update the firmware? Do you have access to a more recent (230+) firmware?
> >> Yes, I can update firmware if you can tell where can I find the latest firmware and the update instructions?
> >
> > Broadcom's web site has pretty easy support portal with NIC firmware
> > publicly available. Current version is 232 and it has all the
> > improvements Pavan mentioned.
>
> Yes, I have found the "Broadcom BCM57xx Fwupg Tools” archive with some precompiled binaries for x86_64 platform. The problem is that our hosts are aarch64 and uses the Nix as a package manager, it will take some time to make it work in our setup. I just hoped that there is firmware binary itself that I can pass to ethtool —-flash.
>
>
>
> > On 25 Mar 2025, at 14:24, Pavan Chebbi <pavan.chebbi@broadcom.com> wrote:
> >
> >>> Yes, I can update firmware if you can tell where can I find the latest firmware and the update instructions?
> >>>
> >>
> >> Broadcom's web site has pretty easy support portal with NIC firmware
> >> publicly available. Current version is 232 and it has all the
> >> improvements Pavan mentioned.
> >>
> > Thanks Vadim for chiming in. I guess you answered all of Kamil's questions.
>
> Yes, thank you for help. Without your explanation, I would have spent a lot more time understanding it on my own.
>
> > I am curious about Kamil's use case of running PTP on 4 ports (in a
> > single host?) which seem to be using RTC mode.
> > Like Vadim pointed out earlier, this cannot be an accurate config
> > given we run a shared PHC.
> > Can Kamil give details of his configuration?
>
> I have a system equipped with a BCM57502 NIC that functions as a PTP grandmaster in a small local network. Four PTP clients — each connected to one of the NIC’s four ports — synchronize their time with the grandmaster using the PTP L2P2P protocol. To support this configuration, I run four ptp4l instances (one for each port) and a single phc2sys daemon to synchronize system time and PHC time by adjusting the PHC. Because the bnxt_en driver reports different PHC device indexes for each NIC port, the phc2sys daemon treats each PHC device as independent and adjusts their times separately.
>
If you are using Broadcom NIC, and have only one system time to
update, I don't see why we should have 4 PTP clients. Just one
instance of ptp4l running on one of the ports and one phc2sys is going
to be valid (and is sufficient?)
I am thinking out loud, the phc2sys daemon could be picking up all the
available clocks, but I think that needs to be modified, unless we
decide to stop exposing multiple clocks for the same PHC in our
design.
Of course, I am not sure if you have a requirement of 4 GMs to sync with.

> We also have a similar setup with a different network card, the Intel E810-C, which has four ports as well. However, its ice driver exposes only one PHC device and probably read PHC counter in a different way. I do not remember similar issues with this setup.
>
 I think on the Intel NIC, this problem itself would not arise,
because you will run only 1 client each of ptp4l and phc2sys, right?
But I am not sure how you can run 4 GMs on Intel NIC if you are
running that.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4196 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: bnxt_en: Incorrect tx timestamp report
  2025-03-27 13:16               ` Pavan Chebbi
@ 2025-04-01 20:17                 ` Keller, Jacob E
  0 siblings, 0 replies; 18+ messages in thread
From: Keller, Jacob E @ 2025-04-01 20:17 UTC (permalink / raw)
  To: Pavan Chebbi, Kamil Zaripov
  Cc: Vadim Fedorenko, Michael Chan, Linux Netdev List



> -----Original Message-----
> From: Pavan Chebbi <pavan.chebbi@broadcom.com>
> Sent: Thursday, March 27, 2025 6:17 AM
> To: Kamil Zaripov <zaripov-kamil@avride.ai>
> Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev>; Michael Chan
> <michael.chan@broadcom.com>; Keller, Jacob E <jacob.e.keller@intel.com>;
> Linux Netdev List <netdev@vger.kernel.org>
> Subject: Re: bnxt_en: Incorrect tx timestamp report
> 
> On Wed, Mar 26, 2025 at 7:20 PM Kamil Zaripov <zaripov-kamil@avride.ai> wrote:
> >
> >
> >
> > > On 25 Mar 2025, at 12:41, Vadim Fedorenko <vadim.fedorenko@linux.dev>
> wrote:
> > >
> > > On 25/03/2025 10:13, Kamil Zaripov wrote:
> > >>
> > >> I guess I don’t understand how does it work. Am I right that if userspace
> program changes frequency of PHC devices 0,1,2,3 (one for each port present in
> NIC) driver will send PHC frequency change 4 times but firmware will drop 3 of
> these frequency change commands and will pick up only one? How can I
> understand which PHC will actually represent adjustable clock and which one is
> phony?
> > >
> > > It can be any of PHC devices, mostly the first to try to adjust will be used.
> >
> > I believe that randomly selecting one of the PHC clock to control actual PHC in
> NIC and directing commands received on other clocks to the /dev/null is quite
> unexpected behavior for the userspace applications.
> >
> > >> Another thing that I cannot understand is so-called RTC and non-RTC mode.
> Is there any documentation that describes it? Or specific parts of the driver that
> change its behavior on for RTC and non-RTC mode?
> > >
> > > Generally, non-RTC means free-running HW PHC clock with timecounter
> > > adjustment on top of it. With RTC mode every adjfine() call tries to
> > > adjust HW configuration to change the slope of PHC.
> >
> > Just to clarify:
> >
> > Am I right that in RTC mode:
> > 1.1. All 64 bits of the PHC counter are stored on the NIC (both the “readable” 0–
> 47 bits and the higher 48–63 bits).
> In both RTC and non-RTC modes, the driver will use the lower 48b from
> HW as cycles to feed to the timecounter that driver has mapped to the
> PHC.
> 
> > 1.2. When userspace attempts to change the PHC counter value (using adjtime
> or settime), these changes are propagated to the NIC via the
> PORT_MAC_CFG_REQ_ENABLES_PTP_ADJ_PHASE and
> FUNC_PTP_CFG_REQ_ENABLES_PTP_SET_TIME requests.
> True.
> 
> > 1.3. If one port of a four-port NIC is updated, the change is propagated to all
> other ports via the
> ASYNC_EVENT_CMPL_PHC_UPDATE_EVENT_DATA1_FLAGS_PHC_RTC_UPDATE
> event. As a result, all four instances of the bnxt_en driver receive the event with
> the high 48–63 bits of the counter in payload. They then asynchronously read the
> 0–47 bits and update the timecounter struct’s nsec field.
> Not true in the latest Firmware.
> 
> > 1.4. If we ignore the bug related to unsynchronized reading of the higher (48–
> 63) and lower (0–47) bits of the PHC counter, the time across each timecounter
> instance should remain in sync.
> Well, no. It won't be very accurate. We designed non-RTC mode for such
> use cases. But yes, your use case is not exactly what non-RTC caters
> for.
> 
> > 1.5. When userspace calls adjfine, it triggers the
> PORT_MAC_CFG_REQ_ENABLES_PTP_FREQ_ADJ_PPB request, causing the PHC
> tick rate to change.
> Correct. But only the first ever port that made the freq adj will
> continue to make further freq adjustments. This was a policy decision,
> not exactly random. There is an option in our tools to see which is
> the interface that is currently making freq adjustments.
> 
> >
> > In non-RTC mode:
> > 2.1. Only the lower 0–47 bits are stored on the NIC. The higher 48–63 bits are
> stored only in the timecounter struct.
> > 2.2. When userspace tries to change the PHC counter via adjtime or settime, the
> change is reflected only in the timecounter struct.
> Correct.
> 
> > 2.3. Each timecounter instance may have its own nsec field value, potentially
> leading to different timestamps read from /dev/ptp[0-3].
> Basically each of the timecounters is independent.
> 
> > 2.4. When userspace calls adjfine, it only modifies the mul field in the
> cyclecounter struct, which means no real changeoccurs to the PHC tick rate on the
> hardware.
> Correct.
> 
> >
> > And about issue in general:
> > 3.1. Firmware versions 230+ operate in non-RTC mode in all environments.
> No, the driver makes the choice of when to shift to non-RTC from RTC.
> Currently this happens only in the multi-host environment, where each
> port is used to synchronize a different Linux system clock.
> But 230+ version has the change that will not track the rollover in
> FW, and the
> ASYNC_EVENT_CMPL_PHC_UPDATE_EVENT_DATA1_FLAGS_PHC_RTC_UPDATE
> deprecated.
> 
> > 3.2. Firmware version 224 uses RTC mode because older driver versions were
> not designed to track overflows (the higher 48–63 bits of the PHC counter) on the
> driver side.
> >
> >
> > >>> The latest driver handles the rollover on its own and we don't need the
> firmware to tell us.
> > >>> I checked with the firmware team and I gather that the version you are using
> is very old.
> > >>> Firmware version 230.x onwards, you should not receive this event for
> rollovers.
> > >>> Is it possible for you to update the firmware? Do you have access to a more
> recent (230+) firmware?
> > >> Yes, I can update firmware if you can tell where can I find the latest firmware
> and the update instructions?
> > >
> > > Broadcom's web site has pretty easy support portal with NIC firmware
> > > publicly available. Current version is 232 and it has all the
> > > improvements Pavan mentioned.
> >
> > Yes, I have found the "Broadcom BCM57xx Fwupg Tools” archive with some
> precompiled binaries for x86_64 platform. The problem is that our hosts are
> aarch64 and uses the Nix as a package manager, it will take some time to make it
> work in our setup. I just hoped that there is firmware binary itself that I can pass
> to ethtool —-flash.
> >
> >
> >
> > > On 25 Mar 2025, at 14:24, Pavan Chebbi <pavan.chebbi@broadcom.com>
> wrote:
> > >
> > >>> Yes, I can update firmware if you can tell where can I find the latest firmware
> and the update instructions?
> > >>>
> > >>
> > >> Broadcom's web site has pretty easy support portal with NIC firmware
> > >> publicly available. Current version is 232 and it has all the
> > >> improvements Pavan mentioned.
> > >>
> > > Thanks Vadim for chiming in. I guess you answered all of Kamil's questions.
> >
> > Yes, thank you for help. Without your explanation, I would have spent a lot
> more time understanding it on my own.
> >
> > > I am curious about Kamil's use case of running PTP on 4 ports (in a
> > > single host?) which seem to be using RTC mode.
> > > Like Vadim pointed out earlier, this cannot be an accurate config
> > > given we run a shared PHC.
> > > Can Kamil give details of his configuration?
> >
> > I have a system equipped with a BCM57502 NIC that functions as a PTP
> grandmaster in a small local network. Four PTP clients — each connected to one
> of the NIC’s four ports — synchronize their time with the grandmaster using the
> PTP L2P2P protocol. To support this configuration, I run four ptp4l instances (one
> for each port) and a single phc2sys daemon to synchronize system time and PHC
> time by adjusting the PHC. Because the bnxt_en driver reports different PHC
> device indexes for each NIC port, the phc2sys daemon treats each PHC device as
> independent and adjusts their times separately.
> >
> If you are using Broadcom NIC, and have only one system time to
> update, I don't see why we should have 4 PTP clients. Just one
> instance of ptp4l running on one of the ports and one phc2sys is going
> to be valid (and is sufficient?)
> I am thinking out loud, the phc2sys daemon could be picking up all the
> available clocks, but I think that needs to be modified, unless we
> decide to stop exposing multiple clocks for the same PHC in our
> design.
> Of course, I am not sure if you have a requirement of 4 GMs to sync with.
> 
> > We also have a similar setup with a different network card, the Intel E810-C,
> which has four ports as well. However, its ice driver exposes only one PHC device
> and probably read PHC counter in a different way. I do not remember similar
> issues with this setup.
> >
>  I think on the Intel NIC, this problem itself would not arise,
> because you will run only 1 client each of ptp4l and phc2sys, right?
> But I am not sure how you can run 4 GMs on Intel NIC if you are
> running that.

You can run one ptp4l instance connected to all 4 ports as a boundary clock. If you try to run separate instances of ptp4l on each port, you'll run into issues with each port trying to synchronize, unless you explicitly configure the ptp4l to be source only and never go into the sink/slave state.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-04-01 20:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 14:35 bnxt_en: Incorrect tx timestamp report Kamil Zaripov
2025-03-20 14:48 ` Andrew Lunn
     [not found]   ` <CAGtf3ibFAidzpFKm1o5zmZF3Neu8MgdXp_n_Wt+mv8M9YZhhug@mail.gmail.com>
2025-03-20 15:14     ` Kamil Zaripov
2025-03-20 16:21   ` Vadim Fedorenko
2025-03-20 15:56 ` Pavan Chebbi
2025-03-20 16:21   ` Kamil Zaripov
2025-03-20 16:26   ` Vadim Fedorenko
2025-03-20 17:11 ` Jacob Keller
2025-03-21 15:17   ` Kamil Zaripov
2025-03-21 17:33     ` Michael Chan
2025-03-24 15:04       ` Pavan Chebbi
2025-03-25 10:13         ` Kamil Zaripov
2025-03-25 10:41           ` Vadim Fedorenko
2025-03-25 12:24             ` Pavan Chebbi
2025-03-26 13:50             ` Kamil Zaripov
2025-03-26 20:31               ` Jacob Keller
2025-03-27 13:16               ` Pavan Chebbi
2025-04-01 20:17                 ` Keller, Jacob E

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).