Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] net: generalize skb freeing deferral to per-cpu lists
From: Eric Dumazet @ 2022-04-22 16:50 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Eric Dumazet, David S . Miller, Paolo Abeni, netdev
In-Reply-To: <20220422094058.30f34bb4@kernel.org>

On Fri, Apr 22, 2022 at 9:41 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 21 Apr 2022 08:39:20 -0700 Eric Dumazet wrote:
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 84d78df60453955a8eaf05847f6e2145176a727a..2fe311447fae5e860eee95f6e8772926d4915e9f 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1080,6 +1080,7 @@ struct sk_buff {
> >               unsigned int    sender_cpu;
> >       };
> >  #endif
> > +     u16                     alloc_cpu;
> >  #ifdef CONFIG_NETWORK_SECMARK
> >       __u32           secmark;
> >  #endif
>
> nit: kdoc missing

Yep, I had this covered for v2 already, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: generalize skb freeing deferral to per-cpu lists
From: Eric Dumazet @ 2022-04-22 16:50 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Eric Dumazet, David S . Miller, Paolo Abeni, netdev
In-Reply-To: <20220422094014.1bcf78d5@kernel.org>

On Fri, Apr 22, 2022 at 9:40 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 21 Apr 2022 08:39:20 -0700 Eric Dumazet wrote:
> > 10 runs of one TCP_STREAM flow
>
> Was the test within a NUMA node or cross-node?

This was a NUMA host, but nothing done to force anything (no pinning,
both for sender and receiver)

>
> For my learning - could this change cause more cache line bouncing
> than individual per-socket lists for non-RFS setups. Multiple CPUs
> may try to queue skbs for freeing on one remove node.

I did tests as well in non-RFS setups, and got nice improvement as well
I could post them in v2 if you want.

The thing is that with a typical number of RX queues (typically 16 or
32 queues on a 100Gbit NIC),
there is enough sharding for this spinlock to be a non-issue.

Also, we could quite easily add some batching in a future patch, for
the cases where the number of RX queues
is too small.

(Each cpu could hold up to 8 or 16 skbs in a per-cpu cache, before
giving them back to alloc_cpu(s))


>
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 7dccbfd1bf5635c27514c70b4a06d3e6f74395dd..0162a9bdc9291e7aae967a044788d09bd2ef2423 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3081,6 +3081,9 @@ struct softnet_data {
> >       struct sk_buff_head     input_pkt_queue;
> >       struct napi_struct      backlog;
> >
> > +     /* Another possibly contended cache line */
> > +     struct sk_buff_head     skb_defer_list ____cacheline_aligned_in_smp;
>
> If so maybe we can avoid some dirtying and use a single-linked list?
> No point modifying the cache line of the skb already on the list.

Good idea, I can think about it.

>
> > +     call_single_data_t  csd_defer;
> >  };

^ permalink raw reply

* Re: [PATCH net-next 5/5] net: dsa: b53: mark as non-legacy
From: Florian Fainelli @ 2022-04-22 16:43 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andrew Lunn, David S. Miller, Jakub Kicinski, netdev,
	Vivien Didelot, Vladimir Oltean
In-Reply-To: <YmJbzay/OiSAxYWF@shell.armlinux.org.uk>

On 4/22/22 00:39, Russell King (Oracle) wrote:
> On Thu, Apr 21, 2022 at 03:31:26PM -0700, Florian Fainelli wrote:
>> Hi Russell,
>>
>> On 2/22/22 02:16, Russell King (Oracle) wrote:
>>> The B53 driver does not make use of the speed, duplex, pause or
>>> advertisement in its phylink_mac_config() implementation, so it can be
>>> marked as a non-legacy driver.
>>>
>>> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
>>> ---
>>>    drivers/net/dsa/b53/b53_common.c | 6 ++++++
>>>    1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
>>> index 50a372dc32ae..83bf30349c26 100644
>>> --- a/drivers/net/dsa/b53/b53_common.c
>>> +++ b/drivers/net/dsa/b53/b53_common.c
>>> @@ -1346,6 +1346,12 @@ static void b53_phylink_get_caps(struct dsa_switch *ds, int port,
>>>    	/* Get the implementation specific capabilities */
>>>    	if (dev->ops->phylink_get_caps)
>>>    		dev->ops->phylink_get_caps(dev, port, config);
>>> +
>>> +	/* This driver does not make use of the speed, duplex, pause or the
>>> +	 * advertisement in its mac_config, so it is safe to mark this driver
>>> +	 * as non-legacy.
>>> +	 */
>>> +	config->legacy_pre_march2020 = false;
>>
>> This patch appears to cause a regression for me, I am not sure why I did not
>> notice it back when I tested it but I suspect it had to do with me testing
>> only with a copper module and not with a fiber module.
>>
>> Now that I tested it again, the SFP port (port 5 in my set-up) link up
>> interrupt does not fire up when setting config->legacy_pre_march2020 to
>> false.
>>
>> Here is a working log with phylink debugging enabled:
>>
>> # udhcpc -i sfp
>> udhcpc: started, v1.35.0
>> [   49.479637] bgmac-enet 18024000.ethernet eth2: Link is Up - 1Gbps/Full -
>> flow control off
>> [   49.488139] Generic PHY fixed-0:02: PHY state change UP -> RUNNING
>> [   49.488256] b53-srab-switch 18036000.ethernet-switch sfp: configuring for
>> inband/1000base-x link mode
>> [   49.504062] b53-srab-switch 18036000.ethernet-switch sfp: major config
>> 1000base-x
>> [   49.511800] b53-srab-switch 18036000.ethernet-switch sfp:
>> phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown
>> adv=0000000,00000201
>> [   49.527504] b53-srab-switch 18036000.ethernet-switch sfp: mac link down
>> [   49.535044] sfp sfp: SM: enter present:down:down event dev_up
>> [   49.541006] sfp sfp: tx disable 1 -> 0
>> [   49.544897] sfp sfp: SM: exit present:up:wait
>> [   49.549509] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
>> udhcpc: broadcasting discover
>> [   49.595185] sfp sfp: SM: enter present:up:wait event timeout
>> [   49.601064] sfp sfp: SM: exit present:up:link_up
>> [   52.388917] b53-srab-switch 18036000.ethernet-switch sfp: mac link up
>> [   52.396513] b53-srab-switch 18036000.ethernet-switch sfp: Link is Up -
>> 1Gbps/Full - flow control rx/tx
>> [   52.406145] IPv6: ADDRCONF(NETDEV_CHANGE): sfp: link becomes ready
>> udhcpc: broadcasting discover
>> udhcpc: broadcasting select for 192.168.3.156, server 192.168.3.1
>> udhcpc: lease of 192.168.3.156 obtained from 192.168.3.1, lease time 600
>> deleting routers
>> adding dns 192.168.1.1
>>
>> and one that is not working with phylink debugging enabled:
>>
>> # udhcpc -i sfp
>> udhcpc: started, v1.35.0
>> [   27.863529] bgmac-enet 18024000.ethernet eth2: Link is Up - 1Gbps/Full -
>> flow control off
>> [   27.872021] Generic PHY fixed-0:02: PHY state change UP -> RUNNING
>> [   27.872120] b53-srab-switch 18036000.ethernet-switch sfp: configuring for
>> inband/1000base-x link mode
>> [   27.887952] b53-srab-switch 18036000.ethernet-switch sfp: major config
>> 1000base-x
>> [   27.895689] b53-srab-switch 18036000.ethernet-switch sfp:
>> phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown
>> adv=0000000,00000201
>> [   27.895802] b53-srab-switch 18036000.ethernet-switch sfp: mac link down
>> [   27.911945] sfp sfp: SM: enter present:down:down event dev_up
>> [   27.923947] sfp sfp: tx disable 1 -> 0
>> [   27.927835] sfp sfp: SM: exit present:up:wait
>> [   27.932442] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
>> udhcpc: broadcasting discover
>> [   27.978181] sfp sfp: SM: enter present:up:wait event timeout
>> [   27.984056] sfp sfp: SM: exit present:up:link_up
>> [   30.686440] b53-srab-switch 18036000.ethernet-switch sfp: mac link up
>> udhcpc: broadcasting discover
>> udhcpc: broadcasting discover
>>
>> The mac side appears to be UP but not no carrier is set to the sfp network
>> device. Do you have any idea why that would happen?
> 
> Oh, it's because setting that flag means we're wanting the PCS methods
> rather than the legacy MAC methods for an_restart and getting the PCS
> link state - so the patch in question was submitted too early (it
> should have been _after_ the conversion to PCS.)

Meh, sorry I was really slow on this and did not even connect the dots. 
Indeed that is what it is.

> 
> If we get the patch reverted in net-next, and then convert b53 to use
> PCS support, we'll then be putting the patch back, so I wonder if it
> would just make sense to apply the PCS conversion patch, possibly
> adding a comment in the commit message pointing out that this fixes
> the b53 legacy_pre_march2020 patch. Thoughts?

I just responded to your other patch "net: dsa: b53: convert to 
phylink_pcs", so if we target that one for 'net' I think we should be 
good to go.

Thanks!
-- 
Florian

^ permalink raw reply

* Re: Accessing XDP packet memory from the end
From: Alexander Lobakin @ 2022-04-22 16:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Larysa Zaremba, bpf, netdev, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Lobakin
In-Reply-To: <87czhagxuw.fsf@toke.dk>

From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Thu, 21 Apr 2022 19:17:11 +0200

> Larysa Zaremba <larysa.zaremba@intel.com> writes:
> 
> > Dear all,
> > Our team has encountered a need of accessing data_meta in a following way:
> >
> > int xdp_meta_prog(struct xdp_md *ctx)
> > {
> > 	void *data_meta_ptr = (void *)(long)ctx->data_meta;
> > 	void *data_end = (void *)(long)ctx->data_end;
> > 	void *data = (void *)(long)ctx->data;
> > 	u64 data_size = sizeof(u32);
> > 	u32 magic_meta;
> > 	u8 offset;
> >
> > 	offset = (u8)((s64)data - (s64)data_meta_ptr);
> > 	if (offset < data_size) {
> > 		bpf_printk("invalid offset: %ld\n", offset);
> > 		return XDP_DROP;
> > 	}
> >
> > 	data_meta_ptr += offset;
> > 	data_meta_ptr -= data_size;
> >
> > 	if (data_meta_ptr + data_size > data) {
> > 		return XDP_DROP;
> > 	}
> > 		
> > 	magic_meta = *((u32 *)data);
> > 	bpf_printk("Magic: %d\n", magic_meta);
> > 	return XDP_PASS;
> > }
> >
> > Unfortunately, verifier claims this code attempts to access packet with
> > an offset of -2 (a constant part) and negative offset is generally forbidden.
> >
> > For now we have 2 solutions, one is using bpf_xdp_adjust_meta(),
> > which is pretty good, but not ideal for the hot path.
> > The second one is the patch at the end.
> >
> > Do you see any other way of accessing memory from the end of data_meta/data?
> > What do you think about both suggested solutions?
> 
> The problem is that the compiler is generating code that the verifier
> doesn't understand. It's notoriously hard to get LLVM to produce code
> that preserves the right bounds checks which is why projects like Cilium
> use helpers with inline ASM to produce the right loads, like in [0].
> 
> Adapting that cilium helper to load from the metadata area, your example
> can be rewritten as follows (which works just fine with no verifier
> changes):
> 
> static __always_inline int
> xdp_load_meta_bytes(const struct xdp_md *ctx, __u64 off, void *to, const __u64 len)
> {
> 	void *from;
> 	int ret;
> 	/* LLVM tends to generate code that verifier doesn't understand,
> 	 * so force it the way we want it in order to open up a range
> 	 * on the reg.
> 	 */
> 	asm volatile("r1 = *(u32 *)(%[ctx] +8)\n\t"
> 		     "r2 = *(u32 *)(%[ctx] +0)\n\t"
> 		     "%[off] &= %[offmax]\n\t"
> 		     "r1 += %[off]\n\t"
> 		     "%[from] = r1\n\t"
> 		     "r1 += %[len]\n\t"
> 		     "if r1 > r2 goto +2\n\t"
> 		     "%[ret] = 0\n\t"
> 		     "goto +1\n\t"
> 		     "%[ret] = %[errno]\n\t"
> 		     : [ret]"=r"(ret), [from]"=r"(from)
> 		     : [ctx]"r"(ctx), [off]"r"(off), [len]"ri"(len),
> 		       [offmax]"i"(__CTX_OFF_MAX), [errno]"i"(-EINVAL)
> 		     : "r1", "r2");
> 	if (!ret)
> 		__builtin_memcpy(to, from, len);
> 	return ret;
> }
> 
> 
> SEC("xdp")
> int xdp_meta_prog(struct xdp_md *ctx)
> {
>         void *data_meta_ptr = (void *)(long)ctx->data_meta;
>         void *data = (void *)(long)ctx->data;
>         __u32 magic_meta;
>         __u8 offset;
> 	int ret;
> 
>         offset = (__u8)((__s64)data - (__s64)data_meta_ptr);
> 	ret = xdp_load_meta_bytes(ctx, offset - 4, &magic_meta, sizeof(magic_meta));
> 	if (ret) {
> 		bpf_printk("load bytes failed: %d\n", ret);
>                 return XDP_DROP;
> 	}
> 
>         bpf_printk("Magic: %d\n", magic_meta);
>         return XDP_PASS;
> }

At the moment, we use this (based on Cilium's and your), it works
just like we want C code to work previously:

#define __CTX_OFF_MAX 0xff

static __always_inline void *
can_i_access_meta_please(const struct xdp_md *ctx, __u64 off, const __u64 len)
{
	void *ret;

	/* LLVM tends to generate code that verifier doesn't understand,
	 * so force it the way we want it in order to open up a range
	 * on the reg.
	 */
	asm volatile("r1 = *(u32 *)(%[ctx] +8)\n\t"
		     "r2 = *(u32 *)(%[ctx] +0)\n\t"
		     "%[off] &= %[offmax]\n\t"
		     "r1 += %[off]\n\t"
		     "%[ret] = r1\n\t"
		     "r1 += %[len]\n\t"
		     "if r1 > r2 goto +1\n\t"
		     "goto +1\n\t"
		     "%[ret] = %[null]\n\t"
		     : [ret]"=r"(ret)
		     : [ctx]"r"(ctx), [off]"r"(off), [len]"ri"(len),
		       [offmax]"i"(__CTX_OFF_MAX), [null]"i"(NULL)
		     : "r1", "r2");

	return ret;
}

SEC("xdp")
int xdp_prognum_n0_meta(struct xdp_md *ctx)
{
	void *data_meta = (void *)(__s64)ctx->data_meta;
	void *data = (void *)(__s64)ctx->data;
	struct xdp_meta_generic *md;
	__u64 offset;

	offset = (__u64)((__s64)data - (__s64)data_meta);

	md = can_i_access_meta_please(ctx, offset, sizeof(*md));
	if (__builtin_expect(!md, 0)) {
		bpf_printk("No you can't\n");
		return XDP_DROP;
	}

	bpf_printk("Magic: 0x%04x\n", md->magic_id);
	return XDP_PASS;
}

Thanks for the help! It's a shame LLVM still suck on generating
correct object code from C.
I guess we'll define a helper above in one of the headers to not
copy-paste it back and forth between each program wanting to
access only the generic part of the metadata (which is always being
placed at the end).

> 
> -Toke
> 
> 
> [0] https://github.com/cilium/cilium/blob/master/bpf/include/bpf/ctx/xdp.h#L35

Thanks,
Al

^ permalink raw reply

* Re: [PATCH RFC net-next v2] net: dsa: b53: convert to phylink_pcs
From: Florian Fainelli @ 2022-04-22 16:41 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andrew Lunn, David S. Miller, Heiner Kallweit, netdev,
	Vivien Didelot, Vladimir Oltean, Jakub Kicinski, Paolo Abeni
In-Reply-To: <E1ndrr8-005FbQ-F7@rmk-PC.armlinux.org.uk>

On 4/11/22 04:05, Russell King (Oracle) wrote:
> Convert B53 to use phylink_pcs for the serdes rather than hooking it
> into the MAC-layer callbacks.
> 
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Fixes: 81c1681cbb9f ("net: dsa: b53: mark as non-legacy")

Thanks Russell!
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] net: generalize skb freeing deferral to per-cpu lists
From: Jakub Kicinski @ 2022-04-22 16:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, Paolo Abeni, netdev, Eric Dumazet
In-Reply-To: <20220421153920.3637792-1-eric.dumazet@gmail.com>

On Thu, 21 Apr 2022 08:39:20 -0700 Eric Dumazet wrote:
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 84d78df60453955a8eaf05847f6e2145176a727a..2fe311447fae5e860eee95f6e8772926d4915e9f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1080,6 +1080,7 @@ struct sk_buff {
>  		unsigned int	sender_cpu;
>  	};
>  #endif
> +	u16			alloc_cpu;
>  #ifdef CONFIG_NETWORK_SECMARK
>  	__u32		secmark;
>  #endif

nit: kdoc missing

^ permalink raw reply

* Re: [PATCH net-next] net: generalize skb freeing deferral to per-cpu lists
From: Jakub Kicinski @ 2022-04-22 16:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, Paolo Abeni, netdev, Eric Dumazet
In-Reply-To: <20220421153920.3637792-1-eric.dumazet@gmail.com>

On Thu, 21 Apr 2022 08:39:20 -0700 Eric Dumazet wrote:
> 10 runs of one TCP_STREAM flow

Was the test within a NUMA node or cross-node?

For my learning - could this change cause more cache line bouncing 
than individual per-socket lists for non-RFS setups. Multiple CPUs 
may try to queue skbs for freeing on one remove node.

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 7dccbfd1bf5635c27514c70b4a06d3e6f74395dd..0162a9bdc9291e7aae967a044788d09bd2ef2423 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3081,6 +3081,9 @@ struct softnet_data {
>  	struct sk_buff_head	input_pkt_queue;
>  	struct napi_struct	backlog;
>  
> +	/* Another possibly contended cache line */
> +	struct sk_buff_head	skb_defer_list ____cacheline_aligned_in_smp;

If so maybe we can avoid some dirtying and use a single-linked list? 
No point modifying the cache line of the skb already on the list.

> +	call_single_data_t  csd_defer;
>  };

^ permalink raw reply

* Re: 5.10.4+ hang with 'rmmod nf_conntrack'
From: Ben Greear @ 2022-04-22 16:32 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev
In-Reply-To: <0fa45356-7c63-bb01-19c8-9447cf2cad39@candelatech.com>

On 1/8/21 5:07 AM, Ben Greear wrote:
> On 1/7/21 10:16 PM, Florian Westphal wrote:
>> Ben Greear <greearb@candelatech.com> wrote:
>>> I noticed my system has a hung process trying to 'rmmod nf_conntrack'.
>>>
>>> I've generally been doing the script that calls rmmod forever,
>>> but only extensively tested on 5.4 kernel and earlier.
>>>
>>> If anyone has any ideas, please let me know.  This is from 'sysrq t'.  I don't see
>>> any hung-task splats in dmesg.
>>
>> rmmod on conntrack loops forever until the active conntrack object count reaches 0.
>> (plus a walk of the conntrack table to evict/put all entries).

Hello Florian,

I keep hitting this bug in a particular test case in 5.17.4+, so I added some debug to
try to learn more.

My debugging patch looks like this:

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 7552e1e9fd62..29724114caef 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -2543,6 +2543,7 @@ void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list)
  {
         int busy;
         struct net *net;
+       unsigned long loops = 0;

         /*
          * This makes sure all current packets have passed through
@@ -2556,12 +2557,30 @@ void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list)
                 struct nf_conntrack_net *cnet = nf_ct_pernet(net);

                 nf_ct_iterate_cleanup(kill_all, net, 0, 0);
-               if (atomic_read(&cnet->count) != 0)
+               if (atomic_read(&cnet->count) != 0) {
+                       if (loops > 50010)
+                               pr_err("nf-conntrack-cleanup-net-list, loops: %ld  cnet-count: %d, expect-count: %d users4: %d users6: %d  users_bridge: %d\n",
+                                      loops, atomic_read(&cnet->count), cnet->expect_count,
+                                      cnet->users4, cnet->users6, cnet->users_bridge);
                         busy = 1;
+               }
         }
         if (busy) {
+               loops++;
+               if (loops > 50000) {
+                       msleep(500);
+               }
                 schedule();
-               goto i_see_dead_people;
+               if (loops > 50020) {
+                       /* This thing is wedged, going to require a reboot to recover, so attempt
+                        * to just ignore the bad count and see if system works OK.
+                        */
+                       WARN_ON_ONCE(1);
+                       pr_err("ERROR:  nf_conntrack_cleanup_net cannot make progress.  Ignoring stale reference count and will continue.\n");
+               }
+               else {
+                       goto i_see_dead_people;
+               }
         }

         list_for_each_entry(net, net_exit_list, exit_list) {


Do you (or anyone else), have some ideas for how to debug this further to help find where the reference
is leaked (or not released)?


[ 1486.101795] nf_conntrack: nf-conntrack-cleanup-net-list, loops: 50011  cnet-count: 1, expect-count: 0 users4: 0 users6: 0  users_bridge: 0
[ 1486.620464] nf_conntrack: nf-conntrack-cleanup-net-list, loops: 50012  cnet-count: 1, expect-count: 0 users4: 0 users6: 0  users_bridge: 0
[ 1487.141700] nf_conntrack: nf-conntrack-cleanup-net-list, loops: 50013  cnet-count: 1, expect-count: 0 users4: 0 users6: 0  users_bridge: 0
[ 1487.660248] nf_conntrack: nf-conntrack-cleanup-net-list, loops: 50014  cnet-count: 1, expect-count: 0 users4: 0 users6: 0  users_bridge: 0
[ 1488.180352] nf_conntrack: nf-conntrack-cleanup-net-list, loops: 50015  cnet-count: 1, expect-count: 0 users4: 0 users6: 0  users_bridge: 0
[ 1488.700485] nf_conntrack: nf-conntrack-cleanup-net-list, loops: 50016  cnet-count: 1, expect-count: 0 users4: 0 users6: 0  users_bridge: 0
[ 1489.220488] nf_conntrack: nf-conntrack-cleanup-net-list, loops: 50017  cnet-count: 1, expect-count: 0 users4: 0 users6: 0  users_bridge: 0
[ 1489.741310] nf_conntrack: nf-conntrack-cleanup-net-list, loops: 50018  cnet-count: 1, expect-count: 0 users4: 0 users6: 0  users_bridge: 0
[ 1490.260216] nf_conntrack: nf-conntrack-cleanup-net-list, loops: 50019  cnet-count: 1, expect-count: 0 users4: 0 users6: 0  users_bridge: 0
[ 1490.780219] nf_conntrack: nf-conntrack-cleanup-net-list, loops: 50020  cnet-count: 1, expect-count: 0 users4: 0 users6: 0  users_bridge: 0
[ 1491.299594] ------------[ cut here ]------------
[ 1491.299598] WARNING: CPU: 6 PID: 25447 at net/netfilter/nf_conntrack_core.c:2578 nf_conntrack_cleanup_net_list+0x105/0x210 [nf_conntrack]
[ 1491.299651] Modules linked in: nfnetlink nf_conntrack(-) nf_defrag_ipv6 nf_defrag_ipv4 bpfilter vrf 8021q garp mrp stp llc macvlan pktgen f71882fg 
snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio uvcvideo videobuf2_vmalloc snd_hda_intel videobuf2_memops snd_intel_dspcfg 
videobuf2_v4l2 videobuf2_common snd_hda_codec coretemp intel_rapl_msr intel_rapl_common iTCO_wdt ch341 cp210x ftdi_sio videodev snd_hda_core ee1004 snd_hwdep 
intel_pmc_bxt mc iTCO_vendor_support mt7915e mt76_connac_lib snd_seq mt76 snd_seq_device intel_wmi_thunderbolt snd_pcm intel_tcc_cooling mac80211 snd_timer 
x86_pkg_temp_thermal snd intel_powerclamp i2c_i801 soundcore i2c_smbus cfg80211 mei_wdt mei_hdcp mei_pxp intel_pch_thermal acpi_pad sch_fq_codel nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc raid1 dm_raid raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor async_tx raid6_pq i915 intel_gtt 
drm_kms_helper cec rc_core ttm igb i2c_algo_bit drm ixgbe mdio agpgart xhci_pci
[ 1491.299839]  hwmon i2c_core xhci_pci_renesas dca wmi video fuse [last unloaded: nf_conntrack_netlink]
[ 1491.299862] CPU: 6 PID: 25447 Comm: rmmod Tainted: G        W         5.17.4+ #30
[ 1491.299869] Hardware name: Default string Default string/SKYBAY, BIOS 5.12 08/04/2020
[ 1491.299873] RIP: 0010:nf_conntrack_cleanup_net_list+0x105/0x210 [nf_conntrack]
[ 1491.299918] Code: 00 00 77 0a e8 6c 3b 05 e1 e9 30 ff ff ff bf f4 01 00 00 e8 ad d9 f3 df e8 58 3b 05 e1 49 81 fd 65 c3 00 00 0f 85 14 ff ff ff <0f> 0b 48 c7 
c7 c0 3c 36 a1 e8 d4 f1 fc e0 4c 89 e7 e8 05 ea 18 e0
[ 1491.299925] RSP: 0018:ffff888126fffd78 EFLAGS: 00010246
[ 1491.299931] RAX: 0000000000400100 RBX: ffff888126fffdd8 RCX: ffffffff823a27ba
[ 1491.299936] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff888145cb212c
[ 1491.299941] RBP: ffff8881179c8f00 R08: ffffed1028b96421 R09: ffffed1028b96421
[ 1491.299946] R10: ffff888145cb2107 R11: ffffed1028b96420 R12: ffff888126fffe08
[ 1491.299951] R13: 000000000000c365 R14: 0000000000000001 R15: ffff888134155500
[ 1491.299955] FS:  00007fcbb8644740(0000) GS:ffff88841e100000(0000) knlGS:0000000000000000
[ 1491.299961] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1491.299965] CR2: 00000000090c6000 CR3: 00000001370bd003 CR4: 00000000003706e0
[ 1491.299970] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1491.299974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1491.299978] Call Trace:
[ 1491.299982]  <TASK>
[ 1491.299987]  free_exit_list+0x77/0xc0
[ 1491.299996]  unregister_pernet_operations+0x130/0x1a0
[ 1491.300004]  ? free_exit_list+0xc0/0xc0
[ 1491.300011]  ? __mutex_unlock_slowpath.isra.21+0x230/0x230
[ 1491.300021]  unregister_pernet_subsys+0x18/0x30
[ 1491.300028]  nf_conntrack_standalone_fini+0x11/0x47f [nf_conntrack]
[ 1491.300076]  __x64_sys_delete_module+0x1f5/0x310
[ 1491.300086]  ? __ia32_sys_delete_module+0x310/0x310
[ 1491.300096]  do_syscall_64+0x34/0xb0
[ 1491.300104]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1491.300112] RIP: 0033:0x7fcbb876ca9b
[ 1491.300117] Code: 73 01 c3 48 8b 0d ed 33 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 
ff ff 73 01 c3 48 8b 0d bd 33 0c 00 f7 d8 64 89 01 48
[ 1491.300124] RSP: 002b:00007ffcbddd8928 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 1491.300130] RAX: ffffffffffffffda RBX: 0000560cfea95820 RCX: 00007fcbb876ca9b
[ 1491.300135] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000560cfea95888
[ 1491.300140] RBP: 00007ffcbddd8988 R08: 0000000000000000 R09: 0000000000000000
[ 1491.300144] R10: 00007fcbb87e0ac0 R11: 0000000000000206 R12: 00007ffcbddd8b50
[ 1491.300148] R13: 00007ffcbdddabc5 R14: 0000560cfea952a0 R15: 0000560cfea95820
[ 1491.300157]  </TASK>
[ 1491.300159] ---[ end trace 0000000000000000 ]---
[ 1491.300163] nf_conntrack: ERROR:  nf_conntrack_cleanup_net cannot make progress.  Ignoring stale reference count and will continue.
[ 1491.437516] =============================================================================
[ 1491.444783] BUG nf_conntrack (Tainted: G        W        ): Objects remaining in nf_conntrack on __kmem_cache_shutdown()
[ 1491.454650] -----------------------------------------------------------------------------

[ 1491.461981] Slab 0x000000006b714627 objects=25 used=1 fp=0x0000000095de89c2 flags=0x5fff8000010200(slab|head|node=0|zone=2|lastcpupid=0x3fff)
[ 1491.473738] CPU: 4 PID: 25447 Comm: rmmod Tainted: G        W         5.17.4+ #30
[ 1491.473743] Hardware name: Default string Default string/SKYBAY, BIOS 5.12 08/04/2020
[ 1491.473745] Call Trace:
[ 1491.473747]  <TASK>
[ 1491.473749]  dump_stack_lvl+0x47/0x5c
[ 1491.473758]  slab_err+0x8f/0xd0
[ 1491.473764]  ? _raw_read_lock_irqsave+0x10/0x50
[ 1491.473769]  ? _raw_read_lock+0x30/0x30
[ 1491.473773]  ? flush_all_cpus_locked+0xd5/0x120
[ 1491.473777]  __kmem_cache_shutdown+0x13f/0x2e0
[ 1491.473781]  ? kasan_quarantine_remove_cache+0xd1/0xe0
[ 1491.473787]  kmem_cache_destroy+0x4a/0x110
[ 1491.473793]  __x64_sys_delete_module+0x1f5/0x310
[ 1491.473800]  ? __ia32_sys_delete_module+0x310/0x310
[ 1491.473804]  do_syscall_64+0x34/0xb0
[ 1491.473809]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1491.473814] RIP: 0033:0x7fcbb876ca9b
[ 1491.473817] Code: 73 01 c3 48 8b 0d ed 33 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 
ff ff 73 01 c3 48 8b 0d bd 33 0c 00 f7 d8 64 89 01 48
[ 1491.473821] RSP: 002b:00007ffcbddd8928 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 1491.473825] RAX: ffffffffffffffda RBX: 0000560cfea95820 RCX: 00007fcbb876ca9b
[ 1491.473828] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000560cfea95888
[ 1491.473830] RBP: 00007ffcbddd8988 R08: 0000000000000000 R09: 0000000000000000
[ 1491.473832] R10: 00007fcbb87e0ac0 R11: 0000000000000206 R12: 00007ffcbddd8b50
[ 1491.473834] R13: 00007ffcbdddabc5 R14: 0000560cfea952a0 R15: 0000560cfea95820
[ 1491.473838]  </TASK>
[ 1491.473840] Disabling lock debugging due to kernel taint
[ 1491.473843] Object 0x000000006989dcb9 @offset=5120
[ 1491.477761] ------------[ cut here ]------------
[ 1491.477763] kmem_cache_destroy nf_conntrack: Slab cache still has objects when called from __x64_sys_delete_module+0x1f5/0x310
[ 1491.477780] WARNING: CPU: 4 PID: 25447 at mm/slab_common.c:504 kmem_cache_destroy+0x108/0x110
[ 1491.477788] Modules linked in: nfnetlink nf_conntrack(-) nf_defrag_ipv6 nf_defrag_ipv4 bpfilter vrf 8021q garp mrp stp llc macvlan pktgen f71882fg 
snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio uvcvideo videobuf2_vmalloc snd_hda_intel videobuf2_memops snd_intel_dspcfg 
videobuf2_v4l2 videobuf2_common snd_hda_codec coretemp intel_rapl_msr intel_rapl_common iTCO_wdt ch341 cp210x ftdi_sio videodev snd_hda_core ee1004 snd_hwdep 
intel_pmc_bxt mc iTCO_vendor_support mt7915e mt76_connac_lib snd_seq mt76 snd_seq_device intel_wmi_thunderbolt snd_pcm intel_tcc_cooling mac80211 snd_timer 
x86_pkg_temp_thermal snd intel_powerclamp i2c_i801 soundcore i2c_smbus cfg80211 mei_wdt mei_hdcp mei_pxp intel_pch_thermal acpi_pad sch_fq_codel nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc raid1 dm_raid raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor async_tx raid6_pq i915 intel_gtt 
drm_kms_helper cec rc_core ttm igb i2c_algo_bit drm ixgbe mdio agpgart xhci_pci
[ 1491.477900]  hwmon i2c_core xhci_pci_renesas dca wmi video fuse [last unloaded: nf_conntrack_netlink]
[ 1491.477912] CPU: 4 PID: 25447 Comm: rmmod Tainted: G    B   W         5.17.4+ #30
[ 1491.477916] Hardware name: Default string Default string/SKYBAY, BIOS 5.12 08/04/2020
[ 1491.477918] RIP: 0010:kmem_cache_destroy+0x108/0x110
[ 1491.477922] Code: dd 08 00 48 89 df e8 d7 dd 08 00 eb d3 c3 48 8b 53 60 48 8b 4c 24 10 48 c7 c6 c0 61 90 82 48 c7 c7 58 66 fc 82 e8 8a 8f ec 00 <0f> 0b eb b2 
0f 1f 40 00 55 53 48 83 ff 10 76 11 48 8b 74 24 10 48
[ 1491.477925] RSP: 0018:ffff888126fffe80 EFLAGS: 00010286
[ 1491.477928] RAX: 0000000000000000 RBX: ffff888109061e00 RCX: 0000000000000027
[ 1491.477931] RDX: 0000000000000027 RSI: dffffc0000000000 RDI: ffff88841e028428
[ 1491.477933] RBP: ffffffffa1381a00 R08: ffffed1083c05086 R09: ffffed1083c05086
[ 1491.477936] R10: ffff88841e02842b R11: ffffed1083c05085 R12: ffffffffa1381d08
[ 1491.477938] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1491.477940] FS:  00007fcbb8644740(0000) GS:ffff88841e000000(0000) knlGS:0000000000000000
[ 1491.477943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1491.477946] CR2: 00007efd5f2d4734 CR3: 00000001370bd003 CR4: 00000000003706e0
[ 1491.477949] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1491.477951] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1491.477953] Call Trace:
[ 1491.477955]  <TASK>
[ 1491.477957]  __x64_sys_delete_module+0x1f5/0x310
[ 1491.477961]  ? __ia32_sys_delete_module+0x310/0x310
[ 1491.477966]  do_syscall_64+0x34/0xb0
[ 1491.477971]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1491.477975] RIP: 0033:0x7fcbb876ca9b
[ 1491.477978] Code: 73 01 c3 48 8b 0d ed 33 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 
ff ff 73 01 c3 48 8b 0d bd 33 0c 00 f7 d8 64 89 01 48
[ 1491.477981] RSP: 002b:00007ffcbddd8928 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 1491.477985] RAX: ffffffffffffffda RBX: 0000560cfea95820 RCX: 00007fcbb876ca9b
[ 1491.477987] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000560cfea95888
[ 1491.477989] RBP: 00007ffcbddd8988 R08: 0000000000000000 R09: 0000000000000000
[ 1491.477991] R10: 00007fcbb87e0ac0 R11: 0000000000000206 R12: 00007ffcbddd8b50
[ 1491.477994] R13: 00007ffcbdddabc5 R14: 0000560cfea952a0 R15: 0000560cfea95820
[ 1491.477997]  </TASK>
[ 1491.477999] ---[ end trace 0000000000000000 ]---


>>
>>> I'll see if it is reproducible and if so will try
>>> with lockdep enabled...
>>
>> No idea, there was a regression in 5.6, but that was fixed by the time
>> 5.7 was released.
>>
>> Can't reproduce hangs with a script that injects a few dummy entries
>> and then removes the module:
>>
>> added=0
>>
>> add_and_rmmod()
>> {
>>          while [ $added -lt 1000 ]; do
>>                  conntrack -I -s $(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%255+1)) \
>>                          -d $(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%255+1)) \
>>                           --protonum 6 --timeout $(((RANDOM%120) + 240)) --state ESTABLISHED --sport $RANDOM --dport $RANDOM 2> /dev/null || break
>>
>>                  added=$((added + 1))
>>                  if [ $((added % 1000)) -eq 0 ];then
>>                          echo $added
>>                  fi
>>          done
>>
>>          echo rmmod after adding $added entries
>>          conntrack -C
>>          rmmod nf_conntrack_netlink
>>          rmmod nf_conntrack
>> }
>>
>> add_and_rmmod
>>
>> I don't see how it would make a difference, but do you have any special conntrack features enabled
>> at run time, e.g. reliable netlink events? (If you don't know what I mean the answer is no).
> 
> Not that I know of, but I am using lots of VRF devices, each with their own routing table, as well
> as some wifi stations and AP netdevs.
> 
> I'll let you know if I can reproduce it again..
> 
> Thanks,
> Ben
> 
> 


^ permalink raw reply related

* Re: [PATCH net 3/3] ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode
From: William Tu @ 2022-04-22 16:35 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Paolo Abeni, Peilin Ye, xeb@mail.ru, Daniel Borkmann, Cong Wang,
	Eric Dumazet, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Linux Kernel Network Developers, bpf, LKML
In-Reply-To: <b606e0355949a3ca8081ee29d9d22f2f30e898bd.1650575919.git.peilin.ye@bytedance.com>

On Thu, Apr 21, 2022 at 3:09 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> From: Peilin Ye <peilin.ye@bytedance.com>
>
> As pointed out by Jakub Kicinski, currently using TUNNEL_SEQ in
> collect_md mode is racy for [IP6]GRE[TAP] devices.  Consider the
> following sequence of events:
>
> 1. An [IP6]GRE[TAP] device is created in collect_md mode using "ip link
>    add ... external".  "ip" ignores "[o]seq" if "external" is specified,
>    so TUNNEL_SEQ is off, and the device is marked as NETIF_F_LLTX (i.e.
>    it uses lockless TX);
> 2. Someone sets TUNNEL_SEQ on outgoing skb's, using e.g.
>    bpf_skb_set_tunnel_key() in an eBPF program attached to this device;
> 3. gre_fb_xmit() or __gre6_xmit() processes these skb's:
>
>         gre_build_header(skb, tun_hlen,
>                          flags, protocol,
>                          tunnel_id_to_key32(tun_info->key.tun_id),
>                          (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
>                                               : 0);   ^^^^^^^^^^^^^^^^^
>
> Since we are not using the TX lock (&txq->_xmit_lock), multiple CPUs may
> try to do this tunnel->o_seqno++ in parallel, which is racy.  Fix it by
> making o_seqno atomic_t.
>
> As mentioned by Eric Dumazet in commit b790e01aee74 ("ip_gre: lockless
> xmit"), making o_seqno atomic_t increases "chance for packets being out
> of order at receiver" when NETIF_F_LLTX is on.
>
> Maybe a better fix would be:
>
> 1. Do not ignore "oseq" in external mode.  Users MUST specify "oseq" if
>    they want the kernel to allow sequencing of outgoing packets;
> 2. Reject all outgoing TUNNEL_SEQ packets if the device was not created
>    with "oseq".
>
> Unfortunately, that would break userspace.
>
> We could now make [IP6]GRE[TAP] devices always NETIF_F_LLTX, but let us
> do it in separate patches to keep this fix minimal.
>
> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> Fixes: 77a5196a804e ("gre: add sequence number for collect md mode.")
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
> ---

LGTM
Acked-by: William Tu <u9012063@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next] libbpf: also check /sys/kernel/tracing for tracefs files
From: Daniel Borkmann @ 2022-04-22 16:32 UTC (permalink / raw)
  To: Connor O'Brien, Alexei Starovoitov, Andrii Nakryiko
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, linux-kernel
In-Reply-To: <20220421201125.13907-1-connoro@google.com>

On 4/21/22 10:11 PM, Connor O'Brien wrote:
> libbpf looks for tracefs files only under debugfs, but tracefs may be
> mounted even if debugfs is not. When /sys/kernel/debug/tracing is
> absent, try looking under /sys/kernel/tracing instead.
> 
> Signed-off-by: Connor O'Brien <connoro@google.com>
> ---
>   tools/lib/bpf/libbpf.c | 26 +++++++++++++++++++-------
>   1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 68cc134d070d..6ef587329eb2 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -10140,10 +10140,16 @@ static void gen_kprobe_legacy_event_name(char *buf, size_t buf_sz,
>   		 __sync_fetch_and_add(&index, 1));
>   }
>   
> +static bool debugfs_available(void)
> +{
> +	return !access("/sys/kernel/debug/tracing", F_OK);

Should this be a one-time probe, so on subsequent debugfs_available() calls
we return the initially cached result?

> +}
> +
>   static int add_kprobe_event_legacy(const char *probe_name, bool retprobe,
>   				   const char *kfunc_name, size_t offset)
>   {
> -	const char *file = "/sys/kernel/debug/tracing/kprobe_events";
> +	const char *file = debugfs_available() ? "/sys/kernel/debug/tracing/kprobe_events" :
> +		"/sys/kernel/tracing/kprobe_events";
>   
>   	return append_to_file(file, "%c:%s/%s %s+0x%zx",
>   			      retprobe ? 'r' : 'p',
> @@ -10153,7 +10159,8 @@ static int add_kprobe_event_legacy(const char *probe_name, bool retprobe,
>   
>   static int remove_kprobe_event_legacy(const char *probe_name, bool retprobe)
>   {
> -	const char *file = "/sys/kernel/debug/tracing/kprobe_events";
> +	const char *file = debugfs_available() ? "/sys/kernel/debug/tracing/kprobe_events" :
> +		"/sys/kernel/tracing/kprobe_events";
>   
>   	return append_to_file(file, "-:%s/%s", retprobe ? "kretprobes" : "kprobes", probe_name);
>   }
> @@ -10163,7 +10170,8 @@ static int determine_kprobe_perf_type_legacy(const char *probe_name, bool retpro
>   	char file[256];
>   
>   	snprintf(file, sizeof(file),
> -		 "/sys/kernel/debug/tracing/events/%s/%s/id",
> +		 debugfs_available() ? "/sys/kernel/debug/tracing/events/%s/%s/id" :
> +		 "/sys/kernel/tracing/events/%s/%s/id",
>   		 retprobe ? "kretprobes" : "kprobes", probe_name);
>   
>   	return parse_uint_from_file(file, "%d\n");
> @@ -10493,7 +10501,8 @@ static void gen_uprobe_legacy_event_name(char *buf, size_t buf_sz,
>   static inline int add_uprobe_event_legacy(const char *probe_name, bool retprobe,
>   					  const char *binary_path, size_t offset)
>   {
> -	const char *file = "/sys/kernel/debug/tracing/uprobe_events";
> +	const char *file = debugfs_available() ? "/sys/kernel/debug/tracing/uprobe_events" :
> +		"/sys/kernel/tracing/uprobe_events";
>   
>   	return append_to_file(file, "%c:%s/%s %s:0x%zx",
>   			      retprobe ? 'r' : 'p',
> @@ -10503,7 +10512,8 @@ static inline int add_uprobe_event_legacy(const char *probe_name, bool retprobe,
>   
>   static inline int remove_uprobe_event_legacy(const char *probe_name, bool retprobe)
>   {
> -	const char *file = "/sys/kernel/debug/tracing/uprobe_events";
> +	const char *file = debugfs_available() ? "/sys/kernel/debug/tracing/uprobe_events" :
> +		"/sys/kernel/tracing/uprobe_events";
>   
>   	return append_to_file(file, "-:%s/%s", retprobe ? "uretprobes" : "uprobes", probe_name);
>   }
> @@ -10513,7 +10523,8 @@ static int determine_uprobe_perf_type_legacy(const char *probe_name, bool retpro
>   	char file[512];
>   
>   	snprintf(file, sizeof(file),
> -		 "/sys/kernel/debug/tracing/events/%s/%s/id",
> +		 debugfs_available() ? "/sys/kernel/debug/tracing/events/%s/%s/id" :
> +		 "/sys/kernel/tracing/events/%s/%s/id",
>   		 retprobe ? "uretprobes" : "uprobes", probe_name);
>   
>   	return parse_uint_from_file(file, "%d\n");
> @@ -11071,7 +11082,8 @@ static int determine_tracepoint_id(const char *tp_category,
>   	int ret;
>   
>   	ret = snprintf(file, sizeof(file),
> -		       "/sys/kernel/debug/tracing/events/%s/%s/id",
> +		       debugfs_available() ? "/sys/kernel/debug/tracing/events/%s/%s/id" :
> +		       "/sys/kernel/tracing/events/%s/%s/id",
>   		       tp_category, tp_name);
>   	if (ret < 0)
>   		return -errno;
> 


^ permalink raw reply

* Re: [PATCH net 2/3] ip6_gre: Make o_seqno start from 0 in native mode
From: William Tu @ 2022-04-22 16:25 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Paolo Abeni, Peilin Ye, xeb@mail.ru, Daniel Borkmann, Cong Wang,
	Eric Dumazet, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Linux Kernel Network Developers, bpf, LKML
In-Reply-To: <950bfd124e4f87bd9e1acbf6303545875c3681fe.1650575919.git.peilin.ye@bytedance.com>

On Thu, Apr 21, 2022 at 3:08 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> From: Peilin Ye <peilin.ye@bytedance.com>
>
> For IP6GRE and IP6GRETAP devices, currently o_seqno starts from 1 in
> native mode.  According to RFC 2890 2.2., "The first datagram is sent
> with a sequence number of 0."  Fix it.
>
> It is worth mentioning that o_seqno already starts from 0 in collect_md
> mode, see the "if (tunnel->parms.collect_md)" clause in __gre6_xmit(),
> where tunnel->o_seqno is passed to gre_build_header() before getting
> incremented.
>
> Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>

LGTM
Acked-by: William Tu <u9012063@gmail.com>

^ permalink raw reply

* Re: [PATCH net 1/3] ip_gre: Make o_seqno start from 0 in native mode
From: William Tu @ 2022-04-22 16:25 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Paolo Abeni, Peilin Ye, xeb@mail.ru, Daniel Borkmann, Cong Wang,
	Eric Dumazet, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Linux Kernel Network Developers, bpf, LKML
In-Reply-To: <dd63f881729052aa4e08a5c7cb9732724c557dfd.1650575919.git.peilin.ye@bytedance.com>

On Thu, Apr 21, 2022 at 3:08 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> From: Peilin Ye <peilin.ye@bytedance.com>
>
> For GRE and GRETAP devices, currently o_seqno starts from 1 in native
> mode.  According to RFC 2890 2.2., "The first datagram is sent with a
> sequence number of 0."  Fix it.
>
> It is worth mentioning that o_seqno already starts from 0 in collect_md
> mode, see gre_fb_xmit(), where tunnel->o_seqno is passed to
> gre_build_header() before getting incremented.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>

LGTM
Acked-by: William Tu <u9012063@gmail.com>

^ permalink raw reply

* Re: FEC MDIO read timeout on linkup
From: Francesco Dolcini @ 2022-04-22 16:04 UTC (permalink / raw)
  To: Fabio Estevam
  Cc: Francesco Dolcini, Joakim Zhang, Andrew Lunn, Heiner Kallweit,
	Russell King, netdev, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Tim Harvey
In-Reply-To: <CAOMZO5B6nBH_fFGsg+EKZGTOqqTjztvpGNNCJ0wpbcTq+2vPDA@mail.gmail.com>

Hello Fabio,

On Fri, Apr 22, 2022 at 12:55:22PM -0300, Fabio Estevam wrote:
> On Fri, Apr 22, 2022 at 12:26 PM Francesco Dolcini
> <francesco.dolcini@toradex.com> wrote:
> > What I can see from the code is that the timeout is coming from
> > net/phy/micrel.c:kszphy_handle_interrupt().
> 
> For debugging purposes, could you try not to describe the Ethernet PHY
> irq pin inside imx6qdl-apalis.dtsi?
Yep, we'll try to see what's going on with polling instead that having the
interrupt enabled.

> Looking at its pinctrl, I don't see it has pull-up enabled. Is there
> an external pull-up on the KSZ9131 pin?
yes, we have a 4.7K Ohm pull-up on the board.

In addition to that I just tried with

@@ -4151,7 +4151,7 @@ static int __maybe_unused fec_runtime_suspend(struct device *dev)
        struct fec_enet_private *fep = netdev_priv(ndev);

        clk_disable_unprepare(fep->clk_ahb);
-       clk_disable_unprepare(fep->clk_ipg);
+       //clk_disable_unprepare(fep->clk_ipg);

and the issue is still present.

Francesco


^ permalink raw reply

* Re: [PATCH net-next] net: generalize skb freeing deferral to per-cpu lists
From: Eric Dumazet @ 2022-04-22 15:59 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev
In-Reply-To: <319497a698ba77244aa935c13dc9b93c893dbbc3.camel@redhat.com>

On Fri, Apr 22, 2022 at 2:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> Looks great! I have a few questions below mostly to understand better
> how it works...
>

Hi Paolo, thanks for the review :)

> On Thu, 2022-04-21 at 08:39 -0700, Eric Dumazet wrote:
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 84d78df60453955a8eaf05847f6e2145176a727a..2fe311447fae5e860eee95f6e8772926d4915e9f 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1080,6 +1080,7 @@ struct sk_buff {
> >               unsigned int    sender_cpu;
> >       };
> >  #endif
> > +     u16                     alloc_cpu;
>
> I *think* we could in theory fetch the CPU that allocated the skb from
> the napi_id - adding a cpu field to napi_struct and implementing an
> helper to fetch it. Have you considered that option? or the napi lookup
> would be just too expensive?

I have considered that, but a NAPI is not guaranteed to be
owned/serviced from a single cpu.

(In fact, I realized recently about the fact that commit
01770a166165 "tcp: fix race condition when creating child sockets from
syncookies"
has not been backported to stable kernels.

This tcp bug would not happen in normal cases, where all packets from
a particular 4-tuple
are handled by a single cpu.

>
> [...]
>
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 4a77ebda4fb155581a5f761a864446a046987f51..4136d9c0ada6870ea0f7689702bdb5f0bbf29145 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4545,6 +4545,12 @@ static void rps_trigger_softirq(void *data)
> >
> >  #endif /* CONFIG_RPS */
> >
> > +/* Called from hardirq (IPI) context */
> > +static void trigger_rx_softirq(void *data)
>
> Perhaps '__always_unused' ? (But the compiler doesn't complain here)

Sure I will add this.

>
> > @@ -6486,3 +6487,46 @@ void __skb_ext_put(struct skb_ext *ext)
> >  }
> >  EXPORT_SYMBOL(__skb_ext_put);
> >  #endif /* CONFIG_SKB_EXTENSIONS */
> > +
> > +/**
> > + * skb_attempt_defer_free - queue skb for remote freeing
> > + * @skb: buffer
> > + *
> > + * Put @skb in a per-cpu list, using the cpu which
> > + * allocated the skb/pages to reduce false sharing
> > + * and memory zone spinlock contention.
> > + */
> > +void skb_attempt_defer_free(struct sk_buff *skb)
> > +{
> > +     int cpu = skb->alloc_cpu;
> > +     struct softnet_data *sd;
> > +     unsigned long flags;
> > +     bool kick;
> > +
> > +     if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || !cpu_online(cpu)) {
> > +             __kfree_skb(skb);
> > +             return;
> > +     }
>
> I'm wondering if we should skip even when cpu == smp_processor_id()?

Yes, although we would have to use the raw_smp_processor_id() form I guess.

>
> > +
> > +     sd = &per_cpu(softnet_data, cpu);
> > +     /* We do not send an IPI or any signal.
> > +      * Remote cpu will eventually call skb_defer_free_flush()
> > +      */
> > +     spin_lock_irqsave(&sd->skb_defer_list.lock, flags);
> > +     __skb_queue_tail(&sd->skb_defer_list, skb);
> > +
> > +     /* kick every time queue length reaches 128.
> > +      * This should avoid blocking in smp_call_function_single_async().
> > +      * This condition should hardly be bit under normal conditions,
> > +      * unless cpu suddenly stopped to receive NIC interrupts.
> > +      */
> > +     kick = skb_queue_len(&sd->skb_defer_list) == 128;
>
> Out of sheer curiosity why 128? I guess it's should be larger then
> NAPI_POLL_WEIGHT, to cope with with maximum theorethical burst len?

Yes, I needed a value there, but was not sure which precise one.
In my tests I had no IPI ever sent with 128.

>
> Thanks!
>
> Paolo
>

^ permalink raw reply

* Re: FEC MDIO read timeout on linkup
From: Fabio Estevam @ 2022-04-22 15:55 UTC (permalink / raw)
  To: Francesco Dolcini
  Cc: Joakim Zhang, Andrew Lunn, Heiner Kallweit, Russell King, netdev,
	Jakub Kicinski, Paolo Abeni, David S. Miller, Tim Harvey
In-Reply-To: <20220422152612.GA510015@francesco-nb.int.toradex.com>

Hi Francesco,

On Fri, Apr 22, 2022 at 12:26 PM Francesco Dolcini
<francesco.dolcini@toradex.com> wrote:
>
> Hello all,
> I have been recently trying to debug an issue with FEC driver erroring
> a MDIO read timeout during linkup [0]. At the beginning I was working
> with an old 5.4 kernel, but today I tried with the current master
> (5.18.0-rc3-00080-gd569e86915b7) and the issue is just there.
>
> I'm also aware of the old discussions on the topic and I tried to
> increase the timeout without success (even if I'm not sure is relevant
> with the newer polling solution).
>
> The issue was reproduced on an apalis-imx6 that has a KSZ9131
> ethernet connected to the FEC MAC.
>
> No load on the machine, 4 cores just idling during my test.
>
> What I can see from the code is that the timeout is coming from
> net/phy/micrel.c:kszphy_handle_interrupt().

For debugging purposes, could you try not to describe the Ethernet PHY
irq pin inside imx6qdl-apalis.dtsi?

Looking at its pinctrl, I don't see it has pull-up enabled. Is there
an external pull-up
on the KSZ9131 pin?

^ permalink raw reply

* Re: [PATCH bpf] lwt_bpf: fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook
From: Daniel Borkmann @ 2022-04-22 15:55 UTC (permalink / raw)
  To: Eyal Birger, davem, kuba, pabeni, ast, andrii
  Cc: kafai, songliubraving, yhs, john.fastabend, kpsingh, mkl, tgraf,
	shmulik.ladkani, netdev, bpf, stable
In-Reply-To: <c053fdf3-84bb-faee-387d-6edb2df9ffee@iogearbox.net>

On 4/22/22 5:48 PM, Daniel Borkmann wrote:
> On 4/20/22 6:52 PM, Eyal Birger wrote:
[...]
> 
> Ok, makes sense given for BPF_OK the dst->dev shouldn't change here (e.g. as opposed
> to BPF_REDIRECT). Applied, please also follow-up with a BPF selftest for test_progs
> so that this won't break in future when it's running as part of BPF CI.

(Coverage for lwt BPF flavor from test_progs is afaik non-existent aside from some section
name tests. Would be great in general to have runtime tests asserting lwt behavior there if
you have a chance.)

^ permalink raw reply

* Re: [PATCH bpf] lwt_bpf: fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook
From: patchwork-bot+netdevbpf @ 2022-04-22 15:50 UTC (permalink / raw)
  To: Eyal Birger
  Cc: davem, kuba, pabeni, ast, daniel, andrii, kafai, songliubraving,
	yhs, john.fastabend, kpsingh, mkl, tgraf, shmulik.ladkani, netdev,
	bpf, stable
In-Reply-To: <20220420165219.1755407-1-eyal.birger@gmail.com>

Hello:

This patch was applied to bpf/bpf.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Wed, 20 Apr 2022 19:52:19 +0300 you wrote:
> xmit_check_hhlen() observes the dst for getting the device hard header
> length to make sure a modified packet can fit. When a helper which changes
> the dst - such as bpf_skb_set_tunnel_key() - is called as part of the xmit
> program the accessed dst is no longer valid.
> 
> This leads to the following splat:
> 
> [...]

Here is the summary with links:
  - [bpf] lwt_bpf: fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook
    https://git.kernel.org/bpf/bpf/c/b02d196c44ea

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH bpf] lwt_bpf: fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook
From: Daniel Borkmann @ 2022-04-22 15:48 UTC (permalink / raw)
  To: Eyal Birger, davem, kuba, pabeni, ast, andrii
  Cc: kafai, songliubraving, yhs, john.fastabend, kpsingh, mkl, tgraf,
	shmulik.ladkani, netdev, bpf, stable
In-Reply-To: <20220420165219.1755407-1-eyal.birger@gmail.com>

On 4/20/22 6:52 PM, Eyal Birger wrote:
> xmit_check_hhlen() observes the dst for getting the device hard header
> length to make sure a modified packet can fit. When a helper which changes
> the dst - such as bpf_skb_set_tunnel_key() - is called as part of the xmit
> program the accessed dst is no longer valid.
> 
> This leads to the following splat:
> 
>   BUG: kernel NULL pointer dereference, address: 00000000000000de
>   #PF: supervisor read access in kernel mode
>   #PF: error_code(0x0000) - not-present page
>   PGD 0 P4D 0
>   Oops: 0000 [#1] PREEMPT SMP PTI
>   CPU: 0 PID: 798 Comm: ping Not tainted 5.18.0-rc2+ #103
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>   RIP: 0010:bpf_xmit+0xfb/0x17f
>   Code: c6 c0 4d cd 8e 48 c7 c7 7d 33 f0 8e e8 42 09 fb ff 48 8b 45 58 48 8b 95 c8 00 00 00 48 2b 95 c0 00 00 00 48 83 e0 fe 48 8b 00 <0f> b7 80 de 00 00 00 39 c2 73 22 29 d0 b9 20 0a 00 00 31 d2 48 89
>   RSP: 0018:ffffb148c0bc7b98 EFLAGS: 00010282
>   RAX: 0000000000000000 RBX: 0000000000240008 RCX: 0000000000000000
>   RDX: 0000000000000010 RSI: 00000000ffffffea RDI: 00000000ffffffff
>   RBP: ffff922a828a4e00 R08: ffffffff8f1350e8 R09: 00000000ffffdfff
>   R10: ffffffff8f055100 R11: ffffffff8f105100 R12: 0000000000000000
>   R13: ffff922a828a4e00 R14: 0000000000000040 R15: 0000000000000000
>   FS:  00007f414e8f0080(0000) GS:ffff922afdc00000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00000000000000de CR3: 0000000002d80006 CR4: 0000000000370ef0
>   Call Trace:
>    <TASK>
>    lwtunnel_xmit.cold+0x71/0xc8
>    ip_finish_output2+0x279/0x520
>    ? __ip_finish_output.part.0+0x21/0x130
> 
> Fix by fetching the device hard header length before running the bpf code.
> 
> Cc: stable@vger.kernel.org
> Fixes: commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure")
> Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
> ---
>   net/core/lwt_bpf.c | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
> index 349480ef68a5..8b6b5e72b217 100644
> --- a/net/core/lwt_bpf.c
> +++ b/net/core/lwt_bpf.c
> @@ -159,10 +159,8 @@ static int bpf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>   	return dst->lwtstate->orig_output(net, sk, skb);
>   }
>   
> -static int xmit_check_hhlen(struct sk_buff *skb)
> +static int xmit_check_hhlen(struct sk_buff *skb, int hh_len)
>   {
> -	int hh_len = skb_dst(skb)->dev->hard_header_len;
> -
>   	if (skb_headroom(skb) < hh_len) {
>   		int nhead = HH_DATA_ALIGN(hh_len - skb_headroom(skb));
>   
> @@ -274,6 +272,7 @@ static int bpf_xmit(struct sk_buff *skb)
>   
>   	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
>   	if (bpf->xmit.prog) {
> +		int hh_len = dst->dev->hard_header_len;
>   		__be16 proto = skb->protocol;
>   		int ret;
>   
> @@ -291,7 +290,7 @@ static int bpf_xmit(struct sk_buff *skb)
>   			/* If the header was expanded, headroom might be too
>   			 * small for L2 header to come, expand as needed.
>   			 */
> -			ret = xmit_check_hhlen(skb);
> +			ret = xmit_check_hhlen(skb, hh_len);

Ok, makes sense given for BPF_OK the dst->dev shouldn't change here (e.g. as opposed
to BPF_REDIRECT). Applied, please also follow-up with a BPF selftest for test_progs
so that this won't break in future when it's running as part of BPF CI.

>   			if (unlikely(ret))
>   				return ret;
>   
> 

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net-next v2] net: phy: broadcom: 1588 support on bcm54210pe
From: Andrew Lunn @ 2022-04-22 15:42 UTC (permalink / raw)
  To: Lasse Johnsen
  Cc: Richard Cochran, netdev, Gordon Hollingworth, Ahmad Byagowi,
	Florian Fainelli, Heiner Kallweit, Russell King,
	bcm-kernel-feedback-list, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <208820C3-E4C8-4B75-B926-15BCD844CE96@timebeat.app>

On Fri, Apr 22, 2022 at 04:08:18PM +0100, Lasse Johnsen wrote:

Hi Lasse

Don't attach a patch to the end of a discussion. What you email is
what comes out of git-format patch. Nothing added.

If you want to discuss review comments, just reply to the email with
the comments. 

> From 3fcbbac9fe85909877f15d95f7a1e64dd6d06ab7 Mon Sep 17 00:00:00 2001
> From: Lasse Johnsen <l@ssejohnsen.me>
> Date: Sat, 5 Feb 2022 09:34:19 -0500
> Subject: [PATCH] Added support for IEEE1588 timestamping for the BCM54210PE
>  PHY using the kernel mii_timestamper interface
> 
> ---
>  arch/arm/configs/bcm2711_defconfig            |    1 +
>  arch/arm64/configs/bcm2711_defconfig          |    1 +
>  .../net/ethernet/broadcom/genet/bcmgenet.c    |    8 +-
>  drivers/net/phy/Makefile                      |    1 +
>  drivers/net/phy/bcm54210pe_ptp.c              | 1382 +++++++++++++++++
>  drivers/net/phy/bcm54210pe_ptp.h              |   85 +
>  drivers/net/phy/broadcom.c                    |   21 +-
>  drivers/ptp/Kconfig                           |   17 +

You need to break this up into a patch series. You probably want the
following patches:

defconfig changes
The core ptp code, in library form
Extensions to drivers/net/phy/broadcom.c to use the new code
bcmgenet.c change.


> +static u64 four_u16_to_ns(u16 *four_u16)
> +{
> +	u32 seconds;
> +	u32 nanoseconds;
> +	struct timespec64 ts;
> +	u16 *ptr;

Now it has been through checkpatch, it is starting to look like Linux
code :-)

Network drivers have a few extra code style hoops to jump
through. Variables should be sorted longest to shortest. So you want:

> +	struct timespec64 ts;
> +	u32 nanoseconds;
> +	u32 seconds;
> +	u16 *ptr;

This is known as reverse Christmas tree.

> +static int bcm54210pe_interrupts_enable(struct phy_device *phydev, bool fsync_en, bool sop_en)

Although Linux as a whole allows 100 character lines, networking
mostly stays with 80. I'm not sure it is strictly enforced, but it is
a good idea to try to keep with it.

> +{
> +	u16 interrupt_mask;
> +
> +	interrupt_mask = 0;

You can combine these into one line.

> +static int bcm54210pe_get80bittime(struct bcm54210pe_private *private,
> +				   struct timespec64 *ts,
> +				   struct ptp_system_timestamp *sts)
> +{
> +	struct phy_device *phydev;
> +	u16 nco_6_register_value;
> +	int i;
> +	u64 time_stamp_48, time_stamp_80, control_ts;
> +
> +	phydev = private->phydev;
> +
> +	// Capture timestamp on next framesync
> +	nco_6_register_value = 0x2000;

You should not have magic numbers. Please add a #define. If the
#define has a sensible name, it should then be obvious you are
capturing a timestamp on the next frame sync and so you don't need the
comment.

> +
> +	// Lock
> +	mutex_lock(&private->clock_lock);

Comments like this don't add any value. I can see it is a lock because
you are calling mutex_lock.

> +static int bcm54210pe_perout_enable(struct bcm54210pe_private *private, s64 period,
> +				    s64 pulsewidth, int enable)
> +{
> +	struct phy_device *phydev;
> +	u16 nco_6_register_value, frequency_hi, frequency_lo,
> +		pulsewidth_reg, pulse_start_hi, pulse_start_lo;
> +	int err;
> +
> +	phydev = private->phydev;
> +
> +	if (enable) {
> +		frequency_hi = 0;
> +		frequency_lo = 0;
> +		pulsewidth_reg = 0;
> +		pulse_start_hi = 0;
> +		pulse_start_lo = 0;
> +
> +		// Convert interval pulse spacing (period) and pulsewidth to 8 ns units
> +		period /= 8;
> +		pulsewidth /= 8;
> +
> +		// Mode 2 only: If pulsewidth is not explicitly set with PTP_PEROUT_DUTY_CYCLE
> +		if (pulsewidth == 0) {
> +			if (period < 2500) {
> +				// At a frequency at less than 20us (2500 x 8ns) set
> +				// pulse length to 1/10th of the interval pulse spacing
> +				pulsewidth = period / 10;
> +
> +				// Where the interval pulse spacing is short,
> +				// ensure we set a pulse length of 8ns
> +				if (pulsewidth == 0)
> +					pulsewidth = 1;
> +
> +			} else {
> +				// Otherwise set pulse with to 4us (8ns x 500 = 4us)
> +				pulsewidth = 500;
> +			}
> +		}
> +
> +		if (private->perout_mode == SYNC_OUT_MODE_1) {
> +			// Set period
> +			private->perout_period = period;
> +
> +			if (!private->perout_en) {
> +				// Set enable per_out
> +				private->perout_en = true;
> +				schedule_delayed_work(&private->perout_ws, msecs_to_jiffies(1));
> +			}
> +
> +			err = 0;
> +
> +		} else if (private->perout_mode == SYNC_OUT_MODE_2) {
> +			// Set enable per_out
> +			private->perout_en = true;
> +
> +			// Calculate registers
> +
> +			// Lowest 16 bits of 8ns interval pulse spacing [15:0]
> +			frequency_lo	= (u16)period;
> +
> +			// Highest 14 bits of 8ns interval pulse spacing [29:16]
> +			frequency_hi	= (u16)(0x3FFF & (period >> 16));
> +
> +			// 2 lowest bits of 8ns pulse length [1:0]
> +			frequency_hi   |= (u16)pulsewidth << 14;
> +
> +			// 7 highest bit  of 8 ns pulse length [8:2]
> +			pulsewidth_reg	= (u16)(0x7F & (pulsewidth >> 2));
> +
> +			// Get base value
> +			nco_6_register_value = bcm54210pe_get_base_nco6_reg(private,
> +									    nco_6_register_value,
> +									    true);
> +
> +			mutex_lock(&private->clock_lock);
> +
> +			// Write to register
> +			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG,
> +						nco_6_register_value);
> +
> +			// Set sync out pulse interval spacing and pulse length
> +			err |= bcm_phy_write_exp(phydev, NSE_DPPL_NCO_3_0_REG, frequency_lo);
> +			err |= bcm_phy_write_exp(phydev, NSE_DPPL_NCO_3_1_REG, frequency_hi);
> +			err |= bcm_phy_write_exp(phydev, NSE_DPPL_NCO_3_2_REG, pulsewidth_reg);
> +
> +			// On next framesync load sync out frequency
> +			err |= bcm_phy_write_exp(phydev, SHADOW_REG_LOAD, 0x0200);
> +
> +			// Trigger immediate framesync
> +			err |= bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
> +
> +			mutex_unlock(&private->clock_lock);
> +		}
> +	} else {
> +		// Set disable pps
> +		private->perout_en = false;
> +
> +		// Get base value
> +		nco_6_register_value = bcm54210pe_get_base_nco6_reg(private,
> +								    nco_6_register_value,
> +								    false);
> +
> +		mutex_lock(&private->clock_lock);
> +
> +		// Write to register
> +		err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
> +
> +		mutex_unlock(&private->clock_lock);
> +	}
> +
> +	return err;

This is a pretty big function, and its indentation gets pretty deep. The coding style:

https://www.kernel.org/doc/html/latest/process/coding-style.html#functions

says:

Functions should be short and sweet, and do just one thing. They
should fit on one or two screenfuls of text (the ISO/ANSI screen size
is 80x24, as we all know), and do one thing and do that well.

Maybe you can break this up into helpers.

> +void bcm54210pe_txtstamp(struct mii_timestamper *mii_ts, struct sk_buff *skb, int type)
> +{
> +	struct bcm54210pe_private *private = container_of(mii_ts, struct bcm54210pe_private,
> +							  mii_ts);
> +
> +	switch (private->hwts_tx_en) {
> +	case HWTSTAMP_TX_ON:
> +	{
> +		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +		skb_queue_tail(&private->tx_skb_queue, skb);
> +		schedule_work(&private->txts_work);
> +		break;
> +	}
> +
> +	case HWTSTAMP_TX_OFF:
> +	{
> +	}

That just looks odd.

Should there be a break? Do you want to fall through? If you do want
to fall through, you need to mark it so. But since it is empty, i
guess you don't want to fall through?

> +
> +	default:
> +	{
> +		kfree_skb(skb);
> +		break;
> +	}
> +	}
> +}

^ permalink raw reply

* FEC MDIO read timeout on linkup
From: Francesco Dolcini @ 2022-04-22 15:26 UTC (permalink / raw)
  To: Joakim Zhang, Andrew Lunn, Heiner Kallweit, Russell King
  Cc: netdev, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Fabio Estevam, Tim Harvey

Hello all,
I have been recently trying to debug an issue with FEC driver erroring
a MDIO read timeout during linkup [0]. At the beginning I was working
with an old 5.4 kernel, but today I tried with the current master
(5.18.0-rc3-00080-gd569e86915b7) and the issue is just there.

I'm also aware of the old discussions on the topic and I tried to
increase the timeout without success (even if I'm not sure is relevant
with the newer polling solution).

The issue was reproduced on an apalis-imx6 that has a KSZ9131
ethernet connected to the FEC MAC.

No load on the machine, 4 cores just idling during my test.

What I can see from the code is that the timeout is coming from
net/phy/micrel.c:kszphy_handle_interrupt().

Could this be some sort of race condition? Any suggestion for debugging
this?

Here the stack trace:

[  146.195696] fec 2188000.ethernet eth0: MDIO read timeout
[  146.201779] ------------[ cut here ]------------
[  146.206671] WARNING: CPU: 0 PID: 571 at drivers/net/phy/phy.c:942 phy_error+0x24/0x6c
[  146.214744] Modules linked in: bnep imx_vdoa imx_sdma evbug
[  146.220640] CPU: 0 PID: 571 Comm: irq/128-2188000 Not tainted 5.18.0-rc3-00080-gd569e86915b7 #9
[  146.229563] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[  146.236257]  unwind_backtrace from show_stack+0x10/0x14
[  146.241640]  show_stack from dump_stack_lvl+0x58/0x70
[  146.246841]  dump_stack_lvl from __warn+0xb4/0x24c
[  146.251772]  __warn from warn_slowpath_fmt+0x5c/0xd4
[  146.256873]  warn_slowpath_fmt from phy_error+0x24/0x6c
[  146.262249]  phy_error from kszphy_handle_interrupt+0x40/0x48
[  146.268159]  kszphy_handle_interrupt from irq_thread_fn+0x1c/0x78
[  146.274417]  irq_thread_fn from irq_thread+0xf0/0x1dc
[  146.279605]  irq_thread from kthread+0xe4/0x104
[  146.284267]  kthread from ret_from_fork+0x14/0x28
[  146.289164] Exception stack(0xe6fa1fb0 to 0xe6fa1ff8)
[  146.294448] 1fa0:                                     00000000 00000000 00000000 00000000
[  146.302842] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  146.311281] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  146.318262] irq event stamp: 12325
[  146.321780] hardirqs last  enabled at (12333): [<c01984c4>] __up_console_sem+0x50/0x60
[  146.330013] hardirqs last disabled at (12342): [<c01984b0>] __up_console_sem+0x3c/0x60
[  146.338259] softirqs last  enabled at (12324): [<c01017f0>] __do_softirq+0x2c0/0x624
[  146.346311] softirqs last disabled at (12319): [<c01300ac>] __irq_exit_rcu+0x138/0x178
[  146.354447] ---[ end trace 0000000000000000 ]---


The issue is not systematic, however using the following script is
pretty easy (minutes) to trigger:

```
#!/bin/bash

count=0

wait_link_or_exit()
{
	tmo=600
	while ! ethtool eth0 |grep -qF 'Link detected: yes'
	do
		sleep 0.1
		tmo=$((tmo - 1))
		[ $tmo -gt 0 ] || exit 1
	done
}

while true
do
	count=$((count + 1))
	echo "run $count"

	ethtool -s eth0 speed 10 duplex half autoneg on
	wait_link_or_exit

	ethtool -s eth0 speed 10 duplex full autoneg on
	wait_link_or_exit

	ethtool -s eth0 speed 100 duplex half autoneg on
	wait_link_or_exit

	ethtool -s eth0 speed 100 duplex full autoneg on
	wait_link_or_exit
done

```

Francesco

[0] https://lore.kernel.org/all/20220325140808.GA1047855@francesco-nb.int.toradex.com/



^ permalink raw reply

* Re: [PATCH net-next v2] net: phy: broadcom: 1588 support on bcm54210pe
From: Jonathan Lemon @ 2022-04-22 15:22 UTC (permalink / raw)
  To: Lasse Johnsen
  Cc: Richard Cochran, netdev, Gordon Hollingworth, Ahmad Byagowi,
	Florian Fainelli, Andrew Lunn, Heiner Kallweit, Russell King,
	bcm-kernel-feedback-list, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <208820C3-E4C8-4B75-B926-15BCD844CE96@timebeat.app>

On Fri, Apr 22, 2022 at 04:08:18PM +0100, Lasse Johnsen wrote:
> > On 21 Apr 2022, at 15:48, Richard Cochran <richardcochran@gmail.com> wrote:
> > Moreover: Does this device provide in-band Rx time stamps?  If so, why
> > not use them?
> 
> This is the first generation PHY and it does not do in-band RX. I asked BCM and studied the documentation. I’m sure I’m allowed to say, that the second generation 40nm BCM PHY (which - "I am not making this up" is available in 3 versions: BCM54210, BCM54210S and BCM54210SE - not “PE”) - supports in-band rx timestamps. However, as a matter of curiosity, BCM utilise the field in the header now used for minor versioning in 1588-2019, so in due course using this silicon feature will be a significant challenge.

Actually, it does support in-band RX timestamps.  Doing this would be
cleaner, and you'd only need to capture TX timestamps.
-- 
Jonathan

^ permalink raw reply

* [PATCH net-next v2] net: phy: broadcom: 1588 support on bcm54210pe
From: Lasse Johnsen @ 2022-04-22 15:08 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, Gordon Hollingworth, Ahmad Byagowi, Florian Fainelli,
	Andrew Lunn, Heiner Kallweit, Russell King,
	bcm-kernel-feedback-list, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <20220421144825.GA11810@hoboy.vegasvil.org>

[-- Attachment #1: Type: text/plain, Size: 15308 bytes --]

Hi Richard,

Firstly, off topic, please allow me the indulgence this once. If it was not for your work, then I’m pretty sure my career would have looked significantly different. It’s a real privilege to interact with you. Respect… :-)

(To anyone not too familiar with the real world consequences of what accurate clock sync has provided, in finance as an example, Richard’s work prevents brokerages from front-running their clients, it’s a bulwark against fraud and is the foundation for numerous pieces of financial regulation protecting the average Joe all over the world).


> On 21 Apr 2022, at 15:48, Richard Cochran <richardcochran@gmail.com> wrote:
> 
> On Wed, Apr 20, 2022 at 03:03:26PM +0100, Lasse Johnsen wrote:
> 
>> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> index bd1f419bc47ae..2fa6258103025 100644
>> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> 
> This change is unrelated to the PHY driver and should be in its own
> patch series.
> 

Noted, I have everything in a single commit after I flattened 86 previous commits - what is a good way to do what you propose?

>> @@ -39,8 +39,11 @@
>> 
>> #include <asm/unaligned.h>
>> 
>> +#include <linux/ptp_classify.h>
>> +
>> #include "bcmgenet.h"
>> 
>> +
> 
> Don't add extra blank lines.  As Andrew commented, run your code
> through checkpatch.pl and fix the coding style.

I have done so in the new patches below.

> 
>> /* Maximum number of hardware queues, downsized if needed */
>> #define GENET_MAX_MQ_CNT	4
>> 
>> @@ -2096,7 +2099,18 @@ static netdev_tx_t bcmgenet_xmit(struct sk_buff *skb, struct net_device *dev)
>> 	}
>> 
>> 	GENET_CB(skb)->last_cb = tx_cb_ptr;
>> -	skb_tx_timestamp(skb);
>> +
>> +	// Timestamping
>> +	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
>> +	{
>> +		//skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
>> +		skb_pull(skb, skb_mac_offset(skb)); // Feels like this pull should really be part of ptp_classify_raw...
>> +		skb_clone_tx_timestamp(skb);
>> +	}
>> +	else if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP))
>> +	{
>> +		skb_tstamp_tx(skb, NULL);
>> +	}
> 
> This is totally wrong.  skb_tx_timestamp() is the correct MAC driver API.

Thank you. What is a good if condition in the instance then please? In the new patches I have gone with:

(unlikely(skb_shinfo(skb)->tx_flags & (SKBTX_HW_TSTAMP | SKBTX_SW_TSTAMP)) > 0)


> 
>> 
>> 	/* Decrement total BD count and advance our write pointer */
>> 	ring->free_bds -= nr_frags + 1;
>> diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
>> index a13e402074cf8..528192d59d793 100644
>> --- a/drivers/net/phy/Makefile
>> +++ b/drivers/net/phy/Makefile
>> @@ -46,6 +46,7 @@ obj-$(CONFIG_BCM84881_PHY)	+= bcm84881.o
>> obj-$(CONFIG_BCM87XX_PHY)	+= bcm87xx.o
>> obj-$(CONFIG_BCM_CYGNUS_PHY)	+= bcm-cygnus.o
>> obj-$(CONFIG_BCM_NET_PHYLIB)	+= bcm-phy-lib.o
>> +obj-$(CONFIG_BCM54120PE_PHY)	+= bcm54210pe_ptp.o
>> obj-$(CONFIG_BROADCOM_PHY)	+= broadcom.o
>> obj-$(CONFIG_CICADA_PHY)	+= cicada.o
>> obj-$(CONFIG_CORTINA_PHY)	+= cortina.o
> 
>> diff --git a/drivers/net/phy/bcm54210pe_ptp.c b/drivers/net/phy/bcm54210pe_ptp.c
>> new file mode 100755
>> index 0000000000000..c4882c84229f9
>> --- /dev/null
>> +++ b/drivers/net/phy/bcm54210pe_ptp.c
>> @@ -0,0 +1,1406 @@
>> +// SPDX-License-Identifier: GPL-2.0+
>> +/*
>> + *  drivers/net/phy/bcm54210pe_ptp.c
>> + *
>> + * IEEE1588 (PTP), perout and extts for BCM54210PE PHY
>> + *
>> + * Authors: Carlos Fernandez, Kyle Judd, Lasse Johnsen
>> + * License: GPL
> 
> Use the proper MODULE macros for this info.

I have added the module macros in the new patches.

> 
>> + */
>> +
>> +#include <linux/gpio/consumer.h>
>> +#include <linux/ip.h>                                                                                
>> +#include <linux/net_tstamp.h>
>> +#include <linux/mii.h>
>> +#include <linux/phy.h>                                                                               
>> +#include <linux/ptp_classify.h>
>> +#include <linux/ptp_clock_kernel.h>                                                                  
>> +#include <linux/udp.h>
>> +#include <asm/unaligned.h> 
>> +#include <linux/brcmphy.h>
>> +#include <linux/irq.h>
>> +#include <linux/workqueue.h>
>> +#include <linux/gpio.h>
>> +#include <linux/if_ether.h>
>> +#include <linux/delay.h>
>> +#include <linux/sched.h>
> 
> Suggest keeping include directives in alphabetical order.

Noted, and included in new version of patches.

> 
>> +
>> +#include "bcm54210pe_ptp.h"
>> +#include "bcm-phy-lib.h"
>> +
>> +#define PTP_CONTROL_OFFSET		32
>> +#define PTP_TSMT_OFFSET 		0
>> +#define PTP_SEQUENCE_ID_OFFSET		30
>> +#define PTP_CLOCK_ID_OFFSET		20
>> +#define PTP_CLOCK_ID_SIZE		8
>> +#define PTP_SEQUENCE_PORT_NUMER_OFFSET  (PTP_CLOCK_ID_OFFSET + PTP_CLOCK_ID_SIZE)
>> +
>> +#define EXT_SELECT_REG			0x17
>> +#define EXT_DATA_REG			0x15
>> +
>> +#define EXT_ENABLE_REG1			0x17
>> +#define EXT_ENABLE_DATA1		0x0F7E
>> +#define EXT_ENABLE_REG2			0x15
>> +#define EXT_ENABLE_DATA2		0x0000
>> +
>> +#define EXT_1588_SLICE_REG		0x0810
>> +#define EXT_1588_SLICE_DATA		0x0101
> 
> EXT_1588_SLICE_DATA is not used anywhere.

Noted and removed. EXT_SELECT_REG removed for same reason.

>> +static bool bcm54210pe_fetch_timestamp(u8 txrx, u8 message_type, u16 seq_id, struct bcm54210pe_private *private, u64 *timestamp)
>> +{
>> +	struct bcm54210pe_circular_buffer_item *item; 
>> +	struct list_head *this, *next;
>> +
>> +	u8 index = (txrx * 4) + message_type;
>> +
>> +	if(index >= CIRCULAR_BUFFER_COUNT)
>> +	{
>> +		return false;
>> +	}
>> +
>> +	list_for_each_safe(this, next, &private->circular_buffers[index])
>> +	{
>> +		item = list_entry(this, struct bcm54210pe_circular_buffer_item, list);
>> +
>> +		if(item->sequence_id == seq_id && item->is_valid)
>> +		{
>> +			item->is_valid = false;
> 
> Instead of using this flag, just remove matched items from list.

It’s a circular buffer of a predefined size. I could be wrong, but does this not preclude what you suggest? If not, I would humbly ask for guidance.

>> +static void bcm54210pe_read_sop_time_register(struct bcm54210pe_private *private)
>> +{
>> +	struct phy_device *phydev = private->phydev;
>> +	struct bcm54210pe_circular_buffer_item *item;
>> +	u16 fifo_info_1, fifo_info_2;
>> +	u8 tx_or_rx, msg_type, index;
>> +	u16 sequence_id;
>> +	u64 timestamp;
>> +	u16 Time[4];
>> +
>> +	mutex_lock(&private->timestamp_buffer_lock);
>> +
>> +	while(bcm_phy_read_exp(phydev, INTERRUPT_STATUS_REG) & 2)
> 
> Replace magic number 2 proper macro.

The magic is gone and replaced with INTC_SOP. INTC_FSYNC has also been defined in the header file in preparation for future interrupt driven behaviour. It is currently unused. If this is not agreeable I can remove. 

> 
> Also, this loop is potentially infinite.  Add a sanity check to break out.

I have added an iteration check to prevent this.

>> +	while (skb != NULL) {
>> +
>> +		// Yes....  skb_defer_rx_timestamp just did this but <ZZZzzz>....
>> +		skb_push(skb, ETH_HLEN);
>> +		type = ptp_classify_raw(skb);
>> +		skb_pull(skb, ETH_HLEN);
>> +
>> +		hdr = ptp_parse_header(skb, type);
>> +
>> +		if (hdr == NULL) {
>> +			goto dequeue;
>> +		}
>> +
>> +		msg_type = ptp_get_msgtype(hdr, type);
>> +		sequence_id = be16_to_cpu(hdr->sequence_id);
>> +
>> +		timestamp = 0;
>> +
>> +		for (x = 0; x < 10; x++) {
> 
> Ten times? and ...

If we have not matched the timestamp in the PHY fifo by then then a timestamp of 0 is returned. This strikes me as reasonable behaviour, but I can omit returning the packet in the err queue completely if that is the desired behaviour. I will rely on your guidance for the desired behaviour.

> 
>> +			bcm54210pe_read_sop_time_register(private);
>> +			if (bcm54210pe_fetch_timestamp(0, msg_type, sequence_id,
>> +						       private, &timestamp)) {
>> +				break;
>> +			}
>> +
>> +			udelay(private->fib_sequence[x] *
>> +			       private->fib_factor_rx);
> 
> with a cute udelay?  No, use a proper deferral mechanism.

Hehe… The proposed method strikes me as cheap and reasonable in the instance. I am happy to change, but will you propose a satisfactory deferral mechanism in the instance? I just don’t have sufficient kernel experience to identify the best approach and your input would be much appreciated. 

I have read this: https://www.kernel.org/doc/Documentation/timers/timers-howto.txt

I think on the rx side it is unlikely that we get to the udelay as that timestamp “should be”(tm) available on the first iteration. On the corresponding tx side the amount of udelaying is also minimal, but again, I will take the direction you give me.  

>> +static int bcm54210pe_config_1588(struct phy_device *phydev)
>> +{
>> +	int err;
>> +
>> +	err = bcm_phy_write_exp(phydev, PTP_INTERRUPT_REG, 0x3c02 );
>> +
>> +	err |=  bcm_phy_write_exp(phydev, GLOBAL_TIMESYNC_REG, 0x0001); //Enable global timesync register.
>> +	err |=  bcm_phy_write_exp(phydev, EXT_1588_SLICE_REG, 0x0101); //ENABLE TX and RX slice 1588
> 
> Does this device support multiple ports?  If so, the driver should
> support that.

It does not no. It is a single port PHY.

>> 
>> +bool bcm54210pe_rxtstamp(struct mii_timestamper *mii_ts, struct sk_buff *skb, int type)
>> +{
>> +	struct bcm54210pe_private *private = container_of(mii_ts, struct bcm54210pe_private, mii_ts);
>> +
>> +	if (private->hwts_rx_en) {
>> +		skb_queue_tail(&private->rx_skb_queue, skb);
> 
> This is clunky.  The time stamp item may already be in the list.  Code
> should check the list and deliver the skb immediately on match.  Queue
> the skb only when time stamp not ready yet.
> 

No, that strikes me as unlikely both when acting as a source or consumer of PTP. The updating of the list is triggered by an tx or rx frame event. So in order for an rx timestamp to be already available, a tx event must have occurred at more or less the exact same time. This is not impossible, but probably sufficiently improbable to warrant the check you suggest. However, your experience far outweighs mine and I can modify to implement the behaviour you suggest if you insist.

> It appears that time stamps generate an interrupt, no?  If so, then
> why not use the ISR to trigger reading of time stamps?
> 

No, as I indicated in my email we trigger poll style behaviour on the receipt or transmit of 1588 frames. This is not ideal, and I hope to change in the future time permitting, but in the current iteration, that is the behaviour and it performs well.

> See dp83640.c for an example of how to handle asynchronous Rx time stamps.
> 
> Moreover: Does this device provide in-band Rx time stamps?  If so, why
> not use them?

This is the first generation PHY and it does not do in-band RX. I asked BCM and studied the documentation. I’m sure I’m allowed to say, that the second generation 40nm BCM PHY (which - "I am not making this up" is available in 3 versions: BCM54210, BCM54210S and BCM54210SE - not “PE”) - supports in-band rx timestamps. However, as a matter of curiosity, BCM utilise the field in the header now used for minor versioning in 1588-2019, so in due course using this silicon feature will be a significant challenge.

>> +	// Initialisation of work_structs and similar
>> +	INIT_WORK(&bcm54210pe->txts_work, bcm54210pe_run_tx_timestamp_match_thread);
>> +	INIT_WORK(&bcm54210pe->rxts_work, bcm54210pe_run_rx_timestamp_match_thread);
>> +	INIT_DELAYED_WORK(&bcm54210pe->perout_ws, bcm54210pe_run_perout_mode_one_thread);
>> +	INIT_DELAYED_WORK(&bcm54210pe->extts_ws, bcm54210pe_run_extts_thread);
> 
> Don't use generic work.  Instead, implement ptp_clock_info::do_aux_work()

No periodic work is undertaken. Rather rx and tx frames schedule work. With this in mind are you confident your recommendation is sound? If so I might need additional instruction on how to implement.

> 
> That way, you get a kthread that may be given appropriate scheduling priority.

Given the currently scheme if you do not propose to change it, can I give the theads an alternative priority such as it is?

>> 
>> +static u64 ts_to_ns(struct timespec64 *ts)
>> +{
>> +	return ((u64)ts->tv_sec * (u64)1000000000) + ts->tv_nsec;
>> +}
> 
> Use timespec64_to_ns()

Thank you. Noted with thanks and updated.

> 
>> +static void ns_to_ts(u64 time_stamp, struct timespec64 *ts)
>> +{
>> +	ts->tv_sec  = ( (u64)time_stamp / (u64)1000000000 );
>> +	ts->tv_nsec = ( (u64)time_stamp % (u64)1000000000 );
>> +}
> 
> Use ns_to_timespec64()

Also noted with thanks and updated.

>> 
>> +static bool bcm54210pe_rxtstamp(struct mii_timestamper *mii_ts, struct sk_buff *skb, int type);
>> +static void bcm54210pe_txtstamp(struct mii_timestamper *mii_ts, struct sk_buff *skb, int type);
>> +static void bcm54210pe_run_rx_timestamp_match_thread(struct work_struct *w);
>> +static void bcm54210pe_run_tx_timestamp_match_thread(struct work_struct *w);
>> +static void bcm54210pe_read_sop_time_register(struct bcm54210pe_private *private);
>> +static bool bcm54210pe_fetch_timestamp(u8 txrx, u8 message_type, u16 seq_id, struct bcm54210pe_private *private, u64 *timestamp);
>> +
>> +static u16  bcm54210pe_get_base_nco6_reg(struct bcm54210pe_private *private, u16 val, bool do_nse_init);
>> +static int  bcm54210pe_interrupts_enable(struct phy_device *phydev, bool fsync_en, bool sop_en);
>> +static int  bcm54210pe_gettimex(struct ptp_clock_info *info, struct timespec64 *ts, struct ptp_system_timestamp *sts);
>> +static int  bcm54210pe_get80bittime(struct bcm54210pe_private *private, struct timespec64 *ts, struct ptp_system_timestamp *sts);
>> +static int  bcm54210pe_get48bittime(struct bcm54210pe_private *private, u64 *time_stamp);
>> +static void bcm54210pe_read80bittime_register(struct phy_device *phydev, u64 *time_stamp_80, u64 *time_stamp_48);
>> +static void bcm54210pe_read48bittime_register(struct phy_device *phydev, u64 *time_stamp);
>> +
>> +static int  bcm54210pe_perout_enable(struct bcm54210pe_private *private, s64 period, s64 pulsewidth, int on);
>> +static void bcm54210pe_run_perout_mode_one_thread(struct work_struct *perout_ws);
>> +
>> +static int  bcm54210pe_extts_enable(struct bcm54210pe_private *private, int enable);
>> +static void bcm54210pe_run_extts_thread(struct work_struct *extts_ws);
>> +static void bcm54210pe_trigger_extts_event(struct bcm54210pe_private *private, u64 timestamp);
>> +
>> +static u64  convert_48bit_to_80bit(u64 second_on_set, u64 ts);
>> +static u64  four_u16_to_ns(u16 *four_u16);
>> +static u64  ts_to_ns(struct timespec64 *ts);
>> +static void ns_to_ts(u64 time_stamp, struct timespec64 *ts);
> 
> Never put "static" prototypes into a header file.
> 
> Avoid static prototypes altogether.  Instead, order the functions
> within the source file so that sub-functions appear before their call
> sites.

Noted with thanks and updated.




[-- Attachment #2: bcm54210pe-1588.v2.patch.txt --]
[-- Type: text/plain, Size: 48668 bytes --]

From 3fcbbac9fe85909877f15d95f7a1e64dd6d06ab7 Mon Sep 17 00:00:00 2001
From: Lasse Johnsen <l@ssejohnsen.me>
Date: Sat, 5 Feb 2022 09:34:19 -0500
Subject: [PATCH] Added support for IEEE1588 timestamping for the BCM54210PE
 PHY using the kernel mii_timestamper interface

---
 arch/arm/configs/bcm2711_defconfig            |    1 +
 arch/arm64/configs/bcm2711_defconfig          |    1 +
 .../net/ethernet/broadcom/genet/bcmgenet.c    |    8 +-
 drivers/net/phy/Makefile                      |    1 +
 drivers/net/phy/bcm54210pe_ptp.c              | 1382 +++++++++++++++++
 drivers/net/phy/bcm54210pe_ptp.h              |   85 +
 drivers/net/phy/broadcom.c                    |   21 +-
 drivers/ptp/Kconfig                           |   17 +
 8 files changed, 1514 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/phy/bcm54210pe_ptp.c
 create mode 100644 drivers/net/phy/bcm54210pe_ptp.h

diff --git a/arch/arm/configs/bcm2711_defconfig b/arch/arm/configs/bcm2711_defconfig
index 8f4ae82cade4f..c7f3cce0024b7 100644
--- a/arch/arm/configs/bcm2711_defconfig
+++ b/arch/arm/configs/bcm2711_defconfig
@@ -1402,6 +1402,7 @@ CONFIG_MAXIM_THERMOCOUPLE=m
 CONFIG_MAX31856=m
 CONFIG_PWM_BCM2835=m
 CONFIG_PWM_PCA9685=m
+CONFIG_GENERIC_PHY=y
 CONFIG_RPI_AXIPERF=m
 CONFIG_NVMEM_RMEM=m
 CONFIG_EXT4_FS=y
diff --git a/arch/arm64/configs/bcm2711_defconfig b/arch/arm64/configs/bcm2711_defconfig
index 75333e69ef741..2af6b2fc5dcda 100644
--- a/arch/arm64/configs/bcm2711_defconfig
+++ b/arch/arm64/configs/bcm2711_defconfig
@@ -1409,6 +1409,7 @@ CONFIG_MAXIM_THERMOCOUPLE=m
 CONFIG_MAX31856=m
 CONFIG_PWM_BCM2835=m
 CONFIG_PWM_PCA9685=m
+CONFIG_GENERIC_PHY=y
 CONFIG_RPI_AXIPERF=m
 CONFIG_ANDROID=y
 CONFIG_ANDROID_BINDER_IPC=y
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index bd1f419bc47ae..264bcb5b08b24 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -36,6 +36,7 @@
 #include <linux/ipv6.h>
 #include <linux/phy.h>
 #include <linux/platform_data/bcmgenet.h>
+#include <linux/ptp_classify.h>
 
 #include <asm/unaligned.h>
 
@@ -2096,7 +2097,12 @@ static netdev_tx_t bcmgenet_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	GENET_CB(skb)->last_cb = tx_cb_ptr;
-	skb_tx_timestamp(skb);
+
+	// Timestamping
+	if (unlikely(skb_shinfo(skb)->tx_flags & (SKBTX_HW_TSTAMP | SKBTX_SW_TSTAMP)) > 0) {
+		skb_pull(skb, skb_mac_offset(skb));
+		skb_tx_timestamp(skb);
+	}
 
 	/* Decrement total BD count and advance our write pointer */
 	ring->free_bds -= nr_frags + 1;
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index a13e402074cf8..528192d59d793 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_BCM84881_PHY)	+= bcm84881.o
 obj-$(CONFIG_BCM87XX_PHY)	+= bcm87xx.o
 obj-$(CONFIG_BCM_CYGNUS_PHY)	+= bcm-cygnus.o
 obj-$(CONFIG_BCM_NET_PHYLIB)	+= bcm-phy-lib.o
+obj-$(CONFIG_BCM54120PE_PHY)	+= bcm54210pe_ptp.o
 obj-$(CONFIG_BROADCOM_PHY)	+= broadcom.o
 obj-$(CONFIG_CICADA_PHY)	+= cicada.o
 obj-$(CONFIG_CORTINA_PHY)	+= cortina.o
diff --git a/drivers/net/phy/bcm54210pe_ptp.c b/drivers/net/phy/bcm54210pe_ptp.c
new file mode 100644
index 0000000000000..d35f98d1403c7
--- /dev/null
+++ b/drivers/net/phy/bcm54210pe_ptp.c
@@ -0,0 +1,1382 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ *
+ * IEEE1588 (PTP), perout and extts for BCM54210PE PHY
+ *
+ * Authors: Carlos Fernandez, Kyle Judd, Lasse Johnsen
+ * License: GPL
+ */
+
+#include <linux/brcmphy.h>
+#include <linux/delay.h>
+#include <linux/gpio.h>
+#include <linux/gpio/consumer.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/irq.h>
+#include <linux/mii.h>
+#include <linux/net_tstamp.h>
+#include <linux/phy.h>
+#include <linux/ptp_classify.h>
+#include <linux/ptp_clock_kernel.h>
+#include <linux/sched.h>
+#include <linux/udp.h>
+#include <linux/workqueue.h>
+
+#include <asm/unaligned.h>
+
+#include "bcm54210pe_ptp.h"
+#include "bcm-phy-lib.h"
+
+MODULE_DESCRIPTION("Broadcom BCM54210PE PHY driver");
+MODULE_AUTHOR("Lasse L. Johnsen");
+MODULE_LICENSE("GPL");
+
+#define PTP_CONTROL_OFFSET		32
+#define PTP_TSMT_OFFSET			0
+#define PTP_SEQUENCE_ID_OFFSET	30
+#define PTP_CLOCK_ID_OFFSET		20
+#define PTP_CLOCK_ID_SIZE		8
+#define PTP_SEQUENCE_PORT_NUMER_OFFSET  (PTP_CLOCK_ID_OFFSET + PTP_CLOCK_ID_SIZE)
+
+#define EXT_ENABLE_REG1			0x17
+#define EXT_ENABLE_DATA1		0x0F7E
+#define EXT_ENABLE_REG2			0x15
+#define EXT_ENABLE_DATA2		0x0000
+
+#define EXT_1588_SLICE_REG		0x0810
+#define EXT_1588_SLICE_DATA		0x0101
+
+#define ORIGINAL_TIME_CODE_0	0x0854
+#define ORIGINAL_TIME_CODE_1	0x0855
+#define ORIGINAL_TIME_CODE_2	0x0856
+#define ORIGINAL_TIME_CODE_3	0x0857
+#define ORIGINAL_TIME_CODE_4	0x0858
+
+#define TIME_STAMP_REG_0		0x0889
+#define TIME_STAMP_REG_1		0x088A
+#define TIME_STAMP_REG_2		0x088B
+#define TIME_STAMP_REG_3		0x08C4
+#define TIME_STAMP_INFO_1		0x088C
+#define TIME_STAMP_INFO_2		0x088D
+#define INTERRUPT_STATUS_REG	0x085F
+#define INTERRUPT_MASK_REG		0x085E
+#define EXT_SOFTWARE_RESET		0x0F70
+#define EXT_RESET1				0x0001 //RESET
+#define EXT_RESET2				0x0000 //NORMAL OPERATION
+#define GLOBAL_TIMESYNC_REG		0x0FF5
+
+#define TX_EVENT_MODE_REG		0x0811
+#define RX_EVENT_MODE_REG		0x0819
+#define TX_TSCAPTURE_ENABLE_REG	0x0821
+#define RX_TSCAPTURE_ENABLE_REG	0x0822
+#define TXRX_1588_OPTION_REG	0x0823
+
+#define TX_TS_OFFSET_LSB		0x0834
+#define TX_TS_OFFSET_MSB		0x0835
+#define RX_TS_OFFSET_LSB		0x0844
+#define RX_TS_OFFSET_MSB		0x0845
+#define NSE_DPPL_NCO_1_LSB_REG	0x0873
+#define NSE_DPPL_NCO_1_MSB_REG	0x0874
+
+#define NSE_DPPL_NCO_2_0_REG	0x0875
+#define NSE_DPPL_NCO_2_1_REG	0x0876
+#define NSE_DPPL_NCO_2_2_REG	0x0877
+
+#define NSE_DPPL_NCO_3_0_REG	0x0878
+#define NSE_DPPL_NCO_3_1_REG	0x0879
+#define NSE_DPPL_NCO_3_2_REG	0x087A
+
+#define NSE_DPPL_NCO_4_REG		0x087B
+
+#define NSE_DPPL_NCO_5_0_REG	0x087C
+#define NSE_DPPL_NCO_5_1_REG	0x087D
+#define NSE_DPPL_NCO_5_2_REG	0x087E
+
+#define NSE_DPPL_NCO_6_REG		0x087F
+
+#define NSE_DPPL_NCO_7_0_REG	0x0880
+#define NSE_DPPL_NCO_7_1_REG	0x0881
+
+#define DPLL_SELECT_REG			0x085b
+#define TIMECODE_SEL_REG		0x08C3
+#define SHADOW_REG_CONTROL		0x085C
+#define SHADOW_REG_LOAD			0x085D
+
+#define PTP_INTERRUPT_REG		0x0D0C
+
+#define CTR_DBG_REG				0x088E
+#define HEART_BEAT_REG4			0x08ED
+#define HEART_BEAT_REG3			0x08EC
+#define HEART_BEAT_REG2			0x0888
+#define	HEART_BEAT_REG1			0x0887
+#define	HEART_BEAT_REG0			0x0886
+
+#define READ_END_REG			0x0885
+
+static u64 convert_48bit_to_80bit(u64 second_on_set, u64 ts)
+{
+	return (second_on_set * 1000000000) + ts;
+}
+
+static u64 four_u16_to_ns(u16 *four_u16)
+{
+	u32 seconds;
+	u32 nanoseconds;
+	struct timespec64 ts;
+	u16 *ptr;
+
+	nanoseconds = 0;
+	seconds = 0;
+
+	ptr = (u16 *)&nanoseconds;
+	*ptr = four_u16[0]; ptr++; *ptr = four_u16[1];
+
+	ptr = (u16 *)&seconds;
+	*ptr = four_u16[2]; ptr++; *ptr = four_u16[3];
+
+	ts.tv_sec = seconds;
+	ts.tv_nsec = nanoseconds;
+
+	return timespec64_to_ns(&ts);
+}
+
+static int bcm54210pe_interrupts_enable(struct phy_device *phydev, bool fsync_en, bool sop_en)
+{
+	u16 interrupt_mask;
+
+	interrupt_mask = 0;
+
+	if (fsync_en)
+		interrupt_mask |= 0x0001;
+
+	if (sop_en)
+		interrupt_mask |= 0x0002;
+
+	return bcm_phy_write_exp(phydev, INTERRUPT_MASK_REG, interrupt_mask);
+}
+
+static bool bcm54210pe_fetch_timestamp(u8 txrx, u8 message_type, u16 seq_id,
+				       struct bcm54210pe_private *private, u64 *timestamp)
+{
+	struct bcm54210pe_circular_buffer_item *item;
+	struct list_head *this, *next;
+
+	u8 index = (txrx * 4) + message_type;
+
+	if (index >= CIRCULAR_BUFFER_COUNT)
+		return false;
+
+	list_for_each_safe(this, next, &private->circular_buffers[index]) {
+		item = list_entry(this, struct bcm54210pe_circular_buffer_item, list);
+
+		if (item->sequence_id == seq_id && item->is_valid) {
+			item->is_valid = false;
+			*timestamp = item->time_stamp;
+			mutex_unlock(&private->timestamp_buffer_lock);
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static u16 bcm54210pe_get_base_nco6_reg(struct bcm54210pe_private *private,
+					u16 val, bool do_nse_init)
+{
+	// Set Global mode to CPU system
+	val |= 0xC000;
+
+	// NSE init
+	if (do_nse_init)
+		val |= 0x1000;
+
+	if (private->extts_en)
+		val |= 0x2004;
+
+	if (private->perout_en) {
+		if (private->perout_mode == SYNC_OUT_MODE_1)
+			val |= 0x0001;
+		else if (private->perout_mode == SYNC_OUT_MODE_2)
+			val |= 0x0002;
+	}
+
+	return val;
+}
+
+static void bcm54210pe_read_sop_time_register(struct bcm54210pe_private *private)
+{
+	struct phy_device *phydev = private->phydev;
+	struct bcm54210pe_circular_buffer_item *item;
+	u16 fifo_info_1, fifo_info_2;
+	u8 tx_or_rx, msg_type, index;
+	u16 sequence_id;
+	u64 timestamp;
+	u16 time[4];
+	int deadlock_check;
+
+	deadlock_check = 0;
+
+	mutex_lock(&private->timestamp_buffer_lock);
+
+	while (bcm_phy_read_exp(phydev, INTERRUPT_STATUS_REG) & INTC_SOP) {
+		mutex_lock(&private->clock_lock);
+
+		// Flush out the FIFO
+		bcm_phy_write_exp(phydev, READ_END_REG, 1);
+
+		time[3] = bcm_phy_read_exp(phydev, TIME_STAMP_REG_3);
+		time[2] = bcm_phy_read_exp(phydev, TIME_STAMP_REG_2);
+		time[1] = bcm_phy_read_exp(phydev, TIME_STAMP_REG_1);
+		time[0] = bcm_phy_read_exp(phydev, TIME_STAMP_REG_0);
+
+		fifo_info_1 = bcm_phy_read_exp(phydev, TIME_STAMP_INFO_1);
+		fifo_info_2 = bcm_phy_read_exp(phydev, TIME_STAMP_INFO_2);
+
+		bcm_phy_write_exp(phydev, READ_END_REG, 2);
+		bcm_phy_write_exp(phydev, READ_END_REG, 0);
+
+		mutex_unlock(&private->clock_lock);
+
+		msg_type = (u8)((fifo_info_2 & 0xF000) >> 12);
+		tx_or_rx = (u8)((fifo_info_2 & 0x0800) >> 11); // 1 = TX, 0 = RX
+		sequence_id = fifo_info_1;
+
+		timestamp = four_u16_to_ns(time);
+
+		index = (tx_or_rx * 4) + msg_type;
+
+		if (index < CIRCULAR_BUFFER_COUNT)
+			item = list_first_entry_or_null(&private->circular_buffers[index],
+							struct bcm54210pe_circular_buffer_item,
+							list);
+
+		if (!item)
+			continue;
+
+		list_del_init(&item->list);
+
+		item->msg_type = msg_type;
+		item->sequence_id = sequence_id;
+		item->time_stamp = timestamp;
+		item->is_valid = true;
+
+		list_add_tail(&item->list, &private->circular_buffers[index]);
+
+		deadlock_check++;
+		if (deadlock_check > 100)
+			break;
+	}
+
+	mutex_unlock(&private->timestamp_buffer_lock);
+}
+
+static void bcm54210pe_run_rx_timestamp_match_thread(struct work_struct *w)
+{
+	struct bcm54210pe_private *private =
+		container_of(w, struct bcm54210pe_private, rxts_work);
+
+	struct skb_shared_hwtstamps *shhwtstamps;
+	struct ptp_header *hdr;
+	struct sk_buff *skb;
+
+	u8 msg_type;
+	u16 sequence_id;
+	u64 timestamp;
+	int x, type;
+
+	skb = skb_dequeue(&private->rx_skb_queue);
+
+	while (skb) {
+		// Yes....  skb_defer_rx_timestamp just did this but <ZZZzzz>....
+		skb_push(skb, ETH_HLEN);
+		type = ptp_classify_raw(skb);
+		skb_pull(skb, ETH_HLEN);
+
+		hdr = ptp_parse_header(skb, type);
+
+		if (!hdr)
+			goto dequeue;
+
+		msg_type = ptp_get_msgtype(hdr, type);
+		sequence_id = be16_to_cpu(hdr->sequence_id);
+
+		timestamp = 0;
+
+		for (x = 0; x < 10; x++) {
+			bcm54210pe_read_sop_time_register(private);
+			if (bcm54210pe_fetch_timestamp(0, msg_type, sequence_id,
+						       private, &timestamp)) {
+				break;
+			}
+
+			udelay(private->fib_sequence[x] *
+			       private->fib_factor_rx);
+		}
+
+		shhwtstamps = skb_hwtstamps(skb);
+
+		if (!shhwtstamps)
+			goto dequeue;
+
+		memset(shhwtstamps, 0, sizeof(*shhwtstamps));
+		shhwtstamps->hwtstamp = ns_to_ktime(timestamp);
+
+dequeue:
+		netif_rx_ni(skb);
+		skb = skb_dequeue(&private->rx_skb_queue);
+	}
+}
+
+static void bcm54210pe_run_tx_timestamp_match_thread(struct work_struct *w)
+{
+	struct bcm54210pe_private *private =
+		container_of(w, struct bcm54210pe_private, txts_work);
+
+	struct skb_shared_hwtstamps *shhwtstamps;
+	struct sk_buff *skb;
+
+	struct ptp_header *hdr;
+	u8 msg_type;
+	u16 sequence_id;
+	u64 timestamp;
+	int x, type;
+
+	timestamp = 0;
+	skb = skb_dequeue(&private->tx_skb_queue);
+
+	while (skb) {
+		type = ptp_classify_raw(skb);
+		hdr = ptp_parse_header(skb, type);
+
+		if (!hdr)
+			goto dequeue;
+
+		msg_type = ptp_get_msgtype(hdr, type);
+		sequence_id = be16_to_cpu(hdr->sequence_id);
+
+		for (x = 0; x < 10; x++) {
+			bcm54210pe_read_sop_time_register(private);
+			if (bcm54210pe_fetch_timestamp(1, msg_type, sequence_id,
+						       private, &timestamp)) {
+				break;
+			}
+			udelay(private->fib_sequence[x] * private->fib_factor_tx);
+		}
+		shhwtstamps = skb_hwtstamps(skb);
+
+		if (!shhwtstamps) {
+			kfree_skb(skb);
+			goto dequeue;
+		}
+
+		memset(shhwtstamps, 0, sizeof(*shhwtstamps));
+		shhwtstamps->hwtstamp = ns_to_ktime(timestamp);
+
+		skb_complete_tx_timestamp(skb, shhwtstamps);
+
+dequeue:
+		skb = skb_dequeue(&private->tx_skb_queue);
+	}
+}
+
+static int bcm54210pe_config_1588(struct phy_device *phydev)
+{
+	int err;
+
+	err = bcm_phy_write_exp(phydev, PTP_INTERRUPT_REG, 0x3c02);
+
+	//Enable global timesync register
+	err |=  bcm_phy_write_exp(phydev, GLOBAL_TIMESYNC_REG, 0x0001);
+
+	//ENABLE TX and RX slice 1588
+	err |=  bcm_phy_write_exp(phydev, EXT_1588_SLICE_REG, 0x0101);
+
+	//Add 80bit timestamp + NO CPU MODE in TX
+	err |=  bcm_phy_write_exp(phydev, TX_EVENT_MODE_REG, 0xFF00);
+
+	//Add 32+32 bits timestamp + NO CPU mode in RX
+	err |=  bcm_phy_write_exp(phydev, RX_EVENT_MODE_REG, 0xFF00);
+
+	//Select 80 bit counter
+	err |=  bcm_phy_write_exp(phydev, TIMECODE_SEL_REG, 0x0101);
+
+	//Enable timestamp capture in TX
+	err |=  bcm_phy_write_exp(phydev, TX_TSCAPTURE_ENABLE_REG, 0x0001);
+
+	//Enable timestamp capture in RX
+	err |=  bcm_phy_write_exp(phydev, RX_TSCAPTURE_ENABLE_REG, 0x0001);
+
+	//Enable shadow register
+	err |= bcm_phy_write_exp(phydev, SHADOW_REG_CONTROL, 0x0000);
+	err |= bcm_phy_write_exp(phydev, SHADOW_REG_LOAD, 0x07c0);
+
+	// Set global mode and trigger immediate framesync to load shaddow registers
+	err |=  bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, 0xC020);
+
+	// Enable Interrupt behaviour (eventhough we get no interrupts)
+	err |= bcm54210pe_interrupts_enable(phydev, true, false);
+
+	return err;
+}
+
+// Must be called under clock_lock
+static void bcm54210pe_trigger_extts_event(struct bcm54210pe_private *private, u64 timestamp)
+{
+	struct ptp_clock_event event;
+	struct timespec64 ts;
+
+	event.type = PTP_CLOCK_EXTTS;
+	event.timestamp = convert_48bit_to_80bit(private->second_on_set, timestamp);
+	event.index = 0;
+
+	ptp_clock_event(private->ptp->ptp_clock, &event);
+
+	private->last_extts_ts = timestamp;
+
+	ts = ns_to_timespec64(timestamp);
+}
+
+// Must be called under clock_lock
+static void bcm54210pe_read80bittime_register(struct phy_device *phydev,
+					      u64 *time_stamp_80, u64 *time_stamp_48)
+{
+	u16 time[5];
+	u64 ts[3];
+	u64 cumulative;
+
+	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x400);
+	time[4] = bcm_phy_read_exp(phydev, HEART_BEAT_REG4);
+	time[3] = bcm_phy_read_exp(phydev, HEART_BEAT_REG3);
+	time[2] = bcm_phy_read_exp(phydev, HEART_BEAT_REG2);
+	time[1] = bcm_phy_read_exp(phydev, HEART_BEAT_REG1);
+	time[0] = bcm_phy_read_exp(phydev, HEART_BEAT_REG0);
+
+	// Set read end bit
+	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x800);
+	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x000);
+
+	*time_stamp_80 = four_u16_to_ns(time);
+
+	if (time_stamp_48) {
+		ts[2] = (((u64)time[2]) << 32);
+		ts[1] = (((u64)time[1]) << 16);
+		ts[0] = ((u64)time[0]);
+
+		cumulative = 0;
+		cumulative |= ts[0];
+		cumulative |= ts[1];
+		cumulative |= ts[2];
+
+		*time_stamp_48 = cumulative;
+	}
+}
+
+// Must be called under clock_lock
+static void bcm54210pe_read48bittime_register(struct phy_device *phydev, u64 *time_stamp)
+{
+	u16 time[3];
+	u64 ts[3];
+	u64 cumulative;
+
+	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x400);
+	time[2] = bcm_phy_read_exp(phydev, HEART_BEAT_REG2);
+	time[1] = bcm_phy_read_exp(phydev, HEART_BEAT_REG1);
+	time[0] = bcm_phy_read_exp(phydev, HEART_BEAT_REG0);
+
+	// Set read end bit
+	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x800);
+	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x000);
+
+	ts[2] = (((u64)time[2]) << 32);
+	ts[1] = (((u64)time[1]) << 16);
+	ts[0] = ((u64)time[0]);
+
+	cumulative = 0;
+	cumulative |= ts[0];
+	cumulative |= ts[1];
+	cumulative |= ts[2];
+
+	*time_stamp = cumulative;
+}
+
+static int bcm54210pe_get80bittime(struct bcm54210pe_private *private,
+				   struct timespec64 *ts,
+				   struct ptp_system_timestamp *sts)
+{
+	struct phy_device *phydev;
+	u16 nco_6_register_value;
+	int i;
+	u64 time_stamp_48, time_stamp_80, control_ts;
+
+	phydev = private->phydev;
+
+	// Capture timestamp on next framesync
+	nco_6_register_value = 0x2000;
+
+	// Lock
+	mutex_lock(&private->clock_lock);
+
+	// We share frame sync events with extts, so we need to ensure no event
+	// has occurred as we are about to boot the registers, so....
+
+	// If extts is enabled
+	if (private->extts_en) {
+		// Halt framesyncs generated by the sync in pin
+		bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0000);
+
+		// Read what's in the 8- bit register
+		bcm54210pe_read48bittime_register(phydev, &control_ts);
+
+		// If it matches neither the last gettime or extts timestamp
+		if (control_ts != private->last_extts_ts &&
+		    control_ts != private->last_immediate_ts[0]) {
+			// Odds are this is a extts not yet logged as an event
+			//printk("extts triggered by get80bittime\n");
+			bcm54210pe_trigger_extts_event(private, control_ts);
+		}
+	}
+
+	// Heartbeat register selection. Latch 80 bit Original time counter
+	// into Heartbeat register (this is undocumented)
+	bcm_phy_write_exp(phydev, DPLL_SELECT_REG, 0x0040);
+
+	// Amend to base register
+	nco_6_register_value = bcm54210pe_get_base_nco6_reg(private, nco_6_register_value, false);
+
+	// Set the NCO register
+	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
+
+	// Trigger framesync
+	if (sts) {
+		// If we are doing a gettimex call
+		ptp_read_system_prets(sts);
+		bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
+		ptp_read_system_postts(sts);
+
+	} else {
+		// or if we are doing a gettime call
+		bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
+	}
+
+	for (i = 0; i < 5; i++) {
+		bcm54210pe_read80bittime_register(phydev, &time_stamp_80, &time_stamp_48);
+
+		if (time_stamp_80 != 0)
+			break;
+	}
+
+	// Convert to timespec64
+	*ts = ns_to_timespec64(time_stamp_80);
+
+	// If we are using extts
+	if (private->extts_en) {
+		// Commit last timestamp
+		private->last_immediate_ts[0] = time_stamp_48;
+		private->last_immediate_ts[1] = time_stamp_80;
+
+		// Heartbeat register selection. Latch 48 bit Original time counter
+		// into Heartbeat register (this is undocumented)
+		bcm_phy_write_exp(phydev, DPLL_SELECT_REG, 0x0000);
+
+		// Rearm framesync for sync in pin
+		nco_6_register_value = bcm54210pe_get_base_nco6_reg(private,
+								    nco_6_register_value, false);
+		bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
+	}
+
+	mutex_unlock(&private->clock_lock);
+
+	return 0;
+}
+
+static int bcm54210pe_gettimex(struct ptp_clock_info *info,
+			       struct timespec64 *ts,
+			       struct ptp_system_timestamp *sts)
+{
+	struct bcm54210pe_ptp *ptp;
+
+	ptp = container_of(info, struct bcm54210pe_ptp, caps);
+
+	return bcm54210pe_get80bittime(ptp->chosen, ts, sts);
+}
+
+static int bcm54210pe_gettime(struct ptp_clock_info *info, struct timespec64 *ts)
+{
+	return bcm54210pe_gettimex(info, ts, NULL);
+}
+
+static int bcm54210pe_get48bittime(struct bcm54210pe_private *private, u64 *timestamp)
+{
+	u16 nco_6_register_value;
+	int i, err;
+
+	struct phy_device *phydev = private->phydev;
+
+	// Capture timestamp on next framesync
+	nco_6_register_value = 0x2000;
+
+	mutex_lock(&private->clock_lock);
+
+	// Heartbeat register selection. Latch 48 bit Original time counter
+	// into Heartbeat register (this is undocumented)
+	err = bcm_phy_write_exp(phydev, DPLL_SELECT_REG, 0x0000);
+
+	// Amend to base register
+	nco_6_register_value =
+		bcm54210pe_get_base_nco6_reg(private, nco_6_register_value, false);
+
+	// Set the NCO register
+	err |= bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
+
+	// Trigger framesync
+	err |= bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
+
+	for (i = 0; i < 5; i++) {
+		bcm54210pe_read48bittime_register(phydev, timestamp);
+
+		if (*timestamp != 0)
+			break;
+	}
+
+	mutex_unlock(&private->clock_lock);
+
+	return err;
+}
+
+static int bcm54210pe_settime(struct ptp_clock_info *info, const struct timespec64 *ts)
+{
+	u16 shadow_load_register, nco_6_register_value;
+	u16 original_time_codes[5], local_time_codes[3];
+	struct bcm54210pe_ptp *ptp;
+	struct phy_device *phydev;
+
+	ptp = container_of(info, struct bcm54210pe_ptp, caps);
+	phydev = ptp->chosen->phydev;
+
+	shadow_load_register = 0;
+	nco_6_register_value = 0;
+
+	// Assign original time codes (80 bit)
+	original_time_codes[4] = (u16)((ts->tv_sec & 0x0000FFFF00000000) >> 32);
+	original_time_codes[3] = (u16)((ts->tv_sec  & 0x00000000FFFF0000) >> 16);
+	original_time_codes[2] = (u16)(ts->tv_sec  & 0x000000000000FFFF);
+	original_time_codes[1] = (u16)((ts->tv_nsec & 0x00000000FFFF0000) >> 16);
+	original_time_codes[0] = (u16)(ts->tv_nsec & 0x000000000000FFFF);
+
+	// Assign original time codes (48 bit)
+	local_time_codes[2] = 0x4000;
+	local_time_codes[1] = (u16)(ts->tv_nsec >> 20);
+	local_time_codes[0] = (u16)(ts->tv_nsec >> 4);
+
+	// Set Time Code load bit in the shadow load register
+	shadow_load_register |= 0x0400;
+
+	// Set Local Time load bit in the shadow load register
+	shadow_load_register |= 0x0080;
+
+	mutex_lock(&ptp->chosen->clock_lock);
+
+	// Write Original Time Code Register
+	bcm_phy_write_exp(phydev, ORIGINAL_TIME_CODE_0, original_time_codes[0]);
+	bcm_phy_write_exp(phydev, ORIGINAL_TIME_CODE_1, original_time_codes[1]);
+	bcm_phy_write_exp(phydev, ORIGINAL_TIME_CODE_2, original_time_codes[2]);
+	bcm_phy_write_exp(phydev, ORIGINAL_TIME_CODE_3, original_time_codes[3]);
+	bcm_phy_write_exp(phydev, ORIGINAL_TIME_CODE_4, original_time_codes[4]);
+
+	// Write Local Time Code Register
+	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_2_0_REG, local_time_codes[0]);
+	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_2_1_REG, local_time_codes[1]);
+	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_2_2_REG, local_time_codes[2]);
+
+	// Write Shadow register
+	bcm_phy_write_exp(phydev, SHADOW_REG_CONTROL, 0x0000);
+	bcm_phy_write_exp(phydev, SHADOW_REG_LOAD, shadow_load_register);
+
+	// Set global mode and nse_init
+	nco_6_register_value = bcm54210pe_get_base_nco6_reg(ptp->chosen,
+							    nco_6_register_value, true);
+
+	// Write to register
+	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
+
+	// Trigger framesync
+	bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
+
+	// Set the second on set
+	ptp->chosen->second_on_set = ts->tv_sec;
+
+	mutex_unlock(&ptp->chosen->clock_lock);
+
+	return 0;
+}
+
+static int bcm54210pe_adjfine(struct ptp_clock_info *info, long scaled_ppm)
+{
+	int err;
+	u16 lo, hi;
+	u32 corrected_8ns_interval, base_8ns_interval;
+	bool negative;
+
+	struct bcm54210pe_ptp *ptp = container_of(info, struct bcm54210pe_ptp, caps);
+	struct phy_device *phydev = ptp->chosen->phydev;
+
+	negative = false;
+	if (scaled_ppm < 0) {
+		negative = true;
+		scaled_ppm = -scaled_ppm;
+	}
+
+	// This is not completely accurate but very fast
+	scaled_ppm >>= 7;
+
+	// Nominal counter increment is 8ns
+	base_8ns_interval = 1 << 31;
+
+	// Add or subtract differential
+	if (negative)
+		corrected_8ns_interval = base_8ns_interval - scaled_ppm;
+	else
+		corrected_8ns_interval = base_8ns_interval + scaled_ppm;
+
+	// Load up registers
+	hi = (corrected_8ns_interval & 0xFFFF0000) >> 16;
+	lo = (corrected_8ns_interval & 0x0000FFFF);
+
+	mutex_lock(&ptp->chosen->clock_lock);
+
+	// Set freq_mdio_sel to 1
+	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_2_2_REG, 0x4000);
+
+	// Load 125MHz frequency reqcntrl
+	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_1_MSB_REG, hi);
+	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_1_LSB_REG, lo);
+
+	// On next framesync load freq from freqcntrl
+	bcm_phy_write_exp(phydev, SHADOW_REG_LOAD, 0x0040);
+
+	// Trigger framesync
+	err = bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
+
+	mutex_unlock(&ptp->chosen->clock_lock);
+
+	return err;
+}
+
+static int bcm54210pe_adjtime(struct ptp_clock_info *info, s64 delta)
+{
+	int err;
+	struct timespec64 ts;
+	u64 now;
+
+	err = bcm54210pe_gettime(info, &ts);
+	if (err < 0)
+		return err;
+
+	now = ktime_to_ns(timespec64_to_ktime(ts));
+	ts = ns_to_timespec64(now + delta);
+
+	err = bcm54210pe_settime(info, &ts);
+
+	return err;
+}
+
+static int bcm54210pe_extts_enable(struct bcm54210pe_private *private, int enable)
+{
+	int err;
+	struct phy_device *phydev;
+	u16 nco_6_register_value;
+
+	phydev = private->phydev;
+
+	if (enable) {
+		if (!private->extts_en) {
+			// Set enable per_out
+			private->extts_en = true;
+			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_4_REG, 0x0001);
+
+			nco_6_register_value = 0;
+			nco_6_register_value = bcm54210pe_get_base_nco6_reg(private,
+									    nco_6_register_value,
+									    false);
+
+			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_7_0_REG, 0x0100);
+			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_7_1_REG, 0x0200);
+			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
+
+			schedule_delayed_work(&private->extts_ws, msecs_to_jiffies(1));
+		}
+
+	} else {
+		private->extts_en = false;
+		err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_4_REG, 0x0000);
+	}
+
+	return err;
+}
+
+static void bcm54210pe_run_extts_thread(struct work_struct *extts_ws)
+{
+	struct bcm54210pe_private *private;
+	struct phy_device *phydev;
+	u64 interval, time_stamp_48, time_stamp_80;
+
+	private = container_of((struct delayed_work *)extts_ws,
+			       struct bcm54210pe_private, extts_ws);
+	phydev = private->phydev;
+
+	interval = 10;	// in ms - long after we are gone from this earth, discussions will be had
+			// and songs will be sung about whether this interval is short enough....
+			// Before you complain let me say that in Timebeat.app up to ~150ms allows
+			// single digit ns servo accuracy. If your client / servo is not as cool:
+			// Do better :-)
+
+	mutex_lock(&private->clock_lock);
+
+	bcm54210pe_read80bittime_register(phydev, &time_stamp_80, &time_stamp_48);
+
+	if (private->last_extts_ts != time_stamp_48 &&
+	    private->last_immediate_ts[0] != time_stamp_48 &&
+	    private->last_immediate_ts[1] != time_stamp_80) {
+		bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, 0xE000);
+		bcm54210pe_trigger_extts_event(private, time_stamp_48);
+		bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, 0xE004);
+	}
+
+	mutex_unlock(&private->clock_lock);
+
+	// Do we need to reschedule
+	if (private->extts_en)
+		schedule_delayed_work(&private->extts_ws, msecs_to_jiffies(interval));
+}
+
+static int bcm54210pe_perout_enable(struct bcm54210pe_private *private, s64 period,
+				    s64 pulsewidth, int enable)
+{
+	struct phy_device *phydev;
+	u16 nco_6_register_value, frequency_hi, frequency_lo,
+		pulsewidth_reg, pulse_start_hi, pulse_start_lo;
+	int err;
+
+	phydev = private->phydev;
+
+	if (enable) {
+		frequency_hi = 0;
+		frequency_lo = 0;
+		pulsewidth_reg = 0;
+		pulse_start_hi = 0;
+		pulse_start_lo = 0;
+
+		// Convert interval pulse spacing (period) and pulsewidth to 8 ns units
+		period /= 8;
+		pulsewidth /= 8;
+
+		// Mode 2 only: If pulsewidth is not explicitly set with PTP_PEROUT_DUTY_CYCLE
+		if (pulsewidth == 0) {
+			if (period < 2500) {
+				// At a frequency at less than 20us (2500 x 8ns) set
+				// pulse length to 1/10th of the interval pulse spacing
+				pulsewidth = period / 10;
+
+				// Where the interval pulse spacing is short,
+				// ensure we set a pulse length of 8ns
+				if (pulsewidth == 0)
+					pulsewidth = 1;
+
+			} else {
+				// Otherwise set pulse with to 4us (8ns x 500 = 4us)
+				pulsewidth = 500;
+			}
+		}
+
+		if (private->perout_mode == SYNC_OUT_MODE_1) {
+			// Set period
+			private->perout_period = period;
+
+			if (!private->perout_en) {
+				// Set enable per_out
+				private->perout_en = true;
+				schedule_delayed_work(&private->perout_ws, msecs_to_jiffies(1));
+			}
+
+			err = 0;
+
+		} else if (private->perout_mode == SYNC_OUT_MODE_2) {
+			// Set enable per_out
+			private->perout_en = true;
+
+			// Calculate registers
+
+			// Lowest 16 bits of 8ns interval pulse spacing [15:0]
+			frequency_lo	= (u16)period;
+
+			// Highest 14 bits of 8ns interval pulse spacing [29:16]
+			frequency_hi	= (u16)(0x3FFF & (period >> 16));
+
+			// 2 lowest bits of 8ns pulse length [1:0]
+			frequency_hi   |= (u16)pulsewidth << 14;
+
+			// 7 highest bit  of 8 ns pulse length [8:2]
+			pulsewidth_reg	= (u16)(0x7F & (pulsewidth >> 2));
+
+			// Get base value
+			nco_6_register_value = bcm54210pe_get_base_nco6_reg(private,
+									    nco_6_register_value,
+									    true);
+
+			mutex_lock(&private->clock_lock);
+
+			// Write to register
+			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG,
+						nco_6_register_value);
+
+			// Set sync out pulse interval spacing and pulse length
+			err |= bcm_phy_write_exp(phydev, NSE_DPPL_NCO_3_0_REG, frequency_lo);
+			err |= bcm_phy_write_exp(phydev, NSE_DPPL_NCO_3_1_REG, frequency_hi);
+			err |= bcm_phy_write_exp(phydev, NSE_DPPL_NCO_3_2_REG, pulsewidth_reg);
+
+			// On next framesync load sync out frequency
+			err |= bcm_phy_write_exp(phydev, SHADOW_REG_LOAD, 0x0200);
+
+			// Trigger immediate framesync
+			err |= bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
+
+			mutex_unlock(&private->clock_lock);
+		}
+	} else {
+		// Set disable pps
+		private->perout_en = false;
+
+		// Get base value
+		nco_6_register_value = bcm54210pe_get_base_nco6_reg(private,
+								    nco_6_register_value,
+								    false);
+
+		mutex_lock(&private->clock_lock);
+
+		// Write to register
+		err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
+
+		mutex_unlock(&private->clock_lock);
+	}
+
+	return err;
+}
+
+static void bcm54210pe_run_perout_mode_one_thread(struct work_struct *perout_ws)
+{
+	struct bcm54210pe_private *private;
+	u64 local_time_stamp_48bits; //, local_time_stamp_80bits;
+	u64 next_event, time_before_next_pulse, period;
+	u16 nco_6_register_value, pulsewidth_nco3_hack;
+	u64 wait_one, wait_two;
+
+	private = container_of((struct delayed_work *)perout_ws,
+			       struct bcm54210pe_private, perout_ws);
+	period = private->perout_period * 8;
+	pulsewidth_nco3_hack = 250; // The BCM chip is broken.
+				    // It does not respect this in sync out mode 1
+
+	nco_6_register_value = 0;
+
+	// Get base value
+	nco_6_register_value = bcm54210pe_get_base_nco6_reg(private, nco_6_register_value, false);
+
+	// Get 48 bit local time
+	bcm54210pe_get48bittime(private, &local_time_stamp_48bits);
+
+	// Calculate time before next event and next event time
+	time_before_next_pulse =  period - (local_time_stamp_48bits % period);
+	next_event = local_time_stamp_48bits + time_before_next_pulse;
+
+	// Lock
+	mutex_lock(&private->clock_lock);
+
+	// Set pulsewidth (test reveal this does not work),
+	// but registers need content or no pulse will exist
+	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_3_1_REG, pulsewidth_nco3_hack << 14);
+	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_3_2_REG, pulsewidth_nco3_hack >> 2);
+
+	// Set sync out pulse interval spacing and pulse length
+	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_0_REG, next_event & 0xFFF0);
+	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_1_REG, next_event >> 16);
+	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_2_REG, next_event >> 32);
+
+	// On next framesync load sync out frequency
+	bcm_phy_write_exp(private->phydev, SHADOW_REG_LOAD, 0x0200);
+
+	// Write to register with mode one set for sync out
+	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value | 0x0001);
+
+	// Trigger immediate framesync
+	bcm_phy_modify_exp(private->phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
+
+	// Unlock
+	mutex_unlock(&private->clock_lock);
+
+	// Wait until 1/10 period after the next pulse
+	wait_one = (time_before_next_pulse / 1000000) + (period / 1000000 / 10);
+	mdelay(wait_one);
+
+	// Lock
+	mutex_lock(&private->clock_lock);
+
+	// Clear pulse by bumping sync_out_match to max (this pulls sync out down)
+	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_0_REG, 0xFFF0);
+	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_1_REG, 0xFFFF);
+	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_2_REG, 0xFFFF);
+
+	// On next framesync load sync out frequency
+	bcm_phy_write_exp(private->phydev, SHADOW_REG_LOAD, 0x0200);
+
+	// Trigger immediate framesync
+	bcm_phy_modify_exp(private->phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
+
+	// Unlock
+	mutex_unlock(&private->clock_lock);
+
+	// Calculate wait before we reschedule the next pulse
+	wait_two = (period / 1000000) - (2 * (period / 10000000));
+
+	// Do we need to reschedule
+	if (private->perout_en)
+		schedule_delayed_work(&private->perout_ws, msecs_to_jiffies(wait_two));
+}
+
+bool bcm54210pe_rxtstamp(struct mii_timestamper *mii_ts, struct sk_buff *skb, int type)
+{
+	struct bcm54210pe_private *private = container_of(mii_ts, struct bcm54210pe_private,
+							  mii_ts);
+
+	if (private->hwts_rx_en) {
+		skb_queue_tail(&private->rx_skb_queue, skb);
+		schedule_work(&private->rxts_work);
+		return true;
+	}
+
+	return false;
+}
+
+void bcm54210pe_txtstamp(struct mii_timestamper *mii_ts, struct sk_buff *skb, int type)
+{
+	struct bcm54210pe_private *private = container_of(mii_ts, struct bcm54210pe_private,
+							  mii_ts);
+
+	switch (private->hwts_tx_en) {
+	case HWTSTAMP_TX_ON:
+	{
+		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+		skb_queue_tail(&private->tx_skb_queue, skb);
+		schedule_work(&private->txts_work);
+		break;
+	}
+
+	case HWTSTAMP_TX_OFF:
+	{
+	}
+
+	default:
+	{
+		kfree_skb(skb);
+		break;
+	}
+	}
+}
+
+int bcm54210pe_ts_info(struct mii_timestamper *mii_ts, struct ethtool_ts_info *info)
+{
+	struct bcm54210pe_private *bcm54210pe = container_of(mii_ts, struct bcm54210pe_private,
+							     mii_ts);
+
+	info->so_timestamping =
+		SOF_TIMESTAMPING_TX_HARDWARE |
+		SOF_TIMESTAMPING_RX_HARDWARE |
+		SOF_TIMESTAMPING_RAW_HARDWARE;
+
+	info->phc_index = ptp_clock_index(bcm54210pe->ptp->ptp_clock);
+	info->tx_types =
+		(1 << HWTSTAMP_TX_OFF) |
+		(1 << HWTSTAMP_TX_ON);
+	info->rx_filters =
+		(1 << HWTSTAMP_FILTER_NONE) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L4_EVENT);
+	return 0;
+}
+
+int bcm54210pe_hwtstamp(struct mii_timestamper *mii_ts, struct ifreq *ifr)
+{
+	struct bcm54210pe_private *device = container_of(mii_ts, struct bcm54210pe_private, mii_ts);
+
+	struct hwtstamp_config cfg;
+
+	if (copy_from_user(&cfg, ifr->ifr_data, sizeof(cfg)))
+		return -EFAULT;
+
+	if (cfg.flags) /* reserved for future extensions */
+		return -EINVAL;
+
+	if (cfg.tx_type < 0 || cfg.tx_type > HWTSTAMP_TX_ONESTEP_SYNC)
+		return -ERANGE;
+
+	device->hwts_tx_en = cfg.tx_type;
+
+	switch (cfg.rx_filter) {
+	case HWTSTAMP_FILTER_NONE:
+		device->hwts_rx_en = 0;
+		device->layer = 0;
+		device->version = 0;
+		break;
+	case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
+	case HWTSTAMP_FILTER_PTP_V1_L4_SYNC:
+	case HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ:
+		device->hwts_rx_en = 1;
+		device->layer = PTP_CLASS_L4;
+		device->version = PTP_CLASS_V1;
+		cfg.rx_filter = HWTSTAMP_FILTER_PTP_V1_L4_EVENT;
+		break;
+	case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
+	case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
+	case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
+		device->hwts_rx_en = 1;
+		device->layer = PTP_CLASS_L4;
+		device->version = PTP_CLASS_V2;
+		cfg.rx_filter = HWTSTAMP_FILTER_PTP_V2_L4_EVENT;
+		break;
+	case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
+	case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
+	case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
+		device->hwts_rx_en = 1;
+		device->layer = PTP_CLASS_L2;
+		device->version = PTP_CLASS_V2;
+		cfg.rx_filter = HWTSTAMP_FILTER_PTP_V2_L2_EVENT;
+		break;
+	case HWTSTAMP_FILTER_PTP_V2_EVENT:
+	case HWTSTAMP_FILTER_PTP_V2_SYNC:
+	case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
+		device->hwts_rx_en = 1;
+		device->layer = PTP_CLASS_L4 | PTP_CLASS_L2;
+		device->version = PTP_CLASS_V2;
+		cfg.rx_filter = HWTSTAMP_FILTER_PTP_V2_EVENT;
+		break;
+	default:
+		return -ERANGE;
+	}
+
+	return copy_to_user(ifr->ifr_data, &cfg, sizeof(cfg)) ? -EFAULT : 0;
+}
+
+static int bcm54210pe_feature_enable(struct ptp_clock_info *info,
+				     struct ptp_clock_request *req, int on)
+{
+	struct bcm54210pe_ptp *ptp = container_of(info, struct bcm54210pe_ptp, caps);
+	s64 period, pulsewidth;
+	struct timespec64 ts;
+
+	switch (req->type) {
+	case PTP_CLK_REQ_PEROUT:
+
+		period = 0;
+		pulsewidth = 0;
+
+		// Check if pin func is set correctly
+		if (ptp->chosen->sdp_config[SYNC_OUT_PIN].func != PTP_PF_PEROUT)
+			return -EOPNOTSUPP;
+
+		// No other flags supported
+		if (req->perout.flags & ~PTP_PEROUT_DUTY_CYCLE)
+			return -EOPNOTSUPP;
+
+		// Check if a specific pulsewidth is set
+		if ((req->perout.flags & PTP_PEROUT_DUTY_CYCLE) > 0) {
+			if (ptp->chosen->perout_mode == SYNC_OUT_MODE_1)
+				return -EOPNOTSUPP;
+
+			// Extract pulsewidth
+			ts.tv_sec = req->perout.on.sec;
+			ts.tv_nsec = req->perout.on.nsec;
+			pulsewidth = timespec64_to_ns(&ts);
+
+			// 9 bits in 8ns units, so max = 4,088ns
+			if (pulsewidth > 511 * 8)
+				return -ERANGE;
+		}
+
+		// Extract pulse spacing interval (period)
+		ts.tv_sec = req->perout.period.sec;
+		ts.tv_nsec = req->perout.period.nsec;
+		period = timespec64_to_ns(&ts);
+
+		// 16ns is minimum pulse spacing interval (a value of
+		// 16 will result in 8ns high followed by 8 ns low)
+		if (period != 0 && period < 16)
+			return -ERANGE;
+
+		return bcm54210pe_perout_enable(ptp->chosen, period, pulsewidth, on);
+
+	case PTP_CLK_REQ_EXTTS:
+
+		if (ptp->chosen->sdp_config[SYNC_IN_PIN].func != PTP_PF_EXTTS)
+			return -EOPNOTSUPP;
+
+		return bcm54210pe_extts_enable(ptp->chosen, on);
+
+	default:
+		break;
+	}
+
+	return -EOPNOTSUPP;
+}
+
+static int bcm54210pe_ptp_verify_pin(struct ptp_clock_info *info, unsigned int pin,
+				     enum ptp_pin_function func, unsigned int chan)
+{
+	switch (func) {
+	case PTP_PF_NONE:
+		return 0;
+	case PTP_PF_EXTTS:
+		if (pin == SYNC_IN_PIN)
+			return 0;
+		break;
+	case PTP_PF_PEROUT:
+		if (pin == SYNC_OUT_PIN)
+			return 0;
+		break;
+	case PTP_PF_PHYSYNC:
+		break;
+	}
+	return -1;
+}
+
+static const struct ptp_clock_info bcm54210pe_clk_caps = {
+	.owner		= THIS_MODULE,
+	.name		= "BCM54210PE_PHC",
+	.max_adj	= 100000000,
+	.n_alarm	= 0,
+	.n_pins		= 2,
+	.n_ext_ts	= 1,
+	.n_per_out	= 1,
+	.pps		= 0,
+	.adjtime	= &bcm54210pe_adjtime,
+	.adjfine	= &bcm54210pe_adjfine,
+	.gettime64	= &bcm54210pe_gettime,
+	.gettimex64	= &bcm54210pe_gettimex,
+	.settime64	= &bcm54210pe_settime,
+	.enable		= &bcm54210pe_feature_enable,
+	.verify		= &bcm54210pe_ptp_verify_pin,
+};
+
+static int bcm54210pe_sw_reset(struct phy_device *phydev)
+{
+	u16 err;
+	u16 aux;
+
+	err =  bcm_phy_write_exp(phydev, EXT_SOFTWARE_RESET, EXT_RESET1);
+	err |= bcm_phy_read_exp(phydev, EXT_ENABLE_REG1);
+
+	if (err < 0)
+		return err;
+
+	err |= bcm_phy_write_exp(phydev, EXT_SOFTWARE_RESET, EXT_RESET2);
+	aux = bcm_phy_read_exp(phydev, EXT_SOFTWARE_RESET);
+	return err;
+}
+
+int bcm54210pe_probe(struct phy_device *phydev)
+{
+	int x, y;
+	struct bcm54210pe_ptp *ptp;
+	struct bcm54210pe_private *bcm54210pe;
+	struct ptp_pin_desc *sync_in_pin_desc, *sync_out_pin_desc;
+
+	bcm54210pe_sw_reset(phydev);
+	bcm54210pe_config_1588(phydev);
+
+	bcm54210pe = kzalloc(sizeof(*bcm54210pe), GFP_KERNEL);
+	if (!bcm54210pe)
+		return -ENOMEM;
+
+	ptp = kzalloc(sizeof(*ptp), GFP_KERNEL);
+	if (!ptp)
+		return -ENOMEM;
+
+	bcm54210pe->phydev = phydev;
+	bcm54210pe->ptp = ptp;
+
+	bcm54210pe->mii_ts.rxtstamp = bcm54210pe_rxtstamp;
+	bcm54210pe->mii_ts.txtstamp = bcm54210pe_txtstamp;
+	bcm54210pe->mii_ts.hwtstamp = bcm54210pe_hwtstamp;
+	bcm54210pe->mii_ts.ts_info  = bcm54210pe_ts_info;
+
+	phydev->mii_ts = &bcm54210pe->mii_ts;
+
+	// Initialisation of work_structs and similar
+	INIT_WORK(&bcm54210pe->txts_work, bcm54210pe_run_tx_timestamp_match_thread);
+	INIT_WORK(&bcm54210pe->rxts_work, bcm54210pe_run_rx_timestamp_match_thread);
+	INIT_DELAYED_WORK(&bcm54210pe->perout_ws, bcm54210pe_run_perout_mode_one_thread);
+	INIT_DELAYED_WORK(&bcm54210pe->extts_ws, bcm54210pe_run_extts_thread);
+
+	// SKB queues
+	skb_queue_head_init(&bcm54210pe->tx_skb_queue);
+	skb_queue_head_init(&bcm54210pe->rx_skb_queue);
+
+	for (x = 0; x < CIRCULAR_BUFFER_COUNT; x++) {
+		INIT_LIST_HEAD(&bcm54210pe->circular_buffers[x]);
+
+		for (y = 0; y < CIRCULAR_BUFFER_ITEM_COUNT; y++)
+			list_add(&bcm54210pe->circular_buffer_items[x][y].list,
+				 &bcm54210pe->circular_buffers[x]);
+	}
+
+	// Caps
+	memcpy(&bcm54210pe->ptp->caps, &bcm54210pe_clk_caps, sizeof(bcm54210pe_clk_caps));
+	bcm54210pe->ptp->caps.pin_config = bcm54210pe->sdp_config;
+
+	// Mutex
+	mutex_init(&bcm54210pe->clock_lock);
+	mutex_init(&bcm54210pe->timestamp_buffer_lock);
+
+	// Features
+	bcm54210pe->one_step = false;
+	bcm54210pe->extts_en = false;
+	bcm54210pe->perout_en = false;
+	bcm54210pe->perout_mode = SYNC_OUT_MODE_1;
+
+	// Fibonacci RSewoke style progressive backoff scheme
+	bcm54210pe->fib_sequence[0] = 1;
+	bcm54210pe->fib_sequence[1] = 1;
+	bcm54210pe->fib_sequence[2] = 2;
+	bcm54210pe->fib_sequence[3] = 3;
+	bcm54210pe->fib_sequence[4] = 5;
+	bcm54210pe->fib_sequence[5] = 8;
+	bcm54210pe->fib_sequence[6] = 13;
+	bcm54210pe->fib_sequence[7] = 21;
+	bcm54210pe->fib_sequence[8] = 34;
+	bcm54210pe->fib_sequence[9] = 55;
+
+	//bcm54210pe->fib_sequence = {1, 1, 2, 3, 5, 8, 13, 21, 34, 55};
+	bcm54210pe->fib_factor_rx = 10;
+	bcm54210pe->fib_factor_tx = 10;
+
+	// Pin descriptions
+	sync_in_pin_desc = &bcm54210pe->sdp_config[SYNC_IN_PIN];
+	snprintf(sync_in_pin_desc->name, sizeof(sync_in_pin_desc->name), "SYNC_IN");
+	sync_in_pin_desc->index = SYNC_IN_PIN;
+	sync_in_pin_desc->func = PTP_PF_NONE;
+
+	sync_out_pin_desc = &bcm54210pe->sdp_config[SYNC_OUT_PIN];
+	snprintf(sync_out_pin_desc->name, sizeof(sync_out_pin_desc->name), "SYNC_OUT");
+	sync_out_pin_desc->index = SYNC_OUT_PIN;
+	sync_out_pin_desc->func = PTP_PF_NONE;
+
+	ptp->chosen = bcm54210pe;
+	phydev->priv = bcm54210pe;
+	ptp->caps.owner = THIS_MODULE;
+
+	bcm54210pe->ptp->ptp_clock = ptp_clock_register(&bcm54210pe->ptp->caps, &phydev->mdio.dev);
+
+	if (IS_ERR(bcm54210pe->ptp->ptp_clock))
+		return PTR_ERR(bcm54210pe->ptp->ptp_clock);
+
+	return 0;
+}
diff --git a/drivers/net/phy/bcm54210pe_ptp.h b/drivers/net/phy/bcm54210pe_ptp.h
new file mode 100644
index 0000000000000..f43e1a4ddbd3d
--- /dev/null
+++ b/drivers/net/phy/bcm54210pe_ptp.h
@@ -0,0 +1,85 @@
+/* SPDX-License-Identifier: GPL-2.0+
+ *
+ * IEEE1588 (PTP), perout and extts for BCM54210PE PHY
+ *
+ * Authors: Carlos Fernandez, Kyle Judd, Lasse Johnsen
+ * License: GPL
+ */
+
+#include <linux/list.h>
+#include <linux/ptp_clock_kernel.h>
+
+#define CIRCULAR_BUFFER_COUNT		8
+#define CIRCULAR_BUFFER_ITEM_COUNT	32
+
+#define SYNC_IN_PIN			0
+#define SYNC_OUT_PIN			1
+
+#define SYNC_OUT_MODE_1			1
+#define SYNC_OUT_MODE_2			2
+
+#define DIRECTION_RX			0
+#define DIRECTION_TX			1
+
+#define INTC_FSYNC			1
+#define INTC_SOP			2
+
+struct bcm54210pe_ptp {
+	struct ptp_clock_info caps;
+	struct ptp_clock *ptp_clock;
+	struct bcm54210pe_private *chosen;
+};
+
+struct bcm54210pe_circular_buffer_item {
+	struct list_head list;
+
+	u8 msg_type;
+	u16 sequence_id;
+	u64 time_stamp;
+	bool is_valid;
+};
+
+struct bcm54210pe_private {
+	struct phy_device *phydev;
+	struct bcm54210pe_ptp *ptp;
+	struct mii_timestamper mii_ts;
+	struct ptp_pin_desc sdp_config[2];
+
+	int ts_tx_config;
+	int tx_rx_filter;
+
+	bool one_step;
+	bool perout_en;
+	bool extts_en;
+
+	int second_on_set;
+
+	int perout_mode;
+	int perout_period;
+	int perout_pulsewidth;
+
+	u64 last_extts_ts;
+	u64 last_immediate_ts[2];
+
+	struct sk_buff_head tx_skb_queue;
+	struct sk_buff_head rx_skb_queue;
+
+	struct bcm54210pe_circular_buffer_item
+		circular_buffer_items[CIRCULAR_BUFFER_COUNT]
+				     [CIRCULAR_BUFFER_ITEM_COUNT];
+	struct list_head circular_buffers[CIRCULAR_BUFFER_COUNT];
+
+	struct work_struct txts_work, rxts_work;
+	struct delayed_work perout_ws, extts_ws;
+	struct mutex clock_lock, timestamp_buffer_lock;
+
+	int fib_sequence[10];
+
+	int fib_factor_rx;
+	int fib_factor_tx;
+
+	int hwts_tx_en;
+	int hwts_rx_en;
+	int layer;
+	int version;
+};
diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index 8b0ac38742d06..0aba0f08eb49d 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -15,6 +15,11 @@
 #include <linux/phy.h>
 #include <linux/brcmphy.h>
 #include <linux/of.h>
+#include <linux/irq.h>
+
+#if IS_ENABLED(CONFIG_BCM54120PE_PHY)
+extern int bcm54210pe_probe(struct phy_device *phydev);
+#endif
 
 #define BRCM_PHY_MODEL(phydev) \
 	((phydev)->drv->phy_id & (phydev)->drv->phy_id_mask)
@@ -778,7 +783,20 @@ static struct phy_driver broadcom_drivers[] = {
 	.config_init	= bcm54xx_config_init,
 	.ack_interrupt	= bcm_phy_ack_intr,
 	.config_intr	= bcm_phy_config_intr,
-}, {
+},
+
+#if IS_ENABLED(CONFIG_BCM54120PE_PHY)
+{
+	.phy_id		= PHY_ID_BCM54213PE,
+	.phy_id_mask	= 0xffffffff,
+	.name		= "Broadcom BCM54210PE",
+	/* PHY_GBIT_FEATURES */
+	.config_init	= bcm54xx_config_init,
+	.ack_interrupt	= bcm_phy_ack_intr,
+	.config_intr	= bcm_phy_config_intr,
+	.probe		= bcm54210pe_probe,
+#elif
+{
 	.phy_id		= PHY_ID_BCM54213PE,
 	.phy_id_mask	= 0xffffffff,
 	.name		= "Broadcom BCM54213PE",
@@ -786,6 +804,7 @@ static struct phy_driver broadcom_drivers[] = {
 	.config_init	= bcm54xx_config_init,
 	.ack_interrupt	= bcm_phy_ack_intr,
 	.config_intr	= bcm_phy_config_intr,
+#endif
 }, {
 	.phy_id		= PHY_ID_BCM5461,
 	.phy_id_mask	= 0xfffffff0,
diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 3e377f3c69e5d..586f9143311b4 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -87,6 +87,23 @@ config PTP_1588_CLOCK_INES
 	  core.  This clock is only useful if the MII bus of your MAC
 	  is wired up to the core.
 
+ config BCM54120PE_PHY
+	tristate "Add support for ptp in bcm54210pe PHYs"
+	depends on NETWORK_PHY_TIMESTAMPING
+	depends on PHYLIB
+	depends on PTP_1588_CLOCK
+	depends on BCM_NET_PHYLIB
+        select NET_PTP_CLASSIFY
+	help
+	  This driver adds support for using the BCM54210PE as a PTP
+	  clock. This clock is only useful if your PTP programs are
+	  getting hardware time stamps on the PTP Ethernet packets
+	  using the SO_TIMESTAMPING API.
+
+	  In order for this to work, your MAC driver must also
+	  implement the skb_tx_timestamp() function.
+
+
 config PTP_1588_CLOCK_PCH
 	tristate "Intel PCH EG20T as PTP clock"
 	depends on X86_32 || COMPILE_TEST

[-- Attachment #3: Type: text/plain, Size: 3 bytes --]





^ permalink raw reply related

* [PATCH net-next 28/28] sfc/siena: Reinstate SRIOV init/fini function calls
From: Martin Habets @ 2022-04-22 15:02 UTC (permalink / raw)
  To: kuba, pabeni, davem; +Cc: netdev, ecree.xilinx
In-Reply-To: <165063937837.27138.6911229584057659609.stgit@palantir17.mph.net>

They were removed in patch 1 since they are not used for EF10.
Put that code back for Siena, with the prototypes in siena_sriov.h
since that file is a more applicable for it.

Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
---
 drivers/net/ethernet/sfc/siena/efx.c         |   16 ++++++++++++++++
 drivers/net/ethernet/sfc/siena/siena_sriov.h |    3 +++
 2 files changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/sfc/siena/efx.c b/drivers/net/ethernet/sfc/siena/efx.c
index 718b07b1ed28..236e033f9f3d 100644
--- a/drivers/net/ethernet/sfc/siena/efx.c
+++ b/drivers/net/ethernet/sfc/siena/efx.c
@@ -32,6 +32,9 @@
 #include "io.h"
 #include "selftest.h"
 #include "sriov.h"
+#ifdef CONFIG_SFC_SIENA_SRIOV
+#include "siena_sriov.h"
+#endif
 
 #include "mcdi_port_common.h"
 #include "mcdi_pcol.h"
@@ -1271,6 +1274,12 @@ static int __init efx_init_module(void)
 	if (rc)
 		goto err_notifier;
 
+#ifdef CONFIG_SFC_SIENA_SRIOV
+	rc = efx_init_sriov();
+	if (rc)
+		goto err_sriov;
+#endif
+
 	rc = efx_siena_create_reset_workqueue();
 	if (rc)
 		goto err_reset;
@@ -1284,6 +1293,10 @@ static int __init efx_init_module(void)
  err_pci:
 	efx_siena_destroy_reset_workqueue();
  err_reset:
+#ifdef CONFIG_SFC_SIENA_SRIOV
+	efx_fini_sriov();
+ err_sriov:
+#endif
 	unregister_netdevice_notifier(&efx_netdev_notifier);
  err_notifier:
 	return rc;
@@ -1295,6 +1308,9 @@ static void __exit efx_exit_module(void)
 
 	pci_unregister_driver(&efx_pci_driver);
 	efx_siena_destroy_reset_workqueue();
+#ifdef CONFIG_SFC_SIENA_SRIOV
+	efx_fini_sriov();
+#endif
 	unregister_netdevice_notifier(&efx_netdev_notifier);
 
 }
diff --git a/drivers/net/ethernet/sfc/siena/siena_sriov.h b/drivers/net/ethernet/sfc/siena/siena_sriov.h
index 69a7a18e9ba0..50f6e924495e 100644
--- a/drivers/net/ethernet/sfc/siena/siena_sriov.h
+++ b/drivers/net/ethernet/sfc/siena/siena_sriov.h
@@ -60,6 +60,9 @@ static inline bool efx_siena_sriov_enabled(struct efx_nic *efx)
 {
 	return efx->vf_init_count != 0;
 }
+
+int efx_init_sriov(void);
+void efx_fini_sriov(void);
 #else /* !CONFIG_SFC_SIENA_SRIOV */
 static inline bool efx_siena_sriov_enabled(struct efx_nic *efx)
 {


^ permalink raw reply related

* [PATCH net-next 27/28] sfc/siena: Make PTP and reset support specific for Siena
From: Martin Habets @ 2022-04-22 15:02 UTC (permalink / raw)
  To: kuba, pabeni, davem; +Cc: netdev, ecree.xilinx
In-Reply-To: <165063937837.27138.6911229584057659609.stgit@palantir17.mph.net>

From: Martin Habets <martinh@xilinx.com>

Change the clock name and work queue names to differentiate them from
the names used in sfc.ko.

Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
---
 drivers/net/ethernet/sfc/siena/efx_common.c |    2 +-
 drivers/net/ethernet/sfc/siena/ptp.c        |    7 ++++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sfc/siena/efx_common.c b/drivers/net/ethernet/sfc/siena/efx_common.c
index a615bffcbad4..954daf464abb 100644
--- a/drivers/net/ethernet/sfc/siena/efx_common.c
+++ b/drivers/net/ethernet/sfc/siena/efx_common.c
@@ -112,7 +112,7 @@ static struct workqueue_struct *reset_workqueue;
 
 int efx_siena_create_reset_workqueue(void)
 {
-	reset_workqueue = create_singlethread_workqueue("sfc_reset");
+	reset_workqueue = create_singlethread_workqueue("sfc_siena_reset");
 	if (!reset_workqueue) {
 		printk(KERN_ERR "Failed to create reset workqueue\n");
 		return -ENOMEM;
diff --git a/drivers/net/ethernet/sfc/siena/ptp.c b/drivers/net/ethernet/sfc/siena/ptp.c
index 8e18da096595..7c46752e6eae 100644
--- a/drivers/net/ethernet/sfc/siena/ptp.c
+++ b/drivers/net/ethernet/sfc/siena/ptp.c
@@ -1422,7 +1422,7 @@ static void efx_ptp_worker(struct work_struct *work)
 
 static const struct ptp_clock_info efx_phc_clock_info = {
 	.owner		= THIS_MODULE,
-	.name		= "sfc",
+	.name		= "sfc_siena",
 	.max_adj	= MAX_PPB,
 	.n_alarm	= 0,
 	.n_ext_ts	= 0,
@@ -1458,7 +1458,7 @@ static int efx_ptp_probe(struct efx_nic *efx, struct efx_channel *channel)
 
 	skb_queue_head_init(&ptp->rxq);
 	skb_queue_head_init(&ptp->txq);
-	ptp->workwq = create_singlethread_workqueue("sfc_ptp");
+	ptp->workwq = create_singlethread_workqueue("sfc_siena_ptp");
 	if (!ptp->workwq) {
 		rc = -ENOMEM;
 		goto fail2;
@@ -1502,7 +1502,8 @@ static int efx_ptp_probe(struct efx_nic *efx, struct efx_channel *channel)
 			goto fail3;
 		} else if (ptp->phc_clock) {
 			INIT_WORK(&ptp->pps_work, efx_ptp_pps_worker);
-			ptp->pps_workwq = create_singlethread_workqueue("sfc_pps");
+			ptp->pps_workwq =
+				create_singlethread_workqueue("sfc_siena_pps");
 			if (!ptp->pps_workwq) {
 				rc = -ENOMEM;
 				goto fail4;


^ permalink raw reply related

* [PATCH net-next 26/28] sfc/siena: Make MCDI logging support specific for Siena
From: Martin Habets @ 2022-04-22 15:02 UTC (permalink / raw)
  To: kuba, pabeni, davem; +Cc: netdev, ecree.xilinx
In-Reply-To: <165063937837.27138.6911229584057659609.stgit@palantir17.mph.net>

From: Martin Habets <martinh@xilinx.com>

Add a Siena Kconfig option and use it in stead of the sfc one.
Rename the internal variable for the 'mcdi_logging_default' module
parameter to avoid a naming conflict with the one in sfc.ko.

Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
---
 drivers/net/ethernet/sfc/Kconfig            |    2 +-
 drivers/net/ethernet/sfc/siena/Kconfig      |   10 ++++++++++
 drivers/net/ethernet/sfc/siena/efx_common.c |    2 +-
 drivers/net/ethernet/sfc/siena/efx_common.h |    2 +-
 drivers/net/ethernet/sfc/siena/mcdi.c       |   23 ++++++++++++-----------
 drivers/net/ethernet/sfc/siena/mcdi.h       |    2 +-
 6 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/sfc/Kconfig b/drivers/net/ethernet/sfc/Kconfig
index dac2f09702aa..0950e6b0508f 100644
--- a/drivers/net/ethernet/sfc/Kconfig
+++ b/drivers/net/ethernet/sfc/Kconfig
@@ -55,7 +55,7 @@ config SFC_SRIOV
 	  features, allowing accelerated network performance in
 	  virtualized environments.
 config SFC_MCDI_LOGGING
-	bool "Solarflare SFC9000/SFC9100-family MCDI logging support"
+	bool "Solarflare SFC9100-family MCDI logging support"
 	depends on SFC
 	default y
 	help
diff --git a/drivers/net/ethernet/sfc/siena/Kconfig b/drivers/net/ethernet/sfc/siena/Kconfig
index 4eb6801ff3c0..cb3c5cb42a53 100644
--- a/drivers/net/ethernet/sfc/siena/Kconfig
+++ b/drivers/net/ethernet/sfc/siena/Kconfig
@@ -33,3 +33,13 @@ config SFC_SIENA_SRIOV
 	  This enables support for the Single Root I/O Virtualization
 	  features, allowing accelerated network performance in
 	  virtualized environments.
+config SFC_SIENA_MCDI_LOGGING
+	bool "Solarflare SFC9000-family MCDI logging support"
+	depends on SFC_SIENA
+	default y
+	help
+	  This enables support for tracing of MCDI (Management-Controller-to-
+	  Driver-Interface) commands and responses, allowing debugging of
+	  driver/firmware interaction.  The tracing is actually enabled by
+	  a sysfs file 'mcdi_logging' under the PCI device, or via module
+	  parameter mcdi_logging_default.
diff --git a/drivers/net/ethernet/sfc/siena/efx_common.c b/drivers/net/ethernet/sfc/siena/efx_common.c
index 3aef8d216f95..a615bffcbad4 100644
--- a/drivers/net/ethernet/sfc/siena/efx_common.c
+++ b/drivers/net/ethernet/sfc/siena/efx_common.c
@@ -1170,7 +1170,7 @@ void efx_siena_fini_io(struct efx_nic *efx)
 		pci_disable_device(efx->pci_dev);
 }
 
-#ifdef CONFIG_SFC_MCDI_LOGGING
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
 static ssize_t mcdi_logging_show(struct device *dev,
 				 struct device_attribute *attr,
 				 char *buf)
diff --git a/drivers/net/ethernet/sfc/siena/efx_common.h b/drivers/net/ethernet/sfc/siena/efx_common.h
index 470033611436..aeb92f4e34b7 100644
--- a/drivers/net/ethernet/sfc/siena/efx_common.h
+++ b/drivers/net/ethernet/sfc/siena/efx_common.h
@@ -88,7 +88,7 @@ static inline void efx_schedule_channel_irq(struct efx_channel *channel)
 	efx_schedule_channel(channel);
 }
 
-#ifdef CONFIG_SFC_MCDI_LOGGING
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
 void efx_siena_init_mcdi_logging(struct efx_nic *efx);
 void efx_siena_fini_mcdi_logging(struct efx_nic *efx);
 #else
diff --git a/drivers/net/ethernet/sfc/siena/mcdi.c b/drivers/net/ethernet/sfc/siena/mcdi.c
index b767e29cfe92..3df0f0eca3b7 100644
--- a/drivers/net/ethernet/sfc/siena/mcdi.c
+++ b/drivers/net/ethernet/sfc/siena/mcdi.c
@@ -51,9 +51,10 @@ static int efx_mcdi_drv_attach(struct efx_nic *efx, bool driver_operating,
 static bool efx_mcdi_poll_once(struct efx_nic *efx);
 static void efx_mcdi_abandon(struct efx_nic *efx);
 
-#ifdef CONFIG_SFC_MCDI_LOGGING
-static bool mcdi_logging_default;
-module_param(mcdi_logging_default, bool, 0644);
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
+static bool efx_siena_mcdi_logging_default;
+module_param_named(mcdi_logging_default, efx_siena_mcdi_logging_default,
+		   bool, 0644);
 MODULE_PARM_DESC(mcdi_logging_default,
 		 "Enable MCDI logging on newly-probed functions");
 #endif
@@ -70,12 +71,12 @@ int efx_siena_mcdi_init(struct efx_nic *efx)
 
 	mcdi = efx_mcdi(efx);
 	mcdi->efx = efx;
-#ifdef CONFIG_SFC_MCDI_LOGGING
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
 	/* consuming code assumes buffer is page-sized */
 	mcdi->logging_buffer = (char *)__get_free_page(GFP_KERNEL);
 	if (!mcdi->logging_buffer)
 		goto fail1;
-	mcdi->logging_enabled = mcdi_logging_default;
+	mcdi->logging_enabled = efx_siena_mcdi_logging_default;
 #endif
 	init_waitqueue_head(&mcdi->wq);
 	init_waitqueue_head(&mcdi->proxy_rx_wq);
@@ -114,7 +115,7 @@ int efx_siena_mcdi_init(struct efx_nic *efx)
 
 	return 0;
 fail2:
-#ifdef CONFIG_SFC_MCDI_LOGGING
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
 	free_page((unsigned long)mcdi->logging_buffer);
 fail1:
 #endif
@@ -140,7 +141,7 @@ void efx_siena_mcdi_fini(struct efx_nic *efx)
 	if (!efx->mcdi)
 		return;
 
-#ifdef CONFIG_SFC_MCDI_LOGGING
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
 	free_page((unsigned long)efx->mcdi->iface.logging_buffer);
 #endif
 
@@ -151,7 +152,7 @@ static void efx_mcdi_send_request(struct efx_nic *efx, unsigned cmd,
 				  const efx_dword_t *inbuf, size_t inlen)
 {
 	struct efx_mcdi_iface *mcdi = efx_mcdi(efx);
-#ifdef CONFIG_SFC_MCDI_LOGGING
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
 	char *buf = mcdi->logging_buffer; /* page-sized */
 #endif
 	efx_dword_t hdr[2];
@@ -198,7 +199,7 @@ static void efx_mcdi_send_request(struct efx_nic *efx, unsigned cmd,
 		hdr_len = 8;
 	}
 
-#ifdef CONFIG_SFC_MCDI_LOGGING
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
 	if (mcdi->logging_enabled && !WARN_ON_ONCE(!buf)) {
 		int bytes = 0;
 		int i;
@@ -266,7 +267,7 @@ static void efx_mcdi_read_response_header(struct efx_nic *efx)
 {
 	struct efx_mcdi_iface *mcdi = efx_mcdi(efx);
 	unsigned int respseq, respcmd, error;
-#ifdef CONFIG_SFC_MCDI_LOGGING
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
 	char *buf = mcdi->logging_buffer; /* page-sized */
 #endif
 	efx_dword_t hdr;
@@ -286,7 +287,7 @@ static void efx_mcdi_read_response_header(struct efx_nic *efx)
 			EFX_DWORD_FIELD(hdr, MC_CMD_V2_EXTN_IN_ACTUAL_LEN);
 	}
 
-#ifdef CONFIG_SFC_MCDI_LOGGING
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
 	if (mcdi->logging_enabled && !WARN_ON_ONCE(!buf)) {
 		size_t hdr_len, data_len;
 		int bytes = 0;
diff --git a/drivers/net/ethernet/sfc/siena/mcdi.h b/drivers/net/ethernet/sfc/siena/mcdi.h
index 03810c570a33..06f38e5e6832 100644
--- a/drivers/net/ethernet/sfc/siena/mcdi.h
+++ b/drivers/net/ethernet/sfc/siena/mcdi.h
@@ -80,7 +80,7 @@ struct efx_mcdi_iface {
 	spinlock_t async_lock;
 	struct list_head async_list;
 	struct timer_list async_timer;
-#ifdef CONFIG_SFC_MCDI_LOGGING
+#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
 	char *logging_buffer;
 	bool logging_enabled;
 #endif


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox