Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Accessing XDP packet memory from the end
From: Jesper Dangaard Brouer @ 2022-04-21 16:27 UTC (permalink / raw)
  To: Larysa Zaremba, bpf
  Cc: brouer, netdev, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Toke Hoiland-Jorgensen, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Lobakin, xdp-hints@xdp-project.net
In-Reply-To: <20220421155620.81048-1-larysa.zaremba@intel.com>



On 21/04/2022 17.56, Larysa Zaremba wrote:
> Dear all,
> Our team has encountered a need of accessing data_meta in a following way:
> 
> int xdp_meta_prog(struct xdp_md *ctx)
> {
> 	void *data_meta_ptr = (void *)(long)ctx->data_meta;
> 	void *data_end = (void *)(long)ctx->data_end;
> 	void *data = (void *)(long)ctx->data;
> 	u64 data_size = sizeof(u32);
> 	u32 magic_meta;
> 	u8 offset;
> 
> 	offset = (u8)((s64)data - (s64)data_meta_ptr);

I'm not sure the verifier can handle this 'offset' calc. As it cannot
statically know the sized based on this statement. Maybe this is not the
issue.

> 	if (offset < data_size) {
> 		bpf_printk("invalid offset: %ld\n", offset);
> 		return XDP_DROP;
> 	}
> 
> 	data_meta_ptr += offset;
> 	data_meta_ptr -= data_size;
> 
> 	if (data_meta_ptr + data_size > data) {
> 		return XDP_DROP;
> 	}
> 		
> 	magic_meta = *((u32 *)data);
> 	bpf_printk("Magic: %d\n", magic_meta);
> 	return XDP_PASS;
> }
> 
> Unfortunately, verifier claims this code attempts to access packet with
> an offset of -2 (a constant part) and negative offset is generally forbidden.

Are you forgetting to mention:
  - Have you modified the NIC driver to adjust data_meta pointer and 
provide info in this area?

p.s. this is exactly what I'm also working towards[1], so I'll be happy
to collaborate.  I'm missing the driver code, as link[1] is focused on
decoding BTF data_meta area in userspace for AF_XDP.

[1] 
https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction

> For now we have 2 solutions, one is using bpf_xdp_adjust_meta(),
> which is pretty good, but not ideal for the hot path.
> The second one is the patch at the end.
> 

Are you saying, verifier cannot handle that driver changed data_meta 
pointer and provided info there (without calling bpf_xdp_adjust_meta)?


> Do you see any other way of accessing memory from the end of data_meta/data?
> What do you think about both suggested solutions?
> 
> Best regards,
> Larysa Zaremba
> 
> ---
> 
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3576,8 +3576,11 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
>   	}
>   
>   	err = reg->range < 0 ? -EINVAL :
> -	      __check_mem_access(env, regno, off, size, reg->range,
> -				 zero_size_allowed);
> +	      __check_mem_access(env, regno, off + reg->smin_value, size,
> +				 reg->range + reg->smin_value, zero_size_allowed);
> +	err = err ? :
> +	      __check_mem_access(env, regno, off + reg->umax_value, size,
> +				 reg->range + reg->umax_value, zero_size_allowed);
>   	if (err) {
>   		verbose(env, "R%d offset is outside of the packet\n", regno);
>   		return err;
> 


^ permalink raw reply

* Accessing XDP packet memory from the end
From: Larysa Zaremba @ 2022-04-21 15:56 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, netdev, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, Toke Hoiland-Jorgensen,
	Magnus Karlsson, Maciej Fijalkowski, Alexander Lobakin

Dear all,
Our team has encountered a need of accessing data_meta in a following way:

int xdp_meta_prog(struct xdp_md *ctx)
{
	void *data_meta_ptr = (void *)(long)ctx->data_meta;
	void *data_end = (void *)(long)ctx->data_end;
	void *data = (void *)(long)ctx->data;
	u64 data_size = sizeof(u32);
	u32 magic_meta;
	u8 offset;

	offset = (u8)((s64)data - (s64)data_meta_ptr);
	if (offset < data_size) {
		bpf_printk("invalid offset: %ld\n", offset);
		return XDP_DROP;
	}

	data_meta_ptr += offset;
	data_meta_ptr -= data_size;

	if (data_meta_ptr + data_size > data) {
		return XDP_DROP;
	}
		
	magic_meta = *((u32 *)data);
	bpf_printk("Magic: %d\n", magic_meta);
	return XDP_PASS;
}

Unfortunately, verifier claims this code attempts to access packet with
an offset of -2 (a constant part) and negative offset is generally forbidden.

For now we have 2 solutions, one is using bpf_xdp_adjust_meta(),
which is pretty good, but not ideal for the hot path.
The second one is the patch at the end.

Do you see any other way of accessing memory from the end of data_meta/data?
What do you think about both suggested solutions?

Best regards,
Larysa Zaremba

---

--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3576,8 +3576,11 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
 	}
 
 	err = reg->range < 0 ? -EINVAL :
-	      __check_mem_access(env, regno, off, size, reg->range,
-				 zero_size_allowed);
+	      __check_mem_access(env, regno, off + reg->smin_value, size,
+				 reg->range + reg->smin_value, zero_size_allowed);
+	err = err ? :
+	      __check_mem_access(env, regno, off + reg->umax_value, size,
+				 reg->range + reg->umax_value, zero_size_allowed);
 	if (err) {
 		verbose(env, "R%d offset is outside of the packet\n", regno);
 		return err;

^ permalink raw reply

* Re: [PATCH] wwan_hwsim: Avoid flush_scheduled_work() usage
From: Sergey Ryazanov @ 2022-04-21 16:14 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Loic Poulain, Johannes Berg, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Network Development
In-Reply-To: <cb2f74ea-74ce-2dba-c5e3-e4672f1be663@I-love.SAKURA.ne.jp>

Hello Tetsuo,

On Wed, Apr 20, 2022 at 1:17 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> On 2022/04/20 18:53, Loic Poulain wrote:
>>> @@ -506,9 +507,15 @@ static int __init wwan_hwsim_init(void)
>>>         if (wwan_hwsim_devsnum < 0 || wwan_hwsim_devsnum > 128)
>>>                 return -EINVAL;
>>>
>>> +       wwan_wq = alloc_workqueue("wwan_wq", 0, 0);
>>> +       if (!wwan_wq)
>>> +               return -ENOMEM;
>>> +
>>>         wwan_hwsim_class = class_create(THIS_MODULE, "wwan_hwsim");
>>> -       if (IS_ERR(wwan_hwsim_class))
>>> +       if (IS_ERR(wwan_hwsim_class)) {
>>> +               destroy_workqueue(wwan_wq);
>>>                 return PTR_ERR(wwan_hwsim_class);
>>> +       }
>>>
>>>         wwan_hwsim_debugfs_topdir = debugfs_create_dir("wwan_hwsim", NULL);
>>>         wwan_hwsim_debugfs_devcreate =
>>> @@ -524,6 +531,7 @@ static int __init wwan_hwsim_init(void)
>>>
>>>  err_clean_devs:
>
> Do you want
>
>         debugfs_remove(wwan_hwsim_debugfs_devcreate);
>
> here (as a separate patch)?

Nope. But I will not be against such a patch. I remove the "devcreate"
file in wwwan_hwsim_exit() to prevent new emulated device creation
while the workqueue flushing, which can take a sufficient time. Here
we cleanup the leftovers of the attempt to automatically create
emulated devices. Here is no workqueue flushing, so the race window is
very tight.

In other words, the preparatory debugfs file removal is practically
not required here, but it will not hurt anyone. And possibly will make
the code less questionable.

>>>         wwan_hwsim_free_devs();
>>> +       destroy_workqueue(wwan_wq);
>>>         debugfs_remove(wwan_hwsim_debugfs_topdir);
>>>         class_destroy(wwan_hwsim_class);
>>>
>>> @@ -534,7 +542,7 @@ static void __exit wwan_hwsim_exit(void)
>>>  {
>>>         debugfs_remove(wwan_hwsim_debugfs_devcreate);   /* Avoid new devs */
>>>         wwan_hwsim_free_devs();
>>> -       flush_scheduled_work();         /* Wait deletion works completion */
>>> +       destroy_workqueue(wwan_wq);             /* Wait deletion works completion */
>>
>> Wouldn't it be simpler to just remove the flush call. It Looks like
>> all ports have been removed at that point, and all works cancelled,
>> right?
>
> I guess that this flush_scheduled_work() is for waiting for schedule_work(&dev->del_work) from
> wwan_hwsim_debugfs_devdestroy_write(). That is, if wwan_hwsim_debugfs_devdestroy_write() already
> scheduled this work, wwan_hwsim_dev_del() from wwan_hwsim_dev_del_work() might be still in progress
> even after wwan_hwsim_dev_del() from wwan_hwsim_free_devs() from wwan_hwsim_exit() returned.

Exactly. This code will wait for the completion of the work that was
scheduled somewhere else.

-- 
Sergey

^ permalink raw reply

* Re: [PATCH 3/5] hv_sock: Add validation for untrusted Hyper-V values
From: Stefano Garzarella @ 2022-04-21 16:14 UTC (permalink / raw)
  To: Andrea Parri
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, Michael Kelley, David Miller, Jakub Kicinski,
	Paolo Abeni, linux-hyperv, Linux Virtualization, netdev,
	kernel list
In-Reply-To: <20220421152827.GB4679@anparri>

On Thu, Apr 21, 2022 at 5:30 PM Andrea Parri <parri.andrea@gmail.com> wrote:
>
> > > @@ -577,12 +577,19 @@ static bool hvs_dgram_allow(u32 cid, u32 port)
> > > static int hvs_update_recv_data(struct hvsock *hvs)
> > > {
> > >     struct hvs_recv_buf *recv_buf;
> > > -   u32 payload_len;
> > > +   u32 pkt_len, payload_len;
> > > +
> > > +   pkt_len = hv_pkt_len(hvs->recv_desc);
> > > +
> > > +   /* Ensure the packet is big enough to read its header */
> > > +   if (pkt_len < HVS_HEADER_LEN)
> > > +           return -EIO;
> > >
> > >     recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
> > >     payload_len = recv_buf->hdr.data_size;
> > >
> > > -   if (payload_len > HVS_MTU_SIZE)
> > > +   /* Ensure the packet is big enough to read its payload */
> > > +   if (payload_len > pkt_len - HVS_HEADER_LEN || payload_len > HVS_MTU_SIZE)
> >
> > checkpatch warns that we exceed 80 characters, I do not have a strong
> > opinion on this, but if you have to resend better break the condition into 2
> > lines.
>
> Will break if preferred.  (but does it really warn??  I understand that
> the warning was deprecated and the "limit" increased to 100 chars...)

I see the warn here:
https://patchwork.kernel.org/project/netdevbpf/patch/20220420200720.434717-4-parri.andrea@gmail.com/

in the kernel doc [1] we still say we prefer 80 columns, so I try to
follow, especially when it doesn't make things worse.

[1] https://docs.kernel.org/process/coding-style.html#breaking-long-lines-and-strings

>
>
> > Maybe even update or remove the comment? (it only describes the first
> > condition, but the conditions are pretty clear, so I don't think it adds
> > much).
>
> Works for me.  (taking it as this applies to the previous comment too.)

Yep.

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH net] net: Use this_cpu_inc() to increment net->core_stats
From: Eric Dumazet @ 2022-04-21 16:06 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: netdev, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Thomas Gleixner, Peter Zijlstra
In-Reply-To: <YmFjdOp+R5gVGZ7p@linutronix.de>

On Thu, Apr 21, 2022 at 7:00 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>

>                 for_each_possible_cpu(i) {
>                         core_stats = per_cpu_ptr(p, i);
> -                       storage->rx_dropped += local_read(&core_stats->rx_dropped);
> -                       storage->tx_dropped += local_read(&core_stats->tx_dropped);
> -                       storage->rx_nohandler += local_read(&core_stats->rx_nohandler);
> +                       storage->rx_dropped += core_stats->rx_dropped;
> +                       storage->tx_dropped += core_stats->tx_dropped;
> +                       storage->rx_nohandler += core_stats->rx_nohandler;

I think that one of the reasons for me to use  local_read() was that
it provided what was needed to avoid future syzbot reports.

Perhaps use READ_ONCE() here ?

Yes, we have many similar folding loops that are  simply assuming
compiler won't do stupid things.

^ permalink raw reply

* Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations
From: Shakeel Butt @ 2022-04-21 15:56 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Vlastimil Babka, Roman Gushchin, kernel, LKML, netdev, Cgroups,
	Michal Hocko, Florian Westphal, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <55605876-d05a-8be3-a6ae-ec26de9ee178@openvz.org>

On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <vvs@openvz.org> wrote:
>
> __register_pernet_operations() executes init hook of registered
> pernet_operation structure in all existing net namespaces.
>
> Typically, these hooks are called by a process associated with
> the specified net namespace, and all __GFP_ACCOUNTING marked
> allocation are accounted for corresponding container/memcg.
>
> However __register_pernet_operations() calls the hooks in the same
> context, and as a result all marked allocations are accounted
> to one memcg for all processed net namespaces.
>
> This patch adjusts active memcg for each net namespace and helps
> to account memory allocated inside ops_init() into the proper memcg.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> Dear Vlastimil, Roman,
> I'm not sure that memcg is used correctly here,
> is it perhaps some additional locking required?
> ---
>  net/core/net_namespace.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index a5b5bb99c644..171c6e0b2337 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -26,6 +26,7 @@
>  #include <net/net_namespace.h>
>  #include <net/netns/generic.h>
>
> +#include <linux/sched/mm.h>
>  /*
>   *     Our network namespace constructor/destructor lists
>   */
> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
>                  * setup_net() and cleanup_net() are not possible.
>                  */
>                 for_each_net(net) {
> +                       struct mem_cgroup *old, *memcg = NULL;
> +#ifdef CONFIG_MEMCG
> +                       memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);

memcg from obj is unstable, so you need a reference on memcg. You can
introduce get_mem_cgroup_from_kmem() which works for both
MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
init_net) it should return NULL.

> +#endif
> +                       old = set_active_memcg(memcg);
>                         error = ops_init(ops, net);
> +                       set_active_memcg(old);
>                         if (error)
>                                 goto out_undo;
>                         list_add_tail(&net->exit_list, &net_exit_list);
> --
> 2.31.1
>

^ permalink raw reply

* [PATCH net-next] net: ethernet: mtk_eth_soc: add check for allocation failure
From: Dan Carpenter @ 2022-04-21 15:49 UTC (permalink / raw)
  To: Felix Fietkau
  Cc: John Crispin, Sean Wang, Mark Lee, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Matthias Brugger, netdev,
	linux-mediatek, kernel-janitors

Check if the kzalloc() failed.

Fixes: 804775dfc288 ("net: ethernet: mtk_eth_soc: add support for Wireless Ethernet Dispatch (WED)")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 drivers/net/ethernet/mediatek/mtk_wed.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_wed.c b/drivers/net/ethernet/mediatek/mtk_wed.c
index f0eacf819cd9..a2793bbb8ce0 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed.c
@@ -827,6 +827,8 @@ void mtk_wed_add_hw(struct device_node *np, struct mtk_eth *eth,
 		goto unlock;
 
 	hw = kzalloc(sizeof(*hw), GFP_KERNEL);
+	if (!hw)
+		goto unlock;
 	hw->node = np;
 	hw->regs = regs;
 	hw->eth = eth;
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net] net: Use this_cpu_inc() to increment net->core_stats
From: Sebastian Andrzej Siewior @ 2022-04-21 15:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Thomas Gleixner, Peter Zijlstra
In-Reply-To: <CANn89i+0u=DmAd1_vv-vJsJ53L2y6v7pvvTgrVN9D=rGo9-ifA@mail.gmail.com>

On 2022-04-21 08:32:30 [-0700], Eric Dumazet wrote:
> On Thu, Apr 21, 2022 at 7:00 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> >
> > The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
> > netdev_core_stats_alloc() to return a per-CPU pointer.
> > netdev_core_stats_alloc() will allocate memory on its first invocation
> > which breaks on PREEMPT_RT because it requires non-atomic context for
> > memory allocation.
> 
> Can you elaborate on this, I am confused ?
> 
> You are saying that on PREEMPT_RT, we can not call
> alloc_percpu_gfp(XXX, GFP_ATOMIC | __GFP_NOWARN);
> under some contexts ?

Correct. On PREEMPT_RT you must not explicitly create an atomic context
by
- using preempt_disable()
- acquiring a raw_spinlock_t lock
- using local_irq_disable()

while allocating memory. GFP_ATOMIC won't save you. The internal locks
within mm (kmalloc() and per-CPU memory) are sleeping locks and can not
be acquired in atomic context.

> preemption might be disabled by callers of net->core_stats anyways...

It won't be disabled by
- acquiring a spinlock_t lock
- running in softirq or interrupt handler

I haven't seen any splats (with RT enabled) other than this
preempt_disable() section so far. However only the first caller
allocates memory so maybe I add a check later on to be sure.

Sebastian

^ permalink raw reply

* [PATCH] batman-adv: remove unnecessary type castings
From: Yu Zhe @ 2022-04-21 15:48 UTC (permalink / raw)
  To: mareklindner, sw, a, sven, davem, kuba, pabeni
  Cc: b.a.t.m.a.n, netdev, linux-kernel, liqiong, kernel-janitors,
	Yu Zhe

remove unnecessary void* type castings.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
---
 net/batman-adv/bridge_loop_avoidance.c |  4 ++--
 net/batman-adv/translation-table.c     | 10 +++++-----
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/batman-adv/bridge_loop_avoidance.c b/net/batman-adv/bridge_loop_avoidance.c
index 7f8a14d99cdb..38cc21370eda 100644
--- a/net/batman-adv/bridge_loop_avoidance.c
+++ b/net/batman-adv/bridge_loop_avoidance.c
@@ -65,7 +65,7 @@ batadv_bla_send_announce(struct batadv_priv *bat_priv,
  */
 static inline u32 batadv_choose_claim(const void *data, u32 size)
 {
-	struct batadv_bla_claim *claim = (struct batadv_bla_claim *)data;
+	struct batadv_bla_claim *claim = data;
 	u32 hash = 0;
 
 	hash = jhash(&claim->addr, sizeof(claim->addr), hash);
@@ -86,7 +86,7 @@ static inline u32 batadv_choose_backbone_gw(const void *data, u32 size)
 	const struct batadv_bla_backbone_gw *gw;
 	u32 hash = 0;
 
-	gw = (struct batadv_bla_backbone_gw *)data;
+	gw = data;
 	hash = jhash(&gw->orig, sizeof(gw->orig), hash);
 	hash = jhash(&gw->vid, sizeof(gw->vid), hash);
 
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c
index 8478034d3abf..f579060aa503 100644
--- a/net/batman-adv/translation-table.c
+++ b/net/batman-adv/translation-table.c
@@ -106,7 +106,7 @@ static inline u32 batadv_choose_tt(const void *data, u32 size)
 	struct batadv_tt_common_entry *tt;
 	u32 hash = 0;
 
-	tt = (struct batadv_tt_common_entry *)data;
+	tt = data;
 	hash = jhash(&tt->addr, ETH_ALEN, hash);
 	hash = jhash(&tt->vid, sizeof(tt->vid), hash);
 
@@ -2766,7 +2766,7 @@ static void batadv_tt_tvlv_generate(struct batadv_priv *bat_priv,
 	u32 i;
 
 	tt_tot = batadv_tt_entries(tt_len);
-	tt_change = (struct batadv_tvlv_tt_change *)tvlv_buff;
+	tt_change = tvlv_buff;
 
 	if (!valid_cb)
 		return;
@@ -3994,7 +3994,7 @@ static void batadv_tt_tvlv_ogm_handler_v1(struct batadv_priv *bat_priv,
 	if (tvlv_value_len < sizeof(*tt_data))
 		return;
 
-	tt_data = (struct batadv_tvlv_tt_data *)tvlv_value;
+	tt_data = tvlv_value;
 	tvlv_value_len -= sizeof(*tt_data);
 
 	num_vlan = ntohs(tt_data->num_vlan);
@@ -4037,7 +4037,7 @@ static int batadv_tt_tvlv_unicast_handler_v1(struct batadv_priv *bat_priv,
 	if (tvlv_value_len < sizeof(*tt_data))
 		return NET_RX_SUCCESS;
 
-	tt_data = (struct batadv_tvlv_tt_data *)tvlv_value;
+	tt_data = tvlv_value;
 	tvlv_value_len -= sizeof(*tt_data);
 
 	tt_vlan_len = sizeof(struct batadv_tvlv_tt_vlan_data);
@@ -4129,7 +4129,7 @@ static int batadv_roam_tvlv_unicast_handler_v1(struct batadv_priv *bat_priv,
 		goto out;
 
 	batadv_inc_counter(bat_priv, BATADV_CNT_TT_ROAM_ADV_RX);
-	roaming_adv = (struct batadv_tvlv_roam_adv *)tvlv_value;
+	roaming_adv = tvlv_value;
 
 	batadv_dbg(BATADV_DBG_TT, bat_priv,
 		   "Received ROAMING_ADV from %pM (client %pM)\n",
-- 
2.25.1


^ permalink raw reply related

* [PATCH net] net: lan966x: fix a couple off by one bugs
From: Dan Carpenter @ 2022-04-21 15:46 UTC (permalink / raw)
  To: Horatiu Vultur
  Cc: UNGLinuxDriver, David S. Miller, Jakub Kicinski, Paolo Abeni,
	netdev, kernel-janitors

The lan966x->ports[] array has lan966x->num_phys_ports elements.  These
are assigned in lan966x_probe().  That means the > comparison should be
changed to >=.

The first off by one check is harmless but the second one could lead to
an out of bounds access and a crash.

Fixes: 5ccd66e01cbe ("net: lan966x: add support for interrupts from analyzer")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 drivers/net/ethernet/microchip/lan966x/lan966x_mac.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_mac.c b/drivers/net/ethernet/microchip/lan966x/lan966x_mac.c
index 2679111ef669..005e56ea5da1 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_mac.c
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_mac.c
@@ -346,7 +346,7 @@ static void lan966x_mac_irq_process(struct lan966x *lan966x, u32 row,
 
 			lan966x_mac_process_raw_entry(&raw_entries[column],
 						      mac, &vid, &dest_idx);
-			if (WARN_ON(dest_idx > lan966x->num_phys_ports))
+			if (WARN_ON(dest_idx >= lan966x->num_phys_ports))
 				continue;
 
 			/* If the entry in SW is found, then there is nothing
@@ -393,7 +393,7 @@ static void lan966x_mac_irq_process(struct lan966x *lan966x, u32 row,
 
 		lan966x_mac_process_raw_entry(&raw_entries[column],
 					      mac, &vid, &dest_idx);
-		if (WARN_ON(dest_idx > lan966x->num_phys_ports))
+		if (WARN_ON(dest_idx >= lan966x->num_phys_ports))
 			continue;
 
 		mac_entry = lan966x_mac_alloc_entry(mac, vid, dest_idx);
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next] net: generalize skb freeing deferral to per-cpu lists
From: Eric Dumazet @ 2022-04-21 15:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 71874 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |  3 +++
 include/linux/skbuff.h    |  2 ++
 include/net/sock.h        |  2 --
 include/net/tcp.h         | 12 ----------
 net/core/dev.c            | 27 +++++++++++++++++++++++
 net/core/skbuff.c         | 46 ++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c           |  3 ---
 net/ipv4/tcp.c            | 25 +--------------------
 net/ipv4/tcp_ipv4.c       |  1 -
 net/ipv6/tcp_ipv6.c       |  1 -
 net/tls/tls_sw.c          |  2 --
 11 files changed, 78 insertions(+), 46 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7dccbfd1bf5635c27514c70b4a06d3e6f74395dd..0162a9bdc9291e7aae967a044788d09bd2ef2423 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3081,6 +3081,9 @@ struct softnet_data {
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 
+	/* Another possibly contended cache line */
+	struct sk_buff_head	skb_defer_list ____cacheline_aligned_in_smp;
+	call_single_data_t  csd_defer;
 };
 
 static inline void input_queue_head_incr(struct softnet_data *sd)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 84d78df60453955a8eaf05847f6e2145176a727a..2fe311447fae5e860eee95f6e8772926d4915e9f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1080,6 +1080,7 @@ struct sk_buff {
 		unsigned int	sender_cpu;
 	};
 #endif
+	u16			alloc_cpu;
 #ifdef CONFIG_NETWORK_SECMARK
 	__u32		secmark;
 #endif
@@ -1321,6 +1322,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb_around(struct sk_buff *skb,
 				 void *data, unsigned int frag_size);
+void skb_attempt_defer_free(struct sk_buff *skb);
 
 struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
 
diff --git a/include/net/sock.h b/include/net/sock.h
index a01d6c421aa2caad4032167284eed9565d4bd545..f9f8ecae0f8decb3e0e74c6efaff5b890e3685ea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -292,7 +292,6 @@ struct sk_filter;
   *	@sk_pacing_shift: scaling factor for TCP Small Queues
   *	@sk_lingertime: %SO_LINGER l_linger setting
   *	@sk_backlog: always used with the per-socket spinlock held
-  *	@defer_list: head of llist storing skbs to be freed
   *	@sk_callback_lock: used with the callbacks in the end of this struct
   *	@sk_error_queue: rarely used
   *	@sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt,
@@ -417,7 +416,6 @@ struct sock {
 		struct sk_buff	*head;
 		struct sk_buff	*tail;
 	} sk_backlog;
-	struct llist_head defer_list;
 
 #define sk_rmem_alloc sk_backlog.rmem_alloc
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 679b1964d49414fcb1c361778fd0ac664e8c8ea5..94a52ad1101c12e13c2957e8b028b697742c451f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1375,18 +1375,6 @@ static inline bool tcp_checksum_complete(struct sk_buff *skb)
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 		     enum skb_drop_reason *reason);
 
-#ifdef CONFIG_INET
-void __sk_defer_free_flush(struct sock *sk);
-
-static inline void sk_defer_free_flush(struct sock *sk)
-{
-	if (llist_empty(&sk->defer_list))
-		return;
-	__sk_defer_free_flush(sk);
-}
-#else
-static inline void sk_defer_free_flush(struct sock *sk) {}
-#endif
 
 int tcp_filter(struct sock *sk, struct sk_buff *skb);
 void tcp_set_state(struct sock *sk, int state);
diff --git a/net/core/dev.c b/net/core/dev.c
index 4a77ebda4fb155581a5f761a864446a046987f51..4136d9c0ada6870ea0f7689702bdb5f0bbf29145 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4545,6 +4545,12 @@ static void rps_trigger_softirq(void *data)
 
 #endif /* CONFIG_RPS */
 
+/* Called from hardirq (IPI) context */
+static void trigger_rx_softirq(void *data)
+{
+	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+}
+
 /*
  * Check if this softnet_data structure is another cpu one
  * If yes, queue it to our IPI list and return 1
@@ -6571,6 +6577,24 @@ static int napi_threaded_poll(void *data)
 	return 0;
 }
 
+static void skb_defer_free_flush(struct softnet_data *sd)
+{
+	struct sk_buff_head list;
+	struct sk_buff *skb;
+	unsigned long flags;
+
+	if (skb_queue_empty_lockless(&sd->skb_defer_list))
+		return;
+	__skb_queue_head_init(&list);
+
+	spin_lock_irqsave(&sd->skb_defer_list.lock, flags);
+	skb_queue_splice_init(&sd->skb_defer_list, &list);
+	spin_unlock_irqrestore(&sd->skb_defer_list.lock, flags);
+
+	while ((skb = __skb_dequeue(&list)) != NULL)
+		__kfree_skb(skb);
+}
+
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
@@ -6616,6 +6640,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
 		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 
 	net_rps_action_and_irq_enable(sd);
+	skb_defer_free_flush(sd);
 }
 
 struct netdev_adjacent {
@@ -11326,6 +11351,8 @@ static int __init net_dev_init(void)
 		INIT_CSD(&sd->csd, rps_trigger_softirq, sd);
 		sd->cpu = i;
 #endif
+		INIT_CSD(&sd->csd_defer, trigger_rx_softirq, NULL);
+		skb_queue_head_init(&sd->skb_defer_list);
 
 		init_gro_hash(&sd->backlog);
 		sd->backlog.poll = process_backlog;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 30b523fa4ad2e9be30bdefdc61f70f989c345bbf..c0ab436401b5d1d985e6eccc393ffa4ad007843b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -204,7 +204,7 @@ static void __build_skb_around(struct sk_buff *skb, void *data,
 	skb_set_end_offset(skb, size);
 	skb->mac_header = (typeof(skb->mac_header))~0U;
 	skb->transport_header = (typeof(skb->transport_header))~0U;
-
+	skb->alloc_cpu = raw_smp_processor_id();
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
 	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
@@ -1037,6 +1037,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	CHECK_SKB_FIELD(napi_id);
 #endif
+	CHECK_SKB_FIELD(alloc_cpu);
 #ifdef CONFIG_XPS
 	CHECK_SKB_FIELD(sender_cpu);
 #endif
@@ -6486,3 +6487,46 @@ void __skb_ext_put(struct skb_ext *ext)
 }
 EXPORT_SYMBOL(__skb_ext_put);
 #endif /* CONFIG_SKB_EXTENSIONS */
+
+/**
+ * skb_attempt_defer_free - queue skb for remote freeing
+ * @skb: buffer
+ *
+ * Put @skb in a per-cpu list, using the cpu which
+ * allocated the skb/pages to reduce false sharing
+ * and memory zone spinlock contention.
+ */
+void skb_attempt_defer_free(struct sk_buff *skb)
+{
+	int cpu = skb->alloc_cpu;
+	struct softnet_data *sd;
+	unsigned long flags;
+	bool kick;
+
+	if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || !cpu_online(cpu)) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	sd = &per_cpu(softnet_data, cpu);
+	/* We do not send an IPI or any signal.
+	 * Remote cpu will eventually call skb_defer_free_flush()
+	 */
+	spin_lock_irqsave(&sd->skb_defer_list.lock, flags);
+	__skb_queue_tail(&sd->skb_defer_list, skb);
+
+	/* kick every time queue length reaches 128.
+	 * This should avoid blocking in smp_call_function_single_async().
+	 * This condition should hardly be bit under normal conditions,
+	 * unless cpu suddenly stopped to receive NIC interrupts.
+	 */
+	kick = skb_queue_len(&sd->skb_defer_list) == 128;
+
+	spin_unlock_irqrestore(&sd->skb_defer_list.lock, flags);
+
+	/* Make sure to trigger NET_RX_SOFTIRQ on the remote CPU
+	 * if we are unlucky enough (this seems very unlikely).
+	 */
+	if (unlikely(kick))
+		smp_call_function_single_async(cpu, &sd->csd_defer);
+}
diff --git a/net/core/sock.c b/net/core/sock.c
index 29abec3eabd8905f2671e0b5789878a129453ef6..a0f3989de3d62456665e8b6382a4681fba17d60c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2082,9 +2082,6 @@ void sk_destruct(struct sock *sk)
 {
 	bool use_call_rcu = sock_flag(sk, SOCK_RCU_FREE);
 
-	WARN_ON_ONCE(!llist_empty(&sk->defer_list));
-	sk_defer_free_flush(sk);
-
 	if (rcu_access_pointer(sk->sk_reuseport_cb)) {
 		reuseport_detach_sock(sk);
 		use_call_rcu = true;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e20b87b3bf907a9b04b7531936129fb729e96c52..db55af9eb37b56bf0ec3b47212240c0302b86a1f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -843,7 +843,6 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 	}
 
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 
 	if (spliced)
 		return spliced;
@@ -1589,20 +1588,6 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
 		tcp_send_ack(sk);
 }
 
-void __sk_defer_free_flush(struct sock *sk)
-{
-	struct llist_node *head;
-	struct sk_buff *skb, *n;
-
-	head = llist_del_all(&sk->defer_list);
-	llist_for_each_entry_safe(skb, n, head, ll_node) {
-		prefetch(n);
-		skb_mark_not_on_list(skb);
-		__kfree_skb(skb);
-	}
-}
-EXPORT_SYMBOL(__sk_defer_free_flush);
-
 static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	__skb_unlink(skb, &sk->sk_receive_queue);
@@ -1610,11 +1595,7 @@ static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 		sock_rfree(skb);
 		skb->destructor = NULL;
 		skb->sk = NULL;
-		if (!skb_queue_empty(&sk->sk_receive_queue) ||
-		    !llist_empty(&sk->defer_list)) {
-			llist_add(&skb->ll_node, &sk->defer_list);
-			return;
-		}
+		return skb_attempt_defer_free(skb);
 	}
 	__kfree_skb(skb);
 }
@@ -2453,7 +2434,6 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			__sk_flush_backlog(sk);
 		} else {
 			tcp_cleanup_rbuf(sk, copied);
-			sk_defer_free_flush(sk);
 			sk_wait_data(sk, &timeo, last);
 		}
 
@@ -2571,7 +2551,6 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
 	lock_sock(sk);
 	ret = tcp_recvmsg_locked(sk, msg, len, flags, &tss, &cmsg_flags);
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 
 	if (cmsg_flags && ret >= 0) {
 		if (cmsg_flags & TCP_CMSG_TS)
@@ -3096,7 +3075,6 @@ int tcp_disconnect(struct sock *sk, int flags)
 		sk->sk_frag.page = NULL;
 		sk->sk_frag.offset = 0;
 	}
-	sk_defer_free_flush(sk);
 	sk_error_report(sk);
 	return 0;
 }
@@ -4225,7 +4203,6 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
 							  &zc, &len, err);
 		release_sock(sk);
-		sk_defer_free_flush(sk);
 		if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags))
 			goto zerocopy_rcv_cmsg;
 		switch (len) {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 157265aecbedd1761de2d892b5a54f4a0cfe1912..f5eb40e6ec48824bb684175c0dc920653c72a6d8 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2066,7 +2066,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
 
 	sk_incoming_cpu_update(sk);
 
-	sk_defer_free_flush(sk);
 	bh_lock_sock_nested(sk);
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 782df529ff69e0b2fc5f9d12fc72538b7041a392..09031561fc0e74f2a92069aa5ebd78df9c39cd5a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1728,7 +1728,6 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 
 	sk_incoming_cpu_update(sk);
 
-	sk_defer_free_flush(sk);
 	bh_lock_sock_nested(sk);
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ddbe05ec5489dd352dee832e038884339f338b43..bc54f6c5b1a4cabbfe1e3eff1768128b2730c730 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1911,7 +1911,6 @@ int tls_sw_recvmsg(struct sock *sk,
 
 end:
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 	if (psock)
 		sk_psock_put(sk, psock);
 	return copied ? : err;
@@ -1983,7 +1982,6 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 
 splice_read_end:
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 	return copied ? : err;
 }
 
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related

* Re: [PATCH net] net: Use this_cpu_inc() to increment net->core_stats
From: Eric Dumazet @ 2022-04-21 15:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: netdev, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Thomas Gleixner, Peter Zijlstra
In-Reply-To: <YmFjdOp+R5gVGZ7p@linutronix.de>

On Thu, Apr 21, 2022 at 7:00 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
> netdev_core_stats_alloc() to return a per-CPU pointer.
> netdev_core_stats_alloc() will allocate memory on its first invocation
> which breaks on PREEMPT_RT because it requires non-atomic context for
> memory allocation.

Can you elaborate on this, I am confused ?

You are saying that on PREEMPT_RT, we can not call
alloc_percpu_gfp(XXX, GFP_ATOMIC | __GFP_NOWARN);
under some contexts ?

preemption might be disabled by callers of net->core_stats anyways...


>
> This can be avoided by enabling preemption in netdev_core_stats_alloc()
> assuming the caller always disables preemption.
>
> It might be better to replace local_inc() with this_cpu_inc() now that
> dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
> not rely on already disabled preemption. This results in less
> instructions on x86-64:
> local_inc:
> |          incl %gs:__preempt_count(%rip)  # __preempt_count
> |          movq    488(%rdi), %rax # _1->core_stats, _22
> |          testq   %rax, %rax      # _22
> |          je      .L585   #,
> |          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
> |  .L586:
> |          testq   %rax, %rax      # _27
> |          je      .L587   #,
> |          incq (%rax)            # _6->a.counter
> |  .L587:
> |          decl %gs:__preempt_count(%rip)  # __preempt_count
>
> this_cpu_inc(), this patch:
> |         movq    488(%rdi), %rax # _1->core_stats, _5
> |         testq   %rax, %rax      # _5
> |         je      .L591   #,
> | .L585:
> |         incq %gs:(%rax) # _18->rx_dropped
>
> Use unsigned long as type for the counter. Use this_cpu_inc() to
> increment the counter. Use a plain read of the counter.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/linux/netdevice.h | 17 +++++++----------
>  net/core/dev.c            | 14 +++++---------
>  2 files changed, 12 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 59e27a2b7bf04..0009112b23767 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -199,10 +199,10 @@ struct net_device_stats {
>   * Try to fit them in a single cache line, for dev_get_stats() sake.
>   */
>  struct net_device_core_stats {
> -       local_t         rx_dropped;
> -       local_t         tx_dropped;
> -       local_t         rx_nohandler;
> -} __aligned(4 * sizeof(local_t));
> +       unsigned long   rx_dropped;
> +       unsigned long   tx_dropped;
> +       unsigned long   rx_nohandler;
> +} __aligned(4 * sizeof(unsigned long));
>
>  #include <linux/cache.h>
>  #include <linux/skbuff.h>
> @@ -3843,7 +3843,7 @@ static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
>         return false;
>  }
>
> -struct net_device_core_stats *netdev_core_stats_alloc(struct net_device *dev);
> +struct net_device_core_stats __percpu *netdev_core_stats_alloc(struct net_device *dev);
>
>  static inline struct net_device_core_stats *dev_core_stats(struct net_device *dev)
>  {
> @@ -3851,7 +3851,7 @@ static inline struct net_device_core_stats *dev_core_stats(struct net_device *de
>         struct net_device_core_stats __percpu *p = READ_ONCE(dev->core_stats);
>
>         if (likely(p))
> -               return this_cpu_ptr(p);
> +               return p;
>
>         return netdev_core_stats_alloc(dev);
>  }
> @@ -3861,12 +3861,9 @@ static inline void dev_core_stats_##FIELD##_inc(struct net_device *dev)          \
>  {                                                                              \
>         struct net_device_core_stats *p;                                        \
>                                                                                 \
> -       preempt_disable();                                                      \
>         p = dev_core_stats(dev);                                                \
> -                                                                               \
>         if (p)                                                                  \
> -               local_inc(&p->FIELD);                                           \
> -       preempt_enable();                                                       \
> +               this_cpu_inc(p->FIELD);                                         \
>  }
>  DEV_CORE_STATS_INC(rx_dropped)
>  DEV_CORE_STATS_INC(tx_dropped)
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 9ec51c1d77cf4..bf6026158874e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -10309,7 +10309,7 @@ void netdev_stats_to_stats64(struct rtnl_link_stats64 *stats64,
>  }
>  EXPORT_SYMBOL(netdev_stats_to_stats64);
>
> -struct net_device_core_stats *netdev_core_stats_alloc(struct net_device *dev)
> +struct net_device_core_stats __percpu *netdev_core_stats_alloc(struct net_device *dev)
>  {
>         struct net_device_core_stats __percpu *p;
>
> @@ -10320,11 +10320,7 @@ struct net_device_core_stats *netdev_core_stats_alloc(struct net_device *dev)
>                 free_percpu(p);
>
>         /* This READ_ONCE() pairs with the cmpxchg() above */
> -       p = READ_ONCE(dev->core_stats);
> -       if (!p)
> -               return NULL;
> -
> -       return this_cpu_ptr(p);
> +       return READ_ONCE(dev->core_stats);
>  }
>  EXPORT_SYMBOL(netdev_core_stats_alloc);
>
> @@ -10361,9 +10357,9 @@ struct rtnl_link_stats64 *dev_get_stats(struct net_device *dev,
>
>                 for_each_possible_cpu(i) {
>                         core_stats = per_cpu_ptr(p, i);
> -                       storage->rx_dropped += local_read(&core_stats->rx_dropped);
> -                       storage->tx_dropped += local_read(&core_stats->tx_dropped);
> -                       storage->rx_nohandler += local_read(&core_stats->rx_nohandler);
> +                       storage->rx_dropped += core_stats->rx_dropped;
> +                       storage->tx_dropped += core_stats->tx_dropped;
> +                       storage->rx_nohandler += core_stats->rx_nohandler;
>                 }
>         }
>         return storage;
> --
> 2.35.2
>

^ permalink raw reply

* Re: [PATCH 3/5] hv_sock: Add validation for untrusted Hyper-V values
From: Andrea Parri @ 2022-04-21 15:30 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, Michael Kelley, David Miller, Jakub Kicinski,
	Paolo Abeni, linux-hyperv, virtualization, netdev, linux-kernel
In-Reply-To: <20220421140805.qg4cwqhsq5vuqkut@sgarzare-redhat>

> > @@ -577,12 +577,19 @@ static bool hvs_dgram_allow(u32 cid, u32 port)
> > static int hvs_update_recv_data(struct hvsock *hvs)
> > {
> > 	struct hvs_recv_buf *recv_buf;
> > -	u32 payload_len;
> > +	u32 pkt_len, payload_len;
> > +
> > +	pkt_len = hv_pkt_len(hvs->recv_desc);
> > +
> > +	/* Ensure the packet is big enough to read its header */
> > +	if (pkt_len < HVS_HEADER_LEN)
> > +		return -EIO;
> > 
> > 	recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
> > 	payload_len = recv_buf->hdr.data_size;
> > 
> > -	if (payload_len > HVS_MTU_SIZE)
> > +	/* Ensure the packet is big enough to read its payload */
> > +	if (payload_len > pkt_len - HVS_HEADER_LEN || payload_len > HVS_MTU_SIZE)
> 
> checkpatch warns that we exceed 80 characters, I do not have a strong
> opinion on this, but if you have to resend better break the condition into 2
> lines.

Will break if preferred.  (but does it really warn??  I understand that
the warning was deprecated and the "limit" increased to 100 chars...)


> Maybe even update or remove the comment? (it only describes the first
> condition, but the conditions are pretty clear, so I don't think it adds
> much).

Works for me.  (taking it as this applies to the previous comment too.)

Thanks,
  Andrea

^ permalink raw reply

* Re: [PATCH 2/5] hv_sock: Copy packets sent by Hyper-V out of the ring buffer
From: Andrea Parri @ 2022-04-21 15:21 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, Michael Kelley, David Miller, Jakub Kicinski,
	Paolo Abeni, linux-hyperv, virtualization, netdev, linux-kernel
In-Reply-To: <20220421135839.2fj6fk6bvlrau73o@sgarzare-redhat>

> > @@ -378,6 +381,8 @@ static void hvs_open_connection(struct vmbus_channel *chan)
> > 		rcvbuf = ALIGN(rcvbuf, HV_HYP_PAGE_SIZE);
> > 	}
> > 
> > +	chan->max_pkt_size = HVS_MAX_PKT_SIZE;
> > +
> 
> premise, I don't know HyperV channels :-(
> 
> Is this change necessary to use hv_pkt_iter_first() instead of
> hv_pkt_iter_first_raw()?

Yes, the change is required to initialize the buffer which holds the
copies of the incoming packets (in hv_ringbuffer_init()).


> If yes, then please mention that you set this value in the commit message,
> otherwise maybe better to have a separate patch.

Sure, will do.

Thanks,
  Andrea

^ permalink raw reply

* Re: [PATCH net] sctp: check asoc strreset_chunk in sctp_generate_reconf_event
From: Xin Long @ 2022-04-21 15:04 UTC (permalink / raw)
  To: David Miller
  Cc: network dev, linux-sctp @ vger . kernel . org, Jakub Kicinski,
	Marcelo Ricardo Leitner, Neil Horman
In-Reply-To: <20220421.111807.1804226251338066304.davem@davemloft.net>

On Thu, Apr 21, 2022 at 6:18 AM David Miller <davem@davemloft.net> wrote:
>
> From: Xin Long <lucien.xin@gmail.com>
> Date: Wed, 20 Apr 2022 16:52:41 -0400
>
> > diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> > index b3815b568e8e..463c4a58d2c3 100644
> > --- a/net/sctp/sm_sideeffect.c
> > +++ b/net/sctp/sm_sideeffect.c
> > @@ -458,6 +458,10 @@ void sctp_generate_reconf_event(struct timer_list *t)
> >               goto out_unlock;
> >       }
> >
> > +     /* This happens when the response arrives after the timer is triggered. */
> > +     if (!asoc->strreset_chunk)
> > +             goto out_unlock;
> > +
> This will return 0 because that is error's value right there, intentional?
intentional, this won't be treated as an error.

^ permalink raw reply

* Re: [PATCH] [net-next] selftests: net: vrf_strict_mode_test: add support to select a test to run
From: Roopa Prabhu @ 2022-04-21 14:53 UTC (permalink / raw)
  To: Jaehee Park, outreachy, Julia Denham, Roopa Prabhu,
	Stefano Brivio, netdev
In-Reply-To: <20220420044647.GA1288137@jaehee-ThinkPad-X1-Extreme>


On 4/19/22 21:46, Jaehee Park wrote:
> Add a boilerplate test loop to run all tests in
> vrf_strict_mode_test.sh.
>
> Signed-off-by: Jaehee Park <jhpark1013@gmail.com>
> ---

pls ignore this, duplicate patch


>   .../testing/selftests/net/vrf_strict_mode_test.sh  | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/net/vrf_strict_mode_test.sh b/tools/testing/selftests/net/vrf_strict_mode_test.sh
> index 865d53c1781c..116ca43381b5 100755
> --- a/tools/testing/selftests/net/vrf_strict_mode_test.sh
> +++ b/tools/testing/selftests/net/vrf_strict_mode_test.sh
> @@ -14,6 +14,8 @@ INIT_NETNS_NAME="init"
>   
>   PAUSE_ON_FAIL=${PAUSE_ON_FAIL:=no}
>   
> +TESTS="vrf_strict_mode_tests_init vrf_strict_mode_tests_testns vrf_strict_mode_tests_mix"
> +
>   log_test()
>   {
>   	local rc=$1
> @@ -391,7 +393,17 @@ fi
>   cleanup &> /dev/null
>   
>   setup
> -vrf_strict_mode_tests
> +for t in $TESTS
> +do
> +	case $t in
> +	vrf_strict_mode_tests_init|vrf_strict_mode_init) vrf_strict_mode_tests_init;;
> +	vrf_strict_mode_tests_testns|vrf_strict_mode_testns) vrf_strict_mode_tests_testns;;
> +	vrf_strict_mode_tests_mix|vrf_strict_mode_mix) vrf_strict_mode_tests_mix;;
> +
> +	help) echo "Test names: $TESTS"; exit 0;;
> +
> +	esac
> +done
>   cleanup
>   
>   print_log_test_results

^ permalink raw reply

* Re: [PATCH net-next] selftests: net: vrf_strict_mode_test: add support to select a test to run
From: Roopa Prabhu @ 2022-04-21 14:52 UTC (permalink / raw)
  To: Jaehee Park, outreachy, Julia Denham, Roopa Prabhu,
	Stefano Brivio, netdev
In-Reply-To: <20220420045512.GA1289782@jaehee-ThinkPad-X1-Extreme>


On 4/19/22 21:55, Jaehee Park wrote:
> Add a boilerplate test loop to run all tests in
> vrf_strict_mode_test.sh.
>
> Signed-off-by: Jaehee Park <jhpark1013@gmail.com>
> ---

Jaehee, this needs more work. The idea is to be able to run individual 
tests with -t option.

An example is drop_monitor_tests.sh, see the usage and getopts arg 
parsing at the beginning of the test

eg ./drop_monitor_tests.sh -t <testname>


>   .../testing/selftests/net/vrf_strict_mode_test.sh  | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/net/vrf_strict_mode_test.sh b/tools/testing/selftests/net/vrf_strict_mode_test.sh
> index 865d53c1781c..116ca43381b5 100755
> --- a/tools/testing/selftests/net/vrf_strict_mode_test.sh
> +++ b/tools/testing/selftests/net/vrf_strict_mode_test.sh
> @@ -14,6 +14,8 @@ INIT_NETNS_NAME="init"
>   
>   PAUSE_ON_FAIL=${PAUSE_ON_FAIL:=no}
>   
> +TESTS="vrf_strict_mode_tests_init vrf_strict_mode_tests_testns vrf_strict_mode_tests_mix"
> +
>   log_test()
>   {
>   	local rc=$1
> @@ -391,7 +393,17 @@ fi
>   cleanup &> /dev/null
>   
>   setup
> -vrf_strict_mode_tests
> +for t in $TESTS
> +do
> +	case $t in
> +	vrf_strict_mode_tests_init|vrf_strict_mode_init) vrf_strict_mode_tests_init;;
> +	vrf_strict_mode_tests_testns|vrf_strict_mode_testns) vrf_strict_mode_tests_testns;;
> +	vrf_strict_mode_tests_mix|vrf_strict_mode_mix) vrf_strict_mode_tests_mix;;
> +
> +	help) echo "Test names: $TESTS"; exit 0;;
> +
> +	esac
> +done
>   cleanup
>   
>   print_log_test_results

^ permalink raw reply

* Re: [PATCH bpf-next] selftests/bpf: fix attach tests retcode checks
From: patchwork-bot+netdevbpf @ 2022-04-21 14:50 UTC (permalink / raw)
  To: Artem Savkov
  Cc: laoar.shao, ast, daniel, andrii, bpf, netdev, linux-kernel,
	linux-kselftest
In-Reply-To: <20220421130104.1582053-1-asavkov@redhat.com>

Hello:

This patch was applied to bpf/bpf-next.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Thu, 21 Apr 2022 15:01:04 +0200 you wrote:
> Switching to libbpf 1.0 API broke test_sock and test_sysctl as they
> check for return of bpf_prog_attach to be exactly -1. Switch the check
> to '< 0' instead.
> 
> Fixes: b858ba8c52b6 ("selftests/bpf: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK")
> Signed-off-by: Artem Savkov <asavkov@redhat.com>
> 
> [...]

Here is the summary with links:
  - [bpf-next] selftests/bpf: fix attach tests retcode checks
    https://git.kernel.org/bpf/bpf-next/c/920fd5e1771d

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: Support for IEEE1588 timestamping in the BCM54210PE PHY using the kernel mii_timestamper interface
From: Richard Cochran @ 2022-04-21 14:48 UTC (permalink / raw)
  To: Lasse Johnsen; +Cc: netdev, Gordon Hollingworth, Ahmad Byagowi
In-Reply-To: <928593CA-9CE9-4A54-B84A-9973126E026D@timebeat.app>

On Wed, Apr 20, 2022 at 03:03:26PM +0100, Lasse Johnsen wrote:

> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index bd1f419bc47ae..2fa6258103025 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c

This change is unrelated to the PHY driver and should be in its own
patch series.

> @@ -39,8 +39,11 @@
>  
>  #include <asm/unaligned.h>
>  
> +#include <linux/ptp_classify.h>
> +
>  #include "bcmgenet.h"
>  
> +

Don't add extra blank lines.  As Andrew commented, run your code
through checkpatch.pl and fix the coding style.

>  /* Maximum number of hardware queues, downsized if needed */
>  #define GENET_MAX_MQ_CNT	4
>  
> @@ -2096,7 +2099,18 @@ static netdev_tx_t bcmgenet_xmit(struct sk_buff *skb, struct net_device *dev)
>  	}
>  
>  	GENET_CB(skb)->last_cb = tx_cb_ptr;
> -	skb_tx_timestamp(skb);
> +
> +	// Timestamping
> +	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
> +	{
> +		//skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +		skb_pull(skb, skb_mac_offset(skb)); // Feels like this pull should really be part of ptp_classify_raw...
> +		skb_clone_tx_timestamp(skb);
> +	}
> +	else if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP))
> +	{
> +		skb_tstamp_tx(skb, NULL);
> +	}

This is totally wrong.  skb_tx_timestamp() is the correct MAC driver API.

>  
>  	/* Decrement total BD count and advance our write pointer */
>  	ring->free_bds -= nr_frags + 1;
> diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
> index a13e402074cf8..528192d59d793 100644
> --- a/drivers/net/phy/Makefile
> +++ b/drivers/net/phy/Makefile
> @@ -46,6 +46,7 @@ obj-$(CONFIG_BCM84881_PHY)	+= bcm84881.o
>  obj-$(CONFIG_BCM87XX_PHY)	+= bcm87xx.o
>  obj-$(CONFIG_BCM_CYGNUS_PHY)	+= bcm-cygnus.o
>  obj-$(CONFIG_BCM_NET_PHYLIB)	+= bcm-phy-lib.o
> +obj-$(CONFIG_BCM54120PE_PHY)	+= bcm54210pe_ptp.o
>  obj-$(CONFIG_BROADCOM_PHY)	+= broadcom.o
>  obj-$(CONFIG_CICADA_PHY)	+= cicada.o
>  obj-$(CONFIG_CORTINA_PHY)	+= cortina.o

> diff --git a/drivers/net/phy/bcm54210pe_ptp.c b/drivers/net/phy/bcm54210pe_ptp.c
> new file mode 100755
> index 0000000000000..c4882c84229f9
> --- /dev/null
> +++ b/drivers/net/phy/bcm54210pe_ptp.c
> @@ -0,0 +1,1406 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + *  drivers/net/phy/bcm54210pe_ptp.c
> + *
> + * IEEE1588 (PTP), perout and extts for BCM54210PE PHY
> + *
> + * Authors: Carlos Fernandez, Kyle Judd, Lasse Johnsen
> + * License: GPL

Use the proper MODULE macros for this info.

> + */
> +
> +#include <linux/gpio/consumer.h>
> +#include <linux/ip.h>                                                                                
> +#include <linux/net_tstamp.h>
> +#include <linux/mii.h>
> +#include <linux/phy.h>                                                                               
> +#include <linux/ptp_classify.h>
> +#include <linux/ptp_clock_kernel.h>                                                                  
> +#include <linux/udp.h>
> +#include <asm/unaligned.h> 
> +#include <linux/brcmphy.h>
> +#include <linux/irq.h>
> +#include <linux/workqueue.h>
> +#include <linux/gpio.h>
> +#include <linux/if_ether.h>
> +#include <linux/delay.h>
> +#include <linux/sched.h>

Suggest keeping include directives in alphabetical order.

> +
> +#include "bcm54210pe_ptp.h"
> +#include "bcm-phy-lib.h"
> +
> +#define PTP_CONTROL_OFFSET		32
> +#define PTP_TSMT_OFFSET 		0
> +#define PTP_SEQUENCE_ID_OFFSET		30
> +#define PTP_CLOCK_ID_OFFSET		20
> +#define PTP_CLOCK_ID_SIZE		8
> +#define PTP_SEQUENCE_PORT_NUMER_OFFSET  (PTP_CLOCK_ID_OFFSET + PTP_CLOCK_ID_SIZE)
> +
> +#define EXT_SELECT_REG			0x17
> +#define EXT_DATA_REG			0x15
> +
> +#define EXT_ENABLE_REG1			0x17
> +#define EXT_ENABLE_DATA1		0x0F7E
> +#define EXT_ENABLE_REG2			0x15
> +#define EXT_ENABLE_DATA2		0x0000
> +
> +#define EXT_1588_SLICE_REG		0x0810
> +#define EXT_1588_SLICE_DATA		0x0101

EXT_1588_SLICE_DATA is not used anywhere.

> +
> +#define ORIGINAL_TIME_CODE_0 		0x0854
> +#define ORIGINAL_TIME_CODE_1 		0x0855
> +#define ORIGINAL_TIME_CODE_2 		0x0856
> +#define ORIGINAL_TIME_CODE_3 		0x0857
> +#define ORIGINAL_TIME_CODE_4 		0x0858
> +
> +#define TIME_STAMP_REG_0		0x0889
> +#define TIME_STAMP_REG_1		0x088A
> +#define TIME_STAMP_REG_2		0x088B
> +#define TIME_STAMP_REG_3		0x08C4
> +#define TIME_STAMP_INFO_1		0x088C
> +#define TIME_STAMP_INFO_2		0x088D
> +#define INTERRUPT_STATUS_REG		0x085F
> +#define INTERRUPT_MASK_REG		0x085E
> +#define EXT_SOFTWARE_RESET		0x0F70
> +#define EXT_RESET1			0x0001 //RESET
> +#define EXT_RESET2			0x0000 //NORMAL OPERATION
> +#define GLOBAL_TIMESYNC_REG		0x0FF5
> +
> +#define TX_EVENT_MODE_REG		0x0811
> +#define RX_EVENT_MODE_REG		0x0819
> +#define TX_TSCAPTURE_ENABLE_REG		0x0821
> +#define RX_TSCAPTURE_ENABLE_REG		0x0822
> +#define TXRX_1588_OPTION_REG		0x0823
> +
> +#define TX_TS_OFFSET_LSB		0x0834
> +#define TX_TS_OFFSET_MSB		0x0835
> +#define RX_TS_OFFSET_LSB		0x0844
> +#define RX_TS_OFFSET_MSB		0x0845
> +#define NSE_DPPL_NCO_1_LSB_REG		0x0873
> +#define NSE_DPPL_NCO_1_MSB_REG		0x0874
> +
> +#define NSE_DPPL_NCO_2_0_REG		0x0875
> +#define NSE_DPPL_NCO_2_1_REG		0x0876
> +#define NSE_DPPL_NCO_2_2_REG		0x0877
> +
> +#define NSE_DPPL_NCO_3_0_REG		0x0878
> +#define NSE_DPPL_NCO_3_1_REG		0x0879
> +#define NSE_DPPL_NCO_3_2_REG		0x087A
> +
> +#define NSE_DPPL_NCO_4_REG		0x087B
> +
> +#define NSE_DPPL_NCO_5_0_REG		0x087C
> +#define NSE_DPPL_NCO_5_1_REG		0x087D
> +#define NSE_DPPL_NCO_5_2_REG		0x087E
> +
> +#define NSE_DPPL_NCO_6_REG		0x087F
> +
> +#define NSE_DPPL_NCO_7_0_REG		0x0880
> +#define NSE_DPPL_NCO_7_1_REG		0x0881
> +
> +#define DPLL_SELECT_REG			0x085b
> +#define TIMECODE_SEL_REG		0x08C3
> +#define SHADOW_REG_CONTROL		0x085C
> +#define SHADOW_REG_LOAD			0x085D
> +
> +#define PTP_INTERRUPT_REG		0x0D0C
> +
> +#define CTR_DBG_REG			0x088E
> +#define HEART_BEAT_REG4			0x08ED
> +#define HEART_BEAT_REG3			0x08EC
> +#define HEART_BEAT_REG2			0x0888
> +#define	HEART_BEAT_REG1			0x0887
> +#define	HEART_BEAT_REG0			0x0886
> +	
> +#define READ_END_REG			0x0885
> +
> +static bool bcm54210pe_fetch_timestamp(u8 txrx, u8 message_type, u16 seq_id, struct bcm54210pe_private *private, u64 *timestamp)
> +{
> +	struct bcm54210pe_circular_buffer_item *item; 
> +	struct list_head *this, *next;
> +
> +	u8 index = (txrx * 4) + message_type;
> +
> +	if(index >= CIRCULAR_BUFFER_COUNT)
> +	{
> +		return false;
> +	}
> +
> +	list_for_each_safe(this, next, &private->circular_buffers[index])
> +	{
> +		item = list_entry(this, struct bcm54210pe_circular_buffer_item, list);
> +
> +		if(item->sequence_id == seq_id && item->is_valid)
> +		{
> +			item->is_valid = false;

Instead of using this flag, just remove matched items from list.

> +			*timestamp = item->time_stamp;
> +			mutex_unlock(&private->timestamp_buffer_lock);
> +			return true;
> +		}
> +	}
> +
> +	return false;
> +}
> +
> +static void bcm54210pe_read_sop_time_register(struct bcm54210pe_private *private)
> +{
> +	struct phy_device *phydev = private->phydev;
> +	struct bcm54210pe_circular_buffer_item *item;
> +	u16 fifo_info_1, fifo_info_2;
> +	u8 tx_or_rx, msg_type, index;
> +	u16 sequence_id;
> +	u64 timestamp;
> +	u16 Time[4];
> +
> +	mutex_lock(&private->timestamp_buffer_lock);
> +
> +	while(bcm_phy_read_exp(phydev, INTERRUPT_STATUS_REG) & 2)

Replace magic number 2 proper macro.

Also, this loop is potentially infinite.  Add a sanity check to break out.

> +	{
> +		mutex_lock(&private->clock_lock);
> +
> +		// Flush out the FIFO
> +		bcm_phy_write_exp(phydev, READ_END_REG, 1);
> +
> +		Time[3] = bcm_phy_read_exp(phydev, TIME_STAMP_REG_3);
> +		Time[2] = bcm_phy_read_exp(phydev, TIME_STAMP_REG_2);
> +		Time[1] = bcm_phy_read_exp(phydev, TIME_STAMP_REG_1);
> +		Time[0] = bcm_phy_read_exp(phydev, TIME_STAMP_REG_0);
> +
> +		fifo_info_1 = bcm_phy_read_exp(phydev, TIME_STAMP_INFO_1);
> +		fifo_info_2 = bcm_phy_read_exp(phydev, TIME_STAMP_INFO_2);
> +
> +		bcm_phy_write_exp(phydev, READ_END_REG, 2);
> +		bcm_phy_write_exp(phydev, READ_END_REG, 0);
> +
> +		mutex_unlock(&private->clock_lock);
> +
> +		msg_type = (u8) ((fifo_info_2 & 0xF000) >> 12);
> +		tx_or_rx = (u8) ((fifo_info_2 & 0x0800) >> 11); // 1 = TX, 0 = RX
> +		sequence_id = fifo_info_1;
> +
> +		timestamp = four_u16_to_ns(Time);
> +
> +		index = (tx_or_rx * 4) + msg_type;
> +
> +		if(index < CIRCULAR_BUFFER_COUNT)
> +		{
> +			item = list_first_entry_or_null(&private->circular_buffers[index], struct bcm54210pe_circular_buffer_item, list);
> +		}
> +
> +		if(item == NULL) {
> +			continue;
> +		}
> +
> +		list_del_init(&item->list);
> +
> +		item->msg_type = msg_type;
> +		item->sequence_id = sequence_id;
> +		item->time_stamp = timestamp;
> +		item->is_valid = true;
> +
> +		list_add_tail(&item->list, &private->circular_buffers[index]);
> +	}
> +
> +	mutex_unlock(&private->timestamp_buffer_lock);
> +}
> +
> +static void bcm54210pe_run_rx_timestamp_match_thread(struct work_struct *w)
> +{
> +	struct bcm54210pe_private *private =
> +		container_of(w, struct bcm54210pe_private, rxts_work);
> +
> +	struct skb_shared_hwtstamps *shhwtstamps;
> +	struct ptp_header *hdr;
> +	struct sk_buff *skb;
> +
> +	u8 msg_type;
> +	u16 sequence_id;
> +	u64 timestamp;
> +	int x, type;
> +
> +	skb = skb_dequeue(&private->rx_skb_queue);
> +
> +	while (skb != NULL) {
> +
> +		// Yes....  skb_defer_rx_timestamp just did this but <ZZZzzz>....
> +		skb_push(skb, ETH_HLEN);
> +		type = ptp_classify_raw(skb);
> +		skb_pull(skb, ETH_HLEN);
> +
> +		hdr = ptp_parse_header(skb, type);
> +
> +		if (hdr == NULL) {
> +			goto dequeue;
> +		}
> +
> +		msg_type = ptp_get_msgtype(hdr, type);
> +		sequence_id = be16_to_cpu(hdr->sequence_id);
> +
> +		timestamp = 0;
> +
> +		for (x = 0; x < 10; x++) {

Ten times? and ...

> +			bcm54210pe_read_sop_time_register(private);
> +			if (bcm54210pe_fetch_timestamp(0, msg_type, sequence_id,
> +						       private, &timestamp)) {
> +				break;
> +			}
> +
> +			udelay(private->fib_sequence[x] *
> +			       private->fib_factor_rx);

with a cute udelay?  No, use a proper deferral mechanism.

> +		}
> +
> +		shhwtstamps = skb_hwtstamps(skb);
> +
> +		if (!shhwtstamps) {
> +			goto dequeue;
> +		}
> +
> +		memset(shhwtstamps, 0, sizeof(*shhwtstamps));
> +		shhwtstamps->hwtstamp = ns_to_ktime(timestamp);
> +
> +	dequeue:
> +		netif_rx_ni(skb);
> +		skb = skb_dequeue(&private->rx_skb_queue);
> +	}
> +}
> +
> +static void bcm54210pe_run_tx_timestamp_match_thread(struct work_struct *w)
> +{
> +	struct bcm54210pe_private *private =
> +		container_of(w, struct bcm54210pe_private, txts_work);
> +
> +	struct skb_shared_hwtstamps *shhwtstamps;
> +	struct sk_buff *skb;
> +
> +	struct ptp_header *hdr;
> +	u8 msg_type;
> +	u16 sequence_id;
> +	u64 timestamp;
> +	int x;
> +
> +	timestamp = 0;
> +	skb = skb_dequeue(&private->tx_skb_queue);
> +
> +	while (skb != NULL) {
> +		int type = ptp_classify_raw(skb);
> +		hdr = ptp_parse_header(skb, type);
> +
> +		if (!hdr) {
> +			return;
> +		}
> +
> +		msg_type = ptp_get_msgtype(hdr, type);
> +		sequence_id = be16_to_cpu(hdr->sequence_id);
> +
> +		for (x = 0; x < 10; x++) {
> +			bcm54210pe_read_sop_time_register(private);
> +			if (bcm54210pe_fetch_timestamp(1, msg_type, sequence_id,
> +						       private, &timestamp)) {
> +				break;
> +			}
> +			udelay(private->fib_sequence[x] * private->fib_factor_tx);
> +		}
> +		shhwtstamps = skb_hwtstamps(skb);
> +
> +		if (!shhwtstamps) {
> +			kfree_skb(skb);
> +			goto dequeue;
> +		}
> +
> +		memset(shhwtstamps, 0, sizeof(*shhwtstamps));
> +		shhwtstamps->hwtstamp = ns_to_ktime(timestamp);
> +
> +		skb_complete_tx_timestamp(skb, shhwtstamps);
> +
> +	dequeue:
> +		skb = skb_dequeue(&private->tx_skb_queue);
> +	}
> +}
> +
> +static int bcm54210pe_config_1588(struct phy_device *phydev)
> +{
> +	int err;
> +
> +	err = bcm_phy_write_exp(phydev, PTP_INTERRUPT_REG, 0x3c02 );
> +
> +	err |=  bcm_phy_write_exp(phydev, GLOBAL_TIMESYNC_REG, 0x0001); //Enable global timesync register.
> +	err |=  bcm_phy_write_exp(phydev, EXT_1588_SLICE_REG, 0x0101); //ENABLE TX and RX slice 1588

Does this device support multiple ports?  If so, the driver should
support that.

> +	err |=  bcm_phy_write_exp(phydev, TX_EVENT_MODE_REG, 0xFF00); //Add 80bit timestamp + NO CPU MODE in TX
> +	err |=  bcm_phy_write_exp(phydev, RX_EVENT_MODE_REG, 0xFF00); //Add 32+32 bits timestamp + NO CPU mode in RX
> +	err |=  bcm_phy_write_exp(phydev, TIMECODE_SEL_REG, 0x0101); //Select 80 bit counter
> +	err |=  bcm_phy_write_exp(phydev, TX_TSCAPTURE_ENABLE_REG, 0x0001); //Enable timestamp capture in TX
> +	err |=  bcm_phy_write_exp(phydev, RX_TSCAPTURE_ENABLE_REG, 0x0001); //Enable timestamp capture in RX
> +
> +	//Enable shadow register
> +	err |= bcm_phy_write_exp(phydev, SHADOW_REG_CONTROL, 0x0000);
> +	err |= bcm_phy_write_exp(phydev, SHADOW_REG_LOAD, 0x07c0);
> +
> +
> +	// Set global mode and trigger immediate framesync to load shaddow registers
> +	err |=  bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, 0xC020);
> +
> +	//15, 33 or 41 - experimental
> +	printk("DEBUG: GPIO %d IRQ %d\n", 15, gpio_to_irq(41));
> +
> +	// Enable Interrupt behaviour (eventhough we get no interrupts)
> +	err |= bcm54210pe_interrupts_enable(phydev,true, false);
> +	
> +	return err; 
> +}
> +
> +static int bcm54210pe_gettimex(struct ptp_clock_info *info,
> +			       struct timespec64 *ts,
> +			       struct ptp_system_timestamp *sts)
> +{
> +
> +	struct bcm54210pe_ptp *ptp = container_of(info, struct bcm54210pe_ptp, caps);
> +	return bcm54210pe_get80bittime(ptp->chosen, ts, sts);
> +}
> +
> +// Must be called under clock_lock
> +static void bcm54210pe_read80bittime_register(struct phy_device *phydev, u64 *time_stamp_80, u64 *time_stamp_48)
> +{
> +	u16 Time[5];
> +	u64 ts[3];
> +	u64 cumulative;
> +
> +	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x400);
> +	Time[4] = bcm_phy_read_exp(phydev, HEART_BEAT_REG4);
> +	Time[3] = bcm_phy_read_exp(phydev, HEART_BEAT_REG3);
> +	Time[2] = bcm_phy_read_exp(phydev, HEART_BEAT_REG2);
> +	Time[1] = bcm_phy_read_exp(phydev, HEART_BEAT_REG1);
> +	Time[0] = bcm_phy_read_exp(phydev, HEART_BEAT_REG0);
> +
> +	// Set read end bit
> +	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x800);
> +	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x000);
> +
> +	*time_stamp_80 = four_u16_to_ns(Time);
> +
> +	if (time_stamp_48 != NULL) {
> +
> +
> +		ts[2] = (((u64)Time[2]) << 32);
> +		ts[1] = (((u64)Time[1]) << 16);
> +		ts[0] = ((u64)Time[0]);
> +
> +		cumulative = 0;
> +		cumulative |= ts[0];
> +		cumulative |= ts[1];
> +		cumulative |= ts[2];
> +
> +		*time_stamp_48 = cumulative;
> +	}
> +}
> +
> +// Must be called under clock_lock
> +static void bcm54210pe_read48bittime_register(struct phy_device *phydev, u64 *time_stamp)
> +{
> +	u16 Time[3];
> +	u64 ts[3];
> +	u64 cumulative;
> +
> +	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x400);
> +	Time[2] = bcm_phy_read_exp(phydev, HEART_BEAT_REG2);
> +	Time[1] = bcm_phy_read_exp(phydev, HEART_BEAT_REG1);
> +	Time[0] = bcm_phy_read_exp(phydev, HEART_BEAT_REG0);
> +
> +	// Set read end bit
> +	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x800);
> +	bcm_phy_write_exp(phydev, CTR_DBG_REG, 0x000);
> +
> +
> +	ts[2] = (((u64)Time[2]) << 32);
> +	ts[1] = (((u64)Time[1]) << 16);
> +	ts[0] = ((u64)Time[0]);
> +
> +	cumulative = 0;
> +	cumulative |= ts[0];
> +	cumulative |= ts[1];
> +	cumulative |= ts[2];
> +
> +	*time_stamp = cumulative;
> +}
> +
> +static int bcm54210pe_get80bittime(struct bcm54210pe_private *private,
> +				   struct timespec64 *ts,
> +				   struct ptp_system_timestamp *sts)
> +{
> +	struct phy_device *phydev;
> +	u16 nco_6_register_value;
> +	int i;
> +	u64 time_stamp_48, time_stamp_80, control_ts;
> +
> +	phydev = private->phydev;
> +
> +	// Capture timestamp on next framesync
> +	nco_6_register_value = 0x2000;
> +
> +	// Lock
> +	mutex_lock(&private->clock_lock);
> +
> +	// We share frame sync events with extts, so we need to ensure no event has occurred as we are about to boot the registers, so....
> +
> +	// If extts is enabled
> +	if (private->extts_en) {
> +
> +		// Halt framesyncs generated by the sync in pin
> +		bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0000);
> +
> +		// Read what's in the 8- bit register
> +		bcm54210pe_read48bittime_register(phydev, &control_ts);
> +
> +		// If it matches neither the last gettime or extts timestamp
> +		if (control_ts != private->last_extts_ts && control_ts != private->last_immediate_ts[0]) { // FIXME: This is a bug
> +
> +			// Odds are this is a extts not yet logged as an event
> +			//printk("extts triggered by get80bittime\n");
> +			bcm54210pe_trigger_extts_event(private, control_ts);
> +		}
> +	}
> +
> +	// Heartbeat register selection. Latch 80 bit Original time coude counter into Heartbeat register
> +	// (this is undocumented)
> +	bcm_phy_write_exp(phydev, DPLL_SELECT_REG, 0x0040);
> +
> +	// Amend to base register
> +	nco_6_register_value = bcm54210pe_get_base_nco6_reg(private, nco_6_register_value, false);
> +
> +	// Set the NCO register
> +	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
> +
> +	// Trigger framesync
> +	if (sts != NULL) {
> +
> +		// If we are doing a gettimex call
> +		ptp_read_system_prets(sts);
> +		bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
> +		ptp_read_system_postts(sts);
> +
> +	} else {
> +
> +		// or if we are doing a gettime call
> +		bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
> +	}
> +
> +	for (i = 0; i < 5;i++) {
> +
> +		bcm54210pe_read80bittime_register(phydev, &time_stamp_80, &time_stamp_48);
> +
> +		if (time_stamp_80 != 0) {
> +			break;
> +		}
> +	}
> +
> +	// Convert to timespec64
> +	ns_to_ts(time_stamp_80, ts);
> +
> +	// If we are using extts
> +	if(private->extts_en) {
> +
> +		// Commit last timestamp
> +	   	private->last_immediate_ts[0] = time_stamp_48;
> +	    	private->last_immediate_ts[1] = time_stamp_80;
> +
> +		// Heartbeat register selection. Latch 48 bit Original time coude counter into Heartbeat register
> +		// (this is undocumented)
> +		bcm_phy_write_exp(phydev, DPLL_SELECT_REG, 0x0000);
> +
> +		// Rearm framesync for sync in pin
> +		nco_6_register_value = bcm54210pe_get_base_nco6_reg(private, nco_6_register_value, false);
> +		bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
> +	}
> +
> +	mutex_unlock(&private->clock_lock);
> +
> +	return 0;
> +}
> +
> +static int bcm54210pe_gettime(struct ptp_clock_info *info, struct timespec64 *ts)
> +{
> +	int err;
> +	err = bcm54210pe_gettimex(info, ts, NULL);
> +	return err;
> +}
> +
> +static int bcm54210pe_get48bittime(struct bcm54210pe_private *private, u64 *timestamp)
> +{
> +	u16 nco_6_register_value;
> +	int i, err;
> +
> +	struct phy_device *phydev = private->phydev;
> +
> +	// Capture timestamp on next framesync
> +	nco_6_register_value = 0x2000;
> +
> +	mutex_lock(&private->clock_lock);
> +
> +	// Heartbeat register selection. Latch 48 bit Original time coude counter into Heartbeat register
> +	// (this is undocumented)
> +	err = bcm_phy_write_exp(phydev, DPLL_SELECT_REG, 0x0000);
> +
> +	// Amend to base register
> +	nco_6_register_value =
> +		bcm54210pe_get_base_nco6_reg(private, nco_6_register_value, false);
> +
> +	// Set the NCO register
> +	err |= bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
> +
> +	// Trigger framesync
> +	err |= bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
> +
> +	for (i = 0; i < 5; i++) {
> +
> +		bcm54210pe_read48bittime_register(phydev,timestamp);
> +
> +		if (*timestamp != 0) {
> +			break;
> +		}
> +	}
> +	mutex_unlock(&private->clock_lock);
> +
> +	return err;
> +}
> +
> +static int bcm54210pe_settime(struct ptp_clock_info *info, const struct timespec64 *ts)
> +{
> +	u16 shadow_load_register, nco_6_register_value;
> +	u16 original_time_codes[5], local_time_codes[3];
> +	struct bcm54210pe_ptp *ptp;
> +	struct phy_device *phydev;
> +
> +	ptp = container_of(info, struct bcm54210pe_ptp, caps);
> +	phydev = ptp->chosen->phydev;
> +
> +	shadow_load_register = 0;
> +	nco_6_register_value = 0;
> +
> +	// Assign original time codes (80 bit)
> +	original_time_codes[4] = (u16) ((ts->tv_sec & 0x0000FFFF00000000) >> 32);
> +	original_time_codes[3] = (u16) ((ts->tv_sec  & 0x00000000FFFF0000) >> 16);
> +	original_time_codes[2] = (u16) (ts->tv_sec  & 0x000000000000FFFF);
> +	original_time_codes[1] = (u16) ((ts->tv_nsec & 0x00000000FFFF0000) >> 16);
> +	original_time_codes[0] = (u16) (ts->tv_nsec & 0x000000000000FFFF);
> +
> +	// Assign original time codes (48 bit)
> +	local_time_codes[2] = 0x4000;
> +	local_time_codes[1] = (u16) (ts->tv_nsec >> 20);
> +	local_time_codes[0] = (u16) (ts->tv_nsec >> 4);
> +
> +	// Set Time Code load bit in the shadow load register
> +	shadow_load_register |= 0x0400;
> +
> +	// Set Local Time load bit in the shadow load register
> +	shadow_load_register |= 0x0080;
> +
> +	mutex_lock(&ptp->chosen->clock_lock);
> +
> +	// Write Original Time Code Register
> +	bcm_phy_write_exp(phydev, ORIGINAL_TIME_CODE_0, original_time_codes[0]);
> +	bcm_phy_write_exp(phydev, ORIGINAL_TIME_CODE_1, original_time_codes[1]);
> +	bcm_phy_write_exp(phydev, ORIGINAL_TIME_CODE_2, original_time_codes[2]);
> +	bcm_phy_write_exp(phydev, ORIGINAL_TIME_CODE_3, original_time_codes[3]);
> +	bcm_phy_write_exp(phydev, ORIGINAL_TIME_CODE_4, original_time_codes[4]);
> +
> +	// Write Local Time Code Register
> +	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_2_0_REG, local_time_codes[0]);
> +	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_2_1_REG, local_time_codes[1]);
> +	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_2_2_REG, local_time_codes[2]);
> +
> +	// Write Shadow register
> +	bcm_phy_write_exp(phydev, SHADOW_REG_CONTROL, 0x0000);
> +	bcm_phy_write_exp(phydev, SHADOW_REG_LOAD, shadow_load_register);
> +
> +	// Set global mode and nse_init
> +	nco_6_register_value = bcm54210pe_get_base_nco6_reg(ptp->chosen, nco_6_register_value, true);
> +
> +	// Write to register
> +	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
> +
> +	// Trigger framesync
> +	bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
> +
> +	// Set the second on set
> +	ptp->chosen->second_on_set = ts->tv_sec;
> +
> +	mutex_unlock(&ptp->chosen->clock_lock);
> +
> +	return 0;
> +}
> +
> +static int bcm54210pe_adjfine(struct ptp_clock_info *info, long scaled_ppm)
> +{	
> +	int err;
> +	u16 lo, hi;
> +	u32 corrected_8ns_interval, base_8ns_interval;
> +	bool negative;
> +
> +	struct bcm54210pe_ptp *ptp = container_of(info, struct bcm54210pe_ptp, caps);
> +	struct phy_device *phydev = ptp->chosen->phydev;
> +
> +	negative = false;
> +        if ( scaled_ppm < 0 ) {
> +		negative = true;
> +		scaled_ppm = -scaled_ppm;
> +	}
> +
> +	// This is not completely accurate but very fast
> +	scaled_ppm >>= 7;
> +
> +	// Nominal counter increment is 8ns
> +	base_8ns_interval = 1 << 31;
> +
> +	// Add or subtract differential
> +	if (negative) {
> +		corrected_8ns_interval = base_8ns_interval - scaled_ppm;
> +	} else {
> +		corrected_8ns_interval = base_8ns_interval + scaled_ppm;
> +	}
> +
> +	// Load up registers
> +	hi = (corrected_8ns_interval & 0xFFFF0000) >> 16;
> +	lo = (corrected_8ns_interval & 0x0000FFFF);
> +
> +	mutex_lock(&ptp->chosen->clock_lock);
> +
> +	// Set freq_mdio_sel to 1
> +	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_2_2_REG, 0x4000);
> +
> +	// Load 125MHz frequency reqcntrl
> +	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_1_MSB_REG, hi);
> +	bcm_phy_write_exp(phydev, NSE_DPPL_NCO_1_LSB_REG, lo);
> +
> +	// On next framesync load freq from freqcntrl
> +	bcm_phy_write_exp(phydev, SHADOW_REG_LOAD, 0x0040);
> +
> +	// Trigger framesync
> +	err = bcm_phy_modify_exp(phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
> +
> +	mutex_unlock(&ptp->chosen->clock_lock);
> +
> +	return err;
> +}
> +
> +static int bcm54210pe_adjtime(struct ptp_clock_info *info, s64 delta)
> +{
> +	int err;
> +	struct timespec64 ts;
> +	u64 now;
> +
> +	err = bcm54210pe_gettime(info, &ts);
> +	if (err < 0)
> +		return err;
> +
> +	now = ktime_to_ns(timespec64_to_ktime(ts));
> +	ts = ns_to_timespec64(now + delta);
> +
> +	err = bcm54210pe_settime(info, &ts);
> +
> +	return err;
> +}
> +
> +
> +static int bcm54210pe_extts_enable(struct bcm54210pe_private *private, int enable)
> +{
> +	int err;
> +	struct phy_device *phydev;
> +	u16 nco_6_register_value;
> +
> +	phydev = private->phydev;
> +
> +	if (enable) {
> +
> +		if (!private->extts_en) {
> +
> +			// Set enable per_out
> +			private->extts_en = true;
> +			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_4_REG, 0x0001);
> +
> +			nco_6_register_value = 0;
> +			nco_6_register_value = bcm54210pe_get_base_nco6_reg(private, nco_6_register_value, false);
> +
> +			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_7_0_REG, 0x0100);
> +			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_7_1_REG, 0x0200);
> +			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
> +
> +			schedule_delayed_work(&private->extts_ws, msecs_to_jiffies(1));
> +		}
> +
> +	} else {
> +		private->extts_en = false;
> +		err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_4_REG, 0x0000);
> +
> +	}
> +
> +	return err;
> +}
> +
> +static void bcm54210pe_run_extts_thread(struct work_struct *extts_ws)
> +{
> +	struct bcm54210pe_private *private;
> +	struct phy_device *phydev;
> +	u64 interval, time_stamp_48, time_stamp_80;
> +
> +	private = container_of((struct delayed_work *)extts_ws, struct bcm54210pe_private, extts_ws);
> +	phydev = private->phydev;
> +
> +	interval = 10;	// in ms - long after we are gone from this earth, discussions will be had and
> +	  		// songs will be sung about whether this interval is short enough....
> +			// Before you complain let me say that in Timebeat.app up to ~150ms allows
> +			// single digit ns servo accuracy. If your client / servo is not as cool: Do better :-)
> +
> +	mutex_lock(&private->clock_lock);
> +
> +	bcm54210pe_read80bittime_register(phydev, &time_stamp_80, &time_stamp_48);
> +
> +
> +	if (private->last_extts_ts != time_stamp_48 &&
> +	    private->last_immediate_ts[0] != time_stamp_48 &&
> +	    private->last_immediate_ts[1] != time_stamp_80) {
> +
> +		bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, 0xE000);
> +		bcm54210pe_trigger_extts_event(private, time_stamp_48);
> +		bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, 0xE004);
> +	}
> +
> +	mutex_unlock(&private->clock_lock);
> +
> +	// Do we need to reschedule
> +	if (private->extts_en) {
> +		schedule_delayed_work(&private->extts_ws, msecs_to_jiffies(interval));
> +	}
> +}
> +
> +// Must be called under clock_lock
> +static void bcm54210pe_trigger_extts_event(struct bcm54210pe_private *private, u64 time_stamp)
> +{
> +
> +	struct ptp_clock_event event;
> +	struct timespec64 ts;
> +
> +	event.type = PTP_CLOCK_EXTTS;
> +	event.timestamp = convert_48bit_to_80bit(private->second_on_set, time_stamp);
> +	event.index = 0;
> +
> +	ptp_clock_event(private->ptp->ptp_clock, &event);
> +
> +    	private->last_extts_ts = time_stamp;
> +
> +	ns_to_ts(time_stamp, &ts);
> +}
> +
> +static int bcm54210pe_perout_enable(struct bcm54210pe_private *private, s64 period, s64 pulsewidth, int enable)
> +{
> +	struct phy_device *phydev;
> +	u16 nco_6_register_value, frequency_hi, frequency_lo, pulsewidth_reg, pulse_start_hi, pulse_start_lo;
> +	int err;
> +
> +	phydev = private->phydev;
> +
> +	if (enable) {
> +		frequency_hi = 0;
> +		frequency_lo = 0;
> +		pulsewidth_reg = 0;
> +		pulse_start_hi = 0;
> +		pulse_start_lo = 0;
> +
> +		// Convert interval pulse spacing (period) and pulsewidth to 8 ns units
> +		period /= 8;
> +		pulsewidth /= 8;
> +
> +		// Mode 2 only: If pulsewidth is not explicitly set with PTP_PEROUT_DUTY_CYCLE
> +		if (pulsewidth == 0) {
> +			if (period < 2500) {
> +				// At a frequency at less than 20us (2500 x 8ns) set pulse length to 1/10th of the interval pulse spacing
> +				pulsewidth = period / 10;
> +
> +				// Where the interval pulse spacing is short, ensure we set a pulse length of 8ns
> +				if (pulsewidth == 0) {
> +					pulsewidth = 1;
> +				}
> +
> +			} else {
> +				// Otherwise set pulse with to 4us (8ns x 500 = 4us)
> +				pulsewidth = 500;
> +			}
> +		}
> +
> +		if (private->perout_mode == SYNC_OUT_MODE_1) {
> +
> +			// Set period
> +			private->perout_period = period;
> +
> +			if (!private->perout_en) {
> +
> +				// Set enable per_out
> +				private->perout_en = true;
> +				schedule_delayed_work(&private->perout_ws, msecs_to_jiffies(1));
> +			}
> +
> +			err = 0;
> +
> +		} else if (private->perout_mode == SYNC_OUT_MODE_2) {
> +
> +			// Set enable per_out
> +			private->perout_en = true;
> +
> +			// Calculate registers
> +			frequency_lo 	 = (u16)period; 			// Lowest 16 bits of 8ns interval pulse spacing [15:0]
> +			frequency_hi	 = (u16) (0x3FFF & (period >> 16));	// Highest 14 bits of 8ns interval pulse spacing [29:16]
> +			frequency_hi	|= (u16) pulsewidth << 14; 		// 2 lowest bits of 8ns pulse length [1:0]
> +			pulsewidth_reg	 = (u16) (0x7F & (pulsewidth >> 2));	// 7 highest bit  of 8 ns pulse length [8:2]
> +
> +			// Get base value
> +			nco_6_register_value = bcm54210pe_get_base_nco6_reg(
> +				private, nco_6_register_value, true);
> +
> +			mutex_lock(&private->clock_lock);
> +
> +			// Write to register
> +			err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG,
> +						nco_6_register_value);
> +
> +			// Set sync out pulse interval spacing and pulse length
> +			err |= bcm_phy_write_exp(
> +				phydev, NSE_DPPL_NCO_3_0_REG, frequency_lo);
> +			err |= bcm_phy_write_exp(
> +				phydev, NSE_DPPL_NCO_3_1_REG, frequency_hi);
> +			err |= bcm_phy_write_exp(phydev,
> +						 NSE_DPPL_NCO_3_2_REG,
> +						 pulsewidth_reg);
> +
> +			// On next framesync load sync out frequency
> +			err |= bcm_phy_write_exp(phydev, SHADOW_REG_LOAD,
> +						 0x0200);
> +
> +			// Trigger immediate framesync framesync
> +			err |= bcm_phy_modify_exp(
> +				phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
> +
> +			mutex_unlock(&private->clock_lock);
> +		}
> +	} else {
> +
> +		// Set disable pps
> +		private->perout_en = false;
> +
> +		// Get base value
> +		nco_6_register_value = bcm54210pe_get_base_nco6_reg(private, nco_6_register_value, false);
> +
> +		mutex_lock(&private->clock_lock);
> +
> +		// Write to register
> +		err = bcm_phy_write_exp(phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value);
> +
> +		mutex_unlock(&private->clock_lock);
> +	}
> +
> +	return err;
> +}
> +
> +static void bcm54210pe_run_perout_mode_one_thread(struct work_struct *perout_ws)
> +{
> +	struct bcm54210pe_private *private;
> +	u64 local_time_stamp_48bits; //, local_time_stamp_80bits;
> +	u64 next_event, time_before_next_pulse, period;
> +	u16 nco_6_register_value, pulsewidth_nco3_hack;
> +	u64 wait_one, wait_two;
> +
> +	private = container_of((struct delayed_work *)perout_ws, struct bcm54210pe_private, perout_ws);
> +	period = private->perout_period * 8;
> +	pulsewidth_nco3_hack = 250; // The BCM chip is broken. It does not respect this in sync out mode 1
> +
> +	nco_6_register_value = 0;
> +
> +	// Get base value
> +	nco_6_register_value = bcm54210pe_get_base_nco6_reg(private, nco_6_register_value, false);
> +
> +	// Get 48 bit local time
> +	bcm54210pe_get48bittime(private, &local_time_stamp_48bits);
> +
> +	// Calculate time before next event and next event time
> +	time_before_next_pulse =  period - (local_time_stamp_48bits % period);
> +	next_event = local_time_stamp_48bits + time_before_next_pulse;
> +
> +	// Lock
> +	mutex_lock(&private->clock_lock);
> +
> +	// Set pulsewidth (test reveal this does not work), but registers need content or no pulse will exist
> +	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_3_1_REG, pulsewidth_nco3_hack << 14);
> +	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_3_2_REG, pulsewidth_nco3_hack >> 2);
> +
> +	// Set sync out pulse interval spacing and pulse length
> +	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_0_REG, next_event & 0xFFF0);
> +	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_1_REG, next_event >> 16);
> +	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_2_REG, next_event >> 32);
> +
> +	// On next framesync load sync out frequency
> +	bcm_phy_write_exp(private->phydev, SHADOW_REG_LOAD, 0x0200);
> +
> +	// Write to register with mode one set for sync out
> +	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_6_REG, nco_6_register_value || 0x0001);
> +
> +	// Trigger immediate framesync framesync
> +	bcm_phy_modify_exp(private->phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
> +
> +	// Unlock
> +	mutex_unlock(&private->clock_lock);
> +
> +	// Wait until 1/10 period after the next pulse
> +	wait_one = (time_before_next_pulse / 1000000) + (period / 1000000 / 10);
> +	mdelay(wait_one);
> +
> +	// Lock
> +	mutex_lock(&private->clock_lock);
> +
> +	// Clear pulse by bumping sync_out_match to max (this pulls sync out down)
> +	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_0_REG, 0xFFF0);
> +	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_1_REG, 0xFFFF);
> +	bcm_phy_write_exp(private->phydev, NSE_DPPL_NCO_5_2_REG, 0xFFFF);
> +
> +	// On next framesync load sync out frequency
> +	bcm_phy_write_exp(private->phydev, SHADOW_REG_LOAD, 0x0200);
> +
> +	// Trigger immediate framesync framesync
> +	bcm_phy_modify_exp(private->phydev, NSE_DPPL_NCO_6_REG, 0x003C, 0x0020);
> +
> +	// Unlock
> +	mutex_unlock(&private->clock_lock);
> +
> +	// Calculate wait before we reschedule the next pulse
> +	wait_two = (period / 1000000) - (2 * (period / 10000000));
> +
> +	// Do we need to reschedule
> +	if (private->perout_en) {
> +		schedule_delayed_work(&private->perout_ws, msecs_to_jiffies(wait_two));
> +	}
> +}
> +
> +
> +bool bcm54210pe_rxtstamp(struct mii_timestamper *mii_ts, struct sk_buff *skb, int type)
> +{
> +	struct bcm54210pe_private *private = container_of(mii_ts, struct bcm54210pe_private, mii_ts);
> +
> +	if (private->hwts_rx_en) {
> +		skb_queue_tail(&private->rx_skb_queue, skb);

This is clunky.  The time stamp item may already be in the list.  Code
should check the list and deliver the skb immediately on match.  Queue
the skb only when time stamp not ready yet.

It appears that time stamps generate an interrupt, no?  If so, then
why not use the ISR to trigger reading of time stamps?

See dp83640.c for an example of how to handle asynchronous Rx time stamps.

Moreover: Does this device provide in-band Rx time stamps?  If so, why
not use them?

> +		schedule_work(&private->rxts_work);
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +void bcm54210pe_txtstamp(struct mii_timestamper *mii_ts, struct sk_buff *skb, int type)
> +{
> +	struct bcm54210pe_private *private = container_of(mii_ts, struct bcm54210pe_private, mii_ts);
> +
> +	switch (private->hwts_tx_en)
> +	{
> +		case HWTSTAMP_TX_ON:
> +		{	
> +			skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +			skb_queue_tail(&private->tx_skb_queue, skb);
> +			schedule_work(&private->txts_work);
> +			break;
> +		}
> +		case HWTSTAMP_TX_OFF:
> +		{	
> +		
> +		}
> +		default:
> +		{
> +			kfree_skb(skb);
> +			break;
> +		}
> +	}
> +}
> +
> +int bcm54210pe_ts_info(struct mii_timestamper *mii_ts, struct ethtool_ts_info *info)
> +{
> +	struct bcm54210pe_private *bcm54210pe = container_of(mii_ts, struct bcm54210pe_private, mii_ts);
> +
> +	info->so_timestamping =
> +		SOF_TIMESTAMPING_TX_HARDWARE |
> +		SOF_TIMESTAMPING_RX_HARDWARE |
> +		SOF_TIMESTAMPING_RAW_HARDWARE;
> +
> +	info->phc_index = ptp_clock_index(bcm54210pe->ptp->ptp_clock);
> +	info->tx_types =
> +		(1 << HWTSTAMP_TX_OFF) |
> +		(1 << HWTSTAMP_TX_ON) ;
> +      	info->rx_filters =
> +                (1 << HWTSTAMP_FILTER_NONE) |
> +                (1 << HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
> +                (1 << HWTSTAMP_FILTER_PTP_V2_L4_EVENT);
> +	return 0;
> +}
> +
> +int bcm54210pe_hwtstamp(struct mii_timestamper *mii_ts, struct ifreq *ifr)
> +{
> +	struct bcm54210pe_private *device = container_of(mii_ts, struct bcm54210pe_private, mii_ts);
> +
> +	struct hwtstamp_config cfg;
> +
> +	if (copy_from_user(&cfg, ifr->ifr_data, sizeof(cfg)))
> +		return -EFAULT;
> +
> +	if (cfg.flags) /* reserved for future extensions */
> +		return -EINVAL;
> +
> +	if (cfg.tx_type < 0 || cfg.tx_type > HWTSTAMP_TX_ONESTEP_SYNC)
> +		return -ERANGE;
> +
> +	device->hwts_tx_en = cfg.tx_type;
> +
> +	switch (cfg.rx_filter) {
> +	case HWTSTAMP_FILTER_NONE:
> +		device->hwts_rx_en = 0;
> +		device->layer = 0;
> +		device->version = 0;
> +		break;
> +	case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
> +	case HWTSTAMP_FILTER_PTP_V1_L4_SYNC:
> +	case HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ:
> +		device->hwts_rx_en = 1;
> +		device->layer = PTP_CLASS_L4;
> +		device->version = PTP_CLASS_V1;
> +		cfg.rx_filter = HWTSTAMP_FILTER_PTP_V1_L4_EVENT;
> +		break;
> +	case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
> +	case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
> +	case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
> +		device->hwts_rx_en = 1;
> +		device->layer = PTP_CLASS_L4;
> +		device->version = PTP_CLASS_V2;
> +		cfg.rx_filter = HWTSTAMP_FILTER_PTP_V2_L4_EVENT;
> +		break;
> +	case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
> +	case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
> +	case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
> +		device->hwts_rx_en = 1;
> +		device->layer = PTP_CLASS_L2;
> +		device->version = PTP_CLASS_V2;
> +		cfg.rx_filter = HWTSTAMP_FILTER_PTP_V2_L2_EVENT;
> +		break;
> +	case HWTSTAMP_FILTER_PTP_V2_EVENT:
> +	case HWTSTAMP_FILTER_PTP_V2_SYNC:
> +	case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
> +		device->hwts_rx_en = 1;
> +		device->layer = PTP_CLASS_L4 | PTP_CLASS_L2;
> +		device->version = PTP_CLASS_V2;
> +		cfg.rx_filter = HWTSTAMP_FILTER_PTP_V2_EVENT;
> +		break;
> +	default:
> +		return -ERANGE;
> +	}
> +	
> +	return copy_to_user(ifr->ifr_data, &cfg, sizeof(cfg)) ? -EFAULT : 0;
> +}
> +
> +static int bcm54210pe_feature_enable(struct ptp_clock_info *info, struct ptp_clock_request *req, int on)
> +{
> +	struct bcm54210pe_ptp *ptp = container_of(info, struct bcm54210pe_ptp, caps);
> +	s64 period, pulsewidth;
> +	struct timespec64 ts;
> +
> +	switch (req->type) {
> +
> +	case PTP_CLK_REQ_PEROUT :
> +
> +		period = 0;
> +		pulsewidth = 0;
> +
> +		// Check if pin func is set correctly
> +		if (ptp->chosen->sdp_config[SYNC_OUT_PIN].func != PTP_PF_PEROUT) {
> +			return -EOPNOTSUPP;
> +		}
> +
> +		// No other flags supported
> +		if (req->perout.flags & ~PTP_PEROUT_DUTY_CYCLE) {
> +			return -EOPNOTSUPP;
> +		}
> +
> +		// Check if a specific pulsewidth is set
> +		if ((req->perout.flags & PTP_PEROUT_DUTY_CYCLE) > 0) {
> +
> +			if (ptp->chosen->perout_mode == SYNC_OUT_MODE_1) {
> +				return -EOPNOTSUPP;
> +			}
> +
> +			// Extract pulsewidth
> +			ts.tv_sec = req->perout.on.sec;
> +			ts.tv_nsec = req->perout.on.nsec;
> +			pulsewidth = timespec64_to_ns(&ts);
> +
> +			// 9 bits in 8ns units, so max = 4,088ns
> +			if (pulsewidth > 511 * 8) {
> +				return -ERANGE;
> +			}
> +		}
> +
> +		// Extract pulse spacing interval (period)
> +		ts.tv_sec = req->perout.period.sec;
> +		ts.tv_nsec = req->perout.period.nsec;
> +		period = timespec64_to_ns(&ts);
> +
> +		// 16ns is minimum pulse spacing interval (a value of 16 will result in 8ns high followed by 8 ns low)
> +		if (period != 0 && period < 16) {
> +			return -ERANGE;
> +		}
> +
> +		return bcm54210pe_perout_enable(ptp->chosen, period, pulsewidth, on);
> +
> +	case PTP_CLK_REQ_EXTTS:
> +
> +		if (ptp->chosen->sdp_config[SYNC_IN_PIN].func != PTP_PF_EXTTS) {
> +			return -EOPNOTSUPP;
> +		}
> +
> +		return bcm54210pe_extts_enable(ptp->chosen, on);
> +
> +	default:
> +		break;
> +	}
> +
> +	return -EOPNOTSUPP;
> +}
> +
> +
> +static int bcm54210pe_ptp_verify_pin(struct ptp_clock_info *info, unsigned int pin,
> +			      enum ptp_pin_function func, unsigned int chan)
> +{
> +	switch (func) {
> +	case PTP_PF_NONE:
> +		return 0;
> +		break;
> +	case PTP_PF_EXTTS:
> +		if (pin == SYNC_IN_PIN)
> +			return 0;
> +		break;
> +	case PTP_PF_PEROUT:
> +		if (pin == SYNC_OUT_PIN)
> +			return 0;
> +		break;
> +	case PTP_PF_PHYSYNC:
> +		break;
> +	}
> +	return -1;
> +}
> +
> +static const struct ptp_clock_info bcm54210pe_clk_caps = {
> +        .owner          = THIS_MODULE,
> +        .name           = "BCM54210PE_PHC",
> +        .max_adj        = 100000000,
> +        .n_alarm        = 0,
> +        .n_pins         = 2,
> +        .n_ext_ts       = 1,
> +        .n_per_out      = 1,
> +        .pps            = 0,
> +        .adjtime        = &bcm54210pe_adjtime,
> +        .adjfine        = &bcm54210pe_adjfine,
> +        .gettime64      = &bcm54210pe_gettime,
> +	.gettimex64	= &bcm54210pe_gettimex,
> +        .settime64      = &bcm54210pe_settime,
> +	.enable		= &bcm54210pe_feature_enable,
> +	.verify		= &bcm54210pe_ptp_verify_pin,
> +};
> +
> +static int bcm54210pe_interrupts_enable(struct phy_device *phydev, bool fsync_en, bool sop_en)
> +{
> +	u16 interrupt_mask;
> +
> +	interrupt_mask = 0;
> +
> +	if (fsync_en) {
> +		interrupt_mask |= 0x0001;
> +	}
> +
> +	if (sop_en) {
> +		interrupt_mask |= 0x0002;
> +	}
> +
> +	return bcm_phy_write_exp(phydev, INTERRUPT_MASK_REG, interrupt_mask);
> +}
> +
> +static int bcm54210pe_sw_reset(struct phy_device *phydev)
> +{
> +	u16 err;
> +	u16 aux;
> +        
> +	err =  bcm_phy_write_exp(phydev, EXT_SOFTWARE_RESET, EXT_RESET1);
> +	err = bcm_phy_read_exp(phydev, EXT_ENABLE_REG1);
> +        if (err < 0)
> +                return err;
> +
> +        err = bcm_phy_write_exp(phydev, EXT_SOFTWARE_RESET, EXT_RESET2);
> +	aux = bcm_phy_read_exp(phydev, EXT_SOFTWARE_RESET);
> +        return err;
> +}
> +
> +int bcm54210pe_probe(struct phy_device *phydev)
> +{
> +	int x, y;
> +	struct bcm54210pe_ptp *ptp;
> +        struct bcm54210pe_private *bcm54210pe;
> +	struct ptp_pin_desc *sync_in_pin_desc, *sync_out_pin_desc;
> +
> +	bcm54210pe_sw_reset(phydev);
> +	bcm54210pe_config_1588(phydev);
> +
> +	bcm54210pe = kzalloc(sizeof(struct bcm54210pe_private), GFP_KERNEL);
> +        if (!bcm54210pe) {
> +		return -ENOMEM;
> +	}
> +
> +	ptp = kzalloc(sizeof(struct bcm54210pe_ptp), GFP_KERNEL);
> +        if (!ptp) {
> +		return -ENOMEM;
> +	}
> +
> +	bcm54210pe->phydev = phydev;
> +	bcm54210pe->ptp = ptp;
> +
> +	bcm54210pe->mii_ts.rxtstamp = bcm54210pe_rxtstamp;
> +	bcm54210pe->mii_ts.txtstamp = bcm54210pe_txtstamp;
> +	bcm54210pe->mii_ts.hwtstamp = bcm54210pe_hwtstamp;
> +	bcm54210pe->mii_ts.ts_info  = bcm54210pe_ts_info;
> +
> +
> +	phydev->mii_ts = &bcm54210pe->mii_ts;
> +
> +	// Initialisation of work_structs and similar
> +	INIT_WORK(&bcm54210pe->txts_work, bcm54210pe_run_tx_timestamp_match_thread);
> +	INIT_WORK(&bcm54210pe->rxts_work, bcm54210pe_run_rx_timestamp_match_thread);
> +	INIT_DELAYED_WORK(&bcm54210pe->perout_ws, bcm54210pe_run_perout_mode_one_thread);
> +	INIT_DELAYED_WORK(&bcm54210pe->extts_ws, bcm54210pe_run_extts_thread);

Don't use generic work.  Instead, implement ptp_clock_info::do_aux_work()

That way, you get a kthread that may be given appropriate scheduling priority.

> +
> +	// SKB queues
> +	skb_queue_head_init(&bcm54210pe->tx_skb_queue);
> +	skb_queue_head_init(&bcm54210pe->rx_skb_queue);
> +
> +	for (x = 0; x < CIRCULAR_BUFFER_COUNT; x++)
> +	{ 
> +		INIT_LIST_HEAD(&bcm54210pe->circular_buffers[x]);
> +	
> +		for (y = 0; y < CIRCULAR_BUFFER_ITEM_COUNT; y++)
> +		{ list_add(&bcm54210pe->circular_buffer_items[x][y].list, &bcm54210pe->circular_buffers[x]); }
> +	}
> +
> +	// Caps
> +	memcpy(&bcm54210pe->ptp->caps, &bcm54210pe_clk_caps, sizeof(bcm54210pe_clk_caps));
> +	bcm54210pe->ptp->caps.pin_config = bcm54210pe->sdp_config;
> +
> +	// Mutex
> +	mutex_init(&bcm54210pe->clock_lock);
> +	mutex_init(&bcm54210pe->timestamp_buffer_lock);
> +
> +	// Features
> +	bcm54210pe->one_step = false;
> +	bcm54210pe->extts_en = false;
> +	bcm54210pe->perout_en = false;
> +	bcm54210pe->perout_mode = SYNC_OUT_MODE_1;
> +
> +	// Fibonacci RSewoke style progressive backoff scheme
> +	bcm54210pe->fib_sequence[0] = 1;
> +	bcm54210pe->fib_sequence[1] = 1;
> +	bcm54210pe->fib_sequence[2] = 2;
> +	bcm54210pe->fib_sequence[3] = 3;
> +	bcm54210pe->fib_sequence[4] = 5;
> +	bcm54210pe->fib_sequence[5] = 8;
> +	bcm54210pe->fib_sequence[6] = 13;
> +	bcm54210pe->fib_sequence[7] = 21;
> +	bcm54210pe->fib_sequence[8] = 34;
> +	bcm54210pe->fib_sequence[9] = 55;
> +
> +	//bcm54210pe->fib_sequence = {1, 1, 2, 3, 5, 8, 13, 21, 34, 55};
> +	bcm54210pe->fib_factor_rx = 10;
> +	bcm54210pe->fib_factor_tx = 10;
> +
> +	// Pin descriptions
> +	sync_in_pin_desc = &bcm54210pe->sdp_config[SYNC_IN_PIN];
> +	snprintf(sync_in_pin_desc->name, sizeof(sync_in_pin_desc->name), "SYNC_IN");
> +	sync_in_pin_desc->index = SYNC_IN_PIN;
> +	sync_in_pin_desc->func = PTP_PF_NONE;
> +
> +	sync_out_pin_desc = &bcm54210pe->sdp_config[SYNC_OUT_PIN];
> +	snprintf(sync_out_pin_desc->name, sizeof(sync_out_pin_desc->name), "SYNC_OUT");
> +	sync_out_pin_desc->index = SYNC_OUT_PIN;
> +	sync_out_pin_desc->func = PTP_PF_NONE;
> +
> +	ptp->chosen = bcm54210pe;
> +	phydev->priv = bcm54210pe;
> +	ptp->caps.owner = THIS_MODULE;
> +
> +	bcm54210pe->ptp->ptp_clock = ptp_clock_register(&bcm54210pe->ptp->caps, &phydev->mdio.dev);
> +
> +	if (IS_ERR(bcm54210pe->ptp->ptp_clock)) {
> +                        return PTR_ERR(bcm54210pe->ptp->ptp_clock);
> +	}
> +
> +	return 0;
> +}
> +
> +static u16 bcm54210pe_get_base_nco6_reg(struct bcm54210pe_private *private, u16 val, bool do_nse_init)
> +{
> +
> +	// Set Global mode to CPU system
> +	val |= 0xC000;
> +
> +	// NSE init
> +	if (do_nse_init) {
> +		val |= 0x1000;
> +	}
> +
> +	if (private->extts_en) {
> +		val |= 0x2004;
> +	}
> +
> +	if(private->perout_en) {
> +		if (private->perout_mode == SYNC_OUT_MODE_1) {
> +			val |= 0x0001;
> +		} else if (private->perout_mode == SYNC_OUT_MODE_2) {
> +			val |= 0x0002;
> +		}
> +	}
> +
> +	return val;
> +}
> +
> +
> +static u64 convert_48bit_to_80bit(u64 second_on_set, u64 ts)
> +{
> +	return (second_on_set * 1000000000) + ts;
> +}
> +
> +static u64 four_u16_to_ns(u16 *four_u16)
> +{
> +	u32 seconds;
> +	u32 nanoseconds;
> +	struct timespec64 ts;
> +	u16 *ptr;
> +
> +	nanoseconds = 0;
> +	seconds = 0;
> +
> +
> +	ptr = (u16 *)&nanoseconds;
> +	*ptr = four_u16[0]; ptr++; *ptr = four_u16[1];
> +
> +	ptr = (u16 *)&seconds;
> +	*ptr = four_u16[2]; ptr++; *ptr = four_u16[3];
> +
> +	ts.tv_sec = seconds;
> +	ts.tv_nsec = nanoseconds;
> +
> +	return ts_to_ns(&ts);
> +}
> +
> +static u64 ts_to_ns(struct timespec64 *ts)
> +{
> +	return ((u64)ts->tv_sec * (u64)1000000000) + ts->tv_nsec;
> +}

Use timespec64_to_ns()

> +static void ns_to_ts(u64 time_stamp, struct timespec64 *ts)
> +{
> +	ts->tv_sec  = ( (u64)time_stamp / (u64)1000000000 );
> +	ts->tv_nsec = ( (u64)time_stamp % (u64)1000000000 );
> +}

Use ns_to_timespec64()

> diff --git a/drivers/net/phy/bcm54210pe_ptp.h b/drivers/net/phy/bcm54210pe_ptp.h
> new file mode 100755
> index 0000000000000..483dafc2d4514
> --- /dev/null
> +++ b/drivers/net/phy/bcm54210pe_ptp.h
> @@ -0,0 +1,111 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + *  drivers/net/phy/bcm54210pe_ptp.h
> + *
> +* IEEE1588 (PTP), perout and extts for BCM54210PE PHY
> + *
> + * Authors: Carlos Fernandez, Kyle Judd, Lasse Johnsen
> + * License: GPL
> + */
> +
> +#include <linux/ptp_clock_kernel.h>
> +#include <linux/list.h>
> +
> +#define CIRCULAR_BUFFER_COUNT 8
> +#define CIRCULAR_BUFFER_ITEM_COUNT 32
> +
> +#define SYNC_IN_PIN 0
> +#define SYNC_OUT_PIN 1
> +
> +#define SYNC_OUT_MODE_1 1
> +#define SYNC_OUT_MODE_2 2
> +
> +#define DIRECTION_RX 0
> +#define DIRECTION_TX 1
> +
> +struct bcm54210pe_ptp {
> +	struct ptp_clock_info caps;
> +	struct ptp_clock *ptp_clock;
> +	struct bcm54210pe_private *chosen;
> +};
> +
> +struct bcm54210pe_circular_buffer_item {
> +	struct list_head list;
> +
> +	u8 msg_type;
> +	u16 sequence_id;
> +	u64 time_stamp;
> +	bool is_valid;
> +};
> +
> +struct bcm54210pe_private {
> +	struct phy_device *phydev;
> +	struct bcm54210pe_ptp *ptp;
> +	struct mii_timestamper mii_ts;
> +	struct ptp_pin_desc sdp_config[2];
> +
> +	int ts_tx_config;
> +	int tx_rx_filter;
> +
> +	bool one_step;
> +	bool perout_en;
> +	bool extts_en;
> +
> +	int second_on_set;
> +
> +	int perout_mode;
> +	int perout_period;
> +	int perout_pulsewidth;
> +
> +	u64 last_extts_ts;
> +	u64 last_immediate_ts[2];
> +
> +	struct sk_buff_head tx_skb_queue;
> +	struct sk_buff_head rx_skb_queue;
> +
> +	struct bcm54210pe_circular_buffer_item
> +		circular_buffer_items[CIRCULAR_BUFFER_COUNT]
> +				     [CIRCULAR_BUFFER_ITEM_COUNT];
> +	struct list_head circular_buffers[CIRCULAR_BUFFER_COUNT];
> +
> +	struct work_struct txts_work, rxts_work;
> +	struct delayed_work perout_ws, extts_ws;
> +	struct mutex clock_lock, timestamp_buffer_lock;
> +
> +	int fib_sequence[10];
> +
> +	int fib_factor_rx;
> +	int fib_factor_tx;
> +
> +	int hwts_tx_en;
> +	int hwts_rx_en;
> +	int layer;
> +	int version;
> +};
> +
> +static bool bcm54210pe_rxtstamp(struct mii_timestamper *mii_ts, struct sk_buff *skb, int type);
> +static void bcm54210pe_txtstamp(struct mii_timestamper *mii_ts, struct sk_buff *skb, int type);
> +static void bcm54210pe_run_rx_timestamp_match_thread(struct work_struct *w);
> +static void bcm54210pe_run_tx_timestamp_match_thread(struct work_struct *w);
> +static void bcm54210pe_read_sop_time_register(struct bcm54210pe_private *private);
> +static bool bcm54210pe_fetch_timestamp(u8 txrx, u8 message_type, u16 seq_id, struct bcm54210pe_private *private, u64 *timestamp);
> +
> +static u16  bcm54210pe_get_base_nco6_reg(struct bcm54210pe_private *private, u16 val, bool do_nse_init);
> +static int  bcm54210pe_interrupts_enable(struct phy_device *phydev, bool fsync_en, bool sop_en);
> +static int  bcm54210pe_gettimex(struct ptp_clock_info *info, struct timespec64 *ts, struct ptp_system_timestamp *sts);
> +static int  bcm54210pe_get80bittime(struct bcm54210pe_private *private, struct timespec64 *ts, struct ptp_system_timestamp *sts);
> +static int  bcm54210pe_get48bittime(struct bcm54210pe_private *private, u64 *time_stamp);
> +static void bcm54210pe_read80bittime_register(struct phy_device *phydev, u64 *time_stamp_80, u64 *time_stamp_48);
> +static void bcm54210pe_read48bittime_register(struct phy_device *phydev, u64 *time_stamp);
> +
> +static int  bcm54210pe_perout_enable(struct bcm54210pe_private *private, s64 period, s64 pulsewidth, int on);
> +static void bcm54210pe_run_perout_mode_one_thread(struct work_struct *perout_ws);
> +
> +static int  bcm54210pe_extts_enable(struct bcm54210pe_private *private, int enable);
> +static void bcm54210pe_run_extts_thread(struct work_struct *extts_ws);
> +static void bcm54210pe_trigger_extts_event(struct bcm54210pe_private *private, u64 timestamp);
> +
> +static u64  convert_48bit_to_80bit(u64 second_on_set, u64 ts);
> +static u64  four_u16_to_ns(u16 *four_u16);
> +static u64  ts_to_ns(struct timespec64 *ts);
> +static void ns_to_ts(u64 time_stamp, struct timespec64 *ts);

Never put "static" prototypes into a header file.

Avoid static prototypes altogether.  Instead, order the functions
within the source file so that sub-functions appear before their call
sites.

> diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
> index 8b0ac38742d06..c8b79522cf3ad 100644
> --- a/drivers/net/phy/broadcom.c
> +++ b/drivers/net/phy/broadcom.c
> @@ -15,6 +15,11 @@
>  #include <linux/phy.h>
>  #include <linux/brcmphy.h>
>  #include <linux/of.h>
> +#include <linux/irq.h>
> +
> +#if IS_ENABLED (CONFIG_BCM54120PE_PHY)
> +extern int bcm54210pe_probe(struct phy_device *phydev);
> +#endif
>  
>  #define BRCM_PHY_MODEL(phydev) \
>  	((phydev)->drv->phy_id & (phydev)->drv->phy_id_mask)
> @@ -778,7 +783,20 @@ static struct phy_driver broadcom_drivers[] = {
>  	.config_init	= bcm54xx_config_init,
>  	.ack_interrupt	= bcm_phy_ack_intr,
>  	.config_intr	= bcm_phy_config_intr,
> -}, {
> +},
> +
> +#if IS_ENABLED (CONFIG_BCM54120PE_PHY)
> +{
> +	.phy_id		= PHY_ID_BCM54213PE,
> +	.phy_id_mask	= 0xffffffff,
> +        .name           = "Broadcom BCM54210PE",
> +        /* PHY_GBIT_FEATURES */
> +        .config_init    = bcm54xx_config_init,
> +        .ack_interrupt  = bcm_phy_ack_intr,
> +        .config_intr    = bcm_phy_config_intr,
> +	.probe		= bcm54210pe_probe,
> +#elif
> +{ 
>  	.phy_id		= PHY_ID_BCM54213PE,
>  	.phy_id_mask	= 0xffffffff,
>  	.name		= "Broadcom BCM54213PE",
> @@ -786,6 +804,7 @@ static struct phy_driver broadcom_drivers[] = {
>  	.config_init	= bcm54xx_config_init,
>  	.ack_interrupt	= bcm_phy_ack_intr,
>  	.config_intr	= bcm_phy_config_intr,
> +#endif
>  }, {
>  	.phy_id		= PHY_ID_BCM5461,
>  	.phy_id_mask	= 0xfffffff0,
> diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
> index 3e377f3c69e5d..975a62286a9c6 100644
> --- a/drivers/ptp/Kconfig
> +++ b/drivers/ptp/Kconfig
> @@ -87,6 +87,23 @@ config PTP_1588_CLOCK_INES
>  	  core.  This clock is only useful if the MII bus of your MAC
>  	  is wired up to the core.
>  
> + config BCM54120PE_PHY
> +	tristate "Add suport for ptp in bcm54210pe PHYs"
> +	depends on NETWORK_PHY_TIMESTAMPING
> +	depends on PHYLIB
> +	depends on PTP_1588_CLOCK
> +	depends on BCM_NET_PHYLIB
> +        select NET_PTP_CLASSIFY
> +	help
> +	  This driver adds support for using the BCM54210PE as a PTP
> +	  clock. This clock is only useful if your PTP programs are
> +	  getting hardware time stamps on the PTP Ethernet packets
> +	  using the SO_TIMESTAMPING API.
> +
> +	  In order for this to work, your MAC driver must also
> +	  implement the skb_tx_timestamp() function.
> +
> +
>  config PTP_1588_CLOCK_PCH
>  	tristate "Intel PCH EG20T as PTP clock"
>  	depends on X86_32 || COMPILE_TEST


^ permalink raw reply

* Re: [PATCH bpf-next 0/2] xsk: remove reduntant 'falltrough' attributes
From: patchwork-bot+netdevbpf @ 2022-04-21 14:40 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, sfr, andrii, netdev, magnus.karlsson,
	linux-next
In-Reply-To: <20220421132126.471515-1-maciej.fijalkowski@intel.com>

Hello:

This series was applied to bpf/bpf-next.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Thu, 21 Apr 2022 15:21:24 +0200 you wrote:
> This is a follow-up to recently applied set [0] to fix the build
> warnings:
> 
> error: attribute 'fallthrough' not preceding a case label or default
> label [-Werror]
> 
> that Stephen has stumbled upon when merging bpf-next to linux-next.
> Apologies for these leftovers.
> 
> [...]

Here is the summary with links:
  - [bpf-next,1/2] ixgbe: xsk: get rid of redundant 'fallthrough'
    https://git.kernel.org/bpf/bpf-next/c/e130e8d5434b
  - [bpf-next,2/2] i40e: xsk: get rid of redundant 'fallthrough'
    https://git.kernel.org/bpf/bpf-next/c/9d87e41a6d64

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] vDPA/ifcvf: allow userspace to suspend a queue
From: Zhou Furong @ 2022-04-21 14:27 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst; +Cc: virtualization, netdev
In-Reply-To: <f224fc38-327c-6dac-d2cb-93f8b4ac5b82@intel.com>



On 2022/4/21 19:05, Zhu, Lingshan wrote:
> 
> 
> On 4/12/2022 2:55 PM, Zhou Furong wrote:
>> Hi,
>>
>>> +bool ifcvf_get_vq_ready(struct ifcvf_hw *hw, u16 qid)
>>> +{
>>> +    struct virtio_pci_common_cfg __iomem *cfg = hw->common_cfg;
>>> +    bool queue_enable;
>>> +
>>> +    vp_iowrite16(qid, &cfg->queue_select);
>>> +    queue_enable = vp_ioread16(&cfg->queue_enable);
>>> +
>>> +    return (bool)queue_enable;
>> queue_enable is bool, why cast? looks like remove the variable is better.
>> return vp_ioread16(&cfg->queue_enable);
> Hi, thanks for your comments. This cast tries to make some static 
> checkers happy.
> Please correct me if I misunderstood it:
> vp_ioread16() returns an u16, and bool is not an one bit variable, it is 
> C99 _Bool.
> C99 defines _Bool to be true(expends to int 1) or false(expends to int 0).
> I think this implies a bool is actually capable to store an u16.
> so I am afraid some checkers may complain about "queue_enable = 
> vp_ioread16(&cfg->queue_enable);",
> a cast here can help us silence the checkers.
> 
> see http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf section 7.16
> and https://www.kernel.org/doc/Documentation/process/coding-style.rst 
> section 17
> 
> Thanks,
> Zhu Lingshan
>>
bool type contains 1 bit 1/0 only, any number assign to a boolean 
variable will auto cast to 1 or 0.

if you want make static checker happy(if there is)
queue_enable = (bool)vp_ioread16(&cfg->queue_enable);
return queue_enable;

or
return (bool)vp_ioread16(&cfg->queue_enable);




>>>   static bool ifcvf_vdpa_get_vq_ready(struct vdpa_device *vdpa_dev, 
>>> u16 qid)
>>>   {
>>>       struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev);
>>> +    bool ready;
>>>   -    return vf->vring[qid].ready;
>>> +    ready = ifcvf_get_vq_ready(vf, qid);
>>> +
>>> +    return ready;
>> remove ready looks better
>> return ifcvf_get_vq_ready(vf, qid);
>>
>>
>> Best regards,
>> Furong
> 

^ permalink raw reply

* Re: [PATCH v2 3/3] ARM: dts: imx6qdl-sr-som: update phy configuration for som revision 1.9
From: Russell King (Oracle) @ 2022-04-21 14:20 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Josua Mayer, netdev, alvaro.karsz, Rob Herring,
	Krzysztof Kozlowski, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, NXP Linux Team
In-Reply-To: <YmFcfhzOmi1GwTvS@lunn.ch>

On Thu, Apr 21, 2022 at 03:30:38PM +0200, Andrew Lunn wrote:
> > The only other ways around this that I can see would be to have some
> > way to flag in DT that the PHYs are "optional" - if they're not found
> > while probing the hardware, then don't whinge about them. Or have
> > u-boot discover which address the PHY is located, and update the DT
> > blob passed to the kernel to disable the PHY addresses that aren't
> > present. Or edit the DT to update the node name and reg property. Or
> > something along those lines.
> 
> uboot sounds like the best option. I don't know if we currently
> support the status property for PHYs. Maybe the .dtsi file should have
> them all status = "disabled"; and uboot can flip the populated ones to
> "okay". Or maybe the other way around to handle older bootloaders.

... which would immediately regress the networking on all SolidRun iMX6
platforms when booting "new" DT with existing u-boot, so clearly that
isn't a possible solution.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH net-next] net/af_packet: add VLAN support for AF_PACKET SOCK_RAW GSO
From: Willem de Bruijn @ 2022-04-21 14:15 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: Willem de Bruijn, netdev, Michael S . Tsirkin, Jason Wang,
	David S . Miller, Jakub Kicinski, Paolo Abeni, Maxim Mikityanskiy,
	virtualization, Balazs Nemeth, Mike Pattrick, Eric Dumazet
In-Reply-To: <YmDCHI330AUfcYKa@Laptop-X1>

On Wed, Apr 20, 2022 at 10:32 PM Hangbin Liu <liuhangbin@gmail.com> wrote:
>
> On Wed, Apr 20, 2022 at 09:45:15AM -0400, Willem de Bruijn wrote:
> > On Wed, Apr 20, 2022 at 4:28 AM Hangbin Liu <liuhangbin@gmail.com> wrote:
> > >
> > > Currently, the kernel drops GSO VLAN tagged packet if it's created with
> > > socket(AF_PACKET, SOCK_RAW, 0) plus virtio_net_hdr.
> > >
> > > The reason is AF_PACKET doesn't adjust the skb network header if there is
> > > a VLAN tag. Then after virtio_net_hdr_set_proto() called, the skb->protocol
> > > will be set to ETH_P_IP/IPv6. And in later inet/ipv6_gso_segment() the skb
> > > is dropped as network header position is invalid.
> > >
> > > Let's handle VLAN packets by adjusting network header position in
> > > packet_parse_headers(), and move the function in packet_snd() before
> > > calling virtio_net_hdr_set_proto().
> >
> > The network header is set in
> >
> >         skb_reset_network_header(skb);
> >
> >         err = -EINVAL;
> >         if (sock->type == SOCK_DGRAM) {
> >                 offset = dev_hard_header(skb, dev, ntohs(proto), addr,
> > NULL, len);
> >                 if (unlikely(offset < 0))
> >                         goto out_free;
> >         } else if (reserve) {
> >                 skb_reserve(skb, -reserve);
> >                 if (len < reserve + sizeof(struct ipv6hdr) &&
> >                     dev->min_header_len != dev->hard_header_len)
> >                         skb_reset_network_header(skb);
> >         }
> >
> > If all that is needed is to move the network header beyond an optional
> > VLAN tag in the case of SOCK_RAW, then this can be done in the else
> > for Ethernet packets.
>
> Before we set network position, we need to check the skb->protocol to make
> sure it's a VLAN packet.
>
> If we set skb->protocol and adjust network header here, like
> packet_parse_headers() does. How should we do with later
>
>         skb->protocol = proto;
>         skb->dev = dev;
>
> settings?
>
> If you are afraid that skb_probe_transport_header(skb) would affect the
> virtio_net_hdr operation. How about split the skb_probe_transport_header()
> from packet_parse_headers()? Something like
>
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1924,13 +1924,19 @@ static int packet_rcv_spkt(struct sk_buff *skb, struct net_device *dev,
>
>  static void packet_parse_headers(struct sk_buff *skb, struct socket *sock)
>  {
> +       int depth;
> +
>         if ((!skb->protocol || skb->protocol == htons(ETH_P_ALL)) &&
>             sock->type == SOCK_RAW) {
>                 skb_reset_mac_header(skb);
>                 skb->protocol = dev_parse_header_protocol(skb);
>         }
>
> -       skb_probe_transport_header(skb);
> +       /* Move network header to the right position for VLAN tagged packets */
> +       if (likely(skb->dev->type == ARPHRD_ETHER) &&
> +           eth_type_vlan(skb->protocol) &&
> +           __vlan_get_protocol(skb, skb->protocol, &depth) != 0)
> +               skb_set_network_header(skb, depth);
>  }
>
>  /*
> @@ -3047,6 +3055,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>         skb->mark = sockc.mark;
>         skb->tstamp = sockc.transmit_time;
>
> +       packet_parse_headers(skb, sock);
> +
>         if (has_vnet_hdr) {
>                 err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
>                 if (err)
> @@ -3055,7 +3065,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>                 virtio_net_hdr_set_proto(skb, &vnet_hdr);
>         }
>
> -       packet_parse_headers(skb, sock);
> +       skb_probe_transport_header(skb);
>
>         if (unlikely(extra_len == 4))
>                 skb->no_fcs = 1;
>
>
> >
> > It is not safe to increase reserve, as that would eat into the
> > reserved hlen LL_RESERVED_SPACE(dev), which does not account for
> > optional VLAN headers.
> >
> > Instead of setting here first, then patching up again later in
> > packet_parse_headers.
> >
> > This change affects all packets with VLAN headers, not just those with
> > GSO. I imagine that moving the network header is safe for all, but
> > don't know that code well enough to verify that it does not have
> > unintended side effects. Does dev_queue_xmit expect the network header
> > to point to the start of the VLAN headers or after, for instance?
>
> I think adjust the network position should be safe, as tap device also did that.

Oh, it's great to be able to point to such prior art. Can you mention
that in the commit message?

Your approach does sound simpler than the above. Thanks for looking
into that alternative, though.

>
> Thanks
> Hangbin

^ permalink raw reply

* Re: [PATCH 3/5] hv_sock: Add validation for untrusted Hyper-V values
From: Stefano Garzarella @ 2022-04-21 14:08 UTC (permalink / raw)
  To: Andrea Parri (Microsoft)
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, Michael Kelley, David Miller, Jakub Kicinski,
	Paolo Abeni, linux-hyperv, virtualization, netdev, linux-kernel
In-Reply-To: <20220420200720.434717-4-parri.andrea@gmail.com>

On Wed, Apr 20, 2022 at 10:07:18PM +0200, Andrea Parri (Microsoft) wrote:
>For additional robustness in the face of Hyper-V errors or malicious
>behavior, validate all values that originate from packets that Hyper-V
>has sent to the guest in the host-to-guest ring buffer.  Ensure that
>invalid values cannot cause data being copied out of the bounds of the
>source buffer in hvs_stream_dequeue().
>
>Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
>---
> include/linux/hyperv.h           |  5 +++++
> net/vmw_vsock/hyperv_transport.c | 11 +++++++++--
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
>diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
>index fe2e0179ed51e..55478a6810b60 100644
>--- a/include/linux/hyperv.h
>+++ b/include/linux/hyperv.h
>@@ -1663,6 +1663,11 @@ static inline u32 hv_pkt_datalen(const struct vmpacket_descriptor *desc)
> 	return (desc->len8 << 3) - (desc->offset8 << 3);
> }
>
>+/* Get packet length associated with descriptor */
>+static inline u32 hv_pkt_len(const struct vmpacket_descriptor *desc)
>+{
>+	return desc->len8 << 3;
>+}
>
> struct vmpacket_descriptor *
> hv_pkt_iter_first_raw(struct vmbus_channel *channel);
>diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
>index 8c37d07017fc4..092cadc2c866d 100644
>--- a/net/vmw_vsock/hyperv_transport.c
>+++ b/net/vmw_vsock/hyperv_transport.c
>@@ -577,12 +577,19 @@ static bool hvs_dgram_allow(u32 cid, u32 port)
> static int hvs_update_recv_data(struct hvsock *hvs)
> {
> 	struct hvs_recv_buf *recv_buf;
>-	u32 payload_len;
>+	u32 pkt_len, payload_len;
>+
>+	pkt_len = hv_pkt_len(hvs->recv_desc);
>+
>+	/* Ensure the packet is big enough to read its header */
>+	if (pkt_len < HVS_HEADER_LEN)
>+		return -EIO;
>
> 	recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
> 	payload_len = recv_buf->hdr.data_size;
>
>-	if (payload_len > HVS_MTU_SIZE)
>+	/* Ensure the packet is big enough to read its payload */
>+	if (payload_len > pkt_len - HVS_HEADER_LEN || payload_len > HVS_MTU_SIZE)

checkpatch warns that we exceed 80 characters, I do not have a strong 
opinion on this, but if you have to resend better break the condition 
into 2 lines.

Maybe even update or remove the comment? (it only describes the first 
condition, but the conditions are pretty clear, so I don't think it adds 
much).

Thanks,
Stefano


^ permalink raw reply

* [PATCH net] net: Use this_cpu_inc() to increment net->core_stats
From: Sebastian Andrzej Siewior @ 2022-04-21 14:00 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Thomas Gleixner, Peter Zijlstra

The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
netdev_core_stats_alloc() to return a per-CPU pointer.
netdev_core_stats_alloc() will allocate memory on its first invocation
which breaks on PREEMPT_RT because it requires non-atomic context for
memory allocation.

This can be avoided by enabling preemption in netdev_core_stats_alloc()
assuming the caller always disables preemption.

It might be better to replace local_inc() with this_cpu_inc() now that
dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
not rely on already disabled preemption. This results in less
instructions on x86-64:
local_inc:
|          incl %gs:__preempt_count(%rip)  # __preempt_count
|          movq    488(%rdi), %rax # _1->core_stats, _22
|          testq   %rax, %rax      # _22
|          je      .L585   #,
|          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
|  .L586:
|          testq   %rax, %rax      # _27
|          je      .L587   #,
|          incq (%rax)            # _6->a.counter
|  .L587:
|          decl %gs:__preempt_count(%rip)  # __preempt_count

this_cpu_inc(), this patch:
|         movq    488(%rdi), %rax # _1->core_stats, _5
|         testq   %rax, %rax      # _5
|         je      .L591   #,
| .L585:
|         incq %gs:(%rax) # _18->rx_dropped

Use unsigned long as type for the counter. Use this_cpu_inc() to
increment the counter. Use a plain read of the counter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netdevice.h | 17 +++++++----------
 net/core/dev.c            | 14 +++++---------
 2 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 59e27a2b7bf04..0009112b23767 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -199,10 +199,10 @@ struct net_device_stats {
  * Try to fit them in a single cache line, for dev_get_stats() sake.
  */
 struct net_device_core_stats {
-	local_t		rx_dropped;
-	local_t		tx_dropped;
-	local_t		rx_nohandler;
-} __aligned(4 * sizeof(local_t));
+	unsigned long	rx_dropped;
+	unsigned long	tx_dropped;
+	unsigned long	rx_nohandler;
+} __aligned(4 * sizeof(unsigned long));
 
 #include <linux/cache.h>
 #include <linux/skbuff.h>
@@ -3843,7 +3843,7 @@ static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
 	return false;
 }
 
-struct net_device_core_stats *netdev_core_stats_alloc(struct net_device *dev);
+struct net_device_core_stats __percpu *netdev_core_stats_alloc(struct net_device *dev);
 
 static inline struct net_device_core_stats *dev_core_stats(struct net_device *dev)
 {
@@ -3851,7 +3851,7 @@ static inline struct net_device_core_stats *dev_core_stats(struct net_device *de
 	struct net_device_core_stats __percpu *p = READ_ONCE(dev->core_stats);
 
 	if (likely(p))
-		return this_cpu_ptr(p);
+		return p;
 
 	return netdev_core_stats_alloc(dev);
 }
@@ -3861,12 +3861,9 @@ static inline void dev_core_stats_##FIELD##_inc(struct net_device *dev)		\
 {										\
 	struct net_device_core_stats *p;					\
 										\
-	preempt_disable();							\
 	p = dev_core_stats(dev);						\
-										\
 	if (p)									\
-		local_inc(&p->FIELD);						\
-	preempt_enable();							\
+		this_cpu_inc(p->FIELD);						\
 }
 DEV_CORE_STATS_INC(rx_dropped)
 DEV_CORE_STATS_INC(tx_dropped)
diff --git a/net/core/dev.c b/net/core/dev.c
index 9ec51c1d77cf4..bf6026158874e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10309,7 +10309,7 @@ void netdev_stats_to_stats64(struct rtnl_link_stats64 *stats64,
 }
 EXPORT_SYMBOL(netdev_stats_to_stats64);
 
-struct net_device_core_stats *netdev_core_stats_alloc(struct net_device *dev)
+struct net_device_core_stats __percpu *netdev_core_stats_alloc(struct net_device *dev)
 {
 	struct net_device_core_stats __percpu *p;
 
@@ -10320,11 +10320,7 @@ struct net_device_core_stats *netdev_core_stats_alloc(struct net_device *dev)
 		free_percpu(p);
 
 	/* This READ_ONCE() pairs with the cmpxchg() above */
-	p = READ_ONCE(dev->core_stats);
-	if (!p)
-		return NULL;
-
-	return this_cpu_ptr(p);
+	return READ_ONCE(dev->core_stats);
 }
 EXPORT_SYMBOL(netdev_core_stats_alloc);
 
@@ -10361,9 +10357,9 @@ struct rtnl_link_stats64 *dev_get_stats(struct net_device *dev,
 
 		for_each_possible_cpu(i) {
 			core_stats = per_cpu_ptr(p, i);
-			storage->rx_dropped += local_read(&core_stats->rx_dropped);
-			storage->tx_dropped += local_read(&core_stats->tx_dropped);
-			storage->rx_nohandler += local_read(&core_stats->rx_nohandler);
+			storage->rx_dropped += core_stats->rx_dropped;
+			storage->tx_dropped += core_stats->tx_dropped;
+			storage->rx_nohandler += core_stats->rx_nohandler;
 		}
 	}
 	return storage;
-- 
2.35.2


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox