Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH v1 net-next 4/4] net: stmmac: setup higher frequency clk support for EHL & TGL
From: Voon, Weifeng @ 2019-08-27 10:38 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli
  Cc: David S. Miller, Maxime Coquelin, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Jose Abreu, Giuseppe Cavallaro,
	Alexandre Torgue, Ong, Boon Leong
In-Reply-To: <20190826201346.GJ2168@lunn.ch>

> > > +#include <linux/clk-provider.h>
> > >  #include <linux/pci.h>
> > >  #include <linux/dmi.h>
> > >
> > > @@ -174,6 +175,19 @@ static int intel_mgbe_common_data(struct
> pci_dev *pdev,
> > >  	plat->axi->axi_blen[1] = 8;
> > >  	plat->axi->axi_blen[2] = 16;
> > >
> > > +	plat->ptp_max_adj = plat->clk_ptp_rate;
> > > +
> > > +	/* Set system clock */
> > > +	plat->stmmac_clk = clk_register_fixed_rate(&pdev->dev,
> > > +						   "stmmac-clk", NULL, 0,
> > > +						   plat->clk_ptp_rate);
> > > +
> > > +	if (IS_ERR(plat->stmmac_clk)) {
> > > +		dev_warn(&pdev->dev, "Fail to register stmmac-clk\n");
> > > +		plat->stmmac_clk = NULL;
> >
> > Don't you need to propagate at least EPROBE_DEFER here?
> 
> Hi Florian
> 
> Isn't a fixed rate clock a complete fake. There is no hardware behind it.
> So can it return EPROBE_DEFER?
> 
>     Andrew

Yes, there is no hardware behind it. So, I don't think we need to deferred probe
and a warning message should be sufficient. Anyhow, please point it out if I miss
out anything.

Thanks. 


^ permalink raw reply

* Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot
From: Pablo Neira Ayuso @ 2019-08-27 10:35 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: netfilter-devel, coreteam, netdev, linux-kernel, Jozsef Kadlecsik,
	Florian Westphal, David S. Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI
In-Reply-To: <20190821141505.2394-1-leonardo@linux.ibm.com>

On Wed, Aug 21, 2019 at 11:15:06AM -0300, Leonardo Bras wrote:
> If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
> dealing with a IPv6 package, it causes a kernel panic in
> fib6_node_lookup_1(), crashing in bad_page_fault.

Q: How do you get to see IPv6 packets if IPv6 module is disable?

> The panic is caused by trying to deference a very low address (0x38
> in ppc64le), due to ipv6.fib6_main_tbl = NULL.
> BUG: Kernel NULL pointer dereference at 0x00000038
> 
> Fix this behavior by dropping IPv6 packages if !ipv6_mod_enabled().

I'd suggest: s/package/packet/

[...]
> diff --git a/net/ipv6/netfilter/nft_fib_ipv6.c b/net/ipv6/netfilter/nft_fib_ipv6.c
> index 7ece86afd079..75acc417e2ff 100644
> --- a/net/ipv6/netfilter/nft_fib_ipv6.c
> +++ b/net/ipv6/netfilter/nft_fib_ipv6.c
> @@ -125,6 +125,11 @@ void nft_fib6_eval_type(const struct nft_expr *expr, struct nft_regs *regs,
>  	u32 *dest = &regs->data[priv->dreg];
>  	struct ipv6hdr *iph, _iph;
>  
> +	if (!ipv6_mod_enabled()) {
> +		regs->verdict.code = NF_DROP;

NFT_BREAK instead to stop evaluating this rule, this results in a
mismatch, so you let the user decide what to do with packets that do
not match your policy.

The drop case at the bottom of the fib eval function never actually
never happens.

^ permalink raw reply

* [PATCH net] mld: fix memory leak in mld_del_delrec()
From: Eric Dumazet @ 2019-08-27 10:33 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet, syzbot

Similar to the fix done for IPv4 in commit e5b1c6c6277d
("igmp: fix memory leak in igmpv3_del_delrec()"), we need to
make sure mca_tomb and mca_sources are not blindly overwritten.

Using swap() then a call to ip6_mc_clear_src() will take care
of the missing free.

BUG: memory leak
unreferenced object 0xffff888117d9db00 (size 64):
  comm "syz-executor247", pid 6918, jiffies 4294943989 (age 25.350s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 fe 88 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000005b463030>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
    [<000000005b463030>] slab_post_alloc_hook mm/slab.h:522 [inline]
    [<000000005b463030>] slab_alloc mm/slab.c:3319 [inline]
    [<000000005b463030>] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3548
    [<00000000939cbf94>] kmalloc include/linux/slab.h:552 [inline]
    [<00000000939cbf94>] kzalloc include/linux/slab.h:748 [inline]
    [<00000000939cbf94>] ip6_mc_add1_src net/ipv6/mcast.c:2236 [inline]
    [<00000000939cbf94>] ip6_mc_add_src+0x31f/0x420 net/ipv6/mcast.c:2356
    [<00000000d8972221>] ip6_mc_source+0x4a8/0x600 net/ipv6/mcast.c:449
    [<000000002b203d0d>] do_ipv6_setsockopt.isra.0+0x1b92/0x1dd0 net/ipv6/ipv6_sockglue.c:748
    [<000000001f1e2d54>] ipv6_setsockopt+0x89/0xd0 net/ipv6/ipv6_sockglue.c:944
    [<00000000c8f7bdf9>] udpv6_setsockopt+0x4e/0x90 net/ipv6/udp.c:1558
    [<000000005a9a0c5e>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3139
    [<00000000910b37b2>] __sys_setsockopt+0x10f/0x220 net/socket.c:2084
    [<00000000e9108023>] __do_sys_setsockopt net/socket.c:2100 [inline]
    [<00000000e9108023>] __se_sys_setsockopt net/socket.c:2097 [inline]
    [<00000000e9108023>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2097
    [<00000000f4818160>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:296
    [<000000008d367e8f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when set link down")
Fixes: 9c8bb163ae78 ("igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/ipv6/mcast.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 7f3f13c3791636f92c5c46aa6235c69682dea151..eaa4c2cc2fbb2e784b5641135d28dd38a0616421 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -787,14 +787,15 @@ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im)
 	if (pmc) {
 		im->idev = pmc->idev;
 		if (im->mca_sfmode == MCAST_INCLUDE) {
-			im->mca_tomb = pmc->mca_tomb;
-			im->mca_sources = pmc->mca_sources;
+			swap(im->mca_tomb, pmc->mca_tomb);
+			swap(im->mca_sources, pmc->mca_sources);
 			for (psf = im->mca_sources; psf; psf = psf->sf_next)
 				psf->sf_crcount = idev->mc_qrv;
 		} else {
 			im->mca_crcount = idev->mc_qrv;
 		}
 		in6_dev_put(pmc->idev);
+		ip6_mc_clear_src(pmc);
 		kfree(pmc);
 	}
 	spin_unlock_bh(&im->mca_lock);
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related

* Re: [PATCH net v2] net/sched: pfifo_fast: fix wrong dereference when qdisc is reset
From: Paolo Abeni @ 2019-08-27 10:30 UTC (permalink / raw)
  To: Davide Caratti, Cong Wang, Jamal Hadi Salim, Jiri Pirko,
	David S. Miller, netdev
  Cc: Stefano Brivio, Li Shuang
In-Reply-To: <783231162b9d32faaf5df34ad8ad437b0031bd31.1566901438.git.dcaratti@redhat.com>

On Tue, 2019-08-27 at 12:29 +0200, Davide Caratti wrote:
> Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of
> 'TCQ_F_NOLOCK' bit in the parent qdisc, we need to be sure that per-cpu
> counters are present when 'reset()' is called for pfifo_fast qdiscs.
> Otherwise, the following script:
> 
>  # tc q a dev lo handle 1: root htb default 100
>  # tc c a dev lo parent 1: classid 1:100 htb \
>  > rate 95Mbit ceil 100Mbit burst 64k
>  [...]
>  # tc f a dev lo parent 1: protocol arp basic classid 1:100
>  [...]
>  # tc q a dev lo parent 1:100 handle 100: pfifo_fast
>  [...]
>  # tc q d dev lo root
> 
> can generate the following splat:
> 
>  Unable to handle kernel paging request at virtual address dfff2c01bd148000
>  Mem abort info:
>    ESR = 0x96000004
>    Exception class = DABT (current EL), IL = 32 bits
>    SET = 0, FnV = 0
>    EA = 0, S1PTW = 0
>  Data abort info:
>    ISV = 0, ISS = 0x00000004
>    CM = 0, WnR = 0
>  [dfff2c01bd148000] address between user and kernel address ranges
>  Internal error: Oops: 96000004 [#1] SMP
>  [...]
>  pstate: 80000005 (Nzcv daif -PAN -UAO)
>  pc : pfifo_fast_reset+0x280/0x4d8
>  lr : pfifo_fast_reset+0x21c/0x4d8
>  sp : ffff800d09676fa0
>  x29: ffff800d09676fa0 x28: ffff200012ee22e4
>  x27: dfff200000000000 x26: 0000000000000000
>  x25: ffff800ca0799958 x24: ffff1001940f332b
>  x23: 0000000000000007 x22: ffff200012ee1ab8
>  x21: 0000600de8a40000 x20: 0000000000000000
>  x19: ffff800ca0799900 x18: 0000000000000000
>  x17: 0000000000000002 x16: 0000000000000000
>  x15: 0000000000000000 x14: 0000000000000000
>  x13: 0000000000000000 x12: ffff1001b922e6e2
>  x11: 1ffff001b922e6e1 x10: 0000000000000000
>  x9 : 1ffff001b922e6e1 x8 : dfff200000000000
>  x7 : 0000000000000000 x6 : 0000000000000000
>  x5 : 1fffe400025dc45c x4 : 1fffe400025dc357
>  x3 : 00000c01bd148000 x2 : 0000600de8a40000
>  x1 : 0000000000000007 x0 : 0000600de8a40004
>  Call trace:
>   pfifo_fast_reset+0x280/0x4d8
>   qdisc_reset+0x6c/0x370
>   htb_reset+0x150/0x3b8 [sch_htb]
>   qdisc_reset+0x6c/0x370
>   dev_deactivate_queue.constprop.5+0xe0/0x1a8
>   dev_deactivate_many+0xd8/0x908
>   dev_deactivate+0xe4/0x190
>   qdisc_graft+0x88c/0xbd0
>   tc_get_qdisc+0x418/0x8a8
>   rtnetlink_rcv_msg+0x3a8/0xa78
>   netlink_rcv_skb+0x18c/0x328
>   rtnetlink_rcv+0x28/0x38
>   netlink_unicast+0x3c4/0x538
>   netlink_sendmsg+0x538/0x9a0
>   sock_sendmsg+0xac/0xf8
>   ___sys_sendmsg+0x53c/0x658
>   __sys_sendmsg+0xc8/0x140
>   __arm64_sys_sendmsg+0x74/0xa8
>   el0_svc_handler+0x164/0x468
>   el0_svc+0x10/0x14
>  Code: 910012a0 92400801 d343fc03 11000c21 (38fb6863)
> 
> Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags',
> before dereferencing 'qdisc->cpu_qstats'.
> 
> Changes since v1:
>  - coding style improvements, thanks to Stefano Brivio
> 
> Fixes: 8a53e616de29 ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too")
> CC: Paolo Abeni <pabeni@redhat.com>
> Reported-by: Li Shuang <shuali@redhat.com>
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>
> ---
>  net/sched/sch_generic.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 11c03cf4aa74..099797e5409d 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -688,11 +688,14 @@ static void pfifo_fast_reset(struct Qdisc *qdisc)
>  			kfree_skb(skb);
>  	}
>  
> -	for_each_possible_cpu(i) {
> -		struct gnet_stats_queue *q = per_cpu_ptr(qdisc->cpu_qstats, i);
> +	if (qdisc_is_percpu_stats(qdisc)) {
> +		for_each_possible_cpu(i) {
> +			struct gnet_stats_queue *q;
>  
> -		q->backlog = 0;
> -		q->qlen = 0;
> +			q = per_cpu_ptr(qdisc->cpu_qstats, i);
> +			q->backlog = 0;
> +			q->qlen = 0;
> +		}
>  	}
>  }

Thanks for fixing this Davide!

Acked-by: Paolo Abeni <pabeni@redhat.com>


^ permalink raw reply

* Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs
From: Cornelia Huck @ 2019-08-27 10:29 UTC (permalink / raw)
  To: Parav Pandit
  Cc: alex.williamson, jiri, kwankhede, davem, kvm, linux-kernel,
	netdev
In-Reply-To: <20190826204119.54386-3-parav@mellanox.com>

On Mon, 26 Aug 2019 15:41:17 -0500
Parav Pandit <parav@mellanox.com> wrote:

> Mdev alias should be unique among all the mdevs, so that when such alias
> is used by the mdev users to derive other objects, there is no
> collision in a given system.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index e825ff38b037..6eb37f0c6369 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -375,6 +375,11 @@ int mdev_device_create(struct kobject *kobj, struct device *dev,
>  			ret = -EEXIST;
>  			goto mdev_fail;
>  		}
> +		if (tmp->alias && strcmp(tmp->alias, alias) == 0) {

Any way we can relay to the caller that the uuid was fine, but that we
had a hash collision? Duplicate uuids are much more obvious than a
collision here.

> +			mutex_unlock(&mdev_list_lock);
> +			ret = -EEXIST;
> +			goto mdev_fail;
> +		}
>  	}
>  
>  	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);


^ permalink raw reply

* [PATCH net v2] net/sched: pfifo_fast: fix wrong dereference when qdisc is reset
From: Davide Caratti @ 2019-08-27 10:29 UTC (permalink / raw)
  To: Cong Wang, Jamal Hadi Salim, Jiri Pirko, David S. Miller, netdev
  Cc: Paolo Abeni, Stefano Brivio, Li Shuang

Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of
'TCQ_F_NOLOCK' bit in the parent qdisc, we need to be sure that per-cpu
counters are present when 'reset()' is called for pfifo_fast qdiscs.
Otherwise, the following script:

 # tc q a dev lo handle 1: root htb default 100
 # tc c a dev lo parent 1: classid 1:100 htb \
 > rate 95Mbit ceil 100Mbit burst 64k
 [...]
 # tc f a dev lo parent 1: protocol arp basic classid 1:100
 [...]
 # tc q a dev lo parent 1:100 handle 100: pfifo_fast
 [...]
 # tc q d dev lo root

can generate the following splat:

 Unable to handle kernel paging request at virtual address dfff2c01bd148000
 Mem abort info:
   ESR = 0x96000004
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 [dfff2c01bd148000] address between user and kernel address ranges
 Internal error: Oops: 96000004 [#1] SMP
 [...]
 pstate: 80000005 (Nzcv daif -PAN -UAO)
 pc : pfifo_fast_reset+0x280/0x4d8
 lr : pfifo_fast_reset+0x21c/0x4d8
 sp : ffff800d09676fa0
 x29: ffff800d09676fa0 x28: ffff200012ee22e4
 x27: dfff200000000000 x26: 0000000000000000
 x25: ffff800ca0799958 x24: ffff1001940f332b
 x23: 0000000000000007 x22: ffff200012ee1ab8
 x21: 0000600de8a40000 x20: 0000000000000000
 x19: ffff800ca0799900 x18: 0000000000000000
 x17: 0000000000000002 x16: 0000000000000000
 x15: 0000000000000000 x14: 0000000000000000
 x13: 0000000000000000 x12: ffff1001b922e6e2
 x11: 1ffff001b922e6e1 x10: 0000000000000000
 x9 : 1ffff001b922e6e1 x8 : dfff200000000000
 x7 : 0000000000000000 x6 : 0000000000000000
 x5 : 1fffe400025dc45c x4 : 1fffe400025dc357
 x3 : 00000c01bd148000 x2 : 0000600de8a40000
 x1 : 0000000000000007 x0 : 0000600de8a40004
 Call trace:
  pfifo_fast_reset+0x280/0x4d8
  qdisc_reset+0x6c/0x370
  htb_reset+0x150/0x3b8 [sch_htb]
  qdisc_reset+0x6c/0x370
  dev_deactivate_queue.constprop.5+0xe0/0x1a8
  dev_deactivate_many+0xd8/0x908
  dev_deactivate+0xe4/0x190
  qdisc_graft+0x88c/0xbd0
  tc_get_qdisc+0x418/0x8a8
  rtnetlink_rcv_msg+0x3a8/0xa78
  netlink_rcv_skb+0x18c/0x328
  rtnetlink_rcv+0x28/0x38
  netlink_unicast+0x3c4/0x538
  netlink_sendmsg+0x538/0x9a0
  sock_sendmsg+0xac/0xf8
  ___sys_sendmsg+0x53c/0x658
  __sys_sendmsg+0xc8/0x140
  __arm64_sys_sendmsg+0x74/0xa8
  el0_svc_handler+0x164/0x468
  el0_svc+0x10/0x14
 Code: 910012a0 92400801 d343fc03 11000c21 (38fb6863)

Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags',
before dereferencing 'qdisc->cpu_qstats'.

Changes since v1:
 - coding style improvements, thanks to Stefano Brivio

Fixes: 8a53e616de29 ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too")
CC: Paolo Abeni <pabeni@redhat.com>
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/sch_generic.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 11c03cf4aa74..099797e5409d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -688,11 +688,14 @@ static void pfifo_fast_reset(struct Qdisc *qdisc)
 			kfree_skb(skb);
 	}
 
-	for_each_possible_cpu(i) {
-		struct gnet_stats_queue *q = per_cpu_ptr(qdisc->cpu_qstats, i);
+	if (qdisc_is_percpu_stats(qdisc)) {
+		for_each_possible_cpu(i) {
+			struct gnet_stats_queue *q;
 
-		q->backlog = 0;
-		q->qlen = 0;
+			q = per_cpu_ptr(qdisc->cpu_qstats, i);
+			q->backlog = 0;
+			q->qlen = 0;
+		}
 	}
 }
 
-- 
2.20.1


^ permalink raw reply related

* RE: [PATCH v1 net-next] net: phy: mdio_bus: make mdiobus_scan also cover PHY that only talks C45
From: Voon, Weifeng @ 2019-08-27 10:25 UTC (permalink / raw)
  To: David Miller
  Cc: mcoquelin.stm32@gmail.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, joabreu@synopsys.com,
	andrew@lunn.ch, f.fainelli@gmail.com, hkallweit1@gmail.com,
	Ong, Boon Leong
In-Reply-To: <20190826.142853.2135315525185656171.davem@davemloft.net>

> There is something wrong with the clock on the computer you are posting
> these patches from, the date in these postings are in the future by
> several hours.
> 
> This messes up the ordering of changes in patchwork and makes my life
> miserable to a certain degree, so please fix this.
> 
> Thank you.

Sorry about that as my machine's date somehow went out of
sync with the server time. I have already fixed that.

Thanks,
Weifeng

^ permalink raw reply

* Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
From: Cornelia Huck @ 2019-08-27 10:24 UTC (permalink / raw)
  To: Parav Pandit
  Cc: alex.williamson, jiri, kwankhede, davem, kvm, linux-kernel,
	netdev
In-Reply-To: <20190826204119.54386-2-parav@mellanox.com>

On Mon, 26 Aug 2019 15:41:16 -0500
Parav Pandit <parav@mellanox.com> wrote:

> Whenever a parent requests to generate mdev alias, generate a mdev
> alias.
> It is an optional attribute that parent can request to generate
> for each of its child mdev.
> mdev alias is generated using sha1 from the mdev name.

Maybe add some motivation here as well?

"Some vendor drivers want an identifier for an mdev device that is
shorter than the uuid, due to length restrictions in the consumers of
that identifier.

Add a callback that allows a vendor driver to request an alias of a
specified length to be generated (via sha1) for an mdev device. If
generated, that alias is checked for collisions."

> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 98 +++++++++++++++++++++++++++++++-
>  drivers/vfio/mdev/mdev_private.h |  5 +-
>  drivers/vfio/mdev/mdev_sysfs.c   | 13 +++--
>  include/linux/mdev.h             |  4 ++
>  4 files changed, 111 insertions(+), 9 deletions(-)
> 

(...)

> @@ -406,6 +495,10 @@ EXPORT_SYMBOL(mdev_get_iommu_device);
>  
>  static int __init mdev_init(void)
>  {
> +	alias_hash = crypto_alloc_shash("sha1", 0, 0);
> +	if (!alias_hash)
> +		return -ENOMEM;
> +
>  	return mdev_bus_register();

Don't you need to call crypto_free_shash() if mdev_bus_register() fails?

>  }
>  
> @@ -415,6 +508,7 @@ static void __exit mdev_exit(void)
>  		class_compat_unregister(mdev_bus_compat_class);
>  
>  	mdev_bus_unregister();
> +	crypto_free_shash(alias_hash);
>  }
>  
>  module_init(mdev_init)

(...)

> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> index 0ce30ca78db0..f036fe9854ee 100644
> --- a/include/linux/mdev.h
> +++ b/include/linux/mdev.h
> @@ -72,6 +72,9 @@ struct device *mdev_get_iommu_device(struct device *dev);
>   * @mmap:		mmap callback
>   *			@mdev: mediated device structure
>   *			@vma: vma structure
> + * @get_alias_length:	Generate alias for the mdevs of this parent based on the
> + *			mdev device name when it returns non zero alias length.
> + *			It is optional.

What about:

* @get_alias_length: optional callback to specify length of the alias to create
*                    Returns unsigned integer: length of the alias to be created,
*                                              0 to not create an alias

I also think it might be beneficial to add a device parameter here now
(rather than later); that seems to be something that makes sense.

>   * Parent device that support mediated device should be registered with mdev
>   * module with mdev_parent_ops structure.
>   **/
> @@ -92,6 +95,7 @@ struct mdev_parent_ops {
>  	long	(*ioctl)(struct mdev_device *mdev, unsigned int cmd,
>  			 unsigned long arg);
>  	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> +	unsigned int (*get_alias_length)(void);
>  };
>  
>  /* interface for exporting mdev supported type attributes */


^ permalink raw reply

* Re: [PATCH net] net/sched: pfifo_fast: fix wrong dereference when qdisc is reset
From: Davide Caratti @ 2019-08-27 10:15 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Cong Wang, Jamal Hadi Salim, Jiri Pirko, David S. Miller, netdev,
	Paolo Abeni, Li Shuang
In-Reply-To: <20190827115031.43fcbac5@redhat.com>

hello Stefano,

thanks for looking at this.

On Tue, 2019-08-27 at 11:50 +0200, Stefano Brivio wrote:
> On Tue, 27 Aug 2019 01:15:16 +0200
> Davide Caratti <dcaratti@redhat.com> wrote:
[...]
> @@ -688,12 +688,14 @@ static void pfifo_fast_reset(struct Qdisc *qdisc)
> >  			kfree_skb(skb);
> >  	}
> >  
> > -	for_each_possible_cpu(i) {
> > -		struct gnet_stats_queue *q = per_cpu_ptr(qdisc->cpu_qstats, i);
> > +	if (qdisc_is_percpu_stats(qdisc))
> 
> This needs curly brackets, as the block has multiple lines (for coding
> style only).

If I well read Documentation/process/coding-style.rst, at the end of
section 3), this rule should apply to loops. But I'm fine with the curly
brace here, if checkpatch doesn't say anything.

I will add it in v2. 

> > +		for_each_possible_cpu(i) {
> > +			struct gnet_stats_queue *q =
> > +				per_cpu_ptr(qdisc->cpu_qstats, i);
> 
> And you could split declaration and assignment here, it takes two lines
> anyway and becomes more readable.

ok, I will do that in v2.

thanks!
-- 
davide


^ permalink raw reply

* Re: [RFC PATCH net-next] net: phy: force phy suspend when calling phy_stop
From: Sergei Shtylyov @ 2019-08-27 10:11 UTC (permalink / raw)
  To: Jian Shen, andrew, f.fainelli, hkallweit1, davem
  Cc: netdev, forest.zhouchang, linuxarm
In-Reply-To: <1566874020-14334-1-git-send-email-shenjian15@huawei.com>

On 27.08.2019 5:47, Jian Shen wrote:

> Some ethernet drivers may call phy_start() and phy_stop() from
> ndo_open and ndo_close() respectively.

    ndo_open() for consistency.

> When network cable is unconnected, and operate like below:
> step 1: ifconfig ethX up -> ndo_open -> phy_start ->start
> autoneg, and phy is no link.
> step 2: ifconfig ethX down -> ndo_close -> phy_stop -> just stop
> phy state machine.
> step 3: plugin the network cable, and autoneg complete, then
> LED for link status will be on.
> step 4: ethtool ethX --> see the result of "Link detected" is no.
> 
> This patch forces phy suspend even phydev->link is off.
> 
> Signed-off-by: Jian Shen <shenjian15@huawei.com>
[...]

MBR, Sergei


^ permalink raw reply

* Re: [PATCH v2 0/3] Add NETIF_F_HW_BR_CAP feature
From: Horatiu Vultur @ 2019-08-27 10:10 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: roopa, nikolay, davem, UNGLinuxDriver, alexandre.belloni,
	allan.nielsen, f.fainelli, netdev, linux-kernel, bridge
In-Reply-To: <20190826123811.GA13411@lunn.ch>

The 08/26/2019 14:38, Andrew Lunn wrote:
> External E-Mail
> 
> 
> On Mon, Aug 26, 2019 at 10:11:12AM +0200, Horatiu Vultur wrote:
> > When a network port is added to a bridge then the port is added in
> > promisc mode. Some HW that has bridge capabilities(can learn, forward,
> > flood etc the frames) they are disabling promisc mode in the network
> > driver when the port is added to the SW bridge.
> > 
> > This patch adds the feature NETIF_F_HW_BR_CAP so that the network ports
> > that have this feature will not be set in promisc mode when they are
> > added to a SW bridge.
> > 
> > In this way the HW that has bridge capabilities don't need to send all the
> > traffic to the CPU and can also implement the promisc mode and toggle it
> > using the command 'ip link set dev swp promisc on'
> 
> Hi Horatiu

Hi Andrew,
> 
> I'm still not convinced this is needed. The model is, the hardware is
> there to accelerate what Linux can do in software. Any peculiarities
> of the accelerator should be hidden in the driver.  If the accelerator
> can do its job without needing promisc mode, do that in the driver.
Thanks for the model description. I will keep in my mind for the next
patches that I will do.
> 
> So you are trying to differentiate between promisc mode because the
> interface is a member of a bridge, and promisc mode because some
> application, like pcap, has asked for promisc mode.
> 
> dev->promiscuity is a counter. So what you can do it look at its
> value, and how the interface is being used. If the interface is not a
> member of a bridge, and the count > 0, enable promisc mode in the
> accelerator. If the interface is a member of a bridge, and the count >
> 1, enable promisc mode in the accelerator.
That sounds like a great idea. I was expecting to add this logic in the
set_rx_mode function of the driver. But unfortunetly, I got the calls to
this function before the dev->promiscuity is updated or not to get the
call at all. For example in case the port is member of a bridge and I try
to enable the promisc mode.

> 
>    Andrew
> 
> 

-- 
/Horatiu

^ permalink raw reply

* Re: [PATCH -next] net: mlx5: Kconfig: Fix MLX5_CORE_EN dependencies
From: maowenan @ 2019-08-27  9:51 UTC (permalink / raw)
  To: wharms
  Cc: saeedm, leon, davem, netdev, linux-rdma, linux-kernel,
	kernel-janitors
In-Reply-To: <5D64DABF.4010601@bfs.de>



On 2019/8/27 15:24, walter harms wrote:
> 
> 
> Am 27.08.2019 05:12, schrieb Mao Wenan:
>> When MLX5_CORE_EN=y and PCI_HYPERV_INTERFACE is not set, below errors are found:
>> drivers/net/ethernet/mellanox/mlx5/core/en_main.o: In function `mlx5e_nic_enable':
>> en_main.c:(.text+0xb649): undefined reference to `mlx5e_hv_vhca_stats_create'
>> drivers/net/ethernet/mellanox/mlx5/core/en_main.o: In function `mlx5e_nic_disable':
>> en_main.c:(.text+0xb8c4): undefined reference to `mlx5e_hv_vhca_stats_destroy'
>>
>> This because CONFIG_PCI_HYPERV_INTERFACE is newly introduced by 'commit 348dd93e40c1
>> ("PCI: hv: Add a Hyper-V PCI interface driver for software backchannel interface"),
>> Fix this by making MLX5_CORE_EN imply PCI_HYPERV_INTERFACE.
>>
>> Fixes: cef35af34d6d ("net/mlx5e: Add mlx5e HV VHCA stats agent")
>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
>> index 37fef8c..a6a70ce 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
>> @@ -35,6 +35,7 @@ config MLX5_CORE_EN
>>  	depends on IPV6=y || IPV6=n || MLX5_CORE=m
> 
> OT but ...
> is that IPV6 needed at all ? can there be something else that yes or no ?

If I set IPV6=m, errors are found as below:
drivers/net/ethernet/mellanox/mlx5/core/main.o: In function `mlx5_unload':
main.c:(.text+0x275): undefined reference to `mlx5_hv_vhca_cleanup'
drivers/net/ethernet/mellanox/mlx5/core/main.o: In function `mlx5_cleanup_once':
main.c:(.text+0x2e8): undefined reference to `mlx5_hv_vhca_destroy'
drivers/net/ethernet/mellanox/mlx5/core/main.o: In function `mlx5_load_one':
main.c:(.text+0x23c1): undefined reference to `mlx5_hv_vhca_create'
main.c:(.text+0x248f): undefined reference to `mlx5_hv_vhca_init'
main.c:(.text+0x25e0): undefined reference to `mlx5_hv_vhca_cleanup
> 
> re,
>  wh
> 
>>  	select PAGE_POOL
>>  	select DIMLIB
>> +	imply PCI_HYPERV_INTERFACE
>>  	default n
>>  	---help---
>>  	  Ethernet support in Mellanox Technologies ConnectX-4 NIC.
> 
> .
> 


^ permalink raw reply

* Re: [PATCH net] net/sched: pfifo_fast: fix wrong dereference when qdisc is reset
From: Stefano Brivio @ 2019-08-27  9:50 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Cong Wang, Jamal Hadi Salim, Jiri Pirko, David S. Miller, netdev,
	Paolo Abeni, Li Shuang
In-Reply-To: <0598164c6e32684e57c7656f0b8aca0813c51f42.1566861256.git.dcaratti@redhat.com>

On Tue, 27 Aug 2019 01:15:16 +0200
Davide Caratti <dcaratti@redhat.com> wrote:

> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 11c03cf4aa74..c89b787785a1 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -688,12 +688,14 @@ static void pfifo_fast_reset(struct Qdisc *qdisc)
>  			kfree_skb(skb);
>  	}
>  
> -	for_each_possible_cpu(i) {
> -		struct gnet_stats_queue *q = per_cpu_ptr(qdisc->cpu_qstats, i);
> +	if (qdisc_is_percpu_stats(qdisc))

This needs curly brackets, as the block has multiple lines (for coding
style only).

> +		for_each_possible_cpu(i) {
> +			struct gnet_stats_queue *q =
> +				per_cpu_ptr(qdisc->cpu_qstats, i);

And you could split declaration and assignment here, it takes two lines
anyway and becomes more readable.

-- 
Stefano

^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jiri Pirko @ 2019-08-27  9:35 UTC (permalink / raw)
  To: David Miller
  Cc: jakub.kicinski, dsahern, roopa, netdev, sthemmin, dcbw, mkubecek,
	andrew, parav, saeedm, mlxsw
In-Reply-To: <20190827.012242.418276717667374306.davem@davemloft.net>

Tue, Aug 27, 2019 at 10:22:42AM CEST, davem@davemloft.net wrote:
>From: Jiri Pirko <jiri@resnulli.us>
>Date: Tue, 27 Aug 2019 09:08:08 +0200
>
>> Okay, so if I understand correctly, on top of separate commands for
>> add/del of alternative names, you suggest also get/dump to be separate
>> command and don't fill this up in existing newling/getlink command.
>
>I'm not sure what to do yet.
>
>David has a point, because the only way these ifnames are useful is
>as ways to specify and choose net devices.  So based upon that I'm
>slightly learning towards not using separate commands.

Well yeah, one can use it to handle existing commands instead of
IFLA_NAME.

But why does it rule out separate commands? I think it is cleaner than
to put everything in poor setlink messages :/ The fact that we would
need to add "OP" to the setlink message just feels of. Other similar
needs may show up in the future and we may endup in ridiculous messages
like:

SETLINK
  IFLA_NAME eth0
  IFLA_ATLNAME_LIST (nest)
      IFLA_ALTNAME_OP add
      IFLA_ALTNAME somereallylongname 
      IFLA_ALTNAME_OP del
      IFLA_ALTNAME somereallyreallylongname 
      IFLA_ALTNAME_OP add
      IFLA_ALTNAME someotherreallylongname 
  IFLA_SOMETHING_ELSE_LIST (nest)
      IFLA_SOMETHING_ELSE_OP add
      ...
      IFLA_SOMETHING_ELSE_OP del
      ...
      IFLA_SOMETHING_ELSE_OP add
      ...

I don't know what to think about it. Rollbacks are going to be pure hell :/

^ permalink raw reply

* [PATCH 1/4] dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
From: Razvan Stefanescu @ 2019-08-27  9:31 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Razvan Stefanescu
In-Reply-To: <20190827093110.14957-1-razvan.stefanescu@microchip.com>

It is a 3-Port 10/100 Ethernet Switch with 1588v2 PTP.

Signed-off-by: Razvan Stefanescu <razvan.stefanescu@microchip.com>
---
 Documentation/devicetree/bindings/net/dsa/ksz.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt b/Documentation/devicetree/bindings/net/dsa/ksz.txt
index 5e8429b6f9ca..95e91e84151c 100644
--- a/Documentation/devicetree/bindings/net/dsa/ksz.txt
+++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt
@@ -15,6 +15,7 @@ Required properties:
   - "microchip,ksz8565"
   - "microchip,ksz9893"
   - "microchip,ksz9563"
+  - "microchip,ksz8563"
 
 Optional properties:
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH 2/4] net: dsa: microchip: add KSZ8563 compatibility string
From: Razvan Stefanescu @ 2019-08-27  9:31 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Razvan Stefanescu
In-Reply-To: <20190827093110.14957-1-razvan.stefanescu@microchip.com>

It is a 3-Port 10/100 Ethernet Switch with 1588v2 PTP.

Signed-off-by: Razvan Stefanescu <razvan.stefanescu@microchip.com>
---
 drivers/net/dsa/microchip/ksz9477_spi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/dsa/microchip/ksz9477_spi.c b/drivers/net/dsa/microchip/ksz9477_spi.c
index a226b389e12d..2e402e4d866f 100644
--- a/drivers/net/dsa/microchip/ksz9477_spi.c
+++ b/drivers/net/dsa/microchip/ksz9477_spi.c
@@ -80,6 +80,7 @@ static const struct of_device_id ksz9477_dt_ids[] = {
 	{ .compatible = "microchip,ksz9897" },
 	{ .compatible = "microchip,ksz9893" },
 	{ .compatible = "microchip,ksz9563" },
+	{ .compatible = "microchip,ksz8563" },
 	{},
 };
 MODULE_DEVICE_TABLE(of, ksz9477_dt_ids);
-- 
2.20.1


^ permalink raw reply related

* [PATCH 3/4] net: dsa: microchip: fix interrupt mask
From: Razvan Stefanescu @ 2019-08-27  9:31 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Razvan Stefanescu
In-Reply-To: <20190827093110.14957-1-razvan.stefanescu@microchip.com>

Global Interrupt Mask Register comprises of Lookup Engine (LUE) Interrupt
Mask (bit 31) and GPIO Pin Output Trigger and Timestamp Unit Interrupt
Mask (bit 29).

This corrects LUE bit.

Signed-off-by: Razvan Stefanescu <razvan.stefanescu@microchip.com>
---
 drivers/net/dsa/microchip/ksz9477_reg.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/microchip/ksz9477_reg.h b/drivers/net/dsa/microchip/ksz9477_reg.h
index 2938e892b631..f3949d7b9bbd 100644
--- a/drivers/net/dsa/microchip/ksz9477_reg.h
+++ b/drivers/net/dsa/microchip/ksz9477_reg.h
@@ -76,7 +76,7 @@
 #define TRIG_TS_INT			BIT(30)
 #define APB_TIMEOUT_INT			BIT(29)
 
-#define SWITCH_INT_MASK			(TRIG_TS_INT | APB_TIMEOUT_INT)
+#define SWITCH_INT_MASK			(LUE_INT | TRIG_TS_INT)
 
 #define REG_SW_PORT_INT_STATUS__4	0x0018
 #define REG_SW_PORT_INT_MASK__4		0x001C
-- 
2.20.1


^ permalink raw reply related

* [PATCH 4/4] net: dsa: microchip: avoid hard-codded port count
From: Razvan Stefanescu @ 2019-08-27  9:31 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Razvan Stefanescu
In-Reply-To: <20190827093110.14957-1-razvan.stefanescu@microchip.com>

Use port_cnt value to disable interrupts on switch reset.

Signed-off-by: Razvan Stefanescu <razvan.stefanescu@microchip.com>
---
 drivers/net/dsa/microchip/ksz9477.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/microchip/ksz9477.c b/drivers/net/dsa/microchip/ksz9477.c
index 187be42de5f1..54fc05595d48 100644
--- a/drivers/net/dsa/microchip/ksz9477.c
+++ b/drivers/net/dsa/microchip/ksz9477.c
@@ -213,7 +213,7 @@ static int ksz9477_reset_switch(struct ksz_device *dev)
 
 	/* disable interrupts */
 	ksz_write32(dev, REG_SW_INT_MASK__4, SWITCH_INT_MASK);
-	ksz_write32(dev, REG_SW_PORT_INT_MASK__4, 0x7F);
+	ksz_write32(dev, REG_SW_PORT_INT_MASK__4, dev->port_cnt);
 	ksz_read32(dev, REG_SW_PORT_INT_STATUS__4, &data32);
 
 	/* set broadcast storm protection 10% rate */
-- 
2.20.1


^ permalink raw reply related

* [PATCH 0/4] net: dsa: microchip: add KSZ8563 support
From: Razvan Stefanescu @ 2019-08-27  9:31 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Razvan Stefanescu

This patchset adds compatiblity strings for the KSZ8563 switch and
also adds two small fixes to the ksz9477 driver.

Razvan Stefanescu (4):
  dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
  net: dsa: microchip: add KSZ8563 compatibility string
  net: dsa: microchip: fix interrupt mask
  net: dsa: microchip: avoid hard-codded port count

 Documentation/devicetree/bindings/net/dsa/ksz.txt | 1 +
 drivers/net/dsa/microchip/ksz9477.c               | 2 +-
 drivers/net/dsa/microchip/ksz9477_reg.h           | 2 +-
 drivers/net/dsa/microchip/ksz9477_spi.c           | 1 +
 4 files changed, 4 insertions(+), 2 deletions(-)

-- 
2.20.1


^ permalink raw reply

* RE: [PATCH v4] rtw88: pci: Move a mass of jobs in hw IRQ to soft IRQ
From: Tony Chuang @ 2019-08-27  9:20 UTC (permalink / raw)
  To: Jian-Hong Pan, Kalle Valo, David S . Miller
  Cc: linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux@endlessm.com
In-Reply-To: <20190826070827.1436-1-jian-hong@endlessm.com>

> From: Jian-Hong Pan 
> Subject: [PATCH v4] rtw88: pci: Move a mass of jobs in hw IRQ to soft IRQ
> 
> There is a mass of jobs between spin lock and unlock in the hardware
> IRQ which will occupy much time originally. To make system work more
> efficiently, this patch moves the jobs to the soft IRQ (bottom half) to
> reduce the time in hardware IRQ.
> 
> Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
> ---
> v2:
>  Change the spin_lock_irqsave/unlock_irqrestore to spin_lock/unlock in
>  rtw_pci_interrupt_handler. Because the interrupts are already disabled
>  in the hardware interrupt handler.
> 
> v3:
>  Extend the spin lock protecting area for the TX path in
>  rtw_pci_interrupt_threadfn by Realtek's suggestion
> 
> v4:
>  Remove the WiFi running check in rtw_pci_interrupt_threadfn to avoid AP
>  connection failed by Realtek's suggestion.
> 
>  drivers/net/wireless/realtek/rtw88/pci.c | 32 +++++++++++++++++++-----
>  1 file changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/wireless/realtek/rtw88/pci.c
> b/drivers/net/wireless/realtek/rtw88/pci.c
> index 00ef229552d5..955dd6c6fb57 100644
> --- a/drivers/net/wireless/realtek/rtw88/pci.c
> +++ b/drivers/net/wireless/realtek/rtw88/pci.c
> @@ -866,12 +866,29 @@ static irqreturn_t rtw_pci_interrupt_handler(int irq,
> void *dev)
>  {
>  	struct rtw_dev *rtwdev = dev;
>  	struct rtw_pci *rtwpci = (struct rtw_pci *)rtwdev->priv;
> -	u32 irq_status[4];
> 
>  	spin_lock(&rtwpci->irq_lock);
>  	if (!rtwpci->irq_enabled)
>  		goto out;
> 
> +	/* disable RTW PCI interrupt to avoid more interrupts before the end of
> +	 * thread function
> +	 */
> +	rtw_pci_disable_interrupt(rtwdev, rtwpci);
> +out:
> +	spin_unlock(&rtwpci->irq_lock);
> +
> +	return IRQ_WAKE_THREAD;
> +}
> +
> +static irqreturn_t rtw_pci_interrupt_threadfn(int irq, void *dev)
> +{
> +	struct rtw_dev *rtwdev = dev;
> +	struct rtw_pci *rtwpci = (struct rtw_pci *)rtwdev->priv;
> +	unsigned long flags;
> +	u32 irq_status[4];
> +
> +	spin_lock_irqsave(&rtwpci->irq_lock, flags);
>  	rtw_pci_irq_recognized(rtwdev, rtwpci, irq_status);
> 
>  	if (irq_status[0] & IMR_MGNTDOK)
> @@ -891,8 +908,9 @@ static irqreturn_t rtw_pci_interrupt_handler(int irq,
> void *dev)
>  	if (irq_status[0] & IMR_ROK)
>  		rtw_pci_rx_isr(rtwdev, rtwpci, RTW_RX_QUEUE_MPDU);
> 
> -out:
> -	spin_unlock(&rtwpci->irq_lock);
> +	/* all of the jobs for this interrupt have been done */
> +	rtw_pci_enable_interrupt(rtwdev, rtwpci);
> +	spin_unlock_irqrestore(&rtwpci->irq_lock, flags);
> 
>  	return IRQ_HANDLED;
>  }
> @@ -1152,8 +1170,10 @@ static int rtw_pci_probe(struct pci_dev *pdev,
>  		goto err_destroy_pci;
>  	}
> 
> -	ret = request_irq(pdev->irq, &rtw_pci_interrupt_handler,
> -			  IRQF_SHARED, KBUILD_MODNAME, rtwdev);
> +	ret = devm_request_threaded_irq(rtwdev->dev, pdev->irq,
> +					rtw_pci_interrupt_handler,
> +					rtw_pci_interrupt_threadfn,
> +					IRQF_SHARED, KBUILD_MODNAME, rtwdev);
>  	if (ret) {
>  		ieee80211_unregister_hw(hw);
>  		goto err_destroy_pci;
> @@ -1192,7 +1212,7 @@ static void rtw_pci_remove(struct pci_dev *pdev)
>  	rtw_pci_disable_interrupt(rtwdev, rtwpci);
>  	rtw_pci_destroy(rtwdev, pdev);
>  	rtw_pci_declaim(rtwdev, pdev);
> -	free_irq(rtwpci->pdev->irq, rtwdev);
> +	devm_free_irq(rtwdev->dev, rtwpci->pdev->irq, rtwdev);
>  	rtw_core_deinit(rtwdev);
>  	ieee80211_free_hw(hw);
>  }
> --
> 2.20.1


Now it works fine with MSI interrupt enabled.

But this patch is conflicting with MSI interrupt patch.
Is there a better way we can make Kalle apply them more smoothly?
I can rebase them and submit both if you're OK.

Yan-Hsuan

^ permalink raw reply

* [PATCH] ipv6: Not to probe neighbourless routes
From: Yi Wang @ 2019-08-27  9:08 UTC (permalink / raw)
  To: davem
  Cc: kuznet, yoshfuji, netdev, linux-kernel, xue.zhihong, wang.yi59,
	wang.liang82, Cheng Lin

From: Cheng Lin <cheng.lin130@zte.com.cn>

Originally, Router Reachability Probing require a neighbour entry
existed. Commit 2152caea7196 ("ipv6: Do not depend on rt->n in
rt6_probe().") removed the requirement for a neighbour entry. And
commit f547fac624be ("ipv6: rate-limit probes for neighbourless
routes") adds rate-limiting for neighbourless routes.

And, the Neighbor Discovery for IP version 6 (IPv6)(rfc4861) says,
"
7.2.5.  Receipt of Neighbor Advertisements

When a valid Neighbor Advertisement is received (either solicited or
unsolicited), the Neighbor Cache is searched for the target's entry.
If no entry exists, the advertisement SHOULD be silently discarded.
There is no need to create an entry if none exists, since the
recipient has apparently not initiated any communication with the
target.
".

In rt6_probe(), just a Neighbor Solicitation message are transmited.
When receiving a Neighbor Advertisement, the node does nothing in a
Neighborless condition.

Not sure it's needed to create a neighbor entry in Router
Reachability Probing. And the Original way may be the right way.

This patch recover the requirement for a neighbour entry.

Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
---
 include/net/ip6_fib.h | 5 -----
 net/ipv6/route.c      | 5 +----
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 4b5656c..8c2e022 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -124,11 +124,6 @@ struct rt6_exception {
 
 struct fib6_nh {
 	struct fib_nh_common	nh_common;
-
-#ifdef CONFIG_IPV6_ROUTER_PREF
-	unsigned long		last_probe;
-#endif
-
 	struct rt6_info * __percpu *rt6i_pcpu;
 	struct rt6_exception_bucket __rcu *rt6i_exception_bucket;
 };
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index fd059e0..c4bcffc 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -639,12 +639,12 @@ static void rt6_probe(struct fib6_nh *fib6_nh)
 	nh_gw = &fib6_nh->fib_nh_gw6;
 	dev = fib6_nh->fib_nh_dev;
 	rcu_read_lock_bh();
-	idev = __in6_dev_get(dev);
 	neigh = __ipv6_neigh_lookup_noref(dev, nh_gw);
 	if (neigh) {
 		if (neigh->nud_state & NUD_VALID)
 			goto out;
 
+		idev = __in6_dev_get(dev);
 		write_lock(&neigh->lock);
 		if (!(neigh->nud_state & NUD_VALID) &&
 		    time_after(jiffies,
@@ -654,9 +654,6 @@ static void rt6_probe(struct fib6_nh *fib6_nh)
 				__neigh_set_probe_once(neigh);
 		}
 		write_unlock(&neigh->lock);
-	} else if (time_after(jiffies, fib6_nh->last_probe +
-				       idev->cnf.rtr_probe_interval)) {
-		work = kmalloc(sizeof(*work), GFP_ATOMIC);
 	}
 
 	if (work) {
-- 
1.8.3.1


^ permalink raw reply related

* Re: [RFC PATCH net-next] net: phy: force phy suspend when calling phy_stop
From: shenjian (K) @ 2019-08-27  8:29 UTC (permalink / raw)
  To: Heiner Kallweit, andrew, f.fainelli, davem
  Cc: netdev, forest.zhouchang, linuxarm
In-Reply-To: <fc2a700a-9c24-b96c-df6b-c5414883d89e@gmail.com>



在 2019/8/27 13:51, Heiner Kallweit 写道:
> On 27.08.2019 04:47, Jian Shen wrote:
>> Some ethernet drivers may call phy_start() and phy_stop() from
>> ndo_open and ndo_close() respectively.
>>
>> When network cable is unconnected, and operate like below:
>> step 1: ifconfig ethX up -> ndo_open -> phy_start ->start
>> autoneg, and phy is no link.
>> step 2: ifconfig ethX down -> ndo_close -> phy_stop -> just stop
>> phy state machine.
>> step 3: plugin the network cable, and autoneg complete, then
>> LED for link status will be on.
>> step 4: ethtool ethX --> see the result of "Link detected" is no.
>>
> Step 3 and 4 seem to be unrelated to the actual issue.
> With which MAC + PHY driver did you observe this?
> 
Thanks Heiner,

I tested this on HNS3 driver, with two phy, Marvell 88E1512 and RTL8211.

Step 3 and Step 4 is just to describe that the LED of link shows link up,
but the port information shows no link.


>> This patch forces phy suspend even phydev->link is off.
>>
>> Signed-off-by: Jian Shen <shenjian15@huawei.com>
>> ---
>>  drivers/net/phy/phy.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index f3adea9..0acd5b4 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -911,8 +911,8 @@ void phy_state_machine(struct work_struct *work)
>>  		if (phydev->link) {
>>  			phydev->link = 0;
>>  			phy_link_down(phydev, true);
>> -			do_suspend = true;
>>  		}
>> +		do_suspend = true;
>>  		break;
>>  	}
>>  
>>
> Heiner
> 
> 


^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: David Miller @ 2019-08-27  8:22 UTC (permalink / raw)
  To: jiri
  Cc: jakub.kicinski, dsahern, roopa, netdev, sthemmin, dcbw, mkubecek,
	andrew, parav, saeedm, mlxsw
In-Reply-To: <20190827070808.GA2250@nanopsycho>

From: Jiri Pirko <jiri@resnulli.us>
Date: Tue, 27 Aug 2019 09:08:08 +0200

> Okay, so if I understand correctly, on top of separate commands for
> add/del of alternative names, you suggest also get/dump to be separate
> command and don't fill this up in existing newling/getlink command.

I'm not sure what to do yet.

David has a point, because the only way these ifnames are useful is
as ways to specify and choose net devices.  So based upon that I'm
slightly learning towards not using separate commands.

^ permalink raw reply

* Re: [PATCH 14/16] include/linux: prefer __section from compiler_attributes.h
From: Will Deacon @ 2019-08-27  8:21 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Nick Desaulniers, Andrew Morton, Sedat Dilek, Josh Poimboeuf,
	Yonghong Song, clang-built-linux, Luc Van Oostenryck,
	Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra (Intel), Nicholas Piggin, Jiri Kosina,
	Ard Biesheuvel, Michael Ellerman, Masahiro Yamada,
	Hans Liljestrand, Elena Reshetova, David Windsor, Marc Zyngier,
	Ming Lei, Dou Liyang, Julien Thierry, Mauro Carvalho Chehab,
	Jens Axboe, linux-kernel, linux-sparse, rcu, Network Development,
	bpf
In-Reply-To: <CANiq72mXvuVdO2i9gobmh9YeUW4bhe7YnqQc6PaZrbqua1DoYw@mail.gmail.com>

On Sat, Aug 24, 2019 at 02:51:46PM +0200, Miguel Ojeda wrote:
> On Tue, Aug 13, 2019 at 10:33 AM Will Deacon <will@kernel.org> wrote:
> >
> > -ENOCOMMITMESSAGE
> >
> > Otherwise, patch looks good to me.
> 
> Do you want Ack, Review or nothing?

You can add my Ack if a commit message appears.

Will

^ permalink raw reply

* Re: [PATCH bpf-next v5 03/11] xsk: add support to allow unaligned chunk placement
From: Maxim Mikityanskiy @ 2019-08-27  7:36 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	bjorn.topel@intel.com, magnus.karlsson@intel.com,
	jakub.kicinski@netronome.com, jonathan.lemon@gmail.com,
	Saeed Mahameed, stephen@networkplumber.org,
	bruce.richardson@intel.com, ciara.loftus@intel.com,
	bpf@vger.kernel.org, intel-wired-lan@lists.osuosl.org
In-Reply-To: <20190822014427.49800-4-kevin.laatz@intel.com>

On 2019-08-22 04:44, Kevin Laatz wrote:
> Currently, addresses are chunk size aligned. This means, we are very
> restricted in terms of where we can place chunk within the umem. For
> example, if we have a chunk size of 2k, then our chunks can only be placed
> at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
> 
> This patch introduces the ability to use unaligned chunks. With these
> changes, we are no longer bound to having to place chunks at a 2k (or
> whatever your chunk size is) interval. Since we are no longer dealing with
> aligned chunks, they can now cross page boundaries. Checks for page
> contiguity have been added in order to keep track of which pages are
> followed by a physically contiguous page.
> 
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> ---
> v2:
>    - Add checks for the flags coming from userspace
>    - Fix how we get chunk_size in xsk_diag.c
>    - Add defines for masking the new descriptor format
>    - Modified the rx functions to use new descriptor format
>    - Modified the tx functions to use new descriptor format
> 
> v3:
>    - Add helper function to do address/offset masking/addition
> 
> v4:
>    - fixed page_start calculation in __xsk_rcv_memcpy().
>    - move offset handling to the xdp_umem_get_* functions
>    - modified the len field in xdp_umem_reg struct. We now use 16 bits from
>      this for the flags field.
>    - removed next_pg_contig field from xdp_umem_page struct. Using low 12
>      bits of addr to store flags instead.
>    - other minor changes based on review comments
> 
> v5:
>    - Added accessors for getting addr and offset
>    - Added helper function to add offset to addr
>    - Fixed offset handling in xsk_rcv
>    - Removed bitfields from xdp_umem_reg
>    - Added struct size checking for xdp_umem_reg in xsk_setsockopt to handle
>      different versions of the struct.
>    - fix conflicts after 'bpf-af-xdp-wakeup' was merged.
> ---
>   include/net/xdp_sock.h      | 75 +++++++++++++++++++++++++++--
>   include/uapi/linux/if_xdp.h |  9 ++++
>   net/xdp/xdp_umem.c          | 19 ++++++--
>   net/xdp/xsk.c               | 96 +++++++++++++++++++++++++++++--------
>   net/xdp/xsk_diag.c          |  2 +-
>   net/xdp/xsk_queue.h         | 68 ++++++++++++++++++++++----
>   6 files changed, 232 insertions(+), 37 deletions(-)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index f023b9940d64..c9398ce7960f 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -16,6 +16,13 @@
>   struct net_device;
>   struct xsk_queue;
>   
> +/* Masks for xdp_umem_page flags.
> + * The low 12-bits of the addr will be 0 since this is the page address, so we
> + * can use them for flags.
> + */
> +#define XSK_NEXT_PG_CONTIG_SHIFT 0
> +#define XSK_NEXT_PG_CONTIG_MASK (1ULL << XSK_NEXT_PG_CONTIG_SHIFT)
> +
>   struct xdp_umem_page {
>   	void *addr;
>   	dma_addr_t dma;
> @@ -27,8 +34,12 @@ struct xdp_umem_fq_reuse {
>   	u64 handles[];
>   };
>   
> -/* Flags for the umem flags field. */
> -#define XDP_UMEM_USES_NEED_WAKEUP (1 << 0)
> +/* Flags for the umem flags field.
> + *
> + * The NEED_WAKEUP flag is 1 due to the reuse of the flags field for public
> + * flags. See inlude/uapi/include/linux/if_xdp.h.
> + */
> +#define XDP_UMEM_USES_NEED_WAKEUP (1 << 1)
>   
>   struct xdp_umem {
>   	struct xsk_queue *fq;
> @@ -124,14 +135,36 @@ void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
>   int xsk_map_inc(struct xsk_map *map);
>   void xsk_map_put(struct xsk_map *map);
>   
> +static inline u64 xsk_umem_extract_addr(u64 addr)
> +{
> +	return addr & XSK_UNALIGNED_BUF_ADDR_MASK;
> +}
> +
> +static inline u64 xsk_umem_extract_offset(u64 addr)
> +{
> +	return addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> +}
> +
> +static inline u64 xsk_umem_add_offset_to_addr(u64 addr)
> +{
> +	return xsk_umem_extract_addr(addr) + xsk_umem_extract_offset(addr);
> +}
> +
>   static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
>   {
> -	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
> +	unsigned long page_addr;
> +
> +	addr = xsk_umem_add_offset_to_addr(addr);
> +	page_addr = (unsigned long)umem->pages[addr >> PAGE_SHIFT].addr;
> +
> +	return (char *)(page_addr & PAGE_MASK) + (addr & ~PAGE_MASK);
>   }
>   
>   static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
>   {
> -	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
> +	addr = xsk_umem_add_offset_to_addr(addr);
> +
> +	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & ~PAGE_MASK);
>   }
>   
>   /* Reuse-queue aware version of FILL queue helpers */
> @@ -172,6 +205,19 @@ static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
>   
>   	rq->handles[rq->length++] = addr;
>   }
> +
> +/* Handle the offset appropriately depending on aligned or unaligned mode.
> + * For unaligned mode, we store the offset in the upper 16-bits of the address.
> + * For aligned mode, we simply add the offset to the address.
> + */
> +static inline u64 xsk_umem_adjust_offset(struct xdp_umem *umem, u64 address,
> +					 u64 offset)
> +{
> +	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG)
> +		return address + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
> +	else
> +		return address + offset;
> +}
>   #else
>   static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>   {
> @@ -241,6 +287,21 @@ static inline struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev,
>   	return NULL;
>   }
>   
> +static inline u64 xsk_umem_extract_addr(u64 addr)
> +{
> +	return 0;
> +}
> +
> +static inline u64 xsk_umem_extract_offset(u64 addr)
> +{
> +	return 0;
> +}
> +
> +static inline u64 xsk_umem_add_offset_to_addr(u64 addr)
> +{
> +	return 0;
> +}
> +
>   static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
>   {
>   	return NULL;
> @@ -290,6 +351,12 @@ static inline bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem)
>   	return false;
>   }
>   
> +static inline u64 xsk_umem_adjust_offset(struct xdp_umem *umem, u64 handle,
> +					 u64 offset)
> +{
> +	return 0;
> +}
> +
>   #endif /* CONFIG_XDP_SOCKETS */
>   
>   #endif /* _LINUX_XDP_SOCK_H */
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 62b80d57b72a..be328c59389d 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -26,6 +26,9 @@
>    */
>   #define XDP_USE_NEED_WAKEUP (1 << 3)
>   
> +/* Flags for xsk_umem_config flags */
> +#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
> +
>   struct sockaddr_xdp {
>   	__u16 sxdp_family;
>   	__u16 sxdp_flags;
> @@ -66,6 +69,7 @@ struct xdp_umem_reg {
>   	__u64 len; /* Length of packet data area */
>   	__u32 chunk_size;
>   	__u32 headroom;
> +	__u32 flags;
>   };
>   
>   struct xdp_statistics {
> @@ -87,6 +91,11 @@ struct xdp_options {
>   #define XDP_UMEM_PGOFF_FILL_RING	0x100000000ULL
>   #define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000ULL
>   
> +/* Masks for unaligned chunks mode */
> +#define XSK_UNALIGNED_BUF_OFFSET_SHIFT 48
> +#define XSK_UNALIGNED_BUF_ADDR_MASK \
> +	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> +
>   /* Rx/Tx descriptor */
>   struct xdp_desc {
>   	__u64 addr;
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 2d65779282a1..e997b263a0dd 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -340,6 +340,7 @@ static int xdp_umem_account_pages(struct xdp_umem *umem)
>   
>   static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>   {
> +	bool unaligned_chunks = mr->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG;
>   	u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
>   	unsigned int chunks, chunks_per_page;
>   	u64 addr = mr->addr, size = mr->len;
> @@ -355,7 +356,11 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>   		return -EINVAL;
>   	}
>   
> -	if (!is_power_of_2(chunk_size))
> +	if (mr->flags & ~(XDP_UMEM_UNALIGNED_CHUNK_FLAG |
> +			XDP_UMEM_USES_NEED_WAKEUP))
> +		return -EINVAL;
> +
> +	if (!unaligned_chunks && !is_power_of_2(chunk_size))
>   		return -EINVAL;
>   
>   	if (!PAGE_ALIGNED(addr)) {
> @@ -372,9 +377,11 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>   	if (chunks == 0)
>   		return -EINVAL;
>   
> -	chunks_per_page = PAGE_SIZE / chunk_size;
> -	if (chunks < chunks_per_page || chunks % chunks_per_page)
> -		return -EINVAL;
> +	if (!unaligned_chunks) {
> +		chunks_per_page = PAGE_SIZE / chunk_size;
> +		if (chunks < chunks_per_page || chunks % chunks_per_page)
> +			return -EINVAL;
> +	}
>   
>   	headroom = ALIGN(headroom, 64);
>   
> @@ -383,13 +390,15 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>   		return -EINVAL;
>   
>   	umem->address = (unsigned long)addr;
> -	umem->chunk_mask = ~((u64)chunk_size - 1);
> +	umem->chunk_mask = unaligned_chunks ? XSK_UNALIGNED_BUF_ADDR_MASK
> +					    : ~((u64)chunk_size - 1);
>   	umem->size = size;
>   	umem->headroom = headroom;
>   	umem->chunk_size_nohr = chunk_size - headroom;
>   	umem->npgs = size / PAGE_SIZE;
>   	umem->pgs = NULL;
>   	umem->user = NULL;
> +	umem->flags = mr->flags;
>   	INIT_LIST_HEAD(&umem->xsk_list);
>   	spin_lock_init(&umem->xsk_list_lock);
>   
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index ee4428a892fa..907e5f12338f 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -45,7 +45,7 @@ EXPORT_SYMBOL(xsk_umem_has_addrs);
>   
>   u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
>   {
> -	return xskq_peek_addr(umem->fq, addr);
> +	return xskq_peek_addr(umem->fq, addr, umem);
>   }
>   EXPORT_SYMBOL(xsk_umem_peek_addr);
>   
> @@ -115,21 +115,43 @@ bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem)
>   }
>   EXPORT_SYMBOL(xsk_umem_uses_need_wakeup);
>   
> +/* If a buffer crosses a page boundary, we need to do 2 memcpy's, one for
> + * each page. This is only required in copy mode.
> + */
> +static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void *from_buf,
> +			     u32 len, u32 metalen)
> +{
> +	void *to_buf = xdp_umem_get_data(umem, addr);
> +
> +	addr = xsk_umem_add_offset_to_addr(addr);
> +	if (xskq_crosses_non_contig_pg(umem, addr, len + metalen)) {
> +		void *next_pg_addr = umem->pages[(addr >> PAGE_SHIFT) + 1].addr;
> +		u64 page_start = addr & ~(PAGE_SIZE - 1);
> +		u64 first_len = PAGE_SIZE - (addr - page_start);
> +
> +		memcpy(to_buf, from_buf, first_len + metalen);
> +		memcpy(next_pg_addr, from_buf + first_len, len - first_len);
> +
> +		return;
> +	}
> +
> +	memcpy(to_buf, from_buf, len + metalen);
> +}
> +
>   static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>   {
> -	void *to_buf, *from_buf;
> +	u64 offset = xs->umem->headroom;
> +	u64 addr, memcpy_addr;
> +	void *from_buf;
>   	u32 metalen;
> -	u64 addr;
>   	int err;
>   
> -	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> +	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>   	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>   		xs->rx_dropped++;
>   		return -ENOSPC;
>   	}
>   
> -	addr += xs->umem->headroom;
> -
>   	if (unlikely(xdp_data_meta_unsupported(xdp))) {
>   		from_buf = xdp->data;
>   		metalen = 0;
> @@ -138,9 +160,11 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>   		metalen = xdp->data - xdp->data_meta;
>   	}
>   
> -	to_buf = xdp_umem_get_data(xs->umem, addr);
> -	memcpy(to_buf, from_buf, len + metalen);
> -	addr += metalen;
> +	memcpy_addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
> +	__xsk_rcv_memcpy(xs->umem, memcpy_addr, from_buf, len, metalen);
> +
> +	offset += metalen;
> +	addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
>   	err = xskq_produce_batch_desc(xs->rx, addr, len);
>   	if (!err) {
>   		xskq_discard_addr(xs->umem->fq);
> @@ -185,6 +209,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>   {
>   	u32 metalen = xdp->data - xdp->data_meta;
>   	u32 len = xdp->data_end - xdp->data;
> +	u64 offset = xs->umem->headroom;
>   	void *buffer;
>   	u64 addr;
>   	int err;
> @@ -196,17 +221,17 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>   		goto out_unlock;
>   	}
>   
> -	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> +	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>   	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>   		err = -ENOSPC;
>   		goto out_drop;
>   	}
>   
> -	addr += xs->umem->headroom;
> -
> -	buffer = xdp_umem_get_data(xs->umem, addr);
> +	buffer = xdp_umem_get_data(xs->umem, addr + offset);
>   	memcpy(buffer, xdp->data_meta, len + metalen);
> -	addr += metalen;
> +	offset += metalen;
> +
> +	addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
>   	err = xskq_produce_batch_desc(xs->rx, addr, len);
>   	if (err)
>   		goto out_drop;
> @@ -250,7 +275,7 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc)
>   
>   	rcu_read_lock();
>   	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
> -		if (!xskq_peek_desc(xs->tx, desc))
> +		if (!xskq_peek_desc(xs->tx, desc, umem))
>   			continue;
>   
>   		if (xskq_produce_addr_lazy(umem->cq, desc->addr))
> @@ -304,7 +329,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
>   	if (xs->queue_id >= xs->dev->real_num_tx_queues)
>   		goto out;
>   
> -	while (xskq_peek_desc(xs->tx, &desc)) {
> +	while (xskq_peek_desc(xs->tx, &desc, xs->umem)) {
>   		char *buffer;
>   		u64 addr;
>   		u32 len;
> @@ -333,7 +358,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
>   		skb->dev = xs->dev;
>   		skb->priority = sk->sk_priority;
>   		skb->mark = sk->sk_mark;
> -		skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
> +		skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
>   		skb->destructor = xsk_destruct_skb;
>   
>   		err = dev_direct_xmit(skb, xs->queue_id);
> @@ -526,6 +551,24 @@ static struct socket *xsk_lookup_xsk_from_fd(int fd)
>   	return sock;
>   }
>   
> +/* Check if umem pages are contiguous.
> + * If zero-copy mode, use the DMA address to do the page contiguity check
> + * For all other modes we use addr (kernel virtual address)
> + * Store the result in the low bits of addr.
> + */
> +static void xsk_check_page_contiguity(struct xdp_umem *umem, u32 flags)
> +{
> +	struct xdp_umem_page *pgs = umem->pages;
> +	int i, is_contig;
> +
> +	for (i = 0; i < umem->npgs - 1; i++) {
> +		is_contig = (flags & XDP_ZEROCOPY) ?
> +			(pgs[i].dma + PAGE_SIZE == pgs[i + 1].dma) :
> +			(pgs[i].addr + PAGE_SIZE == pgs[i + 1].addr);
> +		pgs[i].addr += is_contig << XSK_NEXT_PG_CONTIG_SHIFT;
> +	}
> +}
> +
>   static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   {
>   	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
> @@ -616,6 +659,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   		err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
>   		if (err)
>   			goto out_unlock;
> +
> +		xsk_check_page_contiguity(xs->umem, flags);
>   	}
>   
>   	xs->dev = dev;
> @@ -636,6 +681,13 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   	return err;
>   }
>   
> +struct xdp_umem_reg_v1 {
> +	__u64 addr; /* Start of packet data area */
> +	__u64 len; /* Length of packet data area */
> +	__u32 chunk_size;
> +	__u32 headroom;
> +};
> +
>   static int xsk_setsockopt(struct socket *sock, int level, int optname,
>   			  char __user *optval, unsigned int optlen)
>   {
> @@ -673,10 +725,16 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>   	}
>   	case XDP_UMEM_REG:
>   	{
> -		struct xdp_umem_reg mr;
> +		size_t mr_size = sizeof(struct xdp_umem_reg);
> +		struct xdp_umem_reg mr = {};
>   		struct xdp_umem *umem;
>   
> -		if (copy_from_user(&mr, optval, sizeof(mr)))
> +		if (optlen < sizeof(struct xdp_umem_reg_v1))
> +			return -EINVAL;
> +		else if (optlen < sizeof(mr))
> +			mr_size = sizeof(struct xdp_umem_reg_v1);
> +
> +		if (copy_from_user(&mr, optval, mr_size))
>   			return -EFAULT;
>   
>   		mutex_lock(&xs->mutex);
> diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c
> index d5e06c8e0cbf..9986a759fe06 100644
> --- a/net/xdp/xsk_diag.c
> +++ b/net/xdp/xsk_diag.c
> @@ -56,7 +56,7 @@ static int xsk_diag_put_umem(const struct xdp_sock *xs, struct sk_buff *nlskb)
>   	du.id = umem->id;
>   	du.size = umem->size;
>   	du.num_pages = umem->npgs;
> -	du.chunk_size = (__u32)(~umem->chunk_mask + 1);
> +	du.chunk_size = umem->chunk_size_nohr + umem->headroom;
>   	du.headroom = umem->headroom;
>   	du.ifindex = umem->dev ? umem->dev->ifindex : 0;
>   	du.queue_id = umem->queue_id;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index dd9e985c2461..6c67c9d0294f 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -134,6 +134,17 @@ static inline bool xskq_has_addrs(struct xsk_queue *q, u32 cnt)
>   
>   /* UMEM queue */
>   
> +static inline bool xskq_crosses_non_contig_pg(struct xdp_umem *umem, u64 addr,
> +					      u64 length)
> +{
> +	bool cross_pg = (addr & (PAGE_SIZE - 1)) + length > PAGE_SIZE;
> +	bool next_pg_contig =
> +		(unsigned long)umem->pages[(addr >> PAGE_SHIFT)].addr &
> +			XSK_NEXT_PG_CONTIG_MASK;
> +
> +	return cross_pg && !next_pg_contig;
> +}
> +
>   static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
>   {
>   	if (addr >= q->size) {
> @@ -144,23 +155,49 @@ static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
>   	return true;
>   }
>   
> -static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr)
> +static inline bool xskq_is_valid_addr_unaligned(struct xsk_queue *q, u64 addr,
> +						u64 length,
> +						struct xdp_umem *umem)
> +{
> +	addr = xsk_umem_add_offset_to_addr(addr);
> +	if (addr >= q->size ||

I know I already asked about it, but I feel we need to clarify things 
here further.

The `addr >= q->size` check is good, it checks that the address doesn't 
overflow the UMEM. However, the new encoding of UMEM handles consists of 
two parts: base address of the frame (low bits) and offset (high bits), 
that are added together by xsk_umem_add_offset_to_addr. It's possible to 
craft an address that has too big base address of the frame (all 
0xff...f), some non-zero offset, so after adding them together we pass 
`addr < q->size`, but the base address exceeds the UMEM. In my opinion, 
we should not allow such addresses, because the base address of the 
frame is out of bounds, so we need one more check here. What do you 
think about this point?

> +	    xskq_crosses_non_contig_pg(umem, addr, length)) {
> +		q->invalid_descs++;
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr,
> +				      struct xdp_umem *umem)
>   {
>   	while (q->cons_tail != q->cons_head) {
>   		struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
>   		unsigned int idx = q->cons_tail & q->ring_mask;
>   
>   		*addr = READ_ONCE(ring->desc[idx]) & q->chunk_mask;
> +
> +		if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
> +			if (xskq_is_valid_addr_unaligned(q, *addr,
> +							 umem->chunk_size_nohr,
> +							 umem))
> +				return addr;
> +			goto out;
> +		}
> +
>   		if (xskq_is_valid_addr(q, *addr))
>   			return addr;
>   
> +out:
>   		q->cons_tail++;
>   	}
>   
>   	return NULL;
>   }
>   
> -static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
> +static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr,
> +				  struct xdp_umem *umem)
>   {
>   	if (q->cons_tail == q->cons_head) {
>   		smp_mb(); /* D, matches A */
> @@ -171,7 +208,7 @@ static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
>   		smp_rmb();
>   	}
>   
> -	return xskq_validate_addr(q, addr);
> +	return xskq_validate_addr(q, addr, umem);
>   }
>   
>   static inline void xskq_discard_addr(struct xsk_queue *q)
> @@ -230,8 +267,21 @@ static inline int xskq_reserve_addr(struct xsk_queue *q)
>   
>   /* Rx/Tx queue */
>   
> -static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
> +static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d,
> +				      struct xdp_umem *umem)
>   {
> +	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
> +		if (!xskq_is_valid_addr_unaligned(q, d->addr, d->len, umem))
> +			return false;
> +
> +		if (d->len > umem->chunk_size_nohr || d->options) {
> +			q->invalid_descs++;
> +			return false;
> +		}
> +
> +		return true;
> +	}
> +
>   	if (!xskq_is_valid_addr(q, d->addr))
>   		return false;
>   
> @@ -245,14 +295,15 @@ static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
>   }
>   
>   static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
> -						  struct xdp_desc *desc)
> +						  struct xdp_desc *desc,
> +						  struct xdp_umem *umem)
>   {
>   	while (q->cons_tail != q->cons_head) {
>   		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
>   		unsigned int idx = q->cons_tail & q->ring_mask;
>   
>   		*desc = READ_ONCE(ring->desc[idx]);
> -		if (xskq_is_valid_desc(q, desc))
> +		if (xskq_is_valid_desc(q, desc, umem))
>   			return desc;
>   
>   		q->cons_tail++;
> @@ -262,7 +313,8 @@ static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
>   }
>   
>   static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
> -					      struct xdp_desc *desc)
> +					      struct xdp_desc *desc,
> +					      struct xdp_umem *umem)
>   {
>   	if (q->cons_tail == q->cons_head) {
>   		smp_mb(); /* D, matches A */
> @@ -273,7 +325,7 @@ static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
>   		smp_rmb(); /* C, matches B */
>   	}
>   
> -	return xskq_validate_desc(q, desc);
> +	return xskq_validate_desc(q, desc, umem);
>   }
>   
>   static inline void xskq_discard_desc(struct xsk_queue *q)
> 


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox