Netdev List
 help / color / mirror / Atom feed
* Re: [pull request][for-next 00/11] Mellanox, mlx5 E-Switch updates 2017-12-19
From: Saeed Mahameed @ 2017-12-19 23:36 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Saeed Mahameed, David S. Miller, Doug Ledford, Linux Netdev List,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Leon Romanovsky
In-Reply-To: <20171219224600.GJ6122-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

On Tue, Dec 19, 2017 at 2:46 PM, Marcelo Ricardo Leitner
<marcelo.leitner-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Dec 19, 2017 at 02:39:38PM -0800, Saeed Mahameed wrote:
>> On Tue, Dec 19, 2017 at 1:54 PM, Marcelo Ricardo Leitner
>> <marcelo.leitner-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > On Tue, Dec 19, 2017 at 12:33:29PM -0800, Saeed Mahameed wrote:
>> >> Hi Dave and Doug,
>> >>
>> >> ==============
>> >> This series includes updates for mlx5 E-Switch infrastructures,
>> >> to be merged into net-next and rdma-next trees.
>> >>
>> >> Mark's patches provide E-Switch refactoring that generalize the mlx5
>> >> E-Switch vf representors interfaces and data structures. The serious is
>> >> mainly focused on moving ethernet (netdev) specific representors logic out
>> >> of E-Switch (eswitch.c) into mlx5e representor module (en_rep.c), which
>> >> provides better separation and allows future support for other types of vf
>> >> representors (e.g. RDMA).
>> >>
>> >> Gal's patches at the end of this serious, provide a simple syntax fix and
>> >> two other patches that handles vport ingress/egress ACL steering name
>> >> spaces to be aligned with the Firmware/Hardware specs.
>> >
>> > Patch 10 actually looks quite worrysome if a card would support only
>> > one ingress or egress, but if all of them support both, then it should
>> > be fine yes. Is that possible, to support only one direction?
>> >
>>
>> Hi Marcelo,
>>
>> Patch 10 is a function renaming patch that fixes a function name to
>> correspond with its behavior!
>>
>> Are you asking about all mlx5 cards or all cards in general ?
>> All patches in this series, and specifically patches 10,11 only
>> concern mlx5 eswitch specific implementation and mlx5 hardware spec,
>> they have nothing to do with other vendors cards.
>
> I was asking only bout Mellanox cards.
>
>>
>> All mlx5 cards do support both ingress and egress ACLs per vport (VF),
>> each has its own namespaces (separate pipeline), hence patch #11.
>
> Alright then.
>
>>
>> So i don't see why you worry so much :).
>
> Oh but because currently they are called like:
>
>                 if (MLX5_CAP_ESW_EGRESS_ACL(dev, ft_support)) {
>                         err = init_egress_acl_root_ns(steering);
> ....
>
>                 if (MLX5_CAP_ESW_INGRESS_ACL(dev, ft_support)) {
>                         err = init_ingress_acl_root_ns(steering);
> ....
>
> and it could cause bigger problems if only one kind were supported,
> like trying to setup egress while it's not supported by the card.
> But again, as it seems both are always supported, that's no issue.
>

Ok, now i see your point, and the driver is already ok for that
matter, even for the extreme non existing cases of when one of the
ACLs is not supported,
ACL users need to query for the ACL namespace before using it, e.g:

root_ns = mlx5_get_flow_vport_acl_namespace(dev,
MLX5_FLOW_NAMESPACE_ESW_EGRESS, vport->vport);
if (!root_ns)
           abort;

so if for some reason  MLX5_CAP_ESW_EGRESS_ACL(dev, ft_support) is
false, then any following call to mlx5_get_flow_vport_acl_namespace
will return NULL, which means "not supported" and those cases are
already handled by the mlx5 eswitch driver.

Saeed.

> Thanks,
> Marcelo
>
>>
>> >   Marcelo
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v3 ipsec-next 0/3] xfrm: offload api fixes
From: Shannon Nelson @ 2017-12-19 23:35 UTC (permalink / raw)
  To: steffen.klassert; +Cc: netdev

These are a couple of little fixes to the xfrm_offload API to make
life just a little easier for the poor driver developer.

Changes from v2:
 - fix up another kbuild robot complaint when CONFIG_XFRM_OFFLOAD is off
 - split out checks into a common function for register and feature check

Changes from v1:
 - removed netdev_err() notes  (Steffen)
 - fixed build when CONFIG_XFRM_OFFLOAD is off (kbuild robot)
 - split into multiple patches (me)


Shannon Nelson (3):
  xfrm: check for xdo_dev_state_free
  xfrm: check for xdo_dev_ops add and delete
  xfrm: wrap xfrmdev_ops with offload config

 include/linux/netdevice.h |  2 +-
 include/net/xfrm.h        |  3 ++-
 net/xfrm/xfrm_device.c    | 32 ++++++++++++++++++--------------
 3 files changed, 21 insertions(+), 16 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH v3 ipsec-next 3/3] xfrm: wrap xfrmdev_ops with offload config
From: Shannon Nelson @ 2017-12-19 23:35 UTC (permalink / raw)
  To: steffen.klassert; +Cc: netdev
In-Reply-To: <1513726549-7065-1-git-send-email-shannon.nelson@oracle.com>

There's no reason to define netdev->xfrmdev_ops if
the offload facility is not CONFIG'd in.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 include/linux/netdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2eaac7d..145d0de 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1697,7 +1697,7 @@ struct net_device {
 	const struct ndisc_ops *ndisc_ops;
 #endif
 
-#ifdef CONFIG_XFRM
+#ifdef CONFIG_XFRM_OFFLOAD
 	const struct xfrmdev_ops *xfrmdev_ops;
 #endif
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 ipsec-next 1/3] xfrm: check for xdo_dev_state_free
From: Shannon Nelson @ 2017-12-19 23:35 UTC (permalink / raw)
  To: steffen.klassert; +Cc: netdev
In-Reply-To: <1513726549-7065-1-git-send-email-shannon.nelson@oracle.com>

The current XFRM code assumes that we've implemented the
xdo_dev_state_free() callback, even if it is meaningless to the driver.
This patch adds a check for it before calling, as done in other APIs,
to prevent a NULL function pointer kernel crash.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 include/net/xfrm.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index e015e16..dfabd04 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1891,7 +1891,8 @@ static inline void xfrm_dev_state_free(struct xfrm_state *x)
 	 struct net_device *dev = xso->dev;
 
 	if (dev && dev->xfrmdev_ops) {
-		dev->xfrmdev_ops->xdo_dev_state_free(x);
+		if (dev->xfrmdev_ops->xdo_dev_state_free)
+			dev->xfrmdev_ops->xdo_dev_state_free(x);
 		xso->dev = NULL;
 		dev_put(dev);
 	}
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 ipsec-next 2/3] xfrm: check for xdo_dev_ops add and delete
From: Shannon Nelson @ 2017-12-19 23:35 UTC (permalink / raw)
  To: steffen.klassert; +Cc: netdev
In-Reply-To: <1513726549-7065-1-git-send-email-shannon.nelson@oracle.com>

This adds a check for the required add and delete functions up front
at registration time to be sure both are defined.

Since both the features check and the registration check are looking
at the same things, break out the check for both to call.

Lastly, for some reason the feature check was setting xfrmdev_ops to
NULL if the NETIF_F_HW_ESP bit was missing, which would probably
surprise the driver later if the driver turned its NETIF_F_HW_ESP bit
back on.  We shouldn't be messing with the driver's callback list, so
we stop doing that with this patch.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 net/xfrm/xfrm_device.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 30e5746..d271de5 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -142,17 +142,31 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
 EXPORT_SYMBOL_GPL(xfrm_dev_offload_ok);
 #endif
 
-static int xfrm_dev_register(struct net_device *dev)
+static int xfrm_api_check(struct net_device *dev)
 {
-	if ((dev->features & NETIF_F_HW_ESP) && !dev->xfrmdev_ops)
-		return NOTIFY_BAD;
+#ifdef CONFIG_XFRM_OFFLOAD
 	if ((dev->features & NETIF_F_HW_ESP_TX_CSUM) &&
 	    !(dev->features & NETIF_F_HW_ESP))
 		return NOTIFY_BAD;
 
+	if ((dev->features & NETIF_F_HW_ESP) &&
+	    (!(dev->xfrmdev_ops &&
+	       dev->xfrmdev_ops->xdo_dev_state_add &&
+	       dev->xfrmdev_ops->xdo_dev_state_delete)))
+		return NOTIFY_BAD;
+#else
+	if (dev->features & (NETIF_F_HW_ESP | NETIF_F_HW_ESP_TX_CSUM))
+		return NOTIFY_BAD;
+#endif
+
 	return NOTIFY_DONE;
 }
 
+static int xfrm_dev_register(struct net_device *dev)
+{
+	return xfrm_api_check(dev);
+}
+
 static int xfrm_dev_unregister(struct net_device *dev)
 {
 	xfrm_policy_cache_flush();
@@ -161,16 +175,7 @@ static int xfrm_dev_unregister(struct net_device *dev)
 
 static int xfrm_dev_feat_change(struct net_device *dev)
 {
-	if ((dev->features & NETIF_F_HW_ESP) && !dev->xfrmdev_ops)
-		return NOTIFY_BAD;
-	else if (!(dev->features & NETIF_F_HW_ESP))
-		dev->xfrmdev_ops = NULL;
-
-	if ((dev->features & NETIF_F_HW_ESP_TX_CSUM) &&
-	    !(dev->features & NETIF_F_HW_ESP))
-		return NOTIFY_BAD;
-
-	return NOTIFY_DONE;
+	return xfrm_api_check(dev);
 }
 
 static int xfrm_dev_down(struct net_device *dev)
-- 
2.7.4

^ permalink raw reply related

* Re: [RFC] hv_netvsc: automatically name slave VF network device
From: Jakub Kicinski @ 2017-12-19 23:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Samudrala, Sridhar, netdev, Stephen Hemminger
In-Reply-To: <20171219145017.78b60959@xeon-e3>

On Tue, 19 Dec 2017 14:50:17 -0800, Stephen Hemminger wrote:
> On Tue, 19 Dec 2017 14:44:37 -0800
> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> 
> >  -static void __netvsc_vf_setup(struct net_device *ndev,  
> > > -			      struct net_device *vf_netdev)
> > > -{
> > > -	int ret;
> > > +	/* set the name of VF device based on upper device name */
> > > +	snprintf(vf_name, IFNAMSIZ, "%s_vf", ndev->name);
> > > +	ret = dev_change_name(vf_netdev, vf_name);
> > > +	if (ret != 0)
> > > +		netdev_warn(vf_netdev,
> > > +			    "can not rename device: (%d)\n", ret);    
> > 
> > It is possible that upper device name can change after this call.  I 
> > noticed this
> > when i tried this approach with virtio_net.
> > 
> > Also, what should happen if the upper device is unloaded? Should we rename
> > the VF name?  
> 
> Yes upper device can change name. So sure, netdevice could trap that
> in callback (it already has notifier) and rename VF. Will add that in V2.
> 
> If upper device is unloaded then it is already decoupled from the VF.
> There is no good value to change it back to. The orignal name probably
> has been reused by then.

Both of those issues would be solved by just exposing phys_port_name
from the VF driver, and letting systemd do its thing independent of 
magic bonds.

Reluctance to do driver work aside :/

^ permalink raw reply

* Re: [PATCH 3/3] trace: print address if symbol not found
From: kbuild test robot @ 2017-12-19 23:19 UTC (permalink / raw)
  To: Tobin C. Harding
  Cc: kbuild-all, kernel-hardening, Tobin C. Harding, Steven Rostedt,
	Tycho Andersen, Linus Torvalds, Kees Cook, Andrew Morton,
	Daniel Borkmann, Masahiro Yamada, Alexei Starovoitov,
	linux-kernel, Network Development
In-Reply-To: <1513554812-13014-4-git-send-email-me@tobin.cc>

[-- Attachment #1: Type: text/plain, Size: 1801 bytes --]

Hi Tobin,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.15-rc4 next-20171219]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Tobin-C-Harding/kallsyms-don-t-leak-address/20171220-062707
config: i386-randconfig-x016-201751 (attached as .config)
compiler: gcc-7 (Debian 7.2.0-12) 7.2.1 20171025
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   kernel/trace/trace_events_hist.c: In function 'hist_trigger_stacktrace_print':
>> kernel/trace/trace_events_hist.c:985:3: error: implicit declaration of function 'trace_sprint_symbol_addr'; did you mean 'trace_sprint_symbol'? [-Werror=implicit-function-declaration]
      trace_sprint_symbol_addr(str, stacktrace_entries[i]);
      ^~~~~~~~~~~~~~~~~~~~~~~~
      trace_sprint_symbol
   cc1: some warnings being treated as errors

vim +985 kernel/trace/trace_events_hist.c

   971	
   972	static void hist_trigger_stacktrace_print(struct seq_file *m,
   973						  unsigned long *stacktrace_entries,
   974						  unsigned int max_entries)
   975	{
   976		char str[KSYM_SYMBOL_LEN];
   977		unsigned int spaces = 8;
   978		unsigned int i;
   979	
   980		for (i = 0; i < max_entries; i++) {
   981			if (stacktrace_entries[i] == ULONG_MAX)
   982				return;
   983	
   984			seq_printf(m, "%*c", 1 + spaces, ' ');
 > 985			trace_sprint_symbol_addr(str, stacktrace_entries[i]);
   986			seq_printf(m, "%s\n", str);
   987		}
   988	}
   989	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29958 bytes --]

^ permalink raw reply

* Re: [PATCH v2 iproute2 net-next] erspan: add erspan version II support
From: William Tu @ 2017-12-19 23:15 UTC (permalink / raw)
  To: David Ahern; +Cc: Linux Kernel Network Developers
In-Reply-To: <d7283b25-e7a2-76e3-3946-137bc5baf416@gmail.com>

On Tue, Dec 19, 2017 at 2:17 PM, David Ahern <dsahern@gmail.com> wrote:
> On 12/15/17 6:06 PM, William Tu wrote:
>> @@ -343,6 +355,22 @@ get_failed:
>>                               invarg("invalid erspan index\n", *argv);
>>                       if (erspan_idx & ~((1<<20) - 1) || erspan_idx == 0)
>>                               invarg("erspan index must be > 0 and <= 20-bit\n", *argv);
>> +             } else if (strcmp(*argv, "erspan_ver") == 0) {
>> +                     NEXT_ARG();
>> +                     if (get_u8(&erspan_ver, *argv, 0))
>> +                             invarg("invalid erspan version\n", *argv);
>> +                     if (erspan_ver != 1 && erspan_ver != 2)
>> +                             invarg("erspan version must be 1 or 2\n", *argv);
>> +             } else if (strcmp(*argv, "erspan_dir") == 0) {
>> +                     NEXT_ARG();
>> +                     if (get_u8(&erspan_dir, *argv, 0))
>> +                             invarg("invalid erspan direction\n", *argv);
>> +                     if (erspan_dir != 0 && erspan_dir != 1)
>> +                             invarg("erspan direction must be 0(Ingress) or 1(Egress)\n", *argv);
>
> Why not allow ingress and egress as strings and using matches()?
> e.g.,  '... erspan_dir in[gress] ...' or '... erspan_dir e[gress] ...'
>
> Seems much more intuitive than remembering "0" = ingress and "1" = egress.
>
>> @@ -374,8 +402,15 @@ get_failed:
>>               addattr_l(n, 1024, IFLA_GRE_TTL, &ttl, 1);
>>               addattr_l(n, 1024, IFLA_GRE_TOS, &tos, 1);
>>               addattr32(n, 1024, IFLA_GRE_FWMARK, fwmark);
>> -             if (erspan_idx != 0)
>> -                     addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
>> +             if (erspan_ver) {
>> +                     addattr8(n, 1024, IFLA_GRE_ERSPAN_VER, erspan_ver);
>> +                     if (erspan_ver == 1 && erspan_idx != 0) {
>> +                             addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
>> +                     } else if (erspan_ver == 2) {
>> +                             addattr8(n, 1024, IFLA_GRE_ERSPAN_DIR, erspan_dir);
>> +                             addattr16(n, 1024, IFLA_GRE_ERSPAN_HWID, erspan_hwid);
>> +                     }
>> +             }
>>       } else {
>>               addattr_l(n, 1024, IFLA_GRE_COLLECT_METADATA, NULL, 0);
>>       }
>> @@ -514,7 +549,25 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
>>       if (tb[IFLA_GRE_ERSPAN_INDEX]) {
>>               __u32 erspan_idx = rta_getattr_u32(tb[IFLA_GRE_ERSPAN_INDEX]);
>>
>> -             fprintf(f, "erspan_index %u ", erspan_idx);
>> +             print_uint(PRINT_ANY, "erspan_index", "erspan_index %u", erspan_idx);
>> +     }
>> +
>> +     if (tb[IFLA_GRE_ERSPAN_VER]) {
>> +             __u8 erspan_ver = rta_getattr_u8(tb[IFLA_GRE_ERSPAN_VER]);
>> +
>> +             print_uint(PRINT_ANY, "erspan_ver", "erspan_ver %u", erspan_ver);
>> +     }
>> +
>> +     if (tb[IFLA_GRE_ERSPAN_DIR]) {
>> +             __u8 erspan_dir = rta_getattr_u8(tb[IFLA_GRE_ERSPAN_DIR]);
>> +
>> +             print_uint(PRINT_ANY, "erspan_dir", "erspan_dir %u", erspan_dir);
>
> similarly, here can convert the 0/1 to a human string.
>
> Same comments for the changes to link_gre6.c
>
Hi David,
Thanks for the feedback. I will convert to human string and resubmit.
William

^ permalink raw reply

* Re: [net-next: PATCH 0/8] Armada 7k/8k PP2 ACPI support
From: Marcin Wojtas @ 2017-12-19 23:13 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: David Miller, linux-kernel, linux-arm-kernel, netdev,
	Russell King - ARM Linux, Rafael J. Wysocki, Florian Fainelli,
	Antoine Ténart, Thomas Petazzoni, Gregory Clément,
	Ezequiel Garcia, nadavh, Neta Zur Hershkovits, Ard Biesheuvel,
	Grzegorz Jaszczyk, Tomasz Nowicki
In-Reply-To: <20171219204614.GA24156@lunn.ch>

Hi Andrew,

2017-12-19 21:46 GMT+01:00 Andrew Lunn <andrew@lunn.ch>:
>> Of course! v2 will not have such problem, I've been waiting however
>> for the feedback about the ACPI representation. Anyway, I'm strongly
>> leaning towards using _ADR/_CID objects in PHY's nodes for ACPI, so
>> maybe I'll just issue the v2 in order to push the discussion a bit
>> forward.
>
> Hi Marcin
>
> I know ~0 about ACPI. But what seems to be missing for me is
> documentation. You are defining a ABI here, which all future MDIO
> busses, PHYs, and to some extent Ethernet switches need to follow. So
> i would expect this to be documented somewhere.

As the code derives straight from of_mdio.c, there is absolutely no
deviation from Documentation/devicetree/bindings/net/mdio.txt.

>
> How does documentation work in the ACPI world?
>

ACPI has its own specification, the newest one is 6.2
http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf

It is pretty likely, that if we establish a generic solution for the
MDIO bus, a short section about it may probably require adding to
above - for now I see none.

Best regards,
Marcin

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Stephen Hemminger @ 2017-12-19 22:53 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: David Miller, mst, netdev, virtualization, alexander.duyck,
	jesse.brandeburg
In-Reply-To: <ddef3208-3cf5-faa2-7572-a561a6b35e2f@intel.com>

On Tue, 19 Dec 2017 14:37:50 -0800
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:

> On 12/19/2017 11:46 AM, Stephen Hemminger wrote:
> > On Tue, 19 Dec 2017 11:42:33 -0800
> > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> >  
> >> On 12/19/2017 10:41 AM, Stephen Hemminger wrote:  
> >>> On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
> >>> David Miller <davem@davemloft.net> wrote:
> >>>     
> >>>> From: Stephen Hemminger <stephen@networkplumber.org>
> >>>> Date: Tue, 19 Dec 2017 09:55:48 -0800
> >>>>     
> >>>>> could be 10ms, just enough to let udev do its renaming  
> >>>> Please, move to some kind of notification or event based handling of
> >>>> this problem.
> >>>>
> >>>> No delay is safe, what if userspace gets swapped out or whatever
> >>>> else might make userspace stall unexpectedly?
> >>>>     
> >>> The plan is to remove the delay and do the naming in the kernel.
> >>> This was suggested by Lennart since udev is only doing naming policy
> >>> because kernel names were not repeatable.
> >>>
> >>> This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
> >>>
> >>> Patch is pending.  
> >> Do we really need to delay the setup until the name is changed?
> >> Can't we call dev_set_mtu() and dev_open() until dev_change_name() is done?
> >>
> >> Thanks
> >> Sridhar  
> > You can call dev_set_mtu, but when dev_open is done the device name
> > can not be changed by userspace.  
> I did a quick test to remove the delay and also the dev_open() call and 
> i don't see
> any issues with virtio taking over the VF datapath.
> Only the netdev_info() messages may show old device name.
> 
> Any specific scenario where we need to explicitly call  VF's dev_open() 
> in the VF setup process?
> I tried i40evf driver loaded after virtio_net  AND  virtio_net loading 
> after i40evf.
> 
> Thanks
> Sridhar

It happens with hotplug. It is possible on Hyper-V to hotplug SR-IOV on
and off while guest is running. If SR-IOV is disabled in host then the
VF device is removed (hotplug) and the inverse. If the master device is
up then the VF device should be brought up by the master device.

^ permalink raw reply

* Re: [RFC] hv_netvsc: automatically name slave VF network device
From: Stephen Hemminger @ 2017-12-19 22:51 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, Stephen Hemminger, Jiri Pirko
In-Reply-To: <20171219142452.0f708189@cakuba.netronome.com>

On Tue, 19 Dec 2017 14:24:52 -0800
Jakub Kicinski <kubakici@wp.pl> wrote:

> On Tue, 19 Dec 2017 14:06:59 -0800, Stephen Hemminger wrote:
> > > > > > I assume you mean the modern application is udev, and it works
> > > > > > but the name is meaningless because it based of synthetic PCI
> > > > > > information. The PCI host adapter is simulated for pass through
> > > > > > devices. Names like enp12s0.
> > > > > > 
> > > > > > Since every passthrough VF device on Hyper-V/Azure has a matching
> > > > > > synthetic network device with same mac address. It is best to
> > > > > > have the relationship shown in the name.          
> > > > > 
> > > > > How about we make the VF drivers expose "vf" as phys_port_name?
> > > > > Then systemd/udev should glue that onto the name regardless of
> > > > > how the VF is used.        
> > > > 
> > > > One of the goals was not to modify in any way other drivers (like VF).      
> > > 
> > > Why?  Do you have out-of-tree drivers you can't change or some such?    
> > 
> > This needs to work on enterprise distributions; plus it is not good
> > practice to introduce random changes into partners like Mellanox
> > drivers.  
> 
> Are we talking about Linux or Windows kernel here?  I don't think
> this requires hypervisor changes?  The notion of a "partner" and
> changing drivers by people who are not employed by a vendor being 
> bad practice sounds entirely foreign to me.

Minor bug fixes sure, but changing semantics requires consultation.
Vendors like Intel and Mellanox have code bases that feed upstream.

> If we agree that marking VF interfaces is useful (and I think 
> it is) then we should mark them always, not only when they are 
> enslaved to a magic bond.  And the natural way of doing that 
> seems to be phys_port_name.

^ permalink raw reply

* Re: [RFC] hv_netvsc: automatically name slave VF network device
From: Stephen Hemminger @ 2017-12-19 22:50 UTC (permalink / raw)
  To: Samudrala, Sridhar; +Cc: netdev, Stephen Hemminger
In-Reply-To: <e2fa7512-7c49-b756-f239-8bf52bc70d65@intel.com>

On Tue, 19 Dec 2017 14:44:37 -0800
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:

>  -static void __netvsc_vf_setup(struct net_device *ndev,
> > -			      struct net_device *vf_netdev)
> > -{
> > -	int ret;
> > +	/* set the name of VF device based on upper device name */
> > +	snprintf(vf_name, IFNAMSIZ, "%s_vf", ndev->name);
> > +	ret = dev_change_name(vf_netdev, vf_name);
> > +	if (ret != 0)
> > +		netdev_warn(vf_netdev,
> > +			    "can not rename device: (%d)\n", ret);  
> 
> It is possible that upper device name can change after this call.  I 
> noticed this
> when i tried this approach with virtio_net.
> 
> Also, what should happen if the upper device is unloaded? Should we rename
> the VF name?

Yes upper device can change name. So sure, netdevice could trap that
in callback (it already has notifier) and rename VF. Will add that in V2.

If upper device is unloaded then it is already decoupled from the VF.
There is no good value to change it back to. The orignal name probably
has been reused by then.

^ permalink raw reply

* Re: [pull request][for-next 00/11] Mellanox, mlx5 E-Switch updates 2017-12-19
From: Marcelo Ricardo Leitner @ 2017-12-19 22:46 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, David S. Miller, Doug Ledford, Linux Netdev List,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Leon Romanovsky
In-Reply-To: <CALzJLG9i4g7c8vKbw7k+o=Pv5GcqWi6AATpOQXj09_KwqvWkmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Dec 19, 2017 at 02:39:38PM -0800, Saeed Mahameed wrote:
> On Tue, Dec 19, 2017 at 1:54 PM, Marcelo Ricardo Leitner
> <marcelo.leitner-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Tue, Dec 19, 2017 at 12:33:29PM -0800, Saeed Mahameed wrote:
> >> Hi Dave and Doug,
> >>
> >> ==============
> >> This series includes updates for mlx5 E-Switch infrastructures,
> >> to be merged into net-next and rdma-next trees.
> >>
> >> Mark's patches provide E-Switch refactoring that generalize the mlx5
> >> E-Switch vf representors interfaces and data structures. The serious is
> >> mainly focused on moving ethernet (netdev) specific representors logic out
> >> of E-Switch (eswitch.c) into mlx5e representor module (en_rep.c), which
> >> provides better separation and allows future support for other types of vf
> >> representors (e.g. RDMA).
> >>
> >> Gal's patches at the end of this serious, provide a simple syntax fix and
> >> two other patches that handles vport ingress/egress ACL steering name
> >> spaces to be aligned with the Firmware/Hardware specs.
> >
> > Patch 10 actually looks quite worrysome if a card would support only
> > one ingress or egress, but if all of them support both, then it should
> > be fine yes. Is that possible, to support only one direction?
> >
> 
> Hi Marcelo,
> 
> Patch 10 is a function renaming patch that fixes a function name to
> correspond with its behavior!
> 
> Are you asking about all mlx5 cards or all cards in general ?
> All patches in this series, and specifically patches 10,11 only
> concern mlx5 eswitch specific implementation and mlx5 hardware spec,
> they have nothing to do with other vendors cards.

I was asking only bout Mellanox cards.

> 
> All mlx5 cards do support both ingress and egress ACLs per vport (VF),
> each has its own namespaces (separate pipeline), hence patch #11.

Alright then.

> 
> So i don't see why you worry so much :).

Oh but because currently they are called like:

                if (MLX5_CAP_ESW_EGRESS_ACL(dev, ft_support)) {
                        err = init_egress_acl_root_ns(steering);
....

                if (MLX5_CAP_ESW_INGRESS_ACL(dev, ft_support)) {
                        err = init_ingress_acl_root_ns(steering);
....

and it could cause bigger problems if only one kind were supported,
like trying to setup egress while it's not supported by the card.
But again, as it seems both are always supported, that's no issue.

Thanks,
Marcelo

> 
> >   Marcelo
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC] hv_netvsc: automatically name slave VF network device
From: Samudrala, Sridhar @ 2017-12-19 22:44 UTC (permalink / raw)
  To: Stephen Hemminger, netdev; +Cc: Stephen Hemminger
In-Reply-To: <20171219193537.22587-1-sthemmin@microsoft.com>

On 12/19/2017 11:35 AM, Stephen Hemminger wrote:
> Rename the VF device to ethX_vf based on the ethX as the
> synthetic device.  This eliminates the need for delay on setup,
> and the PCI (udev based) naming is not reproducible on Hyper-V
> anyway. The name of the VF does not matter since all control
> operations take place the primary device. It does make the
> user experience better to associate the names.
>
> Based on feedback from all.systems.go talk.
> The downside is that it requires exporting a symbol from netdev
> core which makes it harder to backport.
>
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> ---
>   drivers/net/hyperv/hyperv_net.h |  1 -
>   drivers/net/hyperv/netvsc_drv.c | 53 +++++++++++------------------------------
>   net/core/dev.c                  |  1 +
>   3 files changed, 15 insertions(+), 40 deletions(-)
>
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 0db3bd1ea06f..5013189123e6 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -762,7 +762,6 @@ struct net_device_context {
>   	/* State to manage the associated VF interface. */
>   	struct net_device __rcu *vf_netdev;
>   	struct netvsc_vf_pcpu_stats __percpu *vf_stats;
> -	struct delayed_work vf_takeover;
>   
>   	/* 1: allocated, serial number is valid. 0: not allocated */
>   	u32 vf_alloc;
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index c5584c2d440e..8a0afe9a5064 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -49,7 +49,6 @@
>   #define RING_SIZE_MIN		64
>   
>   #define LINKCHANGE_INT (2 * HZ)
> -#define VF_TAKEOVER_INT (HZ / 10)
>   
>   static unsigned int ring_size __ro_after_init = 128;
>   module_param(ring_size, uint, S_IRUGO);
> @@ -1760,7 +1759,7 @@ static rx_handler_result_t netvsc_vf_handle_frame(struct sk_buff **pskb)
>   static int netvsc_vf_join(struct net_device *vf_netdev,
>   			  struct net_device *ndev)
>   {
> -	struct net_device_context *ndev_ctx = netdev_priv(ndev);
> +	char vf_name[IFNAMSIZ];
>   	int ret;
>   
>   	ret = netdev_rx_handler_register(vf_netdev,
> @@ -1783,23 +1782,14 @@ static int netvsc_vf_join(struct net_device *vf_netdev,
>   	/* set slave flag before open to prevent IPv6 addrconf */
>   	vf_netdev->flags |= IFF_SLAVE;
>   
> -	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
> -
>   	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
>   
> -	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
> -	return 0;
> -
> -upper_link_failed:
> -	netdev_rx_handler_unregister(vf_netdev);
> -rx_handler_failed:
> -	return ret;
> -}
> -
> -static void __netvsc_vf_setup(struct net_device *ndev,
> -			      struct net_device *vf_netdev)
> -{
> -	int ret;
> +	/* set the name of VF device based on upper device name */
> +	snprintf(vf_name, IFNAMSIZ, "%s_vf", ndev->name);
> +	ret = dev_change_name(vf_netdev, vf_name);
> +	if (ret != 0)
> +		netdev_warn(vf_netdev,
> +			    "can not rename device: (%d)\n", ret);

It is possible that upper device name can change after this call.  I 
noticed this
when i tried this approach with virtio_net.

Also, what should happen if the upper device is unloaded? Should we rename
the VF name?


>   
>   	/* Align MTU of VF with master */
>   	ret = dev_set_mtu(vf_netdev, ndev->mtu);
> @@ -1807,34 +1797,21 @@ static void __netvsc_vf_setup(struct net_device *ndev,
>   		netdev_warn(vf_netdev,
>   			    "unable to change mtu to %u\n", ndev->mtu);
>   
> +	/* If upper device is already up, bring up slave */
>   	if (netif_running(ndev)) {
>   		ret = dev_open(vf_netdev);
>   		if (ret)
>   			netdev_warn(vf_netdev,
>   				    "unable to open: %d\n", ret);
>   	}
> -}
> -
> -/* Setup VF as slave of the synthetic device.
> - * Runs in workqueue to avoid recursion in netlink callbacks.
> - */
> -static void netvsc_vf_setup(struct work_struct *w)
> -{
> -	struct net_device_context *ndev_ctx
> -		= container_of(w, struct net_device_context, vf_takeover.work);
> -	struct net_device *ndev = hv_get_drvdata(ndev_ctx->device_ctx);
> -	struct net_device *vf_netdev;
> -
> -	if (!rtnl_trylock()) {
> -		schedule_delayed_work(&ndev_ctx->vf_takeover, 0);
> -		return;
> -	}
>   
> -	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
> -	if (vf_netdev)
> -		__netvsc_vf_setup(ndev, vf_netdev);
> +	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
> +	return 0;
>   
> -	rtnl_unlock();
> + upper_link_failed:
> +	netdev_rx_handler_unregister(vf_netdev);
> + rx_handler_failed:
> +	return ret;
>   }
>   
>   static int netvsc_register_vf(struct net_device *vf_netdev)
> @@ -1904,7 +1881,6 @@ static int netvsc_unregister_vf(struct net_device *vf_netdev)
>   		return NOTIFY_DONE;
>   
>   	net_device_ctx = netdev_priv(ndev);
> -	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
>   
>   	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
>   
> @@ -1947,7 +1923,6 @@ static int netvsc_probe(struct hv_device *dev,
>   
>   	spin_lock_init(&net_device_ctx->lock);
>   	INIT_LIST_HEAD(&net_device_ctx->reconfig_events);
> -	INIT_DELAYED_WORK(&net_device_ctx->vf_takeover, netvsc_vf_setup);
>   
>   	net_device_ctx->vf_stats
>   		= netdev_alloc_pcpu_stats(struct netvsc_vf_pcpu_stats);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index c7db39926769..f51358a90efa 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1238,6 +1238,7 @@ int dev_change_name(struct net_device *dev, const char *newname)
>   
>   	return err;
>   }
> +EXPORT_SYMBOL(dev_change_name);
>   
>   /**
>    *	dev_set_alias - change ifalias of a device

^ permalink raw reply

* Re: [net 02/14] Revert "mlx5: move affinity hints assignments to generic code"
From: Jes Sorensen @ 2017-12-19 22:42 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller; +Cc: netdev, Sagi Grimberg, Thomas Gleixner
In-Reply-To: <20171219222456.29627-3-saeedm@mellanox.com>

On 12/19/2017 05:24 PM, Saeed Mahameed wrote:
> Before the offending commit, mlx5 core did the IRQ affinity itself,
> and it seems that the new generic code have some drawbacks and one
> of them is the lack for user ability to modify irq affinity after
> the initial affinity values got assigned.
> 
> The issue is still being discussed and a solution in the new generic code
> is required, until then we need to revert this patch.
> 
> This fixes the following issue:
> echo <new affinity> > /proc/irq/<x>/smp_affinity
> fails with  -EIO
> 
> This reverts commit a435393acafbf0ecff4deb3e3cb554b34f0d0664.
> Note: kept mlx5_get_vector_affinity in include/linux/mlx5/driver.h since
> it is used in mlx5_ib driver.
> 
> Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic code")
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Jes Sorensen <jsorensen@fb.com>
> Reported-by: Jes Sorensen <jsorensen@fb.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

Acked-by: Jes Sorensen <jsorensen@fb.com>

Cheers,
Jes


> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h      |  1 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 45 +++++++-------
>  drivers/net/ethernet/mellanox/mlx5/core/main.c    | 75 +++++++++++++++++++++--
>  include/linux/mlx5/driver.h                       |  1 +
>  4 files changed, 93 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index c0872b3284cb..43f9054830e5 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -590,6 +590,7 @@ struct mlx5e_channel {
>  	struct mlx5_core_dev      *mdev;
>  	struct hwtstamp_config    *tstamp;
>  	int                        ix;
> +	int                        cpu;
>  };
>  
>  struct mlx5e_channels {
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index d2b057a3e512..cbec66bc82f1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -71,11 +71,6 @@ struct mlx5e_channel_param {
>  	struct mlx5e_cq_param      icosq_cq;
>  };
>  
> -static int mlx5e_get_node(struct mlx5e_priv *priv, int ix)
> -{
> -	return pci_irq_get_node(priv->mdev->pdev, MLX5_EQ_VEC_COMP_BASE + ix);
> -}
> -
>  static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
>  {
>  	return MLX5_CAP_GEN(mdev, striding_rq) &&
> @@ -444,17 +439,16 @@ static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
>  	int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
>  	int mtt_sz = mlx5e_get_wqe_mtt_sz();
>  	int mtt_alloc = mtt_sz + MLX5_UMR_ALIGN - 1;
> -	int node = mlx5e_get_node(c->priv, c->ix);
>  	int i;
>  
>  	rq->mpwqe.info = kzalloc_node(wq_sz * sizeof(*rq->mpwqe.info),
> -					GFP_KERNEL, node);
> +				      GFP_KERNEL, cpu_to_node(c->cpu));
>  	if (!rq->mpwqe.info)
>  		goto err_out;
>  
>  	/* We allocate more than mtt_sz as we will align the pointer */
> -	rq->mpwqe.mtt_no_align = kzalloc_node(mtt_alloc * wq_sz,
> -					GFP_KERNEL, node);
> +	rq->mpwqe.mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
> +					cpu_to_node(c->cpu));
>  	if (unlikely(!rq->mpwqe.mtt_no_align))
>  		goto err_free_wqe_info;
>  
> @@ -562,7 +556,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  	int err;
>  	int i;
>  
> -	rqp->wq.db_numa_node = mlx5e_get_node(c->priv, c->ix);
> +	rqp->wq.db_numa_node = cpu_to_node(c->cpu);
>  
>  	err = mlx5_wq_ll_create(mdev, &rqp->wq, rqc_wq, &rq->wq,
>  				&rq->wq_ctrl);
> @@ -629,8 +623,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  	default: /* MLX5_WQ_TYPE_LINKED_LIST */
>  		rq->wqe.frag_info =
>  			kzalloc_node(wq_sz * sizeof(*rq->wqe.frag_info),
> -				     GFP_KERNEL,
> -				     mlx5e_get_node(c->priv, c->ix));
> +				     GFP_KERNEL, cpu_to_node(c->cpu));
>  		if (!rq->wqe.frag_info) {
>  			err = -ENOMEM;
>  			goto err_rq_wq_destroy;
> @@ -1000,13 +993,13 @@ static int mlx5e_alloc_xdpsq(struct mlx5e_channel *c,
>  	sq->uar_map   = mdev->mlx5e_res.bfreg.map;
>  	sq->min_inline_mode = params->tx_min_inline_mode;
>  
> -	param->wq.db_numa_node = mlx5e_get_node(c->priv, c->ix);
> +	param->wq.db_numa_node = cpu_to_node(c->cpu);
>  	err = mlx5_wq_cyc_create(mdev, &param->wq, sqc_wq, &sq->wq, &sq->wq_ctrl);
>  	if (err)
>  		return err;
>  	sq->wq.db = &sq->wq.db[MLX5_SND_DBR];
>  
> -	err = mlx5e_alloc_xdpsq_db(sq, mlx5e_get_node(c->priv, c->ix));
> +	err = mlx5e_alloc_xdpsq_db(sq, cpu_to_node(c->cpu));
>  	if (err)
>  		goto err_sq_wq_destroy;
>  
> @@ -1053,13 +1046,13 @@ static int mlx5e_alloc_icosq(struct mlx5e_channel *c,
>  	sq->channel   = c;
>  	sq->uar_map   = mdev->mlx5e_res.bfreg.map;
>  
> -	param->wq.db_numa_node = mlx5e_get_node(c->priv, c->ix);
> +	param->wq.db_numa_node = cpu_to_node(c->cpu);
>  	err = mlx5_wq_cyc_create(mdev, &param->wq, sqc_wq, &sq->wq, &sq->wq_ctrl);
>  	if (err)
>  		return err;
>  	sq->wq.db = &sq->wq.db[MLX5_SND_DBR];
>  
> -	err = mlx5e_alloc_icosq_db(sq, mlx5e_get_node(c->priv, c->ix));
> +	err = mlx5e_alloc_icosq_db(sq, cpu_to_node(c->cpu));
>  	if (err)
>  		goto err_sq_wq_destroy;
>  
> @@ -1126,13 +1119,13 @@ static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
>  	if (MLX5_IPSEC_DEV(c->priv->mdev))
>  		set_bit(MLX5E_SQ_STATE_IPSEC, &sq->state);
>  
> -	param->wq.db_numa_node = mlx5e_get_node(c->priv, c->ix);
> +	param->wq.db_numa_node = cpu_to_node(c->cpu);
>  	err = mlx5_wq_cyc_create(mdev, &param->wq, sqc_wq, &sq->wq, &sq->wq_ctrl);
>  	if (err)
>  		return err;
>  	sq->wq.db    = &sq->wq.db[MLX5_SND_DBR];
>  
> -	err = mlx5e_alloc_txqsq_db(sq, mlx5e_get_node(c->priv, c->ix));
> +	err = mlx5e_alloc_txqsq_db(sq, cpu_to_node(c->cpu));
>  	if (err)
>  		goto err_sq_wq_destroy;
>  
> @@ -1504,8 +1497,8 @@ static int mlx5e_alloc_cq(struct mlx5e_channel *c,
>  	struct mlx5_core_dev *mdev = c->priv->mdev;
>  	int err;
>  
> -	param->wq.buf_numa_node = mlx5e_get_node(c->priv, c->ix);
> -	param->wq.db_numa_node  = mlx5e_get_node(c->priv, c->ix);
> +	param->wq.buf_numa_node = cpu_to_node(c->cpu);
> +	param->wq.db_numa_node  = cpu_to_node(c->cpu);
>  	param->eq_ix   = c->ix;
>  
>  	err = mlx5e_alloc_cq_common(mdev, param, cq);
> @@ -1604,6 +1597,11 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
>  	mlx5e_free_cq(cq);
>  }
>  
> +static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
> +{
> +	return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
> +}
> +
>  static int mlx5e_open_tx_cqs(struct mlx5e_channel *c,
>  			     struct mlx5e_params *params,
>  			     struct mlx5e_channel_param *cparam)
> @@ -1752,12 +1750,13 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
>  {
>  	struct mlx5e_cq_moder icocq_moder = {0, 0};
>  	struct net_device *netdev = priv->netdev;
> +	int cpu = mlx5e_get_cpu(priv, ix);
>  	struct mlx5e_channel *c;
>  	unsigned int irq;
>  	int err;
>  	int eqn;
>  
> -	c = kzalloc_node(sizeof(*c), GFP_KERNEL, mlx5e_get_node(priv, ix));
> +	c = kzalloc_node(sizeof(*c), GFP_KERNEL, cpu_to_node(cpu));
>  	if (!c)
>  		return -ENOMEM;
>  
> @@ -1765,6 +1764,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
>  	c->mdev     = priv->mdev;
>  	c->tstamp   = &priv->tstamp;
>  	c->ix       = ix;
> +	c->cpu      = cpu;
>  	c->pdev     = &priv->mdev->pdev->dev;
>  	c->netdev   = priv->netdev;
>  	c->mkey_be  = cpu_to_be32(priv->mdev->mlx5e_res.mkey.key);
> @@ -1853,8 +1853,7 @@ static void mlx5e_activate_channel(struct mlx5e_channel *c)
>  	for (tc = 0; tc < c->num_tc; tc++)
>  		mlx5e_activate_txqsq(&c->sq[tc]);
>  	mlx5e_activate_rq(&c->rq);
> -	netif_set_xps_queue(c->netdev,
> -		mlx5_get_vector_affinity(c->priv->mdev, c->ix), c->ix);
> +	netif_set_xps_queue(c->netdev, get_cpu_mask(c->cpu), c->ix);
>  }
>  
>  static void mlx5e_deactivate_channel(struct mlx5e_channel *c)
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 5f323442cc5a..8a89c7e8cd63 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -317,9 +317,6 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
>  {
>  	struct mlx5_priv *priv = &dev->priv;
>  	struct mlx5_eq_table *table = &priv->eq_table;
> -	struct irq_affinity irqdesc = {
> -		.pre_vectors = MLX5_EQ_VEC_COMP_BASE,
> -	};
>  	int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
>  	int nvec;
>  
> @@ -333,10 +330,9 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
>  	if (!priv->irq_info)
>  		goto err_free_msix;
>  
> -	nvec = pci_alloc_irq_vectors_affinity(dev->pdev,
> +	nvec = pci_alloc_irq_vectors(dev->pdev,
>  			MLX5_EQ_VEC_COMP_BASE + 1, nvec,
> -			PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
> -			&irqdesc);
> +			PCI_IRQ_MSIX);
>  	if (nvec < 0)
>  		return nvec;
>  
> @@ -622,6 +618,63 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
>  	return (u64)timer_l | (u64)timer_h1 << 32;
>  }
>  
> +static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
> +{
> +	struct mlx5_priv *priv  = &mdev->priv;
> +	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
> +
> +	if (!zalloc_cpumask_var(&priv->irq_info[i].mask, GFP_KERNEL)) {
> +		mlx5_core_warn(mdev, "zalloc_cpumask_var failed");
> +		return -ENOMEM;
> +	}
> +
> +	cpumask_set_cpu(cpumask_local_spread(i, priv->numa_node),
> +			priv->irq_info[i].mask);
> +
> +	if (IS_ENABLED(CONFIG_SMP) &&
> +	    irq_set_affinity_hint(irq, priv->irq_info[i].mask))
> +		mlx5_core_warn(mdev, "irq_set_affinity_hint failed, irq 0x%.4x", irq);
> +
> +	return 0;
> +}
> +
> +static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
> +{
> +	struct mlx5_priv *priv  = &mdev->priv;
> +	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
> +
> +	irq_set_affinity_hint(irq, NULL);
> +	free_cpumask_var(priv->irq_info[i].mask);
> +}
> +
> +static int mlx5_irq_set_affinity_hints(struct mlx5_core_dev *mdev)
> +{
> +	int err;
> +	int i;
> +
> +	for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++) {
> +		err = mlx5_irq_set_affinity_hint(mdev, i);
> +		if (err)
> +			goto err_out;
> +	}
> +
> +	return 0;
> +
> +err_out:
> +	for (i--; i >= 0; i--)
> +		mlx5_irq_clear_affinity_hint(mdev, i);
> +
> +	return err;
> +}
> +
> +static void mlx5_irq_clear_affinity_hints(struct mlx5_core_dev *mdev)
> +{
> +	int i;
> +
> +	for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++)
> +		mlx5_irq_clear_affinity_hint(mdev, i);
> +}
> +
>  int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, int *eqn,
>  		    unsigned int *irqn)
>  {
> @@ -1097,6 +1150,12 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
>  		goto err_stop_eqs;
>  	}
>  
> +	err = mlx5_irq_set_affinity_hints(dev);
> +	if (err) {
> +		dev_err(&pdev->dev, "Failed to alloc affinity hint cpumask\n");
> +		goto err_affinity_hints;
> +	}
> +
>  	err = mlx5_init_fs(dev);
>  	if (err) {
>  		dev_err(&pdev->dev, "Failed to init flow steering\n");
> @@ -1154,6 +1213,9 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
>  	mlx5_cleanup_fs(dev);
>  
>  err_fs:
> +	mlx5_irq_clear_affinity_hints(dev);
> +
> +err_affinity_hints:
>  	free_comp_eqs(dev);
>  
>  err_stop_eqs:
> @@ -1222,6 +1284,7 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
>  
>  	mlx5_sriov_detach(dev);
>  	mlx5_cleanup_fs(dev);
> +	mlx5_irq_clear_affinity_hints(dev);
>  	free_comp_eqs(dev);
>  	mlx5_stop_eqs(dev);
>  	mlx5_put_uars_page(dev, priv->uar);
> diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
> index a886b51511ab..40a6f33c4cde 100644
> --- a/include/linux/mlx5/driver.h
> +++ b/include/linux/mlx5/driver.h
> @@ -556,6 +556,7 @@ struct mlx5_core_sriov {
>  };
>  
>  struct mlx5_irq_info {
> +	cpumask_var_t mask;
>  	char name[MLX5_MAX_IRQ_NAME];
>  };
>  
> 

^ permalink raw reply

* Re: [pull request][for-next 00/11] Mellanox, mlx5 E-Switch updates 2017-12-19
From: Saeed Mahameed @ 2017-12-19 22:39 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Saeed Mahameed, David S. Miller, Doug Ledford, Linux Netdev List,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Leon Romanovsky
In-Reply-To: <20171219215414.GH6122-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

On Tue, Dec 19, 2017 at 1:54 PM, Marcelo Ricardo Leitner
<marcelo.leitner-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Dec 19, 2017 at 12:33:29PM -0800, Saeed Mahameed wrote:
>> Hi Dave and Doug,
>>
>> ==============
>> This series includes updates for mlx5 E-Switch infrastructures,
>> to be merged into net-next and rdma-next trees.
>>
>> Mark's patches provide E-Switch refactoring that generalize the mlx5
>> E-Switch vf representors interfaces and data structures. The serious is
>> mainly focused on moving ethernet (netdev) specific representors logic out
>> of E-Switch (eswitch.c) into mlx5e representor module (en_rep.c), which
>> provides better separation and allows future support for other types of vf
>> representors (e.g. RDMA).
>>
>> Gal's patches at the end of this serious, provide a simple syntax fix and
>> two other patches that handles vport ingress/egress ACL steering name
>> spaces to be aligned with the Firmware/Hardware specs.
>
> Patch 10 actually looks quite worrysome if a card would support only
> one ingress or egress, but if all of them support both, then it should
> be fine yes. Is that possible, to support only one direction?
>

Hi Marcelo,

Patch 10 is a function renaming patch that fixes a function name to
correspond with its behavior!

Are you asking about all mlx5 cards or all cards in general ?
All patches in this series, and specifically patches 10,11 only
concern mlx5 eswitch specific implementation and mlx5 hardware spec,
they have nothing to do with other vendors cards.

All mlx5 cards do support both ingress and egress ACLs per vport (VF),
each has its own namespaces (separate pipeline), hence patch #11.

So i don't see why you worry so much :).

>   Marcelo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2017-12-19 22:37 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, mst, netdev, virtualization, alexander.duyck,
	jesse.brandeburg
In-Reply-To: <20171219114627.76dac732@xeon-e3>

On 12/19/2017 11:46 AM, Stephen Hemminger wrote:
> On Tue, 19 Dec 2017 11:42:33 -0800
> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
>
>> On 12/19/2017 10:41 AM, Stephen Hemminger wrote:
>>> On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
>>> David Miller <davem@davemloft.net> wrote:
>>>   
>>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>>> Date: Tue, 19 Dec 2017 09:55:48 -0800
>>>>   
>>>>> could be 10ms, just enough to let udev do its renaming
>>>> Please, move to some kind of notification or event based handling of
>>>> this problem.
>>>>
>>>> No delay is safe, what if userspace gets swapped out or whatever
>>>> else might make userspace stall unexpectedly?
>>>>   
>>> The plan is to remove the delay and do the naming in the kernel.
>>> This was suggested by Lennart since udev is only doing naming policy
>>> because kernel names were not repeatable.
>>>
>>> This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
>>>
>>> Patch is pending.
>> Do we really need to delay the setup until the name is changed?
>> Can't we call dev_set_mtu() and dev_open() until dev_change_name() is done?
>>
>> Thanks
>> Sridhar
> You can call dev_set_mtu, but when dev_open is done the device name
> can not be changed by userspace.
I did a quick test to remove the delay and also the dev_open() call and 
i don't see
any issues with virtio taking over the VF datapath.
Only the netdev_info() messages may show old device name.

Any specific scenario where we need to explicitly call  VF's dev_open() 
in the VF setup process?
I tried i40evf driver loaded after virtio_net  AND  virtio_net loading 
after i40evf.

Thanks
Sridhar

^ permalink raw reply

* Re: [PATCH net V2] net: always reevalulate autoflowlabel setting for reset packet
From: Eric Dumazet @ 2017-12-19 22:33 UTC (permalink / raw)
  To: Shaohua Li, netdev, davem; +Cc: Kernel Team, Shaohua Li, Martin KaFai Lau
In-Reply-To: <b44762a8392137c6008e2f7490cbd0853f0015ac.1513716787.git.shli@fb.com>

On Tue, 2017-12-19 at 12:58 -0800, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> ipv6_pinfo.autoflowlabel is set in sock creation. Later if we change
> sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't changed,
> so the sock will keep the old behavior in terms of auto flowlabel. Reset
> packet is suffering from this problem, because reset packset is sent
> from a special control socket, which is created at boot time. Since
> sysctl.ipv6.auto_flowlabels is 2 by default, the control socket will
> always have its ipv6_pinfo.autoflowlabel set, even after user set
> sysctl.ipv6.auto_flowlabels to 1, so reset packset will always have
> flowlabel.
> 
> To fix this, we always reevaluate autoflowlabel setting for reset
> packet. Normal sock has the same issue too, but since the
> sysctl.ipv6.auto_flowlabels is usually set at host startup, this isn't a
> big issue for normal sock.
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  net/ipv6/tcp_ipv6.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 7178476..5fba548 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -787,9 +787,11 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
>  	struct net *net = sk ? sock_net(sk) : dev_net(skb_dst(skb)->dev);
>  	struct sock *ctl_sk = net->ipv6.tcp_sk;
>  	unsigned int tot_len = sizeof(struct tcphdr);
> +	struct ipv6_pinfo *np = inet6_sk(ctl_sk);
>  	struct dst_entry *dst;
>  	__be32 *topt;
>  
> +	np->autoflowlabel = ip6_default_np_autolabel(net);


This looks unsafe to set a bitfield on a shared socket (one ctl_sk per
netns)

Compiler could play strange things here.

^ permalink raw reply

* [net 12/14] net/mlx5: Fix steering memory leak
From: Saeed Mahameed @ 2017-12-19 22:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Maor Gottlieb, Saeed Mahameed
In-Reply-To: <20171219222456.29627-1-saeedm@mellanox.com>

From: Maor Gottlieb <maorg@mellanox.com>

Flow steering priority and namespace are software only objects that
didn't have the proper destructors and were not freed during steering
cleanup.

Fix it by adding destructor functions for these objects.

Fixes: bd71b08ec2ee ("net/mlx5: Support multiple updates of steering rules in parallel")
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index c70fd663a633..dfaad9ecb2b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -174,6 +174,8 @@ static void del_hw_fte(struct fs_node *node);
 static void del_sw_flow_table(struct fs_node *node);
 static void del_sw_flow_group(struct fs_node *node);
 static void del_sw_fte(struct fs_node *node);
+static void del_sw_prio(struct fs_node *node);
+static void del_sw_ns(struct fs_node *node);
 /* Delete rule (destination) is special case that 
  * requires to lock the FTE for all the deletion process.
  */
@@ -408,6 +410,16 @@ static inline struct mlx5_core_dev *get_dev(struct fs_node *node)
 	return NULL;
 }
 
+static void del_sw_ns(struct fs_node *node)
+{
+	kfree(node);
+}
+
+static void del_sw_prio(struct fs_node *node)
+{
+	kfree(node);
+}
+
 static void del_hw_flow_table(struct fs_node *node)
 {
 	struct mlx5_flow_table *ft;
@@ -2064,7 +2076,7 @@ static struct fs_prio *fs_create_prio(struct mlx5_flow_namespace *ns,
 		return ERR_PTR(-ENOMEM);
 
 	fs_prio->node.type = FS_TYPE_PRIO;
-	tree_init_node(&fs_prio->node, NULL, NULL);
+	tree_init_node(&fs_prio->node, NULL, del_sw_prio);
 	tree_add_node(&fs_prio->node, &ns->node);
 	fs_prio->num_levels = num_levels;
 	fs_prio->prio = prio;
@@ -2090,7 +2102,7 @@ static struct mlx5_flow_namespace *fs_create_namespace(struct fs_prio *prio)
 		return ERR_PTR(-ENOMEM);
 
 	fs_init_namespace(ns);
-	tree_init_node(&ns->node, NULL, NULL);
+	tree_init_node(&ns->node, NULL, del_sw_ns);
 	tree_add_node(&ns->node, &prio->node);
 	list_add_tail(&ns->node.list, &prio->node.children);
 
-- 
2.13.0

^ permalink raw reply related

* [net 11/14] net/mlx5e: Prevent possible races in VXLAN control flow
From: Saeed Mahameed @ 2017-12-19 22:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Gal Pressman, Saeed Mahameed
In-Reply-To: <20171219222456.29627-1-saeedm@mellanox.com>

From: Gal Pressman <galp@mellanox.com>

When calling add/remove VXLAN port, a lock must be held in order to
prevent race scenarios when more than one add/remove happens at the
same time.
Fix by holding our state_lock (mutex) as done by all other parts of the
driver.
Note that the spinlock protecting the radix-tree is still needed in
order to synchronize radix-tree access from softirq context.

Fixes: b3f63c3d5e2c ("net/mlx5e: Add netdev support for VXLAN tunneling")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/vxlan.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c b/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c
index 25f782344667..2f74953e4561 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c
@@ -88,6 +88,7 @@ static void mlx5e_vxlan_add_port(struct work_struct *work)
 	struct mlx5e_vxlan *vxlan;
 	int err;
 
+	mutex_lock(&priv->state_lock);
 	vxlan = mlx5e_vxlan_lookup_port(priv, port);
 	if (vxlan) {
 		atomic_inc(&vxlan->refcount);
@@ -117,6 +118,7 @@ static void mlx5e_vxlan_add_port(struct work_struct *work)
 err_delete_port:
 	mlx5e_vxlan_core_del_port_cmd(priv->mdev, port);
 free_work:
+	mutex_unlock(&priv->state_lock);
 	kfree(vxlan_work);
 }
 
@@ -130,6 +132,7 @@ static void mlx5e_vxlan_del_port(struct work_struct *work)
 	struct mlx5e_vxlan *vxlan;
 	bool remove = false;
 
+	mutex_lock(&priv->state_lock);
 	spin_lock_bh(&vxlan_db->lock);
 	vxlan = radix_tree_lookup(&vxlan_db->tree, port);
 	if (!vxlan)
@@ -147,6 +150,7 @@ static void mlx5e_vxlan_del_port(struct work_struct *work)
 		mlx5e_vxlan_core_del_port_cmd(priv->mdev, port);
 		kfree(vxlan);
 	}
+	mutex_unlock(&priv->state_lock);
 	kfree(vxlan_work);
 }
 
-- 
2.13.0

^ permalink raw reply related

* [net 10/14] net/mlx5e: Add refcount to VXLAN structure
From: Saeed Mahameed @ 2017-12-19 22:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Gal Pressman, Saeed Mahameed
In-Reply-To: <20171219222456.29627-1-saeedm@mellanox.com>

From: Gal Pressman <galp@mellanox.com>

A refcount mechanism must be implemented in order to prevent unwanted
scenarios such as:
- Open an IPv4 VXLAN interface
- Open an IPv6 VXLAN interface (different socket)
- Remove one of the interfaces

With current implementation, the UDP port will be removed from our VXLAN
database and turn off the offloads for the other interface, which is
still active.
The reference count mechanism will only allow UDP port removals once all
consumers are gone.

Fixes: b3f63c3d5e2c ("net/mlx5e: Add netdev support for VXLAN tunneling")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/vxlan.c | 50 +++++++++++++------------
 drivers/net/ethernet/mellanox/mlx5/core/vxlan.h |  1 +
 2 files changed, 28 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c b/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c
index f8238275759f..25f782344667 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c
@@ -88,8 +88,11 @@ static void mlx5e_vxlan_add_port(struct work_struct *work)
 	struct mlx5e_vxlan *vxlan;
 	int err;
 
-	if (mlx5e_vxlan_lookup_port(priv, port))
+	vxlan = mlx5e_vxlan_lookup_port(priv, port);
+	if (vxlan) {
+		atomic_inc(&vxlan->refcount);
 		goto free_work;
+	}
 
 	if (mlx5e_vxlan_core_add_port_cmd(priv->mdev, port))
 		goto free_work;
@@ -99,6 +102,7 @@ static void mlx5e_vxlan_add_port(struct work_struct *work)
 		goto err_delete_port;
 
 	vxlan->udp_port = port;
+	atomic_set(&vxlan->refcount, 1);
 
 	spin_lock_bh(&vxlan_db->lock);
 	err = radix_tree_insert(&vxlan_db->tree, vxlan->udp_port, vxlan);
@@ -116,32 +120,33 @@ static void mlx5e_vxlan_add_port(struct work_struct *work)
 	kfree(vxlan_work);
 }
 
-static void __mlx5e_vxlan_core_del_port(struct mlx5e_priv *priv, u16 port)
+static void mlx5e_vxlan_del_port(struct work_struct *work)
 {
+	struct mlx5e_vxlan_work *vxlan_work =
+		container_of(work, struct mlx5e_vxlan_work, work);
+	struct mlx5e_priv *priv         = vxlan_work->priv;
 	struct mlx5e_vxlan_db *vxlan_db = &priv->vxlan;
+	u16 port = vxlan_work->port;
 	struct mlx5e_vxlan *vxlan;
+	bool remove = false;
 
 	spin_lock_bh(&vxlan_db->lock);
-	vxlan = radix_tree_delete(&vxlan_db->tree, port);
-	spin_unlock_bh(&vxlan_db->lock);
-
+	vxlan = radix_tree_lookup(&vxlan_db->tree, port);
 	if (!vxlan)
-		return;
-
-	mlx5e_vxlan_core_del_port_cmd(priv->mdev, vxlan->udp_port);
-
-	kfree(vxlan);
-}
+		goto out_unlock;
 
-static void mlx5e_vxlan_del_port(struct work_struct *work)
-{
-	struct mlx5e_vxlan_work *vxlan_work =
-		container_of(work, struct mlx5e_vxlan_work, work);
-	struct mlx5e_priv *priv = vxlan_work->priv;
-	u16 port = vxlan_work->port;
+	if (atomic_dec_and_test(&vxlan->refcount)) {
+		radix_tree_delete(&vxlan_db->tree, port);
+		remove = true;
+	}
 
-	__mlx5e_vxlan_core_del_port(priv, port);
+out_unlock:
+	spin_unlock_bh(&vxlan_db->lock);
 
+	if (remove) {
+		mlx5e_vxlan_core_del_port_cmd(priv->mdev, port);
+		kfree(vxlan);
+	}
 	kfree(vxlan_work);
 }
 
@@ -171,12 +176,11 @@ void mlx5e_vxlan_cleanup(struct mlx5e_priv *priv)
 	struct mlx5e_vxlan *vxlan;
 	unsigned int port = 0;
 
-	spin_lock_bh(&vxlan_db->lock);
+	/* Lockless since we are the only radix-tree consumers, wq is disabled */
 	while (radix_tree_gang_lookup(&vxlan_db->tree, (void **)&vxlan, port, 1)) {
 		port = vxlan->udp_port;
-		spin_unlock_bh(&vxlan_db->lock);
-		__mlx5e_vxlan_core_del_port(priv, (u16)port);
-		spin_lock_bh(&vxlan_db->lock);
+		radix_tree_delete(&vxlan_db->tree, port);
+		mlx5e_vxlan_core_del_port_cmd(priv->mdev, port);
+		kfree(vxlan);
 	}
-	spin_unlock_bh(&vxlan_db->lock);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vxlan.h b/drivers/net/ethernet/mellanox/mlx5/core/vxlan.h
index 5def12c048e3..5ef6ae7d568a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vxlan.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vxlan.h
@@ -36,6 +36,7 @@
 #include "en.h"
 
 struct mlx5e_vxlan {
+	atomic_t refcount;
 	u16 udp_port;
 };
 
-- 
2.13.0

^ permalink raw reply related

* [net 13/14] net/mlx5: Cleanup IRQs in case of unload failure
From: Saeed Mahameed @ 2017-12-19 22:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Moshe Shemesh, Saeed Mahameed
In-Reply-To: <20171219222456.29627-1-saeedm@mellanox.com>

From: Moshe Shemesh <moshe@mellanox.com>

When mlx5_stop_eqs fails to destroy any of the eqs it returns with an error.
In such failure flow the function will return without
releasing all EQs irqs and then pci_free_irq_vectors will fail.
Fix by only warn on destroy EQ failure and continue to release other
EQs and their irqs.

It fixes the following kernel trace:
kernel: kernel BUG at drivers/pci/msi.c:352!
...
...
kernel: Call Trace:
kernel: pci_disable_msix+0xd3/0x100
kernel: pci_free_irq_vectors+0xe/0x20
kernel: mlx5_load_one.isra.17+0x9f5/0xec0 [mlx5_core]

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 20 +++++++++++++-------
 include/linux/mlx5/driver.h                  |  2 +-
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 0308a2b4823c..ab4d1465b7e4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -775,7 +775,7 @@ int mlx5_start_eqs(struct mlx5_core_dev *dev)
 	return err;
 }
 
-int mlx5_stop_eqs(struct mlx5_core_dev *dev)
+void mlx5_stop_eqs(struct mlx5_core_dev *dev)
 {
 	struct mlx5_eq_table *table = &dev->priv.eq_table;
 	int err;
@@ -784,22 +784,28 @@ int mlx5_stop_eqs(struct mlx5_core_dev *dev)
 	if (MLX5_CAP_GEN(dev, pg)) {
 		err = mlx5_destroy_unmap_eq(dev, &table->pfault_eq);
 		if (err)
-			return err;
+			mlx5_core_err(dev, "failed to destroy page fault eq, err(%d)\n",
+				      err);
 	}
 #endif
 
 	err = mlx5_destroy_unmap_eq(dev, &table->pages_eq);
 	if (err)
-		return err;
+		mlx5_core_err(dev, "failed to destroy pages eq, err(%d)\n",
+			      err);
 
-	mlx5_destroy_unmap_eq(dev, &table->async_eq);
+	err = mlx5_destroy_unmap_eq(dev, &table->async_eq);
+	if (err)
+		mlx5_core_err(dev, "failed to destroy async eq, err(%d)\n",
+			      err);
 	mlx5_cmd_use_polling(dev);
 
 	err = mlx5_destroy_unmap_eq(dev, &table->cmd_eq);
-	if (err)
+	if (err) {
+		mlx5_core_err(dev, "failed to destroy command eq, err(%d)\n",
+			      err);
 		mlx5_cmd_use_events(dev);
-
-	return err;
+	}
 }
 
 int mlx5_core_eq_query(struct mlx5_core_dev *dev, struct mlx5_eq *eq,
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 40a6f33c4cde..57b109c6e422 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1049,7 +1049,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx,
 		       enum mlx5_eq_type type);
 int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq);
 int mlx5_start_eqs(struct mlx5_core_dev *dev);
-int mlx5_stop_eqs(struct mlx5_core_dev *dev);
+void mlx5_stop_eqs(struct mlx5_core_dev *dev);
 int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, int *eqn,
 		    unsigned int *irqn);
 int mlx5_core_attach_mcg(struct mlx5_core_dev *dev, union ib_gid *mgid, u32 qpn);
-- 
2.13.0

^ permalink raw reply related

* [net 14/14] net/mlx5: Stay in polling mode when command EQ destroy fails
From: Saeed Mahameed @ 2017-12-19 22:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Moshe Shemesh, Saeed Mahameed
In-Reply-To: <20171219222456.29627-1-saeedm@mellanox.com>

From: Moshe Shemesh <moshe@mellanox.com>

During unload, on mlx5_stop_eqs we move command interface from events
mode to polling mode, but if command interface EQ destroy fail we move
back to events mode.
That's wrong since even if we fail to destroy command interface EQ, we
do release its irq, so no interrupts will be received.

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index ab4d1465b7e4..e7e7cef2bde4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -801,11 +801,9 @@ void mlx5_stop_eqs(struct mlx5_core_dev *dev)
 	mlx5_cmd_use_polling(dev);
 
 	err = mlx5_destroy_unmap_eq(dev, &table->cmd_eq);
-	if (err) {
+	if (err)
 		mlx5_core_err(dev, "failed to destroy command eq, err(%d)\n",
 			      err);
-		mlx5_cmd_use_events(dev);
-	}
 }
 
 int mlx5_core_eq_query(struct mlx5_core_dev *dev, struct mlx5_eq *eq,
-- 
2.13.0

^ permalink raw reply related

* [net 09/14] net/mlx5e: Fix possible deadlock of VXLAN lock
From: Saeed Mahameed @ 2017-12-19 22:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Gal Pressman, Saeed Mahameed
In-Reply-To: <20171219222456.29627-1-saeedm@mellanox.com>

From: Gal Pressman <galp@mellanox.com>

mlx5e_vxlan_lookup_port is called both from mlx5e_add_vxlan_port (user
context) and mlx5e_features_check (softirq), but the lock acquired does
not disable bottom half and might result in deadlock. Fix it by simply
replacing spin_lock() with spin_lock_bh().
While at it, replace all unnecessary spin_lock_irq() to spin_lock_bh().

lockdep's WARNING: inconsistent lock state
[  654.028136] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  654.028229] swapper/5/0 [HC0[0]:SC1[9]:HE1:SE0] takes:
[  654.028321]  (&(&vxlan_db->lock)->rlock){+.?.}, at: [<ffffffffa06e7f0e>] mlx5e_vxlan_lookup_port+0x1e/0x50 [mlx5_core]
[  654.028528] {SOFTIRQ-ON-W} state was registered at:
[  654.028607]   _raw_spin_lock+0x3c/0x70
[  654.028689]   mlx5e_vxlan_lookup_port+0x1e/0x50 [mlx5_core]
[  654.028794]   mlx5e_vxlan_add_port+0x2e/0x120 [mlx5_core]
[  654.028878]   process_one_work+0x1e9/0x640
[  654.028942]   worker_thread+0x4a/0x3f0
[  654.029002]   kthread+0x141/0x180
[  654.029056]   ret_from_fork+0x24/0x30
[  654.029114] irq event stamp: 579088
[  654.029174] hardirqs last  enabled at (579088): [<ffffffff818f475a>] ip6_finish_output2+0x49a/0x8c0
[  654.029309] hardirqs last disabled at (579087): [<ffffffff818f470e>] ip6_finish_output2+0x44e/0x8c0
[  654.029446] softirqs last  enabled at (579030): [<ffffffff810b3b3d>] irq_enter+0x6d/0x80
[  654.029567] softirqs last disabled at (579031): [<ffffffff810b3c05>] irq_exit+0xb5/0xc0
[  654.029684] other info that might help us debug this:
[  654.029781]  Possible unsafe locking scenario:

[  654.029868]        CPU0
[  654.029908]        ----
[  654.029947]   lock(&(&vxlan_db->lock)->rlock);
[  654.030045]   <Interrupt>
[  654.030090]     lock(&(&vxlan_db->lock)->rlock);
[  654.030162]
 *** DEADLOCK ***

Fixes: b3f63c3d5e2c ("net/mlx5e: Add netdev support for VXLAN tunneling")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/vxlan.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c b/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c
index 07a9ba6cfc70..f8238275759f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vxlan.c
@@ -71,9 +71,9 @@ struct mlx5e_vxlan *mlx5e_vxlan_lookup_port(struct mlx5e_priv *priv, u16 port)
 	struct mlx5e_vxlan_db *vxlan_db = &priv->vxlan;
 	struct mlx5e_vxlan *vxlan;
 
-	spin_lock(&vxlan_db->lock);
+	spin_lock_bh(&vxlan_db->lock);
 	vxlan = radix_tree_lookup(&vxlan_db->tree, port);
-	spin_unlock(&vxlan_db->lock);
+	spin_unlock_bh(&vxlan_db->lock);
 
 	return vxlan;
 }
@@ -100,9 +100,9 @@ static void mlx5e_vxlan_add_port(struct work_struct *work)
 
 	vxlan->udp_port = port;
 
-	spin_lock_irq(&vxlan_db->lock);
+	spin_lock_bh(&vxlan_db->lock);
 	err = radix_tree_insert(&vxlan_db->tree, vxlan->udp_port, vxlan);
-	spin_unlock_irq(&vxlan_db->lock);
+	spin_unlock_bh(&vxlan_db->lock);
 	if (err)
 		goto err_free;
 
@@ -121,9 +121,9 @@ static void __mlx5e_vxlan_core_del_port(struct mlx5e_priv *priv, u16 port)
 	struct mlx5e_vxlan_db *vxlan_db = &priv->vxlan;
 	struct mlx5e_vxlan *vxlan;
 
-	spin_lock_irq(&vxlan_db->lock);
+	spin_lock_bh(&vxlan_db->lock);
 	vxlan = radix_tree_delete(&vxlan_db->tree, port);
-	spin_unlock_irq(&vxlan_db->lock);
+	spin_unlock_bh(&vxlan_db->lock);
 
 	if (!vxlan)
 		return;
@@ -171,12 +171,12 @@ void mlx5e_vxlan_cleanup(struct mlx5e_priv *priv)
 	struct mlx5e_vxlan *vxlan;
 	unsigned int port = 0;
 
-	spin_lock_irq(&vxlan_db->lock);
+	spin_lock_bh(&vxlan_db->lock);
 	while (radix_tree_gang_lookup(&vxlan_db->tree, (void **)&vxlan, port, 1)) {
 		port = vxlan->udp_port;
-		spin_unlock_irq(&vxlan_db->lock);
+		spin_unlock_bh(&vxlan_db->lock);
 		__mlx5e_vxlan_core_del_port(priv, (u16)port);
-		spin_lock_irq(&vxlan_db->lock);
+		spin_lock_bh(&vxlan_db->lock);
 	}
-	spin_unlock_irq(&vxlan_db->lock);
+	spin_unlock_bh(&vxlan_db->lock);
 }
-- 
2.13.0

^ permalink raw reply related

* [net 08/14] net/mlx5: Fix error flow in CREATE_QP command
From: Saeed Mahameed @ 2017-12-19 22:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Moni Shoua, Saeed Mahameed
In-Reply-To: <20171219222456.29627-1-saeedm@mellanox.com>

From: Moni Shoua <monis@mellanox.com>

In error flow, when DESTROY_QP command should be executed, the wrong
mailbox was set with data, not the one that is written to hardware,
Fix that.

Fixes: 09a7d9eca1a6 '{net,IB}/mlx5: QP/XRCD commands via mlx5 ifc'
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index db9e665ab104..889130edb715 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -213,8 +213,8 @@ int mlx5_core_create_qp(struct mlx5_core_dev *dev,
 err_cmd:
 	memset(din, 0, sizeof(din));
 	memset(dout, 0, sizeof(dout));
-	MLX5_SET(destroy_qp_in, in, opcode, MLX5_CMD_OP_DESTROY_QP);
-	MLX5_SET(destroy_qp_in, in, qpn, qp->qpn);
+	MLX5_SET(destroy_qp_in, din, opcode, MLX5_CMD_OP_DESTROY_QP);
+	MLX5_SET(destroy_qp_in, din, qpn, qp->qpn);
 	mlx5_cmd_exec(dev, din, sizeof(din), dout, sizeof(dout));
 	return err;
 }
-- 
2.13.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox