* [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
@ 2015-05-21 16:24 Or Gerlitz
[not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 31+ messages in thread
From: Or Gerlitz @ 2015-05-21 16:24 UTC (permalink / raw)
To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai, Or Gerlitz
Standard configuration of SRIOV VFs through the host is done over the
following chain of calls: libvirt --> netlink --> PF netdevice -- where
the PF netdevice exports the ndo_set_vf_ calls.
When this comes to IB/IPoIB we should normalize this into the verbs
framework so we further go: PF IPoIB --> verbs API --> PF HW driver
Virtualization systems assign VMs with 48 bits mac, to allow working
with non-modified SW layers (open-stack, libvirt, etc), we can safely
extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac
entry calls the set_vf_guid verb.
One thing to clean for being beyond RFC is to make the get_vf_config
verb return guid and have IPoIB to make it back a mac.
Here's how it looks when using the ip tool (libvirt runs the same
netlink to set it out) and later reflected when the VF read their port.
# ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff
# ibstat -d mlx4_2
CA 'mlx4_2'
CA type: MT4100
Number of ports: 1
Firmware version: 2.34.1260
Hardware version: 0
Node GUID: 0x00140500f30e84c4
System image GUID: 0xf452140300117423
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 7
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0xffeeddfeffccbbaa
Link layer: InfiniBand
Or Gerlitz (3):
IB/IPoIB: Support SRIOV standard configuration
IB/mlx4: Refactor Alias GUID storing
IB/mlx4: Add support for SRIOV VF management
drivers/infiniband/hw/mlx4/main.c | 26 ++++++++++++++
drivers/infiniband/hw/mlx4/mlx4_ib.h | 4 ++
drivers/infiniband/hw/mlx4/sysfs.c | 54 ++++++++++++++++++----------
drivers/infiniband/ulp/ipoib/ipoib_main.c | 39 +++++++++++++++++++++
drivers/net/ethernet/mellanox/mlx4/cmd.c | 26 +++++++++----
include/linux/mlx4/device.h | 2 +
include/rdma/ib_verbs.h | 4 ++
7 files changed, 128 insertions(+), 27 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread[parent not found: <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration [not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-21 16:24 ` Or Gerlitz [not found] ` <1432225447-6536-2-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-21 16:24 ` [PATCH RFC 2/3] IB/mlx4: Refactor Alias GUID storing Or Gerlitz ` (2 subsequent siblings) 3 siblings, 1 reply; 31+ messages in thread From: Or Gerlitz @ 2015-05-21 16:24 UTC (permalink / raw) To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai, Or Gerlitz Standard configuration of SRIOV VFs through the host is done over the following chain of calls: libvirt --> netlink --> PF netdevice When this comes to IB/IPoIB we should normalize this into the verbs framework so we further go: PF IPoIB --> verbs API --> PF HW driver Virtualization systems assign VMs 48 bits mac, to allow working with non-modified SW layers (open-stack, libvirt, etc), we can safely extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac entry calls the set_vf_guid verb. Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 39 +++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 4 +++ 2 files changed, 43 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 9e1b203..8f82870 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1357,6 +1357,43 @@ void ipoib_dev_cleanup(struct net_device *dev) priv->tx_ring = NULL; } +static int ipoib_get_vf_config(struct net_device *dev, int vf, struct ifla_vf_info *ivf) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->ca->get_vf_config) + return priv->ca->get_vf_config(priv->ca, priv->port, vf, ivf); + else + return -EINVAL; +} + +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac) +{ + char *raw_guid; + u64 guid = 0; + + struct ipoib_dev_priv *priv = netdev_priv(dev); + + raw_guid = (char *)&guid; + raw_guid[0] = mac[0]; + raw_guid[1] = mac[1]; + raw_guid[2] = mac[2]; + raw_guid[3] = 0xff; + raw_guid[4] = 0xfe; + raw_guid[5] = mac[3]; + raw_guid[6] = mac[4]; + raw_guid[7] = mac[5]; + + guid &= ~(cpu_to_be64(1ULL << 56)); + guid |= cpu_to_be64(1ULL << 57); + + if (priv->ca->set_vf_guid) + return priv->ca->set_vf_guid(priv->ca, priv->port, queue, guid); + else + return -EINVAL; +} + + static const struct header_ops ipoib_header_ops = { .create = ipoib_hard_header, }; @@ -1371,6 +1408,8 @@ static const struct net_device_ops ipoib_netdev_ops = { .ndo_tx_timeout = ipoib_timeout, .ndo_set_rx_mode = ipoib_set_mcast_list, .ndo_get_iflink = ipoib_get_iflink, + .ndo_get_vf_config = ipoib_get_vf_config, + .ndo_set_vf_mac = ipoib_set_vf_mac, }; void ipoib_setup(struct net_device *dev) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 65994a1..6589520 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -53,6 +53,7 @@ #include <linux/atomic.h> #include <linux/mmu_notifier.h> #include <asm/uaccess.h> +#include <linux/if_link.h> extern struct workqueue_struct *ib_wq; @@ -1653,6 +1654,9 @@ struct ib_device { int (*check_mr_status)(struct ib_mr *mr, u32 check_mask, struct ib_mr_status *mr_status); + int (*set_vf_guid) (struct ib_device *device, int port, int vf, u64 guid); + int (*get_vf_config)(struct ib_device *device, int port, int vf, struct ifla_vf_info *ivf); + struct ib_dma_mapping_ops *dma_ops; struct module *owner; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 31+ messages in thread
[parent not found: <1432225447-6536-2-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration [not found] ` <1432225447-6536-2-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-05-21 18:46 ` Jason Gunthorpe [not found] ` <20150521184613.GD6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-21 18:46 UTC (permalink / raw) To: Or Gerlitz; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai On Thu, May 21, 2015 at 07:24:05PM +0300, Or Gerlitz wrote: > Standard configuration of SRIOV VFs through the host is done over the > following chain of calls: libvirt --> netlink --> PF netdevice > > When this comes to IB/IPoIB we should normalize this into the verbs > framework so we further go: PF IPoIB --> verbs API --> PF HW driver > > Virtualization systems assign VMs 48 bits mac, to allow working with > non-modified SW layers (open-stack, libvirt, etc), we can safely > extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac > entry calls the set_vf_guid verb. > > Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > drivers/infiniband/ulp/ipoib/ipoib_main.c | 39 +++++++++++++++++++++++++++++ > include/rdma/ib_verbs.h | 4 +++ > 2 files changed, 43 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 9e1b203..8f82870 100644 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -1357,6 +1357,43 @@ void ipoib_dev_cleanup(struct net_device *dev) > priv->tx_ring = NULL; > } > > +static int ipoib_get_vf_config(struct net_device *dev, int vf, struct ifla_vf_info *ivf) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + > + if (priv->ca->get_vf_config) > + return priv->ca->get_vf_config(priv->ca, priv->port, vf, ivf); > + else > + return -EINVAL; > +} > + > +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac) > +{ > + char *raw_guid; > + u64 guid = 0; This doesn't seem right at all. It makes no sense that a IPoIB interface with a 20 byte LLADDR would accept an 8 byte LLADDR only for 'ip link set vf mac' The definition of the netlink struct seems to confirm this: struct ifla_vf_mac { __u32 vf; __u8 mac[32]; /* MAX_ADDR_LEN */ }; If it was really just ever a mac it would really only be 6 bytes. [Honestly, this whole feature seems very inconistent with the rest of the design of ip net link, so who knows] If the ifla_vf_mac had been variable-sized (like every other address related attribute) then sure, auto detect the length and do the right thing. But with this API, I think you have no choice, 'ip set vf mac LLADDR' can only be the 20 byte address. If you really, really want this: Get someone from iproute or netdev to sign off that they really mean that 'ip set vf mac LLADDR' is always a 6 byte mac. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20150521184613.GD6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration [not found] ` <20150521184613.GD6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-21 20:05 ` Or Gerlitz [not found] ` <CAJ3xEMgJvXjg3aFbTNEudj9QWMfP4==eBq0ccuhjuVJsv9mRmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-21 21:05 ` Doug Ledford 1 sibling, 1 reply; 31+ messages in thread From: Or Gerlitz @ 2015-05-21 20:05 UTC (permalink / raw) To: Jason Gunthorpe Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Thu, May 21, 2015, Jason Gunthorpe wrote: >> +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac) >> +{ >> + char *raw_guid; >> + u64 guid = 0; > > This doesn't seem right at all. > > It makes no sense that a IPoIB interface with a 20 byte LLADDR would > accept an 8 byte LLADDR only for 'ip link set vf mac' > > The definition of the netlink struct seems to confirm this: > > struct ifla_vf_mac { > __u32 vf; > __u8 mac[32]; /* MAX_ADDR_LEN */ > }; > > If it was really just ever a mac it would really only be 6 bytes. > [Honestly, this whole feature seems very inconistent with the rest of > the design of ip net link, so who knows] > > If the ifla_vf_mac had been variable-sized (like every other address > related attribute) then sure, auto detect the length and do the right > thing. > > But with this API, I think you have no choice, 'ip set vf mac LLADDR' > can only be the 20 byte address. You can't enforce 20 byte address on ipoib instance b/c the driver can't dictate the QPN of their UD QP. Also, you don't need to force that, b/c what the virtualization system want to provision relates to a VM ID which is their mac (--> guid). If the ifla_vf_mac had been variable-sized we could have the ipoib set_vf_mac implementation to check if user-space provided 48 bits (mac), 64bits (guid) or even 128bits (whole gid), but it doesn't and I would like to either use the lowest common denominator (48bits) or just take always 64bits which could have two zero bytes (e.g when libvirt calls into that api through netlink). -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CAJ3xEMgJvXjg3aFbTNEudj9QWMfP4==eBq0ccuhjuVJsv9mRmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration [not found] ` <CAJ3xEMgJvXjg3aFbTNEudj9QWMfP4==eBq0ccuhjuVJsv9mRmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-21 21:01 ` Jason Gunthorpe 0 siblings, 0 replies; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-21 21:01 UTC (permalink / raw) To: Or Gerlitz Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Thu, May 21, 2015 at 11:05:32PM +0300, Or Gerlitz wrote: > On Thu, May 21, 2015, Jason Gunthorpe wrote: > > >> +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac) > >> +{ > >> + char *raw_guid; > >> + u64 guid = 0; > > > > This doesn't seem right at all. > > > > It makes no sense that a IPoIB interface with a 20 byte LLADDR would > > accept an 8 byte LLADDR only for 'ip link set vf mac' > > > > The definition of the netlink struct seems to confirm this: > > > > struct ifla_vf_mac { > > __u32 vf; > > __u8 mac[32]; /* MAX_ADDR_LEN */ > > }; > > > > If it was really just ever a mac it would really only be 6 bytes. > > [Honestly, this whole feature seems very inconistent with the rest of > > the design of ip net link, so who knows] > > > > If the ifla_vf_mac had been variable-sized (like every other address > > related attribute) then sure, auto detect the length and do the right > > thing. > > > > But with this API, I think you have no choice, 'ip set vf mac LLADDR' > > can only be the 20 byte address. > > You can't enforce 20 byte address on ipoib instance b/c the driver > can't dictate the QPN of their UD QP. I can think of several ways off the top of my head to deal with that. Don't see it as a problem. > Also, you don't need to force that, b/c what the virtualization > system want to provision relates to a VM ID which is their mac (--> > guid). No, I do need to force that because that is how netlink works: each device has a defined hw address length, and all netlink messages dealing with LLADDR must use an identical format for the hw address. For IPoIB that is 20 byte. It is a simple matter of UAPI sanity, it is not the kernel's problem that some userspace tools assume a 6 byte LLADDR for all devices - they are broken. > (mac), 64bits (guid) or even 128bits (whole gid), but it doesn't and I > would like to either use the lowest common denominator (48bits) or > just take always 64bits which could have two zero bytes (e.g when > libvirt calls into that api through netlink). Then get the iproute and netdev people to sign off on using ifla_vf_mac with something other than the defined 20 byte IPoIB HW address. Maybe you can work out some scheme with them. Making the IFLA_VF_MAC properly variable sized might be a good angle. NACK otherwise. The rest of the series's idea seems sound. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration [not found] ` <20150521184613.GD6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2015-05-21 20:05 ` Or Gerlitz @ 2015-05-21 21:05 ` Doug Ledford [not found] ` <1432242331.28905.67.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 31+ messages in thread From: Doug Ledford @ 2015-05-21 21:05 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai [-- Attachment #1: Type: text/plain, Size: 2283 bytes --] On Thu, 2015-05-21 at 12:46 -0600, Jason Gunthorpe wrote: > On Thu, May 21, 2015 at 07:24:05PM +0300, Or Gerlitz wrote: > > Standard configuration of SRIOV VFs through the host is done over the > > following chain of calls: libvirt --> netlink --> PF netdevice > > > > When this comes to IB/IPoIB we should normalize this into the verbs > > framework so we further go: PF IPoIB --> verbs API --> PF HW driver > > > > Virtualization systems assign VMs 48 bits mac, to allow working with > > non-modified SW layers (open-stack, libvirt, etc), we can safely > > extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac > > entry calls the set_vf_guid verb. > > > > Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > > drivers/infiniband/ulp/ipoib/ipoib_main.c | 39 +++++++++++++++++++++++++++++ > > include/rdma/ib_verbs.h | 4 +++ > > 2 files changed, 43 insertions(+), 0 deletions(-) > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > > index 9e1b203..8f82870 100644 > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > > @@ -1357,6 +1357,43 @@ void ipoib_dev_cleanup(struct net_device *dev) > > priv->tx_ring = NULL; > > } > > > > +static int ipoib_get_vf_config(struct net_device *dev, int vf, struct ifla_vf_info *ivf) > > +{ > > + struct ipoib_dev_priv *priv = netdev_priv(dev); > > + > > + if (priv->ca->get_vf_config) > > + return priv->ca->get_vf_config(priv->ca, priv->port, vf, ivf); > > + else > > + return -EINVAL; > > +} > > + > > +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac) > > +{ > > + char *raw_guid; > > + u64 guid = 0; > > This doesn't seem right at all. > > It makes no sense that a IPoIB interface with a 20 byte LLADDR would > accept an 8 byte LLADDR only for 'ip link set vf mac' It has to be the 8byte guid, the next 8bytes are the subnet prefix which is set by the SM and not under driver control, and the final 4bytes are specific to the IPoIB connection. You could pass in all 20bytes, but you would only be able to set 8 of them. -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <1432242331.28905.67.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration [not found] ` <1432242331.28905.67.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-21 21:21 ` Jason Gunthorpe 0 siblings, 0 replies; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-21 21:21 UTC (permalink / raw) To: Doug Ledford; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai On Thu, May 21, 2015 at 05:05:31PM -0400, Doug Ledford wrote: > It has to be the 8byte guid, the next 8bytes are the subnet prefix which > is set by the SM and not under driver control, and the final 4bytes are > specific to the IPoIB connection. You could pass in all 20bytes, but > you would only be able to set 8 of them. Of course you can't set all 20 bytes, but you still have to provide them, that is the defined LLADDR format for this device. 1) Kernel demands immutable bits are zero and fills them in automatically 2) Userspace copies the current LLADDR and only modifies the 8 bytes of the GUID, kernel demands immutable bits are unchanged 3) Who is to say the QPN can't be made settable someday? People have complained about the unstable QPN in the past. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH RFC 2/3] IB/mlx4: Refactor Alias GUID storing [not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-21 16:24 ` [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration Or Gerlitz @ 2015-05-21 16:24 ` Or Gerlitz 2015-05-21 16:24 ` [PATCH RFC 3/3] IB/mlx4: Add support for SRIOV VF management Or Gerlitz 2015-05-21 16:40 ` [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs Doug Ledford 3 siblings, 0 replies; 31+ messages in thread From: Or Gerlitz @ 2015-05-21 16:24 UTC (permalink / raw) To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai, Or Gerlitz Move the code that actually set the alias GUID provided by the admin into a function which isn't tied to the mlx4 SRIOV sysfs constructs. So we can use for the verbs which deal with guid setting too. This commit does not change any functionality. Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> --- drivers/infiniband/hw/mlx4/mlx4_ib.h | 4 ++ drivers/infiniband/hw/mlx4/sysfs.c | 54 ++++++++++++++++++++++------------ 2 files changed, 39 insertions(+), 19 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index fce3934..a13a814 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -807,6 +807,10 @@ void mlx4_ib_device_unregister_sysfs(struct mlx4_ib_dev *device); __be64 mlx4_ib_gen_node_guid(void); + +void mlx4_store_admin_alias_guid(struct mlx4_ib_dev *mdev, int port, int slave, + __be64 guid); + int mlx4_ib_steer_qp_alloc(struct mlx4_ib_dev *dev, int count, int *qpn); void mlx4_ib_steer_qp_free(struct mlx4_ib_dev *dev, u32 qpn, int count); int mlx4_ib_steer_qp_reg(struct mlx4_ib_dev *mdev, struct mlx4_ib_qp *mqp, diff --git a/drivers/infiniband/hw/mlx4/sysfs.c b/drivers/infiniband/hw/mlx4/sysfs.c index 6797108..705e3b8 100644 --- a/drivers/infiniband/hw/mlx4/sysfs.c +++ b/drivers/infiniband/hw/mlx4/sysfs.c @@ -59,6 +59,38 @@ static ssize_t show_admin_alias_guid(struct device *dev, return sprintf(buf, "%llx\n", be64_to_cpu(sysadmin_ag_val)); } + +void mlx4_store_admin_alias_guid(struct mlx4_ib_dev *mdev, int port, int slave, + __be64 guid) +{ + unsigned long flags; + int record_num;/*0-15*/ + int guid_index_in_rec; /*0 - 7*/ + + record_num = slave / 8; + guid_index_in_rec = slave % 8; + + spin_lock_irqsave(&mdev->sriov.alias_guid.ag_work_lock, flags); + + *(__be64 *)&mdev->sriov.alias_guid.ports_guid[port - 1]. + all_rec_per_port[record_num]. + all_recs[GUID_REC_SIZE * guid_index_in_rec] = guid; + + /* Change the state to be pending for update */ + mdev->sriov.alias_guid.ports_guid[port - 1].all_rec_per_port[record_num].status + = MLX4_GUID_INFO_STATUS_IDLE ; + + mlx4_set_admin_guid(mdev->dev, guid, slave, port); + + /* set the record index */ + mdev->sriov.alias_guid.ports_guid[port - 1].all_rec_per_port[record_num].guid_indexes + |= mlx4_ib_get_aguid_comp_mask_from_ix(guid_index_in_rec); + + spin_unlock_irqrestore(&mdev->sriov.alias_guid.ag_work_lock, flags); + + mlx4_ib_init_alias_guid_work(mdev, port - 1); +} + /* store_admin_alias_guid stores the (new) administratively assigned value of that GUID. * Values in buf parameter string: * 0 - requests opensm to assign a value. @@ -76,7 +108,6 @@ static ssize_t store_admin_alias_guid(struct device *dev, struct mlx4_ib_iov_port *port = mlx4_ib_iov_dentry->ctx; struct mlx4_ib_dev *mdev = port->dev; u64 sysadmin_ag_val; - unsigned long flags; record_num = mlx4_ib_iov_dentry->entry_num / 8; guid_index_in_rec = mlx4_ib_iov_dentry->entry_num % 8; @@ -84,26 +115,11 @@ static ssize_t store_admin_alias_guid(struct device *dev, pr_err("GUID 0 block 0 is RO\n"); return count; } - spin_lock_irqsave(&mdev->sriov.alias_guid.ag_work_lock, flags); sscanf(buf, "%llx", &sysadmin_ag_val); - *(__be64 *)&mdev->sriov.alias_guid.ports_guid[port->num - 1]. - all_rec_per_port[record_num]. - all_recs[GUID_REC_SIZE * guid_index_in_rec] = - cpu_to_be64(sysadmin_ag_val); - - /* Change the state to be pending for update */ - mdev->sriov.alias_guid.ports_guid[port->num - 1].all_rec_per_port[record_num].status - = MLX4_GUID_INFO_STATUS_IDLE ; - mlx4_set_admin_guid(mdev->dev, cpu_to_be64(sysadmin_ag_val), - mlx4_ib_iov_dentry->entry_num, - port->num); - /* set the record index */ - mdev->sriov.alias_guid.ports_guid[port->num - 1].all_rec_per_port[record_num].guid_indexes - |= mlx4_ib_get_aguid_comp_mask_from_ix(guid_index_in_rec); - - spin_unlock_irqrestore(&mdev->sriov.alias_guid.ag_work_lock, flags); - mlx4_ib_init_alias_guid_work(mdev, port->num - 1); + mlx4_store_admin_alias_guid(mdev, port->num, + mlx4_ib_iov_dentry->entry_num, + cpu_to_be64(sysadmin_ag_val)); return count; } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH RFC 3/3] IB/mlx4: Add support for SRIOV VF management [not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2015-05-21 16:24 ` [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration Or Gerlitz 2015-05-21 16:24 ` [PATCH RFC 2/3] IB/mlx4: Refactor Alias GUID storing Or Gerlitz @ 2015-05-21 16:24 ` Or Gerlitz 2015-05-21 16:40 ` [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs Doug Ledford 3 siblings, 0 replies; 31+ messages in thread From: Or Gerlitz @ 2015-05-21 16:24 UTC (permalink / raw) To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai, Or Gerlitz Add support for the set_vf_guid and get_vf_config verbs. Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> --- drivers/infiniband/hw/mlx4/main.c | 26 ++++++++++++++++++++++++++ drivers/net/ethernet/mellanox/mlx4/cmd.c | 26 ++++++++++++++++++-------- include/linux/mlx4/device.h | 2 ++ 3 files changed, 46 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 57070c5..17b6fa7 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -1325,6 +1325,27 @@ err_malloc: return err; } +static int mlx4_ib_set_vf_guid(struct ib_device *ibdev, int port, int vf, u64 guid) +{ + int slave; + struct mlx4_ib_dev *mdev = to_mdev(ibdev); + + slave = mlx4_get_slave_indx(mdev->dev, vf); + if (slave < 0) + return -EINVAL; + + mlx4_store_admin_alias_guid(mdev, port, slave, cpu_to_be64(guid)); + + return 0; +} + +static int mlx4_ib_get_vf_config(struct ib_device *ibdev, int port, int vf, struct ifla_vf_info *ivf) +{ + struct mlx4_ib_dev *mdev = to_mdev(ibdev); + + return mlx4_get_vf_config(mdev->dev, port, vf, ivf); +} + static struct mlx4_ib_gid_entry *find_gid_entry(struct mlx4_ib_qp *qp, u8 *raw) { struct mlx4_ib_gid_entry *ge; @@ -2250,6 +2271,11 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) ibdev->ib_dev.dealloc_fmr = mlx4_ib_fmr_dealloc; } + if (mlx4_is_master(ibdev->dev)) { + ibdev->ib_dev.set_vf_guid = mlx4_ib_set_vf_guid; + ibdev->ib_dev.get_vf_config = mlx4_ib_get_vf_config; + } + if (dev->caps.flags & MLX4_DEV_CAP_FLAG_MEM_WINDOW || dev->caps.bmme_flags & MLX4_BMME_FLAG_TYPE_2_WIN) { ibdev->ib_dev.alloc_mw = mlx4_ib_alloc_mw; diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c index 7761045..a544650 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c @@ -2647,7 +2647,7 @@ u32 mlx4_comm_get_version(void) return ((u32) CMD_CHAN_IF_REV << 8) | (u32) CMD_CHAN_VER; } -static int mlx4_get_slave_indx(struct mlx4_dev *dev, int vf) +int mlx4_get_slave_indx(struct mlx4_dev *dev, int vf) { if ((vf < 0) || (vf >= dev->persist->num_vfs)) { mlx4_err(dev, "Bad vf number:%d (number of activated vf: %d)\n", @@ -2657,6 +2657,7 @@ static int mlx4_get_slave_indx(struct mlx4_dev *dev, int vf) return vf+1; } +EXPORT_SYMBOL_GPL(mlx4_get_slave_indx); int mlx4_get_vf_indx(struct mlx4_dev *dev, int slave) { @@ -3089,13 +3090,22 @@ int mlx4_get_vf_config(struct mlx4_dev *dev, int port, int vf, struct ifla_vf_in s_info = &priv->mfunc.master.vf_admin[slave].vport[port]; ivf->vf = vf; - /* need to convert it to a func */ - ivf->mac[0] = ((s_info->mac >> (5*8)) & 0xff); - ivf->mac[1] = ((s_info->mac >> (4*8)) & 0xff); - ivf->mac[2] = ((s_info->mac >> (3*8)) & 0xff); - ivf->mac[3] = ((s_info->mac >> (2*8)) & 0xff); - ivf->mac[4] = ((s_info->mac >> (1*8)) & 0xff); - ivf->mac[5] = ((s_info->mac) & 0xff); + if (dev->caps.port_mask[port] == MLX4_PORT_TYPE_ETH) { + ivf->mac[0] = ((s_info->mac >> (5*8)) & 0xff); + ivf->mac[1] = ((s_info->mac >> (4*8)) & 0xff); + ivf->mac[2] = ((s_info->mac >> (3*8)) & 0xff); + ivf->mac[3] = ((s_info->mac >> (2*8)) & 0xff); + ivf->mac[4] = ((s_info->mac >> (1*8)) & 0xff); + ivf->mac[5] = ((s_info->mac) & 0xff); + } else { + u64 guid = be64_to_cpu(s_info->guid); + ivf->mac[0] = ((guid >> (7*8)) & 0xff); + ivf->mac[1] = ((guid >> (6*8)) & 0xff); + ivf->mac[2] = ((guid >> (5*8)) & 0xff); + ivf->mac[3] = ((guid >> (2*8)) & 0xff); + ivf->mac[4] = ((guid >> (1*8)) & 0xff); + ivf->mac[5] = ((guid) & 0xff); + } ivf->vlan = s_info->default_vlan; ivf->qos = s_info->default_qos; diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index 83e80ab..e5a70bd 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -1382,6 +1382,8 @@ int mlx4_get_slave_from_roce_gid(struct mlx4_dev *dev, int port, u8 *gid, int mlx4_get_roce_gid_from_slave(struct mlx4_dev *dev, int port, int slave_id, u8 *gid); +int mlx4_get_slave_indx(struct mlx4_dev *dev, int vf); + int mlx4_FLOW_STEERING_IB_UC_QP_RANGE(struct mlx4_dev *dev, u32 min_range_qpn, u32 max_range_qpn); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> ` (2 preceding siblings ...) 2015-05-21 16:24 ` [PATCH RFC 3/3] IB/mlx4: Add support for SRIOV VF management Or Gerlitz @ 2015-05-21 16:40 ` Doug Ledford [not found] ` <1432226406.28905.22.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 3 siblings, 1 reply; 31+ messages in thread From: Doug Ledford @ 2015-05-21 16:40 UTC (permalink / raw) To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai [-- Attachment #1: Type: text/plain, Size: 2953 bytes --] On Thu, 2015-05-21 at 19:24 +0300, Or Gerlitz wrote: > Standard configuration of SRIOV VFs through the host is done over the > following chain of calls: libvirt --> netlink --> PF netdevice -- where > the PF netdevice exports the ndo_set_vf_ calls. > > When this comes to IB/IPoIB we should normalize this into the verbs > framework so we further go: PF IPoIB --> verbs API --> PF HW driver > > Virtualization systems assign VMs with 48 bits mac, to allow working > with non-modified SW layers Why are we suggesting to make this work with unmodified software? Why aren't we doing this right and adding a new ndo entry point for the GUID? The MAC/GUID mapping isn't the only thing that has to be faked here, you would also have to fake the vlan/pkey mapping. This just seems the wrong thing to do. > (open-stack, libvirt, etc), we can safely > extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac > entry calls the set_vf_guid verb. > > One thing to clean for being beyond RFC is to make the get_vf_config > verb return guid and have IPoIB to make it back a mac. > > Here's how it looks when using the ip tool (libvirt runs the same > netlink to set it out) and later reflected when the VF read their port. > > # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff > > # ibstat -d mlx4_2 > CA 'mlx4_2' > CA type: MT4100 > Number of ports: 1 > Firmware version: 2.34.1260 > Hardware version: 0 > Node GUID: 0x00140500f30e84c4 > System image GUID: 0xf452140300117423 > Port 1: > State: Active > Physical state: LinkUp > Rate: 56 > Base lid: 7 > LMC: 0 > SM lid: 1 > Capability mask: 0x02514868 > Port GUID: 0xffeeddfeffccbbaa > Link layer: InfiniBand > > Or Gerlitz (3): > IB/IPoIB: Support SRIOV standard configuration > IB/mlx4: Refactor Alias GUID storing > IB/mlx4: Add support for SRIOV VF management > > drivers/infiniband/hw/mlx4/main.c | 26 ++++++++++++++ > drivers/infiniband/hw/mlx4/mlx4_ib.h | 4 ++ > drivers/infiniband/hw/mlx4/sysfs.c | 54 ++++++++++++++++++---------- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 39 +++++++++++++++++++++ > drivers/net/ethernet/mellanox/mlx4/cmd.c | 26 +++++++++---- > include/linux/mlx4/device.h | 2 + > include/rdma/ib_verbs.h | 4 ++ > 7 files changed, 128 insertions(+), 27 deletions(-) > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <1432226406.28905.22.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <1432226406.28905.22.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-21 19:55 ` Or Gerlitz [not found] ` <CAJ3xEMjzpqnQuA=0HQaN8noVq04d9BkVvEWGY7Lq5ZntVTKm4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Or Gerlitz @ 2015-05-21 19:55 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Thu, May 21, 2015 at 7:40 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > The MAC/GUID mapping isn't the only thing that has to be faked here Exactly nothing is faked here, Virtualization systems such as open-stack provision unique 48 bit mac values to VMs, and it's perfectly legitimate and viable to derive 64 bit guid value from that mac. > Why are we suggesting to make this work with unmodified software? Why > aren't we doing this right and adding a new ndo entry point for the GUID? Because rome wasn't built in a day and nor will be the support for IB in today's/tomorrow's virtualization systems, e.g if you follow on this layering [1] Open-Stack / ODL controller [2] Open-Stack neutron / ODL agent [3] libvirt [4] user/kernel netlink API [5] kernel ndo API [6] ipoib [7] kernel verbs API [8] PF IB driver with the approach presented here, we only simply (yeah, simplicity could turn to be critical criteria in engineering) to few kernel only patches that deal with layers 6-8 and we are ready for all sorts of bring-ups, testing and even production! For reasons which I don't really see the practical / real life use case where there's a must to get them to work (but I will happy to hear on) one can go & change the world, namely patch layers 5 ---> 1 too and deal with all sort of dependencies for setting up a system. But guess what, this can be perfectly done in parallel with this small change. > you would also have to fake the vlan/pkey mapping. This just > seems the wrong thing to do. Repeating the above argument --- virt systems provision 12bit vlan-id to be set for VM traffic, which can be nicely map to 16 bit IB pkey doing the same job. I understand that you have sort of desire to see IB ala the full spec going into libvirt and from there up to the whole virtualization management space, but this doesn't need as an argument to not enable doing thing in the right direction. The upstream kernel supports SRIOV for IB over mlx4 for 3 years now, but this can't work with libvirt as is. Using these patches can make the thing. Couple of months ago, we both attended a call with the libvirt developers / maintainers from red-hat and they really liked this staged approach. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CAJ3xEMjzpqnQuA=0HQaN8noVq04d9BkVvEWGY7Lq5ZntVTKm4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <CAJ3xEMjzpqnQuA=0HQaN8noVq04d9BkVvEWGY7Lq5ZntVTKm4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-21 21:11 ` Doug Ledford [not found] ` <1432242708.28905.73.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Doug Ledford @ 2015-05-21 21:11 UTC (permalink / raw) To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai [-- Attachment #1: Type: text/plain, Size: 4480 bytes --] On Thu, 2015-05-21 at 22:55 +0300, Or Gerlitz wrote: > On Thu, May 21, 2015 at 7:40 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > The MAC/GUID mapping isn't the only thing that has to be faked here > > Exactly nothing is faked here, Virtualization systems such as > open-stack provision unique 48 bit mac values to VMs, and it's > perfectly legitimate and viable to derive 64 bit guid value from that > mac. OK, faked wasn't the best use of words. How's converted behind the software's back? And if the management software set the MAC, then tried to check it via ARP after the guest is up and running, it would never find the guest. I don't know if Open-Stack or any other controller would both A) attempt to set the MAC of the device in libvirt and start the guest and B) enter the MAC into a dhcp.conf file for static IP assignment, but they could, and this sort of manipulation would directly break that. > > Why are we suggesting to make this work with unmodified software? Why > > aren't we doing this right and adding a new ndo entry point for the GUID? > > Because rome wasn't built in a day and nor will be the support for IB > in today's/tomorrow's virtualization systems, e.g if you follow on > this layering > > [1] Open-Stack / ODL controller > [2] Open-Stack neutron / ODL agent > [3] libvirt > [4] user/kernel netlink API > [5] kernel ndo API > [6] ipoib > [7] kernel verbs API > [8] PF IB driver > > with the approach presented here, we only simply (yeah, simplicity > could turn to be critical criteria in engineering) to few kernel only > patches that deal with layers 6-8 and we are ready for all sorts of > bring-ups, testing and even production! You're ready to pretend that your IB device is a regular ethernet device. Not even a RoCE or iWARP device. You totally obscured that the device is RDMA capable and made the entire stack above 6 unawares of what you are doing. If your guest actually intends to use any RDMA capabilities, then this is, at best, a quick and dirty workaround to get you up and running while you work through the process of doing things right, which includes making libvirt aware of the difference between RDMA capable devices and not and how to select those devices and how to mark certain guests as RDMA device needy. > For reasons which I don't really see the practical / real life use > case where there's a must to get them to work (but I will happy to > hear on) one can go & change the world, namely patch layers 5 ---> 1 > too and deal with all sort of dependencies for setting up a system. > But guess what, this can be perfectly done in parallel with this small > change. > > > you would also have to fake the vlan/pkey mapping. This just > > seems the wrong thing to do. > > Repeating the above argument --- virt systems provision 12bit vlan-id > to be set for VM traffic, which can be nicely map to 16 bit IB pkey > doing the same job. > > I understand that you have sort of desire to see IB ala the full spec > going into libvirt and from there up to the whole virtualization > management space, but this doesn't need as an argument to not enable > doing thing in the right direction. The upstream kernel supports SRIOV > for IB over mlx4 for 3 years now, but this can't work with libvirt as > is. Using these patches can make the thing. It's a workaround. It comes with limitations, and if we get around to adding an ndo later for really setting the guid, then it would be possible to call the set_guid ndo with a complete guid that didn't use fffe in the middle 2 bytes, and then when we call get vf_info, we get a MAC back that removes those 2 bytes and generates an inconsistency between what we *think* our constructed guid should be and what the set guid actually is. > Couple of months ago, we both attended a call with the libvirt > developers / maintainers from red-hat and they really liked this > staged approach. My recollection of that call was they said "Oh, you guys don't have an API for us to set the GUIDs yet. Ok, we'll close all the bugs and wait until you do." And they promptly closed the bugs and moved on. But that didn't specify the API to use. That's what we are doing here. But I'm not finding this an entirely convincing solution. -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <1432242708.28905.73.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <1432242708.28905.73.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-21 21:44 ` Jason Gunthorpe 2015-05-25 20:04 ` Or Gerlitz 1 sibling, 0 replies; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-21 21:44 UTC (permalink / raw) To: Doug Ledford Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Thu, May 21, 2015 at 05:11:48PM -0400, Doug Ledford wrote: > fffe in the middle 2 bytes, and then when we call get vf_info, we get a > MAC back that removes those 2 bytes and generates an inconsistency > between what we *think* our constructed guid should be and what the set > guid actually is. Bingo - When ndo_get_vf_config is called on the PF it must return the same 20 byte LLADDR that will match the IFLA_ADDRESS returned on the VF netdevice. If do_get_vf_config returns the 20 byte LLADDR, then ndo_set_vf_mac must also accept the same format. This probably suggests option #2 from my last email is the way to go, as the input and ouput from the set, and any followup IFLA_ADDRESS or ndo_get_vf_config will all be the same LLADDR. Hidden changes to the LLADDR are probably just going to confuse things. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <1432242708.28905.73.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-05-21 21:44 ` Jason Gunthorpe @ 2015-05-25 20:04 ` Or Gerlitz [not found] ` <CAJ3xEMh3BaxJzCu9mV9m6ZMwgrDaO2UvTyS1i=DEPq9nuLX3oA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 31+ messages in thread From: Or Gerlitz @ 2015-05-25 20:04 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Fri, May 22, 2015 at 12:11 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, 2015-05-21 at 22:55 +0300, Or Gerlitz wrote: >> On Thu, May 21, 2015 at 7:40 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> >> > The MAC/GUID mapping isn't the only thing that has to be faked here >> >> Exactly nothing is faked here, Virtualization systems such as >> open-stack provision unique 48 bit mac values to VMs, and it's >> perfectly legitimate and viable to derive 64 bit guid value from that >> mac. > > OK, faked wasn't the best use of words. How's converted behind the > software's back? And if the management software set the MAC, then tried > to check it via ARP after the guest is up and running, it would never > find the guest. I don't know if Open-Stack or any other controller > would both A) attempt to set the MAC of the device in libvirt and start > the guest and B) enter the MAC into a dhcp.conf file for static IP > assignment, but they could, and this sort of manipulation would directly > break that. OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port GUID, and as Jason noted the user/kernel API allows to deliver up to 32 bytes between user and kernel under the set_vf_mac flow (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through **non-modified** ip tool and net/core/rtnetlink.c things just work - I can set eight bytes value to be the virtual port GUID : # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22 # ibstat -d mlx4_2 CA 'mlx4_2' CA type: MT4100 Number of ports: 1 Firmware version: 2.34.1260 Hardware version: 0 Node GUID: 0x001405003bca04bb System image GUID: 0xf452140300117423 Port 1: State: Active Physical state: LinkUp Rate: 56 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x02514868 Port GUID: 0x2211ffeeddccbbaa Link layer: InfiniBand Re DHCP: RFC 2131 adds a “client identifier” option that replaces the client hardware address as the unique identifier of the client in its subnet. DHCP over IB RFC 4390 [1] requires that IPoIB DHCP clients use the client identifier (as they cannot fit their 20 byte MAC in the client hardware address field). DHCP packages (e.g ISC) that support IPoIB use client identifier which is based on the unique eight byte port GUID, so with the modified patches that use 8 bytes, we're OK DHCP wise. [1] https://tools.ietf.org/html/rfc4390#section-2.1 [...] > It's a workaround. It comes with limitations, and if we get around to > adding an ndo later for really setting the guid, then it would be > possible to call the set_guid ndo with a complete guid that didn't use > fffe in the middle 2 bytes, and then when we call get vf_info, we get a > MAC back that removes those 2 bytes and generates an inconsistency > between what we *think* our constructed guid should be and what the set > guid actually is. OK, as written above, I have managed to get away from this possible mess which you described here by providing eight bytes from user to kernel through the existing netlink API (which is used by the ip tool and libvirt). >> Couple of months ago, we both attended a call with the libvirt >> developers / maintainers from red-hat and they really liked this >> staged approach. > My recollection of that call was they said "Oh, you guys don't have an > API for us to set the GUIDs yet. Ok, we'll close all the bugs and wait > until you do." And they promptly closed the bugs and moved on. But > that didn't specify the API to use. That's what we are doing here. But > I'm not finding this an entirely convincing solution. So that was what said, we were wrong and with this small ipoib/verbs patch, we fully have the API to provision the vGUID of the VF. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CAJ3xEMh3BaxJzCu9mV9m6ZMwgrDaO2UvTyS1i=DEPq9nuLX3oA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <CAJ3xEMh3BaxJzCu9mV9m6ZMwgrDaO2UvTyS1i=DEPq9nuLX3oA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-25 21:14 ` Jason Gunthorpe [not found] ` <20150525211433.GA9186-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-25 21:14 UTC (permalink / raw) To: Or Gerlitz Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote: > OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port > GUID, and as Jason noted the user/kernel API allows to deliver up to > 32 bytes between user and kernel under the set_vf_mac flow > (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through > **non-modified** ip tool and net/core/rtnetlink.c things just work - > I can set eight bytes value to be the virtual port GUID : Was I not perfectly clear? You have to use the 20 byte LLADDR format here: > # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22 > Port GUID: 0x2211ffeeddccbbaa The byte order got screwed up someplace. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20150525211433.GA9186-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <20150525211433.GA9186-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-25 21:50 ` Or Gerlitz [not found] ` <CAJ3xEMhG2W6WzxC4Kc2CFmdMwTRUF5ppBgcDZ6SMA=kgJowUpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Or Gerlitz @ 2015-05-25 21:50 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote: > >> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port >> GUID, and as Jason noted the user/kernel API allows to deliver up to >> 32 bytes between user and kernel under the set_vf_mac flow >> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through >> **non-modified** ip tool and net/core/rtnetlink.c things just work - >> I can set eight bytes value to be the virtual port GUID : > > Was I not perfectly clear? You have to use the 20 byte LLADDR format here: > >> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22 Jason, I am aiming to provision the VF IB end-node address == port GUID (vGUID) in the same manner that VF Eth end-node address is their MAC, not more, not less. 20 bytes are the lladdr of IPoIB devices which isn't the VF IB end-node address but rather made of flags (1B) + QPN (3B) + subnet prefix (8B) + VF GUID -- way more then the virtualization system care or can provision. >> Port GUID: 0x2211ffeeddccbbaa > The byte order got screwed up someplace. thx, will fix -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CAJ3xEMhG2W6WzxC4Kc2CFmdMwTRUF5ppBgcDZ6SMA=kgJowUpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <CAJ3xEMhG2W6WzxC4Kc2CFmdMwTRUF5ppBgcDZ6SMA=kgJowUpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-25 22:32 ` Jason Gunthorpe [not found] ` <20150525223235.GA9858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2015-05-26 16:53 ` Doug Ledford 1 sibling, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-25 22:32 UTC (permalink / raw) To: Or Gerlitz Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Tue, May 26, 2015 at 12:50:52AM +0300, Or Gerlitz wrote: > On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe > <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > > On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote: > > > >> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port > >> GUID, and as Jason noted the user/kernel API allows to deliver up to > >> 32 bytes between user and kernel under the set_vf_mac flow > >> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through > >> **non-modified** ip tool and net/core/rtnetlink.c things just work - > >> I can set eight bytes value to be the virtual port GUID : > > > > Was I not perfectly clear? You have to use the 20 byte LLADDR format here: > > > >> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22 > > Jason, > > I am aiming to provision the VF IB end-node address == port GUID (vGUID) > in the same manner that VF Eth end-node address is their MAC, not > more, not less. I perfectly understand what you are trying to do. I care about the design and consistency of netlink - and that means there is one LLADDR definition for a net device, and every single netlink message that touches a LLADDR uses that definition - for IPoIB that is 20 bytes. To violate that design invariant needs an incredibly strong argument, and you haven't made it. This allows generic code to work with the LLADDR - for instance 'ip link set vf mac' should have checked the size of the IFLA_ADDRESS and demanded that the address argument is the same number of bytes. It is very broken the command happily accepts an 8 byte and 6 byte argument for the same device. Yes, it is ugly that the PF side's ndo_get_vf_config cannot return the same 20 byte address of the VF's ipoib interface, but I think that is less ugly than forcing a different address format just for the vf calls. If you have doubts then *ask netdev*. Ask them if ndo_set_vf_mac must follow the same address size and format as IFLA_ADDRESS, or if we can use something else. Such a netlink architecture choice is beyond the authority of linux-rdma. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20150525223235.GA9858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <20150525223235.GA9858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-26 3:33 ` Or Gerlitz [not found] ` <CAJ3xEMjSU0xVWyqd8v_-OO5JvsHycGTU6gg=BHpZD8PSqRfzQg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-26 16:53 ` Doug Ledford 1 sibling, 1 reply; 31+ messages in thread From: Or Gerlitz @ 2015-05-26 3:33 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, May 26, 2015 at 1:32 AM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Tue, May 26, 2015 at 12:50:52AM +0300, Or Gerlitz wrote: >> On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe >> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: >> > On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote: >> > >> >> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port >> >> GUID, and as Jason noted the user/kernel API allows to deliver up to >> >> 32 bytes between user and kernel under the set_vf_mac flow >> >> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through >> >> **non-modified** ip tool and net/core/rtnetlink.c things just work - >> >> I can set eight bytes value to be the virtual port GUID : >> > >> > Was I not perfectly clear? You have to use the 20 byte LLADDR format here: >> > >> >> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22 >> >> Jason, >> >> I am aiming to provision the VF IB end-node address == port GUID (vGUID) >> in the same manner that VF Eth end-node address is their MAC, not >> more, not less. > > I perfectly understand what you are trying to do. Good, we should be doing things for a reason, and not just for the sake of doing them 111% right by someone possibly subjective judgement > I care about the design and consistency of netlink - and that means > there is one LLADDR definition for a net device, and every single netlink > message that touches a LLADDR uses that definition - for IPoIB that is 20 > bytes. > To violate that design invariant needs an incredibly strong argument, > and you haven't made it. > This allows generic code to work with the LLADDR - for instance 'ip > link set vf mac' should have checked the size of the IFLA_ADDRESS and > demanded that the address argument is the same number of bytes. It is > very broken the command happily accepts an 8 byte and 6 byte argument > for the same device. OK, so per your view, the existing kernel code for this flow is broken and you resist my attempt to use it as is, and > Yes, it is ugly that the PF side's ndo_get_vf_config cannot return the > same 20 byte address of the VF's ipoib interface, but I think that is > less ugly than forcing a different address format just for the vf calls. you claim that what I propose is uglier from the fact that the PF can't by no means return the 20 VF's IPoIB address and it's OK if I only let the PF configure 20 bytes with part (say four) of them being arbitrary and only have consistent 20B get/set by the PF. Would you be happier if the ipoib ndo_set_vf_mac ndo be 1. getting 20B from user-space and treating 16 of them as the VF subnet-prefix (8B) + vGUID (8B) 2. checking that the subnet-prefix is correct 3. provision the vGUID through PF driver / the verb I proposed for the VF ??? on the way back, for the get_vf_config 1. read the VF vGUID from the PF IB driver through the verbs 2. add the port subnet prefix 3. return 20B to user-space ??? > If you have doubts then *ask netdev*. Ask them if ndo_set_vf_mac must > follow the same address size and format as IFLA_ADDRESS, or if we can > use something else. > Such a netlink architecture choice is beyond the authority of linux-rdma. I am not @ the point to change start changing this specific netlink flow. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CAJ3xEMjSU0xVWyqd8v_-OO5JvsHycGTU6gg=BHpZD8PSqRfzQg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <CAJ3xEMjSU0xVWyqd8v_-OO5JvsHycGTU6gg=BHpZD8PSqRfzQg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-26 5:58 ` Jason Gunthorpe 0 siblings, 0 replies; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-26 5:58 UTC (permalink / raw) To: Or Gerlitz Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, May 26, 2015 at 06:33:15AM +0300, Or Gerlitz wrote: > > This allows generic code to work with the LLADDR - for instance 'ip > > link set vf mac' should have checked the size of the IFLA_ADDRESS and > > demanded that the address argument is the same number of bytes. It is > > very broken the command happily accepts an 8 byte and 6 byte argument > > for the same device. > > OK, so per your view, the existing kernel code for this flow is broken > and you resist my attempt to use it as is, and What does that mean? The current kernel code that uses this interface is all ethernet drivers. They use 6 bytes, in exactly the same format as the LLADDR. Do you see something different? And yes, the kernel code for this interface is very much inconsistent with the rest of netlink. Carrying a LLADDR in a fixed and unsized 32 byte array and calling it a 'mac' is awful. Looking through the comments from 2010 on the patch that introduced this function, it is pretty clear that the original author did not understand netlink. The userspace side in iproute2 is even worse: static void print_vfinfo(FILE *fp, struct rtattr *vfinfo) { ... fprintf(fp, "\n vf %d MAC %s", vf_mac->vf, ll_addr_n2a((unsigned char *)&vf_mac->mac, ETH_ALEN, 0, b1, sizeof(b1))); Hard coded ETH_ALEN! For a 32 byte array! BLECK! And the set side: static int iplink_parse_vf(int vf, int *argcp, char ***argvp, struct iplink_req *req, int dev_index) { ... len = ll_addr_a2n((char *)ivm.mac, 32, *argv); if (len < 0) return -1; addattr_l(&req->n, sizeof(*req), IFLA_VF_MAC, &ivm, sizeof(ivm)); It just passes random stack garbage into the kernel if the user doesn't provide 32 bytes on the command line. BLECK! The kernel didn't even validate the *length* of address sub attribute until recently: https://patchwork.ozlabs.org/patch/436901/ So, to your question yes, I think it is broken / horribly written. But we can live with it, we don't need to radically change it. So. looking at this interface there are two options for the 32 byte fixed array: 1) 'mac' is some arbitary interface specific thing with no relation to anything else in netlink 2) 'mac' is a LLADDR, and must match the construction of IFLA_ADDRESS Speaking generally - netlink is a general purpose interface - #2 is by far the saner of the two choices. Why? Because 'ip' needs to know the length of the address for the two places noted above. Using the same length as IFLA_ADDRESS is simple and 100% general. Using a lookup on the device type to guess the address length is 8 because it is infiniband, but 6 for ethernet, and who knows what if the tool is old and doesn't recognize the device type.. Crappy. Not The Netlink Way. Thus, the very best choice (at this point) is to use the length of the IFLA_ADDRESS (aka dev->addr_len), which is completely general and works for every single device choice. Is #1 possible? Yes. But linux-rdma is the wrong place to make that choice. Ask netdev. [ .. if we could go back in time fixing the original patch to use a *sized* array for IFLA_VF_MAC would have been ideal, and made #1 much more desirable, but that is not straightforward now, and we have to live with the bad choices someone else made. Try not to add to this crap .. ] > > Yes, it is ugly that the PF side's ndo_get_vf_config cannot return the > > same 20 byte address of the VF's ipoib interface, but I think that is > > less ugly than forcing a different address format just for the vf calls. > > you claim that what I propose is uglier from the fact that the PF can't > by no means return the 20 VF's IPoIB address and it's OK if I only let > the PF configure 20 bytes with part (say four) of them being arbitrary > and only have consistent 20B get/set by the PF. > > Would you be happier if the ipoib ndo_set_vf_mac ndo be > > 1. getting 20B from user-space and treating 16 of them as the VF > subnet-prefix (8B) + vGUID (8B) > > 2. checking that the subnet-prefix is correct > > 3. provision the vGUID through PF driver / the verb I proposed for the VF > > ??? > > on the way back, for the get_vf_config > > 1. read the VF vGUID from the PF IB driver through the verbs > > 2. add the port subnet prefix > > 3. return 20B to user-space > > ??? Yes, absolutely. That maintains the general invariant for LLADDR and the broad integrity of the existing netlink design. Using anything else is also short sited. Current hardware is restricted to a randomized QPN, and current IBA is restricted to a single GID prefix: But are you *absolutely* sure that will be the case forever? This is a UAPI, after all. I'm not: For instance controlling/choosing the QPN solves real issues with live migration of VF IPoIB - the need to invalidate remote ND entires when the QPN changes on migration. > I am not @ the point to change start changing this specific netlink flow. We are not changing anything, we are trying to guess what netdev wants to use the fixed 32 byte array called 'mac' for, when we are not going to store a MAC in it. This is new territory, never been done before. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <20150525223235.GA9858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2015-05-26 3:33 ` Or Gerlitz @ 2015-05-26 16:53 ` Doug Ledford [not found] ` <1432659226.28905.151.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 31+ messages in thread From: Doug Ledford @ 2015-05-26 16:53 UTC (permalink / raw) To: Jason Gunthorpe Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai [-- Attachment #1: Type: text/plain, Size: 5965 bytes --] On Mon, 2015-05-25 at 16:32 -0600, Jason Gunthorpe wrote: > On Tue, May 26, 2015 at 12:50:52AM +0300, Or Gerlitz wrote: > > On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe > > <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > > > On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote: > > > > > >> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port > > >> GUID, and as Jason noted the user/kernel API allows to deliver up to > > >> 32 bytes between user and kernel under the set_vf_mac flow > > >> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through > > >> **non-modified** ip tool and net/core/rtnetlink.c things just work - > > >> I can set eight bytes value to be the virtual port GUID : > > > > > > Was I not perfectly clear? You have to use the 20 byte LLADDR format here: > > > > > >> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22 > > > > Jason, > > > > I am aiming to provision the VF IB end-node address == port GUID (vGUID) > > in the same manner that VF Eth end-node address is their MAC, not > > more, not less. > > I perfectly understand what you are trying to do. > > I care about the design and consistency of netlink - and that means > there is one LLADDR definition for a net device, and every single netlink > message that touches a LLADDR uses that definition - for IPoIB that is 20 > bytes. So, here's my $.02. First, 8 bytes is the right size for the data to be transferred. The remaining 12 bytes of the IPoIB LLADDR are generated constructs at least partially controlled by elements outside of this machine, and moreover, they are *not* the *actual* LLADDR of the underlying IB device and the 20 byte LLADDR never hits the wire. So, it's an artificial construct. The fact that the 20 byte LLADDR of IPoIB is problematic is born out by multiple facts: we only use 8 bytes in DHCP requests and responses, only use 8 bytes in matching IPoIB netdevices against HWADDR entries in the sysconfig/ifcfg-* files, only use 8 bytes for matching the udev rules to rename IPoIB devices, etc. This clearly demonstrates that attaching the netlink ndo points to the IPoIB netdevice is, let's just say, a shortcut. The *real* device that we should be manipulating is not the IPoIB device, but the verbs device. If we were manipulating the verbs device, no one would be arguing for anything other than the exact proper item: an 8 byte GUID. Of course, the problem then becomes an issue of "can we attach the underlying verbs device to the netlink ndo mechanism?" And the answer there is "Not easily, and Dave Miller would likely throw a fit" (and rightfully so, as all netdevice entries are assumed to be IP capable and we aren't, and so we can't really be used by the core of the net stack). Since I don't see us getting permission from Dave to attach verbs devices to the netstack just so we can A) be ignored by the core of the netstack while also B) having only 3 or 4 of the ndo_ entry points defined so we can co-opt their infrastructure for our own uses. That leaves using the IPoIB device. In which case, here's what I want: If a user passes in a MAC addresses, please use the following checks: if (mac_len != 8 && mac_len != 20) err out set guid as lower 8 bytes of mac, leave remainder untouched regardless of what's passed in You will note that if someone passes in a mac size of 8, this is not an error. It is consistent with the existing NetworkManager device settings, ifup-ib device settings, and udev device settings that an 8 byte MAC for IPoIB devices is the right thing to uniquely identify the device. Here's why I disagree with Jason. If you have an app that queries a device, sees that it has a 20byte MAC, and then tries to set that MAC, it may have no specific knowledge of how IPoIB relates to the underlying verbs device and could just be trying to set a MAC blindly. And that's ok, but we aren't going to honor the entire MAC they pass in, and that's likely to confuse the app (or the person running the app). On the other hand, if we advertise a 20 byte MAC, and the app/person passes in an 8 byte MAC instead, then that implicitly denotes that the app/person *knows* this is an IPoIB device, and *knows* that we are actually setting the underlying GUID of the verbs device, and is aware of and prepared for the fact that the entire 20 byte MAC is not in fact settable. IPoIB and verbs devices are an exception to the rest of the netdevice stack, and this usage tacitly acknowledges that fact. So I am of the opinion that if an app/user knows exactly how accessing the IPoIB device is merely as a passthrough to the verbs device and that 8 bytes is the right size, then we should accept it instead of requiring them to supply fake data they know won't be used. > To violate that design invariant needs an incredibly strong argument, > and you haven't made it. Hopefully my argument satisfies your requirements. > This allows generic code to work with the LLADDR - for instance 'ip > link set vf mac' should have checked the size of the IFLA_ADDRESS and > demanded that the address argument is the same number of bytes. It is > very broken the command happily accepts an 8 byte and 6 byte argument > for the same device. > > Yes, it is ugly that the PF side's ndo_get_vf_config cannot return the > same 20 byte address of the VF's ipoib interface, but I think that is > less ugly than forcing a different address format just for the vf calls. > > If you have doubts then *ask netdev*. Ask them if ndo_set_vf_mac must > follow the same address size and format as IFLA_ADDRESS, or if we can > use something else. > > Such a netlink architecture choice is beyond the authority of > linux-rdma. > > Jason -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <1432659226.28905.151.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <1432659226.28905.151.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-26 18:13 ` Jason Gunthorpe [not found] ` <20150526181336.GD11800-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-26 18:13 UTC (permalink / raw) To: Doug Ledford Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Tue, May 26, 2015 at 12:53:46PM -0400, Doug Ledford wrote: > > I perfectly understand what you are trying to do. > > > > I care about the design and consistency of netlink - and that means > > there is one LLADDR definition for a net device, and every single netlink > > message that touches a LLADDR uses that definition - for IPoIB that is 20 > > bytes. > > So, here's my $.02. > > First, 8 bytes is the right size for the data to be transferred. The > remaining 12 bytes of the IPoIB LLADDR are generated constructs at > least Today only 8 bytes are settable, but the GID prefix and QPN are both logically something that *could* be set by the PF in some future. > partially controlled by elements outside of this machine, and moreover, > they are *not* the *actual* LLADDR of the underlying IB device and the > 20 byte LLADDR never hits the wire. So, it's an artificial construct. Yes, it is ugly that we are using a IPoIB netdev to configure the RDMA device. If you want to suggest that we use the RDMA Netlink interface to do this, that would make sense to me too. But I thought the whole point was to provide something somewhat software compatible with the ethernet world.. > The fact that the 20 byte LLADDR of IPoIB is problematic is born out by > multiple facts: we only use 8 bytes in DHCP requests and responses, > only use 8 bytes in matching IPoIB netdevices against HWADDR entries in > the sysconfig/ifcfg-* files, only use 8 bytes for matching the udev > rules to rename IPoIB devices, etc. The LLADDR for IPoIB *is* 20 bytes. Truncating it down is *broken userspace*: - DHCP: Not sending the full 20 bytes in the client request means the server cannot unicast the reply. This causes all sorts of problems and is discouraged in the RFCs these days. - ifcfg/udev/networkmanager: So what happens when I do ip link add link ib0 name ib0.1 type ipoib And get two IPoIB interfaces with the same GUID? I doubt any sane user would want to apply the same config to those two interfaces. There is a reason the kernel uses 20 byte - it is the unique LLADDR for the interface, it is the value it stuffs in a ND packet. If userspace takes the 20 byte LLADDR and mangles it so it is no longer unique, then that is it's prerogative. But that doesn't mean the kernel should ever accept the 8 byte value for anything. > This clearly demonstrates that attaching the netlink ndo points to the > IPoIB netdevice is, let's just say, a shortcut. The *real* device that > we should be manipulating is not the IPoIB device, but the verbs > device. Yep. > That leaves using the IPoIB device. In which case, here's what I want: > > If a user passes in a MAC addresses, please use the following checks: > > if (mac_len != 8 && mac_len != 20) > err out As I've already said, this is not possible. The person who designed this netlink extension screwed it up and 'mac_len' is fixed to 32 bytes. Even if they didn't, the netlink common code should validate the input length for the drivers and confirm it is dev->addr_len, as is done for all other cases. Unbreaking it is a UAPI change, not impossible, but do we really care enough about 8 or 20 to push for that? > If you have an app that queries a device, sees that it has a 20byte MAC, > and then tries to set that MAC, it may have no specific knowledge of how > IPoIB relates to the underlying verbs device and could just be trying to > set a MAC blindly. And that's ok, but we aren't going to honor the > entire MAC they pass in, and that's likely to confuse the app (or the > person running the app). Yes, I agree with this, but no matter what we do, apps using this feature need some kind of IB specific knowledge to format the set. If that specific knoweldge is 'ignore the LL_ADDRESS size and use 8 byte' or 'specially format the 20 byte' seems pretty much equal to me. > On the other hand, if we advertise a 20 byte MAC, and the app/person > passes in an 8 byte MAC instead, then that implicitly denotes that the > app/person *knows* this is an IPoIB device, and *knows* that we are > actually setting the underlying GUID of the verbs device, and is aware > of and prepared for the fact that the entire 20 byte MAC is not in fact > settable. Well, as above, this is unworkable, but even if we could specify the length, I think this is goofy. You've been concentrating on the set side, but there is a get side too, and the response to set is a get packet. What does get return? If we accept 8 or 20, then get must return 20. Should get response to a set of 8 return 8 or 20? The kernel UAPI is much cleaner if there is one case: 20 byte LLADDR. The kernel internal API is much cleaner if drivers don't have the option to use two sizes for their LLADDR. Just checking that the 20 byte address is sane (right GID prefix, 0 QPN, right flags) should be enough for the app to denote it knows what it is doing. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20150526181336.GD11800-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <20150526181336.GD11800-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-26 20:32 ` Doug Ledford [not found] ` <1432672378.28905.178.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Doug Ledford @ 2015-05-26 20:32 UTC (permalink / raw) To: Jason Gunthorpe Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai [-- Attachment #1: Type: text/plain, Size: 4930 bytes --] On Tue, 2015-05-26 at 12:13 -0600, Jason Gunthorpe wrote: > On Tue, May 26, 2015 at 12:53:46PM -0400, Doug Ledford wrote: > > > Yes, it is ugly that we are using a IPoIB netdev to configure the RDMA > device. > > If you want to suggest that we use the RDMA Netlink interface to do > this, I had thought of exactly that... > that would make sense to me too. But I thought the whole point > was to provide something somewhat software compatible with the > ethernet world.. Not so much ethernet world as netdevice world. The iproute2 program is used to configure any and all netdevices, ethernet or otherwise. Right now, we can abuse it to do the same here, but it uses the netdevice ndo_ ops, not rtnetlink to accomplish what it does, so we are limited in how we do thing if we want to maintain tool usage. > > The fact that the 20 byte LLADDR of IPoIB is problematic is born out by > > multiple facts: we only use 8 bytes in DHCP requests and responses, > > only use 8 bytes in matching IPoIB netdevices against HWADDR entries in > > the sysconfig/ifcfg-* files, only use 8 bytes for matching the udev > > rules to rename IPoIB devices, etc. > > The LLADDR for IPoIB *is* 20 bytes. > > Truncating it down is *broken userspace*: > - DHCP: Not sending the full 20 bytes in the client request means the > server cannot unicast the reply. This causes all sorts of problems > and is discouraged in the RFCs these days. Reference? The RFCs I've read (4390 -> 4361 -> 3315) list a number of options (three at the moment), but the LLADDR options all call for using a LLADDR from a device that is a permanent part of the machine (not common with add in cards), so the option most commonly used in IB is option 2, DUID Assigned by Vendor, aka GUID. According to that, truncating to 8 bytes is precisely what you are supposed to do. And, at least in all current Red Hat products, that's exactly how dhcp client creates the client-id. > - ifcfg/udev/networkmanager: So what happens when I do > ip link add link ib0 name ib0.1 type ipoib > And get two IPoIB interfaces with the same GUID? I doubt any sane > user would want to apply the same config to those two interfaces. No, they probably don't want to apply the same rules to both interfaces. I'm not entirely sure I agree with the argument though. I fully expected this to fail without a pkey argument on the ip command line. The net stack doesn't allow users to do the same thing with Ethernet devices, so I'm not sure we shouldn't be disallowing this as opposed to creating duplicate devices that are identical in all ways except name. > As I've already said, this is not possible. The person who designed > this netlink extension screwed it up and 'mac_len' is fixed to 32 > bytes. You're right. We don't have a way of passing in the length we wish to set versus the size of the message. > Even if they didn't, the netlink common code should validate the input > length for the drivers and confirm it is dev->addr_len, as is done for > all other cases. > > Unbreaking it is a UAPI change, not impossible, but do we really care > enough about 8 or 20 to push for that? In truth, at least right now, it's all moot. Since we can't set the subnet prefix, the qpn, or the flags, anything above 8 bytes is immutable regardless of how many bytes we pass in. So even if we say we aren't going to change the UAPI and for everything to 20, the real world result is that 8 works exactly the same and has no functional difference. > What does get return? If we accept 8 or 20, then get must return 20. The get has to return 20 regardless. It's the only accepted means of getting all 20 bytes of the LLADDR. There was a hack long ago (that might even still work, but I haven't checked) where you could call the ioctl to get the lladdr, which was supposed to be limited to 6 bytes, but if the caller of the ioctl passed in an oversized buffer, the kernel would write the full 20 bytes as long as the buffer could hold it. But that hack may have long since been closed up. > Should get response to a set of 8 return 8 or 20? 20. > The kernel UAPI is much cleaner if there is one case: 20 byte LLADDR. As I mentioned above, due to the constraints of what we can set and the fact that we don't actually *tell* the kernel what size we've filled out in the 32byte struct, it's really all moot at this point. > The kernel internal API is much cleaner if drivers don't have the > option to use two sizes for their LLADDR. > > Just checking that the 20 byte address is sane (right GID prefix, 0 > QPN, right flags) should be enough for the app to denote it knows what > it is doing. Yes, checking the subnet prefix and flags would work. -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <1432672378.28905.178.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <1432672378.28905.178.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-26 21:11 ` Jason Gunthorpe [not found] ` <20150526211114.GB4502-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-26 21:11 UTC (permalink / raw) To: Doug Ledford Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Tue, May 26, 2015 at 04:32:58PM -0400, Doug Ledford wrote: > Not so much ethernet world as netdevice world. The iproute2 program is > used to configure any and all netdevices, ethernet or otherwise. Right > now, we can abuse it to do the same here, but it uses the netdevice ndo_ > ops, not rtnetlink to accomplish what it does, so we are limited in how > we do thing if we want to maintain tool usage. Hmm? iproute2 does it over rtnetlink? > > The LLADDR for IPoIB *is* 20 bytes. > > > > Truncating it down is *broken userspace*: > > - DHCP: Not sending the full 20 bytes in the client request means the > > server cannot unicast the reply. This causes all sorts of problems > > and is discouraged in the RFCs these days. > > Reference? The RFCs I've read (4390 -> 4361 -> 3315) list a number of > options (three at the moment), but the LLADDR options all call for using I'm talking about this part from RFC 4390: As described above, the link-layer address is unavailable to the DHCP server because the link-layer address is larger than the "chaddr" field length. As a result, the server cannot unicast its reply to the client. Therefore, a DHCP client MUST request that the server send a broadcast reply by setting the BROADCAST flag when IPoIB Address Resolution Protocol (ARP) is not possible, i.e., in situations where the client does not know its IP address. AFAIK, nobody ever solved this, and it actually does cause real world problems for cloud stuff as there is limited randomness in the TID. This is the network side of DHCP. > a LLADDR from a device that is a permanent part of the machine (not > common with add in cards), so the option most commonly used in IB is > option 2, DUID Assigned by Vendor, aka GUID. According to that, > truncating to 8 bytes is precisely what you are supposed to do. And, at > least in all current Red Hat products, that's exactly how dhcp client > creates the client-id. Using the GUID as the client-id is sort of OK from a policy perspective (ie what IP should I use), but it doesn't help the network side, and it breaks down completely when you create child interfaces. Basically, the dhcp server not having the LLADDR at all is a pretty big hack.. No other network I know of runs DHCP like that. > > - ifcfg/udev/networkmanager: So what happens when I do > > ip link add link ib0 name ib0.1 type ipoib > > And get two IPoIB interfaces with the same GUID? I doubt any sane > > user would want to apply the same config to those two interfaces. > > No, they probably don't want to apply the same rules to both interfaces. > I'm not entirely sure I agree with the argument though. I fully > expected this to fail without a pkey argument on the ip command > line. Does that matter to the above tools? Are they using PKey,GUID as their key? > The net stack doesn't allow users to do the same thing with Ethernet > devices, so I'm not sure we shouldn't be disallowing this as opposed to > creating duplicate devices that are identical in all ways except name. The netstack doesn't allow it for ethernet because it would create a 2nd identical LLADDR, and LLADDRs must be unique. Because the QPN is part of the LLADDR IB can create two interfaces on the same physical port that are completely separated by hardware. Read Haggi's email, he explains how they plan to use this to create interfaces that can be delegated to namespaces. It is not a bad idea really.. So prepare for a world where each namespace has a child IPoIB interface with a unique QPN, but the same Pkey and GUID as the host. The breakage from assuming GUID == unique will become a problem. > > Unbreaking it is a UAPI change, not impossible, but do we really care > > enough about 8 or 20 to push for that? > > In truth, at least right now, it's all moot. Since we can't set the > subnet prefix, the qpn, or the flags, anything above 8 bytes is > immutable regardless of how many bytes we pass in. So even if we say we > aren't going to change the UAPI and for everything to 20, the real world > result is that 8 works exactly the same and has no functional > difference. Not quite, in the 20 byte format the 8 bytes of the GUID are in the last 8/20 bytes, so the app would have to place 12 zeros and then the GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID) This is why the question of 'what is ILFA_VF_MAC' is so important, every option presented (MAC,GUID,LLADDR) are incompatible with each other. > > What does get return? If we accept 8 or 20, then get must return 20. > > The get has to return 20 regardless. It's the only accepted means of > getting all 20 bytes of the LLADDR. You are conflating IFLA_ADDRESS and IFLA_VF_MAC. IFLA_VF_MAC could be 8 byte and IFLA_ADDRESS could be 20, I think that makes no sense, but it wouldn't break existing stuff. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20150526211114.GB4502-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <20150526211114.GB4502-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-27 13:01 ` Or Gerlitz 2015-05-27 14:14 ` Doug Ledford 1 sibling, 0 replies; 31+ messages in thread From: Or Gerlitz @ 2015-05-27 13:01 UTC (permalink / raw) To: Jason Gunthorpe, Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On 5/27/2015 12:11 AM, Jason Gunthorpe wrote: > On Tue, May 26, 2015 at 04:32:58PM -0400, Doug Ledford wrote: > >>> - ifcfg/udev/networkmanager: So what happens when I do >>> ip link add link ib0 name ib0.1 type ipoib >>> And get two IPoIB interfaces with the same GUID? I doubt any sane >>> user would want to apply the same config to those two interfaces. >> No, they probably don't want to apply the same rules to both interfaces. >> I'm not entirely sure I agree with the argument though. I fully >> expected this to fail without a pkey argument on the ip command >> line. > Does that matter to the above tools? Are they using PKey,GUID as their > key? > >> The net stack doesn't allow users to do the same thing with Ethernet >> devices, so I'm not sure we shouldn't be disallowing this as opposed to >> creating duplicate devices that are identical in all ways except name. > The netstack doesn't allow it for ethernet because it would create a > 2nd identical LLADDR, and LLADDRs must be unique. > > Because the QPN is part of the LLADDR IB can create two interfaces on > the same physical port that are completely separated by hardware. Read > Haggi's email, he explains how they plan to use this to create > interfaces that can be delegated to namespaces. It is not a bad idea > really.. > > So prepare for a world where each namespace has a child IPoIB > interface with a unique QPN, but the same Pkey and GUID as the > host. The breakage from assuming GUID == unique will become a problem. > >>> Unbreaking it is a UAPI change, not impossible, but do we really care >>> enough about 8 or 20 to push for that? >> In truth, at least right now, it's all moot. Since we can't set the >> subnet prefix, the qpn, or the flags, anything above 8 bytes is >> immutable regardless of how many bytes we pass in. So even if we say we >> aren't going to change the UAPI and for everything to 20, the real world >> result is that 8 works exactly the same and has no functional >> difference. > Not quite, in the 20 byte format the 8 bytes of the GUID are in the > last 8/20 bytes, so the app would have to place 12 zeros and then the > GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID) > > This is why the question of 'what is ILFA_VF_MAC' is so important, > every option presented (MAC,GUID,LLADDR) are incompatible with each > other. I agree with Doug that to be practical here, libvirt and Co. would really want to use rtnetlink based provisioning of IB VFs, at least in a similar manner done for Eth VFs. So with this assumption at hand, my vote goes to having user-space to provide the eight bytes of vGUID through the ndo_set_vf_mac call into IPoIB. I don't see the real value of user space providing the four zero bytes (19-16) and the 8 bytes of the subnet prefix provided by the SM. My personal thinking is that the important thing to address is consistency between what the virtualization system provisions on the host (ndo_set_vf_mac) to the DHCP server scheme they build. Do we have a go here? Also few comments on DHCP: If we're talking on different vlans/Eth or pkey/IB - it's totally OK for two entities (== IPoIB instances under IB) on the physical subnet to use the same identifier (IB/GUID, Eth/MAC) if they are on two different L2 broadcast domains. The DHCP server is expected to have a different mapping scheme per such virtual L2 subnet. For SRIOV, we don't expect two VFs on the network to use the same vGUID, so DHCP wise we should be OK. Today the Client-ID works fine for SRIOV schemes which are based on 8byte vGUIDs. Re two IPoIB child devices using the same GUID and the same pkey, we can enhance the system and take advantage of IB Alias GUIDs which today are only used for SRIOV for Para-Virtual and other environments too, thanks for the heads up on the necessity of doing so. > >>> What does get return? If we accept 8 or 20, then get must return 20. >> The get has to return 20 regardless. It's the only accepted means of >> getting all 20 bytes of the LLADDR. > You are conflating IFLA_ADDRESS and IFLA_VF_MAC. > > IFLA_VF_MAC could be 8 byte and IFLA_ADDRESS could be 20, I think that > makes no sense, but it wouldn't break existing stuff. > Just to make sure we're on the same page, this thread deals with using rtnetlink's IFLA_VF_MAC(== struct ifla_vf_mac) for provisioning vGUID for IB VFs, through the PF IPoIB interface, not attempting to use IFLA_ADDRESS. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <20150526211114.GB4502-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2015-05-27 13:01 ` Or Gerlitz @ 2015-05-27 14:14 ` Doug Ledford [not found] ` <1432736046.28905.215.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 31+ messages in thread From: Doug Ledford @ 2015-05-27 14:14 UTC (permalink / raw) To: Jason Gunthorpe Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai [-- Attachment #1: Type: text/plain, Size: 10211 bytes --] On Tue, 2015-05-26 at 15:11 -0600, Jason Gunthorpe wrote: > On Tue, May 26, 2015 at 04:32:58PM -0400, Doug Ledford wrote: > > > Not so much ethernet world as netdevice world. The iproute2 program is > > used to configure any and all netdevices, ethernet or otherwise. Right > > now, we can abuse it to do the same here, but it uses the netdevice ndo_ > > ops, not rtnetlink to accomplish what it does, so we are limited in how > > we do thing if we want to maintain tool usage. > > Hmm? iproute2 does it over rtnetlink? Yes, it does. Sorry for being imprecise. It uses the specific netlink packet that maps back to the predefined ndo_ entry point and so you can't use that netlink on anything other than a netdevice with a defined ndo_ entry point. We can't use it, for example, to extend the new RDMA netlink operations that Kaike posted without also making modifications to iproute2 to know about the new netlink. > > > The LLADDR for IPoIB *is* 20 bytes. > > > > > > Truncating it down is *broken userspace*: > > > - DHCP: Not sending the full 20 bytes in the client request means the > > > server cannot unicast the reply. This causes all sorts of problems > > > and is discouraged in the RFCs these days. > > > > Reference? The RFCs I've read (4390 -> 4361 -> 3315) list a number of > > options (three at the moment), but the LLADDR options all call for using > > I'm talking about this part from RFC 4390: > > As described above, the link-layer address is unavailable to the DHCP > server because the link-layer address is larger than the "chaddr" > field length. As a result, the server cannot unicast its reply to > the client. Therefore, a DHCP client MUST request that the server > send a broadcast reply by setting the BROADCAST flag when IPoIB > Address Resolution Protocol (ARP) is not possible, i.e., in > situations where the client does not know its IP address. > > AFAIK, nobody ever solved this, and it actually does cause real world > problems for cloud stuff as there is limited randomness in the > TID. This is the network side of DHCP. > > > a LLADDR from a device that is a permanent part of the machine (not > > common with add in cards), so the option most commonly used in IB is > > option 2, DUID Assigned by Vendor, aka GUID. According to that, > > truncating to 8 bytes is precisely what you are supposed to do. And, at > > least in all current Red Hat products, that's exactly how dhcp client > > creates the client-id. > > Using the GUID as the client-id is sort of OK from a policy > perspective (ie what IP should I use), but it doesn't help the network > side, and it breaks down completely when you create child interfaces. Actually, no, it doesn't. My standard test network inside Red Hat uses child interfaces and dhcp exclusively and it all works as expected. This is because each child interface has its own unique pkey and all of the devices include the pkey in the broadcast address, so one child does not see another child's broadcast reply, not the parent's. It does, however, mean that you only want a single link on a given pkey for a given device, which is why adding a second ipoib link without a unique pkey seems so broken to me. > Basically, the dhcp server not having the LLADDR at all is a pretty > big hack.. No other network I know of runs DHCP like that. That's how they decided to solve the issue in the RFCs, so that's what we have. > > > - ifcfg/udev/networkmanager: So what happens when I do > > > ip link add link ib0 name ib0.1 type ipoib > > > And get two IPoIB interfaces with the same GUID? I doubt any sane > > > user would want to apply the same config to those two interfaces. > > > > No, they probably don't want to apply the same rules to both interfaces. > > I'm not entirely sure I agree with the argument though. I fully > > expected this to fail without a pkey argument on the ip command > > line. > > Does that matter to the above tools? Are they using PKey,GUID as their > key? They are pkey aware, yes. There is the parent device, and all non-default pkey devices are listed as a pkey device and given a pkey number and they share the same GUID. They are considered children of the parent, just like with vlan setups. > > The net stack doesn't allow users to do the same thing with Ethernet > > devices, so I'm not sure we shouldn't be disallowing this as opposed to > > creating duplicate devices that are identical in all ways except name. > > The netstack doesn't allow it for ethernet because it would create a > 2nd identical LLADDR, and LLADDRs must be unique. And as far as the configuration scripts (at least on Red Hat) as well as NetworkManager is concerned, the unique requirements are GUID/P_Key. But, the P_Key doesn't show up in the LLADDR, only in the broadcast address, so only the GUID is checked in the LLADDR. All IPoIB devices are either parent devices (meaning no PKEY field specified in the config file) in which case they are the first started and match the GUID, or they are PKEY devices and are started only after their parent device is brought up (and here I think they actually match on parent name, not on GUID, but I would have to double check that). > Because the QPN is part of the LLADDR IB can create two interfaces on > the same physical port that are completely separated by hardware. Read > Haggi's email, he explains how they plan to use this to create > interfaces that can be delegated to namespaces. It is not a bad idea > really.. Yes, it is actually. The whole reason we went to GUID matching long ago was because of this exact issue. Actually, allow me to be perfectly clear about this point: the qpn changing on IPoIB interfaces *drove* us to drop the qpn from device matching. In addition, links that were slow to come up (such as 40GBit links that needed more time to synchronize at 40GBit/s than it took to try and start the IPoIB interface) drove us to drop the subnet prefix from the match because you could start to configure your IPoIB interfaces before OpenSM had a chance to tell you what your subnet prefix actually was. So we were forced to stick with only the 8byte GUID. You can not put a transient item into your device identifier, *ever*. The problem here is that even a slight change in ordering of module loading can change the qpn that each IPoIB device gets. Or, even easier to demonstrate, was the fact that if you rmmod ib_ipoib; modprobe ib_ipoib, then all of your devices will fail to start because all of your qpns no longer match up! The *only* way this will ever be a workable item is if we A) reserve a number of queue pairs from the driver specifically for IPoIB use and B) specify which queue pairs go to which IPoIB devices at IPoIB module load time. Short of that, using qpn in the device identifier is a complete non-starter. And that still doesn't address the subnet prefix either. If you want that included, then we need the ability to tell ib_ipoib what each link's subnet prefix will be once it talks to the SM so we don't have the wrong device identifier because we booted up faster than the link synchronized. > > So prepare for a world where each namespace has a child IPoIB > interface with a unique QPN, but the same Pkey and GUID as the > host. The breakage from assuming GUID == unique will become a problem. See above. Without further changes, this is a total non-starter. And this will require coordination with initscripts and NetworkManager (at least, I don't know if other distros use their own custom tools here). > > > Unbreaking it is a UAPI change, not impossible, but do we really care > > > enough about 8 or 20 to push for that? > > > > In truth, at least right now, it's all moot. Since we can't set the > > subnet prefix, the qpn, or the flags, anything above 8 bytes is > > immutable regardless of how many bytes we pass in. So even if we say we > > aren't going to change the UAPI and for everything to 20, the real world > > result is that 8 works exactly the same and has no functional > > difference. > > Not quite, in the 20 byte format the 8 bytes of the GUID are in the > last 8/20 bytes, so the app would have to place 12 zeros and then the > GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID) > > This is why the question of 'what is ILFA_VF_MAC' is so important, > every option presented (MAC,GUID,LLADDR) are incompatible with each > other. For Ethernet devices, it's the MAC. The equivalent of MAC on IB is the GUID. I would leave it at that. IPoIB devices are constructs on top of the GUID/link, and you can have 10 IPoIB interfaces between the parent and children, but we don't need to specify all of those LLADDRs, we just need to give a unique GUID and allow the guest OS to create their own IPoIB devices on top of that. > > > What does get return? If we accept 8 or 20, then get must return 20. > > > > The get has to return 20 regardless. It's the only accepted means of > > getting all 20 bytes of the LLADDR. > > You are conflating IFLA_ADDRESS and IFLA_VF_MAC. > > IFLA_VF_MAC could be 8 byte and IFLA_ADDRESS could be 20, I think that > makes no sense, but it wouldn't break existing stuff. Sorry, you're right. I was thinking about getting the address of the parent IPoIB device we are talking to, not getting the VF_MAC. For VF_MAC, I would say it should be the 8 byte GUID. In a world where we are configuring a base device via a parent base device, then as you suggest it would make no sense that the two would be different sizes. But in a world where we are using a construct on top of the base device in order to access config space for the base device, and we are configuring SRIOV instances of the base device for use in a guest (and not our construct we are accessing things through), then our access device and configure device need not have the same address size. -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <1432736046.28905.215.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <1432736046.28905.215.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-27 17:11 ` Jason Gunthorpe [not found] ` <20150527171143.GB9909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-27 17:11 UTC (permalink / raw) To: Doug Ledford Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Wed, May 27, 2015 at 10:14:06AM -0400, Doug Ledford wrote: > > Because the QPN is part of the LLADDR IB can create two interfaces on > > the same physical port that are completely separated by hardware. Read > > Haggi's email, he explains how they plan to use this to create > > interfaces that can be delegated to namespaces. It is not a bad idea > > really.. > > Yes, it is actually. The whole reason we went to GUID matching long ago > was because of this exact issue. I reflected on this some more last night, and yes, I am leaning toward 'bad idea' direction too. Too much stuff breaks if you create multiple children with the same pkey/guid: - RDMA CM cannot disambiguate CM packets between them - DHCP cannot tell them apart - Net scripts/network manager won't work - IPv6 becomes totally broken That means the namespace stuff will have to create children using GUID aliases.. > The *only* way this will ever be a workable item is if we A) reserve a > number of queue pairs from the driver specifically for IPoIB use and B) > specify which queue pairs go to which IPoIB devices at IPoIB module > load This basic idea is exactly why I think we should stick with the 20 byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the VF what QPN to use for IPoIB (if we ever see HW support to implement that) If we use 8 bytes then that route is closed off forever. > > Not quite, in the 20 byte format the 8 bytes of the GUID are in the > > last 8/20 bytes, so the app would have to place 12 zeros and then the > > GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID) > > > > This is why the question of 'what is ILFA_VF_MAC' is so important, > > every option presented (MAC,GUID,LLADDR) are incompatible with each > > other. > > For Ethernet devices, it's the MAC. The equivalent of MAC on IB is the > GUID. I would leave it at that. Yes, both arguments can be made: - Our netlink end point is targetting an IPoIB interface, and the equivelent to an Ethernet MAC in IPoIB language is the LLADDR. - Our netlink interface is targetting the hardware under the IPoIB interface and that MAC equivilent is the GUID > IPoIB devices are constructs on top of > the GUID/link, and you can have 10 IPoIB interfaces between the parent > and children, but we don't need to specify all of those LLADDRs, we just > need to give a unique GUID and allow the guest OS to create their own > IPoIB devices on top of that. As I've said, I would like to see netdev review that idea before we merge any patches.. There are pragmatic downsides to the 8 byte choice: Userspace completely looses the ability to size the address without a table based on link type. That is terrible in the context of netlink's design. For instance iproute2 would need IB specific code to format the 'ip link show' (review print_vfinfo in iproute2) and to length check 'ip link set vf mac' If we do use 8, then it would be ideal (and my strong preference) to also fix the IFLA_VF_MAC message to have a working length. I think that could be done compatibly with a bit of work. At least that way iproute2 can be kept clean when it learns to do IB, and we could have the option again of using 20 someday if we need. So to be clear, to go with the 8 byte option I suggest: - Engage netdev/iproute and confirm they are philosophically OK with IFLA_VF_MAC != IFLA_ADDRESS - Make a kernel patch to properly size the IFLA_VF_MAC message - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat) - Drop in the IB patch Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20150527171143.GB9909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <20150527171143.GB9909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-05-27 17:53 ` Doug Ledford [not found] ` <1432749191.28905.243.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Doug Ledford @ 2015-05-27 17:53 UTC (permalink / raw) To: Jason Gunthorpe Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai [-- Attachment #1: Type: text/plain, Size: 5184 bytes --] On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote: > On Wed, May 27, 2015 at 10:14:06AM -0400, Doug Ledford wrote: > > > Because the QPN is part of the LLADDR IB can create two interfaces on > > > the same physical port that are completely separated by hardware. Read > > > Haggi's email, he explains how they plan to use this to create > > > interfaces that can be delegated to namespaces. It is not a bad idea > > > really.. > > > > Yes, it is actually. The whole reason we went to GUID matching long ago > > was because of this exact issue. > > I reflected on this some more last night, and yes, I am leaning toward > 'bad idea' direction too. > > Too much stuff breaks if you create multiple children with the same > pkey/guid: > - RDMA CM cannot disambiguate CM packets between them > - DHCP cannot tell them apart > - Net scripts/network manager won't work > - IPv6 becomes totally broken > > That means the namespace stuff will have to create children using GUID > aliases.. Glad we agree on that ;-) > > The *only* way this will ever be a workable item is if we A) reserve a > > number of queue pairs from the driver specifically for IPoIB use and B) > > specify which queue pairs go to which IPoIB devices at IPoIB module > > load > > This basic idea is exactly why I think we should stick with the 20 > byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the > VF what QPN to use for IPoIB (if we ever see HW support to implement that) > > If we use 8 bytes then that route is closed off forever. And that's exactly as it should be. If we allow setting all 20 bytes via the VF_MAC calls, then we violate the "guests should behave like they are on bare metal as much as possible" rule. As a host, we get a GUID and if we want to control the QPNs for IPoIB (and indeed if we want to control how many IPoIB interfaces and on what P_Keys) then we must create config files in /etc/sysconfig/network-scripts (on Red Hat, similar requirements on other distros) that would instruct the OS to create exactly what we want. But, the key point is that we are only given a GUID, and we must create everything else from our config files. Guests should be the same way. They only get the GUID to start, then the guest disk image with its self contained configuration will take over and control the rest. > > > Not quite, in the 20 byte format the 8 bytes of the GUID are in the > > > last 8/20 bytes, so the app would have to place 12 zeros and then the > > > GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID) > > > > > > This is why the question of 'what is ILFA_VF_MAC' is so important, > > > every option presented (MAC,GUID,LLADDR) are incompatible with each > > > other. > > > > For Ethernet devices, it's the MAC. The equivalent of MAC on IB is the > > GUID. I would leave it at that. > > Yes, both arguments can be made: > - Our netlink end point is targetting an IPoIB interface, and > the equivelent to an Ethernet MAC in IPoIB language is the LLADDR. > - Our netlink interface is targetting the hardware under the IPoIB > interface and that MAC equivilent is the GUID > > > IPoIB devices are constructs on top of > > the GUID/link, and you can have 10 IPoIB interfaces between the parent > > and children, but we don't need to specify all of those LLADDRs, we just > > need to give a unique GUID and allow the guest OS to create their own > > IPoIB devices on top of that. > > As I've said, I would like to see netdev review that idea before we > merge any patches.. > > There are pragmatic downsides to the 8 byte choice: Userspace > completely looses the ability to size the address without a table > based on link type. That is terrible in the context of netlink's > design. Well, let's just be clear: netlink/iproute2 screwed the pooch on their implementation of this stuff. Any solution that doesn't include fixing this up in some way is not really a good solution. > For instance iproute2 would need IB specific code to format > the 'ip link show' (review print_vfinfo in iproute2) and to length > check 'ip link set vf mac' > > If we do use 8, then it would be ideal (and my strong preference) to > also fix the IFLA_VF_MAC message to have a working length. I think > that could be done compatibly with a bit of work. At least that way > iproute2 can be kept clean when it learns to do IB, and we could have > the option again of using 20 someday if we need. > > So to be clear, to go with the 8 byte option I suggest: > - Engage netdev/iproute and confirm they are philosophically OK > with IFLA_VF_MAC != IFLA_ADDRESS > - Make a kernel patch to properly size the IFLA_VF_MAC message > - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo > instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat) > - Drop in the IB patch Sounds like a reasonable plan. Or, this is your patch set, are you going to pick up these action items? -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <1432749191.28905.243.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <1432749191.28905.243.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-05-27 18:29 ` Jason Gunthorpe 2015-05-27 21:58 ` Or Gerlitz 1 sibling, 0 replies; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-27 18:29 UTC (permalink / raw) To: Doug Ledford Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Wed, May 27, 2015 at 01:53:11PM -0400, Doug Ledford wrote: > > This basic idea is exactly why I think we should stick with the 20 > > byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the > > VF what QPN to use for IPoIB (if we ever see HW support to implement that) > > > > If we use 8 bytes then that route is closed off forever. > > And that's exactly as it should be. If we allow setting all 20 bytes > via the VF_MAC calls, then we violate the "guests should behave like > they are on bare metal as much as possible" rule. As a host, we get a > GUID and if we want to control the QPNs for IPoIB (and indeed if we > want No, that doesn't really work. The PF's QPN pool is shared between all VF guests. The VF HCA does a call to the PF driver to get a free QPN to use. The VM cannot unilaterally select a QPN without global coordination. In other words, if a QPN is to be forced for IPoIB then it must come from the PF, which must get it from the orchestration software, which must globally coordinate to ensure the QPN is available on every PF the VM may want to run on. Similar to how the MAC must be globally unique. Will this ever be done? Probably not. But if it was, this interface would be the sanest way to go. Anyhow - as far as I'm concerned, fixing the IFLA_VF_MAC to have a proper length nullifies my concern. If we really needed we can switch to a 20 byte format with good compatability options in the future once we have a proper length. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <1432749191.28905.243.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-05-27 18:29 ` Jason Gunthorpe @ 2015-05-27 21:58 ` Or Gerlitz [not found] ` <CAJ3xEMjXXKy=DSeksTFEX-GAN=nYm_6nn5msvsYOwnp0roEHJQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 31+ messages in thread From: Or Gerlitz @ 2015-05-27 21:58 UTC (permalink / raw) To: Doug Ledford Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Wed, May 27, 2015 at 8:53 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote: > Well, let's just be clear: netlink/iproute2 screwed the pooch on their > implementation of this stuff. Any solution that doesn't include fixing > this up in some way is not really a good solution. >> For instance iproute2 would need IB specific code to format >> the 'ip link show' (review print_vfinfo in iproute2) and to length >> check 'ip link set vf mac' >> If we do use 8, then it would be ideal (and my strong preference) to >> also fix the IFLA_VF_MAC message to have a working length. I think >> that could be done compatibly with a bit of work. At least that way >> iproute2 can be kept clean when it learns to do IB, and we could have >> the option again of using 20 someday if we need. >> > Sounds like a reasonable plan. > Or, this is your patch set, are you going to pick up these action items? Let see >> So to be clear, to go with the 8 byte option I suggest: >> - Engage netdev/iproute and confirm they are philosophically OK >> with IFLA_VF_MAC != IFLA_ADDRESS the last thing netdev are is having philosophical views, they are very practical (and non-perfect and happy and lively community), the one thing before the last netdev are is caring for the rdma subsystem. If something has to change @ their direct part, we should come with patches. >> - Make a kernel patch to properly size the IFLA_VF_MAC message You mean the below structure which is expected by the kernel after they see the IFLA_VF_MAC NL attribute struct ifla_vf_mac { __u32 vf; __u8 mac[32]; /* MAX_ADDR_LEN */ }; How you thought to patch things such that the size of the address provided by user-space will propagate into the kernel w.o breaking the NL ABI here? Why not just take the eight lower bytes and set them NL --> ipoib --> PF driver >> - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo >> instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat) I was thinking to patch iproute to sense the link type: if eth print six bytes, if ipoib print 8 bytes, simple. >> - Drop in the IB patch sounds good. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CAJ3xEMjXXKy=DSeksTFEX-GAN=nYm_6nn5msvsYOwnp0roEHJQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <CAJ3xEMjXXKy=DSeksTFEX-GAN=nYm_6nn5msvsYOwnp0roEHJQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-05-27 22:42 ` Jason Gunthorpe 0 siblings, 0 replies; 31+ messages in thread From: Jason Gunthorpe @ 2015-05-27 22:42 UTC (permalink / raw) To: Or Gerlitz Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai On Thu, May 28, 2015 at 12:58:59AM +0300, Or Gerlitz wrote: > On Wed, May 27, 2015 at 8:53 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote: > > Well, let's just be clear: netlink/iproute2 screwed the pooch on their > > implementation of this stuff. Any solution that doesn't include fixing > > this up in some way is not really a good solution. > > >> For instance iproute2 would need IB specific code to format > >> the 'ip link show' (review print_vfinfo in iproute2) and to length > >> check 'ip link set vf mac' > > >> If we do use 8, then it would be ideal (and my strong preference) to > >> also fix the IFLA_VF_MAC message to have a working length. I think > >> that could be done compatibly with a bit of work. At least that way > >> iproute2 can be kept clean when it learns to do IB, and we could have > >> the option again of using 20 someday if we need. > >> > > > Sounds like a reasonable plan. > > Or, this is your patch set, are you going to pick up these action items? > > Let see > > >> So to be clear, to go with the 8 byte option I suggest: > > >> - Engage netdev/iproute and confirm they are philosophically OK > >> with IFLA_VF_MAC != IFLA_ADDRESS > > the last thing netdev are is having philosophical views, they are very > practical (and non-perfect and happy and lively community), the one > thing before the last netdev are is caring for the rdma subsystem. If > something has to change @ their direct part, we should come with > patches. Yes, you need to talk in patches, I will suggest you start with this (totally untested, quickly thrown together): diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 8de36824018d..715b59e9925a 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1172,7 +1172,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, nla_nest_cancel(skb, vfinfo); goto nla_put_failure; } - if (nla_put(skb, IFLA_VF_MAC, sizeof(vf_mac), &vf_mac) || + if (nla_put(skb, IFLA_VF_MAC, sizeof(u32) + dev->addr_len, &vf_mac) || nla_put(skb, IFLA_VF_VLAN, sizeof(vf_vlan), &vf_vlan) || nla_put(skb, IFLA_VF_RATE, sizeof(vf_rate), &vf_rate) || @@ -1292,7 +1292,8 @@ static const struct nla_policy ifla_vfinfo_policy[IFLA_VF_INFO_MAX+1] = { }; static const struct nla_policy ifla_vf_policy[IFLA_VF_MAX+1] = { - [IFLA_VF_MAC] = { .len = sizeof(struct ifla_vf_mac) }, + [IFLA_VF_MAC] = { .type = NLA_BINARY, + .len = sizeof(struct ifla_vf_mac) }, [IFLA_VF_VLAN] = { .len = sizeof(struct ifla_vf_vlan) }, [IFLA_VF_TX_RATE] = { .len = sizeof(struct ifla_vf_tx_rate) }, [IFLA_VF_SPOOFCHK] = { .len = sizeof(struct ifla_vf_spoofchk) }, @@ -1447,6 +1448,21 @@ static int do_setvfinfo(struct net_device *dev, struct nlattr *attr) switch (nla_type(vf)) { case IFLA_VF_MAC: { struct ifla_vf_mac *ivm; + int sz; + + /* Legacy compatability, old iproute2 will pass in 32 + * bytes for everything, assume it is a 6 byte MAC + * because that is all old iproute really supported. + */ + sz = nla_len(attr); + if (sz == 36) + sz = 10; + + if (sz != (sizeof(u32) + dev->addr_len)) { + err = -EINVAL; + break; + } + ivm = nla_data(vf); err = -EOPNOTSUPP; if (ops->ndo_set_vf_mac) And make the argument why the above is OK. - Reader side only ever read 6 bytes anyhow, so it doesn't matter that ifla_vf_mac is undersized in the RTA_DATA - Confirm any other users of this (openstack?) you know are similarly OK. - Old write side will always send the full 36 byte struct, so assume 6 bytes. - Justify this as needing to add IB support next which does not use a 6 byte addr_len New code that wants to work with IB will have to use correct lengths. Then based on that iproute2 can be fixed in the two areas I pointed out, doing that at the same time will probably earn some rep. Finally you can make a patch that switches addr_len to something else - that would trigger the philosophical question of 'what actually is a IFLA_VF_MAC', if everyone is OK with changing it then golden. > >> - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo > >> instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat) > > I was thinking to patch iproute to sense the link type: if eth print > six bytes, if ipoib print 8 bytes, simple. This is against the design ideals of netlink: a tool like iproute should not need link specific knowledge to parse kernel messages. Now is the best time to make this fix because we can use the sz == 36 so sz == 10 assumption. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs [not found] ` <CAJ3xEMhG2W6WzxC4Kc2CFmdMwTRUF5ppBgcDZ6SMA=kgJowUpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-05-25 22:32 ` Jason Gunthorpe @ 2015-05-26 16:53 ` Doug Ledford 1 sibling, 0 replies; 31+ messages in thread From: Doug Ledford @ 2015-05-26 16:53 UTC (permalink / raw) To: Or Gerlitz Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Amir Vadai [-- Attachment #1: Type: text/plain, Size: 1717 bytes --] On Tue, 2015-05-26 at 00:50 +0300, Or Gerlitz wrote: > On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe > <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > > On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote: > > > >> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port > >> GUID, and as Jason noted the user/kernel API allows to deliver up to > >> 32 bytes between user and kernel under the set_vf_mac flow > >> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through > >> **non-modified** ip tool and net/core/rtnetlink.c things just work - > >> I can set eight bytes value to be the virtual port GUID : > > > > Was I not perfectly clear? You have to use the 20 byte LLADDR format here: > > > >> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22 > > Jason, > > I am aiming to provision the VF IB end-node address == port GUID (vGUID) > in the same manner that VF Eth end-node address is their MAC, not > more, not less. > > 20 bytes are the lladdr of IPoIB devices which isn't the VF IB > end-node address but > rather made of flags (1B) + QPN (3B) + subnet prefix (8B) + VF GUID -- > way more then > the virtualization system care or can provision. > > >> Port GUID: 0x2211ffeeddccbbaa > > > The byte order got screwed up someplace. > > thx, will fix Or, while you are working on this, please attempt setting a larger than 12byte vlan->pkey. The actual ndo_ for setting a vlan uses a u16, so it's possible that just like the MAC setting, our pkey setting might work unaltered too. -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2015-05-27 22:42 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-21 16:24 [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs Or Gerlitz
[not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-21 16:24 ` [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration Or Gerlitz
[not found] ` <1432225447-6536-2-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-21 18:46 ` Jason Gunthorpe
[not found] ` <20150521184613.GD6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-21 20:05 ` Or Gerlitz
[not found] ` <CAJ3xEMgJvXjg3aFbTNEudj9QWMfP4==eBq0ccuhjuVJsv9mRmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-21 21:01 ` Jason Gunthorpe
2015-05-21 21:05 ` Doug Ledford
[not found] ` <1432242331.28905.67.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-21 21:21 ` Jason Gunthorpe
2015-05-21 16:24 ` [PATCH RFC 2/3] IB/mlx4: Refactor Alias GUID storing Or Gerlitz
2015-05-21 16:24 ` [PATCH RFC 3/3] IB/mlx4: Add support for SRIOV VF management Or Gerlitz
2015-05-21 16:40 ` [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs Doug Ledford
[not found] ` <1432226406.28905.22.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-21 19:55 ` Or Gerlitz
[not found] ` <CAJ3xEMjzpqnQuA=0HQaN8noVq04d9BkVvEWGY7Lq5ZntVTKm4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-21 21:11 ` Doug Ledford
[not found] ` <1432242708.28905.73.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-21 21:44 ` Jason Gunthorpe
2015-05-25 20:04 ` Or Gerlitz
[not found] ` <CAJ3xEMh3BaxJzCu9mV9m6ZMwgrDaO2UvTyS1i=DEPq9nuLX3oA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-25 21:14 ` Jason Gunthorpe
[not found] ` <20150525211433.GA9186-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-25 21:50 ` Or Gerlitz
[not found] ` <CAJ3xEMhG2W6WzxC4Kc2CFmdMwTRUF5ppBgcDZ6SMA=kgJowUpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-25 22:32 ` Jason Gunthorpe
[not found] ` <20150525223235.GA9858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-26 3:33 ` Or Gerlitz
[not found] ` <CAJ3xEMjSU0xVWyqd8v_-OO5JvsHycGTU6gg=BHpZD8PSqRfzQg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-26 5:58 ` Jason Gunthorpe
2015-05-26 16:53 ` Doug Ledford
[not found] ` <1432659226.28905.151.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-26 18:13 ` Jason Gunthorpe
[not found] ` <20150526181336.GD11800-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-26 20:32 ` Doug Ledford
[not found] ` <1432672378.28905.178.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-26 21:11 ` Jason Gunthorpe
[not found] ` <20150526211114.GB4502-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-27 13:01 ` Or Gerlitz
2015-05-27 14:14 ` Doug Ledford
[not found] ` <1432736046.28905.215.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-27 17:11 ` Jason Gunthorpe
[not found] ` <20150527171143.GB9909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-27 17:53 ` Doug Ledford
[not found] ` <1432749191.28905.243.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-27 18:29 ` Jason Gunthorpe
2015-05-27 21:58 ` Or Gerlitz
[not found] ` <CAJ3xEMjXXKy=DSeksTFEX-GAN=nYm_6nn5msvsYOwnp0roEHJQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-27 22:42 ` Jason Gunthorpe
2015-05-26 16:53 ` Doug Ledford
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox