* [PATCH net-next 00/13] Add mlx5 subfunction support
@ 2020-11-12 19:24 Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 01/13] devlink: Prepare code to fill multiple port function attributes Parav Pandit
                   ` (13 more replies)
  0 siblings, 14 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit
Hi Dave, Jakub, Greg,
This series introduces support for mlx5 subfunction (SF).
A subfunction is a portion of a PCI device that supports multiple
classes of devices such as netdev, RDMA and more.
This patchset is based on Leon's series [3].
It is a third user of proposed auxiliary bus [4].
Subfunction support is discussed in detail in RFC [1] and [2].
RFC [1] and extension [2] describes requirements, design, and proposed
plumbing using devlink, auxiliary bus and sysfs for systemd/udev
support.
Patch summary:
--------------
Patch 1 to 6 prepares devlink:
Patch-1 prepares code to handle multiple port function attributes
Patch-2 introduces devlink pcisf port flavour similar to pcipf and pcivf
Patch-3 adds port add and delete driver callbacks
Patch-4 adds port function state get and set callbacks
Patch-5 refactors devlink to avoid using global mutext
Patch-6 uses refcount to allow creating devlink instance from existing
one
Patch 7 to 13 implements mlx5 pieces for SF support.
Patch-7 adds SF auxiliary device
Patch-8 adds SF auxiliary driver
Patch-9 prepares eswitch to handler SF vport
PAtch-10 adds eswitch helpers to add/remove SF vport
Patch-11 adds SF device configuration commands
Patch-12 implements devlink port add/del callbacks
Patch-13 implements devlink port function get/set callbacks
More on SF plumbing below.
overview:
--------
A subfunction can be created and deleted by a user using devlink port
add/delete interface.
A subfunction can be configured using devlink port function attributes
before its activated.
When a subfunction is activated, it results in an auxiliary device.
A driver binds to the auxiliary device that further creates supported
class devices.
short example sequence:
-----------------------
Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
Add a devlink port of subfunction flaovur:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88
Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active
Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4
$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff
$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.28.1002 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.28.1002 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
subfunction (SF) in detail:
---------------------------
- A sub-function is a portion of the PCI device which supports multiple
  classes of devices such as netdev, RDMA and more.
- A SF netdev has its own dedicated queues(txq, rxq).
- A SF RDMA device has its own QP1, GID table and other RDMA resources.
- A SF supports eswitch representation and tc offload support similar
  to existing PF and VF representors.
- User must configure eswitch to send/receive SF's packets.
- A SF shares PCI level resources with other SFs and/or with its
  parent PCI function.
  For example, an SF shares IRQ vectors with other SFs and its
  PCI function.
  In future it may have dedicated IRQ vector per SF.
  A SF has dedicated window in PCI BAR space that is not shared
  with other SFs or PF. This ensures that when a SF is assigned to
  an application, only that application can access device resources.
- SF's auxiliary device exposes sfnum sysfs attribute. This will be
  used by systemd/udev to deterministic names for its netdev and
  RDMA device.
[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2
[3] https://lists.linuxfoundation.org/pipermail/virtualization/2020-November/050473.html
[4] https://lore.kernel.org/linux-rdma/20201023003338.1285642-2-david.m.ertman@intel.com/
Parav Pandit (11):
  devlink: Prepare code to fill multiple port function attributes
  devlink: Introduce PCI SF port flavour and port attribute
  devlink: Support add and delete devlink port
  devlink: Support get and set state of port function
  devlink: Avoid global devlink mutex, use per instance reload lock
  devlink: Introduce devlink refcount to reduce scope of global
    devlink_mutex
  net/mlx5: SF, Add auxiliary device support
  net/mlx5: SF, Add auxiliary device driver
  net/mlx5: E-switch, Add eswitch helpers for SF vport
  net/mlx5: SF, Add port add delete functionality
  net/mlx5: SF, Port function state change support
Vu Pham (2):
  net/mlx5: E-switch, Prepare eswitch to handle SF vport
  net/mlx5: SF, Add SF configuration hardware commands
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  19 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   4 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   7 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   2 +-
 .../mellanox/mlx5/core/esw/acl/egress_ofld.c  |   2 +-
 .../mellanox/mlx5/core/esw/devlink_port.c     |  41 ++
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  46 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  82 +++
 .../mellanox/mlx5/core/eswitch_offloads.c     |  47 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |  48 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  10 +
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  20 +
 .../net/ethernet/mellanox/mlx5/core/sf/cmd.c  |  48 ++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.c  | 213 ++++++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  68 +++
 .../mellanox/mlx5/core/sf/dev/driver.c        | 105 ++++
 .../net/ethernet/mellanox/mlx5/core/sf/priv.h |  14 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.c   | 498 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  59 +++
 .../net/ethernet/mellanox/mlx5/core/vport.c   |   3 +-
 include/linux/mlx5/driver.h                   |  12 +-
 include/net/devlink.h                         |  82 +++
 include/uapi/linux/devlink.h                  |  26 +
 net/core/devlink.c                            | 362 +++++++++++--
 25 files changed, 1754 insertions(+), 73 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
-- 
2.26.2
^ permalink raw reply	[flat|nested] 57+ messages in thread
* [PATCH net-next 01/13] devlink: Prepare code to fill multiple port function attributes
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 02/13] devlink: Introduce PCI SF port flavour and port attribute Parav Pandit
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit,
	Vu Pham
Prepare code to fill zero or more port function optional attributes.
Subsequent patch makes use of this to fill more port function
attributes.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
---
 net/core/devlink.c | 63 +++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 31 deletions(-)
diff --git a/net/core/devlink.c b/net/core/devlink.c
index a578634052a3..75cca9cbb9d9 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -695,6 +695,31 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg,
 	return 0;
 }
 
+static int
+devlink_port_function_hw_addr_fill(struct devlink *devlink, const struct devlink_ops *ops,
+				   struct devlink_port *port, struct sk_buff *msg,
+				   struct netlink_ext_ack *extack, bool *msg_updated)
+{
+	u8 hw_addr[MAX_ADDR_LEN];
+	int hw_addr_len;
+	int err;
+
+	if (!ops->port_function_hw_addr_get)
+		return 0;
+
+	err = ops->port_function_hw_addr_get(devlink, port, hw_addr, &hw_addr_len, extack);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+	err = nla_put(msg, DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR, hw_addr_len, hw_addr);
+	if (err)
+		return err;
+	*msg_updated = true;
+	return 0;
+}
+
 static int
 devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *port,
 				   struct netlink_ext_ack *extack)
@@ -702,36 +727,16 @@ devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *por
 	struct devlink *devlink = port->devlink;
 	const struct devlink_ops *ops;
 	struct nlattr *function_attr;
-	bool empty_nest = true;
-	int err = 0;
+	bool msg_updated = false;
+	int err;
 
 	function_attr = nla_nest_start_noflag(msg, DEVLINK_ATTR_PORT_FUNCTION);
 	if (!function_attr)
 		return -EMSGSIZE;
 
 	ops = devlink->ops;
-	if (ops->port_function_hw_addr_get) {
-		int hw_addr_len;
-		u8 hw_addr[MAX_ADDR_LEN];
-
-		err = ops->port_function_hw_addr_get(devlink, port, hw_addr, &hw_addr_len, extack);
-		if (err == -EOPNOTSUPP) {
-			/* Port function attributes are optional for a port. If port doesn't
-			 * support function attribute, returning -EOPNOTSUPP is not an error.
-			 */
-			err = 0;
-			goto out;
-		} else if (err) {
-			goto out;
-		}
-		err = nla_put(msg, DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR, hw_addr_len, hw_addr);
-		if (err)
-			goto out;
-		empty_nest = false;
-	}
-
-out:
-	if (err || empty_nest)
+	err = devlink_port_function_hw_addr_fill(devlink, ops, port, msg, extack, &msg_updated);
+	if (err || !msg_updated)
 		nla_nest_cancel(msg, function_attr);
 	else
 		nla_nest_end(msg, function_attr);
@@ -964,7 +969,6 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port *
 	const struct devlink_ops *ops;
 	const u8 *hw_addr;
 	int hw_addr_len;
-	int err;
 
 	hw_addr = nla_data(attr);
 	hw_addr_len = nla_len(attr);
@@ -989,12 +993,7 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port *
 		return -EOPNOTSUPP;
 	}
 
-	err = ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack);
-	if (err)
-		return err;
-
-	devlink_port_notify(port, DEVLINK_CMD_PORT_NEW);
-	return 0;
+	return ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack);
 }
 
 static int
@@ -1015,6 +1014,8 @@ devlink_port_function_set(struct devlink *devlink, struct devlink_port *port,
 	if (attr)
 		err = devlink_port_function_hw_addr_set(devlink, port, attr, extack);
 
+	if (!err)
+		devlink_port_notify(port, DEVLINK_CMD_PORT_NEW);
 	return err;
 }
 
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 02/13] devlink: Introduce PCI SF port flavour and port attribute
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 01/13] devlink: Prepare code to fill multiple port function attributes Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 03/13] devlink: Support add and delete devlink port Parav Pandit
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit,
	Vu Pham
A PCI sub-function (SF) represents a portion of the device similar
to PCI VF.
In an eswitch, PCI SF may have port which is normally represented
using a representor netdevice.
To have better visibility of eswitch port, its association with SF,
and its representor netdevice, introduce a PCI SF port flavour.
When devlink port flavour is PCI SF, fill up PCI SF attributes of the
port.
Extend port name creation using PCI PF and SF number scheme on best
effort basis, so that vendor drivers can skip defining their own
scheme.
This is done as cApfNSfM, where A, N and M are controller, PCI PF and
PCI SF number respectively.
This is similar to existing naming for PCI PF and PCI VF ports.
An example view of a PCI SF port:
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state active opstate attached
$ devlink port show pci/0000:06:00.0/32768 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "eth0",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
---
 include/net/devlink.h        | 17 +++++++++++++
 include/uapi/linux/devlink.h |  5 ++++
 net/core/devlink.c           | 46 ++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)
diff --git a/include/net/devlink.h b/include/net/devlink.h
index b01bb9bca5a2..1b7c9fbc607a 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -92,6 +92,20 @@ struct devlink_port_pci_vf_attrs {
 	u8 external:1;
 };
 
+/**
+ * struct devlink_port_pci_sf_attrs - devlink port's PCI SF attributes
+ * @controller: Associated controller number
+ * @pf: Associated PCI PF number for this port.
+ * @sf: Associated PCI SF for of the PCI PF for this port.
+ * @external: when set, indicates if a port is for an external controller
+ */
+struct devlink_port_pci_sf_attrs {
+	u32 controller;
+	u16 pf;
+	u32 sf;
+	u8 external:1;
+};
+
 /**
  * struct devlink_port_attrs - devlink port object
  * @flavour: flavour of the port
@@ -113,6 +127,7 @@ struct devlink_port_attrs {
 		struct devlink_port_phys_attrs phys;
 		struct devlink_port_pci_pf_attrs pci_pf;
 		struct devlink_port_pci_vf_attrs pci_vf;
+		struct devlink_port_pci_sf_attrs pci_sf;
 	};
 };
 
@@ -1401,6 +1416,8 @@ void devlink_port_attrs_pci_pf_set(struct devlink_port *devlink_port, u32 contro
 				   u16 pf, bool external);
 void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port, u32 controller,
 				   u16 pf, u16 vf, bool external);
+void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32 controller,
+				   u16 pf, u32 sf, bool external);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 0113bc4db9f5..57065722b9c3 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -200,6 +200,10 @@ enum devlink_port_flavour {
 	DEVLINK_PORT_FLAVOUR_UNUSED, /* Port which exists in the switch, but
 				      * is not used in any way.
 				      */
+	DEVLINK_PORT_FLAVOUR_PCI_SF, /* Represents eswitch port
+				      * for the PCI SF. It is an internal
+				      * port that faces the PCI SF.
+				      */
 };
 
 enum devlink_param_cmode {
@@ -527,6 +531,7 @@ enum devlink_attr {
 	DEVLINK_ATTR_RELOAD_STATS_VALUE,	/* u32 */
 	DEVLINK_ATTR_REMOTE_RELOAD_STATS,	/* nested */
 
+	DEVLINK_ATTR_PORT_PCI_SF_NUMBER,	/* u32 */
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 75cca9cbb9d9..b1e849b624a6 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -673,6 +673,15 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg,
 		if (nla_put_u8(msg, DEVLINK_ATTR_PORT_EXTERNAL, attrs->pci_vf.external))
 			return -EMSGSIZE;
 		break;
+	case DEVLINK_PORT_FLAVOUR_PCI_SF:
+		if (nla_put_u32(msg, DEVLINK_ATTR_PORT_CONTROLLER_NUMBER,
+				attrs->pci_sf.controller) ||
+		    nla_put_u16(msg, DEVLINK_ATTR_PORT_PCI_PF_NUMBER, attrs->pci_sf.pf) ||
+		    nla_put_u32(msg, DEVLINK_ATTR_PORT_PCI_SF_NUMBER, attrs->pci_sf.sf))
+			return -EMSGSIZE;
+		if (nla_put_u8(msg, DEVLINK_ATTR_PORT_EXTERNAL, attrs->pci_sf.external))
+			return -EMSGSIZE;
+		break;
 	case DEVLINK_PORT_FLAVOUR_PHYSICAL:
 	case DEVLINK_PORT_FLAVOUR_CPU:
 	case DEVLINK_PORT_FLAVOUR_DSA:
@@ -8330,6 +8339,33 @@ void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port, u32 contro
 }
 EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_vf_set);
 
+/**
+ *	devlink_port_attrs_pci_sf_set - Set PCI SF port attributes
+ *
+ *	@devlink_port: devlink port
+ *	@controller: associated controller number for the devlink port instance
+ *	@pf: associated PF for the devlink port instance
+ *	@sf: associated SF of a PF for the devlink port instance
+ *	@external: indicates if the port is for an external controller
+ */
+void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32 controller,
+				   u16 pf, u32 sf, bool external)
+{
+	struct devlink_port_attrs *attrs = &devlink_port->attrs;
+	int ret;
+
+	if (WARN_ON(devlink_port->registered))
+		return;
+	ret = __devlink_port_attrs_set(devlink_port, DEVLINK_PORT_FLAVOUR_PCI_SF);
+	if (ret)
+		return;
+	attrs->pci_sf.controller = controller;
+	attrs->pci_sf.pf = pf;
+	attrs->pci_sf.sf = sf;
+	attrs->pci_sf.external = external;
+}
+EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_sf_set);
+
 static int __devlink_port_phys_port_name_get(struct devlink_port *devlink_port,
 					     char *name, size_t len)
 {
@@ -8378,6 +8414,16 @@ static int __devlink_port_phys_port_name_get(struct devlink_port *devlink_port,
 		n = snprintf(name, len, "pf%uvf%u",
 			     attrs->pci_vf.pf, attrs->pci_vf.vf);
 		break;
+	case DEVLINK_PORT_FLAVOUR_PCI_SF:
+		if (attrs->pci_sf.external) {
+			n = snprintf(name, len, "c%u", attrs->pci_sf.controller);
+			if (n >= len)
+				return -EINVAL;
+			len -= n;
+			name += n;
+		}
+		n = snprintf(name, len, "pf%usf%u", attrs->pci_sf.pf, attrs->pci_sf.sf);
+		break;
 	}
 
 	if (n >= len)
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 01/13] devlink: Prepare code to fill multiple port function attributes Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 02/13] devlink: Introduce PCI SF port flavour and port attribute Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-18 16:21   ` David Ahern
  2020-11-12 19:24 ` [PATCH net-next 04/13] devlink: Support get and set state of port function Parav Pandit
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit,
	Vu Pham
Extended devlink interface for the user to add and delete port.
Extend devlink to connect user requests to driver to add/delete
such port in the device.
When driver routines are invoked, devlink instance lock is not held.
This enables driver to perform several devlink objects registration,
unregistration such as (port, health reporter, resource etc)
by using exising devlink APIs.
This also helps to uniformly use the code for port unregistration
during driver unload and during port deletion initiated by user.
Examples of add, show and delete commands:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached
$ udevadm test-builtin net_id /sys/class/net/eth0
Load module index
Parsed configuration file /usr/lib/systemd/network/99-default.link
Created link configuration context.
Using default interface naming scheme 'v245'.
ID_NET_NAMING_SCHEME=v245
ID_NET_NAME_PATH=enp6s0f0npf0sf88
ID_NET_NAME_SLOT=ens2f0npf0sf88
Unload module index
Unloaded link configuration context.
$ devlink port del netdevsim/netdevsim10/32768
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
---
 include/net/devlink.h | 38 ++++++++++++++++++++++++
 net/core/devlink.c    | 67 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 105 insertions(+)
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 1b7c9fbc607a..3991345ef3e2 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -152,6 +152,17 @@ struct devlink_port {
 	struct mutex reporters_lock; /* Protects reporter_list */
 };
 
+struct devlink_port_new_attrs {
+	enum devlink_port_flavour flavour;
+	unsigned int port_index;
+	u32 controller;
+	u32 sfnum;
+	u16 pfnum;
+	u8 port_index_valid:1,
+	   controller_valid:1,
+	   sfnum_valid:1;
+};
+
 struct devlink_sb_pool_info {
 	enum devlink_sb_pool_type pool_type;
 	u32 size;
@@ -1360,6 +1371,33 @@ struct devlink_ops {
 	int (*port_function_hw_addr_set)(struct devlink *devlink, struct devlink_port *port,
 					 const u8 *hw_addr, int hw_addr_len,
 					 struct netlink_ext_ack *extack);
+	/**
+	 * @port_new: Port add function.
+	 *
+	 * Should be used by device driver to let caller add new port of a specified flavour
+	 * with optional attributes.
+	 * Driver should return -EOPNOTSUPP if it doesn't support port addition of a specified
+	 * flavour or specified attributes. Driver should set extack error message in case of fail
+	 * to add the port.
+	 * devlink core does not hold a devlink instance lock when this callback is invoked.
+	 * Driver must ensures synchronization when adding or deleting a port. Driver must
+	 * register a port with devlink core.
+	 */
+	int (*port_new)(struct devlink *devlink, const struct devlink_port_new_attrs *attrs,
+			struct netlink_ext_ack *extack);
+	/**
+	 * @port_del: Port delete function.
+	 *
+	 * Should be used by device driver to let caller delete port which was previously created
+	 * using port_new() callback.
+	 * Driver should return -EOPNOTSUPP if it doesn't support port deletion.
+	 * Driver should set extack error message in case of fail to delete the port.
+	 * devlink core does not hold a devlink instance lock when this callback is invoked.
+	 * Driver must ensures synchronization when adding or deleting a port. Driver must
+	 * register a port with devlink core.
+	 */
+	int (*port_del)(struct devlink *devlink, unsigned int port_index,
+			struct netlink_ext_ack *extack);
 };
 
 static inline void *devlink_priv(struct devlink *devlink)
diff --git a/net/core/devlink.c b/net/core/devlink.c
index b1e849b624a6..dccdf36afba6 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -1124,6 +1124,57 @@ static int devlink_nl_cmd_port_unsplit_doit(struct sk_buff *skb,
 	return devlink_port_unsplit(devlink, port_index, info->extack);
 }
 
+static int devlink_nl_cmd_port_new_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct devlink_port_new_attrs new_attrs = {};
+	struct devlink *devlink = info->user_ptr[0];
+
+	if (!info->attrs[DEVLINK_ATTR_PORT_FLAVOUR] ||
+	    !info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]) {
+		NL_SET_ERR_MSG_MOD(extack, "Port flavour or PCI PF are not specified");
+		return -EINVAL;
+	}
+	new_attrs.flavour = nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_FLAVOUR]);
+	new_attrs.pfnum = nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]);
+
+	if (info->attrs[DEVLINK_ATTR_PORT_INDEX]) {
+		new_attrs.port_index = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]);
+		new_attrs.port_index_valid = true;
+	}
+	if (info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]) {
+		new_attrs.controller =
+			nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]);
+		new_attrs.controller_valid = true;
+	}
+	if (info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]) {
+		new_attrs.sfnum = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]);
+		new_attrs.sfnum_valid = true;
+	}
+
+	if (!devlink->ops->port_new)
+		return -EOPNOTSUPP;
+
+	return devlink->ops->port_new(devlink, &new_attrs, extack);
+}
+
+static int devlink_nl_cmd_port_del_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct devlink *devlink = info->user_ptr[0];
+	unsigned int port_index;
+
+	if (!info->attrs[DEVLINK_ATTR_PORT_INDEX]) {
+		NL_SET_ERR_MSG_MOD(extack, "Port index is not specified");
+		return -EINVAL;
+	}
+	port_index = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]);
+
+	if (!devlink->ops->port_del)
+		return -EOPNOTSUPP;
+	return devlink->ops->port_del(devlink, port_index, extack);
+}
+
 static int devlink_nl_sb_fill(struct sk_buff *msg, struct devlink *devlink,
 			      struct devlink_sb *devlink_sb,
 			      enum devlink_command cmd, u32 portid,
@@ -7565,6 +7616,10 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_RELOAD_ACTION] = NLA_POLICY_RANGE(NLA_U8, DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
 							DEVLINK_RELOAD_ACTION_MAX),
 	[DEVLINK_ATTR_RELOAD_LIMITS] = NLA_POLICY_BITFIELD32(DEVLINK_RELOAD_LIMITS_VALID_MASK),
+	[DEVLINK_ATTR_PORT_FLAVOUR] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_PORT_PCI_PF_NUMBER] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_PORT_PCI_SF_NUMBER] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER] = { .type = NLA_U32 },
 };
 
 static const struct genl_small_ops devlink_nl_ops[] = {
@@ -7604,6 +7659,18 @@ static const struct genl_small_ops devlink_nl_ops[] = {
 		.flags = GENL_ADMIN_PERM,
 		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
 	},
+	{
+		.cmd = DEVLINK_CMD_PORT_NEW,
+		.doit = devlink_nl_cmd_port_new_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
+	},
+	{
+		.cmd = DEVLINK_CMD_PORT_DEL,
+		.doit = devlink_nl_cmd_port_del_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
+	},
 	{
 		.cmd = DEVLINK_CMD_SB_GET,
 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 04/13] devlink: Support get and set state of port function
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (2 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 03/13] devlink: Support add and delete devlink port Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 05/13] devlink: Avoid global devlink mutex, use per instance reload lock Parav Pandit
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit,
	Vu Pham
devlink port function can be in active or inactive state.
Allow users to get and set port function's state.
When the port function it activated, its operational state may change
after a while when the device is created and driver binds to it.
Similarly on deactivation flow.
To clearly describe the state of the port function and its device's
operational state in the host system, define state and opstate
attributes.
Example of a PCI SF port which supports a port function:
Create a device with ID=10 and one physical port.
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached
$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active
$ devlink port show pci/0000:06:00.0/32768 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "eth0",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
---
 include/net/devlink.h        | 21 +++++++++
 include/uapi/linux/devlink.h | 21 +++++++++
 net/core/devlink.c           | 89 +++++++++++++++++++++++++++++++++++-
 3 files changed, 130 insertions(+), 1 deletion(-)
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 3991345ef3e2..124bac130c22 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1371,6 +1371,27 @@ struct devlink_ops {
 	int (*port_function_hw_addr_set)(struct devlink *devlink, struct devlink_port *port,
 					 const u8 *hw_addr, int hw_addr_len,
 					 struct netlink_ext_ack *extack);
+	/**
+	 * @port_function_state_get: Port function's state get function.
+	 *
+	 * Should be used by device drivers to report the state of a function managed
+	 * by the devlink port. Driver should return -EOPNOTSUPP if it doesn't support port
+	 * function handling for a particular port.
+	 */
+	int (*port_function_state_get)(struct devlink *devlink, struct devlink_port *port,
+				       enum devlink_port_function_state *state,
+				       enum devlink_port_function_opstate *opstate,
+				       struct netlink_ext_ack *extack);
+	/**
+	 * @port_function_state_set: Port function's state set function.
+	 *
+	 * Should be used by device drivers to set the state of a function managed
+	 * by the devlink port. Driver should return -EOPNOTSUPP if it doesn't support port
+	 * function handling for a particular port.
+	 */
+	int (*port_function_state_set)(struct devlink *devlink, struct devlink_port *port,
+				       enum devlink_port_function_state state,
+				       struct netlink_ext_ack *extack);
 	/**
 	 * @port_new: Port add function.
 	 *
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 57065722b9c3..0c6eb3add736 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -581,9 +581,30 @@ enum devlink_resource_unit {
 enum devlink_port_function_attr {
 	DEVLINK_PORT_FUNCTION_ATTR_UNSPEC,
 	DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR,	/* binary */
+	DEVLINK_PORT_FUNCTION_ATTR_STATE,	/* u8 */
+	DEVLINK_PORT_FUNCTION_ATTR_OPSTATE,	/* u8 */
 
 	__DEVLINK_PORT_FUNCTION_ATTR_MAX,
 	DEVLINK_PORT_FUNCTION_ATTR_MAX = __DEVLINK_PORT_FUNCTION_ATTR_MAX - 1
 };
 
+enum devlink_port_function_state {
+	DEVLINK_PORT_FUNCTION_STATE_INACTIVE,
+	DEVLINK_PORT_FUNCTION_STATE_ACTIVE,
+};
+
+/**
+ * enum devlink_port_function_opstate - indicates operational state of port function
+ * @DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED: Driver is attached to the function of port, for
+ *					    gracefufl tear down of the function, after
+ *					    inactivation of the port function, user should wait
+ *					    for operational state to turn DETACHED.
+ * @DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED: Driver is detached from the function of port; it is
+ *					    safe to delete the port.
+ */
+enum devlink_port_function_opstate {
+	DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED,
+	DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED,
+};
+
 #endif /* _UAPI_LINUX_DEVLINK_H_ */
diff --git a/net/core/devlink.c b/net/core/devlink.c
index dccdf36afba6..3e59ba73d5c4 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -87,6 +87,9 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_trap_report);
 
 static const struct nla_policy devlink_function_nl_policy[DEVLINK_PORT_FUNCTION_ATTR_MAX + 1] = {
 	[DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR] = { .type = NLA_BINARY },
+	[DEVLINK_PORT_FUNCTION_ATTR_STATE] =
+		NLA_POLICY_RANGE(NLA_U8, DEVLINK_PORT_FUNCTION_STATE_INACTIVE,
+				 DEVLINK_PORT_FUNCTION_STATE_ACTIVE),
 };
 
 static LIST_HEAD(devlink_list);
@@ -729,6 +732,52 @@ devlink_port_function_hw_addr_fill(struct devlink *devlink, const struct devlink
 	return 0;
 }
 
+static bool devlink_port_function_state_valid(enum devlink_port_function_state state)
+{
+	return state == DEVLINK_PORT_FUNCTION_STATE_INACTIVE ||
+	       state == DEVLINK_PORT_FUNCTION_STATE_ACTIVE;
+}
+
+static bool devlink_port_function_opstate_valid(enum devlink_port_function_opstate state)
+{
+	return state == DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED ||
+	       state == DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED;
+}
+
+static int devlink_port_function_state_fill(struct devlink *devlink, const struct devlink_ops *ops,
+					    struct devlink_port *port, struct sk_buff *msg,
+					    struct netlink_ext_ack *extack, bool *msg_updated)
+{
+	enum devlink_port_function_opstate opstate;
+	enum devlink_port_function_state state;
+	int err;
+
+	if (!ops->port_function_state_get)
+		return 0;
+
+	err = ops->port_function_state_get(devlink, port, &state, &opstate, extack);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+	if (!devlink_port_function_state_valid(state)) {
+		WARN_ON_ONCE(1);
+		NL_SET_ERR_MSG_MOD(extack, "Invalid state value read from driver");
+		return -EINVAL;
+	}
+	if (!devlink_port_function_opstate_valid(opstate)) {
+		WARN_ON_ONCE(1);
+		NL_SET_ERR_MSG_MOD(extack, "Invalid operational state value read from driver");
+		return -EINVAL;
+	}
+	if (nla_put_u8(msg, DEVLINK_PORT_FUNCTION_ATTR_STATE, state) ||
+	    nla_put_u8(msg, DEVLINK_PORT_FUNCTION_ATTR_OPSTATE, opstate))
+		return -EMSGSIZE;
+	*msg_updated = true;
+	return 0;
+}
+
 static int
 devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *port,
 				   struct netlink_ext_ack *extack)
@@ -745,6 +794,12 @@ devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *por
 
 	ops = devlink->ops;
 	err = devlink_port_function_hw_addr_fill(devlink, ops, port, msg, extack, &msg_updated);
+	if (err)
+		goto out;
+	err = devlink_port_function_state_fill(devlink, ops, port, msg, extack, &msg_updated);
+	if (err)
+		goto out;
+out:
 	if (err || !msg_updated)
 		nla_nest_cancel(msg, function_attr);
 	else
@@ -1005,6 +1060,28 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port *
 	return ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack);
 }
 
+static int
+devlink_port_function_state_set(struct devlink *devlink, struct devlink_port *port,
+				const struct nlattr *attr, struct netlink_ext_ack *extack)
+{
+	enum devlink_port_function_state state;
+	const struct devlink_ops *ops;
+	int err;
+
+	state = nla_get_u8(attr);
+	ops = devlink->ops;
+	if (!ops->port_function_state_set) {
+		NL_SET_ERR_MSG_MOD(extack, "Port function does not support state setting");
+		return -EOPNOTSUPP;
+	}
+	err = ops->port_function_state_set(devlink, port, state, extack);
+	if (err)
+		return err;
+
+	devlink_port_notify(port, DEVLINK_CMD_PORT_NEW);
+	return 0;
+}
+
 static int
 devlink_port_function_set(struct devlink *devlink, struct devlink_port *port,
 			  const struct nlattr *attr, struct netlink_ext_ack *extack)
@@ -1020,8 +1097,18 @@ devlink_port_function_set(struct devlink *devlink, struct devlink_port *port,
 	}
 
 	attr = tb[DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR];
-	if (attr)
+	if (attr) {
 		err = devlink_port_function_hw_addr_set(devlink, port, attr, extack);
+		if (err)
+			return err;
+	}
+	/* Keep this as the last function attribute set, so that when
+	 * multiple port function attributes are set along with state,
+	 * Those can be applied first before activating the state.
+	 */
+	attr = tb[DEVLINK_PORT_FUNCTION_ATTR_STATE];
+	if (attr)
+		err = devlink_port_function_state_set(devlink, port, attr, extack);
 
 	if (!err)
 		devlink_port_notify(port, DEVLINK_CMD_PORT_NEW);
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 05/13] devlink: Avoid global devlink mutex, use per instance reload lock
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (3 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 04/13] devlink: Support get and set state of port function Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 06/13] devlink: Introduce devlink refcount to reduce scope of global devlink_mutex Parav Pandit
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit
devlink device reload is a special operation which brings down and up
the device. Such operation will unregister devlink device of sub
function port.
During devlink_reload() with devlink_mutex held leads to cyclic
dependency. For example,
devlink_reload()
  mutex_lock(&devlink_mutex); <- First lock acquire
  mlx5_reload_down(PCI PF device)
    disable_sf_devices();
      sf_state_set(inactive);
        ancillary_dev->remove();
           mlx5_adev_remove(adev);
             devlink_unregister(adev->devlink_instance);
               mutex_lock(&devlink_mutex); <- Second lock acquire
Hence devlink_reload() operation cannot be done under global
devlink_mutex mutex.
In second such instance reload_down() callback likely to disable reload
on child devlink device. This also prevents devlink_reload() to use
the overloaded global devlink_mutex.
devlink_reload()
  mutex_lock(&devlink_mutex); <- First lock acquire
    mlx5_reload_down(PCI PF device)
      disable_sf_devices();
        ancillary_dev->remove();
           mlx5_adev_remove(adev);
             devlink_reload_disable(adev->devlink_instance);
               mutex_lock(&devlink_mutex); <- Second lock acquire
Therefore, introduce a reload_lock per devlink instance which is held
when performing devlink device reload.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
---
 include/net/devlink.h |  1 +
 net/core/devlink.c    | 25 +++++++++++++++----------
 2 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 124bac130c22..ef487b8ed17b 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -52,6 +52,7 @@ struct devlink {
 	struct mutex lock; /* Serializes access to devlink instance specific objects such as
 			    * port, sb, dpipe, resource, params, region, traps and more.
 			    */
+	struct mutex reload_lock; /* Protects reload operation */
 	u8 reload_failed:1,
 	   reload_enabled:1,
 	   registered:1;
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 3e59ba73d5c4..c7c6f274d392 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -3307,29 +3307,32 @@ static int devlink_reload(struct devlink *devlink, struct net *dest_net,
 	u32 remote_reload_stats[DEVLINK_RELOAD_STATS_ARRAY_SIZE];
 	int err;
 
-	if (!devlink->reload_enabled)
-		return -EOPNOTSUPP;
+	mutex_lock(&devlink->reload_lock);
+	if (!devlink->reload_enabled) {
+		err = -EOPNOTSUPP;
+		goto done;
+	}
 
 	memcpy(remote_reload_stats, devlink->stats.remote_reload_stats,
 	       sizeof(remote_reload_stats));
 	err = devlink->ops->reload_down(devlink, !!dest_net, action, limit, extack);
 	if (err)
-		return err;
+		goto done;
 
 	if (dest_net && !net_eq(dest_net, devlink_net(devlink)))
 		devlink_reload_netns_change(devlink, dest_net);
 
 	err = devlink->ops->reload_up(devlink, action, limit, actions_performed, extack);
 	devlink_reload_failed_set(devlink, !!err);
-	if (err)
-		return err;
 
 	WARN_ON(!(*actions_performed & BIT(action)));
 	/* Catch driver on updating the remote action within devlink reload */
 	WARN_ON(memcmp(remote_reload_stats, devlink->stats.remote_reload_stats,
 		       sizeof(remote_reload_stats)));
 	devlink_reload_stats_update(devlink, limit, *actions_performed);
-	return 0;
+done:
+	mutex_unlock(&devlink->reload_lock);
+	return err;
 }
 
 static int
@@ -8118,6 +8121,7 @@ struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
 	INIT_LIST_HEAD(&devlink->trap_policer_list);
 	mutex_init(&devlink->lock);
 	mutex_init(&devlink->reporters_lock);
+	mutex_init(&devlink->reload_lock);
 	return devlink;
 }
 EXPORT_SYMBOL_GPL(devlink_alloc);
@@ -8166,9 +8170,9 @@ EXPORT_SYMBOL_GPL(devlink_unregister);
  */
 void devlink_reload_enable(struct devlink *devlink)
 {
-	mutex_lock(&devlink_mutex);
+	mutex_lock(&devlink->reload_lock);
 	devlink->reload_enabled = true;
-	mutex_unlock(&devlink_mutex);
+	mutex_unlock(&devlink->reload_lock);
 }
 EXPORT_SYMBOL_GPL(devlink_reload_enable);
 
@@ -8182,12 +8186,12 @@ EXPORT_SYMBOL_GPL(devlink_reload_enable);
  */
 void devlink_reload_disable(struct devlink *devlink)
 {
-	mutex_lock(&devlink_mutex);
+	mutex_lock(&devlink->reload_lock);
 	/* Mutex is taken which ensures that no reload operation is in
 	 * progress while setting up forbidded flag.
 	 */
 	devlink->reload_enabled = false;
-	mutex_unlock(&devlink_mutex);
+	mutex_unlock(&devlink->reload_lock);
 }
 EXPORT_SYMBOL_GPL(devlink_reload_disable);
 
@@ -8198,6 +8202,7 @@ EXPORT_SYMBOL_GPL(devlink_reload_disable);
  */
 void devlink_free(struct devlink *devlink)
 {
+	mutex_destroy(&devlink->reload_lock);
 	mutex_destroy(&devlink->reporters_lock);
 	mutex_destroy(&devlink->lock);
 	WARN_ON(!list_empty(&devlink->trap_policer_list));
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 06/13] devlink: Introduce devlink refcount to reduce scope of global devlink_mutex
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (4 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 05/13] devlink: Avoid global devlink mutex, use per instance reload lock Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 07/13] net/mlx5: SF, Add auxiliary device support Parav Pandit
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit
Currently global devlink_mutex is held while a doit() operation is
progress. This brings a limitation.
A Driver cannot perform devlink_register()/unregister() calls
during devlink doit() callback functions.
This is typically required when a port state change described in
RFC [1] callback wants to delete an active SF port or wants to
activate a SF port that results into unregistering or registering a
devlink instance on different bus such as ancillary bus.
An example flow:
devlink_predoit()
  mutex_lock(&devlink_mutex); <- First lock acquire
  devlink_reload()
    driver->reload_down(inactive)
        adev->remove();
           mlx5_adev_remove(ancillary_dev);
             devlink_unregister(ancillary_dev->devlink_instance);
               mutex_lock(&devlink_mutex); <- Second lock acquire
This patch is preparation patch to enable drivers to achieve this.
It achieves this by maintaining a per devlink instance refcount to
prevent devlink device unregistration while user command are in progress
or while devlink device is migration to init_net net namespace.
devlink_nl_family continue to remain registered with parallel_ops
disabled. So even after removing devlink_mutex during doit commands,
it doesn't enable userspace to run multiple devlink commands for one
or multiple devlink instance.
[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
---
 include/net/devlink.h |  5 +++
 net/core/devlink.c    | 84 +++++++++++++++++++++++++++++++------------
 2 files changed, 67 insertions(+), 22 deletions(-)
diff --git a/include/net/devlink.h b/include/net/devlink.h
index ef487b8ed17b..c8eab814c234 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -53,6 +53,11 @@ struct devlink {
 			    * port, sb, dpipe, resource, params, region, traps and more.
 			    */
 	struct mutex reload_lock; /* Protects reload operation */
+	struct list_head reload_list;
+	refcount_t refcount; /* Serializes user doit commands and netns command
+			      * with device unregistration.
+			      */
+	struct completion unregister_complete;
 	u8 reload_failed:1,
 	   reload_enabled:1,
 	   registered:1;
diff --git a/net/core/devlink.c b/net/core/devlink.c
index c7c6f274d392..84f3ec12b3e8 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -96,9 +96,8 @@ static LIST_HEAD(devlink_list);
 
 /* devlink_mutex
  *
- * An overall lock guarding every operation coming from userspace.
- * It also guards devlink devices list and it is taken when
- * driver registers/unregisters it.
+ * An overall lock guarding devlink devices list during operations coming from
+ * userspace and when driver registers/unregisters devlink device.
  */
 static DEFINE_MUTEX(devlink_mutex);
 
@@ -121,6 +120,18 @@ void devlink_net_set(struct devlink *devlink, struct net *net)
 }
 EXPORT_SYMBOL_GPL(devlink_net_set);
 
+static inline bool
+devlink_try_get(struct devlink *devlink)
+{
+	return refcount_inc_not_zero(&devlink->refcount);
+}
+
+static void devlink_put(struct devlink *devlink)
+{
+	if (refcount_dec_and_test(&devlink->refcount))
+		complete(&devlink->unregister_complete);
+}
+
 static struct devlink *devlink_get_from_attrs(struct net *net,
 					      struct nlattr **attrs)
 {
@@ -139,7 +150,7 @@ static struct devlink *devlink_get_from_attrs(struct net *net,
 	list_for_each_entry(devlink, &devlink_list, list) {
 		if (strcmp(devlink->dev->bus->name, busname) == 0 &&
 		    strcmp(dev_name(devlink->dev), devname) == 0 &&
-		    net_eq(devlink_net(devlink), net))
+		    net_eq(devlink_net(devlink), net) && devlink_try_get(devlink))
 			return devlink;
 	}
 
@@ -411,7 +422,7 @@ devlink_region_snapshot_get_by_id(struct devlink_region *region, u32 id)
 
 /* The per devlink instance lock is taken by default in the pre-doit
  * operation, yet several commands do not require this. The global
- * devlink lock is taken and protects from disruption by user-calls.
+ * devlink lock is taken and protects from disruption by dumpit user-calls.
  */
 #define DEVLINK_NL_FLAG_NO_LOCK			BIT(2)
 
@@ -424,10 +435,10 @@ static int devlink_nl_pre_doit(const struct genl_ops *ops,
 
 	mutex_lock(&devlink_mutex);
 	devlink = devlink_get_from_info(info);
-	if (IS_ERR(devlink)) {
-		mutex_unlock(&devlink_mutex);
+	mutex_unlock(&devlink_mutex);
+
+	if (IS_ERR(devlink))
 		return PTR_ERR(devlink);
-	}
 	if (~ops->internal_flags & DEVLINK_NL_FLAG_NO_LOCK)
 		mutex_lock(&devlink->lock);
 	info->user_ptr[0] = devlink;
@@ -448,7 +459,7 @@ static int devlink_nl_pre_doit(const struct genl_ops *ops,
 unlock:
 	if (~ops->internal_flags & DEVLINK_NL_FLAG_NO_LOCK)
 		mutex_unlock(&devlink->lock);
-	mutex_unlock(&devlink_mutex);
+	devlink_put(devlink);
 	return err;
 }
 
@@ -460,7 +471,7 @@ static void devlink_nl_post_doit(const struct genl_ops *ops,
 	devlink = info->user_ptr[0];
 	if (~ops->internal_flags & DEVLINK_NL_FLAG_NO_LOCK)
 		mutex_unlock(&devlink->lock);
-	mutex_unlock(&devlink_mutex);
+	devlink_put(devlink);
 }
 
 static struct genl_family devlink_nl_family;
@@ -8122,6 +8133,7 @@ struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
 	mutex_init(&devlink->lock);
 	mutex_init(&devlink->reporters_lock);
 	mutex_init(&devlink->reload_lock);
+	init_completion(&devlink->unregister_complete);
 	return devlink;
 }
 EXPORT_SYMBOL_GPL(devlink_alloc);
@@ -8136,6 +8148,7 @@ int devlink_register(struct devlink *devlink, struct device *dev)
 {
 	devlink->dev = dev;
 	devlink->registered = true;
+	refcount_set(&devlink->refcount, 1);
 	mutex_lock(&devlink_mutex);
 	list_add_tail(&devlink->list, &devlink_list);
 	devlink_notify(devlink, DEVLINK_CMD_NEW);
@@ -8151,12 +8164,23 @@ EXPORT_SYMBOL_GPL(devlink_register);
  */
 void devlink_unregister(struct devlink *devlink)
 {
+	/* Remove from the list first, so that no new users can get it */
 	mutex_lock(&devlink_mutex);
-	WARN_ON(devlink_reload_supported(devlink->ops) &&
-		devlink->reload_enabled);
 	devlink_notify(devlink, DEVLINK_CMD_DEL);
 	list_del(&devlink->list);
 	mutex_unlock(&devlink_mutex);
+
+	/* Balances with refcount_set in devlink_register(). */
+	devlink_put(devlink);
+	/* Wait for any existing users to stop using the devlink device */
+	wait_for_completion(&devlink->unregister_complete);
+
+	/* At this point there are no active users working on the devlink instance;
+	 * also net ns exit operation (if any) is also completed.
+	 * devlink is out of global list, hence no users can acquire reference to this devlink
+	 * instance anymore. Hence, it is safe to proceed with unregistration.
+	 */
+	WARN_ON(devlink_reload_supported(devlink->ops) && devlink->reload_enabled);
 }
 EXPORT_SYMBOL_GPL(devlink_unregister);
 
@@ -10472,6 +10496,8 @@ static void __net_exit devlink_pernet_pre_exit(struct net *net)
 {
 	struct devlink *devlink;
 	u32 actions_performed;
+	LIST_HEAD(local_list);
+	struct devlink *tmp;
 	int err;
 
 	/* In case network namespace is getting destroyed, reload
@@ -10479,18 +10505,32 @@ static void __net_exit devlink_pernet_pre_exit(struct net *net)
 	 */
 	mutex_lock(&devlink_mutex);
 	list_for_each_entry(devlink, &devlink_list, list) {
-		if (net_eq(devlink_net(devlink), net)) {
-			if (WARN_ON(!devlink_reload_supported(devlink->ops)))
-				continue;
-			err = devlink_reload(devlink, &init_net,
-					     DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
-					     DEVLINK_RELOAD_LIMIT_UNSPEC,
-					     &actions_performed, NULL);
-			if (err && err != -EOPNOTSUPP)
-				pr_warn("Failed to reload devlink instance into init_net\n");
-		}
+		if (!net_eq(devlink_net(devlink), net))
+			continue;
+
+		if (WARN_ON(!devlink_reload_supported(devlink->ops)))
+			continue;
+
+		/* Hold the reference to devlink instance so that it doesn't get unregistered
+		 * once global devlink_mutex is unlocked.
+		 * Store the devlink to a shadow list so that if devlink unregistration is
+		 * started, it can be still found in the shadow list.
+		 */
+		if (devlink_try_get(devlink))
+			list_add_tail(&devlink->reload_list, &local_list);
 	}
 	mutex_unlock(&devlink_mutex);
+
+	list_for_each_entry_safe(devlink, tmp, &local_list, reload_list) {
+		list_del_init(&devlink->reload_list);
+		err = devlink_reload(devlink, &init_net,
+				     DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
+				     DEVLINK_RELOAD_LIMIT_UNSPEC,
+				     &actions_performed, NULL);
+		if (err && err != -EOPNOTSUPP)
+			pr_warn("Failed to reload devlink instance into init_net\n");
+		devlink_put(devlink);
+	}
 }
 
 static struct pernet_operations devlink_pernet_ops __net_initdata = {
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 07/13] net/mlx5: SF, Add auxiliary device support
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (5 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 06/13] devlink: Introduce devlink refcount to reduce scope of global devlink_mutex Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-12-07  2:48   ` David Ahern
  2020-11-12 19:24 ` [PATCH net-next 08/13] net/mlx5: SF, Add auxiliary device driver Parav Pandit
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit,
	Vu Pham
Introduce API to add and delete an auxiliary device for an SF.
Each SF has its own dedicated window in the PCI BAR 2.
SF device is similar to PCI PF and VF that supports multiple class of
devices such as net, rdma and vdpa.
SF device will be added or removed in subsequent patch during SF
devlink port function state change command.
A subfunction device exposes user supplied subfunction number which will
be further used by systemd/udev to have deterministic name for its
netdevice and rdma device.
An mlx5 subfunction auxiliary device example:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88 state active
On activation,
$ ls -l /sys/bus/auxiliary/devices/
mlx5_core.sf.0 -> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_core.sf.0
$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.0/sfnum
88
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   9 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   4 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  12 +
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.c  | 213 ++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  55 +++++
 include/linux/mlx5/driver.h                   |   4 +
 6 files changed, 297 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 485478979b1a..10dfaf671c90 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -202,3 +202,12 @@ config MLX5_SW_STEERING
 	default y
 	help
 	Build support for software-managed steering in the NIC.
+
+config MLX5_SF
+	bool "Mellanox Technologies subfunction device support using auxiliary device"
+	depends on MLX5_CORE && MLX5_CORE_EN
+	default n
+	help
+	Build support for subfuction device in the NIC. A Mellanox subfunction
+	device can support RDMA, netdevice and vdpa device.
+	It is similar to a SRIOV VF but it doesn't require SRIOV support.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 2d477f9a8cb7..ee866da1d9ba 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -85,3 +85,7 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 					steering/dr_ste.o steering/dr_send.o \
 					steering/dr_cmd.o steering/dr_fw.o \
 					steering/dr_action.o steering/fs_dr.o
+#
+# SF device
+#
+mlx5_core-$(CONFIG_MLX5_SF) += sf/dev/dev.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 37fa56904235..a1ba6056952b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -73,6 +73,7 @@
 #include "ecpf.h"
 #include "lib/hv_vhca.h"
 #include "diag/rsc_dump.h"
+#include "sf/dev/dev.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -884,6 +885,12 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 		goto err_eswitch_cleanup;
 	}
 
+	err = mlx5_sf_dev_table_init(dev);
+	if (err) {
+		mlx5_core_err(dev, "Failed to init SF device table %d\n", err);
+		goto err_sf_dev_cleanup;
+	}
+
 	dev->dm = mlx5_dm_create(dev);
 	if (IS_ERR(dev->dm))
 		mlx5_core_warn(dev, "Failed to init device memory%d\n", err);
@@ -894,6 +901,8 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 
 	return 0;
 
+err_sf_dev_cleanup:
+	mlx5_fpga_cleanup(dev);
 err_eswitch_cleanup:
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
 err_sriov_cleanup:
@@ -925,6 +934,7 @@ static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 	mlx5_hv_vhca_destroy(dev->hv_vhca);
 	mlx5_fw_tracer_destroy(dev->tracer);
 	mlx5_dm_cleanup(dev);
+	mlx5_sf_dev_table_cleanup(dev);
 	mlx5_fpga_cleanup(dev);
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
 	mlx5_sriov_cleanup(dev);
@@ -1141,6 +1151,7 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 		goto err_ec;
 	}
 
+	mlx5_sf_dev_table_create(dev);
 	return 0;
 
 err_ec:
@@ -1171,6 +1182,7 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 
 static void mlx5_unload(struct mlx5_core_dev *dev)
 {
+	mlx5_sf_dev_table_destroy(dev);
 	mlx5_ec_cleanup(dev);
 	mlx5_sriov_detach(dev);
 	mlx5_cleanup_fs(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
new file mode 100644
index 000000000000..a25f6027b7cd
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
@@ -0,0 +1,213 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include <linux/mlx5/device.h>
+#include "mlx5_core.h"
+#include "dev.h"
+
+struct mlx5_sf_dev_table {
+	/* Serializes table access between driver unload context and
+	 * device add/remove user command context.
+	 */
+	struct mutex table_lock;
+	struct xarray devices;
+	unsigned int max_sfs;
+	phys_addr_t base_address;
+	u64 sf_bar_length;
+};
+
+static bool mlx5_sf_dev_supported(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN(dev, sf);
+}
+
+static ssize_t sfnum_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct auxiliary_device *adev = container_of(dev, struct auxiliary_device, dev);
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+
+	return scnprintf(buf, PAGE_SIZE, "%u\n", sf_dev->sfnum);
+}
+static DEVICE_ATTR_RO(sfnum);
+
+static struct attribute *sf_device_attrs[] = {
+	&dev_attr_sfnum.attr,
+	NULL,
+};
+
+static const struct attribute_group sf_attr_group = {
+	.attrs = sf_device_attrs,
+};
+
+static const struct attribute_group *sf_attr_groups[2] = {
+	&sf_attr_group,
+	NULL
+};
+
+static void mlx5_sf_dev_release(struct device *device)
+{
+	struct auxiliary_device *adev = container_of(device, struct auxiliary_device, dev);
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+
+	mlx5_adev_idx_free(sf_dev->adev.id);
+	kfree(sf_dev);
+}
+
+static void mlx5_sf_dev_remove(struct mlx5_sf_dev *sf_dev)
+{
+	auxiliary_device_delete(&sf_dev->adev);
+	auxiliary_device_uninit(&sf_dev->adev);
+}
+
+int mlx5_sf_dev_add(struct mlx5_core_dev *dev, u16 sf_index, u32 sfnum)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+	struct mlx5_sf_dev *sf_dev;
+	struct pci_dev *pdev;
+	int id;
+	int err;
+
+	id = mlx5_adev_idx_alloc();
+	if (id < 0)
+		return id;
+
+	sf_dev = kzalloc(sizeof(*sf_dev), GFP_KERNEL);
+	if (!sf_dev) {
+		mlx5_adev_idx_free(id);
+		return -ENOMEM;
+	}
+	pdev = dev->pdev;
+	sf_dev->adev.id = id;
+	sf_dev->adev.name = MLX5_SF_DEV_ID_NAME;
+	sf_dev->adev.dev.release = mlx5_sf_dev_release;
+	sf_dev->adev.dev.parent = &pdev->dev;
+	sf_dev->adev.dev.groups = sf_attr_groups;
+	sf_dev->sfnum = sfnum;
+	sf_dev->parent_mdev = dev;
+
+	/* Serialize with unloading the driver. */
+	mutex_lock(&table->table_lock);
+	if (!table->max_sfs) {
+		mlx5_adev_idx_free(id);
+		kfree(sf_dev);
+		err = -EOPNOTSUPP;
+		goto add_err;
+	}
+	sf_dev->bar_base_addr = table->base_address + (sf_index * table->sf_bar_length);
+
+	err = auxiliary_device_init(&sf_dev->adev);
+	if (err) {
+		mlx5_adev_idx_free(id);
+		kfree(sf_dev);
+		goto add_err;
+	}
+
+	err = auxiliary_device_add(&sf_dev->adev);
+	if (err) {
+		put_device(&sf_dev->adev.dev);
+		goto add_err;
+	}
+
+	err = xa_insert(&table->devices, sf_index, sf_dev, GFP_KERNEL);
+	if (err)
+		goto xa_err;
+	mutex_unlock(&table->table_lock);
+	return 0;
+
+xa_err:
+	mlx5_sf_dev_remove(sf_dev);
+add_err:
+	mutex_unlock(&table->table_lock);
+	return err;
+}
+
+void mlx5_sf_dev_del(struct mlx5_core_dev *dev, u16 sf_index)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+	struct mlx5_sf_dev *sf_dev;
+
+	mutex_lock(&table->table_lock);
+	sf_dev = xa_load(&table->devices, sf_index);
+	if (!sf_dev)
+		goto done;
+
+	xa_erase(&table->devices, sf_index);
+	mlx5_sf_dev_remove(sf_dev);
+done:
+	mutex_unlock(&table->table_lock);
+}
+
+int mlx5_sf_dev_table_init(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_dev_table *table;
+
+	if (!mlx5_sf_dev_supported(dev))
+		return 0;
+
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table)
+		return -ENOMEM;
+
+	mutex_init(&table->table_lock);
+	xa_init(&table->devices);
+	dev->priv.sf_dev_table = table;
+	return 0;
+}
+
+void mlx5_sf_dev_table_cleanup(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+
+	if (!table)
+		return;
+
+	WARN_ON(!xa_empty(&table->devices));
+	mutex_destroy(&table->table_lock);
+	kfree(table);
+	dev->priv.sf_dev_table = NULL;
+}
+
+void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+	unsigned int max_sfs;
+
+	if (!table)
+		return;
+
+	/* Honor the caps changed during reload */
+	if (!mlx5_sf_dev_supported(dev))
+		return;
+
+	max_sfs = 1 << MLX5_CAP_GEN(dev, log_max_sf);
+	table->base_address = pci_resource_start(dev->pdev, 2);
+	table->sf_bar_length = 1 << (MLX5_CAP_GEN(dev, log_min_sf_size) + 12);
+	mutex_lock(&table->table_lock);
+	table->max_sfs = max_sfs;
+	mutex_unlock(&table->table_lock);
+}
+
+static void mlx5_sf_dev_destroy_all(struct mlx5_sf_dev_table *table)
+{
+	struct mlx5_sf_dev *sf_dev;
+	unsigned long index;
+
+	xa_for_each(&table->devices, index, sf_dev) {
+		xa_erase(&table->devices, index);
+		mlx5_sf_dev_remove(sf_dev);
+	}
+}
+
+void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+
+	if (!table)
+		return;
+
+	mutex_lock(&table->table_lock);
+	table->max_sfs = 0;
+	mlx5_sf_dev_destroy_all(table);
+	mutex_unlock(&table->table_lock);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
new file mode 100644
index 000000000000..d81612122a45
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_SF_DEV_H__
+#define __MLX5_SF_DEV_H__
+
+#ifdef CONFIG_MLX5_SF
+
+#include <linux/auxiliary_bus.h>
+
+#define MLX5_SF_DEV_ID_NAME "sf"
+
+struct mlx5_sf_dev {
+	struct auxiliary_device adev;
+	struct mlx5_core_dev *parent_mdev;
+	phys_addr_t bar_base_addr;
+	u32 sfnum;
+};
+
+void __exit mlx5_sf_dev_exit(void);
+int mlx5_sf_dev_table_init(struct mlx5_core_dev *dev);
+void mlx5_sf_dev_table_cleanup(struct mlx5_core_dev *dev);
+void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev);
+void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev);
+
+int mlx5_sf_dev_add(struct mlx5_core_dev *dev, u16 sf_index, u32 sfnum);
+void mlx5_sf_dev_del(struct mlx5_core_dev *dev, u16 sf_index);
+
+#else
+
+static inline void __exit mlx5_sf_dev_exit(void)
+{
+}
+
+static inline int mlx5_sf_dev_table_init(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_dev_table_cleanup(struct mlx5_core_dev *dev)
+{
+}
+
+static inline int mlx5_sf_dev_table_create(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev)
+{
+}
+
+#endif
+
+#endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 28e9b2f17eb9..151cacab07db 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -507,6 +507,7 @@ struct mlx5_devcom;
 struct mlx5_fw_reset;
 struct mlx5_eq_table;
 struct mlx5_irq_table;
+struct mlx5_sf_dev_table;
 
 struct mlx5_rate_limit {
 	u32			rate;
@@ -603,6 +604,9 @@ struct mlx5_priv {
 
 	struct mlx5_bfreg_data		bfregs;
 	struct mlx5_uars_page	       *uar;
+#ifdef CONFIG_MLX5_SF
+	struct mlx5_sf_dev_table *sf_dev_table;
+#endif
 };
 
 enum mlx5_device_state {
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 08/13] net/mlx5: SF, Add auxiliary device driver
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (6 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 07/13] net/mlx5: SF, Add auxiliary device support Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 09/13] net/mlx5: E-switch, Prepare eswitch to handle SF vport Parav Pandit
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit,
	Vu Pham
Add auxiliary device driver for mlx5 subfunction auxiliary device.
A mlx5 subfunction is similar to PCI PF and VF. For a subfunction
an auxiliary device is created.
As a result, when mlx5 SF auxiliary device binds to the driver,
its netdev and rdma device are created, they appear as
$ ls -l /sys/bus/auxiliary/devices/
mlx5_core.sf.0 -> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_core.sf.0
$ ls -l /sys/class/net/eth1/device
/sys/class/net/eth1/device -> ../../../mlx5_core.sf.0
$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.0/sfnum
88
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.0
$ devlink port show auxiliary/mlx5_core.sf.0/1
auxiliary/mlx5_core.sf.0/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
$ rdma link show mlx5_0/1
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
$ rdma dev show
8: rocep6s0f1: node_type ca fw 16.27.1017 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
13: mlx5_0: node_type ca fw 16.27.1017 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
In future, devlink device instance name will adapt to have sfnum
annotation using either an alias or as devlink instance name described
in RFC [1].
[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |  22 ++--
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  10 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  20 ++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  13 +++
 .../mellanox/mlx5/core/sf/dev/driver.c        | 105 ++++++++++++++++++
 include/linux/mlx5/driver.h                   |   4 +-
 8 files changed, 168 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index ee866da1d9ba..7dd5be49fb9e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -88,4 +88,4 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 #
 # SF device
 #
-mlx5_core-$(CONFIG_MLX5_SF) += sf/dev/dev.o
+mlx5_core-$(CONFIG_MLX5_SF) += sf/dev/dev.o sf/dev/driver.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 8ebfe782f95e..50c235e54c86 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -465,7 +465,7 @@ int mlx5_eq_table_init(struct mlx5_core_dev *dev)
 	for (i = 0; i < MLX5_EVENT_TYPE_MAX; i++)
 		ATOMIC_INIT_NOTIFIER_HEAD(&eq_table->nh[i]);
 
-	eq_table->irq_table = dev->priv.irq_table;
+	eq_table->irq_table = mlx5_irq_table_get(dev);
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index a1ba6056952b..adfa21de938e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -83,7 +83,6 @@ unsigned int mlx5_core_debug_mask;
 module_param_named(debug_mask, mlx5_core_debug_mask, uint, 0644);
 MODULE_PARM_DESC(debug_mask, "debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0");
 
-#define MLX5_DEFAULT_PROF	2
 static unsigned int prof_sel = MLX5_DEFAULT_PROF;
 module_param_named(prof_sel, prof_sel, uint, 0444);
 MODULE_PARM_DESC(prof_sel, "profile selector. Valid range 0 - 2");
@@ -1295,7 +1294,7 @@ void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup)
 	mutex_unlock(&dev->intf_state_mutex);
 }
 
-static int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx)
+int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx)
 {
 	struct mlx5_priv *priv = &dev->priv;
 	int err;
@@ -1345,7 +1344,7 @@ static int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx)
 	return err;
 }
 
-static void mlx5_mdev_uninit(struct mlx5_core_dev *dev)
+void mlx5_mdev_uninit(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 
@@ -1684,16 +1683,24 @@ static int __init init(void)
 	if (err)
 		goto err_debug;
 
+	err = mlx5_sf_driver_register();
+	if (err)
+		goto err_sf;
+
 #ifdef CONFIG_MLX5_CORE_EN
 	err = mlx5e_init();
-	if (err) {
-		pci_unregister_driver(&mlx5_core_driver);
-		goto err_debug;
-	}
+	if (err)
+		goto err_eth;
 #endif
 
 	return 0;
 
+#ifdef CONFIG_MLX5_CORE_EN
+err_eth:
+	mlx5_sf_driver_unregister();
+#endif
+err_sf:
+	pci_unregister_driver(&mlx5_core_driver);
 err_debug:
 	mlx5_unregister_debugfs();
 	return err;
@@ -1704,6 +1711,7 @@ static void __exit cleanup(void)
 #ifdef CONFIG_MLX5_CORE_EN
 	mlx5e_cleanup();
 #endif
+	mlx5_sf_driver_unregister();
 	pci_unregister_driver(&mlx5_core_driver);
 	mlx5_unregister_debugfs();
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index dd7312621d0d..499aa76bf8d1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -117,6 +117,8 @@ enum mlx5_semaphore_space_address {
 	MLX5_SEMAPHORE_SW_RESET         = 0x20,
 };
 
+#define MLX5_DEFAULT_PROF       2
+
 int mlx5_query_hca_caps(struct mlx5_core_dev *dev);
 int mlx5_query_board_id(struct mlx5_core_dev *dev);
 int mlx5_cmd_init_hca(struct mlx5_core_dev *dev, uint32_t *sw_owner_id);
@@ -172,6 +174,7 @@ struct cpumask *
 mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
 struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
 int mlx5_irq_get_num_comp(struct mlx5_irq_table *table);
+struct mlx5_irq_table *mlx5_irq_table_get(struct mlx5_core_dev *dev);
 
 int mlx5_events_init(struct mlx5_core_dev *dev);
 void mlx5_events_cleanup(struct mlx5_core_dev *dev);
@@ -253,6 +256,13 @@ enum {
 u8 mlx5_get_nic_state(struct mlx5_core_dev *dev);
 void mlx5_set_nic_state(struct mlx5_core_dev *dev, u8 state);
 
+static inline bool mlx5_core_is_sf(const struct mlx5_core_dev *dev)
+{
+	return dev->coredev_type == MLX5_COREDEV_SF;
+}
+
+int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx);
+void mlx5_mdev_uninit(struct mlx5_core_dev *dev);
 void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup);
 int mlx5_load_one(struct mlx5_core_dev *dev, bool boot);
 #endif /* __MLX5_CORE_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 6fd974920394..a61e09aff152 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -30,6 +30,9 @@ int mlx5_irq_table_init(struct mlx5_core_dev *dev)
 {
 	struct mlx5_irq_table *irq_table;
 
+	if (mlx5_core_is_sf(dev))
+		return 0;
+
 	irq_table = kvzalloc(sizeof(*irq_table), GFP_KERNEL);
 	if (!irq_table)
 		return -ENOMEM;
@@ -40,6 +43,9 @@ int mlx5_irq_table_init(struct mlx5_core_dev *dev)
 
 void mlx5_irq_table_cleanup(struct mlx5_core_dev *dev)
 {
+	if (mlx5_core_is_sf(dev))
+		return;
+
 	kvfree(dev->priv.irq_table);
 }
 
@@ -268,6 +274,9 @@ int mlx5_irq_table_create(struct mlx5_core_dev *dev)
 	int nvec;
 	int err;
 
+	if (mlx5_core_is_sf(dev))
+		return 0;
+
 	nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
 	       MLX5_IRQ_VEC_COMP_BASE;
 	nvec = min_t(int, nvec, num_eqs);
@@ -319,6 +328,9 @@ void mlx5_irq_table_destroy(struct mlx5_core_dev *dev)
 	struct mlx5_irq_table *table = dev->priv.irq_table;
 	int i;
 
+	if (mlx5_core_is_sf(dev))
+		return;
+
 	/* free_irq requires that affinity and rmap will be cleared
 	 * before calling it. This is why there is asymmetry with set_rmap
 	 * which should be called after alloc_irq but before request_irq.
@@ -332,3 +344,11 @@ void mlx5_irq_table_destroy(struct mlx5_core_dev *dev)
 	kfree(table->irq);
 }
 
+struct mlx5_irq_table *mlx5_irq_table_get(struct mlx5_core_dev *dev)
+{
+#ifdef CONFIG_MLX5_SF
+	if (mlx5_core_is_sf(dev))
+		return dev->priv.parent_mdev->priv.irq_table;
+#endif
+	return dev->priv.irq_table;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
index d81612122a45..37634e3dedb5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
@@ -13,6 +13,7 @@
 struct mlx5_sf_dev {
 	struct auxiliary_device adev;
 	struct mlx5_core_dev *parent_mdev;
+	struct mlx5_core_dev *mdev;
 	phys_addr_t bar_base_addr;
 	u32 sfnum;
 };
@@ -26,6 +27,9 @@ void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev);
 int mlx5_sf_dev_add(struct mlx5_core_dev *dev, u16 sf_index, u32 sfnum);
 void mlx5_sf_dev_del(struct mlx5_core_dev *dev, u16 sf_index);
 
+int mlx5_sf_driver_register(void);
+void mlx5_sf_driver_unregister(void);
+
 #else
 
 static inline void __exit mlx5_sf_dev_exit(void)
@@ -50,6 +54,15 @@ static inline void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev)
 {
 }
 
+static inline int mlx5_sf_driver_register(void)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_driver_unregister(void)
+{
+}
+
 #endif
 
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
new file mode 100644
index 000000000000..10fe41c13a4a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include <linux/mlx5/device.h>
+#include "mlx5_core.h"
+#include "dev.h"
+#include "devlink.h"
+
+static int mlx5_sf_dev_probe(struct auxiliary_device *adev, const struct auxiliary_device_id *id)
+{
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+	struct mlx5_core_dev *mdev;
+	struct devlink *devlink;
+	int err;
+
+	devlink = mlx5_devlink_alloc();
+	if (!devlink)
+		return -ENOMEM;
+
+	mdev = devlink_priv(devlink);
+	mdev->device = &adev->dev;
+	mdev->pdev = sf_dev->parent_mdev->pdev;
+	mdev->bar_addr = sf_dev->bar_base_addr;
+	mdev->iseg_base = sf_dev->bar_base_addr;
+	mdev->coredev_type = MLX5_COREDEV_SF;
+	mdev->priv.parent_mdev = sf_dev->parent_mdev;
+	mdev->priv.adev_idx = sf_dev->adev.id;
+	sf_dev->mdev = mdev;
+
+	err = mlx5_mdev_init(mdev, MLX5_DEFAULT_PROF);
+	if (err) {
+		mlx5_core_warn(mdev, "mlx5_mdev_init on err=%d\n", err);
+		goto mdev_err;
+	}
+
+	mdev->iseg = ioremap(mdev->iseg_base, sizeof(*mdev->iseg));
+	if (!mdev->iseg) {
+		mlx5_core_warn(mdev, "remap error\n");
+		goto remap_err;
+	}
+
+	err = mlx5_load_one(mdev, true);
+	if (err) {
+		mlx5_core_warn(mdev, "mlx5_load_one err=%d\n", err);
+		goto load_one_err;
+	}
+	devlink_reload_enable(devlink);
+	return 0;
+
+load_one_err:
+	iounmap(mdev->iseg);
+remap_err:
+	mlx5_mdev_uninit(mdev);
+mdev_err:
+	mlx5_devlink_free(devlink);
+	return err;
+}
+
+static int mlx5_sf_dev_remove(struct auxiliary_device *adev)
+{
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+	struct devlink *devlink;
+
+	devlink = priv_to_devlink(sf_dev->mdev);
+	devlink_reload_disable(devlink);
+	mlx5_unload_one(sf_dev->mdev, true);
+	iounmap(sf_dev->mdev->iseg);
+	mlx5_mdev_uninit(sf_dev->mdev);
+	mlx5_devlink_free(devlink);
+	return 0;
+}
+
+static void mlx5_sf_dev_shutdown(struct auxiliary_device *adev)
+{
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+
+	mlx5_unload_one(sf_dev->mdev, false);
+}
+
+static const struct auxiliary_device_id mlx5_sf_dev_id_table[] = {
+	{ .name = KBUILD_MODNAME "." MLX5_SF_DEV_ID_NAME, },
+	{ },
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mlx5_sf_dev_id_table);
+
+static struct auxiliary_driver mlx5_sf_driver = {
+	.name = MLX5_SF_DEV_ID_NAME,
+	.probe = mlx5_sf_dev_probe,
+	.remove = mlx5_sf_dev_remove,
+	.shutdown = mlx5_sf_dev_shutdown,
+	.id_table = mlx5_sf_dev_id_table,
+};
+
+int mlx5_sf_driver_register(void)
+{
+	return auxiliary_driver_register(&mlx5_sf_driver);
+}
+
+void mlx5_sf_driver_unregister(void)
+{
+	auxiliary_driver_unregister(&mlx5_sf_driver);
+}
+
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 151cacab07db..f3104b50ade5 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -193,7 +193,8 @@ enum port_state_policy {
 
 enum mlx5_coredev_type {
 	MLX5_COREDEV_PF,
-	MLX5_COREDEV_VF
+	MLX5_COREDEV_VF,
+	MLX5_COREDEV_SF,
 };
 
 struct mlx5_field_desc {
@@ -606,6 +607,7 @@ struct mlx5_priv {
 	struct mlx5_uars_page	       *uar;
 #ifdef CONFIG_MLX5_SF
 	struct mlx5_sf_dev_table *sf_dev_table;
+	struct mlx5_core_dev *parent_mdev;
 #endif
 };
 
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 09/13] net/mlx5: E-switch, Prepare eswitch to handle SF vport
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (7 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 08/13] net/mlx5: SF, Add auxiliary device driver Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 10/13] net/mlx5: E-switch, Add eswitch helpers for " Parav Pandit
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Vu Pham,
	Parav Pandit, Roi Dayan
From: Vu Pham <vuhuong@nvidia.com>
Prepare eswitch to handle SF vport during
(a) querying eswitch functions
(b) egress ACL creation
(c) account for SF vports in total vports calculation
Assign a dedicated placeholder for SFs vports and their representors.
They are placed after VFs vports and before ECPF vports as below:
[PF,VF0,...,VFn,SF0,...SFm,ECPF,UPLINK].
Change functions to map SF's vport numbers to indices when
accessing the vports or representors arrays, and vice versa.
Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Kconfig   | 10 ++++
 .../mellanox/mlx5/core/esw/acl/egress_ofld.c  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 11 +++-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h | 55 +++++++++++++++++++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 11 ++++
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   | 35 ++++++++++++
 .../net/ethernet/mellanox/mlx5/core/vport.c   |  3 +-
 7 files changed, 123 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 10dfaf671c90..11d5e0e99bd6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -211,3 +211,13 @@ config MLX5_SF
 	Build support for subfuction device in the NIC. A Mellanox subfunction
 	device can support RDMA, netdevice and vdpa device.
 	It is similar to a SRIOV VF but it doesn't require SRIOV support.
+
+config MLX5_SF_MANAGER
+	bool
+	depends on MLX5_SF && MLX5_ESWITCH
+	default y
+	help
+	Build support for subfuction port in the NIC. A Mellanox subfunction
+	port is managed through devlink.  A subfunction supports RDMA, netdevice
+	and vdpa device. It is similar to a SRIOV VF but it doesn't require
+	SRIOV support.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c
index c3faae67e4d6..45758ff3c14e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c
@@ -150,7 +150,7 @@ static void esw_acl_egress_ofld_groups_destroy(struct mlx5_vport *vport)
 
 static bool esw_acl_egress_needed(const struct mlx5_eswitch *esw, u16 vport_num)
 {
-	return mlx5_eswitch_is_vf_vport(esw, vport_num);
+	return mlx5_eswitch_is_vf_vport(esw, vport_num) || mlx5_esw_is_sf_vport(esw, vport_num);
 }
 
 int esw_acl_egress_ofld_setup(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index b44f28fb5518..5b90f126b7f3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1369,9 +1369,15 @@ const u32 *mlx5_esw_query_functions(struct mlx5_core_dev *dev)
 {
 	int outlen = MLX5_ST_SZ_BYTES(query_esw_functions_out);
 	u32 in[MLX5_ST_SZ_DW(query_esw_functions_in)] = {};
+	u16 max_sf_vports;
 	u32 *out;
 	int err;
 
+	max_sf_vports = mlx5_sf_max_ports(dev);
+	/* Device interface is array of 64-bits */
+	if (max_sf_vports)
+		outlen += DIV_ROUND_UP(max_sf_vports, BITS_PER_TYPE(__be64)) * sizeof(__be64);
+
 	out = kvzalloc(outlen, GFP_KERNEL);
 	if (!out)
 		return ERR_PTR(-ENOMEM);
@@ -1379,7 +1385,7 @@ const u32 *mlx5_esw_query_functions(struct mlx5_core_dev *dev)
 	MLX5_SET(query_esw_functions_in, in, opcode,
 		 MLX5_CMD_OP_QUERY_ESW_FUNCTIONS);
 
-	err = mlx5_cmd_exec_inout(dev, query_esw_functions, in, out);
+	err = mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
 	if (!err)
 		return out;
 
@@ -1874,7 +1880,8 @@ static bool
 is_port_function_supported(const struct mlx5_eswitch *esw, u16 vport_num)
 {
 	return vport_num == MLX5_VPORT_PF ||
-	       mlx5_eswitch_is_vf_vport(esw, vport_num);
+	       mlx5_eswitch_is_vf_vport(esw, vport_num) ||
+	       mlx5_esw_is_sf_vport(esw, vport_num);
 }
 
 int mlx5_devlink_port_function_hw_addr_get(struct devlink *devlink,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index cf87de94418f..2165bc065196 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -43,6 +43,7 @@
 #include <linux/mlx5/fs.h>
 #include "lib/mpfs.h"
 #include "lib/fs_chains.h"
+#include "sf/sf.h"
 #include "en/tc_ct.h"
 
 #ifdef CONFIG_MLX5_ESWITCH
@@ -499,6 +500,45 @@ static inline u16 mlx5_eswitch_first_host_vport_num(struct mlx5_core_dev *dev)
 		MLX5_VPORT_PF : MLX5_VPORT_FIRST_VF;
 }
 
+static inline u16 mlx5_esw_sf_start_vport_num(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN(dev, sf_base_id);
+}
+
+static inline int mlx5_esw_sf_start_idx(const struct mlx5_eswitch *esw)
+{
+	/* PF and VF vports indices start from 0 to max_vfs */
+	return MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev);
+}
+
+static inline int mlx5_esw_sf_end_idx(const struct mlx5_eswitch *esw)
+{
+	return mlx5_esw_sf_start_idx(esw) + mlx5_sf_max_ports(esw->dev);
+}
+
+static inline int
+mlx5_esw_sf_vport_num_to_index(const struct mlx5_eswitch *esw, u16 vport_num)
+{
+	return vport_num - mlx5_esw_sf_start_vport_num(esw->dev) +
+	       MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev);
+}
+
+static inline u16
+mlx5_esw_sf_vport_index_to_num(const struct mlx5_eswitch *esw, int idx)
+{
+	return mlx5_esw_sf_start_vport_num(esw->dev) + idx -
+	       (MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev));
+}
+
+static inline bool
+mlx5_esw_is_sf_vport(const struct mlx5_eswitch *esw, u16 vport_num)
+{
+	return mlx5_sf_supported(esw->dev) &&
+	       vport_num >= mlx5_esw_sf_start_vport_num(esw->dev) &&
+	       (vport_num < (mlx5_esw_sf_start_vport_num(esw->dev) +
+			     mlx5_sf_max_ports(esw->dev)));
+}
+
 static inline bool mlx5_eswitch_is_funcs_handler(const struct mlx5_core_dev *dev)
 {
 	return mlx5_core_is_ecpf_esw_manager(dev);
@@ -527,6 +567,10 @@ static inline int mlx5_eswitch_vport_num_to_index(struct mlx5_eswitch *esw,
 	if (vport_num == MLX5_VPORT_UPLINK)
 		return mlx5_eswitch_uplink_idx(esw);
 
+	if (mlx5_esw_is_sf_vport(esw, vport_num))
+		return mlx5_esw_sf_vport_num_to_index(esw, vport_num);
+
+	/* PF and VF vports start from 0 to max_vfs */
 	return vport_num;
 }
 
@@ -540,6 +584,12 @@ static inline u16 mlx5_eswitch_index_to_vport_num(struct mlx5_eswitch *esw,
 	if (index == mlx5_eswitch_uplink_idx(esw))
 		return MLX5_VPORT_UPLINK;
 
+	/* SF vports indices are after VFs and before ECPF */
+	if (mlx5_sf_supported(esw->dev) &&
+	    index > mlx5_core_max_vfs(esw->dev))
+		return mlx5_esw_sf_vport_index_to_num(esw, index);
+
+	/* PF and VF vports start from 0 to max_vfs */
 	return index;
 }
 
@@ -625,6 +675,11 @@ void mlx5e_tc_clean_fdb_peer_flows(struct mlx5_eswitch *esw);
 	for ((vport) = (nvfs);						\
 	     (vport) >= (esw)->first_host_vport; (vport)--)
 
+#define mlx5_esw_for_each_sf_rep(esw, i, rep)		\
+	for ((i) = mlx5_esw_sf_start_idx(esw);		\
+	     (rep) = &(esw)->offloads.vport_reps[(i)],	\
+	     (i) < mlx5_esw_sf_end_idx(esw); (i++))
+
 struct mlx5_eswitch *mlx5_devlink_eswitch_get(struct devlink *devlink);
 struct mlx5_vport *__must_check
 mlx5_eswitch_get_vport(struct mlx5_eswitch *esw, u16 vport_num);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 429dc613530b..01242afbfcce 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1801,11 +1801,22 @@ static void __esw_offloads_unload_rep(struct mlx5_eswitch *esw,
 		esw->offloads.rep_ops[rep_type]->unload(rep);
 }
 
+static void __unload_reps_sf_vport(struct mlx5_eswitch *esw, u8 rep_type)
+{
+	struct mlx5_eswitch_rep *rep;
+	int i;
+
+	mlx5_esw_for_each_sf_rep(esw, i, rep)
+		__esw_offloads_unload_rep(esw, rep, rep_type);
+}
+
 static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type)
 {
 	struct mlx5_eswitch_rep *rep;
 	int i;
 
+	__unload_reps_sf_vport(esw, rep_type);
+
 	mlx5_esw_for_each_vf_rep_reverse(esw, i, rep, esw->esw_funcs.num_vfs)
 		__esw_offloads_unload_rep(esw, rep, rep_type);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
new file mode 100644
index 000000000000..7b9071a865ce
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_SF_H__
+#define __MLX5_SF_H__
+
+#include <linux/mlx5/driver.h>
+
+#ifdef CONFIG_MLX5_SF_MANAGER
+
+static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN(dev, sf);
+}
+
+static inline u16 mlx5_sf_max_ports(const struct mlx5_core_dev *dev)
+{
+	return mlx5_sf_supported(dev) ? 1 << MLX5_CAP_GEN(dev, log_max_sf) : 0;
+}
+
+#else
+
+static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
+{
+	return false;
+}
+
+static inline u16 mlx5_sf_max_ports(const struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+#endif
+
+#endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index bdafc85fd874..233aa8242916 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -36,6 +36,7 @@
 #include <linux/mlx5/vport.h>
 #include <linux/mlx5/eswitch.h>
 #include "mlx5_core.h"
+#include "sf/sf.h"
 
 /* Mutex to hold while enabling or disabling RoCE */
 static DEFINE_MUTEX(mlx5_roce_en_lock);
@@ -1160,6 +1161,6 @@ EXPORT_SYMBOL_GPL(mlx5_query_nic_system_image_guid);
  */
 u16 mlx5_eswitch_get_total_vports(const struct mlx5_core_dev *dev)
 {
-	return MLX5_SPECIAL_VPORTS(dev) + mlx5_core_max_vfs(dev);
+	return MLX5_SPECIAL_VPORTS(dev) + mlx5_core_max_vfs(dev) + mlx5_sf_max_ports(dev);
 }
 EXPORT_SYMBOL_GPL(mlx5_eswitch_get_total_vports);
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 10/13] net/mlx5: E-switch, Add eswitch helpers for SF vport
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (8 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 09/13] net/mlx5: E-switch, Prepare eswitch to handle SF vport Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 11/13] net/mlx5: SF, Add SF configuration hardware commands Parav Pandit
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit,
	Vu Pham, Roi Dayan
Add helpers to enable/disable eswitch port, register its devlink port and
load its representor.
Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
---
 .../mellanox/mlx5/core/esw/devlink_port.c     | 41 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 10 +++++
 .../net/ethernet/mellanox/mlx5/core/eswitch.h | 15 +++++++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 36 +++++++++++++++-
 4 files changed, 100 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
index 88688b84513b..f361a896c278 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
@@ -122,3 +122,44 @@ struct devlink_port *mlx5_esw_offloads_devlink_port(struct mlx5_eswitch *esw, u1
 	vport = mlx5_eswitch_get_vport(esw, vport_num);
 	return IS_ERR(vport) ? ERR_CAST(vport) : vport->dl_port;
 }
+
+int mlx5_esw_devlink_sf_port_register(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum)
+{
+	struct mlx5_core_dev *dev = esw->dev;
+	struct netdev_phys_item_id ppid = {};
+	unsigned int dl_port_index;
+	struct mlx5_vport *vport;
+	struct devlink *devlink;
+	u16 pfnum;
+	int err;
+
+	vport = mlx5_eswitch_get_vport(esw, vport_num);
+	if (IS_ERR(vport))
+		return PTR_ERR(vport);
+
+	pfnum = PCI_FUNC(dev->pdev->devfn);
+	mlx5_esw_get_port_parent_id(dev, &ppid);
+	memcpy(dl_port->attrs.switch_id.id, &ppid.id[0], ppid.id_len);
+	dl_port->attrs.switch_id.id_len = ppid.id_len;
+	devlink_port_attrs_pci_sf_set(dl_port, 0, pfnum, sfnum, false);
+	devlink = priv_to_devlink(dev);
+	dl_port_index = mlx5_esw_vport_to_devlink_port_index(dev, vport_num);
+	err = devlink_port_register(devlink, dl_port, dl_port_index);
+	if (err)
+		return err;
+
+	vport->dl_port = dl_port;
+	return 0;
+}
+
+void mlx5_esw_devlink_sf_port_unregister(struct mlx5_eswitch *esw, u16 vport_num)
+{
+	struct mlx5_vport *vport;
+
+	vport = mlx5_eswitch_get_vport(esw, vport_num);
+	if (IS_ERR(vport))
+		return;
+	devlink_port_unregister(vport->dl_port);
+	vport->dl_port = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 5b90f126b7f3..d72766b78bd7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1342,6 +1342,16 @@ static void esw_disable_vport(struct mlx5_eswitch *esw, u16 vport_num)
 	mutex_unlock(&esw->state_lock);
 }
 
+int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, u16 vport_num)
+{
+	return esw_enable_vport(esw, vport_num, MLX5_VPORT_UC_ADDR_CHANGE);
+}
+
+void mlx5_esw_vport_disable(struct mlx5_eswitch *esw, u16 vport_num)
+{
+	esw_disable_vport(esw, vport_num);
+}
+
 static int eswitch_vport_event(struct notifier_block *nb,
 			       unsigned long type, void *data)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 2165bc065196..3a373f314a6b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -711,6 +711,9 @@ esw_get_max_restore_tag(struct mlx5_eswitch *esw);
 int esw_offloads_load_rep(struct mlx5_eswitch *esw, u16 vport_num);
 void esw_offloads_unload_rep(struct mlx5_eswitch *esw, u16 vport_num);
 
+int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num);
+void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num);
+
 int mlx5_eswitch_load_vport(struct mlx5_eswitch *esw, u16 vport_num,
 			    enum mlx5_eswitch_vport_event enabled_events);
 void mlx5_eswitch_unload_vport(struct mlx5_eswitch *esw, u16 vport_num);
@@ -722,6 +725,18 @@ void mlx5_eswitch_unload_vf_vports(struct mlx5_eswitch *esw, u16 num_vfs);
 int mlx5_esw_offloads_devlink_port_register(struct mlx5_eswitch *esw, u16 vport_num);
 void mlx5_esw_offloads_devlink_port_unregister(struct mlx5_eswitch *esw, u16 vport_num);
 struct devlink_port *mlx5_esw_offloads_devlink_port(struct mlx5_eswitch *esw, u16 vport_num);
+
+int mlx5_esw_devlink_sf_port_register(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum);
+void mlx5_esw_devlink_sf_port_unregister(struct mlx5_eswitch *esw, u16 vport_num);
+
+int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, u16 vport_num);
+void mlx5_esw_vport_disable(struct mlx5_eswitch *esw, u16 vport_num);
+
+int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum);
+void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num);
+
 #else  /* CONFIG_MLX5_ESWITCH */
 /* eswitch API stubs */
 static inline int  mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 01242afbfcce..14f73c202adf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1834,7 +1834,7 @@ static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type)
 	__esw_offloads_unload_rep(esw, rep, rep_type);
 }
 
-static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
+int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	struct mlx5_eswitch_rep *rep;
 	int rep_type;
@@ -1858,7 +1858,7 @@ static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
 	return err;
 }
 
-static void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num)
+void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	struct mlx5_eswitch_rep *rep;
 	int rep_type;
@@ -2842,3 +2842,35 @@ u32 mlx5_eswitch_get_vport_metadata_for_match(struct mlx5_eswitch *esw,
 	return vport->metadata << (32 - ESW_SOURCE_PORT_METADATA_BITS);
 }
 EXPORT_SYMBOL(mlx5_eswitch_get_vport_metadata_for_match);
+
+int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum)
+{
+	int err;
+
+	err = mlx5_esw_vport_enable(esw, vport_num);
+	if (err)
+		return err;
+
+	err = mlx5_esw_devlink_sf_port_register(esw, dl_port, vport_num, sfnum);
+	if (err)
+		goto devlink_err;
+
+	err = mlx5_esw_offloads_rep_load(esw, vport_num);
+	if (err)
+		goto rep_err;
+	return 0;
+
+rep_err:
+	mlx5_esw_devlink_sf_port_unregister(esw, vport_num);
+devlink_err:
+	mlx5_esw_vport_disable(esw, vport_num);
+	return err;
+}
+
+void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num)
+{
+	mlx5_esw_offloads_rep_unload(esw, vport_num);
+	mlx5_esw_devlink_sf_port_unregister(esw, vport_num);
+	mlx5_esw_vport_disable(esw, vport_num);
+}
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 11/13] net/mlx5: SF, Add SF configuration hardware commands
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (9 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 10/13] net/mlx5: E-switch, Add eswitch helpers for " Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 12/13] net/mlx5: SF, Add port add delete functionality Parav Pandit
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Vu Pham,
	Parav Pandit
From: Vu Pham <vuhuong@nvidia.com>
Add command helpers to access SF port and function in device.
Enable SF HCA port capability when such configuration is enabled
and supported in a device.
Use them in subsequent patches.
Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |  5 ++
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |  4 ++
 .../net/ethernet/mellanox/mlx5/core/main.c    |  5 ++
 .../net/ethernet/mellanox/mlx5/core/sf/cmd.c  | 48 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sf/priv.h | 14 ++++++
 5 files changed, 76 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 7dd5be49fb9e..911e7bb43b23 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -89,3 +89,8 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 # SF device
 #
 mlx5_core-$(CONFIG_MLX5_SF) += sf/dev/dev.o sf/dev/driver.o
+
+#
+# SF manager
+#
+mlx5_core-$(CONFIG_MLX5_SF_MANAGER) += sf/cmd.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index e49387dbef98..7de8139bc167 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -333,6 +333,7 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op,
 	case MLX5_CMD_OP_DEALLOC_MEMIC:
 	case MLX5_CMD_OP_PAGE_FAULT_RESUME:
 	case MLX5_CMD_OP_QUERY_ESW_FUNCTIONS:
+	case MLX5_CMD_OP_DEALLOC_SF:
 		return MLX5_CMD_STAT_OK;
 
 	case MLX5_CMD_OP_QUERY_HCA_CAP:
@@ -464,6 +465,7 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op,
 	case MLX5_CMD_OP_ALLOC_MEMIC:
 	case MLX5_CMD_OP_MODIFY_XRQ:
 	case MLX5_CMD_OP_RELEASE_XRQ_ERROR:
+	case MLX5_CMD_OP_ALLOC_SF:
 		*status = MLX5_DRIVER_STATUS_ABORTED;
 		*synd = MLX5_DRIVER_SYND;
 		return -EIO;
@@ -657,6 +659,8 @@ const char *mlx5_command_str(int command)
 	MLX5_COMMAND_STR_CASE(DESTROY_UMEM);
 	MLX5_COMMAND_STR_CASE(RELEASE_XRQ_ERROR);
 	MLX5_COMMAND_STR_CASE(MODIFY_XRQ);
+	MLX5_COMMAND_STR_CASE(ALLOC_SF);
+	MLX5_COMMAND_STR_CASE(DEALLOC_SF);
 	default: return "unknown command opcode";
 	}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index adfa21de938e..bd414d93f70e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -567,6 +567,11 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
 	if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
 		MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
 
+#ifdef CONFIG_MLX5_SF_MANAGER
+	if (MLX5_CAP_GEN_MAX(dev, sf) && MLX5_ESWITCH_MANAGER(dev))
+		MLX5_SET(cmd_hca_cap, set_hca_cap, sf, 1);
+#endif
+
 	return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
new file mode 100644
index 000000000000..8dd44a2b2467
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+
+int mlx5_cmd_alloc_sf(struct mlx5_core_dev *dev, u16 function_id)
+{
+	u32 out[MLX5_ST_SZ_DW(alloc_sf_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(alloc_sf_in)] = {};
+
+	MLX5_SET(alloc_sf_in, in, opcode, MLX5_CMD_OP_ALLOC_SF);
+	MLX5_SET(alloc_sf_in, in, function_id, function_id);
+
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id)
+{
+	u32 out[MLX5_ST_SZ_DW(dealloc_sf_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(dealloc_sf_in)] = {};
+
+	MLX5_SET(dealloc_sf_in, in, opcode, MLX5_CMD_OP_DEALLOC_SF);
+	MLX5_SET(dealloc_sf_in, in, function_id, function_id);
+
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+int mlx5_cmd_sf_enable_hca(struct mlx5_core_dev *dev, u16 func_id)
+{
+	u32 out[MLX5_ST_SZ_DW(enable_hca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(enable_hca_in)] = {};
+
+	MLX5_SET(enable_hca_in, in, opcode, MLX5_CMD_OP_ENABLE_HCA);
+	MLX5_SET(enable_hca_in, in, function_id, func_id);
+	MLX5_SET(enable_hca_in, in, embedded_cpu_function, 0);
+	return mlx5_cmd_exec(dev, &in, sizeof(in), &out, sizeof(out));
+}
+
+int mlx5_cmd_sf_disable_hca(struct mlx5_core_dev *dev, u16 func_id)
+{
+	u32 out[MLX5_ST_SZ_DW(disable_hca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(disable_hca_in)] = {};
+
+	MLX5_SET(disable_hca_in, in, opcode, MLX5_CMD_OP_DISABLE_HCA);
+	MLX5_SET(disable_hca_in, in, function_id, func_id);
+	MLX5_SET(enable_hca_in, in, embedded_cpu_function, 0);
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
new file mode 100644
index 000000000000..0e39df9f297e
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_SF_PRIV_H__
+#define __MLX5_SF_PRIV_H__
+
+#include <linux/mlx5/driver.h>
+
+int mlx5_cmd_alloc_sf(struct mlx5_core_dev *dev, u16 function_id);
+int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id);
+int mlx5_cmd_sf_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
+int mlx5_cmd_sf_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
+
+#endif
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 12/13] net/mlx5: SF, Add port add delete functionality
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (10 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 11/13] net/mlx5: SF, Add SF configuration hardware commands Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-12 19:24 ` [PATCH net-next 13/13] net/mlx5: SF, Port function state change support Parav Pandit
  2020-11-16 22:52 ` [PATCH net-next 00/13] Add mlx5 subfunction support Jakub Kicinski
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit,
	Vu Pham
To handle SF port management outside of the eswitch as independent
software layer, introduce eswitch notifier APIs so that upper layer who
wish to support sf port management in switchdev mode can perform its
task whenever eswitch mode is set to switchdev or before eswitch is
disabled.
(a) Initialize sf port table on such eswitch event.
(b) Add SF port add and delete functionality in switchdev mode.
(c) Destroy all SF ports when eswitch is disabled.
(d) Expose SF port add and delete to user via devlink commands.
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached
$ devlink port show ens2f0npf0sf88 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "inactive",
                "opstate": "detached"
            }
        }
    }
}
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   5 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  25 ++
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  12 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |   9 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.c   | 370 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  17 +
 include/linux/mlx5/driver.h                   |   4 +
 8 files changed, 443 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.c
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 911e7bb43b23..b1d7f193375a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -93,4 +93,4 @@ mlx5_core-$(CONFIG_MLX5_SF) += sf/dev/dev.o sf/dev/driver.o
 #
 # SF manager
 #
-mlx5_core-$(CONFIG_MLX5_SF_MANAGER) += sf/cmd.o
+mlx5_core-$(CONFIG_MLX5_SF_MANAGER) += sf/cmd.o sf/sf.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index aeffb6b135ee..7ad8dc26cb74 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -7,6 +7,7 @@
 #include "fw_reset.h"
 #include "fs_core.h"
 #include "eswitch.h"
+#include "sf/sf.h"
 
 static int mlx5_devlink_flash_update(struct devlink *devlink,
 				     struct devlink_flash_update_params *params,
@@ -187,6 +188,10 @@ static const struct devlink_ops mlx5_devlink_ops = {
 	.eswitch_encap_mode_get = mlx5_devlink_eswitch_encap_mode_get,
 	.port_function_hw_addr_get = mlx5_devlink_port_function_hw_addr_get,
 	.port_function_hw_addr_set = mlx5_devlink_port_function_hw_addr_set,
+#endif
+#ifdef CONFIG_MLX5_SF_MANAGER
+	.port_new = mlx5_devlink_sf_port_new,
+	.port_del = mlx5_devlink_sf_port_del,
 #endif
 	.flash_update = mlx5_devlink_flash_update,
 	.info_get = mlx5_devlink_info_get,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index d72766b78bd7..25f8c0918fca 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1585,6 +1585,15 @@ mlx5_eswitch_update_num_of_vfs(struct mlx5_eswitch *esw, int num_vfs)
 	kvfree(out);
 }
 
+static void mlx5_esw_mode_change_notify(struct mlx5_eswitch *esw, u16 mode)
+{
+	struct mlx5_esw_event_info info = {};
+
+	info.new_mode = mode;
+
+	blocking_notifier_call_chain(&esw->n_head, 0, &info);
+}
+
 /**
  * mlx5_eswitch_enable_locked - Enable eswitch
  * @esw:	Pointer to eswitch
@@ -1645,6 +1654,8 @@ int mlx5_eswitch_enable_locked(struct mlx5_eswitch *esw, int mode, int num_vfs)
 		 mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
 		 esw->esw_funcs.num_vfs, esw->enabled_vports);
 
+	mlx5_esw_mode_change_notify(esw, mode);
+
 	return 0;
 
 abort:
@@ -1701,6 +1712,11 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw, bool clear_vf)
 		 esw->mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
 		 esw->esw_funcs.num_vfs, esw->enabled_vports);
 
+	/* Notify eswitch users that it is exiting from current mode.
+	 * So that it can do necessary cleanup before the eswitch is disabled.
+	 */
+	mlx5_esw_mode_change_notify(esw, MLX5_ESWITCH_NONE);
+
 	mlx5_eswitch_event_handlers_unregister(esw);
 
 	if (esw->mode == MLX5_ESWITCH_LEGACY)
@@ -1801,6 +1817,7 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
 	esw->offloads.inline_mode = MLX5_INLINE_MODE_NONE;
 
 	dev->priv.eswitch = esw;
+	BLOCKING_INIT_NOTIFIER_HEAD(&esw->n_head);
 	return 0;
 abort:
 	if (esw->work_queue)
@@ -2493,4 +2510,12 @@ bool mlx5_esw_multipath_prereq(struct mlx5_core_dev *dev0,
 		dev1->priv.eswitch->mode == MLX5_ESWITCH_OFFLOADS);
 }
 
+int mlx5_esw_event_notifier_register(struct mlx5_eswitch *esw, struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&esw->n_head, nb);
+}
 
+void mlx5_esw_event_notifier_unregister(struct mlx5_eswitch *esw, struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&esw->n_head, nb);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 3a373f314a6b..fb26690a0ad1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -278,6 +278,7 @@ struct mlx5_eswitch {
 	struct {
 		u32             large_group_num;
 	}  params;
+	struct blocking_notifier_head n_head;
 };
 
 void esw_offloads_disable(struct mlx5_eswitch *esw);
@@ -737,6 +738,17 @@ int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_p
 				      u16 vport_num, u32 sfnum);
 void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num);
 
+/**
+ * mlx5_esw_event_info - Indicates eswitch mode changed/changing.
+ *
+ * @new_mode: New mode of eswitch.
+ */
+struct mlx5_esw_event_info {
+	u16 new_mode;
+};
+
+int mlx5_esw_event_notifier_register(struct mlx5_eswitch *esw, struct notifier_block *n);
+void mlx5_esw_event_notifier_unregister(struct mlx5_eswitch *esw, struct notifier_block *n);
 #else  /* CONFIG_MLX5_ESWITCH */
 /* eswitch API stubs */
 static inline int  mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index bd414d93f70e..f6570ce9393f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -889,6 +889,12 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 		goto err_eswitch_cleanup;
 	}
 
+	err = mlx5_sf_table_init(dev);
+	if (err) {
+		mlx5_core_err(dev, "Failed to init SF table %d\n", err);
+		goto err_sf_table_cleanup;
+	}
+
 	err = mlx5_sf_dev_table_init(dev);
 	if (err) {
 		mlx5_core_err(dev, "Failed to init SF device table %d\n", err);
@@ -906,6 +912,8 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 	return 0;
 
 err_sf_dev_cleanup:
+	mlx5_sf_table_cleanup(dev);
+err_sf_table_cleanup:
 	mlx5_fpga_cleanup(dev);
 err_eswitch_cleanup:
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
@@ -939,6 +947,7 @@ static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 	mlx5_fw_tracer_destroy(dev->tracer);
 	mlx5_dm_cleanup(dev);
 	mlx5_sf_dev_table_cleanup(dev);
+	mlx5_sf_table_cleanup(dev);
 	mlx5_fpga_cleanup(dev);
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
 	mlx5_sriov_cleanup(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.c
new file mode 100644
index 000000000000..dff44ab5057d
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include "eswitch.h"
+#include "priv.h"
+
+struct mlx5_sf {
+	struct devlink_port dl_port;
+	unsigned int port_index;
+	u32 usr_sfnum;
+	u16 sw_id;
+	u16 hw_fn_id;
+};
+
+struct mlx5_sf_table {
+	struct mlx5_core_dev *dev; /* To refer from notifier context. */
+	struct xarray port_indices; /* port index based lookup. */
+	struct ida fn_ida; /* allocator based on firmware limits. */
+	int ida_max_ports;
+	refcount_t refcount;
+	struct completion disable_complete;
+	struct notifier_block esw_nb;
+};
+
+static u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id)
+{
+	return sw_id + mlx5_esw_sf_start_vport_num(dev);
+}
+
+static int mlx5_sf_id_alloc(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	u16 hw_fn_id;
+	int sw_id;
+	int err;
+
+	if (!table->ida_max_ports)
+		return -EOPNOTSUPP;
+	sw_id = ida_alloc_max(&table->fn_ida, table->ida_max_ports - 1, GFP_KERNEL);
+	if (sw_id < 0)
+		return sw_id;
+	sf->sw_id = sw_id;
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, sw_id);
+	err = mlx5_cmd_alloc_sf(table->dev, hw_fn_id);
+	if (err)
+		goto cmd_err;
+
+	sf->hw_fn_id = hw_fn_id;
+	return 0;
+
+cmd_err:
+	ida_free(&table->fn_ida, sf->sw_id);
+	return err;
+}
+
+static void mlx5_sf_id_free(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	mlx5_cmd_dealloc_sf(table->dev, sf->hw_fn_id);
+	ida_free(&table->fn_ida, sf->sw_id);
+}
+
+static struct mlx5_sf *
+mlx5_sf_lookup_by_index(struct mlx5_sf_table *table, unsigned int port_index)
+{
+	return xa_load(&table->port_indices, port_index);
+}
+
+static struct mlx5_sf *
+mlx5_sf_lookup_by_sfnum(struct mlx5_sf_table *table, u32 usr_sfnum)
+{
+	unsigned long index;
+	struct mlx5_sf *sf;
+
+	xa_for_each(&table->port_indices, index, sf) {
+		if (sf->usr_sfnum == usr_sfnum)
+			return sf;
+	}
+	return NULL;
+}
+
+static int mlx5_sf_id_insert(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	int err;
+
+	err = xa_insert(&table->port_indices, sf->port_index, sf, GFP_KERNEL);
+	/* After this stage, SF can be queried by devlink user by it port index. */
+	return err;
+}
+
+static void mlx5_sf_id_erase(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	xa_erase(&table->port_indices, sf->port_index);
+}
+
+static struct mlx5_sf *
+mlx5_sf_alloc(struct mlx5_sf_table *table, u32 sfnum, struct netlink_ext_ack *extack)
+{
+	unsigned int dl_port_index;
+	struct mlx5_sf *sf;
+	int err;
+
+	sf = mlx5_sf_lookup_by_sfnum(table, sfnum);
+	if (sf) {
+		NL_SET_ERR_MSG_MOD(extack, "SF already exist. Choose different sfnum");
+		err = -EEXIST;
+		goto err;
+	}
+	sf = kzalloc(sizeof(*sf), GFP_KERNEL);
+	if (!sf) {
+		err = -ENOMEM;
+		goto err;
+	}
+	err = mlx5_sf_id_alloc(table, sf);
+	if (err)
+		goto id_err;
+
+	dl_port_index = mlx5_esw_vport_to_devlink_port_index(table->dev, sf->hw_fn_id);
+	sf->port_index = dl_port_index;
+	sf->usr_sfnum = sfnum;
+
+	err = mlx5_sf_id_insert(table, sf);
+	if (err)
+		goto insert_err;
+
+	return sf;
+
+insert_err:
+	mlx5_sf_id_free(table, sf);
+id_err:
+	kfree(sf);
+err:
+	return ERR_PTR(err);
+}
+
+static void mlx5_sf_free(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	mlx5_sf_id_erase(table, sf);
+	mlx5_sf_id_free(table, sf);
+	kfree(sf);
+}
+
+static struct mlx5_sf_table *mlx5_sf_table_try_get(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_table *table = dev->priv.sf_table;
+
+	if (!table)
+		return NULL;
+
+	return refcount_inc_not_zero(&table->refcount) ? table : NULL;
+}
+
+static void mlx5_sf_table_put(struct mlx5_sf_table *table)
+{
+	if (refcount_dec_and_test(&table->refcount))
+		complete(&table->disable_complete);
+}
+
+static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
+		       const struct devlink_port_new_attrs *new_attr,
+		       struct netlink_ext_ack *extack)
+{
+	struct mlx5_eswitch *esw = dev->priv.eswitch;
+	struct mlx5_sf *sf;
+	u16 hw_fn_id;
+	int err;
+
+	sf = mlx5_sf_alloc(table, new_attr->sfnum, extack);
+	if (IS_ERR(sf))
+		return PTR_ERR(sf);
+
+	hw_fn_id = sf->hw_fn_id;
+	err = mlx5_esw_offloads_sf_vport_enable(esw, &sf->dl_port, hw_fn_id, new_attr->sfnum);
+	if (err)
+		goto esw_err;
+	return 0;
+
+esw_err:
+	mlx5_sf_free(table, sf);
+	return err;
+}
+
+static void mlx5_sf_del(struct mlx5_core_dev *dev, struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	struct mlx5_eswitch *esw = dev->priv.eswitch;
+
+	mlx5_esw_offloads_sf_vport_disable(esw, sf->hw_fn_id);
+	mlx5_sf_free(table, sf);
+}
+
+static int
+mlx5_sf_new_check_attr(struct mlx5_core_dev *dev, const struct devlink_port_new_attrs *new_attr,
+		       struct netlink_ext_ack *extack)
+{
+	if (new_attr->flavour != DEVLINK_PORT_FLAVOUR_PCI_SF) {
+		NL_SET_ERR_MSG_MOD(extack, "Driver supports only SF port addition.");
+		return -EOPNOTSUPP;
+	}
+	if (new_attr->port_index_valid) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Driver does not support user defined port index assignment.");
+		return -EOPNOTSUPP;
+	}
+	if (!new_attr->sfnum_valid) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "User must provide unique sfnum. Driver does not support auto assignment.");
+		return -EOPNOTSUPP;
+	}
+	if (new_attr->controller_valid && new_attr->controller) {
+		NL_SET_ERR_MSG_MOD(extack, "External controller is unsupported.");
+		return -EOPNOTSUPP;
+	}
+	if (new_attr->pfnum != PCI_FUNC(dev->pdev->devfn)) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid pfnum supplied.");
+		return -EOPNOTSUPP;
+	}
+	return 0;
+}
+
+int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_new_attrs *new_attr,
+			     struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	int err;
+
+	err = mlx5_sf_new_check_attr(dev, new_attr, extack);
+	if (err)
+		return err;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Port add is only supported in eswitch switchdev mode or SF ports are disabled.");
+		return -EOPNOTSUPP;
+	}
+	err = mlx5_sf_add(dev, table, new_attr, extack);
+	mlx5_sf_table_put(table);
+	return err;
+}
+
+int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
+			     struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	struct mlx5_sf *sf;
+	int err = 0;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Port del is only supported in eswitch switchdev mode or SF ports are disabled.");
+		return -EOPNOTSUPP;
+	}
+	sf = mlx5_sf_lookup_by_index(table, port_index);
+	if (!sf) {
+		err = -ENODEV;
+		goto sf_err;
+	}
+
+	mlx5_sf_del(dev, table, sf);
+sf_err:
+	mlx5_sf_table_put(table);
+	return err;
+}
+
+static void mlx5_sf_destroy_all(struct mlx5_sf_table *table)
+{
+	struct mlx5_core_dev *dev = table->dev;
+	unsigned long index;
+	struct mlx5_sf *sf;
+
+	xa_for_each(&table->port_indices, index, sf)
+		mlx5_sf_del(dev, table, sf);
+}
+
+static void mlx5_sf_table_enable(struct mlx5_sf_table *table)
+{
+	if (!mlx5_sf_max_ports(table->dev))
+		return;
+
+	xa_init(&table->port_indices);
+	table->ida_max_ports = mlx5_sf_max_ports(table->dev);
+	init_completion(&table->disable_complete);
+	refcount_set(&table->refcount, 1);
+}
+
+void mlx5_sf_table_disable(struct mlx5_sf_table *table)
+{
+	if (!mlx5_sf_max_ports(table->dev))
+		return;
+
+	if (!refcount_read(&table->refcount))
+		return;
+
+	/* Balances with refcount_set; drop the reference so that new user cmd cannot start. */
+	mlx5_sf_table_put(table);
+	wait_for_completion(&table->disable_complete);
+
+	/* At this point, no new user commands can start.
+	 * It is safe to destroy all user created SFs.
+	 */
+	mlx5_sf_destroy_all(table);
+	WARN_ON(!xa_empty(&table->port_indices));
+}
+
+static int mlx5_sf_esw_event(struct notifier_block *nb, unsigned long event, void *data)
+{
+	struct mlx5_sf_table *table = container_of(nb, struct mlx5_sf_table, esw_nb);
+	const struct mlx5_esw_event_info *mode = data;
+
+	switch (mode->new_mode) {
+	case MLX5_ESWITCH_OFFLOADS:
+		mlx5_sf_table_enable(table);
+		break;
+	case MLX5_ESWITCH_NONE:
+		mlx5_sf_table_disable(table);
+		break;
+	default:
+		break;
+	};
+
+	return 0;
+}
+
+static bool mlx5_sf_table_supported(const struct mlx5_core_dev *dev)
+{
+	return dev->priv.eswitch && MLX5_ESWITCH_MANAGER(dev) && mlx5_sf_supported(dev);
+}
+
+int mlx5_sf_table_init(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_table *table;
+	int err;
+
+	if (!mlx5_sf_table_supported(dev))
+		return 0;
+
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table)
+		return -ENOMEM;
+
+	table->dev = dev;
+	ida_init(&table->fn_ida);
+	dev->priv.sf_table = table;
+	table->esw_nb.notifier_call = mlx5_sf_esw_event;
+	err = mlx5_esw_event_notifier_register(dev->priv.eswitch, &table->esw_nb);
+	if (err)
+		goto reg_err;
+	return 0;
+
+reg_err:
+	kfree(table);
+	dev->priv.sf_table = NULL;
+	return err;
+}
+
+void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_table *table = dev->priv.sf_table;
+
+	if (!table)
+		return;
+
+	mlx5_esw_event_notifier_unregister(dev->priv.eswitch, &table->esw_nb);
+	WARN_ON(refcount_read(&table->refcount));
+	WARN_ON(!ida_is_empty(&table->fn_ida));
+	kfree(table);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
index 7b9071a865ce..555b19a5880d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
@@ -18,6 +18,14 @@ static inline u16 mlx5_sf_max_ports(const struct mlx5_core_dev *dev)
 	return mlx5_sf_supported(dev) ? 1 << MLX5_CAP_GEN(dev, log_max_sf) : 0;
 }
 
+int mlx5_sf_table_init(struct mlx5_core_dev *dev);
+void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev);
+
+int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_new_attrs *add_attr,
+			     struct netlink_ext_ack *extack);
+int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
+			     struct netlink_ext_ack *extack);
+
 #else
 
 static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
@@ -30,6 +38,15 @@ static inline u16 mlx5_sf_max_ports(const struct mlx5_core_dev *dev)
 	return 0;
 }
 
+static inline int mlx5_sf_table_init(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev)
+{
+}
+
 #endif
 
 #endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index f3104b50ade5..e625cb20ad76 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -509,6 +509,7 @@ struct mlx5_fw_reset;
 struct mlx5_eq_table;
 struct mlx5_irq_table;
 struct mlx5_sf_dev_table;
+struct mlx5_sf_table;
 
 struct mlx5_rate_limit {
 	u32			rate;
@@ -609,6 +610,9 @@ struct mlx5_priv {
 	struct mlx5_sf_dev_table *sf_dev_table;
 	struct mlx5_core_dev *parent_mdev;
 #endif
+#ifdef CONFIG_MLX5_SF_MANAGER
+	struct mlx5_sf_table *sf_table;
+#endif
 };
 
 enum mlx5_device_state {
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* [PATCH net-next 13/13] net/mlx5: SF, Port function state change support
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (11 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 12/13] net/mlx5: SF, Add port add delete functionality Parav Pandit
@ 2020-11-12 19:24 ` Parav Pandit
  2020-11-16 22:52 ` [PATCH net-next 00/13] Add mlx5 subfunction support Jakub Kicinski
  13 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-12 19:24 UTC (permalink / raw)
  To: netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Parav Pandit,
	Vu Pham
Support changing the state of the SF port's function through devlink.
When activating the SF port's function, enable the hca in the device
followed by adding its auxiliary device.
When deactivating the SF port's function, delete its auxiliary device
followed by disabling the HCA.
Port function attributes get/set callbacks are invoked with devlink
instance lock held. Such callbacks need to synchronize with sf port
table getting disabled. These callbacks while operating on the devlink
port, synchronize with table disable context by holding table refcount.
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached
$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active
$ devlink port show ens2f0npf0sf88 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}
On port function activation, an auxiliary device is created in below
example.
$ devlink dev show
devlink dev show auxiliary/mlx5_core.sf.0
$ devlink port show auxiliary/mlx5_core.sf.0/1
auxiliary/mlx5_core.sf.0/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   2 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.c   | 128 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |   7 +
 3 files changed, 137 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 7ad8dc26cb74..22d22959e6f6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -192,6 +192,8 @@ static const struct devlink_ops mlx5_devlink_ops = {
 #ifdef CONFIG_MLX5_SF_MANAGER
 	.port_new = mlx5_devlink_sf_port_new,
 	.port_del = mlx5_devlink_sf_port_del,
+	.port_function_state_get = mlx5_devlink_sf_port_fn_state_get,
+	.port_function_state_set = mlx5_devlink_sf_port_fn_state_set,
 #endif
 	.flash_update = mlx5_devlink_flash_update,
 	.info_get = mlx5_devlink_info_get,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.c
index dff44ab5057d..7e90629fd910 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.c
@@ -4,6 +4,7 @@
 #include <linux/mlx5/driver.h>
 #include "eswitch.h"
 #include "priv.h"
+#include "sf/dev/dev.h"
 
 struct mlx5_sf {
 	struct devlink_port dl_port;
@@ -11,6 +12,7 @@ struct mlx5_sf {
 	u32 usr_sfnum;
 	u16 sw_id;
 	u16 hw_fn_id;
+	enum devlink_port_function_state state;
 };
 
 struct mlx5_sf_table {
@@ -115,6 +117,7 @@ mlx5_sf_alloc(struct mlx5_sf_table *table, u32 sfnum, struct netlink_ext_ack *ex
 	if (err)
 		goto id_err;
 
+	sf->state = DEVLINK_PORT_FUNCTION_STATE_INACTIVE;
 	dl_port_index = mlx5_esw_vport_to_devlink_port_index(table->dev, sf->hw_fn_id);
 	sf->port_index = dl_port_index;
 	sf->usr_sfnum = sfnum;
@@ -156,6 +159,126 @@ static void mlx5_sf_table_put(struct mlx5_sf_table *table)
 		complete(&table->disable_complete);
 }
 
+static int
+mlx5_sf_state_get(struct mlx5_core_dev *dev, struct mlx5_sf *sf,
+		  enum devlink_port_function_state *state,
+		  enum devlink_port_function_opstate *opstate)
+{
+	int err = 0;
+
+	*state = sf->state;
+	switch (sf->state) {
+	case DEVLINK_PORT_FUNCTION_STATE_ACTIVE:
+		*opstate = DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED;
+		break;
+	case DEVLINK_PORT_FUNCTION_STATE_INACTIVE:
+		*opstate = DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED;
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+	return err;
+}
+
+int mlx5_devlink_sf_port_fn_state_get(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state *state,
+				      enum devlink_port_function_opstate *opstate,
+				      struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	int err = -EOPNOTSUPP;
+	struct mlx5_sf *sf;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table)
+		return -EOPNOTSUPP;
+
+	sf = mlx5_sf_lookup_by_index(table, dl_port->index);
+	if (!sf)
+		goto sf_err;
+	err = mlx5_sf_state_get(dev, sf, state, opstate);
+sf_err:
+	mlx5_sf_table_put(table);
+	return err;
+}
+
+static int mlx5_sf_activate(struct mlx5_core_dev *dev, struct mlx5_sf *sf)
+{
+	int err;
+
+	err = mlx5_cmd_sf_enable_hca(dev, sf->hw_fn_id);
+	if (err)
+		return err;
+
+	err = mlx5_sf_dev_add(dev, sf->sw_id, sf->usr_sfnum);
+	if (err)
+		goto dev_err;
+
+	sf->state = DEVLINK_PORT_FUNCTION_STATE_ACTIVE;
+	return 0;
+
+dev_err:
+	mlx5_cmd_sf_disable_hca(dev, sf->hw_fn_id);
+	return err;
+}
+
+static int mlx5_sf_deactivate(struct mlx5_core_dev *dev, struct mlx5_sf *sf)
+{
+	int err;
+
+	mlx5_sf_dev_del(dev, sf->sw_id);
+	err = mlx5_cmd_sf_disable_hca(dev, sf->hw_fn_id);
+	if (err)
+		return err;
+	sf->state = DEVLINK_PORT_FUNCTION_STATE_INACTIVE;
+	return 0;
+}
+
+static int mlx5_sf_state_set(struct mlx5_core_dev *dev, struct mlx5_sf *sf,
+			     enum devlink_port_function_state state)
+{
+	int err;
+
+	if (sf->state == state)
+		return 0;
+	if (state == DEVLINK_PORT_FUNCTION_STATE_ACTIVE)
+		err = mlx5_sf_activate(dev, sf);
+	else if (state == DEVLINK_PORT_FUNCTION_STATE_INACTIVE)
+		err = mlx5_sf_deactivate(dev, sf);
+	else
+		err = -EINVAL;
+	return err;
+}
+
+int mlx5_devlink_sf_port_fn_state_set(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state state,
+				      struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	struct mlx5_sf *sf;
+	int err;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Port state set is only supported in eswitch switchdev mode or SF ports are disabled.");
+		return -EOPNOTSUPP;
+	}
+	sf = mlx5_sf_lookup_by_index(table, dl_port->index);
+	if (!sf) {
+		err = -ENODEV;
+		goto out;
+	}
+
+	err = mlx5_sf_state_set(dev, sf, state);
+out:
+	mlx5_sf_table_put(table);
+	return err;
+}
+
 static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
 		       const struct devlink_port_new_attrs *new_attr,
 		       struct netlink_ext_ack *extack)
@@ -184,6 +307,10 @@ static void mlx5_sf_del(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
 {
 	struct mlx5_eswitch *esw = dev->priv.eswitch;
 
+	if (sf->state == DEVLINK_PORT_FUNCTION_STATE_ACTIVE) {
+		mlx5_sf_dev_del(dev, sf->sw_id);
+		mlx5_cmd_sf_disable_hca(dev, sf->hw_fn_id);
+	}
 	mlx5_esw_offloads_sf_vport_disable(esw, sf->hw_fn_id);
 	mlx5_sf_free(table, sf);
 }
@@ -343,6 +470,7 @@ int mlx5_sf_table_init(struct mlx5_core_dev *dev)
 
 	table->dev = dev;
 	ida_init(&table->fn_ida);
+	refcount_set(&table->refcount, 0);
 	dev->priv.sf_table = table;
 	table->esw_nb.notifier_call = mlx5_sf_esw_event;
 	err = mlx5_esw_event_notifier_register(dev->priv.eswitch, &table->esw_nb);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
index 555b19a5880d..3d1c459b9936 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
@@ -25,6 +25,13 @@ int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_
 			     struct netlink_ext_ack *extack);
 int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
 			     struct netlink_ext_ack *extack);
+int mlx5_devlink_sf_port_fn_state_get(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state *state,
+				      enum devlink_port_function_opstate *opstate,
+				      struct netlink_ext_ack *extack);
+int mlx5_devlink_sf_port_fn_state_set(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state state,
+				      struct netlink_ext_ack *extack);
 
 #else
 
-- 
2.26.2
^ permalink raw reply related	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
                   ` (12 preceding siblings ...)
  2020-11-12 19:24 ` [PATCH net-next 13/13] net/mlx5: SF, Port function state change support Parav Pandit
@ 2020-11-16 22:52 ` Jakub Kicinski
  2020-11-17  0:06   ` Saeed Mahameed
  13 siblings, 1 reply; 57+ messages in thread
From: Jakub Kicinski @ 2020-11-16 22:52 UTC (permalink / raw)
  To: Parav Pandit
  Cc: netdev, linux-rdma, gregkh, jiri, jgg, dledford, leonro, saeedm,
	davem
On Thu, 12 Nov 2020 21:24:10 +0200 Parav Pandit wrote:
> This series introduces support for mlx5 subfunction (SF).
> A subfunction is a portion of a PCI device that supports multiple
> classes of devices such as netdev, RDMA and more.
> 
> This patchset is based on Leon's series [3].
> It is a third user of proposed auxiliary bus [4].
> 
> Subfunction support is discussed in detail in RFC [1] and [2].
> RFC [1] and extension [2] describes requirements, design, and proposed
> plumbing using devlink, auxiliary bus and sysfs for systemd/udev
> support.
So we're going to have two ways of adding subdevs? Via devlink and via
the new vdpa netlink thing?
Question number two - is this supposed to be ready to be applied to
net-next? It seems there is a conflict.
Also could you please wrap your code at 80 chars?
Thanks.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-16 22:52 ` [PATCH net-next 00/13] Add mlx5 subfunction support Jakub Kicinski
@ 2020-11-17  0:06   ` Saeed Mahameed
  2020-11-17  1:58     ` Jakub Kicinski
  0 siblings, 1 reply; 57+ messages in thread
From: Saeed Mahameed @ 2020-11-17  0:06 UTC (permalink / raw)
  To: Jakub Kicinski, Parav Pandit
  Cc: netdev, linux-rdma, gregkh, jiri, jgg, dledford, leonro, davem
On Mon, 2020-11-16 at 14:52 -0800, Jakub Kicinski wrote:
> On Thu, 12 Nov 2020 21:24:10 +0200 Parav Pandit wrote:
> > This series introduces support for mlx5 subfunction (SF).
> > A subfunction is a portion of a PCI device that supports multiple
> > classes of devices such as netdev, RDMA and more.
> > 
> > This patchset is based on Leon's series [3].
> > It is a third user of proposed auxiliary bus [4].
> > 
> > Subfunction support is discussed in detail in RFC [1] and [2].
> > RFC [1] and extension [2] describes requirements, design, and
> > proposed
> > plumbing using devlink, auxiliary bus and sysfs for systemd/udev
> > support.
> 
> So we're going to have two ways of adding subdevs? Via devlink and
> via
> the new vdpa netlink thing?
> 
Via devlink you add the Sub-function bus device - think of it as
spawning a new VF - but has no actual characteristics
(netdev/vpda/rdma) "yet" until user admin decides to load an interface
on it via aux sysfs.
Basically devlink adds a new eswitch port (the SF port) and loading the
drivers and the interfaces is done via the auxbus subsystem only after
the SF is spawned by FW.
> Question number two - is this supposed to be ready to be applied to
> net-next? It seems there is a conflict.
> 
This series requires other mlx5 and auxbus infrastructure dependencies
that was already submitted by leon 2-3 weeks ago and pending Greg's
review, once finalized it will be merged into mlx5-next, then I will
ask you to pull mlx5-next and only after, you can apply this series
cleanly to net-next, sorry for the mess but we had to move forward and
show how auxdev subsystem is being actually used.
Leon's series:
https://patchwork.ozlabs.org/project/netdev/cover/20201101201542.2027568-1-leon@kernel.org/
> Also could you please wrap your code at 80 chars?
> 
I prefer no to do this in mlx5, in mlx5 we follow a 95 chars rule.
But if you insist :) .. 
Thanks,
Saeed.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-17  0:06   ` Saeed Mahameed
@ 2020-11-17  1:58     ` Jakub Kicinski
  2020-11-17  4:08       ` Parav Pandit
  0 siblings, 1 reply; 57+ messages in thread
From: Jakub Kicinski @ 2020-11-17  1:58 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Parav Pandit, netdev, linux-rdma, gregkh, jiri, jgg, dledford,
	leonro, davem
On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote:
> On Mon, 2020-11-16 at 14:52 -0800, Jakub Kicinski wrote:
> > On Thu, 12 Nov 2020 21:24:10 +0200 Parav Pandit wrote:  
> > > This series introduces support for mlx5 subfunction (SF).
> > > A subfunction is a portion of a PCI device that supports multiple
> > > classes of devices such as netdev, RDMA and more.
> > > 
> > > This patchset is based on Leon's series [3].
> > > It is a third user of proposed auxiliary bus [4].
> > > 
> > > Subfunction support is discussed in detail in RFC [1] and [2].
> > > RFC [1] and extension [2] describes requirements, design, and
> > > proposed
> > > plumbing using devlink, auxiliary bus and sysfs for systemd/udev
> > > support.  
> > 
> > So we're going to have two ways of adding subdevs? Via devlink and
> > via the new vdpa netlink thing?
> 
> Via devlink you add the Sub-function bus device - think of it as
> spawning a new VF - but has no actual characteristics
> (netdev/vpda/rdma) "yet" until user admin decides to load an interface
> on it via aux sysfs.
By which you mean it doesn't get probed or the device type is not set
(IOW it can still become a block device or netdev depending on the vdpa
request)?
> Basically devlink adds a new eswitch port (the SF port) and loading the
> drivers and the interfaces is done via the auxbus subsystem only after
> the SF is spawned by FW.
But why?
Is this for the SmartNIC / bare metal case? The flow for spawning on
the local host gets highly convoluted.
> > Also could you please wrap your code at 80 chars?
> 
> I prefer no to do this in mlx5, in mlx5 we follow a 95 chars rule.
> But if you insist :) .. 
Oh yeah, I meant the devlink patches!
^ permalink raw reply	[flat|nested] 57+ messages in thread
* RE: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-17  1:58     ` Jakub Kicinski
@ 2020-11-17  4:08       ` Parav Pandit
  2020-11-17 17:11         ` Jakub Kicinski
  0 siblings, 1 reply; 57+ messages in thread
From: Parav Pandit @ 2020-11-17  4:08 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, Jason Gunthorpe,
	dledford@redhat.com, Leon Romanovsky, davem@davemloft.net
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Tuesday, November 17, 2020 7:28 AM
> 
> On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote:
> > > > Subfunction support is discussed in detail in RFC [1] and [2].
> > > > RFC [1] and extension [2] describes requirements, design, and
> > > > proposed plumbing using devlink, auxiliary bus and sysfs for
> > > > systemd/udev support.
> > >
> > > So we're going to have two ways of adding subdevs? Via devlink and
> > > via the new vdpa netlink thing?
Nop.
Subfunctions (subdevs) are added only one way, 
i.e. devlink port as settled in RFC [1].
Just to refresh all our memory, we discussed and settled on the flow in [2];
RFC [1] followed this discussion.
vdpa tool of [3] can add one or more vdpa device(s) on top of already spawned PF, VF, SF device.
> >
> > Via devlink you add the Sub-function bus device - think of it as
> > spawning a new VF - but has no actual characteristics
> > (netdev/vpda/rdma) "yet" until user admin decides to load an interface
> > on it via aux sysfs.
> 
> By which you mean it doesn't get probed or the device type is not set (IOW it can
> still become a block device or netdev depending on the vdpa request)?
> 
> > Basically devlink adds a new eswitch port (the SF port) and loading
> > the drivers and the interfaces is done via the auxbus subsystem only
> > after the SF is spawned by FW.
> 
> But why?
> 
> Is this for the SmartNIC / bare metal case? The flow for spawning on the local
> host gets highly convoluted.
> 
The flow of spawning for (a) local host or (b) for external host controller from smartnic is same.
$ devlink port add..
[..]
Followed by
$ devlink port function set state...
Only change would be to specify the destination where to spawn it. (controller number, pf, sf num etc)
Please refer to the detailed examples in individual patch.
Patch 12 and 13 mostly covers the complete view.
> > > Also could you please wrap your code at 80 chars?
> >
> > I prefer no to do this in mlx5, in mlx5 we follow a 95 chars rule.
> > But if you insist :) ..
> 
> Oh yeah, I meant the devlink patches!
May I ask why?
Past few devlink patches [4] followed 100 chars rule. When did we revert back to 80?
If so, any pointers to the thread for 80? checkpatch.pl with --strict mode didn't complain me when I prepared the patches.
[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://lore.kernel.org/netdev/20200324132044.GI20941@ziepe.ca/
[3] https://lists.linuxfoundation.org/pipermail/virtualization/2020-November/050623.html
[4] commits dc64cc7c6310, 77069ba2e3ad, a1e8ae907c8d, 2a916ecc4056, ba356c90985d
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-17  4:08       ` Parav Pandit
@ 2020-11-17 17:11         ` Jakub Kicinski
  2020-11-17 18:49           ` Jason Gunthorpe
  2020-11-17 18:50           ` Parav Pandit
  0 siblings, 2 replies; 57+ messages in thread
From: Jakub Kicinski @ 2020-11-17 17:11 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Saeed Mahameed, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
On Tue, 17 Nov 2020 04:08:57 +0000 Parav Pandit wrote:
> > On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote:  
> > > > > Subfunction support is discussed in detail in RFC [1] and [2].
> > > > > RFC [1] and extension [2] describes requirements, design, and
> > > > > proposed plumbing using devlink, auxiliary bus and sysfs for
> > > > > systemd/udev support.  
> > > >
> > > > So we're going to have two ways of adding subdevs? Via devlink and
> > > > via the new vdpa netlink thing?  
> Nop.
> Subfunctions (subdevs) are added only one way, 
> i.e. devlink port as settled in RFC [1].
> 
> Just to refresh all our memory, we discussed and settled on the flow
> in [2]; RFC [1] followed this discussion.
> 
> vdpa tool of [3] can add one or more vdpa device(s) on top of already
> spawned PF, VF, SF device.
Nack for the networking part of that. It'd basically be VMDq.
> > > Via devlink you add the Sub-function bus device - think of it as
> > > spawning a new VF - but has no actual characteristics
> > > (netdev/vpda/rdma) "yet" until user admin decides to load an
> > > interface on it via aux sysfs.  
> > 
> > By which you mean it doesn't get probed or the device type is not
> > set (IOW it can still become a block device or netdev depending on
> > the vdpa request)? 
> > > Basically devlink adds a new eswitch port (the SF port) and
> > > loading the drivers and the interfaces is done via the auxbus
> > > subsystem only after the SF is spawned by FW.  
> > 
> > But why?
> > 
> > Is this for the SmartNIC / bare metal case? The flow for spawning
> > on the local host gets highly convoluted.
>
> The flow of spawning for (a) local host or (b) for external host
> controller from smartnic is same.
> 
> $ devlink port add..
> [..]
> Followed by
> $ devlink port function set state...
> 
> Only change would be to specify the destination where to spawn it.
> (controller number, pf, sf num etc) Please refer to the detailed
> examples in individual patch. Patch 12 and 13 mostly covers the
> complete view.
Please share full examples of the workflow.
I'm asking how the vdpa API fits in with this, and you're showing me
the two devlink commands we already talked about in the past.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-17 17:11         ` Jakub Kicinski
@ 2020-11-17 18:49           ` Jason Gunthorpe
  2020-11-19  2:14             ` Jakub Kicinski
  2020-11-17 18:50           ` Parav Pandit
  1 sibling, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2020-11-17 18:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Parav Pandit, Saeed Mahameed, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> > Just to refresh all our memory, we discussed and settled on the flow
> > in [2]; RFC [1] followed this discussion.
> > 
> > vdpa tool of [3] can add one or more vdpa device(s) on top of already
> > spawned PF, VF, SF device.
> 
> Nack for the networking part of that. It'd basically be VMDq.
What are you NAK'ing? 
It is consistent with the multi-subsystem device sharing model we've
had for ages now.
The physical ethernet port is shared between multiple accelerator
subsystems. netdev gets its slice of traffic, so does RDMA, iSCSI,
VDPA, etc.
Jason
^ permalink raw reply	[flat|nested] 57+ messages in thread
* RE: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-17 17:11         ` Jakub Kicinski
  2020-11-17 18:49           ` Jason Gunthorpe
@ 2020-11-17 18:50           ` Parav Pandit
  2020-11-19  2:23             ` Jakub Kicinski
  1 sibling, 1 reply; 57+ messages in thread
From: Parav Pandit @ 2020-11-17 18:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
Hi Jakub,
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Tuesday, November 17, 2020 10:41 PM
> 
> On Tue, 17 Nov 2020 04:08:57 +0000 Parav Pandit wrote:
> > > On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote:
> > > > > > Subfunction support is discussed in detail in RFC [1] and [2].
> > > > > > RFC [1] and extension [2] describes requirements, design, and
> > > > > > proposed plumbing using devlink, auxiliary bus and sysfs for
> > > > > > systemd/udev support.
> > > > >
> > > > > So we're going to have two ways of adding subdevs? Via devlink
> > > > > and via the new vdpa netlink thing?
> > Nop.
> > Subfunctions (subdevs) are added only one way, i.e. devlink port as
> > settled in RFC [1].
> >
> > Just to refresh all our memory, we discussed and settled on the flow
> > in [2]; RFC [1] followed this discussion.
> >
> > vdpa tool of [3] can add one or more vdpa device(s) on top of already
> > spawned PF, VF, SF device.
> 
> Nack for the networking part of that. It'd basically be VMDq.
> 
Can you please clarify which networking part do you mean?
Which patches exactly in this patchset?
> > > > Via devlink you add the Sub-function bus device - think of it as
> > > > spawning a new VF - but has no actual characteristics
> > > > (netdev/vpda/rdma) "yet" until user admin decides to load an
> > > > interface on it via aux sysfs.
> > >
> > > By which you mean it doesn't get probed or the device type is not
> > > set (IOW it can still become a block device or netdev depending on
> > > the vdpa request)?
> > > > Basically devlink adds a new eswitch port (the SF port) and
> > > > loading the drivers and the interfaces is done via the auxbus
> > > > subsystem only after the SF is spawned by FW.
> > >
> > > But why?
> > >
> > > Is this for the SmartNIC / bare metal case? The flow for spawning on
> > > the local host gets highly convoluted.
> >
> > The flow of spawning for (a) local host or (b) for external host
> > controller from smartnic is same.
> >
> > $ devlink port add..
> > [..]
> > Followed by
> > $ devlink port function set state...
> >
> > Only change would be to specify the destination where to spawn it.
> > (controller number, pf, sf num etc) Please refer to the detailed
> > examples in individual patch. Patch 12 and 13 mostly covers the
> > complete view.
> 
> Please share full examples of the workflow.
> 
Please find the full example sequence below, taken from this cover letter and from the respective patches 12 and 13.
Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
Add a devlink port of subfunction flavour:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
Configure mac address of the port function: (existing API).
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88
Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active
Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4
$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff
$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.28.1002 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.28.1002 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
At this point vdpa tool of [1] can create one or more vdpa net devices on this subfunction device in below sequence.
$ vdpa parentdev list
auxiliary/mlx5_core.sf.4
  supported_classes
    net
$ vdpa dev add parentdev auxiliary/mlx5_core.sf.4 type net name foo0
$ vdpa dev show foo0
foo0: parentdev auxiliary/mlx5_core.sf.4 type network parentdev vdpasim vendor_id 0 max_vqs 2 max_vq_size 256
> I'm asking how the vdpa API fits in with this, and you're showing me the two
> devlink commands we already talked about in the past.
Oh ok, sorry, my bad. I understood your question now about relation of vdpa commands with this.
Please look at the above example sequence that covers the vdpa example also.
[1] https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-12 19:24 ` [PATCH net-next 03/13] devlink: Support add and delete devlink port Parav Pandit
@ 2020-11-18 16:21   ` David Ahern
  2020-11-18 17:02     ` Parav Pandit
  0 siblings, 1 reply; 57+ messages in thread
From: David Ahern @ 2020-11-18 16:21 UTC (permalink / raw)
  To: Parav Pandit, netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Vu Pham
On 11/12/20 12:24 PM, Parav Pandit wrote:
> Extended devlink interface for the user to add and delete port.
> Extend devlink to connect user requests to driver to add/delete
> such port in the device.
> 
> When driver routines are invoked, devlink instance lock is not held.
> This enables driver to perform several devlink objects registration,
> unregistration such as (port, health reporter, resource etc)
> by using exising devlink APIs.
> This also helps to uniformly use the code for port unregistration
> during driver unload and during port deletion initiated by user.
> 
> Examples of add, show and delete commands:
> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
> 
> $ devlink port show
> pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
> 
> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> 
> $ devlink port show pci/0000:06:00.0/32768
> pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
>   function:
>     hw_addr 00:00:00:00:88:88 state inactive opstate detached
> 
There has to be limits on the number of sub functions that can be
created for a device. How does a user find that limit?
Also, seems like there are hardware constraint at play. e.g., can a user
reduce the number of queues used by the physical function to support
more sub-functions? If so how does a user programmatically learn about
this limitation? e.g., devlink could have support to show resource
sizing and configure constraints similar to what mlxsw has.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* RE: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-18 16:21   ` David Ahern
@ 2020-11-18 17:02     ` Parav Pandit
  2020-11-18 18:03       ` David Ahern
  2020-11-19  0:52       ` Jacob Keller
  0 siblings, 2 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-18 17:02 UTC (permalink / raw)
  To: David Ahern, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org
  Cc: Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	Saeed Mahameed, kuba@kernel.org, davem@davemloft.net, Vu Pham
> From: David Ahern <dsahern@gmail.com>
> Sent: Wednesday, November 18, 2020 9:51 PM
> 
> On 11/12/20 12:24 PM, Parav Pandit wrote:
> > Extended devlink interface for the user to add and delete port.
> > Extend devlink to connect user requests to driver to add/delete such
> > port in the device.
> >
> > When driver routines are invoked, devlink instance lock is not held.
> > This enables driver to perform several devlink objects registration,
> > unregistration such as (port, health reporter, resource etc) by using
> > exising devlink APIs.
> > This also helps to uniformly use the code for port unregistration
> > during driver unload and during port deletion initiated by user.
> >
> > Examples of add, show and delete commands:
> > $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
> >
> > $ devlink port show
> > pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical
> > port 0 splittable false
> >
> > $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> >
> > $ devlink port show pci/0000:06:00.0/32768
> > pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0
> pfnum 0 sfnum 88 external false splittable false
> >   function:
> >     hw_addr 00:00:00:00:88:88 state inactive opstate detached
> >
> 
> There has to be limits on the number of sub functions that can be created for
> a device. How does a user find that limit?
Yes, this came up internally, but didn't really converged.
The devlink resource looked too verbose for an average or simple use cases.
But it may be fine.
The hurdle I faced with devlink resource is with defining the granularity.
For example one devlink instance deploys sub functions on multiple pci functions.
So how to name them? Currently we have controller and PFs in port annotation.
So resource name as 
c0pf0_subfunctions -> for controller 0, pf 0 
c1pf2_subfunctions -> for controller 1, pf 2
Couldn't convince my self to name it this way.
Below example looked simpler to use but plumbing doesn’t exist for it.
$ devlink resource show pci/0000:03:00.0
pci/0000:03:00.0/1: name max_sfs count 256 controller 0 pf 0
pci/0000:03:00.0/2: name max_sfs count 100 controller 1 pf 0
pci/0000:03:00.0/3: name max_sfs count 64 controller 1 pf 1
$ devlink resource set pci/0000:03:00.0/1 max_sfs 100
Second option I was considering was use port params which doesn't sound so right as resource.
> 
> Also, seems like there are hardware constraint at play. e.g., can a user reduce
> the number of queues used by the physical function to support more sub-
> functions? If so how does a user programmatically learn about this limitation?
> e.g., devlink could have support to show resource sizing and configure
> constraints similar to what mlxsw has.
Yes, need to figure out its naming. For mlx5 num queues doesn't have relation to subfunctions.
But PCI resource has relation and this is something we want to do in future, as you said may be using devlink resource.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-18 17:02     ` Parav Pandit
@ 2020-11-18 18:03       ` David Ahern
  2020-11-18 18:38         ` Jason Gunthorpe
  2020-11-18 19:22         ` Parav Pandit
  2020-11-19  0:52       ` Jacob Keller
  1 sibling, 2 replies; 57+ messages in thread
From: David Ahern @ 2020-11-18 18:03 UTC (permalink / raw)
  To: Parav Pandit, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org
  Cc: Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	Saeed Mahameed, kuba@kernel.org, davem@davemloft.net, Vu Pham
On 11/18/20 10:02 AM, Parav Pandit wrote:
> 
>> From: David Ahern <dsahern@gmail.com>
>> Sent: Wednesday, November 18, 2020 9:51 PM
>>
>> On 11/12/20 12:24 PM, Parav Pandit wrote:
>>> Extended devlink interface for the user to add and delete port.
>>> Extend devlink to connect user requests to driver to add/delete such
>>> port in the device.
>>>
>>> When driver routines are invoked, devlink instance lock is not held.
>>> This enables driver to perform several devlink objects registration,
>>> unregistration such as (port, health reporter, resource etc) by using
>>> exising devlink APIs.
>>> This also helps to uniformly use the code for port unregistration
>>> during driver unload and during port deletion initiated by user.
>>>
>>> Examples of add, show and delete commands:
>>> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
>>>
>>> $ devlink port show
>>> pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical
>>> port 0 splittable false
>>>
>>> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
>>>
>>> $ devlink port show pci/0000:06:00.0/32768
>>> pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0
>> pfnum 0 sfnum 88 external false splittable false
>>>   function:
>>>     hw_addr 00:00:00:00:88:88 state inactive opstate detached
>>>
>>
>> There has to be limits on the number of sub functions that can be created for
>> a device. How does a user find that limit?
> Yes, this came up internally, but didn't really converged.
> The devlink resource looked too verbose for an average or simple use cases.
> But it may be fine.
> The hurdle I faced with devlink resource is with defining the granularity.
> 
> For example one devlink instance deploys sub functions on multiple pci functions.
> So how to name them? Currently we have controller and PFs in port annotation.
> So resource name as 
> c0pf0_subfunctions -> for controller 0, pf 0 
> c1pf2_subfunctions -> for controller 1, pf 2
> 
> Couldn't convince my self to name it this way.
> 
> Below example looked simpler to use but plumbing doesn’t exist for it.
> 
> $ devlink resource show pci/0000:03:00.0
> pci/0000:03:00.0/1: name max_sfs count 256 controller 0 pf 0
> pci/0000:03:00.0/2: name max_sfs count 100 controller 1 pf 0
> pci/0000:03:00.0/3: name max_sfs count 64 controller 1 pf 1
> 
> $ devlink resource set pci/0000:03:00.0/1 max_sfs 100
> 
> Second option I was considering was use port params which doesn't sound so right as resource.
> 
>>
>> Also, seems like there are hardware constraint at play. e.g., can a user reduce
>> the number of queues used by the physical function to support more sub-
>> functions? If so how does a user programmatically learn about this limitation?
>> e.g., devlink could have support to show resource sizing and configure
>> constraints similar to what mlxsw has.
> Yes, need to figure out its naming. For mlx5 num queues doesn't have relation to subfunctions.
> But PCI resource has relation and this is something we want to do in future, as you said may be using devlink resource.
> 
With Connectx-4 Lx for example the netdev can have at most 63 queues
leaving 96 cpu servers a bit short - as an example of the limited number
of queues that a nic can handle (or currently exposes to the OS not sure
which). If I create a subfunction for ethernet traffic, how many queues
are allocated to it by default, is it managed via ethtool like the pf
and is there an impact to the resources used by / available to the
primary device?
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-18 18:03       ` David Ahern
@ 2020-11-18 18:38         ` Jason Gunthorpe
  2020-11-18 19:36           ` David Ahern
  2020-11-18 19:22         ` Parav Pandit
  1 sibling, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2020-11-18 18:38 UTC (permalink / raw)
  To: David Ahern
  Cc: Parav Pandit, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, Saeed Mahameed, kuba@kernel.org,
	davem@davemloft.net, Vu Pham
On Wed, Nov 18, 2020 at 11:03:24AM -0700, David Ahern wrote:
> With Connectx-4 Lx for example the netdev can have at most 63 queues
What netdev calls a queue is really a "can the device deliver
interrupts and packets to a given per-CPU queue" and covers a whole
spectrum of smaller limits like RSS scheme, # of available interrupts,
ability of the device to create queues, etc.
CX4Lx can create a huge number of queues, but hits one of these limits
that mean netdev's specific usage can't scale up. Other stuff like
RDMA doesn't have the same limits, and has tonnes of queues.
What seems to be needed is a resource controller concept like cgroup
has for processes. The system is really organized into a tree:
           physical device
              mlx5_core
        /      |      \      \                        (aux bus)
     netdev   rdma    vdpa   SF  etc
                             |                        (aux bus)
                           mlx5_core
                          /      \                    (aux bus)
                       netdev   vdpa
And it does make a lot of sense to start to talk about limits at each
tree level.
eg the top of the tree may have 128 physical interrupts. With 128 CPU
cores that isn't enough interrupts to support all of those things
concurrently.
So the user may want to configure:
 - The first level netdev only gets 64,
 - 3rd level mlx5_core gets 32 
 - Final level vdpa gets 8
Other stuff has to fight it out with the remaining shared interrupts.
In netdev land # of interrupts governs # of queues
For RDMA # of interrupts limits the CPU affinities for queues
VPDA limits the # of VMs that can use VT-d
The same story repeats for other less general resources, mlx5 also
has consumption of limited BAR space, and consumption of some limited
memory elements. These numbers are much bigger and may not need
explicit governing, but the general concept holds.
It would be very nice if the limit could be injected when the aux
device is created but before the driver is bound. I'm not sure how to
manage that though..
I assume other devices will be different, maybe some devices have a
limit on the number of total queues, or a limit on the number of
VDPA or RDMA devices.
Jason
^ permalink raw reply	[flat|nested] 57+ messages in thread
* RE: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-18 18:03       ` David Ahern
  2020-11-18 18:38         ` Jason Gunthorpe
@ 2020-11-18 19:22         ` Parav Pandit
  2020-11-19  0:41           ` Jacob Keller
  1 sibling, 1 reply; 57+ messages in thread
From: Parav Pandit @ 2020-11-18 19:22 UTC (permalink / raw)
  To: David Ahern, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org
  Cc: Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	Saeed Mahameed, kuba@kernel.org, davem@davemloft.net, Vu Pham
> From: David Ahern <dsahern@gmail.com>
> Sent: Wednesday, November 18, 2020 11:33 PM
> 
> 
> With Connectx-4 Lx for example the netdev can have at most 63 queues
> leaving 96 cpu servers a bit short - as an example of the limited number of
> queues that a nic can handle (or currently exposes to the OS not sure which).
> If I create a subfunction for ethernet traffic, how many queues are allocated
> to it by default, is it managed via ethtool like the pf and is there an impact to
> the resources used by / available to the primary device?
Jason already answered it with details.
Thanks a lot Jason.
Short answer to ethtool question, yes, ethtool can change num queues for subfunction like PF.
Default is same number of queues for subfunction as that of PF in this patchset.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-18 18:38         ` Jason Gunthorpe
@ 2020-11-18 19:36           ` David Ahern
  2020-11-18 20:42             ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: David Ahern @ 2020-11-18 19:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, Saeed Mahameed, kuba@kernel.org,
	davem@davemloft.net, Vu Pham
On 11/18/20 11:38 AM, Jason Gunthorpe wrote:
> On Wed, Nov 18, 2020 at 11:03:24AM -0700, David Ahern wrote:
> 
>> With Connectx-4 Lx for example the netdev can have at most 63 queues
> 
> What netdev calls a queue is really a "can the device deliver
> interrupts and packets to a given per-CPU queue" and covers a whole
> spectrum of smaller limits like RSS scheme, # of available interrupts,
> ability of the device to create queues, etc.
> 
> CX4Lx can create a huge number of queues, but hits one of these limits
> that mean netdev's specific usage can't scale up. Other stuff like
> RDMA doesn't have the same limits, and has tonnes of queues.
> 
> What seems to be needed is a resource controller concept like cgroup
> has for processes. The system is really organized into a tree:
> 
>            physical device
>               mlx5_core
>         /      |      \      \                        (aux bus)
>      netdev   rdma    vdpa   SF  etc
>                              |                        (aux bus)
>                            mlx5_core
>                           /      \                    (aux bus)
>                        netdev   vdpa
> 
> And it does make a lot of sense to start to talk about limits at each
> tree level.
> 
> eg the top of the tree may have 128 physical interrupts. With 128 CPU
> cores that isn't enough interrupts to support all of those things
> concurrently.
> 
> So the user may want to configure:
>  - The first level netdev only gets 64,
>  - 3rd level mlx5_core gets 32 
>  - Final level vdpa gets 8
> 
> Other stuff has to fight it out with the remaining shared interrupts.
> 
> In netdev land # of interrupts governs # of queues
> 
> For RDMA # of interrupts limits the CPU affinities for queues
> 
> VPDA limits the # of VMs that can use VT-d
> 
> The same story repeats for other less general resources, mlx5 also
> has consumption of limited BAR space, and consumption of some limited
> memory elements. These numbers are much bigger and may not need
> explicit governing, but the general concept holds.
> 
> It would be very nice if the limit could be injected when the aux
> device is created but before the driver is bound. I'm not sure how to
> manage that though..
> 
> I assume other devices will be different, maybe some devices have a
> limit on the number of total queues, or a limit on the number of
> VDPA or RDMA devices.
> 
> Jason
> 
A lot of low level resource details that need to be summarized into a
nicer user / config perspective to specify limits / allocations.
Thanks for the detailed response.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-18 19:36           ` David Ahern
@ 2020-11-18 20:42             ` Jason Gunthorpe
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2020-11-18 20:42 UTC (permalink / raw)
  To: David Ahern
  Cc: Parav Pandit, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, Saeed Mahameed, kuba@kernel.org,
	davem@davemloft.net, Vu Pham
On Wed, Nov 18, 2020 at 12:36:26PM -0700, David Ahern wrote:
> On 11/18/20 11:38 AM, Jason Gunthorpe wrote:
> > On Wed, Nov 18, 2020 at 11:03:24AM -0700, David Ahern wrote:
> > 
> >> With Connectx-4 Lx for example the netdev can have at most 63 queues
> > 
> > What netdev calls a queue is really a "can the device deliver
> > interrupts and packets to a given per-CPU queue" and covers a whole
> > spectrum of smaller limits like RSS scheme, # of available interrupts,
> > ability of the device to create queues, etc.
> > 
> > CX4Lx can create a huge number of queues, but hits one of these limits
> > that mean netdev's specific usage can't scale up. Other stuff like
> > RDMA doesn't have the same limits, and has tonnes of queues.
> > 
> > What seems to be needed is a resource controller concept like cgroup
> > has for processes. The system is really organized into a tree:
> > 
> >            physical device
> >               mlx5_core
> >         /      |      \      \                        (aux bus)
> >      netdev   rdma    vdpa   SF  etc
> >                              |                        (aux bus)
> >                            mlx5_core
> >                           /      \                    (aux bus)
> >                        netdev   vdpa
> > 
> > And it does make a lot of sense to start to talk about limits at each
> > tree level.
> > 
> > eg the top of the tree may have 128 physical interrupts. With 128 CPU
> > cores that isn't enough interrupts to support all of those things
> > concurrently.
> > 
> > So the user may want to configure:
> >  - The first level netdev only gets 64,
> >  - 3rd level mlx5_core gets 32 
> >  - Final level vdpa gets 8
> > 
> > Other stuff has to fight it out with the remaining shared interrupts.
> > 
> > In netdev land # of interrupts governs # of queues
> > 
> > For RDMA # of interrupts limits the CPU affinities for queues
> > 
> > VPDA limits the # of VMs that can use VT-d
> > 
> > The same story repeats for other less general resources, mlx5 also
> > has consumption of limited BAR space, and consumption of some limited
> > memory elements. These numbers are much bigger and may not need
> > explicit governing, but the general concept holds.
> > 
> > It would be very nice if the limit could be injected when the aux
> > device is created but before the driver is bound. I'm not sure how to
> > manage that though..
> > 
> > I assume other devices will be different, maybe some devices have a
> > limit on the number of total queues, or a limit on the number of
> > VDPA or RDMA devices.
> 
> A lot of low level resource details that need to be summarized into a
> nicer user / config perspective to specify limits / allocations.
Well, now that we have the aux bus stuff there is a nice natural place
to put things..
The aux bus owner device (mlx5_core) could have a list of available
resources
Each aux bus device (netdev/rdma/vdpa) could have a list of consumed
resources
Some API to place a limit on the consumed resources at each aux bus
device.
The tricky bit is the auto-probing/configure. By the time the user has
a chance to apply a limit the drivers are already bound and have
already done their setup. So each subsystem has to support dynamically
imposing a limit..
And I simplified things a bit above too, we actually have two kinds of
interrupt demand: sharable and dedicated. The actual need is to carve
out a bunch of dedicated interrupts and only allow subsystems that are
doing VT-d guest interrupt assignment to consume them (eg VDPA)
Jason
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-18 19:22         ` Parav Pandit
@ 2020-11-19  0:41           ` Jacob Keller
  2020-11-19  1:17             ` David Ahern
  0 siblings, 1 reply; 57+ messages in thread
From: Jacob Keller @ 2020-11-19  0:41 UTC (permalink / raw)
  To: Parav Pandit, David Ahern, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org
  Cc: Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	Saeed Mahameed, kuba@kernel.org, davem@davemloft.net, Vu Pham
On 11/18/2020 11:22 AM, Parav Pandit wrote:
> 
> 
>> From: David Ahern <dsahern@gmail.com>
>> Sent: Wednesday, November 18, 2020 11:33 PM
>>
>>
>> With Connectx-4 Lx for example the netdev can have at most 63 queues
>> leaving 96 cpu servers a bit short - as an example of the limited number of
>> queues that a nic can handle (or currently exposes to the OS not sure which).
>> If I create a subfunction for ethernet traffic, how many queues are allocated
>> to it by default, is it managed via ethtool like the pf and is there an impact to
>> the resources used by / available to the primary device?
> 
> Jason already answered it with details.
> Thanks a lot Jason.
> 
> Short answer to ethtool question, yes, ethtool can change num queues for subfunction like PF.
> Default is same number of queues for subfunction as that of PF in this patchset.
> 
But what is the mechanism for partitioning the global resources of the
device into each sub function?
Is it just evenly divided into the subfunctions? is there some maximum
limit per sub function?
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-18 17:02     ` Parav Pandit
  2020-11-18 18:03       ` David Ahern
@ 2020-11-19  0:52       ` Jacob Keller
  1 sibling, 0 replies; 57+ messages in thread
From: Jacob Keller @ 2020-11-19  0:52 UTC (permalink / raw)
  To: Parav Pandit, David Ahern, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org
  Cc: Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	Saeed Mahameed, kuba@kernel.org, davem@davemloft.net, Vu Pham
On 11/18/2020 9:02 AM, Parav Pandit wrote:
> 
>> From: David Ahern <dsahern@gmail.com>
>> Sent: Wednesday, November 18, 2020 9:51 PM
>>
>> On 11/12/20 12:24 PM, Parav Pandit wrote:
>>> Extended devlink interface for the user to add and delete port.
>>> Extend devlink to connect user requests to driver to add/delete such
>>> port in the device.
>>>
>>> When driver routines are invoked, devlink instance lock is not held.
>>> This enables driver to perform several devlink objects registration,
>>> unregistration such as (port, health reporter, resource etc) by using
>>> exising devlink APIs.
>>> This also helps to uniformly use the code for port unregistration
>>> during driver unload and during port deletion initiated by user.
>>>
>>> Examples of add, show and delete commands:
>>> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
>>>
>>> $ devlink port show
>>> pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical
>>> port 0 splittable false
>>>
>>> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
>>>
>>> $ devlink port show pci/0000:06:00.0/32768
>>> pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0
>> pfnum 0 sfnum 88 external false splittable false
>>>   function:
>>>     hw_addr 00:00:00:00:88:88 state inactive opstate detached
>>>
>>
>> There has to be limits on the number of sub functions that can be created for
>> a device. How does a user find that limit?
> Yes, this came up internally, but didn't really converged.
> The devlink resource looked too verbose for an average or simple use cases.
> But it may be fine.
> The hurdle I faced with devlink resource is with defining the granularity.
> 
> For example one devlink instance deploys sub functions on multiple pci functions.
> So how to name them? Currently we have controller and PFs in port annotation.
> So resource name as 
> c0pf0_subfunctions -> for controller 0, pf 0 
> c1pf2_subfunctions -> for controller 1, pf 2
> 
> Couldn't convince my self to name it this way.
Yea, I think we need to extend the plumbing of resources to allow
specifying or assigning parent resources to a subfunction.
> 
> Below example looked simpler to use but plumbing doesn’t exist for it.
> 
> $ devlink resource show pci/0000:03:00.0
> pci/0000:03:00.0/1: name max_sfs count 256 controller 0 pf 0
> pci/0000:03:00.0/2: name max_sfs count 100 controller 1 pf 0
> pci/0000:03:00.0/3: name max_sfs count 64 controller 1 pf 1
> 
> $ devlink resource set pci/0000:03:00.0/1 max_sfs 100
> 
> Second option I was considering was use port params which doesn't sound so right as resource.
> 
I don't think port parameters make sense here. They only encapsulate
single name -> value pairs, and don't really help show the relationships
between the subfunction ports and the parent device.
>>
>> Also, seems like there are hardware constraint at play. e.g., can a user reduce
>> the number of queues used by the physical function to support more sub-
>> functions? If so how does a user programmatically learn about this limitation?
>> e.g., devlink could have support to show resource sizing and configure
>> constraints similar to what mlxsw has.
> Yes, need to figure out its naming. For mlx5 num queues doesn't have relation to subfunctions.
> But PCI resource has relation and this is something we want to do in future, as you said may be using devlink resource.
> 
I've been looking into queue management and being able to add and remove
queue groups and queues. I'm leaning towards building on top of devlink
resource for this.
Specifically I have been looking at picking up the work started by
Magnus last year, around creating interface for representing queues to
the stack better for AF_XDP, but it also has other possible uses.
I'd like to make sure it aligns with the ideas here for partitioning
resources. It seems like that should be best done at the devlink level,
where the main devlink instance knows about all the part limitations and
can then have new commands for allowing assignment of resources to ports.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-19  0:41           ` Jacob Keller
@ 2020-11-19  1:17             ` David Ahern
  2020-11-19  1:56               ` Samudrala, Sridhar
  0 siblings, 1 reply; 57+ messages in thread
From: David Ahern @ 2020-11-19  1:17 UTC (permalink / raw)
  To: Jacob Keller, Parav Pandit, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org
  Cc: Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	Saeed Mahameed, kuba@kernel.org, davem@davemloft.net, Vu Pham
On 11/18/20 5:41 PM, Jacob Keller wrote:
> 
> 
> On 11/18/2020 11:22 AM, Parav Pandit wrote:
>>
>>
>>> From: David Ahern <dsahern@gmail.com>
>>> Sent: Wednesday, November 18, 2020 11:33 PM
>>>
>>>
>>> With Connectx-4 Lx for example the netdev can have at most 63 queues
>>> leaving 96 cpu servers a bit short - as an example of the limited number of
>>> queues that a nic can handle (or currently exposes to the OS not sure which).
>>> If I create a subfunction for ethernet traffic, how many queues are allocated
>>> to it by default, is it managed via ethtool like the pf and is there an impact to
>>> the resources used by / available to the primary device?
>>
>> Jason already answered it with details.
>> Thanks a lot Jason.
>>
>> Short answer to ethtool question, yes, ethtool can change num queues for subfunction like PF.
>> Default is same number of queues for subfunction as that of PF in this patchset.
>>
> 
> But what is the mechanism for partitioning the global resources of the
> device into each sub function?
> 
> Is it just evenly divided into the subfunctions? is there some maximum
> limit per sub function?
> 
I hope it is not just evenly divided; it should be user controllable. If
I create a subfunction for say a container's networking, I may want to
only assign 1 Rx and 1 Tx queue pair (or 1 channel depending on
terminology where channel includes Rx, Tx and CQ).
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 03/13] devlink: Support add and delete devlink port
  2020-11-19  1:17             ` David Ahern
@ 2020-11-19  1:56               ` Samudrala, Sridhar
  0 siblings, 0 replies; 57+ messages in thread
From: Samudrala, Sridhar @ 2020-11-19  1:56 UTC (permalink / raw)
  To: David Ahern, Jacob Keller, Parav Pandit, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org
  Cc: Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	Saeed Mahameed, kuba@kernel.org, davem@davemloft.net, Vu Pham
On 11/18/2020 5:17 PM, David Ahern wrote:
> On 11/18/20 5:41 PM, Jacob Keller wrote:
>>
>>
>> On 11/18/2020 11:22 AM, Parav Pandit wrote:
>>>
>>>
>>>> From: David Ahern <dsahern@gmail.com>
>>>> Sent: Wednesday, November 18, 2020 11:33 PM
>>>>
>>>>
>>>> With Connectx-4 Lx for example the netdev can have at most 63 queues
>>>> leaving 96 cpu servers a bit short - as an example of the limited number of
>>>> queues that a nic can handle (or currently exposes to the OS not sure which).
>>>> If I create a subfunction for ethernet traffic, how many queues are allocated
>>>> to it by default, is it managed via ethtool like the pf and is there an impact to
>>>> the resources used by / available to the primary device?
>>>
>>> Jason already answered it with details.
>>> Thanks a lot Jason.
>>>
>>> Short answer to ethtool question, yes, ethtool can change num queues for subfunction like PF.
>>> Default is same number of queues for subfunction as that of PF in this patchset.
>>>
>>
>> But what is the mechanism for partitioning the global resources of the
>> device into each sub function?
>>
>> Is it just evenly divided into the subfunctions? is there some maximum
>> limit per sub function?
>>
> 
> I hope it is not just evenly divided; it should be user controllable. If
> I create a subfunction for say a container's networking, I may want to
> only assign 1 Rx and 1 Tx queue pair (or 1 channel depending on
> terminology where channel includes Rx, Tx and CQ).
I think we need a way to expose and configure policy for resources 
associated with each type of auxiliary device.
   For ex: default, min and max queues and interrupt vectors.
Once an auxiliary device is created, the user should be able to 
configure the resources within the allowed min-max values.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-17 18:49           ` Jason Gunthorpe
@ 2020-11-19  2:14             ` Jakub Kicinski
  2020-11-19  4:35               ` David Ahern
  2020-11-19  6:12               ` Saeed Mahameed
  0 siblings, 2 replies; 57+ messages in thread
From: Jakub Kicinski @ 2020-11-19  2:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, Saeed Mahameed, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> 
> > > Just to refresh all our memory, we discussed and settled on the flow
> > > in [2]; RFC [1] followed this discussion.
> > > 
> > > vdpa tool of [3] can add one or more vdpa device(s) on top of already
> > > spawned PF, VF, SF device.  
> > 
> > Nack for the networking part of that. It'd basically be VMDq.  
> 
> What are you NAK'ing? 
Spawning multiple netdevs from one device by slicing up its queues.
> It is consistent with the multi-subsystem device sharing model we've
> had for ages now.
> 
> The physical ethernet port is shared between multiple accelerator
> subsystems. netdev gets its slice of traffic, so does RDMA, iSCSI,
> VDPA, etc.
Right, devices of other subsystems are fine, I don't care.
Sorry for not being crystal clear but quite frankly IDK what else can
be expected from me given the submissions have little to no context and
documentation. This comes up every damn time with the SF patches, I'm
tired of having to ask for a basic workflow.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-17 18:50           ` Parav Pandit
@ 2020-11-19  2:23             ` Jakub Kicinski
  2020-11-19  6:22               ` Saeed Mahameed
  0 siblings, 1 reply; 57+ messages in thread
From: Jakub Kicinski @ 2020-11-19  2:23 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Saeed Mahameed, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
On Tue, 17 Nov 2020 18:50:57 +0000 Parav Pandit wrote:
> At this point vdpa tool of [1] can create one or more vdpa net devices on this subfunction device in below sequence.
> 
> $ vdpa parentdev list
> auxiliary/mlx5_core.sf.4
>   supported_classes
>     net
> 
> $ vdpa dev add parentdev auxiliary/mlx5_core.sf.4 type net name foo0
> 
> $ vdpa dev show foo0
> foo0: parentdev auxiliary/mlx5_core.sf.4 type network parentdev vdpasim vendor_id 0 max_vqs 2 max_vq_size 256
> 
> > I'm asking how the vdpa API fits in with this, and you're showing me the two
> > devlink commands we already talked about in the past.  
> Oh ok, sorry, my bad. I understood your question now about relation of vdpa commands with this.
> Please look at the above example sequence that covers the vdpa example also.
> 
> [1] https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/
I think the biggest missing piece in my understanding is what's the
technical difference between an SF and a VDPA device.
Isn't a VDPA device an SF with a particular descriptor format for the
queues?
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19  2:14             ` Jakub Kicinski
@ 2020-11-19  4:35               ` David Ahern
  2020-11-19  5:57                 ` Saeed Mahameed
  2020-11-20  1:29                 ` Jakub Kicinski
  2020-11-19  6:12               ` Saeed Mahameed
  1 sibling, 2 replies; 57+ messages in thread
From: David Ahern @ 2020-11-19  4:35 UTC (permalink / raw)
  To: Jakub Kicinski, Jason Gunthorpe
  Cc: Parav Pandit, Saeed Mahameed, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
On 11/18/20 7:14 PM, Jakub Kicinski wrote:
> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
>>
>>>> Just to refresh all our memory, we discussed and settled on the flow
>>>> in [2]; RFC [1] followed this discussion.
>>>>
>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already
>>>> spawned PF, VF, SF device.  
>>>
>>> Nack for the networking part of that. It'd basically be VMDq.  
>>
>> What are you NAK'ing? 
> 
> Spawning multiple netdevs from one device by slicing up its queues.
Why do you object to that? Slicing up h/w resources for virtual what
ever has been common practice for a long time.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19  4:35               ` David Ahern
@ 2020-11-19  5:57                 ` Saeed Mahameed
  2020-11-20  1:31                   ` Jakub Kicinski
  2020-11-25  5:33                   ` David Ahern
  2020-11-20  1:29                 ` Jakub Kicinski
  1 sibling, 2 replies; 57+ messages in thread
From: Saeed Mahameed @ 2020-11-19  5:57 UTC (permalink / raw)
  To: David Ahern, Jakub Kicinski, Jason Gunthorpe
  Cc: Parav Pandit, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On Wed, 2020-11-18 at 21:35 -0700, David Ahern wrote:
> On 11/18/20 7:14 PM, Jakub Kicinski wrote:
> > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
> > > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> > > 
> > > > > Just to refresh all our memory, we discussed and settled on
> > > > > the flow
> > > > > in [2]; RFC [1] followed this discussion.
> > > > > 
> > > > > vdpa tool of [3] can add one or more vdpa device(s) on top of
> > > > > already
> > > > > spawned PF, VF, SF device.  
> > > > 
> > > > Nack for the networking part of that. It'd basically be VMDq.  
> > > 
> > > What are you NAK'ing? 
> > 
> > Spawning multiple netdevs from one device by slicing up its queues.
> 
> Why do you object to that? Slicing up h/w resources for virtual what
> ever has been common practice for a long time.
> 
> 
We are not slicing up any queues, from our HW and FW perspective SF ==
VF literally, a full blown HW slice (Function), with isolated control
and data plane of its own, this is very different from VMDq and more
generic and secure. an SF device is exactly like a VF, doesn't steal or
share any HW resources or control/data path with others. SF is
basically SRIOV done right.
this series has nothing to do with netdev, if you look at the list of
files Parav is touching, there is 0 change in our netdev stack :) ..
all Parav is doing is adding the API to create/destroy SFs and
represents the low level SF function to devlink as a device, just
like a VF.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19  2:14             ` Jakub Kicinski
  2020-11-19  4:35               ` David Ahern
@ 2020-11-19  6:12               ` Saeed Mahameed
  2020-11-19  8:25                 ` Parav Pandit
  2020-11-20  1:35                 ` Jakub Kicinski
  1 sibling, 2 replies; 57+ messages in thread
From: Saeed Mahameed @ 2020-11-19  6:12 UTC (permalink / raw)
  To: Jakub Kicinski, Jason Gunthorpe
  Cc: Parav Pandit, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On Wed, 2020-11-18 at 18:14 -0800, Jakub Kicinski wrote:
> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
> > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> > 
> > It is consistent with the multi-subsystem device sharing model
> > we've
> > had for ages now.
> > 
> > The physical ethernet port is shared between multiple accelerator
> > subsystems. netdev gets its slice of traffic, so does RDMA, iSCSI,
> > VDPA, etc.
not just a slice of traffic, a whole HW domain.
> 
> Right, devices of other subsystems are fine, I don't care.
> 
But a netdev will be loaded on SF automatically just through the
current driver design and modularity, since SF == VF and our netdev is
abstract and doesn't know if it runs on a PF/VF/SF .. we literally have
to add code to not load a netdev on a SF. why ? :/
> Sorry for not being crystal clear but quite frankly IDK what else can
> be expected from me given the submissions have little to no context
> and
> documentation. This comes up every damn time with the SF patches, I'm
> tired of having to ask for a basic workflow.
From how this discussion is going, i think you are right, we need to
clarify what we are doing in a more high level simplified and generic
documentation to give some initial context, Parav, let's add the
missing documentation, we can also add some comments regarding how this
is very different from VMDq, but i would like to avoid that, since it
is different in almost every way:) .. 
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19  2:23             ` Jakub Kicinski
@ 2020-11-19  6:22               ` Saeed Mahameed
  2020-11-19 14:00                 ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: Saeed Mahameed @ 2020-11-19  6:22 UTC (permalink / raw)
  To: Jakub Kicinski, Parav Pandit
  Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, Jason Gunthorpe,
	dledford@redhat.com, Leon Romanovsky, davem@davemloft.net
On Wed, 2020-11-18 at 18:23 -0800, Jakub Kicinski wrote:
> On Tue, 17 Nov 2020 18:50:57 +0000 Parav Pandit wrote:
> > At this point vdpa tool of [1] can create one or more vdpa net
> > devices on this subfunction device in below sequence.
> > 
> > $ vdpa parentdev list
> > auxiliary/mlx5_core.sf.4
> >   supported_classes
> >     net
> > 
> > $ vdpa dev add parentdev auxiliary/mlx5_core.sf.4 type net name
> > foo0
> > 
> > $ vdpa dev show foo0
> > foo0: parentdev auxiliary/mlx5_core.sf.4 type network parentdev
> > vdpasim vendor_id 0 max_vqs 2 max_vq_size 256
> > 
> > > I'm asking how the vdpa API fits in with this, and you're showing
> > > me the two
> > > devlink commands we already talked about in the past.  
> > Oh ok, sorry, my bad. I understood your question now about relation
> > of vdpa commands with this.
> > Please look at the above example sequence that covers the vdpa
> > example also.
> > 
> > [1] 
> > https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/
> 
> I think the biggest missing piece in my understanding is what's the
> technical difference between an SF and a VDPA device.
> 
Same difference as between a VF and netdev.
SF == VF, so a full HW function.
VDPA/RDMA/netdev/SCSI/nvme/etc.. are just interfaces (ULPs) sharing the
same functions as always been, nothing new about this.
Today on a VF we load a RDMA/VDPA/netdev interfaces
SF will do exactly the same and the ULPs will simply load, and we don't
need to modify them.
> Isn't a VDPA device an SF with a particular descriptor format for the
> queues?
No :/, 
I hope the above answer clarifies things a bit.
SF is a device function that provides all kinds of queues.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* RE: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19  6:12               ` Saeed Mahameed
@ 2020-11-19  8:25                 ` Parav Pandit
  2020-11-20  1:35                 ` Jakub Kicinski
  1 sibling, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-19  8:25 UTC (permalink / raw)
  To: Saeed Mahameed, Jakub Kicinski, Jason Gunthorpe
  Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
> From: Saeed Mahameed <saeed@kernel.org>
> Sent: Thursday, November 19, 2020 11:42 AM
> 
> From how this discussion is going, i think you are right, we need to clarify
> what we are doing in a more high level simplified and generic documentation
> to give some initial context, Parav, let's add the missing documentation, we
> can also add some comments regarding how this is very different from
> VMDq, but i would like to avoid that, since it is different in almost every way:)
Sure I will add Documentation/networking/subfunction.rst in v2 describing subfunction details.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19  6:22               ` Saeed Mahameed
@ 2020-11-19 14:00                 ` Jason Gunthorpe
  2020-11-20  3:35                   ` Jakub Kicinski
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2020-11-19 14:00 UTC (permalink / raw)
  To: Saeed Mahameed, Jakub Kicinski
  Cc: Parav Pandit, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On Wed, Nov 18, 2020 at 10:22:51PM -0800, Saeed Mahameed wrote:
> > I think the biggest missing piece in my understanding is what's the
> > technical difference between an SF and a VDPA device.
> 
> Same difference as between a VF and netdev.
> SF == VF, so a full HW function.
> VDPA/RDMA/netdev/SCSI/nvme/etc.. are just interfaces (ULPs) sharing the
> same functions as always been, nothing new about this.
All the implementation details are very different, but this white
paper from Intel goes into some detail the basic elements and rational
for the SF concept:
https://software.intel.com/content/dam/develop/public/us/en/documents/intel-scalable-io-virtualization-technical-specification.pdf
What we are calling a sub-function here is a close cousin to what
Intel calls an Assignable Device Interface. I expect to see other
drivers following this general pattern eventually.
A SF will eventually be assignable to a VM and the VM won't be able to
tell the difference between a VF or SF providing the assignable PCI
resources.
VDPA is also assignable to a guest, but the key difference between
mlx5's SF and VDPA is what guest driver binds to the virtual PCI
function. For a SF the guest will bind mlx5_core, for VDPA the guest
will bind virtio-net.
So, the driver stack for a VM using VDPA might be
 Physical device [pci] -> mlx5_core -> [aux] -> SF -> [aux] ->  mlx5_core -> [aux] -> mlx5_vdpa -> QEMU -> |VM| -> [pci] -> virtio_net
When Parav is talking about creating VDPA devices he means attaching
the VDPA accelerator subsystem to a mlx5_core, where ever that
mlx5_core might be attached to.
To your other remark:
> > What are you NAK'ing?
> Spawning multiple netdevs from one device by slicing up its queues.
This is a bit vauge. In SRIOV a device spawns multiple netdevs for a
physical port by "slicing up its physical queues" - where do you see
the cross over between VMDq (bad) and SRIOV (ok)?
I thought the issue with VMDq was more on the horrid management to
configure the traffic splitting, not the actual splitting itself?
In classic SRIOV the traffic is split by a simple non-configurable HW
switch based on MAC address of the VF.
mlx5 already has the extended version of that idea, we can run in
switchdev mode and use switchdev to configure the HW switch. Now
configurable switchdev rules split the traffic for VFs.
This SF step replaces the VF in the above, but everything else is the
same. The switchdev still splits the traffic, it still ends up in same
nested netdev queue structure & RSS a VF/PF would use, etc, etc. No
queues are "stolen" to create the nested netdev.
From the driver perspective there is no significant difference between
sticking a netdev on a mlx5 VF or sticking a netdev on a mlx5 SF. A SF
netdev is not going in and doing deep surgery to the PF netdev to
steal queues or something.
Both VF and SF will be eventually assignable to guests, both can
support all the accelerator subsystems - VDPA, RDMA, etc. Both can
support netdev.
Compared to VMDq, I think it is really no comparison. SF/ADI is an
evolution of a SRIOV VF from something PCI-SGI controlled to something
device specific and lighter weight.
SF/ADI come with a architectural security boundary suitable for
assignment to an untrusted guest. It is not just a jumble of queues.
VMDq is .. not that.
Actually it has been one of the open debates in the virtualization
userspace world. The approach to use switchdev to control the traffic
splitting to VMs is elegant but many drivers are are not following
this design. :(
Finally, in the mlx5 model VDPA is just an "application". It asks the
device to create a 'RDMA' raw ethernet packet QP that is uses rings
formed in the virtio-net specification. We can create it in the kernel
using mlx5_vdpa, and we can create it in userspace through the RDMA
subsystem. Like any "RDMA" application it is contained by the security
boundary of the PF/VF/SF the mlx5_core is running on.
Jason
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19  4:35               ` David Ahern
  2020-11-19  5:57                 ` Saeed Mahameed
@ 2020-11-20  1:29                 ` Jakub Kicinski
  2020-11-20 17:58                   ` Alexander Duyck
  1 sibling, 1 reply; 57+ messages in thread
From: Jakub Kicinski @ 2020-11-20  1:29 UTC (permalink / raw)
  To: David Ahern
  Cc: Jason Gunthorpe, Parav Pandit, Saeed Mahameed,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net, Alexander Duyck
On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
> On 11/18/20 7:14 PM, Jakub Kicinski wrote:
> > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:  
> >> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> >>  
> >>>> Just to refresh all our memory, we discussed and settled on the flow
> >>>> in [2]; RFC [1] followed this discussion.
> >>>>
> >>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already
> >>>> spawned PF, VF, SF device.    
> >>>
> >>> Nack for the networking part of that. It'd basically be VMDq.    
> >>
> >> What are you NAK'ing?   
> > 
> > Spawning multiple netdevs from one device by slicing up its queues.  
> 
> Why do you object to that? Slicing up h/w resources for virtual what
> ever has been common practice for a long time.
My memory of the VMDq debate is hazy, let me rope in Alex into this.
I believe the argument was that we should offload software constructs,
not create HW-specific APIs which depend on HW availability and
implementation. So the path we took was offloading macvlan.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19  5:57                 ` Saeed Mahameed
@ 2020-11-20  1:31                   ` Jakub Kicinski
  2020-11-25  5:33                   ` David Ahern
  1 sibling, 0 replies; 57+ messages in thread
From: Jakub Kicinski @ 2020-11-20  1:31 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Ahern, Jason Gunthorpe, Parav Pandit,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On Wed, 18 Nov 2020 21:57:57 -0800 Saeed Mahameed wrote:
> On Wed, 2020-11-18 at 21:35 -0700, David Ahern wrote:
> > On 11/18/20 7:14 PM, Jakub Kicinski wrote:  
> > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:  
> > > > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> > > >   
> > > > > > Just to refresh all our memory, we discussed and settled on
> > > > > > the flow
> > > > > > in [2]; RFC [1] followed this discussion.
> > > > > > 
> > > > > > vdpa tool of [3] can add one or more vdpa device(s) on top of
> > > > > > already
> > > > > > spawned PF, VF, SF device.    
> > > > > 
> > > > > Nack for the networking part of that. It'd basically be VMDq.    
> > > > 
> > > > What are you NAK'ing?   
> > > 
> > > Spawning multiple netdevs from one device by slicing up its queues.  
> > 
> > Why do you object to that? Slicing up h/w resources for virtual what
> > ever has been common practice for a long time.
> 
> We are not slicing up any queues, from our HW and FW perspective SF ==
> VF literally, a full blown HW slice (Function), with isolated control
> and data plane of its own, this is very different from VMDq and more
> generic and secure. an SF device is exactly like a VF, doesn't steal or
> share any HW resources or control/data path with others. SF is
> basically SRIOV done right.
> 
> this series has nothing to do with netdev, if you look at the list of
> files Parav is touching, there is 0 change in our netdev stack :) ..
> all Parav is doing is adding the API to create/destroy SFs and
> represents the low level SF function to devlink as a device, just
> like a VF.
Ack, the concern is about the vdpa, not SF. 
So not really this patch set.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19  6:12               ` Saeed Mahameed
  2020-11-19  8:25                 ` Parav Pandit
@ 2020-11-20  1:35                 ` Jakub Kicinski
  2020-11-20  3:34                   ` Parav Pandit
  1 sibling, 1 reply; 57+ messages in thread
From: Jakub Kicinski @ 2020-11-20  1:35 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Jason Gunthorpe, Parav Pandit, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
On Wed, 18 Nov 2020 22:12:22 -0800 Saeed Mahameed wrote:
> > Right, devices of other subsystems are fine, I don't care.
> 
> But a netdev will be loaded on SF automatically just through the
> current driver design and modularity, since SF == VF and our netdev is
> abstract and doesn't know if it runs on a PF/VF/SF .. we literally have
> to add code to not load a netdev on a SF. why ? :/
A netdev is fine, but the examples so far don't make it clear (to me) 
if it's expected/supported to spawn _multiple_ netdevs from a single
"vdpa parentdev".
^ permalink raw reply	[flat|nested] 57+ messages in thread
* RE: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-20  1:35                 ` Jakub Kicinski
@ 2020-11-20  3:34                   ` Parav Pandit
  0 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-20  3:34 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: Jason Gunthorpe, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, November 20, 2020 7:05 AM
> 
> On Wed, 18 Nov 2020 22:12:22 -0800 Saeed Mahameed wrote:
> > > Right, devices of other subsystems are fine, I don't care.
> >
> > But a netdev will be loaded on SF automatically just through the
> > current driver design and modularity, since SF == VF and our netdev is
> > abstract and doesn't know if it runs on a PF/VF/SF .. we literally
> > have to add code to not load a netdev on a SF. why ? :/
> 
> A netdev is fine, but the examples so far don't make it clear (to me) if it's
> expected/supported to spawn _multiple_ netdevs from a single "vdpa
> parentdev".
We do not create Netdev from vdpa parentdev.
From vdpa parentdev, only vdpa device(s) are created which is 'struct device' residing in /sys/bus/vdpa/<device>.
Currently such vdpa device is already created on mlx5_vpa.ko driver load, however user has no way to inspect, stats, get/set features of this device, hence the vdpa tool.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19 14:00                 ` Jason Gunthorpe
@ 2020-11-20  3:35                   ` Jakub Kicinski
  2020-11-20  3:50                     ` Parav Pandit
  2020-11-20 16:16                     ` Jason Gunthorpe
  0 siblings, 2 replies; 57+ messages in thread
From: Jakub Kicinski @ 2020-11-20  3:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Saeed Mahameed, Parav Pandit, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote:
> Finally, in the mlx5 model VDPA is just an "application". It asks the
> device to create a 'RDMA' raw ethernet packet QP that is uses rings
> formed in the virtio-net specification. We can create it in the kernel
> using mlx5_vdpa, and we can create it in userspace through the RDMA
> subsystem. Like any "RDMA" application it is contained by the security
> boundary of the PF/VF/SF the mlx5_core is running on.
Thanks for the write up!
The SF part is pretty clear to me, it is what it is. DPDK camp has been
pretty excited about ADI/PASID for a while now.
The part that's blurry to me is VDPA.
I was under the impression that for VDPA the device is supposed to
support native virtio 2.0 (or whatever the "HW friendly" spec was).
I believe that's what the early patches from Intel did.
You're saying it's a client application like any other - do I understand
it right that the hypervisor driver will be translating descriptors
between virtio and device-native then?
The vdpa parent is in the hypervisor correct?
Can a VDPA device have multiple children of the same type?
Why do we have a representor for a SF, if the interface is actually VDPA?
Block and net traffic can't reasonably be treated the same by the switch.
Also I'm confused how block device can bind to mlx5_core - in that case
I'm assuming the QP is bound 1:1 with a QP on the SmartNIC side, and
that QP is plugged into an appropriate backend?
^ permalink raw reply	[flat|nested] 57+ messages in thread
* RE: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-20  3:35                   ` Jakub Kicinski
@ 2020-11-20  3:50                     ` Parav Pandit
  2020-11-20 16:16                     ` Jason Gunthorpe
  1 sibling, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-11-20  3:50 UTC (permalink / raw)
  To: Jakub Kicinski, Jason Gunthorpe
  Cc: Saeed Mahameed, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, November 20, 2020 9:05 AM
> 
> On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote:
> > Finally, in the mlx5 model VDPA is just an "application". It asks the
> > device to create a 'RDMA' raw ethernet packet QP that is uses rings
> > formed in the virtio-net specification. We can create it in the kernel
> > using mlx5_vdpa, and we can create it in userspace through the RDMA
> > subsystem. Like any "RDMA" application it is contained by the security
> > boundary of the PF/VF/SF the mlx5_core is running on.
> 
> Thanks for the write up!
> 
> The SF part is pretty clear to me, it is what it is. DPDK camp has been pretty
> excited about ADI/PASID for a while now.
> 
> 
> The part that's blurry to me is VDPA.
> 
> I was under the impression that for VDPA the device is supposed to support
> native virtio 2.0 (or whatever the "HW friendly" spec was).
> 
> I believe that's what the early patches from Intel did.
> 
> You're saying it's a client application like any other - do I understand it right that
> the hypervisor driver will be translating descriptors between virtio and device-
> native then?
>
mlx5 device support virtio descriptors natively. So no need of translation.
 
> The vdpa parent is in the hypervisor correct?
> 
Yep. 
> Can a VDPA device have multiple children of the same type?
>
I guess, you mean VDPA parentdev? If so, yes, however at present we see only one_to_one mapping of vdpa device and parent dev.
 
> Why do we have a representor for a SF, if the interface is actually VDPA?
Because vdpa is just one client out of multiple.
At the moment there is one to one relation of vdpa device to a SF/VF.
> Block and net traffic can't reasonably be treated the same by the switch.
> 
> Also I'm confused how block device can bind to mlx5_core - in that case I'm
> assuming the QP is bound 1:1 with a QP on the SmartNIC side, and that QP is
> plugged into an appropriate backend?
So far there isn't mlx5_vdpa.ko or plan to do block. But yes, in future for block, it needs to bind to a QP in backend in smartnic.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-20  3:35                   ` Jakub Kicinski
  2020-11-20  3:50                     ` Parav Pandit
@ 2020-11-20 16:16                     ` Jason Gunthorpe
  1 sibling, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2020-11-20 16:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, Parav Pandit, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, gregkh@linuxfoundation.org,
	Jiri Pirko, dledford@redhat.com, Leon Romanovsky,
	davem@davemloft.net
On Thu, Nov 19, 2020 at 07:35:26PM -0800, Jakub Kicinski wrote:
> On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote:
> > Finally, in the mlx5 model VDPA is just an "application". It asks the
> > device to create a 'RDMA' raw ethernet packet QP that is uses rings
> > formed in the virtio-net specification. We can create it in the kernel
> > using mlx5_vdpa, and we can create it in userspace through the RDMA
> > subsystem. Like any "RDMA" application it is contained by the security
> > boundary of the PF/VF/SF the mlx5_core is running on.
> 
> Thanks for the write up!
No problem!
> The part that's blurry to me is VDPA.
Okay, I think I see where the gap is, I'm going to elaborate below so
we are clear.
> I was under the impression that for VDPA the device is supposed to
> support native virtio 2.0 (or whatever the "HW friendly" spec was).
I think VDPA covers a wide range of things.
The basic idea is starting with the all SW virtio-net implementation
we can move parts to HW. Each implementation will probably be a little
different here. The kernel vdpa subsystem is a toolbox to mix the
required emulation and HW capability to build a virtio-net PCI
interface.
The most key question to ask of any VDPA design is "what does the VDPA
FW do with the packet once the HW accelerator has parsed the
virtio-net descriptor?".
The VDPA world has refused to agree on this due to vendor squabbling,
but mlx5 has a clear answer:
 VDPA Tx generates an ethernet packet and sends it out the SF/VF port
 through a tunnel to the representor and then on to the switchdev.
Other VDPA designs have a different answer!!
This concept is so innate to how Mellanox views the world it is not
surprising me that the cover letters and patch descriptions don't
belabor this point much :)
I'm going to deep dive through this answer below. I think you'll see
this is the most sane and coherent architecture with the tools
available in netdev.. Mellanox thinks the VDPA world should
standardize on this design so we can have a standard control plane.
> You're saying it's a client application like any other - do I understand
> it right that the hypervisor driver will be translating descriptors
> between virtio and device-native then?
No, the hypervisor creates a QP and tells the HW that this QP's
descriptor format follows virtio-net. The QP processes those
descriptors in HW and generates ethernet packets.
A "client application like any other" means that the ethernet packets
VDPA forms are identical to the ones netdev or RDMA forms. They are
all delivered into the tunnel on the SF/VF to the representor and on
to the switch. See below
> The vdpa parent is in the hypervisor correct?
> 
> Can a VDPA device have multiple children of the same type?
I'm not sure parent/child are good words here.
The VDPA emulation runs in the hypervisor, and the virtio-net netdev
driver runs in the guest. The VDPA is attached to a switchdev port and
representor tunnel by virtue of its QPs being created under a SF/VF.
If we imagine a virtio-rdma, then you might have a SF/VF hosting both
VDPA and VDPA-RDMA which emulate two PCI devices assigned to a
VM. Both of these peer virtio's would generate ethernet packets for TX
on the SF/VF port into the tunnel through the represntor and to the
switch.
> Why do we have a representor for a SF, if the interface is actually VDPA?
> Block and net traffic can't reasonably be treated the same by the
> switch.
I think you are focusing on queues, the architecture at PF/SF/VF is
not queue based, it is packet based.
At the physical mlx5 the netdev has a switchdev. On that switch I can
create a *switch port*.
The switch port is composed of a representor and a SF/VF. They form a
tunnel for packets.
The representor is the hypervisor side of the tunnel and contains all
packets coming out of and into the SF/VF.
The SF/VF is the guest side of the tunnel and has a full NIC.
The SF/VF can be:
 - Used in the same OS as the switch
 - Assigned to a guest VM as a PCI device
 - Assigned to another processor in the SmartNIC case.
In all cases if I use a queue on a SF/VF to generate an ethernet
packet then that packet *always* goes into the tunnel to the
representor and goes into a switch. It is always contained by any
rules on the switch side. If the switch is set so the representor is
VLAN tagged then a queue on a SF/VF *cannot* escape the VLAN tag.
Similarly SF/VF cannot Rx any packets that are not sent into the
tunnel, meaning the switch controls what packets go into the
representor, through the tunnel and to the SF.
Yes, block and net traffic are all reduced to ethernet packets, sent
through the tunnel to the representor and treated by the switch. It is
no different than a physical switch. If there is to be some net/block
difference it has to be represented in the ethernet packets, eg with
vlan or something.
This is the fundamental security boundary of the architecture. The
SF/VF is a security domain and the only exchange of information from
that security domain to the hypervisor security domain is the tunnel
to the representor. The exchange across the boundary is only *packets*
not queues.
Essentially it exactly models the physical world. If I phyically plug
in a NIC to a switch then the "representor" is the switch port in the
physical switch OS and the "SF/VF" is the NIC in the server.
The switch OS does not know or care what the NIC is doing. It does not
know or care if the NIC is doing VDPA, or if the packets are "block"
or "net" - they are all just packets by the time it gets to switching.
> Also I'm confused how block device can bind to mlx5_core - in that case
> I'm assuming the QP is bound 1:1 with a QP on the SmartNIC side, and
> that QP is plugged into an appropriate backend?
Every mlx5_core is a full multi-queue instance. It can have a huge
number of queues with no problems. Do not focus on the
queues. *queues* are irrelevant here.
Queues always have two ends. In this model one end is at the CPU and
the other is just ethernet packets. The purpose of the queue is to
convert CPU stuff into ethernet packets and vice versa. A mlx5 device
has a wide range of accelerators that can do all sorts of
transformations between CPU and packets built into the queues.
A queue can only be attached to a single mlx5_core, meaning all the
ethernet packets the queue sources/sinks must come from the PF/SF/VF
port. For SF/VF this port is connected to a tunnel to a representor to
the switch. Thus every queue has its packet side connected to the
switch.
However, the *queue* is an opaque detail of how the ethernet packets
are created from CPU data.
It doesn't matter if the queue is running VDPA, RDMA, netdev, or block
traffic - all of these things inherently result in ethernet packets,
and the hypervisor can't tell how the packet was created.
The architecture is *not* like virtio. virtio queues are individual
tunnels between hypervisor and guest.
This is the key detail: A VDPA queue is *not a tunnel*. It is a engine
to covert CPU data in virtio-net format to ethernet packets and
deliver those packet to the SF/VF end of the tunnel to the representor
and then to the switch. The tunnel is the SF/VF and representor
pairing, NOT the VDPA queue.
Looking at the logical life of a Tx packet from a VM doing VDPA:
 - VM's netdev builds the skb and writes a vitio-net formed descriptor
   to a send qeuue
 - VM triggers a doorbell via write to a BAR. In mlx5 this write goes
   to the device - qemu mmaps part of the device BAR to the guest
 - The HW begins processing a queue. The queue is in virtio-net format
   so it fetches the descriptor and now has the skb data
 - The HW forms the skb into an ethernet packet and delivers it to the
   representor through the tunnel, which immediately sends it to the
   HW switch. The VDPA QP in the SF/VF is now done.
 - In the switch the HW determines the packet is an exception. It
   applies RSS rules/etc and dynamically identifies on a per-packet
   basis what hypervisor queue the packet should be delivered to.
   This queue is in the hypervisor, and is in mlx5 native format.
 - The choosen hypervisor queue recives this packet and begins
   processing. It gets a receive buffer and writes the packet,
   triggers an interrupts. This queue is now done.
 - hypervisor netdev now has the packet. It does the exception path
   in netdev and puts the SKB back on another queue for TX to the
   physical port. This queue is in mlx5 native format, the packet goes
   to the physical port.
It traversed three queues. The HW dynamically selected the hypervisor
queue the VDPA packet is delivered to based *entirely* on switch
rules. The originating queue only informs the switch of what SF/VF
(and thus switch port) generated the packet.
At no point does the hypervisor know the packet originated from a VDPA
QP.
The RX side the similar, each PF/SF/VF port has a selector that
chooses which queue each packet goes to. That chooses how the packet
is converted to CPU. Each PF/SF/VF can have a huge number of
selectors, and SF/VF source their packets from the logical tunnel
attached to a representor which receives packets from the switch.
The selector is how the cross subsystem sharing of the ethernet port
works, regardless of PF/SF/VF.
Again the hypervisor side has *no idea* what queue the packet will be
selected to when it delivers the packet to the representor side of the
tunnel.
Jason
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-20  1:29                 ` Jakub Kicinski
@ 2020-11-20 17:58                   ` Alexander Duyck
  2020-11-20 19:04                     ` Samudrala, Sridhar
  0 siblings, 1 reply; 57+ messages in thread
From: Alexander Duyck @ 2020-11-20 17:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, Jason Gunthorpe, Parav Pandit, Saeed Mahameed,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
> > On 11/18/20 7:14 PM, Jakub Kicinski wrote:
> > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
> > >> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> > >>
> > >>>> Just to refresh all our memory, we discussed and settled on the flow
> > >>>> in [2]; RFC [1] followed this discussion.
> > >>>>
> > >>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already
> > >>>> spawned PF, VF, SF device.
> > >>>
> > >>> Nack for the networking part of that. It'd basically be VMDq.
> > >>
> > >> What are you NAK'ing?
> > >
> > > Spawning multiple netdevs from one device by slicing up its queues.
> >
> > Why do you object to that? Slicing up h/w resources for virtual what
> > ever has been common practice for a long time.
>
> My memory of the VMDq debate is hazy, let me rope in Alex into this.
> I believe the argument was that we should offload software constructs,
> not create HW-specific APIs which depend on HW availability and
> implementation. So the path we took was offloading macvlan.
I think it somewhat depends on the type of interface we are talking
about. What we were wanting to avoid was drivers spawning their own
unique VMDq netdevs and each having a different way of doing it. The
approach Intel went with was to use a MACVLAN offload to approach it.
Although I would imagine many would argue the approach is somewhat
dated and limiting since you cannot do many offloads on a MACVLAN
interface.
With the VDPA case I believe there is a set of predefined virtio
devices that are being emulated and presented so it isn't as if they
are creating a totally new interface for this.
What I would be interested in seeing is if there are any other vendors
that have reviewed this and sign off on this approach. What we don't
want to see is Nivida/Mellanox do this one way, then Broadcom or Intel
come along later and have yet another way of doing this. We need an
interface and feature set that will work for everyone in terms of how
this will look going forward.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-20 17:58                   ` Alexander Duyck
@ 2020-11-20 19:04                     ` Samudrala, Sridhar
  2020-11-23 21:51                       ` Saeed Mahameed
  2020-11-24  7:01                       ` Jason Wang
  0 siblings, 2 replies; 57+ messages in thread
From: Samudrala, Sridhar @ 2020-11-20 19:04 UTC (permalink / raw)
  To: Alexander Duyck, Jakub Kicinski
  Cc: David Ahern, Jason Gunthorpe, Parav Pandit, Saeed Mahameed,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On 11/20/2020 9:58 AM, Alexander Duyck wrote:
> On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
>>> On 11/18/20 7:14 PM, Jakub Kicinski wrote:
>>>> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
>>>>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
>>>>>
>>>>>>> Just to refresh all our memory, we discussed and settled on the flow
>>>>>>> in [2]; RFC [1] followed this discussion.
>>>>>>>
>>>>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already
>>>>>>> spawned PF, VF, SF device.
>>>>>>
>>>>>> Nack for the networking part of that. It'd basically be VMDq.
>>>>>
>>>>> What are you NAK'ing?
>>>>
>>>> Spawning multiple netdevs from one device by slicing up its queues.
>>>
>>> Why do you object to that? Slicing up h/w resources for virtual what
>>> ever has been common practice for a long time.
>>
>> My memory of the VMDq debate is hazy, let me rope in Alex into this.
>> I believe the argument was that we should offload software constructs,
>> not create HW-specific APIs which depend on HW availability and
>> implementation. So the path we took was offloading macvlan.
> 
> I think it somewhat depends on the type of interface we are talking
> about. What we were wanting to avoid was drivers spawning their own
> unique VMDq netdevs and each having a different way of doing it. The
> approach Intel went with was to use a MACVLAN offload to approach it.
> Although I would imagine many would argue the approach is somewhat
> dated and limiting since you cannot do many offloads on a MACVLAN
> interface.
Yes. We talked about this at netdev 0x14 and the limitations of macvlan 
based offloads.
https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces
Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI as a 
netdev for kernel containers. AF_XDP ZC in a container is one of the 
usecase this would address. Today we have to pass the entire PF/VF to a 
container to do AF_XDP.
Looks like the current model is to create a subfunction of a specific 
type on auxiliary bus, do some configuration to assign resources and 
then activate the subfunction.
> 
> With the VDPA case I believe there is a set of predefined virtio
> devices that are being emulated and presented so it isn't as if they
> are creating a totally new interface for this.
> 
> What I would be interested in seeing is if there are any other vendors
> that have reviewed this and sign off on this approach. What we don't
> want to see is Nivida/Mellanox do this one way, then Broadcom or Intel
> come along later and have yet another way of doing this. We need an
> interface and feature set that will work for everyone in terms of how
> this will look going forward.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-20 19:04                     ` Samudrala, Sridhar
@ 2020-11-23 21:51                       ` Saeed Mahameed
  2020-11-24  7:01                       ` Jason Wang
  1 sibling, 0 replies; 57+ messages in thread
From: Saeed Mahameed @ 2020-11-23 21:51 UTC (permalink / raw)
  To: Samudrala, Sridhar, Alexander Duyck, Jakub Kicinski
  Cc: David Ahern, Jason Gunthorpe, Parav Pandit,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On Fri, 2020-11-20 at 11:04 -0800, Samudrala, Sridhar wrote:
> 
> On 11/20/2020 9:58 AM, Alexander Duyck wrote:
> > On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org>
> > wrote:
> > > On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
> > > > On 11/18/20 7:14 PM, Jakub Kicinski wrote:
> > > > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
> > > > > > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski
> > > > > > wrote:
> > > > > > 
> > > > > > > > Just to refresh all our memory, we discussed and
> > > > > > > > settled on the flow
> > > > > > > > in [2]; RFC [1] followed this discussion.
> > > > > > > > 
> > > > > > > > vdpa tool of [3] can add one or more vdpa device(s) on
> > > > > > > > top of already
> > > > > > > > spawned PF, VF, SF device.
> > > > > > > 
> > > > > > > Nack for the networking part of that. It'd basically be
> > > > > > > VMDq.
> > > > > > 
> > > > > > What are you NAK'ing?
> > > > > 
> > > > > Spawning multiple netdevs from one device by slicing up its
> > > > > queues.
> > > > 
> > > > Why do you object to that? Slicing up h/w resources for virtual
> > > > what
> > > > ever has been common practice for a long time.
> > > 
> > > My memory of the VMDq debate is hazy, let me rope in Alex into
> > > this.
> > > I believe the argument was that we should offload software
> > > constructs,
> > > not create HW-specific APIs which depend on HW availability and
> > > implementation. So the path we took was offloading macvlan.
> > 
> > I think it somewhat depends on the type of interface we are talking
> > about. What we were wanting to avoid was drivers spawning their own
> > unique VMDq netdevs and each having a different way of doing it. 
Agreed, but SF netdevs are not a VMDq netdevs, they are avaiable in the
switchdev model where they correspond to a full blown port (HW domain).
> > The
> > approach Intel went with was to use a MACVLAN offload to approach
> > it.
> > Although I would imagine many would argue the approach is somewhat
> > dated and limiting since you cannot do many offloads on a MACVLAN
> > interface.
> 
> Yes. We talked about this at netdev 0x14 and the limitations of
> macvlan 
> based offloads.
> https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces
> 
> Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI
> as a 
Exactly, Subfunctions is the most generic model to overcome any SW
model limitations e.g macvtap offload, all HW vendors are already
creating netdevs on a given PF/VF .. all we need is to model the SF and
all the rest is the same! most likely every thing else comes for free
like in the mlx5 model where the netdev/rmda interfaces are abstracted
from the underlying HW, same netdev loads on a PF/VF/SF or even an
embedded function !
> netdev for kernel containers. AF_XDP ZC in a container is one of the 
> usecase this would address. Today we have to pass the entire PF/VF to
> a 
> container to do AF_XDP.
> 
this will be supported out of the box for free with SFs.
> Looks like the current model is to create a subfunction of a
> specific 
> type on auxiliary bus, do some configuration to assign resources and 
> then activate the subfunction.
> 
> > With the VDPA case I believe there is a set of predefined virtio
> > devices that are being emulated and presented so it isn't as if
> > they
> > are creating a totally new interface for this.
> > 
> > What I would be interested in seeing is if there are any other
> > vendors
> > that have reviewed this and sign off on this approach. What we
> > don't
> > want to see is Nivida/Mellanox do this one way, then Broadcom or
> > Intel
> > come along later and have yet another way of doing this. We need an
> > interface and feature set that will work for everyone in terms of
> > how
> > this will look going forward.
Well, the vdpa interface was created by the virtio community and
especially redhat, i am not sure mellanox were even involved in the
initial development stages :-)
anyway historically speaking vDPA was originally created for DPDK, but
same API applies to device drivers who can deliver the same set of
queues and API while bypassing the whole DPDK stack, enters Kernel vDPA
which was created to overcome some of the userspace limitations and
complexity and to leverage some of the kernel great feature such as
eBPF.
https://www.redhat.com/en/blog/introduction-vdpa-kernel-framework
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-20 19:04                     ` Samudrala, Sridhar
  2020-11-23 21:51                       ` Saeed Mahameed
@ 2020-11-24  7:01                       ` Jason Wang
  2020-11-24  7:05                         ` Jason Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Jason Wang @ 2020-11-24  7:01 UTC (permalink / raw)
  To: Samudrala, Sridhar, Alexander Duyck, Jakub Kicinski
  Cc: David Ahern, Jason Gunthorpe, Parav Pandit, Saeed Mahameed,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On 2020/11/21 上午3:04, Samudrala, Sridhar wrote:
>
>
> On 11/20/2020 9:58 AM, Alexander Duyck wrote:
>> On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>
>>> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
>>>> On 11/18/20 7:14 PM, Jakub Kicinski wrote:
>>>>> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
>>>>>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
>>>>>>
>>>>>>>> Just to refresh all our memory, we discussed and settled on the 
>>>>>>>> flow
>>>>>>>> in [2]; RFC [1] followed this discussion.
>>>>>>>>
>>>>>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of 
>>>>>>>> already
>>>>>>>> spawned PF, VF, SF device.
>>>>>>>
>>>>>>> Nack for the networking part of that. It'd basically be VMDq.
>>>>>>
>>>>>> What are you NAK'ing?
>>>>>
>>>>> Spawning multiple netdevs from one device by slicing up its queues.
>>>>
>>>> Why do you object to that? Slicing up h/w resources for virtual what
>>>> ever has been common practice for a long time.
>>>
>>> My memory of the VMDq debate is hazy, let me rope in Alex into this.
>>> I believe the argument was that we should offload software constructs,
>>> not create HW-specific APIs which depend on HW availability and
>>> implementation. So the path we took was offloading macvlan.
>>
>> I think it somewhat depends on the type of interface we are talking
>> about. What we were wanting to avoid was drivers spawning their own
>> unique VMDq netdevs and each having a different way of doing it. The
>> approach Intel went with was to use a MACVLAN offload to approach it.
>> Although I would imagine many would argue the approach is somewhat
>> dated and limiting since you cannot do many offloads on a MACVLAN
>> interface.
>
> Yes. We talked about this at netdev 0x14 and the limitations of 
> macvlan based offloads.
> https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces 
>
>
> Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI as 
> a netdev for kernel containers. AF_XDP ZC in a container is one of the 
> usecase this would address. Today we have to pass the entire PF/VF to 
> a container to do AF_XDP.
>
> Looks like the current model is to create a subfunction of a specific 
> type on auxiliary bus, do some configuration to assign resources and 
> then activate the subfunction.
>
>>
>> With the VDPA case I believe there is a set of predefined virtio
>> devices that are being emulated and presented so it isn't as if they
>> are creating a totally new interface for this.
vDPA doesn't have any limitation of how the devices is created or 
implemented. It could be predefined or created dynamically. vDPA leaves 
all of those to the parent device with the help of a unified management 
API[1]. E.g It could be a PCI device (PF or VF), sub-function or  
software emulated devices.
>>
>> What I would be interested in seeing is if there are any other vendors
>> that have reviewed this and sign off on this approach.
For "this approach" do you mean vDPA subfucntion? My understanding is 
that it's totally vendor specific, vDPA subsystem don't want to be 
limited by a specific type of device.
>> What we don't
>> want to see is Nivida/Mellanox do this one way, then Broadcom or Intel
>> come along later and have yet another way of doing this. We need an
>> interface and feature set that will work for everyone in terms of how
>> this will look going forward.
For feature set,  it would be hard to force (we can have a 
recommendation set of features) vendors to implement a common set of 
features consider they can be negotiated. So the management interface is 
expected to implement features like cpu clusters in order to make sure 
the migration compatibility, or qemu can assist for the missing feature 
with performance lose.
Thanks
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-24  7:01                       ` Jason Wang
@ 2020-11-24  7:05                         ` Jason Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Wang @ 2020-11-24  7:05 UTC (permalink / raw)
  To: Samudrala, Sridhar, Alexander Duyck, Jakub Kicinski
  Cc: David Ahern, Jason Gunthorpe, Parav Pandit, Saeed Mahameed,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On 2020/11/24 下午3:01, Jason Wang wrote:
>
> On 2020/11/21 上午3:04, Samudrala, Sridhar wrote:
>>
>>
>> On 11/20/2020 9:58 AM, Alexander Duyck wrote:
>>> On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>>
>>>> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
>>>>> On 11/18/20 7:14 PM, Jakub Kicinski wrote:
>>>>>> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
>>>>>>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
>>>>>>>
>>>>>>>>> Just to refresh all our memory, we discussed and settled on 
>>>>>>>>> the flow
>>>>>>>>> in [2]; RFC [1] followed this discussion.
>>>>>>>>>
>>>>>>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of 
>>>>>>>>> already
>>>>>>>>> spawned PF, VF, SF device.
>>>>>>>>
>>>>>>>> Nack for the networking part of that. It'd basically be VMDq.
>>>>>>>
>>>>>>> What are you NAK'ing?
>>>>>>
>>>>>> Spawning multiple netdevs from one device by slicing up its queues.
>>>>>
>>>>> Why do you object to that? Slicing up h/w resources for virtual what
>>>>> ever has been common practice for a long time.
>>>>
>>>> My memory of the VMDq debate is hazy, let me rope in Alex into this.
>>>> I believe the argument was that we should offload software constructs,
>>>> not create HW-specific APIs which depend on HW availability and
>>>> implementation. So the path we took was offloading macvlan.
>>>
>>> I think it somewhat depends on the type of interface we are talking
>>> about. What we were wanting to avoid was drivers spawning their own
>>> unique VMDq netdevs and each having a different way of doing it. The
>>> approach Intel went with was to use a MACVLAN offload to approach it.
>>> Although I would imagine many would argue the approach is somewhat
>>> dated and limiting since you cannot do many offloads on a MACVLAN
>>> interface.
>>
>> Yes. We talked about this at netdev 0x14 and the limitations of 
>> macvlan based offloads.
>> https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces 
>>
>>
>> Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI 
>> as a netdev for kernel containers. AF_XDP ZC in a container is one of 
>> the usecase this would address. Today we have to pass the entire 
>> PF/VF to a container to do AF_XDP.
>>
>> Looks like the current model is to create a subfunction of a specific 
>> type on auxiliary bus, do some configuration to assign resources and 
>> then activate the subfunction.
>>
>>>
>>> With the VDPA case I believe there is a set of predefined virtio
>>> devices that are being emulated and presented so it isn't as if they
>>> are creating a totally new interface for this.
>
>
> vDPA doesn't have any limitation of how the devices is created or 
> implemented. It could be predefined or created dynamically. vDPA 
> leaves all of those to the parent device with the help of a unified 
> management API[1]. E.g It could be a PCI device (PF or VF), 
> sub-function or  software emulated devices.
Miss the link, https://www.spinics.net/lists/netdev/msg699374.html.
Thanks
>
>
>>>
>>> What I would be interested in seeing is if there are any other vendors
>>> that have reviewed this and sign off on this approach.
>
>
> For "this approach" do you mean vDPA subfucntion? My understanding is 
> that it's totally vendor specific, vDPA subsystem don't want to be 
> limited by a specific type of device.
>
>
>>> What we don't
>>> want to see is Nivida/Mellanox do this one way, then Broadcom or Intel
>>> come along later and have yet another way of doing this. We need an
>>> interface and feature set that will work for everyone in terms of how
>>> this will look going forward.
>
> For feature set,  it would be hard to force (we can have a 
> recommendation set of features) vendors to implement a common set of 
> features consider they can be negotiated. So the management interface 
> is expected to implement features like cpu clusters in order to make 
> sure the migration compatibility, or qemu can assist for the missing 
> feature with performance lose.
>
> Thanks
>
>
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-19  5:57                 ` Saeed Mahameed
  2020-11-20  1:31                   ` Jakub Kicinski
@ 2020-11-25  5:33                   ` David Ahern
  2020-11-25  6:00                     ` Parav Pandit
  1 sibling, 1 reply; 57+ messages in thread
From: David Ahern @ 2020-11-25  5:33 UTC (permalink / raw)
  To: Saeed Mahameed, Jakub Kicinski, Jason Gunthorpe
  Cc: Parav Pandit, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On 11/18/20 10:57 PM, Saeed Mahameed wrote:
> 
> We are not slicing up any queues, from our HW and FW perspective SF ==
> VF literally, a full blown HW slice (Function), with isolated control
> and data plane of its own, this is very different from VMDq and more
> generic and secure. an SF device is exactly like a VF, doesn't steal or
> share any HW resources or control/data path with others. SF is
> basically SRIOV done right.
What does that mean with respect to mac filtering and ntuple rules?
Also, Tx is fairly easy to imagine, but how does hardware know how to
direct packets for the Rx path? As an example, consider 2 VMs or
containers with the same destination ip both using subfunction devices.
How does the nic know how to direct the ingress flows to the right
queues for the subfunction?
^ permalink raw reply	[flat|nested] 57+ messages in thread
* RE: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-25  5:33                   ` David Ahern
@ 2020-11-25  6:00                     ` Parav Pandit
  2020-11-25 14:37                       ` David Ahern
  0 siblings, 1 reply; 57+ messages in thread
From: Parav Pandit @ 2020-11-25  6:00 UTC (permalink / raw)
  To: David Ahern, Saeed Mahameed, Jakub Kicinski, Jason Gunthorpe
  Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
Hi David,
> From: David Ahern <dsahern@gmail.com>
> Sent: Wednesday, November 25, 2020 11:04 AM
> 
> On 11/18/20 10:57 PM, Saeed Mahameed wrote:
> >
> > We are not slicing up any queues, from our HW and FW perspective SF ==
> > VF literally, a full blown HW slice (Function), with isolated control
> > and data plane of its own, this is very different from VMDq and more
> > generic and secure. an SF device is exactly like a VF, doesn't steal
> > or share any HW resources or control/data path with others. SF is
> > basically SRIOV done right.
> 
> What does that mean with respect to mac filtering and ntuple rules?
> 
> Also, Tx is fairly easy to imagine, but how does hardware know how to direct
> packets for the Rx path? As an example, consider 2 VMs or containers with the
> same destination ip both using subfunction devices.
Since both VM/containers are having same IP, it is better to place them in different L2 domains via vlan, vxlan etc.
> How does the nic know how to direct the ingress flows to the right queues for
> the subfunction?
> 
Rx steering occurs through tc filters via representor netdev of SF.
Exactly same way as VF representor netdev operation.
When devlink eswitch port is created as shown in example in cover letter, and also in patch-12, it creates the representor netdevice.
Below is the snippet of it.
Add a devlink port of subfunction flavour:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88
                                                ^^^^^^^^^^^^^^
This is the representor netdevice. It is created by port add command.
This name is setup by systemd/udev v245 and higher by utilizing the existing phys_port_name infrastructure already exists for PF and VF representors.
Now user can add unicast rx tc rule for example,
$ tc filter add dev ens2f0np0 parent ffff: prio 1 flower dst_mac 00:00:00:00:88:88 action mirred egress redirect dev ens2f0npf0sf88
I didn't cover this tc example in cover letter, to keep it short.
But I had a one line description as below in the 'detail' section of cover-letter.
Hope it helps.
- A SF supports eswitch representation and tc offload support similar
  to existing PF and VF representors.
Now above portion answers, how to forward the packet to subfunction.
But how to forward to the right rx queue out of multiple rxqueues?
This is done by the rss configuration done by the user, number of channels from ethtool.
Just like VF and PF.
The driver defaults are similar to VF, which user can change via ethtool.
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 00/13] Add mlx5 subfunction support
  2020-11-25  6:00                     ` Parav Pandit
@ 2020-11-25 14:37                       ` David Ahern
  0 siblings, 0 replies; 57+ messages in thread
From: David Ahern @ 2020-11-25 14:37 UTC (permalink / raw)
  To: Parav Pandit, Saeed Mahameed, Jakub Kicinski, Jason Gunthorpe
  Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org, Jiri Pirko, dledford@redhat.com,
	Leon Romanovsky, davem@davemloft.net
On 11/24/20 11:00 PM, Parav Pandit wrote:
> Hi David,
> 
>> From: David Ahern <dsahern@gmail.com>
>> Sent: Wednesday, November 25, 2020 11:04 AM
>>
>> On 11/18/20 10:57 PM, Saeed Mahameed wrote:
>>>
>>> We are not slicing up any queues, from our HW and FW perspective SF ==
>>> VF literally, a full blown HW slice (Function), with isolated control
>>> and data plane of its own, this is very different from VMDq and more
>>> generic and secure. an SF device is exactly like a VF, doesn't steal
>>> or share any HW resources or control/data path with others. SF is
>>> basically SRIOV done right.
>>
>> What does that mean with respect to mac filtering and ntuple rules?
>>
>> Also, Tx is fairly easy to imagine, but how does hardware know how to direct
>> packets for the Rx path? As an example, consider 2 VMs or containers with the
>> same destination ip both using subfunction devices.
> Since both VM/containers are having same IP, it is better to place them in different L2 domains via vlan, vxlan etc.
ok, so relying on <vlan, dmac> pairs.
> 
>> How does the nic know how to direct the ingress flows to the right queues for
>> the subfunction?
>>
> Rx steering occurs through tc filters via representor netdev of SF.
> Exactly same way as VF representor netdev operation.
> 
> When devlink eswitch port is created as shown in example in cover letter, and also in patch-12, it creates the representor netdevice.
> Below is the snippet of it.
> 
> Add a devlink port of subfunction flavour:
> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> 
> Configure mac address of the port function:
> $ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88
>                                                 ^^^^^^^^^^^^^^
> This is the representor netdevice. It is created by port add command.
> This name is setup by systemd/udev v245 and higher by utilizing the existing phys_port_name infrastructure already exists for PF and VF representors.
hardware ensures only packets with that dmac are sent to the subfunction
device.
> 
> Now user can add unicast rx tc rule for example,
> 
> $ tc filter add dev ens2f0np0 parent ffff: prio 1 flower dst_mac 00:00:00:00:88:88 action mirred egress redirect dev ens2f0npf0sf88
> 
> I didn't cover this tc example in cover letter, to keep it short.
> But I had a one line description as below in the 'detail' section of cover-letter.
> Hope it helps.
> 
> - A SF supports eswitch representation and tc offload support similar
>   to existing PF and VF representors.
> 
> Now above portion answers, how to forward the packet to subfunction.
> But how to forward to the right rx queue out of multiple rxqueues?
> This is done by the rss configuration done by the user, number of channels from ethtool.
> Just like VF and PF.
> The driver defaults are similar to VF, which user can change via ethtool.
> 
so users can add flow steering or drop rules to SF devices.
thanks,
^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: [PATCH net-next 07/13] net/mlx5: SF, Add auxiliary device support
  2020-11-12 19:24 ` [PATCH net-next 07/13] net/mlx5: SF, Add auxiliary device support Parav Pandit
@ 2020-12-07  2:48   ` David Ahern
  2020-12-07  4:53     ` Parav Pandit
  0 siblings, 1 reply; 57+ messages in thread
From: David Ahern @ 2020-12-07  2:48 UTC (permalink / raw)
  To: Parav Pandit, netdev, linux-rdma, gregkh
  Cc: jiri, jgg, dledford, leonro, saeedm, kuba, davem, Vu Pham
On 11/12/20 12:24 PM, Parav Pandit wrote:
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> index 485478979b1a..10dfaf671c90 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> @@ -202,3 +202,12 @@ config MLX5_SW_STEERING
>  	default y
>  	help
>  	Build support for software-managed steering in the NIC.
> +
> +config MLX5_SF
> +	bool "Mellanox Technologies subfunction device support using auxiliary device"
> +	depends on MLX5_CORE && MLX5_CORE_EN
> +	default n
> +	help
> +	Build support for subfuction device in the NIC. A Mellanox subfunction
> +	device can support RDMA, netdevice and vdpa device.
> +	It is similar to a SRIOV VF but it doesn't require SRIOV support.
per Dan's comment about AUXILIARY_BUS being select only, should this
config select AUXILIARY_BUS?
^ permalink raw reply	[flat|nested] 57+ messages in thread
* RE: [PATCH net-next 07/13] net/mlx5: SF, Add auxiliary device support
  2020-12-07  2:48   ` David Ahern
@ 2020-12-07  4:53     ` Parav Pandit
  0 siblings, 0 replies; 57+ messages in thread
From: Parav Pandit @ 2020-12-07  4:53 UTC (permalink / raw)
  To: David Ahern, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	gregkh@linuxfoundation.org
  Cc: Jiri Pirko, Jason Gunthorpe, dledford@redhat.com, Leon Romanovsky,
	Saeed Mahameed, kuba@kernel.org, davem@davemloft.net, Vu Pham
> From: David Ahern <dsahern@gmail.com>
> Sent: Monday, December 7, 2020 8:19 AM
> 
> On 11/12/20 12:24 PM, Parav Pandit wrote:
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > index 485478979b1a..10dfaf671c90 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > @@ -202,3 +202,12 @@ config MLX5_SW_STEERING
> >  	default y
> >  	help
> >  	Build support for software-managed steering in the NIC.
> > +
> > +config MLX5_SF
> > +	bool "Mellanox Technologies subfunction device support using auxiliary
> device"
> > +	depends on MLX5_CORE && MLX5_CORE_EN
> > +	default n
> > +	help
> > +	Build support for subfuction device in the NIC. A Mellanox subfunction
> > +	device can support RDMA, netdevice and vdpa device.
> > +	It is similar to a SRIOV VF but it doesn't require SRIOV support.
> 
> per Dan's comment about AUXILIARY_BUS being select only, should this config
> select AUXILIARY_BUS?
Yes.
However, my patchset depends on patchset [1].
With introduction of patchset [2], MLX5_CORE depends on AUXILIARY_BUS. 
MLX5_SF depends on MLX5_CORE.
So I omitted explicitly selecting AUXBUS by MLX5_SF.
[1] https://lore.kernel.org/linux-rdma/20201204182952.72263-1-saeedm@nvidia.com/
[2] https://lore.kernel.org/alsa-devel/20201026111849.1035786-6-leon@kernel.org/
^ permalink raw reply	[flat|nested] 57+ messages in thread
end of thread, other threads:[~2020-12-07  4:54 UTC | newest]
Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-12 19:24 [PATCH net-next 00/13] Add mlx5 subfunction support Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 01/13] devlink: Prepare code to fill multiple port function attributes Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 02/13] devlink: Introduce PCI SF port flavour and port attribute Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 03/13] devlink: Support add and delete devlink port Parav Pandit
2020-11-18 16:21   ` David Ahern
2020-11-18 17:02     ` Parav Pandit
2020-11-18 18:03       ` David Ahern
2020-11-18 18:38         ` Jason Gunthorpe
2020-11-18 19:36           ` David Ahern
2020-11-18 20:42             ` Jason Gunthorpe
2020-11-18 19:22         ` Parav Pandit
2020-11-19  0:41           ` Jacob Keller
2020-11-19  1:17             ` David Ahern
2020-11-19  1:56               ` Samudrala, Sridhar
2020-11-19  0:52       ` Jacob Keller
2020-11-12 19:24 ` [PATCH net-next 04/13] devlink: Support get and set state of port function Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 05/13] devlink: Avoid global devlink mutex, use per instance reload lock Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 06/13] devlink: Introduce devlink refcount to reduce scope of global devlink_mutex Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 07/13] net/mlx5: SF, Add auxiliary device support Parav Pandit
2020-12-07  2:48   ` David Ahern
2020-12-07  4:53     ` Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 08/13] net/mlx5: SF, Add auxiliary device driver Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 09/13] net/mlx5: E-switch, Prepare eswitch to handle SF vport Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 10/13] net/mlx5: E-switch, Add eswitch helpers for " Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 11/13] net/mlx5: SF, Add SF configuration hardware commands Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 12/13] net/mlx5: SF, Add port add delete functionality Parav Pandit
2020-11-12 19:24 ` [PATCH net-next 13/13] net/mlx5: SF, Port function state change support Parav Pandit
2020-11-16 22:52 ` [PATCH net-next 00/13] Add mlx5 subfunction support Jakub Kicinski
2020-11-17  0:06   ` Saeed Mahameed
2020-11-17  1:58     ` Jakub Kicinski
2020-11-17  4:08       ` Parav Pandit
2020-11-17 17:11         ` Jakub Kicinski
2020-11-17 18:49           ` Jason Gunthorpe
2020-11-19  2:14             ` Jakub Kicinski
2020-11-19  4:35               ` David Ahern
2020-11-19  5:57                 ` Saeed Mahameed
2020-11-20  1:31                   ` Jakub Kicinski
2020-11-25  5:33                   ` David Ahern
2020-11-25  6:00                     ` Parav Pandit
2020-11-25 14:37                       ` David Ahern
2020-11-20  1:29                 ` Jakub Kicinski
2020-11-20 17:58                   ` Alexander Duyck
2020-11-20 19:04                     ` Samudrala, Sridhar
2020-11-23 21:51                       ` Saeed Mahameed
2020-11-24  7:01                       ` Jason Wang
2020-11-24  7:05                         ` Jason Wang
2020-11-19  6:12               ` Saeed Mahameed
2020-11-19  8:25                 ` Parav Pandit
2020-11-20  1:35                 ` Jakub Kicinski
2020-11-20  3:34                   ` Parav Pandit
2020-11-17 18:50           ` Parav Pandit
2020-11-19  2:23             ` Jakub Kicinski
2020-11-19  6:22               ` Saeed Mahameed
2020-11-19 14:00                 ` Jason Gunthorpe
2020-11-20  3:35                   ` Jakub Kicinski
2020-11-20  3:50                     ` Parav Pandit
2020-11-20 16:16                     ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).