* [PATCH rdma-next v3 0/7] Support RDMA events monitoring through
@ 2024-09-09 17:30 Michael Guralnik
2024-09-09 17:30 ` [PATCH v3 rdma-next 1/7] RDMA/mlx5: Check RoCE LAG status before getting netdev Michael Guralnik
` (7 more replies)
0 siblings, 8 replies; 14+ messages in thread
From: Michael Guralnik @ 2024-09-09 17:30 UTC (permalink / raw)
To: jgg
Cc: linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern,
Michael Guralnik
This series consists of multiple parts that collectively offer a method
to monitor RDMA events from userspace.
Using netlink, users will be able to monitor their IB device events and
changes such as device register, device unregister and netdev
attachment.
The first 2 patches contain fixes in mlx5 lag code that are required for
accurate event reporting in case of a lag bond.
Patch #3 initializes phys_port_cnt early in device probe to allow the
IB-to-netdev mapping API to work properly.
Patches #4,#5 modify and export IB-to-netdev mapping API, making it accessible
to all vendors who wish to rely on it for associating their IB device with
a netdevice.
Patches #6,#7 add the netlink support for reporting IB device events to
userspace.
Changes in v3:
- Fix lockdep warning in ib_device_get_netdev by dropping
optimization part from it
- Extend event info to include device names
- Instead of disregarding unknown events, report them to
userspace
- Remove fill_mon_register and replace with fill_nldev_handle to
fill netlink messages for register events
Changes in v2:
- Fix compilation issues with forward declaration of ib_device
- Add missing setting of return code in error flow
Chiara Meiohas (5):
RDMA/mlx5: Initialize phys_port_cnt earlier in RDMA device creation
RDMA/device: Remove optimization in ib_device_get_netdev()
RDMA/mlx5: Use IB set_netdev and get_netdev functions
RDMA/nldev: Add support for RDMA monitoring
RDMA/nldev: Expose whether RDMA monitoring is supported
Mark Bloch (2):
RDMA/mlx5: Check RoCE LAG status before getting netdev
RDMA/mlx5: Obtain upper net device only when needed
drivers/infiniband/core/device.c | 51 ++++-
drivers/infiniband/core/netlink.c | 1 +
drivers/infiniband/core/nldev.c | 130 ++++++++++++
drivers/infiniband/hw/mlx5/ib_rep.c | 22 +-
drivers/infiniband/hw/mlx5/main.c | 197 +++++++++++++-----
drivers/infiniband/hw/mlx5/mlx5_ib.h | 3 +-
.../net/ethernet/mellanox/mlx5/core/lag/lag.c | 76 +++----
include/linux/mlx5/device.h | 1 +
include/linux/mlx5/driver.h | 2 +-
include/rdma/ib_verbs.h | 2 +
include/rdma/rdma_netlink.h | 12 ++
include/uapi/rdma/rdma_netlink.h | 16 ++
12 files changed, 399 insertions(+), 114 deletions(-)
--
2.17.2
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v3 rdma-next 1/7] RDMA/mlx5: Check RoCE LAG status before getting netdev
2024-09-09 17:30 [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Michael Guralnik
@ 2024-09-09 17:30 ` Michael Guralnik
2024-09-10 3:58 ` Kalesh Anakkur Purayil
2024-09-09 17:30 ` [PATCH v3 rdma-next 2/7] RDMA/mlx5: Obtain upper net device only when needed Michael Guralnik
` (6 subsequent siblings)
7 siblings, 1 reply; 14+ messages in thread
From: Michael Guralnik @ 2024-09-09 17:30 UTC (permalink / raw)
To: jgg
Cc: linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern,
Michael Guralnik
From: Mark Bloch <mbloch@nvidia.com>
Check if RoCE LAG is active before calling the LAG layer for netdev.
This clarifies if LAG is active. No behavior changes with this patch.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/mlx5/main.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index b85ad3c0bfa1..cdf1ce0f6b34 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -198,12 +198,18 @@ static int mlx5_netdev_event(struct notifier_block *this,
case NETDEV_CHANGE:
case NETDEV_UP:
case NETDEV_DOWN: {
- struct net_device *lag_ndev = mlx5_lag_get_roce_netdev(mdev);
struct net_device *upper = NULL;
- if (lag_ndev) {
- upper = netdev_master_upper_dev_get(lag_ndev);
- dev_put(lag_ndev);
+ if (mlx5_lag_is_roce(mdev)) {
+ struct net_device *lag_ndev;
+
+ lag_ndev = mlx5_lag_get_roce_netdev(mdev);
+ if (lag_ndev) {
+ upper = netdev_master_upper_dev_get(lag_ndev);
+ dev_put(lag_ndev);
+ } else {
+ goto done;
+ }
}
if (ibdev->is_rep)
@@ -257,9 +263,10 @@ static struct net_device *mlx5_ib_get_netdev(struct ib_device *device,
if (!mdev)
return NULL;
- ndev = mlx5_lag_get_roce_netdev(mdev);
- if (ndev)
+ if (mlx5_lag_is_roce(mdev)) {
+ ndev = mlx5_lag_get_roce_netdev(mdev);
goto out;
+ }
/* Ensure ndev does not disappear before we invoke dev_hold()
*/
--
2.17.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v3 rdma-next 2/7] RDMA/mlx5: Obtain upper net device only when needed
2024-09-09 17:30 [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Michael Guralnik
2024-09-09 17:30 ` [PATCH v3 rdma-next 1/7] RDMA/mlx5: Check RoCE LAG status before getting netdev Michael Guralnik
@ 2024-09-09 17:30 ` Michael Guralnik
2024-09-10 3:59 ` Kalesh Anakkur Purayil
2024-09-09 17:30 ` [PATCH v3 rdma-next 3/7] RDMA/mlx5: Initialize phys_port_cnt earlier in RDMA device creation Michael Guralnik
` (5 subsequent siblings)
7 siblings, 1 reply; 14+ messages in thread
From: Michael Guralnik @ 2024-09-09 17:30 UTC (permalink / raw)
To: jgg
Cc: linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern,
Michael Guralnik
From: Mark Bloch <mbloch@nvidia.com>
Report the upper device's state as the RDMA port state only in RoCE LAG or
switchdev LAG.
Fixes: 27f9e0ccb6da ("net/mlx5: Lag, Add single RDMA device in multiport mode")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/mlx5/main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index cdf1ce0f6b34..c75cc3d14e74 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -558,7 +558,7 @@ static int mlx5_query_port_roce(struct ib_device *device, u32 port_num,
if (!ndev)
goto out;
- if (dev->lag_active) {
+ if (mlx5_lag_is_roce(mdev) || mlx5_lag_is_sriov(mdev)) {
rcu_read_lock();
upper = netdev_master_upper_dev_get_rcu(ndev);
if (upper) {
--
2.17.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v3 rdma-next 3/7] RDMA/mlx5: Initialize phys_port_cnt earlier in RDMA device creation
2024-09-09 17:30 [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Michael Guralnik
2024-09-09 17:30 ` [PATCH v3 rdma-next 1/7] RDMA/mlx5: Check RoCE LAG status before getting netdev Michael Guralnik
2024-09-09 17:30 ` [PATCH v3 rdma-next 2/7] RDMA/mlx5: Obtain upper net device only when needed Michael Guralnik
@ 2024-09-09 17:30 ` Michael Guralnik
2024-09-09 17:30 ` [PATCH v3 rdma-next 4/7] RDMA/device: Remove optimization in ib_device_get_netdev() Michael Guralnik
` (4 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Michael Guralnik @ 2024-09-09 17:30 UTC (permalink / raw)
To: jgg
Cc: linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern,
Michael Guralnik
From: Chiara Meiohas <cmeiohas@nvidia.com>
phys_port_cnt of the IB device must be initialized before calling
ib_device_set_netdev().
Previously, phys_port_cnt was initialized in the mlx5_ib init function.
Remove this initialization to allow setting it separately, providing
the flexibility to call ib_device_set_netdev before registering the
IB device.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/hw/mlx5/ib_rep.c | 1 +
drivers/infiniband/hw/mlx5/main.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mlx5/ib_rep.c b/drivers/infiniband/hw/mlx5/ib_rep.c
index c7a4ee896121..1ad934685d80 100644
--- a/drivers/infiniband/hw/mlx5/ib_rep.c
+++ b/drivers/infiniband/hw/mlx5/ib_rep.c
@@ -104,6 +104,7 @@ mlx5_ib_vport_rep_load(struct mlx5_core_dev *dev, struct mlx5_eswitch_rep *rep)
ibdev->is_rep = true;
vport_index = rep->vport_index;
ibdev->port[vport_index].rep = rep;
+ ibdev->ib_dev.phys_port_cnt = num_ports;
ibdev->port[vport_index].roce.netdev =
mlx5_ib_get_rep_netdev(lag_master->priv.eswitch, rep->vport);
ibdev->mdev = lag_master;
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index c75cc3d14e74..1046c92212c7 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3932,7 +3932,6 @@ static int mlx5_ib_stage_init_init(struct mlx5_ib_dev *dev)
dev->ib_dev.node_type = RDMA_NODE_IB_CA;
dev->ib_dev.local_dma_lkey = 0 /* not supported for now */;
- dev->ib_dev.phys_port_cnt = dev->num_ports;
dev->ib_dev.dev.parent = mdev->device;
dev->ib_dev.lag_flags = RDMA_LAG_FLAGS_HASH_ALL_SLAVES;
@@ -4647,6 +4646,7 @@ static struct ib_device *mlx5_ib_add_sub_dev(struct ib_device *parent,
mplane->mdev = mparent->mdev;
mplane->num_ports = mparent->num_plane;
mplane->sub_dev_name = name;
+ mplane->ib_dev.phys_port_cnt = mplane->num_ports;
ret = __mlx5_ib_add(mplane, &plane_profile);
if (ret)
@@ -4763,6 +4763,7 @@ static int mlx5r_probe(struct auxiliary_device *adev,
dev->mdev = mdev;
dev->num_ports = num_ports;
+ dev->ib_dev.phys_port_cnt = num_ports;
if (ll == IB_LINK_LAYER_ETHERNET && !mlx5_get_roce_state(mdev))
profile = &raw_eth_profile;
--
2.17.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v3 rdma-next 4/7] RDMA/device: Remove optimization in ib_device_get_netdev()
2024-09-09 17:30 [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Michael Guralnik
` (2 preceding siblings ...)
2024-09-09 17:30 ` [PATCH v3 rdma-next 3/7] RDMA/mlx5: Initialize phys_port_cnt earlier in RDMA device creation Michael Guralnik
@ 2024-09-09 17:30 ` Michael Guralnik
2024-09-10 4:00 ` Kalesh Anakkur Purayil
2024-09-09 17:30 ` [PATCH v3 rdma-next 5/7] RDMA/mlx5: Use IB set_netdev and get_netdev functions Michael Guralnik
` (3 subsequent siblings)
7 siblings, 1 reply; 14+ messages in thread
From: Michael Guralnik @ 2024-09-09 17:30 UTC (permalink / raw)
To: jgg
Cc: linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern,
Michael Guralnik
From: Chiara Meiohas <cmeiohas@nvidia.com>
The caller of ib_device_get_netdev() relies on its result to accurately
match a given netdev with the ib device associated netdev.
ib_device_get_netdev returns NULL when the IB device associated
netdev is unregistering, preventing the caller of matching netdevs properly.
Thus, remove this optimization and return the netdev even if
it is undergoing unregistration, allowing matching by the caller.
This change ensures proper netdev matching and reference count handling
by the caller of ib_device_get_netdev/ib_device_set_netdev API.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/core/device.c | 9 ---------
1 file changed, 9 deletions(-)
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 0290aca18d26..b1377503cb9d 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2252,15 +2252,6 @@ struct net_device *ib_device_get_netdev(struct ib_device *ib_dev,
spin_unlock(&pdata->netdev_lock);
}
- /*
- * If we are starting to unregister expedite things by preventing
- * propagation of an unregistering netdev.
- */
- if (res && res->reg_state != NETREG_REGISTERED) {
- dev_put(res);
- return NULL;
- }
-
return res;
}
--
2.17.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v3 rdma-next 5/7] RDMA/mlx5: Use IB set_netdev and get_netdev functions
2024-09-09 17:30 [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Michael Guralnik
` (3 preceding siblings ...)
2024-09-09 17:30 ` [PATCH v3 rdma-next 4/7] RDMA/device: Remove optimization in ib_device_get_netdev() Michael Guralnik
@ 2024-09-09 17:30 ` Michael Guralnik
2024-09-09 17:30 ` [PATCH v3 rdma-next 6/7] RDMA/nldev: Add support for RDMA monitoring Michael Guralnik
` (2 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Michael Guralnik @ 2024-09-09 17:30 UTC (permalink / raw)
To: jgg
Cc: linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern,
Michael Guralnik
From: Chiara Meiohas <cmeiohas@nvidia.com>
The IB layer provides a common interface to store and get net
devices associated to an IB device port (ib_device_set_netdev()
and ib_device_get_netdev()).
Previously, mlx5_ib stored and managed the associated net devices
internally.
Replace internal net device management in mlx5_ib with
ib_device_set_netdev() when attaching/detaching a net device and
ib_device_get_netdev() when retrieving the net device.
Export ib_device_get_netdev().
For mlx5 representors/PFs/VFs and lag creation we replace the netdev
assignments with the IB set/get netdev functions.
In active-backup mode lag the active slave net device is stored in the
lag itself. To assure the net device stored in a lag bond IB device is
the active slave we implement the following:
- mlx5_core: when modifying the slave of a bond we send the internal driver event
MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE.
- mlx5_ib: when catching the event call ib_device_set_netdev()
This patch also ensures the correct IB events are sent in switchdev lag.
While at it, when in multiport eswitch mode, only a single IB device is
created for all ports. The said IB device will receive all netdev events
of its VFs once loaded, thus to avoid overwriting the mapping of PF IB
device to PF netdev, ignore NETDEV_REGISTER events if the ib device has
already been mapped to a netdev.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/core/device.c | 4 +
drivers/infiniband/hw/mlx5/ib_rep.c | 23 +--
drivers/infiniband/hw/mlx5/main.c | 183 ++++++++++++------
drivers/infiniband/hw/mlx5/mlx5_ib.h | 3 +-
.../net/ethernet/mellanox/mlx5/core/lag/lag.c | 76 ++++----
include/linux/mlx5/device.h | 1 +
include/linux/mlx5/driver.h | 2 +-
include/rdma/ib_verbs.h | 2 +
8 files changed, 191 insertions(+), 103 deletions(-)
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index b1377503cb9d..9e765c79a892 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2236,6 +2236,9 @@ struct net_device *ib_device_get_netdev(struct ib_device *ib_dev,
if (!rdma_is_port_valid(ib_dev, port))
return NULL;
+ if (!ib_dev->port_data)
+ return NULL;
+
pdata = &ib_dev->port_data[port];
/*
@@ -2254,6 +2257,7 @@ struct net_device *ib_device_get_netdev(struct ib_device *ib_dev,
return res;
}
+EXPORT_SYMBOL(ib_device_get_netdev);
/**
* ib_device_get_by_netdev - Find an IB device associated with a netdev
diff --git a/drivers/infiniband/hw/mlx5/ib_rep.c b/drivers/infiniband/hw/mlx5/ib_rep.c
index 1ad934685d80..49af1cfbe6d1 100644
--- a/drivers/infiniband/hw/mlx5/ib_rep.c
+++ b/drivers/infiniband/hw/mlx5/ib_rep.c
@@ -13,6 +13,7 @@ mlx5_ib_set_vport_rep(struct mlx5_core_dev *dev,
int vport_index)
{
struct mlx5_ib_dev *ibdev;
+ struct net_device *ndev;
ibdev = mlx5_eswitch_uplink_get_proto_dev(dev->priv.eswitch, REP_IB);
if (!ibdev)
@@ -20,12 +21,9 @@ mlx5_ib_set_vport_rep(struct mlx5_core_dev *dev,
ibdev->port[vport_index].rep = rep;
rep->rep_data[REP_IB].priv = ibdev;
- write_lock(&ibdev->port[vport_index].roce.netdev_lock);
- ibdev->port[vport_index].roce.netdev =
- mlx5_ib_get_rep_netdev(rep->esw, rep->vport);
- write_unlock(&ibdev->port[vport_index].roce.netdev_lock);
+ ndev = mlx5_ib_get_rep_netdev(rep->esw, rep->vport);
- return 0;
+ return ib_device_set_netdev(&ibdev->ib_dev, ndev, vport_index + 1);
}
static void mlx5_ib_register_peer_vport_reps(struct mlx5_core_dev *mdev);
@@ -104,11 +102,15 @@ mlx5_ib_vport_rep_load(struct mlx5_core_dev *dev, struct mlx5_eswitch_rep *rep)
ibdev->is_rep = true;
vport_index = rep->vport_index;
ibdev->port[vport_index].rep = rep;
- ibdev->ib_dev.phys_port_cnt = num_ports;
- ibdev->port[vport_index].roce.netdev =
- mlx5_ib_get_rep_netdev(lag_master->priv.eswitch, rep->vport);
ibdev->mdev = lag_master;
ibdev->num_ports = num_ports;
+ ibdev->ib_dev.phys_port_cnt = num_ports;
+ ret = ib_device_set_netdev(&ibdev->ib_dev,
+ mlx5_ib_get_rep_netdev(lag_master->priv.eswitch,
+ rep->vport),
+ vport_index + 1);
+ if (ret)
+ goto fail_add;
ret = __mlx5_ib_add(ibdev, profile);
if (ret)
@@ -161,9 +163,8 @@ mlx5_ib_vport_rep_unload(struct mlx5_eswitch_rep *rep)
}
port = &dev->port[vport_index];
- write_lock(&port->roce.netdev_lock);
- port->roce.netdev = NULL;
- write_unlock(&port->roce.netdev_lock);
+
+ ib_device_set_netdev(&dev->ib_dev, NULL, vport_index + 1);
rep->rep_data[REP_IB].priv = NULL;
port->rep = NULL;
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 1046c92212c7..a750f61513d4 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -147,16 +147,52 @@ static struct mlx5_roce *mlx5_get_rep_roce(struct mlx5_ib_dev *dev,
if (upper && port->rep->vport == MLX5_VPORT_UPLINK)
continue;
-
- read_lock(&port->roce.netdev_lock);
- rep_ndev = mlx5_ib_get_rep_netdev(port->rep->esw,
- port->rep->vport);
- if (rep_ndev == ndev) {
- read_unlock(&port->roce.netdev_lock);
+ rep_ndev = ib_device_get_netdev(&dev->ib_dev, i + 1);
+ if (rep_ndev && rep_ndev == ndev) {
+ dev_put(rep_ndev);
*port_num = i + 1;
return &port->roce;
}
- read_unlock(&port->roce.netdev_lock);
+
+ dev_put(rep_ndev);
+ }
+
+ return NULL;
+}
+
+static bool mlx5_netdev_send_event(struct mlx5_ib_dev *dev,
+ struct net_device *ndev,
+ struct net_device *upper,
+ struct net_device *ib_ndev)
+{
+ if (!dev->ib_active)
+ return false;
+
+ /* Event is about our upper device */
+ if (upper == ndev)
+ return true;
+
+ /* RDMA device is not in lag and not in switchdev */
+ if (!dev->is_rep && !upper && ndev == ib_ndev)
+ return true;
+
+ /* RDMA devie is in switchdev */
+ if (dev->is_rep && ndev == ib_ndev)
+ return true;
+
+ return false;
+}
+
+static struct net_device *mlx5_ib_get_rep_uplink_netdev(struct mlx5_ib_dev *ibdev)
+{
+ struct mlx5_ib_port *port;
+ int i;
+
+ for (i = 0; i < ibdev->num_ports; i++) {
+ port = &ibdev->port[i];
+ if (port->rep && port->rep->vport == MLX5_VPORT_UPLINK) {
+ return ib_device_get_netdev(&ibdev->ib_dev, i + 1);
+ }
}
return NULL;
@@ -168,6 +204,7 @@ static int mlx5_netdev_event(struct notifier_block *this,
struct mlx5_roce *roce = container_of(this, struct mlx5_roce, nb);
struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
u32 port_num = roce->native_port_num;
+ struct net_device *ib_ndev = NULL;
struct mlx5_core_dev *mdev;
struct mlx5_ib_dev *ibdev;
@@ -181,29 +218,38 @@ static int mlx5_netdev_event(struct notifier_block *this,
/* Should already be registered during the load */
if (ibdev->is_rep)
break;
- write_lock(&roce->netdev_lock);
+
+ ib_ndev = ib_device_get_netdev(&ibdev->ib_dev, port_num);
+ /* Exit if already registered */
+ if (ib_ndev)
+ goto put_ndev;
+
if (ndev->dev.parent == mdev->device)
- roce->netdev = ndev;
- write_unlock(&roce->netdev_lock);
+ ib_device_set_netdev(&ibdev->ib_dev, ndev, port_num);
break;
case NETDEV_UNREGISTER:
/* In case of reps, ib device goes away before the netdevs */
- write_lock(&roce->netdev_lock);
- if (roce->netdev == ndev)
- roce->netdev = NULL;
- write_unlock(&roce->netdev_lock);
- break;
+ if (ibdev->is_rep)
+ break;
+ ib_ndev = ib_device_get_netdev(&ibdev->ib_dev, port_num);
+ if (ib_ndev == ndev)
+ ib_device_set_netdev(&ibdev->ib_dev, NULL, port_num);
+ goto put_ndev;
case NETDEV_CHANGE:
case NETDEV_UP:
case NETDEV_DOWN: {
struct net_device *upper = NULL;
- if (mlx5_lag_is_roce(mdev)) {
+ if (mlx5_lag_is_roce(mdev) || mlx5_lag_is_sriov(mdev)) {
struct net_device *lag_ndev;
- lag_ndev = mlx5_lag_get_roce_netdev(mdev);
+ if(mlx5_lag_is_roce(mdev))
+ lag_ndev = ib_device_get_netdev(&ibdev->ib_dev, 1);
+ else /* sriov lag */
+ lag_ndev = mlx5_ib_get_rep_uplink_netdev(ibdev);
+
if (lag_ndev) {
upper = netdev_master_upper_dev_get(lag_ndev);
dev_put(lag_ndev);
@@ -216,18 +262,19 @@ static int mlx5_netdev_event(struct notifier_block *this,
roce = mlx5_get_rep_roce(ibdev, ndev, upper, &port_num);
if (!roce)
return NOTIFY_DONE;
- if ((upper == ndev ||
- ((!upper || ibdev->is_rep) && ndev == roce->netdev)) &&
- ibdev->ib_active) {
+
+ ib_ndev = ib_device_get_netdev(&ibdev->ib_dev, port_num);
+
+ if (mlx5_netdev_send_event(ibdev, ndev, upper, ib_ndev)) {
struct ib_event ibev = { };
enum ib_port_state port_state;
if (get_port_state(&ibdev->ib_dev, port_num,
&port_state))
- goto done;
+ goto put_ndev;
if (roce->last_port_state == port_state)
- goto done;
+ goto put_ndev;
roce->last_port_state = port_state;
ibev.device = &ibdev->ib_dev;
@@ -236,7 +283,7 @@ static int mlx5_netdev_event(struct notifier_block *this,
else if (port_state == IB_PORT_ACTIVE)
ibev.event = IB_EVENT_PORT_ACTIVE;
else
- goto done;
+ goto put_ndev;
ibev.element.port_num = port_num;
ib_dispatch_event(&ibev);
@@ -247,39 +294,13 @@ static int mlx5_netdev_event(struct notifier_block *this,
default:
break;
}
+put_ndev:
+ dev_put(ib_ndev);
done:
mlx5_ib_put_native_port_mdev(ibdev, port_num);
return NOTIFY_DONE;
}
-static struct net_device *mlx5_ib_get_netdev(struct ib_device *device,
- u32 port_num)
-{
- struct mlx5_ib_dev *ibdev = to_mdev(device);
- struct net_device *ndev;
- struct mlx5_core_dev *mdev;
-
- mdev = mlx5_ib_get_native_port_mdev(ibdev, port_num, NULL);
- if (!mdev)
- return NULL;
-
- if (mlx5_lag_is_roce(mdev)) {
- ndev = mlx5_lag_get_roce_netdev(mdev);
- goto out;
- }
-
- /* Ensure ndev does not disappear before we invoke dev_hold()
- */
- read_lock(&ibdev->port[port_num - 1].roce.netdev_lock);
- ndev = ibdev->port[port_num - 1].roce.netdev;
- dev_hold(ndev);
- read_unlock(&ibdev->port[port_num - 1].roce.netdev_lock);
-
-out:
- mlx5_ib_put_native_port_mdev(ibdev, port_num);
- return ndev;
-}
-
struct mlx5_core_dev *mlx5_ib_get_native_port_mdev(struct mlx5_ib_dev *ibdev,
u32 ib_port_num,
u32 *native_port_num)
@@ -554,7 +575,7 @@ static int mlx5_query_port_roce(struct ib_device *device, u32 port_num,
if (!put_mdev)
goto out;
- ndev = mlx5_ib_get_netdev(device, port_num);
+ ndev = ib_device_get_netdev(device, port_num);
if (!ndev)
goto out;
@@ -3185,6 +3206,60 @@ static void get_dev_fw_str(struct ib_device *ibdev, char *str)
fw_rev_sub(dev->mdev));
}
+static int lag_event(struct notifier_block *nb, unsigned long event, void *data)
+{
+ struct mlx5_ib_dev *dev = container_of(nb, struct mlx5_ib_dev,
+ lag_events);
+ struct mlx5_core_dev *mdev = dev->mdev;
+ struct mlx5_ib_port *port;
+ struct net_device *ndev;
+ int i, err;
+ int portnum;
+
+ portnum = 0;
+ switch (event) {
+ case MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE:
+ ndev = data;
+ if (ndev) {
+ if (!mlx5_lag_is_roce(mdev)) {
+ // sriov lag
+ for (i = 0; i < dev->num_ports; i++) {
+ port = &dev->port[i];
+ if (port->rep && port->rep->vport ==
+ MLX5_VPORT_UPLINK) {
+ portnum = i;
+ break;
+ }
+ }
+ }
+ err = ib_device_set_netdev(&dev->ib_dev, ndev,
+ portnum + 1);
+ dev_put(ndev);
+ if (err)
+ return err;
+ /* Rescan gids after new netdev assignment */
+ rdma_roce_rescan_device(&dev->ib_dev);
+ }
+ break;
+ default:
+ return NOTIFY_DONE;
+ }
+ return NOTIFY_OK;
+}
+
+static void mlx5e_lag_event_register(struct mlx5_ib_dev *dev)
+{
+ dev->lag_events.notifier_call = lag_event;
+ blocking_notifier_chain_register(&dev->mdev->priv.lag_nh,
+ &dev->lag_events);
+}
+
+static void mlx5e_lag_event_unregister(struct mlx5_ib_dev *dev)
+{
+ blocking_notifier_chain_unregister(&dev->mdev->priv.lag_nh,
+ &dev->lag_events);
+}
+
static int mlx5_eth_lag_init(struct mlx5_ib_dev *dev)
{
struct mlx5_core_dev *mdev = dev->mdev;
@@ -3206,6 +3281,7 @@ static int mlx5_eth_lag_init(struct mlx5_ib_dev *dev)
goto err_destroy_vport_lag;
}
+ mlx5e_lag_event_register(dev);
dev->flow_db->lag_demux_ft = ft;
dev->lag_ports = mlx5_lag_get_num_ports(mdev);
dev->lag_active = true;
@@ -3223,6 +3299,7 @@ static void mlx5_eth_lag_cleanup(struct mlx5_ib_dev *dev)
if (dev->lag_active) {
dev->lag_active = false;
+ mlx5e_lag_event_unregister(dev);
mlx5_destroy_flow_table(dev->flow_db->lag_demux_ft);
dev->flow_db->lag_demux_ft = NULL;
@@ -3937,7 +4014,6 @@ static int mlx5_ib_stage_init_init(struct mlx5_ib_dev *dev)
for (i = 0; i < dev->num_ports; i++) {
spin_lock_init(&dev->port[i].mp.mpi_lock);
- rwlock_init(&dev->port[i].roce.netdev_lock);
dev->port[i].roce.dev = dev;
dev->port[i].roce.native_port_num = i + 1;
dev->port[i].roce.last_port_state = IB_PORT_DOWN;
@@ -4202,7 +4278,6 @@ static const struct ib_device_ops mlx5_ib_dev_common_roce_ops = {
.create_wq = mlx5_ib_create_wq,
.destroy_rwq_ind_table = mlx5_ib_destroy_rwq_ind_table,
.destroy_wq = mlx5_ib_destroy_wq,
- .get_netdev = mlx5_ib_get_netdev,
.modify_wq = mlx5_ib_modify_wq,
INIT_RDMA_OBJ_SIZE(ib_rwq_ind_table, mlx5_ib_rwq_ind_table,
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index c0b1a9cd752b..6af8b4df9a5c 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -896,8 +896,6 @@ struct mlx5_roce {
/* Protect mlx5_ib_get_netdev from invoking dev_hold() with a NULL
* netdev pointer
*/
- rwlock_t netdev_lock;
- struct net_device *netdev;
struct notifier_block nb;
struct netdev_net_notifier nn;
struct notifier_block mdev_nb;
@@ -1146,6 +1144,7 @@ struct mlx5_ib_dev {
/* protect accessing data_direct_dev */
struct mutex data_direct_lock;
struct notifier_block mdev_events;
+ struct notifier_block lag_events;
int num_ports;
/* serialize update of capability mask
*/
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
index cf8045b92689..8577db3308cc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
@@ -445,6 +445,34 @@ static int _mlx5_modify_lag(struct mlx5_lag *ldev, u8 *ports)
return mlx5_cmd_modify_lag(dev0, ldev->ports, ports);
}
+static struct net_device *mlx5_lag_active_backup_get_netdev(struct mlx5_core_dev *dev)
+{
+ struct net_device *ndev = NULL;
+ struct mlx5_lag *ldev;
+ unsigned long flags;
+ int i;
+
+ spin_lock_irqsave(&lag_lock, flags);
+ ldev = mlx5_lag_dev(dev);
+
+ if (!ldev)
+ goto unlock;
+
+ for (i = 0; i < ldev->ports; i++)
+ if (ldev->tracker.netdev_state[i].tx_enabled)
+ ndev = ldev->pf[i].netdev;
+ if (!ndev)
+ ndev = ldev->pf[ldev->ports - 1].netdev;
+
+ if (ndev)
+ dev_hold(ndev);
+
+unlock:
+ spin_unlock_irqrestore(&lag_lock, flags);
+
+ return ndev;
+}
+
void mlx5_modify_lag(struct mlx5_lag *ldev,
struct lag_tracker *tracker)
{
@@ -477,9 +505,18 @@ void mlx5_modify_lag(struct mlx5_lag *ldev,
}
}
- if (tracker->tx_type == NETDEV_LAG_TX_TYPE_ACTIVEBACKUP &&
- !(ldev->mode == MLX5_LAG_MODE_ROCE))
- mlx5_lag_drop_rule_setup(ldev, tracker);
+ if (tracker->tx_type == NETDEV_LAG_TX_TYPE_ACTIVEBACKUP) {
+ struct net_device *ndev = mlx5_lag_active_backup_get_netdev(dev0);
+
+ if(!(ldev->mode == MLX5_LAG_MODE_ROCE))
+ mlx5_lag_drop_rule_setup(ldev, tracker);
+ /** Only sriov and roce lag should have tracker->tx_type set so
+ * no need to check the mode
+ */
+ blocking_notifier_call_chain(&dev0->priv.lag_nh,
+ MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE,
+ ndev);
+ }
}
static int mlx5_lag_set_port_sel_mode_roce(struct mlx5_lag *ldev,
@@ -613,6 +650,7 @@ static int mlx5_create_lag(struct mlx5_lag *ldev,
mlx5_core_err(dev0,
"Failed to deactivate RoCE LAG; driver restart required\n");
}
+ BLOCKING_INIT_NOTIFIER_HEAD(&dev0->priv.lag_nh);
return err;
}
@@ -1492,38 +1530,6 @@ void mlx5_lag_enable_change(struct mlx5_core_dev *dev)
mlx5_queue_bond_work(ldev, 0);
}
-struct net_device *mlx5_lag_get_roce_netdev(struct mlx5_core_dev *dev)
-{
- struct net_device *ndev = NULL;
- struct mlx5_lag *ldev;
- unsigned long flags;
- int i;
-
- spin_lock_irqsave(&lag_lock, flags);
- ldev = mlx5_lag_dev(dev);
-
- if (!(ldev && __mlx5_lag_is_roce(ldev)))
- goto unlock;
-
- if (ldev->tracker.tx_type == NETDEV_LAG_TX_TYPE_ACTIVEBACKUP) {
- for (i = 0; i < ldev->ports; i++)
- if (ldev->tracker.netdev_state[i].tx_enabled)
- ndev = ldev->pf[i].netdev;
- if (!ndev)
- ndev = ldev->pf[ldev->ports - 1].netdev;
- } else {
- ndev = ldev->pf[MLX5_LAG_P1].netdev;
- }
- if (ndev)
- dev_hold(ndev);
-
-unlock:
- spin_unlock_irqrestore(&lag_lock, flags);
-
- return ndev;
-}
-EXPORT_SYMBOL(mlx5_lag_get_roce_netdev);
-
u8 mlx5_lag_get_slave_port(struct mlx5_core_dev *dev,
struct net_device *slave)
{
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index ba875a619b97..a38a17f8c12c 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -370,6 +370,7 @@ enum mlx5_driver_event {
MLX5_DRIVER_EVENT_SF_PEER_DEVLINK,
MLX5_DRIVER_EVENT_AFFILIATION_DONE,
MLX5_DRIVER_EVENT_AFFILIATION_REMOVED,
+ MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE,
};
enum {
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index a96438ded15f..46a7a3d11048 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -643,6 +643,7 @@ struct mlx5_priv {
struct mlx5_sf_hw_table *sf_hw_table;
struct mlx5_sf_table *sf_table;
#endif
+ struct blocking_notifier_head lag_nh;
};
enum mlx5_device_state {
@@ -1181,7 +1182,6 @@ bool mlx5_lag_mode_is_hash(struct mlx5_core_dev *dev);
bool mlx5_lag_is_master(struct mlx5_core_dev *dev);
bool mlx5_lag_is_shared_fdb(struct mlx5_core_dev *dev);
bool mlx5_lag_is_mpesw(struct mlx5_core_dev *dev);
-struct net_device *mlx5_lag_get_roce_netdev(struct mlx5_core_dev *dev);
u8 mlx5_lag_get_slave_port(struct mlx5_core_dev *dev,
struct net_device *slave);
int mlx5_lag_query_cong_counters(struct mlx5_core_dev *dev,
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index a1dcf812d787..aa8ede439905 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -4453,6 +4453,8 @@ struct net_device *ib_get_net_dev_by_params(struct ib_device *dev, u32 port,
const struct sockaddr *addr);
int ib_device_set_netdev(struct ib_device *ib_dev, struct net_device *ndev,
unsigned int port);
+struct net_device *ib_device_get_netdev(struct ib_device *ib_dev,
+ u32 port);
struct ib_wq *ib_create_wq(struct ib_pd *pd,
struct ib_wq_init_attr *init_attr);
int ib_destroy_wq_user(struct ib_wq *wq, struct ib_udata *udata);
--
2.17.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v3 rdma-next 6/7] RDMA/nldev: Add support for RDMA monitoring
2024-09-09 17:30 [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Michael Guralnik
` (4 preceding siblings ...)
2024-09-09 17:30 ` [PATCH v3 rdma-next 5/7] RDMA/mlx5: Use IB set_netdev and get_netdev functions Michael Guralnik
@ 2024-09-09 17:30 ` Michael Guralnik
2024-09-09 18:05 ` Leon Romanovsky
2024-09-10 11:09 ` Leon Romanovsky
2024-09-09 17:30 ` [PATCH v3 rdma-next 7/7] RDMA/nldev: Expose whether RDMA monitoring is supported Michael Guralnik
2024-09-11 13:30 ` [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Leon Romanovsky
7 siblings, 2 replies; 14+ messages in thread
From: Michael Guralnik @ 2024-09-09 17:30 UTC (permalink / raw)
To: jgg
Cc: linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern,
Michael Guralnik
From: Chiara Meiohas <cmeiohas@nvidia.com>
Introduce a new netlink command to allow rdma event monitoring.
The rdma events supported now are IB device
registration/unregistration and net device attachment/detachment.
Example output of rdma monitor and the commands which trigger
the events:
$ rdma monitor
$ rmmod mlx5_ib
[UNREGISTER] ibdev_idx 1 ibdev rocep8s0f1
[UNREGISTER] ibdev_idx 0 ibdev rocep8s0f0
$ modprobe mlx5_ib
[REGISTER] ibdev_idx 2 ibdev mlx5_0
[NETDEV_ATTACH] ibdev_idx 2 ibdev mlx5_0 port 1 netdev_idx 4 netdev eth2
[REGISTER] ibdev_idx 3 ibdev mlx5_1
[NETDEV_ATTACH] ibdev_idx 3 ibdev mlx5_1 port 1 netdev_idx 5 netdev eth3
$ devlink dev eswitch set pci/0000:08:00.0 mode switchdev
[UNREGISTER] ibdev_idx 2 ibdev rocep8s0f0
[REGISTER] ibdev_idx 4 ibdev mlx5_0
[NETDEV_ATTACH] ibdev_idx 4 ibdev mlx5_0 port 30 netdev_idx 4 netdev eth2
$ echo 4 > /sys/class/net/eth2/device/sriov_numvfs
[NETDEV_ATTACH] ibdev_idx 4 ibdev rdmap8s0f0 port 2 netdev_idx 7 netdev eth4
[NETDEV_ATTACH] ibdev_idx 4 ibdev rdmap8s0f0 port 3 netdev_idx 8 netdev eth5
[NETDEV_ATTACH] ibdev_idx 4 ibdev rdmap8s0f0 port 4 netdev_idx 9 netdev eth6
[NETDEV_ATTACH] ibdev_idx 4 ibdev rdmap8s0f0 port 5 netdev_idx 10 netdev eth7
[REGISTER] ibdev_idx 5 ibdev mlx5_0
[NETDEV_ATTACH] ibdev_idx 5 ibdev mlx5_0 port 1 netdev_idx 11 netdev eth8
[REGISTER] ibdev_idx 6 ibdev mlx5_0
[NETDEV_ATTACH] ibdev_idx 6 ibdev mlx5_0 port 1 netdev_idx 12 netdev eth9
[REGISTER] ibdev_idx 7 ibdev mlx5_0
[NETDEV_ATTACH] ibdev_idx 7 ibdev mlx5_0 port 1 netdev_idx 13 netdev eth10
[REGISTER] ibdev_idx 8 ibdev mlx5_0
[NETDEV_ATTACH] ibdev_idx 8 ibdev mlx5_0 port 1 netdev_idx 14 netdev eth11
$ echo 0 > /sys/class/net/eth2/device/sriov_numvfs
[UNREGISTER] ibdev_idx 5 ibdev rocep8s0f0v0
[UNREGISTER] ibdev_idx 6 ibdev rocep8s0f0v1
[UNREGISTER] ibdev_idx 7 ibdev rocep8s0f0v2
[UNREGISTER] ibdev_idx 8 ibdev rocep8s0f0v3
[NETDEV_DETACH] ibdev_idx 4 ibdev rdmap8s0f0 port 2
[NETDEV_DETACH] ibdev_idx 4 ibdev rdmap8s0f0 port 3
[NETDEV_DETACH] ibdev_idx 4 ibdev rdmap8s0f0 port 4
[NETDEV_DETACH] ibdev_idx 4 ibdev rdmap8s0f0 port 5
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/core/device.c | 38 +++++++++
drivers/infiniband/core/netlink.c | 1 +
drivers/infiniband/core/nldev.c | 124 ++++++++++++++++++++++++++++++
include/rdma/rdma_netlink.h | 12 +++
include/uapi/rdma/rdma_netlink.h | 15 ++++
5 files changed, 190 insertions(+)
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 9e765c79a892..d571b78d1bcc 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -1351,6 +1351,30 @@ static void prevent_dealloc_device(struct ib_device *ib_dev)
{
}
+static void ib_device_notify_register(struct ib_device *device)
+{
+ struct net_device *netdev;
+ u32 port;
+ int ret;
+
+ ret = rdma_nl_notify_event(device, 0, RDMA_REGISTER_EVENT);
+ if (ret)
+ return;
+
+ rdma_for_each_port(device, port) {
+ netdev = ib_device_get_netdev(device, port);
+ if (!netdev)
+ continue;
+
+ ret = rdma_nl_notify_event(device, port,
+ RDMA_NETDEV_ATTACH_EVENT);
+ dev_put(netdev);
+ if (ret)
+ return;
+ }
+ return;
+}
+
/**
* ib_register_device - Register an IB device with IB core
* @device: Device to register
@@ -1449,6 +1473,8 @@ int ib_register_device(struct ib_device *device, const char *name,
dev_set_uevent_suppress(&device->dev, false);
/* Mark for userspace that device is ready */
kobject_uevent(&device->dev.kobj, KOBJ_ADD);
+
+ ib_device_notify_register(device);
ib_device_put(device);
return 0;
@@ -1491,6 +1517,7 @@ static void __ib_unregister_device(struct ib_device *ib_dev)
goto out;
disable_device(ib_dev);
+ rdma_nl_notify_event(ib_dev, 0, RDMA_UNREGISTER_EVENT);
/* Expedite removing unregistered pointers from the hash table */
free_netdevs(ib_dev);
@@ -2159,6 +2186,7 @@ static void add_ndev_hash(struct ib_port_data *pdata)
int ib_device_set_netdev(struct ib_device *ib_dev, struct net_device *ndev,
u32 port)
{
+ enum rdma_nl_notify_event_type etype;
struct net_device *old_ndev;
struct ib_port_data *pdata;
unsigned long flags;
@@ -2190,6 +2218,16 @@ int ib_device_set_netdev(struct ib_device *ib_dev, struct net_device *ndev,
spin_unlock_irqrestore(&pdata->netdev_lock, flags);
add_ndev_hash(pdata);
+
+ down_read(&devices_rwsem);
+ if (xa_get_mark(&devices, ib_dev->index, DEVICE_REGISTERED) &&
+ xa_load(&devices, ib_dev->index) == ib_dev) {
+ etype = ndev ?
+ RDMA_NETDEV_ATTACH_EVENT : RDMA_NETDEV_DETACH_EVENT;
+ rdma_nl_notify_event(ib_dev, port, etype);
+ }
+ up_read(&devices_rwsem);
+
return 0;
}
EXPORT_SYMBOL(ib_device_set_netdev);
diff --git a/drivers/infiniband/core/netlink.c b/drivers/infiniband/core/netlink.c
index ae2db0c70788..def14c54b648 100644
--- a/drivers/infiniband/core/netlink.c
+++ b/drivers/infiniband/core/netlink.c
@@ -311,6 +311,7 @@ int rdma_nl_net_init(struct rdma_dev_net *rnet)
struct net *net = read_pnet(&rnet->net);
struct netlink_kernel_cfg cfg = {
.input = rdma_nl_rcv,
+ .flags = NL_CFG_F_NONROOT_RECV,
};
struct sock *nls;
diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 4d4a1f90e484..30b0fd54a7d3 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -170,6 +170,7 @@ static const struct nla_policy nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
[RDMA_NLDEV_ATTR_DEV_TYPE] = { .type = NLA_U8 },
[RDMA_NLDEV_ATTR_PARENT_NAME] = { .type = NLA_NUL_STRING },
[RDMA_NLDEV_ATTR_NAME_ASSIGN_TYPE] = { .type = NLA_U8 },
+ [RDMA_NLDEV_ATTR_EVENT_TYPE] = { .type = NLA_U8 },
};
static int put_driver_name_print_type(struct sk_buff *msg, const char *name,
@@ -2722,6 +2723,129 @@ static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
},
};
+static int fill_mon_netdev_association(struct sk_buff *msg,
+ struct ib_device *device, u32 port,
+ const struct net *net)
+{
+ struct net_device *netdev = ib_device_get_netdev(device, port);
+ int ret = 0;
+
+ if (netdev && !net_eq(dev_net(netdev), net))
+ goto out;
+
+ ret = nla_put_u32(msg, RDMA_NLDEV_ATTR_DEV_INDEX, device->index);
+ if (ret)
+ goto out;
+
+ ret = nla_put_string(msg, RDMA_NLDEV_ATTR_DEV_NAME,
+ dev_name(&device->dev));
+ if (ret)
+ goto out;
+
+ ret = nla_put_u32(msg, RDMA_NLDEV_ATTR_PORT_INDEX, port);
+ if (ret)
+ goto out;
+
+ if (netdev) {
+ ret = nla_put_u32(msg,
+ RDMA_NLDEV_ATTR_NDEV_INDEX, netdev->ifindex);
+ if (ret)
+ goto out;
+
+ ret = nla_put_string(msg,
+ RDMA_NLDEV_ATTR_NDEV_NAME, netdev->name);
+ }
+
+out:
+ dev_put(netdev);
+ return ret;
+}
+
+static void rdma_nl_notify_err_msg(struct ib_device *device, u32 port_num,
+ enum rdma_nl_notify_event_type type)
+{
+ struct net_device *netdev;
+
+ switch (type) {
+ case RDMA_REGISTER_EVENT:
+ dev_warn_ratelimited(&device->dev,
+ "Failed to send RDMA monitor register device event\n");
+ break;
+ case RDMA_UNREGISTER_EVENT:
+ dev_warn_ratelimited(&device->dev,
+ "Failed to send RDMA monitor unregister device event\n");
+ break;
+ case RDMA_NETDEV_ATTACH_EVENT:
+ netdev = ib_device_get_netdev(device, port_num);
+ dev_warn_ratelimited(&device->dev,
+ "Failed to send RDMA monitor netdev attach event: port %d netdev %d\n",
+ port_num, netdev->ifindex);
+ dev_put(netdev);
+ break;
+ case RDMA_NETDEV_DETACH_EVENT:
+ dev_warn_ratelimited(&device->dev,
+ "Failed to send RDMA monitor netdev detach event: port %d\n",
+ port_num);
+ default:
+ break;
+ };
+}
+
+int rdma_nl_notify_event(struct ib_device *device, u32 port_num,
+ enum rdma_nl_notify_event_type type)
+{
+ struct sk_buff *skb;
+ struct net *net;
+ int ret = 0;
+ void *nlh;
+
+ net = read_pnet(&device->coredev.rdma_net);
+ if (!net)
+ return -EINVAL;
+
+ skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+ if (!skb)
+ return -ENOMEM;
+ nlh = nlmsg_put(skb, 0, 0,
+ RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_MONITOR),
+ 0, 0);
+
+ switch (type) {
+ case RDMA_REGISTER_EVENT:
+ case RDMA_UNREGISTER_EVENT:
+ ret = fill_nldev_handle(skb, device);
+ if (ret)
+ goto err_free;
+ break;
+ case RDMA_NETDEV_ATTACH_EVENT:
+ case RDMA_NETDEV_DETACH_EVENT:
+ ret = fill_mon_netdev_association(skb, device,
+ port_num, net);
+ if (ret)
+ goto err_free;
+ break;
+ default:
+ break;
+ }
+
+ ret = nla_put_u8(skb, RDMA_NLDEV_ATTR_EVENT_TYPE, type);
+ if (ret)
+ goto err_free;
+
+ nlmsg_end(skb, nlh);
+ ret = rdma_nl_multicast(net, skb, RDMA_NL_GROUP_NOTIFY, GFP_KERNEL);
+ if (ret && ret != -ESRCH) {
+ skb = NULL; /* skb is freed in the netlink send-op handling */
+ goto err_free;
+ }
+ return 0;
+
+err_free:
+ rdma_nl_notify_err_msg(device, port_num, type);
+ nlmsg_free(skb);
+ return ret;
+}
+
void __init nldev_init(void)
{
rdma_nl_register(RDMA_NL_NLDEV, nldev_cb_table);
diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
index c2a79aeee113..326deaf56d5d 100644
--- a/include/rdma/rdma_netlink.h
+++ b/include/rdma/rdma_netlink.h
@@ -6,6 +6,8 @@
#include <linux/netlink.h>
#include <uapi/rdma/rdma_netlink.h>
+struct ib_device;
+
enum {
RDMA_NLDEV_ATTR_EMPTY_STRING = 1,
RDMA_NLDEV_ATTR_ENTRY_STRLEN = 16,
@@ -110,6 +112,16 @@ int rdma_nl_multicast(struct net *net, struct sk_buff *skb,
*/
bool rdma_nl_chk_listeners(unsigned int group);
+/**
+ * Prepare and send an event message
+ * @ib: the IB device which triggered the event
+ * @port_num: the port number which triggered the event - 0 if unused
+ * @type: the event type
+ * Returns 0 on success or a negative error code
+ */
+int rdma_nl_notify_event(struct ib_device *ib, u32 port_num,
+ enum rdma_nl_notify_event_type type);
+
struct rdma_link_ops {
struct list_head list;
const char *type;
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 2f37568f5556..5f9636d26050 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -15,6 +15,7 @@ enum {
enum {
RDMA_NL_GROUP_IWPM = 2,
RDMA_NL_GROUP_LS,
+ RDMA_NL_GROUP_NOTIFY,
RDMA_NL_NUM_GROUPS
};
@@ -305,6 +306,8 @@ enum rdma_nldev_command {
RDMA_NLDEV_CMD_DELDEV,
+ RDMA_NLDEV_CMD_MONITOR,
+
RDMA_NLDEV_NUM_OPS
};
@@ -574,6 +577,8 @@ enum rdma_nldev_attr {
RDMA_NLDEV_ATTR_NAME_ASSIGN_TYPE, /* u8 */
+ RDMA_NLDEV_ATTR_EVENT_TYPE, /* u8 */
+
/*
* Always the end
*/
@@ -624,4 +629,14 @@ enum rdma_nl_name_assign_type {
RDMA_NAME_ASSIGN_TYPE_USER = 1, /* Provided by user-space */
};
+/*
+ * Supported rdma monitoring event types.
+ */
+enum rdma_nl_notify_event_type {
+ RDMA_REGISTER_EVENT,
+ RDMA_UNREGISTER_EVENT,
+ RDMA_NETDEV_ATTACH_EVENT,
+ RDMA_NETDEV_DETACH_EVENT,
+};
+
#endif /* _UAPI_RDMA_NETLINK_H */
--
2.17.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v3 rdma-next 7/7] RDMA/nldev: Expose whether RDMA monitoring is supported
2024-09-09 17:30 [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Michael Guralnik
` (5 preceding siblings ...)
2024-09-09 17:30 ` [PATCH v3 rdma-next 6/7] RDMA/nldev: Add support for RDMA monitoring Michael Guralnik
@ 2024-09-09 17:30 ` Michael Guralnik
2024-09-11 13:30 ` [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Leon Romanovsky
7 siblings, 0 replies; 14+ messages in thread
From: Michael Guralnik @ 2024-09-09 17:30 UTC (permalink / raw)
To: jgg
Cc: linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern,
Michael Guralnik
From: Chiara Meiohas <cmeiohas@nvidia.com>
Extend the "rdma sys" command to display whether RDMA
monitoring is supported.
RDMA monitoring is not supported in mlx4 because it does
not use the ib_device_set_netdev() API, which sends the
RDMA events.
Example output for kernel where monitoring is supported:
$ rdma sys show
netns shared privileged-qkey off monitor on copy-on-fork on
Example output for kernel where monitoring is not supported:
$ rdma sys show
netns shared privileged-qkey off monitor off copy-on-fork on
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/infiniband/core/nldev.c | 6 ++++++
include/uapi/rdma/rdma_netlink.h | 1 +
2 files changed, 7 insertions(+)
diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 30b0fd54a7d3..b2dca6aa531d 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -1952,6 +1952,12 @@ static int nldev_sys_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
nlmsg_free(msg);
return err;
}
+
+ err = nla_put_u8(msg, RDMA_NLDEV_SYS_ATTR_MONITOR_MODE, 1);
+ if (err) {
+ nlmsg_free(msg);
+ return err;
+ }
/*
* Copy-on-fork is supported.
* See commits:
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 5f9636d26050..39be09c0ffbb 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -579,6 +579,7 @@ enum rdma_nldev_attr {
RDMA_NLDEV_ATTR_EVENT_TYPE, /* u8 */
+ RDMA_NLDEV_SYS_ATTR_MONITOR_MODE, /* u8 */
/*
* Always the end
*/
--
2.17.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v3 rdma-next 6/7] RDMA/nldev: Add support for RDMA monitoring
2024-09-09 17:30 ` [PATCH v3 rdma-next 6/7] RDMA/nldev: Add support for RDMA monitoring Michael Guralnik
@ 2024-09-09 18:05 ` Leon Romanovsky
2024-09-10 11:09 ` Leon Romanovsky
1 sibling, 0 replies; 14+ messages in thread
From: Leon Romanovsky @ 2024-09-09 18:05 UTC (permalink / raw)
To: Michael Guralnik; +Cc: jgg, linux-rdma, mbloch, cmeiohas, msanalla, dsahern
On Mon, Sep 09, 2024 at 08:30:24PM +0300, Michael Guralnik wrote:
> From: Chiara Meiohas <cmeiohas@nvidia.com>
>
> Introduce a new netlink command to allow rdma event monitoring.
> The rdma events supported now are IB device
> registration/unregistration and net device attachment/detachment.
>
> Example output of rdma monitor and the commands which trigger
> the events:
>
> $ rdma monitor
> $ rmmod mlx5_ib
> [UNREGISTER] ibdev_idx 1 ibdev rocep8s0f1
> [UNREGISTER] ibdev_idx 0 ibdev rocep8s0f0
>
> $ modprobe mlx5_ib
> [REGISTER] ibdev_idx 2 ibdev mlx5_0
> [NETDEV_ATTACH] ibdev_idx 2 ibdev mlx5_0 port 1 netdev_idx 4 netdev eth2
> [REGISTER] ibdev_idx 3 ibdev mlx5_1
> [NETDEV_ATTACH] ibdev_idx 3 ibdev mlx5_1 port 1 netdev_idx 5 netdev eth3
No need to resend the series, I will fix when applying, but the right
format will be:
[NETDEV_ATTACH] dev 3 mlx5_1 port 1 netdev 5 eth3
Thanks
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 rdma-next 1/7] RDMA/mlx5: Check RoCE LAG status before getting netdev
2024-09-09 17:30 ` [PATCH v3 rdma-next 1/7] RDMA/mlx5: Check RoCE LAG status before getting netdev Michael Guralnik
@ 2024-09-10 3:58 ` Kalesh Anakkur Purayil
0 siblings, 0 replies; 14+ messages in thread
From: Kalesh Anakkur Purayil @ 2024-09-10 3:58 UTC (permalink / raw)
To: Michael Guralnik
Cc: jgg, linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern
[-- Attachment #1: Type: text/plain, Size: 2361 bytes --]
On Mon, Sep 9, 2024 at 11:10 PM Michael Guralnik <michaelgur@nvidia.com> wrote:
>
> From: Mark Bloch <mbloch@nvidia.com>
>
> Check if RoCE LAG is active before calling the LAG layer for netdev.
> This clarifies if LAG is active. No behavior changes with this patch.
>
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Looks good to me
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> ---
> drivers/infiniband/hw/mlx5/main.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
> index b85ad3c0bfa1..cdf1ce0f6b34 100644
> --- a/drivers/infiniband/hw/mlx5/main.c
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -198,12 +198,18 @@ static int mlx5_netdev_event(struct notifier_block *this,
> case NETDEV_CHANGE:
> case NETDEV_UP:
> case NETDEV_DOWN: {
> - struct net_device *lag_ndev = mlx5_lag_get_roce_netdev(mdev);
> struct net_device *upper = NULL;
>
> - if (lag_ndev) {
> - upper = netdev_master_upper_dev_get(lag_ndev);
> - dev_put(lag_ndev);
> + if (mlx5_lag_is_roce(mdev)) {
> + struct net_device *lag_ndev;
> +
> + lag_ndev = mlx5_lag_get_roce_netdev(mdev);
> + if (lag_ndev) {
> + upper = netdev_master_upper_dev_get(lag_ndev);
> + dev_put(lag_ndev);
> + } else {
> + goto done;
> + }
> }
>
> if (ibdev->is_rep)
> @@ -257,9 +263,10 @@ static struct net_device *mlx5_ib_get_netdev(struct ib_device *device,
> if (!mdev)
> return NULL;
>
> - ndev = mlx5_lag_get_roce_netdev(mdev);
> - if (ndev)
> + if (mlx5_lag_is_roce(mdev)) {
> + ndev = mlx5_lag_get_roce_netdev(mdev);
> goto out;
> + }
>
> /* Ensure ndev does not disappear before we invoke dev_hold()
> */
> --
> 2.17.2
>
>
--
Regards,
Kalesh A P
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4239 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 rdma-next 2/7] RDMA/mlx5: Obtain upper net device only when needed
2024-09-09 17:30 ` [PATCH v3 rdma-next 2/7] RDMA/mlx5: Obtain upper net device only when needed Michael Guralnik
@ 2024-09-10 3:59 ` Kalesh Anakkur Purayil
0 siblings, 0 replies; 14+ messages in thread
From: Kalesh Anakkur Purayil @ 2024-09-10 3:59 UTC (permalink / raw)
To: Michael Guralnik
Cc: jgg, linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern
[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]
On Mon, Sep 9, 2024 at 11:10 PM Michael Guralnik <michaelgur@nvidia.com> wrote:
>
> From: Mark Bloch <mbloch@nvidia.com>
>
> Report the upper device's state as the RDMA port state only in RoCE LAG or
> switchdev LAG.
>
> Fixes: 27f9e0ccb6da ("net/mlx5: Lag, Add single RDMA device in multiport mode")
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Looks good to me
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> ---
> drivers/infiniband/hw/mlx5/main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
> index cdf1ce0f6b34..c75cc3d14e74 100644
> --- a/drivers/infiniband/hw/mlx5/main.c
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -558,7 +558,7 @@ static int mlx5_query_port_roce(struct ib_device *device, u32 port_num,
> if (!ndev)
> goto out;
>
> - if (dev->lag_active) {
> + if (mlx5_lag_is_roce(mdev) || mlx5_lag_is_sriov(mdev)) {
> rcu_read_lock();
> upper = netdev_master_upper_dev_get_rcu(ndev);
> if (upper) {
> --
> 2.17.2
>
>
--
Regards,
Kalesh A P
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4239 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 rdma-next 4/7] RDMA/device: Remove optimization in ib_device_get_netdev()
2024-09-09 17:30 ` [PATCH v3 rdma-next 4/7] RDMA/device: Remove optimization in ib_device_get_netdev() Michael Guralnik
@ 2024-09-10 4:00 ` Kalesh Anakkur Purayil
0 siblings, 0 replies; 14+ messages in thread
From: Kalesh Anakkur Purayil @ 2024-09-10 4:00 UTC (permalink / raw)
To: Michael Guralnik
Cc: jgg, linux-rdma, leonro, mbloch, cmeiohas, msanalla, dsahern
[-- Attachment #1: Type: text/plain, Size: 1863 bytes --]
On Mon, Sep 9, 2024 at 11:10 PM Michael Guralnik <michaelgur@nvidia.com> wrote:
>
> From: Chiara Meiohas <cmeiohas@nvidia.com>
>
> The caller of ib_device_get_netdev() relies on its result to accurately
> match a given netdev with the ib device associated netdev.
>
> ib_device_get_netdev returns NULL when the IB device associated
> netdev is unregistering, preventing the caller of matching netdevs properly.
>
> Thus, remove this optimization and return the netdev even if
> it is undergoing unregistration, allowing matching by the caller.
>
> This change ensures proper netdev matching and reference count handling
> by the caller of ib_device_get_netdev/ib_device_set_netdev API.
>
> Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
> Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Looks good to me
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> ---
> drivers/infiniband/core/device.c | 9 ---------
> 1 file changed, 9 deletions(-)
>
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index 0290aca18d26..b1377503cb9d 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -2252,15 +2252,6 @@ struct net_device *ib_device_get_netdev(struct ib_device *ib_dev,
> spin_unlock(&pdata->netdev_lock);
> }
>
> - /*
> - * If we are starting to unregister expedite things by preventing
> - * propagation of an unregistering netdev.
> - */
> - if (res && res->reg_state != NETREG_REGISTERED) {
> - dev_put(res);
> - return NULL;
> - }
> -
> return res;
> }
>
> --
> 2.17.2
>
>
--
Regards,
Kalesh A P
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4239 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v3 rdma-next 6/7] RDMA/nldev: Add support for RDMA monitoring
2024-09-09 17:30 ` [PATCH v3 rdma-next 6/7] RDMA/nldev: Add support for RDMA monitoring Michael Guralnik
2024-09-09 18:05 ` Leon Romanovsky
@ 2024-09-10 11:09 ` Leon Romanovsky
1 sibling, 0 replies; 14+ messages in thread
From: Leon Romanovsky @ 2024-09-10 11:09 UTC (permalink / raw)
To: Michael Guralnik; +Cc: jgg, linux-rdma, mbloch, cmeiohas, msanalla, dsahern
On Mon, Sep 09, 2024 at 08:30:24PM +0300, Michael Guralnik wrote:
> From: Chiara Meiohas <cmeiohas@nvidia.com>
>
> Introduce a new netlink command to allow rdma event monitoring.
> The rdma events supported now are IB device
> registration/unregistration and net device attachment/detachment.
>
> Example output of rdma monitor and the commands which trigger
> the events:
>
> $ rdma monitor
> $ rmmod mlx5_ib
> [UNREGISTER] ibdev_idx 1 ibdev rocep8s0f1
> [UNREGISTER] ibdev_idx 0 ibdev rocep8s0f0
>
> $ modprobe mlx5_ib
> [REGISTER] ibdev_idx 2 ibdev mlx5_0
> [NETDEV_ATTACH] ibdev_idx 2 ibdev mlx5_0 port 1 netdev_idx 4 netdev eth2
> [REGISTER] ibdev_idx 3 ibdev mlx5_1
> [NETDEV_ATTACH] ibdev_idx 3 ibdev mlx5_1 port 1 netdev_idx 5 netdev eth3
>
> $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev
> [UNREGISTER] ibdev_idx 2 ibdev rocep8s0f0
> [REGISTER] ibdev_idx 4 ibdev mlx5_0
> [NETDEV_ATTACH] ibdev_idx 4 ibdev mlx5_0 port 30 netdev_idx 4 netdev eth2
>
> $ echo 4 > /sys/class/net/eth2/device/sriov_numvfs
> [NETDEV_ATTACH] ibdev_idx 4 ibdev rdmap8s0f0 port 2 netdev_idx 7 netdev eth4
> [NETDEV_ATTACH] ibdev_idx 4 ibdev rdmap8s0f0 port 3 netdev_idx 8 netdev eth5
> [NETDEV_ATTACH] ibdev_idx 4 ibdev rdmap8s0f0 port 4 netdev_idx 9 netdev eth6
> [NETDEV_ATTACH] ibdev_idx 4 ibdev rdmap8s0f0 port 5 netdev_idx 10 netdev eth7
> [REGISTER] ibdev_idx 5 ibdev mlx5_0
> [NETDEV_ATTACH] ibdev_idx 5 ibdev mlx5_0 port 1 netdev_idx 11 netdev eth8
> [REGISTER] ibdev_idx 6 ibdev mlx5_0
> [NETDEV_ATTACH] ibdev_idx 6 ibdev mlx5_0 port 1 netdev_idx 12 netdev eth9
> [REGISTER] ibdev_idx 7 ibdev mlx5_0
> [NETDEV_ATTACH] ibdev_idx 7 ibdev mlx5_0 port 1 netdev_idx 13 netdev eth10
> [REGISTER] ibdev_idx 8 ibdev mlx5_0
> [NETDEV_ATTACH] ibdev_idx 8 ibdev mlx5_0 port 1 netdev_idx 14 netdev eth11
>
> $ echo 0 > /sys/class/net/eth2/device/sriov_numvfs
> [UNREGISTER] ibdev_idx 5 ibdev rocep8s0f0v0
> [UNREGISTER] ibdev_idx 6 ibdev rocep8s0f0v1
> [UNREGISTER] ibdev_idx 7 ibdev rocep8s0f0v2
> [UNREGISTER] ibdev_idx 8 ibdev rocep8s0f0v3
> [NETDEV_DETACH] ibdev_idx 4 ibdev rdmap8s0f0 port 2
> [NETDEV_DETACH] ibdev_idx 4 ibdev rdmap8s0f0 port 3
> [NETDEV_DETACH] ibdev_idx 4 ibdev rdmap8s0f0 port 4
> [NETDEV_DETACH] ibdev_idx 4 ibdev rdmap8s0f0 port 5
>
> Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
> ---
> drivers/infiniband/core/device.c | 38 +++++++++
> drivers/infiniband/core/netlink.c | 1 +
> drivers/infiniband/core/nldev.c | 124 ++++++++++++++++++++++++++++++
> include/rdma/rdma_netlink.h | 12 +++
> include/uapi/rdma/rdma_netlink.h | 15 ++++
> 5 files changed, 190 insertions(+)
<...>
> /* Expedite removing unregistered pointers from the hash table */
> free_netdevs(ib_dev);
> @@ -2159,6 +2186,7 @@ static void add_ndev_hash(struct ib_port_data *pdata)
> int ib_device_set_netdev(struct ib_device *ib_dev, struct net_device *ndev,
> u32 port)
> {
> + enum rdma_nl_notify_event_type etype;
> struct net_device *old_ndev;
> struct ib_port_data *pdata;
> unsigned long flags;
> @@ -2190,6 +2218,16 @@ int ib_device_set_netdev(struct ib_device *ib_dev, struct net_device *ndev,
> spin_unlock_irqrestore(&pdata->netdev_lock, flags);
>
> add_ndev_hash(pdata);
> +
> + down_read(&devices_rwsem);
> + if (xa_get_mark(&devices, ib_dev->index, DEVICE_REGISTERED) &&
> + xa_load(&devices, ib_dev->index) == ib_dev) {
> + etype = ndev ?
> + RDMA_NETDEV_ATTACH_EVENT : RDMA_NETDEV_DETACH_EVENT;
> + rdma_nl_notify_event(ib_dev, port, etype);
> + }
> + up_read(&devices_rwsem);
There is no need in this locking, let's rewrite the following code
without it. We are in -rc7, I'll add this hunk when applying.
Thanks
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index d571b78d1bcc..3be66dd7b226 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2219,14 +2219,12 @@ int ib_device_set_netdev(struct ib_device *ib_dev, struct net_device *ndev,
add_ndev_hash(pdata);
- down_read(&devices_rwsem);
- if (xa_get_mark(&devices, ib_dev->index, DEVICE_REGISTERED) &&
- xa_load(&devices, ib_dev->index) == ib_dev) {
- etype = ndev ?
- RDMA_NETDEV_ATTACH_EVENT : RDMA_NETDEV_DETACH_EVENT;
- rdma_nl_notify_event(ib_dev, port, etype);
- }
- up_read(&devices_rwsem);
+ /* Make sure that the device is registered before we send events */
+ if (xa_load(&devices, ib_dev->index) != ib_dev)
+ return 0;
+
+ etype = ndev ? RDMA_NETDEV_ATTACH_EVENT : RDMA_NETDEV_DETACH_EVENT;
+ rdma_nl_notify_event(ib_dev, port, etype);
return 0;
}
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH rdma-next v3 0/7] Support RDMA events monitoring through
2024-09-09 17:30 [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Michael Guralnik
` (6 preceding siblings ...)
2024-09-09 17:30 ` [PATCH v3 rdma-next 7/7] RDMA/nldev: Expose whether RDMA monitoring is supported Michael Guralnik
@ 2024-09-11 13:30 ` Leon Romanovsky
7 siblings, 0 replies; 14+ messages in thread
From: Leon Romanovsky @ 2024-09-11 13:30 UTC (permalink / raw)
To: Jason Gunthorpe, Michael Guralnik
Cc: linux-rdma, mbloch, cmeiohas, msanalla, dsahern, Leon Romanovsky
On Mon, 09 Sep 2024 20:30:18 +0300, Michael Guralnik wrote:
> This series consists of multiple parts that collectively offer a method
> to monitor RDMA events from userspace.
> Using netlink, users will be able to monitor their IB device events and
> changes such as device register, device unregister and netdev
> attachment.
>
> The first 2 patches contain fixes in mlx5 lag code that are required for
> accurate event reporting in case of a lag bond.
>
> [...]
Applied, thanks!
[1/7] RDMA/mlx5: Check RoCE LAG status before getting netdev
https://git.kernel.org/rdma/rdma/c/e67266dc429670
[2/7] RDMA/mlx5: Obtain upper net device only when needed
https://git.kernel.org/rdma/rdma/c/eb66fcc43fde8f
[3/7] RDMA/mlx5: Initialize phys_port_cnt earlier in RDMA device creation
https://git.kernel.org/rdma/rdma/c/41068c95b0bf6e
[4/7] RDMA/device: Remove optimization in ib_device_get_netdev()
https://git.kernel.org/rdma/rdma/c/95ae29d023a4ba
[5/7] RDMA/mlx5: Use IB set_netdev and get_netdev functions
https://git.kernel.org/rdma/rdma/c/425b36d3b2cb81
[6/7] RDMA/nldev: Add support for RDMA monitoring
https://git.kernel.org/rdma/rdma/c/9a13c8cffcf6e8
[7/7] RDMA/nldev: Expose whether RDMA monitoring is supported
https://git.kernel.org/rdma/rdma/c/61eb1e03c16f38
Best regards,
--
Leon Romanovsky <leon@kernel.org>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-09-11 13:30 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-09 17:30 [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Michael Guralnik
2024-09-09 17:30 ` [PATCH v3 rdma-next 1/7] RDMA/mlx5: Check RoCE LAG status before getting netdev Michael Guralnik
2024-09-10 3:58 ` Kalesh Anakkur Purayil
2024-09-09 17:30 ` [PATCH v3 rdma-next 2/7] RDMA/mlx5: Obtain upper net device only when needed Michael Guralnik
2024-09-10 3:59 ` Kalesh Anakkur Purayil
2024-09-09 17:30 ` [PATCH v3 rdma-next 3/7] RDMA/mlx5: Initialize phys_port_cnt earlier in RDMA device creation Michael Guralnik
2024-09-09 17:30 ` [PATCH v3 rdma-next 4/7] RDMA/device: Remove optimization in ib_device_get_netdev() Michael Guralnik
2024-09-10 4:00 ` Kalesh Anakkur Purayil
2024-09-09 17:30 ` [PATCH v3 rdma-next 5/7] RDMA/mlx5: Use IB set_netdev and get_netdev functions Michael Guralnik
2024-09-09 17:30 ` [PATCH v3 rdma-next 6/7] RDMA/nldev: Add support for RDMA monitoring Michael Guralnik
2024-09-09 18:05 ` Leon Romanovsky
2024-09-10 11:09 ` Leon Romanovsky
2024-09-09 17:30 ` [PATCH v3 rdma-next 7/7] RDMA/nldev: Expose whether RDMA monitoring is supported Michael Guralnik
2024-09-11 13:30 ` [PATCH rdma-next v3 0/7] Support RDMA events monitoring through Leon Romanovsky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox