Re: [RFC PATCH net-next 03/12] devlink: Acquire device lock during reload

public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed

From: Jiri Pirko <jiri@resnulli.us>
To: Ido Schimmel <idosch@nvidia.com>
Cc: netdev@vger.kernel.org, linux-pci@vger.kernel.org,
	davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
	edumazet@google.com, bhelgaas@google.com,
	alex.williamson@redhat.com, lukas@wunner.de, petrm@nvidia.com,
	jiri@nvidia.com, mlxsw@nvidia.com
Subject: Re: [RFC PATCH net-next 03/12] devlink: Acquire device lock during reload
Date: Tue, 17 Oct 2023 10:04:46 +0200	[thread overview]
Message-ID: <ZS5AHnlJp6orqdLb@nanopsycho> (raw)
In-Reply-To: <20231017074257.3389177-4-idosch@nvidia.com>

Tue, Oct 17, 2023 at 09:42:48AM CEST, idosch@nvidia.com wrote:
>Device drivers register with devlink from their probe routines (under
>the device lock) by acquiring the devlink instance lock and calling
>devl_register().
>
>Drivers that support a devlink reload usually implement the
>reload_{down, up}() operations in a similar fashion to their remove and
>probe routines, respectively.
>
>However, while the remove and probe routines are invoked with the device
>lock held, the reload operations are only invoked with the devlink
>instance lock held. It is therefore impossible for drivers to acquire
>the device lock from their reload operations, as this would result in
>lock inversion.
>
>The motivating use case for invoking the reload operations with the
>device lock held is in mlxsw which needs to trigger a PCI reset as part
>of the reload. The driver cannot call pci_reset_function() as this
>function acquires the device lock. Instead, it needs to call
>__pci_reset_function_locked which expects the device lock to be held.
>
>To that end, adjust devlink to always acquire the device lock before the
>devlink instance lock when performing a reload. Do that both when reload
>is triggered explicitly by user space and when it is triggered as part
>of netns dismantle.
>
>Tested the following flows with netdevsim and mlxsw while lockdep is
>enabled:
>
>netdevsim:
>
> # echo "10 1" > /sys/bus/netdevsim/new_device
> # devlink dev reload netdevsim/netdevsim10
> # ip netns add bla
> # devlink dev reload netdevsim/netdevsim10 netns bla
> # ip netns del bla
> # echo 10 > /sys/bus/netdevsim/del_device
>
>mlxsw:
>
> # devlink dev reload pci/0000:01:00.0
> # ip netns add bla
> # devlink dev reload pci/0000:01:00.0 netns bla
> # ip netns del bla
> # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
> # echo 1 > /sys/bus/pci/rescan
>
>Signed-off-by: Ido Schimmel <idosch@nvidia.com>
>---
> net/devlink/core.c          |  4 ++--
> net/devlink/dev.c           |  8 ++++++++
> net/devlink/devl_internal.h | 19 ++++++++++++++++++-
> net/devlink/health.c        |  3 ++-
> net/devlink/netlink.c       | 21 ++++++++++++++-------
> net/devlink/region.c        |  3 ++-
> 6 files changed, 46 insertions(+), 12 deletions(-)
>
>diff --git a/net/devlink/core.c b/net/devlink/core.c
>index 5b8b692b8c76..0f866f2cbaf6 100644
>--- a/net/devlink/core.c
>+++ b/net/devlink/core.c
>@@ -502,14 +502,14 @@ static void __net_exit devlink_pernet_pre_exit(struct net *net)
> 	 * all devlink instances from this namespace into init_net.
> 	 */
> 	devlinks_xa_for_each_registered_get(net, index, devlink) {
>-		devl_lock(devlink);
>+		devl_dev_lock(devlink, true);
> 		err = 0;
> 		if (devl_is_registered(devlink))
> 			err = devlink_reload(devlink, &init_net,
> 					     DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
> 					     DEVLINK_RELOAD_LIMIT_UNSPEC,
> 					     &actions_performed, NULL);
>-		devl_unlock(devlink);
>+		devl_dev_unlock(devlink, true);
> 		devlink_put(devlink);
> 		if (err && err != -EOPNOTSUPP)
> 			pr_warn("Failed to reload devlink instance into init_net\n");
>diff --git a/net/devlink/dev.c b/net/devlink/dev.c
>index dc8039ca2b38..70cebe716187 100644
>--- a/net/devlink/dev.c
>+++ b/net/devlink/dev.c
>@@ -4,6 +4,7 @@
>  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
>  */
> 
>+#include <linux/device.h>
> #include <net/genetlink.h>
> #include <net/sock.h>
> #include "devl_internal.h"
>@@ -433,6 +434,13 @@ int devlink_reload(struct devlink *devlink, struct net *dest_net,
> 	struct net *curr_net;
> 	int err;
> 
>+	/* Make sure the reload operations are invoked with the device lock
>+	 * held to allow drivers to trigger functionality that expects it
>+	 * (e.g., PCI reset) and to close possible races between these
>+	 * operations and probe/remove.
>+	 */
>+	device_lock_assert(devlink->dev);
>+
> 	memcpy(remote_reload_stats, devlink->stats.remote_reload_stats,
> 	       sizeof(remote_reload_stats));
> 
>diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
>index 741d1bf1bec8..a9c5e52c40a7 100644
>--- a/net/devlink/devl_internal.h
>+++ b/net/devlink/devl_internal.h
>@@ -3,6 +3,7 @@
>  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
>  */
> 
>+#include <linux/device.h>
> #include <linux/etherdevice.h>
> #include <linux/mutex.h>
> #include <linux/netdevice.h>
>@@ -96,6 +97,20 @@ static inline bool devl_is_registered(struct devlink *devlink)
> 	return xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED);
> }
> 
>+static inline void devl_dev_lock(struct devlink *devlink, bool dev_lock)
>+{
>+	if (dev_lock)
>+		device_lock(devlink->dev);
>+	devl_lock(devlink);
>+}
>+
>+static inline void devl_dev_unlock(struct devlink *devlink, bool dev_lock)
>+{
>+	devl_unlock(devlink);
>+	if (dev_lock)
>+		device_unlock(devlink->dev);
>+}
>+
> typedef void devlink_rel_notify_cb_t(struct devlink *devlink, u32 obj_index);
> typedef void devlink_rel_cleanup_cb_t(struct devlink *devlink, u32 obj_index,
> 				      u32 rel_index);
>@@ -113,6 +128,7 @@ int devlink_rel_devlink_handle_put(struct sk_buff *msg, struct devlink *devlink,
> /* Netlink */
> #define DEVLINK_NL_FLAG_NEED_PORT		BIT(0)
> #define DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT	BIT(1)
>+#define DEVLINK_NL_FLAG_NEED_DEV_LOCK		BIT(2)
> 
> enum devlink_multicast_groups {
> 	DEVLINK_MCGRP_CONFIG,
>@@ -140,7 +156,8 @@ typedef int devlink_nl_dump_one_func_t(struct sk_buff *msg,
> 				       int flags);
> 
> struct devlink *
>-devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs);
>+devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
>+			    bool dev_lock);
> 
> int devlink_nl_dumpit(struct sk_buff *msg, struct netlink_callback *cb,
> 		      devlink_nl_dump_one_func_t *dump_one);
>diff --git a/net/devlink/health.c b/net/devlink/health.c
>index 51e6e81e31bb..3c4c049c3636 100644
>--- a/net/devlink/health.c
>+++ b/net/devlink/health.c
>@@ -1266,7 +1266,8 @@ devlink_health_reporter_get_from_cb_lock(struct netlink_callback *cb)
> 	struct nlattr **attrs = info->attrs;
> 	struct devlink *devlink;
> 
>-	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs);
>+	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs,
>+					      false);
> 	if (IS_ERR(devlink))
> 		return NULL;
> 
>diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
>index 499304d9de49..14d598000d72 100644
>--- a/net/devlink/netlink.c
>+++ b/net/devlink/netlink.c
>@@ -124,7 +124,8 @@ int devlink_nl_msg_reply_and_new(struct sk_buff **msg, struct genl_info *info)
> }
> 
> struct devlink *
>-devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs)
>+devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
>+			    bool dev_lock)
> {
> 	struct devlink *devlink;
> 	unsigned long index;
>@@ -138,12 +139,12 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs)
> 	devname = nla_data(attrs[DEVLINK_ATTR_DEV_NAME]);
> 
> 	devlinks_xa_for_each_registered_get(net, index, devlink) {
>-		devl_lock(devlink);
>+		devl_dev_lock(devlink, dev_lock);
> 		if (devl_is_registered(devlink) &&
> 		    strcmp(devlink->dev->bus->name, busname) == 0 &&
> 		    strcmp(dev_name(devlink->dev), devname) == 0)
> 			return devlink;
>-		devl_unlock(devlink);
>+		devl_dev_unlock(devlink, dev_lock);
> 		devlink_put(devlink);
> 	}
> 
>@@ -155,9 +156,12 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
> {
> 	struct devlink_port *devlink_port;
> 	struct devlink *devlink;
>+	bool dev_lock;
> 	int err;
> 
>-	devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs);
>+	dev_lock = !!(flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK);

I know you are aware, but just for the record: This conflicts
with my patchset "devlink: finish conversion to generated split_ops"
where I'm removing use of internal_flags. Ops that need this (should
be only reload) would need separate devlink_nl_pre/post_doit() helpers.

Otherwise the patch looks fine to me.


>+	devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs,
>+					      dev_lock);
> 	if (IS_ERR(devlink))
> 		return PTR_ERR(devlink);
> 
>@@ -177,7 +181,7 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
> 	return 0;
> 
> unlock:
>-	devl_unlock(devlink);
>+	devl_dev_unlock(devlink, dev_lock);
> 	devlink_put(devlink);
> 	return err;
> }
>@@ -205,9 +209,11 @@ void devlink_nl_post_doit(const struct genl_split_ops *ops,
> 			  struct sk_buff *skb, struct genl_info *info)
> {
> 	struct devlink *devlink;
>+	bool dev_lock;
> 
>+	dev_lock = !!(ops->internal_flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK);
> 	devlink = info->user_ptr[0];
>-	devl_unlock(devlink);
>+	devl_dev_unlock(devlink, dev_lock);
> 	devlink_put(devlink);
> }
> 
>@@ -219,7 +225,7 @@ static int devlink_nl_inst_single_dumpit(struct sk_buff *msg,
> 	struct devlink *devlink;
> 	int err;
> 
>-	devlink = devlink_get_from_attrs_lock(sock_net(msg->sk), attrs);
>+	devlink = devlink_get_from_attrs_lock(sock_net(msg->sk), attrs, false);
> 	if (IS_ERR(devlink))
> 		return PTR_ERR(devlink);
> 	err = dump_one(msg, devlink, cb, flags | NLM_F_DUMP_FILTERED);
>@@ -420,6 +426,7 @@ static const struct genl_small_ops devlink_nl_small_ops[40] = {
> 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
> 		.doit = devlink_nl_cmd_reload,
> 		.flags = GENL_ADMIN_PERM,
>+		.internal_flags = DEVLINK_NL_FLAG_NEED_DEV_LOCK,
> 	},
> 	{
> 		.cmd = DEVLINK_CMD_PARAM_SET,
>diff --git a/net/devlink/region.c b/net/devlink/region.c
>index d197cdb662db..30c6c49ec10b 100644
>--- a/net/devlink/region.c
>+++ b/net/devlink/region.c
>@@ -883,7 +883,8 @@ int devlink_nl_cmd_region_read_dumpit(struct sk_buff *skb,
> 
> 	start_offset = state->start_offset;
> 
>-	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs);
>+	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs,
>+					      false);
> 	if (IS_ERR(devlink))
> 		return PTR_ERR(devlink);
> 
>-- 
>2.40.1
>
>

next prev parent reply	other threads:[~2023-10-17  8:04 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 01/12] netdevsim: Block until all devices are released Ido Schimmel
2023-10-19  0:53   ` Jakub Kicinski
2023-10-17  7:42 ` [RFC PATCH net-next 02/12] devlink: Hold a reference on parent device Ido Schimmel
2023-10-17  7:56   ` Jiri Pirko
2023-10-17  8:11     ` Ido Schimmel
2023-10-17  9:01       ` Jiri Pirko
2023-10-17  7:42 ` [RFC PATCH net-next 03/12] devlink: Acquire device lock during reload Ido Schimmel
2023-10-17  8:04   ` Jiri Pirko [this message]
2023-10-17  7:42 ` [RFC PATCH net-next 04/12] PCI: Add no PM reset quirk for NVIDIA Spectrum devices Ido Schimmel
2023-10-18 19:40   ` Bjorn Helgaas
2023-10-22  8:23     ` Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 05/12] PCI: Add device-specific reset " Ido Schimmel
2023-10-17 10:00   ` Lukas Wunner
2023-10-18 20:08   ` Bjorn Helgaas
2023-10-25 11:05     ` Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 06/12] PCI: Add debug print for device ready delay Ido Schimmel
2023-10-18 19:41   ` Bjorn Helgaas
2023-10-17  7:42 ` [RFC PATCH net-next 07/12] mlxsw: Extend MRSR pack() function to support new commands Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 08/12] mlxsw: pci: Rename mlxsw_pci_sw_reset() Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 09/12] mlxsw: pci: Move software reset code to a separate function Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 10/12] mlxsw: pci: Add support for new reset flow Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 11/12] mlxsw: pci: Implement PCI reset handlers Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 12/12] selftests: mlxsw: Add PCI reset test Ido Schimmel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZS5AHnlJp6orqdLb@nanopsycho \
    --to=jiri@resnulli.us \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=idosch@nvidia.com \
    --cc=jiri@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mlxsw@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=petrm@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox