netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/2] devlink: Acquire device lock during reload
@ 2023-06-19 12:50 Ido Schimmel
  2023-06-19 12:50 ` [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device Ido Schimmel
  2023-06-19 12:50 ` [RFC PATCH net-next 2/2] devlink: Acquire device lock during reload Ido Schimmel
  0 siblings, 2 replies; 14+ messages in thread
From: Ido Schimmel @ 2023-06-19 12:50 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, pabeni, edumazet, jiri, petrm, Ido Schimmel

These two patches change devlink to acquire the device lock during
reload. This is needed by device drivers that need to invoke a PCI reset
via PCI core during devlink reload, as PCI core requires the device lock
to be held. See detailed explanation in the second patch.

As a preparation for this change, devlink needs to hold a reference on
the underlying device since reload can be invoked asynchronously as part
of netns dismantle. Unfortunately, this change results in a crash which
I'm not sure how to solve. Detailed description and reproducer are
provided in the first patch.

Ido Schimmel (2):
  devlink: Hold a reference on parent device
  devlink: Acquire device lock during reload

 net/devlink/core.c          |  7 +++++--
 net/devlink/dev.c           |  8 ++++++++
 net/devlink/devl_internal.h | 19 ++++++++++++++++++-
 net/devlink/health.c        |  3 ++-
 net/devlink/leftover.c      |  4 +++-
 net/devlink/netlink.c       | 18 ++++++++++++------
 6 files changed, 48 insertions(+), 11 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-19 12:50 [RFC PATCH net-next 0/2] devlink: Acquire device lock during reload Ido Schimmel
@ 2023-06-19 12:50 ` Ido Schimmel
  2023-06-20  6:23   ` Jiri Pirko
  2023-06-21 11:48   ` Jiri Pirko
  2023-06-19 12:50 ` [RFC PATCH net-next 2/2] devlink: Acquire device lock during reload Ido Schimmel
  1 sibling, 2 replies; 14+ messages in thread
From: Ido Schimmel @ 2023-06-19 12:50 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, pabeni, edumazet, jiri, petrm, Ido Schimmel

Each devlink instance is associated with a parent device and a pointer
to this device is stored in the devlink structure, but devlink does not
hold a reference on this device.

This is going to be a problem in the next patch where - among other
things - devlink will acquire the device lock during netns dismantle,
before the reload operation. Since netns dismantle is performed
asynchronously and since a reference is not held on the parent device,
it will be possible to hit a use-after-free.

Prepare for the upcoming change by holding a reference on the parent
device.

Unfortunately, with this patch and this reproducer [1], the following
crash can be observed [2]. The last reference is released from the
device asynchronously - after an RCU grace period - when the netdevsim
module is no longer present. This causes device_release() to invoke a
release callback that is no longer present: nsim_bus_dev_release().

It's not clear to me if I'm doing something wrong in devlink (I don't
think so), if it's a bug in netdevsim or alternatively a bug in core
driver code that allows the bus module to go away before all the devices
that were connected to it are released.

The problem can be solved by devlink holding a reference on the backing
module (i.e., dev->driver->owner) or by each netdevsim device holding a
reference on the netdevsim module. However, this will prevent the
removal of the module when devices are present, something that is
possible today.

[1]
#!/bin/bash

for i in $(seq 1 1000); do
        echo "$i"
        insmod drivers/net/netdevsim/netdevsim.ko
        echo "10 0" > /sys/bus/netdevsim/new_device
        rmmod netdevsim
done

[2]
BUG: unable to handle page fault for address: ffffffffc0490910
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0
Oops: 0010 [#1] PREEMPT SMP
CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
Workqueue: events devlink_release
RIP: 0010:0xffffffffc0490910
Code: Unable to access opcode bytes at 0xffffffffc04908e6.
RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282
RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608
RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800
RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134
R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80
R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05
FS:  0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0
Call Trace:
 <TASK>
 ? __die+0x23/0x70
 ? page_fault_oops+0x181/0x470
 ? exc_page_fault+0xa6/0x140
 ? asm_exc_page_fault+0x26/0x30
 ? device_release+0x23/0x90
 ? device_release+0x34/0x90
 ? kobject_put+0x7d/0x1b0
 ? devlink_release+0x16/0x30
 ? process_one_work+0x1e0/0x3d0
 ? worker_thread+0x4e/0x3b0
 ? rescuer_thread+0x3a0/0x3a0
 ? kthread+0xe5/0x120
 ? kthread_complete_and_exit+0x20/0x20
 ? ret_from_fork+0x1f/0x30
 </TASK>
Modules linked in: [last unloaded: netdevsim]

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 net/devlink/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/devlink/core.c b/net/devlink/core.c
index c23ebabadc52..b191112f57af 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -4,6 +4,7 @@
  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
  */
 
+#include <linux/device.h>
 #include <net/genetlink.h>
 
 #include "devl_internal.h"
@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
 
 	mutex_destroy(&devlink->lock);
 	lockdep_unregister_key(&devlink->lock_key);
+	put_device(devlink->dev);
 	kfree(devlink);
 }
 
@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
 	if (ret < 0)
 		goto err_xa_alloc;
 
+	get_device(dev);
 	devlink->dev = dev;
 	devlink->ops = ops;
 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH net-next 2/2] devlink: Acquire device lock during reload
  2023-06-19 12:50 [RFC PATCH net-next 0/2] devlink: Acquire device lock during reload Ido Schimmel
  2023-06-19 12:50 ` [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device Ido Schimmel
@ 2023-06-19 12:50 ` Ido Schimmel
  1 sibling, 0 replies; 14+ messages in thread
From: Ido Schimmel @ 2023-06-19 12:50 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, pabeni, edumazet, jiri, petrm, Ido Schimmel

Device drivers register with devlink from their probe routines (under
the device lock) by acquiring the devlink instance lock and calling
devl_register().

Drivers that support a devlink reload usually implement the
reload_{down, up}() operations in a similar fashion to their remove and
probe routines, respectively.

However, while the remove and probe routines are invoked with the device
lock held, the reload operations are only invoked with the devlink
instance lock held. It is therefore impossible for drivers to acquire
the device lock from their reload operations, as this would result in
lock inversion.

The motivating use case for invoking the reload operations with the
device lock held is in mlxsw which needs to trigger a PCI reset as part
of the reload. The driver cannot call pci_reset_function() as this
function acquires the device lock. Instead, it needs to call
__pci_reset_function_locked which expects the device lock to be held.

To that end, adjust devlink to always acquire the device lock before the
devlink instance lock when performing a reload. Do that both when reload
is triggered explicitly by user space and when it is triggered as part
of netns dismantle.

Tested the following flows with netdevsim and mlxsw while lockdep is
enabled:

netdevsim:

 # echo "10 1" > /sys/bus/netdevsim/new_device
 # devlink dev reload netdevsim/netdevsim10
 # ip netns add bla
 # devlink dev reload netdevsim/netdevsim10 netns bla
 # ip netns del bla
 # echo 10 > /sys/bus/netdevsim/del_device

mlxsw:

 # devlink dev reload pci/0000:01:00.0
 # ip netns add bla
 # devlink dev reload pci/0000:01:00.0 netns bla
 # ip netns del bla
 # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
 # echo 1 > /sys/bus/pci/rescan

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 net/devlink/core.c          |  4 ++--
 net/devlink/dev.c           |  8 ++++++++
 net/devlink/devl_internal.h | 19 ++++++++++++++++++-
 net/devlink/health.c        |  3 ++-
 net/devlink/leftover.c      |  4 +++-
 net/devlink/netlink.c       | 18 ++++++++++++------
 6 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/net/devlink/core.c b/net/devlink/core.c
index b191112f57af..a4b6d548e50c 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -279,14 +279,14 @@ static void __net_exit devlink_pernet_pre_exit(struct net *net)
 	 * all devlink instances from this namespace into init_net.
 	 */
 	devlinks_xa_for_each_registered_get(net, index, devlink) {
-		devl_lock(devlink);
+		devl_dev_lock(devlink, true);
 		err = 0;
 		if (devl_is_registered(devlink))
 			err = devlink_reload(devlink, &init_net,
 					     DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
 					     DEVLINK_RELOAD_LIMIT_UNSPEC,
 					     &actions_performed, NULL);
-		devl_unlock(devlink);
+		devl_dev_unlock(devlink, true);
 		devlink_put(devlink);
 		if (err && err != -EOPNOTSUPP)
 			pr_warn("Failed to reload devlink instance into init_net\n");
diff --git a/net/devlink/dev.c b/net/devlink/dev.c
index bf1d6f1bcfc7..daee2039fb58 100644
--- a/net/devlink/dev.c
+++ b/net/devlink/dev.c
@@ -4,6 +4,7 @@
  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
  */
 
+#include <linux/device.h>
 #include <net/genetlink.h>
 #include <net/sock.h>
 #include "devl_internal.h"
@@ -356,6 +357,13 @@ int devlink_reload(struct devlink *devlink, struct net *dest_net,
 	struct net *curr_net;
 	int err;
 
+	/* Make sure the reload operations are invoked with the device lock
+	 * held to allow drivers to trigger functionality that expects it
+	 * (e.g., PCI reset) and to close possible races between these
+	 * operations and probe/remove.
+	 */
+	device_lock_assert(devlink->dev);
+
 	memcpy(remote_reload_stats, devlink->stats.remote_reload_stats,
 	       sizeof(remote_reload_stats));
 
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 62921b2eb0d3..99c3efbae718 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -3,6 +3,7 @@
  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
  */
 
+#include <linux/device.h>
 #include <linux/mutex.h>
 #include <linux/netdevice.h>
 #include <linux/notifier.h>
@@ -87,12 +88,27 @@ static inline bool devl_is_registered(struct devlink *devlink)
 	return xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED);
 }
 
+static inline void devl_dev_lock(struct devlink *devlink, bool dev_lock)
+{
+	if (dev_lock)
+		device_lock(devlink->dev);
+	devl_lock(devlink);
+}
+
+static inline void devl_dev_unlock(struct devlink *devlink, bool dev_lock)
+{
+	devl_unlock(devlink);
+	if (dev_lock)
+		device_unlock(devlink->dev);
+}
+
 /* Netlink */
 #define DEVLINK_NL_FLAG_NEED_PORT		BIT(0)
 #define DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT	BIT(1)
 #define DEVLINK_NL_FLAG_NEED_RATE		BIT(2)
 #define DEVLINK_NL_FLAG_NEED_RATE_NODE		BIT(3)
 #define DEVLINK_NL_FLAG_NEED_LINECARD		BIT(4)
+#define DEVLINK_NL_FLAG_NEED_DEV_LOCK		BIT(5)
 
 enum devlink_multicast_groups {
 	DEVLINK_MCGRP_CONFIG,
@@ -122,7 +138,8 @@ struct devlink_cmd {
 extern const struct genl_small_ops devlink_nl_ops[56];
 
 struct devlink *
-devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs);
+devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
+			    bool dev_lock);
 
 void devlink_notify_unregister(struct devlink *devlink);
 void devlink_notify_register(struct devlink *devlink);
diff --git a/net/devlink/health.c b/net/devlink/health.c
index 194340a8bb86..fa8ccdcffb7a 100644
--- a/net/devlink/health.c
+++ b/net/devlink/health.c
@@ -1253,7 +1253,8 @@ devlink_health_reporter_get_from_cb(struct netlink_callback *cb)
 	struct nlattr **attrs = info->attrs;
 	struct devlink *devlink;
 
-	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs);
+	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs,
+					      false);
 	if (IS_ERR(devlink))
 		return NULL;
 	devl_unlock(devlink);
diff --git a/net/devlink/leftover.c b/net/devlink/leftover.c
index 1f00f874471f..f4e6030e3b56 100644
--- a/net/devlink/leftover.c
+++ b/net/devlink/leftover.c
@@ -5185,7 +5185,8 @@ static int devlink_nl_cmd_region_read_dumpit(struct sk_buff *skb,
 
 	start_offset = state->start_offset;
 
-	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs);
+	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs,
+					      false);
 	if (IS_ERR(devlink))
 		return PTR_ERR(devlink);
 
@@ -6478,6 +6479,7 @@ const struct genl_small_ops devlink_nl_ops[56] = {
 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
 		.doit = devlink_nl_cmd_reload,
 		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NEED_DEV_LOCK,
 	},
 	{
 		.cmd = DEVLINK_CMD_PARAM_GET,
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index 7a332eb70f70..95fd8e3befea 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -83,7 +83,8 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 };
 
 struct devlink *
-devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs)
+devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
+			    bool dev_lock)
 {
 	struct devlink *devlink;
 	unsigned long index;
@@ -97,12 +98,12 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs)
 	devname = nla_data(attrs[DEVLINK_ATTR_DEV_NAME]);
 
 	devlinks_xa_for_each_registered_get(net, index, devlink) {
-		devl_lock(devlink);
+		devl_dev_lock(devlink, dev_lock);
 		if (devl_is_registered(devlink) &&
 		    strcmp(devlink->dev->bus->name, busname) == 0 &&
 		    strcmp(dev_name(devlink->dev), devname) == 0)
 			return devlink;
-		devl_unlock(devlink);
+		devl_dev_unlock(devlink, dev_lock);
 		devlink_put(devlink);
 	}
 
@@ -115,9 +116,12 @@ static int devlink_nl_pre_doit(const struct genl_split_ops *ops,
 	struct devlink_linecard *linecard;
 	struct devlink_port *devlink_port;
 	struct devlink *devlink;
+	bool dev_lock;
 	int err;
 
-	devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs);
+	dev_lock = !!(ops->internal_flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK);
+	devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs,
+					      dev_lock);
 	if (IS_ERR(devlink))
 		return PTR_ERR(devlink);
 
@@ -162,7 +166,7 @@ static int devlink_nl_pre_doit(const struct genl_split_ops *ops,
 	return 0;
 
 unlock:
-	devl_unlock(devlink);
+	devl_dev_unlock(devlink, dev_lock);
 	devlink_put(devlink);
 	return err;
 }
@@ -171,9 +175,11 @@ static void devlink_nl_post_doit(const struct genl_split_ops *ops,
 				 struct sk_buff *skb, struct genl_info *info)
 {
 	struct devlink *devlink;
+	bool dev_lock;
 
+	dev_lock = !!(ops->internal_flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK);
 	devlink = info->user_ptr[0];
-	devl_unlock(devlink);
+	devl_dev_unlock(devlink, dev_lock);
 	devlink_put(devlink);
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-19 12:50 ` [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device Ido Schimmel
@ 2023-06-20  6:23   ` Jiri Pirko
  2023-06-20  7:05     ` Ido Schimmel
  2023-06-21 11:48   ` Jiri Pirko
  1 sibling, 1 reply; 14+ messages in thread
From: Jiri Pirko @ 2023-06-20  6:23 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, kuba, pabeni, edumazet, petrm

Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
>Each devlink instance is associated with a parent device and a pointer
>to this device is stored in the devlink structure, but devlink does not
>hold a reference on this device.
>
>This is going to be a problem in the next patch where - among other
>things - devlink will acquire the device lock during netns dismantle,
>before the reload operation. Since netns dismantle is performed
>asynchronously and since a reference is not held on the parent device,
>it will be possible to hit a use-after-free.
>
>Prepare for the upcoming change by holding a reference on the parent
>device.
>
>Unfortunately, with this patch and this reproducer [1], the following
>crash can be observed [2]. The last reference is released from the
>device asynchronously - after an RCU grace period - when the netdevsim
>module is no longer present. This causes device_release() to invoke a
>release callback that is no longer present: nsim_bus_dev_release().
>
>It's not clear to me if I'm doing something wrong in devlink (I don't
>think so), if it's a bug in netdevsim or alternatively a bug in core
>driver code that allows the bus module to go away before all the devices
>that were connected to it are released.
>
>The problem can be solved by devlink holding a reference on the backing
>module (i.e., dev->driver->owner) or by each netdevsim device holding a
>reference on the netdevsim module. However, this will prevent the
>removal of the module when devices are present, something that is
>possible today.
>
>[1]
>#!/bin/bash
>
>for i in $(seq 1 1000); do
>        echo "$i"
>        insmod drivers/net/netdevsim/netdevsim.ko
>        echo "10 0" > /sys/bus/netdevsim/new_device
>        rmmod netdevsim
>done
>
>[2]
>BUG: unable to handle page fault for address: ffffffffc0490910
>#PF: supervisor instruction fetch in kernel mode
>#PF: error_code(0x0010) - not-present page
>PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0
>Oops: 0010 [#1] PREEMPT SMP
>CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299
>Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
>Workqueue: events devlink_release
>RIP: 0010:0xffffffffc0490910
>Code: Unable to access opcode bytes at 0xffffffffc04908e6.
>RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282
>RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608
>RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800
>RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134
>R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80
>R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05
>FS:  0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000
>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0
>Call Trace:
> <TASK>
> ? __die+0x23/0x70
> ? page_fault_oops+0x181/0x470
> ? exc_page_fault+0xa6/0x140
> ? asm_exc_page_fault+0x26/0x30
> ? device_release+0x23/0x90
> ? device_release+0x34/0x90
> ? kobject_put+0x7d/0x1b0
> ? devlink_release+0x16/0x30
> ? process_one_work+0x1e0/0x3d0
> ? worker_thread+0x4e/0x3b0
> ? rescuer_thread+0x3a0/0x3a0
> ? kthread+0xe5/0x120
> ? kthread_complete_and_exit+0x20/0x20
> ? ret_from_fork+0x1f/0x30
> </TASK>
>Modules linked in: [last unloaded: netdevsim]
>
>Signed-off-by: Ido Schimmel <idosch@nvidia.com>

Reviewed-by: Jiri Pirko <jiri@nvidia.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-20  6:23   ` Jiri Pirko
@ 2023-06-20  7:05     ` Ido Schimmel
  2023-06-20 17:43       ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Ido Schimmel @ 2023-06-20  7:05 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, kuba, pabeni, edumazet, petrm

On Tue, Jun 20, 2023 at 08:23:26AM +0200, Jiri Pirko wrote:
> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
> >Each devlink instance is associated with a parent device and a pointer
> >to this device is stored in the devlink structure, but devlink does not
> >hold a reference on this device.
> >
> >This is going to be a problem in the next patch where - among other
> >things - devlink will acquire the device lock during netns dismantle,
> >before the reload operation. Since netns dismantle is performed
> >asynchronously and since a reference is not held on the parent device,
> >it will be possible to hit a use-after-free.
> >
> >Prepare for the upcoming change by holding a reference on the parent
> >device.
> >
> >Unfortunately, with this patch and this reproducer [1], the following
> >crash can be observed [2]. The last reference is released from the
> >device asynchronously - after an RCU grace period - when the netdevsim
> >module is no longer present. This causes device_release() to invoke a
> >release callback that is no longer present: nsim_bus_dev_release().
> >
> >It's not clear to me if I'm doing something wrong in devlink (I don't
> >think so), if it's a bug in netdevsim or alternatively a bug in core
> >driver code that allows the bus module to go away before all the devices
> >that were connected to it are released.
> >
> >The problem can be solved by devlink holding a reference on the backing
> >module (i.e., dev->driver->owner) or by each netdevsim device holding a
> >reference on the netdevsim module. However, this will prevent the
> >removal of the module when devices are present, something that is
> >possible today.
> >
> >[1]
> >#!/bin/bash
> >
> >for i in $(seq 1 1000); do
> >        echo "$i"
> >        insmod drivers/net/netdevsim/netdevsim.ko
> >        echo "10 0" > /sys/bus/netdevsim/new_device
> >        rmmod netdevsim
> >done
> >
> >[2]
> >BUG: unable to handle page fault for address: ffffffffc0490910
> >#PF: supervisor instruction fetch in kernel mode
> >#PF: error_code(0x0010) - not-present page
> >PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0
> >Oops: 0010 [#1] PREEMPT SMP
> >CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299
> >Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
> >Workqueue: events devlink_release
> >RIP: 0010:0xffffffffc0490910
> >Code: Unable to access opcode bytes at 0xffffffffc04908e6.
> >RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282
> >RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608
> >RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800
> >RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134
> >R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80
> >R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05
> >FS:  0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000
> >CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0
> >Call Trace:
> > <TASK>
> > ? __die+0x23/0x70
> > ? page_fault_oops+0x181/0x470
> > ? exc_page_fault+0xa6/0x140
> > ? asm_exc_page_fault+0x26/0x30
> > ? device_release+0x23/0x90
> > ? device_release+0x34/0x90
> > ? kobject_put+0x7d/0x1b0
> > ? devlink_release+0x16/0x30
> > ? process_one_work+0x1e0/0x3d0
> > ? worker_thread+0x4e/0x3b0
> > ? rescuer_thread+0x3a0/0x3a0
> > ? kthread+0xe5/0x120
> > ? kthread_complete_and_exit+0x20/0x20
> > ? ret_from_fork+0x1f/0x30
> > </TASK>
> >Modules linked in: [last unloaded: netdevsim]
> >
> >Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> 
> Reviewed-by: Jiri Pirko <jiri@nvidia.com>

Thanks, but I was hoping to get feedback on how to solve the problem
mentioned in the commit message :p

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-20  7:05     ` Ido Schimmel
@ 2023-06-20 17:43       ` Jakub Kicinski
  2023-06-21  6:31         ` Ido Schimmel
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2023-06-20 17:43 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: Jiri Pirko, netdev, davem, pabeni, edumazet, petrm

On Tue, 20 Jun 2023 10:05:23 +0300 Ido Schimmel wrote:
> On Tue, Jun 20, 2023 at 08:23:26AM +0200, Jiri Pirko wrote:
> > Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:  
>  [...]  
> > 
> > Reviewed-by: Jiri Pirko <jiri@nvidia.com>  
> 
> Thanks, but I was hoping to get feedback on how to solve the problem
> mentioned in the commit message :p

Do we need to hold the reference on the device until release?
I think you can release it in devlink_free().
The only valid fields for an unregistered devlink instance are:
 - lock
 - refcount
 - index

And obviously unregistered devices can't be reloaded.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-20 17:43       ` Jakub Kicinski
@ 2023-06-21  6:31         ` Ido Schimmel
  2023-06-21 19:03           ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Ido Schimmel @ 2023-06-21  6:31 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jiri Pirko, netdev, davem, pabeni, edumazet, petrm

On Tue, Jun 20, 2023 at 10:43:51AM -0700, Jakub Kicinski wrote:
> Do we need to hold the reference on the device until release?
> I think you can release it in devlink_free().
> The only valid fields for an unregistered devlink instance are:
>  - lock
>  - refcount
>  - index
> 
> And obviously unregistered devices can't be reloaded.

Thanks for taking a look.

Moving the release to devlink_free() [1] was the first thing I tried and
it indeed solves the problem I mentioned earlier, but creates a new one.
After devlink_free() returns the devlink instance can still be accessed
by user space in devlink_get_from_attrs_lock(). If I reload in a loop
while concurrently removing and adding the device [2], we can hit a UAF
when trying to acquire the device lock [3].

[1]
diff --git a/net/devlink/core.c b/net/devlink/core.c
index a4b6d548e50c..b60a8463d6e0 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -92,7 +92,6 @@ static void devlink_release(struct work_struct *work)
 
        mutex_destroy(&devlink->lock);
        lockdep_unregister_key(&devlink->lock_key);
-       put_device(devlink->dev);
        kfree(devlink);
 }
 
@@ -264,6 +263,8 @@ void devlink_free(struct devlink *devlink)
 
        xa_erase(&devlinks, devlink->index);
 
+       put_device(devlink->dev);
+
        devlink_put(devlink);
 }
 EXPORT_SYMBOL_GPL(devlink_free);

[2]
while true; do devlink dev reload netdevsim/netdevsim10 &> /dev/null; done &
while true; do echo "10 0" > /sys/bus/netdevsim/new_device; echo 10 > /sys/bus/netdevsim/del_device; done

[3]
[   96.081096] ==================================================================
[   96.082158] BUG: KASAN: slab-use-after-free in __mutex_lock+0x18a7/0x1ac0
[   96.083107] Read of size 8 at addr ffff888109caa8d8 by task devlink/429

[   96.084266] CPU: 0 PID: 429 Comm: devlink Not tainted 6.4.0-rc6-custom-gb01b0912311c-dirty #303
[   96.085443] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
[   96.086632] Call Trace:
[   96.086990]  <TASK>
[   96.087314]  dump_stack_lvl+0x91/0xf0
[   96.087852]  print_report+0xcf/0x660
[   96.088384]  ? __virt_addr_valid+0x86/0x360
[   96.088980]  ? __mutex_lock+0x18a7/0x1ac0
[   96.089558]  ? __mutex_lock+0x18a7/0x1ac0
[   96.090132]  kasan_report+0xd6/0x110
[   96.090667]  ? __mutex_lock+0x18a7/0x1ac0
[   96.091245]  __mutex_lock+0x18a7/0x1ac0
[   96.091898]  ? devlink_get_from_attrs_lock+0x2bc/0x460
[   96.092645]  ? mutex_lock_io_nested+0x18b0/0x18b0
[   96.093323]  ? reacquire_held_locks+0x4e0/0x4e0
[   96.093972]  ? devlink_try_get+0x158/0x1e0
[   96.094586]  ? devlink_get_from_attrs_lock+0x460/0x460
[   96.095325]  ? devlink_get_from_attrs_lock+0x2bc/0x460
[   96.096053]  devlink_get_from_attrs_lock+0x2bc/0x460
[   96.096767]  ? devlink_nl_post_doit+0xf0/0xf0
[   96.097409]  ? __nla_parse+0x40/0x50
[   96.097943]  ? devlink_get_from_attrs_lock+0x460/0x460
[   96.098671]  devlink_nl_pre_doit+0xb3/0x480
[   96.099255]  genl_family_rcv_msg_doit.isra.0+0x1b8/0x2e0
[   96.099966]  ? genl_start+0x670/0x670
[   96.100487]  ? ns_capable+0xda/0x110
[   96.100991]  genl_rcv_msg+0x558/0x7f0
[   96.101516]  ? genl_family_rcv_msg_doit.isra.0+0x2e0/0x2e0
[   96.102260]  ? devlink_get_from_attrs_lock+0x460/0x460
[   96.102925]  ? devlink_reload+0x540/0x540
[   96.103429]  ? devlink_pernet_pre_exit+0x340/0x340
[   96.104017]  netlink_rcv_skb+0x170/0x440
[   96.104508]  ? genl_family_rcv_msg_doit.isra.0+0x2e0/0x2e0
[   96.105166]  ? netlink_ack+0x1380/0x1380
[   96.105662]  ? lock_contended+0xc70/0xc70
[   96.106172]  ? rwsem_down_read_slowpath+0xda0/0xda0
[   96.106767]  ? netlink_deliver_tap+0x1b6/0xd90
[   96.107317]  ? is_vmalloc_addr+0x8b/0xb0
[   96.107804]  genl_rcv+0x2d/0x40
[   96.108213]  netlink_unicast+0x53f/0x810
[   96.108698]  ? netlink_attachskb+0x870/0x870
[   96.109230]  ? lock_release+0x3ac/0xbb0
[   96.109714]  netlink_sendmsg+0x95c/0xe80
[   96.110206]  ? netlink_unicast+0x810/0x810
[   96.110711]  ? __might_fault+0x15b/0x190
[   96.111194]  ? _copy_from_user+0x9f/0xd0
[   96.111691]  __sys_sendto+0x2aa/0x420
[   96.112148]  ? __ia32_sys_getpeername+0xb0/0xb0
[   96.112706]  ? reacquire_held_locks+0x4e0/0x4e0
[   96.113275]  ? rcu_is_watching+0x12/0xb0
[   96.113769]  ? blkcg_exit_disk+0x50/0x50
[   96.114260]  __x64_sys_sendto+0xe5/0x1c0
[   96.114742]  ? lockdep_hardirqs_on+0x7d/0x100
[   96.115285]  ? syscall_enter_from_user_mode+0x20/0x50
[   96.115899]  do_syscall_64+0x38/0x80
[   96.116352]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   96.116964] RIP: 0033:0x7fa073f72fa7
[   96.117417] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 80 3d 5d d6 0c 00 00 41 89 ca 74 10 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 71 c3 55 48 83 ec 30 44 89 4c 24 2c 4c 89 44
[   96.119548] RSP: 002b:00007ffca2723938 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
[   96.120449] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fa073f72fa7
[   96.121287] RDX: 0000000000000034 RSI: 000055883950a460 RDI: 0000000000000003
[   96.122120] RBP: 0000558837836e60 R08: 00007fa074050200 R09: 000000000000000c
[   96.122951] R10: 0000000000000000 R11: 0000000000000202 R12: 000055883950a2a0
[   96.123789] R13: 000055883950a460 R14: 000055883784eeab R15: 000055883950a2a0
[   96.124638]  </TASK>

[   96.125120] Allocated by task 209:
[   96.125543]  kasan_save_stack+0x33/0x50
[   96.125567]  kasan_set_track+0x25/0x30
[   96.125587]  __kasan_kmalloc+0x87/0x90
[   96.125606]  new_device_store+0x22d/0x6a0
[   96.125625]  bus_attr_store+0x7b/0xa0
[   96.125641]  sysfs_kf_write+0x11c/0x170
[   96.125659]  kernfs_fop_write_iter+0x3f7/0x600
[   96.125676]  vfs_write+0x680/0xe90
[   96.125691]  ksys_write+0x143/0x270
[   96.125707]  do_syscall_64+0x38/0x80
[   96.125722]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

[   96.125947] Freed by task 209:
[   96.126332]  kasan_save_stack+0x33/0x50
[   96.126355]  kasan_set_track+0x25/0x30
[   96.126374]  kasan_save_free_info+0x2e/0x40
[   96.126389]  __kasan_slab_free+0x10a/0x180
[   96.126409]  __kmem_cache_free+0x8a/0x1a0
[   96.126429]  device_release+0xa6/0x240
[   96.126449]  kobject_put+0x1f7/0x5b0
[   96.126465]  device_unregister+0x34/0xc0
[   96.126478]  del_device_store+0x308/0x430
[   96.126496]  bus_attr_store+0x7b/0xa0
[   96.126511]  sysfs_kf_write+0x11c/0x170
[   96.126528]  kernfs_fop_write_iter+0x3f7/0x600
[   96.126545]  vfs_write+0x680/0xe90
[   96.126560]  ksys_write+0x143/0x270
[   96.126575]  do_syscall_64+0x38/0x80
[   96.126590]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

[   96.126815] The buggy address belongs to the object at ffff888109caa800
                which belongs to the cache kmalloc-1k of size 1024
[   96.128252] The buggy address is located 216 bytes inside of
                freed 1024-byte region [ffff888109caa800, ffff888109caac00)

[   96.129875] The buggy address belongs to the physical page:
[   96.130540] page:00000000c7912b23 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109ca8
[   96.130562] head:00000000c7912b23 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[   96.130575] flags: 0x200000000010200(slab|head|node=0|zone=2)
[   96.130591] page_type: 0xffffffff()
[   96.130607] raw: 0200000000010200 ffff888100041dc0 dead000000000100 dead000000000122
[   96.130622] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[   96.130632] page dumped because: kasan: bad access detected

[   96.130844] Memory state around the buggy address:
[   96.131424]  ffff888109caa780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   96.132266]  ffff888109caa800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   96.133101] >ffff888109caa880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   96.133944]                                                     ^
[   96.134666]  ffff888109caa900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   96.135510]  ffff888109caa980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   96.136351] ==================================================================

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-19 12:50 ` [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device Ido Schimmel
  2023-06-20  6:23   ` Jiri Pirko
@ 2023-06-21 11:48   ` Jiri Pirko
  2023-06-21 15:35     ` Ido Schimmel
  1 sibling, 1 reply; 14+ messages in thread
From: Jiri Pirko @ 2023-06-21 11:48 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, kuba, pabeni, edumazet, petrm

Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
>Each devlink instance is associated with a parent device and a pointer
>to this device is stored in the devlink structure, but devlink does not
>hold a reference on this device.
>
>This is going to be a problem in the next patch where - among other
>things - devlink will acquire the device lock during netns dismantle,
>before the reload operation. Since netns dismantle is performed
>asynchronously and since a reference is not held on the parent device,
>it will be possible to hit a use-after-free.
>
>Prepare for the upcoming change by holding a reference on the parent
>device.
>
>Unfortunately, with this patch and this reproducer [1], the following
>crash can be observed [2]. The last reference is released from the
>device asynchronously - after an RCU grace period - when the netdevsim
>module is no longer present. This causes device_release() to invoke a
>release callback that is no longer present: nsim_bus_dev_release().
>
>It's not clear to me if I'm doing something wrong in devlink (I don't
>think so), if it's a bug in netdevsim or alternatively a bug in core
>driver code that allows the bus module to go away before all the devices
>that were connected to it are released.
>
>The problem can be solved by devlink holding a reference on the backing
>module (i.e., dev->driver->owner) or by each netdevsim device holding a
>reference on the netdevsim module. However, this will prevent the
>removal of the module when devices are present, something that is
>possible today.
>
>[1]
>#!/bin/bash
>
>for i in $(seq 1 1000); do
>        echo "$i"
>        insmod drivers/net/netdevsim/netdevsim.ko
>        echo "10 0" > /sys/bus/netdevsim/new_device
>        rmmod netdevsim
>done
>
>[2]
>BUG: unable to handle page fault for address: ffffffffc0490910
>#PF: supervisor instruction fetch in kernel mode
>#PF: error_code(0x0010) - not-present page
>PGD 12e040067 P4D 12e040067 PUD 12e042067 PMD 100a38067 PTE 0
>Oops: 0010 [#1] PREEMPT SMP
>CPU: 0 PID: 138 Comm: kworker/0:2 Not tainted 6.4.0-rc5-custom-g42e05937ca59 #299
>Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
>Workqueue: events devlink_release
>RIP: 0010:0xffffffffc0490910
>Code: Unable to access opcode bytes at 0xffffffffc04908e6.
>RSP: 0018:ffffb487802f3e40 EFLAGS: 00010282
>RAX: ffffffffc0490910 RBX: ffff92e6c0057800 RCX: 0001020304050608
>RDX: 0000000000000001 RSI: ffffffff92b7d763 RDI: ffff92e6c0057800
>RBP: ffff92e6c1ef0a00 R08: ffff92e6c0055158 R09: ffff92e6c2be9134
>R10: 0000000000000018 R11: fefefefefefefeff R12: ffffffff934c3e80
>R13: ffff92e6c2a1a740 R14: 0000000000000000 R15: ffff92e7f7c30b05
>FS:  0000000000000000(0000) GS:ffff92e7f7c00000(0000) knlGS:0000000000000000
>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>CR2: ffffffffc04908e6 CR3: 0000000101f1a004 CR4: 0000000000170ef0
>Call Trace:
> <TASK>
> ? __die+0x23/0x70
> ? page_fault_oops+0x181/0x470
> ? exc_page_fault+0xa6/0x140
> ? asm_exc_page_fault+0x26/0x30
> ? device_release+0x23/0x90
> ? device_release+0x34/0x90
> ? kobject_put+0x7d/0x1b0
> ? devlink_release+0x16/0x30
> ? process_one_work+0x1e0/0x3d0
> ? worker_thread+0x4e/0x3b0
> ? rescuer_thread+0x3a0/0x3a0
> ? kthread+0xe5/0x120
> ? kthread_complete_and_exit+0x20/0x20
> ? ret_from_fork+0x1f/0x30
> </TASK>
>Modules linked in: [last unloaded: netdevsim]
>
>Signed-off-by: Ido Schimmel <idosch@nvidia.com>
>---
> net/devlink/core.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/net/devlink/core.c b/net/devlink/core.c
>index c23ebabadc52..b191112f57af 100644
>--- a/net/devlink/core.c
>+++ b/net/devlink/core.c
>@@ -4,6 +4,7 @@
>  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
>  */
> 
>+#include <linux/device.h>
> #include <net/genetlink.h>
> 
> #include "devl_internal.h"
>@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
> 
> 	mutex_destroy(&devlink->lock);
> 	lockdep_unregister_key(&devlink->lock_key);
>+	put_device(devlink->dev);

In this case I think you have to make sure this is called before
devlink_free() ends. After the caller of devlink_free() returns (most
probably .remove callback), nothing stops module from being removed.

I don't see other way. Utilize complete() here and wait_for_completion()
at the end of devlink_free().

If the completion in devlink_put() area rings a bell for you, let me save
you the trouble looking it up:
9053637e0da7 ("devlink: remove the registration guarantee of references")
This commit removed that. But it is a different usage.



> 	kfree(devlink);
> }
> 
>@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
> 	if (ret < 0)
> 		goto err_xa_alloc;
> 
>+	get_device(dev);
> 	devlink->dev = dev;
> 	devlink->ops = ops;
> 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
>-- 
>2.40.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-21 11:48   ` Jiri Pirko
@ 2023-06-21 15:35     ` Ido Schimmel
  2023-06-22  6:29       ` Jiri Pirko
  0 siblings, 1 reply; 14+ messages in thread
From: Ido Schimmel @ 2023-06-21 15:35 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, kuba, pabeni, edumazet, petrm

On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote:
> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
> > 
> > 	mutex_destroy(&devlink->lock);
> > 	lockdep_unregister_key(&devlink->lock_key);
> >+	put_device(devlink->dev);
> 
> In this case I think you have to make sure this is called before
> devlink_free() ends. After the caller of devlink_free() returns (most
> probably .remove callback), nothing stops module from being removed.
> 
> I don't see other way. Utilize complete() here and wait_for_completion()
> at the end of devlink_free().

I might be missing something, but how can I do something like
wait_for_completion(&devlink->comp) at the end of devlink_free()? After
I call devlink_put() the devlink instance can be freed and the
wait_for_completion() call will result in a UAF.

> 
> If the completion in devlink_put() area rings a bell for you, let me save
> you the trouble looking it up:
> 9053637e0da7 ("devlink: remove the registration guarantee of references")
> This commit removed that. But it is a different usage.
> 
> 
> 
> > 	kfree(devlink);
> > }
> > 
> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
> > 	if (ret < 0)
> > 		goto err_xa_alloc;
> > 
> >+	get_device(dev);
> > 	devlink->dev = dev;
> > 	devlink->ops = ops;
> > 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
> >-- 
> >2.40.1
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-21  6:31         ` Ido Schimmel
@ 2023-06-21 19:03           ` Jakub Kicinski
  2023-06-22  6:03             ` Ido Schimmel
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2023-06-21 19:03 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: Jiri Pirko, netdev, davem, pabeni, edumazet, petrm

On Wed, 21 Jun 2023 09:31:43 +0300 Ido Schimmel wrote:
> Thanks for taking a look.
> 
> Moving the release to devlink_free() [1] was the first thing I tried and
> it indeed solves the problem I mentioned earlier, but creates a new one.
> After devlink_free() returns the devlink instance can still be accessed
> by user space in devlink_get_from_attrs_lock(). If I reload in a loop
> while concurrently removing and adding the device [2], we can hit a UAF
> when trying to acquire the device lock [3].

Ugh, I didn't look at the second patch, it's taking the device lock
before validating that the devlink instance is registered. 
So we need to extend the list of fields which must always be valid :(

Let's try to fix it at the netdevsim level then? AFAIU we only need the
bus to remain loaded for nsim_bus_dev_release to exist? What if we split
netdevsim into two modules, put the bus stuff in a new module called
netdevsim_bus, and leave the rest (driver) in just netdevsim. That way
we can take a ref on netdevsim_bus until all devices are gone, and still
load / unload netdevsim. With unload resulting in all devices getting
auto-deleted.

I haven't looked in detail so maybe you'll immediately tell me it won't
work, but I'm guessing this is how "real" buses work avoid the problem?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-21 19:03           ` Jakub Kicinski
@ 2023-06-22  6:03             ` Ido Schimmel
  0 siblings, 0 replies; 14+ messages in thread
From: Ido Schimmel @ 2023-06-22  6:03 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jiri Pirko, netdev, davem, pabeni, edumazet, petrm

On Wed, Jun 21, 2023 at 12:03:57PM -0700, Jakub Kicinski wrote:
> Let's try to fix it at the netdevsim level then?

I think it's a good direction. I couldn't find a way to fix it in
devlink and the problem does seem specific to netdevsim.

> AFAIU we only need the bus to remain loaded for nsim_bus_dev_release
> to exist?

That's my understanding as well.

> What if we split netdevsim into two modules, put the bus stuff in a
> new module called netdevsim_bus, and leave the rest (driver) in just
> netdevsim. That way we can take a ref on netdevsim_bus until all
> devices are gone, and still load / unload netdevsim. With unload
> resulting in all devices getting auto-deleted.
> 
> I haven't looked in detail so maybe you'll immediately tell me it won't
> work, but I'm guessing this is how "real" buses work avoid the problem?

At least with PCI (which I believe is the bus backing most devlink
users), the release callback is builtin - not part of a module - so this
problem can't happen there.

Anyway, I will explore this direction.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-21 15:35     ` Ido Schimmel
@ 2023-06-22  6:29       ` Jiri Pirko
  2023-06-25 11:55         ` Ido Schimmel
  0 siblings, 1 reply; 14+ messages in thread
From: Jiri Pirko @ 2023-06-22  6:29 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, kuba, pabeni, edumazet, petrm

Wed, Jun 21, 2023 at 05:35:15PM CEST, idosch@nvidia.com wrote:
>On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote:
>> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
>> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
>> > 
>> > 	mutex_destroy(&devlink->lock);
>> > 	lockdep_unregister_key(&devlink->lock_key);
>> >+	put_device(devlink->dev);
>> 
>> In this case I think you have to make sure this is called before
>> devlink_free() ends. After the caller of devlink_free() returns (most
>> probably .remove callback), nothing stops module from being removed.
>> 
>> I don't see other way. Utilize complete() here and wait_for_completion()
>> at the end of devlink_free().
>
>I might be missing something, but how can I do something like
>wait_for_completion(&devlink->comp) at the end of devlink_free()? After
>I call devlink_put() the devlink instance can be freed and the
>wait_for_completion() call will result in a UAF.

You have to move the free() to devlink_free()
Basically, all the things done in devlink_put that are symmetrical to
the initialization done in devlink_alloc() should be moved there.


>
>> 
>> If the completion in devlink_put() area rings a bell for you, let me save
>> you the trouble looking it up:
>> 9053637e0da7 ("devlink: remove the registration guarantee of references")
>> This commit removed that. But it is a different usage.
>> 
>> 
>> 
>> > 	kfree(devlink);
>> > }
>> > 
>> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
>> > 	if (ret < 0)
>> > 		goto err_xa_alloc;
>> > 
>> >+	get_device(dev);
>> > 	devlink->dev = dev;
>> > 	devlink->ops = ops;
>> > 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
>> >-- 
>> >2.40.1
>> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-22  6:29       ` Jiri Pirko
@ 2023-06-25 11:55         ` Ido Schimmel
  2023-06-27 10:13           ` Jiri Pirko
  0 siblings, 1 reply; 14+ messages in thread
From: Ido Schimmel @ 2023-06-25 11:55 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, kuba, pabeni, edumazet, petrm

On Thu, Jun 22, 2023 at 08:29:31AM +0200, Jiri Pirko wrote:
> Wed, Jun 21, 2023 at 05:35:15PM CEST, idosch@nvidia.com wrote:
> >On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote:
> >> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
> >> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
> >> > 
> >> > 	mutex_destroy(&devlink->lock);
> >> > 	lockdep_unregister_key(&devlink->lock_key);
> >> >+	put_device(devlink->dev);
> >> 
> >> In this case I think you have to make sure this is called before
> >> devlink_free() ends. After the caller of devlink_free() returns (most
> >> probably .remove callback), nothing stops module from being removed.
> >> 
> >> I don't see other way. Utilize complete() here and wait_for_completion()
> >> at the end of devlink_free().
> >
> >I might be missing something, but how can I do something like
> >wait_for_completion(&devlink->comp) at the end of devlink_free()? After
> >I call devlink_put() the devlink instance can be freed and the
> >wait_for_completion() call will result in a UAF.
> 
> You have to move the free() to devlink_free()
> Basically, all the things done in devlink_put that are symmetrical to
> the initialization done in devlink_alloc() should be moved there.

But it's a bit weird to dereference 'devlink' (to wait for the
completion) after calling 'devlink_put(devlink)'. Given that this
problem seems to be specific to netdevsim, don't you think it's better
to fix it in netdevsim rather than working around it in devlink?

> 
> 
> >
> >> 
> >> If the completion in devlink_put() area rings a bell for you, let me save
> >> you the trouble looking it up:
> >> 9053637e0da7 ("devlink: remove the registration guarantee of references")
> >> This commit removed that. But it is a different usage.
> >> 
> >> 
> >> 
> >> > 	kfree(devlink);
> >> > }
> >> > 
> >> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
> >> > 	if (ret < 0)
> >> > 		goto err_xa_alloc;
> >> > 
> >> >+	get_device(dev);
> >> > 	devlink->dev = dev;
> >> > 	devlink->ops = ops;
> >> > 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
> >> >-- 
> >> >2.40.1
> >> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device
  2023-06-25 11:55         ` Ido Schimmel
@ 2023-06-27 10:13           ` Jiri Pirko
  0 siblings, 0 replies; 14+ messages in thread
From: Jiri Pirko @ 2023-06-27 10:13 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, kuba, pabeni, edumazet, petrm

Sun, Jun 25, 2023 at 01:55:36PM CEST, idosch@nvidia.com wrote:
>On Thu, Jun 22, 2023 at 08:29:31AM +0200, Jiri Pirko wrote:
>> Wed, Jun 21, 2023 at 05:35:15PM CEST, idosch@nvidia.com wrote:
>> >On Wed, Jun 21, 2023 at 01:48:36PM +0200, Jiri Pirko wrote:
>> >> Mon, Jun 19, 2023 at 02:50:14PM CEST, idosch@nvidia.com wrote:
>> >> >@@ -91,6 +92,7 @@ static void devlink_release(struct work_struct *work)
>> >> > 
>> >> > 	mutex_destroy(&devlink->lock);
>> >> > 	lockdep_unregister_key(&devlink->lock_key);
>> >> >+	put_device(devlink->dev);
>> >> 
>> >> In this case I think you have to make sure this is called before
>> >> devlink_free() ends. After the caller of devlink_free() returns (most
>> >> probably .remove callback), nothing stops module from being removed.
>> >> 
>> >> I don't see other way. Utilize complete() here and wait_for_completion()
>> >> at the end of devlink_free().
>> >
>> >I might be missing something, but how can I do something like
>> >wait_for_completion(&devlink->comp) at the end of devlink_free()? After
>> >I call devlink_put() the devlink instance can be freed and the
>> >wait_for_completion() call will result in a UAF.
>> 
>> You have to move the free() to devlink_free()
>> Basically, all the things done in devlink_put that are symmetrical to
>> the initialization done in devlink_alloc() should be moved there.
>
>But it's a bit weird to dereference 'devlink' (to wait for the
>completion) after calling 'devlink_put(devlink)'. Given that this

Well, I don't see any problem in that.

>problem seems to be specific to netdevsim, don't you think it's better
>to fix it in netdevsim rather than working around it in devlink?

Up to you, both work.


>
>> 
>> 
>> >
>> >> 
>> >> If the completion in devlink_put() area rings a bell for you, let me save
>> >> you the trouble looking it up:
>> >> 9053637e0da7 ("devlink: remove the registration guarantee of references")
>> >> This commit removed that. But it is a different usage.
>> >> 
>> >> 
>> >> 
>> >> > 	kfree(devlink);
>> >> > }
>> >> > 
>> >> >@@ -204,6 +206,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
>> >> > 	if (ret < 0)
>> >> > 		goto err_xa_alloc;
>> >> > 
>> >> >+	get_device(dev);
>> >> > 	devlink->dev = dev;
>> >> > 	devlink->ops = ops;
>> >> > 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
>> >> >-- 
>> >> >2.40.1
>> >> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-06-27 10:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-19 12:50 [RFC PATCH net-next 0/2] devlink: Acquire device lock during reload Ido Schimmel
2023-06-19 12:50 ` [RFC PATCH net-next 1/2] devlink: Hold a reference on parent device Ido Schimmel
2023-06-20  6:23   ` Jiri Pirko
2023-06-20  7:05     ` Ido Schimmel
2023-06-20 17:43       ` Jakub Kicinski
2023-06-21  6:31         ` Ido Schimmel
2023-06-21 19:03           ` Jakub Kicinski
2023-06-22  6:03             ` Ido Schimmel
2023-06-21 11:48   ` Jiri Pirko
2023-06-21 15:35     ` Ido Schimmel
2023-06-22  6:29       ` Jiri Pirko
2023-06-25 11:55         ` Ido Schimmel
2023-06-27 10:13           ` Jiri Pirko
2023-06-19 12:50 ` [RFC PATCH net-next 2/2] devlink: Acquire device lock during reload Ido Schimmel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).