netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use
@ 2023-10-18 15:47 Antoine Tenart
  2023-10-18 15:47 ` [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes Antoine Tenart
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Antoine Tenart @ 2023-10-18 15:47 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet
  Cc: Antoine Tenart, netdev, gregkh, mhocko, stephen

Hi,

This is sent as an RFC because I believe this should be discussed (and
some might want to do additional testing), but the code itself is ready.

Some time ago we tried to improve the rtnl_trylock/restart_syscall
situation[1]. What happens is when there is rtnl contention, userspace
accessing net sysfs attributes will spin and experience delays. This can
happen in different situations, when sysfs attributes are accessed
(networking daemon, configuration, monitoring) while operations under
rtnl are performed (veth creation, driver configuration, etc). A few
improvements can be done in userspace to ease things, like using the
netlink interface instead, or polling less (or more selectively) the
attributes; but in the end the root cause is always there and this keeps
happening from time to time.

That initial effort however wasn't successful, although I think there
was an interest, mostly because we found technical flaws and didn't find
a working solution at the time. Some time later, we gave it a new try
and found something more promising, but the patches fell off my radar. I
recently had another look at this series, made more tests and cleaned it
up.

The technical aspect is described in patch 1 directly in the code
comments, with an additional important comment in patch 3. This was
mostly tested by stress-testing net sysfs attributes (read/write ops)
while adding/removing queues and adding/removing veths, all in parallel.

All comments are welcomed.

Thanks,
Antoine

[1] https://lore.kernel.org/all/20210928125500.167943-1-atenart@kernel.org/T/

Antoine Tenart (4):
  net-sysfs: remove rtnl_trylock from device attributes
  net-sysfs: move queue attribute groups outside the default groups
  net-sysfs: prevent uncleared queues from being re-added
  net-sysfs: remove rtnl_trylock from queue attributes

 include/linux/netdevice.h     |   1 +
 include/net/netdev_rx_queue.h |   1 +
 net/core/net-sysfs.c          | 329 ++++++++++++++++++++++++----------
 3 files changed, 237 insertions(+), 94 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2023-10-18 15:47 [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Antoine Tenart
@ 2023-10-18 15:47 ` Antoine Tenart
  2023-10-18 16:49   ` Greg KH
                     ` (2 more replies)
  2023-10-18 15:47 ` [RFC PATCH net-next 2/4] net-sysfs: move queue attribute groups outside the default groups Antoine Tenart
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 23+ messages in thread
From: Antoine Tenart @ 2023-10-18 15:47 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet
  Cc: Antoine Tenart, netdev, gregkh, mhocko, stephen

We have an ABBA deadlock between net device unregistration and sysfs
files being accessed[1][2]. To prevent this from happening all paths
taking the rtnl lock after the sysfs one (actually kn->active refcount)
use rtnl_trylock and return early (using restart_syscall)[3] which can
make syscalls to spin for a long time when there is contention on the
rtnl lock[4].

There are not many possibilities to improve the above:
- Rework the entire net/ locking logic.
- Invert two locks in one of the paths — not possible.

But here it's actually possible to drop one of the locks safely: the
kernfs_node refcount. More details in the code itself, which comes with
lots of comments.

[1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/
[2] https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/
[3] https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/
[4] https://lore.kernel.org/all/20210928125500.167943-1-atenart@kernel.org/T/

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/core/net-sysfs.c | 142 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 112 insertions(+), 30 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index fccaa5bac0ed..76d8729340b7 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -40,6 +40,88 @@ static inline int dev_isalive(const struct net_device *dev)
 	return dev->reg_state <= NETREG_REGISTERED;
 }
 
+/* We have a possible ABBA deadlock between rtnl_lock and kernfs_node->active,
+ * when unregistering a net device and accessing associated sysfs files. Tx/Rx
+ * queues do embed their own kobject for their sysfs files but the same issue
+ * applies as a net device being unresgistered will remove those sysfs files as
+ * well. The potential deadlock is as follow:
+ *
+ *         CPU 0                                         CPU 1
+ *
+ *    rtnl_lock                                   vfs_read
+ *    unregister_netdevice_many                   kernfs_seq_start
+ *    device_del / kobject_put                      kernfs_get_active (kn->active++)
+ *    kernfs_drain                                sysfs_kf_seq_show
+ *    wait_event(                                 rtnl_lock
+ *       kn->active == KN_DEACTIVATED_BIAS)       -> waits on CPU 0 to release
+ *    -> waits on CPU 1 to decrease kn->active       the rtnl lock.
+ *
+ * The historical fix was to use rtnl_trylock with restart_syscall to bail out
+ * of sysfs accesses when the lock wasn't taken. This fixed the above issue as
+ * it allowed CPU 1 to bail out of the ABBA situation.
+ *
+ * But this comes with performances issues, as syscalls are being restarted in
+ * loops when there is contention, with huge slow downs in specific scenarios
+ * (e.g. lots of virtual interfaces created at startup and userspace daemons
+ * querying their attributes).
+ *
+ * The idea below is to bail out of the active kernfs_node protection
+ * (kn->active) while still being in a safe position to continue the execution
+ * of our sysfs helpers.
+ */
+static inline struct kernfs_node *sysfs_rtnl_lock(struct kobject *kobj,
+						  struct attribute *attr,
+						  struct net_device *ndev)
+{
+	struct kernfs_node *kn;
+
+	/* First, we hold a reference to the net device we might use in the
+	 * locking section as the unregistration path might run in parallel.
+	 * This will ensure the net device won't be freed before we return.
+	 */
+	dev_hold(ndev);
+	/* sysfs_break_active_protection was introduced to allow self-removal of
+	 * devices and their associated sysfs files by bailing out of the
+	 * sysfs/kernfs protection. We do this here to allow the unregistration
+	 * path to complete in parallel. The following takes a reference on the
+	 * kobject and the kernfs_node being accessed.
+	 *
+	 * This works because we hold a reference onto the net device and the
+	 * unregistration path will wait for us eventually in netdev_run_todo
+	 * (outside an rtnl lock section).
+	 */
+	kn = sysfs_break_active_protection(kobj, attr);
+	WARN_ON_ONCE(!kn);
+	/* Finally we do take the rtnl lock. This can't deadlock us as the
+	 * unregistration path will be able to drain sysfs files (kernfs_node).
+	 */
+	rtnl_lock();
+	return kn;
+}
+
+static inline void __sysfs_rtnl_unlock(struct net_device *ndev,
+				       struct kernfs_node *kn)
+{
+	rtnl_unlock();
+	/* This will drop our references on the kobject and kernfs_node. A
+	 * limitation is we can't have the sysfs removal logic depend on the
+	 * kobject reference counting, we have to do this manually in our
+	 * unregistration paths.
+	 */
+	if (kn)
+		sysfs_unbreak_active_protection(kn);
+}
+
+static inline void sysfs_rtnl_unlock(struct net_device *ndev,
+				     struct kernfs_node *kn)
+{
+	__sysfs_rtnl_unlock(ndev, kn);
+	/* Might unlock the last unregistration path step. It's not safe to
+	 * access the net device after this.
+	 */
+	dev_put(ndev);
+}
+
 /* use same locking rules as GIF* ioctl's */
 static ssize_t netdev_show(const struct device *dev,
 			   struct device_attribute *attr, char *buf,
@@ -83,6 +165,7 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr,
 {
 	struct net_device *netdev = to_net_dev(dev);
 	struct net *net = dev_net(netdev);
+	struct kernfs_node *kn;
 	unsigned long new;
 	int ret;
 
@@ -93,15 +176,14 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr,
 	if (ret)
 		goto err;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(&dev->kobj, &attr->attr, netdev);
 
 	if (dev_isalive(netdev)) {
 		ret = (*set)(netdev, new);
 		if (ret == 0)
 			ret = len;
 	}
-	rtnl_unlock();
+	sysfs_rtnl_unlock(netdev, kn);
  err:
 	return ret;
 }
@@ -181,7 +263,7 @@ static ssize_t carrier_store(struct device *dev, struct device_attribute *attr,
 	struct net_device *netdev = to_net_dev(dev);
 
 	/* The check is also done in change_carrier; this helps returning early
-	 * without hitting the trylock/restart in netdev_store.
+	 * without hitting the locking section in netdev_store.
 	 */
 	if (!netdev->netdev_ops->ndo_change_carrier)
 		return -EOPNOTSUPP;
@@ -205,16 +287,16 @@ static ssize_t speed_show(struct device *dev,
 			  struct device_attribute *attr, char *buf)
 {
 	struct net_device *netdev = to_net_dev(dev);
+	struct kernfs_node *kn;
 	int ret = -EINVAL;
 
 	/* The check is also done in __ethtool_get_link_ksettings; this helps
-	 * returning early without hitting the trylock/restart below.
+	 * returning early without hitting the locking section below.
 	 */
 	if (!netdev->ethtool_ops->get_link_ksettings)
 		return ret;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(&dev->kobj, &attr->attr, netdev);
 
 	if (netif_running(netdev) && netif_device_present(netdev)) {
 		struct ethtool_link_ksettings cmd;
@@ -222,7 +304,7 @@ static ssize_t speed_show(struct device *dev,
 		if (!__ethtool_get_link_ksettings(netdev, &cmd))
 			ret = sysfs_emit(buf, fmt_dec, cmd.base.speed);
 	}
-	rtnl_unlock();
+	sysfs_rtnl_unlock(netdev, kn);
 	return ret;
 }
 static DEVICE_ATTR_RO(speed);
@@ -231,16 +313,16 @@ static ssize_t duplex_show(struct device *dev,
 			   struct device_attribute *attr, char *buf)
 {
 	struct net_device *netdev = to_net_dev(dev);
+	struct kernfs_node *kn;
 	int ret = -EINVAL;
 
 	/* The check is also done in __ethtool_get_link_ksettings; this helps
-	 * returning early without hitting the trylock/restart below.
+	 * returning early without hitting the locking section below.
 	 */
 	if (!netdev->ethtool_ops->get_link_ksettings)
 		return ret;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(&dev->kobj, &attr->attr, netdev);
 
 	if (netif_running(netdev)) {
 		struct ethtool_link_ksettings cmd;
@@ -262,7 +344,7 @@ static ssize_t duplex_show(struct device *dev,
 			ret = sysfs_emit(buf, "%s\n", duplex);
 		}
 	}
-	rtnl_unlock();
+	sysfs_rtnl_unlock(netdev, kn);
 	return ret;
 }
 static DEVICE_ATTR_RO(duplex);
@@ -428,6 +510,7 @@ static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 {
 	struct net_device *netdev = to_net_dev(dev);
 	struct net *net = dev_net(netdev);
+	struct kernfs_node *kn;
 	size_t count = len;
 	ssize_t ret = 0;
 
@@ -438,8 +521,7 @@ static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 	if (len >  0 && buf[len - 1] == '\n')
 		--count;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(&dev->kobj, &attr->attr, netdev);
 
 	if (dev_isalive(netdev)) {
 		ret = dev_set_alias(netdev, buf, count);
@@ -449,7 +531,7 @@ static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 		netdev_state_change(netdev);
 	}
 err:
-	rtnl_unlock();
+	sysfs_rtnl_unlock(netdev, kn);
 
 	return ret;
 }
@@ -499,16 +581,16 @@ static ssize_t phys_port_id_show(struct device *dev,
 				 struct device_attribute *attr, char *buf)
 {
 	struct net_device *netdev = to_net_dev(dev);
+	struct kernfs_node *kn;
 	ssize_t ret = -EINVAL;
 
 	/* The check is also done in dev_get_phys_port_id; this helps returning
-	 * early without hitting the trylock/restart below.
+	 * early without hitting the locking section below.
 	 */
 	if (!netdev->netdev_ops->ndo_get_phys_port_id)
 		return -EOPNOTSUPP;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(&dev->kobj, &attr->attr, netdev);
 
 	if (dev_isalive(netdev)) {
 		struct netdev_phys_item_id ppid;
@@ -517,7 +599,7 @@ static ssize_t phys_port_id_show(struct device *dev,
 		if (!ret)
 			ret = sysfs_emit(buf, "%*phN\n", ppid.id_len, ppid.id);
 	}
-	rtnl_unlock();
+	sysfs_rtnl_unlock(netdev, kn);
 
 	return ret;
 }
@@ -527,17 +609,17 @@ static ssize_t phys_port_name_show(struct device *dev,
 				   struct device_attribute *attr, char *buf)
 {
 	struct net_device *netdev = to_net_dev(dev);
+	struct kernfs_node *kn;
 	ssize_t ret = -EINVAL;
 
 	/* The checks are also done in dev_get_phys_port_name; this helps
-	 * returning early without hitting the trylock/restart below.
+	 * returning early without hitting the locking section below.
 	 */
 	if (!netdev->netdev_ops->ndo_get_phys_port_name &&
 	    !netdev->devlink_port)
 		return -EOPNOTSUPP;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(&dev->kobj, &attr->attr, netdev);
 
 	if (dev_isalive(netdev)) {
 		char name[IFNAMSIZ];
@@ -546,7 +628,7 @@ static ssize_t phys_port_name_show(struct device *dev,
 		if (!ret)
 			ret = sysfs_emit(buf, "%s\n", name);
 	}
-	rtnl_unlock();
+	sysfs_rtnl_unlock(netdev, kn);
 
 	return ret;
 }
@@ -556,18 +638,18 @@ static ssize_t phys_switch_id_show(struct device *dev,
 				   struct device_attribute *attr, char *buf)
 {
 	struct net_device *netdev = to_net_dev(dev);
+	struct kernfs_node *kn;
 	ssize_t ret = -EINVAL;
 
 	/* The checks are also done in dev_get_phys_port_name; this helps
-	 * returning early without hitting the trylock/restart below. This works
+	 * returning early without hitting the locking section below. This works
 	 * because recurse is false when calling dev_get_port_parent_id.
 	 */
 	if (!netdev->netdev_ops->ndo_get_port_parent_id &&
 	    !netdev->devlink_port)
 		return -EOPNOTSUPP;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(&dev->kobj, &attr->attr, netdev);
 
 	if (dev_isalive(netdev)) {
 		struct netdev_phys_item_id ppid = { };
@@ -576,7 +658,7 @@ static ssize_t phys_switch_id_show(struct device *dev,
 		if (!ret)
 			ret = sysfs_emit(buf, "%*phN\n", ppid.id_len, ppid.id);
 	}
-	rtnl_unlock();
+	sysfs_rtnl_unlock(netdev, kn);
 
 	return ret;
 }
@@ -586,15 +668,15 @@ static ssize_t threaded_show(struct device *dev,
 			     struct device_attribute *attr, char *buf)
 {
 	struct net_device *netdev = to_net_dev(dev);
+	struct kernfs_node *kn;
 	ssize_t ret = -EINVAL;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(&dev->kobj, &attr->attr, netdev);
 
 	if (dev_isalive(netdev))
 		ret = sysfs_emit(buf, fmt_dec, netdev->threaded);
 
-	rtnl_unlock();
+	sysfs_rtnl_unlock(netdev, kn);
 	return ret;
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH net-next 2/4] net-sysfs: move queue attribute groups outside the default groups
  2023-10-18 15:47 [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Antoine Tenart
  2023-10-18 15:47 ` [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes Antoine Tenart
@ 2023-10-18 15:47 ` Antoine Tenart
  2023-10-18 15:47 ` [RFC PATCH net-next 3/4] net-sysfs: prevent uncleared queues from being re-added Antoine Tenart
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Antoine Tenart @ 2023-10-18 15:47 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet
  Cc: Antoine Tenart, netdev, gregkh, mhocko, stephen

Rx/tx queues embed their own kobject for registering their per-queue
sysfs files. The issue is they're using the kobject default groups for
this and entirely rely on the kobject refcounting for releasing their
sysfs paths.

In order to remove rtnl_trylock calls we need sysfs files not to rely on
their associated kobject refcounting for their release. Thus we here
move queues sysfs files from the kobject default groups to their own
groups which can be removed separately.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 include/linux/netdevice.h     |  1 +
 include/net/netdev_rx_queue.h |  1 +
 net/core/net-sysfs.c          | 27 ++++++++++++++++++++++-----
 3 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1c7681263d30..90fa1aa93155 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -626,6 +626,7 @@ struct netdev_queue {
 	struct Qdisc __rcu	*qdisc_sleeping;
 #ifdef CONFIG_SYSFS
 	struct kobject		kobj;
+	const struct attribute_group	**groups;
 #endif
 #if defined(CONFIG_XPS) && defined(CONFIG_NUMA)
 	int			numa_node;
diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index cdcafb30d437..b77bbc027176 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -15,6 +15,7 @@ struct netdev_rx_queue {
 	struct rps_dev_flow_table __rcu	*rps_flow_table;
 #endif
 	struct kobject			kobj;
+	const struct attribute_group	**groups;
 	struct net_device		*dev;
 	netdevice_tracker		dev_tracker;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 76d8729340b7..a9f712ef9925 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1138,7 +1138,6 @@ static void rx_queue_get_ownership(const struct kobject *kobj,
 static const struct kobj_type rx_queue_ktype = {
 	.sysfs_ops = &rx_queue_sysfs_ops,
 	.release = rx_queue_release,
-	.default_groups = rx_queue_default_groups,
 	.namespace = rx_queue_namespace,
 	.get_ownership = rx_queue_get_ownership,
 };
@@ -1172,10 +1171,15 @@ static int rx_queue_add_kobject(struct net_device *dev, int index)
 	if (error)
 		goto err;
 
+	queue->groups = rx_queue_default_groups;
+	error = sysfs_create_groups(kobj, queue->groups);
+	if (error)
+		goto err;
+
 	if (dev->sysfs_rx_queue_group) {
 		error = sysfs_create_group(kobj, dev->sysfs_rx_queue_group);
 		if (error)
-			goto err;
+			goto err_default_groups;
 	}
 
 	error = rx_queue_default_mask(dev, queue);
@@ -1186,6 +1190,8 @@ static int rx_queue_add_kobject(struct net_device *dev, int index)
 
 	return error;
 
+err_default_groups:
+	sysfs_remove_groups(kobj, queue->groups);
 err:
 	kobject_put(kobj);
 	return error;
@@ -1230,12 +1236,14 @@ net_rx_queue_update_kobjects(struct net_device *dev, int old_num, int new_num)
 	}
 
 	while (--i >= new_num) {
-		struct kobject *kobj = &dev->_rx[i].kobj;
+		struct netdev_rx_queue *queue = &dev->_rx[i];
+		struct kobject *kobj = &queue->kobj;
 
 		if (!refcount_read(&dev_net(dev)->ns.count))
 			kobj->uevent_suppress = 1;
 		if (dev->sysfs_rx_queue_group)
 			sysfs_remove_group(kobj, dev->sysfs_rx_queue_group);
+		sysfs_remove_groups(kobj, queue->groups);
 		kobject_put(kobj);
 	}
 
@@ -1757,7 +1765,6 @@ static void netdev_queue_get_ownership(const struct kobject *kobj,
 static const struct kobj_type netdev_queue_ktype = {
 	.sysfs_ops = &netdev_queue_sysfs_ops,
 	.release = netdev_queue_release,
-	.default_groups = netdev_queue_default_groups,
 	.namespace = netdev_queue_namespace,
 	.get_ownership = netdev_queue_get_ownership,
 };
@@ -1779,15 +1786,24 @@ static int netdev_queue_add_kobject(struct net_device *dev, int index)
 	if (error)
 		goto err;
 
+	queue->groups = netdev_queue_default_groups;
+	error = sysfs_create_groups(kobj, queue->groups);
+	if (error)
+		goto err;
+
 #ifdef CONFIG_BQL
 	error = sysfs_create_group(kobj, &dql_group);
 	if (error)
-		goto err;
+		goto err_default_groups;
 #endif
 
 	kobject_uevent(kobj, KOBJ_ADD);
 	return 0;
 
+#ifdef CONFIG_BQL
+err_default_groups:
+	sysfs_remove_groups(kobj, queue->groups);
+#endif
 err:
 	kobject_put(kobj);
 	return error;
@@ -1841,6 +1857,7 @@ netdev_queue_update_kobjects(struct net_device *dev, int old_num, int new_num)
 #ifdef CONFIG_BQL
 		sysfs_remove_group(&queue->kobj, &dql_group);
 #endif
+		sysfs_remove_groups(&queue->kobj, queue->groups);
 		kobject_put(&queue->kobj);
 	}
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH net-next 3/4] net-sysfs: prevent uncleared queues from being re-added
  2023-10-18 15:47 [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Antoine Tenart
  2023-10-18 15:47 ` [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes Antoine Tenart
  2023-10-18 15:47 ` [RFC PATCH net-next 2/4] net-sysfs: move queue attribute groups outside the default groups Antoine Tenart
@ 2023-10-18 15:47 ` Antoine Tenart
  2023-10-18 15:47 ` [RFC PATCH net-next 4/4] net-sysfs: remove rtnl_trylock from queue attributes Antoine Tenart
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Antoine Tenart @ 2023-10-18 15:47 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet
  Cc: Antoine Tenart, netdev, gregkh, mhocko, stephen

With the (upcoming) removal of the rtnl_trylock/restart_syscall logic
and because of how Tx/Rx queues are implemented (and their
requirements), it might happen that a queue is re-added before having
the chance to be cleared. In such rare case, do not complete the queue
addition operation.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/core/net-sysfs.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index a9f712ef9925..75fb92c44291 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1160,6 +1160,20 @@ static int rx_queue_add_kobject(struct net_device *dev, int index)
 	struct kobject *kobj = &queue->kobj;
 	int error = 0;
 
+	/* Rx queues are cleared in rx_queue_release to allow later
+	 * re-registration. This is triggered when their kobj refcount is
+	 * dropped.
+	 *
+	 * If a queue is removed while both a read (or write) operation and a
+	 * the re-addition of the same queue are pending (waiting on rntl_lock)
+	 * it might happen that the re-addition will execute before the read,
+	 * making the initial removal to never happen (queue's kobj refcount
+	 * won't drop enough because of the pending read). In such rare case,
+	 * return to allow the removal operation to complete.
+	 */
+	if (unlikely(kobj->state_initialized))
+		return -EAGAIN;
+
 	/* Kobject_put later will trigger rx_queue_release call which
 	 * decreases dev refcount: Take that reference here
 	 */
@@ -1775,6 +1789,20 @@ static int netdev_queue_add_kobject(struct net_device *dev, int index)
 	struct kobject *kobj = &queue->kobj;
 	int error = 0;
 
+	/* Tx queues are cleared in netdev_queue_release to allow later
+	 * re-registration. This is triggered when their kobj refcount is
+	 * dropped.
+	 *
+	 * If a queue is removed while both a read (or write) operation and a
+	 * the re-addition of the same queue are pending (waiting on rntl_lock)
+	 * it might happen that the re-addition will execute before the read,
+	 * making the initial removal to never happen (queue's kobj refcount
+	 * won't drop enough because of the pending read). In such rare case,
+	 * return to allow the removal operation to complete.
+	 */
+	if (unlikely(kobj->state_initialized))
+		return -EAGAIN;
+
 	/* Kobject_put later will trigger netdev_queue_release call
 	 * which decreases dev refcount: Take that reference here
 	 */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH net-next 4/4] net-sysfs: remove rtnl_trylock from queue attributes
  2023-10-18 15:47 [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Antoine Tenart
                   ` (2 preceding siblings ...)
  2023-10-18 15:47 ` [RFC PATCH net-next 3/4] net-sysfs: prevent uncleared queues from being re-added Antoine Tenart
@ 2023-10-18 15:47 ` Antoine Tenart
  2023-10-18 15:57 ` [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Stephen Hemminger
  2023-10-18 18:15 ` Stephen Hemminger
  5 siblings, 0 replies; 23+ messages in thread
From: Antoine Tenart @ 2023-10-18 15:47 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet
  Cc: Antoine Tenart, netdev, gregkh, mhocko, stephen

Similar to the commit removing remove rtnl_trylock from device
attributes we here apply the same technique to networking queues.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/core/net-sysfs.c | 132 ++++++++++++++++++++++++-------------------
 1 file changed, 73 insertions(+), 59 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 75fb92c44291..1ef93a3ef4a5 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1296,9 +1296,11 @@ static int net_rx_queue_change_owner(struct net_device *dev, int num,
  */
 struct netdev_queue_attribute {
 	struct attribute attr;
-	ssize_t (*show)(struct netdev_queue *queue, char *buf);
-	ssize_t (*store)(struct netdev_queue *queue,
-			 const char *buf, size_t len);
+	ssize_t (*show)(struct kobject *kobj, struct attribute *attr,
+			struct netdev_queue *queue, char *buf);
+	ssize_t (*store)(struct kobject *kobj, struct attribute *attr,
+			 struct netdev_queue *queue, const char *buf,
+			 size_t len);
 };
 #define to_netdev_queue_attr(_attr) \
 	container_of(_attr, struct netdev_queue_attribute, attr)
@@ -1315,7 +1317,7 @@ static ssize_t netdev_queue_attr_show(struct kobject *kobj,
 	if (!attribute->show)
 		return -EIO;
 
-	return attribute->show(queue, buf);
+	return attribute->show(kobj, attr, queue, buf);
 }
 
 static ssize_t netdev_queue_attr_store(struct kobject *kobj,
@@ -1329,7 +1331,7 @@ static ssize_t netdev_queue_attr_store(struct kobject *kobj,
 	if (!attribute->store)
 		return -EIO;
 
-	return attribute->store(queue, buf, count);
+	return attribute->store(kobj, attr, queue, buf, count);
 }
 
 static const struct sysfs_ops netdev_queue_sysfs_ops = {
@@ -1337,7 +1339,8 @@ static const struct sysfs_ops netdev_queue_sysfs_ops = {
 	.store = netdev_queue_attr_store,
 };
 
-static ssize_t tx_timeout_show(struct netdev_queue *queue, char *buf)
+static ssize_t tx_timeout_show(struct kobject *kobj, struct attribute *attr,
+			       struct netdev_queue *queue, char *buf)
 {
 	unsigned long trans_timeout = atomic_long_read(&queue->trans_timeout);
 
@@ -1355,18 +1358,18 @@ static unsigned int get_netdev_queue_index(struct netdev_queue *queue)
 	return i;
 }
 
-static ssize_t traffic_class_show(struct netdev_queue *queue,
-				  char *buf)
+static ssize_t traffic_class_show(struct kobject *kobj, struct attribute *attr,
+				  struct netdev_queue *queue, char *buf)
 {
 	struct net_device *dev = queue->dev;
+	struct kernfs_node *kn;
 	int num_tc, tc;
 	int index;
 
 	if (!netif_is_multiqueue(dev))
 		return -ENOENT;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(kobj, attr, queue->dev);
 
 	index = get_netdev_queue_index(queue);
 
@@ -1376,7 +1379,7 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
 	num_tc = dev->num_tc;
 	tc = netdev_txq_to_tc(dev, index);
 
-	rtnl_unlock();
+	sysfs_rtnl_unlock(queue->dev, kn);
 
 	if (tc < 0)
 		return -EINVAL;
@@ -1393,24 +1396,26 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
 }
 
 #ifdef CONFIG_XPS
-static ssize_t tx_maxrate_show(struct netdev_queue *queue,
-			       char *buf)
+static ssize_t tx_maxrate_show(struct kobject *kobj, struct attribute *attr,
+			       struct netdev_queue *queue, char *buf)
 {
 	return sysfs_emit(buf, "%lu\n", queue->tx_maxrate);
 }
 
-static ssize_t tx_maxrate_store(struct netdev_queue *queue,
-				const char *buf, size_t len)
+static ssize_t tx_maxrate_store(struct kobject *kobj, struct attribute *attr,
+				struct netdev_queue *queue, const char *buf,
+				size_t len)
 {
-	struct net_device *dev = queue->dev;
 	int err, index = get_netdev_queue_index(queue);
+	struct net_device *dev = queue->dev;
+	struct kernfs_node *kn;
 	u32 rate = 0;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
 	/* The check is also done later; this helps returning early without
-	 * hitting the trylock/restart below.
+	 * hitting the locking section below.
 	 */
 	if (!dev->netdev_ops->ndo_set_tx_maxrate)
 		return -EOPNOTSUPP;
@@ -1419,18 +1424,19 @@ static ssize_t tx_maxrate_store(struct netdev_queue *queue,
 	if (err < 0)
 		return err;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(kobj, attr, dev);
 
 	err = -EOPNOTSUPP;
 	if (dev->netdev_ops->ndo_set_tx_maxrate)
 		err = dev->netdev_ops->ndo_set_tx_maxrate(dev, index, rate);
 
-	rtnl_unlock();
 	if (!err) {
 		queue->tx_maxrate = rate;
+		sysfs_rtnl_unlock(dev, kn);
 		return len;
 	}
+
+	sysfs_rtnl_unlock(dev, kn);
 	return err;
 }
 
@@ -1474,16 +1480,17 @@ static ssize_t bql_set(const char *buf, const size_t count,
 	return count;
 }
 
-static ssize_t bql_show_hold_time(struct netdev_queue *queue,
-				  char *buf)
+static ssize_t bql_show_hold_time(struct kobject *kobj, struct attribute *attr,
+				  struct netdev_queue *queue, char *buf)
 {
 	struct dql *dql = &queue->dql;
 
 	return sysfs_emit(buf, "%u\n", jiffies_to_msecs(dql->slack_hold_time));
 }
 
-static ssize_t bql_set_hold_time(struct netdev_queue *queue,
-				 const char *buf, size_t len)
+static ssize_t bql_set_hold_time(struct kobject *kobj, struct attribute *attr,
+				 struct netdev_queue *queue, const char *buf,
+				 size_t len)
 {
 	struct dql *dql = &queue->dql;
 	unsigned int value;
@@ -1502,8 +1509,8 @@ static struct netdev_queue_attribute bql_hold_time_attribute __ro_after_init
 	= __ATTR(hold_time, 0644,
 		 bql_show_hold_time, bql_set_hold_time);
 
-static ssize_t bql_show_inflight(struct netdev_queue *queue,
-				 char *buf)
+static ssize_t bql_show_inflight(struct kobject *kobj, struct attribute *attr,
+				 struct netdev_queue *queue, char *buf)
 {
 	struct dql *dql = &queue->dql;
 
@@ -1514,13 +1521,16 @@ static struct netdev_queue_attribute bql_inflight_attribute __ro_after_init =
 	__ATTR(inflight, 0444, bql_show_inflight, NULL);
 
 #define BQL_ATTR(NAME, FIELD)						\
-static ssize_t bql_show_ ## NAME(struct netdev_queue *queue,		\
-				 char *buf)				\
+static ssize_t bql_show_ ## NAME(struct kobject *kobj,			\
+				 struct attribute *attr,		\
+				 struct netdev_queue *queue, char *buf)	\
 {									\
 	return bql_show(buf, queue->dql.FIELD);				\
 }									\
 									\
-static ssize_t bql_set_ ## NAME(struct netdev_queue *queue,		\
+static ssize_t bql_set_ ## NAME(struct kobject *kobj,			\
+				struct attribute *attr,			\
+				struct netdev_queue *queue,		\
 				const char *buf, size_t len)		\
 {									\
 	return bql_set(buf, len, &queue->dql.FIELD);			\
@@ -1600,9 +1610,11 @@ static ssize_t xps_queue_show(struct net_device *dev, unsigned int index,
 	return len < PAGE_SIZE ? len : -EINVAL;
 }
 
-static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf)
+static ssize_t xps_cpus_show(struct kobject *kobj, struct attribute *attr,
+			     struct netdev_queue *queue, char *buf)
 {
 	struct net_device *dev = queue->dev;
+	struct kernfs_node *kn;
 	unsigned int index;
 	int len, tc;
 
@@ -1611,32 +1623,34 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf)
 
 	index = get_netdev_queue_index(queue);
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(kobj, attr, queue->dev);
 
 	/* If queue belongs to subordinate dev use its map */
 	dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
 
 	tc = netdev_txq_to_tc(dev, index);
 	if (tc < 0) {
-		rtnl_unlock();
+		sysfs_rtnl_unlock(queue->dev, kn);
 		return -EINVAL;
 	}
 
-	/* Make sure the subordinate device can't be freed */
-	get_device(&dev->dev);
-	rtnl_unlock();
+	/* Do not release the net device refcnt to make sure it won't be freed
+	 * while xps_queue_show is running.
+	 */
+	__sysfs_rtnl_unlock(queue->dev, kn);
 
 	len = xps_queue_show(dev, index, tc, buf, XPS_CPUS);
 
-	put_device(&dev->dev);
+	dev_put(dev);
 	return len;
 }
 
-static ssize_t xps_cpus_store(struct netdev_queue *queue,
-			      const char *buf, size_t len)
+static ssize_t xps_cpus_store(struct kobject *kobj, struct attribute *attr,
+			      struct netdev_queue *queue, const char *buf,
+			      size_t len)
 {
 	struct net_device *dev = queue->dev;
+	struct kernfs_node *kn;
 	unsigned int index;
 	cpumask_var_t mask;
 	int err;
@@ -1658,13 +1672,9 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 		return err;
 	}
 
-	if (!rtnl_trylock()) {
-		free_cpumask_var(mask);
-		return restart_syscall();
-	}
-
+	kn = sysfs_rtnl_lock(kobj, attr, dev);
 	err = netif_set_xps_queue(dev, mask, index);
-	rtnl_unlock();
+	sysfs_rtnl_unlock(dev, kn);
 
 	free_cpumask_var(mask);
 
@@ -1674,30 +1684,37 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
 	= __ATTR_RW(xps_cpus);
 
-static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
+static ssize_t xps_rxqs_show(struct kobject *kobj, struct attribute *attr,
+			     struct netdev_queue *queue, char *buf)
 {
 	struct net_device *dev = queue->dev;
+	struct kernfs_node *kn;
 	unsigned int index;
-	int tc;
+	int tc, ret;
 
 	index = get_netdev_queue_index(queue);
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	kn = sysfs_rtnl_lock(kobj, attr, dev);
 
 	tc = netdev_txq_to_tc(dev, index);
-	rtnl_unlock();
-	if (tc < 0)
-		return -EINVAL;
 
-	return xps_queue_show(dev, index, tc, buf, XPS_RXQS);
+	/* Do not release the net device refcnt to make sure it won't be freed
+	 * while xps_queue_show is running.
+	 */
+	__sysfs_rtnl_unlock(dev, kn);
+
+	ret = tc >= 0 ? xps_queue_show(dev, index, tc, buf, XPS_RXQS) : -EINVAL;
+	dev_put(dev);
+	return ret;
 }
 
-static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
+static ssize_t xps_rxqs_store(struct kobject *kobj, struct attribute *attr,
+			      struct netdev_queue *queue, const char *buf,
 			      size_t len)
 {
 	struct net_device *dev = queue->dev;
 	struct net *net = dev_net(dev);
+	struct kernfs_node *kn;
 	unsigned long *mask;
 	unsigned int index;
 	int err;
@@ -1717,16 +1734,13 @@ static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
 		return err;
 	}
 
-	if (!rtnl_trylock()) {
-		bitmap_free(mask);
-		return restart_syscall();
-	}
+	kn = sysfs_rtnl_lock(kobj, attr, dev);
 
 	cpus_read_lock();
 	err = __netif_set_xps_queue(dev, mask, index, XPS_RXQS);
 	cpus_read_unlock();
 
-	rtnl_unlock();
+	sysfs_rtnl_unlock(dev, kn);
 
 	bitmap_free(mask);
 	return err ? : len;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use
  2023-10-18 15:47 [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Antoine Tenart
                   ` (3 preceding siblings ...)
  2023-10-18 15:47 ` [RFC PATCH net-next 4/4] net-sysfs: remove rtnl_trylock from queue attributes Antoine Tenart
@ 2023-10-18 15:57 ` Stephen Hemminger
  2023-10-18 16:34   ` Antoine Tenart
  2023-10-18 18:15 ` Stephen Hemminger
  5 siblings, 1 reply; 23+ messages in thread
From: Stephen Hemminger @ 2023-10-18 15:57 UTC (permalink / raw)
  To: Antoine Tenart; +Cc: davem, kuba, pabeni, edumazet, netdev, gregkh, mhocko

On Wed, 18 Oct 2023 17:47:42 +0200
Antoine Tenart <atenart@kernel.org> wrote:

> Hi,
> 
> This is sent as an RFC because I believe this should be discussed (and
> some might want to do additional testing), but the code itself is ready.
> 
> Some time ago we tried to improve the rtnl_trylock/restart_syscall
> situation[1]. What happens is when there is rtnl contention, userspace
> accessing net sysfs attributes will spin and experience delays. This can
> happen in different situations, when sysfs attributes are accessed
> (networking daemon, configuration, monitoring) while operations under
> rtnl are performed (veth creation, driver configuration, etc). A few
> improvements can be done in userspace to ease things, like using the
> netlink interface instead, or polling less (or more selectively) the
> attributes; but in the end the root cause is always there and this keeps
> happening from time to time.
> 
> That initial effort however wasn't successful, although I think there
> was an interest, mostly because we found technical flaws and didn't find
> a working solution at the time. Some time later, we gave it a new try
> and found something more promising, but the patches fell off my radar. I
> recently had another look at this series, made more tests and cleaned it
> up.
> 
> The technical aspect is described in patch 1 directly in the code
> comments, with an additional important comment in patch 3. This was
> mostly tested by stress-testing net sysfs attributes (read/write ops)
> while adding/removing queues and adding/removing veths, all in parallel.
> 
> All comments are welcomed.

The trylock was introduced to deal with lock inversion.
It is not clear how this more complex solution prevents that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use
  2023-10-18 15:57 ` [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Stephen Hemminger
@ 2023-10-18 16:34   ` Antoine Tenart
  0 siblings, 0 replies; 23+ messages in thread
From: Antoine Tenart @ 2023-10-18 16:34 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, kuba, pabeni, edumazet, netdev, gregkh, mhocko

Quoting Stephen Hemminger (2023-10-18 17:57:17)
> The trylock was introduced to deal with lock inversion.
> It is not clear how this more complex solution prevents that.

Anything specifically in the patch 1 comments is not clear that I can
improve?

The dead lock happens between rtnl_lock and the refcounting on the
attribute kn->active, specifically when unregistering net devices
because device_del kernfs_drain will wait for the kn->active refcount to
reach KN_DEACTIVATED_BIAS, under an rtnl section. The current solution
was making one path to bail out (trylock/restart syscall).

The idea here is we can actually bail out of the attribute kn protection
(kn->active), while still letting unregistering net devices to wait for
the current sysfs operations to complete, by using the net device
refcount instead. To simplify, instead of waiting on kn->active in the
net device unregistration step, this waits on the net device refcount
(netdev_wait_allrefs_any), which is done outside an rtnl section. This
way kernfs_drain can complete under its rtnl section even if a call to
rtnl_lock is waiting in a sysfs operation.

Antoine

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2023-10-18 15:47 ` [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes Antoine Tenart
@ 2023-10-18 16:49   ` Greg KH
  2023-10-19  8:13     ` Antoine Tenart
  2023-10-18 18:13   ` Stephen Hemminger
  2025-01-02 22:36   ` Jakub Kicinski
  2 siblings, 1 reply; 23+ messages in thread
From: Greg KH @ 2023-10-18 16:49 UTC (permalink / raw)
  To: Antoine Tenart; +Cc: davem, kuba, pabeni, edumazet, netdev, mhocko, stephen

On Wed, Oct 18, 2023 at 05:47:43PM +0200, Antoine Tenart wrote:
> We have an ABBA deadlock between net device unregistration and sysfs
> files being accessed[1][2]. To prevent this from happening all paths
> taking the rtnl lock after the sysfs one (actually kn->active refcount)
> use rtnl_trylock and return early (using restart_syscall)[3] which can
> make syscalls to spin for a long time when there is contention on the
> rtnl lock[4].
> 
> There are not many possibilities to improve the above:
> - Rework the entire net/ locking logic.
> - Invert two locks in one of the paths — not possible.
> 
> But here it's actually possible to drop one of the locks safely: the
> kernfs_node refcount. More details in the code itself, which comes with
> lots of comments.
> 
> [1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/
> [2] https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/
> [3] https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/
> [4] https://lore.kernel.org/all/20210928125500.167943-1-atenart@kernel.org/T/
> 
> Signed-off-by: Antoine Tenart <atenart@kernel.org>
> ---
>  net/core/net-sysfs.c | 142 ++++++++++++++++++++++++++++++++++---------
>  1 file changed, 112 insertions(+), 30 deletions(-)
> 
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index fccaa5bac0ed..76d8729340b7 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -40,6 +40,88 @@ static inline int dev_isalive(const struct net_device *dev)
>  	return dev->reg_state <= NETREG_REGISTERED;
>  }
>  
> +/* We have a possible ABBA deadlock between rtnl_lock and kernfs_node->active,
> + * when unregistering a net device and accessing associated sysfs files. Tx/Rx
> + * queues do embed their own kobject for their sysfs files but the same issue
> + * applies as a net device being unresgistered will remove those sysfs files as
> + * well. The potential deadlock is as follow:
> + *
> + *         CPU 0                                         CPU 1
> + *
> + *    rtnl_lock                                   vfs_read
> + *    unregister_netdevice_many                   kernfs_seq_start
> + *    device_del / kobject_put                      kernfs_get_active (kn->active++)
> + *    kernfs_drain                                sysfs_kf_seq_show
> + *    wait_event(                                 rtnl_lock
> + *       kn->active == KN_DEACTIVATED_BIAS)       -> waits on CPU 0 to release
> + *    -> waits on CPU 1 to decrease kn->active       the rtnl lock.
> + *
> + * The historical fix was to use rtnl_trylock with restart_syscall to bail out
> + * of sysfs accesses when the lock wasn't taken. This fixed the above issue as
> + * it allowed CPU 1 to bail out of the ABBA situation.
> + *
> + * But this comes with performances issues, as syscalls are being restarted in
> + * loops when there is contention, with huge slow downs in specific scenarios
> + * (e.g. lots of virtual interfaces created at startup and userspace daemons
> + * querying their attributes).
> + *
> + * The idea below is to bail out of the active kernfs_node protection
> + * (kn->active) while still being in a safe position to continue the execution
> + * of our sysfs helpers.
> + */
> +static inline struct kernfs_node *sysfs_rtnl_lock(struct kobject *kobj,
> +						  struct attribute *attr,
> +						  struct net_device *ndev)
> +{
> +	struct kernfs_node *kn;
> +
> +	/* First, we hold a reference to the net device we might use in the
> +	 * locking section as the unregistration path might run in parallel.
> +	 * This will ensure the net device won't be freed before we return.
> +	 */
> +	dev_hold(ndev);
> +	/* sysfs_break_active_protection was introduced to allow self-removal of
> +	 * devices and their associated sysfs files by bailing out of the
> +	 * sysfs/kernfs protection. We do this here to allow the unregistration
> +	 * path to complete in parallel. The following takes a reference on the
> +	 * kobject and the kernfs_node being accessed.
> +	 *
> +	 * This works because we hold a reference onto the net device and the
> +	 * unregistration path will wait for us eventually in netdev_run_todo
> +	 * (outside an rtnl lock section).
> +	 */
> +	kn = sysfs_break_active_protection(kobj, attr);
> +	WARN_ON_ONCE(!kn);

If this triggers, you will end up rebooting the machines that set
panic-on-warn, do you mean to do that?  And note, the huge majority of
Linux systems in the world have that enabled, so be careful.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2023-10-18 15:47 ` [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes Antoine Tenart
  2023-10-18 16:49   ` Greg KH
@ 2023-10-18 18:13   ` Stephen Hemminger
  2023-10-19  7:48     ` Antoine Tenart
  2025-01-02 22:36   ` Jakub Kicinski
  2 siblings, 1 reply; 23+ messages in thread
From: Stephen Hemminger @ 2023-10-18 18:13 UTC (permalink / raw)
  To: Antoine Tenart; +Cc: davem, kuba, pabeni, edumazet, netdev, gregkh, mhocko

On Wed, 18 Oct 2023 17:47:43 +0200
Antoine Tenart <atenart@kernel.org> wrote:

> +static inline struct kernfs_node *sysfs_rtnl_lock(struct kobject *kobj,
> +						  struct attribute *attr,
> +						  struct net_device *ndev)
Still reviewing the details here.
But inline on static functions is not the code policy in networking related code.
The argument is compiler will inline anyway if useful.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use
  2023-10-18 15:47 [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Antoine Tenart
                   ` (4 preceding siblings ...)
  2023-10-18 15:57 ` [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Stephen Hemminger
@ 2023-10-18 18:15 ` Stephen Hemminger
  2023-10-19  7:47   ` Antoine Tenart
  5 siblings, 1 reply; 23+ messages in thread
From: Stephen Hemminger @ 2023-10-18 18:15 UTC (permalink / raw)
  To: Antoine Tenart; +Cc: davem, kuba, pabeni, edumazet, netdev, gregkh, mhocko

On Wed, 18 Oct 2023 17:47:42 +0200
Antoine Tenart <atenart@kernel.org> wrote:

> Some time ago we tried to improve the rtnl_trylock/restart_syscall
> situation[1]. What happens is when there is rtnl contention, userspace
> accessing net sysfs attributes will spin and experience delays. This can
> happen in different situations, when sysfs attributes are accessed
> (networking daemon, configuration, monitoring) while operations under
> rtnl are performed (veth creation, driver configuration, etc). A few
> improvements can be done in userspace to ease things, like using the
> netlink interface instead, or polling less (or more selectively) the
> attributes; but in the end the root cause is always there and this keeps
> happening from time to time.

What attribute is not exposed by netlink, and only by sysfs?
There will always be more overhead using sysfs.
That doesn't mean the locking should not be fixed, just that better
to avoid the situation if possible.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use
  2023-10-18 18:15 ` Stephen Hemminger
@ 2023-10-19  7:47   ` Antoine Tenart
  2023-10-19 14:54     ` Stephen Hemminger
  0 siblings, 1 reply; 23+ messages in thread
From: Antoine Tenart @ 2023-10-19  7:47 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, kuba, pabeni, edumazet, netdev, gregkh, mhocko

Quoting Stephen Hemminger (2023-10-18 20:15:47)
> On Wed, 18 Oct 2023 17:47:42 +0200
> Antoine Tenart <atenart@kernel.org> wrote:
> 
> > Some time ago we tried to improve the rtnl_trylock/restart_syscall
> > situation[1]. What happens is when there is rtnl contention, userspace
> > accessing net sysfs attributes will spin and experience delays. This can
> > happen in different situations, when sysfs attributes are accessed
> > (networking daemon, configuration, monitoring) while operations under
> > rtnl are performed (veth creation, driver configuration, etc). A few
> > improvements can be done in userspace to ease things, like using the
> > netlink interface instead, or polling less (or more selectively) the
> > attributes; but in the end the root cause is always there and this keeps
> > happening from time to time.
> 
> What attribute is not exposed by netlink, and only by sysfs?

I think there were a few last time (a while a go) I checked, but it's
not an issue IMHO, if one is missing in netlink we can add it.

> That doesn't mean the locking should not be fixed, just that better
> to avoid the situation if possible.

100% agree on this one. I believe using netlink is the right way.

Having said that, sysfs is still there and (quite some time ago) while
having discussions with different projects, some were keen to switch to
netlink, but some weren't and pushed back because "sysfs is a stable
API" and "if there is a kernel issue it should be fixed in the kernel".
Not blaming anyone really, they'd have to support the two interfaces for
compatibility. My point is, yes, I would encourage everyone to use
netlink too, but we don't control every user and it's not like sysfs
will disappear anytime soon.

Thanks,
Antoine

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2023-10-18 18:13   ` Stephen Hemminger
@ 2023-10-19  7:48     ` Antoine Tenart
  0 siblings, 0 replies; 23+ messages in thread
From: Antoine Tenart @ 2023-10-19  7:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, kuba, pabeni, edumazet, netdev, gregkh, mhocko

Quoting Stephen Hemminger (2023-10-18 20:13:43)
> On Wed, 18 Oct 2023 17:47:43 +0200
> Antoine Tenart <atenart@kernel.org> wrote:
> 
> > +static inline struct kernfs_node *sysfs_rtnl_lock(struct kobject *kobj,
> > +                                               struct attribute *attr,
> > +                                               struct net_device *ndev)
> Still reviewing the details here.

Thanks!

> But inline on static functions is not the code policy in networking
> related code. The argument is compiler will inline anyway if useful.

Right, I'll fix that.

Antoine

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2023-10-18 16:49   ` Greg KH
@ 2023-10-19  8:13     ` Antoine Tenart
  2023-10-19 15:37       ` Greg KH
  0 siblings, 1 reply; 23+ messages in thread
From: Antoine Tenart @ 2023-10-19  8:13 UTC (permalink / raw)
  To: Greg KH; +Cc: davem, kuba, pabeni, edumazet, netdev, mhocko, stephen

Quoting Greg KH (2023-10-18 18:49:18)
> On Wed, Oct 18, 2023 at 05:47:43PM +0200, Antoine Tenart wrote:
> > +static inline struct kernfs_node *sysfs_rtnl_lock(struct kobject *kobj,
> > +                                               struct attribute *attr,
> > +                                               struct net_device *ndev)
> > +{
> > +     struct kernfs_node *kn;
> > +
> > +     /* First, we hold a reference to the net device we might use in the
> > +      * locking section as the unregistration path might run in parallel.
> > +      * This will ensure the net device won't be freed before we return.
> > +      */
> > +     dev_hold(ndev);
> > +     /* sysfs_break_active_protection was introduced to allow self-removal of
> > +      * devices and their associated sysfs files by bailing out of the
> > +      * sysfs/kernfs protection. We do this here to allow the unregistration
> > +      * path to complete in parallel. The following takes a reference on the
> > +      * kobject and the kernfs_node being accessed.
> > +      *
> > +      * This works because we hold a reference onto the net device and the
> > +      * unregistration path will wait for us eventually in netdev_run_todo
> > +      * (outside an rtnl lock section).
> > +      */
> > +     kn = sysfs_break_active_protection(kobj, attr);
> > +     WARN_ON_ONCE(!kn);
> 
> If this triggers, you will end up rebooting the machines that set
> panic-on-warn, do you mean to do that?  And note, the huge majority of
> Linux systems in the world have that enabled, so be careful.

Right. My understanding was this can not happen here and I added this
one as a "that should not happen and something is really wrong", as the
attribute should be valid until at least the call to
sysfs_break_active_protection.

Thanks,
Antoine

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use
  2023-10-19  7:47   ` Antoine Tenart
@ 2023-10-19 14:54     ` Stephen Hemminger
  0 siblings, 0 replies; 23+ messages in thread
From: Stephen Hemminger @ 2023-10-19 14:54 UTC (permalink / raw)
  To: Antoine Tenart; +Cc: davem, kuba, pabeni, edumazet, netdev, gregkh, mhocko

On Thu, 19 Oct 2023 09:47:27 +0200
Antoine Tenart <atenart@kernel.org> wrote:

> > That doesn't mean the locking should not be fixed, just that better
> > to avoid the situation if possible.  
> 
> 100% agree on this one. I believe using netlink is the right way.
> 
> Having said that, sysfs is still there and (quite some time ago) while
> having discussions with different projects, some were keen to switch to
> netlink, but some weren't and pushed back because "sysfs is a stable
> API" and "if there is a kernel issue it should be fixed in the kernel".
> Not blaming anyone really, they'd have to support the two interfaces for
> compatibility. My point is, yes, I would encourage everyone to use
> netlink too, but we don't control every user and it's not like sysfs
> will disappear anytime soon.

I have seen code doing discovery of new devices via netlink then poking
around in sysfs. But that usage is inherently racy from the application
point of view. By the time device is discovered, it might be removed or
worse renamed before the sysfs operations.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2023-10-19  8:13     ` Antoine Tenart
@ 2023-10-19 15:37       ` Greg KH
  0 siblings, 0 replies; 23+ messages in thread
From: Greg KH @ 2023-10-19 15:37 UTC (permalink / raw)
  To: Antoine Tenart; +Cc: davem, kuba, pabeni, edumazet, netdev, mhocko, stephen

On Thu, Oct 19, 2023 at 10:13:29AM +0200, Antoine Tenart wrote:
> Quoting Greg KH (2023-10-18 18:49:18)
> > On Wed, Oct 18, 2023 at 05:47:43PM +0200, Antoine Tenart wrote:
> > > +static inline struct kernfs_node *sysfs_rtnl_lock(struct kobject *kobj,
> > > +                                               struct attribute *attr,
> > > +                                               struct net_device *ndev)
> > > +{
> > > +     struct kernfs_node *kn;
> > > +
> > > +     /* First, we hold a reference to the net device we might use in the
> > > +      * locking section as the unregistration path might run in parallel.
> > > +      * This will ensure the net device won't be freed before we return.
> > > +      */
> > > +     dev_hold(ndev);
> > > +     /* sysfs_break_active_protection was introduced to allow self-removal of
> > > +      * devices and their associated sysfs files by bailing out of the
> > > +      * sysfs/kernfs protection. We do this here to allow the unregistration
> > > +      * path to complete in parallel. The following takes a reference on the
> > > +      * kobject and the kernfs_node being accessed.
> > > +      *
> > > +      * This works because we hold a reference onto the net device and the
> > > +      * unregistration path will wait for us eventually in netdev_run_todo
> > > +      * (outside an rtnl lock section).
> > > +      */
> > > +     kn = sysfs_break_active_protection(kobj, attr);
> > > +     WARN_ON_ONCE(!kn);
> > 
> > If this triggers, you will end up rebooting the machines that set
> > panic-on-warn, do you mean to do that?  And note, the huge majority of
> > Linux systems in the world have that enabled, so be careful.
> 
> Right. My understanding was this can not happen here and I added this
> one as a "that should not happen and something is really wrong", as the
> attribute should be valid until at least the call to
> sysfs_break_active_protection.

If it can not happen, then no need to ever check it.  If it can happen,
then check for it and handle the error.  Don't cheat and try to rely on
WARN_ON() to paper over lazy programming :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2023-10-18 15:47 ` [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes Antoine Tenart
  2023-10-18 16:49   ` Greg KH
  2023-10-18 18:13   ` Stephen Hemminger
@ 2025-01-02 22:36   ` Jakub Kicinski
  2025-01-03  8:41     ` Eric Dumazet
  2025-01-07 16:30     ` Antoine Tenart
  2 siblings, 2 replies; 23+ messages in thread
From: Jakub Kicinski @ 2025-01-02 22:36 UTC (permalink / raw)
  To: Antoine Tenart; +Cc: davem, pabeni, edumazet, netdev, gregkh, mhocko, stephen

On Wed, 18 Oct 2023 17:47:43 +0200 Antoine Tenart wrote:
> We have an ABBA deadlock between net device unregistration and sysfs
> files being accessed[1][2]. To prevent this from happening all paths
> taking the rtnl lock after the sysfs one (actually kn->active refcount)
> use rtnl_trylock and return early (using restart_syscall)[3] which can
> make syscalls to spin for a long time when there is contention on the
> rtnl lock[4].

Hi Antoine!

I was looking at the sysfs locking, and ended up going down a very
similar path. Luckily lore search for sysfs_break_active_protection()
surfaced this thread so I can save myself some duplicated work :)

Is there any particular reason why you haven't pursued this solution
further? I think it should work.

My version, FWIW:
https://github.com/kuba-moo/linux/commit/2724bb7275496a254b001fe06fe20ccc5addc9d2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2025-01-02 22:36   ` Jakub Kicinski
@ 2025-01-03  8:41     ` Eric Dumazet
  2025-01-07 16:30     ` Antoine Tenart
  1 sibling, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-01-03  8:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Antoine Tenart, davem, pabeni, netdev, gregkh, mhocko, stephen

On Thu, Jan 2, 2025 at 11:36 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 18 Oct 2023 17:47:43 +0200 Antoine Tenart wrote:
> > We have an ABBA deadlock between net device unregistration and sysfs
> > files being accessed[1][2]. To prevent this from happening all paths
> > taking the rtnl lock after the sysfs one (actually kn->active refcount)
> > use rtnl_trylock and return early (using restart_syscall)[3] which can
> > make syscalls to spin for a long time when there is contention on the
> > rtnl lock[4].
>
> Hi Antoine!
>
> I was looking at the sysfs locking, and ended up going down a very
> similar path. Luckily lore search for sysfs_break_active_protection()
> surfaced this thread so I can save myself some duplicated work :)
>
> Is there any particular reason why you haven't pursued this solution
> further? I think it should work.
>
> My version, FWIW:
> https://github.com/kuba-moo/linux/commit/2724bb7275496a254b001fe06fe20ccc5addc9d2

Indeed, this would probably remove a lot of syzbot reports.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2025-01-02 22:36   ` Jakub Kicinski
  2025-01-03  8:41     ` Eric Dumazet
@ 2025-01-07 16:30     ` Antoine Tenart
  2025-01-07 17:06       ` Jakub Kicinski
  1 sibling, 1 reply; 23+ messages in thread
From: Antoine Tenart @ 2025-01-07 16:30 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, pabeni, edumazet, netdev, gregkh, mhocko, stephen

Hi Jakub,

Quoting Jakub Kicinski (2025-01-02 23:36:47)
> On Wed, 18 Oct 2023 17:47:43 +0200 Antoine Tenart wrote:
> > We have an ABBA deadlock between net device unregistration and sysfs
> > files being accessed[1][2]. To prevent this from happening all paths
> > taking the rtnl lock after the sysfs one (actually kn->active refcount)
> > use rtnl_trylock and return early (using restart_syscall)[3] which can
> > make syscalls to spin for a long time when there is contention on the
> > rtnl lock[4].
> 
> I was looking at the sysfs locking, and ended up going down a very
> similar path. Luckily lore search for sysfs_break_active_protection()
> surfaced this thread so I can save myself some duplicated work :)

Seeing that thread in my inbox again is a nice surprise :-)

Did you encounter any specific issue that made you look at the sysfs
locking?

> Is there any particular reason why you haven't pursued this solution
> further? I think it should work.

I felt there wasn't much interest and feedback at the time and we had
things in place to ease the initial issue we were working on (~ slow
boot time w/ lots of netns and containers). With that and given the
change was a bit tricky I didn't wanted to be the only one pushing for
this.

But I still think this could be beneficial for various use cases so if
you're interested I'll be happy to revive it. I'll have to refresh my
mind and run some tests again first. (Any additional testing will be
appreciated too).

> My version, FWIW:
> https://github.com/kuba-moo/linux/commit/2724bb7275496a254b001fe06fe20ccc5addc9d2

I might take a few of your changes in there, eg. I see you used an
interruptible lock. With this and the few minors comments this RFC got I
can prepare a new series.

Thanks!
Antoine

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2025-01-07 16:30     ` Antoine Tenart
@ 2025-01-07 17:06       ` Jakub Kicinski
  2025-01-16 13:36         ` Antoine Tenart
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2025-01-07 17:06 UTC (permalink / raw)
  To: Antoine Tenart; +Cc: davem, pabeni, edumazet, netdev, gregkh, mhocko, stephen

On Tue, 07 Jan 2025 17:30:03 +0100 Antoine Tenart wrote:
> Quoting Jakub Kicinski (2025-01-02 23:36:47)
> > On Wed, 18 Oct 2023 17:47:43 +0200 Antoine Tenart wrote:  
> > > We have an ABBA deadlock between net device unregistration and sysfs
> > > files being accessed[1][2]. To prevent this from happening all paths
> > > taking the rtnl lock after the sysfs one (actually kn->active refcount)
> > > use rtnl_trylock and return early (using restart_syscall)[3] which can
> > > make syscalls to spin for a long time when there is contention on the
> > > rtnl lock[4].  
> > 
> > I was looking at the sysfs locking, and ended up going down a very
> > similar path. Luckily lore search for sysfs_break_active_protection()
> > surfaced this thread so I can save myself some duplicated work :)  
> 
> Seeing that thread in my inbox again is a nice surprise :-)
> 
> Did you encounter any specific issue that made you look at the sysfs
> locking?

I started working on broadening the use of the netdev->lock 
(per instance lock) to lower the rtnl_lock pressure.
I wanted to make sure I will not end up with the same trylock
hack when it comes to sysfs, so I started digging into the existing
issue...

> > Is there any particular reason why you haven't pursued this solution
> > further? I think it should work.  
> 
> I felt there wasn't much interest and feedback at the time and we had
> things in place to ease the initial issue we were working on (~ slow
> boot time w/ lots of netns and containers). With that and given the
> change was a bit tricky I didn't wanted to be the only one pushing for
> this.
> 
> But I still think this could be beneficial for various use cases so if
> you're interested I'll be happy to revive it. I'll have to refresh my
> mind and run some tests again first. (Any additional testing will be
> appreciated too).

TBH my interest is a bit tangential. We keep adding device configuration
APIs to netdev, and they all end up taking rtnl_lock, even tho vast
majority of the time the configuration is completely local to a single
instance. I'm trying to lay enough the groundwork for using the instance
lock to enable less experienced developers using it. Kuniyuki is also
working on making rtnl_lock per netns. I think it's a good time to fix
the sysfs situation.

> > My version, FWIW:
> > https://github.com/kuba-moo/linux/commit/2724bb7275496a254b001fe06fe20ccc5addc9d2  
> 
> I might take a few of your changes in there, eg. I see you used an
> interruptible lock. With this and the few minors comments this RFC got I
> can prepare a new series.

Perfect.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2025-01-07 17:06       ` Jakub Kicinski
@ 2025-01-16 13:36         ` Antoine Tenart
  2025-01-16 23:42           ` Jakub Kicinski
  0 siblings, 1 reply; 23+ messages in thread
From: Antoine Tenart @ 2025-01-16 13:36 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, pabeni, edumazet, netdev, gregkh, mhocko, stephen

Hi Jakub,

Quoting Jakub Kicinski (2025-01-07 18:06:41)
> On Tue, 07 Jan 2025 17:30:03 +0100 Antoine Tenart wrote:
> > Quoting Jakub Kicinski (2025-01-02 23:36:47)
> > > My version, FWIW:
> > > https://github.com/kuba-moo/linux/commit/2724bb7275496a254b001fe06fe20ccc5addc9d2  
> > 
> > I might take a few of your changes in there, eg. I see you used an
> > interruptible lock. With this and the few minors comments this RFC got I
> > can prepare a new series.
> 
> Perfect.

While refreshing the series, especially after adding the dev_isalive()
check, I found out we actually do not need to drop the sysfs protection
and hold a reference to the net device during the whole rtnl locking
section. This is because after getting the rtnl lock and once we know
the net device dismantle hasn't started yet, we're sure dismantle won't
start (and the device won't be freed) until we give back the rtnl lock.

This makes the new helpers easier to use, does not require to expose
the kernfs node to users, making the code more contained; but the
locking order is not as perfect.

We would go from (version 1),

1. unlocking sysfs
2. locking rtnl
3. unlocking rtnl
4. locking sysfs

to (version 2),

1. unlocking sysfs
2. locking rtnl
3. locking sysfs
4. unlocking rtnl

This is actually fine because the "sysfs lock" isn't a lock but a
refcnt, with the only deadlock situation being when draining it.

Version 1: https://github.com/atenart/linux/commit/596c5d9895ccdb75057978abd6be1a42ee4b448e
Version 2: https://github.com/atenart/linux/commit/c6659bb26f564f1fd63d1c279616f57141e9f2bf

Thoughts? Apart from that question, either series is ready for
submission.

Thanks,
Antoine

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2025-01-16 13:36         ` Antoine Tenart
@ 2025-01-16 23:42           ` Jakub Kicinski
  2025-01-16 23:43             ` Jakub Kicinski
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2025-01-16 23:42 UTC (permalink / raw)
  To: Antoine Tenart; +Cc: davem, pabeni, edumazet, netdev, gregkh, mhocko, stephen

On Thu, 16 Jan 2025 14:36:17 +0100 Antoine Tenart wrote:
> While refreshing the series, especially after adding the dev_isalive()
> check, I found out we actually do not need to drop the sysfs protection
> and hold a reference to the net device during the whole rtnl locking
> section. This is because after getting the rtnl lock and once we know
> the net device dismantle hasn't started yet, we're sure dismantle won't
> start (and the device won't be freed) until we give back the rtnl lock.
> 
> This makes the new helpers easier to use, does not require to expose
> the kernfs node to users, making the code more contained; but the
> locking order is not as perfect.
> 
> We would go from (version 1),
> 
> 1. unlocking sysfs
> 2. locking rtnl
> 3. unlocking rtnl
> 4. locking sysfs
> 
> to (version 2),
> 
> 1. unlocking sysfs
> 2. locking rtnl
> 3. locking sysfs
> 4. unlocking rtnl
> 
> This is actually fine because the "sysfs lock" isn't a lock but a
> refcnt, with the only deadlock situation being when draining it.
> 
> Version 1: https://github.com/atenart/linux/commit/596c5d9895ccdb75057978abd6be1a42ee4b448e
> Version 2: https://github.com/atenart/linux/commit/c6659bb26f564f1fd63d1c279616f57141e9f2bf
> 
> Thoughts? Apart from that question, either series is ready for
> submission.

Nice, yes, I think that works!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2025-01-16 23:42           ` Jakub Kicinski
@ 2025-01-16 23:43             ` Jakub Kicinski
  2025-01-17  8:26               ` Antoine Tenart
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2025-01-16 23:43 UTC (permalink / raw)
  To: Antoine Tenart; +Cc: davem, pabeni, edumazet, netdev, gregkh, mhocko, stephen

On Thu, 16 Jan 2025 15:42:41 -0800 Jakub Kicinski wrote:
> On Thu, 16 Jan 2025 14:36:17 +0100 Antoine Tenart wrote:
> > While refreshing the series, especially after adding the dev_isalive()
> > check, I found out we actually do not need to drop the sysfs protection
> > and hold a reference to the net device during the whole rtnl locking
> > section. This is because after getting the rtnl lock and once we know
> > the net device dismantle hasn't started yet, we're sure dismantle won't
> > start (and the device won't be freed) until we give back the rtnl lock.
> > 
> > This makes the new helpers easier to use, does not require to expose
> > the kernfs node to users, making the code more contained; but the
> > locking order is not as perfect.
> > 
> > We would go from (version 1),
> > 
> > 1. unlocking sysfs
> > 2. locking rtnl
> > 3. unlocking rtnl
> > 4. locking sysfs
> > 
> > to (version 2),
> > 
> > 1. unlocking sysfs
> > 2. locking rtnl
> > 3. locking sysfs
> > 4. unlocking rtnl
> > 
> > This is actually fine because the "sysfs lock" isn't a lock but a
> > refcnt, with the only deadlock situation being when draining it.
> > 
> > Version 1: https://github.com/atenart/linux/commit/596c5d9895ccdb75057978abd6be1a42ee4b448e
> > Version 2: https://github.com/atenart/linux/commit/c6659bb26f564f1fd63d1c279616f57141e9f2bf
> > 
> > Thoughts? Apart from that question, either series is ready for
> > submission.  
> 
> Nice, yes, I think that works!

To be clear - by "that" I mean version 2 :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
  2025-01-16 23:43             ` Jakub Kicinski
@ 2025-01-17  8:26               ` Antoine Tenart
  0 siblings, 0 replies; 23+ messages in thread
From: Antoine Tenart @ 2025-01-17  8:26 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, pabeni, edumazet, netdev, gregkh, mhocko, stephen

Quoting Jakub Kicinski (2025-01-17 00:43:19)
> On Thu, 16 Jan 2025 15:42:41 -0800 Jakub Kicinski wrote:
> > On Thu, 16 Jan 2025 14:36:17 +0100 Antoine Tenart wrote:
> > > While refreshing the series, especially after adding the dev_isalive()
> > > check, I found out we actually do not need to drop the sysfs protection
> > > and hold a reference to the net device during the whole rtnl locking
> > > section. This is because after getting the rtnl lock and once we know
> > > the net device dismantle hasn't started yet, we're sure dismantle won't
> > > start (and the device won't be freed) until we give back the rtnl lock.
> > > 
> > > This makes the new helpers easier to use, does not require to expose
> > > the kernfs node to users, making the code more contained; but the
> > > locking order is not as perfect.
> > > 
> > > We would go from (version 1),
> > > 
> > > 1. unlocking sysfs
> > > 2. locking rtnl
> > > 3. unlocking rtnl
> > > 4. locking sysfs
> > > 
> > > to (version 2),
> > > 
> > > 1. unlocking sysfs
> > > 2. locking rtnl
> > > 3. locking sysfs
> > > 4. unlocking rtnl
> > > 
> > > This is actually fine because the "sysfs lock" isn't a lock but a
> > > refcnt, with the only deadlock situation being when draining it.
> > > 
> > > Version 1: https://github.com/atenart/linux/commit/596c5d9895ccdb75057978abd6be1a42ee4b448e
> > > Version 2: https://github.com/atenart/linux/commit/c6659bb26f564f1fd63d1c279616f57141e9f2bf
> > > 
> > > Thoughts? Apart from that question, either series is ready for
> > > submission.  
> > 
> > Nice, yes, I think that works!
> 
> To be clear - by "that" I mean version 2 :)

Thanks! Will send version 2 then.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-01-17  8:26 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-18 15:47 [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Antoine Tenart
2023-10-18 15:47 ` [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes Antoine Tenart
2023-10-18 16:49   ` Greg KH
2023-10-19  8:13     ` Antoine Tenart
2023-10-19 15:37       ` Greg KH
2023-10-18 18:13   ` Stephen Hemminger
2023-10-19  7:48     ` Antoine Tenart
2025-01-02 22:36   ` Jakub Kicinski
2025-01-03  8:41     ` Eric Dumazet
2025-01-07 16:30     ` Antoine Tenart
2025-01-07 17:06       ` Jakub Kicinski
2025-01-16 13:36         ` Antoine Tenart
2025-01-16 23:42           ` Jakub Kicinski
2025-01-16 23:43             ` Jakub Kicinski
2025-01-17  8:26               ` Antoine Tenart
2023-10-18 15:47 ` [RFC PATCH net-next 2/4] net-sysfs: move queue attribute groups outside the default groups Antoine Tenart
2023-10-18 15:47 ` [RFC PATCH net-next 3/4] net-sysfs: prevent uncleared queues from being re-added Antoine Tenart
2023-10-18 15:47 ` [RFC PATCH net-next 4/4] net-sysfs: remove rtnl_trylock from queue attributes Antoine Tenart
2023-10-18 15:57 ` [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Stephen Hemminger
2023-10-18 16:34   ` Antoine Tenart
2023-10-18 18:15 ` Stephen Hemminger
2023-10-19  7:47   ` Antoine Tenart
2023-10-19 14:54     ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).