netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice()
@ 2025-01-14 20:55 Eric Dumazet
  2025-01-14 20:55 ` [PATCH v3 net-next 1/5] net: expedite synchronize_net() for cleanup_net() Eric Dumazet
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Eric Dumazet @ 2025-01-14 20:55 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet, Eric Dumazet

One major source of RTNL contention resides in unregister_netdevice()

Due to RCU protection of various network structures, and
unregister_netdevice() being a synchronous function,
it is calling potentially slow functions while holding RTNL.

I think we can release RTNL in two points, so that three
slow functions are called while RTNL can be used
by other threads.

v3: Deal with CONFIG_NET_NS=n
    Use net_todo_list_for_cleanup_net in third patch.
    Split the last patch in two parts for future bisections.

v2: Only temporarily release RTNL from cleanup_net()

v1: https://lore.kernel.org/netdev/20250107130906.098fc8d6@kernel.org/T/#m398c95f5778e1ff70938e079d3c4c43c050ad2a6



Eric Dumazet (5):
  net: expedite synchronize_net() for cleanup_net()
  net: no longer assume RTNL is held in flush_all_backlogs()
  net: no longer hold RTNL while calling flush_all_backlogs()
  net: reduce RTNL hold duration in unregister_netdevice_many_notify()
    (part 1)
  net: reduce RTNL hold duration in unregister_netdevice_many_notify()
    (part 2)

 include/net/net_namespace.h |  2 +
 net/core/dev.c              | 97 ++++++++++++++++++++++++++++---------
 net/core/net_namespace.c    |  5 ++
 3 files changed, 82 insertions(+), 22 deletions(-)

-- 
2.48.0.rc2.279.g1de40edade-goog


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 net-next 1/5] net: expedite synchronize_net() for cleanup_net()
  2025-01-14 20:55 [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice() Eric Dumazet
@ 2025-01-14 20:55 ` Eric Dumazet
  2025-01-14 22:57   ` Jesse Brandeburg
  2025-01-14 20:55 ` [PATCH v3 net-next 2/5] net: no longer assume RTNL is held in flush_all_backlogs() Eric Dumazet
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2025-01-14 20:55 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet, Eric Dumazet

cleanup_net() is the single thread responsible
for netns dismantles, and a serious bottleneck.

Before we can get per-netns RTNL, make sure
all synchronize_net() called from this thread
are using rcu_synchronize_expedited().

v3: deal with CONFIG_NET_NS=n

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/net_namespace.h |  2 ++
 net/core/dev.c              | 11 ++++++++++-
 net/core/net_namespace.c    |  5 +++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 5a2a0df8ad91b677b515b392869c6c755be5c868..0f5eb9db0c6264efc1ac83ab577511fd6823f4fe 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -210,6 +210,8 @@ void net_ns_barrier(void);
 
 struct ns_common *get_net_ns(struct ns_common *ns);
 struct net *get_net_ns_by_fd(int fd);
+extern struct task_struct *cleanup_net_task;
+
 #else /* CONFIG_NET_NS */
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
diff --git a/net/core/dev.c b/net/core/dev.c
index fda4e1039bf01d46cfaa5f134d20e1d2bcdcfdfc..0542346a403c2602f94d8bc61f7be0ea0c64c33a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10072,6 +10072,15 @@ static void dev_index_release(struct net *net, int ifindex)
 	WARN_ON(xa_erase(&net->dev_by_index, ifindex));
 }
 
+static bool from_cleanup_net(void)
+{
+#ifdef CONFIG_NET_NS
+	return current == cleanup_net_task;
+#else
+	return false;
+#endif
+}
+
 /* Delayed registration/unregisteration */
 LIST_HEAD(net_todo_list);
 DECLARE_WAIT_QUEUE_HEAD(netdev_unregistering_wq);
@@ -11447,7 +11456,7 @@ EXPORT_SYMBOL_GPL(alloc_netdev_dummy);
 void synchronize_net(void)
 {
 	might_sleep();
-	if (rtnl_is_locked())
+	if (from_cleanup_net() || rtnl_is_locked())
 		synchronize_rcu_expedited();
 	else
 		synchronize_rcu();
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index b5cd3ae4f04cf28d43f8401a3dafebac4a297123..cb39a12b2f8295c605f08b5589932932150a1644 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -588,6 +588,8 @@ static void unhash_nsid(struct net *net, struct net *last)
 
 static LLIST_HEAD(cleanup_list);
 
+struct task_struct *cleanup_net_task;
+
 static void cleanup_net(struct work_struct *work)
 {
 	const struct pernet_operations *ops;
@@ -596,6 +598,8 @@ static void cleanup_net(struct work_struct *work)
 	LIST_HEAD(net_exit_list);
 	LIST_HEAD(dev_kill_list);
 
+	cleanup_net_task = current;
+
 	/* Atomically snapshot the list of namespaces to cleanup */
 	net_kill_list = llist_del_all(&cleanup_list);
 
@@ -670,6 +674,7 @@ static void cleanup_net(struct work_struct *work)
 		put_user_ns(net->user_ns);
 		net_free(net);
 	}
+	cleanup_net_task = NULL;
 }
 
 /**
-- 
2.48.0.rc2.279.g1de40edade-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 net-next 2/5] net: no longer assume RTNL is held in flush_all_backlogs()
  2025-01-14 20:55 [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice() Eric Dumazet
  2025-01-14 20:55 ` [PATCH v3 net-next 1/5] net: expedite synchronize_net() for cleanup_net() Eric Dumazet
@ 2025-01-14 20:55 ` Eric Dumazet
  2025-01-14 23:01   ` Jesse Brandeburg
  2025-01-14 20:55 ` [PATCH v3 net-next 3/5] net: no longer hold RTNL while calling flush_all_backlogs() Eric Dumazet
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2025-01-14 20:55 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet, Eric Dumazet

flush_all_backlogs() uses per-cpu and static data to hold its
temporary data, on the assumption it is called under RTNL
protection.

Following patch in the series will break this assumption.

Use instead a dynamically allocated piece of memory.

In the unlikely case the allocation fails,
use a boot-time allocated memory.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/dev.c | 53 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 35 insertions(+), 18 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 0542346a403c2602f94d8bc61f7be0ea0c64c33a..b0e05e44d771bee2721d054ddbd03166cc676680 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6012,8 +6012,6 @@ void netif_receive_skb_list(struct list_head *head)
 }
 EXPORT_SYMBOL(netif_receive_skb_list);
 
-static DEFINE_PER_CPU(struct work_struct, flush_works);
-
 /* Network device is going away, flush any packets still pending */
 static void flush_backlog(struct work_struct *work)
 {
@@ -6070,36 +6068,54 @@ static bool flush_required(int cpu)
 	return true;
 }
 
+struct flush_backlogs {
+	cpumask_t		flush_cpus;
+	struct work_struct	w[];
+};
+
+static struct flush_backlogs *flush_backlogs_alloc(void)
+{
+	return kmalloc(struct_size_t(struct flush_backlogs, w, nr_cpu_ids),
+		       GFP_KERNEL);
+}
+
+static struct flush_backlogs *flush_backlogs_fallback;
+static DEFINE_MUTEX(flush_backlogs_mutex);
+
 static void flush_all_backlogs(void)
 {
-	static cpumask_t flush_cpus;
+	struct flush_backlogs *ptr = flush_backlogs_alloc();
 	unsigned int cpu;
 
-	/* since we are under rtnl lock protection we can use static data
-	 * for the cpumask and avoid allocating on stack the possibly
-	 * large mask
-	 */
-	ASSERT_RTNL();
+	if (!ptr) {
+		mutex_lock(&flush_backlogs_mutex);
+		ptr = flush_backlogs_fallback;
+	}
+	cpumask_clear(&ptr->flush_cpus);
 
 	cpus_read_lock();
 
-	cpumask_clear(&flush_cpus);
 	for_each_online_cpu(cpu) {
 		if (flush_required(cpu)) {
-			queue_work_on(cpu, system_highpri_wq,
-				      per_cpu_ptr(&flush_works, cpu));
-			cpumask_set_cpu(cpu, &flush_cpus);
+			INIT_WORK(&ptr->w[cpu], flush_backlog);
+			queue_work_on(cpu, system_highpri_wq, &ptr->w[cpu]);
+			__cpumask_set_cpu(cpu, &ptr->flush_cpus);
 		}
 	}
 
 	/* we can have in flight packet[s] on the cpus we are not flushing,
 	 * synchronize_net() in unregister_netdevice_many() will take care of
-	 * them
+	 * them.
 	 */
-	for_each_cpu(cpu, &flush_cpus)
-		flush_work(per_cpu_ptr(&flush_works, cpu));
+	for_each_cpu(cpu, &ptr->flush_cpus)
+		flush_work(&ptr->w[cpu]);
 
 	cpus_read_unlock();
+
+	if (ptr != flush_backlogs_fallback)
+		kfree(ptr);
+	else
+		mutex_unlock(&flush_backlogs_mutex);
 }
 
 static void net_rps_send_ipi(struct softnet_data *remsd)
@@ -12286,12 +12302,13 @@ static int __init net_dev_init(void)
 	 *	Initialise the packet receive queues.
 	 */
 
+	flush_backlogs_fallback = flush_backlogs_alloc();
+	if (!flush_backlogs_fallback)
+		goto out;
+
 	for_each_possible_cpu(i) {
-		struct work_struct *flush = per_cpu_ptr(&flush_works, i);
 		struct softnet_data *sd = &per_cpu(softnet_data, i);
 
-		INIT_WORK(flush, flush_backlog);
-
 		skb_queue_head_init(&sd->input_pkt_queue);
 		skb_queue_head_init(&sd->process_queue);
 #ifdef CONFIG_XFRM_OFFLOAD
-- 
2.48.0.rc2.279.g1de40edade-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 net-next 3/5] net: no longer hold RTNL while calling flush_all_backlogs()
  2025-01-14 20:55 [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice() Eric Dumazet
  2025-01-14 20:55 ` [PATCH v3 net-next 1/5] net: expedite synchronize_net() for cleanup_net() Eric Dumazet
  2025-01-14 20:55 ` [PATCH v3 net-next 2/5] net: no longer assume RTNL is held in flush_all_backlogs() Eric Dumazet
@ 2025-01-14 20:55 ` Eric Dumazet
  2025-01-14 23:04   ` Jesse Brandeburg
  2025-02-10  4:13   ` Kuniyuki Iwashima
  2025-01-14 20:55 ` [PATCH v3 net-next 4/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1) Eric Dumazet
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Eric Dumazet @ 2025-01-14 20:55 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet, Eric Dumazet

flush_all_backlogs() is called from unregister_netdevice_many_notify()
as part of netdevice dismantles.

This is currently called under RTNL, and can last up to 50 ms
on busy hosts.

There is no reason to hold RTNL at this stage, if our caller
is cleanup_net() : netns are no more visible, devices
are in NETREG_UNREGISTERING state and no other thread
could mess our state while RTNL is temporarily released.

In order to provide isolation, this patch provides a separate
'net_todo_list' for cleanup_net().

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/dev.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index b0e05e44d771bee2721d054ddbd03166cc676680..f4dd92bed2223269053b6576e4954fcce218a2e5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10097,14 +10097,37 @@ static bool from_cleanup_net(void)
 #endif
 }
 
+static void rtnl_drop_if_cleanup_net(void)
+{
+	if (from_cleanup_net())
+		__rtnl_unlock();
+}
+
+static void rtnl_acquire_if_cleanup_net(void)
+{
+	if (from_cleanup_net())
+		rtnl_lock();
+}
+
 /* Delayed registration/unregisteration */
 LIST_HEAD(net_todo_list);
+static LIST_HEAD(net_todo_list_for_cleanup_net);
+
+/* TODO: net_todo_list/net_todo_list_for_cleanup_net should probably
+ * be provided by callers, instead of being static, rtnl protected.
+ */
+static struct list_head *todo_list(void)
+{
+	return from_cleanup_net() ? &net_todo_list_for_cleanup_net :
+				    &net_todo_list;
+}
+
 DECLARE_WAIT_QUEUE_HEAD(netdev_unregistering_wq);
 atomic_t dev_unreg_count = ATOMIC_INIT(0);
 
 static void net_set_todo(struct net_device *dev)
 {
-	list_add_tail(&dev->todo_list, &net_todo_list);
+	list_add_tail(&dev->todo_list, todo_list());
 }
 
 static netdev_features_t netdev_sync_upper_features(struct net_device *lower,
@@ -10952,7 +10975,7 @@ void netdev_run_todo(void)
 #endif
 
 	/* Snapshot list, allow later requests */
-	list_replace_init(&net_todo_list, &list);
+	list_replace_init(todo_list(), &list);
 
 	__rtnl_unlock();
 
@@ -11575,8 +11598,10 @@ void unregister_netdevice_many_notify(struct list_head *head,
 		unlist_netdevice(dev);
 		WRITE_ONCE(dev->reg_state, NETREG_UNREGISTERING);
 	}
-	flush_all_backlogs();
 
+	rtnl_drop_if_cleanup_net();
+	flush_all_backlogs();
+	rtnl_acquire_if_cleanup_net();
 	synchronize_net();
 
 	list_for_each_entry(dev, head, unreg_list) {
-- 
2.48.0.rc2.279.g1de40edade-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 net-next 4/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)
  2025-01-14 20:55 [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice() Eric Dumazet
                   ` (2 preceding siblings ...)
  2025-01-14 20:55 ` [PATCH v3 net-next 3/5] net: no longer hold RTNL while calling flush_all_backlogs() Eric Dumazet
@ 2025-01-14 20:55 ` Eric Dumazet
  2025-01-14 23:05   ` Jesse Brandeburg
  2025-01-14 20:55 ` [PATCH v3 net-next 5/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2) Eric Dumazet
  2025-01-16  3:30 ` [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice() patchwork-bot+netdevbpf
  5 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2025-01-14 20:55 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet, Eric Dumazet

Two synchronize_net() calls are currently done while holding RTNL.

This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.

For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().

This should be safe, because devices are no longer visible
to other threads at this point.

In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index f4dd92bed2223269053b6576e4954fcce218a2e5..574bd40f3a2bfcc6e43300fad669b1579d48039a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -11602,6 +11602,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
 	rtnl_drop_if_cleanup_net();
 	flush_all_backlogs();
 	rtnl_acquire_if_cleanup_net();
+	/* TODO: move this before the prior rtnl_acquire_if_cleanup_net() */
 	synchronize_net();
 
 	list_for_each_entry(dev, head, unreg_list) {
@@ -11662,7 +11663,9 @@ void unregister_netdevice_many_notify(struct list_head *head,
 #endif
 	}
 
+	rtnl_drop_if_cleanup_net();
 	synchronize_net();
+	rtnl_acquire_if_cleanup_net();
 
 	list_for_each_entry(dev, head, unreg_list) {
 		netdev_put(dev, &dev->dev_registered_tracker);
-- 
2.48.0.rc2.279.g1de40edade-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 net-next 5/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)
  2025-01-14 20:55 [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice() Eric Dumazet
                   ` (3 preceding siblings ...)
  2025-01-14 20:55 ` [PATCH v3 net-next 4/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1) Eric Dumazet
@ 2025-01-14 20:55 ` Eric Dumazet
  2025-01-14 23:09   ` Jesse Brandeburg
  2025-01-16  3:30 ` [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice() patchwork-bot+netdevbpf
  5 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2025-01-14 20:55 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet, Eric Dumazet

One synchronize_net() call is currently done while holding RTNL.

This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.

For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().

This should be safe, because devices are no longer visible
to other threads after unlist_netdevice() call
and setting dev->reg_state to NETREG_UNREGISTERING.

In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/dev.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 574bd40f3a2bfcc6e43300fad669b1579d48039a..0bb97bb392dcdfa82292fe1ae2ee8290b036df60 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -11601,9 +11601,8 @@ void unregister_netdevice_many_notify(struct list_head *head,
 
 	rtnl_drop_if_cleanup_net();
 	flush_all_backlogs();
-	rtnl_acquire_if_cleanup_net();
-	/* TODO: move this before the prior rtnl_acquire_if_cleanup_net() */
 	synchronize_net();
+	rtnl_acquire_if_cleanup_net();
 
 	list_for_each_entry(dev, head, unreg_list) {
 		struct sk_buff *skb = NULL;
-- 
2.48.0.rc2.279.g1de40edade-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 1/5] net: expedite synchronize_net() for cleanup_net()
  2025-01-14 20:55 ` [PATCH v3 net-next 1/5] net: expedite synchronize_net() for cleanup_net() Eric Dumazet
@ 2025-01-14 22:57   ` Jesse Brandeburg
  0 siblings, 0 replies; 13+ messages in thread
From: Jesse Brandeburg @ 2025-01-14 22:57 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet, jbrandeb@kernel.org

On 1/14/25 12:55 PM, Eric Dumazet wrote:
> cleanup_net() is the single thread responsible
> for netns dismantles, and a serious bottleneck.
> 
> Before we can get per-netns RTNL, make sure
> all synchronize_net() called from this thread
> are using rcu_synchronize_expedited().
> 
> v3: deal with CONFIG_NET_NS=n
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Seems like a good cleanup!

Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 2/5] net: no longer assume RTNL is held in flush_all_backlogs()
  2025-01-14 20:55 ` [PATCH v3 net-next 2/5] net: no longer assume RTNL is held in flush_all_backlogs() Eric Dumazet
@ 2025-01-14 23:01   ` Jesse Brandeburg
  0 siblings, 0 replies; 13+ messages in thread
From: Jesse Brandeburg @ 2025-01-14 23:01 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet

On 1/14/25 12:55 PM, Eric Dumazet wrote:
> flush_all_backlogs() uses per-cpu and static data to hold its
> temporary data, on the assumption it is called under RTNL
> protection.
> 
> Following patch in the series will break this assumption.
> 
> Use instead a dynamically allocated piece of memory.
> 
> In the unlikely case the allocation fails,
> use a boot-time allocated memory.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 3/5] net: no longer hold RTNL while calling flush_all_backlogs()
  2025-01-14 20:55 ` [PATCH v3 net-next 3/5] net: no longer hold RTNL while calling flush_all_backlogs() Eric Dumazet
@ 2025-01-14 23:04   ` Jesse Brandeburg
  2025-02-10  4:13   ` Kuniyuki Iwashima
  1 sibling, 0 replies; 13+ messages in thread
From: Jesse Brandeburg @ 2025-01-14 23:04 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet

On 1/14/25 12:55 PM, Eric Dumazet wrote:
> flush_all_backlogs() is called from unregister_netdevice_many_notify()
> as part of netdevice dismantles.
> 
> This is currently called under RTNL, and can last up to 50 ms
> on busy hosts.
> 
> There is no reason to hold RTNL at this stage, if our caller
> is cleanup_net() : netns are no more visible, devices
> are in NETREG_UNREGISTERING state and no other thread
> could mess our state while RTNL is temporarily released.
> 
> In order to provide isolation, this patch provides a separate
> 'net_todo_list' for cleanup_net().
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 4/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)
  2025-01-14 20:55 ` [PATCH v3 net-next 4/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1) Eric Dumazet
@ 2025-01-14 23:05   ` Jesse Brandeburg
  0 siblings, 0 replies; 13+ messages in thread
From: Jesse Brandeburg @ 2025-01-14 23:05 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet

On 1/14/25 12:55 PM, Eric Dumazet wrote:
> Two synchronize_net() calls are currently done while holding RTNL.
> 
> This is source of RTNL contention in workloads adding and deleting
> many network namespaces per second, because synchronize_rcu()
> and synchronize_rcu_expedited() can use 60+ ms in some cases.
> 
> For cleanup_net() use, temporarily release RTNL
> while calling the last synchronize_net().
> 
> This should be safe, because devices are no longer visible
> to other threads at this point.
> 
> In any case, the new netdev_lock() / netdev_unlock()
> infrastructure that we are adding should allow
> to fix potential issues, with a combination
> of a per-device mutex and dev->reg_state awareness.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com

Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 5/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)
  2025-01-14 20:55 ` [PATCH v3 net-next 5/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2) Eric Dumazet
@ 2025-01-14 23:09   ` Jesse Brandeburg
  0 siblings, 0 replies; 13+ messages in thread
From: Jesse Brandeburg @ 2025-01-14 23:09 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Simon Horman, eric.dumazet

On 1/14/25 12:55 PM, Eric Dumazet wrote:
> One synchronize_net() call is currently done while holding RTNL.
> 
> This is source of RTNL contention in workloads adding and deleting
> many network namespaces per second, because synchronize_rcu()
> and synchronize_rcu_expedited() can use 60+ ms in some cases.
> 
> For cleanup_net() use, temporarily release RTNL
> while calling the last synchronize_net().
> 
> This should be safe, because devices are no longer visible
> to other threads after unlist_netdevice() call
> and setting dev->reg_state to NETREG_UNREGISTERING.
> 
> In any case, the new netdev_lock() / netdev_unlock()
> infrastructure that we are adding should allow
> to fix potential issues, with a combination
> of a per-device mutex and dev->reg_state awareness.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice()
  2025-01-14 20:55 [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice() Eric Dumazet
                   ` (4 preceding siblings ...)
  2025-01-14 20:55 ` [PATCH v3 net-next 5/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2) Eric Dumazet
@ 2025-01-16  3:30 ` patchwork-bot+netdevbpf
  5 siblings, 0 replies; 13+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-01-16  3:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, kuba, pabeni, netdev, horms, eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 14 Jan 2025 20:55:26 +0000 you wrote:
> One major source of RTNL contention resides in unregister_netdevice()
> 
> Due to RCU protection of various network structures, and
> unregister_netdevice() being a synchronous function,
> it is calling potentially slow functions while holding RTNL.
> 
> I think we can release RTNL in two points, so that three
> slow functions are called while RTNL can be used
> by other threads.
> 
> [...]

Here is the summary with links:
  - [v3,net-next,1/5] net: expedite synchronize_net() for cleanup_net()
    https://git.kernel.org/netdev/net-next/c/0734d7c3d93c
  - [v3,net-next,2/5] net: no longer assume RTNL is held in flush_all_backlogs()
    https://git.kernel.org/netdev/net-next/c/8a2b61e9e879
  - [v3,net-next,3/5] net: no longer hold RTNL while calling flush_all_backlogs()
    https://git.kernel.org/netdev/net-next/c/cfa579f66656
  - [v3,net-next,4/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)
    https://git.kernel.org/netdev/net-next/c/ae646f1a0bb9
  - [v3,net-next,5/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)
    https://git.kernel.org/netdev/net-next/c/83419b61d187

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 3/5] net: no longer hold RTNL while calling flush_all_backlogs()
  2025-01-14 20:55 ` [PATCH v3 net-next 3/5] net: no longer hold RTNL while calling flush_all_backlogs() Eric Dumazet
  2025-01-14 23:04   ` Jesse Brandeburg
@ 2025-02-10  4:13   ` Kuniyuki Iwashima
  1 sibling, 0 replies; 13+ messages in thread
From: Kuniyuki Iwashima @ 2025-02-10  4:13 UTC (permalink / raw)
  To: edumazet; +Cc: davem, eric.dumazet, horms, kuba, netdev, pabeni

Hi Eric,

From: Eric Dumazet <edumazet@google.com>
Date: Tue, 14 Jan 2025 20:55:29 +0000
> @@ -11575,8 +11598,10 @@ void unregister_netdevice_many_notify(struct list_head *head,
>  		unlist_netdevice(dev);
>  		WRITE_ONCE(dev->reg_state, NETREG_UNREGISTERING);
>  	}
> -	flush_all_backlogs();
>  
> +	rtnl_drop_if_cleanup_net();
> +	flush_all_backlogs();
> +	rtnl_acquire_if_cleanup_net();
>  	synchronize_net();
>  
>  	list_for_each_entry(dev, head, unreg_list) {

One of my syzkaller setup happend to not have the revert of this series
and this hunk seemed to trigger BUG_ON(dev->reg_state != NETREG_REGISTERED)
for PPP.

ppp_release() assumed that RTNL is not released until ->ndo_uninit() that
clears ppp->owner to NULL, so this change may be needed in the next try,
just fyi.

Thanks!

---8<---
diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index 4583e15ad03a..ccf3b708bbc9 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -406,7 +406,8 @@ static int ppp_release(struct inode *unused, struct file *file)
 		if (pf->kind == INTERFACE) {
 			ppp = PF_TO_PPP(pf);
 			rtnl_lock();
-			if (file == ppp->owner)
+			if (file == ppp->owner &&
+			    ppp->dev->reg_state == NETREG_REGISTERED)
 				unregister_netdevice(ppp->dev);
 			rtnl_unlock();
 		}
---8<---

---8<---
kernel BUG at net/core/dev.c:11773!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 UID: 0 PID: 1681 Comm: syz.2.364 Not tainted 6.13.0-04046-g0ad9617c78ac #25 2a4f595e37b581d176eb9aae48dfe81ca9e88551
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:unregister_netdevice_many_notify+0x1d44/0x1d50 net/core/dev.c:11773
Code: c1 80 61 30 87 80 e1 07 80 c1 03 38 c1 0f 8c c8 e8 ff ff 48 c7 c7 80 61 30 87 e8 17 6b dc fd e9 b7 e8 ff ff e8 3d c6 8b fd 90 <0f> 0b e8 35 c6 8b fd 90 0f 0b 66 90 55 41 57 41 56 41 55 41 54 53
RSP: 0018:ffa000001440f6c0 EFLAGS: 00010293
RAX: ffffffff83cad063 RBX: 1fe22000029bc02f RCX: ff11000103764480
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000001
RBP: ffa000001440f890 R08: ffffffff8709abc7 R09: 1ffffffff0e13578
R10: dffffc0000000000 R11: fffffbfff0e13579 R12: ffa000001440f8e0
R13: ffa000001440f8e0 R14: 0000000000000002 R15: 0000000000000002
FS:  0000000000000000(0000) GS:ff1100011a000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9c1a46e000 CR3: 00000001095ee002 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
PKRU: 80000000
Call Trace:
 <TASK>
 unregister_netdevice_many net/core/dev.c:11875 [inline]
 unregister_netdevice_queue+0x33d/0x380 net/core/dev.c:11741
 unregister_netdevice include/linux/netdevice.h:3329 [inline]
 ppp_release+0xed/0x1f0 drivers/net/ppp/ppp_generic.c:410
 __fput+0x212/0xa60 fs/file_table.c:450
 task_work_run+0x1cb/0x240 kernel/task_work.c:239
 exit_task_work include/linux/task_work.h:43 [inline]
 do_exit+0x87e/0x2470 kernel/exit.c:938
 do_group_exit+0x21c/0x2d0 kernel/exit.c:1087
 get_signal+0x1206/0x12c0 kernel/signal.c:3036
 arch_do_signal_or_restart+0x87/0x7a0 arch/x86/kernel/signal.c:337
 exit_to_user_mode_loop kernel/entry/common.c:111 [inline]
 exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
 __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
 syscall_exit_to_user_mode+0x8b/0x110 kernel/entry/common.c:218
 do_syscall_64+0xf1/0x1c0 arch/x86/entry/common.c:89
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f35b2c05d29
Code: Unable to access opcode bytes at 0x7f35b2c05cff.
RSP: 002b:00007f35b1235038 EFLAGS: 00000246 ORIG_RAX: 000000000000012a
RAX: 0000000000000005 RBX: 00007f35b2df6160 RCX: 00007f35b2c05d29
RDX: fdffffffffffffff RSI: 0000000000000000 RDI: 0000000020000100
RBP: 00007f35b2c81b08 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f35b2df6160 R15: 00007f35b2f1fa28
 </TASK>
Modules linked in:
---8<---

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-02-10  4:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-14 20:55 [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice() Eric Dumazet
2025-01-14 20:55 ` [PATCH v3 net-next 1/5] net: expedite synchronize_net() for cleanup_net() Eric Dumazet
2025-01-14 22:57   ` Jesse Brandeburg
2025-01-14 20:55 ` [PATCH v3 net-next 2/5] net: no longer assume RTNL is held in flush_all_backlogs() Eric Dumazet
2025-01-14 23:01   ` Jesse Brandeburg
2025-01-14 20:55 ` [PATCH v3 net-next 3/5] net: no longer hold RTNL while calling flush_all_backlogs() Eric Dumazet
2025-01-14 23:04   ` Jesse Brandeburg
2025-02-10  4:13   ` Kuniyuki Iwashima
2025-01-14 20:55 ` [PATCH v3 net-next 4/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1) Eric Dumazet
2025-01-14 23:05   ` Jesse Brandeburg
2025-01-14 20:55 ` [PATCH v3 net-next 5/5] net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2) Eric Dumazet
2025-01-14 23:09   ` Jesse Brandeburg
2025-01-16  3:30 ` [PATCH v3 net-next 0/5] net: reduce RTNL pressure in unregister_netdevice() patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).