public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Kirill Tkhai <ktkhai@virtuozzo.com>
To: davem@davemloft.net, vyasevic@redhat.com,
	kstewart@linuxfoundation.org, pombredanne@nexb.com,
	vyasevich@gmail.com, mark.rutland@arm.com,
	gregkh@linuxfoundation.org, adobriyan@gmail.com, fw@strlen.de,
	nicolas.dichtel@6wind.com, xiyou.wangcong@gmail.com,
	roman.kapl@sysgo.com, paul@paul-moore.com, dsahern@gmail.com,
	daniel@iogearbox.net, lucien.xin@gmail.com,
	mschiffer@universe-factory.net, rshearma@brocade.com,
	netdev@vger.kernel.org, ktkhai@virtuozzo.com,
	ebiederm@xmission.com, avagin@virtuozzo.com,
	gorcunov@virtuozzo.com, eric.dumazet@gmail.com,
	stephen@networkplumber.org, ktkhai@virtuozzo.com
Subject: [PATCH net-next v3 03/32] net: Introduce net_sem for protection of pernet_list
Date: Tue, 13 Feb 2018 12:26:23 +0300	[thread overview]
Message-ID: <151851398385.5034.3452700099290073969.stgit@localhost.localdomain> (raw)
In-Reply-To: <151851357738.5034.10272265431844825686.stgit@localhost.localdomain>

Currently, the mutex is mostly used to protect pernet operations
list. It orders setup_net() and cleanup_net() with parallel
{un,}register_pernet_operations() calls, so ->exit{,batch} methods
of the same pernet operations are executed for a dying net, as
were used to call ->init methods, even after the net namespace
is unlinked from net_namespace_list in cleanup_net().

But there are several problems with scalability. The first one
is that more than one net can't be created or destroyed
at the same moment on the node. For big machines with many cpus
running many containers it's very sensitive.

The second one is that it's need to synchronize_rcu() after net
is removed from net_namespace_list():

Destroy net_ns:
cleanup_net()
  mutex_lock(&net_mutex)
  list_del_rcu(&net->list)
  synchronize_rcu()                                  <--- Sleep there for ages
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list(ops, &net_exit_list)
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_free_list(ops, &net_exit_list)
  mutex_unlock(&net_mutex)

This primitive is not fast, especially on the systems with many processors
and/or when preemptible RCU is enabled in config. So, all the time, while
cleanup_net() is waiting for RCU grace period, creation of new net namespaces
is not possible, the tasks, who makes it, are sleeping on the same mutex:

Create net_ns:
copy_net_ns()
  mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages

I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
with preemptible RCU enabled after CRIU tests round is finished.

The solution is to convert net_mutex to the rw_semaphore and add fine grain
locks to really small number of pernet_operations, what really need them.

Then, pernet_operations::init/::exit methods, modifying the net-related data,
will require down_read() locking only, while down_write() will be used
for changing pernet_list (i.e., when modules are being loaded and unloaded).

This gives signify performance increase, after all patch set is applied,
like you may see here:

%for i in {1..10000}; do unshare -n bash -c exit; done

*before*
real 1m40,377s
user 0m9,672s
sys 0m19,928s

*after*
real 0m17,007s
user 0m5,311s
sys 0m11,779

(5.8 times faster)

This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
describes the variables it protects, and makes to use, where appropriate.
net_mutex is still present, and next patches will kick it out step-by-step.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
---
 include/linux/rtnetlink.h |    1 +
 net/core/net_namespace.c  |   39 ++++++++++++++++++++++++++-------------
 net/core/rtnetlink.c      |    4 ++--
 3 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 1fdcde96eb65..e9ee9ad0a681 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -36,6 +36,7 @@ extern int rtnl_is_locked(void);
 
 extern wait_queue_head_t netdev_unregistering_wq;
 extern struct mutex net_mutex;
+extern struct rw_semaphore net_sem;
 
 #ifdef CONFIG_PROVE_LOCKING
 extern bool lockdep_rtnl_is_held(void);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 81384386f91b..e89b2b7abd36 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -41,6 +41,11 @@ struct net init_net = {
 EXPORT_SYMBOL(init_net);
 
 static bool init_net_initialized;
+/*
+ * net_sem: protects: pernet_list, net_generic_ids,
+ * init_net_initialized and first_device pointer.
+ */
+DECLARE_RWSEM(net_sem);
 
 #define MIN_PERNET_OPS_ID	\
 	((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *))
@@ -286,7 +291,7 @@ struct net *get_net_ns_by_id(struct net *net, int id)
  */
 static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 {
-	/* Must be called with net_mutex held */
+	/* Must be called with net_sem held */
 	const struct pernet_operations *ops, *saved_ops;
 	int error = 0;
 	LIST_HEAD(net_exit_list);
@@ -418,12 +423,16 @@ struct net *copy_net_ns(unsigned long flags,
 	net->ucounts = ucounts;
 	get_user_ns(user_ns);
 
-	rv = mutex_lock_killable(&net_mutex);
+	rv = down_read_killable(&net_sem);
 	if (rv < 0)
 		goto put_userns;
-
+	rv = mutex_lock_killable(&net_mutex);
+	if (rv < 0)
+		goto up_read;
 	rv = setup_net(net, user_ns);
 	mutex_unlock(&net_mutex);
+up_read:
+	up_read(&net_sem);
 	if (rv < 0) {
 put_userns:
 		put_user_ns(user_ns);
@@ -477,6 +486,7 @@ static void cleanup_net(struct work_struct *work)
 	list_replace_init(&cleanup_list, &net_kill_list);
 	spin_unlock_irq(&cleanup_list_lock);
 
+	down_read(&net_sem);
 	mutex_lock(&net_mutex);
 
 	/* Don't let anyone else find us. */
@@ -517,6 +527,7 @@ static void cleanup_net(struct work_struct *work)
 		ops_free_list(ops, &net_exit_list);
 
 	mutex_unlock(&net_mutex);
+	up_read(&net_sem);
 
 	/* Ensure there are no outstanding rcu callbacks using this
 	 * network namespace.
@@ -543,8 +554,10 @@ static void cleanup_net(struct work_struct *work)
  */
 void net_ns_barrier(void)
 {
+	down_write(&net_sem);
 	mutex_lock(&net_mutex);
 	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 }
 EXPORT_SYMBOL(net_ns_barrier);
 
@@ -871,12 +884,12 @@ static int __init net_ns_init(void)
 
 	rcu_assign_pointer(init_net.gen, ng);
 
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	if (setup_net(&init_net, &init_user_ns))
 		panic("Could not setup the initial network namespace");
 
 	init_net_initialized = true;
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 
 	register_pernet_subsys(&net_ns_ops);
 
@@ -1016,9 +1029,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
 int register_pernet_subsys(struct pernet_operations *ops)
 {
 	int error;
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	error =  register_pernet_operations(first_device, ops);
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 	return error;
 }
 EXPORT_SYMBOL_GPL(register_pernet_subsys);
@@ -1034,9 +1047,9 @@ EXPORT_SYMBOL_GPL(register_pernet_subsys);
  */
 void unregister_pernet_subsys(struct pernet_operations *ops)
 {
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	unregister_pernet_operations(ops);
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 }
 EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
 
@@ -1062,11 +1075,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
 int register_pernet_device(struct pernet_operations *ops)
 {
 	int error;
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	error = register_pernet_operations(&pernet_list, ops);
 	if (!error && (first_device == &pernet_list))
 		first_device = &ops->list;
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 	return error;
 }
 EXPORT_SYMBOL_GPL(register_pernet_device);
@@ -1082,11 +1095,11 @@ EXPORT_SYMBOL_GPL(register_pernet_device);
  */
 void unregister_pernet_device(struct pernet_operations *ops)
 {
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	if (&ops->list == first_device)
 		first_device = first_device->next;
 	unregister_pernet_operations(ops);
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 }
 EXPORT_SYMBOL_GPL(unregister_pernet_device);
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index bc290413a49d..257e7bbaffba 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -454,11 +454,11 @@ static void rtnl_lock_unregistering_all(void)
 void rtnl_link_unregister(struct rtnl_link_ops *ops)
 {
 	/* Close the race with cleanup_net() */
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	rtnl_lock_unregistering_all();
 	__rtnl_link_unregister(ops);
 	rtnl_unlock();
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 }
 EXPORT_SYMBOL_GPL(rtnl_link_unregister);
 

  parent reply	other threads:[~2018-02-13  9:26 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-13  9:25 [PATCH net-next v3 00/32] Replacing net_mutex with rw_semaphore Kirill Tkhai
2018-02-13  9:26 ` [PATCH net-next v3 01/32] net: Assign net to net_namespace_list in setup_net() Kirill Tkhai
2018-02-13  9:26 ` [PATCH net-next v3 02/32] net: Cleanup in copy_net_ns() Kirill Tkhai
2018-02-13  9:26 ` Kirill Tkhai [this message]
2018-02-26  2:04   ` [lkp-robot] [net] 37b927536f: kernel_BUG_at_net/core/net_namespace.c kernel test robot
2018-02-26  9:40     ` Kirill Tkhai
2018-02-13  9:26 ` [PATCH net-next v3 04/32] net: Move mutex_unlock() in cleanup_net() up Kirill Tkhai
2018-02-13  9:26 ` [PATCH net-next v3 05/32] net: Allow pernet_operations to be executed in parallel Kirill Tkhai
2018-02-13  9:26 ` [PATCH net-next v3 06/32] net: Convert proc_net_ns_ops Kirill Tkhai
2018-02-13  9:27 ` [PATCH net-next v3 07/32] net: Convert net_ns_ops methods Kirill Tkhai
2018-02-13  9:27 ` [PATCH net-next v3 08/32] net: Convert sysctl_pernet_ops Kirill Tkhai
2018-02-13  9:27 ` [PATCH net-next v3 09/32] net: Convert netfilter_net_ops Kirill Tkhai
2018-02-13  9:27 ` [PATCH net-next v3 10/32] net: Convert nf_log_net_ops Kirill Tkhai
2018-02-13  9:27 ` [PATCH net-next v3 11/32] net: Convert net_inuse_ops Kirill Tkhai
2018-02-13  9:27 ` [PATCH net-next v3 12/32] net: Convert net_defaults_ops Kirill Tkhai
2018-02-13  9:28 ` [PATCH net-next v3 13/32] net: Convert netlink_net_ops Kirill Tkhai
2018-02-13  9:28 ` [PATCH net-next v3 14/32] net: Convert rtnetlink_net_ops Kirill Tkhai
2018-02-13  9:28 ` [PATCH net-next v3 15/32] net: Convert audit_net_ops Kirill Tkhai
2018-02-13  9:28 ` [PATCH net-next v3 16/32] net: Convert uevent_net_ops Kirill Tkhai
2018-02-13  9:28 ` [PATCH net-next v3 17/32] net: Convert proto_net_ops Kirill Tkhai
2018-02-13  9:28 ` [PATCH net-next v3 18/32] net: Convert pernet_subsys ops, registered via net_dev_init() Kirill Tkhai
2018-02-13  9:29 ` [PATCH net-next v3 19/32] net: Convert fib_* pernet_operations, registered via subsys_initcall Kirill Tkhai
2018-02-13  9:29 ` [PATCH net-next v3 20/32] net: Convert subsys_initcall() registered pernet_operations from net/sched Kirill Tkhai
2018-02-13  9:29 ` [PATCH net-next v3 21/32] net: Convert genl_pernet_ops Kirill Tkhai
2018-02-13  9:29 ` [PATCH net-next v3 22/32] net: Convert wext_pernet_ops Kirill Tkhai
2018-02-13  9:29 ` [PATCH net-next v3 23/32] net: Convert sysctl_core_ops Kirill Tkhai
2018-02-13  9:29 ` [PATCH net-next v3 24/32] net: Convert pernet_subsys, registered from inet_init() Kirill Tkhai
2018-02-13  9:30 ` [PATCH net-next v3 25/32] net: Convert unix_net_ops Kirill Tkhai
2018-02-13  9:30 ` [PATCH net-next v3 26/32] net: Convert packet_net_ops Kirill Tkhai
2018-02-13  9:30 ` [PATCH net-next v3 27/32] net: Convert ipv4_sysctl_ops Kirill Tkhai
2018-02-13  9:30 ` [PATCH net-next v3 28/32] net: Convert addrconf_ops Kirill Tkhai
2018-02-13  9:30 ` [PATCH net-next v3 29/32] net: Convert loopback_net_ops Kirill Tkhai
2018-02-13  9:30 ` [PATCH net-next v3 30/32] net: Convert default_device_ops Kirill Tkhai
2018-02-13  9:30 ` [PATCH net-next v3 31/32] net: Convert diag_net_ops Kirill Tkhai
2018-02-13  9:31 ` [PATCH net-next v3 32/32] net: Convert netlink_tap_net_ops Kirill Tkhai
2018-02-13 15:54 ` [PATCH net-next v3 00/32] Replacing net_mutex with rw_semaphore David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=151851398385.5034.3452700099290073969.stgit@localhost.localdomain \
    --to=ktkhai@virtuozzo.com \
    --cc=adobriyan@gmail.com \
    --cc=avagin@virtuozzo.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@gmail.com \
    --cc=ebiederm@xmission.com \
    --cc=eric.dumazet@gmail.com \
    --cc=fw@strlen.de \
    --cc=gorcunov@virtuozzo.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=kstewart@linuxfoundation.org \
    --cc=lucien.xin@gmail.com \
    --cc=mark.rutland@arm.com \
    --cc=mschiffer@universe-factory.net \
    --cc=netdev@vger.kernel.org \
    --cc=nicolas.dichtel@6wind.com \
    --cc=paul@paul-moore.com \
    --cc=pombredanne@nexb.com \
    --cc=roman.kapl@sysgo.com \
    --cc=rshearma@brocade.com \
    --cc=stephen@networkplumber.org \
    --cc=vyasevic@redhat.com \
    --cc=vyasevich@gmail.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox