[PATCH] net: limit a number of namespaces which can be cleaned up concurrently

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] net: limit a number of namespaces which can be cleaned up concurrently
@ 2016-10-12 17:32 Andrei Vagin
       [not found] ` <1476293579-28582-1-git-send-email-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Andrei Vagin @ 2016-10-12 17:32 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Andrey Vagin, David S. Miller

From: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

The operation of destroying netns is heavy and it is executed under
net_mutex. If many namespaces are destroyed concurrently, net_mutex can
be locked for a long time. It is impossible to create a new netns during
this period of time.

In our days when userns allows to create network namespaces to
unprivilaged users, it may be a real problem.

On my laptop (fedora 24, i5-5200U, 12GB) 1000 namespaces requires about
300MB of RAM and are being destroyed for 8 seconds.

In this patch, a number of namespaces which can be cleaned up
concurrently is limited by 32. net_mutex is released after handling each
portion of net namespaces and then it is locked again to handle the next
one. It allows other users to lock it without waiting for a long
time.

I am not sure whether we need to add a sysctl to costomize this limit.
Let me know if you think it's required.

Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Andrei Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 net/core/net_namespace.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 989434f..33dd3b7 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -406,10 +406,20 @@ static void cleanup_net(struct work_struct *work)
 	struct net *net, *tmp;
 	struct list_head net_kill_list;
 	LIST_HEAD(net_exit_list);
+	int i = 0;

 	/* Atomically snapshot the list of namespaces to cleanup */
 	spin_lock_irq(&cleanup_list_lock);
-	list_replace_init(&cleanup_list, &net_kill_list);
+	list_for_each_entry_safe(net, tmp, &cleanup_list, cleanup_list)
+		if (++i == 32)
+			break;
+	if (i == 32) {
+		list_cut_position(&net_kill_list,
+				  &cleanup_list, &net->cleanup_list);
+		queue_work(netns_wq, work);
+	} else {
+		list_replace_init(&cleanup_list, &net_kill_list);
+	}
 	spin_unlock_irq(&cleanup_list_lock);

 	mutex_lock(&net_mutex);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: limit a number of namespaces which can be cleaned up concurrently
       [not found] ` <1476293579-28582-1-git-send-email-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2016-10-13 15:49   ` Eric W. Biederman
       [not found]     ` <871szk9rl9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Eric W. Biederman @ 2016-10-13 15:49 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	David S. Miller

Andrei Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> writes:

> From: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
>
> The operation of destroying netns is heavy and it is executed under
> net_mutex. If many namespaces are destroyed concurrently, net_mutex can
> be locked for a long time. It is impossible to create a new netns during
> this period of time.

This may be the right approach or at least the right approach to bound
net_mutex hold times but I have to take exception to calling network
namespace cleanup heavy.

The only particularly time consuming operation I have ever found are calls to
synchronize_rcu/sycrhonize_sched/synchronize_net.

Ideally we can search those out calls in the network namespace cleanup
operations and figuroue out how to eliminate those operations or how to
stack them.

> In our days when userns allows to create network namespaces to
> unprivilaged users, it may be a real problem.

Sorting out syncrhonize_rcu calls will be a much larger
and much more effective improvement than your patch here.

> On my laptop (fedora 24, i5-5200U, 12GB) 1000 namespaces requires about
> 300MB of RAM and are being destroyed for 8 seconds.
>
> In this patch, a number of namespaces which can be cleaned up
> concurrently is limited by 32. net_mutex is released after handling each
> portion of net namespaces and then it is locked again to handle the next
> one. It allows other users to lock it without waiting for a long
> time.
>
> I am not sure whether we need to add a sysctl to costomize this limit.
> Let me know if you think it's required.

We definitely don't need an extra sysctl.

Eric


> Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> Cc: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Andrei Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> ---
>  net/core/net_namespace.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 989434f..33dd3b7 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -406,10 +406,20 @@ static void cleanup_net(struct work_struct *work)
>  	struct net *net, *tmp;
>  	struct list_head net_kill_list;
>  	LIST_HEAD(net_exit_list);
> +	int i = 0;
>  
>  	/* Atomically snapshot the list of namespaces to cleanup */
>  	spin_lock_irq(&cleanup_list_lock);
> -	list_replace_init(&cleanup_list, &net_kill_list);
> +	list_for_each_entry_safe(net, tmp, &cleanup_list, cleanup_list)
> +		if (++i == 32)
> +			break;
> +	if (i == 32) {
> +		list_cut_position(&net_kill_list,
> +				  &cleanup_list, &net->cleanup_list);
> +		queue_work(netns_wq, work);
> +	} else {
> +		list_replace_init(&cleanup_list, &net_kill_list);
> +	}
>  	spin_unlock_irq(&cleanup_list_lock);
>  
>  	mutex_lock(&net_mutex);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: limit a number of namespaces which can be cleaned up concurrently
       [not found]     ` <871szk9rl9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-10-13 20:44       ` Andrei Vagin
  2016-10-14  3:06         ` Eric W. Biederman
  0 siblings, 1 reply; 8+ messages in thread
From: Andrei Vagin @ 2016-10-13 20:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andrei Vagin, David S. Miller

On Thu, Oct 13, 2016 at 10:49:38AM -0500, Eric W. Biederman wrote:
> Andrei Vagin <avagin@openvz.org> writes:
> 
> > From: Andrey Vagin <avagin@openvz.org>
> >
> > The operation of destroying netns is heavy and it is executed under
> > net_mutex. If many namespaces are destroyed concurrently, net_mutex can
> > be locked for a long time. It is impossible to create a new netns during
> > this period of time.
> 
> This may be the right approach or at least the right approach to bound
> net_mutex hold times but I have to take exception to calling network
> namespace cleanup heavy.
> 
> The only particularly time consuming operation I have ever found are calls to
> synchronize_rcu/sycrhonize_sched/synchronize_net.

I booted the kernel with maxcpus=1, in this case these functions work
very fast and the problem is there any way.

Accoding to perf, we spend a lot of time in kobject_uevent:

-   99.96%     0.00%  kworker/u4:1     [kernel.kallsyms]  [k] unregister_netdevice_many                                                                      ▒
   - unregister_netdevice_many                                                                                                                               ◆
      - 99.95% rollback_registered_many                                                                                                                      ▒
         - 99.64% netdev_unregister_kobject                                                                                                                  ▒
            - 33.43% netdev_queue_update_kobjects                                                                                                            ▒
               - 33.40% kobject_put                                                                                                                          ▒
                  - kobject_release                                                                                                                          ▒
                     + 33.37% kobject_uevent                                                                                                                 ▒
                     + 0.03% kobject_del                                                                                                                     ▒
               + 0.03% sysfs_remove_group                                                                                                                    ▒
            - 33.13% net_rx_queue_update_kobjects                                                                                                            ▒
               - kobject_put                                                                                                                                 ▒
               - kobject_release                                                                                                                             ▒
                  + 33.11% kobject_uevent                                                                                                                    ▒
                  + 0.01% kobject_del                                                                                                                        ▒
                    0.00% rx_queue_release                                                                                                                   ▒
            - 33.08% device_del                                                                                                                              ▒
               + 32.75% kobject_uevent                                                                                                                       ▒
               + 0.17% device_remove_attrs                                                                                                                   ▒
               + 0.07% dpm_sysfs_remove                                                                                                                      ▒
               + 0.04% device_remove_class_symlinks                                                                                                          ▒
               + 0.01% kobject_del                                                                                                                           ▒
               + 0.01% device_pm_remove                                                                                                                      ▒
               + 0.01% sysfs_remove_file_ns                                                                                                                  ▒
               + 0.00% klist_del                                                                                                                             ▒
               + 0.00% driver_deferred_probe_del                                                                                                             ▒
                 0.00% cleanup_glue_dir.isra.14.part.15                                                                                                      ▒
                 0.00% to_acpi_device_node                                                                                                                   ▒
                 0.00% sysfs_remove_group                                                                                                                    ▒
              0.00% klist_del                                                                                                                                ▒
              0.00% device_remove_attrs                                                                                                                      ▒
         + 0.26% call_netdevice_notifiers_info                                                                                                               ▒
         + 0.04% rtmsg_ifinfo_build_skb                                                                                                                      ▒
         + 0.01% rtmsg_ifinfo_send                                                                                                                           ▒
        0.00% dev_uc_flush                                                                                                                                   ▒
        0.00% netif_reset_xps_queues_gt

Someone can listen these uevents, so we can't stop sending them without
breaking backward compatibility. We can try to optimize kobject_uevent...

> 
> Ideally we can search those out calls in the network namespace cleanup
> operations and figuroue out how to eliminate those operations or how to
> stack them.
> 
> > In our days when userns allows to create network namespaces to
> > unprivilaged users, it may be a real problem.
> 
> Sorting out syncrhonize_rcu calls will be a much larger
> and much more effective improvement than your patch here.
> 
> > On my laptop (fedora 24, i5-5200U, 12GB) 1000 namespaces requires about
> > 300MB of RAM and are being destroyed for 8 seconds.
> >
> > In this patch, a number of namespaces which can be cleaned up
> > concurrently is limited by 32. net_mutex is released after handling each
> > portion of net namespaces and then it is locked again to handle the next
> > one. It allows other users to lock it without waiting for a long
> > time.
> >
> > I am not sure whether we need to add a sysctl to costomize this limit.
> > Let me know if you think it's required.
> 
> We definitely don't need an extra sysctl.

Thanks,
Andrei

> 
> Eric
> 
> 
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Signed-off-by: Andrei Vagin <avagin@openvz.org>
> > ---
> >  net/core/net_namespace.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> > index 989434f..33dd3b7 100644
> > --- a/net/core/net_namespace.c
> > +++ b/net/core/net_namespace.c
> > @@ -406,10 +406,20 @@ static void cleanup_net(struct work_struct *work)
> >  	struct net *net, *tmp;
> >  	struct list_head net_kill_list;
> >  	LIST_HEAD(net_exit_list);
> > +	int i = 0;
> >  
> >  	/* Atomically snapshot the list of namespaces to cleanup */
> >  	spin_lock_irq(&cleanup_list_lock);
> > -	list_replace_init(&cleanup_list, &net_kill_list);
> > +	list_for_each_entry_safe(net, tmp, &cleanup_list, cleanup_list)
> > +		if (++i == 32)
> > +			break;
> > +	if (i == 32) {
> > +		list_cut_position(&net_kill_list,
> > +				  &cleanup_list, &net->cleanup_list);
> > +		queue_work(netns_wq, work);
> > +	} else {
> > +		list_replace_init(&cleanup_list, &net_kill_list);
> > +	}
> >  	spin_unlock_irq(&cleanup_list_lock);
> >  
> >  	mutex_lock(&net_mutex);
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: limit a number of namespaces which can be cleaned up concurrently
  2016-10-13 20:44       ` Andrei Vagin
@ 2016-10-14  3:06         ` Eric W. Biederman
       [not found]           ` <87k2db39zf.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Eric W. Biederman @ 2016-10-14  3:06 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: Andrei Vagin, netdev, containers, David S. Miller

Andrei Vagin <avagin@virtuozzo.com> writes:

> On Thu, Oct 13, 2016 at 10:49:38AM -0500, Eric W. Biederman wrote:
>> Andrei Vagin <avagin@openvz.org> writes:
>> 
>> > From: Andrey Vagin <avagin@openvz.org>
>> >
>> > The operation of destroying netns is heavy and it is executed under
>> > net_mutex. If many namespaces are destroyed concurrently, net_mutex can
>> > be locked for a long time. It is impossible to create a new netns during
>> > this period of time.
>> 
>> This may be the right approach or at least the right approach to bound
>> net_mutex hold times but I have to take exception to calling network
>> namespace cleanup heavy.
>> 
>> The only particularly time consuming operation I have ever found are calls to
>> synchronize_rcu/sycrhonize_sched/synchronize_net.
>
> I booted the kernel with maxcpus=1, in this case these functions work
> very fast and the problem is there any way.
>
> Accoding to perf, we spend a lot of time in kobject_uevent:
>
> -   99.96%     0.00%  kworker/u4:1     [kernel.kallsyms]  [k] unregister_netdevice_many                                                                      ▒
>    - unregister_netdevice_many                                                                                                                               ◆
>       - 99.95% rollback_registered_many                                                                                                                      ▒
>          - 99.64% netdev_unregister_kobject                                                                                                                  ▒
>             - 33.43% netdev_queue_update_kobjects                                                                                                            ▒
>                - 33.40% kobject_put                                                                                                                          ▒
>                   - kobject_release                                                                                                                          ▒
>                      + 33.37% kobject_uevent                                                                                                                 ▒
>                      + 0.03% kobject_del                                                                                                                     ▒
>                + 0.03% sysfs_remove_group                                                                                                                    ▒
>             - 33.13% net_rx_queue_update_kobjects                                                                                                            ▒
>                - kobject_put                                                                                                                                 ▒
>                - kobject_release                                                                                                                             ▒
>                   + 33.11% kobject_uevent                                                                                                                    ▒
>                   + 0.01% kobject_del                                                                                                                        ▒
>                     0.00% rx_queue_release                                                                                                                   ▒
>             - 33.08% device_del                                                                                                                              ▒
>                + 32.75% kobject_uevent                                                                                                                       ▒
>                + 0.17% device_remove_attrs                                                                                                                   ▒
>                + 0.07% dpm_sysfs_remove                                                                                                                      ▒
>                + 0.04% device_remove_class_symlinks                                                                                                          ▒
>                + 0.01% kobject_del                                                                                                                           ▒
>                + 0.01% device_pm_remove                                                                                                                      ▒
>                + 0.01% sysfs_remove_file_ns                                                                                                                  ▒
>                + 0.00% klist_del                                                                                                                             ▒
>                + 0.00% driver_deferred_probe_del                                                                                                             ▒
>                  0.00% cleanup_glue_dir.isra.14.part.15                                                                                                      ▒
>                  0.00% to_acpi_device_node                                                                                                                   ▒
>                  0.00% sysfs_remove_group                                                                                                                    ▒
>               0.00% klist_del                                                                                                                                ▒
>               0.00% device_remove_attrs                                                                                                                      ▒
>          + 0.26% call_netdevice_notifiers_info                                                                                                               ▒
>          + 0.04% rtmsg_ifinfo_build_skb                                                                                                                      ▒
>          + 0.01% rtmsg_ifinfo_send                                                                                                                           ▒
>         0.00% dev_uc_flush                                                                                                                                   ▒
>         0.00% netif_reset_xps_queues_gt
>
> Someone can listen these uevents, so we can't stop sending them without
> breaking backward compatibility. We can try to optimize
> kobject_uevent...

Oh that is a surprise.  We can definitely skip genenerating uevents for
network namespaces that are exiting because by definition no one can see
those network namespaces.  If a socket existed that could see those
uevents it would hold a reference to the network namespace and as such
the network namespace could not exit.

That sounds like it is worth investigating a little more deeply.

I am surprised that allocation and freeing is so heavy we are spending
lots of time doing that.  On the other hand kobj_bcast_filter is very
dumb and very late so I expect something can be moved earlier and make
that code cheaper with the tiniest bit of work.

Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: limit a number of namespaces which can be cleaned up concurrently
       [not found]           ` <87k2db39zf.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-10-14 14:09             ` David Miller
  2016-10-14 21:26             ` Andrei Vagin
  1 sibling, 0 replies; 8+ messages in thread
From: David Miller @ 2016-10-14 14:09 UTC (permalink / raw)
  To: ebiederm-aS9lmoZGLiVWk0Htik3J/w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	avagin-5HdwGun5lf+gSpxsJD1C4w, avagin-GEFAQzZX7r8dnm+yROfE0A

From: ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman)
Date: Thu, 13 Oct 2016 22:06:28 -0500

> Oh that is a surprise.  We can definitely skip genenerating uevents for
> network namespaces that are exiting because by definition no one can see
> those network namespaces.  If a socket existed that could see those
> uevents it would hold a reference to the network namespace and as such
> the network namespace could not exit.
> 
> That sounds like it is worth investigating a little more deeply.
> 
> I am surprised that allocation and freeing is so heavy we are spending
> lots of time doing that.  On the other hand kobj_bcast_filter is very
> dumb and very late so I expect something can be moved earlier and make
> that code cheaper with the tiniest bit of work.

I definitely would rather see the uevents removed to kill ~%99 of the
namespace removal overhead rather than limiting.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: limit a number of namespaces which can be cleaned up concurrently
       [not found]           ` <87k2db39zf.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2016-10-14 14:09             ` David Miller
@ 2016-10-14 21:26             ` Andrei Vagin
  2016-10-15 16:36               ` Eric W. Biederman
  1 sibling, 1 reply; 8+ messages in thread
From: Andrei Vagin @ 2016-10-14 21:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andrei Vagin, David S. Miller

On Thu, Oct 13, 2016 at 10:06:28PM -0500, Eric W. Biederman wrote:
> Andrei Vagin <avagin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> writes:
> 
> > On Thu, Oct 13, 2016 at 10:49:38AM -0500, Eric W. Biederman wrote:
> >> Andrei Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> writes:
> >> 
> >> > From: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> >> >
> >> > The operation of destroying netns is heavy and it is executed under
> >> > net_mutex. If many namespaces are destroyed concurrently, net_mutex can
> >> > be locked for a long time. It is impossible to create a new netns during
> >> > this period of time.
> >> 
> >> This may be the right approach or at least the right approach to bound
> >> net_mutex hold times but I have to take exception to calling network
> >> namespace cleanup heavy.
> >> 
> >> The only particularly time consuming operation I have ever found are calls to
> >> synchronize_rcu/sycrhonize_sched/synchronize_net.
> >
> > I booted the kernel with maxcpus=1, in this case these functions work
> > very fast and the problem is there any way.
> >
> > Accoding to perf, we spend a lot of time in kobject_uevent:
> >
> > -   99.96%     0.00%  kworker/u4:1     [kernel.kallsyms]  [k] unregister_netdevice_many
> >    - unregister_netdevice_many
> >       - 99.95% rollback_registered_many
> >          - 99.64% netdev_unregister_kobject
> >             - 33.43% netdev_queue_update_kobjects
> >                - 33.40% kobject_put
> >                   - kobject_release
> >                      + 33.37% kobject_uevent
> >                      + 0.03% kobject_del
> >                + 0.03% sysfs_remove_group
> >             - 33.13% net_rx_queue_update_kobjects
> >                - kobject_put
> >                - kobject_release
> >                   + 33.11% kobject_uevent
> >                   + 0.01% kobject_del
> >                     0.00% rx_queue_release
> >             - 33.08% device_del
> >                + 32.75% kobject_uevent
> >                + 0.17% device_remove_attrs
> >                + 0.07% dpm_sysfs_remove
> >                + 0.04% device_remove_class_symlinks
> >                + 0.01% kobject_del
> >                + 0.01% device_pm_remove
> >                + 0.01% sysfs_remove_file_ns
> >                + 0.00% klist_del
> >                + 0.00% driver_deferred_probe_del
> >                  0.00% cleanup_glue_dir.isra.14.part.15
> >                  0.00% to_acpi_device_node
> >                  0.00% sysfs_remove_group
> >               0.00% klist_del
> >               0.00% device_remove_attrs
> >          + 0.26% call_netdevice_notifiers_info
> >          + 0.04% rtmsg_ifinfo_build_skb
> >          + 0.01% rtmsg_ifinfo_send
> >         0.00% dev_uc_flush
> >         0.00% netif_reset_xps_queues_gt
> >
> > Someone can listen these uevents, so we can't stop sending them without
> > breaking backward compatibility. We can try to optimize
> > kobject_uevent...
> 
> Oh that is a surprise.  We can definitely skip genenerating uevents for
> network namespaces that are exiting because by definition no one can see
> those network namespaces.  If a socket existed that could see those
> uevents it would hold a reference to the network namespace and as such
> the network namespace could not exit.
> 
> That sounds like it is worth investigating a little more deeply.
> 
> I am surprised that allocation and freeing is so heavy we are spending
> lots of time doing that.  On the other hand kobj_bcast_filter is very
> dumb and very late so I expect something can be moved earlier and make
> that code cheaper with the tiniest bit of work.
> 

I'm sorry, I've collected this data for a kernel with debug options
(DEBUG_SPINLOCK, PROVE_LOCKING, DEBUG_LIST, etc). If a kernel is
compiled without debug options, kobject_uevent becomes less expensive,
but still expensive.

-   98.64%     0.00%  kworker/u4:2  [kernel.kallsyms]    [k] cleanup_net
   - cleanup_net
      - 98.54% ops_exit_list.isra.4
         - 60.48% default_device_exit_batch
            - 60.40% unregister_netdevice_many
               - rollback_registered_many
                  - 59.82% netdev_unregister_kobject
                     - 20.10% device_del
                        + 19.44% kobject_uevent
                        + 0.40% device_remove_attrs
                        + 0.17% dpm_sysfs_remove
                        + 0.04% device_remove_class_symlinks
                        + 0.04% kobject_del
                        + 0.01% device_pm_remove
                        + 0.01% sysfs_remove_file_ns
                     - 19.89% netdev_queue_update_kobjects
                        + 19.81% kobject_put
                        + 0.07% sysfs_remove_group
                     - 19.79% net_rx_queue_update_kobjects
                          kobject_put
                        - kobject_release
                           + 19.77% kobject_uevent
                           + 0.02% kobject_del
                             0.01% rx_queue_release
                     + 0.02% kset_unregister
                       0.01% pm_runtime_set_memalloc_noio
                       0.01% bus_remove_device
                  + 0.45% call_netdevice_notifiers_info
                  + 0.07% rtmsg_ifinfo_build_skb
                  + 0.04% rtmsg_ifinfo_send
                    0.01% kset_unregister
            + 0.07% rtnl_unlock
         + 19.27% rpcsec_gss_exit_net
         + 5.45% tcp_net_metrics_exit
         + 5.31% sunrpc_exit_net
         + 3.18% ip6addrlbl_net_exit 


So after removing kobject_uevent, cleanup_net becomes more than two times faster:

1000 namespaces are cleaned up for 2.8 seconds with uevents, and 1.2 senconds
without uevents. I do this experiments with max_cpus=1 to exclude synchronize_rcu.

As a summary we can skip generating uevents, but it doesn't solve the original
problem. If we want to avoid the limit introduced in this patch, we have
to reduce the time for destroing net namespace in dozen times, don't we?

Here is a perf report after skipping generating uevents:
-   93.27%     0.00%  kworker/u4:1  [kernel.kallsyms]   [k] cleanup_net
   - cleanup_net
      - 92.97% ops_exit_list.isra.4
         - 35.14% rpcsec_gss_exit_net
            - gss_svc_shutdown_net
               - 17.40% rsc_cache_destroy_net
                  + 8.64% cache_unregister_net
                  + 8.52% cache_purge
                  + 0.22% cache_destroy_net
               + 9.00% cache_unregister_net
               + 8.49% cache_purge
               + 0.15% destroy_use_gss_proxy_proc_entry
               + 0.10% cache_destroy_net
         - 14.35% tcp_net_metrics_exit
            - 7.32% tcp_metrics_flush_all
               + 4.86% _raw_spin_unlock_bh
                 0.59% __local_bh_enable_ip
              6.12% _raw_spin_lock_bh
              0.90% _raw_spin_unlock_bh
         - 13.08% sunrpc_exit_net
            - 6.91% ip_map_cache_destroy
               + 3.90% cache_unregister_net
               + 2.86% cache_purge
               + 0.15% cache_destroy_net
            + 5.95% unix_gid_cache_destroy
            + 0.12% rpc_pipefs_exit_net
            + 0.10% rpc_proc_exit
         - 7.35% ip6addrlbl_net_exit
            + call_rcu_sched
         + 3.34% xfrm_net_exit
         + 1.22% ipv6_frags_exit_net
         + 1.17% ipv4_frags_exit_net
         + 0.78% fib_net_exit
         + 0.76% inet6_net_exit
         + 0.76% devinet_exit_net
         + 0.68% addrconf_exit_net
         + 0.63% igmp6_net_exit
         + 0.59% ipv4_mib_exit_net
         + 0.59% uevent_net_exit  

> Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: limit a number of namespaces which can be cleaned up concurrently
  2016-10-14 21:26             ` Andrei Vagin
@ 2016-10-15 16:36               ` Eric W. Biederman
       [not found]                 ` <87eg3hy3fm.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Eric W. Biederman @ 2016-10-15 16:36 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: Andrei Vagin, netdev, containers, David S. Miller

Andrei Vagin <avagin@virtuozzo.com> writes:

> On Thu, Oct 13, 2016 at 10:06:28PM -0500, Eric W. Biederman wrote:
>> Andrei Vagin <avagin@virtuozzo.com> writes:
>> 
>> > On Thu, Oct 13, 2016 at 10:49:38AM -0500, Eric W. Biederman wrote:
>> >> Andrei Vagin <avagin@openvz.org> writes:
>> >> 
>> >> > From: Andrey Vagin <avagin@openvz.org>
>> >> >
>> >> > The operation of destroying netns is heavy and it is executed under
>> >> > net_mutex. If many namespaces are destroyed concurrently, net_mutex can
>> >> > be locked for a long time. It is impossible to create a new netns during
>> >> > this period of time.
>> >> 
>> >> This may be the right approach or at least the right approach to bound
>> >> net_mutex hold times but I have to take exception to calling network
>> >> namespace cleanup heavy.
>> >> 
>> >> The only particularly time consuming operation I have ever found are calls to
>> >> synchronize_rcu/sycrhonize_sched/synchronize_net.
>> >
>> > I booted the kernel with maxcpus=1, in this case these functions work
>> > very fast and the problem is there any way.
>> >
>> > Accoding to perf, we spend a lot of time in kobject_uevent:
>> >
>> > -   99.96%     0.00%  kworker/u4:1     [kernel.kallsyms]  [k] unregister_netdevice_many
>> >    - unregister_netdevice_many
>> >       - 99.95% rollback_registered_many
>> >          - 99.64% netdev_unregister_kobject
>> >             - 33.43% netdev_queue_update_kobjects
>> >                - 33.40% kobject_put
>> >                   - kobject_release
>> >                      + 33.37% kobject_uevent
>> >                      + 0.03% kobject_del
>> >                + 0.03% sysfs_remove_group
>> >             - 33.13% net_rx_queue_update_kobjects
>> >                - kobject_put
>> >                - kobject_release
>> >                   + 33.11% kobject_uevent
>> >                   + 0.01% kobject_del
>> >                     0.00% rx_queue_release
>> >             - 33.08% device_del
>> >                + 32.75% kobject_uevent
>> >                + 0.17% device_remove_attrs
>> >                + 0.07% dpm_sysfs_remove
>> >                + 0.04% device_remove_class_symlinks
>> >                + 0.01% kobject_del
>> >                + 0.01% device_pm_remove
>> >                + 0.01% sysfs_remove_file_ns
>> >                + 0.00% klist_del
>> >                + 0.00% driver_deferred_probe_del
>> >                  0.00% cleanup_glue_dir.isra.14.part.15
>> >                  0.00% to_acpi_device_node
>> >                  0.00% sysfs_remove_group
>> >               0.00% klist_del
>> >               0.00% device_remove_attrs
>> >          + 0.26% call_netdevice_notifiers_info
>> >          + 0.04% rtmsg_ifinfo_build_skb
>> >          + 0.01% rtmsg_ifinfo_send
>> >         0.00% dev_uc_flush
>> >         0.00% netif_reset_xps_queues_gt
>> >
>> > Someone can listen these uevents, so we can't stop sending them without
>> > breaking backward compatibility. We can try to optimize
>> > kobject_uevent...
>> 
>> Oh that is a surprise.  We can definitely skip genenerating uevents for
>> network namespaces that are exiting because by definition no one can see
>> those network namespaces.  If a socket existed that could see those
>> uevents it would hold a reference to the network namespace and as such
>> the network namespace could not exit.
>> 
>> That sounds like it is worth investigating a little more deeply.
>> 
>> I am surprised that allocation and freeing is so heavy we are spending
>> lots of time doing that.  On the other hand kobj_bcast_filter is very
>> dumb and very late so I expect something can be moved earlier and make
>> that code cheaper with the tiniest bit of work.
>> 
>
> I'm sorry, I've collected this data for a kernel with debug options
> (DEBUG_SPINLOCK, PROVE_LOCKING, DEBUG_LIST, etc). If a kernel is
> compiled without debug options, kobject_uevent becomes less expensive,
> but still expensive.
>
> -   98.64%     0.00%  kworker/u4:2  [kernel.kallsyms]    [k] cleanup_net
>    - cleanup_net
>       - 98.54% ops_exit_list.isra.4
>          - 60.48% default_device_exit_batch
>             - 60.40% unregister_netdevice_many
>                - rollback_registered_many
>                   - 59.82% netdev_unregister_kobject
>                      - 20.10% device_del
>                         + 19.44% kobject_uevent
>                         + 0.40% device_remove_attrs
>                         + 0.17% dpm_sysfs_remove
>                         + 0.04% device_remove_class_symlinks
>                         + 0.04% kobject_del
>                         + 0.01% device_pm_remove
>                         + 0.01% sysfs_remove_file_ns
>                      - 19.89% netdev_queue_update_kobjects
>                         + 19.81% kobject_put
>                         + 0.07% sysfs_remove_group
>                      - 19.79% net_rx_queue_update_kobjects
>                           kobject_put
>                         - kobject_release
>                            + 19.77% kobject_uevent
>                            + 0.02% kobject_del
>                              0.01% rx_queue_release
>                      + 0.02% kset_unregister
>                        0.01% pm_runtime_set_memalloc_noio
>                        0.01% bus_remove_device
>                   + 0.45% call_netdevice_notifiers_info
>                   + 0.07% rtmsg_ifinfo_build_skb
>                   + 0.04% rtmsg_ifinfo_send
>                     0.01% kset_unregister
>             + 0.07% rtnl_unlock
>          + 19.27% rpcsec_gss_exit_net
>          + 5.45% tcp_net_metrics_exit
>          + 5.31% sunrpc_exit_net
>          + 3.18% ip6addrlbl_net_exit 
>
>
> So after removing kobject_uevent, cleanup_net becomes more than two times faster:
>
> 1000 namespaces are cleaned up for 2.8 seconds with uevents, and 1.2 senconds
> without uevents. I do this experiments with max_cpus=1 to exclude synchronize_rcu.
>
> As a summary we can skip generating uevents, but it doesn't solve the original
> problem. If we want to avoid the limit introduced in this patch, we have
> to reduce the time for destroing net namespace in dozen times, don't
> we?

It definitely looks like optimizing kobject_uevent for this case is
worth while.

I would not mind getting the raw cost of network namespace cleanups
below 2.8ms or with uevent cleanups 1.2ms.  There is just a lot going on
for a lot of good reasons in the networking stack so that can be tricky.

The larger issue is that there is a trade off between latency and
throughput in network namespace destruction.  Consider the case of
vsftpd.  Which creates a new network namespace for every connection.
Something like that can wind up with a huge backlog of network
namespaces to clean up while continually creating more.  The system will
go OOM if we don't stop and cleanup what we have.

And the batching is very very important for throughput.  So the smallest
batch size we could really accept is a batch size that does not hurt
throughput when destroying network namespaces.  Otherwise we will have a
growing backlog of network namespaces to cleanup and a system that
eventuallys stops being usable at all.  In that context I think a long
hold time on net_mutex is preferable to a system that does not work at
all.

Now I would love to make both the throughput and the latency better I
would be all in favor of that, but that requires some deep changes to
the network namespace initialization and cleanup.  Unfortunately I
haven't stared at the problem enough to know what those changes would
need to be.  But something where we would not need to serialize network
namespace cleanup between different network namespaces.  And ideally
something we could implement incrementally as there is so much
networking code I don't expect we could verify and change verything
overnight.

That plus in practice the bottleneck has always been the synchronize_rcu
calls which tend to take at least a millisecond a piece.  Being able
overlap those synchronize_rcu calls in the common case has reduced
the time to run the network stack cleanup code by very dramatic amounts.

Right now I am very happy that the network namespace cleanup code is
working properly.  When I started the network stack cleanup code to
cleanup network namespaces I found actual functional bugs.  I will be
even happier if we can figure out how to make it all run fast.

But ultimately we have the net_mutex and the rtnl_lock that serialize
things on the setup and cleanup paths and to allow creation to proceed
while cleanup is ongoing we need to find a way to avoid serialization by
either of those, and I have honestly drawn a blank.

So right now my best suggestion for making things better is to find and
fix each little piece we can fix.  Until the things are working as best
we can make them work.  It is not sexy or glamorous or fast but it makes
things better and is the best that I can see to do.

Eric


> Here is a perf report after skipping generating uevents:
> -   93.27%     0.00%  kworker/u4:1  [kernel.kallsyms]   [k] cleanup_net
>    - cleanup_net
>       - 92.97% ops_exit_list.isra.4
>          - 35.14% rpcsec_gss_exit_net
>             - gss_svc_shutdown_net
>                - 17.40% rsc_cache_destroy_net
>                   + 8.64% cache_unregister_net
>                   + 8.52% cache_purge
>                   + 0.22% cache_destroy_net
>                + 9.00% cache_unregister_net
>                + 8.49% cache_purge
>                + 0.15% destroy_use_gss_proxy_proc_entry
>                + 0.10% cache_destroy_net
>          - 14.35% tcp_net_metrics_exit
>             - 7.32% tcp_metrics_flush_all
>                + 4.86% _raw_spin_unlock_bh
>                  0.59% __local_bh_enable_ip
>               6.12% _raw_spin_lock_bh
>               0.90% _raw_spin_unlock_bh
>          - 13.08% sunrpc_exit_net
>             - 6.91% ip_map_cache_destroy
>                + 3.90% cache_unregister_net
>                + 2.86% cache_purge
>                + 0.15% cache_destroy_net
>             + 5.95% unix_gid_cache_destroy
>             + 0.12% rpc_pipefs_exit_net
>             + 0.10% rpc_proc_exit
>          - 7.35% ip6addrlbl_net_exit
>             + call_rcu_sched
>          + 3.34% xfrm_net_exit
>          + 1.22% ipv6_frags_exit_net
>          + 1.17% ipv4_frags_exit_net
>          + 0.78% fib_net_exit
>          + 0.76% inet6_net_exit
>          + 0.76% devinet_exit_net
>          + 0.68% addrconf_exit_net
>          + 0.63% igmp6_net_exit
>          + 0.59% ipv4_mib_exit_net
>          + 0.59% uevent_net_exit  
>
>> Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: limit a number of namespaces which can be cleaned up concurrently
       [not found]                 ` <87eg3hy3fm.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-10-19 18:46                   ` Andrey Vagin
  0 siblings, 0 replies; 8+ messages in thread
From: Andrey Vagin @ 2016-10-19 18:46 UTC (permalink / raw)
  To: Eric W. Biederman, David S. Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers, Andrei Vagin

[-- Attachment #1: Type: text/plain, Size: 12106 bytes --]

On Sat, Oct 15, 2016 at 9:36 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andrei Vagin <avagin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> writes:
>
>> On Thu, Oct 13, 2016 at 10:06:28PM -0500, Eric W. Biederman wrote:
>>> Andrei Vagin <avagin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> writes:
>>>
>>> > On Thu, Oct 13, 2016 at 10:49:38AM -0500, Eric W. Biederman wrote:
>>> >> Andrei Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> writes:
>>> >>
>>> >> > From: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
>>> >> >
>>> >> > The operation of destroying netns is heavy and it is executed under
>>> >> > net_mutex. If many namespaces are destroyed concurrently, net_mutex can
>>> >> > be locked for a long time. It is impossible to create a new netns during
>>> >> > this period of time.
>>> >>
>>> >> This may be the right approach or at least the right approach to bound
>>> >> net_mutex hold times but I have to take exception to calling network
>>> >> namespace cleanup heavy.
>>> >>
>>> >> The only particularly time consuming operation I have ever found are calls to
>>> >> synchronize_rcu/sycrhonize_sched/synchronize_net.
>>> >
>>> > I booted the kernel with maxcpus=1, in this case these functions work
>>> > very fast and the problem is there any way.
>>> >
>>> > Accoding to perf, we spend a lot of time in kobject_uevent:
>>> >
>>> > -   99.96%     0.00%  kworker/u4:1     [kernel.kallsyms]  [k] unregister_netdevice_many
>>> >    - unregister_netdevice_many
>>> >       - 99.95% rollback_registered_many
>>> >          - 99.64% netdev_unregister_kobject
>>> >             - 33.43% netdev_queue_update_kobjects
>>> >                - 33.40% kobject_put
>>> >                   - kobject_release
>>> >                      + 33.37% kobject_uevent
>>> >                      + 0.03% kobject_del
>>> >                + 0.03% sysfs_remove_group
>>> >             - 33.13% net_rx_queue_update_kobjects
>>> >                - kobject_put
>>> >                - kobject_release
>>> >                   + 33.11% kobject_uevent
>>> >                   + 0.01% kobject_del
>>> >                     0.00% rx_queue_release
>>> >             - 33.08% device_del
>>> >                + 32.75% kobject_uevent
>>> >                + 0.17% device_remove_attrs
>>> >                + 0.07% dpm_sysfs_remove
>>> >                + 0.04% device_remove_class_symlinks
>>> >                + 0.01% kobject_del
>>> >                + 0.01% device_pm_remove
>>> >                + 0.01% sysfs_remove_file_ns
>>> >                + 0.00% klist_del
>>> >                + 0.00% driver_deferred_probe_del
>>> >                  0.00% cleanup_glue_dir.isra.14.part.15
>>> >                  0.00% to_acpi_device_node
>>> >                  0.00% sysfs_remove_group
>>> >               0.00% klist_del
>>> >               0.00% device_remove_attrs
>>> >          + 0.26% call_netdevice_notifiers_info
>>> >          + 0.04% rtmsg_ifinfo_build_skb
>>> >          + 0.01% rtmsg_ifinfo_send
>>> >         0.00% dev_uc_flush
>>> >         0.00% netif_reset_xps_queues_gt
>>> >
>>> > Someone can listen these uevents, so we can't stop sending them without
>>> > breaking backward compatibility. We can try to optimize
>>> > kobject_uevent...
>>>
>>> Oh that is a surprise.  We can definitely skip genenerating uevents for
>>> network namespaces that are exiting because by definition no one can see
>>> those network namespaces.  If a socket existed that could see those
>>> uevents it would hold a reference to the network namespace and as such
>>> the network namespace could not exit.
>>>
>>> That sounds like it is worth investigating a little more deeply.
>>>
>>> I am surprised that allocation and freeing is so heavy we are spending
>>> lots of time doing that.  On the other hand kobj_bcast_filter is very
>>> dumb and very late so I expect something can be moved earlier and make
>>> that code cheaper with the tiniest bit of work.
>>>
>>
>> I'm sorry, I've collected this data for a kernel with debug options
>> (DEBUG_SPINLOCK, PROVE_LOCKING, DEBUG_LIST, etc). If a kernel is
>> compiled without debug options, kobject_uevent becomes less expensive,
>> but still expensive.
>>
>> -   98.64%     0.00%  kworker/u4:2  [kernel.kallsyms]    [k] cleanup_net
>>    - cleanup_net
>>       - 98.54% ops_exit_list.isra.4
>>          - 60.48% default_device_exit_batch
>>             - 60.40% unregister_netdevice_many
>>                - rollback_registered_many
>>                   - 59.82% netdev_unregister_kobject
>>                      - 20.10% device_del
>>                         + 19.44% kobject_uevent
>>                         + 0.40% device_remove_attrs
>>                         + 0.17% dpm_sysfs_remove
>>                         + 0.04% device_remove_class_symlinks
>>                         + 0.04% kobject_del
>>                         + 0.01% device_pm_remove
>>                         + 0.01% sysfs_remove_file_ns
>>                      - 19.89% netdev_queue_update_kobjects
>>                         + 19.81% kobject_put
>>                         + 0.07% sysfs_remove_group
>>                      - 19.79% net_rx_queue_update_kobjects
>>                           kobject_put
>>                         - kobject_release
>>                            + 19.77% kobject_uevent
>>                            + 0.02% kobject_del
>>                              0.01% rx_queue_release
>>                      + 0.02% kset_unregister
>>                        0.01% pm_runtime_set_memalloc_noio
>>                        0.01% bus_remove_device
>>                   + 0.45% call_netdevice_notifiers_info
>>                   + 0.07% rtmsg_ifinfo_build_skb
>>                   + 0.04% rtmsg_ifinfo_send
>>                     0.01% kset_unregister
>>             + 0.07% rtnl_unlock
>>          + 19.27% rpcsec_gss_exit_net
>>          + 5.45% tcp_net_metrics_exit
>>          + 5.31% sunrpc_exit_net
>>          + 3.18% ip6addrlbl_net_exit
>>
>>
>> So after removing kobject_uevent, cleanup_net becomes more than two times faster:
>>
>> 1000 namespaces are cleaned up for 2.8 seconds with uevents, and 1.2 senconds
>> without uevents. I do this experiments with max_cpus=1 to exclude synchronize_rcu.
>>
>> As a summary we can skip generating uevents, but it doesn't solve the original
>> problem. If we want to avoid the limit introduced in this patch, we have
>> to reduce the time for destroing net namespace in dozen times, don't
>> we?
>
> It definitely looks like optimizing kobject_uevent for this case is
> worth while.
>
> I would not mind getting the raw cost of network namespace cleanups
> below 2.8ms or with uevent cleanups 1.2ms.  There is just a lot going on
> for a lot of good reasons in the networking stack so that can be tricky.

>
> The larger issue is that there is a trade off between latency and
> throughput in network namespace destruction.  Consider the case of
> vsftpd.  Which creates a new network namespace for every connection.
> Something like that can wind up with a huge backlog of network
> namespaces to clean up while continually creating more.  The system will
> go OOM if we don't stop and cleanup what we have.
>
> And the batching is very very important for throughput.  So the smallest
> batch size we could really accept is a batch size that does not hurt
> throughput when destroying network namespaces.  Otherwise we will have a
> growing backlog of network namespaces to cleanup and a system that
> eventuallys stops being usable at all.  In that context I think a long
> hold time on net_mutex is preferable to a system that does not work at
> all.
>
> Now I would love to make both the throughput and the latency better I
> would be all in favor of that, but that requires some deep changes to
> the network namespace initialization and cleanup.  Unfortunately I
> haven't stared at the problem enough to know what those changes would
> need to be.  But something where we would not need to serialize network
> namespace cleanup between different network namespaces.  And ideally
> something we could implement incrementally as there is so much
> networking code I don't expect we could verify and change verything
> overnight.

Eric, I get your point. All these sounds reasonable.

And here is another idea about net_mutex.

The longer I look at net_mutex the more it looks like that it can be replaced
on a read-write lock. It protects per-namespace lists of operations, which are
modified only when modules are loaded or unloaded. And the kernel reads
these lists to create or destroy a new network namespace.

Eric and David, what do you think about this idea? Do you have any ideas
why it will not work.

I don't know this code so well to not skip something obvious. The attached
patch shows how it looks like.

If it will works we will be able to create and destroy net namespaces
concurrently.
And even call cleanup_net() from a few threads if we have a big
backlog. It is only
one of steps which may be useful to fix this problem.


>
> That plus in practice the bottleneck has always been the synchronize_rcu
> calls which tend to take at least a millisecond a piece.  Being able
> overlap those synchronize_rcu calls in the common case has reduced
> the time to run the network stack cleanup code by very dramatic amounts.
>
> Right now I am very happy that the network namespace cleanup code is
> working properly.  When I started the network stack cleanup code to
> cleanup network namespaces I found actual functional bugs.  I will be
> even happier if we can figure out how to make it all run fast.
>
> But ultimately we have the net_mutex and the rtnl_lock that serialize
> things on the setup and cleanup paths and to allow creation to proceed
> while cleanup is ongoing we need to find a way to avoid serialization by
> either of those, and I have honestly drawn a blank.
>
> So right now my best suggestion for making things better is to find and
> fix each little piece we can fix.  Until the things are working as best
> we can make them work.  It is not sexy or glamorous or fast but it makes
> things better and is the best that I can see to do.
>
> Eric
>
>
>> Here is a perf report after skipping generating uevents:
>> -   93.27%     0.00%  kworker/u4:1  [kernel.kallsyms]   [k] cleanup_net
>>    - cleanup_net
>>       - 92.97% ops_exit_list.isra.4
>>          - 35.14% rpcsec_gss_exit_net
>>             - gss_svc_shutdown_net
>>                - 17.40% rsc_cache_destroy_net
>>                   + 8.64% cache_unregister_net
>>                   + 8.52% cache_purge
>>                   + 0.22% cache_destroy_net
>>                + 9.00% cache_unregister_net
>>                + 8.49% cache_purge
>>                + 0.15% destroy_use_gss_proxy_proc_entry
>>                + 0.10% cache_destroy_net
>>          - 14.35% tcp_net_metrics_exit
>>             - 7.32% tcp_metrics_flush_all
>>                + 4.86% _raw_spin_unlock_bh
>>                  0.59% __local_bh_enable_ip
>>               6.12% _raw_spin_lock_bh
>>               0.90% _raw_spin_unlock_bh
>>          - 13.08% sunrpc_exit_net
>>             - 6.91% ip_map_cache_destroy
>>                + 3.90% cache_unregister_net
>>                + 2.86% cache_purge
>>                + 0.15% cache_destroy_net
>>             + 5.95% unix_gid_cache_destroy
>>             + 0.12% rpc_pipefs_exit_net
>>             + 0.10% rpc_proc_exit
>>          - 7.35% ip6addrlbl_net_exit
>>             + call_rcu_sched
>>          + 3.34% xfrm_net_exit
>>          + 1.22% ipv6_frags_exit_net
>>          + 1.17% ipv4_frags_exit_net
>>          + 0.78% fib_net_exit
>>          + 0.76% inet6_net_exit
>>          + 0.76% devinet_exit_net
>>          + 0.68% addrconf_exit_net
>>          + 0.63% igmp6_net_exit
>>          + 0.59% ipv4_mib_exit_net
>>          + 0.59% uevent_net_exit
>>
>>> Eric
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

[-- Attachment #2: 0001-RFC-net-convert-net_mutex-into-a-read-write-lock.patch --]
[-- Type: text/x-patch, Size: 5405 bytes --]

From 6fb255bed1c82a24931595502a8e393200c7dd52 Mon Sep 17 00:00:00 2001
From: Andrei Vagin <avagin@openvz.org>
Date: Sun, 16 Oct 2016 17:18:43 -0700
Subject: [PATCH] [RFC] net: convert net_mutex into a read-write lock

It protects per-namespace lists of operations, which are modified only
when modules are loaded or unloaded.
And the kernel reads these lists to create or destroy a new network
namespace.

Signed-off-by: Andrei Vagin <avagin@openvz.org>
---
 include/linux/rtnetlink.h |  2 +-
 net/core/net_namespace.c  | 32 ++++++++++++++++----------------
 net/core/rtnetlink.c      |  4 ++--
 3 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 57e5484..1a780a3 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -30,7 +30,7 @@ extern int rtnl_trylock(void);
 extern int rtnl_is_locked(void);
 
 extern wait_queue_head_t netdev_unregistering_wq;
-extern struct mutex net_mutex;
+extern struct rw_semaphore net_mutex;
 
 #ifdef CONFIG_PROVE_LOCKING
 extern bool lockdep_rtnl_is_held(void);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 989434f..81dafce 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -27,7 +27,7 @@
 
 static LIST_HEAD(pernet_list);
 static struct list_head *first_device = &pernet_list;
-DEFINE_MUTEX(net_mutex);
+DECLARE_RWSEM(net_mutex);
 
 LIST_HEAD(net_namespace_list);
 EXPORT_SYMBOL_GPL(net_namespace_list);
@@ -59,7 +59,7 @@ static int net_assign_generic(struct net *net, int id, void *data)
 {
 	struct net_generic *ng, *old_ng;
 
-	BUG_ON(!mutex_is_locked(&net_mutex));
+	BUG_ON(!rwsem_is_locked(&net_mutex));
 	BUG_ON(id == 0);
 
 	old_ng = rcu_dereference_protected(net->gen,
@@ -379,7 +379,7 @@ struct net *copy_net_ns(unsigned long flags,
 
 	get_user_ns(user_ns);
 
-	mutex_lock(&net_mutex);
+	down_read(&net_mutex);
 	net->ucounts = ucounts;
 	rv = setup_net(net, user_ns);
 	if (rv == 0) {
@@ -387,7 +387,7 @@ struct net *copy_net_ns(unsigned long flags,
 		list_add_tail_rcu(&net->list, &net_namespace_list);
 		rtnl_unlock();
 	}
-	mutex_unlock(&net_mutex);
+	up_read(&net_mutex);
 	if (rv < 0) {
 		dec_net_namespaces(ucounts);
 		put_user_ns(user_ns);
@@ -412,7 +412,7 @@ static void cleanup_net(struct work_struct *work)
 	list_replace_init(&cleanup_list, &net_kill_list);
 	spin_unlock_irq(&cleanup_list_lock);
 
-	mutex_lock(&net_mutex);
+	down_read(&net_mutex);
 
 	/* Don't let anyone else find us. */
 	rtnl_lock();
@@ -452,7 +452,7 @@ static void cleanup_net(struct work_struct *work)
 	list_for_each_entry_reverse(ops, &pernet_list, list)
 		ops_free_list(ops, &net_exit_list);
 
-	mutex_unlock(&net_mutex);
+	up_read(&net_mutex);
 
 	/* Ensure there are no outstanding rcu callbacks using this
 	 * network namespace.
@@ -763,7 +763,7 @@ static int __init net_ns_init(void)
 
 	rcu_assign_pointer(init_net.gen, ng);
 
-	mutex_lock(&net_mutex);
+	down_read(&net_mutex);
 	if (setup_net(&init_net, &init_user_ns))
 		panic("Could not setup the initial network namespace");
 
@@ -773,7 +773,7 @@ static int __init net_ns_init(void)
 	list_add_tail_rcu(&init_net.list, &net_namespace_list);
 	rtnl_unlock();
 
-	mutex_unlock(&net_mutex);
+	up_read(&net_mutex);
 
 	register_pernet_subsys(&net_ns_ops);
 
@@ -912,9 +912,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
 int register_pernet_subsys(struct pernet_operations *ops)
 {
 	int error;
-	mutex_lock(&net_mutex);
+	down_write(&net_mutex);
 	error =  register_pernet_operations(first_device, ops);
-	mutex_unlock(&net_mutex);
+	up_write(&net_mutex);
 	return error;
 }
 EXPORT_SYMBOL_GPL(register_pernet_subsys);
@@ -930,9 +930,9 @@ EXPORT_SYMBOL_GPL(register_pernet_subsys);
  */
 void unregister_pernet_subsys(struct pernet_operations *ops)
 {
-	mutex_lock(&net_mutex);
+	down_write(&net_mutex);
 	unregister_pernet_operations(ops);
-	mutex_unlock(&net_mutex);
+	up_write(&net_mutex);
 }
 EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
 
@@ -958,11 +958,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
 int register_pernet_device(struct pernet_operations *ops)
 {
 	int error;
-	mutex_lock(&net_mutex);
+	down_write(&net_mutex);
 	error = register_pernet_operations(&pernet_list, ops);
 	if (!error && (first_device == &pernet_list))
 		first_device = &ops->list;
-	mutex_unlock(&net_mutex);
+	up_write(&net_mutex);
 	return error;
 }
 EXPORT_SYMBOL_GPL(register_pernet_device);
@@ -978,11 +978,11 @@ EXPORT_SYMBOL_GPL(register_pernet_device);
  */
 void unregister_pernet_device(struct pernet_operations *ops)
 {
-	mutex_lock(&net_mutex);
+	down_write(&net_mutex);
 	if (&ops->list == first_device)
 		first_device = first_device->next;
 	unregister_pernet_operations(ops);
-	mutex_unlock(&net_mutex);
+	up_write(&net_mutex);
 }
 EXPORT_SYMBOL_GPL(unregister_pernet_device);
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index b06d2f4..7533419 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -418,11 +418,11 @@ static void rtnl_lock_unregistering_all(void)
 void rtnl_link_unregister(struct rtnl_link_ops *ops)
 {
 	/* Close the race with cleanup_net() */
-	mutex_lock(&net_mutex);
+	down_write(&net_mutex);
 	rtnl_lock_unregistering_all();
 	__rtnl_link_unregister(ops);
 	rtnl_unlock();
-	mutex_unlock(&net_mutex);
+	up_write(&net_mutex);
 }
 EXPORT_SYMBOL_GPL(rtnl_link_unregister);
 
-- 
2.7.4


[-- Attachment #3: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-10-19 18:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-12 17:32 [PATCH] net: limit a number of namespaces which can be cleaned up concurrently Andrei Vagin
     [not found] ` <1476293579-28582-1-git-send-email-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2016-10-13 15:49   ` Eric W. Biederman
     [not found]     ` <871szk9rl9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-10-13 20:44       ` Andrei Vagin
2016-10-14  3:06         ` Eric W. Biederman
     [not found]           ` <87k2db39zf.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-10-14 14:09             ` David Miller
2016-10-14 21:26             ` Andrei Vagin
2016-10-15 16:36               ` Eric W. Biederman
     [not found]                 ` <87eg3hy3fm.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-10-19 18:46                   ` Andrey Vagin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).