[4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
@ 2015-09-15 12:05 Christian Borntraeger
  2015-09-15 13:05 ` Peter Zijlstra
  2015-09-16 15:55 ` [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking" Tejun Heo
  0 siblings, 2 replies; 31+ messages in thread
From: Christian Borntraeger @ 2015-09-15 12:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list

Tejun,

commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14 (sched, cgroup: replace 
signal_struct->group_rwsem with a global percpu_rwsem) causes some noticably
hickups when starting several kvm guests (which libvirt will move into cgroups
- each vcpu thread and each i/o thread)
When you now start lots of guests in parallel on a bigger system (32CPUs with
2way smt in my case) the system is so busy that systemd runs into several timeouts
like "Did not receive a reply. Possible causes include: the remote application did
not send a reply, the message bus security policy blocked the reply, the reply
timeout expired, or the network connection was broken."

The problem seems to be that the newly used percpu_rwsem does a
rcu_synchronize_sched_expedited for all write downs/ups.

Hacking the percpu_rw_semaphore to always go the slow path and avoid the
synchronize_sched seems to fix the issue. For some (yet unknown to me) reason
the synchronize_sched and the fast path seems to block writers for incredibly
long times.

To trigger the problem, the guest must be CPU bound, iow idle guests seem to
not trigger the pathological case. Any idea how to improve the situation, 
this looks like a real regression for larger kvm installations.

Christian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 12:05 [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm Christian Borntraeger
@ 2015-09-15 13:05 ` Peter Zijlstra
  2015-09-15 13:36   ` Christian Borntraeger
  2015-09-16 15:55 ` [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking" Tejun Heo
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2015-09-15 13:05 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Tejun Heo, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov, Paul McKenney

On Tue, Sep 15, 2015 at 02:05:14PM +0200, Christian Borntraeger wrote:
> Tejun,
> 
> 
> commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14 (sched, cgroup: replace 
> signal_struct->group_rwsem with a global percpu_rwsem) causes some noticably
> hickups when starting several kvm guests (which libvirt will move into cgroups
> - each vcpu thread and each i/o thread)
> When you now start lots of guests in parallel on a bigger system (32CPUs with
> 2way smt in my case) the system is so busy that systemd runs into several timeouts
> like "Did not receive a reply. Possible causes include: the remote application did
> not send a reply, the message bus security policy blocked the reply, the reply
> timeout expired, or the network connection was broken."
> 
> The problem seems to be that the newly used percpu_rwsem does a
> rcu_synchronize_sched_expedited for all write downs/ups.

Can you try:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev.2015.09.11ab

those include Oleg's rework of the percpu rwsem which should hopefully
improve things somewhat.

But yes, pounding a global lock on a big machine will always suck.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 13:05 ` Peter Zijlstra
@ 2015-09-15 13:36   ` Christian Borntraeger
  2015-09-15 13:53     ` Tejun Heo
  2015-09-15 16:42     ` Paolo Bonzini
  0 siblings, 2 replies; 31+ messages in thread
From: Christian Borntraeger @ 2015-09-15 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov, Paul McKenney

Am 15.09.2015 um 15:05 schrieb Peter Zijlstra:
> On Tue, Sep 15, 2015 at 02:05:14PM +0200, Christian Borntraeger wrote:
>> Tejun,
>>
>>
>> commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14 (sched, cgroup: replace 
>> signal_struct->group_rwsem with a global percpu_rwsem) causes some noticably
>> hickups when starting several kvm guests (which libvirt will move into cgroups
>> - each vcpu thread and each i/o thread)
>> When you now start lots of guests in parallel on a bigger system (32CPUs with
>> 2way smt in my case) the system is so busy that systemd runs into several timeouts
>> like "Did not receive a reply. Possible causes include: the remote application did
>> not send a reply, the message bus security policy blocked the reply, the reply
>> timeout expired, or the network connection was broken."
>>
>> The problem seems to be that the newly used percpu_rwsem does a
>> rcu_synchronize_sched_expedited for all write downs/ups.
> 
> Can you try:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev.2015.09.11ab

yes, dev.2015.09.11a seems to help, thanks. Getting rid of the expedited hammer was
really helpful - I guess.

> 
> those include Oleg's rework of the percpu rwsem which should hopefully
> improve things somewhat.
> 
> But yes, pounding a global lock on a big machine will always suck.

By hacking out the fast path I actually degraded percpu rwsem to a real global lock, but
things were still a lot faster. 
I am wondering why the old code behaved in such fatal ways. Is there some interaction 
between waiting for a reschedule in the synchronize_sched writer and some fork code 
actually waiting for the read side to get the lock together with some rescheduling going
on waiting for a lock that fork holds? lockdep does not give me an hints so I have no clue :-(


Christian


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 13:36   ` Christian Borntraeger
@ 2015-09-15 13:53     ` Tejun Heo
  2015-09-15 16:42     ` Paolo Bonzini
  1 sibling, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2015-09-15 13:53 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov, Paul McKenney

Hello,

On Tue, Sep 15, 2015 at 03:36:34PM +0200, Christian Borntraeger wrote:
> >> The problem seems to be that the newly used percpu_rwsem does a
> >> rcu_synchronize_sched_expedited for all write downs/ups.
> > 
> > Can you try:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev.2015.09.11ab
> 
> yes, dev.2015.09.11a seems to help, thanks. Getting rid of the expedited hammer was
> really helpful - I guess.

Ah, that's nice.  I mentioned this in the original patchset but
percpu_rwsem as previously implemented could be too heavy on the
writer side for this path and I was planning to implement rwsem based
lglock if this blows up.  That said, if Oleg's changes makes the issue
go away, all the better.

> > those include Oleg's rework of the percpu rwsem which should hopefully
> > improve things somewhat.
> > 
> > But yes, pounding a global lock on a big machine will always suck.
> 
> By hacking out the fast path I actually degraded percpu rwsem to a real global lock, but
> things were still a lot faster. 
> I am wondering why the old code behaved in such fatal ways. Is there some interaction 
> between waiting for a reschedule in the synchronize_sched writer and some fork code 
> actually waiting for the read side to get the lock together with some rescheduling going
> on waiting for a lock that fork holds? lockdep does not give me an hints so I have no clue :-(

percpu_rwsem is a global lock.  My rough suspicion is that probably
the writer locking path was taking too long (especially if the kernel
has preemption disabled) making the writers getting backed up badly
starving the readers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 13:36   ` Christian Borntraeger
  2015-09-15 13:53     ` Tejun Heo
@ 2015-09-15 16:42     ` Paolo Bonzini
  2015-09-15 17:38       ` Paul E. McKenney
  2015-09-15 21:11       ` Christian Borntraeger
  1 sibling, 2 replies; 31+ messages in thread
From: Paolo Bonzini @ 2015-09-15 16:42 UTC (permalink / raw)
  To: Christian Borntraeger, Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov, Paul McKenney

On 15/09/2015 15:36, Christian Borntraeger wrote:
> I am wondering why the old code behaved in such fatal ways. Is there
> some interaction between waiting for a reschedule in the
> synchronize_sched writer and some fork code actually waiting for the
> read side to get the lock together with some rescheduling going on
> waiting for a lock that fork holds? lockdep does not give me an hints
> so I have no clue :-(

It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
about it:

 * if you are using synchronize_sched_expedited() in a loop, please
 * restructure your code to batch your updates, and then use a single
 * synchronize_sched() instead.

and you may remember that in KVM we switched from RCU to SRCU exactly to
avoid userspace-controlled synchronize_rcu_expedited().

In fact, I would say that any userspace-controlled call to *_expedited()
is a bug waiting to happen and a bad idea---because userspace can, with
little effort, end up calling it in a loop.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 16:42     ` Paolo Bonzini
@ 2015-09-15 17:38       ` Paul E. McKenney
  2015-09-16  8:32         ` Paolo Bonzini
  2015-09-15 21:11       ` Christian Borntraeger
  1 sibling, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2015-09-15 17:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christian Borntraeger, Peter Zijlstra, Tejun Heo, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

On Tue, Sep 15, 2015 at 06:42:19PM +0200, Paolo Bonzini wrote:
> 
> 
> On 15/09/2015 15:36, Christian Borntraeger wrote:
> > I am wondering why the old code behaved in such fatal ways. Is there
> > some interaction between waiting for a reschedule in the
> > synchronize_sched writer and some fork code actually waiting for the
> > read side to get the lock together with some rescheduling going on
> > waiting for a lock that fork holds? lockdep does not give me an hints
> > so I have no clue :-(
> 
> It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
> about it:
> 
>  * if you are using synchronize_sched_expedited() in a loop, please
>  * restructure your code to batch your updates, and then use a single
>  * synchronize_sched() instead.
> 
> and you may remember that in KVM we switched from RCU to SRCU exactly to
> avoid userspace-controlled synchronize_rcu_expedited().
> 
> In fact, I would say that any userspace-controlled call to *_expedited()
> is a bug waiting to happen and a bad idea---because userspace can, with
> little effort, end up calling it in a loop.

Excellent points!

Other options in such situations include the following:

o	Rework so that the code uses call_rcu*() instead of *_expedited().

o	Maintain a per-task or per-CPU counter so that every so many
	*_expedited() invocations instead uses the non-expedited
	counterpart.  (For example, synchronize_rcu instead of
	synchronize_rcu_expedited().)

Note that synchronize_srcu_expedited() is less troublesome than are the
other *_expedited() functions, because synchronize_srcu_expedited() does
not inflict OS jitter on other CPUs.  This situation is being improved,
so that the other *_expedited() functions inflict less OS jitter and
(mostly) avoid inflicting OS jitter on nohz_full CPUs and idle CPUs (the
latter being important for battery-powered systems).  In addition, the
*_expedited() functions avoid hammering CPUs with N-squared OS jitter
in response to concurrent invocation from all CPUs because multiple
concurrent *_expedited() calls will be satisfied by a single expedited
grace-period operation.  Nevertheless, as Paolo points out, it is still
necessary to exercise caution when exposing synchronous grace periods
to userspace control.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 16:42     ` Paolo Bonzini
  2015-09-15 17:38       ` Paul E. McKenney
@ 2015-09-15 21:11       ` Christian Borntraeger
  2015-09-15 21:26         ` Tejun Heo
  1 sibling, 1 reply; 31+ messages in thread
From: Christian Borntraeger @ 2015-09-15 21:11 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov, Paul McKenney

Am 15.09.2015 um 18:42 schrieb Paolo Bonzini:
> 
> 
> On 15/09/2015 15:36, Christian Borntraeger wrote:
>> I am wondering why the old code behaved in such fatal ways. Is there
>> some interaction between waiting for a reschedule in the
>> synchronize_sched writer and some fork code actually waiting for the
>> read side to get the lock together with some rescheduling going on
>> waiting for a lock that fork holds? lockdep does not give me an hints
>> so I have no clue :-(
> 
> It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
> about it:
> 
>  * if you are using synchronize_sched_expedited() in a loop, please
>  * restructure your code to batch your updates, and then use a single
>  * synchronize_sched() instead.
> 
> and you may remember that in KVM we switched from RCU to SRCU exactly to
> avoid userspace-controlled synchronize_rcu_expedited().
> 
> In fact, I would say that any userspace-controlled call to *_expedited()
> is a bug waiting to happen and a bad idea---because userspace can, with
> little effort, end up calling it in a loop.

Right. This also implies that we should fix this for 4.2 - I guess.

Christian


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 21:11       ` Christian Borntraeger
@ 2015-09-15 21:26         ` Tejun Heo
  2015-09-15 21:38           ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-09-15 21:26 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov, Paul McKenney

Hello,

On Tue, Sep 15, 2015 at 11:11:45PM +0200, Christian Borntraeger wrote:
> > In fact, I would say that any userspace-controlled call to *_expedited()
> > is a bug waiting to happen and a bad idea---because userspace can, with
> > little effort, end up calling it in a loop.
> 
> Right. This also implies that we should fix this for 4.2 - I guess.

Are the percpu_rwsem changes enough?  If so, we can try to backport
those.  If those are too risky, we can revert the patches which
switched threadgroup lock to percpu_rwsem.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 21:26         ` Tejun Heo
@ 2015-09-15 21:38           ` Paul E. McKenney
  2015-09-15 22:28             ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2015-09-15 21:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Borntraeger, Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

On Tue, Sep 15, 2015 at 05:26:22PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 15, 2015 at 11:11:45PM +0200, Christian Borntraeger wrote:
> > > In fact, I would say that any userspace-controlled call to *_expedited()
> > > is a bug waiting to happen and a bad idea---because userspace can, with
> > > little effort, end up calling it in a loop.
> > 
> > Right. This also implies that we should fix this for 4.2 - I guess.
> 
> Are the percpu_rwsem changes enough?  If so, we can try to backport
> those.  If those are too risky, we can revert the patches which
> switched threadgroup lock to percpu_rwsem.

I did take a shot at adding the rcu_sync stuff during this past merge
window, but it did not converge quickly enough to make it.  It looks
quite good for the next merge window.  There have been changes in most
of the relevant areas, so probably best to just try them and see which
works best.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 21:38           ` Paul E. McKenney
@ 2015-09-15 22:28             ` Tejun Heo
  2015-09-15 23:38               ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-09-15 22:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christian Borntraeger, Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

Hello,

On Tue, Sep 15, 2015 at 02:38:30PM -0700, Paul E. McKenney wrote:
> I did take a shot at adding the rcu_sync stuff during this past merge
> window, but it did not converge quickly enough to make it.  It looks
> quite good for the next merge window.  There have been changes in most
> of the relevant areas, so probably best to just try them and see which
> works best.

Heh, I'm having a bit of trouble following.  Are you saying that the
changes would be too big for -stable?  If so, I'll send out reverts of
the culprit patches and then reapply them for this cycle so that it
can land together with the rcu changes in the next merge window, but
it'd be great to find out whether the rcu changes are enough for the
issue that Christian is seeing to go away.  If not, I'll switch to a
different locking scheme and mark those patches w/ stable tag.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 22:28             ` Tejun Heo
@ 2015-09-15 23:38               ` Paul E. McKenney
  2015-09-16  1:24                 ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2015-09-15 23:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Borntraeger, Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

On Tue, Sep 15, 2015 at 06:28:11PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 15, 2015 at 02:38:30PM -0700, Paul E. McKenney wrote:
> > I did take a shot at adding the rcu_sync stuff during this past merge
> > window, but it did not converge quickly enough to make it.  It looks
> > quite good for the next merge window.  There have been changes in most
> > of the relevant areas, so probably best to just try them and see which
> > works best.
> 
> Heh, I'm having a bit of trouble following.  Are you saying that the
> changes would be too big for -stable?  If so, I'll send out reverts of
> the culprit patches and then reapply them for this cycle so that it
> can land together with the rcu changes in the next merge window, but
> it'd be great to find out whether the rcu changes are enough for the
> issue that Christian is seeing to go away.  If not, I'll switch to a
> different locking scheme and mark those patches w/ stable tag.

Well, the decision as to what is too big for -stable is owned by the
-stable maintainers, not by me.

I am suggesting trying the options and seeing what works best, then
working to convince people as needed.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 23:38               ` Paul E. McKenney
@ 2015-09-16  1:24                 ` Tejun Heo
  2015-09-16  4:35                   ` Paul E. McKenney
  2015-09-16  7:44                   ` Christian Borntraeger
  0 siblings, 2 replies; 31+ messages in thread
From: Tejun Heo @ 2015-09-16  1:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christian Borntraeger, Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

Hello, Paul.

On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
> Well, the decision as to what is too big for -stable is owned by the
> -stable maintainers, not by me.

Is it tho?  Usually the subsystem maintainer knows the best and has
most say in it.  I was mostly curious whether you'd think that the
changes would be too risky.  If not, great.

> I am suggesting trying the options and seeing what works best, then
> working to convince people as needed.

Yeah, sure thing.  Let's wait for Christian.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16  1:24                 ` Tejun Heo
@ 2015-09-16  4:35                   ` Paul E. McKenney
  2015-09-16 11:06                     ` Tejun Heo
  2015-09-16  7:44                   ` Christian Borntraeger
  1 sibling, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2015-09-16  4:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Borntraeger, Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

On Tue, Sep 15, 2015 at 09:24:15PM -0400, Tejun Heo wrote:
> Hello, Paul.
> 
> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
> > Well, the decision as to what is too big for -stable is owned by the
> > -stable maintainers, not by me.
> 
> Is it tho?  Usually the subsystem maintainer knows the best and has
> most say in it.  I was mostly curious whether you'd think that the
> changes would be too risky.  If not, great.

I do hope that they would listen to what I thought about it, but at
the end of the day, it is the -stable maintainers who pull a given
patch, or don't.

> > I am suggesting trying the options and seeing what works best, then
> > working to convince people as needed.
> 
> Yeah, sure thing.  Let's wait for Christian.

Indeed.  Is there enough benefit to risk jamming this thing into 4.3?
I believe that 4.4 should be a no-brainer.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16  1:24                 ` Tejun Heo
  2015-09-16  4:35                   ` Paul E. McKenney
@ 2015-09-16  7:44                   ` Christian Borntraeger
  2015-09-16 10:58                     ` Christian Borntraeger
  1 sibling, 1 reply; 31+ messages in thread
From: Christian Borntraeger @ 2015-09-16  7:44 UTC (permalink / raw)
  To: Tejun Heo, Paul E. McKenney
  Cc: Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

Am 16.09.2015 um 03:24 schrieb Tejun Heo:
> Hello, Paul.
> 
> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
>> Well, the decision as to what is too big for -stable is owned by the
>> -stable maintainers, not by me.
> 
> Is it tho?  Usually the subsystem maintainer knows the best and has
> most say in it.  I was mostly curious whether you'd think that the
> changes would be too risky.  If not, great.
> 
>> I am suggesting trying the options and seeing what works best, then
>> working to convince people as needed.
> 
> Yeah, sure thing.  Let's wait for Christian.

Well, I have optimized my testcase now that is puts enough pressure to
the system to  confuses system (the older 209 version, which still has
some event loop issues) that systemd restarts the journal deamon and does
several other recoveries.
To avoid regressions - even for somewhat shaky userspaces - we should
consider a revert for 4.2 stable.
There are several followup patches, which makes the revert non-trivial,
though.

The rework of the percpu rwsem seems to work fine, but we are beyond the
merge window so 4.4 seems better to me. (and consider a revert for 4.3)

Christian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-15 17:38       ` Paul E. McKenney
@ 2015-09-16  8:32         ` Paolo Bonzini
  2015-09-16  8:57           ` Christian Borntraeger
  0 siblings, 1 reply; 31+ messages in thread
From: Paolo Bonzini @ 2015-09-16  8:32 UTC (permalink / raw)
  To: paulmck
  Cc: Christian Borntraeger, Peter Zijlstra, Tejun Heo, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov



On 15/09/2015 19:38, Paul E. McKenney wrote:
> Excellent points!
> 
> Other options in such situations include the following:
> 
> o	Rework so that the code uses call_rcu*() instead of *_expedited().
> 
> o	Maintain a per-task or per-CPU counter so that every so many
> 	*_expedited() invocations instead uses the non-expedited
> 	counterpart.  (For example, synchronize_rcu instead of
> 	synchronize_rcu_expedited().)

Or just use ratelimit (untested):

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 834c4e52cb2d..8fb66b2aeed9 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -6,6 +6,7 @@
 #include <linux/percpu.h>
 #include <linux/wait.h>
 #include <linux/lockdep.h>
+#include <linux/ratelimit.h>
 
 struct percpu_rw_semaphore {
 	unsigned int __percpu	*fast_read_ctr;
@@ -13,6 +14,7 @@ struct percpu_rw_semaphore {
 	struct rw_semaphore	rw_sem;
 	atomic_t		slow_read_ctr;
 	wait_queue_head_t	write_waitq;
+	struct ratelimit_state	expedited_ratelimit;
 };
 
 extern void percpu_down_read(struct percpu_rw_semaphore *);
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f32567254867..c33f8bc89384 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -20,6 +20,8 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
 	atomic_set(&brw->write_ctr, 0);
 	atomic_set(&brw->slow_read_ctr, 0);
 	init_waitqueue_head(&brw->write_waitq);
+	/* Expedite one down_write and one up_write per second.  */
+	ratelimit_state_init(&brw->expedited_ratelimit, HZ, 2);
 	return 0;
 }
 
@@ -152,7 +156,10 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
 	 *    fast-path, it executes a full memory barrier before we return.
 	 *    See R_W case in the comment above update_fast_ctr().
 	 */
-	synchronize_sched_expedited();
+	if (__ratelimit(&brw->expedited_ratelimit))
+		synchronize_sched_expedited();
+	else
+		synchronize_sched();
 
 	/* exclude other writers, and block the new readers completely */
 	down_write(&brw->rw_sem);
@@ -172,7 +179,10 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
 	 * Insert the barrier before the next fast-path in down_read,
 	 * see W_R case in the comment above update_fast_ctr().
 	 */
-	synchronize_sched_expedited();
+	if (__ratelimit(&brw->expedited_ratelimit))
+		synchronize_sched_expedited();
+	else
+		synchronize_sched();
 	/* the last writer unblocks update_fast_ctr() */
 	atomic_dec(&brw->write_ctr);
 }


> Note that synchronize_srcu_expedited() is less troublesome than are the
> other *_expedited() functions, because synchronize_srcu_expedited() does
> not inflict OS jitter on other CPUs.

Yup, synchronize_srcu_expedited() is just a busy wait and it can
complete extremely fast if you use SRCU as a "local RCU" rather
than a "sleepable RCU".  However it doesn't apply here since you
want to avoid SRCU's 2 memory barriers per lock/unlock.

Paolo

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16  8:32         ` Paolo Bonzini
@ 2015-09-16  8:57           ` Christian Borntraeger
  2015-09-16  9:12             ` Paolo Bonzini
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Borntraeger @ 2015-09-16  8:57 UTC (permalink / raw)
  To: Paolo Bonzini, paulmck
  Cc: Peter Zijlstra, Tejun Heo, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
> 
> 
> On 15/09/2015 19:38, Paul E. McKenney wrote:
>> Excellent points!
>>
>> Other options in such situations include the following:
>>
>> o	Rework so that the code uses call_rcu*() instead of *_expedited().
>>
>> o	Maintain a per-task or per-CPU counter so that every so many
>> 	*_expedited() invocations instead uses the non-expedited
>> 	counterpart.  (For example, synchronize_rcu instead of
>> 	synchronize_rcu_expedited().)
> 
> Or just use ratelimit (untested):

One of my tests was to always replace synchronize_sched_expedited with 
synchronize_sched and things turned out to be even worse. Not sure if
it makes sense to test yopur in-the-middle approach?

Christian


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16  8:57           ` Christian Borntraeger
@ 2015-09-16  9:12             ` Paolo Bonzini
  2015-09-16 12:22               ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Paolo Bonzini @ 2015-09-16  9:12 UTC (permalink / raw)
  To: Christian Borntraeger, paulmck
  Cc: Peter Zijlstra, Tejun Heo, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov



On 16/09/2015 10:57, Christian Borntraeger wrote:
> Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
>>
>>
>> On 15/09/2015 19:38, Paul E. McKenney wrote:
>>> Excellent points!
>>>
>>> Other options in such situations include the following:
>>>
>>> o	Rework so that the code uses call_rcu*() instead of *_expedited().
>>>
>>> o	Maintain a per-task or per-CPU counter so that every so many
>>> 	*_expedited() invocations instead uses the non-expedited
>>> 	counterpart.  (For example, synchronize_rcu instead of
>>> 	synchronize_rcu_expedited().)
>>
>> Or just use ratelimit (untested):
> 
> One of my tests was to always replace synchronize_sched_expedited with 
> synchronize_sched and things turned out to be even worse. Not sure if
> it makes sense to test yopur in-the-middle approach?

I don't think it applies here, since down_write/up_write is a
synchronous API.

If the revert isn't easy, I think backporting rcu_sync is the best bet.
 The issue is that rcu_sync doesn't eliminate synchronize_sched, it only
makes it more rare.  So it's possible that it isn't eliminating the root
cause of the problem.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16  7:44                   ` Christian Borntraeger
@ 2015-09-16 10:58                     ` Christian Borntraeger
  2015-09-16 11:03                       ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Borntraeger @ 2015-09-16 10:58 UTC (permalink / raw)
  To: Tejun Heo, Paul E. McKenney
  Cc: Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

Am 16.09.2015 um 09:44 schrieb Christian Borntraeger:
> Am 16.09.2015 um 03:24 schrieb Tejun Heo:
>> Hello, Paul.
>>
>> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
>>> Well, the decision as to what is too big for -stable is owned by the
>>> -stable maintainers, not by me.
>>
>> Is it tho?  Usually the subsystem maintainer knows the best and has
>> most say in it.  I was mostly curious whether you'd think that the
>> changes would be too risky.  If not, great.
>>
>>> I am suggesting trying the options and seeing what works best, then
>>> working to convince people as needed.
>>
>> Yeah, sure thing.  Let's wait for Christian.
> 
> Well, I have optimized my testcase now that is puts enough pressure to
> the system to  confuses system (the older 209 version, which still has
> some event loop issues) that systemd restarts the journal deamon and does
> several other recoveries.
> To avoid regressions - even for somewhat shaky userspaces - we should
> consider a revert for 4.2 stable.
> There are several followup patches, which makes the revert non-trivial,
> though.
> 
> The rework of the percpu rwsem seems to work fine, but we are beyond the
> merge window so 4.4 seems better to me. (and consider a revert for 4.3)

FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
systemd seems to do that for the processes. 

So a revert is really the right thing to do. In fact, I dont know if the
rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
be NOT a cold/seldom case.

Christian








^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16 10:58                     ` Christian Borntraeger
@ 2015-09-16 11:03                       ` Tejun Heo
  2015-09-16 11:50                         ` Christian Borntraeger
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-09-16 11:03 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Paul E. McKenney, Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

Hello,

On Wed, Sep 16, 2015 at 12:58:00PM +0200, Christian Borntraeger wrote:
> FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
> just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
> systemd seems to do that for the processes. 
>
> So a revert is really the right thing to do. In fact, I dont know if the
> rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
> be NOT a cold/seldom case.

Booting would usually be the hottest operation for that and it's still
*relatively* cold path compared to the reader side which is task
fork/exit paths.  The whole point is shift overhead from hotter reader
side.  Can you see problems with percpu_rwsem rework?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16  4:35                   ` Paul E. McKenney
@ 2015-09-16 11:06                     ` Tejun Heo
  0 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2015-09-16 11:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christian Borntraeger, Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

Hello,

On Tue, Sep 15, 2015 at 09:35:47PM -0700, Paul E. McKenney wrote:
> > > I am suggesting trying the options and seeing what works best, then
> > > working to convince people as needed.
> > 
> > Yeah, sure thing.  Let's wait for Christian.
> 
> Indeed.  Is there enough benefit to risk jamming this thing into 4.3?
> I believe that 4.4 should be a no-brainer.

In that case, I'm gonna revert the threadcgroup percpu_rwsem
conversion patches through -stable and reapply them for 4.4 merge
window.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16 11:03                       ` Tejun Heo
@ 2015-09-16 11:50                         ` Christian Borntraeger
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Borntraeger @ 2015-09-16 11:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paul E. McKenney, Paolo Bonzini, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list, Oleg Nesterov

Am 16.09.2015 um 13:03 schrieb Tejun Heo:
> Hello,
> 
> On Wed, Sep 16, 2015 at 12:58:00PM +0200, Christian Borntraeger wrote:
>> FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
>> just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
>> systemd seems to do that for the processes. 
>>
>> So a revert is really the right thing to do. In fact, I dont know if the
>> rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
>> be NOT a cold/seldom case.
> 
> Booting would usually be the hottest operation for that and it's still
> *relatively* cold path compared to the reader side which is task
> fork/exit paths.  The whole point is shift overhead from hotter reader
> side.  Can you see problems with percpu_rwsem rework?

As I said, it seems the rcu tree with that change seems to work fine on my
system. This needs a more testing on other machines, though. I guess a 
revert plus a re-add in the 4.4 merge window should give us enough test
coverage.

Christian


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16  9:12             ` Paolo Bonzini
@ 2015-09-16 12:22               ` Oleg Nesterov
  2015-09-16 12:35                 ` Paolo Bonzini
                                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Oleg Nesterov @ 2015-09-16 12:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christian Borntraeger, paulmck, Peter Zijlstra, Tejun Heo,
	Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list

On 09/16, Paolo Bonzini wrote:
>
>
> On 16/09/2015 10:57, Christian Borntraeger wrote:
> > Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
> >>
> >>
> >> On 15/09/2015 19:38, Paul E. McKenney wrote:
> >>> Excellent points!
> >>>
> >>> Other options in such situations include the following:
> >>>
> >>> o	Rework so that the code uses call_rcu*() instead of *_expedited().
> >>>
> >>> o	Maintain a per-task or per-CPU counter so that every so many
> >>> 	*_expedited() invocations instead uses the non-expedited
> >>> 	counterpart.  (For example, synchronize_rcu instead of
> >>> 	synchronize_rcu_expedited().)
> >>
> >> Or just use ratelimit (untested):
> >
> > One of my tests was to always replace synchronize_sched_expedited with
> > synchronize_sched and things turned out to be even worse. Not sure if
> > it makes sense to test yopur in-the-middle approach?
>
> I don't think it applies here, since down_write/up_write is a
> synchronous API.
>
> If the revert isn't easy, I think backporting rcu_sync is the best bet.

I leave this to Paul and Tejun... at least I think this is not v4.2 material.

>  The issue is that rcu_sync doesn't eliminate synchronize_sched,

Yes, but it eliminates _expedited(). This is good, but otoh this means
that (say) individual __cgroup_procs_write() can take much more time.
However, it won't block the readers and/or disturb the whole system.
And percpu_up_write() doesn't do synchronize_sched() at all.

> it only
> makes it more rare.

Yes, so we can hope that multiple __cgroup_procs_write()'s can "share"
a single synchronize_sched().

> So it's possible that it isn't eliminating the root
> cause of the problem.

We will see... Just in case, currently the usage of percpu_down_write()
is suboptimal. We do not need to do ->sync() under cgroup_mutex. But
this needs some WIP changes in rcu_sync. Plus we can do more improvements,
but this is off-topic right now.

Oleg.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16 12:22               ` Oleg Nesterov
@ 2015-09-16 12:35                 ` Paolo Bonzini
  2015-09-16 12:43                   ` Oleg Nesterov
  2015-09-16 12:56                 ` Christian Borntraeger
  2015-09-16 14:16                 ` Tejun Heo
  2 siblings, 1 reply; 31+ messages in thread
From: Paolo Bonzini @ 2015-09-16 12:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Borntraeger, paulmck, Peter Zijlstra, Tejun Heo,
	Ingo Molnar, Linux Kernel Mailing List, KVM list



On 16/09/2015 14:22, Oleg Nesterov wrote:
> >  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> 
> Yes, but it eliminates _expedited(). This is good, but otoh this means
> that (say) individual __cgroup_procs_write() can take much more time.
> However, it won't block the readers and/or disturb the whole system.

According to Christian, removing the _expedited() "makes things worse"
in that the system takes ages to boot up and systemd timeouts.  So I'm
still a bit wary about anything that uses RCU for the cgroups write side.

However, rcu_sync is okay with him, so perhaps it is really really
effective.  Christian, can you instrument how many synchronize_sched
(out of the 6479 cgroup_procs_write calls) are actually executed at boot
with the rcu rework?

Paolo

> And percpu_up_write() doesn't do synchronize_sched() at all.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16 12:35                 ` Paolo Bonzini
@ 2015-09-16 12:43                   ` Oleg Nesterov
  0 siblings, 0 replies; 31+ messages in thread
From: Oleg Nesterov @ 2015-09-16 12:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christian Borntraeger, paulmck, Peter Zijlstra, Tejun Heo,
	Ingo Molnar, Linux Kernel Mailing List, KVM list

On 09/16, Paolo Bonzini wrote:
>
>
> On 16/09/2015 14:22, Oleg Nesterov wrote:
> > >  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> >
> > Yes, but it eliminates _expedited(). This is good, but otoh this means
> > that (say) individual __cgroup_procs_write() can take much more time.
> > However, it won't block the readers and/or disturb the whole system.
>
> According to Christian, removing the _expedited() "makes things worse"

Yes sure, we can not just remove _expedited() from down/up_read().

> in that the system takes ages to boot up and systemd timeouts.

Yes, this is clear

> So I'm
> still a bit wary about anything that uses RCU for the cgroups write side.
>
> However, rcu_sync is okay with him, so perhaps it is really really
> effective.  Christian, can you instrument how many synchronize_sched
> (out of the 6479 cgroup_procs_write calls) are actually executed at boot
> with the rcu rework?

Heh, another change I have in mind. It would be nice to add some trace
points. But firstly we should merge the current code.

Oleg.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16 12:22               ` Oleg Nesterov
  2015-09-16 12:35                 ` Paolo Bonzini
@ 2015-09-16 12:56                 ` Christian Borntraeger
  2015-09-16 14:16                 ` Tejun Heo
  2 siblings, 0 replies; 31+ messages in thread
From: Christian Borntraeger @ 2015-09-16 12:56 UTC (permalink / raw)
  To: Oleg Nesterov, Paolo Bonzini
  Cc: paulmck, Peter Zijlstra, Tejun Heo, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list

Am 16.09.2015 um 14:22 schrieb Oleg Nesterov:
>>  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> 
> Yes, but it eliminates _expedited(). This is good, but otoh this means
> that (say) individual __cgroup_procs_write() can take much more time.
> However, it won't block the readers and/or disturb the whole system.
> And percpu_up_write() doesn't do synchronize_sched() at all.
> 
>> it only
>> makes it more rare.
> 
> Yes, so we can hope that multiple __cgroup_procs_write()'s can "share"
> a single synchronize_sched().

And in fact it does. Paolo suggested to trace how often we call 
synchronize_sched so I applied some advanced printk debugging technology ;-)
Until login I have 41 and after starting the 70 guests this went up to 48.
Nice work.

> 
>> So it's possible that it isn't eliminating the root
>> cause of the problem.
> 
> We will see... Just in case, currently the usage of percpu_down_write()
> is suboptimal. We do not need to do ->sync() under cgroup_mutex. But
> this needs some WIP changes in rcu_sync. Plus we can do more improvements,
> but this is off-topic right now.
> 
> Oleg.
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16 12:22               ` Oleg Nesterov
  2015-09-16 12:35                 ` Paolo Bonzini
  2015-09-16 12:56                 ` Christian Borntraeger
@ 2015-09-16 14:16                 ` Tejun Heo
  2015-09-16 14:19                   ` Paolo Bonzini
  2 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-09-16 14:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paolo Bonzini, Christian Borntraeger, paulmck, Peter Zijlstra,
	Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list

Hello,

On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote:
> > If the revert isn't easy, I think backporting rcu_sync is the best bet.
> 
> I leave this to Paul and Tejun... at least I think this is not v4.2 material.

Will route reverts through cgroup branch.  Should be pretty painless.
Nice job on percpu_rwsem.  It's so much better than having to come up
with a separate middleground solution.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
  2015-09-16 14:16                 ` Tejun Heo
@ 2015-09-16 14:19                   ` Paolo Bonzini
  0 siblings, 0 replies; 31+ messages in thread
From: Paolo Bonzini @ 2015-09-16 14:19 UTC (permalink / raw)
  To: Tejun Heo, Oleg Nesterov
  Cc: Christian Borntraeger, paulmck, Peter Zijlstra, Ingo Molnar,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	KVM list



On 16/09/2015 16:16, Tejun Heo wrote:
> On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote:
>>> > > If the revert isn't easy, I think backporting rcu_sync is the best bet.
>> > 
>> > I leave this to Paul and Tejun... at least I think this is not v4.2 material.
> Will route reverts through cgroup branch.  Should be pretty painless.
> Nice job on percpu_rwsem.  It's so much better than having to come up
> with a separate middleground solution.

Thanks all for the quick resolution!

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking"
  2015-09-15 12:05 [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm Christian Borntraeger
  2015-09-15 13:05 ` Peter Zijlstra
@ 2015-09-16 15:55 ` Tejun Heo
  2015-09-16 15:56   ` [PATCH cgroup/for-4.3-fixes 2/2] Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem" Tejun Heo
                     ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Tejun Heo @ 2015-09-16 15:55 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Ingo Molnar, Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	Oleg Nesterov, Paul E. McKenney, Paolo Bonzini, KVM list

>From f9f9e7b776142fb1c0782cade004cc8e0147a199 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 16 Sep 2015 11:51:12 -0400

This reverts commit b5ba75b5fc0e8404e2c50cb68f39bb6a53fc916f.

d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem") and b5ba75b5fc0e ("cgroup: simplify
threadgroup locking") changed how cgroup synchronizes against task
fork and exits so that it uses global percpu_rwsem instead of
per-process rwsem; unfortunately, the write [un]lock paths of
percpu_rwsem always involve synchronize_rcu_expedited() which turned
out to be too expensive.

Improvements for percpu_rwsem are scheduled to be merged in the coming
v4.4-rc1 merge window which alleviates this issue.  For now, revert
the two commits to restore per-process rwsem.  They will be re-applied
for the v4.4-rc1 merge window.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org # v4.2+
---
Hello,

These are the two reverts that I'm pushing through
cgroup/for-4.3-fixes.  I'll re-apply the reverted patches on the
for-4.4 branch so that they can land together with percpu_rwsem
updates during the next merge window.

Thanks.

 kernel/cgroup.c | 45 +++++++++++++++++++++++++++++++++------------
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2cf0f79..115091e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2460,13 +2460,14 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	if (!cgrp)
 		return -ENODEV;
 
-	percpu_down_write(&cgroup_threadgroup_rwsem);
+retry_find_task:
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
+			rcu_read_unlock();
 			ret = -ESRCH;
-			goto out_unlock_rcu;
+			goto out_unlock_cgroup;
 		}
 	} else {
 		tsk = current;
@@ -2482,23 +2483,37 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	 */
 	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
 		ret = -EINVAL;
-		goto out_unlock_rcu;
+		rcu_read_unlock();
+		goto out_unlock_cgroup;
 	}
 
 	get_task_struct(tsk);
 	rcu_read_unlock();
 
+	percpu_down_write(&cgroup_threadgroup_rwsem);
+	if (threadgroup) {
+		if (!thread_group_leader(tsk)) {
+			/*
+			 * a race with de_thread from another thread's exec()
+			 * may strip us of our leadership, if this happens,
+			 * there is no choice but to throw this task away and
+			 * try again; this is
+			 * "double-double-toil-and-trouble-check locking".
+			 */
+			percpu_up_write(&cgroup_threadgroup_rwsem);
+			put_task_struct(tsk);
+			goto retry_find_task;
+		}
+	}
+
 	ret = cgroup_procs_write_permission(tsk, cgrp, of);
 	if (!ret)
 		ret = cgroup_attach_task(cgrp, tsk, threadgroup);
 
-	put_task_struct(tsk);
-	goto out_unlock_threadgroup;
-
-out_unlock_rcu:
-	rcu_read_unlock();
-out_unlock_threadgroup:
 	percpu_up_write(&cgroup_threadgroup_rwsem);
+
+	put_task_struct(tsk);
+out_unlock_cgroup:
 	cgroup_kn_unlock(of->kn);
 	return ret ?: nbytes;
 }
@@ -2643,8 +2658,6 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 
 	lockdep_assert_held(&cgroup_mutex);
 
-	percpu_down_write(&cgroup_threadgroup_rwsem);
-
 	/* look up all csses currently attached to @cgrp's subtree */
 	down_read(&css_set_rwsem);
 	css_for_each_descendant_pre(css, cgroup_css(cgrp, NULL)) {
@@ -2700,8 +2713,17 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 				goto out_finish;
 			last_task = task;
 
+			percpu_down_write(&cgroup_threadgroup_rwsem);
+			/* raced against de_thread() from another thread? */
+			if (!thread_group_leader(task)) {
+				percpu_up_write(&cgroup_threadgroup_rwsem);
+				put_task_struct(task);
+				continue;
+			}
+
 			ret = cgroup_migrate(src_cset->dfl_cgrp, task, true);
 
+			percpu_up_write(&cgroup_threadgroup_rwsem);
 			put_task_struct(task);
 
 			if (WARN(ret, "cgroup: failed to update controllers for the default hierarchy (%d), further operations may crash or hang\n", ret))
@@ -2711,7 +2733,6 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 
 out_finish:
 	cgroup_migrate_finish(&preloaded_csets);
-	percpu_up_write(&cgroup_threadgroup_rwsem);
 	return ret;
 }
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH cgroup/for-4.3-fixes 2/2] Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"
  2015-09-16 15:55 ` [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking" Tejun Heo
@ 2015-09-16 15:56   ` Tejun Heo
  2015-09-16 17:00   ` [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking" Oleg Nesterov
  2015-09-16 18:45   ` Christian Borntraeger
  2 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2015-09-16 15:56 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Ingo Molnar, Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	Oleg Nesterov, Paul E. McKenney, Paolo Bonzini, KVM list

>From 0c986253b939cc14c69d4adbe2b4121bdf4aa220 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 16 Sep 2015 11:51:12 -0400

This reverts commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14.

d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem") and b5ba75b5fc0e ("cgroup: simplify
threadgroup locking") changed how cgroup synchronizes against task
fork and exits so that it uses global percpu_rwsem instead of
per-process rwsem; unfortunately, the write [un]lock paths of
percpu_rwsem always involve synchronize_rcu_expedited() which turned
out to be too expensive.

Improvements for percpu_rwsem are scheduled to be merged in the coming
v4.4-rc1 merge window which alleviates this issue.  For now, revert
the two commits to restore per-process rwsem.  They will be re-applied
for the v4.4-rc1 merge window.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org # v4.2+
---
 include/linux/cgroup-defs.h | 27 ++--------------
 include/linux/init_task.h   |  8 +++++
 include/linux/sched.h       | 12 +++++++
 kernel/cgroup.c             | 77 +++++++++++++++++++++++++++++++++------------
 kernel/fork.c               |  4 +++
 5 files changed, 83 insertions(+), 45 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 4d8fcf2..8492721 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -473,31 +473,8 @@ struct cgroup_subsys {
 	unsigned int depends_on;
 };
 
-extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
-
-/**
- * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
- * @tsk: target task
- *
- * Called from threadgroup_change_begin() and allows cgroup operations to
- * synchronize against threadgroup changes using a percpu_rw_semaphore.
- */
-static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
-{
-	percpu_down_read(&cgroup_threadgroup_rwsem);
-}
-
-/**
- * cgroup_threadgroup_change_end - threadgroup exclusion for cgroups
- * @tsk: target task
- *
- * Called from threadgroup_change_end().  Counterpart of
- * cgroup_threadcgroup_change_begin().
- */
-static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
-{
-	percpu_up_read(&cgroup_threadgroup_rwsem);
-}
+void cgroup_threadgroup_change_begin(struct task_struct *tsk);
+void cgroup_threadgroup_change_end(struct task_struct *tsk);
 
 #else	/* CONFIG_CGROUPS */
 
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index d0b380e..e38681f 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -25,6 +25,13 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_GROUP_RWSEM(sig)						\
+	.group_rwsem = __RWSEM_INITIALIZER(sig.group_rwsem),
+#else
+#define INIT_GROUP_RWSEM(sig)
+#endif
+
 #ifdef CONFIG_CPUSETS
 #define INIT_CPUSET_SEQ(tsk)							\
 	.mems_allowed_seq = SEQCNT_ZERO(tsk.mems_allowed_seq),
@@ -57,6 +64,7 @@ extern struct fs_struct init_fs;
 	INIT_PREV_CPUTIME(sig)						\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	INIT_GROUP_RWSEM(sig)						\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4ab9da..b7b9501 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -762,6 +762,18 @@ struct signal_struct {
 	unsigned audit_tty_log_passwd;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * group_rwsem prevents new tasks from entering the threadgroup and
+	 * member tasks from exiting,a more specifically, setting of
+	 * PF_EXITING.  fork and exit paths are protected with this rwsem
+	 * using threadgroup_change_begin/end().  Users which require
+	 * threadgroup to remain stable should use threadgroup_[un]lock()
+	 * which also takes care of exec path.  Currently, cgroup is the
+	 * only user.
+	 */
+	struct rw_semaphore group_rwsem;
+#endif
 
 	oom_flags_t oom_flags;
 	short oom_score_adj;		/* OOM kill score adjustment */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 115091e..2c9eae6 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -46,7 +46,6 @@
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/rwsem.h>
-#include <linux/percpu-rwsem.h>
 #include <linux/string.h>
 #include <linux/sort.h>
 #include <linux/kmod.h>
@@ -104,8 +103,6 @@ static DEFINE_SPINLOCK(cgroup_idr_lock);
  */
 static DEFINE_SPINLOCK(release_agent_path_lock);
 
-struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
-
 #define cgroup_assert_mutex_or_rcu_locked()				\
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&			\
 			   !lockdep_is_held(&cgroup_mutex),		\
@@ -874,6 +871,48 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	return cset;
 }
 
+void cgroup_threadgroup_change_begin(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->group_rwsem);
+}
+
+void cgroup_threadgroup_change_end(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->group_rwsem);
+}
+
+/**
+ * threadgroup_lock - lock threadgroup
+ * @tsk: member task of the threadgroup to lock
+ *
+ * Lock the threadgroup @tsk belongs to.  No new task is allowed to enter
+ * and member tasks aren't allowed to exit (as indicated by PF_EXITING) or
+ * change ->group_leader/pid.  This is useful for cases where the threadgroup
+ * needs to stay stable across blockable operations.
+ *
+ * fork and exit explicitly call threadgroup_change_{begin|end}() for
+ * synchronization.  While held, no new task will be added to threadgroup
+ * and no existing live task will have its PF_EXITING set.
+ *
+ * de_thread() does threadgroup_change_{begin|end}() when a non-leader
+ * sub-thread becomes a new leader.
+ */
+static void threadgroup_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->group_rwsem);
+}
+
+/**
+ * threadgroup_unlock - unlock threadgroup
+ * @tsk: member task of the threadgroup to unlock
+ *
+ * Reverse threadgroup_lock().
+ */
+static inline void threadgroup_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->group_rwsem);
+}
+
 static struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root)
 {
 	struct cgroup *root_cgrp = kf_root->kn->priv;
@@ -2074,9 +2113,9 @@ static void cgroup_task_migrate(struct cgroup *old_cgrp,
 	lockdep_assert_held(&css_set_rwsem);
 
 	/*
-	 * We are synchronized through cgroup_threadgroup_rwsem against
-	 * PF_EXITING setting such that we can't race against cgroup_exit()
-	 * changing the css_set to init_css_set and dropping the old one.
+	 * We are synchronized through threadgroup_lock() against PF_EXITING
+	 * setting such that we can't race against cgroup_exit() changing the
+	 * css_set to init_css_set and dropping the old one.
 	 */
 	WARN_ON_ONCE(tsk->flags & PF_EXITING);
 	old_cset = task_css_set(tsk);
@@ -2133,11 +2172,10 @@ static void cgroup_migrate_finish(struct list_head *preloaded_csets)
  * @src_cset and add it to @preloaded_csets, which should later be cleaned
  * up by cgroup_migrate_finish().
  *
- * This function may be called without holding cgroup_threadgroup_rwsem
- * even if the target is a process.  Threads may be created and destroyed
- * but as long as cgroup_mutex is not dropped, no new css_set can be put
- * into play and the preloaded css_sets are guaranteed to cover all
- * migrations.
+ * This function may be called without holding threadgroup_lock even if the
+ * target is a process.  Threads may be created and destroyed but as long
+ * as cgroup_mutex is not dropped, no new css_set can be put into play and
+ * the preloaded css_sets are guaranteed to cover all migrations.
  */
 static void cgroup_migrate_add_src(struct css_set *src_cset,
 				   struct cgroup *dst_cgrp,
@@ -2240,7 +2278,7 @@ static int cgroup_migrate_prepare_dst(struct cgroup *dst_cgrp,
  * @threadgroup: whether @leader points to the whole process or a single task
  *
  * Migrate a process or task denoted by @leader to @cgrp.  If migrating a
- * process, the caller must be holding cgroup_threadgroup_rwsem.  The
+ * process, the caller must be holding threadgroup_lock of @leader.  The
  * caller is also responsible for invoking cgroup_migrate_add_src() and
  * cgroup_migrate_prepare_dst() on the targets before invoking this
  * function and following up with cgroup_migrate_finish().
@@ -2368,7 +2406,7 @@ static int cgroup_migrate(struct cgroup *cgrp, struct task_struct *leader,
  * @leader: the task or the leader of the threadgroup to be attached
  * @threadgroup: attach the whole threadgroup?
  *
- * Call holding cgroup_mutex and cgroup_threadgroup_rwsem.
+ * Call holding cgroup_mutex and threadgroup_lock of @leader.
  */
 static int cgroup_attach_task(struct cgroup *dst_cgrp,
 			      struct task_struct *leader, bool threadgroup)
@@ -2490,7 +2528,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	get_task_struct(tsk);
 	rcu_read_unlock();
 
-	percpu_down_write(&cgroup_threadgroup_rwsem);
+	threadgroup_lock(tsk);
 	if (threadgroup) {
 		if (!thread_group_leader(tsk)) {
 			/*
@@ -2500,7 +2538,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 			 * try again; this is
 			 * "double-double-toil-and-trouble-check locking".
 			 */
-			percpu_up_write(&cgroup_threadgroup_rwsem);
+			threadgroup_unlock(tsk);
 			put_task_struct(tsk);
 			goto retry_find_task;
 		}
@@ -2510,7 +2548,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	if (!ret)
 		ret = cgroup_attach_task(cgrp, tsk, threadgroup);
 
-	percpu_up_write(&cgroup_threadgroup_rwsem);
+	threadgroup_unlock(tsk);
 
 	put_task_struct(tsk);
 out_unlock_cgroup:
@@ -2713,17 +2751,17 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 				goto out_finish;
 			last_task = task;
 
-			percpu_down_write(&cgroup_threadgroup_rwsem);
+			threadgroup_lock(task);
 			/* raced against de_thread() from another thread? */
 			if (!thread_group_leader(task)) {
-				percpu_up_write(&cgroup_threadgroup_rwsem);
+				threadgroup_unlock(task);
 				put_task_struct(task);
 				continue;
 			}
 
 			ret = cgroup_migrate(src_cset->dfl_cgrp, task, true);
 
-			percpu_up_write(&cgroup_threadgroup_rwsem);
+			threadgroup_unlock(task);
 			put_task_struct(task);
 
 			if (WARN(ret, "cgroup: failed to update controllers for the default hierarchy (%d), further operations may crash or hang\n", ret))
@@ -5045,7 +5083,6 @@ int __init cgroup_init(void)
 	unsigned long key;
 	int ssid, err;
 
-	BUG_ON(percpu_init_rwsem(&cgroup_threadgroup_rwsem));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 7d5f0f1..2845623 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1149,6 +1149,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->group_rwsem);
+#endif
+
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking"
  2015-09-16 15:55 ` [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking" Tejun Heo
  2015-09-16 15:56   ` [PATCH cgroup/for-4.3-fixes 2/2] Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem" Tejun Heo
@ 2015-09-16 17:00   ` Oleg Nesterov
  2015-09-16 18:45   ` Christian Borntraeger
  2 siblings, 0 replies; 31+ messages in thread
From: Oleg Nesterov @ 2015-09-16 17:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Borntraeger, Ingo Molnar, Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	Paul E. McKenney, Paolo Bonzini, KVM list

On 09/16, Tejun Heo wrote:
>
> From f9f9e7b776142fb1c0782cade004cc8e0147a199 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 16 Sep 2015 11:51:12 -0400
>
> This reverts commit b5ba75b5fc0e8404e2c50cb68f39bb6a53fc916f.
>
> d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with
> a global percpu_rwsem") and b5ba75b5fc0e ("cgroup: simplify
> threadgroup locking") changed how cgroup synchronizes against task
> fork and exits so that it uses global percpu_rwsem instead of
> per-process rwsem; unfortunately, the write [un]lock paths of
> percpu_rwsem always involve synchronize_rcu_expedited() which turned
> out to be too expensive.
>
> Improvements for percpu_rwsem are scheduled to be merged in the coming
> v4.4-rc1 merge window which alleviates this issue.  For now, revert
> the two commits to restore per-process rwsem.  They will be re-applied
> for the v4.4-rc1 merge window.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: stable@vger.kernel.org # v4.2+

So just in case, I agree. Perhaps we could merge the percpu_rwsem changes
in v4.3, but these patches look much safer for -stable.

Oleg.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking"
  2015-09-16 15:55 ` [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking" Tejun Heo
  2015-09-16 15:56   ` [PATCH cgroup/for-4.3-fixes 2/2] Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem" Tejun Heo
  2015-09-16 17:00   ` [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking" Oleg Nesterov
@ 2015-09-16 18:45   ` Christian Borntraeger
  2 siblings, 0 replies; 31+ messages in thread
From: Christian Borntraeger @ 2015-09-16 18:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	Oleg Nesterov, Paul E. McKenney, Paolo Bonzini, KVM list

Both patches in combination
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # on top of 4.3-rc1


As a side note, patch 2 does not apply cleanly on 4.2, so we probably need to 
provide a separate backport.

Christian


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-09-16 18:45 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-15 12:05 [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm Christian Borntraeger
2015-09-15 13:05 ` Peter Zijlstra
2015-09-15 13:36   ` Christian Borntraeger
2015-09-15 13:53     ` Tejun Heo
2015-09-15 16:42     ` Paolo Bonzini
2015-09-15 17:38       ` Paul E. McKenney
2015-09-16  8:32         ` Paolo Bonzini
2015-09-16  8:57           ` Christian Borntraeger
2015-09-16  9:12             ` Paolo Bonzini
2015-09-16 12:22               ` Oleg Nesterov
2015-09-16 12:35                 ` Paolo Bonzini
2015-09-16 12:43                   ` Oleg Nesterov
2015-09-16 12:56                 ` Christian Borntraeger
2015-09-16 14:16                 ` Tejun Heo
2015-09-16 14:19                   ` Paolo Bonzini
2015-09-15 21:11       ` Christian Borntraeger
2015-09-15 21:26         ` Tejun Heo
2015-09-15 21:38           ` Paul E. McKenney
2015-09-15 22:28             ` Tejun Heo
2015-09-15 23:38               ` Paul E. McKenney
2015-09-16  1:24                 ` Tejun Heo
2015-09-16  4:35                   ` Paul E. McKenney
2015-09-16 11:06                     ` Tejun Heo
2015-09-16  7:44                   ` Christian Borntraeger
2015-09-16 10:58                     ` Christian Borntraeger
2015-09-16 11:03                       ` Tejun Heo
2015-09-16 11:50                         ` Christian Borntraeger
2015-09-16 15:55 ` [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking" Tejun Heo
2015-09-16 15:56   ` [PATCH cgroup/for-4.3-fixes 2/2] Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem" Tejun Heo
2015-09-16 17:00   ` [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking" Oleg Nesterov
2015-09-16 18:45   ` Christian Borntraeger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).