Regression on vcpu_is

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Regression on vcpu_is_preempted()
@ 2022-10-28  8:48 Miaohe Lin
  2022-10-28 10:21 ` Abel Wu
  2022-10-29  8:58 ` Peter Zijlstra
  0 siblings, 2 replies; 6+ messages in thread
From: Miaohe Lin @ 2022-10-28  8:48 UTC (permalink / raw)
  To: mingo@redhat.com, Peter Zijlstra, juri.lelli, vincent.guittot,
	rohit.k.jain
  Cc: dietmar.eggemann, Steven Rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel

Hi all scheduler experts:
  When we run java gc in our 8 vcpus guest *without KVM_FEATURE_STEAL_TIME enabled*, the output looks like below:
    With ParallelGCThreads=4 and ConcGCThreads=4, we have:
	G1 Young Generation: 1 times 1786 ms
	G1 Old Generation: 1 times 1022 ms
    With ParallelGCThreads=5 and ConcGCThreads=5, we have:
	G1 Young Generation: 1 times 1557 ms
	G1 Old Generation: 1 times 1020 ms

  This meets our expectation. But *with KVM_FEATURE_STEAL_TIME enabled* in our guest, the output looks like this:
    With ParallelGCThreads=4 and ConcGCThreads=4, we have:
	G1 Young Generation: 1 times 1637 ms
	G1 Old Generation: 1 times 1022 ms
    With ParallelGCThreads=5 and ConcGCThreads=5, we have:
	G1 Young Generation: 1 times 2164 ms
				      ^^^^
	G1 Old Generation: 1 times 1024 ms

  The duration of G1 Young Generation is far beyond our expectation when gc threads = 5. And we found the root cause
is that when KVM_FEATURE_STEAL_TIME is enabled *there are much more(3k+) cpu migrations for java gc threads*. It's due to
the below commit:

  commit 247f2f6f3c706b40b5f3886646f3eb53671258bf
  Author: Rohit Jain <rohit.k.jain@oracle.com>
  Date:   Wed May 2 13:52:10 2018 -0700

    sched/core: Don't schedule threads on pre-empted vCPUs

    In paravirt configurations today, spinlocks figure out whether a vCPU is
    running to determine whether or not spinlock should bother spinning. We
    can use the same logic to prioritize CPUs when scheduling threads. If a
    vCPU has been pre-empted, it will incur the extra cost of VMENTER and
    the time it actually spends to be running on the host CPU. If we had
    other vCPUs which were actually running on the host CPU and idle we
    should schedule threads there.

  When scheduler tries to select a CPU to run the gc thread, available_idle_cpu() will check whether vcpu_is_preempted().
It will choose other vcpu to run gc threads when the current vcpu is preempted. But the preempted vcpu has no other work
to do except continuing to do gc. In our guest, there are more vcpus than java gc threads. So there could always be some
available vcpus when scheduler tries to select a idle vcpu (runing on host). This leads to lots of cpu migrations and results
in regression.

  I'm not really familiar with this mechanism. Is this a problem that needs to be fixed or improved? Or is this just expected
behavior? Any response would be really appreciated!

Thanks!
Miaohe Lin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Regression on vcpu_is_preempted()
  2022-10-28  8:48 Regression on vcpu_is_preempted() Miaohe Lin
@ 2022-10-28 10:21 ` Abel Wu
  2022-10-29  2:27   ` Miaohe Lin
  2022-10-29  8:58 ` Peter Zijlstra
  1 sibling, 1 reply; 6+ messages in thread
From: Abel Wu @ 2022-10-28 10:21 UTC (permalink / raw)
  To: Miaohe Lin, mingo@redhat.com, Peter Zijlstra, juri.lelli,
	vincent.guittot, rohit.k.jain
  Cc: dietmar.eggemann, Steven Rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel

Hi Miaohe,

On 10/28/22 4:48 PM, Miaohe Lin wrote:
> Hi all scheduler experts:
>    When we run java gc in our 8 vcpus guest *without KVM_FEATURE_STEAL_TIME enabled*, the output looks like below:
>      With ParallelGCThreads=4 and ConcGCThreads=4, we have:
> 	G1 Young Generation: 1 times 1786 ms
> 	G1 Old Generation: 1 times 1022 ms
>      With ParallelGCThreads=5 and ConcGCThreads=5, we have:
> 	G1 Young Generation: 1 times 1557 ms
> 	G1 Old Generation: 1 times 1020 ms
> 
>    This meets our expectation. But *with KVM_FEATURE_STEAL_TIME enabled* in our guest, the output looks like this:
>      With ParallelGCThreads=4 and ConcGCThreads=4, we have:
> 	G1 Young Generation: 1 times 1637 ms
> 	G1 Old Generation: 1 times 1022 ms
>      With ParallelGCThreads=5 and ConcGCThreads=5, we have:
> 	G1 Young Generation: 1 times 2164 ms
> 				      ^^^^
> 	G1 Old Generation: 1 times 1024 ms
> 
>    The duration of G1 Young Generation is far beyond our expectation when gc threads = 5. And we found the root cause
> is that when KVM_FEATURE_STEAL_TIME is enabled *there are much more(3k+) cpu migrations for java gc threads*. It's due to
> the below commit:
> 
>    commit 247f2f6f3c706b40b5f3886646f3eb53671258bf
>    Author: Rohit Jain <rohit.k.jain@oracle.com>
>    Date:   Wed May 2 13:52:10 2018 -0700
> 
>      sched/core: Don't schedule threads on pre-empted vCPUs
> 
>      In paravirt configurations today, spinlocks figure out whether a vCPU is
>      running to determine whether or not spinlock should bother spinning. We
>      can use the same logic to prioritize CPUs when scheduling threads. If a
>      vCPU has been pre-empted, it will incur the extra cost of VMENTER and
>      the time it actually spends to be running on the host CPU. If we had
>      other vCPUs which were actually running on the host CPU and idle we
>      should schedule threads there.
> 
>    When scheduler tries to select a CPU to run the gc thread, available_idle_cpu() will check whether vcpu_is_preempted().
> It will choose other vcpu to run gc threads when the current vcpu is preempted. But the preempted vcpu has no other work
> to do except continuing to do gc. In our guest, there are more vcpus than java gc threads. So there could always be some
> available vcpus when scheduler tries to select a idle vcpu (runing on host). This leads to lots of cpu migrations and results
> in regression.

So you want the preempted idle cpus to run gc threads to maximize the
gc throughput, but available_idle_cpu() keeps them from being selected.
In theory, load balancing will help spreading load to these cpus (and
make them VMENTERed), so have you checked that the gc threads showed a
tendency to stack on same cpus?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Regression on vcpu_is_preempted()
  2022-10-28 10:21 ` Abel Wu
@ 2022-10-29  2:27   ` Miaohe Lin
  0 siblings, 0 replies; 6+ messages in thread
From: Miaohe Lin @ 2022-10-29  2:27 UTC (permalink / raw)
  To: Abel Wu
  Cc: dietmar.eggemann, Steven Rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, mingo@redhat.com, Peter Zijlstra,
	juri.lelli, vincent.guittot, rohit.k.jain

On 2022/10/28 18:21, Abel Wu wrote:
> Hi Miaohe,
> 
> On 10/28/22 4:48 PM, Miaohe Lin wrote:
>> Hi all scheduler experts:
>>    When we run java gc in our 8 vcpus guest *without KVM_FEATURE_STEAL_TIME enabled*, the output looks like below:
>>      With ParallelGCThreads=4 and ConcGCThreads=4, we have:
>>     G1 Young Generation: 1 times 1786 ms
>>     G1 Old Generation: 1 times 1022 ms
>>      With ParallelGCThreads=5 and ConcGCThreads=5, we have:
>>     G1 Young Generation: 1 times 1557 ms
>>     G1 Old Generation: 1 times 1020 ms
>>
>>    This meets our expectation. But *with KVM_FEATURE_STEAL_TIME enabled* in our guest, the output looks like this:
>>      With ParallelGCThreads=4 and ConcGCThreads=4, we have:
>>     G1 Young Generation: 1 times 1637 ms
>>     G1 Old Generation: 1 times 1022 ms
>>      With ParallelGCThreads=5 and ConcGCThreads=5, we have:
>>     G1 Young Generation: 1 times 2164 ms
>>                       ^^^^
>>     G1 Old Generation: 1 times 1024 ms
>>
>>    The duration of G1 Young Generation is far beyond our expectation when gc threads = 5. And we found the root cause
>> is that when KVM_FEATURE_STEAL_TIME is enabled *there are much more(3k+) cpu migrations for java gc threads*. It's due to
>> the below commit:
>>
>>    commit 247f2f6f3c706b40b5f3886646f3eb53671258bf
>>    Author: Rohit Jain <rohit.k.jain@oracle.com>
>>    Date:   Wed May 2 13:52:10 2018 -0700
>>
>>      sched/core: Don't schedule threads on pre-empted vCPUs
>>
>>      In paravirt configurations today, spinlocks figure out whether a vCPU is
>>      running to determine whether or not spinlock should bother spinning. We
>>      can use the same logic to prioritize CPUs when scheduling threads. If a
>>      vCPU has been pre-empted, it will incur the extra cost of VMENTER and
>>      the time it actually spends to be running on the host CPU. If we had
>>      other vCPUs which were actually running on the host CPU and idle we
>>      should schedule threads there.
>>
>>    When scheduler tries to select a CPU to run the gc thread, available_idle_cpu() will check whether vcpu_is_preempted().
>> It will choose other vcpu to run gc threads when the current vcpu is preempted. But the preempted vcpu has no other work
>> to do except continuing to do gc. In our guest, there are more vcpus than java gc threads. So there could always be some
>> available vcpus when scheduler tries to select a idle vcpu (runing on host). This leads to lots of cpu migrations and results
>> in regression.
> 

Hi Abel, many thanks for your reply. :)

> So you want the preempted idle cpus to run gc threads to maximize the
> gc throughput, but available_idle_cpu() keeps them from being selected.

Yes. The preempted idle cpus has nothing to do just as the other running on host idle cpus.

> In theory, load balancing will help spreading load to these cpus (and
> make them VMENTERed), so have you checked that the gc threads showed a
> tendency to stack on same cpus?
> .

When KVM_FEATURE_STEAL_TIME enabled, gc threads are migrated frequently between cpus without tendency to
stack on same cpus. But the loads of cpus look more balanced in this case. It looks like it's a tradeoff
between gc throughout and cpu load. Any thoughts?

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Regression on vcpu_is_preempted()
  2022-10-28  8:48 Regression on vcpu_is_preempted() Miaohe Lin
  2022-10-28 10:21 ` Abel Wu
@ 2022-10-29  8:58 ` Peter Zijlstra
  2022-10-29  9:15   ` Miaohe Lin
  1 sibling, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2022-10-29  8:58 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: mingo@redhat.com, juri.lelli, vincent.guittot, rohit.k.jain,
	dietmar.eggemann, Steven Rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel

On Fri, Oct 28, 2022 at 04:48:21PM +0800, Miaohe Lin wrote:
>   When scheduler tries to select a CPU to run the gc thread,
>   available_idle_cpu() will check whether vcpu_is_preempted().  It
>   will choose other vcpu to run gc threads when the current vcpu is
>   preempted. But the preempted vcpu has no other work to do except
>   continuing to do gc. In our guest, there are more vcpus than java gc
>   threads. So there could always be some available vcpus when
>   scheduler tries to select a idle vcpu (runing on host). This leads
>   to lots of cpu migrations and results in regression.
> 
>   I'm not really familiar with this mechanism. Is this a problem that
>   needs to be fixed or improved? Or is this just expected behavior?
>   Any response would be really appreciated!

This is pretty much expected behaviour. When a vCPU is preempted the
guest cannot know it's state or latency. Typically in the overcomitted
case another vCPU will be running on the CPU and getting our vCPU thread
back will take a considerable amount of time.

If you know you're not over-committed, perhaps you should configure your
VM differently.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Regression on vcpu_is_preempted()
  2022-10-29  8:58 ` Peter Zijlstra
@ 2022-10-29  9:15   ` Miaohe Lin
  2022-10-29 12:23     ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Miaohe Lin @ 2022-10-29  9:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo@redhat.com, juri.lelli, vincent.guittot, rohit.k.jain,
	dietmar.eggemann, Steven Rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel

On 2022/10/29 16:58, Peter Zijlstra wrote:
> On Fri, Oct 28, 2022 at 04:48:21PM +0800, Miaohe Lin wrote:
>>   When scheduler tries to select a CPU to run the gc thread,
>>   available_idle_cpu() will check whether vcpu_is_preempted().  It
>>   will choose other vcpu to run gc threads when the current vcpu is
>>   preempted. But the preempted vcpu has no other work to do except
>>   continuing to do gc. In our guest, there are more vcpus than java gc
>>   threads. So there could always be some available vcpus when
>>   scheduler tries to select a idle vcpu (runing on host). This leads
>>   to lots of cpu migrations and results in regression.
>>
>>   I'm not really familiar with this mechanism. Is this a problem that
>>   needs to be fixed or improved? Or is this just expected behavior?
>>   Any response would be really appreciated!
> 
> This is pretty much expected behaviour. When a vCPU is preempted the
> guest cannot know it's state or latency. Typically in the overcomitted
> case another vCPU will be running on the CPU and getting our vCPU thread
> back will take a considerable amount of time.

I see. Many thanks for your kindly reply and explanation. :)

> 
> If you know you're not over-committed, perhaps you should configure your
> VM differently.

Do you have any suggestion about how should I configure my VM when it's not over-committed?

Thanks,
Miaohe Lin



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Regression on vcpu_is_preempted()
  2022-10-29  9:15   ` Miaohe Lin
@ 2022-10-29 12:23     ` Peter Zijlstra
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Zijlstra @ 2022-10-29 12:23 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: mingo@redhat.com, juri.lelli, vincent.guittot, rohit.k.jain,
	dietmar.eggemann, Steven Rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel

On Sat, Oct 29, 2022 at 05:15:15PM +0800, Miaohe Lin wrote:
> On 2022/10/29 16:58, Peter Zijlstra wrote:
> > On Fri, Oct 28, 2022 at 04:48:21PM +0800, Miaohe Lin wrote:
> >>   When scheduler tries to select a CPU to run the gc thread,
> >>   available_idle_cpu() will check whether vcpu_is_preempted().  It
> >>   will choose other vcpu to run gc threads when the current vcpu is
> >>   preempted. But the preempted vcpu has no other work to do except
> >>   continuing to do gc. In our guest, there are more vcpus than java gc
> >>   threads. So there could always be some available vcpus when
> >>   scheduler tries to select a idle vcpu (runing on host). This leads
> >>   to lots of cpu migrations and results in regression.
> >>
> >>   I'm not really familiar with this mechanism. Is this a problem that
> >>   needs to be fixed or improved? Or is this just expected behavior?
> >>   Any response would be really appreciated!
> > 
> > This is pretty much expected behaviour. When a vCPU is preempted the
> > guest cannot know it's state or latency. Typically in the overcomitted
> > case another vCPU will be running on the CPU and getting our vCPU thread
> > back will take a considerable amount of time.
> 
> I see. Many thanks for your kindly reply and explanation. :)
> 
> > 
> > If you know you're not over-committed, perhaps you should configure your
> > VM differently.
> 
> Do you have any suggestion about how should I configure my VM when it's not over-committed?

I'm not an expert on VMs, but IIRC when you construct a pinned VM (ie.
1:1 vCPU:CPU relations) this all goes away.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-29 12:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-28  8:48 Regression on vcpu_is_preempted() Miaohe Lin
2022-10-28 10:21 ` Abel Wu
2022-10-29  2:27   ` Miaohe Lin
2022-10-29  8:58 ` Peter Zijlstra
2022-10-29  9:15   ` Miaohe Lin
2022-10-29 12:23     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox