* [Patch 0 of 2]: PV-domain SMP performance
@ 2008-12-17 12:21 Juergen Gross
2008-12-17 12:38 ` Keir Fraser
0 siblings, 1 reply; 5+ messages in thread
From: Juergen Gross @ 2008-12-17 12:21 UTC (permalink / raw)
To: xen-devel@lists.xensource.com
Hi,
I've played a little bit with the xen scheduler to enhance the performance of
paravirtualized SMP domains including Dom0.
Under heavy system load a vcpu might be descheduled in a critical section.
This in turn leads to even higher system load if other vcpus of the same
domain are waiting for the descheduled vcpu to leave the critical section.
I've created a patch for xen and for the linux kernel to show that cooperative
scheduling can help to avoid this problem or make it less critical.
A vcpu might set a flag "no_desched" in its vcpu_info structure (I've used
an unused hole) which will tell the xen scheduler to keep the vcpu running for
some more time. If the vcpu would have been descheduled otherwise, the guest
is infomred by another flag in the vcpu_info to voluntarily give up control
after leaving the critical section. If the guest is not cooperative it will be
descheduled after 1 msec anyway.
I've made some tests in Dom0 with a small benchmark producing high system load
using
time -p (dd if=/dev/urandom count=1M | cat >/dev/null)
The system is a 4 processor x86_64 machine running latest XEN-unstable. The
tests are performed in dom0. For each test run the sums of the time outputs
are printed (2 parallel runs lasting 60 seconds each will print 120 seconds).
Multiple tests returned very similar results (deviation about 1-2%).
First configuration: 4 vcpus, no pinning:
-----------------------------------------
1 run: real: 79.92 user: 1.04 sys: 78.83
2 runs: real: 271.28 user: 1.91 sys: 269.35
4 runs: real: 882.32 user: 5.70 sys: 875.50
Second configuration: 4 vcpus, all pinned to cpu 0:
---------------------------------------------------
1 run: real: 400.55 user: 0.10 sys: 380.28
2 runs: real: 1270.68 user: 2.58 sys: 653.28
4 runs: real: 1558.27 user: 20.99 sys: 368.10
The same tests with my patches:
First configuration: 4 vcpus, no pinning:
-----------------------------------------
1 run: real: 81.85 user: 1.00 sys: 81.29
2 runs: real: 229.62 user: 2.07 sys: 191.31
4 runs: real: 878.63 user: 3.61 sys: 873.76
Second configuration: 4 vcpus, all pinned to cpu 0:
---------------------------------------------------
1 run: real: 274.06 user: 0.74 sys: 58.88
2 runs: real: 999.77 user: 1.27 sys: 98.61
4 runs: real: 1251.00 user: 16.58 sys: 291.66
This result was achieved by avoiding descheduling of a vcpu only when irqs
are blocked. Even better results might be possible with some fine tuning
(e.g. instrumenting bh_enable/bh_disable).
I think system time has dropped remarkably!
Patch 1 is hypervisor support
Patch 2 is my Linux support in irq_enable and irq_disable
Juergen
--
Juergen Gross Principal Developer
IP SW OS6 Telephone: +49 (0) 89 636 47950
Fujitsu Siemens Computers e-mail: juergen.gross@fujitsu-siemens.com
Otto-Hahn-Ring 6 Internet: www.fujitsu-siemens.com
D-81739 Muenchen Company details: www.fujitsu-siemens.com/imprint.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Patch 0 of 2]: PV-domain SMP performance
2008-12-17 12:21 [Patch 0 of 2]: PV-domain SMP performance Juergen Gross
@ 2008-12-17 12:38 ` Keir Fraser
2008-12-17 14:20 ` Juergen Gross
0 siblings, 1 reply; 5+ messages in thread
From: Keir Fraser @ 2008-12-17 12:38 UTC (permalink / raw)
To: Juergen Gross, xen-devel@lists.xensource.com
On 17/12/2008 12:21, "Juergen Gross" <juergen.gross@fujitsu-siemens.com>
wrote:
> This result was achieved by avoiding descheduling of a vcpu only when irqs
> are blocked. Even better results might be possible with some fine tuning
> (e.g. instrumenting bh_enable/bh_disable).
> I think system time has dropped remarkably!
It's nice, but it'd be more compelling if a win was shown on a real
benchmark. Under normal workloads there is actually little lock contention
in the Linux kernel, and hence I think scope for wins are limited.
Also, pv_ops Linux already has some extra smartness in its spinlock
implementation. A spinner will sleep after some time, making it more likely
that the lock holder will run (who then wakes the sleeper when the lock is
released). You'd need to compare with that approach (which required no extra
hypervisor interfaces).
-- Keir
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Patch 0 of 2]: PV-domain SMP performance
2008-12-17 12:38 ` Keir Fraser
@ 2008-12-17 14:20 ` Juergen Gross
2008-12-17 15:05 ` Keir Fraser
0 siblings, 1 reply; 5+ messages in thread
From: Juergen Gross @ 2008-12-17 14:20 UTC (permalink / raw)
To: Keir Fraser; +Cc: xen-devel@lists.xensource.com
Keir Fraser wrote:
> On 17/12/2008 12:21, "Juergen Gross" <juergen.gross@fujitsu-siemens.com>
> wrote:
>
>> This result was achieved by avoiding descheduling of a vcpu only when irqs
>> are blocked. Even better results might be possible with some fine tuning
>> (e.g. instrumenting bh_enable/bh_disable).
>> I think system time has dropped remarkably!
>
> It's nice, but it'd be more compelling if a win was shown on a real
> benchmark. Under normal workloads there is actually little lock contention
> in the Linux kernel, and hence I think scope for wins are limited.
>
> Also, pv_ops Linux already has some extra smartness in its spinlock
> implementation. A spinner will sleep after some time, making it more likely
> that the lock holder will run (who then wakes the sleeper when the lock is
> released). You'd need to compare with that approach (which required no extra
> hypervisor interfaces).
Sure, my benchmark is a very special case :-)
The advantage of my solution is a general mechanism to avoid preemption of
a vcpu in critical sections instead of dealing with it after it has occured.
Is the pv_ops Linux capable to deal with held locks in interrupt handling?
What about other code paths which should be completed in short time?
About real world applications:
Again 4 vcpus pinned to one physical cpu, 3 files copied via scp to the test
machine at the same time, each file about 50 MB.
Linux-xen from xensource: about 1:50 elapsed time for each job
My modified Linux: about 0:50 elapsed time
Juergen
--
Juergen Gross Principal Developer
IP SW OS6 Telephone: +49 (0) 89 636 47950
Fujitsu Siemens Computers e-mail: juergen.gross@fujitsu-siemens.com
Otto-Hahn-Ring 6 Internet: www.fujitsu-siemens.com
D-81739 Muenchen Company details: www.fujitsu-siemens.com/imprint.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Patch 0 of 2]: PV-domain SMP performance
2008-12-17 14:20 ` Juergen Gross
@ 2008-12-17 15:05 ` Keir Fraser
2008-12-18 12:06 ` Juergen Gross
0 siblings, 1 reply; 5+ messages in thread
From: Keir Fraser @ 2008-12-17 15:05 UTC (permalink / raw)
To: Juergen Gross; +Cc: xen-devel@lists.xensource.com
On 17/12/2008 14:20, "Juergen Gross" <juergen.gross@fujitsu-siemens.com>
wrote:
> The advantage of my solution is a general mechanism to avoid preemption of
> a vcpu in critical sections instead of dealing with it after it has occured.
> Is the pv_ops Linux capable to deal with held locks in interrupt handling?
> What about other code paths which should be completed in short time?
Yes the approach is the other way round to yours. It handles irq-safe locks
just fine; no reason for it not to.
> About real world applications:
> Again 4 vcpus pinned to one physical cpu, 3 files copied via scp to the test
> machine at the same time, each file about 50 MB.
>
> Linux-xen from xensource: about 1:50 elapsed time for each job
> My modified Linux: about 0:50 elapsed time
So this provides great wins for those who run multi-vcpu VMs on a single
physical CPU? ;-) Actually getting a speedup on this benchmark even in that
configuration is a surprise I will admit -- I'd expect most time to be spent
in sshd in user space. By 0:50 for each job you mean 0:50 for 50MB? That's
10Mbps and I wouldn't even expect a single CPU working alone to be breaking
a sweat. Weird...
-- Keir
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Patch 0 of 2]: PV-domain SMP performance
2008-12-17 15:05 ` Keir Fraser
@ 2008-12-18 12:06 ` Juergen Gross
0 siblings, 0 replies; 5+ messages in thread
From: Juergen Gross @ 2008-12-18 12:06 UTC (permalink / raw)
To: Keir Fraser; +Cc: xen-devel@lists.xensource.com
Keir Fraser wrote:
> On 17/12/2008 14:20, "Juergen Gross" <juergen.gross@fujitsu-siemens.com>
> wrote:
>> About real world applications:
>> Again 4 vcpus pinned to one physical cpu, 3 files copied via scp to the test
>> machine at the same time, each file about 50 MB.
>>
>> Linux-xen from xensource: about 1:50 elapsed time for each job
>> My modified Linux: about 0:50 elapsed time
>
> So this provides great wins for those who run multi-vcpu VMs on a single
> physical CPU? ;-)
Absolutely! :-)
I thought this would be an easy way to force vcpu scheduling without having to
use multiple domains.
The picture would be quite similar (may be not so extreme) on a machine with
several multi-vcpu domains under heavy load.
> Actually getting a speedup on this benchmark even in that
> configuration is a surprise I will admit -- I'd expect most time to be spent
> in sshd in user space. By 0:50 for each job you mean 0:50 for 50MB? That's
> 10Mbps and I wouldn't even expect a single CPU working alone to be breaking
> a sweat. Weird...
I think the problem here are the network interrupts which will occur in a
round robin fashion on all vcpus. And those are serialized by the scheduler...
Juergen
--
Juergen Gross Principal Developer
IP SW OS6 Telephone: +49 (0) 89 636 47950
Fujitsu Siemens Computers e-mail: juergen.gross@fujitsu-siemens.com
Otto-Hahn-Ring 6 Internet: www.fujitsu-siemens.com
D-81739 Muenchen Company details: www.fujitsu-siemens.com/imprint.html
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-12-18 12:06 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-17 12:21 [Patch 0 of 2]: PV-domain SMP performance Juergen Gross
2008-12-17 12:38 ` Keir Fraser
2008-12-17 14:20 ` Juergen Gross
2008-12-17 15:05 ` Keir Fraser
2008-12-18 12:06 ` Juergen Gross
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.