[Question] PARSEC benchmark has smaller execution time in VM than in native?

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [Question] PARSEC benchmark has smaller execution time in VM than in native?
@ 2016-02-26  5:02 Meng Xu
  2016-02-29 16:06 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 13+ messages in thread
From: Meng Xu @ 2016-02-26  5:02 UTC (permalink / raw)
  To: xen-devel@lists.xen.org; +Cc: Hyon-Young Choi

Hi,

We are measuring the execution time between native machine environment
and xen virtualization environment using PARSEC Benchmark [1].

In virtualiztion environment, we run a domU with three VCPUs, each of
them pinned to a core; we pin the dom0 to another core that is not
used by the domU.

Inside the Linux in domU in virtualization environment and in native
environment,  We used the cpuset to isolate a core (or VCPU) for the
system processors and to isolate a core for the benchmark processes.
We also configured the Linux boot command line with isocpus= option to
isolate the core for benchmark from other unnecessary processes.

We expect that execution time of benchmarks in xen virtualization
environment is larger than the execution time in native machine
environment. However, the evaluation gave us an opposite result.

Below is the evaluation data for the canneal and streamcluster benchmarks:

Benchmark: canneal, input=simlarge, conf=gcc-serial
Native: 6.387s
Virtualization: 5.890s

Benchmark: streamcluster, input=simlarge, conf=gcc-serial
Native: 5.276s
Virtualization: 5.240s

Is there anything wrong with our evaluation that lead to the abnormal
performance results?

Any suggestion or advice is really appreciated.

Thank you very much for your time on this question!

Best regards,

Meng

[1] http://parsec.cs.princeton.edu/

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-02-26  5:02 [Question] PARSEC benchmark has smaller execution time in VM than in native? Meng Xu
@ 2016-02-29 16:06 ` Konrad Rzeszutek Wilk
  2016-02-29 17:29   ` Meng Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-02-29 16:06 UTC (permalink / raw)
  To: Meng Xu; +Cc: Hyon-Young Choi, xen-devel@lists.xen.org

On Fri, Feb 26, 2016 at 12:02:50AM -0500, Meng Xu wrote:
> Hi,
> 

Hey!

CC-ing Elena.

> We are measuring the execution time between native machine environment
> and xen virtualization environment using PARSEC Benchmark [1].
> 
> In virtualiztion environment, we run a domU with three VCPUs, each of
> them pinned to a core; we pin the dom0 to another core that is not
> used by the domU.
> 
> Inside the Linux in domU in virtualization environment and in native
> environment,  We used the cpuset to isolate a core (or VCPU) for the
> system processors and to isolate a core for the benchmark processes.
> We also configured the Linux boot command line with isocpus= option to
> isolate the core for benchmark from other unnecessary processes.

You may want to just offline them and also boot the machine with NUMA
disabled.

> 
> We expect that execution time of benchmarks in xen virtualization
> environment is larger than the execution time in native machine
> environment. However, the evaluation gave us an opposite result.
> 
> Below is the evaluation data for the canneal and streamcluster benchmarks:
> 
> Benchmark: canneal, input=simlarge, conf=gcc-serial
> Native: 6.387s
> Virtualization: 5.890s
> 
> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
> Native: 5.276s
> Virtualization: 5.240s
> 
> Is there anything wrong with our evaluation that lead to the abnormal
> performance results?

Nothing is wrong. Virtualization is naturally faster than baremetal!

:-)

No clue sadly.
> 
> Any suggestion or advice is really appreciated.
> 
> Thank you very much for your time on this question!
> 
> Best regards,
> 
> Meng
> 
> [1] http://parsec.cs.princeton.edu/
> 
> 
> -----------
> Meng Xu
> PhD Student in Computer and Information Science
> University of Pennsylvania
> http://www.cis.upenn.edu/~mengxu/
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-02-29 16:06 ` Konrad Rzeszutek Wilk
@ 2016-02-29 17:29   ` Meng Xu
  2016-02-29 17:59     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 13+ messages in thread
From: Meng Xu @ 2016-02-29 17:29 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, elena.ufimtseva@oracle.com
  Cc: Hyon-Young Choi, xen-devel@lists.xen.org

On Mon, Feb 29, 2016 at 11:06 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Fri, Feb 26, 2016 at 12:02:50AM -0500, Meng Xu wrote:
>> Hi,
>>
>
> Hey!
>
> CC-ing Elena.

I think you forgot you cc.ed her..
Anyway, let's cc. her now... :-)

>
>> We are measuring the execution time between native machine environment
>> and xen virtualization environment using PARSEC Benchmark [1].
>>
>> In virtualiztion environment, we run a domU with three VCPUs, each of
>> them pinned to a core; we pin the dom0 to another core that is not
>> used by the domU.
>>
>> Inside the Linux in domU in virtualization environment and in native
>> environment,  We used the cpuset to isolate a core (or VCPU) for the
>> system processors and to isolate a core for the benchmark processes.
>> We also configured the Linux boot command line with isocpus= option to
>> isolate the core for benchmark from other unnecessary processes.
>
> You may want to just offline them and also boot the machine with NUMA
> disabled.

Right, the machine is booted up with NUMA disabled.
We will offline the unnecessary cores then.

>
>>
>> We expect that execution time of benchmarks in xen virtualization
>> environment is larger than the execution time in native machine
>> environment. However, the evaluation gave us an opposite result.
>>
>> Below is the evaluation data for the canneal and streamcluster benchmarks:
>>
>> Benchmark: canneal, input=simlarge, conf=gcc-serial
>> Native: 6.387s
>> Virtualization: 5.890s
>>
>> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
>> Native: 5.276s
>> Virtualization: 5.240s
>>
>> Is there anything wrong with our evaluation that lead to the abnormal
>> performance results?
>
> Nothing is wrong. Virtualization is naturally faster than baremetal!
>
> :-)
>
> No clue sadly.

Ah-ha. This is really surprising to me.... Why will it speed up the
system by adding one more layer? Unless the virtualization disabled
some services that occur in native and interfere with the benchmark.

If virtualization is faster than baremetal by nature, why we can see
that some experiment shows that virtualization introduces overhead?

For example, VMware did some evaluation at [1]. Fig. 3 on page 9 shows
that the virtualization (both vmware ESX301 and xen) introduces
overhead and the benchmark is slower in virtualization than in native.

 [1] https://www.vmware.com/pdf/hypervisor_performance.pdf

It seems to me that the performance data may be tweaked (kind of
cooked up) to some extent when people are comparing different
hypervisors. we just need to configure the system in a specific way to
favor one type of hypervisor than the other.

Meng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-02-29 17:29   ` Meng Xu
@ 2016-02-29 17:59     ` Konrad Rzeszutek Wilk
  2016-03-01 13:48       ` Meng Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-02-29 17:59 UTC (permalink / raw)
  To: Meng Xu
  Cc: elena.ufimtseva@oracle.com, Hyon-Young Choi,
	xen-devel@lists.xen.org

> > Hey!
> >
> > CC-ing Elena.
> 
> I think you forgot you cc.ed her..
> Anyway, let's cc. her now... :-)
> 
> >
> >> We are measuring the execution time between native machine environment
> >> and xen virtualization environment using PARSEC Benchmark [1].
> >>
> >> In virtualiztion environment, we run a domU with three VCPUs, each of
> >> them pinned to a core; we pin the dom0 to another core that is not
> >> used by the domU.
> >>
> >> Inside the Linux in domU in virtualization environment and in native
> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
> >> system processors and to isolate a core for the benchmark processes.
> >> We also configured the Linux boot command line with isocpus= option to
> >> isolate the core for benchmark from other unnecessary processes.
> >
> > You may want to just offline them and also boot the machine with NUMA
> > disabled.
> 
> Right, the machine is booted up with NUMA disabled.
> We will offline the unnecessary cores then.
> 
> >
> >>
> >> We expect that execution time of benchmarks in xen virtualization
> >> environment is larger than the execution time in native machine
> >> environment. However, the evaluation gave us an opposite result.
> >>
> >> Below is the evaluation data for the canneal and streamcluster benchmarks:
> >>
> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
> >> Native: 6.387s
> >> Virtualization: 5.890s
> >>
> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
> >> Native: 5.276s
> >> Virtualization: 5.240s
> >>
> >> Is there anything wrong with our evaluation that lead to the abnormal
> >> performance results?
> >
> > Nothing is wrong. Virtualization is naturally faster than baremetal!
> >
> > :-)
> >
> > No clue sadly.
> 
> Ah-ha. This is really surprising to me.... Why will it speed up the
> system by adding one more layer? Unless the virtualization disabled
> some services that occur in native and interfere with the benchmark.
> 
> If virtualization is faster than baremetal by nature, why we can see
> that some experiment shows that virtualization introduces overhead?

Elena told me that there were some weird regression in Linux 4.1 - where
CPU burning workloads were _slower_ on baremetal than as guests.

Updating to a later kernel fixed that  -where one could see that
baremetal was faster (or on par) with the guest.
> 
> For example, VMware did some evaluation at [1]. Fig. 3 on page 9 shows
> that the virtualization (both vmware ESX301 and xen) introduces
> overhead and the benchmark is slower in virtualization than in native.
> 
>  [1] https://www.vmware.com/pdf/hypervisor_performance.pdf
> 
> It seems to me that the performance data may be tweaked (kind of
> cooked up) to some extent when people are comparing different
> hypervisors. we just need to configure the system in a specific way to
> favor one type of hypervisor than the other.

> 
> Meng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-02-29 17:59     ` Konrad Rzeszutek Wilk
@ 2016-03-01 13:48       ` Meng Xu
  2016-03-01 18:20         ` Elena Ufimtseva
  0 siblings, 1 reply; 13+ messages in thread
From: Meng Xu @ 2016-03-01 13:48 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: elena.ufimtseva@oracle.com, Hyon-Young Choi,
	xen-devel@lists.xen.org

On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> > Hey!
>> >
>> > CC-ing Elena.
>>
>> I think you forgot you cc.ed her..
>> Anyway, let's cc. her now... :-)
>>
>> >
>> >> We are measuring the execution time between native machine environment
>> >> and xen virtualization environment using PARSEC Benchmark [1].
>> >>
>> >> In virtualiztion environment, we run a domU with three VCPUs, each of
>> >> them pinned to a core; we pin the dom0 to another core that is not
>> >> used by the domU.
>> >>
>> >> Inside the Linux in domU in virtualization environment and in native
>> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
>> >> system processors and to isolate a core for the benchmark processes.
>> >> We also configured the Linux boot command line with isocpus= option to
>> >> isolate the core for benchmark from other unnecessary processes.
>> >
>> > You may want to just offline them and also boot the machine with NUMA
>> > disabled.
>>
>> Right, the machine is booted up with NUMA disabled.
>> We will offline the unnecessary cores then.
>>
>> >
>> >>
>> >> We expect that execution time of benchmarks in xen virtualization
>> >> environment is larger than the execution time in native machine
>> >> environment. However, the evaluation gave us an opposite result.
>> >>
>> >> Below is the evaluation data for the canneal and streamcluster benchmarks:
>> >>
>> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
>> >> Native: 6.387s
>> >> Virtualization: 5.890s
>> >>
>> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
>> >> Native: 5.276s
>> >> Virtualization: 5.240s
>> >>
>> >> Is there anything wrong with our evaluation that lead to the abnormal
>> >> performance results?
>> >
>> > Nothing is wrong. Virtualization is naturally faster than baremetal!
>> >
>> > :-)
>> >
>> > No clue sadly.
>>
>> Ah-ha. This is really surprising to me.... Why will it speed up the
>> system by adding one more layer? Unless the virtualization disabled
>> some services that occur in native and interfere with the benchmark.
>>
>> If virtualization is faster than baremetal by nature, why we can see
>> that some experiment shows that virtualization introduces overhead?
>
> Elena told me that there were some weird regression in Linux 4.1 - where
> CPU burning workloads were _slower_ on baremetal than as guests.

Hi Elena,
Would you mind sharing with us some of your experience of how you
found the real reason? Did you use some tool or some methodology to
pin down the reason (i.e,  CPU burning workloads in native is _slower_
on baremetal than as guests)?



>
> Updating to a later kernel fixed that  -where one could see that
> baremetal was faster (or on par) with the guest.

Thank you very much, Konrad! We are giving it a shot. :-D

Best Regards,

Meng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-03-01 13:48       ` Meng Xu
@ 2016-03-01 18:20         ` Elena Ufimtseva
  2016-03-01 19:52           ` Meng Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Elena Ufimtseva @ 2016-03-01 18:20 UTC (permalink / raw)
  To: Meng Xu; +Cc: xen-devel@lists.xen.org, Hyon-Young Choi

On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
> On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> >> > Hey!
> >> >
> >> > CC-ing Elena.
> >>
> >> I think you forgot you cc.ed her..
> >> Anyway, let's cc. her now... :-)
> >>
> >> >
> >> >> We are measuring the execution time between native machine environment
> >> >> and xen virtualization environment using PARSEC Benchmark [1].
> >> >>
> >> >> In virtualiztion environment, we run a domU with three VCPUs, each of
> >> >> them pinned to a core; we pin the dom0 to another core that is not
> >> >> used by the domU.
> >> >>
> >> >> Inside the Linux in domU in virtualization environment and in native
> >> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
> >> >> system processors and to isolate a core for the benchmark processes.
> >> >> We also configured the Linux boot command line with isocpus= option to
> >> >> isolate the core for benchmark from other unnecessary processes.
> >> >
> >> > You may want to just offline them and also boot the machine with NUMA
> >> > disabled.
> >>
> >> Right, the machine is booted up with NUMA disabled.
> >> We will offline the unnecessary cores then.
> >>
> >> >
> >> >>
> >> >> We expect that execution time of benchmarks in xen virtualization
> >> >> environment is larger than the execution time in native machine
> >> >> environment. However, the evaluation gave us an opposite result.
> >> >>
> >> >> Below is the evaluation data for the canneal and streamcluster benchmarks:
> >> >>
> >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
> >> >> Native: 6.387s
> >> >> Virtualization: 5.890s
> >> >>
> >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
> >> >> Native: 5.276s
> >> >> Virtualization: 5.240s
> >> >>
> >> >> Is there anything wrong with our evaluation that lead to the abnormal
> >> >> performance results?
> >> >
> >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
> >> >
> >> > :-)
> >> >
> >> > No clue sadly.
> >>
> >> Ah-ha. This is really surprising to me.... Why will it speed up the
> >> system by adding one more layer? Unless the virtualization disabled
> >> some services that occur in native and interfere with the benchmark.
> >>
> >> If virtualization is faster than baremetal by nature, why we can see
> >> that some experiment shows that virtualization introduces overhead?
> >
> > Elena told me that there were some weird regression in Linux 4.1 - where
> > CPU burning workloads were _slower_ on baremetal than as guests.
> 
> Hi Elena,
> Would you mind sharing with us some of your experience of how you
> found the real reason? Did you use some tool or some methodology to
> pin down the reason (i.e,  CPU burning workloads in native is _slower_
> on baremetal than as guests)?
>

Hi Meng

Yes, sure!

While working on performance tests for smt-exposing patches from Joao
I run CPU bound workload in HVM guest and using same kernel in baremetal
run same test.
While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
I found that the time to complete the same test is few times more that
as it takes for the same under HVM guest.
I have tried tests where kernel threads pinned to cores and without pinning.
The execution times are most of the times take as twice longer, sometimes 4
times longer that HVM case.

Interesting is not only that it takes sometimes 3-4 times more
than HVM guest, but also that test with bound threads (to cores) takes almost
3 times longer
to execute than running same cpu-bound test under HVM (in all
configurations).

I run each test 5 times and here are the execution times (seconds):

-------------------------------------------------
        baremetal           |
thread_bind | thread unbind | HVM pinned to cores
----------- |---------------|---------------------
     74     |     83        |        28
     74     |     88        |        28
     74     |     38        |        28
     74     |     73        |        28
     74     |     87        |        28

Sometimes better times were on unbinded tests, but not often enough
to present it here. Some results are much worse and reach up to 120
seconds.

Each test has 8 kernel threads. In baremetal case I tried the following:
- numa off,on;
- all cpus are on;
- isolate cpus from first node;
- set intel_idle.max_cstate=1;
- disable intel_pstate;

I dont think I have exhausted all the options here, but it looked like
two last changes did improve performance, but was still not comparable to
HVM case.
I am trying to find where regression had happened. Performance on newer
kernel (I tried 4.5.0-rc4+) was close or better than HVM.

I am trying to find f there were some relevant regressions to understand
the reason of this.


What kernel you guys use?

Elena

See more description of the tests here:
http://lists.xenproject.org/archives/html/xen-devel/2016-01/msg02874.html
Joao patches are here:
http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg03115.html)




> 
> 
> >
> > Updating to a later kernel fixed that  -where one could see that
> > baremetal was faster (or on par) with the guest.
> 
> Thank you very much, Konrad! We are giving it a shot. :-D
> 
> Best Regards,
> 
> Meng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-03-01 18:20         ` Elena Ufimtseva
@ 2016-03-01 19:52           ` Meng Xu
  2016-03-01 20:39             ` Elena Ufimtseva
  0 siblings, 1 reply; 13+ messages in thread
From: Meng Xu @ 2016-03-01 19:52 UTC (permalink / raw)
  To: Elena Ufimtseva; +Cc: xen-devel@lists.xen.org, Hyon-Young Choi

Hi Elena,

Thank you very much for sharing this! :-)

On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva
<elena.ufimtseva@oracle.com> wrote:
>
> On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
> > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
> > <konrad.wilk@oracle.com> wrote:
> > >> > Hey!
> > >> >
> > >> > CC-ing Elena.
> > >>
> > >> I think you forgot you cc.ed her..
> > >> Anyway, let's cc. her now... :-)
> > >>
> > >> >
> > >> >> We are measuring the execution time between native machine environment
> > >> >> and xen virtualization environment using PARSEC Benchmark [1].
> > >> >>
> > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of
> > >> >> them pinned to a core; we pin the dom0 to another core that is not
> > >> >> used by the domU.
> > >> >>
> > >> >> Inside the Linux in domU in virtualization environment and in native
> > >> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
> > >> >> system processors and to isolate a core for the benchmark processes.
> > >> >> We also configured the Linux boot command line with isocpus= option to
> > >> >> isolate the core for benchmark from other unnecessary processes.
> > >> >
> > >> > You may want to just offline them and also boot the machine with NUMA
> > >> > disabled.
> > >>
> > >> Right, the machine is booted up with NUMA disabled.
> > >> We will offline the unnecessary cores then.
> > >>
> > >> >
> > >> >>
> > >> >> We expect that execution time of benchmarks in xen virtualization
> > >> >> environment is larger than the execution time in native machine
> > >> >> environment. However, the evaluation gave us an opposite result.
> > >> >>
> > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks:
> > >> >>
> > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
> > >> >> Native: 6.387s
> > >> >> Virtualization: 5.890s
> > >> >>
> > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
> > >> >> Native: 5.276s
> > >> >> Virtualization: 5.240s
> > >> >>
> > >> >> Is there anything wrong with our evaluation that lead to the abnormal
> > >> >> performance results?
> > >> >
> > >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
> > >> >
> > >> > :-)
> > >> >
> > >> > No clue sadly.
> > >>
> > >> Ah-ha. This is really surprising to me.... Why will it speed up the
> > >> system by adding one more layer? Unless the virtualization disabled
> > >> some services that occur in native and interfere with the benchmark.
> > >>
> > >> If virtualization is faster than baremetal by nature, why we can see
> > >> that some experiment shows that virtualization introduces overhead?
> > >
> > > Elena told me that there were some weird regression in Linux 4.1 - where
> > > CPU burning workloads were _slower_ on baremetal than as guests.
> >
> > Hi Elena,
> > Would you mind sharing with us some of your experience of how you
> > found the real reason? Did you use some tool or some methodology to
> > pin down the reason (i.e,  CPU burning workloads in native is _slower_
> > on baremetal than as guests)?
> >
>
> Hi Meng
>
> Yes, sure!
>
> While working on performance tests for smt-exposing patches from Joao
> I run CPU bound workload in HVM guest and using same kernel in baremetal
> run same test.
> While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
> I found that the time to complete the same test is few times more that
> as it takes for the same under HVM guest.
> I have tried tests where kernel threads pinned to cores and without pinning.
> The execution times are most of the times take as twice longer, sometimes 4
> times longer that HVM case.
>
> Interesting is not only that it takes sometimes 3-4 times more
> than HVM guest, but also that test with bound threads (to cores) takes almost
> 3 times longer
> to execute than running same cpu-bound test under HVM (in all
> configurations).


wow~ I didn't expect the native performance can be so "bad".... ;-)

>
>
> I run each test 5 times and here are the execution times (seconds):
>
> -------------------------------------------------
>         baremetal           |
> thread_bind | thread unbind | HVM pinned to cores
> ----------- |---------------|---------------------
>      74     |     83        |        28
>      74     |     88        |        28
>      74     |     38        |        28
>      74     |     73        |        28
>      74     |     87        |        28
>
> Sometimes better times were on unbinded tests, but not often enough
> to present it here. Some results are much worse and reach up to 120
> seconds.
>
> Each test has 8 kernel threads. In baremetal case I tried the following:
> - numa off,on;
> - all cpus are on;
> - isolate cpus from first node;
> - set intel_idle.max_cstate=1;
> - disable intel_pstate;
>
> I dont think I have exhausted all the options here, but it looked like
> two last changes did improve performance, but was still not comparable to
> HVM case.
> I am trying to find where regression had happened. Performance on newer
> kernel (I tried 4.5.0-rc4+) was close or better than HVM.
>
> I am trying to find f there were some relevant regressions to understand
> the reason of this.


I see. If this is only happening for the SMT, it may be caused by the
SMT-related load balancing in Linux scheduler.
However, I have disabled the HT on my machine. Probably, that's also
the reason why I didn't see so much different in performance.

>
>
>
> What kernel you guys use?


I'm using a quite old kernel
3.10.31
. The reason why I'm using this kernel is because I want to use the
LITMUS^RT [1], which is a linux testbed for real-time scheduling
research. (It has a new version though, and I can upgrade to the
latest version to see if the "problem" still occurs.)

Thanks and Best Regards,

Meng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-03-01 19:52           ` Meng Xu
@ 2016-03-01 20:39             ` Elena Ufimtseva
  2016-03-01 21:51               ` Sander Eikelenboom
  2016-03-02 19:41               ` Meng Xu
  0 siblings, 2 replies; 13+ messages in thread
From: Elena Ufimtseva @ 2016-03-01 20:39 UTC (permalink / raw)
  To: Meng Xu; +Cc: xen-devel@lists.xen.org, Hyon-Young Choi

On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote:
> Hi Elena,
> 
> Thank you very much for sharing this! :-)
> 
> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva
> <elena.ufimtseva@oracle.com> wrote:
> >
> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
> > > <konrad.wilk@oracle.com> wrote:
> > > >> > Hey!
> > > >> >
> > > >> > CC-ing Elena.
> > > >>
> > > >> I think you forgot you cc.ed her..
> > > >> Anyway, let's cc. her now... :-)
> > > >>
> > > >> >
> > > >> >> We are measuring the execution time between native machine environment
> > > >> >> and xen virtualization environment using PARSEC Benchmark [1].
> > > >> >>
> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of
> > > >> >> them pinned to a core; we pin the dom0 to another core that is not
> > > >> >> used by the domU.
> > > >> >>
> > > >> >> Inside the Linux in domU in virtualization environment and in native
> > > >> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
> > > >> >> system processors and to isolate a core for the benchmark processes.
> > > >> >> We also configured the Linux boot command line with isocpus= option to
> > > >> >> isolate the core for benchmark from other unnecessary processes.
> > > >> >
> > > >> > You may want to just offline them and also boot the machine with NUMA
> > > >> > disabled.
> > > >>
> > > >> Right, the machine is booted up with NUMA disabled.
> > > >> We will offline the unnecessary cores then.
> > > >>
> > > >> >
> > > >> >>
> > > >> >> We expect that execution time of benchmarks in xen virtualization
> > > >> >> environment is larger than the execution time in native machine
> > > >> >> environment. However, the evaluation gave us an opposite result.
> > > >> >>
> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks:
> > > >> >>
> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
> > > >> >> Native: 6.387s
> > > >> >> Virtualization: 5.890s
> > > >> >>
> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
> > > >> >> Native: 5.276s
> > > >> >> Virtualization: 5.240s
> > > >> >>
> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal
> > > >> >> performance results?
> > > >> >
> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
> > > >> >
> > > >> > :-)
> > > >> >
> > > >> > No clue sadly.
> > > >>
> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the
> > > >> system by adding one more layer? Unless the virtualization disabled
> > > >> some services that occur in native and interfere with the benchmark.
> > > >>
> > > >> If virtualization is faster than baremetal by nature, why we can see
> > > >> that some experiment shows that virtualization introduces overhead?
> > > >
> > > > Elena told me that there were some weird regression in Linux 4.1 - where
> > > > CPU burning workloads were _slower_ on baremetal than as guests.
> > >
> > > Hi Elena,
> > > Would you mind sharing with us some of your experience of how you
> > > found the real reason? Did you use some tool or some methodology to
> > > pin down the reason (i.e,  CPU burning workloads in native is _slower_
> > > on baremetal than as guests)?
> > >
> >
> > Hi Meng
> >
> > Yes, sure!
> >
> > While working on performance tests for smt-exposing patches from Joao
> > I run CPU bound workload in HVM guest and using same kernel in baremetal
> > run same test.
> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
> > I found that the time to complete the same test is few times more that
> > as it takes for the same under HVM guest.
> > I have tried tests where kernel threads pinned to cores and without pinning.
> > The execution times are most of the times take as twice longer, sometimes 4
> > times longer that HVM case.
> >
> > Interesting is not only that it takes sometimes 3-4 times more
> > than HVM guest, but also that test with bound threads (to cores) takes almost
> > 3 times longer
> > to execute than running same cpu-bound test under HVM (in all
> > configurations).
> 
> 
> wow~ I didn't expect the native performance can be so "bad".... ;-)

Yes, quite a surprise :)
> 
> >
> >
> > I run each test 5 times and here are the execution times (seconds):
> >
> > -------------------------------------------------
> >         baremetal           |
> > thread_bind | thread unbind | HVM pinned to cores
> > ----------- |---------------|---------------------
> >      74     |     83        |        28
> >      74     |     88        |        28
> >      74     |     38        |        28
> >      74     |     73        |        28
> >      74     |     87        |        28
> >
> > Sometimes better times were on unbinded tests, but not often enough
> > to present it here. Some results are much worse and reach up to 120
> > seconds.
> >
> > Each test has 8 kernel threads. In baremetal case I tried the following:
> > - numa off,on;
> > - all cpus are on;
> > - isolate cpus from first node;
> > - set intel_idle.max_cstate=1;
> > - disable intel_pstate;
> >
> > I dont think I have exhausted all the options here, but it looked like
> > two last changes did improve performance, but was still not comparable to
> > HVM case.
> > I am trying to find where regression had happened. Performance on newer
> > kernel (I tried 4.5.0-rc4+) was close or better than HVM.
> >
> > I am trying to find f there were some relevant regressions to understand
> > the reason of this.
> 
> 
> I see. If this is only happening for the SMT, it may be caused by the
> SMT-related load balancing in Linux scheduler.
> However, I have disabled the HT on my machine. Probably, that's also
> the reason why I didn't see so much different in performance.

I did enable tracing to see if maybe there is extensive migration:
Test machine has two nodes, 8 cores each, 2 threads per core, total 32 logical cpus.

Kernel threads are not binded and here is the output for the life of one of the threads:

cat ./t-komp_trace |grep t-kompressor|grep 18883

    t-kompressor-18883 [028] d... 69458.596403: sched_switch: prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> next_comm=swapper/28 next_pid=0 next_prio=120
          insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9
          <idle>-0     [009] d... 69458.669205: sched_switch: prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120
    t-kompressor-18883 [009] d... 69486.997626: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> next_comm=migration/9 next_pid=52 next_prio=0
     migration/9-52    [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
          <idle>-0     [025] d... 69486.997641: sched_switch: prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120
    t-kompressor-18883 [025] d... 69486.997710: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> next_comm=swapper/25 next_pid=0 next_prio=120
          insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: comm=t-kompressor pid=18883


Threads are being spawned from two cores, then some of the threads migrate to other cores.
In the example above threads is being spawned on cpu 27 and when woken up, runs on cpu 009.
Later it migrated to 025 which is the second thread of the same core (009).
While I am not sure why this migration happens, it does not seem to contribute a lot.
Anyway this picture repeats for some other threads (some stay where they were woken up):

    t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald pid=3820 prio=120 orig_cpu=14 dest_cpu=11
    migration/13-72    [013] d... 69486.707459: sched_migrate_task: comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29
    migration/14-77    [014] d... 69486.783818: sched_migrate_task: comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30
     migration/8-47    [008] d... 69486.792667: sched_migrate_task: comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24
    migration/15-82    [015] d... 69486.796429: sched_migrate_task: comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31
    migration/10-57    [010] d... 69486.857848: sched_migrate_task: comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26
     migration/9-52    [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
    migration/28-147   [028] d... 69503.073577: sched_migrate_task: comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10

All threads are running on their own cores and some migrate to second smt-thread over time.
I probably should have traced some other scheduling events, but I did not yet find any relevant ones yet.

> 
> >
> >
> >
> > What kernel you guys use?
> 
> 
> I'm using a quite old kernel
> 3.10.31
> . The reason why I'm using this kernel is because I want to use the
> LITMUS^RT [1], which is a linux testbed for real-time scheduling
> research. (It has a new version though, and I can upgrade to the
> latest version to see if the "problem" still occurs.)

Yes, it will be interesting to see the outcome.

What difference in numbers do you see?
What the machines you are seeing it on?
Is your workload is purely cpu-bound?


Thanks!

> 
> Thanks and Best Regards,
> 
> Meng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-03-01 20:39             ` Elena Ufimtseva
@ 2016-03-01 21:51               ` Sander Eikelenboom
  2016-03-01 22:06                 ` Elena Ufimtseva
                                   ` (2 more replies)
  2016-03-02 19:41               ` Meng Xu
  1 sibling, 3 replies; 13+ messages in thread
From: Sander Eikelenboom @ 2016-03-01 21:51 UTC (permalink / raw)
  To: Elena Ufimtseva; +Cc: Hyon-Young Choi, Meng Xu, xen-devel@lists.xen.org


Tuesday, March 1, 2016, 9:39:25 PM, you wrote:

> On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote:
>> Hi Elena,
>> 
>> Thank you very much for sharing this! :-)
>> 
>> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva
>> <elena.ufimtseva@oracle.com> wrote:
>> >
>> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
>> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
>> > > <konrad.wilk@oracle.com> wrote:
>> > > >> > Hey!
>> > > >> >
>> > > >> > CC-ing Elena.
>> > > >>
>> > > >> I think you forgot you cc.ed her..
>> > > >> Anyway, let's cc. her now... :-)
>> > > >>
>> > > >> >
>> > > >> >> We are measuring the execution time between native machine environment
>> > > >> >> and xen virtualization environment using PARSEC Benchmark [1].
>> > > >> >>
>> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of
>> > > >> >> them pinned to a core; we pin the dom0 to another core that is not
>> > > >> >> used by the domU.
>> > > >> >>
>> > > >> >> Inside the Linux in domU in virtualization environment and in native
>> > > >> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
>> > > >> >> system processors and to isolate a core for the benchmark processes.
>> > > >> >> We also configured the Linux boot command line with isocpus= option to
>> > > >> >> isolate the core for benchmark from other unnecessary processes.
>> > > >> >
>> > > >> > You may want to just offline them and also boot the machine with NUMA
>> > > >> > disabled.
>> > > >>
>> > > >> Right, the machine is booted up with NUMA disabled.
>> > > >> We will offline the unnecessary cores then.
>> > > >>
>> > > >> >
>> > > >> >>
>> > > >> >> We expect that execution time of benchmarks in xen virtualization
>> > > >> >> environment is larger than the execution time in native machine
>> > > >> >> environment. However, the evaluation gave us an opposite result.
>> > > >> >>
>> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks:
>> > > >> >>
>> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
>> > > >> >> Native: 6.387s
>> > > >> >> Virtualization: 5.890s
>> > > >> >>
>> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
>> > > >> >> Native: 5.276s
>> > > >> >> Virtualization: 5.240s
>> > > >> >>
>> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal
>> > > >> >> performance results?
>> > > >> >
>> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
>> > > >> >
>> > > >> > :-)
>> > > >> >
>> > > >> > No clue sadly.
>> > > >>
>> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the
>> > > >> system by adding one more layer? Unless the virtualization disabled
>> > > >> some services that occur in native and interfere with the benchmark.
>> > > >>
>> > > >> If virtualization is faster than baremetal by nature, why we can see
>> > > >> that some experiment shows that virtualization introduces overhead?
>> > > >
>> > > > Elena told me that there were some weird regression in Linux 4.1 - where
>> > > > CPU burning workloads were _slower_ on baremetal than as guests.
>> > >
>> > > Hi Elena,
>> > > Would you mind sharing with us some of your experience of how you
>> > > found the real reason? Did you use some tool or some methodology to
>> > > pin down the reason (i.e,  CPU burning workloads in native is _slower_
>> > > on baremetal than as guests)?
>> > >
>> >
>> > Hi Meng
>> >
>> > Yes, sure!
>> >
>> > While working on performance tests for smt-exposing patches from Joao
>> > I run CPU bound workload in HVM guest and using same kernel in baremetal
>> > run same test.
>> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
>> > I found that the time to complete the same test is few times more that
>> > as it takes for the same under HVM guest.
>> > I have tried tests where kernel threads pinned to cores and without pinning.
>> > The execution times are most of the times take as twice longer, sometimes 4
>> > times longer that HVM case.
>> >
>> > Interesting is not only that it takes sometimes 3-4 times more
>> > than HVM guest, but also that test with bound threads (to cores) takes almost
>> > 3 times longer
>> > to execute than running same cpu-bound test under HVM (in all
>> > configurations).
>> 
>> 
>> wow~ I didn't expect the native performance can be so "bad".... ;-)

> Yes, quite a surprise :)
>> 
>> >
>> >
>> > I run each test 5 times and here are the execution times (seconds):
>> >
>> > -------------------------------------------------
>> >         baremetal           |
>> > thread_bind | thread unbind | HVM pinned to cores
>> > ----------- |---------------|---------------------
>> >      74     |     83        |        28
>> >      74     |     88        |        28
>> >      74     |     38        |        28
>> >      74     |     73        |        28
>> >      74     |     87        |        28
>> >
>> > Sometimes better times were on unbinded tests, but not often enough
>> > to present it here. Some results are much worse and reach up to 120
>> > seconds.
>> >
>> > Each test has 8 kernel threads. In baremetal case I tried the following:
>> > - numa off,on;
>> > - all cpus are on;
>> > - isolate cpus from first node;
>> > - set intel_idle.max_cstate=1;
>> > - disable intel_pstate;
>> >
>> > I dont think I have exhausted all the options here, but it looked like
>> > two last changes did improve performance, but was still not comparable to
>> > HVM case.
>> > I am trying to find where regression had happened. Performance on newer
>> > kernel (I tried 4.5.0-rc4+) was close or better than HVM.

Just a perhaps silly thought .. but could there be something in the 
time-measuring that could differ and explain the slightly surprising results ?
--
Sander 

>> > I am trying to find f there were some relevant regressions to understand
>> > the reason of this.
>> 
>> 
>> I see. If this is only happening for the SMT, it may be caused by the
>> SMT-related load balancing in Linux scheduler.
>> However, I have disabled the HT on my machine. Probably, that's also
>> the reason why I didn't see so much different in performance.

> I did enable tracing to see if maybe there is extensive migration:
> Test machine has two nodes, 8 cores each, 2 threads per core, total 32 logical cpus.

> Kernel threads are not binded and here is the output for the life of one of the threads:

> cat ./t-komp_trace |grep t-kompressor|grep 18883

>     t-kompressor-18883 [028] d... 69458.596403: sched_switch: prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> next_comm=swapper/28 next_pid=0 next_prio=120
>           insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9
>           <idle>-0     [009] d... 69458.669205: sched_switch: prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120
>     t-kompressor-18883 [009] d... 69486.997626: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> next_comm=migration/9 next_pid=52 next_prio=0
>      migration/9-52    [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
>           <idle>-0     [025] d... 69486.997641: sched_switch: prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120
>     t-kompressor-18883 [025] d... 69486.997710: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> next_comm=swapper/25 next_pid=0 next_prio=120
>           insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: comm=t-kompressor pid=18883


> Threads are being spawned from two cores, then some of the threads migrate to other cores.
> In the example above threads is being spawned on cpu 27 and when woken up, runs on cpu 009.
> Later it migrated to 025 which is the second thread of the same core (009).
> While I am not sure why this migration happens, it does not seem to contribute a lot.
> Anyway this picture repeats for some other threads (some stay where they were woken up):

>     t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald pid=3820 prio=120 orig_cpu=14 dest_cpu=11
>     migration/13-72    [013] d... 69486.707459: sched_migrate_task: comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29
>     migration/14-77    [014] d... 69486.783818: sched_migrate_task: comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30
>      migration/8-47    [008] d... 69486.792667: sched_migrate_task: comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24
>     migration/15-82    [015] d... 69486.796429: sched_migrate_task: comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31
>     migration/10-57    [010] d... 69486.857848: sched_migrate_task: comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26
>      migration/9-52    [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
>     migration/28-147   [028] d... 69503.073577: sched_migrate_task: comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10

> All threads are running on their own cores and some migrate to second smt-thread over time.
> I probably should have traced some other scheduling events, but I did not yet find any relevant ones yet.

>> 
>> >
>> >
>> >
>> > What kernel you guys use?
>> 
>> 
>> I'm using a quite old kernel
>> 3.10.31
>> . The reason why I'm using this kernel is because I want to use the
>> LITMUS^RT [1], which is a linux testbed for real-time scheduling
>> research. (It has a new version though, and I can upgrade to the
>> latest version to see if the "problem" still occurs.)

> Yes, it will be interesting to see the outcome.

> What difference in numbers do you see?
> What the machines you are seeing it on?
> Is your workload is purely cpu-bound?


> Thanks!

>> 
>> Thanks and Best Regards,
>> 
>> Meng




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-03-01 21:51               ` Sander Eikelenboom
@ 2016-03-01 22:06                 ` Elena Ufimtseva
  2016-03-01 22:12                 ` Dario Faggioli
  2016-03-02 19:44                 ` Meng Xu
  2 siblings, 0 replies; 13+ messages in thread
From: Elena Ufimtseva @ 2016-03-01 22:06 UTC (permalink / raw)
  To: Sander Eikelenboom; +Cc: Hyon-Young Choi, Meng Xu, xen-devel@lists.xen.org

On Tue, Mar 01, 2016 at 10:51:30PM +0100, Sander Eikelenboom wrote:
> 
> Tuesday, March 1, 2016, 9:39:25 PM, you wrote:
> 
> > On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote:
> >> Hi Elena,
> >> 
> >> Thank you very much for sharing this! :-)
> >> 
> >> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva
> >> <elena.ufimtseva@oracle.com> wrote:
> >> >
> >> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
> >> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
> >> > > <konrad.wilk@oracle.com> wrote:
> >> > > >> > Hey!
> >> > > >> >
> >> > > >> > CC-ing Elena.
> >> > > >>
> >> > > >> I think you forgot you cc.ed her..
> >> > > >> Anyway, let's cc. her now... :-)
> >> > > >>
> >> > > >> >
> >> > > >> >> We are measuring the execution time between native machine environment
> >> > > >> >> and xen virtualization environment using PARSEC Benchmark [1].
> >> > > >> >>
> >> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of
> >> > > >> >> them pinned to a core; we pin the dom0 to another core that is not
> >> > > >> >> used by the domU.
> >> > > >> >>
> >> > > >> >> Inside the Linux in domU in virtualization environment and in native
> >> > > >> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
> >> > > >> >> system processors and to isolate a core for the benchmark processes.
> >> > > >> >> We also configured the Linux boot command line with isocpus= option to
> >> > > >> >> isolate the core for benchmark from other unnecessary processes.
> >> > > >> >
> >> > > >> > You may want to just offline them and also boot the machine with NUMA
> >> > > >> > disabled.
> >> > > >>
> >> > > >> Right, the machine is booted up with NUMA disabled.
> >> > > >> We will offline the unnecessary cores then.
> >> > > >>
> >> > > >> >
> >> > > >> >>
> >> > > >> >> We expect that execution time of benchmarks in xen virtualization
> >> > > >> >> environment is larger than the execution time in native machine
> >> > > >> >> environment. However, the evaluation gave us an opposite result.
> >> > > >> >>
> >> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks:
> >> > > >> >>
> >> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
> >> > > >> >> Native: 6.387s
> >> > > >> >> Virtualization: 5.890s
> >> > > >> >>
> >> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
> >> > > >> >> Native: 5.276s
> >> > > >> >> Virtualization: 5.240s
> >> > > >> >>
> >> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal
> >> > > >> >> performance results?
> >> > > >> >
> >> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
> >> > > >> >
> >> > > >> > :-)
> >> > > >> >
> >> > > >> > No clue sadly.
> >> > > >>
> >> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the
> >> > > >> system by adding one more layer? Unless the virtualization disabled
> >> > > >> some services that occur in native and interfere with the benchmark.
> >> > > >>
> >> > > >> If virtualization is faster than baremetal by nature, why we can see
> >> > > >> that some experiment shows that virtualization introduces overhead?
> >> > > >
> >> > > > Elena told me that there were some weird regression in Linux 4.1 - where
> >> > > > CPU burning workloads were _slower_ on baremetal than as guests.
> >> > >
> >> > > Hi Elena,
> >> > > Would you mind sharing with us some of your experience of how you
> >> > > found the real reason? Did you use some tool or some methodology to
> >> > > pin down the reason (i.e,  CPU burning workloads in native is _slower_
> >> > > on baremetal than as guests)?
> >> > >
> >> >
> >> > Hi Meng
> >> >
> >> > Yes, sure!
> >> >
> >> > While working on performance tests for smt-exposing patches from Joao
> >> > I run CPU bound workload in HVM guest and using same kernel in baremetal
> >> > run same test.
> >> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
> >> > I found that the time to complete the same test is few times more that
> >> > as it takes for the same under HVM guest.
> >> > I have tried tests where kernel threads pinned to cores and without pinning.
> >> > The execution times are most of the times take as twice longer, sometimes 4
> >> > times longer that HVM case.
> >> >
> >> > Interesting is not only that it takes sometimes 3-4 times more
> >> > than HVM guest, but also that test with bound threads (to cores) takes almost
> >> > 3 times longer
> >> > to execute than running same cpu-bound test under HVM (in all
> >> > configurations).
> >> 
> >> 
> >> wow~ I didn't expect the native performance can be so "bad".... ;-)
> 
> > Yes, quite a surprise :)
> >> 
> >> >
> >> >
> >> > I run each test 5 times and here are the execution times (seconds):
> >> >
> >> > -------------------------------------------------
> >> >         baremetal           |
> >> > thread_bind | thread unbind | HVM pinned to cores
> >> > ----------- |---------------|---------------------
> >> >      74     |     83        |        28
> >> >      74     |     88        |        28
> >> >      74     |     38        |        28
> >> >      74     |     73        |        28
> >> >      74     |     87        |        28
> >> >
> >> > Sometimes better times were on unbinded tests, but not often enough
> >> > to present it here. Some results are much worse and reach up to 120
> >> > seconds.
> >> >
> >> > Each test has 8 kernel threads. In baremetal case I tried the following:
> >> > - numa off,on;
> >> > - all cpus are on;
> >> > - isolate cpus from first node;
> >> > - set intel_idle.max_cstate=1;
> >> > - disable intel_pstate;
> >> >
> >> > I dont think I have exhausted all the options here, but it looked like
> >> > two last changes did improve performance, but was still not comparable to
> >> > HVM case.
> >> > I am trying to find where regression had happened. Performance on newer
> >> > kernel (I tried 4.5.0-rc4+) was close or better than HVM.
> 
> Just a perhaps silly thought .. but could there be something in the 
> time-measuring that could differ and explain the slightly surprising results ?
> --
> Sander 


Thanks Sander, I also tghought abou this. I would consider it further if the
execution times were not that obviously bigger, its 28 sec vs up to few
minutes the diffrerence noticeable without registering time.

Also there is expected difference in a ping-pong test where baremetal
outperforms HVM slighly on same kernels.

> 
> >> > I am trying to find f there were some relevant regressions to understand
> >> > the reason of this.
> >> 
> >> 
> >> I see. If this is only happening for the SMT, it may be caused by the
> >> SMT-related load balancing in Linux scheduler.
> >> However, I have disabled the HT on my machine. Probably, that's also
> >> the reason why I didn't see so much different in performance.
> 
> > I did enable tracing to see if maybe there is extensive migration:
> > Test machine has two nodes, 8 cores each, 2 threads per core, total 32 logical cpus.
> 
> > Kernel threads are not binded and here is the output for the life of one of the threads:
> 
> > cat ./t-komp_trace |grep t-kompressor|grep 18883
> 
> >     t-kompressor-18883 [028] d... 69458.596403: sched_switch: prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> next_comm=swapper/28 next_pid=0 next_prio=120
> >           insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9
> >           <idle>-0     [009] d... 69458.669205: sched_switch: prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120
> >     t-kompressor-18883 [009] d... 69486.997626: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> next_comm=migration/9 next_pid=52 next_prio=0
> >      migration/9-52    [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
> >           <idle>-0     [025] d... 69486.997641: sched_switch: prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120
> >     t-kompressor-18883 [025] d... 69486.997710: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> next_comm=swapper/25 next_pid=0 next_prio=120
> >           insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: comm=t-kompressor pid=18883
> 
> 
> > Threads are being spawned from two cores, then some of the threads migrate to other cores.
> > In the example above threads is being spawned on cpu 27 and when woken up, runs on cpu 009.
> > Later it migrated to 025 which is the second thread of the same core (009).
> > While I am not sure why this migration happens, it does not seem to contribute a lot.
> > Anyway this picture repeats for some other threads (some stay where they were woken up):
> 
> >     t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald pid=3820 prio=120 orig_cpu=14 dest_cpu=11
> >     migration/13-72    [013] d... 69486.707459: sched_migrate_task: comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29
> >     migration/14-77    [014] d... 69486.783818: sched_migrate_task: comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30
> >      migration/8-47    [008] d... 69486.792667: sched_migrate_task: comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24
> >     migration/15-82    [015] d... 69486.796429: sched_migrate_task: comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31
> >     migration/10-57    [010] d... 69486.857848: sched_migrate_task: comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26
> >      migration/9-52    [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
> >     migration/28-147   [028] d... 69503.073577: sched_migrate_task: comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10
> 
> > All threads are running on their own cores and some migrate to second smt-thread over time.
> > I probably should have traced some other scheduling events, but I did not yet find any relevant ones yet.
> 
> >> 
> >> >
> >> >
> >> >
> >> > What kernel you guys use?
> >> 
> >> 
> >> I'm using a quite old kernel
> >> 3.10.31
> >> . The reason why I'm using this kernel is because I want to use the
> >> LITMUS^RT [1], which is a linux testbed for real-time scheduling
> >> research. (It has a new version though, and I can upgrade to the
> >> latest version to see if the "problem" still occurs.)
> 
> > Yes, it will be interesting to see the outcome.
> 
> > What difference in numbers do you see?
> > What the machines you are seeing it on?
> > Is your workload is purely cpu-bound?
> 
> 
> > Thanks!
> 
> >> 
> >> Thanks and Best Regards,
> >> 
> >> Meng
> 
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-03-01 21:51               ` Sander Eikelenboom
  2016-03-01 22:06                 ` Elena Ufimtseva
@ 2016-03-01 22:12                 ` Dario Faggioli
  2016-03-02 19:44                 ` Meng Xu
  2 siblings, 0 replies; 13+ messages in thread
From: Dario Faggioli @ 2016-03-01 22:12 UTC (permalink / raw)
  To: Sander Eikelenboom, Elena Ufimtseva
  Cc: Meng Xu, Hyon-Young Choi, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 1730 bytes --]

On Tue, 2016-03-01 at 22:51 +0100, Sander Eikelenboom wrote:
> Tuesday, March 1, 2016, 9:39:25 PM, you wrote:
> 
> > On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote:
> > > 
> > > > I dont think I have exhausted all the options here, but it
> > > > looked like
> > > > two last changes did improve performance, but was still not
> > > > comparable to
> > > > HVM case.
> > > > I am trying to find where regression had happened. Performance
> > > > on newer
> > > > kernel (I tried 4.5.0-rc4+) was close or better than HVM.
>
> Just a perhaps silly thought .. but could there be something in the 
> time-measuring that could differ and explain the slightly surprising
> results ?
>
I was about to say the exact same thing. In fact, when I can, I always
try to measure time either from the host, or even from outside, rather
than inside the VM itself.

The host should be fine, although that, in Xen, means measuring in
Dom0, which thing may be affected by virtualization as well so, from
this point of view, outside would be preferrable.

However, measuring from outside (and not even from the host, in certain
case) is not always possible and/or comes with a price.

So, it really depends on the benchmark and on the use case, and in a
bunch of them measuring from the guest is actually fine and/or is what
one wants. However, do at least consider that as an option for the
issue being reported.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-03-01 21:51               ` Sander Eikelenboom
  2016-03-01 22:06                 ` Elena Ufimtseva
  2016-03-01 22:12                 ` Dario Faggioli
@ 2016-03-02 19:44                 ` Meng Xu
  2 siblings, 0 replies; 13+ messages in thread
From: Meng Xu @ 2016-03-02 19:44 UTC (permalink / raw)
  To: Sander Eikelenboom
  Cc: Elena Ufimtseva, Hyon-Young Choi, xen-devel@lists.xen.org

On Tue, Mar 1, 2016 at 4:51 PM, Sander Eikelenboom <linux@eikelenboom.it> wrote:
>
> Tuesday, March 1, 2016, 9:39:25 PM, you wrote:
>
>> On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote:
>>> Hi Elena,
>>>
>>> Thank you very much for sharing this! :-)
>>>
>>> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva
>>> <elena.ufimtseva@oracle.com> wrote:
>>> >
>>> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
>>> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
>>> > > <konrad.wilk@oracle.com> wrote:
>>> > > >> > Hey!
>>> > > >> >
>>> > > >> > CC-ing Elena.
>>> > > >>
>>> > > >> I think you forgot you cc.ed her..
>>> > > >> Anyway, let's cc. her now... :-)
>>> > > >>
>>> > > >> >
>>> > > >> >> We are measuring the execution time between native machine environment
>>> > > >> >> and xen virtualization environment using PARSEC Benchmark [1].
>>> > > >> >>
>>> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of
>>> > > >> >> them pinned to a core; we pin the dom0 to another core that is not
>>> > > >> >> used by the domU.
>>> > > >> >>
>>> > > >> >> Inside the Linux in domU in virtualization environment and in native
>>> > > >> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
>>> > > >> >> system processors and to isolate a core for the benchmark processes.
>>> > > >> >> We also configured the Linux boot command line with isocpus= option to
>>> > > >> >> isolate the core for benchmark from other unnecessary processes.
>>> > > >> >
>>> > > >> > You may want to just offline them and also boot the machine with NUMA
>>> > > >> > disabled.
>>> > > >>
>>> > > >> Right, the machine is booted up with NUMA disabled.
>>> > > >> We will offline the unnecessary cores then.
>>> > > >>
>>> > > >> >
>>> > > >> >>
>>> > > >> >> We expect that execution time of benchmarks in xen virtualization
>>> > > >> >> environment is larger than the execution time in native machine
>>> > > >> >> environment. However, the evaluation gave us an opposite result.
>>> > > >> >>
>>> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks:
>>> > > >> >>
>>> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
>>> > > >> >> Native: 6.387s
>>> > > >> >> Virtualization: 5.890s
>>> > > >> >>
>>> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
>>> > > >> >> Native: 5.276s
>>> > > >> >> Virtualization: 5.240s
>>> > > >> >>
>>> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal
>>> > > >> >> performance results?
>>> > > >> >
>>> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
>>> > > >> >
>>> > > >> > :-)
>>> > > >> >
>>> > > >> > No clue sadly.
>>> > > >>
>>> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the
>>> > > >> system by adding one more layer? Unless the virtualization disabled
>>> > > >> some services that occur in native and interfere with the benchmark.
>>> > > >>
>>> > > >> If virtualization is faster than baremetal by nature, why we can see
>>> > > >> that some experiment shows that virtualization introduces overhead?
>>> > > >
>>> > > > Elena told me that there were some weird regression in Linux 4.1 - where
>>> > > > CPU burning workloads were _slower_ on baremetal than as guests.
>>> > >
>>> > > Hi Elena,
>>> > > Would you mind sharing with us some of your experience of how you
>>> > > found the real reason? Did you use some tool or some methodology to
>>> > > pin down the reason (i.e,  CPU burning workloads in native is _slower_
>>> > > on baremetal than as guests)?
>>> > >
>>> >
>>> > Hi Meng
>>> >
>>> > Yes, sure!
>>> >
>>> > While working on performance tests for smt-exposing patches from Joao
>>> > I run CPU bound workload in HVM guest and using same kernel in baremetal
>>> > run same test.
>>> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
>>> > I found that the time to complete the same test is few times more that
>>> > as it takes for the same under HVM guest.
>>> > I have tried tests where kernel threads pinned to cores and without pinning.
>>> > The execution times are most of the times take as twice longer, sometimes 4
>>> > times longer that HVM case.
>>> >
>>> > Interesting is not only that it takes sometimes 3-4 times more
>>> > than HVM guest, but also that test with bound threads (to cores) takes almost
>>> > 3 times longer
>>> > to execute than running same cpu-bound test under HVM (in all
>>> > configurations).
>>>
>>>
>>> wow~ I didn't expect the native performance can be so "bad".... ;-)
>
>> Yes, quite a surprise :)
>>>
>>> >
>>> >
>>> > I run each test 5 times and here are the execution times (seconds):
>>> >
>>> > -------------------------------------------------
>>> >         baremetal           |
>>> > thread_bind | thread unbind | HVM pinned to cores
>>> > ----------- |---------------|---------------------
>>> >      74     |     83        |        28
>>> >      74     |     88        |        28
>>> >      74     |     38        |        28
>>> >      74     |     73        |        28
>>> >      74     |     87        |        28
>>> >
>>> > Sometimes better times were on unbinded tests, but not often enough
>>> > to present it here. Some results are much worse and reach up to 120
>>> > seconds.
>>> >
>>> > Each test has 8 kernel threads. In baremetal case I tried the following:
>>> > - numa off,on;
>>> > - all cpus are on;
>>> > - isolate cpus from first node;
>>> > - set intel_idle.max_cstate=1;
>>> > - disable intel_pstate;
>>> >
>>> > I dont think I have exhausted all the options here, but it looked like
>>> > two last changes did improve performance, but was still not comparable to
>>> > HVM case.
>>> > I am trying to find where regression had happened. Performance on newer
>>> > kernel (I tried 4.5.0-rc4+) was close or better than HVM.
>
> Just a perhaps silly thought .. but could there be something in the
> time-measuring that could differ and explain the slightly surprising results ?

Thanks Sander! Actually, I also thought about this reason as Elena
did. If it's the time-measuring, the difference about the execution
time should not vary for different types of workload/programs. That's
why I think the time measurement is not the reason here (at least not
the main reason). :-)

Best,

Meng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
  2016-03-01 20:39             ` Elena Ufimtseva
  2016-03-01 21:51               ` Sander Eikelenboom
@ 2016-03-02 19:41               ` Meng Xu
  1 sibling, 0 replies; 13+ messages in thread
From: Meng Xu @ 2016-03-02 19:41 UTC (permalink / raw)
  To: Elena Ufimtseva; +Cc: xen-devel@lists.xen.org, Hyon-Young Choi

Hi Elena,


On Tue, Mar 1, 2016 at 3:39 PM, Elena Ufimtseva
<elena.ufimtseva@oracle.com> wrote:
> On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote:
>> Hi Elena,
>>
>> Thank you very much for sharing this! :-)
>>
>> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva
>> <elena.ufimtseva@oracle.com> wrote:
>> >
>> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
>> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
>> > > <konrad.wilk@oracle.com> wrote:
>> > > >> > Hey!
>> > > >> >
>> > > >> > CC-ing Elena.
>> > > >>
>> > > >> I think you forgot you cc.ed her..
>> > > >> Anyway, let's cc. her now... :-)
>> > > >>
>> > > >> >
>> > > >> >> We are measuring the execution time between native machine environment
>> > > >> >> and xen virtualization environment using PARSEC Benchmark [1].
>> > > >> >>
>> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of
>> > > >> >> them pinned to a core; we pin the dom0 to another core that is not
>> > > >> >> used by the domU.
>> > > >> >>
>> > > >> >> Inside the Linux in domU in virtualization environment and in native
>> > > >> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
>> > > >> >> system processors and to isolate a core for the benchmark processes.
>> > > >> >> We also configured the Linux boot command line with isocpus= option to
>> > > >> >> isolate the core for benchmark from other unnecessary processes.
>> > > >> >
>> > > >> > You may want to just offline them and also boot the machine with NUMA
>> > > >> > disabled.
>> > > >>
>> > > >> Right, the machine is booted up with NUMA disabled.
>> > > >> We will offline the unnecessary cores then.
>> > > >>
>> > > >> >
>> > > >> >>
>> > > >> >> We expect that execution time of benchmarks in xen virtualization
>> > > >> >> environment is larger than the execution time in native machine
>> > > >> >> environment. However, the evaluation gave us an opposite result.
>> > > >> >>
>> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks:
>> > > >> >>
>> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
>> > > >> >> Native: 6.387s
>> > > >> >> Virtualization: 5.890s
>> > > >> >>
>> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
>> > > >> >> Native: 5.276s
>> > > >> >> Virtualization: 5.240s
>> > > >> >>
>> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal
>> > > >> >> performance results?
>> > > >> >
>> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
>> > > >> >
>> > > >> > :-)
>> > > >> >
>> > > >> > No clue sadly.
>> > > >>
>> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the
>> > > >> system by adding one more layer? Unless the virtualization disabled
>> > > >> some services that occur in native and interfere with the benchmark.
>> > > >>
>> > > >> If virtualization is faster than baremetal by nature, why we can see
>> > > >> that some experiment shows that virtualization introduces overhead?
>> > > >
>> > > > Elena told me that there were some weird regression in Linux 4.1 - where
>> > > > CPU burning workloads were _slower_ on baremetal than as guests.
>> > >
>> > > Hi Elena,
>> > > Would you mind sharing with us some of your experience of how you
>> > > found the real reason? Did you use some tool or some methodology to
>> > > pin down the reason (i.e,  CPU burning workloads in native is _slower_
>> > > on baremetal than as guests)?
>> > >
>> >
>> > Hi Meng
>> >
>> > Yes, sure!
>> >
>> > While working on performance tests for smt-exposing patches from Joao
>> > I run CPU bound workload in HVM guest and using same kernel in baremetal
>> > run same test.
>> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
>> > I found that the time to complete the same test is few times more that
>> > as it takes for the same under HVM guest.
>> > I have tried tests where kernel threads pinned to cores and without pinning.
>> > The execution times are most of the times take as twice longer, sometimes 4
>> > times longer that HVM case.
>> >
>> > Interesting is not only that it takes sometimes 3-4 times more
>> > than HVM guest, but also that test with bound threads (to cores) takes almost
>> > 3 times longer
>> > to execute than running same cpu-bound test under HVM (in all
>> > configurations).
>>
>>
>> wow~ I didn't expect the native performance can be so "bad".... ;-)
>
> Yes, quite a surprise :)
>>
>> >
>> >
>> > I run each test 5 times and here are the execution times (seconds):
>> >
>> > -------------------------------------------------
>> >         baremetal           |
>> > thread_bind | thread unbind | HVM pinned to cores
>> > ----------- |---------------|---------------------
>> >      74     |     83        |        28
>> >      74     |     88        |        28
>> >      74     |     38        |        28
>> >      74     |     73        |        28
>> >      74     |     87        |        28
>> >
>> > Sometimes better times were on unbinded tests, but not often enough
>> > to present it here. Some results are much worse and reach up to 120
>> > seconds.
>> >
>> > Each test has 8 kernel threads. In baremetal case I tried the following:
>> > - numa off,on;
>> > - all cpus are on;
>> > - isolate cpus from first node;
>> > - set intel_idle.max_cstate=1;
>> > - disable intel_pstate;
>> >
>> > I dont think I have exhausted all the options here, but it looked like
>> > two last changes did improve performance, but was still not comparable to
>> > HVM case.
>> > I am trying to find where regression had happened. Performance on newer
>> > kernel (I tried 4.5.0-rc4+) was close or better than HVM.
>> >
>> > I am trying to find f there were some relevant regressions to understand
>> > the reason of this.
>>
>>
>> I see. If this is only happening for the SMT, it may be caused by the
>> SMT-related load balancing in Linux scheduler.
>> However, I have disabled the HT on my machine. Probably, that's also
>> the reason why I didn't see so much different in performance.
>
> I did enable tracing to see if maybe there is extensive migration:
> Test machine has two nodes, 8 cores each, 2 threads per core, total 32 logical cpus.
>
> Kernel threads are not binded and here is the output for the life of one of the threads:
>
> cat ./t-komp_trace |grep t-kompressor|grep 18883
>
>     t-kompressor-18883 [028] d... 69458.596403: sched_switch: prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> next_comm=swapper/28 next_pid=0 next_prio=120
>           insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9
>           <idle>-0     [009] d... 69458.669205: sched_switch: prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120
>     t-kompressor-18883 [009] d... 69486.997626: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> next_comm=migration/9 next_pid=52 next_prio=0
>      migration/9-52    [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
>           <idle>-0     [025] d... 69486.997641: sched_switch: prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120
>     t-kompressor-18883 [025] d... 69486.997710: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> next_comm=swapper/25 next_pid=0 next_prio=120
>           insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: comm=t-kompressor pid=18883
>
>
> Threads are being spawned from two cores, then some of the threads migrate to other cores.
> In the example above threads is being spawned on cpu 27 and when woken up, runs on cpu 009.
> Later it migrated to 025 which is the second thread of the same core (009).
> While I am not sure why this migration happens, it does not seem to contribute a lot.
> Anyway this picture repeats for some other threads (some stay where they were woken up):
>
>     t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald pid=3820 prio=120 orig_cpu=14 dest_cpu=11
>     migration/13-72    [013] d... 69486.707459: sched_migrate_task: comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29
>     migration/14-77    [014] d... 69486.783818: sched_migrate_task: comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30
>      migration/8-47    [008] d... 69486.792667: sched_migrate_task: comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24
>     migration/15-82    [015] d... 69486.796429: sched_migrate_task: comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31
>     migration/10-57    [010] d... 69486.857848: sched_migrate_task: comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26
>      migration/9-52    [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
>     migration/28-147   [028] d... 69503.073577: sched_migrate_task: comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10
>
> All threads are running on their own cores and some migrate to second smt-thread over time.
> I probably should have traced some other scheduling events, but I did not yet find any relevant ones yet.
>
>>
>> >
>> >
>> >
>> > What kernel you guys use?
>>
>>
>> I'm using a quite old kernel
>> 3.10.31
>> . The reason why I'm using this kernel is because I want to use the
>> LITMUS^RT [1], which is a linux testbed for real-time scheduling
>> research. (It has a new version though, and I can upgrade to the
>> latest version to see if the "problem" still occurs.)
>
> Yes, it will be interesting to see the outcome.
>
> What difference in numbers do you see?

Below is the evaluation data for the canneal and streamcluster
benchmarks, which are in the PARSEC benchmark:

Benchmark: canneal, input=simlarge, conf=gcc-serial
Native: 6.387s
Virtualization: 5.890s

Benchmark: streamcluster, input=simlarge, conf=gcc-serial
Native: 5.276s
Virtualization: 5.240s

> What the machines you are seeing it on?

Below is the CPU info. at /proc/cpuinfo

processor : 7

vendor_id : GenuineIntel

cpu family : 6

model : 58

model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

stepping : 9

microcode : 0x12

cpu MHz : 1600.000

cache size : 8192 KB

physical id : 0

siblings : 8

core id : 3

cpu cores : 4

apicid : 7

initial apicid : 7

fpu : yes

fpu_exception : yes

cpuid level : 13

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl
vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb
xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase
smep erms

bogomips : 6784.70

clflush size : 64

cache_alignment : 64

address sizes : 36 bits physical, 48 bits virtual

power management:


> Is your workload is purely cpu-bound?

Nope. The canneal and streamcluster benchmark are cache-sensitive (or
memory sensitive) task. The execution time of these two benchmarks
depend on how much cache and memory it can get.

Under the current kernel, IIRC, I didn't see the "abnormal performance
behavior" for cpu-bound tasks.

Thanks and Best Regards,

Meng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-03-02 19:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-26  5:02 [Question] PARSEC benchmark has smaller execution time in VM than in native? Meng Xu
2016-02-29 16:06 ` Konrad Rzeszutek Wilk
2016-02-29 17:29   ` Meng Xu
2016-02-29 17:59     ` Konrad Rzeszutek Wilk
2016-03-01 13:48       ` Meng Xu
2016-03-01 18:20         ` Elena Ufimtseva
2016-03-01 19:52           ` Meng Xu
2016-03-01 20:39             ` Elena Ufimtseva
2016-03-01 21:51               ` Sander Eikelenboom
2016-03-01 22:06                 ` Elena Ufimtseva
2016-03-01 22:12                 ` Dario Faggioli
2016-03-02 19:44                 ` Meng Xu
2016-03-02 19:41               ` Meng Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).