xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
To: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Hyon-Young Choi <commani@gmail.com>,
	Meng Xu <xumengpanda@gmail.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: [Question] PARSEC benchmark has smaller execution time in VM than in native?
Date: Tue, 1 Mar 2016 17:06:13 -0500	[thread overview]
Message-ID: <20160301220613.GA13399@elena.ufimtseva> (raw)
In-Reply-To: <1328562453.20160301225130@eikelenboom.it>

On Tue, Mar 01, 2016 at 10:51:30PM +0100, Sander Eikelenboom wrote:
> 
> Tuesday, March 1, 2016, 9:39:25 PM, you wrote:
> 
> > On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote:
> >> Hi Elena,
> >> 
> >> Thank you very much for sharing this! :-)
> >> 
> >> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva
> >> <elena.ufimtseva@oracle.com> wrote:
> >> >
> >> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote:
> >> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk
> >> > > <konrad.wilk@oracle.com> wrote:
> >> > > >> > Hey!
> >> > > >> >
> >> > > >> > CC-ing Elena.
> >> > > >>
> >> > > >> I think you forgot you cc.ed her..
> >> > > >> Anyway, let's cc. her now... :-)
> >> > > >>
> >> > > >> >
> >> > > >> >> We are measuring the execution time between native machine environment
> >> > > >> >> and xen virtualization environment using PARSEC Benchmark [1].
> >> > > >> >>
> >> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of
> >> > > >> >> them pinned to a core; we pin the dom0 to another core that is not
> >> > > >> >> used by the domU.
> >> > > >> >>
> >> > > >> >> Inside the Linux in domU in virtualization environment and in native
> >> > > >> >> environment,  We used the cpuset to isolate a core (or VCPU) for the
> >> > > >> >> system processors and to isolate a core for the benchmark processes.
> >> > > >> >> We also configured the Linux boot command line with isocpus= option to
> >> > > >> >> isolate the core for benchmark from other unnecessary processes.
> >> > > >> >
> >> > > >> > You may want to just offline them and also boot the machine with NUMA
> >> > > >> > disabled.
> >> > > >>
> >> > > >> Right, the machine is booted up with NUMA disabled.
> >> > > >> We will offline the unnecessary cores then.
> >> > > >>
> >> > > >> >
> >> > > >> >>
> >> > > >> >> We expect that execution time of benchmarks in xen virtualization
> >> > > >> >> environment is larger than the execution time in native machine
> >> > > >> >> environment. However, the evaluation gave us an opposite result.
> >> > > >> >>
> >> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks:
> >> > > >> >>
> >> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial
> >> > > >> >> Native: 6.387s
> >> > > >> >> Virtualization: 5.890s
> >> > > >> >>
> >> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial
> >> > > >> >> Native: 5.276s
> >> > > >> >> Virtualization: 5.240s
> >> > > >> >>
> >> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal
> >> > > >> >> performance results?
> >> > > >> >
> >> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal!
> >> > > >> >
> >> > > >> > :-)
> >> > > >> >
> >> > > >> > No clue sadly.
> >> > > >>
> >> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the
> >> > > >> system by adding one more layer? Unless the virtualization disabled
> >> > > >> some services that occur in native and interfere with the benchmark.
> >> > > >>
> >> > > >> If virtualization is faster than baremetal by nature, why we can see
> >> > > >> that some experiment shows that virtualization introduces overhead?
> >> > > >
> >> > > > Elena told me that there were some weird regression in Linux 4.1 - where
> >> > > > CPU burning workloads were _slower_ on baremetal than as guests.
> >> > >
> >> > > Hi Elena,
> >> > > Would you mind sharing with us some of your experience of how you
> >> > > found the real reason? Did you use some tool or some methodology to
> >> > > pin down the reason (i.e,  CPU burning workloads in native is _slower_
> >> > > on baremetal than as guests)?
> >> > >
> >> >
> >> > Hi Meng
> >> >
> >> > Yes, sure!
> >> >
> >> > While working on performance tests for smt-exposing patches from Joao
> >> > I run CPU bound workload in HVM guest and using same kernel in baremetal
> >> > run same test.
> >> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2)
> >> > I found that the time to complete the same test is few times more that
> >> > as it takes for the same under HVM guest.
> >> > I have tried tests where kernel threads pinned to cores and without pinning.
> >> > The execution times are most of the times take as twice longer, sometimes 4
> >> > times longer that HVM case.
> >> >
> >> > Interesting is not only that it takes sometimes 3-4 times more
> >> > than HVM guest, but also that test with bound threads (to cores) takes almost
> >> > 3 times longer
> >> > to execute than running same cpu-bound test under HVM (in all
> >> > configurations).
> >> 
> >> 
> >> wow~ I didn't expect the native performance can be so "bad".... ;-)
> 
> > Yes, quite a surprise :)
> >> 
> >> >
> >> >
> >> > I run each test 5 times and here are the execution times (seconds):
> >> >
> >> > -------------------------------------------------
> >> >         baremetal           |
> >> > thread_bind | thread unbind | HVM pinned to cores
> >> > ----------- |---------------|---------------------
> >> >      74     |     83        |        28
> >> >      74     |     88        |        28
> >> >      74     |     38        |        28
> >> >      74     |     73        |        28
> >> >      74     |     87        |        28
> >> >
> >> > Sometimes better times were on unbinded tests, but not often enough
> >> > to present it here. Some results are much worse and reach up to 120
> >> > seconds.
> >> >
> >> > Each test has 8 kernel threads. In baremetal case I tried the following:
> >> > - numa off,on;
> >> > - all cpus are on;
> >> > - isolate cpus from first node;
> >> > - set intel_idle.max_cstate=1;
> >> > - disable intel_pstate;
> >> >
> >> > I dont think I have exhausted all the options here, but it looked like
> >> > two last changes did improve performance, but was still not comparable to
> >> > HVM case.
> >> > I am trying to find where regression had happened. Performance on newer
> >> > kernel (I tried 4.5.0-rc4+) was close or better than HVM.
> 
> Just a perhaps silly thought .. but could there be something in the 
> time-measuring that could differ and explain the slightly surprising results ?
> --
> Sander 


Thanks Sander, I also tghought abou this. I would consider it further if the
execution times were not that obviously bigger, its 28 sec vs up to few
minutes the diffrerence noticeable without registering time.

Also there is expected difference in a ping-pong test where baremetal
outperforms HVM slighly on same kernels.

> 
> >> > I am trying to find f there were some relevant regressions to understand
> >> > the reason of this.
> >> 
> >> 
> >> I see. If this is only happening for the SMT, it may be caused by the
> >> SMT-related load balancing in Linux scheduler.
> >> However, I have disabled the HT on my machine. Probably, that's also
> >> the reason why I didn't see so much different in performance.
> 
> > I did enable tracing to see if maybe there is extensive migration:
> > Test machine has two nodes, 8 cores each, 2 threads per core, total 32 logical cpus.
> 
> > Kernel threads are not binded and here is the output for the life of one of the threads:
> 
> > cat ./t-komp_trace |grep t-kompressor|grep 18883
> 
> >     t-kompressor-18883 [028] d... 69458.596403: sched_switch: prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> next_comm=swapper/28 next_pid=0 next_prio=120
> >           insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9
> >           <idle>-0     [009] d... 69458.669205: sched_switch: prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120
> >     t-kompressor-18883 [009] d... 69486.997626: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> next_comm=migration/9 next_pid=52 next_prio=0
> >      migration/9-52    [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
> >           <idle>-0     [025] d... 69486.997641: sched_switch: prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120
> >     t-kompressor-18883 [025] d... 69486.997710: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> next_comm=swapper/25 next_pid=0 next_prio=120
> >           insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: comm=t-kompressor pid=18883
> 
> 
> > Threads are being spawned from two cores, then some of the threads migrate to other cores.
> > In the example above threads is being spawned on cpu 27 and when woken up, runs on cpu 009.
> > Later it migrated to 025 which is the second thread of the same core (009).
> > While I am not sure why this migration happens, it does not seem to contribute a lot.
> > Anyway this picture repeats for some other threads (some stay where they were woken up):
> 
> >     t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald pid=3820 prio=120 orig_cpu=14 dest_cpu=11
> >     migration/13-72    [013] d... 69486.707459: sched_migrate_task: comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29
> >     migration/14-77    [014] d... 69486.783818: sched_migrate_task: comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30
> >      migration/8-47    [008] d... 69486.792667: sched_migrate_task: comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24
> >     migration/15-82    [015] d... 69486.796429: sched_migrate_task: comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31
> >     migration/10-57    [010] d... 69486.857848: sched_migrate_task: comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26
> >      migration/9-52    [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25
> >     migration/28-147   [028] d... 69503.073577: sched_migrate_task: comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10
> 
> > All threads are running on their own cores and some migrate to second smt-thread over time.
> > I probably should have traced some other scheduling events, but I did not yet find any relevant ones yet.
> 
> >> 
> >> >
> >> >
> >> >
> >> > What kernel you guys use?
> >> 
> >> 
> >> I'm using a quite old kernel
> >> 3.10.31
> >> . The reason why I'm using this kernel is because I want to use the
> >> LITMUS^RT [1], which is a linux testbed for real-time scheduling
> >> research. (It has a new version though, and I can upgrade to the
> >> latest version to see if the "problem" still occurs.)
> 
> > Yes, it will be interesting to see the outcome.
> 
> > What difference in numbers do you see?
> > What the machines you are seeing it on?
> > Is your workload is purely cpu-bound?
> 
> 
> > Thanks!
> 
> >> 
> >> Thanks and Best Regards,
> >> 
> >> Meng
> 
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

  reply	other threads:[~2016-03-01 22:06 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-26  5:02 [Question] PARSEC benchmark has smaller execution time in VM than in native? Meng Xu
2016-02-29 16:06 ` Konrad Rzeszutek Wilk
2016-02-29 17:29   ` Meng Xu
2016-02-29 17:59     ` Konrad Rzeszutek Wilk
2016-03-01 13:48       ` Meng Xu
2016-03-01 18:20         ` Elena Ufimtseva
2016-03-01 19:52           ` Meng Xu
2016-03-01 20:39             ` Elena Ufimtseva
2016-03-01 21:51               ` Sander Eikelenboom
2016-03-01 22:06                 ` Elena Ufimtseva [this message]
2016-03-01 22:12                 ` Dario Faggioli
2016-03-02 19:44                 ` Meng Xu
2016-03-02 19:41               ` Meng Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160301220613.GA13399@elena.ufimtseva \
    --to=elena.ufimtseva@oracle.com \
    --cc=commani@gmail.com \
    --cc=linux@eikelenboom.it \
    --cc=xen-devel@lists.xen.org \
    --cc=xumengpanda@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).