* [Question] PARSEC benchmark has smaller execution time in VM than in native? @ 2016-02-26 5:02 Meng Xu 2016-02-29 16:06 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 13+ messages in thread From: Meng Xu @ 2016-02-26 5:02 UTC (permalink / raw) To: xen-devel@lists.xen.org; +Cc: Hyon-Young Choi Hi, We are measuring the execution time between native machine environment and xen virtualization environment using PARSEC Benchmark [1]. In virtualiztion environment, we run a domU with three VCPUs, each of them pinned to a core; we pin the dom0 to another core that is not used by the domU. Inside the Linux in domU in virtualization environment and in native environment, We used the cpuset to isolate a core (or VCPU) for the system processors and to isolate a core for the benchmark processes. We also configured the Linux boot command line with isocpus= option to isolate the core for benchmark from other unnecessary processes. We expect that execution time of benchmarks in xen virtualization environment is larger than the execution time in native machine environment. However, the evaluation gave us an opposite result. Below is the evaluation data for the canneal and streamcluster benchmarks: Benchmark: canneal, input=simlarge, conf=gcc-serial Native: 6.387s Virtualization: 5.890s Benchmark: streamcluster, input=simlarge, conf=gcc-serial Native: 5.276s Virtualization: 5.240s Is there anything wrong with our evaluation that lead to the abnormal performance results? Any suggestion or advice is really appreciated. Thank you very much for your time on this question! Best regards, Meng [1] http://parsec.cs.princeton.edu/ ----------- Meng Xu PhD Student in Computer and Information Science University of Pennsylvania http://www.cis.upenn.edu/~mengxu/ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-02-26 5:02 [Question] PARSEC benchmark has smaller execution time in VM than in native? Meng Xu @ 2016-02-29 16:06 ` Konrad Rzeszutek Wilk 2016-02-29 17:29 ` Meng Xu 0 siblings, 1 reply; 13+ messages in thread From: Konrad Rzeszutek Wilk @ 2016-02-29 16:06 UTC (permalink / raw) To: Meng Xu; +Cc: Hyon-Young Choi, xen-devel@lists.xen.org On Fri, Feb 26, 2016 at 12:02:50AM -0500, Meng Xu wrote: > Hi, > Hey! CC-ing Elena. > We are measuring the execution time between native machine environment > and xen virtualization environment using PARSEC Benchmark [1]. > > In virtualiztion environment, we run a domU with three VCPUs, each of > them pinned to a core; we pin the dom0 to another core that is not > used by the domU. > > Inside the Linux in domU in virtualization environment and in native > environment, We used the cpuset to isolate a core (or VCPU) for the > system processors and to isolate a core for the benchmark processes. > We also configured the Linux boot command line with isocpus= option to > isolate the core for benchmark from other unnecessary processes. You may want to just offline them and also boot the machine with NUMA disabled. > > We expect that execution time of benchmarks in xen virtualization > environment is larger than the execution time in native machine > environment. However, the evaluation gave us an opposite result. > > Below is the evaluation data for the canneal and streamcluster benchmarks: > > Benchmark: canneal, input=simlarge, conf=gcc-serial > Native: 6.387s > Virtualization: 5.890s > > Benchmark: streamcluster, input=simlarge, conf=gcc-serial > Native: 5.276s > Virtualization: 5.240s > > Is there anything wrong with our evaluation that lead to the abnormal > performance results? Nothing is wrong. Virtualization is naturally faster than baremetal! :-) No clue sadly. > > Any suggestion or advice is really appreciated. > > Thank you very much for your time on this question! > > Best regards, > > Meng > > [1] http://parsec.cs.princeton.edu/ > > > ----------- > Meng Xu > PhD Student in Computer and Information Science > University of Pennsylvania > http://www.cis.upenn.edu/~mengxu/ > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-02-29 16:06 ` Konrad Rzeszutek Wilk @ 2016-02-29 17:29 ` Meng Xu 2016-02-29 17:59 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 13+ messages in thread From: Meng Xu @ 2016-02-29 17:29 UTC (permalink / raw) To: Konrad Rzeszutek Wilk, elena.ufimtseva@oracle.com Cc: Hyon-Young Choi, xen-devel@lists.xen.org On Mon, Feb 29, 2016 at 11:06 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Fri, Feb 26, 2016 at 12:02:50AM -0500, Meng Xu wrote: >> Hi, >> > > Hey! > > CC-ing Elena. I think you forgot you cc.ed her.. Anyway, let's cc. her now... :-) > >> We are measuring the execution time between native machine environment >> and xen virtualization environment using PARSEC Benchmark [1]. >> >> In virtualiztion environment, we run a domU with three VCPUs, each of >> them pinned to a core; we pin the dom0 to another core that is not >> used by the domU. >> >> Inside the Linux in domU in virtualization environment and in native >> environment, We used the cpuset to isolate a core (or VCPU) for the >> system processors and to isolate a core for the benchmark processes. >> We also configured the Linux boot command line with isocpus= option to >> isolate the core for benchmark from other unnecessary processes. > > You may want to just offline them and also boot the machine with NUMA > disabled. Right, the machine is booted up with NUMA disabled. We will offline the unnecessary cores then. > >> >> We expect that execution time of benchmarks in xen virtualization >> environment is larger than the execution time in native machine >> environment. However, the evaluation gave us an opposite result. >> >> Below is the evaluation data for the canneal and streamcluster benchmarks: >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial >> Native: 6.387s >> Virtualization: 5.890s >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial >> Native: 5.276s >> Virtualization: 5.240s >> >> Is there anything wrong with our evaluation that lead to the abnormal >> performance results? > > Nothing is wrong. Virtualization is naturally faster than baremetal! > > :-) > > No clue sadly. Ah-ha. This is really surprising to me.... Why will it speed up the system by adding one more layer? Unless the virtualization disabled some services that occur in native and interfere with the benchmark. If virtualization is faster than baremetal by nature, why we can see that some experiment shows that virtualization introduces overhead? For example, VMware did some evaluation at [1]. Fig. 3 on page 9 shows that the virtualization (both vmware ESX301 and xen) introduces overhead and the benchmark is slower in virtualization than in native. [1] https://www.vmware.com/pdf/hypervisor_performance.pdf It seems to me that the performance data may be tweaked (kind of cooked up) to some extent when people are comparing different hypervisors. we just need to configure the system in a specific way to favor one type of hypervisor than the other. Meng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-02-29 17:29 ` Meng Xu @ 2016-02-29 17:59 ` Konrad Rzeszutek Wilk 2016-03-01 13:48 ` Meng Xu 0 siblings, 1 reply; 13+ messages in thread From: Konrad Rzeszutek Wilk @ 2016-02-29 17:59 UTC (permalink / raw) To: Meng Xu Cc: elena.ufimtseva@oracle.com, Hyon-Young Choi, xen-devel@lists.xen.org > > Hey! > > > > CC-ing Elena. > > I think you forgot you cc.ed her.. > Anyway, let's cc. her now... :-) > > > > >> We are measuring the execution time between native machine environment > >> and xen virtualization environment using PARSEC Benchmark [1]. > >> > >> In virtualiztion environment, we run a domU with three VCPUs, each of > >> them pinned to a core; we pin the dom0 to another core that is not > >> used by the domU. > >> > >> Inside the Linux in domU in virtualization environment and in native > >> environment, We used the cpuset to isolate a core (or VCPU) for the > >> system processors and to isolate a core for the benchmark processes. > >> We also configured the Linux boot command line with isocpus= option to > >> isolate the core for benchmark from other unnecessary processes. > > > > You may want to just offline them and also boot the machine with NUMA > > disabled. > > Right, the machine is booted up with NUMA disabled. > We will offline the unnecessary cores then. > > > > >> > >> We expect that execution time of benchmarks in xen virtualization > >> environment is larger than the execution time in native machine > >> environment. However, the evaluation gave us an opposite result. > >> > >> Below is the evaluation data for the canneal and streamcluster benchmarks: > >> > >> Benchmark: canneal, input=simlarge, conf=gcc-serial > >> Native: 6.387s > >> Virtualization: 5.890s > >> > >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial > >> Native: 5.276s > >> Virtualization: 5.240s > >> > >> Is there anything wrong with our evaluation that lead to the abnormal > >> performance results? > > > > Nothing is wrong. Virtualization is naturally faster than baremetal! > > > > :-) > > > > No clue sadly. > > Ah-ha. This is really surprising to me.... Why will it speed up the > system by adding one more layer? Unless the virtualization disabled > some services that occur in native and interfere with the benchmark. > > If virtualization is faster than baremetal by nature, why we can see > that some experiment shows that virtualization introduces overhead? Elena told me that there were some weird regression in Linux 4.1 - where CPU burning workloads were _slower_ on baremetal than as guests. Updating to a later kernel fixed that -where one could see that baremetal was faster (or on par) with the guest. > > For example, VMware did some evaluation at [1]. Fig. 3 on page 9 shows > that the virtualization (both vmware ESX301 and xen) introduces > overhead and the benchmark is slower in virtualization than in native. > > [1] https://www.vmware.com/pdf/hypervisor_performance.pdf > > It seems to me that the performance data may be tweaked (kind of > cooked up) to some extent when people are comparing different > hypervisors. we just need to configure the system in a specific way to > favor one type of hypervisor than the other. > > Meng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-02-29 17:59 ` Konrad Rzeszutek Wilk @ 2016-03-01 13:48 ` Meng Xu 2016-03-01 18:20 ` Elena Ufimtseva 0 siblings, 1 reply; 13+ messages in thread From: Meng Xu @ 2016-03-01 13:48 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: elena.ufimtseva@oracle.com, Hyon-Young Choi, xen-devel@lists.xen.org On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: >> > Hey! >> > >> > CC-ing Elena. >> >> I think you forgot you cc.ed her.. >> Anyway, let's cc. her now... :-) >> >> > >> >> We are measuring the execution time between native machine environment >> >> and xen virtualization environment using PARSEC Benchmark [1]. >> >> >> >> In virtualiztion environment, we run a domU with three VCPUs, each of >> >> them pinned to a core; we pin the dom0 to another core that is not >> >> used by the domU. >> >> >> >> Inside the Linux in domU in virtualization environment and in native >> >> environment, We used the cpuset to isolate a core (or VCPU) for the >> >> system processors and to isolate a core for the benchmark processes. >> >> We also configured the Linux boot command line with isocpus= option to >> >> isolate the core for benchmark from other unnecessary processes. >> > >> > You may want to just offline them and also boot the machine with NUMA >> > disabled. >> >> Right, the machine is booted up with NUMA disabled. >> We will offline the unnecessary cores then. >> >> > >> >> >> >> We expect that execution time of benchmarks in xen virtualization >> >> environment is larger than the execution time in native machine >> >> environment. However, the evaluation gave us an opposite result. >> >> >> >> Below is the evaluation data for the canneal and streamcluster benchmarks: >> >> >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial >> >> Native: 6.387s >> >> Virtualization: 5.890s >> >> >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial >> >> Native: 5.276s >> >> Virtualization: 5.240s >> >> >> >> Is there anything wrong with our evaluation that lead to the abnormal >> >> performance results? >> > >> > Nothing is wrong. Virtualization is naturally faster than baremetal! >> > >> > :-) >> > >> > No clue sadly. >> >> Ah-ha. This is really surprising to me.... Why will it speed up the >> system by adding one more layer? Unless the virtualization disabled >> some services that occur in native and interfere with the benchmark. >> >> If virtualization is faster than baremetal by nature, why we can see >> that some experiment shows that virtualization introduces overhead? > > Elena told me that there were some weird regression in Linux 4.1 - where > CPU burning workloads were _slower_ on baremetal than as guests. Hi Elena, Would you mind sharing with us some of your experience of how you found the real reason? Did you use some tool or some methodology to pin down the reason (i.e, CPU burning workloads in native is _slower_ on baremetal than as guests)? > > Updating to a later kernel fixed that -where one could see that > baremetal was faster (or on par) with the guest. Thank you very much, Konrad! We are giving it a shot. :-D Best Regards, Meng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-03-01 13:48 ` Meng Xu @ 2016-03-01 18:20 ` Elena Ufimtseva 2016-03-01 19:52 ` Meng Xu 0 siblings, 1 reply; 13+ messages in thread From: Elena Ufimtseva @ 2016-03-01 18:20 UTC (permalink / raw) To: Meng Xu; +Cc: xen-devel@lists.xen.org, Hyon-Young Choi On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote: > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > >> > Hey! > >> > > >> > CC-ing Elena. > >> > >> I think you forgot you cc.ed her.. > >> Anyway, let's cc. her now... :-) > >> > >> > > >> >> We are measuring the execution time between native machine environment > >> >> and xen virtualization environment using PARSEC Benchmark [1]. > >> >> > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of > >> >> them pinned to a core; we pin the dom0 to another core that is not > >> >> used by the domU. > >> >> > >> >> Inside the Linux in domU in virtualization environment and in native > >> >> environment, We used the cpuset to isolate a core (or VCPU) for the > >> >> system processors and to isolate a core for the benchmark processes. > >> >> We also configured the Linux boot command line with isocpus= option to > >> >> isolate the core for benchmark from other unnecessary processes. > >> > > >> > You may want to just offline them and also boot the machine with NUMA > >> > disabled. > >> > >> Right, the machine is booted up with NUMA disabled. > >> We will offline the unnecessary cores then. > >> > >> > > >> >> > >> >> We expect that execution time of benchmarks in xen virtualization > >> >> environment is larger than the execution time in native machine > >> >> environment. However, the evaluation gave us an opposite result. > >> >> > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks: > >> >> > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial > >> >> Native: 6.387s > >> >> Virtualization: 5.890s > >> >> > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial > >> >> Native: 5.276s > >> >> Virtualization: 5.240s > >> >> > >> >> Is there anything wrong with our evaluation that lead to the abnormal > >> >> performance results? > >> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal! > >> > > >> > :-) > >> > > >> > No clue sadly. > >> > >> Ah-ha. This is really surprising to me.... Why will it speed up the > >> system by adding one more layer? Unless the virtualization disabled > >> some services that occur in native and interfere with the benchmark. > >> > >> If virtualization is faster than baremetal by nature, why we can see > >> that some experiment shows that virtualization introduces overhead? > > > > Elena told me that there were some weird regression in Linux 4.1 - where > > CPU burning workloads were _slower_ on baremetal than as guests. > > Hi Elena, > Would you mind sharing with us some of your experience of how you > found the real reason? Did you use some tool or some methodology to > pin down the reason (i.e, CPU burning workloads in native is _slower_ > on baremetal than as guests)? > Hi Meng Yes, sure! While working on performance tests for smt-exposing patches from Joao I run CPU bound workload in HVM guest and using same kernel in baremetal run same test. While testing cpu-bound workload on baremetal linux (4.1.0-rc2) I found that the time to complete the same test is few times more that as it takes for the same under HVM guest. I have tried tests where kernel threads pinned to cores and without pinning. The execution times are most of the times take as twice longer, sometimes 4 times longer that HVM case. Interesting is not only that it takes sometimes 3-4 times more than HVM guest, but also that test with bound threads (to cores) takes almost 3 times longer to execute than running same cpu-bound test under HVM (in all configurations). I run each test 5 times and here are the execution times (seconds): ------------------------------------------------- baremetal | thread_bind | thread unbind | HVM pinned to cores ----------- |---------------|--------------------- 74 | 83 | 28 74 | 88 | 28 74 | 38 | 28 74 | 73 | 28 74 | 87 | 28 Sometimes better times were on unbinded tests, but not often enough to present it here. Some results are much worse and reach up to 120 seconds. Each test has 8 kernel threads. In baremetal case I tried the following: - numa off,on; - all cpus are on; - isolate cpus from first node; - set intel_idle.max_cstate=1; - disable intel_pstate; I dont think I have exhausted all the options here, but it looked like two last changes did improve performance, but was still not comparable to HVM case. I am trying to find where regression had happened. Performance on newer kernel (I tried 4.5.0-rc4+) was close or better than HVM. I am trying to find f there were some relevant regressions to understand the reason of this. What kernel you guys use? Elena See more description of the tests here: http://lists.xenproject.org/archives/html/xen-devel/2016-01/msg02874.html Joao patches are here: http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg03115.html) > > > > > > Updating to a later kernel fixed that -where one could see that > > baremetal was faster (or on par) with the guest. > > Thank you very much, Konrad! We are giving it a shot. :-D > > Best Regards, > > Meng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-03-01 18:20 ` Elena Ufimtseva @ 2016-03-01 19:52 ` Meng Xu 2016-03-01 20:39 ` Elena Ufimtseva 0 siblings, 1 reply; 13+ messages in thread From: Meng Xu @ 2016-03-01 19:52 UTC (permalink / raw) To: Elena Ufimtseva; +Cc: xen-devel@lists.xen.org, Hyon-Young Choi Hi Elena, Thank you very much for sharing this! :-) On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote: > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk > > <konrad.wilk@oracle.com> wrote: > > >> > Hey! > > >> > > > >> > CC-ing Elena. > > >> > > >> I think you forgot you cc.ed her.. > > >> Anyway, let's cc. her now... :-) > > >> > > >> > > > >> >> We are measuring the execution time between native machine environment > > >> >> and xen virtualization environment using PARSEC Benchmark [1]. > > >> >> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of > > >> >> them pinned to a core; we pin the dom0 to another core that is not > > >> >> used by the domU. > > >> >> > > >> >> Inside the Linux in domU in virtualization environment and in native > > >> >> environment, We used the cpuset to isolate a core (or VCPU) for the > > >> >> system processors and to isolate a core for the benchmark processes. > > >> >> We also configured the Linux boot command line with isocpus= option to > > >> >> isolate the core for benchmark from other unnecessary processes. > > >> > > > >> > You may want to just offline them and also boot the machine with NUMA > > >> > disabled. > > >> > > >> Right, the machine is booted up with NUMA disabled. > > >> We will offline the unnecessary cores then. > > >> > > >> > > > >> >> > > >> >> We expect that execution time of benchmarks in xen virtualization > > >> >> environment is larger than the execution time in native machine > > >> >> environment. However, the evaluation gave us an opposite result. > > >> >> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks: > > >> >> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial > > >> >> Native: 6.387s > > >> >> Virtualization: 5.890s > > >> >> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial > > >> >> Native: 5.276s > > >> >> Virtualization: 5.240s > > >> >> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal > > >> >> performance results? > > >> > > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal! > > >> > > > >> > :-) > > >> > > > >> > No clue sadly. > > >> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the > > >> system by adding one more layer? Unless the virtualization disabled > > >> some services that occur in native and interfere with the benchmark. > > >> > > >> If virtualization is faster than baremetal by nature, why we can see > > >> that some experiment shows that virtualization introduces overhead? > > > > > > Elena told me that there were some weird regression in Linux 4.1 - where > > > CPU burning workloads were _slower_ on baremetal than as guests. > > > > Hi Elena, > > Would you mind sharing with us some of your experience of how you > > found the real reason? Did you use some tool or some methodology to > > pin down the reason (i.e, CPU burning workloads in native is _slower_ > > on baremetal than as guests)? > > > > Hi Meng > > Yes, sure! > > While working on performance tests for smt-exposing patches from Joao > I run CPU bound workload in HVM guest and using same kernel in baremetal > run same test. > While testing cpu-bound workload on baremetal linux (4.1.0-rc2) > I found that the time to complete the same test is few times more that > as it takes for the same under HVM guest. > I have tried tests where kernel threads pinned to cores and without pinning. > The execution times are most of the times take as twice longer, sometimes 4 > times longer that HVM case. > > Interesting is not only that it takes sometimes 3-4 times more > than HVM guest, but also that test with bound threads (to cores) takes almost > 3 times longer > to execute than running same cpu-bound test under HVM (in all > configurations). wow~ I didn't expect the native performance can be so "bad".... ;-) > > > I run each test 5 times and here are the execution times (seconds): > > ------------------------------------------------- > baremetal | > thread_bind | thread unbind | HVM pinned to cores > ----------- |---------------|--------------------- > 74 | 83 | 28 > 74 | 88 | 28 > 74 | 38 | 28 > 74 | 73 | 28 > 74 | 87 | 28 > > Sometimes better times were on unbinded tests, but not often enough > to present it here. Some results are much worse and reach up to 120 > seconds. > > Each test has 8 kernel threads. In baremetal case I tried the following: > - numa off,on; > - all cpus are on; > - isolate cpus from first node; > - set intel_idle.max_cstate=1; > - disable intel_pstate; > > I dont think I have exhausted all the options here, but it looked like > two last changes did improve performance, but was still not comparable to > HVM case. > I am trying to find where regression had happened. Performance on newer > kernel (I tried 4.5.0-rc4+) was close or better than HVM. > > I am trying to find f there were some relevant regressions to understand > the reason of this. I see. If this is only happening for the SMT, it may be caused by the SMT-related load balancing in Linux scheduler. However, I have disabled the HT on my machine. Probably, that's also the reason why I didn't see so much different in performance. > > > > What kernel you guys use? I'm using a quite old kernel 3.10.31 . The reason why I'm using this kernel is because I want to use the LITMUS^RT [1], which is a linux testbed for real-time scheduling research. (It has a new version though, and I can upgrade to the latest version to see if the "problem" still occurs.) Thanks and Best Regards, Meng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-03-01 19:52 ` Meng Xu @ 2016-03-01 20:39 ` Elena Ufimtseva 2016-03-01 21:51 ` Sander Eikelenboom 2016-03-02 19:41 ` Meng Xu 0 siblings, 2 replies; 13+ messages in thread From: Elena Ufimtseva @ 2016-03-01 20:39 UTC (permalink / raw) To: Meng Xu; +Cc: xen-devel@lists.xen.org, Hyon-Young Choi On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote: > Hi Elena, > > Thank you very much for sharing this! :-) > > On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva > <elena.ufimtseva@oracle.com> wrote: > > > > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote: > > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk > > > <konrad.wilk@oracle.com> wrote: > > > >> > Hey! > > > >> > > > > >> > CC-ing Elena. > > > >> > > > >> I think you forgot you cc.ed her.. > > > >> Anyway, let's cc. her now... :-) > > > >> > > > >> > > > > >> >> We are measuring the execution time between native machine environment > > > >> >> and xen virtualization environment using PARSEC Benchmark [1]. > > > >> >> > > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of > > > >> >> them pinned to a core; we pin the dom0 to another core that is not > > > >> >> used by the domU. > > > >> >> > > > >> >> Inside the Linux in domU in virtualization environment and in native > > > >> >> environment, We used the cpuset to isolate a core (or VCPU) for the > > > >> >> system processors and to isolate a core for the benchmark processes. > > > >> >> We also configured the Linux boot command line with isocpus= option to > > > >> >> isolate the core for benchmark from other unnecessary processes. > > > >> > > > > >> > You may want to just offline them and also boot the machine with NUMA > > > >> > disabled. > > > >> > > > >> Right, the machine is booted up with NUMA disabled. > > > >> We will offline the unnecessary cores then. > > > >> > > > >> > > > > >> >> > > > >> >> We expect that execution time of benchmarks in xen virtualization > > > >> >> environment is larger than the execution time in native machine > > > >> >> environment. However, the evaluation gave us an opposite result. > > > >> >> > > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks: > > > >> >> > > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial > > > >> >> Native: 6.387s > > > >> >> Virtualization: 5.890s > > > >> >> > > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial > > > >> >> Native: 5.276s > > > >> >> Virtualization: 5.240s > > > >> >> > > > >> >> Is there anything wrong with our evaluation that lead to the abnormal > > > >> >> performance results? > > > >> > > > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal! > > > >> > > > > >> > :-) > > > >> > > > > >> > No clue sadly. > > > >> > > > >> Ah-ha. This is really surprising to me.... Why will it speed up the > > > >> system by adding one more layer? Unless the virtualization disabled > > > >> some services that occur in native and interfere with the benchmark. > > > >> > > > >> If virtualization is faster than baremetal by nature, why we can see > > > >> that some experiment shows that virtualization introduces overhead? > > > > > > > > Elena told me that there were some weird regression in Linux 4.1 - where > > > > CPU burning workloads were _slower_ on baremetal than as guests. > > > > > > Hi Elena, > > > Would you mind sharing with us some of your experience of how you > > > found the real reason? Did you use some tool or some methodology to > > > pin down the reason (i.e, CPU burning workloads in native is _slower_ > > > on baremetal than as guests)? > > > > > > > Hi Meng > > > > Yes, sure! > > > > While working on performance tests for smt-exposing patches from Joao > > I run CPU bound workload in HVM guest and using same kernel in baremetal > > run same test. > > While testing cpu-bound workload on baremetal linux (4.1.0-rc2) > > I found that the time to complete the same test is few times more that > > as it takes for the same under HVM guest. > > I have tried tests where kernel threads pinned to cores and without pinning. > > The execution times are most of the times take as twice longer, sometimes 4 > > times longer that HVM case. > > > > Interesting is not only that it takes sometimes 3-4 times more > > than HVM guest, but also that test with bound threads (to cores) takes almost > > 3 times longer > > to execute than running same cpu-bound test under HVM (in all > > configurations). > > > wow~ I didn't expect the native performance can be so "bad".... ;-) Yes, quite a surprise :) > > > > > > > I run each test 5 times and here are the execution times (seconds): > > > > ------------------------------------------------- > > baremetal | > > thread_bind | thread unbind | HVM pinned to cores > > ----------- |---------------|--------------------- > > 74 | 83 | 28 > > 74 | 88 | 28 > > 74 | 38 | 28 > > 74 | 73 | 28 > > 74 | 87 | 28 > > > > Sometimes better times were on unbinded tests, but not often enough > > to present it here. Some results are much worse and reach up to 120 > > seconds. > > > > Each test has 8 kernel threads. In baremetal case I tried the following: > > - numa off,on; > > - all cpus are on; > > - isolate cpus from first node; > > - set intel_idle.max_cstate=1; > > - disable intel_pstate; > > > > I dont think I have exhausted all the options here, but it looked like > > two last changes did improve performance, but was still not comparable to > > HVM case. > > I am trying to find where regression had happened. Performance on newer > > kernel (I tried 4.5.0-rc4+) was close or better than HVM. > > > > I am trying to find f there were some relevant regressions to understand > > the reason of this. > > > I see. If this is only happening for the SMT, it may be caused by the > SMT-related load balancing in Linux scheduler. > However, I have disabled the HT on my machine. Probably, that's also > the reason why I didn't see so much different in performance. I did enable tracing to see if maybe there is extensive migration: Test machine has two nodes, 8 cores each, 2 threads per core, total 32 logical cpus. Kernel threads are not binded and here is the output for the life of one of the threads: cat ./t-komp_trace |grep t-kompressor|grep 18883 t-kompressor-18883 [028] d... 69458.596403: sched_switch: prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> next_comm=swapper/28 next_pid=0 next_prio=120 insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9 <idle>-0 [009] d... 69458.669205: sched_switch: prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120 t-kompressor-18883 [009] d... 69486.997626: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> next_comm=migration/9 next_pid=52 next_prio=0 migration/9-52 [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25 <idle>-0 [025] d... 69486.997641: sched_switch: prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120 t-kompressor-18883 [025] d... 69486.997710: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> next_comm=swapper/25 next_pid=0 next_prio=120 insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: comm=t-kompressor pid=18883 Threads are being spawned from two cores, then some of the threads migrate to other cores. In the example above threads is being spawned on cpu 27 and when woken up, runs on cpu 009. Later it migrated to 025 which is the second thread of the same core (009). While I am not sure why this migration happens, it does not seem to contribute a lot. Anyway this picture repeats for some other threads (some stay where they were woken up): t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald pid=3820 prio=120 orig_cpu=14 dest_cpu=11 migration/13-72 [013] d... 69486.707459: sched_migrate_task: comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29 migration/14-77 [014] d... 69486.783818: sched_migrate_task: comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30 migration/8-47 [008] d... 69486.792667: sched_migrate_task: comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24 migration/15-82 [015] d... 69486.796429: sched_migrate_task: comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31 migration/10-57 [010] d... 69486.857848: sched_migrate_task: comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26 migration/9-52 [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25 migration/28-147 [028] d... 69503.073577: sched_migrate_task: comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10 All threads are running on their own cores and some migrate to second smt-thread over time. I probably should have traced some other scheduling events, but I did not yet find any relevant ones yet. > > > > > > > > > What kernel you guys use? > > > I'm using a quite old kernel > 3.10.31 > . The reason why I'm using this kernel is because I want to use the > LITMUS^RT [1], which is a linux testbed for real-time scheduling > research. (It has a new version though, and I can upgrade to the > latest version to see if the "problem" still occurs.) Yes, it will be interesting to see the outcome. What difference in numbers do you see? What the machines you are seeing it on? Is your workload is purely cpu-bound? Thanks! > > Thanks and Best Regards, > > Meng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-03-01 20:39 ` Elena Ufimtseva @ 2016-03-01 21:51 ` Sander Eikelenboom 2016-03-01 22:06 ` Elena Ufimtseva ` (2 more replies) 2016-03-02 19:41 ` Meng Xu 1 sibling, 3 replies; 13+ messages in thread From: Sander Eikelenboom @ 2016-03-01 21:51 UTC (permalink / raw) To: Elena Ufimtseva; +Cc: Hyon-Young Choi, Meng Xu, xen-devel@lists.xen.org Tuesday, March 1, 2016, 9:39:25 PM, you wrote: > On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote: >> Hi Elena, >> >> Thank you very much for sharing this! :-) >> >> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva >> <elena.ufimtseva@oracle.com> wrote: >> > >> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote: >> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk >> > > <konrad.wilk@oracle.com> wrote: >> > > >> > Hey! >> > > >> > >> > > >> > CC-ing Elena. >> > > >> >> > > >> I think you forgot you cc.ed her.. >> > > >> Anyway, let's cc. her now... :-) >> > > >> >> > > >> > >> > > >> >> We are measuring the execution time between native machine environment >> > > >> >> and xen virtualization environment using PARSEC Benchmark [1]. >> > > >> >> >> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of >> > > >> >> them pinned to a core; we pin the dom0 to another core that is not >> > > >> >> used by the domU. >> > > >> >> >> > > >> >> Inside the Linux in domU in virtualization environment and in native >> > > >> >> environment, We used the cpuset to isolate a core (or VCPU) for the >> > > >> >> system processors and to isolate a core for the benchmark processes. >> > > >> >> We also configured the Linux boot command line with isocpus= option to >> > > >> >> isolate the core for benchmark from other unnecessary processes. >> > > >> > >> > > >> > You may want to just offline them and also boot the machine with NUMA >> > > >> > disabled. >> > > >> >> > > >> Right, the machine is booted up with NUMA disabled. >> > > >> We will offline the unnecessary cores then. >> > > >> >> > > >> > >> > > >> >> >> > > >> >> We expect that execution time of benchmarks in xen virtualization >> > > >> >> environment is larger than the execution time in native machine >> > > >> >> environment. However, the evaluation gave us an opposite result. >> > > >> >> >> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks: >> > > >> >> >> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial >> > > >> >> Native: 6.387s >> > > >> >> Virtualization: 5.890s >> > > >> >> >> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial >> > > >> >> Native: 5.276s >> > > >> >> Virtualization: 5.240s >> > > >> >> >> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal >> > > >> >> performance results? >> > > >> > >> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal! >> > > >> > >> > > >> > :-) >> > > >> > >> > > >> > No clue sadly. >> > > >> >> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the >> > > >> system by adding one more layer? Unless the virtualization disabled >> > > >> some services that occur in native and interfere with the benchmark. >> > > >> >> > > >> If virtualization is faster than baremetal by nature, why we can see >> > > >> that some experiment shows that virtualization introduces overhead? >> > > > >> > > > Elena told me that there were some weird regression in Linux 4.1 - where >> > > > CPU burning workloads were _slower_ on baremetal than as guests. >> > > >> > > Hi Elena, >> > > Would you mind sharing with us some of your experience of how you >> > > found the real reason? Did you use some tool or some methodology to >> > > pin down the reason (i.e, CPU burning workloads in native is _slower_ >> > > on baremetal than as guests)? >> > > >> > >> > Hi Meng >> > >> > Yes, sure! >> > >> > While working on performance tests for smt-exposing patches from Joao >> > I run CPU bound workload in HVM guest and using same kernel in baremetal >> > run same test. >> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2) >> > I found that the time to complete the same test is few times more that >> > as it takes for the same under HVM guest. >> > I have tried tests where kernel threads pinned to cores and without pinning. >> > The execution times are most of the times take as twice longer, sometimes 4 >> > times longer that HVM case. >> > >> > Interesting is not only that it takes sometimes 3-4 times more >> > than HVM guest, but also that test with bound threads (to cores) takes almost >> > 3 times longer >> > to execute than running same cpu-bound test under HVM (in all >> > configurations). >> >> >> wow~ I didn't expect the native performance can be so "bad".... ;-) > Yes, quite a surprise :) >> >> > >> > >> > I run each test 5 times and here are the execution times (seconds): >> > >> > ------------------------------------------------- >> > baremetal | >> > thread_bind | thread unbind | HVM pinned to cores >> > ----------- |---------------|--------------------- >> > 74 | 83 | 28 >> > 74 | 88 | 28 >> > 74 | 38 | 28 >> > 74 | 73 | 28 >> > 74 | 87 | 28 >> > >> > Sometimes better times were on unbinded tests, but not often enough >> > to present it here. Some results are much worse and reach up to 120 >> > seconds. >> > >> > Each test has 8 kernel threads. In baremetal case I tried the following: >> > - numa off,on; >> > - all cpus are on; >> > - isolate cpus from first node; >> > - set intel_idle.max_cstate=1; >> > - disable intel_pstate; >> > >> > I dont think I have exhausted all the options here, but it looked like >> > two last changes did improve performance, but was still not comparable to >> > HVM case. >> > I am trying to find where regression had happened. Performance on newer >> > kernel (I tried 4.5.0-rc4+) was close or better than HVM. Just a perhaps silly thought .. but could there be something in the time-measuring that could differ and explain the slightly surprising results ? -- Sander >> > I am trying to find f there were some relevant regressions to understand >> > the reason of this. >> >> >> I see. If this is only happening for the SMT, it may be caused by the >> SMT-related load balancing in Linux scheduler. >> However, I have disabled the HT on my machine. Probably, that's also >> the reason why I didn't see so much different in performance. > I did enable tracing to see if maybe there is extensive migration: > Test machine has two nodes, 8 cores each, 2 threads per core, total 32 logical cpus. > Kernel threads are not binded and here is the output for the life of one of the threads: > cat ./t-komp_trace |grep t-kompressor|grep 18883 > t-kompressor-18883 [028] d... 69458.596403: sched_switch: prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> next_comm=swapper/28 next_pid=0 next_prio=120 > insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9 > <idle>-0 [009] d... 69458.669205: sched_switch: prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120 > t-kompressor-18883 [009] d... 69486.997626: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> next_comm=migration/9 next_pid=52 next_prio=0 > migration/9-52 [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25 > <idle>-0 [025] d... 69486.997641: sched_switch: prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120 > t-kompressor-18883 [025] d... 69486.997710: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> next_comm=swapper/25 next_pid=0 next_prio=120 > insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: comm=t-kompressor pid=18883 > Threads are being spawned from two cores, then some of the threads migrate to other cores. > In the example above threads is being spawned on cpu 27 and when woken up, runs on cpu 009. > Later it migrated to 025 which is the second thread of the same core (009). > While I am not sure why this migration happens, it does not seem to contribute a lot. > Anyway this picture repeats for some other threads (some stay where they were woken up): > t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald pid=3820 prio=120 orig_cpu=14 dest_cpu=11 > migration/13-72 [013] d... 69486.707459: sched_migrate_task: comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29 > migration/14-77 [014] d... 69486.783818: sched_migrate_task: comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30 > migration/8-47 [008] d... 69486.792667: sched_migrate_task: comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24 > migration/15-82 [015] d... 69486.796429: sched_migrate_task: comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31 > migration/10-57 [010] d... 69486.857848: sched_migrate_task: comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26 > migration/9-52 [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25 > migration/28-147 [028] d... 69503.073577: sched_migrate_task: comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10 > All threads are running on their own cores and some migrate to second smt-thread over time. > I probably should have traced some other scheduling events, but I did not yet find any relevant ones yet. >> >> > >> > >> > >> > What kernel you guys use? >> >> >> I'm using a quite old kernel >> 3.10.31 >> . The reason why I'm using this kernel is because I want to use the >> LITMUS^RT [1], which is a linux testbed for real-time scheduling >> research. (It has a new version though, and I can upgrade to the >> latest version to see if the "problem" still occurs.) > Yes, it will be interesting to see the outcome. > What difference in numbers do you see? > What the machines you are seeing it on? > Is your workload is purely cpu-bound? > Thanks! >> >> Thanks and Best Regards, >> >> Meng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-03-01 21:51 ` Sander Eikelenboom @ 2016-03-01 22:06 ` Elena Ufimtseva 2016-03-01 22:12 ` Dario Faggioli 2016-03-02 19:44 ` Meng Xu 2 siblings, 0 replies; 13+ messages in thread From: Elena Ufimtseva @ 2016-03-01 22:06 UTC (permalink / raw) To: Sander Eikelenboom; +Cc: Hyon-Young Choi, Meng Xu, xen-devel@lists.xen.org On Tue, Mar 01, 2016 at 10:51:30PM +0100, Sander Eikelenboom wrote: > > Tuesday, March 1, 2016, 9:39:25 PM, you wrote: > > > On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote: > >> Hi Elena, > >> > >> Thank you very much for sharing this! :-) > >> > >> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva > >> <elena.ufimtseva@oracle.com> wrote: > >> > > >> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote: > >> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk > >> > > <konrad.wilk@oracle.com> wrote: > >> > > >> > Hey! > >> > > >> > > >> > > >> > CC-ing Elena. > >> > > >> > >> > > >> I think you forgot you cc.ed her.. > >> > > >> Anyway, let's cc. her now... :-) > >> > > >> > >> > > >> > > >> > > >> >> We are measuring the execution time between native machine environment > >> > > >> >> and xen virtualization environment using PARSEC Benchmark [1]. > >> > > >> >> > >> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of > >> > > >> >> them pinned to a core; we pin the dom0 to another core that is not > >> > > >> >> used by the domU. > >> > > >> >> > >> > > >> >> Inside the Linux in domU in virtualization environment and in native > >> > > >> >> environment, We used the cpuset to isolate a core (or VCPU) for the > >> > > >> >> system processors and to isolate a core for the benchmark processes. > >> > > >> >> We also configured the Linux boot command line with isocpus= option to > >> > > >> >> isolate the core for benchmark from other unnecessary processes. > >> > > >> > > >> > > >> > You may want to just offline them and also boot the machine with NUMA > >> > > >> > disabled. > >> > > >> > >> > > >> Right, the machine is booted up with NUMA disabled. > >> > > >> We will offline the unnecessary cores then. > >> > > >> > >> > > >> > > >> > > >> >> > >> > > >> >> We expect that execution time of benchmarks in xen virtualization > >> > > >> >> environment is larger than the execution time in native machine > >> > > >> >> environment. However, the evaluation gave us an opposite result. > >> > > >> >> > >> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks: > >> > > >> >> > >> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial > >> > > >> >> Native: 6.387s > >> > > >> >> Virtualization: 5.890s > >> > > >> >> > >> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial > >> > > >> >> Native: 5.276s > >> > > >> >> Virtualization: 5.240s > >> > > >> >> > >> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal > >> > > >> >> performance results? > >> > > >> > > >> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal! > >> > > >> > > >> > > >> > :-) > >> > > >> > > >> > > >> > No clue sadly. > >> > > >> > >> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the > >> > > >> system by adding one more layer? Unless the virtualization disabled > >> > > >> some services that occur in native and interfere with the benchmark. > >> > > >> > >> > > >> If virtualization is faster than baremetal by nature, why we can see > >> > > >> that some experiment shows that virtualization introduces overhead? > >> > > > > >> > > > Elena told me that there were some weird regression in Linux 4.1 - where > >> > > > CPU burning workloads were _slower_ on baremetal than as guests. > >> > > > >> > > Hi Elena, > >> > > Would you mind sharing with us some of your experience of how you > >> > > found the real reason? Did you use some tool or some methodology to > >> > > pin down the reason (i.e, CPU burning workloads in native is _slower_ > >> > > on baremetal than as guests)? > >> > > > >> > > >> > Hi Meng > >> > > >> > Yes, sure! > >> > > >> > While working on performance tests for smt-exposing patches from Joao > >> > I run CPU bound workload in HVM guest and using same kernel in baremetal > >> > run same test. > >> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2) > >> > I found that the time to complete the same test is few times more that > >> > as it takes for the same under HVM guest. > >> > I have tried tests where kernel threads pinned to cores and without pinning. > >> > The execution times are most of the times take as twice longer, sometimes 4 > >> > times longer that HVM case. > >> > > >> > Interesting is not only that it takes sometimes 3-4 times more > >> > than HVM guest, but also that test with bound threads (to cores) takes almost > >> > 3 times longer > >> > to execute than running same cpu-bound test under HVM (in all > >> > configurations). > >> > >> > >> wow~ I didn't expect the native performance can be so "bad".... ;-) > > > Yes, quite a surprise :) > >> > >> > > >> > > >> > I run each test 5 times and here are the execution times (seconds): > >> > > >> > ------------------------------------------------- > >> > baremetal | > >> > thread_bind | thread unbind | HVM pinned to cores > >> > ----------- |---------------|--------------------- > >> > 74 | 83 | 28 > >> > 74 | 88 | 28 > >> > 74 | 38 | 28 > >> > 74 | 73 | 28 > >> > 74 | 87 | 28 > >> > > >> > Sometimes better times were on unbinded tests, but not often enough > >> > to present it here. Some results are much worse and reach up to 120 > >> > seconds. > >> > > >> > Each test has 8 kernel threads. In baremetal case I tried the following: > >> > - numa off,on; > >> > - all cpus are on; > >> > - isolate cpus from first node; > >> > - set intel_idle.max_cstate=1; > >> > - disable intel_pstate; > >> > > >> > I dont think I have exhausted all the options here, but it looked like > >> > two last changes did improve performance, but was still not comparable to > >> > HVM case. > >> > I am trying to find where regression had happened. Performance on newer > >> > kernel (I tried 4.5.0-rc4+) was close or better than HVM. > > Just a perhaps silly thought .. but could there be something in the > time-measuring that could differ and explain the slightly surprising results ? > -- > Sander Thanks Sander, I also tghought abou this. I would consider it further if the execution times were not that obviously bigger, its 28 sec vs up to few minutes the diffrerence noticeable without registering time. Also there is expected difference in a ping-pong test where baremetal outperforms HVM slighly on same kernels. > > >> > I am trying to find f there were some relevant regressions to understand > >> > the reason of this. > >> > >> > >> I see. If this is only happening for the SMT, it may be caused by the > >> SMT-related load balancing in Linux scheduler. > >> However, I have disabled the HT on my machine. Probably, that's also > >> the reason why I didn't see so much different in performance. > > > I did enable tracing to see if maybe there is extensive migration: > > Test machine has two nodes, 8 cores each, 2 threads per core, total 32 logical cpus. > > > Kernel threads are not binded and here is the output for the life of one of the threads: > > > cat ./t-komp_trace |grep t-kompressor|grep 18883 > > > t-kompressor-18883 [028] d... 69458.596403: sched_switch: prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> next_comm=swapper/28 next_pid=0 next_prio=120 > > insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9 > > <idle>-0 [009] d... 69458.669205: sched_switch: prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120 > > t-kompressor-18883 [009] d... 69486.997626: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> next_comm=migration/9 next_pid=52 next_prio=0 > > migration/9-52 [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25 > > <idle>-0 [025] d... 69486.997641: sched_switch: prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120 > > t-kompressor-18883 [025] d... 69486.997710: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> next_comm=swapper/25 next_pid=0 next_prio=120 > > insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: comm=t-kompressor pid=18883 > > > > Threads are being spawned from two cores, then some of the threads migrate to other cores. > > In the example above threads is being spawned on cpu 27 and when woken up, runs on cpu 009. > > Later it migrated to 025 which is the second thread of the same core (009). > > While I am not sure why this migration happens, it does not seem to contribute a lot. > > Anyway this picture repeats for some other threads (some stay where they were woken up): > > > t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald pid=3820 prio=120 orig_cpu=14 dest_cpu=11 > > migration/13-72 [013] d... 69486.707459: sched_migrate_task: comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29 > > migration/14-77 [014] d... 69486.783818: sched_migrate_task: comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30 > > migration/8-47 [008] d... 69486.792667: sched_migrate_task: comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24 > > migration/15-82 [015] d... 69486.796429: sched_migrate_task: comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31 > > migration/10-57 [010] d... 69486.857848: sched_migrate_task: comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26 > > migration/9-52 [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25 > > migration/28-147 [028] d... 69503.073577: sched_migrate_task: comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10 > > > All threads are running on their own cores and some migrate to second smt-thread over time. > > I probably should have traced some other scheduling events, but I did not yet find any relevant ones yet. > > >> > >> > > >> > > >> > > >> > What kernel you guys use? > >> > >> > >> I'm using a quite old kernel > >> 3.10.31 > >> . The reason why I'm using this kernel is because I want to use the > >> LITMUS^RT [1], which is a linux testbed for real-time scheduling > >> research. (It has a new version though, and I can upgrade to the > >> latest version to see if the "problem" still occurs.) > > > Yes, it will be interesting to see the outcome. > > > What difference in numbers do you see? > > What the machines you are seeing it on? > > Is your workload is purely cpu-bound? > > > > Thanks! > > >> > >> Thanks and Best Regards, > >> > >> Meng > > > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-03-01 21:51 ` Sander Eikelenboom 2016-03-01 22:06 ` Elena Ufimtseva @ 2016-03-01 22:12 ` Dario Faggioli 2016-03-02 19:44 ` Meng Xu 2 siblings, 0 replies; 13+ messages in thread From: Dario Faggioli @ 2016-03-01 22:12 UTC (permalink / raw) To: Sander Eikelenboom, Elena Ufimtseva Cc: Meng Xu, Hyon-Young Choi, xen-devel@lists.xen.org [-- Attachment #1.1: Type: text/plain, Size: 1730 bytes --] On Tue, 2016-03-01 at 22:51 +0100, Sander Eikelenboom wrote: > Tuesday, March 1, 2016, 9:39:25 PM, you wrote: > > > On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote: > > > > > > > I dont think I have exhausted all the options here, but it > > > > looked like > > > > two last changes did improve performance, but was still not > > > > comparable to > > > > HVM case. > > > > I am trying to find where regression had happened. Performance > > > > on newer > > > > kernel (I tried 4.5.0-rc4+) was close or better than HVM. > > Just a perhaps silly thought .. but could there be something in the > time-measuring that could differ and explain the slightly surprising > results ? > I was about to say the exact same thing. In fact, when I can, I always try to measure time either from the host, or even from outside, rather than inside the VM itself. The host should be fine, although that, in Xen, means measuring in Dom0, which thing may be affected by virtualization as well so, from this point of view, outside would be preferrable. However, measuring from outside (and not even from the host, in certain case) is not always possible and/or comes with a price. So, it really depends on the benchmark and on the use case, and in a bunch of them measuring from the guest is actually fine and/or is what one wants. However, do at least consider that as an option for the issue being reported. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-03-01 21:51 ` Sander Eikelenboom 2016-03-01 22:06 ` Elena Ufimtseva 2016-03-01 22:12 ` Dario Faggioli @ 2016-03-02 19:44 ` Meng Xu 2 siblings, 0 replies; 13+ messages in thread From: Meng Xu @ 2016-03-02 19:44 UTC (permalink / raw) To: Sander Eikelenboom Cc: Elena Ufimtseva, Hyon-Young Choi, xen-devel@lists.xen.org On Tue, Mar 1, 2016 at 4:51 PM, Sander Eikelenboom <linux@eikelenboom.it> wrote: > > Tuesday, March 1, 2016, 9:39:25 PM, you wrote: > >> On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote: >>> Hi Elena, >>> >>> Thank you very much for sharing this! :-) >>> >>> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva >>> <elena.ufimtseva@oracle.com> wrote: >>> > >>> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote: >>> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk >>> > > <konrad.wilk@oracle.com> wrote: >>> > > >> > Hey! >>> > > >> > >>> > > >> > CC-ing Elena. >>> > > >> >>> > > >> I think you forgot you cc.ed her.. >>> > > >> Anyway, let's cc. her now... :-) >>> > > >> >>> > > >> > >>> > > >> >> We are measuring the execution time between native machine environment >>> > > >> >> and xen virtualization environment using PARSEC Benchmark [1]. >>> > > >> >> >>> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of >>> > > >> >> them pinned to a core; we pin the dom0 to another core that is not >>> > > >> >> used by the domU. >>> > > >> >> >>> > > >> >> Inside the Linux in domU in virtualization environment and in native >>> > > >> >> environment, We used the cpuset to isolate a core (or VCPU) for the >>> > > >> >> system processors and to isolate a core for the benchmark processes. >>> > > >> >> We also configured the Linux boot command line with isocpus= option to >>> > > >> >> isolate the core for benchmark from other unnecessary processes. >>> > > >> > >>> > > >> > You may want to just offline them and also boot the machine with NUMA >>> > > >> > disabled. >>> > > >> >>> > > >> Right, the machine is booted up with NUMA disabled. >>> > > >> We will offline the unnecessary cores then. >>> > > >> >>> > > >> > >>> > > >> >> >>> > > >> >> We expect that execution time of benchmarks in xen virtualization >>> > > >> >> environment is larger than the execution time in native machine >>> > > >> >> environment. However, the evaluation gave us an opposite result. >>> > > >> >> >>> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks: >>> > > >> >> >>> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial >>> > > >> >> Native: 6.387s >>> > > >> >> Virtualization: 5.890s >>> > > >> >> >>> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial >>> > > >> >> Native: 5.276s >>> > > >> >> Virtualization: 5.240s >>> > > >> >> >>> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal >>> > > >> >> performance results? >>> > > >> > >>> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal! >>> > > >> > >>> > > >> > :-) >>> > > >> > >>> > > >> > No clue sadly. >>> > > >> >>> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the >>> > > >> system by adding one more layer? Unless the virtualization disabled >>> > > >> some services that occur in native and interfere with the benchmark. >>> > > >> >>> > > >> If virtualization is faster than baremetal by nature, why we can see >>> > > >> that some experiment shows that virtualization introduces overhead? >>> > > > >>> > > > Elena told me that there were some weird regression in Linux 4.1 - where >>> > > > CPU burning workloads were _slower_ on baremetal than as guests. >>> > > >>> > > Hi Elena, >>> > > Would you mind sharing with us some of your experience of how you >>> > > found the real reason? Did you use some tool or some methodology to >>> > > pin down the reason (i.e, CPU burning workloads in native is _slower_ >>> > > on baremetal than as guests)? >>> > > >>> > >>> > Hi Meng >>> > >>> > Yes, sure! >>> > >>> > While working on performance tests for smt-exposing patches from Joao >>> > I run CPU bound workload in HVM guest and using same kernel in baremetal >>> > run same test. >>> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2) >>> > I found that the time to complete the same test is few times more that >>> > as it takes for the same under HVM guest. >>> > I have tried tests where kernel threads pinned to cores and without pinning. >>> > The execution times are most of the times take as twice longer, sometimes 4 >>> > times longer that HVM case. >>> > >>> > Interesting is not only that it takes sometimes 3-4 times more >>> > than HVM guest, but also that test with bound threads (to cores) takes almost >>> > 3 times longer >>> > to execute than running same cpu-bound test under HVM (in all >>> > configurations). >>> >>> >>> wow~ I didn't expect the native performance can be so "bad".... ;-) > >> Yes, quite a surprise :) >>> >>> > >>> > >>> > I run each test 5 times and here are the execution times (seconds): >>> > >>> > ------------------------------------------------- >>> > baremetal | >>> > thread_bind | thread unbind | HVM pinned to cores >>> > ----------- |---------------|--------------------- >>> > 74 | 83 | 28 >>> > 74 | 88 | 28 >>> > 74 | 38 | 28 >>> > 74 | 73 | 28 >>> > 74 | 87 | 28 >>> > >>> > Sometimes better times were on unbinded tests, but not often enough >>> > to present it here. Some results are much worse and reach up to 120 >>> > seconds. >>> > >>> > Each test has 8 kernel threads. In baremetal case I tried the following: >>> > - numa off,on; >>> > - all cpus are on; >>> > - isolate cpus from first node; >>> > - set intel_idle.max_cstate=1; >>> > - disable intel_pstate; >>> > >>> > I dont think I have exhausted all the options here, but it looked like >>> > two last changes did improve performance, but was still not comparable to >>> > HVM case. >>> > I am trying to find where regression had happened. Performance on newer >>> > kernel (I tried 4.5.0-rc4+) was close or better than HVM. > > Just a perhaps silly thought .. but could there be something in the > time-measuring that could differ and explain the slightly surprising results ? Thanks Sander! Actually, I also thought about this reason as Elena did. If it's the time-measuring, the difference about the execution time should not vary for different types of workload/programs. That's why I think the time measurement is not the reason here (at least not the main reason). :-) Best, Meng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Question] PARSEC benchmark has smaller execution time in VM than in native? 2016-03-01 20:39 ` Elena Ufimtseva 2016-03-01 21:51 ` Sander Eikelenboom @ 2016-03-02 19:41 ` Meng Xu 1 sibling, 0 replies; 13+ messages in thread From: Meng Xu @ 2016-03-02 19:41 UTC (permalink / raw) To: Elena Ufimtseva; +Cc: xen-devel@lists.xen.org, Hyon-Young Choi Hi Elena, On Tue, Mar 1, 2016 at 3:39 PM, Elena Ufimtseva <elena.ufimtseva@oracle.com> wrote: > On Tue, Mar 01, 2016 at 02:52:14PM -0500, Meng Xu wrote: >> Hi Elena, >> >> Thank you very much for sharing this! :-) >> >> On Tue, Mar 1, 2016 at 1:20 PM, Elena Ufimtseva >> <elena.ufimtseva@oracle.com> wrote: >> > >> > On Tue, Mar 01, 2016 at 08:48:30AM -0500, Meng Xu wrote: >> > > On Mon, Feb 29, 2016 at 12:59 PM, Konrad Rzeszutek Wilk >> > > <konrad.wilk@oracle.com> wrote: >> > > >> > Hey! >> > > >> > >> > > >> > CC-ing Elena. >> > > >> >> > > >> I think you forgot you cc.ed her.. >> > > >> Anyway, let's cc. her now... :-) >> > > >> >> > > >> > >> > > >> >> We are measuring the execution time between native machine environment >> > > >> >> and xen virtualization environment using PARSEC Benchmark [1]. >> > > >> >> >> > > >> >> In virtualiztion environment, we run a domU with three VCPUs, each of >> > > >> >> them pinned to a core; we pin the dom0 to another core that is not >> > > >> >> used by the domU. >> > > >> >> >> > > >> >> Inside the Linux in domU in virtualization environment and in native >> > > >> >> environment, We used the cpuset to isolate a core (or VCPU) for the >> > > >> >> system processors and to isolate a core for the benchmark processes. >> > > >> >> We also configured the Linux boot command line with isocpus= option to >> > > >> >> isolate the core for benchmark from other unnecessary processes. >> > > >> > >> > > >> > You may want to just offline them and also boot the machine with NUMA >> > > >> > disabled. >> > > >> >> > > >> Right, the machine is booted up with NUMA disabled. >> > > >> We will offline the unnecessary cores then. >> > > >> >> > > >> > >> > > >> >> >> > > >> >> We expect that execution time of benchmarks in xen virtualization >> > > >> >> environment is larger than the execution time in native machine >> > > >> >> environment. However, the evaluation gave us an opposite result. >> > > >> >> >> > > >> >> Below is the evaluation data for the canneal and streamcluster benchmarks: >> > > >> >> >> > > >> >> Benchmark: canneal, input=simlarge, conf=gcc-serial >> > > >> >> Native: 6.387s >> > > >> >> Virtualization: 5.890s >> > > >> >> >> > > >> >> Benchmark: streamcluster, input=simlarge, conf=gcc-serial >> > > >> >> Native: 5.276s >> > > >> >> Virtualization: 5.240s >> > > >> >> >> > > >> >> Is there anything wrong with our evaluation that lead to the abnormal >> > > >> >> performance results? >> > > >> > >> > > >> > Nothing is wrong. Virtualization is naturally faster than baremetal! >> > > >> > >> > > >> > :-) >> > > >> > >> > > >> > No clue sadly. >> > > >> >> > > >> Ah-ha. This is really surprising to me.... Why will it speed up the >> > > >> system by adding one more layer? Unless the virtualization disabled >> > > >> some services that occur in native and interfere with the benchmark. >> > > >> >> > > >> If virtualization is faster than baremetal by nature, why we can see >> > > >> that some experiment shows that virtualization introduces overhead? >> > > > >> > > > Elena told me that there were some weird regression in Linux 4.1 - where >> > > > CPU burning workloads were _slower_ on baremetal than as guests. >> > > >> > > Hi Elena, >> > > Would you mind sharing with us some of your experience of how you >> > > found the real reason? Did you use some tool or some methodology to >> > > pin down the reason (i.e, CPU burning workloads in native is _slower_ >> > > on baremetal than as guests)? >> > > >> > >> > Hi Meng >> > >> > Yes, sure! >> > >> > While working on performance tests for smt-exposing patches from Joao >> > I run CPU bound workload in HVM guest and using same kernel in baremetal >> > run same test. >> > While testing cpu-bound workload on baremetal linux (4.1.0-rc2) >> > I found that the time to complete the same test is few times more that >> > as it takes for the same under HVM guest. >> > I have tried tests where kernel threads pinned to cores and without pinning. >> > The execution times are most of the times take as twice longer, sometimes 4 >> > times longer that HVM case. >> > >> > Interesting is not only that it takes sometimes 3-4 times more >> > than HVM guest, but also that test with bound threads (to cores) takes almost >> > 3 times longer >> > to execute than running same cpu-bound test under HVM (in all >> > configurations). >> >> >> wow~ I didn't expect the native performance can be so "bad".... ;-) > > Yes, quite a surprise :) >> >> > >> > >> > I run each test 5 times and here are the execution times (seconds): >> > >> > ------------------------------------------------- >> > baremetal | >> > thread_bind | thread unbind | HVM pinned to cores >> > ----------- |---------------|--------------------- >> > 74 | 83 | 28 >> > 74 | 88 | 28 >> > 74 | 38 | 28 >> > 74 | 73 | 28 >> > 74 | 87 | 28 >> > >> > Sometimes better times were on unbinded tests, but not often enough >> > to present it here. Some results are much worse and reach up to 120 >> > seconds. >> > >> > Each test has 8 kernel threads. In baremetal case I tried the following: >> > - numa off,on; >> > - all cpus are on; >> > - isolate cpus from first node; >> > - set intel_idle.max_cstate=1; >> > - disable intel_pstate; >> > >> > I dont think I have exhausted all the options here, but it looked like >> > two last changes did improve performance, but was still not comparable to >> > HVM case. >> > I am trying to find where regression had happened. Performance on newer >> > kernel (I tried 4.5.0-rc4+) was close or better than HVM. >> > >> > I am trying to find f there were some relevant regressions to understand >> > the reason of this. >> >> >> I see. If this is only happening for the SMT, it may be caused by the >> SMT-related load balancing in Linux scheduler. >> However, I have disabled the HT on my machine. Probably, that's also >> the reason why I didn't see so much different in performance. > > I did enable tracing to see if maybe there is extensive migration: > Test machine has two nodes, 8 cores each, 2 threads per core, total 32 logical cpus. > > Kernel threads are not binded and here is the output for the life of one of the threads: > > cat ./t-komp_trace |grep t-kompressor|grep 18883 > > t-kompressor-18883 [028] d... 69458.596403: sched_switch: prev_comm=kthreadd prev_pid=18883 prev_prio=120 prev_state=D ==> next_comm=swapper/28 next_pid=0 next_prio=120 > insmod-18875 [027] dN.. 69458.669180: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=28 dest_cpu=9 > <idle>-0 [009] d... 69458.669205: sched_switch: prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120 > t-kompressor-18883 [009] d... 69486.997626: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=R ==> next_comm=migration/9 next_pid=52 next_prio=0 > migration/9-52 [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25 > <idle>-0 [025] d... 69486.997641: sched_switch: prev_comm=swapper/25 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=t-kompressor next_pid=18883 next_prio=120 > t-kompressor-18883 [025] d... 69486.997710: sched_switch: prev_comm=t-kompressor prev_pid=18883 prev_prio=120 prev_state=x ==> next_comm=swapper/25 next_pid=0 next_prio=120 > insmod-18875 [011] .N.. 69503.119960: sched_kthread_stop: comm=t-kompressor pid=18883 > > > Threads are being spawned from two cores, then some of the threads migrate to other cores. > In the example above threads is being spawned on cpu 27 and when woken up, runs on cpu 009. > Later it migrated to 025 which is the second thread of the same core (009). > While I am not sure why this migration happens, it does not seem to contribute a lot. > Anyway this picture repeats for some other threads (some stay where they were woken up): > > t-kompressor-18880 [014] dNh. 69485.802729: sched_migrate_task: comm=hald pid=3820 prio=120 orig_cpu=14 dest_cpu=11 > migration/13-72 [013] d... 69486.707459: sched_migrate_task: comm=t-kompressor pid=18878 prio=120 orig_cpu=13 dest_cpu=29 > migration/14-77 [014] d... 69486.783818: sched_migrate_task: comm=t-kompressor pid=18880 prio=120 orig_cpu=14 dest_cpu=30 > migration/8-47 [008] d... 69486.792667: sched_migrate_task: comm=t-kompressor pid=18882 prio=120 orig_cpu=8 dest_cpu=24 > migration/15-82 [015] d... 69486.796429: sched_migrate_task: comm=t-kompressor pid=18881 prio=120 orig_cpu=15 dest_cpu=31 > migration/10-57 [010] d... 69486.857848: sched_migrate_task: comm=t-kompressor pid=18884 prio=120 orig_cpu=10 dest_cpu=26 > migration/9-52 [009] d... 69486.997632: sched_migrate_task: comm=t-kompressor pid=18883 prio=120 orig_cpu=9 dest_cpu=25 > migration/28-147 [028] d... 69503.073577: sched_migrate_task: comm=t-kompressor pid=18876 prio=120 orig_cpu=28 dest_cpu=10 > > All threads are running on their own cores and some migrate to second smt-thread over time. > I probably should have traced some other scheduling events, but I did not yet find any relevant ones yet. > >> >> > >> > >> > >> > What kernel you guys use? >> >> >> I'm using a quite old kernel >> 3.10.31 >> . The reason why I'm using this kernel is because I want to use the >> LITMUS^RT [1], which is a linux testbed for real-time scheduling >> research. (It has a new version though, and I can upgrade to the >> latest version to see if the "problem" still occurs.) > > Yes, it will be interesting to see the outcome. > > What difference in numbers do you see? Below is the evaluation data for the canneal and streamcluster benchmarks, which are in the PARSEC benchmark: Benchmark: canneal, input=simlarge, conf=gcc-serial Native: 6.387s Virtualization: 5.890s Benchmark: streamcluster, input=simlarge, conf=gcc-serial Native: 5.276s Virtualization: 5.240s > What the machines you are seeing it on? Below is the CPU info. at /proc/cpuinfo processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz stepping : 9 microcode : 0x12 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips : 6784.70 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: > Is your workload is purely cpu-bound? Nope. The canneal and streamcluster benchmark are cache-sensitive (or memory sensitive) task. The execution time of these two benchmarks depend on how much cache and memory it can get. Under the current kernel, IIRC, I didn't see the "abnormal performance behavior" for cpu-bound tasks. Thanks and Best Regards, Meng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2016-03-02 19:44 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-02-26 5:02 [Question] PARSEC benchmark has smaller execution time in VM than in native? Meng Xu 2016-02-29 16:06 ` Konrad Rzeszutek Wilk 2016-02-29 17:29 ` Meng Xu 2016-02-29 17:59 ` Konrad Rzeszutek Wilk 2016-03-01 13:48 ` Meng Xu 2016-03-01 18:20 ` Elena Ufimtseva 2016-03-01 19:52 ` Meng Xu 2016-03-01 20:39 ` Elena Ufimtseva 2016-03-01 21:51 ` Sander Eikelenboom 2016-03-01 22:06 ` Elena Ufimtseva 2016-03-01 22:12 ` Dario Faggioli 2016-03-02 19:44 ` Meng Xu 2016-03-02 19:41 ` Meng Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).