* schedulers and topology exposing questions
@ 2016-01-22 16:54 Elena Ufimtseva
2016-01-22 17:29 ` Dario Faggioli
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Elena Ufimtseva @ 2016-01-22 16:54 UTC (permalink / raw)
To: xen-devel, dario.faggioli, george.dunlap, konrad.wilk,
joao.m.martins, boris.ostrovsky
[-- Attachment #1: Type: text/plain, Size: 6756 bytes --]
Hello all!
Dario, Gerorge or anyone else, your help will be appreciated.
Let me put some intro to our findings. I may forget something or put something
not too explicit, please ask me.
Customer filled a bug where some of the applications were running slow in their HVM DomU setups.
These running times were compared against baremetal running same kernel version as HVM DomU.
After some investigation by different parties, the test case scenario was found
where the problem was easily seen. The test app is a udp server/client pair where
client passes some message n number of times.
The test case was executed on baremetal and Xen DomU with kernel version 2.6.39.
Bare metal showed 2x times better result that DomU.
Konrad came up with a workaround that was setting the flag for domain scheduler in linux
As the guest is not aware of SMT-related topology, it has a flat topology initialized.
Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39.
Konrad discovered that changing the flag for CPU sched domain to 4655
works as a workaround and makes Linux think that the topology has SMT threads.
This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse).
This workaround is not suitable for kernels of higher versions as we discovered.
The hackish way of making domU linux think that it has SMT threads (along with matching cpuid)
made us thinks that the problem comes from the fact that cpu topology is not exposed to
guest and Linux scheduler cannot make intelligent decision on scheduling.
Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe
topology numbering and provided matching pinning of vcpus and enabling options,
allows to expose to guest correct topology.
I guess Joao will be posting it at some point.
With this patches we decided to test the performance impact on different kernel versionand Xen versions.
The test described above was labeled as IO-bound test.
We have run io-bound test with and without smt-patches. The improvement comparing
to base case (no smt patches, flat topology) shows 22-23% gain.
While we have seen improvement with io-bound tests, the same did not happen with cpu-bound workload.
As cpu-bound test we use kernel module which runs requested number of kernel threads
and each thread compresses and decompresses some data.
Here is the setup for tests:
Intel Xeon E5 2600
8 cores, 25MB Cashe, 2 sockets, 2 threads per core.
Xen 4.4.3, default timeslice and ratelimit
Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+.
Dom0: kernel 4.1.0, 2 vcpus, not pinned.
DomU has 8 vcpus (except some cases).
For io-bound tests results were better with smt patches applied for every kernel.
For cpu-bound test the results were different depending on wether vcpus were
pinned or not, how many vcpus were assigned to the guest.
Please take a look at the graph captured by xentrace -e 0x0002f000
On the graphs X is time in seconds since xentrace start, Y is the pcpu number,
the graph itself represent the event when scheduler places vcpu to pcpu.
The graphs #1 & #2:
trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, one client/server
trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, 8 kernel theads
config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 kernel.
As can be seen here scheduler places the vcpus correctly on empty cores.
As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this?
Take a look at trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png
where I split data per vcpus.
Now to cpu-bound tests.
When smt patches applied and vcpus pinned correctly to match the topology and
guest become aware of the topology, cpu-bound tests did not show improvement with kernel 2.6.39.
With upstream kernel we see some improvements. The tes was repeated 5 times back to back.
The number of vcpus was increased to 16 to match the test case where linux was not
aware of the topology and assumed all cpus as cores.
On some iterations one can see that vcpus are being scheduled as expected.
For some runs the vcpus are placed on came core (core/thread) (see trace_cpu_16vcpus_8threads_5runs.out.plot.err.png).
It doubles the time it takes for test to complete (first three runs show close to baremetal execution time).
END: cycles: 31209326708 (29 seconds)
END: cycles: 30928835308 (28 seconds)
END: cycles: 31191626508 (29 seconds)
END: cycles: 50117313540 (46 seconds)
END: cycles: 49944848614 (46 seconds)
Since the vcpus are pinned, then my guess is that Linux scheduler makes wrong decisions?
So I ran the test with smt patches enabled, but not pinned vcpus.
result is also shows the same as above (see trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png):
Also see the per-cpu graph (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_pervcpu.png).
END: cycles: 49740185572 (46 seconds)
END: cycles: 45862289546 (42 seconds)
END: cycles: 30976368378 (28 seconds)
END: cycles: 30886882143 (28 seconds)
END: cycles: 30806304256 (28 seconds)
I cut the timeslice where its seen that vcpu0 and vcpu2 run on same core while other cores are idle:
35v2 9.881103815 7
35v0 9.881104013 6
35v2 9.892746452 7
35v0 9.892746546 6 -> vcpu0 gets scheduled right after vcpu2 on same core
35v0 9.904388175 6
35v2 9.904388205 7 -> same here
35v2 9.916029791 7
35v0 9.916029992 6
Disabling smt option in linux config (what essentially means that guest does not
have correct topology and its just flat shows slightly better results - there
are no cores and threads being scheduled in pair while other cores are empty.
END: cycles: 41823591845 (38 seconds)
END: cycles: 41105093568 (38 seconds)
END: cycles: 30987224290 (28 seconds)
END: cycles: 31138979573 (29 seconds)
END: cycles: 31002228982 (28 seconds)
and graph is attached (trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.png).
I may have forgotten something here.. Please ask me questions if I did.
Maybe you have some ideas what can be done here?
We try to make guests topology aware but looks like for cpu bound workloads its
not that easy.
Any suggestions are welcome.
Thanks you.
Elena
--
Elena
[-- Attachment #2: trace_cpu_16vcpus_16threads_5runs.out.plot.err.png --]
[-- Type: image/png, Size: 23710 bytes --]
[-- Attachment #3: trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.png --]
[-- Type: image/png, Size: 15999 bytes --]
[-- Attachment #4: trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_pervcpu.png --]
[-- Type: image/png, Size: 16211 bytes --]
[-- Attachment #5: trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png --]
[-- Type: image/png, Size: 17335 bytes --]
[-- Attachment #6: trace_cpu_16vcpus_8threads_5runs.out.plot.err.png --]
[-- Type: image/png, Size: 27962 bytes --]
[-- Attachment #7: trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png --]
[-- Type: image/png, Size: 12940 bytes --]
[-- Attachment #8: trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png --]
[-- Type: image/png, Size: 15137 bytes --]
[-- Attachment #9: trace_iobound_nosmt_dom0notpinned.out.plot.err.png --]
[-- Type: image/png, Size: 17837 bytes --]
[-- Attachment #10: trace_cpu_smtapplied_smt0_totalcpus18_notpinned_5iters.out.plot.err.png --]
[-- Type: image/png, Size: 17769 bytes --]
[-- Attachment #11: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: schedulers and topology exposing questions 2016-01-22 16:54 schedulers and topology exposing questions Elena Ufimtseva @ 2016-01-22 17:29 ` Dario Faggioli 2016-01-22 23:58 ` Elena Ufimtseva 2016-01-26 11:21 ` George Dunlap 2016-01-27 14:01 ` Dario Faggioli 2 siblings, 1 reply; 22+ messages in thread From: Dario Faggioli @ 2016-01-22 17:29 UTC (permalink / raw) To: Elena Ufimtseva, xen-devel, george.dunlap, konrad.wilk, joao.m.martins, boris.ostrovsky [-- Attachment #1.1: Type: text/plain, Size: 4464 bytes --] On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote: > Hello all! > Hello, > Let me put some intro to our findings. I may forget something or put > something > not too explicit, please ask me. > > Customer filled a bug where some of the applications were running > slow in their HVM DomU setups. > These running times were compared against baremetal running same > kernel version as HVM DomU. > > After some investigation by different parties, the test case scenario > was found > where the problem was easily seen. The test app is a udp > server/client pair where > client passes some message n number of times. > The test case was executed on baremetal and Xen DomU with kernel > version 2.6.39. > Bare metal showed 2x times better result that DomU. > > Konrad came up with a workaround that was setting the flag for domain > scheduler in linux > As the guest is not aware of SMT-related topology, it has a flat > topology initialized. > Kernel has domain scheduler flags for scheduling domain CPU set to > 4143 for 2.6.39. > Konrad discovered that changing the flag for CPU sched domain to 4655 > works as a workaround and makes Linux think that the topology has SMT > threads. > This workaround makes the test to complete almost in same time as on > baremetal (or insignificantly worse). > > This workaround is not suitable for kernels of higher versions as we > discovered. > > The hackish way of making domU linux think that it has SMT threads > (along with matching cpuid) > made us thinks that the problem comes from the fact that cpu topology > is not exposed to > guest and Linux scheduler cannot make intelligent decision on > scheduling. > So, me an Juergen (from SuSE) have been working on this for a while too. As far as my experiments goes, there are at least two different issues, both traceable to Linux's scheduler behavior. One has to do with what you just say, i.e., topology. Juergen has developed a set of patches, and I'm running benchamrks with them applied to both Dom0 and DomU, to see how they work. I'm not far from finishing running a set of 324 different test cases (each one run both without and with Juergen's patches). I am running different benchamrks, such as: - iperf, - a Xen build, - sysbench --oltp, - sysbench --cpu, - unixbench and I'm also varying how loaded the host is, how big the VMs are, and how loaded the VMs are. 324 is the result of various combinations of the above... It's quite an extensive set! :-P As soon as everything finishes running, I'll data mine the results, and let you know how they look like. The other issue that I've observed is that tweaking some _non_ topology related scheduling domains' flags also impact performance, sometimes in a quite sensible way. I have got the results from the 324 test cases described above of running with flags set to 4131 inside all the DomUs. That value was chosen after quite a bit of preliminary benchmarking and investigation as well. I'll share the results of that data set as well as soon as I manage to extract them from the raw output. > Joao Martins from Oracle developed set of patches that fixed the > smt/core/cashe > topology numbering and provided matching pinning of vcpus and > enabling options, > allows to expose to guest correct topology. > I guess Joao will be posting it at some point. > That is one way of approaching the topology issue. The other, which is what me and Juergen are pursuing, is the opposite one, i.e., make the DomU (and Dom0, actually) think that the topology is always completely flat. I think, ideally, we want both: flat topology as the default, if no pinning is specifying. Matching topology if it is. > With this patches we decided to test the performance impact on > different kernel versionand Xen versions. > That is really interesting, and thanks a lot for sharing it with us. I'm in the middle of something here, so I just wanted to quickly let you know that we're also working on something related... I'll have a look at the rest of the email and at the graphs ASAP. Thanks again and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-22 17:29 ` Dario Faggioli @ 2016-01-22 23:58 ` Elena Ufimtseva 0 siblings, 0 replies; 22+ messages in thread From: Elena Ufimtseva @ 2016-01-22 23:58 UTC (permalink / raw) To: Dario Faggioli; +Cc: george.dunlap, joao.m.martins, boris.ostrovsky, xen-devel [-- Attachment #1: Type: text/plain, Size: 4970 bytes --] On Fri, Jan 22, 2016 at 06:29:19PM +0100, Dario Faggioli wrote: > On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote: > > Hello all! > > > Hello, > > > Let me put some intro to our findings. I may forget something or put > > something > > not too explicit, please ask me. > > > > Customer filled a bug where some of the applications were running > > slow in their HVM DomU setups. > > These running times were compared against baremetal running same > > kernel version as HVM DomU. > > > > After some investigation by different parties, the test case scenario > > was found > > where the problem was easily seen. The test app is a udp > > server/client pair where > > client passes some message n number of times. > > The test case was executed on baremetal and Xen DomU with kernel > > version 2.6.39. > > Bare metal showed 2x times better result that DomU. > > > > Konrad came up with a workaround that was setting the flag for domain > > scheduler in linux > > As the guest is not aware of SMT-related topology, it has a flat > > topology initialized. > > Kernel has domain scheduler flags for scheduling domain CPU set to > > 4143 for 2.6.39. > > Konrad discovered that changing the flag for CPU sched domain to 4655 > > works as a workaround and makes Linux think that the topology has SMT > > threads. > > This workaround makes the test to complete almost in same time as on > > baremetal (or insignificantly worse). > > > > This workaround is not suitable for kernels of higher versions as we > > discovered. > > > > The hackish way of making domU linux think that it has SMT threads > > (along with matching cpuid) > > made us thinks that the problem comes from the fact that cpu topology > > is not exposed to > > guest and Linux scheduler cannot make intelligent decision on > > scheduling. > > > So, me an Juergen (from SuSE) have been working on this for a while > too. > > As far as my experiments goes, there are at least two different issues, > both traceable to Linux's scheduler behavior. One has to do with what > you just say, i.e., topology. > > Juergen has developed a set of patches, and I'm running benchamrks with > them applied to both Dom0 and DomU, to see how they work. > > I'm not far from finishing running a set of 324 different test cases > (each one run both without and with Juergen's patches). I am running > different benchamrks, such as: > - iperf, > - a Xen build, > - sysbench --oltp, > - sysbench --cpu, > - unixbench > > and I'm also varying how loaded the host is, how big the VMs are, and > how loaded the VMs are. Thats pretty cool. I also tried in my tests oversubscribed tests. > > 324 is the result of various combinations of the above... It's quite an > extensive set! :-P It is! Even with my few tests its a lot of work. > > As soon as everything finishes running, I'll data mine the results, and > let you know how they look like. > > > The other issue that I've observed is that tweaking some _non_ topology > related scheduling domains' flags also impact performance, sometimes in > a quite sensible way. > > I have got the results from the 324 test cases described above of > running with flags set to 4131 inside all the DomUs. That value was > chosen after quite a bit of preliminary benchmarking and investigation > as well. > > I'll share the results of that data set as well as soon as I manage to > extract them from the raw output. > > > Joao Martins from Oracle developed set of patches that fixed the > > smt/core/cashe > > topology numbering and provided matching pinning of vcpus and > > enabling options, > > allows to expose to guest correct topology. > > I guess Joao will be posting it at some point. > > > That is one way of approaching the topology issue. The other, which is > what me and Juergen are pursuing, is the opposite one, i.e., make the > DomU (and Dom0, actually) think that the topology is always completely > flat. > > I think, ideally, we want both: flat topology as the default, if no > pinning is specifying. Matching topology if it is. > > > With this patches we decided to test the performance impact on > > different kernel versionand Xen versions. > > > That is really interesting, and thanks a lot for sharing it with us. > > I'm in the middle of something here, so I just wanted to quickly let > you know that we're also working on something related... I'll have a > look at the rest of the email and at the graphs ASAP. Great! I am attaching the io and cpu-bound tests that were used to get the data. Thanks Dario! > > Thanks again and Regards, > Dario > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > [-- Attachment #2: perf_tests.tar.gz --] [-- Type: application/gzip, Size: 7455 bytes --] [-- Attachment #3: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-22 16:54 schedulers and topology exposing questions Elena Ufimtseva 2016-01-22 17:29 ` Dario Faggioli @ 2016-01-26 11:21 ` George Dunlap 2016-01-27 14:25 ` Dario Faggioli 2016-01-27 14:33 ` Konrad Rzeszutek Wilk 2016-01-27 14:01 ` Dario Faggioli 2 siblings, 2 replies; 22+ messages in thread From: George Dunlap @ 2016-01-26 11:21 UTC (permalink / raw) To: Elena Ufimtseva, xen-devel, dario.faggioli, george.dunlap, konrad.wilk, joao.m.martins, boris.ostrovsky On 22/01/16 16:54, Elena Ufimtseva wrote: > Hello all! > > Dario, Gerorge or anyone else, your help will be appreciated. > > Let me put some intro to our findings. I may forget something or put something > not too explicit, please ask me. > > Customer filled a bug where some of the applications were running slow in their HVM DomU setups. > These running times were compared against baremetal running same kernel version as HVM DomU. > > After some investigation by different parties, the test case scenario was found > where the problem was easily seen. The test app is a udp server/client pair where > client passes some message n number of times. > The test case was executed on baremetal and Xen DomU with kernel version 2.6.39. > Bare metal showed 2x times better result that DomU. > > Konrad came up with a workaround that was setting the flag for domain scheduler in linux > As the guest is not aware of SMT-related topology, it has a flat topology initialized. > Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39. > Konrad discovered that changing the flag for CPU sched domain to 4655 > works as a workaround and makes Linux think that the topology has SMT threads. > This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse). > > This workaround is not suitable for kernels of higher versions as we discovered. > > The hackish way of making domU linux think that it has SMT threads (along with matching cpuid) > made us thinks that the problem comes from the fact that cpu topology is not exposed to > guest and Linux scheduler cannot make intelligent decision on scheduling. > > Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe > topology numbering and provided matching pinning of vcpus and enabling options, > allows to expose to guest correct topology. > I guess Joao will be posting it at some point. > > With this patches we decided to test the performance impact on different kernel versionand Xen versions. > > The test described above was labeled as IO-bound test. So just to clarify: The client sends a request (presumably not much more than a ping) to the server, and waits for the server to respond before sending another one; and the server does the reverse -- receives a request, responds, and then waits for the next request. Is that right? How much data is transferred? If the amount of data transferred is tiny, then the bottleneck for the test is probably the IPI time, and I'd call this a "ping-pong" benchmark[1]. I would only call this "io-bound" if you're actually copying large amounts of data. Regarding placement wrt topology: If two threads are doing a large amount of communication, then putting them close in the topology will increase perfomance, because they share cache, and the IPI distance between them is much shorter. If they rarely run at the same time, being on the same thread is probably the ideal. On the other hand, if two threads are running mostly independently, and each one is using a lot of cache, then having the threads at opposite ends of the topology will increase performance, since that will increase the aggregate cache used by both. The ideal in this case would certainly be for each thread to run on a separate socket. At the moment, neither the Credit1 and Credit2 schedulers take communication into account; they only account for processing time, and thus silently assume that all workloads are cache-hungry and non-communicating. [1] https://www.google.co.uk/search?q=ping+pong+benchmark > We have run io-bound test with and without smt-patches. The improvement comparing > to base case (no smt patches, flat topology) shows 22-23% gain. > > While we have seen improvement with io-bound tests, the same did not happen with cpu-bound workload. > As cpu-bound test we use kernel module which runs requested number of kernel threads > and each thread compresses and decompresses some data. > > Here is the setup for tests: > Intel Xeon E5 2600 > 8 cores, 25MB Cashe, 2 sockets, 2 threads per core. > Xen 4.4.3, default timeslice and ratelimit > Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+. > Dom0: kernel 4.1.0, 2 vcpus, not pinned. > DomU has 8 vcpus (except some cases). > > > For io-bound tests results were better with smt patches applied for every kernel. > > For cpu-bound test the results were different depending on wether > vcpus were pinned or not, how many vcpus were assigned to the guest. Looking through your mail, I can't quite figure out if "io-bound tests with the smt patches applied" here means "smt+pinned" or just "smt" (unpinned). (Or both.) Assuming that the Linux kernel takes process communication into account in its scheduling decisions, I would expect smt+pinning to have the kind of performance improvement you observe. I would expect that smt without pinning would have very little effect -- or might be actively worse, since the topology information would then be actively wrong as soon as the scheduler moved the vcpus. The fact that exposing topology of the cpu-bound workload didn't help sounds expected to me -- the Xen scheduler already tries to optimize for the cpu-bound case, so in the [non-smt, unpinned] case probably places things on the physical hardware similar to the way Linux places it in the [smt, pinned] case. > Please take a look at the graph captured by xentrace -e 0x0002f000 > On the graphs X is time in seconds since xentrace start, Y is the pcpu number, > the graph itself represent the event when scheduler places vcpu to pcpu. > > The graphs #1 & #2: > trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, one client/server > trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, 8 kernel theads > config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 kernel. > > As can be seen here scheduler places the vcpus correctly on empty cores. > As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this? Well it looks like vcpu0 does the lion's share of the work, while the other vcpus more or less share the work. So the scheduler gives vcpu0 its own socket (more or less), while the other ones share the other socket (optimizing for maximum cache usage). > Take a look at trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png > where I split data per vcpus. > > > Now to cpu-bound tests. > When smt patches applied and vcpus pinned correctly to match the topology and > guest become aware of the topology, cpu-bound tests did not show improvement with kernel 2.6.39. > With upstream kernel we see some improvements. The tes was repeated 5 times back to back. > The number of vcpus was increased to 16 to match the test case where linux was not > aware of the topology and assumed all cpus as cores. > > On some iterations one can see that vcpus are being scheduled as expected. > For some runs the vcpus are placed on came core (core/thread) (see trace_cpu_16vcpus_8threads_5runs.out.plot.err.png). > It doubles the time it takes for test to complete (first three runs show close to baremetal execution time). > > END: cycles: 31209326708 (29 seconds) > END: cycles: 30928835308 (28 seconds) > END: cycles: 31191626508 (29 seconds) > END: cycles: 50117313540 (46 seconds) > END: cycles: 49944848614 (46 seconds) > > Since the vcpus are pinned, then my guess is that Linux scheduler makes wrong decisions? Hmm -- could it be that the logic detecting whether the threads are "cpu-bound" (and thus want their own cache) vs "communicating" (and thus want to share a thread) is triggering differently in each case? Or maybe neither is true, and placement from the Linux side is more or less random. :-) > So I ran the test with smt patches enabled, but not pinned vcpus. > > result is also shows the same as above (see trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png): > Also see the per-cpu graph (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_pervcpu.png). > > END: cycles: 49740185572 (46 seconds) > END: cycles: 45862289546 (42 seconds) > END: cycles: 30976368378 (28 seconds) > END: cycles: 30886882143 (28 seconds) > END: cycles: 30806304256 (28 seconds) > > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same core while other cores are idle: > > 35v2 9.881103815 7 > 35v0 9.881104013 6 > > 35v2 9.892746452 7 > 35v0 9.892746546 6 -> vcpu0 gets scheduled right after vcpu2 on same core > > 35v0 9.904388175 6 > 35v2 9.904388205 7 -> same here > > 35v2 9.916029791 7 > 35v0 9.916029992 6 > > Disabling smt option in linux config (what essentially means that guest does not > have correct topology and its just flat shows slightly better results - there > are no cores and threads being scheduled in pair while other cores are empty. > > END: cycles: 41823591845 (38 seconds) > END: cycles: 41105093568 (38 seconds) > END: cycles: 30987224290 (28 seconds) > END: cycles: 31138979573 (29 seconds) > END: cycles: 31002228982 (28 seconds) > > and graph is attached (trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.png). This is a bit strange. You're showing that for *unpinned* vcpus, with empty cores, there are vcpus sharing the same thread for significant periods of time? That definitely shouldn't happen. It looks like you still have a fairly "bimodal" distribution even in the "no-smt unpinned" scenario -- just 28<->38 rather than 28<->45-ish. Could you try a couple of these tests with the credit2 scheduler, just to see? You'd have to make sure and use one of the versions that has hard pinning enabled; I don't think that made 4.6, so you'd have to use xen-unstable I think. > I may have forgotten something here.. Please ask me questions if I did. > > Maybe you have some ideas what can be done here? > > We try to make guests topology aware but looks like for cpu bound workloads its > not that easy. > Any suggestions are welcome. Well one option is always, as you say, to try to expose the topology to the guest. But that is a fairly limited solution -- in order for that information to be accurate, the vcpus need to be pinned, which in turn means 1) a lot more effort required by admins, and 2) a lot less opportunity for sharing of resources which is one of the big 'wins' for virtualization. The other option is, as Dario said, to remove all topology information from Linux, and add functionality to the Xen schedulers to attempt to identify vcpus which are communicating or sharing in some other way, and try to co-locate them. This is a lot easier and more flexible for users, but a lot more work for us. -George ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-26 11:21 ` George Dunlap @ 2016-01-27 14:25 ` Dario Faggioli 2016-01-27 14:33 ` Konrad Rzeszutek Wilk 1 sibling, 0 replies; 22+ messages in thread From: Dario Faggioli @ 2016-01-27 14:25 UTC (permalink / raw) To: George Dunlap, Elena Ufimtseva, xen-devel, george.dunlap, konrad.wilk, joao.m.martins, boris.ostrovsky [-- Attachment #1.1: Type: text/plain, Size: 9144 bytes --] On Tue, 2016-01-26 at 11:21 +0000, George Dunlap wrote: > On 22/01/16 16:54, Elena Ufimtseva wrote: > > > Regarding placement wrt topology: If two threads are doing a large > amount of communication, then putting them close in the topology will > increase perfomance, because they share cache, and the IPI distance > between them is much shorter. If they rarely run at the same time, > being on the same thread is probably the ideal. > Yes, this make sense to me... a bit hard to do the guessing right, but if we could, it would be a good thing to do. > On the other hand, if two threads are running mostly independently, > and > each one is using a lot of cache, then having the threads at opposite > ends of the topology will increase performance, since that will > increase > the aggregate cache used by both. The ideal in this case would > certainly be for each thread to run on a separate socket. > > At the moment, neither the Credit1 and Credit2 schedulers take > communication into account; they only account for processing time, > and > thus silently assume that all workloads are cache-hungry and > non-communicating. > I don't think Linux's scheduler does anything like that either. One can say that --speaking again about the flags of the scheduling domains-- you can try to use the SD_BALANCE_FORK and EXEC, together with the knowledge of who runs first, between father or child, something like what you say could be implemented... But even in this case, it's not at all explicit, and it would only be effective near fork() and exec() calls. The same is true, with all the due differences, for the other flags. Also, there is a great amount of logic to deal with task groups, and, e.g., provide fairness to task groups instead than to single tasks, etc. An I guess one can assume that the tasks in the same group does communicate, and things like that, but that's again nothing specifically taking any communication pattern into account (and it changed --growing and loosing features-- quite frenetically over the last years, so I don't think this has a say in what Elena is seeing. > Assuming that the Linux kernel takes process communication into > account > in its scheduling decisions, I would expect smt+pinning to have the > kind > of performance improvement you observe. I would expect that smt > without > pinning would have very little effect -- or might be actively worse, > since the topology information would then be actively wrong as soon > as > the scheduler moved the vcpus. > We better check then (if Linux has these characteristics), because I don't think it does. I can, and will, check myself, just not right now. :-/ So I ran the test with smt patches enabled, but not pinned vcpus. > > > > result is also shows the same as above (see > > trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.pn > > g): > > Also see the per-cpu graph > > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_p > > ervcpu.png). > > > > END: cycles: 49740185572 (46 seconds) > > END: cycles: 45862289546 (42 seconds) > > END: cycles: 30976368378 (28 seconds) > > END: cycles: 30886882143 (28 seconds) > > END: cycles: 30806304256 (28 seconds) > > > > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same > > core while other cores are idle: > > > > 35v2 9.881103815 > > 7 > > 35v0 9.881104013 6 > > > > 35v2 9.892746452 > > 7 > > 35v0 9.892746546 6 -> vcpu0 gets scheduled right after vcpu2 on > > same core > > > > 35v0 9.904388175 > > 6 > > 35v2 9.904388205 7 -> same here > > > > 35v2 9.916029791 > > 7 > > 35v0 9.916029992 > > 6 > > > > Disabling smt option in linux config (what essentially means that > > guest does not > > have correct topology and its just flat shows slightly better > > results - there > > are no cores and threads being scheduled in pair while other cores > > are empty. > > > > END: cycles: 41823591845 (38 seconds) > > END: cycles: 41105093568 (38 seconds) > > END: cycles: 30987224290 (28 seconds) > > END: cycles: 31138979573 (29 seconds) > > END: cycles: 31002228982 (28 seconds) > > > > and graph is attached > > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.p > > ng). > > This is a bit strange. You're showing that for *unpinned* vcpus, > with > empty cores, there are vcpus sharing the same thread for significant > periods of time? That definitely shouldn't happen. > I totally agree with George on this, the key word of what's he is saying being _significant_. In fact, this is the perfect summary/bottom line of my explanation of how our SMT load balancer in Credit1 works... :-) From just looking at the graph, I can't spot many places where this happens really for a significant amount of time. Am I wrong? For knowing for sure, we need to check the full trace. Ah, given the above, one could ask why we do not change Credit1 to actually do the SMT load balancing more frequently, e.g., at each vcpu wakeup. That is certainly a possibility, but there is the risk that the overhead of doing that too frequently (and there is indeed some overhead!) absorb the benefits of a more efficient placing (and I do think that will be the case, the wakeup path, in Credit1, is already complex and crowded enough! :-/). > Could you try a couple of these tests with the credit2 scheduler, > just > to see? You'd have to make sure and use one of the versions that has > hard pinning enabled; I don't think that made 4.6, so you'd have to > use > xen-unstable I think. > Nope, sorry, I would not do that, yet. I've got things half done already, but I have been sidetracked by other things (including this one), and so Credit2 is not yet in a shape where running the benchmarks with it, even if using staging, would represent a fair comparison between Credit1. I'll get back to the work that is still pending to make that possible in a bit (after FOSDEM, i.e., next week). > > We try to make guests topology aware but looks like for cpu bound > > workloads its > > not that easy. > > Any suggestions are welcome. > > well one option is always, as you say, to try to expose the topology > to > But that is a fairly limited solution -- in order for that > information to be accurate, the vcpus need to be pinned, which in > turn > means 1) a lot more effort required by admins, and 2) a lot less > opportunity for sharing of resources which is one of the big 'wins' > for > virtualization. > Exactly. I indeed think that it would be good to support this mode, but it has to be put very clear that, either the pinning does not ever ever ever ever ever...ever change, or you'll get back to pseudo-random scheduler(s)' behavior, leading to unpredictable and inconsistent performance (maybe better, maybe worse, but certainly inconsistent). Or, when pinning changes, we figure out a way to tell Linux (the scheduler and every pother component that needs to know) that something like that happened. As far as scheduling domains goes, I think there is a way to ask the kernel to rebuild the hierarcy, but I've never tried that, and I don't know it's available to userspace (already). > The other option is, as Dario said, to remove all topology > information > from Linux, and add functionality to the Xen schedulers to attempt to > identify vcpus which are communicating or sharing in some other way, > and > try to co-locate them. This is a lot easier and more flexible for > users, but a lot more work for us. > This is the only way that Linux will see something that is not wrong, if pinning is not used, so I think we really want this, and want it to be the default (and Juergen has patches for this! :-D) Thanks again and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-26 11:21 ` George Dunlap 2016-01-27 14:25 ` Dario Faggioli @ 2016-01-27 14:33 ` Konrad Rzeszutek Wilk 2016-01-27 15:10 ` George Dunlap 1 sibling, 1 reply; 22+ messages in thread From: Konrad Rzeszutek Wilk @ 2016-01-27 14:33 UTC (permalink / raw) To: George Dunlap Cc: Elena Ufimtseva, george.dunlap, dario.faggioli, xen-devel, joao.m.martins, boris.ostrovsky On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote: > On 22/01/16 16:54, Elena Ufimtseva wrote: > > Hello all! > > > > Dario, Gerorge or anyone else, your help will be appreciated. > > > > Let me put some intro to our findings. I may forget something or put something > > not too explicit, please ask me. > > > > Customer filled a bug where some of the applications were running slow in their HVM DomU setups. > > These running times were compared against baremetal running same kernel version as HVM DomU. > > > > After some investigation by different parties, the test case scenario was found > > where the problem was easily seen. The test app is a udp server/client pair where > > client passes some message n number of times. > > The test case was executed on baremetal and Xen DomU with kernel version 2.6.39. > > Bare metal showed 2x times better result that DomU. > > > > Konrad came up with a workaround that was setting the flag for domain scheduler in linux > > As the guest is not aware of SMT-related topology, it has a flat topology initialized. > > Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39. > > Konrad discovered that changing the flag for CPU sched domain to 4655 > > works as a workaround and makes Linux think that the topology has SMT threads. > > This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse). > > > > This workaround is not suitable for kernels of higher versions as we discovered. > > > > The hackish way of making domU linux think that it has SMT threads (along with matching cpuid) > > made us thinks that the problem comes from the fact that cpu topology is not exposed to > > guest and Linux scheduler cannot make intelligent decision on scheduling. > > > > Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe > > topology numbering and provided matching pinning of vcpus and enabling options, > > allows to expose to guest correct topology. > > I guess Joao will be posting it at some point. > > > > With this patches we decided to test the performance impact on different kernel versionand Xen versions. > > > > The test described above was labeled as IO-bound test. > > So just to clarify: The client sends a request (presumably not much more > than a ping) to the server, and waits for the server to respond before > sending another one; and the server does the reverse -- receives a > request, responds, and then waits for the next request. Is that right? Yes. > > How much data is transferred? 1 packet, UDP > > If the amount of data transferred is tiny, then the bottleneck for the > test is probably the IPI time, and I'd call this a "ping-pong" > benchmark[1]. I would only call this "io-bound" if you're actually > copying large amounts of data. What we found is that on baremetal the scheduler would put both apps on the same CPU and schedule them right after each other. This would have a high IPI as the scheduler would poke itself. On Xen it would put the two applications on seperate CPUs - and there would be hardly any IPI. Digging deeper in the code I found out that if you do an UDP sendmsg without any timeouts - it would put it in a queue and just call schedule. On baremetal the schedule would result in scheduler picking up the other task, and starting it - which would dequeue immediately. On Xen - the schedule() would go HLT.. and then later be woken up by the VIRQ_TIMER. And since the two applications were on seperate CPUs - the single packet would just stick in the queue until the VIRQ_TIMER arrived. I found out that if I expose the SMT topology to the guest (which is what baremetal sees) suddenly the Linux scheduler would behave the same way as under baremetal. To be fair - this is a very .. ping-pong no-CPU bound workload. If the amount of communication was huge it would probably behave a bit differently - as the queue would fill up - and by the time the VIRQ_TIMER hit the other CPU - it would have a nice chunk of data to eat through. > > Regarding placement wrt topology: If two threads are doing a large > amount of communication, then putting them close in the topology will > increase perfomance, because they share cache, and the IPI distance > between them is much shorter. If they rarely run at the same time, > being on the same thread is probably the ideal. This is a ping-pong type workload - very much serialized. > > On the other hand, if two threads are running mostly independently, and > each one is using a lot of cache, then having the threads at opposite > ends of the topology will increase performance, since that will increase > the aggregate cache used by both. The ideal in this case would > certainly be for each thread to run on a separate socket. > > At the moment, neither the Credit1 and Credit2 schedulers take > communication into account; they only account for processing time, and > thus silently assume that all workloads are cache-hungry and > non-communicating. And this is very much the opposite of that :-) > > [1] https://www.google.co.uk/search?q=ping+pong+benchmark > > > We have run io-bound test with and without smt-patches. The improvement comparing > > to base case (no smt patches, flat topology) shows 22-23% gain. > > > > While we have seen improvement with io-bound tests, the same did not happen with cpu-bound workload. > > As cpu-bound test we use kernel module which runs requested number of kernel threads > > and each thread compresses and decompresses some data. > > > > Here is the setup for tests: > > Intel Xeon E5 2600 > > 8 cores, 25MB Cashe, 2 sockets, 2 threads per core. > > Xen 4.4.3, default timeslice and ratelimit > > Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+. > > Dom0: kernel 4.1.0, 2 vcpus, not pinned. > > DomU has 8 vcpus (except some cases). > > > > > > For io-bound tests results were better with smt patches applied for every kernel. > > > > For cpu-bound test the results were different depending on wether > > vcpus were pinned or not, how many vcpus were assigned to the guest. > > Looking through your mail, I can't quite figure out if "io-bound tests > with the smt patches applied" here means "smt+pinned" or just "smt" > (unpinned). (Or both.) > > Assuming that the Linux kernel takes process communication into account > in its scheduling decisions, I would expect smt+pinning to have the kind > of performance improvement you observe. I would expect that smt without > pinning would have very little effect -- or might be actively worse, > since the topology information would then be actively wrong as soon as > the scheduler moved the vcpus. > > The fact that exposing topology of the cpu-bound workload didn't help > sounds expected to me -- the Xen scheduler already tries to optimize for > the cpu-bound case, so in the [non-smt, unpinned] case probably places > things on the physical hardware similar to the way Linux places it in > the [smt, pinned] case. > > > Please take a look at the graph captured by xentrace -e 0x0002f000 > > On the graphs X is time in seconds since xentrace start, Y is the pcpu number, > > the graph itself represent the event when scheduler places vcpu to pcpu. > > > > The graphs #1 & #2: > > trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, one client/server > > trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, 8 kernel theads > > config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 kernel. > > > > As can be seen here scheduler places the vcpus correctly on empty cores. > > As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this? > > Well it looks like vcpu0 does the lion's share of the work, while the > other vcpus more or less share the work. So the scheduler gives vcpu0 > its own socket (more or less), while the other ones share the other > socket (optimizing for maximum cache usage). > > > Take a look at trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png > > where I split data per vcpus. > > > > > > Now to cpu-bound tests. > > When smt patches applied and vcpus pinned correctly to match the topology and > > guest become aware of the topology, cpu-bound tests did not show improvement with kernel 2.6.39. > > With upstream kernel we see some improvements. The tes was repeated 5 times back to back. > > The number of vcpus was increased to 16 to match the test case where linux was not > > aware of the topology and assumed all cpus as cores. > > > > On some iterations one can see that vcpus are being scheduled as expected. > > For some runs the vcpus are placed on came core (core/thread) (see trace_cpu_16vcpus_8threads_5runs.out.plot.err.png). > > It doubles the time it takes for test to complete (first three runs show close to baremetal execution time). > > > > END: cycles: 31209326708 (29 seconds) > > END: cycles: 30928835308 (28 seconds) > > END: cycles: 31191626508 (29 seconds) > > END: cycles: 50117313540 (46 seconds) > > END: cycles: 49944848614 (46 seconds) > > > > Since the vcpus are pinned, then my guess is that Linux scheduler makes wrong decisions? > > Hmm -- could it be that the logic detecting whether the threads are > "cpu-bound" (and thus want their own cache) vs "communicating" (and thus > want to share a thread) is triggering differently in each case? > > Or maybe neither is true, and placement from the Linux side is more or > less random. :-) > > > So I ran the test with smt patches enabled, but not pinned vcpus. > > > > result is also shows the same as above (see trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png): > > Also see the per-cpu graph (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_pervcpu.png). > > > > END: cycles: 49740185572 (46 seconds) > > END: cycles: 45862289546 (42 seconds) > > END: cycles: 30976368378 (28 seconds) > > END: cycles: 30886882143 (28 seconds) > > END: cycles: 30806304256 (28 seconds) > > > > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same core while other cores are idle: > > > > 35v2 9.881103815 7 > > 35v0 9.881104013 6 > > > > 35v2 9.892746452 7 > > 35v0 9.892746546 6 -> vcpu0 gets scheduled right after vcpu2 on same core > > > > 35v0 9.904388175 6 > > 35v2 9.904388205 7 -> same here > > > > 35v2 9.916029791 7 > > 35v0 9.916029992 6 > > > > Disabling smt option in linux config (what essentially means that guest does not > > have correct topology and its just flat shows slightly better results - there > > are no cores and threads being scheduled in pair while other cores are empty. > > > > END: cycles: 41823591845 (38 seconds) > > END: cycles: 41105093568 (38 seconds) > > END: cycles: 30987224290 (28 seconds) > > END: cycles: 31138979573 (29 seconds) > > END: cycles: 31002228982 (28 seconds) > > > > and graph is attached (trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.png). > > This is a bit strange. You're showing that for *unpinned* vcpus, with > empty cores, there are vcpus sharing the same thread for significant > periods of time? That definitely shouldn't happen. > > It looks like you still have a fairly "bimodal" distribution even in the > "no-smt unpinned" scenario -- just 28<->38 rather than 28<->45-ish. > > Could you try a couple of these tests with the credit2 scheduler, just > to see? You'd have to make sure and use one of the versions that has > hard pinning enabled; I don't think that made 4.6, so you'd have to use > xen-unstable I think. > > > I may have forgotten something here.. Please ask me questions if I did. > > > > Maybe you have some ideas what can be done here? > > > > We try to make guests topology aware but looks like for cpu bound workloads its > > not that easy. > > Any suggestions are welcome. > > Well one option is always, as you say, to try to expose the topology to > the guest. But that is a fairly limited solution -- in order for that > information to be accurate, the vcpus need to be pinned, which in turn > means 1) a lot more effort required by admins, and 2) a lot less > opportunity for sharing of resources which is one of the big 'wins' for > virtualization. > > The other option is, as Dario said, to remove all topology information > from Linux, and add functionality to the Xen schedulers to attempt to > identify vcpus which are communicating or sharing in some other way, and > try to co-locate them. This is a lot easier and more flexible for > users, but a lot more work for us. > > -George > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-27 14:33 ` Konrad Rzeszutek Wilk @ 2016-01-27 15:10 ` George Dunlap 2016-01-27 15:27 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 22+ messages in thread From: George Dunlap @ 2016-01-27 15:10 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Elena Ufimtseva, george.dunlap, dario.faggioli, xen-devel, joao.m.martins, boris.ostrovsky On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote: > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote: >> On 22/01/16 16:54, Elena Ufimtseva wrote: >>> Hello all! >>> >>> Dario, Gerorge or anyone else, your help will be appreciated. >>> >>> Let me put some intro to our findings. I may forget something or put something >>> not too explicit, please ask me. >>> >>> Customer filled a bug where some of the applications were running slow in their HVM DomU setups. >>> These running times were compared against baremetal running same kernel version as HVM DomU. >>> >>> After some investigation by different parties, the test case scenario was found >>> where the problem was easily seen. The test app is a udp server/client pair where >>> client passes some message n number of times. >>> The test case was executed on baremetal and Xen DomU with kernel version 2.6.39. >>> Bare metal showed 2x times better result that DomU. >>> >>> Konrad came up with a workaround that was setting the flag for domain scheduler in linux >>> As the guest is not aware of SMT-related topology, it has a flat topology initialized. >>> Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39. >>> Konrad discovered that changing the flag for CPU sched domain to 4655 >>> works as a workaround and makes Linux think that the topology has SMT threads. >>> This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse). >>> >>> This workaround is not suitable for kernels of higher versions as we discovered. >>> >>> The hackish way of making domU linux think that it has SMT threads (along with matching cpuid) >>> made us thinks that the problem comes from the fact that cpu topology is not exposed to >>> guest and Linux scheduler cannot make intelligent decision on scheduling. >>> >>> Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe >>> topology numbering and provided matching pinning of vcpus and enabling options, >>> allows to expose to guest correct topology. >>> I guess Joao will be posting it at some point. >>> >>> With this patches we decided to test the performance impact on different kernel versionand Xen versions. >>> >>> The test described above was labeled as IO-bound test. >> >> So just to clarify: The client sends a request (presumably not much more >> than a ping) to the server, and waits for the server to respond before >> sending another one; and the server does the reverse -- receives a >> request, responds, and then waits for the next request. Is that right? > > Yes. >> >> How much data is transferred? > > 1 packet, UDP >> >> If the amount of data transferred is tiny, then the bottleneck for the >> test is probably the IPI time, and I'd call this a "ping-pong" >> benchmark[1]. I would only call this "io-bound" if you're actually >> copying large amounts of data. > > What we found is that on baremetal the scheduler would put both apps > on the same CPU and schedule them right after each other. This would > have a high IPI as the scheduler would poke itself. > On Xen it would put the two applications on seperate CPUs - and there > would be hardly any IPI. Sorry -- why would the scheduler send itself an IPI if it's on the same logical cpu (which seems pretty pointless), but *not* send an IPI to the *other* processor when it was actually waking up another task? Or do you mean high context switch rate? > Digging deeper in the code I found out that if you do an UDP sendmsg > without any timeouts - it would put it in a queue and just call schedule. You mean, it would mark the other process as runnable somehow, but not actually send an IPI to wake it up? Is that a new "feature" designed for large systems, to reduce the IPI traffic or something? > On baremetal the schedule would result in scheduler picking up the other > task, and starting it - which would dequeue immediately. > > On Xen - the schedule() would go HLT.. and then later be woken up by the > VIRQ_TIMER. And since the two applications were on seperate CPUs - the > single packet would just stick in the queue until the VIRQ_TIMER arrived. I'm not sure I understand the situation right, but it sounds a bit like what you're seeing is just a quirk of the fact that Linux doesn't always send IPIs to wake other processes up (either by design or by accident), but relies on scheduling timers to check for work to do. Presumably they knew that low performance on ping-pong workloads might be a possibility when they wrote the code that way; I don't see a reason why we should try to work around that in Xen. -George ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-27 15:10 ` George Dunlap @ 2016-01-27 15:27 ` Konrad Rzeszutek Wilk 2016-01-27 15:53 ` George Dunlap ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Konrad Rzeszutek Wilk @ 2016-01-27 15:27 UTC (permalink / raw) To: George Dunlap Cc: Elena Ufimtseva, george.dunlap, dario.faggioli, xen-devel, joao.m.martins, boris.ostrovsky On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote: > On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote: > > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote: > >> On 22/01/16 16:54, Elena Ufimtseva wrote: > >>> Hello all! > >>> > >>> Dario, Gerorge or anyone else, your help will be appreciated. > >>> > >>> Let me put some intro to our findings. I may forget something or put something > >>> not too explicit, please ask me. > >>> > >>> Customer filled a bug where some of the applications were running slow in their HVM DomU setups. > >>> These running times were compared against baremetal running same kernel version as HVM DomU. > >>> > >>> After some investigation by different parties, the test case scenario was found > >>> where the problem was easily seen. The test app is a udp server/client pair where > >>> client passes some message n number of times. > >>> The test case was executed on baremetal and Xen DomU with kernel version 2.6.39. > >>> Bare metal showed 2x times better result that DomU. > >>> > >>> Konrad came up with a workaround that was setting the flag for domain scheduler in linux > >>> As the guest is not aware of SMT-related topology, it has a flat topology initialized. > >>> Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39. > >>> Konrad discovered that changing the flag for CPU sched domain to 4655 > >>> works as a workaround and makes Linux think that the topology has SMT threads. > >>> This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse). > >>> > >>> This workaround is not suitable for kernels of higher versions as we discovered. > >>> > >>> The hackish way of making domU linux think that it has SMT threads (along with matching cpuid) > >>> made us thinks that the problem comes from the fact that cpu topology is not exposed to > >>> guest and Linux scheduler cannot make intelligent decision on scheduling. > >>> > >>> Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe > >>> topology numbering and provided matching pinning of vcpus and enabling options, > >>> allows to expose to guest correct topology. > >>> I guess Joao will be posting it at some point. > >>> > >>> With this patches we decided to test the performance impact on different kernel versionand Xen versions. > >>> > >>> The test described above was labeled as IO-bound test. > >> > >> So just to clarify: The client sends a request (presumably not much more > >> than a ping) to the server, and waits for the server to respond before > >> sending another one; and the server does the reverse -- receives a > >> request, responds, and then waits for the next request. Is that right? > > > > Yes. > >> > >> How much data is transferred? > > > > 1 packet, UDP > >> > >> If the amount of data transferred is tiny, then the bottleneck for the > >> test is probably the IPI time, and I'd call this a "ping-pong" > >> benchmark[1]. I would only call this "io-bound" if you're actually > >> copying large amounts of data. > > > > What we found is that on baremetal the scheduler would put both apps > > on the same CPU and schedule them right after each other. This would > > have a high IPI as the scheduler would poke itself. > > On Xen it would put the two applications on seperate CPUs - and there > > would be hardly any IPI. > > Sorry -- why would the scheduler send itself an IPI if it's on the same > logical cpu (which seems pretty pointless), but *not* send an IPI to the > *other* processor when it was actually waking up another task? > > Or do you mean high context switch rate? Yes, very high. > > > Digging deeper in the code I found out that if you do an UDP sendmsg > > without any timeouts - it would put it in a queue and just call schedule. > > You mean, it would mark the other process as runnable somehow, but not > actually send an IPI to wake it up? Is that a new "feature" designed Correct - because the other process was not on its vCPU runqueue. > for large systems, to reduce the IPI traffic or something? This is just a normal Linux scheduler. The only way it would do an IPI to the other CPU was if the UDP message had an timeout. The default timeout is infite so it didn't bother to send an IPI. > > > On baremetal the schedule would result in scheduler picking up the other > > task, and starting it - which would dequeue immediately. > > > > On Xen - the schedule() would go HLT.. and then later be woken up by the > > VIRQ_TIMER. And since the two applications were on seperate CPUs - the > > single packet would just stick in the queue until the VIRQ_TIMER arrived. > > I'm not sure I understand the situation right, but it sounds a bit like > what you're seeing is just a quirk of the fact that Linux doesn't always > send IPIs to wake other processes up (either by design or by accident), It does and it does not :-) > but relies on scheduling timers to check for work to do. Presumably It .. I am not explaining it well. The Linux kernel scheduler when called for 'schedule' (from the UDP sendmsg) would either pick the next appliction and do a context swap - of if there were none - go to sleep. [Kind of - it also may do an IPI to the other CPU if requested ,but that requires some hints from underlaying layers] Since there were only two apps on the runqueue - udp sender and udp receiver it would run them back-to back (this is on baremetal) However if SMT was not exposed - the Linux kernel scheduler would put those on each CPU runqueue. Meaning each CPU only had one app on its runqueue. Hence no need to do an context switch. [unless you modified the UDP message to have a timeout, then it would send an IPI] > they knew that low performance on ping-pong workloads might be a > possibility when they wrote the code that way; I don't see a reason why > we should try to work around that in Xen. Which is not what I am suggesting. Our first ideas was that since this is a Linux kernel schduler characteristic - let us give the guest all the information it needs to do this. That is make it look as baremetal as possible - and that is where the vCPU pinning and the exposing of SMT information came about. That (Elena pls correct me if I am wrong) did indeed show that the guest was doing what we expected. But naturally that requires pinning and all that - and while it is a useful case for those that have the vCPUs to spare and can do it - that is not a general use-case. So Elena started looking at the CPU bound and seeing how Xen behaves then and if we can improve the floating situation as she saw some abnormal behavious. I do not see any way to fix the udp single message mechanism except by modifying the Linux kernel scheduler - and indeed it looks like later kernels modified their behavior. Also doing the vCPU pinning and SMT exposing did not hurt in those cases (Elena?). > > -George ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-27 15:27 ` Konrad Rzeszutek Wilk @ 2016-01-27 15:53 ` George Dunlap 2016-01-27 16:12 ` Konrad Rzeszutek Wilk 2016-01-28 9:55 ` Dario Faggioli 2016-01-27 16:03 ` Elena Ufimtseva 2016-01-28 15:10 ` Dario Faggioli 2 siblings, 2 replies; 22+ messages in thread From: George Dunlap @ 2016-01-27 15:53 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Elena Ufimtseva, george.dunlap, dario.faggioli, xen-devel, joao.m.martins, boris.ostrovsky On 27/01/16 15:27, Konrad Rzeszutek Wilk wrote: > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote: >> On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote: >>> On Xen - the schedule() would go HLT.. and then later be woken up by the >>> VIRQ_TIMER. And since the two applications were on seperate CPUs - the >>> single packet would just stick in the queue until the VIRQ_TIMER arrived. >> >> I'm not sure I understand the situation right, but it sounds a bit like >> what you're seeing is just a quirk of the fact that Linux doesn't always >> send IPIs to wake other processes up (either by design or by accident), > > It does and it does not :-) > >> but relies on scheduling timers to check for work to do. Presumably > > It .. I am not explaining it well. The Linux kernel scheduler when > called for 'schedule' (from the UDP sendmsg) would either pick the next > appliction and do a context swap - of if there were none - go to sleep. > [Kind of - it also may do an IPI to the other CPU if requested ,but that requires > some hints from underlaying layers] > Since there were only two apps on the runqueue - udp sender and udp receiver > it would run them back-to back (this is on baremetal) I think I understand at a high level from your description what's happening (No IPIs -> happens to run if on the same cpu, waits until next timer tick if on a different cpu); but what I don't quite get is *why* Linux doesn't send an IPI. It's been quite a while since I looked at the Linux scheduling code, so I'm trying to understand it based a lot on the Xen code. In Xen a vcpu can be "runnable" (has something to do) and "blocked" (waiting for something to do). Whenever a vcpu goes from "blocked" to "runnable", the scheduler will call vcpu_wake(), which sends an IPI to the appropriate pcpu to get it to run the vcpu. What you're describing is a situation where a process is blocked (either in 'listen' or 'read'), and another process does something which should cause it to become 'runnable' (sends it a UDP message). If anyone happens to run the scheduler on its cpu, it will run; but no proactive actions are taken to wake it up (i.e., sending an IPI). The idea of not sending an IPI when a process goes from "waiting for something to do" to "has something to do" seems strange to me; and if it wasn't a mistake, then my only guess why they would choose to do that would be to reduce IPI traffic on large systems. But whether it's a mistake or on purpose, it's a Linux thing, so... >> they knew that low performance on ping-pong workloads might be a >> possibility when they wrote the code that way; I don't see a reason why >> we should try to work around that in Xen. > > Which is not what I am suggesting. I'm glad we agree on this. :-) > Our first ideas was that since this is a Linux kernel schduler characteristic > - let us give the guest all the information it needs to do this. That is > make it look as baremetal as possible - and that is where the vCPU > pinning and the exposing of SMT information came about. That (Elena > pls correct me if I am wrong) did indeed show that the guest was doing > what we expected. > > But naturally that requires pinning and all that - and while it is a useful > case for those that have the vCPUs to spare and can do it - that is not > a general use-case. > > So Elena started looking at the CPU bound and seeing how Xen behaves then > and if we can improve the floating situation as she saw some abnormal > behavious. OK -- if the focus was on the two cases where the Xen credit1 scheduler (apparently) co-located two cpu-burning vcpus on sibling threads, then yeah, that's behavior we should probably try to get to the bottom of. -George ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-27 15:53 ` George Dunlap @ 2016-01-27 16:12 ` Konrad Rzeszutek Wilk 2016-01-28 9:55 ` Dario Faggioli 1 sibling, 0 replies; 22+ messages in thread From: Konrad Rzeszutek Wilk @ 2016-01-27 16:12 UTC (permalink / raw) To: George Dunlap Cc: Elena Ufimtseva, george.dunlap, dario.faggioli, xen-devel, joao.m.martins, boris.ostrovsky On Wed, Jan 27, 2016 at 03:53:38PM +0000, George Dunlap wrote: > On 27/01/16 15:27, Konrad Rzeszutek Wilk wrote: > > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote: > >> On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote: > >>> On Xen - the schedule() would go HLT.. and then later be woken up by the > >>> VIRQ_TIMER. And since the two applications were on seperate CPUs - the > >>> single packet would just stick in the queue until the VIRQ_TIMER arrived. > >> > >> I'm not sure I understand the situation right, but it sounds a bit like > >> what you're seeing is just a quirk of the fact that Linux doesn't always > >> send IPIs to wake other processes up (either by design or by accident), > > > > It does and it does not :-) > > > >> but relies on scheduling timers to check for work to do. Presumably > > > > It .. I am not explaining it well. The Linux kernel scheduler when > > called for 'schedule' (from the UDP sendmsg) would either pick the next > > appliction and do a context swap - of if there were none - go to sleep. > > [Kind of - it also may do an IPI to the other CPU if requested ,but that requires > > some hints from underlaying layers] > > Since there were only two apps on the runqueue - udp sender and udp receiver > > it would run them back-to back (this is on baremetal) > > I think I understand at a high level from your description what's > happening (No IPIs -> happens to run if on the same cpu, waits until > next timer tick if on a different cpu); but what I don't quite get is > *why* Linux doesn't send an IPI. Wait no no. "happens to run if on the same cpu" - only if on baremetal or if we expose SMT topology to a guest. Otherwise the applications are not on the same CPU. The sending IPI part is because there are two CPUs - and the apps on those two runqeueus are not intertwined from the perspective of the scheduler. (Unless the udp code has given the scheduler hints). However if I tasket the applications on the same vCPU (this being without exposing SMT threads or just the normal situation as today) - the scheduler will send IPI context switches. Then I found that if I enable vAPIC and disable event channels for IPIs and only use the native APIC machinery for (aka vAPIC) we can even do less VMEXITs, but that is a different story: http://lists.xenproject.org/archives/html/xen-devel/2015-10/msg00897.html > > It's been quite a while since I looked at the Linux scheduling code, so > I'm trying to understand it based a lot on the Xen code. In Xen a vcpu > can be "runnable" (has something to do) and "blocked" (waiting for > something to do). Whenever a vcpu goes from "blocked" to "runnable", the > scheduler will call vcpu_wake(), which sends an IPI to the appropriate > pcpu to get it to run the vcpu. > > What you're describing is a situation where a process is blocked (either > in 'listen' or 'read'), and another process does something which should > cause it to become 'runnable' (sends it a UDP message). If anyone > happens to run the scheduler on its cpu, it will run; but no proactive > actions are taken to wake it up (i.e., sending an IPI). Right. And that is a UDP code decision. It called the schedule without any timeout or hints. > > The idea of not sending an IPI when a process goes from "waiting for > something to do" to "has something to do" seems strange to me; and if it > wasn't a mistake, then my only guess why they would choose to do that > would be to reduce IPI traffic on large systems. > > But whether it's a mistake or on purpose, it's a Linux thing, so... Yes :-) > > >> they knew that low performance on ping-pong workloads might be a > >> possibility when they wrote the code that way; I don't see a reason why > >> we should try to work around that in Xen. > > > > Which is not what I am suggesting. > > I'm glad we agree on this. :-) > > > Our first ideas was that since this is a Linux kernel schduler characteristic > > - let us give the guest all the information it needs to do this. That is > > make it look as baremetal as possible - and that is where the vCPU > > pinning and the exposing of SMT information came about. That (Elena > > pls correct me if I am wrong) did indeed show that the guest was doing > > what we expected. > > > > But naturally that requires pinning and all that - and while it is a useful > > case for those that have the vCPUs to spare and can do it - that is not > > a general use-case. > > > > So Elena started looking at the CPU bound and seeing how Xen behaves then > > and if we can improve the floating situation as she saw some abnormal > > behavious. > > OK -- if the focus was on the two cases where the Xen credit1 scheduler > (apparently) co-located two cpu-burning vcpus on sibling threads, then > yeah, that's behavior we should probably try to get to the bottom of. Right! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-27 15:53 ` George Dunlap 2016-01-27 16:12 ` Konrad Rzeszutek Wilk @ 2016-01-28 9:55 ` Dario Faggioli 2016-01-29 21:59 ` Elena Ufimtseva 1 sibling, 1 reply; 22+ messages in thread From: Dario Faggioli @ 2016-01-28 9:55 UTC (permalink / raw) To: George Dunlap, Konrad Rzeszutek Wilk Cc: Elena Ufimtseva, george.dunlap, joao.m.martins, boris.ostrovsky, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 1152 bytes --] On Wed, 2016-01-27 at 15:53 +0000, George Dunlap wrote: > On 27/01/16 15:27, Konrad Rzeszutek Wilk wrote: > > > > So Elena started looking at the CPU bound and seeing how Xen > > behaves then > > and if we can improve the floating situation as she saw some > > abnormal > > behavious. > > OK -- if the focus was on the two cases where the Xen credit1 > scheduler > (apparently) co-located two cpu-burning vcpus on sibling threads, > then > yeah, that's behavior we should probably try to get to the bottom of. > Well, let's see the trace. In any case, I'm up to trying hooking the SMT load balancer in runq_tickle (which would mean doing it upon every vcpus wakeup). My gut feeling is that the overhead my outwieght the benefit, and that it will actually reveal useful only in a minority of the cases/workloads, but it's maybe worth a try. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-28 9:55 ` Dario Faggioli @ 2016-01-29 21:59 ` Elena Ufimtseva 2016-02-02 11:58 ` Dario Faggioli 0 siblings, 1 reply; 22+ messages in thread From: Elena Ufimtseva @ 2016-01-29 21:59 UTC (permalink / raw) To: Dario Faggioli Cc: george.dunlap, George Dunlap, xen-devel, joao.m.martins, boris.ostrovsky [-- Attachment #1: Type: text/plain, Size: 1478 bytes --] On Thu, Jan 28, 2016 at 09:55:45AM +0000, Dario Faggioli wrote: > On Wed, 2016-01-27 at 15:53 +0000, George Dunlap wrote: > > On 27/01/16 15:27, Konrad Rzeszutek Wilk wrote: > > > > > > So Elena started looking at the CPU bound and seeing how Xen > > > behaves then > > > and if we can improve the floating situation as she saw some > > > abnormal > > > behavious. > > > > OK -- if the focus was on the two cases where the Xen credit1 > > scheduler > > (apparently) co-located two cpu-burning vcpus on sibling threads, > > then > > yeah, that's behavior we should probably try to get to the bottom of. > > > Well, let's see the trace. Hey Dario Please disregard the previous email with topology information. It was incorrect and I am attaching the topology that is actually result of Joao smt patches application. Elena > > In any case, I'm up to trying hooking the SMT load balancer in > runq_tickle (which would mean doing it upon every vcpus wakeup). > > My gut feeling is that the overhead my outwieght the benefit, and that > it will actually reveal useful only in a minority of the > cases/workloads, but it's maybe worth a try. > > Regards, > Dario > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > [-- Attachment #2: cpuinfo --] [-- Type: text/plain, Size: 12658 bytes --] processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 1 cpu cores : 8 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 1 cpu cores : 8 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 2 cpu cores : 8 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 2 cpu cores : 8 apicid : 5 initial apicid : 5 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 3 cpu cores : 8 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 3 cpu cores : 8 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 8 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 4 cpu cores : 8 apicid : 8 initial apicid : 8 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 9 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 4 cpu cores : 8 apicid : 9 initial apicid : 9 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 10 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 5 cpu cores : 8 apicid : 10 initial apicid : 10 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 11 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 5 cpu cores : 8 apicid : 11 initial apicid : 11 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 12 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 6 cpu cores : 8 apicid : 12 initial apicid : 12 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 13 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 6 cpu cores : 8 apicid : 13 initial apicid : 13 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 14 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 7 cpu cores : 8 apicid : 14 initial apicid : 14 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 15 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.360 cache size : 25600 KB physical id : 0 siblings : 16 core id : 7 cpu cores : 8 apicid : 15 initial apicid : 15 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.72 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: [-- Attachment #3: sched_domains --] [-- Type: text/plain, Size: 366 bytes --] cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 cat /proc/sys/kernel/sched_domain/cpu*/domain*/names SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC [-- Attachment #4: topology_smt_patches --] [-- Type: text/plain, Size: 6425 bytes --] Advisory to Users on system topology enumeration This utility is for demonstration purpose only. It assumes the hardware topology configuration within a coherent domain does not change during the life of an OS session. If an OS support advanced features that can change hardware topology configurations, more sophisticated adaptation may be necessary to account for the hardware configuration change that might have added and reduced the number of logical processors being managed by the OS. User should also`be aware that the system topology enumeration algorithm is based on the assumption that CPUID instruction will return raw data reflecting the native hardware configuration. When an application runs inside a virtual machine hosted by a Virtual Machine Monitor (VMM), any CPUID instructions issued by an app (or a guest OS) are trapped by the VMM and it is the VMM's responsibility and decision to emulate/supply CPUID return data to the virtual machines. When deploying topology enumeration code based on querying CPUID inside a VM environment, the user must consult with the VMM vendor on how an VMM will emulate CPUID instruction relating to topology enumeration. Software visible enumeration in the system: Number of logical processors visible to the OS: 16 Number of logical processors visible to this process: 16 Number of processor cores visible to this process: 8 Number of physical packages visible to this process: 1 Hierarchical counts by levels of processor topology: # of cores in package 0 visible to this process: 8 . # of logical processors in Core 0 visible to this process: 2 . # of logical processors in Core 1 visible to this process: 2 . # of logical processors in Core 2 visible to this process: 2 . # of logical processors in Core 3 visible to this process: 2 . # of logical processors in Core 4 visible to this process: 2 . # of logical processors in Core 5 visible to this process: 2 . # of logical processors in Core 6 visible to this process: 2 . # of logical processors in Core 7 visible to this process: 2 . Affinity masks per SMT thread, per core, per package: Individual: P:0, C:0, T:0 --> 1 P:0, C:0, T:1 --> 2 Core-aggregated: P:0, C:0 --> 3 Individual: P:0, C:1, T:0 --> 4 P:0, C:1, T:1 --> 8 Core-aggregated: P:0, C:1 --> c Individual: P:0, C:2, T:0 --> 10 P:0, C:2, T:1 --> 20 Core-aggregated: P:0, C:2 --> 30 Individual: P:0, C:3, T:0 --> 40 P:0, C:3, T:1 --> 80 Core-aggregated: P:0, C:3 --> c0 Individual: P:0, C:4, T:0 --> 100 P:0, C:4, T:1 --> 200 Core-aggregated: P:0, C:4 --> 300 Individual: P:0, C:5, T:0 --> 400 P:0, C:5, T:1 --> 800 Core-aggregated: P:0, C:5 --> c00 Individual: P:0, C:6, T:0 --> 1z3 P:0, C:6, T:1 --> 2z3 Core-aggregated: P:0, C:6 --> 3z3 Individual: P:0, C:7, T:0 --> 4z3 P:0, C:7, T:1 --> 8z3 Core-aggregated: P:0, C:7 --> cz3 Pkg-aggregated: P:0 --> ffff APIC ID listings from affinity masks OS cpu 0, Affinity mask 000001 - apic id 0 OS cpu 1, Affinity mask 000002 - apic id 1 OS cpu 2, Affinity mask 000004 - apic id 2 OS cpu 3, Affinity mask 000008 - apic id 3 OS cpu 4, Affinity mask 000010 - apic id 4 OS cpu 5, Affinity mask 000020 - apic id 5 OS cpu 6, Affinity mask 000040 - apic id 6 OS cpu 7, Affinity mask 000080 - apic id 7 OS cpu 8, Affinity mask 000100 - apic id 8 OS cpu 9, Affinity mask 000200 - apic id 9 OS cpu 10, Affinity mask 000400 - apic id a OS cpu 11, Affinity mask 000800 - apic id b OS cpu 12, Affinity mask 001000 - apic id c OS cpu 13, Affinity mask 002000 - apic id d OS cpu 14, Affinity mask 004000 - apic id e OS cpu 15, Affinity mask 008000 - apic id f Package 0 Cache and Thread details Box Description: Cache is cache level designator Size is cache size OScpu# is cpu # as seen by OS Core is core#[_thread# if > 1 thread/core] inside socket AffMsk is AffinityMask(extended hex) for core and thread CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache CmbMsk will differ from AffMsk if > 1 hw_thread/cache Extended Hex replaces trailing zeroes with 'z#' where # is number of zeroes (so '8z5' is '0x800000') L1D is Level 1 Data cache, size(KBytes)= 32, Cores/cache= 2, Caches/package= 8 L1I is Level 1 Instruction cache, size(KBytes)= 32, Cores/cache= 2, Caches/package= 8 L2 is Level 2 Unified cache, size(KBytes)= 256, Cores/cache= 2, Caches/package= 8 L3 is Level 3 Unified cache, size(KBytes)= 25600, Cores/cache= 16, Caches/package= 1 +-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ Cache | L1D | L1D | L1D | L1D | L1D | L1D | L1D | L1D | Size | 32K | 32K | 32K | 32K | 32K | 32K | 32K | 32K | OScpu#| 0 1| 2 3| 4 5| 6 7| 8 9| 10 11| 12 13| 14 15| Core |c0_t0 c0_t1|c1_t0 c1_t1|c2_t0 c2_t1|c3_t0 c3_t1|c4_t0 c4_t1|c5_t0 c5_t1|c6_t0 c6_t1|c7_t0 c7_t1| AffMsk| 1 2| 4 8| 10 20| 40 80| 100 200| 400 800| 1z3 2z3| 4z3 8z3| CmbMsk| 3 | c | 30 | c0 | 300 | c00 | 3z3 | cz3 | +-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ Cache | L1I | L1I | L1I | L1I | L1I | L1I | L1I | L1I | Size | 32K | 32K | 32K | 32K | 32K | 32K | 32K | 32K | +-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ Cache | L2 | L2 | L2 | L2 | L2 | L2 | L2 | L2 | Size | 256K | 256K | 256K | 256K | 256K | 256K | 256K | 256K | +-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ Cache | L3 | Size | 25M | CmbMsk| ffff | +-----------------------------------------------------------------------------------------------+ [-- Attachment #5: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-29 21:59 ` Elena Ufimtseva @ 2016-02-02 11:58 ` Dario Faggioli 0 siblings, 0 replies; 22+ messages in thread From: Dario Faggioli @ 2016-02-02 11:58 UTC (permalink / raw) To: Elena Ufimtseva Cc: george.dunlap, George Dunlap, xen-devel, joao.m.martins, boris.ostrovsky [-- Attachment #1.1: Type: text/plain, Size: 1216 bytes --] On Fri, 2016-01-29 at 16:59 -0500, Elena Ufimtseva wrote: > > Hey Dario > > Please disregard the previous email with topology information. > It was incorrect and I am attaching the topology that is actually > result > of Joao smt patches application. > Ok :-) Well, this: ... physical id : 0 siblings : 16 core id : 0 cpu cores : 8 ... physical id : 0 siblings : 16 core id : 0 cpu cores : 8 ... physical id : 0 siblings : 16 core id : 1 cpu cores : 8 ... [etc] And this: cat /proc/sys/kernel/sched_domain/cpu*/domain*/names SMT MC SMT MC SMT MC SMT MC [etc] Makes me think that the patch are doing their job, and that the OS is correctly seeing a multicore, SMT-enabled topology. Good work on that! ;-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-27 15:27 ` Konrad Rzeszutek Wilk 2016-01-27 15:53 ` George Dunlap @ 2016-01-27 16:03 ` Elena Ufimtseva 2016-01-28 9:46 ` Dario Faggioli 2016-01-28 15:10 ` Dario Faggioli 2 siblings, 1 reply; 22+ messages in thread From: Elena Ufimtseva @ 2016-01-27 16:03 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: george.dunlap, dario.faggioli, George Dunlap, xen-devel, joao.m.martins, boris.ostrovsky On Wed, Jan 27, 2016 at 10:27:01AM -0500, Konrad Rzeszutek Wilk wrote: > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote: > > On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote: > > > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote: > > >> On 22/01/16 16:54, Elena Ufimtseva wrote: > > >>> Hello all! > > >>> > > >>> Dario, Gerorge or anyone else, your help will be appreciated. > > >>> > > >>> Let me put some intro to our findings. I may forget something or put something > > >>> not too explicit, please ask me. > > >>> > > >>> Customer filled a bug where some of the applications were running slow in their HVM DomU setups. > > >>> These running times were compared against baremetal running same kernel version as HVM DomU. > > >>> > > >>> After some investigation by different parties, the test case scenario was found > > >>> where the problem was easily seen. The test app is a udp server/client pair where > > >>> client passes some message n number of times. > > >>> The test case was executed on baremetal and Xen DomU with kernel version 2.6.39. > > >>> Bare metal showed 2x times better result that DomU. > > >>> > > >>> Konrad came up with a workaround that was setting the flag for domain scheduler in linux > > >>> As the guest is not aware of SMT-related topology, it has a flat topology initialized. > > >>> Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39. > > >>> Konrad discovered that changing the flag for CPU sched domain to 4655 > > >>> works as a workaround and makes Linux think that the topology has SMT threads. > > >>> This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse). > > >>> > > >>> This workaround is not suitable for kernels of higher versions as we discovered. > > >>> > > >>> The hackish way of making domU linux think that it has SMT threads (along with matching cpuid) > > >>> made us thinks that the problem comes from the fact that cpu topology is not exposed to > > >>> guest and Linux scheduler cannot make intelligent decision on scheduling. > > >>> > > >>> Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe > > >>> topology numbering and provided matching pinning of vcpus and enabling options, > > >>> allows to expose to guest correct topology. > > >>> I guess Joao will be posting it at some point. > > >>> > > >>> With this patches we decided to test the performance impact on different kernel versionand Xen versions. > > >>> > > >>> The test described above was labeled as IO-bound test. > > >> > > >> So just to clarify: The client sends a request (presumably not much more > > >> than a ping) to the server, and waits for the server to respond before > > >> sending another one; and the server does the reverse -- receives a > > >> request, responds, and then waits for the next request. Is that right? > > > > > > Yes. > > >> > > >> How much data is transferred? > > > > > > 1 packet, UDP > > >> > > >> If the amount of data transferred is tiny, then the bottleneck for the > > >> test is probably the IPI time, and I'd call this a "ping-pong" > > >> benchmark[1]. I would only call this "io-bound" if you're actually > > >> copying large amounts of data. > > > > > > What we found is that on baremetal the scheduler would put both apps > > > on the same CPU and schedule them right after each other. This would > > > have a high IPI as the scheduler would poke itself. > > > On Xen it would put the two applications on seperate CPUs - and there > > > would be hardly any IPI. > > > > Sorry -- why would the scheduler send itself an IPI if it's on the same > > logical cpu (which seems pretty pointless), but *not* send an IPI to the > > *other* processor when it was actually waking up another task? > > > > Or do you mean high context switch rate? > > Yes, very high. > > > > > Digging deeper in the code I found out that if you do an UDP sendmsg > > > without any timeouts - it would put it in a queue and just call schedule. > > > > You mean, it would mark the other process as runnable somehow, but not > > actually send an IPI to wake it up? Is that a new "feature" designed > > Correct - because the other process was not on its vCPU runqueue. > > > for large systems, to reduce the IPI traffic or something? > > This is just a normal Linux scheduler. The only way it would do an IPI > to the other CPU was if the UDP message had an timeout. The default > timeout is infite so it didn't bother to send an IPI. > > > > > > On baremetal the schedule would result in scheduler picking up the other > > > task, and starting it - which would dequeue immediately. > > > > > > On Xen - the schedule() would go HLT.. and then later be woken up by the > > > VIRQ_TIMER. And since the two applications were on seperate CPUs - the > > > single packet would just stick in the queue until the VIRQ_TIMER arrived. > > > > I'm not sure I understand the situation right, but it sounds a bit like > > what you're seeing is just a quirk of the fact that Linux doesn't always > > send IPIs to wake other processes up (either by design or by accident), > > It does and it does not :-) > > > but relies on scheduling timers to check for work to do. Presumably > > It .. I am not explaining it well. The Linux kernel scheduler when > called for 'schedule' (from the UDP sendmsg) would either pick the next > appliction and do a context swap - of if there were none - go to sleep. > [Kind of - it also may do an IPI to the other CPU if requested ,but that requires > some hints from underlaying layers] > Since there were only two apps on the runqueue - udp sender and udp receiver > it would run them back-to back (this is on baremetal) > > However if SMT was not exposed - the Linux kernel scheduler would put those > on each CPU runqueue. Meaning each CPU only had one app on its runqueue. > > Hence no need to do an context switch. > [unless you modified the UDP message to have a timeout, then it would > send an IPI] > > they knew that low performance on ping-pong workloads might be a > > possibility when they wrote the code that way; I don't see a reason why > > we should try to work around that in Xen. > > Which is not what I am suggesting. > > Our first ideas was that since this is a Linux kernel schduler characteristic > - let us give the guest all the information it needs to do this. That is > make it look as baremetal as possible - and that is where the vCPU > pinning and the exposing of SMT information came about. That (Elena > pls correct me if I am wrong) did indeed show that the guest was doing > what we expected. > > But naturally that requires pinning and all that - and while it is a useful > case for those that have the vCPUs to spare and can do it - that is not > a general use-case. > > So Elena started looking at the CPU bound and seeing how Xen behaves then > and if we can improve the floating situation as she saw some abnormal > behavious. Maybe its normal? :) While having satisfactory results with ping-pong test and having Joao's SMT patches in hand, we decided to try cpu-bound workload. And that is where exposing SMT does not work that well. I mostly here refer to the case where two vCPUs are being placed on same core while there are idle cores. This I think what Dario is asking me more details about in another reply and I am going to answer his questions. > > I do not see any way to fix the udp single message mechanism except > by modifying the Linux kernel scheduler - and indeed it looks like later > kernels modified their behavior. Also doing the vCPU pinning and SMT exposing > did not hurt in those cases (Elena?). Yes, the drastic performance differences with bare metal were only observed with 2.6.39-based kernel. For this ping-pong udp test exposing the SMT topology to the kernels if higher versions did help as tests show about 20 percent performance improvement comparing to the tests where SMT topology is not exposed. This assumes that SMT exposure goes along with pinning. kernel. > > > > > -George ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-27 16:03 ` Elena Ufimtseva @ 2016-01-28 9:46 ` Dario Faggioli 2016-01-29 16:09 ` Elena Ufimtseva 0 siblings, 1 reply; 22+ messages in thread From: Dario Faggioli @ 2016-01-28 9:46 UTC (permalink / raw) To: Elena Ufimtseva, Konrad Rzeszutek Wilk Cc: george.dunlap, joao.m.martins, George Dunlap, boris.ostrovsky, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 9847 bytes --] On Wed, 2016-01-27 at 11:03 -0500, Elena Ufimtseva wrote: > On Wed, Jan 27, 2016 at 10:27:01AM -0500, Konrad Rzeszutek Wilk > wrote: > > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote: > > > On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote: > > > > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote: > > > > > On 22/01/16 16:54, Elena Ufimtseva wrote: > > > > > > Hello all! > > > > > > > > > > > > Dario, Gerorge or anyone else, your help will be > > > > > > appreciated. > > > > > > > > > > > > Let me put some intro to our findings. I may forget > > > > > > something or put something > > > > > > not too explicit, please ask me. > > > > > > > > > > > > Customer filled a bug where some of the applications were > > > > > > running slow in their HVM DomU setups. > > > > > > These running times were compared against baremetal running > > > > > > same kernel version as HVM DomU. > > > > > > > > > > > > After some investigation by different parties, the test > > > > > > case scenario was found > > > > > > where the problem was easily seen. The test app is a udp > > > > > > server/client pair where > > > > > > client passes some message n number of times. > > > > > > The test case was executed on baremetal and Xen DomU with > > > > > > kernel version 2.6.39. > > > > > > Bare metal showed 2x times better result that DomU. > > > > > > > > > > > > Konrad came up with a workaround that was setting the flag > > > > > > for domain scheduler in linux > > > > > > As the guest is not aware of SMT-related topology, it has a > > > > > > flat topology initialized. > > > > > > Kernel has domain scheduler flags for scheduling domain CPU > > > > > > set to 4143 for 2.6.39. > > > > > > Konrad discovered that changing the flag for CPU sched > > > > > > domain to 4655 > > > > > > works as a workaround and makes Linux think that the > > > > > > topology has SMT threads. > > > > > > This workaround makes the test to complete almost in same > > > > > > time as on baremetal (or insignificantly worse). > > > > > > > > > > > > This workaround is not suitable for kernels of higher > > > > > > versions as we discovered. > > > > > > > > > > > > The hackish way of making domU linux think that it has SMT > > > > > > threads (along with matching cpuid) > > > > > > made us thinks that the problem comes from the fact that > > > > > > cpu topology is not exposed to > > > > > > guest and Linux scheduler cannot make intelligent decision > > > > > > on scheduling. > > > > > > > > > > > > Joao Martins from Oracle developed set of patches that > > > > > > fixed the smt/core/cashe > > > > > > topology numbering and provided matching pinning of vcpus > > > > > > and enabling options, > > > > > > allows to expose to guest correct topology. > > > > > > I guess Joao will be posting it at some point. > > > > > > > > > > > > With this patches we decided to test the performance impact > > > > > > on different kernel versionand Xen versions. > > > > > > > > > > > > The test described above was labeled as IO-bound test. > > > > > > > > > > So just to clarify: The client sends a request (presumably > > > > > not much more > > > > > than a ping) to the server, and waits for the server to > > > > > respond before > > > > > sending another one; and the server does the reverse -- > > > > > receives a > > > > > request, responds, and then waits for the next request. Is > > > > > that right? > > > > > > > > Yes. > > > > > > > > > > How much data is transferred? > > > > > > > > 1 packet, UDP > > > > > > > > > > If the amount of data transferred is tiny, then the > > > > > bottleneck for the > > > > > test is probably the IPI time, and I'd call this a "ping- > > > > > pong" > > > > > benchmark[1]. I would only call this "io-bound" if you're > > > > > actually > > > > > copying large amounts of data. > > > > > > > > What we found is that on baremetal the scheduler would put both > > > > apps > > > > on the same CPU and schedule them right after each other. This > > > > would > > > > have a high IPI as the scheduler would poke itself. > > > > On Xen it would put the two applications on seperate CPUs - and > > > > there > > > > would be hardly any IPI. > > > > > > Sorry -- why would the scheduler send itself an IPI if it's on > > > the same > > > logical cpu (which seems pretty pointless), but *not* send an IPI > > > to the > > > *other* processor when it was actually waking up another task? > > > > > > Or do you mean high context switch rate? > > > > Yes, very high. > > > > > > > Digging deeper in the code I found out that if you do an UDP > > > > sendmsg > > > > without any timeouts - it would put it in a queue and just call > > > > schedule. > > > > > > You mean, it would mark the other process as runnable somehow, > > > but not > > > actually send an IPI to wake it up? Is that a new "feature" > > > designed > > > > Correct - because the other process was not on its vCPU runqueue. > > > > > for large systems, to reduce the IPI traffic or something? > > > > This is just a normal Linux scheduler. The only way it would do an > > IPI > > to the other CPU was if the UDP message had an timeout. The default > > timeout is infite so it didn't bother to send an IPI. > > > > > > > > > On baremetal the schedule would result in scheduler picking up > > > > the other > > > > task, and starting it - which would dequeue immediately. > > > > > > > > On Xen - the schedule() would go HLT.. and then later be woken > > > > up by the > > > > VIRQ_TIMER. And since the two applications were on seperate > > > > CPUs - the > > > > single packet would just stick in the queue until the > > > > VIRQ_TIMER arrived. > > > > > > I'm not sure I understand the situation right, but it sounds a > > > bit like > > > what you're seeing is just a quirk of the fact that Linux doesn't > > > always > > > send IPIs to wake other processes up (either by design or by > > > accident), > > > > It does and it does not :-) > > > > > but relies on scheduling timers to check for work to > > > do. Presumably > > > > It .. I am not explaining it well. The Linux kernel scheduler when > > called for 'schedule' (from the UDP sendmsg) would either pick the > > next > > appliction and do a context swap - of if there were none - go to > > sleep. > > [Kind of - it also may do an IPI to the other CPU if requested ,but > > that requires > > some hints from underlaying layers] > > Since there were only two apps on the runqueue - udp sender and udp > > receiver > > it would run them back-to back (this is on baremetal) > > > > However if SMT was not exposed - the Linux kernel scheduler would > > put those > > on each CPU runqueue. Meaning each CPU only had one app on its > > runqueue. > > > > Hence no need to do an context switch. > > [unless you modified the UDP message to have a timeout, then it > > would > > send an IPI] > > > they knew that low performance on ping-pong workloads might be a > > > possibility when they wrote the code that way; I don't see a > > > reason why > > > we should try to work around that in Xen. > > > > Which is not what I am suggesting. > > > > Our first ideas was that since this is a Linux kernel schduler > > characteristic > > - let us give the guest all the information it needs to do this. > > That is > > make it look as baremetal as possible - and that is where the vCPU > > pinning and the exposing of SMT information came about. That (Elena > > pls correct me if I am wrong) did indeed show that the guest was > > doing > > what we expected. > > > > But naturally that requires pinning and all that - and while it is > > a useful > > case for those that have the vCPUs to spare and can do it - that is > > not > > a general use-case. > > > > So Elena started looking at the CPU bound and seeing how Xen > > behaves then > > and if we can improve the floating situation as she saw some > > abnormal > > behavious. > > Maybe its normal? :) > > While having satisfactory results with ping-pong test and having > Joao's > SMT patches in hand, we decided to try cpu-bound workload. > And that is where exposing SMT does not work that well. > I mostly here refer to the case where two vCPUs are being placed on > same > core while there are idle cores. > > This I think what Dario is asking me more details about in another > reply and I am going to > answer his questions. > Yes, exactly. We need to see more trace entries around the one where we see the vcpus being placed on SMT-siblings. You can well send, or upload somewhere, the full trace, and I'll have a look myself as soon as I can. :-) > > I do not see any way to fix the udp single message mechanism except > > by modifying the Linux kernel scheduler - and indeed it looks like > > later > > kernels modified their behavior. Also doing the vCPU pinning and > > SMT exposing > > did not hurt in those cases (Elena?). > > Yes, the drastic performance differences with bare metal were only > observed with 2.6.39-based kernel. > For this ping-pong udp test exposing the SMT topology to the kernels > if > higher versions did help as tests show about 20 percent performance > improvement comparing to the tests where SMT topology is not exposed. > This assumes that SMT exposure goes along with pinning. > > > kernel. > hypervisor. :-D :-D :-D Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-28 9:46 ` Dario Faggioli @ 2016-01-29 16:09 ` Elena Ufimtseva 0 siblings, 0 replies; 22+ messages in thread From: Elena Ufimtseva @ 2016-01-29 16:09 UTC (permalink / raw) To: Dario Faggioli Cc: george.dunlap, George Dunlap, xen-devel, joao.m.martins, boris.ostrovsky [-- Attachment #1: Type: text/plain, Size: 11053 bytes --] On Thu, Jan 28, 2016 at 09:46:46AM +0000, Dario Faggioli wrote: > On Wed, 2016-01-27 at 11:03 -0500, Elena Ufimtseva wrote: > > On Wed, Jan 27, 2016 at 10:27:01AM -0500, Konrad Rzeszutek Wilk > > wrote: > > > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote: > > > > On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote: > > > > > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote: > > > > > > On 22/01/16 16:54, Elena Ufimtseva wrote: > > > > > > > Hello all! > > > > > > > > > > > > > > Dario, Gerorge or anyone else, your help will be > > > > > > > appreciated. > > > > > > > > > > > > > > Let me put some intro to our findings. I may forget > > > > > > > something or put something > > > > > > > not too explicit, please ask me. > > > > > > > > > > > > > > Customer filled a bug where some of the applications were > > > > > > > running slow in their HVM DomU setups. > > > > > > > These running times were compared against baremetal running > > > > > > > same kernel version as HVM DomU. > > > > > > > > > > > > > > After some investigation by different parties, the test > > > > > > > case scenario was found > > > > > > > where the problem was easily seen. The test app is a udp > > > > > > > server/client pair where > > > > > > > client passes some message n number of times. > > > > > > > The test case was executed on baremetal and Xen DomU with > > > > > > > kernel version 2.6.39. > > > > > > > Bare metal showed 2x times better result that DomU. > > > > > > > > > > > > > > Konrad came up with a workaround that was setting the flag > > > > > > > for domain scheduler in linux > > > > > > > As the guest is not aware of SMT-related topology, it has a > > > > > > > flat topology initialized. > > > > > > > Kernel has domain scheduler flags for scheduling domain CPU > > > > > > > set to 4143 for 2.6.39. > > > > > > > Konrad discovered that changing the flag for CPU sched > > > > > > > domain to 4655 > > > > > > > works as a workaround and makes Linux think that the > > > > > > > topology has SMT threads. > > > > > > > This workaround makes the test to complete almost in same > > > > > > > time as on baremetal (or insignificantly worse). > > > > > > > > > > > > > > This workaround is not suitable for kernels of higher > > > > > > > versions as we discovered. > > > > > > > > > > > > > > The hackish way of making domU linux think that it has SMT > > > > > > > threads (along with matching cpuid) > > > > > > > made us thinks that the problem comes from the fact that > > > > > > > cpu topology is not exposed to > > > > > > > guest and Linux scheduler cannot make intelligent decision > > > > > > > on scheduling. > > > > > > > > > > > > > > Joao Martins from Oracle developed set of patches that > > > > > > > fixed the smt/core/cashe > > > > > > > topology numbering and provided matching pinning of vcpus > > > > > > > and enabling options, > > > > > > > allows to expose to guest correct topology. > > > > > > > I guess Joao will be posting it at some point. > > > > > > > > > > > > > > With this patches we decided to test the performance impact > > > > > > > on different kernel versionand Xen versions. > > > > > > > > > > > > > > The test described above was labeled as IO-bound test. > > > > > > > > > > > > So just to clarify: The client sends a request (presumably > > > > > > not much more > > > > > > than a ping) to the server, and waits for the server to > > > > > > respond before > > > > > > sending another one; and the server does the reverse -- > > > > > > receives a > > > > > > request, responds, and then waits for the next request. Is > > > > > > that right? > > > > > > > > > > Yes. > > > > > > > > > > > > How much data is transferred? > > > > > > > > > > 1 packet, UDP > > > > > > > > > > > > If the amount of data transferred is tiny, then the > > > > > > bottleneck for the > > > > > > test is probably the IPI time, and I'd call this a "ping- > > > > > > pong" > > > > > > benchmark[1]. I would only call this "io-bound" if you're > > > > > > actually > > > > > > copying large amounts of data. > > > > > > > > > > What we found is that on baremetal the scheduler would put both > > > > > apps > > > > > on the same CPU and schedule them right after each other. This > > > > > would > > > > > have a high IPI as the scheduler would poke itself. > > > > > On Xen it would put the two applications on seperate CPUs - and > > > > > there > > > > > would be hardly any IPI. > > > > > > > > Sorry -- why would the scheduler send itself an IPI if it's on > > > > the same > > > > logical cpu (which seems pretty pointless), but *not* send an IPI > > > > to the > > > > *other* processor when it was actually waking up another task? > > > > > > > > Or do you mean high context switch rate? > > > > > > Yes, very high. > > > > > > > > > Digging deeper in the code I found out that if you do an UDP > > > > > sendmsg > > > > > without any timeouts - it would put it in a queue and just call > > > > > schedule. > > > > > > > > You mean, it would mark the other process as runnable somehow, > > > > but not > > > > actually send an IPI to wake it up? Is that a new "feature" > > > > designed > > > > > > Correct - because the other process was not on its vCPU runqueue. > > > > > > > for large systems, to reduce the IPI traffic or something? > > > > > > This is just a normal Linux scheduler. The only way it would do an > > > IPI > > > to the other CPU was if the UDP message had an timeout. The default > > > timeout is infite so it didn't bother to send an IPI. > > > > > > > > > > > > On baremetal the schedule would result in scheduler picking up > > > > > the other > > > > > task, and starting it - which would dequeue immediately. > > > > > > > > > > On Xen - the schedule() would go HLT.. and then later be woken > > > > > up by the > > > > > VIRQ_TIMER. And since the two applications were on seperate > > > > > CPUs - the > > > > > single packet would just stick in the queue until the > > > > > VIRQ_TIMER arrived. > > > > > > > > I'm not sure I understand the situation right, but it sounds a > > > > bit like > > > > what you're seeing is just a quirk of the fact that Linux doesn't > > > > always > > > > send IPIs to wake other processes up (either by design or by > > > > accident), > > > > > > It does and it does not :-) > > > > > > > but relies on scheduling timers to check for work to > > > > do. Presumably > > > > > > It .. I am not explaining it well. The Linux kernel scheduler when > > > called for 'schedule' (from the UDP sendmsg) would either pick the > > > next > > > appliction and do a context swap - of if there were none - go to > > > sleep. > > > [Kind of - it also may do an IPI to the other CPU if requested ,but > > > that requires > > > some hints from underlaying layers] > > > Since there were only two apps on the runqueue - udp sender and udp > > > receiver > > > it would run them back-to back (this is on baremetal) > > > > > > However if SMT was not exposed - the Linux kernel scheduler would > > > put those > > > on each CPU runqueue. Meaning each CPU only had one app on its > > > runqueue. > > > > > > Hence no need to do an context switch. > > > [unless you modified the UDP message to have a timeout, then it > > > would > > > send an IPI] > > > > they knew that low performance on ping-pong workloads might be a > > > > possibility when they wrote the code that way; I don't see a > > > > reason why > > > > we should try to work around that in Xen. > > > > > > Which is not what I am suggesting. > > > > > > Our first ideas was that since this is a Linux kernel schduler > > > characteristic > > > - let us give the guest all the information it needs to do this. > > > That is > > > make it look as baremetal as possible - and that is where the vCPU > > > pinning and the exposing of SMT information came about. That (Elena > > > pls correct me if I am wrong) did indeed show that the guest was > > > doing > > > what we expected. > > > > > > But naturally that requires pinning and all that - and while it is > > > a useful > > > case for those that have the vCPUs to spare and can do it - that is > > > not > > > a general use-case. > > > > > > So Elena started looking at the CPU bound and seeing how Xen > > > behaves then > > > and if we can improve the floating situation as she saw some > > > abnormal > > > behavious. > > > > Maybe its normal? :) > > > > While having satisfactory results with ping-pong test and having > > Joao's > > SMT patches in hand, we decided to try cpu-bound workload. > > And that is where exposing SMT does not work that well. > > I mostly here refer to the case where two vCPUs are being placed on > > same > > core while there are idle cores. > > > > This I think what Dario is asking me more details about in another > > reply and I am going to > > answer his questions. > > > Yes, exactly. We need to see more trace entries around the one where we > see the vcpus being placed on SMT-siblings. You can well send, or > upload somewhere, the full trace, and I'll have a look myself as soon > as I can. :-) Hi Dario So here is the trace with smt patches applied, 5 iterations of cpu-bound workload. dom0 has 2 not-pinned vcpus, domU has 16 vcpus, not pinned as well, 8 active threads of cpu-bound test. Here is a trace output: https://drive.google.com/file/d/0ByVx1zSzgzLIbjFLTXFsbDJ4QVU/view?usp=sharing Topology in guest after smt patches applied can be seen in topology_smt_patches, along with cpuinfo and sched_domains. The thing what we are trying to figure out is why data cache is not being shared between threads and why number of packages is 4. We are looking at this right now. Dario Let me know if you would think of any other data what may help. Elena > > > > I do not see any way to fix the udp single message mechanism except > > > by modifying the Linux kernel scheduler - and indeed it looks like > > > later > > > kernels modified their behavior. Also doing the vCPU pinning and > > > SMT exposing > > > did not hurt in those cases (Elena?). > > > > Yes, the drastic performance differences with bare metal were only > > observed with 2.6.39-based kernel. > > For this ping-pong udp test exposing the SMT topology to the kernels > > if > > higher versions did help as tests show about 20 percent performance > > improvement comparing to the tests where SMT topology is not exposed. > > This assumes that SMT exposure goes along with pinning. > > > > > > kernel. > > > hypervisor. > > :-D :-D :-D > > Regards, > Dario > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > [-- Attachment #2: topology_smt_patches --] [-- Type: text/plain, Size: 8617 bytes --] Advisory to Users on system topology enumeration This utility is for demonstration purpose only. It assumes the hardware topology configuration within a coherent domain does not change during the life of an OS session. If an OS support advanced features that can change hardware topology configurations, more sophisticated adaptation may be necessary to account for the hardware configuration change that might have added and reduced the number of logical processors being managed by the OS. User should also`be aware that the system topology enumeration algorithm is based on the assumption that CPUID instruction will return raw data reflecting the native hardware configuration. When an application runs inside a virtual machine hosted by a Virtual Machine Monitor (VMM), any CPUID instructions issued by an app (or a guest OS) are trapped by the VMM and it is the VMM's responsibility and decision to emulate/supply CPUID return data to the virtual machines. When deploying topology enumeration code based on querying CPUID inside a VM environment, the user must consult with the VMM vendor on how an VMM will emulate CPUID instruction relating to topology enumeration. Software visible enumeration in the system: Number of logical processors visible to the OS: 16 Number of logical processors visible to this process: 16 Number of processor cores visible to this process: 8 Number of physical packages visible to this process: 4 Hierarchical counts by levels of processor topology: # of cores in package 0 visible to this process: 2 . # of logical processors in Core 0 visible to this process: 2 . # of logical processors in Core 1 visible to this process: 2 . # of cores in package 1 visible to this process: 2 . # of logical processors in Core 0 visible to this process: 2 . # of logical processors in Core 1 visible to this process: 2 . # of cores in package 2 visible to this process: 2 . # of logical processors in Core 0 visible to this process: 2 . # of logical processors in Core 1 visible to this process: 2 . # of cores in package 3 visible to this process: 2 . # of logical processors in Core 0 visible to this process: 2 . # of logical processors in Core 1 visible to this process: 2 . Affinity masks per SMT thread, per core, per package: Individual: P:0, C:0, T:0 --> 1 P:0, C:0, T:1 --> 2 Core-aggregated: P:0, C:0 --> 3 Individual: P:0, C:1, T:0 --> 4 P:0, C:1, T:1 --> 8 Core-aggregated: P:0, C:1 --> c Pkg-aggregated: P:0 --> f Individual: P:1, C:0, T:0 --> 10 P:1, C:0, T:1 --> 20 Core-aggregated: P:1, C:0 --> 30 Individual: P:1, C:1, T:0 --> 40 P:1, C:1, T:1 --> 80 Core-aggregated: P:1, C:1 --> c0 Pkg-aggregated: P:1 --> f0 Individual: P:2, C:0, T:0 --> 100 P:2, C:0, T:1 --> 200 Core-aggregated: P:2, C:0 --> 300 Individual: P:2, C:1, T:0 --> 400 P:2, C:1, T:1 --> 800 Core-aggregated: P:2, C:1 --> c00 Pkg-aggregated: P:2 --> f00 Individual: P:3, C:0, T:0 --> 1z3 P:3, C:0, T:1 --> 2z3 Core-aggregated: P:3, C:0 --> 3z3 Individual: P:3, C:1, T:0 --> 4z3 P:3, C:1, T:1 --> 8z3 Core-aggregated: P:3, C:1 --> cz3 Pkg-aggregated: P:3 --> fz3 APIC ID listings from affinity masks OS cpu 0, Affinity mask 000001 - apic id 0 OS cpu 1, Affinity mask 000002 - apic id 1 OS cpu 2, Affinity mask 000004 - apic id 2 OS cpu 3, Affinity mask 000008 - apic id 3 OS cpu 4, Affinity mask 000010 - apic id 4 OS cpu 5, Affinity mask 000020 - apic id 5 OS cpu 6, Affinity mask 000040 - apic id 6 OS cpu 7, Affinity mask 000080 - apic id 7 OS cpu 8, Affinity mask 000100 - apic id 8 OS cpu 9, Affinity mask 000200 - apic id 9 OS cpu 10, Affinity mask 000400 - apic id a OS cpu 11, Affinity mask 000800 - apic id b OS cpu 12, Affinity mask 001000 - apic id c OS cpu 13, Affinity mask 002000 - apic id d OS cpu 14, Affinity mask 004000 - apic id e OS cpu 15, Affinity mask 008000 - apic id f Package 0 Cache and Thread details Box Description: Cache is cache level designator Size is cache size OScpu# is cpu # as seen by OS Core is core#[_thread# if > 1 thread/core] inside socket AffMsk is AffinityMask(extended hex) for core and thread CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache CmbMsk will differ from AffMsk if > 1 hw_thread/cache Extended Hex replaces trailing zeroes with 'z#' where # is number of zeroes (so '8z5' is '0x800000') L1D is Level 1 Data cache, size(KBytes)= 32, Cores/cache= 1, Caches/package= 4 L1I is Level 1 Instruction cache, size(KBytes)= 32, Cores/cache= 2, Caches/package= 2 L2 is Level 2 Unified cache, size(KBytes)= 256, Cores/cache= 2, Caches/package= 2 L3 is Level 3 Unified cache, size(KBytes)= 25600, Cores/cache= 16, Caches/package= 0 +-----+-----+-----+-----+ Cache | L1D| L1D| L1D| L1D| Size | 32K| 32K| 32K| 32K| OScpu#| 0| 1| 2| 3| Core |c0_t0|c0_t1|c1_t0|c1_t1| AffMsk| 1| 2| 4| 8| +-----+-----+-----+-----+ Cache | L1I | L1I | Size | 32K | 32K | CmbMsk| 3 | c | +-----------+-----------+ Cache | L2 | L2 | Size | 256K | 256K | +-----------+-----------+ Cache | L3 | Size | 25M | CmbMsk| f | +-----------------------+ Package 1 Cache and Thread details Box Description: Cache is cache level designator Size is cache size OScpu# is cpu # as seen by OS Core is core#[_thread# if > 1 thread/core] inside socket AffMsk is AffinityMask(extended hex) for core and thread CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache CmbMsk will differ from AffMsk if > 1 hw_thread/cache Extended Hex replaces trailing zeroes with 'z#' where # is number of zeroes (so '8z5' is '0x800000') +-----+-----+-----+-----+ Cache | L1D| L1D| L1D| L1D| Size | 32K| 32K| 32K| 32K| OScpu#| 4| 5| 6| 7| Core |c0_t0|c0_t1|c1_t0|c1_t1| AffMsk| 10| 20| 40| 80| +-----+-----+-----+-----+ Cache | L1I | L1I | Size | 32K | 32K | CmbMsk| 30 | c0 | +-----------+-----------+ Cache | L2 | L2 | Size | 256K | 256K | +-----------+-----------+ Cache | L3 | Size | 25M | CmbMsk| f0 | +-----------------------+ Package 2 Cache and Thread details Box Description: Cache is cache level designator Size is cache size OScpu# is cpu # as seen by OS Core is core#[_thread# if > 1 thread/core] inside socket AffMsk is AffinityMask(extended hex) for core and thread CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache CmbMsk will differ from AffMsk if > 1 hw_thread/cache Extended Hex replaces trailing zeroes with 'z#' where # is number of zeroes (so '8z5' is '0x800000') +-----+-----+-----+-----+ Cache | L1D| L1D| L1D| L1D| Size | 32K| 32K| 32K| 32K| OScpu#| 8| 9| 10| 11| Core |c0_t0|c0_t1|c1_t0|c1_t1| AffMsk| 100| 200| 400| 800| +-----+-----+-----+-----+ Cache | L1I | L1I | Size | 32K | 32K | CmbMsk| 300 | c00 | +-----------+-----------+ Cache | L2 | L2 | Size | 256K | 256K | +-----------+-----------+ Cache | L3 | Size | 25M | CmbMsk| f00 | +-----------------------+ Package 3 Cache and Thread details Box Description: Cache is cache level designator Size is cache size OScpu# is cpu # as seen by OS Core is core#[_thread# if > 1 thread/core] inside socket AffMsk is AffinityMask(extended hex) for core and thread CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache CmbMsk will differ from AffMsk if > 1 hw_thread/cache Extended Hex replaces trailing zeroes with 'z#' where # is number of zeroes (so '8z5' is '0x800000') +-----+-----+-----+-----+ Cache | L1D| L1D| L1D| L1D| Size | 32K| 32K| 32K| 32K| OScpu#| 12| 13| 14| 15| Core |c0_t0|c0_t1|c1_t0|c1_t1| AffMsk| 1z3| 2z3| 4z3| 8z3| +-----+-----+-----+-----+ Cache | L1I | L1I | Size | 32K | 32K | CmbMsk| 3z3 | cz3 | +-----------+-----------+ Cache | L2 | L2 | Size | 256K | 256K | +-----------+-----------+ Cache | L3 | Size | 25M | CmbMsk| fz3 | +-----------------------+ [-- Attachment #3: sched_domains --] [-- Type: text/plain, Size: 364 bytes --] cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 4783 559 cat /proc/sys/kernel/sched_domain/cpu*/domain*/names MT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC SMT MC [-- Attachment #4: cpuinfo --] [-- Type: text/plain, Size: 12642 bytes --] processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.67 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.67 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.67 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5586.67 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 1 siblings : 4 core id : 0 cpu cores : 2 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5606.43 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 1 siblings : 4 core id : 0 cpu cores : 2 apicid : 5 initial apicid : 5 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5606.43 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 1 siblings : 4 core id : 1 cpu cores : 2 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5606.43 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 1 siblings : 4 core id : 1 cpu cores : 2 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5606.43 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 8 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 2 siblings : 4 core id : 0 cpu cores : 2 apicid : 8 initial apicid : 8 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5606.48 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 9 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 2 siblings : 4 core id : 0 cpu cores : 2 apicid : 9 initial apicid : 9 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5606.48 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 10 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 2 siblings : 4 core id : 1 cpu cores : 2 apicid : 10 initial apicid : 10 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5606.48 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 11 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 2 siblings : 4 core id : 1 cpu cores : 2 apicid : 11 initial apicid : 11 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5606.48 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 12 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 3 siblings : 4 core id : 0 cpu cores : 2 apicid : 12 initial apicid : 12 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5608.87 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 13 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 3 siblings : 4 core id : 0 cpu cores : 2 apicid : 13 initial apicid : 13 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5608.87 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 14 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 3 siblings : 4 core id : 1 cpu cores : 2 apicid : 14 initial apicid : 14 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5608.87 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 15 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz stepping : 2 microcode : 0x209 cpu MHz : 2793.338 cache size : 25600 KB physical id : 3 siblings : 4 core id : 1 cpu cores : 2 apicid : 15 initial apicid : 15 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt bugs : bogomips : 5608.87 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: [-- Attachment #5: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-27 15:27 ` Konrad Rzeszutek Wilk 2016-01-27 15:53 ` George Dunlap 2016-01-27 16:03 ` Elena Ufimtseva @ 2016-01-28 15:10 ` Dario Faggioli 2016-01-29 3:27 ` Konrad Rzeszutek Wilk 2 siblings, 1 reply; 22+ messages in thread From: Dario Faggioli @ 2016-01-28 15:10 UTC (permalink / raw) To: Konrad Rzeszutek Wilk, George Dunlap Cc: Elena Ufimtseva, george.dunlap, joao.m.martins, boris.ostrovsky, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 1755 bytes --] On Wed, 2016-01-27 at 10:27 -0500, Konrad Rzeszutek Wilk wrote: > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote: > > > > I'm not sure I understand the situation right, but it sounds a bit > > like > > what you're seeing is just a quirk of the fact that Linux doesn't > > always > > send IPIs to wake other processes up (either by design or by > > accident), > > It does and it does not :-) > > > but relies on scheduling timers to check for work to > > do. Presumably > > It .. I am not explaining it well. The Linux kernel scheduler when > called for 'schedule' (from the UDP sendmsg) would either pick the > next > appliction and do a context swap - of if there were none - go to > sleep. > [Kind of - it also may do an IPI to the other CPU if requested ,but > that requires > some hints from underlaying layers] > Since there were only two apps on the runqueue - udp sender and udp > receiver > it would run them back-to back (this is on baremetal) > > However if SMT was not exposed - the Linux kernel scheduler would put > those > on each CPU runqueue. Meaning each CPU only had one app on its > runqueue. > > Hence no need to do an context switch. > [unless you modified the UDP message to have a timeout, then it would > send an IPI] > So, may I ask what piece of (Linux) code are we actually talking about? Because I had a quick look, and could not find where what you describe happens.... Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-28 15:10 ` Dario Faggioli @ 2016-01-29 3:27 ` Konrad Rzeszutek Wilk 2016-02-02 11:45 ` Dario Faggioli 0 siblings, 1 reply; 22+ messages in thread From: Konrad Rzeszutek Wilk @ 2016-01-29 3:27 UTC (permalink / raw) To: Dario Faggioli Cc: Elena Ufimtseva, george.dunlap, George Dunlap, xen-devel, joao.m.martins, boris.ostrovsky On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote: > On Wed, 2016-01-27 at 10:27 -0500, Konrad Rzeszutek Wilk wrote: > > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote: > > > > > > I'm not sure I understand the situation right, but it sounds a bit > > > like > > > what you're seeing is just a quirk of the fact that Linux doesn't > > > always > > > send IPIs to wake other processes up (either by design or by > > > accident), > > > > It does and it does not :-) > > > > > but relies on scheduling timers to check for work to > > > do. Presumably > > > > It .. I am not explaining it well. The Linux kernel scheduler when > > called for 'schedule' (from the UDP sendmsg) would either pick the > > next > > appliction and do a context swap - of if there were none - go to > > sleep. > > [Kind of - it also may do an IPI to the other CPU if requested ,but > > that requires > > some hints from underlaying layers] > > Since there were only two apps on the runqueue - udp sender and udp > > receiver > > it would run them back-to back (this is on baremetal) > > > > However if SMT was not exposed - the Linux kernel scheduler would put > > those > > on each CPU runqueue. Meaning each CPU only had one app on its > > runqueue. > > > > Hence no need to do an context switch. > > [unless you modified the UDP message to have a timeout, then it would > > send an IPI] > > > So, may I ask what piece of (Linux) code are we actually talking about? > Because I had a quick look, and could not find where what you describe > happens.... udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the UDP by having a diffrent timeout. And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER. The other side - udp_sendmsg is more complex, and I don't seem to have the stacktrace. > > Thanks and Regards, > Dario > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-29 3:27 ` Konrad Rzeszutek Wilk @ 2016-02-02 11:45 ` Dario Faggioli 2016-02-03 18:05 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 22+ messages in thread From: Dario Faggioli @ 2016-02-02 11:45 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Elena Ufimtseva, george.dunlap, George Dunlap, xen-devel, joao.m.martins, boris.ostrovsky [-- Attachment #1.1: Type: text/plain, Size: 5976 bytes --] On Thu, 2016-01-28 at 22:27 -0500, Konrad Rzeszutek Wilk wrote: > On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote: > > > > So, may I ask what piece of (Linux) code are we actually talking > > about? > > Because I had a quick look, and could not find where what you > > describe > > happens.... > > udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout > The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the > UDP by having a diffrent timeout. > Ha, recvmsg! At some point you mentioned sendmsg, and I was looking there and seeing nothing! But yes, it indeed makes sense to consider the receiving side... let me have a look... So, it looks to me that this is what happens: udp_recvmsg(noblock=0) | ---> __skb_recv_datagram(flags=0) { timeo = sock_rcvtimeo(flags=0) /* returns sk->sk_rcvtimeo */ do {...} wait_for_more_packets(timeo); | ---> schedule_timeor(timeo) So, at least in Linux 4.4, the timeout used is the one defined in sk->sk_rcvtimeo, which it looks to me to be this (unless I've followed some link wrong, which can well be the case): http://lxr.free-electrons.com/source/include/uapi/asm-generic/socket.h#L31 #define SO_RCVTIMEO 20 So there looks to be a timeout. But anyways, let's check schedule_timeout(). > And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just > goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER. > So, if the timeout is MAX_SCHEDULE_TIMEOUT, the function does: schedule_timeout(SCHEDULE_TIMEOUT) { schedule(); return; } If the timeout is anything else than MAX_SCHEDULE_TIMEOUT (but still a valid value), the function does: schedule_timeout(timeout) { struct timer_list timer; setup_timer_on_stack(&timer); __mod_timer(&timer); schedule(); del_singleshot_timer_sync(&timer); destroy_timer_on_stack(&timer); return; } So, in both cases, it pretty much calls schedule() just about immediately. And when schedule() it's called, the calling process -- which would be out UDP receiver-- goes to sleep. The difference is that, in case of MAX_SCHEDULE_TIMEOUT, it does not arrange for anyone to wakeup the thread that is going to sleep. In theory, it could even be stuck forever... Of course, this depends on whether the receiver thread is on a runqueue or not, if (in case it's not) if it's status is TASK_INTERRUPTIBLE OR TASK_UNINTERRUPTIBLE, etc., and, in prractice, it never happens! :-D In this case, I think we take the other branch (the one 'with timeout'). But even if we would take this one, I would expect the receiver thread to not be on any runqueue, but yet to be (either in interruptible or not state) in a blocking list from where it is taken out when a packet arrives. In case of anything different than MAX_SCHEDULE_TIMEOUT, all the above is still true, but a timer is set before calling schedule() and putting the thread to sleep. This means that, in case nothing that would wakeup such thread happens, or in case it hasn't happened yet when the timeout expires, the thread is woken up by the timer. And in fact, schedule_timeout() is not a different way, with respect to just calling schedule(), to going to sleep. It is the way you go to sleep for at most some amount of time... But in all cases, you just and immediately go to sleep! And I also am not sure I see where all that discussion you've had with George about IPIs fit into this all... The IPI that will trigger the call to schedule() that will actually put back to execution the thread that we're putting to sleep in here (i.e., the receiver), happens when the sender manages to send a packet (actually, when the packet arrives, I think) _or_ when the timer expires. The two possible calls to schedule() in schedule_timeout() behave exactly in the same way, and I don't think having a timeout or not is responsible for any particular behavior. What I think it's happening is this: when such a call to schedule() (from inside schedule_timeout(), I mean) is made what happens is that the receiver task just goes to sleep, and another one, perhaps the sender, is executed. The sender sends the packet, which arrives before the timeout, and the receiver is woken up. *Here* is where an IPI should or should not happen, depending on where our receiver task is going to be executed! And where would that be? Well, that depends on the Linux scheduler's load balancer, the behavior of which is controlled by scheduling domains flags like BALANCE_FORK, BALANCE_EXEC, BALANCE_WAKE, WAKE_AFFINE and PREFER_SIBLINGS (and others, but I think these are the most likely ones to be involved here). So, in summary, where the receiver executes when it wakes up on what is the configuration of such flags in the (various) scheduling domain(s). Check, for instance, this path: try_to_wakeu_up() --> select_task_irq() --> select_task_rq_fair() The reason why the tests 'reacts' to topology changes is that which set of flags is used for the various scheduling domains is, during the time the scheduling domains themselves are created and configured-- depends on topology... So it's quite possible that exposing the SMT topology, wrt to not doing so, makes one of the flag flip in a way which makes the benchmark work better. If you play with the flags above (or whatever they equivalents were in 2.6.39) directly, even without exposing the SMT-topology, I'm quite sure you would be able to trigger the same behavior. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-02-02 11:45 ` Dario Faggioli @ 2016-02-03 18:05 ` Konrad Rzeszutek Wilk 0 siblings, 0 replies; 22+ messages in thread From: Konrad Rzeszutek Wilk @ 2016-02-03 18:05 UTC (permalink / raw) To: Dario Faggioli Cc: Elena Ufimtseva, george.dunlap, George Dunlap, xen-devel, joao.m.martins, boris.ostrovsky On Tue, Feb 02, 2016 at 12:45:00PM +0100, Dario Faggioli wrote: > On Thu, 2016-01-28 at 22:27 -0500, Konrad Rzeszutek Wilk wrote: > > On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote: > > > > > > So, may I ask what piece of (Linux) code are we actually talking > > > about? > > > Because I had a quick look, and could not find where what you > > > describe > > > happens.... > > > > udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout > > The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the > > UDP by having a diffrent timeout. > > > Ha, recvmsg! At some point you mentioned sendmsg, and I was looking > there and seeing nothing! But yes, it indeed makes sense to consider > the receiving side... let me have a look... > > So, it looks to me that this is what happens: > > udp_recvmsg(noblock=0) > | > ---> __skb_recv_datagram(flags=0) { > timeo = sock_rcvtimeo(flags=0) /* returns sk->sk_rcvtimeo */ > do {...} wait_for_more_packets(timeo); > | > ---> schedule_timeor(timeo) > > So, at least in Linux 4.4, the timeout used is the one defined in > sk->sk_rcvtimeo, which it looks to me to be this (unless I've followed > some link wrong, which can well be the case): > > http://lxr.free-electrons.com/source/include/uapi/asm-generic/socket.h#L31 > #define SO_RCVTIMEO 20 > > So there looks to be a timeout. But anyways, let's check > schedule_timeout(). > > > And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just > > goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER. > > > So, if the timeout is MAX_SCHEDULE_TIMEOUT, the function does: > > schedule_timeout(SCHEDULE_TIMEOUT) { > schedule(); > return; > } > > If the timeout is anything else than MAX_SCHEDULE_TIMEOUT (but still a > valid value), the function does: > > schedule_timeout(timeout) { > struct timer_list timer; > > setup_timer_on_stack(&timer); > __mod_timer(&timer); > schedule(); > del_singleshot_timer_sync(&timer); > destroy_timer_on_stack(&timer); > return; > } > > So, in both cases, it pretty much calls schedule() just about > immediately. And when schedule() it's called, the calling process -- > which would be out UDP receiver-- goes to sleep. > > The difference is that, in case of MAX_SCHEDULE_TIMEOUT, it does not > arrange for anyone to wakeup the thread that is going to sleep. In > theory, it could even be stuck forever... Of course, this depends on > whether the receiver thread is on a runqueue or not, if (in case it's > not) if it's status is TASK_INTERRUPTIBLE OR TASK_UNINTERRUPTIBLE, > etc., and, in prractice, it never happens! :-D > > In this case, I think we take the other branch (the one 'with > timeout'). But even if we would take this one, I would expect the > receiver thread to not be on any runqueue, but yet to be (either in > interruptible or not state) in a blocking list from where it is taken > out when a packet arrives. > > In case of anything different than MAX_SCHEDULE_TIMEOUT, all the above > is still true, but a timer is set before calling schedule() and putting > the thread to sleep. This means that, in case nothing that would wakeup > such thread happens, or in case it hasn't happened yet when the timeout > expires, the thread is woken up by the timer. Right. > > And in fact, schedule_timeout() is not a different way, with respect to > just calling schedule(), to going to sleep. It is the way you go to > sleep for at most some amount of time... But in all cases, you just and > immediately go to sleep! > > And I also am not sure I see where all that discussion you've had with > George about IPIs fit into this all... The IPI that will trigger the > call to schedule() that will actually put back to execution the thread > that we're putting to sleep in here (i.e., the receiver), happens when > the sender manages to send a packet (actually, when the packet arrives, > I think) _or_ when the timer expires. The IPI were observed when SMT was exposed to the guest. That is because the Linux scheduler put both applications on the same CPU - udp_sender and udp_receiver. Which meant that the 'schedule' call would immediately pick the next application (udp_sender) and schedule it (and send an IPI to itself to do that). > > The two possible calls to schedule() in schedule_timeout() behave > exactly in the same way, and I don't think having a timeout or not is > responsible for any particular behavior. Correct. The quirk was that if the applications were on seperate CPUs - the "thread [would be] woken up by the timer". While if they were on the same CPU - the scheduler would pick the next application on the run-queue (which coincidentally was the UDP sender - or receiver). > > What I think it's happening is this: when such a call to schedule() > (from inside schedule_timeout(), I mean) is made what happens is that > the receiver task just goes to sleep, and another one, perhaps the > sender, is executed. The sender sends the packet, which arrives before > the timeout, and the receiver is woken up. Yes! > > *Here* is where an IPI should or should not happen, depending on where > our receiver task is going to be executed! And where would that be? > Well, that depends on the Linux scheduler's load balancer, the behavior > of which is controlled by scheduling domains flags like BALANCE_FORK, > BALANCE_EXEC, BALANCE_WAKE, WAKE_AFFINE and PREFER_SIBLINGS (and > others, but I think these are the most likely ones to be involved > here). Probably. > > So, in summary, where the receiver executes when it wakes up on what is > the configuration of such flags in the (various) scheduling domain(s). > Check, for instance, this path: > > try_to_wakeu_up() --> select_task_irq() --> select_task_rq_fair() > > The reason why the tests 'reacts' to topology changes is that which set > of flags is used for the various scheduling domains is, during the time > the scheduling domains themselves are created and configured-- depends > on topology... So it's quite possible that exposing the SMT topology, > wrt to not doing so, makes one of the flag flip in a way which makes > the benchmark work better. /me nods. > > If you play with the flags above (or whatever they equivalents were in > 2.6.39) directly, even without exposing the SMT-topology, I'm quite > sure you would be able to trigger the same behavior. I did. And that was the work-around - echo 4xyz flag in the SysFS domain and suddenly things go much faster. . > > Regards, > Dario > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-22 16:54 schedulers and topology exposing questions Elena Ufimtseva 2016-01-22 17:29 ` Dario Faggioli 2016-01-26 11:21 ` George Dunlap @ 2016-01-27 14:01 ` Dario Faggioli 2016-01-28 18:51 ` Elena Ufimtseva 2 siblings, 1 reply; 22+ messages in thread From: Dario Faggioli @ 2016-01-27 14:01 UTC (permalink / raw) To: Elena Ufimtseva, xen-devel, george.dunlap, konrad.wilk, joao.m.martins, boris.ostrovsky [-- Attachment #1.1: Type: text/plain, Size: 14426 bytes --] On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote: > Hello all! > Hey, here I am again, > Konrad came up with a workaround that was setting the flag for domain > scheduler in linux > As the guest is not aware of SMT-related topology, it has a flat > topology initialized. > Kernel has domain scheduler flags for scheduling domain CPU set to > 4143 for 2.6.39. > Konrad discovered that changing the flag for CPU sched domain to 4655 > So, as you've seen, I also have been up to doing quite a few of benchmarking doing soemthing similar (I used more recent kernels, and decided to test 4131 as flags. In your casse, according to this: http://lxr.oss.org.cn/source/include/linux/sched.h?v=2.6.39#L807 4655 means: SD_LOAD_BALANCE | SD_BALANCE_EXEC | SD_BALANCE_WAKE | SD_PREFER_LOCAL | [*] SD_SHARE_PKG_RESOURCES | SD_SERIALIZE and another bit (0x4000) that I don't immediately see what it is. Things have changed a bit since then, it appears. However, I'm quite sure I've tested turning on SD_SERIALIZE in 4.2.0 and 4.3.0, and results were really pretty bad (as you also seem to say later). > works as a workaround and makes Linux think that the topology has SMT > threads. > Well, yes and no. :-). I don't want to make this all a terminology bunfight, something that also matters here is how many scheduling domains you have. To check that (although in recent kernels) you check here: ls /proc/sys/kernel/sched_domain/cpu2/ (any cpu is ok) and see how many domain[0-9] you have. On baremetal, on an HT cpu, I've got this: $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name SMT MC So, two domains, one of which is the SMT one. If you check their flags, they're different: $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags 4783 559 So, yes, you are right in saying that 4655 is related to SMT. In fact, it is what (among other things) tells the load balancer that *all* the cpus (well, all the scheduling groups, actually) in this domain are SMT siblings... Which is a legitimate thing to do, but it's not what happens on SMT baremetal. At least is consistent, IMO. I.e., it still creates a pretty flat topology, like there was a big core, of which _all_ the vcpus are part of, as SMT siblings. The other option (the one I'm leaning toward) was too get rid of that one flag. I've only done preliminary experiments with it on and off, and the ones with it off were better looking, so I did keep it off for the big run... but we can test with it again. > This workaround makes the test to complete almost in same time as on > baremetal (or insignificantly worse). > > This workaround is not suitable for kernels of higher versions as we > discovered. > There may be more than one reason for this (as said, a lot changed!) but it matches what I've found when SD_SERIALIZE was kept on for the scheduling domain where all the vcpus are. > The hackish way of making domU linux think that it has SMT threads > (along with matching cpuid) > made us thinks that the problem comes from the fact that cpu topology > is not exposed to > guest and Linux scheduler cannot make intelligent decision on > scheduling. > As said, I think it's the other way around: we expose too much of it (and this is more of an issue for PV rather than for HVM). Basically, either you do the pinning you're doing or, whatever you expose, will be *wrong*... and the only way to expose not wrong data is to actually don't expose anything! :-) > The test described above was labeled as IO-bound test. > > We have run io-bound test with and without smt-patches. The > improvement comparing > to base case (no smt patches, flat topology) shows 22-23% gain. > I'd be curious to see the content of the /proc/sys/kernel/sched_domain directory and subdirectories with Joao's patches applied. > While we have seen improvement with io-bound tests, the same did not > happen with cpu-bound workload. > As cpu-bound test we use kernel module which runs requested number of > kernel threads > and each thread compresses and decompresses some data. > That is somewhat what I would have expected, although up to what extent, it's hard to tell in advance. It also matches my findings, both for the results I've already shared on list, and for others that I'll be sharing in a bit. > Here is the setup for tests: > Intel Xeon E5 2600 > 8 cores, 25MB Cashe, 2 sockets, 2 threads per core. > Xen 4.4.3, default timeslice and ratelimit > Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+. > Dom0: kernel 4.1.0, 2 vcpus, not pinned. > DomU has 8 vcpus (except some cases). > > > For io-bound tests results were better with smt patches applied for > every kernel. > > For cpu-bound test the results were different depending on wether > vcpus were > pinned or not, how many vcpus were assigned to the guest. > Right. In general, this also makes sense... Can we see the actual numbers? I mean the results of the tests with improvements/regressions highlighted, in addition to the traces that you already shared? > Please take a look at the graph captured by xentrace -e 0x0002f000 > On the graphs X is time in seconds since xentrace start, Y is the > pcpu number, > the graph itself represent the event when scheduler places vcpu to > pcpu. > > The graphs #1 & #2: > trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, > one client/server > trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, > 8 kernel theads > config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 > kernel. > Ok, so this is the "baseline", the result of just running your tests in a pretty standard Xen and Dom0 and DomU status and configurations, right? > As can be seen here scheduler places the vcpus correctly on empty > cores. > As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this? > Take a look at > trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png > where I split data per vcpus. > Well, why not, I would say? I mean, where a vcpu starts to run at an arbitrary point in time, especially if the system is otherwise idle before, it can be considered random (it's not, it depends on both the vcpu's and system's previous history, but in a non-linear way, and that is not in the graph anyway). In any case, since there are idle cores, the fact that vcpus do not move much, even if they're not pinned, I consider it a good thing, don't you? If vcpuX wakes up on processor Y, where it has always run before, and it find out it can still run there, migrating somewhere else would be pure overhead. The only potential worry of mine about trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png, is that vcpus 4 and 7 (or 4 and 2, colors are too similar to be sure) run for some time (the burst around t=17), on pcpus 5 and 6. Are these two pcpus SMT siblings? Doing the math myself on pCPUs IDs, I don't think they are, so all would be fine. If they are, that should not happen. However, you're using 4.4, so even if you had an issue there, we don't know if it's still in staging. In any case and just to be sure, can you produce the output of `xl vcpu-list', while this case is running? > Now to cpu-bound tests. > When smt patches applied and vcpus pinned correctly to match the > topology and > guest become aware of the topology, cpu-bound tests did not show > improvement with kernel 2.6.39. > With upstream kernel we see some improvements. The tes was repeated 5 > times back to back. > Again, 'some' being? > The number of vcpus was increased to 16 to match the test case where > linux was not > aware of the topology and assumed all cpus as cores. > > On some iterations one can see that vcpus are being scheduled as > expected. > For some runs the vcpus are placed on came core (core/thread) (see > trace_cpu_16vcpus_8threads_5runs.out.plot.err.png). > It doubles the time it takes for test to complete (first three runs > show close to baremetal execution time). > No, sorry, I don't think I fully understood this part. So: 1. can you point me at where (time equal to ?) what you are saying happens? 2. more important, you are saying that the vcpus are pinned. If you pin the vcpus they just should not move. Period. If they move, if's a bug, no matter where they go and what the other SMT sibling of the pcpu where they go is busy or idle! :-O So, are you saying that you pinned the vcpus of the guest and you see them moving and/or not being _always_ scheduled where you pinned them? Can we see `xl vcpu-list' again, to see how they're actually pinned. > END: cycles: 31209326708 (29 seconds) > END: cycles: 30928835308 (28 seconds) > END: cycles: 31191626508 (29 seconds) > END: cycles: 50117313540 (46 seconds) > END: cycles: 49944848614 (46 seconds) > > Since the vcpus are pinned, then my guess is that Linux scheduler > makes wrong decisions? > Ok, so now it seems to me that you agree that the vcpus don't have much alternatives. If yes (which would be of great relief for me :-) ), it could indeed be that indeed the Linux scheduler is working suboptimally. Perhaps it's worth trying running the benchmark inside the guest with the Linux's threads pinned to the vcpus. That should give you perfectly consistent results over all the 5 runs. One more thing. You say you the guest has 16 vcpus, and that there are 8 threads running inside it. However, I seem to be able to identify in the graphs at least a few vertical lines where more than 8 vcpus are running on some pcpu. So, if Linux is working well, and it really only has to place 8 vcpus, it would put them on different cores. However, if at some point in time, there is more than that it has to place, it will have to necessarily 'invade' an already busy core. Am I right in seeing those lines, or are my eyes deceiving me? (I think a per-vcpu breakup of the graph above, like you did for dom0, would help figure this out). > So I ran the test with smt patches enabled, but not pinned vcpus. > AFAICT, This does not make much sense. So, if I understood correctly what you mean, by doing as you say, you're telling Linux that, for instance, vcpu0 and vcpu1 are SMT siblings, but then Xen is free to run vcpu0 and vcpu1 at the same time wherever it likes... same core, different core on same socket, different socket, etc. This, I would say, bring us back to the pseudo-random situation we have by default already, without any patching and any pinning, or just in a different variant of it. > result is also shows the same as above (see > trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png) > : > Also see the per-cpu graph > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_per > vcpu.png). > Ok. I'll look at this graph better with the aim of showing an example of my theory above (as soon as my brain, which is not in it's best shape today) will manage to deal with all the colors (I'm not complaining, BTW, there's not another way in which you can show things, it's just me! :-D). > END: cycles: 49740185572 (46 seconds) > END: cycles: 45862289546 (42 seconds) > END: cycles: 30976368378 (28 seconds) > END: cycles: 30886882143 (28 seconds) > END: cycles: 30806304256 (28 seconds) > > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same > core while other cores are idle: > > 35v2 9.881103815 > 7 > 35v0 9.881104013 6 > > 35v2 9.892746452 > 7 > 35v0 9.892746546 6 -> vcpu0 gets scheduled right after vcpu2 on > same core > > 35v0 9.904388175 > 6 > 35v2 9.904388205 7 -> same here > > 35v2 9.916029791 > 7 > 35v0 9.916029992 > 6 > Yes, this, in theory, should not happen. However, our (but Linux's, or any other OS's one --perhaps in its own way--) can't always be _instantly_ perfect! In this case, for instance, the SMT load balancing logic, in Credit1, is triggered: - from outside of sched_credit.c, by vcpu_migrate(), which is called upon in response to a bunch of events, but _not_ at every vcpu wakeup - from inside sched_credit.c, csched_vcpu_acct(), if the vcpu was it has been active for a while This means, it is not triggered upon each and every vcpu wakeup (it might, but not for the vcpu that is waking up). So, seeing a samples of a vcpu not being scheduled according to optimal SMT load balancing, especially right after it woke up, it is expectable. Then, after a while, the logic should indeed trigger (via csched_vcpu_acct()) and move away the vcpu to an idle core. To tell how long the SMT perfect balancing violation happens, and whether or not it happens as a consequence of tasks wakeups, we need more records from the trace file, coming from around the point where the violation happens. Does this make sense to you? Regards, and thanks for sharing all this! :-) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: schedulers and topology exposing questions 2016-01-27 14:01 ` Dario Faggioli @ 2016-01-28 18:51 ` Elena Ufimtseva 0 siblings, 0 replies; 22+ messages in thread From: Elena Ufimtseva @ 2016-01-28 18:51 UTC (permalink / raw) To: Dario Faggioli; +Cc: george.dunlap, joao.m.martins, boris.ostrovsky, xen-devel On Wed, Jan 27, 2016 at 02:01:35PM +0000, Dario Faggioli wrote: > On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote: > > Hello all! > > > Hey, here I am again, > > > Konrad came up with a workaround that was setting the flag for domain > > scheduler in linux > > As the guest is not aware of SMT-related topology, it has a flat > > topology initialized. > > Kernel has domain scheduler flags for scheduling domain CPU set to > > 4143 for 2.6.39. > > Konrad discovered that changing the flag for CPU sched domain to 4655 > > > So, as you've seen, I also have been up to doing quite a few of > benchmarking doing soemthing similar (I used more recent kernels, and > decided to test 4131 as flags. > > In your casse, according to this: > http://lxr.oss.org.cn/source/include/linux/sched.h?v=2.6.39#L807 > > 4655 means: > SD_LOAD_BALANCE | > SD_BALANCE_EXEC | > > SD_BALANCE_WAKE | > SD_PREFER_LOCAL | [*] > > SD_SHARE_PKG_RESOURCES | > SD_SERIALIZE > > and another bit (0x4000) that I don't immediately see what it is. > > Things have changed a bit since then, it appears. However, I'm quite sure I've tested turning on SD_SERIALIZE in 4.2.0 and 4.3.0, and results were really pretty bad (as you also seem to say later). > > > works as a workaround and makes Linux think that the topology has SMT > > threads. > > > Well, yes and no. :-). I don't want to make this all a terminology > bunfight, something that also matters here is how many scheduling > domains you have. > > To check that (although in recent kernels) you check here: > > ls /proc/sys/kernel/sched_domain/cpu2/ (any cpu is ok) > > and see how many domain[0-9] you have. > > On baremetal, on an HT cpu, I've got this: > > $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name > SMT > MC > > So, two domains, one of which is the SMT one. If you check their flags, > they're different: > > $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags > 4783 > 559 > > So, yes, you are right in saying that 4655 is related to SMT. In fact, > it is what (among other things) tells the load balancer that *all* the > cpus (well, all the scheduling groups, actually) in this domain are SMT > siblings... Which is a legitimate thing to do, but it's not what > happens on SMT baremetal. > > At least is consistent, IMO. I.e., it still creates a pretty flat > topology, like there was a big core, of which _all_ the vcpus are part > of, as SMT siblings. > > The other option (the one I'm leaning toward) was too get rid of that > one flag. I've only done preliminary experiments with it on and off, > and the ones with it off were better looking, so I did keep it off for > the big run... but we can test with it again. > > > This workaround makes the test to complete almost in same time as on > > baremetal (or insignificantly worse). > > > > This workaround is not suitable for kernels of higher versions as we > > discovered. > > > There may be more than one reason for this (as said, a lot changed!) > but it matches what I've found when SD_SERIALIZE was kept on for the > scheduling domain where all the vcpus are. > > > The hackish way of making domU linux think that it has SMT threads > > (along with matching cpuid) > > made us thinks that the problem comes from the fact that cpu topology > > is not exposed to > > guest and Linux scheduler cannot make intelligent decision on > > scheduling. > > > As said, I think it's the other way around: we expose too much of it > (and this is more of an issue for PV rather than for HVM). Basically, > either you do the pinning you're doing or, whatever you expose, will be > *wrong*... and the only way to expose not wrong data is to actually > don't expose anything! :-) > > > The test described above was labeled as IO-bound test. > > > > We have run io-bound test with and without smt-patches. The > > improvement comparing > > to base case (no smt patches, flat topology) shows 22-23% gain. > > > I'd be curious to see the content of the /proc/sys/kernel/sched_domain > directory and subdirectories with Joao's patches applied. > > > While we have seen improvement with io-bound tests, the same did not > > happen with cpu-bound workload. > > As cpu-bound test we use kernel module which runs requested number of > > kernel threads > > and each thread compresses and decompresses some data. > > > That is somewhat what I would have expected, although up to what > extent, it's hard to tell in advance. > > It also matches my findings, both for the results I've already shared > on list, and for others that I'll be sharing in a bit. > > > Here is the setup for tests: > > Intel Xeon E5 2600 > > 8 cores, 25MB Cashe, 2 sockets, 2 threads per core. > > Xen 4.4.3, default timeslice and ratelimit > > Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+. > > Dom0: kernel 4.1.0, 2 vcpus, not pinned. > > DomU has 8 vcpus (except some cases). > > > > > > For io-bound tests results were better with smt patches applied for > > every kernel. > > > > For cpu-bound test the results were different depending on wether > > vcpus were > > pinned or not, how many vcpus were assigned to the guest. > > > Right. In general, this also makes sense... Can we see the actual > numbers? I mean the results of the tests with improvements/regressions > highlighted, in addition to the traces that you already shared? > > > Please take a look at the graph captured by xentrace -e 0x0002f000 > > On the graphs X is time in seconds since xentrace start, Y is the > > pcpu number, > > the graph itself represent the event when scheduler places vcpu to > > pcpu. > > > > The graphs #1 & #2: > > trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, > > one client/server > > trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, > > 8 kernel theads > > config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 > > kernel. > > > Ok, so this is the "baseline", the result of just running your tests in > a pretty standard Xen and Dom0 and DomU status and configurations, > right? > > > As can be seen here scheduler places the vcpus correctly on empty > > cores. > > As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this? > > Take a look at > > trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png > > where I split data per vcpus. > > > Well, why not, I would say? I mean, where a vcpu starts to run at an > arbitrary point in time, especially if the system is otherwise idle > before, it can be considered random (it's not, it depends on both the > vcpu's and system's previous history, but in a non-linear way, and that > is not in the graph anyway). > > In any case, since there are idle cores, the fact that vcpus do not > move much, even if they're not pinned, I consider it a good thing, > don't you? If vcpuX wakes up on processor Y, where it has always run > before, and it find out it can still run there, migrating somewhere > else would be pure overhead. > > The only potential worry of mine about > trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png, is > that vcpus 4 and 7 (or 4 and 2, colors are too similar to be sure) run > for some time (the burst around t=17), on pcpus 5 and 6. Are these two > pcpus SMT siblings? Doing the math myself on pCPUs IDs, I don't think > they are, so all would be fine. If they are, that should not happen. > > However, you're using 4.4, so even if you had an issue there, we don't > know if it's still in staging. > > In any case and just to be sure, can you produce the output of `xl > vcpu-list', while this case is running? > > > Now to cpu-bound tests. > > When smt patches applied and vcpus pinned correctly to match the > > topology and > > guest become aware of the topology, cpu-bound tests did not show > > improvement with kernel 2.6.39. > > With upstream kernel we see some improvements. The tes was repeated 5 > > times back to back. > > > Again, 'some' being? > > > The number of vcpus was increased to 16 to match the test case where > > linux was not > > aware of the topology and assumed all cpus as cores. > > > > On some iterations one can see that vcpus are being scheduled as > > expected. > > For some runs the vcpus are placed on came core (core/thread) (see > > trace_cpu_16vcpus_8threads_5runs.out.plot.err.png). > > It doubles the time it takes for test to complete (first three runs > > show close to baremetal execution time). > > > No, sorry, I don't think I fully understood this part. So: > 1. can you point me at where (time equal to ?) what you are saying > happens? > 2. more important, you are saying that the vcpus are pinned. If you > pin the vcpus they just should not move. Period. If they move, > if's a bug, no matter where they go and what the other SMT sibling > of the pcpu where they go is busy or idle! :-O > > So, are you saying that you pinned the vcpus of the guest and you > see them moving and/or not being _always_ scheduled where you > pinned them? Can we see `xl vcpu-list' again, to see how they're > actually pinned. > > > END: cycles: 31209326708 (29 seconds) > > END: cycles: 30928835308 (28 seconds) > > END: cycles: 31191626508 (29 seconds) > > END: cycles: 50117313540 (46 seconds) > > END: cycles: 49944848614 (46 seconds) > > > > Since the vcpus are pinned, then my guess is that Linux scheduler > > makes wrong decisions? > > > Ok, so now it seems to me that you agree that the vcpus don't have much > alternatives. > > If yes (which would be of great relief for me :-) ), it could indeed be > that indeed the Linux scheduler is working suboptimally. > > Perhaps it's worth trying running the benchmark inside the guest with > the Linux's threads pinned to the vcpus. That should give you perfectly > consistent results over all the 5 runs. > > One more thing. You say you the guest has 16 vcpus, and that there are > 8 threads running inside it. However, I seem to be able to identify in > the graphs at least a few vertical lines where more than 8 vcpus are > running on some pcpu. So, if Linux is working well, and it really only > has to place 8 vcpus, it would put them on different cores. However, if > at some point in time, there is more than that it has to place, it will > have to necessarily 'invade' an already busy core. Am I right in seeing > those lines, or are my eyes deceiving me? (I think a per-vcpu breakup > of the graph above, like you did for dom0, would help figure this out). > > > So I ran the test with smt patches enabled, but not pinned vcpus. > > > AFAICT, This does not make much sense. So, if I understood correctly > what you mean, by doing as you say, you're telling Linux that, for > instance, vcpu0 and vcpu1 are SMT siblings, but then Xen is free to run > vcpu0 and vcpu1 at the same time wherever it likes... same core, > different core on same socket, different socket, etc. Correct. I did run this to see what happens in this pseudo-random case. > > This, I would say, bring us back to the pseudo-random situation we have > by default already, without any patching and any pinning, or just in a > different variant of it. > > > result is also shows the same as above (see > > trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png) > > : > > Also see the per-cpu graph > > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_per > > vcpu.png). > > > Ok. I'll look at this graph better with the aim of showing an example > of my theory above (as soon as my brain, which is not in it's best > shape today) will manage to deal with all the colors (I'm not > complaining, BTW, there's not another way in which you can show things, > it's just me! :-D). At the same time, if you think I can improve data representation, it will be awesome! > > > END: cycles: 49740185572 (46 seconds) > > END: cycles: 45862289546 (42 seconds) > > END: cycles: 30976368378 (28 seconds) > > END: cycles: 30886882143 (28 seconds) > > END: cycles: 30806304256 (28 seconds) > > > > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same > > core while other cores are idle: > > > > 35v2 9.881103815 > > 7 > > 35v0 9.881104013 6 > > > > 35v2 9.892746452 > > 7 > > 35v0 9.892746546 6 -> vcpu0 gets scheduled right after vcpu2 on > > same core > > > > 35v0 9.904388175 > > 6 > > 35v2 9.904388205 7 -> same here > > > > 35v2 9.916029791 > > 7 > > 35v0 9.916029992 > > 6 > > > Yes, this, in theory, should not happen. However, our (but Linux's, or > any other OS's one --perhaps in its own way--) can't always be > _instantly_ perfect! In this case, for instance, the SMT load balancing > logic, in Credit1, is triggered: > - from outside of sched_credit.c, by vcpu_migrate(), which is called > upon in response to a bunch of events, but _not_ at every vcpu > wakeup > - from inside sched_credit.c, csched_vcpu_acct(), if the vcpu was it > has been active for a while > > This means, it is not triggered upon each and every vcpu wakeup (it > might, but not for the vcpu that is waking up). So, seeing a samples of > a vcpu not being scheduled according to optimal SMT load balancing, > especially right after it woke up, it is expectable. Then, after a > while, the logic should indeed trigger (via csched_vcpu_acct()) and > move away the vcpu to an idle core. > > To tell how long the SMT perfect balancing violation happens, and > whether or not it happens as a consequence of tasks wakeups, we need > more records from the trace file, coming from around the point where > the violation happens. > > Does this make sense to you? Dario, thanks for explanations. I am going to verify some numbers and also I am collecting more trace data. I am going to send it shortly, sorry for the delay. Elena > > Regards, and thanks for sharing all this! :-) > Dario > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2016-02-03 18:05 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-01-22 16:54 schedulers and topology exposing questions Elena Ufimtseva 2016-01-22 17:29 ` Dario Faggioli 2016-01-22 23:58 ` Elena Ufimtseva 2016-01-26 11:21 ` George Dunlap 2016-01-27 14:25 ` Dario Faggioli 2016-01-27 14:33 ` Konrad Rzeszutek Wilk 2016-01-27 15:10 ` George Dunlap 2016-01-27 15:27 ` Konrad Rzeszutek Wilk 2016-01-27 15:53 ` George Dunlap 2016-01-27 16:12 ` Konrad Rzeszutek Wilk 2016-01-28 9:55 ` Dario Faggioli 2016-01-29 21:59 ` Elena Ufimtseva 2016-02-02 11:58 ` Dario Faggioli 2016-01-27 16:03 ` Elena Ufimtseva 2016-01-28 9:46 ` Dario Faggioli 2016-01-29 16:09 ` Elena Ufimtseva 2016-01-28 15:10 ` Dario Faggioli 2016-01-29 3:27 ` Konrad Rzeszutek Wilk 2016-02-02 11:45 ` Dario Faggioli 2016-02-03 18:05 ` Konrad Rzeszutek Wilk 2016-01-27 14:01 ` Dario Faggioli 2016-01-28 18:51 ` Elena Ufimtseva
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).