schedulers and topology exposing questions

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* schedulers and topology exposing questions
@ 2016-01-22 16:54 Elena Ufimtseva
  2016-01-22 17:29 ` Dario Faggioli
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Elena Ufimtseva @ 2016-01-22 16:54 UTC (permalink / raw)
  To: xen-devel, dario.faggioli, george.dunlap, konrad.wilk,
	joao.m.martins, boris.ostrovsky

[-- Attachment #1: Type: text/plain, Size: 6756 bytes --]

Hello all!

Dario, Gerorge or anyone else,  your help will be appreciated.

Let me put some intro to our findings. I may forget something or put something
not too explicit, please ask me.

Customer filled a bug where some of the applications were running slow in their HVM DomU setups.
These running times were compared against baremetal running same kernel version as HVM DomU.

After some investigation by different parties, the test case scenario was found
where the problem was easily seen. The test app is a udp server/client pair where
client passes some message n number of times.
The test case was executed on baremetal and Xen DomU with kernel version 2.6.39.
Bare metal showed 2x times better result that DomU.

Konrad came up with a workaround that was setting the flag for domain scheduler in linux
As the guest is not aware of SMT-related topology, it has a flat topology initialized.
Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39.
Konrad discovered that changing the flag for CPU sched domain to 4655
works as a workaround and makes Linux think that the topology has SMT threads.
This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse).

This workaround is not suitable for kernels of higher versions as we discovered.

The hackish way of making domU linux think that it has SMT threads (along with matching cpuid)
made us thinks that the problem comes from the fact that cpu topology is not exposed to
guest and Linux scheduler cannot make intelligent decision on scheduling.

Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe
topology numbering and provided matching pinning of vcpus and enabling options,
allows to expose to guest correct topology.
I guess Joao will be posting it at some point.

With this patches we decided to test the performance impact on different kernel versionand Xen versions.

The test described above was labeled as IO-bound test.

We have run io-bound test with and without smt-patches. The improvement comparing
to base case (no smt patches, flat topology) shows 22-23% gain.

While we have seen improvement with io-bound tests, the same did not happen with cpu-bound workload.
As cpu-bound test we use kernel module which runs requested number of kernel threads
and each thread compresses and decompresses some data.

Here is the setup for tests:
Intel Xeon E5 2600
8 cores, 25MB Cashe, 2 sockets, 2 threads per core.
Xen 4.4.3, default timeslice and ratelimit
Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+.
Dom0: kernel 4.1.0, 2 vcpus, not pinned.
DomU has 8 vcpus (except some cases).

For io-bound tests results were better with smt patches applied for every kernel.

For cpu-bound test the results were different depending on wether vcpus were
pinned or not, how many vcpus were assigned to the guest.

Please take a look at the graph captured by xentrace -e 0x0002f000
On the graphs X is time in seconds since xentrace start, Y is the pcpu number,
the graph itself represent the event when scheduler places vcpu to pcpu.

The graphs #1 & #2:
trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, one client/server
trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, 8 kernel theads
config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 kernel.

As can be seen here scheduler places the vcpus correctly on empty cores.
As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this?
Take a look at trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png
where I split data per vcpus.

Now to cpu-bound tests.
When smt patches applied and vcpus pinned correctly to match the topology and
guest become aware of the topology, cpu-bound tests did not show improvement with kernel 2.6.39.
With upstream kernel we see some improvements. The tes was repeated 5 times back to back.
The number of vcpus was increased to 16 to match the test case where linux was not
aware of the topology and assumed all cpus as cores.

On some iterations one can see that vcpus are being scheduled as expected.
For some runs the vcpus are placed on came core (core/thread) (see trace_cpu_16vcpus_8threads_5runs.out.plot.err.png).
It doubles the time it takes for test to complete (first three runs show close to baremetal execution time).

END: cycles: 31209326708 (29 seconds)
END: cycles: 30928835308 (28 seconds)
END: cycles: 31191626508 (29 seconds)
END: cycles: 50117313540 (46 seconds)
END: cycles: 49944848614 (46 seconds)

Since the vcpus are pinned, then my guess is that Linux scheduler makes wrong decisions?
So I ran the test with smt patches enabled, but not pinned vcpus.

result is also shows the same as above (see trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png):
Also see the per-cpu graph (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_pervcpu.png).

END: cycles: 49740185572 (46 seconds)
END: cycles: 45862289546 (42 seconds)
END: cycles: 30976368378 (28 seconds)
END: cycles: 30886882143 (28 seconds)
END: cycles: 30806304256 (28 seconds)

I cut the timeslice where its seen that vcpu0 and vcpu2 run on same core while other cores are idle:

35v2 9.881103815 7                                                              
35v0 9.881104013 6

35v2 9.892746452 7                                                                
35v0 9.892746546 6   -> vcpu0 gets scheduled right after vcpu2 on same core

35v0 9.904388175 6                                                              
35v2 9.904388205 7 -> same here

35v2 9.916029791 7                                                              
35v0 9.916029992 6                                                              

Disabling smt option in linux config (what essentially means that guest does not
have correct topology and its just flat shows slightly better results - there
are no cores and threads being scheduled in pair while other cores are empty.

END: cycles: 41823591845 (38 seconds)
END: cycles: 41105093568 (38 seconds)
END: cycles: 30987224290 (28 seconds)
END: cycles: 31138979573 (29 seconds)
END: cycles: 31002228982 (28 seconds)

and graph is attached (trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.png).

I may have forgotten something here.. Please ask me questions if I did.

Maybe you have some ideas what can be done here? 

We try to make guests topology aware but looks like for cpu bound workloads its
not that easy.
Any suggestions are welcome.

Thanks you.

Elena

-- 
Elena

[-- Attachment #2: trace_cpu_16vcpus_16threads_5runs.out.plot.err.png --]
[-- Type: image/png, Size: 23710 bytes --]

[-- Attachment #3: trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.png --]
[-- Type: image/png, Size: 15999 bytes --]

[-- Attachment #4: trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_pervcpu.png --]
[-- Type: image/png, Size: 16211 bytes --]

[-- Attachment #5: trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png --]
[-- Type: image/png, Size: 17335 bytes --]

[-- Attachment #6: trace_cpu_16vcpus_8threads_5runs.out.plot.err.png --]
[-- Type: image/png, Size: 27962 bytes --]

[-- Attachment #7: trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png --]
[-- Type: image/png, Size: 12940 bytes --]

[-- Attachment #8: trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png --]
[-- Type: image/png, Size: 15137 bytes --]

[-- Attachment #9: trace_iobound_nosmt_dom0notpinned.out.plot.err.png --]
[-- Type: image/png, Size: 17837 bytes --]

[-- Attachment #10: trace_cpu_smtapplied_smt0_totalcpus18_notpinned_5iters.out.plot.err.png --]
[-- Type: image/png, Size: 17769 bytes --]

[-- Attachment #11: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-22 16:54 schedulers and topology exposing questions Elena Ufimtseva
@ 2016-01-22 17:29 ` Dario Faggioli
  2016-01-22 23:58   ` Elena Ufimtseva
  2016-01-26 11:21 ` George Dunlap
  2016-01-27 14:01 ` Dario Faggioli
  2 siblings, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2016-01-22 17:29 UTC (permalink / raw)
  To: Elena Ufimtseva, xen-devel, george.dunlap, konrad.wilk,
	joao.m.martins, boris.ostrovsky


[-- Attachment #1.1: Type: text/plain, Size: 4464 bytes --]

On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote:
> Hello all!
> 
Hello,

> Let me put some intro to our findings. I may forget something or put
> something
> not too explicit, please ask me.
> 
> Customer filled a bug where some of the applications were running
> slow in their HVM DomU setups.
> These running times were compared against baremetal running same
> kernel version as HVM DomU.
> 
> After some investigation by different parties, the test case scenario
> was found
> where the problem was easily seen. The test app is a udp
> server/client pair where
> client passes some message n number of times.
> The test case was executed on baremetal and Xen DomU with kernel
> version 2.6.39.
> Bare metal showed 2x times better result that DomU.
> 
> Konrad came up with a workaround that was setting the flag for domain
> scheduler in linux
> As the guest is not aware of SMT-related topology, it has a flat
> topology initialized.
> Kernel has domain scheduler flags for scheduling domain CPU set to
> 4143 for 2.6.39.
> Konrad discovered that changing the flag for CPU sched domain to 4655
> works as a workaround and makes Linux think that the topology has SMT
> threads.
> This workaround makes the test to complete almost in same time as on
> baremetal (or insignificantly worse).
> 
> This workaround is not suitable for kernels of higher versions as we
> discovered.
> 
> The hackish way of making domU linux think that it has SMT threads
> (along with matching cpuid)
> made us thinks that the problem comes from the fact that cpu topology
> is not exposed to
> guest and Linux scheduler cannot make intelligent decision on
> scheduling.
> 
So, me an Juergen (from SuSE) have been working on this for a while
too.

As far as my experiments goes, there are at least two different issues,
both traceable to Linux's scheduler behavior. One has to do with what
you just say, i.e., topology.

Juergen has developed a set of patches, and I'm running benchamrks with
them applied to both Dom0 and DomU, to see how they work.

I'm not far from finishing running a set of 324 different test cases
(each one run both without and with Juergen's patches). I am running
different benchamrks, such as:
 - iperf,
 - a Xen build,
 - sysbench --oltp,
 - sysbench --cpu,
 - unixbench

and I'm also varying how loaded the host is, how big the VMs are, and
how loaded the VMs are.

324 is the result of various combinations of the above... It's quite an
extensive set! :-P

As soon as everything finishes running, I'll data mine the results, and
let you know how they look like.


The other issue that I've observed is that tweaking some _non_ topology
related scheduling domains' flags also impact performance, sometimes in
a quite sensible way.

I have got the results from the 324 test cases described above of
running with flags set to 4131 inside all the DomUs. That value was
chosen after quite a bit of preliminary benchmarking and investigation
as well.

I'll share the results of that data set as well as soon as I manage to
extract them from the raw output.

> Joao Martins from Oracle developed set of patches that fixed the
> smt/core/cashe
> topology numbering and provided matching pinning of vcpus and
> enabling options,
> allows to expose to guest correct topology.
> I guess Joao will be posting it at some point.
> 
That is one way of approaching the topology issue. The other, which is
what me and Juergen are pursuing, is the opposite one, i.e., make the
DomU (and Dom0, actually) think that the topology is always completely
flat.

I think, ideally, we want both: flat topology as the default, if no
pinning is specifying. Matching topology if it is.

> With this patches we decided to test the performance impact on
> different kernel versionand Xen versions.
> 
That is really interesting, and thanks a lot for sharing it with us.

I'm in the middle of something here, so I just wanted to quickly let
you know that we're also working on something related... I'll have a
look at the rest of the email and at the graphs ASAP.

Thanks again and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-22 17:29 ` Dario Faggioli
@ 2016-01-22 23:58   ` Elena Ufimtseva
  0 siblings, 0 replies; 22+ messages in thread
From: Elena Ufimtseva @ 2016-01-22 23:58 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: george.dunlap, joao.m.martins, boris.ostrovsky, xen-devel

[-- Attachment #1: Type: text/plain, Size: 4970 bytes --]

On Fri, Jan 22, 2016 at 06:29:19PM +0100, Dario Faggioli wrote:
> On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote:
> > Hello all!
> >
> Hello,
>
> > Let me put some intro to our findings. I may forget something or put
> > something
> > not too explicit, please ask me.
> >
> > Customer filled a bug where some of the applications were running
> > slow in their HVM DomU setups.
> > These running times were compared against baremetal running same
> > kernel version as HVM DomU.
> >
> > After some investigation by different parties, the test case scenario
> > was found
> > where the problem was easily seen. The test app is a udp
> > server/client pair where
> > client passes some message n number of times.
> > The test case was executed on baremetal and Xen DomU with kernel
> > version 2.6.39.
> > Bare metal showed 2x times better result that DomU.
> >
> > Konrad came up with a workaround that was setting the flag for domain
> > scheduler in linux
> > As the guest is not aware of SMT-related topology, it has a flat
> > topology initialized.
> > Kernel has domain scheduler flags for scheduling domain CPU set to
> > 4143 for 2.6.39.
> > Konrad discovered that changing the flag for CPU sched domain to 4655
> > works as a workaround and makes Linux think that the topology has SMT
> > threads.
> > This workaround makes the test to complete almost in same time as on
> > baremetal (or insignificantly worse).
> >
> > This workaround is not suitable for kernels of higher versions as we
> > discovered.
> >
> > The hackish way of making domU linux think that it has SMT threads
> > (along with matching cpuid)
> > made us thinks that the problem comes from the fact that cpu topology
> > is not exposed to
> > guest and Linux scheduler cannot make intelligent decision on
> > scheduling.
> >
> So, me an Juergen (from SuSE) have been working on this for a while
> too.
>
> As far as my experiments goes, there are at least two different issues,
> both traceable to Linux's scheduler behavior. One has to do with what
> you just say, i.e., topology.
>
> Juergen has developed a set of patches, and I'm running benchamrks with
> them applied to both Dom0 and DomU, to see how they work.
>
> I'm not far from finishing running a set of 324 different test cases
> (each one run both without and with Juergen's patches). I am running
> different benchamrks, such as:
>  - iperf,
>  - a Xen build,
>  - sysbench --oltp,
>  - sysbench --cpu,
>  - unixbench
>
> and I'm also varying how loaded the host is, how big the VMs are, and
> how loaded the VMs are.

Thats pretty cool. I also tried in my tests oversubscribed tests.

>
> 324 is the result of various combinations of the above... It's quite an
> extensive set! :-P

It is! Even with my few tests its a lot of work.
>
> As soon as everything finishes running, I'll data mine the results, and
> let you know how they look like.
>
>
> The other issue that I've observed is that tweaking some _non_ topology
> related scheduling domains' flags also impact performance, sometimes in
> a quite sensible way.
>
> I have got the results from the 324 test cases described above of
> running with flags set to 4131 inside all the DomUs. That value was
> chosen after quite a bit of preliminary benchmarking and investigation
> as well.
>
> I'll share the results of that data set as well as soon as I manage to
> extract them from the raw output.

>
> > Joao Martins from Oracle developed set of patches that fixed the
> > smt/core/cashe
> > topology numbering and provided matching pinning of vcpus and
> > enabling options,
> > allows to expose to guest correct topology.
> > I guess Joao will be posting it at some point.
> >
> That is one way of approaching the topology issue. The other, which is
> what me and Juergen are pursuing, is the opposite one, i.e., make the
> DomU (and Dom0, actually) think that the topology is always completely
> flat.
>
> I think, ideally, we want both: flat topology as the default, if no
> pinning is specifying. Matching topology if it is.
>
> > With this patches we decided to test the performance impact on
> > different kernel versionand Xen versions.
> >
> That is really interesting, and thanks a lot for sharing it with us.
>
> I'm in the middle of something here, so I just wanted to quickly let
> you know that we're also working on something related... I'll have a
> look at the rest of the email and at the graphs ASAP.

Great!
I am attaching the io and cpu-bound tests that were used to get the
data.

Thanks Dario!
>
> Thanks again and Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



[-- Attachment #2: perf_tests.tar.gz --]
[-- Type: application/gzip, Size: 7455 bytes --]

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-22 16:54 schedulers and topology exposing questions Elena Ufimtseva
  2016-01-22 17:29 ` Dario Faggioli
@ 2016-01-26 11:21 ` George Dunlap
  2016-01-27 14:25   ` Dario Faggioli
  2016-01-27 14:33   ` Konrad Rzeszutek Wilk
  2016-01-27 14:01 ` Dario Faggioli
  2 siblings, 2 replies; 22+ messages in thread
From: George Dunlap @ 2016-01-26 11:21 UTC (permalink / raw)
  To: Elena Ufimtseva, xen-devel, dario.faggioli, george.dunlap,
	konrad.wilk, joao.m.martins, boris.ostrovsky

On 22/01/16 16:54, Elena Ufimtseva wrote:
> Hello all!
> 
> Dario, Gerorge or anyone else,  your help will be appreciated.
> 
> Let me put some intro to our findings. I may forget something or put something
> not too explicit, please ask me.
> 
> Customer filled a bug where some of the applications were running slow in their HVM DomU setups.
> These running times were compared against baremetal running same kernel version as HVM DomU.
> 
> After some investigation by different parties, the test case scenario was found
> where the problem was easily seen. The test app is a udp server/client pair where
> client passes some message n number of times.
> The test case was executed on baremetal and Xen DomU with kernel version 2.6.39.
> Bare metal showed 2x times better result that DomU.
> 
> Konrad came up with a workaround that was setting the flag for domain scheduler in linux
> As the guest is not aware of SMT-related topology, it has a flat topology initialized.
> Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39.
> Konrad discovered that changing the flag for CPU sched domain to 4655
> works as a workaround and makes Linux think that the topology has SMT threads.
> This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse).
> 
> This workaround is not suitable for kernels of higher versions as we discovered.
> 
> The hackish way of making domU linux think that it has SMT threads (along with matching cpuid)
> made us thinks that the problem comes from the fact that cpu topology is not exposed to
> guest and Linux scheduler cannot make intelligent decision on scheduling.
> 
> Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe
> topology numbering and provided matching pinning of vcpus and enabling options,
> allows to expose to guest correct topology.
> I guess Joao will be posting it at some point.
> 
> With this patches we decided to test the performance impact on different kernel versionand Xen versions.
> 
> The test described above was labeled as IO-bound test.

So just to clarify: The client sends a request (presumably not much more
than a ping) to the server, and waits for the server to respond before
sending another one; and the server does the reverse -- receives a
request, responds, and then waits for the next request.  Is that right?

How much data is transferred?

If the amount of data transferred is tiny, then the bottleneck for the
test is probably the IPI time, and I'd call this a "ping-pong"
benchmark[1].  I would only call this "io-bound" if you're actually
copying large amounts of data.

Regarding placement wrt topology: If two threads are doing a large
amount of communication, then putting them close in the topology will
increase perfomance, because they share cache, and the IPI distance
between them is much shorter.  If they rarely run at the same time,
being on the same thread is probably the ideal.

On the other hand, if two threads are running mostly independently, and
each one is using a lot of cache, then having the threads at opposite
ends of the topology will increase performance, since that will increase
the aggregate cache used by both.  The ideal in this case would
certainly be for each thread to run on a separate socket.

At the moment, neither the Credit1 and Credit2 schedulers take
communication into account; they only account for processing time, and
thus silently assume that all workloads are cache-hungry and
non-communicating.

[1] https://www.google.co.uk/search?q=ping+pong+benchmark

> We have run io-bound test with and without smt-patches. The improvement comparing
> to base case (no smt patches, flat topology) shows 22-23% gain.
> 
> While we have seen improvement with io-bound tests, the same did not happen with cpu-bound workload.
> As cpu-bound test we use kernel module which runs requested number of kernel threads
> and each thread compresses and decompresses some data.
> 
> Here is the setup for tests:
> Intel Xeon E5 2600
> 8 cores, 25MB Cashe, 2 sockets, 2 threads per core.
> Xen 4.4.3, default timeslice and ratelimit
> Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+.
> Dom0: kernel 4.1.0, 2 vcpus, not pinned.
> DomU has 8 vcpus (except some cases).
> 
> 
> For io-bound tests results were better with smt patches applied for every kernel.
>
> For cpu-bound test the results were different depending on wether
> vcpus were pinned or not, how many vcpus were assigned to the guest.

Looking through your mail, I can't quite figure out if "io-bound tests
with the smt patches applied" here means "smt+pinned" or just "smt"
(unpinned).  (Or both.)

Assuming that the Linux kernel takes process communication into account
in its scheduling decisions, I would expect smt+pinning to have the kind
of performance improvement you observe.  I would expect that smt without
pinning would have very little effect -- or might be actively worse,
since the topology information would then be actively wrong as soon as
the scheduler moved the vcpus.

The fact that exposing topology of the cpu-bound workload didn't help
sounds expected to me -- the Xen scheduler already tries to optimize for
the cpu-bound case, so in the [non-smt, unpinned] case probably places
things on the physical hardware similar to the way Linux places it in
the [smt, pinned] case.

> Please take a look at the graph captured by xentrace -e 0x0002f000
> On the graphs X is time in seconds since xentrace start, Y is the pcpu number,
> the graph itself represent the event when scheduler places vcpu to pcpu.
> 
> The graphs #1 & #2:
> trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, one client/server
> trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, 8 kernel theads
> config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 kernel.
> 
> As can be seen here scheduler places the vcpus correctly on empty cores.
> As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this?

Well it looks like vcpu0 does the lion's share of the work, while the
other vcpus more or less share the work.  So the scheduler gives vcpu0
its own socket (more or less), while the other ones share the other
socket (optimizing for maximum cache usage).

> Take a look at trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png
> where I split data per vcpus.
> 
> 
> Now to cpu-bound tests.
> When smt patches applied and vcpus pinned correctly to match the topology and
> guest become aware of the topology, cpu-bound tests did not show improvement with kernel 2.6.39.
> With upstream kernel we see some improvements. The tes was repeated 5 times back to back.
> The number of vcpus was increased to 16 to match the test case where linux was not
> aware of the topology and assumed all cpus as cores.
>  
> On some iterations one can see that vcpus are being scheduled as expected.
> For some runs the vcpus are placed on came core (core/thread) (see trace_cpu_16vcpus_8threads_5runs.out.plot.err.png).
> It doubles the time it takes for test to complete (first three runs show close to baremetal execution time).
> 
> END: cycles: 31209326708 (29 seconds)
> END: cycles: 30928835308 (28 seconds)
> END: cycles: 31191626508 (29 seconds)
> END: cycles: 50117313540 (46 seconds)
> END: cycles: 49944848614 (46 seconds)
> 
> Since the vcpus are pinned, then my guess is that Linux scheduler makes wrong decisions?

Hmm -- could it be that the logic detecting whether the threads are
"cpu-bound" (and thus want their own cache) vs "communicating" (and thus
want to share a thread) is triggering differently in each case?

Or maybe neither is true, and placement from the Linux side is more or
less random. :-)

> So I ran the test with smt patches enabled, but not pinned vcpus.
> 
> result is also shows the same as above (see trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png):
> Also see the per-cpu graph (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_pervcpu.png).
> 
> END: cycles: 49740185572 (46 seconds)
> END: cycles: 45862289546 (42 seconds)
> END: cycles: 30976368378 (28 seconds)
> END: cycles: 30886882143 (28 seconds)
> END: cycles: 30806304256 (28 seconds)
> 
> I cut the timeslice where its seen that vcpu0 and vcpu2 run on same core while other cores are idle:
> 
> 35v2 9.881103815 7                                                              
> 35v0 9.881104013 6
>                                                              
> 35v2 9.892746452 7                                                                
> 35v0 9.892746546 6   -> vcpu0 gets scheduled right after vcpu2 on same core
>                                                              
> 35v0 9.904388175 6                                                              
> 35v2 9.904388205 7 -> same here
>                                                              
> 35v2 9.916029791 7                                                              
> 35v0 9.916029992 6                                                              
> 
> Disabling smt option in linux config (what essentially means that guest does not
> have correct topology and its just flat shows slightly better results - there
> are no cores and threads being scheduled in pair while other cores are empty.
> 
> END: cycles: 41823591845 (38 seconds)
> END: cycles: 41105093568 (38 seconds)
> END: cycles: 30987224290 (28 seconds)
> END: cycles: 31138979573 (29 seconds)
> END: cycles: 31002228982 (28 seconds)
> 
> and graph is attached (trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.png).

This is a bit strange.  You're showing that for *unpinned* vcpus, with
empty cores, there are vcpus sharing the same thread for significant
periods of time?  That definitely shouldn't happen.

It looks like you still have a fairly "bimodal" distribution even in the
"no-smt unpinned" scenario -- just 28<->38 rather than 28<->45-ish.

Could you try a couple of these tests with the credit2 scheduler, just
to see?  You'd have to make sure and use one of the versions that has
hard pinning enabled; I don't think that made 4.6, so you'd have to use
xen-unstable I think.

> I may have forgotten something here.. Please ask me questions if I did.
> 
> Maybe you have some ideas what can be done here? 
> 
> We try to make guests topology aware but looks like for cpu bound workloads its
> not that easy.
> Any suggestions are welcome.

Well one option is always, as you say, to try to expose the topology to
the guest.  But that is a fairly limited solution -- in order for that
information to be accurate, the vcpus need to be pinned, which in turn
means 1) a lot more effort required by admins, and 2) a lot less
opportunity for sharing of resources which is one of the big 'wins' for
virtualization.

The other option is, as Dario said, to remove all topology information
from Linux, and add functionality to the Xen schedulers to attempt to
identify vcpus which are communicating or sharing in some other way, and
try to co-locate them.  This is a lot easier and more flexible for
users, but a lot more work for us.

 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-26 11:21 ` George Dunlap
@ 2016-01-27 14:25   ` Dario Faggioli
  2016-01-27 14:33   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 22+ messages in thread
From: Dario Faggioli @ 2016-01-27 14:25 UTC (permalink / raw)
  To: George Dunlap, Elena Ufimtseva, xen-devel, george.dunlap,
	konrad.wilk, joao.m.martins, boris.ostrovsky


[-- Attachment #1.1: Type: text/plain, Size: 9144 bytes --]

On Tue, 2016-01-26 at 11:21 +0000, George Dunlap wrote:
> On 22/01/16 16:54, Elena Ufimtseva wrote:
> > 
> Regarding placement wrt topology: If two threads are doing a large
> amount of communication, then putting them close in the topology will
> increase perfomance, because they share cache, and the IPI distance
> between them is much shorter.  If they rarely run at the same time,
> being on the same thread is probably the ideal.
> 
Yes, this make sense to me... a bit hard to do the guessing right, but
if we could, it would be a good thing to do.

> On the other hand, if two threads are running mostly independently,
> and
> each one is using a lot of cache, then having the threads at opposite
> ends of the topology will increase performance, since that will
> increase
> the aggregate cache used by both.  The ideal in this case would
> certainly be for each thread to run on a separate socket.
> 
> At the moment, neither the Credit1 and Credit2 schedulers take
> communication into account; they only account for processing time,
> and
> thus silently assume that all workloads are cache-hungry and
> non-communicating.
> 
I don't think Linux's scheduler does anything like that either. One can
say that --speaking again about the flags of the scheduling domains--
you can try to use the SD_BALANCE_FORK and EXEC, together with the
knowledge of who runs first, between father or child, something like
what you say could be implemented... But even in this case, it's not at
all explicit, and it would only be effective near fork() and exec()
calls.

The same is true, with all the due differences, for the other flags.

Also, there is a great amount of logic to deal with task groups, and,
e.g., provide fairness to task groups instead than to single tasks,
etc. An I guess one can assume that the tasks in the same group does
communicate, and things like that, but that's again nothing
specifically taking any communication pattern into account (and it
changed --growing and loosing features-- quite frenetically over the
last years, so I don't think this has a say in what Elena is seeing.

> Assuming that the Linux kernel takes process communication into
> account
> in its scheduling decisions, I would expect smt+pinning to have the
> kind
> of performance improvement you observe.  I would expect that smt
> without
> pinning would have very little effect -- or might be actively worse,
> since the topology information would then be actively wrong as soon
> as
> the scheduler moved the vcpus.
> 
We better check then (if Linux has these characteristics), because I
don't think it does. I can, and will, check myself, just not right now.
:-/

So I ran the test with smt patches enabled, but not pinned vcpus.

> > 
> > result is also shows the same as above (see
> > trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.pn
> > g):
> > Also see the per-cpu graph
> > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_p
> > ervcpu.png).
> > 
> > END: cycles: 49740185572 (46 seconds)
> > END: cycles: 45862289546 (42 seconds)
> > END: cycles: 30976368378 (28 seconds)
> > END: cycles: 30886882143 (28 seconds)
> > END: cycles: 30806304256 (28 seconds)
> > 
> > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same
> > core while other cores are idle:
> > 
> > 35v2 9.881103815
> > 7                                                              
> > 35v0 9.881104013 6
> >                                                              
> > 35v2 9.892746452
> > 7                                                                
> > 35v0 9.892746546 6   -> vcpu0 gets scheduled right after vcpu2 on
> > same core
> >                                                              
> > 35v0 9.904388175
> > 6                                                              
> > 35v2 9.904388205 7 -> same here
> >                                                              
> > 35v2 9.916029791
> > 7                                                              
> > 35v0 9.916029992
> > 6                                                              
> > 
> > Disabling smt option in linux config (what essentially means that
> > guest does not
> > have correct topology and its just flat shows slightly better
> > results - there
> > are no cores and threads being scheduled in pair while other cores
> > are empty.
> > 
> > END: cycles: 41823591845 (38 seconds)
> > END: cycles: 41105093568 (38 seconds)
> > END: cycles: 30987224290 (28 seconds)
> > END: cycles: 31138979573 (29 seconds)
> > END: cycles: 31002228982 (28 seconds)
> > 
> > and graph is attached
> > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.p
> > ng).
> 
> This is a bit strange.  You're showing that for *unpinned* vcpus,
> with
> empty cores, there are vcpus sharing the same thread for significant
> periods of time?  That definitely shouldn't happen.
> 
I totally agree with George on this, the key word of what's he is
saying being _significant_. In fact, this is the perfect summary/bottom
line of my explanation of how our SMT load balancer in Credit1 works...
:-)

From just looking at the graph, I can't spot many places where this
happens really for a significant amount of time. Am I wrong? For
knowing for sure, we need to check the full trace.

Ah, given the above, one could ask why we do not change Credit1 to
actually do the SMT load balancing more frequently, e.g., at each vcpu
wakeup. That is certainly a possibility, but there is the risk that the
overhead of doing that too frequently (and there is indeed some
overhead!) absorb the benefits of a more efficient placing (and I do
think that will be the case, the wakeup path, in Credit1, is already
complex and crowded enough! :-/).

> Could you try a couple of these tests with the credit2 scheduler,
> just
> to see?  You'd have to make sure and use one of the versions that has
> hard pinning enabled; I don't think that made 4.6, so you'd have to
> use
> xen-unstable I think.
> 
Nope, sorry, I would not do that, yet. I've got things half done
already, but I have been sidetracked by other things (including this
one), and so Credit2 is not yet in a shape where running the benchmarks
with it, even if using staging, would represent a fair comparison
between Credit1.

I'll get back to the work that is still pending to make that possible
in a bit (after FOSDEM, i.e., next week).

> > We try to make guests topology aware but looks like for cpu bound
> > workloads its
> > not that easy.
> > Any suggestions are welcome.
>
> well one option is always, as you say, to try to expose the topology 
> to
> But that is a fairly limited solution -- in order for that
> information to be accurate, the vcpus need to be pinned, which in
> turn
> means 1) a lot more effort required by admins, and 2) a lot less
> opportunity for sharing of resources which is one of the big 'wins'
> for
> virtualization.
> 

Exactly. I indeed think that it would be good to support this mode, but
it has to be put very clear that, either the pinning does not ever ever
ever ever ever...ever change, or you'll get back to pseudo-random
scheduler(s)' behavior, leading to unpredictable and inconsistent
performance (maybe better, maybe worse, but certainly inconsistent).

Or, when pinning changes, we figure out a way to tell Linux (the
scheduler and every pother component that needs to know) that something
like that happened. As far as scheduling domains goes, I think there is
a way to ask the kernel to rebuild the hierarcy, but I've never tried
that, and I don't know it's available to userspace (already).

> The other option is, as Dario said, to remove all topology
> information
> from Linux, and add functionality to the Xen schedulers to attempt to
> identify vcpus which are communicating or sharing in some other way,
> and
> try to co-locate them.  This is a lot easier and more flexible for
> users, but a lot more work for us.
> 
This is the only way that Linux will see something that is not wrong,
if pinning is not used, so I think we really want this, and want it to
be the default (and Juergen has patches for this! :-D)

Thanks again and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-26 11:21 ` George Dunlap
  2016-01-27 14:25   ` Dario Faggioli
@ 2016-01-27 14:33   ` Konrad Rzeszutek Wilk
  2016-01-27 15:10     ` George Dunlap
  1 sibling, 1 reply; 22+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-27 14:33 UTC (permalink / raw)
  To: George Dunlap
  Cc: Elena Ufimtseva, george.dunlap, dario.faggioli, xen-devel,
	joao.m.martins, boris.ostrovsky

On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote:
> On 22/01/16 16:54, Elena Ufimtseva wrote:
> > Hello all!
> > 
> > Dario, Gerorge or anyone else,  your help will be appreciated.
> > 
> > Let me put some intro to our findings. I may forget something or put something
> > not too explicit, please ask me.
> > 
> > Customer filled a bug where some of the applications were running slow in their HVM DomU setups.
> > These running times were compared against baremetal running same kernel version as HVM DomU.
> > 
> > After some investigation by different parties, the test case scenario was found
> > where the problem was easily seen. The test app is a udp server/client pair where
> > client passes some message n number of times.
> > The test case was executed on baremetal and Xen DomU with kernel version 2.6.39.
> > Bare metal showed 2x times better result that DomU.
> > 
> > Konrad came up with a workaround that was setting the flag for domain scheduler in linux
> > As the guest is not aware of SMT-related topology, it has a flat topology initialized.
> > Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39.
> > Konrad discovered that changing the flag for CPU sched domain to 4655
> > works as a workaround and makes Linux think that the topology has SMT threads.
> > This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse).
> > 
> > This workaround is not suitable for kernels of higher versions as we discovered.
> > 
> > The hackish way of making domU linux think that it has SMT threads (along with matching cpuid)
> > made us thinks that the problem comes from the fact that cpu topology is not exposed to
> > guest and Linux scheduler cannot make intelligent decision on scheduling.
> > 
> > Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe
> > topology numbering and provided matching pinning of vcpus and enabling options,
> > allows to expose to guest correct topology.
> > I guess Joao will be posting it at some point.
> > 
> > With this patches we decided to test the performance impact on different kernel versionand Xen versions.
> > 
> > The test described above was labeled as IO-bound test.
> 
> So just to clarify: The client sends a request (presumably not much more
> than a ping) to the server, and waits for the server to respond before
> sending another one; and the server does the reverse -- receives a
> request, responds, and then waits for the next request.  Is that right?

Yes.
> 
> How much data is transferred?

1 packet, UDP
> 
> If the amount of data transferred is tiny, then the bottleneck for the
> test is probably the IPI time, and I'd call this a "ping-pong"
> benchmark[1].  I would only call this "io-bound" if you're actually
> copying large amounts of data.

What we found is that on baremetal the scheduler would put both apps
on the same CPU and schedule them right after each other. This would
have a high IPI as the scheduler would poke itself.

On Xen it would put the two applications on seperate CPUs - and there
would be hardly any IPI.

Digging deeper in the code I found out that if you do an UDP sendmsg
without any timeouts - it would put it in a queue and just call schedule.

On baremetal the schedule would result in scheduler picking up the other
task, and starting it - which would dequeue immediately.

On Xen - the schedule() would go HLT.. and then later be woken up by the
VIRQ_TIMER. And since the two applications were on seperate CPUs - the
single packet would just stick in the queue until the VIRQ_TIMER arrived.


I found out that if I expose the SMT topology to the guest (which is
what baremetal sees) suddenly the Linux scheduler would behave the same
way as under baremetal.

To be fair - this is a very .. ping-pong no-CPU bound workload.

If the amount of communication was huge it would probably behave a bit
differently - as the queue would fill up - and by the time the VIRQ_TIMER
hit the other CPU - it would have a nice chunk of data to eat through.
 
> 
> Regarding placement wrt topology: If two threads are doing a large
> amount of communication, then putting them close in the topology will
> increase perfomance, because they share cache, and the IPI distance
> between them is much shorter.  If they rarely run at the same time,
> being on the same thread is probably the ideal.

This is a ping-pong type workload - very much serialized.
> 
> On the other hand, if two threads are running mostly independently, and
> each one is using a lot of cache, then having the threads at opposite
> ends of the topology will increase performance, since that will increase
> the aggregate cache used by both.  The ideal in this case would
> certainly be for each thread to run on a separate socket.
> 
> At the moment, neither the Credit1 and Credit2 schedulers take
> communication into account; they only account for processing time, and
> thus silently assume that all workloads are cache-hungry and
> non-communicating.


And this is very much the opposite of that :-)

> 
> [1] https://www.google.co.uk/search?q=ping+pong+benchmark
> 
> > We have run io-bound test with and without smt-patches. The improvement comparing
> > to base case (no smt patches, flat topology) shows 22-23% gain.
> > 
> > While we have seen improvement with io-bound tests, the same did not happen with cpu-bound workload.
> > As cpu-bound test we use kernel module which runs requested number of kernel threads
> > and each thread compresses and decompresses some data.
> > 
> > Here is the setup for tests:
> > Intel Xeon E5 2600
> > 8 cores, 25MB Cashe, 2 sockets, 2 threads per core.
> > Xen 4.4.3, default timeslice and ratelimit
> > Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+.
> > Dom0: kernel 4.1.0, 2 vcpus, not pinned.
> > DomU has 8 vcpus (except some cases).
> > 
> > 
> > For io-bound tests results were better with smt patches applied for every kernel.
> >
> > For cpu-bound test the results were different depending on wether
> > vcpus were pinned or not, how many vcpus were assigned to the guest.
> 
> Looking through your mail, I can't quite figure out if "io-bound tests
> with the smt patches applied" here means "smt+pinned" or just "smt"
> (unpinned).  (Or both.)
> 
> Assuming that the Linux kernel takes process communication into account
> in its scheduling decisions, I would expect smt+pinning to have the kind
> of performance improvement you observe.  I would expect that smt without
> pinning would have very little effect -- or might be actively worse,
> since the topology information would then be actively wrong as soon as
> the scheduler moved the vcpus.
> 
> The fact that exposing topology of the cpu-bound workload didn't help
> sounds expected to me -- the Xen scheduler already tries to optimize for
> the cpu-bound case, so in the [non-smt, unpinned] case probably places
> things on the physical hardware similar to the way Linux places it in
> the [smt, pinned] case.
> 
> > Please take a look at the graph captured by xentrace -e 0x0002f000
> > On the graphs X is time in seconds since xentrace start, Y is the pcpu number,
> > the graph itself represent the event when scheduler places vcpu to pcpu.
> > 
> > The graphs #1 & #2:
> > trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, one client/server
> > trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, 8 kernel theads
> > config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 kernel.
> > 
> > As can be seen here scheduler places the vcpus correctly on empty cores.
> > As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this?
> 
> Well it looks like vcpu0 does the lion's share of the work, while the
> other vcpus more or less share the work.  So the scheduler gives vcpu0
> its own socket (more or less), while the other ones share the other
> socket (optimizing for maximum cache usage).
> 
> > Take a look at trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png
> > where I split data per vcpus.
> > 
> > 
> > Now to cpu-bound tests.
> > When smt patches applied and vcpus pinned correctly to match the topology and
> > guest become aware of the topology, cpu-bound tests did not show improvement with kernel 2.6.39.
> > With upstream kernel we see some improvements. The tes was repeated 5 times back to back.
> > The number of vcpus was increased to 16 to match the test case where linux was not
> > aware of the topology and assumed all cpus as cores.
> >  
> > On some iterations one can see that vcpus are being scheduled as expected.
> > For some runs the vcpus are placed on came core (core/thread) (see trace_cpu_16vcpus_8threads_5runs.out.plot.err.png).
> > It doubles the time it takes for test to complete (first three runs show close to baremetal execution time).
> > 
> > END: cycles: 31209326708 (29 seconds)
> > END: cycles: 30928835308 (28 seconds)
> > END: cycles: 31191626508 (29 seconds)
> > END: cycles: 50117313540 (46 seconds)
> > END: cycles: 49944848614 (46 seconds)
> > 
> > Since the vcpus are pinned, then my guess is that Linux scheduler makes wrong decisions?
> 
> Hmm -- could it be that the logic detecting whether the threads are
> "cpu-bound" (and thus want their own cache) vs "communicating" (and thus
> want to share a thread) is triggering differently in each case?
> 
> Or maybe neither is true, and placement from the Linux side is more or
> less random. :-)
> 
> > So I ran the test with smt patches enabled, but not pinned vcpus.
> > 
> > result is also shows the same as above (see trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png):
> > Also see the per-cpu graph (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_pervcpu.png).
> > 
> > END: cycles: 49740185572 (46 seconds)
> > END: cycles: 45862289546 (42 seconds)
> > END: cycles: 30976368378 (28 seconds)
> > END: cycles: 30886882143 (28 seconds)
> > END: cycles: 30806304256 (28 seconds)
> > 
> > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same core while other cores are idle:
> > 
> > 35v2 9.881103815 7                                                              
> > 35v0 9.881104013 6
> >                                                              
> > 35v2 9.892746452 7                                                                
> > 35v0 9.892746546 6   -> vcpu0 gets scheduled right after vcpu2 on same core
> >                                                              
> > 35v0 9.904388175 6                                                              
> > 35v2 9.904388205 7 -> same here
> >                                                              
> > 35v2 9.916029791 7                                                              
> > 35v0 9.916029992 6                                                              
> > 
> > Disabling smt option in linux config (what essentially means that guest does not
> > have correct topology and its just flat shows slightly better results - there
> > are no cores and threads being scheduled in pair while other cores are empty.
> > 
> > END: cycles: 41823591845 (38 seconds)
> > END: cycles: 41105093568 (38 seconds)
> > END: cycles: 30987224290 (28 seconds)
> > END: cycles: 31138979573 (29 seconds)
> > END: cycles: 31002228982 (28 seconds)
> > 
> > and graph is attached (trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.png).
> 
> This is a bit strange.  You're showing that for *unpinned* vcpus, with
> empty cores, there are vcpus sharing the same thread for significant
> periods of time?  That definitely shouldn't happen.
> 
> It looks like you still have a fairly "bimodal" distribution even in the
> "no-smt unpinned" scenario -- just 28<->38 rather than 28<->45-ish.
> 
> Could you try a couple of these tests with the credit2 scheduler, just
> to see?  You'd have to make sure and use one of the versions that has
> hard pinning enabled; I don't think that made 4.6, so you'd have to use
> xen-unstable I think.
> 
> > I may have forgotten something here.. Please ask me questions if I did.
> > 
> > Maybe you have some ideas what can be done here? 
> > 
> > We try to make guests topology aware but looks like for cpu bound workloads its
> > not that easy.
> > Any suggestions are welcome.
> 
> Well one option is always, as you say, to try to expose the topology to
> the guest.  But that is a fairly limited solution -- in order for that
> information to be accurate, the vcpus need to be pinned, which in turn
> means 1) a lot more effort required by admins, and 2) a lot less
> opportunity for sharing of resources which is one of the big 'wins' for
> virtualization.
> 
> The other option is, as Dario said, to remove all topology information
> from Linux, and add functionality to the Xen schedulers to attempt to
> identify vcpus which are communicating or sharing in some other way, and
> try to co-locate them.  This is a lot easier and more flexible for
> users, but a lot more work for us.
> 
>  -George
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-27 14:33   ` Konrad Rzeszutek Wilk
@ 2016-01-27 15:10     ` George Dunlap
  2016-01-27 15:27       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 22+ messages in thread
From: George Dunlap @ 2016-01-27 15:10 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, george.dunlap, dario.faggioli, xen-devel,
	joao.m.martins, boris.ostrovsky

On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote:
>> On 22/01/16 16:54, Elena Ufimtseva wrote:
>>> Hello all!
>>>
>>> Dario, Gerorge or anyone else,  your help will be appreciated.
>>>
>>> Let me put some intro to our findings. I may forget something or put something
>>> not too explicit, please ask me.
>>>
>>> Customer filled a bug where some of the applications were running slow in their HVM DomU setups.
>>> These running times were compared against baremetal running same kernel version as HVM DomU.
>>>
>>> After some investigation by different parties, the test case scenario was found
>>> where the problem was easily seen. The test app is a udp server/client pair where
>>> client passes some message n number of times.
>>> The test case was executed on baremetal and Xen DomU with kernel version 2.6.39.
>>> Bare metal showed 2x times better result that DomU.
>>>
>>> Konrad came up with a workaround that was setting the flag for domain scheduler in linux
>>> As the guest is not aware of SMT-related topology, it has a flat topology initialized.
>>> Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39.
>>> Konrad discovered that changing the flag for CPU sched domain to 4655
>>> works as a workaround and makes Linux think that the topology has SMT threads.
>>> This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse).
>>>
>>> This workaround is not suitable for kernels of higher versions as we discovered.
>>>
>>> The hackish way of making domU linux think that it has SMT threads (along with matching cpuid)
>>> made us thinks that the problem comes from the fact that cpu topology is not exposed to
>>> guest and Linux scheduler cannot make intelligent decision on scheduling.
>>>
>>> Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe
>>> topology numbering and provided matching pinning of vcpus and enabling options,
>>> allows to expose to guest correct topology.
>>> I guess Joao will be posting it at some point.
>>>
>>> With this patches we decided to test the performance impact on different kernel versionand Xen versions.
>>>
>>> The test described above was labeled as IO-bound test.
>>
>> So just to clarify: The client sends a request (presumably not much more
>> than a ping) to the server, and waits for the server to respond before
>> sending another one; and the server does the reverse -- receives a
>> request, responds, and then waits for the next request.  Is that right?
> 
> Yes.
>>
>> How much data is transferred?
> 
> 1 packet, UDP
>>
>> If the amount of data transferred is tiny, then the bottleneck for the
>> test is probably the IPI time, and I'd call this a "ping-pong"
>> benchmark[1].  I would only call this "io-bound" if you're actually
>> copying large amounts of data.
> 
> What we found is that on baremetal the scheduler would put both apps
> on the same CPU and schedule them right after each other. This would
> have a high IPI as the scheduler would poke itself.
> On Xen it would put the two applications on seperate CPUs - and there
> would be hardly any IPI.

Sorry -- why would the scheduler send itself an IPI if it's on the same
logical cpu (which seems pretty pointless), but *not* send an IPI to the
*other* processor when it was actually waking up another task?

Or do you mean high context switch rate?

> Digging deeper in the code I found out that if you do an UDP sendmsg
> without any timeouts - it would put it in a queue and just call schedule.

You mean, it would mark the other process as runnable somehow, but not
actually send an IPI to wake it up?  Is that a new "feature" designed
for large systems, to reduce the IPI traffic or something?

> On baremetal the schedule would result in scheduler picking up the other
> task, and starting it - which would dequeue immediately.
> 
> On Xen - the schedule() would go HLT.. and then later be woken up by the
> VIRQ_TIMER. And since the two applications were on seperate CPUs - the
> single packet would just stick in the queue until the VIRQ_TIMER arrived.

I'm not sure I understand the situation right, but it sounds a bit like
what you're seeing is just a quirk of the fact that Linux doesn't always
send IPIs to wake other processes up (either by design or by accident),
but relies on scheduling timers to check for work to do.  Presumably
they knew that low performance on ping-pong workloads might be a
possibility when they wrote the code that way; I don't see a reason why
we should try to work around that in Xen.

 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-27 15:10     ` George Dunlap
@ 2016-01-27 15:27       ` Konrad Rzeszutek Wilk
  2016-01-27 15:53         ` George Dunlap
                           ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-27 15:27 UTC (permalink / raw)
  To: George Dunlap
  Cc: Elena Ufimtseva, george.dunlap, dario.faggioli, xen-devel,
	joao.m.martins, boris.ostrovsky

On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote:
> On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote:
> > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote:
> >> On 22/01/16 16:54, Elena Ufimtseva wrote:
> >>> Hello all!
> >>>
> >>> Dario, Gerorge or anyone else,  your help will be appreciated.
> >>>
> >>> Let me put some intro to our findings. I may forget something or put something
> >>> not too explicit, please ask me.
> >>>
> >>> Customer filled a bug where some of the applications were running slow in their HVM DomU setups.
> >>> These running times were compared against baremetal running same kernel version as HVM DomU.
> >>>
> >>> After some investigation by different parties, the test case scenario was found
> >>> where the problem was easily seen. The test app is a udp server/client pair where
> >>> client passes some message n number of times.
> >>> The test case was executed on baremetal and Xen DomU with kernel version 2.6.39.
> >>> Bare metal showed 2x times better result that DomU.
> >>>
> >>> Konrad came up with a workaround that was setting the flag for domain scheduler in linux
> >>> As the guest is not aware of SMT-related topology, it has a flat topology initialized.
> >>> Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39.
> >>> Konrad discovered that changing the flag for CPU sched domain to 4655
> >>> works as a workaround and makes Linux think that the topology has SMT threads.
> >>> This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse).
> >>>
> >>> This workaround is not suitable for kernels of higher versions as we discovered.
> >>>
> >>> The hackish way of making domU linux think that it has SMT threads (along with matching cpuid)
> >>> made us thinks that the problem comes from the fact that cpu topology is not exposed to
> >>> guest and Linux scheduler cannot make intelligent decision on scheduling.
> >>>
> >>> Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe
> >>> topology numbering and provided matching pinning of vcpus and enabling options,
> >>> allows to expose to guest correct topology.
> >>> I guess Joao will be posting it at some point.
> >>>
> >>> With this patches we decided to test the performance impact on different kernel versionand Xen versions.
> >>>
> >>> The test described above was labeled as IO-bound test.
> >>
> >> So just to clarify: The client sends a request (presumably not much more
> >> than a ping) to the server, and waits for the server to respond before
> >> sending another one; and the server does the reverse -- receives a
> >> request, responds, and then waits for the next request.  Is that right?
> > 
> > Yes.
> >>
> >> How much data is transferred?
> > 
> > 1 packet, UDP
> >>
> >> If the amount of data transferred is tiny, then the bottleneck for the
> >> test is probably the IPI time, and I'd call this a "ping-pong"
> >> benchmark[1].  I would only call this "io-bound" if you're actually
> >> copying large amounts of data.
> > 
> > What we found is that on baremetal the scheduler would put both apps
> > on the same CPU and schedule them right after each other. This would
> > have a high IPI as the scheduler would poke itself.
> > On Xen it would put the two applications on seperate CPUs - and there
> > would be hardly any IPI.
> 
> Sorry -- why would the scheduler send itself an IPI if it's on the same
> logical cpu (which seems pretty pointless), but *not* send an IPI to the
> *other* processor when it was actually waking up another task?
> 
> Or do you mean high context switch rate?

Yes, very high.
> 
> > Digging deeper in the code I found out that if you do an UDP sendmsg
> > without any timeouts - it would put it in a queue and just call schedule.
> 
> You mean, it would mark the other process as runnable somehow, but not
> actually send an IPI to wake it up?  Is that a new "feature" designed

Correct - because the other process was not on its vCPU runqueue.

> for large systems, to reduce the IPI traffic or something?

This is just a normal Linux scheduler. The only way it would do an IPI
to the other CPU was if the UDP message had an timeout. The default
timeout is infite so it didn't bother to send an IPI.

> 
> > On baremetal the schedule would result in scheduler picking up the other
> > task, and starting it - which would dequeue immediately.
> > 
> > On Xen - the schedule() would go HLT.. and then later be woken up by the
> > VIRQ_TIMER. And since the two applications were on seperate CPUs - the
> > single packet would just stick in the queue until the VIRQ_TIMER arrived.
> 
> I'm not sure I understand the situation right, but it sounds a bit like
> what you're seeing is just a quirk of the fact that Linux doesn't always
> send IPIs to wake other processes up (either by design or by accident),

It does and it does not :-)

> but relies on scheduling timers to check for work to do.  Presumably

It .. I am not explaining it well. The Linux kernel scheduler when
called for 'schedule' (from the UDP sendmsg) would either pick the next
appliction and do a context swap - of if there were none - go to sleep.
[Kind of - it also may do an IPI to the other CPU if requested ,but that requires
some hints from underlaying layers]
Since there were only two apps on the runqueue - udp sender and udp receiver
it would run them back-to back (this is on baremetal)

However if SMT was not exposed - the Linux kernel scheduler would put those
on each CPU runqueue. Meaning each CPU only had one app on its runqueue.

Hence no need to do an context switch.
[unless you modified the UDP message to have a timeout, then it would
send an IPI]
> they knew that low performance on ping-pong workloads might be a
> possibility when they wrote the code that way; I don't see a reason why
> we should try to work around that in Xen.

Which is not what I am suggesting.

Our first ideas was that since this is a Linux kernel schduler characteristic
- let us give the guest all the information it needs to do this. That is
make it look as baremetal as possible - and that is where the vCPU
pinning and the exposing of SMT information came about. That (Elena
pls correct me if I am wrong) did indeed show that the guest was doing
what we expected.

But naturally that requires pinning and all that - and while it is a useful
case for those that have the vCPUs to spare and can do it - that is not
a general use-case.

So Elena started looking at the CPU bound and seeing how Xen behaves then
and if we can improve the floating situation as she saw some abnormal
behavious.

I do not see any way to fix the udp single message mechanism except
by modifying the Linux kernel scheduler - and indeed it looks like later
kernels modified their behavior. Also doing the vCPU pinning and SMT exposing
did not hurt in those cases (Elena?).

> 
>  -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-27 15:27       ` Konrad Rzeszutek Wilk
@ 2016-01-27 15:53         ` George Dunlap
  2016-01-27 16:12           ` Konrad Rzeszutek Wilk
  2016-01-28  9:55           ` Dario Faggioli
  2016-01-27 16:03         ` Elena Ufimtseva
  2016-01-28 15:10         ` Dario Faggioli
  2 siblings, 2 replies; 22+ messages in thread
From: George Dunlap @ 2016-01-27 15:53 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, george.dunlap, dario.faggioli, xen-devel,
	joao.m.martins, boris.ostrovsky

On 27/01/16 15:27, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote:
>> On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote:
>>> On Xen - the schedule() would go HLT.. and then later be woken up by the
>>> VIRQ_TIMER. And since the two applications were on seperate CPUs - the
>>> single packet would just stick in the queue until the VIRQ_TIMER arrived.
>>
>> I'm not sure I understand the situation right, but it sounds a bit like
>> what you're seeing is just a quirk of the fact that Linux doesn't always
>> send IPIs to wake other processes up (either by design or by accident),
> 
> It does and it does not :-)
> 
>> but relies on scheduling timers to check for work to do.  Presumably
> 
> It .. I am not explaining it well. The Linux kernel scheduler when
> called for 'schedule' (from the UDP sendmsg) would either pick the next
> appliction and do a context swap - of if there were none - go to sleep.
> [Kind of - it also may do an IPI to the other CPU if requested ,but that requires
> some hints from underlaying layers]
> Since there were only two apps on the runqueue - udp sender and udp receiver
> it would run them back-to back (this is on baremetal)

I think I understand at a high level from your description what's
happening (No IPIs -> happens to run if on the same cpu, waits until
next timer tick if on a different cpu); but what I don't quite get is
*why* Linux doesn't send an IPI.

It's been quite a while since I looked at the Linux scheduling code, so
I'm trying to understand it based a lot on the Xen code. In Xen a vcpu
can be "runnable" (has something to do) and "blocked" (waiting for
something to do). Whenever a vcpu goes from "blocked" to "runnable", the
scheduler will call vcpu_wake(), which sends an IPI to the appropriate
pcpu to get it to run the vcpu.

What you're describing is a situation where a process is blocked (either
in 'listen' or 'read'), and another process does something which should
cause it to become 'runnable' (sends it a UDP message). If anyone
happens to run the scheduler on its cpu, it will run; but no proactive
actions are taken to wake it up (i.e., sending an IPI).

The idea of not sending an IPI when a process goes from "waiting for
something to do" to "has something to do" seems strange to me; and if it
wasn't a mistake, then my only guess why they would choose to do that
would be to reduce IPI traffic on large systems.

But whether it's a mistake or on purpose, it's a Linux thing, so...

>> they knew that low performance on ping-pong workloads might be a
>> possibility when they wrote the code that way; I don't see a reason why
>> we should try to work around that in Xen.
> 
> Which is not what I am suggesting.

I'm glad we agree on this. :-)

> Our first ideas was that since this is a Linux kernel schduler characteristic
> - let us give the guest all the information it needs to do this. That is
> make it look as baremetal as possible - and that is where the vCPU
> pinning and the exposing of SMT information came about. That (Elena
> pls correct me if I am wrong) did indeed show that the guest was doing
> what we expected.
> 
> But naturally that requires pinning and all that - and while it is a useful
> case for those that have the vCPUs to spare and can do it - that is not
> a general use-case.
> 
> So Elena started looking at the CPU bound and seeing how Xen behaves then
> and if we can improve the floating situation as she saw some abnormal
> behavious.

OK -- if the focus was on the two cases where the Xen credit1 scheduler
(apparently) co-located two cpu-burning vcpus on sibling threads, then
yeah, that's behavior we should probably try to get to the bottom of.

 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-27 15:53         ` George Dunlap
@ 2016-01-27 16:12           ` Konrad Rzeszutek Wilk
  2016-01-28  9:55           ` Dario Faggioli
  1 sibling, 0 replies; 22+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-27 16:12 UTC (permalink / raw)
  To: George Dunlap
  Cc: Elena Ufimtseva, george.dunlap, dario.faggioli, xen-devel,
	joao.m.martins, boris.ostrovsky

On Wed, Jan 27, 2016 at 03:53:38PM +0000, George Dunlap wrote:
> On 27/01/16 15:27, Konrad Rzeszutek Wilk wrote:
> > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote:
> >> On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote:
> >>> On Xen - the schedule() would go HLT.. and then later be woken up by the
> >>> VIRQ_TIMER. And since the two applications were on seperate CPUs - the
> >>> single packet would just stick in the queue until the VIRQ_TIMER arrived.
> >>
> >> I'm not sure I understand the situation right, but it sounds a bit like
> >> what you're seeing is just a quirk of the fact that Linux doesn't always
> >> send IPIs to wake other processes up (either by design or by accident),
> > 
> > It does and it does not :-)
> > 
> >> but relies on scheduling timers to check for work to do.  Presumably
> > 
> > It .. I am not explaining it well. The Linux kernel scheduler when
> > called for 'schedule' (from the UDP sendmsg) would either pick the next
> > appliction and do a context swap - of if there were none - go to sleep.
> > [Kind of - it also may do an IPI to the other CPU if requested ,but that requires
> > some hints from underlaying layers]
> > Since there were only two apps on the runqueue - udp sender and udp receiver
> > it would run them back-to back (this is on baremetal)
> 
> I think I understand at a high level from your description what's
> happening (No IPIs -> happens to run if on the same cpu, waits until
> next timer tick if on a different cpu); but what I don't quite get is
> *why* Linux doesn't send an IPI.

Wait no no.

"happens to run if on the same cpu" - only if on baremetal or if
we expose SMT topology to a guest.

Otherwise the applications are not on the same CPU.

The sending IPI part is because there are two CPUs - and the apps on those
two runqeueus are not intertwined from the perspective of the scheduler.
(Unless the udp code has given the scheduler hints).


However if I tasket the applications on the same vCPU (this being without
exposing SMT threads or just the normal situation as today) - the scheduler
will send IPI context switches.

Then I found that if I enable vAPIC and disable event channels for IPIs
and only use the native APIC machinery for (aka vAPIC) we can even do less
VMEXITs, but that is a different story:
http://lists.xenproject.org/archives/html/xen-devel/2015-10/msg00897.html

> 
> It's been quite a while since I looked at the Linux scheduling code, so
> I'm trying to understand it based a lot on the Xen code. In Xen a vcpu
> can be "runnable" (has something to do) and "blocked" (waiting for
> something to do). Whenever a vcpu goes from "blocked" to "runnable", the
> scheduler will call vcpu_wake(), which sends an IPI to the appropriate
> pcpu to get it to run the vcpu.
> 
> What you're describing is a situation where a process is blocked (either
> in 'listen' or 'read'), and another process does something which should
> cause it to become 'runnable' (sends it a UDP message). If anyone
> happens to run the scheduler on its cpu, it will run; but no proactive
> actions are taken to wake it up (i.e., sending an IPI).

Right. And that is a UDP code decision. It called the schedule without
any timeout or hints.
> 
> The idea of not sending an IPI when a process goes from "waiting for
> something to do" to "has something to do" seems strange to me; and if it
> wasn't a mistake, then my only guess why they would choose to do that
> would be to reduce IPI traffic on large systems.
> 
> But whether it's a mistake or on purpose, it's a Linux thing, so...

Yes :-)
> 
> >> they knew that low performance on ping-pong workloads might be a
> >> possibility when they wrote the code that way; I don't see a reason why
> >> we should try to work around that in Xen.
> > 
> > Which is not what I am suggesting.
> 
> I'm glad we agree on this. :-)
> 
> > Our first ideas was that since this is a Linux kernel schduler characteristic
> > - let us give the guest all the information it needs to do this. That is
> > make it look as baremetal as possible - and that is where the vCPU
> > pinning and the exposing of SMT information came about. That (Elena
> > pls correct me if I am wrong) did indeed show that the guest was doing
> > what we expected.
> > 
> > But naturally that requires pinning and all that - and while it is a useful
> > case for those that have the vCPUs to spare and can do it - that is not
> > a general use-case.
> > 
> > So Elena started looking at the CPU bound and seeing how Xen behaves then
> > and if we can improve the floating situation as she saw some abnormal
> > behavious.
> 
> OK -- if the focus was on the two cases where the Xen credit1 scheduler
> (apparently) co-located two cpu-burning vcpus on sibling threads, then
> yeah, that's behavior we should probably try to get to the bottom of.

Right!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-27 15:53         ` George Dunlap
  2016-01-27 16:12           ` Konrad Rzeszutek Wilk
@ 2016-01-28  9:55           ` Dario Faggioli
  2016-01-29 21:59             ` Elena Ufimtseva
  1 sibling, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2016-01-28  9:55 UTC (permalink / raw)
  To: George Dunlap, Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, george.dunlap, joao.m.martins, boris.ostrovsky,
	xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1152 bytes --]

On Wed, 2016-01-27 at 15:53 +0000, George Dunlap wrote:
> On 27/01/16 15:27, Konrad Rzeszutek Wilk wrote:
> > 
> > So Elena started looking at the CPU bound and seeing how Xen
> > behaves then
> > and if we can improve the floating situation as she saw some
> > abnormal
> > behavious.
> 
> OK -- if the focus was on the two cases where the Xen credit1
> scheduler
> (apparently) co-located two cpu-burning vcpus on sibling threads,
> then
> yeah, that's behavior we should probably try to get to the bottom of.
> 
Well, let's see the trace. 

In any case, I'm up to trying hooking the SMT load balancer in
runq_tickle (which would mean doing it upon every vcpus wakeup).

My gut feeling is that the overhead my outwieght the benefit, and that
it will actually reveal useful only in a minority of the
cases/workloads, but it's maybe worth a try.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-28  9:55           ` Dario Faggioli
@ 2016-01-29 21:59             ` Elena Ufimtseva
  2016-02-02 11:58               ` Dario Faggioli
  0 siblings, 1 reply; 22+ messages in thread
From: Elena Ufimtseva @ 2016-01-29 21:59 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: george.dunlap, George Dunlap, xen-devel, joao.m.martins,
	boris.ostrovsky

[-- Attachment #1: Type: text/plain, Size: 1478 bytes --]

On Thu, Jan 28, 2016 at 09:55:45AM +0000, Dario Faggioli wrote:
> On Wed, 2016-01-27 at 15:53 +0000, George Dunlap wrote:
> > On 27/01/16 15:27, Konrad Rzeszutek Wilk wrote:
> > > 
> > > So Elena started looking at the CPU bound and seeing how Xen
> > > behaves then
> > > and if we can improve the floating situation as she saw some
> > > abnormal
> > > behavious.
> >
> > OK -- if the focus was on the two cases where the Xen credit1
> > scheduler
> > (apparently) co-located two cpu-burning vcpus on sibling threads,
> > then
> > yeah, that's behavior we should probably try to get to the bottom of.
> >
> Well, let's see the trace. 

Hey Dario

Please disregard the previous email with topology information.
It was incorrect and I am attaching the topology that is actually result
of Joao smt patches application.


Elena
>
> In any case, I'm up to trying hooking the SMT load balancer in
> runq_tickle (which would mean doing it upon every vcpus wakeup).
>
> My gut feeling is that the overhead my outwieght the benefit, and that
> it will actually reveal useful only in a minority of the
> cases/workloads, but it's maybe worth a try.

>
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



[-- Attachment #2: cpuinfo --]
[-- Type: text/plain, Size: 12658 bytes --]

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 1
cpu cores	: 8
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 1
cpu cores	: 8
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 2
cpu cores	: 8
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 2
cpu cores	: 8
apicid		: 5
initial apicid	: 5
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 3
cpu cores	: 8
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 3
cpu cores	: 8
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 8
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 4
cpu cores	: 8
apicid		: 8
initial apicid	: 8
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 9
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 4
cpu cores	: 8
apicid		: 9
initial apicid	: 9
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 10
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 5
cpu cores	: 8
apicid		: 10
initial apicid	: 10
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 11
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 5
cpu cores	: 8
apicid		: 11
initial apicid	: 11
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 12
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 6
cpu cores	: 8
apicid		: 12
initial apicid	: 12
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 13
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 6
cpu cores	: 8
apicid		: 13
initial apicid	: 13
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 14
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 7
cpu cores	: 8
apicid		: 14
initial apicid	: 14
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 15
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.360
cache size	: 25600 KB
physical id	: 0
siblings	: 16
core id		: 7
cpu cores	: 8
apicid		: 15
initial apicid	: 15
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.72
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:


[-- Attachment #3: sched_domains --]
[-- Type: text/plain, Size: 366 bytes --]

cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags

4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559

cat /proc/sys/kernel/sched_domain/cpu*/domain*/names

SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC


[-- Attachment #4: topology_smt_patches --]
[-- Type: text/plain, Size: 6425 bytes --]



	Advisory to Users on system topology enumeration

This utility is for demonstration purpose only. It assumes the hardware topology
configuration within a coherent domain does not change during the life of an OS
session. If an OS support advanced features that can change hardware topology 
configurations, more sophisticated adaptation may be necessary to account for 
the hardware configuration change that might have added and reduced the number 
of logical processors being managed by the OS.

User should also`be aware that the system topology enumeration algorithm is 
based on the assumption that CPUID instruction will return raw data reflecting 
the native hardware configuration. When an application runs inside a virtual 
machine hosted by a Virtual Machine Monitor (VMM), any CPUID instructions 
issued by an app (or a guest OS) are trapped by the VMM and it is the VMM's 
responsibility and decision to emulate/supply CPUID return data to the virtual 
machines. When deploying topology enumeration code based on querying CPUID 
inside a VM environment, the user must consult with the VMM vendor on how an VMM 
will emulate CPUID instruction relating to topology enumeration.



	Software visible enumeration in the system: 
Number of logical processors visible to the OS: 16 
Number of logical processors visible to this process: 16 
Number of processor cores visible to this process: 8 
Number of physical packages visible to this process: 1 


	Hierarchical counts by levels of processor topology: 
 # of cores in package  0 visible to this process: 8 .
	 # of logical processors in Core 0 visible to this process: 2 .
	 # of logical processors in Core  1 visible to this process: 2 .
	 # of logical processors in Core  2 visible to this process: 2 .
	 # of logical processors in Core  3 visible to this process: 2 .
	 # of logical processors in Core  4 visible to this process: 2 .
	 # of logical processors in Core  5 visible to this process: 2 .
	 # of logical processors in Core  6 visible to this process: 2 .
	 # of logical processors in Core  7 visible to this process: 2 .


	Affinity masks per SMT thread, per core, per package: 
Individual:
	P:0, C:0, T:0 --> 1
	P:0, C:0, T:1 --> 2

Core-aggregated:
	P:0, C:0 --> 3
Individual:
	P:0, C:1, T:0 --> 4
	P:0, C:1, T:1 --> 8

Core-aggregated:
	P:0, C:1 --> c
Individual:
	P:0, C:2, T:0 --> 10
	P:0, C:2, T:1 --> 20

Core-aggregated:
	P:0, C:2 --> 30
Individual:
	P:0, C:3, T:0 --> 40
	P:0, C:3, T:1 --> 80

Core-aggregated:
	P:0, C:3 --> c0
Individual:
	P:0, C:4, T:0 --> 100
	P:0, C:4, T:1 --> 200

Core-aggregated:
	P:0, C:4 --> 300
Individual:
	P:0, C:5, T:0 --> 400
	P:0, C:5, T:1 --> 800

Core-aggregated:
	P:0, C:5 --> c00
Individual:
	P:0, C:6, T:0 --> 1z3
	P:0, C:6, T:1 --> 2z3

Core-aggregated:
	P:0, C:6 --> 3z3
Individual:
	P:0, C:7, T:0 --> 4z3
	P:0, C:7, T:1 --> 8z3

Core-aggregated:
	P:0, C:7 --> cz3

Pkg-aggregated:
	P:0 --> ffff


	APIC ID listings from affinity masks
OS cpu   0, Affinity mask   000001 - apic id 0
OS cpu   1, Affinity mask   000002 - apic id 1
OS cpu   2, Affinity mask   000004 - apic id 2
OS cpu   3, Affinity mask   000008 - apic id 3
OS cpu   4, Affinity mask   000010 - apic id 4
OS cpu   5, Affinity mask   000020 - apic id 5
OS cpu   6, Affinity mask   000040 - apic id 6
OS cpu   7, Affinity mask   000080 - apic id 7
OS cpu   8, Affinity mask   000100 - apic id 8
OS cpu   9, Affinity mask   000200 - apic id 9
OS cpu  10, Affinity mask   000400 - apic id a
OS cpu  11, Affinity mask   000800 - apic id b
OS cpu  12, Affinity mask   001000 - apic id c
OS cpu  13, Affinity mask   002000 - apic id d
OS cpu  14, Affinity mask   004000 - apic id e
OS cpu  15, Affinity mask   008000 - apic id f


Package 0 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache
       CmbMsk will differ from AffMsk if > 1 hw_thread/cache
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
L1D is Level 1 Data cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 8
L1I is Level 1 Instruction cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 8
L2 is Level 2 Unified cache, size(KBytes)= 256,  Cores/cache= 2, Caches/package= 8
L3 is Level 3 Unified cache, size(KBytes)= 25600,  Cores/cache= 16, Caches/package= 1
      +-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
Cache |  L1D      |  L1D      |  L1D      |  L1D      |  L1D      |  L1D      |  L1D      |  L1D      |
Size  |  32K      |  32K      |  32K      |  32K      |  32K      |  32K      |  32K      |  32K      |
OScpu#|    0     1|    2     3|    4     5|    6     7|    8     9|   10    11|   12    13|   14    15|
Core  |c0_t0 c0_t1|c1_t0 c1_t1|c2_t0 c2_t1|c3_t0 c3_t1|c4_t0 c4_t1|c5_t0 c5_t1|c6_t0 c6_t1|c7_t0 c7_t1|
AffMsk|    1     2|    4     8|   10    20|   40    80|  100   200|  400   800|  1z3   2z3|  4z3   8z3|
CmbMsk|    3      |    c      |   30      |   c0      |  300      |  c00      |  3z3      |  cz3      |
      +-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

Cache |  L1I      |  L1I      |  L1I      |  L1I      |  L1I      |  L1I      |  L1I      |  L1I      |
Size  |  32K      |  32K      |  32K      |  32K      |  32K      |  32K      |  32K      |  32K      |
      +-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

Cache |   L2      |   L2      |   L2      |   L2      |   L2      |   L2      |   L2      |   L2      |
Size  | 256K      | 256K      | 256K      | 256K      | 256K      | 256K      | 256K      | 256K      |
      +-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

Cache |   L3                                                                                          |
Size  |  25M                                                                                          |
CmbMsk| ffff                                                                                          |
      +-----------------------------------------------------------------------------------------------+


[-- Attachment #5: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-29 21:59             ` Elena Ufimtseva
@ 2016-02-02 11:58               ` Dario Faggioli
  0 siblings, 0 replies; 22+ messages in thread
From: Dario Faggioli @ 2016-02-02 11:58 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: george.dunlap, George Dunlap, xen-devel, joao.m.martins,
	boris.ostrovsky


[-- Attachment #1.1: Type: text/plain, Size: 1216 bytes --]

On Fri, 2016-01-29 at 16:59 -0500, Elena Ufimtseva wrote:
> 
> Hey Dario
> 
> Please disregard the previous email with topology information.
> It was incorrect and I am attaching the topology that is actually
> result
> of Joao smt patches application.
> 
Ok :-)

Well, this:

...
physical id     : 0
siblings        : 16
core id         : 0
cpu cores       : 8
...
physical id     : 0
siblings        : 16
core id         : 0
cpu cores       : 8
...
physical id     : 0
siblings        : 16
core id         : 1
cpu cores       : 8
...
[etc]

And this:

cat /proc/sys/kernel/sched_domain/cpu*/domain*/names
SMT
MC
SMT
MC
SMT
MC
SMT
MC
[etc]

Makes me think that the patch are doing their job, and that the OS is
correctly seeing a multicore, SMT-enabled topology.

Good work on that! ;-)

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-27 15:27       ` Konrad Rzeszutek Wilk
  2016-01-27 15:53         ` George Dunlap
@ 2016-01-27 16:03         ` Elena Ufimtseva
  2016-01-28  9:46           ` Dario Faggioli
  2016-01-28 15:10         ` Dario Faggioli
  2 siblings, 1 reply; 22+ messages in thread
From: Elena Ufimtseva @ 2016-01-27 16:03 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: george.dunlap, dario.faggioli, George Dunlap, xen-devel,
	joao.m.martins, boris.ostrovsky

On Wed, Jan 27, 2016 at 10:27:01AM -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote:
> > On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote:
> > > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote:
> > >> On 22/01/16 16:54, Elena Ufimtseva wrote:
> > >>> Hello all!
> > >>>
> > >>> Dario, Gerorge or anyone else,  your help will be appreciated.
> > >>>
> > >>> Let me put some intro to our findings. I may forget something or put something
> > >>> not too explicit, please ask me.
> > >>>
> > >>> Customer filled a bug where some of the applications were running slow in their HVM DomU setups.
> > >>> These running times were compared against baremetal running same kernel version as HVM DomU.
> > >>>
> > >>> After some investigation by different parties, the test case scenario was found
> > >>> where the problem was easily seen. The test app is a udp server/client pair where
> > >>> client passes some message n number of times.
> > >>> The test case was executed on baremetal and Xen DomU with kernel version 2.6.39.
> > >>> Bare metal showed 2x times better result that DomU.
> > >>>
> > >>> Konrad came up with a workaround that was setting the flag for domain scheduler in linux
> > >>> As the guest is not aware of SMT-related topology, it has a flat topology initialized.
> > >>> Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39.
> > >>> Konrad discovered that changing the flag for CPU sched domain to 4655
> > >>> works as a workaround and makes Linux think that the topology has SMT threads.
> > >>> This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse).
> > >>>
> > >>> This workaround is not suitable for kernels of higher versions as we discovered.
> > >>>
> > >>> The hackish way of making domU linux think that it has SMT threads (along with matching cpuid)
> > >>> made us thinks that the problem comes from the fact that cpu topology is not exposed to
> > >>> guest and Linux scheduler cannot make intelligent decision on scheduling.
> > >>>
> > >>> Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe
> > >>> topology numbering and provided matching pinning of vcpus and enabling options,
> > >>> allows to expose to guest correct topology.
> > >>> I guess Joao will be posting it at some point.
> > >>>
> > >>> With this patches we decided to test the performance impact on different kernel versionand Xen versions.
> > >>>
> > >>> The test described above was labeled as IO-bound test.
> > >>
> > >> So just to clarify: The client sends a request (presumably not much more
> > >> than a ping) to the server, and waits for the server to respond before
> > >> sending another one; and the server does the reverse -- receives a
> > >> request, responds, and then waits for the next request.  Is that right?
> > > 
> > > Yes.
> > >>
> > >> How much data is transferred?
> > > 
> > > 1 packet, UDP
> > >>
> > >> If the amount of data transferred is tiny, then the bottleneck for the
> > >> test is probably the IPI time, and I'd call this a "ping-pong"
> > >> benchmark[1].  I would only call this "io-bound" if you're actually
> > >> copying large amounts of data.
> > > 
> > > What we found is that on baremetal the scheduler would put both apps
> > > on the same CPU and schedule them right after each other. This would
> > > have a high IPI as the scheduler would poke itself.
> > > On Xen it would put the two applications on seperate CPUs - and there
> > > would be hardly any IPI.
> > 
> > Sorry -- why would the scheduler send itself an IPI if it's on the same
> > logical cpu (which seems pretty pointless), but *not* send an IPI to the
> > *other* processor when it was actually waking up another task?
> > 
> > Or do you mean high context switch rate?
> 
> Yes, very high.
> > 
> > > Digging deeper in the code I found out that if you do an UDP sendmsg
> > > without any timeouts - it would put it in a queue and just call schedule.
> > 
> > You mean, it would mark the other process as runnable somehow, but not
> > actually send an IPI to wake it up?  Is that a new "feature" designed
> 
> Correct - because the other process was not on its vCPU runqueue.
> 
> > for large systems, to reduce the IPI traffic or something?
> 
> This is just a normal Linux scheduler. The only way it would do an IPI
> to the other CPU was if the UDP message had an timeout. The default
> timeout is infite so it didn't bother to send an IPI.
> 
> > 
> > > On baremetal the schedule would result in scheduler picking up the other
> > > task, and starting it - which would dequeue immediately.
> > > 
> > > On Xen - the schedule() would go HLT.. and then later be woken up by the
> > > VIRQ_TIMER. And since the two applications were on seperate CPUs - the
> > > single packet would just stick in the queue until the VIRQ_TIMER arrived.
> > 
> > I'm not sure I understand the situation right, but it sounds a bit like
> > what you're seeing is just a quirk of the fact that Linux doesn't always
> > send IPIs to wake other processes up (either by design or by accident),
> 
> It does and it does not :-)
> 
> > but relies on scheduling timers to check for work to do.  Presumably
> 
> It .. I am not explaining it well. The Linux kernel scheduler when
> called for 'schedule' (from the UDP sendmsg) would either pick the next
> appliction and do a context swap - of if there were none - go to sleep.
> [Kind of - it also may do an IPI to the other CPU if requested ,but that requires
> some hints from underlaying layers]
> Since there were only two apps on the runqueue - udp sender and udp receiver
> it would run them back-to back (this is on baremetal)
> 
> However if SMT was not exposed - the Linux kernel scheduler would put those
> on each CPU runqueue. Meaning each CPU only had one app on its runqueue.
> 
> Hence no need to do an context switch.
> [unless you modified the UDP message to have a timeout, then it would
> send an IPI]
> > they knew that low performance on ping-pong workloads might be a
> > possibility when they wrote the code that way; I don't see a reason why
> > we should try to work around that in Xen.
> 
> Which is not what I am suggesting.
> 
> Our first ideas was that since this is a Linux kernel schduler characteristic
> - let us give the guest all the information it needs to do this. That is
> make it look as baremetal as possible - and that is where the vCPU
> pinning and the exposing of SMT information came about. That (Elena
> pls correct me if I am wrong) did indeed show that the guest was doing
> what we expected.
> 
> But naturally that requires pinning and all that - and while it is a useful
> case for those that have the vCPUs to spare and can do it - that is not
> a general use-case.
> 
> So Elena started looking at the CPU bound and seeing how Xen behaves then
> and if we can improve the floating situation as she saw some abnormal
> behavious.

Maybe its normal? :)

While having satisfactory results with ping-pong test and having Joao's
SMT patches in hand, we decided to try cpu-bound workload.
And that is where exposing SMT does not work that well.
I mostly here refer to the case where two vCPUs are being placed on same
core while there are idle cores.

This I think what Dario is asking me more details about in another reply and I am going to
answer his questions.

> 
> I do not see any way to fix the udp single message mechanism except
> by modifying the Linux kernel scheduler - and indeed it looks like later
> kernels modified their behavior. Also doing the vCPU pinning and SMT exposing
> did not hurt in those cases (Elena?).

Yes, the drastic performance differences with bare metal were only
observed with 2.6.39-based kernel.
For this ping-pong udp test exposing the SMT topology to the kernels if
higher versions did help as tests show about 20 percent performance
improvement comparing to the tests where SMT topology is not exposed.
This assumes that SMT exposure goes along with pinning.


kernel.
> 
> > 
> >  -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-27 16:03         ` Elena Ufimtseva
@ 2016-01-28  9:46           ` Dario Faggioli
  2016-01-29 16:09             ` Elena Ufimtseva
  0 siblings, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2016-01-28  9:46 UTC (permalink / raw)
  To: Elena Ufimtseva, Konrad Rzeszutek Wilk
  Cc: george.dunlap, joao.m.martins, George Dunlap, boris.ostrovsky,
	xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 9847 bytes --]

On Wed, 2016-01-27 at 11:03 -0500, Elena Ufimtseva wrote:
> On Wed, Jan 27, 2016 at 10:27:01AM -0500, Konrad Rzeszutek Wilk
> wrote:
> > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote:
> > > On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote:
> > > > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote:
> > > > > On 22/01/16 16:54, Elena Ufimtseva wrote:
> > > > > > Hello all!
> > > > > > 
> > > > > > Dario, Gerorge or anyone else,  your help will be
> > > > > > appreciated.
> > > > > > 
> > > > > > Let me put some intro to our findings. I may forget
> > > > > > something or put something
> > > > > > not too explicit, please ask me.
> > > > > > 
> > > > > > Customer filled a bug where some of the applications were
> > > > > > running slow in their HVM DomU setups.
> > > > > > These running times were compared against baremetal running
> > > > > > same kernel version as HVM DomU.
> > > > > > 
> > > > > > After some investigation by different parties, the test
> > > > > > case scenario was found
> > > > > > where the problem was easily seen. The test app is a udp
> > > > > > server/client pair where
> > > > > > client passes some message n number of times.
> > > > > > The test case was executed on baremetal and Xen DomU with
> > > > > > kernel version 2.6.39.
> > > > > > Bare metal showed 2x times better result that DomU.
> > > > > > 
> > > > > > Konrad came up with a workaround that was setting the flag
> > > > > > for domain scheduler in linux
> > > > > > As the guest is not aware of SMT-related topology, it has a
> > > > > > flat topology initialized.
> > > > > > Kernel has domain scheduler flags for scheduling domain CPU
> > > > > > set to 4143 for 2.6.39.
> > > > > > Konrad discovered that changing the flag for CPU sched
> > > > > > domain to 4655
> > > > > > works as a workaround and makes Linux think that the
> > > > > > topology has SMT threads.
> > > > > > This workaround makes the test to complete almost in same
> > > > > > time as on baremetal (or insignificantly worse).
> > > > > > 
> > > > > > This workaround is not suitable for kernels of higher
> > > > > > versions as we discovered.
> > > > > > 
> > > > > > The hackish way of making domU linux think that it has SMT
> > > > > > threads (along with matching cpuid)
> > > > > > made us thinks that the problem comes from the fact that
> > > > > > cpu topology is not exposed to
> > > > > > guest and Linux scheduler cannot make intelligent decision
> > > > > > on scheduling.
> > > > > > 
> > > > > > Joao Martins from Oracle developed set of patches that
> > > > > > fixed the smt/core/cashe
> > > > > > topology numbering and provided matching pinning of vcpus
> > > > > > and enabling options,
> > > > > > allows to expose to guest correct topology.
> > > > > > I guess Joao will be posting it at some point.
> > > > > > 
> > > > > > With this patches we decided to test the performance impact
> > > > > > on different kernel versionand Xen versions.
> > > > > > 
> > > > > > The test described above was labeled as IO-bound test.
> > > > > 
> > > > > So just to clarify: The client sends a request (presumably
> > > > > not much more
> > > > > than a ping) to the server, and waits for the server to
> > > > > respond before
> > > > > sending another one; and the server does the reverse --
> > > > > receives a
> > > > > request, responds, and then waits for the next request.  Is
> > > > > that right?
> > > > 
> > > > Yes.
> > > > > 
> > > > > How much data is transferred?
> > > > 
> > > > 1 packet, UDP
> > > > > 
> > > > > If the amount of data transferred is tiny, then the
> > > > > bottleneck for the
> > > > > test is probably the IPI time, and I'd call this a "ping-
> > > > > pong"
> > > > > benchmark[1].  I would only call this "io-bound" if you're
> > > > > actually
> > > > > copying large amounts of data.
> > > > 
> > > > What we found is that on baremetal the scheduler would put both
> > > > apps
> > > > on the same CPU and schedule them right after each other. This
> > > > would
> > > > have a high IPI as the scheduler would poke itself.
> > > > On Xen it would put the two applications on seperate CPUs - and
> > > > there
> > > > would be hardly any IPI.
> > > 
> > > Sorry -- why would the scheduler send itself an IPI if it's on
> > > the same
> > > logical cpu (which seems pretty pointless), but *not* send an IPI
> > > to the
> > > *other* processor when it was actually waking up another task?
> > > 
> > > Or do you mean high context switch rate?
> > 
> > Yes, very high.
> > > 
> > > > Digging deeper in the code I found out that if you do an UDP
> > > > sendmsg
> > > > without any timeouts - it would put it in a queue and just call
> > > > schedule.
> > > 
> > > You mean, it would mark the other process as runnable somehow,
> > > but not
> > > actually send an IPI to wake it up?  Is that a new "feature"
> > > designed
> > 
> > Correct - because the other process was not on its vCPU runqueue.
> > 
> > > for large systems, to reduce the IPI traffic or something?
> > 
> > This is just a normal Linux scheduler. The only way it would do an
> > IPI
> > to the other CPU was if the UDP message had an timeout. The default
> > timeout is infite so it didn't bother to send an IPI.
> > 
> > > 
> > > > On baremetal the schedule would result in scheduler picking up
> > > > the other
> > > > task, and starting it - which would dequeue immediately.
> > > > 
> > > > On Xen - the schedule() would go HLT.. and then later be woken
> > > > up by the
> > > > VIRQ_TIMER. And since the two applications were on seperate
> > > > CPUs - the
> > > > single packet would just stick in the queue until the
> > > > VIRQ_TIMER arrived.
> > > 
> > > I'm not sure I understand the situation right, but it sounds a
> > > bit like
> > > what you're seeing is just a quirk of the fact that Linux doesn't
> > > always
> > > send IPIs to wake other processes up (either by design or by
> > > accident),
> > 
> > It does and it does not :-)
> > 
> > > but relies on scheduling timers to check for work to
> > > do.  Presumably
> > 
> > It .. I am not explaining it well. The Linux kernel scheduler when
> > called for 'schedule' (from the UDP sendmsg) would either pick the
> > next
> > appliction and do a context swap - of if there were none - go to
> > sleep.
> > [Kind of - it also may do an IPI to the other CPU if requested ,but
> > that requires
> > some hints from underlaying layers]
> > Since there were only two apps on the runqueue - udp sender and udp
> > receiver
> > it would run them back-to back (this is on baremetal)
> > 
> > However if SMT was not exposed - the Linux kernel scheduler would
> > put those
> > on each CPU runqueue. Meaning each CPU only had one app on its
> > runqueue.
> > 
> > Hence no need to do an context switch.
> > [unless you modified the UDP message to have a timeout, then it
> > would
> > send an IPI]
> > > they knew that low performance on ping-pong workloads might be a
> > > possibility when they wrote the code that way; I don't see a
> > > reason why
> > > we should try to work around that in Xen.
> > 
> > Which is not what I am suggesting.
> > 
> > Our first ideas was that since this is a Linux kernel schduler
> > characteristic
> > - let us give the guest all the information it needs to do this.
> > That is
> > make it look as baremetal as possible - and that is where the vCPU
> > pinning and the exposing of SMT information came about. That (Elena
> > pls correct me if I am wrong) did indeed show that the guest was
> > doing
> > what we expected.
> > 
> > But naturally that requires pinning and all that - and while it is
> > a useful
> > case for those that have the vCPUs to spare and can do it - that is
> > not
> > a general use-case.
> > 
> > So Elena started looking at the CPU bound and seeing how Xen
> > behaves then
> > and if we can improve the floating situation as she saw some
> > abnormal
> > behavious.
> 
> Maybe its normal? :)
> 
> While having satisfactory results with ping-pong test and having
> Joao's
> SMT patches in hand, we decided to try cpu-bound workload.
> And that is where exposing SMT does not work that well.
> I mostly here refer to the case where two vCPUs are being placed on
> same
> core while there are idle cores.
> 
> This I think what Dario is asking me more details about in another
> reply and I am going to
> answer his questions.
> 
Yes, exactly. We need to see more trace entries around the one where we
see the vcpus being placed on SMT-siblings. You can well send, or
upload somewhere, the full trace, and I'll have a look myself as soon
as I can. :-)

> > I do not see any way to fix the udp single message mechanism except
> > by modifying the Linux kernel scheduler - and indeed it looks like
> > later
> > kernels modified their behavior. Also doing the vCPU pinning and
> > SMT exposing
> > did not hurt in those cases (Elena?).
> 
> Yes, the drastic performance differences with bare metal were only
> observed with 2.6.39-based kernel.
> For this ping-pong udp test exposing the SMT topology to the kernels
> if
> higher versions did help as tests show about 20 percent performance
> improvement comparing to the tests where SMT topology is not exposed.
> This assumes that SMT exposure goes along with pinning.
> 
> 
> kernel.
>
hypervisor.

:-D :-D :-D

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-28  9:46           ` Dario Faggioli
@ 2016-01-29 16:09             ` Elena Ufimtseva
  0 siblings, 0 replies; 22+ messages in thread
From: Elena Ufimtseva @ 2016-01-29 16:09 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: george.dunlap, George Dunlap, xen-devel, joao.m.martins,
	boris.ostrovsky

[-- Attachment #1: Type: text/plain, Size: 11053 bytes --]

On Thu, Jan 28, 2016 at 09:46:46AM +0000, Dario Faggioli wrote:
> On Wed, 2016-01-27 at 11:03 -0500, Elena Ufimtseva wrote:
> > On Wed, Jan 27, 2016 at 10:27:01AM -0500, Konrad Rzeszutek Wilk
> > wrote:
> > > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote:
> > > > On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote:
> > > > > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote:
> > > > > > On 22/01/16 16:54, Elena Ufimtseva wrote:
> > > > > > > Hello all!
> > > > > > >
> > > > > > > Dario, Gerorge or anyone else,  your help will be
> > > > > > > appreciated.
> > > > > > >
> > > > > > > Let me put some intro to our findings. I may forget
> > > > > > > something or put something
> > > > > > > not too explicit, please ask me.
> > > > > > >
> > > > > > > Customer filled a bug where some of the applications were
> > > > > > > running slow in their HVM DomU setups.
> > > > > > > These running times were compared against baremetal running
> > > > > > > same kernel version as HVM DomU.
> > > > > > >
> > > > > > > After some investigation by different parties, the test
> > > > > > > case scenario was found
> > > > > > > where the problem was easily seen. The test app is a udp
> > > > > > > server/client pair where
> > > > > > > client passes some message n number of times.
> > > > > > > The test case was executed on baremetal and Xen DomU with
> > > > > > > kernel version 2.6.39.
> > > > > > > Bare metal showed 2x times better result that DomU.
> > > > > > >
> > > > > > > Konrad came up with a workaround that was setting the flag
> > > > > > > for domain scheduler in linux
> > > > > > > As the guest is not aware of SMT-related topology, it has a
> > > > > > > flat topology initialized.
> > > > > > > Kernel has domain scheduler flags for scheduling domain CPU
> > > > > > > set to 4143 for 2.6.39.
> > > > > > > Konrad discovered that changing the flag for CPU sched
> > > > > > > domain to 4655
> > > > > > > works as a workaround and makes Linux think that the
> > > > > > > topology has SMT threads.
> > > > > > > This workaround makes the test to complete almost in same
> > > > > > > time as on baremetal (or insignificantly worse).
> > > > > > >
> > > > > > > This workaround is not suitable for kernels of higher
> > > > > > > versions as we discovered.
> > > > > > >
> > > > > > > The hackish way of making domU linux think that it has SMT
> > > > > > > threads (along with matching cpuid)
> > > > > > > made us thinks that the problem comes from the fact that
> > > > > > > cpu topology is not exposed to
> > > > > > > guest and Linux scheduler cannot make intelligent decision
> > > > > > > on scheduling.
> > > > > > >
> > > > > > > Joao Martins from Oracle developed set of patches that
> > > > > > > fixed the smt/core/cashe
> > > > > > > topology numbering and provided matching pinning of vcpus
> > > > > > > and enabling options,
> > > > > > > allows to expose to guest correct topology.
> > > > > > > I guess Joao will be posting it at some point.
> > > > > > >
> > > > > > > With this patches we decided to test the performance impact
> > > > > > > on different kernel versionand Xen versions.
> > > > > > >
> > > > > > > The test described above was labeled as IO-bound test.
> > > > > >
> > > > > > So just to clarify: The client sends a request (presumably
> > > > > > not much more
> > > > > > than a ping) to the server, and waits for the server to
> > > > > > respond before
> > > > > > sending another one; and the server does the reverse --
> > > > > > receives a
> > > > > > request, responds, and then waits for the next request.  Is
> > > > > > that right?
> > > > >
> > > > > Yes.
> > > > > >
> > > > > > How much data is transferred?
> > > > >
> > > > > 1 packet, UDP
> > > > > >
> > > > > > If the amount of data transferred is tiny, then the
> > > > > > bottleneck for the
> > > > > > test is probably the IPI time, and I'd call this a "ping-
> > > > > > pong"
> > > > > > benchmark[1].  I would only call this "io-bound" if you're
> > > > > > actually
> > > > > > copying large amounts of data.
> > > > >
> > > > > What we found is that on baremetal the scheduler would put both
> > > > > apps
> > > > > on the same CPU and schedule them right after each other. This
> > > > > would
> > > > > have a high IPI as the scheduler would poke itself.
> > > > > On Xen it would put the two applications on seperate CPUs - and
> > > > > there
> > > > > would be hardly any IPI.
> > > >
> > > > Sorry -- why would the scheduler send itself an IPI if it's on
> > > > the same
> > > > logical cpu (which seems pretty pointless), but *not* send an IPI
> > > > to the
> > > > *other* processor when it was actually waking up another task?
> > > >
> > > > Or do you mean high context switch rate?
> > >
> > > Yes, very high.
> > > >
> > > > > Digging deeper in the code I found out that if you do an UDP
> > > > > sendmsg
> > > > > without any timeouts - it would put it in a queue and just call
> > > > > schedule.
> > > >
> > > > You mean, it would mark the other process as runnable somehow,
> > > > but not
> > > > actually send an IPI to wake it up?  Is that a new "feature"
> > > > designed
> > >
> > > Correct - because the other process was not on its vCPU runqueue.
> > >
> > > > for large systems, to reduce the IPI traffic or something?
> > >
> > > This is just a normal Linux scheduler. The only way it would do an
> > > IPI
> > > to the other CPU was if the UDP message had an timeout. The default
> > > timeout is infite so it didn't bother to send an IPI.
> > >
> > > >
> > > > > On baremetal the schedule would result in scheduler picking up
> > > > > the other
> > > > > task, and starting it - which would dequeue immediately.
> > > > >
> > > > > On Xen - the schedule() would go HLT.. and then later be woken
> > > > > up by the
> > > > > VIRQ_TIMER. And since the two applications were on seperate
> > > > > CPUs - the
> > > > > single packet would just stick in the queue until the
> > > > > VIRQ_TIMER arrived.
> > > >
> > > > I'm not sure I understand the situation right, but it sounds a
> > > > bit like
> > > > what you're seeing is just a quirk of the fact that Linux doesn't
> > > > always
> > > > send IPIs to wake other processes up (either by design or by
> > > > accident),
> > >
> > > It does and it does not :-)
> > >
> > > > but relies on scheduling timers to check for work to
> > > > do.  Presumably
> > >
> > > It .. I am not explaining it well. The Linux kernel scheduler when
> > > called for 'schedule' (from the UDP sendmsg) would either pick the
> > > next
> > > appliction and do a context swap - of if there were none - go to
> > > sleep.
> > > [Kind of - it also may do an IPI to the other CPU if requested ,but
> > > that requires
> > > some hints from underlaying layers]
> > > Since there were only two apps on the runqueue - udp sender and udp
> > > receiver
> > > it would run them back-to back (this is on baremetal)
> > >
> > > However if SMT was not exposed - the Linux kernel scheduler would
> > > put those
> > > on each CPU runqueue. Meaning each CPU only had one app on its
> > > runqueue.
> > >
> > > Hence no need to do an context switch.
> > > [unless you modified the UDP message to have a timeout, then it
> > > would
> > > send an IPI]
> > > > they knew that low performance on ping-pong workloads might be a
> > > > possibility when they wrote the code that way; I don't see a
> > > > reason why
> > > > we should try to work around that in Xen.
> > >
> > > Which is not what I am suggesting.
> > >
> > > Our first ideas was that since this is a Linux kernel schduler
> > > characteristic
> > > - let us give the guest all the information it needs to do this.
> > > That is
> > > make it look as baremetal as possible - and that is where the vCPU
> > > pinning and the exposing of SMT information came about. That (Elena
> > > pls correct me if I am wrong) did indeed show that the guest was
> > > doing
> > > what we expected.
> > >
> > > But naturally that requires pinning and all that - and while it is
> > > a useful
> > > case for those that have the vCPUs to spare and can do it - that is
> > > not
> > > a general use-case.
> > >
> > > So Elena started looking at the CPU bound and seeing how Xen
> > > behaves then
> > > and if we can improve the floating situation as she saw some
> > > abnormal
> > > behavious.
> >
> > Maybe its normal? :)
> >
> > While having satisfactory results with ping-pong test and having
> > Joao's
> > SMT patches in hand, we decided to try cpu-bound workload.
> > And that is where exposing SMT does not work that well.
> > I mostly here refer to the case where two vCPUs are being placed on
> > same
> > core while there are idle cores.
> >
> > This I think what Dario is asking me more details about in another
> > reply and I am going to
> > answer his questions.
> >
> Yes, exactly. We need to see more trace entries around the one where we
> see the vcpus being placed on SMT-siblings. You can well send, or
> upload somewhere, the full trace, and I'll have a look myself as soon
> as I can. :-)

Hi Dario

So here is the trace with smt patches applied, 5 iterations of cpu-bound
workload.
dom0 has 2 not-pinned vcpus, domU has 16 vcpus, not pinned as well, 8 active threads of
cpu-bound test.
Here is a trace output:
https://drive.google.com/file/d/0ByVx1zSzgzLIbjFLTXFsbDJ4QVU/view?usp=sharing

Topology in guest after smt patches applied can be seen in
topology_smt_patches, along with cpuinfo and sched_domains.

The thing what we are trying to figure out is why data cache is not
being shared between threads and why number of packages is 4.
We are looking at this right now.

Dario
Let me know if you would think of any other data what may help.

Elena


>
> > > I do not see any way to fix the udp single message mechanism except
> > > by modifying the Linux kernel scheduler - and indeed it looks like
> > > later
> > > kernels modified their behavior. Also doing the vCPU pinning and
> > > SMT exposing
> > > did not hurt in those cases (Elena?).
> >
> > Yes, the drastic performance differences with bare metal were only
> > observed with 2.6.39-based kernel.
> > For this ping-pong udp test exposing the SMT topology to the kernels
> > if
> > higher versions did help as tests show about 20 percent performance
> > improvement comparing to the tests where SMT topology is not exposed.
> > This assumes that SMT exposure goes along with pinning.
> >
> >
> > kernel.
> >
> hypervisor.
>
> :-D :-D :-D
>
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



[-- Attachment #2: topology_smt_patches --]
[-- Type: text/plain, Size: 8617 bytes --]



	Advisory to Users on system topology enumeration

This utility is for demonstration purpose only. It assumes the hardware topology
configuration within a coherent domain does not change during the life of an OS
session. If an OS support advanced features that can change hardware topology 
configurations, more sophisticated adaptation may be necessary to account for 
the hardware configuration change that might have added and reduced the number 
of logical processors being managed by the OS.

User should also`be aware that the system topology enumeration algorithm is 
based on the assumption that CPUID instruction will return raw data reflecting 
the native hardware configuration. When an application runs inside a virtual 
machine hosted by a Virtual Machine Monitor (VMM), any CPUID instructions 
issued by an app (or a guest OS) are trapped by the VMM and it is the VMM's 
responsibility and decision to emulate/supply CPUID return data to the virtual 
machines. When deploying topology enumeration code based on querying CPUID 
inside a VM environment, the user must consult with the VMM vendor on how an VMM 
will emulate CPUID instruction relating to topology enumeration.



	Software visible enumeration in the system: 
Number of logical processors visible to the OS: 16 
Number of logical processors visible to this process: 16 
Number of processor cores visible to this process: 8 
Number of physical packages visible to this process: 4 


	Hierarchical counts by levels of processor topology: 
 # of cores in package  0 visible to this process: 2 .
	 # of logical processors in Core 0 visible to this process: 2 .
	 # of logical processors in Core  1 visible to this process: 2 .
 # of cores in package  1 visible to this process: 2 .
	 # of logical processors in Core 0 visible to this process: 2 .
	 # of logical processors in Core  1 visible to this process: 2 .
 # of cores in package  2 visible to this process: 2 .
	 # of logical processors in Core 0 visible to this process: 2 .
	 # of logical processors in Core  1 visible to this process: 2 .
 # of cores in package  3 visible to this process: 2 .
	 # of logical processors in Core 0 visible to this process: 2 .
	 # of logical processors in Core  1 visible to this process: 2 .


	Affinity masks per SMT thread, per core, per package: 
Individual:
	P:0, C:0, T:0 --> 1
	P:0, C:0, T:1 --> 2

Core-aggregated:
	P:0, C:0 --> 3
Individual:
	P:0, C:1, T:0 --> 4
	P:0, C:1, T:1 --> 8

Core-aggregated:
	P:0, C:1 --> c

Pkg-aggregated:
	P:0 --> f
Individual:
	P:1, C:0, T:0 --> 10
	P:1, C:0, T:1 --> 20

Core-aggregated:
	P:1, C:0 --> 30
Individual:
	P:1, C:1, T:0 --> 40
	P:1, C:1, T:1 --> 80

Core-aggregated:
	P:1, C:1 --> c0

Pkg-aggregated:
	P:1 --> f0
Individual:
	P:2, C:0, T:0 --> 100
	P:2, C:0, T:1 --> 200

Core-aggregated:
	P:2, C:0 --> 300
Individual:
	P:2, C:1, T:0 --> 400
	P:2, C:1, T:1 --> 800

Core-aggregated:
	P:2, C:1 --> c00

Pkg-aggregated:
	P:2 --> f00
Individual:
	P:3, C:0, T:0 --> 1z3
	P:3, C:0, T:1 --> 2z3

Core-aggregated:
	P:3, C:0 --> 3z3
Individual:
	P:3, C:1, T:0 --> 4z3
	P:3, C:1, T:1 --> 8z3

Core-aggregated:
	P:3, C:1 --> cz3

Pkg-aggregated:
	P:3 --> fz3


	APIC ID listings from affinity masks
OS cpu   0, Affinity mask   000001 - apic id 0
OS cpu   1, Affinity mask   000002 - apic id 1
OS cpu   2, Affinity mask   000004 - apic id 2
OS cpu   3, Affinity mask   000008 - apic id 3
OS cpu   4, Affinity mask   000010 - apic id 4
OS cpu   5, Affinity mask   000020 - apic id 5
OS cpu   6, Affinity mask   000040 - apic id 6
OS cpu   7, Affinity mask   000080 - apic id 7
OS cpu   8, Affinity mask   000100 - apic id 8
OS cpu   9, Affinity mask   000200 - apic id 9
OS cpu  10, Affinity mask   000400 - apic id a
OS cpu  11, Affinity mask   000800 - apic id b
OS cpu  12, Affinity mask   001000 - apic id c
OS cpu  13, Affinity mask   002000 - apic id d
OS cpu  14, Affinity mask   004000 - apic id e
OS cpu  15, Affinity mask   008000 - apic id f


Package 0 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache
       CmbMsk will differ from AffMsk if > 1 hw_thread/cache
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
L1D is Level 1 Data cache, size(KBytes)= 32,  Cores/cache= 1, Caches/package= 4
L1I is Level 1 Instruction cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 2
L2 is Level 2 Unified cache, size(KBytes)= 256,  Cores/cache= 2, Caches/package= 2
L3 is Level 3 Unified cache, size(KBytes)= 25600,  Cores/cache= 16, Caches/package= 0
      +-----+-----+-----+-----+
Cache |  L1D|  L1D|  L1D|  L1D|
Size  |  32K|  32K|  32K|  32K|
OScpu#|    0|    1|    2|    3|
Core  |c0_t0|c0_t1|c1_t0|c1_t1|
AffMsk|    1|    2|    4|    8|
      +-----+-----+-----+-----+

Cache |  L1I      |  L1I      |
Size  |  32K      |  32K      |
CmbMsk|    3      |    c      |
      +-----------+-----------+

Cache |   L2      |   L2      |
Size  | 256K      | 256K      |
      +-----------+-----------+

Cache |   L3                  |
Size  |  25M                  |
CmbMsk|    f                  |
      +-----------------------+



Package 1 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache
       CmbMsk will differ from AffMsk if > 1 hw_thread/cache
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----+-----+-----+-----+
Cache |  L1D|  L1D|  L1D|  L1D|
Size  |  32K|  32K|  32K|  32K|
OScpu#|    4|    5|    6|    7|
Core  |c0_t0|c0_t1|c1_t0|c1_t1|
AffMsk|   10|   20|   40|   80|
      +-----+-----+-----+-----+

Cache |  L1I      |  L1I      |
Size  |  32K      |  32K      |
CmbMsk|   30      |   c0      |
      +-----------+-----------+

Cache |   L2      |   L2      |
Size  | 256K      | 256K      |
      +-----------+-----------+

Cache |   L3                  |
Size  |  25M                  |
CmbMsk|   f0                  |
      +-----------------------+



Package 2 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache
       CmbMsk will differ from AffMsk if > 1 hw_thread/cache
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----+-----+-----+-----+
Cache |  L1D|  L1D|  L1D|  L1D|
Size  |  32K|  32K|  32K|  32K|
OScpu#|    8|    9|   10|   11|
Core  |c0_t0|c0_t1|c1_t0|c1_t1|
AffMsk|  100|  200|  400|  800|
      +-----+-----+-----+-----+

Cache |  L1I      |  L1I      |
Size  |  32K      |  32K      |
CmbMsk|  300      |  c00      |
      +-----------+-----------+

Cache |   L2      |   L2      |
Size  | 256K      | 256K      |
      +-----------+-----------+

Cache |   L3                  |
Size  |  25M                  |
CmbMsk|  f00                  |
      +-----------------------+



Package 3 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache
       CmbMsk will differ from AffMsk if > 1 hw_thread/cache
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----+-----+-----+-----+
Cache |  L1D|  L1D|  L1D|  L1D|
Size  |  32K|  32K|  32K|  32K|
OScpu#|   12|   13|   14|   15|
Core  |c0_t0|c0_t1|c1_t0|c1_t1|
AffMsk|  1z3|  2z3|  4z3|  8z3|
      +-----+-----+-----+-----+

Cache |  L1I      |  L1I      |
Size  |  32K      |  32K      |
CmbMsk|  3z3      |  cz3      |
      +-----------+-----------+

Cache |   L2      |   L2      |
Size  | 256K      | 256K      |
      +-----------+-----------+

Cache |   L3                  |
Size  |  25M                  |
CmbMsk|  fz3                  |
      +-----------------------+


[-- Attachment #3: sched_domains --]
[-- Type: text/plain, Size: 364 bytes --]

cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags

4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559
4783
559

cat /proc/sys/kernel/sched_domain/cpu*/domain*/names
MT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC
SMT
MC


[-- Attachment #4: cpuinfo --]
[-- Type: text/plain, Size: 12642 bytes --]

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.67
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.67
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.67
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5586.67
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 1
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5606.43
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 1
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 5
initial apicid	: 5
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5606.43
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 1
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5606.43
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 1
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5606.43
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 8
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 2
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 8
initial apicid	: 8
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5606.48
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 9
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 2
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 9
initial apicid	: 9
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5606.48
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 10
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 2
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 10
initial apicid	: 10
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5606.48
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 11
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 2
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 11
initial apicid	: 11
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5606.48
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 12
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 3
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 12
initial apicid	: 12
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5608.87
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 13
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 3
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 13
initial apicid	: 13
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5608.87
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 14
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 3
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 14
initial apicid	: 14
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5608.87
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 15
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Genuine Intel(R) CPU  @ 2.80GHz
stepping	: 2
microcode	: 0x209
cpu MHz		: 2793.338
cache size	: 25600 KB
physical id	: 3
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 15
initial apicid	: 15
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5608.87
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:


[-- Attachment #5: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-27 15:27       ` Konrad Rzeszutek Wilk
  2016-01-27 15:53         ` George Dunlap
  2016-01-27 16:03         ` Elena Ufimtseva
@ 2016-01-28 15:10         ` Dario Faggioli
  2016-01-29  3:27           ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2016-01-28 15:10 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, George Dunlap
  Cc: Elena Ufimtseva, george.dunlap, joao.m.martins, boris.ostrovsky,
	xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1755 bytes --]

On Wed, 2016-01-27 at 10:27 -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote:
> > 
> > I'm not sure I understand the situation right, but it sounds a bit
> > like
> > what you're seeing is just a quirk of the fact that Linux doesn't
> > always
> > send IPIs to wake other processes up (either by design or by
> > accident),
> 
> It does and it does not :-)
> 
> > but relies on scheduling timers to check for work to
> > do.  Presumably
> 
> It .. I am not explaining it well. The Linux kernel scheduler when
> called for 'schedule' (from the UDP sendmsg) would either pick the
> next
> appliction and do a context swap - of if there were none - go to
> sleep.
> [Kind of - it also may do an IPI to the other CPU if requested ,but
> that requires
> some hints from underlaying layers]
> Since there were only two apps on the runqueue - udp sender and udp
> receiver
> it would run them back-to back (this is on baremetal)
> 
> However if SMT was not exposed - the Linux kernel scheduler would put
> those
> on each CPU runqueue. Meaning each CPU only had one app on its
> runqueue.
> 
> Hence no need to do an context switch.
> [unless you modified the UDP message to have a timeout, then it would
> send an IPI]
>
So, may I ask what piece of (Linux) code are we actually talking about?
Because I had a quick look, and could not find where what you describe
happens....

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-28 15:10         ` Dario Faggioli
@ 2016-01-29  3:27           ` Konrad Rzeszutek Wilk
  2016-02-02 11:45             ` Dario Faggioli
  0 siblings, 1 reply; 22+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-29  3:27 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Elena Ufimtseva, george.dunlap, George Dunlap, xen-devel,
	joao.m.martins, boris.ostrovsky

On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote:
> On Wed, 2016-01-27 at 10:27 -0500, Konrad Rzeszutek Wilk wrote:
> > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote:
> > > 
> > > I'm not sure I understand the situation right, but it sounds a bit
> > > like
> > > what you're seeing is just a quirk of the fact that Linux doesn't
> > > always
> > > send IPIs to wake other processes up (either by design or by
> > > accident),
> > 
> > It does and it does not :-)
> > 
> > > but relies on scheduling timers to check for work to
> > > do.  Presumably
> > 
> > It .. I am not explaining it well. The Linux kernel scheduler when
> > called for 'schedule' (from the UDP sendmsg) would either pick the
> > next
> > appliction and do a context swap - of if there were none - go to
> > sleep.
> > [Kind of - it also may do an IPI to the other CPU if requested ,but
> > that requires
> > some hints from underlaying layers]
> > Since there were only two apps on the runqueue - udp sender and udp
> > receiver
> > it would run them back-to back (this is on baremetal)
> > 
> > However if SMT was not exposed - the Linux kernel scheduler would put
> > those
> > on each CPU runqueue. Meaning each CPU only had one app on its
> > runqueue.
> > 
> > Hence no need to do an context switch.
> > [unless you modified the UDP message to have a timeout, then it would
> > send an IPI]
> >
> So, may I ask what piece of (Linux) code are we actually talking about?
> Because I had a quick look, and could not find where what you describe
> happens....

udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout
The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the
UDP by having a diffrent timeout.

And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just
goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER.

The other side - udp_sendmsg is more complex, and I don't seem
to have the stacktrace.
 
> 
> Thanks and Regards,
> Dario
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-29  3:27           ` Konrad Rzeszutek Wilk
@ 2016-02-02 11:45             ` Dario Faggioli
  2016-02-03 18:05               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2016-02-02 11:45 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, george.dunlap, George Dunlap, xen-devel,
	joao.m.martins, boris.ostrovsky

[-- Attachment #1.1: Type: text/plain, Size: 5976 bytes --]

On Thu, 2016-01-28 at 22:27 -0500, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote:
> > 
> > So, may I ask what piece of (Linux) code are we actually talking
> > about?
> > Because I had a quick look, and could not find where what you
> > describe
> > happens....
> 
> udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout
> The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the
> UDP by having a diffrent timeout.
> 
Ha, recvmsg! At some point you mentioned sendmsg, and I was looking
there and seeing nothing! But yes, it indeed makes sense to consider
the receiving side... let me have a look...

So, it looks to me that this is what happens:

 udp_recvmsg(noblock=0)
   |
   ---> __skb_recv_datagram(flags=0) {
                timeo = sock_rcvtimeo(flags=0) /* returns sk->sk_rcvtimeo */
                do {...} wait_for_more_packets(timeo);
                           |
                           ---> schedule_timeor(timeo)

So, at least in Linux 4.4, the timeout used is the one defined in
sk->sk_rcvtimeo, which it looks to me to be this (unless I've followed
some link wrong, which can well be the case):

http://lxr.free-electrons.com/source/include/uapi/asm-generic/socket.h#L31
#define SO_RCVTIMEO     20

So there looks to be a timeout. But anyways, let's check
schedule_timeout().

> And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just
> goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER.
> 
So, if the timeout is MAX_SCHEDULE_TIMEOUT, the function does:

schedule_timeout(SCHEDULE_TIMEOUT) {
    schedule();
    return;
}

If the timeout is anything else than MAX_SCHEDULE_TIMEOUT (but still a
valid value), the function does:

schedule_timeout(timeout) {
    struct timer_list timer;

    setup_timer_on_stack(&timer);
    __mod_timer(&timer);
    schedule();
    del_singleshot_timer_sync(&timer);
    destroy_timer_on_stack(&timer);
    return;
}

So, in both cases, it pretty much calls schedule() just about
immediately. And when schedule() it's called, the calling process --
which would be out UDP receiver-- goes to sleep.

The difference is that, in case of MAX_SCHEDULE_TIMEOUT, it does not
arrange for anyone to wakeup the thread that is going to sleep. In
theory, it could even be stuck forever... Of course, this depends on
whether the receiver thread is on a runqueue or not, if (in case it's
not) if it's status is TASK_INTERRUPTIBLE OR TASK_UNINTERRUPTIBLE,
etc., and, in prractice, it never happens! :-D

In this case, I think we take the other branch (the one 'with
timeout'). But even if we would take this one, I would expect the
receiver thread to not be on any runqueue, but yet to be (either in
interruptible or not state) in a blocking list from where it is taken
out when a packet arrives.

In case of anything different than MAX_SCHEDULE_TIMEOUT, all the above
is still true, but a timer is set before calling schedule() and putting
the thread to sleep. This means that, in case nothing that would wakeup
such thread happens, or in case it hasn't happened yet when the timeout
expires, the thread is woken up by the timer.

And in fact, schedule_timeout() is not a different way, with respect to
just calling schedule(), to going to sleep. It is the way you go to
sleep for at most some amount of time... But in all cases, you just and
immediately go to sleep!

And I also am not sure I see where all that discussion you've had with
George about IPIs fit into this all... The IPI that will trigger the
call to schedule() that will actually put back to execution the thread
that we're putting to sleep in here (i.e., the receiver), happens when
the sender manages to send a packet (actually, when the packet arrives,
I think) _or_ when the timer expires.

The two possible calls to schedule() in schedule_timeout() behave
exactly in the same way, and I don't think having a timeout or not is
responsible for any particular behavior.

What I think it's happening is this: when such a call to schedule()
(from inside schedule_timeout(), I mean) is made what happens is that
the receiver task just goes to sleep, and another one, perhaps the
sender, is executed. The sender sends the packet, which arrives before
the timeout, and the receiver is woken up.

*Here* is where an IPI should or should not happen, depending on where
our receiver task is going to be executed! And where would that be?
Well, that depends on the Linux scheduler's load balancer, the behavior
of which is controlled by scheduling domains flags like BALANCE_FORK,
BALANCE_EXEC, BALANCE_WAKE, WAKE_AFFINE and PREFER_SIBLINGS (and
others, but I think these are the most likely ones to be involved
here).

So, in summary, where the receiver executes when it wakes up on what is
the configuration of such flags in the (various) scheduling domain(s).
Check, for instance, this path:

  try_to_wakeu_up() --> select_task_irq() --> select_task_rq_fair()

The reason why the tests 'reacts' to topology changes is that which set
of flags is used for the various scheduling domains is, during the time
the scheduling domains themselves are created and configured-- depends
on topology... So it's quite possible that exposing the SMT topology,
wrt to not doing so, makes one of the flag flip in a way which makes
the benchmark work better.

If you play with the flags above (or whatever they equivalents were in
2.6.39) directly, even without exposing the SMT-topology, I'm quite
sure you would be able to trigger the same behavior.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-02-02 11:45             ` Dario Faggioli
@ 2016-02-03 18:05               ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 22+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-02-03 18:05 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Elena Ufimtseva, george.dunlap, George Dunlap, xen-devel,
	joao.m.martins, boris.ostrovsky

On Tue, Feb 02, 2016 at 12:45:00PM +0100, Dario Faggioli wrote:
> On Thu, 2016-01-28 at 22:27 -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote:
> > > 
> > > So, may I ask what piece of (Linux) code are we actually talking
> > > about?
> > > Because I had a quick look, and could not find where what you
> > > describe
> > > happens....
> > 
> > udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout
> > The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the
> > UDP by having a diffrent timeout.
> > 
> Ha, recvmsg! At some point you mentioned sendmsg, and I was looking
> there and seeing nothing! But yes, it indeed makes sense to consider
> the receiving side... let me have a look...
> 
> So, it looks to me that this is what happens:
> 
>  udp_recvmsg(noblock=0)
>    |
>    ---> __skb_recv_datagram(flags=0) {
>                 timeo = sock_rcvtimeo(flags=0) /* returns sk->sk_rcvtimeo */
>                 do {...} wait_for_more_packets(timeo);
>                            |
>                            ---> schedule_timeor(timeo)
> 
> So, at least in Linux 4.4, the timeout used is the one defined in
> sk->sk_rcvtimeo, which it looks to me to be this (unless I've followed
> some link wrong, which can well be the case):
> 
> http://lxr.free-electrons.com/source/include/uapi/asm-generic/socket.h#L31
> #define SO_RCVTIMEO     20
> 
> So there looks to be a timeout. But anyways, let's check
> schedule_timeout().
> 
> > And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just
> > goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER.
> > 
> So, if the timeout is MAX_SCHEDULE_TIMEOUT, the function does:
> 
> schedule_timeout(SCHEDULE_TIMEOUT) {
>     schedule();
>     return;
> }
> 
> If the timeout is anything else than MAX_SCHEDULE_TIMEOUT (but still a
> valid value), the function does:
> 
> schedule_timeout(timeout) {
>     struct timer_list timer;
> 
>     setup_timer_on_stack(&timer);
>     __mod_timer(&timer);
>     schedule();
>     del_singleshot_timer_sync(&timer);
>     destroy_timer_on_stack(&timer);
>     return;
> }
> 
> So, in both cases, it pretty much calls schedule() just about
> immediately. And when schedule() it's called, the calling process --
> which would be out UDP receiver-- goes to sleep.
> 
> The difference is that, in case of MAX_SCHEDULE_TIMEOUT, it does not
> arrange for anyone to wakeup the thread that is going to sleep. In
> theory, it could even be stuck forever... Of course, this depends on
> whether the receiver thread is on a runqueue or not, if (in case it's
> not) if it's status is TASK_INTERRUPTIBLE OR TASK_UNINTERRUPTIBLE,
> etc., and, in prractice, it never happens! :-D
> 
> In this case, I think we take the other branch (the one 'with
> timeout'). But even if we would take this one, I would expect the
> receiver thread to not be on any runqueue, but yet to be (either in
> interruptible or not state) in a blocking list from where it is taken
> out when a packet arrives.
> 
> In case of anything different than MAX_SCHEDULE_TIMEOUT, all the above
> is still true, but a timer is set before calling schedule() and putting
> the thread to sleep. This means that, in case nothing that would wakeup
> such thread happens, or in case it hasn't happened yet when the timeout
> expires, the thread is woken up by the timer.

Right.
> 
> And in fact, schedule_timeout() is not a different way, with respect to
> just calling schedule(), to going to sleep. It is the way you go to
> sleep for at most some amount of time... But in all cases, you just and
> immediately go to sleep!
> 
> And I also am not sure I see where all that discussion you've had with
> George about IPIs fit into this all... The IPI that will trigger the
> call to schedule() that will actually put back to execution the thread
> that we're putting to sleep in here (i.e., the receiver), happens when
> the sender manages to send a packet (actually, when the packet arrives,
> I think) _or_ when the timer expires.

The IPI were observed when SMT was exposed to the guest. That is because
the Linux scheduler put both applications on the same CPU - udp_sender
and udp_receiver. Which meant that the 'schedule' call would immediately
pick the next application (udp_sender) and schedule it (and send an IPI 
to itself to do that).

> 
> The two possible calls to schedule() in schedule_timeout() behave
> exactly in the same way, and I don't think having a timeout or not is
> responsible for any particular behavior.

Correct. The quirk was that if the applications were on seperate
CPUs - the "thread [would be] woken up by the timer". While if they
were on the same CPU - the scheduler would pick the next application
on the run-queue (which coincidentally was the UDP sender - or receiver).

> 
> What I think it's happening is this: when such a call to schedule()
> (from inside schedule_timeout(), I mean) is made what happens is that
> the receiver task just goes to sleep, and another one, perhaps the
> sender, is executed. The sender sends the packet, which arrives before
> the timeout, and the receiver is woken up.

Yes!
> 
> *Here* is where an IPI should or should not happen, depending on where
> our receiver task is going to be executed! And where would that be?
> Well, that depends on the Linux scheduler's load balancer, the behavior
> of which is controlled by scheduling domains flags like BALANCE_FORK,
> BALANCE_EXEC, BALANCE_WAKE, WAKE_AFFINE and PREFER_SIBLINGS (and
> others, but I think these are the most likely ones to be involved
> here).

Probably.
> 
> So, in summary, where the receiver executes when it wakes up on what is
> the configuration of such flags in the (various) scheduling domain(s).
> Check, for instance, this path:
> 
>   try_to_wakeu_up() --> select_task_irq() --> select_task_rq_fair()
> 
> The reason why the tests 'reacts' to topology changes is that which set
> of flags is used for the various scheduling domains is, during the time
> the scheduling domains themselves are created and configured-- depends
> on topology... So it's quite possible that exposing the SMT topology,
> wrt to not doing so, makes one of the flag flip in a way which makes
> the benchmark work better.

/me nods.
> 
> If you play with the flags above (or whatever they equivalents were in
> 2.6.39) directly, even without exposing the SMT-topology, I'm quite
> sure you would be able to trigger the same behavior.

I did. And that was the work-around - echo 4xyz flag in the SysFS domain
and suddenly things go much faster.
.
> 
> Regards,
> Dario
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-22 16:54 schedulers and topology exposing questions Elena Ufimtseva
  2016-01-22 17:29 ` Dario Faggioli
  2016-01-26 11:21 ` George Dunlap
@ 2016-01-27 14:01 ` Dario Faggioli
  2016-01-28 18:51   ` Elena Ufimtseva
  2 siblings, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2016-01-27 14:01 UTC (permalink / raw)
  To: Elena Ufimtseva, xen-devel, george.dunlap, konrad.wilk,
	joao.m.martins, boris.ostrovsky


[-- Attachment #1.1: Type: text/plain, Size: 14426 bytes --]

On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote:
> Hello all!
> 
Hey, here I am again,

> Konrad came up with a workaround that was setting the flag for domain
> scheduler in linux
> As the guest is not aware of SMT-related topology, it has a flat
> topology initialized.
> Kernel has domain scheduler flags for scheduling domain CPU set to
> 4143 for 2.6.39.
> Konrad discovered that changing the flag for CPU sched domain to 4655
>
So, as you've seen, I also have been up to doing quite a few of
benchmarking doing soemthing similar (I used more recent kernels, and
decided to test 4131 as flags.

In your casse, according to this:
 http://lxr.oss.org.cn/source/include/linux/sched.h?v=2.6.39#L807

4655 means:
  SD_LOAD_BALANCE        |
  SD_BALANCE_EXEC        |
 
SD_BALANCE_WAKE        |
  SD_PREFER_LOCAL        | [*]
 
SD_SHARE_PKG_RESOURCES |
  SD_SERIALIZE

and another bit (0x4000) that I don't immediately see what it is.

Things have changed a bit since then, it appears. However, I'm quite sure I've tested turning on SD_SERIALIZE in 4.2.0 and 4.3.0, and results were really pretty bad (as you also seem to say later).

> works as a workaround and makes Linux think that the topology has SMT
> threads.
>
Well, yes and no. :-). I don't want to make this all a terminology
bunfight, something that also matters here is how many scheduling
domains you have.

To check that (although in recent kernels) you check here:

 ls /proc/sys/kernel/sched_domain/cpu2/ (any cpu is ok)

and see how many domain[0-9] you have.

On baremetal, on an HT cpu, I've got this:

$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name 
SMT
MC

So, two domains, one of which is the SMT one. If you check their flags,
they're different:

$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags
4783
559

So, yes, you are right in saying that 4655 is related to SMT. In fact,
it is what (among other things) tells the load balancer that *all* the
cpus (well, all the scheduling groups, actually) in this domain are SMT
siblings... Which is a legitimate thing to do, but it's not what
happens on SMT baremetal.

At least is consistent, IMO. I.e., it still creates a pretty flat
topology, like there was a big core, of which _all_ the vcpus are part
of, as SMT siblings.

The other option (the one I'm leaning toward) was too get rid of that
one flag. I've only done preliminary experiments with it on and off,
and the ones with it off were better looking, so I did keep it off for
the big run... but we can test with it again.

> This workaround makes the test to complete almost in same time as on
> baremetal (or insignificantly worse).
> 
> This workaround is not suitable for kernels of higher versions as we
> discovered.
> 
There may be more than one reason for this (as said, a lot changed!)
but it matches what I've found when SD_SERIALIZE was kept on for the
scheduling domain where all the vcpus are.

> The hackish way of making domU linux think that it has SMT threads
> (along with matching cpuid)
> made us thinks that the problem comes from the fact that cpu topology
> is not exposed to
> guest and Linux scheduler cannot make intelligent decision on
> scheduling.
> 
As said, I think it's the other way around: we expose too much of it
(and this is more of an issue for PV rather than for HVM). Basically,
either you do the pinning you're doing or, whatever you expose, will be
*wrong*... and the only way to expose not wrong data is to actually
don't expose anything! :-)

> The test described above was labeled as IO-bound test.
> 
> We have run io-bound test with and without smt-patches. The
> improvement comparing
> to base case (no smt patches, flat topology) shows 22-23% gain.
> 
I'd be curious to see the content of the /proc/sys/kernel/sched_domain
directory and subdirectories with Joao's patches applied.

> While we have seen improvement with io-bound tests, the same did not
> happen with cpu-bound workload.
> As cpu-bound test we use kernel module which runs requested number of
> kernel threads
> and each thread compresses and decompresses some data.
> 
That is somewhat what I would have expected, although up to what
extent, it's hard to tell in advance.

It also matches my findings, both for the results I've already shared
on list, and for others that I'll be sharing in a bit.

> Here is the setup for tests:
> Intel Xeon E5 2600
> 8 cores, 25MB Cashe, 2 sockets, 2 threads per core.
> Xen 4.4.3, default timeslice and ratelimit
> Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+.
> Dom0: kernel 4.1.0, 2 vcpus, not pinned.
> DomU has 8 vcpus (except some cases).
> 
> 
> For io-bound tests results were better with smt patches applied for
> every kernel.
> 
> For cpu-bound test the results were different depending on wether
> vcpus were
> pinned or not, how many vcpus were assigned to the guest.
> 
Right. In general, this also makes sense... Can we see the actual
numbers? I mean the results of the tests with improvements/regressions
highlighted, in addition to the traces that you already shared?

> Please take a look at the graph captured by xentrace -e 0x0002f000
> On the graphs X is time in seconds since xentrace start, Y is the
> pcpu number,
> the graph itself represent the event when scheduler places vcpu to
> pcpu.
> 
> The graphs #1 & #2:
> trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test,
> one client/server
> trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test,
> 8 kernel theads
> config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69
> kernel.
> 
Ok, so this is the "baseline", the result of just running your tests in
a pretty standard Xen and Dom0 and DomU status and configurations,
right?

> As can be seen here scheduler places the vcpus correctly on empty
> cores.
> As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this?
> Take a look at
> trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png
> where I split data per vcpus.
> 
Well, why not, I would say? I mean, where a vcpu starts to run at an
arbitrary point in time, especially if the system is otherwise idle
before, it can be considered random (it's not, it depends on both the
vcpu's and system's previous history, but in a non-linear way, and that
is not in the graph anyway).

In any case, since there are idle cores, the fact that vcpus do not
move much, even if they're not pinned, I consider it a good thing,
don't you? If vcpuX wakes up on processor Y, where it has always run
before, and it find out it can still run there, migrating somewhere
else would be pure overhead.

The only potential worry of mine about
trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png, is
that vcpus 4 and 7 (or 4 and 2, colors are too similar to be sure) run
for some time (the burst around t=17), on pcpus 5 and 6. Are these two
pcpus SMT siblings? Doing the math myself on pCPUs IDs, I don't think
they are, so all would be fine. If they are, that should not happen.

However, you're using 4.4, so even if you had an issue there, we don't
know if it's still in staging.

In any case and just to be sure, can you produce the output of `xl
vcpu-list', while this case is running?

> Now to cpu-bound tests.
> When smt patches applied and vcpus pinned correctly to match the
> topology and
> guest become aware of the topology, cpu-bound tests did not show
> improvement with kernel 2.6.39.
> With upstream kernel we see some improvements. The tes was repeated 5
> times back to back.
>
Again, 'some' being?

> The number of vcpus was increased to 16 to match the test case where
> linux was not
> aware of the topology and assumed all cpus as cores.
>  
> On some iterations one can see that vcpus are being scheduled as
> expected.
> For some runs the vcpus are placed on came core (core/thread) (see
> trace_cpu_16vcpus_8threads_5runs.out.plot.err.png).
> It doubles the time it takes for test to complete (first three runs
> show close to baremetal execution time).
> 
No, sorry, I don't think I fully understood this part. So:
 1. can you point me at where (time equal to ?) what you are saying 
    happens?
 2. more important, you are saying that the vcpus are pinned. If you 
    pin the vcpus they just should not move. Period. If they move, 
    if's a bug, no matter where they go and what the other SMT sibling 
    of the pcpu where they go is busy or idle! :-O

    So, are you saying that you pinned the vcpus of the guest and you
    see them moving and/or not being _always_ scheduled where you 
    pinned them? Can we see `xl vcpu-list' again, to see how they're 
    actually pinned.

> END: cycles: 31209326708 (29 seconds)
> END: cycles: 30928835308 (28 seconds)
> END: cycles: 31191626508 (29 seconds)
> END: cycles: 50117313540 (46 seconds)
> END: cycles: 49944848614 (46 seconds)
> 
> Since the vcpus are pinned, then my guess is that Linux scheduler
> makes wrong decisions?
>
Ok, so now it seems to me that you agree that the vcpus don't have much
alternatives.

If yes (which would be of great relief for me :-) ), it could indeed be
that indeed the Linux scheduler is working suboptimally.

Perhaps it's worth trying running the benchmark inside the guest with
the Linux's threads pinned to the vcpus. That should give you perfectly
consistent results over all the 5 runs.

One more thing. You say you the guest has 16 vcpus, and that there are
8 threads running inside it. However, I seem to be able to identify in
the graphs at least a few vertical lines where more than 8 vcpus are
running on some pcpu. So, if Linux is working well, and it really only
has to place 8 vcpus, it would put them on different cores. However, if
at some point in time, there is more than that it has to place, it will
have to necessarily 'invade' an already busy core. Am I right in seeing
those lines, or are my eyes deceiving me? (I think a per-vcpu breakup
of the graph above, like you did for dom0, would help figure this out).

> So I ran the test with smt patches enabled, but not pinned vcpus.
> 
AFAICT, This does not make much sense. So, if I understood correctly
what you mean, by doing as you say, you're telling Linux that, for
instance, vcpu0 and vcpu1 are SMT siblings, but then Xen is free to run
vcpu0 and vcpu1 at the same time wherever it likes... same core,
different core on same socket, different socket, etc.

This, I would say, bring us back to the pseudo-random situation we have
by default already, without any patching and any pinning, or just in a
different variant of it. 

> result is also shows the same as above (see
> trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png)
> :
> Also see the per-cpu graph
> (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_per
> vcpu.png).
> 
Ok. I'll look at this graph better with the aim of showing an example
of my theory above (as soon as my brain, which is not in it's best
shape today) will manage to deal with all the colors (I'm not
complaining, BTW, there's not another way in which you can show things,
it's just me! :-D).

> END: cycles: 49740185572 (46 seconds)
> END: cycles: 45862289546 (42 seconds)
> END: cycles: 30976368378 (28 seconds)
> END: cycles: 30886882143 (28 seconds)
> END: cycles: 30806304256 (28 seconds)
> 
> I cut the timeslice where its seen that vcpu0 and vcpu2 run on same
> core while other cores are idle:
> 
> 35v2 9.881103815
> 7                                                              
> 35v0 9.881104013 6
>                                                              
> 35v2 9.892746452
> 7                                                                
> 35v0 9.892746546 6   -> vcpu0 gets scheduled right after vcpu2 on
> same core
>                                                              
> 35v0 9.904388175
> 6                                                              
> 35v2 9.904388205 7 -> same here
>                                                              
> 35v2 9.916029791
> 7                                                              
> 35v0 9.916029992
> 6                                                              
> 
Yes, this, in theory, should not happen. However, our (but Linux's, or
any other OS's one --perhaps in its own way--) can't always be
_instantly_ perfect! In this case, for instance, the SMT load balancing
logic, in Credit1, is triggered:
 - from outside of sched_credit.c, by vcpu_migrate(), which is called 
   upon in response to a bunch of events, but _not_ at every vcpu 
   wakeup
 - from inside sched_credit.c, csched_vcpu_acct(), if the vcpu was it 
   has been active for a while

This means, it is not triggered upon each and every vcpu wakeup (it
might, but not for the vcpu that is waking up). So, seeing a samples of
a vcpu not being scheduled according to optimal SMT load balancing,
especially right after it woke up, it is expectable. Then, after a
while, the logic should indeed trigger (via csched_vcpu_acct()) and
move away the vcpu to an idle core.

To tell how long the SMT perfect balancing violation happens, and
whether or not it happens as a consequence of tasks wakeups, we need
more records from the trace file, coming from around the point where
the violation happens.

Does this make sense to you?

Regards, and thanks for sharing all this! :-)
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: schedulers and topology exposing questions
  2016-01-27 14:01 ` Dario Faggioli
@ 2016-01-28 18:51   ` Elena Ufimtseva
  0 siblings, 0 replies; 22+ messages in thread
From: Elena Ufimtseva @ 2016-01-28 18:51 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: george.dunlap, joao.m.martins, boris.ostrovsky, xen-devel

On Wed, Jan 27, 2016 at 02:01:35PM +0000, Dario Faggioli wrote:
> On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote:
> > Hello all!
> > 
> Hey, here I am again,
> 
> > Konrad came up with a workaround that was setting the flag for domain
> > scheduler in linux
> > As the guest is not aware of SMT-related topology, it has a flat
> > topology initialized.
> > Kernel has domain scheduler flags for scheduling domain CPU set to
> > 4143 for 2.6.39.
> > Konrad discovered that changing the flag for CPU sched domain to 4655
> >
> So, as you've seen, I also have been up to doing quite a few of
> benchmarking doing soemthing similar (I used more recent kernels, and
> decided to test 4131 as flags.
> 
> In your casse, according to this:
>  http://lxr.oss.org.cn/source/include/linux/sched.h?v=2.6.39#L807
> 
> 4655 means:
>   SD_LOAD_BALANCE        |
>   SD_BALANCE_EXEC        |
>  
> SD_BALANCE_WAKE        |
>   SD_PREFER_LOCAL        | [*]
>  
> SD_SHARE_PKG_RESOURCES |
>   SD_SERIALIZE
> 
> and another bit (0x4000) that I don't immediately see what it is.
> 
> Things have changed a bit since then, it appears. However, I'm quite sure I've tested turning on SD_SERIALIZE in 4.2.0 and 4.3.0, and results were really pretty bad (as you also seem to say later).
> 
> > works as a workaround and makes Linux think that the topology has SMT
> > threads.
> >
> Well, yes and no. :-). I don't want to make this all a terminology
> bunfight, something that also matters here is how many scheduling
> domains you have.
> 
> To check that (although in recent kernels) you check here:
> 
>  ls /proc/sys/kernel/sched_domain/cpu2/ (any cpu is ok)
> 
> and see how many domain[0-9] you have.
> 
> On baremetal, on an HT cpu, I've got this:
> 
> $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name 
> SMT
> MC
> 
> So, two domains, one of which is the SMT one. If you check their flags,
> they're different:
> 
> $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags
> 4783
> 559
> 
> So, yes, you are right in saying that 4655 is related to SMT. In fact,
> it is what (among other things) tells the load balancer that *all* the
> cpus (well, all the scheduling groups, actually) in this domain are SMT
> siblings... Which is a legitimate thing to do, but it's not what
> happens on SMT baremetal.
> 
> At least is consistent, IMO. I.e., it still creates a pretty flat
> topology, like there was a big core, of which _all_ the vcpus are part
> of, as SMT siblings.
> 
> The other option (the one I'm leaning toward) was too get rid of that
> one flag. I've only done preliminary experiments with it on and off,
> and the ones with it off were better looking, so I did keep it off for
> the big run... but we can test with it again.
> 
> > This workaround makes the test to complete almost in same time as on
> > baremetal (or insignificantly worse).
> > 
> > This workaround is not suitable for kernels of higher versions as we
> > discovered.
> > 
> There may be more than one reason for this (as said, a lot changed!)
> but it matches what I've found when SD_SERIALIZE was kept on for the
> scheduling domain where all the vcpus are.
> 
> > The hackish way of making domU linux think that it has SMT threads
> > (along with matching cpuid)
> > made us thinks that the problem comes from the fact that cpu topology
> > is not exposed to
> > guest and Linux scheduler cannot make intelligent decision on
> > scheduling.
> > 
> As said, I think it's the other way around: we expose too much of it
> (and this is more of an issue for PV rather than for HVM). Basically,
> either you do the pinning you're doing or, whatever you expose, will be
> *wrong*... and the only way to expose not wrong data is to actually
> don't expose anything! :-)
> 
> > The test described above was labeled as IO-bound test.
> > 
> > We have run io-bound test with and without smt-patches. The
> > improvement comparing
> > to base case (no smt patches, flat topology) shows 22-23% gain.
> > 
> I'd be curious to see the content of the /proc/sys/kernel/sched_domain
> directory and subdirectories with Joao's patches applied.
> 
> > While we have seen improvement with io-bound tests, the same did not
> > happen with cpu-bound workload.
> > As cpu-bound test we use kernel module which runs requested number of
> > kernel threads
> > and each thread compresses and decompresses some data.
> > 
> That is somewhat what I would have expected, although up to what
> extent, it's hard to tell in advance.
> 
> It also matches my findings, both for the results I've already shared
> on list, and for others that I'll be sharing in a bit.
> 
> > Here is the setup for tests:
> > Intel Xeon E5 2600
> > 8 cores, 25MB Cashe, 2 sockets, 2 threads per core.
> > Xen 4.4.3, default timeslice and ratelimit
> > Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+.
> > Dom0: kernel 4.1.0, 2 vcpus, not pinned.
> > DomU has 8 vcpus (except some cases).
> > 
> > 
> > For io-bound tests results were better with smt patches applied for
> > every kernel.
> > 
> > For cpu-bound test the results were different depending on wether
> > vcpus were
> > pinned or not, how many vcpus were assigned to the guest.
> > 
> Right. In general, this also makes sense... Can we see the actual
> numbers? I mean the results of the tests with improvements/regressions
> highlighted, in addition to the traces that you already shared?
> 
> > Please take a look at the graph captured by xentrace -e 0x0002f000
> > On the graphs X is time in seconds since xentrace start, Y is the
> > pcpu number,
> > the graph itself represent the event when scheduler places vcpu to
> > pcpu.
> > 
> > The graphs #1 & #2:
> > trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test,
> > one client/server
> > trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test,
> > 8 kernel theads
> > config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69
> > kernel.
> > 
> Ok, so this is the "baseline", the result of just running your tests in
> a pretty standard Xen and Dom0 and DomU status and configurations,
> right?
> 
> > As can be seen here scheduler places the vcpus correctly on empty
> > cores.
> > As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this?
> > Take a look at
> > trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png
> > where I split data per vcpus.
> > 
> Well, why not, I would say? I mean, where a vcpu starts to run at an
> arbitrary point in time, especially if the system is otherwise idle
> before, it can be considered random (it's not, it depends on both the
> vcpu's and system's previous history, but in a non-linear way, and that
> is not in the graph anyway).
> 
> In any case, since there are idle cores, the fact that vcpus do not
> move much, even if they're not pinned, I consider it a good thing,
> don't you? If vcpuX wakes up on processor Y, where it has always run
> before, and it find out it can still run there, migrating somewhere
> else would be pure overhead.
> 
> The only potential worry of mine about
> trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png, is
> that vcpus 4 and 7 (or 4 and 2, colors are too similar to be sure) run
> for some time (the burst around t=17), on pcpus 5 and 6. Are these two
> pcpus SMT siblings? Doing the math myself on pCPUs IDs, I don't think
> they are, so all would be fine. If they are, that should not happen.
> 
> However, you're using 4.4, so even if you had an issue there, we don't
> know if it's still in staging.
> 
> In any case and just to be sure, can you produce the output of `xl
> vcpu-list', while this case is running?
> 
> > Now to cpu-bound tests.
> > When smt patches applied and vcpus pinned correctly to match the
> > topology and
> > guest become aware of the topology, cpu-bound tests did not show
> > improvement with kernel 2.6.39.
> > With upstream kernel we see some improvements. The tes was repeated 5
> > times back to back.
> >
> Again, 'some' being?
> 
> > The number of vcpus was increased to 16 to match the test case where
> > linux was not
> > aware of the topology and assumed all cpus as cores.
> >  
> > On some iterations one can see that vcpus are being scheduled as
> > expected.
> > For some runs the vcpus are placed on came core (core/thread) (see
> > trace_cpu_16vcpus_8threads_5runs.out.plot.err.png).
> > It doubles the time it takes for test to complete (first three runs
> > show close to baremetal execution time).
> > 
> No, sorry, I don't think I fully understood this part. So:
>  1. can you point me at where (time equal to ?) what you are saying 
>     happens?
>  2. more important, you are saying that the vcpus are pinned. If you 
>     pin the vcpus they just should not move. Period. If they move, 
>     if's a bug, no matter where they go and what the other SMT sibling 
>     of the pcpu where they go is busy or idle! :-O
> 
>     So, are you saying that you pinned the vcpus of the guest and you
>     see them moving and/or not being _always_ scheduled where you 
>     pinned them? Can we see `xl vcpu-list' again, to see how they're 
>     actually pinned.
> 
> > END: cycles: 31209326708 (29 seconds)
> > END: cycles: 30928835308 (28 seconds)
> > END: cycles: 31191626508 (29 seconds)
> > END: cycles: 50117313540 (46 seconds)
> > END: cycles: 49944848614 (46 seconds)
> > 
> > Since the vcpus are pinned, then my guess is that Linux scheduler
> > makes wrong decisions?
> >
> Ok, so now it seems to me that you agree that the vcpus don't have much
> alternatives.
> 
> If yes (which would be of great relief for me :-) ), it could indeed be
> that indeed the Linux scheduler is working suboptimally.
> 
> Perhaps it's worth trying running the benchmark inside the guest with
> the Linux's threads pinned to the vcpus. That should give you perfectly
> consistent results over all the 5 runs.
> 
> One more thing. You say you the guest has 16 vcpus, and that there are
> 8 threads running inside it. However, I seem to be able to identify in
> the graphs at least a few vertical lines where more than 8 vcpus are
> running on some pcpu. So, if Linux is working well, and it really only
> has to place 8 vcpus, it would put them on different cores. However, if
> at some point in time, there is more than that it has to place, it will
> have to necessarily 'invade' an already busy core. Am I right in seeing
> those lines, or are my eyes deceiving me? (I think a per-vcpu breakup
> of the graph above, like you did for dom0, would help figure this out).
> 
> > So I ran the test with smt patches enabled, but not pinned vcpus.
> > 
> AFAICT, This does not make much sense. So, if I understood correctly
> what you mean, by doing as you say, you're telling Linux that, for
> instance, vcpu0 and vcpu1 are SMT siblings, but then Xen is free to run
> vcpu0 and vcpu1 at the same time wherever it likes... same core,
> different core on same socket, different socket, etc.

Correct. I did run this to see what happens in this pseudo-random case.
> 
> This, I would say, bring us back to the pseudo-random situation we have
> by default already, without any patching and any pinning, or just in a
> different variant of it. 
> 
> > result is also shows the same as above (see
> > trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png)
> > :
> > Also see the per-cpu graph
> > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_per
> > vcpu.png).
> > 
> Ok. I'll look at this graph better with the aim of showing an example
> of my theory above (as soon as my brain, which is not in it's best
> shape today) will manage to deal with all the colors (I'm not
> complaining, BTW, there's not another way in which you can show things,
> it's just me! :-D).

At the same time, if you think I can improve data representation, it
will be awesome!

> 
> > END: cycles: 49740185572 (46 seconds)
> > END: cycles: 45862289546 (42 seconds)
> > END: cycles: 30976368378 (28 seconds)
> > END: cycles: 30886882143 (28 seconds)
> > END: cycles: 30806304256 (28 seconds)
> > 
> > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same
> > core while other cores are idle:
> > 
> > 35v2 9.881103815
> > 7                                                              
> > 35v0 9.881104013 6
> >                                                              
> > 35v2 9.892746452
> > 7                                                                
> > 35v0 9.892746546 6   -> vcpu0 gets scheduled right after vcpu2 on
> > same core
> >                                                              
> > 35v0 9.904388175
> > 6                                                              
> > 35v2 9.904388205 7 -> same here
> >                                                              
> > 35v2 9.916029791
> > 7                                                              
> > 35v0 9.916029992
> > 6                                                              
> > 
> Yes, this, in theory, should not happen. However, our (but Linux's, or
> any other OS's one --perhaps in its own way--) can't always be
> _instantly_ perfect! In this case, for instance, the SMT load balancing
> logic, in Credit1, is triggered:
>  - from outside of sched_credit.c, by vcpu_migrate(), which is called 
>    upon in response to a bunch of events, but _not_ at every vcpu 
>    wakeup
>  - from inside sched_credit.c, csched_vcpu_acct(), if the vcpu was it 
>    has been active for a while
> 
> This means, it is not triggered upon each and every vcpu wakeup (it
> might, but not for the vcpu that is waking up). So, seeing a samples of
> a vcpu not being scheduled according to optimal SMT load balancing,
> especially right after it woke up, it is expectable. Then, after a
> while, the logic should indeed trigger (via csched_vcpu_acct()) and
> move away the vcpu to an idle core.
> 
> To tell how long the SMT perfect balancing violation happens, and
> whether or not it happens as a consequence of tasks wakeups, we need
> more records from the trace file, coming from around the point where
> the violation happens.
> 
> Does this make sense to you?

Dario, thanks for explanations.

I am going to verify some numbers and also I am collecting more trace
data.
I am going to send it shortly, sorry for the delay.


Elena
> 
> Regards, and thanks for sharing all this! :-)
> Dario
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-02-03 18:05 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-22 16:54 schedulers and topology exposing questions Elena Ufimtseva
2016-01-22 17:29 ` Dario Faggioli
2016-01-22 23:58   ` Elena Ufimtseva
2016-01-26 11:21 ` George Dunlap
2016-01-27 14:25   ` Dario Faggioli
2016-01-27 14:33   ` Konrad Rzeszutek Wilk
2016-01-27 15:10     ` George Dunlap
2016-01-27 15:27       ` Konrad Rzeszutek Wilk
2016-01-27 15:53         ` George Dunlap
2016-01-27 16:12           ` Konrad Rzeszutek Wilk
2016-01-28  9:55           ` Dario Faggioli
2016-01-29 21:59             ` Elena Ufimtseva
2016-02-02 11:58               ` Dario Faggioli
2016-01-27 16:03         ` Elena Ufimtseva
2016-01-28  9:46           ` Dario Faggioli
2016-01-29 16:09             ` Elena Ufimtseva
2016-01-28 15:10         ` Dario Faggioli
2016-01-29  3:27           ` Konrad Rzeszutek Wilk
2016-02-02 11:45             ` Dario Faggioli
2016-02-03 18:05               ` Konrad Rzeszutek Wilk
2016-01-27 14:01 ` Dario Faggioli
2016-01-28 18:51   ` Elena Ufimtseva

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).