Re: schedulers and topology exposing questions

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@citrix.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	george.dunlap@eu.citrix.com, dario.faggioli@citrix.com,
	xen-devel@lists.xen.org, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com
Subject: Re: schedulers and topology exposing questions
Date: Wed, 27 Jan 2016 15:10:01 +0000	[thread overview]
Message-ID: <56A8DDC9.4080307@citrix.com> (raw)
In-Reply-To: <20160127143303.GA1094@char.us.oracle.com>

On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote:
>> On 22/01/16 16:54, Elena Ufimtseva wrote:
>>> Hello all!
>>>
>>> Dario, Gerorge or anyone else,  your help will be appreciated.
>>>
>>> Let me put some intro to our findings. I may forget something or put something
>>> not too explicit, please ask me.
>>>
>>> Customer filled a bug where some of the applications were running slow in their HVM DomU setups.
>>> These running times were compared against baremetal running same kernel version as HVM DomU.
>>>
>>> After some investigation by different parties, the test case scenario was found
>>> where the problem was easily seen. The test app is a udp server/client pair where
>>> client passes some message n number of times.
>>> The test case was executed on baremetal and Xen DomU with kernel version 2.6.39.
>>> Bare metal showed 2x times better result that DomU.
>>>
>>> Konrad came up with a workaround that was setting the flag for domain scheduler in linux
>>> As the guest is not aware of SMT-related topology, it has a flat topology initialized.
>>> Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for 2.6.39.
>>> Konrad discovered that changing the flag for CPU sched domain to 4655
>>> works as a workaround and makes Linux think that the topology has SMT threads.
>>> This workaround makes the test to complete almost in same time as on baremetal (or insignificantly worse).
>>>
>>> This workaround is not suitable for kernels of higher versions as we discovered.
>>>
>>> The hackish way of making domU linux think that it has SMT threads (along with matching cpuid)
>>> made us thinks that the problem comes from the fact that cpu topology is not exposed to
>>> guest and Linux scheduler cannot make intelligent decision on scheduling.
>>>
>>> Joao Martins from Oracle developed set of patches that fixed the smt/core/cashe
>>> topology numbering and provided matching pinning of vcpus and enabling options,
>>> allows to expose to guest correct topology.
>>> I guess Joao will be posting it at some point.
>>>
>>> With this patches we decided to test the performance impact on different kernel versionand Xen versions.
>>>
>>> The test described above was labeled as IO-bound test.
>>
>> So just to clarify: The client sends a request (presumably not much more
>> than a ping) to the server, and waits for the server to respond before
>> sending another one; and the server does the reverse -- receives a
>> request, responds, and then waits for the next request.  Is that right?
> 
> Yes.
>>
>> How much data is transferred?
> 
> 1 packet, UDP
>>
>> If the amount of data transferred is tiny, then the bottleneck for the
>> test is probably the IPI time, and I'd call this a "ping-pong"
>> benchmark[1].  I would only call this "io-bound" if you're actually
>> copying large amounts of data.
> 
> What we found is that on baremetal the scheduler would put both apps
> on the same CPU and schedule them right after each other. This would
> have a high IPI as the scheduler would poke itself.
> On Xen it would put the two applications on seperate CPUs - and there
> would be hardly any IPI.

Sorry -- why would the scheduler send itself an IPI if it's on the same
logical cpu (which seems pretty pointless), but *not* send an IPI to the
*other* processor when it was actually waking up another task?

Or do you mean high context switch rate?

> Digging deeper in the code I found out that if you do an UDP sendmsg
> without any timeouts - it would put it in a queue and just call schedule.

You mean, it would mark the other process as runnable somehow, but not
actually send an IPI to wake it up?  Is that a new "feature" designed
for large systems, to reduce the IPI traffic or something?

> On baremetal the schedule would result in scheduler picking up the other
> task, and starting it - which would dequeue immediately.
> 
> On Xen - the schedule() would go HLT.. and then later be woken up by the
> VIRQ_TIMER. And since the two applications were on seperate CPUs - the
> single packet would just stick in the queue until the VIRQ_TIMER arrived.

I'm not sure I understand the situation right, but it sounds a bit like
what you're seeing is just a quirk of the fact that Linux doesn't always
send IPIs to wake other processes up (either by design or by accident),
but relies on scheduling timers to check for work to do.  Presumably
they knew that low performance on ping-pong workloads might be a
possibility when they wrote the code that way; I don't see a reason why
we should try to work around that in Xen.

 -George

next prev parent reply	other threads:[~2016-01-27 15:10 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-22 16:54 schedulers and topology exposing questions Elena Ufimtseva
2016-01-22 17:29 ` Dario Faggioli
2016-01-22 23:58   ` Elena Ufimtseva
2016-01-26 11:21 ` George Dunlap
2016-01-27 14:25   ` Dario Faggioli
2016-01-27 14:33   ` Konrad Rzeszutek Wilk
2016-01-27 15:10     ` George Dunlap [this message]
2016-01-27 15:27       ` Konrad Rzeszutek Wilk
2016-01-27 15:53         ` George Dunlap
2016-01-27 16:12           ` Konrad Rzeszutek Wilk
2016-01-28  9:55           ` Dario Faggioli
2016-01-29 21:59             ` Elena Ufimtseva
2016-02-02 11:58               ` Dario Faggioli
2016-01-27 16:03         ` Elena Ufimtseva
2016-01-28  9:46           ` Dario Faggioli
2016-01-29 16:09             ` Elena Ufimtseva
2016-01-28 15:10         ` Dario Faggioli
2016-01-29  3:27           ` Konrad Rzeszutek Wilk
2016-02-02 11:45             ` Dario Faggioli
2016-02-03 18:05               ` Konrad Rzeszutek Wilk
2016-01-27 14:01 ` Dario Faggioli
2016-01-28 18:51   ` Elena Ufimtseva

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56A8DDC9.4080307@citrix.com \
    --to=george.dunlap@citrix.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=dario.faggioli@citrix.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=joao.m.martins@oracle.com \
    --cc=konrad.wilk@oracle.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).