From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: schedulers and topology exposing questions Date: Thu, 28 Jan 2016 09:46:46 +0000 Message-ID: <1453974406.26691.5.camel@citrix.com> References: <20160122165423.GA8595@elena.ufimtseva> <56A756C0.20501@citrix.com> <20160127143303.GA1094@char.us.oracle.com> <56A8DDC9.4080307@citrix.com> <20160127152701.GF552@char.us.oracle.com> <20160127160331.GA26453@elena.ufimtseva> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============3872540292718178799==" Return-path: In-Reply-To: <20160127160331.GA26453@elena.ufimtseva> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Elena Ufimtseva , Konrad Rzeszutek Wilk Cc: george.dunlap@eu.citrix.com, joao.m.martins@oracle.com, George Dunlap , boris.ostrovsky@oracle.com, xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org --===============3872540292718178799== Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-bxiSzEeAbQsdtoiSkhSB" --=-bxiSzEeAbQsdtoiSkhSB Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, 2016-01-27 at 11:03 -0500, Elena Ufimtseva wrote: > On Wed, Jan 27, 2016 at 10:27:01AM -0500, Konrad Rzeszutek Wilk > wrote: > > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote: > > > On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote: > > > > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote: > > > > > On 22/01/16 16:54, Elena Ufimtseva wrote: > > > > > > Hello all! > > > > > >=20 > > > > > > Dario, Gerorge or anyone else,=C2=A0=C2=A0your help will be > > > > > > appreciated. > > > > > >=20 > > > > > > Let me put some intro to our findings. I may forget > > > > > > something or put something > > > > > > not too explicit, please ask me. > > > > > >=20 > > > > > > Customer filled a bug where some of the applications were > > > > > > running slow in their HVM DomU setups. > > > > > > These running times were compared against baremetal running > > > > > > same kernel version as HVM DomU. > > > > > >=20 > > > > > > After some investigation by different parties, the test > > > > > > case scenario was found > > > > > > where the problem was easily seen. The test app is a udp > > > > > > server/client pair where > > > > > > client passes some message n number of times. > > > > > > The test case was executed on baremetal and Xen DomU with > > > > > > kernel version 2.6.39. > > > > > > Bare metal showed 2x times better result that DomU. > > > > > >=20 > > > > > > Konrad came up with a workaround that was setting the flag > > > > > > for domain scheduler in linux > > > > > > As the guest is not aware of SMT-related topology, it has a > > > > > > flat topology initialized. > > > > > > Kernel has domain scheduler flags for scheduling domain CPU > > > > > > set to 4143 for 2.6.39. > > > > > > Konrad discovered that changing the flag for CPU sched > > > > > > domain to 4655 > > > > > > works as a workaround and makes Linux think that the > > > > > > topology has SMT threads. > > > > > > This workaround makes the test to complete almost in same > > > > > > time as on baremetal (or insignificantly worse). > > > > > >=20 > > > > > > This workaround is not suitable for kernels of higher > > > > > > versions as we discovered. > > > > > >=20 > > > > > > The hackish way of making domU linux think that it has SMT > > > > > > threads (along with matching cpuid) > > > > > > made us thinks that the problem comes from the fact that > > > > > > cpu topology is not exposed to > > > > > > guest and Linux scheduler cannot make intelligent decision > > > > > > on scheduling. > > > > > >=20 > > > > > > Joao Martins from Oracle developed set of patches that > > > > > > fixed the smt/core/cashe > > > > > > topology numbering and provided matching pinning of vcpus > > > > > > and enabling options, > > > > > > allows to expose to guest correct topology. > > > > > > I guess Joao will be posting it at some point. > > > > > >=20 > > > > > > With this patches we decided to test the performance impact > > > > > > on different kernel versionand Xen versions. > > > > > >=20 > > > > > > The test described above was labeled as IO-bound test. > > > > >=20 > > > > > So just to clarify: The client sends a request (presumably > > > > > not much more > > > > > than a ping) to the server, and waits for the server to > > > > > respond before > > > > > sending another one; and the server does the reverse -- > > > > > receives a > > > > > request, responds, and then waits for the next request.=C2=A0=C2= =A0Is > > > > > that right? > > > >=20 > > > > Yes. > > > > >=20 > > > > > How much data is transferred? > > > >=20 > > > > 1 packet, UDP > > > > >=20 > > > > > If the amount of data transferred is tiny, then the > > > > > bottleneck for the > > > > > test is probably the IPI time, and I'd call this a "ping- > > > > > pong" > > > > > benchmark[1].=C2=A0=C2=A0I would only call this "io-bound" if you= 're > > > > > actually > > > > > copying large amounts of data. > > > >=20 > > > > What we found is that on baremetal the scheduler would put both > > > > apps > > > > on the same CPU and schedule them right after each other. This > > > > would > > > > have a high IPI as the scheduler would poke itself. > > > > On Xen it would put the two applications on seperate CPUs - and > > > > there > > > > would be hardly any IPI. > > >=20 > > > Sorry -- why would the scheduler send itself an IPI if it's on > > > the same > > > logical cpu (which seems pretty pointless), but *not* send an IPI > > > to the > > > *other* processor when it was actually waking up another task? > > >=20 > > > Or do you mean high context switch rate? > >=20 > > Yes, very high. > > >=20 > > > > Digging deeper in the code I found out that if you do an UDP > > > > sendmsg > > > > without any timeouts - it would put it in a queue and just call > > > > schedule. > > >=20 > > > You mean, it would mark the other process as runnable somehow, > > > but not > > > actually send an IPI to wake it up?=C2=A0=C2=A0Is that a new "feature= " > > > designed > >=20 > > Correct - because the other process was not on its vCPU runqueue. > >=20 > > > for large systems, to reduce the IPI traffic or something? > >=20 > > This is just a normal Linux scheduler. The only way it would do an > > IPI > > to the other CPU was if the UDP message had an timeout. The default > > timeout is infite so it didn't bother to send an IPI. > >=20 > > >=20 > > > > On baremetal the schedule would result in scheduler picking up > > > > the other > > > > task, and starting it - which would dequeue immediately. > > > >=20 > > > > On Xen - the schedule() would go HLT.. and then later be woken > > > > up by the > > > > VIRQ_TIMER. And since the two applications were on seperate > > > > CPUs - the > > > > single packet would just stick in the queue until the > > > > VIRQ_TIMER arrived. > > >=20 > > > I'm not sure I understand the situation right, but it sounds a > > > bit like > > > what you're seeing is just a quirk of the fact that Linux doesn't > > > always > > > send IPIs to wake other processes up (either by design or by > > > accident), > >=20 > > It does and it does not :-) > >=20 > > > but relies on scheduling timers to check for work to > > > do.=C2=A0=C2=A0Presumably > >=20 > > It .. I am not explaining it well. The Linux kernel scheduler when > > called for 'schedule' (from the UDP sendmsg) would either pick the > > next > > appliction and do a context swap - of if there were none - go to > > sleep. > > [Kind of - it also may do an IPI to the other CPU if requested ,but > > that requires > > some hints from underlaying layers] > > Since there were only two apps on the runqueue - udp sender and udp > > receiver > > it would run them back-to back (this is on baremetal) > >=20 > > However if SMT was not exposed - the Linux kernel scheduler would > > put those > > on each CPU runqueue. Meaning each CPU only had one app on its > > runqueue. > >=20 > > Hence no need to do an context switch. > > [unless you modified the UDP message to have a timeout, then it > > would > > send an IPI] > > > they knew that low performance on ping-pong workloads might be a > > > possibility when they wrote the code that way; I don't see a > > > reason why > > > we should try to work around that in Xen. > >=20 > > Which is not what I am suggesting. > >=20 > > Our first ideas was that since this is a Linux kernel schduler > > characteristic > > - let us give the guest all the information it needs to do this. > > That is > > make it look as baremetal as possible - and that is where the vCPU > > pinning and the exposing of SMT information came about. That (Elena > > pls correct me if I am wrong) did indeed show that the guest was > > doing > > what we expected. > >=20 > > But naturally that requires pinning and all that - and while it is > > a useful > > case for those that have the vCPUs to spare and can do it - that is > > not > > a general use-case. > >=20 > > So Elena started looking at the CPU bound and seeing how Xen > > behaves then > > and if we can improve the floating situation as she saw some > > abnormal > > behavious. >=20 > Maybe its normal? :) >=20 > While having satisfactory results with ping-pong test and having > Joao's > SMT patches in hand, we decided to try cpu-bound workload. > And that is where exposing SMT does not work that well. > I mostly here refer to the case where two vCPUs are being placed on > same > core while there are idle cores. >=20 > This I think what Dario is asking me more details about in another > reply and I am going to > answer his questions. >=20 Yes, exactly. We need to see more trace entries around the one where we see the vcpus being placed on SMT-siblings. You can well send, or upload somewhere, the full trace, and I'll have a look myself as soon as I can. :-) > > I do not see any way to fix the udp single message mechanism except > > by modifying the Linux kernel scheduler - and indeed it looks like > > later > > kernels modified their behavior. Also doing the vCPU pinning and > > SMT exposing > > did not hurt in those cases (Elena?). >=20 > Yes, the drastic performance differences with bare metal were only > observed with 2.6.39-based kernel. > For this ping-pong udp test exposing the SMT topology to the kernels > if > higher versions did help as tests show about 20 percent performance > improvement comparing to the tests where SMT topology is not exposed. > This assumes that SMT exposure goes along with pinning. >=20 >=20 > kernel. > hypervisor. :-D :-D :-D Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-bxiSzEeAbQsdtoiSkhSB Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEABECAAYFAlap44YACgkQk4XaBE3IOsRkJQCfdpIL2rAcBIzZpDZTDwxjuV/N iagAn2WqfyNEnfyCGJ86HEqTrj+DLpSf =KEOR -----END PGP SIGNATURE----- --=-bxiSzEeAbQsdtoiSkhSB-- --===============3872540292718178799== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============3872540292718178799==--