From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dario Faggioli <dario.faggioli@citrix.com>
Subject: Re: Xen on ARM IRQ latency and scheduler overhead
Date: Fri, 10 Feb 2017 09:40:22 +0100
Message-ID: <1486716022.3042.112.camel@citrix.com>
References: <alpine.DEB.2.10.1702091603240.20549@sstabellini-ThinkPad-X260>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============6950374571494509823=="
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <alpine.DEB.2.10.1702091603240.20549@sstabellini-ThinkPad-X260>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
To: Stefano Stabellini <sstabellini@kernel.org>, xen-devel@lists.xen.org
Cc: george.dunlap@eu.citrix.com, edgar.iglesias@xilinx.com, julien.grall@arm.com
List-Id: xen-devel@lists.xenproject.org

--===============6950374571494509823==
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="=-+M279DaIVsEQ/ikUO84y"

--=-+M279DaIVsEQ/ikUO84y
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, 2017-02-09 at 16:54 -0800, Stefano Stabellini wrote:
> Hi all,
>=20
Hi,

> I have run some IRQ latency measurements on Xen on ARM on a Xilinx
> ZynqMP board (four Cortex A53 cores, GICv2).
>=20
> Dom0 has 1 vcpu pinned to cpu0, DomU has 1 vcpu pinned to cpu2.
> Dom0 is Ubuntu. DomU is an ad-hoc baremetal app to measure interrupt
> latency: https://github.com/edgarigl/tbm
>=20
Right, interesting use case. I'm glad to see there's some interest in
it, and am happy to help investigating, and trying to make things
better.

> I modified the app to use the phys_timer instead of the virt_timer.=C2=A0
> You
> can build it with:
>=20
> make CFG=3Dconfigs/xen-guest-irq-latency.cfg=C2=A0
>=20
Ok, do you (or anyone) mind explaining in a little bit more details
what the app tries to measure and how it does that.

As a matter of fact, I'm quite familiar with the scenario (I've spent a
lot of time playing with cyclictest=C2=A0https://rt.wiki.kernel.org/index.p=
h
p/Cyclictest=C2=A0) but I don't immediately understand the meaning of way
the timer is programmed, what is supposed to be in the various
variables/register, what actually is 'freq', etc.

> These are the results, in nanosec:
>=20
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 AVG=C2=
=A0=C2=A0=C2=A0=C2=A0 MIN=C2=A0=C2=A0=C2=A0=C2=A0 MAX=C2=A0=C2=A0=C2=A0=C2=
=A0 WARM MAX
>=20
> NODEBUG no WFI=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1890=
=C2=A0=C2=A0=C2=A0 1800=C2=A0=C2=A0=C2=A0 3170=C2=A0=C2=A0=C2=A0 2070
> NODEBUG WFI=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0 4850=C2=A0=C2=A0=C2=A0 4810=C2=A0=C2=A0=C2=A0 7030=C2=A0=C2=A0=
=C2=A0 4980
> NODEBUG no WFI credit2=C2=A0 2217=C2=A0=C2=A0=C2=A0 2090=C2=A0=C2=A0=C2=
=A0 3420=C2=A0=C2=A0=C2=A0 2650
> NODEBUG WFI credit2=C2=A0=C2=A0=C2=A0=C2=A0 8080=C2=A0=C2=A0=C2=A0 7890=
=C2=A0=C2=A0=C2=A0 10320=C2=A0=C2=A0 8300
>=20
> DEBUG no WFI=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 2252=C2=A0=C2=A0=C2=A0 2080=C2=A0=C2=A0=C2=A0 3320=C2=A0=C2=A0=C2=A0=
 2650
> DEBUG WFI=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 6500=C2=A0=C2=A0=C2=A0 6140=C2=A0=C2=A0=C2=A0 8520=C2=
=A0=C2=A0=C2=A0 8130
> DEBUG WFI, credit2=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 8050=C2=A0=C2=A0=C2=A0 7=
870=C2=A0=C2=A0=C2=A0 10680=C2=A0=C2=A0 8450
>=20
> DEBUG means Xen DEBUG build.
>
Mmm, and Credit2 (with WFI) behave almost the same (and even a bit
better in some cases) with debug enabled. While in Credit1, debug yes
or no makes quite a few difference, AFAICT, especially in the WFI case.

That looks a bit strange, as I'd have expected the effect to be similar
(there's actually quite a bit of debug checks in Credit2, maybe even
more than in Credit1).

> WARM MAX is the maximum latency, taking out the first few interrupts
> to
> warm the caches.
> WFI is the ARM and ARM64 sleeping instruction, trapped and emulated
> by
> Xen by calling vcpu_block.
>=20
> As you can see, depending on whether the guest issues a WFI or not
> while
> waiting for interrupts, the results change significantly.
> Interestingly,
> credit2 does worse than credit1 in this area.
>=20
This is with current staging right? If yes, in Credit1, you on ARM
never stop the scheduler tick, like we do in x86. This means the system
is, in general, "more awake" than Credit2, which does not have a
periodic tick (and FWIW, also "more awake" of Credit1 in x86, as far as
the scheduler is concerned, at least).

Whether or not this impact significantly your measurements, I don't
know, as it depends on a bunch of factors. What we know is that this
has enough impact to trigger the RCU bug Julien discovered (in a
different scenario, I know), so I would not rule it out.

I can try sending a quick patch for disabling the tick when a CPU is
idle, but I'd need your help in testing it.

> Trying to figure out where those 3000-4000ns of difference between
> the
> WFI and non-WFI cases come from, I wrote a patch to zero the latency
> introduced by xen/arch/arm/domain.c:schedule_tail. That saves about
> 1000ns. There are no other arch specific context switch functions
> worth
> optimizing.
>=20
Yeah. It would be interesting to see a trace, but we still don't have
that for ARM. :-(

> We are down to 2000-3000ns. Then, I started investigating the
> scheduler.
> I measured how long it takes to run "vcpu_unblock": 1050ns, which is
> significant.=20
>
How you measured, if I can ask.

> I don't know what is causing the remaining 1000-2000ns, but
> I bet on another scheduler function. Do you have any suggestions on
> which one?
>=20
Well, when a vcpu is woken up, it is put in a runqueue, and a pCPU is
poked to go get and run it. The other thing you may want to try to
measure is how much time passes between when the vCPU becomes runnable
and is added to the runqueue, and when it is actually put to run.

Again, this would be visible in tracing. :-/

> Assuming that the problem is indeed the scheduler, one workaround
> that
> we could introduce today would be to avoid calling vcpu_unblock on
> guest
> WFI and call vcpu_yield instead. This change makes things
> significantly
> better:
>=20
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 AVG=
=C2=A0=C2=A0=C2=A0=C2=A0 MIN=C2=A0=C2=A0=C2=A0=C2=A0 MAX=C2=A0=C2=A0=C2=A0=
=C2=A0 WARM MAX
> DEBUG WFI (yield, no block)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 2900=C2=A0=C2=A0=C2=A0 2190=C2=A0=C2=A0=C2=A0 5130=C2=A0=C2=A0=C2=
=A0 5130
> DEBUG WFI (yield, no block) credit2=C2=A0 3514=C2=A0=C2=A0=C2=A0 2280=C2=
=A0=C2=A0=C2=A0 6180=C2=A0=C2=A0=C2=A0 5430
>=20
> Is that a reasonable change to make? Would it cause significantly
> more
> power consumption in Xen (because xen/arch/arm/domain.c:idle_loop
> might
> not be called anymore)?
>=20
Exactly. So, I think that, as Linux has 'idle=3Dpoll', it is conceivable
to have something similar in Xen, and if we do, I guess it can be
implemented as you suggest.

But, no, I don't think this is satisfying as default, not before trying
to figure out what is going on, and if we can improve things in other
ways.

> If we wanted to zero the difference between the WFI and non-WFI
> cases,
> would we need a new scheduler? A simple "noop scheduler" that
> statically
> assigns vcpus to pcpus, one by one, until they run out, then return
> error?=20
>
Well, writing such a scheduler would at least be useful as reference.
As in, the latency that you measure on it, is the minimum possible
latency the scheduler is responsible for, and we can compare that with
what we get from 'regular' schedulers.

As a matter of fact, it may also turn out useful for a couple of other
issues/reason, so I may indeed give this a go.

But it would not be much more useful than that, IMO.

> Or do we need more extensive modifications to
> xen/common/schedule.c? Any other ideas?
>=20
Not yet. :-/

Regards,
Dario
--=20
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
--=-+M279DaIVsEQ/ikUO84y
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAABCAAGBQJYnXx3AAoJEBZCeImluHPupNUQAKptD3ARBUPdGgXlXMoyIkkC
GS7nOslFu1nN53MhOA7g2D/m/mh/WB4Zk5ZiFlPtbpvJj1QTE3OP4ZEYghHxlAHD
L1LMz+cCMY6CKF8j09Mi+6bZOVZiVcdJ2Gs4c5nuRdMjanxHgcPDmKrjI9UphjcL
MWxWi3Ed0j+g/4pk44tGapIaDTg+fwVq7GqSx/k/JZ2bqFk3nQlCShLVDU4ohVgI
5oYG18p5tAxmvN1kM3ROLPx3Xk9cYDnge2BVT+E5+tQsSgw0Xp9QBQnBIsKywB1/
cmcT9kAfnYuEHPf+UbpZezP41n/QilJbE4psjCUOrYTvdUsNnUMzqyKOdLDr+ljl
pPc74yOOX+SQHM2ImuRQlyRaPH7B5YB2IRT3R1WiF8kR9yOyO4wCV8Bt5SefWOv6
Qwox9kaJuBC5GKn4EGzk4X1R5Vsueqh0TnIo+tK9/jXyvn8SDl3JLc8xtCVq5q9s
+sqrVWiqf8L23G4l6jW2XsBXk3o+6tpq0DFUr64MxgMQ2tm+HSA1JwNF2YJr8Wzv
tNn1/xa9rUB+flgegrr2cJy0MBpoabuf7F3nLbwVi69xw2sz0QboiTzJL174JOzz
tNi53kCBB2dQkFWl+e8ugRkPMGrq3I5SxMLmNBaZfCNx6AI9/BMTDf8l/1Nj7UjU
EGo7OUYcp4nKI3O9p/Pm
=0oWl
-----END PGP SIGNATURE-----

--=-+M279DaIVsEQ/ikUO84y--


--===============6950374571494509823==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline

X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs
IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v
cmcveGVuLWRldmVsCg==

--===============6950374571494509823==--