From mboxrd@z Thu Jan  1 00:00:00 1970
From: Luwei Cheng <chengluwei@gmail.com>
Subject: Re: [PROPOSAL] Event channel for SMP-VMs: per-vCPU or
	per-OS?
Date: Wed, 30 Oct 2013 15:35:15 +0800
Message-ID: <CA+1E0hQGx_ns6ELZOzNnDHdZFFiadXM1cCNU9nrmBhtOKgyu3Q@mail.gmail.com>
References: <CA+1E0hSfzUaj6UTDW=zSi+t-_KHuojLXJ5k8R_uDT6bc5Kt_Zg@mail.gmail.com>
	<526FD285.6080902@cantab.net>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============0853341594112420575=="
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <chengluwei@gmail.com>) id 1VbQJX-0006rZ-V9
	for xen-devel@lists.xenproject.org; Wed, 30 Oct 2013 07:35:40 +0000
Received: by mail-pa0-f49.google.com with SMTP id lj1so544198pab.22
	for <xen-devel@lists.xenproject.org>;
	Wed, 30 Oct 2013 00:35:35 -0700 (PDT)
In-Reply-To: <526FD285.6080902@cantab.net>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: David Vrabel <dvrabel@cantab.net>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>, xen-devel@lists.xenproject.org, Wei Liu <wei.liu2@citrix.com>, david.vrabel@citrix.com
List-Id: xen-devel@lists.xenproject.org

--===============0853341594112420575==
Content-Type: multipart/alternative; boundary=bcaec52be9b1fc907d04e9f060b2

--bcaec52be9b1fc907d04e9f060b2
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

On Tue, Oct 29, 2013 at 11:21 PM, David Vrabel <dvrabel@cantab.net> wrote:

> On 28/10/2013 15:26, Luwei Cheng wrote:
> > This following idea was first discussed with George Dunlap, David Vrabe=
l
> > and Wei Liu in XenDevSummit13. Many thanks for their encouragement to
> > post this idea to the community for a wider discussion.
> >
> > [Current Design]
> > Each event channel is associated with only =93one=94 notified vCPU:
> one-to-one.
> >
> > [Problem]
> > Some events are per-vCPU (such as local timer interrupts) while some
> others
> > are per-OS (such as I/O interrupts: network and disk).
> > For SMP-VMs, it is possible that when one vCPU is waiting in the
> scheduling
> > queue, another vCPU is running. So, if the I/O events can be dynamicall=
y
> > routed to the running vCPU, the events can be processed quickly, withou=
t
> > suffering from VM scheduling delays (tens of milliseconds). On the othe=
r
> > hand, no reschedule operations are introduced.
> >
> > Though users can set IRQ affinity in the guest OS, the current
> > implementation forces to bind the IRQ to the first vCPU of the
> > affinity mask [events.c: set_affinity_irq].
> > If the hypervisor delivers the event to a different vCPU, the event
> > will get lost because the guest OS has masked out this event in all
> > non-notified vCPUs [events.c: bind_evtchn_to_cpu].
> >
> > [New Design]
> > For per-OS event channel, add =93vCPU affinity=94 support: one-to-many.
> > The =93affinity=94 should be consistent with the =91/proc/irq/#/smp_aff=
inity=92
> > of the
> > guest OS and users can change the mapping at runtime. But by default,
> > all vCPUs should be enabled to serve I/O.
> >
> > When such flexibility is enabled, I/O balancing among vCPUs can be
> > offloaded to the hypervisor. =93irqbalance=94 is designed for physical
> > SMP systems, not virtual SMP systems.
>
> Thanks for your echoing, David.

>  It's an interesting idea but I'm not sure how useful it will be in
> practise as often work is deferred to threads in the guest rather than
> done directly in the interrupt handler.

Sure, but if the interrupt handler is not called timely, no irq threads wil=
l
be created.


> I don't see any way this could be implemented using the 2-level ABI.

Probably the implementation does not need to bother 2-level ABI.

With the FIFO ABI, queues cannot move between VCPUs without some
> additional locking (dequeuing an event is only safe with a single
> consumer) but it may be possible (when an event is set pending) for Xen
> to pick a queue from a set of queues, instead of always using the same
> queue.
>
> I don't think this would result in balanced I/O between VCPUs, but the
> opposite -- events would crowd onto the few VCPUs that are currently
> running.
>
I think it is the hypervisor who plays the role of deciding which vCPU
should
be kicked to serve I/O. Different routing policies results in different
results.
Since all vCPUs are symmetrically scheduled, the events can therefore be
evenly distributed onto them. At one moment, vCPUx is running, while at
another moment, vCPUy is running. So, the events will not always crowd to
very few ones.

Currently, all I/O events are bound to vCPU0, which is just like what you
said:
events would crowd onto that vCPU. As a result, vCPU0 consumes much more
CPU cycles than other ones, leading to unfairness. If some workload can be
dynamically migrated to other vCPUs, I believe more or less we can get
some benefit.

Thanks,
Luwei

--bcaec52be9b1fc907d04e9f060b2
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Tue, Oct 29, 2013 at 11:21 PM, David Vrabel <span dir=3D"ltr">&l=
t;<a href=3D"mailto:dvrabel@cantab.net" target=3D"_blank">dvrabel@cantab.ne=
t</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div class=3D""><div class=3D"h5">On 28/10/2013 15:26, Luw=
ei Cheng wrote:<br>


&gt; This following idea was first discussed with George Dunlap, David Vrab=
el<br>
&gt; and Wei Liu in XenDevSummit13. Many thanks for their encouragement to<=
br>
&gt; post this idea to the community for a wider discussion.<br>
&gt;<br>
&gt; [Current Design]<br>
&gt; Each event channel is associated with only =93one=94 notified vCPU: on=
e-to-one.<br>
&gt;<br>
&gt; [Problem]<br>
&gt; Some events are per-vCPU (such as local timer interrupts) while some o=
thers<br>
&gt; are per-OS (such as I/O interrupts: network and disk).<br>
&gt; For SMP-VMs, it is possible that when one vCPU is waiting in the sched=
uling<br>
&gt; queue, another vCPU is running. So, if the I/O events can be dynamical=
ly<br>
&gt; routed to the running vCPU, the events can be processed quickly, witho=
ut<br>
&gt; suffering from VM scheduling delays (tens of milliseconds). On the oth=
er<br>
&gt; hand, no reschedule operations are introduced.<br>
&gt;<br>
&gt; Though users can set IRQ affinity in the guest OS, the current<br>
&gt; implementation forces to bind the IRQ to the first vCPU of the<br>
&gt; affinity mask [events.c: set_affinity_irq].<br>
&gt; If the hypervisor delivers the event to a different vCPU, the event<br=
>
&gt; will get lost because the guest OS has masked out this event in all<br=
>
&gt; non-notified vCPUs [events.c: bind_evtchn_to_cpu].<br>
&gt;<br>
&gt; [New Design]<br>
&gt; For per-OS event channel, add =93vCPU affinity=94 support: one-to-many=
.<br>
&gt; The =93affinity=94 should be consistent with the =91/proc/irq/#/smp_af=
finity=92<br>
&gt; of the<br>
&gt; guest OS and users can change the mapping at runtime. But by default,<=
br>
&gt; all vCPUs should be enabled to serve I/O.<br>
&gt;<br>
&gt; When such flexibility is enabled, I/O balancing among vCPUs can be<br>
&gt; offloaded to the hypervisor. =93irqbalance=94 is designed for physical=
<br>
&gt; SMP systems, not virtual SMP systems.<br>
<br></div></div></blockquote><div>Thanks for your echoing, David.=A0</div><=
blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-l=
eft-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;pa=
dding-left:1ex">

<div class=3D""><div class=3D"h5">
</div></div>It&#39;s an interesting idea but I&#39;m not sure how useful it=
 will be in<br>
practise as often work is deferred to threads in the guest rather than<br>
done directly in the interrupt handler.</blockquote><div>Sure, but if the i=
nterrupt handler is not called timely, no irq threads will</div><div>be cre=
ated.</div><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);=
border-left-style:solid;padding-left:1ex">


I don&#39;t see any way this could be implemented using the 2-level ABI.</b=
lockquote><div>Probably the implementation does not need to bother 2-level =
ABI.</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);=
border-left-style:solid;padding-left:1ex">


With the FIFO ABI, queues cannot move between VCPUs without some<br>
additional locking (dequeuing an event is only safe with a single<br>
consumer) but it may be possible (when an event is set pending) for Xen<br>
to pick a queue from a set of queues, instead of always using the same<br>
queue.<br>
<br>
I don&#39;t think this would result in balanced I/O between VCPUs, but the<=
br>
opposite -- events would crowd onto the few VCPUs that are currently<br>
running.<br></blockquote><div>I think it is the hypervisor who plays the ro=
le of deciding which vCPU should</div><div>be kicked to serve I/O. Differen=
t routing policies results in different results.</div><div>Since all vCPUs =
are symmetrically scheduled, the events can therefore be=A0</div>

<div>evenly distributed onto them. At one moment, vCPUx is running, while a=
t=A0</div><div>another moment, vCPUy is running. So, the events will not al=
ways crowd to</div><div>very few ones.</div><div><br></div><div>Currently, =
all I/O events are bound to vCPU0, which is just like what you said:</div>

<div>events would crowd onto that vCPU. As a result, vCPU0 consumes much mo=
re</div><div>CPU cycles than other ones, leading to unfairness. If some wor=
kload can be=A0</div><div>dynamically migrated to other vCPUs, I believe mo=
re or less we can get=A0</div>

<div>some benefit.</div><div><br></div><div>Thanks,</div><div>Luwei</div></=
div></div></div>

--bcaec52be9b1fc907d04e9f060b2--


--===============0853341594112420575==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--===============0853341594112420575==--