From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=60462 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PTF0k-00025V-LR
	for qemu-devel@nongnu.org; Thu, 16 Dec 2010 09:40:53 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mst@redhat.com>) id 1PTF0h-0006Ay-Ks
	for qemu-devel@nongnu.org; Thu, 16 Dec 2010 09:40:50 -0500
Received: from mx1.redhat.com ([209.132.183.28]:18548)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mst@redhat.com>) id 1PTF0h-0006AD-4P
	for qemu-devel@nongnu.org; Thu, 16 Dec 2010 09:40:47 -0500
Date: Thu, 16 Dec 2010 16:40:10 +0200
From: "Michael S. Tsirkin" <mst@redhat.com>
Message-ID: <20101216144010.GA25333@redhat.com>
References: <1290665220-26478-6-git-send-email-tamura.yoshiaki@lab.ntt.co.jp>
	<20101128092857.GA3342@redhat.com>
	<AANLkTinP6o-kmRExEWguhHUkjXe52VKsoyQgq5CqduPO@mail.gmail.com>
	<20101128114627.GC4499@redhat.com>
	<AANLkTim2CojXtyGVDNngzaYMCXfan3kkS=thDE9i-yT=@mail.gmail.com>
	<20101202120213.GA2454@redhat.com>
	<AANLkTimAkmqtAP4e_rvjc_NAsr7D86L3pD64HtqXa7DD@mail.gmail.com>
	<AANLkTimvsozOXJpwyUYAqWBKLFsY==x8AzCkJ4CapgTg@mail.gmail.com>
	<20101216095140.GB19495@redhat.com>
	<AANLkTikW+9CnDhqqMgtbdavScZ-Kg2UA-9xhGw988Qp9@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
In-Reply-To: <AANLkTikW+9CnDhqqMgtbdavScZ-Kg2UA-9xhGw988Qp9@mail.gmail.com>
Content-Transfer-Encoding: quoted-printable
Subject: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to
 handle inuse varialble.
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com, kvm@vger.kernel.org, ohmura.kei@lab.ntt.co.jp, Marcelo Tosatti <mtosatti@redhat.com>, qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com, avi@redhat.com, psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com

On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrot=
e:
> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura w=
rote:
> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, an=
d revert
> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulati=
on.
> >> >>> >> >>
> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co=
.jp>
> >> >>> >> >
> >> >>> >> > This changes migration format, so it will break compatibili=
ty with
> >> >>> >> > existing drivers. More generally, I think migrating interna=
l
> >> >>> >> > state that is not guest visible is always a mistake
> >> >>> >> > as it ties migration format to an internal implementation
> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >>> >> > try not to add such cases). =A0I think the right thing to d=
o in this case
> >> >>> >> > is to flush outstanding
> >> >>> >> > work when vm is stopped. =A0Then, we are guaranteed that in=
use is 0.
> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >>> >>
> >> >>> >> Could you give me the link of your patches? =A0I'd like to te=
st
> >> >>> >> whether they work with Kemari upon failover. =A0If they do, I=
'm
> >> >>> >> happy to drop this patch.
> >> >>> >>
> >> >>> >> Yoshi
> >> >>> >
> >> >>> > Look for this:
> >> >>> > stable migration image on a stopped vm
> >> >>> > sent on:
> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >>>
> >> >>> Thanks for the info.
> >> >>>
> >> >>> However, The patch series above didn't solve the issue. =A0In
> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >>> output, and while last_avail_idx gets incremented
> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >>> between Primary and Secondary.
> >> >>
> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >
> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> > and update the external when inuse is decremented. =A0I'll try
> >> > whether it work w/ w/o Kemari.
> >>
> >> Hi Michael,
> >>
> >> Could you please take a look at the following patch?
> >
> > Which version is this against?
>=20
> Oops.  It should be very old.
> 67f895bfe69f323b427b284430b6219c8a62e8d4
>=20
> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> Date: =A0 Thu Dec 16 14:50:54 2010 +0900
> >>
> >> =A0 =A0 virtio: update last_avail_idx when inuse is decreased.
> >>
> >> =A0 =A0 Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.j=
p>
> >
> > It would be better to have a commit description explaining why a chan=
ge
> > is made, and why it is correct, not just repeating what can be seen f=
rom
> > the diff anyway.
>=20
> Sorry for being lazy here.
>=20
> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> index c8a0fc6..6688c02 100644
> >> --- a/hw/virtio.c
> >> +++ b/hw/virtio.c
> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int=
 count)
> >> =A0 =A0 =A0wmb();
> >> =A0 =A0 =A0trace_virtqueue_flush(vq, count);
> >> =A0 =A0 =A0vring_used_idx_increment(vq, count);
> >> + =A0 =A0vq->last_avail_idx +=3D count;
> >> =A0 =A0 =A0vq->inuse -=3D count;
> >> =A0}
> >>
> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElemen=
t *elem)
> >> =A0 =A0 =A0unsigned int i, head, max;
> >> =A0 =A0 =A0target_phys_addr_t desc_pa =3D vq->vring.desc;
> >>
> >> - =A0 =A0if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> + =A0 =A0if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse=
))
> >> =A0 =A0 =A0 =A0 =A0return 0;
> >>
> >> =A0 =A0 =A0/* When we start there are none of either input nor outpu=
t. */
> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElemen=
t *elem)
> >>
> >> =A0 =A0 =A0max =3D vq->vring.num;
> >>
> >> - =A0 =A0i =3D head =3D virtqueue_get_head(vq, vq->last_avail_idx++)=
;
> >> + =A0 =A0i =3D head =3D virtqueue_get_head(vq, vq->last_avail_idx + =
vq->inuse);
> >>
> >> =A0 =A0 =A0if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT)=
 {
> >> =A0 =A0 =A0 =A0 =A0if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc=
)) {
> >>
> >
> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail=
_bytes?
>=20
> I think there are two problems.
>=20
> 1. When to update last_avail_idx.
> 2. The ordering issue you're mentioning below.
>=20
> The patch above is only trying to address 1 because last time you
> mentioned that modifying last_avail_idx upon save may break the
> guest, which I agree.  If virtio_queue_empty and
> virtqueue_avail_bytes are only used internally, meaning invisible
> to the guest, I guess the approach above can be applied too.

So IMHO 2 is the real issue. This is what was problematic
with the save patch, otherwise of course changes in save
are better than changes all over the codebase.

> > Previous patch version sure looked simpler, and this seems functional=
ly
> > equivalent, so my question still stands: here it is rephrased in a
> > different way:
> >
> > =A0 =A0 =A0 =A0assume that we have in avail ring 2 requests at start =
of ring: A and B in this order
> >
> > =A0 =A0 =A0 =A0host pops A, then B, then completes B and flushes
> >
> > =A0 =A0 =A0 =A0now with this patch last_avail_idx will be 1, and then
> > =A0 =A0 =A0 =A0remote will get it, it will execute B again. As a resu=
lt
> > =A0 =A0 =A0 =A0B will complete twice, and apparently A will never com=
plete.
> >
> >
> > This is what I was saying below: assuming that there are
> > outstanding requests when we migrate, there is no way
> > a single index can be enough to figure out which requests
> > need to be handled and which are in flight already.
> >
> > We must add some kind of bitmask to tell us which is which.
>=20
> I should understand why this inversion can happen before solving
> the issue.

It's a fundamental thing in virtio.
I think it is currently only likely to happen with block, I think tap
currently completes things in order.  In any case relying on this in the
frontend is a mistake.

>  Currently, how are you making virio-net to flush
> every requests for live migration?  Is it qemu_aio_flush()?
>=20
> Yoshi

Think so.


> >
> >> >
> >> >>
> >> >>> =A0I'm wondering why
> >> >>> last_avail_idx is OK to send but not inuse.
> >> >>
> >> >> last_avail_idx is at some level a mistake, it exposes part of
> >> >> our internal implementation, but it does *also* express
> >> >> a guest observable state.
> >> >>
> >> >> Here's the problem that it solves: just looking at the rings in v=
irtio
> >> >> there is no way to detect that a specific request has already bee=
n
> >> >> completed. And the protocol forbids completing the same request t=
wice.
> >> >>
> >> >> Our implementation always starts processing the requests
> >> >> in order, and since we flush outstanding requests
> >> >> before save, it works to just tell the remote 'process only reque=
sts
> >> >> after this place'.
> >> >>
> >> >> But there's no such requirement in the virtio protocol,
> >> >> so to be really generic we could add a bitmask of valid avail
> >> >> ring entries that did not complete yet. This would be
> >> >> the exact representation of the guest observable state.
> >> >> In practice we have rings of up to 512 entries.
> >> >> That's 64 byte per ring, not a lot at all.
> >> >>
> >> >> However, if we ever do change the protocol to send the bitmask,
> >> >> we would need some code to resubmit requests
> >> >> out of order, so it's not trivial.
> >> >>
> >> >> Another minor mistake with last_avail_idx is that it has
> >> >> some redundancy: the high bits in the index
> >> >> (> vq size) are not necessary as they can be
> >> >> got from avail idx. =A0There's a consistency check
> >> >> in load but we really should try to use formats
> >> >> that are always consistent.
> >> >>
> >> >>> The following patch does the same thing as original, yet
> >> >>> keeps the format of the virtio. =A0It shouldn't break live
> >> >>> migration either because inuse should be 0.
> >> >>>
> >> >>> Yoshi
> >> >>
> >> >> Question is, can you flush to make inuse 0 in kemari too?
> >> >> And if not, how do you handle the fact that some requests
> >> >> are in flight on the primary?
> >> >
> >> > Although we try flushing requests one by one making inuse 0,
> >> > there are cases when it failovers to the secondary when inuse
> >> > isn't 0. =A0We handle these in flight request on the primary by
> >> > replaying on the secondary.
> >> >
> >> >>
> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >>> index c8a0fc6..875c7ca 100644
> >> >>> --- a/hw/virtio.c
> >> >>> +++ b/hw/virtio.c
> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUF=
ile *f)
> >> >>> =A0 =A0 =A0qemu_put_be32(f, i);
> >> >>>
> >> >>> =A0 =A0 =A0for (i =3D 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >> >>> + =A0 =A0 =A0 =A0uint16_t last_avail_idx;
> >> >>> +
> >> >>> =A0 =A0 =A0 =A0 =A0if (vdev->vq[i].vring.num =3D=3D 0)
> >> >>> =A0 =A0 =A0 =A0 =A0 =A0 =A0break;
> >> >>>
> >> >>> + =A0 =A0 =A0 =A0last_avail_idx =3D vdev->vq[i].last_avail_idx -=
 vdev->vq[i].inuse;
> >> >>> +
> >> >>> =A0 =A0 =A0 =A0 =A0qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >>> =A0 =A0 =A0 =A0 =A0qemu_put_be64(f, vdev->vq[i].pa);
> >> >>> - =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >>> + =A0 =A0 =A0 =A0qemu_put_be16s(f, &last_avail_idx);
> >> >>> =A0 =A0 =A0 =A0 =A0if (vdev->binding->save_queue)
> >> >>> =A0 =A0 =A0 =A0 =A0 =A0 =A0vdev->binding->save_queue(vdev->bindi=
ng_opaque, i, f);
> >> >>> =A0 =A0 =A0}
> >> >>>
> >> >>>
> >> >>
> >> >> This looks wrong to me. =A0Requests can complete in any order, ca=
n they
> >> >> not? =A0So if request 0 did not complete and request 1 did not,
> >> >> you send avail - inuse and on the secondary you will process and
> >> >> complete request 1 the second time, crashing the guest.
> >> >
> >> > In case of Kemari, no. =A0We sit between devices and net/block, an=
d
> >> > queue the requests. =A0After completing each transaction, we flush
> >> > the requests one by one. =A0So there won't be completion inversion=
,
> >> > and therefore won't be visible to the guest.
> >> >
> >> > Yoshi
> >> >
> >> >>
> >> >>>
> >> >>> >
> >> >>> >> >
> >> >>> >> >> ---
> >> >>> >> >> =A0hw/virtio.c | =A0 =A08 +++++++-
> >> >>> >> >> =A01 files changed, 7 insertions(+), 1 deletions(-)
> >> >>> >> >>
> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >>> >> >> index 849a60f..5509644 100644
> >> >>> >> >> --- a/hw/virtio.c
> >> >>> >> >> +++ b/hw/virtio.c
> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >>> >> >> =A0 =A0 =A0VRing vring;
> >> >>> >> >> =A0 =A0 =A0target_phys_addr_t pa;
> >> >>> >> >> =A0 =A0 =A0uint16_t last_avail_idx;
> >> >>> >> >> - =A0 =A0int inuse;
> >> >>> >> >> + =A0 =A0uint16_t inuse;
> >> >>> >> >> =A0 =A0 =A0uint16_t vector;
> >> >>> >> >> =A0 =A0 =A0void (*handle_output)(VirtIODevice *vdev, VirtQ=
ueue *vq);
> >> >>> >> >> =A0 =A0 =A0VirtIODevice *vdev;
> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, Q=
EMUFile *f)
> >> >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be32(f, vdev->vq[i].vring.num)=
;
> >> >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be64(f, vdev->vq[i].pa);
> >> >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].last_ava=
il_idx);
> >> >>> >> >> + =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >>> >> >> =A0 =A0 =A0 =A0 =A0if (vdev->binding->save_queue)
> >> >>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0vdev->binding->save_queue(vdev-=
>binding_opaque, i, f);
> >> >>> >> >> =A0 =A0 =A0}
> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, Q=
EMUFile *f)
> >> >>> >> >> =A0 =A0 =A0 =A0 =A0vdev->vq[i].vring.num =3D qemu_get_be32=
(f);
> >> >>> >> >> =A0 =A0 =A0 =A0 =A0vdev->vq[i].pa =3D qemu_get_be64(f);
> >> >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_get_be16s(f, &vdev->vq[i].last_ava=
il_idx);
> >> >>> >> >> + =A0 =A0 =A0 =A0qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >>> >> >> +
> >> >>> >> >> + =A0 =A0 =A0 =A0/* revert last_avail_idx if there are out=
standing emulation. */
> >> >>> >> >> + =A0 =A0 =A0 =A0vdev->vq[i].last_avail_idx -=3D vdev->vq[=
i].inuse;
> >> >>> >> >> + =A0 =A0 =A0 =A0vdev->vq[i].inuse =3D 0;
> >> >>> >> >>
> >> >>> >> >> =A0 =A0 =A0 =A0 =A0if (vdev->vq[i].pa) {
> >> >>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0virtqueue_init(&vdev->vq[i]);
> >> >>> >> >> --
> >> >>> >> >> 1.7.1.2
> >> >>> >> >>
> >> >>> >> >> --
> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe =
kvm" in
> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >> >>> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo=
-info.html
> >> >>> >> > --
> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe k=
vm" in
> >> >>> >> > the body of a message to majordomo@vger.kernel.org
> >> >>> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-=
info.html
> >> >>> >> >
> >> >>> > --
> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm"=
 in
> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >>> > More majordomo info at =A0http://vger.kernel.org/majordomo-inf=
o.html
> >> >>> >
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info.h=
tml
> >> >>
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
> >