From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=46360 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PT8Nx-0007A6-Vs for qemu-devel@nongnu.org; Thu, 16 Dec 2010 02:36:23 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PT8Nw-0001C6-7F for qemu-devel@nongnu.org; Thu, 16 Dec 2010 02:36:21 -0500 Received: from mail-wy0-f173.google.com ([74.125.82.173]:50278) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PT8Nv-0001C0-Q7 for qemu-devel@nongnu.org; Thu, 16 Dec 2010 02:36:20 -0500 Received: by wyg36 with SMTP id 36so2413877wyg.4 for ; Wed, 15 Dec 2010 23:36:18 -0800 (PST) MIME-Version: 1.0 Sender: tamura.yoshiaki@gmail.com In-Reply-To: References: <1290665220-26478-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <1290665220-26478-6-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <20101128092857.GA3342@redhat.com> <20101128114627.GC4499@redhat.com> <20101202120213.GA2454@redhat.com> Date: Thu, 16 Dec 2010 16:36:16 +0900 Message-ID: From: Yoshiaki Tamura Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble. List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com, kvm@vger.kernel.org, ohmura.kei@lab.ntt.co.jp, Marcelo Tosatti , qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com, avi@redhat.com, psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com 2010/12/3 Yoshiaki Tamura : > 2010/12/2 Michael S. Tsirkin : >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote: >>> 2010/11/28 Michael S. Tsirkin : >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote: >>> >> 2010/11/28 Michael S. Tsirkin : >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote: >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and rever= t >>> >> >> last_avail_idx with inuse if there are outstanding emulation. >>> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura >>> >> > >>> >> > This changes migration format, so it will break compatibility with >>> >> > existing drivers. More generally, I think migrating internal >>> >> > state that is not guest visible is always a mistake >>> >> > as it ties migration format to an internal implementation >>> >> > (yes, I know we do this sometimes, but we should at least >>> >> > try not to add such cases). =A0I think the right thing to do in th= is case >>> >> > is to flush outstanding >>> >> > work when vm is stopped. =A0Then, we are guaranteed that inuse is = 0. >>> >> > I sent patches that do this for virtio net and block. >>> >> >>> >> Could you give me the link of your patches? =A0I'd like to test >>> >> whether they work with Kemari upon failover. =A0If they do, I'm >>> >> happy to drop this patch. >>> >> >>> >> Yoshi >>> > >>> > Look for this: >>> > stable migration image on a stopped vm >>> > sent on: >>> > Wed, 24 Nov 2010 17:52:49 +0200 >>> >>> Thanks for the info. >>> >>> However, The patch series above didn't solve the issue. =A0In >>> case of Kemari, inuse is mostly > 0 because it queues the >>> output, and while last_avail_idx gets incremented >>> immediately, not sending inuse makes the state inconsistent >>> between Primary and Secondary. >> >> Hmm. Can we simply avoid incrementing last_avail_idx? > > I think we can calculate or prepare an internal last_avail_idx, > and update the external when inuse is decremented. =A0I'll try > whether it work w/ w/o Kemari. Hi Michael, Could you please take a look at the following patch? commit 36ee7910059e6b236fe9467a609f5b4aed866912 Author: Yoshiaki Tamura Date: Thu Dec 16 14:50:54 2010 +0900 virtio: update last_avail_idx when inuse is decreased. Signed-off-by: Yoshiaki Tamura diff --git a/hw/virtio.c b/hw/virtio.c index c8a0fc6..6688c02 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count) wmb(); trace_virtqueue_flush(vq, count); vring_used_idx_increment(vq, count); + vq->last_avail_idx +=3D count; vq->inuse -=3D count; } @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem= ) unsigned int i, head, max; target_phys_addr_t desc_pa =3D vq->vring.desc; - if (!virtqueue_num_heads(vq, vq->last_avail_idx)) + if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse)) return 0; /* When we start there are none of either input nor output. */ @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem= ) max =3D vq->vring.num; - i =3D head =3D virtqueue_get_head(vq, vq->last_avail_idx++); + i =3D head =3D virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse); if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) { if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) { > >> >>> =A0I'm wondering why >>> last_avail_idx is OK to send but not inuse. >> >> last_avail_idx is at some level a mistake, it exposes part of >> our internal implementation, but it does *also* express >> a guest observable state. >> >> Here's the problem that it solves: just looking at the rings in virtio >> there is no way to detect that a specific request has already been >> completed. And the protocol forbids completing the same request twice. >> >> Our implementation always starts processing the requests >> in order, and since we flush outstanding requests >> before save, it works to just tell the remote 'process only requests >> after this place'. >> >> But there's no such requirement in the virtio protocol, >> so to be really generic we could add a bitmask of valid avail >> ring entries that did not complete yet. This would be >> the exact representation of the guest observable state. >> In practice we have rings of up to 512 entries. >> That's 64 byte per ring, not a lot at all. >> >> However, if we ever do change the protocol to send the bitmask, >> we would need some code to resubmit requests >> out of order, so it's not trivial. >> >> Another minor mistake with last_avail_idx is that it has >> some redundancy: the high bits in the index >> (> vq size) are not necessary as they can be >> got from avail idx. =A0There's a consistency check >> in load but we really should try to use formats >> that are always consistent. >> >>> The following patch does the same thing as original, yet >>> keeps the format of the virtio. =A0It shouldn't break live >>> migration either because inuse should be 0. >>> >>> Yoshi >> >> Question is, can you flush to make inuse 0 in kemari too? >> And if not, how do you handle the fact that some requests >> are in flight on the primary? > > Although we try flushing requests one by one making inuse 0, > there are cases when it failovers to the secondary when inuse > isn't 0. =A0We handle these in flight request on the primary by > replaying on the secondary. > >> >>> diff --git a/hw/virtio.c b/hw/virtio.c >>> index c8a0fc6..875c7ca 100644 >>> --- a/hw/virtio.c >>> +++ b/hw/virtio.c >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) >>> =A0 =A0 =A0qemu_put_be32(f, i); >>> >>> =A0 =A0 =A0for (i =3D 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { >>> + =A0 =A0 =A0 =A0uint16_t last_avail_idx; >>> + >>> =A0 =A0 =A0 =A0 =A0if (vdev->vq[i].vring.num =3D=3D 0) >>> =A0 =A0 =A0 =A0 =A0 =A0 =A0break; >>> >>> + =A0 =A0 =A0 =A0last_avail_idx =3D vdev->vq[i].last_avail_idx - vdev->= vq[i].inuse; >>> + >>> =A0 =A0 =A0 =A0 =A0qemu_put_be32(f, vdev->vq[i].vring.num); >>> =A0 =A0 =A0 =A0 =A0qemu_put_be64(f, vdev->vq[i].pa); >>> - =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].last_avail_idx); >>> + =A0 =A0 =A0 =A0qemu_put_be16s(f, &last_avail_idx); >>> =A0 =A0 =A0 =A0 =A0if (vdev->binding->save_queue) >>> =A0 =A0 =A0 =A0 =A0 =A0 =A0vdev->binding->save_queue(vdev->binding_opaq= ue, i, f); >>> =A0 =A0 =A0} >>> >>> >> >> This looks wrong to me. =A0Requests can complete in any order, can they >> not? =A0So if request 0 did not complete and request 1 did not, >> you send avail - inuse and on the secondary you will process and >> complete request 1 the second time, crashing the guest. > > In case of Kemari, no. =A0We sit between devices and net/block, and > queue the requests. =A0After completing each transaction, we flush > the requests one by one. =A0So there won't be completion inversion, > and therefore won't be visible to the guest. > > Yoshi > >> >>> >>> > >>> >> > >>> >> >> --- >>> >> >> =A0hw/virtio.c | =A0 =A08 +++++++- >>> >> >> =A01 files changed, 7 insertions(+), 1 deletions(-) >>> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c >>> >> >> index 849a60f..5509644 100644 >>> >> >> --- a/hw/virtio.c >>> >> >> +++ b/hw/virtio.c >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue >>> >> >> =A0 =A0 =A0VRing vring; >>> >> >> =A0 =A0 =A0target_phys_addr_t pa; >>> >> >> =A0 =A0 =A0uint16_t last_avail_idx; >>> >> >> - =A0 =A0int inuse; >>> >> >> + =A0 =A0uint16_t inuse; >>> >> >> =A0 =A0 =A0uint16_t vector; >>> >> >> =A0 =A0 =A0void (*handle_output)(VirtIODevice *vdev, VirtQueue *v= q); >>> >> >> =A0 =A0 =A0VirtIODevice *vdev; >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile= *f) >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be32(f, vdev->vq[i].vring.num); >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be64(f, vdev->vq[i].pa); >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].last_avail_idx)= ; >>> >> >> + =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].inuse); >>> >> >> =A0 =A0 =A0 =A0 =A0if (vdev->binding->save_queue) >>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0vdev->binding->save_queue(vdev->bindin= g_opaque, i, f); >>> >> >> =A0 =A0 =A0} >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile= *f) >>> >> >> =A0 =A0 =A0 =A0 =A0vdev->vq[i].vring.num =3D qemu_get_be32(f); >>> >> >> =A0 =A0 =A0 =A0 =A0vdev->vq[i].pa =3D qemu_get_be64(f); >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_get_be16s(f, &vdev->vq[i].last_avail_idx)= ; >>> >> >> + =A0 =A0 =A0 =A0qemu_get_be16s(f, &vdev->vq[i].inuse); >>> >> >> + >>> >> >> + =A0 =A0 =A0 =A0/* revert last_avail_idx if there are outstandin= g emulation. */ >>> >> >> + =A0 =A0 =A0 =A0vdev->vq[i].last_avail_idx -=3D vdev->vq[i].inus= e; >>> >> >> + =A0 =A0 =A0 =A0vdev->vq[i].inuse =3D 0; >>> >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0if (vdev->vq[i].pa) { >>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0virtqueue_init(&vdev->vq[i]); >>> >> >> -- >>> >> >> 1.7.1.2 >>> >> >> >>> >> >> -- >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in >>> >> >> the body of a message to majordomo@vger.kernel.org >>> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info.h= tml >>> >> > -- >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in >>> >> > the body of a message to majordomo@vger.kernel.org >>> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht= ml >>> >> > >>> > -- >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in >>> > the body of a message to majordomo@vger.kernel.org >>> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html >>> > >> -- >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html >> >