From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=35701 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PW4gr-0008IJ-IS for qemu-devel@nongnu.org; Fri, 24 Dec 2010 05:16:23 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PW3wE-0005Tr-Kn for qemu-devel@nongnu.org; Fri, 24 Dec 2010 04:27:52 -0500 Received: from mx1.redhat.com ([209.132.183.28]:50688) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PW3wE-0005Tk-5x for qemu-devel@nongnu.org; Fri, 24 Dec 2010 04:27:50 -0500 Date: Fri, 24 Dec 2010 11:27:10 +0200 From: "Michael S. Tsirkin" Message-ID: <20101224092710.GA23271@redhat.com> References: <20101128114627.GC4499@redhat.com> <20101202120213.GA2454@redhat.com> <20101216095140.GB19495@redhat.com> <20101216144010.GA25333@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble. List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Yoshiaki Tamura Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com, kvm@vger.kernel.org, ohmura.kei@lab.ntt.co.jp, Marcelo Tosatti , qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com, avi@redhat.com, psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote: > 2010/12/16 Michael S. Tsirkin : > > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote: > >> 2010/12/16 Michael S. Tsirkin : > >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote: > >> >> 2010/12/3 Yoshiaki Tamura : > >> >> > 2010/12/2 Michael S. Tsirkin : > >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrot= e: > >> >> >>> 2010/11/28 Michael S. Tsirkin : > >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura w= rote: > >> >> >>> >> 2010/11/28 Michael S. Tsirkin : > >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamur= a wrote: > >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle,= and revert > >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emul= ation. > >> >> >>> >> >> > >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura > >> >> >>> >> > > >> >> >>> >> > This changes migration format, so it will break compatib= ility with > >> >> >>> >> > existing drivers. More generally, I think migrating inte= rnal > >> >> >>> >> > state that is not guest visible is always a mistake > >> >> >>> >> > as it ties migration format to an internal implementatio= n > >> >> >>> >> > (yes, I know we do this sometimes, but we should at leas= t > >> >> >>> >> > try not to add such cases). =A0I think the right thing t= o do in this case > >> >> >>> >> > is to flush outstanding > >> >> >>> >> > work when vm is stopped. =A0Then, we are guaranteed that= inuse is 0. > >> >> >>> >> > I sent patches that do this for virtio net and block. > >> >> >>> >> > >> >> >>> >> Could you give me the link of your patches? =A0I'd like to= test > >> >> >>> >> whether they work with Kemari upon failover. =A0If they do= , I'm > >> >> >>> >> happy to drop this patch. > >> >> >>> >> > >> >> >>> >> Yoshi > >> >> >>> > > >> >> >>> > Look for this: > >> >> >>> > stable migration image on a stopped vm > >> >> >>> > sent on: > >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200 > >> >> >>> > >> >> >>> Thanks for the info. > >> >> >>> > >> >> >>> However, The patch series above didn't solve the issue. =A0In > >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the > >> >> >>> output, and while last_avail_idx gets incremented > >> >> >>> immediately, not sending inuse makes the state inconsistent > >> >> >>> between Primary and Secondary. > >> >> >> > >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx? > >> >> > > >> >> > I think we can calculate or prepare an internal last_avail_idx, > >> >> > and update the external when inuse is decremented. =A0I'll try > >> >> > whether it work w/ w/o Kemari. > >> >> > >> >> Hi Michael, > >> >> > >> >> Could you please take a look at the following patch? > >> > > >> > Which version is this against? > >> > >> Oops. =A0It should be very old. > >> 67f895bfe69f323b427b284430b6219c8a62e8d4 > >> > >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912 > >> >> Author: Yoshiaki Tamura > >> >> Date: =A0 Thu Dec 16 14:50:54 2010 +0900 > >> >> > >> >> =A0 =A0 virtio: update last_avail_idx when inuse is decreased. > >> >> > >> >> =A0 =A0 Signed-off-by: Yoshiaki Tamura > >> > > >> > It would be better to have a commit description explaining why a c= hange > >> > is made, and why it is correct, not just repeating what can be see= n from > >> > the diff anyway. > >> > >> Sorry for being lazy here. > >> > >> >> diff --git a/hw/virtio.c b/hw/virtio.c > >> >> index c8a0fc6..6688c02 100644 > >> >> --- a/hw/virtio.c > >> >> +++ b/hw/virtio.c > >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned = int count) > >> >> =A0 =A0 =A0wmb(); > >> >> =A0 =A0 =A0trace_virtqueue_flush(vq, count); > >> >> =A0 =A0 =A0vring_used_idx_increment(vq, count); > >> >> + =A0 =A0vq->last_avail_idx +=3D count; > >> >> =A0 =A0 =A0vq->inuse -=3D count; > >> >> =A0} > >> >> > >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueEle= ment *elem) > >> >> =A0 =A0 =A0unsigned int i, head, max; > >> >> =A0 =A0 =A0target_phys_addr_t desc_pa =3D vq->vring.desc; > >> >> > >> >> - =A0 =A0if (!virtqueue_num_heads(vq, vq->last_avail_idx)) > >> >> + =A0 =A0if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->in= use)) > >> >> =A0 =A0 =A0 =A0 =A0return 0; > >> >> > >> >> =A0 =A0 =A0/* When we start there are none of either input nor ou= tput. */ > >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueEle= ment *elem) > >> >> > >> >> =A0 =A0 =A0max =3D vq->vring.num; > >> >> > >> >> - =A0 =A0i =3D head =3D virtqueue_get_head(vq, vq->last_avail_idx= ++); > >> >> + =A0 =A0i =3D head =3D virtqueue_get_head(vq, vq->last_avail_idx= + vq->inuse); > >> >> > >> >> =A0 =A0 =A0if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRE= CT) { > >> >> =A0 =A0 =A0 =A0 =A0if (vring_desc_len(desc_pa, i) % sizeof(VRingD= esc)) { > >> >> > >> > > >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_av= ail_bytes? > >> > >> I think there are two problems. > >> > >> 1. When to update last_avail_idx. > >> 2. The ordering issue you're mentioning below. > >> > >> The patch above is only trying to address 1 because last time you > >> mentioned that modifying last_avail_idx upon save may break the > >> guest, which I agree. =A0If virtio_queue_empty and > >> virtqueue_avail_bytes are only used internally, meaning invisible > >> to the guest, I guess the approach above can be applied too. > > > > So IMHO 2 is the real issue. This is what was problematic > > with the save patch, otherwise of course changes in save > > are better than changes all over the codebase. >=20 > All right. Then let's focus on 2 first. >=20 > >> > Previous patch version sure looked simpler, and this seems functio= nally > >> > equivalent, so my question still stands: here it is rephrased in a > >> > different way: > >> > > >> > =A0 =A0 =A0 =A0assume that we have in avail ring 2 requests at sta= rt of ring: A and B in this order > >> > > >> > =A0 =A0 =A0 =A0host pops A, then B, then completes B and flushes > >> > > >> > =A0 =A0 =A0 =A0now with this patch last_avail_idx will be 1, and t= hen > >> > =A0 =A0 =A0 =A0remote will get it, it will execute B again. As a r= esult > >> > =A0 =A0 =A0 =A0B will complete twice, and apparently A will never = complete. > >> > > >> > > >> > This is what I was saying below: assuming that there are > >> > outstanding requests when we migrate, there is no way > >> > a single index can be enough to figure out which requests > >> > need to be handled and which are in flight already. > >> > > >> > We must add some kind of bitmask to tell us which is which. > >> > >> I should understand why this inversion can happen before solving > >> the issue. > > > > It's a fundamental thing in virtio. > > I think it is currently only likely to happen with block, I think tap > > currently completes things in order. =A0In any case relying on this i= n the > > frontend is a mistake. > > > >> =A0Currently, how are you making virio-net to flush > >> every requests for live migration? =A0Is it qemu_aio_flush()? > > > > Think so. >=20 > If qemu_aio_flush() is responsible for flushing the outstanding > virtio-net requests, I'm wondering why it's a problem for Kemari. > As I described in the previous message, Kemari queues the > requests first. So in you example above, it should start with >=20 > virtio-net: last_avai_idx 0 inuse 2 > event-tap: {A,B} >=20 > As you know, the requests are still in order still because net > layer initiates in order. Not about completing. >=20 > In the first synchronization, the status above is transferred. In > the next synchronization, the status will be as following. >=20 > virtio-net: last_avai_idx 1 inuse 1 > event-tap: {B} OK, this answers the ordering question. Another question: at this point we transfer this status: both event-tap and virtio ring have the command B, so the remote will have: virtio-net: inuse 0 event-tap: {B} Is this right? This already seems to be a problem as when B completes inuse will go negative? Next it seems that the remote virtio will resubmit B to event-tap. The remote will then have: virtio-net: inuse 1 event-tap: {B, B} This looks kind of wrong ... will two packets go out? > Why? Because Kemari flushes the first virtio-net request using > qemu_aio_flush() before each synchronization. If > qemu_aio_flush() doesn't guarantee the order, what you pointed > should be problematic. So in the final synchronization, the > state should be, >=20 > virtio-net: last_avai_idx 2 inuse 0 > event-tap: {} >=20 > where A,B were completed in order. >=20 > Yoshi It might be better to discuss block because that's where requests can complete out of order. So let me see if I understand: - each command passed to event tap is queued by it, it is not passed directly to the backend - later requests are passed to the backend, always in the same order that they were submitted - each synchronization point flushes all requests passed to the backend so far - each synchronization transfers all requests not passed to the backend, to the remote, and they are replayed there Now to analyse this for correctness I am looking at the original patch because it is smaller so easier to analyse and I think it is functionally equivalent, correct me if I am wrong in this. So the reason there's no out of order issue is this (and might be a good thing to put in commit log or a comment somewhere): At point of save callback event tap has flushed commands passed to the backend already. Thus at the point of the save callback if a command has completed all previous commands have been flushed and completed. Therefore inuse is in fact the # of requests passed to event tap but not yet passed to the backend (for non-event tap case all commands are passed to the backend immediately and because of this inuse is 0) and these are the last inuse commands submitted. Right? Now a question: When we pass last_used_index - inuse to the remote, the remote virtio will resubmit the request. Since request is also passed by event tap, we get the request twice, why is this not a problem? > > > >> > > >> >> > > >> >> >> > >> >> >>> =A0I'm wondering why > >> >> >>> last_avail_idx is OK to send but not inuse. > >> >> >> > >> >> >> last_avail_idx is at some level a mistake, it exposes part of > >> >> >> our internal implementation, but it does *also* express > >> >> >> a guest observable state. > >> >> >> > >> >> >> Here's the problem that it solves: just looking at the rings i= n virtio > >> >> >> there is no way to detect that a specific request has already = been > >> >> >> completed. And the protocol forbids completing the same reques= t twice. > >> >> >> > >> >> >> Our implementation always starts processing the requests > >> >> >> in order, and since we flush outstanding requests > >> >> >> before save, it works to just tell the remote 'process only re= quests > >> >> >> after this place'. > >> >> >> > >> >> >> But there's no such requirement in the virtio protocol, > >> >> >> so to be really generic we could add a bitmask of valid avail > >> >> >> ring entries that did not complete yet. This would be > >> >> >> the exact representation of the guest observable state. > >> >> >> In practice we have rings of up to 512 entries. > >> >> >> That's 64 byte per ring, not a lot at all. > >> >> >> > >> >> >> However, if we ever do change the protocol to send the bitmask= , > >> >> >> we would need some code to resubmit requests > >> >> >> out of order, so it's not trivial. > >> >> >> > >> >> >> Another minor mistake with last_avail_idx is that it has > >> >> >> some redundancy: the high bits in the index > >> >> >> (> vq size) are not necessary as they can be > >> >> >> got from avail idx. =A0There's a consistency check > >> >> >> in load but we really should try to use formats > >> >> >> that are always consistent. > >> >> >> > >> >> >>> The following patch does the same thing as original, yet > >> >> >>> keeps the format of the virtio. =A0It shouldn't break live > >> >> >>> migration either because inuse should be 0. > >> >> >>> > >> >> >>> Yoshi > >> >> >> > >> >> >> Question is, can you flush to make inuse 0 in kemari too? > >> >> >> And if not, how do you handle the fact that some requests > >> >> >> are in flight on the primary? > >> >> > > >> >> > Although we try flushing requests one by one making inuse 0, > >> >> > there are cases when it failovers to the secondary when inuse > >> >> > isn't 0. =A0We handle these in flight request on the primary by > >> >> > replaying on the secondary. > >> >> > > >> >> >> > >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c > >> >> >>> index c8a0fc6..875c7ca 100644 > >> >> >>> --- a/hw/virtio.c > >> >> >>> +++ b/hw/virtio.c > >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QE= MUFile *f) > >> >> >>> =A0 =A0 =A0qemu_put_be32(f, i); > >> >> >>> > >> >> >>> =A0 =A0 =A0for (i =3D 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { > >> >> >>> + =A0 =A0 =A0 =A0uint16_t last_avail_idx; > >> >> >>> + > >> >> >>> =A0 =A0 =A0 =A0 =A0if (vdev->vq[i].vring.num =3D=3D 0) > >> >> >>> =A0 =A0 =A0 =A0 =A0 =A0 =A0break; > >> >> >>> > >> >> >>> + =A0 =A0 =A0 =A0last_avail_idx =3D vdev->vq[i].last_avail_id= x - vdev->vq[i].inuse; > >> >> >>> + > >> >> >>> =A0 =A0 =A0 =A0 =A0qemu_put_be32(f, vdev->vq[i].vring.num); > >> >> >>> =A0 =A0 =A0 =A0 =A0qemu_put_be64(f, vdev->vq[i].pa); > >> >> >>> - =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].last_avail_id= x); > >> >> >>> + =A0 =A0 =A0 =A0qemu_put_be16s(f, &last_avail_idx); > >> >> >>> =A0 =A0 =A0 =A0 =A0if (vdev->binding->save_queue) > >> >> >>> =A0 =A0 =A0 =A0 =A0 =A0 =A0vdev->binding->save_queue(vdev->bi= nding_opaque, i, f); > >> >> >>> =A0 =A0 =A0} > >> >> >>> > >> >> >>> > >> >> >> > >> >> >> This looks wrong to me. =A0Requests can complete in any order,= can they > >> >> >> not? =A0So if request 0 did not complete and request 1 did not= , > >> >> >> you send avail - inuse and on the secondary you will process a= nd > >> >> >> complete request 1 the second time, crashing the guest. > >> >> > > >> >> > In case of Kemari, no. =A0We sit between devices and net/block,= and > >> >> > queue the requests. =A0After completing each transaction, we fl= ush > >> >> > the requests one by one. =A0So there won't be completion invers= ion, > >> >> > and therefore won't be visible to the guest. > >> >> > > >> >> > Yoshi > >> >> > > >> >> >> > >> >> >>> > >> >> >>> > > >> >> >>> >> > > >> >> >>> >> >> --- > >> >> >>> >> >> =A0hw/virtio.c | =A0 =A08 +++++++- > >> >> >>> >> >> =A01 files changed, 7 insertions(+), 1 deletions(-) > >> >> >>> >> >> > >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c > >> >> >>> >> >> index 849a60f..5509644 100644 > >> >> >>> >> >> --- a/hw/virtio.c > >> >> >>> >> >> +++ b/hw/virtio.c > >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue > >> >> >>> >> >> =A0 =A0 =A0VRing vring; > >> >> >>> >> >> =A0 =A0 =A0target_phys_addr_t pa; > >> >> >>> >> >> =A0 =A0 =A0uint16_t last_avail_idx; > >> >> >>> >> >> - =A0 =A0int inuse; > >> >> >>> >> >> + =A0 =A0uint16_t inuse; > >> >> >>> >> >> =A0 =A0 =A0uint16_t vector; > >> >> >>> >> >> =A0 =A0 =A0void (*handle_output)(VirtIODevice *vdev, Vi= rtQueue *vq); > >> >> >>> >> >> =A0 =A0 =A0VirtIODevice *vdev; > >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev= , QEMUFile *f) > >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be32(f, vdev->vq[i].vring.n= um); > >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be64(f, vdev->vq[i].pa); > >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].last_= avail_idx); > >> >> >>> >> >> + =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].inuse); > >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0if (vdev->binding->save_queue) > >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0vdev->binding->save_queue(vd= ev->binding_opaque, i, f); > >> >> >>> >> >> =A0 =A0 =A0} > >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev= , QEMUFile *f) > >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0vdev->vq[i].vring.num =3D qemu_get_b= e32(f); > >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0vdev->vq[i].pa =3D qemu_get_be64(f); > >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0qemu_get_be16s(f, &vdev->vq[i].last_= avail_idx); > >> >> >>> >> >> + =A0 =A0 =A0 =A0qemu_get_be16s(f, &vdev->vq[i].inuse); > >> >> >>> >> >> + > >> >> >>> >> >> + =A0 =A0 =A0 =A0/* revert last_avail_idx if there are = outstanding emulation. */ > >> >> >>> >> >> + =A0 =A0 =A0 =A0vdev->vq[i].last_avail_idx -=3D vdev->= vq[i].inuse; > >> >> >>> >> >> + =A0 =A0 =A0 =A0vdev->vq[i].inuse =3D 0; > >> >> >>> >> >> > >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0if (vdev->vq[i].pa) { > >> >> >>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0virtqueue_init(&vdev->vq[i])= ; > >> >> >>> >> >> -- > >> >> >>> >> >> 1.7.1.2 > >> >> >>> >> >> > >> >> >>> >> >> -- > >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscri= be kvm" in > >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org > >> >> >>> >> >> More majordomo info at =A0http://vger.kernel.org/majord= omo-info.html > >> >> >>> >> > -- > >> >> >>> >> > To unsubscribe from this list: send the line "unsubscrib= e kvm" in > >> >> >>> >> > the body of a message to majordomo@vger.kernel.org > >> >> >>> >> > More majordomo info at =A0http://vger.kernel.org/majordo= mo-info.html > >> >> >>> >> > > >> >> >>> > -- > >> >> >>> > To unsubscribe from this list: send the line "unsubscribe k= vm" in > >> >> >>> > the body of a message to majordomo@vger.kernel.org > >> >> >>> > More majordomo info at =A0http://vger.kernel.org/majordomo-= info.html > >> >> >>> > > >> >> >> -- > >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm"= in > >> >> >> the body of a message to majordomo@vger.kernel.org > >> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-inf= o.html > >> >> >> > >> >> > > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe kvm" in > >> > the body of a message to majordomo@vger.kernel.org > >> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht= ml > >> > > > -- > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > >