From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=38908 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PWpYf-0007by-4a for qemu-devel@nongnu.org; Sun, 26 Dec 2010 07:18:52 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PWpYU-0001h2-74 for qemu-devel@nongnu.org; Sun, 26 Dec 2010 07:18:32 -0500 Received: from mx1.redhat.com ([209.132.183.28]:40609) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PWpYT-0001gg-Md for qemu-devel@nongnu.org; Sun, 26 Dec 2010 07:18:30 -0500 Date: Sun, 26 Dec 2010 14:17:49 +0200 From: "Michael S. Tsirkin" Subject: Re: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble. Message-ID: <20101226121749.GB1926@redhat.com> References: <20101216095140.GB19495@redhat.com> <20101216144010.GA25333@redhat.com> <20101224092710.GA23271@redhat.com> <20101226104919.GB32000@redhat.com> <20101226120151.GA1926@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: Content-Transfer-Encoding: quoted-printable List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Yoshiaki Tamura Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com, kvm@vger.kernel.org, Marcelo Tosatti , ohmura.kei@lab.ntt.co.jp, qemu-devel@nongnu.org, avi@redhat.com, vatsa@linux.vnet.ibm.com, psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com On Sun, Dec 26, 2010 at 09:16:28PM +0900, Yoshiaki Tamura wrote: > 2010/12/26 Michael S. Tsirkin : > > On Sun, Dec 26, 2010 at 07:57:52PM +0900, Yoshiaki Tamura wrote: > >> 2010/12/26 Michael S. Tsirkin : > >> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote: > >> >> 2010/12/24 Michael S. Tsirkin : > >> >> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote= : > >> >> >> 2010/12/16 Michael S. Tsirkin : > >> >> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wr= ote: > >> >> >> >> 2010/12/16 Michael S. Tsirkin : > >> >> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura= wrote: > >> >> >> >> >> 2010/12/3 Yoshiaki Tamura : > >> >> >> >> >> > 2010/12/2 Michael S. Tsirkin : > >> >> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Ta= mura wrote: > >> >> >> >> >> >>> 2010/11/28 Michael S. Tsirkin : > >> >> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki= Tamura wrote: > >> >> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin : > >> >> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshi= aki Tamura wrote: > >> >> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load t= o handle, and revert > >> >> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstan= ding emulation. > >> >> >> >> >> >>> >> >> > >> >> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura > >> >> >> >> >> >>> >> > > >> >> >> >> >> >>> >> > This changes migration format, so it will break= compatibility with > >> >> >> >> >> >>> >> > existing drivers. More generally, I think migra= ting internal > >> >> >> >> >> >>> >> > state that is not guest visible is always a mis= take > >> >> >> >> >> >>> >> > as it ties migration format to an internal impl= ementation > >> >> >> >> >> >>> >> > (yes, I know we do this sometimes, but we shoul= d at least > >> >> >> >> >> >>> >> > try not to add such cases). =A0I think the righ= t thing to do in this case > >> >> >> >> >> >>> >> > is to flush outstanding > >> >> >> >> >> >>> >> > work when vm is stopped. =A0Then, we are guaran= teed that inuse is 0. > >> >> >> >> >> >>> >> > I sent patches that do this for virtio net and = block. > >> >> >> >> >> >>> >> > >> >> >> >> >> >>> >> Could you give me the link of your patches? =A0I'= d like to test > >> >> >> >> >> >>> >> whether they work with Kemari upon failover. =A0I= f they do, I'm > >> >> >> >> >> >>> >> happy to drop this patch. > >> >> >> >> >> >>> >> > >> >> >> >> >> >>> >> Yoshi > >> >> >> >> >> >>> > > >> >> >> >> >> >>> > Look for this: > >> >> >> >> >> >>> > stable migration image on a stopped vm > >> >> >> >> >> >>> > sent on: > >> >> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200 > >> >> >> >> >> >>> > >> >> >> >> >> >>> Thanks for the info. > >> >> >> >> >> >>> > >> >> >> >> >> >>> However, The patch series above didn't solve the iss= ue. =A0In > >> >> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queue= s the > >> >> >> >> >> >>> output, and while last_avail_idx gets incremented > >> >> >> >> >> >>> immediately, not sending inuse makes the state incon= sistent > >> >> >> >> >> >>> between Primary and Secondary. > >> >> >> >> >> >> > >> >> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx? > >> >> >> >> >> > > >> >> >> >> >> > I think we can calculate or prepare an internal last_a= vail_idx, > >> >> >> >> >> > and update the external when inuse is decremented. =A0= I'll try > >> >> >> >> >> > whether it work w/ w/o Kemari. > >> >> >> >> >> > >> >> >> >> >> Hi Michael, > >> >> >> >> >> > >> >> >> >> >> Could you please take a look at the following patch? > >> >> >> >> > > >> >> >> >> > Which version is this against? > >> >> >> >> > >> >> >> >> Oops. =A0It should be very old. > >> >> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4 > >> >> >> >> > >> >> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912 > >> >> >> >> >> Author: Yoshiaki Tamura > >> >> >> >> >> Date: =A0 Thu Dec 16 14:50:54 2010 +0900 > >> >> >> >> >> > >> >> >> >> >> =A0 =A0 virtio: update last_avail_idx when inuse is decr= eased. > >> >> >> >> >> > >> >> >> >> >> =A0 =A0 Signed-off-by: Yoshiaki Tamura > >> >> >> >> > > >> >> >> >> > It would be better to have a commit description explainin= g why a change > >> >> >> >> > is made, and why it is correct, not just repeating what c= an be seen from > >> >> >> >> > the diff anyway. > >> >> >> >> > >> >> >> >> Sorry for being lazy here. > >> >> >> >> > >> >> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c > >> >> >> >> >> index c8a0fc6..6688c02 100644 > >> >> >> >> >> --- a/hw/virtio.c > >> >> >> >> >> +++ b/hw/virtio.c > >> >> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, = unsigned int count) > >> >> >> >> >> =A0 =A0 =A0wmb(); > >> >> >> >> >> =A0 =A0 =A0trace_virtqueue_flush(vq, count); > >> >> >> >> >> =A0 =A0 =A0vring_used_idx_increment(vq, count); > >> >> >> >> >> + =A0 =A0vq->last_avail_idx +=3D count; > >> >> >> >> >> =A0 =A0 =A0vq->inuse -=3D count; > >> >> >> >> >> =A0} > >> >> >> >> >> > >> >> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, Vir= tQueueElement *elem) > >> >> >> >> >> =A0 =A0 =A0unsigned int i, head, max; > >> >> >> >> >> =A0 =A0 =A0target_phys_addr_t desc_pa =3D vq->vring.desc= ; > >> >> >> >> >> > >> >> >> >> >> - =A0 =A0if (!virtqueue_num_heads(vq, vq->last_avail_idx= )) > >> >> >> >> >> + =A0 =A0if (!virtqueue_num_heads(vq, vq->last_avail_idx= + vq->inuse)) > >> >> >> >> >> =A0 =A0 =A0 =A0 =A0return 0; > >> >> >> >> >> > >> >> >> >> >> =A0 =A0 =A0/* When we start there are none of either inp= ut nor output. */ > >> >> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, Vir= tQueueElement *elem) > >> >> >> >> >> > >> >> >> >> >> =A0 =A0 =A0max =3D vq->vring.num; > >> >> >> >> >> > >> >> >> >> >> - =A0 =A0i =3D head =3D virtqueue_get_head(vq, vq->last_= avail_idx++); > >> >> >> >> >> + =A0 =A0i =3D head =3D virtqueue_get_head(vq, vq->last_= avail_idx + vq->inuse); > >> >> >> >> >> > >> >> >> >> >> =A0 =A0 =A0if (vring_desc_flags(desc_pa, i) & VRING_DESC= _F_INDIRECT) { > >> >> >> >> >> =A0 =A0 =A0 =A0 =A0if (vring_desc_len(desc_pa, i) % size= of(VRingDesc)) { > >> >> >> >> >> > >> >> >> >> > > >> >> >> >> > Hmm, will virtio_queue_empty be wrong now? What about vir= tqueue_avail_bytes? > >> >> >> >> > >> >> >> >> I think there are two problems. > >> >> >> >> > >> >> >> >> 1. When to update last_avail_idx. > >> >> >> >> 2. The ordering issue you're mentioning below. > >> >> >> >> > >> >> >> >> The patch above is only trying to address 1 because last ti= me you > >> >> >> >> mentioned that modifying last_avail_idx upon save may break= the > >> >> >> >> guest, which I agree. =A0If virtio_queue_empty and > >> >> >> >> virtqueue_avail_bytes are only used internally, meaning inv= isible > >> >> >> >> to the guest, I guess the approach above can be applied too. > >> >> >> > > >> >> >> > So IMHO 2 is the real issue. This is what was problematic > >> >> >> > with the save patch, otherwise of course changes in save > >> >> >> > are better than changes all over the codebase. > >> >> >> > >> >> >> All right. =A0Then let's focus on 2 first. > >> >> >> > >> >> >> >> > Previous patch version sure looked simpler, and this seem= s functionally > >> >> >> >> > equivalent, so my question still stands: here it is rephr= ased in a > >> >> >> >> > different way: > >> >> >> >> > > >> >> >> >> > =A0 =A0 =A0 =A0assume that we have in avail ring 2 reques= ts at start of ring: A and B in this order > >> >> >> >> > > >> >> >> >> > =A0 =A0 =A0 =A0host pops A, then B, then completes B and = flushes > >> >> >> >> > > >> >> >> >> > =A0 =A0 =A0 =A0now with this patch last_avail_idx will be= 1, and then > >> >> >> >> > =A0 =A0 =A0 =A0remote will get it, it will execute B agai= n. As a result > >> >> >> >> > =A0 =A0 =A0 =A0B will complete twice, and apparently A wi= ll never complete. > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > This is what I was saying below: assuming that there are > >> >> >> >> > outstanding requests when we migrate, there is no way > >> >> >> >> > a single index can be enough to figure out which requests > >> >> >> >> > need to be handled and which are in flight already. > >> >> >> >> > > >> >> >> >> > We must add some kind of bitmask to tell us which is whic= h. > >> >> >> >> > >> >> >> >> I should understand why this inversion can happen before so= lving > >> >> >> >> the issue. > >> >> >> > > >> >> >> > It's a fundamental thing in virtio. > >> >> >> > I think it is currently only likely to happen with block, I = think tap > >> >> >> > currently completes things in order. =A0In any case relying = on this in the > >> >> >> > frontend is a mistake. > >> >> >> > > >> >> >> >> =A0Currently, how are you making virio-net to flush > >> >> >> >> every requests for live migration? =A0Is it qemu_aio_flush(= )? > >> >> >> > > >> >> >> > Think so. > >> >> >> > >> >> >> If qemu_aio_flush() is responsible for flushing the outstandin= g > >> >> >> virtio-net requests, I'm wondering why it's a problem for Kema= ri. > >> >> >> As I described in the previous message, Kemari queues the > >> >> >> requests first. =A0So in you example above, it should start wi= th > >> >> >> > >> >> >> virtio-net: last_avai_idx 0 inuse 2 > >> >> >> event-tap: {A,B} > >> >> >> > >> >> >> As you know, the requests are still in order still because net > >> >> >> layer initiates in order. =A0Not about completing. > >> >> >> > >> >> >> In the first synchronization, the status above is transferred.= =A0In > >> >> >> the next synchronization, the status will be as following. > >> >> >> > >> >> >> virtio-net: last_avai_idx 1 inuse 1 > >> >> >> event-tap: {B} > >> >> > > >> >> > OK, this answers the ordering question. > >> >> > >> >> Glad to hear that! > >> >> > >> >> > Another question: at this point we transfer this status: both > >> >> > event-tap and virtio ring have the command B, > >> >> > so the remote will have: > >> >> > > >> >> > virtio-net: inuse 0 > >> >> > event-tap: {B} > >> >> > > >> >> > Is this right? This already seems to be a problem as when B com= pletes > >> >> > inuse will go negative? > >> >> > >> >> I think state above is wrong. =A0inuse 0 means there shouldn't be > >> >> any requests in event-tap. =A0Note that the callback is called on= ly > >> >> when event-tap flushes the requests. > >> >> > >> >> > Next it seems that the remote virtio will resubmit B to event-t= ap. The > >> >> > remote will then have: > >> >> > > >> >> > virtio-net: inuse 1 > >> >> > event-tap: {B, B} > >> >> > > >> >> > This looks kind of wrong ... will two packets go out? > >> >> > >> >> No. =A0Currently, we're just replaying the requests with pio/mmio. > >> > > >> > You do? =A0What purpose do the hooks in bdrv/net serve then? > >> > A placeholder for the future? > >> > >> Not only for that reason. =A0The hooks in bdrv/net is the main > >> function that queues requests and starts synchronization. > >> pio/mmio hooks are there for recording what initiated the > >> requests monitored in bdrv/net layer. =A0I would like to remove > >> pio/mmio part if we could make bdrv/net level replay is possible. > >> > >> Yoshi > > > > I think I begin see. So when event-tap does a replay, > > we will probably need to pass the inuse value. >=20 > Completely correct. >=20 > > But since we generally don't try to support new->old > > cross-version migrations in qemu, my guess is that > > it is better not to change the format in anticipation > > right now. >=20 > I agree. >=20 > > So basically for now we just need to add a comment explaining > > the reason for moving last_avail_idx back. > > Does something like the below (completely untested) make sense? >=20 > Yes, it does. Thank you for putting a decent comment. Can I put > the patch into my series as is? >=20 > Yoshi Sure. > > > > Signed-off-by: Michael S. Tsirkin > > > > diff --git a/hw/virtio.c b/hw/virtio.c > > index 07dbf86..d1509f28 100644 > > --- a/hw/virtio.c > > +++ b/hw/virtio.c > > @@ -665,12 +665,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *= f) > > =A0 =A0 qemu_put_be32(f, i); > > > > =A0 =A0 for (i =3D 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { > > + =A0 =A0 =A0 =A0/* For regular migration inuse =3D=3D 0 always as > > + =A0 =A0 =A0 =A0 * requests are flushed before save. However, > > + =A0 =A0 =A0 =A0 * event-tap log when enabled introduces an extra > > + =A0 =A0 =A0 =A0 * queue for requests which is not being flushed, > > + =A0 =A0 =A0 =A0 * thus the last inuse requests are left in the even= t-tap queue. > > + =A0 =A0 =A0 =A0 * Move the last_avail_idx value sent to the remote = back > > + =A0 =A0 =A0 =A0 * to make it repeat the last inuse requests. */ > > + =A0 =A0 =A0 =A0uint16_t last_avail =3D vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse; > > =A0 =A0 =A0 =A0 if (vdev->vq[i].vring.num =3D=3D 0) > > =A0 =A0 =A0 =A0 =A0 =A0 break; > > > > =A0 =A0 =A0 =A0 qemu_put_be32(f, vdev->vq[i].vring.num); > > =A0 =A0 =A0 =A0 qemu_put_be64(f, vdev->vq[i].pa); > > - =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].last_avail_idx); > > + =A0 =A0 =A0 =A0qemu_put_be16s(f, &last_avail); > > =A0 =A0 =A0 =A0 if (vdev->binding->save_queue) > > =A0 =A0 =A0 =A0 =A0 =A0 vdev->binding->save_queue(vdev->binding_opaqu= e, i, f); > > =A0 =A0 } > > > >