From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=46360 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PT8Nx-0007A6-Vs
	for qemu-devel@nongnu.org; Thu, 16 Dec 2010 02:36:23 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <tamura.yoshiaki@gmail.com>) id 1PT8Nw-0001C6-7F
	for qemu-devel@nongnu.org; Thu, 16 Dec 2010 02:36:21 -0500
Received: from mail-wy0-f173.google.com ([74.125.82.173]:50278)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <tamura.yoshiaki@gmail.com>) id 1PT8Nv-0001C0-Q7
	for qemu-devel@nongnu.org; Thu, 16 Dec 2010 02:36:20 -0500
Received: by wyg36 with SMTP id 36so2413877wyg.4
	for <qemu-devel@nongnu.org>; Wed, 15 Dec 2010 23:36:18 -0800 (PST)
MIME-Version: 1.0
Sender: tamura.yoshiaki@gmail.com
In-Reply-To: <AANLkTimAkmqtAP4e_rvjc_NAsr7D86L3pD64HtqXa7DD@mail.gmail.com>
References: <1290665220-26478-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp>
	<1290665220-26478-6-git-send-email-tamura.yoshiaki@lab.ntt.co.jp>
	<20101128092857.GA3342@redhat.com>
	<AANLkTinP6o-kmRExEWguhHUkjXe52VKsoyQgq5CqduPO@mail.gmail.com>
	<20101128114627.GC4499@redhat.com>
	<AANLkTim2CojXtyGVDNngzaYMCXfan3kkS=thDE9i-yT=@mail.gmail.com>
	<20101202120213.GA2454@redhat.com>
	<AANLkTimAkmqtAP4e_rvjc_NAsr7D86L3pD64HtqXa7DD@mail.gmail.com>
Date: Thu, 16 Dec 2010 16:36:16 +0900
Message-ID: <AANLkTimvsozOXJpwyUYAqWBKLFsY==x8AzCkJ4CapgTg@mail.gmail.com>
From: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Subject: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to
	handle inuse varialble.
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com, kvm@vger.kernel.org, ohmura.kei@lab.ntt.co.jp, Marcelo Tosatti <mtosatti@redhat.com>, qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com, avi@redhat.com, psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com

2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>>> >> >> Modify inuse type to uint16_t, let save/load to handle, and rever=
t
>>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>>> >> >>
>>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >> >
>>> >> > This changes migration format, so it will break compatibility with
>>> >> > existing drivers. More generally, I think migrating internal
>>> >> > state that is not guest visible is always a mistake
>>> >> > as it ties migration format to an internal implementation
>>> >> > (yes, I know we do this sometimes, but we should at least
>>> >> > try not to add such cases). =A0I think the right thing to do in th=
is case
>>> >> > is to flush outstanding
>>> >> > work when vm is stopped. =A0Then, we are guaranteed that inuse is =
0.
>>> >> > I sent patches that do this for virtio net and block.
>>> >>
>>> >> Could you give me the link of your patches? =A0I'd like to test
>>> >> whether they work with Kemari upon failover. =A0If they do, I'm
>>> >> happy to drop this patch.
>>> >>
>>> >> Yoshi
>>> >
>>> > Look for this:
>>> > stable migration image on a stopped vm
>>> > sent on:
>>> > Wed, 24 Nov 2010 17:52:49 +0200
>>>
>>> Thanks for the info.
>>>
>>> However, The patch series above didn't solve the issue. =A0In
>>> case of Kemari, inuse is mostly > 0 because it queues the
>>> output, and while last_avail_idx gets incremented
>>> immediately, not sending inuse makes the state inconsistent
>>> between Primary and Secondary.
>>
>> Hmm. Can we simply avoid incrementing last_avail_idx?
>
> I think we can calculate or prepare an internal last_avail_idx,
> and update the external when inuse is decremented. =A0I'll try
> whether it work w/ w/o Kemari.

Hi Michael,

Could you please take a look at the following patch?

commit 36ee7910059e6b236fe9467a609f5b4aed866912
Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Date:   Thu Dec 16 14:50:54 2010 +0900

    virtio: update last_avail_idx when inuse is decreased.

    Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>

diff --git a/hw/virtio.c b/hw/virtio.c
index c8a0fc6..6688c02 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
     wmb();
     trace_virtqueue_flush(vq, count);
     vring_used_idx_increment(vq, count);
+    vq->last_avail_idx +=3D count;
     vq->inuse -=3D count;
 }

@@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem=
)
     unsigned int i, head, max;
     target_phys_addr_t desc_pa =3D vq->vring.desc;

-    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
+    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
         return 0;

     /* When we start there are none of either input nor output. */
@@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem=
)

     max =3D vq->vring.num;

-    i =3D head =3D virtqueue_get_head(vq, vq->last_avail_idx++);
+    i =3D head =3D virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);

     if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
         if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {


>
>>
>>> =A0I'm wondering why
>>> last_avail_idx is OK to send but not inuse.
>>
>> last_avail_idx is at some level a mistake, it exposes part of
>> our internal implementation, but it does *also* express
>> a guest observable state.
>>
>> Here's the problem that it solves: just looking at the rings in virtio
>> there is no way to detect that a specific request has already been
>> completed. And the protocol forbids completing the same request twice.
>>
>> Our implementation always starts processing the requests
>> in order, and since we flush outstanding requests
>> before save, it works to just tell the remote 'process only requests
>> after this place'.
>>
>> But there's no such requirement in the virtio protocol,
>> so to be really generic we could add a bitmask of valid avail
>> ring entries that did not complete yet. This would be
>> the exact representation of the guest observable state.
>> In practice we have rings of up to 512 entries.
>> That's 64 byte per ring, not a lot at all.
>>
>> However, if we ever do change the protocol to send the bitmask,
>> we would need some code to resubmit requests
>> out of order, so it's not trivial.
>>
>> Another minor mistake with last_avail_idx is that it has
>> some redundancy: the high bits in the index
>> (> vq size) are not necessary as they can be
>> got from avail idx. =A0There's a consistency check
>> in load but we really should try to use formats
>> that are always consistent.
>>
>>> The following patch does the same thing as original, yet
>>> keeps the format of the virtio. =A0It shouldn't break live
>>> migration either because inuse should be 0.
>>>
>>> Yoshi
>>
>> Question is, can you flush to make inuse 0 in kemari too?
>> And if not, how do you handle the fact that some requests
>> are in flight on the primary?
>
> Although we try flushing requests one by one making inuse 0,
> there are cases when it failovers to the secondary when inuse
> isn't 0. =A0We handle these in flight request on the primary by
> replaying on the secondary.
>
>>
>>> diff --git a/hw/virtio.c b/hw/virtio.c
>>> index c8a0fc6..875c7ca 100644
>>> --- a/hw/virtio.c
>>> +++ b/hw/virtio.c
>>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>> =A0 =A0 =A0qemu_put_be32(f, i);
>>>
>>> =A0 =A0 =A0for (i =3D 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>>> + =A0 =A0 =A0 =A0uint16_t last_avail_idx;
>>> +
>>> =A0 =A0 =A0 =A0 =A0if (vdev->vq[i].vring.num =3D=3D 0)
>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0break;
>>>
>>> + =A0 =A0 =A0 =A0last_avail_idx =3D vdev->vq[i].last_avail_idx - vdev->=
vq[i].inuse;
>>> +
>>> =A0 =A0 =A0 =A0 =A0qemu_put_be32(f, vdev->vq[i].vring.num);
>>> =A0 =A0 =A0 =A0 =A0qemu_put_be64(f, vdev->vq[i].pa);
>>> - =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> + =A0 =A0 =A0 =A0qemu_put_be16s(f, &last_avail_idx);
>>> =A0 =A0 =A0 =A0 =A0if (vdev->binding->save_queue)
>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0vdev->binding->save_queue(vdev->binding_opaq=
ue, i, f);
>>> =A0 =A0 =A0}
>>>
>>>
>>
>> This looks wrong to me. =A0Requests can complete in any order, can they
>> not? =A0So if request 0 did not complete and request 1 did not,
>> you send avail - inuse and on the secondary you will process and
>> complete request 1 the second time, crashing the guest.
>
> In case of Kemari, no. =A0We sit between devices and net/block, and
> queue the requests. =A0After completing each transaction, we flush
> the requests one by one. =A0So there won't be completion inversion,
> and therefore won't be visible to the guest.
>
> Yoshi
>
>>
>>>
>>> >
>>> >> >
>>> >> >> ---
>>> >> >> =A0hw/virtio.c | =A0 =A08 +++++++-
>>> >> >> =A01 files changed, 7 insertions(+), 1 deletions(-)
>>> >> >>
>>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>>> >> >> index 849a60f..5509644 100644
>>> >> >> --- a/hw/virtio.c
>>> >> >> +++ b/hw/virtio.c
>>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>>> >> >> =A0 =A0 =A0VRing vring;
>>> >> >> =A0 =A0 =A0target_phys_addr_t pa;
>>> >> >> =A0 =A0 =A0uint16_t last_avail_idx;
>>> >> >> - =A0 =A0int inuse;
>>> >> >> + =A0 =A0uint16_t inuse;
>>> >> >> =A0 =A0 =A0uint16_t vector;
>>> >> >> =A0 =A0 =A0void (*handle_output)(VirtIODevice *vdev, VirtQueue *v=
q);
>>> >> >> =A0 =A0 =A0VirtIODevice *vdev;
>>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile=
 *f)
>>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be32(f, vdev->vq[i].vring.num);
>>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be64(f, vdev->vq[i].pa);
>>> >> >> =A0 =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].last_avail_idx)=
;
>>> >> >> + =A0 =A0 =A0 =A0qemu_put_be16s(f, &vdev->vq[i].inuse);
>>> >> >> =A0 =A0 =A0 =A0 =A0if (vdev->binding->save_queue)
>>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0vdev->binding->save_queue(vdev->bindin=
g_opaque, i, f);
>>> >> >> =A0 =A0 =A0}
>>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile=
 *f)
>>> >> >> =A0 =A0 =A0 =A0 =A0vdev->vq[i].vring.num =3D qemu_get_be32(f);
>>> >> >> =A0 =A0 =A0 =A0 =A0vdev->vq[i].pa =3D qemu_get_be64(f);
>>> >> >> =A0 =A0 =A0 =A0 =A0qemu_get_be16s(f, &vdev->vq[i].last_avail_idx)=
;
>>> >> >> + =A0 =A0 =A0 =A0qemu_get_be16s(f, &vdev->vq[i].inuse);
>>> >> >> +
>>> >> >> + =A0 =A0 =A0 =A0/* revert last_avail_idx if there are outstandin=
g emulation. */
>>> >> >> + =A0 =A0 =A0 =A0vdev->vq[i].last_avail_idx -=3D vdev->vq[i].inus=
e;
>>> >> >> + =A0 =A0 =A0 =A0vdev->vq[i].inuse =3D 0;
>>> >> >>
>>> >> >> =A0 =A0 =A0 =A0 =A0if (vdev->vq[i].pa) {
>>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0virtqueue_init(&vdev->vq[i]);
>>> >> >> --
>>> >> >> 1.7.1.2
>>> >> >>
>>> >> >> --
>>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info.h=
tml
>>> >> > --
>>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> > the body of a message to majordomo@vger.kernel.org
>>> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
>>> >> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>
>