From: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com,
kvm@vger.kernel.org, ohmura.kei@lab.ntt.co.jp,
Marcelo Tosatti <mtosatti@redhat.com>,
qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com, avi@redhat.com,
psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com
Subject: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
Date: Thu, 16 Dec 2010 16:36:16 +0900 [thread overview]
Message-ID: <AANLkTimvsozOXJpwyUYAqWBKLFsY==x8AzCkJ4CapgTg@mail.gmail.com> (raw)
In-Reply-To: <AANLkTimAkmqtAP4e_rvjc_NAsr7D86L3pD64HtqXa7DD@mail.gmail.com>
2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>>> >> >>
>>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >> >
>>> >> > This changes migration format, so it will break compatibility with
>>> >> > existing drivers. More generally, I think migrating internal
>>> >> > state that is not guest visible is always a mistake
>>> >> > as it ties migration format to an internal implementation
>>> >> > (yes, I know we do this sometimes, but we should at least
>>> >> > try not to add such cases). I think the right thing to do in this case
>>> >> > is to flush outstanding
>>> >> > work when vm is stopped. Then, we are guaranteed that inuse is 0.
>>> >> > I sent patches that do this for virtio net and block.
>>> >>
>>> >> Could you give me the link of your patches? I'd like to test
>>> >> whether they work with Kemari upon failover. If they do, I'm
>>> >> happy to drop this patch.
>>> >>
>>> >> Yoshi
>>> >
>>> > Look for this:
>>> > stable migration image on a stopped vm
>>> > sent on:
>>> > Wed, 24 Nov 2010 17:52:49 +0200
>>>
>>> Thanks for the info.
>>>
>>> However, The patch series above didn't solve the issue. In
>>> case of Kemari, inuse is mostly > 0 because it queues the
>>> output, and while last_avail_idx gets incremented
>>> immediately, not sending inuse makes the state inconsistent
>>> between Primary and Secondary.
>>
>> Hmm. Can we simply avoid incrementing last_avail_idx?
>
> I think we can calculate or prepare an internal last_avail_idx,
> and update the external when inuse is decremented. I'll try
> whether it work w/ w/o Kemari.
Hi Michael,
Could you please take a look at the following patch?
commit 36ee7910059e6b236fe9467a609f5b4aed866912
Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Date: Thu Dec 16 14:50:54 2010 +0900
virtio: update last_avail_idx when inuse is decreased.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
diff --git a/hw/virtio.c b/hw/virtio.c
index c8a0fc6..6688c02 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
wmb();
trace_virtqueue_flush(vq, count);
vring_used_idx_increment(vq, count);
+ vq->last_avail_idx += count;
vq->inuse -= count;
}
@@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
unsigned int i, head, max;
target_phys_addr_t desc_pa = vq->vring.desc;
- if (!virtqueue_num_heads(vq, vq->last_avail_idx))
+ if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
return 0;
/* When we start there are none of either input nor output. */
@@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
max = vq->vring.num;
- i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
+ i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>
>>
>>> I'm wondering why
>>> last_avail_idx is OK to send but not inuse.
>>
>> last_avail_idx is at some level a mistake, it exposes part of
>> our internal implementation, but it does *also* express
>> a guest observable state.
>>
>> Here's the problem that it solves: just looking at the rings in virtio
>> there is no way to detect that a specific request has already been
>> completed. And the protocol forbids completing the same request twice.
>>
>> Our implementation always starts processing the requests
>> in order, and since we flush outstanding requests
>> before save, it works to just tell the remote 'process only requests
>> after this place'.
>>
>> But there's no such requirement in the virtio protocol,
>> so to be really generic we could add a bitmask of valid avail
>> ring entries that did not complete yet. This would be
>> the exact representation of the guest observable state.
>> In practice we have rings of up to 512 entries.
>> That's 64 byte per ring, not a lot at all.
>>
>> However, if we ever do change the protocol to send the bitmask,
>> we would need some code to resubmit requests
>> out of order, so it's not trivial.
>>
>> Another minor mistake with last_avail_idx is that it has
>> some redundancy: the high bits in the index
>> (> vq size) are not necessary as they can be
>> got from avail idx. There's a consistency check
>> in load but we really should try to use formats
>> that are always consistent.
>>
>>> The following patch does the same thing as original, yet
>>> keeps the format of the virtio. It shouldn't break live
>>> migration either because inuse should be 0.
>>>
>>> Yoshi
>>
>> Question is, can you flush to make inuse 0 in kemari too?
>> And if not, how do you handle the fact that some requests
>> are in flight on the primary?
>
> Although we try flushing requests one by one making inuse 0,
> there are cases when it failovers to the secondary when inuse
> isn't 0. We handle these in flight request on the primary by
> replaying on the secondary.
>
>>
>>> diff --git a/hw/virtio.c b/hw/virtio.c
>>> index c8a0fc6..875c7ca 100644
>>> --- a/hw/virtio.c
>>> +++ b/hw/virtio.c
>>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>> qemu_put_be32(f, i);
>>>
>>> for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>>> + uint16_t last_avail_idx;
>>> +
>>> if (vdev->vq[i].vring.num == 0)
>>> break;
>>>
>>> + last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>>> +
>>> qemu_put_be32(f, vdev->vq[i].vring.num);
>>> qemu_put_be64(f, vdev->vq[i].pa);
>>> - qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> + qemu_put_be16s(f, &last_avail_idx);
>>> if (vdev->binding->save_queue)
>>> vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>> }
>>>
>>>
>>
>> This looks wrong to me. Requests can complete in any order, can they
>> not? So if request 0 did not complete and request 1 did not,
>> you send avail - inuse and on the secondary you will process and
>> complete request 1 the second time, crashing the guest.
>
> In case of Kemari, no. We sit between devices and net/block, and
> queue the requests. After completing each transaction, we flush
> the requests one by one. So there won't be completion inversion,
> and therefore won't be visible to the guest.
>
> Yoshi
>
>>
>>>
>>> >
>>> >> >
>>> >> >> ---
>>> >> >> hw/virtio.c | 8 +++++++-
>>> >> >> 1 files changed, 7 insertions(+), 1 deletions(-)
>>> >> >>
>>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>>> >> >> index 849a60f..5509644 100644
>>> >> >> --- a/hw/virtio.c
>>> >> >> +++ b/hw/virtio.c
>>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>>> >> >> VRing vring;
>>> >> >> target_phys_addr_t pa;
>>> >> >> uint16_t last_avail_idx;
>>> >> >> - int inuse;
>>> >> >> + uint16_t inuse;
>>> >> >> uint16_t vector;
>>> >> >> void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>>> >> >> VirtIODevice *vdev;
>>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>> >> >> qemu_put_be32(f, vdev->vq[i].vring.num);
>>> >> >> qemu_put_be64(f, vdev->vq[i].pa);
>>> >> >> qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >> + qemu_put_be16s(f, &vdev->vq[i].inuse);
>>> >> >> if (vdev->binding->save_queue)
>>> >> >> vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>> >> >> }
>>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>>> >> >> vdev->vq[i].vring.num = qemu_get_be32(f);
>>> >> >> vdev->vq[i].pa = qemu_get_be64(f);
>>> >> >> qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >> + qemu_get_be16s(f, &vdev->vq[i].inuse);
>>> >> >> +
>>> >> >> + /* revert last_avail_idx if there are outstanding emulation. */
>>> >> >> + vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>>> >> >> + vdev->vq[i].inuse = 0;
>>> >> >>
>>> >> >> if (vdev->vq[i].pa) {
>>> >> >> virtqueue_init(&vdev->vq[i]);
>>> >> >> --
>>> >> >> 1.7.1.2
>>> >> >>
>>> >> >> --
>>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> >> > --
>>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> > the body of a message to majordomo@vger.kernel.org
>>> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> >> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
next prev parent reply other threads:[~2010-12-16 7:36 UTC|newest]
Thread overview: 112+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-25 6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 01/21] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 02/21] Introduce read() to FdMigrationState Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 03/21] Introduce skip_header parameter to qemu_loadvm_state() Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 04/21] qemu-char: export socket_set_nodelay() Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble Yoshiaki Tamura
2010-11-28 9:28 ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 11:27 ` Yoshiaki Tamura
2010-11-28 11:46 ` Michael S. Tsirkin
2010-12-01 8:03 ` Yoshiaki Tamura
2010-12-02 12:02 ` Michael S. Tsirkin
2010-12-03 6:28 ` Yoshiaki Tamura
2010-12-16 7:36 ` Yoshiaki Tamura [this message]
2010-12-16 9:51 ` Michael S. Tsirkin
2010-12-16 14:28 ` Yoshiaki Tamura
2010-12-16 14:40 ` Michael S. Tsirkin
2010-12-16 15:59 ` Yoshiaki Tamura
2010-12-17 16:22 ` Yoshiaki Tamura
2010-12-24 9:27 ` Michael S. Tsirkin
2010-12-24 11:42 ` Yoshiaki Tamura
2010-12-24 13:21 ` Michael S. Tsirkin
2010-12-26 9:05 ` Michael S. Tsirkin
2010-12-26 10:14 ` Yoshiaki Tamura
2010-12-26 10:46 ` Michael S. Tsirkin
2010-12-26 10:50 ` Yoshiaki Tamura
2010-12-26 10:49 ` Michael S. Tsirkin
2010-12-26 10:57 ` Yoshiaki Tamura
2010-12-26 12:01 ` Michael S. Tsirkin
2010-12-26 12:16 ` Yoshiaki Tamura
2010-12-26 12:17 ` Michael S. Tsirkin
2010-11-25 6:06 ` [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs Yoshiaki Tamura
2010-12-08 7:03 ` Isaku Yamahata
2010-12-08 8:11 ` Yoshiaki Tamura
2010-12-08 14:22 ` Anthony Liguori
2010-11-25 6:06 ` [Qemu-devel] [PATCH 07/21] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 08/21] savevm: introduce util functions to control ft_trans_file from savevm layer Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 09/21] Introduce event-tap Yoshiaki Tamura
2010-11-29 11:00 ` [Qemu-devel] " Stefan Hajnoczi
2010-11-30 9:50 ` Yoshiaki Tamura
2010-11-30 10:04 ` Stefan Hajnoczi
2010-11-30 10:20 ` Yoshiaki Tamura
2011-01-04 11:02 ` Yoshiaki Tamura
2011-01-04 11:14 ` Stefan Hajnoczi
2011-01-04 11:19 ` Michael S. Tsirkin
2011-01-04 12:20 ` Yoshiaki Tamura
2011-01-04 13:10 ` Michael S. Tsirkin
2011-01-04 13:45 ` Yoshiaki Tamura
2011-01-04 14:42 ` Michael S. Tsirkin
2011-01-06 8:47 ` Yoshiaki Tamura
2011-01-06 9:36 ` Michael S. Tsirkin
2011-01-06 9:41 ` Yoshiaki Tamura
[not found] ` <20101130011914.GA9015@amt.cnet>
2010-11-30 9:28 ` Yoshiaki Tamura
2010-11-30 10:25 ` Marcelo Tosatti
2010-11-30 10:35 ` Yoshiaki Tamura
2010-11-30 13:11 ` Marcelo Tosatti
2010-11-25 6:06 ` [Qemu-devel] [PATCH 10/21] Call init handler of event-tap at main() in vl.c Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write() Yoshiaki Tamura
2010-11-28 9:40 ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 12:00 ` Yoshiaki Tamura
2010-12-16 7:37 ` Yoshiaki Tamura
2010-12-16 9:22 ` Michael S. Tsirkin
2010-12-16 9:50 ` Yoshiaki Tamura
2010-12-16 9:54 ` Michael S. Tsirkin
2010-12-16 16:27 ` Stefan Hajnoczi
2010-12-17 16:19 ` Yoshiaki Tamura
2010-12-18 8:36 ` Stefan Hajnoczi
2010-11-25 6:06 ` [Qemu-devel] [PATCH 12/21] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy() Yoshiaki Tamura
2010-11-28 9:33 ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 11:55 ` Yoshiaki Tamura
2010-11-28 12:28 ` Michael S. Tsirkin
2010-11-29 9:52 ` Kevin Wolf
2010-11-29 12:56 ` Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 14/21] virtio-blk: replace bdrv_aio_multiwrite() with bdrv_aio_multiwrite_proxy() Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 15/21] virtio-net: replace qemu_sendv_packet_async() with qemu_sendv_packet_async_proxy() Yoshiaki Tamura
2010-11-28 9:31 ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 11:43 ` Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 16/21] e1000: replace qemu_send_packet() with qemu_send_packet_proxy() Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 17/21] savevm: introduce qemu_savevm_trans_{begin, commit} Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 18/21] migration: introduce migrate_ft_trans_{put, get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 19/21] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 20/21] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
2010-11-25 6:07 ` [Qemu-devel] [PATCH 21/21] migration: add a parser to accept FT migration incoming mode Yoshiaki Tamura
2010-11-26 18:39 ` [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Blue Swirl
2010-11-27 4:29 ` Yoshiaki Tamura
2010-11-27 7:23 ` Stefan Hajnoczi
2010-11-27 8:53 ` Yoshiaki Tamura
2010-11-27 11:03 ` Blue Swirl
2010-11-27 12:21 ` Yoshiaki Tamura
2010-11-27 11:54 ` Stefan Hajnoczi
2010-11-27 13:11 ` Yoshiaki Tamura
2010-11-29 10:17 ` Stefan Hajnoczi
2010-11-29 13:00 ` Paul Brook
2010-11-29 13:13 ` Yoshiaki Tamura
2010-11-29 13:19 ` Paul Brook
2010-11-29 13:41 ` Yoshiaki Tamura
2010-11-29 14:12 ` Paul Brook
2010-11-29 14:37 ` Yoshiaki Tamura
2010-11-29 14:56 ` Paul Brook
2010-11-29 15:00 ` Yoshiaki Tamura
2010-11-29 15:56 ` Paul Brook
2010-11-29 16:23 ` Stefan Hajnoczi
2010-11-29 16:41 ` Dor Laor
2010-11-29 16:53 ` Paul Brook
2010-11-29 17:05 ` Anthony Liguori
2010-11-29 17:18 ` Paul Brook
2010-11-29 17:33 ` Anthony Liguori
2010-11-30 7:13 ` Yoshiaki Tamura
2010-11-30 6:43 ` Yoshiaki Tamura
2010-11-30 9:13 ` Takuya Yoshikawa
2010-11-27 11:20 ` Paul Brook
2010-11-27 12:35 ` Yoshiaki Tamura
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='AANLkTimvsozOXJpwyUYAqWBKLFsY==x8AzCkJ4CapgTg@mail.gmail.com' \
--to=tamura.yoshiaki@lab.ntt.co.jp \
--cc=aliguori@us.ibm.com \
--cc=ananth@in.ibm.com \
--cc=avi@redhat.com \
--cc=dlaor@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=mst@redhat.com \
--cc=mtosatti@redhat.com \
--cc=ohmura.kei@lab.ntt.co.jp \
--cc=psuriset@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@linux.vnet.ibm.com \
--cc=vatsa@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).