From: "Michael S. Tsirkin" <mst@redhat.com>
To: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Cc: aliguori@us.ibm.com, dlaor@redhat.com, ananth@in.ibm.com,
kvm@vger.kernel.org, ohmura.kei@lab.ntt.co.jp,
Marcelo Tosatti <mtosatti@redhat.com>,
qemu-devel@nongnu.org, vatsa@linux.vnet.ibm.com, avi@redhat.com,
psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com
Subject: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
Date: Thu, 16 Dec 2010 11:51:40 +0200 [thread overview]
Message-ID: <20101216095140.GB19495@redhat.com> (raw)
In-Reply-To: <AANLkTimvsozOXJpwyUYAqWBKLFsY==x8AzCkJ4CapgTg@mail.gmail.com>
On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >>> >> >>
> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >>> >> >
> >>> >> > This changes migration format, so it will break compatibility with
> >>> >> > existing drivers. More generally, I think migrating internal
> >>> >> > state that is not guest visible is always a mistake
> >>> >> > as it ties migration format to an internal implementation
> >>> >> > (yes, I know we do this sometimes, but we should at least
> >>> >> > try not to add such cases). I think the right thing to do in this case
> >>> >> > is to flush outstanding
> >>> >> > work when vm is stopped. Then, we are guaranteed that inuse is 0.
> >>> >> > I sent patches that do this for virtio net and block.
> >>> >>
> >>> >> Could you give me the link of your patches? I'd like to test
> >>> >> whether they work with Kemari upon failover. If they do, I'm
> >>> >> happy to drop this patch.
> >>> >>
> >>> >> Yoshi
> >>> >
> >>> > Look for this:
> >>> > stable migration image on a stopped vm
> >>> > sent on:
> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >>>
> >>> Thanks for the info.
> >>>
> >>> However, The patch series above didn't solve the issue. In
> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >>> output, and while last_avail_idx gets incremented
> >>> immediately, not sending inuse makes the state inconsistent
> >>> between Primary and Secondary.
> >>
> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >
> > I think we can calculate or prepare an internal last_avail_idx,
> > and update the external when inuse is decremented. I'll try
> > whether it work w/ w/o Kemari.
>
> Hi Michael,
>
> Could you please take a look at the following patch?
Which version is this against?
> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> Date: Thu Dec 16 14:50:54 2010 +0900
>
> virtio: update last_avail_idx when inuse is decreased.
>
> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
It would be better to have a commit description explaining why a change
is made, and why it is correct, not just repeating what can be seen from
the diff anyway.
> diff --git a/hw/virtio.c b/hw/virtio.c
> index c8a0fc6..6688c02 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> wmb();
> trace_virtqueue_flush(vq, count);
> vring_used_idx_increment(vq, count);
> + vq->last_avail_idx += count;
> vq->inuse -= count;
> }
>
> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> unsigned int i, head, max;
> target_phys_addr_t desc_pa = vq->vring.desc;
>
> - if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> + if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> return 0;
>
> /* When we start there are none of either input nor output. */
> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>
> max = vq->vring.num;
>
> - i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> + i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>
> if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>
Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
Previous patch version sure looked simpler, and this seems functionally
equivalent, so my question still stands: here it is rephrased in a
different way:
assume that we have in avail ring 2 requests at start of ring: A and B in this order
host pops A, then B, then completes B and flushes
now with this patch last_avail_idx will be 1, and then
remote will get it, it will execute B again. As a result
B will complete twice, and apparently A will never complete.
This is what I was saying below: assuming that there are
outstanding requests when we migrate, there is no way
a single index can be enough to figure out which requests
need to be handled and which are in flight already.
We must add some kind of bitmask to tell us which is which.
> >
> >>
> >>> I'm wondering why
> >>> last_avail_idx is OK to send but not inuse.
> >>
> >> last_avail_idx is at some level a mistake, it exposes part of
> >> our internal implementation, but it does *also* express
> >> a guest observable state.
> >>
> >> Here's the problem that it solves: just looking at the rings in virtio
> >> there is no way to detect that a specific request has already been
> >> completed. And the protocol forbids completing the same request twice.
> >>
> >> Our implementation always starts processing the requests
> >> in order, and since we flush outstanding requests
> >> before save, it works to just tell the remote 'process only requests
> >> after this place'.
> >>
> >> But there's no such requirement in the virtio protocol,
> >> so to be really generic we could add a bitmask of valid avail
> >> ring entries that did not complete yet. This would be
> >> the exact representation of the guest observable state.
> >> In practice we have rings of up to 512 entries.
> >> That's 64 byte per ring, not a lot at all.
> >>
> >> However, if we ever do change the protocol to send the bitmask,
> >> we would need some code to resubmit requests
> >> out of order, so it's not trivial.
> >>
> >> Another minor mistake with last_avail_idx is that it has
> >> some redundancy: the high bits in the index
> >> (> vq size) are not necessary as they can be
> >> got from avail idx. There's a consistency check
> >> in load but we really should try to use formats
> >> that are always consistent.
> >>
> >>> The following patch does the same thing as original, yet
> >>> keeps the format of the virtio. It shouldn't break live
> >>> migration either because inuse should be 0.
> >>>
> >>> Yoshi
> >>
> >> Question is, can you flush to make inuse 0 in kemari too?
> >> And if not, how do you handle the fact that some requests
> >> are in flight on the primary?
> >
> > Although we try flushing requests one by one making inuse 0,
> > there are cases when it failovers to the secondary when inuse
> > isn't 0. We handle these in flight request on the primary by
> > replaying on the secondary.
> >
> >>
> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >>> index c8a0fc6..875c7ca 100644
> >>> --- a/hw/virtio.c
> >>> +++ b/hw/virtio.c
> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >>> qemu_put_be32(f, i);
> >>>
> >>> for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >>> + uint16_t last_avail_idx;
> >>> +
> >>> if (vdev->vq[i].vring.num == 0)
> >>> break;
> >>>
> >>> + last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >>> +
> >>> qemu_put_be32(f, vdev->vq[i].vring.num);
> >>> qemu_put_be64(f, vdev->vq[i].pa);
> >>> - qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >>> + qemu_put_be16s(f, &last_avail_idx);
> >>> if (vdev->binding->save_queue)
> >>> vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >>> }
> >>>
> >>>
> >>
> >> This looks wrong to me. Requests can complete in any order, can they
> >> not? So if request 0 did not complete and request 1 did not,
> >> you send avail - inuse and on the secondary you will process and
> >> complete request 1 the second time, crashing the guest.
> >
> > In case of Kemari, no. We sit between devices and net/block, and
> > queue the requests. After completing each transaction, we flush
> > the requests one by one. So there won't be completion inversion,
> > and therefore won't be visible to the guest.
> >
> > Yoshi
> >
> >>
> >>>
> >>> >
> >>> >> >
> >>> >> >> ---
> >>> >> >> hw/virtio.c | 8 +++++++-
> >>> >> >> 1 files changed, 7 insertions(+), 1 deletions(-)
> >>> >> >>
> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >>> >> >> index 849a60f..5509644 100644
> >>> >> >> --- a/hw/virtio.c
> >>> >> >> +++ b/hw/virtio.c
> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >>> >> >> VRing vring;
> >>> >> >> target_phys_addr_t pa;
> >>> >> >> uint16_t last_avail_idx;
> >>> >> >> - int inuse;
> >>> >> >> + uint16_t inuse;
> >>> >> >> uint16_t vector;
> >>> >> >> void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >>> >> >> VirtIODevice *vdev;
> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >>> >> >> qemu_put_be32(f, vdev->vq[i].vring.num);
> >>> >> >> qemu_put_be64(f, vdev->vq[i].pa);
> >>> >> >> qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >>> >> >> + qemu_put_be16s(f, &vdev->vq[i].inuse);
> >>> >> >> if (vdev->binding->save_queue)
> >>> >> >> vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >>> >> >> }
> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >>> >> >> vdev->vq[i].vring.num = qemu_get_be32(f);
> >>> >> >> vdev->vq[i].pa = qemu_get_be64(f);
> >>> >> >> qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >>> >> >> + qemu_get_be16s(f, &vdev->vq[i].inuse);
> >>> >> >> +
> >>> >> >> + /* revert last_avail_idx if there are outstanding emulation. */
> >>> >> >> + vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >>> >> >> + vdev->vq[i].inuse = 0;
> >>> >> >>
> >>> >> >> if (vdev->vq[i].pa) {
> >>> >> >> virtqueue_init(&vdev->vq[i]);
> >>> >> >> --
> >>> >> >> 1.7.1.2
> >>> >> >>
> >>> >> >> --
> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >>> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>> >> > --
> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> >> > the body of a message to majordomo@vger.kernel.org
> >>> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>> >> >
> >>> > --
> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> > the body of a message to majordomo@vger.kernel.org
> >>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >
next prev parent reply other threads:[~2010-12-16 9:52 UTC|newest]
Thread overview: 112+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-25 6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 01/21] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 02/21] Introduce read() to FdMigrationState Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 03/21] Introduce skip_header parameter to qemu_loadvm_state() Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 04/21] qemu-char: export socket_set_nodelay() Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble Yoshiaki Tamura
2010-11-28 9:28 ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 11:27 ` Yoshiaki Tamura
2010-11-28 11:46 ` Michael S. Tsirkin
2010-12-01 8:03 ` Yoshiaki Tamura
2010-12-02 12:02 ` Michael S. Tsirkin
2010-12-03 6:28 ` Yoshiaki Tamura
2010-12-16 7:36 ` Yoshiaki Tamura
2010-12-16 9:51 ` Michael S. Tsirkin [this message]
2010-12-16 14:28 ` Yoshiaki Tamura
2010-12-16 14:40 ` Michael S. Tsirkin
2010-12-16 15:59 ` Yoshiaki Tamura
2010-12-17 16:22 ` Yoshiaki Tamura
2010-12-24 9:27 ` Michael S. Tsirkin
2010-12-24 11:42 ` Yoshiaki Tamura
2010-12-24 13:21 ` Michael S. Tsirkin
2010-12-26 9:05 ` Michael S. Tsirkin
2010-12-26 10:14 ` Yoshiaki Tamura
2010-12-26 10:46 ` Michael S. Tsirkin
2010-12-26 10:50 ` Yoshiaki Tamura
2010-12-26 10:49 ` Michael S. Tsirkin
2010-12-26 10:57 ` Yoshiaki Tamura
2010-12-26 12:01 ` Michael S. Tsirkin
2010-12-26 12:16 ` Yoshiaki Tamura
2010-12-26 12:17 ` Michael S. Tsirkin
2010-11-25 6:06 ` [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs Yoshiaki Tamura
2010-12-08 7:03 ` Isaku Yamahata
2010-12-08 8:11 ` Yoshiaki Tamura
2010-12-08 14:22 ` Anthony Liguori
2010-11-25 6:06 ` [Qemu-devel] [PATCH 07/21] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 08/21] savevm: introduce util functions to control ft_trans_file from savevm layer Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 09/21] Introduce event-tap Yoshiaki Tamura
2010-11-29 11:00 ` [Qemu-devel] " Stefan Hajnoczi
2010-11-30 9:50 ` Yoshiaki Tamura
2010-11-30 10:04 ` Stefan Hajnoczi
2010-11-30 10:20 ` Yoshiaki Tamura
2011-01-04 11:02 ` Yoshiaki Tamura
2011-01-04 11:14 ` Stefan Hajnoczi
2011-01-04 11:19 ` Michael S. Tsirkin
2011-01-04 12:20 ` Yoshiaki Tamura
2011-01-04 13:10 ` Michael S. Tsirkin
2011-01-04 13:45 ` Yoshiaki Tamura
2011-01-04 14:42 ` Michael S. Tsirkin
2011-01-06 8:47 ` Yoshiaki Tamura
2011-01-06 9:36 ` Michael S. Tsirkin
2011-01-06 9:41 ` Yoshiaki Tamura
[not found] ` <20101130011914.GA9015@amt.cnet>
2010-11-30 9:28 ` Yoshiaki Tamura
2010-11-30 10:25 ` Marcelo Tosatti
2010-11-30 10:35 ` Yoshiaki Tamura
2010-11-30 13:11 ` Marcelo Tosatti
2010-11-25 6:06 ` [Qemu-devel] [PATCH 10/21] Call init handler of event-tap at main() in vl.c Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write() Yoshiaki Tamura
2010-11-28 9:40 ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 12:00 ` Yoshiaki Tamura
2010-12-16 7:37 ` Yoshiaki Tamura
2010-12-16 9:22 ` Michael S. Tsirkin
2010-12-16 9:50 ` Yoshiaki Tamura
2010-12-16 9:54 ` Michael S. Tsirkin
2010-12-16 16:27 ` Stefan Hajnoczi
2010-12-17 16:19 ` Yoshiaki Tamura
2010-12-18 8:36 ` Stefan Hajnoczi
2010-11-25 6:06 ` [Qemu-devel] [PATCH 12/21] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy() Yoshiaki Tamura
2010-11-28 9:33 ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 11:55 ` Yoshiaki Tamura
2010-11-28 12:28 ` Michael S. Tsirkin
2010-11-29 9:52 ` Kevin Wolf
2010-11-29 12:56 ` Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 14/21] virtio-blk: replace bdrv_aio_multiwrite() with bdrv_aio_multiwrite_proxy() Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 15/21] virtio-net: replace qemu_sendv_packet_async() with qemu_sendv_packet_async_proxy() Yoshiaki Tamura
2010-11-28 9:31 ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 11:43 ` Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 16/21] e1000: replace qemu_send_packet() with qemu_send_packet_proxy() Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 17/21] savevm: introduce qemu_savevm_trans_{begin, commit} Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 18/21] migration: introduce migrate_ft_trans_{put, get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 19/21] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
2010-11-25 6:06 ` [Qemu-devel] [PATCH 20/21] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
2010-11-25 6:07 ` [Qemu-devel] [PATCH 21/21] migration: add a parser to accept FT migration incoming mode Yoshiaki Tamura
2010-11-26 18:39 ` [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Blue Swirl
2010-11-27 4:29 ` Yoshiaki Tamura
2010-11-27 7:23 ` Stefan Hajnoczi
2010-11-27 8:53 ` Yoshiaki Tamura
2010-11-27 11:03 ` Blue Swirl
2010-11-27 12:21 ` Yoshiaki Tamura
2010-11-27 11:54 ` Stefan Hajnoczi
2010-11-27 13:11 ` Yoshiaki Tamura
2010-11-29 10:17 ` Stefan Hajnoczi
2010-11-29 13:00 ` Paul Brook
2010-11-29 13:13 ` Yoshiaki Tamura
2010-11-29 13:19 ` Paul Brook
2010-11-29 13:41 ` Yoshiaki Tamura
2010-11-29 14:12 ` Paul Brook
2010-11-29 14:37 ` Yoshiaki Tamura
2010-11-29 14:56 ` Paul Brook
2010-11-29 15:00 ` Yoshiaki Tamura
2010-11-29 15:56 ` Paul Brook
2010-11-29 16:23 ` Stefan Hajnoczi
2010-11-29 16:41 ` Dor Laor
2010-11-29 16:53 ` Paul Brook
2010-11-29 17:05 ` Anthony Liguori
2010-11-29 17:18 ` Paul Brook
2010-11-29 17:33 ` Anthony Liguori
2010-11-30 7:13 ` Yoshiaki Tamura
2010-11-30 6:43 ` Yoshiaki Tamura
2010-11-30 9:13 ` Takuya Yoshikawa
2010-11-27 11:20 ` Paul Brook
2010-11-27 12:35 ` Yoshiaki Tamura
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101216095140.GB19495@redhat.com \
--to=mst@redhat.com \
--cc=aliguori@us.ibm.com \
--cc=ananth@in.ibm.com \
--cc=avi@redhat.com \
--cc=dlaor@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=mtosatti@redhat.com \
--cc=ohmura.kei@lab.ntt.co.jp \
--cc=psuriset@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@linux.vnet.ibm.com \
--cc=tamura.yoshiaki@lab.ntt.co.jp \
--cc=vatsa@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).