From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael S. Tsirkin" Subject: Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap. Date: Tue, 4 Jan 2011 16:42:14 +0200 Message-ID: <20110104144214.GC8734@redhat.com> References: <1290665220-26478-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <1290665220-26478-10-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <20110104111908.GA5694@redhat.com> <20110104131052.GA8734@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: aliguori@us.ibm.com, mtosatti@redhat.com, ananth@in.ibm.com, kvm@vger.kernel.org, Stefan Hajnoczi , dlaor@redhat.com, ohmura.kei@lab.ntt.co.jp, qemu-devel@nongnu.org, avi@redhat.com, vatsa@linux.vnet.ibm.com, psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com To: Yoshiaki Tamura Return-path: Received: from mx1.redhat.com ([209.132.183.28]:1354 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751419Ab1ADOmy (ORCPT ); Tue, 4 Jan 2011 09:42:54 -0500 Content-Disposition: inline In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On Tue, Jan 04, 2011 at 10:45:13PM +0900, Yoshiaki Tamura wrote: > 2011/1/4 Michael S. Tsirkin : > > On Tue, Jan 04, 2011 at 09:20:53PM +0900, Yoshiaki Tamura wrote: > >> 2011/1/4 Michael S. Tsirkin : > >> > On Tue, Jan 04, 2011 at 08:02:54PM +0900, Yoshiaki Tamura wrote: > >> >> 2010/11/29 Stefan Hajnoczi : > >> >> > On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura > >> >> > wrote: > >> >> >> event-tap controls when to start FT transaction, and provide= s proxy > >> >> >> functions to called from net/block devices. =A0While FT tran= saction, it > >> >> >> queues up net/block requests, and flush them when the transa= ction gets > >> >> >> completed. > >> >> >> > >> >> >> Signed-off-by: Yoshiaki Tamura > >> >> >> Signed-off-by: OHMURA Kei > >> >> >> --- > >> >> >> =A0Makefile.target | =A0 =A01 + > >> >> >> =A0block.h =A0 =A0 =A0 =A0 | =A0 =A09 + > >> >> >> =A0event-tap.c =A0 =A0 | =A0794 ++++++++++++++++++++++++++++= +++++++++++++++++++++++++++ > >> >> >> =A0event-tap.h =A0 =A0 | =A0 34 +++ > >> >> >> =A0net.h =A0 =A0 =A0 =A0 =A0 | =A0 =A04 + > >> >> >> =A0net/queue.c =A0 =A0 | =A0 =A01 + > >> >> >> =A06 files changed, 843 insertions(+), 0 deletions(-) > >> >> >> =A0create mode 100644 event-tap.c > >> >> >> =A0create mode 100644 event-tap.h > >> >> > > >> >> > event_tap_state is checked at the beginning of several functi= ons. =A0If > >> >> > there is an unexpected state the function silently returns. =A0= Should > >> >> > these checks really be assert() so there is an abort and back= trace if > >> >> > the program ever reaches this state? > >> >> > > >> >> >> +typedef struct EventTapBlkReq { > >> >> >> + =A0 =A0char *device_name; > >> >> >> + =A0 =A0int num_reqs; > >> >> >> + =A0 =A0int num_cbs; > >> >> >> + =A0 =A0bool is_multiwrite; > >> >> > > >> >> > Is multiwrite logging necessary? =A0If event tap is called fr= om within > >> >> > the block layer then multiwrite is turned into one or more > >> >> > bdrv_aio_writev() calls. > >> >> > > >> >> >> +static void event_tap_replay(void *opaque, int running, int= reason) > >> >> >> +{ > >> >> >> + =A0 =A0EventTapLog *log, *next; > >> >> >> + > >> >> >> + =A0 =A0if (!running) { > >> >> >> + =A0 =A0 =A0 =A0return; > >> >> >> + =A0 =A0} > >> >> >> + > >> >> >> + =A0 =A0if (event_tap_state !=3D EVENT_TAP_LOAD) { > >> >> >> + =A0 =A0 =A0 =A0return; > >> >> >> + =A0 =A0} > >> >> >> + > >> >> >> + =A0 =A0event_tap_state =3D EVENT_TAP_REPLAY; > >> >> >> + > >> >> >> + =A0 =A0QTAILQ_FOREACH(log, &event_list, node) { > >> >> >> + =A0 =A0 =A0 =A0EventTapBlkReq *blk_req; > >> >> >> + > >> >> >> + =A0 =A0 =A0 =A0/* event resume */ > >> >> >> + =A0 =A0 =A0 =A0switch (log->mode & ~EVENT_TAP_TYPE_MASK) { > >> >> >> + =A0 =A0 =A0 =A0case EVENT_TAP_NET: > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0event_tap_net_flush(&log->net_req); > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0break; > >> >> >> + =A0 =A0 =A0 =A0case EVENT_TAP_BLK: > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0blk_req =3D &log->blk_req; > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0if ((log->mode & EVENT_TAP_TYPE_MAS= K) =3D=3D EVENT_TAP_IOPORT) { > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0switch (log->ioport.index) = { > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case 0: > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cpu_outb(log->iopor= t.address, log->ioport.data); > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break; > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case 1: > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cpu_outw(log->iopor= t.address, log->ioport.data); > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break; > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case 2: > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cpu_outl(log->iopor= t.address, log->ioport.data); > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break; > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0} > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0} else { > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* EVENT_TAP_MMIO */ > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cpu_physical_memory_rw(log-= >mmio.address, > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 log->mmio.buf, > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 log->mmio.len, 1); > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0} > >> >> >> + =A0 =A0 =A0 =A0 =A0 =A0break; > >> >> > > >> >> > Why are net tx packets replayed at the net level but blk requ= ests are > >> >> > replayed at the pio/mmio level? > >> >> > > >> >> > I expected everything to replay either as pio/mmio or as net/= block. > >> >> > >> >> Stefan, > >> >> > >> >> After doing some heavy load tests, I realized that we have to > >> >> take a hybrid approach to replay for now. =A0This is because wh= en a > >> >> device moves to the next state (e.g. virtio decreases inuse) is > >> >> different between net and block. =A0For example, virtio-net > >> >> decreases inuse upon returning from the net layer, > >> >> but virtio-blk > >> >> does that inside of the callback. > >> > > >> > For TX, virtio-net calls virtqueue_push from virtio_net_tx_compl= ete. > >> > For RX, virtio-net calls virtqueue_flush from virtio_net_receive= =2E > >> > Both are invoked from a callback. > >> > > >> >> If we only use pio/mmio > >> >> replay, even though event-tap tries to replay net requests, som= e > >> >> get lost because the state has proceeded already. > >> > > >> > It seems that all you need to do to avoid this is to > >> > delay the callback? > >> > >> Yeah, if it's possible. =A0But if you take a look at virtio-net, > >> you'll see that virtio_push is called immediately after calling > >> qemu_sendv_packet > >> while virtio-blk does that in the callback. > > > > This is only if the packet was sent immediately. > > I was referring to the case where the packet is queued. >=20 > I see. I usually don't see packets get queued in the net layer. > What would be the effect to devices? Restraint sending packets? Yes. > > > >> > > >> >> This doesn't > >> >> happen with block, because the state is still old enough to > >> >> replay. =A0Note that using hybrid approach won't cause duplicat= ed > >> >> requests on the secondary. > >> > > >> > An assumption devices make is that a buffer is unused once > >> > completion callback was invoked. Does this violate that assumpti= on? > >> > >> No, it shouldn't. =A0In case of net with net layer replay, we copy > >> the content of the requests, and in case of block, because we > >> haven't called the callback yet, the requests remains fresh. > >> > >> Yoshi > >> > > > > Yes, as long as you copy it should be fine. =A0Maybe it's a good id= ea for > > event-tap to queue all packets to avoid the copy and avoid the need= to > > replay at the net level. >=20 > If queuing works fine for the devices, it seems to be a good > idea. I think the ordering issue doesn't happen still. >=20 > Yoshi If you replay and both net and pio level, it becomes complex. Maybe it's ok, but certainly harder to reason about. > > > >> > > >> > -- > >> > MST > >> > > >> > > > > >