From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=39595 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pa5sn-0006Ri-A8
	for qemu-devel@nongnu.org; Tue, 04 Jan 2011 07:21:00 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <tamura.yoshiaki@gmail.com>) id 1Pa5sl-0000sZ-CF
	for qemu-devel@nongnu.org; Tue, 04 Jan 2011 07:20:57 -0500
Received: from mail-wy0-f173.google.com ([74.125.82.173]:39995)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <tamura.yoshiaki@gmail.com>) id 1Pa5sk-0000sF-To
	for qemu-devel@nongnu.org; Tue, 04 Jan 2011 07:20:55 -0500
Received: by wyg36 with SMTP id 36so15756201wyg.4
	for <qemu-devel@nongnu.org>; Tue, 04 Jan 2011 04:20:54 -0800 (PST)
MIME-Version: 1.0
Sender: tamura.yoshiaki@gmail.com
In-Reply-To: <20110104111908.GA5694@redhat.com>
References: <1290665220-26478-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp>
	<1290665220-26478-10-git-send-email-tamura.yoshiaki@lab.ntt.co.jp>
	<AANLkTimGqh7v5gHrax_Yjt8vJwJPda6rrFhs+rY4UDik@mail.gmail.com>
	<AANLkTin=0msj=vAyWskTJBTWOJC9xYH6fOzVT3m=WUJH@mail.gmail.com>
	<20110104111908.GA5694@redhat.com>
Date: Tue, 4 Jan 2011 21:20:53 +0900
Message-ID: <AANLkTim7=bm9v95c1D+8z8RDe-7XiaMM9Ufa7XQNvK-E@mail.gmail.com>
Subject: Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
From: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: aliguori@us.ibm.com, mtosatti@redhat.com, ananth@in.ibm.com, kvm@vger.kernel.org, Stefan Hajnoczi <stefanha@gmail.com>, dlaor@redhat.com, ohmura.kei@lab.ntt.co.jp, qemu-devel@nongnu.org, avi@redhat.com, vatsa@linux.vnet.ibm.com, psuriset@linux.vnet.ibm.com, stefanha@linux.vnet.ibm.com

2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
> On Tue, Jan 04, 2011 at 08:02:54PM +0900, Yoshiaki Tamura wrote:
>> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
>> > On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>> > <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> >> event-tap controls when to start FT transaction, and provides proxy
>> >> functions to called from net/block devices. =A0While FT transaction, =
it
>> >> queues up net/block requests, and flush them when the transaction get=
s
>> >> completed.
>> >>
>> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
>> >> ---
>> >> =A0Makefile.target | =A0 =A01 +
>> >> =A0block.h =A0 =A0 =A0 =A0 | =A0 =A09 +
>> >> =A0event-tap.c =A0 =A0 | =A0794 +++++++++++++++++++++++++++++++++++++=
++++++++++++++++++
>> >> =A0event-tap.h =A0 =A0 | =A0 34 +++
>> >> =A0net.h =A0 =A0 =A0 =A0 =A0 | =A0 =A04 +
>> >> =A0net/queue.c =A0 =A0 | =A0 =A01 +
>> >> =A06 files changed, 843 insertions(+), 0 deletions(-)
>> >> =A0create mode 100644 event-tap.c
>> >> =A0create mode 100644 event-tap.h
>> >
>> > event_tap_state is checked at the beginning of several functions. =A0I=
f
>> > there is an unexpected state the function silently returns. =A0Should
>> > these checks really be assert() so there is an abort and backtrace if
>> > the program ever reaches this state?
>> >
>> >> +typedef struct EventTapBlkReq {
>> >> + =A0 =A0char *device_name;
>> >> + =A0 =A0int num_reqs;
>> >> + =A0 =A0int num_cbs;
>> >> + =A0 =A0bool is_multiwrite;
>> >
>> > Is multiwrite logging necessary? =A0If event tap is called from within
>> > the block layer then multiwrite is turned into one or more
>> > bdrv_aio_writev() calls.
>> >
>> >> +static void event_tap_replay(void *opaque, int running, int reason)
>> >> +{
>> >> + =A0 =A0EventTapLog *log, *next;
>> >> +
>> >> + =A0 =A0if (!running) {
>> >> + =A0 =A0 =A0 =A0return;
>> >> + =A0 =A0}
>> >> +
>> >> + =A0 =A0if (event_tap_state !=3D EVENT_TAP_LOAD) {
>> >> + =A0 =A0 =A0 =A0return;
>> >> + =A0 =A0}
>> >> +
>> >> + =A0 =A0event_tap_state =3D EVENT_TAP_REPLAY;
>> >> +
>> >> + =A0 =A0QTAILQ_FOREACH(log, &event_list, node) {
>> >> + =A0 =A0 =A0 =A0EventTapBlkReq *blk_req;
>> >> +
>> >> + =A0 =A0 =A0 =A0/* event resume */
>> >> + =A0 =A0 =A0 =A0switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
>> >> + =A0 =A0 =A0 =A0case EVENT_TAP_NET:
>> >> + =A0 =A0 =A0 =A0 =A0 =A0event_tap_net_flush(&log->net_req);
>> >> + =A0 =A0 =A0 =A0 =A0 =A0break;
>> >> + =A0 =A0 =A0 =A0case EVENT_TAP_BLK:
>> >> + =A0 =A0 =A0 =A0 =A0 =A0blk_req =3D &log->blk_req;
>> >> + =A0 =A0 =A0 =A0 =A0 =A0if ((log->mode & EVENT_TAP_TYPE_MASK) =3D=3D=
 EVENT_TAP_IOPORT) {
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0switch (log->ioport.index) {
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case 0:
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cpu_outb(log->ioport.address=
, log->ioport.data);
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case 1:
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cpu_outw(log->ioport.address=
, log->ioport.data);
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case 2:
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cpu_outl(log->ioport.address=
, log->ioport.data);
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
>> >> + =A0 =A0 =A0 =A0 =A0 =A0} else {
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* EVENT_TAP_MMIO */
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cpu_physical_memory_rw(log->mmio.add=
ress,
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 log->mmio.buf,
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 log->mmio.len, 1);
>> >> + =A0 =A0 =A0 =A0 =A0 =A0}
>> >> + =A0 =A0 =A0 =A0 =A0 =A0break;
>> >
>> > Why are net tx packets replayed at the net level but blk requests are
>> > replayed at the pio/mmio level?
>> >
>> > I expected everything to replay either as pio/mmio or as net/block.
>>
>> Stefan,
>>
>> After doing some heavy load tests, I realized that we have to
>> take a hybrid approach to replay for now. =A0This is because when a
>> device moves to the next state (e.g. virtio decreases inuse) is
>> different between net and block. =A0For example, virtio-net
>> decreases inuse upon returning from the net layer,
>> but virtio-blk
>> does that inside of the callback.
>
> For TX, virtio-net calls virtqueue_push from virtio_net_tx_complete.
> For RX, virtio-net calls virtqueue_flush from virtio_net_receive.
> Both are invoked from a callback.
>
>> If we only use pio/mmio
>> replay, even though event-tap tries to replay net requests, some
>> get lost because the state has proceeded already.
>
> It seems that all you need to do to avoid this is to
> delay the callback?

Yeah, if it's possible.  But if you take a look at virtio-net,
you'll see that virtio_push is called immediately after calling
qemu_sendv_packet while virtio-blk does that in the callback.

>
>> This doesn't
>> happen with block, because the state is still old enough to
>> replay. =A0Note that using hybrid approach won't cause duplicated
>> requests on the secondary.
>
> An assumption devices make is that a buffer is unused once
> completion callback was invoked. Does this violate that assumption?

No, it shouldn't.  In case of net with net layer replay, we copy
the content of the requests, and in case of block, because we
haven't called the callback yet, the requests remains fresh.

Yoshi

>
> --
> MST
>
>