From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out
 of inner run loop
Date: Wed, 27 Jun 2012 08:19:03 -0300
Message-ID: <20120627111903.GA3994@amt.cnet>
References: <4FE4F56D.1020201@web.de>
 <4FE4F7F5.7030400@web.de>
 <20120626193420.GA19852@amt.cnet>
 <CAJSP0QXOoH-drT2ntiB0A2iuTBstkY_h5yN6mcsC7Dpc+w4qqg@mail.gmail.com>
 <CAJSP0QW=ZzJzjKes+e+9qSSEWWXAsDH+w00DgNY18yoQDPDHMg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Jan Kiszka <jan.kiszka@web.de>,
	Liu Ping Fan <pingfank@linux.vnet.ibm.com>,
	kvm <kvm@vger.kernel.org>, qemu-devel <qemu-devel@nongnu.org>,
	Alexander Graf <agraf@suse.de>, Avi Kivity <avi@redhat.com>,
	Anthony Liguori <anthony@codemonkey.ws>
To: Stefan Hajnoczi <stefanha@gmail.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:25076 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753025Ab2F0LWn (ORCPT <rfc822;kvm@vger.kernel.org>);
	Wed, 27 Jun 2012 07:22:43 -0400
Content-Disposition: inline
In-Reply-To: <CAJSP0QW=ZzJzjKes+e+9qSSEWWXAsDH+w00DgNY18yoQDPDHMg@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Wed, Jun 27, 2012 at 08:41:49AM +0100, Stefan Hajnoczi wrote:
> On Wed, Jun 27, 2012 at 8:39 AM, Stefan Hajnoczi <stefanha@gmail.com>=
 wrote:
> > On Tue, Jun 26, 2012 at 8:34 PM, Marcelo Tosatti <mtosatti@redhat.c=
om> wrote:
> >> On Sat, Jun 23, 2012 at 12:55:49AM +0200, Jan Kiszka wrote:
> >> net.txt
> >> --------
> >>
> >>
> >> iothread flow
> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >>
> >> 1) Skip-work-if-device-locked
> >>
> >> select(tap fd ready)
> >> tap_send
> >> =A0 =A0if (trylock(TAPState->NetClientState->dev))
> >> =A0 =A0 =A0 =A0proceed;
> >> =A0 =A0else
> >> =A0 =A0 =A0 =A0continue; (data remains in queue)
> >> =A0 =A0tap_read_packet
> >> =A0 =A0qemu_send_packet_async
> >>
> >> DRAWBACK: lock intensive.
> >> DRAWBACK: VCPU threads can starve IOTHREAD (can be fixed with
> >> a scheme such as trylock() marks a flag saying "iothread wants loc=
k".
> >
> > One alternative is to say the lock cannot be held across system cal=
ls
> > or other long-running operations (similar to interrupt handler
> > spinlocks in the kernel). =A0But QEMU threads are still at the merc=
y of
> > the scheduler or page faults, so perhaps this is not a good idea
> > because waiting on the lock could tie up the iothread.
> >
> >> 2) Queue-work-to-vcpu-context
> >>
> >> select(tap fd ready)
> >> tap_send
> >> =A0 =A0if (trylock(TAPState->NetClientState->dev)) {
> >> =A0 =A0 =A0 =A0proceed;
> >> =A0 =A0} else {
> >> =A0 =A0 =A0 =A0AddToDeviceWorkQueue(work);
> >> =A0 =A0}
> >> =A0 =A0tap_read_packet
> >> =A0 =A0qemu_send_packet_async
> >>
> >> DRAWBACK: lock intensive
> >> DRAWBACK: cache effects
> >
> > This almost suggests we should invert packet reception for NICs. =A0=
tap
> > fd ready should add a work item and the guest device calls into
> > net/tap.c to pull out any packets from the work function:
> >
> > tap_send
> > =A0 =A0dev_add_work(work);
> >
> > virtio_net_rx_work_fn
> > =A0 =A0while ((req =3D virtqueue_pop(rx_queue))) {
> > =A0 =A0 =A0 =A0if (!net_receive_packet(netclient, req->buf)) {
> > =A0 =A0 =A0 =A0 =A0 =A0break; /* no more packets available */
> > =A0 =A0 =A0 =A0}
> > =A0 =A0 =A0 =A0/* ...mark req complete and push onto virtqueue... *=
/
> > =A0 =A0}
> > =A0 =A0virtqueue_notify(rx_queue);
> >
> > The idea is to avoid copies on the rx path and make the locking sim=
ple
> > by always deferring to a work function (which can be invoked
> > immediately if the device isn't locked).
>=20
> I just realized this approach bypasses net/queue.c.  I think it works
> really well for the peer model (no "VLANs").  Basically flow control
> is dictated by the peer (receiver) and we don't need to use NetQueue
> anymore.  Whether this works elegantly for slirp and other backends,
> I'm not sure yet.
>=20
> Stefan

Right. The advantage is only backends where performance matters need to
be converted (and only net hw drivers that matter performance wise need
to be converted).

Apparently you guys have very different ideas on the higher level
model here. It would be good to agree on one structure (including
modifications on the one above) to follow and share the work on that
direction.

Having competing implementations in this case is a waste of time.