From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58564) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1c6J9l-00037e-Ka for qemu-devel@nongnu.org; Mon, 14 Nov 2016 10:26:50 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1c6J9i-0003Tu-94 for qemu-devel@nongnu.org; Mon, 14 Nov 2016 10:26:49 -0500 Received: from mail-wm0-x241.google.com ([2a00:1450:400c:c09::241]:33764) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1c6J9i-0003Tj-1U for qemu-devel@nongnu.org; Mon, 14 Nov 2016 10:26:46 -0500 Received: by mail-wm0-x241.google.com with SMTP id u144so16252976wmu.0 for ; Mon, 14 Nov 2016 07:26:45 -0800 (PST) Date: Mon, 14 Nov 2016 15:26:42 +0000 From: Stefan Hajnoczi Message-ID: <20161114152642.GE26198@stefanha-x1.localdomain> References: <1478711602-12620-1-git-send-email-stefanha@redhat.com> <5826231D.7070208@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="so9zsI5B81VjUb/o" Content-Disposition: inline In-Reply-To: <5826231D.7070208@redhat.com> Subject: Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Karl Rister Cc: Stefan Hajnoczi , qemu-devel@nongnu.org, Andrew Theurer , Paolo Bonzini , Fam Zheng --so9zsI5B81VjUb/o Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote: > On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote: > > Recent performance investigation work done by Karl Rister shows that the > > guest->host notification takes around 20 us. This is more than the "ov= erhead" > > of QEMU itself (e.g. block layer). > >=20 > > One way to avoid the costly exit is to use polling instead of notificat= ion. > > The main drawback of polling is that it consumes CPU resources. In ord= er to > > benefit performance the host must have extra CPU cycles available on ph= ysical > > CPUs that aren't used by the guest. > >=20 > > This is an experimental AioContext polling implementation. It adds a p= olling > > callback into the event loop. Polling functions are implemented for vi= rtio-blk > > virtqueue guest->host kick and Linux AIO completion. > >=20 > > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanose= conds to > > poll before entering the usual blocking poll(2) syscall. Try setting t= his > > variable to the time from old request completion to new virtqueue kick. > >=20 > > By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to= get any > > polling! > >=20 > > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_M= AX_NS > > values. If you don't find a good value we should double-check the trac= ing data > > to see if this experimental code can be improved. >=20 > Stefan >=20 > I ran some quick tests with your patches and got some pretty good gains, > but also some seemingly odd behavior. > > These results are for a 5 minute test doing sequential 4KB requests from > fio using O_DIRECT, libaio, and IO depth of 1. The requests are > performed directly against the virtio-blk device (no filesystem) which > is backed by a 400GB NVme card. >=20 > QEMU_AIO_POLL_MAX_NS IOPs > unset 31,383 > 1 46,860 > 2 46,440 > 4 35,246 > 8 34,973 > 16 46,794 > 32 46,729 > 64 35,520 > 128 45,902 The environment variable is in nanoseconds. The range of values you tried are very small (all <1 usec). It would be interesting to try larger values in the ballpark of the latencies you have traced. For example 2000, 4000, 8000, 16000, and 32000 ns. Very interesting that QEMU_AIO_POLL_MAX_NS=3D1 performs so well without much CPU overhead. > I found the results for 4, 8, and 64 odd so I re-ran some tests to check > for consistency. I used values of 2 and 4 and ran each 5 times. Here > is what I got: >=20 > Iteration QEMU_AIO_POLL_MAX_NS=3D2 QEMU_AIO_POLL_MAX_NS=3D4 > 1 46,972 35,434 > 2 46,939 35,719 > 3 47,005 35,584 > 4 47,016 35,615 > 5 47,267 35,474 >=20 > So the results seem consistent. That is interesting. I don't have an explanation for the consistent difference between 2 and 4 ns polling time. The time difference is so small yet the IOPS difference is clear. Comparing traces could shed light on the cause for this difference. > I saw some discussion on the patches made which make me think you'll be > making some changes, is that right? If so, I may wait for the updates > and then we can run the much more exhaustive set of workloads > (sequential read and write, random read and write) at various block > sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32) > that we were doing when we started looking at this. I'll send an updated version of the patches. Stefan --so9zsI5B81VjUb/o Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQEcBAEBAgAGBQJYKdeyAAoJEJykq7OBq3PIspIH/0dMkZVg3A+0EVdlN8W2AYgJ t1KM5TQMxcZsZxWMOk1Mbph3XsooMMIb3Zxm+hSk3cM0KT7lljUxRjdhsFurf2Ik O7HdUJ9iWVk7lzRn9F2kwG2WLKmGbPUM/TDeY6ZnauMeREFGUNtK2UiMcWIbve03 E0uX17HeAuQjX0DnnDq6czozf5UNXdAMGzW1ao1Q8AFihGkgaFZwG6OxeCr5NN29 HRAiGKXWzWssNeCl8MAMiMwyaoRpv1LOKMFBv5NkYhV9lpu7fJoDlkETwScVY1PQ +aCsTCp515pPnrOPGtNDk0o67ryvNw53j4p+1x3/JVFiDrOTc/wnu0/tYlwTpNs= =iiu7 -----END PGP SIGNATURE----- --so9zsI5B81VjUb/o--