From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47059) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XM29a-0001fQ-Vb for qemu-devel@nongnu.org; Mon, 25 Aug 2014 17:50:27 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XM29R-00064x-8r for qemu-devel@nongnu.org; Mon, 25 Aug 2014 17:50:18 -0400 Received: from mail.windriver.com ([147.11.1.11]:33441) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XM29R-00062e-0U for qemu-devel@nongnu.org; Mon, 25 Aug 2014 17:50:09 -0400 Message-ID: <53FBAF8A.3050005@windriver.com> Date: Mon, 25 Aug 2014 15:50:02 -0600 From: Chris Friesen MIME-Version: 1.0 References: <53C9A440.7020306@windriver.com> <53CA06ED.1090102@redhat.com> <53CA0FC4.8080802@windriver.com> <53CA1D06.9090601@redhat.com> <20140719084537.GA3058@irqsave.net> <53CD2AE1.6080803@windriver.com> <20140721151540.GA22161@irqsave.net> <53CD3341.60705@windriver.com> <20140721161034.GC22161@irqsave.net> <53F7E77A.9050509@windriver.com> <20140823075658.GA6687@irqsave.net> In-Reply-To: <20140823075658.GA6687@irqsave.net> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: =?ISO-8859-1?Q?Beno=EEt_Canet?= Cc: Paolo Bonzini , qemu-devel@nongnu.org On 08/23/2014 01:56 AM, Beno=EEt Canet wrote: > The Friday 22 Aug 2014 =E0 18:59:38 (-0600), Chris Friesen wrote : >> On 07/21/2014 10:10 AM, Beno=EEt Canet wrote: >>> The Monday 21 Jul 2014 =E0 09:35:29 (-0600), Chris Friesen wrote : >>>> On 07/21/2014 09:15 AM, Beno=EEt Canet wrote: >>>>> The Monday 21 Jul 2014 =E0 08:59:45 (-0600), Chris Friesen wrote : >>>>>> On 07/19/2014 02:45 AM, Beno=EEt Canet wrote: >>>>>> >>>>>>> I think in the throttling case the number of in flight operation = is limited by >>>>>>> the emulated hardware queue. Else request would pile up and throt= tling would be >>>>>>> inefective. >>>>>>> >>>>>>> So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 = or something like than that. >>>>>> >>>>>> Okay, that makes sense. Do you know how much data can be written = as part of >>>>>> a single operation? We're using 2MB hugepages for the guest memor= y, and we >>>>>> saw the qemu RSS numbers jump from 25-30MB during normal operation= up to >>>>>> 120-180MB when running dbench. I'd like to know what the worst-ca= se would >>> >>> Sorry I didn't understood this part at first read. >>> >>> In the linux guest can you monitor: >>> benoit@Laure:~$ cat /sys/class/block/xyz/inflight ? >>> >>> This would give us a faily precise number of the requests actually in= flight between the guest and qemu. >> >> >> After a bit of a break I'm looking at this again. >> > > Strange. > > I would use dd with the flag oflag=3Dnocache to make sure the write req= uest > does not do in the guest cache though. > > Best regards > > Beno=EEt > >> While doing "dd if=3D/dev/zero of=3Dtestfile bs=3D1M count=3D700" in t= he guest, I >> got a max "inflight" value of 181. This seems quite a bit higher than >> VIRTIO_PCI_QUEUE_MAX. >> >> I've seen throughput as high as ~210 MB/sec, which also kicked the RSS >> numbers up above 200MB. >> >> I tried dropping VIRTIO_PCI_QUEUE_MAX down to 32 (it didn't seem to wo= rk at >> all for values much less than that, though I didn't bother getting an = exact >> value) and it didn't really make any difference, I saw inflight values= as >> high as 177. I think I might have a glimmering of what's going on. Someone please=20 correct me if I get something wrong. I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with=20 respect to max inflight operations, and neither does virtio-blk calling=20 virtio_add_queue() with a queue size of 128. I think what's happening is that virtio_blk_handle_output() spins,=20 pulling data off the 128-entry queue and calling=20 virtio_blk_handle_request(). At this point that queue entry can be=20 reused, so the queue size isn't really relevant. In virtio_blk_handle_write() we add the request to a MultiReqBuffer and=20 every 32 writes we'll call virtio_submit_multiwrite() which calls down=20 into bdrv_aio_multiwrite(). That tries to merge requests and then for=20 each resulting request calls bdrv_aio_writev() which ends up calling=20 qemu_rbd_aio_writev(), which calls rbd_start_aio(). rbd_start_aio() allocates a buffer and converts from iovec to a single=20 buffer. This buffer stays allocated until the request is acked, which=20 is where the bulk of the memory overhead with rbd is coming from (has=20 anyone considered adding iovec support to rbd to avoid this extra copy?). The only limit I see in the whole call chain from=20 virtio_blk_handle_request() on down is the call to=20 bdrv_io_limits_intercept() in bdrv_co_do_writev(). However, that=20 doesn't provide any limit on the absolute number of inflight operations,=20 only on operations/sec. If the ceph server cluster can't keep up with=20 the aggregate load, then the number of inflight operations can still=20 grow indefinitely. Chris