From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:53372) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZUlU9-0000RN-Ms for qemu-devel@nongnu.org; Wed, 26 Aug 2015 20:56:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZUlU6-0001or-FJ for qemu-devel@nongnu.org; Wed, 26 Aug 2015 20:56:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:58924) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZUlU6-0001oE-4l for qemu-devel@nongnu.org; Wed, 26 Aug 2015 20:56:06 -0400 Message-ID: <55DE6023.10709@redhat.com> Date: Wed, 26 Aug 2015 17:56:03 -0700 From: Josh Durgin MIME-Version: 1.0 References: <53C9A440.7020306@windriver.com> <53CA06ED.1090102@redhat.com> <53CA0FC4.8080802@windriver.com> <53CA1D06.9090601@redhat.com> <20140719084537.GA3058@irqsave.net> <53CD2AE1.6080803@windriver.com> <20140721151540.GA22161@irqsave.net> <53CD3341.60705@windriver.com> <20140721161034.GC22161@irqsave.net> <53F7E77A.9050509@windriver.com> <20140823075658.GA6687@irqsave.net> <53FBAF8A.3050005@windriver.com> <53FD701B.9030402@windriver.com> <55DE4C40.3000306@redhat.com> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Andrey Korolyov Cc: =?UTF-8?B?QmVub8OudCBDYW5ldA==?= , Chris Friesen , "qemu-devel@nongnu.org" , Paolo Bonzini On 08/26/2015 04:47 PM, Andrey Korolyov wrote: > On Thu, Aug 27, 2015 at 2:31 AM, Josh Durgin wrote: >> On 08/26/2015 10:10 AM, Andrey Korolyov wrote: >>> >>> On Thu, May 14, 2015 at 4:42 PM, Andrey Korolyov wrote: >>>> >>>> On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen >>>> wrote: >>>>> >>>>> On 08/25/2014 03:50 PM, Chris Friesen wrote: >>>>> >>>>>> I think I might have a glimmering of what's going on. Someone please >>>>>> correct me if I get something wrong. >>>>>> >>>>>> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with >>>>>> respect to max inflight operations, and neither does virtio-blk calling >>>>>> virtio_add_queue() with a queue size of 128. >>>>>> >>>>>> I think what's happening is that virtio_blk_handle_output() spins, >>>>>> pulling data off the 128-entry queue and calling >>>>>> virtio_blk_handle_request(). At this point that queue entry can be >>>>>> reused, so the queue size isn't really relevant. >>>>>> >>>>>> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and >>>>>> every 32 writes we'll call virtio_submit_multiwrite() which calls down >>>>>> into bdrv_aio_multiwrite(). That tries to merge requests and then for >>>>>> each resulting request calls bdrv_aio_writev() which ends up calling >>>>>> qemu_rbd_aio_writev(), which calls rbd_start_aio(). >>>>>> >>>>>> rbd_start_aio() allocates a buffer and converts from iovec to a single >>>>>> buffer. This buffer stays allocated until the request is acked, which >>>>>> is where the bulk of the memory overhead with rbd is coming from (has >>>>>> anyone considered adding iovec support to rbd to avoid this extra >>>>>> copy?). >>>>>> >>>>>> The only limit I see in the whole call chain from >>>>>> virtio_blk_handle_request() on down is the call to >>>>>> bdrv_io_limits_intercept() in bdrv_co_do_writev(). However, that >>>>>> doesn't provide any limit on the absolute number of inflight >>>>>> operations, >>>>>> only on operations/sec. If the ceph server cluster can't keep up with >>>>>> the aggregate load, then the number of inflight operations can still >>>>>> grow indefinitely. >>>>>> >>>>>> Chris >>>>> >>>>> >>>>> >>>>> I was a bit concerned that I'd need to extend the IO throttling code to >>>>> support a limit on total inflight bytes, but it doesn't look like that >>>>> will >>>>> be necessary. >>>>> >>>>> It seems that using mallopt() to set the trim/mmap thresholds to 128K is >>>>> enough to minimize the increase in RSS and also drop it back down after >>>>> an >>>>> I/O burst. For now this looks like it should be sufficient for our >>>>> purposes. >>>>> >>>>> I'm actually a bit surprised I didn't have to go lower, but it seems to >>>>> work >>>>> for both "dd" and dbench testcases so we'll give it a try. >>>>> >>>>> Chris >>>>> >>>> >>>> Bumping this... >>>> >>>> For now, we are rarely suffering with an unlimited cache growth issue >>>> which can be observed on all post-1.4 versions of qemu with rbd >>>> backend in a writeback mode and certain pattern of a guest operations. >>>> The issue is confirmed for virtio and can be re-triggered by issuing >>>> excessive amount of write requests without completing returned acks >>>> from a emulator` cache timely. Since most applications behave in a >>>> right way, the oom issue is very rare (and we developed an ugly >>>> workaround for such situations long ago). If anybody is interested in >>>> fixing this, I can send a prepared image for a reproduction or >>>> instructions to make one, whichever is preferable. >>>> >>>> Thanks! >>> >>> >>> A gentle bump: for at least rbd backend with writethrough/writeback >>> cache it is possible to achieve unlimited growth with lot of large >>> unfinished ops, what can be considered as a DoS. Usually it is >>> triggered by poorly written applications in the wild, like proprietary >>> KV databases or MSSQL under Windows, but regular applications, >>> primarily OSS databases, can trigger the RSS growth for hundreds of >>> megabytes just easily. There is probably no straight way to limit >>> in-flight request size by re-chunking it, as supposedly malicious >>> guest can inflate it up to very high numbers, but it`s fine to crash >>> such a guest, saving real-world stuff with simple in-flight op count >>> limiter looks like more achievable option. >> >> >> Hey, sorry I missed this thread before. >> >> What version of ceph are you running? There was an issue with ceph >> 0.80.8 and earlier that could cause lots of extra memory usage by rbd's >> cache (even in writethrough mode) due to copy-on-write triggering >> whole-object (default 4MB) reads, and sticking those in the cache without >> proper throttling [1]. I'm wondering if this could be causing the large >> RSS growth you're seeing. >> >> In-flight requests do have buffers and structures allocated for them in >> librbd, but these should have lower overhead than cow. If these are the >> problem, it seems to me a generic limit on in flight ops in qemu would >> be a reasonable fix. Other backends have resources tied up by in-flight >> ops as well. >> >> Josh >> >> [1] https://github.com/ceph/ceph/pull/3410 >> >> >> > > I honestly believe that this is the second case. I have your pull in > mine dumpling branch since mid-February, but amount of 'near-oom to > handle' events was still the same over last few months compared to > earlier times, with range from hundred megabytes to gigabyte compared > to the theoretical top of the VM` consumption. Since the nature of the > issue is a very reactive, e.g. RSS image can grow fast and shrink fast > and eventually hit the cgroup limit, I have only a bare reproducer and > a couple of indirect symptoms which are driving my thoughts in a > direction as above - there is still no direct confirmation that > unfinished disk requests are always causing infinite additional memory > allocation. Could you run massif on one of these guests with a problematic workload to see where most of the memory is being used? Like in this bug report, where it pointed to reads for cow as the culprit: http://tracker.ceph.com/issues/6494#note-1