From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:53372)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jdurgin@redhat.com>) id 1ZUlU9-0000RN-Ms
	for qemu-devel@nongnu.org; Wed, 26 Aug 2015 20:56:11 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jdurgin@redhat.com>) id 1ZUlU6-0001or-FJ
	for qemu-devel@nongnu.org; Wed, 26 Aug 2015 20:56:09 -0400
Received: from mx1.redhat.com ([209.132.183.28]:58924)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jdurgin@redhat.com>) id 1ZUlU6-0001oE-4l
	for qemu-devel@nongnu.org; Wed, 26 Aug 2015 20:56:06 -0400
Message-ID: <55DE6023.10709@redhat.com>
Date: Wed, 26 Aug 2015 17:56:03 -0700
From: Josh Durgin <jdurgin@redhat.com>
MIME-Version: 1.0
References: <53C9A440.7020306@windriver.com>
	<53CA06ED.1090102@redhat.com>	<53CA0FC4.8080802@windriver.com>
	<53CA1D06.9090601@redhat.com>	<20140719084537.GA3058@irqsave.net>
	<53CD2AE1.6080803@windriver.com>	<20140721151540.GA22161@irqsave.net>
	<53CD3341.60705@windriver.com>	<20140721161034.GC22161@irqsave.net>
	<53F7E77A.9050509@windriver.com>	<20140823075658.GA6687@irqsave.net>
	<53FBAF8A.3050005@windriver.com>	<53FD701B.9030402@windriver.com>	<CABYiri9YfBHoktO6hZsXabv_tB5goeBRKhJMtYznhezzA1AFHQ@mail.gmail.com>	<CABYiri-y+LQLSUV2hnsE_C8mgAv=ex-_VLM9fGaHuZVRiBtyiQ@mail.gmail.com>	<55DE4C40.3000306@redhat.com>
	<CABYiri9S7zCPMc5m9FZN8Uhb9d8Aq_C-SFAxPaZZ5jEAO7QSzg@mail.gmail.com>
In-Reply-To: <CABYiri9S7zCPMc5m9FZN8Uhb9d8Aq_C-SFAxPaZZ5jEAO7QSzg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] is there a limit on the number of in-flight I/O
 operations?
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Andrey Korolyov <andrey@xdel.ru>
Cc: =?UTF-8?B?QmVub8OudCBDYW5ldA==?= <benoit.canet@irqsave.net>, Chris Friesen <chris.friesen@windriver.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, Paolo Bonzini <pbonzini@redhat.com>

On 08/26/2015 04:47 PM, Andrey Korolyov wrote:
> On Thu, Aug 27, 2015 at 2:31 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>> On 08/26/2015 10:10 AM, Andrey Korolyov wrote:
>>>
>>> On Thu, May 14, 2015 at 4:42 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>
>>>> On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen
>>>> <chris.friesen@windriver.com> wrote:
>>>>>
>>>>> On 08/25/2014 03:50 PM, Chris Friesen wrote:
>>>>>
>>>>>> I think I might have a glimmering of what's going on.  Someone please
>>>>>> correct me if I get something wrong.
>>>>>>
>>>>>> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
>>>>>> respect to max inflight operations, and neither does virtio-blk calling
>>>>>> virtio_add_queue() with a queue size of 128.
>>>>>>
>>>>>> I think what's happening is that virtio_blk_handle_output() spins,
>>>>>> pulling data off the 128-entry queue and calling
>>>>>> virtio_blk_handle_request().  At this point that queue entry can be
>>>>>> reused, so the queue size isn't really relevant.
>>>>>>
>>>>>> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
>>>>>> every 32 writes we'll call virtio_submit_multiwrite() which calls down
>>>>>> into bdrv_aio_multiwrite().  That tries to merge requests and then for
>>>>>> each resulting request calls bdrv_aio_writev() which ends up calling
>>>>>> qemu_rbd_aio_writev(), which calls rbd_start_aio().
>>>>>>
>>>>>> rbd_start_aio() allocates a buffer and converts from iovec to a single
>>>>>> buffer.  This buffer stays allocated until the request is acked, which
>>>>>> is where the bulk of the memory overhead with rbd is coming from (has
>>>>>> anyone considered adding iovec support to rbd to avoid this extra
>>>>>> copy?).
>>>>>>
>>>>>> The only limit I see in the whole call chain from
>>>>>> virtio_blk_handle_request() on down is the call to
>>>>>> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
>>>>>> doesn't provide any limit on the absolute number of inflight
>>>>>> operations,
>>>>>> only on operations/sec.  If the ceph server cluster can't keep up with
>>>>>> the aggregate load, then the number of inflight operations can still
>>>>>> grow indefinitely.
>>>>>>
>>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>> I was a bit concerned that I'd need to extend the IO throttling code to
>>>>> support a limit on total inflight bytes, but it doesn't look like that
>>>>> will
>>>>> be necessary.
>>>>>
>>>>> It seems that using mallopt() to set the trim/mmap thresholds to 128K is
>>>>> enough to minimize the increase in RSS and also drop it back down after
>>>>> an
>>>>> I/O burst.  For now this looks like it should be sufficient for our
>>>>> purposes.
>>>>>
>>>>> I'm actually a bit surprised I didn't have to go lower, but it seems to
>>>>> work
>>>>> for both "dd" and dbench testcases so we'll give it a try.
>>>>>
>>>>> Chris
>>>>>
>>>>
>>>> Bumping this...
>>>>
>>>> For now, we are rarely suffering with an unlimited cache growth issue
>>>> which can be observed on all post-1.4 versions of qemu with rbd
>>>> backend in a writeback mode and certain pattern of a guest operations.
>>>> The issue is confirmed for virtio and can be re-triggered by issuing
>>>> excessive amount of write requests without completing returned acks
>>>> from a emulator` cache timely. Since most applications behave in a
>>>> right way, the oom issue is very rare (and we developed an ugly
>>>> workaround for such situations long ago). If anybody is interested in
>>>> fixing this, I can send a prepared image for a reproduction or
>>>> instructions to make one, whichever is preferable.
>>>>
>>>> Thanks!
>>>
>>>
>>> A gentle bump: for at least rbd backend with writethrough/writeback
>>> cache it is possible to achieve unlimited growth with lot of large
>>> unfinished ops, what can be considered as a DoS. Usually it is
>>> triggered by poorly written applications in the wild, like proprietary
>>> KV databases or MSSQL under Windows, but regular applications,
>>> primarily OSS databases, can trigger the RSS growth for hundreds of
>>> megabytes just easily. There is probably no straight way to limit
>>> in-flight request size by re-chunking it, as supposedly malicious
>>> guest can inflate it up to very high numbers, but it`s fine to crash
>>> such a guest, saving real-world stuff with simple in-flight op count
>>> limiter looks like more achievable option.
>>
>>
>> Hey, sorry I missed this thread before.
>>
>> What version of ceph are you running? There was an issue with ceph
>> 0.80.8 and earlier that could cause lots of extra memory usage by rbd's
>> cache (even in writethrough mode) due to copy-on-write triggering
>> whole-object (default 4MB) reads, and sticking those in the cache without
>> proper throttling [1]. I'm wondering if this could be causing the large
>> RSS growth you're seeing.
>>
>> In-flight requests do have buffers and structures allocated for them in
>> librbd, but these should have lower overhead than cow. If these are the
>> problem, it seems to me a generic limit on in flight ops in qemu would
>> be a reasonable fix. Other backends have resources tied up by in-flight
>> ops as well.
>>
>> Josh
>>
>> [1] https://github.com/ceph/ceph/pull/3410
>>
>>
>>
>
> I honestly believe that this is the second case. I have your pull in
> mine dumpling branch since mid-February, but amount of 'near-oom to
> handle' events was still the same over last few months compared to
> earlier times, with range from hundred megabytes to gigabyte compared
> to the theoretical top of the VM` consumption. Since the nature of the
> issue is a very reactive, e.g. RSS image can grow fast and shrink fast
> and eventually hit the cgroup limit, I have only a bare reproducer and
> a couple of indirect symptoms which are driving my thoughts in a
> direction as above - there is still no direct confirmation that
> unfinished disk requests are always causing infinite additional memory
> allocation.

Could you run massif on one of these guests with a problematic workload
to see where most of the memory is being used?

Like in this bug report, where it pointed to reads for cow as the
culprit:

http://tracker.ceph.com/issues/6494#note-1