qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Josh Durgin <jdurgin@redhat.com>
To: Andrey Korolyov <andrey@xdel.ru>
Cc: "Benoît Canet" <benoit.canet@irqsave.net>,
	"Chris Friesen" <chris.friesen@windriver.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
Date: Wed, 26 Aug 2015 17:56:03 -0700	[thread overview]
Message-ID: <55DE6023.10709@redhat.com> (raw)
In-Reply-To: <CABYiri9S7zCPMc5m9FZN8Uhb9d8Aq_C-SFAxPaZZ5jEAO7QSzg@mail.gmail.com>

On 08/26/2015 04:47 PM, Andrey Korolyov wrote:
> On Thu, Aug 27, 2015 at 2:31 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>> On 08/26/2015 10:10 AM, Andrey Korolyov wrote:
>>>
>>> On Thu, May 14, 2015 at 4:42 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>
>>>> On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen
>>>> <chris.friesen@windriver.com> wrote:
>>>>>
>>>>> On 08/25/2014 03:50 PM, Chris Friesen wrote:
>>>>>
>>>>>> I think I might have a glimmering of what's going on.  Someone please
>>>>>> correct me if I get something wrong.
>>>>>>
>>>>>> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
>>>>>> respect to max inflight operations, and neither does virtio-blk calling
>>>>>> virtio_add_queue() with a queue size of 128.
>>>>>>
>>>>>> I think what's happening is that virtio_blk_handle_output() spins,
>>>>>> pulling data off the 128-entry queue and calling
>>>>>> virtio_blk_handle_request().  At this point that queue entry can be
>>>>>> reused, so the queue size isn't really relevant.
>>>>>>
>>>>>> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
>>>>>> every 32 writes we'll call virtio_submit_multiwrite() which calls down
>>>>>> into bdrv_aio_multiwrite().  That tries to merge requests and then for
>>>>>> each resulting request calls bdrv_aio_writev() which ends up calling
>>>>>> qemu_rbd_aio_writev(), which calls rbd_start_aio().
>>>>>>
>>>>>> rbd_start_aio() allocates a buffer and converts from iovec to a single
>>>>>> buffer.  This buffer stays allocated until the request is acked, which
>>>>>> is where the bulk of the memory overhead with rbd is coming from (has
>>>>>> anyone considered adding iovec support to rbd to avoid this extra
>>>>>> copy?).
>>>>>>
>>>>>> The only limit I see in the whole call chain from
>>>>>> virtio_blk_handle_request() on down is the call to
>>>>>> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
>>>>>> doesn't provide any limit on the absolute number of inflight
>>>>>> operations,
>>>>>> only on operations/sec.  If the ceph server cluster can't keep up with
>>>>>> the aggregate load, then the number of inflight operations can still
>>>>>> grow indefinitely.
>>>>>>
>>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>> I was a bit concerned that I'd need to extend the IO throttling code to
>>>>> support a limit on total inflight bytes, but it doesn't look like that
>>>>> will
>>>>> be necessary.
>>>>>
>>>>> It seems that using mallopt() to set the trim/mmap thresholds to 128K is
>>>>> enough to minimize the increase in RSS and also drop it back down after
>>>>> an
>>>>> I/O burst.  For now this looks like it should be sufficient for our
>>>>> purposes.
>>>>>
>>>>> I'm actually a bit surprised I didn't have to go lower, but it seems to
>>>>> work
>>>>> for both "dd" and dbench testcases so we'll give it a try.
>>>>>
>>>>> Chris
>>>>>
>>>>
>>>> Bumping this...
>>>>
>>>> For now, we are rarely suffering with an unlimited cache growth issue
>>>> which can be observed on all post-1.4 versions of qemu with rbd
>>>> backend in a writeback mode and certain pattern of a guest operations.
>>>> The issue is confirmed for virtio and can be re-triggered by issuing
>>>> excessive amount of write requests without completing returned acks
>>>> from a emulator` cache timely. Since most applications behave in a
>>>> right way, the oom issue is very rare (and we developed an ugly
>>>> workaround for such situations long ago). If anybody is interested in
>>>> fixing this, I can send a prepared image for a reproduction or
>>>> instructions to make one, whichever is preferable.
>>>>
>>>> Thanks!
>>>
>>>
>>> A gentle bump: for at least rbd backend with writethrough/writeback
>>> cache it is possible to achieve unlimited growth with lot of large
>>> unfinished ops, what can be considered as a DoS. Usually it is
>>> triggered by poorly written applications in the wild, like proprietary
>>> KV databases or MSSQL under Windows, but regular applications,
>>> primarily OSS databases, can trigger the RSS growth for hundreds of
>>> megabytes just easily. There is probably no straight way to limit
>>> in-flight request size by re-chunking it, as supposedly malicious
>>> guest can inflate it up to very high numbers, but it`s fine to crash
>>> such a guest, saving real-world stuff with simple in-flight op count
>>> limiter looks like more achievable option.
>>
>>
>> Hey, sorry I missed this thread before.
>>
>> What version of ceph are you running? There was an issue with ceph
>> 0.80.8 and earlier that could cause lots of extra memory usage by rbd's
>> cache (even in writethrough mode) due to copy-on-write triggering
>> whole-object (default 4MB) reads, and sticking those in the cache without
>> proper throttling [1]. I'm wondering if this could be causing the large
>> RSS growth you're seeing.
>>
>> In-flight requests do have buffers and structures allocated for them in
>> librbd, but these should have lower overhead than cow. If these are the
>> problem, it seems to me a generic limit on in flight ops in qemu would
>> be a reasonable fix. Other backends have resources tied up by in-flight
>> ops as well.
>>
>> Josh
>>
>> [1] https://github.com/ceph/ceph/pull/3410
>>
>>
>>
>
> I honestly believe that this is the second case. I have your pull in
> mine dumpling branch since mid-February, but amount of 'near-oom to
> handle' events was still the same over last few months compared to
> earlier times, with range from hundred megabytes to gigabyte compared
> to the theoretical top of the VM` consumption. Since the nature of the
> issue is a very reactive, e.g. RSS image can grow fast and shrink fast
> and eventually hit the cgroup limit, I have only a bare reproducer and
> a couple of indirect symptoms which are driving my thoughts in a
> direction as above - there is still no direct confirmation that
> unfinished disk requests are always causing infinite additional memory
> allocation.

Could you run massif on one of these guests with a problematic workload
to see where most of the memory is being used?

Like in this bug report, where it pointed to reads for cow as the
culprit:

http://tracker.ceph.com/issues/6494#note-1

  reply	other threads:[~2015-08-27  0:56 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-18 14:58 [Qemu-devel] is there a limit on the number of in-flight I/O operations? Chris Friesen
2014-07-18 15:24 ` Paolo Bonzini
2014-07-18 16:22   ` Chris Friesen
2014-07-18 20:13     ` Paolo Bonzini
2014-07-18 22:48       ` Chris Friesen
2014-07-19  5:49         ` Paolo Bonzini
2014-07-19  6:27           ` Chris Friesen
2014-07-19  7:23             ` Paolo Bonzini
2014-07-19  8:45               ` Benoît Canet
2014-07-21 14:59                 ` Chris Friesen
2014-07-21 15:15                   ` Benoît Canet
2014-07-21 15:35                     ` Chris Friesen
2014-07-21 15:54                       ` Benoît Canet
2014-07-21 16:10                       ` Benoît Canet
2014-08-23  0:59                         ` Chris Friesen
2014-08-23  7:56                           ` Benoît Canet
2014-08-25 15:12                             ` Chris Friesen
2014-08-25 17:43                               ` Chris Friesen
2015-08-27 16:37                                 ` Stefan Hajnoczi
2015-08-27 16:33                               ` Stefan Hajnoczi
2014-08-25 21:50                             ` Chris Friesen
2014-08-27  5:43                               ` Chris Friesen
2015-05-14 13:42                                 ` Andrey Korolyov
2015-08-26 17:10                                   ` Andrey Korolyov
2015-08-26 23:31                                     ` Josh Durgin
2015-08-26 23:47                                       ` Andrey Korolyov
2015-08-27  0:56                                         ` Josh Durgin [this message]
2015-08-27 16:48                               ` Stefan Hajnoczi
2015-08-27 17:05                                 ` Stefan Hajnoczi
2015-08-27 16:49                               ` Stefan Hajnoczi
2015-08-28  0:31                                 ` Josh Durgin
2015-08-28  8:31                                   ` Andrey Korolyov
2014-07-21 19:47                       ` Benoît Canet
2014-07-21 21:12                         ` Chris Friesen
2014-07-21 22:04                           ` Benoît Canet
2014-07-18 15:54 ` Andrey Korolyov
2014-07-18 16:26   ` Chris Friesen
2014-07-18 16:30     ` Andrey Korolyov
2014-07-18 16:46       ` Chris Friesen
     [not found] <1000957815.25879188.1441820902018.JavaMail.zimbra@redhat.com>
2015-09-09 18:51 ` Jason Dillaman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55DE6023.10709@redhat.com \
    --to=jdurgin@redhat.com \
    --cc=andrey@xdel.ru \
    --cc=benoit.canet@irqsave.net \
    --cc=chris.friesen@windriver.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).