qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Josh Durgin <jdurgin@redhat.com>
To: Andrey Korolyov <andrey@xdel.ru>,
	Chris Friesen <chris.friesen@windriver.com>
Cc: "Benoît Canet" <benoit.canet@irqsave.net>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
Date: Wed, 26 Aug 2015 16:31:12 -0700	[thread overview]
Message-ID: <55DE4C40.3000306@redhat.com> (raw)
In-Reply-To: <CABYiri-y+LQLSUV2hnsE_C8mgAv=ex-_VLM9fGaHuZVRiBtyiQ@mail.gmail.com>

On 08/26/2015 10:10 AM, Andrey Korolyov wrote:
> On Thu, May 14, 2015 at 4:42 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>> On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen
>> <chris.friesen@windriver.com> wrote:
>>> On 08/25/2014 03:50 PM, Chris Friesen wrote:
>>>
>>>> I think I might have a glimmering of what's going on.  Someone please
>>>> correct me if I get something wrong.
>>>>
>>>> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
>>>> respect to max inflight operations, and neither does virtio-blk calling
>>>> virtio_add_queue() with a queue size of 128.
>>>>
>>>> I think what's happening is that virtio_blk_handle_output() spins,
>>>> pulling data off the 128-entry queue and calling
>>>> virtio_blk_handle_request().  At this point that queue entry can be
>>>> reused, so the queue size isn't really relevant.
>>>>
>>>> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
>>>> every 32 writes we'll call virtio_submit_multiwrite() which calls down
>>>> into bdrv_aio_multiwrite().  That tries to merge requests and then for
>>>> each resulting request calls bdrv_aio_writev() which ends up calling
>>>> qemu_rbd_aio_writev(), which calls rbd_start_aio().
>>>>
>>>> rbd_start_aio() allocates a buffer and converts from iovec to a single
>>>> buffer.  This buffer stays allocated until the request is acked, which
>>>> is where the bulk of the memory overhead with rbd is coming from (has
>>>> anyone considered adding iovec support to rbd to avoid this extra copy?).
>>>>
>>>> The only limit I see in the whole call chain from
>>>> virtio_blk_handle_request() on down is the call to
>>>> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
>>>> doesn't provide any limit on the absolute number of inflight operations,
>>>> only on operations/sec.  If the ceph server cluster can't keep up with
>>>> the aggregate load, then the number of inflight operations can still
>>>> grow indefinitely.
>>>>
>>>> Chris
>>>
>>>
>>> I was a bit concerned that I'd need to extend the IO throttling code to
>>> support a limit on total inflight bytes, but it doesn't look like that will
>>> be necessary.
>>>
>>> It seems that using mallopt() to set the trim/mmap thresholds to 128K is
>>> enough to minimize the increase in RSS and also drop it back down after an
>>> I/O burst.  For now this looks like it should be sufficient for our
>>> purposes.
>>>
>>> I'm actually a bit surprised I didn't have to go lower, but it seems to work
>>> for both "dd" and dbench testcases so we'll give it a try.
>>>
>>> Chris
>>>
>>
>> Bumping this...
>>
>> For now, we are rarely suffering with an unlimited cache growth issue
>> which can be observed on all post-1.4 versions of qemu with rbd
>> backend in a writeback mode and certain pattern of a guest operations.
>> The issue is confirmed for virtio and can be re-triggered by issuing
>> excessive amount of write requests without completing returned acks
>> from a emulator` cache timely. Since most applications behave in a
>> right way, the oom issue is very rare (and we developed an ugly
>> workaround for such situations long ago). If anybody is interested in
>> fixing this, I can send a prepared image for a reproduction or
>> instructions to make one, whichever is preferable.
>>
>> Thanks!
>
> A gentle bump: for at least rbd backend with writethrough/writeback
> cache it is possible to achieve unlimited growth with lot of large
> unfinished ops, what can be considered as a DoS. Usually it is
> triggered by poorly written applications in the wild, like proprietary
> KV databases or MSSQL under Windows, but regular applications,
> primarily OSS databases, can trigger the RSS growth for hundreds of
> megabytes just easily. There is probably no straight way to limit
> in-flight request size by re-chunking it, as supposedly malicious
> guest can inflate it up to very high numbers, but it`s fine to crash
> such a guest, saving real-world stuff with simple in-flight op count
> limiter looks like more achievable option.

Hey, sorry I missed this thread before.

What version of ceph are you running? There was an issue with ceph
0.80.8 and earlier that could cause lots of extra memory usage by rbd's
cache (even in writethrough mode) due to copy-on-write triggering 
whole-object (default 4MB) reads, and sticking those in the cache without
proper throttling [1]. I'm wondering if this could be causing the large
RSS growth you're seeing.

In-flight requests do have buffers and structures allocated for them in
librbd, but these should have lower overhead than cow. If these are the
problem, it seems to me a generic limit on in flight ops in qemu would
be a reasonable fix. Other backends have resources tied up by in-flight
ops as well.

Josh

[1] https://github.com/ceph/ceph/pull/3410

  reply	other threads:[~2015-08-26 23:31 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-18 14:58 [Qemu-devel] is there a limit on the number of in-flight I/O operations? Chris Friesen
2014-07-18 15:24 ` Paolo Bonzini
2014-07-18 16:22   ` Chris Friesen
2014-07-18 20:13     ` Paolo Bonzini
2014-07-18 22:48       ` Chris Friesen
2014-07-19  5:49         ` Paolo Bonzini
2014-07-19  6:27           ` Chris Friesen
2014-07-19  7:23             ` Paolo Bonzini
2014-07-19  8:45               ` Benoît Canet
2014-07-21 14:59                 ` Chris Friesen
2014-07-21 15:15                   ` Benoît Canet
2014-07-21 15:35                     ` Chris Friesen
2014-07-21 15:54                       ` Benoît Canet
2014-07-21 16:10                       ` Benoît Canet
2014-08-23  0:59                         ` Chris Friesen
2014-08-23  7:56                           ` Benoît Canet
2014-08-25 15:12                             ` Chris Friesen
2014-08-25 17:43                               ` Chris Friesen
2015-08-27 16:37                                 ` Stefan Hajnoczi
2015-08-27 16:33                               ` Stefan Hajnoczi
2014-08-25 21:50                             ` Chris Friesen
2014-08-27  5:43                               ` Chris Friesen
2015-05-14 13:42                                 ` Andrey Korolyov
2015-08-26 17:10                                   ` Andrey Korolyov
2015-08-26 23:31                                     ` Josh Durgin [this message]
2015-08-26 23:47                                       ` Andrey Korolyov
2015-08-27  0:56                                         ` Josh Durgin
2015-08-27 16:48                               ` Stefan Hajnoczi
2015-08-27 17:05                                 ` Stefan Hajnoczi
2015-08-27 16:49                               ` Stefan Hajnoczi
2015-08-28  0:31                                 ` Josh Durgin
2015-08-28  8:31                                   ` Andrey Korolyov
2014-07-21 19:47                       ` Benoît Canet
2014-07-21 21:12                         ` Chris Friesen
2014-07-21 22:04                           ` Benoît Canet
2014-07-18 15:54 ` Andrey Korolyov
2014-07-18 16:26   ` Chris Friesen
2014-07-18 16:30     ` Andrey Korolyov
2014-07-18 16:46       ` Chris Friesen
     [not found] <1000957815.25879188.1441820902018.JavaMail.zimbra@redhat.com>
2015-09-09 18:51 ` Jason Dillaman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55DE4C40.3000306@redhat.com \
    --to=jdurgin@redhat.com \
    --cc=andrey@xdel.ru \
    --cc=benoit.canet@irqsave.net \
    --cc=chris.friesen@windriver.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).