Re: [PATCH] coroutine: cap per-thread local pool size

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>,
	qemu-devel@nongnu.org, Sanjay Rao <srao@redhat.com>,
	Boaz Ben Shabat <bbenshab@redhat.com>,
	Joe Mario <jmario@redhat.com>
Subject: Re: [PATCH] coroutine: cap per-thread local pool size
Date: Tue, 19 Mar 2024 17:10:58 +0000	[thread overview]
Message-ID: <ZfnHIv9W-tVoF4Bm@redhat.com> (raw)
In-Reply-To: <ZfnDTkh5CCHX1WFK@redhat.com>

On Tue, Mar 19, 2024 at 05:54:38PM +0100, Kevin Wolf wrote:
> Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben:
> > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > The coroutine pool implementation can hit the Linux vm.max_map_count
> > > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > > or "failed to set up stack guard page" during coroutine creation.
> > > 
> > > This happens because per-thread pools can grow to tens of thousands of
> > > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > 
> > This sounds quite alarming. What usage scenario is justified in
> > creating so many coroutines?
> 
> Basically we try to allow pooling coroutines for as many requests as
> there can be in flight at the same time. That is, adding a virtio-blk
> device increases the maximum pool size by num_queues * queue_size. If
> you have a guest with many CPUs, the default num_queues is relatively
> large (the bug referenced by Stefan had 64), and queue_size is 256 by
> default. That's 16k potential requests in flight per disk.

If we have more than 1 virtio-blk device, does that scale
up the max coroutines too ?

eg would 32 virtio-blks devices imply 16k * 32 -> 512k potential
requests/coroutines ?

> > IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> > coroutines implies 10's of GB of memory just on stacks alone.
> 
> That's only virtual memory, though. Not sure how much of it is actually
> used in practice.

True, by default Linux wouldn't care too much about virtual memory,
Only if 'vm.overcommit_memory' is changed from its default, such
that Linux applies an overcommit ratio on RAM, then total virtual
memory would be relevant.



> > > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > 
> > On my system max_map_count is 1048576, quite alot higher than
> > 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> > ~500 GB of stacks !
> 
> Did you change the configuration some time in the past, or is this just
> a newer default? I get 65530, and that's the same default number I've
> seen in the bug reports.

It turns out it is a Fedora change, rather than a kernel change:

  https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount

> > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > index 5fd2dbaf8b..2790959eaf 100644
> > > --- a/util/qemu-coroutine.c
> > > +++ b/util/qemu-coroutine.c
> > 
> > > +static unsigned int get_global_pool_hard_max_size(void)
> > > +{
> > > +#ifdef __linux__
> > > +    g_autofree char *contents = NULL;
> > > +    int max_map_count;
> > > +
> > > +    /*
> > > +     * Linux processes can have up to max_map_count virtual memory areas
> > > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We
> > > +     * must limit the coroutine pool to a safe size to avoid running out of
> > > +     * VMAs.
> > > +     */
> > > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > > +                            NULL) &&
> > > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > > +        /*
> > > +         * This is a conservative upper bound that avoids exceeding
> > > +         * max_map_count. Leave half for non-coroutine users like library
> > > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > > +         * halve the amount again.
> > > +         */
> > > +        return max_map_count / 4;
> > 
> > That's 256,000 coroutines, which still sounds incredibly large
> > to me.
> 
> The whole purpose of the limitation is that you won't ever get -ENOMEM
> back, which will likely crash your VM. Even if this hard limit is high,
> that doesn't mean that it's fully used. Your setting of 1048576 probably
> means that you would never have hit the crash anyway.
> 
> Even the benchmarks that used to hit the problem don't even get close to
> this hard limit any more because the actual number of coroutines stays
> much smaller after applying this patch.

I'm more thinking about what's the worst case behaviour that a
malicious guest can inflict on QEMU, and cause unexpectedly
high memory usage in the host.

ENOMEM is bad for a friendy VM, but there's also the risk to
the host from a unfriendly VM exploiting the high limits

> 
> > > +    }
> > > +#endif
> > > +
> > > +    return UINT_MAX;
> > 
> > Why UINT_MAX as a default ?  If we can't read procfs, we should
> > assume some much smaller sane default IMHO, that corresponds to
> > what current linux default max_map_count would be.
> 
> I don't think we should artificially limit the pool size and with this
> potentially limit the performance with it even if the host could do more
> if we only allowed it to. If we can't read it from procfs, then it's
> your responsibility as a user to make sure that it's large enough for
> your VM configuration.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

next prev parent reply	other threads:[~2024-03-19 17:12 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-18 18:34 [PATCH] coroutine: cap per-thread local pool size Stefan Hajnoczi
2024-03-19 13:32 ` Kevin Wolf
2024-03-19 13:45   ` Stefan Hajnoczi
2024-03-19 14:23   ` Sanjay Rao
2024-03-19 13:43 ` Daniel P. Berrangé
2024-03-19 16:54   ` Kevin Wolf
2024-03-19 17:10     ` Daniel P. Berrangé [this message]
2024-03-19 17:41       ` Kevin Wolf
2024-03-19 20:14         ` Daniel P. Berrangé
2024-03-19 17:55   ` Stefan Hajnoczi
2024-03-19 20:10     ` Daniel P. Berrangé
2024-03-20 13:35       ` Stefan Hajnoczi
2024-03-20 14:09         ` Daniel P. Berrangé
2024-03-21 12:21           ` Kevin Wolf
2024-03-21 16:59             ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZfnHIv9W-tVoF4Bm@redhat.com \
    --to=berrange@redhat.com \
    --cc=bbenshab@redhat.com \
    --cc=jmario@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=srao@redhat.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.