Re: [PATCH v2] virtio_balloon: add param to skip adjusting pages

From: "Michael S. Tsirkin" <mst@redhat.com>
To: David Stevens <stevensd@chromium.org>
Cc: virtualization@lists.linux-foundation.org
Subject: Re: [PATCH v2] virtio_balloon: add param to skip adjusting pages
Date: Wed, 24 Nov 2021 03:37:56 -0500	[thread overview]
Message-ID: <20211124032622-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <CAD=HUj5wPYLKJxsjgcnMu_NYQ6eMwmd-VDU0gbWbqgzOPkV6fg@mail.gmail.com>

On Wed, Nov 24, 2021 at 01:55:16PM +0900, David Stevens wrote:
> > >> And if you look at things from the context of
> > >> a specific userspace process, there will be other processes running
> > >> and using memory. So while that statement is true with respect to this
> > >> change, that is also true without this change. The specific details
> > >> might be changed by the proposed parameter, but it wouldn't be
> > >> introducing any fundamentally new behavior to Linux.
> > >>
> > >
> > > Please note that the hyper-v balloon just recently switched to using
> > > adjust_managed_page_count() proper accounting reasons:
> > >
> > > commit d1df458cbfdb0c3384c03c7fbcb1689bc02a746c
> > > Author: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > Date:   Wed Dec 2 17:12:45 2020 +0100
> > >
> > >     hv_balloon: do adjust_managed_page_count() when ballooning/un-ballooning
> > >
> > >     Unlike virtio_balloon/virtio_mem/xen balloon drivers, Hyper-V balloon driver
> > >     does not adjust managed pages count when ballooning/un-ballooning and this leads
> > >     to incorrect stats being reported, e.g. unexpected 'free' output.
> > >
> > >     Note, the calculation in post_status() seems to remain correct: ballooned out
> > >     pages are never 'available' and we manually add dm->num_pages_ballooned to
> > >     'commited'.
> > >
> 
> I saw this commit, but it wasn't entirely clear to me the problem it
> was addressing was. Is it the issue Michael pointed out on v1 of my
> patch set, where memory in the balloon shouldn't be included in the
> free stat reported to the device? This version of my patch should
> address that specific issue. Managed page count is linux kernel
> specific metadata, so there's no fundamental reason that it needs to
> line up exactly with anything reported via the virtio-balloon API.
> 
> > >> That approach would require a lot of changes to userspace - probably
> > >> nearly everywhere that uses _SC_PHYS_PAGES or get_phys_pages, or
> > >> anywhere that parses /proc/meminfo. Actually properly using "logically
> > >> offline pages" would require an additional API for monitoring changes
> > >> to the value, and updating to that sort of listener API would not be a
> > >> localized change, especially since most programs do not account for
> > >> memory hotplug and just use the amount of physical memory during
> > >> initialization. Realistically, nearly all of the callers would simply
> > >> add together "logically offline pages" and MemTotal.
> > >
> > > I'd appreciate a more generic approach for user space to figure out the
> > > "initial memory size" in a virtualized environment than adding
> > > some module parameter to virtio-balloon -- if that makes sense.
> > >
> > > MemTotal as is expresses how much memory the buddy currently manages,
> > > for example, excluding early allocations during boot, excluding actually
> > > unplugged memory and excluding logically unplugged memory. Adjusting that
> > > value makes perfect sense for virtio-balloon without deflate-on-oom.
> > >
> 
> That's a definition of how MemTotal is implemented, but it's not
> really a specification of the MemTotal API. The closest thing to a
> real specification I can find is "Total usable RAM (i.e., physical RAM
> minus a few reserved bits and the kernel binary code)", from the proc
> man pages. I think there is quite a bit of leeway in changing how
> exactly MemTotal is implemented without violating the (quite vague)
> specification or changing any observable semantics of the API. In
> particular, leaving the pages in the balloon as part of MemTotal is
> essentially indistinguishable from simply having a non-OOM killable
> process locking an equivalent amount of memory. So this proposal isn't
> really introducing any fundamentally new behavior to the Linux kernel.
> 
> > >> It's also not clear to me what utility the extra information would
> > >> provide to userspace. If userspace wants to know how much memory is
> > >> available, they should use MemAvailable. If userspace wants to have a
> > >> rough estimate for the maximum amount of memory in the system, they
> > >> would add together MemTotal and "logically offline pages". The value
> > >> of MemTotal with a no-deflate-on-oom virtio-balloon is a value with a
> > >> vague meaning that lies somewhere between the "maximum amount of
> > >> memory" and the "current amount of memory". I don't really see any
> > >> situations where it should clearly be used over one of MemAvailable or
> > >> MemTotal + "logically offline pages".
> > >
> > > The issue is that any application that relies on MemTotal in a virtualized
> > > environment is most probably already suboptimal in some cases. You can
> > > rely on it and actually later someone will unplug (inflate balloon)
> > > memory or plug (deflate balloon) memory. Even MemAvailable is suboptimal
> > > because what about two applications that rely on that information at
> > > the same time?
> > >
> >
> > BTW, the general issue here is that "we don't know what the hypervisor
> > will do".
> 
> I do agree that this is a significant problem. I would expand on it a
> bit more, to be "since we don't know what the hypervisor will do, we
> don't know how to treat memory in the balloon". The proposed module
> parameter is more or less a mechanism to allow the system
> administrator to tell the virtio_balloon driver how the hypervisor
> behaves.

Now that you put it that way, it looks more like this should
be a feature bit not a module parameter.

> And if the hypervisor will give memory back to the guest when
> the guest needs it, then I don't think it's not necessary to logically
> unplug the memory.

Ideally we would also pair this with sending a signal to device
that memory is needed.

> It might be a bit cleaner to explicitly address this in the
> virtio_balloon protocol. We could add a min_num_pages field to the
> balloon config, with semantics along the lines of "The host will
> respond to memory pressure in the guest by deflating the balloon down
> to min_num_pages, unless it would cause system instability in the
> host". Given that feature, I think it would be reasonable to only
> consider min_num_pages as logically unplugged.

Okay. I think I would do it a bit differently though, make num_pages be
the min_num_pages, and add an extra_num_pages field for memory that is
nice to have but ok to drop.

As long as we are here, can we add a page_shift field please
so more than 2^44 bytes can be requested?

> > Maybe "MemMax" actually could make sense, where we expose the maximum
> > "MemTotal" we had so far since we were up an running. So the semantics
> > wouldn't be "maximum possible", because we don't know that, but instead
> > "maximum we had".
> 
> Rather than add a new API, I think it is much better to make existing
> APIs behave closer to how they behave in a non-virtualized
> environment. It is true that we could go through and fix a limited
> number of special user space applications, but sysconf(_SC_PHYS_PAGES)
> and /proc/meminfo are not special APIs. Fixing every application that
> uses them is not feasible, especially when taking into account systems
> with closed-source applications (e.g. Android). Also, while MemMax is
> well defined, it has the same issues you brought up earlier -
> specifically, applications don't know whether the hypervisor will
> actually ever provide MemMax again, and they don't know whether MemMax
> is actually the realy maximum amount of memory that could be available
> in the future. It's not clear to me that it's significantly better or
> more useful to userspace than simply changing how MemTotal is
> implemented.
> 
> -David

Agree on trying to avoid changing applications, limiting changes
to device and guest kernel, this has a lot of value.

-- 
MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization