All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Alejandro Vallejo" <alejandro.vallejo@cloud.com>
To: "Jan Beulich" <jbeulich@suse.com>
Cc: "Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Anthony PERARD" <anthony.perard@vates.tech>,
	"Michal Orzel" <michal.orzel@amd.com>,
	"Julien Grall" <julien@xen.org>,
	"Roger Pau Monné" <roger.pau@citrix.com>,
	"Stefano Stabellini" <sstabellini@kernel.org>,
	xen-devel@lists.xenproject.org
Subject: Re: [PATCH] xen/page_alloc: Simplify domain_adjust_tot_pages
Date: Wed, 05 Mar 2025 13:22:58 +0000	[thread overview]
Message-ID: <D88D5732H4EQ.3770M7EIO3TW1@cloud.com> (raw)
In-Reply-To: <f50d8ee7-a00f-4f4f-99f6-4313af7a4fdc@suse.com>

On Wed Mar 5, 2025 at 10:49 AM GMT, Jan Beulich wrote:
> On 27.02.2025 15:36, Alejandro Vallejo wrote:
> > On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote:
> >> On 24.02.2025 15:49, Alejandro Vallejo wrote:
> >>> Open question to whoever reviews this...
> >>>
> >>> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
> >>>>      spin_lock(&heap_lock);
> >>>> -    /* adjust domain outstanding pages; may not go negative */
> >>>> -    dom_before = d->outstanding_pages;
> >>>> -    dom_after = dom_before - pages;
> >>>> -    BUG_ON(dom_before < 0);
> >>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >>>> -    d->outstanding_pages = dom_claimed;
> >>>> -    /* flag accounting bug if system outstanding_claims would go negative */
> >>>> -    sys_before = outstanding_claims;
> >>>> -    sys_after = sys_before - (dom_before - dom_claimed);
> >>>> -    BUG_ON(sys_after < 0);
> >>>> -    outstanding_claims = sys_after;
> >>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >>>> +    if ( pages > 0 && d->outstanding_pages < pages )
> >>>> +    {
> >>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >>>> +        outstanding_claims -= d->outstanding_pages;
> >>>> +        d->outstanding_pages = 0;
> >>>
> >>> While this matches the previous behaviour, do we _really_ want it? It's weird,
> >>> quirky, and it hard to extend to NUMA-aware claims (which is something in
> >>> midway through).
> >>>
> >>> Wouldn't it make sense to fail the allocation (earlier) if the claim has run
> >>> out? Do we even expect this to ever happen this late in the allocation call
> >>> chain?
> >>
> >> This goes back to what a "claim" means. Even without any claim, a domain may
> >> allocate memory. So a claim having run out doesn't imply allocation has to
> >> fail.
> > 
> > Hmmm... but that violates the purpose of the claim infra as far as I understand
> > it. If a domain may overallocate by (e.g) ballooning in memory it can distort the
> > ability of another domain to start up, even if it succeeded in its own claim.
>
> Why would that be? As long as we hold back enough memory to cover the claim, it
> shouldn't matter what kind of allocation we want to process. I'd say that a PV
> guest starting ballooned ought to be able to deflate its balloon as far as
> there was a claim established for it up front.

The fact a domain allocated something does not mean it had it claimed before. A
claim is a restriction to others' ability to allocate/claim, not to the domain
that made the claim. e.g:

  0. host is idle. No domU.
  1. dom1 is created with a claim to 10% of host memory and max_mem of 80% of
     host meomory.
  2. dom1 balloons in 70% of host memory.
  3. dom1 ballons out 70% of host memory.
  4. dom1 now holds a a claim to 80% of host memory.

It's all quite perverse. Fortunately, looking at adjacent claims-related code
xl seems to default to making a claim prior to populating the physmap and
cancelling the claim at the end of the meminit() hook so this is never a real
problem.

This tells me that the logic intent is to force early failure of
populate_physmap and nothing else. It's never active by the time ballooning or
memory exchange matter at all.

Xen ought to cancel the claim by itself though, toolstack should not NEED to do
it.

And furthermore, the fact that the claim goes up and down interacts really
poorly with the per-node claims I want to implement, because it's just
impossible to know if a freed page actually comes from a prior decrease of a
claim or not.

>
> > We might also break the invariant that total claims are strictly >=
> > total_avail_pages.
>
> Same here - I don't see why this would happen as long as all accounting is
> working correctly.

See previous example.

>
> > I'm somewhat puzzled at the "why" of having separate concepts for max_mem and
> > claims. I guess it simply grew the way it did. Reinstating sanity would
> > probably involve making max_mem effectively the claim, but that's a ton of
> > work I really would rather not do for now.
>
> To me the two are different (beyond claim being global while max-mem is per-
> domain). max-mem is a hard boundary (beyond which allocations _will_ fail),
> whereas a claim is a softer one, beyond which allocations _may_ fail.
>
> Jan

What I meant is that I'd rather have a non-oversubscribed memory model by which
I can statically partition my memory and make 100% sure the sum of all max_mem
of every existing domain never exceeds host capacity. But this is neither the
intent of the existing claim infra nor what I'm after right now. I was just
thinking out loud in text form. :)

Cheers,
Alejandro


  reply	other threads:[~2025-03-05 13:23 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-24 13:27 [PATCH] xen/page_alloc: Simplify domain_adjust_tot_pages Alejandro Vallejo
2025-02-24 14:49 ` Alejandro Vallejo
2025-02-26 14:05   ` Jan Beulich
2025-02-27 14:36     ` Alejandro Vallejo
2025-03-05 10:49       ` Jan Beulich
2025-03-05 13:22         ` Alejandro Vallejo [this message]
2025-03-05 13:39           ` Jan Beulich
2025-03-05 14:55             ` Alejandro Vallejo
2025-03-11  9:46             ` Alejandro Vallejo
2025-03-11 10:06               ` Jan Beulich
2025-02-26 14:18   ` Roger Pau Monné
2025-02-26 13:56 ` Roger Pau Monné
2025-02-26 14:08   ` Jan Beulich
2025-02-26 14:28     ` Roger Pau Monné
2025-02-26 14:36       ` Jan Beulich
2025-02-26 16:04         ` Roger Pau Monné
2025-02-26 16:06           ` Jan Beulich
2025-02-26 16:34             ` Andrew Cooper
2025-02-26 16:42               ` Jan Beulich
2025-02-26 16:51                 ` Andrew Cooper
2025-02-27 14:39                   ` Alejandro Vallejo
2025-02-27 14:50       ` Alejandro Vallejo
2025-03-05 10:50         ` Jan Beulich
2025-02-26 14:02 ` Jan Beulich
2025-02-27 14:59   ` Alejandro Vallejo
2025-03-05 10:52     ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D88D5732H4EQ.3770M7EIO3TW1@cloud.com \
    --to=alejandro.vallejo@cloud.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=anthony.perard@vates.tech \
    --cc=jbeulich@suse.com \
    --cc=julien@xen.org \
    --cc=michal.orzel@amd.com \
    --cc=roger.pau@citrix.com \
    --cc=sstabellini@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.