All of lore.kernel.org
 help / color / mirror / Atom feed
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Tim Deegan <tim@xen.org>
Cc: dan.magenheimer@oracle.com, xen-devel@lists.xensource.com,
	keir@xen.org, keir.xen@gmail.com
Subject: Re: [PATCH 1/6] mmu: Introduce XENMEM_claim_pages (subop of memory ops).
Date: Tue, 5 Mar 2013 16:43:20 -0500	[thread overview]
Message-ID: <20130305214320.GB8235@phenom.dumpdata.com> (raw)
In-Reply-To: <20130305120156.GA1139@ocelot.phlegethon.org>

On Tue, Mar 05, 2013 at 12:01:56PM +0000, Tim Deegan wrote:
> Hi,
> 
> At 12:47 -0500 on 04 Mar (1362401229), Konrad Rzeszutek Wilk wrote:
> > --- a/xen/common/page_alloc.c
> > +++ b/xen/common/page_alloc.c
> > @@ -252,11 +252,124 @@ static long midsize_alloc_zone_pages;
> >  #define MIDSIZE_ALLOC_FRAC 128
> >  
> >  static DEFINE_SPINLOCK(heap_lock);
> > +static long total_unclaimed_pages; /* total outstanding claims by all domains */
> 
> Could this field (& associated per-domain stuff) have a better name?

Yes. d->outstanding_pages ?

> AFAICT from the code, this is a count of _claimed_ pages (specifically,
> claimed but not allocated).  It caused me to double-take almost every
> time I saw it used in this patch.

The word 'unclaimed' does not help either.

> 
> How about outstanding_claimed_pages, or just outstanding_claims?

outstanding_claims is a great name.

> 
> >  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >  {
> > +    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> > +
> >      ASSERT(spin_is_locked(&d->page_alloc_lock));
> > -    return d->tot_pages += pages;
> > +    d->tot_pages += pages;
> > +
> > +    /*
> > +     * can test d->claimed_pages race-free because it can only change
> > +     * if d->page_alloc_lock and heap_lock are both held, see also
> > +     * domain_set_unclaimed_pages below
> > +     */
> > +    if ( !d->unclaimed_pages )
> > +        goto out;
> > +
> > +    spin_lock(&heap_lock);
> > +    /* adjust domain unclaimed_pages; may not go negative */
> > +    dom_before = d->unclaimed_pages;
> > +    dom_after = dom_before - pages;
> > +    if ( (dom_before > 0) && (dom_after < 0) )
> > +        dom_claimed = 0;
> 
> Why the test for dom_before > 0 ?  I think it might be better to
> BUG_ON() any of these counts ever being negative, rather than to carry
> on regardless.

I put in a BUG_ON( dom_before <= 0 ) and simplified that logic to be

    BUG_ON(dom_before < 0);
    dom_claimed = dom_after < 0 ? 0 : dom_after;


> 
> Or is this meant to be a cunning way of handling the case where 
> sizeof (long) == 4 and unclaimed_pages > 2^31 ?  I suspect that
> will fall foul of the undefinedness of signed overflow.
> 
> > +    else
> > +        dom_claimed = dom_after;
> > +    d->unclaimed_pages = dom_claimed;
> > +    /* flag accounting bug if system unclaimed_pages would go negative */
> > +    sys_before = total_unclaimed_pages;
> > +    sys_after = sys_before - (dom_before - dom_claimed);
> > +    BUG_ON((sys_before > 0) && (sys_after < 0));
> 
> Same question here.

That can certainly be simplified to be:
	+    BUG_ON(sys_after < 0);

> 
> > +/*
> > + * XENMEM_claim_pages flags:
> > + *  normal: claim is successful only if sufficient free pages
> > + *    are available.
> > + *  tmem: claim is successful only if sufficient free pages
> > + *    are available and (if tmem is enabled) hypervisor
> > + *    may also consider tmem "freeable" pages to satisfy the claim.
> 
> The 'normal' restriction isn't actually enforced except at claim time.
> Since the gate in alloc_heap_pages is effectively:
> 
>   (request > free + freeable - total_reserved)
> 
> later allocations (from any source) can take up all the free memory as
> long as freeable >= total_reserved).
> 
> Is there a use for it?  Presumably by turning on tmem you're happy with
> the idea that allocations might have to come from freeable memory?


The big thing is *might*. I put this in the code path to explain better:

  /* note; The usage of tmem claim means that allocation from a guest *might*
+     * have to come from freeable memory. Using free memory is always better, if
+     * it is available, than using freeable memory. This flag is for the use
+     * case where the toolstack cannot be constantly aware of the exact current
+     * value of free and/or freeable on each machine and is multi-machine
+     * capable. It can try/fail a "normal" claim on all machines first then,
+     * and if the normal claim on all machines fail, then "fallback" to a
+     * tmem-flag type claim.
+     *
+     * The memory claimed by a normal claim isn't enforced against "freeable
+     * pages" once the claim is successful. That's by design. Once the claim
+     * has been made, it still can take minutes before the claim is fully
+     * satisfied. Tmem can make use of the unclaimed pages during this time
+     * (to store ephemeral/freeable pages only, not persistent pages).
+     */

Here is the updated patch (with the long git commit description removed).

diff --git a/xen/common/domain.c b/xen/common/domain.c
index b360de1..90ce40b 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -507,6 +507,7 @@ int domain_kill(struct domain *d)
         evtchn_destroy(d);
         gnttab_release_mappings(d);
         tmem_destroy(d->tmem);
+        domain_set_unclaimed_pages(d, 0, 0);
         d->tmem = NULL;
         /* fallthrough */
     case DOMDYING_dying:
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index b7f6619..c98e99c 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -154,6 +154,7 @@ void getdomaininfo(struct domain *d, struct xen_domctl_getdomaininfo *info)
 
     info->tot_pages         = d->tot_pages;
     info->max_pages         = d->max_pages;
+    info->outstanding_pages = d->outstanding_pages;
     info->shr_pages         = atomic_read(&d->shr_pages);
     info->paged_pages       = atomic_read(&d->paged_pages);
     info->shared_info_frame = mfn_to_gmfn(d, virt_to_mfn(d->shared_info));
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 08550ef..aa698b1 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -712,6 +712,37 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         break;
     }
 
+    case XENMEM_claim_pages:
+        if ( !IS_PRIV(current->domain) )
+            return -EPERM;
+
+        if ( copy_from_guest(&reservation, arg, 1) )
+            return -EFAULT;
+
+        if ( !guest_handle_is_null(reservation.extent_start) )
+            return -EINVAL;
+
+        if ( reservation.extent_order != 0 )
+            return -EINVAL;
+
+        d = rcu_lock_domain_by_id(reservation.domid);
+        if ( d == NULL )
+            return -EINVAL;
+
+        rc = domain_set_unclaimed_pages(d, reservation.nr_extents,
+                                        reservation.mem_flags);
+
+        rcu_unlock_domain(d);
+
+        break;
+
+    case XENMEM_get_unclaimed_pages:
+        if ( !IS_PRIV(current->domain) )
+            return -EPERM;
+
+        rc = get_outstanding_claims();
+        break;
+
     default:
         rc = arch_memory_op(op, arg);
         break;
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 9e9fb15..73e2392 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -252,11 +252,137 @@ static long midsize_alloc_zone_pages;
 #define MIDSIZE_ALLOC_FRAC 128
 
 static DEFINE_SPINLOCK(heap_lock);
+static long outstanding_claims; /* total outstanding claims by all domains */
 
 unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
 {
+    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
+
     ASSERT(spin_is_locked(&d->page_alloc_lock));
-    return d->tot_pages += pages;
+    d->tot_pages += pages;
+
+    /*
+     * can test d->claimed_pages race-free because it can only change
+     * if d->page_alloc_lock and heap_lock are both held, see also
+     * domain_set_unclaimed_pages below
+     */
+    if ( !d->outstanding_pages )
+        goto out;
+
+    spin_lock(&heap_lock);
+    /* adjust domain outstanding pages; may not go negative */
+    dom_before = d->outstanding_pages;
+    dom_after = dom_before - pages;
+    BUG_ON(dom_before < 0);
+    dom_claimed = dom_after < 0 ? 0 : dom_after;
+    d->outstanding_pages = dom_claimed;
+    /* flag accounting bug if system outstanding_claims would go negative */
+    sys_before = outstanding_claims;
+    sys_after = sys_before - (dom_before - dom_claimed);
+    BUG_ON(sys_after < 0);
+    outstanding_claims = sys_after;
+    spin_unlock(&heap_lock);
+
+out:
+    return d->tot_pages;
+}
+
+int domain_set_unclaimed_pages(struct domain *d, unsigned long pages,
+                                unsigned long flags)
+{
+    int ret = -ENOMEM;
+    unsigned long claim, avail_pages;
+
+    /*
+     * Sanity check.
+     */
+    switch ( flags ) {
+    case XENMEM_CLAIMF_normal:
+    case XENMEM_CLAIMF_tmem:
+        if ( pages == 0 )
+            return -EINVAL;
+        break;
+    case XENMEM_CLAIMF_cancel:
+        pages = 0;
+        break;
+    default:
+    return -ENOSYS;
+    }
+
+    /*
+     * take the domain's page_alloc_lock, else all d->tot_page adjustments
+     * must always take the global heap_lock rather than only in the much
+     * rarer case that d->outstanding_pages is non-zero
+     */
+    spin_lock(&d->page_alloc_lock);
+    spin_lock(&heap_lock);
+
+    /* pages==0 means "unset" the claim. */
+    if ( pages == 0 )
+    {
+        outstanding_claims -= d->outstanding_pages;
+        d->outstanding_pages = 0;
+        ret = 0;
+        goto out;
+    }
+
+    /* only one active claim per domain please */
+    if ( d->outstanding_pages )
+    {
+        ret = -EINVAL;
+        goto out;
+    }
+
+    /* disallow a claim not exceeding current tot_pages or above max_pages */
+    if ( (pages <= d->tot_pages) || (pages > d->max_pages) )
+    {
+        ret = -EINVAL;
+        goto out;
+    }
+
+    /* how much memory is available? */
+    avail_pages = total_avail_pages;
+    /* note; The usage of tmem claim means that allocation from a guest *might*
+     * have to come from freeable memory. Using free memory is always better, if
+     * it is available, than using freeable memory. This flag is for the use
+     * case where the toolstack cannot be constantly aware of the exact current
+     * value of free and/or freeable on each machine and is multi-machine
+     * capable. It can try/fail a "normal" claim on all machines first then,
+     * and if the normal claim on all machines fail, then "fallback" to a
+     * tmem-flag type claim.
+     *
+     * The memory claimed by a normal claim isn't enforced against "freeable
+     * pages" once the claim is successful. That's by design. Once the claim
+     * has been made, it still can take minutes before the claim is fully
+     * satisfied. Tmem can make use of the unclaimed pages during this time
+     * (to store ephemeral/freeable pages only, not persistent pages).
+     */
+    if ( flags & XENMEM_CLAIMF_tmem )
+        avail_pages += tmem_freeable_pages();
+    avail_pages -= outstanding_claims;
+
+    /*
+     * note, if domain has already allocated memory before making a claim
+     * then the claim must take tot_pages into account
+     */
+    claim = pages - d->tot_pages;
+    if ( claim > avail_pages )
+        goto out;
+
+    /* yay, claim fits in available memory, stake the claim, success! */
+    d->outstanding_pages = claim;
+    outstanding_claims += d->outstanding_pages;
+    ret = 0;
+
+out:
+    spin_unlock(&heap_lock);
+    spin_unlock(&d->page_alloc_lock);
+    return ret;
+}
+
+long get_outstanding_claims(void)
+{
+    return outstanding_claims;
 }
 
 static unsigned long init_node_heap(int node, unsigned long mfn,
@@ -397,7 +523,7 @@ static void __init setup_low_mem_virq(void)
 static void check_low_mem_virq(void)
 {
     unsigned long avail_pages = total_avail_pages +
-        (opt_tmem ? tmem_freeable_pages() : 0);
+        (opt_tmem ? tmem_freeable_pages() : 0) - outstanding_claims;
 
     if ( unlikely(avail_pages <= low_mem_virq_th) )
     {
@@ -466,6 +592,15 @@ static struct page_info *alloc_heap_pages(
     spin_lock(&heap_lock);
 
     /*
+     * Claimed memory is considered unavailable unless the request
+     * is made by a domain with sufficient unclaimed pages.
+     */
+    if ( (outstanding_claims + request >
+          total_avail_pages + tmem_freeable_pages()) &&
+          (d == NULL || d->outstanding_pages < request) )
+        goto not_found;
+
+    /*
      * TMEM: When available memory is scarce due to tmem absorbing it, allow
      * only mid-size allocations to avoid worst of fragmentation issues.
      * Others try tmem pools then fail.  This is a workaround until all
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index deb19db..113b8dc 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -36,7 +36,7 @@
 #include "grant_table.h"
 #include "hvm/save.h"
 
-#define XEN_DOMCTL_INTERFACE_VERSION 0x00000008
+#define XEN_DOMCTL_INTERFACE_VERSION 0x00000009
 
 /*
  * NB. xen_domctl.domain is an IN/OUT parameter for this operation.
@@ -95,6 +95,7 @@ struct xen_domctl_getdomaininfo {
     uint32_t flags;              /* XEN_DOMINF_* */
     uint64_aligned_t tot_pages;
     uint64_aligned_t max_pages;
+    uint64_aligned_t outstanding_pages;
     uint64_aligned_t shr_pages;
     uint64_aligned_t paged_pages;
     uint64_aligned_t shared_info_frame; /* GMFN of shared_info struct */
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 1c5ca19..5c264ca 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -68,6 +68,8 @@ struct xen_memory_reservation {
      *   IN:  GPFN bases of extents to populate with memory
      *   OUT: GMFN bases of extents that were allocated
      *   (NB. This command also updates the mach_to_phys translation table)
+     * XENMEM_claim_pages:
+     *   IN: must be zero
      */
     XEN_GUEST_HANDLE(xen_pfn_t) extent_start;
 
@@ -430,10 +432,51 @@ typedef struct xen_mem_sharing_op xen_mem_sharing_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
 
 /*
- * Reserve ops for future/out-of-tree "claim" patches (Oracle)
+ * Attempt to stake a claim for a domain on a quantity of pages
+ * of system RAM, but _not_ assign specific pageframes.  Only
+ * arithmetic is performed so the hypercall is very fast and need
+ * not be preemptible, thus sidestepping time-of-check-time-of-use
+ * races for memory allocation.  Returns 0 if the hypervisor page
+ * allocator has atomically and successfully claimed the requested
+ * number of pages, else non-zero.
+ *
+ * Any domain may have only one active claim.  When sufficient memory
+ * has been allocated to resolve the claim, the claim silently expires.
+ * Claiming zero pages effectively resets any outstanding claim and
+ * is always successful.
+ *
+ * Note that a valid claim may be staked even after memory has been
+ * allocated for a domain.  In this case, the claim is not incremental,
+ * i.e. if the domain's tot_pages is 3, and a claim is staked for 10,
+ * only 7 additional pages are claimed.
+ *
+ * Caller must be privileged or the hypercall fails.
  */
 #define XENMEM_claim_pages                  24
-#define XENMEM_get_unclaimed_pages          25
+
+/*
+ * XENMEM_claim_pages flags:
+ *  normal: claim is successful only if sufficient free pages
+ *    are available.
+ *  tmem: claim is successful only if sufficient free pages
+ *    are available and (if tmem is enabled) hypervisor
+ *    may also consider tmem "freeable" pages to satisfy the claim.
+ *  cancel: cancel the outstanding claim for the domain.
+ */
+#define XENMEM_CLAIMF_ignored                0
+
+#define _XENMEM_CLAIMF_normal                1
+#define XENMEM_CLAIMF_normal                 (1U<<_XENMEM_CLAIMF_normal)
+#define _XENMEM_CLAIMF_tmem                  2
+#define XENMEM_CLAIMF_tmem                   (1U<<_XENMEM_CLAIMF_tmem)
+#define _XENMEM_CLAIMF_cancel                3
+#define XENMEM_CLAIMF_cancel                 (1U<<_XENMEM_CLAIMF_cancel)
+/*
+ * Get the number of pages currently claimed (but not yet "possessed")
+ * across all domains.  The caller must be privileged but otherwise
+ * the call never fails.
+ */
+#define XENMEM_get_unclaimed_pages            25
 
 #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
 
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 2f701f5..8a355b6 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -49,7 +49,11 @@ void free_xenheap_pages(void *v, unsigned int order);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
 #define free_xenheap_page(v) (free_xenheap_pages(v,0))
 
+/* Claim handling */
 unsigned long domain_adjust_tot_pages(struct domain *d, long pages);
+int domain_set_unclaimed_pages(
+    struct domain *d, unsigned long pages, unsigned long flags);
+long get_outstanding_claims(void);
 
 /* Domain suballocator. These functions are *not* interrupt-safe.*/
 void init_domheap_pages(paddr_t ps, paddr_t pe);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index e108436..569e76e 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -242,6 +242,7 @@ struct domain
     struct page_list_head page_list;  /* linked list */
     struct page_list_head xenpage_list; /* linked list (size xenheap_pages) */
     unsigned int     tot_pages;       /* number of pages currently possesed */
+    unsigned int     outstanding_pages; /* pages claimed but not possessed  */
     unsigned int     max_pages;       /* maximum value for tot_pages        */
     atomic_t         shr_pages;       /* number of shared pages             */
     atomic_t         paged_pages;     /* number of paged-out pages          */

  reply	other threads:[~2013-03-05 21:43 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-04 17:47 [PATCH v10] claim and its friends for allocating multiple self-ballooning guests Konrad Rzeszutek Wilk
2013-03-04 17:47 ` [PATCH 1/6] mmu: Introduce XENMEM_claim_pages (subop of memory ops) Konrad Rzeszutek Wilk
2013-03-04 21:50   ` Keir Fraser
2013-03-05 12:01   ` Tim Deegan
2013-03-05 21:43     ` Konrad Rzeszutek Wilk [this message]
2013-03-06  9:07       ` Tim Deegan
2013-03-06 15:36         ` Konrad Rzeszutek Wilk
2013-03-07 14:59           ` Konrad Rzeszutek Wilk
2013-03-07 15:56           ` Tim Deegan
2013-03-08 20:07             ` Konrad Rzeszutek Wilk
2013-03-11  9:25               ` Tim Deegan
2013-03-04 17:47 ` [PATCH 2/6] xc: use XENMEM_claim_pages during guest creation Konrad Rzeszutek Wilk
2013-03-04 17:47 ` [PATCH 3/6] xl: Implement XENMEM_claim_pages support via 'claim_mode' global config Konrad Rzeszutek Wilk
2013-03-04 17:47 ` [PATCH 4/6] xc: XENMEM_claim_pages claimed but not possesed value Konrad Rzeszutek Wilk
2013-03-04 17:47 ` [PATCH 5/6] xl: export 'unclaimed_pages' value from xcinfo Konrad Rzeszutek Wilk
2013-03-04 17:47 ` [PATCH 6/6] xl: 'xl list' supports '-c' for global claim information Konrad Rzeszutek Wilk
  -- strict thread matches above, loose matches on Subject: below --
2013-03-11 14:20 [PATCH v11] claim and its friends for allocating multiple self-ballooning guests Konrad Rzeszutek Wilk
2013-03-11 14:20 ` [PATCH 1/6] mmu: Introduce XENMEM_claim_pages (subop of memory ops) Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130305214320.GB8235@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=dan.magenheimer@oracle.com \
    --cc=keir.xen@gmail.com \
    --cc=keir@xen.org \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.