Re: [PATCH 3 of 5] Rework locking in the PoD layer

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@citrix.com>
To: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	"andres@gridcentric.ca" <andres@gridcentric.ca>,
	"Tim (Xen.org)" <tim@xen.org>,
	"keir.xen@gmail.com" <keir.xen@gmail.com>,
	"adin@gridcentric.ca" <adin@gridcentric.ca>
Subject: Re: [PATCH 3 of 5] Rework locking in the PoD layer
Date: Thu, 2 Feb 2012 15:33:15 +0000	[thread overview]
Message-ID: <1328196795.21722.86.camel@elijah> (raw)
In-Reply-To: <56ceab0118cba85cdcd3.1328126167@xdev.gridcentric.ca>

On Wed, 2012-02-01 at 19:56 +0000, Andres Lagar-Cavilla wrote:
> @@ -114,15 +114,20 @@ p2m_pod_cache_add(struct p2m_domain *p2m
>          unmap_domain_page(b);
>      }
> 
> +    /* First, take all pages off the domain list */
>      lock_page_alloc(p2m);
> -
> -    /* First, take all pages off the domain list */
>      for(i=0; i < 1 << order ; i++)
>      {
>          p = page + i;
>          page_list_del(p, &d->page_list);
>      }
> 
> +    /* Ensure that the PoD cache has never been emptied.
> +     * This may cause "zombie domains" since the page will never be freed. */
> +    BUG_ON( d->arch.relmem != RELMEM_not_started );
> +
> +    unlock_page_alloc(p2m);
> +

This assert should stay where it is.  The idea is to verify
after-the-fact that the page_list_add_tail()s *have not* raced with
emptying the PoD cache.  Having the assert before cannot guarantee that
they *will not* race with emptying the PoD cache.  Alternately, if we're
positive that condition can never happen again, we should just remove
the BUG_ON().

If I recall correctly, I put this in here because something ended up
calling cache_add after empty_cache(), potentially with the p2m lock
already held; that's why I put the BUG_ON() there to begin with.

> @@ -922,6 +929,12 @@ p2m_pod_emergency_sweep(struct p2m_domai
>      limit = (start > POD_SWEEP_LIMIT) ? (start - POD_SWEEP_LIMIT) : 0;
> 
>      /* FIXME: Figure out how to avoid superpages */
> +    /* NOTE: Promote to globally locking the p2m. This will get complicated
> +     * in a fine-grained scenario. Even if we're to lock each gfn
> +     * individually we must be careful about recursion limits and
> +     * POD_SWEEP_STRIDE. This is why we don't enforce deadlock constraints
> +     * between p2m and pod locks */
> +    p2m_lock(p2m);
>      for ( i=p2m->pod.reclaim_single; i > 0 ; i-- )
>      {
>          p2m_access_t a;
> @@ -940,7 +953,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
>          /* Stop if we're past our limit and we have found *something*.
>           *
>           * NB that this is a zero-sum game; we're increasing our cache size
> -         * by re-increasing our 'debt'.  Since we hold the p2m lock,
> +         * by re-increasing our 'debt'.  Since we hold the pod lock,
>           * (entry_count - count) must remain the same. */
>          if ( p2m->pod.count > 0 && i < limit )
>              break;
> @@ -949,6 +962,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
>      if ( j )
>          p2m_pod_zero_check(p2m, gfns, j);
> 
> +    p2m_unlock(p2m);
>      p2m->pod.reclaim_single = i ? i - 1 : i;
> 
>  }
> @@ -965,8 +979,9 @@ p2m_pod_demand_populate(struct p2m_domai
>      int i;
> 
>      ASSERT(gfn_locked_by_me(p2m, gfn));
> +    pod_lock(p2m);
> 
> -    /* This check is done with the p2m lock held.  This will make sure that
> +    /* This check is done with the pod lock held.  This will make sure that
>       * even if d->is_dying changes under our feet, p2m_pod_empty_cache()
>       * won't start until we're done. */
>      if ( unlikely(d->is_dying) )
> @@ -977,6 +992,7 @@ p2m_pod_demand_populate(struct p2m_domai
>       * 1GB region to 2MB chunks for a retry. */
>      if ( order == PAGE_ORDER_1G )
>      {
> +        pod_unlock(p2m);
>          gfn_aligned = (gfn >> order) << order;
>          /* Note that we are supposed to call set_p2m_entry() 512 times to
>           * split 1GB into 512 2MB pages here. But We only do once here because
> @@ -1000,11 +1016,15 @@ p2m_pod_demand_populate(struct p2m_domai
> 
>          /* If we're low, start a sweep */
>          if ( order == PAGE_ORDER_2M && page_list_empty(&p2m->pod.super) )
> +            /* Note that sweeps scan other ranges in the p2m. In an scenario
> +             * in which p2m locks are order-enforced wrt pod lock and p2m
> +             * locks are fine grained, this will result in deadlock panic */
>              p2m_pod_emergency_sweep_super(p2m);
> 
>          if ( page_list_empty(&p2m->pod.single) &&
>               ( ( order == PAGE_ORDER_4K )
>                 || (order == PAGE_ORDER_2M && page_list_empty(&p2m->pod.super) ) ) )
> +            /* Same comment regarding deadlock applies */
>              p2m_pod_emergency_sweep(p2m);
>      }
> 

Regarding locking on emergency sweeps: I think it should suffice if we
could do the equivalent of a spin_trylock() on each gpfn, and just move
on (not checking that gfn) if it fails.  What do you think?

Other than that, I don't see anything wrong with this locking at first
glance; but it's complicated enough that I don't think I've quite
grokked it yet.  I'll look at it again tomorrow and see if things are
more clear. :-)

-George


> @@ -1012,8 +1032,6 @@ p2m_pod_demand_populate(struct p2m_domai
>      if ( q == p2m_guest && gfn > p2m->pod.max_guest )
>          p2m->pod.max_guest = gfn;
> 
> -    lock_page_alloc(p2m);
> -
>      if ( p2m->pod.count == 0 )
>          goto out_of_memory;
> 
> @@ -1026,8 +1044,6 @@ p2m_pod_demand_populate(struct p2m_domai
> 
>      BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0);
> 
> -    unlock_page_alloc(p2m);
> -
>      gfn_aligned = (gfn >> order) << order;
> 
>      set_p2m_entry(p2m, gfn_aligned, mfn, order, p2m_ram_rw, p2m->default_access);
> @@ -1038,7 +1054,7 @@ p2m_pod_demand_populate(struct p2m_domai
>          paging_mark_dirty(d, mfn_x(mfn) + i);
>      }
> 
> -    p2m->pod.entry_count -= (1 << order); /* Lock: p2m */
> +    p2m->pod.entry_count -= (1 << order);
>      BUG_ON(p2m->pod.entry_count < 0);
> 
>      if ( tb_init_done )
> @@ -1056,20 +1072,24 @@ p2m_pod_demand_populate(struct p2m_domai
>          __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t);
>      }
> 
> +    pod_unlock(p2m);
>      return 0;
>  out_of_memory:
> -    unlock_page_alloc(p2m);
> +    pod_unlock(p2m);
> 
>      printk("%s: Out of populate-on-demand memory! tot_pages %" PRIu32 " pod_entries %" PRIi32 "\n",
>             __func__, d->tot_pages, p2m->pod.entry_count);
>      domain_crash(d);
>  out_fail:
> +    pod_unlock(p2m);
>      return -1;
>  remap_and_retry:
>      BUG_ON(order != PAGE_ORDER_2M);
> -    unlock_page_alloc(p2m);
> +    pod_unlock(p2m);
> 
>      /* Remap this 2-meg region in singleton chunks */
> +    /* NOTE: In a p2m fine-grained lock scenario this might
> +     * need promoting the gfn lock from gfn->2M superpage */
>      gfn_aligned = (gfn>>order)<<order;
>      for(i=0; i<(1<<order); i++)
>          set_p2m_entry(p2m, gfn_aligned+i, _mfn(0), PAGE_ORDER_4K,
> @@ -1137,9 +1157,11 @@ guest_physmap_mark_populate_on_demand(st
>          rc = -EINVAL;
>      else
>      {
> -        p2m->pod.entry_count += 1 << order; /* Lock: p2m */
> +        pod_lock(p2m);
> +        p2m->pod.entry_count += 1 << order;
>          p2m->pod.entry_count -= pod_count;
>          BUG_ON(p2m->pod.entry_count < 0);
> +        pod_unlock(p2m);
>      }
> 
>      gfn_unlock(p2m, gfn, order);
> diff -r fb9c06df05f2 -r 56ceab0118cb xen/arch/x86/mm/p2m-pt.c
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -954,6 +954,7 @@ long p2m_pt_audit_p2m(struct p2m_domain
>      struct domain *d = p2m->domain;
> 
>      ASSERT(p2m_locked_by_me(p2m));
> +    ASSERT(pod_locked_by_me(p2m));
> 
>      test_linear = ( (d == current->domain)
>                      && !pagetable_is_null(current->arch.monitor_table) );
> diff -r fb9c06df05f2 -r 56ceab0118cb xen/arch/x86/mm/p2m.c
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -72,6 +72,7 @@ boolean_param("hap_2mb", opt_hap_2mb);
>  static void p2m_initialise(struct domain *d, struct p2m_domain *p2m)
>  {
>      mm_lock_init(&p2m->lock);
> +    mm_lock_init(&p2m->pod.lock);
>      INIT_LIST_HEAD(&p2m->np2m_list);
>      INIT_PAGE_LIST_HEAD(&p2m->pages);
>      INIT_PAGE_LIST_HEAD(&p2m->pod.super);
> @@ -587,8 +588,10 @@ guest_physmap_add_entry(struct domain *d
>              rc = -EINVAL;
>          else
>          {
> -            p2m->pod.entry_count -= pod_count; /* Lock: p2m */
> +            pod_lock(p2m);
> +            p2m->pod.entry_count -= pod_count;
>              BUG_ON(p2m->pod.entry_count < 0);
> +            pod_unlock(p2m);
>          }
>      }
> 
> @@ -1372,6 +1375,7 @@ p2m_flush_table(struct p2m_domain *p2m)
>      /* "Host" p2m tables can have shared entries &c that need a bit more
>       * care when discarding them */
>      ASSERT(p2m_is_nestedp2m(p2m));
> +    /* Nested p2m's do not do pod, hence the asserts (and no pod lock)*/
>      ASSERT(page_list_empty(&p2m->pod.super));
>      ASSERT(page_list_empty(&p2m->pod.single));
> 
> @@ -1529,6 +1533,7 @@ void audit_p2m(struct domain *d,
>      P2M_PRINTK("p2m audit starts\n");
> 
>      p2m_lock(p2m);
> +    pod_lock(p2m);
> 
>      if (p2m->audit_p2m)
>          pmbad = p2m->audit_p2m(p2m);
> @@ -1589,6 +1594,7 @@ void audit_p2m(struct domain *d,
>      }
>      spin_unlock(&d->page_alloc_lock);
> 
> +    pod_unlock(p2m);
>      p2m_unlock(p2m);
> 
>      P2M_PRINTK("p2m audit complete\n");
> diff -r fb9c06df05f2 -r 56ceab0118cb xen/include/asm-x86/p2m.h
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -261,25 +261,12 @@ struct p2m_domain {
>      unsigned long max_mapped_pfn;
> 
>      /* Populate-on-demand variables
> -     * NB on locking.  {super,single,count} are
> -     * covered by d->page_alloc_lock, since they're almost always used in
> -     * conjunction with that functionality.  {entry_count} is covered by
> -     * the domain p2m lock, since it's almost always used in conjunction
> -     * with changing the p2m tables.
> -     *
> -     * At this point, both locks are held in two places.  In both,
> -     * the order is [p2m,page_alloc]:
> -     * + p2m_pod_decrease_reservation() calls p2m_pod_cache_add(),
> -     *   which grabs page_alloc
> -     * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid
> -     *   double-demand-populating of pages, the page_alloc lock to
> -     *   protect moving stuff from the PoD cache to the domain page list.
> -     *
> -     * We enforce this lock ordering through a construct in mm-locks.h.
> -     * This demands, however, that we store the previous lock-ordering
> -     * level in effect before grabbing the page_alloc lock. The unlock
> -     * level is stored in the arch section of the domain struct.
> -     */
> +     * All variables are protected with the pod lock. We cannot rely on
> +     * the p2m lock if it's turned into a fine-grained lock.
> +     * We only use the domain page_alloc lock for additions and
> +     * deletions to the domain's page list. Because we use it nested
> +     * within the PoD lock, we enforce it's ordering (by remembering
> +     * the unlock level in the arch_domain sub struct). */
>      struct {
>          struct page_list_head super,   /* List of superpages                */
>                           single;       /* Non-super lists                   */
> @@ -288,6 +275,8 @@ struct p2m_domain {
>          unsigned         reclaim_super; /* Last gpfn of a scan */
>          unsigned         reclaim_single; /* Last gpfn of a scan */
>          unsigned         max_guest;    /* gpfn of max guest demand-populate */
> +        mm_lock_t        lock;         /* Locking of private pod structs,   *
> +                                        * not relying on the p2m lock.      */
>      } pod;
>  };
>

next prev parent reply	other threads:[~2012-02-02 15:33 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-01 19:56 [PATCH 0 of 5] Synchronized p2m lookups Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 1 of 5] Make p2m lookups fully synchronized wrt modifications Andres Lagar-Cavilla
2012-02-02 13:08   ` Tim Deegan
2012-02-02 14:01     ` Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 2 of 5] Clean up locking now that p2m lockups are fully synchronized Andres Lagar-Cavilla
2012-02-02 13:10   ` Tim Deegan
2012-02-02 14:31   ` George Dunlap
2012-02-02 19:40     ` Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 3 of 5] Rework locking in the PoD layer Andres Lagar-Cavilla
2012-02-02 13:14   ` Tim Deegan
2012-02-02 14:04     ` Andres Lagar-Cavilla
2012-02-02 15:33   ` George Dunlap [this message]
2012-02-02 20:03     ` Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 4 of 5] Re-order calls to put_gfn() around wait queu invocations Andres Lagar-Cavilla
2012-02-02 13:26   ` Tim Deegan
2012-02-01 19:56 ` [PATCH 5 of 5] x86/mm: Revert changeset 24582:f6c33cfe7333 Andres Lagar-Cavilla
2012-02-02 13:27   ` Tim Deegan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1328196795.21722.86.camel@elijah \
    --to=george.dunlap@citrix.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=adin@gridcentric.ca \
    --cc=andres@gridcentric.ca \
    --cc=andres@lagarcavilla.org \
    --cc=keir.xen@gmail.com \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).