All of lore.kernel.org
 help / color / mirror / Atom feed
From: George Dunlap <george.dunlap@citrix.com>
To: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	"andres@gridcentric.ca" <andres@gridcentric.ca>,
	"Tim (Xen.org)" <tim@xen.org>,
	"keir.xen@gmail.com" <keir.xen@gmail.com>,
	"adin@gridcentric.ca" <adin@gridcentric.ca>
Subject: Re: [PATCH 3 of 5] Rework locking in the PoD layer
Date: Thu, 2 Feb 2012 15:33:15 +0000	[thread overview]
Message-ID: <1328196795.21722.86.camel@elijah> (raw)
In-Reply-To: <56ceab0118cba85cdcd3.1328126167@xdev.gridcentric.ca>

On Wed, 2012-02-01 at 19:56 +0000, Andres Lagar-Cavilla wrote:
> @@ -114,15 +114,20 @@ p2m_pod_cache_add(struct p2m_domain *p2m
>          unmap_domain_page(b);
>      }
> 
> +    /* First, take all pages off the domain list */
>      lock_page_alloc(p2m);
> -
> -    /* First, take all pages off the domain list */
>      for(i=0; i < 1 << order ; i++)
>      {
>          p = page + i;
>          page_list_del(p, &d->page_list);
>      }
> 
> +    /* Ensure that the PoD cache has never been emptied.
> +     * This may cause "zombie domains" since the page will never be freed. */
> +    BUG_ON( d->arch.relmem != RELMEM_not_started );
> +
> +    unlock_page_alloc(p2m);
> +

This assert should stay where it is.  The idea is to verify
after-the-fact that the page_list_add_tail()s *have not* raced with
emptying the PoD cache.  Having the assert before cannot guarantee that
they *will not* race with emptying the PoD cache.  Alternately, if we're
positive that condition can never happen again, we should just remove
the BUG_ON().

If I recall correctly, I put this in here because something ended up
calling cache_add after empty_cache(), potentially with the p2m lock
already held; that's why I put the BUG_ON() there to begin with.

> @@ -922,6 +929,12 @@ p2m_pod_emergency_sweep(struct p2m_domai
>      limit = (start > POD_SWEEP_LIMIT) ? (start - POD_SWEEP_LIMIT) : 0;
> 
>      /* FIXME: Figure out how to avoid superpages */
> +    /* NOTE: Promote to globally locking the p2m. This will get complicated
> +     * in a fine-grained scenario. Even if we're to lock each gfn
> +     * individually we must be careful about recursion limits and
> +     * POD_SWEEP_STRIDE. This is why we don't enforce deadlock constraints
> +     * between p2m and pod locks */
> +    p2m_lock(p2m);
>      for ( i=p2m->pod.reclaim_single; i > 0 ; i-- )
>      {
>          p2m_access_t a;
> @@ -940,7 +953,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
>          /* Stop if we're past our limit and we have found *something*.
>           *
>           * NB that this is a zero-sum game; we're increasing our cache size
> -         * by re-increasing our 'debt'.  Since we hold the p2m lock,
> +         * by re-increasing our 'debt'.  Since we hold the pod lock,
>           * (entry_count - count) must remain the same. */
>          if ( p2m->pod.count > 0 && i < limit )
>              break;
> @@ -949,6 +962,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
>      if ( j )
>          p2m_pod_zero_check(p2m, gfns, j);
> 
> +    p2m_unlock(p2m);
>      p2m->pod.reclaim_single = i ? i - 1 : i;
> 
>  }
> @@ -965,8 +979,9 @@ p2m_pod_demand_populate(struct p2m_domai
>      int i;
> 
>      ASSERT(gfn_locked_by_me(p2m, gfn));
> +    pod_lock(p2m);
> 
> -    /* This check is done with the p2m lock held.  This will make sure that
> +    /* This check is done with the pod lock held.  This will make sure that
>       * even if d->is_dying changes under our feet, p2m_pod_empty_cache()
>       * won't start until we're done. */
>      if ( unlikely(d->is_dying) )
> @@ -977,6 +992,7 @@ p2m_pod_demand_populate(struct p2m_domai
>       * 1GB region to 2MB chunks for a retry. */
>      if ( order == PAGE_ORDER_1G )
>      {
> +        pod_unlock(p2m);
>          gfn_aligned = (gfn >> order) << order;
>          /* Note that we are supposed to call set_p2m_entry() 512 times to
>           * split 1GB into 512 2MB pages here. But We only do once here because
> @@ -1000,11 +1016,15 @@ p2m_pod_demand_populate(struct p2m_domai
> 
>          /* If we're low, start a sweep */
>          if ( order == PAGE_ORDER_2M && page_list_empty(&p2m->pod.super) )
> +            /* Note that sweeps scan other ranges in the p2m. In an scenario
> +             * in which p2m locks are order-enforced wrt pod lock and p2m
> +             * locks are fine grained, this will result in deadlock panic */
>              p2m_pod_emergency_sweep_super(p2m);
> 
>          if ( page_list_empty(&p2m->pod.single) &&
>               ( ( order == PAGE_ORDER_4K )
>                 || (order == PAGE_ORDER_2M && page_list_empty(&p2m->pod.super) ) ) )
> +            /* Same comment regarding deadlock applies */
>              p2m_pod_emergency_sweep(p2m);
>      }
> 

Regarding locking on emergency sweeps: I think it should suffice if we
could do the equivalent of a spin_trylock() on each gpfn, and just move
on (not checking that gfn) if it fails.  What do you think?

Other than that, I don't see anything wrong with this locking at first
glance; but it's complicated enough that I don't think I've quite
grokked it yet.  I'll look at it again tomorrow and see if things are
more clear. :-)

-George


> @@ -1012,8 +1032,6 @@ p2m_pod_demand_populate(struct p2m_domai
>      if ( q == p2m_guest && gfn > p2m->pod.max_guest )
>          p2m->pod.max_guest = gfn;
> 
> -    lock_page_alloc(p2m);
> -
>      if ( p2m->pod.count == 0 )
>          goto out_of_memory;
> 
> @@ -1026,8 +1044,6 @@ p2m_pod_demand_populate(struct p2m_domai
> 
>      BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0);
> 
> -    unlock_page_alloc(p2m);
> -
>      gfn_aligned = (gfn >> order) << order;
> 
>      set_p2m_entry(p2m, gfn_aligned, mfn, order, p2m_ram_rw, p2m->default_access);
> @@ -1038,7 +1054,7 @@ p2m_pod_demand_populate(struct p2m_domai
>          paging_mark_dirty(d, mfn_x(mfn) + i);
>      }
> 
> -    p2m->pod.entry_count -= (1 << order); /* Lock: p2m */
> +    p2m->pod.entry_count -= (1 << order);
>      BUG_ON(p2m->pod.entry_count < 0);
> 
>      if ( tb_init_done )
> @@ -1056,20 +1072,24 @@ p2m_pod_demand_populate(struct p2m_domai
>          __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t);
>      }
> 
> +    pod_unlock(p2m);
>      return 0;
>  out_of_memory:
> -    unlock_page_alloc(p2m);
> +    pod_unlock(p2m);
> 
>      printk("%s: Out of populate-on-demand memory! tot_pages %" PRIu32 " pod_entries %" PRIi32 "\n",
>             __func__, d->tot_pages, p2m->pod.entry_count);
>      domain_crash(d);
>  out_fail:
> +    pod_unlock(p2m);
>      return -1;
>  remap_and_retry:
>      BUG_ON(order != PAGE_ORDER_2M);
> -    unlock_page_alloc(p2m);
> +    pod_unlock(p2m);
> 
>      /* Remap this 2-meg region in singleton chunks */
> +    /* NOTE: In a p2m fine-grained lock scenario this might
> +     * need promoting the gfn lock from gfn->2M superpage */
>      gfn_aligned = (gfn>>order)<<order;
>      for(i=0; i<(1<<order); i++)
>          set_p2m_entry(p2m, gfn_aligned+i, _mfn(0), PAGE_ORDER_4K,
> @@ -1137,9 +1157,11 @@ guest_physmap_mark_populate_on_demand(st
>          rc = -EINVAL;
>      else
>      {
> -        p2m->pod.entry_count += 1 << order; /* Lock: p2m */
> +        pod_lock(p2m);
> +        p2m->pod.entry_count += 1 << order;
>          p2m->pod.entry_count -= pod_count;
>          BUG_ON(p2m->pod.entry_count < 0);
> +        pod_unlock(p2m);
>      }
> 
>      gfn_unlock(p2m, gfn, order);
> diff -r fb9c06df05f2 -r 56ceab0118cb xen/arch/x86/mm/p2m-pt.c
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -954,6 +954,7 @@ long p2m_pt_audit_p2m(struct p2m_domain
>      struct domain *d = p2m->domain;
> 
>      ASSERT(p2m_locked_by_me(p2m));
> +    ASSERT(pod_locked_by_me(p2m));
> 
>      test_linear = ( (d == current->domain)
>                      && !pagetable_is_null(current->arch.monitor_table) );
> diff -r fb9c06df05f2 -r 56ceab0118cb xen/arch/x86/mm/p2m.c
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -72,6 +72,7 @@ boolean_param("hap_2mb", opt_hap_2mb);
>  static void p2m_initialise(struct domain *d, struct p2m_domain *p2m)
>  {
>      mm_lock_init(&p2m->lock);
> +    mm_lock_init(&p2m->pod.lock);
>      INIT_LIST_HEAD(&p2m->np2m_list);
>      INIT_PAGE_LIST_HEAD(&p2m->pages);
>      INIT_PAGE_LIST_HEAD(&p2m->pod.super);
> @@ -587,8 +588,10 @@ guest_physmap_add_entry(struct domain *d
>              rc = -EINVAL;
>          else
>          {
> -            p2m->pod.entry_count -= pod_count; /* Lock: p2m */
> +            pod_lock(p2m);
> +            p2m->pod.entry_count -= pod_count;
>              BUG_ON(p2m->pod.entry_count < 0);
> +            pod_unlock(p2m);
>          }
>      }
> 
> @@ -1372,6 +1375,7 @@ p2m_flush_table(struct p2m_domain *p2m)
>      /* "Host" p2m tables can have shared entries &c that need a bit more
>       * care when discarding them */
>      ASSERT(p2m_is_nestedp2m(p2m));
> +    /* Nested p2m's do not do pod, hence the asserts (and no pod lock)*/
>      ASSERT(page_list_empty(&p2m->pod.super));
>      ASSERT(page_list_empty(&p2m->pod.single));
> 
> @@ -1529,6 +1533,7 @@ void audit_p2m(struct domain *d,
>      P2M_PRINTK("p2m audit starts\n");
> 
>      p2m_lock(p2m);
> +    pod_lock(p2m);
> 
>      if (p2m->audit_p2m)
>          pmbad = p2m->audit_p2m(p2m);
> @@ -1589,6 +1594,7 @@ void audit_p2m(struct domain *d,
>      }
>      spin_unlock(&d->page_alloc_lock);
> 
> +    pod_unlock(p2m);
>      p2m_unlock(p2m);
> 
>      P2M_PRINTK("p2m audit complete\n");
> diff -r fb9c06df05f2 -r 56ceab0118cb xen/include/asm-x86/p2m.h
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -261,25 +261,12 @@ struct p2m_domain {
>      unsigned long max_mapped_pfn;
> 
>      /* Populate-on-demand variables
> -     * NB on locking.  {super,single,count} are
> -     * covered by d->page_alloc_lock, since they're almost always used in
> -     * conjunction with that functionality.  {entry_count} is covered by
> -     * the domain p2m lock, since it's almost always used in conjunction
> -     * with changing the p2m tables.
> -     *
> -     * At this point, both locks are held in two places.  In both,
> -     * the order is [p2m,page_alloc]:
> -     * + p2m_pod_decrease_reservation() calls p2m_pod_cache_add(),
> -     *   which grabs page_alloc
> -     * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid
> -     *   double-demand-populating of pages, the page_alloc lock to
> -     *   protect moving stuff from the PoD cache to the domain page list.
> -     *
> -     * We enforce this lock ordering through a construct in mm-locks.h.
> -     * This demands, however, that we store the previous lock-ordering
> -     * level in effect before grabbing the page_alloc lock. The unlock
> -     * level is stored in the arch section of the domain struct.
> -     */
> +     * All variables are protected with the pod lock. We cannot rely on
> +     * the p2m lock if it's turned into a fine-grained lock.
> +     * We only use the domain page_alloc lock for additions and
> +     * deletions to the domain's page list. Because we use it nested
> +     * within the PoD lock, we enforce it's ordering (by remembering
> +     * the unlock level in the arch_domain sub struct). */
>      struct {
>          struct page_list_head super,   /* List of superpages                */
>                           single;       /* Non-super lists                   */
> @@ -288,6 +275,8 @@ struct p2m_domain {
>          unsigned         reclaim_super; /* Last gpfn of a scan */
>          unsigned         reclaim_single; /* Last gpfn of a scan */
>          unsigned         max_guest;    /* gpfn of max guest demand-populate */
> +        mm_lock_t        lock;         /* Locking of private pod structs,   *
> +                                        * not relying on the p2m lock.      */
>      } pod;
>  };
> 

  parent reply	other threads:[~2012-02-02 15:33 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-01 19:56 [PATCH 0 of 5] Synchronized p2m lookups Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 1 of 5] Make p2m lookups fully synchronized wrt modifications Andres Lagar-Cavilla
2012-02-02 13:08   ` Tim Deegan
2012-02-02 14:01     ` Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 2 of 5] Clean up locking now that p2m lockups are fully synchronized Andres Lagar-Cavilla
2012-02-02 13:10   ` Tim Deegan
2012-02-02 14:31   ` George Dunlap
2012-02-02 19:40     ` Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 3 of 5] Rework locking in the PoD layer Andres Lagar-Cavilla
2012-02-02 13:14   ` Tim Deegan
2012-02-02 14:04     ` Andres Lagar-Cavilla
2012-02-02 15:33   ` George Dunlap [this message]
2012-02-02 20:03     ` Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 4 of 5] Re-order calls to put_gfn() around wait queu invocations Andres Lagar-Cavilla
2012-02-02 13:26   ` Tim Deegan
2012-02-01 19:56 ` [PATCH 5 of 5] x86/mm: Revert changeset 24582:f6c33cfe7333 Andres Lagar-Cavilla
2012-02-02 13:27   ` Tim Deegan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1328196795.21722.86.camel@elijah \
    --to=george.dunlap@citrix.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=adin@gridcentric.ca \
    --cc=andres@gridcentric.ca \
    --cc=andres@lagarcavilla.org \
    --cc=keir.xen@gmail.com \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.