From: George Dunlap <george.dunlap@citrix.com>
To: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
George Dunlap <George.Dunlap@eu.citrix.com>,
"andres@gridcentric.ca" <andres@gridcentric.ca>,
"Tim (Xen.org)" <tim@xen.org>,
"keir.xen@gmail.com" <keir.xen@gmail.com>,
"adin@gridcentric.ca" <adin@gridcentric.ca>
Subject: Re: [PATCH 3 of 5] Rework locking in the PoD layer
Date: Thu, 2 Feb 2012 15:33:15 +0000 [thread overview]
Message-ID: <1328196795.21722.86.camel@elijah> (raw)
In-Reply-To: <56ceab0118cba85cdcd3.1328126167@xdev.gridcentric.ca>
On Wed, 2012-02-01 at 19:56 +0000, Andres Lagar-Cavilla wrote:
> @@ -114,15 +114,20 @@ p2m_pod_cache_add(struct p2m_domain *p2m
> unmap_domain_page(b);
> }
>
> + /* First, take all pages off the domain list */
> lock_page_alloc(p2m);
> -
> - /* First, take all pages off the domain list */
> for(i=0; i < 1 << order ; i++)
> {
> p = page + i;
> page_list_del(p, &d->page_list);
> }
>
> + /* Ensure that the PoD cache has never been emptied.
> + * This may cause "zombie domains" since the page will never be freed. */
> + BUG_ON( d->arch.relmem != RELMEM_not_started );
> +
> + unlock_page_alloc(p2m);
> +
This assert should stay where it is. The idea is to verify
after-the-fact that the page_list_add_tail()s *have not* raced with
emptying the PoD cache. Having the assert before cannot guarantee that
they *will not* race with emptying the PoD cache. Alternately, if we're
positive that condition can never happen again, we should just remove
the BUG_ON().
If I recall correctly, I put this in here because something ended up
calling cache_add after empty_cache(), potentially with the p2m lock
already held; that's why I put the BUG_ON() there to begin with.
> @@ -922,6 +929,12 @@ p2m_pod_emergency_sweep(struct p2m_domai
> limit = (start > POD_SWEEP_LIMIT) ? (start - POD_SWEEP_LIMIT) : 0;
>
> /* FIXME: Figure out how to avoid superpages */
> + /* NOTE: Promote to globally locking the p2m. This will get complicated
> + * in a fine-grained scenario. Even if we're to lock each gfn
> + * individually we must be careful about recursion limits and
> + * POD_SWEEP_STRIDE. This is why we don't enforce deadlock constraints
> + * between p2m and pod locks */
> + p2m_lock(p2m);
> for ( i=p2m->pod.reclaim_single; i > 0 ; i-- )
> {
> p2m_access_t a;
> @@ -940,7 +953,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
> /* Stop if we're past our limit and we have found *something*.
> *
> * NB that this is a zero-sum game; we're increasing our cache size
> - * by re-increasing our 'debt'. Since we hold the p2m lock,
> + * by re-increasing our 'debt'. Since we hold the pod lock,
> * (entry_count - count) must remain the same. */
> if ( p2m->pod.count > 0 && i < limit )
> break;
> @@ -949,6 +962,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
> if ( j )
> p2m_pod_zero_check(p2m, gfns, j);
>
> + p2m_unlock(p2m);
> p2m->pod.reclaim_single = i ? i - 1 : i;
>
> }
> @@ -965,8 +979,9 @@ p2m_pod_demand_populate(struct p2m_domai
> int i;
>
> ASSERT(gfn_locked_by_me(p2m, gfn));
> + pod_lock(p2m);
>
> - /* This check is done with the p2m lock held. This will make sure that
> + /* This check is done with the pod lock held. This will make sure that
> * even if d->is_dying changes under our feet, p2m_pod_empty_cache()
> * won't start until we're done. */
> if ( unlikely(d->is_dying) )
> @@ -977,6 +992,7 @@ p2m_pod_demand_populate(struct p2m_domai
> * 1GB region to 2MB chunks for a retry. */
> if ( order == PAGE_ORDER_1G )
> {
> + pod_unlock(p2m);
> gfn_aligned = (gfn >> order) << order;
> /* Note that we are supposed to call set_p2m_entry() 512 times to
> * split 1GB into 512 2MB pages here. But We only do once here because
> @@ -1000,11 +1016,15 @@ p2m_pod_demand_populate(struct p2m_domai
>
> /* If we're low, start a sweep */
> if ( order == PAGE_ORDER_2M && page_list_empty(&p2m->pod.super) )
> + /* Note that sweeps scan other ranges in the p2m. In an scenario
> + * in which p2m locks are order-enforced wrt pod lock and p2m
> + * locks are fine grained, this will result in deadlock panic */
> p2m_pod_emergency_sweep_super(p2m);
>
> if ( page_list_empty(&p2m->pod.single) &&
> ( ( order == PAGE_ORDER_4K )
> || (order == PAGE_ORDER_2M && page_list_empty(&p2m->pod.super) ) ) )
> + /* Same comment regarding deadlock applies */
> p2m_pod_emergency_sweep(p2m);
> }
>
Regarding locking on emergency sweeps: I think it should suffice if we
could do the equivalent of a spin_trylock() on each gpfn, and just move
on (not checking that gfn) if it fails. What do you think?
Other than that, I don't see anything wrong with this locking at first
glance; but it's complicated enough that I don't think I've quite
grokked it yet. I'll look at it again tomorrow and see if things are
more clear. :-)
-George
> @@ -1012,8 +1032,6 @@ p2m_pod_demand_populate(struct p2m_domai
> if ( q == p2m_guest && gfn > p2m->pod.max_guest )
> p2m->pod.max_guest = gfn;
>
> - lock_page_alloc(p2m);
> -
> if ( p2m->pod.count == 0 )
> goto out_of_memory;
>
> @@ -1026,8 +1044,6 @@ p2m_pod_demand_populate(struct p2m_domai
>
> BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0);
>
> - unlock_page_alloc(p2m);
> -
> gfn_aligned = (gfn >> order) << order;
>
> set_p2m_entry(p2m, gfn_aligned, mfn, order, p2m_ram_rw, p2m->default_access);
> @@ -1038,7 +1054,7 @@ p2m_pod_demand_populate(struct p2m_domai
> paging_mark_dirty(d, mfn_x(mfn) + i);
> }
>
> - p2m->pod.entry_count -= (1 << order); /* Lock: p2m */
> + p2m->pod.entry_count -= (1 << order);
> BUG_ON(p2m->pod.entry_count < 0);
>
> if ( tb_init_done )
> @@ -1056,20 +1072,24 @@ p2m_pod_demand_populate(struct p2m_domai
> __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t);
> }
>
> + pod_unlock(p2m);
> return 0;
> out_of_memory:
> - unlock_page_alloc(p2m);
> + pod_unlock(p2m);
>
> printk("%s: Out of populate-on-demand memory! tot_pages %" PRIu32 " pod_entries %" PRIi32 "\n",
> __func__, d->tot_pages, p2m->pod.entry_count);
> domain_crash(d);
> out_fail:
> + pod_unlock(p2m);
> return -1;
> remap_and_retry:
> BUG_ON(order != PAGE_ORDER_2M);
> - unlock_page_alloc(p2m);
> + pod_unlock(p2m);
>
> /* Remap this 2-meg region in singleton chunks */
> + /* NOTE: In a p2m fine-grained lock scenario this might
> + * need promoting the gfn lock from gfn->2M superpage */
> gfn_aligned = (gfn>>order)<<order;
> for(i=0; i<(1<<order); i++)
> set_p2m_entry(p2m, gfn_aligned+i, _mfn(0), PAGE_ORDER_4K,
> @@ -1137,9 +1157,11 @@ guest_physmap_mark_populate_on_demand(st
> rc = -EINVAL;
> else
> {
> - p2m->pod.entry_count += 1 << order; /* Lock: p2m */
> + pod_lock(p2m);
> + p2m->pod.entry_count += 1 << order;
> p2m->pod.entry_count -= pod_count;
> BUG_ON(p2m->pod.entry_count < 0);
> + pod_unlock(p2m);
> }
>
> gfn_unlock(p2m, gfn, order);
> diff -r fb9c06df05f2 -r 56ceab0118cb xen/arch/x86/mm/p2m-pt.c
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -954,6 +954,7 @@ long p2m_pt_audit_p2m(struct p2m_domain
> struct domain *d = p2m->domain;
>
> ASSERT(p2m_locked_by_me(p2m));
> + ASSERT(pod_locked_by_me(p2m));
>
> test_linear = ( (d == current->domain)
> && !pagetable_is_null(current->arch.monitor_table) );
> diff -r fb9c06df05f2 -r 56ceab0118cb xen/arch/x86/mm/p2m.c
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -72,6 +72,7 @@ boolean_param("hap_2mb", opt_hap_2mb);
> static void p2m_initialise(struct domain *d, struct p2m_domain *p2m)
> {
> mm_lock_init(&p2m->lock);
> + mm_lock_init(&p2m->pod.lock);
> INIT_LIST_HEAD(&p2m->np2m_list);
> INIT_PAGE_LIST_HEAD(&p2m->pages);
> INIT_PAGE_LIST_HEAD(&p2m->pod.super);
> @@ -587,8 +588,10 @@ guest_physmap_add_entry(struct domain *d
> rc = -EINVAL;
> else
> {
> - p2m->pod.entry_count -= pod_count; /* Lock: p2m */
> + pod_lock(p2m);
> + p2m->pod.entry_count -= pod_count;
> BUG_ON(p2m->pod.entry_count < 0);
> + pod_unlock(p2m);
> }
> }
>
> @@ -1372,6 +1375,7 @@ p2m_flush_table(struct p2m_domain *p2m)
> /* "Host" p2m tables can have shared entries &c that need a bit more
> * care when discarding them */
> ASSERT(p2m_is_nestedp2m(p2m));
> + /* Nested p2m's do not do pod, hence the asserts (and no pod lock)*/
> ASSERT(page_list_empty(&p2m->pod.super));
> ASSERT(page_list_empty(&p2m->pod.single));
>
> @@ -1529,6 +1533,7 @@ void audit_p2m(struct domain *d,
> P2M_PRINTK("p2m audit starts\n");
>
> p2m_lock(p2m);
> + pod_lock(p2m);
>
> if (p2m->audit_p2m)
> pmbad = p2m->audit_p2m(p2m);
> @@ -1589,6 +1594,7 @@ void audit_p2m(struct domain *d,
> }
> spin_unlock(&d->page_alloc_lock);
>
> + pod_unlock(p2m);
> p2m_unlock(p2m);
>
> P2M_PRINTK("p2m audit complete\n");
> diff -r fb9c06df05f2 -r 56ceab0118cb xen/include/asm-x86/p2m.h
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -261,25 +261,12 @@ struct p2m_domain {
> unsigned long max_mapped_pfn;
>
> /* Populate-on-demand variables
> - * NB on locking. {super,single,count} are
> - * covered by d->page_alloc_lock, since they're almost always used in
> - * conjunction with that functionality. {entry_count} is covered by
> - * the domain p2m lock, since it's almost always used in conjunction
> - * with changing the p2m tables.
> - *
> - * At this point, both locks are held in two places. In both,
> - * the order is [p2m,page_alloc]:
> - * + p2m_pod_decrease_reservation() calls p2m_pod_cache_add(),
> - * which grabs page_alloc
> - * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid
> - * double-demand-populating of pages, the page_alloc lock to
> - * protect moving stuff from the PoD cache to the domain page list.
> - *
> - * We enforce this lock ordering through a construct in mm-locks.h.
> - * This demands, however, that we store the previous lock-ordering
> - * level in effect before grabbing the page_alloc lock. The unlock
> - * level is stored in the arch section of the domain struct.
> - */
> + * All variables are protected with the pod lock. We cannot rely on
> + * the p2m lock if it's turned into a fine-grained lock.
> + * We only use the domain page_alloc lock for additions and
> + * deletions to the domain's page list. Because we use it nested
> + * within the PoD lock, we enforce it's ordering (by remembering
> + * the unlock level in the arch_domain sub struct). */
> struct {
> struct page_list_head super, /* List of superpages */
> single; /* Non-super lists */
> @@ -288,6 +275,8 @@ struct p2m_domain {
> unsigned reclaim_super; /* Last gpfn of a scan */
> unsigned reclaim_single; /* Last gpfn of a scan */
> unsigned max_guest; /* gpfn of max guest demand-populate */
> + mm_lock_t lock; /* Locking of private pod structs, *
> + * not relying on the p2m lock. */
> } pod;
> };
>
next prev parent reply other threads:[~2012-02-02 15:33 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-02-01 19:56 [PATCH 0 of 5] Synchronized p2m lookups Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 1 of 5] Make p2m lookups fully synchronized wrt modifications Andres Lagar-Cavilla
2012-02-02 13:08 ` Tim Deegan
2012-02-02 14:01 ` Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 2 of 5] Clean up locking now that p2m lockups are fully synchronized Andres Lagar-Cavilla
2012-02-02 13:10 ` Tim Deegan
2012-02-02 14:31 ` George Dunlap
2012-02-02 19:40 ` Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 3 of 5] Rework locking in the PoD layer Andres Lagar-Cavilla
2012-02-02 13:14 ` Tim Deegan
2012-02-02 14:04 ` Andres Lagar-Cavilla
2012-02-02 15:33 ` George Dunlap [this message]
2012-02-02 20:03 ` Andres Lagar-Cavilla
2012-02-01 19:56 ` [PATCH 4 of 5] Re-order calls to put_gfn() around wait queu invocations Andres Lagar-Cavilla
2012-02-02 13:26 ` Tim Deegan
2012-02-01 19:56 ` [PATCH 5 of 5] x86/mm: Revert changeset 24582:f6c33cfe7333 Andres Lagar-Cavilla
2012-02-02 13:27 ` Tim Deegan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1328196795.21722.86.camel@elijah \
--to=george.dunlap@citrix.com \
--cc=George.Dunlap@eu.citrix.com \
--cc=adin@gridcentric.ca \
--cc=andres@gridcentric.ca \
--cc=andres@lagarcavilla.org \
--cc=keir.xen@gmail.com \
--cc=tim@xen.org \
--cc=xen-devel@lists.xensource.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).