Domain relinquish resources racing with p2m access

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* Domain relinquish resources racing with p2m access
@ 2012-02-01 20:49 Andres Lagar-Cavilla
  2012-02-02 13:34 ` Tim Deegan
  0 siblings, 1 reply; 3+ messages in thread
From: Andres Lagar-Cavilla @ 2012-02-01 20:49 UTC (permalink / raw)
  To: xen-devel, tim, keir

So we've run into this interesting (race?) condition while doing
stress-testing. We pummel the domain with paging, sharing and mmap
operations from dom0, and concurrently we launch a domain destruction.
Often we get in the logs something along these lines

(XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from L1
entry 8000000859b1a625 for l1e_owner=0, pg_owner=1

We're using the synchronized p2m patches just posted, so my analysis is as
follows:

- the domain destroy domctl kicks in. It calls relinquish resources. This
disowns and puts most domain pages, resulting in invalid (0xff...ff) m2p
entries

- In parallel, a do_mmu_update is making progress, it has no issues
performing a p2m lookup because the p2m has not been torn down yet; we
haven't gotten to the RCU callback. Eventually, the mapping fails in
page_get_owner in get_pafe_from_l1e.

The map is failed, as expected, but what makes me uneasy is the fact that
there is a still active p2m lurking around, with seemingly valid
translations to valid mfn's, while all the domain pages are gone.

Is this a race condition? Can this lead to trouble?

Thanks!
Andres

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Domain relinquish resources racing with p2m access
  2012-02-01 20:49 Domain relinquish resources racing with p2m access Andres Lagar-Cavilla
@ 2012-02-02 13:34 ` Tim Deegan
  2012-02-10 18:05   ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 3+ messages in thread
From: Tim Deegan @ 2012-02-02 13:34 UTC (permalink / raw)
  To: Andres Lagar-Cavilla; +Cc: xen-devel, keir

At 12:49 -0800 on 01 Feb (1328100564), Andres Lagar-Cavilla wrote:
> So we've run into this interesting (race?) condition while doing
> stress-testing. We pummel the domain with paging, sharing and mmap
> operations from dom0, and concurrently we launch a domain destruction.
> Often we get in the logs something along these lines
> 
> (XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from L1
> entry 8000000859b1a625 for l1e_owner=0, pg_owner=1
> 
> We're using the synchronized p2m patches just posted, so my analysis is as
> follows:
> 
> - the domain destroy domctl kicks in. It calls relinquish resources. This
> disowns and puts most domain pages, resulting in invalid (0xff...ff) m2p
> entries
> 
> - In parallel, a do_mmu_update is making progress, it has no issues
> performing a p2m lookup because the p2m has not been torn down yet; we
> haven't gotten to the RCU callback. Eventually, the mapping fails in
> page_get_owner in get_pafe_from_l1e.
> 
> The map is failed, as expected, but what makes me uneasy is the fact that
> there is a still active p2m lurking around, with seemingly valid
> translations to valid mfn's, while all the domain pages are gone.

Yes.  That's OK as long as we know that any user of that page will
fail, but I'm not sure that we do.   

At one point we talked about get_gfn() taking a refcount on the
underlying MFN, which would fix this more cleanly.  ISTR the problem was
how to make sure the refcount was moved when the gfn->mfn mapping
changed. 

Can you stick a WARN() in mm.c to get the actual path that leads to the
failure?

Tim.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Domain relinquish resources racing with p2m access
  2012-02-02 13:34 ` Tim Deegan
@ 2012-02-10 18:05   ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 3+ messages in thread
From: Andres Lagar-Cavilla @ 2012-02-10 18:05 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, keir

> At 12:49 -0800 on 01 Feb (1328100564), Andres Lagar-Cavilla wrote:
>> So we've run into this interesting (race?) condition while doing
>> stress-testing. We pummel the domain with paging, sharing and mmap
>> operations from dom0, and concurrently we launch a domain destruction.
>> Often we get in the logs something along these lines
>>
>> (XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from
>> L1
>> entry 8000000859b1a625 for l1e_owner=0, pg_owner=1
>>
>> We're using the synchronized p2m patches just posted, so my analysis is
>> as
>> follows:
>>
>> - the domain destroy domctl kicks in. It calls relinquish resources.
>> This
>> disowns and puts most domain pages, resulting in invalid (0xff...ff) m2p
>> entries
>>
>> - In parallel, a do_mmu_update is making progress, it has no issues
>> performing a p2m lookup because the p2m has not been torn down yet; we
>> haven't gotten to the RCU callback. Eventually, the mapping fails in
>> page_get_owner in get_pafe_from_l1e.
>>
>> The map is failed, as expected, but what makes me uneasy is the fact
>> that
>> there is a still active p2m lurking around, with seemingly valid
>> translations to valid mfn's, while all the domain pages are gone.
>
> Yes.  That's OK as long as we know that any user of that page will
> fail, but I'm not sure that we do.
>
> At one point we talked about get_gfn() taking a refcount on the
> underlying MFN, which would fix this more cleanly.  ISTR the problem was
> how to make sure the refcount was moved when the gfn->mfn mapping
> changed.

Oh, I ditched that because it's too hairy and error prone. There are
plenty of nested get_gfn's with the n>1 call changing the mfn. So unless
we make a point of remembering the mfn at the point of get_gfn, it's just
impossible to make this work. And then "remembering the mfn" means a
serious uglification of existing code.

>
> Can you stick a WARN() in mm.c to get the actual path that leads to the
> failure?

As a debug aid or as actual code to make it into the tree? This typically
happens in batches of a few dozens, so a WARN is going to massively spam
the console with stack traces. Guess how I found out ...

The moral is that the code is reasonably defensive, so this gets caught,
albeit in a rather verbose way. But this might eventually bite someone who
does a get_gfn and doesn't either check that the domain is dying or ensure
that a get_page succeeds.

Andres

>
> Tim.
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-02-10 18:05 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-01 20:49 Domain relinquish resources racing with p2m access Andres Lagar-Cavilla
2012-02-02 13:34 ` Tim Deegan
2012-02-10 18:05   ` Andres Lagar-Cavilla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).