From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Andres Lagar-Cavilla" Subject: Domain relinquish resources racing with p2m access Date: Wed, 1 Feb 2012 12:49:24 -0800 Message-ID: <0642c1aa7bb490b322c1a5c7d12ebb54.squirrel@webmail.lagarcavilla.org> Reply-To: andres@lagarcavilla.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel@lists.xensource.com, tim@xen.org, keir@xen.org List-Id: xen-devel@lists.xenproject.org So we've run into this interesting (race?) condition while doing stress-testing. We pummel the domain with paging, sharing and mmap operations from dom0, and concurrently we launch a domain destruction. Often we get in the logs something along these lines (XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from L1 entry 8000000859b1a625 for l1e_owner=0, pg_owner=1 We're using the synchronized p2m patches just posted, so my analysis is as follows: - the domain destroy domctl kicks in. It calls relinquish resources. This disowns and puts most domain pages, resulting in invalid (0xff...ff) m2p entries - In parallel, a do_mmu_update is making progress, it has no issues performing a p2m lookup because the p2m has not been torn down yet; we haven't gotten to the RCU callback. Eventually, the mapping fails in page_get_owner in get_pafe_from_l1e. The map is failed, as expected, but what makes me uneasy is the fact that there is a still active p2m lurking around, with seemingly valid translations to valid mfn's, while all the domain pages are gone. Is this a race condition? Can this lead to trouble? Thanks! Andres