From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tomasz Wroblewski Subject: Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present Date: Mon, 26 Aug 2013 19:00:28 +0200 Message-ID: <521B89AC.1040509@citrix.com> References: <1377511404-3365-1-git-send-email-tomasz.wroblewski@citrix.com> <521B543902000078000EE55D@nat28.tlf.novell.com> <521B48FF.1040904@citrix.com> <521B6D7302000078000EE64E@nat28.tlf.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta4.messagelabs.com ([85.158.143.247]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1VE0AI-00048I-8r for xen-devel@lists.xenproject.org; Mon, 26 Aug 2013 17:01:18 +0000 In-Reply-To: <521B6D7302000078000EE64E@nat28.tlf.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich Cc: xen-devel@lists.xenproject.org, dgdegra@tycho.nsa.gov, Andrew Cooper List-Id: xen-devel@lists.xenproject.org On 08/26/2013 03:00 PM, Jan Beulich wrote: >>>> On 26.08.13 at 14:24, Tomasz Wroblewski wrote: >> On 08/26/2013 01:12 PM, Jan Beulich wrote: >>>>>> On 26.08.13 at 12:03, Tomasz Wroblewski wrote: >>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not >>>> specified. >>>> This seems to have worked on 4.1 at least. >>> Looking at the code (4.1.5) I can't see what would prevent the >>> same NULL pointer deref. Care to explain? >> The crash doesn't happen at the NULL pointer dereference site though, > Then does it deref the NULL pointer, or does it not? If it does and > merely doesn't crash because something happens to be mapped > there, that's still a bug. So after bit more tracing it looks like the issue is not a null pointer deref (it is avoided as Daniel pointed out), but rather some xmalloc/xfree pairs which happen inside security_load_policy even if pointer is null. security_load_policy calls policydb_init, then tries to read the policy which fails since length of is 0, so it calls policydb_destroy. Inside these functions there is some hashtable construction/destruction happening, and a reasonably sizable ones too (there are 7 or so hashtables, biggest one uses an array of 512 pointers, they total to use 768 pointers). I've verified that replacing security_load_policy call completely with the following allocation/deallocaiton is enough to cause this crash: //ret = security_load_policy(policy_buffer, policy_size); { void ** p = xmalloc_array(void*, 768); xfree(p); } Note that this allocation succeeds, and also if you would not call xfree (which is not called if say a policy was succesfully loaded), there is no crash. So yeah my original patch accidentaly fixes it by just avoiding the alloc/free completely. The shaky manually constructed call graph for the assertion failure: setup.c: init_idle_domain schedule.c: scheduler_init domain.c: domain_create domain.c: alloc_domain_struct domain.c: alloc_xenheap_pages .. page_alloc.c: alloc_heap_pages flushtlb.h: flush_tlb_mask flushtlb.h: flush_mask smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here I apparently can't get a real stacktrace because adding dump_execution_state in flush_area_mask just causes the "Unknown interrupt" error, similarily to what hitting the ASSERT fail does. I printed the assert condition manually to verify it tho and interrupts are disabled there so its bound to fail. So it looks like the flush_tlb is called too early (with interrupts disabled) due to large deallocations in the policy loader code? On 08/26/2013 03:00 PM, Jan Beulich wrote: >>>> On 26.08.13 at 14:24, Tomasz Wroblewski wrote: >> On 08/26/2013 01:12 PM, Jan Beulich wrote: >>>>>> On 26.08.13 at 12:03, Tomasz Wroblewski wrote: >>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not >>>> specified. >>>> This seems to have worked on 4.1 at least. >>> Looking at the code (4.1.5) I can't see what would prevent the >>> same NULL pointer deref. Care to explain? >> The crash doesn't happen at the NULL pointer dereference site though, >> but a bit later, when xen tries to flush tlbs for first time I believe, >> which happens during page allocation for the initial domain structure. I >> traced it to the following ASSERT in smp.c (so yes I should add this >> particular crash likely is limited to debug builds then) >> >> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int >> flags) >> { >> ASSERT(local_irq_is_enabled()); >> ... >> >> The actual crash message is unhelpful since it's basically only >> >> ... >> (XEN) Using scheduler: SMP Credit Scheduler (credit) >> (XEN) Unknown interrupt (cr2=0000000000000000) >> >> >> Either removing the assert (which is obviously bad), or checking for the > The assertion is in no way bad. It's the too early use of the > function that is the problem here. > >> null pointer deref as in the submitted patch seems to be fixing it. I'm >> suspecting it was always broken somehow but just was hidden or had >> different side effects on 4.1 than it does now. I do lack for a good >> explanation why fiddling with null addresses breaks up this assert, though. > Also, you didn't show the call trace that made things get here (yes, > you may need to construct this manually). I'm in no way convinced > that there's a NULL pointer involved here at all - the fact that CR2 > is zero doesn't mean a page fault occurred in the first place. > > Jan >