All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: xm pause causing lockup
@ 2005-04-15 19:29 Ian Pratt
  2005-04-15 21:04 ` Kip Macy
  0 siblings, 1 reply; 3+ messages in thread
From: Ian Pratt @ 2005-04-15 19:29 UTC (permalink / raw)
  To: Kip Macy; +Cc: xen-devel


I need to think about this more, but it looks like you have an L2 page
that has a type count of 1 but hasn't been validated. You're then
looping when you try and increment it to 2 thinking that you're racing
someone else. 

Does this happen if you boot with 'nosmp'? I don't really believe it's a
race, but might be worth checking.

Also, it's worth adding a printk into this loop just to check that that
is where you're getting caught.

            /* Someone else is updating validation of this page. Wait...
*/
            while ( (y = page->u.inuse.type_info) == x )
                cpu_relax();
            goto again;

We need to figure out how the type count managed to get to one without
the page being validated. I presume you're doing a debug=y build of Xen?
Do you get any warnings about illegal mmu_update attempts when you boot
FreeBSD?

Ian

> Without the ability to continue and only a very basic 
> understanding of the page typing code there is not a whole 
> lot to go on. Let me know if there is some other bit of 
> information that I can provide you with.
> 
>          -Kip
> 
> Before attaching:
> (XEN) 'd' pressed -> dumping registers
> (XEN) CPU:    1
> (XEN) EIP:    0808:[<fc52d59f>]      
> (XEN) EFLAGS: 00000246   CONTEXT: hypervisor
> (XEN) eax: 40000001   ebx: 00000000   ecx: fcfe3740   edx: fcfe3740
> (XEN) esi: 00007ff0   edi: 00000001   ebp: fcffbda0   esp: fcffbd58
> (XEN) ds: 0810   es: 0810   fs: 0810   gs: 0810   ss: 0810   cs: 0808
> (XEN) Stack trace from ESP=fcffbd58:
> (XEN)    80000003 00000001 fcfe3740 fcfe3740 fcfe3740 80000003
> 80000004 80000003
> (XEN)    00000000 00007ff0 fcffbda0 [fc52bfec] fd494968 fcfe3740
> fcffbdc0 40000001
> (XEN)    40000001 40000002 fcffbdd0 [fc52c07b] fd494968 25fe0000
> 00000000 00000000
> (XEN)    000003d1 00000000 fcffbde0 [fc52bcec] 00000000 fd494968
> fcffbe00 [fc52c52e]
> (XEN)    0000630f 25fe0000 fcfe3740 [fc52d100] fffffffc 00000000
> fcffe000 00000001
> (XEN)    00000001 ff85b000 fcffbe40 [fc52c889] 0630f061 0000630f
> fcfe3740 000002ff
> (XEN)    00000001 f0000000 f0000000 00000004 f0000001 f0000000
> 000002ff ff85b000
> (XEN)    0000630f fcfe3740 fcffbe60 [fc52d0f0] fd494968 000001fa
> fc5b20c0 [fc53185d]
> (XEN)    40000000 00000002 fcffbeb0 [fc52d771] fd494968 40000000
> fcfe3740 fcfe3740
> (XEN)    fcfe3740 80000002 80000003 00000004 00000000 f0000000
> f0000000 00000004
> (XEN)    40000001 f0000000 fd49497c f0000000 f0000000 40000001
> fcffbee0 [fc52c07b]
> (XEN)    fd494968 40000000 002ed518 00000000 a089075b 00000001
> fcfe3740 00000000
> (XEN)    00007ff0 fd494968 fcffbfb0 [fc52df98] 0000630f 40000000
> fcfe3740 00000292
> (XEN)    fc5781c0 00000001 0019b901 00000000 00804e95 00000000
> a089075b 000000a1
> (XEN)    a10955f0 000000a1 00000001 fcfea040 00007ff0 00000001
> fcffbf80 00000000
> (XEN)    fcfe3740 00000000 fcfe3740 00000000 a10955f0 000000a1
> 00000000 fcffbf98
> (XEN)    c0293bac 0000000c 00000003 [fc515bfc] a08902cd 000000a1
> 00000002 fcfe3740
> (XEN)    fcfea040 fd494968 00000000 40000000 00000001 00000001
> 00000000 00000000
> (XEN)    00000001 0000630f c018a19b 00000001 fcfea040 00007ff0
> c0293bc8 [fc54e923]
> (XEN)    c0293bac 00000001 00000000 00007ff0 00000001 c0293bc8
> 0000001a 00000000
> (XEN) Call Trace from ESP=fcffbd58:
> (XEN)    [<fc52bfec>] [<fc52c07b>] [<fc52bcec>] [<fc52c52e>]
> [<fc52d100>] [<fc52c889>]
> (XEN)    [<fc52d0f0>] [<fc53185d>] [<fc52d771>] [<fc52c07b>]
> [<fc52df98>] [<fc515bfc>]
> (XEN)    [<fc54e923>] 
> (XEN) Waiting for GDB to attach to XenDBG
> 
> 
> gdb) bt
> #0  0xfc52d59f in get_page_type (page=0xfd494968, 
> type=0x25fe0000) at mm.c:1235
> #1  0xfc52c07b in get_page_and_type_from_pagenr 
> (page_nr=0x630f, type=0x25fe0000, d=0xfcfe3740) at mm.c:360
> #2  0xfc52c52e in get_page_from_l2e (l2e={l2_lo = 0x630f061}, 
> pfn=0x630f, d=0xfcfe3740, va_idx=0x2ff) at mm.c:495
> #3  0xfc52c889 in alloc_l2_table (page=0xfd494968) at mm.c:679
> #4  0xfc52d0f0 in alloc_page_type (page=0xfd494968, 
> type=0x40000000) at mm.c:1083
> #5  0xfc52d771 in get_page_type (page=0xfd494968, 
> type=0x40000000) at mm.c:1269
> #6  0xfc52c07b in get_page_and_type_from_pagenr 
> (page_nr=0x630f, type=0x40000000, d=0xfcfe3740) at mm.c:360
> #7  0xfc52df98 in do_mmuext_op (uops=0xc0293bac, count=0x1, pdone=0x0,
> foreigndom=0x7ff0) at mm.c:1499
> #8  0xfc54e923 in test_all_events () at bitops.h:239
> #9  0xc0293bac in ?? ()
> 
> (gdb) f 7
> #7  0xfc52df98 in do_mmuext_op (uops=0xc0293bac, count=0x1, pdone=0x0,
> foreigndom=0x7ff0)  at mm.c:1499
> 1499                okay = get_page_and_type_from_pagenr(op.mfn, type,
> FOREIGNDOM);
> (gdb) p op
> $9 = {
>   cmd = 0x1,
>   {
>     mfn = 0x630f,
>     linear_addr = 0x630f
>   },
>   {
>     nr_ents = 0xc018a19b,
>     cpuset = 0xc018a19b
>   }
> }
> (gdb) p x
> $1 = 0x40000001
> (gdb) x nx
> 0x40000002:     Ignoring packet error, continuing...
> Reply contains invalid hex digit 40
> (gdb) p y
> $2 = 0x40000001
> (gdb) p page->u.inuse.type_info
> $3 = 0x40000001
> (gdb) p x
> $4 = 0x40000001
> (gdb) p nx
> $5 = 0x40000002
> (gdb) p y
> $6 = 0x40000001
> (gdb) p x
> $7 = 0x40000001
> (gdb) p sizeof(page->u.inuse.type_info)
> $8 = 0x4
> 
> 
> 
> On 4/15/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:
> > Wild! It really is looping in get_page_type.
> > 
> > Any chance you could use the serial debugger to find out what x, nx 
> > and y are in the cmpxchg?
> > 
> > I've tried to think of duff inputs that could cause it to loop, but 
> > I'm not smart enough.
> > 
> > Ian
> > 
> > > -----Original Message-----
> > > From: Kip Macy [mailto:kip.macy@gmail.com]
> > > Sent: 15 April 2005 18:13
> > > To: Ian Pratt
> > > Cc: Keir Fraser; xen-devel; ian.pratt@cl.cam.ac.uk
> > > Subject: Re: [Xen-devel] xm pause causing lockup
> > >
> > > Great, thanks. I'm now running a completely fresh tree from last 
> > > night.
> > >
> > > Over the course of several minutes I hit 'd' a number of 
> times. The 
> > > addresses I got were:
> > >
> > > 0xfc51c742
> > > 0xfc51c746
> > > 0xfc51c74b
> > > 0xfc51c740
> > >
> > > (gdb) x/i 0xfc51c742
> > > 0xfc51c742 <get_page_type+1218>:        mov    0x40(%esp,1),%eax
> > > (gdb) x/i 0xfc51c746
> > > 0xfc51c746 <get_page_type+1222>:        mov    0x14(%eax),%ebx
> > > (gdb) x/i 0xfc51c74b
> > > 0xfc51c74b <get_page_type+1227>:        je     0xfc51c740
> > > <get_page_type+1216>
> > > (gdb) x/i 0xfc51c740
> > > 0xfc51c740 <get_page_type+1216>:        repz nop
> > >
> > >
> > >                -Kip
> > >
> > > On 4/14/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: xen-devel-bounces@lists.xensource.com
> > > > > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf
> > > Of Kip Macy
> > > > > Sent: 15 April 2005 05:36
> > > > > To: Keir Fraser
> > > > > Cc: xen-devel
> > > > > Subject: Re: [Xen-devel] xm pause causing lockup
> > > > >
> > > > > To further check this I added:
> > > > >  printk("%s %d %d %d %d %d\n", __FUNCTION__, op->cmd,
> > > > > op->mfn, count, success_count, domid); to
> > > > > HYPERVISOR_mmuext_op and something similar to mmu_update.
> > > >
> > > > Is your hypothesis that Xen gets stuck in either the 
> mmuext_op or 
> > > > mmu_update loops?
> > > > Are you running with watchdog enabled?
> > > >
> > > > It might be good to add a printk at the end so that you can
> > > prove this.
> > > >
> > > > Hitting 'd' on the debug console will give us an EIP on CPU 1.
> > > >
> > > > Ian
> > > >
> > >
> >
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: xm pause causing lockup
  2005-04-15 19:29 xm pause causing lockup Ian Pratt
@ 2005-04-15 21:04 ` Kip Macy
  2005-04-16 19:59   ` buggy linear page table handling " Kip Macy
  0 siblings, 1 reply; 3+ messages in thread
From: Kip Macy @ 2005-04-15 21:04 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel

> Does this happen if you boot with 'nosmp'? I don't really believe it's a
> race, but might be worth checking.

Yes, it still happens. It would have found it quite astonishing if it
were a race.
(XEN) EIP:    0808:[<fc52d5a3>]
(gdb) x/i 0xfc52d5a3
0xfc52d5a3 <get_page_type+265>: mov    0x14(%eax),%eax
(gdb) info line *0xfc52d5a3
Line 1236 of "mm.c" starts at address 0xfc52d5a0 <get_page_type+262>
and ends at 0xfc52d5b0 <get_page_type+278>.
(gdb) 

Line 1236-1240 of local mm.c:
            while ( (y = page->u.inuse.type_info) == x )
                cpu_relax();
            counter++;
            printk("page was not validated");
            goto again;

> Also, it's worth adding a printk into this loop just to check that that
> is where you're getting caught.

Obviously wasn't thinking and stuck it in the wrong place.
Nonetheless, even without the printk I think I've proven my point.


> 
>             /* Someone else is updating validation of this page. Wait...
> */
>             while ( (y = page->u.inuse.type_info) == x )
>                 cpu_relax();
>             goto again;

Yep.

> 
> We need to figure out how the type count managed to get to one without
> the page being validated. I presume you're doing a debug=y build of Xen?

Correct. Nothing comes out on the console apart from debug output from FreeBSD.

> Do you get any warnings about illegal mmu_update attempts when you boot
> FreeBSD?

No, I don't. This is the offending code snippet from pmap_pinit:

        /* install self-referential address mapping entry(s) */
	for (i = 0; i < NPGPTD; i++) {
		ma = xpmap_ptom(VM_PAGE_TO_PHYS(ptdpg[i]));
		pmap->pm_pdir[PTDPTDI + i] = ma | PG_V | PG_A | PG_M;
#ifdef PAE
		pmap->pm_pdpt[i] = ma | PG_V;
#endif
		/* re-map page directory read-only */
		PT_SET_MA(pmap->pm_pdir, *vtopte((vm_offset_t)pmap->pm_pdir) & ~PG_RW);
		xen_pgd_pin(ma);
	}

PT_SET_MA is just a wrapper for update_va_mapping. Have there been any
recent changes to the page typing code that would cause it to get
confused by a self-referential mapping?

                          -Kip

^ permalink raw reply	[flat|nested] 3+ messages in thread

* buggy linear page table handling Re: xm pause causing lockup
  2005-04-15 21:04 ` Kip Macy
@ 2005-04-16 19:59   ` Kip Macy
  0 siblings, 0 replies; 3+ messages in thread
From: Kip Macy @ 2005-04-16 19:59 UTC (permalink / raw)
  To: xen-devel

I went through a few quick iterations to test page table reference
counting. In short, if I L2 pin a zeroed page that I've re-mapped
read-only the pin succeeds. If the page has a self-referential mapping
before it is remapped read-only the pin never returns. It is probably
safe to conclude that the type count is not correctly changed when the
page is re-mapped if there is a self-referential entry. This used to
work, thus it is also safe to say that this is a regression introduced
some time between 3/22 and 4/11. Test code from pmap_pinit below.

                          -Kip 


	/* ***** TEMP \/ ********** */
	ma = xpmap_ptom(VM_PAGE_TO_PHYS(ptdpg[0]));
#if 0
	/* works */
	pmap_qremove((vm_offset_t)pmap->pm_pdir, NPGPTD);
#elif 0
 	/* works */
	PT_SET_MA(pmap->pm_pdir, 0);
#elif 0
	/* works */
	PT_SET_MA(pmap->pm_pdir, ma | PG_V | PG_A);
#else 		
	/* causes lockup on pin */
	pmap->pm_pdir[PTDPTDI + i] = ma | PG_V | PG_A | PG_M;
	PT_SET_MA(pmap->pm_pdir, ma | PG_V | PG_A);
#endif
	
	printk("pinning %p - pass 0\n", ma);
	xen_pgd_pin(xpmap_ptom(VM_PAGE_TO_PHYS(ptdpg[0])));
	printk("pinned %p - pass 0\n", ma);
	/* ***** TEMP ^ ********** */

On 4/15/05, Kip Macy <kip.macy@gmail.com> wrote:
> > Does this happen if you boot with 'nosmp'? I don't really believe it's a
> > race, but might be worth checking.
> 
> Yes, it still happens. It would have found it quite astonishing if it
> were a race.
> (XEN) EIP:    0808:[<fc52d5a3>]
> (gdb) x/i 0xfc52d5a3
> 0xfc52d5a3 <get_page_type+265>: mov    0x14(%eax),%eax
> (gdb) info line *0xfc52d5a3
> Line 1236 of "mm.c" starts at address 0xfc52d5a0 <get_page_type+262>
> and ends at 0xfc52d5b0 <get_page_type+278>.
> (gdb)
> 
> Line 1236-1240 of local mm.c:
>             while ( (y = page->u.inuse.type_info) == x )
>                 cpu_relax();
>             counter++;
>             printk("page was not validated");
>             goto again;
> 
> > Also, it's worth adding a printk into this loop just to check that that
> > is where you're getting caught.
> 
> Obviously wasn't thinking and stuck it in the wrong place.
> Nonetheless, even without the printk I think I've proven my point.
> 
> 
> >
> >             /* Someone else is updating validation of this page. Wait...
> > */
> >             while ( (y = page->u.inuse.type_info) == x )
> >                 cpu_relax();
> >             goto again;
> 
> Yep.
> 
> >
> > We need to figure out how the type count managed to get to one without
> > the page being validated. I presume you're doing a debug=y build of Xen?
> 
> Correct. Nothing comes out on the console apart from debug output from FreeBSD.
> 
> > Do you get any warnings about illegal mmu_update attempts when you boot
> > FreeBSD?
> 
> No, I don't. This is the offending code snippet from pmap_pinit:
> 
>         /* install self-referential address mapping entry(s) */
>         for (i = 0; i < NPGPTD; i++) {
>                 ma = xpmap_ptom(VM_PAGE_TO_PHYS(ptdpg[i]));
>                 pmap->pm_pdir[PTDPTDI + i] = ma | PG_V | PG_A | PG_M;
> #ifdef PAE
>                 pmap->pm_pdpt[i] = ma | PG_V;
> #endif
>                 /* re-map page directory read-only */
>                 PT_SET_MA(pmap->pm_pdir, *vtopte((vm_offset_t)pmap->pm_pdir) & ~PG_RW);
>                 xen_pgd_pin(ma);
>         }
> 
> PT_SET_MA is just a wrapper for update_va_mapping. Have there been any
> recent changes to the page typing code that would cause it to get
> confused by a self-referential mapping?
> 
>                           -Kip
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2005-04-16 19:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-15 19:29 xm pause causing lockup Ian Pratt
2005-04-15 21:04 ` Kip Macy
2005-04-16 19:59   ` buggy linear page table handling " Kip Macy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.