Xen dom0 crash in get_phys_to

All of lore.kernel.org
 help / color / mirror / Atom feed

* Xen dom0 crash in get_phys_to_machine
@ 2010-10-12  7:55 Alan J. Wylie
  2010-10-13  0:19 ` Jeremy Fitzhardinge
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Alan J. Wylie @ 2010-10-12  7:55 UTC (permalink / raw)
  To: xen-devel; +Cc: Jeremy Fitzhardinge


Further to my previous report:

http://lists.xensource.com/archives/html/xen-devel/2010-10/msg00257.html
Message-ID: <19629.39326.337589.71778@wylie.me.uk>

I've added some debugging and have tracked down the crash to the
recently modified code in arch/x86/xen/mmu.c

Since the last version of the code that worked for me, mmu.c has been
modified with a lot of P2M changes. It now crashes in
get_phys_to_machine().

Having tracked down the crash and the offending value of pfn, I then
further modified the code only to print if ( pfn == 0x18C3 ), and also
to print intermediate values.

<7>ALANW get_phys_to_machine pfn 000018C3
<7> topidx 00000000
<7> mididx 0000000C
<7> idx 000000C3
(XEN) d0:v0: unhandled page fault (ec=0000)

If there is any more debugging that I can do, I'll be only too happy to
oblige.

System: Supermicro SM-SC825TQ-R720LPB, 8GB RAM
Motherboard: X8DTL
Processor: 1 x Intel XEON E5506 quad core
RAID controller: LSI MegaRAID SAS 8708

git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
branch xen/stable-2.6.32.x

commit 179eca50d08fa05d7650fcb8a0d3e6598cf2388a
    Merge commit 'v2.6.32.24' into xen/next-2.6.32

------8<------8<------8<------8<------8<------8<------8<------8<------8<------8<
/* initial changes to mmu.c to track down crash */
+static char hex[9];

 unsigned long get_phys_to_machine(unsigned long pfn)
 {
        unsigned topidx, mididx, idx;
+       unsigned long rv;
+
+       longtohex(pfn);
+       xen_raw_printk(KERN_DEBUG "ALANW get_phys_to_machine %s", hex );

        if (unlikely(pfn >= MAX_P2M_PFN))
                return INVALID_P2M_ENTRY;
@@ -406,7 +432,12 @@ unsigned long get_phys_to_machine(unsigned long pfn)
        mididx = p2m_mid_index(pfn);
        idx = p2m_index(pfn);

-       return p2m_top[topidx][mididx][idx];
+       rv=p2m_top[topidx][mididx][idx];
+
+       longtohex(rv);
+       xen_raw_printk(KERN_DEBUG " returns %s\n", hex );
+
+       return rv;
 }
------8<------8<------8<------8<------8<------8<------8<------8<------8<------8<
...

(XEN) VIRTUAL MEMORY ARRANGEMENT:
(XEN)  Loaded kernel: ffffffff81000000->ffffffff816b1000
(XEN)  Init. ramdisk: ffffffff816b1000->ffffffff816b1000
(XEN)  Phys-Mach map: ffffffff816b1000->ffffffff818b1000
(XEN)  Start info:    ffffffff818b1000->ffffffff818b14b4
(XEN)  Page tables:   ffffffff818b2000->ffffffff818c3000
(XEN)  Boot stack:    ffffffff818c3000->ffffffff818c4000
(XEN)  TOTAL:         ffffffff80000000->ffffffff81c00000
(XEN)  ENTRY ADDRESS: ffffffff814cc200

...

<7>ALANW get_phys_to_machine 0003FFFC<7> returns 0017A544
<7>ALANW get_phys_to_machine 0003FFFD<7> returns 0017A545
<7>ALANW get_phys_to_machine 0003FFFE<7> returns 0017A546
<7>ALANW get_phys_to_machine 0003FFFF<7> returns 0017A547
<7>ALANW get_phys_to_machine 000002ED<7> returns 002382ED
<7>ALANW get_phys_to_machine 000002ED<7> returns 002382ED
init_memory_mapping: 0000000100000000-00000002bf780000
 0100000000 - 02bf780000 page 4k
kernel direct mapping tables up to 2bf780000 @ 18c3000-2ecb000
<7>ALANW get_phys_to_machine 000018C3(XEN) d0:v0: unhandled page fault (ec=0000)
(XEN) Pagetable walk from ffffffff816bd618:
(XEN)  L4[0x1ff] = 0000000239003067 0000000000001003
(XEN)  L3[0x1fe] = 0000000239007067 0000000000001007
(XEN)  L2[0x00b] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.0.2-rc1-pre  x86_64  debug=n  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e033:[<ffffffff8100c393>]
(XEN) RFLAGS: 0000000000000206   EM: 1   CONTEXT: pv guest
(XEN) rax: ffffffff816bd000   rbx: 00000000000000c3   rcx: 0000000000000000
(XEN) rdx: ffffffff8158b000   rsi: 0000000000000025   rdi: 0000000000000000
(XEN) rbp: ffffffffffffffff   rsp: ffffffff81445c00   r8:  000000000000000a
(XEN) r9:  ffffffff8157bf90   r10: ffffffff8157bd90   r11: 0000000000000200
(XEN) r12: 00000000018c3000   r13: 8000000000000163   r14: 0000000000000001
(XEN) r15: 00000000000009ff   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000239001000   cr2: ffffffff816bd618
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=ffffffff81445c00:
(XEN)    0000000000000000 0000000000000200 0000000000000000 ffffffff8100c393
(XEN)    000000010000e030 0000000000010006 ffffffff81445c40 000000000000e02b
(XEN)    ffffffff81445de8 80000000018c3063 00000000018c3000 ffffffff8100c657
(XEN)    80000000018c3063 ffffffff8100c72a 0000000000000000 ffffffff8100b789
(XEN)    0000000239002040 ffffffff815553e0 000000000000000f 8000000000000163
(XEN)    80000000018c3063 ffffffffff400000 ffffffff81536000 00000002bf780000
(XEN)    ffffffff814de1a3 0000000000000001 ffffffff814c30a0 ffffffffff400000
(XEN)    0000000139002038 ffffffff815553e0 ffffffff81445d90 00000000018c3000
(XEN)    ffff8802bf780000 00000002bf780000 00000002bf780000 0000000000000005
(XEN)    ffffffff813085a8 ffff880001002048 0000000240000000 ffff8802bf780000
(XEN)    ffffffff814f8091 0000000000000001 ffffffff814c30a0 8000000000000163
(XEN)    0000000000000000 0000000000000004 0000000000000000 0000000000000000
(XEN)    ffff880001002000 ffffffff8100b76b 00000000000003bf ffffffff815553e0
(XEN)    ffffffff81001880 00000002bf780000 ffff8802bf780000 ffffffff813c7fad
(XEN)    ffff8802bf780000 0000000000000000 ffffffff814f823a 00000002bf780000
(XEN)    ffffffff813196e4 0000000000000020 ffff880100000000 ffffffff81445e08
(XEN)    0000000040000000 0000000040000000 ffffffff81445e78 0000000000000001
(XEN)    0000000000000001 ffffffff813c7fad 0000000000000000 00000002bf780000
(XEN)    ffffffff813083d2 0000000000000000 0000000000000000 ffffffff00000000
(XEN)    0000000100000000 ffff880000000000 0000000000000000 0000000100000000
(XEN) Domain 0 crashed: 'noreboot' set - not rebooting.

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen dom0 crash in get_phys_to_machine
  2010-10-12  7:55 Xen dom0 crash in get_phys_to_machine Alan J. Wylie
@ 2010-10-13  0:19 ` Jeremy Fitzhardinge
  2010-10-22 12:47 ` Alan J. Wylie
  2010-10-22 13:05 ` Gianni Tedesco
  2 siblings, 0 replies; 10+ messages in thread
From: Jeremy Fitzhardinge @ 2010-10-13  0:19 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: xen-devel, Jeremy Fitzhardinge

 On 10/12/2010 12:55 AM, Alan J. Wylie wrote:
> Further to my previous report:
>
> http://lists.xensource.com/archives/html/xen-devel/2010-10/msg00257.html
> Message-ID: <19629.39326.337589.71778@wylie.me.uk>
>
> I've added some debugging and have tracked down the crash to the
> recently modified code in arch/x86/xen/mmu.c

Thanks, this is useful info.  I'll try to get to it tomorrow.

    J

> Since the last version of the code that worked for me, mmu.c has been
> modified with a lot of P2M changes. It now crashes in
> get_phys_to_machine().
>
> Having tracked down the crash and the offending value of pfn, I then
> further modified the code only to print if ( pfn == 0x18C3 ), and also
> to print intermediate values.
>
> <7>ALANW get_phys_to_machine pfn 000018C3
> <7> topidx 00000000
> <7> mididx 0000000C
> <7> idx 000000C3
> (XEN) d0:v0: unhandled page fault (ec=0000)
>
> If there is any more debugging that I can do, I'll be only too happy to
> oblige.
>
> System: Supermicro SM-SC825TQ-R720LPB, 8GB RAM
> Motherboard: X8DTL
> Processor: 1 x Intel XEON E5506 quad core
> RAID controller: LSI MegaRAID SAS 8708
>
> git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
> branch xen/stable-2.6.32.x
>
> commit 179eca50d08fa05d7650fcb8a0d3e6598cf2388a
>     Merge commit 'v2.6.32.24' into xen/next-2.6.32
>
> ------8<------8<------8<------8<------8<------8<------8<------8<------8<------8<
> /* initial changes to mmu.c to track down crash */
> +static char hex[9];
>
>  unsigned long get_phys_to_machine(unsigned long pfn)
>  {
>         unsigned topidx, mididx, idx;
> +       unsigned long rv;
> +
> +       longtohex(pfn);
> +       xen_raw_printk(KERN_DEBUG "ALANW get_phys_to_machine %s", hex );
>
>         if (unlikely(pfn >= MAX_P2M_PFN))
>                 return INVALID_P2M_ENTRY;
> @@ -406,7 +432,12 @@ unsigned long get_phys_to_machine(unsigned long pfn)
>         mididx = p2m_mid_index(pfn);
>         idx = p2m_index(pfn);
>
> -       return p2m_top[topidx][mididx][idx];
> +       rv=p2m_top[topidx][mididx][idx];
> +
> +       longtohex(rv);
> +       xen_raw_printk(KERN_DEBUG " returns %s\n", hex );
> +
> +       return rv;
>  }
> ------8<------8<------8<------8<------8<------8<------8<------8<------8<------8<
> ...
>
> (XEN) VIRTUAL MEMORY ARRANGEMENT:
> (XEN)  Loaded kernel: ffffffff81000000->ffffffff816b1000
> (XEN)  Init. ramdisk: ffffffff816b1000->ffffffff816b1000
> (XEN)  Phys-Mach map: ffffffff816b1000->ffffffff818b1000
> (XEN)  Start info:    ffffffff818b1000->ffffffff818b14b4
> (XEN)  Page tables:   ffffffff818b2000->ffffffff818c3000
> (XEN)  Boot stack:    ffffffff818c3000->ffffffff818c4000
> (XEN)  TOTAL:         ffffffff80000000->ffffffff81c00000
> (XEN)  ENTRY ADDRESS: ffffffff814cc200
>
> ...
>
> <7>ALANW get_phys_to_machine 0003FFFC<7> returns 0017A544
> <7>ALANW get_phys_to_machine 0003FFFD<7> returns 0017A545
> <7>ALANW get_phys_to_machine 0003FFFE<7> returns 0017A546
> <7>ALANW get_phys_to_machine 0003FFFF<7> returns 0017A547
> <7>ALANW get_phys_to_machine 000002ED<7> returns 002382ED
> <7>ALANW get_phys_to_machine 000002ED<7> returns 002382ED
> init_memory_mapping: 0000000100000000-00000002bf780000
>  0100000000 - 02bf780000 page 4k
> kernel direct mapping tables up to 2bf780000 @ 18c3000-2ecb000
> <7>ALANW get_phys_to_machine 000018C3(XEN) d0:v0: unhandled page fault (ec=0000)
> (XEN) Pagetable walk from ffffffff816bd618:
> (XEN)  L4[0x1ff] = 0000000239003067 0000000000001003
> (XEN)  L3[0x1fe] = 0000000239007067 0000000000001007
> (XEN)  L2[0x00b] = 0000000000000000 ffffffffffffffff
> (XEN) domain_crash_sync called from entry.S
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) ----[ Xen-4.0.2-rc1-pre  x86_64  debug=n  Tainted:    C ]----
> (XEN) CPU:    0
> (XEN) RIP:    e033:[<ffffffff8100c393>]
> (XEN) RFLAGS: 0000000000000206   EM: 1   CONTEXT: pv guest
> (XEN) rax: ffffffff816bd000   rbx: 00000000000000c3   rcx: 0000000000000000
> (XEN) rdx: ffffffff8158b000   rsi: 0000000000000025   rdi: 0000000000000000
> (XEN) rbp: ffffffffffffffff   rsp: ffffffff81445c00   r8:  000000000000000a
> (XEN) r9:  ffffffff8157bf90   r10: ffffffff8157bd90   r11: 0000000000000200
> (XEN) r12: 00000000018c3000   r13: 8000000000000163   r14: 0000000000000001
> (XEN) r15: 00000000000009ff   cr0: 000000008005003b   cr4: 00000000000026f0
> (XEN) cr3: 0000000239001000   cr2: ffffffff816bd618
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
> (XEN) Guest stack trace from rsp=ffffffff81445c00:
> (XEN)    0000000000000000 0000000000000200 0000000000000000 ffffffff8100c393
> (XEN)    000000010000e030 0000000000010006 ffffffff81445c40 000000000000e02b
> (XEN)    ffffffff81445de8 80000000018c3063 00000000018c3000 ffffffff8100c657
> (XEN)    80000000018c3063 ffffffff8100c72a 0000000000000000 ffffffff8100b789
> (XEN)    0000000239002040 ffffffff815553e0 000000000000000f 8000000000000163
> (XEN)    80000000018c3063 ffffffffff400000 ffffffff81536000 00000002bf780000
> (XEN)    ffffffff814de1a3 0000000000000001 ffffffff814c30a0 ffffffffff400000
> (XEN)    0000000139002038 ffffffff815553e0 ffffffff81445d90 00000000018c3000
> (XEN)    ffff8802bf780000 00000002bf780000 00000002bf780000 0000000000000005
> (XEN)    ffffffff813085a8 ffff880001002048 0000000240000000 ffff8802bf780000
> (XEN)    ffffffff814f8091 0000000000000001 ffffffff814c30a0 8000000000000163
> (XEN)    0000000000000000 0000000000000004 0000000000000000 0000000000000000
> (XEN)    ffff880001002000 ffffffff8100b76b 00000000000003bf ffffffff815553e0
> (XEN)    ffffffff81001880 00000002bf780000 ffff8802bf780000 ffffffff813c7fad
> (XEN)    ffff8802bf780000 0000000000000000 ffffffff814f823a 00000002bf780000
> (XEN)    ffffffff813196e4 0000000000000020 ffff880100000000 ffffffff81445e08
> (XEN)    0000000040000000 0000000040000000 ffffffff81445e78 0000000000000001
> (XEN)    0000000000000001 ffffffff813c7fad 0000000000000000 00000002bf780000
> (XEN)    ffffffff813083d2 0000000000000000 0000000000000000 ffffffff00000000
> (XEN)    0000000100000000 ffff880000000000 0000000000000000 0000000100000000
> (XEN) Domain 0 crashed: 'noreboot' set - not rebooting.
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen dom0 crash in get_phys_to_machine
  2010-10-12  7:55 Xen dom0 crash in get_phys_to_machine Alan J. Wylie
  2010-10-13  0:19 ` Jeremy Fitzhardinge
@ 2010-10-22 12:47 ` Alan J. Wylie
  2010-11-12 22:41   ` sven
  2010-10-22 13:05 ` Gianni Tedesco
  2 siblings, 1 reply; 10+ messages in thread
From: Alan J. Wylie @ 2010-10-22 12:47 UTC (permalink / raw)
  Cc: xen-devel@lists.xensource.com, Jeremy Fitzhardinge, sven,
	Andreas Kinzler, Gianni Tedesco


I've pulled the latest tree from Jeremy's git repository and would
just like to report that the changes to mmu.c in 
7510ae89101a20046d03c551bd7db056ada84933 haven't stopped the crashing.

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen dom0 crash in get_phys_to_machine
  2010-10-12  7:55 Xen dom0 crash in get_phys_to_machine Alan J. Wylie
  2010-10-13  0:19 ` Jeremy Fitzhardinge
  2010-10-22 12:47 ` Alan J. Wylie
@ 2010-10-22 13:05 ` Gianni Tedesco
  2010-10-22 13:33   ` Alan J. Wylie
  2010-10-22 22:26   ` Jeremy Fitzhardinge
  2 siblings, 2 replies; 10+ messages in thread
From: Gianni Tedesco @ 2010-10-22 13:05 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: xen-devel@lists.xensource.com, Jeremy, Fitzhardinge

On Tue, 2010-10-12 at 08:55 +0100, Alan J. Wylie wrote:
> Further to my previous report:
> 
> http://lists.xensource.com/archives/html/xen-devel/2010-10/msg00257.html
> Message-ID: <19629.39326.337589.71778@wylie.me.uk>
> 
> I've added some debugging and have tracked down the crash to the
> recently modified code in arch/x86/xen/mmu.c
> 
> Since the last version of the code that worked for me, mmu.c has been
> modified with a lot of P2M changes. It now crashes in
> get_phys_to_machine().
> 
> Having tracked down the crash and the offending value of pfn, I then
> further modified the code only to print if ( pfn == 0x18C3 ), and also
> to print intermediate values.
> 
> <7>ALANW get_phys_to_machine pfn 000018C3
> <7> topidx 00000000
> <7> mididx 0000000C
> <7> idx 000000C3
> (XEN) d0:v0: unhandled page fault (ec=0000)
> 
> If there is any more debugging that I can do, I'll be only too happy to
> oblige.

FWIW, when I was checking for any call where pfn > max_pfn - and I got:

  p2m_top[0][10][104] max_pfn=0

The p2m seems to have been correctly initialised:

 xen_build_dynamic_phys_to_machine: topidx=0 mididx=375 max_pfn=192512

But then it looks like something is trampling max_pfn and possibly other
important data structures.

I can get a working pvops dom0 by reverting to commit
e6b9b2cbca5093e8e38d3e314e2f6415ad951c60 - with the same config.

git-bisect between that commit and head turned up some nonsense about a
ata_piix change which just added a spinlock
876b3a81850fc237f643a065ea78ce2ad7665767 - so I assume that is a bisect
problem and that this commit is unrelated...

Gianni

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen dom0 crash in get_phys_to_machine
  2010-10-22 13:05 ` Gianni Tedesco
@ 2010-10-22 13:33   ` Alan J. Wylie
  2010-10-22 13:45     ` Gianni Tedesco
  2010-10-22 22:26   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 10+ messages in thread
From: Alan J. Wylie @ 2010-10-22 13:33 UTC (permalink / raw)
  To: Gianni Tedesco, Andreas Kinzler, sven, Jeremy Fitzhardinge

at 14:05 on Fri 22-Oct-2010 Gianni Tedesco (gianni.tedesco@citrix.com) wrote:

> FWIW, when I was checking for any call where pfn > max_pfn - and I got:
> 
>   p2m_top[0][10][104] max_pfn=0
> 
> The p2m seems to have been correctly initialised:
> 
>  xen_build_dynamic_phys_to_machine: topidx=0 mididx=375 max_pfn=192512
> 
> But then it looks like something is trampling max_pfn and possibly other
> important data structures.

I've just been reading through the Documentation/development-process
and discovered "sparse".

Five minutes ago I ran it on mmu.c and got the following interesting
output:

/usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:385:23: warning: symbol 'max_pfn' shadows an earlier one
/usr/src/jeremy-git-xen/arch/x86/include/asm/page_64_types.h:58:22: originally declared here
/usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:289:47: warning: potentially expensive pointer subtraction
/usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9: warning: incorrect type in argument 1 (different address spaces)
/usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9:    expected void const volatile [noderef] <asn:1>*<noident>
/usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9:    got unsigned long *
/usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9: warning: cast adds address space to expression (<asn:1>)
/usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9: warning: cast adds address space to expression (<asn:1>)
/usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9: warning: cast adds address space to expression (<asn:1>)
/usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9: warning: cast adds address space to expression (<asn:1>)
/usr/src/jeremy-git-xen/include/linux/mm.h:603:16: warning: potentially expensive pointer subtraction
/usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:1269:37: warning: potentially expensive pointer subtraction
/usr/src/jeremy-git-xen/include/linux/mm.h:603:16: warning: potentially expensive pointer subtraction
/usr/src/jeremy-git-xen/include/linux/mm.h:603:16: warning: potentially expensive pointer subtraction
/usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:1410:37: warning: potentially expensive pointer subtraction
/usr/src/jeremy-git-xen/include/linux/mm.h:603:16: warning: potentially expensive pointer subtraction
/usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:1684:17: error: bad constant expression

Is it just a co-incidence that the first two lines refer to the same
symbol that you have just mentioned?

I'm going to try renaming the local symbol and see if things still crash.

The trouble is that the box I've been testing on is supposed to be our
backup file server and is currently doing a rsync of 280GB of files
from a 7 year old windows box. At least I'll be able to leave it
running undisturbed over the weekend.

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen dom0 crash in get_phys_to_machine
  2010-10-22 13:33   ` Alan J. Wylie
@ 2010-10-22 13:45     ` Gianni Tedesco
  2010-10-22 14:20       ` Alan J. Wylie
  0 siblings, 1 reply; 10+ messages in thread
From: Gianni Tedesco @ 2010-10-22 13:45 UTC (permalink / raw)
  To: Alan J. Wylie
  Cc: Fitzhardinge, sven, Jeremy, xen-devel@lists.xensource.com,
	Andreas Kinzler

On Fri, 2010-10-22 at 14:33 +0100, Alan J. Wylie wrote:
> at 14:05 on Fri 22-Oct-2010 Gianni Tedesco (gianni.tedesco@citrix.com) wrote:
> 
> > FWIW, when I was checking for any call where pfn > max_pfn - and I got:
> > 
> >   p2m_top[0][10][104] max_pfn=0
> > 
> > The p2m seems to have been correctly initialised:
> > 
> >  xen_build_dynamic_phys_to_machine: topidx=0 mididx=375 max_pfn=192512
> > 
> > But then it looks like something is trampling max_pfn and possibly other
> > important data structures.
> 
> I've just been reading through the Documentation/development-process
> and discovered "sparse".
> 
> Five minutes ago I ran it on mmu.c and got the following interesting
> output:
> 
> /usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:385:23: warning: symbol 'max_pfn' shadows an earlier one
> /usr/src/jeremy-git-xen/arch/x86/include/asm/page_64_types.h:58:22: originally declared here
> /usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:289:47: warning: potentially expensive pointer subtraction
> /usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9: warning: incorrect type in argument 1 (different address spaces)
> /usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9:    expected void const volatile [noderef] <asn:1>*<noident>
> /usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9:    got unsigned long *
> /usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9: warning: cast adds address space to expression (<asn:1>)
> /usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9: warning: cast adds address space to expression (<asn:1>)
> /usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9: warning: cast adds address space to expression (<asn:1>)
> /usr/src/jeremy-git-xen/arch/x86/include/asm/xen/page.h:84:9: warning: cast adds address space to expression (<asn:1>)
> /usr/src/jeremy-git-xen/include/linux/mm.h:603:16: warning: potentially expensive pointer subtraction
> /usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:1269:37: warning: potentially expensive pointer subtraction
> /usr/src/jeremy-git-xen/include/linux/mm.h:603:16: warning: potentially expensive pointer subtraction
> /usr/src/jeremy-git-xen/include/linux/mm.h:603:16: warning: potentially expensive pointer subtraction
> /usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:1410:37: warning: potentially expensive pointer subtraction
> /usr/src/jeremy-git-xen/include/linux/mm.h:603:16: warning: potentially expensive pointer subtraction
> /usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:1684:17: error: bad constant expression
> 
> Is it just a co-incidence that the first two lines refer to the same
> symbol that you have just mentioned?

Hmm, sort of, I assumed I was printing the global max_pfn but it looks
like the shadowing is deliberate (if a little thoughtless in the
naming). It does reverse my finding that 'max_pfn' (the global one) is
getting corrupted.

> I'm going to try renaming the local symbol and see if things still crash.

Sadly, I'm almost certain things will still crash. You may get more play
out of initialising the global max_pfn. But I am not sure how this code
is supposed to work and am busy with other things right now.

> The trouble is that the box I've been testing on is supposed to be our
> backup file server and is currently doing a rsync of 280GB of files
> from a 7 year old windows box. At least I'll be able to leave it
> running undisturbed over the weekend.

This happens for you after a full boot then? Mine gets as far as this:

(XEN) Freed 204kB init memory.
xen_build_dynamic_phys_to_machine: topidx=0 mididx=375 max_pfn=192512
mapping kernel into physical memory
Xen: setup ISA identity maps
xen_build_mfn_list_list: topidx=0 mididx=375
about to get started...
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.32.24-g2472a9c-dirty (scara@dt09) (gcc version 4.5.1 20100907 (Red Hat 4.5.1-3) (GCC) ) #51 SMP Thu Oct 21 15:46:56 BST 2010
[    0.000000] Command line: ro root=/dev/sda2 console=hvc0 initcall_debug max_cstate=1 earlyprintk=xen
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Centaur CentaurHauls
[    0.000000] xen_release_chunk: looking at area pfn 9e-a0: 2 pages freed
[    0.000000] released 2 pages of unused memory
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  Xen: 0000000000000000 - 000000000009e000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 000000002f000000 (usable)
[    0.000000]  Xen: 00000000bf699000 - 00000000bf6af000 (reserved)
[    0.000000]  Xen: 00000000bf6af000 - 00000000bf6ce000 (ACPI data)
[    0.000000]  Xen: 00000000bf6ce000 - 00000000c0000000 (reserved)
[    0.000000]  Xen: 00000000e0000000 - 00000000f0000000 (reserved)
[    0.000000]  Xen: 00000000fe000000 - 0000000100000000 (reserved)
[    0.000000]  Xen: 0000000240000000 - 00000002d069b000 (usable)
[    0.000000] bootconsole [xenboot0] enabled
[    0.000000] DMI 2.6 present.
[    0.000000] last_pfn = 0x2d069b max_arch_pfn = 0x400000000
[    0.000000] x86 PAT enabled: cpu 0, old 0x50100070406, new 0x7010600070106
[    0.000000] last_pfn = 0x2f000 max_arch_pfn = 0x400000000
[    0.000000] init_memory_mapping: 0000000000000000-000000002f000000
[    0.000000] init_memory_mapping: 0000000100000000-00000002d069b000
(XEN) d0:v0: unhandled page fault (ec=0000)
(XEN) Pagetable walk from ffffffff817d2030:
(XEN)  L4[0x1ff] = 0000000239003067 0000000000001003
(XEN)  L3[0x1fe] = 0000000239007067 0000000000001007
(XEN)  L2[0x00b] = 0000000000000000 ffffffffffffffff 
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.1-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e033:[<ffffffff81212bbf>]
(XEN) RFLAGS: 0000000000000246   EM: 1   CONTEXT: pv guest
(XEN) rax: ffffffff817d2000   rbx: 0000000000000046   rcx: 00000000ffffffff
(XEN) rdx: 00000000deadbeef   rsi: 00000000deadbeef   rdi: 00000000deadbeef
(XEN) rbp: ffffffff813c7c58   rsp: ffffffff813c7be0   r8:  0000000000000766
(XEN) r9:  00000000ffffffff   r10: 0000000000000006   r11: ffffffff813c7c88
(XEN) r12: ffffffff8148bb87   r13: 0000000000000767   r14: 0000000000000046
(XEN) r15: 00000000ffffffff   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000239001000   cr2: ffffffff817d2030
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=ffffffff813c7be0:
(XEN)    00000000ffffffff ffffffff813c7c88 0000000000000000 ffffffff81212bbf
(XEN)    000000010000e030 0000000000010046 ffffffff813c7c28 000000000000e02b
(XEN)    ffffffff81212b13 ffffffff813c7c78 ffffffff813e7ee0 ffffffff813ffae0
(XEN)    0000000000000767 0000000000000046 00000000ffffffff ffffffff813c7c88
(XEN)    ffffffff8104c284 00000000000007ad 00000000000007ad 00000000000007ad
(XEN)    0000000000000000 ffffffff813c7ca8 ffffffff8104c2e5 ffffffff813c7ca8
(XEN)    00000000fffff853 ffffffff813c7cd8 ffffffff8104c580 ffffffff8150b55a
(XEN)    000000000000004c ffffffff813c7d08 0000000000000036 ffffffff813c7d78
(XEN)    ffffffff8104cb25 ffffffff813c7d16 0000000000000000 0000000faaaaaaaa
(XEN)    ffffffff813c7d17 302e30202020205b 00205d3030303030 0000000000000000
(XEN)    aaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaa 0000000000007ff0 0000000000000000
(XEN)    00000000deadbeef 0000000000000000 000000000153b3c5 0000000000000000
(XEN)    ffffffff813c7f60 ffffffffffffffff 0000000000000000 ffffffff813c7dd8
(XEN)    ffffffff81322650 0000000000000018 ffffffff813c7de8 ffffffff813c7da8
(XEN)    00003ffffffff000 ffffffff813c7dd8 0000000100000000 00000002d069b000
(XEN)    0000000000100000 0000000000007ff0 aaaaaaaaaaaaaaaa ffffffff813c7eb8
(XEN)    ffffffff81311686 302e30202020205b 0000000000000000 0000000100000000
(XEN)    00000002d069b000 0000000000000000 000000002f000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) Domain 0 crashed: rebooting machine in 5 seconds.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen dom0 crash in get_phys_to_machine
  2010-10-22 13:45     ` Gianni Tedesco
@ 2010-10-22 14:20       ` Alan J. Wylie
  0 siblings, 0 replies; 10+ messages in thread
From: Alan J. Wylie @ 2010-10-22 14:20 UTC (permalink / raw)
  To: Gianni Tedesco, Andreas Kinzler, sven, Jeremy Fitzhardinge

at 14:45 on Fri 22-Oct-2010 Gianni Tedesco (gianni.tedesco@citrix.com) wrote:

> On Fri, 2010-10-22 at 14:33 +0100, Alan J. Wylie wrote:

>>> But then it looks like something is trampling max_pfn and possibly other
>>> important data structures.

>> I've just been reading through the Documentation/development-process
>> and discovered "sparse".

>> Five minutes ago I ran it on mmu.c and got the following interesting
>> output:

>> /usr/src/jeremy-git-xen/arch/x86/xen/mmu.c:385:23: warning: symbol 'max_pfn'
>> shadows an earlier one
>> /usr/src/jeremy-git-xen/arch/x86/include/asm/page_64_types.h:58:22:
>> originally declared here

>> Is it just a co-incidence that the first two lines refer to the
>> same symbol that you have just mentioned?

> Hmm, sort of, I assumed I was printing the global max_pfn but it
> looks like the shadowing is deliberate (if a little thoughtless in
> the naming). It does reverse my finding that 'max_pfn' (the global
> one) is getting corrupted.

>> I'm going to try renaming the local symbol and see if things still crash.

> Sadly, I'm almost certain things will still crash.

You are quite right - it still crashes.

>> At least I'll be able to leave it running undisturbed over the
>> weekend.

> This happens for you after a full boot then?

No - I have an old 2.6.32.18 kernel that boots fine. 

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen dom0 crash in get_phys_to_machine
  2010-10-22 13:05 ` Gianni Tedesco
  2010-10-22 13:33   ` Alan J. Wylie
@ 2010-10-22 22:26   ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 10+ messages in thread
From: Jeremy Fitzhardinge @ 2010-10-22 22:26 UTC (permalink / raw)
  To: Gianni Tedesco
  Cc: Alan J. Wylie, xen-devel@lists.xensource.com, Jeremy,
	Fitzhardinge

 On 10/22/2010 06:05 AM, Gianni Tedesco wrote:
> On Tue, 2010-10-12 at 08:55 +0100, Alan J. Wylie wrote:
>> Further to my previous report:
>>
>> http://lists.xensource.com/archives/html/xen-devel/2010-10/msg00257.html
>> Message-ID: <19629.39326.337589.71778@wylie.me.uk>
>>
>> I've added some debugging and have tracked down the crash to the
>> recently modified code in arch/x86/xen/mmu.c
>>
>> Since the last version of the code that worked for me, mmu.c has been
>> modified with a lot of P2M changes. It now crashes in
>> get_phys_to_machine().
>>
>> Having tracked down the crash and the offending value of pfn, I then
>> further modified the code only to print if ( pfn == 0x18C3 ), and also
>> to print intermediate values.
>>
>> <7>ALANW get_phys_to_machine pfn 000018C3
>> <7> topidx 00000000
>> <7> mididx 0000000C
>> <7> idx 000000C3
>> (XEN) d0:v0: unhandled page fault (ec=0000)
>>
>> If there is any more debugging that I can do, I'll be only too happy to
>> oblige.
> FWIW, when I was checking for any call where pfn > max_pfn - and I got:
>
>   p2m_top[0][10][104] max_pfn=0
>
> The p2m seems to have been correctly initialised:
>
>  xen_build_dynamic_phys_to_machine: topidx=0 mididx=375 max_pfn=192512
>
> But then it looks like something is trampling max_pfn and possibly other
> important data structures.
>
> I can get a working pvops dom0 by reverting to commit
> e6b9b2cbca5093e8e38d3e314e2f6415ad951c60 - with the same config.
>
> git-bisect between that commit and head turned up some nonsense about a
> ata_piix change which just added a spinlock
> 876b3a81850fc237f643a065ea78ce2ad7665767 - so I assume that is a bisect
> problem and that this commit is unrelated...

Yeah.  If the problem appears as a function of kernel size, then
bisection is going to give you more or less random results, unfortunately.

    J

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen dom0 crash in get_phys_to_machine
  2010-10-22 12:47 ` Alan J. Wylie
@ 2010-11-12 22:41   ` sven
  2010-11-13  0:53     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 10+ messages in thread
From: sven @ 2010-11-12 22:41 UTC (permalink / raw)
  To: Alan J. Wylie
  Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com,
	Andreas Kinzler, Gianni Tedesco

Hello,

is there any news regarding this problem?

This evening I tried next-2.6.32 again, but still no luck.

(..had the slight hope the recent changes about lowest-megabyte memory 
area might have fixed this, too .. but ..)

Regards,
  Sven

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen dom0 crash in get_phys_to_machine
  2010-11-12 22:41   ` sven
@ 2010-11-13  0:53     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 10+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-13  0:53 UTC (permalink / raw)
  To: sven
  Cc: Alan J. Wylie, xen-devel@lists.xensource.com, Gianni Tedesco,
	Andreas Kinzler

On 11/12/2010 02:41 PM, sven wrote:
> is there any news regarding this problem?
>
> This evening I tried next-2.6.32 again, but still no luck.
>
> (..had the slight hope the recent changes about lowest-megabyte memory
> area might have fixed this, too .. but ..)

Sorry, I'd set it to one side while dealing with all the 2.6.37
upstreaming work.  Unfortunately I don't have a machine which reproduces
this problem, so I've been relying on other people's reports, and they
haven't shown any smoking guns yet.

But I have been seeing other odd things occasionally which could be the
same problem in different guises, so I'll see if I can track those down.

Thanks,
    J

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-11-13  0:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-12  7:55 Xen dom0 crash in get_phys_to_machine Alan J. Wylie
2010-10-13  0:19 ` Jeremy Fitzhardinge
2010-10-22 12:47 ` Alan J. Wylie
2010-11-12 22:41   ` sven
2010-11-13  0:53     ` Jeremy Fitzhardinge
2010-10-22 13:05 ` Gianni Tedesco
2010-10-22 13:33   ` Alan J. Wylie
2010-10-22 13:45     ` Gianni Tedesco
2010-10-22 14:20       ` Alan J. Wylie
2010-10-22 22:26   ` Jeremy Fitzhardinge

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.