2.6.32.27 dom0 - BUG: unable to handle kernel paging request

All of lore.kernel.org
 help / color / mirror / Atom feed

* 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
@ 2010-12-30 22:57 Christopher S. Aker
  2010-12-31  1:29 ` Jeremy Fitzhardinge
  2011-01-04  9:16 ` Ian Campbell
  0 siblings, 2 replies; 21+ messages in thread
From: Christopher S. Aker @ 2010-12-30 22:57 UTC (permalink / raw)
  To: xen devel; +Cc: Jeremy Fitzhardinge

Xen: 3.4.4-rc1-pre 64bit (xenbits @ 19986)
Dom0: 2.6.32.27-1 PAE (xen/stable-2.6.32.x @ 
75cc13f5aa29b4f3227d269ca165dfa8937c94fe)

We've been running our xen-thrash testsuite on a bunch of hosts against 
a very recent build, and we've just hit this on one box:

BUG: unable to handle kernel paging request at 15555d60
IP: [<c1022781>] vmalloc_sync_all+0xd1/0x1f0
*pdpt = 000000001d8ee027 *pde = 0000000000000000
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
Modules linked in: dm_snapshot iTCO_wdt usbhid
Pid: 44, comm: xenwatch Not tainted (2.6.32.27-1 #1) X8DTU
EIP: 0061:[<c1022781>] EFLAGS: 00010007 CPU: 0
EIP is at vmalloc_sync_all+0xd1/0x1f0
EAX: 15555d60 EBX: c1a50c00 ECX: 55555001 EDX: 06855067
ESI: c173ad60 EDI: dd8f85c4 EBP: 00000009 ESP: dfd7fe64
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
Process xenwatch (pid: 44, ti=dfd7e000 task=dfce90f0 task.ti=dfd7e000)
Stack:
  00000018 0001fc55 00000000 c0000d60 f5800000 1fc55067 00000000 1fc55067
<0> 00000000 c170025c dd2abe00 c898f180 dfd7ff20 c8a1171c c10829d0 c1081370
<0> 00000000 00000000 c12bbbd6 c1006bf4 0000000d dfd7fee4 dd2a0200 00000008
Call Trace:
  [<c10829d0>] ? alloc_vm_area+0x40/0x60
  [<c1081370>] ? f+0x0/0x10
  [<c12bbbd6>] ? blkif_map+0x36/0x1c0
  [<c1006bf4>] ? check_events+0x8/0xc
  [<c12b2b0f>] ? xenbus_gather+0x5f/0x90
  [<c12bb36c>] ? frontend_changed+0x25c/0x2d0
  [<c12b36c5>] ? xenbus_otherend_changed+0x95/0xa0
  [<c12b38bf>] ? frontend_changed+0xf/0x20
  [<c12b1f57>] ? xenwatch_thread+0x87/0x130
  [<c1048700>] ? autoremove_wake_function+0x0/0x40
  [<c12b1ed0>] ? xenwatch_thread+0x0/0x130
  [<c10484e4>] ? kthread+0x74/0x80
  [<c1048470>] ? kthread+0x0/0x80
  [<c1009e67>] ? kernel_thread_helper+0x7/0x10
Code: 04 8b 45 00 ff 15 14 2e 65 c1 8b 54 24 0c 25 00 f0 ff ff 8d 34 10 
8b 16 8b 6e 04 f6 c2 01 74 7d 89 c8 25 00 f0 ff ff 03 44 24 0c <8b> 08 
89 4c 24 04 8b 48 04 f6 44 24 04 01 75 67 89 e9 e8 18 32
EIP: [<c1022781>] vmalloc_sync_all+0xd1/0x1f0 SS:ESP 0069:dfd7fe64
CR2: 0000000015555d60
---[ end trace 7a29128cd8a0e564 ]---

And then a whole load of soft lockup traces.  Full output, hypervisor, 
and dom0 kernel are in here:

http://theshore.net/~caker/xen/BUGS/2.6.32.27/

-Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2010-12-30 22:57 2.6.32.27 dom0 - BUG: unable to handle kernel paging request Christopher S. Aker
@ 2010-12-31  1:29 ` Jeremy Fitzhardinge
  2010-12-31 17:19   ` Christopher S. Aker
  2011-01-04  9:16 ` Ian Campbell
  1 sibling, 1 reply; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2010-12-31  1:29 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: xen devel, Ian Jackson, Ian Campbell

On 12/31/2010 09:57 AM, Christopher S. Aker wrote:
> Xen: 3.4.4-rc1-pre 64bit (xenbits @ 19986)
> Dom0: 2.6.32.27-1 PAE (xen/stable-2.6.32.x @
> 75cc13f5aa29b4f3227d269ca165dfa8937c94fe)
>
> We've been running our xen-thrash testsuite on a bunch of hosts
> against a very recent build, and we've just hit this on one box:

Ah, interesting.  This looks like something that Ian Jackson found on
one of his test machines.

What was going on at the time?

    J

>
> BUG: unable to handle kernel paging request at 15555d60
> IP: [<c1022781>] vmalloc_sync_all+0xd1/0x1f0
> *pdpt = 000000001d8ee027 *pde = 0000000000000000
> Oops: 0000 [#1] SMP
> last sysfs file:
> /sys/devices/system/xen_memory/xen_memory0/info/current_kb
> Modules linked in: dm_snapshot iTCO_wdt usbhid
> Pid: 44, comm: xenwatch Not tainted (2.6.32.27-1 #1) X8DTU
> EIP: 0061:[<c1022781>] EFLAGS: 00010007 CPU: 0
> EIP is at vmalloc_sync_all+0xd1/0x1f0
> EAX: 15555d60 EBX: c1a50c00 ECX: 55555001 EDX: 06855067
> ESI: c173ad60 EDI: dd8f85c4 EBP: 00000009 ESP: dfd7fe64
>  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
> Process xenwatch (pid: 44, ti=dfd7e000 task=dfce90f0 task.ti=dfd7e000)
> Stack:
>  00000018 0001fc55 00000000 c0000d60 f5800000 1fc55067 00000000 1fc55067
> <0> 00000000 c170025c dd2abe00 c898f180 dfd7ff20 c8a1171c c10829d0
> c1081370
> <0> 00000000 00000000 c12bbbd6 c1006bf4 0000000d dfd7fee4 dd2a0200
> 00000008
> Call Trace:
>  [<c10829d0>] ? alloc_vm_area+0x40/0x60
>  [<c1081370>] ? f+0x0/0x10
>  [<c12bbbd6>] ? blkif_map+0x36/0x1c0
>  [<c1006bf4>] ? check_events+0x8/0xc
>  [<c12b2b0f>] ? xenbus_gather+0x5f/0x90
>  [<c12bb36c>] ? frontend_changed+0x25c/0x2d0
>  [<c12b36c5>] ? xenbus_otherend_changed+0x95/0xa0
>  [<c12b38bf>] ? frontend_changed+0xf/0x20
>  [<c12b1f57>] ? xenwatch_thread+0x87/0x130
>  [<c1048700>] ? autoremove_wake_function+0x0/0x40
>  [<c12b1ed0>] ? xenwatch_thread+0x0/0x130
>  [<c10484e4>] ? kthread+0x74/0x80
>  [<c1048470>] ? kthread+0x0/0x80
>  [<c1009e67>] ? kernel_thread_helper+0x7/0x10
> Code: 04 8b 45 00 ff 15 14 2e 65 c1 8b 54 24 0c 25 00 f0 ff ff 8d 34
> 10 8b 16 8b 6e 04 f6 c2 01 74 7d 89 c8 25 00 f0 ff ff 03 44 24 0c <8b>
> 08 89 4c 24 04 8b 48 04 f6 44 24 04 01 75 67 89 e9 e8 18 32
> EIP: [<c1022781>] vmalloc_sync_all+0xd1/0x1f0 SS:ESP 0069:dfd7fe64
> CR2: 0000000015555d60
> ---[ end trace 7a29128cd8a0e564 ]---
>
> And then a whole load of soft lockup traces.  Full output, hypervisor,
> and dom0 kernel are in here:
>
> http://theshore.net/~caker/xen/BUGS/2.6.32.27/
>
> -Chris
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2010-12-31  1:29 ` Jeremy Fitzhardinge
@ 2010-12-31 17:19   ` Christopher S. Aker
  2011-01-02 20:08     ` Christopher S. Aker
  0 siblings, 1 reply; 21+ messages in thread
From: Christopher S. Aker @ 2010-12-31 17:19 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen devel, Ian Jackson, Ian Campbell

On Dec 30, 2010, at 8:29 PM, Jeremy Fitzhardinge wrote:

> On 12/31/2010 09:57 AM, Christopher S. Aker wrote:
>> Xen: 3.4.4-rc1-pre 64bit (xenbits @ 19986)
>> Dom0: 2.6.32.27-1 PAE (xen/stable-2.6.32.x @
>> 75cc13f5aa29b4f3227d269ca165dfa8937c94fe)
>> 
>> We've been running our xen-thrash testsuite on a bunch of hosts
>> against a very recent build, and we've just hit this on one box:
> 
> Ah, interesting.  This looks like something that Ian Jackson found on
> one of his test machines.
> 
> What was going on at the time?

Our xen-thrash testsuite was running.  It was configured to boot:

* 5 domUs swap thrashing (eatmem.c)
* 5 domUs that busy-loop CPU
* 5 domUs running crashme w/ 2.6.18 kernel
* 5 domUs running crashme w/ pv_ops kernel
* 5 domUs in a boot up -> sleep 60 -> shut down loop
* 5 domUs in a boot up -> sleep 60 -> xm destroy loop

We kicked off this identical run on 10 hosts.  On another host I cranked these numbers up to total about 80 domUs.  They're all still running fine.

The one that hit this BUG is still up, if any Hypervisor output would be helpful.

-Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2010-12-31 17:19   ` Christopher S. Aker
@ 2011-01-02 20:08     ` Christopher S. Aker
  0 siblings, 0 replies; 21+ messages in thread
From: Christopher S. Aker @ 2011-01-02 20:08 UTC (permalink / raw)
  To: xen devel; +Cc: Jeremy Fitzhardinge, Ian Jackson, Ian Campbell

On Dec 31, 2010, at 12:19 PM, Christopher S. Aker wrote:
> We kicked off this identical run on 10 hosts.  On another host I cranked these numbers up to total about 80 domUs.  They're all still running fine.

Scratch that - another box just hit the identical trace:

BUG: unable to handle kernel paging request at 15555d60
IP: [<c1022781>] vmalloc_sync_all+0xd1/0x1f0
*pdpt = 000000000c0ea027 *pde = 0000000000000000
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
Modules linked in: dm_snapshot iTCO_wdt usbhid
Pid: 44, comm: xenwatch Not tainted (2.6.32.27-1 #1) X8DTU
EIP: 0061:[<c1022781>] EFLAGS: 00010007 CPU: 2
EIP is at vmalloc_sync_all+0xd1/0x1f0
EAX: 15555d60 EBX: c1aa8f00 ECX: 55555001 EDX: 06855067
ESI: c173ad60 EDI: ddbd7944 EBP: 00000009 ESP: dfd7fe64
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
Process xenwatch (pid: 44, ti=dfd7e000 task=dfce90f0 task.ti=dfd7e000)
Stack:
00000018 0001fc55 00000000 c0000d60 f5800000 1fc55067 00000000 1fc55067
<0> 00000000 c170025c cc974c20 d422a180 dfd7ff20 c8d44118 c10829d0 c1081370
<0> 00000000 00000000 c12bbbd6 c1006bf4 00000025 dfd7fee4 cc970220 00000009
Call Trace:
[<c10829d0>] ? alloc_vm_area+0x40/0x60
[<c1081370>] ? f+0x0/0x10
[<c12bbbd6>] ? blkif_map+0x36/0x1c0
[<c1006bf4>] ? check_events+0x8/0xc
[<c12b2b0f>] ? xenbus_gather+0x5f/0x90
[<c12bb36c>] ? frontend_changed+0x25c/0x2d0
[<c12b36c5>] ? xenbus_otherend_changed+0x95/0xa0
[<c12b38bf>] ? frontend_changed+0xf/0x20
[<c12b1f57>] ? xenwatch_thread+0x87/0x130
[<c1048700>] ? autoremove_wake_function+0x0/0x40
[<c12b1ed0>] ? xenwatch_thread+0x0/0x130
[<c10484e4>] ? kthread+0x74/0x80
[<c1048470>] ? kthread+0x0/0x80
[<c1009e67>] ? kernel_thread_helper+0x7/0x10
Code: 04 8b 45 00 ff 15 14 2e 65 c1 8b 54 24 0c 25 00 f0 ff ff 8d 34 10 8b 16 8b 6e 04 f6 c2 01 74 7d 89 c8 25 00 f0 ff ff 03 44 24 0c <8b> 08 89 4c 24 04 8b 48 04 f6 44 24 04 01 75 67 89 e9 e8 18 32
EIP: [<c1022781>] vmalloc_sync_all+0xd1/0x1f0 SS:ESP 0069:dfd7fe64
CR2: 0000000015555d60
---[ end trace 48bbd5284e47e665 ]---

Thoughts or suggestions?  I'd be happy to provide any additional information or perform more tests.

Thanks!
-Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2010-12-30 22:57 2.6.32.27 dom0 - BUG: unable to handle kernel paging request Christopher S. Aker
  2010-12-31  1:29 ` Jeremy Fitzhardinge
@ 2011-01-04  9:16 ` Ian Campbell
  2011-01-04 20:30   ` Christopher S. Aker
  1 sibling, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2011-01-04  9:16 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: Jeremy Fitzhardinge, xen devel

On Thu, 2010-12-30 at 22:57 +0000, Christopher S. Aker wrote:
> Xen: 3.4.4-rc1-pre 64bit (xenbits @ 19986)
> Dom0: 2.6.32.27-1 PAE (xen/stable-2.6.32.x @ 
> 75cc13f5aa29b4f3227d269ca165dfa8937c94fe)

> 
> We've been running our xen-thrash testsuite on a bunch of hosts against 
> a very recent build, and we've just hit this on one box:
> 
> BUG: unable to handle kernel paging request at 15555d60
> IP: [<c1022781>] vmalloc_sync_all+0xd1/0x1f0

This looks similar to the issue which we thought was resolved via
b2464c422fb44275deeb5770b668351860f68e0e.

Can you convert 0xc10022781 to an exact line number? If you have a
vmlinux with symbols then:
	$ gdb vmlinux
	(gdb) list *0xc10022781
should tell you the file and line.

	(gdb) disas 0xc10022781
might tell us something too.

Ian.

> *pdpt = 000000001d8ee027 *pde = 0000000000000000
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
> Modules linked in: dm_snapshot iTCO_wdt usbhid
> Pid: 44, comm: xenwatch Not tainted (2.6.32.27-1 #1) X8DTU
> EIP: 0061:[<c1022781>] EFLAGS: 00010007 CPU: 0
> EIP is at vmalloc_sync_all+0xd1/0x1f0
> EAX: 15555d60 EBX: c1a50c00 ECX: 55555001 EDX: 06855067
> ESI: c173ad60 EDI: dd8f85c4 EBP: 00000009 ESP: dfd7fe64
>   DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
> Process xenwatch (pid: 44, ti=dfd7e000 task=dfce90f0 task.ti=dfd7e000)
> Stack:
>   00000018 0001fc55 00000000 c0000d60 f5800000 1fc55067 00000000 1fc55067
> <0> 00000000 c170025c dd2abe00 c898f180 dfd7ff20 c8a1171c c10829d0 c1081370
> <0> 00000000 00000000 c12bbbd6 c1006bf4 0000000d dfd7fee4 dd2a0200 00000008
> Call Trace:
>   [<c10829d0>] ? alloc_vm_area+0x40/0x60
>   [<c1081370>] ? f+0x0/0x10
>   [<c12bbbd6>] ? blkif_map+0x36/0x1c0
>   [<c1006bf4>] ? check_events+0x8/0xc
>   [<c12b2b0f>] ? xenbus_gather+0x5f/0x90
>   [<c12bb36c>] ? frontend_changed+0x25c/0x2d0
>   [<c12b36c5>] ? xenbus_otherend_changed+0x95/0xa0
>   [<c12b38bf>] ? frontend_changed+0xf/0x20
>   [<c12b1f57>] ? xenwatch_thread+0x87/0x130
>   [<c1048700>] ? autoremove_wake_function+0x0/0x40
>   [<c12b1ed0>] ? xenwatch_thread+0x0/0x130
>   [<c10484e4>] ? kthread+0x74/0x80
>   [<c1048470>] ? kthread+0x0/0x80
>   [<c1009e67>] ? kernel_thread_helper+0x7/0x10
> Code: 04 8b 45 00 ff 15 14 2e 65 c1 8b 54 24 0c 25 00 f0 ff ff 8d 34 10 
> 8b 16 8b 6e 04 f6 c2 01 74 7d 89 c8 25 00 f0 ff ff 03 44 24 0c <8b> 08 
> 89 4c 24 04 8b 48 04 f6 44 24 04 01 75 67 89 e9 e8 18 32
> EIP: [<c1022781>] vmalloc_sync_all+0xd1/0x1f0 SS:ESP 0069:dfd7fe64
> CR2: 0000000015555d60
> ---[ end trace 7a29128cd8a0e564 ]---
> 
> And then a whole load of soft lockup traces.  Full output, hypervisor, 
> and dom0 kernel are in here:
> 
> http://theshore.net/~caker/xen/BUGS/2.6.32.27/
> 
> -Chris
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-04  9:16 ` Ian Campbell
@ 2011-01-04 20:30   ` Christopher S. Aker
  2011-01-04 20:34     ` Ian Campbell
  0 siblings, 1 reply; 21+ messages in thread
From: Christopher S. Aker @ 2011-01-04 20:30 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Jeremy Fitzhardinge, xen devel

On Jan 4, 2011, at 4:16 AM, Ian Campbell wrote:
> This looks similar to the issue which we thought was resolved via
> b2464c422fb44275deeb5770b668351860f68e0e.

Verified my tree has this changeset...

> Can you convert 0xc10022781 to an exact line number? If you have a
> vmlinux with symbols then:
> 	$ gdb vmlinux
> 	(gdb) list *0xc10022781
> should tell you the file and line.
> 
> 	(gdb) disas 0xc10022781
> might tell us something too.
> 
> Ian.

~# uname -rv
2.6.32.27-1 #1 SMP Wed Dec 29 17:47:30 UTC 2010
~# gdb vmlinux /proc/kcore -s /boot/System.map-2.6.32.27-1 
GNU gdb 6.4-debian
Copyright 2005 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".

#0  0x00000000 in ?? ()
(gdb) list *0xc1022781 
No symbol table is loaded.  Use the "file" command.
(gdb) disas 0xc1022781
Dump of assembler code for function vmalloc_sync_all:
0xc10226b0 <vmalloc_sync_all+0>:        push   %ebp
0xc10226b1 <vmalloc_sync_all+1>:        push   %edi
0xc10226b2 <vmalloc_sync_all+2>:        push   %esi
0xc10226b3 <vmalloc_sync_all+3>:        push   %ebx
0xc10226b4 <vmalloc_sync_all+4>:        sub    $0x28,%esp
0xc10226b7 <vmalloc_sync_all+7>:        mov    0xc1652ca4,%eax
0xc10226bc <vmalloc_sync_all+12>:       test   %eax,%eax
0xc10226be <vmalloc_sync_all+14>:       jne    0xc1022898 <vmalloc_sync_all+488>
0xc10226c4 <vmalloc_sync_all+20>:       mov    0xc171e18c,%eax
0xc10226c9 <vmalloc_sync_all+25>:       add    $0x800000,%eax
0xc10226ce <vmalloc_sync_all+30>:       and    $0xffe00000,%eax
0xc10226d3 <vmalloc_sync_all+35>:       cmp    $0xbfffffff,%eax
0xc10226d8 <vmalloc_sync_all+40>:       mov    %eax,0x10(%esp)
0xc10226dc <vmalloc_sync_all+44>:       jbe    0xc1022898 <vmalloc_sync_all+488>
0xc10226e2 <vmalloc_sync_all+50>:       cmp    0xc1652ed8,%eax
0xc10226e8 <vmalloc_sync_all+56>:       jae    0xc1022898 <vmalloc_sync_all+488>
0xc10226ee <vmalloc_sync_all+62>:       data16
0xc10226ef <vmalloc_sync_all+63>:       nop    
0xc10226f0 <vmalloc_sync_all+64>:       mov    $0xc170785c,%eax
0xc10226f5 <vmalloc_sync_all+69>:       call   0xc14ed140 <_spin_lock_irqsave>
0xc10226fa <vmalloc_sync_all+74>:       mov    %eax,0x24(%esp)
0xc10226fe <vmalloc_sync_all+78>:       mov    0xc1652ec4,%eax
0xc1022703 <vmalloc_sync_all+83>:       lea    0xffffffe8(%eax),%ebx
0xc1022706 <vmalloc_sync_all+86>:       mov    0x18(%ebx),%edx
0xc1022709 <vmalloc_sync_all+89>:       prefetchnta (%edx)
0xc102270c <vmalloc_sync_all+92>:       nop    
0xc102270d <vmalloc_sync_all+93>:       cmp    $0xc1652ec4,%eax
0xc1022712 <vmalloc_sync_all+98>:       je     0xc1022868 <vmalloc_sync_all+440>
0xc1022718 <vmalloc_sync_all+104>:      mov    0x10(%esp),%eax
0xc102271c <vmalloc_sync_all+108>:      shr    $0x1e,%eax
0xc102271f <vmalloc_sync_all+111>:      shl    $0x3,%eax
0xc1022722 <vmalloc_sync_all+114>:      mov    %eax,(%esp)
0xc1022725 <vmalloc_sync_all+117>:      mov    0x10(%esp),%eax
0xc1022729 <vmalloc_sync_all+121>:      shr    $0x12,%eax
0xc102272c <vmalloc_sync_all+124>:      and    $0xff8,%eax
0xc1022731 <vmalloc_sync_all+129>:      sub    $0x40000000,%eax
0xc1022736 <vmalloc_sync_all+134>:      mov    %eax,0xc(%esp)
0xc102273a <vmalloc_sync_all+138>:      jmp    0xc10227c8 <vmalloc_sync_all+280>
0xc102273f <vmalloc_sync_all+143>:      nop    
0xc1022740 <vmalloc_sync_all+144>:      mov    (%esp),%edx
0xc1022743 <vmalloc_sync_all+147>:      mov    (%eax,%edx,1),%ecx
0xc1022746 <vmalloc_sync_all+150>:      mov    0x4(%eax,%edx,1),%edx
0xc102274a <vmalloc_sync_all+154>:      mov    %ecx,%eax
0xc102274c <vmalloc_sync_all+156>:      call   *0xc1652e14
0xc1022752 <vmalloc_sync_all+162>:      mov    %eax,%ecx
0xc1022754 <vmalloc_sync_all+164>:      mov    0x4(%ebp),%edx
0xc1022757 <vmalloc_sync_all+167>:      mov    0x0(%ebp),%eax
0xc102275a <vmalloc_sync_all+170>:      call   *0xc1652e14
0xc1022760 <vmalloc_sync_all+176>:      mov    0xc(%esp),%edx
0xc1022764 <vmalloc_sync_all+180>:      and    $0xfffff000,%eax
0xc1022769 <vmalloc_sync_all+185>:      lea    (%eax,%edx,1),%esi
0xc102276c <vmalloc_sync_all+188>:      mov    (%esi),%edx
0xc102276e <vmalloc_sync_all+190>:      mov    0x4(%esi),%ebp
0xc1022771 <vmalloc_sync_all+193>:      test   $0x1,%dl
0xc1022774 <vmalloc_sync_all+196>:      je     0xc10227f3 <vmalloc_sync_all+323>
0xc1022776 <vmalloc_sync_all+198>:      mov    %ecx,%eax
0xc1022778 <vmalloc_sync_all+200>:      and    $0xfffff000,%eax
0xc102277d <vmalloc_sync_all+205>:      add    0xc(%esp),%eax
0xc1022781 <vmalloc_sync_all+209>:      mov    (%eax),%ecx
0xc1022783 <vmalloc_sync_all+211>:      mov    %ecx,0x4(%esp)
0xc1022787 <vmalloc_sync_all+215>:      mov    0x4(%eax),%ecx
0xc102278a <vmalloc_sync_all+218>:      testb  $0x1,0x4(%esp)
0xc102278f <vmalloc_sync_all+223>:      jne    0xc10227f8 <vmalloc_sync_all+328>
0xc1022791 <vmalloc_sync_all+225>:      mov    %ebp,%ecx
0xc1022793 <vmalloc_sync_all+227>:      call   0xc10059b0 <xen_set_pmd>
0xc1022798 <vmalloc_sync_all+232>:      nop    
0xc1022799 <vmalloc_sync_all+233>:      lea    0x0(%esi),%esi
0xc10227a0 <vmalloc_sync_all+240>:      mov    %edi,%eax
0xc10227a2 <vmalloc_sync_all+242>:      call   0xc10074b0 <xen_spin_unlock>
0xc10227a7 <vmalloc_sync_all+247>:      nop    
0xc10227a8 <vmalloc_sync_all+248>:      test   %esi,%esi
0xc10227aa <vmalloc_sync_all+250>:      je     0xc1022868 <vmalloc_sync_all+440>
0xc10227b0 <vmalloc_sync_all+256>:      mov    0x18(%ebx),%eax
0xc10227b3 <vmalloc_sync_all+259>:      lea    0xffffffe8(%eax),%ebx
0xc10227b6 <vmalloc_sync_all+262>:      mov    0x18(%ebx),%edx
0xc10227b9 <vmalloc_sync_all+265>:      prefetchnta (%edx)
0xc10227bc <vmalloc_sync_all+268>:      nop    
0xc10227bd <vmalloc_sync_all+269>:      cmp    $0xc1652ec4,%eax
0xc10227c2 <vmalloc_sync_all+274>:      je     0xc1022868 <vmalloc_sync_all+440>
0xc10227c8 <vmalloc_sync_all+280>:      mov    %ebx,%eax
0xc10227ca <vmalloc_sync_all+282>:      call   0xc1025370 <pgd_page_get_mm>
0xc10227cf <vmalloc_sync_all+287>:      lea    0x44(%eax),%edi
0xc10227d2 <vmalloc_sync_all+290>:      mov    %edi,%eax
0xc10227d4 <vmalloc_sync_all+292>:      call   0xc14ed240 <_spin_lock>
0xc10227d9 <vmalloc_sync_all+297>:      mov    %ebx,%eax
0xc10227db <vmalloc_sync_all+299>:      call   0xc10757a0 <page_address>
0xc10227e0 <vmalloc_sync_all+304>:      mov    (%esp),%ebp
0xc10227e3 <vmalloc_sync_all+307>:      add    0xc1655f64,%ebp
0xc10227e9 <vmalloc_sync_all+313>:      testb  $0x1,0x0(%ebp)
0xc10227ed <vmalloc_sync_all+317>:      jne    0xc1022740 <vmalloc_sync_all+144>
0xc10227f3 <vmalloc_sync_all+323>:      xor    %esi,%esi
0xc10227f5 <vmalloc_sync_all+325>:      jmp    0xc10227a0 <vmalloc_sync_all+240>
0xc10227f7 <vmalloc_sync_all+327>:      nop    
0xc10227f8 <vmalloc_sync_all+328>:      mov    0x4(%esp),%eax
0xc10227fc <vmalloc_sync_all+332>:      mov    %ecx,%edx
0xc10227fe <vmalloc_sync_all+334>:      call   *0xc1652e2c
0xc1022804 <vmalloc_sync_all+340>:      mov    %edx,%ebp
0xc1022806 <vmalloc_sync_all+342>:      mov    %eax,%ecx
0xc1022808 <vmalloc_sync_all+344>:      mov    0x4(%esi),%edx
0xc102280b <vmalloc_sync_all+347>:      mov    (%esi),%eax
0xc102280d <vmalloc_sync_all+349>:      call   *0xc1652e2c
0xc1022813 <vmalloc_sync_all+355>:      mov    %edx,0x4(%esp)
0xc1022817 <vmalloc_sync_all+359>:      mov    %ecx,0x14(%esp)
0xc102281b <vmalloc_sync_all+363>:      mov    0x14(%esp),%edx
0xc102281f <vmalloc_sync_all+367>:      mov    %ebp,0x18(%esp)
0xc1022823 <vmalloc_sync_all+371>:      mov    0x18(%esp),%ecx
0xc1022827 <vmalloc_sync_all+375>:      mov    %eax,0x1c(%esp)
0xc102282b <vmalloc_sync_all+379>:      mov    0x4(%esp),%eax
0xc102282f <vmalloc_sync_all+383>:      shrd   $0xc,%ecx,%edx
0xc1022833 <vmalloc_sync_all+387>:      mov    %edx,%ecx
0xc1022835 <vmalloc_sync_all+389>:      mov    %eax,0x20(%esp)
0xc1022839 <vmalloc_sync_all+393>:      mov    0x1c(%esp),%eax
0xc102283d <vmalloc_sync_all+397>:      shl    $0x5,%ecx
0xc1022840 <vmalloc_sync_all+400>:      mov    0x20(%esp),%edx
0xc1022844 <vmalloc_sync_all+404>:      shrd   $0xc,%edx,%eax
0xc1022848 <vmalloc_sync_all+408>:      mov    %eax,0x4(%esp)
0xc102284c <vmalloc_sync_all+412>:      mov    0x4(%esp),%eax
0xc1022850 <vmalloc_sync_all+416>:      shr    $0xc,%edx
0xc1022853 <vmalloc_sync_all+419>:      mov    %edx,0x8(%esp)
0xc1022857 <vmalloc_sync_all+423>:      shl    $0x5,%eax
0xc102285a <vmalloc_sync_all+426>:      cmp    %eax,%ecx
0xc102285c <vmalloc_sync_all+428>:      je     0xc10227a0 <vmalloc_sync_all+240>
0xc1022862 <vmalloc_sync_all+434>:      ud2a   
0xc1022864 <vmalloc_sync_all+436>:      jmp    0xc1022864 <vmalloc_sync_all+436>
0xc1022866 <vmalloc_sync_all+438>:      data16
0xc1022867 <vmalloc_sync_all+439>:      nop    
0xc1022868 <vmalloc_sync_all+440>:      mov    0x24(%esp),%edx
0xc102286c <vmalloc_sync_all+444>:      mov    $0xc170785c,%eax
0xc1022871 <vmalloc_sync_all+449>:      call   0xc14ed260 <_spin_unlock_irqrestore>
0xc1022876 <vmalloc_sync_all+454>:      addl   $0x200000,0x10(%esp)
0xc102287e <vmalloc_sync_all+462>:      cmpl   $0xbfffffff,0x10(%esp)
0xc1022886 <vmalloc_sync_all+470>:      jbe    0xc1022898 <vmalloc_sync_all+488>
0xc1022888 <vmalloc_sync_all+472>:      mov    0x10(%esp),%edx
0xc102288c <vmalloc_sync_all+476>:      cmp    %edx,0xc1652ed8
0xc1022892 <vmalloc_sync_all+482>:      ja     0xc10226f0 <vmalloc_sync_all+64>
0xc1022898 <vmalloc_sync_all+488>:      add    $0x28,%esp
0xc102289b <vmalloc_sync_all+491>:      pop    %ebx
0xc102289c <vmalloc_sync_all+492>:      pop    %esi
0xc102289d <vmalloc_sync_all+493>:      pop    %edi
0xc102289e <vmalloc_sync_all+494>:      pop    %ebp
0xc102289f <vmalloc_sync_all+495>:      ret    
End of assembler dump.
(gdb) quit

-Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-04 20:30   ` Christopher S. Aker
@ 2011-01-04 20:34     ` Ian Campbell
  2011-01-04 21:59       ` Christopher S. Aker
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2011-01-04 20:34 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: Jeremy Fitzhardinge, xen devel

On Tue, 2011-01-04 at 20:30 +0000, Christopher S. Aker wrote:
> 
> #0  0x00000000 in ?? ()
> (gdb) list *0xc1022781 
> No symbol table is loaded.  Use the "file" command. 

I think you need to enable CONFIG_DEBUG_INFO for this to work.

I'll see what I can figure out from the rest tomorrow.

Ian.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-04 20:34     ` Ian Campbell
@ 2011-01-04 21:59       ` Christopher S. Aker
  2011-01-09 18:07         ` Christopher S. Aker
  0 siblings, 1 reply; 21+ messages in thread
From: Christopher S. Aker @ 2011-01-04 21:59 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Jeremy Fitzhardinge, xen devel

On Jan 4, 2011, at 3:34 PM, Ian Campbell wrote:
> On Tue, 2011-01-04 at 20:30 +0000, Christopher S. Aker wrote:
>> 
>> No symbol table is loaded.  Use the "file" command. 
> 
> I think you need to enable CONFIG_DEBUG_INFO for this to work.

I rebuilt with CONFIG_DEBUG_INFO, and surprisingly it appears valid at the same address:

# gdb vmlinux
GNU gdb (GDB) 7.0-ubuntu
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /build/xen/dom0/pv_ops/2.6.32.27-1-debug/vmlinux...done.
(gdb) list *0xc1022781
0xc1022781 is in vmalloc_sync_all (/build/xen/dom0/pv_ops/2.6.32.27-1-debug/arch/x86/include/asm/pgtable.h:434).
429     #define pud_page(pud)           pfn_to_page(pud_val(pud) >> PAGE_SHIFT)
430
431     /* Find an entry in the second-level page table.. */
432     static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
433     {
434             return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
435     }
436
437     static inline int pud_large(pud_t pud)
438     {
(gdb) quit

-Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-04 21:59       ` Christopher S. Aker
@ 2011-01-09 18:07         ` Christopher S. Aker
  2011-01-10 18:56           ` Konrad Rzeszutek Wilk
  2011-01-13 14:43           ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 21+ messages in thread
From: Christopher S. Aker @ 2011-01-09 18:07 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: Ian Campbell, Jeremy Fitzhardinge, xen devel

On Jan 4, 2011, at 4:59 PM, Christopher S. Aker wrote:
> 
> I rebuilt with CONFIG_DEBUG_INFO, and surprisingly it appears valid at the same address:
> 
> # gdb vmlinux
> (gdb) list *0xc1022781
> 0xc1022781 is in vmalloc_sync_all (/build/xen/dom0/pv_ops/2.6.32.27-1-debug/arch/x86/include/asm/pgtable.h:434).
> 429     #define pud_page(pud)           pfn_to_page(pud_val(pud) >> PAGE_SHIFT)
> 430
> 431     /* Find an entry in the second-level page table.. */
> 432     static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> 433     {
> 434             return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 435     }
> 436
> 437     static inline int pud_large(pud_t pud)
> 438     {

We hit the BUG again on a third test box -- at least it's fairly easy to reproduce.  Has anyone had a chance to poke at this, or have a suggestion for something for me to try/test?

Thanks,
-Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-09 18:07         ` Christopher S. Aker
@ 2011-01-10 18:56           ` Konrad Rzeszutek Wilk
  2011-01-10 21:49             ` John Weekes
  2011-01-13 14:43           ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-10 18:56 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: Ian Campbell, Jeremy Fitzhardinge, xen devel

On Sun, Jan 09, 2011 at 01:07:26PM -0500, Christopher S. Aker wrote:
> On Jan 4, 2011, at 4:59 PM, Christopher S. Aker wrote:
> > 
> > I rebuilt with CONFIG_DEBUG_INFO, and surprisingly it appears valid at the same address:
> > 
> > # gdb vmlinux
> > (gdb) list *0xc1022781
> > 0xc1022781 is in vmalloc_sync_all (/build/xen/dom0/pv_ops/2.6.32.27-1-debug/arch/x86/include/asm/pgtable.h:434).
> > 429     #define pud_page(pud)           pfn_to_page(pud_val(pud) >> PAGE_SHIFT)
> > 430
> > 431     /* Find an entry in the second-level page table.. */
> > 432     static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> > 433     {
> > 434             return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> > 435     }
> > 436
> > 437     static inline int pud_large(pud_t pud)
> > 438     {
> 
> We hit the BUG again on a third test box -- at least it's fairly easy to reproduce.  Has anyone had a chance to poke at this, or have a suggestion for something for me to try/test?

Which test makes it easy to reproduce? Oh wait, you have a whole bunch of guests
pounding. Is it possible to narrow down which type of test is causing this? Or can
you put up the domU guests along with the xm config files to try to reproduce this?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-10 18:56           ` Konrad Rzeszutek Wilk
@ 2011-01-10 21:49             ` John Weekes
  2011-01-15 15:57               ` Christopher S. Aker
  0 siblings, 1 reply; 21+ messages in thread
From: John Weekes @ 2011-01-10 21:49 UTC (permalink / raw)
  To: xen-devel, Christopher S. Aker

> We hit the BUG again on a third test box -- at least it's fairly easy to reproduce.  Has anyone had a chance to poke at this, or have a suggestion for something for me to try/test?

Have you tried raising /proc/sys/vm/min_free_kbytes ? When I was seeing 
a similar "unable to handle kernel paging request" error on some 
machines, I bumped mine up to 32768, and it seemed to eliminate the problem.

-John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-09 18:07         ` Christopher S. Aker
  2011-01-10 18:56           ` Konrad Rzeszutek Wilk
@ 2011-01-13 14:43           ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-13 14:43 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: Ian Campbell, Jeremy Fitzhardinge, xen devel

On Sun, Jan 09, 2011 at 01:07:26PM -0500, Christopher S. Aker wrote:
> On Jan 4, 2011, at 4:59 PM, Christopher S. Aker wrote:
> > 
> > I rebuilt with CONFIG_DEBUG_INFO, and surprisingly it appears valid at the same address:
> > 
> > # gdb vmlinux
> > (gdb) list *0xc1022781
> > 0xc1022781 is in vmalloc_sync_all (/build/xen/dom0/pv_ops/2.6.32.27-1-debug/arch/x86/include/asm/pgtable.h:434).
> > 429     #define pud_page(pud)           pfn_to_page(pud_val(pud) >> PAGE_SHIFT)
> > 430
> > 431     /* Find an entry in the second-level page table.. */
> > 432     static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> > 433     {
> > 434             return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> > 435     }
> > 436
> > 437     static inline int pud_large(pud_t pud)
> > 438     {
> 
> We hit the BUG again on a third test box -- at least it's fairly easy to reproduce.  Has anyone had a chance to poke at this, or have a suggestion for something for me to try/test?

I don't have that much memory as you, so I've only been running a smaller subset of those guests.
So far, nothing yet. How long did it take you to hit it? 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-10 21:49             ` John Weekes
@ 2011-01-15 15:57               ` Christopher S. Aker
  2011-01-31 21:07                 ` Christopher S. Aker
  0 siblings, 1 reply; 21+ messages in thread
From: Christopher S. Aker @ 2011-01-15 15:57 UTC (permalink / raw)
  To: John Weekes; +Cc: Ian Campbell, Jeremy Fitzhardinge, xen devel

On Jan 10, 2011, at 4:49 PM, John Weekes wrote:
> Have you tried raising /proc/sys/vm/min_free_kbytes ? When I was seeing a similar "unable to handle kernel paging request" error on some machines, I bumped mine up to 32768, and it seemed to eliminate the problem.

Thanks for your suggestion, John.  I reset all 16 test machines and included a min_free_kbyte of 32M.  Unfortunately one machine hit the identical BUG within 48 hours.  So, this hasn't fixed it.

I'm willing to help in any way I can.

-Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-15 15:57               ` Christopher S. Aker
@ 2011-01-31 21:07                 ` Christopher S. Aker
  2011-01-31 21:17                   ` Konrad Rzeszutek Wilk
                                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Christopher S. Aker @ 2011-01-31 21:07 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Jeremy Fitzhardinge, xen devel

> Xen: 3.4.4-rc1-pre 64bit (xenbits @ 19986)
 > Dom0: 2.6.32.27-1 PAE (xen/stable-2.6.32.x)
>
> We've been running our xen-thrash testsuite on a bunch of hosts
> against a very recent build, and we've just hit this on one box:
>
> BUG: unable to handle kernel paging request at 15555d60

Two additional boxes out of my last test round have also hit this. 
About one a week.

Ian / Jeremy:  Where do I go from here?

-Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-31 21:07                 ` Christopher S. Aker
@ 2011-01-31 21:17                   ` Konrad Rzeszutek Wilk
  2011-01-31 22:19                     ` Jeremy Fitzhardinge
  2011-01-31 22:22                   ` Jeremy Fitzhardinge
  2011-01-31 22:25                   ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-31 21:17 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: Ian Campbell, Jeremy Fitzhardinge, xen devel

On Mon, Jan 31, 2011 at 04:07:18PM -0500, Christopher S. Aker wrote:
> >Xen: 3.4.4-rc1-pre 64bit (xenbits @ 19986)
> > Dom0: 2.6.32.27-1 PAE (xen/stable-2.6.32.x)
> >
> >We've been running our xen-thrash testsuite on a bunch of hosts
> >against a very recent build, and we've just hit this on one box:
> >
> >BUG: unable to handle kernel paging request at 15555d60

Oh, I hit that if I do cat /sys/kernel/debug/kernel_page_tables.

On 64-bit:
sh-4.1# cd /sys/kernel/debug
sh-4.1# ls
acpi  boot_params  kernel_page_tables  mce      usb             x86
bdi   dri          kprobes             tracing  wakeup_sources  xen
sh-4.1# cat kernel_page_tables
[  108.263615] BUG: unable to handle kernel paging request at ffff9d5555555000
[  108.270480] IP: [<ffffffff81036bf0>] ptdump_show+0xc6/0x2f6
[  108.276122] PGD 0 
[  108.278205] Oops: 0000 [#1] SMP 
[  108.281504] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:06:03.0/class
[  108.289316] CPU 3 
[  108.291137] Modules linked in: xen_evtchn video sg sd_mod radeon ahci libahci libata fbcon scsi_mod tileblit e1000e font bitblit ttm softcursor drm_kms_helper xen_blkfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea xenfs [last unloaded: dump_dma]
[  108.314658] 
[  108.316221] Pid: 3025, comm: cat Not tainted 2.6.38-rc2-00038-g7c92066 #1 DX58SO/        
[  108.324466] RIP: e030:[<ffffffff81036bf0>]  [<ffffffff81036bf0>] ptdump_show+0xc6/0x2f6
[  108.332538] RSP: e02b:ffff8800868f5dd8  EFLAGS: 00010286
[  108.337919] RAX: ffff800000000000 RBX: ffff880085117700 RCX: 0000000000000000
[  108.345126] RDX: ffff9d555555Killed5ff8 RSI: 000000
0000000000 RDI: sh-4.1# 0000000152460067
[  108.345128] RBP: ffff8800868f5e78 R08: 0000000000000006 R09: 0000000000000000
[  108.345130] R10: 00007fffec97cc30 R11: 0000000000000246 R12: ffff9d5555555000
[  108.345132] R13: ffff880085117700 R14: ffffffff81803800 R15: ffff880000000000
[  108.345137] FS:  00007f6c5a3d7700(0000) GS:ffff88009ce83000(0000) knlGS:0000000000000000
[  108.345139] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[  108.345140] CR2: ffff9d5555555000 CR3: 000000008b2c6000 CR4: 0000000000002660
[  108.345142] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  108.345144] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  108.345146] Process cat (pid: 3025, threadinfo ffff8800868f4000, task ffff88009dbe8000)
[  108.345148] Stack:
[  108.345149]  ffffffff81006ca2 0000000000000246 00007fffec97cc30 ffffffff8110efe7
[  108.345152]  ffff9d5555555ff8 0000800000000000 ffff88009f0029c0 0000800000000000
[  108.345155]  0000000000000001 0000000000000000 0000000000000000 ffff800000000000
[  108.345158] Call Trace:
[  108.345163]  [<ffffffff81006ca2>] ? check_events+0x12/0x20
[  108.345168]  [<ffffffff8110efe7>] ? seq_read+0xbf/0x34a
[  108.345170]  [<ffffffff8110efe7>] ? seq_read+0xbf/0x34a
[  108.345173]  [<ffffffff8110f0a1>] seq_read+0x179/0x34a
[  108.345176]  [<ffffffff810f5c32>] vfs_read+0xa6/0x102
[  108.345178]  [<ffffffff810f5d47>] sys_read+0x45/0x6c
[  108.345181]  [<ffffffff8100a992>] system_call_fastpath+0x16/0x1b
[  108.345182] Code: 0f 00 00 00 88 ff ff 48 8d 14 10 4e 8d 24 38 48 8b 45 98 48 89 55 80 48 89 45 88 48 8b 45 88 48 c1 e0 10 48 c1 f8 10 48 89 45 b8 <49> 8b 3c 24 48 85 ff 0f 84 96 01 00 00 ff 14 25 b0 18 81 81 48 
[  108.345208] RIP  [<ffffffff81036bf0>] ptdump_show+0xc6/0x2f6
[  108.345211]  RSP <ffff8800868f5dd8>
[  108.345212] CR2: ffff9d5555555000
[  108.345214] ---[ end trace 9134d308b82bc832 ]---


The 32-bit hits this too, with the 15555d60 address.

Does the xenthrash hit this file too? Do you know what file it touches when
this happens?
> 
> Two additional boxes out of my last test round have also hit this.
> About one a week.
> 
> Ian / Jeremy:  Where do I go from here?
> 
> -Chris
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-31 21:17                   ` Konrad Rzeszutek Wilk
@ 2011-01-31 22:19                     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2011-01-31 22:19 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Ian Campbell, xen devel

On 01/31/2011 01:17 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Jan 31, 2011 at 04:07:18PM -0500, Christopher S. Aker wrote:
>>> Xen: 3.4.4-rc1-pre 64bit (xenbits @ 19986)
>>> Dom0: 2.6.32.27-1 PAE (xen/stable-2.6.32.x)
>>>
>>> We've been running our xen-thrash testsuite on a bunch of hosts
>>> against a very recent build, and we've just hit this on one box:
>>>
>>> BUG: unable to handle kernel paging request at 15555d60
> Oh, I hit that if I do cat /sys/kernel/debug/kernel_page_tables.
>
> On 64-bit:
> sh-4.1# cd /sys/kernel/debug
> sh-4.1# ls
> acpi  boot_params  kernel_page_tables  mce      usb             x86
> bdi   dri          kprobes             tracing  wakeup_sources  xen
> sh-4.1# cat kernel_page_tables
> [  108.263615] BUG: unable to handle kernel paging request at ffff9d5555555000
> [  108.270480] IP: [<ffffffff81036bf0>] ptdump_show+0xc6/0x2f6
> [  108.276122] PGD 0 
> [  108.278205] Oops: 0000 [#1] SMP 
> [  108.281504] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:06:03.0/class
> [  108.289316] CPU 3 
> [  108.291137] Modules linked in: xen_evtchn video sg sd_mod radeon ahci libahci libata fbcon scsi_mod tileblit e1000e font bitblit ttm softcursor drm_kms_helper xen_blkfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea xenfs [last unloaded: dump_dma]
> [  108.314658] 
> [  108.316221] Pid: 3025, comm: cat Not tainted 2.6.38-rc2-00038-g7c92066 #1 DX58SO/        
> [  108.324466] RIP: e030:[<ffffffff81036bf0>]  [<ffffffff81036bf0>] ptdump_show+0xc6/0x2f6
> [  108.332538] RSP: e02b:ffff8800868f5dd8  EFLAGS: 00010286
> [  108.337919] RAX: ffff800000000000 RBX: ffff880085117700 RCX: 0000000000000000
> [  108.345126] RDX: ffff9d555555Killed5ff8 RSI: 000000
> 0000000000 RDI: sh-4.1# 0000000152460067
> [  108.345128] RBP: ffff8800868f5e78 R08: 0000000000000006 R09: 0000000000000000
> [  108.345130] R10: 00007fffec97cc30 R11: 0000000000000246 R12: ffff9d5555555000
> [  108.345132] R13: ffff880085117700 R14: ffffffff81803800 R15: ffff880000000000
> [  108.345137] FS:  00007f6c5a3d7700(0000) GS:ffff88009ce83000(0000) knlGS:0000000000000000
> [  108.345139] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  108.345140] CR2: ffff9d5555555000 CR3: 000000008b2c6000 CR4: 0000000000002660
> [  108.345142] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  108.345144] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  108.345146] Process cat (pid: 3025, threadinfo ffff8800868f4000, task ffff88009dbe8000)
> [  108.345148] Stack:
> [  108.345149]  ffffffff81006ca2 0000000000000246 00007fffec97cc30 ffffffff8110efe7
> [  108.345152]  ffff9d5555555ff8 0000800000000000 ffff88009f0029c0 0000800000000000
> [  108.345155]  0000000000000001 0000000000000000 0000000000000000 ffff800000000000
> [  108.345158] Call Trace:
> [  108.345163]  [<ffffffff81006ca2>] ? check_events+0x12/0x20
> [  108.345168]  [<ffffffff8110efe7>] ? seq_read+0xbf/0x34a
> [  108.345170]  [<ffffffff8110efe7>] ? seq_read+0xbf/0x34a
> [  108.345173]  [<ffffffff8110f0a1>] seq_read+0x179/0x34a
> [  108.345176]  [<ffffffff810f5c32>] vfs_read+0xa6/0x102
> [  108.345178]  [<ffffffff810f5d47>] sys_read+0x45/0x6c
> [  108.345181]  [<ffffffff8100a992>] system_call_fastpath+0x16/0x1b
> [  108.345182] Code: 0f 00 00 00 88 ff ff 48 8d 14 10 4e 8d 24 38 48 8b 45 98 48 89 55 80 48 89 45 88 48 8b 45 88 48 c1 e0 10 48 c1 f8 10 48 89 45 b8 <49> 8b 3c 24 48 85 ff 0f 84 96 01 00 00 ff 14 25 b0 18 81 81 48 
> [  108.345208] RIP  [<ffffffff81036bf0>] ptdump_show+0xc6/0x2f6
> [  108.345211]  RSP <ffff8800868f5dd8>
> [  108.345212] CR2: ffff9d5555555000
> [  108.345214] ---[ end trace 9134d308b82bc832 ]---
>
>
> The 32-bit hits this too, with the 15555d60 address.
>
> Does the xenthrash hit this file too? Do you know what file it touches when
> this happens?

I think you're seeing the same address because many bogus m2p lookups
return 0x55555555.  I don't think there's any more similarity between
what you're seeing and Christopher's report than that.

    J

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-31 21:07                 ` Christopher S. Aker
  2011-01-31 21:17                   ` Konrad Rzeszutek Wilk
@ 2011-01-31 22:22                   ` Jeremy Fitzhardinge
  2011-01-31 22:25                   ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2011-01-31 22:22 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: Ian Campbell, xen devel

On 01/31/2011 01:07 PM, Christopher S. Aker wrote:
>> Xen: 3.4.4-rc1-pre 64bit (xenbits @ 19986)
> > Dom0: 2.6.32.27-1 PAE (xen/stable-2.6.32.x)
>>
>> We've been running our xen-thrash testsuite on a bunch of hosts
>> against a very recent build, and we've just hit this on one box:
>>
>> BUG: unable to handle kernel paging request at 15555d60
>
> Two additional boxes out of my last test round have also hit this.
> About one a week.
>
> Ian / Jeremy:  Where do I go from here?

There seems to be a moderately difficult-to-hit (but still pretty large)
race in pagetable teardown.  It *should* be protected by the pgd lock,
so we need to work out where a teardown (or access) is happening without
that lock.  I think that's going to be a matter of close code-review
rather than any more testing.

The interesting thing is that this problem seems to have come to the
fore since the the patch that was explicitly intended to avoid it was
put in :/...  Before that, the race was theoretical, but AFAIK had never
been observed in a pvops kernel (though it was seen in the Citrix
product in non-pvops kernels, which is why we fixed it).

I'll try to stare at it in the next couple of days.

    J

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-31 21:07                 ` Christopher S. Aker
  2011-01-31 21:17                   ` Konrad Rzeszutek Wilk
  2011-01-31 22:22                   ` Jeremy Fitzhardinge
@ 2011-01-31 22:25                   ` Jeremy Fitzhardinge
  2011-02-14 23:52                     ` Christopher S. Aker
  2 siblings, 1 reply; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2011-01-31 22:25 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: Ian Campbell, xen devel

On 01/31/2011 01:07 PM, Christopher S. Aker wrote:
>> Xen: 3.4.4-rc1-pre 64bit (xenbits @ 19986)
> > Dom0: 2.6.32.27-1 PAE (xen/stable-2.6.32.x)
>>
>> We've been running our xen-thrash testsuite on a bunch of hosts
>> against a very recent build, and we've just hit this on one box:
>>
>> BUG: unable to handle kernel paging request at 15555d60
>
> Two additional boxes out of my last test round have also hit this.
> About one a week.
>
> Ian / Jeremy:  Where do I go from here?

It's also not impossible this bug is related to the "get_user_pages" bug
that has been discussed over the last few days.  I need to think about
that too.

    J

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-01-31 22:25                   ` Jeremy Fitzhardinge
@ 2011-02-14 23:52                     ` Christopher S. Aker
  2011-02-15  0:19                       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 21+ messages in thread
From: Christopher S. Aker @ 2011-02-14 23:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Ian Campbell, xen devel

On 1/31/11 5:25 PM, Jeremy Fitzhardinge wrote:
> On 01/31/2011 01:07 PM, Christopher S. Aker wrote:
>> Ian / Jeremy:  Where do I go from here?
>
> It's also not impossible this bug is related to the "get_user_pages" bug
> that has been discussed over the last few days.  I need to think about
> that too.

How's that going?  Any epiphanies?  I've been trying to follow the list 
(and changesets) but wasn't sure if a potential fix snuck in or if 
there's something else I should be pressure cooking (an unrelated patch 
that may have the side effect of fixing my issue, for example).

Thanks,
-Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-02-14 23:52                     ` Christopher S. Aker
@ 2011-02-15  0:19                       ` Jeremy Fitzhardinge
  2011-02-15  1:15                         ` Christopher S. Aker
  0 siblings, 1 reply; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2011-02-15  0:19 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: Ian Campbell, xen devel

On 02/14/2011 03:52 PM, Christopher S. Aker wrote:
> On 1/31/11 5:25 PM, Jeremy Fitzhardinge wrote:
>> On 01/31/2011 01:07 PM, Christopher S. Aker wrote:
>>> Ian / Jeremy:  Where do I go from here?
>>
>> It's also not impossible this bug is related to the "get_user_pages" bug
>> that has been discussed over the last few days.  I need to think about
>> that too.
>
> How's that going?  Any epiphanies?  I've been trying to follow the
> list (and changesets) but wasn't sure if a potential fix snuck in or
> if there's something else I should be pressure cooking (an unrelated
> patch that may have the side effect of fixing my issue, for example).

No, I had a close look at it the other day and remained stumped.  It
looks like the pgd is being freed while still in use, but I couldn't see
where that could happen unprotected from the lock.  This is really
bugging me - there's something strange going on, which worries me.

    J

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.32.27 dom0 - BUG: unable to handle kernel paging request
  2011-02-15  0:19                       ` Jeremy Fitzhardinge
@ 2011-02-15  1:15                         ` Christopher S. Aker
  0 siblings, 0 replies; 21+ messages in thread
From: Christopher S. Aker @ 2011-02-15  1:15 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Ian Campbell, xen devel

On Feb 14, 2011, at 7:19 PM, Jeremy Fitzhardinge wrote:
> No, I had a close look at it the other day and remained stumped.  It
> looks like the pgd is being freed while still in use, but I couldn't see
> where that could happen unprotected from the lock.  This is really
> bugging me - there's something strange going on, which worries me.

Hmmm.  Well I doubt I can make any useful suggestions towards a solution at my current kernel hacking skill level, but perhaps some verbose debugging output sprinkled throughout may help?  I'd be happy to reset my test suite for another round.

-Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-02-15  1:15 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-30 22:57 2.6.32.27 dom0 - BUG: unable to handle kernel paging request Christopher S. Aker
2010-12-31  1:29 ` Jeremy Fitzhardinge
2010-12-31 17:19   ` Christopher S. Aker
2011-01-02 20:08     ` Christopher S. Aker
2011-01-04  9:16 ` Ian Campbell
2011-01-04 20:30   ` Christopher S. Aker
2011-01-04 20:34     ` Ian Campbell
2011-01-04 21:59       ` Christopher S. Aker
2011-01-09 18:07         ` Christopher S. Aker
2011-01-10 18:56           ` Konrad Rzeszutek Wilk
2011-01-10 21:49             ` John Weekes
2011-01-15 15:57               ` Christopher S. Aker
2011-01-31 21:07                 ` Christopher S. Aker
2011-01-31 21:17                   ` Konrad Rzeszutek Wilk
2011-01-31 22:19                     ` Jeremy Fitzhardinge
2011-01-31 22:22                   ` Jeremy Fitzhardinge
2011-01-31 22:25                   ` Jeremy Fitzhardinge
2011-02-14 23:52                     ` Christopher S. Aker
2011-02-15  0:19                       ` Jeremy Fitzhardinge
2011-02-15  1:15                         ` Christopher S. Aker
2011-01-13 14:43           ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.