xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Scott Garron <xen-devel@sce.pridelands.org>,
	Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>,
	xen-devel@lists.xensource.com
Subject: BUG: unable to handle kernel NULL pointer dereference at IP: [<ffffffff8105ae4c>] process_one_work+
Date: Tue, 14 Jun 2011 09:55:44 -0400	[thread overview]
Message-ID: <20110614135543.GA27849@dumpdata.com> (raw)
In-Reply-To: <4DF69B42.4080908@sce.pridelands.org>

On Mon, Jun 13, 2011 at 07:20:34PM -0400, Scott Garron wrote:
> On 06/13/2011 06:03 PM, Konrad Rzeszutek Wilk wrote:
> >Can you do one more thing - bootup the same kernel as baremetal?
> >Without any Xen and with the same options .. and also with
> >/proc/interrupts so I can see what native Linux sees?
> 
> The serial console plus cat /proc/interrupts pasted onto the end of it
> is here:

Thank you.
> 
> http://pridelands.org/~simba/xen/hailstorm-fullserial20110613.txt

So IRQ 9 is correct.

Somehow I thought that this:

[    1.646560]  dc 0FF ACPI Warning: Large Reference Count (0x1FEA) in object ffff88001ebb3b98 (20110316/utdelete-448)
[    4.136398] ACPI Warning: Large Reference Count (0x1FE9) in object ffff88001ebb3b98 (20110316/utdelete-448)
[    4.136426] BUG: unable to handle kernel NULL pointer dereference at           (null)
[    4.136436] IP: [<ffffffff8105ae4c>] process_one_work+0x27/0x286
[    4.136459] PGD 0 
[    4.136465] Oops: 0000 [#1] SMP 
[    4.136475] CPU 0 
[    4.136479] Modules linked in:
[    4.136485] 
[    4.136492] Pid: 374, comm: kworker/0:1 Tainted: G        W   2.6.39+ #2 To Be Filled By O.E.M. To Be Filled By O.E.M./TYAN High-End Dual AMD Opteron, S2882
[    4.136505] RIP: e030:[<ffffffff8105ae4c>]  [<ffffffff8105ae4c>] process_one_work+0x27/0x286
[    4.136516] RSP: e02b:ffff88001eb4be40  EFLAGS: 00010046
(from http://pridelands.org/~simba/xen/hailstorm-fullserial20110610.txt)

are related - as in the ACPI IRQ gets triggered, it does something (and it looks
to make the ACPI parser complain about it), then puts some function on the
workqueue which dies trying to access ffff88001ebb3b80. It died and whatever
that function was suppose to do - it never completed. I was thinking that
due to the IRQ 9 having the wrong polarity (which it has not) or trigger (which it has
not) it is causing this mayhem - but that is not the case. Sorry about
wasting your time heading this wrong path.

The boot process continues and the xen clocksource kicks in and it does a hypercall
.. and is probabally looping between the hypercall, the xen upcall handler and back.
The IRQ 9 is pending so it hasn't been acknowledged by the Linux kernel. In fact, there
are couple of events that are stuck and are locally masked. Which means that 'spin_lock_irqsave'
has been called and it masks the vcpu, but spin_unlock_irqrestore has not - which could be
due to process_one_work dying.

But the curious thing is that you have two CPUs assigned to Dom0 and while
CPU0 looks to be bouncing back and forth, CPU1 is doing something. The RIP
is 0xffffffff8108820c. Can you try to run this through System.map?
Or the whole bunch of these:

ffffffff8108820c
ffffffff81088100
ffffffff810881a7
ffffffff8108811a
ffffffff816101a8
ffffffff81006c32
ffffffff816114a4
ffffffff8108803a
ffffffff8105f5bd
ffffffff81618564
ffffffff81617973
ffffffff816117a1
ffffffff81618560

The other idea is to limit Dom0 to only run on one CPU. You can do this
by having 'dom0_max_vcpus=1 dom0_vcpus_pin' and see if it fails somewhere
else? It probably will die in the 0xffffffff810013aa :-(

But irregardless of what I mentioned above we need to find out why
process_one_worker got a toxic parameter. Can you disassemble 0xffffffff8105ae4c
and see what it does and how it corresponds to 'process_one_work' in kernel/workqueue.c?
You can also instrument the code to find out what:

1804         work_func_t f = work->func;

is.

Jeremy, any thoughts on what else might be at foot here?

  reply	other threads:[~2011-06-14 13:55 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-26  0:04 BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39 Scott Garron
2011-04-26  3:15 ` Konrad Rzeszutek Wilk
2011-04-26  5:03   ` Scott Garron
2011-04-27 20:09     ` Konrad Rzeszutek Wilk
2011-04-27 23:45       ` Scott Garron
     [not found]         ` <20110428183019.GA9852@dumpdata.com4DBA1EA4.5010004@sce.pridelands.org>
2011-04-28 18:30         ` Konrad Rzeszutek Wilk
2011-04-29  0:15           ` Scott Garron
2011-04-29  2:12           ` Scott Garron
2011-04-29 14:43             ` Dan Magenheimer
2011-04-29 16:56               ` Scott Garron
2011-04-29 19:38                 ` Dan Magenheimer
2011-04-29 23:08                   ` Scott Garron
2011-05-04 15:58                     ` Konrad Rzeszutek Wilk
2011-05-04 19:19                       ` Scott Garron
2011-05-04 19:35                         ` Konrad Rzeszutek Wilk
2011-05-04 20:17                           ` Scott Garron
2011-05-04 20:23                             ` Konrad Rzeszutek Wilk
2011-05-04 21:55                               ` Scott Garron
2011-05-04 22:16                                 ` Konrad Rzeszutek Wilk
2011-05-04 23:23                                   ` Scott Garron
2011-05-05 18:34                                     ` Konrad Rzeszutek Wilk
2011-05-05 20:48                                       ` Scott Garron
2011-05-05 21:06                                         ` Konrad Rzeszutek Wilk
2011-06-06 18:00                                           ` Scott Garron
2011-06-06 19:17                                             ` Pasi Kärkkäinen
2011-06-06 21:33                                               ` Scott Garron
2011-06-07 19:19                                                 ` Konrad Rzeszutek Wilk
2011-06-08 18:25                                                   ` Scott Garron
2011-06-08 19:29                                                     ` Konrad Rzeszutek Wilk
2011-06-09 20:04                                                       ` Scott Garron
2011-06-10 12:59                                                         ` Konrad Rzeszutek Wilk
2011-06-10 16:51                                                           ` Scott Garron
2011-06-13 22:03                                                             ` Konrad Rzeszutek Wilk
2011-06-13 23:20                                                               ` Scott Garron
2011-06-14 13:55                                                                 ` Konrad Rzeszutek Wilk [this message]
2011-06-14 21:55                                                                   ` BUG: unable to handle kernel NULL pointer dereference at IP: [<ffffffff8105ae4c>] process_one_work+ Scott Garron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110614135543.GA27849@dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=dan.magenheimer@oracle.com \
    --cc=jeremy@goop.org \
    --cc=xen-devel@lists.xensource.com \
    --cc=xen-devel@sce.pridelands.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).