From: Eric Sandeen <sandeen@sandeen.net>
To: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Fabio Coatti <cova@ferrara.linux.it>,
linux-kernel@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: [BUG] spinlock lockup on CPU#0
Date: Thu, 09 Apr 2009 10:27:37 -0500 [thread overview]
Message-ID: <49DE13E9.6040605@sandeen.net> (raw)
In-Reply-To: <19f34abd0904090707v7eb8b677gbda42595aa04a090@mail.gmail.com>
Vegard Nossum wrote:
> 2009/3/30 Fabio Coatti <cova@ferrara.linux.it>:
>> Hi all, I've got the following BUG: report on one of our servers running
>> 2.6.28.8; some background:
>> we are seeing several lockups in db (mysql) servers that shows up as a sudden
>> load increase and then, very quickly, the server freezes. It happens in a
>> random way, sometimes after weeks, sometimes very quickly after a system
>> reboot. Trying to discover the problem we installed latest (at the time of
>> test) 2.6.28.X kernel and loaded it with some high disk I/O operations (find,
>> dd, rsync and so on).
>> We have been able to crash a server with these tests; unfortunately we have
>> been able to capture only a remote screen snapshot so I copied by hand
>> (hopefully without typos) the data and this is the result is the following:
>
> Hi,
>
> Thanks for the report.
>
>> [<ffffffff80213590>] ? default_idle+0x30/0x50
>> [<ffffffff8021358e>] ? default_idle+0x2e/0x50
>> [<ffffffff80213793>] ? c1e_idle+0x73/0x120
>> [<ffffffff80259f11>] ? atomic_notifier_call_chain+0x11/0x20
>> [<ffffffff8020a31f>] ? cpu_idle+0x3f/0x70
>> BUG: spinlock lockup on CPU#0, find/13114, ffff8801363d2c80
>> Pid: 13114, comm: find Tainted: G D W 2.6.28.8 #5
>> Call Trace:
>> [<ffffffff8041a02e>] _raw_spin_lock+0x14e/0x180
>> [<ffffffff8060b691>] _spin_lock+0x51/0x70
>> [<ffffffff80231ca4>] ? task_rq_lock+0x54/0xa0
>> [<ffffffff80231ca4>] task_rq_lock+0x54/0xa0
>> [<ffffffff80234501>] try_to_wake_up+0x91/0x280
>> [<ffffffff80234720>] wake_up_process+0x10/0x20
>> [<ffffffff803bf863>] xfsbufd_wakeup+0x53/0x70
>> [<ffffffff802871e0>] shrink_slab+0x90/0x180
>> [<ffffffff80287526>] try_to_free_pages+0x256/0x3a0
>> [<ffffffff80285280>] ? isolate_pages_global+0x0/0x280
>> [<ffffffff80281166>] __alloc_pages_internal+0x1b6/0x460
>> [<ffffffff802a186d>] alloc_page_vma+0x6d/0x110
>> [<ffffffff8028d3ab>] handle_mm_fault+0x4ab/0x790
>> [<ffffffff80225293>] do_page_fault+0x463/0x870
>> [<ffffffff8060b199>] ? trace_hardirqs_off_thunk+0x3a/0x3c
>> [<ffffffff8060bf52>] error_exit+0x0/0xa9
>
> Seems like you hit this:
In _xfs_buf_lookup_pages? that's not on the stack, and we didn't see
the below printk...
> /*
> * This could deadlock.
> *
> * But until all the XFS lowlevel code is revamped to
> * handle buffer allocation failures we can't do much.
> */
> if (!(++retries % 100))
> printk(KERN_ERR
> "XFS: possible memory allocation "
> "deadlock in %s (mode:0x%x)\n",
> __func__, gfp_mask);
>
...
so I don't think so. From the trace:
>> [<ffffffff803bf863>] xfsbufd_wakeup+0x53/0x70
>> [<ffffffff802871e0>] shrink_slab+0x90/0x180
this is the shrinker kicking off:
static struct shrinker xfs_buf_shake = {
.shrink = xfsbufd_wakeup,
.seeks = DEFAULT_SEEKS,
};
> ...so my guess is that you ran out of memory (and XFS simply can't
> handle it -- an error in the XFS code, of course).
Wrong guess, I think. XFS has been called via the shrinker mechanisms
to *free* memory, and we're not able to get the task rq lock in the
wakeup path, but not sure why...
> My first tip, if you simply want your servers not to crash, is to
> switch to another filesystem. You could at least try it and see if it
> helps your problem -- that's the most straight-forward solution I can
> think of.
>
>> The machine is a dual 2216HE (2 cores) AMD with 4 Gb ram; below you can find
>> the .config file. (from /proc/config.gz)
>>
>> we are seeing similar lockups (at least similar for the results) since several
>> kernel revisions (starting from 2.6.25.X) and on different hardware. Several
>> machines are hit by this, mostly databases (maybe for the specific usage, other
>> machines being apache servers, I don't know).
>>
>> Could someone give us some hints about this issue, or at least some
>> suggestions on how to dig it? Of course we can do any sort of testing and
>> tries.
If sysrq-t (show all running tasks) still works post-oops, capturing
that might help to see where other threads are at. Hook up a serial
console to make capturing the output possible.
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
prev parent reply other threads:[~2009-04-09 15:28 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <200903301936.08477.cova@ferrara.linux.it>
2009-04-09 14:07 ` [BUG] spinlock lockup on CPU#0 Vegard Nossum
2009-04-09 14:21 ` Vegard Nossum
2009-04-09 15:27 ` Eric Sandeen [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49DE13E9.6040605@sandeen.net \
--to=sandeen@sandeen.net \
--cc=cova@ferrara.linux.it \
--cc=linux-kernel@vger.kernel.org \
--cc=vegard.nossum@gmail.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox