All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Candler <B.Candler@pobox.com>
To: Stan Hoeppner <stan@hardwarefreak.com>
Cc: xfs@oss.sgi.com
Subject: Re: Storage server, hung tasks and tracebacks
Date: Fri, 4 May 2012 17:32:37 +0100	[thread overview]
Message-ID: <20120504163237.GA6128@nsrc.org> (raw)
In-Reply-To: <4FA3047D.8060908@hardwarefreak.com>

On Thu, May 03, 2012 at 05:19:41PM -0500, Stan Hoeppner wrote:
> Glad to hear you've got one running somewhat stable.  Could be a driver
> problem, but it's pretty rare for a SCSI driver to hard lock a box isn't
> it?

Yes, that bothers me too.

> Keep us posted.

Last night I fired up two more instances of bonnie++ on that box, so there
were four at once.  Going back to the box now, I find that they have all
hung :-(

They are stuck at:

    Delete files in random order...
    Stat files in random order...
    Stat files in random order...
    Stat files in sequential order...

respectively.

iostat 5 shows no activity. There are 9 hung processes:

$ uptime
 17:23:35 up 1 day, 20:39,  1 user,  load average: 9.04, 9.08, 8.91
$ ps auxwww | grep " D" | grep -v grep
root        35  1.5  0.0      0     0 ?        D    May02  42:10 [kswapd0]
root      1179  0.0  0.0      0     0 ?        D    May02   1:50 [xfsaild/md126]
root      3127  0.0  0.0  25096   312 ?        D    16:55   0:00 /usr/lib/postfix/master
tomi     29138  1.1  0.0 378860  3708 pts/1    D+   12:43   3:06 bonnie++ -d /disk/scratch/test -s 16384k -n 98:800k:500k:1000
tomi     29390  1.0  0.0 378860  3560 pts/3    D+   12:52   2:53 bonnie++ -d /disk/scratch/test -s 16384k -n 98:800k:500k:1000
tomi     30356  1.1  0.0 378860  3512 pts/2    D+   13:32   2:36 bonnie++ -d /disk/scratch/testb -s 16384k -n 98:800k:500k:1000
root     31075  0.0  0.0      0     0 ?        D    14:00   0:04 [kworker/0:0]
tomi     31796  0.6  0.0 378860  3864 pts/4    D+   14:30   1:05 bonnie++ -d /disk/scratch/testb -s 16384k -n 98:800k:500k:1000
root     31922  0.0  0.0      0     0 ?        D    14:35   0:00 [kworker/1:0]

dmesg shows hung tasks and backtraces, starting with:

[150927.599920] INFO: task kswapd0:35 blocked for more than 120 seconds.
[150927.600263] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[150927.600698] kswapd0         D ffffffff81806240     0    35      2 0x00000000
[150927.600704]  ffff880212389330 0000000000000046 ffff880212389320 ffffffff81082df5
[150927.600710]  ffff880212389fd8 ffff880212389fd8 ffff880212389fd8 0000000000013780
[150927.600715]  ffff8802121816f0 ffff88020e538000 ffff880212389320 ffff88020e538000
[150927.600719] Call Trace:
[150927.600728]  [<ffffffff81082df5>] ? __queue_work+0xe5/0x320
[150927.600733]  [<ffffffff8165a55f>] schedule+0x3f/0x60
[150927.600739]  [<ffffffff814e82c6>] md_flush_request+0x86/0x140
[150927.600745]  [<ffffffff8105f990>] ? try_to_wake_up+0x200/0x200
[150927.600756]  [<ffffffffa0010419>] raid0_make_request+0x119/0x1c0 [raid0]
...

Now, the only other thing I have found by googling is a suggestion that LSI
drivers lock up when there is any smart or hddtemp activity: see end of
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/906873

On this system the smartmontools package is installed, but I have not
configured it, and smartd is not running.  I don't have hddtemp installed
either.

I am completely at a loss with all this... I've never seen a Unix/Linux
system behave so unreliably.  One of the company's directors has reminded me
that we have a Windows storage server with 48 disks which has been running
without incident for the last 3 or 4 years, and I don't have a good answer
for that :-(

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2012-05-04 16:32 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-02 18:44 Storage server, hung tasks and tracebacks Brian Candler
2012-05-03 12:50 ` Stan Hoeppner
2012-05-03 20:41   ` Brian Candler
2012-05-03 22:19     ` Stan Hoeppner
2012-05-04 16:32       ` Brian Candler [this message]
2012-05-04 16:50         ` Stefan Ring
2012-05-07  1:53         ` Dave Chinner
     [not found]         ` <4FA4C321.2070105@hardwarefreak.com>
2012-05-06  8:47           ` Brian Candler
2012-05-15 14:02           ` Brian Candler
2012-05-20 16:35             ` Brian Candler
2012-05-22 13:14               ` Brian Candler
2012-05-20 23:59             ` Dave Chinner
2012-05-21  9:58               ` Brian Candler
2012-09-09  9:47                 ` Brian Candler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120504163237.GA6128@nsrc.org \
    --to=b.candler@pobox.com \
    --cc=stan@hardwarefreak.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.