public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Brian Candler <B.Candler@pobox.com>
Cc: Stan Hoeppner <stan@hardwarefreak.com>, xfs@oss.sgi.com
Subject: Re: Storage server, hung tasks and tracebacks
Date: Mon, 7 May 2012 11:53:22 +1000	[thread overview]
Message-ID: <20120507015322.GY5091@dastard> (raw)
In-Reply-To: <20120504163237.GA6128@nsrc.org>

On Fri, May 04, 2012 at 05:32:37PM +0100, Brian Candler wrote:
> On Thu, May 03, 2012 at 05:19:41PM -0500, Stan Hoeppner wrote:
> > Glad to hear you've got one running somewhat stable.  Could be a driver
> > problem, but it's pretty rare for a SCSI driver to hard lock a box isn't
> > it?

No. The hardware does something bad to the PCI bus, or DMAs
something over kernel memory, or won't de-assert and interrupt line,
or .... and the system will hard hang. Hell, if it just stops and
you run out of memory because IO is needed to clean and free memory,
then system can hang there as well....

> > Keep us posted.
> 
> Last night I fired up two more instances of bonnie++ on that box, so there
> were four at once.  Going back to the box now, I find that they have all
> hung :-(
> 
> They are stuck at:
> 
>     Delete files in random order...
>     Stat files in random order...
>     Stat files in random order...
>     Stat files in sequential order...
> 
> respectively.
> 
> iostat 5 shows no activity. There are 9 hung processes:
> 
> $ uptime
>  17:23:35 up 1 day, 20:39,  1 user,  load average: 9.04, 9.08, 8.91
> $ ps auxwww | grep " D" | grep -v grep
> root        35  1.5  0.0      0     0 ?        D    May02  42:10 [kswapd0]
> root      1179  0.0  0.0      0     0 ?        D    May02   1:50 [xfsaild/md126]
> root      3127  0.0  0.0  25096   312 ?        D    16:55   0:00 /usr/lib/postfix/master
> tomi     29138  1.1  0.0 378860  3708 pts/1    D+   12:43   3:06 bonnie++ -d /disk/scratch/test -s 16384k -n 98:800k:500k:1000
> tomi     29390  1.0  0.0 378860  3560 pts/3    D+   12:52   2:53 bonnie++ -d /disk/scratch/test -s 16384k -n 98:800k:500k:1000
> tomi     30356  1.1  0.0 378860  3512 pts/2    D+   13:32   2:36 bonnie++ -d /disk/scratch/testb -s 16384k -n 98:800k:500k:1000
> root     31075  0.0  0.0      0     0 ?        D    14:00   0:04 [kworker/0:0]
> tomi     31796  0.6  0.0 378860  3864 pts/4    D+   14:30   1:05 bonnie++ -d /disk/scratch/testb -s 16384k -n 98:800k:500k:1000
> root     31922  0.0  0.0      0     0 ?        D    14:35   0:00 [kworker/1:0]
> 
> dmesg shows hung tasks and backtraces, starting with:
> 
> [150927.599920] INFO: task kswapd0:35 blocked for more than 120 seconds.
> [150927.600263] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [150927.600698] kswapd0         D ffffffff81806240     0    35      2 0x00000000
> [150927.600704]  ffff880212389330 0000000000000046 ffff880212389320 ffffffff81082df5
> [150927.600710]  ffff880212389fd8 ffff880212389fd8 ffff880212389fd8 0000000000013780
> [150927.600715]  ffff8802121816f0 ffff88020e538000 ffff880212389320 ffff88020e538000
> [150927.600719] Call Trace:
> [150927.600728]  [<ffffffff81082df5>] ? __queue_work+0xe5/0x320
> [150927.600733]  [<ffffffff8165a55f>] schedule+0x3f/0x60
> [150927.600739]  [<ffffffff814e82c6>] md_flush_request+0x86/0x140
> [150927.600745]  [<ffffffff8105f990>] ? try_to_wake_up+0x200/0x200
> [150927.600756]  [<ffffffffa0010419>] raid0_make_request+0x119/0x1c0 [raid0]

That's most likely a hardware or driver problem - the IO request
queue is full which means that IO completions are not occurring or
being delayed excessively. The problem is below the level of the
filesystem....

> I am completely at a loss with all this... I've never seen a Unix/Linux
> system behave so unreliably.

If you are buying bottom of the barrel hardware, then you get the
reliability that you pay for. Spend a few more dollars and buy
something that is properly engineered - you've wasted more money
trying to diagnose this problem that you would have saved by being
cheap hardware....

> One of the company's directors has reminded me
> that we have a Windows storage server with 48 disks which has been running
> without incident for the last 3 or 4 years, and I don't have a good answer
> for that :-(

If you buy bottom of the barrel hardware for Windows servers, then
you'll get similar results, only they'll be much harder to diagnose.
Software can't fix busted hardware...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2012-05-07  1:53 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-02 18:44 Storage server, hung tasks and tracebacks Brian Candler
2012-05-03 12:50 ` Stan Hoeppner
2012-05-03 20:41   ` Brian Candler
2012-05-03 22:19     ` Stan Hoeppner
2012-05-04 16:32       ` Brian Candler
2012-05-04 16:50         ` Stefan Ring
2012-05-07  1:53         ` Dave Chinner [this message]
     [not found]         ` <4FA4C321.2070105@hardwarefreak.com>
2012-05-06  8:47           ` Brian Candler
2012-05-15 14:02           ` Brian Candler
2012-05-20 16:35             ` Brian Candler
2012-05-22 13:14               ` Brian Candler
2012-05-20 23:59             ` Dave Chinner
2012-05-21  9:58               ` Brian Candler
2012-09-09  9:47                 ` Brian Candler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120507015322.GY5091@dastard \
    --to=david@fromorbit.com \
    --cc=B.Candler@pobox.com \
    --cc=stan@hardwarefreak.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox