From: Dave Chinner <david@fromorbit.com>
To: 韩国中 <vincent.han.megan@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: task xfssyncd blocked while raid5 was in recovery
Date: Thu, 11 Oct 2012 16:09:55 +1100 [thread overview]
Message-ID: <20121011050955.GB2739@dastard> (raw)
In-Reply-To: <CAE-xvygu7j4zTBVRmm93-VbzhwiU2enEqwnFOPy5gbVHBn1CEQ@mail.gmail.com>
On Thu, Oct 11, 2012 at 11:55:01AM +0800, 韩国中 wrote:
> Hello, every one:
>
> Recently, a problem has troubled me for a long time.
>
> I created a 4*2T (sda, sdb, sdc, sdd) raid5 with XFS file system, 128K
> chuck size and 2048 strip_cache_size. The mdadm 3.2.2, kernel 2.6.38
> and mkfs.xfs 3.1.1 were used. When the raid5 was in recovery and the
> schedule reached 47%, I/O errors occurred in sdb. The following was
> the output:
>
> ......
>
> ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>
> ata2: status=0x41 { DriveReady Error }
Looks like you've had a drive fail during rebuild.
> Then, there were lots of error messages about the file system. The
> following was the output:
>
>
>
> ......
>
> INFO: task xfssyncd/md127:1058 blocked for more than 120 seconds.
>
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> xfssyncd/md127 D fffffff7000216d0 0 1058 2 0x00000000
> frame 0: 0xfffffff700020570 __switch_to+0x1b8/0x1c0 (sp 0xfffffe008d7ff900)
> frame 1: 0xfffffff7000216d0 schedule+0x918/0x1538 (sp 0xfffffe008d7ff9d0)
> frame 2: 0xfffffff700022a90 schedule_timeout+0x268/0x5b0 (sp 0xfffffe008d7ffd18)
> frame 3: 0xfffffff700024ee0 __down+0xd8/0x158 (sp 0xfffffe008d7ffda8)
> frame 4: 0xfffffff70085da78 down.cold+0x8/0x28 (sp 0xfffffe008d7ffe18)
> frame 5: 0xfffffff700750788 xfs_buf_lock+0xd0/0x120 (sp 0xfffffe008d7ffe38)
> frame 6: 0xfffffff700821b40 xfs_getsb+0x38/0x78 (sp 0xfffffe008d7ffe50)
> frame 7: 0xfffffff70077e230 xfs_trans_getsb+0xe0/0x100 (sp 0xfffffe008d7ffe68)
> frame 8: 0xfffffff7006babc0 xfs_mod_sb+0x88/0x198 (sp 0xfffffe008d7ffe88)
> frame 9: 0xfffffff7007a6480 xfs_fs_log_dummy+0x68/0xe0 (sp 0xfffffe008d7ffeb8)
> frame 10: 0xfffffff70079c6c0 xfs_sync_worker+0xe0/0xe8 (sp 0xfffffe008d7ffed8)
> frame 11: 0xfffffff700570a00 xfssyncd+0x240/0x328 (sp 0xfffffe008d7ffef0)
> frame 12: 0xfffffff7000f0530 kthread+0xe0/0xe8 (sp 0xfffffe008d7fff80)
> frame 13: 0xfffffff7000bab38 start_kernel_thread+0x18/0x20 (sp 0xfffffe008d7fffe8)
Which is basically saying that the superblock buffer is under IO -
that's the only reason it ever gets locked.
> The output said “INFO: task xfssyncd/md127:1058 blocked for more than
> 120 seconds? What did that mean? I used “cat /proc/mdstat?to see the
> state of the raid5. The output was:
>
> Personalities : [raid0] [raid6] [raid5] [raid4]
>
> md127 : active raid5 sdd[3] sdc[2] sdb[1](F) sda[0]
>
> 5860540032 blocks super 1.2 level 5, 128k chunk, algorithm 2 [4/3] [U_UU]
>
> resync=PENDING
>
> unused devices: <none>
>
>
> The state of the raid5 was “PENDING? I had never seen such a
> state of raid5 when I used ext4. After that, I wrote a program to access the
> raid5, there was no response any more.
Waiting on IO to complete, but with the MD device down, it will
enver complete.
> Then I used “ps aux| task
> xfssyncd?to see the state of “xfssyncd? Unfortunately, there was no
> response yet. Then I tried “ps aux? There were outputs, but the
> program could exit with “Ctrl+d? or “Ctrl+z? And when I tested the
> write performance for raid5, I/O errors often occurred. I did not know
> why this I/O errors occurred so frequently.
>
> What was the problem? Can any one help me?
Broken hardware causing MD to go into a bad state, which causes XFS
to stall because it can't make progress.
Bottom line: replace the broken disk, though given that MD was
already rebuilding a RAID5 when the disk died, you probably have
lost everything on the filesystem....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2012-10-11 5:08 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-11 3:55 task xfssyncd blocked while raid5 was in recovery 韩国中
2012-10-11 5:09 ` Dave Chinner [this message]
[not found] <CACY-59cbWX9Gu_xsfqv_p8=Q7CabWZuj=ZH2K41j4N0-o-8WLw@mail.gmail.com>
2012-10-24 3:17 ` hanguozhong
2012-10-24 5:14 ` NeilBrown
2012-10-30 2:19 ` hanguozhong
2012-10-30 4:49 ` Stan Hoeppner
-- strict thread matches above, loose matches on Subject: below --
2012-10-10 3:14 GuoZhong Han
2012-10-10 11:54 ` Stan Hoeppner
2012-10-11 2:42 ` hanguozhong
2012-10-11 3:47 ` Chris Murphy
2012-10-11 11:20 ` Stan Hoeppner
2012-10-11 6:12 ` Mikael Abrahamsson
2012-10-11 11:01 ` Stan Hoeppner
2012-10-11 11:16 ` Mikael Abrahamsson
[not found] ` <201210112054336567511@meganovo.com>
2012-10-11 14:47 ` Stan Hoeppner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121011050955.GB2739@dastard \
--to=david@fromorbit.com \
--cc=vincent.han.megan@gmail.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.