From: Dave Chinner <david@fromorbit.com>
To: 韩国中 <vincent.han.megan@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: task xfssyncd blocked while raid5 was in recovery
Date: Thu, 11 Oct 2012 16:09:55 +1100 [thread overview]
Message-ID: <20121011050955.GB2739@dastard> (raw)
In-Reply-To: <CAE-xvygu7j4zTBVRmm93-VbzhwiU2enEqwnFOPy5gbVHBn1CEQ@mail.gmail.com>
On Thu, Oct 11, 2012 at 11:55:01AM +0800, 韩国中 wrote:
> Hello, every one:
>
> Recently, a problem has troubled me for a long time.
>
> I created a 4*2T (sda, sdb, sdc, sdd) raid5 with XFS file system, 128K
> chuck size and 2048 strip_cache_size. The mdadm 3.2.2, kernel 2.6.38
> and mkfs.xfs 3.1.1 were used. When the raid5 was in recovery and the
> schedule reached 47%, I/O errors occurred in sdb. The following was
> the output:
>
> ......
>
> ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>
> ata2: status=0x41 { DriveReady Error }
Looks like you've had a drive fail during rebuild.
> Then, there were lots of error messages about the file system. The
> following was the output:
>
>
>
> ......
>
> INFO: task xfssyncd/md127:1058 blocked for more than 120 seconds.
>
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> xfssyncd/md127 D fffffff7000216d0 0 1058 2 0x00000000
> frame 0: 0xfffffff700020570 __switch_to+0x1b8/0x1c0 (sp 0xfffffe008d7ff900)
> frame 1: 0xfffffff7000216d0 schedule+0x918/0x1538 (sp 0xfffffe008d7ff9d0)
> frame 2: 0xfffffff700022a90 schedule_timeout+0x268/0x5b0 (sp 0xfffffe008d7ffd18)
> frame 3: 0xfffffff700024ee0 __down+0xd8/0x158 (sp 0xfffffe008d7ffda8)
> frame 4: 0xfffffff70085da78 down.cold+0x8/0x28 (sp 0xfffffe008d7ffe18)
> frame 5: 0xfffffff700750788 xfs_buf_lock+0xd0/0x120 (sp 0xfffffe008d7ffe38)
> frame 6: 0xfffffff700821b40 xfs_getsb+0x38/0x78 (sp 0xfffffe008d7ffe50)
> frame 7: 0xfffffff70077e230 xfs_trans_getsb+0xe0/0x100 (sp 0xfffffe008d7ffe68)
> frame 8: 0xfffffff7006babc0 xfs_mod_sb+0x88/0x198 (sp 0xfffffe008d7ffe88)
> frame 9: 0xfffffff7007a6480 xfs_fs_log_dummy+0x68/0xe0 (sp 0xfffffe008d7ffeb8)
> frame 10: 0xfffffff70079c6c0 xfs_sync_worker+0xe0/0xe8 (sp 0xfffffe008d7ffed8)
> frame 11: 0xfffffff700570a00 xfssyncd+0x240/0x328 (sp 0xfffffe008d7ffef0)
> frame 12: 0xfffffff7000f0530 kthread+0xe0/0xe8 (sp 0xfffffe008d7fff80)
> frame 13: 0xfffffff7000bab38 start_kernel_thread+0x18/0x20 (sp 0xfffffe008d7fffe8)
Which is basically saying that the superblock buffer is under IO -
that's the only reason it ever gets locked.
> The output said “INFO: task xfssyncd/md127:1058 blocked for more than
> 120 seconds? What did that mean? I used “cat /proc/mdstat?to see the
> state of the raid5. The output was:
>
> Personalities : [raid0] [raid6] [raid5] [raid4]
>
> md127 : active raid5 sdd[3] sdc[2] sdb[1](F) sda[0]
>
> 5860540032 blocks super 1.2 level 5, 128k chunk, algorithm 2 [4/3] [U_UU]
>
> resync=PENDING
>
> unused devices: <none>
>
>
> The state of the raid5 was “PENDING? I had never seen such a
> state of raid5 when I used ext4. After that, I wrote a program to access the
> raid5, there was no response any more.
Waiting on IO to complete, but with the MD device down, it will
enver complete.
> Then I used “ps aux| task
> xfssyncd?to see the state of “xfssyncd? Unfortunately, there was no
> response yet. Then I tried “ps aux? There were outputs, but the
> program could exit with “Ctrl+d? or “Ctrl+z? And when I tested the
> write performance for raid5, I/O errors often occurred. I did not know
> why this I/O errors occurred so frequently.
>
> What was the problem? Can any one help me?
Broken hardware causing MD to go into a bad state, which causes XFS
to stall because it can't make progress.
Bottom line: replace the broken disk, though given that MD was
already rebuilding a RAID5 when the disk died, you probably have
lost everything on the filesystem....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
prev parent reply other threads:[~2012-10-11 5:08 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-11 3:55 task xfssyncd blocked while raid5 was in recovery 韩国中
2012-10-11 5:09 ` Dave Chinner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121011050955.GB2739@dastard \
--to=david@fromorbit.com \
--cc=vincent.han.megan@gmail.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox