From: Shivani Bhope <sbhope@redhat.com>
To: linux-raid@vger.kernel.org
Subject: md0_raid5 process consuming 100% CPU on disk failure
Date: Thu, 16 Jun 2011 15:01:17 -0400 (EDT) [thread overview]
Message-ID: <1349223387.724279.1308250877780.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com> (raw)
In-Reply-To: <542975669.724147.1308250559342.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>
Hello all,
I have been trying to test md raid 5 array faults. I set up a three-disk
RAID 5, and started some I/O. Then I entered the command:
mdadm /dev/md0 --fail /dev/sdc
As soon as the disk is failed, the md0_raid5 process is seen to be taking
100% CPU. This happens about 2 out of 3 times the test was run.
The details:
1. Output for mdadm -D /dev/md0 before the test:
/dev/md0:
Version : 1.2
Creation Time : Tue Jun 14 20:05:07 2011
Raid Level : raid5
Array Size : 143129600 (136.50 GiB 146.56 GB)
Used Dev Size : 71564800 (68.25 GiB 73.28 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Jun 15 13:42:59 2011
State : active
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : 0
UUID : 5a1f5b73:7ce46e00:2b16a389:cadd7ae8
Events : 4716
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
3 8 48 2 active sync /dev/sdd
2. Output for mdadm -D after the test:
/dev/md0:
Version : 1.2
Creation Time : Wed Jun 15 17:44:11 2011
Raid Level : raid5
Array Size : 143129600 (136.50 GiB 146.56 GB)
Used Dev Size : 71564800 (68.25 GiB 73.28 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Jun 15 18:18:41 2011
State : active, degraded
Active Devices : 2
Working Devices : 2
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : 0
UUID : 8663119c:892e5b7d:a4f2be22:5bb2fdd4
Events : 1049
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 faulty spare
rebuilding /dev/sdc
3 8 48 2 active sync /dev/sdd
Note the device is not marked as removed and spare but as
spare rebuilding.
The system has to be manually power-cycled to recover.
This system is running RHEL 6.1, kernel version
2.6.32-131.0.15.el6.x86_64 & mdadm-3.2.1-1.el6.x86_64
The same test was also run on Fedora-15, kernel version
2.6.38.8-32.fc15.x86_64 & mdadm-3.1.5-2.fc15.x86_64
The output for sysrq-t showing md0_raid5 after the hang is:
Jun 15 18:20:39 kernel: md0_raid5 R running task 0 1896 2 0x00000080
Jun 15 18:20:39 kernel: ffff880175e3fbd0 ffffffff814db337 ffff880175e3fb70 ffffffff8104af29
Jun 15 18:20:39 kernel: 000000000000100c 0000000300000001 ffffe8ffffc00270 ffff88002801e988
Jun 15 18:20:39 kernel: ffff8801747bdab8 ffff880175e3ffd8 000000000000f598 ffff8801747bdac0
Jun 15 18:20:39 kernel: Call Trace:
Jun 15 18:20:39 kernel: [<ffffffff814db337>] ? thread_return+0x4e/0x777
Jun 15 18:20:39 kernel: [<ffffffff8104af29>] ? __wake_up_common+0x59/0x90
Jun 15 18:20:39 kernel: [<ffffffff81103c56>] ? __perf_event_task_sched_out+0x36/0x50
Jun 15 18:20:39 kernel: [<ffffffff8105faba>] __cond_resched+0x2a/0x40
Jun 15 18:20:39 kernel: [<ffffffff814dbbb0>] _cond_resched+0x30/0x40
Jun 15 18:20:39 kernel: [<ffffffffa0362cee>] ops_run_io+0x2e/0x350 [raid456]
Jun 15 18:20:39 kernel: [<ffffffffa03659d1>] handle_stripe+0x501/0x2310 [raid456]
Jun 15 18:20:39 kernel: [<ffffffff8104f843>] ? __wake_up+0x53/0x70
Jun 15 18:20:39 kernel: [<ffffffffa0367c7f>] raid5d+0x49f/0x690 [raid456]
Jun 15 18:20:39 kernel: [<ffffffff813de266>] md_thread+0x116/0x150
Jun 15 18:20:39 kernel: [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40
Jun 15 18:20:39 kernel: [<ffffffff813de150>] ? md_thread+0x0/0x150
Jun 15 18:20:39 kernel: [<ffffffff8108ddf6>] kthread+0x96/0xa0
Jun 15 18:20:39 kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
Jun 15 18:20:39 kernel: [<ffffffff8108dd60>] ? kthread+0x0/0xa0
Jun 15 18:20:39 kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
I tried to increase the stripe_cache_size to 512 and then to 16384.
The problem is still seen.
What could be the reason for the problem?
Any pointers will be greatly appreciated.
Thanks,
Shivani
parent reply other threads:[~2011-06-16 19:01 UTC|newest]
Thread overview: expand[flat|nested] mbox.gz Atom feed
[parent not found: <542975669.724147.1308250559342.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1349223387.724279.1308250877780.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com \
--to=sbhope@redhat.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).