* Raid5 device hangs in active state
@ 2012-01-08 22:03 Larkin Lowrey
2012-01-09 0:26 ` NeilBrown
2012-03-11 23:29 ` Asdo
0 siblings, 2 replies; 9+ messages in thread
From: Larkin Lowrey @ 2012-01-08 22:03 UTC (permalink / raw)
To: linux-raid
I've been chasing a fault since "upgrading" from Fedora 15 to Fedora 16.
When under heavy IO load my root volume will hang and block any
additional writes. Reading appears to be ok but I can't tell if I'm
reading the actual md device or cache memory. This problem occurs most
often when doing a weekly check of all md devices in the early AM hours
and particularly when the check fires before my backup job completes.
The checks do appear to complete normally, and without error.
There are no error or warning messages in any log or in the console.
There is no indication of any problem except that any IO of the root
volume will hang and ctrl-c does not get me back to a prompt.
Interestingly, to me, when in this state, 'iostat -dx 1' shows the root
LVM volume at 100% utilization yet neither the mv physical volume nor
any of the constituent devices show any activity and all read 0%
utilization. IO wait reads 50% (6 core machine) so it appears that
something is waiting for an event that will never occur.
The md device showed a value of 26 for stripe_cache_active during the
most recent occurrence and that number did not change over time.
Further, mdadm -D /dev/md0 showed the following:
dev/md0:
Version : 1.2
Creation Time : Tue Dec 21 16:28:52 2010
Raid Level : raid5
Array Size : 2180641792 (2079.62 GiB 2232.98 GB)
Used Dev Size : 311520256 (297.09 GiB 319.00 GB)
Raid Devices : 8
Total Devices : 8
Persistence : Superblock is persistent
Update Time : Sun Jan 8 03:31:42 2012
State : active
Active Devices : 8
Working Devices : 8
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
Name : ****.****.com:1 (local to host ****.****.com)
UUID : 4e95a658:13a5a387:dd62bdbe:ea655271
Events : 736102
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
9 8 34 2 active sync /dev/sdc2
3 8 50 3 active sync /dev/sdd2
4 8 66 4 active sync /dev/sde2
5 8 82 5 active sync /dev/sdf2
6 8 98 6 active sync /dev/sdg2
8 8 114 7 active sync /dev/sdh2
I noted that state is active and not idle. The output of 'mdadm -D
/dev/md0' did not change between executions.
It appears that either something is deadlocked somewhere or some other
event was missed and something is waiting forever for it to happen. I
was able to read from /dev/md0 and all the constituent devices via dd
and 'smartctl -a' did not indicate any problems. I was able to read from
/proc/mdstat and no problems were indicated.
I have no idea how to debug this further. What else should I look at
when I encounter this problem? What kind of logging can I enable which
might show additional, and hopefully useful, information when the
problem occurs?
I'm running Fedora 16 with the latest packages updated via yum. The
mdadm is v3.2.2 - 17th June 2011 and the kernel is 3.1.6-1.fc16.x86_64.
I have 6 devices connected to the AMD SB850 ACHI SATA controller and 2
devices to the built-in JMicron JMB362/363 controller to make /dev/md0.
I also have 6 devices connected to 3 sil3132 SATA controllers to make
/dev/md1. I have never encountered this problem with md1 but its I/O is
no where near as great.
Suggestions?
--Larkin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Raid5 device hangs in active state
2012-01-08 22:03 Raid5 device hangs in active state Larkin Lowrey
@ 2012-01-09 0:26 ` NeilBrown
2012-02-28 18:23 ` Larkin Lowrey
` (2 more replies)
2012-03-11 23:29 ` Asdo
1 sibling, 3 replies; 9+ messages in thread
From: NeilBrown @ 2012-01-09 0:26 UTC (permalink / raw)
To: Larkin Lowrey; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 424 bytes --]
On Sun, 08 Jan 2012 16:03:10 -0600 Larkin Lowrey <llowrey@nuclearwinter.com>
wrote:
> Suggestions?
# echo t > /proc/sysrq-trigger
and capture that messages that go to 'dmesg'. Post them.
Hopefully your message ring buffer is big enough to collect the entire
output. If it isn't you might need to boot with
log_buf_len=1M
or similar.
That should show what process is blocking on what.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Raid5 device hangs in active state
2012-01-09 0:26 ` NeilBrown
@ 2012-02-28 18:23 ` Larkin Lowrey
[not found] ` <4F4D1B33.3010308@nuclearwinter.com>
2012-03-11 22:39 ` Larkin Lowrey
2 siblings, 0 replies; 9+ messages in thread
From: Larkin Lowrey @ 2012-02-28 18:23 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]
I did another sysrq dump and have attached the output.
Again, 'iostat -dx 1' showed 100% utilization on the LVM which uses
/dev/md0 as a pv and /sys/block/md0/md/stripe_cache_active was 29 and
that value did not change. There were no error messages in
/var/log/messages or 'dmesg'.
My suspicions lie with md0 since the stripe_cache_active value remains
at a fixed non-zero value even though all disks are (or appear to be)
idle. Should I be looking elsewhere? This hardware did not exhibit this
problem before "upgrading" from Fedora 15 to Fedora 16.
Thank you,
--Larkin
On 1/8/2012 6:26 PM, NeilBrown wrote:
> On Sun, 08 Jan 2012 16:03:10 -0600 Larkin Lowrey
<llowrey@nuclearwinter.com>
> wrote:
>
>> Suggestions?
>
> # echo t > /proc/sysrq-trigger
>
> and capture that messages that go to 'dmesg'. Post them.
>
> Hopefully your message ring buffer is big enough to collect the entire
> output. If it isn't you might need to boot with
> log_buf_len=1M
> or similar.
>
> That should show what process is blocking on what.
>
> NeilBrown
[-- Attachment #2: deadlock.zip --]
[-- Type: application/octet-stream, Size: 12696 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Raid5 device hangs in active state
[not found] ` <4F4D1B33.3010308@nuclearwinter.com>
@ 2012-02-28 19:52 ` NeilBrown
2012-02-28 21:33 ` Larkin Lowrey
0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-02-28 19:52 UTC (permalink / raw)
To: Larkin Lowrey; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 2207 bytes --]
On Tue, 28 Feb 2012 12:21:39 -0600 Larkin Lowrey <llowrey@nuclearwinter.com>
wrote:
> I did another sysrq dump and have attached the output.
Thanks. Unfortunately it contains nothing of value - too much has been
lost. It seems that 'Show State' contains a lot more noise than it used to.
You will need to boot with
log_buf_len=4M
or something like that.
>
> Again, 'iostat -dx 1' showed 100% utilization on the LVM which uses
> /dev/md0 as a pv and /sys/block/md0/md/stripe_cache_active was 29 and
> that value did not change. There were no error messages in
> /var/log/messages or 'dmesg'.
The '29' could simply mean that md/raid5 has sent 29 requests down to lower
levels which have not yet completed.
>
> My suspicions lie with md0 since the stripe_cache_active value remains
> at a fixed non-zero value even though all disks are (or appear to be)
> idle. Should I be looking elsewhere? This hardware did not exhibit this
> problem before "upgrading" from Fedora 15 to Fedora 16.
My guess is a problem with one of the drive controllers. Your monthly 'sync'
puts a much heavier load on them than normal IO does. It is consistently
sending a bunch of requests to all devices at exactly the same time. This
could trigger race conditions that normal IO does not.
But that is just a guess. Unfortunately it is very hard to track exactly
what is going wrong in this sort of case.
I'd suggest shuffling devices so they are on different controllers, or maybe
replace a controller. See if you can get the problem to move, and then see
which controller it stayed with.
NeilBrown
>
> Thank you,
>
> --Larkin
>
> On 1/8/2012 6:26 PM, NeilBrown wrote:
> > On Sun, 08 Jan 2012 16:03:10 -0600 Larkin Lowrey
> <llowrey@nuclearwinter.com>
> > wrote:
> >
> >> Suggestions?
> >
> > # echo t > /proc/sysrq-trigger
> >
> > and capture that messages that go to 'dmesg'. Post them.
> >
> > Hopefully your message ring buffer is big enough to collect the entire
> > output. If it isn't you might need to boot with
> > log_buf_len=1M
> > or similar.
> >
> > That should show what process is blocking on what.
> >
> > NeilBrown
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Raid5 device hangs in active state
2012-02-28 19:52 ` NeilBrown
@ 2012-02-28 21:33 ` Larkin Lowrey
2012-02-28 21:46 ` NeilBrown
0 siblings, 1 reply; 9+ messages in thread
From: Larkin Lowrey @ 2012-02-28 21:33 UTC (permalink / raw)
Cc: linux-raid
Thank you for taking a look.
I should be able to move the drives to a completely different controller
(and driver) so that will be a good test.
Could NCQ be an issue? IOW, do you think it would be worth disabling NCQ
and re-running this scenario?
--Larkin
On 2/28/2012 1:52 PM, NeilBrown wrote:
> On Tue, 28 Feb 2012 12:21:39 -0600 Larkin Lowrey
<llowrey@nuclearwinter.com>
> wrote:
>
>> I did another sysrq dump and have attached the output.
>
> Thanks. Unfortunately it contains nothing of value - too much has been
> lost. It seems that 'Show State' contains a lot more noise than it used to.
>
> You will need to boot with
> log_buf_len=4M
>
> or something like that.
>
>>
>> Again, 'iostat -dx 1' showed 100% utilization on the LVM which uses
>> /dev/md0 as a pv and /sys/block/md0/md/stripe_cache_active was 29 and
>> that value did not change. There were no error messages in
>> /var/log/messages or 'dmesg'.
>
> The '29' could simply mean that md/raid5 has sent 29 requests down to lower
> levels which have not yet completed.
>>
>> My suspicions lie with md0 since the stripe_cache_active value remains
>> at a fixed non-zero value even though all disks are (or appear to be)
>> idle. Should I be looking elsewhere? This hardware did not exhibit this
>> problem before "upgrading" from Fedora 15 to Fedora 16.
>
> My guess is a problem with one of the drive controllers. Your monthly
'sync'
> puts a much heavier load on them than normal IO does. It is consistently
> sending a bunch of requests to all devices at exactly the same time. This
> could trigger race conditions that normal IO does not.
>
> But that is just a guess. Unfortunately it is very hard to track exactly
> what is going wrong in this sort of case.
>
> I'd suggest shuffling devices so they are on different controllers, or
maybe
> replace a controller. See if you can get the problem to move, and then see
> which controller it stayed with.
>
> NeilBrown
>
>
>>
>> Thank you,
>>
>> --Larkin
>>
>> On 1/8/2012 6:26 PM, NeilBrown wrote:
>>> On Sun, 08 Jan 2012 16:03:10 -0600 Larkin Lowrey
>> <llowrey@nuclearwinter.com>
>>> wrote:
>>>
>>>> Suggestions?
>>>
>>> # echo t > /proc/sysrq-trigger
>>>
>>> and capture that messages that go to 'dmesg'. Post them.
>>>
>>> Hopefully your message ring buffer is big enough to collect the entire
>>> output. If it isn't you might need to boot with
>>> log_buf_len=1M
>>> or similar.
>>>
>>> That should show what process is blocking on what.
>>>
>>> NeilBrown
>>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Raid5 device hangs in active state
2012-02-28 21:33 ` Larkin Lowrey
@ 2012-02-28 21:46 ` NeilBrown
0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2012-02-28 21:46 UTC (permalink / raw)
To: Larkin Lowrey; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 887 bytes --]
On Tue, 28 Feb 2012 15:33:56 -0600 Larkin Lowrey <llowrey@nuclearwinter.com>
wrote:
> Thank you for taking a look.
>
> I should be able to move the drives to a completely different controller
> (and driver) so that will be a good test.
>
> Could NCQ be an issue? IOW, do you think it would be worth disabling NCQ
> and re-running this scenario?
>
At this point I wouldn't want to exclude anything as a possible issue. So
anything that you can turn off or on fairly easily would be worth checking.
If you turn off NCQ and the problem goes away, that isn't proof that NCQ is
the problem. If it is a race condition, then removing NCQ could upset
timings just enough that you stop hitting the race window. However if you
get acceptable performance without NFQ then it could turn out to be an
effective work-around even if it isn't the real problem.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Raid5 device hangs in active state
2012-01-09 0:26 ` NeilBrown
2012-02-28 18:23 ` Larkin Lowrey
[not found] ` <4F4D1B33.3010308@nuclearwinter.com>
@ 2012-03-11 22:39 ` Larkin Lowrey
2 siblings, 0 replies; 9+ messages in thread
From: Larkin Lowrey @ 2012-03-11 22:39 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 5772 bytes --]
I either have more info or a totally different scenario. I initiated a
raid5->raid6 reshape on a different machine. At the same time I (perhaps
stupidly) ran resize2fs to shrink the ext4 fs on the array being
reshaped. The reshape is going slowly (as I would expect) but the resize
is nearly dead. It is only able to write a single 4k block to the array
about every 5-6 seconds.
If that's expected then sorry for the noise and please ignore the rest.
Otherwise...
When I tried reducing sync_speed_min to 1000 the resize2fs write
interval increased to once per 8-10s. When I lowered sync_speed_max to
1000 I saw no more writes until I set the min/max back to 500000 at
which point iostat reported a w_await time roughly equal to the time the
array had the lower max.
The '# echo t > /proc/sysrq-trigger ' output seems to indicate that
resize2fs is stuck doing an fsync. The full dump is attached.
Here's an iostat showing 100% utilization of the LVM volume and the 4k
block writes.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
md2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 0.00 0.00 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
md2 0.00 0.00 0.00 1.00 0.00 4.00
8.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 1.00 0.00 4.00
8.00 1.00 6200.00 0.00 6200.00 1000.00 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
md2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 0.00 0.00 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
md2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 0.00 0.00 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
md2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 0.00 0.00 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
md2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 0.00 0.00 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
md2 0.00 0.00 0.00 1.00 0.00 4.00
8.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 1.00 0.00 4.00
8.00 1.00 5450.00 0.00 5450.00 1000.00 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
md2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 0.00 0.00 100.00
/proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [raid0]
md10 : active raid5 md11[3] sdo1[1] sdn1[0]
1250259200 blocks super 1.2 level 5, 128k chunk, algorithm 2 [3/3]
[UUU]
md11 : active raid0 sdm1[1] sdad1[0]
625138176 blocks super 1.2 128k chunks
md5 : active raid5 sdq1[6] sdp1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0]
1220981760 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6]
[UUUUUU]
md3 : active raid6 sdy2[4] md4[2] sdx3[1] sdt3[0]
1875411968 blocks super 1.2 level 6, 128k chunk, algorithm 2 [4/3]
[UUU_]
resync=DELAYED
md2 : active raid6 sdy1[11] sdz2[0] sdt2[10] sdx2[9] sdu2[8] sds2[7]
sdw2[6] sdv2[5] sdr2[4] sdac2[3] sdaa2[2] sdab2[1]
6641912320 blocks super 0.91 level 6, 128k chunk, algorithm 18
[12/11] [UUUUUUUUUUU_]
[===============>.....] reshape = 77.4% (514551296/664191232)
finish=1030.3min speed=2420K/sec
md1 : active raid5 sdj1[0] sdu1[14] sdx1[13] sdt1[12] sdv1[11] sdw1[10]
sdh1[9] sdac1[8] sdaa1[7] sdr1[6] sdab1[5] sdz1[4] sdk1[3] sds1[2] sdl1[1]
4375960064 blocks level 5, 64k chunk, algorithm 2 [15/15]
[UUUUUUUUUUUUUUU]
md4 : active raid0 sdi1[1] sdg1[0]
976769024 blocks super 1.2 128k chunks
md0 : active raid1 sdb1[1] sda1[0]
41941944 blocks super 1.2 [2/2] [UU]
unused devices: <none>
Kernel: 3.2.9-1.fc16.x86_64
mdadm: v3.2.3
resize2fs: 1.41.14
--Larkin
On 1/8/2012 6:26 PM, NeilBrown wrote:
> On Sun, 08 Jan 2012 16:03:10 -0600 Larkin Lowrey
<llowrey@nuclearwinter.com>
> wrote:
>
>> Suggestions?
>
> # echo t > /proc/sysrq-trigger
>
> and capture that messages that go to 'dmesg'. Post them.
>
> Hopefully your message ring buffer is big enough to collect the entire
> output. If it isn't you might need to boot with
> log_buf_len=1M
> or similar.
>
> That should show what process is blocking on what.
>
> NeilBrown
[-- Attachment #2: resize2fs hang.zip --]
[-- Type: application/octet-stream, Size: 24641 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Raid5 device hangs in active state
2012-01-08 22:03 Raid5 device hangs in active state Larkin Lowrey
2012-01-09 0:26 ` NeilBrown
@ 2012-03-11 23:29 ` Asdo
2012-03-12 0:18 ` Larkin Lowrey
1 sibling, 1 reply; 9+ messages in thread
From: Asdo @ 2012-03-11 23:29 UTC (permalink / raw)
To: Larkin Lowrey; +Cc: linux-raid
On 01/08/12 23:03, Larkin Lowrey wrote:
> ...
> Interestingly, to me, when in this state, 'iostat -dx 1' shows the root
> ...
Do you have any "avgqu-sz" stuck at nonzero value in iostat -dx 1?
BTW this bug could be related to the recent post by Milan Broz, subject:
"md raid5 fsync deadlock" .
Because also in your case fsyncs are not likely to complete... correct?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Raid5 device hangs in active state
2012-03-11 23:29 ` Asdo
@ 2012-03-12 0:18 ` Larkin Lowrey
0 siblings, 0 replies; 9+ messages in thread
From: Larkin Lowrey @ 2012-03-12 0:18 UTC (permalink / raw)
To: Asdo; +Cc: linux-raid
Observation:
Setting vm/dirty_background_bytes to 1MB increased my reshape speed by
50% and reduced the resize2fs write interval from once per 5-6 seconds
to once per 3-4 seconds. Changing vm/dirty_bytes didn't have any
noticeable affect until it got below 50MB at which point performance
dropped dramatically.
Prior to dialing the knobs, dirty_ratio was 20% and
dirty_background_ratio was 10%. The system has 8GB installed.
On 3/11/2012 6:29 PM, Asdo wrote:
> On 01/08/12 23:03, Larkin Lowrey wrote:
>> ...
>> Interestingly, to me, when in this state, 'iostat -dx 1' shows the root
>> ...
>
> Do you have any "avgqu-sz" stuck at nonzero value in iostat -dx 1?
Yes, avgqu-sz is 1.00.
>
>
> BTW this bug could be related to the recent post by Milan Broz,
> subject: "md raid5 fsync deadlock" .
> Because also in your case fsyncs are not likely to complete... correct?
I do get writes, now about one 4k block every 3-4 seconds. It is quite
possible that our two issues are related.
--Larkin
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-03-12 0:18 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-08 22:03 Raid5 device hangs in active state Larkin Lowrey
2012-01-09 0:26 ` NeilBrown
2012-02-28 18:23 ` Larkin Lowrey
[not found] ` <4F4D1B33.3010308@nuclearwinter.com>
2012-02-28 19:52 ` NeilBrown
2012-02-28 21:33 ` Larkin Lowrey
2012-02-28 21:46 ` NeilBrown
2012-03-11 22:39 ` Larkin Lowrey
2012-03-11 23:29 ` Asdo
2012-03-12 0:18 ` Larkin Lowrey
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).