RAID extremely slow

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RAID extremely slow
@ 2012-07-25 22:52 Kevin Ross
  2012-07-26  1:00 ` Phil Turmel
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Ross @ 2012-07-25 22:52 UTC (permalink / raw)
  To: linux-kernel

Hello,

I'm having a problem.  After a while, my software RAID rebuild becomes 
extremely slow, and the filesystem on the RAID is essentially blocked.  
I don't know what is causing this.  I guess it could be a bad drive, but 
how can I find out?

I used atop to show the transfer speeds to each drive. Here's a 
screenshot: 
http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png

"smartctl -a" for all the drives looks good to me, no pending failures, 
or errors logged.  dmesg doesn't report anything wrong with any of the 
drives.  It does, however, report lots of hung tasks, which are trying 
to access the RAID volume.  For example:

[51000.672064] INFO: task mythbackend:10677 blocked for more than 120 
seconds.
[51000.672098] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[51000.672143] mythbackend     D 0000000e     0 10677      1 0x00000000
[51000.672146]  f38bea00 00000086 c1095415 0000000e 00000002 00000000 
00000000 c147aac0
[51000.672152]  f38bebac c147aac0 eb2cff04 003d2f4b 00000000 c109cacb 
01872f02 eb2cfe50
[51000.672157]  c100f28b c13df480 01872f02 eb2cfe68 c10532b1 0069a8d0 
f79d6ac0 00000000
[51000.672162] Call Trace:
[51000.672169]  [<c1095415>] ? find_get_pages_tag+0x2f/0xa2
[51000.672173]  [<c109cacb>] ? pagevec_lookup_tag+0x18/0x1e
[51000.672176]  [<c100f28b>] ? read_tsc+0xa/0x28
[51000.672179]  [<c10532b1>] ? timekeeping_get_ns+0x11/0x55
[51000.672182]  [<c10536a4>] ? ktime_get_ts+0x7a/0x82
[51000.672186]  [<c12bea8b>] ? io_schedule+0x4a/0x5f
[51000.672188]  [<c1095659>] ? sleep_on_page+0x5/0x8
[51000.672191]  [<c12bedeb>] ? __wait_on_bit+0x2f/0x54
[51000.672193]  [<c1095654>] ? lock_page+0x1d/0x1d
[51000.672196]  [<c1095754>] ? wait_on_page_bit+0x57/0x5e
[51000.672199]  [<c104d171>] ? autoremove_wake_function+0x29/0x29
[51000.672201]  [<c1095823>] ? filemap_fdatawait_range+0x71/0x11e
[51000.672205]  [<c109630f>] ? filemap_write_and_wait_range+0x3e/0x4c
[51000.672232]  [<f86bfb39>] ? xfs_file_fsync+0x68/0x214 [xfs]
[51000.672246]  [<f86bfad1>] ? xfs_file_splice_write+0x144/0x144 [xfs]
[51000.672249]  [<c10e7e3b>] ? vfs_fsync_range+0x27/0x2d
[51000.672252]  [<c10e7e52>] ? vfs_fsync+0x11/0x15
[51000.672254]  [<c10e80b8>] ? sys_fdatasync+0x20/0x2e
[51000.672258]  [<c12c409f>] ? sysenter_do_call+0x12/0x28
[51000.672261]  [<c12b0000>] ? quirk_usb_early_handoff+0x4a9/0x522

Here is some other possibly relevant info:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
       6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [UUUUUUUUU]
       [==========>..........]  resync = 51.3% (501954432/976758784) 
finish=28755.6min speed=275K/sec

unused devices: <none>

# cat /proc/sys/dev/raid/speed_limit_min
10000
# cat /proc/sys/dev/raid/speed_limit_max
200000

Thanks in advance!
-- Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-25 22:52 RAID extremely slow Kevin Ross
@ 2012-07-26  1:00 ` Phil Turmel
  2012-07-26  1:55   ` Kevin Ross
  2012-08-17 21:55   ` Jan Engelhardt
  0 siblings, 2 replies; 19+ messages in thread
From: Phil Turmel @ 2012-07-26  1:00 UTC (permalink / raw)
  To: Kevin Ross; +Cc: linux-kernel, linux-raid

[Added linux-raid to the CC]

Hi Kevin,

Notes interleaved:

On 07/25/2012 06:52 PM, Kevin Ross wrote:
> Hello,
> 
> I'm having a problem.  After a while, my software RAID rebuild becomes
> extremely slow, and the filesystem on the RAID is essentially blocked. 
> I don't know what is causing this.  I guess it could be a bad drive, but
> how can I find out?

Probably not.  That pretty much always shows up in dmesg.

> I used atop to show the transfer speeds to each drive. Here's a
> screenshot:
> http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png

Piles of small reads  scattered across multiple drives, and a
concentration of queued writes to /dev/sda.  What's on /dev/sda?
It's not a member of the raid, so it must be some other system task
involved.

[ The output of "lsdrv" [1] might be useful here, along with
"mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ]

> "smartctl -a" for all the drives looks good to me, no pending failures,
> or errors logged.  dmesg doesn't report anything wrong with any of the
> drives.  It does, however, report lots of hung tasks, which are trying
> to access the RAID volume.  For example:
> 
> [51000.672064] INFO: task mythbackend:10677 blocked for more than 120
> seconds.
> [51000.672098] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [51000.672143] mythbackend     D 0000000e     0 10677      1 0x00000000
> [51000.672146]  f38bea00 00000086 c1095415 0000000e 00000002 00000000
> 00000000 c147aac0
> [51000.672152]  f38bebac c147aac0 eb2cff04 003d2f4b 00000000 c109cacb
> 01872f02 eb2cfe50
> [51000.672157]  c100f28b c13df480 01872f02 eb2cfe68 c10532b1 0069a8d0
> f79d6ac0 00000000
> [51000.672162] Call Trace:
> [51000.672169]  [<c1095415>] ? find_get_pages_tag+0x2f/0xa2
> [51000.672173]  [<c109cacb>] ? pagevec_lookup_tag+0x18/0x1e
> [51000.672176]  [<c100f28b>] ? read_tsc+0xa/0x28
> [51000.672179]  [<c10532b1>] ? timekeeping_get_ns+0x11/0x55
> [51000.672182]  [<c10536a4>] ? ktime_get_ts+0x7a/0x82
> [51000.672186]  [<c12bea8b>] ? io_schedule+0x4a/0x5f
> [51000.672188]  [<c1095659>] ? sleep_on_page+0x5/0x8
> [51000.672191]  [<c12bedeb>] ? __wait_on_bit+0x2f/0x54
> [51000.672193]  [<c1095654>] ? lock_page+0x1d/0x1d
> [51000.672196]  [<c1095754>] ? wait_on_page_bit+0x57/0x5e
> [51000.672199]  [<c104d171>] ? autoremove_wake_function+0x29/0x29
> [51000.672201]  [<c1095823>] ? filemap_fdatawait_range+0x71/0x11e
> [51000.672205]  [<c109630f>] ? filemap_write_and_wait_range+0x3e/0x4c
> [51000.672232]  [<f86bfb39>] ? xfs_file_fsync+0x68/0x214 [xfs]
> [51000.672246]  [<f86bfad1>] ? xfs_file_splice_write+0x144/0x144 [xfs]
> [51000.672249]  [<c10e7e3b>] ? vfs_fsync_range+0x27/0x2d
> [51000.672252]  [<c10e7e52>] ? vfs_fsync+0x11/0x15
> [51000.672254]  [<c10e80b8>] ? sys_fdatasync+0x20/0x2e

MythTV is trying to flush recorded video to disk, I presume.  Sync is
known to cause stalls--a great deal of work is on-going to improve
this.  How old is this kernel?

> [51000.672258]  [<c12c409f>] ? sysenter_do_call+0x12/0x28
> [51000.672261]  [<c12b0000>] ? quirk_usb_early_handoff+0x4a9/0x522
> 
> Here is some other possibly relevant info:
> 
> # cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
> sdf1[3] sdg1[8] sdj1[1]
>       6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
> [UUUUUUUUU]
>       [==========>..........]  resync = 51.3% (501954432/976758784)
> finish=28755.6min speed=275K/sec

Is this resync a weekly check, or did something else trigger it?

> unused devices: <none>
> 
> # cat /proc/sys/dev/raid/speed_limit_min
> 10000

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if that
gets you out of the stalls.

Or temporarily shut down mythtv.

> # cat /proc/sys/dev/raid/speed_limit_max
> 200000
> 
> Thanks in advance!
> -- Kevin

HTH,

Phil

[1] http://github.com/pturmel/lsdrv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-26  1:00 ` Phil Turmel
@ 2012-07-26  1:55   ` Kevin Ross
  2012-07-26  2:09     ` CoolCold
                       ` (2 more replies)
  2012-08-17 21:55   ` Jan Engelhardt
  1 sibling, 3 replies; 19+ messages in thread
From: Kevin Ross @ 2012-07-26  1:55 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-kernel, linux-raid

Thank you very much for taking the time to look into this.

On 07/25/2012 06:00 PM, Phil Turmel wrote:
> Piles of small reads  scattered across multiple drives, and a
> concentration of queued writes to /dev/sda.  What's on /dev/sda?
> It's not a member of the raid, so it must be some other system task
> involved.

/dev/sda1 is the root filesystem.  The writes were most likely by MySQL, 
but I would have to run iotop to be sure.

> [ The output of "lsdrv" [1] might be useful here, along with
> "mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ]

Here you go: http://pastebin.ca/2174740

> MythTV is trying to flush recorded video to disk, I presume.  Sync is
> known to cause stalls--a great deal of work is on-going to improve
> this.  How old is this kernel?

After rebooting, MythTV is currently recording two shows, and the resync 
is running at full speed.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
       6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [UUUUUUUUU]
       [=>...................]  resync =  9.3% (91363840/976758784) 
finish=1434.3min speed=10287K/sec

unused devices: <none>

atop shows the avio of all the drives to be less than 1ms, where before 
they were much higher.  It will run for a couple days under load just 
fine, and then it will come to a halt.

It's a 3.2.21 kernel.  I'm running Debian Testing, and the exact Debian 
package version is:

ii  linux-image-3.2.0-3-686-pae                    
3.2.21-3                                       Linux 3.2 for modern PCs

>
>> [51000.672258]  [<c12c409f>] ? sysenter_do_call+0x12/0x28
>> [51000.672261]  [<c12b0000>] ? quirk_usb_early_handoff+0x4a9/0x522
>>
>> Here is some other possibly relevant info:
>>
>> # cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
>> sdf1[3] sdg1[8] sdj1[1]
>>        6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
>> [UUUUUUUUU]
>>        [==========>..........]  resync = 51.3% (501954432/976758784)
>> finish=28755.6min speed=275K/sec
> Is this resync a weekly check, or did something else trigger it?

This is not a scheduled check.  It was triggered by, I believe, an 
unclean shutdown.  An unclean shutdown will trigger a resync.  I don't 
think it used to do this, but I could be remembering wrong.

>
>> unused devices:<none>
>>
>> # cat /proc/sys/dev/raid/speed_limit_min
>> 10000
> MD is unable to reach its minimum rebuild rate while other system
> activity is ongoing.  You might want to lower this number to see if that
> gets you out of the stalls.
>
> Or temporarily shut down mythtv.

I will try lowering those numbers next time this happens, which will 
probably be within the next day or two.  That's about how often this 
happens.

>> # cat /proc/sys/dev/raid/speed_limit_max
>> 200000
>>
>> Thanks in advance!
>> -- Kevin
> HTH,
>
> Phil
>
> [1] http://github.com/pturmel/lsdrv
>

Thanks!
-- Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-26  1:55   ` Kevin Ross
@ 2012-07-26  2:09     ` CoolCold
  2012-07-26  2:18       ` Kevin Ross
  2012-07-26  5:00     ` Kevin Ross
  2012-07-27  2:15     ` David Dillow
  2 siblings, 1 reply; 19+ messages in thread
From: CoolCold @ 2012-07-26  2:09 UTC (permalink / raw)
  To: Kevin Ross; +Cc: Phil Turmel, linux-kernel, linux-raid

On Thu, Jul 26, 2012 at 5:55 AM, Kevin Ross <kevin@familyross.net> wrote:
>
> Thank you very much for taking the time to look into this.
>
>
> On 07/25/2012 06:00 PM, Phil Turmel wrote:
>>
>> Piles of small reads  scattered across multiple drives, and a
>> concentration of queued writes to /dev/sda.  What's on /dev/sda?
>> It's not a member of the raid, so it must be some other system task
>> involved.
>
>
> /dev/sda1 is the root filesystem.  The writes were most likely by MySQL,
> but I would have to run iotop to be sure.
>
>
>> [ The output of "lsdrv" [1] might be useful here, along with
>> "mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ]
>
>
> Here you go: http://pastebin.ca/2174740
>
>
>> MythTV is trying to flush recorded video to disk, I presume.  Sync is
>> known to cause stalls--a great deal of work is on-going to improve
>> this.  How old is this kernel?
>
>
> After rebooting, MythTV is currently recording two shows, and the resync
> is running at full speed.
>
>
> # cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
> sdf1[3] sdg1[8] sdj1[1]
>       6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
> [UUUUUUUUU]
>       [=>...................]  resync =  9.3% (91363840/976758784)
> finish=1434.3min speed=10287K/sec
>
> unused devices: <none>
>
> atop shows the avio of all the drives to be less than 1ms, where before
> they were much higher.  It will run for a couple days under load just fine,
> and then it will come to a halt.
>
> It's a 3.2.21 kernel.  I'm running Debian Testing, and the exact Debian
> package version is:
>
> ii  linux-image-3.2.0-3-686-pae                    3.2.21-3
> Linux 3.2 for modern PCs
>
>
>>
>>> [51000.672258]  [<c12c409f>] ? sysenter_do_call+0x12/0x28
>>> [51000.672261]  [<c12b0000>] ? quirk_usb_early_handoff+0x4a9/0x522
>>>
>>> Here is some other possibly relevant info:
>>>
>>> # cat /proc/mdstat
>>> Personalities : [raid6] [raid5] [raid4]
>>> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
>>> sdf1[3] sdg1[8] sdj1[1]
>>>        6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2
>>> [9/9]
>>> [UUUUUUUUU]
>>>        [==========>..........]  resync = 51.3% (501954432/976758784)
>>> finish=28755.6min speed=275K/sec
>>
>> Is this resync a weekly check, or did something else trigger it?
>
>
> This is not a scheduled check.  It was triggered by, I believe, an unclean
> shutdown.  An unclean shutdown will trigger a resync.  I don't think it used
> to do this, but I could be remembering wrong.
>
>
>>
>>> unused devices:<none>
>>>
>>> # cat /proc/sys/dev/raid/speed_limit_min
>>> 10000
>>
>> MD is unable to reach its minimum rebuild rate while other system
>> activity is ongoing.  You might want to lower this number to see if that
>> gets you out of the stalls.
>>
>> Or temporarily shut down mythtv.
>
>
> I will try lowering those numbers next time this happens, which will
> probably be within the next day or two.  That's about how often this
> happens.
You might be interested in write intent bitmap then, it gonna help a lot.
(resending in plain text)
>
>
>>> # cat /proc/sys/dev/raid/speed_limit_max
>>> 200000
>>>
>>> Thanks in advance!
>>> -- Kevin
>>
>> HTH,
>>
>> Phil
>>
>> [1] http://github.com/pturmel/lsdrv
>>
>
> Thanks!
> -- Kevin
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Best regards,
[COOLCOLD-RIPN]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-26  2:09     ` CoolCold
@ 2012-07-26  2:18       ` Kevin Ross
  0 siblings, 0 replies; 19+ messages in thread
From: Kevin Ross @ 2012-07-26  2:18 UTC (permalink / raw)
  To: CoolCold; +Cc: Phil Turmel, linux-kernel, linux-raid

On 07/25/2012 07:09 PM, CoolCold wrote:
> You might be interested in write intent bitmap then, it gonna help a 
> lot. (resending in plain text)

Thanks, I'll look into that!

-- Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-26  1:55   ` Kevin Ross
  2012-07-26  2:09     ` CoolCold
@ 2012-07-26  5:00     ` Kevin Ross
  2012-07-26 22:36       ` Kevin Ross
  2012-07-27 19:08       ` Bill Davidsen
  2012-07-27  2:15     ` David Dillow
  2 siblings, 2 replies; 19+ messages in thread
From: Kevin Ross @ 2012-07-26  5:00 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-kernel, linux-raid


>>
>>> unused devices:<none>
>>>
>>> # cat /proc/sys/dev/raid/speed_limit_min
>>> 10000
>> MD is unable to reach its minimum rebuild rate while other system
>> activity is ongoing.  You might want to lower this number to see if that
>> gets you out of the stalls.
>>
>> Or temporarily shut down mythtv.
>
> I will try lowering those numbers next time this happens, which will 
> probably be within the next day or two.  That's about how often this 
> happens.

Unfortunately, it has happened again, with speeds at near zero.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
       6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [UUUUUUUUU]
       [=>...................]  resync =  8.3% (81251712/976758784) 
finish=1057826.4min speed=14K/sec

unused devices: <none>

atop doesn't show ANY activity on the raid device or the individual drives.
http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png

Also, I tried writing to a test file with the following command, and it 
hangs.  I let it go for about 30 minutes, with no change.

# dd if=/dev/zero of=test bs=1M count=1

dmesg only reports hung tasks.  It doesn't report any other problems.  
Here's my dmesg output:
http://pastebin.ca/2174778

I'm going to try rebooting into single user mode, and see if the rebuild 
succeeds without stalling.

-- Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-26  5:00     ` Kevin Ross
@ 2012-07-26 22:36       ` Kevin Ross
  2012-07-27 19:08       ` Bill Davidsen
  1 sibling, 0 replies; 19+ messages in thread
From: Kevin Ross @ 2012-07-26 22:36 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-kernel, linux-raid

On 07/25/2012 10:00 PM, Kevin Ross wrote:
>
>>>
>>>> unused devices:<none>
>>>>
>>>> # cat /proc/sys/dev/raid/speed_limit_min
>>>> 10000
>>> MD is unable to reach its minimum rebuild rate while other system
>>> activity is ongoing.  You might want to lower this number to see if 
>>> that
>>> gets you out of the stalls.
>>>
>>> Or temporarily shut down mythtv.
>>
>> I will try lowering those numbers next time this happens, which will 
>> probably be within the next day or two.  That's about how often this 
>> happens.
>
> Unfortunately, it has happened again, with speeds at near zero.
>
> # cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
> sdf1[3] sdg1[8] sdj1[1]
>       6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
> [9/9] [UUUUUUUUU]
>       [=>...................]  resync =  8.3% (81251712/976758784) 
> finish=1057826.4min speed=14K/sec
>
> unused devices: <none>
>
> atop doesn't show ANY activity on the raid device or the individual 
> drives.
> http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png
>
> Also, I tried writing to a test file with the following command, and 
> it hangs.  I let it go for about 30 minutes, with no change.
>
> # dd if=/dev/zero of=test bs=1M count=1
>
> dmesg only reports hung tasks.  It doesn't report any other problems.  
> Here's my dmesg output:
> http://pastebin.ca/2174778
>
> I'm going to try rebooting into single user mode, and see if the 
> rebuild succeeds without stalling.
>
> -- Kevin

It rebuilt fine in single user mode, with speeds usually around 
50MB/sec.  But after exiting single user mode, and allowing MythTV and 
other programs to start, within 30 minutes I had the problem again.  
Basically a hung filesystem.  I couldn't even "cat /proc/mdstat", that 
just hung.  Lots of hung task warnings in dmesg.

Because Phil suggested that fsync calls might cause stalls, I commented 
out the fsync in MythTV.  I'll run with that for awhile, and see how 
things work out.  So far it isn't adversely affecting MythTV.

Thanks!
-- Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-26  1:55   ` Kevin Ross
  2012-07-26  2:09     ` CoolCold
  2012-07-26  5:00     ` Kevin Ross
@ 2012-07-27  2:15     ` David Dillow
  2012-07-27  2:17       ` David Dillow
  2 siblings, 1 reply; 19+ messages in thread
From: David Dillow @ 2012-07-27  2:15 UTC (permalink / raw)
  To: Kevin Ross; +Cc: Phil Turmel, linux-kernel, linux-raid

On Wed, 2012-07-25 at 18:55 -0700, Kevin Ross wrote:
> On 07/25/2012 06:00 PM, Phil Turmel wrote:
> > Piles of small reads  scattered across multiple drives, and a
> > concentration of queued writes to /dev/sda.  What's on /dev/sda?
> > It's not a member of the raid, so it must be some other system task
> > involved.

> After rebooting, MythTV is currently recording two shows, and the resync 
> is running at full speed.
> 
> # cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
> sdf1[3] sdg1[8] sdj1[1]
>        6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
> [9/9] [UUUUUUUUU]
>        [=>...................]  resync =  9.3% (91363840/976758784) 
> finish=1434.3min speed=10287K/sec
> 
> unused devices: <none>
> 
> atop shows the avio of all the drives to be less than 1ms, where before 
> they were much higher.  It will run for a couple days under load just 
> fine, and then it will come to a halt.
> 
> It's a 3.2.21 kernel.  I'm running Debian Testing, and the exact Debian 
> package version is:

I suspect you are being hit by same bug I was -- delayed stripes never
got processed. If you get into the state where the rebuild isn't
progressing, and you find that increasing the size of the stripe cache
allows the rebuild to proceed (but the filesystem stays wedged), then
that cinches it.

If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
As far as I can see, the latest 3.2 stable does not contain the delayed
stripe fix.

After applying those fixes to my kernel, my MythTV setup over a 5 disk
RAID5 has been pretty solid, where before I was getting lockups every
few days. It still seems to be getting slower over time, but I've not
looked into it yet as it is not as catastrophic as the wedging.

HTH,
Dave

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-27  2:15     ` David Dillow
@ 2012-07-27  2:17       ` David Dillow
  2012-07-27  2:17         ` Kevin Ross
  0 siblings, 1 reply; 19+ messages in thread
From: David Dillow @ 2012-07-27  2:17 UTC (permalink / raw)
  To: Kevin Ross; +Cc: Phil Turmel, linux-kernel, linux-raid

On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
> If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
> As far as I can see, the latest 3.2 stable does not contain the delayed
> stripe fix.

And I was looking at the wrong version; 3.2.24 does indeed have the fix.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-27  2:17       ` David Dillow
@ 2012-07-27  2:17         ` Kevin Ross
  2012-07-27  2:27           ` David Dillow
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Ross @ 2012-07-27  2:17 UTC (permalink / raw)
  To: David Dillow; +Cc: Phil Turmel, linux-kernel, linux-raid

On 07/26/2012 07:17 PM, David Dillow wrote:
> On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
>> If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
>> As far as I can see, the latest 3.2 stable does not contain the delayed
>> stripe fix.
> And I was looking at the wrong version; 3.2.24 does indeed have the fix.
>

I'm running 3.2.21, does that contain the fix?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-27  2:17         ` Kevin Ross
@ 2012-07-27  2:27           ` David Dillow
  2012-07-27  2:53             ` Kevin Ross
  0 siblings, 1 reply; 19+ messages in thread
From: David Dillow @ 2012-07-27  2:27 UTC (permalink / raw)
  To: Kevin Ross; +Cc: Phil Turmel, linux-kernel, linux-raid

On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote:
> On 07/26/2012 07:17 PM, David Dillow wrote:
> > On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
> >> If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
> >> As far as I can see, the latest 3.2 stable does not contain the delayed
> >> stripe fix.
> > And I was looking at the wrong version; 3.2.24 does indeed have the fix.
> >
> 
> I'm running 3.2.21, does that contain the fix?

No, that was the one I looked at. It is commit
c0159c780e8d42309d04e83271986274d3880826.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-27  2:27           ` David Dillow
@ 2012-07-27  2:53             ` Kevin Ross
  2012-07-27  3:17               ` Kevin Ross
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Ross @ 2012-07-27  2:53 UTC (permalink / raw)
  To: David Dillow; +Cc: Phil Turmel, linux-kernel, linux-raid

On 07/26/2012 07:27 PM, David Dillow wrote:
> On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote:
>> On 07/26/2012 07:17 PM, David Dillow wrote:
>>> On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
>>>> If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
>>>> As far as I can see, the latest 3.2 stable does not contain the delayed
>>>> stripe fix.
>>> And I was looking at the wrong version; 3.2.24 does indeed have the fix.
>>>
>> I'm running 3.2.21, does that contain the fix?
> No, that was the one I looked at. It is commit
> c0159c780e8d42309d04e83271986274d3880826.
>

Okay, I grabbed 3.4.4 from Debian experimental, and I'm running with 
that now.  Hopefully this fixes the problem.

Thanks for your help!
-- Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-27  2:53             ` Kevin Ross
@ 2012-07-27  3:17               ` Kevin Ross
  0 siblings, 0 replies; 19+ messages in thread
From: Kevin Ross @ 2012-07-27  3:17 UTC (permalink / raw)
  To: David Dillow; +Cc: Phil Turmel, linux-kernel, linux-raid

On 07/26/2012 07:53 PM, Kevin Ross wrote:
> On 07/26/2012 07:27 PM, David Dillow wrote:
>> On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote:
>>> On 07/26/2012 07:17 PM, David Dillow wrote:
>>>> On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
>>>>> If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right 
>>>>> now).
>>>>> As far as I can see, the latest 3.2 stable does not contain the 
>>>>> delayed
>>>>> stripe fix.
>>>> And I was looking at the wrong version; 3.2.24 does indeed have the 
>>>> fix.
>>>>
>>> I'm running 3.2.21, does that contain the fix?
>> No, that was the one I looked at. It is commit
>> c0159c780e8d42309d04e83271986274d3880826.
>>
>
> Okay, I grabbed 3.4.4 from Debian experimental, and I'm running with 
> that now.  Hopefully this fixes the problem.
>
> Thanks for your help!
> -- Kevin

Just noticed I need 3.4.5 or later.  Doh!  I'll grab a vanilla kernel 
from kernel.org and build it.

Thanks!
-- Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-26  5:00     ` Kevin Ross
  2012-07-26 22:36       ` Kevin Ross
@ 2012-07-27 19:08       ` Bill Davidsen
  2012-07-27 21:45         ` Kevin Ross
  1 sibling, 1 reply; 19+ messages in thread
From: Bill Davidsen @ 2012-07-27 19:08 UTC (permalink / raw)
  To: Kevin Ross, Linux RAID, Linux Kernel mailing List

Kevin Ross wrote:
>
>>>
>>>> unused devices:<none>
>>>>
>>>> # cat /proc/sys/dev/raid/speed_limit_min
>>>> 10000
>>> MD is unable to reach its minimum rebuild rate while other system
>>> activity is ongoing.  You might want to lower this number to see if that
>>> gets you out of the stalls.
>>>
>>> Or temporarily shut down mythtv.
>>
>> I will try lowering those numbers next time this happens, which will probably
>> be within the next day or two.  That's about how often this happens.
>
> Unfortunately, it has happened again, with speeds at near zero.
>
> # cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3]
> sdg1[8] sdj1[1]
>        6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
> [UUUUUUUUU]
>        [=>...................]  resync =  8.3% (81251712/976758784)
> finish=1057826.4min speed=14K/sec
>
> unused devices: <none>
>
> atop doesn't show ANY activity on the raid device or the individual drives.
> http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png
>
> Also, I tried writing to a test file with the following command, and it hangs.
> I let it go for about 30 minutes, with no change.
>
> # dd if=/dev/zero of=test bs=1M count=1
>
> dmesg only reports hung tasks.  It doesn't report any other problems. Here's my
> dmesg output:
> http://pastebin.ca/2174778
>
> I'm going to try rebooting into single user mode, and see if the rebuild
> succeeds without stalling.
>
Have you set the io scheduler to deadline on all members of the array? That's 
kind of "job one" on older kernels.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-27 19:08       ` Bill Davidsen
@ 2012-07-27 21:45         ` Kevin Ross
  2012-07-28  4:45           ` Grant Coady
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Ross @ 2012-07-27 21:45 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Linux RAID, Linux Kernel mailing List

On 07/27/2012 12:08 PM, Bill Davidsen wrote:
> Have you set the io scheduler to deadline on all members of the array? 
> That's kind of "job one" on older kernels.
>

I have not, thanks for the tip, I'll look into that now.

Thanks!
-- Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-27 21:45         ` Kevin Ross
@ 2012-07-28  4:45           ` Grant Coady
  2012-07-28  8:34             ` Kevin Ross
  0 siblings, 1 reply; 19+ messages in thread
From: Grant Coady @ 2012-07-28  4:45 UTC (permalink / raw)
  To: Kevin Ross; +Cc: Bill Davidsen, Linux RAID, Linux Kernel mailing List

On Fri, 27 Jul 2012 14:45:18 -0700, you wrote:

>On 07/27/2012 12:08 PM, Bill Davidsen wrote:
>> Have you set the io scheduler to deadline on all members of the array? 
>> That's kind of "job one" on older kernels.
>>
>
>I have not, thanks for the tip, I'll look into that now.

Plus I disable the on-drive queuing (NCQ) during startup, right now 
I don't have benchmarks to show the difference.  This on a six by 1TB 
drive RAID6 array I built over a year ago on Slackware64-13.37:

# cat /etc/rc.d/rc.local
...
# turn off NCQ on the RAID drives by adjusting queue depth to 1
n=1
echo "rc.local: Disable RAID drives' NCQ"
for d in a b c d e f
do
        echo "  set NCQ depth to $n on sd${d}"
        echo $n > /sys/block/sd${d}/device/queue_depth
done
...

Maybe you could try that?  See if it makes a difference.  My drives 
are Seagate.

Grant.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-28  4:45           ` Grant Coady
@ 2012-07-28  8:34             ` Kevin Ross
  2012-08-01  3:16               ` Bill Davidsen
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Ross @ 2012-07-28  8:34 UTC (permalink / raw)
  To: Grant Coady; +Cc: Bill Davidsen, Linux RAID, Linux Kernel mailing List

On 07/27/2012 09:45 PM, Grant Coady wrote:
> On Fri, 27 Jul 2012 14:45:18 -0700, you wrote:
>
>> On 07/27/2012 12:08 PM, Bill Davidsen wrote:
>>> Have you set the io scheduler to deadline on all members of the array?
>>> That's kind of "job one" on older kernels.
>>>
>> I have not, thanks for the tip, I'll look into that now.
> Plus I disable the on-drive queuing (NCQ) during startup, right now
> I don't have benchmarks to show the difference.  This on a six by 1TB
> drive RAID6 array I built over a year ago on Slackware64-13.37:
>
> # cat /etc/rc.d/rc.local
> ...
> # turn off NCQ on the RAID drives by adjusting queue depth to 1
> n=1
> echo "rc.local: Disable RAID drives' NCQ"
> for d in a b c d e f
> do
>          echo "  set NCQ depth to $n on sd${d}"
>          echo $n>  /sys/block/sd${d}/device/queue_depth
> done
> ...
>
> Maybe you could try that?  See if it makes a difference.  My drives
> are Seagate.
>
> Grant.
>

Does disabling NCQ improve performance?

The suggestion to use kernel 3.4.6 has been working quite well so far, 
hopefully that fixes the problem.  I'll know for sure in a few more days...

Thanks!
-- Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-28  8:34             ` Kevin Ross
@ 2012-08-01  3:16               ` Bill Davidsen
  0 siblings, 0 replies; 19+ messages in thread
From: Bill Davidsen @ 2012-08-01  3:16 UTC (permalink / raw)
  To: Kevin Ross; +Cc: Grant Coady, Linux RAID, Linux Kernel mailing List

Kevin Ross wrote:
> On 07/27/2012 09:45 PM, Grant Coady wrote:
>> On Fri, 27 Jul 2012 14:45:18 -0700, you wrote:
>>
>>> On 07/27/2012 12:08 PM, Bill Davidsen wrote:
>>>> Have you set the io scheduler to deadline on all members of the array?
>>>> That's kind of "job one" on older kernels.
>>>>
>>> I have not, thanks for the tip, I'll look into that now.
>> Plus I disable the on-drive queuing (NCQ) during startup, right now
>> I don't have benchmarks to show the difference.  This on a six by 1TB
>> drive RAID6 array I built over a year ago on Slackware64-13.37:
>>
>> # cat /etc/rc.d/rc.local
>> ...
>> # turn off NCQ on the RAID drives by adjusting queue depth to 1
>> n=1
>> echo "rc.local: Disable RAID drives' NCQ"
>> for d in a b c d e f
>> do
>>          echo "  set NCQ depth to $n on sd${d}"
>>          echo $n>  /sys/block/sd${d}/device/queue_depth
>> done
>> ...
>>
>> Maybe you could try that?  See if it makes a difference.  My drives
>> are Seagate.
>>
>> Grant.
>>
>
> Does disabling NCQ improve performance?

Does for me!
>
> The suggestion to use kernel 3.4.6 has been working quite well so far, 
> hopefully that fixes the problem.  I'll know for sure in a few more days...
>
> Thanks!
> -- Kevin
>


-- 
Bill Davidsen <davidsen@tmr.com>
   We are not out of the woods yet, but we know the direction and have
taken the first step. The steps are many, but finite in number, and if
we persevere we will reach our destination.  -me, 2010



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RAID extremely slow
  2012-07-26  1:00 ` Phil Turmel
  2012-07-26  1:55   ` Kevin Ross
@ 2012-08-17 21:55   ` Jan Engelhardt
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Engelhardt @ 2012-08-17 21:55 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Kevin Ross, linux-kernel, linux-raid


On Thursday 2012-07-26 03:00, Phil Turmel wrote:
>> I used atop to show the transfer speeds to each drive. Here's a
>> screenshot:
>> http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png
>
>[ The output of "lsdrv" [1] might be useful here, along with
>"mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ]
>[1] http://github.com/pturmel/lsdrv

lsdrv? Shows a bunch, but for this, using standard tools
like lsscsi & lsblk seems sufficient :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-08-17 21:55 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-25 22:52 RAID extremely slow Kevin Ross
2012-07-26  1:00 ` Phil Turmel
2012-07-26  1:55   ` Kevin Ross
2012-07-26  2:09     ` CoolCold
2012-07-26  2:18       ` Kevin Ross
2012-07-26  5:00     ` Kevin Ross
2012-07-26 22:36       ` Kevin Ross
2012-07-27 19:08       ` Bill Davidsen
2012-07-27 21:45         ` Kevin Ross
2012-07-28  4:45           ` Grant Coady
2012-07-28  8:34             ` Kevin Ross
2012-08-01  3:16               ` Bill Davidsen
2012-07-27  2:15     ` David Dillow
2012-07-27  2:17       ` David Dillow
2012-07-27  2:17         ` Kevin Ross
2012-07-27  2:27           ` David Dillow
2012-07-27  2:53             ` Kevin Ross
2012-07-27  3:17               ` Kevin Ross
2012-08-17 21:55   ` Jan Engelhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox