* stuck tasks
@ 2010-04-12 10:40 Jeremy Sanders
2010-04-12 11:06 ` MRK
0 siblings, 1 reply; 6+ messages in thread
From: Jeremy Sanders @ 2010-04-12 10:40 UTC (permalink / raw)
To: linux-raid
Hi - I'm not getting any joy with Fedora's bugzilla. Has anyone seen
problems like this with Fedora 12? Our systems have recently been getting
stuck while rsyncing data onto an MD device:
https://bugzilla.redhat.com/show_bug.cgi?id=578549
INFO: task kthreadd:2 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kthreadd D 0000000000000002 0 2 0 0x00000000
ffff88007dbfd4c0 0000000000000046 0000000000000000 0000000a00000000
ffff880000000001 ffff880079f9b800 ffff88007dbfdfd8 ffff88007dbfdfd8
ffff88007dbf1b38 000000000000f980 0000000000015740 ffff88007dbf1b38
Call Trace:
[<ffffffff8107c30d>] ? ktime_get_ts+0x85/0x8e
[<ffffffff810d604d>] ? sync_page+0x0/0x4a
[<ffffffff810d604d>] ? sync_page+0x0/0x4a
[<ffffffff814546f5>] io_schedule+0x43/0x5d
[<ffffffff810d6093>] sync_page+0x46/0x4a
[<ffffffff81454c48>] __wait_on_bit+0x48/0x7b
...
Several processes end up stuck in a D state:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2 0.0 0.0 0 0 ? D Mar31 0:07 [kthreadd]
root 14 0.0 0.0 0 0 ? D Mar31 9:38 [async/mgr]
root 17 0.0 0.0 0 0 ? D Mar31 0:00 [bdi-
default]
root 34 0.0 0.0 0 0 ? D Mar31 10:06 [kswapd0]
root 5509 0.0 0.3 50732 7900 ? D Apr09 0:03 rsync -
raHSx --stats --whole-file --numeric-ids --link-
dest=/xback2_back1/YY/20100407-000501 --exclude=/lost+found --
exclude=.mozilla/*/*/Cache/* XX:/XX_data1/data/YY/
/xback2_back1/YY/20100409-000502/
root 17457 0.0 0.2 61920 5756 ? D Apr11 0:00 python
/data/soft3/backup/diskbackup/diskbackup.py
/data/soft3/backup/diskbackup/main.cfg
root 18402 0.0 0.0 0 0 ? D Apr09 0:11 [flush-9:0]
root 20259 0.0 0.1 4284 3424 ? DN Apr11 0:00
/usr/sbin/prelink -av -mR -q
It only seems to affect our MD systems. The kernel is
2.6.32.10-90.fc12.x86_64. The systems have 3ware 96xx controllers. This
kernel does have the issue when there are lots of aio processes.
The two affected systems have different file systems: xfs and ext3.
Jeremy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: stuck tasks
2010-04-12 10:40 stuck tasks Jeremy Sanders
@ 2010-04-12 11:06 ` MRK
2010-04-12 11:14 ` Jeremy Sanders
0 siblings, 1 reply; 6+ messages in thread
From: MRK @ 2010-04-12 11:06 UTC (permalink / raw)
To: Jeremy Sanders; +Cc: linux-raid
On 04/12/2010 12:40 PM, Jeremy Sanders wrote:
> Hi - I'm not getting any joy with Fedora's bugzilla. Has anyone seen
> problems like this with Fedora 12? Our systems have recently been getting
> stuck while rsyncing data onto an MD device:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=578549
>
>
You need to lower the sync speed by catting a value into
/sys/block/md{n}/md/sync_speed_max
the value should be about 1/3 lower than the max speed you see (cat
/proc/mdstat) now that it's not yet limited.
Set up a script to set it at boot.
If it was not happening for you on older kernels might be a good sign:
it might mean that the resync is faster now...
What is the sync speed you see (cat /proc/mdstat)? How many drives do
you have and what type of raid is that?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: stuck tasks
2010-04-12 11:06 ` MRK
@ 2010-04-12 11:14 ` Jeremy Sanders
2010-04-12 12:49 ` MRK
0 siblings, 1 reply; 6+ messages in thread
From: Jeremy Sanders @ 2010-04-12 11:14 UTC (permalink / raw)
To: linux-raid
MRK wrote:
> You need to lower the sync speed by catting a value into
> /sys/block/md{n}/md/sync_speed_max
> the value should be about 1/3 lower than the max speed you see (cat
> /proc/mdstat) now that it's not yet limited.
> Set up a script to set it at boot.
Thanks - I'll try it. It didn't happen in a raid sync, just during an rsync
run. Would this have any effect on normal operation?
> If it was not happening for you on older kernels might be a good sign:
> it might mean that the resync is faster now...
>
> What is the sync speed you see (cat /proc/mdstat)? How many drives do
> you have and what type of raid is that?
We're only getting 30MB/s. I thought it used to be quite a lot faster. It
seems to slow down as the sync progresses:
[root@xback2 ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[0] sdj1[10](S) sdk1[9] sdl1[8] sdi1[7] sdh1[6]
sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1]
8788959360 blocks level 5, 32k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
[>....................] resync = 1.4% (14078544/976551040)
finish=501.4min speed=31990K/sec
unused devices: <none>
The drives are connected to a single 3ware Inc 9650SE SATA-II RAID PCIe card
on this particular system. They're SATA 1GB drives.
Jeremy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: stuck tasks
2010-04-12 11:14 ` Jeremy Sanders
@ 2010-04-12 12:49 ` MRK
2010-04-12 13:02 ` Jeremy Sanders
0 siblings, 1 reply; 6+ messages in thread
From: MRK @ 2010-04-12 12:49 UTC (permalink / raw)
To: Jeremy Sanders; +Cc: linux-raid
On 04/12/2010 01:14 PM, Jeremy Sanders wrote:
> MRK wrote:
>
>
>> You need to lower the sync speed by catting a value into
>> /sys/block/md{n}/md/sync_speed_max
>> the value should be about 1/3 lower than the max speed you see (cat
>> /proc/mdstat) now that it's not yet limited.
>> Set up a script to set it at boot.
>>
> Thanks - I'll try it. It didn't happen in a raid sync, just during an rsync
> run. Would this have any effect on normal operation?
>
!?!!
Mistake of mine, but I might have gotten the right answer by chance
I had read resyncing but you wrote rsyncing.
But you were in fact also resyncing from what you write below:
>> If it was not happening for you on older kernels might be a good sign:
>> it might mean that the resync is faster now...
>>
>> What is the sync speed you see (cat /proc/mdstat)? How many drives do
>> you have and what type of raid is that?
>>
> We're only getting 30MB/s. I thought it used to be quite a lot faster. It
> seems to slow down as the sync progresses:
>
> [root@xback2 ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 sdb1[0] sdj1[10](S) sdk1[9] sdl1[8] sdi1[7] sdh1[6]
> sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1]
> 8788959360 blocks level 5, 32k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
> [>....................] resync = 1.4% (14078544/976551040)
> finish=501.4min speed=31990K/sec
>
Resync speed is indeed quite low if you confirm there is no other disk
activity.
Instead if rsync is also running, you need to stop that one to have a
proper resync speed measurement (to compute the value to be entered into
sync_speed_max as per my previous email).
Do you have disk write caches activated? See that with tw_cli (3ware's CLI)
How much is /sys/block/md{n}/md/stripe_cache_size? Pump it up to 32768.
> unused devices:<none>
>
> The drives are connected to a single 3ware Inc 9650SE SATA-II RAID PCIe card
> on this particular system. They're SATA 1GB drives.
>
> Jeremy
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: stuck tasks
2010-04-12 12:49 ` MRK
@ 2010-04-12 13:02 ` Jeremy Sanders
2010-04-12 13:38 ` MRK
0 siblings, 1 reply; 6+ messages in thread
From: Jeremy Sanders @ 2010-04-12 13:02 UTC (permalink / raw)
To: linux-raid
MRK wrote:
>
> Mistake of mine, but I might have gotten the right answer by chance
> I had read resyncing but you wrote rsyncing.
> But you were in fact also resyncing from what you write below:
Not at the time of crash! Sorry for the confusion. This happened on reboot
as the drive didn't properly unmount.
> Resync speed is indeed quite low if you confirm there is no other disk
> activity.
It's not doing anything else on that drive. I think it would be faster but
the option CONFIG_MULTICORE_RAID456 was switched on in this kernel, so
there are lots of async processes fighting each other (190 of them on this
system).
> Instead if rsync is also running, you need to stop that one to have a
> proper resync speed measurement (to compute the value to be entered into
> sync_speed_max as per my previous email).
> Do you have disk write caches activated? See that with tw_cli (3ware's
> CLI) How much is /sys/block/md{n}/md/stripe_cache_size? Pump it up to
> 32768.
The disk write caches are on. The raid device is fast if you do bonnie tests
to them (>100MBs).
I tried the strip_cache_size option. The speed stays around 10 MB/s on this
system. It looks like the sync speed has slowed down a lot since the sync
started.
Jeremy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: stuck tasks
2010-04-12 13:02 ` Jeremy Sanders
@ 2010-04-12 13:38 ` MRK
0 siblings, 0 replies; 6+ messages in thread
From: MRK @ 2010-04-12 13:38 UTC (permalink / raw)
To: Jeremy Sanders; +Cc: linux-raid
On 04/12/2010 03:02 PM, Jeremy Sanders wrote:
> It's not doing anything else on that drive. I think it would be faster but
> the option CONFIG_MULTICORE_RAID456 was switched on in this kernel, so
> there are lots of async processes fighting each other (190 of them on this
> system).
>
Oh it's this one then. I read it has already been discussed here... you
really need to remove the multicore implementation: it's experimental as
of now and *much* slower than single core. I don't know why the your
distro maintainer has activated it. What distro is that? Someone should
probably tell the maintainers to disable it by default. I think all your
problems will go away if you can recompile with multicore off.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-04-12 13:38 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-12 10:40 stuck tasks Jeremy Sanders
2010-04-12 11:06 ` MRK
2010-04-12 11:14 ` Jeremy Sanders
2010-04-12 12:49 ` MRK
2010-04-12 13:02 ` Jeremy Sanders
2010-04-12 13:38 ` MRK
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).