stuck tasks

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* stuck tasks
@ 2010-04-12 10:40 Jeremy Sanders
  2010-04-12 11:06 ` MRK
  0 siblings, 1 reply; 6+ messages in thread
From: Jeremy Sanders @ 2010-04-12 10:40 UTC (permalink / raw)
  To: linux-raid

Hi - I'm not getting any joy with Fedora's bugzilla. Has anyone seen 
problems like this with Fedora 12? Our systems have recently been getting 
stuck while rsyncing data onto an MD device:

https://bugzilla.redhat.com/show_bug.cgi?id=578549

INFO: task kthreadd:2 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kthreadd      D 0000000000000002     0     2      0 0x00000000
 ffff88007dbfd4c0 0000000000000046 0000000000000000 0000000a00000000
 ffff880000000001 ffff880079f9b800 ffff88007dbfdfd8 ffff88007dbfdfd8
 ffff88007dbf1b38 000000000000f980 0000000000015740 ffff88007dbf1b38
Call Trace:
 [<ffffffff8107c30d>] ? ktime_get_ts+0x85/0x8e
 [<ffffffff810d604d>] ? sync_page+0x0/0x4a
 [<ffffffff810d604d>] ? sync_page+0x0/0x4a
 [<ffffffff814546f5>] io_schedule+0x43/0x5d
 [<ffffffff810d6093>] sync_page+0x46/0x4a
 [<ffffffff81454c48>] __wait_on_bit+0x48/0x7b
 ...

Several processes end up stuck in a D state:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         2  0.0  0.0      0     0 ?        D    Mar31   0:07 [kthreadd]
root        14  0.0  0.0      0     0 ?        D    Mar31   9:38 [async/mgr]
root        17  0.0  0.0      0     0 ?        D    Mar31   0:00 [bdi-
default]
root        34  0.0  0.0      0     0 ?        D    Mar31  10:06 [kswapd0]
root      5509  0.0  0.3  50732  7900 ?        D    Apr09   0:03 rsync -
raHSx --stats --whole-file --numeric-ids --link-
dest=/xback2_back1/YY/20100407-000501 --exclude=/lost+found --
exclude=.mozilla/*/*/Cache/* XX:/XX_data1/data/YY/ 
/xback2_back1/YY/20100409-000502/
root     17457  0.0  0.2  61920  5756 ?        D    Apr11   0:00 python 
/data/soft3/backup/diskbackup/diskbackup.py 
/data/soft3/backup/diskbackup/main.cfg
root     18402  0.0  0.0      0     0 ?        D    Apr09   0:11 [flush-9:0]
root     20259  0.0  0.1   4284  3424 ?        DN   Apr11   0:00 
/usr/sbin/prelink -av -mR -q

It only seems to affect our MD systems. The kernel is 
2.6.32.10-90.fc12.x86_64. The systems have 3ware 96xx controllers. This 
kernel does have the issue when there are lots of aio processes.

The two affected systems have different file systems: xfs and ext3.

Jeremy



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stuck tasks
  2010-04-12 10:40 stuck tasks Jeremy Sanders
@ 2010-04-12 11:06 ` MRK
  2010-04-12 11:14   ` Jeremy Sanders
  0 siblings, 1 reply; 6+ messages in thread
From: MRK @ 2010-04-12 11:06 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-raid

On 04/12/2010 12:40 PM, Jeremy Sanders wrote:
> Hi - I'm not getting any joy with Fedora's bugzilla. Has anyone seen
> problems like this with Fedora 12? Our systems have recently been getting
> stuck while rsyncing data onto an MD device:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=578549
>
>    

You need to lower the sync speed by catting a value into 
/sys/block/md{n}/md/sync_speed_max
the value should be about 1/3 lower than the max speed you see (cat 
/proc/mdstat) now that it's not yet limited.
Set up a script to set it at boot.

If it was not happening for you on older kernels might be a good sign: 
it might mean that the resync is faster now...

What is the sync speed you see (cat /proc/mdstat)? How many drives do 
you have and what type of raid is that?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stuck tasks
  2010-04-12 11:06 ` MRK
@ 2010-04-12 11:14   ` Jeremy Sanders
  2010-04-12 12:49     ` MRK
  0 siblings, 1 reply; 6+ messages in thread
From: Jeremy Sanders @ 2010-04-12 11:14 UTC (permalink / raw)
  To: linux-raid

MRK wrote:

> You need to lower the sync speed by catting a value into
> /sys/block/md{n}/md/sync_speed_max
> the value should be about 1/3 lower than the max speed you see (cat
> /proc/mdstat) now that it's not yet limited.
> Set up a script to set it at boot.

Thanks - I'll try it. It didn't happen in a raid sync, just during an rsync 
run. Would this have any effect on normal operation?
 
> If it was not happening for you on older kernels might be a good sign:
> it might mean that the resync is faster now...
> 
> What is the sync speed you see (cat /proc/mdstat)? How many drives do
> you have and what type of raid is that?

We're only getting 30MB/s. I thought it used to be quite a lot faster. It 
seems to slow down as the sync progresses:

[root@xback2 ~]#  cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdb1[0] sdj1[10](S) sdk1[9] sdl1[8] sdi1[7] sdh1[6] 
sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1]
      8788959360 blocks level 5, 32k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
      [>....................]  resync =  1.4% (14078544/976551040) 
finish=501.4min speed=31990K/sec
      
     
unused devices: <none>

The drives are connected to a single 3ware Inc 9650SE SATA-II RAID PCIe card 
on this particular system. They're SATA 1GB drives.

Jeremy


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stuck tasks
  2010-04-12 11:14   ` Jeremy Sanders
@ 2010-04-12 12:49     ` MRK
  2010-04-12 13:02       ` Jeremy Sanders
  0 siblings, 1 reply; 6+ messages in thread
From: MRK @ 2010-04-12 12:49 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-raid

On 04/12/2010 01:14 PM, Jeremy Sanders wrote:
> MRK wrote:
>
>    
>> You need to lower the sync speed by catting a value into
>> /sys/block/md{n}/md/sync_speed_max
>> the value should be about 1/3 lower than the max speed you see (cat
>> /proc/mdstat) now that it's not yet limited.
>> Set up a script to set it at boot.
>>      
> Thanks - I'll try it. It didn't happen in a raid sync, just during an rsync
> run. Would this have any effect on normal operation?
>    

!?!!

Mistake of mine, but I might have gotten the right answer by chance
I had read resyncing but you wrote rsyncing.
But you were in fact also resyncing from what you write below:


>> If it was not happening for you on older kernels might be a good sign:
>> it might mean that the resync is faster now...
>>
>> What is the sync speed you see (cat /proc/mdstat)? How many drives do
>> you have and what type of raid is that?
>>      
> We're only getting 30MB/s. I thought it used to be quite a lot faster. It
> seems to slow down as the sync progresses:
>
> [root@xback2 ~]#  cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 sdb1[0] sdj1[10](S) sdk1[9] sdl1[8] sdi1[7] sdh1[6]
> sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1]
>        8788959360 blocks level 5, 32k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
>        [>....................]  resync =  1.4% (14078544/976551040)
> finish=501.4min speed=31990K/sec
>    


Resync speed is indeed quite low if you confirm there is no other disk 
activity.
Instead if rsync is also running, you need to stop that one to have a 
proper resync speed measurement (to compute the value to be entered into 
sync_speed_max as per my previous email).
Do you have disk write caches activated? See that with tw_cli (3ware's CLI)
How much is /sys/block/md{n}/md/stripe_cache_size? Pump it up to 32768.

> unused devices:<none>
>
> The drives are connected to a single 3ware Inc 9650SE SATA-II RAID PCIe card
> on this particular system. They're SATA 1GB drives.
>
> Jeremy
>    



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stuck tasks
  2010-04-12 12:49     ` MRK
@ 2010-04-12 13:02       ` Jeremy Sanders
  2010-04-12 13:38         ` MRK
  0 siblings, 1 reply; 6+ messages in thread
From: Jeremy Sanders @ 2010-04-12 13:02 UTC (permalink / raw)
  To: linux-raid

MRK wrote:

> 
> Mistake of mine, but I might have gotten the right answer by chance
> I had read resyncing but you wrote rsyncing.
> But you were in fact also resyncing from what you write below:

Not at the time of crash! Sorry for the confusion. This happened on reboot 
as the drive didn't properly unmount.

> Resync speed is indeed quite low if you confirm there is no other disk
> activity.

It's not doing anything else on that drive. I think it would be faster but 
the option  CONFIG_MULTICORE_RAID456 was switched on in this kernel, so 
there are lots of async processes fighting each other (190 of them on this 
system).

> Instead if rsync is also running, you need to stop that one to have a
> proper resync speed measurement (to compute the value to be entered into
> sync_speed_max as per my previous email).
> Do you have disk write caches activated? See that with tw_cli (3ware's
> CLI) How much is /sys/block/md{n}/md/stripe_cache_size? Pump it up to
> 32768.

The disk write caches are on. The raid device is fast if you do bonnie tests 
to them (>100MBs).

I tried the strip_cache_size option. The speed stays around 10 MB/s on this 
system. It looks like the sync speed has slowed down a lot since the sync 
started.

Jeremy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stuck tasks
  2010-04-12 13:02       ` Jeremy Sanders
@ 2010-04-12 13:38         ` MRK
  0 siblings, 0 replies; 6+ messages in thread
From: MRK @ 2010-04-12 13:38 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-raid

On 04/12/2010 03:02 PM, Jeremy Sanders wrote:
> It's not doing anything else on that drive. I think it would be faster but
> the option  CONFIG_MULTICORE_RAID456 was switched on in this kernel, so
> there are lots of async processes fighting each other (190 of them on this
> system).
>    

Oh it's this one then. I read it has already been discussed here... you 
really need to remove the multicore implementation: it's experimental as 
of now and *much* slower than single core. I don't know why the your 
distro maintainer has activated it. What distro is that? Someone should 
probably tell the maintainers to disable it by default. I think all your 
problems will go away if you can recompile with multicore off.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-04-12 13:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-12 10:40 stuck tasks Jeremy Sanders
2010-04-12 11:06 ` MRK
2010-04-12 11:14   ` Jeremy Sanders
2010-04-12 12:49     ` MRK
2010-04-12 13:02       ` Jeremy Sanders
2010-04-12 13:38         ` MRK

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).