possible bus loading problem during resync

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* possible bus loading problem during resync
@ 2010-03-09  6:31 Timothy D. Lenz
  2010-03-09 10:30 ` Kristleifur Daðason
  0 siblings, 1 reply; 6+ messages in thread
From: Timothy D. Lenz @ 2010-03-09  6:31 UTC (permalink / raw)
  To: linux-raid

I'm working on 2 systems that are mainly for running vdr. I've had these 
running somewhat for awhile with raid. But a couple nights ago as I was 
quitting for the night, I noticed one of the computers drive light 
staying on. I had just made some changes to xine and didn't know if 
something had crashed. Turned on the TV and found the video was freezing 
for 10-20secs every 10-20secs. Logging in using putty and winscp I found 
it very sluggish to respond.Starting top I found it was doing the 
regular array check/resync. The process was using about 64% cpu and cpu 
was staying at idle speed (1000Mhz). These computers use Athlon64 x2 
cpu's. A problem with the AN2 socket systems is that when the cpu is 
throttled back, it also slows the bus. This has been found to be a 
problem on boards with integrated graphics when using nvidia's vdpau for 
hardware video decoding because they use system ram. The fix is to set 
the lower speed limit to 1800Mhz and/or change the up_threshold to ~50% 
. However, I am using PCIe video cards and so up till now have not had a 
problem.

I stopped vdr, but putty and winscp where still sluggish. This tells me 
that it is loading the bus so much that both the video card and the 
network is effected. it would also effect any tuner cards interfering 
with any recording that may be going on at the time. I change the 
up_threshold from the default 95% to 50% which should kick the speed up 
when it's syncing. But I'm not sure that will be enough. Could there be 
some other setting that is wrong raising the priority of the process? 
Seems like this would be a problem for any system having raid 
maintenance bring the system to its knees. The eta to finish was 75 minutes.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: possible bus loading problem during resync
  2010-03-09  6:31 possible bus loading problem during resync Timothy D. Lenz
@ 2010-03-09 10:30 ` Kristleifur Daðason
  2010-03-09 11:00   ` Asdo
  0 siblings, 1 reply; 6+ messages in thread
From: Kristleifur Daðason @ 2010-03-09 10:30 UTC (permalink / raw)
  To: linux-raid

On Tue, Mar 9, 2010 at 6:31 AM, Timothy D. Lenz <tlenz@vorgon.com> wrote:
> I'm working on 2 systems that are mainly for running vdr. I've had these
> running somewhat for awhile with raid. But a couple nights ago as I was
> quitting for the night, I noticed one of the computers drive light staying
> on. I had just made some changes to xine and didn't know if something had
> crashed. Turned on the TV and found the video was freezing for 10-20secs
> every 10-20secs. Logging in using putty and winscp I found it very sluggish
> to respond.Starting top I found it was doing the regular array check/resync.
> The process was using about 64% cpu and cpu was staying at idle speed
> (1000Mhz). These computers use Athlon64 x2 cpu's. A problem with the AN2
> socket systems is that when the cpu is throttled back, it also slows the
> bus. This has been found to be a problem on boards with integrated graphics
> when using nvidia's vdpau for hardware video decoding because they use
> system ram. The fix is to set the lower speed limit to 1800Mhz and/or change
> the up_threshold to ~50% . However, I am using PCIe video cards and so up
> till now have not had a problem.
>
> I stopped vdr, but putty and winscp where still sluggish. This tells me that
> it is loading the bus so much that both the video card and the network is
> effected. it would also effect any tuner cards interfering with any
> recording that may be going on at the time. I change the up_threshold from
> the default 95% to 50% which should kick the speed up when it's syncing. But
> I'm not sure that will be enough. Could there be some other setting that is
> wrong raising the priority of the process? Seems like this would be a
> problem for any system having raid maintenance bring the system to its
> knees. The eta to finish was 75 minutes.
> --


Sorry about the incredibly brief answer: Not to dismiss other issues,
but that behavior seems like exactly what I've seen when a disk has
been failing.

-- Kristleifur

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: possible bus loading problem during resync
  2010-03-09 10:30 ` Kristleifur Daðason
@ 2010-03-09 11:00   ` Asdo
  2010-03-11  5:53     ` Goswin von Brederlow
  0 siblings, 1 reply; 6+ messages in thread
From: Asdo @ 2010-03-09 11:00 UTC (permalink / raw)
  To: Kristleifur Daðason, Timothy D. Lenz; +Cc: linux-raid

Kristleifur Daðason wrote:
> On Tue, Mar 9, 2010 at 6:31 AM, Timothy D. Lenz <tlenz@vorgon.com> wrote:
>   
>> I'm working on 2 systems that are mainly for running vdr. I've had these
>> running somewhat for awhile with raid. But a couple nights ago as I was
>> quitting for the night, I noticed one of the computers drive light staying
>> on. I had just made some changes to xine and didn't know if something had
>> crashed. Turned on the TV and found the video was freezing for 10-20secs
>> every 10-20secs. Logging in using putty and winscp I found it very sluggish
>> to respond.Starting top I found it was doing the regular array check/resync.......
>> --
>>     
>
>
> Sorry about the incredibly brief answer: Not to dismiss other issues,
> but that behavior seems like exactly what I've seen when a disk has
> been failing.
>   

If that is true, how does that happen, the driver is hung? But anyway, 
how can such things happen when there is more than one CPU-core?

try disabling NCQ by echo 1 > /sys/block/sdX/device/queue_depth for all 
drives. After doing this, at most 1 request can be issued to one drive 
until the drive has serviced such request.

After doing this, firstly I'd say the sluggishness should disappear, at 
least on SSH when not touching the disks. And then you can look with 
"iostat -x 1": probably the bad drive will have a service time (svctm) 
or await much worse than the others.

Just guesses, correct me if I'm wrong
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: possible bus loading problem during resync
  2010-03-09 11:00   ` Asdo
@ 2010-03-11  5:53     ` Goswin von Brederlow
  2010-03-12 11:00       ` Asdo
  0 siblings, 1 reply; 6+ messages in thread
From: Goswin von Brederlow @ 2010-03-11  5:53 UTC (permalink / raw)
  To: Asdo; +Cc: Kristleifur Daoason, Timothy D. Lenz, linux-raid

Asdo <asdo@shiftmail.org> writes:

> Kristleifur Daðason wrote:
>> On Tue, Mar 9, 2010 at 6:31 AM, Timothy D. Lenz <tlenz@vorgon.com> wrote:
>>
>>> I'm working on 2 systems that are mainly for running vdr. I've had these
>>> running somewhat for awhile with raid. But a couple nights ago as I was
>>> quitting for the night, I noticed one of the computers drive light staying
>>> on. I had just made some changes to xine and didn't know if something had
>>> crashed. Turned on the TV and found the video was freezing for 10-20secs
>>> every 10-20secs. Logging in using putty and winscp I found it very sluggish
>>> to respond.Starting top I found it was doing the regular array check/resync.......
>>> --
>>>
>>
>>
>> Sorry about the incredibly brief answer: Not to dismiss other issues,
>> but that behavior seems like exactly what I've seen when a disk has
>> been failing.
>>
>
> If that is true, how does that happen, the driver is hung? But anyway,
> how can such things happen when there is more than one CPU-core?

A drive produces an error, the whole controler hangs and resets all
ports, all drives have to finish being reset before any IO can continue.
Hapens easily enough.

> try disabling NCQ by echo 1 > /sys/block/sdX/device/queue_depth for
> all drives. After doing this, at most 1 request can be issued to one
> drive until the drive has serviced such request.
>
> After doing this, firstly I'd say the sluggishness should disappear,
> at least on SSH when not touching the disks. And then you can look
> with "iostat -x 1": probably the bad drive will have a service time
> (svctm) or await much worse than the others.
>
> Just guesses, correct me if I'm wrong

What I would start with is check the resync/check speed of the raid and
kernel messages. If it is running at high speed and there are no kernel
messages about IO errors then it is probably just a case of the IO
subsystem being busy. I got similar sluggish behaviour when I increased
the stripe cache to 16384 for a reshape.

If there are no hardware problems on the disks causing this then try
setting the max speed for the resync lower. That way the resync will
leave pauses where other IO and bus activity can happen. The raid should
slow down automatically if there is normal IO pending but in my
experience that doesn't always work.

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: possible bus loading problem during resync
  2010-03-11  5:53     ` Goswin von Brederlow
@ 2010-03-12 11:00       ` Asdo
  2010-03-12 11:43         ` Kristleifur Daðason
  0 siblings, 1 reply; 6+ messages in thread
From: Asdo @ 2010-03-12 11:00 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Kristleifur Daoason, Timothy D. Lenz, linux-raid

>> If that is true, how does that happen, the driver is hung? But anyway,
>> how can such things happen when there is more than one CPU-core?
>>     
>
> A drive produces an error, the whole controler hangs and resets all
> ports, all drives have to finish being reset before any IO can continue.
> Hapens easily enough.
>   
Ok but this is a multi-core CPU and he said Putty and WinSCP were hung.
Ok for WinSCP... but Putty?

Timothy is Putty hung on array check even on NCQ disabled?

What is the resync speed? If it is very high it could be a CPU 
starvation but it's strange with only 2 drives. If it is very low I am 
not sure.
Are disks write caches enabled?

I am not able to spot any problem from your iostat or dmesg.
Note: yes the iostat -x 1 was supposed to be captured during resync.
Trigger a resync manually for the test. You can start it with echo check 
and stop it with echo idle > /sys/block/mdX/md/sync_action

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: possible bus loading problem during resync
  2010-03-12 11:00       ` Asdo
@ 2010-03-12 11:43         ` Kristleifur Daðason
  0 siblings, 0 replies; 6+ messages in thread
From: Kristleifur Daðason @ 2010-03-12 11:43 UTC (permalink / raw)
  To: Asdo; +Cc: Goswin von Brederlow, Timothy D. Lenz, linux-raid

On Fri, Mar 12, 2010 at 11:00 AM, Asdo <asdo@shiftmail.org> wrote:
>
>>> If that is true, how does that happen, the driver is hung? But anyway,
>>> how can such things happen when there is more than one CPU-core?
>>>
>>
>> A drive produces an error, the whole controler hangs and resets all
>> ports, all drives have to finish being reset before any IO can continue.
>> Hapens easily enough.
>>
>
> Ok but this is a multi-core CPU and he said Putty and WinSCP were hung.
> Ok for WinSCP... but Putty?
>
> Timothy is Putty hung on array check even on NCQ disabled?
>
> What is the resync speed? If it is very high it could be a CPU starvation
> but it's strange with only 2 drives. If it is very low I am not sure.
> Are disks write caches enabled?
>
> I am not able to spot any problem from your iostat or dmesg.
> Note: yes the iostat -x 1 was supposed to be captured during resync.
> Trigger a resync manually for the test. You can start it with echo check and
> stop it with echo idle > /sys/block/mdX/md/sync_action
>

I find "gnome-disk-utility" A.K.A. "palimpsest" to be a very good
heuristic to tell whether any drives are giving me physical trouble.
If there is a high remapped-sector-count or if Palimpsest otherwise
thinks a drive is suspect, there is a good chance that any unexplained
slowdowns in the machine are due to that drive.

If nothing more, it's a cheap way to get more information.

(My way of getting Palimpsest to check out drives in a machine that
doesn't have the program available in its installed distro
repositories is to run Ubuntu 9.10 from an USB stick. Boot, and up
comes Palimpsest.)

Hope this helps. Best of luck.

-- Kristleifur

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-03-12 11:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-09  6:31 possible bus loading problem during resync Timothy D. Lenz
2010-03-09 10:30 ` Kristleifur Daðason
2010-03-09 11:00   ` Asdo
2010-03-11  5:53     ` Goswin von Brederlow
2010-03-12 11:00       ` Asdo
2010-03-12 11:43         ` Kristleifur Daðason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).