Question about fault-tolerant redundant disk reading

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Question about fault-tolerant redundant disk reading
@ 2004-03-30 20:21 Jeremy Friesner
  2004-03-31 15:38 ` Mark Bellon
  0 siblings, 1 reply; 2+ messages in thread
From: Jeremy Friesner @ 2004-03-30 20:21 UTC (permalink / raw)
  To: linux-raid; +Cc: jeffk, ronb

Hi all,

I'm trying to add some basic drive-failure tolerance into my audio playback 
system, and I could use some advice as to the best way to go about it.

Some background info:  My company is working on a high-end(ish) embedded audio 
playback device.  It's a PowerPC-based computer-on-a-card with an external 
SCSI connector, running Linux 2.4.  The idea is that the user will connect 
one or more external SCSI drives to this device, then power it on, and our 
software will automount any ext3 partitions it finds on these drives, look 
for audio files in the partitions, and play the audio out over a speaker.

So far, so good.  But what we'd like to add is some fault tolerance -- 
specifically, if one of the drives in the system was to fail or lose power, 
we'd like the system to seamlessly fail-over to another drive so that the 
music playback isn't interrupted.

Some possibilities:

1) Do it without RAID.  This would be my preference if it is possible, since 
it would be the simplest for the user to set up.  Ideally all the user would 
have to do is copy the same audio files onto several ext3-formatted drives 
and plug them all in.  Our audio-playback program would read the files as 
usual, but if the read() system call returned an error, we would assume that 
the drive was toasted and would switch over to reading from the file with the 
same name on another drive.  The only problem is that it's not clear that 
Linux can be made to handle SCSI drive errors quickly and cleanly enough for 
this to work... is this idea practical?  If so, what steps need to be taken 
to obtain "quick-fail" behaviour?

2) Use software RAID.  This would require more setup work and knowledge on the 
part of the user, but it might handle drive failures more gracefully.  Is 
software RAID capable of handling a failover promptly enough to avoid audio 
dropouts?  (i.e. within 3-5 seconds?)

3) Use a separate hardware RAID device so that the Linux card never even 
realizes a failure occurred.  I think this would be the most reliable 
solution, but we're trying to keep the price down and this would blow the 
budget for a lot of our customers.

If someone knowledgable can provide some advice, I'd be very appreciative.

Thanks,

Jeremy Friesner
Level Control Systems
jaf@lcsaudio.com

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Question about fault-tolerant redundant disk reading
  2004-03-30 20:21 Question about fault-tolerant redundant disk reading Jeremy Friesner
@ 2004-03-31 15:38 ` Mark Bellon
  0 siblings, 0 replies; 2+ messages in thread
From: Mark Bellon @ 2004-03-31 15:38 UTC (permalink / raw)
  To: jaf; +Cc: linux-raid, jeffk, ronb

Jeremy Friesner wrote:

>Hi all,
>
>I'm trying to add some basic drive-failure tolerance into my audio playback 
>system, and I could use some advice as to the best way to go about it.
>
>Some background info:  My company is working on a high-end(ish) embedded audio 
>playback device.  It's a PowerPC-based computer-on-a-card with an external 
>SCSI connector, running Linux 2.4.  The idea is that the user will connect 
>one or more external SCSI drives to this device, then power it on, and our 
>software will automount any ext3 partitions it finds on these drives, look 
>for audio files in the partitions, and play the audio out over a speaker.
>
>So far, so good.  But what we'd like to add is some fault tolerance -- 
>specifically, if one of the drives in the system was to fail or lose power, 
>we'd like the system to seamlessly fail-over to another drive so that the 
>music playback isn't interrupted.
>
>Some possibilities:
>
>1) Do it without RAID.  This would be my preference if it is possible, since 
>it would be the simplest for the user to set up.  Ideally all the user would 
>have to do is copy the same audio files onto several ext3-formatted drives 
>and plug them all in.  Our audio-playback program would read the files as 
>usual, but if the read() system call returned an error, we would assume that 
>the drive was toasted and would switch over to reading from the file with the 
>same name on another drive.  The only problem is that it's not clear that 
>Linux can be made to handle SCSI drive errors quickly and cleanly enough for 
>this to work... is this idea practical?  If so, what steps need to be taken 
>to obtain "quick-fail" behaviour?
>
>2) Use software RAID.  This would require more setup work and knowledge on the 
>part of the user, but it might handle drive failures more gracefully.  Is 
>software RAID capable of handling a failover promptly enough to avoid audio 
>dropouts?  (i.e. within 3-5 seconds?)
>
>3) Use a separate hardware RAID device so that the Linux card never even 
>realizes a failure occurred.  I think this would be the most reliable 
>solution, but we're trying to keep the price down and this would blow the 
>budget for a lot of our customers.
>
I've been dealing with a similar situation dealing with critical data 
collection. The hardware is definitively not low end - dual 2Gb 
FibreChannels each with a high performance disk, but the problems are 
similar.  Due to the incoming rate only a few seconds of buffering is 
available so I must be able to complete the I/O within a very small time 
frame - 3-4 seconds at most.

I had great difificulty tuning the SCSI layer and FibreChannel to get a 
worst case failure (the defaults could take over 3 minutes!) to be 
within my time frame (the 2.6 FastFail doesn't help that much either). 
This started me thinking about a non-responsive device timer for the 
RAID 1 driver - if a device took longer than I cared to deal with it 
would be marked faulty and the RAID 1 driver would move on. Now I would 
no longer care about what happens "down below"; I could fail over at the 
point I wanted and insure the necessary level of service.

In your case avoiding the worst case situation is complicated if you 
only have a single SCSI bus - a hung SCSI bus can take a very long time 
to clear and this would cause a delay on your access to all of your disks.
Still there are disk failures that a scheme like I detailed above would 
help with.

I've started playing with a prototype against the 2.4.25 version of MD 
RAID 1 but I would like to hear from the RAID community if it something 
worth exploring futher. If so, I would post my patch once it is done and 
see what everyone things.

Comments?

mark


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2004-03-31 15:38 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-30 20:21 Question about fault-tolerant redundant disk reading Jeremy Friesner
2004-03-31 15:38 ` Mark Bellon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).