Raid5 over sbp2 : sbp2 command abort

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Raid5 over sbp2 : sbp2 command abort
@ 2006-01-30 13:50 Francois Barre
  2006-01-30 20:00 ` Stefan Richter
  0 siblings, 1 reply; 4+ messages in thread
From: Francois Barre @ 2006-01-30 13:50 UTC (permalink / raw)
  To: linux1394-devel, linux-raid

Hello all,

This is a cross-post (sorry for that), but I don't know where it comes from yet.
A. The setup
VIA EPIA 10k Nehemiah, OHCI with VIA
4 sbp2 250Go IDE drives
Vanilla 2.6.15.1 kernel, mdadm 2.2, superblock 0.90
ohci1394+sbp2 in kernel (default params : serialize_io=1, ...), raid5
as a module.

B. The tests
Test0 : Creating a 4-drive raid5 with 1 drive missing, copying the 4th
drive content to the raid5, works great.
Stress-testing multiple drive copy seems to be ok (Test0 + various
tests), very responsive, absolutely no error, but Test1 has a lot of
'command abort' errors, which blocks io for seconds, then starts
again.

Test1 : Building from scratch the raid5 with 4 drives (i.e. none
missing), causes 'sbp2 : command abort' messages.
At the end of Test1, raid5 is not created : one drive is set faulty.

C. The questions :
How could I run a paranoïd/degraded bandwidth mode ? I tried playing
with /proc/sys/dev/raid/speed_limit_max, reducing to far away from
highest bandwidth, but it did not have the expected behaviour : io
runs to highest bw for seconds, then stops, then runs again at highest
rate, ...
Is there a way to avoid write back at sbp2 level ? I could not find
any way to do so...

What kernel version should I rather use ? Seems like scsi on 2.6.15.x
is not really trustworthy, should I run 2.6.14.x ?

Best regards,

F.-E.B.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Raid5 over sbp2 : sbp2 command abort
  2006-01-30 13:50 Raid5 over sbp2 : sbp2 command abort Francois Barre
@ 2006-01-30 20:00 ` Stefan Richter
  2006-01-30 22:09   ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: Stefan Richter @ 2006-01-30 20:00 UTC (permalink / raw)
  To: Francois Barre; +Cc: linux1394-devel, linux-raid

Francois Barre wrote:
> This is a cross-post (sorry for that), but I don't know where it comes from yet.

Alas we get similar reports about software RAID over SBP-2 now and then 
on linux1394-devel or -user. I very much suspect sbp2 to be the culprit.

One person reported different results with different software RAID 
levels but I am too lazy right now to dig for the post in the list archive.

Question to the linux-raid folks: Does md support disks on different 
SCSI host adapters to be in the same RAID set?

> A. The setup
> VIA EPIA 10k Nehemiah, OHCI with VIA
> 4 sbp2 250Go IDE drives

Are these drives' bridges based on a Proflific chip? If yes, check if 
you could get a firmware update.

> Vanilla 2.6.15.1 kernel, mdadm 2.2, superblock 0.90
> ohci1394+sbp2 in kernel (default params : serialize_io=1, ...), raid5
> as a module.

I recommend to build the FireWire drivers as modules. This enables you 
to unload and reload them e.g. to recover from some failures or to try 
different parameters. However, static linking or building them as 
modules does not have an effect on reliability during data transfers.

> B. The tests
> Test0 : Creating a 4-drive raid5 with 1 drive missing, copying the 4th
> drive content to the raid5, works great.
> Stress-testing multiple drive copy seems to be ok (Test0 + various
> tests), very responsive, absolutely no error, but Test1 has a lot of
> 'command abort' errors, which blocks io for seconds, then starts
> again.
> 
> Test1 : Building from scratch the raid5 with 4 drives (i.e. none
> missing), causes 'sbp2 : command abort' messages.

Are there any other suspicious messages from sbp2, ieee1394, or ohci1394?

> At the end of Test1, raid5 is not created : one drive is set faulty.
> 
> C. The questions :
> How could I run a paranoïd/degraded bandwidth mode ? I tried playing
> with /proc/sys/dev/raid/speed_limit_max, reducing to far away from
> highest bandwidth, but it did not have the expected behaviour : io
> runs to highest bw for seconds, then stops, then runs again at highest
> rate, ...

What about sbp2's max_speed parameter?

> Is there a way to avoid write back at sbp2 level ? I could not find
> any way to do so...

What do you mean by that?

> What kernel version should I rather use ? Seems like scsi on 2.6.15.x
> is not really trustworthy, should I run 2.6.14.x ?

"aborting sbp2 command" issues have been reported for quite a long time 
now. Especially for Linux 2.6, although 2.4's sbp2 isn't fundamentally 
different. I don't think 2.6.14.x would make a difference to 2.6.15.x 
with this particular problem.

BTW, I'm hoping to get some spare time in February in order to work on 
this particular problem. I never used software RAID over sbp2 myself and 
don't intend to do so any time soon, but I get what I suspect to be the 
same type of failures with a 1394a disk and with a 1394b JBOD device (or 
hardware "R"AID-0) myself.

In case of my 1394a disk, the failures vanish either with serialize_io=1 
(this was not required with an older kernel; I don't remember which one) 
or --- curiously enough --- with "gap count optimization". As I wrote an 
hour ago on linux1394-user, gap count optimization is a performance 
tuning of the FireWire bus and is not yet implemented in the kernel. You 
can get gap count optimization manually with "echo p 0x00450000 | 
1394commander" for a single external device or "echo p 0x004a0000 | 
1394commander" if 4 external devices are daisy-chained. Run the command 
after all disks were connected and switched on, otherwise the command 
may inhibit access to newly added devices. www.linux1394.org has a link 
to 1394commander.
-- 
Stefan Richter
-=====-=-==- ---= ====-
http://arcgraph.de/sr/

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid\x103432&bid#0486&dat\x121642

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Raid5 over sbp2 : sbp2 command abort
  2006-01-30 20:00 ` Stefan Richter
@ 2006-01-30 22:09   ` Neil Brown
  2006-01-31 18:26     ` Stefan Richter
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Brown @ 2006-01-30 22:09 UTC (permalink / raw)
  To: Stefan Richter; +Cc: Francois Barre, linux1394-devel, linux-raid

On Monday January 30, stefanr@s5r6.in-berlin.de wrote:
> 
> Question to the linux-raid folks: Does md support disks on different 
> SCSI host adapters to be in the same RAID set?
> 

md doesn't notice and doesn't care what the underlying devices are.
It just sees Linux block devices, and sends read/write requests as
appropriate.  Obviously if the different devices are badly mis-matched
you might get poor performance, but it should "work".

NeilBrown

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Raid5 over sbp2 : sbp2 command abort
  2006-01-30 22:09   ` Neil Brown
@ 2006-01-31 18:26     ` Stefan Richter
  0 siblings, 0 replies; 4+ messages in thread
From: Stefan Richter @ 2006-01-31 18:26 UTC (permalink / raw)
  To: Francois Barre; +Cc: linux1394-devel, linux-raid

Francois Barre wrote in personal mail:
> it appeared that it was a harddrive problem. A
> simple dd from was showing the abort messages as well

Over SBP-2 or IDE? Either way, md is no longer a suspect, and we don't 
need to bother linux-raid anymore. :-)

>> What about sbp2's max_speed parameter?
> 
> Hidden option of the level37 ? I've never seen it...

It is listed on www.linux1394.org's sbp2 page (Linux 2.4 syntax) and of 
course in the source. Anyway, it is not important. Since you don't get 
any error messages from the 1394 stack's lower layers, it is obviously 
not an issue of an electrically unreliable bus which would be the main 
reason to use sbp2's max_speed parameter.

> I was wondering if sbp2 didn't behave as if it was buffering
> writes, waiting for a sufficient amount of data before sending it to
> drives... Regardless of the io scheduler I mean...

No, there is no additional scheduling in sbp2 or in the ieee1394 
transactions layer.

It works a bit different anyway: Sbp2 gets pointers to the scsi layer's 
data buffers, puts SCSI commands into additional small buffers, and 
notifies the target (disk) of new commands. The target fetches commands, 
fetches data or sends data, and sends completion status. IOW the target 
is performing all the data movement. So, strictly speaking, there is 
also some kind of scheduler on the other side of the wire involved (the 
target's fetch agent).
-- 
Stefan Richter
-=====-=-==- ---= =====
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-01-31 18:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-30 13:50 Raid5 over sbp2 : sbp2 command abort Francois Barre
2006-01-30 20:00 ` Stefan Richter
2006-01-30 22:09   ` Neil Brown
2006-01-31 18:26     ` Stefan Richter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).