* Raid5 over sbp2 : sbp2 command abort
@ 2006-01-30 13:50 Francois Barre
2006-01-30 20:00 ` Stefan Richter
0 siblings, 1 reply; 4+ messages in thread
From: Francois Barre @ 2006-01-30 13:50 UTC (permalink / raw)
To: linux1394-devel, linux-raid
Hello all,
This is a cross-post (sorry for that), but I don't know where it comes from yet.
A. The setup
VIA EPIA 10k Nehemiah, OHCI with VIA
4 sbp2 250Go IDE drives
Vanilla 2.6.15.1 kernel, mdadm 2.2, superblock 0.90
ohci1394+sbp2 in kernel (default params : serialize_io=1, ...), raid5
as a module.
B. The tests
Test0 : Creating a 4-drive raid5 with 1 drive missing, copying the 4th
drive content to the raid5, works great.
Stress-testing multiple drive copy seems to be ok (Test0 + various
tests), very responsive, absolutely no error, but Test1 has a lot of
'command abort' errors, which blocks io for seconds, then starts
again.
Test1 : Building from scratch the raid5 with 4 drives (i.e. none
missing), causes 'sbp2 : command abort' messages.
At the end of Test1, raid5 is not created : one drive is set faulty.
C. The questions :
How could I run a paranoïd/degraded bandwidth mode ? I tried playing
with /proc/sys/dev/raid/speed_limit_max, reducing to far away from
highest bandwidth, but it did not have the expected behaviour : io
runs to highest bw for seconds, then stops, then runs again at highest
rate, ...
Is there a way to avoid write back at sbp2 level ? I could not find
any way to do so...
What kernel version should I rather use ? Seems like scsi on 2.6.15.x
is not really trustworthy, should I run 2.6.14.x ?
Best regards,
F.-E.B.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Raid5 over sbp2 : sbp2 command abort
2006-01-30 13:50 Raid5 over sbp2 : sbp2 command abort Francois Barre
@ 2006-01-30 20:00 ` Stefan Richter
2006-01-30 22:09 ` Neil Brown
0 siblings, 1 reply; 4+ messages in thread
From: Stefan Richter @ 2006-01-30 20:00 UTC (permalink / raw)
To: Francois Barre; +Cc: linux1394-devel, linux-raid
Francois Barre wrote:
> This is a cross-post (sorry for that), but I don't know where it comes from yet.
Alas we get similar reports about software RAID over SBP-2 now and then
on linux1394-devel or -user. I very much suspect sbp2 to be the culprit.
One person reported different results with different software RAID
levels but I am too lazy right now to dig for the post in the list archive.
Question to the linux-raid folks: Does md support disks on different
SCSI host adapters to be in the same RAID set?
> A. The setup
> VIA EPIA 10k Nehemiah, OHCI with VIA
> 4 sbp2 250Go IDE drives
Are these drives' bridges based on a Proflific chip? If yes, check if
you could get a firmware update.
> Vanilla 2.6.15.1 kernel, mdadm 2.2, superblock 0.90
> ohci1394+sbp2 in kernel (default params : serialize_io=1, ...), raid5
> as a module.
I recommend to build the FireWire drivers as modules. This enables you
to unload and reload them e.g. to recover from some failures or to try
different parameters. However, static linking or building them as
modules does not have an effect on reliability during data transfers.
> B. The tests
> Test0 : Creating a 4-drive raid5 with 1 drive missing, copying the 4th
> drive content to the raid5, works great.
> Stress-testing multiple drive copy seems to be ok (Test0 + various
> tests), very responsive, absolutely no error, but Test1 has a lot of
> 'command abort' errors, which blocks io for seconds, then starts
> again.
>
> Test1 : Building from scratch the raid5 with 4 drives (i.e. none
> missing), causes 'sbp2 : command abort' messages.
Are there any other suspicious messages from sbp2, ieee1394, or ohci1394?
> At the end of Test1, raid5 is not created : one drive is set faulty.
>
> C. The questions :
> How could I run a paranoïd/degraded bandwidth mode ? I tried playing
> with /proc/sys/dev/raid/speed_limit_max, reducing to far away from
> highest bandwidth, but it did not have the expected behaviour : io
> runs to highest bw for seconds, then stops, then runs again at highest
> rate, ...
What about sbp2's max_speed parameter?
> Is there a way to avoid write back at sbp2 level ? I could not find
> any way to do so...
What do you mean by that?
> What kernel version should I rather use ? Seems like scsi on 2.6.15.x
> is not really trustworthy, should I run 2.6.14.x ?
"aborting sbp2 command" issues have been reported for quite a long time
now. Especially for Linux 2.6, although 2.4's sbp2 isn't fundamentally
different. I don't think 2.6.14.x would make a difference to 2.6.15.x
with this particular problem.
BTW, I'm hoping to get some spare time in February in order to work on
this particular problem. I never used software RAID over sbp2 myself and
don't intend to do so any time soon, but I get what I suspect to be the
same type of failures with a 1394a disk and with a 1394b JBOD device (or
hardware "R"AID-0) myself.
In case of my 1394a disk, the failures vanish either with serialize_io=1
(this was not required with an older kernel; I don't remember which one)
or --- curiously enough --- with "gap count optimization". As I wrote an
hour ago on linux1394-user, gap count optimization is a performance
tuning of the FireWire bus and is not yet implemented in the kernel. You
can get gap count optimization manually with "echo p 0x00450000 |
1394commander" for a single external device or "echo p 0x004a0000 |
1394commander" if 4 external devices are daisy-chained. Run the command
after all disks were connected and switched on, otherwise the command
may inhibit access to newly added devices. www.linux1394.org has a link
to 1394commander.
--
Stefan Richter
-=====-=-==- ---= ====-
http://arcgraph.de/sr/
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid\x103432&bid#0486&dat\x121642
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Raid5 over sbp2 : sbp2 command abort
2006-01-30 20:00 ` Stefan Richter
@ 2006-01-30 22:09 ` Neil Brown
2006-01-31 18:26 ` Stefan Richter
0 siblings, 1 reply; 4+ messages in thread
From: Neil Brown @ 2006-01-30 22:09 UTC (permalink / raw)
To: Stefan Richter; +Cc: Francois Barre, linux1394-devel, linux-raid
On Monday January 30, stefanr@s5r6.in-berlin.de wrote:
>
> Question to the linux-raid folks: Does md support disks on different
> SCSI host adapters to be in the same RAID set?
>
md doesn't notice and doesn't care what the underlying devices are.
It just sees Linux block devices, and sends read/write requests as
appropriate. Obviously if the different devices are badly mis-matched
you might get poor performance, but it should "work".
NeilBrown
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Raid5 over sbp2 : sbp2 command abort
2006-01-30 22:09 ` Neil Brown
@ 2006-01-31 18:26 ` Stefan Richter
0 siblings, 0 replies; 4+ messages in thread
From: Stefan Richter @ 2006-01-31 18:26 UTC (permalink / raw)
To: Francois Barre; +Cc: linux1394-devel, linux-raid
Francois Barre wrote in personal mail:
> it appeared that it was a harddrive problem. A
> simple dd from was showing the abort messages as well
Over SBP-2 or IDE? Either way, md is no longer a suspect, and we don't
need to bother linux-raid anymore. :-)
>> What about sbp2's max_speed parameter?
>
> Hidden option of the level37 ? I've never seen it...
It is listed on www.linux1394.org's sbp2 page (Linux 2.4 syntax) and of
course in the source. Anyway, it is not important. Since you don't get
any error messages from the 1394 stack's lower layers, it is obviously
not an issue of an electrically unreliable bus which would be the main
reason to use sbp2's max_speed parameter.
> I was wondering if sbp2 didn't behave as if it was buffering
> writes, waiting for a sufficient amount of data before sending it to
> drives... Regardless of the io scheduler I mean...
No, there is no additional scheduling in sbp2 or in the ieee1394
transactions layer.
It works a bit different anyway: Sbp2 gets pointers to the scsi layer's
data buffers, puts SCSI commands into additional small buffers, and
notifies the target (disk) of new commands. The target fetches commands,
fetches data or sends data, and sends completion status. IOW the target
is performing all the data movement. So, strictly speaking, there is
also some kind of scheduler on the other side of the wire involved (the
target's fetch agent).
--
Stefan Richter
-=====-=-==- ---= =====
http://arcgraph.de/sr/
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-01-31 18:26 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-30 13:50 Raid5 over sbp2 : sbp2 command abort Francois Barre
2006-01-30 20:00 ` Stefan Richter
2006-01-30 22:09 ` Neil Brown
2006-01-31 18:26 ` Stefan Richter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).