Re: [PATCH] ENBD for 2.5.64

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] ENBD for 2.5.64
       [not found] <1048623613.25914.14.camel@lotte>
@ 2003-03-25 20:53 ` Peter T. Breuer
  2003-03-26  2:40   ` Jeff Garzik
  2003-03-28 11:19   ` Pavel Machek
  0 siblings, 2 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-25 20:53 UTC (permalink / raw)
  To: Justin Cormack; +Cc: linux kernel

"Justin Cormack wrote:"
> On Tue, 2003-03-25 at 17:27, Peter T. Breuer wrote:
> > > ENBD is not a replacement for NBD - the two are alternatives, aimed
> > > at different niches.  ENBD is a sort of heavyweight industrial NBD.  It
> > > does many more things and has a different achitecture.  Kernel NBD is
> > > like a stripped down version of ENBD.  Both should be in the kernel.
> 
> hmm, I would argue that nbd is ok, as it is a nice lightweight block
> device (though I have not been able to use it due to the fact that I can
> never find a userspace and kernel that work together), while ENBD should
> be replaced by iscsi, now that is a real ietf standard, and can burn CDs
> across the net and all that extra stuff.

It's not a bad idea. But ENBD in particular can use any transport,
precisely becuase its networking is done in userspace. One only has to
write a stream.c module for it that implements

    read
    write
    open
    close

(There are currently implementations for three transports, including
tcp of course).    

> And I am intending to write an iscsi client sometime, but it got
> delayed. The server stuff is already available from 3com.

Possibly, but ENBD is designed to fail :-). And networks fail.
What will your iscsi implementation do when somebody resets the
router? All those issues are handled by ENBD. ENBD breaks off and
reconnects automatically. It reacts right to removable media.

I should also have said that ENBD has the following features (I said I
forgot some!)

  9) it drops into a mode where it md5sums both ends and skips writes
  of equal blocks, if that's faster. It moves in and out of this mode
  automatically. This helps RAID resyncs (2* overspeed is common on
  100BT nets, that is 20MB/s.).

  10) integration with RAID - it advises raid correctly of its state
  and does hot add and remove correctly (well, you need my patches to 
  raid, but there you are ...).

Of course, if somebody wants me to make enbd appear like a scis device
instead of a generic block device, I guess I could do that. Except,
that yecch, I have seen the scsi code, and I do not understand it.

Another good idea is to make the wire protocol iscsi compatible. I
have no objection to that.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-25 20:53 ` [PATCH] ENBD for 2.5.64 Peter T. Breuer
@ 2003-03-26  2:40   ` Jeff Garzik
  2003-03-26  5:55     ` Matt Mackall
  2003-03-28 11:19   ` Pavel Machek
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Garzik @ 2003-03-26  2:40 UTC (permalink / raw)
  To: ptb; +Cc: Justin Cormack, linux kernel

Peter T. Breuer wrote:
> "Justin Cormack wrote:"
>>And I am intending to write an iscsi client sometime, but it got
>>delayed. The server stuff is already available from 3com.
> 
> 
> Possibly, but ENBD is designed to fail :-). And networks fail.
> What will your iscsi implementation do when somebody resets the
> router? All those issues are handled by ENBD. ENBD breaks off and
> reconnects automatically. It reacts right to removable media.


Yeah, iSCSI handles all that and more.  It's a behemoth of a 
specification.  (whether a particular implementation implements all that 
stuff correctly is another matter...)

BTW, I'm a big enbd fan :)  I like enbd for it's _simplicity_ compared 
to iSCSI.

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  2:40   ` Jeff Garzik
@ 2003-03-26  5:55     ` Matt Mackall
  2003-03-26  6:31       ` Peter T. Breuer
                         ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Matt Mackall @ 2003-03-26  5:55 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: ptb, Justin Cormack, linux kernel

On Tue, Mar 25, 2003 at 09:40:44PM -0500, Jeff Garzik wrote:
> Peter T. Breuer wrote:
> >"Justin Cormack wrote:"
> >>And I am intending to write an iscsi client sometime, but it got
> >>delayed. The server stuff is already available from 3com.
> >
> >
> >Possibly, but ENBD is designed to fail :-). And networks fail.
> >What will your iscsi implementation do when somebody resets the
> >router? All those issues are handled by ENBD. ENBD breaks off and
> >reconnects automatically. It reacts right to removable media.
> 
> Yeah, iSCSI handles all that and more.  It's a behemoth of a 
> specification.  (whether a particular implementation implements all that 
> stuff correctly is another matter...)

Indeed, there are iSCSI implementations that do multipath and
failover.

Both iSCSI and ENBD currently have issues with pending writes during
network outages. The current I/O layer fails to report failed writes
to fsync and friends.

> BTW, I'm a big enbd fan :)  I like enbd for it's _simplicity_ compared 
> to iSCSI.

Definitely. The iSCSI protocol is more powerful but _much_ more
complex than ENBD. I've spent two years working on iSCSI but guess
which I use at home..

-- 
Matt Mackall : http://www.selenic.com : of or relating to the moon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  5:55     ` Matt Mackall
@ 2003-03-26  6:31       ` Peter T. Breuer
  2003-03-26  6:48         ` Matt Mackall
  2003-03-26  6:59       ` Andre Hedrick
  2003-03-26  7:31       ` Lincoln Dale
  2 siblings, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-26  6:31 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel

"A month of sundays ago Matt Mackall wrote:"
> On Tue, Mar 25, 2003 at 09:40:44PM -0500, Jeff Garzik wrote:
> > Peter T. Breuer wrote:
> > >"Justin Cormack wrote:"
> > >>And I am intending to write an iscsi client sometime, but it got
> > >>delayed. The server stuff is already available from 3com.
> > >
> > >Possibly, but ENBD is designed to fail :-). And networks fail.
> > >What will your iscsi implementation do when somebody resets the
> > >router? All those issues are handled by ENBD. ENBD breaks off and
> > >reconnects automatically. It reacts right to removable media.
> > 
> > Yeah, iSCSI handles all that and more.  It's a behemoth of a 
> > specification.  (whether a particular implementation implements all that 
> > stuff correctly is another matter...)
> 
> Indeed, there are iSCSI implementations that do multipath and
> failover.

Somebody really ought to explain it to me :-). I can't keep up with all
this!

> Both iSCSI and ENBD currently have issues with pending writes during
> network outages. The current I/O layer fails to report failed writes
> to fsync and friends.

ENBD has two (configurable) behaviors here. Perhaps it should have
more. By default it blocks pending reads and writes during times when
the connection is down. It can be configured to error them instead. The 
erroring behavior is what you want when running under soft RAID, as
it's raid that should do the deciding about how to treat the requests
according to the overall state of the array, so it needs definite
yes/no info on each request, no "maybe".

Perhaps in a third mode requests should be blocked and time out after
about half an hour (or some number, in an infinite spectrum).

What I would like is some way of telling how backed up the VM is
against us. If the VM is full of dirty buffers aimed at us, then
I think we should consider erroring instead of blocking. The problem is
that at that point we're likely not getting any requests at all,
because the kernel long ago ran out of the 256 requests it has in
hand to send us. 

There is indeed an information disconnect with VMS in those
circumstances that I've never known how to solve.  FSs too are a
problem, because unless they are mounted sync they will happily permit
writes to a file on a fs on a blocked device even if that fills the
machine with buffers that can't go anywhere.  Among other things that
will run tcp out of buffer space that is necessary in order to flush
those buffers even if the connection does come back.  And even if the
mount is sync then some fs's (e.g.  ext2) still allow infinitely much
writing to a blocked device under some circumstances (start two
processes writing to the same file ..  the second will write to
buffers).

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  6:31       ` Peter T. Breuer
@ 2003-03-26  6:48         ` Matt Mackall
  2003-03-26  7:05           ` Peter T. Breuer
  0 siblings, 1 reply; 27+ messages in thread
From: Matt Mackall @ 2003-03-26  6:48 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Jeff Garzik, Justin Cormack, linux kernel

> > Both iSCSI and ENBD currently have issues with pending writes during
> > network outages. The current I/O layer fails to report failed writes
> > to fsync and friends.
> 
> ENBD has two (configurable) behaviors here. Perhaps it should have
> more. By default it blocks pending reads and writes during times when
> the connection is down. It can be configured to error them instead.

And in this case, the upper layers will silently drop write errors on
current kernels.

Cisco's Linux iSCSI driver has a configurable timeout, defaulting to
'infinite', btw.

> What I would like is some way of telling how backed up the VM is
> against us. If the VM is full of dirty buffers aimed at us, then
> I think we should consider erroring instead of blocking. The problem is
> that at that point we're likely not getting any requests at all,
> because the kernel long ago ran out of the 256 requests it has in
> hand to send us. 

Hrrmm. The potential to lose data by surprise here is not terribly
appealing. It might be better to add an accounting mechanism to say
"never go above x dirty pages against block device n" or something of
the sort but you can still get into trouble if you happen to have
hundreds of iSCSI devices each with their own request queue..

-- 
Matt Mackall : http://www.selenic.com : of or relating to the moon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  6:48         ` Matt Mackall
@ 2003-03-26  7:05           ` Peter T. Breuer
  0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-26  7:05 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Peter T. Breuer, Jeff Garzik, Justin Cormack, linux kernel

"A month of sundays ago Matt Mackall wrote:"
> > > Both iSCSI and ENBD currently have issues with pending writes during
> > > network outages. The current I/O layer fails to report failed writes
> > > to fsync and friends.
> > 
> > ENBD has two (configurable) behaviors here. Perhaps it should have
> > more. By default it blocks pending reads and writes during times when
> > the connection is down. It can be configured to error them instead.
> 
> And in this case, the upper layers will silently drop write errors on
> current kernels.
> 
> Cisco's Linux iSCSI driver has a configurable timeout, defaulting to
> 'infinite', btw.

That corresponds to enbd's default behavior.  Sigh. Guess I'll have to
make it 0-infty, instead of 0 or infty. It's easy enough - just need to
make it settable in proc (and think about which is the one line I need
to touch ...).

> Hrrmm. The potential to lose data by surprise here is not terribly
> appealing.

Making the driver "intelligent" is indeed bad news for the more
intelligent admin.  I was thinking of making it default to 0 timeout if
it knows it's running under raid, but I have a natural antipathy to
such in-driver decisions. My conscience would be slightly less on
alert if the userspace daemon did the decision-making. I suppose it
could.

>  It might be better to add an accounting mechanism to say
> "never go above x dirty pages against block device n" or something of
> the sort but you can still get into trouble if you happen to have
> hundreds of iSCSI devices each with their own request queue..

Well, you can get in trouble if you allow even a single dirty page to
be outstanding to something, and have thousands of those somethings.

That's not the normal situation, however, whereas it is normal to
have a single network device and to be writing pell-mell to it
oblivious to the state of the device itself.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  5:55     ` Matt Mackall
  2003-03-26  6:31       ` Peter T. Breuer
@ 2003-03-26  6:59       ` Andre Hedrick
  2003-03-26 13:58         ` Jeff Garzik
  2003-03-26  7:31       ` Lincoln Dale
  2 siblings, 1 reply; 27+ messages in thread
From: Andre Hedrick @ 2003-03-26  6:59 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel

On Tue, 25 Mar 2003, Matt Mackall wrote:

> > Yeah, iSCSI handles all that and more.  It's a behemoth of a 
> > specification.  (whether a particular implementation implements all that 
> > stuff correctly is another matter...)
> 
> Indeed, there are iSCSI implementations that do multipath and
> failover.
> 
> Both iSCSI and ENBD currently have issues with pending writes during
> network outages. The current I/O layer fails to report failed writes
> to fsync and friends.
> 
> > BTW, I'm a big enbd fan :)  I like enbd for it's _simplicity_ compared 
> > to iSCSI.
> 
> Definitely. The iSCSI protocol is more powerful but _much_ more
> complex than ENBD. I've spent two years working on iSCSI but guess
> which I use at home..

To be totally fair ENBD/NBD is not SAN nor will it ever become a qualified
SAN.  I would like to see what happens to your data if you wrap your
ethernet around a ballast resistor or even run it near by the associated
light fixture, and blink the the power/lights.  This is where goofy people
run cables in the drop ceilings.

We have almost finalized our initiator to be submitted under OSL/GPL.
This will be a full RFC ERL=2 w/ Sync-n-Steering.

I have seen to much GPL code stolen and could do nothing about it.

While the code is not in binary executable form it shall be under OSL
only.  Only at compile and execution time will it become GPL compliant
period.  This is designed to extend the copyright holders rights to force
anyone who uses the code and changes anything to return to open to the
copyright holder period.  Additional language may be added to permit not
return code to exist under extremely heavy licensing fees to be used to
promote OSL projects and assist any GPL holder with litigation fees to
protect their rights.  To many of us do not have the means to defend our
copyrights, this is one way I can see to provide a plausable solution to a
dirty problem.  This may not be the best answer but it is doable.  This
will apply some teeth into the license to stop this from happening again.

Like it or hate it, OSL/GPL looks to be the best match out there.

Regards,

Andre Hedrick, CTO & Founder 
iSCSI Software Solutions Provider
http://www.PyXTechnologies.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  6:59       ` Andre Hedrick
@ 2003-03-26 13:58         ` Jeff Garzik
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-03-26 13:58 UTC (permalink / raw)
  To: Andre Hedrick; +Cc: Matt Mackall, ptb, Justin Cormack, linux kernel

Andre Hedrick wrote:
> We have almost finalized our initiator to be submitted under OSL/GPL.
> This will be a full RFC ERL=2 w/ Sync-n-Steering.

That's pretty good news.

Also, I tangent and mention that I have been won over WRT OSL:  with its 
more tight "lawyerspeak" and mutual patent defense clauses, I consider 
OSL to be a "better GPL" license.

I would in fact love to see the Linux kernel relicensed under OSL.  I 
think that would close some "holes" that exist with the GPL, and give us 
a better legal standing.  But relicensing the kernel would be huge 
political undertaking, and I sure as hell don't have the energy, even if 
it possible.  No idea if Linus, Alan, Andrew, or any of the other major 
contributors would go for it, either.

	Jeff, the radical

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  5:55     ` Matt Mackall
  2003-03-26  6:31       ` Peter T. Breuer
  2003-03-26  6:59       ` Andre Hedrick
@ 2003-03-26  7:31       ` Lincoln Dale
  2003-03-26  9:59         ` Lars Marowsky-Bree
                           ` (3 more replies)
  2 siblings, 4 replies; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26  7:31 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel

At 11:55 PM 25/03/2003 -0600, Matt Mackall wrote:
> > Yeah, iSCSI handles all that and more.  It's a behemoth of a
> > specification.  (whether a particular implementation implements all that
> > stuff correctly is another matter...)
>
>Indeed, there are iSCSI implementations that do multipath and
>failover.

iSCSI is a transport.
logically, any "multipathing" and "failover" belongs in a layer above it -- 
typically as a block-layer function -- and not as a transport-layer function.

multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, DevMapper 
-- or in a commercial implementation such as Veritas VxDMP, HDS HDLM, EMC 
PowerPath, ...

>Both iSCSI and ENBD currently have issues with pending writes during
>network outages. The current I/O layer fails to report failed writes
>to fsync and friends.

these are not "iSCSI" or "ENBD" issues.  these are issues with VFS.


cheers,

lincoln.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  7:31       ` Lincoln Dale
@ 2003-03-26  9:59         ` Lars Marowsky-Bree
  2003-03-26 10:18           ` Andrew Morton
  2003-03-26 13:49         ` Jeff Garzik
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Lars Marowsky-Bree @ 2003-03-26  9:59 UTC (permalink / raw)
  To: linux kernel

On 2003-03-26T18:31:31,
   Lincoln Dale <ltd@cisco.com> said:

> >Indeed, there are iSCSI implementations that do multipath and
> >failover.
> iSCSI is a transport.
> logically, any "multipathing" and "failover" belongs in a layer above it -- 

"Multipathing" on iSCSI is actually a layer below - network resiliency is
handled by routing protocols, the switching fabric etc.

> >Both iSCSI and ENBD currently have issues with pending writes during
> >network outages. The current I/O layer fails to report failed writes
> >to fsync and friends.
> these are not "iSCSI" or "ENBD" issues.  these are issues with VFS.

Yes, and it is a fairly annoying issue... In particular with ENBD, a partial
write could occur at the block device layer. Now try to report that upwards to
the write() call. Good luck.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
SuSE Labs - Research & Development, SuSE Linux AG
  
"If anything can go wrong, it will." "Chance favors the prepared (mind)."
  -- Capt. Edward A. Murphy            -- Louis Pasteur

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  9:59         ` Lars Marowsky-Bree
@ 2003-03-26 10:18           ` Andrew Morton
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2003-03-26 10:18 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: linux-kernel

Lars Marowsky-Bree <lmb@suse.de> wrote:
>
> In particular with ENBD, a partial write could occur at the block device
> layer. Now try to report that upwards to the write() call. Good luck.

Well you can't, unless it is an O_SYNC or O_DIRECT write...

But yes, for a normal old write() followed by an fsync() the IO error can be
lost.  We'll fix this for 2.6.  I have oxymoron's patches lined up, but they
need a couple of quality hours' worth of thinking about yet.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  7:31       ` Lincoln Dale
  2003-03-26  9:59         ` Lars Marowsky-Bree
@ 2003-03-26 13:49         ` Jeff Garzik
  2003-03-26 16:09         ` Matt Mackall
       [not found]         ` <5.1.0.14.2.20030327085031.04aa7128@mira-sjcm-3.cisco.com>
  3 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-03-26 13:49 UTC (permalink / raw)
  To: Lincoln Dale; +Cc: Matt Mackall, ptb, Justin Cormack, linux kernel

Lincoln Dale wrote:
> At 11:55 PM 25/03/2003 -0600, Matt Mackall wrote:
> 
>> > Yeah, iSCSI handles all that and more.  It's a behemoth of a
>> > specification.  (whether a particular implementation implements all 
>> that
>> > stuff correctly is another matter...)
>>
>> Indeed, there are iSCSI implementations that do multipath and
>> failover.
> 
> 
> iSCSI is a transport.
> logically, any "multipathing" and "failover" belongs in a layer above it 
> -- typically as a block-layer function -- and not as a transport-layer 
> function.
> 
> multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, 
> DevMapper -- or in a commercial implementation such as Veritas VxDMP, 
> HDS HDLM, EMC PowerPath, ...

I think you will find that most Linux kernel developers agree w/ you :)

That said, iSCSI error recovery can be considered as supporting some of 
what multipathing and failover accomplish.  iSCSI can be shoving bits 
through multiple TCP connections, or fail over from one TCP connection 
to another.


>> Both iSCSI and ENBD currently have issues with pending writes during
>> network outages. The current I/O layer fails to report failed writes
>> to fsync and friends.

...not if your iSCSI implementation is up to spec.  ;-)


> these are not "iSCSI" or "ENBD" issues.  these are issues with VFS.

VFS+VM.  But, agreed.

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26  7:31       ` Lincoln Dale
  2003-03-26  9:59         ` Lars Marowsky-Bree
  2003-03-26 13:49         ` Jeff Garzik
@ 2003-03-26 16:09         ` Matt Mackall
       [not found]         ` <5.1.0.14.2.20030327085031.04aa7128@mira-sjcm-3.cisco.com>
  3 siblings, 0 replies; 27+ messages in thread
From: Matt Mackall @ 2003-03-26 16:09 UTC (permalink / raw)
  To: Lincoln Dale; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel

On Wed, Mar 26, 2003 at 06:31:31PM +1100, Lincoln Dale wrote:
> At 11:55 PM 25/03/2003 -0600, Matt Mackall wrote:
> >> Yeah, iSCSI handles all that and more.  It's a behemoth of a
> >> specification.  (whether a particular implementation implements all that
> >> stuff correctly is another matter...)
> >
> >Indeed, there are iSCSI implementations that do multipath and
> >failover.
> 
> iSCSI is a transport.
> logically, any "multipathing" and "failover" belongs in a layer above it -- 
> typically as a block-layer function -- and not as a transport-layer 
> function.
>
> multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, DevMapper 
> PowerPath, ...

Funny then that I should be talking about Cisco's driver. :P

iSCSI inherently has more interesting reconnect logic than other block
devices, so it's fairly trivial to throw in recognition of identical
devices discovered on two or more iSCSI targets..
 
> >Both iSCSI and ENBD currently have issues with pending writes during
> >network outages. The current I/O layer fails to report failed writes
> >to fsync and friends.
> 
> these are not "iSCSI" or "ENBD" issues.  these are issues with VFS.

Except that the issue simply doesn't show up for anyone else, which is
why it hasn't been fixed yet. Patches are in the works, but they need
more testing:

http://www.selenic.com/linux/write-error-propagation/

-- 
Matt Mackall : http://www.selenic.com : of or relating to the moon

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <5.1.0.14.2.20030327085031.04aa7128@mira-sjcm-3.cisco.com>]

* Re: [PATCH] ENBD for 2.5.64
       [not found]         ` <5.1.0.14.2.20030327085031.04aa7128@mira-sjcm-3.cisco.com>
@ 2003-03-26 22:40           ` Matt Mackall
  0 siblings, 0 replies; 27+ messages in thread
From: Matt Mackall @ 2003-03-26 22:40 UTC (permalink / raw)
  To: Lincoln Dale; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel

On Thu, Mar 27, 2003 at 09:10:14AM +1100, Lincoln Dale wrote:
> At 10:09 AM 26/03/2003 -0600, Matt Mackall wrote:
> >> >Indeed, there are iSCSI implementations that do multipath and
> >> >failover.
> >>
> >> iSCSI is a transport.
> >> logically, any "multipathing" and "failover" belongs in a layer above 
> >it --
> >> typically as a block-layer function -- and not as a transport-layer
> >> function.
> >>
> >> multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, 
> >DevMapper
> >> PowerPath, ...
> >
> >Funny then that I should be talking about Cisco's driver. :P
> 
> :-)
> 
> see my previous email to Jeff.  iSCSI as a transport protocol does have a 
> muxing capability -- but its usefulness is somewhat limited (imho).
> 
> >iSCSI inherently has more interesting reconnect logic than other block
> >devices, so it's fairly trivial to throw in recognition of identical
> >devices discovered on two or more iSCSI targets..
> 
> what logic do you use to identify "identical devices"?
> same data reported from SCSI Report_LUNs?  or perhaps the same data 
> reported from a SCSI_Inquiry?

Sorry, can't remember.
 
> does one now need to add logic into the kernel to provide some multipathing 
> for HDS disks?

No, most of it was done in userspace.

> >> >Both iSCSI and ENBD currently have issues with pending writes during
> >> >network outages. The current I/O layer fails to report failed writes
> >> >to fsync and friends.
> >>
> >> these are not "iSCSI" or "ENBD" issues.  these are issues with VFS.
> >
> >Except that the issue simply doesn't show up for anyone else, which is
> >why it hasn't been fixed yet. Patches are in the works, but they need
> >more testing:
> >
> >http://www.selenic.com/linux/write-error-propagation/
> 
> oh, but it does show up for other people.  it may be that the issue doesn't 
> show up at fsync() time, but rather at close() time, or perhaps neither of 
> those!

Write errors basically don't happen for people who have attached
storage unless their drives die. Which is why the fact that the
pagecache completely ignores I/O errors has gone unnoticed for years..
 
> code looks interesting.  i'll take a look.
> hmm, must find out a way to intentionally introduce errors now and see what 
> happens!

We stumbled on it by pulling cables to make failover happen.

-- 
Matt Mackall : http://www.selenic.com : of or relating to the moon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-25 20:53 ` [PATCH] ENBD for 2.5.64 Peter T. Breuer
  2003-03-26  2:40   ` Jeff Garzik
@ 2003-03-28 11:19   ` Pavel Machek
  2003-03-30 20:48     ` Peter T. Breuer
  1 sibling, 1 reply; 27+ messages in thread
From: Pavel Machek @ 2003-03-28 11:19 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Justin Cormack, linux kernel

Hi!

>   9) it drops into a mode where it md5sums both ends and skips writes
>   of equal blocks, if that's faster. It moves in and out of this mode
>   automatically. This helps RAID resyncs (2* overspeed is common on
>   100BT nets, that is 20MB/s.).

Great way to find md5 collisions, I guess
:-).
				Pavel
-- 
				Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-28 11:19   ` Pavel Machek
@ 2003-03-30 20:48     ` Peter T. Breuer
  0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-30 20:48 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Peter T. Breuer, Justin Cormack, linux kernel

"A month of sundays ago Pavel Machek wrote:"
> Hi!
> 
> >   9) it drops into a mode where it md5sums both ends and skips writes
> >   of equal blocks, if that's faster. It moves in and out of this mode
> >   automatically. This helps RAID resyncs (2* overspeed is common on
> >   100BT nets, that is 20MB/s.).
> 
> Great way to find md5 collisions, I guess
> :-).

Don't worry, I'm not planning on claiming the Turing medal! Or living for
the lifetime of the universe .. :-(.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
@ 2003-03-26 22:16 Lincoln Dale
  2003-03-26 22:56 ` Lars Marowsky-Bree
  0 siblings, 1 reply; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 22:16 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel

At 10:09 AM 26/03/2003 -0600, Matt Mackall wrote:
> > >Indeed, there are iSCSI implementations that do multipath and
> > >failover.
> >
> > iSCSI is a transport.
> > logically, any "multipathing" and "failover" belongs in a layer above 
> it --
> > typically as a block-layer function -- and not as a transport-layer
> > function.
> >
> > multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, 
> DevMapper
> > PowerPath, ...
>
>Funny then that I should be talking about Cisco's driver. :P

:-)

see my previous email to Jeff.  iSCSI as a transport protocol does have a 
muxing capability -- but its usefulness is somewhat limited (imho).

>iSCSI inherently has more interesting reconnect logic than other block
>devices, so it's fairly trivial to throw in recognition of identical
>devices discovered on two or more iSCSI targets..

what logic do you use to identify "identical devices"?
same data reported from SCSI Report_LUNs?  or perhaps the same data 
reported from a SCSI_Inquiry?

in reality, all multipathing software tends to use some blocks at the end 
of the disk (just in the same way that most LVMs do also).

for example, consider the following output from a set of two SCSI_Inquiry 
and Report_LUNs on two paths to storage:
         Lun Description Table
         WWPN             Lun   Capacity Vendor       Product      Serial
         ---------------- ----- -------- ------------ ------------ ------
         Path A:
         21000004cf8c21fb 0     16GB     HP 18.2G     ST318452FC   3EV0BD8E
         21000004cf8c21c5 0     16GB     HP 18.2G     ST318452FC   3EV0KHHP
         50060e8000009591 0     50GB     HITACHI      DF500F       DF500-00B
         50060e8000009591 1     50GB     HITACHI      DF500F       DF500-00B
         50060e8000009591 2     50GB     HITACHI      DF500F       DF500-00B
         50060e8000009591 3     50GB     HITACHI      DF500F       DF500-00B

         Path B:
         31000004cf8c21fb 0     16GB     HP 18.2G     ST318452FC   3EV0BD8E
         31000004cf8c21c5 0     16GB     HP 18.2G     ST318452FC   3EV0KHHP
         50060e8000009591 0     50GB     HITACHI      DF500F       DF500-00A
         50060e8000009591 1     50GB     HITACHI      DF500F       DF500-00A
         50060e8000009591 2     50GB     HITACHI      DF500F       DF500-00A
         50060e8000009591 3     50GB     HITACHI      DF500F       DF500-00A

the "HP 18.2G" devices are 18G FC disks in a FC JBOD.  each disk will 
report an identical Serial # regardless of the interface/path used to get 
to that device.  no issues there right -- you can identify the disk as 
being unique via its "Serial #" and can see the interface used to get to it 
via its WWPN.

now, take a look at some disk from an intelligent disk array (in this case, 
a HDS 9200).
it reports a _different_ serial number for the same disk, dependent on the 
interface used.  (DF500 is the model # of a HDS 9200, interfaces are 
numbered 00A/00B/01A/01B).

does one now need to add logic into the kernel to provide some multipathing 
for HDS disks?
does using linux mean that one had to change some settings on the HDS disk 
array to get it to report different information via a SCSI_Inquiry?  (it 
can - but thats not the point - the point is that any multipathing software 
out there just 'works' right now).

this is just one example.  i could probably find another 50 of 
slightly-different-behavior if you wanted me to!

> > >Both iSCSI and ENBD currently have issues with pending writes during
> > >network outages. The current I/O layer fails to report failed writes
> > >to fsync and friends.
> >
> > these are not "iSCSI" or "ENBD" issues.  these are issues with VFS.
>
>Except that the issue simply doesn't show up for anyone else, which is
>why it hasn't been fixed yet. Patches are in the works, but they need
>more testing:
>
>http://www.selenic.com/linux/write-error-propagation/

oh, but it does show up for other people.  it may be that the issue doesn't 
show up at fsync() time, but rather at close() time, or perhaps neither of 
those!

code looks interesting.  i'll take a look.
hmm, must find out a way to intentionally introduce errors now and see what 
happens!

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26 22:16 Lincoln Dale
@ 2003-03-26 22:56 ` Lars Marowsky-Bree
  2003-03-26 23:21   ` Lincoln Dale
  0 siblings, 1 reply; 27+ messages in thread
From: Lars Marowsky-Bree @ 2003-03-26 22:56 UTC (permalink / raw)
  To: Lincoln Dale, Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel

On 2003-03-27T09:16:18,
   Lincoln Dale <ltd@cisco.com> said:

> what logic do you use to identify "identical devices"?
> same data reported from SCSI Report_LUNs?  or perhaps the same data 
> reported from a SCSI_Inquiry?

That would work well.

We do parse device specific information in order to auto-configure the md
multipath at setup time. After that, magic is on disk...

> does one now need to add logic into the kernel to provide some multipathing
> for HDS disks?

Topology discovery is user-space! It does not need to live in the kernel.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
SuSE Labs - Research & Development, SuSE Linux AG
  
"If anything can go wrong, it will." "Chance favors the prepared (mind)."
  -- Capt. Edward A. Murphy            -- Louis Pasteur

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26 22:56 ` Lars Marowsky-Bree
@ 2003-03-26 23:21   ` Lincoln Dale
  0 siblings, 0 replies; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 23:21 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Matt Mackall, Jeff Garzik, ptb, Justin Cormack, linux kernel

Hi Lars,

At 11:56 PM 26/03/2003 +0100, Lars Marowsky-Bree wrote:
[..]
>We do parse device specific information in order to auto-configure the md
>multipath at setup time. After that, magic is on disk...
>
> > does one now need to add logic into the kernel to provide some multipathing
> > for HDS disks?
>
>Topology discovery is user-space! It does not need to live in the kernel.

i think we're agreeing on the same thing here!

yes, i believe topology discovery should only belong in userspace.
i believe it should be in userspace for both (a) setup and (b) at 
kernel-boot-time

likewise, i believe policy of deciding what mix of i/o's to put down 
different paths also belongs in userspace.
this could take the form of a daemon that frequently looks up statistics 
from the kernel (e.g. average latency per target), and uses that 
information in conjunction with some 'policy' to tweak what paths are used.
but i definitely don't think that the kernel should make any wide-ranging 
decisions about multiple paths, except beyond something like "deviceA has 
disappeared.  i know that deviceB is an alternate path, so will swing all 
outstanding i/o plugged into A to B".

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
@ 2003-03-26 22:16 Lincoln Dale
  2003-03-26 22:32 ` Andre Hedrick
       [not found] ` <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>
  0 siblings, 2 replies; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 22:16 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Matt Mackall, ptb, Justin Cormack, linux kernel

At 08:49 AM 26/03/2003 -0500, Jeff Garzik wrote:
>>>Indeed, there are iSCSI implementations that do multipath and
>>>failover.
>>
>>iSCSI is a transport.
>>logically, any "multipathing" and "failover" belongs in a layer above it 
>>-- typically as a block-layer function -- and not as a transport-layer 
>>function.
>>multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, 
>>DevMapper -- or in a commercial implementation such as Veritas VxDMP, HDS 
>>HDLM, EMC PowerPath, ...
>
>I think you will find that most Linux kernel developers agree w/ you :)
>
>That said, iSCSI error recovery can be considered as supporting some of 
>what multipathing and failover accomplish.  iSCSI can be shoving bits 
>through multiple TCP connections, or fail over from one TCP connection to 
>another.

while the iSCSI spec has the concept of a "network portal" that can have 
multiple TCP streams for i/o, in the real world, i'm yet to see anything 
actually use those multiple streams.
the reason why goes back to how SCSI works.  take a ethereal trace of iSCSI 
and you'll see the way that 2 round-trips are used before any typical i/o 
operation (read or write op) occurs.
multiple TCP streams for a given iSCSI session could potentially be used to 
achieve greater performance when the maximum-window-size of a single TCP 
stream is being hit.
but its quite rare for this to happen.

in reality, if you had multiple TCP streams, its more likely you're doing 
it for high-availability reasons (i.e. multipathing).
if you're multipathing, the chances are you want to multipath down two 
separate paths to two different iSCSI gateways.  (assuming you're talking 
to traditional SAN storage and you're gatewaying into Fibre Channel).

handling multipathing in that manner is well beyond the scope of what an 
iSCSI driver in the kernel should be doing.
determining the policy (read-preferred / write-preferred / round-robin / 
ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are 
used is most definitely something that should NEVER be in the kernel.

btw, the performance of iSCSI over a single TCP stream is also a moot one also.
from a single host (IBM x335 Server i think?) communicating with a FC disk 
via an iSCSI gateway:
         mds# sh int gig2/1
         GigabitEthernet2/1 is up
             Hardware is GigabitEthernet, address is xxxx.xxxx.xxxx
             Internet address is xxx.xxx.xxx.xxx/24
             MTU 1500  bytes, BW 1000000 Kbit
             Port mode is IPS
             Speed is 1 Gbps
             Beacon is turned off
             5 minutes input rate 21968640 bits/sec, 2746080 bytes/sec, 
40420 frames/sec
             5 minutes output rate 929091696 bits/sec, 116136462 bytes/sec, 
80679 frames/sec
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
             74228360 packets input, 13218256042 bytes
               15409 multicast frames, 0 compressed
               0 input errors, 0 frame, 0 overrun 0 fifo
             169487726 packets output, 241066793565 bytes, 0 underruns
               0 output errors, 0 collisions, 0 fifo
               0 carrier errors

not bad for a single TCP stream and a software iSCSI stack. :-)
(kernel is 2.4.20)

>>>Both iSCSI and ENBD currently have issues with pending writes during
>>>network outages. The current I/O layer fails to report failed writes
>>>to fsync and friends.
>
>...not if your iSCSI implementation is up to spec.  ;-)
>
>>these are not "iSCSI" or "ENBD" issues.  these are issues with VFS.
>
>VFS+VM.  But, agreed.

sure - the devil is in the details - but the issue holds true for 
traditional block devices at this point also.

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26 22:16 Lincoln Dale
@ 2003-03-26 22:32 ` Andre Hedrick
       [not found] ` <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>
  1 sibling, 0 replies; 27+ messages in thread
From: Andre Hedrick @ 2003-03-26 22:32 UTC (permalink / raw)
  To: Lincoln Dale; +Cc: Jeff Garzik, Matt Mackall, ptb, Justin Cormack, linux kernel

On Thu, 27 Mar 2003, Lincoln Dale wrote:

> while the iSCSI spec has the concept of a "network portal" that can have 
> multiple TCP streams for i/o, in the real world, i'm yet to see anything 
> actually use those multiple streams.

Want a DEMO?  It is call Sync-WAN-Raid-Relay.

> the reason why goes back to how SCSI works.  take a ethereal trace of iSCSI 
> and you'll see the way that 2 round-trips are used before any typical i/o 
> operation (read or write op) occurs.
> multiple TCP streams for a given iSCSI session could potentially be used to 
> achieve greater performance when the maximum-window-size of a single TCP 
> stream is being hit.
> but its quite rare for this to happen.
> 
> in reality, if you had multiple TCP streams, its more likely you're doing 
> it for high-availability reasons (i.e. multipathing).
> if you're multipathing, the chances are you want to multipath down two 
> separate paths to two different iSCSI gateways.  (assuming you're talking 
> to traditional SAN storage and you're gatewaying into Fibre Channel).

Why a SAN gateway switch, they are all LAN limited.

> handling multipathing in that manner is well beyond the scope of what an 
> iSCSI driver in the kernel should be doing.
> determining the policy (read-preferred / write-preferred / round-robin / 
> ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are 
> used is most definitely something that should NEVER be in the kernel.

Only "NEVER" if you are depending on classic bloated SAN
hardware/gateways.  The very operations you are calling never, is done in
the gateways which is nothing more or less than an embedded system on
crack.  So if this is an initiator which can manage sequencing streams, it
is far superior than dealing with the SAN traps of today.

> btw, the performance of iSCSI over a single TCP stream is also a moot one also.
> from a single host (IBM x335 Server i think?) communicating with a FC disk 
> via an iSCSI gateway:
>          mds# sh int gig2/1
>          GigabitEthernet2/1 is up
>              Hardware is GigabitEthernet, address is xxxx.xxxx.xxxx
>              Internet address is xxx.xxx.xxx.xxx/24
>              MTU 1500  bytes, BW 1000000 Kbit
>              Port mode is IPS
>              Speed is 1 Gbps
>              Beacon is turned off
>              5 minutes input rate 21968640 bits/sec, 2746080 bytes/sec, 
> 40420 frames/sec
>              5 minutes output rate 929091696 bits/sec, 116136462 bytes/sec, 
> 80679 frames/sec
>                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>              74228360 packets input, 13218256042 bytes
>                15409 multicast frames, 0 compressed
>                0 input errors, 0 frame, 0 overrun 0 fifo
>              169487726 packets output, 241066793565 bytes, 0 underruns
>                0 output errors, 0 collisions, 0 fifo
>                0 carrier errors

What do you have for real iSCSI and no FC junk not supporting 
interoperability?

FC is dying and nobody who has wasted money on FC junk will be interested
in iSCSI.  They wasted piles of money and have to justify it.

> not bad for a single TCP stream and a software iSCSI stack. :-)
> (kernel is 2.4.20)

Nice numbers, now do it over WAN.

> >>>Both iSCSI and ENBD currently have issues with pending writes during
> >>>network outages. The current I/O layer fails to report failed writes
> >>>to fsync and friends.
> >
> >...not if your iSCSI implementation is up to spec.  ;-)
> >
> >>these are not "iSCSI" or "ENBD" issues.  these are issues with VFS.
> >
> >VFS+VM.  But, agreed.
> 
> sure - the devil is in the details - but the issue holds true for 
> traditional block devices at this point also.

Sweet kicker here, if you only allow the current rules of SAN to apply.
This is what the big dogs want, and no new ideas allowed.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

PS poking back at you for fun and serious points.


^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>]

* Re: [PATCH] ENBD for 2.5.64
       [not found] ` <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>
@ 2003-03-26 23:03   ` Lincoln Dale
  2003-03-26 23:39     ` Andre Hedrick
  0 siblings, 1 reply; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 23:03 UTC (permalink / raw)
  To: Andre Hedrick
  Cc: Jeff Garzik, Matt Mackall, ptb, Justin Cormack, linux kernel

Andre,

At 02:32 PM 26/03/2003 -0800, Andre Hedrick wrote:
> > in reality, if you had multiple TCP streams, its more likely you're doing
> > it for high-availability reasons (i.e. multipathing).
> > if you're multipathing, the chances are you want to multipath down two
> > separate paths to two different iSCSI gateways.  (assuming you're talking
> > to traditional SAN storage and you're gatewaying into Fibre Channel).
>
>Why a SAN gateway switch, they are all LAN limited.

?
hmm, where to start:

why a SAN gateway?
because (a) that's what is out there right now, (b) iSCSI is really the 
enabler for people to connect to consolodated storage (that they already 
have) at a cheaper price-point than FC.

LAN limited?
10GE is reality.  so is etherchannel where you have 8xGE trunked 
together.  "LAN is limited" is a rather bold statement that doesn't support 
the facts.

in reality, most applications do NOT want to push 100mbyte/sec of i/o -- or 
even 20mbyte/sec.
sure -- benchmarking programs do -- and i could show you a single host 
pushing 425mbyte/sec using 2 x 2gbit/s FC HBAs -- but in reality, thats way 
overkill for most people.

i know that your company is working on native iSCSI storage arrays; 
obviously its in your interests to talk about native iSCSI access to disks, 
but right now, i'll talk about how people deploy TB of storage today.  this 
is most likely a different market segment to what you're working on (at 
least i hope you think it is) - but a discussion on those merits are not 
something that is useful in l-k.

> > handling multipathing in that manner is well beyond the scope of what an
> > iSCSI driver in the kernel should be doing.
> > determining the policy (read-preferred / write-preferred / round-robin /
> > ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are
> > used is most definitely something that should NEVER be in the kernel.
>
>Only "NEVER" if you are depending on classic bloated SAN
>hardware/gateways.  The very operations you are calling never, is done in
>the gateways which is nothing more or less than an embedded system on
>crack.  So if this is an initiator which can manage sequencing streams, it
>is far superior than dealing with the SAN traps of today.

err, either you don't understand multipathing or i don't.

"multipathing" is end-to-end between an initiator and a target.
typically that initiator is a host and multipathing software is installed 
on that host.
the target is typically a disk or disk-array.  the disk array may have 
multiple controllers and those show up as multiple targets.

the thing about multipathing is that it doesn't depend on any magic in "SAN 
hardware/gateways" (sic) -- its simply a case of the host seeing the same 
disk via two interfaces and choosing to use one/both of those interfaces to 
talk to that disk.

[..]
>What do you have for real iSCSI and no FC junk not supporting
>interoperability?

?
no idea what you're talking about here.

>FC is dying and nobody who has wasted money on FC junk will be interested
>in iSCSI.  They wasted piles of money and have to justify it.

lets just agree to disagree.  i don't hold that view.

> > not bad for a single TCP stream and a software iSCSI stack. :-)
> > (kernel is 2.4.20)
>
>Nice numbers, now do it over WAN.

sustaining the throughput is simply a matter of:
  - having a large enough TCP window
  - ensuring all the TCP go-fast options are enabled
  - ensuring you can have a sufficient number of IO operations outstanding
    to allow SCSI to actually be able to fill the TCP window.

realistically, yes, this can sustain high throughput across a WAN.  but 
that WAN has to be built right in the first place.
i.e. if its moving other traffic, provide QoS to allow storage traffic to 
have preference.

>Sweet kicker here, if you only allow the current rules of SAN to apply.
>This is what the big dogs want, and no new ideas allowed.

i definitely don't subscribe to your conspiracy theories here.  sorry.

>PS poking back at you for fun and serious points.

yes - i figured.  i'm happy to have a meaningful technical discussion, but 
don't have the cycles to discuss the universe.

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26 23:03   ` Lincoln Dale
@ 2003-03-26 23:39     ` Andre Hedrick
  0 siblings, 0 replies; 27+ messages in thread
From: Andre Hedrick @ 2003-03-26 23:39 UTC (permalink / raw)
  To: Lincoln Dale; +Cc: Jeff Garzik, Matt Mackall, ptb, Justin Cormack, linux kernel

On Thu, 27 Mar 2003, Lincoln Dale wrote:

> > > to traditional SAN storage and you're gatewaying into Fibre Channel).
> >
> >Why a SAN gateway switch, they are all LAN limited.
> 
> ?
> hmm, where to start:
> 
> why a SAN gateway?
> because (a) that's what is out there right now, (b) iSCSI is really the 
> enabler for people to connect to consolodated storage (that they already 
> have) at a cheaper price-point than FC.
> 
> LAN limited?
> 10GE is reality.  so is etherchannel where you have 8xGE trunked 
> together.  "LAN is limited" is a rather bold statement that doesn't support 
> the facts.
> 
> in reality, most applications do NOT want to push 100mbyte/sec of i/o -- or 
> even 20mbyte/sec.
> sure -- benchmarking programs do -- and i could show you a single host 
> pushing 425mbyte/sec using 2 x 2gbit/s FC HBAs -- but in reality, thats way 
> overkill for most people.

We agree this is even overkill for people like Pixar and the movie people.

> i know that your company is working on native iSCSI storage arrays; 
> obviously its in your interests to talk about native iSCSI access to disks, 
> but right now, i'll talk about how people deploy TB of storage today.  this 
> is most likely a different market segment to what you're working on (at 
> least i hope you think it is) - but a discussion on those merits are not 
> something that is useful in l-k.

Well we deploy ERL=1 or ERL=2(%80) today on 6TB platforms now.
So the democratization of SAN is now and today.

> > > handling multipathing in that manner is well beyond the scope of what an
> > > iSCSI driver in the kernel should be doing.
> > > determining the policy (read-preferred / write-preferred / round-robin /
> > > ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are
> > > used is most definitely something that should NEVER be in the kernel.
> >
> >Only "NEVER" if you are depending on classic bloated SAN
> >hardware/gateways.  The very operations you are calling never, is done in
> >the gateways which is nothing more or less than an embedded system on
> >crack.  So if this is an initiator which can manage sequencing streams, it
> >is far superior than dealing with the SAN traps of today.
> 
> err, either you don't understand multipathing or i don't.
> 
> "multipathing" is end-to-end between an initiator and a target.
> typically that initiator is a host and multipathing software is installed 
> on that host.
> the target is typically a disk or disk-array.  the disk array may have 
> multiple controllers and those show up as multiple targets.

Agreed, and apply as series of head-to-toe target-initiator pairs and you
get multipathing support native from the super target.  This is all a SAN
gateway/switch does.  Not much more than LVM on crack and a six-pack.

> the thing about multipathing is that it doesn't depend on any magic in "SAN 
> hardware/gateways" (sic) -- its simply a case of the host seeing the same 
> disk via two interfaces and choosing to use one/both of those interfaces to 
> talk to that disk.

Well Storage is nothing by a LIE and regardless if one spoofs and ident
mode pages or not, they must track and manage the resource reporting
properly.

> [..]
> >What do you have for real iSCSI and no FC junk not supporting
> >interoperability?
> 
> ?
> no idea what you're talking about here.

Erm, shove a MacData and Brocade switch on the same FC network and watch
it turn into a degraded dog.

> >FC is dying and nobody who has wasted money on FC junk will be interested
> >in iSCSI.  They wasted piles of money and have to justify it.
> 
> lets just agree to disagree.  i don't hold that view.

Guess that is why NetAPP snaked a big share of EMC's marketspace with a
cheaper mousetrap.  Agreed to "agree to disagree" erm whatever I just
typed.

> > > not bad for a single TCP stream and a software iSCSI stack. :-)
> > > (kernel is 2.4.20)
> >
> >Nice numbers, now do it over WAN.
> 
> sustaining the throughput is simply a matter of:
>   - having a large enough TCP window
>   - ensuring all the TCP go-fast options are enabled
>   - ensuring you can have a sufficient number of IO operations outstanding
>     to allow SCSI to actually be able to fill the TCP window.
> 
> realistically, yes, this can sustain high throughput across a WAN.  but 
> that WAN has to be built right in the first place.

Well sell more of those high bandwidth switches to the world of
internet-ether to make it faster, I would be happier.

> i.e. if its moving other traffic, provide QoS to allow storage traffic to 
> have preference.
> 
> >Sweet kicker here, if you only allow the current rules of SAN to apply.
> >This is what the big dogs want, and no new ideas allowed.
> 
> i definitely don't subscribe to your conspiracy theories here.  sorry.

You should listen to more Art Bell at night, well morning for you.

> >PS poking back at you for fun and serious points.
> 
> yes - i figured.  i'm happy to have a meaningful technical discussion, but 
> don't have the cycles to discuss the universe.

I did the universe once as an academic, it was fun.

http://schwab.tsuniv.edu/t13.html

This was my last time of stargazing and I miss it too!

Cheers,

Andre Hedrick
LAD Storage Consulting Group


^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <5.1.0.14.2.20030327083757.037c0760@mira-sjcm-3.cisco.com>]

* Re: [PATCH] ENBD for 2.5.64
       [not found] <5.1.0.14.2.20030327083757.037c0760@mira-sjcm-3.cisco.com>
@ 2003-03-26 22:02 ` Peter T. Breuer
  2003-03-26 23:49   ` Lincoln Dale
  0 siblings, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-26 22:02 UTC (permalink / raw)
  To: Lincoln Dale; +Cc: Jeff Garzik, Matt Mackall, ptb, Justin Cormack, linux kernel

"Lincoln Dale wrote:"
> >what multipathing and failover accomplish.  iSCSI can be shoving bits 
> >through multiple TCP connections, or fail over from one TCP connection to 
> >another.
> 
> while the iSCSI spec has the concept of a "network portal" that can have 
> multiple TCP streams for i/o, in the real world, i'm yet to see anything 
> actually use those multiple streams.

I'll content myself with mentioning that ENBD has /always/ throughout
its five years of life had automatic failover between channels.  Mind
you, I don't think anybody makes use of the multichannel architecture in
practice for the purposes of redundancy (i.e.  people using multiple
channels don't pass them through different interfaces or routes, which
is the idea!), they may do it for speed/bandwidth.

But then surely they might as well use channel bonding in the network layer?
I've never tried it, or possibly never figured out how ..

> the reason why goes back to how SCSI works.  take a ethereal trace of iSCSI 
> and you'll see the way that 2 round-trips are used before any typical i/o 
> operation (read or write op) occurs.

Hmm.

I have some people telling me that I should pile up network packets
in order to avoid too many interrupts firing on Ge cards, and other
people telling me to send partial packets as soon as possible in order
to avoid buffer buildup.  My head spins.

> multiple TCP streams for a given iSCSI session could potentially be used to 
> achieve greater performance when the maximum-window-size of a single TCP 
> stream is being hit.
> but its quite rare for this to happen.

My considered opinion is that there are way too many variables here for
anyone to make sense of them.

> in reality, if you had multiple TCP streams, its more likely you're doing 
> it for high-availability reasons (i.e. multipathing).

Except that in real life most people don't know what they're doing and
they certainly don't know why they're doing it! In particular they
don't seem to get the idea that more redundancy is what they want.

I can almost see why.

But they can be persuaded to run multichannel by being promised more
speed.

> if you're multipathing, the chances are you want to multipath down two 
> separate paths to two different iSCSI gateways.  (assuming you're talking 
> to traditional SAN storage and you're gatewaying into Fibre Channel).

Yes. This is all that really makes sense for redundancy. And make sure
the routing is distinct too.

Then you start having problems maintaining request order across
multiple paths. At least I do.  But ENBD does it.

> determining the policy (read-preferred / write-preferred / round-robin / 
> ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are 
> used is most definitely something that should NEVER be in the kernel.

ENBD doesn't have any problem - it uses all the channels, by demand.
Each userspace daemon runs a different channel and each daemon picks
up requests to treat as soon as it can, as soon as there are any. The
kernel does not dictate. It's async.

(iscsi stream over tcp)
>              5 minutes output rate 929091696 bits/sec, 116136462 bytes/sec, 
> 80679 frames/sec
>                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Very impressive. I think the most that's been seen over ENBD is 60MB/s
sustained, across Ge.

> not bad for a single TCP stream and a software iSCSI stack. :-)
> (kernel is 2.4.20)

Ditto.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26 22:02 ` Peter T. Breuer
@ 2003-03-26 23:49   ` Lincoln Dale
  2003-03-27  0:08     ` Peter T. Breuer
  0 siblings, 1 reply; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 23:49 UTC (permalink / raw)
  To: ptb; +Cc: Jeff Garzik, Matt Mackall, ptb, Justin Cormack, linux kernel

Hi Peter,

At 11:02 PM 26/03/2003 +0100, Peter T. Breuer wrote:
>I'll content myself with mentioning that ENBD has /always/ throughout
>its five years of life had automatic failover between channels.  Mind
>you, I don't think anybody makes use of the multichannel architecture in
>practice for the purposes of redundancy (i.e.  people using multiple
>channels don't pass them through different interfaces or routes, which
>is the idea!), they may do it for speed/bandwidth.
>
>But then surely they might as well use channel bonding in the network layer?
>I've never tried it, or possibly never figured out how ..

"channel bonding" can handle cases whereby you lose a single NIC or port -- 
but typically channeling means that you need multiple paths into a single 
ethernet switch.
single ethernet switch = single point of failure.

hence, from a high-availability (HA) perspective, you're better off 
connecting N NICs into N switches -- and then load-balance (multipath) 
across those.

an interesting side-note is that channel-bonding doesn't necessarily mean 
higher performance.
i haven't looked at linux's channel-bonding, but many NICs on higher-end 
servers offer this as an option, and when enabled, you end up with multiple 
NICs with the same MAC address.  typically only one NIC is used for one 
direction of traffic.

> > the reason why goes back to how SCSI works.  take a ethereal trace of 
> iSCSI
> > and you'll see the way that 2 round-trips are used before any typical i/o
> > operation (read or write op) occurs.
>
>Hmm.
>I have some people telling me that I should pile up network packets
>in order to avoid too many interrupts firing on Ge cards, and other
>people telling me to send partial packets as soon as possible in order
>to avoid buffer buildup.  My head spins.

:-)
most "storage" people care more about latency than they do about raw 
performance.  coalescing packets = bad for latency.

i figure there has to be middle ground somewhere -- implement both and have 
it as a config option.

decent GE cards will do coalescing themselves anyway.

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
  2003-03-26 23:49   ` Lincoln Dale
@ 2003-03-27  0:08     ` Peter T. Breuer
  0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-27  0:08 UTC (permalink / raw)
  To: Lincoln Dale; +Cc: ptb, Jeff Garzik, Matt Mackall, Justin Cormack, linux kernel

"Lincoln Dale wrote:"
> Hi Peter,

Hi!

> decent GE cards will do coalescing themselves anyway.

 From what I confusedly remember of my last interchange with someone
convinced that packet coalescing (or lack of it, I forget which)
was the root of all evil, it's "all because" there's some magic limit
of 8K interrupts per second somewhere, and at 1.5KB per packet, that
would be only 12MB/s. So Ge cards wait after each interrupt to see if
there's some more stuff coming, so that they can treat more than one
packet at a time.

Apparently that means that if you have a two-way interchange in
your protocol at low level, they wait at the end of each half of
the protocol, even though you can't proceed with the protocol
until they decide to stop listening and start working. And the 
result is a severe slowdown.

In my naive opinion, hat should make ENBD's architecture (in which all
the channels going through the same NIC nevertheless work independently
and asynchronously) have an advantage, because pipelining effects
will fill up the slack time spaces in one channel's protocol with
activity from other channels.

But surely the number of channels required to fill up the waiting time
woulod be astronomical? Oh well.

Anyway, my head still spins. 

The point is that none of this is as easy or straightforward as it
seems. I suspect that pure storage people like andre will make a real
mess of the networking considerations. It's just not easy.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ENBD for 2.5.64
@ 2003-03-25 17:27 Peter T. Breuer
  0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-25 17:27 UTC (permalink / raw)
  To: linux kernel

"a litle while ago ptb wrote:"
> Here's a patch to incorporate Enhanced NBD (ENBD) into kernel 2.5.64.
> I'll put the patch on the list first, and then post again with a
> technical breakdown and various arguments/explanations.

I'll now put up the technical discussion I promised. (the patch is
also in the patches/ subdir in the archive at
ftp://oboe.it.uc3m.es/pub/Programs/nbd-2.4.31.tgz)

I'll repeat the dates .. Pavel's kernel NBD, 1997, the ENBD 1998,
derived initially from Pavel's code backported to stable kernels.
Pavel and I have been in contact many times over the years. 

Technical differences
---------------------
1) One of the original changes made was technical, and is perhaps
   the biggest reason for what incompatibilities there are (I can
   equalize the wire formats, but not the functional protocols, so you
   need different userspace support for the different kernel drivers).

   - kernel nbd runs a single thread transferring data between kernel
   and net.

   - ENBD runs multiple threads running asynchronously of each other

   The result is that ENBD can get a pipelining benefit ..  while one
   thread is sending to the net another is talking to the kernel and
   so on.  This shows up in different ways.  Obviously you do best if
   you have two cpus or two nics, etc.

   Also ENBD doesn't die when one thread gets stuck. I'll talk about
   that.

2) There is a difference in philosophy, which results in different
   code, different behaviors, etc. Basically, ENBD must not /fail/.
   It's supposed to keep working first and foremost, and deal with
   errors when they crop up, and it's supposed to expect errors.

   - kernel nbd runs a full kernel thread which cannot die. It loops
     inside the kernel.

   - ENBD runs userspace threads which can die and are expected to die
     and which are restarted by a master when they die. They only dip
     into the kernel occasionally.

   This originally arose because I was frustrated with not being able
   to kill the kernel nbd client daemon, and thus free up its "space".
   It certainly used to start what nowadays we know as a kernel thread,
   but from user space. It dove into the kernel in an ioctl and
   executed a forever loop there. ENBD doesn't do that. It runs the
   daemon cycle from user space via separate ioctls for each stage.

   That's why you need different user space utilities.

   - kernel nbd has daemons which are quite lightweight

   - ENBD has daemons which disconnect if they detect network failures
     and reconnect as soon as the net comes up again. Servers and
     clients can die, and be restarted, and they'll reconnect, entirely
     automatically, all on their little ownsomes ..

   ENBD is prepared internally to retract requests from client daemons
   which don't respond any longer, and pass them to others instead.
   It's tehrefore also prepared to receive acks out fo order, etc. etc.

   Another facet of all that is the following:

   - kernel nbd does networking from within the kernel

   - ENBD does its networking from userspace. It has to, to manage the
     complex reconnect handshakes, authentication, brownouts, etc.

   As a result, ENBD is much more flexible in its transport protocols.
   There is a single code module which implements a "stream", and
   the three or four method within need to be reimplementd for each
   protocol, but that's all. There are two standard transports in the
   distribution code - tcp and ssl, and other transport modules have 
   been implemented, including ones for very low overhead raw networking
   protocols.

OK, I can't think of any more "basic" things at the moment. But ENBD
also suffers from galloping featurism. All the features can be added to 
kernel nbd too, of course, but some of them are not point changes at
all! It would take just as long as it took to add them to ENBD in the
first place.  I'll make a list ...

Featuritus
----------

  1) remote ioctls. ENBD does pass ioctls over the net to the
     server. Only the ones it knows about of course, but that's 
     at least a hundred off. You can eject cdroms over the net.
     More ioctls can be added to its list anytime. Well, it knows about
     at least 4 different techniques for moving ioctls, and you can
     invent more ..

  2) support for removable media. Maybe I should have included that in
     the  technical differnces part. Basically, ENBD expects the
     server to report errors that are on-disk, and it distinguishes
     them from on-wire errors. It proactively pings both the server, and
     asks the server to check its media, every second or so. A change 
     in an exported floppy is spotted, and the kernel notified.

  3) ENBD has a "local write/remote read" mode, which  is useful for
     replacing NFS root with. A single server can be replicated to
     many clients, each of which then makes its own local changes.
     The writes stay in memory, of course (this IS a kind of point
     change).

  4) ENBD has an async mode (well, two), in which no acks are expected
     for requests. This is useful for swapping over ENBD (the daemons
     also have to fixed in memory for that, and thats's a "-s" flag).
     Really, there are several async modes. Either the client doesn't
     need to ack the kernel, or can ack it late, or the server doesn't
     need to ack the client, etc.

  5) ENBD has an evolved accounting and control interface in /proc.
     It amounts to about 25% of its code.

  6) ENBD supports several sync modes, direct i/o on client, sync 
     on server, talking to raw devices, etc.

  7) ENBD supports partitions.

Maybe there are more features. There are enough that I forget them at
times. I try and split them out into add-on modules. These are things
that have been requested or even requested and funded! So they satisfy
real needs.

Extra badness
-------------

One thing that's obvious is that ENBD has vastly more code than kernel
enbd. Look at these stats:

csize output, enbd vs kernel nbd ..

   total    blank lines w/   nb, nc    semi- preproc. file
   lines    lines comments    lines   colons  direct.
--------+--------+--------+--------+--------+--------+----
    4172      619      800     2789     1438       89 enbd_base.c
     405       38       67      304       70       38 enbd_ioctl.c
      30        4        3       23       10        4 enbd_ioctl_stub.c
      99       13        8       78       34        8 enbd_md.c
    1059      134       32      902      447       15 enbd_proc.c
      75        8       16       51       20        2 enbd_seqno.c
      64       14        5       45       18        2 enbd_speed.c
    5943      839      931     4222     2043      167 total

   total    blank lines w/   nb, nc    semi- preproc. file
   lines    lines comments    lines   colons  direct.
--------+--------+--------+--------+--------+--------+----
     631       77       68      487      307       34 nbd.c

You should see that ENBD has between 5 and 10 times as much code as
kernel nbd.  I've tried to split things up so that enbd_base.c is
roughly equivalent to kernel nbd, but it still looks that way.  But it's
not quite true ..  one thing that distorts stats is that ENBD needs many
more trivial support functions just to allow things to be split up!  The
extra functions become methods in a struct, and the struct is exported
to the other module, and then the caller uses the method.  Pavel was
probably able to just do a straight bitop instead!

Another thing that distorts the stats is the proc interface. Although I
split it out in the code (it's about 1000 of 5000 lines total), the 
support functions for its read and writes are still in the main code.
Yes, I could have not written a function and instead embedded the code
directly in the proc interface, but then maintenance would have been
impossible. So that's another reason ...

... because of the extra size of the code, ENBD has many more internal
code interfaces, in order to keep things separated and sane. It would
be unmanagable as a single monolithic lump. You get some idea of that
from the function counts in the following list:

   ccount 1.0:    NCSS  Comnts  Funcs  Blanks  Lines
------------------+-----+-------+------+-------+----
  enbd_base.c:    1449    739     71    615   4174
 enbd_ioctl.c:      70     59     12     42    409
enbd_ioctl_stub.c:  10      3      3      3     30
    enbd_md.c:      34      7      6     13     99
  enbd_proc.c:     452     32     16    133   1060
 enbd_seqno.c:      20     13      5      8     75
 enbd_speed.c:      18      4      2     14     64
       Totals:    2059    857    115    837   5950

------------------+-----+-------+------+-------+----
   ccount 1.0:    NCSS  Comnts  Funcs  Blanks  Lines
        nbd.c:     314     63     13     75    631

Note that Pavel averages 48 lines per function and I average 51,
so we probably have the same sense of "diffculty". We both comment
at about the same rate too, Pavel 1 in every 10 lines, me 1 in
every 7 lines.

But I know that I have considerable swathes of code that have to be done
inline, because they mess with request struct fields (for the remote
ioctl stuff), and have to complete and reverse the manipulations within
a single routine.

I'll close with what I said earlier ...

> ENBD is not a replacement for NBD - the two are alternatives, aimed
> at different niches.  ENBD is a sort of heavyweight industrial NBD.  It
> does many more things and has a different achitecture.  Kernel NBD is
> like a stripped down version of ENBD.  Both should be in the kernel.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-03-30 20:37 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1048623613.25914.14.camel@lotte>
2003-03-25 20:53 ` [PATCH] ENBD for 2.5.64 Peter T. Breuer
2003-03-26  2:40   ` Jeff Garzik
2003-03-26  5:55     ` Matt Mackall
2003-03-26  6:31       ` Peter T. Breuer
2003-03-26  6:48         ` Matt Mackall
2003-03-26  7:05           ` Peter T. Breuer
2003-03-26  6:59       ` Andre Hedrick
2003-03-26 13:58         ` Jeff Garzik
2003-03-26  7:31       ` Lincoln Dale
2003-03-26  9:59         ` Lars Marowsky-Bree
2003-03-26 10:18           ` Andrew Morton
2003-03-26 13:49         ` Jeff Garzik
2003-03-26 16:09         ` Matt Mackall
     [not found]         ` <5.1.0.14.2.20030327085031.04aa7128@mira-sjcm-3.cisco.com>
2003-03-26 22:40           ` Matt Mackall
2003-03-28 11:19   ` Pavel Machek
2003-03-30 20:48     ` Peter T. Breuer
2003-03-26 22:16 Lincoln Dale
2003-03-26 22:56 ` Lars Marowsky-Bree
2003-03-26 23:21   ` Lincoln Dale
  -- strict thread matches above, loose matches on Subject: below --
2003-03-26 22:16 Lincoln Dale
2003-03-26 22:32 ` Andre Hedrick
     [not found] ` <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>
2003-03-26 23:03   ` Lincoln Dale
2003-03-26 23:39     ` Andre Hedrick
     [not found] <5.1.0.14.2.20030327083757.037c0760@mira-sjcm-3.cisco.com>
2003-03-26 22:02 ` Peter T. Breuer
2003-03-26 23:49   ` Lincoln Dale
2003-03-27  0:08     ` Peter T. Breuer
2003-03-25 17:27 Peter T. Breuer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox