public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] ENBD for 2.5.64
@ 2003-03-26 22:16 Lincoln Dale
  2003-03-26 22:56 ` Lars Marowsky-Bree
  0 siblings, 1 reply; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 22:16 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Jeff Garzik, ptb, Justin Cormack, linux kernel

At 10:09 AM 26/03/2003 -0600, Matt Mackall wrote:
> > >Indeed, there are iSCSI implementations that do multipath and
> > >failover.
> >
> > iSCSI is a transport.
> > logically, any "multipathing" and "failover" belongs in a layer above 
> it --
> > typically as a block-layer function -- and not as a transport-layer
> > function.
> >
> > multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, 
> DevMapper
> > PowerPath, ...
>
>Funny then that I should be talking about Cisco's driver. :P

:-)

see my previous email to Jeff.  iSCSI as a transport protocol does have a 
muxing capability -- but its usefulness is somewhat limited (imho).

>iSCSI inherently has more interesting reconnect logic than other block
>devices, so it's fairly trivial to throw in recognition of identical
>devices discovered on two or more iSCSI targets..

what logic do you use to identify "identical devices"?
same data reported from SCSI Report_LUNs?  or perhaps the same data 
reported from a SCSI_Inquiry?

in reality, all multipathing software tends to use some blocks at the end 
of the disk (just in the same way that most LVMs do also).

for example, consider the following output from a set of two SCSI_Inquiry 
and Report_LUNs on two paths to storage:
         Lun Description Table
         WWPN             Lun   Capacity Vendor       Product      Serial
         ---------------- ----- -------- ------------ ------------ ------
         Path A:
         21000004cf8c21fb 0     16GB     HP 18.2G     ST318452FC   3EV0BD8E
         21000004cf8c21c5 0     16GB     HP 18.2G     ST318452FC   3EV0KHHP
         50060e8000009591 0     50GB     HITACHI      DF500F       DF500-00B
         50060e8000009591 1     50GB     HITACHI      DF500F       DF500-00B
         50060e8000009591 2     50GB     HITACHI      DF500F       DF500-00B
         50060e8000009591 3     50GB     HITACHI      DF500F       DF500-00B

         Path B:
         31000004cf8c21fb 0     16GB     HP 18.2G     ST318452FC   3EV0BD8E
         31000004cf8c21c5 0     16GB     HP 18.2G     ST318452FC   3EV0KHHP
         50060e8000009591 0     50GB     HITACHI      DF500F       DF500-00A
         50060e8000009591 1     50GB     HITACHI      DF500F       DF500-00A
         50060e8000009591 2     50GB     HITACHI      DF500F       DF500-00A
         50060e8000009591 3     50GB     HITACHI      DF500F       DF500-00A


the "HP 18.2G" devices are 18G FC disks in a FC JBOD.  each disk will 
report an identical Serial # regardless of the interface/path used to get 
to that device.  no issues there right -- you can identify the disk as 
being unique via its "Serial #" and can see the interface used to get to it 
via its WWPN.

now, take a look at some disk from an intelligent disk array (in this case, 
a HDS 9200).
it reports a _different_ serial number for the same disk, dependent on the 
interface used.  (DF500 is the model # of a HDS 9200, interfaces are 
numbered 00A/00B/01A/01B).

does one now need to add logic into the kernel to provide some multipathing 
for HDS disks?
does using linux mean that one had to change some settings on the HDS disk 
array to get it to report different information via a SCSI_Inquiry?  (it 
can - but thats not the point - the point is that any multipathing software 
out there just 'works' right now).

this is just one example.  i could probably find another 50 of 
slightly-different-behavior if you wanted me to!

> > >Both iSCSI and ENBD currently have issues with pending writes during
> > >network outages. The current I/O layer fails to report failed writes
> > >to fsync and friends.
> >
> > these are not "iSCSI" or "ENBD" issues.  these are issues with VFS.
>
>Except that the issue simply doesn't show up for anyone else, which is
>why it hasn't been fixed yet. Patches are in the works, but they need
>more testing:
>
>http://www.selenic.com/linux/write-error-propagation/

oh, but it does show up for other people.  it may be that the issue doesn't 
show up at fsync() time, but rather at close() time, or perhaps neither of 
those!

code looks interesting.  i'll take a look.
hmm, must find out a way to intentionally introduce errors now and see what 
happens!


cheers,

lincoln.


^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: [PATCH] ENBD for 2.5.64
@ 2003-03-26 22:16 Lincoln Dale
  2003-03-26 22:32 ` Andre Hedrick
       [not found] ` <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>
  0 siblings, 2 replies; 27+ messages in thread
From: Lincoln Dale @ 2003-03-26 22:16 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Matt Mackall, ptb, Justin Cormack, linux kernel

At 08:49 AM 26/03/2003 -0500, Jeff Garzik wrote:
>>>Indeed, there are iSCSI implementations that do multipath and
>>>failover.
>>
>>iSCSI is a transport.
>>logically, any "multipathing" and "failover" belongs in a layer above it 
>>-- typically as a block-layer function -- and not as a transport-layer 
>>function.
>>multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, 
>>DevMapper -- or in a commercial implementation such as Veritas VxDMP, HDS 
>>HDLM, EMC PowerPath, ...
>
>I think you will find that most Linux kernel developers agree w/ you :)
>
>That said, iSCSI error recovery can be considered as supporting some of 
>what multipathing and failover accomplish.  iSCSI can be shoving bits 
>through multiple TCP connections, or fail over from one TCP connection to 
>another.

while the iSCSI spec has the concept of a "network portal" that can have 
multiple TCP streams for i/o, in the real world, i'm yet to see anything 
actually use those multiple streams.
the reason why goes back to how SCSI works.  take a ethereal trace of iSCSI 
and you'll see the way that 2 round-trips are used before any typical i/o 
operation (read or write op) occurs.
multiple TCP streams for a given iSCSI session could potentially be used to 
achieve greater performance when the maximum-window-size of a single TCP 
stream is being hit.
but its quite rare for this to happen.

in reality, if you had multiple TCP streams, its more likely you're doing 
it for high-availability reasons (i.e. multipathing).
if you're multipathing, the chances are you want to multipath down two 
separate paths to two different iSCSI gateways.  (assuming you're talking 
to traditional SAN storage and you're gatewaying into Fibre Channel).

handling multipathing in that manner is well beyond the scope of what an 
iSCSI driver in the kernel should be doing.
determining the policy (read-preferred / write-preferred / round-robin / 
ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are 
used is most definitely something that should NEVER be in the kernel.

btw, the performance of iSCSI over a single TCP stream is also a moot one also.
from a single host (IBM x335 Server i think?) communicating with a FC disk 
via an iSCSI gateway:
         mds# sh int gig2/1
         GigabitEthernet2/1 is up
             Hardware is GigabitEthernet, address is xxxx.xxxx.xxxx
             Internet address is xxx.xxx.xxx.xxx/24
             MTU 1500  bytes, BW 1000000 Kbit
             Port mode is IPS
             Speed is 1 Gbps
             Beacon is turned off
             5 minutes input rate 21968640 bits/sec, 2746080 bytes/sec, 
40420 frames/sec
             5 minutes output rate 929091696 bits/sec, 116136462 bytes/sec, 
80679 frames/sec
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
             74228360 packets input, 13218256042 bytes
               15409 multicast frames, 0 compressed
               0 input errors, 0 frame, 0 overrun 0 fifo
             169487726 packets output, 241066793565 bytes, 0 underruns
               0 output errors, 0 collisions, 0 fifo
               0 carrier errors

not bad for a single TCP stream and a software iSCSI stack. :-)
(kernel is 2.4.20)

>>>Both iSCSI and ENBD currently have issues with pending writes during
>>>network outages. The current I/O layer fails to report failed writes
>>>to fsync and friends.
>
>...not if your iSCSI implementation is up to spec.  ;-)
>
>>these are not "iSCSI" or "ENBD" issues.  these are issues with VFS.
>
>VFS+VM.  But, agreed.

sure - the devil is in the details - but the issue holds true for 
traditional block devices at this point also.


cheers,

lincoln.


^ permalink raw reply	[flat|nested] 27+ messages in thread
[parent not found: <5.1.0.14.2.20030327083757.037c0760@mira-sjcm-3.cisco.com>]
* Re: [PATCH] ENBD for 2.5.64
@ 2003-03-25 17:27 Peter T. Breuer
  0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-25 17:27 UTC (permalink / raw)
  To: linux kernel

"a litle while ago ptb wrote:"
> Here's a patch to incorporate Enhanced NBD (ENBD) into kernel 2.5.64.
> I'll put the patch on the list first, and then post again with a
> technical breakdown and various arguments/explanations.

I'll now put up the technical discussion I promised. (the patch is
also in the patches/ subdir in the archive at
ftp://oboe.it.uc3m.es/pub/Programs/nbd-2.4.31.tgz)

I'll repeat the dates .. Pavel's kernel NBD, 1997, the ENBD 1998,
derived initially from Pavel's code backported to stable kernels.
Pavel and I have been in contact many times over the years. 

Technical differences
---------------------
1) One of the original changes made was technical, and is perhaps
   the biggest reason for what incompatibilities there are (I can
   equalize the wire formats, but not the functional protocols, so you
   need different userspace support for the different kernel drivers).

   - kernel nbd runs a single thread transferring data between kernel
   and net.

   - ENBD runs multiple threads running asynchronously of each other

   The result is that ENBD can get a pipelining benefit ..  while one
   thread is sending to the net another is talking to the kernel and
   so on.  This shows up in different ways.  Obviously you do best if
   you have two cpus or two nics, etc.

   Also ENBD doesn't die when one thread gets stuck. I'll talk about
   that.

2) There is a difference in philosophy, which results in different
   code, different behaviors, etc. Basically, ENBD must not /fail/.
   It's supposed to keep working first and foremost, and deal with
   errors when they crop up, and it's supposed to expect errors.

   - kernel nbd runs a full kernel thread which cannot die. It loops
     inside the kernel.

   - ENBD runs userspace threads which can die and are expected to die
     and which are restarted by a master when they die. They only dip
     into the kernel occasionally.

   This originally arose because I was frustrated with not being able
   to kill the kernel nbd client daemon, and thus free up its "space".
   It certainly used to start what nowadays we know as a kernel thread,
   but from user space. It dove into the kernel in an ioctl and
   executed a forever loop there. ENBD doesn't do that. It runs the
   daemon cycle from user space via separate ioctls for each stage.

   That's why you need different user space utilities.

   - kernel nbd has daemons which are quite lightweight

   - ENBD has daemons which disconnect if they detect network failures
     and reconnect as soon as the net comes up again. Servers and
     clients can die, and be restarted, and they'll reconnect, entirely
     automatically, all on their little ownsomes ..

   ENBD is prepared internally to retract requests from client daemons
   which don't respond any longer, and pass them to others instead.
   It's tehrefore also prepared to receive acks out fo order, etc. etc.

   Another facet of all that is the following:

   - kernel nbd does networking from within the kernel

   - ENBD does its networking from userspace. It has to, to manage the
     complex reconnect handshakes, authentication, brownouts, etc.

   As a result, ENBD is much more flexible in its transport protocols.
   There is a single code module which implements a "stream", and
   the three or four method within need to be reimplementd for each
   protocol, but that's all. There are two standard transports in the
   distribution code - tcp and ssl, and other transport modules have 
   been implemented, including ones for very low overhead raw networking
   protocols.


OK, I can't think of any more "basic" things at the moment. But ENBD
also suffers from galloping featurism. All the features can be added to 
kernel nbd too, of course, but some of them are not point changes at
all! It would take just as long as it took to add them to ENBD in the
first place.  I'll make a list ...


Featuritus
----------

  1) remote ioctls. ENBD does pass ioctls over the net to the
     server. Only the ones it knows about of course, but that's 
     at least a hundred off. You can eject cdroms over the net.
     More ioctls can be added to its list anytime. Well, it knows about
     at least 4 different techniques for moving ioctls, and you can
     invent more ..

  2) support for removable media. Maybe I should have included that in
     the  technical differnces part. Basically, ENBD expects the
     server to report errors that are on-disk, and it distinguishes
     them from on-wire errors. It proactively pings both the server, and
     asks the server to check its media, every second or so. A change 
     in an exported floppy is spotted, and the kernel notified.

  3) ENBD has a "local write/remote read" mode, which  is useful for
     replacing NFS root with. A single server can be replicated to
     many clients, each of which then makes its own local changes.
     The writes stay in memory, of course (this IS a kind of point
     change).

  4) ENBD has an async mode (well, two), in which no acks are expected
     for requests. This is useful for swapping over ENBD (the daemons
     also have to fixed in memory for that, and thats's a "-s" flag).
     Really, there are several async modes. Either the client doesn't
     need to ack the kernel, or can ack it late, or the server doesn't
     need to ack the client, etc.

  5) ENBD has an evolved accounting and control interface in /proc.
     It amounts to about 25% of its code.

  6) ENBD supports several sync modes, direct i/o on client, sync 
     on server, talking to raw devices, etc.

  7) ENBD supports partitions.


Maybe there are more features. There are enough that I forget them at
times. I try and split them out into add-on modules. These are things
that have been requested or even requested and funded! So they satisfy
real needs.

Extra badness
-------------

One thing that's obvious is that ENBD has vastly more code than kernel
enbd. Look at these stats:

csize output, enbd vs kernel nbd ..

   total    blank lines w/   nb, nc    semi- preproc. file
   lines    lines comments    lines   colons  direct.
--------+--------+--------+--------+--------+--------+----
    4172      619      800     2789     1438       89 enbd_base.c
     405       38       67      304       70       38 enbd_ioctl.c
      30        4        3       23       10        4 enbd_ioctl_stub.c
      99       13        8       78       34        8 enbd_md.c
    1059      134       32      902      447       15 enbd_proc.c
      75        8       16       51       20        2 enbd_seqno.c
      64       14        5       45       18        2 enbd_speed.c
    5943      839      931     4222     2043      167 total

   total    blank lines w/   nb, nc    semi- preproc. file
   lines    lines comments    lines   colons  direct.
--------+--------+--------+--------+--------+--------+----
     631       77       68      487      307       34 nbd.c

You should see that ENBD has between 5 and 10 times as much code as
kernel nbd.  I've tried to split things up so that enbd_base.c is
roughly equivalent to kernel nbd, but it still looks that way.  But it's
not quite true ..  one thing that distorts stats is that ENBD needs many
more trivial support functions just to allow things to be split up!  The
extra functions become methods in a struct, and the struct is exported
to the other module, and then the caller uses the method.  Pavel was
probably able to just do a straight bitop instead!

Another thing that distorts the stats is the proc interface. Although I
split it out in the code (it's about 1000 of 5000 lines total), the 
support functions for its read and writes are still in the main code.
Yes, I could have not written a function and instead embedded the code
directly in the proc interface, but then maintenance would have been
impossible. So that's another reason ...

... because of the extra size of the code, ENBD has many more internal
code interfaces, in order to keep things separated and sane. It would
be unmanagable as a single monolithic lump. You get some idea of that
from the function counts in the following list:

   ccount 1.0:    NCSS  Comnts  Funcs  Blanks  Lines
------------------+-----+-------+------+-------+----
  enbd_base.c:    1449    739     71    615   4174
 enbd_ioctl.c:      70     59     12     42    409
enbd_ioctl_stub.c:  10      3      3      3     30
    enbd_md.c:      34      7      6     13     99
  enbd_proc.c:     452     32     16    133   1060
 enbd_seqno.c:      20     13      5      8     75
 enbd_speed.c:      18      4      2     14     64
       Totals:    2059    857    115    837   5950

------------------+-----+-------+------+-------+----
   ccount 1.0:    NCSS  Comnts  Funcs  Blanks  Lines
        nbd.c:     314     63     13     75    631

Note that Pavel averages 48 lines per function and I average 51,
so we probably have the same sense of "diffculty". We both comment
at about the same rate too, Pavel 1 in every 10 lines, me 1 in
every 7 lines.

But I know that I have considerable swathes of code that have to be done
inline, because they mess with request struct fields (for the remote
ioctl stuff), and have to complete and reverse the manipulations within
a single routine.

I'll close with what I said earlier ...

> ENBD is not a replacement for NBD - the two are alternatives, aimed
> at different niches.  ENBD is a sort of heavyweight industrial NBD.  It
> does many more things and has a different achitecture.  Kernel NBD is
> like a stripped down version of ENBD.  Both should be in the kernel.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-03-30 20:37 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1048623613.25914.14.camel@lotte>
2003-03-25 20:53 ` [PATCH] ENBD for 2.5.64 Peter T. Breuer
2003-03-26  2:40   ` Jeff Garzik
2003-03-26  5:55     ` Matt Mackall
2003-03-26  6:31       ` Peter T. Breuer
2003-03-26  6:48         ` Matt Mackall
2003-03-26  7:05           ` Peter T. Breuer
2003-03-26  6:59       ` Andre Hedrick
2003-03-26 13:58         ` Jeff Garzik
2003-03-26  7:31       ` Lincoln Dale
2003-03-26  9:59         ` Lars Marowsky-Bree
2003-03-26 10:18           ` Andrew Morton
2003-03-26 13:49         ` Jeff Garzik
2003-03-26 16:09         ` Matt Mackall
     [not found]         ` <5.1.0.14.2.20030327085031.04aa7128@mira-sjcm-3.cisco.com>
2003-03-26 22:40           ` Matt Mackall
2003-03-28 11:19   ` Pavel Machek
2003-03-30 20:48     ` Peter T. Breuer
2003-03-26 22:16 Lincoln Dale
2003-03-26 22:56 ` Lars Marowsky-Bree
2003-03-26 23:21   ` Lincoln Dale
  -- strict thread matches above, loose matches on Subject: below --
2003-03-26 22:16 Lincoln Dale
2003-03-26 22:32 ` Andre Hedrick
     [not found] ` <Pine.LNX.4.10.10303261422580.25072-100000@master.linux-ide .org>
2003-03-26 23:03   ` Lincoln Dale
2003-03-26 23:39     ` Andre Hedrick
     [not found] <5.1.0.14.2.20030327083757.037c0760@mira-sjcm-3.cisco.com>
2003-03-26 22:02 ` Peter T. Breuer
2003-03-26 23:49   ` Lincoln Dale
2003-03-27  0:08     ` Peter T. Breuer
2003-03-25 17:27 Peter T. Breuer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox