[Drbd-dev] GFS support in DRBD-0.8

Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed

* [Drbd-dev] GFS support in DRBD-0.8
@ 2004-09-21 14:16 Philipp Reisner
       [not found] ` <3+B6NrvPixc+6shooEioqTc=lge@web.de>
  0 siblings, 1 reply; 3+ messages in thread
From: Philipp Reisner @ 2004-09-21 14:16 UTC (permalink / raw)
  To: drbd-dev

Hi Lars,

I have thought about it and wrote this item for the roadmpa.txt
document :

--snip--
8 Support shared disk semantics  ( for GFS, OCFS etc... )

    All the thoughts in this area, imply that the cluster deals
    with split brain situations as discussed in item 6.

  In order to offer a shared disk mode for GFS, we introduce a 
  new state "shared" (in addition to primary and secondary).

  In a cluster of two nodes in shared state we determine a 
  coordinator node (e.g. by selecting the node with the 
  numeric higher IP address)

 read after write dependencies

  The shared state is available to clusters using protocol C
  and B. It is not usable with protocol A.

  To support the shared state with protocol B, upon a read
  request the node has to check if a new version of the block
  is in the progress of getting written. (== search for it on
  active_ee and done_ee, must make sure that it is on active_ee
  before the RecvAck is sent. [is already the case.] )

 global write order

  As far as I understand the toppic up to now we have two options
  to establish a global write order. 

  Proposed Solution 1, using the order of a coordinator node:

  Writes from the coordinator node are carried out, as they are
  carried out on the primary node in conventional DRBD. ( Write 
  to disk and send to peer simultaniously. )

  Writes from the other node are sent to the coordinator first, 
  then the coordinator inserts a small "write now" packet into
  its stram of write packets.
  The node commits the write to its local IO subsystem as soon 
  as it gets the "write-now" packet from the coordinator.

  Note: With protocol C it does not matter which node is the
        coordinator from the performance viewpoint.

  Proposed Solution 2, use ALs as distributed locks:

  Only one node might mark an extent as active at a time. New
  packets are introduced to request the locking of an extent.
--snap--

PS: I think that we do not need to use the AL extents as
    distributed locks.

PS2: Comments about the wording ("coordinator") are also welcome.

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Drbd-dev] GFS support in DRBD-0.8
       [not found] ` <3+B6NrvPixc+6shooEioqTc=lge@web.de>
@ 2004-09-22 13:18   ` Philipp Reisner
  2004-09-22 14:53     ` Lars Ellenberg
  0 siblings, 1 reply; 3+ messages in thread
From: Philipp Reisner @ 2004-09-22 13:18 UTC (permalink / raw)
  To: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 3666 bytes --]

[...]
> >   Proposed Solution 1, using the order of a coordinator node:
> >
> >   Writes from the coordinator node are carried out, as they are
> >   carried out on the primary node in conventional DRBD. ( Write
> >   to disk and send to peer simultaniously. )
> >
> >   Writes from the other node are sent to the coordinator first,
> >   then the coordinator inserts a small "write now" packet into
> >   its stram of write packets.
> >   The node commits the write to its local IO subsystem as soon
> >   as it gets the "write-now" packet from the coordinator.
> >
> >   Note: With protocol C it does not matter which node is the
> >         coordinator from the performance viewpoint.
> >
> >   Proposed Solution 2, use ALs as distributed locks:
> >
> >   Only one node might mark an extent as active at a time. New
> >   packets are introduced to request the locking of an extent.
> > --snap--
> >
> > PS: I think that we do not need to use the AL extents as
> >     distributed locks.
>
> we don't need to, and it will probably be simpler to implement with S1.
> but S2 will most likely scale better as soon as we introduce more than
> two nodes, and maybe already whith only two nodes, since I expect GFS
> and similar systems to coordinate on the higher level already, so that
> typically (think of for example the per-node-journals) there won't be
> real concurrent access to the same area of the device.

DRBD-0.8 will strictly be 2 nodes. For the two node case it has
principal the same latency with protocol C 
 (see the attached PDF, N2 initiates the write, ... the path until 
  IO completion can be signalled is equally long.)

with S2 we have one packet less that travels over the wire per write
request, thus less interrupts less CPU load etc... more performace
in real live.

But with S2 a extent ping-pong will be *really* expensive. 

PS: You mentioned that you want to use an other term for 
    extent. Why ? The expression extent is used in LVM1 for
    the smalles unit of allocation by default 4M. 
    I think it is a good term for what we mean...

Ok, lets consider S2:
Why is it a good idea to unify the AL-extents and the lock-extents ?

pro: we already have AL-extents.
con: it is an other thing!

I think it would be wise to have an independent LRU cache for lock-extents

pro: other extent sizes possible.
pro: other cahce sizes possible.
pro: deleteion from cache (other node needs that extent) is cheap! no 
     meta-data update.
con: more code. (but LRU is already nicely abstraced anyway)

I am willing to agree on S2 as soon as I know that it will fit 
GFS's ussage patter. I tried to find a paper on the on-disk
layout of GFS, but was in a 30 minute seach not successfull....

> note that I think either way we need to get rid of the current scheme of
> "throttling" io in the tcp buffer by doing all network and disk io
> directly in the process context of the submitting process. we should
> instead have our own queue, with some maximum length, and let the worker
> do the work. yes this introduces more context switches.  but I really
> doubt that this is a performance problem on todays boxes.

Tell me one reason for this other than "I think we need..."

>
> I'd like to keep Primary, but introduce "active" as well, so we can have
> active Secondaries. a Primary is by definition always active.
>

So it would be Primary/Active ?? What is the difference between
an Active and an Primary node ?

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

[-- Attachment #2: GFS-mode-options.pdf --]
[-- Type: application/pdf, Size: 9808 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Drbd-dev] GFS support in DRBD-0.8
  2004-09-22 13:18   ` Philipp Reisner
@ 2004-09-22 14:53     ` Lars Ellenberg
  0 siblings, 0 replies; 3+ messages in thread
From: Lars Ellenberg @ 2004-09-22 14:53 UTC (permalink / raw)
  To: drbd-dev

/ 2004-09-22 15:18:45 +0200
\ Philipp Reisner:
> > I'd like to keep Primary, but introduce "active" as well, so we can have
> > active Secondaries. a Primary is by definition always active.
> >
> 
> So it would be Primary/Active ?? What is the difference between
> an Active and an Primary node ?

Primary and Active => is the writable Coordinator
just Active is a writable Secondary.

its just a suggestion anyways.


I'll think about your other questions...

	lge

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2004-09-22 14:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-21 14:16 [Drbd-dev] GFS support in DRBD-0.8 Philipp Reisner
     [not found] ` <3+B6NrvPixc+6shooEioqTc=lge@web.de>
2004-09-22 13:18   ` Philipp Reisner
2004-09-22 14:53     ` Lars Ellenberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox