[Drbd-dev] How Locking in GFS works...

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Drbd-dev] How Locking in GFS works...
@ 2004-10-04 12:56 Philipp Reisner
  2004-10-04 13:01 ` Lars Marowsky-Bree
  0 siblings, 1 reply; 21+ messages in thread
From: Philipp Reisner @ 2004-10-04 12:56 UTC (permalink / raw)
  To: drbd-dev

Finally I found a Text describing usefully how the on disk 
layout of GFS (actually openGFS) looks like and how the locking
works.
[I was not able to find anything about Sistina's ahm. RedHat's
 GFS. ]

Find everything at http://opengfs.sourceforge.net/docs.php

The interesting part from the document on Locking:

2.  Lock name

  lm_lockname_t    gl_name   -- Unique "name" (but not a string!) for lock

  The lockname structure has two components:

    uint64         ln_number -- lock number
    unsigned int   ln_type   -- type of protected entity

  For most locks, the lock number is the block number (within the filesystem's
  64-bit linear block space, which can span many storage devices) of the
  protected entity, left shifted to be equivalent to a 512-byte sector.
  Details are in src/fs/glock.c, ogfs_blk2lockname().

  As an example, if we wanted to protect an inode at block 0x100, and we
  are using 4-kByte blocks, the lock number would be 0x0800 (0x100 << 3).

  I believe the block-to-sector conversion is for support of hardware-based
  DMEP protocols, which address the DMEP storage space in terms of 512-byte
  sectors.  This could turn out to be problematic in *very large* 64-bit
  filesystems, if they want to use the upper 3 bits of the 64-bit block
  number.

  There is a special lock for the disk-based superblock, defined in
  src/fs/ogfs_ondisk.h.  Note that this lock is not based on the block
  number (the superblock is *not* stored in block 0):

    OGFS_SB_LOCK      (0) -- protects superblock read accesses from fs 
upgrades

  In addition to the block-based number assignments, OpenGFS uses some
  special, non-disk lock numbers.  They are defined in src/fs/ogfs_ondisk.h
  (even though they don't show up on disk!):
[...]

This is intended as food for thought on how we should design our
support for shared disk file systems.

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 12:56 [Drbd-dev] How Locking in GFS works Philipp Reisner
@ 2004-10-04 13:01 ` Lars Marowsky-Bree
  2004-10-04 13:20   ` Lars Ellenberg
  2004-10-04 13:26   ` Philipp Reisner
  0 siblings, 2 replies; 21+ messages in thread
From: Lars Marowsky-Bree @ 2004-10-04 13:01 UTC (permalink / raw)
  To: drbd-dev

On 2004-10-04T14:56:21, Philipp Reisner <philipp.reisner@linbit.com> wrote:

> This is intended as food for thought on how we should design our
> support for shared disk file systems.

I'm still not sure what kind of special support you need. The only
guarantee you need to provide is that after a barrier all reads on all
nodes return the same data for those blocks affected by the flush.

The shared disk file system itself will take care of issueing
appropriate barrier and flushing the OS caches.

Am I missing something? ;-)

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 13:01 ` Lars Marowsky-Bree
@ 2004-10-04 13:20   ` Lars Ellenberg
  2004-10-04 13:41     ` Lars Marowsky-Bree
  2004-10-04 13:26   ` Philipp Reisner
  1 sibling, 1 reply; 21+ messages in thread
From: Lars Ellenberg @ 2004-10-04 13:20 UTC (permalink / raw)
  To: drbd-dev

/ 2004-10-04 15:01:58 +0200
\ Lars Marowsky-Bree:
> On 2004-10-04T14:56:21, Philipp Reisner <philipp.reisner@linbit.com> wrote:
> 
> > This is intended as food for thought on how we should design our
> > support for shared disk file systems.
> 
> I'm still not sure what kind of special support you need. The only
> guarantee you need to provide is that after a barrier all reads on all
> nodes return the same data for those blocks affected by the flush.
> 
> The shared disk file system itself will take care of issueing
> appropriate barrier and flushing the OS caches.
> 
> Am I missing something? ;-)

"In case our user goes up the wall",
we need to guarantee that whatever our users do,
out lower level devices are identical. allways.

so *in case* gfs had a bug, or something other does strange things with
us, we can not trust it to not write concurrently on both nodes to the
same block at the same time.

we have to assume that this can indeed happen,
and do some serialization stuff internally. just in case.

and if we know the expected access pattern of our users,
we can optimize our own internal serialization stuff to
not conflict and degrade performance,
but to only be there as a safety net.

and the (wanted) side effect is, that we always know
which regions of the device have been active,
so we can do the resync correctly eventually.

	lge

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 13:01 ` Lars Marowsky-Bree
  2004-10-04 13:20   ` Lars Ellenberg
@ 2004-10-04 13:26   ` Philipp Reisner
  2004-10-04 13:49     ` Lars Marowsky-Bree
  1 sibling, 1 reply; 21+ messages in thread
From: Philipp Reisner @ 2004-10-04 13:26 UTC (permalink / raw)
  To: drbd-dev

On Monday 04 October 2004 15:01, Lars Marowsky-Bree wrote:
> On 2004-10-04T14:56:21, Philipp Reisner <philipp.reisner@linbit.com> wrote:
> > This is intended as food for thought on how we should design our
> > support for shared disk file systems.
>
> I'm still not sure what kind of special support you need. The only
> guarantee you need to provide is that after a barrier all reads on all
> nodes return the same data for those blocks affected by the flush.
>
> The shared disk file system itself will take care of issueing
> appropriate barrier and flushing the OS caches.
>
> Am I missing something? ;-)
>

If everything works (esp. the locking of the shared disk fs) no.

But just consider that the locking of the shared disk FS on 
top of us is broken, and that it issues a write request to
the same block number on both nodes.

Then each node would write its copy first and the peers
version of the data at second to that block number.

=> We would have different data in this block on our
   two copies. - And we would event know about it!

What would have happened on a real shared disk?
The real shared disk would have ordered in some order,
ond one of the writes would overwrite the other version.
(This is the basic design idea of proposed solution 1)

(For proposed solution2 the lock "granulaty" of the 
 shared disk FS is interesting...)

--snip from ROADMAP file--
 global write order

  As far as I understand the topic up to now we have two options
  to establish a global write order. 

  Proposed Solution 1, using the order of a coordinator node:

  Writes from the coordinator node are carried out, as they are
  carried out on the primary node in conventional DRBD. ( Write 
  to disk and send to peer simultaneously. )

  Writes from the other node are sent to the coordinator first, 
  then the coordinator inserts a small "write now" packet into
  its stream of write packets.
  The node commits the write to its local IO subsystem as soon 
  as it gets the "write-now" packet from the coordinator.

  Note: With protocol C it does not matter which node is the
        coordinator from the performance viewpoint.

  Proposed Solution 2, use a dedicated LRU to implement locking:

  Each extent in the locking LRU can have on of these states:
    requested
    locked-by-peer
    locked-by-me
    locked-by-me-and-requested-by-peer

  We allow application writes only to extents which are in
  locked-by-me* state. 

  New Packets:
    LockExtent
    LockExtentAck

  Configuration directives: dl-extents , dl-extent-size

  TODO: Need to verify with GFS that this makes sense.

-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 13:20   ` Lars Ellenberg
@ 2004-10-04 13:41     ` Lars Marowsky-Bree
  0 siblings, 0 replies; 21+ messages in thread
From: Lars Marowsky-Bree @ 2004-10-04 13:41 UTC (permalink / raw)
  To: drbd-dev

On 2004-10-04T15:20:14, Lars Ellenberg <Lars.Ellenberg@linbit.com> wrote:

> "In case our user goes up the wall",
> we need to guarantee that whatever our users do,
> out lower level devices are identical. allways.

You can't. Not efficiently. You'd need global ordering and global
write/read locking. That would totally kill performance.

> so *in case* gfs had a bug, or something other does strange things with
> us, we can not trust it to not write concurrently on both nodes to the
> same block at the same time.

In case GFS or whatever else messes up it's internal write ordering and
cache coherency mechanisms, you're SOL anyway.

All what is needed is to guarantee the consistency when barriers come
down; ie, after a barrier (or tagged command sequence, as in SCSI) need
the devices be consistent (for writes which have happened so far).

> and the (wanted) side effect is, that we always know
> which regions of the device have been active,
> so we can do the resync correctly eventually.

Well, yes, but that's a different issue. Of course the activity logs etc
need to be kept.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 13:26   ` Philipp Reisner
@ 2004-10-04 13:49     ` Lars Marowsky-Bree
  2004-10-04 14:09       ` Philipp Reisner
  0 siblings, 1 reply; 21+ messages in thread
From: Lars Marowsky-Bree @ 2004-10-04 13:49 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

On 2004-10-04T15:26:15, Philipp Reisner <philipp.reisner@linbit.com> wrote:

> If everything works (esp. the locking of the shared disk fs) no.
> 
> But just consider that the locking of the shared disk FS on 
> top of us is broken, and that it issues a write request to
> the same block number on both nodes.
> 
> Then each node would write its copy first and the peers
> version of the data at second to that block number.
> 
> => We would have different data in this block on our
>    two copies. - And we would event know about it!

You would know the moment the replicated write from the remote end came
in, no?

"Oh my, this is dirty locally too and unacked. We better arbitate now;
ie one side wins and the other one is silently discarded."

(This arbitation doesn't even require an additional communication step
as long as it's consistent; you can simply always let the one with the
lower node id or whatever else win.)

In protocol C mode that's enough if in that case one side becomes the
winner, as the write hasn't returned to the application yet and what is
read() returns until then is undefined anyway.

You don't need to implement global ordering with heavy weaponry; if you
really wanted that (and I don't think you do) the only sane choice would
be to make drbd use the total or causal ordering mechanisms in the
generic cluster infrastructure. Those are not algorithms you want to
implement internally.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 13:49     ` Lars Marowsky-Bree
@ 2004-10-04 14:09       ` Philipp Reisner
  2004-10-04 14:17         ` Philipp Reisner
  0 siblings, 1 reply; 21+ messages in thread
From: Philipp Reisner @ 2004-10-04 14:09 UTC (permalink / raw)
  To: drbd-dev

On Monday 04 October 2004 15:49, Lars Marowsky-Bree wrote:
> On 2004-10-04T15:26:15, Philipp Reisner <philipp.reisner@linbit.com> wrote:
> > If everything works (esp. the locking of the shared disk fs) no.
> >
> > But just consider that the locking of the shared disk FS on
> > top of us is broken, and that it issues a write request to
> > the same block number on both nodes.
> >
> > Then each node would write its copy first and the peers
> > version of the data at second to that block number.
> >
> > => We would have different data in this block on our
> >    two copies. - And we would event know about it!
>
> You would know the moment the replicated write from the remote end came
> in, no?
>
> "Oh my, this is dirty locally too and unacked. We better arbitate now;
> ie one side wins and the other one is silently discarded."
>

This is what I like about mailinglists. This is a new idea, that
certainly needs to be considered. 

Hmm, I just tooks a sheet of paper and drew a view diagrams of it.

It works as long as writing the block takes longer than transmitting
the block.

The scheme simply fails if transmitting takes longer than writing.

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 14:09       ` Philipp Reisner
@ 2004-10-04 14:17         ` Philipp Reisner
  2004-10-04 15:12           ` Lars Ellenberg
  2004-10-05 19:37           ` Philipp Reisner
  0 siblings, 2 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-04 14:17 UTC (permalink / raw)
  To: drbd-dev

On Monday 04 October 2004 16:09, Philipp Reisner wrote:
> On Monday 04 October 2004 15:49, Lars Marowsky-Bree wrote:
> > On 2004-10-04T15:26:15, Philipp Reisner <philipp.reisner@linbit.com> 
wrote:
> > > If everything works (esp. the locking of the shared disk fs) no.
> > >
> > > But just consider that the locking of the shared disk FS on
> > > top of us is broken, and that it issues a write request to
> > > the same block number on both nodes.
> > >
> > > Then each node would write its copy first and the peers
> > > version of the data at second to that block number.
> > >
> > > => We would have different data in this block on our
> > >    two copies. - And we would event know about it!
> >
> > You would know the moment the replicated write from the remote end came
> > in, no?
> >
> > "Oh my, this is dirty locally too and unacked. We better arbitate now;
> > ie one side wins and the other one is silently discarded."
>
> This is what I like about mailinglists. This is a new idea, that
> certainly needs to be considered.
>
> Hmm, I just tooks a sheet of paper and drew a view diagrams of it.
>
> It works as long as writing the block takes longer than transmitting
> the block.
>
> The scheme simply fails if transmitting takes longer than writing.
>

No. It works... I will write a text describing it.

-Philipp

-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 14:17         ` Philipp Reisner
@ 2004-10-04 15:12           ` Lars Ellenberg
  2004-10-04 20:24             ` Lars Marowsky-Bree
  2004-10-08 12:32             ` Philipp Reisner
  2004-10-05 19:37           ` Philipp Reisner
  1 sibling, 2 replies; 21+ messages in thread
From: Lars Ellenberg @ 2004-10-04 15:12 UTC (permalink / raw)
  To: drbd-dev

/ 2004-10-04 16:17:21 +0200
\ Philipp Reisner:
> On Monday 04 October 2004 16:09, Philipp Reisner wrote:
> > On Monday 04 October 2004 15:49, Lars Marowsky-Bree wrote:
> > > On 2004-10-04T15:26:15, Philipp Reisner <philipp.reisner@linbit.com> 
> wrote:
> > > > If everything works (esp. the locking of the shared disk fs) no.
> > > >
> > > > But just consider that the locking of the shared disk FS on
> > > > top of us is broken, and that it issues a write request to
> > > > the same block number on both nodes.
> > > >
> > > > Then each node would write its copy first and the peers
> > > > version of the data at second to that block number.
> > > >
> > > > => We would have different data in this block on our
> > > >    two copies. - And we would event know about it!
> > >
> > > You would know the moment the replicated write from the remote end came
> > > in, no?
> > >
> > > "Oh my, this is dirty locally too and unacked. We better arbitate now;
> > > ie one side wins and the other one is silently discarded."
> >
> > This is what I like about mailinglists. This is a new idea, that
> > certainly needs to be considered.
> >
> > Hmm, I just tooks a sheet of paper and drew a view diagrams of it.
> >
> > It works as long as writing the block takes longer than transmitting
> > the block.
> >
> > The scheme simply fails if transmitting takes longer than writing.
> >
> 
> No. It works... I will write a text describing it.

I think for two nodes (and drbd will stay that way for some time),
the easiest to implement would be "solution one" anyways.
but, I may be wrong. and, it involves additional latency,
even though it does not need an additional comm step (we can take the
write ack of one node as the "submit now locally" for the other.
or it involves one additional comm step (the extra "submit now" packet),
and still introduce additional latency.

but yes, I think a consistent arbitration
would do the trick much cheaper.

though for the (N>2)-node case I'd like to see your paper first ;)

	lge

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 15:12           ` Lars Ellenberg
@ 2004-10-04 20:24             ` Lars Marowsky-Bree
  2004-10-08 12:32             ` Philipp Reisner
  1 sibling, 0 replies; 21+ messages in thread
From: Lars Marowsky-Bree @ 2004-10-04 20:24 UTC (permalink / raw)
  To: drbd-dev

On 2004-10-04T17:12:24, Lars Ellenberg <Lars.Ellenberg@linbit.com> wrote:

> but yes, I think a consistent arbitration would do the trick much
> cheaper.

For two nodes yes. I think it's the optimal scheme assuming that write
contention is not the regular case; if a large percentage (>40% or so)
of writes would overlap I assume a coordination algorithm would be
better. But, I assume such workloads have a much more fundamental
problem. ;-)

> though for the (N>2)-node case I'd like to see your paper first ;)

I don't think this scheme will work well for >2 node active scenarios if
all more than two try to write and receive all writes in different
ordering.

But >2 nodes would likely wish to have an efficient multicast protocol
anyway. Three you could do in a triangle, but 4 already would suck for
such a full mesh anyway.

Actually, the 2-node active:active seems so straightforward it may make
sense for 0.8 already. A passive replication to more than 1 standby may
also be doable. >2 node active/active is 0.9 material...

I need to add that to our funding plans ;-)

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 14:17         ` Philipp Reisner
  2004-10-04 15:12           ` Lars Ellenberg
@ 2004-10-05 19:37           ` Philipp Reisner
  2004-10-05 19:39             ` Philipp Reisner
  1 sibling, 1 reply; 21+ messages in thread
From: Philipp Reisner @ 2004-10-05 19:37 UTC (permalink / raw)
  To: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 2586 bytes --]

Hi!

Please also look at the nice PDF!

> > > "Oh my, this is dirty locally too and unacked. We better arbitate now;
> > > ie one side wins and the other one is silently discarded."

9 Support shared disk semantics  ( for GFS, OCFS etc... )

    All the thoughts in this area, imply that the cluster deals
    with split brain situations as discussed in item 6.

  In order to offer a shared disk mode for GFS, we allow both
  nodes to become primary. (This needs to be enabled with the
  config statement net { allow-two-primaries; } )

 Read after write dependencies

  The shared state is available to clusters using protocol C
  and B. It is not usable with protocol A.

  To support the shared state with protocol B, upon a read
  request the node has to check if a new version of the block
  is in the progress of getting written. (== search for it on
  active_ee and done_ee. [ Since it is on active_ee before the 
  RecvAck is sent. ] )

 Global write order

  The major pitfall is the handling of concurrent writes to the
  same block. (Concurrent writes to the same blocks should not 
  happen, but we have to assume that it is possible that the
  synchronisation methods of our upper layer [i.e. openGFS] 
  may fail.)

  Without further handling concurrent writes to the same block
  would get written on each node locally first, then sent
  to the peer and then overwrite the local version on the peer.
  In other words, each node would write its local version first,
  and the peers version of the data.

  Both nodes need to agree to _one_ order, in which such 
  conflicting writes should be carried out.

  Proposed Solution

  We arbitrary select one node (e.g. the node that did the first
  accept() in the drbd_connect() function) and mark it withe the
  discard-concurrent-write-flag.

  The algorithm which is performed upon the reception of a 
  data packet.

  1. Do we have a concurrent request? (i.e. Do I have a request
     to the same block in my transfer log.) If not -> write now.
  2. Have I already got an ACK packet for the concurrent 
     request ? (Has the request the RQ_DRBD_SENT bit already set)
     If yes -> write the data from the data packet afterwards.
  3. Do I have the "discard-concurrent-write-flag" ?
     If yes -> discard the data packet and send an discard notify.
     If no -> Write data from the data packet afterwards.

  BTW, each time we have a concurrent write access, we print
  a warning to the syslog, since this indicates that the layer
  above us is broken!

  [ see also GFS-mode-arbitration.pdf for illustration. ]

[-- Attachment #2: GFS-mode-options.pdf --]
[-- Type: application/pdf, Size: 9808 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-05 19:37           ` Philipp Reisner
@ 2004-10-05 19:39             ` Philipp Reisner
  0 siblings, 0 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-05 19:39 UTC (permalink / raw)
  To: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 179 bytes --]

Am Dienstag, 5. Oktober 2004 21:37 schrieb Philipp Reisner:
> Hi!
>
> Please also look at the nice PDF!

I accidentially attached the wrong one!
Here is the right one

-philipp



[-- Attachment #2: GFS-mode-arbitration.pdf --]
[-- Type: application/pdf, Size: 8009 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-04 15:12           ` Lars Ellenberg
  2004-10-04 20:24             ` Lars Marowsky-Bree
@ 2004-10-08 12:32             ` Philipp Reisner
  2004-10-08 12:55               ` Lars Marowsky-Bree
  2004-10-08 13:51               ` Lars Ellenberg
  1 sibling, 2 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-08 12:32 UTC (permalink / raw)
  To: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 543 bytes --]

Hi Friends,

In reallity it is much more complex than we thought in the first
place.

I think that the solution with the "coordinator node" and the write
now packet would be simpler, but it's drawback is the additional
write now packet means that we have more packets on the wirte....

... But please read it first!

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

[-- Attachment #2: GFS-mode-arbitration2-c.pdf --]
[-- Type: application/pdf, Size: 10404 bytes --]

[-- Attachment #3: ROADMAP.i9 --]
[-- Type: text/plain, Size: 5790 bytes --]

9 Support shared disk semantics  ( for GFS, OCFS etc... )

    All the thoughts in this area, imply that the cluster deals
    with split brain situations as discussed in item 6.

  In order to offer a shared disk mode for GFS, we allow both
  nodes to become primary. (This needs to be enabled with the
  config statement net { allow-two-primaries; } )

 Read after write dependencies

  The shared state is available to clusters using protocol C
  and B. It is not usable with protocol A.

  To support the shared state with protocol B, upon a read
  request the node has to check if a new version of the block
  is in the progress of getting written. (== search for it on
  active_ee and done_ee. [ Since it is on active_ee before the 
  RecvAck is sent. ] )
  
 Global write order

  [ Description of GFS-mode-arbitration2.pdf ]

  1. Basic mirroring with protocol C.
    The file system on N2 issues a write request towards DRBD, 
    which is written to the local disk and sent to N1. Then
    the data bock is written to the local disk here and and
    acknowledge packet is sent back. As soon as both the
    write to the local disk and the ACK from N1 reach N2, 
    DRBD signals the completion of IO to the file system.

    The major pitfall is the handling of concurrent writes to the
    same block. (Concurrent writes to the same blocks should not 
    happen, but we have to assume that it is possible that the
    synchronisation methods of our upper layer [i.e. openGFS] 
    may fail.)

    There are many cases in which such concurrent writes would
    lead to different data on our two copies of the block. 

  2. Concurrent writes, network latency is lower than disk latency
    As we can see on the left side in figure two this could lead
    to N1 has the blue version (=data from FS on N2) while N2
    ends with having the green version (=data from FS on N1).
    The solution is to flag one node (in the example N2 has the
    discard-concurrent-writes-flag).
    As we can see on the right side, now both nodes ends with 
    the blue data.

  3. Concurrent writes, high latency for data packets.
    The problem now is that N2 does can not detect that this was
    a concurrent write, since it got the ACK before the conflicting
    data packets comes in. 
    This can happens since in DRBD, data packets and ACK packets are
    transmitted via two independent TCP connections, therefore the
    ACK packet can overtakes a data packet.
    The solution is to send with the ACK packet a discard info packet,
    which identifies the data packet by it sequence number.
    N2 will keep this discard info as long as it has not seen higher
    sequence numbers by now.
    With this both nodes will end with the blue data.

  4. Concurrent writes, high latency for data packets.
    This is the inverse case to case3 and already handled by the means
    introduced with item 1. 

  5. New write while processing a write from the peer.
    Without further measures this would lead to an inconsistency in 
    our mirror as the figure on the left side shows. 
    If we currently write a conflicting block from the peer, we simply
    discard the write request from our FS and signal IO completion 
    immediately.

  6. High disk latency on N2.
    By IO reordering in the layers below us this could lead to 
    having the blue data on N2 and the green data on N1. 
    The solution to this case is the delay the write to the local
    disk on N2 until the local write is done. This is different from
    case two since we already got the write ACK to the conflicting
    block.

  7. An data packet overtakes an ACK packet on the network.
    Although this case is quite unlikely, we have to take int into 
    account. 

 Proposed solution

  We arbitrary select one node (e.g. the node that did the first
  accept() in the drbd_connect() function) and mark it withe the
  discard-concurrent-writes-flag.

  Each data packet and each ACK packet gets a sequence 
  number, which is increased which every packet sent. 
  (This is a common space of sequence numbers)

  The algorithm which is performed upon the reception of a 
  data packet [drbd_receiver].

  *  If the sequence number of the data packet is higher than
     last_seq+1 sleep until last_seq-1 == seq_num(data packet)

  1. If the packet's sequence number is on the discard list,
     simply drop it.
  2. Do we have a concurrent request? (i.e. Do I have a request
     to the same block in my transfer log.) If not -> write now.
  3. Have I already got an ACK packet for the concurrent 
     request ? (Has the request the RQ_DRBD_SENT bit already set)
     If yes -> write the data from the data packet afterwards.
  4. Do I have the "discard-concurrent-write-flag" ?
     If yes -> discard the data packet.
     If no -> Write data from the data packet afterwards and set
              the RQ_DRBD_SENT bit in the request object ( Since
              will will not get an ACK from our peer )

  The algorithm which is performed upon the reception of an 
  ACK packet [drbd_asender]

  * If we get an ACK, store the sequence number in last_seq.

  The algorithm which is performed upon the reception of an 
  discard info packet [drbd_asender]

  * if the current last_seq is lower the the packet that should
    be discarded, store it in the to discard list.

  BTW, each time we have a concurrent write access, we print
  a warning to the syslog, since this indicates that the layer
  above us is broken!

  Note: In Item 6 we created a hash table over all requests in the
        transfer log, keyed with (sector & ~0x7). This allows us
        to find IO operations starting in the same 4k block of
        data quickly. -> With two lookups the hash table we can
	find any concurrent access.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-08 12:32             ` Philipp Reisner
@ 2004-10-08 12:55               ` Lars Marowsky-Bree
  2004-10-08 13:37                 ` Philipp Reisner
  2004-10-08 13:51               ` Lars Ellenberg
  1 sibling, 1 reply; 21+ messages in thread
From: Lars Marowsky-Bree @ 2004-10-08 12:55 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

On 2004-10-08T14:32:09, Philipp Reisner <philipp.reisner@linbit.com> wrote:

>   3. Concurrent writes, high latency for data packets.
>     The problem now is that N2 does can not detect that this was
>     a concurrent write, since it got the ACK before the conflicting
>     data packets comes in. 

Uhm. I don't see how this can be a problem.

In this case, one write has logically happened before the other, and
from they don't overlap - the second write will simply wipe out the
first one, which seems fine?

>   5. New write while processing a write from the peer.

Sounds just like case 1.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-08 12:55               ` Lars Marowsky-Bree
@ 2004-10-08 13:37                 ` Philipp Reisner
  0 siblings, 0 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-08 13:37 UTC (permalink / raw)
  To: drbd-dev

Am Freitag, 8. Oktober 2004 14:55 schrieb Lars Marowsky-Bree:
> On 2004-10-08T14:32:09, Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >   3. Concurrent writes, high latency for data packets.
> >     The problem now is that N2 does can not detect that this was
> >     a concurrent write, since it got the ACK before the conflicting
> >     data packets comes in.
>
> Uhm. I don't see how this can be a problem.
>
> In this case, one write has logically happened before the other, and
> from they don't overlap - the second write will simply wipe out the
> first one, which seems fine?
>

Just look at it again. on the left figure you will find that N1 has
the blue data on its block and N2 has the green data on its disk. 

I do see here a problem.

> >   5. New write while processing a write from the peer.
>
> Sounds just like case 1.
>

In case 1 there is no concurrenct access at all ?!?

Hav you had a look at the pdf ?

-Philipp

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-08 12:32             ` Philipp Reisner
  2004-10-08 12:55               ` Lars Marowsky-Bree
@ 2004-10-08 13:51               ` Lars Ellenberg
  2004-10-11  7:12                 ` Philipp Reisner
  1 sibling, 1 reply; 21+ messages in thread
From: Lars Ellenberg @ 2004-10-08 13:51 UTC (permalink / raw)
  To: drbd-dev

/ 2004-10-08 14:32:09 +0200
\ Philipp Reisner:
> Hi Friends,
> 
> In reallity it is much more complex than we thought in the first
> place.
> 
> I think that the solution with the "coordinator node" and the write
> now packet would be simpler, but it's drawback is the additional
> write now packet means that we have more packets on the wirte....
> 
> ... But please read it first!

now, I did not, yet...

but, 

network packets we have in the active/non-active case:

	data -> 
	     <- ack (recv (B) or write (C))
	
packets we have in the active/active case,
lets do this strictly for protocol C first:

  write on non-coordinator:
	data -> 
	     <- write now               [ when? is this already a write ack? ]
	ack  ->             (write ack)
             <- ack         (write ack) [ when? is this neccessary? ]

  write on coordinator:
	     <- data
	ack  ->

packets we have in the active/active case, arbitration mode:

	data ->
		[ cancel it, or write it.
		  if canceled, send "cancel ack",
		  if written, send write ack ]
	     <- ack


do we agree so far?
or is anything else neccessary?
an additional ack in the other direction, maybe?


I think I like the "locking extents" best.
this assumes that a typical usage pattern would have distinct active
sets on both nodes. then, most of the time writes go through normally as
if this was active, and the other node non-active.
sometimes, i.e. whenever I modify the activity-log, I need to communiacte:
	want-extent -> 
			[**]
		    <- there you go

and this expected to be as infrequent as actlog updates now.
but [**] can be expensive, if both nodes try to write to the same
"lock region", and we have a lock-extent ping-pong, because it would
basically mean 
	if I don't use it, tell peer "you have it",
	if I did use it, but its no longer in active use now, ex it from
		my activity log and tell peer "you have it"
	if it is still in use, mark it to be send to the peer,
		which implies to not accept new requests,
		and as soon as the local usage count drops to zero,
		it is send to the peer.
now, if the alternating write blocks to the same lock-region, thats bad.
expectation is they don't, because upper layers have the same
problem, and therefore will optimize to not do so.


but, yes, I will have a look at the arbitration logic, too.

	lge

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-08 13:51               ` Lars Ellenberg
@ 2004-10-11  7:12                 ` Philipp Reisner
  2004-10-11 10:09                   ` Lars Ellenberg
  2004-10-11 10:11                   ` Lars Ellenberg
  0 siblings, 2 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-11  7:12 UTC (permalink / raw)
  To: drbd-dev


> but, yes, I will have a look at the arbitration logic, too.
>

Hi Lars,

Did you find any loose ends in my description of the 
arbitration logic ? -- If we do not find any loose ends
I vote goes for it.

-Phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-11  7:12                 ` Philipp Reisner
@ 2004-10-11 10:09                   ` Lars Ellenberg
  2004-10-11 10:11                   ` Lars Ellenberg
  1 sibling, 0 replies; 21+ messages in thread
From: Lars Ellenberg @ 2004-10-11 10:09 UTC (permalink / raw)
  To: drbd-dev

/ 2004-10-11 09:12:02 +0200
\ Philipp Reisner:
> 
> > but, yes, I will have a look at the arbitration logic, too.
> >
> 
> Hi Lars,
> 
> Did you find any loose ends in my description of the 
> arbitration logic ? -- If we do not find any loose ends
> I vote goes for it.

not sure yet...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-11  7:12                 ` Philipp Reisner
  2004-10-11 10:09                   ` Lars Ellenberg
@ 2004-10-11 10:11                   ` Lars Ellenberg
  2004-10-11 12:28                     ` Philipp Reisner
  1 sibling, 1 reply; 21+ messages in thread
From: Lars Ellenberg @ 2004-10-11 10:11 UTC (permalink / raw)
  To: drbd-dev

/ 2004-10-11 09:12:02 +0200
\ Philipp Reisner:
> 
> > but, yes, I will have a look at the arbitration logic, too.
> >
> 
> Hi Lars,
> 
> Did you find any loose ends in my description of the 
> arbitration logic ? -- If we do not find any loose ends
> I vote goes for it.

not sure yet ...

especially what exactly should happen in failure cases.
need to think about it some more.

	lge

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-11 10:11                   ` Lars Ellenberg
@ 2004-10-11 12:28                     ` Philipp Reisner
  2004-10-11 12:41                       ` Philipp Reisner
  0 siblings, 1 reply; 21+ messages in thread
From: Philipp Reisner @ 2004-10-11 12:28 UTC (permalink / raw)
  To: drbd-dev

On Monday 11 October 2004 12:11, Lars Ellenberg wrote:
> / 2004-10-11 09:12:02 +0200
>
> \ Philipp Reisner:
> > > but, yes, I will have a look at the arbitration logic, too.
> >
> > Hi Lars,
> >
> > Did you find any loose ends in my description of the
> > arbitration logic ? -- If we do not find any loose ends
> > I vote goes for it.
>
> not sure yet ...
>
> especially what exactly should happen in failure cases.
> need to think about it some more.
>

Right, I thought about that too, and came to the conclusion,
that everything is covered by the AL nicely.

-philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Drbd-dev] How Locking in GFS works...
  2004-10-11 12:28                     ` Philipp Reisner
@ 2004-10-11 12:41                       ` Philipp Reisner
  0 siblings, 0 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-11 12:41 UTC (permalink / raw)
  To: drbd-dev


> > not sure yet ...
> >
> > especially what exactly should happen in failure cases.
> > need to think about it some more.
>
> Right, I thought about that too, and came to the conclusion,
> that everything is covered by the AL nicely.
>

Ahhh... I think I know what you mean ... Hmmm...

-Philipp

-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2004-10-11 12:40 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-04 12:56 [Drbd-dev] How Locking in GFS works Philipp Reisner
2004-10-04 13:01 ` Lars Marowsky-Bree
2004-10-04 13:20   ` Lars Ellenberg
2004-10-04 13:41     ` Lars Marowsky-Bree
2004-10-04 13:26   ` Philipp Reisner
2004-10-04 13:49     ` Lars Marowsky-Bree
2004-10-04 14:09       ` Philipp Reisner
2004-10-04 14:17         ` Philipp Reisner
2004-10-04 15:12           ` Lars Ellenberg
2004-10-04 20:24             ` Lars Marowsky-Bree
2004-10-08 12:32             ` Philipp Reisner
2004-10-08 12:55               ` Lars Marowsky-Bree
2004-10-08 13:37                 ` Philipp Reisner
2004-10-08 13:51               ` Lars Ellenberg
2004-10-11  7:12                 ` Philipp Reisner
2004-10-11 10:09                   ` Lars Ellenberg
2004-10-11 10:11                   ` Lars Ellenberg
2004-10-11 12:28                     ` Philipp Reisner
2004-10-11 12:41                       ` Philipp Reisner
2004-10-05 19:37           ` Philipp Reisner
2004-10-05 19:39             ` Philipp Reisner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.