* [Drbd-dev] How Locking in GFS works...
@ 2004-10-04 12:56 Philipp Reisner
2004-10-04 13:01 ` Lars Marowsky-Bree
0 siblings, 1 reply; 21+ messages in thread
From: Philipp Reisner @ 2004-10-04 12:56 UTC (permalink / raw)
To: drbd-dev
Finally I found a Text describing usefully how the on disk
layout of GFS (actually openGFS) looks like and how the locking
works.
[I was not able to find anything about Sistina's ahm. RedHat's
GFS. ]
Find everything at http://opengfs.sourceforge.net/docs.php
The interesting part from the document on Locking:
2. Lock name
lm_lockname_t gl_name -- Unique "name" (but not a string!) for lock
The lockname structure has two components:
uint64 ln_number -- lock number
unsigned int ln_type -- type of protected entity
For most locks, the lock number is the block number (within the filesystem's
64-bit linear block space, which can span many storage devices) of the
protected entity, left shifted to be equivalent to a 512-byte sector.
Details are in src/fs/glock.c, ogfs_blk2lockname().
As an example, if we wanted to protect an inode at block 0x100, and we
are using 4-kByte blocks, the lock number would be 0x0800 (0x100 << 3).
I believe the block-to-sector conversion is for support of hardware-based
DMEP protocols, which address the DMEP storage space in terms of 512-byte
sectors. This could turn out to be problematic in *very large* 64-bit
filesystems, if they want to use the upper 3 bits of the 64-bit block
number.
There is a special lock for the disk-based superblock, defined in
src/fs/ogfs_ondisk.h. Note that this lock is not based on the block
number (the superblock is *not* stored in block 0):
OGFS_SB_LOCK (0) -- protects superblock read accesses from fs
upgrades
In addition to the block-based number assignments, OpenGFS uses some
special, non-disk lock numbers. They are defined in src/fs/ogfs_ondisk.h
(even though they don't show up on disk!):
[...]
This is intended as food for thought on how we should design our
support for shared disk file systems.
-Philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 12:56 [Drbd-dev] How Locking in GFS works Philipp Reisner
@ 2004-10-04 13:01 ` Lars Marowsky-Bree
2004-10-04 13:20 ` Lars Ellenberg
2004-10-04 13:26 ` Philipp Reisner
0 siblings, 2 replies; 21+ messages in thread
From: Lars Marowsky-Bree @ 2004-10-04 13:01 UTC (permalink / raw)
To: drbd-dev
On 2004-10-04T14:56:21, Philipp Reisner <philipp.reisner@linbit.com> wrote:
> This is intended as food for thought on how we should design our
> support for shared disk file systems.
I'm still not sure what kind of special support you need. The only
guarantee you need to provide is that after a barrier all reads on all
nodes return the same data for those blocks affected by the flush.
The shared disk file system itself will take care of issueing
appropriate barrier and flushing the OS caches.
Am I missing something? ;-)
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 13:01 ` Lars Marowsky-Bree
@ 2004-10-04 13:20 ` Lars Ellenberg
2004-10-04 13:41 ` Lars Marowsky-Bree
2004-10-04 13:26 ` Philipp Reisner
1 sibling, 1 reply; 21+ messages in thread
From: Lars Ellenberg @ 2004-10-04 13:20 UTC (permalink / raw)
To: drbd-dev
/ 2004-10-04 15:01:58 +0200
\ Lars Marowsky-Bree:
> On 2004-10-04T14:56:21, Philipp Reisner <philipp.reisner@linbit.com> wrote:
>
> > This is intended as food for thought on how we should design our
> > support for shared disk file systems.
>
> I'm still not sure what kind of special support you need. The only
> guarantee you need to provide is that after a barrier all reads on all
> nodes return the same data for those blocks affected by the flush.
>
> The shared disk file system itself will take care of issueing
> appropriate barrier and flushing the OS caches.
>
> Am I missing something? ;-)
"In case our user goes up the wall",
we need to guarantee that whatever our users do,
out lower level devices are identical. allways.
so *in case* gfs had a bug, or something other does strange things with
us, we can not trust it to not write concurrently on both nodes to the
same block at the same time.
we have to assume that this can indeed happen,
and do some serialization stuff internally. just in case.
and if we know the expected access pattern of our users,
we can optimize our own internal serialization stuff to
not conflict and degrade performance,
but to only be there as a safety net.
and the (wanted) side effect is, that we always know
which regions of the device have been active,
so we can do the resync correctly eventually.
lge
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 13:01 ` Lars Marowsky-Bree
2004-10-04 13:20 ` Lars Ellenberg
@ 2004-10-04 13:26 ` Philipp Reisner
2004-10-04 13:49 ` Lars Marowsky-Bree
1 sibling, 1 reply; 21+ messages in thread
From: Philipp Reisner @ 2004-10-04 13:26 UTC (permalink / raw)
To: drbd-dev
On Monday 04 October 2004 15:01, Lars Marowsky-Bree wrote:
> On 2004-10-04T14:56:21, Philipp Reisner <philipp.reisner@linbit.com> wrote:
> > This is intended as food for thought on how we should design our
> > support for shared disk file systems.
>
> I'm still not sure what kind of special support you need. The only
> guarantee you need to provide is that after a barrier all reads on all
> nodes return the same data for those blocks affected by the flush.
>
> The shared disk file system itself will take care of issueing
> appropriate barrier and flushing the OS caches.
>
> Am I missing something? ;-)
>
If everything works (esp. the locking of the shared disk fs) no.
But just consider that the locking of the shared disk FS on
top of us is broken, and that it issues a write request to
the same block number on both nodes.
Then each node would write its copy first and the peers
version of the data at second to that block number.
=> We would have different data in this block on our
two copies. - And we would event know about it!
What would have happened on a real shared disk?
The real shared disk would have ordered in some order,
ond one of the writes would overwrite the other version.
(This is the basic design idea of proposed solution 1)
(For proposed solution2 the lock "granulaty" of the
shared disk FS is interesting...)
--snip from ROADMAP file--
global write order
As far as I understand the topic up to now we have two options
to establish a global write order.
Proposed Solution 1, using the order of a coordinator node:
Writes from the coordinator node are carried out, as they are
carried out on the primary node in conventional DRBD. ( Write
to disk and send to peer simultaneously. )
Writes from the other node are sent to the coordinator first,
then the coordinator inserts a small "write now" packet into
its stream of write packets.
The node commits the write to its local IO subsystem as soon
as it gets the "write-now" packet from the coordinator.
Note: With protocol C it does not matter which node is the
coordinator from the performance viewpoint.
Proposed Solution 2, use a dedicated LRU to implement locking:
Each extent in the locking LRU can have on of these states:
requested
locked-by-peer
locked-by-me
locked-by-me-and-requested-by-peer
We allow application writes only to extents which are in
locked-by-me* state.
New Packets:
LockExtent
LockExtentAck
Configuration directives: dl-extents , dl-extent-size
TODO: Need to verify with GFS that this makes sense.
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 13:20 ` Lars Ellenberg
@ 2004-10-04 13:41 ` Lars Marowsky-Bree
0 siblings, 0 replies; 21+ messages in thread
From: Lars Marowsky-Bree @ 2004-10-04 13:41 UTC (permalink / raw)
To: drbd-dev
On 2004-10-04T15:20:14, Lars Ellenberg <Lars.Ellenberg@linbit.com> wrote:
> "In case our user goes up the wall",
> we need to guarantee that whatever our users do,
> out lower level devices are identical. allways.
You can't. Not efficiently. You'd need global ordering and global
write/read locking. That would totally kill performance.
> so *in case* gfs had a bug, or something other does strange things with
> us, we can not trust it to not write concurrently on both nodes to the
> same block at the same time.
In case GFS or whatever else messes up it's internal write ordering and
cache coherency mechanisms, you're SOL anyway.
All what is needed is to guarantee the consistency when barriers come
down; ie, after a barrier (or tagged command sequence, as in SCSI) need
the devices be consistent (for writes which have happened so far).
> and the (wanted) side effect is, that we always know
> which regions of the device have been active,
> so we can do the resync correctly eventually.
Well, yes, but that's a different issue. Of course the activity logs etc
need to be kept.
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 13:26 ` Philipp Reisner
@ 2004-10-04 13:49 ` Lars Marowsky-Bree
2004-10-04 14:09 ` Philipp Reisner
0 siblings, 1 reply; 21+ messages in thread
From: Lars Marowsky-Bree @ 2004-10-04 13:49 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
On 2004-10-04T15:26:15, Philipp Reisner <philipp.reisner@linbit.com> wrote:
> If everything works (esp. the locking of the shared disk fs) no.
>
> But just consider that the locking of the shared disk FS on
> top of us is broken, and that it issues a write request to
> the same block number on both nodes.
>
> Then each node would write its copy first and the peers
> version of the data at second to that block number.
>
> => We would have different data in this block on our
> two copies. - And we would event know about it!
You would know the moment the replicated write from the remote end came
in, no?
"Oh my, this is dirty locally too and unacked. We better arbitate now;
ie one side wins and the other one is silently discarded."
(This arbitation doesn't even require an additional communication step
as long as it's consistent; you can simply always let the one with the
lower node id or whatever else win.)
In protocol C mode that's enough if in that case one side becomes the
winner, as the write hasn't returned to the application yet and what is
read() returns until then is undefined anyway.
You don't need to implement global ordering with heavy weaponry; if you
really wanted that (and I don't think you do) the only sane choice would
be to make drbd use the total or causal ordering mechanisms in the
generic cluster infrastructure. Those are not algorithms you want to
implement internally.
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 13:49 ` Lars Marowsky-Bree
@ 2004-10-04 14:09 ` Philipp Reisner
2004-10-04 14:17 ` Philipp Reisner
0 siblings, 1 reply; 21+ messages in thread
From: Philipp Reisner @ 2004-10-04 14:09 UTC (permalink / raw)
To: drbd-dev
On Monday 04 October 2004 15:49, Lars Marowsky-Bree wrote:
> On 2004-10-04T15:26:15, Philipp Reisner <philipp.reisner@linbit.com> wrote:
> > If everything works (esp. the locking of the shared disk fs) no.
> >
> > But just consider that the locking of the shared disk FS on
> > top of us is broken, and that it issues a write request to
> > the same block number on both nodes.
> >
> > Then each node would write its copy first and the peers
> > version of the data at second to that block number.
> >
> > => We would have different data in this block on our
> > two copies. - And we would event know about it!
>
> You would know the moment the replicated write from the remote end came
> in, no?
>
> "Oh my, this is dirty locally too and unacked. We better arbitate now;
> ie one side wins and the other one is silently discarded."
>
This is what I like about mailinglists. This is a new idea, that
certainly needs to be considered.
Hmm, I just tooks a sheet of paper and drew a view diagrams of it.
It works as long as writing the block takes longer than transmitting
the block.
The scheme simply fails if transmitting takes longer than writing.
-Philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 14:09 ` Philipp Reisner
@ 2004-10-04 14:17 ` Philipp Reisner
2004-10-04 15:12 ` Lars Ellenberg
2004-10-05 19:37 ` Philipp Reisner
0 siblings, 2 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-04 14:17 UTC (permalink / raw)
To: drbd-dev
On Monday 04 October 2004 16:09, Philipp Reisner wrote:
> On Monday 04 October 2004 15:49, Lars Marowsky-Bree wrote:
> > On 2004-10-04T15:26:15, Philipp Reisner <philipp.reisner@linbit.com>
wrote:
> > > If everything works (esp. the locking of the shared disk fs) no.
> > >
> > > But just consider that the locking of the shared disk FS on
> > > top of us is broken, and that it issues a write request to
> > > the same block number on both nodes.
> > >
> > > Then each node would write its copy first and the peers
> > > version of the data at second to that block number.
> > >
> > > => We would have different data in this block on our
> > > two copies. - And we would event know about it!
> >
> > You would know the moment the replicated write from the remote end came
> > in, no?
> >
> > "Oh my, this is dirty locally too and unacked. We better arbitate now;
> > ie one side wins and the other one is silently discarded."
>
> This is what I like about mailinglists. This is a new idea, that
> certainly needs to be considered.
>
> Hmm, I just tooks a sheet of paper and drew a view diagrams of it.
>
> It works as long as writing the block takes longer than transmitting
> the block.
>
> The scheme simply fails if transmitting takes longer than writing.
>
No. It works... I will write a text describing it.
-Philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 14:17 ` Philipp Reisner
@ 2004-10-04 15:12 ` Lars Ellenberg
2004-10-04 20:24 ` Lars Marowsky-Bree
2004-10-08 12:32 ` Philipp Reisner
2004-10-05 19:37 ` Philipp Reisner
1 sibling, 2 replies; 21+ messages in thread
From: Lars Ellenberg @ 2004-10-04 15:12 UTC (permalink / raw)
To: drbd-dev
/ 2004-10-04 16:17:21 +0200
\ Philipp Reisner:
> On Monday 04 October 2004 16:09, Philipp Reisner wrote:
> > On Monday 04 October 2004 15:49, Lars Marowsky-Bree wrote:
> > > On 2004-10-04T15:26:15, Philipp Reisner <philipp.reisner@linbit.com>
> wrote:
> > > > If everything works (esp. the locking of the shared disk fs) no.
> > > >
> > > > But just consider that the locking of the shared disk FS on
> > > > top of us is broken, and that it issues a write request to
> > > > the same block number on both nodes.
> > > >
> > > > Then each node would write its copy first and the peers
> > > > version of the data at second to that block number.
> > > >
> > > > => We would have different data in this block on our
> > > > two copies. - And we would event know about it!
> > >
> > > You would know the moment the replicated write from the remote end came
> > > in, no?
> > >
> > > "Oh my, this is dirty locally too and unacked. We better arbitate now;
> > > ie one side wins and the other one is silently discarded."
> >
> > This is what I like about mailinglists. This is a new idea, that
> > certainly needs to be considered.
> >
> > Hmm, I just tooks a sheet of paper and drew a view diagrams of it.
> >
> > It works as long as writing the block takes longer than transmitting
> > the block.
> >
> > The scheme simply fails if transmitting takes longer than writing.
> >
>
> No. It works... I will write a text describing it.
I think for two nodes (and drbd will stay that way for some time),
the easiest to implement would be "solution one" anyways.
but, I may be wrong. and, it involves additional latency,
even though it does not need an additional comm step (we can take the
write ack of one node as the "submit now locally" for the other.
or it involves one additional comm step (the extra "submit now" packet),
and still introduce additional latency.
but yes, I think a consistent arbitration
would do the trick much cheaper.
though for the (N>2)-node case I'd like to see your paper first ;)
lge
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 15:12 ` Lars Ellenberg
@ 2004-10-04 20:24 ` Lars Marowsky-Bree
2004-10-08 12:32 ` Philipp Reisner
1 sibling, 0 replies; 21+ messages in thread
From: Lars Marowsky-Bree @ 2004-10-04 20:24 UTC (permalink / raw)
To: drbd-dev
On 2004-10-04T17:12:24, Lars Ellenberg <Lars.Ellenberg@linbit.com> wrote:
> but yes, I think a consistent arbitration would do the trick much
> cheaper.
For two nodes yes. I think it's the optimal scheme assuming that write
contention is not the regular case; if a large percentage (>40% or so)
of writes would overlap I assume a coordination algorithm would be
better. But, I assume such workloads have a much more fundamental
problem. ;-)
> though for the (N>2)-node case I'd like to see your paper first ;)
I don't think this scheme will work well for >2 node active scenarios if
all more than two try to write and receive all writes in different
ordering.
But >2 nodes would likely wish to have an efficient multicast protocol
anyway. Three you could do in a triangle, but 4 already would suck for
such a full mesh anyway.
Actually, the 2-node active:active seems so straightforward it may make
sense for 0.8 already. A passive replication to more than 1 standby may
also be doable. >2 node active/active is 0.9 material...
I need to add that to our funding plans ;-)
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 14:17 ` Philipp Reisner
2004-10-04 15:12 ` Lars Ellenberg
@ 2004-10-05 19:37 ` Philipp Reisner
2004-10-05 19:39 ` Philipp Reisner
1 sibling, 1 reply; 21+ messages in thread
From: Philipp Reisner @ 2004-10-05 19:37 UTC (permalink / raw)
To: drbd-dev
[-- Attachment #1: Type: text/plain, Size: 2586 bytes --]
Hi!
Please also look at the nice PDF!
> > > "Oh my, this is dirty locally too and unacked. We better arbitate now;
> > > ie one side wins and the other one is silently discarded."
9 Support shared disk semantics ( for GFS, OCFS etc... )
All the thoughts in this area, imply that the cluster deals
with split brain situations as discussed in item 6.
In order to offer a shared disk mode for GFS, we allow both
nodes to become primary. (This needs to be enabled with the
config statement net { allow-two-primaries; } )
Read after write dependencies
The shared state is available to clusters using protocol C
and B. It is not usable with protocol A.
To support the shared state with protocol B, upon a read
request the node has to check if a new version of the block
is in the progress of getting written. (== search for it on
active_ee and done_ee. [ Since it is on active_ee before the
RecvAck is sent. ] )
Global write order
The major pitfall is the handling of concurrent writes to the
same block. (Concurrent writes to the same blocks should not
happen, but we have to assume that it is possible that the
synchronisation methods of our upper layer [i.e. openGFS]
may fail.)
Without further handling concurrent writes to the same block
would get written on each node locally first, then sent
to the peer and then overwrite the local version on the peer.
In other words, each node would write its local version first,
and the peers version of the data.
Both nodes need to agree to _one_ order, in which such
conflicting writes should be carried out.
Proposed Solution
We arbitrary select one node (e.g. the node that did the first
accept() in the drbd_connect() function) and mark it withe the
discard-concurrent-write-flag.
The algorithm which is performed upon the reception of a
data packet.
1. Do we have a concurrent request? (i.e. Do I have a request
to the same block in my transfer log.) If not -> write now.
2. Have I already got an ACK packet for the concurrent
request ? (Has the request the RQ_DRBD_SENT bit already set)
If yes -> write the data from the data packet afterwards.
3. Do I have the "discard-concurrent-write-flag" ?
If yes -> discard the data packet and send an discard notify.
If no -> Write data from the data packet afterwards.
BTW, each time we have a concurrent write access, we print
a warning to the syslog, since this indicates that the layer
above us is broken!
[ see also GFS-mode-arbitration.pdf for illustration. ]
[-- Attachment #2: GFS-mode-options.pdf --]
[-- Type: application/pdf, Size: 9808 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-05 19:37 ` Philipp Reisner
@ 2004-10-05 19:39 ` Philipp Reisner
0 siblings, 0 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-05 19:39 UTC (permalink / raw)
To: drbd-dev
[-- Attachment #1: Type: text/plain, Size: 179 bytes --]
Am Dienstag, 5. Oktober 2004 21:37 schrieb Philipp Reisner:
> Hi!
>
> Please also look at the nice PDF!
I accidentially attached the wrong one!
Here is the right one
-philipp
[-- Attachment #2: GFS-mode-arbitration.pdf --]
[-- Type: application/pdf, Size: 8009 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-04 15:12 ` Lars Ellenberg
2004-10-04 20:24 ` Lars Marowsky-Bree
@ 2004-10-08 12:32 ` Philipp Reisner
2004-10-08 12:55 ` Lars Marowsky-Bree
2004-10-08 13:51 ` Lars Ellenberg
1 sibling, 2 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-08 12:32 UTC (permalink / raw)
To: drbd-dev
[-- Attachment #1: Type: text/plain, Size: 543 bytes --]
Hi Friends,
In reallity it is much more complex than we thought in the first
place.
I think that the solution with the "coordinator node" and the write
now packet would be simpler, but it's drawback is the additional
write now packet means that we have more packets on the wirte....
... But please read it first!
-Philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
[-- Attachment #2: GFS-mode-arbitration2-c.pdf --]
[-- Type: application/pdf, Size: 10404 bytes --]
[-- Attachment #3: ROADMAP.i9 --]
[-- Type: text/plain, Size: 5790 bytes --]
9 Support shared disk semantics ( for GFS, OCFS etc... )
All the thoughts in this area, imply that the cluster deals
with split brain situations as discussed in item 6.
In order to offer a shared disk mode for GFS, we allow both
nodes to become primary. (This needs to be enabled with the
config statement net { allow-two-primaries; } )
Read after write dependencies
The shared state is available to clusters using protocol C
and B. It is not usable with protocol A.
To support the shared state with protocol B, upon a read
request the node has to check if a new version of the block
is in the progress of getting written. (== search for it on
active_ee and done_ee. [ Since it is on active_ee before the
RecvAck is sent. ] )
Global write order
[ Description of GFS-mode-arbitration2.pdf ]
1. Basic mirroring with protocol C.
The file system on N2 issues a write request towards DRBD,
which is written to the local disk and sent to N1. Then
the data bock is written to the local disk here and and
acknowledge packet is sent back. As soon as both the
write to the local disk and the ACK from N1 reach N2,
DRBD signals the completion of IO to the file system.
The major pitfall is the handling of concurrent writes to the
same block. (Concurrent writes to the same blocks should not
happen, but we have to assume that it is possible that the
synchronisation methods of our upper layer [i.e. openGFS]
may fail.)
There are many cases in which such concurrent writes would
lead to different data on our two copies of the block.
2. Concurrent writes, network latency is lower than disk latency
As we can see on the left side in figure two this could lead
to N1 has the blue version (=data from FS on N2) while N2
ends with having the green version (=data from FS on N1).
The solution is to flag one node (in the example N2 has the
discard-concurrent-writes-flag).
As we can see on the right side, now both nodes ends with
the blue data.
3. Concurrent writes, high latency for data packets.
The problem now is that N2 does can not detect that this was
a concurrent write, since it got the ACK before the conflicting
data packets comes in.
This can happens since in DRBD, data packets and ACK packets are
transmitted via two independent TCP connections, therefore the
ACK packet can overtakes a data packet.
The solution is to send with the ACK packet a discard info packet,
which identifies the data packet by it sequence number.
N2 will keep this discard info as long as it has not seen higher
sequence numbers by now.
With this both nodes will end with the blue data.
4. Concurrent writes, high latency for data packets.
This is the inverse case to case3 and already handled by the means
introduced with item 1.
5. New write while processing a write from the peer.
Without further measures this would lead to an inconsistency in
our mirror as the figure on the left side shows.
If we currently write a conflicting block from the peer, we simply
discard the write request from our FS and signal IO completion
immediately.
6. High disk latency on N2.
By IO reordering in the layers below us this could lead to
having the blue data on N2 and the green data on N1.
The solution to this case is the delay the write to the local
disk on N2 until the local write is done. This is different from
case two since we already got the write ACK to the conflicting
block.
7. An data packet overtakes an ACK packet on the network.
Although this case is quite unlikely, we have to take int into
account.
Proposed solution
We arbitrary select one node (e.g. the node that did the first
accept() in the drbd_connect() function) and mark it withe the
discard-concurrent-writes-flag.
Each data packet and each ACK packet gets a sequence
number, which is increased which every packet sent.
(This is a common space of sequence numbers)
The algorithm which is performed upon the reception of a
data packet [drbd_receiver].
* If the sequence number of the data packet is higher than
last_seq+1 sleep until last_seq-1 == seq_num(data packet)
1. If the packet's sequence number is on the discard list,
simply drop it.
2. Do we have a concurrent request? (i.e. Do I have a request
to the same block in my transfer log.) If not -> write now.
3. Have I already got an ACK packet for the concurrent
request ? (Has the request the RQ_DRBD_SENT bit already set)
If yes -> write the data from the data packet afterwards.
4. Do I have the "discard-concurrent-write-flag" ?
If yes -> discard the data packet.
If no -> Write data from the data packet afterwards and set
the RQ_DRBD_SENT bit in the request object ( Since
will will not get an ACK from our peer )
The algorithm which is performed upon the reception of an
ACK packet [drbd_asender]
* If we get an ACK, store the sequence number in last_seq.
The algorithm which is performed upon the reception of an
discard info packet [drbd_asender]
* if the current last_seq is lower the the packet that should
be discarded, store it in the to discard list.
BTW, each time we have a concurrent write access, we print
a warning to the syslog, since this indicates that the layer
above us is broken!
Note: In Item 6 we created a hash table over all requests in the
transfer log, keyed with (sector & ~0x7). This allows us
to find IO operations starting in the same 4k block of
data quickly. -> With two lookups the hash table we can
find any concurrent access.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-08 12:32 ` Philipp Reisner
@ 2004-10-08 12:55 ` Lars Marowsky-Bree
2004-10-08 13:37 ` Philipp Reisner
2004-10-08 13:51 ` Lars Ellenberg
1 sibling, 1 reply; 21+ messages in thread
From: Lars Marowsky-Bree @ 2004-10-08 12:55 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
On 2004-10-08T14:32:09, Philipp Reisner <philipp.reisner@linbit.com> wrote:
> 3. Concurrent writes, high latency for data packets.
> The problem now is that N2 does can not detect that this was
> a concurrent write, since it got the ACK before the conflicting
> data packets comes in.
Uhm. I don't see how this can be a problem.
In this case, one write has logically happened before the other, and
from they don't overlap - the second write will simply wipe out the
first one, which seems fine?
> 5. New write while processing a write from the peer.
Sounds just like case 1.
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-08 12:55 ` Lars Marowsky-Bree
@ 2004-10-08 13:37 ` Philipp Reisner
0 siblings, 0 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-08 13:37 UTC (permalink / raw)
To: drbd-dev
Am Freitag, 8. Oktober 2004 14:55 schrieb Lars Marowsky-Bree:
> On 2004-10-08T14:32:09, Philipp Reisner <philipp.reisner@linbit.com> wrote:
> > 3. Concurrent writes, high latency for data packets.
> > The problem now is that N2 does can not detect that this was
> > a concurrent write, since it got the ACK before the conflicting
> > data packets comes in.
>
> Uhm. I don't see how this can be a problem.
>
> In this case, one write has logically happened before the other, and
> from they don't overlap - the second write will simply wipe out the
> first one, which seems fine?
>
Just look at it again. on the left figure you will find that N1 has
the blue data on its block and N2 has the green data on its disk.
I do see here a problem.
> > 5. New write while processing a write from the peer.
>
> Sounds just like case 1.
>
In case 1 there is no concurrenct access at all ?!?
Hav you had a look at the pdf ?
-Philipp
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-08 12:32 ` Philipp Reisner
2004-10-08 12:55 ` Lars Marowsky-Bree
@ 2004-10-08 13:51 ` Lars Ellenberg
2004-10-11 7:12 ` Philipp Reisner
1 sibling, 1 reply; 21+ messages in thread
From: Lars Ellenberg @ 2004-10-08 13:51 UTC (permalink / raw)
To: drbd-dev
/ 2004-10-08 14:32:09 +0200
\ Philipp Reisner:
> Hi Friends,
>
> In reallity it is much more complex than we thought in the first
> place.
>
> I think that the solution with the "coordinator node" and the write
> now packet would be simpler, but it's drawback is the additional
> write now packet means that we have more packets on the wirte....
>
> ... But please read it first!
now, I did not, yet...
but,
network packets we have in the active/non-active case:
data ->
<- ack (recv (B) or write (C))
packets we have in the active/active case,
lets do this strictly for protocol C first:
write on non-coordinator:
data ->
<- write now [ when? is this already a write ack? ]
ack -> (write ack)
<- ack (write ack) [ when? is this neccessary? ]
write on coordinator:
<- data
ack ->
packets we have in the active/active case, arbitration mode:
data ->
[ cancel it, or write it.
if canceled, send "cancel ack",
if written, send write ack ]
<- ack
do we agree so far?
or is anything else neccessary?
an additional ack in the other direction, maybe?
I think I like the "locking extents" best.
this assumes that a typical usage pattern would have distinct active
sets on both nodes. then, most of the time writes go through normally as
if this was active, and the other node non-active.
sometimes, i.e. whenever I modify the activity-log, I need to communiacte:
want-extent ->
[**]
<- there you go
and this expected to be as infrequent as actlog updates now.
but [**] can be expensive, if both nodes try to write to the same
"lock region", and we have a lock-extent ping-pong, because it would
basically mean
if I don't use it, tell peer "you have it",
if I did use it, but its no longer in active use now, ex it from
my activity log and tell peer "you have it"
if it is still in use, mark it to be send to the peer,
which implies to not accept new requests,
and as soon as the local usage count drops to zero,
it is send to the peer.
now, if the alternating write blocks to the same lock-region, thats bad.
expectation is they don't, because upper layers have the same
problem, and therefore will optimize to not do so.
but, yes, I will have a look at the arbitration logic, too.
lge
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-08 13:51 ` Lars Ellenberg
@ 2004-10-11 7:12 ` Philipp Reisner
2004-10-11 10:09 ` Lars Ellenberg
2004-10-11 10:11 ` Lars Ellenberg
0 siblings, 2 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-11 7:12 UTC (permalink / raw)
To: drbd-dev
> but, yes, I will have a look at the arbitration logic, too.
>
Hi Lars,
Did you find any loose ends in my description of the
arbitration logic ? -- If we do not find any loose ends
I vote goes for it.
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-11 7:12 ` Philipp Reisner
@ 2004-10-11 10:09 ` Lars Ellenberg
2004-10-11 10:11 ` Lars Ellenberg
1 sibling, 0 replies; 21+ messages in thread
From: Lars Ellenberg @ 2004-10-11 10:09 UTC (permalink / raw)
To: drbd-dev
/ 2004-10-11 09:12:02 +0200
\ Philipp Reisner:
>
> > but, yes, I will have a look at the arbitration logic, too.
> >
>
> Hi Lars,
>
> Did you find any loose ends in my description of the
> arbitration logic ? -- If we do not find any loose ends
> I vote goes for it.
not sure yet...
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-11 7:12 ` Philipp Reisner
2004-10-11 10:09 ` Lars Ellenberg
@ 2004-10-11 10:11 ` Lars Ellenberg
2004-10-11 12:28 ` Philipp Reisner
1 sibling, 1 reply; 21+ messages in thread
From: Lars Ellenberg @ 2004-10-11 10:11 UTC (permalink / raw)
To: drbd-dev
/ 2004-10-11 09:12:02 +0200
\ Philipp Reisner:
>
> > but, yes, I will have a look at the arbitration logic, too.
> >
>
> Hi Lars,
>
> Did you find any loose ends in my description of the
> arbitration logic ? -- If we do not find any loose ends
> I vote goes for it.
not sure yet ...
especially what exactly should happen in failure cases.
need to think about it some more.
lge
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-11 10:11 ` Lars Ellenberg
@ 2004-10-11 12:28 ` Philipp Reisner
2004-10-11 12:41 ` Philipp Reisner
0 siblings, 1 reply; 21+ messages in thread
From: Philipp Reisner @ 2004-10-11 12:28 UTC (permalink / raw)
To: drbd-dev
On Monday 11 October 2004 12:11, Lars Ellenberg wrote:
> / 2004-10-11 09:12:02 +0200
>
> \ Philipp Reisner:
> > > but, yes, I will have a look at the arbitration logic, too.
> >
> > Hi Lars,
> >
> > Did you find any loose ends in my description of the
> > arbitration logic ? -- If we do not find any loose ends
> > I vote goes for it.
>
> not sure yet ...
>
> especially what exactly should happen in failure cases.
> need to think about it some more.
>
Right, I thought about that too, and came to the conclusion,
that everything is covered by the AL nicely.
-philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Drbd-dev] How Locking in GFS works...
2004-10-11 12:28 ` Philipp Reisner
@ 2004-10-11 12:41 ` Philipp Reisner
0 siblings, 0 replies; 21+ messages in thread
From: Philipp Reisner @ 2004-10-11 12:41 UTC (permalink / raw)
To: drbd-dev
> > not sure yet ...
> >
> > especially what exactly should happen in failure cases.
> > need to think about it some more.
>
> Right, I thought about that too, and came to the conclusion,
> that everything is covered by the AL nicely.
>
Ahhh... I think I know what you mean ... Hmmm...
-Philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2004-10-11 12:40 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-04 12:56 [Drbd-dev] How Locking in GFS works Philipp Reisner
2004-10-04 13:01 ` Lars Marowsky-Bree
2004-10-04 13:20 ` Lars Ellenberg
2004-10-04 13:41 ` Lars Marowsky-Bree
2004-10-04 13:26 ` Philipp Reisner
2004-10-04 13:49 ` Lars Marowsky-Bree
2004-10-04 14:09 ` Philipp Reisner
2004-10-04 14:17 ` Philipp Reisner
2004-10-04 15:12 ` Lars Ellenberg
2004-10-04 20:24 ` Lars Marowsky-Bree
2004-10-08 12:32 ` Philipp Reisner
2004-10-08 12:55 ` Lars Marowsky-Bree
2004-10-08 13:37 ` Philipp Reisner
2004-10-08 13:51 ` Lars Ellenberg
2004-10-11 7:12 ` Philipp Reisner
2004-10-11 10:09 ` Lars Ellenberg
2004-10-11 10:11 ` Lars Ellenberg
2004-10-11 12:28 ` Philipp Reisner
2004-10-11 12:41 ` Philipp Reisner
2004-10-05 19:37 ` Philipp Reisner
2004-10-05 19:39 ` Philipp Reisner
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.